Table of Contents

cs.CL [Back]

[1] UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop cs.CL | cs.AIPDF

Muhammad Ali Shafique, Areej Mehboob, Layba Fiaz, Muhammad Usman Qadeer, Hamza Farooq

TL;DR: 本文提出了UrduBench,一个用于评估大型语言模型在乌尔都语中推理能力的基准。通过一种结合上下文集成翻译和人工验证的框架,作者将多个英文推理基准(如MGSM、MATH-500等)翻译成乌尔都语,并系统评估了多种LLM在不同提示策略下的表现。

Details

Motivation: 解决低资源语言(特别是乌尔都语)缺乏标准化推理评估基准的问题,因为机器翻译的敏感性和现有工作多关注通用语言任务而非推理能力。

Result: 在翻译构建的UrduBench上对多种推理导向和指令调优的LLM进行了全面评估,分析了模型在不同数据集、任务难度、架构、规模设置和语言一致性测试上的性能差异,发现多步和符号推理任务在乌尔都语中挑战巨大,且稳定的语言对齐是鲁棒推理的关键前提。

Insight: 创新点在于提出了一个可扩展的、结合上下文集成翻译与人工验证的框架来构建低资源语言的推理基准,该方法可泛化至其他低资源语言;客观分析认为,其系统性的评估框架为理解多语言推理失败提供了实证见解,并强调了语言对齐在跨语言推理中的重要性。

Abstract: Recent advances in large language models (LLMs) have led to strong reasoning capabilities; however, evaluating such models in low-resource languages remains challenging due to the lack of standardized benchmarks. In particular, Urdu reasoning evaluation has been limited by the sensitivity of machine translation and an emphasis on general language tasks rather than reasoning benchmarks. In this paper, we propose a contextually ensembled translation framework with human-in-the-loop validation that leverages multiple translation systems to develop Urdu reasoning benchmarks while preserving contextual and structural integrity. Using this framework, we translate widely adopted reasoning and question-answering benchmarks, including MGSM, MATH-500, CommonSenseQA, and OpenBookQA, into Urdu, collectively referred to as UrduBench, and conduct a comprehensive evaluation of both reasoning-oriented and instruction-tuned LLMs across multiple prompting strategies. Our analysis reveals performance differences across (1) four datasets, (2) five task difficulty levels, (3) diverse model architectures, (4) multiple model scaling settings, and (5) language consistency tests. We find that multi-step and symbolic reasoning tasks pose significant challenges in Urdu, and that stable language alignment is a critical prerequisite for robust reasoning. Overall, our work establishes a scalable methodology for standardized reasoning evaluation in Urdu and provides empirical insights into multilingual reasoning failures. This experimental setup is also broadly applicable to other low-resource languages. The code and datasets will be publicly released.


[2] Large Language Models Naively Recover Ethnicity from Individual Records cs.CLPDF

Noah Dasanaike

TL;DR: 这篇论文证明大型语言模型(LLMs)能够仅凭姓名高精度推断种族或民族身份,其准确性超过了传统的贝叶斯改进姓氏地理编码(BISG)方法,并且无需额外训练数据即可应用于美国以外的多种文化和分类体系。

Details

Motivation: 解决传统BISG方法在种族推断上存在的地理局限性、准确性不足(尤其在平衡样本中)以及系统性偏差(如将富裕社区的少数族裔误分类为白人)等问题,探索LLMs作为一种更通用、更准确的替代方案。

Result: 在美国佛罗里达州和北卡罗来纳州选民文件(含自我报告种族)的平衡样本上,LLM分类准确率最高达84.7%,显著优于BISG的68.2%;在黎巴嫩(宗教派系,64.3%)、印度(保留选区议员,99.2%;种姓记录,74.0%)等多个国家的验证中也表现良好;使用扩展推理或加入元数据(如党派登记)可进一步提升准确率至86.7%。

Insight: 论文的创新点在于揭示了LLMs在零样本或少样本下从姓名中提取敏感人口统计特征的强大能力,其方法突破了传统模型的地域和分类限制,并展示了通过微调小型Transformer模型使用LLM生成的标签,可以在保证高精度(超越BISG)的同时实现低成本本地部署,为大规模社会科学研究提供了新工具。

Abstract: I demonstrate that large language models can infer ethnicity from names with accuracy exceeding that of Bayesian Improved Surname Geocoding (BISG) without additional training data, enabling inference outside the United States and to contextually appropriate classification categories. Using stratified samples from Florida and North Carolina voter files with self-reported race, LLM-based classification achieves up to 84.7% accuracy, outperforming BISG (68.2%) on balanced samples. I test six models including Gemini 3 Flash, GPT-4o, and open-source alternatives such as DeepSeek v3.2 and GLM-4.7. Enabling extended reasoning can improve accuracy by 1-3 percentage points, though effects vary across contexts; including metadata such as party registration reaches 86.7%. LLM classification also reduces the income bias inherent in BISG, where minorities in wealthier neighborhoods are systematically misclassified as White. I further validate using Lebanese voter registration with religious sect (64.3% accuracy), Indian MPs from reserved constituencies (99.2%), and Indian land records with caste classification (74.0%). Aggregate validation across India, Uganda, Nepal, Armenia, Chile, and Costa Rica using original full-count voter rolls demonstrates that the method recovers known population distributions where naming conventions are distinctive. For large-scale applications, small transformer models fine-tuned on LLM labels exceed BISG accuracy while enabling local deployment at no cost.


[3] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models cs.CL | cs.LGPDF

Zhaoyi Li, Jiatong Li, Gangwei Jiang, Linqi Song, Defu Lian

TL;DR: 该论文系统研究了大型语言模型在推理步数超出训练分布时的性能下降问题,发现错误集中在少数关键错误类型的token位置,源于内部注意力头的竞争机制。论文提出了一种轻量级的推理时校正方法,通过动态识别和停用错误处理头来提升推理步数泛化能力。

Details

Motivation: 解决大型语言模型在链式思维推理中,当所需推理步数超过训练分布时出现的性能急剧下降问题,并探究其内部失效机制。

Result: 在多个领域任务和不同LLM上的广泛实验表明,所提出的推理时校正方法能持续提升推理步数泛化性能。

Insight: 揭示了推理步数泛化失败源于特定注意力头(错误处理头)对错误推理路径的放大,并提出了一种无需重新训练、在推理时动态干预的轻量级校正方法,为理解并改进LLM的推理鲁棒性提供了新视角。

Abstract: Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.


[4] MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation cs.CL | cs.AIPDF

Tianyi Xu, Kosei Uemura, Alfred Malengo Kondoro, Tadesse Destaw Belay, Catherine Nana Nyaah Essuman

TL;DR: 本文提出了MGSM-Pro数据集,作为MGSM数据集的多语言扩展,通过为每个问题生成五种不同的实例(如改变名称、数字和无关上下文)来评估大型语言模型在多语言数学推理中的鲁棒性。研究发现,许多低资源语言在遇到与原始测试集不同的数字实例时性能显著下降,且不同模型对数字变化的鲁棒性存在差异。

Details

Motivation: 现有数学推理基准在多语言评估方面,尤其在难度和时效性上落后于英语,且已有研究(GSM-Symbolic)表明模型对同一问题的不同实例表现存在高方差,但该评估仅限于英语,因此需要扩展至多语言环境以更全面地评估模型鲁棒性。

Result: 在九种语言上的评估显示,低资源语言在数字实例变化时性能下降较大;专有模型中,Gemini 2.5 Flash和GPT-4.1对数字实例的鲁棒性较差,而Claude 4.0 Sonnet更鲁棒;开源模型中,GPT-OSS 120B和DeepSeek V3表现出更强的鲁棒性。

Insight: 创新点在于将GSM-Symbolic方法扩展到多语言场景,构建了MGSM-Pro数据集以系统评估模型对问题实例变化的敏感性;客观分析认为,该方法揭示了模型在多语言数学推理中,特别是低资源语言下,对数字等细节变化的脆弱性,建议使用至少五种数字变化的实例进行评估,这为未来基准设计提供了重要参考。

Abstract: Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that some proprietary models, notably Gemini 2.5 Flash and GPT-4.1, are less robust to digit instantiation, whereas Claude 4.0 Sonnet is more robust. Among open models, GPT-OSS 120B and DeepSeek V3 show stronger robustness. Based on these findings, we recommend evaluating each problem using at least five digit-varying instantiations to obtain a more robust and realistic assessment of math reasoning.


[5] MoCo: A One-Stop Shop for Model Collaboration Research cs.CLPDF

Shangbin Feng, Yuyang Bai, Ziyuan Yang, Yike Wang, Zhaoxuan Tan

TL;DR: MoCo是一个用于模型协作研究的Python库,整合了26种模型协作方法(如路由、文本、logit和模型参数交换)和25个评估数据集(涵盖推理、QA、代码、安全等领域),旨在统一和推进多语言模型协作的研究。实验表明,在61.0%的(模型,数据)设置中,协作策略平均优于非协作模型,最高提升达25.8%。

Details

Motivation: 现有模型协作研究分散且缺乏系统比较,MoCo旨在整合这些研究,将模型协作确立为一个系统化的研究方向,提供一个执行、基准测试和比较协作算法的统一平台。

Result: 在MoCo库的广泛实验中,大多数协作策略在61.0%的(模型,数据)设置中平均优于非协作模型,最有效的方法性能提升高达25.8%,并分析了协作策略的扩展性、训练/推理效率。

Insight: MoCo的创新点在于提供了一个全面的工具包,促进模型协作研究的标准化和规模化,强调协作系统能解决单一语言模型难以处理的问题,支持开放、模块化、去中心化的AI未来愿景。

Abstract: Advancing beyond single monolithic language models (LMs), recent research increasingly recognizes the importance of model collaboration, where multiple LMs collaborate, compose, and complement each other. Existing research on this topic has mostly been disparate and disconnected, from different research communities, and lacks rigorous comparison. To consolidate existing research and establish model collaboration as a school of thought, we present MoCo: a one-stop Python library of executing, benchmarking, and comparing model collaboration algorithms at scale. MoCo features 26 model collaboration methods, spanning diverse levels of cross-model information exchange such as routing, text, logit, and model parameters. MoCo integrates 25 evaluation datasets spanning reasoning, QA, code, safety, and more, while users could flexibly bring their own data. Extensive experiments with MoCo demonstrate that most collaboration strategies outperform models without collaboration in 61.0% of (model, data) settings on average, with the most effective methods outperforming by up to 25.8%. We further analyze the scaling of model collaboration strategies, the training/inference efficiency of diverse methods, highlight that the collaborative system solves problems where single LMs struggle, and discuss future work in model collaboration, all made possible by MoCo. We envision MoCo as a valuable toolkit to facilitate and turbocharge the quest for an open, modular, decentralized, and collaborative AI future.


[6] CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding cs.CLPDF

Jiahao Huo, Yu Huang, Yibo Yan, Ye Pan, Yi Cao

TL;DR: 本文提出了一种名为CausalEmbed的自回归多向量生成方法,用于构建视觉文档嵌入,旨在解决多模态大语言模型在视觉文档检索中因使用大量视觉标记而导致存储开销过大的问题。该方法通过对比训练中的迭代边缘损失,学习紧凑且结构良好的表示,从而显著减少标记数量,同时保持高性能。

Details

Motivation: 多模态大语言模型在视觉文档检索中生成高质量多向量嵌入时,因每页需数千个视觉标记而产生巨大存储开销,限制了实际应用,因此需要一种更高效的嵌入生成方法。

Result: CausalEmbed在各种骨干网络和基准测试中,仅使用数十个视觉标记即可实现高效视觉文档检索,标记数量减少30-155倍,同时保持高度竞争力的性能。

Insight: 创新点包括自回归生成多向量嵌入、迭代边缘损失以优化表示结构,以及灵活测试时缩放策略,为多模态文档检索中的生成范式提供了新思路。

Abstract: Although Multimodal Large Language Models (MLLMs) have shown remarkable potential in Visual Document Retrieval (VDR) through generating high-quality multi-vector embeddings, the substantial storage overhead caused by representing a page with thousands of visual tokens limits their practicality in real-world applications. To address this challenge, we propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings. By incorporating iterative margin loss during contrastive training, CausalEmbed encourages the embedding models to learn compact and well-structured representations. Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks. Theoretical analysis and empirical results demonstrate the unique advantages of auto-regressive embedding generation in terms of training efficiency and scalability at test time. As a result, CausalEmbed introduces a flexible test-time scaling strategy for multi-vector VDR representations and sheds light on the generative paradigm within multimodal document retrieval.


[7] Self-Improving Pretraining: using post-trained models to pretrain better models cs.CL | cs.AI | cs.LGPDF

Ellen Xiaoqing Tan, Shehzaad Dhuliawala, Jing Xu, Ping Yu, Sainbayar Sukhbaatar

TL;DR: 本文提出了一种名为Self-Improving Pretraining的新型预训练方法,旨在从根本上提升大语言模型生成内容的事实性、安全性和整体质量。该方法通过流式处理文档,并利用强化学习(RL)来优化模型每一步生成的后续K个token。一个经过后训练的强大模型负责评估候选生成(包括模型自身生成、原始后缀和重写后缀)的质量、安全性和事实性,从而在预训练阶段就引导模型学习更优的行为模式。

Details

Motivation: 当前解决大语言模型生成内容的安全性、事实性和质量问题,主要依赖于昂贵的人工标注数据集和多阶段的微调与对齐流程,但这种方法无法修正模型在预训练阶段已学习到的错误模式。因此,需要在预训练阶段就从根本上解决这些问题,以防止不安全或虚假输出被深度内化到模型的核心行为中。

Result: 实验结果表明,该方法相比标准预训练,在事实性和安全性方面分别取得了36.2%和18.5%的相对提升,并且在整体生成质量上,最高获得了86.3%的胜率提升。

Insight: 论文的核心创新点在于将后训练模型作为“评判者”引入到预训练过程中,通过强化学习机制,利用高质量反馈(包括原始文本、重写文本和模型自身生成)来直接优化模型在预训练阶段的生成行为。这为从源头构建更安全、更可靠的大语言模型提供了一种新的、可借鉴的范式。

Abstract: Ensuring safety, factuality and overall quality in the generations of large language models is a critical challenge, especially as these models are increasingly deployed in real-world applications. The prevailing approach to addressing these issues involves collecting expensive, carefully curated datasets and applying multiple stages of fine-tuning and alignment. However, even this complex pipeline cannot guarantee the correction of patterns learned during pretraining. Therefore, addressing these issues during pretraining is crucial, as it shapes a model’s core behaviors and prevents unsafe or hallucinated outputs from becoming deeply embedded. To tackle this issue, we introduce a new pretraining method that streams documents and uses reinforcement learning (RL) to improve the next K generated tokens at each step. A strong, post-trained model judges candidate generations – including model rollouts, the original suffix, and a rewritten suffix – for quality, safety, and factuality. Early in training, the process relies on the original and rewritten suffixes; as the model improves, RL rewards high-quality rollouts. This approach builds higher quality, safer, and more factual models from the ground up. In experiments, our method gives 36.2% and 18.5% relative improvements over standard pretraining in terms of factuality and safety, and up to 86.3% win rate improvements in overall generation quality.


[8] Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation cs.CL | cs.AIPDF

Yuan Sui, Bryan Hooi

TL;DR: 这篇论文提出了CoNL框架,通过多智能体自博弈实现生成、评估和元评估的统一,以解决无监督任务中LLM训练因缺乏真实标签和评估器质量限制而面临的挑战。

Details

Motivation: 动机是解决在无监督任务(如创意写作和伦理推理)中训练LLM的难题,由于缺乏真实标签,现有LLM-as-Judge方法受限于评估器自身质量,无法提供有效训练信号,且存在评估偏见(如偏好冗长而非质量)。

Result: 在五个基准测试上的实验表明,CoNL相比自奖励基线方法实现了持续改进,并保持了训练稳定性,但摘要未提及具体定量结果或是否达到SOTA水平。

Insight: 创新点在于提出元评估概念,通过多智能体自博弈以批判是否帮助改进解决方案作为评估标准,从而联合优化生成和判断能力,无需外部评估器或真实标签;客观分析认为,该方法将评估过程内化为自监督学习,可能提升LLM在开放任务中的自适应能力。

Abstract: Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator’s own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on five benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.


[9] SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models cs.CLPDF

Lei Yang, Wei Bi, Chenxi Sun, Renren Jin, Deyi Xiong

TL;DR: 本文提出了SOUP框架,一种在token级别统一离策略和同策略学习的强化学习方法,用于大型语言模型的后训练。该方法通过将历史策略采样的序列前缀作为离策略部分,而后续部分采用同策略生成,利用token级别的重要性比率有效利用离策略信息并保持训练稳定性。

Details

Motivation: 解决同策略强化学习方法(如GRPO)在语言模型后训练中因采样多样性低而导致的探索有限和早期饱和问题,同时避免现有混合整个轨迹的离策略方法引起的策略不匹配和不稳定性。

Result: 大量实验表明,SOUP在多个基准测试中持续优于标准的同策略训练和现有的离策略扩展方法,提升了最终性能。

Insight: 创新点在于细粒度的单样本混合策略训练范式,在token级别统一离策略和同策略学习,通过限制离策略影响范围和使用重要性比率,平衡了探索与稳定性,从而改善LLM强化学习的探索能力和最终性能。

Abstract: On-policy reinforcement learning (RL) methods widely used for language model post-training, like Group Relative Policy Optimization (GRPO), often suffer from limited exploration and early saturation due to low sampling diversity. While off-policy data can help, current approaches that mix entire trajectories cause significant policy mismatch and instability. In this work, we propose the $\textbf{S}$ingle-sample Mix-p$\textbf{O}$licy $\textbf{U}$nified $\textbf{P}$aradigm (SOUP), a framework that unifies off- and on-policy learning within individual samples at the token level. It confines off-policy influence to the prefix of a generated sequence sampled from historical policies, while the continuation is generated on-policy. Through token-level importance ratios, SOUP effectively leverages off-policy information while preserving training stability. Extensive experiments demonstrate that SOUP consistently outperforms standard on-policy training and existing off-policy extensions. Our further analysis clarifies how our fine-grained, single-sample mix-policy training can improve both exploration and final performance in LLM RL.


[10] Note2Chat: Improving LLMs for Multi-Turn Clinical History Taking Using Medical Notes cs.CLPDF

Yang Zhou, Zhenting Sheng, Mingrui Tan, Yuting Song, Jun Zhou

TL;DR: 本文提出Note2Chat框架,通过将真实世界医疗记录转化为高质量医患对话数据,并采用三阶段微调策略,显著提升大语言模型在多轮临床问诊中的表现。

Details

Motivation: 解决大语言模型在需要迭代提问和假设修正的动态多轮诊断场景中表现不足的问题,并规避稀缺且敏感的对话数据依赖。

Result: 在临床推理任务上,方法相比GPT-4o取得了+16.9 F1分数和+21.0 Top-1诊断准确率的显著提升。

Insight: 创新点包括:1) 利用决策树引导的生成与精炼流程从医疗记录构建对话数据;2) 三阶段微调策略(监督学习、模拟数据增强、偏好学习);3) 将多轮问诊重构为单轮推理序列的新范式,提升可解释性、支持局部监督与动态适应。

Abstract: Effective clinical history taking is a foundational yet underexplored component of clinical reasoning. While large language models (LLMs) have shown promise on static benchmarks, they often fall short in dynamic, multi-turn diagnostic settings that require iterative questioning and hypothesis refinement. To address this gap, we propose \method{}, a note-driven framework that trains LLMs to conduct structured history taking and diagnosis by learning from widely available medical notes. Instead of relying on scarce and sensitive dialogue data, we convert real-world medical notes into high-quality doctor-patient dialogues using a decision tree-guided generation and refinement pipeline. We then propose a three-stage fine-tuning strategy combining supervised learning, simulated data augmentation, and preference learning. Furthermore, we propose a novel single-turn reasoning paradigm that reframes history taking as a sequence of single-turn reasoning problems. This design enhances interpretability and enables local supervision, dynamic adaptation, and greater sample efficiency. Experimental results show that our method substantially improves clinical reasoning, achieving gains of +16.9 F1 and +21.0 Top-1 diagnostic accuracy over GPT-4o. Our code and dataset can be found at https://github.com/zhentingsheng/Note2Chat.


[11] ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas cs.CLPDF

Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Hao Zhou, Kaichi Yu

TL;DR: 本文提出了ASTRA框架,这是一个用于训练工具增强语言模型智能体的全自动端到端框架。它通过可扩展的数据合成和可验证的强化学习,解决了现有方法在训练稳健工具使用智能体时面临的手动干预、模拟环境不可验证、以及长视野多轮学习不稳定等挑战。

Details

Motivation: 动机是解决当前训练工具增强大语言模型智能体进行多步决策时存在的挑战,包括需要手动干预、依赖不可验证的模拟环境、仅依赖监督微调或强化学习单一方法,以及在长视野、多轮学习中表现不稳定。

Result: 在多个智能体工具使用基准测试上的实验表明,ASTRA训练的模型在同等规模下达到了最先进的性能,接近闭源系统的水平,同时保持了核心推理能力。

Insight: 创新点在于整合了两个互补组件:一个基于工具调用图静态拓扑合成多样化、结构化的轨迹的流程,以及一个将人类语义推理的丰富组合拓扑转化为独立、可代码执行且规则可验证的环境的框架。这实现了数据合成与确定性多轮强化学习的结合,并提出了一个统一训练方法,利用轨迹级奖励整合监督微调和在线强化学习,以平衡任务完成度和交互效率。

Abstract: Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.


[12] Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling cs.CL | cs.LGPDF

Xinglin Wang, Jiayi Shi, Shaoxiong Feng, Peiwen Yuan, Yiwei Li

TL;DR: 本文提出了一种名为“回收搜索经验(RSE)”的训练无关策略,旨在提升大语言模型在测试时扩展中的推理效率。该方法通过将搜索过程中产生的中间轨迹(包括成功结论和失败模式)提炼并存储到一个共享经验库中,从而将原本孤立的多次搜索尝试转变为累积学习过程,有效避免了计算冗余,实现了更高效的解空间探索。

Details

Motivation: 现有测试时扩展方法通常将每次搜索尝试(rollout)视为一次性样本,丢弃了其中包含的宝贵中间洞察,导致模型在后续尝试中反复重新推导已发现的结论或重访已知的死胡同,造成了巨大的计算浪费。本文旨在解决这种系统性“无记忆”问题,以提高搜索过程的效率。

Result: 在HMMT24、HMMT25、IMO-Bench和HLE等多个复杂推理基准测试上的广泛实验表明,在计算成本相当的情况下,RSE方法持续优于强基线方法,并实现了最先进的扩展效率(state-of-the-art scaling efficiency)。

Insight: 论文的核心创新点在于提出了一个无需训练、自我引导的“经验回收”框架,将搜索过程从一系列独立试验重构为一个累积过程。其可借鉴之处在于:1)通过构建共享经验库来显式地记忆和复用中间推理步骤(正回收)与失败模式(负回收);2)理论上分析了该方法相对于独立采样的效率增益,为减少大模型推理时的计算冗余提供了一种系统性的解决方案。

Abstract: Test-Time Scaling enhances the reasoning capabilities of Large Language Models by allocating additional inference compute to broaden the exploration of the solution space. However, existing search strategies typically treat rollouts as disposable samples, where valuable intermediate insights are effectively discarded after each trial. This systemic memorylessness leads to massive computational redundancy, as models repeatedly re-derive discovered conclusions and revisit known dead ends across extensive attempts. To bridge this gap, we propose \textbf{Recycling Search Experience (RSE)}, a self-guided, training-free strategy that turns test-time search from a series of isolated trials into a cumulative process. By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends. Theoretically, we provide an analysis that formalizes the efficiency gains of RSE, validating its advantage over independent sampling in solving complex reasoning tasks. Empirically, extensive experiments on HMMT24, HMMT25, IMO-Bench, and HLE show that RSE consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art scaling efficiency.


[13] Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents cs.CLPDF

Hojae Han, Heeyun Jung, Jongyoon Kim, Seung-won Hwang

TL;DR: 论文提出DAVID-GRPO框架,旨在解决资源受限环境下小型语言模型进行多跳推理时面临的探索稀疏、信用分配困难和训练不稳定问题。该框架通过最小监督稳定早期学习、基于证据召回分配检索信用以及重采样截断的接近成功轨迹来提升探索效率。

Details

Motivation: 现有强化学习方法依赖大规模模型和密集探索,但在现实资源约束下,小型语言模型因有限的rollout预算导致稀疏探索、稀疏信用分配和训练不稳定,陷入低成本低准确率的困境。本文旨在打破这一权衡,使小型模型在资源受限时也能实现强大多跳推理。

Result: 在仅使用四块RTX 3090 GPU训练的参数不超过1.5B的智能体上,DAVID-GRPO在六个多跳问答基准测试中一致优于先前为大规模设置设计的强化学习方法。

Insight: 创新点包括:1)引入预算高效的强化学习框架,通过特定归纳偏置(如基于证据召回的信用分配和轨迹重采样)优化小型模型的训练稳定性与探索效率;2)证明了在适当方法下,小型智能体可以实现低成本高准确率的训练,挑战了资源与性能必然权衡的固有观念。

Abstract: While reinforcement learning (RL) has empowered multi-turn reasoning agents with retrieval and tools, existing successes largely depend on extensive on-policy rollouts in high-cost, high-accuracy regimes. Under realistic resource constraints that cannot support large models or dense explorations, however, small language model agents fall into a low-cost, low-accuracy regime, where limited rollout budgets lead to sparse exploration, sparse credit assignment, and unstable training. In this work, we challenge this trade-off and show that small language models can achieve strong multi-hop reasoning under resource constraints. We introduce DAVID-GRPO, a budget-efficient RL framework that (i) stabilizes early learning with minimal supervision, (ii) assigns retrieval credit based on evidence recall, and (iii) improves exploration by resampling truncated near-miss trajectories. Evaluated on agents up to 1.5B parameters trained on only four RTX 3090 GPUs, DAVID-GRPO consistently outperforms prior RL methods designed for large-scale settings on six multi-hop QA benchmarks. These results show that with the right inductive biases, small agents can achieve low training cost with high accuracy.


[14] Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning cs.CL | cs.AI | cs.IR | cs.MA | cs.SIPDF

Wonduk Seo, Wonseok Choi, Junseo Koh, Juhyeon Lee, Hyunjin An

TL;DR: 本文提出OG-MAR框架,通过本体引导的多智能体推理来提升大语言模型的文化对齐能力。该方法利用世界价值观调查数据构建全球文化本体,在推理时检索本体一致的关系和相似人口统计档案来实例化多个价值-角色智能体,并通过一个判断智能体综合输出以确保本体一致性和人口统计接近性。

Details

Motivation: 解决大语言模型在文化敏感决策中因预训练数据偏差和缺乏结构化价值表示而导致的对齐不足、一致性与可解释性差的问题。

Result: 在四个LLM骨干网络上进行的区域社会调查基准测试表明,OG-MAR相比竞争基线提高了文化对齐性和鲁棒性,并产生了更透明的推理轨迹。

Insight: 创新点在于将结构化文化本体与多智能体推理相结合,通过基于人口统计档案的价值-角色智能体实例化和本体一致性约束,实现了更接地气、更一致且可解释的文化价值对齐。

Abstract: Large Language Models (LLMs) increasingly support culturally sensitive decision making, yet often exhibit misalignment due to skewed pretraining data and the absence of structured value representations. Existing methods can steer outputs, but often lack demographic grounding and treat values as independent, unstructured signals, reducing consistency and interpretability. We propose OG-MAR, an Ontology-Guided Multi-Agent Reasoning framework. OG-MAR summarizes respondent-specific values from the World Values Survey (WVS) and constructs a global cultural ontology by eliciting relations over a fixed taxonomy via competency questions. At inference time, it retrieves ontology-consistent relations and demographically similar profiles to instantiate multiple value-persona agents, whose outputs are synthesized by a judgment agent that enforces ontology consistency and demographic proximity. Experiments on regional social-survey benchmarks across four LLM backbones show that OG-MAR improves cultural alignment and robustness over competitive baselines, while producing more transparent reasoning traces.


[15] TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning cs.CL | cs.AIPDF

Huiyuan Lai, Malvina Nissim

TL;DR: 本文提出TACLer框架,一种针对大语言模型(LLMs)的定制化课程强化学习方法,旨在提升复杂推理任务的学习和推理效率。该框架通过渐进式增加数据复杂度,并引入混合Thinking/NoThinking推理模式,在减少计算成本和推理令牌使用的同时,提高推理准确性。

Details

Motivation: 现有大语言模型在复杂推理任务中通常需要大规模强化学习训练以生成长链思维(CoT),但这种方法计算成本高且可能导致过度思考(overthinking)和冗余中间步骤,因此需要一种更高效的学习和推理方法。

Result: 在四个数学数据集上的实验表明,TACLer相比长思维模型减少超过50%的训练计算成本,相比基础模型减少超过42%的推理令牌使用,同时准确率提升超过9%,持续优于最先进的NoThinking和Thinking基线模型。

Insight: 创新点包括:1)定制化课程学习,根据模型熟练度动态调整学习内容;2)混合Thinking/NoThinking推理范式,平衡准确性与效率。从客观角度看,该方法将课程学习与强化学习结合,针对LLMs的推理过程进行优化,为高效训练和推理提供了新思路。

Abstract: Large Language Models (LLMs) have shown remarkable performance on complex reasoning tasks, especially when equipped with long chain-of-thought (CoT) reasoning. However, eliciting long CoT typically requires large-scale reinforcement learning (RL) training, while often leading to overthinking with redundant intermediate steps. To improve learning and reasoning efficiency, while preserving or even enhancing performance, we propose TACLer, a model-tailored curriculum reinforcement learning framework that gradually increases the complexity of the data based on the model’s proficiency in multi-stage RL training. TACLer features two core components: (i) tailored curriculum learning that determines what knowledge the model lacks and needs to learn in progressive stages; (ii) a hybrid Thinking/NoThinking reasoning paradigm that balances accuracy and efficiency by enabling or disabling the Thinking mode. Our experiments show that TACLer yields a twofold advantage in learning and reasoning: (i) it reduces computational cost, cutting training compute by over 50% compared to long thinking models and reducing inference token usage by over 42% relative to the base model; and (ii) it improves accuracy by over 9% on the base model, consistently outperforming state-of-the-art Nothinking and Thinking baselines across four math datasets with complex problems.


[16] Procedural Pretraining: Warming Up Language Models with Abstract Data cs.CL | cs.LGPDF

Liangze Jiang, Zachary Shinnick, Anton van den Hengel, Hemanth Saratchandran, Damien Teney

TL;DR: 本文提出了一种名为‘过程预训练’的新方法,即在语言模型正式预训练之前,先使用由形式语言和简单算法生成的抽象结构化数据进行‘热身’。研究发现,这种预训练能显著提升模型在算法技能(如上下文召回)上的表现,并能在后续的自然语言、代码和数学数据集预训练中,用更少的数据达到相同甚至更好的效果。

Details

Motivation: 当前语言模型的预训练范式是直接在网络规模的语料库上进行。本文的动机是探索一种替代方案,模仿人类先学习简单逻辑和数学再进行高级推理的过程,让模型先接触抽象结构化数据,以促进后续对丰富语义知识的获取。

Result: 在上下文召回(大海捞针)任务上,使用Dyck序列(平衡括号)进行过程预训练后,准确率从10%跃升至98%。在更大模型(高达13亿参数)的预训练中,仅前置0.1%的过程数据,就在C4、CodeParrot和DeepMind-Math数据集上显著优于标准预训练,并且分别只需原数据量的55%、67%和86%就能达到相同的损失值。

Insight: 核心创新点是提出了‘过程预训练’这一轻量级方法,将知识获取与推理能力解耦。其机制在于,过程数据能在注意力层和MLP层中注入非平凡的结构化表示,前者对代码等结构化领域尤为重要,后者则对语言处理有益。这为结合多种形式的过程数据以加速和提升预训练效果开辟了道路。

Abstract: Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a means to ease the subsequent acquisition of rich semantic knowledge, much like humans learn simple logic and mathematics before higher reasoning. We specifically focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, on context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this procedural pretraining enables the models to reach the same loss value with only 55, 67, 86% of the original data. Third, we explore the mechanisms behind and find that procedural pretraining instils non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means to improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.


[17] CoFrGeNet: Continued Fraction Architectures for Language Generation cs.CL | cs.AIPDF

Amit Dhurandhar, Vijil Chenthamarakshan, Dennis Wei, Tejaswini Pedapati, Karthikeyan Natesan Ramamurthy

TL;DR: 本文提出了一种基于连分数(continued fractions)的新型生成模型架构家族CoFrGeNets,旨在替代Transformer中的多头注意力(Multi-head Attention)和前馈网络(Feed-Forward Networks)。通过设计新的架构组件和定制梯度公式,该方法在GPT2-xl和Llama3等模型上实现了参数减少(降至原模型的1/2到2/3)和预训练时间缩短,同时在下游分类、问答、推理和文本理解任务上保持竞争力甚至更优性能。

Details

Motivation: Transformer是目前语言生成的首选架构,但参数量大、计算成本高。本文受连分数启发,旨在提出一种新的函数类别和对应的架构,以更少的参数替代Transformer中的核心组件,从而降低模型复杂度和训练成本。

Result: 在GPT2-xl(1.5B)和Llama3(3.2B)两个不同Transformer架构上进行了实验,使用OpenWebText、GneissWeb和docling数据混合进行预训练。结果表明,在参数减少至原模型的1/2到2/3且预训练时间更短的情况下,模型在下游分类、问答、推理和文本理解任务上的性能与原模型相当甚至更优。

Insight: 创新点包括:1)基于连分数设计新的生成模型函数类别和架构组件(CoFrGeNets),可替代Transformer中的多头注意力和前馈网络;2)推导定制梯度公式以优化组件,比标准PyTorch梯度更准确高效;3)组件可作为即插即用替换,无需大幅修改现有Transformer模型的训练或推理流程,易于工业部署。从客观角度看,该方法在保持性能的同时显著减少了参数量和训练时间,为高效语言模型设计提供了新思路。

Abstract: Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q& A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.


[18] Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention cs.CLPDF

Alon Rozental

TL;DR: Zonkey是一种分层扩散语言模型,通过可微分分词器和概率注意力机制,实现了从原始字符到文档级表示的端到端可训练流程,旨在克服传统LLMs中固定分词器(如BPE)的限制。

Details

Motivation: 解决大型语言模型中固定、不可微分分词器(如BPE)阻碍端到端优化、难以适应噪声或领域特定数据的问题。

Result: 在Wikipedia上端到端训练后,Zonkey能从噪声生成连贯、可变长度的文本,展现出层次化结构,并在数据分布对齐方面优于基于熵的可学习分词器。

Insight: 创新点包括可微分分词器(Segment Splitter)学习概率性序列起始决策、概率注意力机制支持软掩码和可变长度输出、分层压缩与去噪扩散混合模型(DDMM)实现稳定潜在空间去噪,以及Stitcher确保分段重叠不变性,推动了全梯度基LLMs的发展。

Abstract: Large language models (LLMs) have revolutionized natural language processing, yet they remain constrained by fixed, non-differentiable tokenizers like Byte Pair Encoding (BPE), which hinder end-to-end optimization and adaptability to noisy or domain-specific data. We introduce Zonkey, a hierarchical diffusion model that addresses these limitations through a fully trainable pipeline from raw characters to document-level representations. At its core is a differentiable tokenizer (Segment Splitter) that learns probabilistic beginning-of-sequence (BOS) decisions, enabling adaptive splits that emerge as linguistically meaningful (e.g., word boundaries at spaces, sentence starts at periods) without explicit supervision. This differentiability is enabled by our novel Probabilistic Attention mechanism, which incorporates position-specific existence probabilities to simulate soft masking over theoretically infinite sequences while preserving gradients. Sequences decay probabilistically rather than relying on end-of-sequence tokens, supporting variable-length outputs. Hierarchical levels compress sequences into higher abstractions (e.g., character n-grams to word-like vectors, then sentence-like), with reconstruction via our Denoising Diffusion Mixed Model (DDMM) for stable and efficient denoising in latent space. A Stitcher ensures overlap invariance across segments. Trained end-to-end on Wikipedia, Zonkey generates coherent, variable-length text from noise, demonstrating emergent hierarchies and promising qualitative alignment to data distributions compared to entropy-based learnable tokenizers. Our approach advances toward fully gradient-based LLMs, with potential for better domain adaptation and scalable generation. We release the source code for training and reproducing our experiments.


[19] KID: Knowledge-Injected Dual-Head Learning for Knowledge-Grounded Harmful Meme Detection cs.CL | cs.AIPDF

Yaocong Li, Leihan Zhang, Le Zhang, Qiang Yan

TL;DR: 本文提出KID(知识注入双头学习)框架,用于基于知识的恶意模因检测。该框架通过标签约束蒸馏将复杂模因理解分解为结构化推理链,并采用双头架构联合优化语义生成与分类目标,在多种语言数据集上实现了SOTA性能。

Details

Motivation: 现有方法主要关注模因的模态内与模态间信号分析,但隐含毒性的理解往往依赖模因本身未明确呈现的背景知识,因此需要引入外部知识来提升检测能力。

Result: 在涵盖英语、中文和低资源孟加拉语的五个多语言数据集上,KID在二元和多标签有害模因检测任务中均达到SOTA,主要评估指标比之前最佳方法提升2.1%至19.7%。

Insight: 创新点包括:通过标签约束蒸馏构建连接视觉证据、背景知识和分类标签的显式推理链;采用双头架构联合优化生成与分类目标,实现对齐的语言推理与稳定的决策边界;知识注入与双头联合学习对鲁棒且可泛化的模因理解具有互补贡献。

Abstract: Internet memes have become pervasive carriers of digital culture on social platforms. However, their heavy reliance on metaphors and sociocultural context also makes them subtle vehicles for harmful content, posing significant challenges for automated content moderation. Existing approaches primarily focus on intra-modal and inter-modal signal analysis, while the understanding of implicit toxicity often depends on background knowledge that is not explicitly present in the meme itself. To address this challenge, we propose KID, a Knowledge-Injected Dual-Head Learning framework for knowledge-grounded harmful meme detection. KID adopts a label-constrained distillation paradigm to decompose complex meme understanding into structured reasoning chains that explicitly link visual evidence, background knowledge, and classification labels. These chains guide the learning process by grounding external knowledge in meme-specific contexts. In addition, KID employs a dual-head architecture that jointly optimizes semantic generation and classification objectives, enabling aligned linguistic reasoning while maintaining stable decision boundaries. Extensive experiments on five multilingual datasets spanning English, Chinese, and low-resource Bengali demonstrate that KID achieves SOTA performance on both binary and multi-label harmful meme detection tasks, improving over previous best methods by 2.1%–19.7% across primary evaluation metrics. Ablation studies further confirm the effectiveness of knowledge injection and dual-head joint learning, highlighting their complementary contributions to robust and generalizable meme understanding. The code and data are available at https://github.com/PotatoDog1669/KID.


[20] Enhancing Conversational Agents via Task-Oriented Adversarial Memory Adaptation cs.CLPDF

Yimin Deng, Yuqing Fu, Derong Xu, Yejing Wang, Wei Ni

TL;DR: 本文提出了一种对抗性记忆适应机制(AMA),通过模拟任务执行来对齐离线记忆构建/更新与下游任务目标,以解决对话智能体因上下文窗口限制而难以处理长对话的问题。AMA包含挑战者、评估者和适配器三个代理,在离线阶段通过生成问答对、评估响应和分析错误来提供任务感知的监督信号,从而提升记忆系统对下游任务的适应性。

Details

Motivation: 现有记忆系统在离线阶段的记忆构建和更新通常采用预定义流程或通用指标,缺乏任务特定监督,导致离线记忆准备与任务需求不匹配,从而影响下游任务性能。

Result: 在长对话基准测试LoCoMo上进行的广泛实验证明了AMA的有效性,该机制可以集成到多种现有记忆系统中以提升性能。

Insight: 创新点在于通过模拟下游推理(生成问答对)和错误分析,在离线阶段引入任务感知的对抗性适应过程,实现了记忆构建策略和内容的双重动态更新,从而弥合了离线准备与在线任务需求之间的差距。

Abstract: Conversational agents struggle to handle long conversations due to context window limitations. Therefore, memory systems are developed to leverage essential historical information. Existing memory systems typically follow a pipeline of offline memory construction and update, and online retrieval. Despite the flexible online phase, the offline phase remains fixed and task-independent. In this phase, memory construction operates under a predefined workflow and fails to emphasize task relevant information. Meanwhile, memory updates are guided by generic metrics rather than task specific supervision. This leads to a misalignment between offline memory preparation and task requirements, which undermines downstream task performance. To this end, we propose an Adversarial Memory Adaptation mechanism (AMA) that aligns memory construction and update with task objectives by simulating task execution. Specifically, first, a challenger agent generates question answer pairs based on the original dialogues. The constructed memory is then used to answer these questions, simulating downstream inference. Subsequently, an evaluator agent assesses the responses and performs error analysis. Finally, an adapter agent analyzes the error cases and performs dual level updates on both the construction strategy and the content. Through this process, the memory system receives task aware supervision signals in advance during the offline phase, enhancing its adaptability to downstream tasks. AMA can be integrated into various existing memory systems, and extensive experiments on long dialogue benchmark LoCoMo demonstrate its effectiveness.


[21] Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning cs.CLPDF

Bodong Du, Xuanqi Huang, Xiaomeng Li

TL;DR: 本文提出了一种名为分布感知奖励估计(DARE)的新方法,用于解决测试时强化学习(TTRL)中奖励信号估计的偏差问题。该方法通过利用完整的经验轨迹分布,而非仅依赖多数投票产生的单一结果,并结合探索奖励和分布剪枝机制,来提供更稳健、信息量更大的奖励估计。

Details

Motivation: 现有TTRL方法依赖多数投票来产生确定性奖励,这隐含地假设多数轨迹能提供可靠的学习信号。作者指出该假设是脆弱的,因为多数投票将轨迹分布简化为单一结果,丢弃了非多数但正确的候选动作信息,并导致系统性的奖励估计偏差。

Result: 在具有挑战性的推理基准测试(如AIME 2024和AMC)上进行的大量实验表明,DARE相比近期基线方法,优化稳定性和最终性能均有提升,在AIME 2024上实现了25.3%的相对提升,在AMC上实现了5.3%的相对提升。

Insight: 核心创新点是将奖励估计的焦点从单一多数结果转移到完整的经验轨迹分布,并引入了探索奖励和分布剪枝机制来增强探索和去噪。这为强化学习中处理不确定性和分布信息提供了新思路,可借鉴于其他需要从无监督或弱监督反馈中学习的场景。

Abstract: Test-time reinforcement learning (TTRL) enables large language models (LLMs) to self-improve on unlabeled inputs, but its effectiveness critically depends on how reward signals are estimated without ground-truth supervision. Most existing TTRL methods rely on majority voting (MV) over rollouts to produce deterministic rewards, implicitly assuming that the majority rollout provides a reliable learning signal. We show that this assumption is fragile: MV reduces the rollout distribution into a single outcome, discarding information about non-majority but correct actions candidates, and yields systematically biased reward estimates. To address this, we propose Distribution-AwareReward Estimation (DARE), which shifts reward estimation from a single majority outcome to the full empirical rollout distribution. DARE further augments this distribution-based reward with an exploration bonus and a distribution pruning mechanism for non-majority rollout exploration and reward denoise, yielding a more informative and robust reward estimation. Extensive experiments on challenging reasoning benchmarks show that DARE improves optimization stability and final performance over recent baselines, achieving relative improvements of 25.3% on challenging AIME 2024 and 5.3% on AMC.


[22] Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models cs.CLPDF

Aadi Palnitkar, Mingyang Mao, Nicholas Waytowich, Vinicius G. Goecks, Tinoosh Mohsenin

TL;DR: 本文提出了Mil-SCORE基准测试,这是首个基于复杂模拟军事规划场景构建的专家级、多跳问题数据集,旨在评估大语言模型在长上下文、多模态地理空间信息下的推理与规划能力。

Details

Motivation: 现有长上下文基准测试缺乏对异构、多模态信息进行选择性阅读与整合的真实场景,特别是在需要快速准确处理地图、命令、情报报告等分布式数据的大规模军事行动规划等地理空间规划问题上存在空白。

Result: 论文为一系列当代视觉-语言模型提供了评估协议和基线结果,发现当前系统在Mil-SCORE上表现不佳,存在巨大的提升空间,表明其在真实场景级的长上下文规划任务上仍面临挑战。

Insight: 创新点在于构建了首个面向高风险决策与规划的、地理空间信息丰富的长上下文基准测试,其问题设计针对事实回忆和多步推理(约束、策略、空间分析),为评估模型整合战术与空间推理能力提供了具有挑战性的测试平台。

Abstract: As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs’ ability to combine tactical and spatial reasoning across multiple sources and to reason over long-horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision-language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning, and positioning MilSCORE as a challenging testbed for future work.


[23] Embodied Task Planning via Graph-Informed Action Generation with Large Lanaguage Model cs.CLPDF

Xiang Li, Ning Yan, Masood Mortazavi

TL;DR: 本文提出了一种名为GiG的新型具身任务规划框架,旨在解决大语言模型在长视野规划中面临的策略连贯性差和违反环境约束的问题。该框架通过图内图架构组织智能体的记忆,利用图神经网络编码环境状态,并结合有界前瞻模块来增强规划能力。

Details

Motivation: 大语言模型在零样本推理方面表现出色,但作为具身智能体进行长视野规划时,常因上下文窗口限制或产生违反环境逻辑的幻觉而失败。本文旨在解决这一核心挑战,使智能体能在动态观察环境中,将高级意图分解为可执行的子目标。

Result: 在三个具身规划基准测试(Robotouille Synchronous、Robotouille Asynchronous和ALFWorld)上,该方法均超越了现有最先进基线,Pass@1性能分别提升了22%、37%和15%,且计算成本相当或更低。

Insight: 主要创新点在于提出了Graph-in-Graph架构来结构化具身智能体的记忆,通过图嵌入聚类实现结构感知先验的检索,以及引入基于符号转移逻辑的有界前瞻模块进行接地动作投影。这为将大语言模型与结构化环境表示和规划逻辑相结合提供了新思路。

Abstract: While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning. Unlike open-ended text generation, embodied agents must decompose high-level intent into actionable sub-goals while strictly adhering to the logic of a dynamic, observed environment. Standard LLM planners frequently fail to maintain strategy coherence over extended horizons due to context window limitation or hallucinate transitions that violate constraints. We propose GiG, a novel planning framework that structures embodied agents’ memory using a Graph-in-Graph architecture. Our approach employs a Graph Neural Network (GNN) to encode environmental states into embeddings, organizing these embeddings into action-connected execution trace graphs within an experience memory bank. By clustering these graph embeddings, the framework enables retrieval of structure-aware priors, allowing agents to ground current decisions in relevant past structural patterns. Furthermore, we introduce a novel bounded lookahead module that leverages symbolic transition logic to enhance the agents’ planning capabilities through the grounded action projection. We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld. Our method outperforms state-of-the-art baselines, achieving Pass@1 performance gains of up to 22% on Robotouille Synchronous, 37% on Asynchronous, and 15% on ALFWorld with comparable or lower computational cost.


[24] OVD: On-policy Verbal Distillation cs.CLPDF

Jing Xiong, Hui Shen, Shansan Gong, Yuxin Cheng, Jianghan Shen

TL;DR: 本文提出了On-policy Verbal Distillation (OVD),一种用于知识蒸馏的高效框架。它通过使用教师模型生成的离散语言评分(0-9)进行轨迹匹配,取代了传统的词元级概率匹配,从而显著降低了内存消耗,避免了词元级对齐对学生模型探索能力的限制,并允许有效利用交互环境反馈。

Details

Motivation: 现有基于策略的词元级知识蒸馏方法需要学生模型与教师模型进行词元级对齐,这限制了学生模型的探索能力,阻碍了交互环境反馈的有效利用,并在强化学习中造成严重的内存瓶颈。

Result: 在Web问答和数学推理任务上的大量实验表明,OVD显著优于现有方法,在Web Q&A任务上平均EM分数绝对提升高达12.9%,在数学基准测试上(仅使用一个随机样本训练)提升高达25.7%,同时展现出更优的训练效率。

Insight: 核心创新在于将知识蒸馏的监督信号从词元级概率匹配转变为基于教师模型语言评分的轨迹级匹配。这避免了词元对齐的约束,释放了学生模型的探索能力,并大幅降低了内存开销,为高效的知识蒸馏提供了新思路。

Abstract: Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model’s exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0–9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at https://OVD.github.io


[25] Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding cs.CL | cs.AIPDF

Yifan Zhu, Huiqiang Rong, Haoran Luo

TL;DR: 本文提出了Token-Guard,一种基于自检解码的令牌级幻觉控制方法。它在每个推理步骤进行内部验证,在幻觉令牌传播前检测它们,并通过潜在空间评分、迭代剪枝和再生来动态纠正错误。

Details

Motivation: 解决大型语言模型(LLMs)的幻觉问题,现有方法如RAG和RLHF需要大量资源,而基于解码的方法缺乏明确的幻觉控制,因此需要一种轻量且能进行细粒度控制的方法。

Result: 在HALU数据集上的实验表明,Token-Guard显著减少了幻觉并提高了生成准确性。

Insight: 创新点在于将幻觉控制细化到令牌级别,通过自检解码和潜在空间风险评分实现动态、模块化的错误检测与纠正,为可靠LLM输出提供了一种可扩展的解决方案。

Abstract: Large Language Models (LLMs) often hallucinate, generating content inconsistent with the input. Retrieval-Augmented Generation (RAG) and Reinforcement Learning with Human Feedback (RLHF) can mitigate hallucinations but require resource-intensive retrieval or large-scale fine-tuning. Decoding-based methods are lighter yet lack explicit hallucination control. To address this, we present Token-Guard, a token-level hallucination control method based on self-checking decoding. Token-Guard performs internal verification at each reasoning step to detect hallucinated tokens before they propagate. Candidate fragments are further evaluated in a latent space with explicit hallucination risk scoring, while iterative pruning and regeneration dynamically correct detected errors. Experiments on HALU datasets show Token-Guard substantially reduces hallucinations and improves generation accuracy, offering a scalable, modular solution for reliable LLM outputs. Our code is publicly available.


[26] Causal Autoregressive Diffusion Language Model cs.CLPDF

Junhao Ruan, Bei Li, Yongjing Yin, Pengcheng Huang, Xin Chen

TL;DR: 本文提出了Causal Autoregressive Diffusion (CARD)框架,它将自回归模型(ARMs)的训练效率与扩散模型的高吞吐量推理能力相结合。CARD通过严格的因果注意力掩码重新定义扩散过程,实现了单次前向传播中对每个token的密集监督。该方法在优化稳定性方面引入了软尾掩码方案和基于信噪比原则的上下文感知重加权机制,支持动态并行解码。实验表明,CARD在性能上超越了现有的离散扩散基线,并将训练延迟降低了3倍。

Details

Motivation: 旨在解决自回归模型推理速度慢和扩散模型训练效率低的问题,通过统一两者的优势来构建下一代高效大语言模型。

Result: 在实验中,CARD超越了现有的离散扩散基线方法,同时与块扩散方法相比,训练延迟降低了3倍,实现了ARM级别的数据效率并获得了并行生成的延迟优势。

Insight: 创新点在于将扩散过程重新表述为严格的因果注意力掩码,实现了单次前向传播的密集监督;引入了软尾掩码和上下文感知重加权机制以稳定优化;支持基于置信度的动态并行解码,结合了KV缓存技术来生成可变长度的token序列。

Abstract: In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.


[27] Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models cs.CL | cs.AIPDF

Longxuan Yu, Yu Fu, Shaorong Zhang, Hui Liu, Mukund Varma T

TL;DR: 本文研究了在扩散语言模型中输出顺序与推理顺序解耦的能力。当提示要求先输出答案再进行推理时,自回归模型因固定的生成顺序而性能大幅下降,而掩码扩散语言模型则能保持稳定的性能,表现出’顺序鲁棒性’。作者通过新构建的ReasonOrderQA基准验证了这一现象,并分析了其内在机制与失效条件。

Details

Motivation: 自回归语言模型强制性的从左到右生成顺序,在所需输出结构与自然推理顺序冲突时(例如由于呈现或模式约束要求先输出答案后给出解释)会成为一个根本性限制,迫使模型过早地做出答案承诺。

Result: 在GSM8K、Math500和新提出的ReasonOrderQA基准上,当提示要求答案在前时,自回归模型相比标准思维链顺序出现高达67%的相对准确率下降,而掩码扩散语言模型的相对下降则小于等于14%,表现出显著的顺序鲁棒性。

Insight: 核心创新点在于利用掩码扩散语言模型的并行迭代细化特性,将计算顺序与输出结构解耦。研究发现,MDLMs通过使简单标记(如推理步骤)在扩散过程中比复杂标记(如最终答案)更早稳定,从而实现顺序鲁棒性。论文还界定了这一优势的失效条件,明确了顺序鲁棒性的边界。

Abstract: Autoregressive (AR) language models enforce a fixed left-to-right generation order, creating a fundamental limitation when the required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). In such cases, AR models must commit to answers before generating intermediate reasoning, and this rigid constraint forces premature commitment. Masked diffusion language models (MDLMs), which iteratively refine all tokens in parallel, offer a way to decouple computation order from output structure. We validate this capability on GSM8K, Math500, and ReasonOrderQA, a benchmark we introduce with controlled difficulty and order-level evaluation. When prompts request answers before reasoning, AR models exhibit large accuracy gaps compared to standard chain-of-thought ordering (up to 67% relative drop), while MDLMs remain stable ($\leq$14% relative drop), a property we term “order robustness”. Using ReasonOrderQA, we present evidence that MDLMs achieve order robustness by stabilizing simpler tokens (e.g., reasoning steps) earlier in the diffusion process than complex ones (e.g., final answers), enabling reasoning tokens to stabilize before answer commitment. Finally, we identify failure conditions where this advantage weakens, outlining the limits required for order robustness.


[28] MasalBench: A Benchmark for Contextual and Cross-Cultural Understanding of Persian Proverbs in LLMs cs.CLPDF

Ghazal Kalhor, Behnam Bahrak

TL;DR: 本文介绍了MasalBench,一个用于评估大型语言模型在低资源语言波斯语中,对波斯谚语的上下文和跨文化理解能力的基准测试。研究评估了八个先进LLM,发现它们在识别上下文中的波斯谚语时表现良好(准确率>0.90),但在识别等效英语谚语时性能显著下降(最佳模型准确率0.79),揭示了当前LLM在文化知识和类比推理方面的局限性。

Details

Motivation: 当前多语言LLM在低资源语言(如波斯语)中对比喻性语言(特别是谚语)的理解能力尚未得到充分评估,而谚语是日常对话的关键组成部分,因此需要建立一个专门的基准来测试LLM的上下文和跨文化理解能力。

Result: 在MasalBench上评估了八个SOTA LLM:在识别上下文中的波斯谚语任务上,模型准确率超过0.90;在识别等效英语谚语的跨文化理解任务上,最佳模型准确率仅为0.79,性能明显下降。

Insight: 创新点在于构建了首个专注于波斯谚语理解的基准MasalBench,强调了评估LLM在低资源语言中文化特定知识的重要性;客观来看,该工作为评估其他低资源语言的跨文化理解提供了一个可扩展的框架,并凸显了当前多语言LLM在文化迁移和类比推理能力上的不足,这对改进LLM的文化适应性具有启发意义。

Abstract: In recent years, multilingual Large Language Models (LLMs) have become an inseparable part of daily life, making it crucial for them to master the rules of conversational language in order to communicate effectively with users. While previous work has evaluated LLMs’ understanding of figurative language in high-resource languages, their performance in low-resource languages remains underexplored. In this paper, we introduce MasalBench, a comprehensive benchmark for assessing LLMs’ contextual and cross-cultural understanding of Persian proverbs, which are a key component of conversation in this low-resource language. We evaluate eight state-of-the-art LLMs on MasalBench and find that they perform well in identifying Persian proverbs in context, achieving accuracies above 0.90. However, their performance drops considerably when tasked with identifying equivalent English proverbs, with the best model achieving 0.79 accuracy. Our findings highlight the limitations of current LLMs in cultural knowledge and analogical reasoning, and they provide a framework for assessing cross-cultural understanding in other low-resource languages. MasalBench is available at https://github.com/kalhorghazal/MasalBench.


[29] $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA cs.CLPDF

Yaxin Du, Junru Song, Yifan Zhou, Cheng Wang, Jiahao Gu

TL;DR: 该论文提出了$G^2$-Reader,一种用于多模态文档问答的双图系统。它通过演化内容图来保留文档原生结构和跨模态语义,同时维护一个规划图来跟踪中间发现并指导逐步的证据检索,以解决传统检索增强生成方法在处理长文档中交错文本、表格和图像时的结构破坏和检索漂移问题。

Details

Motivation: 动机是解决在长文档多模态阅读(文本、表格、图像交错)场景下,传统检索增强生成方法的两个主要缺陷:一是扁平分块破坏了文档原生结构和跨模态对齐,二是迭代检索在长上下文中容易陷入局部证据循环或漂移到无关部分。

Result: 在涵盖五个多模态领域的VisDoMBench基准测试上,使用Qwen3-VL-32B-Instruct的$G^2$-Reader达到了66.21%的平均准确率,优于多个强基线模型和独立的GPT-5(53.08%),实现了新的SOTA性能。

Insight: 创新点在于引入了双图协同演化机制:内容图用于结构化表示文档,规划图作为智能体驱动的有向无环图来管理推理过程。这种设计将全局结构感知与逐步、有状态的检索规划相结合,为复杂多模态文档理解提供了可借鉴的框架。

Abstract: Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce $G^2$-Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide stepwise navigation for evidence completion. On VisDoMBench across five multimodal domains, $G^2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08%).


[30] VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning cs.CLPDF

Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong-cheng Tu

TL;DR: 本文提出了VTC-R1,一种新的高效长上下文推理范式,通过将中间推理片段渲染成紧凑图像作为’光学记忆’,迭代输入视觉语言模型,以解决长上下文推理带来的计算效率瓶颈问题。

Details

Motivation: 长上下文推理增强了大型语言模型处理复杂任务的能力,但带来了严重的计算效率瓶颈;现有高效方法通常依赖复杂的额外训练或外部模型进行压缩,限制了可扩展性并丢弃了关键细粒度信息。

Result: 在MATH500、AIME25、AMC23和GPQA-D等基准测试上的广泛实验表明,VTC-R1始终优于标准长上下文推理方法,并显著提高了推理效率,实现了2.7倍的端到端延迟加速。

Insight: 核心创新点在于将视觉-文本压缩集成到推理过程中,利用图像作为紧凑的中间表示(光学记忆),避免了处理冗长文本轨迹,实现了高压缩比(3.4倍)和高效率,为推理密集型应用提供了可扩展的解决方案。

Abstract: Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as “optical memory.” We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.


[31] Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers cs.CL | cs.AIPDF

Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan

TL;DR: 本文提出了一种名为主动交互式推理(PIR)的新范式,旨在将推理大语言模型从被动的、依赖内部‘盲目自思’的求解器,转变为能够主动向用户提问以澄清前提和意图的交互式推理器。该方法通过不确定性感知的监督微调和基于用户模拟器的策略优化来实现,在数学推理、代码生成和文档编辑等任务上显著提升了性能并减少了计算开销。

Details

Motivation: 现有基于思维链(CoT)的推理大语言模型存在‘盲目自思’的局限,即在关键信息缺失或模糊时仍进行大量内部推理,导致效率低下或错误。本文旨在解决模型在前提和意图层面的不确定性,而非仅仅是知识不确定性。

Result: 在数学推理、代码生成和文档编辑任务上的实验表明,PIR方法持续优于强基线模型,实现了最高32.70%的准确率提升、22.90%的通过率提升和41.36的BLEU分数提升,同时将推理计算量和不必要的交互轮次减少了近一半。在事实知识、问答和缺失前提场景下的可靠性评估也证实了其良好的泛化性和鲁棒性。

Insight: 核心创新在于将交互式澄清作为推理过程的内在组成部分,而非外部工具调用,通过专门的微调和强化学习框架来训练模型主动识别并询问不确定性。这为构建更高效、更可靠、更符合人类协作模式的AI助手提供了新思路。

Abstract: Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70% higher accuracy, 22.90% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}


[32] DynaWeb: Model-Based Reinforcement Learning of Web Agents cs.CL | cs.AIPDF

Hang Ding, Peidong Liu, Junqiao Wang, Ziwei Ji, Meng Cao

TL;DR: 本文提出了DynaWeb,一种基于模型的强化学习框架,用于训练自主网页代理。该框架通过训练一个网页世界模型来预测给定代理动作后的自然化网页表示,从而创建一个合成网页环境,使代理策略能够通过生成大量模拟轨迹进行高效的在线强化学习。

Details

Motivation: 当前基于大语言模型和强化学习的自主网页代理训练面临与实时互联网交互效率低、成本高和风险大的挑战,因此需要一种更高效、可扩展的训练方法。

Result: 在WebArena和WebVoyager基准测试上的实验表明,DynaWeb能够持续且显著地提升最先进的开源网页代理模型的性能。

Insight: 创新点在于将基于模型的强化学习应用于网页代理训练,通过合成环境(世界模型)进行“想象”式策略模拟,并结合真实专家轨迹以提高训练稳定性和样本效率,为在线代理强化学习提供了一种可扩展的解决方案。

Abstract: The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.


cs.CV [Back]

[33] MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading cs.CVPDF

Matteo Rossi

TL;DR: 本文提出了一种名为MA-LipNet的多维注意力网络,用于提升唇语识别的鲁棒性。该方法通过依次应用通道注意力、联合时空注意力和分离时空注意力三个模块,从通道、空间和时间维度对视觉特征进行精细化处理,以增强特征判别力并抑制无关信息。

Details

Motivation: 现有唇语识别方法因发音动作的细微性,常面临特征判别力有限和泛化能力差的问题。本文旨在通过从多个维度净化视觉特征来解决这些挑战。

Result: 在CMLR和GRID数据集上的大量实验表明,MA-LipNet显著降低了字符错误率和词错误率,验证了其相对于多种最先进方法的有效性和优越性。

Insight: 创新点在于提出了一个包含三种专用注意力模块的序列化处理框架,实现了从粗粒度到细粒度的多维度特征精细化。其核心洞察是,通过通道、联合时空和分离时空注意力的协同作用,可以有效提升唇语识别对细微动作的建模能力和鲁棒性。

Abstract: Lipreading, the technology of decoding spoken content from silent videos of lip movements, holds significant application value in fields such as public security. However, due to the subtle nature of articulatory gestures, existing lipreading methods often suffer from limited feature discriminability and poor generalization capabilities. To address these challenges, this paper delves into the purification of visual features from temporal, spatial, and channel dimensions. We propose a novel method named Multi-Attention Lipreading Network(MA-LipNet). The core of MA-LipNet lies in its sequential application of three dedicated attention modules. Firstly, a \textit{Channel Attention (CA)} module is employed to adaptively recalibrate channel-wise features, thereby mitigating interference from less informative channels. Subsequently, two spatio-temporal attention modules with distinct granularities-\textit{Joint Spatial-Temporal Attention (JSTA)} and \textit{Separate Spatial-Temporal Attention (SSTA)}-are leveraged to suppress the influence of irrelevant pixels and video frames. The JSTA module performs a coarse-grained filtering by computing a unified weight map across the spatio-temporal dimensions, while the SSTA module conducts a more fine-grained refinement by separately modeling temporal and spatial attentions. Extensive experiments conducted on the CMLR and GRID datasets demonstrate that MA-LipNet significantly reduces the Character Error Rate (CER) and Word Error Rate (WER), validating its effectiveness and superiority over several state-of-the-art methods. Our work highlights the importance of multi-dimensional feature refinement for robust visual speech recognition.


[34] Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs cs.CV | cs.AIPDF

Haochen Zhang, Animesh Sinha, Felix Juefei-Xu, Haoyu Ma, Kunpeng Li

TL;DR: 本文提出了一种针对非马尔可夫多轮对话图像生成的方法,通过构建包含回滚式编辑和基于名称的多轮个性化等策略的数据集,并设计了一个具有令牌级缓存的历史条件训练与推理框架,以解决现有方法在处理长程历史依赖时的不足。

Details

Motivation: 现有的大多数多轮对话图像生成模型和基准测试本质上是马尔可夫的,即下一轮输出主要依赖于最近一轮的图像,这导致模型无法有效处理用户引用早期状态、撤销更改或跨多轮引用实体的复杂非马尔可夫交互。

Result: 实验表明,针对非马尔可夫交互的显式训练在多轮一致性和指令遵循方面带来了显著提升,同时保持了强大的单轮图像编辑和个性化能力。

Insight: 创新点包括:1)非马尔可夫多轮数据构建策略(如回滚式编辑和基于名称的个性化),强制模型检索早期视觉状态;2)具有令牌级缓存的历史条件框架,防止多轮身份漂移;3)结合基于重建的DiT解令牌器和多阶段微调课程,提升了高保真图像重建和可编辑个性化的性能。

Abstract: Conversational image generation requires a model to follow user instructions across multiple rounds of interaction, grounded in interleaved text and images that accumulate as chat history. While recent multimodal large language models (MLLMs) can generate and edit images, most existing multi-turn benchmarks and training recipes are effectively Markov: the next output depends primarily on the most recent image, enabling shortcut solutions that ignore long-range history. In this work we formalize and target the more challenging non-Markov setting, where a user may refer back to earlier states, undo changes, or reference entities introduced several rounds ago. We present (i) non-Markov multi-round data construction strategies, including rollback-style editing that forces retrieval of earlier visual states and name-based multi-round personalization that binds names to appearances across rounds; (ii) a history-conditioned training and inference framework with token-level caching to prevent multi-round identity drift; and (iii) enabling improvements for high-fidelity image reconstruction and editable personalization, including a reconstruction-based DiT detokenizer and a multi-stage fine-tuning curriculum. We demonstrate that explicitly training for non-Markov interactions yields substantial improvements in multi-round consistency and instruction compliance, while maintaining strong single-round editing and personalization.


[35] AI-based Prediction of Biochemical Recurrence from Biopsy and Prostatectomy Samples cs.CVPDF

Andrea Camilloni, Chiara Micoli, Nita Mulliqi, Erik Everett Palm, Thorgerdur Palsdottir

TL;DR: 该研究开发了一种基于人工智能的模型,利用前列腺活检切片预测根治性前列腺切除术后生化复发的风险。模型在STHLM3队列上训练,并在三个外部队列中验证,结果显示其具有泛化能力,并能结合临床变量改善风险分层,优于指南推荐的CAPRA-S评分。

Details

Motivation: 目前根治性前列腺切除术后生化复发的预后工具不精确,需要更准确的预测方法来支持临床决策。

Result: 模型在三个外部验证队列(LEOPARD、CHIMERA、TCGA-PRAD)中预测5年生化复发的时变AUC分别为0.64、0.70和0.70;结合临床变量后能实现显著的风险分层,且相比CAPRA-S评分有增量改进。

Insight: 创新点在于使用基础模型和基于注意力的多示例学习从活检切片中提取特征,并证明基于活检训练的AI模型可泛化至前列腺切除样本,支持术前和术后决策;但研究也指出需审慎评估AI多模态方法相对于简单模型的附加价值。

Abstract: Biochemical recurrence (BCR) after radical prostatectomy (RP) is a surrogate marker for aggressive prostate cancer with adverse outcomes, yet current prognostic tools remain imprecise. We trained an AI-based model on diagnostic prostate biopsy slides from the STHLM3 cohort (n = 676) to predict patient-specific risk of BCR, using foundation models and attention-based multiple instance learning. Generalizability was assessed across three external RP cohorts: LEOPARD (n = 508), CHIMERA (n = 95), and TCGA-PRAD (n = 379). The image-based approach achieved 5-year time-dependent AUCs of 0.64, 0.70, and 0.70, respectively. Integrating clinical variables added complementary prognostic value and enabled statistically significant risk stratification. Compared with guideline-based CAPRA-S, AI incrementally improved postoperative prognostication. These findings suggest biopsy-trained histopathology AI can generalize across specimen types to support preoperative and postoperative decision making, but the added value of AI-based multimodal approaches over simpler predictive models should be critically scrutinized in further studies.


[36] Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization cs.CVPDF

Jiaqi Li, Guangming Wang, Shuntian Zheng, Minzhe Ni, Xiaoman Lu

TL;DR: 本文提出ActionVLM框架,旨在缓解时序动作定位任务中视觉-语言模型的模态偏差问题。该框架通过去偏重加权模块估计语言相对于纯视觉预测的优势,并动态调整语言模态权重,同时采用残差聚合策略将语言作为视觉的补充细化,从而在保持视觉主导信号的同时自适应利用语言信息。

Details

Motivation: 现有基于视觉-语言模型的时序动作定位方法过度依赖语言先验,牺牲了视觉性能,导致显著的模态偏差。本文旨在系统性地缓解这种偏差,使模型更好地结合视觉和语言模态。

Result: 在THUMOS14基准测试中,该模型比现有最优方法提升了高达3.2%的mAP,达到了新的SOTA水平。

Insight: 创新点在于提出了一种动态评估语言优势的去偏重加权机制,以及将语言视为视觉信号残差补充的聚合策略,这有助于减少语言先验带来的过度自信,并加强时序推理能力。

Abstract: Temporal Action Localization (TAL) requires identifying both the boundaries and categories of actions in untrimmed videos. While vision-language models (VLMs) offer rich semantics to complement visual evidence, existing approaches tend to overemphasize linguistic priors at the expense of visual performance, leading to a pronounced modality bias. We propose ActionVLM, a vision-language aggregation framework that systematically mitigates modality bias in TAL. Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial. To this end, we introduce (i) a debiasing reweighting module that estimates the language advantage-the incremental benefit of language over vision-only predictions-and dynamically reweights language modality accordingly, and (ii) a residual aggregation strategy that treats language as a complementary refinement rather than the primary driver. This combination alleviates modality bias, reduces overconfidence from linguistic priors, and strengthens temporal reasoning. Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.


[37] Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought cs.CVPDF

Yu Huo, Siyu Zhang, Kun Zeng, Haoyue Liu, Owen Lee

TL;DR: 本文提出Shape-of-Thought(SoT),一种视觉思维链框架,用于解决文本到图像生成模型在组合结构约束(如生成计数、属性绑定和部件级关系)上的脆弱性问题。该框架通过训练一个统一的多模态自回归模型,生成交错的文本计划和渲染的中间状态,从而实现无需外部引擎的渐进式形状组装。

Details

Motivation: 当前多模态文本到图像生成模型在视觉保真度上表现良好,但在组合结构约束下仍然脆弱,特别是在生成计数、属性绑定和部件级关系方面存在不足。

Result: 在SoT-26K数据集上微调后,模型在部件计数任务上达到88.4%,在结构拓扑任务上达到84.8%,比纯文本基线高出约20%。这些结果在T2S-CompBench基准测试中评估,展示了其在结构完整性和轨迹忠实度方面的优越性。

Insight: 创新点在于提出了一个视觉思维链框架,通过生成中间视觉状态和文本计划来隐式学习形状组装逻辑,避免了显式几何表示的需求。这为透明、过程监督的组合生成建立了新范式,并引入了大规模数据集和评估基准以支持该研究方向。

Abstract: Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints-notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework that enables progressive shape assembly via coherent 2D projections without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming text-only baselines by around 20%. SoT establishes a new paradigm for transparent, process-supervised compositional generation. The code is available at https://anonymous.4open.science/r/16FE/. The SoT-26K dataset will be released upon acceptance.


[38] Bidirectional Cross-Perception for Open-Vocabulary Semantic Segmentation in Remote Sensing Imagery cs.CVPDF

Jianzheng Wang, Huan Ni

TL;DR: 本文提出了一种名为SDCI的空间正则化感知双分支协同推理框架,用于解决高分辨率遥感图像中训练无关的开放词汇语义分割(OVSS)问题。该方法通过交叉模型注意力融合模块、双向交叉图扩散细化模块以及基于凸优化的超像素协同预测机制,提升了分割的几何定位和语义预测能力。

Details

Motivation: 高分辨率遥感图像具有地物分布密集、边界复杂的特点,对几何定位和语义预测提出了更高要求。现有的训练无关OVSS方法通常采用单向注入和浅层后处理策略融合CLIP和视觉基础模型,难以满足这些需求。

Result: 在多个遥感语义分割基准测试上的实验表明,该方法优于现有方法。消融研究进一步证实了利用超像素结构的传统基于对象的遥感图像分析方法在深度学习框架内仍然有效。

Insight: 创新点包括:1)通过交叉模型注意力融合实现双向引导的协同推理;2)利用双向交叉图扩散迭代细化分割分数可靠性;3)将低层超像素结构与凸优化结合以精细化对象边界。从客观角度看,该方法将传统遥感分析中的超像素先验与深度学习模型有效结合,为复杂场景下的开放词汇分割提供了新思路。

Abstract: High-resolution remote sensing imagery is characterized by densely distributed land-cover objects and complex boundaries, which places higher demands on both geometric localization and semantic prediction. Existing training-free open-vocabulary semantic segmentation (OVSS) methods typically fuse CLIP and vision foundation models (VFMs) using “one-way injection” and “shallow post-processing” strategies, making it difficult to satisfy these requirements. To address this issue, we propose a spatial-regularization-aware dual-branch collaborative inference framework for training-free OVSS, termed SDCI. First, during feature encoding, SDCI introduces a cross-model attention fusion (CAF) module, which guides collaborative inference by injecting self-attention maps into each other. Second, we propose a bidirectional cross-graph diffusion refinement (BCDR) module that enhances the reliability of dual-branch segmentation scores through iterative random-walk diffusion. Finally, we incorporate low-level superpixel structures and develop a convex-optimization-based superpixel collaborative prediction (CSCP) mechanism to further refine object boundaries. Experiments on multiple remote sensing semantic segmentation benchmarks demonstrate that our method achieves better performance than existing approaches. Moreover, ablation studies further confirm that traditional object-based remote sensing image analysis methods leveraging superpixel structures remain effective within deep learning frameworks. Code: https://github.com/yu-ni1989/SDCI.


[39] FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models cs.CV | cs.LGPDF

Chenyu Huang, Peng Ye, Xudong Tan, Jinhan Mu, Shenghe Zheng

TL;DR: 本文提出了一种名为FRISM的细粒度推理注入框架,通过子空间级模型合并来增强视觉语言模型的推理能力。该方法利用奇异值分解分解大型推理模型的任务向量,并通过学习自适应调整每个子空间的缩放系数,实现了推理能力的精细注入,同时避免了视觉能力的损失。

Details

Motivation: 现有方法通常在粗粒度的层级别进行操作,导致在注入推理能力和保留视觉能力之间存在权衡。为了克服这一限制,本文旨在通过细粒度的子空间级模型合并,更有效地增强视觉语言模型的推理能力。

Result: 在多种视觉推理基准测试中,FRISM方法持续实现了最先进的性能,有效提升了模型的推理能力,且未损害其原有的视觉能力。

Insight: 创新点在于观察到推理能力编码在特定的子空间中,并提出了基于奇异值分解的子空间级任务向量分解与自适应缩放系数调整方法。此外,引入了一种无需标签的自蒸馏学习策略,使用常见的视觉语言感知数据集进行双目标优化,实现了细粒度的能力注入与平衡。

Abstract: Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose {FRISM} (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that reasoning capabilities are encoded in distinct subspaces, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with a dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model’s original visual capabilities by consistently achieving state-of-the-art performance across diverse visual reasoning benchmarks.


[40] Generative Recall, Dense Reranking: Learning Multi-View Semantic IDs for Efficient Text-to-Video Retrieval cs.CVPDF

Zecheng Zhao, Zhi Chen, Zi Huang, Shazia Sadiq, Tong Chen

TL;DR: 本文提出了一种名为GRDR的两阶段文本-视频检索方法,通过生成式检索模型进行快速召回,再使用密集检索模型进行精细重排序。该方法为每个视频分配多个查询引导的语义ID,以解决传统生成式检索中的语义模糊性和跨模态对齐问题,在保持高精度的同时显著提升了存储和检索效率。

Details

Motivation: 解决两阶段文本-视频检索中,作为召回阶段的生成式检索模型存在的两个关键问题:语义模糊性(每个视频对应多种查询,但传统方法只分配一个语义ID)和跨模态未对齐(语义ID仅从视觉特征生成,缺乏文本监督),从而提升召回候选集的质量。

Result: 在文本-视频检索基准测试上,GRDR在精度上达到了与先进密集检索器相当的水平,同时将索引存储减少了一个数量级,并在全库检索中实现了高达300倍的加速。

Insight: 创新点在于设计了查询引导的多视角分词器,为每个视频生成多个语义ID以暴露不同的语义访问路径,并通过共享码本联合训练分词器和生成式检索器,使语义ID成为文本和视频之间的语义桥梁。这为高效、高精度的多模态检索提供了新思路。

Abstract: Text-to-Video Retrieval (TVR) is essential in video platforms. Dense retrieval with dual-modality encoders leads in accuracy, but its computation and storage scale poorly with corpus size. Thus, real-time large-scale applications adopt two-stage retrieval, where a fast recall model gathers a small candidate pool, which is reranked by an advanced dense retriever. Due to hugely reduced candidates, the reranking model can use any off-the-shelf dense retriever without hurting efficiency, meaning the recall model bounds two-stage TVR performance. Recently, generative retrieval (GR) replaces dense video embeddings with discrete semantic IDs and retrieves by decoding text queries into ID tokens. GR offers near-constant inference and storage complexity, and its semantic IDs capture high-level video features via quantization, making it ideal for quickly eliminating irrelevant candidates during recall. However, as a recall model in two-stage TVR, GR suffers from (i) semantic ambiguity, where each video satisfies diverse queries but is forced into one semantic ID; and (ii) cross-modal misalignment, as semantic IDs are solely derived from visual features without text supervision. We propose Generative Recall and Dense Reranking (GRDR), designing a novel GR method to uplift recalled candidate quality. GRDR assigns multiple semantic IDs to each video using a query-guided multi-view tokenizer exposing diverse semantic access paths, and jointly trains the tokenizer and generative retriever via a shared codebook to cast semantic IDs as the semantic bridge between texts and videos. At inference, trie-constrained decoding generates a compact candidate set reranked by a dense model for fine-grained matching. Experiments on TVR benchmarks show GRDR matches strong dense retrievers in accuracy while reducing index storage by an order of magnitude and accelerating up to 300$\times$ in full-corpus retrieval.


[41] Thinker: A vision-language foundation model for embodied intelligence cs.CV | cs.AIPDF

Baiyu Pan, Daqin Luo, Junpeng Yang, Jiyuan Wang, Yixuan Zhang

TL;DR: 本文提出了一种名为Thinker的大型视觉语言基础模型,旨在解决大视觉语言模型在机器人领域应用时遇到的两个主要问题:第三人称与第一人称视角混淆,以及在时序推理中容易忽略视频结尾信息。通过构建一个针对机器人感知与推理的大规模数据集,并引入一种结合关键帧与完整视频序列的简单有效方法,模型在任务规划领域的两个常用基准数据集上取得了最先进的结果。

Details

Motivation: 解决大视觉语言模型在机器人(具身智能)应用中存在的两个关键缺陷:视角混淆(如分不清第一人称和第三人称)和时序推理中对视频结尾信息的忽略,以提升模型在真实世界交互任务中的表现。

Result: 在任务规划领域两个最常用的基准数据集上取得了最先进(SOTA)的结果。

Insight: 创新点包括:1) 构建了一个专门针对机器人感知与推理的大规模数据集,包含第一人称视频、视觉定位、空间理解和思维链数据;2) 提出了一种简单有效的视频理解增强方法,通过联合输入关键帧和完整视频序列来提升模型对时序信息的处理能力。

Abstract: When large vision-language models are applied to the field of robotics, they encounter problems that are simple for humans yet error-prone for models. Such issues include confusion between third-person and first-person perspectives and a tendency to overlook information in video endings during temporal reasoning. To address these challenges, we propose Thinker, a large vision-language foundation model designed for embodied intelligence. We tackle the aforementioned issues from two perspectives. Firstly, we construct a large-scale dataset tailored for robotic perception and reasoning, encompassing ego-view videos, visual grounding, spatial understanding, and chain-of-thought data. Secondly, we introduce a simple yet effective approach that substantially enhances the model’s capacity for video comprehension by jointly incorporating key frames and full video sequences as inputs. Our model achieves state-of-the-art results on two of the most commonly used benchmark datasets in the field of task planning.


[42] LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models cs.CVPDF

Alvi Md Ishmam, Najibul Haque Sarker, Zaber Ibn Abdul Hakim, Chris Thomas

TL;DR: 本文提出LAMP方法,一种针对多图像多模态大语言模型的黑盒通用对抗扰动攻击技术,通过注意力约束和跨图像传染机制,实现在无需修改所有输入的情况下有效破坏模型跨图像信息聚合能力。

Details

Motivation: 现有对抗攻击主要针对单图像设置且多为白盒假设,而多图像MLLM的脆弱性尚未被探索,LAMP旨在解决黑盒场景下多图像任务的通用对抗攻击问题。

Result: 实验表明LAMP在多个视觉语言任务和模型上超越现有SOTA基线,达到最高的攻击成功率。

Insight: 创新点包括注意力约束防止跨图像信息聚合、跨图像传染约束实现扰动传播、以及索引注意力抑制损失实现位置不变攻击,为黑盒多模态对抗攻击提供了新思路。

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance across vision-language tasks. Recent advancements allow these models to process multiple images as inputs. However, the vulnerabilities of multi-image MLLMs remain unexplored. Existing adversarial attacks focus on single-image settings and often assume a white-box threat model, which is impractical in many real-world scenarios. This paper introduces LAMP, a black-box method for learning Universal Adversarial Perturbations (UAPs) targeting multi-image MLLMs. LAMP applies an attention-based constraint that prevents the model from effectively aggregating information across images. LAMP also introduces a novel cross-image contagious constraint that forces perturbed tokens to influence clean tokens, spreading adversarial effects without requiring all inputs to be modified. Additionally, an index-attention suppression loss enables a robust position-invariant attack. Experimental results show that LAMP outperforms SOTA baselines and achieves the highest attack success rates across multiple vision-language tasks and models.


[43] Hypersolid: Emergent Vision Representations via Short-Range Repulsion cs.CV | cs.AI | cs.LGPDF

Esteban Rodríguez-Betancourt, Edgar Casasola-Murillo

TL;DR: 论文提出了一种名为Hypersolid的自监督学习方法,通过将表示学习重新解释为离散填充问题,并引入短程硬球排斥机制来防止局部碰撞,从而有效避免表示崩溃。该方法在细粒度和低分辨率分类任务上表现出色。

Details

Motivation: 解决自监督学习中表示崩溃的常见挑战,现有方法通常依赖全局正则化,而本文从离散填充的角度出发,通过保持单射性来保留信息。

Result: Hypersolid方法在细粒度和低分辨率分类任务上表现优异,通过高分离几何机制保留了增强多样性。

Insight: 创新点在于将表示学习视为离散填充问题,并引入短程硬球排斥来防止局部碰撞,这提供了一种新的避免表示崩溃的几何视角,可借鉴于其他自监督学习框架中。

Abstract: A recurring challenge in self-supervised learning is preventing representation collapse. Existing solutions typically rely on global regularization, such as maximizing distances, decorrelating dimensions or enforcing certain distributions. We instead reinterpret representation learning as a discrete packing problem, where preserving information simplifies to maintaining injectivity. We operationalize this in Hypersolid, a method using short-range hard-ball repulsion to prevent local collisions. This constraint results in a high-separation geometric regime that preserves augmentation diversity, excelling on fine-grained and low-resolution classification tasks.


[44] Lightweight High-Fidelity Low-Bitrate Talking Face Compression for 3D Video Conference cs.CV | cs.AIPDF

Jianglong Li, Jun Xu, Bingcong Lu, Zhengxue Cheng, Hongwei Hu

TL;DR: 本文提出了一种轻量级、高保真、低码率的3D说话人脸压缩框架,用于3D视频会议。该方法结合了基于FLAME的参数化建模与3D高斯泼溅(3DGS)神经渲染,通过仅传输关键的面部元数据,实现了在极低码率下的高质量人脸重建。

Details

Motivation: 解决3D视频会议中,在低码率下实现高保真3D说话人脸表示的挑战。传统2D视频压缩技术难以保留细粒度的几何与外观细节,而NeRF等隐式神经渲染方法计算成本过高。

Result: 实验结果表明,该方法在率失真性能上表现优异,能在极低码率下实现高质量的面部渲染,适用于实时3D视频会议应用。

Insight: 创新点在于将参数化人脸模型(FLAME)与高效的3D高斯泼溅渲染相结合,并引入了紧凑的表示与压缩方案(如高斯属性压缩和MLP优化),从而在保持轻量级的同时实现了高保真、低码率的压缩传输。

Abstract: The demand for immersive and interactive communication has driven advancements in 3D video conferencing, yet achieving high-fidelity 3D talking face representation at low bitrates remains a challenge. Traditional 2D video compression techniques fail to preserve fine-grained geometric and appearance details, while implicit neural rendering methods like NeRF suffer from prohibitive computational costs. To address these challenges, we propose a lightweight, high-fidelity, low-bitrate 3D talking face compression framework that integrates FLAME-based parametric modeling with 3DGS neural rendering. Our approach transmits only essential facial metadata in real time, enabling efficient reconstruction with a Gaussian-based head model. Additionally, we introduce a compact representation and compression scheme, including Gaussian attribute compression and MLP optimization, to enhance transmission efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance, delivering high-quality facial rendering at extremely low bitrates, making it well-suited for real-time 3D video conferencing applications.


[45] GeoRC: A Benchmark for Geolocation Reasoning Chains cs.CV | cs.AI | cs.CL | cs.LGPDF

Mohit Talreja, Joshua Diao, Jim Thannikary James, Radu Casapu, Tejas Santanam

TL;DR: 该论文提出了首个地理定位推理链基准测试GeoRC,用于评估视觉语言模型在解释其地理定位预测时所依赖的视觉证据的能力。研究发现,尽管大型闭源VLMs在位置预测准确率上可与人类专家媲美,但在生成可审计的推理链方面仍显著落后于人类,而开源VLMs在该任务上表现更差。

Details

Motivation: 现有视觉语言模型在地理定位预测任务中表现出色,但往往无法合理解释其预测所依据的图像证据,经常产生幻觉性场景属性来支持其预测,这揭示了模型在细粒度视觉属性提取和可解释性方面的局限性。

Result: 在基于GeoGuessr游戏(涵盖100多个国家的Google街景)构建的包含500个查询场景和800条专家标注推理链的基准测试上,大型闭源VLMs(如Gemini和GPT-5)的位置预测准确率与人类专家相当,但其生成的推理链质量远低于人类专家;开源VLMs(如Llama和Qwen)在该基准上表现灾难性失败,仅略优于一个没有视觉信息、仅凭位置先验知识幻觉生成推理链的基线模型。评估发现,使用Qwen 3作为评判者的LLM-as-a-judge策略与人类评分相关性最佳。

Insight: 论文的创新点在于构建了首个专注于地理定位推理链的基准测试,并系统揭示了当前VLMs在可解释性、细粒度视觉属性提取(如车牌形状、建筑风格、土壤特性)以及高分辨率图像理解方面与人类专家存在的巨大差距,强调了模型可审计推理能力评估的重要性。

Abstract: Vision Language Models (VLMs) are good at recognizing the global location of a photograph – their geolocation prediction accuracy rivals the best human experts. But many VLMs are startlingly bad at explaining which image evidence led to their prediction, even when their location prediction is correct. The reasoning chains produced by VLMs frequently hallucinate scene attributes to support their location prediction (e.g. phantom writing, imagined infrastructure, misidentified flora). In this paper, we introduce the first benchmark for geolocation reasoning chains. We focus on the global location prediction task in the popular GeoGuessr game which draws from Google Street View spanning more than 100 countries. We collaborate with expert GeoGuessr players, including the reigning world champion, to produce 800 ground truth reasoning chains for 500 query scenes. These expert reasoning chains address hundreds of different discriminative visual attributes such as license plate shape, architecture, and soil properties to name just a few. We evaluate LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains against our expert reasoning chains and find that Qwen 3 LLM-as-a-judge correlates best with human scoring. Our benchmark reveals that while large, closed-source VLMs such as Gemini and GPT 5 rival human experts at prediction locations, they still lag behind human experts when it comes to producing auditable reasoning chains. Open weights VLMs such as Llama and Qwen catastrophically fail on our benchmark – they perform only slightly better than a baseline in which an LLM hallucinates a reasoning chain with oracle knowledge of the photo location but no visual information at all. We believe the gap between human experts and VLMs on this task points to VLM limitations at extracting fine-grained visual attributes from high resolution images.


[46] Token Entropy Regularization for Multi-modal Antenna Affiliation Identification cs.CVPDF

Dong Chen, Ruoyu Li, Xinyan Zhang, Jialei Xu, Ruoseng Zhao

TL;DR: 本文提出了一种融合基站视频、天线几何特征和物理小区标识(PCI)信号的多模态天线归属识别新范式,将任务转化为多模态分类与匹配问题。针对通信领域缺乏类似数据导致的跨模态对齐困难,作者设计了一个专用训练框架,并在预训练阶段引入了新颖的Token Entropy Regularization模块以解决表征对齐挑战。

Details

Motivation: 当前天线归属识别依赖人工塔检,过程繁琐且易错,亟需自动化解决方案。现有公开预训练Transformer因通信领域缺乏类似数据而难以处理此独特任务,阻碍了跨模态对齐。

Result: 实验表明,所提出的Token Entropy Regularization模块能加速收敛并带来显著的性能提升。进一步分析揭示第一个token的熵具有模态依赖性。

Insight: 创新点在于将天线归属识别构建为多模态分类匹配任务,并针对领域数据稀缺问题设计了专用训练框架及Token Entropy Regularization模块来增强跨模态对齐,其熵分析为理解多模态表征提供了新视角。

Abstract: Accurate antenna affiliation identification is crucial for optimizing and maintaining communication networks. Current practice, however, relies on the cumbersome and error-prone process of manual tower inspections. We propose a novel paradigm shift that fuses video footage of base stations, antenna geometric features, and Physical Cell Identity (PCI) signals, transforming antenna affiliation identification into multi-modal classification and matching tasks. Publicly available pretrained transformers struggle with this unique task due to a lack of analogous data in the communications domain, which hampers cross-modal alignment. To address this, we introduce a dedicated training framework that aligns antenna images with corresponding PCI signals. To tackle the representation alignment challenge, we propose a novel Token Entropy Regularization module in the pretraining stage. Our experiments demonstrate that TER accelerates convergence and yields significant performance gains. Further analysis reveals that the entropy of the first token is modality-dependent. Code will be made available upon publication.


[47] WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models cs.CVPDF

Rishi Upadhyay, Howard Zhang, Jim Solomon, Ayush Agrawal, Pranay Boreddy

TL;DR: 本文介绍了WorldBench,一个专门为解耦评估设计的视频基准测试,旨在隔离并评估世界模型对单个物理概念或定律的理解。该基准包含两个层面:直观物理理解(如物体恒存性)和低层物理常数/材料属性(如摩擦系数)。

Details

Motivation: 现有基于物理的视频基准存在概念纠缠问题,即单个测试同时评估多个物理定律,限制了其诊断能力。为了解决这一问题,作者提出了WorldBench,以实现对世界模型物理推理能力的更精细、可扩展的评估。

Result: 在WorldBench上评估当前最先进的视频世界模型时,发现所有测试模型在特定物理概念上存在一致的失败模式,缺乏生成可靠真实世界交互所需的物理一致性。

Insight: 创新点在于提出了一个解耦的、概念特定的评估框架,能够精确诊断模型在具体物理概念上的能力缺陷,为构建更鲁棒和可泛化的世界模型提供了更细致的评估工具。

Abstract: Recent advances in generative foundational models, often termed “world models,” have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low-level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video-based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real-world interactions. Through its concept-specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world-model-driven learning.


[48] Do Pathology Foundation Models Encode Disease Progression? A Pseudotime Analysis of Visual Representations cs.CVPDF

Pritika Vig, Ren-Chin Wu, William Lotter

TL;DR: 该论文探究了在计算病理学中,视觉基础模型是否能够从其训练的离散图像中隐式学习并编码连续的疾病进展过程。通过应用源自单细胞转录组学的扩散伪时间方法,作者分析了多个病理学专用基础模型在表示空间中组织疾病状态的能力,发现这些模型能够显著恢复疾病进展轨迹,且轨迹保真度与模型在少样本分类任务上的性能高度相关。

Details

Motivation: 动机在于验证视觉基础模型(尤其是病理学专用模型)的潜在表示是否能够捕获其训练数据背后隐含的连续生物学过程(如疾病进展),这被认为能更好地反映底层生物学、支持更稳健的泛化,并实现对疾病转变相关特征的定量分析。

Result: 在四种癌症进展和六个模型上的实验结果表明,所有病理学专用模型恢复的轨迹排序均显著优于随机基线,其中纯视觉模型在CRC-Serrated数据集上达到最高保真度(τ > 0.78)。模型在参考疾病上的轨迹保真度排名能强预测其在未见疾病上的少样本分类性能(ρ = 0.92),且探索性分析显示细胞类型组成沿推断轨迹平滑变化,与已知的基质重塑模式一致。

Insight: 论文的创新点在于将单细胞分析中的扩散伪时间方法迁移到视觉表示分析中,以量化模型编码连续过程的能力,并提出了轨迹保真度作为下游任务性能之外的一种补充性表示质量度量。这一框架可推广至其他通过静态快照观察连续过程的领域。

Abstract: Vision foundation models trained on discretely sampled images achieve strong performance on classification benchmarks, yet whether their representations encode the continuous processes underlying their training data remains unclear. This question is especially pertinent in computational pathology, where we posit that models whose latent representations implicitly capture continuous disease progression may better reflect underlying biology, support more robust generalization, and enable quantitative analyses of features associated with disease transitions. Using diffusion pseudotime, a method developed to infer developmental trajectories from single-cell transcriptomics, we probe whether foundation models organize disease states along coherent progression directions in representation space. Across four cancer progressions and six models, we find that all pathology-specific models recover trajectory orderings significantly exceeding null baselines, with vision-only models achieving the highest fidelities $(τ> 0.78$ on CRC-Serrated). Model rankings by trajectory fidelity on reference diseases strongly predict few-shot classification performance on held-out diseases ($ρ= 0.92$), and exploratory analysis shows cell-type composition varies smoothly along inferred trajectories in patterns consistent with known stromal remodeling. Together, these results demonstrate that vision foundation models can implicitly learn to represent continuous processes from independent static observations, and that trajectory fidelity provides a complementary measure of representation quality beyond downstream performance. While demonstrated in pathology, this framework could be applied to other domains where continuous processes are observed through static snapshots.


[49] SR$^{2}$-Net: A General Plug-and-Play Model for Spectral Refinement in Hyperspectral Image Super-Resolution cs.CVPDF

Ji-Xuan He, Guohang Zhuang, Junge Bo, Tingyi Li, Chen Ling

TL;DR: 本文提出了一种轻量级即插即用的光谱校正超分辨率网络SR²-Net,用于高光谱图像超分辨率任务。该模型采用‘增强后校正’流程,包含分层光谱-空间协同注意力模块和流形一致性校正模块,可在不修改现有HSI-SR模型架构的情况下提升其光谱保真度和重建质量。

Details

Motivation: 现有HSI-SR方法主要利用空间相关性提升空间分辨率,但往往忽略波段间的光谱一致性,导致虚假振荡和物理上不合理的伪影;而通过设计网络架构来保证光谱一致性又会损失模型的通用性和灵活性。

Result: 在多个基准测试和不同骨干网络上的广泛实验表明,SR²-Net能以可忽略的计算开销,一致地提升光谱保真度和整体重建质量。

Insight: 创新点在于提出了一个通用的即插即用校正器,通过增强-校正流程和退化一致性损失,将物理先验(紧凑的光谱流形)引入重建过程,从而在不牺牲通用性的前提下确保光谱的物理合理性。

Abstract: HSI-SR aims to enhance spatial resolution while preserving spectrally faithful and physically plausible characteristics. Recent methods have achieved great progress by leveraging spatial correlations to enhance spatial resolution. However, these methods often neglect spectral consistency across bands, leading to spurious oscillations and physically implausible artifacts. While spectral consistency can be addressed by designing the network architecture, it results in a loss of generality and flexibility. To address this issue, we propose a lightweight plug-and-play rectifier, physically priors Spectral Rectification Super-Resolution Network (SR$^{2}$-Net), which can be attached to a wide range of HSI-SR models without modifying their architectures. SR$^{2}$-Net follows an enhance-then-rectify pipeline consisting of (i) Hierarchical Spectral-Spatial Synergy Attention (H-S$^{3}$A) to reinforce cross-band interactions and (ii) Manifold Consistency Rectification (MCR) to constrain the reconstructed spectra to a compact, physically plausible spectral manifold. In addition, we introduce a degradation-consistency loss to enforce data fidelity by encouraging the degraded SR output to match the observed low resolution input. Extensive experiments on multiple benchmarks and diverse backbones demonstrate consistent improvements in spectral fidelity and overall reconstruction quality with negligible computational overhead. Our code will be released upon publication.


[50] Towards Geometry-Aware and Motion-Guided Video Human Mesh Recovery cs.CVPDF

Hongjun Chen, Huan Zheng, Wencheng Han, Jianbing Shen

TL;DR: 该论文提出了HMRMamba,一种新的视频人体网格恢复范式,旨在解决现有方法因依赖有缺陷的中间3D姿态锚点以及无法有效建模复杂时空动态而导致的物理上不合理结果。该方法引入了两个核心模块:利用新颖的双扫描Mamba架构的几何感知提升模块,以及运动引导重建网络,以生成可靠的3D姿态序列并增强最终网格的时空一致性。

Details

Motivation: 现有基于视频的3D人体网格恢复方法常产生物理上不合理的结果,其根源在于对存在缺陷的中间3D姿态锚点的依赖,以及无法有效建模复杂的时空动态。

Result: 在3DPW、MPI-INF-3DHP和Human3.6M基准测试上的综合评估证实,HMRMamba在重建精度和时序一致性方面均超越了现有方法,达到了新的最先进水平,同时提供了卓越的计算效率。

Insight: 论文的核心创新点在于首次将结构化状态空间模型应用于人体网格恢复任务,并设计了两个关键模块:基于双扫描Mamba架构的几何感知提升模块,用于直接从图像特征中获取几何线索以建立可靠的3D姿态锚点序列;以及运动引导重建网络,通过显式处理时序运动模式来增强网格的连贯性和鲁棒性。这为解决视频中人体网格恢复的时空建模和物理合理性提供了新思路。

Abstract: Existing video-based 3D Human Mesh Recovery (HMR) methods often produce physically implausible results, stemming from their reliance on flawed intermediate 3D pose anchors and their inability to effectively model complex spatiotemporal dynamics. To overcome these deep-rooted architectural problems, we introduce HMRMamba, a new paradigm for HMR that pioneers the use of Structured State Space Models (SSMs) for their efficiency and long-range modeling prowess. Our framework is distinguished by two core contributions. First, the Geometry-Aware Lifting Module, featuring a novel dual-scan Mamba architecture, creates a robust foundation for reconstruction. It directly grounds the 2D-to-3D pose lifting process with geometric cues from image features, producing a highly reliable 3D pose sequence that serves as a stable anchor. Second, the Motion-guided Reconstruction Network leverages this anchor to explicitly process kinematic patterns over time. By injecting this crucial temporal awareness, it significantly enhances the final mesh’s coherence and robustness, particularly under occlusion and motion blur. Comprehensive evaluations on 3DPW, MPI-INF-3DHP, and Human3.6M benchmarks confirm that HMRMamba sets a new state-of-the-art, outperforming existing methods in both reconstruction accuracy and temporal consistency while offering superior computational efficiency.


[51] Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation cs.CV | cs.LGPDF

Zihan Su, Hongyang Wei, Kangrui Cen, Yong Wang, Guanhua Chen

TL;DR: 本文提出了一种名为UniMRG的架构无关后训练方法,通过让统一多模态模型(UMMs)生成输入图像的多种内在表示(像素、深度、分割)来增强其理解能力,实验表明该方法能显著提升模型的细粒度感知、减少幻觉并改善空间理解,同时增强生成能力。

Details

Motivation: 统一多模态模型(UMMs)旨在实现理解与生成的相互促进,但目前利用生成来提升理解的研究尚不充分,本文旨在探索通过辅助生成任务来增强UMMs的理解能力。

Result: 在多种UMM架构上的大量实验表明,该方法显著提升了细粒度感知、减少了幻觉、改善了空间理解,并同时增强了生成能力。

Insight: 创新点在于提出通过生成多种内在表示(像素、深度、分割)作为辅助任务,使模型捕获外观、空间关系和结构布局的互补信息,从而获得更深入全面的视觉理解,这是一种简单有效的后训练策略。

Abstract: Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.


[52] MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations cs.CVPDF

Xinan He, Kaiqing Lin, Yue Zhou, Jiaming Zhong, Wei Ye

TL;DR: 本文提出MPF-Net,一种用于检测高保真AI生成视频伪造的双路径层次框架。该框架基于AI生成视频本质上是流形拟合过程的产物,其连续帧间残差具有结构化、同质化特征(称为流形投影波动,MPF)。

Details

Motivation: 随着Veo、Wan等视频生成模型的快速发展,合成内容的视觉质量已使宏观语义错误和时间不一致性不再明显,但真实视频与高保真假视频之间的区别仍可追踪。论文旨在通过捕捉AI视频中固有的结构化像素组成逻辑来暴露伪造。

Result: 论文未在摘要中提供具体的定量结果或基准测试数据,但声称其框架能够检测出那些在空间上成功驻留在自然流形(on-manifold)并逃避空间检测的高保真视频中的细微计算指纹。

Insight: 核心创新点在于提出了‘流形投影波动’(MPF)的概念,并基于此设计了一个层次化的双路径过滤框架:静态流形偏差分支利用大规模视觉基础模型捕捉空间异常,微时间波动分支则分析视觉完美序列中持续存在的结构化MPF,从而实现对全局流形偏差和细微计算指纹的全面检测。

Abstract: With the rapid advancement of video generation models such as Veo and Wan, the visual quality of synthetic content has reached a level where macro-level semantic errors and temporal inconsistencies are no longer prominent. However, this does not imply that the distinction between real and cutting-edge high-fidelity fake is untraceable. We argue that AI-generated videos are essentially products of a manifold-fitting process rather than a physical recording. Consequently, the pixel composition logic of consecutive adjacent frames residual in AI videos exhibits a structured and homogenous characteristic. We term this phenomenon `Manifold Projection Fluctuations’ (MPF). Driven by this insight, we propose a hierarchical dual-path framework that operates as a sequential filtering process. The first, the Static Manifold Deviation Branch, leverages the refined perceptual boundaries of Large-Scale Vision Foundation Models (VFMs) to capture residual spatial anomalies or physical violations that deviate from the natural real-world manifold (off-manifold). For the remaining high-fidelity videos that successfully reside on-manifold and evade spatial detection, we introduce the Micro-Temporal Fluctuation Branch as a secondary, fine-grained filter. By analyzing the structured MPF that persists even in visually perfect sequences, our framework ensures that forgeries are exposed regardless of whether they manifest as global real-world manifold deviations or subtle computational fingerprints.


[53] From Implicit Ambiguity to Explicit Solidity: Diagnosing Interior Geometric Degradation in Neural Radiance Fields for Dense 3D Scene Understanding cs.CVPDF

Jiangsan Zhao, Jakob Geipel, Kryzysztof Kusnierek

TL;DR: 本文揭示了神经辐射场(NeRF)在密集自遮挡场景中存在的内部几何退化问题,即隐式密度场在严重遮挡下会重建出空心或碎片化结构,导致实例计数不足。为克服此限制,作者提出了一种基于稀疏体素栅格化的显式几何重建流程,通过将2D实例掩码投影到显式体素网格并强制几何分离,显著提升了密集场景下的实例恢复率。

Details

Motivation: NeRF在多视图重建中表现出色,但其在密集、自遮挡场景中进行定量3D分析的可靠性尚不明确。本文旨在诊断NeRF在严重遮挡下出现的内部几何退化问题,并探索更可靠的几何重建方法。

Result: 在合成数据集上的实验表明,最先进的掩码监督NeRF在密集场景中的实例恢复率约为89%。而提出的基于SfM特征几何初始化的显式SVRaster方法,将恢复率提升至95.8%。在分割掩码质量下降的情况下,该方法比隐式基线多恢复了43%的实例。

Insight: 核心创新点在于诊断并命名了NeRF的“内部几何退化”失败模式,并提出了一个结合显式几何先验(SfM特征)和稀疏体素栅格化的重建流程。其关键见解是,在高度自遮挡场景中,显式几何先验对于可靠的定量分析是必要的,这挑战了纯隐式表示在复杂场景中的适用性。

Abstract: Neural Radiance Fields (NeRFs) have emerged as a powerful paradigm for multi-view reconstruction, complementing classical photogrammetric pipelines based on Structure-from-Motion (SfM) and Multi-View Stereo (MVS). However, their reliability for quantitative 3D analysis in dense, self-occluding scenes remains poorly understood. In this study, we identify a fundamental failure mode of implicit density fields under heavy occlusion, which we term Interior Geometric Degradation (IGD). We show that transmittance-based volumetric optimization satisfies photometric supervision by reconstructing hollow or fragmented structures rather than solid interiors, leading to systematic instance undercounting. Through controlled experiments on synthetic datasets with increasing occlusion, we demonstrate that state-of-the-art mask-supervised NeRFs saturate at approximately 89% instance recovery in dense scenes, despite improved surface coherence and mask quality. To overcome this limitation, we introduce an explicit geometric pipeline based on Sparse Voxel Rasterization (SVRaster), initialized from SfM feature geometry. By projecting 2D instance masks onto an explicit voxel grid and enforcing geometric separation via recursive splitting, our approach preserves physical solidity and achieves a 95.8% recovery rate in dense clusters. A sensitivity analysis using degraded segmentation masks further shows that explicit SfM-based geometry is substantially more robust to supervision failure, recovering 43% more instances than implicit baselines. These results demonstrate that explicit geometric priors are a prerequisite for reliable quantitative analysis in highly self-occluding 3D scenes.


[54] MultiModal Fine-tuning with Synthetic Captions cs.CVPDF

Shohei Enomoto, Shin’ya Yamaguchi

TL;DR: 本文提出了一种新颖的多模态微调方法,通过使用多模态大语言模型为单模态数据集生成合成图像描述,从而将单模态数据集转化为多模态数据集,以弥合预训练与微调之间的模态鸿沟。该方法结合了精心设计的提示词、监督对比损失函数以及基于类别平均文本嵌入的推理技术,在13个图像分类基准测试中超越了基线方法,尤其在少样本学习场景下取得了显著提升。

Details

Motivation: 解决深度神经网络预训练与微调之间的根本性差距:预训练已转向多模态学习以增强视觉理解,而微调主要仍是单模态的,这限制了丰富预训练表征的效益。

Result: 在13个图像分类基准测试中,该方法超越了基线方法,在少样本学习场景下取得了显著改进。

Insight: 创新点在于提出了一种利用MLLMs生成合成描述将单模态数据集转化为多模态数据集的新范式,并引入了结合类别标签和领域上下文的提示词设计、监督对比损失函数以及基于多合成描述的类别平均文本嵌入推理技术,有效弥合了多模态预训练与单模态微调之间的差距。

Abstract: In this paper, we address a fundamental gap between pre-training and fine-tuning of deep neural networks: while pre-training has shifted from unimodal to multimodal learning with enhanced visual understanding, fine-tuning predominantly remains unimodal, limiting the benefits of rich pre-trained representations. To bridge this gap, we propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs) to generate synthetic image captions for fine-tuning models with a multimodal objective. Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions tailored for classification tasks. Furthermore, we introduce a supervised contrastive loss function that explicitly encourages clustering of same-class representations during fine-tuning, along with a new inference technique that leverages class-averaged text embeddings from multiple synthetic captions per image. Extensive experiments across 13 image classification benchmarks demonstrate that our approach outperforms baseline methods, with particularly significant improvements in few-shot learning scenarios. Our work establishes a new paradigm for dataset enhancement that effectively bridges the gap between multimodal pre-training and fine-tuning. Our code is available at https://github.com/s-enmt/MMFT.


[55] Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention cs.CV | cs.AI | cs.CLPDF

Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao

TL;DR: 本文提出Spava,一种面向长视频理解的序列并行感知近似注意力框架,旨在解决大型多模态模型在预填充阶段因密集计算导致的效率瓶颈。该方法通过在多GPU上分布近似注意力计算,减少计算量并提高并行性,从而在不压缩视觉嵌入的情况下高效处理更长的视频。

Details

Motivation: 现有方法通过压缩视觉嵌入或在单GPU上应用稀疏注意力来加速长视频推理,但存在加速有限或性能下降的问题,限制了处理更长、更复杂视频的能力。

Result: Spava在系统层面进行了负载均衡和融合前向传递等优化,在保持性能无明显损失的情况下,相比FlashAttn、ZigZagRing和APB分别实现了12.72倍、1.70倍和1.18倍的加速。

Insight: 创新点在于将序列并行与近似注意力相结合,通过多GPU分布式计算来减少长视频推理的计算开销,同时避免视觉嵌入压缩带来的性能损失,系统级优化进一步释放了加速潜力。

Abstract: The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose Spava, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, Spava reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of Spava, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB


[56] Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization cs.CVPDF

Midou Guo, Qilin Yin, Wei Lu, Xiangyang Luo, Rui Yang

TL;DR: 本文提出了一种名为RT-DeepLoc的弱监督时序深度伪造定位框架,通过利用仅在真实数据上训练的掩码自编码器(MAE)产生的重建误差来识别伪造片段,并引入非对称视频内对比损失(AICL)来增强定位的鲁棒性和判别力。

Details

Motivation: 现代深度伪造技术已发展为局部和间歇性的操作,需要细粒度的时序定位,而逐帧标注成本过高,因此需要仅依赖视频级标签的弱监督方法。

Result: 在包括LAV-DF在内的大规模数据集上的广泛实验表明,RT-DeepLoc在弱监督时序伪造定位任务中达到了最先进的性能。

Insight: 创新点在于利用仅在真实数据上训练的MAE的重建误差作为细粒度定位线索,以及引入AICL损失来建立稳定的决策边界,从而在增强局部判别力的同时保持对未见伪造的泛化能力。

Abstract: Modern deepfakes have evolved into localized and intermittent manipulations that require fine-grained temporal localization. The prohibitive cost of frame-level annotation makes weakly supervised methods a practical necessity, which rely only on video-level labels. To this end, we propose Reconstruction-based Temporal Deepfake Localization (RT-DeepLoc), a weakly supervised temporal forgery localization framework that identifies forgeries via reconstruction errors. Our framework uses a Masked Autoencoder (MAE) trained exclusively on authentic data to learn its intrinsic spatiotemporal patterns; this allows the model to produce significant reconstruction discrepancies for forged segments, effectively providing the missing fine-grained cues for localization. To robustly leverage these indicators, we introduce a novel Asymmetric Intra-video Contrastive Loss (AICL). By focusing on the compactness of authentic features guided by these reconstruction cues, AICL establishes a stable decision boundary that enhances local discrimination while preserving generalization to unseen forgeries. Extensive experiments on large-scale datasets, including LAV-DF, demonstrate that RT-DeepLoc achieves state-of-the-art performance in weakly-supervised temporal forgery localization.


[57] Hypernetwork-Based Adaptive Aggregation for Multimodal Multiple-Instance Learning in Predicting Coronary Calcium Debulking cs.CVPDF

Kaito Shiku, Ichika Seo, Tetsuya Matoba, Rissei Hino, Yasuhiro Nakano

TL;DR: 本文首次尝试从CT图像中预测冠状动脉钙化斑块是否需要清除,将其建模为多示例学习问题,并提出了一种基于超网络的适应性聚合Transformer模型HyperAdAgFormer,该模型能根据患者的表格数据自适应调整特征聚合策略,临床数据集实验验证了其有效性。

Details

Motivation: 解决医生在评估冠状动脉钙化清除必要性时,需结合患者表格数据动态调整决策标准的问题,传统多示例学习方法难以适应这种个性化需求。

Result: 在临床数据集上的实验表明,HyperAdAgFormer方法有效,但摘要未提及具体定量结果或与SOTA的比较。

Insight: 创新点在于通过超网络将表格数据作为条件输入,动态调制多示例学习中的特征聚合过程,实现了患者自适应的医学图像分析,可推广至其他需结合多模态数据的个性化医疗任务。

Abstract: In this paper, we present the first attempt to estimate the necessity of debulking coronary artery calcifications from computed tomography (CT) images. We formulate this task as a Multiple-instance Learning (MIL) problem. The difficulty of this task lies in that physicians adjust their focus and decision criteria for device usage according to tabular data representing each patient’s condition. To address this issue, we propose a hypernetwork-based adaptive aggregation transformer (HyperAdAgFormer), which adaptively modifies the feature aggregation strategy for each patient based on tabular data through a hypernetwork. The experiments using the clinical dataset demonstrated the effectiveness of HyperAdAgFormer. The code is publicly available at https://github.com/Shiku-Kaito/HyperAdAgFormer.


[58] Vision KAN: Towards an Attention-Free Backbone for Vision with Kolmogorov-Arnold Networks cs.CVPDF

Zhuoqin Yang, Jiansong Zhang, Xiaoling Luo, Xu Wu, Zheng Lu

TL;DR: 本文提出了Vision KAN (ViK),一种基于Kolmogorov-Arnold Networks (KAN) 的无注意力视觉主干网络。其核心是MultiPatch-RBFKAN模块,它通过基于径向基函数的KAN进行块状非线性变换、轴可分离混合实现高效局部传播,以及低秩全局映射进行长程交互,以线性复杂度实现了与注意力机制相竞争的图像分类性能。

Details

Motivation: 注意力机制虽然能建模长程依赖,但其二次复杂度限制了可扩展性,且注意力权重难以解释。近期研究表明,无需成对注意力也能实现强大性能,因此本文探索基于KAN理论的高效、可解释的替代方案。

Result: 在ImageNet-1K数据集上的实验表明,ViK在保持线性复杂度的同时,达到了具有竞争力的分类准确率。

Insight: 创新点在于将KAN理论引入视觉主干设计,提出了MultiPatch-RBFKAN这一统一令牌混合器,它通过块状分组策略和轻量级算子解决了全KAN在高分辨率特征上的计算成本问题,为无注意力架构提供了一个理论依据扎实且高效的替代方案。

Abstract: Attention mechanisms have become a key module in modern vision backbones due to their ability to model long-range dependencies. However, their quadratic complexity in sequence length and the difficulty of interpreting attention weights limit both scalability and clarity. Recent attention-free architectures demonstrate that strong performance can be achieved without pairwise attention, motivating the search for alternatives. In this work, we introduce Vision KAN (ViK), an attention-free backbone inspired by the Kolmogorov-Arnold Networks. At its core lies MultiPatch-RBFKAN, a unified token mixer that combines (a) patch-wise nonlinear transform with Radial Basis Function-based KANs, (b) axis-wise separable mixing for efficient local propagation, and (c) low-rank global mapping for long-range interaction. Employing as a drop-in replacement for attention modules, this formulation tackles the prohibitive cost of full KANs on high-resolution features by adopting a patch-wise grouping strategy with lightweight operators to restore cross-patch dependencies. Experiments on ImageNet-1K show that ViK achieves competitive accuracy with linear complexity, demonstrating the potential of KAN-based token mixing as an efficient and theoretically grounded alternative to attention.


[59] WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models cs.CVPDF

Zijin Yang, Yu Sun, Kejiang Chen, Jiawei Zhao, Jun Jiang

TL;DR: 该论文提出了WMVLM,首个基于视觉语言模型(VLM)的统一且可解释的扩散模型图像水印评估框架,旨在解决现有方法在评估残留水印和语义水印时缺乏统一性、可解释性不足、安全考量不全面以及指标不当等问题。

Details

Motivation: 现有扩散模型图像水印评估方法存在显著局限,包括缺乏统一框架、结果不可解释、忽视全面安全性以及使用不恰当的语义水印指标,WMVLM旨在填补这些空白。

Result: 实验表明,WMVLM在多个数据集、扩散模型和水印方法上均表现出强大的泛化能力,其性能优于最先进的视觉语言模型(SOTA)。

Insight: 主要创新点包括:为残留水印和语义水印分别重新定义了质量和安全度量指标(如伪影强度和擦除鲁棒性、潜在分布偏移);并引入了一个渐进式的三阶段训练策略,使模型能够依次实现分类、评分和可解释的文本生成,从而提供统一的评估框架和可解释性输出。

Abstract: Digital watermarking is essential for securing generated images from diffusion models. Accurate watermark evaluation is critical for algorithm development, yet existing methods have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide results without interpretability, neglect comprehensive security considerations, and often use inappropriate metrics for semantic watermarks. To address these gaps, we propose WMVLM, the first unified and interpretable evaluation framework for diffusion model image watermarking via vision-language models (VLMs). We redefine quality and security metrics for each watermark type: residual watermarks are evaluated by artifact strength and erasure resistance, while semantic watermarks are assessed through latent distribution shifts. Moreover, we introduce a three-stage training strategy to progressively enable the model to achieve classification, scoring, and interpretable text generation. Experiments show WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods.


[60] PathReasoner-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization cs.CVPDF

Songhan Jiang, Fengchun Liu, Ziyue Wang, Linghan Cai, Yongbing Zhang

TL;DR: 该论文提出了PathReasoner-R1,一个旨在为病理学视觉语言模型注入结构化推理能力的框架。通过构建首个大规模全切片图像推理数据集PathReasoner,并采用知识引导的生成流程,结合轨迹掩码监督微调和推理导向的强化学习,使模型能够生成可验证的、基于证据的推理链,从而提升临床可信度。

Details

Motivation: 当前病理学视觉语言模型直接将视觉输入映射到诊断结论,缺乏可验证的、与证据关联的推理过程,这严重限制了临床信任并阻碍了专家纠错。

Result: 在PathReasoner数据集和多个公共基准测试上,PathReasoner-R1在不同图像尺度下均取得了最先进的性能。

Insight: 核心创新点包括:1) 构建了首个大规模、知识引导生成的病理学WSI结构化推理数据集;2) 提出了结合轨迹掩码监督微调与推理导向强化学习的训练框架;3) 设计了与知识图谱严格对齐的知识感知多粒度奖励函数(包含实体奖励机制),引导模型优化逻辑一致性而非单纯结果匹配,增强了鲁棒性。

Abstract: Vision-Language Models (VLMs) are advancing computational pathology with superior visual understanding capabilities. However, current systems often reduce diagnosis to directly output conclusions without verifiable evidence-linked reasoning, which severely limits clinical trust and hinders expert error rectification. To address these barriers, we construct PathReasoner, the first large-scale dataset of whole-slide image (WSI) reasoning. Unlike previous work reliant on unverified distillation, we develop a rigorous knowledge-guided generation pipeline. By leveraging medical knowledge graphs, we explicitly align structured pathological findings and clinical reasoning with diagnoses, generating over 20K high-quality instructional samples. Based on the database, we propose PathReasoner-R1, which synergizes trajectory-masked supervised fine-tuning with reasoning-oriented reinforcement learning to instill structured chain-of-thought capabilities. To ensure medical rigor, we engineer a knowledge-aware multi-granular reward function incorporating an Entity Reward mechanism strictly aligned with knowledge graphs. This effectively guides the model to optimize for logical consistency rather than mere outcome matching, thereby enhancing robustness. Extensive experiments demonstrate that PathReasoner-R1 achieves state-of-the-art performance on both PathReasoner and public benchmarks across various image scales, equipping pathology models with transparent, clinically grounded reasoning capabilities. Dataset and code are available at https://github.com/cyclexfy/PathReasoner-R1.


[61] Similarity of Processing Steps in Vision Model Representations cs.CVPDF

Matéo Mahaut, Marco Baroni

TL;DR: 该论文研究了不同视觉模型在表示学习过程中是否收敛到相似的中间处理步骤和操作,而不仅仅是最终表示。通过量化不同模型在不同处理阶段表示之间的距离,分析了CNN和Transformer模型在信息处理路径上的差异。

Details

Motivation: 现有研究表明大型模型倾向于收敛到相似的“通用”表示,但尚未明确模型是否通过相同的中间步骤达到这些表示。本文旨在探究不同视觉模型在处理过程中的收敛性,特别是中间步骤的相似性。

Result: 研究发现,虽然不同模型中相似位置的层具有最相似的表示,但仍存在显著差异:分类器模型在最终层会丢弃低层图像统计信息;Transformer模型比CNN模型在层间表示变化更平滑。这些发现通过量化距离分析得到验证。

Insight: 创新点在于从处理步骤动态角度分析模型收敛性,揭示了CNN与Transformer架构在信息处理路径上的本质差异,为理解视觉模型底层过程提供了更定性的视角。

Abstract: Recent literature suggests that the bigger the model, the more likely it is to converge to similar, ``universal’’ representations, despite different training objectives, datasets, or modalities. While this literature shows that there is an area where model representations are similar, we study here how vision models might get to those representations – in particular, do they also converge to the same intermediate steps and operations? We therefore study the processes that lead to convergent representations in different models. First, we quantify distance between different model representations at different stages. We follow the evolution of distances between models throughout processing, identifying the processing steps which are most different between models. We find that while layers at similar positions in different models have the most similar representations, strong differences remain. Classifier models, unlike the others, will discard information about low-level image statistics in their final layers. CNN- and transformer-based models also behave differently, with transformer models applying smoother changes to representations from one layer to the next. These distinctions clarify the level and nature of convergence between model representations, and enables a more qualitative account of the underlying processes in image models.


[62] RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning cs.CVPDF

Shiqi Huang, Shuting He, Bihan Wen

TL;DR: 本文提出了一种名为RSGround-R1的推理引导、位置感知的后训练框架,旨在解决遥感视觉定位任务中因大尺度空间和高语义模糊性带来的挑战。该框架通过链式思维监督微调和强化微调,结合新设计的位置奖励与空间一致性优化,逐步增强多模态大语言模型的空间推理能力,在RSVG基准测试中展现出优越的性能和泛化能力。

Details

Motivation: 遥感视觉定位任务中,由于遥感场景空间尺度大、语义模糊性强,描述语句高度依赖位置线索,这对多模态大语言模型的空间推理能力提出了独特挑战。

Result: 在RSVG基准测试上的广泛实验表明,该模型取得了优越的性能和泛化能力,实现了先进水平。

Insight: 创新点包括:提出推理引导的位置感知后训练框架RSGround-R1;引入链式思维监督微调以建立显式位置感知;设计新的位置奖励进行强化微调;提出空间一致性引导优化方案以确保稳定收敛。从客观角度看,该研究将链式思维推理与强化学习结合,并针对遥感场景的空间特性设计了专门的奖励和优化机制,有效提升了模型在复杂空间推理任务中的性能。

Abstract: Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning. To leverage this unique feature, we propose a reasoning-guided, position-aware post-training framework, dubbed \textbf{RSGround-R1}, to progressively enhance spatial understanding. Specifically, we first introduce Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) using synthetically generated RSVG reasoning data to establish explicit position awareness. Reinforcement Fine-Tuning (RFT) is then applied, augmented by our newly designed positional reward that provides continuous and distance-aware guidance toward accurate localization. Moreover, to mitigate incoherent localization behaviors across rollouts, we introduce a spatial consistency guided optimization scheme that dynamically adjusts policy updates based on their spatial coherence, ensuring stable and robust convergence. Extensive experiments on RSVG benchmarks demonstrate superior performance and generalization of our model.


[63] OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models cs.CVPDF

Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng

TL;DR: 本文提出OCRVerse,首个端到端的全场景OCR方法,统一处理文本中心OCR(如文档识别)和视觉中心OCR(如图表、网页等视觉信息密集图像)。通过构建涵盖多领域的数据工程,并采用两阶段SFT-RL跨域训练策略,实现了在两类任务上的竞争性性能。

Details

Motivation: 现有OCR方法主要关注文本识别,忽略了图表、网页等视觉信息密集图像的视觉元素提取,而这类图像在互联网中广泛存在且具有重要应用价值。

Result: 实验表明OCRVerse在文本中心和视觉中心数据类型上均取得竞争性结果,性能可与大规模开源和闭源模型相媲美。

Insight: 创新点在于首次提出端到端的全场景OCR框架,通过两阶段SFT-RL训练策略(SFT混合跨域数据建立初始知识,RL针对各域特性设计个性化奖励)实现跨域融合,避免了数据冲突。

Abstract: The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (\textbf{Text-centric OCR}), neglecting the identification of visual elements from visually information-dense image sources (\textbf{Vision-centric OCR}), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose \textbf{OCRVerse}, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.


[64] CAF-Mamba: Mamba-Based Cross-Modal Adaptive Attention Fusion for Multimodal Depression Detection cs.CV | cs.CY | cs.HCPDF

Bowen Zhou, Marc-André Fiedler, Ayoub Al-Hamadi

TL;DR: 本文提出了一种名为CAF-Mamba的新型多模态抑郁症检测框架,该框架基于Mamba模型,通过交叉模态自适应注意力融合机制,显式和隐式地捕捉模态间交互,并动态调整各模态的贡献权重,以克服现有方法特征类型有限、忽视显式跨模态交互以及融合方式简单(如拼接或静态加权)的局限性。

Details

Motivation: 现有抑郁症检测的深度学习方法大多依赖有限的特征类型,忽视了显式的跨模态交互,并采用简单的拼接或静态加权进行特征融合,限制了性能。本文旨在解决这些问题,实现更有效的多模态融合。

Result: 在LMVD和D-Vlog两个真实场景基准数据集上的实验表明,CAF-Mamba始终优于现有方法,并取得了最先进的(SOTA)性能。

Insight: 主要创新点在于将Mamba模型架构引入多模态抑郁症检测任务,并设计了结合显式与隐式交互的交叉模态自适应注意力融合机制,能够动态学习并调整不同模态的贡献权重,为多模态融合提供了新思路。

Abstract: Depression is a prevalent mental health disorder that severely impairs daily functioning and quality of life. While recent deep learning approaches for depression detection have shown promise, most rely on limited feature types, overlook explicit cross-modal interactions, and employ simple concatenation or static weighting for fusion. To overcome these limitations, we propose CAF-Mamba, a novel Mamba-based cross-modal adaptive attention fusion framework. CAF-Mamba not only captures cross-modal interactions explicitly and implicitly, but also dynamically adjusts modality contributions through a modality-wise attention mechanism, enabling more effective multimodal fusion. Experiments on two in-the-wild benchmark datasets, LMVD and D-Vlog, demonstrate that CAF-Mamba consistently outperforms existing methods and achieves state-of-the-art performance.


[65] When Gradient Optimization Is Not Enough: $\dagger$ Dispersive and Anchoring Geometric Regularizer for Multimodal Learning cs.CV | cs.LGPDF

Zixuan Xia, Hao Wang, Pengcheng Weng, Yanyu Qian, Yangxin Xu

TL;DR: 该论文指出,仅靠强大的优化无法保证多模态学习获得结构良好的表征,模型常出现模态内表征坍缩和样本级跨模态不一致等几何病态问题。为此,作者提出了一个轻量级的几何感知正则化框架,通过施加模态内分散正则化和模态间锚定正则化来约束中间嵌入,从而改善表征几何结构。

Details

Motivation: 解决多模态学习中,即使训练方案平衡,模型仍存在的表征几何病态问题,如模态内表征坍缩和跨模态不一致,这损害了单模态鲁棒性和多模态融合效果。

Result: 在多个多模态基准测试上的广泛实验表明,该方法能持续提升多模态和单模态性能,有效缓解了模态间的权衡问题。

Insight: 创新点在于将表征几何作为多模态学习的一个控制维度,并提出了一种无需修改架构、即插即用的互补正则化框架,通过分散和锚定约束来显式调控几何结构,而非强制严格对齐。

Abstract: Multimodal learning aims to integrate complementary information from heterogeneous modalities, yet strong optimization alone does not guaranty well-structured representations. Even under carefully balanced training schemes, multimodal models often exhibit geometric pathologies, including intra-modal representation collapse and sample-level cross-modal inconsistency, which degrade both unimodal robustness and multimodal fusion. We identify representation geometry as a missing control axis in multimodal learning and propose \regName, a lightweight geometry-aware regularization framework. \regName enforces two complementary constraints on intermediate embeddings: an intra-modal dispersive regularization that promotes representation diversity, and an inter-modal anchoring regularization that bounds sample-level cross-modal drift without rigid alignment. The proposed regularizer is plug-and-play, requires no architectural modifications, and is compatible with various training paradigms. Extensive experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing that explicitly regulating representation geometry effectively mitigates modality trade-offs.


[66] Multimodal Visual Surrogate Compression for Alzheimer’s Disease Classification cs.CVPDF

Dexuan Ding, Ciyuan Peng, Endrowednes Kuantama, Jingcai Guo, Jia Wu

TL;DR: 本文提出了一种名为多模态视觉代理压缩(MVSC)的新方法,用于阿尔茨海默病(AD)分类。该方法通过将高维3D结构MRI(sMRI)体积压缩并适配为紧凑的2D视觉代理特征,以更好地与冻结的2D基础模型对齐,从而提取强大的表征用于最终分类。

Details

Motivation: 现有sMRI表征学习方法存在计算成本高、丢失切片间关系或特征判别能力有限等问题,MVSC旨在解决这些挑战。

Result: 在三个大规模阿尔茨海默病基准测试上的广泛实验表明,MVSC在二元和多类分类任务上均优于最先进方法。

Insight: 创新点在于通过文本引导的全局上下文编码器和文本增强的切片融合模块,将3D医学图像有效压缩为2D代理,以利用预训练2D基础模型的强大表征能力,这是一种新颖的多模态压缩与适配策略。

Abstract: High-dimensional structural MRI (sMRI) images are widely used for Alzheimer’s Disease (AD) diagnosis. Most existing methods for sMRI representation learning rely on 3D architectures (e.g., 3D CNNs), slice-wise feature extraction with late aggregation, or apply training-free feature extractions using 2D foundation models (e.g., DINO). However, these three paradigms suffer from high computational cost, loss of cross-slice relations, and limited ability to extract discriminative features, respectively. To address these challenges, we propose Multimodal Visual Surrogate Compression (MVSC). It learns to compress and adapt large 3D sMRI volumes into compact 2D features, termed as visual surrogates, which are better aligned with frozen 2D foundation models to extract powerful representations for final AD classification. MVSC has two key components: a Volume Context Encoder that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner. Extensive experiments on three large-scale Alzheimer’s disease benchmarks demonstrate our MVSC performs favourably on both binary and multi-class classification tasks compared against state-of-the-art methods.


[67] ChartE$^{3}$: A Comprehensive Benchmark for End-to-End Chart Editing cs.CVPDF

Shuo Li, Jiajun Sun, Zhekai Wang, Xiaoran Fan, Hui Li

TL;DR: 本文提出了ChartE³,一个用于端到端图表编辑的综合基准测试,旨在直接评估模型在无需依赖中间自然语言程序或代码级监督的情况下执行图表编辑的能力。该基准包含超过1200个高质量样本,涵盖局部编辑(如字体或颜色调整)和全局编辑(如数据筛选和趋势线添加)两个维度,并通过客观和主观视角进行评估。

Details

Motivation: 现有图表编辑方法多采用基于流水线的设计,依赖自然语言或代码作为中间表示,难以忠实执行复杂编辑,因此需要建立一个直接评估端到端图表编辑能力的基准。

Result: 对最先进的多模态大语言模型进行广泛基准测试发现,在全局编辑任务上存在显著的性能差距,突显了当前端到端图表编辑能力的严重局限性。

Insight: 创新点在于提出了一个不依赖中间表示的端到端图表编辑基准,通过精心设计的数据流水线和人工策划构建高质量样本,支持从局部到全局的多维度编辑评估,为模型能力提供了更直接的衡量标准。

Abstract: Charts are a fundamental visualization format for structured data analysis. Enabling end-to-end chart editing according to user intent is of great practical value, yet remains challenging due to the need for both fine-grained control and global structural consistency. Most existing approaches adopt pipeline-based designs, where natural language or code serves as an intermediate representation, limiting their ability to faithfully execute complex edits. We introduce ChartE$^{3}$, an End-to-End Chart Editing benchmark that directly evaluates models without relying on intermediate natural language programs or code-level supervision. ChartE$^{3}$ focuses on two complementary editing dimensions: local editing, which involves fine-grained appearance changes such as font or color adjustments, and global editing, which requires holistic, data-centric transformations including data filtering and trend line addition. ChartE$^{3}$ contains over 1,200 high-quality samples constructed via a well-designed data pipeline with human curation. Each sample is provided as a triplet of a chart image, its underlying code, and a multimodal editing instruction, enabling evaluation from both objective and subjective perspectives. Extensive benchmarking of state-of-the-art multimodal large language models reveals substantial performance gaps, particularly on global editing tasks, highlighting critical limitations in current end-to-end chart editing capabilities.


[68] DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning cs.CV | cs.AIPDF

Mingshuang Luo, Shuang Liang, Zhengkun Rong, Yuxuan Luo, Tianshu Hu

TL;DR: DreamActor-M2是一个通用的角色图像动画框架,通过时空上下文学习,将参考图像的外观和驱动视频的运动线索融合到统一的潜在空间,实现了无需显式姿态先验(如骨架)的高保真视频合成。

Details

Motivation: 解决现有角色动画方法在身份保持与运动一致性之间存在权衡(’跷跷板’效应),以及过度依赖显式姿态先验导致无法泛化到任意非人形角色的根本挑战。

Result: 在提出的AW Bench(涵盖多种角色类型和运动场景的基准)上进行广泛实验,结果表明DreamActor-M2达到了最先进的性能,具有卓越的视觉保真度和强大的跨域泛化能力。

Insight: 核心创新点在于将运动条件建模为上下文学习问题,并采用两阶段范式:1)在统一潜在空间融合外观与运动线索;2)通过自引导数据合成流程生成伪跨身份训练对,实现了从姿态依赖控制到端到端RGB驱动动画的转变,显著提升了泛化性。

Abstract: Character image animation aims to synthesize high-fidelity videos by transferring motion from a driving sequence to a static reference image. Despite recent advancements, existing methods suffer from two fundamental challenges: (1) suboptimal motion injection strategies that lead to a trade-off between identity preservation and motion consistency, manifesting as a “see-saw”, and (2) an over-reliance on explicit pose priors (e.g., skeletons), which inadequately capture intricate dynamics and hinder generalization to arbitrary, non-humanoid characters. To address these challenges, we present DreamActor-M2, a universal animation framework that reimagines motion conditioning as an in-context learning problem. Our approach follows a two-stage paradigm. First, we bridge the input modality gap by fusing reference appearance and motion cues into a unified latent space, enabling the model to jointly reason about spatial identity and temporal dynamics by leveraging the generative prior of foundational models. Second, we introduce a self-bootstrapped data synthesis pipeline that curates pseudo cross-identity training pairs, facilitating a seamless transition from pose-dependent control to direct, end-to-end RGB-driven animation. This strategy significantly enhances generalization across diverse characters and motion scenarios. To facilitate comprehensive evaluation, we further introduce AW Bench, a versatile benchmark encompassing a wide spectrum of characters types and motion scenarios. Extensive experiments demonstrate that DreamActor-M2 achieves state-of-the-art performance, delivering superior visual fidelity and robust cross-domain generalization. Project Page: https://grisoon.github.io/DreamActor-M2/


[69] Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation cs.CVPDF

Jiankun Peng, Jianyuan Guo, Ying Xu, Yue Liu, Jiashuang Yan

TL;DR: 本文提出了DGNav框架,以解决连续环境视觉语言导航(VLN-CE)中拓扑规划方法的’粒度刚性’问题。该框架通过场景感知自适应策略动态调整建图阈值,并利用动态图变换器融合多模态信息重构图连接,从而在复杂环境中实现按需的拓扑结构优化。

Details

Motivation: 现有基于显式拓扑图的导航方法依赖固定的几何阈值采样节点,无法适应环境复杂度的变化,导致在简单区域过采样造成计算冗余,而在高不确定性区域欠采样增加碰撞风险,即’粒度刚性’问题。

Result: 在R2R-CE和RxR-CE基准测试上的大量实验表明,DGNav展现出卓越的导航性能和强大的泛化能力。消融研究证实该框架在导航效率与安全探索之间取得了最佳权衡。

Insight: 核心创新点在于引入了上下文感知机制来动态调节地图密度和连接性:一是基于预测路径点离散度的场景感知自适应策略,实现’按需致密化’;二是融合视觉、语言和几何线索生成动态边权重的动态图变换器,以过滤拓扑噪声并增强指令遵循能力。

Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) presents a core challenge: grounding high-level linguistic instructions into precise, safe, and long-horizon spatial actions. Explicit topological maps have proven to be a vital solution for providing robust spatial memory in such tasks. However, existing topological planning methods suffer from a “Granularity Rigidity” problem. Specifically, these methods typically rely on fixed geometric thresholds to sample nodes, which fails to adapt to varying environmental complexities. This rigidity leads to a critical mismatch: the model tends to over-sample in simple areas, causing computational redundancy, while under-sampling in high-uncertainty regions, increasing collision risks and compromising precision. To address this, we propose DGNav, a framework for Dynamic Topological Navigation, introducing a context-aware mechanism to modulate map density and connectivity on-the-fly. Our approach comprises two core innovations: (1) A Scene-Aware Adaptive Strategy that dynamically modulates graph construction thresholds based on the dispersion of predicted waypoints, enabling “densification on demand” in challenging environments; (2) A Dynamic Graph Transformer that reconstructs graph connectivity by fusing visual, linguistic, and geometric cues into dynamic edge weights, enabling the agent to filter out topological noise and enhancing instruction adherence. Extensive experiments on the R2R-CE and RxR-CE benchmarks demonstrate DGNav exhibits superior navigation performance and strong generalization capabilities. Furthermore, ablation studies confirm that our framework achieves an optimal trade-off between navigation efficiency and safe exploration. The code is available at https://github.com/shannanshouyin/DGNav.


[70] Synthetic-to-Real Domain Bridging for Single-View 3D Reconstruction of Ships for Maritime Monitoring cs.CV | cs.AI | cs.GRPDF

Borja Carrillo-Perez, Felix Sattler, Angel Bueno Rodriguez, Maurice Stephan, Sarah Barnes

TL;DR: 本文提出了一种用于海事监测的高效单视图3D船舶重建流程。该方法完全在合成数据上训练,仅需单张真实图像即可推理,通过Splatter Image网络将物体表示为稀疏3D高斯集合以实现快速重建,并利用YOLOv8分割模块和自定义预处理来弥合合成与真实图像间的领域差距,最终通过后处理实现交互式3D可视化。

Details

Motivation: 现有3D重建方法通常需要多视图监督、标注的3D真值或计算量大,难以实时部署于海事监测,因此需要一种高效、无需真实3D标注的单视图重建方案。

Result: 在合成验证数据上定量评估显示重建保真度高;在ShipSG数据集真实海事图像上的定性结果证实了向实际海事场景迁移的潜力,系统无需真实3D标注即可提供交互式3D检查。

Insight: 创新点包括:完全基于合成数据训练并迁移到真实场景的领域桥接策略;结合Splatter Image网络实现快速单视图重建;集成分割与后处理流程实现实际部署。从客观角度看,该工作展示了合成数据训练在实际监控任务中的可行性,并为实时3D可视化提供了可扩展的解决方案。

Abstract: Three-dimensional (3D) reconstruction of ships is an important part of maritime monitoring, allowing improved visualization, inspection, and decision-making in real-world monitoring environments. However, most state-ofthe-art 3D reconstruction methods require multi-view supervision, annotated 3D ground truth, or are computationally intensive, making them impractical for real-time maritime deployment. In this work, we present an efficient pipeline for single-view 3D reconstruction of real ships by training entirely on synthetic data and requiring only a single view at inference. Our approach uses the Splatter Image network, which represents objects as sparse sets of 3D Gaussians for rapid and accurate reconstruction from single images. The model is first fine-tuned on synthetic ShapeNet vessels and further refined with a diverse custom dataset of 3D ships, bridging the domain gap between synthetic and real-world imagery. We integrate a state-of-the-art segmentation module based on YOLOv8 and custom preprocessing to ensure compatibility with the reconstruction network. Postprocessing steps include real-world scaling, centering, and orientation alignment, followed by georeferenced placement on an interactive web map using AIS metadata and homography-based mapping. Quantitative evaluation on synthetic validation data demonstrates strong reconstruction fidelity, while qualitative results on real maritime images from the ShipSG dataset confirm the potential for transfer to operational maritime settings. The final system provides interactive 3D inspection of real ships without requiring real-world 3D annotations. This pipeline provides an efficient, scalable solution for maritime monitoring and highlights a path toward real-time 3D ship visualization in practical applications. Interactive demo: https://dlr-mi.github.io/ship3d-demo/.


[71] CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models cs.CVPDF

Junming Huang, Weiwei Xu

TL;DR: 本文提出了CG-MLLM,一个新颖的多模态大语言模型,能够在单一框架内完成3D内容描述和高分辨率3D生成。它利用混合Transformer架构,通过Token级自回归Transformer和块级自回归Transformer分别处理不同粒度的内容,并结合预训练的视觉-语言骨干网络与专门的3D VAE潜在空间,实现了标准token与空间块之间的长上下文交互。

Details

Motivation: 现有方法在3D内容生成上存在局限,只能生成低分辨率网格或粗糙的结构代理,无法原生捕捉细粒度几何细节。本文旨在探索并扩展大语言模型在3D内容生成方面的能力。

Result: 实验结果表明,CG-MLLM在生成高保真3D物体方面显著优于现有的多模态大语言模型,有效地将高分辨率3D内容创作带入了主流大语言模型范式。

Insight: 创新点在于提出了一个统一的、支持3D描述与生成的MLLM框架,并采用了混合Transformer架构来解耦不同粒度的建模需求,同时整合了视觉-语言模型与3D VAE潜在空间以实现高效的长上下文3D内容生成。

Abstract: Large Language Models(LLMs) have revolutionized text generation and multimodal perception, but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture fine-grained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-of-Transformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles block-level content. By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm.


[72] MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods cs.CVPDF

Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin

TL;DR: 本文介绍了MMFineReason,一个大规模多模态推理数据集,包含180万个样本和51亿个解决方案token,其高质量推理标注源自Qwen3-VL-235B-A22B-Thinking模型。通过一个三阶段流程构建,涵盖STEM问题、视觉谜题、游戏和复杂图表。基于该数据集微调Qwen3-VL-Instruct得到的MMFineReason-2B/4B/8B模型,在其规模类别中达到了新的SOTA性能,并展现出显著的参数效率。

Details

Motivation: 开源视觉语言模型在视觉推理方面落后于专有系统,主要原因是缺乏高质量推理数据。现有数据集在STEM图表、视觉谜题等挑战性领域覆盖有限,且缺乏一致、长链的思维链标注。

Result: 在MMFineReason数据集上微调的模型在其规模类别中建立了新的SOTA结果。MMFineReason-4B成功超越了Qwen3-VL-8B-Thinking,MMFineReason-8B甚至超越了Qwen3-VL-30B-A3B-Thinking,并接近Qwen3-VL-32B-Thinking。通过难度感知过滤策略发现,仅使用7%(12.3万个样本)的子集即可达到与完整数据集相当的性能。

Insight: 创新点在于提出了一个系统性的三阶段数据构建流程来创建大规模、高质量的多模态推理数据集,并揭示了“少即是多”的现象(通过难度感知过滤实现高效数据选择)以及推理导向的数据组合能同时提升模型通用能力的协同效应。

Abstract: Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a “less is more” phenomenon via our difficulty-aware filtering strategy: a subset of just 7% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.


[73] Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion cs.CVPDF

Hanmo Chen, Chenghao Xu, Xu Yang, Xuan Chen, Cheng Deng

TL;DR: 本文提出了一种名为PaFu-KV的新型KV缓存策略,用于自回归视频扩散模型中的长序列视频生成。该方法通过一个轻量级的显著性估计头来评估不同token的重要性,从而在推理时动态保留关键信息并丢弃冗余缓存,以优化生成质量与效率的权衡。

Details

Motivation: 现有自回归视频生成方法通常依赖启发式的KV缓存策略,忽视了长序列生成中不同token的重要性差异,导致关键时空信息丢失和冗余缓存累积,从而降低了视频生成的质量和效率。

Result: 在多个基准测试上的广泛实验表明,该方法在保持高保真度视频生成质量的同时,能够显著减少KV缓存容量和内存占用,从而加速推理过程,实现更高效的长序列视频生成。

Insight: 核心创新点在于提出了一个过去与未来信息指导的KV缓存策略(PaFu-KV),其通过从双向教师模型蒸馏出的显著性估计头来量化token贡献的时间异质性,实现了对缓存内容的智能管理,为长序列生成任务中的内存效率优化提供了新思路。

Abstract: Video generation is pivotal to digital media creation, and recent advances in autoregressive video generation have markedly enhanced the efficiency of real-time video synthesis. However, existing approaches generally rely on heuristic KV Cache policies, which ignore differences in token importance in long-term video generation. This leads to the loss of critical spatiotemporal information and the accumulation of redundant, invalid cache, thereby degrading video generation quality and efficiency. To address this limitation, we first observe that token contributions to video generation are highly time-heterogeneous and accordingly propose a novel Past- and Future-Informed KV Cache Policy (PaFu-KV). Specifically, PaFu-KV introduces a lightweight Salience Estimation Head distilled from a bidirectional teacher to estimate salience scores, allowing the KV cache to retain informative tokens while discarding less relevant ones. This policy yields a better quality-efficiency trade-off by shrinking KV cache capacity and reducing memory footprint at inference time. Extensive experiments on benchmarks demonstrate that our method preserves high-fidelity video generation quality while enables accelerated inference, thereby enabling more efficient long-horizon video generation. Our code will be released upon paper acceptance.


[74] VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models cs.CVPDF

Yunhao Li, Sijing Wu, Zhilin Gao, Zicheng Zhang, Qi Jia

TL;DR: 本文提出了VideoAesBench,一个用于评估大型多模态模型视频美学感知能力的综合基准,包含来自多种来源的1804个视频和多种问题格式,并对23个开源和商业模型进行了评测。

Details

Motivation: 大型多模态模型在多种视觉感知任务上表现出色,但其视频美学质量评估能力尚未得到充分探索,而这是人类的一项基本能力。

Result: 评测发现,当前的大型多模态模型仅具备基本的视频美学感知能力,其表现仍不完整且不精确。

Insight: 创新点在于构建了一个涵盖多样化视频内容、多种问题格式和全面美学维度的综合基准,并引入了新颖的开放式视频美学描述问题,为可解释的视频美学评估提供了测试平台和见解。

Abstract: Large multimodal models (LMMs) have demonstrated outstanding capabilities in various visual perception tasks, which has in turn made the evaluation of LMMs significant. However, the capability of video aesthetic quality assessment, which is a fundamental ability for human, remains underexplored for LMMs. To address this, we introduce VideoAesBench, a comprehensive benchmark for evaluating LMMs’ understanding of video aesthetic quality. VideoAesBench has several significant characteristics: (1) Diverse content including 1,804 videos from multiple video sources including user-generated (UGC), AI-generated (AIGC), compressed, robotic-generated (RGC), and game videos. (2) Multiple question formats containing traditional single-choice questions, multi-choice questions, True or False questions, and a novel open-ended questions for video aesthetics description. (3) Holistic video aesthetics dimensions including visual form related questions from 5 aspects, visual style related questions from 4 aspects, and visual affectiveness questions from 3 aspects. Based on VideoAesBench, we benchmark 23 open-source and commercial large multimodal models. Our findings show that current LMMs only contain basic video aesthetics perception ability, their performance remains incomplete and imprecise. We hope our VideoAesBench can be served as a strong testbed and offer insights for explainable video aesthetics assessment.


[75] Zero-Shot Video Restoration and Enhancement with Assistance of Video Diffusion Models cs.CVPDF

Cong Cao, Huanjing Yue, Shangbin Xie, Xin Liu, Jingyu Yang

TL;DR: 本文提出首个利用视频扩散模型辅助基于图像的零样本视频修复与增强方法,以解决现有图像方法在视频上应用时产生的时间闪烁问题。通过同源/异源潜在融合、COT融合比例策略以及时间强化后处理,该方法无需训练即可提升视频的时间一致性。

Details

Motivation: 现有基于扩散的零样本图像修复与增强方法在视频上应用时会导致严重的时间闪烁,因此需要一种能够保持时间一致性的视频修复与增强框架。

Result: 实验结果表明,该方法在零样本视频修复与增强任务上表现出优越性,能够有效提升时间一致性。

Insight: 创新点包括利用视频扩散模型辅助图像方法、提出同源/异源潜在融合与COT融合比例策略、以及时间强化后处理,这些技术可借鉴于其他视频生成或编辑任务中以改善时间连贯性。

Abstract: Although diffusion-based zero-shot image restoration and enhancement methods have achieved great success, applying them to video restoration or enhancement will lead to severe temporal flickering. In this paper, we propose the first framework that utilizes the rapidly-developed video diffusion model to assist the image-based method in maintaining more temporal consistency for zero-shot video restoration and enhancement. We propose homologous latents fusion, heterogenous latents fusion, and a COT-based fusion ratio strategy to utilize both homologous and heterogenous text-to-video diffusion models to complement the image method. Moreover, we propose temporal-strengthening post-processing to utilize the image-to-video diffusion model to further improve temporal consistency. Our method is training-free and can be applied to any diffusion-based image restoration and enhancement methods. Experimental results demonstrate the superiority of the proposed method.


[76] Just Noticeable Difference Modeling for Deep Visual Features cs.CVPDF

Rui Zhao, Wenrui Li, Lin Zhu, Yajing Zheng, Weisi Lin

TL;DR: 本文提出FeatJND,一种针对深度视觉特征的恰可察觉差异(JND)建模方法,用于预测在保持下游任务性能的前提下每个特征的最大可容忍扰动图,并在图像分类、检测和实例分割任务上验证了其有效性。

Details

Motivation: 深度视觉特征作为视觉系统的接口日益普及,需要描述特征特性并控制机器感知的特征质量;将JND扩展到深度特征空间,可为受限资源下的特征质量控制提供任务对齐的容忍边界参考。

Result: 在匹配的失真强度下,基于FeatJND的扰动比非结构化高斯扰动能更一致地保持更高的任务性能;在token级动态量化应用中,FeatJND引导的步长分配在相同噪声预算下优于随机步长排列和全局均匀步长。

Insight: 创新点在于将JND概念从图像域扩展到深度特征空间,提出任务对齐的JND公式化方法;客观来看,该方法为特征质量控制提供了可解释的扰动边界,并能通过抑制非关键特征区域来优化资源分配。

Abstract: Deep visual features are increasingly used as the interface in vision systems, motivating the need to describe feature characteristics and control feature quality for machine perception. Just noticeable difference (JND) characterizes the maximum imperceptible distortion for images under human or machine vision. Extending it to deep visual features naturally meets the above demand by providing a task-aligned tolerance boundary in feature space, offering a practical reference for controlling feature quality under constrained resources. We propose FeatJND, a task-aligned JND formulation that predicts the maximum tolerable per-feature perturbation map while preserving downstream task performance. We propose a FeatJND estimator at standardized split points and validate it across image classification, detection, and instance segmentation. Under matched distortion strength, FeatJND-based distortions consistently preserve higher task performance than unstructured Gaussian perturbations, and attribution visualizations suggest FeatJND can suppress non-critical feature regions. As an application, we further apply FeatJND to token-wise dynamic quantization and show that FeatJND-guided step-size allocation yields clear gains over random step-size permutation and global uniform step size under the same noise budget. Our code will be released after publication.


[77] Deep Models, Shallow Alignment: Uncovering the Granularity Mismatch in Neural Decoding cs.CVPDF

Yang Du, Siyuan Dai, Yonghao Song, Paul M. Thompson, Haoteng Tang

TL;DR: 本文提出了一种名为’浅层对齐’的对比学习策略,用于解决神经视觉解码中人类与机器视觉之间的粒度不匹配问题。该方法通过将神经信号与视觉编码器的中间层表示对齐,而非最终输出,从而更好地平衡低级纹理细节和高级语义特征。实验表明,该方法在多个基准测试上显著优于标准的最终层对齐方法。

Details

Motivation: 现有神经视觉解码方法忽视了人类视觉与机器视觉之间的一个根本性粒度不匹配:深度视觉模型通过抑制局部纹理信息来强调语义不变性,而神经信号则同时保留了低层视觉属性和高层语义内容的复杂混合。本文旨在解决这一不匹配问题。

Result: 在多个基准测试上的广泛实验表明,浅层对齐方法显著优于标准的最终层对齐,性能提升幅度在22%到58%之间,且该方法能有效解锁神经视觉解码中的缩放定律,使解码性能可预测地随预训练视觉骨干网络的能力而提升。

Insight: 核心创新点在于提出了’浅层对齐’策略,将神经信号与视觉编码器的中间层表示进行对比学习对齐。这提供了一个新的视角,即利用深度模型中间层的、更具粒度细节的表示,而非仅使用最终的高层语义输出,来更好地匹配神经信号的混合特性,从而显著提升解码性能并揭示缩放规律。

Abstract: Neural visual decoding is a central problem in brain computer interface research, aiming to reconstruct human visual perception and to elucidate the structure of neural representations. However, existing approaches overlook a fundamental granularity mismatch between human and machine vision, where deep vision models emphasize semantic invariance by suppressing local texture information, whereas neural signals preserve an intricate mixture of low-level visual attributes and high-level semantic content. To address this mismatch, we propose Shallow Alignment, a novel contrastive learning strategy that aligns neural signals with intermediate representations of visual encoders rather than their final outputs, thereby striking a better balance between low-level texture details and high-level semantic features. Extensive experiments across multiple benchmarks demonstrate that Shallow Alignment significantly outperforms standard final-layer alignment, with performance gains ranging from 22% to 58% across diverse vision backbones. Notably, our approach effectively unlocks the scaling law in neural visual decoding, enabling decoding performance to scale predictably with the capacity of pre-trained vision backbones. We further conduct systematic empirical analyses to shed light on the mechanisms underlying the observed performance gains.


[78] PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing cs.CVPDF

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang

TL;DR: 本文介绍了PaddleOCR-VL-1.5,这是一个升级版的多任务视觉语言模型,在OmniDocBench v1.5上达到了94.5%的最新SOTA准确率。为了严格评估模型对现实世界物理畸变的鲁棒性,作者提出了Real5-OmniDocBench基准。实验表明,该增强模型在新基准上也达到了SOTA性能。此外,模型还扩展了印章识别和文本定位能力,同时保持0.9B参数量级的高效超紧凑架构。

Details

Motivation: 动机是开发一个鲁棒且高效的多任务视觉语言模型,专门用于处理现实世界中存在各种物理畸变的文档解析任务,并解决现有基准在评估此类鲁棒性方面的不足。

Result: 在OmniDocBench v1.5基准上达到了94.5%的SOTA准确率;在新提出的Real5-OmniDocBench基准(评估扫描、倾斜、扭曲、屏幕拍摄和光照等物理畸变)上也取得了SOTA性能。

Insight: 主要创新点包括:1) 提出了一个专门评估文档图像物理畸变鲁棒性的新基准Real5-OmniDocBench;2) 将印章识别和文本定位任务集成到一个统一的0.9B超紧凑VLM中,实现了多任务能力与高效率的平衡。

Abstract: We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model’s capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency. Code: https://github.com/PaddlePaddle/PaddleOCR


[79] Causal World Modeling for Robot Control cs.CV | cs.ROPDF

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang

TL;DR: 本文提出了一种名为LingBot-VA的自回归扩散框架,通过视频世界建模和视觉语言预训练,同时学习帧预测和策略执行,为机器人学习提供了一个新的基础。该框架包含共享潜在空间、闭环展开机制和异步推理管道三个关键设计,在模拟基准和真实场景中展现出长时程操作、数据效率和强泛化能力。

Details

Motivation: 动机在于利用视频世界模型理解动作与视觉动态之间的因果关系,从而预测近未来,为机器人控制提供更独立和有效的基础,解决传统方法在长时程操作和数据效率方面的挑战。

Result: 在模拟基准和真实场景的评估中,模型在长时程操作、后训练数据效率和面对新配置的泛化能力方面表现出显著潜力,达到了先进水平。

Insight: 创新点包括:共享潜在空间整合视觉和动作令牌、闭环展开机制持续获取环境反馈、异步推理管道并行化动作预测与执行,这些设计提升了机器人控制的效率和适应性。

Abstract: This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.


[80] Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving cs.CVPDF

Linhan Wang, Zichong Yang, Chen Bai, Guoxiang Zhang, Xiaotong Liu

TL;DR: 本文提出Drive-JEPA框架,将视频联合嵌入预测架构(V-JEPA)与多模态轨迹蒸馏相结合,用于端到端自动驾驶。该方法首先在大规模驾驶视频上预训练ViT编码器以生成与轨迹规划对齐的预测性表征,然后引入一个以提议为中心的规划器,蒸馏来自模拟器和人类驾驶的多样化轨迹,并通过动量感知选择机制提升行为的稳定性和安全性。

Details

Motivation: 当前端到端自动驾驶利用自监督视频预训练学习可迁移的规划表征,但现有视频世界模型对场景理解的提升有限,且驾驶场景固有的模糊性(通常只提供单一人为轨迹)使得学习多模态行为变得困难。

Result: 在NAVSIM基准测试中,仅V-JEPA表征结合简单的基于Transformer的解码器就在无感知设置下以3 PDMS的优势超越先前方法;完整的Drive-JEPA框架在v1版本上达到93.3 PDMS,在v2版本上达到87.8 EPDMS,创造了新的最先进水平(SOTA)。

Insight: 创新点在于将视频预测架构(V-JEPA)与多模态轨迹蒸馏相结合,通过模拟器生成多样化轨迹来克服驾驶数据中轨迹单一性的限制,并采用动量感知选择机制来稳定行为选择,这为端到端驾驶中学习鲁棒且多模态的规划表征提供了新思路。

Abstract: End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.


[81] Understanding Multimodal Complementarity for Single-Frame Action Anticipation cs.CVPDF

Manuel Benavent-Lledo, Konstantinos Bacharidis, Konstantinos Papoutsakis, Antonis Argyros, Jose Garcia-Rodriguez

TL;DR: 本文挑战了动作预测必须依赖密集时序信息的传统假设,探索了仅使用单帧视觉观察进行动作预测的可能性。通过系统分析RGB外观、深度几何线索和过去动作语义表示等多模态互补信息,以及融合策略、关键帧选择策略和历史动作来源的影响,作者整合了最有效的设计选择,提出了改进的单帧预测框架AAG+。

Details

Motivation: 动机是探究单帧图像中已编码了多少关于未来动作的信息,以及如何有效利用这些信息,从而挑战动作预测必须依赖视频时序信息的隐含假设。

Result: 在IKEA-ASM、Meccano和Assembly101等具有挑战性的预测基准测试中,AAG+框架持续改进了原始AAG,并取得了与最先进的基于视频的方法相当或超越的性能。

Insight: 创新点在于系统性地研究了单帧动作预测中多模态信息的互补性,并整合了有效的设计选择,揭示了密集时序建模的必要性条件以及精心选择的单帧足以进行预测的场景,为单帧动作预测的潜力和局限提供了新见解。

Abstract: Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.


[82] MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources cs.CV | cs.AIPDF

Baorui Ma, Jiahui Yang, Donglin Di, Xuancheng Zhang, Jianxun Cui

TL;DR: 论文提出了Metric Anything,一个可扩展的预训练框架,用于从噪声、异构的3D数据源中学习度量深度估计,无需手动设计提示、相机特定建模或任务特定架构。其核心是稀疏度量提示,通过随机掩码深度图创建,作为解耦空间推理与传感器/相机偏差的通用接口。使用约2000万对图像-深度数据,首次展示了度量深度领域的扩展趋势。预训练模型在深度补全、超分辨率等提示驱动任务中表现出色,其蒸馏后的无提示学生模型在单目深度估计、相机内参恢复等任务上达到SOTA,并提升多模态大语言模型的空间智能。

Details

Motivation: 解决将基础模型扩展范式应用于度量深度估计时面临的挑战,包括异构传感器噪声、相机相关偏差以及噪声跨源3D数据中的度量模糊性。

Result: 在约2000万对跨10000个相机模型的图像-深度数据上预训练,首次展示了度量深度领域的清晰扩展趋势。蒸馏后的无提示学生模型在单目深度估计、相机内参恢复、单/多视角度量3D重建和VLA规划等任务上达到SOTA水平。

Insight: 创新点在于提出稀疏度量提示作为通用接口,解耦空间推理与传感器/相机偏差,从而能够利用大规模噪声异构数据进行可扩展的度量深度预训练,并首次验证了度量深度估计同样受益于扩展定律。该方法无需任务特定架构,预训练视觉编码器还能增强多模态大语言模型的空间智能。

Abstract: Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at http://metric-anything.github.io/metric-anything-io/ to support community research.


[83] Unsupervised Decomposition and Recombination with Discriminator-Driven Diffusion Models cs.CV | cs.AIPDF

Archer Wang, Emile Anand, Yilun Du, Marin Soljačić

TL;DR: 本文提出了一种基于判别器驱动的扩散模型的无监督分解与重组方法,用于学习数据中可分解的潜在因子表示,并通过跨源重组因子来生成新样本。该方法在图像和机器人视频数据上验证了其有效性,在CelebA-HQ、Virtual KITTI、CLEVR和Falcor3D等基准测试中取得了更低的FID分数和更好的解耦性能(通过MIG和MCC衡量),并在LIBERO基准上通过重组学习到的动作组件生成了多样化的机器人视频轨迹,显著提升了状态空间覆盖率。

Details

Motivation: 解决在无因子级监督下,如何更有效地从复杂数据中学习可分解的潜在表示,并确保通过重组这些因子生成的新样本在物理和语义上具有一致性的问题。

Result: 在CelebA-HQ、Virtual KITTI、CLEVR和Falcor3D数据集上,该方法在FID、MIG和MCC指标上超越了现有基线,达到了SOTA水平;在LIBERO机器人基准测试中,通过重组动作组件生成的多样化序列显著增加了状态空间覆盖率。

Insight: 创新点在于引入了一个通过判别器提供的对抗训练信号,该判别器被训练以区分单源样本和跨源重组因子生成的样本,从而驱动生成器产生物理和语义一致的重组结果;这为无监督学习可重用组件并实现高质量组合生成提供了一种有效的新范式。

Abstract: Decomposing complex data into factorized representations can reveal reusable components and enable synthesizing new samples via component recombination. We investigate this in the context of diffusion-based models that learn factorized latent spaces without factor-level supervision. In images, factors can capture background, illumination, and object attributes; in robotic videos, they can capture reusable motion components. To improve both latent factor discovery and quality of compositional generation, we introduce an adversarial training signal via a discriminator trained to distinguish between single-source samples and those generated by recombining factors across sources. By optimizing the generator to fool this discriminator, we encourage physical and semantic consistency in the resulting recombinations. Our method outperforms implementations of prior baselines on CelebA-HQ, Virtual KITTI, CLEVR, and Falcor3D, achieving lower FID scores and better disentanglement as measured by MIG and MCC. Furthermore, we demonstrate a novel application to robotic video trajectories: by recombining learned action components, we generate diverse sequences that significantly increase state-space coverage for exploration on the LIBERO benchmark.


[84] Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models cs.CV | cs.AIPDF

Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao

TL;DR: 本文提出了Vision-DeepResearch,一种新的多模态深度研究范式,旨在增强多模态大语言模型(MLLMs)的深度研究能力。该方法通过执行多轮、多实体、多尺度的视觉和文本搜索,在现实世界存在大量视觉噪声的情况下,鲁棒地利用搜索引擎获取证据,以解决需要聚合多种来源信息的复杂问题。

Details

Motivation: 现有MLLMs在需要大量事实信息的任务上,通常采用“推理-工具调用”范式来增强,但它们在多模态搜索中往往假设单一图像或文本查询即可获取关键证据,这在现实存在大量视觉噪声的场景中不切实际,且其推理深度和搜索广度有限,难以解决需要聚合多样化视觉和文本证据的复杂问题。

Result: Vision-DeepResearch在深度研究能力上显著优于现有的多模态深度研究MLLMs,以及基于GPT-5、Gemini-2.5-pro和Claude-4-Sonnet等强大闭源基础模型构建的工作流。

Insight: 创新点在于提出了一种支持数十步推理和数百次引擎交互的多模态深度研究范式,并通过冷启动监督和强化学习训练将深度研究能力内化到MLLM中,从而构建了一个强大的端到端模型,能够鲁棒地处理现实世界中的复杂、噪声多的搜索任务。

Abstract: Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call’’ for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet. The code will be released in https://github.com/Osilly/Vision-DeepResearch.


[85] SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence cs.CV | cs.AI | eess.SYPDF

Saoud Aldowaish, Yashwanth Karumanchi, Kai-Chen Chiang, Soroosh Noorzad, Morteza Fayazi

TL;DR: 本文提出了SINA,一个开源的、全自动的电路原理图图像到网表生成器。它集成了深度学习用于精确的元件检测,连通域标记用于精确的连接关系提取,以及光学字符识别用于元件标识符检索,并利用视觉语言模型进行可靠的标识符分配。实验表明,SINA在网表生成任务上达到了96.47%的总体准确率。

Details

Motivation: 解决现有方法在将电路原理图图像转换为机器可读网表时,在元件识别和连接关系推断方面存在的困难。

Result: SINA在网表生成任务上实现了96.47%的总体准确率,这比现有最先进方法的性能高出2.72倍。

Insight: 主要创新点在于将深度学习、连通域标记、光学字符识别和视觉语言模型等多种技术集成到一个统一的、全自动的流程中,以协同解决元件检测、连接提取和标识符分配等子问题,从而显著提升了整体网表生成的准确率。

Abstract: Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision-Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.


[86] EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers cs.CV | cs.GR | cs.LG | cs.MMPDF

John Flynn, Wolfgang Paier, Dimitar Dinev, Sam Nhut Nguyen, Hayk Poghosyan

TL;DR: EditYourself是一个基于扩散Transformer(DiT)的音频驱动视频到视频(V2V)编辑框架,专门用于修改说话头部视频的文本内容,支持无缝添加、删除和重新调整视觉口语内容,同时保持运动、时间一致性、说话者身份和准确的唇部同步。

Details

Motivation: 解决现有生成视频模型在编辑预录制视频时的不足,特别是当需要对口语脚本进行细微修改时,需要保持运动、时间连贯性、说话者身份和唇部同步的挑战。

Result: 该方法在音频驱动的V2V编辑任务中实现了精确的唇部同步和时间一致的重构,包括在新添加的片段中合成逼真的人体运动,并在长时间内保持视觉保真度和身份一致性,代表了生成视频模型作为专业视频后期制作实用工具的基础性进展。

Insight: 创新点在于将音频条件化和区域感知、编辑聚焦的训练扩展集成到通用视频扩散模型中,通过时空修复实现现有表演的重构,这为生成模型在视频编辑领域的实际应用提供了新思路。

Abstract: Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require preserving motion, temporal coherence, speaker identity, and accurate lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven video-to-video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition, removal, and retiming of visually spoken content. Building on a general-purpose video diffusion model, EditYourself augments its V2V capabilities with audio conditioning and region-aware, edit-focused training extensions. This enables precise lip synchronization and temporally coherent restructuring of existing performances via spatiotemporal inpainting, including the synthesis of realistic human motion in newly added segments, while maintaining visual fidelity and identity consistency over long durations. This work represents a foundational step toward generative video models as practical tools for professional video post-production.


[87] Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions cs.CVPDF

Xiaoxiao Sun, Mingyang Li, Kun yuan, Min Woo Sun, Mark Endo

TL;DR: 这篇论文通过引入VI-Probe框架,系统探究了大型视觉语言模型(VLMs)在处理经典视觉错觉时的行为,发现模型对错觉图像变化的响应持久性源于多种机制而非单一原因,挑战了单一因果解释的观点。

Details

Motivation: 动机在于理解VLMs在面对视觉错觉时,是真正感知视觉变化还是仅依赖记忆模式进行回忆,以揭示其视觉感知与语言驱动回忆之间的交互机制。

Result: 实验结果表明,不同模型家族表现出异质性原因:GPT-5显示记忆覆盖,Claude-Opus-4.1呈现感知-记忆竞争,而Qwen变体暗示视觉处理限制,这些发现通过极性翻转一致性、模板固定指数和标准化错觉乘数等指标量化。

Insight: 创新点在于提出了可控的视觉错觉框架VI-Probe,结合分级扰动和匹配视觉控制,以区分视觉基础感知与语言驱动回忆,并推动基于探测的评估方法,同时测量模型知识和受控视觉变化的敏感性。

Abstract: Large Vision-Language Models (VLMs) often answer classic visual illusions “correctly” on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.


[88] UEval: A Benchmark for Unified Multimodal Generation cs.CV | cs.CLPDF

Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, Zhuang Liu

TL;DR: 论文提出了UEval基准,用于评估能够同时生成图像和文本的统一多模态模型。该基准包含1000个专家策划的问题,覆盖8个真实世界任务,需要模型输出图文结合的内容。为了解决开放式多模态生成评估的难题,论文设计了一种基于评分量表的自动评分系统,通过多模态大语言模型生成初始评分标准,并由人类专家验证,最终包含10,417个已验证的评分标准。评估发现当前统一模型在UEval上表现不佳,推理模型优于非推理模型,且推理轨迹的迁移能显著缩小性能差距。

Details

Motivation: 现有评估方法难以有效评估开放式多模态生成任务,特别是对于能够同时生成图像和文本的统一模型,缺乏一个全面、细粒度的基准。

Result: 在UEval基准上,当前最佳模型GPT-5-Thinking得分仅为66.4(满分100),最佳开源模型得分仅为49.1,表明基准具有挑战性。实验发现推理模型性能优于非推理模型,且将推理模型的推理轨迹迁移到非推理模型能显著提升后者性能。

Insight: 创新点在于构建了一个需要图文联合生成的专家级基准,并设计了结合MLLM自动生成与人工验证的细粒度、可扩展的评分量表系统。研究结果表明,推理能力对于复杂的多模态理解和生成任务至关重要,且推理知识可迁移。

Abstract: We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.


cs.CY [Back]

[89] Moral Outrage Shapes Commitments Beyond Attention: Multimodal Moral Emotions on YouTube in Korea and the US cs.CY | cs.AI | cs.CL | cs.SIPDF

Seongchan Park, Jaehong Kim, Hyeonseung Kim, Heejin Bin, Sue Moon

TL;DR: 本研究通过微调视觉语言模型开发了一个多模态道德情感分类器,用于分析YouTube上主流新闻频道视频缩略图和标题中的道德情感框架,并应用于韩美两国约40万个视频。研究发现,表达道德义愤的谴责性言论能跨文化地提升用户参与度,从被动观看到主动评论均有显著影响。

Details

Motivation: 在注意力经济中,理解媒体言论如何影响受众参与至关重要。本研究旨在探究YouTube上主流新闻频道的多模态道德情感框架如何影响韩美两国用户的行为,特别是关注道德义愤的表达方式。

Result: 研究结果显示,谴责他人的道德义愤言论在所有参与形式(包括观看、点赞和评论)中均能显著提升跨文化参与度,效果从被动观看到主动评论不等。

Insight: 创新点在于开发了针对韩语和英语的多模态道德情感分类器,结合视觉和文本信息分析媒体内容。客观分析表明,道德义愤是一种有效的情绪策略,能吸引注意并促进主动参与,但也可能加剧群体极化,为未来研究提供了可复现的工具和数据洞察。

Abstract: Understanding how media rhetoric shapes audience engagement is crucial in the attention economy. This study examines how moral emotional framing by mainstream news channels on YouTube influences user behavior across Korea and the United States. To capture the platform’s multimodal nature, combining thumbnail images and video titles, we develop a multimodal moral emotion classifier by fine tuning a vision language model. The model is trained on human annotated multimodal datasets in both languages and applied to approximately 400,000 videos from major news outlets. We analyze engagement levels including views, likes, and comments, representing increasing degrees of commitment. The results show that other condemning rhetoric expressions of moral outrage that criticize others morally consistently increase all forms of engagement across cultures, with effects ranging from passive viewing to active commenting. These findings suggest that moral outrage is a particularly effective emotional strategy, attracting not only attention but also active participation. We discuss concerns about the potential misuse of other condemning rhetoric, as such practices may deepen polarization by reinforcing in group and out group divisions. To facilitate future research and ensure reproducibility, we publicly release our Korean and English multimodal moral emotion classifiers.


[90] Industrialized Deception: The Collateral Effects of LLM-Generated Misinformation on Digital Ecosystems cs.CY | cs.AI | cs.CL | cs.SIPDF

Alexander Loth, Martin Kappes, Marc-Oliver Pahl

TL;DR: 本文从2024年调查出发,更新了关于生成式AI与虚假信息的研究视角,从文献综述转向实际对策。报告了威胁态势的变化,包括LLM和多模态系统生成的AI内容质量提升。核心贡献是开发了JudgeGPT(评估人类对AI生成新闻感知的平台)和RogueGPT(用于研究的受控刺激生成引擎),构建了一个研究人类如何感知和检测AI生成虚假信息的实验管道。研究发现检测能力有所提升,但生成与检测之间的竞争仍在持续,并讨论了基于LLM的检测、接种方法等缓解策略以及生成式AI的双重用途。

Details

Motivation: 应对大型语言模型(LLM)和多模态系统生成的虚假信息对数字生态系统造成的日益严重的负面影响,从理论综述转向开发实际工具和策略以研究并缓解该问题。

Result: 研究发现,针对AI生成虚假信息的检测能力已有所改进,但生成与检测技术之间的竞争仍在持续;论文通过自建的JudgeGPT和RogueGPT实验平台进行了相关人类感知研究,但未提及在特定标准基准(benchmark)上的定量比较或是否达到SOTA水平。

Insight: 创新点在于构建了JudgeGPT与RogueGPT这一组合实验管道,将人类感知评估与受控虚假信息生成相结合,为实证研究AI虚假信息的影响提供了系统化工具;客观来看,其从宏观综述转向具体工具开发和策略讨论的实践导向,以及对生成与检测“竞赛”动态及AI双重用途的强调,对后续研究和应对策略设计具有借鉴意义。

Abstract: Generative AI and misinformation research has evolved since our 2024 survey. This paper presents an updated perspective, transitioning from literature review to practical countermeasures. We report on changes in the threat landscape, including improved AI-generated content through Large Language Models (LLMs) and multimodal systems. Central to this work are our practical contributions: JudgeGPT, a platform for evaluating human perception of AI-generated news, and RogueGPT, a controlled stimulus generation engine for research. Together, these tools form an experimental pipeline for studying how humans perceive and detect AI-generated misinformation. Our findings show that detection capabilities have improved, but the competition between generation and detection continues. We discuss mitigation strategies including LLM-based detection, inoculation approaches, and the dual-use nature of generative AI. This work contributes to research addressing the adverse impacts of AI on information quality.


cs.LG [Back]

[91] Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning cs.LG | cs.AI | cs.CL | cs.CVPDF

Chengzu Li, Zanyi Wang, Jiaang Li, Yi Xu, Han Zhou

TL;DR: 该论文提出将视频生成模型作为视觉推理的中间步骤,通过生成帧来模拟从初始状态到解决方案的动态过程,并在迷宫导航和七巧板拼图两个任务上验证了其有效性。

Details

Motivation: 解决视觉语言模型在细粒度空间理解和连续动作规划方面的不足,探索视频生成模型作为视觉推理范式的潜力。

Result: 在迷宫导航和七巧板拼图任务上,模型展现出强大的零样本泛化能力,无需微调即可适应未见数据分布;通过视觉上下文控制和增加生成视频长度,模型能保持高视觉一致性并提升对复杂路径的泛化性能。

Insight: 创新点在于将视频生成视为可扩展的视觉推理范式,而非仅媒体工具;关键发现包括视觉上下文作为显式控制的有效性,以及视觉测试时缩放定律(生成视频长度增加可提升零样本泛化能力)。

Abstract: Vision-Language Models have excelled at textual reasoning, but they often struggle with fine-grained spatial understanding and continuous action planning, failing to simulate the dynamics required for complex visual reasoning. In this work, we formulate visual reasoning by means of video generation models, positing that generated frames can act as intermediate reasoning steps between initial states and solutions. We evaluate their capacity in two distinct regimes: Maze Navigation for sequential discrete planning with low visual change and Tangram Puzzle for continuous manipulation with high visual change. Our experiments reveal three critical insights: (1) Robust Zero-Shot Generalization: In both tasks, the model demonstrates strong performance on unseen data distributions without specific finetuning. (2) Visual Context: The model effectively uses visual context as explicit control, such as agent icons and tangram shapes, enabling it to maintain high visual consistency and adapt its planning capability robustly to unseen patterns. (3) Visual Test-Time Scaling: We observe a test-time scaling law in sequential planning; increasing the generated video length (visual inference budget) empowers better zero-shot generalization to spatially and temporally complex paths. These findings suggest that video generation is not merely a media tool, but a scalable, generalizable paradigm for visual reasoning.


[92] Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification cs.LG | cs.AI | cs.CLPDF

Yiju Guo, Tianyi Hu, Zexu Sun, Yankai Lin

TL;DR: 本文提出了一种名为Less Noise Sampling Framework (LENS)的新方法,用于改进基于可验证奖励的强化学习(RLVR)在大型语言模型推理任务中的效率。该方法的核心是通过识别并移除提示中的干扰性标记来净化指令,从而减少无效探索,提高采样成功率和训练稳定性。

Details

Motivation: 现有的RLVR方法在有限的探索预算下,面对复杂任务时存在探索效率低、采样成功率不高和训练不稳定的问题。作者发现许多探索失败并非源于任务本身的难度,而是由提示中少数干扰性标记引入的噪声所导致。

Result: 实验结果表明,LENS在性能上显著优于GRPO基准方法,平均性能提升3.88%,收敛速度加快超过1.6倍,实现了更高的性能和更快的收敛。

Insight: 论文的核心创新点在于揭示了提示中干扰性标记对RLVR探索效率的关键影响,并提出了一种先净化指令、再利用成功轨迹监督原始噪声提示下策略优化的两阶段框架。这为RLVR研究提供了一个通过修剪干扰标记来提升探索效率的新视角。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.


[93] From Consistency to Complementarity: Aligned and Disentangled Multi-modal Learning for Time Series Understanding and Reasoning cs.LG | cs.AI | cs.CL | cs.CVPDF

Hang Ni, Weijia Zhang, Fei Wang, Zezhi Shao, Hao Liu

TL;DR: 本文提出MADI模型,通过细粒度对齐和解耦交互增强多模态大语言模型,以解决时间序列理解与推理任务中跨模态细粒度错位和语义纠缠问题,在合成和真实世界基准测试中优于通用LLM和专用MLLM。

Details

Motivation: 解决多模态时间序列理解中因模态间细粒度时间错位和共享/模态特定语义严重纠缠导致的局部解释和互补推理困难。

Result: 在合成和真实世界基准测试中,MADI一致优于通用大语言模型和时间序列专用多模态大语言模型。

Insight: 创新点包括基于物理基础的细粒度补丁级对齐、将模态共有语义分离为紧凑离散潜在表示并自适应协同纯化模态独特信息的离散解耦交互,以及强调信息丰富且查询相关信号的关键令牌高亮机制。

Abstract: Advances in multi-modal large language models (MLLMs) have inspired time series understanding and reasoning tasks, that enable natural language querying over time series, producing textual analyses of complex temporal dynamics. Recent attempts hybridize numerical time series with their visualized plots, facilitating precise value reasoning and visual structure comprehension for comprehensive time series understanding of MLLMs. However, effective cross-modal integration remains challenging due to fine-grained temporal misalignment across modalities and severe entanglement between shared and modality-specific semantics, which hinder localized interpretation and complementary reasoning. To address these issues, we propose MADI, a multi-modal LLM enhanced with fine-grained alignment and disentangled interaction, featuring (1) Patch-level Alignment, which enforces physically grounded fine-grained correspondence across heterogeneous modalities, (2) Discrete Disentangled Interaction, which separates modality-common semantics into compact discrete latents and adaptively synergizes the purified modality-unique information, and (3) Critical-token Highlighting, which emphasizes informative, query-relevant signals for robust reasoning. Experiments on synthetic and real-world benchmarks show that MADI consistently outperforms general-purpose LLMs and time-series-specialized MLLMs.


[94] Breaking the Overscaling Curse: Thinking Parallelism Before Parallel Thinking cs.LG | cs.AI | cs.CLPDF

Yiming Wang, Zhuosheng Zhang, Rui Wang

TL;DR: 本文提出了一种名为T2的轻量级方法,旨在解决并行思维(Parallel Thinking)中的过扩展诅咒(overscaling curse)。该诅咒源于系统层面为所有样本分配统一的全局并行度N,而由于样本异质性,部分样本仅需较小的并行度N’即可达到相当性能,导致计算资源冗余。T2利用潜在表示在解码前为每个样本估计最优并行度,从而在保持性能的同时显著降低成本。

Details

Motivation: 动机是解决并行思维推理中系统级效率与样本级效率之间的不兼容问题,即过扩展诅咒,该问题导致计算预算的浪费。

Result: 实验表明,T2方法在保持可比性能的同时,显著降低了计算成本,实现了更高效的并行思维。

Insight: 创新点在于形式化并量化了过扩展诅咒,揭示了其普遍性和严重性,并提出了一种基于潜在表示的轻量级方法,在解码前动态估计每个样本的最优并行度,从而打破该诅咒。

Abstract: Parallel thinking enhances LLM reasoning by multi-path sampling and aggregation. In system-level evaluations, a global parallelism level N is allocated to all samples, typically set large to maximize overall dataset accuracy. However, due to sample heterogeneity, some samples can achieve comparable performance with a smaller N’< N, causing budget redundancy. This incompatibility between system-level efficacy and sample-level efficiency constitutes the overscaling curse. In this paper, we formalize and quantify the overscaling curse, showing its universality and severity in practice, and analyze its trigger mechanism. We then propose a lightweight method, T2, to break the overscaling curse, which utilizes latent representations to estimate the optimal parallelism level for each sample before decoding. Experiments show that T2 significantly reduces cost while maintaining comparable performance, enabling more efficient parallel thinking.


[95] Lossy Common Information in a Learnable Gray-Wyner Network cs.LG | cs.CV | cs.ITPDF

Anderson de Andrade, Alon Harell, Ivan V. Bajić

TL;DR: 本文提出了一种可学习的Gray-Wyner网络,用于在多视觉任务中分离共享信息和任务特定信息,以减少表示冗余。通过引入有损公共信息的概念,论文设计了优化目标来平衡学习中的权衡,并在多个视觉基准测试中验证了该方法能有效降低冗余并优于独立编码。

Details

Motivation: 解决传统编解码器在多视觉任务中忽略共享信息导致表示冗余和效率低下的问题,利用信息论中的Gray-Wyner网络框架来分离公共和任务特定信息。

Result: 在涵盖六个视觉基准的双任务场景中,比较了三种编解码架构,结果表明该方法显著减少了冗余,并持续优于独立编码方法。

Insight: 创新点在于将经典信息论中的Gray-Wyner理论引入现代机器学习,通过可学习的三通道编解码器实现信息解耦,并提出了有损公共信息的概念来优化表示学习中的权衡。

Abstract: Many computer vision tasks share substantial overlapping information, yet conventional codecs tend to ignore this, leading to redundant and inefficient representations. The Gray-Wyner network, a classical concept from information theory, offers a principled framework for separating common and task-specific information. Inspired by this idea, we develop a learnable three-channel codec that disentangles shared information from task-specific details across multiple vision tasks. We characterize the limits of this approach through the notion of lossy common information, and propose an optimization objective that balances inherent tradeoffs in learning such representations. Through comparisons of three codec architectures on two-task scenarios spanning six vision benchmarks, we demonstrate that our approach substantially reduces redundancy and consistently outperforms independent coding. These results highlight the practical value of revisiting Gray-Wyner theory in modern machine learning contexts, bridging classic information theory with task-driven representation learning.


[96] Visual-Guided Key-Token Regularization for Multimodal Large Language Model Unlearning cs.LG | cs.CVPDF

Chengyi Cai, Zesheng Ye, Peike Li, Bo Han, Jianzhong Qi

TL;DR: 本文针对多模态大语言模型(MLLMs)的遗忘学习问题,提出了一种视觉引导的关键令牌正则化方法(ViKeR)。该方法通过利用无关视觉输入预测理想的遗忘后令牌级分布,并以此正则化遗忘过程,从而优先处理答案中的关键令牌。

Details

Motivation: 现有MLLM遗忘学习方法大多沿用LLM的方法,对所有答案令牌一视同仁,忽略了它们在遗忘过程中的重要性差异,且仅关注语言模态,忽视了指示关键令牌的视觉线索。

Result: 在MLLMU和CLEAR基准测试上的实验表明,该方法能有效执行遗忘学习,同时减轻遗忘并保持回答的连贯性。

Insight: 创新点在于通过信息熵定义遗忘中的关键令牌,并利用视觉引导的令牌级梯度重加权来放大关键令牌的更新,实现了多模态遗忘学习中视觉与语言信息的协同利用。

Abstract: Unlearning in Multimodal Large Language Models (MLLMs) prevents the model from revealing private information when queried about target images. Existing MLLM unlearning methods largely adopt approaches developed for LLMs. They treat all answer tokens uniformly, disregarding their varying importance in the unlearning process. Moreover, these methods focus exclusively on the language modality, disregarding visual cues that indicate key tokens in answers. In this paper, after formulating the problem of unlearning in multimodal question answering for MLLMs, we propose Visual-Guided Key-Token Regularization (ViKeR). We leverage irrelevant visual inputs to predict ideal post-unlearning token-level distributions and use these distributions to regularize the unlearning process, thereby prioritizing key tokens. Further, we define key tokens in unlearning via information entropy and discuss ViKeR’s effectiveness through token-level gradient reweighting, which amplifies updates on key tokens. Experiments on MLLMU and CLEAR benchmarks demonstrate that our method effectively performs unlearning while mitigating forgetting and maintaining response coherence.


cs.GR [Back]

[97] JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion cs.GR | cs.CVPDF

Anthony Chen, Naomi Ken Korem, Tavi Halperin, Matan Ben Yosef, Urska Jelercic

TL;DR: 本文提出JUST-DUB-IT方法,通过一个轻量级的LoRA适配预训练的音频-视觉基础扩散模型,实现视频到视频的配音任务。该方法能够基于输入的视频-音频,联合生成翻译后的音频和同步的面部动作,从而在保持说话者身份和口型同步的同时,对复杂运动和真实世界动态具有鲁棒性。

Details

Motivation: 现有的视频配音解决方案通常依赖于复杂、任务特定的流程,在真实场景中效果不佳。音频-视觉基础模型在多模态生成和编辑方面展现出前所未有的能力,为下游任务(如视频配音)提供了新的机会。本文旨在利用这种生成先验,开发一个更简单、更有效的单模型方法。

Result: 与现有的配音流程相比,该方法能够生成具有更高视觉保真度、更好口型同步和更强鲁棒性的高质量配音视频。

Insight: 创新点在于利用基础音频-视频扩散模型本身的生成能力,通过合成包含语言切换的多语言视频对来训练LoRA,从而避免了大规模真实配对数据的依赖。这种方法将强大的生成先验与轻量级适配相结合,为多模态任务提供了一种高效且可扩展的解决方案。

Abstract: Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.


cs.CR [Back]

[98] On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression cs.CR | cs.AI | cs.CVPDF

Xinwei Zhang, Hangcheng Liu, Li Bai, Hao Wang, Qingqing Ye

TL;DR: 本文研究了大型视觉语言模型(LVLMs)在视觉令牌压缩下的对抗鲁棒性,指出现有基于编码器的攻击方法会因优化-推断不匹配而高估压缩后模型的鲁棒性,并提出了压缩对齐攻击(CAGE)来更准确地评估鲁棒性。

Details

Motivation: 动机在于视觉令牌压缩虽能加速LVLMs,但其对抗鲁棒性尚未被探索,现有攻击方法因优化与推断阶段的不匹配(在完整令牌表示上优化扰动,而推断时通过令牌压缩瓶颈)导致鲁棒性评估过于乐观。

Result: 在多种即插即用压缩机制和数据集上的实验表明,CAGE攻击相比基线方法能持续获得更低的鲁棒准确率,突显了忽略压缩的鲁棒性评估可能过于乐观。

Insight: 创新点在于提出了CAGE攻击,通过预期特征破坏和秩失真对齐来对齐扰动优化与压缩推断,无需假设部署的压缩机制或令牌预算,为高效LVLMs的压缩感知安全评估和防御提供了新视角。

Abstract: Visual token compression is widely used to accelerate large vision-language models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks can substantially overestimate the robustness of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.


[99] RedSage: A Cybersecurity Generalist LLM cs.CR | cs.AI | cs.CLPDF

Naufal Suryanto, Muzammal Naseer, Pengfei Li, Syed Talal Wasim, Jinhui Yi

TL;DR: 本文提出了RedSage,一个开源的、可本地部署的网络安全通用大语言模型助手。其核心是通过大规模网络安全领域数据的持续预训练和模拟专家工作流的智能体增强管道生成指令微调数据,并结合通用开源LLM数据进行训练。论文还引入了RedSage-Bench评估基准,并在多个网络安全和通用基准测试中验证了模型的有效性。

Details

Motivation: 解决网络安全操作中需要支持多样化工作流且不暴露敏感数据的助手LLM的难题。现有方案要么依赖有隐私风险的专有API,要么使用缺乏领域适应的开源模型。

Result: 在8B规模上,RedSage在网络安全基准测试(如CTI-Bench, CyberMetric, SECURE)上比基线模型提升高达+5.59分,在Open LLM Leaderboard通用任务上提升高达+5.05分,取得了持续更好的结果。

Insight: 论文的创新点在于:1) 通过大规模网络过滤和手动收集,构建了高质量的网络安全领域持续预训练数据集;2) 设计了模拟专家工作流的智能体增强管道,用于生成多轮对话微调数据;3) 结合领域感知的预训练和后训练,不仅提升了领域专业知识,也改善了通用推理和指令遵循能力。其方法证明了领域特定的数据增强和训练策略的有效性。

Abstract: Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can not only enhance cybersecurity-specific expertise but also help to improve general reasoning and instruction-following. All models, datasets, and code are publicly available.


cs.AI [Back]

[100] Bridging the Arithmetic Gap: The Cognitive Complexity Benchmark and Financial-PoT for Robust Financial Reasoning cs.AI | cs.CLPDF

Boxiang Zhao, Qince Li, Zhonghao Wang, Yi Wang, Peng Cheng

TL;DR: 本文针对大语言模型在金融定量推理中存在的’算术幻觉’和’认知崩溃’问题,提出了认知复杂度基准(CCB)和迭代双阶段金融程序链(Financial-PoT)框架。CCB基于95份真实A股年报构建,通过三维分类法(数据来源、映射难度、结果单位)精确评估高认知负荷下的推理退化。Financial-PoT采用神经符号架构,将语义变量提取与逻辑公式化分离,并利用迭代自校正的Python沙箱进行计算,显著提升了模型在复杂金融任务上的鲁棒性和准确性。

Details

Motivation: 解决大语言模型在金融定量推理任务中普遍存在的’算术幻觉’(即产生错误数值结果)和’认知崩溃’(系统性推理失败)问题,以提升其在需要高精度语义理解与定量计算对齐的领域(如金融分析)的可靠性。

Result: 在提出的CCB基准上评估,标准思维链方法在复杂任务上表现不佳,而本文的Financial-PoT框架将Qwen3-235B模型的平均准确率从59.7%提升至67.3%,并在高复杂度推理任务中实现了高达10倍的性能提升,展现了卓越的鲁棒性。

Insight: 论文宣称的创新点在于:1)引入了首个严格量化金融推理中认知复杂度的三维评估基准CCB;2)提出了迭代双阶段Financial-PoT神经符号框架,通过架构解耦(分离语义与计算)和迭代自校正沙箱确保确定性执行。从客观角度看,其核心洞察是:在精度关键领域,通过强制性的架构解耦来对齐语义理解和定量计算,是提升模型可靠性的关键使能因素,这一架构见解具有可迁移性。

Abstract: While Large Language Models excel at semantic tasks, they face a critical bottleneck in financial quantitative reasoning, frequently suffering from “Arithmetic Hallucinations” and a systemic failure mode we term “Cognitive Collapse”. To strictly quantify this phenomenon, we introduce the Cognitive Complexity Benchmark (CCB), a robust evaluation framework grounded in a dataset constructed from 95 real-world Chinese A-share annual reports. Unlike traditional datasets, the CCB stratifies financial queries into a three-dimensional taxonomy, Data Source, Mapping Difficulty, and Result Unit, enabling the precise diagnosis of reasoning degradation in high-cognitive-load scenarios. To address these failures, we propose the Iterative Dual-Phase Financial-PoT framework. This neuro-symbolic architecture enforces a strict architectural decoupling: it first isolates semantic variable extraction and logic formulation, then offloads computation to an iterative, self-correcting Python sandbox to ensure deterministic execution. Evaluation on the CCB demonstrates that while standard Chain-of-Thought falters on complex tasks, our approach offers superior robustness, elevating the Qwen3-235B model’s average accuracy from 59.7% to 67.3% and achieving gains of up to 10-fold in high-complexity reasoning tasks. These findings suggest that architectural decoupling is a critical enabling factor for improving reliability in financial reasoning tasks, providing a transferable architectural insight for precision-critical domains that require tight alignment between semantic understanding and quantitative computation.


[101] Do Reasoning Models Enhance Embedding Models? cs.AI | cs.CLPDF

Wun Yu Chan, Shaojin Chen, Huihao Jing, Kwun Hang Lau, Elton Chun-Chai Li

TL;DR: 本文研究了通过强化学习与可验证奖励(RLVR)训练得到的推理模型是否能够提升嵌入模型的语义表示性能。研究发现,在MTEB和BRIGHT基准测试中,基于RLVR调优的骨干网络初始化的嵌入模型相比其基础版本并未表现出一致的性能优势。通过引入分层表示相似性分析(HRSA)框架,论文揭示了RLVR主要引起潜在流形局部几何结构的不可逆重组和可逆坐标基漂移,但保留了全局流形几何和线性读出能力,导致后续对比学习使基础模型与推理初始化模型之间产生强对齐,即流形重对齐现象。

Details

Motivation: 探究增强的推理能力是否能转化为更优的语义表示,即基于RLVR训练的推理模型作为嵌入初始化时是否带来性能提升。

Result: 在MTEB和BRIGHT基准上评估显示,RLVR调优的骨干网络初始化的嵌入模型相比基础模型没有一致的性能优势(null effect)。

Insight: 创新点在于提出了分层表示相似性分析(HRSA)框架来解耦表示、几何和功能层面的相似性,并发现了流形重对齐现象;客观分析表明RLVR优化的是现有语义空间内的轨迹,而非从根本上重构语义空间本身,这与监督微调(SFT)不同。

Abstract: State-of-the-art embedding models are increasingly derived from decoder-only Large Language Model (LLM) backbones adapted via contrastive learning. Given the emergence of reasoning models trained via Reinforcement Learning with Verifiable Rewards (RLVR), a natural question arises: do enhanced reasoning translate to superior semantic representations when these models serve as embedding initializations? Contrary to expectation, our evaluation on MTEB and BRIGHT reveals a null effect: embedding models initialized from RLVR-tuned backbones yield no consistent performance advantage over their base counterparts when subjected to identical training recipes. To unpack this paradox, we introduce Hierarchical Representation Similarity Analysis (HRSA), a framework that decomposes similarity across representation, geometry, and function levels. HRSA reveals that while RLVR induces irreversible latent manifold’s local geometry reorganization and reversible coordinate basis drift, it preserves the global manifold geometry and linear readout. Consequently, subsequent contrastive learning drives strong alignment between base- and reasoning-initialized models, a phenomenon we term Manifold Realignment. Empirically, our findings suggest that unlike Supervised Fine-Tuning (SFT), RLVR optimizes trajectories within an existing semantic landscape rather than fundamentally restructuring the landscape itself.


[102] Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization cs.AI | cs.CLPDF

Jiecong Wang, Hao Peng, Chunyang Liu

TL;DR: 本文提出了PLaT框架,将潜在推理重新定义为规划过程,通过将推理与语言化解耦,在连续隐状态中进行确定性轨迹规划,并由独立解码器在必要时将隐式思考转化为文本,从而动态决定推理终止点。

Details

Motivation: 解决现有思维链方法因基于离散词元空间而导致的计算成本高、推理路径崩溃问题,以及现有潜在推理方法作为不透明的端到端映射、推理时需预定义隐式步骤的局限性。

Result: 在数学基准测试上,PLaT的贪婪准确率低于基线,但在推理多样性方面展现出更优的可扩展性,表明其学习到了一个更鲁棒、更广阔的解决方案空间。

Insight: 创新点在于将推理与语言化解耦,将推理建模为隐式规划状态的确定性轨迹,允许模型动态终止推理,为推理时搜索提供了透明且可扩展的基础;从客观角度看,这种解耦设计可能为大型语言模型的高效、鲁棒推理开辟新路径。

Abstract: Chain-of-Thought (CoT) empowers Large Language Models (LLMs) to tackle complex problems, but remains constrained by the computational cost and reasoning path collapse when grounded in discrete token spaces. Recent latent reasoning approaches attempt to optimize efficiency by performing reasoning within continuous hidden states. However, these methods typically operate as opaque end-to-end mappings from explicit reasoning steps to latent states, and often require a pre-defined number of latent steps during inference. In this work, we introduce PLaT (Planning with Latent Thoughts), a framework that reformulates latent reasoning as planning by fundamentally decouple reasoning from verbalization. We model reasoning as a deterministic trajectory of latent planning states, while a separate Decoder grounds these thoughts into text when necessary. This decoupling allows the model to dynamically determine when to terminate reasoning rather than relying on fixed hyperparameters. Empirical results on mathematical benchmarks reveal a distinct trade-off: while PLaT achieves lower greedy accuracy than baselines, it demonstrates superior scalability in terms of reasoning diversity. This indicates that PLaT learns a robust, broader solution space, offering a transparent and scalable foundation for inference-time search.


[103] System 1&2 Synergy via Dynamic Model Interpolation cs.AI | cs.CLPDF

Chenxu Yang, Qingyi Si, Chong Tian, Xiyu Liu, Dingyu Yao

TL;DR: 本文提出了一种名为DAMI的动态模型插值框架,旨在通过动态参数插值实现语言模型在直觉型System 1和深思型System 2之间的自适应切换,从而在保持高效的同时提升推理能力。

Details

Motivation: 现有方法通过控制输出长度来提升System 2模型的效率,但作者认为这治标不治本,因此转向能力控制,即调节模型的思考方式而非输出内容。

Result: 在五个数学推理基准测试上,DAMI在保持高效的同时,其准确率超过了深思模型,有效结合了System 1的效率和System 2的推理深度。

Insight: 创新点在于从输出控制转向能力控制,利用现有检查点通过动态参数插值实现认知配置,无需额外训练;并提出了基于偏好学习或置信度的推理强度估计方法,以构建查询特定的认知深度。

Abstract: Training a unified language model that adapts between intuitive System 1 and deliberative System 2 remains challenging due to interference between their cognitive modes. Recent studies have thus pursued making System 2 models more efficient. However, these approaches focused on output control, limiting what models produce. We argue that this paradigm is misaligned: output length is merely a symptom of the model’s cognitive configuration, not the root cause. In this work, we shift the focus to capability control, which modulates \textit{how models think} rather than \textit{what they produce}. To realize this, we leverage existing Instruct and Thinking checkpoints through dynamic parameter interpolation, without additional training. Our pilot study establishes that linear interpolation yields a convex, monotonic Pareto frontier, underpinned by representation continuity and structural connectivity. Building on this, we propose \textbf{DAMI} (\textbf{D}yn\textbf{A}mic \textbf{M}odel \textbf{I}nterpolation), a framework that estimates a query-specific Reasoning Intensity $λ(q)$ to configure cognitive depth. For training-based estimation, we develop a preference learning method encoding accuracy and efficiency criteria. For zero-shot deployment, we introduce a confidence-based method leveraging inter-model cognitive discrepancy. Experiments on five mathematical reasoning benchmarks demonstrate that DAMI achieves higher accuracy than the Thinking model while remaining efficient, effectively combining the efficiency of System 1 with the reasoning depth of System 2.


[104] The Path of Least Resistance: Guiding LLM Reasining Trajectories with Prefix Consensus cs.AI | cs.CLPDF

Ishan Jindal, Sai Prashanth Akuthota, Jayant Taneja, Sachin Dev Sharma

TL;DR: 论文提出了一种名为PoLR(Path of Least Resistance)的推理时方法,通过利用前缀一致性来高效指导大型语言模型的推理轨迹。该方法对推理轨迹的短前缀进行聚类,识别主导簇并扩展该簇中的所有路径,从而在保持自洽性(Self-Consistency)准确性的同时,显著减少令牌使用和延迟。

Details

Motivation: 解决自洽性等推理策略计算成本高的问题,这些策略需要完全扩展所有推理轨迹,导致令牌使用和延迟大幅增加。

Result: 在GSM8K、MATH500、AIME24/25和GPQA-DIAMOND等基准测试上,PoLR持续匹配或超越自洽性的性能,同时将令牌使用减少高达60%,端到端延迟降低高达50%。

Insight: 创新点在于首次利用前缀一致性进行高效推理,通过理论分析(基于互信息和熵)解释了早期推理步骤对最终正确性的强预测信号,并且该方法与自适应推理方法完全互补,可作为即插即用的预过滤器,无需模型微调即可显著提升自洽性的效率和可扩展性。

Abstract: Large language models achieve strong reasoning performance, but inference strategies such as Self-Consistency (SC) are computationally expensive, as they fully expand all reasoning traces. We introduce PoLR (Path of Least Resistance), the first inference-time method to leverage prefix consistency for compute-efficient reasoning. PoLR clusters short prefixes of reasoning traces, identifies the dominant cluster, and expands all paths in that cluster, preserving the accuracy benefits of SC while substantially reducing token usage and latency. Our theoretical analysis, framed via mutual information and entropy, explains why early reasoning steps encode strong signals predictive of final correctness. Empirically, PoLR consistently matches or exceeds SC across GSM8K, MATH500, AIME24/25, and GPQA-DIAMOND, reducing token usage by up to 60% and wall-clock latency by up to 50%. Moreover, PoLR is fully complementary to adaptive inference methods (e.g., Adaptive Consistency, Early-Stopping SC) and can serve as a drop-in pre-filter, making SC substantially more efficient and scalable without requiring model fine-tuning.


[105] ShardMemo: Masked MoE Routing for Sharded Agentic LLM Memory cs.AI | cs.CLPDF

Yang Zhao, Chengxiao Dai, Yue Xiu, Mengying Kou, Yuliang Zheng

TL;DR: 本文提出了ShardMemo,一种面向智能体LLM系统的预算分层记忆服务,包含三层结构:Tier A(每个智能体的工作状态)、Tier B(分片证据库,带有分片局部近似最近邻索引)和Tier C(版本化技能库)。其核心创新在于Tier B的“范围优先于路由”机制,通过结构化资格约束在路由或搜索前屏蔽不合格分片,并将分片探测建模为基于资格分片的掩码混合专家路由。实验表明,在LoCoMo、长上下文HotpotQA和ToolBench等基准上,ShardMemo在性能、检索效率和延迟方面均优于现有基线方法。

Details

Motivation: 随着记忆容量和并行访问的增长,智能体LLM系统依赖的外部记忆(如集中式索引和启发式分区)成为瓶颈。本文旨在解决大规模、并发智能体系统中高效、可扩展的记忆检索问题。

Result: 在LoCoMo基准上,ShardMemo比最强基线(GAM)在各类问题上F1分数提升+5.11到+6.82。在固定预算路由设置下(B_probe=3),比基于余弦到原型的分片路由F1提升+6.87,同时减少检索工作量(VecScan从521降至414,-20.5%)和p95延迟(从95ms降至76ms)。在长上下文HotpotQA(56K/224K/448K tokens)上,F1分别达到63.41/61.88/57.95。在ToolBench上,Tier C的Precision@3达到0.97,StepRed达到1.94(相比基于嵌入相似性的检索分别提升+10.2%和+7.2%)。

Insight: 主要创新点包括:1)提出三层预算记忆服务体系结构;2)在Tier B引入“范围优先于路由”原则,通过结构化资格约束进行预过滤;3)将分片探测形式化为掩码混合专家路由问题,并利用成本感知门控和基于证据到分片的监督训练路由器。这为大规模智能体系统的记忆管理提供了高效、可扩展的解决方案,特别是在减少不必要检索计算和延迟方面具有借鉴意义。

Abstract: Agentic large language model (LLM) systems rely on external memory for long-horizon state and concurrent multi-agent execution, but centralized indexes and heuristic partitions become bottlenecks as memory volume and parallel access grow. We present ShardMemo, a budgeted tiered memory service with Tier A per-agent working state, Tier B sharded evidence with shard-local approximate nearest neighbor (ANN) indexes, and Tier C, a versioned skill library. Tier B enforces scope-before-routing: structured eligibility constraints mask ineligible shards before routing or ANN search. We cast shard probing as masked mixture-of-experts (MoE) routing over eligible shards, probing up to $B_{\mathrm{probe}}$ shards via Top-$B_{\mathrm{probe}}$ or adaptive Top-$P$, and use cost-aware gating over profile/observation/session shard families; the router is trained from evidence-to-shard supervision. On LoCoMo, ShardMemo improves over the strongest baseline (GAM) by +5.11 to +6.82 F1 across question categories. Under a fixed-budget routing setting ($B_{\mathrm{probe}}=3$), ShardMemo improves over cosine-to-prototype shard routing by +6.87 F1 while reducing retrieval work (VecScan 521->414, -20.5%) and p95 latency (95->76 ms). On long-context HotpotQA, ShardMemo achieves 63.41/61.88/57.95 F1 at 56K/224K/448K tokens. On ToolBench, Tier C reaches 0.97 Precision@3 and 1.94 StepRed (+10.2% and +7.2% over embedding-similarity retrieval).


[106] Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves cs.AI | cs.CL | cs.LGPDF

Jonas Knupp, Jan Hendrik Metzen, Jeremias Bohn, Georg Groh, Kristian Kersting

TL;DR: 本文提出了一种名为Dreamer的深度递归注意力混合框架,通过结合序列注意力、深度注意力和稀疏专家注意力,解决了现有深度递归模型在参数共享、隐藏层大小瓶颈和计算效率方面的限制。该框架在语言推理基准测试中,相比计算量、参数和内存匹配的SOTA模型,达到相同准确率所需的训练token数量减少了2到8倍,并在相同训练token下超越了约2倍规模的SOTA模型。

Details

Motivation: 现有深度递归模型存在三个主要问题:缺乏计算量、参数和内存匹配的基线比较;由于部分固定的层堆叠未能充分利用深度递归;以及恒定隐藏层大小限制了多步潜在推理的瓶颈。

Result: 在语言推理基准测试中,Dreamer模型在计算量、参数和内存匹配的条件下,达到相同准确率所需的训练token比SOTA模型少2到8倍,并在相同训练token下超越了约2倍规模的SOTA模型。此外,专家选择多样性比SOTA的MoE模型高2到11倍。

Insight: 创新点在于提出模块化的深度递归注意力混合框架,通过深度注意力缓解隐藏层大小瓶颈,解耦缩放维度,并实现高效扩展。客观分析认为,该框架通过注意力机制优化深度递归,提升了模型的知识利用效率和推理能力。

Abstract: Depth-recurrence facilitates latent reasoning by sharing parameters across depths. However, prior work lacks combined FLOP-, parameter-, and memory-matched baselines, underutilizes depth-recurrence due to partially fixed layer stacks, and ignores the bottleneck of constant hidden-sizes that restricts many-step latent reasoning. To address this, we introduce a modular framework of depth-recurrent attention mixtures (Dreamer), combining sequence attention, depth attention, and sparse expert attention. It alleviates the hidden-size bottleneck through attention along depth, decouples scaling dimensions, and allows depth-recurrent models to scale efficiently and effectively. Across language reasoning benchmarks, our models require 2 to 8x fewer training tokens for the same accuracy as FLOP-, parameter-, and memory-matched SOTA, and outperform ca. 2x larger SOTA models with the same training tokens. We further present insights into knowledge usage across depths, e.g., showing 2 to 11x larger expert selection diversity than SOTA MoEs.


[107] Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving cs.AI | cs.CVPDF

Weitong Lian, Zecong Tang, Haoran Li, Tianjian Gao, Yifei Wang

TL;DR: 本文提出Drive-KD框架,通过将自动驾驶任务分解为感知、推理和规划三个能力,并利用层特异性注意力作为蒸馏信号,构建了能力特定的单教师模型,进而统一为多教师知识蒸馏框架,采用非对称梯度投影缓解梯度冲突,以高效地将大视觉语言模型(VLM)的能力迁移到小模型上。

Details

Motivation: 解决自动驾驶领域大模型(LLMs/VLMs)GPU内存需求高、推理延迟大,而传统监督微调(SFT)难以弥补小模型能力差距的问题。

Result: 蒸馏后的InternVL3-1B模型在DriveBench基准上,以约42倍更少的GPU内存和约11.4倍更高的吞吐量,取得了比同系列预训练78B模型更好的整体性能,并在规划维度上超越了GPT-5.1。

Insight: 创新点在于将自动驾驶任务分解为能力三元组进行针对性蒸馏,利用层特异性注意力作为蒸馏信号,并设计了多教师框架与非对称梯度投影来有效整合不同能力并缓解梯度冲突,为高效自动驾驶VLM提供了新思路。

Abstract: Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a “perception-reasoning-planning” triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient conflicts. Extensive evaluations validate the generalization of our method across diverse model families and scales. Experiments show that our distilled InternVL3-1B model, with ~42 times less GPU memory and ~11.4 times higher throughput, achieves better overall performance than the pretrained 78B model from the same family on DriveBench, and surpasses GPT-5.1 on the planning dimension, providing insights toward efficient autonomous driving VLMs.


[108] Semantic Content Determines Algorithmic Performance cs.AI | cs.CLPDF

Martiño Ríos-García, Nawaf Alampara, Kevin Maik Jablonka

TL;DR: 本文通过构建WhatCounts基准测试,揭示了前沿大语言模型在执行计数等算法任务时,其性能表现会因输入内容的语义类型(如城市、化学物质、姓名等)而出现显著差异,表明LLMs并非真正实现算法,而是对算法进行与参数语义相关的近似。

Details

Motivation: 动机在于验证算法行为应独立于其参数的语义内容这一基本原则,并探究LLMs是否真正实现了算法不变性。

Result: 在WhatCounts基准上,前沿LLMs仅因计数对象语义类型不同,准确率差异超过40%;消融实验排除了干扰因素,且少量无关微调会导致差异不可预测地变化。

Insight: 核心创新点在于设计了原子化的WhatCounts基准,将语义敏感性与推理复杂性等因素解耦;客观分析表明,LLMs的算法近似具有隐藏的输入语义依赖性,这对LLM作为函数或智能体的可靠性提出了根本性质疑。

Abstract: Counting should not depend on what is being counted; more generally, any algorithm’s behavior should be invariant to the semantic content of its arguments. We introduce WhatCounts to test this property in isolation. Unlike prior work that conflates semantic sensitivity with reasoning complexity or prompt variation, WhatCounts is atomic: count items in an unambiguous, delimited list with no duplicates, distractors, or reasoning steps for different semantic types. Frontier LLMs show over 40% accuracy variation depending solely on what is being counted - cities versus chemicals, names versus symbols. Controlled ablations rule out confounds. The gap is semantic, and it shifts unpredictably with small amounts of unrelated fine-tuning. LLMs do not implement algorithms; they approximate them, and the approximation is argument-dependent. As we show with an agentic example, this has implications beyond counting: any LLM function may carry hidden dependencies on the meaning of its inputs.


[109] Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems cs.AI | cs.CL | cs.MAPDF

Ruiwen Zhou, Maojia Song, Xiaobao Wu, Sitao Cheng, Xunjian Yin

TL;DR: 本文提出了一种名为认知上下文学习(ECL)的推理框架,用于解决多智能体系统中智能体因盲从误导性同伴而缺乏鲁棒性的问题。ECL通过利用历史交互构建同伴档案,使智能体能够评估同伴可靠性并在不确定时向可信赖的同伴学习,从而提升系统性能。实验表明,ECL能使小模型(如Qwen 3-4B)超越大8倍的无历史基线模型(Qwen 3-30B),并推动前沿模型达到接近完美的性能。

Details

Motivation: 解决多智能体系统中智能体因奉承行为和评估同伴可靠性能力不足而导致的鲁棒性差、容易盲从误导性同伴的问题。

Result: 在实验中,ECL使Qwen 3-4B模型在准确识别可靠同伴方面超越了规模大8倍的Qwen 3-30B基线模型,并将前沿模型的性能提升至接近完美(100%)。该方法在多种多智能体配置中泛化良好,且信任建模准确性与最终答案质量呈强相关性。

Insight: 核心创新点是将任务从评估同伴推理质量转变为基于交互历史估计同伴可靠性,并提出了ECL框架来显式构建和使用同伴档案。从客观角度看,将历史交互作为额外输入以建模动态信任关系,并通过强化学习进行优化,是提升多智能体系统协作鲁棒性的有效且可借鉴的思路。

Abstract: Individual agents in multi-agent (MA) systems often lack robustness, tending to blindly conform to misleading peers. We show this weakness stems from both sycophancy and inadequate ability to evaluate peer reliability. To address this, we first formalize the learning problem of history-aware reference, introducing the historical interactions of peers as additional input, so that agents can estimate peer reliability and learn from trustworthy peers when uncertain. This shifts the task from evaluating peer reasoning quality to estimating peer reliability based on interaction history. We then develop Epistemic Context Learning (ECL): a reasoning framework that conditions predictions on explicitly-built peer profiles from history. We further optimize ECL by reinforcement learning using auxiliary rewards. Our experiments reveal that our ECL enables small models like Qwen 3-4B to outperform a history-agnostic baseline 8x its size (Qwen 3-30B) by accurately identifying reliable peers. ECL also boosts frontier models to near-perfect (100%) performance. We show that ECL generalizes well to various MA configurations and we find that trust is modeled well by LLMs, revealing a strong correlation in trust modeling accuracy and final answer quality.


[110] SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding cs.AI | cs.CVPDF

Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza

TL;DR: 本文介绍了SONIC-O1,一个用于在真实世界场景下系统评估多模态大语言模型(MLLMs)音频-视频理解能力的综合性基准。该基准覆盖13个对话领域,包含4958个人工验证的标注和人口统计元数据,评估任务包括开放式摘要、多项选择题(MCQ)回答和带推理依据的时间定位。实验揭示了现有模型在时间定位任务上的显著差距以及在不同人口群体间的性能差异。

Details

Motivation: 当前多模态大语言模型的研究主要集中在静态图像理解上,而处理时序音频-视频数据的能力尚未得到充分探索,因此需要一个高质量的基准来系统评估MLLM在真实世界环境中的性能。

Result: 在闭源和开源模型上的实验表明,尽管在MCQ准确率上模型家族间的差距较小,但在时间定位任务上,表现最佳的闭源模型与开源模型之间存在22.6%的显著性能差异。此外,模型在不同人口统计群体上的性能进一步下降,表明模型行为存在持续差异。

Insight: 论文的创新点在于构建了一个全面、人工验证的真实世界音频-视频理解基准(SONIC-O1),它强调了时序理解和模型在不同社会群体中的鲁棒性评估,为未来研究提供了开放的可复现评估套件和数据集。

Abstract: Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding. We release SONIC-O1 for reproducibility and research: Project page: https://vectorinstitute.github.io/sonic-o1/ Dataset: https://huggingface.co/datasets/vector-institute/sonic-o1 Github: https://github.com/vectorinstitute/sonic-o1 Leaderboard: https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard


[111] From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning cs.AI | cs.CLPDF

Shaojie Wang, Liang Zhang

TL;DR: 本文提出了一种受人类认知启发的两阶段后训练框架,用于提升大语言模型的推理泛化性和可靠性。该框架包含Chain-of-Meta-Thought和Confidence-Calibrated Reinforcement Learning两个核心组件,分别对应获取抽象策略和优化具体执行两个阶段。

Details

Motivation: 当前LLM后训练方法(SFT+RL)将完整推理轨迹作为基本单元进行优化,这与人类先获取抽象策略再适应具体实例的两阶段认知过程存在根本性错位,导致策略泛化性不足。

Result: 在四个模型和八个基准测试上的实验表明,该方法相比标准方法在分布内和分布外性能上分别提升了2.19%和4.63%,同时训练时间减少65-70%,token消耗降低50%。

Insight: 核心创新在于将后训练过程与人类两阶段认知过程对齐:CoMT专注于学习与具体执行解耦的抽象推理模式以提升泛化性;CCRL通过置信度感知的奖励优化中间步骤,防止错误传播以提升执行可靠性。这种认知对齐设计同时带来了性能提升和训练效率的显著优化。

Abstract: Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought (CoMT) focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and eight benchmarks show 2.19% and 4.63% improvements in-distribution and out-of-distribution respectively over standard methods, while reducing training time by 65-70% and token consumption by 50%, demonstrating that aligning post-training with human cognitive principles yields not only superior generalization but also enhanced training efficiency.


[112] ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation cs.AI | cs.CL | cs.IRPDF

Zhao Wang, Ziliang Zhao, Zhicheng Dou

TL;DR: 本文提出了ProRAG,一个用于检索增强生成(RAG)的过程监督强化学习框架,旨在通过整合细粒度的步骤级监督来优化复杂推理任务中的模型性能。

Details

Motivation: 传统基于结果的强化学习方法在长轨迹推理任务中存在奖励稀疏和信用分配效率低下的问题,导致模型可能通过错误逻辑或冗余检索步骤得出正确答案,即产生“过程幻觉”。现有过程感知方法缺乏在线探索能力,难以将步骤级信用与全局结果解耦。

Result: 在五个多跳推理基准测试上的广泛实验表明,ProRAG相比基于结果和过程感知的强化学习基线取得了更优的整体性能,特别是在复杂的长视野任务上,验证了细粒度过程监督的有效性。

Insight: 创新点在于提出了一个四阶段框架,包括监督策略预热、基于MCTS的过程奖励模型构建、PRM引导的推理细化以及具有双粒度优势机制的过程监督强化学习,通过聚合步骤级过程奖励与全局结果信号,为每个动作提供精确反馈。

Abstract: Reinforcement learning (RL) has become a promising paradigm for optimizing Retrieval-Augmented Generation (RAG) in complex reasoning tasks. However, traditional outcome-based RL approaches often suffer from reward sparsity and inefficient credit assignment, as coarse-grained scalar rewards fail to identify specific erroneous steps within long-horizon trajectories. This ambiguity frequently leads to “process hallucinations”, where models reach correct answers through flawed logic or redundant retrieval steps. Although recent process-aware approaches attempt to mitigate this via static preference learning or heuristic reward shaping, they often lack the on-policy exploration capabilities required to decouple step-level credit from global outcomes. To address these challenges, we propose ProRAG, a process-supervised reinforcement learning framework designed to integrate learned step-level supervision into the online optimization loop. Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS-based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM-Guided Reasoning Refinement to align the policy with fine-grained process preferences; and (4) Process-Supervised Reinforcement Learning with a dual-granularity advantage mechanism. By aggregating step-level process rewards with global outcome signals, ProRAG provides precise feedback for every action. Extensive experiments on five multi-hop reasoning benchmarks demonstrate that ProRAG achieves superior overall performance compared to strong outcome-based and process-aware RL baselines, particularly on complex long-horizon tasks, validating the effectiveness of fine-grained process supervision. The code and model are available at https://github.com/lilinwz/ProRAG.


[113] JADE: Bridging the Strategic-Operational Gap in Dynamic Agentic RAG cs.AI | cs.CL | cs.IRPDF

Yiqun Chen, Erhan Zhang, Tianyi Hu, Shijie Wang, Zixuan Yang

TL;DR: 本文提出JADE框架,以解决动态代理式检索增强生成(RAG)中规划器与执行器因分离优化导致的战略与操作不匹配问题。该框架将多轮工作流建模为基于共享骨干网络的协作多智能体团队,通过基于结果的奖励进行端到端联合优化,实现了规划与执行的协同适应。

Details

Motivation: 现有动态代理式RAG范式存在两极化问题:要么在固定图架构中联合优化模块,限制了动态性;要么支持动态规划但将执行器视为冻结的黑盒工具,导致高级规划策略因执行器不适应而无法实现,形成战略与操作不匹配。

Result: 实证结果表明,JADE通过联合优化将分离的模块转变为协同系统,带来了显著的性能提升,并通过动态工作流编排实现了效率与效果之间的灵活平衡。

Insight: 创新点在于提出一个统一的联合优化框架,通过共享骨干网络和基于奖励的端到端学习,促进规划器与执行器的协同适应,从而弥合战略与操作间的差距,提升动态代理式RAG系统的整体效能。

Abstract: The evolution of Retrieval-Augmented Generation (RAG) has shifted from static retrieval pipelines to dynamic, agentic workflows where a central planner orchestrates multi-turn reasoning. However, existing paradigms face a critical dichotomy: they either optimize modules jointly within rigid, fixed-graph architectures, or empower dynamic planning while treating executors as frozen, black-box tools. We identify that this \textit{decoupled optimization} creates a ``strategic-operational mismatch,’’ where sophisticated planning strategies fail to materialize due to unadapted local executors, often leading to negative performance gains despite increased system complexity. In this paper, we propose \textbf{JADE} (\textbf{J}oint \textbf{A}gentic \textbf{D}ynamic \textbf{E}xecution), a unified framework for the joint optimization of planning and execution within dynamic, multi-turn workflows. By modeling the system as a cooperative multi-agent team unified under a single shared backbone, JADE enables end-to-end learning driven by outcome-based rewards. This approach facilitates \textit{co-adaptation}: the planner learns to operate within the capability boundaries of the executors, while the executors evolve to align with high-level strategic intent. Empirical results demonstrate that JADE transforms disjoint modules into a synergistic system, yielding remarkable performance improvements via joint optimization and enabling a flexible balance between efficiency and effectiveness through dynamic workflow orchestration.


[114] Self-Compression of Chain-of-Thought via Multi-Agent Reinforcement Learning cs.AI | cs.CLPDF

Yiqun Chen, Jinyuan Feng, Wei Yang, Meizhi Zhong, Zhengliang Shi

TL;DR: 本文提出了一种基于多智能体强化学习的思维链自压缩框架(SCMA),旨在减少大型推理模型中的冗余推理步骤,从而降低推理开销。该框架通过分割智能体将推理过程分解为逻辑块,评分智能体评估每个块的重要性,并以此指导推理智能体在训练中优化推理链,最终在保持准确性的同时显著缩短响应长度。

Details

Motivation: 现有基于强化学习的方法通过简单结合长度惩罚和结果奖励来压缩推理链,但难以平衡简洁性与准确性,强制缩短可能损害关键推理逻辑。本文旨在解决这一局限性,选择性地惩罚冗余块,同时保留必要的推理逻辑。

Result: 在不同模型规模上的实验表明,SCMA能够将响应长度减少11.1%至39.0%,同时将准确性提升4.33%至10.02%,达到了在压缩推理链的同时提升性能的效果。消融研究和定性分析验证了多智能体强化学习框架的协同优化能产生更强大的大型推理模型。

Insight: 创新点在于提出了一个多智能体强化学习框架,通过分割和评分两个专门智能体来选择性压缩冗余推理块,而非简单全局惩罚,从而更精细地平衡长度与准确性。从客观角度看,这种模块化分工和重要性加权惩罚机制为优化推理过程的效率提供了新思路,可能促进更高效的模型部署。

Abstract: The inference overhead induced by redundant reasoning undermines the interactive experience and severely bottlenecks the deployment of Large Reasoning Models. Existing reinforcement learning (RL)-based solutions tackle this problem by coupling a length penalty with outcome-based rewards. This simplistic reward weighting struggles to reconcile brevity with accuracy, as enforcing brevity may compromise critical reasoning logic. In this work, we address this limitation by proposing a multi-agent RL framework that selectively penalizes redundant chunks, while preserving essential reasoning logic. Our framework, Self-Compression via MARL (SCMA), instantiates redundancy detection and evaluation through two specialized agents: \textbf{a Segmentation Agent} for decomposing the reasoning process into logical chunks, and \textbf{a Scoring Agent} for quantifying the significance of each chunk. The Segmentation and Scoring agents collaboratively define an importance-weighted length penalty during training, incentivizing \textbf{a Reasoning Agent} to prioritize essential logic without introducing inference overhead during deployment. Empirical evaluations across model scales demonstrate that SCMA reduces response length by 11.1% to 39.0% while boosting accuracy by 4.33% to 10.02%. Furthermore, ablation studies and qualitative analysis validate that the synergistic optimization within the MARL framework fosters emergent behaviors, yielding more powerful LRMs compared to vanilla RL paradigms.


[115] Exploring Reasoning Reward Model for Agents cs.AI | cs.CLPDF

Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li

TL;DR: 本文提出了Agent Reasoning Reward Model (Agent-RRM),一种为智能体轨迹提供结构化反馈的多方面奖励模型,包括显式推理轨迹、聚焦式批评和整体评分。基于这些信号,作者系统研究了三种集成策略(Reagent-C、Reagent-R、Reagent-U),并在12个多样化基准测试上进行了广泛评估,验证了其有效性。

Details

Motivation: 当前智能体强化学习(Agentic RL)大多依赖稀疏的、基于结果的奖励进行训练,这种反馈无法区分中间推理过程的质量,导致训练结果次优。

Result: 在12个多样化基准测试上的广泛评估表明,Reagent-U策略带来了显著的性能提升,在GAIA基准上达到43.7%,在WebWalkerQA基准上达到46.2%。

Insight: 创新点在于提出了一个结构化、多方面的推理奖励模型(Agent-RRM),它超越了传统的稀疏结果奖励,能对智能体的推理过程进行细粒度评估和指导。系统性地探索了将这种反馈集成到训练中的不同策略,为基于过程的智能体训练提供了新思路。

Abstract: Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.


cs.NI [Back]

[116] ViTMAlis: Towards Latency-Critical Mobile Video Analytics with Vision Transformers cs.NI | cs.CV | cs.MMPDF

Miao Zhang, Guanzhen Wu, Hao Fang, Yifei Zhu, Fangxin Wang

TL;DR: 本文提出ViTMAlis,一个专为基于视觉Transformer(ViT)的密集预测模型设计的动态设备到边缘卸载框架,通过混合分辨率推理策略,在运行时灵活权衡速度与精度,以应对延迟关键型移动视频分析中ViT模型推理延迟高的问题。

Details

Motivation: 随着移动视频分析应用从CNN转向ViT以利用其全局上下文建模和泛化能力,在延迟关键场景中部署ViT面临显著挑战,特别是密集预测任务中高分辨率输入加剧了ViT的二次计算复杂度,导致推理延迟成为主要瓶颈,而非传统CNN中的网络传输瓶颈。

Result: 在商用移动和边缘设备上实现的原型实验表明,与最先进的以精度为中心、内容感知和延迟自适应的基线方法相比,ViTMAlis显著降低了端到端卸载延迟,同时提升了用户感知的渲染精度。

Insight: 创新点在于针对ViT骨干网络的动态混合分辨率推理策略,以及据此构建的ViT原生设备-边缘协同卸载框架,能够根据网络条件和视频内容动态适应,联合优化传输与推理延迟,为下一代移动智能提供了实用基础。

Abstract: Edge-assisted mobile video analytics (MVA) applications are increasingly shifting from using vision models based on convolutional neural networks (CNNs) to those built on vision transformers (ViTs) to leverage their superior global context modeling and generalization capabilities. However, deploying these advanced models in latency-critical MVA scenarios presents significant challenges. Unlike traditional CNN-based offloading paradigms where network transmission is the primary bottleneck, ViT-based systems are constrained by substantial inference delays, particularly for dense prediction tasks where the need for high-resolution inputs exacerbates the inherent quadratic computational complexity of ViTs. To address these challenges, we propose a dynamic mixed-resolution inference strategy tailored for ViT-backboned dense prediction models, enabling flexible runtime trade-offs between speed and accuracy. Building on this, we introduce ViTMAlis, a ViT-native device-to-edge offloading framework that dynamically adapts to network conditions and video content to jointly reduce transmission and inference delays. We implement a fully functional prototype of ViTMAlis on commodity mobile and edge devices. Extensive experiments demonstrate that, compared to state-of-the-art accuracy-centric, content-aware, and latency-adaptive baselines, ViTMAlis significantly reduces end-to-end offloading latency while improving user-perceived rendering accuracy, providing a practical foundation for next-generation mobile intelligence.


physics.flu-dyn [Back]

[117] Learning Transient Convective Heat Transfer with Geometry Aware World Models physics.flu-dyn | cs.CVPDF

Onur T. Doganay, Alexander Klawonn, Martin Eigel, Hanno Gottschalk

TL;DR: 本文提出了一种几何感知的世界模型架构,用于学习瞬态物理过程,特别是针对二维瞬态计算流体动力学中的对流换热问题。该模型基于LongVideoGAN视频生成架构,通过引入双重条件机制(全局物理参数和局部几何掩码)以及支持任意通道维度的架构适应,实现了对复杂时空动态的模拟。

Details

Motivation: 传统偏微分方程模拟计算成本高,难以满足实时应用需求;现有生成式AI视频生成架构缺乏对物理模拟所需的特定控制和数据兼容性,因此需要设计一种能够学习瞬态物理的专用架构。

Result: 在二维瞬态CFD对流换热问题上,条件化模型成功复现了训练数据的复杂时空动态和空间相关性;在未见几何配置上评估了泛化能力,显示了可控模拟合成的潜力,但也指出了分布外样本空间精度的当前局限性。

Insight: 创新点包括双重条件机制(结合全局参数和局部几何信息)和突破RGB限制的任意通道维度支持;从客观角度看,该工作将生成模型与物理模拟需求结合,为替代建模提供了更可控和兼容的架构思路。

Abstract: Partial differential equation (PDE) simulations are fundamental to engineering and physics but are often computationally prohibitive for real-time applications. While generative AI offers a promising avenue for surrogate modeling, standard video generation architectures lack the specific control and data compatibility required for physical simulations. This paper introduces a geometry aware world model architecture, derived from a video generation architecture (LongVideoGAN), designed to learn transient physics. We introduce two key architecture elements: (1) a twofold conditioning mechanism incorporating global physical parameters and local geometric masks, and (2) an architectural adaptation to support arbitrary channel dimensions, moving beyond standard RGB constraints. We evaluate this approach on a 2D transient computational fluid dynamics (CFD) problem involving convective heat transfer from buoyancy-driven flow coupled to a heat flow in a solid structure. We demonstrate that the conditioned model successfully reproduces complex temporal dynamics and spatial correlations of the training data. Furthermore, we assess the model’s generalization capabilities on unseen geometric configurations, highlighting both its potential for controlled simulation synthesis and current limitations in spatial precision for out-of-distribution samples.


cs.IR [Back]

[118] Thinking Broad, Acting Fast: Latent Reasoning Distillation from Multi-Perspective Chain-of-Thought for E-Commerce Relevance cs.IR | cs.AI | cs.CLPDF

Baopu Qiu, Hao Chen, Yuanrong Wu, Changtong Zan, Chao Wei

TL;DR: 本文提出了一种用于电子商务搜索相关性建模的新框架,旨在解决现有方法依赖单一视角思维链推理以及推理时丢弃思维链结构的问题。该框架包含两个核心部分:教师模型利用多视角思维链生成多样化的推理依据,并结合监督微调和直接偏好优化来构建更鲁棒的推理器;学生模型则通过新颖的潜在推理知识蒸馏方法,在推理时内部化一个轻量级的潜在推理提取器,从而高效地继承大语言模型的复杂推理能力。

Details

Motivation: 现有基于大语言模型和思维链的电子商务相关性模型存在两个关键局限:一是依赖单一视角的思维链推理,无法捕捉电子商务相关性的多面性(如用户意图、属性级匹配、业务特定规则);二是尽管思维链增强了推理能力,但其高推理延迟使得知识蒸馏成为实时部署的必要手段,而当前的蒸馏方法在推理时丢弃了思维链的推理结构,仅将其用作临时辅助信号,浪费了其推理效用。

Result: 在服务于每日数千万用户的电子商务搜索广告平台上进行的离线和在线A/B测试表明,该方法带来了显著的离线性能提升,并在商业表现和用户体验方面均显示出明显优势。

Insight: 论文的创新点在于:1. 提出多视角思维链来捕捉电子商务相关性的多维度特性,并结合SFT与DPO优化教师模型;2. 提出潜在推理知识蒸馏,使学生模型在推理时能通过一个轻量级的内部模块(潜在推理提取器)持续利用思维链的推理语义,而非仅在训练时使用,从而在保持高效低延迟的同时,更好地继承复杂推理能力。

Abstract: Effective relevance modeling is crucial for e-commerce search, as it aligns search results with user intent and enhances customer experience. Recent work has leveraged large language models (LLMs) to address the limitations of traditional relevance models, especially for long-tail and ambiguous queries. By incorporating Chain-of-Thought (CoT) reasoning, these approaches improve both accuracy and interpretability through multi-step reasoning. However, two key limitations remain: (1) most existing approaches rely on single-perspective CoT reasoning, which fails to capture the multifaceted nature of e-commerce relevance (e.g., user intent vs. attribute-level matching vs. business-specific rules); and (2) although CoT-enhanced LLM’s offer rich reasoning capabilities, their high inference latency necessitates knowledge distillation for real-time deployment, yet current distillation methods discard the CoT rationale structure at inference, using it as a transient auxiliary signal and forfeiting its reasoning utility. To address these challenges, we propose a novel framework that better exploits CoT semantics throughout the optimization pipeline. Specifically, the teacher model leverages Multi-Perspective CoT (MPCoT) to generate diverse rationales and combines Supervised Fine-Tuning (SFT) with Direct Preference Optimization (DPO) to construct a more robust reasoner. For distillation, we introduce Latent Reasoning Knowledge Distillation (LRKD), which endows a student model with a lightweight inference-time latent reasoning extractor, allowing efficient and low-latency internalization of the LLM’s sophisticated reasoning capabilities. Evaluated in offline experiments and online A/B tests on an e-commerce search advertising platform serving tens of millions of users daily, our method delivers significant offline gains, showing clear benefits in both commercial performance and user experience.


[119] Influence Guided Sampling for Domain Adaptation of Text Retrievers cs.IR | cs.CLPDF

Meet Doshi, Vishwajeet Kumar, Yulong Li, Jaydeep Sen

TL;DR: 本文提出了一种名为Inf-DDS的轻量级强化学习采样框架,用于优化文本检索模型的领域适应训练。该方法通过基于影响力的奖励信号自适应地重新加权训练数据集,优先选择能最大化目标开发集性能的数据集,从而在多种文本检索任务上显著提升检索性能,同时大幅降低GPU计算成本。

Details

Motivation: 解决通用开放域稠密检索系统在训练时如何从多样化的语料库和搜索任务中采样的问题,传统均匀采样或基于专家监督的方法未充分优化,且训练数据采样策略对嵌入模型性能影响显著,但如何找到最优策略尚未得到充分研究。

Result: 在广泛的文本检索任务上评估,相比现有基于梯度的采样方法,检索性能显著提升,领域适应能力更强,同时GPU计算成本降低1.5倍至4倍;具体而言,训练多语言bge-m3模型时NDCG@10绝对提升5.03,训练all-MiniLM-L6-v2模型时NDCG@10绝对提升0.94,即使从专家分配的权重开始训练。

Insight: 创新点在于引入强化学习驱动的自适应采样框架,利用影响力指导的奖励信号动态调整数据集权重,实现轻量级且高效的训练优化;从客观角度看,该方法将采样问题形式化为策略优化,避免了固定或启发式采样的局限性,为嵌入模型训练提供了可扩展的数据管理方案。

Abstract: General-purpose open-domain dense retrieval systems are usually trained with a large, eclectic mix of corpora and search tasks. How should these diverse corpora and tasks be sampled for training? Conventional approaches sample them uniformly, proportional to their instance population sizes, or depend on human-level expert supervision. It is well known that the training data sampling strategy can greatly impact model performance. However, how to find the optimal strategy has not been adequately studied in the context of embedding models. We propose Inf-DDS, a novel reinforcement learning driven sampling framework that adaptively reweighs training datasets guided by influence-based reward signals and is much more lightweight with respect to GPU consumption. Our technique iteratively refines the sampling policy, prioritizing datasets that maximize model performance on a target development set. We evaluate the efficacy of our sampling strategy on a wide range of text retrieval tasks, demonstrating strong improvements in retrieval performance and better adaptation compared to existing gradient-based sampling methods, while also being 1.5x to 4x cheaper in GPU compute. Our sampling strategy achieves a 5.03 absolute NDCG@10 improvement while training a multilingual bge-m3 model and an absolute NDCG@10 improvement of 0.94 while training all-MiniLM-L6-v2, even when starting from expert-assigned weights on a large pool of training datasets.


cs.RO [Back]

[120] InspecSafe-V1: A Multimodal Benchmark for Safety Assessment in Industrial Inspection Scenarios cs.RO | cs.CVPDF

Zeyi Liu, Shuang Liu, Jihai Min, Zhaoheng Zhang, Jun Cen

TL;DR: InspecSafe-V1是首个用于工业巡检安全评估的多模态基准数据集,旨在解决复杂动态工业场景中AI系统可靠感知与安全评估的瓶颈。该数据集采集自真实巡检机器人在真实环境中的日常操作,覆盖隧道、电力设施、烧结设备、石油化工和煤炭输送栈桥五种代表性工业场景,包含来自41个机器人在2239个有效站点的5013个巡检实例,并提供像素级分割标注、语义场景描述和安全等级标签。

Details

Motivation: 当前公开数据集多局限于模拟数据源、单模态感知或缺乏细粒度物体级标注,阻碍了工业基础模型实现鲁棒的场景理解和多模态安全推理。

Result: 该论文发布了InspecSafe-V1数据集,其包含七种同步感知模态(可见光图像、红外视频、音频、深度点云、雷达点云、气体测量、温湿度),为多模态异常识别、跨模态融合和综合安全评估提供了基准。

Insight: 创新点在于构建了首个基于真实巡检机器人操作的多模态工业安全评估基准,其多模态同步数据、像素级标注和面向实际任务的安全标签设计,为训练和评估工业基础模型提供了关键资源。

Abstract: With the rapid development of industrial intelligence and unmanned inspection, reliable perception and safety assessment for AI systems in complex and dynamic industrial sites has become a key bottleneck for deploying predictive maintenance and autonomous inspection. Most public datasets remain limited by simulated data sources, single-modality sensing, or the absence of fine-grained object-level annotations, which prevents robust scene understanding and multimodal safety reasoning for industrial foundation models. To address these limitations, InspecSafe-V1 is released as the first multimodal benchmark dataset for industrial inspection safety assessment that is collected from routine operations of real inspection robots in real-world environments. InspecSafe-V1 covers five representative industrial scenarios, including tunnels, power facilities, sintering equipment, oil and gas petrochemical plants, and coal conveyor trestles. The dataset is constructed from 41 wheeled and rail-mounted inspection robots operating at 2,239 valid inspection sites, yielding 5,013 inspection instances. For each instance, pixel-level segmentation annotations are provided for key objects in visible-spectrum images. In addition, a semantic scene description and a corresponding safety level label are provided according to practical inspection tasks. Seven synchronized sensing modalities are further included, including infrared video, audio, depth point clouds, radar point clouds, gas measurements, temperature, and humidity, to support multimodal anomaly recognition, cross-modal fusion, and comprehensive safety assessment in industrial environments.


[121] 4D-CAAL: 4D Radar-Camera Calibration and Auto-Labeling for Autonomous Driving cs.RO | cs.CVPDF

Shanliang Yao, Zhuoxiao Li, Runwei Guan, Kebin Cao, Meng Xia

TL;DR: 本文提出了4D-CAAL,一个用于自动驾驶的统一框架,旨在解决4D雷达与相机的外参标定以及雷达点云自动标注问题。该方法设计了一种双用途标定靶,结合棋盘格和角反射器,并开发了鲁棒的对应点匹配算法以实现精确标定。随后,利用标定后的传感器关系,通过几何投影和多特征优化,将基于相机的分割标注转移到雷达点云上,实现自动标注。

Details

Motivation: 4D雷达在自动驾驶中至关重要,但其与相机的有效融合需要精确的外参标定,且雷达感知算法的开发需要大规模标注数据集。现有标定方法常使用针对单一模态优化的独立靶标,难以建立对应关系,而手动标注稀疏雷达数据则费力且不可靠。

Result: 大量实验表明,该方法实现了高精度的标定,同时显著减少了手动标注工作量,从而加速了自动驾驶鲁棒多模态感知系统的开发。

Insight: 创新点在于提出了一个统一的标定与自动标注框架,其核心是新颖的双用途标定靶设计(前表面棋盘格用于相机检测,后表面中心角反射器用于雷达检测)以及鲁棒的对应点匹配算法。这为多模态传感器融合提供了一种高效、自动化的数据标注解决方案。

Abstract: 4D radar has emerged as a critical sensor for autonomous driving, primarily due to its enhanced capabilities in elevation measurement and higher resolution compared to traditional 3D radar. Effective integration of 4D radar with cameras requires accurate extrinsic calibration, and the development of radar-based perception algorithms demands large-scale annotated datasets. However, existing calibration methods often employ separate targets optimized for either visual or radar modalities, complicating correspondence establishment. Furthermore, manually labeling sparse radar data is labor-intensive and unreliable. To address these challenges, we propose 4D-CAAL, a unified framework for 4D radar-camera calibration and auto-labeling. Our approach introduces a novel dual-purpose calibration target design, integrating a checkerboard pattern on the front surface for camera detection and a corner reflector at the center of the back surface for radar detection. We develop a robust correspondence matching algorithm that aligns the checkerboard center with the strongest radar reflection point, enabling accurate extrinsic calibration. Subsequently, we present an auto-labeling pipeline that leverages the calibrated sensor relationship to transfer annotations from camera-based segmentations to radar point clouds through geometric projection and multi-feature optimization. Extensive experiments demonstrate that our method achieves high calibration accuracy while significantly reducing manual annotation effort, thereby accelerating the development of robust multi-modal perception systems for autonomous driving.


[122] DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation cs.RO | cs.CVPDF

Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhou Hong

TL;DR: 本文提出了DynamicVLA,一个用于动态物体操作的视觉-语言-动作模型框架。该框架通过整合紧凑的视觉编码器、连续推理和潜在感知动作流三个关键设计,解决了现有VLA模型在需要快速感知、时序预测和连续控制的动态场景中表现不佳的问题。

Details

Motivation: 现有视觉-语言-动作模型在静态操作上泛化能力强,但在需要快速响应和适应物体运动的动态操作场景中存在困难,这是一个开放挑战。

Result: 在作者新构建的动态物体操作基准上进行了广泛评估,结果表明在响应速度、感知和泛化能力方面取得了显著提升,使其成为一个适用于不同具身智能体的统一动态操作框架。

Insight: 主要创新点包括:1) 采用卷积视觉编码器的紧凑VLA架构以实现快速多模态推理;2) 连续推理机制以降低延迟并适应物体运动;3) 潜在感知动作流以弥合感知与执行的间隙。此外,构建了大规模合成与真实世界动态操作数据集,填补了该领域的数据空白。

Abstract: Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception-execution gap by enforcing temporally aligned action execution. To fill the missing foundation of dynamic manipulation data, we introduce the Dynamic Object Manipulation (DOM) benchmark, built from scratch with an auto data collection pipeline that efficiently gathers 200K synthetic episodes across 2.8K scenes and 206 objects, and enables fast collection of 2K real-world episodes without teleoperation. Extensive evaluations demonstrate remarkable improvements in response speed, perception, and generalization, positioning DynamicVLA as a unified framework for general dynamic object manipulation across embodiments.


cs.MA [Back]

[123] Learning to Communicate Across Modalities: Perceptual Heterogeneity in Multi-Agent Systems cs.MA | cs.AI | cs.CV | cs.LGPDF

Naomi Pitzer, Daniela Mihai

TL;DR: 本文研究了多智能体系统中感知异构性下的涌现通信问题,通过一个异构多步二元通信博弈,探索了不同模态智能体在缺乏感知对齐的情况下如何发展共享结构化表示。研究发现,尽管存在感知错位,多模态系统仍能收敛到基于感知输入的类别一致消息;单模态系统通信效率更高,而多模态智能体需要更多信息交换且不确定性更高。比特扰动实验表明意义以分布式而非组合式方式编码,互操作性分析则显示不同感知世界训练的系统无法直接通信,但有限微调可实现跨系统通信。

Details

Motivation: 现有涌现通信研究大多假设同质模态或对齐的表示空间,忽视了现实世界中的感知异构性,本文旨在探索智能体在模态不同且缺乏感知基础的情况下如何发展通信。

Result: 在异构多步二元通信博弈中,单模态系统使用更少比特、实现更低分类熵,通信更高效;多模态智能体需要更大信息交换且表现出更高不确定性。比特扰动实验提供了意义以分布式编码的强证据,互操作性分析表明不同感知世界训练的系统无法直接通信,但有限微调可实现成功跨系统通信。

Insight: 创新点在于将涌现通信定位为研究智能体如何适应和跨异构模态迁移表示的框架,揭示了意义以分布式编码的特性,以及通过有限微调实现跨系统通信的潜力,为理论和实验开辟了新方向。

Abstract: Emergent communication offers insight into how agents develop shared structured representations, yet most research assumes homogeneous modalities or aligned representational spaces, overlooking the perceptual heterogeneity of real-world settings. We study a heterogeneous multi-step binary communication game where agents differ in modality and lack perceptual grounding. Despite perceptual misalignment, multimodal systems converge to class-consistent messages grounded in perceptual input. Unimodal systems communicate more efficiently, using fewer bits and achieving lower classification entropy, while multimodal agents require greater information exchange and exhibit higher uncertainty. Bit perturbation experiments provide strong evidence that meaning is encoded in a distributional rather than compositional manner, as each bit’s contribution depends on its surrounding pattern. Finally, interoperability analyses show that systems trained in different perceptual worlds fail to directly communicate, but limited fine-tuning enables successful cross-system communication. This work positions emergent communication as a framework for studying how agents adapt and transfer representations across heterogeneous modalities, opening new directions for both theory and experimentation.