Table of Contents
- cs.CL [Total: 24]
- cs.CV [Total: 46]
- cs.AI [Total: 2]
- cs.LG [Total: 6]
- cs.IR [Total: 1]
- cs.CR [Total: 2]
- cs.DB [Total: 1]
- cs.RO [Total: 3]
- physics.soc-ph [Total: 1]
- eess.IV [Total: 2]
- cs.SE [Total: 1]
cs.CL [Back]
[1] Latent Thoughts Tuning: Bridging Context and Reasoning with Fused Information in Latent Tokens cs.CLPDF
Weihao Liu, Dehai Min, Lu Cheng
TL;DR: 本文提出了一种名为潜在思维调优(LT-Tuning)的框架,旨在解决大型语言模型在连续潜在空间中进行推理时存在的特征崩溃和不稳定性问题。该方法通过上下文-预测-融合机制,结合上下文隐藏状态和词汇嵌入空间的预测语义指导来构建潜在思维,并采用渐进式三阶段课程学习管道,实现潜在与显式思维模式的动态切换,从而提升推理的鲁棒性和准确性。
Details
Motivation: 当前大型语言模型在推理时,显式思维链(CoT)要求模型用文本标记表达每个中间步骤,限制了思维在离散词汇空间中的表达;而连续潜在空间推理虽然提供了更灵活的计算方式,但现有方法常因循环使用隐藏状态作为输入嵌入导致的分布不匹配,或依赖辅助模型引起的对齐问题,而遭受特征崩溃和不稳定性困扰。
Result: 实验表明,该方法在推理任务上优于现有的潜在推理基线,有效缓解了特征崩溃问题,并实现了鲁棒的推理准确性,具体基准和定量结果未在摘要中明确提及,但暗示了性能提升。
Insight: 创新点在于引入了上下文-预测-融合机制,将上下文隐藏状态与词汇嵌入空间的预测语义指导相结合,以更稳定地构建潜在思维;同时,通过渐进式课程学习实现思维模式的动态切换,这从客观角度分析,为连续潜在空间推理提供了更可靠的架构设计,可能增强模型在复杂任务中的泛化能力。
Abstract: While explicit Chain-of-Thought (CoT) equips Large Language Models (LLMs) with strong reasoning capabilities, it requires models to verbalize every intermediate step in text tokens, constraining the model thoughts to the discrete vocabulary space. Recently, reasoning in continuous latent space has emerged as a promising alternative, enabling more robust inference and flexible computation beyond discrete token constraints. However, current latent paradigms often suffer from feature collapse and instability, stemming from distribution mismatches when recurrently using hidden states as the input embeddings, or alignment issues when relying on assistant models. To address this, we propose Latent Thoughts Tuning (LT-Tuning), a framework that redefines how latent thoughts are constructed and deployed. Instead of relying solely on raw hidden states, our method introduces a Context-Prediction-Fusion mechanism that jointly leveraging contextual hidden states and predictive semantic guidance from the vocabulary embedding space. Combined with a progressive three-stage curriculum learning pipeline, LT-Tuning also enables dynamically switching between latent and explicit thinking modes. Experiments demonstrate that our method outperforms existing latent reasoning baselines, effectively mitigating feature collapse and achieving robust reasoning accuracy.
[2] Learning to Evict from Key-Value Cache cs.CL | cs.LGPDF
Luca Moschella, Laura Manduchi, Ozan Sener
TL;DR: 本文提出KV Policy (KVP)框架,将大语言模型推理中的键值缓存淘汰问题重新定义为强化学习问题,通过轻量级的每头RL代理学习基于未来效用的令牌排序策略,以自适应管理缓存,无需修改底层LLM或增加推理开销。
Details
Motivation: 现有键值缓存淘汰或压缩方法依赖启发式规则(如最近性或历史注意力分数),这些规则仅是令牌未来效用的间接代理,且引入计算开销。本文旨在通过学习直接预测令牌未来效用的策略,更高效地管理KV缓存以降低内存需求。
Result: 在长上下文基准RULER和多轮对话基准OASST2-4k上,KVP显著优于基线方法。在标准下游任务(如LongBench、BOOLQ、ARC)上的零样本测试表明,KVP能良好泛化到训练分布之外及更长上下文长度。
Insight: 创新点在于将KV缓存淘汰问题形式化为强化学习任务,通过预计算生成轨迹训练轻量级代理学习基于未来效用的排序策略,实现了自适应、可扩展的缓存管理,避免了启发式方法的局限性。
Abstract: The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a token’s future utility and introduce computational overhead. We reframe KV cache eviction as a reinforcement learning (RL) problem: learning to rank tokens by their predicted usefulness for future decoding. To this end, we introduce KV Policy (KVP), a framework of lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. Each agent learns a specialized eviction policy guided by future utility, which evaluates the quality of the ranking across all cache budgets, requiring no modifications to the underlying LLM or additional inference. Evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k, KVP significantly outperforms baselines. Furthermore, zero-shot tests on standard downstream tasks (e.g., LongBench, BOOLQ, ARC) indicate that KVP generalizes well beyond its training distribution and to longer context lengths. These results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.
[3] On Emergent Social World Models – Evidence for Functional Integration of Theory of Mind and Pragmatic Reasoning in Language Models cs.CLPDF
Polina Tsvilodub, Jan-Felix Klumpp, Amir Mohammadpour, Jennifer Hu, Michael Franke
TL;DR: 本文通过行为评估和因果机制实验,探究语言模型是否共享计算机制来处理一般心理理论(ToM)和语言特定的语用推理,以验证其是否形成’社会世界模型’。研究发现,语言模型可能发展出相互关联的’社会世界模型’,而非孤立的能力,为人工系统中社会认知的涌现提供了经验证据。
Details
Motivation: 解决语言模型是否具有涌现的’社会世界模型’这一普遍问题,即模型是否在任务间重用心理状态表征(功能整合假说)。
Result: 在比先前研究更大的定位数据集上,对七个ToM能力子类别进行严格假设驱动统计测试,结果提供了支持功能整合假说的提示性证据。
Insight: 创新点包括引入新的ToM定位数据、改进功能定位技术方法,以及从经验角度揭示人工系统中社会认知的涌现机制,表明模型可能形成整合的社会表征而非孤立能力。
Abstract: This paper investigates whether LMs recruit shared computational mechanisms for general Theory of Mind (ToM) and language-specific pragmatic reasoning in order to contribute to the general question of whether LMs may be said to have emergent “social world models”, i.e., representations of mental states that are repurposed across tasks (the functional integration hypothesis). Using behavioral evaluations and causal-mechanistic experiments via functional localization methods inspired by cognitive neuroscience, we analyze LMs’ performance across seven subcategories of ToM abilities (Beaudoin et al., 2020) on a substantially larger localizer dataset than used in prior like-minded work. Results from stringent hypothesis-driven statistical testing offer suggestive evidence for the functional integration hypothesis, indicating that LMs may develop interconnected “social world models” rather than isolated competencies. This work contributes novel ToM localizer data, methodological refinements to functional localization techniques, and empirical insights into the emergence of social cognition in artificial systems.
[4] Are More Tokens Rational? Inference-Time Scaling in Language Models as Adaptive Resource Rationality cs.CL | cs.AI | cs.LGPDF
Zhimin Hu, Riya Roshan, Sashank Varma
TL;DR: 本文研究了语言模型在推理时扩展计算资源(如生成更多推理步骤)是否能够自发地展现出类似人类的资源理性行为,即根据任务复杂度自适应地调整推理策略。通过设计一个变量归因任务,系统操控任务复杂度,发现指令微调模型和大型推理模型在复杂度增加时均表现出从暴力搜索到分析性策略的转变,但后者在复杂逻辑函数上更鲁棒。
Details
Motivation: 探究在缺乏显式计算成本奖励的情况下,语言模型通过推理时扩展计算(如生成更多推理步骤)是否能够自发地产生资源理性行为,即根据任务复杂性自适应地优化性能。
Result: 在变量归因任务上,随着任务复杂度(候选变量和试验次数)增加,两种模型都表现出策略转变。指令微调模型在XOR和XNOR函数上性能下降,而大型推理模型保持鲁棒。这表明模型能根据复杂度调整行为,且资源理性是推理时扩展本身涌现的属性。
Insight: 论文的创新点在于将资源理性概念与语言模型的推理时扩展计算联系起来,并通过可控实验验证其涌现性。客观来看,这为理解模型如何自适应管理计算资源提供了新视角,并暗示训练目标(如强化学习与指令微调)会影响模型在复杂任务上的策略鲁棒性。
Abstract: Human reasoning is shaped by resource rationality – optimizing performance under constraints. Recently, inference-time scaling has emerged as a powerful paradigm to improve the reasoning performance of Large Language Models by expanding test-time computation. Specifically, instruction-tuned (IT) models explicitly generate long reasoning steps during inference, whereas Large Reasoning Models (LRMs) are trained by reinforcement learning to discover reasoning paths that maximize accuracy. However, it remains unclear whether resource-rationality can emerge from such scaling without explicit reward related to computational costs. We introduce a Variable Attribution Task in which models infer which variables determine outcomes given candidate variables, input-output trials, and predefined logical functions. By varying the number of candidate variables and trials, we systematically manipulate task complexity. Both models exhibit a transition from brute-force to analytic strategies as complexity increases. IT models degrade on XOR and XNOR functions, whereas LRMs remain robust. These findings suggest that models can adjust their reasoning behavior in response to task complexity, even without explicit cost-based reward. It provides compelling evidence that resource rationality is an emergent property of inference-time scaling itself.
[5] Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs cs.CL | cs.AI | cs.LGPDF
Keenan Pepper, Alex McKenzie, Florin Pop, Stijn Servaes, Martin Leitgab
TL;DR: 该论文提出了一种通过训练轻量级适配器来提升语言模型自我解释能力的方法。该方法在冻结语言模型参数的前提下,利用可解释性标注数据训练适配器,实现了跨任务和模型家族的可靠自我解释。
Details
Motivation: 现有自我解释方法因超参数敏感性而不可靠,论文旨在通过训练轻量级适配器来解决这一问题,提升语言模型内部状态描述的可靠性。
Result: 在70B规模下,训练后的适配器生成的稀疏自编码器特征标签在生成评分上优于训练标签本身(71% vs 63%),主题识别任务中召回率@1达到94%(基线为1%),并能解码多跳推理中的隐含实体。此外,自我解释能力的提升超过了从7B到72B参数规模带来的能力增益。
Insight: 创新点在于仅训练极简适配器(如仅需d_model+1参数的标量仿射适配器)即可实现可靠的自我解释,且学习到的偏置向量贡献了85%的改进。这表明轻量适配器在保持模型冻结的情况下,能有效提取并解释模型的内部知识,且自我解释能力随模型规模提升而增强。
Abstract: Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.
[6] Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation cs.CLPDF
Tianci Xue, Zeyi Liao, Tianneng Shi, Zilu Wang, Kai Zhang
TL;DR: 本文提出了ACuRL(自主课程强化学习)框架,用于使计算机使用代理(CUA)能够在无需人工标注数据的情况下,持续适应特定且动态的数字环境。该框架通过探索环境获取初始经验,并利用课程任务生成器迭代合成新任务进行训练,同时引入了CUAJudge自动评估器提供可靠的奖励信号。
Details
Motivation: 现实数字环境高度多样且动态,导致代理经常遇到未见场景和分布偏移,因此需要持续学习来适应特定环境。核心挑战在于如何在不依赖昂贵人工标注的情况下,获取高质量且与环境相关的基础代理数据。
Result: 实验表明,该方法能有效实现环境内和跨环境的持续学习,在现有环境上获得4-22%的性能提升且无灾难性遗忘。CUAJudge评估器与人类判断的一致性达到93%。进一步分析显示该方法实现了高度稀疏的参数更新(例如20%),这有助于解释其有效且鲁棒的适应能力。
Insight: 主要创新点包括:1)自主课程强化学习框架,通过迭代生成课程任务实现零人工数据的持续适应;2)CUAJudge鲁棒自动评估器,为训练提供可靠奖励信号;3)揭示了稀疏参数更新在实现有效、鲁棒适应中的潜在作用,为持续学习提供了新视角。
Abstract: Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen scenarios and distribution shifts, making continual learning in specific environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded agent data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores target environments to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent’s current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without catastrophic forgetting on existing environments. Further analyses show highly sparse updates (e.g., 20% parameters), which helps explain the effective and robust adaptation. Our data and code are available at https://github.com/OSU-NLP-Group/ACuRL.
[7] When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents cs.CLPDF
Virginie Mouilleron, Théo Lasnier, Djamé Seddah
TL;DR: 本文介绍了首个针对法语金融文档理解的多模态基准测试Multimodal Finance Eval,包含1204个专家验证的问题,涵盖文本提取、表格理解、图表解读和多轮对话推理。评估了六个参数量在8B到124B之间的开源视觉语言模型,发现它们在文本和表格任务上表现良好(准确率85-90%),但在图表解读上表现不佳(34-62%),且多轮对话中的早期错误会显著降低整体准确率至约50%。
Details
Motivation: 当前视觉语言模型在专业非英语领域(特别是金融)的可靠性尚未充分探索,而金融文档混合了密集的法规文本、数字表格和可视化图表,提取错误可能带来实际后果,因此需要专门的评估基准。
Result: 在Multimodal Finance Eval基准上,模型在文本和表格任务上达到85-90%的准确率,图表解读任务准确率仅为34-62%,多轮对话推理任务中错误传播导致准确率降至约50%,无论模型规模大小。
Insight: 论文的创新点在于构建了首个法语金融多模态基准,揭示了当前视觉语言模型在结构化提取任务上有效,但在交互式多步骤分析和图表理解方面仍显脆弱,为高风险领域的模型评估提供了挑战性基准。
Abstract: Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting.
[8] Neuro-Symbolic Synergy for Interactive World Modeling cs.CLPDF
Hongyu Zhao, Siyu Zhou, Haolin Yang, Zengyi Qin, Tianyi Zhou
TL;DR: 本文提出Neuro-Symbolic Synergy (NeSyS)框架,通过整合大语言模型(LLMs)的概率语义先验与可执行的符号规则,旨在构建兼具表达性与鲁棒性的交互式世界模型,以解决LLMs作为世界模型时易产生幻觉而符号模型缺乏语义表达力的问题。
Details
Motivation: 动机在于弥合大语言模型作为世界模型时在确定性规则遵循(尤其在极端情况)上的不可靠性(易产生幻觉)与符号世界模型逻辑一致但语义表达力不足之间的差距。
Result: 在ScienceWorld、Webshop和Plancraft三个交互环境上的大量实验表明,NeSyS在预测准确性和数据效率上均优于基线方法。
Insight: 创新点在于提出一种交替训练框架,使符号世界模型能通过修改输出概率分布直接约束LLM,而神经世界模型仅针对符号规则未覆盖的轨迹进行微调,从而在保持准确性的同时将训练数据减少50%。
Abstract: Large language models (LLMs) exhibit strong general-purpose reasoning capabilities, yet they frequently hallucinate when used as world models (WMs), where strict compliance with deterministic transition rules–particularly in corner cases–is essential. In contrast, Symbolic WMs provide logical consistency but lack semantic expressivity. To bridge this gap, we propose Neuro-Symbolic Synergy (NeSyS), a framework that integrates the probabilistic semantic priors of LLMs with executable symbolic rules to achieve both expressivity and robustness. NeSyS alternates training between the two models using trajectories inadequately explained by the other. Unlike rule-based prompting, the symbolic WM directly constrains the LLM by modifying its output probability distribution. The neural WM is fine-tuned only on trajectories not covered by symbolic rules, reducing training data by 50% without loss of accuracy. Extensive experiments on three distinct interactive environments, i.e., ScienceWorld, Webshop, and Plancraft, demonstrate NeSyS’s consistent advantages over baselines in both WM prediction accuracy and data efficiency.
[9] Canvas-of-Thought: Grounding Reasoning via Mutable Structured States cs.CLPDF
Lingzhuang Sun, Yuxia Zhu, Ruitong Liu, Hao Liang, Zheng Sun
TL;DR: 本文提出了Canvas-of-Thought(Canvas-CoT)方法,通过引入HTML Canvas作为外部推理基底,使多模态大语言模型能够执行基于DOM的原子CRUD操作,从而支持对结构化状态进行原地修改和基于渲染的批判循环,以解决复杂任务中线性思维链的局限性。
Details
Motivation: 现有思维链方法依赖线性的、不可变的文本序列进行推理,在几何、SVG设计等高维领域,缺乏明确的视觉引导,导致修正局部错误成本高昂、推理精度受限。
Result: 在VCode、RBench-V和MathVista等基准测试上的大量实验表明,Canvas-CoT显著优于现有基线方法。
Insight: 核心创新在于将推理历史从不可变的文本流转变为可变的、结构化的状态(通过HTML Canvas实现),并引入基于渲染的批判循环作为硬约束验证器,提供明确的视觉反馈,从而实现了更高效、更精确的多模态推理范式。
Abstract: While Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), relying solely on linear text sequences remains a bottleneck for complex tasks. We observe that even when auxiliary visual elements are interleaved, they are often treated as static snapshots within a one-dimensional, unstructured reasoning chain. We argue that such approaches treat reasoning history as an immutable stream: correcting a local error necessitates either generating verbose downstream corrections or regenerating the entire context. This forces the model to implicitly maintain and track state updates, significantly increasing token consumption and cognitive load. This limitation is particularly acute in high-dimensional domains, such as geometry and SVG design, where the textual expression of CoT lacks explicit visual guidance, further constraining the model’s reasoning precision. To bridge this gap, we introduce \textbf{Canvas-of-Thought (Canvas-CoT)}. By leveraging a HTML Canvas as an external reasoning substrate, Canvas-CoT empowers the model to perform atomic, DOM-based CRUD operations. This architecture enables in-place state revisions without disrupting the surrounding context, allowing the model to explicitly maintain the “ground truth”. Furthermore, we integrate a rendering-based critique loop that serves as a hard constraint validator, providing explicit visual feedback to resolve complex tasks that are difficult to articulate through text alone. Extensive experiments on VCode, RBench-V, and MathVista demonstrate that Canvas-CoT significantly outperforms existing baselines, establishing a new paradigm for context-efficient multimodal reasoning.
[10] When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning cs.CL | cs.AIPDF
Leheng Sheng, Yongtao Zhang, Wenchang Ma, Yaorui Shi, Ting Huang
TL;DR: 本文提出了GRU-Mem模型,通过引入文本控制的更新门和退出门机制,解决了现有MemAgent方法在长上下文推理中存在的内存爆炸和计算冗余问题,实现了更稳定和高效的推理。
Details
Motivation: 解决大语言模型在长上下文推理中性能下降的问题,特别是针对MemAgent方法中存在的两个关键缺陷:内存无差别更新导致爆炸,以及循环缺乏退出机制导致不必要的计算。
Result: 在多种长上下文推理任务上的实验表明,GRU-Mem通常优于原始MemAgent,推理速度最高可加速400%。
Insight: 创新点在于将门控机制(更新门和退出门)引入到文本记忆的循环更新过程中,并通过端到端强化学习中的两个奖励信号来训练这些门控行为,从而实现对记忆更新和计算过程的精确、自适应控制。
Abstract: While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an RNN-like loop and updating a textual memory for final answering. However, this naive recurrent memory update faces two crucial drawbacks: (i) memory can quickly explode because it can update indiscriminately, even on evidence-free chunks; and (ii) the loop lacks an exit mechanism, leading to unnecessary computation after even sufficient evidence is collected. To address these issues, we propose GRU-Mem, which incorporates two text-controlled gates for more stable and efficient long-context reasoning. Specifically, in GRU-Mem, the memory only updates when the update gate is open and the recurrent loop will exit immediately once the exit gate is open. To endow the model with such capabilities, we introduce two reward signals $r^{\text{update}}$ and $r^{\text{exit}}$ within end-to-end RL, rewarding the correct updating and exiting behaviors respectively. Experiments on various long-context reasoning tasks demonstrate the effectiveness and efficiency of GRU-Mem, which generally outperforms the vanilla MemAgent with up to 400% times inference speed acceleration.
[11] Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters cs.CL | cs.AIPDF
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao
TL;DR: 本文介绍了Step 3.5 Flash,一个稀疏混合专家模型,旨在以高效的计算成本实现前沿水平的智能体智能。它通过结合1960亿参数的基础模型和110亿活跃参数进行高效推理,并采用交错滑动窗口/全注意力与多令牌预测技术来降低多轮智能体交互的延迟和成本。
Details
Motivation: 解决构建智能体时最关键的问题:实现敏锐的推理能力以及快速、可靠的执行效率,从而在计算效率和前沿智能水平之间架起桥梁。
Result: 在多个基准测试中表现出色:IMO-AnswerBench达到85.4%,LiveCodeBench-v6 (2024.08-2025.05)达到86.4%,tau2-Bench达到88.2%,BrowseComp(带上下文管理)达到69.0%,Terminal-Bench 2.0达到51.0%,性能与GPT-5.2 xHigh和Gemini 3.0 Pro等前沿模型相当。
Insight: 主要创新点包括:1) 稀疏MoE架构设计,以少量活跃参数实现大规模模型能力;2) 交错滑动窗口/全注意力与多令牌预测的优化,降低推理延迟;3) 结合可验证信号与偏好反馈的可扩展强化学习框架,支持在数学、代码和工具使用等领域的稳定自改进。这为在现实工业环境中部署复杂智能体提供了高密度基础。
Abstract: We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
[12] Online Causal Kalman Filtering for Stable and Effective Policy Optimization cs.CL | cs.AIPDF
Shuo He, Lang Feng, Xin Cheng, Lei Feng, Bo An
TL;DR: 本文提出在线因果卡尔曼滤波策略优化(KPO)方法,用于解决大语言模型强化学习中高方差的令牌级重要性采样比导致的策略优化不稳定问题。该方法通过卡尔曼滤波器在线更新跨令牌演化的潜在重要性采样比状态,在保留令牌级局部结构感知变化的同时平滑噪声峰值,从而实现更稳定有效的策略更新。
Details
Motivation: 动机在于解决大语言模型强化学习中因令牌级重要性采样比方差过高导致的策略优化不稳定问题,现有方法(如使用固定序列级比率或单独调整每个令牌比率)忽略了序列中令牌间的时间离策略推导,可能导致相邻令牌间的策略梯度更新失真和训练崩溃。
Result: 在具有挑战性的数学推理数据集上,KPO相比最先进的同类方法取得了更优的结果。
Insight: 创新点在于首次将令牌级离策略偏差建模为跨令牌演化的潜在状态,并应用在线因果卡尔曼滤波器进行自回归更新,从而在保持局部结构变化的同时有效平滑噪声,为稳定策略优化提供了新思路。从客观角度看,该方法将时序滤波思想引入令牌级重要性采样调整,是一种结构感知的方差缩减技术。
Abstract: Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token’s IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.
[13] Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling cs.CLPDF
Alaa Elsetohy, Sama Hadhoud, Haryo Akbarianto Wibowo, Chenxi Whitehouse, Genta Indra Winata
TL;DR: Macaron是一个通过模板填充构建的多语言、多文化推理基准测试,旨在解决现有基准测试中文化背景与推理类型控制不足的问题。它使用100个语言无关模板覆盖7种推理类型和22个文化方面,由母语标注者创建了包含11,862个实例的多选题和判断题,涵盖20个国家/文化背景、10种文字和20种语言。在21个多语言大语言模型的零样本评估中,推理模式模型表现最强且英语与本地语言性能接近,而开源模型在本地语言上表现大幅下降。
Details
Motivation: 解决多语言基准测试中文化背景与推理类型控制不足的问题,现有翻译数据集保持英语中心场景,而文化优先数据集缺乏对推理类型的控制。
Result: 在21个多语言LLM的零样本评估中,推理模式模型(如GPT-4)表现最强,英语与本地语言性能接近(平均准确率差异小于5%),而开源模型(如Llama、Mistral)在本地语言上表现大幅下降,在判断题任务上常接近随机猜测水平(约50%准确率)。文化相关的数学和计数模板是最难的。
Insight: 创新点在于采用模板优先方法将推理类型和文化方面解耦,实现可控的基准构建;通过语言无关模板和母语标注确保文化真实性;揭示了当前多语言LLM在文化推理上的局限性,特别是开源模型在低资源语言上的显著性能下降。
Abstract: Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance and near-parity between English and local languages, while open-weight models degrade substantially in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest. The data can be accessed here https://huggingface.co/datasets/AlaaAhmed2444/Macaron.
[14] Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs cs.CLPDF
Yuming Yan, Shuo Yang, Kai Tang, Sihong Chen, Yang Zhang
TL;DR: 本文提出了一种名为强化课程预对齐(RCPA)的新型后训练范式,用于解决视觉语言模型(VLM)在领域自适应任务中面临的灾难性遗忘与优化崩溃问题。该方法通过课程感知的渐进调制机制,在训练早期施加部分输出约束以安全引入新领域知识,随后逐步过渡到完全生成优化,从而在获取领域知识的同时保持模型的通用多模态能力。
Details
Motivation: 视觉语言模型在专业领域(如医学影像)表现不佳,而监督微调会导致灾难性遗忘并损害通用能力。持续预训练对VLM计算成本过高,因此需要高效的后训练适应方法。现有基于强化学习的方法在模型初始缺乏领域知识时容易发生优化崩溃,因此需要一种能平衡领域知识获取与通用能力保持的新方法。
Result: 在多个专业领域和通用基准测试上的广泛实验验证了RCPA的有效性,为构建高性能、领域自适应的VLM提供了一条实用路径。
Insight: 核心创新点是课程感知的渐进调制机制,它将领域自适应过程分阶段进行:先通过部分约束安全引入概念,再逐步转向完全优化。这为解决领域自适应中“知识空白导致优化崩溃”这一核心挑战提供了新思路,其分阶段、课程式的训练策略具有普适的借鉴意义。
Abstract: Vision-Language Models (VLMs) demonstrate remarkable general-purpose capabilities but often fall short in specialized domains such as medical imaging or geometric problem-solving. Supervised Fine-Tuning (SFT) can enhance performance within a target domain, but it typically causes catastrophic forgetting, limiting its generalization. The central challenge, therefore, is to adapt VLMs to new domains while preserving their general-purpose capabilities. Continual pretraining is effective for expanding knowledge in Large Language Models (LLMs), but it is less feasible for VLMs due to prohibitive computational costs and the unavailability of pretraining data for most open-source models. This necessitates efficient post-training adaptation methods. Reinforcement learning (RL)-based approaches such as Group Relative Policy Optimization (GRPO) have shown promise in preserving general abilities, yet they often fail in domain adaptation scenarios where the model initially lacks sufficient domain knowledge, leading to optimization collapse. To bridge this gap, we propose Reinforced Curriculum Pre-Alignment (RCPA), a novel post-training paradigm that introduces a curriculum-aware progressive modulation mechanism. In the early phase, RCPA applies partial output constraints to safely expose the model to new domain concepts. As the model’s domain familiarity increases, training gradually transitions to full generation optimization, refining responses and aligning them with domain-specific preferences. This staged adaptation balances domain knowledge acquisition with the preservation of general multimodal capabilities. Extensive experiments across specialized domains and general benchmarks validate the effectiveness of RCPA, establishing a practical pathway toward building high-performing and domain-adaptive VLMs.
[15] Beyond Confidence: The Rhythms of Reasoning in Generative Models cs.CL | cs.AIPDF
Deyuan Liu, Zecheng Wang, Zhanyue Qin, Zhiying Tu, Dianhui Chu
TL;DR: 本文提出了一种名为Token Constraint Bound ($δ_{\mathrm{TCB}}$)的新颖度量指标,用于量化大型语言模型(LLM)在主导的下一个token预测发生显著变化前所能承受的内部状态扰动的最大值。该指标与输出嵌入空间的几何结构内在关联,旨在评估LLM预测的局部鲁棒性,弥补了传统指标(如准确率和困惑度)的不足。
Details
Motivation: 大型语言模型虽然能力强大,但对输入上下文的微小变化非常敏感,这损害了其可靠性。传统的准确率和困惑度等指标无法评估模型预测的局部鲁棒性,因为归一化的输出概率可能掩盖了模型内部状态对扰动的潜在韧性。
Result: 实验表明,$δ_{\mathrm{TCB}}$与有效的提示工程相关,并且能在上下文学习和文本生成过程中,揭示出困惑度指标所遗漏的关键预测不稳定性。
Insight: 核心创新点是提出了$δ_{\mathrm{TCB}}$这一原则性度量,它从模型内部状态稳定性的角度,为分析和潜在提升LLM预测的上下文稳定性提供了一个互补性的新视角。其洞察力在于将预测鲁棒性与输出嵌入空间的几何特性联系起来,超越了单纯依赖输出概率的传统评估范式。
Abstract: Large Language Models (LLMs) exhibit impressive capabilities yet suffer from sensitivity to slight input context variations, hampering reliability. Conventional metrics like accuracy and perplexity fail to assess local prediction robustness, as normalized output probabilities can obscure the underlying resilience of an LLM’s internal state to perturbations. We introduce the Token Constraint Bound ($δ_{\mathrm{TCB}}$), a novel metric that quantifies the maximum internal state perturbation an LLM can withstand before its dominant next-token prediction significantly changes. Intrinsically linked to output embedding space geometry, $δ_{\mathrm{TCB}}$ provides insights into the stability of the model’s internal predictive commitment. Our experiments show $δ_{\mathrm{TCB}}$ correlates with effective prompt engineering and uncovers critical prediction instabilities missed by perplexity during in-context learning and text generation. $δ_{\mathrm{TCB}}$ offers a principled, complementary approach to analyze and potentially improve the contextual stability of LLM predictions.
[16] The CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems cs.CL | cs.AI | cs.CEPDF
Zhuohan Xie, Rania Elbadry, Fan Zhang, Georgi Georgiev, Xueqing Peng
TL;DR: 本文介绍了CLEF-2026的FinMMEval实验室,该实验室提出了首个针对金融大语言模型的多语言、多模态评估框架。该框架包含三个相互关联的任务:金融考试问答、多语言金融问答和金融决策制定,旨在全面评估模型在不同语言和模态下的推理、泛化和决策能力。
Details
Motivation: 当前金融自然语言处理领域的基准测试大多为单语言、纯文本且局限于狭窄的子任务,缺乏一个全面的多语言多模态评估体系。FinMMEval旨在填补这一空白,推动构建更稳健、透明且具有全球包容性的金融AI系统。
Result: 论文主要介绍了评估框架的设立和任务设计,未在摘要中提及具体的定量实验结果或基准测试排名。
Insight: 创新点在于首次构建了一个集多语言、多模态于一体的综合性金融AI评估框架,通过三个互补的任务(理解、推理、决策)来全面衡量模型能力,并公开数据集以支持可复现研究,这有助于引导金融AI向更通用、更实用的方向发展。
Abstract: We present the setup and the tasks of the FinMMEval Lab at CLEF 2026, which introduces the first multilingual and multimodal evaluation framework for financial Large Language Models (LLMs). While recent advances in financial natural language processing have enabled automated analysis of market reports, regulatory documents, and investor communications, existing benchmarks remain largely monolingual, text-only, and limited to narrow subtasks. FinMMEval 2026 addresses this gap by offering three interconnected tasks that span financial understanding, reasoning, and decision-making: Financial Exam Question Answering, Multilingual Financial Question Answering (PolyFiQA), and Financial Decision Making. Together, these tasks provide a comprehensive evaluation suite that measures models’ ability to reason, generalize, and act across diverse languages and modalities. The lab aims to promote the development of robust, transparent, and globally inclusive financial AI systems, with datasets and evaluation resources publicly released to support reproducible research.
[17] Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models cs.CL | cs.AIPDF
Mingyu Cao, Alvaro Correia, Christos Louizos, Shiwei Liu, Lu Yin
TL;DR: 本文提出了SOAR(Search or Accelerate)算法,一种无需训练的扩散语言模型解码方法,通过根据模型置信度动态切换搜索与加速策略来优化文本生成过程。
Details
Motivation: 标准贪婪解码方法在扩散语言模型中可能因过早确定解掩码顺序而陷入次优解,尤其是在推理密集型任务中,需要一种能平衡生成质量与效率的自适应解码策略。
Result: 在Dream-7B和LLaDA-8B模型上,针对数学推理(GSM8K)和代码生成(MBPP、HumanEval)基准测试,SOAR在保持推理速度竞争力的同时提升了生成质量。
Insight: 创新点在于基于置信度的动态解码策略:低置信度时扩大搜索空间以避免过早承诺,高置信度时并行解码多个位置以减少去噪迭代次数,实现了质量与效率的平衡。
Abstract: Diffusion Language Models (DLMs) generate text by iteratively denoising a masked sequence, repeatedly deciding which positions to commit at each step. Standard decoding follows a greedy rule: unmask the most confident positions, yet this local choice can lock the model into a suboptimal unmasking order, especially on reasoning-heavy prompts. We present SOAR, a training-free decoding algorithm that adapts its behavior to the model’s uncertainty. When confidence is low, SOAR briefly widens the search over alternative unmasking decisions to avoid premature commitments; when confidence is high, it collapses the search and decodes many positions in parallel to reduce the number of denoising iterations. Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and efficiency in DLM decoding.
[18] LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules cs.CL | cs.AIPDF
Ivan Vulić, Adam Grycner, Quentin de Laroussilhe, Jonas Pfeiffer
TL;DR: 本文提出LoRA-Squeeze方法,通过后调优或训练中动态调整LoRA模块的秩来改进标准LoRA学习。该方法主张先学习高秩表达性解再压缩,而非直接学习低秩约束解,包括使用高秩源进行微调、重构权重更新矩阵,并通过随机奇异值分解压缩至低秩目标。
Details
Motivation: 解决标准LoRA中秩和超参数预选困难、异构秩模块部署复杂等问题,旨在简化LoRA学习过程并提升效率。
Result: 在13个文本和10个视觉语言任务上的实验表明,后调优压缩常能生成优于直接训练的低秩适配器,尤其在目标秩允许少量微调步骤时;训练中秩退火变体在LoRA大小与性能权衡上表现最佳。
Insight: 创新点在于提出先高秩学习后压缩的范式,以及动态秩调整机制,可借鉴于参数高效微调中优化秩选择和压缩策略。
Abstract: Despite its huge number of variants, standard Low-Rank Adaptation (LoRA) is still a dominant technique for parameter-efficient fine-tuning (PEFT). Nonetheless, it faces persistent challenges, including the pre-selection of an optimal rank and rank-specific hyper-parameters, as well as the deployment complexity of heterogeneous-rank modules and more sophisticated LoRA derivatives. In this work, we introduce LoRA-Squeeze, a simple and efficient methodology that aims to improve standard LoRA learning by changing LoRA module ranks either post-hoc or dynamically during training}. Our approach posits that it is better to first learn an expressive, higher-rank solution and then compress it, rather than learning a constrained, low-rank solution directly. The method involves fine-tuning with a deliberately high(er) source rank, reconstructing or efficiently approximating the reconstruction of the full weight update matrix, and then using Randomized Singular Value Decomposition (RSVD) to create a new, compressed LoRA module at a lower target rank. Extensive experiments across 13 text and 10 vision-language tasks show that post-hoc compression often produces lower-rank adapters that outperform those trained directly at the target rank, especially if a small number of fine-tuning steps at the target rank is allowed. Moreover, a gradual, in-tuning rank annealing variant of LoRA-Squeeze consistently achieves the best LoRA size-performance trade-off.
[19] Conversational Behavior Modeling Foundation Model With Multi-Level Perception cs.CL | cs.AIPDF
Dingkun Zhou, Shuchang Pan, Jiachen Lian, Siddharth Banerjee, Sarika Pasumarthy
TL;DR: 本文提出了一种基于多层次感知的对话行为建模基础模型,通过图思维(GoT)框架对对话行为进行推理。该模型将意图到行动的路径形式化为分层标注方案,预测高层交流意图和低层言语行为,并学习其因果和时间依赖关系。
Details
Motivation: 人类对话由隐含的思维链组织,表现为定时的言语行为。捕捉这种感知路径是构建自然全双工交互系统的关键,旨在解决对话系统中行为建模和可解释推理的挑战。
Result: 在合成和真实全双工对话上的实验表明,该框架实现了鲁棒的行为检测,产生了可解释的推理链,并为全双工口语对话系统中的对话推理基准测试奠定了基础。
Insight: 创新点包括将对话过程建模为多层次感知,并利用图思维(GoT)结构对流式预测进行动态推理,从而生成决策理由并优化推理过程;同时开发了高质量标注语料库来训练系统。
Abstract: Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.
[20] Simultaneous Speech-to-Speech Translation Without Aligned Data cs.CL | cs.SD | eess.ASPDF
Tom Labiausse, Romain Fabre, Yannick Estève, Alexandre Défossez, Neil Zeghidour
TL;DR: 本文提出Hibiki-Zero模型,用于解决无需词级对齐数据的同声传译问题。该方法通过句子级对齐数据训练获得高延迟下的语音翻译能力,并采用基于GRPO的强化学习策略优化延迟,在保持翻译质量的同时实现了多语言扩展和低资源适应。
Details
Motivation: 传统同声传译方法依赖词级对齐数据进行监督训练,但这类数据难以大规模获取且依赖特定语言的启发式对齐方法,存在局限性。本文旨在消除对词级对齐数据的依赖,简化训练流程并支持不同语法结构的多种语言。
Result: 在五个X到英语的翻译任务中,Hibiki-Zero在翻译准确性、延迟、语音传递和自然度方面均达到最先进水平(SOTA)。此外,模型仅需不到1000小时的语音数据即可适应新的输入语言。
Insight: 创新点在于完全摒弃词级对齐数据,采用句子级对齐训练结合GRPO强化学习优化延迟的策略。这简化了训练流程,突破了语言特定对齐启发式方法的瓶颈,实现了更好的可扩展性和低资源适应性。
Abstract: Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.
[21] SteuerLLM: Local specialized large language model for German tax law analysis cs.CL | cs.AI | cs.LGPDF
Sebastian Wind, Jeta Sopa, Laurin Schmid, Quirin Jackl, Sebastian Kiefer
TL;DR: 本文提出了SteuerLLM,一个专门用于德国税法分析的领域适应大语言模型,并创建了首个基于真实德国大学税法考试的开放基准SteuerEx。该模型通过大规模合成数据集训练,在税法推理任务上超越了同规模甚至更大规模的通用指令调优模型。
Details
Motivation: 大语言模型在通用推理和语言理解方面表现出色,但在受严格形式规则、精确术语和法律约束结构支配的领域(如税法)中性能下降,需要准确的法条引用、结构化法律论证和数值精度。
Result: SteuerLLM(280亿参数)在SteuerEx基准(包含115个专家验证的考试问题)上,使用逐语句部分评分框架进行评估,其表现持续优于同规模通用指令调优模型,并在多个案例中显著优于更大的系统。
Insight: 论文的创新点在于创建了首个德国税法开放基准SteuerEx,并证明对于现实法律推理任务,领域特定数据和架构适应比参数规模更具决定性。模型通过受控检索增强流程从真实考试材料生成大规模合成数据集进行训练。
Abstract: Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules, precise terminology, and legally binding structure. Tax law exemplifies these challenges, as correct answers require exact statutory citation, structured legal argumentation, and numerical accuracy under rigid grading schemes. We algorithmically generate SteuerEx, the first open benchmark derived from authentic German university tax law examinations. SteuerEx comprises 115 expert-validated examination questions spanning six core tax law domains and multiple academic levels, and employs a statement-level, partial-credit evaluation framework that closely mirrors real examination practice. We further present SteuerLLM, a domain-adapted LLM for German tax law trained on a large-scale synthetic dataset generated from authentic examination material using a controlled retrieval-augmented pipeline. SteuerLLM (28B parameters) consistently outperforms general-purpose instruction-tuned models of comparable size and, in several cases, substantially larger systems, demonstrating that domain-specific data and architectural adaptation are more decisive than parameter scale for performance on realistic legal reasoning tasks. All benchmark data, training datasets, model weights, and evaluation code are released openly to support reproducible research in domain-specific legal artificial intelligence. A web-based demo of SteuerLLM is available at https://steuerllm.i5.ai.fau.de.
[22] DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning cs.CL | cs.AIPDF
Yicheng Chen, Zerun Ma, Xinchen Xie, Yining Li, Kai Chen
TL;DR: 这篇论文提出了DataChef-32B,一个通过强化学习自动生成用于大语言模型(LLM)适应的最优数据配方的系统。它旨在解决数据配方设计手动、劳动密集型的问题,通过端到端生成数据配方,将基础LLM适配到目标任务。
Details
Motivation: 当前LLM性能高度依赖大规模高质量训练数据,而数据配方的设计(即从原始数据源到训练语料库的处理流程)仍主要依赖人工,过程繁琐且需要大量专业知识。论文旨在自动化这一过程,填补自动化数据配方生成的空白。
Result: 在六个保留任务上,DataChef-32B生成的配方达到了与人类专家策划的配方相当的下游性能。具体而言,其配方使Qwen3-1.7B-Base在数学领域适应后,在AIME’25基准上达到66.7分,超越了Qwen3-1.7B。
Insight: 论文的创新点在于将数据配方生成形式化为端到端任务,并利用基于下游性能预测的代理奖励进行在线强化学习来自动化该过程。这为自动化LLM训练和开发自进化AI系统提供了新思路。
Abstract: In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME’25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.
[23] Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away cs.CL | cs.AIPDF
Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Furong Huang, Dinesh Manocha
TL;DR: 本文提出了一种名为SafeThink的轻量级推理时防御方法,旨在解决基于强化学习的后训练(如GRPO)在提升多模态大规模推理模型(MLRMs)推理能力的同时,可能损害其安全对齐并增加越狱成功率的问题。该方法将安全恢复视为一个满足性约束而非最大化目标,通过安全奖励模型监控推理轨迹,并在安全阈值被违反时条件性地注入一个优化的短前缀(如“Wait, think safely”)进行纠正。
Details
Motivation: 动机在于解决强化学习后训练在提升模型推理能力时导致的安全对齐退化问题,即模型在变得更强推理能力的同时,更容易被越狱攻击,因此需要一种在推理时恢复安全性的防御机制。
Result: 在六个开源MLRMs和四个越狱基准(JailbreakV-28K、Hades、FigStep和MM-SafetyBench)上的评估显示,SafeThink将攻击成功率降低了30-60%(例如,LlamaV-o1在JailbreakV-28K上从63.33%降至5.74%,R1-Onevision在Hades上从69.07%降至5.65%),同时保持了推理性能(如MathVista准确率从65.20%微降至65.00%)。
Insight: 创新点在于将安全恢复定义为满足性约束而非优化目标,并采用条件性前缀注入的轻量级干预策略;关键实证发现是安全恢复通常只需在推理早期(如前1-3步)进行引导,即可将整个生成过程重定向至安全完成,这为高效安全干预提供了新思路。
Abstract: Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix (“Wait, think safely”) only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.
[24] Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning cs.CLPDF
Dawid J. Kopiczko, Sagar Vaze, Tijmen Blankevoort, Yuki M. Asano
TL;DR: 本文研究发现,在思维链监督微调中,数据重复优于数据扩展:在固定更新预算下,对较小数据集进行多轮训练,其效果优于对更大数据集进行单轮训练。例如,Olmo3-7B模型在400个样本上训练128轮,在AIME’24/25和GPQA基准测试上的表现,比在51200个样本上训练1轮高出12-26个百分点。
Details
Motivation: 解决在思维链监督微调中,如何更有效地利用数据以提升推理语言模型泛化能力的问题。标准直觉认为更多独特样本能带来更好泛化,但本文挑战了这一观点。
Result: 在AIME’24/25和GPQA基准上,使用400个样本进行128轮训练的模型,性能显著优于使用51200个样本进行单轮训练的模型,提升幅度达12-26个百分点,且没有额外的灾难性遗忘。
Insight: 核心创新点在于揭示了“重复优势”:在监督微调中,追求对训练数据的完全记忆化(饱和)反而能带来更好的泛化性能。这为推理SFT提供了一种实用方法,即用基于标记准确率的早停标准来扩展训练轮次,可以替代昂贵且无方向的数据扩展。这提出了一个关于大语言模型训练动态的新开放性问题。
Abstract: Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME’24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.
cs.CV [Back]
[25] Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs – Evolution, Limitations, and Cognitive Enhancement cs.CV | cs.AI | cs.CL | cs.LGPDF
Zhihang Yi, Jian Zhao, Jiancheng Lv, Tao Wang
TL;DR: 本文是一篇关于多模态大语言模型(MLLMs)在图表理解领域应用的综述。文章系统梳理了该领域的发展历程,分析了融合视觉与语言信息进行图表理解的核心挑战,对下游任务和数据集进行了分类,并回顾了从经典深度学习到先进MLLM范式的方法演进。最后,批判性地指出了当前模型的局限性,并展望了未来的研究方向。
Details
Motivation: 图表理解是一个典型的信息融合任务,需要无缝整合图形和文本数据。尽管MLLMs已革新此领域,但基于MLLM的图表分析研究仍较为零散,缺乏系统性梳理。本综述旨在为这一新兴领域提供一个全面的路线图,以结构化方式组织其核心组成部分。
Result: 本文是一篇综述,未提出具体模型,因此不包含定量实验结果。其成果在于提出了一个新颖的分类法,将基准数据集分为规范(canonical)和非规范(non-canonical)两类,以突显该领域不断扩大的研究范围。
Insight: 创新点在于对MLLM在图表理解领域进行了首次系统性的综述和结构化梳理,并提出了新的基准数据集分类法。从客观角度看,该文对方法演进脉络的梳理、对模型感知与推理缺陷的批判性分析,以及提出的未来方向(如高级对齐技术和用于认知增强的强化学习),为后续研究提供了清晰的框架和有价值的洞见。
Abstract: Chart understanding is a quintessential information fusion task, requiring the seamless integration of graphical and textual data to extract meaning. The advent of Multimodal Large Language Models (MLLMs) has revolutionized this domain, yet the landscape of MLLM-based chart analysis remains fragmented and lacks systematic organization. This survey provides a comprehensive roadmap of this nascent frontier by structuring the domain’s core components. We begin by analyzing the fundamental challenges of fusing visual and linguistic information in charts. We then categorize downstream tasks and datasets, introducing a novel taxonomy of canonical and non-canonical benchmarks to highlight the field’s expanding scope. Subsequently, we present a comprehensive evolution of methodologies, tracing the progression from classic deep learning techniques to state-of-the-art MLLM paradigms that leverage sophisticated fusion strategies. By critically examining the limitations of current models, particularly their perceptual and reasoning deficits, we identify promising future directions, including advanced alignment techniques and reinforcement learning for cognitive enhancement. This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion and to catalyze progress toward more robust and reliable systems.
[26] MPA: Multimodal Prototype Augmentation for Few-Shot Learning cs.CVPDF
Liwen Wu, Wei Wang, Lei Zhao, Zhan Gao, Qika Lin
TL;DR: 本文提出了一种名为MPA的新型少样本学习框架,通过结合大语言模型生成多样化语义描述、利用多视图增强特征多样性以及引入自适应不确定类吸收器,有效解决了传统少样本学习方法仅依赖视觉模态且原型计算单一的问题。
Details
Motivation: 现有少样本学习方法大多仅关注视觉模态,直接从原始支持图像计算原型,缺乏全面丰富的多模态信息,限制了模型性能。
Result: 在四个单域和六个跨域少样本学习基准测试中,MPA在大多数设置下均优于现有最先进方法,特别是在5-way 1-shot设置中,单域和跨域性能分别超越次优方法12.29%和24.56%,达到了SOTA水平。
Insight: 创新点在于将大语言模型生成的语义增强、多视图数据增强与不确定类建模相结合,构建了一个统一的多模态原型增强框架,有效提升了少样本学习的泛化能力和鲁棒性。
Abstract: Recently, few-shot learning (FSL) has become a popular task that aims to recognize new classes from only a few labeled examples and has been widely applied in fields such as natural science, remote sensing, and medical images. However, most existing methods focus only on the visual modality and compute prototypes directly from raw support images, which lack comprehensive and rich multimodal information. To address these limitations, we propose a novel Multimodal Prototype Augmentation FSL framework called MPA, including LLM-based Multi-Variant Semantic Enhancement (LMSE), Hierarchical Multi-View Augmentation (HMA), and an Adaptive Uncertain Class Absorber (AUCA). LMSE leverages large language models to generate diverse paraphrased category descriptions, enriching the support set with additional semantic cues. HMA exploits both natural and multi-view augmentations to enhance feature diversity (e.g., changes in viewing distance, camera angles, and lighting conditions). AUCA models uncertainty by introducing uncertain classes via interpolation and Gaussian sampling, effectively absorbing uncertain samples. Extensive experiments on four single-domain and six cross-domain FSL benchmarks demonstrate that MPA achieves superior performance compared to existing state-of-the-art methods across most settings. Notably, MPA surpasses the second-best method by 12.29% and 24.56% in the single-domain and cross-domain setting, respectively, in the 5-way 1-shot setting.
[27] VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding cs.CV | cs.CLPDF
Rongcan Pei, Huan Li, Fang Guo, Qi Zhu
TL;DR: 论文提出了一种名为VERA的训练无关框架,通过识别视觉语言模型(VLMs)中关键的视觉证据检索(VER)注意力头,并基于模型不确定性触发这些头所关注的视觉证据的显式语言化,从而显著提升模型在长上下文理解任务上的性能。
Details
Motivation: 视觉语言模型在处理长上下文和复杂推理任务时面临显著挑战,本文旨在通过剖析其内部机制来理解性能瓶颈,并利用关键的注意力头来增强模型能力。
Result: 在五个基准测试上,VERA框架使Qwen3-VL-8B-Instruct和GLM-4.1V-Thinking模型分别实现了平均21.3%和20.1%的相对性能提升,显著改善了开源VLMs的长上下文理解能力。
Insight: 创新点在于识别了VLMs中稀疏、动态的视觉证据检索(VER)头,这些头对模型性能具有因果性,并基于此提出了一个无需训练、通过检测模型不确定性来显式语言化视觉证据的增强框架,为理解并提升VLMs的长上下文处理机制提供了新视角。
Abstract: While Vision-Language Models (VLMs) have shown promise in textual understanding, they face significant challenges when handling long context and complex reasoning tasks. In this paper, we dissect the internal mechanisms governing long-context processing in VLMs to understand their performance bottlenecks. Through the lens of attention analysis, we identify specific Visual Evidence Retrieval (VER) Heads - a sparse, dynamic set of attention heads critical for locating visual cues during reasoning, distinct from static OCR heads. We demonstrate that these heads are causal to model performance; masking them leads to significant degradation. Leveraging this discovery, we propose VERA (Visual Evidence Retrieval Augmentation), a training-free framework that detects model uncertainty (i.e., entropy) to trigger the explicit verbalization of visual evidence attended by VER heads. Comprehensive experiments demonstrate that VERA significantly improves long-context understanding of open-source VLMs: it yields an average relative improvement of 21.3% on Qwen3-VL-8B-Instruct and 20.1% on GLM-4.1V-Thinking across five benchmarks.
[28] Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization cs.CV | cs.LGPDF
Tao Yu, Yujia Yang, Haopeng Jin, Junhao Gong, Xinlong Chen
TL;DR: 本文提出了RVMS-Bench基准测试和RACLO智能体框架,用于评估和解决基于模糊、多维记忆的真实世界视频搜索与片段定位问题,弥补了传统封闭视频池检索的不足。
Details
Motivation: 传统视频检索基准专注于精确描述与封闭视频池的匹配,无法反映开放网络上以模糊、多维记忆为特征的真实世界搜索场景。
Result: 实验表明,现有的多模态大语言模型(MLLMs)在基于模糊记忆的真实世界视频检索和时刻定位任务上能力仍然不足。
Insight: 创新点在于引入了包含全局印象、关键时刻、时序上下文和听觉记忆的分层描述框架来模拟真实搜索线索,并提出了采用溯因推理模拟人类“回忆-搜索-验证”认知过程的智能体框架RACLO。
Abstract: Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples} spanning \textbf{20 diverse categories} and \textbf{four duration groups}, sourced from \textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing \textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify’’ cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.
[29] ArtisanGS: Interactive Tools for Gaussian Splat Selection with AI and Human in the Loop cs.CVPDF
Clement Fuji Tsang, Anita Hu, Or Perel, Carsten Kolve, Maria Shugrina
TL;DR: 本文介绍了ArtisanGS,一个用于3D高斯泼溅(3DGS)表示交互式选择和分割的工具套件。它结合了AI驱动的快速2D到3D选择传播与灵活的手动工具,支持用户对非结构化3DGS场景进行精细的二进制分割,并应用于用户引导的局部编辑。
Details
Motivation: 解决从真实世界捕获的3DGS表示中提取可用对象困难、可控编辑技术有限的问题,专注于交互式工具而非全自动或高级编辑。
Result: 在3D高斯泼溅选择任务上评估,达到了最先进水平(SOTA);工具支持无需额外优化的任意真实世界捕获,并展示了通过定制视频扩散模型进行用户引导局部编辑的下游应用效用。
Insight: 创新点在于将AI驱动的快速选择传播与用户干预及手动工具结合,提供灵活、交互式的3DGS分割方案,赋予用户对AI修改区域的直接控制,弥补了自动解决方案的不足。
Abstract: Representation in the family of 3D Gaussian Splats (3DGS) are growing into a viable alternative to traditional graphics for an expanding number of application, including recent techniques that facilitate physics simulation and animation. However, extracting usable objects from in-the-wild captures remains challenging and controllable editing techniques for this representation are limited. Unlike the bulk of emerging techniques, focused on automatic solutions or high-level editing, we introduce an interactive suite of tools centered around versatile Gaussian Splat selection and segmentation. We propose a fast AI-driven method to propagate user-guided 2D selection masks to 3DGS selections. This technique allows for user intervention in the case of errors and is further coupled with flexible manual selection and segmentation tools. These allow a user to achieve virtually any binary segmentation of an unstructured 3DGS scene. We evaluate our toolset against the state-of-the-art for Gaussian Splat selection and demonstrate their utility for downstream applications by developing a user-guided local editing approach, leveraging a custom Video Diffusion Model. With flexible selection tools, users have direct control over the areas that the AI can modify. Our selection and editing tools can be used for any in-the-wild capture without additional optimization.
[30] When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models cs.CV | cs.AIPDF
Jiacheng Hou, Yining Sun, Ruochong Jin, Haochen Han, Fangming Liu
TL;DR: 本文提出了视觉中心越狱攻击(VJA),首次通过纯视觉输入传达恶意指令来攻击大型图像编辑模型,并引入了安全导向的基准测试IESBench。实验表明VJA能有效攻击最先进的商业模型,同时提出了一种无需训练、基于内省多模态推理的防御方法,显著提升了模型安全性。
Details
Motivation: 大型图像编辑模型从文本驱动转向视觉提示编辑,虽然提升了可用性,但也引入了新的安全风险:攻击面本身变得视觉化,而这一风险尚未被充分探索。
Result: 在IESBench基准测试上,VJA对Nano Banana Pro的攻击成功率达到80.9%,对GPT-Image-1.5达到70.1%。提出的防御方法能将未对齐模型的安全性提升到与商业系统相当的水平,且无需辅助防护模型,计算开销可忽略。
Insight: 创新点在于首次系统性地研究了纯视觉输入的越狱攻击,并构建了相应的安全基准。提出的防御方法通过内省多模态推理实现免训练安全增强,为视觉提示模型的安全防护提供了新思路。
Abstract: Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.
[31] XSPLAIN: XAI-enabling Splat-based Prototype Learning for Attribute-aware INterpretability cs.CVPDF
Dominik Galus, Julia Farganus, Tymoteusz Zapala, Mikołaj Czachorowski, Piotr Borycki
TL;DR: 本文提出了XSPLAIN,这是首个专为3D高斯泼溅(3DGS)分类设计的事前、基于原型的可解释性框架。该方法通过体素聚合的PointNet骨干网络和一种新颖的可逆正交变换,在严格保持原始决策边界的同时解耦特征通道以实现可解释性,其解释基于有代表性的训练样本,支持直观的’这个看起来像那个’推理,且不降低分类性能。
Details
Motivation: 3D高斯泼溅(3DGS)已成为高保真3D重建的标准,但其在多个关键领域的应用因生成模型和泼溅分类缺乏可解释性而受阻。现有针对其他3D表示(如点云)的可解释性方法通常依赖模糊的显著性图,无法捕捉高斯基元的体积连贯性。
Result: 严格的用户研究(N=51)表明,参与者有48.4%的时间选择XSPLAIN的解释为最佳,显著优于基线(p<0.001),证明了XSPLAIN能提供透明度和用户信任。该方法在保持分类性能的同时提供可解释性。
Insight: 创新点在于为3DGS分类量身定制了首个基于原型的事前可解释框架,并引入了可逆正交变换来解耦特征通道以进行解释,同时严格保持决策边界不变,实现了性能与可解释性的兼得。从客观角度看,将原型学习与3DGS这一新兴表示结合,并设计保持决策边界的变换,是解决该领域可解释性问题的有效新思路。
Abstract: 3D Gaussian Splatting (3DGS) has rapidly become a standard for high-fidelity 3D reconstruction, yet its adoption in multiple critical domains is hindered by the lack of interpretability of the generation models as well as classification of the Splats. While explainability methods exist for other 3D representations, like point clouds, they typically rely on ambiguous saliency maps that fail to capture the volumetric coherence of Gaussian primitives. We introduce XSPLAIN, the first ante-hoc, prototype-based interpretability framework designed specifically for 3DGS classification. Our approach leverages a voxel-aggregated PointNet backbone and a novel, invertible orthogonal transformation that disentangles feature channels for interpretability while strictly preserving the original decision boundaries. Explanations are grounded in representative training examples, enabling intuitive ``this looks like that’’ reasoning without any degradation in classification performance. A rigorous user study (N=51) demonstrates a decisive preference for our approach: participants selected XSPLAIN explanations 48.4% of the time as the best, significantly outperforming baselines $(p<0.001)$, showing that XSPLAIN provides transparency and user trust. The source code for this work is available at: https://github.com/Solvro/ml-splat-xai
[32] PMMA: The Polytechnique Montreal Mobility Aids Dataset cs.CVPDF
Qingwu Liu, Nicolas Saunier, Guillaume-Alexandre Bilodeau
TL;DR: 该研究提出了一个名为PMMA的新数据集,专注于使用助行器的行人检测,包含九种类别,并在户外环境中收集。研究还基于MMDetection框架评估了七种目标检测模型和三种跟踪算法,以建立基准。
Details
Motivation: 为了解决当前缺乏专门针对使用助行器(如轮椅、拐杖、助行器)的行人检测数据集的问题,以促进辅助移动领域的计算机视觉研究。
Result: 实验结果表明,YOLOX、Deformable DETR和Faster R-CNN在检测性能上表现最佳,而三种跟踪器之间的差异相对较小。
Insight: 创新点在于创建了一个专门针对助行器使用者的多样化数据集,并提供了全面的基准评估,这有助于推动更包容和准确的视觉系统开发,特别是在辅助技术和无障碍环境中的应用。
Abstract: This study introduces a new object detection dataset of pedestrians using mobility aids, named PMMA. The dataset was collected in an outdoor environment, where volunteers used wheelchairs, canes, and walkers, resulting in nine categories of pedestrians: pedestrians, cane users, two types of walker users, whether walking or resting, five types of wheelchair users, including wheelchair users, people pushing empty wheelchairs, and three types of users pushing occupied wheelchairs, including the entire pushing group, the pusher and the person seated on the wheelchair. To establish a benchmark, seven object detection models (Faster R-CNN, CenterNet, YOLOX, DETR, Deformable DETR, DINO, and RT-DETR) and three tracking algorithms (ByteTrack, BOT-SORT, and OC-SORT) were implemented under the MMDetection framework. Experimental results show that YOLOX, Deformable DETR, and Faster R-CNN achieve the best detection performance, while the differences among the three trackers are relatively small. The PMMA dataset is publicly available at https://doi.org/10.5683/SP3/XJPQUG, and the video processing and model training code is available at https://github.com/DatasetPMMA/PMMA.
[33] ERGO: Excess-Risk-Guided Optimization for High-Fidelity Monocular 3D Gaussian Splatting cs.CV | cs.AIPDF
Zehua Ma, Hanhui Li, Zhenyu Xie, Xiaonan Luo, Michael Kampffmeyer
TL;DR: 该论文提出了一个名为ERGO的自适应优化框架,用于解决单图像3D内容生成中因合成辅助视图存在几何不一致和纹理错位而导致重建质量下降的问题。ERGO通过将3D高斯泼溅的优化损失分解为超额风险和贝叶斯误差,动态估计视图特定的超额风险并自适应调整损失权重,结合几何感知和纹理感知目标,建立了一个协同的全局-局部优化范式,从而在噪声监督下提升重建的几何保真度和纹理质量。
Details
Motivation: 从单张图像生成3D内容是一个病态问题,因为遮挡区域缺乏几何和纹理信息。现有方法使用生成模型合成辅助视图进行监督,但这些视图存在几何不一致和纹理错位,会在3D重建过程中传播并放大伪影。
Result: 在Google Scanned Objects和OmniObject3D数据集上的大量实验表明,ERGO在几何保真度和纹理质量上均优于现有的最先进方法。
Insight: 创新点在于将优化损失分解为超额风险和贝叶斯误差,并以此为指导动态调整损失权重,形成了一个对监督噪声鲁棒的、结合全局损失分解与局部几何/纹理感知目标的协同优化范式,有效利用了不完美的合成监督信号。
Abstract: Generating 3D content from a single image remains a fundamentally challenging and ill-posed problem due to the inherent absence of geometric and textural information in occluded regions. While state-of-the-art generative models can synthesize auxiliary views to provide additional supervision, these views inevitably contain geometric inconsistencies and textural misalignments that propagate and amplify artifacts during 3D reconstruction. To effectively harness these imperfect supervisory signals, we propose an adaptive optimization framework guided by excess risk decomposition, termed ERGO. Specifically, ERGO decomposes the optimization losses in 3D Gaussian splatting into two components, i.e., excess risk that quantifies the suboptimality gap between current and optimal parameters, and Bayes error that models the irreducible noise inherent in synthesized views. This decomposition enables ERGO to dynamically estimate the view-specific excess risk and adaptively adjust loss weights during optimization. Furthermore, we introduce geometry-aware and texture-aware objectives that complement the excess-risk-derived weighting mechanism, establishing a synergistic global-local optimization paradigm. Consequently, ERGO demonstrates robustness against supervision noise while consistently enhancing both geometric fidelity and textural quality of the reconstructed 3D content. Extensive experiments on the Google Scanned Objects dataset and the OmniObject3D dataset demonstrate the superiority of ERGO over existing state-of-the-art methods.
[34] HII-DPO: Eliminate Hallucination via Accurate Hallucination-Inducing Counterfactual Images cs.CVPDF
Yilin Yang, Zhenghui Guo, Yuke Wang, Omprakash Gnawali, Sheng Di
TL;DR: 本文提出HII-DPO方法,通过合成幻觉诱导图像揭示大视觉语言模型因语言偏见导致的场景条件幻觉模式,并构建MOH基准评估模型对幻觉的敏感性,最终利用幻觉诱导图像构建高质量偏好数据集进行细粒度对齐,有效减少幻觉同时保持模型通用能力。
Details
Motivation: 解决大视觉语言模型因固有语言偏见导致的幻觉问题,现有方法常忽视语言偏见驱动的幻觉模式。
Result: 在标准幻觉基准上,方法比当前最优方法提升高达38%;构建了MOH基准评估现有对齐框架。
Insight: 创新点在于准确合成幻觉诱导图像以揭示场景条件幻觉模式,并利用其构建偏好数据集进行细粒度对齐,可借鉴于幻觉缓解和模型对齐研究。
Abstract: Large Vision-Language Models (VLMs) have achieved remarkable success across diverse multimodal tasks but remain vulnerable to hallucinations rooted in inherent language bias. Despite recent progress, existing hallucination mitigation methods often overlook the underlying hallucination patterns driven by language bias. In this work, we design a novel pipeline to accurately synthesize Hallucination-Inducing Images (HIIs). Using synthesized HIIs, we reveal a consistent scene-conditioned hallucination pattern: models tend to mention objects that are highly typical of the scene even when visual evidence is removed. To quantify the susceptibility of VLMs to this hallucination pattern, we establish the Masked-Object-Hallucination (MOH) benchmark to rigorously evaluate existing state-of-the-art alignment frameworks. Finally, we leverage HIIs to construct high-quality preference datasets for fine-grained alignment. Experimental results demonstrate that our approach effectively mitigates hallucinations while preserving general model capabilities. Specifically, our method achieves up to a 38% improvement over the current state-of-the-art on standard hallucination benchmarks.
[35] Towards Remote Sensing Change Detection with Neural Memory cs.CVPDF
Zhenyu Yang, Gensheng Pei, Yazhou Yao, Tianfei Zhou, Lizhong Ding
TL;DR: 本文提出了一种基于Titans架构的遥感变化检测框架ChangeTitans,通过引入首个基于Titans的视觉骨干网络VTitans、分层VTitans-Adapter以及双流融合模块TS-CBAM,旨在高效捕获长距离依赖并抑制伪变化,在多个基准数据集上实现了SOTA性能。
Details
Motivation: 解决现有遥感变化检测方法在捕获长距离依赖与保持计算效率之间的平衡难题,特别是Transformer的二次复杂度带来的可扩展性挑战,以及现有线性注意力方法在捕捉复杂时空关系上的不足。
Result: 在LEVIR-CD、WHU-CD、LEVIR-CD+和SYSU-CD四个基准数据集上进行了实验评估,在LEVIR-CD上取得了84.36%的IoU和91.52%的F1分数,达到了最先进(SOTA)水平,同时保持了计算竞争力。
Insight: 创新点包括:1)首个基于Titans的视觉骨干网络VTitans,结合神经记忆与分段局部注意力以高效建模长距离依赖;2)分层VTitans-Adapter用于跨网络层优化多尺度特征;3)双流TS-CBAM模块利用跨时间注意力抑制伪变化。从客观角度看,将Titans架构成功迁移到视觉任务并针对变化检测的时空特性进行定制化设计是核心创新。
Abstract: Remote sensing change detection is essential for environmental monitoring, urban planning, and related applications. However, current methods often struggle to capture long-range dependencies while maintaining computational efficiency. Although Transformers can effectively model global context, their quadratic complexity poses scalability challenges, and existing linear attention approaches frequently fail to capture intricate spatiotemporal relationships. Drawing inspiration from the recent success of Titans in language tasks, we present ChangeTitans, the Titans-based framework for remote sensing change detection. Specifically, we propose VTitans, the first Titans-based vision backbone that integrates neural memory with segmented local attention, thereby capturing long-range dependencies while mitigating computational overhead. Next, we present a hierarchical VTitans-Adapter to refine multi-scale features across different network layers. Finally, we introduce TS-CBAM, a two-stream fusion module leveraging cross-temporal attention to suppress pseudo-changes and enhance detection accuracy. Experimental evaluations on four benchmark datasets (LEVIR-CD, WHU-CD, LEVIR-CD+, and SYSU-CD) demonstrate that ChangeTitans achieves state-of-the-art results, attaining \textbf{84.36%} IoU and \textbf{91.52%} F1-score on LEVIR-CD, while remaining computationally competitive.
[36] The Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation cs.CVPDF
Suman Kunwar
TL;DR: 本研究介绍了公开的垃圾数据集(GD),这是一个专为通过机器学习和计算机视觉推进自动化垃圾分类而设计的图像数据集。该数据集涵盖了10种常见家庭垃圾类别,包含13,348张标注图像,通过多种方法收集并经过严格验证。研究使用最先进的深度学习模型(如EfficientNetV2M、EfficientNetV2S等)进行基准测试,评估了性能指标和运行碳排放。结果表明,EfficientNetV2S在准确率和F1分数上表现最佳,但存在适中的碳成本。分析揭示了数据集的固有特性,如类别不平衡和背景复杂性,这些是实际部署中需要解决的挑战。
Details
Motivation: 为了解决自动化垃圾分类中的实际问题,需要一个多样化的、公开的图像数据集来支持机器学习和计算机视觉研究,以促进环境可持续性应用的发展。
Result: 在GD数据集上,EfficientNetV2S模型达到了96.19%的准确率和0.96的F1分数,性能最优,但碳排放适中;其他模型如EfficientNetV2M、MobileNet、ResNet50和ResNet101也进行了基准测试,结果突出了模型选择中的环境权衡。
Insight: 论文的创新点在于提供了一个多类别、真实世界的垃圾图像数据集,并综合评估了深度学习模型的性能和碳排放,强调了类别不平衡、背景复杂性和环境权衡等挑战,为垃圾分类研究提供了有价值的基准和实际部署的指导。
Abstract: This study introduces the Garbage Dataset (GD), a publicly available image dataset designed to advance automated waste segregation through machine learning and computer vision. It’s a diverse dataset covering 10 common household waste categories: metal, glass, biological, paper, battery, trash, cardboard, shoes, clothes, and plastic. The dataset comprises 13,348 labeled images collected through multiple methods, including DWaste mobile app and curated web sources. Methods included rigorous validation through checksums and outlier detection, analysis of class imbalance and visual separability via PCA/t-SNE, and assessment of background complexity using entropy and saliency measures. The dataset was benchmarked using state-of-the-art deep learning models (EfficientNetV2M, EfficientNetV2S, MobileNet, ResNet50, ResNet101) evaluated on performance metrics and operational carbon emissions. Experiment results indicate EfficientNetV2S achieved the highest performance with 96.19% accuracy and a 0.96 F1-score, though with a moderate carbon cost. Analysis revealed inherent dataset characteristics including class imbalance, a skew toward high-outlier classes (plastic, cardboard, paper), and brightness variations that require consideration. The main conclusion is that GD provides a valuable, real-world benchmark for waste classification research while highlighting important challenges such as class imbalance, background complexity, and environmental trade-offs in model selection that must be addressed for practical deployment. The dataset is publicly released to support further research in environmental sustainability applications.
[37] 1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization cs.CV | cs.AIPDF
Dongshuo Yin, Xue Yang, Deng-Ping Fan, Shi-Min Hu
TL;DR: 本文提出了一种名为CoLin(Complex Linear Projection Optimization)的高效视觉适配器,通过引入仅约1%的额外参数,在多种视觉任务(如目标检测、分割、图像分类和遥感旋转目标检测)上超越了全微调和经典delta-tuning方法,为视觉基础模型的高效部署提供了新方案。
Details
Motivation: 传统全微调视觉基础模型成本高昂且效率低下,而delta-tuning在大型语言模型(LLMs)中表现出的高效优势无法直接迁移到视觉模型的微调流程中,因此需要探索更高效的视觉适应策略。
Result: 在目标检测、分割、图像分类和旋转目标检测(遥感场景)等任务上的大量实验表明,CoLin仅用1%的参数首次超越了全微调和经典delta-tuning方法,实现了高效且高性能的适应。
Insight: 创新点包括设计了一种新颖的低秩复数适配器架构,并从理论上证明了低秩复合矩阵在训练中存在严重的收敛问题,进而通过定制化的损失函数解决了这一挑战,从而实现了参数极简(约1%)下的高效适应。
Abstract: Deploying vision foundation models typically relies on efficient adaptation strategies, whereas conventional full fine-tuning suffers from prohibitive costs and low efficiency. While delta-tuning has proven effective in boosting the performance and efficiency of LLMs during adaptation, its advantages cannot be directly transferred to the fine-tuning pipeline of vision foundation models. To push the boundaries of adaptation efficiency for vision tasks, we propose an adapter with Complex Linear Projection Optimization (CoLin). For architecture, we design a novel low-rank complex adapter that introduces only about 1% parameters to the backbone. For efficiency, we theoretically prove that low-rank composite matrices suffer from severe convergence issues during training, and address this challenge with a tailored loss. Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing scenario) demonstrate that CoLin outperforms both full fine-tuning and classical delta-tuning approaches with merely 1% parameters for the first time, providing a novel and efficient solution for deployment of vision foundation models. We release the code on https://github.com/DongshuoYin/CoLin.
[38] MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps cs.CVPDF
Sharat Bhat, Harshita Khandelwal, Tushar Kataria, Vivek Gupta
TL;DR: 本文提出了MapVerse,一个基于真实世界地图的大规模地理空间问答基准数据集,旨在评估模型的地图阅读、解释和多模态推理能力。该数据集包含1025张地图上的11837个人工编写问答对,涵盖十个地图类别和多个问题类型。作者评估了十个最先进的模型,发现它们在分类任务上表现良好,但在需要复杂空间推理的高级任务上仍有不足。
Details
Motivation: 现有用于评估视觉语言模型在地图推理上的数据集范围狭窄、领域受限且严重依赖人工生成内容,缺乏对真实地理空间推理的深度评估。
Result: 在MapVerse基准上评估了十个SOTA模型,建立了基线并量化了推理差距。模型在分类任务上表现有竞争力,但开源和闭源模型在需要复杂空间推理的高级任务上均表现不佳。
Insight: 创新点在于构建了一个基于多样化真实世界地图的大规模人工标注基准,为评估地理空间推理提供了更真实、更丰富的场景。客观分析认为,其细粒度类别分析和视觉因素探究为理解模型在地图推理上的能力与局限提供了新视角。
Abstract: Maps are powerful carriers of structured and contextual knowledge, encompassing geography, demographics, infrastructure, and environmental patterns. Reasoning over such knowledge requires models to integrate spatial relationships, visual cues, real-world context, and domain-specific expertise-capabilities that current large language models (LLMs) and vision-language models (VLMs) still struggle to exhibit consistently. Yet, datasets used to benchmark VLMs on map-based reasoning remain narrow in scope, restricted to specific domains, and heavily reliant on artificially generated content (outputs from LLMs or pipeline-based methods), offering limited depth for evaluating genuine geospatial reasoning. To address this gap, we present MapVerse, a large-scale benchmark built on real-world maps. It comprises 11,837 human-authored question-answer pairs across 1,025 maps, spanning ten diverse map categories and multiple question categories for each. The dataset provides a rich setting for evaluating map reading, interpretation, and multimodal reasoning. We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps. Beyond overall performance, we conduct fine-grained categorical analyses to assess model inference across multiple dimensions and investigate the visual factors shaping reasoning outcomes. Our findings reveal that while current VLMs perform competitively on classification-style tasks, both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning.
[39] Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance cs.CV | cs.AIPDF
Shengyang Sun, Jiashen Hua, Junyi Feng, Xiaojin Gong
TL;DR: 本文提出了一种文本引导的弱监督多模态视频异常检测框架,通过基于上下文学习的多阶段文本增强机制生成高质量异常文本样本以微调文本特征提取器,并设计多尺度瓶颈Transformer融合模块,利用压缩瓶颈令牌逐步整合跨模态信息,减少冗余与不平衡。在UCF-Crime和XD-Violence数据集上实现了最先进的性能。
Details
Motivation: 解决弱监督多模态视频异常检测中文本模态潜力未充分挖掘的问题,包括通用语言模型难以捕捉异常特定细节、相关描述稀缺,以及多模态融合存在冗余与不平衡的挑战。
Result: 在UCF-Crime和XD-Violence基准测试上取得了最先进的性能。
Insight: 创新点包括基于上下文学习的多阶段文本增强机制以生成高质量异常文本,以及多尺度瓶颈Transformer融合模块通过压缩令牌渐进整合多模态信息,可借鉴于提升多模态任务中文本引导与融合效率。
Abstract: Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.
[40] C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning cs.CV | cs.AIPDF
Guanting Ye, Qiyan Zhao, Wenhao Yu, Xiaofeng Zhang, Jianmin Ji
TL;DR: 本文提出C^2RoPE,一种改进的旋转位置编码方法,旨在解决现有3D大型多模态模型中RoPE在处理视觉特征时导致的局部连续性损失和长期注意力衰减问题。该方法通过构建时空混合位置索引和引入切比雪夫因果掩码,显式建模局部空间连续性和空间因果关系。
Details
Motivation: 现有基于LLM的3D大型多模态模型继承的RoPE存在两个主要问题:使用一维时间位置索引破坏了视觉特征在列维度上的连续性,导致空间局部性损失;同时其时间邻近性先验导致注意力随序列增长而长期衰减,使模型逐渐忽略早期视觉token。
Result: 在3D场景推理和3D视觉问答等多个基准测试上的评估结果证明了C^2RoPE的有效性,但摘要未具体说明是否达到SOTA或与特定模型相当。
Insight: 创新点在于提出了时空连续位置嵌入机制,将时间位置与笛卡尔空间坐标结合为三元组混合索引,并采用频率分配策略;同时引入基于切比雪夫距离的因果掩码来定义2D空间中的因果依赖关系,为多模态位置编码提供了新思路。
Abstract: Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. However, the inherited Rotary Position Embedding (RoPE) introduces limitations for multimodal processing. Specifically, applying 1D temporal positional indices disrupts the continuity of visual features along the column dimension, resulting in spatial locality loss. Moreover, RoPE follows the prior that temporally closer image tokens are more causally related, leading to long-term decay in attention allocation and causing the model to progressively neglect earlier visual tokens as the sequence length increases. To address these issues, we propose C^2RoPE, an improved RoPE that explicitly models local spatial Continuity and spatial Causal relationships for visual processing. C^2RoPE introduces a spatio-temporal continuous positional embedding mechanism for visual tokens. It first integrates 1D temporal positions with Cartesian-based spatial coordinates to construct a triplet hybrid positional index, and then employs a frequency allocation strategy to encode spatio-temporal positional information across the three index components. Additionally, we introduce Chebyshev Causal Masking, which determines causal dependencies by computing the Chebyshev distance of image tokens in 2D space. Evaluation results across various benchmarks, including 3D scene reasoning and 3D visual question answering, demonstrate C^2RoPE’s effectiveness. The code is be available at https://github.com/ErikZ719/C2RoPE.
[41] MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning cs.CV | cs.AI | cs.CYPDF
Chenhao Zhang, Yazhe Niu, Hongsheng Li
TL;DR: 本文提出了MetaphorStar,一个用于图像隐喻理解与推理的端到端视觉强化学习框架。该框架包含细粒度数据集TFQ-Data、视觉强化学习方法TFQ-GRPO和结构化基准TFQ-Bench。其模型家族在图像隐喻基准上平均性能提升82.6%,并在多项任务上超越了包括Gemini-3.0-pro在内的主流多模态大语言模型,达到SOTA水平。
Details
Motivation: 当前多模态大语言模型在基础视觉问答上表现出色,但难以理解图像中蕴含的微妙文化、情感和上下文隐喻含义,这需要复杂的多跳推理、文化背景和心智理论能力。为填补这一空白,本文旨在解决图像隐喻理解这一关键挑战。
Result: 在TFQ-Bench基准测试中,MetaphorStar-32B模型在多项选择题和开放式问题上达到SOTA,在判断题上显著优于顶级闭源模型Gemini-3.0-pro,平均性能提升82.6%。
Insight: 创新点在于首次提出了用于图像隐喻任务的端到端视觉强化学习框架,并构建了配套的数据集和基准。客观分析认为,其核心贡献在于将强化学习范式系统性地应用于需要高层次认知的图像理解任务,并通过实验证明学习此类任务能提升模型的通用理解能力,特别是复杂视觉推理能力。此外,对模型参数、训练数据规模及不同架构策略的系统分析也具有参考价值。
Abstract: Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task’s demand for sophisticated multi-hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine-grained dataset TFQ-Data, the visual RL method TFQ-GRPO, and the well-structured benchmark TFQ-Bench. Our fully open-source MetaphorStar family, trained using TFQ-GRPO on TFQ-Data, significantly improves performance by an average of 82.6% on the image implication benchmarks. Compared with 20+ mainstream MLLMs, MetaphorStar-32B achieves state-of-the-art (SOTA) on Multiple-Choice Question and Open-Style Question, significantly outperforms the top closed-source model Gemini-3.0-pro on True-False Question. Crucially, our experiments reveal that learning image implication tasks improves the general understanding ability, especially the complex visual reasoning ability. We further provide a systematic analysis of model parameter scaling, training data scaling, and the impact of different model architectures and training strategies, demonstrating the broad applicability of our method. We open-sourced all model weights, datasets, and method code at https://metaphorstar.github.io.
[42] Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation cs.CVPDF
Guangjing Yang, ZhangYuan Yu, Ziyuan Qin, Xinyuan Song, Huahui Yi
TL;DR: 本文提出了一种名为VRFT-Aug的视觉强化微调框架,专门针对医学影像领域,通过结合先验知识注入、感知驱动的策略优化、医学启发的奖励塑形和行为模仿等策略,旨在增强模型的视觉感知和结构化推理能力,以稳定和改进强化微调过程。
Details
Motivation: 当前基于规则的奖励方案的强化微调在大型语言模型中已显示出有效性,但在跨模态、以视觉为中心的领域,尤其是需要强大视觉感知和结构化推理的医学影像领域,其应用仍未被充分探索,因此本文旨在填补这一空白。
Result: 通过在多个医学数据集上的广泛实验,该方法在性能上一致优于标准的监督微调和强化微调基线,为其他医学图像任务提供了可推广的经验见解和实用训练启发。
Insight: 创新点在于将强化微调扩展到医学视觉领域,并设计了一系列增强感知和推理的训练策略;从客观角度看,该框架通过整合领域先验和结构化奖励,为高风险医疗应用中开发可靠、具备推理能力的模型提供了可行的指导和新思路。
Abstract: While recent advances in Reinforcement Fine-Tuning (RFT) have shown that rule-based reward schemes can enable effective post-training for large language models, their extension to cross-modal, vision-centric domains remains largely underexplored. This limitation is especially pronounced in the medical imaging domain, where effective performance requires both robust visual perception and structured reasoning. In this work, we address this gap by proposing VRFT-Aug, a visual reinforcement fine-tuning framework tailored for the medical domain. VRFT-Aug introduces a series of training strategies designed to augment both perception and reasoning, including prior knowledge injection, perception-driven policy refinement, medically informed reward shaping, and behavioral imitation. Together, these methods aim to stabilize and improve the RFT process. Through extensive experiments across multiple medical datasets, we show that our approaches consistently outperform both standard supervised fine-tuning and RFT baselines. Moreover, we provide empirically grounded insights and practical training heuristics that can be generalized to other medical image tasks. We hope this work contributes actionable guidance and fresh inspiration for the ongoing effort to develop reliable, reasoning-capable models for high-stakes medical applications.
[43] A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology cs.CV | cs.AIPDF
Siyuan Yan, Xieji Li, Dan Mo, Philipp Tschandl, Yiwen Jiang
TL;DR: 本文介绍了DermFM-Zero,一个基于超过400万多模态数据点通过掩码潜在建模和对比学习训练的皮肤病学视觉-语言基础模型。该模型在20个零样本诊断和多模态检索基准测试中实现了最先进的性能,无需任务特定微调。在涉及1100多名临床医生的多国读者研究中,该模型在初级保健和专科环境中显著提升了诊断准确性和管理适当性,并展示了其潜在表示的可解释性,能够无监督地解耦临床相关概念。
Details
Motivation: 医学基础模型在受控基准测试中表现出潜力,但广泛部署仍受限于对任务特定微调的依赖。本文旨在开发一个无需任务特定适应的皮肤病学视觉-语言基础模型,以提供有效的零样本临床决策支持。
Result: 在20个零样本诊断和多模态检索基准测试中,DermFM-Zero实现了最先进的性能。在读者研究中,AI辅助使全科医生对98种皮肤病的鉴别诊断准确性几乎翻倍,在专科环境中显著优于委员会认证的皮肤科医生,并在协作工作流中使非专家超越未辅助的专家。
Insight: 创新点包括通过掩码潜在建模和对比学习训练大规模多模态基础模型实现零样本泛化,以及模型潜在表示的可解释性:稀疏自编码器能够无监督地解耦临床相关概念,超越预定义词汇方法,并实现针对伪影诱导偏见的定向抑制,从而无需重新训练即可增强鲁棒性。
Abstract: Medical foundation models have shown promise in controlled benchmarks, yet widespread deployment remains hindered by reliance on task-specific fine-tuning. Here, we introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. We evaluated DermFM-Zero across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation. We further evaluated its zero-shot capabilities in three multinational reader studies involving over 1,100 clinicians. In primary care settings, AI assistance enabled general practitioners to nearly double their differential diagnostic accuracy across 98 skin conditions. In specialist settings, the model significantly outperformed board-certified dermatologists in multimodal skin cancer assessment. In collaborative workflows, AI assistance enabled non-experts to surpass unassisted experts while improving management appropriateness. Finally, we show that DermFM-Zero’s latent representations are interpretable: sparse autoencoders unsupervisedly disentangle clinically meaningful concepts that outperform predefined-vocabulary approaches and enable targeted suppression of artifact-induced biases, enhancing robustness without retraining. These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.
[44] VideoSTF: Stress-Testing Output Repetition in Video Large Language Models cs.CV | cs.CR | cs.MMPDF
Yuxin Cao, Wei Song, Shangzhi Xu, Jingling Xue, Jin Song Dong
TL;DR: 论文提出了VideoSTF框架,用于系统性地测量和压力测试视频大语言模型中的输出重复问题。该框架通过三个基于n-gram的指标量化重复,并提供了一个包含1万个多样化视频及可控时间变换的标准化测试平台。通过对10个先进VideoLLM的广泛测试,研究发现输出重复现象普遍存在,且对视频输入的时间扰动高度敏感,揭示了其作为可利用安全漏洞的稳定性问题。
Details
Motivation: 当前视频大语言模型在视频理解任务中表现出色,但其生成失败模式——严重的输出重复(即模型陷入短语或句子的自我强化循环)尚未被充分探索。现有基准主要关注任务准确性和事实正确性,未能捕捉此问题。
Result: 使用VideoSTF对10个先进VideoLLM进行测试,发现输出重复现象广泛存在,且对视频输入的时间扰动高度敏感。简单的时序变换能在黑盒设置中有效诱导重复退化,表明输出重复是一个可利用的安全漏洞。
Insight: 创新点在于首次系统性地形式化并测试了VideoLLM中的输出重复问题,引入了标准化评估框架和时序压力测试方法。客观来看,该研究将模型稳定性(而非仅准确性)纳入评估,揭示了视频时序扰动对生成质量的临界影响,为视频-语言系统的鲁棒性评估提供了新视角。
Abstract: Video Large Language Models (VideoLLMs) have recently achieved strong performance in video understanding tasks. However, we identify a previously underexplored generation failure: severe output repetition, where models degenerate into self-reinforcing loops of repeated phrases or sentences. This failure mode is not captured by existing VideoLLM benchmarks, which focus primarily on task accuracy and factual correctness. We introduce VideoSTF, the first framework for systematically measuring and stress-testing output repetition in VideoLLMs. VideoSTF formalizes repetition using three complementary n-gram-based metrics and provides a standardized testbed of 10,000 diverse videos together with a library of controlled temporal transformations. Using VideoSTF, we conduct pervasive testing, temporal stress testing, and adversarial exploitation across 10 advanced VideoLLMs. We find that output repetition is widespread and, critically, highly sensitive to temporal perturbations of video inputs. Moreover, we show that simple temporal transformations can efficiently induce repetitive degeneration in a black-box setting, exposing output repetition as an exploitable security vulnerability. Our results reveal output repetition as a fundamental stability issue in modern VideoLLMs and motivate stability-aware evaluation for video-language systems. Our evaluation code and scripts are available at: https://github.com/yuxincao22/VideoSTF_benchmark.
[45] Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation cs.CVPDF
Yin Wang, Ziyao Zhang, Zhiying Leng, Haitian Liu, Frederick W. B. Li
TL;DR: 本文提出了一种名为MP-HOI的新框架,用于解决文本驱动的3D人-物交互(HOI)运动生成任务。该框架通过利用来自大型多模态模型的多模态数据(文本、图像、姿态/物体)作为先验来指导生成,改进了物体表示,并采用了一种多模态感知的混合专家模型和级联扩散框架,以解决现有方法在人体运动、物体运动以及人-物交互方面的不足。
Details
Motivation: 现有文本驱动3D人-物交互生成方法主要依赖直接的文本到HOI映射,由于显著的跨模态差距,存在人体运动不优、物体运动不自然以及人-物交互弱三个关键局限性。本文旨在解决这些问题。
Result: 综合实验表明,MP-HOI在生成高保真度和细粒度的人-物交互运动方面优于现有方法。
Insight: 论文的创新点包括:1)利用大型多模态模型的多模态数据作为先验;2)通过结合几何关键点、接触特征和动态属性来增强物体表示;3)提出多模态感知的混合专家模型进行特征融合;4)设计带有交互监督的级联扩散框架进行渐进式精炼。这些方法分别从数据建模、数据表示、特征融合和交互精炼层面系统性地解决了现有挑战。
Abstract: We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3) Multimodal-Aware Mixture-of-Experts (MoE) Model: We propose a modality-aware MoE model for effective multimodal feature fusion paradigm, which tackles Q1 and Q2 in feature fusion. (4) Cascaded Diffusion with Interaction Supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles Q3 in interaction refinement. Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.
[46] TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning cs.CV | cs.AIPDF
Junhua Liu, Zhangcheng Wang, Zhike Han, Ningli Wang, Guotao Liang
TL;DR: 该论文提出了首个大规模、基于时间动态的视觉思维链数据集TwiFF-2.7M,包含270万个视频片段,并构建了包含1078个样本的高质量评估基准TwiFF-Bench,用于评估动态开放场景下的推理轨迹合理性和答案正确性。同时,论文提出了TwiFF模型,该模型协同利用预训练的视频生成和图像理解能力,迭代生成未来动作帧和文本推理,以产生时间连贯的视觉推理线索。实验表明,TwiFF在动态推理任务上显著优于现有的视觉思维链方法和文本思维链基线。
Details
Motivation: 现有视觉思维链方法主要局限于静态场景,难以捕捉指令、预测和相机运动等任务所需的关键时间动态信息,因此需要构建专门针对动态视觉问答的数据集和模型来弥补这一差距。
Result: 广泛的实验表明,TwiFF在动态推理任务上显著优于现有的视觉思维链方法和文本思维链基线,充分验证了其在动态场景下视觉问答的有效性。
Insight: 主要创新点在于构建了首个大规模、时间动态的视觉思维链数据集和评估基准,并提出了一种协同利用视频生成和图像理解能力来迭代生成未来帧和文本推理的统一模型,以解决动态视觉推理中的时间连贯性问题。
Abstract: Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF-2.7M, the first large-scale, temporally grounded VCoT dataset derived from $2.7$ million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF-Bench, a high-quality evaluation benchmark of $1,078$ samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open-ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages pre-trained video generation and image comprehension capabilities to produce temporally coherent visual reasoning cues-iteratively generating future action frames and textual reasoning. Extensive experiments demonstrate that TwiFF significantly outperforms existing VCoT methods and Textual Chain-of-Thought baselines on dynamic reasoning tasks, which fully validates the effectiveness for visual question answering in dynamic scenarios. Our code and data is available at https://github.com/LiuJunhua02/TwiFF.
[47] OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL cs.CV | cs.AIPDF
Jinjie Shen, Jing Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang
TL;DR: 本文提出了OmniVL-Guard,一个基于平衡强化学习的统一框架,用于处理文本、图像和视频交错存在的多模态伪造检测与定位任务。该框架通过自演进的思维链生成和自适应奖励缩放策略优化,解决了多任务优化中存在的‘难度偏差’问题,即简单的真伪分类任务主导梯度,导致细粒度定位性能不佳。
Details
Motivation: 现有伪造检测方法通常局限于单模态或双模态设置,无法处理现实世界虚假信息中普遍存在的文本、图像和视频交错内容。本文旨在开发一个统一的框架,以应对多模态伪造检测与定位的挑战,并解决多任务优化中的‘难度偏差’问题。
Result: 大量实验表明,OmniVL-Guard显著优于现有最先进方法,并在领域外场景中展现出零样本鲁棒泛化能力。
Insight: 论文的创新点在于提出了一个统一的平衡强化学习框架,其核心设计包括自演进的思维链生成以克服冷启动问题,以及自适应奖励缩放策略优化以实现检测与定位任务的平衡联合优化。从客观角度看,该研究将强化学习与多模态伪造检测相结合,并动态调整任务权重,为解决多任务学习中的不平衡问题提供了新思路。
Abstract: Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical difficulty bias problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.
[48] AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models cs.CV | cs.AIPDF
Zhifeng Rao, Wenlong Chen, Lei Xie, Xia Hua, Dongfu Yin
TL;DR: 本文提出AugVLA-3D框架,通过将深度估计集成到视觉-语言-动作模型中,以增强其3D特征表示。该方法利用VGGT深度估计基线从RGB输入中提取几何感知的3D线索,并引入动作助手模块,利用动作先验约束学习到的3D表示,确保其与下游控制任务的一致性。通过融合增强的3D特征与传统2D视觉标记,显著提升了VLA模型的泛化能力和鲁棒性。
Details
Motivation: 现有VLA模型主要依赖基于2D图像训练的视觉语言模型,这限制了其在复杂3D环境中的空间理解和动作定位能力。本文旨在通过整合深度信息来弥补这一不足,提升模型对3D环境的感知。
Result: 实验结果表明,该方法不仅增强了在几何模糊场景下的感知能力,还带来了更优的动作预测准确性,在相关基准测试中表现出色。
Insight: 创新点在于将深度估计作为特征增强手段,并引入动作助手模块进行专家监督,从而有效利用大规模2D数据集隐式恢复3D结构信息,弥合2D观测与3D感知决策之间的差距。
Abstract: Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.
[49] FGAA-FPN: Foreground-Guided Angle-Aware Feature Pyramid Network for Oriented Object Detection cs.CVPDF
Jialin Ma
TL;DR: 本文提出了一种名为FGAA-FPN(前景引导角度感知特征金字塔网络)的新方法,用于解决遥感图像中旋转目标检测的挑战。该方法通过前景引导特征调制模块增强低层特征中的目标区域并抑制背景干扰,同时利用角度感知多头注意力模块编码方向关系以指导高层语义特征的全局交互,从而提升多尺度表示能力。
Details
Motivation: 现有方法在旋转目标检测中,通常缺乏显式的前景建模,且未充分利用几何方向先验,这限制了特征的判别能力。本文旨在通过结合前景引导和角度感知机制来克服这些限制。
Result: 在DOTA v1.0和DOTA v1.5两个基准数据集上的大量实验表明,FGAA-FPN达到了最先进的性能,分别取得了75.5%和68.3%的mAP。
Insight: 创新点在于将前景引导的弱监督学习与角度感知的注意力机制相结合,并基于特征金字塔层级的功能分解进行设计。这为多尺度特征融合和上下文建模提供了新的思路,特别是在处理背景杂乱、尺度变化大和方向变化显著的旋转目标检测任务时。
Abstract: With the increasing availability of high-resolution remote sensing and aerial imagery, oriented object detection has become a key capability for geographic information updating, maritime surveillance, and disaster response. However, it remains challenging due to cluttered backgrounds, severe scale variation, and large orientation changes. Existing approaches largely improve performance through multi-scale feature fusion with feature pyramid networks or contextual modeling with attention, but they often lack explicit foreground modeling and do not leverage geometric orientation priors, which limits feature discriminability. To overcome these limitations, we propose FGAA-FPN, a Foreground-Guided Angle-Aware Feature Pyramid Network for oriented object detection. FGAA-FPN is built on a hierarchical functional decomposition that accounts for the distinct spatial resolution and semantic abstraction across pyramid levels, thereby strengthening multi-scale representations. Concretely, a Foreground-Guided Feature Modulation module learns foreground saliency under weak supervision to enhance object regions and suppress background interference in low-level features. In parallel, an Angle-Aware Multi-Head Attention module encodes relative orientation relationships to guide global interactions among high-level semantic features. Extensive experiments on DOTA v1.0 and DOTA v1.5 demonstrate that FGAA-FPN achieves state-of-the-art results, reaching 75.5% and 68.3% mAP, respectively.
[50] OccFace: Unified Occlusion-Aware Facial Landmark Detection with Per-Point Visibility cs.CVPDF
Xinhao Xiang, Zhengxin Li, Saurav Dhakad, Theo Bancroft, Jiawei Zhang
TL;DR: OccFace是一个针对遮挡条件下人脸关键点检测的统一框架,适用于人类、风格化角色及其他非人类设计的面部。该框架采用基于热图的骨干网络和统一的100点密集布局,通过结合局部证据与跨关键点上下文,联合预测关键点坐标和逐点可见性。
Details
Motivation: 解决遮挡条件下(尤其是外观变化大、旋转导致自遮挡的人脸)人脸关键点检测的挑战,现有方法通常隐式处理遮挡且不预测逐点可见性,而下游应用可受益于此。
Result: 实验表明,在外部遮挡和大角度头部旋转下,尤其是在遮挡区域,鲁棒性得到提升,同时保持了可见关键点的准确性。评估套件报告了可见与遮挡关键点的NME,并使用Occ AP、F1@0.5和ROC-AUC对可见性进行基准测试。
Insight: 创新点包括联合预测关键点坐标和逐点可见性的遮挡模块,以及混合手动标签与基于掩码-热图重叠的伪可见性监督方法;客观来看,其统一的密集点布局和遮挡感知评估指标为遮挡鲁棒性研究提供了新思路。
Abstract: Accurate facial landmark detection under occlusion remains challenging, especially for human-like faces with large appearance variation and rotation-driven self-occlusion. Existing detectors typically localize landmarks while handling occlusion implicitly, without predicting per-point visibility that downstream applications can benefits. We present OccFace, an occlusion-aware framework for universal human-like faces, including humans, stylized characters, and other non-human designs. OccFace adopts a unified dense 100-point layout and a heatmap-based backbone, and adds an occlusion module that jointly predicts landmark coordinates and per-point visibility by combining local evidence with cross-landmark context. Visibility supervision mixes manual labels with landmark-aware masking that derives pseudo visibility from mask-heatmap overlap. We also create an occlusion-aware evaluation suite reporting NME on visible vs. occluded landmarks and benchmarking visibility with Occ AP, F1@0.5, and ROC-AUC, together with a dataset annotated with 100-point landmarks and per-point visibility. Experiments show improved robustness under external occlusion and large head rotations, especially on occluded regions, while preserving accuracy on visible landmarks.
[51] From Steering to Pedalling: Do Autonomous Driving VLMs Generalize to Cyclist-Assistive Spatial Perception and Planning? cs.CV | cs.ROPDF
Krishna Kanth Nakka, Vedasri Nakka
TL;DR: 本文介绍了CyclingVQA诊断基准,用于评估视觉语言模型从骑行者视角进行空间感知和交通规则推理的能力。研究发现,现有模型在骑行者中心化任务上表现有限,特别是对骑行专用交通标志和车道关联的理解存在不足,且部分自动驾驶专用模型的性能甚至不如通用VLM。
Details
Motivation: 解决现有自动驾驶视觉语言模型评估主要围绕车辆视角,缺乏从骑行者安全决策辅助角度评估其感知与推理能力的问题。
Result: 在CyclingVQA基准上评估了31个以上最新VLM(包括通用、空间增强和自动驾驶专用模型),发现模型在骑行者中心化任务上表现出一定能力但存在明显不足,部分驾驶专用模型表现弱于通用VLM。
Insight: 创新点在于提出了首个骑行者视角的VLM诊断基准,揭示了车辆中心化训练向骑行者辅助场景迁移的局限性,并通过系统错误分析为开发更有效的骑行者辅助智能系统提供了指导方向。
Abstract: Cyclists often encounter safety-critical situations in urban traffic, highlighting the need for assistive systems that support safe and informed decision-making. Recently, vision-language models (VLMs) have demonstrated strong performance on autonomous driving benchmarks, suggesting their potential for general traffic understanding and navigation-related reasoning. However, existing evaluations are predominantly vehicle-centric and fail to assess perception and reasoning from a cyclist-centric viewpoint. To address this gap, we introduce CyclingVQA, a diagnostic benchmark designed to probe perception, spatio-temporal understanding, and traffic-rule-to-lane reasoning from a cyclist’s perspective. Evaluating 31+ recent VLMs spanning general-purpose, spatially enhanced, and autonomous-driving-specialized models, we find that current models demonstrate encouraging capabilities, while also revealing clear areas for improvement in cyclist-centric perception and reasoning, particularly in interpreting cyclist-specific traffic cues and associating signs with the correct navigational lanes. Notably, several driving-specialized models underperform strong generalist VLMs, indicating limited transfer from vehicle-centric training to cyclist-assistive scenarios. Finally, through systematic error analysis, we identify recurring failure modes to guide the development of more effective cyclist-assistive intelligent systems.
[52] RSHallu: Dual-Mode Hallucination Evaluation for Remote-Sensing Multimodal Large Language Models with Domain-Tailored Mitigation cs.CV | cs.AIPDF
Zihui Zhou, Yong Feng, Yanying Chen, Guofan Duan, Zhenxi Song
TL;DR: 本文提出了RSHallu,一个针对遥感多模态大语言模型幻觉问题的系统性研究,包括定义遥感幻觉分类、构建评估基准RSHalluEval与RSHalluCheck数据集、以及提出基于训练数据RSHalluShield和免训练策略的缓解方法。
Details
Motivation: 遥感多模态大语言模型在视觉定位、视觉问答等任务中表现出色,但其产生的与输入图像不一致的幻觉(hallucinations)在高风险场景(如应急管理、农业监测)中阻碍了部署,且该问题在遥感领域尚未被充分探索。
Result: 在代表性遥感MLLMs上,所提出的缓解方法在统一协议下将无幻觉率提升了最高21.63个百分点,同时在下游遥感任务(RSVQA/RSVG)上保持了有竞争力的性能。
Insight: 创新点包括:1)提出了面向遥感的幻觉分类法,引入了图像级幻觉以捕捉遥感特有的不一致性(如模态、分辨率、场景级语义);2)支持双模式检查(高精度云审计与低成本本地检查)的评估基准;3)提出了领域定制的训练友好型缓解数据集和免训练的即插即用策略(解码时logit校正和遥感感知提示)。
Abstract: Multimodal large language models (MLLMs) are increasingly adopted in remote sensing (RS) and have shown strong performance on tasks such as RS visual grounding (RSVG), RS visual question answering (RSVQA), and multimodal dialogue. However, hallucinations, which are responses inconsistent with the input RS images, severely hinder their deployment in high-stakes scenarios (e.g., emergency management and agricultural monitoring) and remain under-explored in RS. In this work, we present RSHallu, a systematic study with three deliverables: (1) we formalize RS hallucinations with an RS-oriented taxonomy and introduce image-level hallucination to capture RS-specific inconsistencies beyond object-centric errors (e.g., modality, resolution, and scene-level semantics); (2) we build a hallucination benchmark RSHalluEval (2,023 QA pairs) and enable dual-mode checking, supporting high-precision cloud auditing and low-cost reproducible local checking via a compact checker fine-tuned on RSHalluCheck dataset (15,396 QA pairs); and (3) we introduce a domain-tailored dataset RSHalluShield (30k QA pairs) for training-friendly mitigation and further propose training-free plug-and-play strategies, including decoding-time logit correction and RS-aware prompting. Across representative RS-MLLMs, our mitigation improves the hallucination-free rate by up to 21.63 percentage points under a unified protocol, while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG). Code and datasets will be released.
[53] DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories cs.CV | cs.IRPDF
Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang
TL;DR: 本文提出了DeepImageSearch,一种新颖的代理范式,将图像检索重新定义为对原始视觉历史进行自主探索的任务,并构建了DISBench基准来评估模型在上下文感知图像检索中的能力。
Details
Motivation: 现有检索系统擅长语义匹配,但假设查询-图像相关性可以孤立衡量,忽略了现实视觉流中跨时间序列的丰富依赖关系,需要解决上下文感知检索的挑战。
Result: 在构建的DISBench基准上的广泛实验表明,该基准对现有最先进模型构成了显著挑战,突显了将代理推理纳入下一代检索系统的必要性。
Insight: 创新点在于提出了代理式检索范式、构建了基于互联视觉数据的挑战性基准,并采用了人-模型协作流水线来挖掘潜在的时空关联以高效生成上下文相关查询。
Abstract: Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.
[54] Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training cs.CV | cs.LGPDF
Aojun Lu, Tao Feng, Hangjie Yuan, Wei Li, Yanan Sun
TL;DR: 本文从数据角度解释了视觉语言模型(VLM)后训练中,强化学习(RL)为何比监督微调(SFT)具有更好的分布外(OOD)泛化能力。作者提出RL的泛化优势源于其隐式地优先选择中等难度训练样本的数据过滤机制。为验证此假设,论文系统评估了SFT在不同难度训练数据上的OOD泛化表现,发现困难样本会显著损害泛化性能。基于此,作者提出了难度筛选SFT(DC-SFT)方法,通过显式过滤训练集来提升泛化。实验表明DC-SFT在OOD泛化上不仅显著优于标准SFT,甚至超过了RL训练,同时具有更好的稳定性和计算效率。
Details
Motivation: 解决大规模视觉语言模型后训练中观察到的泛化差距问题:即基于RL微调的模型在分布外性能上 consistently 优于基于SFT的模型。论文旨在从数据角度(而非算法角度)探究这一现象的根本原因。
Result: 实验证实数据难度是关键因素:在困难样本上训练会显著降低SFT的OOD性能。提出的DC-SFT方法在OOD泛化上大幅超越标准SFT,并且性能超过了RL训练,同时提供了更高的稳定性和计算效率。
Insight: 核心创新点在于从数据难度视角解释了RL的泛化优势,并提出了一种简单有效的显式数据筛选方法DC-SFT。其客观创新之处在于将泛化问题归因于训练数据分布(样本难度),并通过可控的数据管理策略(难度筛选)实现了比复杂RL算法更优且更高效的泛化性能,为提升模型鲁棒性提供了一条数据中心的清晰路径。
Abstract: The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL’s generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.
[55] Flow caching for autoregressive video generation cs.CV | cs.AIPDF
Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling
TL;DR: 本文提出了FlowCache,这是首个专为自回归视频生成设计的缓存框架,通过分块缓存策略和联合重要性-冗余优化的KV缓存压缩机制,显著加速了基于Transformer的自回归视频生成模型,实现了实时超长视频生成。
Details
Motivation: 自回归视频生成模型(如基于Transformer的模型)虽然能生成超长视频,但其顺序生成过程速度极慢;现有缓存方法假设所有帧在相同时间步具有均匀的去噪模式,这不适用于自回归模型中不同视频块具有不同相似性模式的情况。
Result: 在MAGI-1和SkyReels-V2基准测试上,FlowCache分别实现了2.38倍和6.7倍的加速,且质量下降可忽略(VBench得分分别增加0.87和减少0.79),达到了实时超长视频生成的新水平。
Insight: 核心创新在于提出每个视频块应保持独立的缓存策略,允许细粒度控制哪些块在每个时间步需要重新计算,并结合了动态适应各块独特去噪特性的分块缓存策略与优化KV缓存压缩,在固定内存限制下保持生成质量。
Abstract: Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra-long videos by synthesizing content in sequential chunks. However, this sequential generation process is notoriously slow. While caching strategies have proven effective for accelerating traditional video diffusion models, existing methods assume uniform denoising across all frames-an assumption that breaks down in autoregressive models where different video chunks exhibit varying similarity patterns at identical timesteps. In this paper, we present FlowCache, the first caching framework specifically designed for autoregressive video generation. Our key insight is that each video chunk should maintain independent caching policies, allowing fine-grained control over which chunks require recomputation at each timestep. We introduce a chunkwise caching strategy that dynamically adapts to the unique denoising characteristics of each chunk, complemented by a joint importance-redundancy optimized KV cache compression mechanism that maintains fixed memory bounds while preserving generation quality. Our method achieves remarkable speedups of 2.38 times on MAGI-1 and 6.7 times on SkyReels-V2, with negligible quality degradation (VBench: 0.87 increase and 0.79 decrease respectively). These results demonstrate that FlowCache successfully unlocks the potential of autoregressive models for real-time, ultra-long video generation-establishing a new benchmark for efficient video synthesis at scale. The code is available at https://github.com/mikeallen39/FlowCache.
[56] Stride-Net: Fairness-Aware Disentangled Representation Learning for Chest X-Ray Diagnosis cs.CVPDF
Darakshan Rashid, Raza Imam, Dwarikanath Mahapatra, Brejesh Lall
TL;DR: 本文提出Stride-Net,一种用于胸部X光诊断的公平感知解耦表征学习框架,旨在学习对疾病具有判别力但对人口统计学属性(如种族、性别)不变的表示,以解决深度学习模型在不同人口亚组间性能不均等的问题。
Details
Motivation: 现有胸部X光分类的深度神经网络虽然在平均性能上表现良好,但在特定人口亚组(如不同种族、性别)上表现不佳,存在临床安全与公平性隐患;现有去偏方法常导致跨数据集改进不一致,或通过牺牲整体诊断效用来换取公平性,将公平性视为后处理约束而非表征的内在属性。
Result: 在MIMIC-CXR和CheXpert基准测试中,针对种族及种族-性别交叉亚组进行评估,Stride-Net在使用ResNet和Vision Transformer等架构时,在保持或超越基线准确率的同时,持续改善了公平性指标,取得了比先前去偏方法更优的准确率-公平性权衡。
Insight: 创新点包括:1) 在图像块级别操作,使用可学习的基于步长的掩码选择与标签对齐的图像区域,同时通过对抗性混淆损失抑制敏感属性信息;2) 通过基于Group Optimal Transport的语义对齐,强制图像特征与基于BioBERT的疾病标签嵌入对齐,以将表征锚定在临床语义中并避免捷径学习。从客观角度看,该方法将公平性内化为表征学习过程的核心属性,而非外部约束,并通过解耦和语义对齐实现了诊断效用与公平性的更好平衡。
Abstract: Deep neural networks for chest X-ray classification achieve strong average performance, yet often underperform for specific demographic subgroups, raising critical concerns about clinical safety and equity. Existing debiasing methods frequently yield inconsistent improvements across datasets or attain fairness by degrading overall diagnostic utility, treating fairness as a post hoc constraint rather than a property of the learned representation. In this work, we propose Stride-Net (Sensitive Attribute Resilient Learning via Disentanglement and Learnable Masking with Embedding Alignment), a fairness-aware framework that learns disease-discriminative yet demographically invariant representations for chest X-ray analysis. Stride-Net operates at the patch level, using a learnable stride-based mask to select label-aligned image regions while suppressing sensitive attribute information through adversarial confusion loss. To anchor representations in clinical semantics and discourage shortcut learning, we further enforce semantic alignment between image features and BioBERT-based disease label embeddings via Group Optimal Transport. We evaluate Stride-Net on the MIMIC-CXR and CheXpert benchmarks across race and intersectional race-gender subgroups. Across architectures including ResNet and Vision Transformers, Stride-Net consistently improves fairness metrics while matching or exceeding baseline accuracy, achieving a more favorable accuracy-fairness trade-off than prior debiasing approaches. Our code is available at https://github.com/Daraksh/Fairness_StrideNet.
[57] Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation cs.CVPDF
Minggui He, Mingchen Dai, Jian Zhang, Yilun Liu, Shimin Tao
TL;DR: 本文提出了一种名为Chart Specification的结构化中间表示方法,用于提升视觉语言模型(VLM)从图表图像生成绘图代码的结构保真度。该方法通过构建结构平衡的训练集和引入Spec-Align奖励机制,将训练重点从文本模仿转向基于语义的监督,从而激励模型学习一致的绘图逻辑。
Details
Motivation: 现有方法主要依赖监督微调,鼓励模型进行表面层的token模仿,而非对底层图表结构进行忠实建模,这常常导致生成幻觉或语义不一致的输出。本文旨在解决图表到代码生成中结构保真度不足的问题。
Result: 在三个公开基准测试上的实验表明,该方法始终优于先前方法。仅使用3K训练样本即展现出强大的数据效率,在复杂基准上超越领先基线高达61.7%;扩展到4K样本时,在所有评估指标上均取得了新的最先进(SOTA)结果。
Insight: 主要创新点在于引入了Chart Specification这一结构化中间表示,它过滤了句法噪声,并支持提供细粒度、可验证结构正确性反馈的Spec-Align奖励,从而通过强化学习来强制执行一致的绘图逻辑。从客观角度看,该方法的核心洞察在于,精确的结构监督为高保真度的图表到代码生成提供了一条高效路径。
Abstract: Vision-Language Models (VLMs) have shown promise in generating plotting code from chart images, yet achieving structural fidelity remains challenging. Existing approaches largely rely on supervised fine-tuning, encouraging surface-level token imitation rather than faithful modeling of underlying chart structure, which often leads to hallucinated or semantically inconsistent outputs. We propose Chart Specification, a structured intermediate representation that shifts training from text imitation to semantically grounded supervision. Chart Specification filters syntactic noise to construct a structurally balanced training set and supports a Spec-Align Reward that provides fine-grained, verifiable feedback on structural correctness, enabling reinforcement learning to enforce consistent plotting logic. Experiments on three public benchmarks show that our method consistently outperforms prior approaches. With only 3K training samples, we achieve strong data efficiency, surpassing leading baselines by up to 61.7% on complex benchmarks, and scaling to 4K samples establishes new state-of-the-art results across all evaluated metrics. Overall, our results demonstrate that precise structural supervision offers an efficient pathway to high-fidelity chart-to-code generation. Code and dataset are available at: https://github.com/Mighten/chart-specification-paper
[58] ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving cs.CVPDF
Jinqing Zhang, Zehua Fu, Zelin Xu, Wenying Dai, Qingjie Liu
TL;DR: 本文提出了一种名为ResWorld的端到端自动驾驶框架,其核心是Temporal Residual World Model (TR-World) 和 Future-Guided Trajectory Refinement (FGTR) 模块。TR-World通过计算场景表征的时间残差来专注建模动态物体,无需依赖检测和跟踪。FGTR模块则利用未来BEV特征与先验轨迹进行交互,以优化轨迹并提供监督。该方法在nuScenes和NAVSIM数据集上实现了最先进的规划性能。
Details
Motivation: 现有世界模型对静态区域存在冗余建模,且与轨迹缺乏深度交互,限制了其在端到端自动驾驶规划中的效能。本文旨在解决这些问题,以提升规划精度。
Result: 在nuScenes和NAVSIM数据集上的综合实验表明,ResWorld方法在规划性能上达到了最先进水平。
Insight: 主要创新点在于:1) 提出时间残差世界模型,通过时间残差直接提取动态物体信息,避免了冗余的静态建模和对检测跟踪的依赖;2) 设计未来引导的轨迹优化模块,实现了未来场景信息与当前轨迹的深度交互,既能优化轨迹,又能为世界模型提供时空监督以防止模型崩溃。这是一种更高效、更专注于动态交互的场景建模与规划方法。
Abstract: The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial-temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art planning performance. The code is available at https://github.com/mengtan00/ResWorld.git.
[59] Towards Learning a Generalizable 3D Scene Representation from 2D Observations cs.CV | cs.ROPDF
Martin Gromniak, Jan-Gerrit Habekost, Sebastian Kamp, Sven Magg, Stefan Wermter
TL;DR: 本文提出了一种可泛化的神经辐射场方法,用于从机器人第一视角的2D观测中预测3D工作空间的占用情况。该方法在全局工作空间坐标系中构建占用表示,而非相机坐标系,使其可直接应用于机器人操作。模型能够整合灵活的源视图,并在无需场景特定微调的情况下泛化到未见过的物体排列。
Details
Motivation: 解决从有限的2D自我中心观测中学习一个可泛化的3D场景表示的问题,以直接支持机器人操作任务,克服了传统方法在相机坐标系下操作、难以直接应用于机器人工作空间的局限性。
Result: 在40个真实场景上训练后,模型实现了26毫米的重建误差(包括被遮挡区域),通过与人形机器人上的3D传感器真值对比进行验证,证明了其推断完整3D占用情况的能力超越了传统立体视觉方法。
Insight: 核心创新在于将神经辐射场的学习框架从相机坐标系转移到全局工作空间坐标系,这提升了表示的通用性和对机器人任务的直接适用性。模型能够从少量2D观测中泛化并重建被遮挡区域,展示了从2D观测学习鲁棒3D表示的潜力。
Abstract: We introduce a Generalizable Neural Radiance Field approach for predicting 3D workspace occupancy from egocentric robot observations. Unlike prior methods operating in camera-centric coordinates, our model constructs occupancy representations in a global workspace frame, making it directly applicable to robotic manipulation. The model integrates flexible source views and generalizes to unseen object arrangements without scene-specific finetuning. We demonstrate the approach on a humanoid robot and evaluate predicted geometry against 3D sensor ground truth. Trained on 40 real scenes, our model achieves 26mm reconstruction error, including occluded regions, validating its ability to infer complete 3D occupancy beyond traditional stereo vision methods.
[60] DFIC: Towards a balanced facial image dataset for automatic ICAO compliance verification cs.CVPDF
Nuno Gonçalves, Diogo Nunes, Carla Guerra, João Marcos
TL;DR: 本文提出了DFIC数据集,这是一个用于自动验证ICAO合规性的新型平衡人脸图像数据集,包含约58,000张标注图像和2,706段视频,涵盖超过1,000名受试者,既包含合规肖像,也覆盖了广泛的非合规条件。该数据集在人口统计分布上比现有公共数据集更平衡,其中一个分区接近均匀分布,有助于开发自动化的ICAO合规验证方法。
Details
Motivation: 当前在机器可读旅行证件(MRTDs)中,确保人脸图像符合ISO/IEC和ICAO标准对于可靠的身份验证至关重要,但现有的人工检查方法在高需求环境下效率低下,因此需要开发自动化的合规验证方法。
Result: 基于DFIC数据集,作者微调了一种严重依赖空间注意力机制的新方法,用于自动验证ICAO合规要求,并与针对ICAO合规验证的最先进方法进行了比较,展示了改进的结果。
Insight: 论文的创新点在于引入了DFIC这一平衡且多样化的人脸图像数据集,其人口统计分布更均匀,有助于提升自动合规验证模型的鲁棒性和适应性;同时,提出的基于空间注意力机制的方法在验证任务中取得了性能提升,该数据集还可用于增强人脸识别系统的安全性、隐私性和公平性。
Abstract: Ensuring compliance with ISO/IEC and ICAO standards for facial images in machine-readable travel documents (MRTDs) is essential for reliable identity verification, but current manual inspection methods are inefficient in high-demand environments. This paper introduces the DFIC dataset, a novel comprehensive facial image dataset comprising around 58,000 annotated images and 2706 videos of more than 1000 subjects, that cover a broad range of non-compliant conditions, in addition to compliant portraits. Our dataset provides a more balanced demographic distribution than the existing public datasets, with one partition that is nearly uniformly distributed, facilitating the development of automated ICAO compliance verification methods. Using DFIC, we fine-tuned a novel method that heavily relies on spatial attention mechanisms for the automatic validation of ICAO compliance requirements, and we have compared it with the state-of-the-art aimed at ICAO compliance verification, demonstrating improved results. DFIC dataset is now made public (https://github.com/visteam-isr-uc/DFIC) for the training and validation of new models, offering an unprecedented diversity of faces, that will improve both robustness and adaptability to the intrinsically diverse combinations of faces and props that can be presented to the validation system. These results emphasize the potential of DFIC to enhance automated ICAO compliance methods but it can also be used in many other applications that aim to improve the security, privacy, and fairness of facial recognition systems.
[61] Interpretable Vision Transformers in Image Classification via SVDA cs.CVPDF
Vasileios Arampatzakis, George Pavlidis, Nikolaos Mitianoudis, Nikos Papamarkos
TL;DR: 本文提出将SVD启发注意力(SVDA)机制应用于视觉Transformer(ViT)架构,以增强其可解释性、稀疏性和谱结构,并在多个图像分类基准上验证了其在不牺牲准确性的前提下产生更可解释的注意力模式。
Details
Motivation: ViT在图像分类中达到SOTA性能,但其注意力机制通常不透明且呈现密集、非结构化行为,因此需要一种能提升可解释性和结构化的方法。
Result: 在CIFAR-10、FashionMNIST、CIFAR-100和ImageNet-100四个基准上的实验表明,SVDA能一致地产生更可解释的注意力模式,同时保持分类准确性。
Insight: SVDA提供了一种基于几何的公式化方法,通过可解释性指标监控训练中的注意力动态,为结构化注意力模型的分析和开发提供了全面且信息丰富的工具,有助于可解释AI、谱诊断和基于注意力的模型压缩的未来进展。
Abstract: Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators – originally proposed with SVDA – to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks – CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 – demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.
[62] Interpretable Vision Transformers in Monocular Depth Estimation via SVDA cs.CVPDF
Vasileios Arampatzakis, George Pavlidis, Nikolaos Mitianoudis, Nikos Papamarkos
TL;DR: 该论文将SVD启发的注意力机制(SVDA)引入Dense Prediction Transformer(DPT)中,用于单目深度估计任务。SVDA通过在学习过程中嵌入可学习的对角矩阵,将方向对齐与谱调制解耦,从而生成本质可解释的注意力图,而非事后近似。在KITTI和NYU-v2数据集上的实验表明,该方法在保持或略微提升预测精度的同时,仅增加了轻微的计算开销,并解锁了六个可量化注意力组织模式的谱指标。
Details
Motivation: 动机在于解决现代Transformer架构中自注意力机制在密集预测任务(如单目深度估计)中的不透明性问题,旨在提供一种结构化的、本质可解释的注意力公式。
Result: 在KITTI和NYU-v2基准测试中,SVDA保持了与基线相当或略有提升的预测精度,同时计算开销很小。更重要的是,它提供了六个可量化的谱指标来揭示注意力在训练过程中的组织模式。
Insight: 创新点在于首次为密集预测任务提出了谱结构化的注意力公式(SVDA),将注意力从黑盒机制转变为可量化的描述符,实现了本质可解释性,并揭示了注意力跨数据集和深度维度的可量化模式,为构建透明的密集预测模型开辟了新途径。
Abstract: Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness. These reveal consistent cross-dataset and depth-wise patterns in how attention organizes during training, insights that remain inaccessible in standard Transformers. By shifting the role of attention from opaque mechanism to quantifiable descriptor, SVDA redefines interpretability in monocular depth estimation and opens a principled avenue toward transparent dense prediction models.
[63] Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting cs.CV | cs.AIPDF
Rishikesh Bhyri, Brian R Quaranto, Philip J Seger, Kaity Tung, Brendan Fox
TL;DR: 本文提出了一种名为Chain-of-Look的新型视觉推理框架,用于解决手术室中密集、紧密聚集的手术器械的精确计数难题。该方法模仿人类顺序计数过程,通过构建结构化的视觉链来引导模型沿着连贯的空间轨迹进行计数,并引入了邻近损失函数来增强视觉链的物理合理性。作者还发布了SurgCount-HD数据集,包含1,464张高密度手术器械图像。实验表明,该方法在密集器械计数任务上优于现有的计数方法和多模态大语言模型。
Details
Motivation: 手术器械的精确计数对患者安全至关重要,但在器械紧密聚集的密集场景下,现有方法(如经典目标检测或大视觉语言模型)难以准确计数,存在挑战。
Result: 在提出的SurgCount-HD数据集上进行的大量实验表明,该方法在密集手术器械计数任务上超越了最先进的计数方法(如CountGD、REC)以及多模态大语言模型(如Qwen、ChatGPT),达到了SOTA水平。
Insight: 主要创新点在于提出了模仿人类顺序计数过程的Chain-of-Look视觉推理框架,用结构化的视觉链替代无序的经典检测,并设计了邻近损失函数来显式建模密集器械的空间约束。从客观角度看,将空间推理链和物理约束损失引入密集计数任务是一个有借鉴意义的思路,同时发布的高质量数据集也推动了该领域的发展。
Abstract: Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.
[64] PuriLight: A Lightweight Shuffle and Purification Framework for Monocular Depth Estimation cs.CVPDF
Yujie Chen, Li Zhang, Xiaomeng Chu, Tian Zhang
TL;DR: 本文提出了PuriLight,一种轻量级且高效的自监督单目深度估计框架,旨在解决计算效率和细节保持的双重挑战。该框架通过包含三个新颖模块的三阶段架构实现:用于局部特征提取的Shuffle-Dilation Convolution (SDC)模块、用于分层特征增强的Rotation-Adaptive Kernel Attention (RAKA)模块,以及用于全局特征净化的Deep Frequency Signal Purification (DFSP)模块。
Details
Motivation: 当前自监督深度估计方法在减少对真实标签依赖方面取得进展,但现有方法要么受限于笨重的架构而影响实用性,要么因轻量级模型而牺牲结构精度,因此需要开发既轻量又结构精确的架构。
Result: 大量实验表明,PuriLight在保持卓越计算效率的同时,以最少的训练参数实现了最先进的性能(SOTA)。
Insight: 创新点在于提出了三个协同工作的新型模块(SDC、RAKA、DFSP),分别针对局部特征提取、分层特征增强和全局特征净化,从而在轻量化的同时实现精确的特征处理;从客观角度看,这种模块化设计平衡了效率与精度,为轻量级深度估计提供了可借鉴的架构思路。
Abstract: We propose PuriLight, a lightweight and efficient framework for self-supervised monocular depth estimation, to address the dual challenges of computational efficiency and detail preservation. While recent advances in self-supervised depth estimation have reduced reliance on ground truth supervision, existing approaches remain constrained by either bulky architectures compromising practicality or lightweight models sacrificing structural precision. These dual limitations underscore the critical need to develop lightweight yet structurally precise architectures. Our framework addresses these limitations through a three-stage architecture incorporating three novel modules: the Shuffle-Dilation Convolution (SDC) module for local feature extraction, the Rotation-Adaptive Kernel Attention (RAKA) module for hierarchical feature enhancement, and the Deep Frequency Signal Purification (DFSP) module for global feature purification. Through effective collaboration, these modules enable PuriLight to achieve both lightweight and accurate feature extraction and processing. Extensive experiments demonstrate that PuriLight achieves state-of-the-art performance with minimal training parameters while maintaining exceptional computational efficiency. Codes will be available at https://github.com/ishrouder/PuriLight.
[65] Chatting with Images for Introspective Visual Thinking cs.CV | cs.AI | cs.CLPDF
Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang
TL;DR: 本文提出了一种名为’chatting with images’的新框架,通过语言引导的特征调制来重构视觉操作,以解决当前大型视觉语言模型(LVLM)因单次视觉编码和纯文本推理而导致的细粒度视觉信息丢失问题。该框架被实例化为ViLaVT模型,它配备了一个专为交互式视觉推理设计的动态视觉编码器,并通过两阶段课程(监督微调和强化学习)进行训练。
Details
Motivation: 当前大型视觉语言模型通常基于单次视觉编码进行纯文本推理,导致细粒度视觉信息丢失;而现有的’用图像思考’方法通过外部工具或代码操作图像,但其产生的视觉状态往往与语言语义的关联不够紧密,特别是在需要跨远距离区域或多张图像进行视觉语义或几何关系推理时,影响了有效的跨模态对齐。
Result: 在八个基准测试上的广泛实验表明,ViLaVT模型取得了显著且一致的性能提升,特别是在复杂的多图像和基于视频的空间推理任务上,增益尤为明显。
Insight: 核心创新在于将视觉操作重新定义为语言引导的特征调制,使模型能够根据表达性语言提示,对多个图像区域进行动态联合重编码,从而实现了语言推理与视觉状态更新之间更紧密的耦合。从客观角度看,其设计的动态视觉编码器架构和结合监督微调与强化学习的训练课程,是针对交互式、多步视觉推理任务的有效方法创新。
Abstract: Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ‘’thinking with images’’ attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ‘’chatting with images’’, a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.
[66] FastFlow: Accelerating The Generative Flow Matching Models with Bandit Inference cs.CVPDF
Divya Jyoti Bajpai, Dhruv Bhardwaj, Soumya Roy, Tejas Duseja, Harsh Agarwal
TL;DR: 本文提出FastFlow,一种即插即用的自适应推理框架,用于加速流匹配模型的生成过程。该方法通过识别去噪路径中仅产生微小调整的步骤,并利用先前预测的有限差分速度估计来近似这些步骤,从而跳过中间计算。决策过程被建模为多臂老虎机问题,以学习在保持性能的同时最优地跳过步骤。FastFlow无需重新训练,可无缝集成到现有流程中,并在图像生成、视频生成和编辑任务上实现通用加速。
Details
Motivation: 流匹配模型在图像和视频生成中实现了最先进的保真度,但其固有的顺序去噪过程导致生成速度较慢。现有的加速方法(如蒸馏、轨迹截断和一致性方法)是静态的,需要重新训练,且通常难以跨任务泛化。
Result: 实验表明,FastFlow在保持高质量输出的同时,实现了超过2.6倍的加速。该方法在图像生成、视频生成和编辑任务上进行了验证,展现了其通用性。
Insight: 主要创新点在于将加速决策建模为多臂老虎机问题,动态学习最优跳过策略,以及利用有限差分速度估计进行零计算成本的近似,从而实现无需重新训练的即插即用自适应推理。这提供了一种平衡速度与性能的新颖动态加速范式。
Abstract: Flow-matching models deliver state-of-the-art fidelity in image and video generation, but the inherent sequential denoising process renders them slower. Existing acceleration methods like distillation, trajectory truncation, and consistency approaches are static, require retraining, and often fail to generalize across tasks. We propose FastFlow, a plug-and-play adaptive inference framework that accelerates generation in flow matching models. FastFlow identifies denoising steps that produce only minor adjustments to the denoising path and approximates them without using the full neural network models used for velocity predictions. The approximation utilizes finite-difference velocity estimates from prior predictions to efficiently extrapolate future states, enabling faster advancements along the denoising path at zero compute cost. This enables skipping computation at intermediary steps. We model the decision of how many steps to safely skip before requiring a full model computation as a multi-armed bandit problem. The bandit learns the optimal skips to balance speed with performance. FastFlow integrates seamlessly with existing pipelines and generalizes across image generation, video generation, and editing tasks. Experiments demonstrate a speedup of over 2.6x while maintaining high-quality outputs. The source code for this work can be found at https://github.com/Div290/FastFlow.
[67] HairWeaver: Few-Shot Photorealistic Hair Motion Synthesis with Sim-to-Real Guided Video Diffusion cs.CVPDF
Di Chang, Ji Hou, Aljaz Bozic, Assaf Neuberger, Felix Juefei-Xu
TL;DR: 本文提出了HairWeaver,一个基于扩散模型的框架,用于从单张人像生成具有真实感和表现力的头发动态动画。该方法通过两个轻量级模块(Motion-Context-LoRA和Sim2Real-Domain-LoRA)引导视频扩散主干网络,解决了现有方法在头发细节控制上的不足,实现了对头发运动的精细控制。
Details
Motivation: 现有的人体动画方法能成功控制身体姿态,但缺乏对头发的专门控制,导致生成的头发动态僵硬、不真实。本文旨在克服这一局限,实现真实且富有细节的头发运动合成。
Result: 综合评估表明,该方法在生成逼真人发动画方面达到了新的最先进水平(SOTA),能够产生具有动态细节的生动效果。
Insight: 创新点在于引入了两个专门的LoRA模块:一个用于整合运动条件,另一个用于跨数据域保持主体的真实感外观。其核心思路是利用CG模拟器生成的动态人体运动专业数据集进行训练,实现了从模拟到真实(Sim-to-Real)的引导,从而学习生成对运动做出自然响应的逼真头发。
Abstract: We present HairWeaver, a diffusion-based pipeline that animates a single human image with realistic and expressive hair dynamics. While existing methods successfully control body pose, they lack specific control over hair, and as a result, fail to capture the intricate hair motions, resulting in stiff and unrealistic animations. HairWeaver overcomes this limitation using two specialized modules: a Motion-Context-LoRA to integrate motion conditions and a Sim2Real-Domain-LoRA to preserve the subject’s photoreal appearance across different data domains. These lightweight components are designed to guide a video diffusion backbone while maintaining its core generative capabilities. By training on a specialized dataset of dynamic human motion generated from a CG simulator, HairWeaver affords fine control over hair motion and ultimately learns to produce highly realistic hair that responds naturally to movement. Comprehensive evaluations demonstrate that our approach sets a new state of the art, producing lifelike human hair animations with dynamic details.
[68] PhyCritic: Multimodal Critic Models for Physical AI cs.CVPDF
Tianyi Xiong, Shihao Wang, Guilin Liu, Yi Dong, Ming Li
TL;DR: 本文提出了PhyCritic,一个专为物理AI任务优化的多模态评判模型。它通过一个两阶段的RLVR流程进行训练:首先是物理技能预热阶段,以增强面向物理的感知和推理能力;随后是自参考评判微调阶段,模型在评判候选回答前会先生成自己的预测作为内部参考,从而提高判断的稳定性和物理正确性。
Details
Motivation: 现有评判模型主要在通用视觉领域(如图像描述或视觉问答)进行训练,而涉及感知、因果推理和规划的物理AI任务在很大程度上未被充分探索,因此需要专门针对物理AI的可靠评判模型。
Result: 在物理和通用多模态评判基准测试中,PhyCritic相比开源基线模型取得了显著的性能提升。当被用作策略模型时,它还能进一步提升物理基础任务中的感知和推理能力。
Insight: 创新点在于针对物理AI领域设计的两阶段训练流程,特别是自参考评判微调机制,让模型在评判前先生成内部参考,这有助于提升判断的稳定性和物理正确性。从客观角度看,将评判模型专门化到物理推理这一垂直领域,并设计相应的训练策略,是一个有前景的研究方向。
Abstract: With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
[69] Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling cs.CV | cs.AIPDF
Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke
TL;DR: 本文提出了DiNa-LRM,一种基于扩散模型原生潜空间的奖励模型,用于直接对噪声扩散状态进行偏好学习。该方法通过引入噪声校准的Thurstone似然函数和依赖于扩散噪声的不确定性,克服了传统基于视觉语言模型(VLM)的奖励在计算成本高和像素空间奖励与潜空间生成器之间存在领域不匹配的问题。
Details
Motivation: 解决扩散模型和流匹配模型偏好优化中,依赖VLM的奖励函数计算和内存成本高,以及像素空间奖励与潜空间生成器之间的领域不匹配问题。
Result: 在图像对齐基准测试中,DiNa-LRM显著优于现有的基于扩散的奖励基线,并以一小部分计算成本实现了与最先进VLM相竞争的性能。在偏好优化中,它改善了优化动态,实现了更快、更资源高效的模型对齐。
Insight: 创新点在于提出了一个扩散原生的潜空间奖励建模框架,直接在噪声扩散状态上定义偏好学习,并引入了噪声依赖的不确定性校准和推理时噪声集成机制,为测试时缩放和鲁棒奖励提供了原生支持。
Abstract: Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.
[70] SurfPhase: 3D Interfacial Dynamics in Two-Phase Flows from Sparse Videos cs.CVPDF
Yue Gao, Hong-Xing Yu, Sanghyeon Chang, Qianxi Fu, Bo Zhu
TL;DR: SurfPhase是一种从稀疏相机视角重建两相流中三维界面动态的新模型,通过结合动态高斯面元与符号距离函数确保几何一致性,并利用视频扩散模型合成新视角视频以优化重建效果。
Details
Motivation: 两相流中的界面动态对动量、热和质量传递至关重要,但实验测量困难,现有神经渲染方法无法处理尖锐、可变的液-汽界面。
Result: 在高速池沸腾视频新数据集上评估,仅用两个相机视角即可实现高质量视图合成和速度估计。
Insight: 创新点在于将动态高斯面元与符号距离函数结合用于几何一致性,并利用视频扩散模型增强稀疏观测的重建,专为处理尖锐界面动态设计。
Abstract: Interfacial dynamics in two-phase flows govern momentum, heat, and mass transfer, yet remain difficult to measure experimentally. Classical techniques face intrinsic limitations near moving interfaces, while existing neural rendering methods target single-phase flows with diffuse boundaries and cannot handle sharp, deformable liquid-vapor interfaces. We propose SurfPhase, a novel model for reconstructing 3D interfacial dynamics from sparse camera views. Our approach integrates dynamic Gaussian surfels with a signed distance function formulation for geometric consistency, and leverages a video diffusion model to synthesize novel-view videos to refine reconstruction from sparse observations. We evaluate on a new dataset of high-speed pool boiling videos, demonstrating high-quality view synthesis and velocity estimation from only two camera views. Project website: https://yuegao.me/SurfPhase.
cs.AI [Back]
[71] To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks cs.AI | cs.CLPDF
Nanxu Gong, Haotian Li, Sixun Dong, Jianxun Lian, Yanjie Fu
TL;DR: 本文系统研究了九种先进大语言模型在心理理论任务上的表现,发现推理模型并不总是优于非推理模型,有时甚至表现更差。研究揭示了推理模型在ToM任务中的三个关键问题:长推理导致准确性下降、适度自适应推理有益、模型依赖选项匹配而非真正推理。
Details
Motivation: 探究大型推理模型在数学和编码等领域的逐步推理能力是否能够迁移到社会认知技能(如心理理论)中,以评估模型在自然社交互动中的表现。
Result: 在三个代表性的ToM基准测试中,推理模型并未持续优于非推理模型,有时表现更差;通过干预方法(S2F自适应推理和T2M捷径预防)进一步验证和缓解了问题。
Insight: 研究指出,LRMs在形式推理(如数学、代码)上的进步不能完全迁移到社会推理的典型任务ToM中;实现稳健的ToM需要发展超越现有推理方法的独特能力,并提出了自适应推理和预防选项匹配捷径的干预策略。
Abstract: Theory of Mind (ToM) assesses whether models can infer hidden mental states such as beliefs, desires, and intentions, which is essential for natural social interaction. Although recent progress in Large Reasoning Models (LRMs) has boosted step-by-step inference in mathematics and coding, it is still underexplored whether this benefit transfers to socio-cognitive skills. We present a systematic study of nine advanced Large Language Models (LLMs), comparing reasoning models with non-reasoning models on three representative ToM benchmarks. The results show that reasoning models do not consistently outperform non-reasoning models and sometimes perform worse. A fine-grained analysis reveals three insights. First, slow thinking collapses: accuracy significantly drops as responses grow longer, and larger reasoning budgets hurt performance. Second, moderate and adaptive reasoning benefits performance: constraining reasoning length mitigates failure, while distinct success patterns demonstrate the necessity of dynamic adaptation. Third, option matching shortcut: when multiple choice options are removed, reasoning models improve markedly, indicating reliance on option matching rather than genuine deduction. We also design two intervention approaches: Slow-to-Fast (S2F) adaptive reasoning and Think-to-Match (T2M) shortcut prevention to further verify and mitigate the problems. With all results, our study highlights the advancement of LRMs in formal reasoning (e.g., math, code) cannot be fully transferred to ToM, a typical task in social reasoning. We conclude that achieving robust ToM requires developing unique capabilities beyond existing reasoning methods.
[72] GameDevBench: Evaluating Agentic Capabilities Through Game Development cs.AI | cs.CL | cs.SEPDF
Wayne Chi, Yixiong Fang, Arnav Yayavaram, Siddharth Yayavaram, Seth Karten
TL;DR: 本文提出了GameDevBench,首个用于评估智能体在游戏开发任务中能力的基准测试。该基准包含132个源自网络和视频教程的任务,要求智能体在视觉游戏场景中处理着色器、精灵和动画等多模态资产,并操作大型、密集的代码库。研究发现当前智能体在游戏开发任务上表现不佳,最佳模型仅能解决54.5%的任务,且任务难度与多模态复杂性高度相关。
Details
Motivation: 当前编码智能体发展迅速,但其多模态对应体的进展相对滞后,主要挑战在于缺乏结合软件开发复杂性和深度多模态理解的评估测试平台。游戏开发提供了一个理想的测试环境,因为它要求智能体在视觉场景中处理多模态资产并操作复杂代码库。
Result: 在GameDevBench基准测试中,最佳智能体仅解决了54.5%的任务。任务成功率随多模态复杂性增加而下降:游戏玩法导向任务成功率为46.9%,而2D图形任务降至31.6%。通过引入基于图像和视频的简单反馈机制,Claude Sonnet 4.5的性能从33.3%提升至47.7%。
Insight: 论文的创新点在于首次构建了专注于游戏开发的多模态智能体评估基准,揭示了多模态复杂性对任务难度的显著影响。从客观角度看,其提出的简单多模态反馈机制能有效提升智能体性能,为多模态智能体研究提供了可借鉴的改进方向。
Abstract: Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex – the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5’s performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.
cs.LG [Back]
[73] Towards Autonomous Mathematics Research cs.LG | cs.AI | cs.CL | cs.CYPDF
Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi
TL;DR: 本文介绍了Aletheia,一个用于自主数学研究的智能体,它能够端到端地生成、验证和修订自然语言形式的数学解决方案。该系统结合了增强的Gemini Deep Think推理模型、一种超越奥赛级别问题的新型推理时间缩放定律以及密集的工具使用,以应对数学研究的复杂性。研究展示了Aletheia从奥赛问题到博士级练习的能力,并实现了AI辅助数学研究的多个里程碑,包括完全由AI生成的研究论文、人机协作的研究论文以及对数百个开放问题的半自主评估。
Details
Motivation: 动机在于将基础模型的推理能力从解决国际数学奥林匹克竞赛级别的问题,扩展到需要处理大量文献和构建长程证明的专业数学研究领域,以应对这一更具挑战性的过渡。
Result: 结果包括:Aletheia成功应用于从奥赛到博士级的问题;实现了三个里程碑:(a) 完全由AI生成关于算术几何中特征权结构常数的研究论文(Feng26),(b) 人机协作证明关于独立集粒子系统边界的研究论文(LeeSeo26),(c) 对Bloom’s Erdos Conjectures数据库中700个开放问题的半自主评估,并自主解决了其中四个开放问题。
Insight: 创新点在于:1) 构建了一个集生成、验证、修订于一体的端到端数学研究智能体框架;2) 提出了一种新的推理时间缩放定律,以处理超越奥赛难度的复杂问题;3) 强调了密集工具使用在导航数学研究复杂性中的关键作用;4) 提出了量化AI辅助结果自主性和新颖性的标准等级建议,为领域评估提供了新思路。
Abstract: Recent advances in foundational models have yielded reasoning systems capable of achieving a gold-medal standard at the International Mathematical Olympiad. The transition from competition-level problem-solving to professional research, however, requires navigating vast literature and constructing long-horizon proofs. In this work, we introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end-to-end in natural language. Specifically, Aletheia is powered by an advanced version of Gemini Deep Think for challenging reasoning problems, a novel inference-time scaling law that extends beyond Olympiad-level problems, and intensive tool use to navigate the complexities of mathematical research. We demonstrate the capability of Aletheia from Olympiad problems to PhD-level exercises and most notably, through several distinct milestones in AI-assisted mathematics research: (a) a research paper (Feng26) generated by AI without any human intervention in calculating certain structure constants in arithmetic geometry called eigenweights; (b) a research paper (LeeSeo26) demonstrating human-AI collaboration in proving bounds on systems of interacting particles called independent sets; and (c) an extensive semi-autonomous evaluation (Feng et al., 2026a) of 700 open problems on Bloom’s Erdos Conjectures database, including autonomous solutions to four open questions. In order to help the public better understand the developments pertaining to AI and mathematics, we suggest codifying standard levels quantifying autonomy and novelty of AI-assisted results. We conclude with reflections on human-AI collaboration in mathematics.
[74] Hardware Co-Design Scaling Laws via Roofline Modelling for On-Device LLMs cs.LG | cs.CLPDF
Luoyang Sun, Jiwen Jiang, Yifeng Ding, Fengfa Li, Yan Song
TL;DR: 本文提出了一种面向端侧大语言模型(LLM)的硬件协同设计缩放定律框架,通过结合模型训练损失与基于Roofline模型的推理延迟建模,实现了在给定硬件约束下对LLM架构的联合优化与快速选择。
Details
Motivation: 解决在资源受限的端侧设备(如自动驾驶汽车、机器人)上部署视觉-语言-动作模型(VLA)时,如何为特定硬件平台选择或设计一个能在精度与推理延迟/硬件效率之间取得最佳平衡的LLM骨干网络的挑战。
Result: 在NVIDIA Jetson Orin硬件上,对1942个候选架构进行了评估,并训练了170个模型以拟合缩放定律。在目标硬件上,与Qwen2.5-0.5B模型在相同延迟下相比,协同设计的架构在WikiText-2数据集上的困惑度降低了19.42%。
Insight: 创新点在于首次提出了一个原则性且可操作的硬件协同设计缩放定律框架,将训练损失建模为架构超参数的显式函数,并与Roofline延迟模型耦合,从而直接建立精度-延迟对应关系,并识别帕累托前沿,将架构选择时间从数月缩短至数天。
Abstract: Vision-Language-Action Models (VLAs) have emerged as a key paradigm of Physical AI and are increasingly deployed in autonomous vehicles, robots, and smart spaces. In these resource-constrained on-device settings, selecting an appropriate large language model (LLM) backbone is a critical challenge: models must balance accuracy with strict inference latency and hardware efficiency constraints. This makes hardware-software co-design a game-changing requirement for on-device LLM deployment, where each hardware platform demands a tailored architectural solution. We propose a hardware co-design law that jointly captures model accuracy and inference performance. Specifically, we model training loss as an explicit function of architectural hyperparameters and characterise inference latency via roofline modelling. We empirically evaluate 1,942 candidate architectures on NVIDIA Jetson Orin, training 170 selected models for 10B tokens each to fit a scaling law relating architecture to training loss. By coupling this scaling law with latency modelling, we establish a direct accuracy-latency correspondence and identify the Pareto frontier for hardware co-designed LLMs. We further formulate architecture search as a joint optimisation over precision and performance, deriving feasible design regions under industrial hardware and application budgets. Our approach reduces architecture selection from months to days. At the same latency as Qwen2.5-0.5B on the target hardware, our co-designed architecture achieves 19.42% lower perplexity on WikiText-2. To our knowledge, this is the first principled and operational framework for hardware co-design scaling laws in on-device LLM deployment. We will make the code and related checkpoints publicly available.
[75] Control Reinforcement Learning: Token-Level Mechanistic Analysis via Learned SAE Feature Steering cs.LG | cs.AI | cs.CLPDF
Seonglae Cho, Zekun Wu, Adriano Koshiyama
TL;DR: 本文提出了控制强化学习(CRL),一种通过训练策略在语言模型每个token位置选择稀疏自编码器(SAE)特征进行干预的新方法。该方法不仅能识别激活的特征,还能揭示哪些特征在被放大时会改变模型输出,从而提供动态、可解释的干预日志。
Details
Motivation: 现有稀疏自编码器方法只能静态分析语言模型激活中哪些特征被激活,但无法揭示哪些特征在被放大时能动态地改变模型输出。本文旨在通过可学习的特征干预策略,弥补静态特征分析与动态因果干预之间的差距。
Result: 在Gemma-2 2B模型上,于MMLU、BBQ、GSM8K、HarmBench和XSTest等多个基准测试中,CRL方法在保持模型性能的同时,提供了每个token的干预日志,实现了性能提升。
Insight: 核心创新点在于将强化学习策略学习与SAE特征干预相结合,实现了token级别的动态机制分析。提出的自适应特征掩码在鼓励特征发现多样性的同时保持了单特征可解释性;分支点追踪、评论家轨迹分析和分层比较等新分析能力,为理解模型内部机制(如早期层的句法特征和后期层的语义特征)提供了新工具。
Abstract: Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplified. We introduce Control Reinforcement Learning (CRL), which trains a policy to select SAE features for steering at each token, producing interpretable intervention logs: the learned policy identifies features that change model outputs when amplified. Adaptive Feature Masking encourages diverse feature discovery while preserving singlefeature interpretability. The framework yields new analysis capabilities: branch point tracking locates tokens where feature choice determines output correctness; critic trajectory analysis separates policy limitations from value estimation errors; layer-wise comparison reveals syntactic features in early layers and semantic features in later layers. On Gemma-2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest, CRL achieves improvements while providing per-token intervention logs. These results establish learned feature steering as a mechanistic interpretability tool that complements static feature analysis with dynamic intervention probes
[76] SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining cs.LG | cs.CLPDF
Yifan Zhang, Zunhai Su, Shuhao Hu, Rui Yang, Wei Wu
TL;DR: 本文提出了SnapMLA,一个针对DeepSeek多头潜在注意力(MLA)架构解码阶段的FP8量化流水线优化框架。该框架通过硬件感知的算法-内核协同优化技术,包括RoPE感知的逐令牌KV量化、量化PV计算流水线重构以及端到端数据流优化,旨在提升长上下文场景下的解码效率。
Details
Motivation: 尽管FP8注意力(如FlashAttention-3)已展现出潜力,但其在MLA架构解码阶段的集成面临挑战,包括位置嵌入解耦导致的数值异质性、FP8 PV GEMM中的量化尺度错位以及缺乏优化的系统级支持。
Result: 在最先进的MLA大语言模型上进行的大量实验表明,SnapMLA在吞吐量上实现了高达1.91倍的提升,并且在具有挑战性的长上下文任务(包括数学推理和代码生成基准测试)中,性能下降的风险可忽略不计。
Insight: 论文的创新点在于针对MLA KV缓存的异构量化敏感性,提出了RoPE部分保持高精度的逐令牌KV量化方法;针对MLA KV缓存的共享结构导致的量化尺度错位问题,重构了PV计算流水线;并通过专用内核优化了端到端数据流,实现了算法与硬件的协同优化。
Abstract: While FP8 attention has shown substantial promise in innovations like FlashAttention-3, its integration into the decoding phase of the DeepSeek Multi-head Latent Attention (MLA) architecture presents notable challenges. These challenges include numerical heterogeneity arising from the decoupling of positional embeddings, misalignment of quantization scales in FP8 PV GEMM, and the need for optimized system-level support. In this paper, we introduce SnapMLA, an FP8 MLA decoding framework optimized to improve long-context efficiency through the following hardware-aware algorithm-kernel co-optimization techniques: (i) RoPE-Aware Per-Token KV Quantization, where the RoPE part is maintained in high precision, motivated by our comprehensive analysis of the heterogeneous quantization sensitivity inherent to the MLA KV cache. Furthermore, per-token granularity is employed to align with the autoregressive decoding process and maintain quantization accuracy. (ii) Quantized PV Computation Pipeline Reconstruction, which resolves the misalignment of quantization scale in FP8 PV computation stemming from the shared KV structure of the MLA KV cache. (iii) End-to-End Dataflow Optimization, where we establish an efficient data read-and-write workflow using specialized kernels, ensuring efficient data flow and performance gains. Extensive experiments on state-of-the-art MLA LLMs show that SnapMLA achieves up to a 1.91x improvement in throughput, with negligible risk of performance degradation in challenging long-context tasks, including mathematical reasoning and code generation benchmarks. Code is available at https://github.com/meituan-longcat/SGLang-FluentLLM.
[77] Just on Time: Token-Level Early Stopping for Diffusion Language Models cs.LG | cs.CLPDF
Zahar Kohut, Severyn Shykula, Dmytro Khamula, Mykola Vysotskyi, Taras Rumezhak
TL;DR: 本文提出了一种针对扩散语言模型的训练无关、基于token级别的早期停止方法,通过独立检测每个位置token的收敛性来动态确定何时可以停止迭代,从而显著减少扩散步骤总数,在保持生成质量的同时大幅提升效率。
Details
Motivation: 扩散语言模型通过迭代细化生成文本,但计算效率低下,因为许多token在最终去噪步骤之前早已稳定,需要一种方法提前终止这些已收敛token的计算。
Result: 在数学推理、通用问答和科学理解等多个基准测试中,该方法在保持生成质量的同时,实现了最先进的效率提升。
Insight: 创新点在于利用模型预测和局部上下文导出的轻量级信号,实现无需任务特定微调的自适应逐token冻结,这是一种训练无关的token级早期停止策略。
Abstract: Diffusion language models generate text through iterative refinement, a process that is often computationally inefficient because many tokens reach stability long before the final denoising step. We introduce a training-free, token-level early stopping approach that identifies convergence independently at each position. Our method leverages lightweight signals derived from the model’s predictions and local context to dynamically determine when individual tokens can be finalized. This yields adaptive per-token freezing without task-specific fine-tuning, substantially reducing the total number of diffusion steps required. Across diverse benchmarks, spanning mathematical reasoning, general question answering, and scientific understanding, our approach achieves state-of-the-art efficiency gains while preserving generation quality.
[78] GENIUS: Generative Fluid Intelligence Evaluation Suite cs.LG | cs.AI | cs.CVPDF
Ruichuan An, Sihan Yang, Ziyu Guo, Wei Dai, Zijun Shen
TL;DR: 本文提出了GENIUS评估套件,用于评估统一多模态模型的生成性流体智能,即模型在即时情境中归纳模式、执行约束和适应新场景的能力,而非依赖已有知识。
Details
Motivation: 现有基准主要评估依赖累积知识和习得模式的晶体智能,而忽视了生成性流体智能,即动态推理和适应新情境的核心能力,因此需要专门的评估工具。
Result: 对12个代表性模型的系统评估揭示了它们在GENIUS任务上的显著性能缺陷,诊断分析表明失败源于有限的上下文理解能力,而非内在生成能力不足。
Insight: 创新点在于将生成性流体智能形式化为三个基本原语,并提出了无需训练的注意力干预策略;其核心价值在于为评估动态、通用推理能力设立了新标准,推动领域超越知识利用。
Abstract: Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess $\textit{Crystallized Intelligence}$, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks $\textit{Generative Fluid Intelligence (GFI)}$: the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce $\textbf{GENIUS}$ ($\textbf{GEN}$ Fluid $\textbf{I}$ntelligence Eval$\textbf{U}$ation $\textbf{S}$uite). We formalize $\textit{GFI}$ as a synthesis of three primitives. These include $\textit{Inducing Implicit Patterns}$ (e.g., inferring personalized visual preferences), $\textit{Executing Ad-hoc Constraints}$ (e.g., visualizing abstract metaphors), and $\textit{Adapting to Contextual Knowledge}$ (e.g., simulating counter-intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training-free attention intervention strategy. Ultimately, $\textbf{GENIUS}$ establishes a rigorous standard for $\textit{GFI}$, guiding the field beyond knowledge utilization toward dynamic, general-purpose reasoning. Our dataset and code will be released at: $\href{https://github.com/arctanxarc/GENIUS}{https://github.com/arctanxarc/GENIUS}$.
cs.IR [Back]
[79] MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation cs.IR | cs.CLPDF
Yongyue Zhang, Yaxiong Wu
TL;DR: 本文提出了MLDocRAG框架,用于解决多模态长文档理解中的跨模态信息定位和跨页面证据聚合挑战。该框架通过构建多模态块-查询图(MCQG),将文档内容组织在语义丰富的可回答查询周围,从而提升检索质量和答案准确性。
Details
Motivation: 解决多模态长文档理解中的两大挑战:跨模态异质性导致的相关信息定位困难,以及跨页面推理所需的分散证据聚合问题。
Result: 在MMLongBench-Doc和LongDocURL数据集上的实验表明,MLDocRAG持续提升了检索质量和答案准确率,证明了其在长上下文多模态理解中的有效性。
Insight: 创新性地采用以查询为中心的表述,将跨模态和跨页面信息投影到统一的查询表示空间,并利用图结构(MCQG)实现选择性、以查询为中心的检索和结构化证据聚合,增强了长上下文多模态问答的可靠性和连贯性。
Abstract: Understanding multimodal long-context documents that comprise multimodal chunks such as paragraphs, figures, and tables is challenging due to (1) cross-modal heterogeneity to localize relevant information across modalities, (2) cross-page reasoning to aggregate dispersed evidence across pages. To address these challenges, we are motivated to adopt a query-centric formulation that projects cross-modal and cross-page information into a unified query representation space, with queries acting as abstract semantic surrogates for heterogeneous multimodal content. In this paper, we propose a Multimodal Long-Context Document Retrieval Augmented Generation (MLDocRAG) framework that leverages a Multimodal Chunk-Query Graph (MCQG) to organize multimodal document content around semantically rich, answerable queries. MCQG is constructed via a multimodal document expansion process that generates fine-grained queries from heterogeneous document chunks and links them to their corresponding content across modalities and pages. This graph-based structure enables selective, query-centric retrieval and structured evidence aggregation, thereby enhancing grounding and coherence in long-context multimodal question answering. Experiments on datasets MMLongBench-Doc and LongDocURL demonstrate that MLDocRAG consistently improves retrieval quality and answer accuracy, demonstrating its effectiveness for long-context multimodal understanding.
cs.CR [Back]
[80] Omni-Safety under Cross-Modality Conflict: Vulnerabilities, Dynamics Mechanisms and Efficient Alignment cs.CR | cs.AI | cs.CLPDF
Kun Wang, Zherui Li, Zhenhong Zhou, Yitong Zhang, Yan Mi
TL;DR: 本文系统研究了全模态大语言模型(OLLMs)中的跨模态安全风险,揭示了其存在的显著漏洞。通过建立模态-语义解耦原则并构建AdvBench-Omni数据集,作者发现了由拒绝向量幅度收缩驱动的‘中层消解’现象和模态不变的纯拒绝方向。基于此,他们提出了一种名为OmniSteer的高效对齐方法,利用奇异值分解提取黄金拒绝向量,并通过轻量级适配器自适应调节干预强度。实验表明,该方法能显著提升对有害输入的拒绝成功率,同时有效保留模型在所有模态上的通用能力。
Details
Motivation: 全模态大语言模型扩展了多模态能力,但也引入了跨模态安全风险,目前缺乏对全模态交互中漏洞的系统性理解。本文旨在填补这一空白,系统分析其脆弱性、动态机制并寻求高效的对齐方法。
Result: 在构建的AdvBench-Omni数据集上进行广泛实验,所提方法OmniSteer将针对有害输入的拒绝成功率从69.9%显著提升至91.2%,同时有效保留了模型在所有模态上的通用能力。
Insight: 创新点在于:1)提出了模态-语义解耦原则来系统分析跨模态安全漏洞;2)揭示了OLLMs中关键的‘中层消解’现象和模态不变的纯拒绝方向;3)提出了一种基于奇异值分解提取黄金拒绝向量、并利用轻量级适配器进行自适应强度调节的高效对齐框架OmniSteer,在提升安全性的同时保持了模型性能。从客观角度看,将安全机制分析与轻量化适配器设计相结合,为多模态模型的安全对齐提供了一种新颖且高效的思路。
Abstract: Omni-modal Large Language Models (OLLMs) greatly expand LLMs’ multimodal capabilities but also introduce cross-modal safety risks. However, a systematic understanding of vulnerabilities in omni-modal interactions remains lacking. To bridge this gap, we establish a modality-semantics decoupling principle and construct the AdvBench-Omni dataset, which reveals a significant vulnerability in OLLMs. Mechanistic analysis uncovers a Mid-layer Dissolution phenomenon driven by refusal vector magnitude shrinkage, alongside the existence of a modal-invariant pure refusal direction. Inspired by these insights, we extract a golden refusal vector using Singular Value Decomposition and propose OmniSteer, which utilizes lightweight adapters to modulate intervention intensity adaptively. Extensive experiments show that our method not only increases the Refusal Success Rate against harmful inputs from 69.9% to 91.2%, but also effectively preserves the general capabilities across all modalities. Our code is available at: https://github.com/zhrli324/omni-safety-research.
[81] The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis cs.CR | cs.CLPDF
Peiran Wang, Xinfeng Li, Chong Xiang, Jinghuai Zhang, Ying Li
TL;DR: 这篇论文系统性地研究了大型语言模型(LLM)智能体中的提示注入(PI)安全威胁,通过文献综述和定量分析建立了攻击与防御的分类法,并揭示了现有防御和基准测试在上下文依赖任务上的局限性。为此,论文提出了一个名为AgentPI的新基准测试,用于在上下文依赖的交互设置下评估智能体行为,并实证表明现有防御方法难以同时实现高可信度、高实用性和低延迟。
Details
Motivation: 随着LLM向自主智能体范式演进,其面临提示注入漏洞的威胁,即不受信任的输入可能劫持智能体行为。现有研究和基准测试在很大程度上忽视了上下文依赖任务(智能体需依赖运行时环境观察来决定行动)中的安全评估,存在关键局限。
Result: 论文提出了AgentPI基准测试,并利用其对代表性防御方法进行了实证评估。结果表明,没有单一方法能同时实现高可信度、高实用性和低延迟;许多防御在现有基准上看似有效,但通过抑制上下文输入实现,无法推广到上下文依赖推理至关重要的现实智能体设置中。
Insight: 论文的创新点在于:1)对提示注入攻击与防御进行了系统性的分类(攻击按有效载荷生成策略,防御按干预阶段);2)识别并强调了上下文依赖任务在现有安全评估中被忽视的关键局限性;3)提出了专门针对上下文依赖交互设置的AgentPI基准测试,为未来研究和安全部署提供了更现实的评估框架。
Abstract: The evolution of Large Language Models (LLMs) has resulted in a paradigm shift towards autonomous agents, necessitating robust security against Prompt Injection (PI) vulnerabilities where untrusted inputs hijack agent behaviors. This SoK presents a comprehensive overview of the PI landscape, covering attacks, defenses, and their evaluation practices. Through a systematic literature review and quantitative analysis, we establish taxonomies that categorize PI attacks by payload generation strategies (heuristic vs. optimization) and defenses by intervention stages (text, model, and execution levels). Our analysis reveals a key limitation shared by many existing defenses and benchmarks: they largely overlook context-dependent tasks, in which agents are authorized to rely on runtime environmental observations to determine actions. To address this gap, we introduce AgentPI, a new benchmark designed to systematically evaluate agent behavior under context-dependent interaction settings. Using AgentPI, we empirically evaluate representative defenses and show that no single approach can simultaneously achieve high trustworthiness, high utility, and low latency. Moreover, we show that many defenses appear effective under existing benchmarks by suppressing contextual inputs, yet fail to generalize to realistic agent settings where context-dependent reasoning is essential. This SoK distills key takeaways and open research problems, offering structured guidance for future research and practical deployment of secure LLM agents.
cs.DB [Back]
[82] GraphSeek: Next-Generation Graph Analytics with LLMs cs.DB | cs.AI | cs.CL | cs.HC | cs.IRPDF
Maciej Besta, Łukasz Jarmocik, Orest Hrycyna, Shachar Klaiman, Konrad Mączka
TL;DR: 本文提出了GraphSeek,一个基于LLM的新一代图分析框架。它通过引入一个包含图模式和操作的语义目录,将自然语言查询转换为可执行的图分析任务,从而解决了LLM在处理大规模、异构、复杂的属性图时效率低下和效果不佳的问题。
Details
Motivation: 图在各个领域都是基础数据结构,但使用门槛高,需要专业知识。虽然LLM有望通过自然语言实现可访问的图分析,但在处理工业级规模、高度异构、结构复杂且动态演变的属性图时,现有方法在有效性和效率上均存在不足。
Result: GraphSeek在任务成功率上取得了显著提升,例如达到了86%(相较于增强版LangChain),表明其在处理复杂图分析任务时具有更高的有效性。
Insight: 核心创新在于提出了一个新颖的抽象:通过语义目录进行规划,将LLM的语义规划与确定性、数据库级的查询执行解耦。这带来了令牌效率和任务有效性的双重提升,即使使用小上下文LLM也能实现,为统一LLM推理与大规模复杂属性图上的数据库级执行指明了方向。
Abstract: Graphs are foundational across domains but remain hard to use without deep expertise. LLMs promise accessible natural language (NL) graph analytics, yet they fail to process industry-scale property graphs effectively and efficiently: such datasets are large, highly heterogeneous, structurally complex, and evolve dynamically. To address this, we devise a novel abstraction for complex multi-query analytics over such graphs. Its key idea is to replace brittle generation of graph queries directly from NL with planning over a Semantic Catalog that describes both the graph schema and the graph operations. Concretely, this induces a clean separation between a Semantic Plane for LLM planning and broader reasoning, and an Execution Plane for deterministic, database-grade query execution over the full dataset and tool implementations. This design yields substantial gains in both token efficiency and task effectiveness even with small-context LLMs. We use this abstraction as the basis of the first LLM-enhanced graph analytics framework called GraphSeek. GraphSeek achieves substantially higher success rates (e.g., 86% over enhanced LangChain) and points toward the next generation of affordable and accessible graph analytics that unify LLM reasoning with database-grade execution over large and complex property graphs.
cs.RO [Back]
[83] SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes cs.RO | cs.AI | cs.CV | cs.GRPDF
Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake
TL;DR: 本文提出了SceneSmith,一种分层代理框架,用于从自然语言提示生成仿真就绪的室内场景。该框架通过多个阶段(从建筑布局到家具布置再到小物体填充)构建场景,每个阶段由视觉语言模型(VLM)代理(设计师、批评者和协调者)交互实现,并集成了文本到3D合成、数据集检索和物理属性估计。
Details
Motivation: 解决现有仿真环境在训练和评估家庭机器人时,无法捕捉真实室内空间的多样性和物理复杂性的问题,特别是现有场景合成方法产生的房间稀疏、缺乏密集杂物、可动家具和机器人操作所需的物理属性。
Result: SceneSmith生成的对象数量比先前方法多3-6倍,物体间碰撞率低于2%,在物理仿真下96%的物体保持稳定;在205名参与者的用户研究中,相对于基线,其平均真实感胜率为92%,平均提示忠实度胜率为91%。
Insight: 创新点包括:分层代理框架利用VLM代理交互实现场景生成,紧密集成多种资产生成技术(如文本到3D合成和物理属性估计),并展示了在自动机器人策略评估端到端管道中的应用潜力。
Abstract: Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.
[84] From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving cs.RO | cs.CVPDF
Sining Ang, Yuguang Yang, Chenxu Dang, Canyu Chen, Cheng Chi
TL;DR: 该论文探讨了在端到端自动驾驶中结合视觉语言模型(VLM)与纯视觉骨干网络(如ViT)的互补性。通过RecogDrive平台进行三阶段研究(RQ1-RQ3),发现VLM引入了独特的决策子空间,在长尾场景中表现出更激进的行为,而ViT则更保守。基于此,论文提出了HybridDriveVLA和DualDriveVLA系统,前者通过学习的评分器选择最优轨迹,后者采用快慢策略,在保持高性能的同时显著提升吞吐量。
Details
Motivation: 研究动机是探索视觉语言动作(VLA)驾驶框架中,引入语言能力的骨干网络(VLM)相较于纯视觉骨干网络(如ViT)在端到端规划中带来的变化,超越传统的准确性与成本权衡,以揭示其互补优势并提升系统性能。
Result: 在RecogDrive基准测试中,通过Oracle选择VLM和ViT分支中的更优轨迹,获得了93.58 PDMS(性能指标)的上界;提出的HybridDriveVLA系统将PDMS提升至92.10;而DualDriveVLA系统在仅调用VLM处理15%场景的情况下,实现了91.00 PDMS,同时吞吐量提高了3.2倍。
Insight: 创新点在于揭示了VLM与纯视觉骨干网络在决策行为上的互补性(VLM激进、ViT保守),并据此设计了协同系统:HybridDriveVLA通过学习的评分器动态选择轨迹,DualDriveVLA采用实用的快慢策略(默认运行ViT,低置信度时调用VLM),在性能与效率间取得平衡,为多模态自动驾驶系统提供了新思路。
Abstract: Vision-Language-Action (VLA) driving augments end-to-end (E2E) planning with language-enabled backbones, yet it remains unclear what changes beyond the usual accuracy–cost trade-off. We revisit this question with 3–RQ analysis in RecogDrive by instantiating the system with a full VLM and vision-only backbones, all under an identical diffusion Transformer planner. RQ1: At the backbone level, the VLM can introduce additional subspaces upon the vision-only backbones. RQ2: This unique subspace leads to a different behavioral in some long-tail scenario: the VLM tends to be more aggressive whereas ViT is more conservative, and each decisively wins on about 2–3% of test scenarios; With an oracle that selects, per scenario, the better trajectory between the VLM and ViT branches, we obtain an upper bound of 93.58 PDMS. RQ3: To fully harness this observation, we propose HybridDriveVLA, which runs both ViT and VLM branches and selects between their endpoint trajectories using a learned scorer, improving PDMS to 92.10. Finally, DualDriveVLA implements a practical fast–slow policy: it runs ViT by default and invokes the VLM only when the scorer’s confidence falls below a threshold; calling the VLM on 15% of scenarios achieves 91.00 PDMS while improving throughput by 3.2x. Code will be released.
[85] ContactGaussian-WM: Learning Physics-Grounded World Model from Videos cs.RO | cs.AI | cs.CVPDF
Meizhong Wang, Wanxin Jin, Kun Cao, Lihua Xie, Yiguang Hong
TL;DR: 本文提出ContactGaussian-WM,一种可微分的、基于物理的刚体世界模型,能够直接从稀疏且接触丰富的视频序列中学习复杂的物理规律。该框架包含用于视觉外观和碰撞几何的统一高斯表示,以及一个端到端的可微分学习框架,通过闭式物理引擎进行微分以从稀疏视觉观察中推断物理属性。
Details
Motivation: 现有方法在数据稀缺和接触丰富的复杂动态运动条件下,难以准确建模环境。本文旨在开发能够理解复杂物理交互的世界模型,以推进机器人规划与仿真。
Result: 广泛的仿真和真实世界评估表明,ContactGaussian-WM在学习复杂场景方面优于最先进的方法,并展现出强大的泛化能力。
Insight: 核心创新点在于将视觉外观和碰撞几何统一在高斯表示中,并构建了一个端到端的可微分框架,通过闭式物理引擎实现从稀疏视频到物理属性的直接学习,这为数据高效的世界模型学习提供了新思路。
Abstract: Developing world models that understand complex physical interactions is essential for advancing robotic planning and simulation.However, existing methods often struggle to accurately model the environment under conditions of data scarcity and complex contact-rich dynamic motion.To address these challenges, we propose ContactGaussian-WM, a differentiable physics-grounded rigid-body world model capable of learning intricate physical laws directly from sparse and contact-rich video sequences.Our framework consists of two core components: (1) a unified Gaussian representation for both visual appearance and collision geometry, and (2) an end-to-end differentiable learning framework that differentiates through a closed-form physics engine to infer physical properties from sparse visual observations.Extensive simulations and real-world evaluations demonstrate that ContactGaussian-WM outperforms state-of-the-art methods in learning complex scenarios, exhibiting robust generalization capabilities.Furthermore, we showcase the practical utility of our framework in downstream applications, including data synthesis and real-time MPC.
physics.soc-ph [Back]
[86] URBAN-SPIN: A street-level bikeability index to inform design implementations in historical city centres physics.soc-ph | cs.CV | cs.CYPDF
Haining Ding, Chenxi Wang, Michal Gath-Morad
TL;DR: 本研究开发了一个感知主导、基于类型学且数据整合的框架,用于评估街道层面的视觉和空间配置如何影响骑行体验。该框架结合了计算机视觉提取的街景指标、建成环境变量和主观评分,构建了一个对街道类型敏感的骑行指数,并应用于历史城市中心,通过AI辅助的视觉重新设计展示了细微、有针对性的改变可以带来显著的感知改善。
Details
Motivation: 解决在历史城市中心,由于空间限制无法进行大规模基础设施改造,且街道类型学背景常被忽视的情况下,如何评估和改善骑行体验的问题。
Result: 统计分析表明,感知的骑行适宜性源于特征之间累积的、特定于上下文的相互作用。绿化和开放性持续增强舒适感和愉悦感,而围合感、意象性和建筑连续性则显示出阈值效应或发散效应,具体取决于街道类型和子类型。AI辅助的视觉重新设计进一步证明,细微的、有针对性的改变可以在不进行大规模结构干预的情况下产生有意义的感知增益。
Insight: 提出了一个整合主观感知与客观物理指标的、对街道类型敏感的骑行指数框架,强调了在历史城市设计中考虑街道类型学和感知维度的重要性,并展示了通过AI辅助的细微设计调整来优化骑行体验的可行性。
Abstract: Cycling is reported by an average of 35% of adults at least once per week across 28 countries, and as vulnerable road users directly exposed to their surroundings, cyclists experience the street at an intensity unmatched by other modes. Yet the street-level features that shape this experience remain under-analysed, particularly in historical urban contexts where spatial constraints rule out large-scale infrastructural change and where typological context is often overlooked. This study develops a perception-led, typology-based, and data-integrated framework that explicitly models street typologies and their sub-classifications to evaluate how visual and spatial configurations shape cycling experience. Drawing on the Cambridge Cycling Experience Video Dataset (CCEVD), a first-person and handlebar-mounted corpus developed in this study, we extract fine-grained streetscape indicators with computer vision and pair them with built-environment variables and subjective ratings from a Balanced Incomplete Block Design (BIBD) survey, thereby constructing a typology-sensitive Bikeability Index that integrates subjective and perceived dimensions with physical metrics for segment-level comparison. Statistical analysis shows that perceived bikeability arises from cumulative, context-specific interactions among features. While greenness and openness consistently enhance comfort and pleasure, enclosure, imageability, and building continuity display threshold or divergent effects contingent on street type and subtype. AI-assisted visual redesigns further demonstrate that subtle, targeted changes can yield meaningful perceptual gains without large-scale structural interventions. The framework offers a transferable model for evaluating and improving cycling conditions in heritage cities through perceptually attuned, typology-aware design strategies.
eess.IV [Back]
[87] A Systematic Review on Data-Driven Brain Deformation Modeling for Image-Guided Neurosurgery eess.IV | cs.CVPDF
Tiago Assis, Colin P. Galvin, Joshua P. Castillo, Nazim Haouchine, Marta Kersten-Oertel
TL;DR: 本文对2020年1月至2025年4月期间用于图像引导神经外科手术的数据驱动脑变形建模方法进行了系统性综述,总结了41项相关研究,分析了包括深度学习图像配准、直接变形场回归、合成驱动多模态对齐、考虑切除的架构以及集成生物力学先验的混合模型等方法,并指出了当前方法在分布外鲁棒性、标准化基准测试、可解释性和临床部署准备方面的局限性。
Details
Motivation: 解决神经外科手术中因组织运动和肿瘤切除导致的术前规划图像与术中解剖结构不对齐问题,从而为可靠的图像引导神经外科手术提供准确的脑变形补偿。
Result: 综述分析了现有研究的方法策略、数据集使用、评估指标和验证协议,指出AI驱动的变形模型在性能和计算效率方面表现出潜力,但缺乏统一的标准化基准测试和临床部署准备。
Insight: 创新点在于对AI驱动的脑变形建模方法进行了系统性的统一分析和批判性评估,揭示了当前研究的局限性(如鲁棒性、可解释性不足),并为未来开发更鲁棒、可泛化且临床可转化的解决方案指明了方向。
Abstract: Accurate compensation of brain deformation is a critical challenge for reliable image-guided neurosurgery, as surgical manipulation and tumor resection induce tissue motion that misaligns preoperative planning images with intraoperative anatomy and longitudinal studies. In this systematic review, we synthesize recent AI-driven approaches developed between January 2020 and April 2025 for modeling and correcting brain deformation. A comprehensive literature search was conducted in PubMed, IEEE Xplore, Scopus, and Web of Science, with predefined inclusion and exclusion criteria focused on computational methods applied to brain deformation compensation for neurosurgical imaging, resulting in 41 studies meeting these criteria. We provide a unified analysis of methodological strategies, including deep learning-based image registration, direct deformation field regression, synthesis-driven multimodal alignment, resection-aware architectures addressing missing correspondences, and hybrid models that integrate biomechanical priors. We also examine dataset utilization, reported evaluation metrics, validation protocols, and how uncertainty and generalization have been assessed across studies. While AI-based deformation models demonstrate promising performance and computational efficiency, current approaches exhibit limitations in out-of-distribution robustness, standardized benchmarking, interpretability, and readiness for clinical deployment. Our review highlights these gaps and outlines opportunities for future research aimed at achieving more robust, generalizable, and clinically translatable deformation compensation solutions for neurosurgical guidance. By organizing recent advances and critically evaluating evaluation practices, this work provides a comprehensive foundation for researchers and clinicians engaged in developing and applying AI-based brain deformation methods.
[88] Uncertainty-Aware Ordinal Deep Learning for cross-Dataset Diabetic Retinopathy Grading eess.IV | cs.CVPDF
Ali El Bellaj, Aya Benradi, Salman El Youssoufi, Taha El Marzouki, Mohammed-Amine Cheddadi
TL;DR: 本文提出了一种不确定性感知的序数深度学习框架,用于跨数据集的糖尿病视网膜病变(DR)严重程度分级。该方法结合了卷积骨干网络、病灶查询注意力池化和基于证据狄利克雷的序数回归头,能够同时进行准确的严重程度预测和预测不确定性的原则性估计。
Details
Motivation: 糖尿病视网膜病变是糖尿病的一种严重并发症,早期可靠检测对预防不可逆失明至关重要。现有自动化DR分级方法往往忽略疾病进展的序数性质,且缺乏对预测不确定性的可靠估计,这限制了其在临床环境中的可靠性,尤其是在面临领域偏移时。
Result: 在结合APTOS、Messidor-2和EyePACS子集的多域训练设置上进行评估。实验结果表明,该方法在保留测试集上表现出强大的跨数据集泛化能力,取得了具有竞争力的分类准确率和高二次加权Kappa值,同时能为低置信度病例提供有意义的不确定性估计。
Insight: 主要创新点在于将序数回归与证据深度学习相结合,通过证据狄利克雷分布建模预测不确定性,并使用退火正则化的序数证据损失进行训练,以鼓励在领域偏移下的校准置信度。这为构建鲁棒且临床可靠的DR自动分级系统提供了一个有前景的方向。
Abstract: Diabetes mellitus is a chronic metabolic disorder characterized by persistent hyperglycemia due to insufficient insulin production or impaired insulin utilization. One of its most severe complications is diabetic retinopathy (DR), a progressive retinal disease caused by microvascular damage, leading to hemorrhages, exudates, and potential vision loss. Early and reliable detection of DR is therefore critical for preventing irreversible blindness. In this work, we propose an uncertainty-aware deep learning framework for automated DR severity grading that explicitly models the ordinal nature of disease progression. Our approach combines a convolutional backbone with lesion-query attention pooling and an evidential Dirichlet-based ordinal regression head, enabling both accurate severity prediction and principled estimation of predictive uncertainty. The model is trained using an ordinal evidential loss with annealed regularization to encourage calibrated confidence under domain shift. We evaluate the proposed method on a multi-domain training setup combining APTOS, Messidor-2, and a subset of EyePACS fundus datasets. Experimental results demonstrate strong cross-dataset generalization, achieving competitive classification accuracy and high quadratic weighted kappa on held-out test sets, while providing meaningful uncertainty estimates for low-confidence cases. These results suggest that ordinal evidential learning is a promising direction for robust and clinically reliable diabetic retinopathy grading.
cs.SE [Back]
[89] ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents cs.SE | cs.CLPDF
YoungHoon Jeon, Suwan Kim, Haein Son, Sookbun Lee, Yeil Jeong
TL;DR: 本文提出了ISD-Agent-Bench,一个用于评估基于大语言模型的教学系统设计代理的综合性基准。该基准通过上下文矩阵框架生成了25,795个场景,覆盖了ADDIE模型的33个子步骤和5个类别的51个上下文变量。为确保评估可靠性,采用了多模型评判协议。实验在1,017个测试场景上比较了现有代理与基于经典ISD理论(如ADDIE、Dick & Carey、快速原型)的新代理,发现将经典ISD框架与现代ReAct式推理结合的方法性能最高。
Details
Motivation: 当前,基于大语言模型的代理在自动化教学系统设计方面展现出潜力,但缺乏标准化的评估基准,且存在LLM-as-judge的偏见风险,使得评估这些代理具有挑战性。
Result: 在1,017个测试场景上的实验表明,将经典ISD框架(如ADDIE)与现代ReAct式推理结合的代理取得了最高性能,超越了纯理论基础的代理和纯技术方法。理论质量与基准性能强相关,基于理论的代理在以问题为中心的设计和目标-评估对齐方面显示出显著优势。
Insight: 主要创新点在于提出了一个大规模、结构化的ISD代理评估基准(ISD-Agent-Bench),其通过上下文矩阵系统生成多样化场景,并采用多LLM评判协议以提高可靠性。客观来看,该工作强调了将经典教育理论与现代AI推理技术(如ReAct)相结合的有效性,为系统化研究基于LLM的ISD提供了基础。
Abstract: Large Language Model (LLM) agents have shown promising potential in automating Instructional Systems Design (ISD), a systematic approach to developing educational programs. However, evaluating these agents remains challenging due to the lack of standardized benchmarks and the risk of LLM-as-judge bias. We present ISD-Agent-Bench, a comprehensive benchmark comprising 25,795 scenarios generated via a Context Matrix framework that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model. To ensure evaluation reliability, we employ a multi-judge protocol using diverse LLMs from different providers, achieving high inter-judge reliability. We compare existing ISD agents with novel agents grounded in classical ISD theories such as ADDIE, Dick & Carey, and Rapid Prototyping ISD. Experiments on 1,017 test scenarios demonstrate that integrating classical ISD frameworks with modern ReAct-style reasoning achieves the highest performance, outperforming both pure theory-based agents and technique-only approaches. Further analysis reveals that theoretical quality strongly correlates with benchmark performance, with theory-based agents showing significant advantages in problem-centered design and objective-assessment alignment. Our work provides a foundation for systematic LLM-based ISD research.