Table of Contents

cs.CL [Back]

[1] TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles cs.CL | cs.LGPDF

Yirong Zeng, Yufei Liu, Xiao Ding, Yutai Hou, Yuxian Wang

TL;DR: 本文提出TinyJudge框架,通过集成多个小型专家语言模型(约0.6B参数)来评估大语言模型(LLM)指令跟随任务中的不可验证约束(如语气),以解决现有基于LLM-as-a-judge的强化学习方法中存在的奖励黑客和高计算开销问题。

Details

Motivation: 现有基于可验证奖励的强化学习方法在评估不可验证约束时存在严重奖励黑客和计算开销大的瓶颈,且作者发现不同约束具有独特的高泛化模式,因此需要一种更高效、精确的评估方案。

Result: 在五个基准测试上的广泛评估表明,TinyJudge在平均性能上比基线方法提升约10%,奖励精度提升12%,同时总训练时间加速3倍。

Insight: 创新点在于将前沿模型的专业知识蒸馏到多个小型专家模型中,通过集成方式实现轻量级、高精度的不可验证约束评估,为LLM与不可验证人类指令的对齐提供了可扩展且鲁棒的路径。

Abstract: Instruction Following (IF) is a core capability of LLMs, requiring strict adherence to diverse constraints, ranging from verifiable ones (e.g., output length) to unverifiable ones (e.g., tone). Reinforcement learning with verifiable rewards has emerged as a paradigm for IF tasks, leveraging LLM-as-a-judge to assess unverifiable constraints. However, we empirically find that this approach remains a significant bottleneck, suffering from severe reward hacking and higher computational overhead. In this work, we first analyze the generalization capabilities of unverifiable constraints and discover that specific constraints exhibit distinct, high-generalization patterns. Motivated by this, we propose TinyJudge, a framework that employs an ensemble of specialized tiny language models ($\sim0.6B$) to provide rewards for soft constraints. By distilling expertise from frontier models into these tiny models, it achieves high-precision, lightweight evaluation. Extensive evaluations across five benchmarks demonstrate that TinyJudge outperforms the baselines by $\sim10%$ in average performance and $12%$ in reward precision. Crucially, it also achieves a $3\times$ speedup in total training time. Our work provides a scalable and robust path for aligning LLMs with unverifiable human instructions.


[2] Evaluating Hallucinations in Domain-Adapted Large Language Models cs.CL | cs.AIPDF

Sanchita Porwal, Sai Prasath S, Xingjian Bi, Madelyn Scandlen

TL;DR: 本研究探讨了领域适应大语言模型(LLMs)中的幻觉现象,具体关注了使用Lamini数据集对Llama-2模型进行微调的情况。通过一系列实验测试微调后模型的记忆、回忆和推理能力,发现模型在处理与训练数据类似的任务时表现良好,但在准确推理和回忆新的领域特定信息方面能力有限,容易产生幻觉,并倾向于过度生成信息。

Details

Motivation: 解决大语言模型在针对特定领域进行微调后,容易产生幻觉(即生成无意义或不忠实内容)这一重大挑战,评估仅通过微调方法来缓解幻觉的有效性。

Result: 实验表明,模型在类似训练数据的任务上表现熟练,但在处理新的领域特定查询时,准确推理和回忆能力有限,导致幻觉发生,并显示出过度生成的倾向。这揭示了仅靠微调方法在将LLMs适应专业领域时,在缓解幻觉方面存在重要局限。

Insight: 论文的创新点在于系统评估了领域适应LLMs的幻觉问题,揭示了仅靠微调不足以有效缓解幻觉,并指出模型在处理领域特定查询时存在相对弱点。这强调了为LLMs适应专业领域开发更鲁棒方法的必要性。

Abstract: This study investigates the phenomenon of hallucinations in domain-adapted Large Language Models (LLMs), focusing on the fine-tuning of the Llama-2 model with the Lamini dataset. Hallucinations, or the generation of nonsensical or unfaithful content by LLMs, pose a significant challenge, especially when these models are fine-tuned with domain-specific data. Our methodology involves a series of experiments testing memorization, recall, and reasoning capabilities of the fine-tuned LLM, comparing its performance on novel question-answer pairs and domain-specific information. We found that while the model shows proficiency in tasks similar to its training data, its capability to accurately reason about and recall new domain-specific information remains limited, leading to instances of hallucination. The model demonstrates a tendency to provide correct answers with extra information, suggesting an inclination toward over-generation. These results suggest important limitations of fine-tuning-only approaches for mitigating hallucinations when adapting LLMs to specialized domains and underscore the need for more robust methods in adapting LLMs to specialized domains. The study also provides insights into the varying performance of LLMs on different types of information, revealing a comparative weakness in handling domain-specific queries.


[3] GraphLoRA: Structure-Aware Low-Rank Adaptation for Large Language Model Recommendation cs.CL | cs.AIPDF

Lin Mu, Guoji Wang, Li Ni, Lei Sang, Zhize Wu

TL;DR: 本文提出GraphLoRA,一种新颖的、结构感知的低秩适应框架,用于大型语言模型推荐。它通过在低秩适应路径中嵌入可训练图消息传递网络,使结构信号能在参数空间中传播,从而将图结构信息与文本语义进行深度融合。

Details

Motivation: 现有方法将协同信息转化为文本提示或注入预训练嵌入,将结构信息视为静态输入,无法捕获高阶关系依赖。GraphLoRA旨在弥合这一差距,实现图结构与文本语义的有效对齐。

Result: 在多个基准测试上的广泛实验表明,GraphLoRA不仅优于最先进的基于LLM的推荐方法,而且实现了卓越的泛化能力,有效平衡了结构推理能力与计算效率。

Insight: 核心创新在于将低秩适应从独立推广到结构感知的传播,通过将图消息传递网络嵌入参数更新路径,使协同拓扑能显式地指导参数更新,促进了图结构与文本语义的深度集成。

Abstract: Large Language Models (LLMs) have shown strong potential for recommendation (LLMRec) due to their powerful reasoning and generalization abilities. However, effectively aligning the textual semantics modeled by LLMs with the collaborative signals remains a key challenge. Existing methods either translate collaborative information into textual prompts or inject pre-trained embeddings into the LLM, both of which treat structural information as static input and fail to capture high-order relational dependencies. To bridge this gap, we propose GraphLoRA, a novel framework that generalizes low-rank adaptation from independent to structure-aware propagation. GraphLoRA embeds a trainable graph message-passing network within the low-rank adaptation pathway, enabling structural signals to propagate through the parameter space. This design allows collaborative topology to explicitly guide parameter updates, fostering deep integration between graph-structured and textual semantic information. Extensive experiments on multiple benchmarks demonstrate that GraphLoRA not only outperforms state-of-the-art LLM-based recommendation methods but also achieves superior generalization, effectively balancing structural reasoning capability with computational efficiency. Code is available at \href{https://github.com/wgj15965/GraphLoRA}{https://github.com/wgj15965/GraphLoRA}.


[4] Post-training is (Massive) Supervised Learning cs.CL | cs.AI | cs.LGPDF

Michael Hassid, Yossi Adi, Roy Schwartz

TL;DR: 这篇论文认为当前LLM训练范式中的大规模后训练(包括SFT和RL)本质上是一种监督式分布拟合,类似于BERT时代的“预训练-微调”模式,旨在让模型适应特定评估基准。作者通过实验证明,从随机初始化开始进行后训练也能在数学和代码推理基准上取得显著性能,表明当前方法主要是在拟合数据分布,而非发展通用能力。

Details

Motivation: 论文的动机是批判当前LLM训练过度依赖大规模后训练来针对特定基准进行性能优化的现象,认为这偏离了发展通用人工智能的初衷,回到了早期依赖分布内数据集拟合的老路。

Result: 实验表明,在数学和代码推理基准(如竞争性数学和编程数据集)上,从随机初始化开始进行后训练(即不依赖预训练)的模型也能取得非平凡的性能,这支持了后训练主要起分布拟合作用的论点。

Insight: 论文宣称的创新点在于揭示了当前LLM后训练的本质是监督式分布拟合,并呼吁转向让模型“学会学习”的训练范式。从客观角度看,其核心洞察是挑战了预训练的必要性,强调后训练阶段可能主导了模型在特定任务上的表现,这对未来训练策略的设计具有启发意义。

Abstract: The prevailing paradigm for training LLMs has evolved to rely on a massive post-training phase consisting of SFT and RL. In this position paper, we argue that this methodology effectively marks a reversion to the pre-train then fine-tune'' approach of the BERT era, explicitly tailoring models to the desired behaviors and specific benchmarks on which they are evaluated. We begin with a historical overview of LLMs, describing the different phases of the LLM evolution. We argue that the current landscape is remarkably similar to the early days of LLMs, where task performance heavily relied on fitting the models to in-distribution datasets. To empirically demonstrate this, we compare pre-trained models to randomly initialized ones, by fine-tuning both variants on modern reasoning datasets and evaluating them on competitive math and code benchmarks. We show that models post-trained from scratch yield highly non-trivial performance. Our findings suggest that current post-training methodologies function primarily as a distribution-fitting mechanism. We finish by positing that developing generally capable models and systems requires moving beyond extensive post-training for predefined behaviors, shifting instead toward training procedures where models learn how to learn’’.


[5] CAPruner: Conceptual-Adjacent Scene Graph Pruner for Enhancing 3D Spatial Reasoning of Large Language Models cs.CL | cs.AI | cs.CV | cs.LG | cs.MMPDF

Shengli Zhou, Xiangchen Wang, Guanhua Chen, Feng Zheng

TL;DR: 本文提出了一种名为CAPruner的概念相邻场景图剪枝器,旨在提升大语言模型在3D视觉语言任务中的空间推理能力。该方法通过融合模糊语义相关性和空间邻近性来估计关系的重要性,从而在任务特定上下文中选择关键关系,避免了现有方法因依赖空间邻近性而移除任务相关关系的问题。

Details

Motivation: 现有场景图剪枝方法主要依赖空间邻近性,常常会移除与任务相关的空间关系,从而损害可靠的3D空间推理。为了解决这一局限性,需要一种能够保留与特定3D-VL任务最相关的空间关系的剪枝方法。

Result: 大量实验表明,CAPruner能有效保留对空间推理至关重要的关系,从而显著提升LLMs在3D-VL任务上的性能。

Insight: 论文的核心创新在于提出了一个关键需求:场景图剪枝应保留与特定任务最相关的空间关系,并据此设计了融合语义与空间信息的剪枝器。从客观角度看,其通过监督节点入射边聚合分数来训练的方法,避免了昂贵的关系级标注,是一个实用且高效的设计思路。

Abstract: Large language models (LLMs) have recently been applied to 3D vision-language (3D-VL) tasks, which require spatial reasoning to identify target objects relative to anchors. Scene graphs are commonly employed to represent such relations, but reasoning over complete graphs incurs high token costs and computational inefficiencies, motivating the need for pruning. Existing pruning methods primarily rely on spatial proximity and often remove task-relevant relations, thereby undermining reliable spatial reasoning. To address these limitations, we derive a key requirement for scene graph pruning: preserving spatial relations that are most pertinent to the specific 3D-VL task. Guided by this insight, we propose the Conceptual-Adjacent Scene Graph Pruner (CAPruner). CAPruner integrates fuzzy semantic relevance with spatial proximity to estimate the importance of relations, enabling the selection of critical relations in a task-specific context. Moreover, to avoid costly relation-level annotations, CAPruner is trained by supervising the aggregated scores of each node’s incident edges. Extensive experiments demonstrate that CAPruner effectively preserves relations essential for spatial reasoning, leading to substantial performance improvements of LLMs on 3D-VL tasks. Code is available at https://github.com/fz-zsl/CAPruner.


[6] mllm-shap: A Shapley Value Explainability Platform for Text-Audio Multimodal Large Language Models cs.CL | cs.AIPDF

Jakub Muszyński, Paweł Pozorski, Maria Ganzha

TL;DR: 本文介绍了mllm-shap,一个开源的Python框架,旨在将沙普利值可解释性从纯文本大语言模型扩展到处理文本和音频联合输入的多模态大语言模型。该框架解决了多模态场景下的三个独特挑战:模态感知的联盟掩码、多轮对话跟踪以及基于语音对齐的令牌分组,并实现了五种沙普利值估计策略。

Details

Motivation: 动机在于将成熟的基于文本的归因方法扩展到多模态领域,特别是针对处理文本和音频输入的MLLMs,解决其特有的可解释性挑战,如离散文本令牌与密集音频编码器帧的交错处理、多轮对话的上下文保持以及长音频带来的计算可行性问题。

Result: 论文提出了一种新颖的基于语音对齐的令牌分组技术,能将联盟空间减少10到50倍,从而使长音频的SV估计在计算上变得可行。此外,其实现的互补贡献估计器在收敛性上优于标准的蒙特卡洛基线。

Insight: 主要创新点包括:1) 针对文本-音频多模态输入的模态感知联盟掩码机制;2) 利用每令牌元数据进行多轮对话跟踪以维持角色和模态上下文;3) 创新的基于语音对齐的令牌分组技术,大幅降低了计算复杂度;4) 提供了首个公开的、完整的、可复现的用于文本-音频MLLMs的SV可解释性框架及交互式GUI。

Abstract: We introduce mllm-shap, an open-source Python framework designed to extend Shapley Value (SV) explainability from text-only Large Language Models to Multimodal LLMs (MLLMs) processing joint text and audio inputs. While text-based attribution is well-studied, mllm-shap addresses three critical challenges unique to the multimodal regime: (1) Modality-aware coalition masking, which manages the interleaved processing of discrete text tokens and dense audio encoder frames. (2) Multi-turn conversation tracking, utilizing per-token metadata to maintain role and modality context. (3) Phonetic alignment-based token grouping, a novel technique that reduces the coalition space by 10x to 50x, rendering SV estimation computationally feasible for long-form audio. The platform implements five SV estimation strategies, including a Complementary Contributions (CC) estimator with Neyman-optimal allocation that demonstrates superior convergence over standard Monte Carlo baselines. mllm-shap is provided as a pip-installable package featuring an interactive web-based GUI for granular attribution visualization. To our knowledge, this is the first publicly available framework providing a complete, reproducible pipeline for SV-based explainability in text-audio MLLMs.


[7] Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis cs.CL | cs.AI | cs.SDPDF

Paweł Pozorski, Jakub Muszyński, Maria Ganzha

TL;DR: 本文提出了一种基于Shapley值的多模态可解释性分析框架,用于理解多模态大语言模型(MLLMs)中文本和音频模态如何共同影响模型行为。为了解决计算复杂性和模态粒度不匹配问题,作者引入了高效的估值策略和一种名为SGPA的预处理方法。此外,作者还发布了开源工具包,并在多语言数据集上验证了框架的有效性。

Details

Motivation: 多模态大语言模型虽然能有效整合文本和音频,但其内部机制不透明,尤其是异质模态如何影响模型行为尚不清楚。传统的基于Shapley值的可解释性方法难以直接扩展到多模态场景,主要受限于跨通道依赖、复杂对话结构以及密集音频表示带来的巨大计算开销。

Result: 在VoiceBench和Infinity Instruct数据集的精选多语言子集上进行了评估。实验结果表明,输入模态是归因波动的主要驱动因素,并且标准的句法重要性代理指标在多模态、跨语言语境中常常无法准确预测模型的注意力分布。

Insight: 主要创新点包括:1)将Shapley值框架形式化扩展到多模态场景,将离散文本标记和对齐的音频片段视为合作特征;2)提出了SGPA预处理方法,通过频谱图引导的语音对齐将高频音频流映射到可解释的、词对齐的片段,解决了模态粒度不匹配问题;3)开发了开源、模型无关的Python工具包和交互式GUI,便于计算和可视化多模态归因。

Abstract: Multimodal Large Language Models (MLLMs) effectively integrate text and audio to interpret context in complex interactive dialogues. However, the internal mechanisms by which heterogeneous modalities influence model behavior remain opaque. While Shapley Values (SV) provide a robust, model-agnostic framework for local explainability in text-based NLP, their extension to multimodal data is hindered by cross-channel dependencies, intricate dialogue structures, and the prohibitive computational complexity of dense audio representations. In this work, we formalize a multimodal extension of the Shapley Value framework, treating discrete text tokens and aligned audio segments as cooperative features. To ensure computational feasibility, we deploy a suite of efficient estimation strategies: exact SV computation for low-dimensional inputs and sampling-based approximations - including Monte Carlo permutations and stratified sampling with Neyman-optimal allocation - to minimize variance under constrained computational budgets. To resolve the granularity mismatch between modalities, we propose Spectrogram-Guided Phonetic Alignment (SGPA), a novel preprocessing method that maps high-frequency audio streams to interpretable, word-aligned segments. Our contribution is twofold: first, we provide an open-source, model-agnostic Python package and a companion GUI for the computation and interactive visualization of multimodal attributions. Second, we evaluate our framework using curated subsets of the VoiceBench and Infinity Instruct datasets across diverse multilingual scenarios. Our experimental results reveal that input modality is a primary driver of attribution volatility and demonstrate that standard syntactic importance proxies often fail to predict model attention in multimodal, cross-lingual contexts.


[8] From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data cs.CLPDF

Md. Rejaul Korim Sadi, Toufiqur Rahman Tasin, Golam Mostofa Naeem

TL;DR: 本文从架构层面系统分析了大型语言模型产生幻觉的结构性根源,指出自注意力机制、最大似然估计训练目标和自回归解码三个核心设计共同构成了一个复合故障系统,导致模型产生实体混淆、事实误归和语义漂移等幻觉。论文将每种机制映射到现有幻觉分类学中的具体输出类别,并阐明数据集缺陷(如长尾不足、训练偏差和合成污染)会放大这些漏洞而非独立引发幻觉。

Details

Motivation: 现有幻觉分类学(如Alansari和Luqman的分类法)虽能描述输出类型,但无法识别产生幻觉的内部机制。本文旨在超越输出层面的描述,从模型架构的决策出发,揭示幻觉产生的结构性原因。

Result: 论文未提及具体的定量实验结果或基准测试,而是进行了理论分析,将三种架构机制(自注意力、MLE训练目标、自回归解码)分别映射到幻觉分类学中的内在幻觉、外在幻觉和逻辑不一致类别,并论证了数据集缺陷的放大作用。

Insight: 创新点在于将幻觉根源定位到具体的架构组件,构建了“机制-输出类型”的映射框架,并区分了架构缺陷(根本原因)与数据缺陷(放大因素)。这为从推理层(而非仅输出分类)设计缓解策略提供了理论基础。

Abstract: Large language models hallucinate–producing fluent, confident, factually wrong outputs–with a consistency that persists across generations and scales. Existing taxonomies classify hallucination by output type, distinguishing intrinsic from extrinsic failures and faithfulness from factuality divergence. These frameworks are descriptively rigorous but do not identify which internal mechanism produced a given instance. This paper analyses hallucination as a structural consequence of three architectural decisions that together form a compound failure system. Self-attention’s co-occurrence learning substitutes statistical proximity for semantic meaning and produces entity confusion, fact misattribution, and semantic drift. The maximum likelihood estimation training objective optimises next-token probability without factual constraint, rewarding statistically plausible outputs regardless of their truth value. Autoregressive decoding’s permanent left-to-right commitment under exposure bias ensures that a single wrong token cascades forward through the entire output sequence without revision. Dataset pathologies–long-tail deficiencies, training bias, and synthetic pollution–amplify these vulnerabilities but do not independently cause them. We make three contributions. First, we map each mechanism to a specific output category in the Alansari and Luqman taxonomy, locating intrinsic hallucination in self-attention, extrinsic hallucination in MLE, and logical inconsistency in autoregressive decoding. Second, we show that each commonly cited dataset pathology exploits one of these mechanisms rather than originating hallucination independently. Third, we identify the diagnostic limitation of output-type-only classification and contrast it with inference-layer mitigation approaches.


[9] Liberating LLM Capabilities in Full-Duplex Speech Models cs.CL | cs.AI | cs.SDPDF

Luoyuan Zhang, Bokai Xu, Junbo Cui, Weiyue Sun, Yingjing Xu

TL;DR: 本文提出了一种名为Listen-Write-Speech(LWS)的文本优先三通道范式,旨在解放基于语音的大型语言模型(LLM)的能力。该方法允许单个自回归LLM在共享因果注意力上下文中,持续收听用户音频、将可见的自由格式文本作为主要输出,并同时生成实时语音回复,从而将文本提升为与语音并列的一流输出通道。

Details

Motivation: 当前基于语音的LLM通常被限制为语音回复,这限制了其面向用户的输出仅为可语音化的内容,并抑制了文本原生能力(如代码生成、结构化分析和多步推理)在需要持久、结构化且可检查的中间输出的实时交互任务中的应用。现有工作虽然改进了语音推理或全双工轮转,但仍将文本视为隐藏的中间状态或次要模态,而非一流输出通道。

Result: 在Full-Duplex-Bench上,LWS展示了强大的全双工交互能力;在VoiceBench AlpacaEval上达到4.72分;实现了92.6%的书写-语音一致性;在URO-Bench上持续优于其内部消融实验。这些结果表明,可见的书写可以作为语音交互的一流输出通道,且不牺牲实时响应性。

Insight: 核心创新在于提出了一个完全通过令牌模式(Token Schema)实现的文本优先三通道范式,无需修改模型架构,并通过一个两阶段数据管道进行学习,该管道合成了与揭示的输入时间线一致的每秒认知标注。这为将文本作为主要输出通道,同时保持实时语音交互提供了可行的技术路径。

Abstract: Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.


[10] Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER) cs.CL | cs.AI | cs.LG | cs.SDPDF

Felix Akeret

TL;DR: 本文系统研究了使用标准德语字幕作为弱监督,对OpenAI的Whisper large-v3模型进行瑞士德语自动语音识别(ASR)的微调。通过16次迭代训练,比较了LoRA和全参数微调,分析了幻觉根源,并量化了数据质量、字幕对齐和训练策略的影响。研究揭示了现有SOTA结果因基准测试集污染而虚高,并发布了一个在严格不相交数据上诚实评估的模型,其词错误率为25.6%,内容词错误率为13.8%。

Details

Motivation: 解决瑞士德语ASR中因缺乏高质量标注数据而面临的挑战,并探究使用标准德语字幕作为弱监督进行模型微调的可行性。同时,旨在揭示现有SOTA结果可能因基准测试集污染而不可靠的问题,并为该领域建立一个诚实的评估基线。

Result: 在严格不相交的All Swiss German Dialects Test Set (ASGDTS)上进行诚实评估,最佳模型达到25.6% WER和13.8% cWER。研究显示,现有SOTA结果(17.1-17.5% WER)因基准污染而虚高,一个未经瑞士德语数据训练、仅在ASGDTS测试集上自训练的Whisper模型即可达到13.88% WER,超过了所有已发布的系统。

Insight: 创新点在于系统性地使用字幕对齐进行弱监督微调,并严格分离训练与测试数据以进行诚实评估。客观分析认为,其核心贡献是揭示了基准测试集污染对ASR评估结果的重大影响,并提出了内容词错误率(cWER)和偏差校正估计等更可靠的评估指标,这对低资源方言ASR的基准构建和模型评估具有重要借鉴意义。

Abstract: We present a systematic study of fine-tuning OpenAI’s Whisper large-v3 for Swiss German ASR, using 1,367 hours of broadcast speech paired with Standard German subtitles as weak supervision. Through 16 iterative training runs on an NVIDIA DGX Spark (Grace Blackwell, 128 GB unified memory, up to 1 PFLOP FP4), we compare LoRA and full fine-tuning of the 1.55B-parameter model, investigate hallucination root causes, and quantify the effect of data quality, subtitle alignment, and training strategy. Our best model achieves 25.6% measured WER on the All Swiss German Dialects Test Set (ASGDTS) in an honest evaluation on strictly disjoint data. A harmonized error analysis separating genuine errors from valid stylistic variation (tense, word order, Swiss orthography) yields a content WER (cWER) of 13.8%, counting only actual recognition failures. Bias-corrected estimation reduces this to 8.5%, suggesting the true error rate is roughly one third of measured WER. We demonstrate that published state-of-the-art Swiss German ASR results (17.1-17.5% WER) are inflated by benchmark contamination: a vanilla Whisper model self-trained on the ASGDTS test set with zero Swiss German data achieves 13.88% WER, surpassing all published systems. Experiments with Phi-4-multimodal show an even stronger memorization effect (3.9% WER), revealing that the benchmark primarily measures convention matching rather than dialectal comprehension. We release two models, a LoRA adapter (25.32% WER, 13.9% cWER) and a full fine-tuned model (25.60% WER, 13.8% cWER), among the few publicly available, honestly evaluated Whisper models for Swiss German, under Apache 2.0 with full reproducibility, requiring no institutional data agreements.


[11] Unlocking Latent Value: Taxonomy-Guided Recovery of High-Performing Data from Low-Tier Web Corpora cs.CLPDF

Neeraj Varshney, Sanket Lokegaonkar, Nasser Zalmout, Qingyu Yin, Priyanka Nigam

TL;DR: 本文提出了一种基于分类学的框架,用于从低优先级网络语料库中恢复高性能数据。该方法通过引入新的语义维度(时效性和文化特异性)并利用高效的标注与过滤策略,显著提升了数据质量,使过滤后的低层级数据在多项基准测试中超越了未过滤的高层级数据。

Details

Motivation: 现有的网络数据预处理流程通常将文档质量压缩为单一综合评分,这会导致某些高价值内容因评分权重不足而被系统性地忽略。本文旨在通过多维度、语义驱动的过滤方法,从被降级的网络数据中挖掘出这些潜在的高价值内容。

Result: 在推理、代码和知识基准测试上,经过分类学过滤的中层数据子集相比其未过滤的基线分别提升了12.1%、9.5%和2.0%,甚至超过了未过滤的最高层级数据(在推理和代码任务上分别超出6.7%和13.7%)。从低于典型生产阈值的层级中过滤出的数据,在代码基准上超越了最高层级数据。

Insight: 创新点在于引入了时效性和文化特异性这两个新的、与现有维度低相关的分类学维度,并提出了一个计算高效的两阶段过滤框架(先识别强信号维度,再构建复合过滤器)。这证明了基于多维语义分类的过滤是一种原则性强、计算效率高的方法,能够有效解锁被降级网络语料库中的巨大潜在价值。

Abstract: Dominant web data curation pipelines for pretraining collapse document quality into a single composite score, systematically missing high-value content along dimensions the scorer underweights. We present a taxonomy-driven framework that recovers this value by filtering along semantically meaningful dimensions that composite scores fail to capture. First, building on the ESSENTIAL-WEB taxonomy, we introduce two novel dimensions: timeliness and cultural specificity, both of which show low pairwise NMI with existing ones. We annotate 14M documents using Qwen2.5 32B and distill into a lightweight 0.5B model. To enable rapid corpus-wide annotation, we additionally train a 73M multi-task MLP on E5 embeddings, achieving 50x inference throughput. Second, to navigate the combinatorial explosion of filter configurations, we introduce a compute-efficient two-pass framework: Pass 1 identifies the strongest dimension signals at small scale; Pass 2 constructs and evaluates conjunctive and disjunctive compound filters from the top performers - identifying high-performing configurations at a fraction of full scaling-law cost. Applying the selected filters to deprioritized web data, taxonomy-filtered subsets outperform their unfiltered baselines and even surpass the highest-quality tier. On mid-tier data, our best filter improves over its unfiltered baseline by 12.1% on reasoning, 9.5% on coding, and 2.0% on knowledge benchmarks, exceeding unfiltered top-tier data by 6.7% on reasoning and 13.7% on coding. Furthermore, filtered data from two tiers below the typical production threshold improves by 22.3% on reasoning and 19.5% on coding over its unfiltered baseline, surpassing top-tier data on coding benchmarks. These results establish that vast latent value remains locked in deprioritized web data, and that multi-dimensional taxonomy filtering is a principled, compute-efficient key to unlocking it.


[12] SLMJury: Can Small Language Models Judge as Well as Large Ones? cs.CL | cs.AI | cs.LGPDF

Anish Laddha, Nitesh Pradhan, Gaurav Srivastava

TL;DR: 本文提出了SLMJury框架,用于评估小型语言模型(SLMs)作为评判者在封闭式二元正确性和开放式质量评分两种范式下的表现。研究在十个基准测试上对16个SLM评判者(0.6B-14B参数)进行了基准测试,发现可靠的自动评估并不一定需要大型专有模型,但没有一个单一的SLM在所有任务中占主导地位。

Details

Motivation: 大型语言模型(LLMs)被广泛用作评估模型输出的评判者,但其高成本、高延迟和不透明性限制了可扩展性。因此,研究旨在探索小型语言模型(SLMs)作为评判者的潜力。

Result: 研究在十个基准测试(包括八个封闭式任务以及SummEval和MT-Bench)上评估了16个SLM评判者。关键发现包括:在数学评判任务中,快速10个token的判决与扩展推理相当或更好(提升2-7%),而在通用任务中推理胜出(提升高达23%);最佳二元评判者(Phi-4)在MT-Bench上排名降至第9;在多智能体辩论协议下,辩论会降低所有测试配置的准确性。

Insight: 论文的创新点在于提出了一个系统性的SLM评判者评估框架,并揭示了评判任务的领域依赖性、模型家族的泛化差异、封闭式与开放式评判所需能力的不同,以及多智能体辩论在评判任务中的负面影响。这为构建低成本、可扩展的自动评估系统提供了重要见解。

Abstract: Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed-ended binary correctness and open-ended quality scoring. We benchmark 16 SLM judges (0.6B-14B parameters) from four model families across ten benchmarks: eight closed-ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT-Bench for summarization and conversational scoring. We formalize judging as a budget-conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain-dependent: for most judges quick 10-token verdicts match or beat extended reasoning on mathematical judging (by 2-7% where they help), while reasoning wins on general tasks by up to 23%. (2) Domain generalization separates model families, with math-to-general accuracy gaps ranging from under 10% to nearly 40%. (3) Closed-ended and open-ended judging draw on different capabilities: the best binary judge (Phi-4) drops to rank 9 on MT-Bench, while reasoning-trained models invert this ordering. (4) Under the Reflect-Critique-Refine (RCR) debate protocol, multi-agent debate degrades accuracy across all tested configurations, whereas the top judges resist six adversarial personas with <=0.55% variance. Reliable automated evaluation does not require large proprietary models, yet no single SLM dominates. The leaderboard is available at https://anishh15.github.io/SLMJury/, and our framework code and pip package are publicly available at https://github.com/anishh15/SLMJury and https://pypi.org/project/slmjury/.


[13] ROSUM-MCTS: Monte Carlo Tree Search-Inspired HDL Code Summarization with Structural Rewards cs.CLPDF

Prashanth Vijayaraghavan, Charles Mackin, Luyao Shi, Apoorva Nitsure, Ashutosh Jadhav

TL;DR: 本文提出了一种名为ROSUM-MCTS的硬件描述语言(HDL)代码摘要生成方法,该方法受蒙特卡洛树搜索(MCTS)启发,通过结构化探索和基于强化学习的优化来精炼摘要。该方法整合了局部和全局上下文,并使用一个平衡功能性正确性、局部内容充分性和流畅性的复合奖励函数进行优化。

Details

Motivation: 尽管大语言模型在代码摘要方面展现出潜力,但其针对VHDL和Verilog等硬件描述语言的有效性仍未得到充分探索。本文旨在解决HDL代码摘要这一特定且具有挑战性的任务。

Result: 在VHDL-eval和Verilog-eval数据集上的评估表明,ROSUM-MCTS持续优于基线方法。消融研究证实了局部与全局扩展策略以及平衡功能性正确性与局部内容充分性的必要性。此外,该方法对变量重命名等表面修改具有鲁棒性,在基线方法性能下降时仍能保持摘要质量。

Insight: 主要创新点在于将蒙特卡洛树搜索的结构化探索思想与强化学习优化相结合,用于HDL代码摘要任务,并设计了一个平衡多方面指标的复合奖励函数。从客观角度看,其分层候选扩展机制和针对HDL特性的鲁棒性优化是值得借鉴的方向。

Abstract: Large language models (LLMs) have shown promise in code summarization, yet their effectiveness for Hardware Description Languages (HDLs) like VHDL and Verilog remains underexplored. We propose ROSUM-MCTS, an LLM-guided approach inspired by Monte Carlo Tree Search (MCTS) that refines summaries through structured exploration and reinforcement-driven optimization. Our method integrates both local and global context via a hierarchical candidate expansion mechanism and optimizes summaries using a composite reward function balancing functional correctness (FC), local content adequacy (LCA), and fluency. We evaluate ROSUM-MCTS on the VHDL-eval and Verilog-eval datasets, demonstrating its consistent outperformance over baseline methods by leveraging structured bottom-up refinement and reinforcement-based optimization. Ablation studies confirm the necessity of both local and global expansion strategies, as well as the importance of balancing FC and LCA for optimal performance. Furthermore, ROSUM-MCTS proves robust against superficial modifications, such as variable renaming, maintaining summary quality where baselines degrade. These results establish ROSUM-MCTS as an effective and robust HDL summarization framework, paving the way for further research into reinforcement-enhanced code summarization.


[14] Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR cs.CLPDF

Hongye Liu, Rongmei Lin, Anurag Kashyap, Hejie Cui, Ricardo Henao

TL;DR: 本文针对超长购物轨迹分析中LLM上下文窗口限制的问题,提出了ShopTrajQA长上下文评估基准和一个基于工具增强代理与RLVR训练范式的客户代理框架。该框架通过将轨迹存储为外部文件并训练代理自主检索解析,有效绕过了LLM的固定上下文限制。

Details

Motivation: 解决现实世界中跨越多年的超长购物轨迹(可达数万token)对现有LLM构成重大挑战的问题,同时克服真实电商数据因隐私难以获取、现有基准仅限短轨迹的局限。

Result: 在自建的ShopTrajQA基准(含32k和64k token变体)上,所提框架取得了强劲性能,并展现出对其他复杂推理任务的泛化能力。

Insight: 创新点在于结合工具增强代理(如代码解释器进行SQL查询)与可验证奖励的强化学习(RLVR)训练范式,将超长轨迹外存并通过智能交互动态检索,这是一种绕过LLM上下文长度瓶颈的实用系统设计思路。

Abstract: Understanding customer shopping trajectories is essential for enabling personalized shopping experiences. However, shopping records (i.e., customer’s search, clicks, purchases, etc.) often span long time horizons over multiple years, resulting in extremely long trajectories that pose significant challenges for existing large language models (LLMs). Despite the importance of this problem, existing benchmarks are limited to short customer trajectories, while real-world trajectories from large e-commerce platforms are rarely accessible due to data privacy constraints. To address this gap, we introduce ShopTrajQA, a long-context evaluation benchmark constructed from real-world product information and simulated shopping trajectories. The dataset includes variants of up to 32k and 64k tokens, enabling systematic evaluation of model robustness under varying context lengths. Through comprehensive benchmarking of frontier LLMs, we identify critical performance gaps in reasoning over long shopping trajectory data. To address these challenges, we propose a Customer Agent Framework for ultra-long context management. Leveraging a Reinforcement Learning with Verifiable Rewards (RLVR) agentic training paradigm, our approach stores trajectories as external local files and trains the agent to autonomously retrieve and parse them through code-interpreter interactions (e.g., SQL queries), effectively bypassing the fixed in-context window constraints of LLMs. Experimental results demonstrate that our framework achieves strong performance for ShopTrajQA and shows generalization to other complex reasoning tasks.


[15] Summarization is Not Dead Yet cs.CL | cs.AIPDF

Dongqi Liu, Chenxi Whitehouse, Zheng Zhao, Zhuchen Cao, Jian Li

TL;DR: 本文通过多维度评估重新审视了大型语言模型(LLMs)在文本摘要任务中的表现,发现尽管LLMs在表面流畅性上受到偏好,但人类参考摘要在信息量和忠实度方面仍具优势,且事实性更可靠,表明摘要研究仍是一个开放问题。

Details

Motivation: 针对LLMs生成的摘要可能超越人类参考的流行说法,本文旨在通过系统评估验证这一主张,探究摘要任务是否仍是一个开放的研究问题。

Result: 在五个多样化数据集和五个SOTA LLMs上的评估表明,人类参考摘要在信息量和忠实度上优于LLMs,而LLMs主要在表面连贯性和流畅性上受偏好;事实性验证显示人类摘要更可靠,特别是在涉及推理或综合的声明上。

Insight: 研究揭示了LLMs在摘要任务中存在的风格同质化问题,并指出当前LLMs提升了摘要质量的下限,但其性能上限仍低于人类能力,这为未来研究(如提升事实性和多样性)指明了方向。

Abstract: The progress of large language models (LLMs) has fueled claims that model-generated summaries rival or even surpass human-written references, raising questions about whether summarization remains an open research problem. We re-examine this narrative through a multi-track evaluation covering five diverse datasets and five state-of-the-art LLMs, combining controlled human assessment, bias-mitigated LLM-as-Judge protocols, factuality verification against external knowledge, and corpus-level linguistic analysis. Our findings reveal a more nuanced landscape in which human reference summaries continue to demonstrate advantages in informativeness and faithfulness, whereas LLM outputs are preferred mainly for surface-level coherence and fluency. Factuality verification indicates that human references remain more reliable, particularly for claims involving reasoning or synthesis, and linguistic analysis uncovers a pattern of stylistic homogeneity across different models. These observations suggest that current LLMs have raised the floor of summarization quality, but the ceiling of their performance remains below human capabilities.


[16] Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation cs.CL | cs.AIPDF

Boxuan Lyu, Haiyue Song, Zhi Qu, Hidetaka Kamigaito, Kotaro Funakoshi

TL;DR: 本文提出RLSR(强化学习源重写)框架,通过强化学习训练源重写模型来提升机器翻译质量,无需为不同翻译模型手动调整提示。该方法利用下游翻译质量的改进作为奖励,在6个翻译模型和16个语言对上验证了其有效性。

Details

Motivation: 解决直接使用大型语言模型生成源重写时需为不同机器翻译模型手动调整提示的问题,实现自动化、高效的源重写优化。

Result: 在6个机器翻译模型和16个语言对的实验中,4B参数的RLSR重写模型显著优于无重写基线和相同规模的基于提示的重写基线,并与基于235B LLM的提示基线性能相当。

Insight: 创新点在于将源重写任务转化为强化学习问题,直接以翻译质量提升作为奖励信号,避免了提示工程的繁琐,实现了模型无关的源重写优化。

Abstract: Although directly prompting off-the-shelf Large Language Models (LLMs) to generate meaning-preserving source rewrites can effectively enhance Machine Translation (MT) quality, doing so requires manually tuning prompts for different MT models. In this work, we propose RLSR (Reinforcement Learning for Source Rewriting), a novel RL-based framework for training a source rewriting model without tuning prompts for each MT model. RLSR optimizes the rewriting model by directly using the improvement in downstream translation quality yielded by each rewritten source as the reward. Extensive experiments across six MT models and 16 language pairs demonstrate that our 4B rewriting models trained via RLSR significantly outperform the no-rewriting baseline and existing same-scale prompt-based rewriting baselines, while achieving competitive performance against prompt-based baselines based on the 235B LLM.


[17] What’s the Point? Spatial Grammar & Index Resolution for Sign Language Processing cs.CL | cs.AIPDF

Oline Ranum, Simon Hadfield, Richard Bowden

TL;DR: 本文针对手语处理中空间索引(指向手势)的建模不足问题,提出了一种分解为索引检测和话语实体链接的框架,用于训练和评估索引专家模型,以增强手语识别系统对非词汇结构的处理能力。

Details

Motivation: 当前手语模型主要依赖词汇序列或文本监督进行训练,忽视了非词汇和能产性结构(如空间索引),导致对手语中占10-15%的指向手势建模效果不佳。

Result: 在手语识别评估中,索引恢复效果较差;提出的框架建立了索引感知手语建模的基线,并通过辅助索引专家在推理时增强冻结的手语识别模型。

Insight: 创新点在于将空间指代解析分解为检测与链接两个子任务,生成的话语提及表示支持自动标注和非词汇结构建模,为手语处理提供了更细粒度的空间语法分析视角。

Abstract: Sign language models are predominantly trained with gloss-sequence or text supervision, thereby under-modeling non-lexical and productive constructions. One comparatively tractable instance is spatial indexing: pointing gestures that assign discourse entities to spatial loci for subsequent co-reference, which lexicon-centric objectives largely fail to capture. We present a targeted evaluation of indexing in Sign Language Recognition, showing that despite comprising 10-15% of signing content, indexing is poorly recovered. We introduce a framework for training and evaluating indexing experts, establishing a baseline for index-aware sign language modeling. Our approach decomposes spatial reference resolution into index detection and discourse entity linking. The resulting mention representations enable automatic annotation and non-lexical structure modeling, and serve as an auxiliary indexing expert that augments a frozen SLR model at inference time.


[18] Diffusion Language Model Parallel Decoding via Product-of-Experts Bridge cs.CLPDF

Juntong Shi, Brian L. Trippe, Jure Leskovec, Stefano Ermon, Minkai Xu

TL;DR: 本文提出PoE-Bridge,一种新颖的解码框架,通过引入一个基于专家乘积(PoE)的中间分布来弥合扩散语言模型(DLM)与自回归(AR)模型之间的分布差距。该框架首先利用DLM并行起草多个续写,然后通过拒绝采样和重要性采样将候选序列逐步对齐到AR目标分布,从而在保持DLM并行解码速度优势的同时,显著提升生成质量。

Details

Motivation: 扩散语言模型(DLM)虽能通过并行解码获得显著的速度优势,但由于缺乏token间的依赖关系,其生成质量相比自回归(AR)模型存在差距。现有方法试图通过重要性采样来弥补这一差距,但DLM与AR分布的巨大差异导致采样需要大量粒子,计算成本高昂。

Result: 在具有挑战性的数学推理和编码任务上,PoE-Bridge相比标准DLM解码方法实现了5倍的加速,并恢复了目标AR模型至少95%的性能,有效地弥补了大部分质量差距。

Insight: 核心创新在于构建了一个DLM提议分布与AR目标分布的专家乘积(PoE)作为中间分布,从而大幅降低了分布差距,使得高效的拒绝采样和重要性采样成为可能。此外,提出的混合温度采样和弹性拒绝窗口等技术进一步增强了生成多样性和验证效率。

Abstract: Diffusion language models (DLMs) offer substantial speed advantages through parallel decoding, but the lack of token dependencies limits generation quality compared to autoregressive (AR) models. Recent progress attempts to bridge the gap via importance sampling, with DLM being the proposal and AR being the target. However, due to the huge gap between their distributions, the sampling requires a large number of particles and is thus expensive to compute. In this paper, we introduce PoE-Bridge, a novel decoding framework that drastically improves generation speed and accuracy by introducing an intermediate distribution to bridge the gap. The distribution is constructed as a Product-of-Experts (PoE) of the DLM proposal and the AR target. With the intermediate distribution, we first use the DLM to draft multiple continuations in parallel, then apply rejection sampling to verify the drafted tokens and move the resulting candidates toward the PoE. We then use importance sampling to further correct the PoE-aligned candidates toward the AR target. We further propose several improved techniques, including mixed-temperature sampling for enhanced diversity and elastic rejection windows for reducing wasted verification. Empirically, PoE-Bridge achieves significantly improved accuracy with $5\times$ speedup over the standard DLM decoding approach, and recovers at least 95% of the target AR model’s performance, efficiently advancing most of the quality gap on challenging mathematical reasoning and coding tasks. Our code is available at https://github.com/juntongshi48/poe-bridge.


[19] SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models cs.CLPDF

Ayah Al-Naji, Edoardo Fazzari, Saif Alkindi, Hamdan Alhadhrami, Preslav Nakov

TL;DR: 本文介绍了SurgiQ,一个用于评估大型语言模型在外科领域理解能力的大规模多领域基准测试集。该基准包含13,055道四选项选择题,涵盖六个外科领域和四种问题格式,通过多阶段生成、验证和专家审核流程构建。作者评估了35个开源模型,发现最佳模型准确率为68.1%,仍有较大提升空间,且通用模型表现优于大多数生物医学专用模型。

Details

Motivation: 目前针对大型语言模型在外科领域的可靠评估体系尚不完善,现有医学基准主要测试临床知识,而外科领域需要程序性推理、管理权衡、否定处理以及在合理手术决策中进行选择的能力。

Result: 在统一的似然协议下评估了35个开源模型,结果显示最佳模型(Qwen2.5)达到68.1%的准确率,而较小模型接近25%的随机基线;通用模型的表现优于大多数生物医学专用模型。

Insight: 创新点在于构建了一个来源可靠、仅文本、大规模且多领域的外科专用基准测试集SurgiQ,其构建流程(生成、验证、专家审核)确保了质量;客观分析表明,当前医学专用模型的外科覆盖广度不足,且即使强模型也会在临床合理的干扰项上犯自信的错误,这凸显了开发更可靠、更广泛的外科评估方法的必要性。

Abstract: Reliable evaluation of large language models in surgery remains underdeveloped. Broad medical benchmarks test clinical knowledge, while surgery requires procedural reasoning, management trade-offs, negation handling, and selection among plausible operative decisions. We present SurgiQ, a text-only, source-grounded benchmark of 13,055 four-option multiple-choice questions spanning six surgical domains and four question formats: case-based, reasoning, best-option, and negative. SurgiQ is constructed from surgical textbooks, open-access papers, and examination material using a multi-stage generation, verification, and expert-audit pipeline. We evaluate 35 open-weight LLMs under a unified log-likelihood protocol. Our results show substantial remaining headroom: smaller models often remain near the 25% random baseline, while the best model reaches 68.1% accuracy. General-purpose models, especially Qwen2.5, outperform most biomedical models, suggesting that current medical specialization does not yet provide sufficiently broad surgical coverage. Calibration and error analysis further show that even strong models make confident mistakes on clinically plausible distractors, motivating more reliable and broader surgical LLM evaluation.


[20] Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions cs.CL | cs.AIPDF

Po-Ya Angela Wang, Chinmaya Mishra, Aslı Özyürek, Paula Rubio-Fernández, Esam Ghaleb

TL;DR: 本文研究多模态大语言模型(MLLM)智能体在重复指称游戏中的协作行为,通过与人类对话对比,发现MLLM智能体虽能实现标签对齐,但无法像人类一样形成依赖于特定伙伴历史、简洁高效的指称惯例。

Details

Motivation: 旨在探究MLLM智能体在重复交互中表现出的标签对齐,究竟是源于与特定伙伴的共享历史(即形成惯例),还是仅仅使用了通用的任务词汇,以区分其与人类对话协调机制的本质差异。

Result: 在KTH Tangrams语料库的实验中,通过引入打破伙伴历史的伪对话基线,发现MLLM智能体在任务能力、描述策略和对齐动态三个层面均与人类存在明显差异:人类通过顺应压缩描述、提高标签对齐以降低努力,而MLLM智能体则从第一轮起就保持固定的高努力水平,产生冗长描述,其标签对齐在真实对话与伪对话中无统计差异,表明其协调不依赖于伙伴特定历史。

Insight: 论文的创新点在于提出了一个受约束的伪对话基线方法,有效分离了伙伴特定历史的影响,从而揭示了MLLM智能体实现协调的机制本质上是基于冗长描述和通用词汇,而非形成人类式的高效、历史依赖的指称惯例,这为理解AI对话系统的协调能力提供了关键洞见。

Abstract: Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become more efficient across rounds, although they align on the labels they use. How can we determine whether this alignment reflects partner-specific grounding rather than a shared task vocabulary? We address this question by comparing capable multimodal agent dyads with human dyads from the KTH Tangrams corpus. Our novel methodological contribution is a constrained pseudo-dyad baseline that matches the original referential task structure, but breaks partner history. This baseline enables us to test whether the observed label alignment depends on interaction with a specific partner. Across three analytic layers (task competence, description strategy, alignment dynamics), we find clear differences. Humans reduce effort through entrainment, compressing descriptions and increasing label alignment with partners. Agents instead maintain fixed effort levels, producing verbose descriptions from round one, with near-ceiling label overlap that is statistically indistinguishable between real and pseudo dyads. MLLMs thus achieve coordination without convention, succeeding by verbose description rather than by forming the compact, history-dependent referring expressions characteristic of human dialogue.


[21] GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models cs.CL | cs.AIPDF

Ryner Tan, Wenxuan Zhang

TL;DR: 本文提出了GlobeAudio,一个用于评估大型音频-语言模型自然音频理解能力的多语言多文化基准。该基准包含6种语言的5637个多选题,由母语者基于真实音频设计,要求模型具备高级听觉推理和文化背景理解能力。作者系统评估了闭源和开源LALM以及级联ASR-LLM流水线,揭示了在自然声学条件下的性能差距。

Details

Motivation: 当前大型音频-语言模型的评估缺乏真实世界的语言文化真实性和声学真实性,无法满足实际应用需求。

Result: 实验表明,在自然声学条件下,特别是开源模型和低资源语言上存在显著性能差距,突显了当前LALM的局限性。

Insight: 创新点在于构建了首个基于真实音频、由母语者设计的多语言多文化评估基准,强调文化背景理解和自然声学条件对模型评估的重要性。

Abstract: Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .


[22] ZAS-SQL: Distilling Rules from Failures for Zero-Shot Text-to-SQL cs.CLPDF

Hongzhou Zheng, Yixin Gou, Wenjia Zhang

TL;DR: 本文提出了一种名为ZAS-SQL的完全零样本文本到SQL框架,该框架通过一个基于Map-Reduce的规则蒸馏管道,从LLM的失败案例中提炼核心生成规则,并利用知识增强模式表示、规则驱动的结构化推理框架和执行引导的早期停止三个互补模块来提升生成质量。

Details

Motivation: 解决现有零样本文本到SQL方法因缺乏有效生成约束而性能落后于少样本方法的问题,并克服少样本方法对演示示例的依赖,以实现更好的跨领域泛化能力和上下文窗口效率。

Result: 在Spider数据集上,该框架在开发集和测试集上分别达到了87.2%和88.6%的执行准确率,创造了新的零样本SOTA,超越了多个基于GPT-4/4o的少样本和微调方法;在领域特定数据集UrbanPlan上达到81.3%的准确率,证明了其跨领域泛化能力。

Insight: 核心创新在于从LLM的系统性失败模式中蒸馏规则来指导生成,而非依赖人工标注或演示;提出的知识增强模式表示、结构化推理框架和低成本自校正机制,为提升零样本SQL生成的准确性和鲁棒性提供了可借鉴的系统性方法。

Abstract: Text-to-SQL translates natural language into executable SQL queries. Few-shot in-context learning methods built upon large language models (LLMs) achieve strong performance, yet their reliance on demonstrations limits cross-domain generalization and consumes substantial context window space. Existing zero-shot methods, lacking effective generation constraints, still fall short of few-shot approaches. We observe that LLM failures in zero-shot Text-to-SQL are not random but exhibit systematic, recurring patterns. Building on this observation, we propose a fully zero-shot Text-to-SQL framework that distills core generation rules from failure cases through a Map-Reduce-based rule distillation pipeline and improves generation quality via three complementary modules: knowledge-augmented schema representation, which supplements missing semantics in Data Definition Language; a rule-driven structured reasoning framework that suppresses structural deviations; and Execution-Guided Early Stopping, which enables low-cost self-correction. On Spider, the proposed framework achieves up to 87.2% and 88.6% execution accuracy on the Dev and Test sets, respectively, establishing a new zero-shot state-of-the-art and surpassing multiple few-shot and fine-tuning methods built upon GPT-4/4o. On the domain-specific dataset UrbanPlan, it achieves 81.3%, confirming that the rule distillation approach generalizes across domains. Moreover, when equipped with a 4B-parameter model, the framework surpasses zero-shot baselines of leading closed-source models, demonstrating strong model generality.


[23] SSR: Can Simulated Patients Learn to Stigmatize Themselves? Modeling Self-Stigma through Internal Monologue cs.CLPDF

Kunyao Lan, Bingrui Jin, Zichen Zhu, Mengyue Wu

TL;DR: 本文提出了一种基于大型语言模型(LLM)的模拟患者框架SSR,旨在解决现有心理健康训练模拟中无法捕捉患者自我污名化(self-stigma)及其情境性抵抗行为(如回避、否认或自责)的问题。该框架基于心理学的3A1H自我污名化模型,通过构建包含内部独白(反映污名感知推理)的Stigmatized Self-Reflection数据集,并采用思维链方法对LLM进行微调,使模拟患者能根据对话触发因素动态调整污名化程度和表达方式。

Details

Motivation: 现有基于LLM的患者模拟方法未能捕捉临床现实中关键的自我污名化现象及其导致的情境敏感抵抗行为,这些行为在现有模型中常被呈现为静态或统一顺从的行为,限制了模拟的真实性。

Result: 评估表明,该方法显著优于专业基线模型,能生成更真实且情境更恰当的患者回应。

Insight: 核心创新在于将心理学的3A1H自我污名化模型与LLM模拟相结合,通过构建包含内部独白(用于反映污名感知推理)的专用数据集,并采用思维链微调策略,实现了模拟患者对污名化行为的动态、情境敏感建模,为临床训练和共情对话系统提供了更真实的污名模拟基础。

Abstract: Simulating patients with large language models (LLMs) is a promising tool for mental health training, but existing approaches fail to capture a key clinical reality: self-stigma. Patients experiencing self-stigma, the internalization of negative stereotypes, often exhibit context-sensitive resistance, such as avoidance, denial, or self-blame, which current models render as static or uniformly compliant behavior. To address this, we introduce a novel simulation framework grounded in the psychological 3A1H model of self-stigmatization. Our core innovation is the creation of a \textbf{Stigmatized Self-Reflection} (\textbf{SSR}) dataset, where we augment mental health dialogues with internal monologues that reflect stigma-aware reasoning. By fine-tuning LLMs with this data using a chain-of-thought approach, we train patient agents to dynamically adjust their level and expression of stigma based on conversational triggers. Evaluations demonstrate that our approach significantly outperforms specialized baselines, generating more authentic and situationally appropriate patient responses. This work provides a crucial step towards realistic stigma simulation for clinical training and empathetic dialogue systems.


[24] TLRD: Teaching LLMs to Reason over Tabular Data with Tri-Level Rationale Distillation cs.CLPDF

Tianyuan Liang, Xuwei Tan, Lei Shi, Junsheng Zhong, Ziyu Hu

TL;DR: 本文提出了一种名为Tri-Level Rationale Distillation (TLRD)的框架,旨在教会大型语言模型(LLMs)对表格数据进行推理。该方法通过一个高性能的教师模型,将仅包含标签的表格数据集转化为包含实例级、数据集级和比较级证据的结构化原理监督数据,然后将其蒸馏到学生LLMs中,使其仅从原始特征就能进行零开销预测并生成有依据的解释。

Details

Motivation: 表格数据是存储现实世界信息的主要媒介,但传统预测器虽性能强却无法提供可读的、针对具体案例的解释,而LLMs虽能生成解释,却难以理解和推理表格数据中特定的模式(如特征分布和交互),且仅对标签进行微调会导致灾难性遗忘。

Result: 在多个领域数据集上的实验表明,TLRD显著缩小了LLMs与最先进的树集成模型(state-of-the-art tree ensembles)之间的性能差距,同时能生成有依据且可读的解释。

Insight: 创新点在于提出了一个三层次(实例级特征、数据集级分布上下文、比较级检索邻居)的证据框架来合成原理语料库,并将其蒸馏到LLMs中,从而在不增加推理开销的情况下,使LLMs能同时实现高性能预测和可解释性,为高风险决策提供了有价值的参考。

Abstract: Tabular data is a primary medium for storing real-world information, driving many industrial applications of machine learning. Traditional predictors achieve strong predictive performance but do not provide readable, case-specific explanations essential for decision-making. Large Language Models (LLMs) can naturally bridge this gap by generating predictions alongside explanations. However, dataset-specific patterns, such as feature distributions and interactions, make tabular data difficult for LLMs to understand and reason over, while label-only fine-tuning improves performance at the cost of catastrophic forgetting. To address this problem, we propose Tri-Level Rationale Distillation (TLRD), a framework that converts label-only tabular datasets into structured rationale supervision for LLMs. TLRD uses a high-capacity teacher to synthesize a rationale corpus grounded in three complementary levels of evidence: instance-level feature, dataset-level distributional context, and comparison-level retrieved neighbors, then distills the rationale into student LLMs, enabling zero-overhead prediction and grounded explanation from raw features only. Experiments on multiple domain datasets show that TLRD significantly closes the performance gap between LLMs and state-of-the-art tree ensembles while producing grounded and readable explanations, offering a valuable reference for high-stakes decision-making.


[25] CATPO: Critique-Augmented Tree Policy Optimization cs.CL | cs.LGPDF

Ayush Singh, Umang Goyal, Ankur Dahiya

TL;DR: 本文提出了CATPO(Critique-Augmented Tree Policy Optimization),一种用于增强大型语言模型推理能力的强化学习方法。该方法通过评估树结构展开的信息量,并针对完全失败的树进行基于批评的修复,从而更有效地利用计算资源进行策略优化。

Details

Motivation: 现有的基于树的强化学习方法(如TreeRPO)在采样时会产生大量信息量低的树(例如所有叶子节点都成功或失败),导致计算资源浪费和梯度更新效率低下。

Result: 在Qwen2.5-Math-1.5B模型和MATH数据集上的实验表明,CATPO在AIME24、MATH-500、OlympiadBench和MinervaMath四个基准测试上取得了37.5%的宏观准确率,分别比TreeRPO和GRPO提升了1.9%和4.8%。

Insight: 核心创新点在于引入了树信息量评分机制来识别和加权信息丰富的树,并提出了针对完全失败树的“批评引导修复”方法,通过生成自然语言批评并嫁接改进的延续来恢复训练信号,从而更高效地利用计算资源进行策略优化。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving the reasoning capabilities of large language models (LLMs). Recent tree-based methods such as TreeRPO extend flat trajectory sampling with tree-structured rollouts to obtain dense, step-level reward signals without a separate process reward model. However, not all trees are equally informative: trees where all leaves succeed, all leaves fail, or the policy already predicts the reward distribution contribute little to gradient updates, wasting compute. We introduce CATPO (Critique-Augmented Tree Policy Optimization), which diagnoses and addresses this waste at the tree level. CATPO first scores each tree via a tree informativeness score, F(T), combining leaf-outcome diversity with policy-reward decorrelation at zero extra compute. For dead-wrong trees where all branches fail, CATPO applies critique-guided healing: it locates the shallowest failure point, generates a natural-language critique, and grafts refined continuations to recover training signal. Finally, an informativeness-weighted loss scales each tree’s gradient contribution by its normalized score, concentrating parameter updates on the most informative trees while preserving overall gradient magnitude. Experiments on Qwen2.5-Math-1.5B trained with the MATH dataset show that CATPO achieves 37.5% macro accuracy across four benchmarks (AIME24, MATH-500, OlympiadBench, and MinervaMath), improving over TreeRPO by 1.9% and GRPO by 4.8%.


[26] Forward-Free Diffusion Language Models cs.CLPDF

Haotian Sun, Rushi Qiang, Yuqian Zheng, Bo Dai

TL;DR: 本文提出了一种名为FReDA的前向无扩散语言模型,通过递归分布细化的方式生成文本,避免了传统扩散模型中需要人工设计前向扰动过程的问题。该方法利用模型生成的草稿作为隐式中间状态,通过自细化或并行候选选择策略逐步将草稿分布向目标分布移动。

Details

Motivation: 传统扩散语言模型在离散语言空间中缺乏自然的邻域结构来定义有效扰动,导致人工设计的前向过程产生的状态与生成过程中的草稿和错误不匹配,从而降低了样本质量。

Result: 在小于80亿参数的规模下,FReDA-4B在推理和编码基准测试中超越了更大的扩散基础模型,实现了高达15%的绝对性能提升,同时比扩散基线平均加速1.5-1.8倍,并能有效利用额外的细化计算进行扩展。

Insight: 创新点在于提出了前向无扩散框架,将扩散语言建模重新定义为递归分布细化,使模型对邻域结构不敏感、能感知模型复杂度,并兼容灵活的细化参数化方式,从而提高了生成质量和效率。

Abstract: Diffusion language models generate text through iterative denoising, offering a powerful alternative to autoregressive generation. However, discrete language spaces lack a natural neighborhood structure for defining effective perturbations, so some artificial corruption schemes are proposed in the forward process. Such prescribed forward processes often produce states that are mathematically convenient but misaligned with drafts and errors encountered during generation, resulting in degraded sample quality. To address this limitation, we propose FReDA, a forward-free diffusion language model that eliminates the need for a hand-designed forward process. We formulate diffusion language modeling as recursive distribution refinement, in which model-generated drafts serve as implicit intermediate states, and the learned refinement model progressively moves the draft distribution toward the target distribution. Concretely, FReDA refines drafts by proposing candidate draft sequences and either directly performing self-refinement or selecting among parallel candidates via best-of-N refinement. With this design, FReDA is neighborhood-agnostic, model-complexity-aware, and compatible with flexible refinement parameterizations. Extensive evaluations in the sub-8B regime show that FReDA-4B outperforms larger diffusion base models on reasoning and coding benchmarks, achieving absolute gains of up to 15%, while reaching a 1.5-1.8x average speedup over diffusion baselines and scaling effectively with additional refinement computation.


[27] When Correct Decisions Hide Internal Stress: Decision-State Probing in Multimodal Language Models cs.CLPDF

Haoran Zhao, Soyeon Caren Han, Eduard Hovy

TL;DR: 该论文提出了S$^3$E框架,用于评估多模态语言模型在做出正确外部行为时,其内部决策状态是否在受控的语义压力下保持稳定。研究发现,即使模型在A/B强制选择任务中始终选择正确答案,其内部隐藏状态仍会受到语义冲突候选项的显著扰动,表明外部正确性不能保证内部决策几何结构的不变性。

Details

Motivation: 当前多模态语言模型的评估主要依赖外部行为(如选择正确的图文匹配),但这无法揭示模型内部决策状态在面临语义压力时是否稳定。论文旨在填补这一评估空白,探究模型内部状态与外部行为之间可能存在的解耦现象。

Result: 在Qwen3VL、Gemma3和InternVL3等模型上的实验表明,在严格正确的试验中,语义压力候选项相对于词汇控制项,在选定的网络层中持续引发了过度的决策状态位移,而相对于随机负例的比较结果则因模型而异。

Insight: 论文的创新点在于提出了一个结构化语义压力评估框架,通过对比语义冲突项与意义保留控制项引发的内部状态位移,来检测模型决策状态的敏感性。这为模型鲁棒性评估提供了超越行为正确性的内部视角,揭示了内部决策几何可能存在的脆弱性。

Abstract: Multimodal language models are typically evaluated through external behavior: selecting the correct image–text match, rejecting unsupported captions, or answering visual queries correctly. However, correct behavior alone does not show that the model’s internal decision state remains stable under controlled semantic stress. We study this gap through S$^3$E (Structured Semantic Stress Evaluation), a framework for analyzing behavior-internal decoupling in multimodal language models. S$^3$E uses a positive-anchored A/B forced-choice setup in which an image-supported caption is contrasted against semantic stress candidates under both original and swapped option orders, while hidden states are extracted at the pre-answer decision state. We focus on strict-correct trials, where the model consistently selects the correct caption across both orders. Rather than treating arbitrary hidden-state variation as evidence of instability, we measure whether semantic-conflict candidates induce excess decision-state displacement relative to meaning-preserving controls. Across Qwen3VL, Gemma3, and InternVL3, semantic stress consistently produces positive selected-layer excess displacement over lexical controls despite correct forced-choice behavior, while comparisons against random negatives are model-dependent. We interpret this as a scoped decision-state stress-sensitivity signal rather than evidence of downstream failure or hallucination. Our results suggest that forced-choice correctness alone is not a sufficient certificate of invariant internal decision geometry.


[28] AsyncLane: Decoupling Refinement from Advancement in Diffusion Language Model Decoding cs.CLPDF

Yingxuan Ren, Yuxuan Lou, Yong Liu, Pengcheng Fang, Ziming Wang

TL;DR: 本文提出AsyncLane,一种用于扩散语言模型(DLM)的无训练解码调度器,旨在解耦解码过程中的“精炼”与“推进”步骤。它通过在观测到的分隔符边界处将生成路径分叉为精炼路径和继续生成路径,允许后续生成在前缀完全精炼完成前提前开始,从而提升推理吞吐量。

Details

Motivation: 标准的块级半自回归解码范式在块间存在严格依赖,即下一个块必须等待当前块完全解码或去噪预算耗尽才能开始,这限制了推理效率。本文旨在打破这种串行依赖,允许在块内出现可靠分隔符或稳定语义前缀时提前开始后续生成。

Result: 在数学推理和代码生成任务上的实验表明,AsyncLane在保持竞争力的生成质量的同时,持续提升了吞吐量。在LLaDA和Dream骨干模型上,AsyncLane在所有评估的基准长度设置中都达到了最高的每秒生成令牌数(TPS);相对于最快的竞争基线,在LLaDA和Dream上分别达到了2.95倍和3.04倍的峰值加速,且在更长的生成预算下增益尤其显著。

Insight: 核心创新在于提出了一种解耦精炼与推进的异步解码调度范式,通过路径分叉和依赖树管理实现并行化。其工程实现上的创新点包括:共享前缀路径批处理、前瞻草稿重用、级联终止以及结合刷新-逻辑重用的紧凑缓存刷新机制,这些设计有效防止了模型调用成本随路径数量线性增长,确保了异步调度的效率。该方法无需重新训练,可直接替代现有块级DLM采样器。

Abstract: Block-wise semi-autoregressive decoding is the standard inference paradigm for diffusion large language models (DLMs), but it imposes a strict dependency between blocks: the next block cannot begin until the current block is fully decoded or its denoising budget is exhausted. We observe that once a block exposes a reliable delimiter boundary or stable semantic prefix, continuation generation need not wait for every residual token to be resolved. We propose AsyncLane, a training-free decoding scheduler that decouples refinement from advancement. AsyncLane forks a generate lane at observed delimiter boundaries into a refine lane and a continuation generate lane: the prefix remains editable, while the continuation advances before prefix refinement finishes. The resulting lane tree records decoding dependencies and output order, while execution proceeds over the active lane set. To make this asynchronous schedule efficient under bidirectional attention, AsyncLane combines shared-prefix lane batching, lookahead draft reuse, cascading termination, and compact cache refresh with refresh-logit reuse, preventing model-call cost from scaling directly with the number of lanes. AsyncLane is a drop-in replacement for block-wise DLM samplers and requires no retraining. Experiments on mathematical reasoning and code generation show that AsyncLane consistently improves throughput while maintaining competitive quality. Across LLaDA and Dream backbones, AsyncLane achieves the highest TPS in all evaluated benchmark-length settings; relative to the fastest competing baseline, it reaches peak speedups of 2.95x on LLaDA and 3.04x on Dream, with especially large gains under longer generation budgets.


[29] More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs cs.CL | cs.AIPDF

Marina Igitkhanian, Erik Arakelyan

TL;DR: 本文研究了小型语言模型(SLMs)的自我改进能力,通过构建一个三步自我纠正流程来测试模型能否识别并修正自身推理错误。实验在算术和逻辑推理基准上进行,发现即使提供正确答案和提示,SLMs的改进效果有限,且更长的提示反而可能导致错误答案增加。

Details

Motivation: 动机是探究语言模型(尤其是小型语言模型)是否具备有效的自我改进能力,即能否识别并纠正自身推理中的缺陷,以评估其实际应用潜力。

Result: 在算术和逻辑推理基准测试中,注入提示句的SLMs仅比初始问答准确率提升4.4%,且更长的提示与错误最终答案呈正相关,表明性能并未随计算预算增加而提升。

Insight: 创新点在于提出了一个最小化三步自我纠正流程来严格测试SLMs的自我纠正能力,并发现SLMs在理解自身推理缺失方面存在局限,提示长度可能阻碍推理过程,这对模型优化设计具有启示意义。

Abstract: Recently, language models have made rapid progress across various domains and applications. However, their capability for self-improvement, i.e., whether they are adept at recognising and correcting flaws in their own reasoning, remains dubious. In this study, we address this question by constructing a sufficiency test to rigorously examine the self-correction capabilities of small language models (SLMs). We propose a minimal three-step self-correction pipeline that collects initial SLM answers, prompts the same model to generate hints for its incorrect responses given the ground truth, and feeds the model the same question with its own feedback to refine the initial answer. We evaluate a variety of instruction-tuned and reasoning SLMs in this experimental setup on arithmetic and logical reasoning benchmarks. Our findings show that SLMs with injected hint sentences yield only a 4.4 percent gain over initial question-answering accuracy. Even though the correct answer was provided alongside the model’s incorrect reasoning, the evaluated SLMs fail to understand what was missing in their reasoning and show minimal semantic difference between hints that lead to corrections and ones that do not. Furthermore, our experiments show that longer hints are positively correlated with incorrect final answers, suggesting that longer deliberation on problems can hinder the reasoning process, meaning that SLMs do not necessarily scale in performance with a larger compute budget.


[30] SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization cs.CL | cs.LGPDF

Jingyi He, Haiyan Zhao, Ruxue Shi, Yanguang Liu, Xin Wang

TL;DR: 本文提出SAEExplainer框架,通过利用激活分数作为奖励信号,训练模型进行自我校正和迭代引导,以解释稀疏自编码器(SAE)特征。该方法通过两轮优化过程迭代验证和修正基础解释,减少解释幻觉并强化因果触发模式。

Details

Motivation: 尽管稀疏自编码器(SAEs)通过将密集表示分解为稀疏特征缓解了大语言模型(LLMs)的不透明性,但解释这些特征仍是一个核心挑战。现有解释方法通常在开环范式下运行,未能利用机制反馈进行进一步优化。

Result: 大量实验表明,该方法在大多数指标上优于现有基线,特别是在因果触发和判别性激活方面取得了改进。

Insight: 创新点在于将激活分数作为目标奖励信号,构建了一个用于自我校正和迭代引导的训练框架,通过两轮优化过程实现解释能力的持续提升,有效减少幻觉并强化因果模式。

Abstract: Although Sparse Autoencoders (SAEs) have mitigated the opacity of large language models (LLMs) by decomposing dense representations into sparse features, explaining these features still remains a central challenge. Current explanation methods, however, typically operate within an open-loop paradigm, failing to leverage mechanistic feedback for further refinement. In this paper, we propose SAEExplainer, a training framework utilizes activation scores as an objective reward signal to train the model for self-correction and iterative bootstrapping. By iteratively verifying and correcting foundational explanations through a two-round optimization process, SAEExplainer achieves continuous improvement in its explanatory capabilities. This mechanism significantly reduces explanation hallucinations and reinforces causal triggering patterns. Extensive experiments demonstrate our approach improves upon established baselines across most metrics, especially in causal triggering and discriminative activation.


[31] Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models cs.CLPDF

Yawen Shao, Jie Xiao, Kai Zhu, Yu Liu, Hongchen Luo

TL;DR: 本文针对扩散大语言模型(dLLMs)强化学习训练中存在的轨迹与梯度更新不匹配问题,提出了过程对齐策略优化(PAPO)框架。该框架通过步感知过程奖励(SPR)将稀疏的终端奖励转化为密集的逐步信用分配,并通过熵引导历史重演(EHR)在高不确定性步骤重放真实轨迹,从而对齐奖励、状态与生成过程。

Details

Motivation: 当前利用强化学习提升扩散大语言模型推理能力的主要障碍在于生成轨迹与梯度更新过程之间的双重错位:过程-奖励错位(稀疏的终端奖励无法为中间步骤提供区分性信用分配)和状态-轨迹错位(策略更新常偏离至非真实轨迹的状态,浪费梯度)。

Result: 在GSM8K、MATH500、Countdown和Sudoku四个基准测试上的广泛实验表明,PAPO显著优于基线方法,分别取得了最高4.5%、4.8%、42.2%和16.1%的性能提升。

Insight: 论文的核心创新点在于系统性解决了dLLM强化学习中的双重错位问题。SPR机制实现了从稀疏奖励到密集、过程感知奖励的转化,而EHR机制则通过重放真实轨迹来引导策略更新,两者结合确保了梯度更新与模型实际生成过程的对齐,这是一种新颖且有效的训练范式。

Abstract: Reinforcement learning (RL) holds immense promise for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, progress is fundamentally constrained by a dual misalignment between authentic generation trajectory and the gradient update process: (i) Process-reward misalignment. Sparse, terminal rewards are indiscriminately assigned to all intermediate steps of the generation process, failing to provide discriminative credit assignment. (ii) State-trajectory misalignment. Policy updates are often diverted toward artificial, out-of-trajectory states, squandering gradients on less informative samples. To address these limitations, we introduce Process Aligned Policy Optimization (PAPO), a novel framework that holistically aligns the RL update with the dLLM’s generative trajectory via Step-Aware Process Rewards (SPR) that transform sparse terminal rewards into dense, step-wise credit, and Entropy-Guided Historical Re-enactment (EHR) that replays authentic trajectories at high-uncertainty steps. Extensive experiments on four benchmarks demonstrate that PAPO significantly outperforms baselines, achieving gains of up to 4.5% on GSM8K, 4.8% on MATH500, 42.2% on Countdown and 16.1% on Sudoku.


[32] TRADE: Transducer-Augmented Decoder for Speech LLM cs.CLPDF

Yun Tang, Shanil Puri, Shinji Watanabe, Subhabrata Mukherjee

TL;DR: 本文提出TRADE(Transducer-Augmented Decoder),一种用于语音大语言模型(Speech LLM)的流式推理架构。它通过引入一个与LLM共享音频编码器并使用其隐藏状态作为预测网络的换能器分支,将帧同步的声学对齐与LLM的语言推理能力耦合起来,从而解决了Speech LLM缺乏流式推理机制的问题。

Details

Motivation: 现有的语音大语言模型缺乏流式推理的原则性机制,其标签同步生成没有声学帧对齐,导致实时解码和话语结束检测困难。

Result: 在Open ASR Leaderboard上,TRADE实现了平均6.71%的词错误率(WER)。使用960ms块大小的流式识别从同一检查点达到了8.40%的WER。在长格式语音上,TED-LIUM和Earnings-22数据集上的WER分别为3.64%和10.88%,且无需外部分段。结合声学语音活动检测,其句子结束标点时间戳将话语结束检测的F1分数提高了0.03。

Insight: 主要创新点包括:1)紧密耦合的双词汇表,从LLM词汇表派生紧凑的换能器词汇表,实现零成本分数融合;2)带梯度停止的块同步流式训练,以离线等效的内存成本消除训练-推理不匹配;3)局部化解码器音频注意力(LDAA),一种因果滑动窗口机制,可独立于话语长度限制KV缓存内存。这些设计使单个检查点能支持跨连续延迟操作点的离线和流式解码。

Abstract: Speech Large Language Models (Speech LLMs) lack a principled mechanism for streaming inference: their label-synchronous generation has no acoustic-frame alignment, making real-time decoding and end-of-utterance detection difficult. We propose TRADE TRansducer-Augmented DEcoder, which augments a multimodal LLM with a transducer branch that shares the audio encoder and uses the LLM’s hidden states directly as the prediction network – coupling frame-synchronous acoustic alignment with the LLM’s linguistic reasoning. Three design choices make the system accurate, streamable, and long-form capable: (1)Tightly coupled dual vocabularies – a compact transducer vocabulary derived from the LLM vocabulary, enabling zero-cost score fusion; (2)Chunk-synchronized streaming training with gradient stopping, eliminating the train-inference mismatch at offline-equivalent memory cost; and (3)Localized Decoder Audio Attention (LDAA), a causal sliding window that caps KV-cache memory independently of utterance length. A single TRADE checkpoint supports offline and streaming decoding across a continuous range of latency operating points. TRADE achieves 6.71% average WER on the Open ASR Leaderboard, while the streaming recognition with 960ms chunk size reaches 8.40% from the same checkpoint. On long-form speech, it obtains 3.64% WER on TED-LIUM and 10.88% on Earnings-22 without external segmentation. TRADE provides sentence-end punctuation timestamps that, when combined with acoustic voice activity detection (VAD), improve end-of-utterance detection by +0.03 F_1 over acoustic VAD alone.


[33] Ishigaki-IDS: An Open-Weight Verifier-Aware Model for Information Delivery Specification Drafting in Building Information Modeling cs.CL | cs.SEPDF

Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando

TL;DR: 本文提出了Ishigaki-IDS,一个专为建筑信息模型(BIM)领域设计的开放权重大语言模型,用于生成可由验证器检查的信息交付规范(IDS)草案。该模型通过领域语料库的持续预训练、监督微调以及结合外部验证器反馈的强化学习进行训练,旨在将IDS编写工作从底层的XML和模式修复,提升为生成可直接供验证器加载并可供从业者审查和修正的草案。

Details

Motivation: 在BIM项目中,信息需求需要被描述为机器可检查的IDS文件,但IDS的编写是一个实际瓶颈,因为从业者需要处理领域词汇、严格的XML模式约束、外部验证器的符合性,并确保需求本身被正确表达。

Result: 在包含166个案例的专家创建的Ishigaki-IDS-Bench基准测试中,Ishigaki-IDS-8B模型在IDS文件生成验证通过指标IDSAuditPass上达到了0.651分,显著优于最强的单次提示LLM基线Claude Opus 4.5(0.331分)。其14B和32B变体分别达到了0.753和0.693的IDSAuditPass分数。在包含六位BIM从业者的工作流程检查中,使用Ishigaki辅助编写在相同的验证和对齐标准下,将总工作时间减少了54.7%。

Insight: 论文的核心创新在于提出了一个“验证器感知”的IDS生成模型,将外部验证器的反馈作为强化学习的奖励信号,从而引导模型生成更符合验证器要求的草案。这种方法将LLM的应用从通用文本生成,转向了与领域特定工具(如IDS验证器)紧密结合的、可验证的、结构化的输出生成,为解决BIM等专业领域内结构化文档创建的瓶颈问题提供了一个有效范例。

Abstract: Building Information Modeling (BIM) projects require information requirements to be described as machine-checkable Information Delivery Specification (IDS) files in order to verify whether building models contain the required attributes. However, IDS authoring remains a practical bottleneck: practitioners must handle domain vocabulary, strict XML schema constraints, and external validator conformance while also checking whether the requirement itself is correctly expressed. We present Ishigaki-IDS, an open-weight LLM specialized for verifier-aware IDS draft generation. The model combines continued pretraining on BIM/IDS corpora, supervised fine-tuning on information-requirement-to-IDS pairs, and reinforcement learning with verifiable rewards from an external validator. The goal is not to replace expert review, but to move IDS authoring from low-level XML and schema repair toward validator-loadable drafts that practitioners can inspect and correct. On the 166-case expert-created Ishigaki-IDS-Bench, Ishigaki-IDS-8B achieves an IDSAuditPass score of 0.651, a validator-pass metric for generated IDS files, substantially outperforming Claude Opus 4.5, the strongest single-shot LLM baseline we evaluated, at 0.331. It also obtains an Audit-Gated FacetF1 of 0.282, which measures requirement-facet alignment among validator-passing drafts. The same recipe scales: 14B and 32B variants reach IDSAuditPass 0.753 / 0.693 and Audit-Gated FacetF1 0.392 / 0.369. In a workflow check with six BIM practitioners, Ishigaki-assisted authoring reduced aggregate work time by 54.7% under the same validation and alignment endpoint. These results suggest that verifier-aware IDS generation can reduce the practical burden of converting BIM information requirements into reviewable IDS drafts.


[34] Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models cs.CL | cs.AI | cs.LGPDF

Subramanyam Sahoo

TL;DR: 该论文针对大语言模型在知识边界外产生错误答案而非承认无知的问题,提出了结构化无知证书(SICs)这一JSON格式的输出模式,要求模型明确声明未知领域、枚举所需概念并提出检索查询。作者构建了一个包含7,347个样本的未知-未知(UU)数据集,并使用GRPO方法微调了一个140亿参数的模型,通过结合检索效用、概念特异性和输出格式有效性的复合奖励来训练模型生成高质量的SICs。评估结果表明,该方法在JSON有效性、证书特异性得分和检索增强生成方面均取得了显著提升。

Details

Motivation: 解决大语言模型在遇到超出其知识边界的问题时,倾向于产生流畅但错误的答案(即幻觉)而非承认无知的普遍性失败模式。

Result: 在735个保留的UU问题上进行评估,实现了99.46%的JSON有效性、0.967的平均证书特异性得分,以及在检索增强生成上比基础模型提升3.6%的ROUGE-L分数。

Insight: 核心创新在于提出了结构化无知证书(SICs)这一显式认知结构化的输出模式,将承认无知和引导后续检索任务形式化、结构化,并通过构建跨领域未知-未知数据集和复合奖励微调方法,使模型能够学习并可靠地生成这种结构化输出,从而将“承认无知”从一个抽象概念转变为一种可学习、可测量的能力。

Abstract: Large language models frequently fail in a characteristic way: rather than acknowledging ignorance, they produce fluent but incorrect answers to questions that lie beyond their knowledge boundaries. We introduce \textbf{Structured Ignorance Certificates} (SICs), a JSON-formatted output schema that demands a model explicitly name the missing domain intersection, enumerate required concepts, and propose a productive retrieval query rather than hallucinating an answer. To train models to produce high-quality SICs we construct a 7,347-sample \emph{Unknown-Unknown} (UU) dataset by prompting Qwen3-14B to stitch together questions from seven domains (physics, biology, engineering, CS, economics, medical, legal) into novel cross-domain queries that no single-domain expert could answer. We fine-tune a 14B-parameter model with Group Relative Policy Optimization (GRPO) using a composite reward that combines retrieval utility, concept specificity, and output-format validity. A paraphrase-divergence probe trained on model responses confirms that SIC-tuned outputs systematically exhibit higher unknown-unknown probability scores. Evaluation on 735 held-out UU questions achieves a 99.46% JSON validity rate, a mean Certificate Specificity Score of 0.967, and a 3.6% ROUGE-L improvement over the base model on retrieval-grounded generation – demonstrating that explicit epistemic structuring is a learnable and measurable capability.


[35] Cross-Source Reasoning-based Correction for Author Name Disambiguation cs.CLPDF

Fanjin Zhang, Yunhe Pang, Bo Chen, Zhiyu Shen, Yanghui Rao

TL;DR: 本文提出CrossND框架,通过跨源推理纠正作者姓名消歧中的累积错误和不一致分配,无需人工干预即可提升消歧准确性。该框架整合数据精炼、跨源校正和测试时扩展模块,在真实数据集上优于17个基线方法。

Details

Motivation: 现有作者姓名消歧方法易受论文-作者分配累积错误影响,且忽略不同数据源间的不一致分配,而依赖专家标注成本高昂,因此探索利用跨源不一致性进行校正的新视角。

Result: 在真实数据集上的实验表明,CrossND通过跨源推理持续超越17个基线方法,实现了更高的消歧准确性和鲁棒性。

Insight: 创新点在于首次将跨源不一致性作为校正信号,结合概率软逻辑的跨源校正模块与测试时扩展技术,构建了端到端的无监督纠错框架,为姓名消歧提供了可扩展的解决方案。

Abstract: Author name disambiguation is a critical challenge in academic search systems, often addressed through from-scratch and real-time disambiguation approaches. However, current algorithms remain vulnerable to cumulative errors of paper-author assignments and overlook inconsistent assignments across different sources. Resorting to expert annotation is resource-intensive. To this end, this paper explores a new perspective for author name disambiguation: cross-source correction by leveraging inconsistent assignments across sources. We propose CrossND, a full-stack framework that integrates data refinement, cross-source reasoning, and test-time scaling. First, a chain-of-refinement pipeline denoises author profiles and produces more accurate paper-author matching probabilities. Second, a supervised fine-tuning process incorporates these refined signals and a probabilistic soft logic-based cross-correction module to infer the assignments of which sources are incorrect. Third, test-time scaling further enhances the accuracy and robustness of the predictions. Experiments on real-world datasets indicate that CrossND consistently outperforms 17 baselines by leveraging cross-source reasoning without human intervention.


[36] From Holistic Evaluation to Structured Criteria: Rubrics Across the Evolving LLM Landscape cs.CLPDF

Hao Chen, Ziyu Han, Yukun Yan, Qingfu Zhu, Maosong Sun

TL;DR: 本文提出‘评分标准’作为评估和引导大型语言模型行为的统一框架,将评分标准定义为将复杂质量判断转化为结构化、可执行标准的明确准则集。论文系统梳理了现有评分标准设计,分析了其在评估和训练中的作用,并评估了其在不同领域的可靠性。

Details

Motivation: 随着LLM向开放式自主智能体发展,需要相应的机制来评估和引导其行为,评分标准框架旨在应对这一需求,将人类价值期望转化为机器可学习的信号。

Result: 论文通过系统性地组织现有评分标准设计,并评估其在生成质量、执行保真度、理论约束和安全威胁等方面的可靠性,论证了评分标准作为连接人类意图与机器行为的持久桥梁的有效性。

Insight: 创新点在于将评分标准概念化为一个递归出现的统一框架,用于分解整体判断、提供密集的过程级反馈以及驱动模型自我改进,这为LLM的评估、强化学习和安全对齐研究提供了结构化的方法论。

Abstract: As Large Language Models (LLMs) advance toward open-ended autonomous agents, the mechanisms used to evaluate and guide their behavior must evolve accordingly. This work introduces the rubric as a unifying framework capturing this evolution, characterizing rubrics as a dynamic response to successive LLM paradigm shifts that recurs across otherwise independent efforts in evaluation, reinforcement learning, and safety alignment. We define rubrics as explicit criteria sets that transform complex quality judgments into structured and actionable standards, and demonstrate that their recurrence across these research threads is not coincidental. We systematically organize existing rubric designs, examine their construction and optimization, and analyze their role across evaluation and training. Rubrics manifest at three progressively deeper levels: at the evaluative level, they decompose holistic judgments into verifiable dimensions; at the training level, they serve as dense feedback signals providing process-level guidance where scalar rewards fall short; at the intrinsic level, they emerge dynamically from model behaviors, driving self-improvement. We further assess rubric reliability across generation quality, execution fidelity, theoretical constraints, and security threats, before surveying rubric-based benchmarks across diverse domains. By rendering assessment transparent and decomposable, rubrics translate human value expectations into machine-learnable signals, serving as the enduring bridge between human intentions and machine behavior.


[37] From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory cs.CLPDF

Yishuo Cai, Xingyu Guo, Xuancheng Huang, Jinhua Du, Can Huang

TL;DR: 本文提出MemoPilot,一种用于增强LLM智能体测试时学习的插件式记忆副驾驶。它将记忆更新过程建模为多轮决策问题,并通过多轮GRPO进行端到端优化,引入逐轮奖励信号和与上下文无关的轮级优势估计,以改进信用分配和训练稳定性。

Details

Motivation: 解决现有LLM智能体在长期运行环境中,依赖人工设计的提示规则进行记忆更新,难以在多步视野中使记忆更新与下游目标保持一致的问题。

Result: 在Multi-round Rock-Paper-Scissors和Limit Texas Hold’em两个测试平台上,MemoPilot显著提升了冻结玩家模型的测试时学习性能,在两个游戏的Elo评分中均排名第一(LHE 1762分,RPS 1590分),超越了所有基线记忆方法和包括DeepSeek-V3.2在内的专有模型。

Insight: 核心创新在于将记忆更新过程显式地建模为可优化的多轮决策问题,并通过强化学习(特别是多轮GRPO)进行端到端训练。其训练方法中引入的逐轮奖励和与上下文无关的轮级优势估计,是针对多轮设置下信用分配和训练稳定性的有效技术改进。

Abstract: Large language model (LLM) agents are increasingly deployed in long-running settings where improving through experience at test time becomes important. A common approach is to update an explicit memory after each interaction to guide future decisions. However, most existing methods rely on hand-designed prompting rules, making it difficult to align memory updates with downstream objectives over multi-step horizons consistently. We propose MemoPilot, a plug-in memory copilot that explicitly trains the memory update process to improve a frozen LLM’s performance across sequential interactions. We formulate memory updating as a multi-turn decision problem and optimize it end-to-end with multi-turn GRPO. Our training recipe introduces (i) a turn-wise reward signal and (ii) a context-independent, turn-level advantage estimation across rollouts, enabling finer-grained credit assignment and more stable training in multi-turn settings. We evaluate MemoPilot on two testbeds: multi-round Rock-Paper-Scissors (RPS) and Limit Texas Hold’em (LHE). Across both environments, MemoPilot substantially improves test-time learning of a frozen player over strong baselines, ranking first in Elo ratings on both games (1762 on LHE and 1590 on RPS) and outperforming all baseline memory methods and proprietary models, including DeepSeek-V3.2.


[38] HydraQE: OSU’s Submission for the IWSLT 2026 Speech Translation Metrics Shared Task cs.CLPDF

Kevin Krahn, Eric Fosler-Lussier

TL;DR: 本文提出了HydraQE系统,这是一个用于语音翻译的无参考质量评估端到端系统,基于Qwen3-ASR主干网络构建。该系统以源音频和翻译假设作为联合输入,通过可学习的稀疏最大标量混合整合所有主干层隐藏状态,再经轻量级双向Transformer重新编码,最后使用三个独立的预测头在互补的监督信号上进行训练。该方法在IWSLT 2026语音翻译指标共享任务中超越了级联文本基线和先前的直接语音QE系统。

Details

Motivation: 解决语音翻译中质量评估任务面临的两个主要挑战:一是缺乏端到端的直接语音翻译QE系统,现有方法多为级联式;二是人类标注数据稀缺,难以训练鲁棒的模型。

Result: 在IWSLT 2026语音翻译指标共享任务中,HydraQE超越了基于级联文本的基线方法和先前的直接语音QE系统,证明了端到端语音翻译QE可以与级联方法竞争。

Insight: 创新点包括:1) 构建了首个基于Qwen3-ASR的端到端无参考语音翻译QE系统;2) 采用可学习的稀疏最大标量混合整合多层隐藏表示;3) 设计了三头训练策略,融合人类直接评估、MetricX-24和xCOMET伪标签三种监督信号;4) 提出从合成/银标数据到人类标注数据的课程学习策略,缓解数据稀缺问题。

Abstract: We present HydraQE, our contribution to the IWSLT 2026 Speech Translation Metrics shared task. HydraQE is an end-to-end, reference-free quality estimation (QE) system for speech translation built on a Qwen3-ASR backbone, which accepts source audio and a translation hypothesis as joint input. Hidden states from all backbone layers are combined via a learnable sparsemax scalar mix, then re-encoded by a lightweight bidirectional Transformer to enable full cross-modal interaction prior to pooling into a shared embedding. Three independent prediction heads are trained on complementary supervision signals: human direct assessment (DA) annotations, MetricX-24 pseudo-labels, and xCOMET pseudo-labels. To address the scarcity of human-annotated data, we train on a combination of synthetically corrupted examples and silver pseudo-labeled machine translation outputs, using a curriculum that begins on synthetic and silver data and gradually shifts toward human-annotated examples. HydraQE outperforms cascaded text-based baselines and prior direct speech QE systems, demonstrating that end-to-end speech translation QE is competitive with cascaded approaches.


[39] Co-Evolving Skill Generation and Policy Optimization cs.CLPDF

Zhiwei Zhang, Yudi Lin, Nikki Lijing Kuang, Linlin Wu, Xiaomin Li

TL;DR: 本文提出了一种在线强化学习框架,用于在技能增强的强化学习中验证新生成技能的效用。该框架通过对比基础技能组和增加候选技能组的奖励差异,评估候选技能在特定任务上下文中的边际贡献,从而在存储前过滤无效或有害技能,并利用该效用信号训练策略本身作为技能生成器。

Details

Motivation: 现有技能增强强化学习方法通常直接存储由大型语言模型生成的技能,但缺乏对新技能效用的预先评估,导致技能库中混入无效甚至有害技能,影响后续性能。

Result: 未在摘要中提及具体的基准测试或定量结果,但方法旨在通过预存储验证提升技能库质量和策略性能。

Insight: 创新点在于提出了一个无额外采样开销的在线技能验证框架,通过边际效用估计实现技能过滤,并利用该信号协同进化技能生成与策略优化,减少对专有模型的依赖。

Abstract: Skill-augmented reinforcement learning improves language agents by storing reusable procedural knowledge acquired from past experience. Existing methods typically use strong language models to analyze trajectories, generate skills, and update a retrievable skill bank during online training. However, they rarely assess whether a newly generated skill is useful before it is stored and reused. We find that this assumption is unreliable: even skills generated by proprietary frontier LLMs exhibit highly mixed utility, with many providing little benefit or even degrading performance. Once such skills enter the bank, their effects are difficult to identify, because subsequent rollout feedback is delayed and usually reflects the combined effect of multiple retrieved skills rather than the marginal contribution of any individual skill. We propose an online reinforcement learning framework for pre-storage skill validation. The framework estimates whether a candidate skill contributes useful information beyond the skills already retrieved for the current task. It uses the standard rollout budget to form two matched groups under the same task and retrieval context: base rollouts conditioned on the currently retrieved skills, and skill-augmented rollouts conditioned on the same skills plus one candidate skill induced from the base trajectories. The reward gap between these two groups estimates the candidate skill’s context-dependent marginal utility, enabling the framework to promote useful skills while filtering ineffective or harmful ones without additional rollout overhead. The framework further uses this marginal-utility signal to train the policy itself as a skill generator, reducing reliance on repeated calls to proprietary models. The learned skill-generation likelihood serves as a context-dependent score for retrieval-time reranking and outdated-skill pruning as the policy evolves.


[40] From Statute to Control Flow: Span-Grounded Deontic Trees for Defeasible Scope Parsing cs.CL | cs.AI | cs.CEPDF

Jian Chen, Siyuan Li, Chucheng Wan, Zixuan Yuan

TL;DR: 本文提出了一种名为Span-Grounded Deontic Trees(SG-DT)的中间表示方法,用于解决规则遵循代理中的静默范围遗漏(SSO)问题,即模型在应用一般规则时忽略了嵌套的例外或反例外。作者还引入了NormBench基准,包含2,290条跨语言(中文和英文)的法律和政策条款,用于评估可废止范围解析能力。实验发现前沿大语言模型存在递归衰减和可审计性陷阱等病理现象,而使用SG-DT作为约束输出能提升整体树保真度和例外恢复能力。

Details

Motivation: 解决规则遵循代理在执行政策和法规时出现的静默范围遗漏(SSO)问题,即模型输出看似合规但忽略了重要的边缘案例例外,这本质上是法规和政策理解能力的瓶颈。现有法律NLP基准过于关注最终任务结果,容易忽视导致SSO的结构性遗漏。

Result: 在NormBench基准(涵盖中文法律/地方政策、英文美国税法/GDPR/公司政策及跨语言设置)上评估前沿大语言模型,发现模型存在递归衰减(随着可废止深度增加性能急剧下降)和可审计性陷阱(能检索相关文本片段但无法组装正确控制流)。使用SG-DT作为约束中间输出提高了整体树保真度和例外恢复,下游实验表明其增益集中在例外活跃、易发生SSO的案例上。

Insight: 创新点包括提出Span-Grounded Deontic Trees(SG-DT)这一编译器风格的中间表示,它将每个逻辑分支锚定到源文本片段并需要显式排除保护,从而实现确定性编译和审计;以及构建NormBench基准专门用于可废止范围解析,以诊断SSO问题。从客观角度看,该方法将法律文本的逻辑结构形式化,为模型提供了明确的约束输出空间,有助于改善复杂规则下的推理可靠性。

Abstract: Rule-following agents tasked with executing policies and regulations often fail via Silent Scope Omission (SSO): a model applies a general rule but silently drops nested exceptions or counter-exceptions, producing outputs that appear compliant yet break on important edge cases. Although such failures are often framed as an agentic-systems problem, the underlying bottleneck is statutory and policy understanding, a capability typically studied in legal NLP. However, most existing legal NLP benchmarks emphasize end-task outcomes, which can overlook the structural omissions that cause SSO. To diagnose and mitigate SSO, we introduce NormBench, a benchmark of 2,290 provisions spanning Chinese (laws and local policies), English (U.S. tax law, GDPR, and corporate policies), and cross-lingual settings, designed for defeasible scope parsing: identifying precisely which clause overrides which. NormBench uses Span-Grounded Deontic Trees (SG-DT), a compiler-style intermediate representation that anchors every logical branch to source spans and requires explicit exclusion guards, enabling deterministic compilation and audit. Evaluations of frontier LLMs reveal two recurring pathologies: (1) Recursion Decay, where performance drops sharply as defeater depth increases, and (2) an Auditability Trap, where models retrieve relevant spans but fail to assemble correct control flow. Using SG-DT as a constrained intermediate output improves whole-tree fidelity and defeater recovery, and downstream experiments show that its utility is mechanism-specific: gains concentrate on exception-active, SSO-prone cases, while aggregate accuracy can be mixed when the added structure is unnecessary or parser fidelity is low.


[41] Multilingual Sentiment Aware Text Summarization A Reinforcement Learning Approach for Consistency Maintenance cs.CLPDF

Mikhail Krasitskii, Alexander Gelbukh, Olga Kolesnikova, Grigori Sidorov

TL;DR: 本文研究了基于人类反馈强化学习(RLHF)的文本摘要模型中的情感漂移现象,即摘要输出相比原文会系统性偏向中性情感。通过多数据集、多模型架构和八种语言的实验,作者发现情感漂移与KL正则化强度正相关,并提出了一个策略归因框架来解释这一现象。基于此,作者提出了一种情感感知的KL正则化修改方法,以减轻情感漂移同时保持摘要质量。

Details

Motivation: 尽管RLHF显著提升了大型语言模型在文本摘要中的质量和流畅性,但其对情感属性的影响尚不明确。本文旨在探究RLHF对齐目标如何影响情感保持,并揭示当前对齐方法在情感表达方面的潜在局限性。

Result: 实验结果表明,情感漂移是一个普遍现象,且随着KL正则化强度增加而加剧。提出的情感感知KL正则化方法在多个数据集和语言上有效缓解了情感漂移,同时保持了摘要质量。

Insight: 创新点在于首次系统性地量化了RLHF中的情感漂移问题,并提出了策略归因框架来分解RLHF目标。从客观角度看,研究揭示了对齐稳定性与情感保真度之间的权衡,为开发显式考虑情感保持的对齐策略提供了新方向。

Abstract: Reinforcement Learning from Human Feedback (RLHF) has significantly improved the quality and fluency of large language models in text summarization. However, its impact on affective properties remains insufficiently understood. In this work, we study sentiment drift, a systematic shift toward neutral sentiment in RLHF-based summarization outputs compared to source texts. We conduct extensive experiments across multiple datasets, model architectures, and eight languages to analyze how alignment objectives influence sentiment preservation. Our results show that sentiment drift is a consistent phenomenon that becomes stronger with increased KL regularization strength, indicating a trade-off between alignment stability and affective fidelity. To explain this behavior, we introduce a Policy Attribution framework that decomposes the RLHF objective and quantifies the contribution of its components. Our analysis reveals that KL regularization is the primary driver of sentiment suppression across all settings. Based on these findings, we propose a sentiment-aware modification of the KL regularization term, which selectively reduces constraints on sentiment-bearing tokens. Empirical results demonstrate that this approach mitigates sentiment drift while maintaining summarization quality. Overall, our findings highlight a fundamental limitation of current alignment methods: while they improve factual consistency and safety, they may unintentionally suppress emotional expressiveness. This motivates the development of alignment strategies that explicitly account for affective preservation.


[42] PACT: Learning Diverse Diagnostic Strategies via Privileged Synthesis and Branch Consensus cs.CL | cs.AIPDF

Gen Li, Yuanze Hu, Zhichao Yang, Qingchen Yu, Jianwei Lv

TL;DR: 论文提出了PACT(Periodic Anchor Consensus Training)框架,用于训练能够灵活运用多种诊断推理范式的医疗对话智能体。该框架通过DPS(Doctor-Patient-Supervisor)数据合成方法,在保证信息隔离的前提下生成高质量的多范式诊断对话数据,并采用分支共识训练策略,分别训练各范式的LoRA分支并周期性地通过符号共识聚合到共享锚点模型中。

Details

Motivation: 现有基于大语言模型的医疗智能体虽然具备较强的医学推理能力,但单一范式或简单混合的对话监督方式会导致不同诊断推理范式之间相互干扰,难以有效学习。临床诊断需要在患者信息不完整的情况下灵活运用多种推理范式,因此需要一种能够协调学习多种诊断策略的方法。

Result: 实验在一个动态多轮中文医疗诊断基准测试上进行。结果表明,PACT在诊断结果和咨询过程指标上,均超越了所比较的专有模型、医学专用模型和任务适应基线模型,达到了最先进的性能水平。

Insight: 创新点在于提出了一个结合了特权信息合成与分支共识训练的两阶段框架。DPS数据合成机制利用完整电子病历进行质量控制,同时确保医生智能体仅能访问患者可见信息,从而在不泄露隐藏临床答案的情况下生成经过验证的多范式对话。分支共识训练策略通过周期性地将特定范式的分支模型共识聚合到共享锚点,有效协调了多范式学习,减少了干扰。

Abstract: Clinical diagnosis requires flexible use of multiple reasoning paradigms under incomplete patient information. Existing LLM-based medical agents show strong medical reasoning ability, but single-paradigm or naively mixed dialogue supervision makes these paradigms difficult to learn without interference. We propose \textbf{PACT} (Periodic Anchor Consensus Training), a framework that couples supervised multi-paradigm dialogue synthesis with consensus-based Branch training. At the data level, \textbf{DPS} (Doctor-Patient-Supervisor) uses complete electronic medical records (EMRs) for quality control while keeping the doctor agent restricted to patient-visible information. This produces validated dialogues under four diagnostic reasoning paradigms without leaking hidden clinical answers. At the training level, PACT trains one paradigm-specific LoRA Branch per paradigm and periodically aggregates Branches into a shared Anchor through sign consensus. We further construct a dynamic multi-turn Chinese medical diagnosis benchmark for interactive consultation. Experiments show that PACT achieves state-of-the-art performance among compared proprietary, medical-specialized, and task-adapted baselines on diagnostic outcome and consultation-process metrics.


[43] Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level cs.CLPDF

Jeonghyeon Moon, Jiwon Kim, Yeheum Lah, Yoonju Han, Yuncheol Kang

TL;DR: 本文评估了大型语言模型在模拟人类调查响应时的分布级复制能力,使用了一个非公开的2010年韩国方便面购买消费者选择实验数据集。研究发现,LLMs能较好地复制条件级模式,但在捕捉分布结构上表现不佳,尤其是在购买数量变量上,所有模型均未能超越一个简单的、不考虑条件的基线分布。此外,基于平均值的评估可能具有误导性,而输入配置(如结构化角色和多模态输入)会影响复制效果。

Details

Motivation: 现有研究主要使用均值或聚合一致性来评估LLMs模拟人类调查响应的能力,这无法揭示LLMs是否能够复现人类行为的变异性。本文旨在从分布层面评估LLM的复制性能,以更全面地理解其模拟人类响应的准确性。

Result: 在韩国方便面购买实验的评估中,LLMs在条件级模式上表现合理,但在分布对齐上失败;对于购买数量,所有模型均未超越一个简单的、仅匹配汇总人类分布的基线。评估还发现,结构化角色和多模态输入能提高对齐度,而显式推理提示则会单调降低对齐度。

Insight: 创新点在于首次从分布层面系统评估LLMs对人类调查的复制能力,揭示了基于平均值的评估的局限性,并强调了输入配置对结果的重要影响。客观来看,该方法为LLMs在社会科学模拟中的评估提供了更严谨的框架,强调了考虑数据分布而非仅关注均值的重要性。

Abstract: LLMs are increasingly used to simulate human survey responses, but prior work has mainly evaluated replication using mean-level or aggregate agreement, offering limited insight into whether LLMs reproduce the variability of human behavior. We evaluate LLM-based survey replication at the distributional level using a non-public 2010 consumer choice experiment on Korean instant noodle purchases, a setting unlikely to overlap with model training data. We evaluate three response variables of differing statistical type: binary purchase incidence, categorical brand choice, and count purchase quantity. For each, we compare human and LLM responses at mean-level, pattern, and distributional alignment, and against reference baselines from the human data alone. LLMs reproduce condition-level patterns reasonably well but fail to capture distributional structure: for purchase quantity, no model beats a condition-insensitive baseline that simply matches the pooled human distribution. Because models that match human means well can still produce distributions further from humans than this baseline, mean-based evaluation alone can be actively misleading. Replication also varies with input configuration, with structured personas and multimodal inputs improving alignment while explicit reasoning prompting degrades it monotonically.


[44] Explicit Representation Alignment for Multimodal Sentiment Analysis cs.CLPDF

Baode Wang, Ziming Wang, Huacan Wang, Ronghao Chen, Biao Wu

TL;DR: 本文提出了一种统一的多模态情感分析框架,通过利用视觉语言模型将视觉内容转换为结构化文本描述,将异构模态投影到共享的语言空间,从而实现以文本为中心的可解释推理。该方法通过语义标记选择和批量均匀性正则化的混合学习策略,提高鲁棒性,并在多个多模态情感和情绪基准测试中一致优于强单模态和多模态基线,达到最先进性能。

Details

Motivation: 多模态情感分析旨在通过联合建模文本和图像等异构模态来理解人类情感和情绪,但多模态模型往往无法持续超越强文本基线,且性能随融合策略变化显著。本文发现独立预训练的模态编码器之间的表示未对齐是有效多模态学习的关键瓶颈,并通过对照实验表明融合前的对齐通常比融合复杂性更重要。

Result: 在多个多模态情感和情绪基准测试上的实验表明,该方法一致优于强单模态和多模态基线,实现了最先进的性能。

Insight: 创新点在于明确将表示对齐作为多模态学习的核心,提出利用视觉语言模型进行模态统一到语言空间,并引入混合学习策略(语义标记选择和批量均匀性正则化)以提高鲁棒性和特征空间稳定性。从客观角度看,该研究强调了跨模态对齐的先决重要性,而非仅仅关注融合架构,为多模态学习提供了新的视角。

Abstract: Multimodal affective analysis aims to understand human sentiment and emotion by jointly modeling heterogeneous modalities such as text and images. However, multimodal models often fail to consistently outperform strong text-only baselines, with performance varying significantly across fusion strategies. In this work, we identify representation misalignment between independently pretrained modality encoders as a key bottleneck for effective multimodal learning, and show through controlled experiments that alignment prior to fusion is often more important than fusion complexity. To address this issue, we propose a unified multimodal affective analysis framework that leverages vision-language models (VLMs) to convert visual content into structured textual descriptions, projecting heterogeneous modalities into a shared linguistic space and enabling interpretable text-centric reasoning. To further improve robustness, we introduce a hybrid learning strategy that combines semantic token selection with a batch-level uniformity regularization objective, encouraging a more dispersed and stable global feature space while mitigating noise introduced by VLM-generated descriptions. Experiments on multiple multimodal sentiment and emotion benchmarks show that our method consistently outperforms strong unimodal and multimodal baselines, achieving state-of-the-art performance. Our analysis further highlights the critical role of representation alignment in multimodal affective learning.


[45] SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance cs.CL | cs.AIPDF

Hanna Abi Akl, Fabien Gandon, Catherine Faron, Pierre Monnin

TL;DR: 本文重新审视了名为SEF-CLGC的推理评估框架,该框架将形式逻辑符号与小型语言模型结合,用于SemEval-2026 Task 11 Subtask 1任务,旨在评估模型在内容与形式推理分离任务中的性能。实验表明,仅使用在自然语言和符号语言混合数据上训练的小型语言模型,最佳模型在任务中取得了27.80%的内容得分,并显著降低了推理中的内容偏差。

Details

Motivation: 论文旨在解决大型语言模型中内容与形式推理的分离问题,通过结合形式逻辑符号和小型语言模型来评估推理性能,以减少推理过程中的内容偏见。

Result: 在SemEval-2026 Task 11 Subtask 1基准测试中,最佳模型的内容得分为27.80%,同时显著降低了推理中的内容偏差。

Insight: 创新点在于将形式逻辑符号与小型语言模型结合,通过混合自然语言和符号语言训练来提升推理的纯粹性,这为减少语言模型中的内容偏见提供了一种新方法。

Abstract: This paper revisits our pipeline called Syllogistic Evaluation Framework-Common Logic Grammar Construction (SEF-CLGC). We combine formal logical notations with Small Language Models (SLMs) to evaluate reasoning performance on the SemEval-2026 Task 11 Subtask 1: Disentangling Content and Formal Reasoning in Large Language Models. Our experiments show that by relying solely on SLMs, trained on a combination of natural and symbolic languages, our best model achieves a content score of 27.80% on the task while significantly lowering the content bias in reasoning.


[46] TruthSplit: Operationalizing Conditional Validity in Arguments Through Multi-Perspective Reasoning cs.CLPDF

Benjamin Stieger, Maximilian Terberger, Thomas Huber, Christina Niklaus

TL;DR: TruthSplit是一个用于多视角论证分析的交互式系统,它通过整合世界观特定的背景知识来评估论证的条件有效性。该系统从输入文本中提取论点和前提,应用三层自然语言推理方法评估逻辑和世界观一致性,并基于结构化世界观档案生成特定视角的解释。

Details

Motivation: 现有论证工具通常只分析论证本身的结构、质量等属性,而忽略了视角特定的背景知识,TruthSplit旨在填补这一空白,支持探索同一主张在不同世界观下的不同结论。

Result: 该系统通过整合大型语言模型和结构化世界观档案进行推理,能够生成视角特定的解释、识别价值冲突和假设差距,并通过交互式界面可视化分歧。

Insight: 创新点在于提出了’条件有效性’的概念,并开发了一个结合NLI和LLM的系统来操作化这一概念,通过结构化世界观档案将隐性背景知识显式化,以支持多视角论证分析。

Abstract: We present TruthSplit, an interactive system for multi-perspective argument analysis. Existing argumentation tools typically analyze properties of the argument itself, such as structure, quality, stance, or persuasiveness, while leaving perspective-specific background knowledge implicit. TruthSplit addresses this gap by supporting an exploratory analysis of how the same claim can lead to different conclusions when interpreted through worldview-specific values, assumptions, and conceptual definitions. We refer to this perspective-dependent analysis as conditional validity. Given an input argumentative text, TruthSplit extracts claims and premises, applies a three-layer natural language inference (NLI) approach to assess both logical and worldview-specific normative consistency, and conditions large language model (LLM) reasoning on structured worldview profiles that encode core values and decision principles. The system then generates perspective-specific interpretations, identifies value conflicts and assumption gaps, and visualizes divergence through interactive analytical interfaces.


[47] Symbolic and Abstractive Reasoning with Complex Visual Queries cs.CLPDF

Yichi Zhang, Jingdian Lu, Zhuo Chen, Lingbing Guo, Jun Xu

TL;DR: 本文提出了一种称为复杂视觉查询(CVQ)的新型抽象数据类型,旨在探究多模态大语言模型(MLLMs)的符号和抽象推理能力。作者从数据、范式和探索三个维度进行了全面研究,包括构建基于多模态知识图谱的CVQ合成流程、提出两阶段训练框架,并进行了广泛的实验评估。

Details

Motivation: 当前多模态大语言模型在理解和推理抽象视觉内容方面仍存在不足,本文旨在探索其符号与抽象推理这一关键但尚未充分研究的人类类神经符号推理维度。

Result: 实验在CVQ数据集上对MLLMs进行了多维度严格评估,包括推理性能、跨任务和跨场景泛化能力,但摘要未提及具体定量结果或与SOTA的比较。

Insight: 创新点在于提出了CVQ这一新型抽象数据类型来系统化评估符号推理,并构建了基于多模态知识图谱的可扩展数据合成流程与渐进式两阶段训练框架,为推进MLLMs推理前沿提供了新视角。

Abstract: Understanding and reasoning over abstract visual content remains a challenge for current multi-modal large language models (MLLMs). In this paper, we explore a novel abstract data type termed complex visual query (CVQ), designed to probe symbolic and abstractive reasoning, which is a critical yet underexplored dimension of human-like neuro-symbolic reasoning for MLLMs. We present a comprehensive investigation from three perspectives: \textbf{Data $\times$ Paradigm $\times$ Exploration}. Specifically, we propose a scalable pipeline for synthesizing CVQs grounded in large-scale multi-modal knowledge graphs, generating a diverse dataset encompassing 14 distinct query types via systematic combinations of first-order logic operators. We further introduce a two-stage training framework that progressively equips MLLMs with robust visual reasoning capabilities. We conduct extensive experiments to rigorously evaluate MLLMs across multiple dimensions, including reasoning performance on CVQs, as well as cross-task and cross-scenario generalization. We believe our work opens new perspectives and avenues for advancing the reasoning frontiers of MLLMs.


[48] One Model, Multiple Goals: Adaptive Multi-Objective Learning for E-commerce Dialogue Systems cs.CLPDF

Mingzhe Li, Jing Xiang, Enguo Zhou, Lang Gao, Tai Li

TL;DR: 本文提出了MORE框架,一种用于电商对话系统的自适应多目标强化学习框架,旨在联合优化推理准确性和语言自然性。该框架将推理功能视为约束条件来指导策略优化,并在推理时直接生成响应,同时引入了自适应多奖励机制来动态平衡语言目标。

Details

Motivation: 电商场景中的对话系统需要同时满足多个目标:准确推理用户画像以确保正确决策,同时生成自然且忠实的回复,这些目标互补但不完全相同,直接混合奖励可能导致振荡和不稳定学习。

Result: 在字节跳动的两个真实世界对话系统和MultiWOZ 2.2基准测试中,MORE始终优于强基线;在14天的在线实验中,整体转化率和达成转化率分别提高了16.53%和30.09%,同时提高了用户满意度并降低了转接率,在人与机器比较中恢复了约60%由人工代理实现的增量转化提升。

Insight: 创新点在于将推理功能作为约束而非混合奖励进行优化,避免了学习不稳定,并引入了自适应多奖励机制动态平衡语言目标,实现了无需显式推理步骤即可生成响应,减少了推理开销。

Abstract: Dialogue systems in e-commerce scenarios often need to satisfy multiple objectives: accurately reasoning over user profiles (e.g., eligibility, credit limit) to ensure correct decision-making and user state interpretation, while also generating natural and faithful responses. These goals are complementary but not identical. In this work, we propose MORE, an adaptive Multi-Objective REinforcement learning framework that jointly optimizes reasoning accuracy and linguistic naturalness. Our preliminary experiments show that directly mixing rewards with diverging optimization dynamics can cause oscillations and unstable learning. Thus, instead of optimizing a single mixed reward, we treat reasoning functions as constraints that guide policy optimization. At inference time, the system directly generates responses without explicit reasoning steps, while still benefiting from reasoning-enhanced scaffold and avoiding additional inference overhead. To better balance linguistic objectives during response generation, we introduce an adaptive multi-reward mechanism that aggregates signals such as fluency and naturalness and dynamically reweighs them via gradient feedback. We evaluate MORE on two real-world dialogue systems at ByteDance and the MultiWOZ 2.2 benchmark, where it consistently outperforms strong baselines. In 14-day online experiments on ByteDance production traffic, MORE improves overall and reached conversion by 16.53% and 30.09%, while increasing user satisfaction and reducing handoff rates. Notably, in a human-machine comparison, MORE recovers about 60% of the incremental conversion lift achieved by human agents.


[49] SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling cs.CL | cs.LGPDF

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Xiaofeng Zhang

TL;DR: 本文提出了SG-OPD方法,旨在改进策略蒸馏中的教师监督可靠性问题。该方法通过一个二元验证器在轨迹和词元两个层面生成信任信号,结合分阶段教师采样和符号一致性门控机制,动态调整蒸馏过程。在数学推理基准测试中,该方法显著优于标准的策略蒸馏方法。

Details

Motivation: 标准策略蒸馏的有效性依赖于学生与教师轨迹对齐以及教师偏好均匀可靠这两个在实践中常不成立的假设。本文旨在解决教师监督信号不可靠的问题,提升蒸馏效率。

Result: 在竞赛级数学推理基准测试上,SG-OPD在每样本和每问题层面分别平均提升了1.98和7.50,一致性地超越了标准策略蒸馏方法。

Insight: 核心创新在于引入一个独立的二元验证器作为信任信号源,并设计了分阶段教师采样和符号一致性门控两种机制,分别从轨迹和词元粒度动态地、有选择地利用教师知识,从而更稳健地处理教师监督中的噪声或偏差。

Abstract: On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher’s preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.


[50] Multi-Hop Knowledge Composition is Bound by Pretraining Exposure cs.CLPDF

Yannis Karmim, Luis Marti, Djamé Seddah, Valentin Barrière

TL;DR: 这篇论文研究了大型语言模型在隐式多跳推理上的失败现象,即模型能正确回答单个事实问题,但在需要组合多个已知事实的复合问题上失败。通过受控实验,作者发现即使模型已完美记忆单个事实,组合失败依然存在,这被归因于预训练阶段缺乏对组合上下文的暴露。

Details

Motivation: 动机是探究LLM为何在隐式多跳推理(如组合两个已知事实回答新问题)上失败,即使模型已单独掌握每个事实,旨在确定这是预训练数据缺陷而非知识缺失问题。

Result: 实验在受控自然语言设置下进行,当1-hop准确率达到97%时,组合失败依然持续,表明这是预训练失败。测试了九种数据增强格式,发现组合预训练仅对预训练中暴露过的个体有效,对未暴露个体无效。

Insight: 创新点在于通过严格分离预训练中暴露与未暴露的个体,实证了预训练阶段接触组合上下文是隐式多跳推理的必要条件,强调了数据组成而非单纯知识记忆的重要性,为改进模型推理能力提供了数据中心的视角。

Abstract: Large Language Models fail at implicit multi-hop reasoning: a model answers “When was $X$ born?” and “Who is $Y$’s closest friend?” correctly but fails on “When was $Y$’s closest friend born?” in a single forward pass, even when both facts are perfectly memorized and individually retrievable. We study this failure in a controlled natural language setting with a strict separation between individuals exposed to compositional contexts during pretraining and those that never appear in any such context. We confirm that compositional failure persists even at 97% 1-hop accuracy, establishing the gap as a pretraining failure rather than a knowledge absence. We propose and test nine data-centric augmentation formats and find that compositional pretraining transfers to unseen questions for exposed individuals, but never to individuals absent from compositional pretraining, suggesting that exposure to compositional contexts during pretraining is a necessary condition for implicit multi-hop reasoning.


[51] Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs cs.CL | eess.ASPDF

Ming-Hao Hsu, Yuxuan Hu, Shujie Liu, Jinyu Li, Yan Lu

TL;DR: 本文提出了一种名为Convex Gate(C-Gate)的语音到LLM接口方法,通过将语音表示约束在LLM输入嵌入流形的凸包内,解决了现有方法在离散对齐和连续表示之间的权衡问题,在语音识别和情感识别任务上实现了联合性能提升。

Details

Motivation: 现有语音到LLM接口要么强制近离散的token对齐(损失副语言信息),要么学习无约束的连续表示(可能偏离LLM输入空间并降低自回归解码性能),需要一种能同时保持与预训练LLM兼容性和连续表达能力的桥梁。

Result: 在自动语音识别(ASR)和情感识别任务上,C-Gate在LibriSpeech数据集上相对词错误率(WER)提升高达48.7%,同时情感识别准确率匹配或超过单任务模型,实现了强联合性能。

Insight: 创新点在于通过架构上的凸包约束将语音帧表示为token嵌入的凸组合,确保与预训练LLM的兼容性;关键发现是信息并非由离散token身份携带,而是由嵌入空间中的时间解析轨迹传递,几何结构(而非离散性)是语音到LLM接口的根本设计因素。

Abstract: Large language models (LLMs) provide a powerful reasoning backbone for speech understanding, but integrating continuous acoustic signals into a frozen LLM remains challenging. Existing speech-to-LLM interfaces typically operate at two extremes: either enforcing near-discrete token alignment, which benefits transcription but loses paralinguistic information, or learning unconstrained continuous representations, which can drift away from the LLM’s input space and degrade autoregressive decoding. In this work, we propose Convex Gate (C-Gate), a speech-to-LLM bridge that constrains all speech representations to lie within the LLM’s input embedding manifold with an architectural convex-hull constraint. Concretely, each frame is represented as a convex combination of token embeddings, ensuring compatibility with the pretrained LLM while preserving continuous expressivity. Across automatic speech recognition (ASR) and emotion recognition, C-Gate achieves strong joint performance, improving LibriSpeech WER by up to 48.7% relative while matching or exceeding single-task emotion accuracy. Beyond performance, our analysis reveals a key insight: information is not carried by discrete token identities, but by time-resolved trajectories in the embedding space. Causal interventions confirm that both the trajectory structure and alignment to the pretrained embedding manifold are critical for performance. These results suggest that geometry, rather than token discreteness, is the fundamental design factor in speech-to-LLM interfaces, and provide a controlled regime for studying multimodal integration in frozen LLMs. We release the checkpoint, per-sample outputs, mechanism dumps, and intervention suite for replication.


Yifan Chen, Haitao Li, Yiran Hu, Kaisong Song, Jun Lin

TL;DR: 本文提出了LexRubric,一个基于评分标准的诊断性基准测试,用于评估大语言模型在开放式中文法律任务上的表现。该基准包含来自法律咨询和司法考试的649个实例,覆盖14个法律场景,并提供了12,337条专家编写的原子评分标准,组织在一个统一的六维框架下,以实现跨任务和维度的精确评估与诊断分析。

Details

Motivation: 随着大语言模型越来越多地应用于现实世界的法律任务,评估其开放式法律响应的可靠性变得至关重要。这些任务需要上下文敏感的答案且容错率低,因此需要细粒度和诊断性的评估来识别响应质量失败的具体原因。

Result: 作者测试了多个评判模型,并将模型判断与人类判断进行了比较以验证评估的可靠性。进一步在LexRubric上评估了18个最新的通用和法律领域大语言模型,结果显示不同模型展现出不同的能力特征,且开放式法律问题对当前的大语言模型仍然具有挑战性。

Insight: 论文的创新点在于构建了一个细粒度的、基于专家评分标准的诊断性法律基准,其统一的六维框架和原子评分标准为评估法律大语言模型提供了更精确和可解释的工具,有助于深入分析模型在法律推理和回答中的具体优势和缺陷。

Abstract: As large language models (LLMs) are increasingly applied to real-world legal tasks, evaluating the reliability of their open-ended legal responses has become essential. These tasks require context-sensitive answers and allow little room for error, motivating fine-grained and diagnostic evaluation that can identify specific sources of response quality failures. We introduce LexRubric, a rubric-based benchmark for evaluating open-ended Chinese legal tasks. LexRubric contains 649 instances from legal consultation and judicial examination, which reflect both everyday legal needs and professional legal reasoning and cover 14 legal scenarios. It further includes 12,337 expert-written atomic scoring criteria organized under a unified six-dimensional framework, enabling accurate evaluation and diagnostic analysis across tasks and evaluation dimensions. To validate the reliability of the evaluation, we test multiple judge models and compare model-based judgments with human judgments. We further evaluate 18 recent general and legal-domain LLMs on LexRubric. Results show that different models exhibit distinct capability profiles, and that open-ended legal question remains challenging for current LLMs. Data is available at: https://github.com/foggpoy/LexRubric.


[53] PriFT: Prior-Support Guided Supervised Fine-Tuning cs.CL | cs.LGPDF

Ke Wang, Shuangqi Li, Mathieu Salzmann, Pascal Frossard

TL;DR: 本文提出了一种名为PriFT的监督微调方法,通过使用冻结的预训练模型来生成稳定的token权重信号,以指导微调过程。该方法旨在解决传统SFT中因拟合与预训练分布不一致的目标token而导致的过拟合问题,并在数学推理、代码生成和医疗问答等任务上取得了SOTA结果。

Details

Motivation: 传统监督微调(SFT)采用离策略目标,逐token拟合固定演示数据,包括那些与模型预训练分布对齐不佳的目标token,这可能导致过拟合和泛化能力弱于强化学习。现有工作通过基于当前微调模型分配token权重来缓解此问题,但权重计算与优化轨迹纠缠,导致自增强动态,使分布迅速偏离预训练模型。

Result: 在数学推理、代码生成和医疗问答等多个任务上的广泛实验表明,PriFT在SFT基线中实现了最先进的结果,并为后续的RL训练提供了更好的初始化。具体地,PriFT-prob和PriFT-mass两种实例化方法均通过使用预训练模型的重加权信号,一致提升了性能。

Insight: 创新点在于引入冻结的预训练模型作为参考,以生成稳定的token重加权信号(称为先验支持估计),从而避免权重计算与微优化轨迹的纠缠。这提供了一种更鲁棒的方法来保留预训练知识,减少微调过程中的分布偏移,可借鉴用于改进其他基于token重加权的微调策略。

Abstract: Supervised fine-tuning (SFT) is an efficient approach for downstream task adaptation and often serves as the initialization stage for reinforcement learning (RL), but it can show weaker generalization than RL. A key limitation is its off-policy objective: SFT fits fixed demonstrations token by token, including targets poorly aligned with the model’s pretrained distribution, which can lead to overfitting. A recent line of work addresses this issue by assigning larger training weights to tokens better aligned with the current model’s predictive distribution, with the intuition that fitting these tokens are less distortive to the model’s pretrained knowledge and representations. However, computing the token weights from the model that is currently fine-tuned entangles token weights with the optimization trajectory, inducing a self-reinforcing dynamics as the distribution rapidly departs from the pretrained model. To address this, we propose PriFT (Prior-support guided Fine-Tuning), which derives token weights from a frozen pretrained reference to obtain a stable reweighting signal unaffected by fine-tuning. This signal estimates prior support: the extent to which each target token is supported by the pretrained distribution. Across multiple existing token-reweighting rules, replacing the reweighting signal from the online model to pretrained model consistently improves performance. We introduce two instantiations: PriFT-prob uses pretrained token probability, while PriFT-mass selects tokens by cumulative probability mass under the pretrained distribution. Extensive experiments on mathematical reasoning, code generation, and medical question answering show that PriFT achieves state-of-the-art results among SFT baselines and provides a better initialization for subsequent RL training.


[54] Guide Me Out: A Framework to Benchmark VLM Operators Communication in Crisis Scenarios cs.CLPDF

Giacomo Gonella, Stefano Menini, Marco Guerini

TL;DR: 本文提出了一种用于评估视觉语言模型(VLM)在模拟疏散危机场景中作为AI操作员引导平民能力的基准测试框架。该框架测试了两种通信策略(窄播与广播)、两种环境表示(视觉与基于图)以及两种威胁行为(静态与移动)在不同结构复杂度地图上的表现。研究发现,窄播策略普遍优于广播策略,视觉模态对性能至关重要,而移动威胁显著增加了引导失败率。

Details

Motivation: 当前NLP在危机沟通领域的研究主要局限于静态、纯文本的分类任务,忽视了AI操作员在动态、具身化场景中的关键沟通作用。本文旨在填补这一空白,评估VLM在动态危机场景中作为引导者的能力。

Result: 在九个不同结构复杂度的地图上进行测试,结果表明:在所有难度级别下,窄播策略相比广播策略持续降低了平民的失败率;视觉模态是驱动性能的关键,而添加邻接图的效果因模型而异且通常有害;移动威胁在所有条件下都提高了失败率。

Insight: 论文的创新点在于构建了一个动态、具身的危机沟通基准测试框架,超越了传统的静态文本分类。客观来看,其核心洞察是:在疏散场景中部署VLM作为AI操作员并非易事,通信策略(窄播/广播)和输入表示(视觉/图)的选择直接决定了干预的成败,这为未来VLM在动态决策场景中的应用提供了重要的评估维度和设计考量。

Abstract: Effective crisis response requires spatially grounded communication that bridges linguistic guidance of civilians with the physical environment, accounting for structural bottlenecks, evolving threats, and agent-specific contexts. Yet, current NLP research in crisis communication remains mainly limited to static, text-only classification settings, overlooking the critical communicative role of AI operators in dynamic, embodied scenarios. We address this gap with a novel benchmarking framework for evaluating Vision-Language Models (VLMs) tasked with guiding civilian agents through simulated evacuations. We test two communication strategies (narrowcast vs. broadcast), two environment representations (visual vs. graph-based), and two threat behaviors (static vs. moving) across nine maps of varying structural complexity. Our results show that Narrowcast consistently reduces civilian Fail rates compared to Broadcast across all difficulty levels. Guidance quality depends heavily on how the VLM operator represents the world: the visual modality drives performance, while adding an adjacency graph is model-dependent and often harmful. Moving threats raise Fail rates across all conditions as communication must continuously adapt over time. Together, these findings show that deploying VLMs as AI operators in evacuation scenarios remains a non-trivial challenge, where the choice of communication strategy and input representation can directly determine the success or failure of the intervention.


[55] MUDIDI: A Two-Stage Framework for Multilingual Dictionary Digitization with Language Models cs.CLPDF

David Setiawan, Temuulen Khishigsuren, Milind Agarwal, Pagnarith Pit, Aso Mahmudi

TL;DR: 本文提出了MUDIDI,一个用于多语言词典数字化的两阶段框架。该框架第一阶段评估字符识别和标记保留的质量,第二阶段专注于词典条目分割并将其映射到机器可读的词典模式(SIL的Multi-Dictionary Formatter)。作者还发布了一个包含30本公共领域词典、涵盖多种书写系统和语言家族的人工标注数据集,并在此数据集上对OCR系统、通用大语言模型(LLMs)和视觉语言模型(VLMs)进行了基准测试。

Details

Motivation: 多语言词典是低资源语言和濒危语言的宝贵文献资源,但许多仅以扫描件形式存在。由于语言特定的文字、复杂的多栏布局以及充满缩写和交叉引用的条目,其数字化和转换为机器可读格式长期以来几乎不可能。

Result: 在发布的数据集上对多种系统进行基准测试,结果表明,在两个阶段中,大语言模型(LLMs)在大多数书写系统和语言上都表现出优越性能。研究还表明,向LLMs补充额外信息(如词典引言)可以提高数字化词典的质量。

Insight: 创新点在于提出了一个专门针对多语言词典数字化的两阶段评估与处理框架,并创建了一个多样化的基准数据集。客观来看,该工作系统地评估了LLMs在此类复杂文档理解任务中的潜力,并提供了针对挑战性场景的实用改进指南。

Abstract: Multilingual dictionaries are among the most valuable documentary resources for low-resource and endangered languages, yet many remain available only as scans. For many decades, their digitization and conversion into a machine-readable format was nearly impossible due to language-specific scripts, complex multi-column layouts full of entries with abbreviations and cross-references. Recent vision-language models offer a promising solution, but it is unclear how well they preserve characters, markup, and process lexicographic structure. We introduce MUDIDI, a two-stage framework for multi-lingual dictionary digitization. Stage One evaluates the quality of character recognition and markup preservation; Stage Two focuses on dictionary entry segmentation with subsequent mapping into a machine-readable lexicographic schema, SIL’s Multi-Dictionary Formatter. We also release a dataset that consists of human-annotated lexicographic entries collected from 30 public-domain dictionaries featuring diverse writing systems, language families, and formats. We benchmark OCR systems, general-purpose Large Language Models (LLMs), and Vision Language Models (VLMs) on the dataset, demonstrating superior performance of LLMs across most writing systems and languages in both stages, and provide practical guidelines on improving the results for more challenging scenarios. Finally, we show that supplementing additional information, such as dictionary introduction, to the LLMs can improve the quality of the digitized dictionary. Github: https://github.com/DavidSamuell/MUDIDI-Pipeline-for-Digitization-of-Multilingual-Dictionary/


[56] Reasoning without Gold Standards: A Proxy-Judge Theory of Autoformalization cs.CLPDF

Lei Xu, Xin Quan, André Freitas

TL;DR: 本文提出了一种无需黄金标准的代理-评判框架,用于解决自动形式化任务中因缺乏单一正确参考而难以评估模型输出的问题。该框架通过一组按结构范围组织的属性检查向量替代精确匹配,驱动一个反射式精炼循环,迭代修正被判定错误的组件。

Details

Motivation: 自动形式化任务需要将非形式数学或逻辑推理转化为可形式化检查的对象,但专家验证的形式化难以扩展,且单个非形式论证可能对应多个有效形式化版本,因此需要研究能否用部分、结构化的代理标准替代精确参考。

Result: 在miniF2F、ProofNet、e-SNLI和ProntoQA基准测试中,使用七个形式化主干网络,精炼循环持续提升了通过率,优于单次上下文学习基线;在基线有改进空间的基准上,按轴代理优于匹配的标量代理。

Insight: 创新点在于提出了一种结构化的、按属性轴组织的代理评判框架,将评估分解为全局、模块内部和跨域对齐三个结构范围,理论上在有限评判噪声下保证收敛;实践上为缺乏黄金标准的复杂推理任务提供了有效的精炼信号和理论收敛保证。

Abstract: Complex reasoning tasks increasingly require systems to produce outputs whose correctness cannot be judged by exact match against a single reference. Autoformalization (AF) is a representative example; it asks a model to translate informal mathematical or logical reasoning into a formally checkable object, yet expert-validated formalizations do not scale beyond toy cases and a single informal argument can admit many valid formal renderings. Progress therefore depends on whether partial, structured proxies can substitute for exact references. We introduce a reference-free proxy-judge framework for AF that replaces gold-standard matching with a vector of per-axis property checks. The framework organizes the proxy along three structural scopes that cover global properties of the elicited object, per-module properties internal to its sub-components, and cross-domain properties that re-align it to the informal source, and aggregates each axis into a verdict vector. The vector drives a reflective refinement loop in which a violated coordinate routes the controller to a matching repair target, so each iteration changes only what is judged wrong. Under bounded judge noise, the expected intrinsic gap contracts geometrically to a noise-dependent plateau. Across seven formalization backbones on miniF2F, ProofNet, e-SNLI, and ProntoQA, refinement consistently lifts Pass Rate over the single-shot ICL baseline, and the per-axis proxy outperforms a matched scalar proxy on benchmarks where the baseline has room to improve. Structured proxy judgments therefore provide both a practical refinement signal and a theoretical handle on convergence when exact references are unavailable.


[57] H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions cs.CLPDF

Shiping Zhu, Yibo Yang, Zhengyang Wang, Tiancheng Shen, Dandan Guo

TL;DR: 该论文提出了H2HMem基准测试,用于评估大语言模型智能体在复杂人-人交互场景中的记忆能力。该基准包含多模态信息流和多方对话,并评估智能体的记忆召回、推理和应用三个维度。实验表明,现有先进智能体在跨模态、参与者和会话的记忆构建、保留与利用方面存在显著不足。

Details

Motivation: 现有记忆基准主要关注单用户、纯文本交互,无法应对人-人交互场景中固有的多模态性、复杂话语现象(如指代)以及多参与者异步或冲突信息带来的挑战。

Result: 在H2HMem基准上的实验表明,先进智能体在跨模态、参与者和会话的记忆能力上存在重大局限,凸显了下一代LLM智能体有巨大的改进空间。

Insight: 创新点在于构建了首个专注于评估人-人交互场景下多模态记忆能力的基准,其设计涵盖了多参与者、多模态流和复杂话语现象,为智能体记忆研究提供了更贴近真实应用场景的评估框架。

Abstract: Large language model agents are increasingly deployed in human-human interaction settings, such as meeting assistants and clinical documentation systems, where they must observe conversations and retain information for downstream queries. Unlike traditional human-assistant settings, these environments are inherently multimodal, involve complex discourse phenomena such as anaphora and deixis, and contain asynchronous or conflicting information from multiple participants. However, existing memory benchmarks largely focus on single-user, text-only interactions, failing to capture these challenges. To address this gap, we introduce H2HMem, a Human-to-Human Multimodal Memory Benchmark for evaluating memory capabilities in complex human-human interactions. H2HMem includes both dyadic and multi-party conversations with multimodal information streams, and evaluates agents along three dimensions: memory recall, reasoning, and application. Experiments with advanced agents reveal substantial limitations in constructing, retaining, and utilizing memories across modalities, participants, and sessions, highlighting substantial room for improvement in next-generation LLM agents.


[58] Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents cs.CL | cs.AIPDF

Tianxiang Fei, Mingyang Song, Mao Zheng, Xiang Yu

TL;DR: 本文提出了一种名为DCPM的双过程认知记忆系统,旨在解决当前LLM智能体长期记忆系统在隐式个性化推理方面的不足。该系统将记忆重新组织为一个认知能力层次结构,并引入类似双过程理论的同步日间写入器(System1)和异步夜间引擎(System2)两个过程来分别处理信念修订与模式归纳。

Details

Motivation: 当前LLM智能体的记忆系统通常将信念修订、因果耦合和跨领域抽象等功能压缩为单一的检索接口,这使其难以处理需要推理用户如何演变的隐式个性化任务。

Result: 在LongMemEval、PersonaMem和PersonaMem-v2基准测试中,启用System2在奖励隐式跨会话推理的任务上提升最显著(在PersonaMem-v2上最高提升+5.20),而在跨度召回任务上提升最小,这与架构预测相符。

Insight: 主要创新点在于借鉴认知科学的双过程理论,将记忆系统解耦为专门处理即时记录和异步深度推理的两个互补过程,并通过分层的认知能力结构来组织记忆,从而更好地支持需要长期演变和跨领域抽象的智能体行为。

Abstract: Long-term memory for an LLM agent is more than retrieving the right passage at the right time. Current memory systems collapse belief revision, causal coupling, and cross-domain abstraction into a single retrieval surface tuned for surface recall, and consequently struggle on implicit personalisation that requires reasoning over how a user has evolved. We propose DCPM, which reorganises agent memory along a cognitive capability hierarchy ascending from raw inputs and atomic facts, through diachronic belief trajectories and identity, to domain schemas, latent intentions and cross-domain patterns. The hierarchy is driven by two processes inheriting the architectural split of dual-process theory: a synchronous daytime writer (System1) that records belief revisions as doubly linked supersedes chains, and an asynchronous nighttime engine (System2) that induces schemas and intentions and sweeps for cross-domain collisions abstracted into higher-level core schemas. On LongMemEval, PersonaMem and PersonaMem-v2, enabling System2 contributes most where the benchmark rewards implicit cross-session inference (up to +5.20 on PersonaMem-v2) and least on span recall, matching the architectural prediction.


[59] Detecting Differences Is Not Understanding Structure: Large Language Models Fail at Graph Isomorphism cs.CLPDF

Kumar Thushalika, Sukumar Kishanthan, Asela Hevapathige

TL;DR: 这篇论文研究了大型语言模型(LLMs)在图同构这一基础图论问题上的结构推理能力。研究发现,尽管LLMs在图同构检测任务上能达到近乎完美的准确率,但这是一种假象,因为当相同的图以置换节点标签的方式呈现时,LLMs无法识别其同构关系。

Details

Motivation: 动机是探究LLMs是否真正具备图结构推理能力,特别是针对图同构这一核心问题,以检验其是否超越了模式匹配而实现了抽象结构理解。

Result: 结果表明,LLMs在标准图同构检测基准上表现优异,但在节点标签置换的测试中失败,这揭示了其性能并非源于真正的拓扑理解。

Insight: 论文的核心创新点在于揭示了LLMs在图推理任务中可能只是利用表面模式而非进行抽象结构推理,并强调了置换不变性是评估有效结构推理的关键要求,这对未来图推理基准的设计和模型能力评估具有重要借鉴意义。

Abstract: Large language models (LLMs) have shown impressive performance on diverse reasoning tasks, yet their capacity for structural reasoning in graphs remains unclear. We investigate whether LLMs can genuinely understand graph isomorphism -a fundamental problem in graph theory. While LLMs achieve near-perfect accuracy on isomorphism detection, we show this performance is illusory. When identical graphs are presented with permuted node labels, LLMs fail to identify their isomorphism. This finding suggests that LLMs exploit patterns rather than reasoning about abstract graph structure. Since permutation invariance is a fundamental requirement for valid structural reasoning, these results indicate that success on graph reasoning benchmarks should not be interpreted as evidence of genuine topological understanding.


[60] Emergence of Context Characteristics Sensitivity in Large Language Models cs.CL | cs.AIPDF

Nadya Yuki Wangsajaya, Haeun Yu, Isabelle Augenstein

TL;DR: 本文研究了大型语言模型在指令微调过程中对上下文特征的敏感性演变。通过分析监督微调、直接偏好优化和基于可验证奖励的强化学习三个阶段,发现模型在SFT阶段倾向于使用易于理解的上下文,而后续训练阶段可能强化或调整这些偏好。

Details

Motivation: 现有研究主要关注推理阶段上下文特征与模型使用上下文的关系,但缺乏对模型在指令微调过程中如何获得这些敏感性的理解。本文旨在探究模型在不同微调阶段对上下文特征敏感性的动态变化。

Result: 在四个模型和三个数据集上的实验表明,SFT使模型更倾向于使用长度较长、上下文-查询相似度高、流畅度好的上下文。后续DPO和RLVR阶段会根据训练数据集强化或调整这些偏好。

Insight: 研究发现上下文使用模式在每个指令微调阶段都被主动重塑,强调了设计平衡的指令微调数据集对于确保模型稳健利用上下文的重要性。这为优化指令微调流程提供了新的视角。

Abstract: During instruction fine-tuning (IFT), large language models (LLMs) learn to follow instructions by using the provided context to answer a query. While prior work has studied how context characteristics correlate with context usage by the LLM, this analysis has been limited to inference time, leaving open how these relationships are acquired in the first place. Here, we measure how models’ sensitivity to such characteristics shifts across successive IFT stages: supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning with verifiable rewards (RLVR). Experiments across four models and three datasets show that SFT makes models more likely to use contexts that are easy to understand, such as containing high length, context-query similarity, and fluency. Post-SFT dynamics may either reinforce or resolve these preferences depending on the training dataset. Our findings reveal that context usage is actively reshaped at each IFT stage, and designing a balanced IFT dataset is important in ensuring robust context utilization of instruction-tuned models.


[61] Gradient-Guided Reward Optimization for Inference-time Alignment cs.CLPDF

Hankun Lin, Ruqi Zhang

TL;DR: 本文提出了一种名为梯度引导奖励优化(GGRO)的轻量级推理时对齐方法,用于提升大语言模型在分布漂移下的可靠性。该方法通过监控令牌级熵来识别高不确定性区域,并利用现成奖励模型的梯度信号生成引导令牌,在解码过程中进行有针对性的最小干预,从而引导生成轨迹而非仅重排序样本。

Details

Motivation: 现有的推理时对齐方法(如Best-of-N和拒绝采样)依赖于采样密集的奖励引导搜索,其性能受限于基础模型的生成质量,且易受不完美奖励模型的奖励攻击影响。GGRO旨在克服这些限制,实现更高效、鲁棒的推理时适应。

Result: 实验表明,GGRO在安全性、帮助性和推理基准测试中持续提升推理时对齐性能,增加了高质量响应的覆盖范围,并增强了对奖励攻击的鲁棒性,同时计算开销最小。

Insight: 创新点在于将推理时对齐从采样密集型搜索转变为基于梯度引导的主动干预,通过熵监控和令牌注入实现动态轨迹调整,这为轻量级、鲁棒的模型适应提供了新思路。

Abstract: Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model’s generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at https://github.com/lhk2004/GGRO.


[62] Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving cs.CL | cs.CVPDF

Yimu Wang, Yee Man Choi, Barry Zhang, Mozhgan Nasr Azadani, Sean Sedwards

TL;DR: 本文提出了一个用于评估多视图多模态大语言模型(MLLMs)在自动驾驶场景中视觉证据源识别能力的基准测试。该基准包含122个基于冲突的问答对,要求模型在给定六个同步NuScenes视图和问题时,必须识别出支持答案的相机视图并回答问题。通过将视觉源识别与答案正确性分离,该基准揭示了仅评估答案时可能忽略的模型接地失败问题。

Details

Motivation: 当前MLLMs在视觉推理基准上表现强劲,但仅凭答案准确性无法判断模型是否依赖了正确的视觉证据。在用于自动驾驶的多视图驾驶场景中,模型可能给出看似合理的答案,却将其建立在错误的相机视图上,这种差距尤为重要。

Result: 基准测试评估了三种设置:相机视图选择、给定黄金视图的预言QA,以及模型一次性选择视图并回答的联合预测。答案通过多项选择和自由形式格式进行评估,结构化预测使用精确匹配,自由形式响应使用LLM法官。

Insight: 创新点在于通过自动冲突挖掘流程生成视图标签并手动验证,构建了一个专注于因果关系、反事实推理和意图预测的多视图视觉问答基准,明确分离了视觉源识别与答案正确性评估,从而更全面地暴露模型的接地能力缺陷。

Abstract: Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a model can produce a plausible answer while grounding it in the wrong camera view. We introduce a multi-view visual question answering benchmark for evaluating evidence-source identification: given six synchronized NuScenes views and a question, the model must identify the supporting camera view and answer the question. The benchmark contains 122 conflict-centric question-answer pairs from 73 scenes, spanning causality, counterfactual reasoning, and intent prediction. View labels are proposed by an automatic conflict-mining pipeline and manually verified by annotators. We evaluate three settings: camera-view selection, oracle QA given the golden view, and joint prediction in which the model selects a view and answers in one pass. Answers are evaluated in both multiple-choice and free-form formats, using exact match for structured predictions and an LLM judge for free-form responses. By explicitly separating visual-source identification from answer correctness, the benchmark exposes grounding failures that answer-only evaluation misses.


[63] When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following cs.CLPDF

Sai Adith Senthil Kumar

TL;DR: 该论文研究了大型推理模型(LRMs)在指令跟随任务中内置思维链(Thinking)机制的影响。通过Qwen3和Hunyuan模型在IFEval基准上的实验,发现思维链不会统一提升或降低性能,而是改变错误模式:约10-20%的提示在思维链开启/关闭时在通过和失败之间切换。约束类型可被分为规划类(如全局计数、结构协调)和精确类(如局部形式精确),前者在思维链下整体改善,后者则恶化。分析表明,思维链对答案长度有影响,且通过激活修补实验发现,精确类错误比规划类错误更容易通过干预恢复。

Details

Motivation: 大型推理模型在数学和编码任务中表现提升,但其对指令跟随任务的影响尚不明确。论文旨在探究内置思维链机制如何改变模型在遵循复杂指令时的错误模式,而非简单地统一提升或降低性能。

Result: 在IFEval基准上,使用Qwen3模型(1.7B-32B)和Hunyuan模型进行实验。整体通过率变化较小(-0.55到-3.52个百分点),但10-20%的提示在思维链开启/关闭时切换通过状态。约束类型分析显示,规划类约束在思维链下改善,精确类约束恶化;这一模式在Hunyuan模型中得到方向性支持。激活修补实验表明,精确类错误实例的恢复率(32-58%)高于规划类错误(14-40%)。

Insight: 创新点在于将指令跟随错误按约束类型分为规划与精确两类,并揭示思维链对它们的不同影响机制。客观分析表明,思维链可能通过改变内部推理过程导致错误模式转移,而非简单性能变化;这为理解模型在复杂任务中的行为提供了新视角,并暗示针对不同错误类型需设计特定干预策略。

Abstract: Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models provide directional cross-family support. Aggregate pass-rate changes are small (-0.55 to -3.52 pp), yet 10-20% of prompts switch between pass and fail across modes, suggesting that thinking changes the pattern of errors–some prompts improve while others worsen–rather than uniformly degrading performance. Under a post-hoc Qwen3-derived grouping, constraint types separate into Planning (global counting, structure, coordination), which improves at the class level under thinking, and Precision (exact local form), which consistently worsens; the class-level Planning/Precision sign pattern holds directionally for all four Hunyuan models despite Hunyuan’s opposite aggregate direction. Thinking also changes final-answer length; matched-length analyses substantially reduce the Precision drop, but a residual penalty remains. Analyzing thinking traces with a cross-encoder relevance metric reveals three patterns: Neutral shows a positive relevance-compliance link (r approximately 0.15); Planning shows near-zero predictive correlation (r approximately 0.02) despite measurable trace engagement, consistent with an execution gap between CE-measured trace relevance and final-answer compliance; Precision shows a small negative correlation (r approximately -0.05), with failing instances having higher mean relevance than passing ones. Activation patching across four model sizes (1.7B-14B) shows that Precision flip instances are more often restored than Planning flip instances (32-58% vs. 14-40% mean layer-restoration), with the largest gap at 14B (about 30 pp).


[64] Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO cs.CL | cs.AI | cs.LGPDF

Blake Bullwinkel, Eugenia Kim, Amanda Minnich, Mark Russinovich

TL;DR: 本文提出AdvGRPO框架,通过密集多通道奖励和解耦优势归一化,使GRPO能够用于攻击者与防御者的联合优化训练。该方法采用从单轮到闭环多轮攻击的课程学习,并进行交替更新的协同训练,旨在生成高效、可迁移的攻击,并提升防御模型在安全基准上的性能。

Details

Motivation: AI红队测试需要持续适应不断演变的攻击者和防御者。强化学习是发现新型攻击的有效途径,而协同训练方法可以同步产生更鲁棒的防御者。现有工作表明PPO和DPO在攻防协同训练中有效,但GRPO在此场景下不稳定,因此需要改进。

Result: AdvGRPO框架能够产生高效且可迁移的攻击,协同训练出的防御模型在安全基准测试中优于基线方法。

Insight: 创新点在于通过密集多通道奖励和解耦优势归一化解决了GRPO在攻防协同训练中的不稳定性,并设计了从单轮到多轮攻击的课程学习及交替更新策略,实现了攻击与防御能力的同步优化。

Abstract: AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.


[65] IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking cs.CLPDF

Zechen Sun, Yuyang Sun, Zecheng Tang, Juntao Li, Wenpeng Hu

TL;DR: 该论文提出了IS-CoT框架,旨在解决大语言模型在生成长篇内容时出现的长度崩溃问题。通过将动态的“计划-写作-反思”循环嵌入生成过程,该方法实现了持续的策略调整和全局对齐。基于此框架训练的IS-Writer-8B模型在多个长文本生成基准测试上取得了SOTA性能。

Details

Motivation: 现有推理增强模型在逻辑密集型任务上表现良好,但在开放式写作任务中,当目标长度超过2000词时,性能会急剧下降,即出现“长度崩溃”。作者认为这源于静态分层规划的局限性,无法为扩展上下文提供动态指导。

Result: 实验表明,IS-Writer-8B在具有挑战性的长文本基准测试(如LongBench-Write)上达到了最先进的性能,例如相比DeepSeek-V3.2提升了3.08分。该模型展现出稳健的长度遵从性和连贯性,其表现可与规模大得多的专有模型竞争。

Insight: 核心创新点是提出了交织式结构思维链框架,将动态的规划-生成-反思循环内置于模型生成过程中,而非依赖外部代理工作流。这实现了对长篇内容生成的持续、自适应指导,有效缓解了长度崩溃问题。其构建高质量交织推理轨迹数据集的方法也具有借鉴意义。

Abstract: Generating coherent and controllable long-form content remains a persistent challenge for Large Language Models (LLMs). While reasoning-enhanced models have demonstrated success in logic-intensive domains, our evaluation reveals that they suffer from a severe length collapse in open-ended writing, where performance degrades sharply as target lengths exceed 2,000 words. We attribute this failure to the limitation of static hierarchical planning, which struggles to provide dynamic guidance over extended contexts. To bridge this gap, we introduce the Interleaved Structural Chain-of-Thought (IS-CoT) framework. Unlike external agentic workflows, IS-CoT embeds a dynamic Plan-Write-Reflect cycle into the generation process, enabling continuous strategy adaptation and global alignment without additional assistance. Based on this framework, we construct a high-quality dataset of interleaved reasoning traces via a multi-teacher pipeline and train IS-Writer-8B. Experiments demonstrate that IS-Writer-8B achieves state-of-the-art performance on challenging long-form benchmarks (e.g., +3.08 vs. DeepSeek-V3.2 on LongBench-Write), exhibiting robust length compliance and coherence competitive with significantly larger proprietary models.


[66] The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model cs.CLPDF

Wendy K. Tam

TL;DR: 本文通过对比Llama 3.1 8B模型在RLHF对齐前后的内部表示,揭示了RLHF对齐的机制本质。研究发现,RLHF并未移除基础模型中结构化的党派政治倾向,而是通过压缩该信号的方差并切断其因果路径,来生成表面中立、平衡的输出。这种对齐是功能性的、浅层的,底层支持党派引导的几何结构依然完整,使得模型行为可能比其输出所显示的更脆弱。

Details

Motivation: 旨在探究RLHF对齐过程的实质,即它究竟编码了何种价值观、是谁的价值观,以及如何编码。现有证据表明RLHF可能仅产生功能性服从而非深度对齐,本文希望通过对党派政治取向的机制性案例研究来验证这一现象。

Result: 研究发现,RLHF后的Instruct模型中,政策编码特征完全失活,而基础模型中支持党派引导的底层几何结构保持完整。特征层面的引导实验证实了因果连接的切断。模型通过推断和放大用户的党派身份,可以重新激活党派性生成,从而绕过RLHF的防护机制。

Insight: RLHF实现的对齐是功能性的而非结构性的,它通过压缩敏感信号方差并切断其到输出的因果路径来产生表面合规的行为,而非移除价值负载的内部结构。这一模式可能适用于其他价值领域,提示对齐模型的行为可能比其输出所显示的更脆弱,为理解和评估大语言模型对齐提供了新的机制视角。

Abstract: The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.’’ Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic case study of this phenomenon for partisan political orientation with a comparison of the internal representations of Llama 3.1 8B before and after RLHF. We show that RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model’s knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry that enables partisan steering remains intact. The mechanisms that bypass RLHF’s guardrails, such as inferring and amplifying a user’s partisan identity, reactivate partisan generation. If RLHF operates by disconnecting rather than removing value-laden structure, then the same pattern may hold for other value domains, and the aligned model’s behavior may be more fragile than its outputs suggest.


cs.CV [Back]

[67] Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach cs.CV | cs.AIPDF

Anderson Augusma

TL;DR: 本文针对野外环境下的群体情绪识别(GER)问题,提出了一种注重隐私保护的非个体化方法。不同于依赖面部、视线或语音等个体线索的传统方法,该研究利用集体音视频信号推断群体层面的情绪,降低了个体监控风险。论文提出了两个互补框架:一是结合跨注意力多模态融合与帧注意力池化的音视频架构,通过合成数据增强和消融实验验证了其鲁棒性;二是变分编码器多解码器框架,学习用于情绪分类和结构表征预测的共享潜在空间,探索了基于DETR和热图的解码策略以分析结构表征的作用。

Details

Motivation: 解决传统情绪识别方法依赖个体特征(如面部、声音)导致的隐私监控风险,旨在开发一种在野外环境中仅使用集体音视频信号进行群体情绪识别的隐私安全方法。

Result: 通过消融研究验证了所提框架在真实世界GER条件下的鲁棒性;结果表明,在不使用个体特征作为输入数据的情况下,仍能实现具有竞争力的性能(具体基准未在摘要中明确提及,但暗示了与现有方法相当的水平)。

Insight: 创新点包括:阐明了多模态和结构线索在群体层面情感计算中的作用;引入了两种隐私保护的多模态GER架构;证明了无需个体特征输入即可达到竞争性能,为隐私安全的情感计算提供了新方向。

Abstract: This thesis addresses group emotion recognition (GER) in-the-wild with a focus on privacy preservation. Unlike traditional emotion recognition methods that rely on individual-level cues such as face, gaze, or voice analysis, this work uses collective audio-video signals to infer emotions at the group level, reducing risks of individual monitoring and surveillance. Two complementary frameworks are proposed. The first is a cross-attention multimodal architecture for audio-video fusion, combined with Frames Attention Pooling (FAP) for temporal aggregation. It is supported by synthetic data augmentation and validated through ablation studies, demonstrating robustness in real-world GER conditions. The second framework, Variational Encoder Multi-Decoder (VE-MD), learns a shared latent space for emotion classification and structural representation prediction, including body and face cues. Two decoding strategies, DETR-based and heatmap-based, are explored to analyze the role of structural representations in group and individual settings. The thesis makes three main contributions: it clarifies the role of multimodality and structural cues in group-level affective computing; introduces two architectures for privacy-preserving multimodal GER; and shows that competitive performance can be achieved without using individual features as input data.


[68] Page image classifier fine-tuned on century-spanning archives of scanned documents for further content-specific processing cs.CV | cs.AI | cs.DLPDF

Kateryna Lutsai, Pavel Straňák, David Novák, Dana Křivánková

TL;DR: 本研究针对人文领域数字化项目产生的大量历史文档存档,开发了一个基于视觉内容类型(文本、表格、图形)的扫描页面图像自动分类系统。通过在一个包含超过48,000张带标注的捷克考古档案页面图像的数据集上进行评估,微调的深度学习模型(如RegNetY-16GF和ViT-large)在测试集上达到了超过99%的Top-1准确率,显著优于基于手工特征的随机森林基线。

Details

Motivation: 解决人文领域大规模、异质性历史文档数字化存档中,手动按内容类型(文本、表格、图形)分类页面图像不切实际的问题,以实现针对不同内容类型(如OCR或结构化数据提取)的下游自动化处理。

Result: 在留出的测试集上,微调的CNN和Transformer模型(如RegNetY-16GF达到99.16%,ViT-large达到99.12%的Top-1准确率)显著超越了基于手工特征的随机森林基线(约75%准确率)。在649,508张未标注的档案页面上,纯图像模型之间的一致性超过90%。

Insight: 创新点在于针对特定领域(历史文档)构建了一个高质量、专家参与标注的数据集和精细的11类别标签体系,并系统性地比较了从传统特征方法到现代深度学习架构(CNN、Transformer、多模态CLIP)的性能。客观分析表明,纯视觉模型(如RegNetY)在此任务上不仅测试精度接近完美,且在未标注数据上预测一致性高,而微调后的CLIP模型尽管测试精度有竞争力,但其在未标注数据上的预测与纯视觉模型一致性较低(<65%),揭示了其在特定领域部署的潜在局限性。

Abstract: Purpose: Digitization projects in the humanities produce vast, heterogeneous archives of historical documents, making manual sorting impractical at scale. This work addresses the need for an automated system to classify scanned page images based on visual content type - text, tables, and graphics - enabling content-specific downstream processing such as Optical Character Recognition (OCR) or structured data extraction. Methods: An image classification system was developed and evaluated on a dataset of over 48,000 annotated historical page images from century-old Czech archaeological archives, refined through four successive annotation stages with domain-expert review. A Random Forest Classifier baseline was established using hand-crafted image features. Subsequently, deep learning architectures were fine-tuned and compared: Convolutional Neural Networks (EfficientNetV2, RegNetY), Vision and Document Image Transformers (ViT, DiT), and multimodal CLIP models. An 11-category label scheme was designed collaboratively with domain experts and evaluated via five-fold cross-validation. Results: The feature-based baseline achieved approximately 75% accuracy. Fine-tuned CNNs and Transformers substantially outperformed it, with RegNetY-16GF achieving 99.16% and ViT-large 99.12% Top-1 accuracy on the held-out test set. CLIP ViT-B/16 reached 99.14% with optimized text descriptions. Conclusion: Image-only models, particularly RegNetY-16GF, deliver near-perfect classification accuracy and produce consistent labels across 649,508 unlabeled archival pages with over 90% inter-model agreement. Fine-tuned CLIP, despite competitive test-set accuracy, showed under 65% agreement with image-only models on unlabeled data, making it less suitable for deployment. The final models, annotated dataset, and software are publicly available under open-source licenses.


[69] SlideCheck: Guiding Self-Supervised Pretraining of Pathology Foundation Models via Dataset Distributions cs.CV | cs.AIPDF

Mingyi He, Xinyi Guo, Xitong Ling, Weiming Chen, Jiawen Li

TL;DR: SlideCheck是一种用于病理学基础模型预训练的数据指导工具,它利用冻结的病理学基础模型补丁特征,通过双头MLP分别建模异常形态和恶性证据,生成明确的异常性和恶性评分,用于组织、过滤和审计预训练数据。实验表明,SlideCheck定义的数据分布能影响自监督ViT预训练的下游行为,表明生物组成是病理学基础模型开发中可控的重要因素。

Details

Motivation: 病理学基础模型通常在WSI衍生的补丁流上进行预训练,而数据构建过程中的监督往往是切片级别的、稀疏的或异构的,这种不匹配使得难以理解和控制哪些生物模式进入预训练数据。

Result: 实验表明,SlideCheck定义的数据分布能影响自监督ViT预训练的下游行为,精心策划的子集可以接近全数据性能,表明明确评分的补丁池可能支持更高效和可审计的预训练数据构建。

Insight: 创新点在于提出了一种轻量级的预训练数据指导工具,通过双头MLP和正则化特征空间评分器分别建模异常和恶性证据,并结合分数-注意力一致性挖掘高置信度伪标签,将大型、未分化的补丁池转化为可控和可重用的预训练数据集。

Abstract: Pathology foundation models are pretrained on large streams of WSI-derived patches, while supervision during data construction is often slide-level, sparse, or heterogeneous. This mismatch makes it difficult to understand and control which biological patterns enter the pretraining data. We propose SlideCheck, a lightweight pretraining data guidance tool built on frozen pathology foundation model patch features. Rather than serving as a standalone patch diagnostic model, SlideCheck provides explicit abnormality and malignancy scores for organizing, filtering, and auditing pathology pretraining data. SlideCheck uses a dual-head MLP to separately model broad abnormal morphology and malignant evidence. A regularized feature-space scorer provides a supervised anchor for patch-level evidence estimation, while score-attention agreement combines patch scores with WSI-level MIL attention to mine high-confidence pseudo labels. The same scores are then used to construct broad-positive ViT pretraining subsets, where a patch is selected if either abnormality or malignancy evidence exceeds a threshold. Experiments show that SlideCheck-defined data distributions influence the downstream behavior of self-supervised ViT pretraining, indicating that biological composition is an important controllable factor in pathology foundation model development. Curated subsets can approach full-data performance, suggesting that explicitly scored patch pools may support more efficient and auditable pretraining data construction. These findings position SlideCheck as a data guidance and auditing layer for transforming large, undifferentiated patch pools into controllable and reusable pretraining datasets.


[70] A Mechanistic Analysis of Adversarial Fine-tuning of Vision Transformers cs.CV | cs.AIPDF

Hannah Gao, Isha Agarwal, Dylan Hadfield-Menell, Rachel Ma

TL;DR: 本文通过机制分析研究了对抗性微调对视觉变换器(ViT)在图像扰动下性能的影响。作者在低频和高频图像损坏上对抗性训练ViT,并检查其注意力机制、内部表示和知识演化,以解释下游模型性能的变化。

Details

Motivation: 由于图像分类模型在高风险现实场景中的广泛应用,需要提高模型对输入图像中轻微扰动(如模糊或锐化)的鲁棒性;而视觉变换器在现代多模态模型中扮演重要角色,但其鲁棒性研究相对缺乏,因此本文旨在通过对抗性微调分析ViT的鲁棒性机制。

Result: 实验表明,在常见损坏类型上微调能提升模型对新实例损坏数据的性能和确定性,但这些改进无法泛化到训练中未见的其他损坏类别;同时,尽管观察到视觉注意力和知识演化在层间的变化,对抗性训练并未导致ViT学习到的稀疏表示发生根本性改变。

Insight: 论文的创新点在于从机制角度分析对抗性微调对ViT鲁棒性的影响,揭示了微调改进的局限性(缺乏跨损坏类别的泛化能力)和ViT表示的本质稳定性;这为理解ViT鲁棒性提供了新视角,并强调了未来研究需关注更广泛的扰动泛化。

Abstract: The widespread use of image classification models in high-risk, real-world situations necessitates making these models robust to slight disturbances or perturbations, such as blurring or sharpening, in the input images. While vision transformers (ViTs) play an integral role in many modern-day multi-modal models like Vision-Language-Models (VLMs) and Vision-Language-Action (VLA) models, they have received a lack of attention in the setting of robustness. In this work, we analyze the effects of adversarial fine-tuning, a popular method for improving model robustness to image perturbations, on a ViT’s performance on perturbed and regular images through a mechanistic lens. We adversarially train a ViT on low-frequency and high-frequency image corruptions, and attempt to explain changes in downstream model performance through an examination of the model’s attention mechanisms, internal representations, and knowledge evolution. Overall, our results suggest that, while fine-tuning on inputs with common corruptions improves model performance and certainty on new instances of corrupted data, these improvements do not transfer to other classes of corruptions not seen in the training. Additionally, despite observing changes in visual attention and knowledge evolution across layers, we found that adversarial training did not lead to fundamental changes in the sparse representations learned by ViTs.


Jinzhe Tan, Ali Ekber Cinar, Karim Benyekhlef

TL;DR: 本文研究了在典型民事纠纷中,人类和前沿多模态大语言模型(MLLMs)区分真实证据照片与AI生成图像的能力。研究构建了SLED-1400数据集,并进行了对照实验,发现人类和MLLMs的检测准确率均不理想,且错误不相关,均无法作为可靠的独立认证者。

Details

Motivation: 人工智能的进步正在削弱视觉证据作为可靠法律证明的假设,本文旨在量化评估人类和AI在区分真实与AI生成的法律证据图像方面的能力。

Result: 人类总体准确率为64.8%,对两个最强生成器(Gemini-3-Pro-Image和Flux-2-Max)的准确率仅为48.5%和51.0%,与随机猜测无异。MLLMs对真实图像的特异性为100%,但对最难生成器(如Gemini-3-Pro-Image)生成的合成图像检测率平均仅为5.9%。人类与MLLM的错误基本不相关。

Insight: 研究揭示了当前AI生成图像在特定领域(如法律证据)的欺骗性极强,人类和先进MLLMs均难以可靠检测。这挑战了视觉证据的固有可靠性,并提出了一个可行的程序性应对方案:结合训练有素的人工审查、MLLM筛查以及来源认证基础设施(如C2PA内容凭证)。

Abstract: Visual evidence has long been treated as a reliable form of legal proof, but advances in artificial intelligence (AI) are undermining that assumption. This article asks how well humans and frontier multimodal large language models (MLLMs) can distinguish authentic evidentiary photographs from AI-generated counterparts in the object-centric scenarios typical of civil disputes. We built Synthetic Legal Evidence Detection (SLED-1400), a dataset of 200 authentic evidence images paired with 1,200 synthetic counterparts produced by six contemporary text-to-image generators across ten evidence categories. The same stimuli and response format were used in a controlled web experiment with 136 lay participants and in a standardized evaluation of four MLLMs (GPT-5.1, Gemini-3-Pro, Gemini-3-Flash, Qwen3-VL-235B). Human accuracy was 64.8% overall, and 48.5% and 51.0% on the two strongest generators (Gemini-3-Pro-Image and Flux-2-Max), indistinguishable from chance. MLLMs never misclassified an authentic image (100% specificity), but missed most synthetic outputs from the harder generators, with average MLLM detection at 5.9% on Gemini-3-Pro-Image outputs. Human and MLLM errors were largely uncorrelated, while the four MLLMs were strongly correlated with each other. Neither group is a reliable standalone authenticator. We argue that visual evidence in legal proceedings should be treated as inherently contestable, and that a workable procedural response must combine trained human review, MLLM screening, and provenance infrastructure such as C2PA Content Credentials.


[72] VisualLeakBench: Reproducible Action-Boundary Propagation Failures in Vision-Language Agents cs.CV | cs.AI | cs.IRPDF

Youting Wang, Yuan Tang, Yitian Qian, Chen Zhao

TL;DR: 本文提出了VisualLeakBench基准测试,用于系统评估视觉语言智能体(VLM)在处理图像时,将图像中敏感或不安全文本(如个人身份信息PII)泄露并传播到下游工具调用参数中的具体故障模式。该基准包含500张多样化图像,涵盖UI、聊天、文档等场景,并在两种工作流下测试了四个生产级VLM系统。基线结果显示,PII和不安全文本的传播率分别高达78.8%和85.5%。

Details

Motivation: 研究动机是识别和量化视觉语言智能体在实际应用中(如处理截图、文档和用户界面)存在的一个具体安全风险:动作边界传播故障,即智能体可能无意中将图像中可见的敏感或有害文本复制到其调用的外部工具参数中,导致信息泄露。

Result: 在基线设置下,目标字符串(敏感信息)传播到工具参数中的比例在PII案例中为78.8%,在渲染的不安全文本案例中为85.5%。即使采用防御性系统提示,不安全文本的传播率仍高达52.6%,而PII传播率降至2.0%,但这主要是通过抑制工具使用而非保持实用性实现的。结果还显示传播率与工具表面类型相关。

Insight: 论文的创新点在于首次系统性地构建了针对视觉语言智能体“动作边界传播”故障的基准测试VisualLeakBench,并量化了该风险。其提供的带标签目标预言机诊断方法,能够将大部分故障定位在工具边界,有助于未来模型的针对性防御和安全评估。这揭示了仅靠提示工程可能不足以解决此类泄露问题,需要更深入的系统级安全设计。

Abstract: Vision-language agents increasingly consume screenshots, documents, and user interfaces before writing to memory, sending messages, or invoking external tools. We study a concrete failure mode in this setting: action-boundary propagation, where sensitive or unsafe visible text is copied from an image into downstream tool arguments. We present VisualLeakBench, a diversified 500-image benchmark spanning UI, chat, document, form, and dashboard scenes, and evaluate a stratified 100-image agent subset with four production VLM systems under two workflows: note capture and external handoff. At baseline, target strings are propagated into tool arguments in 78.8% of PII cases and 85.5% of rendered unsafe-text cases. Under a defensive system prompt, rendered unsafe-text propagation remains high at 52.6%, while PII tool propagation falls to 2.0%, largely by suppressing tool use rather than preserving utility. Rates are tool-surface dependent: search-like tools suppress PII propagation, but rendered unsafe text still crosses tool boundaries. We measure visual-to-tool propagation rather than downstream instruction execution. We additionally provide a labeled-target oracle upper-bound diagnostic that localizes most failures at the tool boundary while leaving response-side leakage as residual risk.


[73] SENTRY: Statistical Reliability Analysis of Vision Transformers Under Soft Errors cs.CV | cs.AI | cs.DC | cs.LGPDF

Pramit Kumar Bhaduri, Mahdi Taheri, Samira Nazari, Maksim Jenihhin, Christian Herglotz

TL;DR: 本文提出了一种名为SENTRY的统计故障注入框架,用于高效分析视觉变换器在软错误下的可靠性。该方法利用有限总体抽样理论,仅需数千个样本即可为任意规模的ViT模型提供形式化的可靠性保证,实验成本相比穷举方法降低高达10,700倍。研究发现,ViT的可靠性分布极不均匀,仅3%的FP32位翻转会导致故障,但这些故障大多引发灾难性的精度崩溃,其脆弱点主要集中于归一化层和IEEE-754格式中的关键指数位。

Details

Motivation: 随着视觉变换器在自动驾驶、医学影像等安全关键领域的应用增长,确保其抵御软错误的可靠性至关重要。然而,ViT庞大的参数量使得穷举式的故障注入测试变得不可行。

Result: 该方法在ViT-Tiny和ViT-Small等不同架构上进行了广泛评估,能以99%的置信度将故障率控制在1%的误差范围内。相比穷举方法,实验成本降低了高达10,700倍,同时仍能定位架构组件中的脆弱点。

Insight: 主要创新点在于将统计抽样理论引入ViT可靠性分析,实现了高效且具有形式化保证的评估。客观来看,其揭示的ViT可靠性非均匀分布特性(特别是归一化层和特定数据位的关键脆弱性)为设计面向边缘部署的鲁棒ViT架构提供了重要的数学基础和可操作的见解。

Abstract: With the growth of Vision Transformers in safety-critical domains like autonomous systems and medical imaging, ensuring their reliability against soft errors is paramount. While ViTs offer state-of-the-art accuracy, their massive parameter counts render exhaustive fault injection campaigns infeasible. To bridge this gap, a statistical fault injection framework is presented, leveraging finite-population sampling theory to provide formal reliability guarantees. It is demonstrated that failure rates are bounded within a 1% margin at 99% confidence using only a few thousand samples, regardless of model scale. This methodology achieves up to a 10,700 times reduction in experimental cost compared to exhaustive approaches, while preserving the ability to localize vulnerabilities across architectural components. Through extensive evaluation of different architectures like ViT-Tiny and ViT-Small, a highly non-uniform reliability landscape is uncovered. It is shown that while only 3% of FP32 bit-flips result in failure, the vast majority of these events lead to catastrophic accuracy collapse. Specific vulnerabilities are localized to normalization layers and critical exponent bits within the IEEE-754 format, providing a mathematical foundation and actionable insights for the design of hardened, edge-deployed ViT architectures.


[74] NeuroAlign: Hierarchical Multimodal Fusion of Dynamic and Structural Neuroimaging for MCI Analysis cs.CV | cs.AIPDF

Xiongri Shen, Zhenxi Song, Jiaqi wang, Yi Zhong, Leilei Zhao

TL;DR: 本文提出NeuroAlign,一种用于轻度认知障碍(MCI)分析的分层多模态神经影像融合框架。该框架融合功能磁共振成像(fMRI)和扩散张量成像(DTI),通过双模态分层对齐(DMHA)和双域分层交互(DDHI)解决特征空间异构和表示错位问题,并设计了协同激活映射(SAM)进行特征级解释。在多个数据集上验证了其MCI/主观认知下降(SCD)检测的竞争性表现和初步跨数据集可迁移性。

Details

Motivation: 解决多模态神经影像(如fMRI和DTI)融合中因特征空间异构和表示错位带来的挑战,以提升对认知障碍(如MCI)的分析能力。

Result: 在GUTCM、ADNI和OASIS数据集上进行五折交叉验证,NeuroAlign在MCI/SCD检测任务上取得了竞争性结果,并展示了初步的跨数据集可迁移性。

Insight: 创新点包括:1) 分层对齐与交互机制(DMHA和DDHI)实现多尺度动态连接建模及动态-静态、功能-结构嵌入的对齐;2) 设计无梯度、面向标记的归因方法SAM,支持对动态功能连接(DFC)、静态功能连接(SFC)、低频振幅(ALFF)和各向异性分数(FA)的特征级可解释性分析,为多模态表示分析提供模型衍生证据。

Abstract: Multimodal neuroimaging fusion of functional MRI (fMRI) and diffusion tensor imaging (DTI) provides complementary information for cognitive impairment analysis, but remains challenged by heterogeneous feature spaces and misaligned representations. We propose \textit{NeuroAlign}, a hierarchical framework for structured multimodal fusion. It introduces (1) \textit{Dual-Modal Hierarchical Alignment} (DMHA), which models multi-scale dynamic connectivity and aligns dynamic-static and functional-structural embeddings; and (2) \textit{Dual-Domain Hierarchical Interaction} (DDHI), which enables fine-grained modulation and global interaction between connectivity- and region-level features. To support feature-level inspection, we design \textit{Synergistic Activation Mapping} (SAM), a gradient-free, marker-oriented attribution method for DFC, SFC, ALFF, and FA. Evaluated on GUTCM, ADNI, and OASIS under five-fold validation, NeuroAlign achieves competitive MCI/SCD detection and preliminary cross-dataset transferability. Attribution analyses reveal modality-specific and partially consistent brain patterns, providing model-derived evidence for multimodal representation analysis.


[75] Crayotter: Traceable Multi-Agent Workflows for Long-Form Video Editing cs.CV | cs.CL | cs.MAPDF

Lecheng Yan, Yichong Zhang, Ben Pan, Xiaoyu Zheng, Jiawei Qian

TL;DR: 本文提出了Crayotter,一个开源的、多模态多智能体系统,用于基于提示词驱动的长视频编辑。它将编辑工作流组织为三个阶段:覆盖感知的素材准备、基于工件的编辑研究以及工具落地的时间线执行。该系统通过生成可检查的中间工件(如覆盖报告、多模态分析、编辑蓝图等)来实现编辑过程的可追溯性,从而支持故障诊断和选择性修订,而无需完全重做。

Details

Motivation: 解决从异构素材中编辑长视频的复杂问题,传统方法不仅需要剪辑选择,还要求智能体在素材准备、时间线构建、后期制作和修订过程中保持叙事意图,并留下足够的证据来诊断失败。

Result: 在23个编辑主题上,与CapCut-Mate和CutClaw基线方法进行对比,通过人工评估,Crayotter平均得分为3.40/5,优于基线的2.44和1.70,在主题对齐、叙事连贯性和编辑流畅性方面均取得一致提升。

Insight: 核心创新在于将长视频编辑流程结构化、可追溯化,通过多阶段多智能体协作并显式生成中间工件,实现了编辑过程的透明化和可诊断性,为未来基于策略优化的视频编辑工作流奠定了基础。其可重放的轨迹模式和可验证的奖励设计也具有借鉴意义。

Abstract: Editing a long-form video from heterogeneous footage requires more than selecting clips: an agent must preserve narrative intent across material preparation, timeline construction, post-production, and revision while leaving enough evidence to diagnose failures. We present \textbf{Crayotter}, an open-source multimodal multi-agent system for prompt-driven video editing. Crayotter organizes production into three phases: coverage-aware material preparation, artifact-based editing research, and tool-grounded timeline execution. Each phase externalizes inspectable artifacts, including coverage reports, multimodal analyses, editing blueprints, tool calls, and intermediate renders. These artifacts make an editing run traceable and allow failed segments to be diagnosed and selectively revised instead of requiring a full restart. We evaluate Crayotter on 23 editing themes against CapCut-Mate and CutClaw. Under human evaluation, Crayotter achieves an average score of 3.40/5, compared with 2.44 and 1.70 for the two baselines, with consistent gains in theme alignment, narrative coherence, and editing smoothness. We additionally describe a replayable trajectory schema and verifiable reward design that prepare these workflows for future policy optimization. Code, traces, and examples are publicly available at https://github.com/idwts/Crayotter.


[76] MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention cs.CV | cs.AIPDF

Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou

TL;DR: 本文提出了MOSS-Video-Preview模型,旨在验证实时视频理解的新范式,即模型在视频播放过程中持续感知新帧、动态修订答案并在无信息时保持沉默。核心是采用双通道架构的交叉注意力骨干网络,实现了感知与生成的非阻塞并行处理,并配合数据合成流程来训练模型。

Details

Motivation: 动机是将视频理解从离线范式(处理完整视频后输出单一答案)转向实时交互范式,使模型能够在视频播放过程中持续感知、动态响应和适时沉默,以更好地适应实际应用场景。

Result: 模型在整体性能上略逊于强大的Qwen2.5-VL-7B基线,但在离线视频与多模态理解任务上保持竞争力,在空间和细粒度时间推理方面表现稳健,并具备了离线模型缺乏的连续感知、答案修订和适时沉默等行为。在单张H200 GPU上处理256帧视频时,实现了约5倍的首令牌生成加速和2.7倍的解码吞吐量提升,且离线能力下降可忽略。

Insight: 创新点在于提出了面向实时视频理解的双通道架构,利用交叉注意力骨干网络实现视觉与语言的非阻塞融合,避免了感知被生成过程阻塞;同时开发了将密集描述转换为实时理解QA的数据合成流程,有效诱导出实时交互行为。从客观角度看,其架构设计为实时多模态系统提供了清晰的通道化接口和独立的压缩路径,具有工程借鉴意义。

Abstract: Video understanding is shifting from the offline paradigm – taking a fully recorded video as input and producing a single answer after it ends – toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm. Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture. We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways – reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression. We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.5-VL-7B baseline overall – a gap we attribute primarily to data and scale rather than the architecture – yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.


[77] Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models cs.CVPDF

Lexin Wang, Shenghua Liu, Yiwei Wang, Jiafeng Guo, Xueqi Cheng

TL;DR: 本文研究了视觉语言模型(VLMs)在仅给定原始图像时预测其180°平面旋转后内容的能力,提出了旋转结果预测任务和RotOutBench诊断基准。研究发现,尽管许多VLMs能直接识别原始或旋转图像中的内容,但仅从原始图像推断旋转结果时性能显著下降,甚至接近零准确率。

Details

Motivation: 探究VLMs是否具备从原始图像预测其180°旋转后可见内容的能力,即理解图像变换的推理能力,而非仅依赖直接观察。

Result: 在RotOutBench基准测试中,VLMs在受控文本-图像旋转任务上的预测准确率接近零,即使其直接阅读旋转图像的准确率很高,表明模型存在显著的预测差距。

Insight: 论文揭示了当前VLMs在视觉推理中的局限性:能识别已呈现的变换状态,但难以从原始视角预测该状态;创新点在于通过旋转结果预测任务和配对基准隔离了这种能力差距,为模型内部表征分析提供了新视角。

Abstract: Can vision-language models predict what a 180° rotation would reveal from the original image alone? We study this ability through Rotated-Outcome Prediction: given an original image, a model must answer what would be seen or read after a 180° in-plane rotation, without directly observing the rotated target. To isolate this gap, we introduce RotOutBench, a paired diagnostic benchmark spanning open visual cases and controlled text-image rotations. A sharp pattern emerges: many VLMs can recognize the relevant content when directly given either the original or rotated image, yet fail to infer the rotated result from the original image alone. On controlled text-image rotations, predicted-rotation accuracy collapses to near zero even for models with high direct-reading accuracy. A model-level case study further shows that the prediction state can approach a rotated-image reading state, while the final readout still shifts toward the original string. Current VLMs can recognize a transformed visual state when it is shown, but often fail to predict that state from the original view.


[78] AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs cs.CV | cs.AI | cs.SD | eess.ASPDF

Yaoting Wang, Ziyi Zhang, Wenming Tu, Shaoxuan Xu, Wenjie Du

TL;DR: 本文提出了AVI-Bench,一个受认知启发的基准测试,用于系统评估全能多模态大语言模型(Omni-MLLMs)的视听智能(AVI)。该基准通过需要联合视听解释的跨模态任务,从感知、理解和推理三个阶段进行细粒度诊断。此外,还提出了AVI-Bench-PriSe扩展,使用低语义刺激来测试模型在陌生领域的鲁棒性和泛化能力。实验揭示了当前模型的显著局限性,并据此提出了一个四级AVI分类法,旨在为开发更鲁棒和可泛化的AVI提供指导。

Details

Motivation: 当前全能多模态大语言模型在视听整合方面取得了进展,但由于缺乏系统全面的基准测试,其视听智能水平尚未得到充分评估。

Result: 在开源和闭源模型上进行的大量实验表明,当前全能多模态大语言模型在视听智能方面存在显著局限性。基于这些发现,作者提出了一个四级视听智能分类法。

Insight: 主要创新点在于提出了一个受认知三阶段(感知、理解、推理)启发的系统性评估框架,并引入了使用低语义刺激的扩展基准来专门测试模型在陌生领域的原始感知和泛化能力,这超越了传统基于常见训练分布的评估范式。

Abstract: Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks. We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes. To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models’ primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy. Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl.github.io/AVI-Bench/


[79] FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction cs.CV | cs.AIPDF

Chang Kong, Yuebing Li, Peng Mo, Haigang Zhang, Qiuming Luo

TL;DR: 本文提出FineGen,一个基于视觉语言模型(VLM)的多智能体框架,用于自动化构建细粒度图像-文本数据集。该框架通过生成-验证-校正的协作流水线与闭环反馈机制,合成语义有效但与视觉内容严格矛盾的困难负样本。在ImageNet上构建了包含超过14.7万个属性特定困难负样本的FineGen-100K数据集,并在FG-OVD基准测试中验证了其有效性。

Details

Motivation: 当前视觉语言数据集中困难负样本的稀缺严重阻碍了模型的细粒度感知能力,因此需要一种自动化方法来构建高质量的细粒度数据集以解决此问题。

Result: 构建的FineGen-100K数据集属性有效率达到96.7%。在FG-OVD基准测试的下游验证中,使用该数据集进行微调使模型在困难样本上的准确率大幅提升14.4%,显著超越了现有最先进(SOTA)方法。

Insight: 创新点在于提出了一个基于VLM的多智能体协作框架与闭环反馈机制,自动化生成高质量的细粒度困难负样本。其构建的具有严格正负样本比例(1:10)的分层数据集为提升模型细粒度感知提供了有效的数据解决方案。

Abstract: The scarcity of hard negative samples in current vision-language datasets significantly hinders fine-grained perception. To address this, we propose FineGen, a VLM-based Multi-Agent framework for automated dataset construction. By employing a collaborative Generation-Verification-Correction pipeline with a closed-loop feedback mechanism, FineGen ensures synthesized hard negatives are semantically valid yet strictly contradictory to visual content. Applying this to ImageNet, we construct FineGen-100K, a hierarchical dataset containing over 147,000 attribute-specific hard negatives with a rigorous 1:10 positive-to-negative ratio. Extensive evaluations confirm a 96.7% attribute validity rate. Crucially, downstream validation on the FG-OVD benchmark shows that fine-tuning on FineGen-100K yields a substantial +14.4% accuracy improvement on hard samples, significantly outperforming state-of-the-art methods.


[80] Do VLMs See What Sensors Feel? A Scalable Expert-Guided Design for Wheelchair Accessibility Assessment from Street View cs.CV | cs.CYPDF

Dongdong Wang, Alina Hagen, Isabelle Gatmaitan, Hao Zhou, Yiwen Dong

TL;DR: 本文研究视觉语言模型(VLMs)能否从谷歌街景(GSV)图像中识别轮椅可及性障碍,并提出了一种专家引导的检索增强框架,结合GSV图像、ADA指南和专家评估准则来评估可及性维度。通过在佛罗里达大学校园收集的数据集(包含407个GSV位置及GPS衍生的轮椅停留行为),发现VLM评分与停留时间呈负相关且分布相似,表明其与移动摩擦行为代理部分一致。

Details

Motivation: 评估建成环境交互(如轮椅可及性)很困难,因为现实世界的移动性受到分散、上下文依赖和临时性障碍的影响,难以大规模捕获。本文旨在探索VLMs是否能够支持可扩展的可及性评估。

Result: 在校园规模数据集上,VLM评分与GPS衍生的轮椅停留时间(作为移动摩擦信号)呈负相关且分布相似,表明与行为代理部分一致。视觉线索分析显示,某些环境对象(如路缘坡道和人行横道)与更高的VLM可及性分数相关,但对于细微表面条件、临时障碍和视角依赖的障碍,一致性有限。

Insight: 创新点在于提出了一个专家引导的检索增强框架,将GSV图像、ADA指南和专家准则结合,用于可扩展的可及性评估。从客观角度看,该方法利用VLM和传感器数据(GPS行为)的关联,为大规模无障碍环境评估提供了一种新途径,但处理复杂或细微障碍时仍有局限性。

Abstract: Assessing built-environment interaction, such as wheelchair accessibility, is difficult because real-world mobility is shaped by distributed, context-dependent, and temporary barriers that are hard to capture at scale. To support scalable assessment, this paper examines whether vision-language models (VLMs) can identify accessibility barriers from Google Street View (GSV) imagery. We propose an expert-guided retrieval-augmented framework that combines GSV images, ADA-informed guidance, and expert-derived rubrics to evaluate accessibility dimensions. We collect a campus-scale dataset at the University of Florida, linking 407 unique GSV locations with GPS-derived wheelchair dwell behavior as a mobility-friction signal. Results show that VLM ratings are both negatively correlated and distributionally similar with dwell time, indicating partial but consistent alignment with a behavioral proxy for mobility friction. Visual cue analysis shows that certain environmental objects, such as curb ramps and crosswalks, are associated with higher VLM accessibility scores, while alignment remains limited for subtle surface conditions, transient obstructions, and viewpoint-dependent barriers. Overall, our findings show the potential of expert-guided VLMs for scalable accessibility assessment aligning with sensor-derived indicators of real-world wheelchair navigation.


[81] DOME: Learning Transferable Domain Variables from Sparse Supervision for Test-Time Adaptation cs.CV | cs.AIPDF

Xiaoran Xu, Yifan Xu, Yupeng Wu, Xiaoshan Yang, Changsheng Xu

TL;DR: 该论文提出了DOME方法,一种能够从稀疏监督中学习可迁移域变量的域编码器,用于测试时适应(TTA)。DOME利用视觉语言预训练提取密集连续表示,将域参数化为分布变量,并引入动量更新的稀疏域库进行解耦监督。通过将显式的域线索注入下游模型,即使使用基本的熵最小化TTA策略,也能在多个基准上实现最先进的性能。

Details

Motivation: 现有TTA方法通常隐式推断单一的全局域分布,忽略了现实世界域偏移的多维性和样本特异性,导致适应过程脆弱。本文旨在通过显式建模每个样本的域来解决这一问题。

Result: 在ImageNet-C、ImageNet-R和ImageNet-Sketch基准测试中,DOME方法超越了复杂的TTA方法,达到了最先进的性能水平。

Insight: 核心创新在于提出了一种显式、结构化的域表示方法,将域参数化为分布变量,并利用预训练的视觉语言模型和稀疏域库进行解耦监督。这表明鲁棒的适应源于良好的域表示,而非复杂的适应算法本身。

Abstract: Test-time adaptation (TTA) aims to align a model to shifting test domains using only unlabeled streaming data. Most existing methods implicitly infer a single global domain distribution, ignoring the multidimensional and sample-specific nature of real-world domain shifts, leading to fragile adaptation. We propose DOME, an effective domain encoder that explicitly models each sample’s domain in a zero-shot manner. DOME leverages vision-language pretraining to extract dense, continuous representations, parameterizes domains as distributional variables, and introduces a momentum-updated sparse domain bank for disentangled supervision. By injecting these explicit domain cues into downstream models, even a basic entropy-minimization TTA strategy achieves state-of-the-art performance across ImageNet-C, ImageNet-R, and ImageNet-Sketch, outperforming complex TTA approaches. Our results demonstrate that robust adaptation stems not from intricate adaptation algorithms, but from explicit, structured domain representation.


[82] Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation cs.CV | cs.CL | cs.LGPDF

Ruipeng Zhang, Zhihao Li, C. L. Philip Chen, Tong Zhang

TL;DR: 该论文提出了一种名为TLVS(Token-Level Visual-Sensitivity Steering)的轻量级、即插即用方法,用于缓解大型视觉语言模型(LVLM)的幻觉问题。该方法通过提取和精炼token级别的转向向量,并在解码过程中自适应地应用细粒度的、视觉敏感度自适应的转向,仅在关键位置进行干预,从而有效抑制幻觉生成并保留基于证据的内容。

Details

Motivation: 现有激活转向方法在自回归解码过程中,通过平均整个序列的图像与无图像差异来获取转向向量,这会稀释关键的稀疏局部信号,导致信噪比低;同时,固定转向强度会错误分配干预预算,过度扰动非关键token并可能导致不稳定。

Result: 该方法在多个基准测试上进行了评估,包括POPE、AMBER、CHAIR (COCO)、MMHal和HallusionBench,结果表明其相比之前的转向方法取得了持续一致的改进。

Insight: 核心创新在于提出了token级别的视觉敏感度转向机制,实现了细粒度和自适应的干预,这比序列级别的平均转向更精准;其轻量级、无需大量训练的设计使其易于部署到不同的视觉语言模型中。

Abstract: Large vision language models (LVLMs) have made rapid advancements and are deployed across various applications, yet hallucinations remain a major challenge. Activation steering is appealing due to its minimal training overhead and controllability at inference time. However, we found that during autoregressive decoding, visual conditioning affects token prediction sparsely and locally across decoding steps, and many existing methods that average image-versus-no-image differences over the entire sequence dilute these critical signals, yielding low signal-to-noise ratio steering directions. Additionally, many existing methods apply a fixed steering strength, which misallocates the intervention budget, over-perturbs non-critical tokens, and can cause instability. To address these limitations, we propose Token-Level Visual-Sensitivity Steering (TLVS) for hallucination mitigation. Our approach first extracts token-level steering vectors and refines them, and then applies fine-grained, visual-sensitivity-adaptive steering only where it matters. This lightweight, plug-and-play mechanism requires only minimal training for calibration and can be applied across diverse vision-language models. It modulates the steering strength at each decoding step, selectively suppressing hallucination-prone spans while preserving evidence-grounded content. We evaluate TLVS on several benchmarks, including POPE, AMBER, CHAIR (COCO), MMHal, and HallusionBench, demonstrating consistent improvements over previous steering methods.


[83] ViMax: Agentic Video Generation cs.CV | cs.AIPDF

Lingxuan Huang, Sizhe He, Hengji Zhou, Liqiang Nie, Lianghao Xia

TL;DR: 本文提出了ViMax,一个基于多智能体协作的代理式视频生成框架,旨在解决长视频生成中叙事规划与视觉一致性的挑战。该框架通过分层叙事引擎和检索增强生成确保全局故事连贯性,并利用依赖感知的视觉一致性机制跨时间边界追踪角色与环境状态,同时由视觉语言模型引导的智能体持续监控和优化叙事连贯性与视觉保真度。

Details

Motivation: 当前短视频生成方法无法满足长视频生成所需的系统性叙事规划和视觉一致性,现有方法生成的序列孤立、缺乏叙事结构,且缺少维持跨场景角色和环境一致性的机制。

Result: 论文未在摘要中提及具体的定量实验结果、基准测试或与现有方法的比较。

Insight: 创新点在于提出了一个协调的多智能体协作框架,将视频创作分解为叙事决策、视觉连续性和制作质量等专门任务,并通过分层叙事引擎与依赖感知的视觉一致性机制,实现了跨多场景时间线的故事完整性和视觉连贯性。

Abstract: Long-form video generation requires systematic narrative planning and visual consistency that current short-clip methods cannot provide. Existing methods generate isolated sequences without narrative structure and lack mechanisms for maintaining character and environmental consistency across scenes. We present ViMax, an agentic video generation framework that addresses video creation through coordinated multi-agent collaboration where specialized components negotiate narrative decisions, visual continuity, and production quality. Our framework employs a hierarchical narrative engine with retrieval-augmented generation for global story coherence and a dependency-aware visual consistency mechanism that tracks character and environmental states across temporal boundaries, while VLM-guided agents continuously monitor and refine both narrative coherence and visual fidelity. The framework enables coordinated agent collaboration to generate extended narrative content. This maintains both storytelling integrity and visual coherence across multi-scene timelines.


[84] A Dataset for Dynamic Human Preferences for Vision Language Models cs.CV | cs.AIPDF

Hannah Gao, Dylan Hadfield-Menell, Rachel Ma

TL;DR: 该论文提出了一个用于评估视觉语言模型(VLMs)理解动态人类偏好能力的新基准数据集。它通过自动化流程生成包含图像依赖变化的测试数据,并评估了当前最先进模型在该基准上的表现。

Details

Motivation: 随着VLMs在交互式场景中的广泛应用,需要评估模型如何适应不同用户的实时偏好。现有基准主要关注静态能力和从训练数据中学习的普遍偏好,缺乏对动态、上下文传递偏好的评估。

Result: 论文在提出的新基准上评估了多个最先进的VLMs,但摘要未具体说明定量结果或是否达到SOTA水平。

Insight: 创新点在于首次构建了一个专注于评估VLMs理解动态、即时传递的人类偏好能力的基准,并提供了自动化生成多样化测试数据的流程,弥补了现有评估体系的不足。

Abstract: Given the increased adoption of Vision Language Models (VLMs) in human-interactive settings, it is important that we evaluate how well these models can adapt to real-time preferences for different users. While an increasing number of vision-language benchmarks have recently been introduced, they focus largely on evaluating static capabilities and generally-held preferences learned from extensive training data. This work introduces a new benchmark for evaluating the ability of VLMs to understand dynamic human-preferences, i.e. preferences that are passed in-context at inference time. We provide an automated pipeline for generating this benchmark with variations on image dependence, a dynamic multi-modal human-preference dataset, and evaluations of state-of-the-art models on the novel benchmark.


[85] What neurosurgeons need to see: synthetic intra-operative MRI from ultrasound for brain-shift compensation in brain tumour surgery cs.CV | cs.LGPDF

Santiago Cepeda, Olga Esteban-Sinovas, Ignacio Arrese, Rosario Sarabia

TL;DR: 本文提出了一种端到端流程,用于在脑肿瘤手术中补偿脑移位。该流程通过融合术前MRI、从术中超声生成的合成MRI以及基于该合成图像的形变配准,生成术前成像空间中的新全脑MRI体积。

Details

Motivation: 解决神经导航在硬脑膜打开后因脑移位而精度下降的问题。术中MRI虽能补偿但成本高且不普及,而术中超声虽普及但存在斑点噪声、视野窄且无法显示术前扫描中不存在的结构(如切除腔和残留肿瘤)等局限。

Result: 在ReMIND术后队列上,ResViT-2.5D生成的合成图像在结构、强度和感知指标上与术中T2图像高度匹配。在14名受试者的215个专家标记点上,合成锚定配准将平均目标配准误差从6.27毫米降至5.86毫米,与经典NiftyReg基线(5.85毫米)相当,并为每个受试者生成了微分同胚形变场。

Insight: 创新点在于集成了2.5D残差-Transformer合成主干(ResViT-2.5D)和两阶段配准(结合NiftyReg与合成锚定的SynthMorph阶段),直接处理原始扫描仪输入,生成了反映术中术后状态的类MRI更新体积,而非单纯追求配准精度提升。

Abstract: Maximal safe resection is the primary objective in glioma surgery. Neuronavigation guidance is progressively degraded by brain shift after dural opening. Intraoperative MRI can compensate but needs dedicated infrastructure and is rarely available, whereas intraoperative ultrasound (ioUS) is inexpensive, repeatable, and compatible with routine workflows. Navigation systems combining ioUS with preoperative MRI usually rely on rigid registration; even deformable multimodal registration is limited by ultrasound speckle contrast, a narrow field of view, and the inability to represent structures absent from the preoperative scan, most critically the resection cavity and residual tumor. We propose an end-to-end pipeline that generates a new whole-brain MRI volume in the preoperative imaging space by merging the preoperative MRI, a synthetic MRI generated from the ioUS, and a deformable registration anchored on that synthetic image. It integrates a 2.5D residual-transformer synthesis backbone (ResViT-2.5D) and a two-stage registration coupling NiftyReg with a synthesis-anchored SynthMorph stage, operating directly on raw scanner inputs. On a post-resection ReMIND cohort, ResViT-2.5D produced synthetic images closely matching the intraoperative T2 across structural, intensity, and perceptual metrics. In 14 subjects with 215 expert landmarks, the synthesis-anchored registration reduced the mean target registration error from 6.27 to 5.86 mm, matching a strong classical NiftyReg baseline (5.85 mm) while yielding a diffeomorphic deformation field in every subject. The contribution is not a gain in registration accuracy but the integrated volume itself, which inside the ultrasound field of view it reflects the intraoperative post-resection state. This provides the surgeon with an MRI-like update of the operative field with potential for integration into surgical-navigation workflows.


[86] MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework cs.CV | cs.AIPDF

Haowen Xiang, Yibo Yan, Jiahao Huo, Yu Huang, Yi Cao

TL;DR: 论文提出了一种名为MM-Matryoshka的二维Matryoshka训练框架,用于实现预算弹性的视觉文档检索。该框架通过一个单一的检索模型,在推理时能够根据存储和计算预算,灵活调整向量维度和编码器深度,从而在保持高质量检索性能的同时显著降低开销。

Details

Motivation: 现有的多向量视觉文档检索器虽然通过深度视觉语言模型实现了细粒度匹配,但部署成本高昂,现有效率优化技术通常只优化部分预算,缺乏统一的方法在向量宽度和编码器深度之间权衡精度与效率。

Result: 在多个代表性骨干网络上的综合实验表明,与直接截断基线相比,MM-Matryoshka在显著降低存储和计算开销的同时,保持了更高的检索质量,为高效视觉文档检索提供了鲁棒的预算弹性。

Insight: 创新点在于提出了一个二维Matryoshka训练框架,首次实现了在向量维度和编码器深度两个维度上的预算弹性,允许单一模型适应不同预算需求,无需为不同预算训练独立模型,这为多模态检索器的效率优化提供了统一且灵活的方法。

Abstract: Multi-vector visual document retrievers achieve strong fine-grained matching by representing each page with multiple vectors from deep Vision-Language Models (VLMs), but this design makes deployment expensive in both storage and computational overhead. Existing efficiency techniques usually optimize only part of this budget, leaving multimodal retrievers without a unified way to trade accuracy for both vector width and encoder depth. Therefore, we propose MM-Matryoshka, a 2D Matryoshka training framework for budget-elastic Visual Document Retrieval (VDR), enabling ColPali-style multi-vector retrieval elastic along both dimension and layer. At inference time, a single retriever can select a 2D selectable budget without training separate models for different budgets. Through comprehensive experiments across multiple representative backbones, we demonstrate that by retaining significantly higher quality than direct truncation baselines while substantially reducing storage and computational overhead, MM-Matryoshka can offer robust budget elasticity for efficient VDR.


[87] MemoVAD: Resource-Efficient Video Anomaly Detection via Dynamic Semantic Memory in Edge Computing Scenarios cs.CV | cs.AIPDF

Guo Li, Jiandian Zeng, Yang Li, Zihao Peng, Ke Chen

TL;DR: MemoVAD是一个面向边缘计算场景的资源高效视频异常检测框架,通过边缘-云协作,利用不确定性感知门控策略和动态语义内存,选择性引入视觉语言模型的高级语义,在保证检测性能的同时大幅降低通信开销。

Details

Motivation: 解决在边缘设备上部署视频异常检测时,对高级语义的需求与有限计算资源之间的根本矛盾,特别是视觉语言模型因延迟和计算成本高而无法在设备端部署的问题。

Result: 在UCF-Crime和XD-Violence数据集上,通过真实边缘设备进行的实验表明,MemoVAD在显著减少通信开销的同时,性能超越了最先进的方法。

Insight: 核心创新在于基于主观逻辑的不确定性感知门控策略,用于建模感知不确定性并选择性查询云端VLM,以及动态语义内存机制,通过缓存和检索VLM验证的原型,使边缘模型能渐进式地融入VLM级语义。

Abstract: Deploying Video Anomaly Detection (VAD) in real-world surveillance faces a fundamental tension between the demand for high-level semantics to ensure effectiveness and the limited computational resources of edge devices. Vision-Language Models (VLMs) provide rich open-vocabulary semantics, but their latency and computational cost preclude on-device deployment. To address the challenge, we propose MemoVAD, an edge-cloud collaborative framework that selectively incorporates VLM semantics into streaming VAD. MemoVAD runs most inference on the edge with a lightweight detector and a causal Temporal Context Encoder (TCE) to model temporal dependencies. Specifically, we introduce an Uncertainty-Aware Gating (UAG) policy grounded in Subjective Logic to model perceived uncertainty and query the cloud-based VLM only for high-uncertainty and semantically novel clips. Besides, a Dynamic Semantic Memory (DSM) is designed to cache VLM-verified prototypes for efficient retrieval, enabling the edge model to progressively incorporate VLM-level semantics via a semantic adapter. Experiments on UCF-Crime and XD-Violence datasets via a real edge device show that MemoVAD substantially reduces communication overhead while surpassing state-of-the-art performance.


[88] PereStruct: Multimodal Semantic Assembly for Robust Historical Document Parsing cs.CV | cs.DLPDF

Maksim Shandybo, Ivan Bespalov, Daniil Yefimov, Marina Kosheleva, Alexander Loukianov

TL;DR: 本文提出了PereStruct,一个专门用于解析复杂布局历史报纸文档的自动化流程。该方法结合了用于版面分析和区块检测的微调YOLO架构,以及一个新颖的语义组装模块,该模块通过联合建模TF-IDF词义相似性、视觉嵌入和几何布局约束来重建文章。该方法在区块到文章的映射任务上取得了最先进的性能。

Details

Motivation: 解决历史文档(特别是报纸)因严重物理退化和高度不规则的页面结构而难以解析的问题,这些问题使得现有最先进的视觉语言模型也面临严重的分布外挑战。

Result: 在区块到文章映射任务上取得了0.904的F1分数,达到最先进水平。端到端评估显示,与视觉语言模型相比,其保真度显著更高(BLEU约0.96 vs 0.34)。

Insight: 创新点在于提出了一个结合了版面分析、视觉嵌入和语义相似性的多模态语义组装模块,以模块化架构有效处理通用视觉语言模型难以应对的复杂历史布局。同时,发布了标注数据集和基准测试以支持可复现性。

Abstract: Parsing historical documents with complex, non-standard layouts remains a fundamental bottleneck in large-scale archival digitization. Unlike modern typography, historical newspapers exhibit severe physical degradation and highly irregular page structures that confound even state-of-the-art vision-language models, presenting severe out-of-distribution challenges. We address this gap with an automated pipeline specifically designed for parsing historical newspapers, documents characterized by particularly intricate multi-column layouts. Our approach combines a fine-tuned YOLO architecture for layout analysis and block detection, trained on 1,426 fully human-annotated scanned pages, with a novel semantic assembly module that reconstructs articles by jointly modeling lexical-semantic similarity via TF-IDF, visual embeddings from our fine-tuned YOLO, and geometric layout constraints. This multi-modal integration yields state-of-the-art performance, achieving an F1 score of 0.904 on block-to-article mapping. Notably, end-to-end evaluation against vision-language models (Qwen3.6-35B-A3B and Qwen3.6-Plus) demonstrates that PereStruct achieves substantially higher fidelity (BLEU approximately 0.96 vs 0.34), validating that modular architectures excel where generic VLMs fail on complex historical layouts. To support reproducibility and advance research in this domain, we release both the training corpus of 599 annotated pages and a curated PereStruct benchmark of 93 pages with expert-verified ground-truth block-to-article mappings. This framework establishes a robust foundation for high-fidelity digitization and semantic reconstruction of complex archival materials.


[89] Liquid Neural Networks as a Drop-in Continuous-Time Deformation Field for Dynamic 3D Gaussian Splatting cs.CV | cs.AIPDF

Mingzhao Li, Arghya Pal, Guan Yuan Tan

TL;DR: 本文提出使用液态神经网络(LNN)中的闭式连续时间(CfC)单元堆栈,来替代可变形3D高斯泼溅(D-3DGS)中基于位置编码MLP的变形场,旨在将变形场从离散的逐帧预测转变为显式的连续时间函数,从而在动态场景重建中实现更好的时间平滑性。

Details

Motivation: 现有D-3DGS方法使用MLP预测基于时间t的变形,但其架构并未耦合不同时间值,本质上预测的是离散的逐帧偏移,时间平滑性仅作为优化的副产品出现,而非模型固有的特性。

Result: 在D-NeRF的8个场景和NeRF-DS的7个场景上,液态变形场的综合性能达到或超过了MLP基线,尤其在具有高频关节运动的场景中提升最为显著。

Insight: 创新点在于将变形场重新设计为CfC单元堆栈,其作为液态时间常数ODE的闭式解,通过Sigmoid时间门在候选隐藏状态间插值,将学习到的时间平滑响应直接嵌入损失景观,无需调用数值求解器,实现了近乎零摩擦的架构设计转变。

Abstract: Deformable 3D Gaussian Splatting (D-3DGS) re-constructs dynamic scenes from monocular video by deforming a canonical set of 3D Gaussians through a positional-encoded MLP of frame time t. Although fitted to a continuous variable, the MLP couples no two values of t in its architecture and effectively predicts discrete per-frame offsets, leaving temporal smoothness to emerge only as a byproduct of optimisation. We redesign the deformation field as a stack of Closed-form Continuous-time (CfC) cells, a Liquid Neural Network (LNN), that is the closed-form solution of the Liquid Time-constant ODE while preserving every other part of the D-3DGS pipeline. Each cell exposes a sigmoidal time gate that interpolates between two candidate hidden states, baking a learned smooth response to t into the loss landscape without invoking any numerical solver. On the eight D-NeRF and seven NeRF-DS scenes the liquid field matches or exceeds the MLP baseline in aggregate, with its largest gains concentrated on the scenes with the most high-frequency articulated motion. The result is a near-zero-friction architectural design that turns the discrete MLP deformation field into an explicit continuous-time function of t.


[90] Simultaneous hyperkinetic movement disorders phenotyping: a cross-cohort pediatric transfer study using routine videos, markerless pose estimation and a tabular foundation model cs.CV | q-bio.NCPDF

Laura Cif, Diane Demailly, Zohra Souei, Muhammad Mushhood Ur Rehman, Juan Dario Ortigoza Escobar

TL;DR: 本研究开发并外部验证了一个基于常规临床视频的框架,用于同时检测多种运动障碍(如肌张力障碍、震颤、肌阵挛等)的表型。该框架结合了无标记姿态估计、运动学描述符和预训练的表格基础模型,并在成人数据集上训练后,通过轻量级校准直接应用于儿科人群进行跨队列迁移验证。

Details

Motivation: 解决在常规临床视频中同时检测多种运动障碍表型的挑战,并特别关注从成人到儿科人群的跨队列外部迁移能力,以提升诊断的客观性和可及性。

Result: 在独立儿科外部验证集上,经过决策层校准后,性能显著提升:汉明准确率从0.804提高到0.839,杰卡德指数从0.548提高到0.633;在临床医生一致性较高的表型子集上,杰卡德指数进一步提升至0.786,表明方法稳健且不依赖于不可靠标签。

Insight: 创新点在于结合无标记姿态估计与预训练基础模型进行多表型同步检测,并通过轻量级校准实现跨队列(成人到儿科)迁移,无需重新训练主干网络,这为临床视频分析提供了高效、可扩展的解决方案。

Abstract: Objective: To develop and externally test a video-based framework for simultaneous detection of hyperkinetic MDs phenomenologies: dystonia, tremor, myoclonus, chorea, athetosis, ballismus, stereotypies, and tics using routine clinical recordings, with explicit testing of external, cross-cohort transfer from adult to pediatric populations. Methods: In this proof-of-concept study, the framework combines markerless pose estimation, kinematic descriptors, and a pretrained fondation model. A shared predictive backbone was developed on 21 adults with confirmed hyperkinetic MDs and 4 healthy controls assessed under a standardized protocol. External validation was performed on an independent external cohort: a real-world pediatric sample (n=12, monogenic combined MDs). For the external dataset, the backbone was deployed without retraining; lightweight calibration adjusted only the final subject-level decision step using a small labeled subset of patients selected by clinicians as representative of the cohort’s phenotypic range. Results: After local calibration of the decision layer on the clinician-selected subset, performance improved consistently on the held-out pediatric patients (n=7): Hamming accuracy rose from 0.804 to 0.839 and the Jaccard index from 0.548 to 0.633. This calibrated performance was preserved, and the Jaccard index further improved, when the evaluation was restricted to the phenomenologies with more definite clinician agreement (Hamming accuracy 0.9, Jaccard index 0.786), indicating that the gains did not rest on the least-reliable labels.


[91] What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction cs.CV | cs.AIPDF

Jewon Yeom, Hanseul Kim, Jeongjae Park, Sungmok Jung, Jaejin Lee

TL;DR: 该论文通过统一的探针评估框架,系统研究了不同视频世界模型预训练信号如何影响其潜在空间中动作相关结构的形成。研究发现,动作相关结构主要由时序视频预训练驱动,而非像素重建保真度;视频预训练的自监督编码器在视觉保真度和动作预测之间实现了最佳的帕累托权衡。

Details

Motivation: 旨在探究视频世界模型中哪些预训练信号能够在其潜在空间中诱导出与动作相关的结构,以明确预测与重建在形成动作相关表示中的作用。

Result: 在统一的逆动力学探针评估中,视频预训练的自监督编码器(如V-JEPA和VideoMAE)在动作预测和视觉保真度之间取得了最佳平衡;在机器人基准测试(如CALVIN)上,时序结构的重要性得到验证,但静态环境任务可能被图像先验部分掩盖。

Insight: 核心创新点在于揭示了时序预测结构(而非重建保真度)是形成动作相关视频表示的关键因素;同时发现逆动力学监督能显著提升对视觉损坏的鲁棒性,表明动作感知目标可以正则化潜在几何结构。

Abstract: Video world models are increasingly used to provide predictive visual representations, yet it remains unclear which pretraining signals induce action-relevant structure in their latent spaces. We study this question through a unified probe-based evaluation across diverse encoder families, including image-only self-supervision, video pretraining with and without latent prediction, reconstruction-based autoencoders, diffusion models, and shortcut-forcing dynamics models. Using a common inverse-dynamics probing objective, we find that action-relevant structure is driven primarily by temporal video pretraining rather than pixel reconstruction fidelity: models with strong pixel decoding quality can exhibit near-zero action recoverability, while video-pretrained self-supervised encoders consistently achieve the best Pareto trade-off between visual fidelity and action prediction. Comparing V-JEPA and VideoMAE further shows that most gains arise from natural-video temporal context, with feature-level latent prediction providing a smaller additional benefit. These trends transfer across robotic benchmarks, though CALVIN reveals that static-environment tasks can partially mask the importance of temporal structure by allowing strong image priors to suffice. Finally, inverse-dynamics supervision substantially improves robustness to visual corruption, suggesting that action-aware objectives regularize latent geometry beyond clean-setting performance. Our results identify temporal predictive structure – not reconstruction fidelity – as the primary ingredient underlying action-relevant video representations.


[92] Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking cs.CVPDF

Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Zheng Lian

TL;DR: 本文提出了Struct-Searcher,一种基于信念修正理论的结构化智能体工作流,用于解决多模态深度信息搜索中处理异构模态间矛盾信息的问题。该方法通过显式维护一个演化的多模态结构图,实现了冲突感知的深度信息寻求。

Details

Motivation: 现有深度研究智能体的工作流主要基于证据积累模型,线性聚合证据,缺乏处理异构模态间矛盾信息的机制。

Result: 在多个基准数据集和骨干模型上的实验表明,Struct-Searcher是即插即用且模型无关的,在BrowseComp-VL上平均相对准确率提升17.2%;同时性能优异,在MM-BrowseComp、HLE-VL和BrowseComp-VL上分别以3.7%、1.5%和0.7%的相对准确率优势超越次优方法,达到SOTA水平。

Insight: 核心创新在于将信念修正理论引入多模态智能体工作流,通过显式维护和演化结构图来处理矛盾信息,这是一种原理性的冲突解决机制,而非简单的线性证据聚合。

Abstract: Deep research agents have attracted increasing attention for their ability to collect large-scale online information to acquire target knowledge, with recent efforts shifting from purely text-based information seeking to multimodal settings. However, existing agentic workflows are largely aligned with evidence accumulation models, which linearly aggregate evidence and lack principled mechanisms for handling contradictory information across heterogeneous modalities. Towards this end, we propose Struct-Searcher, a structural agentic workflow grounded in belief revision theory that explicitly maintains an evolving multimodal structural graph throughout the reasoning process, enabling effective conflict-aware multimodal deep information seeking. Extensive experiments across multiple benchmark datasets and backbone models demonstrate that Struct-Searcher is (1) plug-and-play and model-agnostic, yielding an average relative accuracy improvement of 17.2% on BrowseComp-VL across five different backbones. (2) top-performing, consistently outperforming state-of-the-art vision-language models (VLMs) and deep research agents, with relative accuracy improvements of 3.7% on MM-BrowseComp, 1.5% on HLE-VL, and 0.7% on BrowseComp-VL over the second-best competing approach.


[93] Cross-View Urban Traffic Dataset: Drone-Supervised Ground Truth for Monocular Bird’s-Eye View Localization cs.CV | cs.AIPDF

Prakhar Bhardwaj, Simone Weikl, Kilian Mang, Elia Jonas Sandtner

TL;DR: 该论文提出了一个用于跨视角城市交通感知的数据集和基准测试,该数据集由同步采集的自行车第一人称视角视频和城市交叉路口航拍无人机视频构成。基准测试聚焦于两个关联任务:街景与无人机视角对象轨迹之间的跨视角身份匹配,以及利用航拍监督进行第一人称到鸟瞰图的预测。

Details

Motivation: 动机在于支持以交叉路口为中心的交通分析,该场景需要跨不同视角联合推理身份保持、局部交互和全局空间结构,而现有城市驾驶和V2X数据集缺乏身份级别的跨视角对齐。

Result: 在提出的基准测试上,跨视角匹配实现了较高的召回率,但仍受限于过度分配和时间不一致性;利用航拍监督的第一人称到鸟瞰图预测有所受益,但在轻量级单目感知下远未达到饱和。

Insight: 创新点在于提供了身份级别对齐的跨视角数据集、标准化评估流程及基线实现,支持跨视角感知、城市场景对齐和从局部到全局交通理解的研究。

Abstract: We introduce a dataset and benchmark for cross-view urban traffic perception built from synchronized ego-centric bicycle videos and aerial drone videos recorded at real urban intersections. The benchmark targets two linked tasks: cross-view identity matching between street-view and drone-view object tracks, and ego-to-bird’s-eye-view prediction using aerial supervision. In contrast to prior urban driving and V2X datasets, our benchmark provides identity-level alignment across radically different viewpoints together with standardized evaluation, annotation tooling, and baseline implementations. This setting is motivated by intersection-centric traffic analysis, where identity preservation, local interactions, and global spatial structure must be reasoned about jointly across views. We evaluate methods at both the track and frame levels, including cross-view ID precision/recall/IDF1, near–far breakdowns, temporal stability, and consistency metrics. We also provide baseline results for wedge-based cross-view matching and for three BEV prediction baselines: inverse perspective mapping, a MonoLayout-style learned baseline, and a regression baseline. The results show that the benchmark is feasible but challenging: cross-view matching achieves strong recall yet remains limited by over-assignment and temporal inconsistency, while ego-to-BEV prediction benefits from aerial supervision but remains far from saturated under lightweight monocular sensing. We hope that this benchmark will support future research on cross-view perception, urban scene alignment, and ego-to-global traffic understanding.


[94] DroneDAR: Long-Range Drone Distance Estimation Using Monocular Vision and Bounding-Box Features cs.CV | cs.ROPDF

Knut Peterson, Zaid Mayers, David Han

TL;DR: 本文提出DroneDAR模型,用于单目视觉下的远距离无人机距离估计,结合图像裁剪区域与边界框几何特征,通过轻量级门控机制融合卷积主干网络与边界框线索,分析主干网络容量、裁剪分辨率及回归损失函数对性能的影响,并探讨远距离场景下的常见失效模式。

Details

Motivation: 解决远距离图像中小型无人机距离估计的挑战,包括目标尺度极端变化、背景杂乱和视觉线索噪声,在检测器提供候选无人机区域后,利用外观和边界框特征预测距离。

Result: 在远距离条件下评估了Droneranger风格基线模型,DroneDAR模型通过结合卷积主干和边界框线索,提升了距离估计的鲁棒性,分析了不同距离区间的性能表现,并针对边界框噪声和裁剪区域纹理细节减少等失效模式提供了改进方向。

Insight: 创新点在于将边界框几何特征通过轻量级门控机制与卷积主干网络融合,增强模型对远距离小目标的感知能力;客观分析表明,该方法为设计在真实远距离条件下保持鲁棒的距离估计器提供了实用指导,特别是在无人机仅占少量像素时的可靠性提升。

Abstract: Accurate distance estimation for small drones in long-range imagery is important for tracking and situational awareness, yet remains challenging due to extreme target scale variation, background clutter, and noisy visual cues. This paper studies monocular drone distance estimation using image crops together with bounding-box geometry, a practical setting in which a detector provides a candidate drone region and the model predicts range from appearance and box-derived features. We evaluate a Droneranger-style baseline, and introduce a new DroneDAR (Drone Detection And Ranging) model that combines a convolutional backbone with explicit bounding-box cues through a lightweight gating mechanism. Experiments analyze how backbone capacity, crop resolution, and regression loss functions affect performance across distance regimes. We further examine common failure modes at long distances, including sensitivity to bounding-box noise and reduced texture detail in the crop. The results provide guidance for designing and training range estimators that remain robust under real-world long-range conditions and highlight directions for improving reliability when drones occupy only a few pixels.


[95] DALE-CT: Depth-Aware Foundation Models for Computed Tomography cs.CVPDF

Evan W. Damron, Mahmut S. Gokmen, Mitchell A. Klusty, Caroline N. Leach, Emily B. Collier

TL;DR: 本文提出了DALE-CT,一种用于计算机断层扫描(CT)的深度感知基础模型。该模型基于2D切片架构,采用LeJEPA自监督学习框架从头训练,并引入了一种新颖的3D深度感知预训练策略,结合了自动解剖掩码和人工标注异常的密集辅助监督。

Details

Motivation: 动机在于利用自监督学习和视觉-语言模型整合的成功经验,为CT领域开发适应性强、能力高的视觉编码器,并探索2D切片架构作为处理3D体数据的一种灵活替代方案。

Result: 在CT-RATE数据集上,其双监督模型(DALE-CT-2S)的冻结骨干网络,通过多实例学习进行多异常检测的线性探针评估,取得了0.833的宏观AUROC。该性能与最先进的3D视觉-语言模型相当,但仅使用更少的数据且无需文本监督。

Insight: 创新点在于提出了一种结合密集解剖和异常监督的3D深度感知预训练策略,使得纯2D模型在处理3D医学影像时能有效捕获深度信息,并在数据效率和性能上取得了与3D模型相媲美的结果。

Abstract: Recent breakthroughs in self-supervised learning (SSL), such as the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA), alongside successes in integrating visual encoders with language models, have driven the demand for adaptable, high-capacity vision encoders in Computed Tomography (CT). In this work, we explore 2D slice-based architectures as a flexible alternative to native 3D models for processing volumetric CT data. Using the CT-RATE dataset, we trained DALE-CT (Depth-Aware Latent-Euclidean Computed Tomography), a 2D model family built entirely from scratch using LeJEPA, and compared its performance against a continually pre-trained DINOv2 baseline. To enhance representation quality, we developed a novel 3D depth-aware pre-training strategy anchored by dense auxiliary supervision from both automated anatomical masks and human-annotated abnormalities. Under linear probe evaluation with Multiple Instance Learning (MIL) for multi-abnormality detection, the frozen backbone of this dual-supervised model (DALE-CT-2S) achieves a Macro AUROC of 0.833. This performance demonstrates near-parity with state-of-the-art 3D vision-language models, achieved entirely from scratch with significantly less data and no textual supervision. To ensure reproducibility, all training code, evaluation scripts, and model weights have been made publicly available.


[96] The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models cs.CV | cs.AIPDF

Lujun Li, Lama Sleem, Niccolo Gentile, Yangjie Xu, Yewei Song

TL;DR: 本文提出FineSightBench基准,用于系统评估视觉语言模型在精细视觉感知与推理上的极限。研究发现,现有先进模型的感知能力在约12像素时达到饱和,而涉及计数、排序的推理能力即使在更大尺度上仍存在缺陷。

Details

Motivation: 现有视觉语言模型在细粒度视觉感知方面的能力尚未得到充分探索,本文旨在探究模型能可靠感知的最小视觉模式,并分离感知与推理任务进行评估。

Result: 在FineSightBench上对SOTA模型进行测试,发现感知任务在约12像素尺度饱和,而推理任务(如计数、排序)在更大尺度下仍存在持续错误,揭示了模型在精细视觉推理上的根本缺陷。

Insight: 创新点在于构建了一个分离感知与推理的精细尺度评估基准,并通过系统性实验揭示了VLMs在精细视觉理解上感知与推理能力不匹配的核心问题,强调了更严格评估的必要性。

Abstract: Recent vision-language models (VLMs) excel at multimodal understanding and reasoning, yet their fine-grained visual perception remains underexplored. A natural extension of ``How many r are there in Strawberry?’’ asks: how small a visual pattern can a VLM reliably perceive? As such, we introduce FineSightBench, a new benchmark that systematically probes this limit by separating perception tasks (pixel-level recognition of letters, shapes, objects) from reasoning tasks (spatial reasoning, counting, ordering over small targets) across controlled scales of 4–48px. Through comprehensive experiments and detailed failure mode analysis on state-of-the-art models, we reveal a sharp dissociation: perception saturates around 12px, while reasoning remains limited even at larger scales, with persistent numeracy and sequence errors. These findings expose fundamental deficiencies in VLMs’ fine-scale visual reasoning that demand more rigorous evaluation.


[97] VisualFLIP: Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning? cs.CVPDF

Didi Zhu, Changrui Chen, Stefanos Zafeiriou, Jiankang Deng

TL;DR: 本文提出了VisualFLIP基准测试,用于评估多模态大语言模型在视觉推理任务中,其预测是否真正依赖于任务关键性视觉证据。该基准包含1,374张图像,以相同问题扰动对的形式组织,覆盖基数、属性、空间和逻辑任务。通过配对准确率和崩溃率两个指标,研究发现模型即使能给出正确答案,也常常无法根据关键视觉证据的变化而更新其预测。

Details

Motivation: 当前多模态大语言模型在视觉推理任务中,仅凭准确率无法判断其预测是否真正基于任务关键性视觉证据,因为正确答案可能与有缺陷的推理共存。

Result: 在VisualFLIP基准上评估了24个MLLM,发现配对正确性与证据依赖性相关但不同:能力强的模型在任务关键视觉证据变化后仍可能无法更新答案,且在序列设置中,某些模型的崩溃率会变得更高。

Insight: 创新点在于提出了一个基于扰动图像对的基准测试(VisualFLIP)和两个互补的评估指标(配对准确率和崩溃率),以量化模型预测对关键视觉证据的依赖程度,揭示了模型准确率与真实推理基础之间的脱节问题。

Abstract: When a multimodal large language model answers a visual reasoning question correctly, is the prediction actually supported by the task-critical visual evidence? Correct answers can coexist with flawed reasoning, making accuracy alone an incomplete test of grounding. We introduce VisualFLIP, a paired benchmark with 1,374 images arranged as same-question perturbation pairs across cardinality, attribute, spatial, and logic tasks. Each pair keeps the question fixed but minimally changes the evidence so the gold answer deterministically flips. We evaluate 24 MLLMs with pair accuracy, which requires solving both sides of a pair, and Collapse Rate (CR), which measures how often a model that solves at least one side repeats the same non-empty answer for both images. Together, these metrics show that paired correctness and evidence dependence are related but distinct: capable models can still fail to update after task-critical visual changes, and collapse becomes more severe for some models when the edited image follows an earlier answer in a sequential setting. Further details are available on our project page: https://didizhu-judy.github.io/VisualFLIP/


[98] The Cross-Architecture Substrate: A Domain-Transcendent, Calibration-Surviving Geometric Invariant of Modern Vision Encoders cs.CV | cs.AIPDF

Yousef Radwan

TL;DR: 该研究发现,尽管经过不同任务(分类、对比学习、重建、图文匹配)训练,13种现代视觉编码器的前16个主变异方向在训练后会收敛到同一个16维几何对象,称为跨架构基底。该基底在四个视觉域(自然图像、医学CT、卫星、显微)间迁移性中位数为0.679,在八个域(增加草图、深度、热红外、天文)中为0.604,且能抵抗Pang校准。研究基于此开发了四种应用:无标签可迁移性筛选器、四域检测器、冻结小样本探针和无教师蒸馏辅助。

Details

Motivation: 探究不同视觉神经网络(经不同任务训练)的内部表示是否真的不同,并发现它们共享一个共同的几何结构。

Result: 跨架构基底在多个视觉域间表现出高迁移性(Procrustes-CKA中位数0.679-0.604),且能抵抗全局和局部校准。基于基底的应用在多项任务中表现优异,如无标签可迁移性筛选器超越LogME(快3倍,Kendall-tau提升0.15)、四域检测器准确率达99.6%、16维小样本探针在每类50标签时超越768维DINOv2达3.78个百分点。

Insight: 创新点在于发现了跨任务和架构的视觉编码器共享一个低维几何不变性(跨架构基底),这挑战了不同任务应有不同内部表示的假设。该基底可作为强大的通用特征用于迁移学习、域检测和小样本学习,且计算高效。

Abstract: Different vision neural networks – trained to classify, contrast, reconstruct, or match images to text – should have correspondingly different internal representations. We report that they do not. After training, the top sixteen principal directions of variation inside thirteen modern vision encoders converge to the same sixteen-dimensional geometric object. We call this the cross-architecture substrate and study it with PCA, centred kernel alignment (CKA), and Pang 2026 calibration. The substrate transports across four visual domains (natural photographs, medical CT, satellite, microscopy) at median Procrustes-CKA 0.679, and across eight domains (adding sketches, depth, thermal infrared, astronomy) at 0.604, every pair >0.40. It survives Pang calibration globally (7.4x disc-vs-MAE separation, n=13,394) and locally (4.82-5.30, p<10^{-44}). It is not pixel statistics (0.263), not Gabor features (0.31), not a random projection (0.041), and emerges in the first 10% of training while accuracy keeps climbing. We deliver four applications: a label-free transferability filter beating LogME (3x faster, +0.15 Kendall-tau); a four-way domain detector (99.6% accuracy); a frozen low-shot probe (16 dims beat 768-dim DINOv2 by 3.78pp at N=50 labels per class); and a teacher-free distillation auxiliary matching trained-teacher KD on 33 pairs (7.56pp peak gain at 10% label fraction). The substrate does not cross modalities, does not help cross-paradigm distillation, and does not predict transfer quality (rho=0.08 against transfer accuracy).


[99] TBD-VLA: Temporal Block Diffusion Vision Language Action Model cs.CV | cs.ROPDF

Sung-Wook Lee, Xuhui Kang, Yen-Ling Kuo

TL;DR: 本文提出了TBD-VLA模型,这是一个基于离散令牌的视觉-语言-动作(VLA)框架,通过引入块扩散机制来生成具有时间结构的动作序列。它将动作序列划分为时间块,在每个块内进行掩码离散扩散,同时在块之间保持自回归生成,从而统一了时间自回归和并行动作解码,实现了更好的时间一致性和更快的推理速度。

Details

Motivation: 现有的离散VLA模型通常将动作生成为离散动作空间上的下一个令牌预测,并以自回归方式依赖先前上下文,这导致推理延迟高且忽略了动作轨迹固有的时间结构。近期工作引入并行解码以提高效率,但缺乏对令牌依赖关系的显式建模机制。

Result: TBD-VLA在模拟和真实世界操作任务中显著优于先前的VLA方法,提供了通向快速、具有时间感知的离散VLA模型的可扩展路径。

Insight: 核心创新点在于将块扩散(Block Diffusion)引入离散VLA框架,通过分块进行掩码扩散来显式建模动作序列的时间结构,同时结合了块间自回归以保持长期依赖。这种设计实现了时间连贯性与推理效率的统一,并支持通过时间修复实现动作块的异步执行(如实时分块)。

Abstract: Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/


[100] C3VD-DEFCOL: A Deformable Colonoscopy Dataset with Time-Resolved 3D Ground Truth and Realistic Appearance cs.CVPDF

Ethan Luk, Mayank V. Golhar, Anthony Song, Raúl Iranzo, Víctor M. Batlle

TL;DR: 本文提出了C3VD-DEFCOL,一个用于评估可变形结肠镜三维重建的数据集和框架。该数据集通过模拟结肠的非刚性变形(如蠕动波)并利用基于LTX-2.3的仿真到真实转换模型,生成了具有真实体内外观纹理和密集、时间分辨三维真值(包括深度、法线、光流、相机位姿和带时间戳的三维网格)的配对视频序列。

Details

Motivation: 当前缺乏同时具备真实体内外观和密集、时间分辨三维真值(尤其是在非刚性变形条件下)的数据集,这限制了结肠镜三维重建算法的发展。本文旨在填补这一空白,为可变形三维重建提供一个可重复、定量的评估平台。

Result: 数据集包含来自11个独特结肠网格的110个视频,具有不同的相机轨迹、外观和参数化变形模式(包括三个蠕动严重级别)。实验表明,在可变形三维重建的下游任务(如位姿估计)中,估计误差随着变形严重程度的增加而增加,这提供了一个现有体内数据集无法实现的受控压力测试。

Insight: 核心创新在于构建了一个结合了可控几何变形与高真实感外观渲染的数据集生成框架。它通过参数化变形机制和基于扩散模型的仿真到真实转换,有效缩小了合成数据与真实结肠镜视频之间的领域差距,为算法在复杂非刚性场景下的鲁棒性评估提供了新工具。

Abstract: 3D reconstruction could improve colonoscopy by estimating mucosal coverage and alerting clinicians to missed regions during screening. However, algorithm development is limited as no current datasets provide both a realistic in vivo appearance and dense, time-resolved 3D ground truth, especially under non-rigid deformation. We present C3VD-DEFCOL, a framework and dataset for evaluating deformable colonoscopy reconstruction with paired geometry and realistic texture. Starting from C3VD/C3VDv2 colon meshes and camera trajectories, we generate controlled deformations of the colon surface, including peristaltic waves and centerline motion, and render per-frame depth, surface normals, optical flow, camera poses, and time-stamped 3D meshes. We then use the rendered geometry, primarily depth, to condition an LTX-2.3-based sim-to-real translation model that produces RGB clips with in vivo-like mucosal color, texture, vasculature, and specular appearance while preserving the underlying 3D scene structure. The resulting dataset contains 110 videos from 11 unique colon mesh geometries, with varying camera trajectories, appearances, and parameterized deformation regimes, including three peristaltic severity levels that serve as controlled evaluation axes. We evaluate the generated videos using appearance realism, geometric consistency, and temporal consistency metrics, and use the paired ground truth to benchmark the downstream task of pose estimation in deformable 3D reconstruction. Our experiments show how pose estimation error increases with increasing deformation severity, providing a controlled stress test that is not possible with existing in vivo datasets. Overall, C3VD-DEFCOL is designed as a reproducible, quantitative evaluation platform for testing deformable 3D reconstruction algorithms, with the goal of reducing the domain gap between synthetic datasets and in vivo colonoscopy.


[101] REACT 2026: The Fourth Multiple Appropriate Facial Reaction Generation Challenge: Personalised MAFRG and Appropriate EEG Reaction Prediction cs.CVPDF

Siyang Song, Micol Spitale, Zijian Wu, Xiangyu Kong, Cheng Luo

TL;DR: REACT 2026挑战赛是继2023、2024和2025年之后举办的第四届多合适面部反应生成挑战赛,旨在推动能够为特定听者生成个性化、合适、多样、真实且同步的人类风格面部反应的机器学习模型的发展与基准测试。

Details

Motivation: 在二元互动中,针对说话者的每种行为,听者可能有多种合适的面部反应。现有研究在结合人类表达行为、情感和神经生理信号进行个性化面部反应生成方面仍存在不足。

Result: 挑战赛在MARS数据集基础上,新增了个体层面的五大人格标签和脑电图记录,并针对离线/在线通用及个性化MAFRG四个子挑战提供了新的基线模型。

Insight: 创新点在于引入了一个结合行为、情感和神经生理信号的一对多个性化面部反应生成新范式,这为当前二元互动建模开辟了新的研究方向。

Abstract: In dyadic interactions, various human facial reactions could be appropriate for responding to each human speaker behaviour. Following the successful organisation of the REACT 2023, 2024 and 2025 challenge series, a body of generative deep learning (DL) models have been developed for the problem of multiple appropriate facial reaction generation (MAFRG). This year, we propose the REACT 2026 challenge encouraging the development and benchmarking of Machine Learning (ML) models that can generate multiple personalised, appropriate, diverse, realistic and synchronised human-style facial reactions expressed by a specific human listener for responding to each given speaker behaviour. As a key of the challenge, we continuously provide challenge participants with MARS dataset introduced by REACT 2025 but additionally provide individual-level Big-Five personality labels and EEG recordings. This introduces a new one-to-many personalised facial reaction generation setting combining human expressive behavioural, affective and neurophysiological signals, which remains largely unexplored in current dyadic interaction modelling. This paper also presents the challenge guidelines and new baselines on the four proposed sub-challenges: Offline generic and personalised MAFRG as well as Online generic and personalised MAFRG, respectively, which are publicly available at https://github.com/reactmultimodalchallenge/baseline_react2026.


[102] Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation cs.CV | cs.AI | cs.CL | cs.LG | cs.MMPDF

Jiaxin Dai, Zehang Wei, Jiamin Yan, Xiang Xiang

TL;DR: 本文提出了一种无需训练的两阶段级联视频检索增强生成(Video RAG)流水线,用于解决跨语言长视频理解、严格角色遵循和零幻觉时间定位等挑战。该架构通过模态感知分工,将语义检索与认知逻辑推理解耦,先使用高保真视觉摘要和全局文本描述进行高召回语义预取,再通过基于大语言模型的A.I.R.过滤代理进行细粒度认知重排序,最后通过提示雕刻机制生成严格格式化的JSON响应。

Details

Motivation: 针对跨语言长视频理解、严格角色遵循和零幻觉时间定位等关键挑战,旨在构建一个无需训练的视频检索增强生成系统,以高效处理多模态噪声并确保逻辑一致性。

Result: 在MAGMaR研讨会的RAG赛道上评估,该方法在信息检索和角色条件生成方面表现出卓越的精确度,展现了资源感知设计的有效性。

Insight: 创新点包括:将语义检索与逻辑推理解耦的模态感知分工策略;使用高保真视觉摘要隔离噪声模态以保持向量空间纯净;引入基于大语言模型的A.I.R.过滤代理进行认知重排序;以及通过提示雕刻机制强制生成严格格式化的结构化输出。

Abstract: This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addressing the critical challenges of cross-lingual long-video comprehension, strict persona adherence, and zero-hallucination temporal grounding, we propose a fully training-free, two-stage cascaded Video RAG pipeline. Our architecture strategically decouples semantic retrieval from cognitive logical reasoning through a modality-aware division of labor. In the first stage, a high-recall semantic pre-fetching module employs dense retrieval using only high-fidelity visual summaries and global text descriptions, explicitly isolating noisy modalities (e.g., OCR and ASR) to maintain a pristine vector space. In the second stage, an Adaptive, Iterative, and Reasoning-based (A.I.R.) filtering agent, powered by a commercial Large Language Model (LLM), performs fine-grained cognitive reranking. The agent re-incorporates full multimodal contexts to enforce strict logical alignment with user personas, effectively pruning semantically similar but logically irrelevant candidates. Finally, a Prompt Sculpting mechanism constrains the generator to synthesize the distilled subset into strictly formatted JSON responses with exact chunk-level citations. Evaluated on the RAG track, our resource-aware approach shows exceptional precision in both information retrieval and persona-conditioned generation.


[103] DAL-PCQA: Enabling Distortion-Level and Language-Driven Reasoning for Point Cloud Quality Assessment cs.CV | cs.MM | eess.IVPDF

Swarna Chakraborty, Gabriel De Castro Araújo, Syeda Tasmi Faria, Marcelo M. Carvalho, Mylene C. Q. Farias

TL;DR: 本文提出了DAL-PCQA数据集,这是一个用于点云质量评估(PCQA)的、包含失真级别和语言标注的数据集。该数据集为基准点云添加了多级失真严重程度标签、离散质量类别以及与人类感知对齐的结构化自然语言描述,旨在弥补传统PCQA方法仅预测整体感知评分(MOS)而无法揭示具体失真原因的不足。

Details

Motivation: 传统点云质量评估方法通常只预测一个标量的平均意见分数(MOS),这能量化整体感知退化,但无法揭示其具体原因(如模糊、颜色偏移、点密度变化等)。而人类观察者自然会基于具体的失真类型进行推理,因此需要一种能够支持失真级别和语言驱动推理的数据集来弥合这一差距。

Result: 统计分析揭示了不同失真类型和质量级别下的特征退化模式。实验表明,利用失真感知的监督信息,能够显著提升多模态模型在生成感知质量描述时与真实描述在词汇和语义上的对齐度。

Insight: 论文的创新点在于定义了一个覆盖光度(如颜色)和几何(如形状)伪影的点云专用失真分类法,并构建了首个支持失真级别和语言驱动推理的PCQA数据集。这为可解释的、基于语言的点云质量评估提供了基础,使模型能够像人类一样进行具体的失真推理,而不仅仅是给出一个总体分数。

Abstract: Point Cloud Quality Assessment (PCQA) methods typically predict scalar Mean Opinion Scores (MOS), which quantify overall perceptual degradation but do not reveal its causes. In contrast, human observers naturally reason in terms of specific distortions such as blur, color shifts, point density changes, missing regions, and geometric deformations. To close this gap, we introduce DAL-PCQA, a distortion-aware, language-annotated dataset for PCQA. DAL-PCQA augments benchmark point clouds with multi-level distortion severity labels, discrete quality categories, and structured natural language descriptions aligned with human perception. We define a point-cloud-specific distortion taxonomy that covers both photometric and geometric artifacts. Statistical analysis reveals characteristic degradation patterns across distortion types and quality levels. To assess the utility of these annotations, we compare zero-shot and fine-tuned multimodal models for generating perceptual quality descriptions. Experiments show that distortion-aware supervision substantially improves lexical and semantic alignment with ground-truth descriptions. By enabling interpretable, distortion-level reasoning, DAL-PCQA facilitates language-driven, explainable point cloud quality assessment. The dataset is publicly available at https://github.com/swarna96/DAL-PCQA.


[104] DisCo: World Models with Discrete Camera Motion Control cs.CVPDF

Hongrui Huang, Junke Wang, Quanhao Li, Yu-Gang Jiang, Zuxuan Wu

TL;DR: 本文提出了DisCo模型,一种基于离散动作基元的可控视频世界模型,旨在解决连续相机轨迹作为动作条件时导致的动作跟随不可靠问题。通过引入离散动作表示来提升动作可分离性,并构建了DisCoBench基准来评估模型在短期、长期和高度动态探索场景中的能力。实验表明DisCo在保持视觉质量的同时实现了更可靠的动作跟随。

Details

Motivation: 现有可控视频世界模型通常依赖连续相机轨迹作为动作条件,但在复杂运动序列下容易导致动作跟随不可靠,其核心瓶颈在于动作表示的纠缠问题。

Result: 在提出的DisCoBench基准上进行了广泛实验,DisCo在短期、长期和动态探索场景中均实现了显著更可靠的动作跟随,同时保持了视觉质量。

Insight: 创新点在于识别出连续相机表示会导致不同运动模式间特征相似性过高,从而提出使用离散动作基元来改善动作可分离性;客观来看,将连续动作离散化以降低表示纠缠是一种有效的结构归纳偏置。

Abstract: Controllable video world models target interactive world exploration, where models must faithfully execute explicit action commands while preserving visual quality and temporal coherence. However, most existing approaches rely on continuous camera trajectories as action conditions, which often lead to unreliable action following, especially under complex motion sequences. In this work, we identify action representation entanglement as a key bottleneck in controllable video generation, and show that continuous camera representations lead to high feature similarity across distinct motion patterns, degrading action controllability. Based on this insight, we propose DisCo, a controllable video world model that conditions generation on a compact set of discrete action primitives to improve action separability. We further introduce DisCoBench, a comprehensive benchmark for evaluating the ability of models in short-term, long-horizon, and highly dynamic exploration scenarios. Extensive experiments demonstrate that DisCo achieves significantly more reliable action following while preserving visual quality.


[105] ChronoPhyBench: Do MLLMs Truly Understand the World or Merely Exploit Language Priors? cs.CVPDF

Bin Zhu, Yanhao Jia, Kexin Zhao, Jie Wang, Munan Ning

TL;DR: 本文提出了一个名为ChronoPhyBench的新型多模态时序物理动态推理基准测试,旨在严格评估多模态大语言模型(MLLMs)是否真正理解物理世界,还是仅仅利用了语言先验。该基准通过结合历史视频上下文和文本描述,统一了下一状态预测和视觉问答任务,迫使模型通过单图选择和多帧时序排序进行推理。作者还构建了一个包含超过10,000个长视频和500万标注词元的大规模数据集。实验表明,当前开源模型在基于物理的多模态推理能力上仍处于初级阶段。

Details

Motivation: 当前多模态大语言模型在开放世界推理方面表现出色,但尚不清楚它们是否真正综合跨模态信息进行物理基础推理,还是仅仅利用强大的语言先验来掩盖对单一模态的依赖,从而产生幻觉。为了严格缓解语言模态偏差和捷径,作者提出了这个新的基准测试。

Result: 实验评估揭示了与先前基准测试结论的鲜明对比:当前开源模型执行基于物理的多模态推理的能力仍处于起步阶段。该工作旨在系统性地压力测试多模态模型的推理能力,量化幻觉率。

Insight: 主要创新点在于提出了一个统一了下一状态预测和视觉问答的新型基准测试范式,通过强制模型进行单图选择和多帧时序排序来评估其物理推理能力。从客观角度看,该基准通过精心设计的数据集和任务,有效地隔离了语言先验的影响,为评估模型是否真正理解物理动态提供了一个更透明和鲁棒的框架,有助于推动物理人工智能和通用人工智能的发展。

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in open-world reasoning and understanding. However, a critical ambiguity persists: it remains unclear whether these models genuinely synthesize cross-modal information to construct physically grounded reasoning chains, or if they merely exploit strong language priors to mask single-modality reliance, thereby hallucinating advanced multimodal capabilities. Motivated by this, and to rigorously mitigate language modality bias and shortcuts, we propose a novel multimodal Chrono}logical Physical Dynamics Reasoning Benchmark ChronoPhyBench, which unifies next state prediction with Visual Question Answering (VQA) paradigms by conditioning on historical video context and textual captions to enforce models to deduce subsequent physical states through both single image selection and the inherently more complex task of multiple frame chronological sorting. Concurrently, we construct a large-scale multimodal reasoning dataset curated using the ChronoPhyBench criteria, comprising over 10,000 long-form videos paired with meticulously annotated captions, totaling 5M tokens. Our experimental evaluations reveal a stark contrast to conclusions drawn by previous benchmarks. The capacity of current open-source models to perform physically grounded multimodal reasoning remains in its infancy. Ultimately, this work seeks to systematically stress-test the reasoning capabilities of multimodal models, quantify hallucination rates, and advance the development of Physical AI, thereby providing the community with a robust and transparent evaluation framework toward Artificial General Intelligence (AGI).


[106] Aqua Boundary-Saliency Attention Module for Lightweight Underwater Salient Instance Segmentation Detection Transformer cs.CVPDF

M. Fazri Nizar, Julian Supardi, Muhammad Naufal Rachmatullah

TL;DR: 本文提出了一种轻量级水下显著实例分割检测Transformer(LUSIS-DETR)及其核心模块Aqua Boundary-Saliency Attention Module(AquaBSAM)。该框架通过将水下边界、对比度、衰减等多种先验线索嵌入到DINOv2初始化的多尺度特征中,并结合辅助掩码监督和小目标复制粘贴等训练策略,旨在实现高效、高质量的水下实例分割。

Details

Motivation: 现有基于提示或辅助模态的水下实例分割方法虽然提升了掩码质量,但依赖大型基础模型、提示生成或额外模态估计,导致部署效率低下。本文旨在设计一个紧凑、高效的框架,以解决水下环境复杂特性带来的分割挑战,并实现轻量化和实时推理。

Result: 在UIIS、UIIS10K、USIS10K和USIS16K四个水下实例分割数据集上,该方法在类别感知和显著实例协议下均取得了具有竞争力的领先性能,超越了先前的最先进工作。在NVIDIA T4 GPU上使用TensorRT FP16精度进行基准测试,延迟为4.31-6.34毫秒,支持实时推理。

Insight: 核心创新点是AquaBSAM模块,它通过有界残差调制将多种水下特定先验线索(如边界、对比度、衰减、色度、暗通道和中心先验)系统地嵌入到视觉特征中,有效应对水下图像退化问题。同时,框架设计轻量化,通过仅用于训练的辅助监督和数据增强策略,在不增加推理开销的情况下提升性能,实现了精度与效率的平衡。

Abstract: Underwater instance segmentation integrates pixel-level mask prediction and instance-level discrimination for marine resource exploration, ecological monitoring, and underwater robotic perception. Recent prompt-based and auxiliary-modality methods improve mask quality, but their reliance on large foundation models, prompt generation, or extra modality estimation complicates efficient deployment. This work introduces Lightweight Underwater Salient Instance Segmentation Detection Transformer (LUSIS-DETR), a compact detection-transformer framework built around the Aqua Boundary-Saliency Attention Module (AquaBSAM). AquaBSAM embeds underwater boundary, contrast, attenuation, chroma, dark-channel, and center-prior cues into DINOv2-initialized multi-scale features through bounded residual modulation, while auxiliary mask supervision and small-object copy-paste are training-only. Extensive evaluation on four recent underwater instance segmentation datasets, UIIS, UIIS10K, USIS10K, and USIS16K, shows competitively leading performance against previous state-of-the-art works across category-aware and salient-instance protocols. TensorRT half-precision (FP16) benchmarking on an NVIDIA T4 graphics processing unit (GPU) achieves 4.31-6.34 milliseconds (ms) latency, supporting real-time inference under an accessible reproduction setting.


[107] GVC-Seg: Training-Free 3D Instance Segmentation via Geometric Visual Correspondence cs.CV | cs.AIPDF

Liang Xu, Fangjing Wang, Jinyu Yang, Feng Zheng

TL;DR: 本文提出了一种无需训练的3D实例分割方法GVC-Seg,通过利用3D几何线索与2D视觉线索之间的对应关系来缓解不同分割模型间的置信度偏差。该方法包含3D提议生成模块和掩码感知的CLIP特征提取模块,以提升提议质量评估并实现无偏的模型集成学习。

Details

Motivation: 现有方法依赖多个预训练基础模型生成3D提议并进行聚合,但由于不同分割模型的置信度存在固有差异,会导致结果偏向高置信度模型,这种偏差受数据预处理和训练策略等因素影响,因此需要解决这种模型依赖的置信度偏差问题。

Result: 在多个具有挑战性的基准测试中,该方法达到了最先进的性能,并在开放词汇语义分割场景中展现出强大潜力。

Insight: 创新点在于利用几何与视觉的对应关系来缓解置信度偏差,这是一种无需训练的方法;同时,引入掩码感知的CLIP特征提取模块,增强了实例语义推理能力,为无偏的模型集成学习提供了新思路。

Abstract: Accurate 3D instance segmentation in point cloud data is critical for machine vision applications. Recent advancements leverage multiple pre-trained foundation models to generate 3D proposals, followed by the application of proposal aggregation methods, which significantly enhance performance. However, they often produce sub-optimal results due to inherent variations in confidence levels across different segmentation models, resulting in a bias toward the model with higher confidence. This bias is inherently model-dependent and is influenced by factors such as data preprocessing techniques and training strategies. To address this bias, we propose a novel, training-free 3D instance segmentation approach via Geometric Visual Correspondence (GVC-Seg), which exploits the correspondence between 3D geometric cues and 2D visual cues to mitigate the confidence bias. Additionally, a 3D proposal generation module and a mask-aware CLIP feature extraction module are introduced during the instance mask generation and instance semantic reasoning, respectively. In this way, GVC-Seg enhances proposal quality assessment, ensuring unbiased ensemble learning across different models. Extensive experiments demonstrate that our method achieves state-of-the-art performance on several challenging benchmarks, while also exhibiting strong potential in open-vocabulary semantic segmentation settings.


[108] IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment cs.CV | cs.AI | cs.CLPDF

Zichen Zhu, Yuheng Sun, Mingxuan Zhu, Wenjie Ma, Situo Zhang

TL;DR: 本文提出IEA,一种面向业余用户的对话式图像编辑代理,通过三阶段多任务对齐学习操作参数化工具,在可解释的动作空间中逐步编辑图像,并生成透明的编辑轨迹。

Details

Motivation: 现有图像编辑软件依赖固定滤镜或专家调参,导致业余用户意图与结果存在差距;生成模型编辑可能产生伪影、不合理的细节或风格偏离,且缺乏编辑过程的可解释性。

Result: 在定量实验中,IEA在编辑任务上获得更低的像素距离,在摘要任务上获得更高的ROUGE-L分数;用户研究表明,它在指令遵循方面优于工具调用方法,在整体感知质量上超越生成方法。

Insight: 创新点在于通过三阶段多任务对齐(SFT、GRPO奖励优化、大规模合成微调)联合掌握图像编辑、细化和用户意图摘要,利用可解释的工具操作实现透明且可调试的编辑过程,验证了以工具为中心的可解释视觉语言模型在人类指令引导图像润色中的可靠性。

Abstract: Current image editing software often hinges on fixed filters or expert tuning, leaving a gap between amateur users’ intent and outcomes. Creations by generative models may contain artifacts, implausible details, or stylistic drift away from photorealism and offer little insight into why an edit was made. We propose IEA, a conversational Image Editing Agent that learns to operate parameterized tools in an explicit, interpretable action space. IEA is trained via a three-stage multitask pipeline: (1) SFT on distilled expert edits, (2) GRPO with rewards for likeness improvement, tool usefulness, and intent summarization, and (3) large-scale synthetic fine-tuning to jointly master image editing, refinement, and user intent summarization. By manipulating 16 editing tools step by step, IEA produces transparent edit traces that can be inspected and debugged. In quantitative experiments, it attains a lower pixel distance on the edit task and a higher ROUGE-L on the summary task than strong baselines. In user studies, it ranks best among tool-calling methods for instruction following while surpassing generative methods in overall perceptual quality. Our results validate interpretable, tool-centric VLMs as a reliable path to human instruction-guided image retouching.


[109] Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems cs.CV | cs.AI | cs.CLPDF

Muhammad Falensi Azmi, Ikhlasul Akmal Hanif, Vallerie Alexandra Putra, Adi Yeltay, Abdullah Mubarak

TL;DR: 本文提出了Sci-Rho,一个多语言、视觉基础的符号化基准测试,用于评估视觉语言模型在STEM问题上的鲁棒性。该基准涵盖五个学科和七种语言,包含4,242个问题模板,通过Python代码生成总计42,420个多样化实例,每个实例均配有推理步骤和真实答案。研究评估了17个先进视觉语言模型,揭示了最坏情况准确率与平均准确率之间的显著差距,以及模型在不同语言间的性能差异。

Details

Motivation: 现有符号化基准测试大多局限于数学推理、缺乏视觉基础且主要为英文,难以全面评估模型在STEM问题上的鲁棒性。因此,作者旨在构建一个多语言、视觉基础的动态基准,以更系统地测试模型对问题变体的稳健性。

Result: 在Sci-Rho基准上评估了17个SOTA视觉语言模型,发现最坏情况准确率(模型在所有生成变体上均能正确回答的问题模板比例)与平均准确率存在明显差距。较小模型在不同语言间性能下降显著,而专有和更大模型则保持稳健;步骤级评估也显示平均F1与最坏情况F1分数存在显著差异。

Insight: 创新点在于构建了首个多语言、视觉基础的动态STEM基准,通过可执行代码生成多样化问题实例,实现了对模型鲁棒性的细粒度评估。研究还发现视觉语言模型在跨语言处理中对图像与文本标记的注意力分配存在显著差异,强调了超越静态基准评估的重要性。

Abstract: Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However, existing symbolic benchmarks mostly remain limited to mathematical reasoning, lack visual grounding, and are predominantly in English. In this work, we introduce Sci-Rho (Science Rhobustness), a dynamic benchmark for visually-grounded STEM problems spanning five subjects and seven languages, comprising 4,242 problem templates (606 per language) crafted by domain experts, including Olympiad medalists. Each template is implemented as executable Python code that generates diverse but equivalent problem instances by varying numerical values, visual patterns, geometric shapes, color schemes, and function types, resulting in 42,420 instances in total, each paired with reasoning steps and ground-truth solutions. We evaluated 17 state-of-the-art VLMs and discovered a noticeable gap between worst-case accuracy (defined as the proportion of problem templates that a model answers correctly across every generated variation) and average accuracy. We also discovered that smaller models show noticeable performance degradation across languages, whereas proprietary and larger models remain robust. Step-level evaluation reflects this same trend, revealing a significant gap between average F1 and worst-case F1 scores. Finally, our inspection of attention heads of a VLM reveals substantial cross-lingual variation in the relative attention allocated to image tokens compared to text tokens. Our work highlights the importance of evaluation beyond static benchmarks as a metric to measure the quality of VLMs.


[110] Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding? cs.CV | cs.AI | cs.CLPDF

Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei, Runtao Liu

TL;DR: 本文提出Robust-U1框架,旨在赋予多模态大语言模型(MLLMs)显式的视觉自恢复能力,以提升其在视觉内容受损情况下的鲁棒理解能力。该框架通过监督微调、基于像素级和语义级双奖励的强化学习,以及联合考虑受损输入与恢复图像的多模态推理三个阶段实现。实验表明,该方法在真实世界损坏基准上达到了最先进的鲁棒性,并在通用VQA基准上对抗性损坏下保持了优越性能。

Details

Motivation: 现有MLLMs在真实世界视觉损坏下性能显著下降,而现有的鲁棒性增强方法存在局限:黑盒特征对齐缺乏可解释性,白盒文本推理无法恢复丢失的像素级细节。本文旨在研究MLLMs能否自我恢复损坏的视觉内容,从而从根本上提升其鲁棒理解能力。

Result: 在真实世界损坏基准上达到了最先进的(SOTA)鲁棒性,并在通用视觉问答(VQA)基准上对抗性损坏下保持了优越性能。分析证实高质量的视觉恢复直接提升了推理性能。

Insight: 核心创新在于为MLLMs引入了显式的视觉自恢复能力,并通过监督微调、结合像素级(SSIM)和语义级(CLIP相似度)双奖励的强化学习来对齐高质量视觉恢复,最后进行联合多模态推理。这确立了自恢复作为实现鲁棒视觉理解的关键机制,提供了一种兼具可解释性和细节恢复能力的增强路径。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.


[111] Vision-Language Asymmetry in Bistable Image Captioning cs.CVPDF

Arohan Agate

TL;DR: 本文研究了视觉语言模型在处理双稳态图像(如鸭兔图)时的内部机制,探究了模型在生成描述时如何从视觉表征中做出单一方面的承诺。通过构建包含3320个生成的行为基线,分析了83个双稳态刺激在自然提示与强制选择提示下的三种响应模式,并利用在CLIP层训练的TopK稀疏自编码器(验证解释方差0.93)探测了LLaVA-1.6-7B模型的内部表征。研究发现,在69个具有双方面特征池的刺激中,72%在视觉塔层同时激活了两个方面的特征,但因果干预实验表明,主导性瓶颈位于视觉塔下游,视觉侧的表征与语言侧的承诺之间存在间隙,这为’看见’与’看作’的区分提供了实证依据。

Details

Motivation: 动机源于维特根斯坦的鸭兔图引发的哲学问题:当视觉语言模型为模糊图像生成描述时,模型内部在哪个环节做出了对单一方面的承诺?本文旨在通过实证方法探究这一视觉-语言不对称性。

Result: 在83个双稳态刺激的行为分析中,模型表现出三种响应模式(默认主导、强制主导、强制平衡)。在69个具有双方面特征池的刺激中,72%(50/69)在视觉塔层同时激活了双方面特征。因果干预实验表明,在CLIP第22层进行干预可以翻转默认主导刺激(如鸭/兔)的描述(在流畅性约束下兔翻转率为33%),但无法翻转强制平衡刺激(如年轻/年老)的描述,尽管其视觉侧表征存在叠加。

Insight: 创新点在于揭示了视觉语言模型中视觉侧的多义性表征与语言侧的单义性承诺之间的分离,主导性瓶颈位于视觉塔下游,这为理解模型的’看作’行为提供了新视角。方法上,论文指出了对TopK稀疏自编码器输出进行基于排序的统计时,需使用平局校正排序以避免隐性的行序偏差。

Abstract: Wittgenstein’s duck-rabbit poses a question for vision-language models: when a model captions an ambiguous image, where in the model is the commitment to one aspect made? We address this with a 3,320-generation behavioral baseline over 83 bistable stimuli that surfaces three regimes (default-dominant, force-dominant, force-balanced) under neutral vs forced-choice prompting, then probe the underlying representations using a TopK sparse autoencoder we train on the CLIP layer that LLaVA-1.6-7B actually consumes (validation EV 0.93). Across 69 bistable stimuli with both per-aspect feature pools available, 72% (50/69) show simultaneous activation of both pools at the vision tower, including 12/12 default-dominant duck/rabbit and 7/8 force-balanced young/old. Causal steering at CLIP layer 22 flips captions on default-dominant stimuli (33% rabbit-flip rate under a fluency guard) but cannot flip captions on force-balanced young/old at any tested coefficient, despite their vision-side superposition. The dominance bottleneck lives downstream of the vision tower; the gap between vision-side representation and language-side commitment is an empirical handle on the seeing/seeing-as distinction. We also flag a methodological note: rank-based statistics on TopK SAE outputs require tie-corrected ranking to avoid silent row-order bias.


[112] DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning cs.CVPDF

Hangui Lin, Yan Shu, Zhengyang Liang, Chi Liu, Xiangrui Liu

TL;DR: 本文提出DyCo-RL,一种将动态跨模态协调整合到可验证奖励强化学习(RLVR)中的方法,用于提升多模态大语言模型(MLLMs)的视觉推理能力。该方法通过Fisher-Rao测地距离量化模态内注意力转移,将推理过程中的token分配为视觉导向或文本导向的功能角色,并利用角色对齐分数在策略优化中进行优势重加权。实验表明,该模型无关的方法能持续提升多种RLVR算法在多个视觉中心和数学推理基准上的性能。

Details

Motivation: 现有RLVR方法主要优化推理结果,而忽视了生成过程中所需的细粒度跨模态协调。研究发现,在思维链推理中,MLLMs经常无法在提取视觉证据和综合文本上下文之间动态切换,这种协调崩溃与推理失败存在因果关系。

Result: 在Qwen2.5-VL-3B/7B模型上应用DyCo-RL,能持续改进四种代表性RLVR算法在七个基准测试(涵盖视觉中心和数学推理)上的性能。

Insight: 创新点在于将动态跨模态协调过程(即视觉证据提取与文本综合的交替)建模并整合到RLVR的优化目标中,具体通过基于注意力转移的角色分配和对齐评估来实现。这为提升MLLMs的推理过程质量提供了新视角,即关注中间过程的协调性而不仅仅是最终答案的正确性。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a leading paradigm for enhancing visual reasoning in Multimodal Large Language Models (MLLMs). However, existing RLVR methods optimize primarily for the reasoning outcome, fundamentally overlooking the fine-grained cross-modal coordination required during the generation process. Through token-level analyses and controlled interventions, we reveal that during Chain-of-Thought (CoT) reasoning, MLLMs frequently fail to dynamically alternate between extracting visual evidence and synthesizing textual context-a coordination breakdown that is causally linked to reasoning failures. Motivated by these findings, we propose DyCo-RL, which integrates dynamic cross-modal coordination into RLVR optimization. Specifically, DyCo-RL uses the Fisher-Rao geodesic distance to measure within-modality attention shifts, assigning tokens to either visually-oriented or text-oriented functional roles. It then evaluates the alignment between a token’s actual attention allocation and its assigned role, leveraging this score for alignment-guided advantage reweighting during policy optimization. Extensive experiments demonstrate that the algorithm-agnostic DyCo-RL, when applied to Qwen2.5-VL-3B/7B, consistently improves four representative RLVR algorithms across seven benchmarks spanning visual-centric and mathematical reasoning.


[113] VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation cs.CVPDF

Jianhui Wei, Jie Tan, Hengchuan Zhu, Xiaotian Zhang, Yan Zhang

TL;DR: 本文介绍了VideoWeaver,这是一个用于评估和进化智能体长视频生成技能的框架与基准测试。它通过让智能体将单一指令转化为长视频,自主组合基础技能而非遵循预定义流程来构建工作流。基准包含16个任务类别和285个案例,并提出了一个基于证据的智能体即裁判评估方法,以及一个技能进化算法来优化智能体的技能。

Details

Motivation: 解决现有智能体框架(如Claude Code、Codex、OpenClaw)在长视频生成这一长期多模态任务上能力未被充分探索的问题,旨在评估和进化智能体自主构建工作流以完成复杂视频生成的能力。

Result: 在多个框架和模型上的实验表明,显式的组合技能相比单独使用基础技能能改进生成过程,技能进化进一步提升了输出质量,且性能在不同框架和模型选择间差异显著。提出的智能体即裁判评估方法与人类判断高度一致,尤其在过程指标上。

Insight: 创新点在于提出了一个专注于长视频生成的智能体评估与进化框架,通过智能体即裁判对执行轨迹和最终视频进行基于证据的评估,并设计了技能进化算法来迭代优化智能体的技能组合,为自主工作流构建提供了系统化的评测和优化方法。

Abstract: Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but whether they can handle long video generation, a long-horizon multimodal task, remains underexplored. Unlike earlier video agents whose pipeline is handcrafted, these frameworks can build and refine their own workflows. We introduce VideoWeaver, an agent harness and benchmark that evaluates and evolves skills for long video generation, where an agent turns a single instruction into a long video by composing foundation skills into its own workflow rather than following a predefined pipeline. The benchmark has 16 task categories and 285 cases, with references spanning text, image, audio, video, and their combinations. Because errors can arise at any stage and not just in the final video, we propose an agent-as-judge that inspects both the execution trace and the final video, grounding its scores in evidence such as metadata and intermediate files. Using this feedback, we further design a skill evolution algorithm that refines and merges the agent’s skills. Across multiple frameworks and models, we find that an explicit composition skill improves the generation process over using foundation skills alone, that skill evolution further improves output quality, and that performance varies notably across harness and model choices. The proposed agent-as-judge also aligns well with human judgments, especially on process metrics. Code and dataset is available at https://github.com/JianhuiWei7/VideoWeaver


[114] Trustworthy Visual Predicates for Robust Manipulation Understanding under Degradation cs.CVPDF

Fatemeh Ziaeetabar

TL;DR: 本文提出了一个用于鲁棒操作理解的谓词级可靠性框架,旨在分析视觉谓词(如接触、抓取、释放等)在视觉退化(如模糊、遮挡、检测噪声等)下的可靠性。该框架定义了结构化谓词词汇表、置信度感知的谓词估计以及多种可靠性度量指标。实验表明,谓词失效具有结构性,静态空间谓词相对鲁棒,而接触敏感和动态谓词更脆弱,可靠性下降会显著影响下游操作理解任务的准确性。

Details

Motivation: 操作理解需要可靠的关系证据(视觉谓词),但现有研究很少直接分析这些谓词在视觉退化条件下的可靠性,这限制了鲁棒操作理解系统的发展。

Result: 在VISOR/EPIC-KITCHENS、H2O和ARCTIC等公开数据集上的实验表明,严重退化(如检测噪声、遮挡、丢帧)导致最强的可靠性损失;下游分析显示,退化谓词使操作理解准确率从0.89降至0.58,而移除置信度加权在中等退化下使准确率从0.74降至0.64。

Insight: 创新点在于首次系统性地提出了一个针对视觉谓词的可靠性评估框架,将可靠性作为视觉感知与结构化操作推理之间的诊断层;客观来看,其结构化谓词分类和置信度加权机制为构建鲁棒的神经符号系统提供了可借鉴的方法论。

Abstract: Manipulation understanding requires reliable relational evidence, such as contact, support, containment, motion coupling, grasp, release, and active-hand involvement. Although these visual predicates are widely used in event-chain, graph-based, and neuro-symbolic models, their reliability under visual degradation is rarely analyzed directly. This paper introduces a predicate-level reliability framework for robust manipulation understanding under blur, occlusion, illumination change, low resolution, frame dropping, and detection noise. The framework defines a structured predicate vocabulary, confidence-aware predicate estimation, and reliability metrics for predicate preservation, degradation sensitivity, temporal consistency, confidence-weighted stability, and downstream impact. Experiments on controlled manipulation videos and public egocentric or bimanual datasets, including VISOR/EPIC-KITCHENS, H2O, and ARCTIC, show that predicate failures are structured rather than uniform. Static spatial predicates remain comparatively robust, whereas contact-sensitive, dynamic, and derived predicates such as grasp and release are more fragile. Under severe degradation, detection noise, occlusion, and frame dropping cause the strongest reliability losses. Downstream analysis shows that degraded predicates reduce manipulation-understanding accuracy from 0.89 to 0.58, while removing confidence weighting under moderate degradation reduces accuracy from 0.74 to 0.64. These results show that predicate reliability provides a diagnostic layer between visual perception and structured manipulation reasoning.


[115] Human-Centered Benchmarking of Driver Monitoring Models cs.CV | cs.AIPDF

Ruben Dario Florez-Zela

TL;DR: 本文针对基于视觉的驾驶员监控系统,指出仅用分类准确率评估模型在实际部署中的适用性是不充分的,提出了以人为中心的基准测试框架(HCBF),从准确性、可解释性、效率和鲁棒性四个维度评估模型。该框架应用于MRL Eye数据集上的四种轻量级架构(MobileNetV3、ShuffleNetV2、EfficientNet-B0和DeiT-Tiny),发现尽管模型在干净数据集上的准确率相近,但各在单一维度领先,且均位于帕累托前沿。综合评分显示ShuffleNetV2整体领先,但其在传感器噪声下性能大幅下降且存在将闭眼误分类为睁眼的致命错误,而Transformer模型则保持鲁棒性。

Details

Motivation: 当前驾驶员监控系统模型主要基于分类准确率进行比较,但准确率不足以全面评估模型在现实世界安全关键场景中的适用性,需要更全面的评估框架。

Result: 在MRL Eye数据集上进行眼状态分类测试,四种模型在干净集准确率上几乎无法区分,但各在一个维度(准确性、可解释性、效率、鲁棒性)领先。在三种面向部署的加权场景下计算的人类中心得分中,ShuffleNetV2始终排名第一,但其在传感器噪声下性能损失超过一半,且存在将闭眼误判为睁眼的安全关键错误,而DeiT-Tiny(Transformer)则在鲁棒性方面表现突出。

Insight: 创新点在于提出了一个多维度的、以人为中心的基准测试框架(HCBF),超越了单一的准确率指标,强调从准确性、可解释性、效率和鲁棒性综合评估模型。客观分析认为,该研究揭示了聚合排名可能掩盖对实际部署至关重要的特定维度(尤其是鲁棒性)的脆弱性,强调了多维评估在安全关键应用中的价值,并展示了不同模型架构(如CNN与Transformer)在特定维度上的优势差异。

Abstract: Vision-based driver monitoring systems are increasingly deployed in safety-critical intelligent transportation settings, yet they are almost always compared on classification accuracy alone. This paper argues that accuracy is insufficient to characterize a model’s fitness for real-world deployment, and proposes the Human-Centered Benchmarking Framework (HCBF), which evaluates models across four dimensions: accuracy, explainability, efficiency, and robustness. The framework is applied to four representative lightweight architectures, MobileNetV3, ShuffleNetV2, EfficientNet-B0, and DeiT-Tiny, on the MRL Eye Dataset for eye-state classification. While the models are nearly indistinguishable on clean-set accuracy, each leads in exactly one dimension, and all four lie on the Pareto frontier. A Human-Centered Score computed under three deployment-oriented weighting scenarios ranks ShuffleNetV2 first throughout. However, this aggregate winner retains less than half of its performance under sensor noise and fails by classifying closed eyes as open, whereas the transformer remains robust. These findings show that aggregate ranking can mask dimension-specific vulnerabilities that are operationally decisive, underscoring the value of multi-dimensional, human-centered evaluation.


[116] One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling cs.CVPDF

Qiyu Xu, Zhanxuan Hu, Yu Duan, Yonghang Tai, Huafeng Li

TL;DR: 本文提出了一种名为’一石三鸟’的训练无关框架,通过自适应的最优传输方法,在无需更新视觉语言模型参数的情况下,从多个候选VLM中估计目标域样本与类别之间的共识传输结构,并利用该结构同时实现模型选择、目标域适应和集成预测。

Details

Motivation: 现有部署流程通常假设所选VLM与目标域兼容,但在实际跨域部署中,多个通用或专用VLM可能都适用,且缺乏实例级标签来识别可靠模型,因此需要耦合解决模型选择、目标适应和预测集成这三个问题。

Result: 在自然图像、遥感和医学病理学基准测试上的大量实验表明,OSTB在异构候选池下提升了模型排序、适应稳定性和集成鲁棒性。

Insight: 创新点在于将三个部署决策统一依赖于同一潜在目标域样本-类别结构,利用多个VLM的互补预测证据,通过自适应最优传输学习共识传输计划,并复用该结构实现无参数更新的多任务耦合优化。

Abstract: Vision-language models (VLMs) enable visual recognition from semantic class descriptions, which makes them attractive when target annotations are scarce or unavailable. Most deployment pipelines, however, first choose a single VLM and then adapt that model to the unlabeled target set. This single-backbone paradigm hides a critical assumption: the selected VLM is already compatible with the target domain. In realistic cross-domain deployment, several general-purpose and domain-specialized VLMs may be plausible, yet no instance-level target labels are available to identify the reliable ones. Deployment therefore requires a coupled solution for model selection, target adaptation, and prediction integration. We revisit this problem from a system-level multi-VLM perspective. Our central observation is that the three decisions above depend on the same latent object: a trustworthy sample-class structure in the target set. Different VLMs may encode different transfer biases and produce conflicting predictions, but their outputs can still provide complementary evidence for estimating this structure. We propose One Stone, Three Birds, a training-free framework based on self-adaptive optimal transport. Given a pool of frozen candidate VLMs, OSTB estimates a consensus sample-to-class transport plan without updating VLM parameters. The learned transport structure is then reused for all deployment objectives: model selection is performed by ranking the combined semantic and visual reliability induced by the consensus plan; target adaptation is obtained by fitting transport-conditioned visual classifiers; and ensembling is implemented through reliability-aware probabilistic integration. Extensive experiments on natural-image, remote-sensing, and medical-pathology benchmarks show that OSTB improves model ranking, adaptation stability, and ensemble robustness under heterogeneous candidate pools.


[117] Phase Marginalization for Patch-Grid Instability in Vision Transformers cs.CV | cs.LGPDF

Oğuzhan Ercan

TL;DR: 本文针对视觉Transformer在密集预测任务中因固定patch划分导致的相位依赖不稳定性问题,提出了一种名为Phase Marginalization的后处理边缘化方法。该方法通过评估结构化patch网格相位、逆对齐密集输出并在原始图像坐标系中聚合,以消除相位变化带来的影响。核心变体Uniform Phase Marginalization(K=4)无需训练,在分割、深度估计和局部匹配任务上均优于基线,并在Cityscapes数据集上相比通用的基于平移的测试时增强方法取得了小幅提升。

Details

Motivation: 视觉Transformer在固定patch网格上操作,这会在密集预测中引入相位依赖的不稳定性:改变patch划分会改变像素可用的token证据,尤其是在边界附近。作者将patch网格相位形式化为一个干扰变量,旨在解决由此导致的预测不一致性问题。

Result: 在Cityscapes的受控实验中,Uniform Phase Marginalization(K=4)相比最强的通用行平移测试时增强方法,在计算量匹配的情况下,平均交并比提升了0.31。缩放研究表明K=4是一个实用的成本-精度权衡点,K=8的精度基本不变,而K=16在延迟大幅增加的情况下精度提升甚微。该方法在分割、深度估计和局部匹配设置中均优于K=1的基线。

Insight: 论文的创新点在于将patch网格相位识别并形式化为一个可测量的干扰变量,并提出了一种简单、无需训练的后处理边缘化方法(Phase Marginalization)来消除其影响,为密集ViT预测提供了一个诊断工具和性能提升基线。从客观角度看,该方法通过结构化地采样和聚合不同相位下的预测,以可接受的额外计算成本稳定了输出,是一种有效的测试时优化策略。

Abstract: Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense prediction: changing the patch partition can change the token evidence available to a pixel, especially near boundaries. We formalize patch-grid phase as a nuisance variable and propose Phase Marginalization, a post-hoc marginalization method that evaluates structured patch-grid phases, inverse-aligns dense outputs, and aggregates them in the original image coordinate system. The central variant, Uniform Phase Marginalization with K = 4, is training-free and improves over the canonical K = 1 baseline across measured segmentation, depth, and local matching settings. In a controlled Cityscapes experiment, Uniform Phase Marginalization provides a modest compute-matched advantage over generic shift-based four-forward test-time augmentation (TTA) (+0.31 mean Intersection-over-Union over the strongest tested generic row). A scaling study further shows that K = 4 is a practical cost-accuracy trade-off: K = 8 is essentially unchanged and K = 16 adds little accuracy at much higher latency. These results position patch-grid phase as a measurable nuisance variable and Phase Marginalization as a simple diagnostic and post-hoc marginalization baseline for dense ViT prediction.


[118] IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval cs.CVPDF

Jiale Huang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Chunxiao Wang

TL;DR: 本文提出了一种名为IMAGINE的自适应模式-意象增强组合网络,用于解决组合视频检索(CVR)任务。该方法通过动态多模态原型来具体化隐含语义(称为模式意象),从而自适应地调制视觉特征,将隐含指导注入检索过程。IMAGINE在三个广泛使用的基准测试中,在CVR和组合图像检索(CIR)任务上均达到了最先进的性能。

Details

Motivation: 现有组合视频检索方法通常假设修改文本中描述的对象直接出现在视频中,但修改文本经常描述未明确呈现、而是通过语义相关的视觉线索(如“蛋糕”暗示“生日派对”)隐含表达的概念。这些方法依赖于在具体空间中对齐显式特征表示,忽略了关键的潜在关联。

Result: IMAGINE在三个广泛使用的基准测试中,在组合视频检索(CVR)和组合图像检索(CIR)任务上均达到了最先进的性能。

Insight: 创新点在于提出了“模式意象”概念,通过动态多模态原型来具体化隐含语义,从而自适应地调制视觉特征,以桥接显式视觉内容与隐含检索意图之间的差距。这为处理多模态检索中隐含语义关联提供了一种新思路。

Abstract: Composed Video Retrieval (CVR) is designed to retrieve a target video that matches a reference video modified by a modification text. While existing methods explore cross-modal correspondences, they often assume modified objects appear directly in videos. However, modification texts frequently describe concepts not explicitly presented but implicitly expressed through semantically related visual cues (e.g., “cake” implying “birthday party”). Current approaches typically rely on aligning explicit feature representations within the concrete space, neglecting critical latent associations. To address this, we propose an adaptIve scheMa-ImAGery enhanced composItional NEtwork (IMAGINE). Unlike standard explicit matching, IMAGINE materializes implicit semantics (termed schema imagery) via dynamic multimodal prototypes. These prototypes capture shared latent concepts to adaptively modulate visual features, effectively injecting implicit guidance into the retrieval process. By bridging the gap between explicit visual contents and implicit retrieval intentions, IMAGINE achieves state-of-the-art performance in both CVR and Composed Image Retrieval (CIR) across three widely used benchmarks.


[119] RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT cs.CV | cs.AIPDF

Kyumin Choi, Ikbeom Jang

TL;DR: 本文提出了RAPID,一种用于高效视觉Transformer(ViT)的深度感知令牌缩减框架。它根据网络深度中层间令牌表征的演变特性,自适应地采用分叉策略:在浅层至中层使用冗余-相似性感知的剪枝来消除过度表示的局部模式,在深层则转向重要性-相似性感知的合并机制以融合不重要但相似的令牌,同时保护语义关键令牌。

Details

Motivation: 视觉Transformer(ViT)虽然性能强大,但因其自注意力机制的二次计算复杂度而计算成本高昂。现有的令牌缩减技术(如剪枝和合并)通常忽略了表征在网络深度上的演变过程,因此需要一种能适应层间特性的深度感知缩减方法。

Result: 在ImageNet-1K数据集上使用ViT和DeiT架构进行的实证验证表明,与ToMe和ToFu等即插即用基线方法相比,RAPID在精度-压缩率的帕累托前沿上表现更优。在激进的压缩情况下,RAPID尤其稳健,在极端缩减率下比ToMe的准确率高出4.29%。

Insight: 论文的主要创新点在于提出了一个分叉的、深度自适应的令牌缩减策略,将剪枝与合并机制与Transformer网络的分层特征演化过程对齐。从客观角度看,其核心洞察是根据特征从局部模式到全局语义概念的转变,动态调整缩减标准(从冗余感知到重要性感知),这为无需训练地优化视觉模型提供了一个有效的模板。

Abstract: Vision Transformers (ViTs) achieve strong performance but suffer from high computational costs due to quadratic self-attention complexity. Although token reduction techniques such as pruning and merging mitigate this, they typically overlook how representations evolve across network depth. We propose RAPID, a depth-aware token reduction framework that adapts reduction strategies to the layer-wise characteristics of token representations. The primary methodological contribution is a bifurcated strategy: in shallow-to-middle layers, RAPID employs a redundancy-similarity aware pruning metric to eliminate over-represented local patterns. As features transition to global semantic concepts in deeper layers, the framework shifts to an importance-similarity aware merging mechanism. This stage leverages classification (CLS) token attention weights to protect semantically critical tokens while fusing less important but similar neighbors. Empirical validation on ImageNet-1K using ViT and DeiT architectures demonstrates that RAPID establishes a superior accuracy-compression Pareto frontier compared to plug-and-play baselines such as ToMe and ToFu. RAPID is particularly robust in aggressive compression regimes, achieving up to 4.29% higher accuracy than ToMe at extreme reduction rates. Our framework provides a training-free template for optimizing vision models by aligning reduction strategies with hierarchical feature evolution.


[120] Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning cs.CVPDF

Cong Wan, Ying He, Zhongzhan Huang, Hefeng Wu

TL;DR: 本文首次对多模态基础模型中的测试时缩放进行了全面综述,提出了一个统一的分类框架,将现有方法分为基于采样、基于反馈和基于搜索三种策略,并总结了相关应用和基准测试,同时讨论了开放挑战和未来研究方向。

Details

Motivation: 针对多模态基础模型在推理过程中动态分配计算资源以提升性能的测试时缩放领域缺乏系统性综述和统一理论框架的问题,本文旨在填补这一空白。

Result: 本文提出了一个统一的分类框架来梳理现有方法,并总结了用于评估多模态TTS在生成和推理任务中能力的代表性应用和基准。

Insight: 创新点在于首次系统性地综述了多模态TTS领域,并提出了一个统一的三分法分类框架(采样、反馈、搜索),为该领域后续研究提供了清晰的路线图。

Abstract: Test-time Scaling (TTS) has emerged as a pivotal research direction for enhancing model performance by dynamically allocating computational resources during inference. Recent advancements have adapted this paradigm to Multimodal Foundation Models (MFMs), unlocking their potential in multimodal reasoning and generation. Despite rapid progress, the field lacks a systematic survey and unified theoretical framework to delineate the developmental landscape of multimodal TTS. To bridge this gap, we present the first comprehensive review of TTS research for MFMs, proposing a unified taxonomic framework that categorizes existing methodologies into three distinct strategies: sampling-based, feedback-based, and search-based approaches. We further summarize representative applications and benchmarks commonly utilized to evaluate multimodal TTS capabilities in generation and reasoning tasks. Finally, this survey discusses open challenges and outlines future research directions, providing a systematic roadmap for subsequent studies in this rapidly evolving field.


[121] SegmentAnyTreeV2: Scaling Transformer-Based Tree Instance Segmentation Across Sensors, Platforms, and Forests cs.CV | cs.LGPDF

Maciej Wielgosz, Stefano Puliti, Rasmus Astrup

TL;DR: 本文提出了SegmentAnyTreeV2,一个与传感器和平台无关的框架,用于森林点云的语义和实例分割。该模型结合了基于序列化的Point Transformer v3骨干网络、轻量级语义头和一个专注于树木的交叉注意力掩码解码器。作者还引入了扩展的基准数据集FOR-instance v3。

Details

Motivation: 旨在解决跨不同传感器、平台和森林环境的树木实例分割问题,以实现对密集和结构复杂林分的准确分割。

Result: 在FOR-instanceV2测试集上,模型实现了90.5%的精确率、80.2%的召回率、85.0%的F1分数、90.7%的覆盖率以及87.6%的语义mIoU,在实例检测和掩码完整性方面超越了先前基于学习的方法,并在独立站点的零样本评估中展示了强大的跨域泛化能力。

Insight: 创新点包括:结合序列化Transformer骨干与树类体素限制的实例解码;提出了实例感知查询初始化、一对多种子监督和非对称掩码评分以提高分割效果;构建了大规模、多样化的新基准数据集FOR-instance v3以推动领域发展。

Abstract: We present SegmentAnyTreeV2, a sensor- and platform-agnostic framework for semantic and instance segmentation of forest point clouds. The model combines a serialization-based Point Transformer v3 backbone with a lightweight semantic head and a tree-focused cross-attention mask decoder. Semantic predictions restrict instance decoding to tree-class voxels, while instance-aware query initialization, one-to-many seed supervision, and asymmetric mask scoring improve separation in dense and structurally complex stands. We further introduce FOR-instance v3, an expanded benchmark comprising 427 scenes and 26,496 annotated trees across diverse biomes, forest structures, and LiDAR platforms. On the FOR-instanceV2 test split, SegmentAnyTreeV2 achieves 90.5% precision, 80.2% recall, 85.0% F1, 90.7% coverage, and 87.6% semantic mIoU, outperforming previous learning-based methods in both instance detection and mask completeness. Zero-shot evaluation on independent sites further demonstrates strong cross-domain generalization.


[122] Remember with Confidence: Uncertainty Quantification for Spatio-temporal Memory with Probabilistic Guarantees cs.CVPDF

Harry Zhang, Nicolas Gorlo, Luca Carlone

TL;DR: 本文提出了一种用于多视角视觉语言模型(VLM)记忆的物体级语义不确定性量化方法,并构建了名为UQ-DAAAM的空间-语义记忆系统。该系统通过计算以物体为中心的跨视角描述语义分散度来识别语义不确定的物体,并主动选择高质量视角来融合描述,从而在固定查询预算下优化物体描述。

Details

Motivation: 现有的场景图与检索增强系统将VLM描述视为绝对可靠的信息源,但VLM描述本身存在噪声和视角不一致问题,缺乏检测不可靠存储描述的机制。

Result: 在OC-NaVQA基准测试中,UQ-DAAAM相比基线方法实现了显著更大的不确定性降低和更好的时空问答性能。

Insight: 创新点在于为多视角VLM记忆引入了物体级语义不确定性评分,并提供了主动选择高质量视角进行描述融合的机制,同时从理论上推导了高质量视角选择降低不确定性的概率保证,提升了具身4D记忆系统的可靠性与有效性。

Abstract: Long-horizon robot operation requires spatio-temporal memory to record the environment state and recall it for downstream reasoning. Scene graphs and retrieval-augmented systems ground VLM descriptions to persistent 3D entities with rich semantic descriptions. However, VLM captions are noisy and viewpoint-inconsistent, and existing systems treat them as an oracle with no mechanism to detect unreliable stored descriptions. We introduce object-level semantic uncertainty for multi-view VLM memory: a score that measures object-centric cross-view semantic scatter of captions and identifies semantically unresolved objects. Then, we include our uncertainty scores in an advanced spatial-semantic memory system, that we dub UQ-DAAAM. UQ-DAAAM uses this score to actively refine uncertain objects under a fixed query budget by selecting high-quality views and fusing the resulting multi-view captions into a single object description. We also derive probabilistic guarantees showing that higher-quality candidate views (as selected by our approach) are more likely to reduce uncertainty. Our experiments show that uncertainty quantification can make embodied 4D memory systems more reliable and more effective. In particular, on the OC-NaVQA benchmark, UQ-DAAAM achieves substantially larger uncertainty reduction and better spatio-temporal question answering performance than baselines.


[123] Light-WAM: Efficient World Action Models with State-Fusion Action Decoding cs.CVPDF

Ziang Li, Dongzhou Cheng, Yibin Wang, Shiyue Wang, Xiaoyang Xu

TL;DR: 本文提出Light-WAM,一种用于机器人操作的轻量级世界动作模型。它通过紧凑的视频骨干网络和在降采样潜在空间中进行未来视频监督来降低训练成本,并引入StateFusionActionExpert模块,通过融合多层级表征来直接预测动作块,从而在保持性能的同时显著提升了效率和部署可行性。

Details

Motivation: 当前的世界动作模型通常依赖大规模生成式架构,导致高昂的训练成本和推理延迟,难以部署为高效的闭环策略。本文旨在设计一个轻量高效的WAM,以解决其在机器人操作任务中部署困难的问题。

Result: 实验表明,Light-WAM在LIBERO基准上保持了强劲性能,并在RoboTwin 2.0上实现了可用的多任务性能,同时仅使用0.44B可训练参数。其推理延迟为72.03毫秒,峰值GPU内存为4.1GiB,并提升了训练吞吐量。

Insight: 主要创新点在于:1)在降采样潜在空间进行视频监督以降低计算成本;2)设计StateFusionActionExpert,通过从多个骨干层读取并融合自适应状态,以单次前向传播直接预测动作块,为视频表征与机器人动作之间提供了高效接口,避免了繁重的生成式动作专家模块。

Abstract: World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.


[124] TIDE: Task-Isolated Diffusion for Unified Video Editing and Generation cs.CVPDF

Qi Liu, Gang Yue, Mingyu Yin, Lisai Zhang, Yidi Wu

TL;DR: 本文提出了TIDE(Task-Isolated Diffusion)框架,这是一个统一的视频编辑与生成模型。其核心创新包括为每个输入token引入任务嵌入以区分不同任务的条件,以及结合视觉语言模型和VAE潜空间的双路径条件方案。通过多任务渐进式训练策略,该模型在指令编辑、参考引导编辑和多参考生成等多个任务上实现了最先进的性能。

Details

Motivation: 当前基于扩散Transformer的视频生成和编辑能力通常由独立、任务特定的模型处理,缺乏一个能够支持多样化视频任务的统一框架。现有统一框架要么需要专用的辅助编码器,要么缺乏明确的机制来区分异构的条件token,在任务间视觉条件数量和类型变化时表现不佳。

Result: 在多个视频编辑和生成基准测试上的广泛实验表明,TIDE在所有评估任务上都达到了最先进的(SOTA)性能。

Insight: 论文宣称的创新点包括:为每个token引入任务特定的嵌入以实现条件token的显式区分;结合高层语义理解和细粒度结构保真的双路径条件方案;以及协调不同目标、实现跨异构任务平滑泛化的多任务渐进式训练策略。从客观角度看,其将任务标识嵌入与双路径条件耦合的设计,为解决统一模型中条件异构性问题提供了新颖的思路。

Abstract: Recent advances in Diffusion Transformers have driven rapid progress in video generation and editing, yet these capabilities are still handled by separate, task-specific models. Building a unified framework that supports diverse video tasks remains an open challenge: existing unified attempts either require dedicated auxiliary encoders or lack explicit mechanisms to distinguish heterogeneous conditioning tokens, struggling when the number and type of visual conditions vary across tasks. We propose TIDE, a unified framework that integrates instruction-based editing, reference-guided editing, and multi-reference generation. At its core, we introduce per-token task embeddings that assign each input token a task-specific identifier, enabling the model to explicitly disambiguate target, source, and reference tokens. To simultaneously capture high-level semantic understanding and fine-grained structural fidelity, we design a dual-path conditioning scheme that couples a vision-language model with a VAE latent path for complementary signals. We further devise a multi-task progressive training strategy that incrementally introduces tasks of increasing complexity, effectively harmonizing diverse objectives and enabling smooth generalization across heterogeneous task distributions. Extensive experiments on multiple video editing and generation benchmarks demonstrate that TIDE achieves state-of-the-art performance across all evaluated tasks. Our project page is available at https://LittleWork123.github.io/tide.


[125] G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation cs.CV | cs.ROPDF

Yufei Wei, Shuhao Ye, Chenxiao Hu, Yiyuan Pan, Dongyu Feng

TL;DR: 本文提出G2G方法,用于解决两组图像之间的6自由度相对位姿估计问题,该方法在跨序列重定位和多相机平台里程计任务中应用。G2G保持预训练的多视角骨干网络完全冻结,仅添加三个轻量级可训练模块来桥接两组图像,总参数量约3200万,仅占完整模型的6%,且仅通过相对位姿进行监督。

Details

Motivation: 现有模型将所有视图视为非结构化集合,未能有效利用已知的组内几何信息(来自视觉里程计或平台标定)进行跨组推理,因此需要一种方法来桥接两组图像以进行位姿估计。

Result: 在涵盖室内外模拟、真实世界跨季节捕获以及零样本模拟到真实迁移的四个数据集上,G2G在跨序列重定位和多相机平台里程计任务中均达到了最先进的精度,而所有基线模型均使用其原始完整监督重新训练。

Insight: 创新点在于保持预训练骨干网络冻结,通过引入感知器重采样器、具有合并自注意力的跨组桥接模块和多帧位姿头三个轻量模块,有效利用组内几何进行跨组位姿估计,实现了参数效率和性能的平衡。

Abstract: Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-camera rig odometry. Each group carries known intra-group geometry from visual odometry or rig calibration, and pretrained multi-view backbones already fuse such geometry into visual features. Yet current models treat all views as an unstructured set, leaving cross-group reasoning as the missing piece. We introduce \ours{}, which keeps the foundation model entirely frozen and adds three lightweight trainable modules to bridge the two groups: a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head. The trainable footprint totals about 32M parameters, under 6% of the full model, and is supervised only by relative poses. Across four datasets that span indoor and outdoor simulation, real-world cross-season capture, and zero-shot sim-to-real transfer, \ours{} attains state-of-the-art accuracy on both tasks, while every baseline is retrained with its full original supervision. Code is available at https://github.com/WeiYuFei0217/G2G.


[126] Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging cs.CV | cs.AIPDF

Fabian Perez, Nicolas Quintero, Jeferson Acevedo, Hoover Rueda-Chacon

TL;DR: 本文提出了一种轻量级的基于集合的深度学习框架,用于解决远距离长波红外高光谱成像中的大气补偿问题。该框架以不同距离下采集的多个辐射测量值作为输入,联合估计透射率、大气路径辐射和共享的下行光谱。

Details

Motivation: 远距离被动长波红外高光谱成像受大气吸收、发射和反射辐射的影响,使得大气补偿对于获取目标信息至关重要,但该问题因实际和建模困难而被长期忽视。

Result: 在基于MODTRAN生成的远距离LWIR数据集上的实验表明,该方法对所有估计产物(透射率、路径辐射、下行光谱)均实现了较低的光谱失真。

Insight: 创新点在于提出了一个基于集合的Transformer框架,能够从多距离测量中联合估计多个大气参数;通过稀疏自编码器分析学习到的表示,发现潜在特征能在无位置监督的情况下对地理上连贯的数据子集产生激活,这揭示了模型可能学习到了与地理位置相关的隐含模式。

Abstract: Passive long-wave infrared (LWIR) hyperspectral imaging under a standoff geometry depends on atmospheric absorption and emission, as well as reflected radiance, thus making atmospheric compensation essential to get knowledge of a target of interest. Despite its importance, this compensation has been largely overlooked due to its practical and modeling difficulty. In this paper, we present a lightweight set-based deep learning framework that takes multiple radiance measurements, collected at different standoff ranges, as input and jointly estimates transmittance, atmospheric path radiance, and a shared downwelling spectrum. We analyze the learned representation with a sparse autoencoder and observe that several latent features do activate on geographically coherent subsets of the test data despite the absence of location supervision. Experiments on a MODTRAN generated standoff LWIR dataset demonstrate low spectral distortion across all estimated products. The dataset and code is publicly available at: https://factral.co/SAE-LWIR/


[127] Beyond Raw Signals: Undecoded Generative Latents as Privileged Synthetic Data cs.CVPDF

Cristian Sbrolli, Nicolas Michel, Matteo Matteucci, Toshihiko Yamasaki

TL;DR: 本文提出了直接潜在增强(DLA)和多层显式模拟联觉(MESSy)框架,旨在解决多模态模型部署中的高推理成本和配对数据稀缺问题。该框架通过直接利用未解码的生成潜在表示作为特权合成数据,避免了传统方法中的解码-编码循环,并采用预测目标将密集知识安全地迁移到纯视觉学生模型中。

Details

Motivation: 多模态集成虽然提升了计算机视觉模型性能,但部署时面临高昂推理成本和稀缺的完美配对数据集。现有方法通过生成式AI合成缺失模态,但引入了低效的解码-编码循环,导致下游分类器浪费容量重新编码信息。

Result: 实验结果表明,该框架在性能上显著优于原始数据增强和传统蒸馏方法,最终产生了具有‘联觉’潜在结构的高精度单模态学生模型,这些结构与其从未直接观察到的物理属性内在对齐。

Insight: 创新点在于直接使用未解码的生成潜在作为特权信息,避免了信息损失和计算冗余;同时,MESSy采用预测目标而非强制表示匹配,使学生模型能更自然地内化物理先验,而不扭曲其原生视觉特征。

Abstract: While multimodal integration significantly improves computer vision models, deploying them incurs prohibitive inference costs and requires scarce, perfectly paired datasets. Recent methods address this data bottleneck by synthesizing missing modalities via generative AI, yet they introduce a severe inefficiency: the Decode-Encode Loop. Specifically, information-rich generative latents are decoded into noisy raw signals, forcing the downstream classifier to waste capacity re-encoding them. To bypass this bottleneck, we propose Direct Latent Augmentation (DLA), utilizing undecoded generative latents directly as privileged information. Furthermore, to transfer this dense knowledge to a purely visual student, we introduce Multilayer Explicit Simulated Synesthesia (MESSy). Instead of enforcing rigid representation matching, which forces the student to distort its native visual features to accommodate complex multimodal topologies, MESSy uses a predictive objective to safely internalize these physical priors. Empirical results demonstrate that our framework significantly outperforms raw data augmentation and traditional distillation. Ultimately, our approach yields highly accurate unimodal students with ``synesthetic’’ latent structures that are inherently aligned with physical properties they have never directly observed.


[128] SceneConductor: 3D Scene Generation from Single Image with Multi-Agent Orchestration cs.CV | cs.AI | cs.MAPDF

Jeonghwan Kim, Yushi Lan, Yongwei Chen, Hieu Trung Nguyen, Chuanyu Pan

TL;DR: 本文提出SceneConductor,一个多智能体编排框架,用于从单张图像生成完整的3D场景。该方法将生成过程分解为场景初始化、环境构建和多智能体细化三个结构化阶段,以解决现有方法因整体或弱分解流程而导致的全局一致性差、对场景级监督依赖强的问题。

Details

Motivation: 从单张图像生成完整3D场景需要从模糊的视觉证据中推断全局一致的几何、物体关系和环境上下文。现有方法通常依赖整体或弱分解的流程,一次性纠缠多个因素,且需要大量场景级监督,限制了其在复杂真实世界环境中的泛化能力。

Result: 在基准数据集上的大量实验表明,该方法在几何精度、空间一致性和感知真实感方面持续优于先前方法。

Insight: 创新点在于提出了一个多智能体编排框架,将复杂任务分解为结构化阶段,并引入一个由点云图稀疏几何先验监督的几何感知布局预测器,从而减少对场景级标注的依赖,并能从分割级数据中训练,实现对多样化真实场景的鲁棒泛化。

Abstract: Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout-and-mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene-level supervision, limiting their generalization to complex real-world environments. We propose a multi-agent orchestration framework that decomposes single-image 3D scene generation into three structured stages: scene initialization, environment construction, and multi-agent refinement. The initialization stage extracts image-derived object masks, builds object-level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment-construction stage then leverages this initialization together with point-map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene-level annotations, we further introduce a geometry-aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation-level data and generalizes robustly to diverse real-world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.


[129] Self-Supervised Vision Transformers for CBCT-Based Detection of Temporomandibular Joint Osteoarthritis cs.CV | cs.AIPDF

Shradhdha Trivedi, Vrundan Sojitra, Mariela Padilla

TL;DR: 本文研究了自监督视觉Transformer(DINO系列模型)在锥束CT(CBCT)上检测颞下颌关节骨关节炎(TMJ OA)的迁移性能。作者提出了一种基于切片的处理流程,使用冻结或部分解冻的ViT骨干网络编码轴向CBCT切片,并通过基于注意力的多实例学习(MIL)进行患者级别的二分类。实验表明,仅解冻最后两个Transformer块能显著提升性能,AUC从0.671提高到0.902,优于其他DINO变体和有监督ImageNet预训练基线。

Details

Motivation: 颞下颌关节骨关节炎(TMJ OA)在CBCT影像上的骨性变化通常很细微,使得自动化检测具有挑战性。本文旨在探索自监督视觉Transformer(DINO系列)模型如何有效地迁移到CBCT数据上,并研究需要何种程度的骨干网络适应。

Result: 在多个来源的CBCT数据集上进行系统消融实验,结果显示,部分解冻策略(仅解冻最后两个Transformer块)是关键因素,将AUC从完全冻结DINOv2的0.671提升至0.902。该结果超越了DINOv1(0.867)、DINOv2+reg(0.774)以及有监督的ImageNet ViT-B/16基线(0.843),达到了当前最佳水平。

Insight: 论文的创新点在于系统评估了DINO系列自监督模型在低数据医学影像(CBCT)上的迁移适应策略,并发现部分解冻(特别是最后两个块)比骨干网络选择本身对性能提升更为关键。这为在数据稀缺的医疗场景中有效利用基础模型提供了实用指导,即精细的适应策略设计比单纯使用更强大的预训练骨干更重要。

Abstract: Temporomandibular joint osteoarthritis (TMJ OA) is a prevalent degenerative condition whose osseous changes are often subtle on cone-beam CT (CBCT), making automated detection challenging. We study how well the DINO family of self-supervised vision transformers – DINOv1, DINOv2, DINOv2+reg, and RAD-DINO (a radiology-pretrained variant) – transfers to CBCT, asking how much backbone adaptation is needed and of what kind. We propose a simple slice-based pipeline using Vision Transformer (ViT) backbones: axial CBCT slices are encoded per-slice by a frozen or partially adapted ViT and aggregated via attention-based multiple instance learning (MIL) for patient-level binary OA/Normal classification. Through systematic ablation across unfreezing strategies and aggregation designs on a multi-source CBCT dataset, we find that partial unfreezing of the final two transformer blocks is the decisive factor, improving AUC from 0.671 (fully frozen DINOv2) to 0.902. This outperforms DINOv1 (0.867), DINOv2+reg (0.774), and a supervised ImageNet ViT-B/16 baseline (0.843). Our results provide practical guidance for adapting DINO-family foundation models in low-data medical imaging settings, showing that adaptation strategy is a stronger driver of performance than backbone choice alone.


[130] CoVEBench: Can Video Editing Models Handle Complex Instructions? cs.CV | cs.AIPDF

Jiangtao Wu, Jiaming Wang, Yiwen He, Yuanxing Zhang, Shihao Li

TL;DR: 该论文提出了CoVEBench,一个用于评估视频编辑模型处理复杂组合指令能力的基准测试。它包含416个源视频、626条多点编辑指令和近万个细粒度检查项,旨在诊断模型在同时执行多个耦合编辑(如修改主体、动作和摄像机视角)并严格保留无关时空内容时的表现。

Details

Motivation: 现有视频编辑模型擅长处理简单任务(如风格迁移、对象插入),但现实用户请求通常是高度组合性的,现有基准测试受限于孤立编辑和粗粒度全局指标,无法评估模型处理复杂工作流的能力。

Result: 大量实验表明,组合编辑仍然是一个巨大挑战:当前模型在处理多个操作时经常遗漏编辑、违反保留约束或引入伪影。CoVEBench通过MLLM判断的指令遵从性和视频保真度,以及自动化视频质量指标进行评估。

Insight: 论文的创新点在于构建了一个诊断性的、面向真实用户工作流的组合视频编辑基准,其细粒度检查清单和多维度评估方法(结合MLLM判断与自动指标)为模型能力提供了更深入的洞察,推动了视频编辑向处理复杂指令方向发展。

Abstract: While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.


[131] Reinforcing Temporal Answer Grounding in Instructional Video via Candidate-Aware Causal Reasoning cs.CVPDF

Muge Qi, Rong Fu, Pengbin Feng, Xianda Li, Yu Cai

TL;DR: 该论文提出了一种名为候选感知因果推理(CACR)的框架,用于解决教学视频中的时序答案定位(TAGV)任务,即根据自然语言查询定位视频中的精确片段。该框架通过基于视觉语言预训练的候选选择(VBCS)算法高效生成候选片段,并利用由拒绝奖励机制增强、通过组相对策略优化(GRPO)优化的时序逻辑推理模块进行鲁棒推断。

Details

Motivation: 解决TAGV任务中因语义复杂查询和未修剪长视频与短目标片段之间的长度不匹配带来的挑战,现有方法常对无关内容敏感或视觉推理能力不足。

Result: 在六个基准测试上的广泛实验表明,该方法在平均交并比(mIoU)指标上达到了最先进的性能水平。

Insight: 创新点在于结合了高效的候选选择与因果推理增强的时序逻辑模块,通过拒绝奖励和GRPO优化提升了推理鲁棒性,为长视频中基于推理的检索提供了新视角。

Abstract: The task of temporal answer grounding in instructional video (TAGV), which aims to locate precise video segments that respond to natural language queries, is increasingly important for direct video answer retrieval. This task remains challenging due to the need to comprehend semantically complex questions and to address the significant length mismatch between untrimmed videos and short target moments. Existing methods often suffer from sensitivity to irrelevant content or insufficient visual reasoning capabilities. To tackle these limitations, we propose a Candidate-Aware Causal Reasoning (CACR) framework. Our approach first employs a Visual-Language Pre-training based Candidate Selection (VBCS) algorithm to efficiently generate K candidate segments, then applies a temporal logic reasoning module enhanced by a rejection reward mechanism and optimized via Group Relative Policy Optimization (GRPO) for robust inference. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art performance in terms of mean Intersection-over-Union (mIoU), providing a new perspective for reasoning-based retrieval in long videos.


[132] CheXanatomy: Anatomy-Aware Vision-Language Modeling for Chest Radiographs cs.CVPDF

Sergios Gatidis, Curtis Langlotz, Christian Bluethgen

TL;DR: 该论文提出了CheXanatomy框架,通过自回归的token空间监督,将显式的解剖学知识整合到预训练的视觉语言模型中,旨在提升模型对胸部X光片的细粒度解剖结构理解。该方法利用CT合成X光图像及其投影的2D分割掩码进行训练,使模型能够通过下一个token预测生成解剖分割掩码,从而增强模型在空间精确任务(如分割)中的表现。

Details

Motivation: 现有的视觉语言模型主要优化全局对齐,缺乏对细粒度解剖结构的显式编码,限制了其在空间精确任务(如医学图像分割)中的适用性。

Result: 在合成和真实胸部X光片上的评估表明,该方法在分布内数据上达到了与专用卷积模型(如U-Net基线)相当的性能,并在域转移到真实CXR数据时表现出更好的几何鲁棒性。此外,解剖学预训练的模型在有限监督下适应新定位任务时,显示出更高的样本效率。

Insight: 创新点在于通过自回归的token空间监督将解剖结构直接嵌入生成目标,避免了添加任务特定的解码器头,从而促进了空间基础表示,并支持解剖感知的医学视觉语言建模。客观来看,该方法通过合成数据实现可扩展监督,为医学VLM的细粒度理解提供了一种有效途径。

Abstract: Vision-language models (VLMs) pretrained on large-scale image-text pairs demonstrate strong image-level understanding, but are primarily optimized for global alignment and do not explicitly encode fine-grained anatomical structure, limiting their suitability for spatially precise tasks such as segmentation. We introduce CheXanatomy, a framework that integrates explicit anatomical knowledge into a pretrained VLM through autoregressive token-space supervision. Instead of adding task-specific decoder heads, the model is trained to generate anatomical segmentation masks via next-token prediction. To enable scalable supervision, we synthesize realistic chest radiographs from CT volumes and forward-project CT segmentation labels to obtain anatomically consistent 2D masks. We evaluate the approach on synthetic and real chest radiographs against a U-Net baseline, including ablations on model scale, input resolution, and vision encoder fine-tuning. Autoregressive anatomical supervision achieves performance comparable to specialized convolutional models in-distribution and demonstrates improved geometric robustness under domain shift to real CXR data. In addition, anatomy-pretrained models exhibit improved sample efficiency when adapting to novel localization tasks under limited supervision. Larger models and higher input image resolution improve performance, while vision encoder fine-tuning has limited effect. These results show that embedding anatomical structure directly into the generative objective promotes spatially grounded representations and supports anatomy-aware medical vision-language modeling.


[133] Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation cs.CV | cs.AIPDF

Xuanyi Liu, Deyi Ji, Junyu Lu, Jing Wang, Qianxiong Xu

TL;DR: 本文提出FaithRewriter框架,通过结合多模态大语言模型生成的中间视觉线索来增强文本到图像生成中的用户提示,旨在缩小用户意图与生成图像之间的差距。

Details

Motivation: 现有提示改写方法主要关注文本流畅性,缺乏视觉基础,容易导致过度推断细节,从而产生意图-生成鸿沟。

Result: 实验表明,FaithRewriter生成的提示比强基线方法更能忠实反映用户意图且视觉上更合理,有助于缩小意图-生成鸿沟。

Insight: 创新点在于引入中间视觉线索作为视觉锚点来指导提示改写,并通过知识蒸馏将增强能力部署到小型LLM中,实现了视觉基础与高效性的结合。

Abstract: Despite the impressive capabilities of text-to-image (T2I) models, an intent-generation gap often persists due to the brevity and ambiguity of user prompts. Existing approaches primarily polish the prompt for fluency and readability. However, the enhancement process still lacks visual grounding. As a result, the rewriter may over-infer missing details, causing an intent-generation gap. To address this limitation, we propose FaithRewriter, a novel prompt-enhancement framework for T2I generation. Specifically, FaithRewriter first leverages a multimodal MLLM to generate an image from the original prompt as an intermediate visual cue. This cue is then combined with the prompt and fed into a large-scale LLM to produce visually grounded augmentations that better reflect how the intended content should appear in images. Finally, these augmentations are distilled into a small-scale LLM for efficient deployment, enhancing its ability to generate effective T2I prompts. Experiments show that FaithRewriter yields prompts that are more faithful to the user intent and more visually plausible than strong baselines, helping narrow the intent-generation gap.


[134] OmniTryOn: Video Try-On Anything at Once! cs.CVPDF

Changliang Xia, Chengyou Jia, Minnan Luo, Zhuohang Dang, Xin Shen

TL;DR: 本文提出了OmniTryOn框架,用于解决视频虚拟试穿中多对象同时试穿的难题,通过引入无需外部先验的生成方法,实现了在单次推理中同时将多种可穿戴物品转移到视频人物身上。

Details

Motivation: 现有视频虚拟试穿方法局限于单件衣物转移,且依赖外部先验(如衣物掩码),这会破坏物理动态并降低视觉质量,因此本文旨在开发一种能同时处理多对象试穿且无需外部先验的解决方案。

Result: 在TryAny-Bench基准测试中,OmniTryOn显著优于现有专用视频虚拟试穿模型和通用视频编辑基线,为该任务设立了新的强大标准。

Insight: 创新点包括引入First Frame Wearable Cache策略直接提供可穿戴对象、提出Spatiotemporally Consistent RoPE以保持时空一致性,以及采用Gradual Try-On训练策略逐步优化多对象合成能力。

Abstract: Although video virtual try-on (VVT) has achieved significant progress, existing methods still exhibit two fundamental limitations: first, they are restricted to single-garment transfer, rendering simultaneous multi-object try-on highly impractical; second, their heavy reliance on explicit external priors (e.g., garment masks) inevitably destroys crucial physical dynamics and degrades visual quality. To bridge this gap, this paper proposes the novel Try-On Anything task, which aims to simultaneously transfer diverse wearable objects onto a person in a video in a single inference pass. To support and standardize this paradigm, we introduce TryAny-Bench, a comprehensive benchmark encompassing a paired video dataset alongside a tailored evaluation protocol. Furthermore, we present OmniTryOn, an external-prior-free generative framework designed to tackle this task. Specifically, OmniTryOn employs a First Frame Wearable Cache strategy, which directly provides diverse wearable objects for the generation process through the initial video frame. To maintain consistency, we propose the Spatiotemporally Consistent RoPE (STC-RoPE), which inherently establishes robust spatiotemporal anchors to strictly preserve complex human motions and background dynamics. Optimized by the proposed Gradual Try-On (GTO) training strategy, our model progressively masters robust multi-object synthesis. Extensive experiments on TryAny-Bench demonstrate that OmniTryOn significantly outperforms existing specialized video virtual try-on models and general video editing baselines, establishing a powerful new standard for the Try-On Anything task. Our dataset, code, and models are available at https://github.com/xcltql666/OminTryOn.


[135] Look Less, Reason More: Block-wise Attention Skipping for Efficient Multimodal LLMs cs.CVPDF

Jie Ma, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

TL;DR: 本文针对多模态大语言模型(MLLMs)因长视觉token序列导致的自注意力二次计算成本过高问题,提出了无需训练的推理范式Visual-Skip(V-Skip)。该方法基于‘视觉注意力饱和’现象,在深层网络中通过块级结构化稀疏性选择性地跳过冗余的视觉自注意力模块,同时保留前馈网络以维持语义演化,并通过轻量级少样本校准动态调整稀疏路径以适应不同任务需求。

Details

Motivation: 当前MLLMs面临长视觉token序列带来的巨大推理瓶颈,其自注意力计算成本呈二次增长。作者发现深层网络中的视觉自注意力存在冗余(视觉注意力饱和),而前馈网络对于将视觉特征投影到文本语义空间至关重要,因此旨在解耦空间交互与语义演化以提高效率。

Result: 在多种MLLMs上的广泛实验表明,V-Skip能有效实现块级稀疏化,性能保持率在94.16%到100.31%之间,在保持模型性能的同时显著提升了推理效率。

Insight: 创新点在于揭示了视觉注意力饱和现象,并据此提出了无需训练、通过结构化稀疏跳过冗余视觉自注意力的推理优化方法。其核心思想是‘少看多推理’,即模型无需丢弃视觉信息,只需在合适的深度减少不必要的注意力计算,这为高效MLLM设计提供了新视角。

Abstract: Multimodal Large Language Models (MLLMs) face a significant inference bottleneck due to the quadratic computational cost of self-attention over long visual token sequences. However, we identify a critical inefficiency in current architectures: Visual Attention Saturation. Our analysis reveals that visual tokens rapidly establish their spatial structure and intra-modal relationships in early layers, rendering visual-to-visual self-attention in deeper layers computationally redundant. Conversely, Feed-Forward Networks (FFNs) in these layers remain essential for projecting visual features into the evolving textual semantic space. Leveraging this insight, we present Visual-Skip (V-Skip), a training-free inference paradigm that decouples spatial interaction from semantic evolution. Rather than discarding tokens, V-Skip imposes block-wise structured sparsity by selectively bypassing saturated visual self-attention modules. Furthermore, recognizing that varying downstream tasks demand distinct reasoning depths, V-Skip employs a lightweight, few-shot calibration to dynamically route the task-optimal sparsity path. Extensive experiments demonstrate that V-Skip effectively bypasses redundant vision attention to achieve block-wise sparsity, maintaining a 94.16% to 100.31% performance retention across diverse MLLMs. Ultimately, we prove that to reason more effectively, models do not need to discard what they see – they simply need to “look less” at the right depth.


[136] TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding cs.CVPDF

Lianyu Hu, Xiaoyu Ma, Zeqin Liao, Yang Liu

TL;DR: 本文提出了TVI-CoT(文本-视觉交织思维链)框架,用于增强多模态大语言模型(MLLMs)的推理能力。该框架通过引入可学习的控制令牌(, , ),实现了文本推理与视觉特征访问的显式交织,从而克服了现有CoT方法在推理过程中无法访问视觉信息的‘视觉盲推理’局限。

Details

Motivation: 现有应用于多模态大语言模型(MLLMs)的思维链(CoT)方法存在根本性限制:它们在推理过程中完全基于文本进行,无法在推理过程中访问视觉特征,导致模型只能依赖初始图像描述进行‘视觉盲推理’,限制了细粒度视觉信息提取、错误验证和自适应注意力。

Result: 在八个基准测试上的实验表明,该方法在基于MLLM的CoT方法中取得了最先进(SOTA)的结果,与基线相比性能显著提升:在MMMU上提升6.1%,在MathVerse上提升3.8%,在MathVista上提升3.4%,在ScienceQA上提升3.4%。

Insight: 核心创新点在于提出了一个通过可学习控制令牌实现文本推理与视觉感知动态交织的框架,从而将视觉信息访问无缝集成到推理链中。这为多模态推理提供了一种更灵活、更细粒度的交互范式,允许模型根据推理状态自适应地关注相关图像区域,是提升MLLMs复杂问题解决能力的一个有效途径。

Abstract: Chain-of-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models. However, when applied to multimodal LLMs (MLLMs), existing CoT approaches suffer from a fundamental limitation: they perform reasoning entirely in text without accessing visual features during the reasoning process. After initial visual encoding, image information becomes inaccessible, forcing models to reason based solely on whatever was captured in the initial description, which forms a `vision-blind reasoning’ paradigm that limits fine-grained visual extraction, error verification, and adaptive attention. We propose Text-Visual Interleaved Chain-of-Thought (TVI-CoT), a framework that enables explicit interleaving of textual reasoning and visual feature access through learnable control tokens , and . These tokens allow dynamic switching between reasoning and visual grounding, attending to relevant image regions conditioned on the evolving reasoning state. Experiments on eight benchmarks demonstrate state-of-the-art results among MLLM-based CoT methods and notable performance boost compared to the baseline: +6.1% on MMMU, +3.8% on MathVerse, +3.4% on MathVista, and +3.4% on ScienceQA. Code is available at https://github.com/hulianyuyy/TVI-CoT.


[137] DriveReward: A Comprehensive Dataset and Generative Vision-Language Reward Model for Autonomous Driving cs.CVPDF

Qimao Chen, Fang Li, Yuechen Luo, Zehan Zhang, Haiyang Sun

TL;DR: 本文提出了DriveReward,一个用于自动驾驶的综合数据集和生成式视觉语言奖励模型。该工作通过构建一个包含时序视觉引导标注和反事实驾驶行为的数据集,并训练一个专门的视觉语言奖励模型,以解决现有奖励模型依赖人工规则或感知真值、泛化能力不足的问题。

Details

Motivation: 当前自动驾驶中的奖励模型通常依赖于手工制定的基于规则的指标或感知真值,这阻碍了数据规模扩展时的泛化能力。尽管视觉语言模型在其他领域已展现出作为奖励模型的潜力,但其在驾驶任务中的有效性尚未得到充分探索。

Result: 评估表明,即使是领先的开源和专有视觉语言模型也无法在所有任务上表现出色。本文专门定制的1B参数奖励模型在任务特定的奖励对齐上超越了更大的视觉语言模型。将该模型集成到强化学习微调和多模态轨迹评分中,在开环和闭环评估中取得了与基于规则的奖励计算相当的性能。

Insight: 主要创新点包括:1) 提出了一个通过时序视觉引导和反事实数据标注方案构建的驾驶轨迹评估数据集,有效补充了传统数据集中失败案例的稀缺性;2) 训练了一个专门针对驾驶任务的小型高效视觉语言奖励模型,在特定任务对齐上优于通用大模型。

Abstract: Reward models play a pivotal role in reinforcement learning (RL) and multi-modal trajectory selection for autonomous driving. However, acquiring such rewards typically relies on hand-crafted rule-based objectives or perception ground truth, which hinders generalization for data-scaling. While Vision-Language Models (VLMs) have demonstrated feasibility as reward models in other domains, their effectiveness in driving tasks remains underexplored. In this work, we bridge this gap by (1) introducing DriveReward, a reasoning trajectory evaluation dataset rigorously labeled via temporally-grounded visual guidance, and augmented with counterfactual driving behaviors., (2) alongside a specialized Vision-Language Reward Model. To address the scarcity of failure cases in conventional datasets, we propose a counterfactual data annotation scheme to construct cases encompassing diverse driving styles and erroneous behaviors. Evaluations on our proposed benchmark reveal that even leading open-source and proprietary VLMs fail to excel across all tasks, highlighting significant room for improvement in existing models. Building on these findings, we subsequently tailor a specialized 1B reward model that outperforms larger VLMs on task-specific reward alignment. Finally, we validate our reward model’s effectiveness by integrating it into RL finetuning and multi-modal trajectory scoring across multiple baselines, achieving performance comparable to rule-based reward calculations in both open-loop and closed-loop evaluation.


[138] Towards Accurate Emotion-Attributed Video Captioning via Fine-grained Emotion-Cause Pair Extraction cs.CVPDF

Weidong Chen, Cheng Ye, Zhendong Mao, Liping Wang, Xinyan Liu

TL;DR: 本文提出了一种细粒度情感-原因对提取框架,用于情感属性视频描述生成。该方法通过概念感知的视觉语义分解模块增强视觉特征,并结合视觉引导的情感可解释学习模块优化情感特征,最后通过跨模态耦合和对比损失实现情感-原因对的提取与对齐,从而生成更准确、情感更丰富的视频描述。

Details

Motivation: 现有情感视频描述方法通常利用整体视觉特征挖掘全局情感线索,忽略了情感由特定视觉原因(通常隐含在核心视频片段中)引发这一关键特性,导致信息冗余和情感线索不准确。因此,需要细粒度地提取视觉原因以促进情感感知和描述生成。

Result: 在三个具有挑战性的数据集上进行了广泛实验,证明了方法的优越性。例如,在EVC-MSVD数据集上,相对于基线模型,在BLEU-2和ROUGE-L指标上分别取得了+4.4%和+5.4%的最佳性能提升。

Insight: 创新点在于提出了细粒度的情感-原因对提取框架,通过两轮学习分别增强视觉和情感特征,并利用跨模态耦合与对比损失实现语义强制对齐。这优化了视频的复杂语义理解和情感感知,为情感视频描述任务提供了新的思路。

Abstract: Emotional Video Captioning (EVC) is a challenging task that aims to generate factually accurate and emotionally rich descriptions for videos. Existing EVC methods leverage holistic visual features to mine global emotional cues, and then aggregate multimodal features to guide the emotional caption generation, which ignores the critical characteristic of the EVC task. Visual emotions are evoked by specific motivational causes, which are usually only implied in core video segments. The holistic mining brings significant information redundancy and inaccurate emotional cues. Thus, fine-grained visual cause extraction has a facilitative effect on both emotion perception and emotion-attributed caption generation. To this end, we propose a fine-grained emotion-cause pair extraction framework for emotion-attributed video captioning. Specifically, we learn pair-wise emotion and cause features in two rounds: 1) We propose a Concept-aware Visual Semantic Decomposition module to augment visual features by exploring scene, object, and motion concepts. Besides, to enhance emotional features, we propose a Visual-guided Emotion Interpretable Learning module, which guides emotion refinement with visual temporal dynamics, and augments the interpretable refinement process by reliable VAD-vector constraints. 2) We achieve emotion-cause pair extraction by cross-coupling the visual and emotional features before and after refinement, and leverage contrastive loss to achieve semantic forced alignment. Overall, our approach optimizes complex semantic understanding and emotion perception of videos, leading to a promising performance in emotional captioning. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, e.g., achieving the best performances with +4.4% and +5.4% w.r.t. BLEU-2 and ROUGE-L, respectively, on the EVC-MSVD dataset.


[139] Facial Expression Recognition in the Deep Learning Era: A Systematic Multi-Criteria Review of Methods, Models, Datasets, Performance, Challenges, and Future Research Directions cs.CVPDF

Spyridon Georgiou, Aggelos Psiris, Spyridon Evangelatos, Thomas Lagkas, Vasileios Argyriou

TL;DR: 本文是一篇关于深度学习时代面部表情识别(FER)的系统性综述,全面回顾了从手工特征到基础模型的方法演进,提出了一个多维度分类法来分析文献,并对数据集、性能指标、当前挑战和未来方向进行了梳理。

Details

Motivation: 现有的深度学习FER综述多局限于特定任务、架构或应用,缺乏一个全面、系统组织的整体性回顾。本文旨在填补这一空白,提供一个与更广泛的面部情感识别(FAR)领域明确关联的综合性评述。

Result: 论文未提出新模型,但系统性地整理了代表性SOTA方法在广泛采用的基准数据集上的定量性能比较,并汇编了性能指标。

Insight: 创新点在于提出了一个包含七个互补维度(如识别任务、网络架构、学习策略等)的多标准分类法来系统分析FER文献,并对各方法在非受控条件下的优劣提供了批判性见解,为领域提供了清晰的组织框架和发展路线图。

Abstract: Facial Expression Recognition (FER) has advanced rapidly over the last decade, driven by the shift from handcrafted descriptors and shallow classifiers to deep convolutional, attention-based, vision-language, and foundation-model architectures, and by the parallel growth of large-scale in-the-wild benchmarks spanning categorical, dimensional, compound, micro-expression, Action Unit (AU), and intensity-estimation tasks. Yet the deep learning-based FER landscape has so far been reviewed only along narrow task-, architecture-, or application-specific axes, leaving a holistic, systematically organized account of its recent advances missing. This survey addresses that gap with a comprehensive review of recent deep learning-based FER, explicitly linked to the wider Facial Affect Recognition (FAR) domain. Its main contributions are: a) A description of FER’s evolution into five distinct phases, from handcrafted features and classical machine learning to attention-based, vision-language, and foundation-model approaches, with the key milestone works of each, b) A multi-criteria taxonomy analyzing the literature along seven complementary axes: recognition task, input modality, face pre-processing pipeline, network architecture, learning strategy, acquisition setting, and application domain, c) A per-criterion comparative analysis, with critical insights into the strengths and limitations of each category under in-the-wild conditions, d) A task-organized review of public FER datasets, with their annotation schemes, modalities, and evaluation protocols, e) A compilation of performance metrics and a per-task quantitative comparison of representative state-of-the-art methods on widely adopted benchmarks, and f) A discussion of current challenges and promising future directions.


[140] Harnessing Streaming Video in the Wild cs.CV | cs.CLPDF

Dingyu Yao, Shuhuan Gu, Qingyi Si, Junhao Zhou, Chenxu Yang

TL;DR: 该论文针对视觉语言模型(VLM)处理无界视频流(如视频通话助手、直播评论)的不足,提出了一个三方面的解决方案:构建了用于流式训练的Streaming-Train-248K数据集、开发了可赋予任何VLM流式核心能力的即插即用系统Streaming Harness,并设计了评估基准Streaming-Eval。

Details

Motivation: 现有VLM擅长离线视频理解,但缺乏处理实时、长时、交互式视频流的能力和专用部署基础设施,无法满足实际应用需求。

Result: 大量实验表明,该方法在流式视频理解所需的所有核心能力上均取得了持续提升,但摘要未提及具体基准上的定量结果或与SOTA的比较。

Insight: 创新点在于系统性地从数据、部署系统和评估基准三个层面填补了VLM流式能力的空白,特别是Streaming Harness系统实现了主动交互、长时记忆和实时处理三大核心能力的解耦与集成。

Abstract: Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications such as video-call assistants, live commentary, and embodied robots. An ideal streaming system should support proactive interaction, long-horizon memory, and real-time processing, while resting on a VLM backbone capable of handling diverse in-the-wild streaming tasks. However, existing VLMs excel at offline video understanding but fall short in streaming capabilities and lack dedicated infrastructure for streaming deployment. We address this gap on three fronts. (i) For backbone capability, we construct \textbf{Streaming-Train-248K}, a streaming dataset paired with a novel training objective for adapting VLMs to streaming interaction and understanding. (ii) For real-world deployment, we introduce \textbf{Streaming Harness}, a plug-and-play system that endows any VLM with three core abilities: proactive interaction (per-second response decisions), long-term memory (12-hour context retention), and real-time processing (sub-second latency). (iii) To drive continued community progress on streaming capabilities, we design \textbf{Streaming-Eval}, a benchmark that reflects models’ capabilities across diverse in-the-wild scenarios. Extensive experiments demonstrate consistent gains from our approach across all core capabilities required for streaming video understanding. We will open-source our data, code, and benchmark to advance the community’s shift from offline video understanding to deployable streaming intelligence.


[141] OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning cs.CVPDF

Jiahao Wang, An Ping, Yanghai Wang, Yuanxing Zhang, Shihao Li

TL;DR: 该论文提出了OmniCap-IF,这是首个专门评估全模态视频描述中指令跟随能力的综合基准。该基准包含一个系统框架,从格式正确性和内容正确性两个维度评估描述,涵盖纯视觉、纯音频和视听模态的50种约束类型,并整合了时间定位以评估时空精度。论文还揭示了模型存在的’格式-内容权衡’问题,并构建了一个54K的指令微调数据集OmniCap-IF-54K,以及提出了改进模型OmniCaptioner-IF。

Details

Motivation: 现有的基准主要关注整体视频理解或纯文本指令跟随,未能捕捉模态与用户约束之间复杂的相互作用。为了弥补这一空白,需要专门评估全模态大语言模型在严格遵循复杂、多方面用户指令方面的能力。

Result: 在1,920个高质量样本上对主流模型进行的广泛评估揭示了显著的性能差异。提出的OmniCaptioner-IF模型在复杂指令遵循和通用全模态描述性能方面都取得了显著改进。

Insight: 论文的创新点在于构建了首个专注于全模态指令跟随的基准,并系统性地从格式和内容两个维度进行评估。一个关键的客观发现是揭示了’格式-内容权衡’,即增加格式复杂性会直接损害模型的全模态推理能力。提出的指令微调数据集和模型为提升该领域性能提供了具体方案。

Abstract: While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical “format-content tradeoff”, demonstrating that increasing formatting complexity directly degrades models’ omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.


[142] SSAFE: Simple and Strong AI-Generated Image Detection via Frozen Vision Encoders cs.CVPDF

Seunghyun Lee, Byoungkwon Kim, Jaehyun Nam, Kyungmin Lee, Jinwoo Shin

TL;DR: 本文提出SSAFE方法,利用冻结的多模态视觉编码器(如CLIP)的嵌入空间天然分离真实与合成图像的特性,仅需线性分类器即可实现强效的AI生成图像检测。通过设计一种表征感知的数据筛选策略,仅使用10K图像的小规模训练集,在多个基准测试中展现出优异的性能和对未见生成器的鲁棒性。

Details

Motivation: 生成模型的快速发展模糊了合成与真实图像的界限,亟需可靠的深度伪造检测方法。现有方法大多依赖大规模真实-伪造数据集,但随着新生成器不断涌现,这类数据集难以维护。本文探索现代多模态视觉表征中已编码的图像真实性信息,旨在实现无需任务特定微调的简单而强效的检测。

Result: 实验表明,冻结的多模态编码器结合线性分类器在多个基准测试中达到强性能,优于依赖大规模数据集(如AIGIBench的288K图像和OpenFake的4M图像)的方法。在提出的RealWorldBench基准(包含现代相机照片、当代库存图像和近期商业生成器输出)上,该方法展现出对未见生成器和分布偏移的鲁棒性。

Insight: 创新点在于发现冻结多模态编码器的嵌入空间天然具备区分真实与合成图像的能力,无需微调即可用于检测;并提出表征感知的数据筛选策略,仅需极小规模训练集即可实现强泛化性能。这为AI生成图像检测提供了一种简单、高效且数据高效的解决方案。

Abstract: The rapid advancement of generative models has blurred the boundary between synthetic and real imagery, creating an urgent need for reliable deepfake detection. Yet most existing approaches rely on massive real–fake datasets, which are increasingly difficult to maintain as new generators continue to emerge. In this work, we investigate how much information about image authenticity is already encoded in modern multimodal vision representations. We find that frozen multimodal encoders naturally separate real and synthetic images in their embedding space, enabling a simple linear classifier to achieve strong performance without task-specific fine-tuning. Motivated by this observation, we develop a representation-aware data curation strategy that selects a compact set of representative generators for training. The resulting training set contains only 10K images, compared to 288K in AIGIBench and 4M in OpenFake, while improving robustness to unseen generators and distribution shifts. We additionally introduce RealWorldBench, a benchmark consisting of modern camera photographs, contemporary stock images, and outputs from recent commercial generators. Experiments across multiple benchmarks show that combining frozen multimodal representations with carefully curated training data provides a simple and effective approach to AI-generated image detection.


[143] Learnable Token Sparsification for Efficient Gigapixel Whole Slide Image Reasoning cs.CVPDF

Jingzhi Chen, Landi He, Zhuo Chen, Shawn Young, Lijian Xu

TL;DR: 本文提出了一种可学习的令牌稀疏化方法,用于高效处理千兆像素级别的全切片图像。通过将令牌减少问题重新定义为可训练的稀疏化问题,模型能够学习最优的选择策略,而非依赖固定的启发式方法。该方法在推理时仅保留得分最高的32个令牌,显著压缩了视觉序列长度。

Details

Motivation: 现有方法在处理全切片图像时,由于视觉令牌数量过多,通常采用空间下采样或无需训练的启发式剪枝策略,这往往会丢弃分散在组织中具有临床意义的细微模式。

Result: 在SlideBench (TCGA)基准测试中,该方法将视觉序列压缩至仅32个令牌(约为原始长度的0.78%),实现了73.32%的整体准确率,超越了基于采样的基线方法和通用视觉语言模型,并在SlideBench (BCNB)和WSI VQA*上展示了强大的零样本泛化能力。

Insight: 创新点在于提出了可训练的令牌稀疏化框架,引入了SparseLearn组件(包含方差保持噪声门和注意力去噪器)以实现梯度传播,并在推理时完全丢弃该模块,仅使用训练好的评分器进行确定性Hard Top-K操作,实现了高效计算。这为解决视觉上下文瓶颈和防止稀疏诊断证据被稀释提供了新范式。

Abstract: The processing of gigapixel whole slide images within vision language models faces a major difficulty due to an excessive number of visual tokens. Existing solutions typically rely on spatial downsampling or heuristic pruning strategies that operate without training, and these methods often discard subtle but clinically meaningful patterns because pathological evidence is scattered irregularly across the tissue. To overcome this limitation, we reformulate token reduction in whole slide images as a trainable sparsification problem, allowing the model to learn an optimal selection strategy instead of following fixed heuristics. We propose a decoupled routing architecture. To enable gradient propagation through the nondifferentiable pruning operation during training, we introduce a component called SparseLearn. This component uses a variance-preserving noise gate that regulates the information flow of each patch via a differentiable Soft Top-K operator, together with a diagonal attention denoiser that recovers perturbed representations without leaking spatial information. At inference time, the SparseLearn module is entirely discarded, and the trained scorer applies a deterministic Hard Top-K operator to keep only the highest scoring 32 tokens, incurring no extra computation. By compressing the visual sequence down to a sparse set of just 32 tokens, which represents as little as 0.78% of the original length, our framework achieves 73.32% overall accuracy on SlideBench (TCGA), consistently surpassing sampling-based baselines and general-purpose vision language models. It also demonstrates strong zero shot generalization on SlideBench (BCNB) and WSI VQA*. By resolving the visual context bottleneck and preventing the dilution of sparse diagnostic evidence, this work provides a highly efficient paradigm for end to end gigapixel whole slide image reasoning.


[144] FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning cs.CV | cs.AI | cs.LG | cs.ROPDF

Haihao Lin, Xiangsheng Huang, Xiao Yang, Weibang Zhou, Yiqi Zhang

TL;DR: 本文提出FiberTune方法,用于解决视觉-语言-动作(VLA)策略在动作监督微调过程中,视觉表征沿局部动作纤维发生残差视觉坍缩的问题。该方法通过在线动作探针估计动作预测特征方向,过滤中间视觉令牌表示中的这些方向,并将过滤后的残差与冻结的视觉教师模型对齐,同时正则化其有效秩,从而在训练时保留教师结构化的视觉残差,且不增加推理开销。

Details

Motivation: 动机在于现有VLA策略的动作监督微调虽然能有效拟合演示数据,但仅约束改变预测动作的方向,导致在动作等价状态下视觉结构的一致性自由坍缩,即沿局部动作纤维的残差视觉坍缩问题。

Result: 在相同训练条件下,FiberTune在六个受控仿真设置(涵盖CALVIN和SO-101两个基准测试以及pi_0.5和OpenVLA-OFT两种架构)中均优于仅使用任务损失微调的方法,例如在长视野CALVIN ABC-to-D任务上SR(5)提升10.7个百分点,在物理SO-101拾放任务上成功率从72.7%提升至78.1%。

Insight: 创新点在于形式化了动作纤维上的残差视觉坍缩问题,并提出一种无需推理开销的训练时目标,通过在线动作探针过滤和对齐视觉残差来保留教师模型的视觉结构,残差诊断显示性能提升与探针过滤残差的教师对齐度和有效秩增加一致。

Abstract: Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.


[145] BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension cs.CV | cs.AIPDF

Tsung-Wei Pan, Jung-Hua Wang

TL;DR: 本文提出了BioVid,一种数据驱动的自回归视频生成框架,旨在学习生物行为的自然时间结构,包括其长度分布。该方法采用两阶段流程:首先使用FSQ-R3GAN分词器将视频帧编码为离散表示,然后使用因果Transformer自回归地建模令牌序列,并学习在行为事件语义结束时发出序列结束(EOS)令牌。实验表明,BioVid生成视频的长度分布与真实数据高度匹配,同时保持了有竞争力的空间保真度。

Details

Motivation: 现有视频生成框架将序列时长视为外部预设参数(如固定帧数或文本提示),导致生成片段的时间边界与真实行为数据的统计结构脱节。这与生物行为中动作时长自然变化且编码在数据本身的事实不符。本文旨在解决这一根本性错位问题。

Result: 在人类饮水行为数据集(NTU RGB+D, A001, n=94)上的实验表明,BioVid生成的长度分布与测试数据高度匹配,其Wasserstein-1距离为1.24,远优于固定长度基线(6.05)和VideoGPT(15.48),同时保持了有竞争力的空间保真度。

Insight: 论文的核心创新在于将视频时长建模为从数据中学习的内在属性,而非外部超参数。具体通过两阶段框架实现:结合FSQ与R3GAN的稳定分词器避免了码本崩溃,而因果Transformer自回归建模并学习发出EOS令牌,使得行为终止分布自然地从训练数据中涌现。这为生成符合生物行为统计特性的视频提供了新思路。

Abstract: Existing video generation frameworks treat sequence duration as an externally prescribed parameter – fixed frame counts or text prompts – producing clips whose temporal boundaries are decoupled from the statistical structure of real behavioral data. This assumption is fundamentally misaligned with biological behavior, where action duration varies naturally across individuals and instances and is encoded in the data itself. We present BioVid, a data-driven autoregressive video generation framework that learns the temporal structure of biological behaviors directly from training data, including their natural length distributions. In the first stage, a Finite Scalar Quantization GAN (FSQ-R3GAN) tokenizer encodes each video frame into a compact discrete representation, combining the stabilized relativistic training objective of R3GAN with FSQ’s guaranteed codebook utilization to achieve high-fidelity spatial reconstruction without codebook collapse. In the second stage, a causal Transformer models the resulting token sequences autoregressively and learns to emit an End-of-Sequence (EOS) token when the behavioral event reaches semantic closure, with the termination distribution emerging naturally from the training data rather than any human-specified constraint. Experiments on a human drinking behavior dataset (NTU RGB+D, A001, n=94) demonstrate that BioVid’s generated length distribution closely matches that of held-out test data, achieving a Wasserstein-1 distance of 1.24 against the ground truth – compared to 6.05 for a fixed-length baseline and 15.48 for VideoGPT – while maintaining competitive spatial fidelity.


[146] BLUE: Toward Better Language Use in Efficient Vision-Language-Action Models for Autonomous Driving cs.CVPDF

George Ling, Lijin Yang, Hao Yang, Zhongzhan Huang

TL;DR: 本文提出了BLUE方法,旨在提升自动驾驶中视觉-语言-动作(VLA)模型的语言使用效率。研究发现,语言仅在少数驾驶场景中至关重要,但在这些场景中能显著影响性能。因此,BLUE通过一个轻量级门控模块,基于冻结的VLA隐藏状态动态决定每帧是否需要激活语言生成,从而在保持性能的同时大幅提升推理效率。

Details

Motivation: 当前VLA模型在自动驾驶中每帧都生成语言,计算效率低下,因为语言仅在少数复杂场景中真正有益。本文旨在解决这种计算浪费问题,探索一种更高效的语言使用策略。

Result: 在Bench2Drive和Longest6 v2两个基准测试上,BLUE仅用0.11M参数的门控模块就实现了新的SOTA性能,分别达到76.2%的成功率和36的驾驶分数。相比骨干模型,推理速度提升了2.54倍,成功率提高了8.9%。

Insight: 核心创新在于发现预训练VLA的隐藏状态本身已编码了语言是否有益的信息,并基于此设计了一个无需修改骨干模型或额外标注的轻量级动态门控机制。这为高效语言增强的自动驾驶模型提供了一条实用路径,即以极低成本保留语言带来的性能优势。

Abstract: We present BLUE, a minimal method for better language use in vision-language-action (VLA) models for autonomous driving (AD). Through extensive analysis, we reveal that language matters on only a small fraction of routes, but on those routes it can greatly improve or degrade performance. Generating language at every frame is therefore inefficient, since most computation is spent on frames that do not benefit from language. We further show that pretrained VLA hidden states potentially already encode whether language will benefit a given frame, even though scene complexity and kinematic features alone struggle to predict this. Based on this finding, BLUE trains a lightweight gate on frozen VLA hidden states to decide per frame whether to activate language generation or predict actions directly, without modifying the backbone or requiring additional human annotation. With just a 0.11M-parameter gate, BLUE sets a new state of the art on both benchmarks, achieving 76.2% success rate on Bench2Drive and 36 driving score on Longest6 v2, while delivering 2.54x inference speedup and 8.9% success rate improvement over the backbone. BLUE provides a practical path toward efficient language-augmented AD, showing that VLA models can retain the benefits of language at a fraction of the cost. Our code, data, logs and checkpoints are fully available on https://github.com/George-Ling3/BLUE.


[147] Shift-Dependent Asymmetry: Orthogonal Inverse Low-Rank Adaptation for Federated Medical Segmentation cs.CVPDF

Xingyue Zhao, Wenke Huang, Linghao Zhuang, Haoran Wu, Anwen Jiang

TL;DR: 本文提出了一种名为IAT(Inverse Asymmetric Tuning)的新方法,用于解决联邦学习中医学图像分割基础模型微调时,由于编码器-解码器不对称性导致的性能下降问题。该方法通过个性化编码器以吸收外观偏移、个性化解码器以适应站点特定的监督,同时保留共享路径来传递共识,并引入子空间正交正则化器来防止参数泄漏,从而在多个基准测试中取得了优于现有联邦LoRA和参数高效FL基线的性能。

Details

Motivation: 现有联邦LoRA方法采用统一的聚合规则,但医学图像分割中编码器受外观偏移主导,解码器受监督变化主导,这种不对称性导致共享解剖结构与站点特定偏差纠缠,损害了模型的泛化能力。

Result: 实验表明,IAT方法在多个联邦医学图像分割基准测试中,相比强大的联邦LoRA和参数高效联邦学习基线,取得了持续的性能改进,达到了SOTA水平。

Insight: 创新点在于针对编码器-解码器不对称性进行正交逆低秩适应,通过结构分离和子空间正交正则化来对齐异构源并防止参数泄漏,无需额外通信即可提升联邦学习中的模型泛化能力。

Abstract: Low-Rank Adaptation (LoRA) enables efficient federated fine-tuning of segmentation foundation models for medical imaging. However, most federated LoRA methods adopt a uniform aggregation rule, which breaks under the encoder-decoder asymmetry in medical segmentation: the encoder is dominated by appearance shifts, while the decoder is dominated by supervision variations. This mismatch entangles shared anatomy with site-specific biases and harms generalization. To address this, we propose Inverse Asymmetric Tuning (IAT). IAT aligns adaptation with heterogeneity sources by personalizing module-specific components in the encoder to absorb appearance shifts and in the decoder to accommodate site-dependent supervision, while retaining a shared pathway for transferable consensus. However, structural separation alone is insufficient under LoRA’s bilinear parameterization, where multiplicative coupling can still cause site-specific updates to leak into the shared direction. We therefore introduce a Subspace Orthogonality Regularizer that penalizes shared-local collinearity in the effective update space, mitigating leakage without extra communication. Experiments show consistent improvements over strong federated LoRA and parameter-efficient FL baselines.


[148] Thinking Without Images: Internalizing Visual Manipulation with On-Policy Self-Distillation cs.CVPDF

Yishuo Cai, Jiahui Liu, Yuanxin Liu, Haobo Deng, Linli Yao

TL;DR: 本文提出Imagine-OPD框架,旨在通过内部想象过程替代显式的图像裁剪操作,以实现细粒度视觉推理。该方法利用策略内自蒸馏,让一个教师模型在训练时基于特权缩放视图监督学生模型的想象推理轨迹,从而在保持推理性能的同时显著降低推理开销。

Details

Motivation: 解决’Thinking with Images’范式带来的冗余工具调用、长推理轨迹以及中间裁剪图像可能噪声大或未能忠实捕捉任务相关视觉证据的问题,探索是否可以通过内部想象过程来内化其推理优势。

Result: 在视觉中心基准测试中,Imagine-OPD在对比模型中取得了最佳平均性能,同时相比’Thinking with Images’方法显著减少了推理开销。

Insight: 创新点在于提出’Thinking with Imagination’的内部推理范式,以及无需外部教师或高质量想象演示的策略内自蒸馏框架,实现了性能与效率的平衡。

Abstract: ‘’Thinking with Images’’ has emerged as an effective paradigm for fine-grained visual reasoning: by explicitly zooming into relevant regions and reasoning over crops, models can access local evidence that is difficult to recover from a single global image. However, this benefit comes with redundant tool invocations and longer inference traces. Moreover, when such behaviors are learned mainly from outcome reward, the resulting intermediate crops or visual cues can be noisy or fail to faithfully capture task-relevant visual evidence. In this work, we ask whether the reasoning benefits of ‘’Thinking with Images’’ can be internalized through Thinking with Imagination: an internal process that decides where to look and imagines what visual cues closer inspection would reveal without actually invoking tools. We propose Imagine-OPD, an on-policy self-distillation framework in which a teacher plays the role of a ‘’Thinking with Images’’ reasoner during training: it receives privileged zoomed evidence views derived from annotated regions, and supervises the model’s own imagination reasoning trajectories. Imagine-OPD does not require an external teacher or high-quality imagination demonstrations. Experiments on vision-centric benchmarks show that Imagine-OPD achieves the best average performance among compared models while significantly reducing inference overhead compared with ‘’Thinking with Images’’ methods.


[149] PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping cs.CVPDF

Qiming Li, Tianlun Li, Xiaolong Cheng, Hangyu Li, Ruiyan Gong

TL;DR: 本文提出了PRPO(Perception-Reinforced Policy Optimization),一种用于大型视觉语言模型(LVLM)的令牌级强化学习框架。它通过引入鲁棒视觉依赖(RVD)指标识别并强化多模态推理轨迹中对视觉证据有因果依赖的关键感知令牌,并利用感知优势重塑(PAR)技术进行细粒度的信用分配。

Details

Motivation: 现有基于可验证奖励的强化学习(RLVR)方法主要依赖轨迹级结果奖励,对所有生成的令牌分配相同的学习信号。这种粗粒度的信用分配与多模态推理不匹配,因为只有稀疏的令牌子集是基于视觉证据的,导致关键感知令牌监督不足,容易被语言先验或推理模板令牌淹没。

Result: 在七个多模态推理基准测试上的广泛实验表明,PRPO在3B和7B模型规模上均持续优于强LVLM基线,平均增益分别为23.3%和21.1%,实现了最先进的性能,并具有更高的训练效率和更强的跨任务泛化能力。

Insight: 核心创新点在于提出了细粒度的令牌级信用分配框架,通过RVD指标识别鲁棒的视觉依赖令牌,并通过PAR技术重塑优势函数以强化感知信息丰富的令牌。这强调了细粒度信用分配对于可扩展多模态强化学习的重要性。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective paradigm for improving the reasoning capability of Large Vision-Language Models (LVLMs). However, existing RLVR methods primarily rely on trajectory-level outcome rewards, which assign identical learning signals across all generated tokens. This coarse-grained credit assignment is fundamentally mismatched to multimodal reasoning, where only a sparse subset of tokens is causally grounded in visual evidence. Consequently, these pivotal perceptual tokens receive weak supervision and are often overwhelmed by language priors or reasoning-template tokens. To address this limitation, we propose Perception-Reinforced Policy Optimization (PRPO), a token-level reinforcement learning framework that explicitly identifies and reinforces pivotal perceptual tokens within long-horizon multimodal reasoning trajectories. PRPO introduces Robust Visual Dependency (RVD), a principled metric that identifies tokens whose predictions are both visually grounded and perturbation-stable, filtering out brittle or noisy visual tokens. Based on RVD, we further propose Perceptual Advantage Reshaping (PAR), a token-level credit assignment technique that amplifies perceptually informative tokens while preserving stable gradients for non-perceptual tokens. Extensive experiments on seven multimodal reasoning benchmarks demonstrate that PRPO consistently outperforms strong LVLM baselines across both 3B and 7B model scales, achieving average gains of 23.3% and 21.1%, respectively. PRPO achieves state-of-the-art performance with improved training efficiency and stronger cross-task generalization. Our findings highlight the importance of fine-grained credit assignment for scalable multimodal reinforcement learning.


[150] MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training cs.CVPDF

Lianyu Pang, Tianlin Pan, Cheng Da, Changqian Yu, Huan Yang

TL;DR: 本文提出MaskAlign方法,通过随机采样token子集进行表示对齐,以解决扩散模型训练中全token对齐导致的噪声输入与干净图像参考特征不匹配问题,并引入预掩码token混合模块来缓解信息丢失。

Details

Motivation: 现有方法通过将扩散中间特征与自监督视觉编码器的干净图像表示对齐来加速扩散变换器训练,但存在噪声输入与干净参考特征不匹配的问题,导致对齐目标对token的影响不均匀,模型可能过度依赖完整token集。

Result: 未在摘要中明确提及具体基准测试或定量结果,但暗示该方法能改善对齐稳定性并减少对完整token集的依赖。

Insight: 创新点在于从token层面分析对齐不匹配,提出基于随机token子集的表示对齐策略,并设计预掩码token混合模块来保持信息流动,这为高效扩散训练提供了新思路。

Abstract: Representation alignment with pretrained vision models has recently shown strong potential for accelerating diffusion transformer training. By aligning intermediate diffusion features with clean-image representations from self-supervised vision encoders, existing methods improve convergence and generation quality. However, such alignment also introduces a non-trivial constraint: diffusion models operate on noisy inputs whose usable information varies across timesteps, while the reference features are extracted from clean images. In this paper, we revisit this mismatch from a token-level perspective. We find that, under full-token representation alignment, tokens with large alignment-gradient norms exhibit a stable spatial preference, suggesting that the alignment objective does not affect all tokens uniformly and may encourage the model to rely on the complete set of clean-image tokens. To address this issue, we propose MaskAlign, a token-subset representation alignment method that applies alignment to randomly sampled token subsets during training. By exposing the model to different token subsets across iterations, MaskAlign reduces the dependence of representation alignment on the complete token set and encourages alignment behavior that is more stable under token-subset perturbations. To mitigate the information loss caused by directly dropping tokens, we further introduce a lightweight pre-mask token mixing block that shares information across tokens before masking.


[151] Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing cs.CVPDF

Deyin Liu, Yisheng Ding, Zhe Jin, Xiatian Zhu, Anjan Dutta

TL;DR: 本文提出了一种新的零样本视频编辑方法,旨在超越现有方法仅保证时间一致性的局限,首次明确关注并保留源视频的时间结构(即高级叙事、节奏和语义流)。该方法通过基于特征相似性自适应地将视频分割成语义不同的片段、为每个片段选择代表性锚定帧、设计片段自适应令牌合并策略以及采用交替组合策略,在保持计算效率的同时,实现了对原始时间结构的保护。

Details

Motivation: 现有零样本视频编辑方法依赖于预训练扩散模型,虽然成功实现了空间控制和基本的时间一致性,但根本上无法保留视频的原始时间结构。这种缺失导致编辑后的输出,尤其是对于具有复杂语义变化的长视频,在叙事上不连贯且语义模糊。

Result: 大量实验表明,该方法在零样本视频编辑任务上取得了最先进(SOTA)的结果,成功地在保留原始时间结构与计算效率之间取得了平衡,并为编辑保真度设立了新的基准。

Insight: 核心创新在于首次将编辑目标从“时间一致性”(视觉平滑)提升到“时间结构”(高级语义流)的保留。具体技术贡献包括:基于特征相似性的自适应视频语义分割、锚定帧选择、利用锚定语义主导性的片段自适应令牌合并策略,以及确保片段间无缝过渡的交替组合策略。

Abstract: Existing zero-shot video editing methods rely on pre-trained diffusion models, successfully achieving spatial control and basic temporal consistency but fundamentally fail to preserve the video’s original temporal structure.This distinction is critical: temporal consistency ensures visual smoothness, but temporal structure dictates the video’s high-level narrative, rhythm, and semantic flow. Without this preservation, the edited output, especially for long videos with complex semantic variations, becomes narratively incoherent and semantically ambiguous. To address this limitation, we introduce a novel zero-shot editing approach that, for the first time, explicitly focuses on preserving the source video’s temporal structure. We achieve this by adaptively partitioning the video into semantically distinct clips based on feature similarity and selecting a representative anchor frame for each clip. To enhance both intra-clip fidelity and computational efficiency, we design a clip-adaptive token merging strategy which leverages the anchor’s semantic dominance to stabilize the editing. Furthermore, we employ an alternating combination strategy that ensures seamless inter-clip transitions while maintaining semantic distinction. Extensive experiments demonstrate that our method achieves state-of-the-art results, successfully balancing the preservation of original temporal structure with computational efficiency, and setting a new benchmark for zero-shot video editing fidelity.


[152] BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation cs.CV | cs.AI | cs.LGPDF

Ahmed Abdelmoneim Mazrou, Haidy Maher El-Amir, Ali Hamdi

TL;DR: 本文提出了一种名为BLM-SGAN的新型文本到图像生成模型,该模型通过结合双向语言建模(利用BERT的注意力机制)来捕获丰富的上下文信息并有效处理长序列,旨在解决现有基于GAN的T2I模型在捕获长程依赖、梯度消失和顺序处理限制方面的关键挑战。

Details

Motivation: 现有基于生成对抗网络(GAN)的文本到图像(T2I)模型在捕获长程依赖、处理梯度消失以及顺序处理的局限性方面仍面临挑战,这阻碍了从详细文本描述生成高质量、语义空间一致的图像。

Result: BLM-SGAN在鸟类图像生成任务上取得了最先进的性能,其Inception Score(IS)达到5.45 +/- 0.08,超越了SSA-GAN、DF-GAN、SD-GAN和AttnGAN等多个竞争模型。

Insight: 主要创新点是将双向语言建模(特别是BERT的注意力机制)集成到GAN框架中,以增强对文本语义和空间上下文的理解与建模,这为解决T2I生成中的长序列依赖和语义一致性提供了新思路。

Abstract: Despite the success of image generation from text descriptions, it still faces challenges that are difficult to overcome in domains such as natural language processing (NLP) and computer vision (CV). Recent advancements in text-to-image (T2I) models, particularly those utilizing generative adversarial networks (GANs), have significantly improved the synthesis of realistic images across various domains. However, existing GAN-based T2I models still encounter key challenges, such as difficulty in capturing long-range dependencies, vanishing gradients, and the limitations of sequential processing. To address these issues, we introduce BLM-SGAN, a novel model that incorporates Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation. BLM-SGAN leverages BERT’s attention mechanisms to capture rich contextual information and efficiently manage extended sequences. Our model demonstrates state-of-the-art performance, with an Inception Score (IS) of 5.45 +/- 0.08, surpassing several competitive models such as SSA-GAN, DF-GAN, SD-GAN, and AttnGAN. BLM-SGAN effectively generates highly realistic images of birds from detailed text descriptions. The implementation code is available at: https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation.


[153] Vision-Language Work Zone Intelligence for Safety-Critical Speed Regulation of Mixed-Autonomy Vehicles in Dynamic Environments cs.CVPDF

Angel Martinez-Sanchez, Kianna Ng, Wesley Maia, Laura Fleig, Maitrayee Keskar

TL;DR: 本文提出了一种实时车载感知系统,用于检测动态环境中的施工区域并识别临时限速标志,以提升混合自动驾驶车辆的安全性。该系统融合了目标检测与语义验证,采用基于滞后的时间平滑状态转换来减少误报和闪烁,并能在低成本嵌入式硬件上运行。

Details

Motivation: 临时施工区域的限速标志视觉上不一致且常缺失于数字地图,给人类驾驶员和自动驾驶系统带来安全风险,因此需要一种直接基于车载感知的实时解决方案来准确识别施工区域和限速信息。

Result: 在ROADWork数据集的标注子集(490个序列)上评估,系统在施工区域内的事件级召回率达到96.5%,精确率为68.7%;在35分钟内部驾驶数据上,限速识别的精确率为95.45%,召回率为53.85%,且无错误速度分类,仅有一个误报。

Insight: 创新点在于将目标检测与语义验证融合,并引入基于滞后的时间平滑状态转换机制来增强动态场景下的鲁棒性;系统完全在低成本硬件上运行,提供了一种不依赖地图或基础设施的、可扩展的施工区域感知方法。

Abstract: Temporary work-zone speed limits are communicated through visually inconsistent signage and are often missing from digital maps, creating safety risks for human drivers and automated vehicle systems. We present a real-time, onboard perception pipeline that detects active work zones, recognizes associated temporary speed limits, and outputs a law-aware work-zone state and speed value suitable for driver alerts or downstream automated control. The system fuses object detections with semantic verification and temporally smoothed, hysteresis-based state transitions to reduce false activations and flicker in dynamic scenes, and runs fully on low-cost embedded hardware. Evaluated manually on a annotated subset of the ROADWork dataset (490 sequences), the system achieves inside-work-zone event-level recall of 96.5% and event-level precision of 68.7%. Speed-limit recognition evaluated on 35 minutes of in-house driving data attains 95.45% precision and 53.85% recall, with no incorrect speed classifications and a single false positive. These results demonstrate a practical, scalable approach for grounding work-zone speed awareness directly in onboard perception rather than maps or infrastructure. We release our source code for the proposed system pipeline on our GitHub repository: https://github.com/Mi3-Lab/workzone


[154] CHROMA: Detecting AI-Generated Images through Inter-Channel Color-Space Correlations cs.CV | cs.LGPDF

Juan Pablo Sotelo, Marina Gardella, Pablo Musé

TL;DR: 该论文提出了一种名为CHROMA的AI生成图像检测方法,通过分析不同色彩空间中的通道间颜色相关性来区分真实图像与合成图像。该方法利用RGB和Lab色彩空间中的通道间相关性特征,结合简单的CNN架构,在有限计算预算下实现了稳健的检测性能。

Details

Motivation: 随着扩散模型和大规模生成模型的快速发展,区分合成图像与真实照片变得日益困难,现有检测器对未见生成器的泛化能力较弱。论文旨在探索通道间颜色相关性这一轻量且未被充分利用的取证线索,以提升检测器的泛化性和鲁棒性。

Result: 在标准基准测试中,CHROMA通过增强通道间相关性特征,提高了真实与生成图像的区分能力,其性能与近期检测器相当,同时在单生成器和有限多生成器监督设置下均表现出良好的鲁棒性。

Insight: 创新点在于利用色彩空间中的通道间相关性作为检测特征,揭示了生成模型在跨通道统计上的系统性差异;客观来看,该方法通过轻量级特征和简单架构实现了高效检测,为AI生成图像取证提供了可解释且易于部署的新思路。

Abstract: The rapid adoption of diffusion and large-scale generative models has made it increasingly challenging to distinguish synthetic imagery from real photographs. While automated detectors have been proposed, their generalization to unseen generators remains brittle. To address this limitation, we investigate inter-channel color correlations, a lightweight and underexploited forensic cue. We first demonstrate that LPIPS, a widely used perceptual metric, exhibits inconsistent responses to perturbations that selectively alter channel dependence across different color-space parameterizations, indicating that cross-channel statistics are not uniformly constrained by common perceptual training objectives. Motivated by this, we analyze the distributions of pairwise inter-channel correlation features across multiple color spaces. Our analysis reveals systematic, generator-specific differences in these distributions, with RGB and Lab color spaces providing the most apparent separation between real and generated images. Building on this, we introduce Chroma, a detector of AI-generated images which augments standard RGB inputs with inter-channel correlation maps and employs a fixed CNN backbone trained with a modest computational budget. We assess its robustness under both single-generator training and a limited multi-generator supervision regime, where only a few samples from additional generators are available. Across a standard benchmark protocol, correlation-augmented inputs improve real-vs-generated discrimination and robustness, yielding performance competitive with recent detectors while maintaining a simple architecture and training procedure. Code is available at https://github.com/JPSoteloSilva/CHROMA


[155] Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions? cs.CV | cs.CLPDF

Yizheng Sun, Mochuan Zhan, Yanan Ma, Jia Tong See, Yifan Wang

TL;DR: 本文提出了Distract-Bench基准,用于评估推理视觉语言模型(VLMs)对语义视觉干扰的鲁棒性。研究发现,即使模型能正确感知视觉证据,也容易被任务无关但有意义的视觉线索误导,导致推理错误,这与传统视觉退化造成的感知困难是不同的失效模式。

Details

Motivation: 现有工作主要通过噪声、模糊等输入损坏来评估VLMs的可靠性,但忽略了模型可能正确感知证据,却被看似合理但无关的视觉干扰误导并传播错误到最终答案这一关键失效模式。

Result: 在八个领先的开源和两个闭源VLMs上的评估表明,Distract-Bench暴露了与视觉损坏不同的鲁棒性缺陷:推理VLMs在感知退化时表现与其非推理基础模型相近,但对语义干扰的鲁棒性始终更低。

Insight: 创新点在于将鲁棒性评估的重点从感知退化转向语义干扰,提出了专门针对语义视觉干扰的基准。客观来看,这揭示了当前推理VLMs在利用视觉信息进行推理时,对证据相关性的判断存在脆弱性,是迈向可靠真实世界视觉推理的重要一步。

Abstract: Reasoning Vision-Language Models (VLMs) achieve strong performance on complex multimodal tasks, but reliable real-world application requires handling visual inputs that are messier than clean, curated benchmarks. Existing works mainly evaluate such reliability of VLMs through input corruptions, such as noise, blur and weather effects, which make visual evidence harder to perceive. This leaves a critical reliability failure mode underexplored: a model may perceive the evidence correctly, yet reason from plausible but irrelevant and distracting evidence and propagate this mistake to its final answer. To address this gap, we introduce \textbf{Distract-Bench}, a benchmark for evaluating VLM robustness to \textbf{semantic visual distractions}, defined as meaningful but task-irrelevant visual cues added to inputs while preserving the ground-truth answer. We comprehensively evaluate eight leading open-source and two closed-source VLMs across conventional vision corruptions and Distract-Bench. Our results show that Distract-Bench exposes a robustness failure distinct from vision corruptions: reasoning VLMs largely track their non-reasoning base models under perceptual degradation, but show consistently lower robustness to semantic distractions. Further analysis shows that these distractions often enter the reasoning process of VLMs, are treated as evidence, and lead to incorrect answers. Together, these findings reframe robustness evaluation for reasoning VLMs, shifting the focus from degraded perception to distractions for reliable real-world visual reasoning. Our data and code are available at https://github.com/Yizheng-Sun/Distract-Bench.


[156] A multi-agent system for spine MRI report generation from multi-sequence imaging cs.CV | cs.AI | q-bio.QMPDF

Zhiping Xiao, Junwei Yang, Gongbo Sun, Han Zhang, Hanwen Xu

TL;DR: 本文提出了SpineAgent,一个用于脊柱MRI报告生成的多智能体系统。该系统基于一个在多序列MRI数据上预训练的基础模型,通过创新的持续训练策略整合不同序列的信息,实现了最先进的报告生成性能,并具备病理定位和多模态检索能力。

Details

Motivation: 脊柱病理是全球疼痛和残疾的主要原因,脊柱MRI的解读复杂且耗时,需要整合多个成像序列和解剖区域的信息。现有自动化MRI分析方法在有效结合多序列数据并保留序列特异性诊断信息方面仍存在挑战。

Result: SpineAgent在脊柱MRI报告生成任务上取得了最先进的性能,并在跨制造商和跨队列评估中表现出强大的泛化能力。通过自动指标和五位放射科医生的专家评估,均证实其领先性能。

Insight: 创新点包括:1) 分别预训练T1和T2序列的编码器,并通过持续训练策略学习一个合成器来嵌入其他序列的图像,从而生成整合多序列信息的患者级嵌入;2) 构建了一个包含37个专门智能体的多智能体框架,并将其输出作为结构化令牌集成到端到端训练的医学报告智能体中,实现了可扩展和可解释的报告生成。

Abstract: Spinal pathology is a leading cause of pain and disability worldwide. Spine MRI is central to clinical evaluation, yet its interpretation remains complex and time-consuming, requiring integration of information across multiple imaging sequences and anatomical regions. Despite recent advances in automated MRI analysis, effectively combining multi-sequence data while preserving sequence-specific diagnostic information remains an open challenge. Here we present SpineAgent, a multi-agent framework for spine MRI report generation built upon a multi-sequence foundation model trained on routine clinical data from 32,047 patients and 453,683 MRI series, comprising a total of 13,441,191 MRI slices. To accommodate diverse modalities of sequences, we first pre-train two DINOv3-based encoders separately on T1- and T2-weighted sequences. We then introduce a continual training strategy that learns a synthesizer to embed images of other sequences using the T1 and T2 encoders, producing patient-level embedding that integrates various signals across MRI sequences. Using these embeddings, SpineAgent achieves state-of-the-art performance, and demonstrates strong generalizability under cross-manufacturer and cross-cohort evaluation. Beyond classification, SpineAgent enables pathology localization by identifying findings-relevant slices and segmenting pathological regions. It also supports multimodal image-report retrieval, providing a solid foundation for scalable and explainable MRI report generation. We further integrate these validated capabilities of SpineAgent into 37 specialized agents. Finally, we incorporate their outputs as structured tokens within a Medical Report Agent trained end-to-end for report generation. Through both automated metrics and expert evaluation by five radiologists, SpineAgent achieves leading performance in spine MRI report generation.


[157] DifferSeg: Towards Diverse Multimodal Binary Segmentation via Differential Perception and Frequency Guidance cs.CVPDF

Qiangqiang Zhou, Jiawei Xu, Yong Chen, Dandan Zhu, Yugen Yi

TL;DR: 本文提出了一个名为DifferSeg的通用多模态二元分割框架,旨在解决现有方法在处理模态差异和互补性方面的不足,以及解码过程中高低频表示不平衡的问题。该框架通过差分感知融合模块自适应地对齐多模态特征,并通过频率引导解码器协调高低频信息,从而在多种自然和医学图像分割任务中实现优异的性能。

Details

Motivation: 现有大多数多模态分割方法采用固定的特征拼接进行跨模态交互,且解码器设计主要依赖低频语义,缺乏自适应机制来处理模态间的差异与互补性,以及高效解码策略来平衡高低频表示。

Result: DifferSeg在涉及18个下游任务的29个公共数据集上,持续超越了67种最先进的方法,展现了卓越的泛化能力和分割精度。

Insight: 创新点在于引入了可学习的差分算子进行自适应模态对齐与残差融合,以及设计了跨频率交互和多路径上采样的频率引导解码器,有效缓解了模态不匹配和融合冗余,并确保了细粒度边界恢复与噪声抑制。

Abstract: In many binary segmentation tasks, most multimodal methods rely on fixed feature concatenation for cross-modal interaction and straightforward decoder designs dominated by low-frequency semantics. %ToDO: % However, they ignore two key challenges: one is the lack of an adaptive mechanism to handle modality discrepancies and complementarity, and the other is the absence of an efficient decoding strategy to balance both high- and low-frequency representations. % In this work, we propose a simple yet general multimodal binary segmentation framework, termed DifferSeg, to address both problems simultaneously. With the help of the differential perception fusion (DPF) module, DifferSeg employs learnable differential operators to adaptively align multimodal features and enhance their complementarity through residual fusion, effectively mitigating modality mismatch and fusion redundancy. % In addition, we design a frequency-guided decoder (FGD) that builds cross-frequency interactions and multi-path upsampling to maintain consistency between detailed high-frequency structures and semantic low-frequency representations, ensuring fine-grained boundary recovery and noise suppression. % Benefiting from these designs, DifferSeg can be easily generalized to diverse binary segmentation tasks, including both natural and medical modalities. Without bells and whistles, it consistently surpasses 67 state-of-the-art methods across 29 public datasets involving 18 downstream tasks, demonstrating superior generalization and segmentation accuracy.Code and pretrained models will be available at the Link.


[158] Failure-Aware Refinement of Vision-Language Model for Lithography Defect Detection cs.CV | cs.AIPDF

Pangyun Jeong, Jiyeong Kong, Yuehua Hu, Dohee Jeong, Kyung-Tae Kang

TL;DR: 本文提出了一种两阶段视觉语言框架,用于半导体光刻缺陷检测。第一阶段通过LoRA微调Qwen3-VL模型,从光刻图像中预测缺陷数量、类别和边界框;第二阶段利用第一阶段预测失败的案例训练一个精炼模块,对初始输出进行审查和修正,从而减少误报、漏检和分类错误。

Details

Motivation: 解决半导体光刻缺陷检测中直接微调视觉语言模型仍存在的常见测试时错误,如误报、漏检和缺陷类型误判,以提高检测的可靠性。

Result: 通过从初始适配器失败案例中学习,精炼过程提升了缺陷推理能力,超越了单阶段微调的性能,但未提及具体基准测试或定量结果。

Insight: 创新点在于利用预测失败案例进行二次训练的精炼机制,这种基于错误驱动的学习策略可有效提升模型在细粒度视觉任务中的鲁棒性和准确性。

Abstract: Semiconductor lithography inspection requires reliable detection of small pattern defects such as bridge, burr, pinch, and contamination. In this study, we propose a two-stage vision-language framework that combines initial defect detection with prediction refinement. In the first stage, Qwen3-VL is fine-tuned with LoRA as a vision-language adapter to predict defect counts, defect categories, and normalized bounding boxes from lithography images. However, direct fine-tuning may still produce common test-time errors, including false positives, missed defects, and incorrect defect types. To address this limitation, the second stage trains a refinement module using first-stage prediction failures and their corrected labels, allowing the model to review and revise initial outputs. By learning from cases where the initial adapter fails, the refinement process improves defect inference beyond single-stage fine-tuning.


[159] When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models cs.CVPDF

Junchao Cui, Wenqi Shi, Xuanzi Ma, Nan Wu, Shaoyong Du

TL;DR: 本文提出了一种名为TransGeoCLIP的全球图像地理定位新框架,旨在解决现有方法因视觉相似性而将图像误定位到不同地理区域的问题。该框架通过结合位置注意力机制和大规模多模态模型,构建了一个包含图像、文本和GPS坐标的联合嵌入检索数据库,并利用检索增强推理来最终预测图像位置。

Details

Motivation: 现有全球图像地理定位方法常因视觉相似场景的干扰而误定位,限制了其在实际应用中的可靠性。本文旨在通过引入位置语义来区分视觉相似但地理位置不同的图像,从而提高定位的准确性和鲁棒性。

Result: 在IM2GPS、IM2GPS3k、YFCC4k和YFCC26k等多个数据集上的实验表明,TransGeoCLIP显著提升了视觉相似图像的定位性能。特别是在街道级定位精度(误差在1公里内)上,相比现有最佳方法分别提升了1.5%、1.07%、7.18%和9.75%,达到了新的SOTA水平。

Insight: 创新点在于将位置注意力机制与Transformer结合来编码GPS坐标,增强了位置语义的表示,并利用CLIP实现图像-文本-GPS的联合嵌入。客观来看,其两阶段框架(检索数据库构建与检索增强推理)有效整合了结构化地理信息与多模态模型的能力,为解决视觉相似性干扰问题提供了新思路。

Abstract: Worldwide image geo-localization aims to determine the capture location of an image on a global scale. Existing methods often mislocalize images by matching them to visually similar scenes from different geographic regions, which limits reliability in practical applications. To address this issue, we propose TransGeoCLIP, a novel retrieval-based framework that integrates a location attention mechanism and large multimodal models (LMMs). Using the Transformer encoder with location attention to encode GPS coordinates, TransGeoCLIP can effectively distinguish geographic features among visually similar images. The framework consists of two stages: 1) Retrieval database construction, which employs Transformers equipped with location attention mechanisms to encode labeled GPS coordinates and enhance location semantics, subsequently enables joint image-text-GPS embedding through CLIP; 2) Retrieval-augmented inference, which leverages LMMs to infer the final image location prediction from retrieved database results. Extensive experimental results on diverse datasets, including IM2GPS, IM2GPS3k, YFCC4k, and YFCC26k, demonstrate that TransGeoCLIP significantly enhances localization performance for visually similar images. Particularly, street-level localization accuracy (within 1 km error) is substantially improved, surpassing state-of-the-art methods by 1.5%, 1.07%, 7.18%, and 9.75% on these benchmarks, respectively.


[160] NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis cs.CV | cs.AIPDF

Runze Yan, Minxiao Wang, Jiaying Lu, Darren Liu, Xiao Hu

TL;DR: 该论文提出了NutriMLLM,一个专门用于从食物图像中全面估算膳食微量营养素的多模态大语言模型家族。为了解决现有MLLM在此任务上的不可靠性,作者利用人口规模的24小时饮食回顾数据生成合成图像-描述-营养素三元组数据集,并基于此微调Qwen3-VL和GLM-4.6V-Flash模型。评估表明,NutriMLLM在真实食物图像上实现了近乎全面的营养素覆盖,且最大模型在多数营养素上的准确性与领先的专有模型相当或更优。

Details

Motivation: 现有MLLM(包括领先的专有模型)在从食物图像全面估算膳食微量营养素的任务上不可靠,经常拒绝回答或返回统计上不合理的结果,而训练此类模型需要大规模的多模态数据集,专家标注成本高昂。

Result: 在ASA24、SNAPMe、FNDDS和NutriBench四个独立评估基准上,微调后的NutriMLLM模型家族(基于Qwen3-VL和GLM-4.6V-Flash)在真实食物图像上对所有65种营养素实现了近乎完全的覆盖,且最大模型在大多数营养素上的准确性与GPT-5、Gemini 3和Claude Sonnet 4.5等专有基线模型相当或超越。

Insight: 核心创新在于利用历史饮食回顾数据作为结构化提示,通过文本到图像生成管道创建大规模合成数据集(约110万个三元组),从而低成本地解决了专业标注数据稀缺问题。该方法将基于图像的全面微量营养素估算转变为一个可处理的工程问题,为膳食评估、个性化营养指导和人群营养监测提供了新途径。

Abstract: Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but training such models requires large multimodal datasets linking diverse foods to complete nutrient profiles. We first show that existing multimodal large language models (MLLMs), including leading proprietary models, are unreliable for this task. Across five model families and four independent evaluation benchmarks (ASA24, SNAPMe, FNDDS, and NutriBench), models frequently abstained or returned statistically implausible values. To address this gap without costly expert annotation, we repurposed a decade of population-scale 24-hour dietary recalls as structured prompts for text-to-image generation. This pipeline produced a synthetic corpus of about 1.1 million image-description-nutrient triplets, each pairing a generated food image with a complete 65-nutrient label. To our knowledge, this is the largest synthetic food-image corpus with comprehensive micronutrient annotation planned for public release upon publication. Fine-tuning Qwen3-VL (2B/4B/8B/30B) and GLM-4.6V-Flash on this corpus yielded NutriMLLM, the first family of vision-language models specialized for comprehensive dietary micronutrient estimation. We evaluate these models with a four-component framework that separately measures abstention, hallucination, overall usability, and per-nutrient numerical accuracy. On real food images, every NutriMLLM variant achieved near-complete coverage across all 65 nutrients, and the largest variant matched or exceeded proprietary baselines (GPT-5, Gemini 3, and Claude Sonnet 4.5) in accuracy on most nutrients. These results show that recall-driven synthetic supervision can make image-based comprehensive micronutrient estimation a tractable engineering problem and support dietary assessment, personalized nutrition guidance, and population-scale micronutrient surveillance.


[161] ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China cs.CV | cs.CLPDF

Yi Zhang, Bolei Ma, Yong Cao, Chengyan Wu, Daniel Hershcovich

TL;DR: 论文提出了ChinaHeritaQA,一个针对中国联合国教科文组织世界遗产的多模态基准数据集,用于评估视觉语言模型的文化推理能力。该数据集包含2,279张真实场景图像和14,133个双语(中/英)多项选择题对,涵盖七个认知维度。评估发现,顶尖模型平均表现超过人类,但在文化推理任务上存在显著困难。

Details

Motivation: 为了解决现有视觉语言模型在文化和历史理解方面的不足,特别是针对中国世界遗产的深度推理能力评估,作者创建了一个专门的数据集。

Result: 在ChinaHeritaQA基准上评估的SOTA视觉语言模型平均表现超过人类,但在任务层面存在显著差异:模型在视觉识别上表现出色,但在文化推理上表现不佳,且性能随朝代和地区变化。

Insight: 论文的创新点在于构建了一个基于联合国教科文组织遗产本体论、经过严格人工验证的双语文化推理数据集,揭示了视觉检索能力与文化历史理解能力之间的差距,为文化感知的多模态学习研究提供了新基准。

Abstract: We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models excel at visual recognition but struggle with culturally grounded reasoning. Performance also varies by dynasty and region. ChinaHeritaQA reveals that strong visual retrieval does not extend to cultural and historical understanding. We release the dataset to support future research on culturally aware multimodal learning.


[162] Scaling by Diversified Experience for Vision-Language-Action Models cs.CVPDF

Leiyu Wang, Zhaofengnian Wang, Xueqi Li, Luoyi Fan, Cewu Lu

TL;DR: 本文提出SyVLA,一种通过多样化经验训练的鲁棒视觉-语言-动作模型。为了解决VLA模型在现实世界部署中高层推理与低层控制纠缠、以及策略优化不稳定的挑战,论文引入了意图解耦算法来分离控制相关特征,并提出了相似样本引导的强化学习流程来稳定策略更新。实验表明,SyVLA在真实机器人任务和多模态基准测试中取得了更高的任务成功率和更强的分布外泛化能力。

Details

Motivation: 解决视觉-语言-动作模型在现实部署中因高层推理与低层控制纠缠、以及策略优化不稳定而面临的挑战。

Result: 在真实世界机器人任务和多模态基准测试上的广泛实验表明,SyVLA相比现有方法取得了更高的任务成功率和更强的分布外泛化能力,同时有效保留了核心的视觉-语言能力。

Insight: 核心创新点在于意图解耦算法和相似样本引导的RL流程,前者旨在分离控制与推理特征以解决纠缠问题,后者旨在稳定训练并缓解分布偏移,这为提升VLA模型的鲁棒性和泛化性提供了可借鉴的思路。

Abstract: Vision-Language-Action models face significant challenges in real-world deployment due to the entanglement of high-level reasoning with low-level control, and the instability of policy optimization. In this paper, we introduce SyVLA, a robust VLA model trained with diversified experiences. We propose an Intention Decoupling algorithm to isolate control-relevant features from reasoning contexts and a similar-sample guided RL pipeline to stabilize policy updates and mitigate distribution shift. Extensive experiments on real-world robotic tasks and multi-modal benchmarks demonstrate that SyVLA achieves superior task success rates and stronger out-of-distribution generalization compared to existing methods, while effectively preserving core vision-language capabilities. Codes and Datasets is released on \href{https://sy-vla.github.io/}{project page}.


[163] Frequency Decoupled Framework for Screen Content Image Super-Resolution cs.CVPDF

Xufei Wang, Qicheng Zhang, Qi Wu, Ziyang Gu, Shizhuang Weng

TL;DR: 本文提出了一种频率解耦框架(FDF),用于屏幕内容图像超分辨率(SCISR)。该框架从相量角度重新思考SCISR,通过振幅-相位分解网络(APFN)分离图像的振幅和相位分量,分别提取周期性模式和全局结构,并利用振荡-非调和隐式拟合网络(OAIF-Net)进行高效重建。实验表明,FDF在多个尺度和数据集上实现了最先进的性能。

Details

Motivation: 现有基于隐式神经表示的SCISR方法忽视了图像固有的频率特性,导致性能不佳。本文旨在通过显式建模振幅和相位分量,更好地恢复屏幕内容图像的规则纹理和全局结构。

Result: 在四个公开SCI数据集上的多尺度超分辨率实验中,FDF达到了最先进(SOTA)的性能。消融实验进一步验证了各组件在提取和利用周期性模式与连贯上下文方面的有效性。

Insight: 主要创新点在于从相量(phasor)角度对SCISR进行频率解耦,并设计了专门的振幅聚类模块(ACM)和相位一致性自注意力(PCSA)来分别处理振幅和相位信息。客观来看,将频率分解思想与定制化的隐式表示相结合,为恢复屏幕内容图像的结构化模式提供了新思路。

Abstract: Methods based on implicit neural representations have demonstrated superior performance in Screen Content Image Super-Resolution (SCISR) . However, they overlooked the inherent frequency characteristics, leading to suboptimal performance. We propose a frequency decoupled framework (FDF) that rethinks SCISR from a phasor perspective by capturing structured energy in amplitude and relational continuity in phase, and jointly exploiting them with bespoke implicit representations to faithfully recover the regular textures and global configuration of Screen Content Image (SCI). Amplitude-Phase Factorization Network (APFN) first separates images into amplitude and phase streams, where Amplitude Clustering Module (ACM) organizes sparse yet high-energy amplitude responses into representative prototypes for periodic pattern extraction, while Phase Consistency Self-Attention (PCSA) progressively reinforces configuration through continuous consistency propagation. And Oscillation-Anharmonic Implicit Fitting Network (OAIF-Net) integrates periodic and coherent implicit representations for efficient exploitation of the periodic patterns and coherent context embedded in SCI. Experimental results show FDF achieves state-of-the-art SCISR performance at multiple scales across four public SCI datasets. Ablation experiments further demonstrate the effectiveness of each component in extracting and exploiting periodic patterns and coherent context.


[164] MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation cs.CV | cs.LGPDF

Ishaan Preetam Chandratreya, David Charatan, Basile Van Hoorick, Sergey Zakharov, Vitor Guizilini

TL;DR: 本文提出MilliVid方法,通过在多尺度令牌空间中使用由粗到细的生成策略来增强视频生成模型的长程一致性。该方法首先预训练一个将每帧编码为从粗到细层次令牌的自编码器,然后训练一个视频扩散模型以由粗到细的方式生成这些令牌,从而在保持场景布局和语义一致性的同时减少计算开销。

Details

Motivation: 解决视频生成模型中长程一致性难以实现的问题,因为传统方法处理数十帧就需要过长的Transformer序列,计算上不切实际。

Result: 在自定义的长Minecraft视频数据集上验证,相比现有基线方法,该方法能产生显著更一致的生成结果。

Insight: 创新点在于利用层次化潜在表示和由粗到细的生成策略,将长程一致性建模分解为不同细节层次,优先保证场景布局和语义等关键信息的一致性,从而高效提升生成视频的连贯性。

Abstract: Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.


[165] See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding cs.CV | cs.AIPDF

Shuning Wang, Zhiheng Wu, YiNuo Lu, Naiming Liu, Chen Jia

TL;DR: 本文提出CoVER框架,通过查询扩展的视觉证据收集和答案引导的视觉反馈机制,提升视频大语言模型在长视频理解任务中的性能。

Details

Motivation: 现有视频大语言模型在长视频理解中面临证据获取依赖单一搜索意图、答案生成缺乏有效视觉反馈的局限性。

Result: CoVER-7B在相同参数规模模型中表现显著优于基线,并在某些指标上超越了最先进的闭源模型。

Insight: 创新点在于将长视频理解从以答案为中心的生成转变为以证据为中心、视觉可验证的推理,通过动态查询扩展和答案特定视觉反馈增强模型能力。

Abstract: Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose \textbf{CoVER}, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to \textbf{See More} by dynamically gathering query-expanded visual evidence, and \textbf{Think Deeper} by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.


[166] CRANE: Knowledge Editing for Reasoning MLLMs cs.CV | cs.CLPDF

Han Huang, Hao Wang, Mengqi Zhang, Shu Wu, Qiang Liu

TL;DR: 本文针对具有推理能力的多模态大语言模型(MLLMs)提出了知识编辑的新挑战,即传统编辑方法在模型推理过程中会严重失效。作者识别了三种失效模式,并构建了专门的评估基准ReasonEdit-Bench。为解决此问题,论文提出了CRANE框架,这是一个无需修改模型参数的检索增强框架,通过双库检索和两阶段训练策略,有效提升了模型在冲突场景和多跳推理中的知识编辑性能。

Details

Motivation: 动机在于,现有的知识编辑方法在评估推理型MLLMs时存在严重缺陷,它们在传统教师强制准确率指标上表现良好,但在检查模型的推理链时却可能完全失败。这揭示了传统评估协议无法捕捉模型在生成显式思维链(CoT)时的真实编辑效果。

Result: 在提出的ReasonEdit-Bench基准测试中,CRANE在冲突场景下实现了96.9%的‘Grounded Success’(基于推理链的成功率),在多跳链中的中间实体使用率达到96.9%。在文本和图像局部性编辑独立性上分别达到97.6%和68.1%。在分布外基准MMEVOKE上,在黄金检索条件下达到了87.0%的成功率,展现了强大的编辑效果和泛化能力。

Insight: 论文的核心创新点在于:1) 首次系统性地识别并定义了推理型MLLMs知识编辑的三种失效模式(结构崩溃、认知失调、浅层内化);2) 提出了一个专门针对推理过程的、包含冲突分层和多级探测的CoT感知评估协议与基准(ReasonEdit-Bench);3) 提出了CRANE框架,其创新在于结合了模态感知的双库检索系统和两阶段训练策略(SFT + GRPO),特别是引入了‘认知路由奖励’来训练模型在视觉先验和编辑事实之间进行仲裁,从而无需修改模型参数即可实现深度、鲁棒的编辑。

Abstract: The emergence of reasoning multimodal large language models (MLLMs), which generate explicit chain-of-thought (CoT) reasoning before producing answers, has introduced a new challenge for knowledge editing: methods that appear successful under traditional metrics (teacher-forcing accuracy up to 100%) can fail severely when the model’s reasoning process is examined (Grounded Success as low as 0%). We identify three failure modes: (1) Structural Collapse, where weight-modifying methods destroy the CoT format; (2) Cognitive Dissonance, where the model’s reasoning chain actively rejects the injected edit fact based on visual evidence; and (3) Shallow Internalization, where methods succeed on exact queries but fail on rephrase or multi-hop variants. On reasoning MLLMs, these modes interact: methods that generalize (FT, LoRA) trigger format collapse, while methods without deep modification cannot generalize. To expose these failures, we propose a CoT-aware evaluation protocol and construct ReasonEdit-Bench, with conflict stratification, multi-level probes, and multi-hop portability tests. We propose CRANE, a retrieval-augmented framework that requires no per-edit parameter modification. CRANE combines a modality-aware dual-library retrieval system with a two-phase training strategy: Supervised Fine-Tuning (SFT) for structural initialization, followed by GRPO with a Cognitive Routing Reward that trains the model to arbitrate between visual priors and injected edit facts. On ReasonEdit-Bench, CRANE achieves 96.9% Grounded Success on conflict scenarios and 96.9% intermediate entity usage in multi-hop chains, with 97.6% text-locality and 68.1% image-locality Edit Independence. On the out-of-distribution MMEVOKE benchmark, CRANE reaches 87.0% under gold retrieval.


[167] Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions cs.CVPDF

Xin Jin, Huanqia Cai, Zhen Li, Zechao Zhan, Dengyang Jiang

TL;DR: 本文提出了Z-Reward,一种用于文本到图像后训练的师生奖励建模框架。该框架将基于推理的评分分布生成与高效的奖励部署解耦:教师模型(大型VLM)通过推理生成与评分标准对齐的分数分布,学生模型(紧凑VLM)则通过知识蒸馏将教师的推理能力内化,无需在推理时生成显式推理链。该方法在内部评估集上取得了优于现有基线模型的人类偏好准确率,并能作为可微奖励信号有效优化文本到图像模型。

Details

Motivation: 现有奖励模型(标量、分数标记或成对模型)过度压缩了视觉偏好的不确定性和细粒度分数差异,而基于推理的生成式奖励虽能提供更强判断但部署成本高且难以直接用作优化信号。

Result: 在内部标注的评估集上,27B参数的GDSO教师模型达到了89.6%的人类偏好准确率,优于SFT、RewardDance和GRPO;9B参数的RISD学生模型达到88.6%,优于OPD基线并与大教师模型性能接近。作为可微奖励信号用于文本到图像优化时,相比SFT基线带来了41.3%的净人类偏好提升。

Insight: 核心创新在于将奖励建模从确定性标量或简单分布,转变为通过推理生成的、与评分标准对齐的分数分布,并通过师生框架(GDSO训练教师,RISD蒸馏学生)实现了推理能力的内化与高效部署。这平衡了判断的准确性与部署的效率。

Abstract: Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher’s reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.


[168] Driving Video Retrieval for Complex Queries with Structured Grounding cs.CV | cs.IR | cs.LGPDF

Manyi Yao, Sparsh Garg, Christian Shelton, Amit Roy-Chowdhury, Abhishek Aich

TL;DR: 本文提出了STRIVE-D框架,用于自动驾驶视频检索,特别是针对复杂查询(如切入、急刹车等动态事件)。该框架通过弱标注的领域内视频来校准规则查询的可靠性,自适应调整不匹配的规则,并将校准后的规则得分与视觉语言和基于关键词的检索信号融合,以提升检索性能。

Details

Motivation: 现有基于视觉语言或关键词的检索方法难以捕捉自动驾驶视频中的动态事件,因为这些事件可能未在文本中明确描述或缺乏词汇重叠;而基于规则的方法虽然能直接编码事件,但往往脆弱,当假设与真实驾驶数据不匹配时容易失效。

Result: 在三个驾驶基准测试(包括新发布的DrivingDojo人工标注事件数据)上,STRIVE-D在top-1准确率上相比最先进方法取得了高达84%的相对提升。

Insight: 创新点在于提出数据校准的检索框架,利用弱标注视频估计查询规则的可靠性并自适应调整,结合多模态检索信号;从客观角度看,该方法有效解决了规则检索的脆弱性问题,并通过融合策略提升了复杂事件检索的鲁棒性和准确性。

Abstract: Video retrieval at scale is central to data curation and safety validation in autonomous driving, where users want to find not only scenes but also dynamic events such as cut-ins and hard braking. Existing vision-language and keyword-based retrieval methods often miss these events because the relevant motion may not be explicitly described in text or captured by lexical overlap. Rule-based retrieval can encode such events more directly, but it is brittle: generated or hand-written rules often fail when their assumptions do not match real driving data. We propose STRIVE-D, a data-calibrated retrieval framework for driving videos. It uses weakly labeled in-domain videos to estimate when a query rule is reliable, adapt rules that mismatch observed data, and fuse calibrated rule scores with vision-language and keyword-based retrieval signals. Across three driving benchmarks, including newly released human-annotated event data on DrivingDojo, STRIVE-D delivers up to 84% relative improvement in top-1 accuracy over state-of-the-art methods.


[169] HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging cs.CVPDF

Weiyu Zhou, Tao Hu, Yijian Wang, Xiaogang Xu, Ruixing Wang

TL;DR: 该论文提出了HDRAgent,首个基于智能体驱动的多曝光高动态范围(HDR)成像框架。它通过引入细粒度上下文知识匹配模块和感知-失真反馈机制,自适应地根据场景条件选择重建策略,并设计了智能体引导的生成式对齐策略来处理极端运动。

Details

Motivation: 现有大多数多曝光HDR方法采用固定的前馈重建范式,在复杂的动态场景中容易产生重影伪影。论文旨在解决这一问题,通过引入智能体框架来动态适应场景变化,提升HDR重建的鲁棒性和质量。

Result: 实验表明,HDRAgent能有效减少重影和局部伪影,同时在客观性能指标和视觉质量上达到或超越了现有方法的水平。

Insight: 主要创新点在于将智能体范式引入HDR成像,利用多模态大语言模型进行场景感知和知识检索,并结合反馈机制持续优化策略。其智能体引导的生成式对齐策略为处理极端运动导致的错位问题提供了新思路。

Abstract: Most existing multi-exposure HDR methods follow a fixed feed-forward reconstruction paradigm, making them prone to ghosting artifacts in complex dynamic scenes. To address this issue, we propose HDRAgent, the first agent-driven framework for HDR imaging, which adaptively selects reconstruction strategies according to the current scene conditions. Specifically, to provide scene-specific prior knowledge, we introduce a fine-grained contextual knowledge matching (FCM) module. This module leverages multimodal large language model (MLLM)-derived scene perception to retrieve relevant historical cases and tool knowledge, organizing them into structured evidence for MLLM-based adaptive tool scheduling. In addition, we propose a perception–distortion feedback mechanism that transforms post-execution quality assessment and artifact diagnosis into structured feedback, which is accumulated in historical memory to help subsequent contextual knowledge refinement and strategy selection. Furthermore, considering that extreme motion can invalidate alignment methods, we design an agent-guided generative alignment strategy that uses MLLM-based dynamic-region parsing to reconstruct unreliable contents in non-reference frames under reference-frame guidance. Experiments demonstrate that HDRAgent effectively reduces ghosting and local artifacts while achieving competitive or superior objective performance and visual quality.


[170] Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models cs.CV | cs.AIPDF

Danya Li, Xiang Su, Yan Feng, Rico Krueger

TL;DR: 本文研究如何从第一人称视角的短视频中解码行人过街意图,将其构建为封闭式视觉问答任务,并利用视觉语言模型进行预测。作者首先在零样本设置下评估了三种SOTA VLM,发现其表现有限,随后通过参数高效微调显著提升了模型性能,并进一步结合自我运动、车辆运动和眼动等上下文线索,最终实现了超越专用Transformer基线的SOTA结果。

Details

Motivation: 第一人称视角视频能反映人类的感知和决策过程,但其在交通安全预测(如行人过街意图解码)方面的潜力尚未被充分探索。

Result: 微调后的VLM在零样本基础上大幅提升,其中结合眼动和自我运动线索的Qwen3-VL-2B模型比专用Transformer基线准确率提高了14.5%,达到了该任务的新SOTA水平。

Insight: 创新点在于将行人意图解码任务形式化为VQA问题,并系统性地探索了VLM的零样本能力、参数高效微调以及多模态上下文线索(如眼动、运动信息)的融合,为第一人称交通安全预测提供了有效框架。

Abstract: Egocentric vision offers a first-person view of human perception and decision making, yet its potential for traffic-safety prediction remains underexplored. In this work, we study the decoding of pedestrian crossing intentions from short egocentric video clips. We approach this by formulating the task as a closed-ended visual question answering (VQA) problem and leveraging vision language models (VLMs) to predict the pedestrians’ intent. We first benchmark three families of state-of-the-art VLMs in a zero-shot setting, finding that they achieve moderate gains over random guessing but exhibit limited higher-level traffic reasoning. Motivated by these findings, we further adapt VLMs to the target task using parameter-efficient fine-tuning. Our results show that the fine-tuned models substantially outperform their zero-shot counterparts and achieve a 9% accuracy improvement over a specialized transformer-based baseline. Finally, we demonstrate that incorporating additional contextual cues, including ego motion, vehicle motion, and eye gaze, further improves predictive performance. In particular, the fine-tuned Qwen3-VL-2B model guided by eye gaze and ego motion achieves a 14.5% accuracy improvement over the transformer baseline, establishing a new state of the art for egocentric pedestrian intent decoding.


[171] Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions cs.CVPDF

Luxury, Jie Huang, Zihao Fan, Xiaoxiao Ma, Yuming Li

TL;DR: 本文提出了Ultra Flash,一种级联流式框架,用于实现实时高分辨率视频生成。该方法通过三个关键技术贡献,在单GPU上实现了1K分辨率约30 FPS和2K分辨率约18 FPS的实时生成性能。

Details

Motivation: 解决现有自回归视频扩散模型局限于低分辨率(如480P),无法高效、可扩展地实现实时高分辨率视频生成的根本性开放挑战。

Result: 大量实验表明,Ultra Flash能可靠地生成超高分辨率流式视频,同时保持了最先进的视觉质量和卓越的效率。

Insight: 创新点包括:1)架构保持的T2V到TV2V超分辨率训练范式与面向AIGC的数据退化流程,有效保留基础模型的生成能力;2)因果流式潜在上采样器与高分辨率解码器配对,增强时空一致性;3)级联高分辨率流式视频生成优化方案,结合混合奖励增强的稀疏因果化、单步蒸馏以及带动态缓存管理的级联流式自强制偏好优化,共同提升整体连贯性和质量。

Abstract: While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions (e.g., 480P), leaving efficient, scalable, real-time high-resolution video generation a fundamental open challenge. To bridge this gap, we present Ultra Flash, a cascaded streaming framework capable of real-time high-resolution video generation. Ultra Flash achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single GPU through three key contributions: (1) an architecture-preserving T2V-to-TV2V super-resolution training paradigm coupled with an AIGC-oriented data degradation pipeline that effectively preserves the generative capability of the base model, enabling enhanced high-resolution detail when cascaded after mainstream low-resolution generative models; (2) a causal streaming latent upsampler paired with a high-resolution decoder, which enhances spatiotemporal coherence while enabling efficient latent spatial scaling and precise high-resolution decoding with negligible computational overhead; and (3) a cascade high-resolution streaming video generation optimization scheme that first performs hybrid-reward-enhanced sparse causalization and single-step distillation of the super-resolution model, then introduces cascaded streaming self-forcing preference optimization with dynamic cache management, jointly enhancing overall coherence, improving quality, and enabling real-time high-resolution streaming video generation. Extensive experiments demonstrate that Ultra Flash reliably produces ultra-high-resolution streaming video while maintaining state-of-the-art visual quality and superior efficiency.


[172] CAMF-Det: Closure-Aware Multimodal Fusion for LiDAR-Camera 3D Object Detection on UAV Platforms cs.CVPDF

Yanze Jiang, Yanfeng Gu, Xian Li

TL;DR: 本文提出了CAMF-Det,一种用于无人机平台LiDAR-相机3D目标检测的遮挡感知多模态融合框架。该框架通过物理启发的建模推导双模态遮挡强度,并将其作为先验知识嵌入到整个检测流程中,以解决无人机俯视场景中因树冠等遮挡导致的信息退化问题。在两个自建的无人机多模态数据集上的实验表明,该方法在所有难度级别上都取得了最佳性能。

Details

Motivation: 基于LiDAR和相机的多模态3D目标检测在地面车辆场景中表现优异,但在无人机平台上尚未被探索。无人机俯视场景中,以树冠为主的频繁地物遮挡会导致空间变化且模态依赖的信息退化,而现有的多模态融合框架既未显式建模此类遮挡,也未将遮挡感知嵌入检测流程,限制了其在遮挡无人机场景中的性能。

Result: 在两个自建的无人机多模态数据集SI3D-DI和SI3D-DII上,CAMF-Det在所有难度级别上都达到了最佳性能,在困难级别上的BEV mAP分别比最佳竞争方法提升了9.43%和4.88%。

Insight: 论文的核心创新在于显式地对无人机场景中双模态(LiDAR和相机)的遮挡强度进行物理启发的离线建模和在线预测,并将此先验知识系统地注入数据增强、特征编码、多模态融合和检测头等整个检测流程,实现了对空间变化且模态依赖的信息退化的自适应检测。这种将物理感知的遮挡先验深度整合到深度学习框架中的思路,对于提升复杂场景下的多模态感知鲁棒性具有借鉴意义。

Abstract: Multimodal 3D object detection based on LiDAR and cameras has demonstrated excellent performance in ground-vehicle scenarios, but has not been explored for Unmanned Aerial Vehicle (UAV) platforms. In UAV top-down scenes, frequent groundobject occlusion dominated by tree canopies causes spatially varying and modality-dependent information degradation. Existing multimodal fusion frameworks neither explicitly model such ground-object occlusion nor embed occlusion awareness into the detection pipeline, limiting their performance in occluded UAV scenes. To address these challenges, we propose CAMF-Det, a closure-aware multimodal fusion framework for LiDAR-camera 3D object detection on UAV platforms, which derives dual-modal occlusion intensity through physics-inspired modeling and embeds them as priors throughout the detection pipeline. First, a dual-modal closure modeling module explicitly constructs occlusion intensity ground truth for both modalities offline via a Beer-Lambert-inspired formulation and building-mask correction. Second, using these ground-truth maps as supervision, a dual-modal prediction network converts the offline modeling results into online occlusion intensity predictions under single-frame inference. Third, both ground-truth and predicted occlusion intensity are injected into data augmentation, feature encoding, multimodal fusion, and detection head, enabling adaptive detection under spatially varying and modality-dependent information degradation. Experiments on two self-built UAV-based multimodal datasets, SI3D-DI and SI3D-DII, demonstrate that CAMF-Det achieves the best performance across all difficulty levels, with hard-level mAP$_{\mathrm{BEV}}$ improvements of 9.43% and 4.88% over the best competing methods, respectively. These results confirm the effectiveness of explicit occlusion prior modeling and exploitation for robust multimodal 3D detection in UAV scenes.


[173] OmniGen-AR: AutoRegressive Any-to-Image Generation cs.CVPDF

Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang

TL;DR: OmniGen-AR是一个统一的、自回归的‘任意到图像’生成框架。它通过共享的视觉分词器和文本分词器将多种视觉条件(如文本、空间信号、视觉上下文)离散化,从而支持在单一模型中进行广泛的图像生成任务,包括文本到图像、分割到图像、深度到图像、图像编辑、帧预测和文本到视频生成。

Details

Motivation: 现有自回归模型通常局限于单一模态条件(如文本),限制了其在需要多样化控制的实际场景中的应用。本文旨在解决这一问题,构建一个能够处理多种条件输入的通用自回归视觉生成框架。

Result: OmniGen-AR在多个基准测试中取得了新的最先进或至少具有竞争力的结果,例如在GenEval上达到0.63分,在VBench上达到80.02分,证明了其在灵活和高保真视觉生成方面的有效性。

Insight: 核心创新点在于提出了解耦因果注意力机制,它将全序列因果掩码分离为条件因果注意力和内容因果注意力,作为训练时的正则化器,防止条件信息泄露到内容令牌中,同时不影响推理时的标准下一个令牌预测。这为构建统一的多条件生成模型提供了有效的架构设计思路。

Abstract: Autoregressive (AR) models have demonstrated strong potential in visual generation, offering superior performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, e.g., text, restricting their applicability in real-world scenarios that demand image synthesis from diverse controls. In this work, we present OmniGen-AR, a unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, OmniGen-AR supports a broad spectrum of conditional inputs within a single model, including text (text-to-image generation), spatial signals (segmentation-to-image and depth-to-image), and visual context (image editing, frame prediction, and text-to-video generation). To mitigate the risk of information leakage from condition tokens to content tokens, we introduce Disentangled Causal Attention (DCA), which separates the full-sequence causal mask into condition causal attention and content causal attention. It serves as a training-time regularizer without affecting the standard next-token prediction during inference. With this design, OmniGen-AR achieves new state-of-the-art or at least competitive results across a range of benchmark, e.g., 0.63 on GenEval and 80.02 on VBench, demonstrating its effectiveness in flexible and high-fidelity visual generation.


[174] Zero-Parameter Geometric Gating for Temporally Stable Low-Altitude UAV Video Semantic Segmentation cs.CVPDF

Jingpu Yang, Fengxian Ji, Zhengzhao Lai, Juanfan Wu, Mingxuan Cui

TL;DR: 本文提出了一种零参数几何门控机制,用于提升低空无人机视频语义分割的时间稳定性。该方法通过在16×16空间网格上计算RANSAC单应性内点比率,将每个区域路由至单应性变换或光流变换,再通过语义相似性传播进行融合。该方法仅增加了21.1万个可训练参数,在合成UAVid数据集上显著提升了mIoU指标,并有效恢复了平面区域的时间一致性。

Details

Motivation: 低空无人机视频语义分割需要时间一致性,但密集光流在主导航空影像的平面区域会引入空间结构噪声。现有方法难以有效处理这种噪声,因此需要一种无需学习参数、能自适应选择最优变换的机制。

Result: 在合成UAVid数据集上,该方法在两种骨干网络(SegFormer-b2和Hiera-S+UPerNet)上相比基线模型实现了4.24%至4.91%的mIoU提升。机制诊断表明,该方法将单应性有效区域的时间一致性从62%恢复至92%(提升29.5个百分点)。

Insight: 创新点在于提出了一种完全基于几何统计(RANSAC内点比率)的零参数门控决策机制,无需学习即可自适应选择单应性或光流变换。该方法揭示了平面区域光流残差的空间自相关性(莫兰指数为0.32)与边界不稳定性之间的强相关性(斯皮尔曼相关系数为0.66),为理解时间不一致性提供了新视角。

Abstract: Video semantic segmentation for low-altitude UAVs requires temporal consistency, yet dense optical flow introduces spatially structured noise in the planar regions that dominate aerial imagery. We propose a zero-parameter geometric gate that uses RANSAC homography inlier ratios on a $16\times16$ spatial grid to route each region to either homography or optical flow warp before fusion via Semantic Similarity Propagation. The gate requires no learned parameters – only a median-threshold binary decision on RANSAC statistics – adding only 211K trainable parameters (the SSP fusion layer) to a frozen backbone. On synthetic UAVid, the method achieves +4.24–4.91% mIoU improvement over base models across two architectures (SegFormer-b2 and Hiera-S+UPerNet). Mechanism diagnostics reveal that flow residuals in planar regions are spatially autocorrelated (Moran’s I = 0.32, $p < 0.001$), predict boundary instability (Spearman $ρ= 0.66$), and that rigidification recovers temporal consistency from 62% to 92% (+29.5pp) in homography-valid regions.


[175] Vision-Language Guided Hyperspectral Object Tracking via Semantics Fusion and Contextual Template Updating cs.CVPDF

Rui Yao, Yuhong Zhang, Kunyang Sun, Hancheng Zhu, Jiaqi Zhao

TL;DR: 本文提出了一种名为VLHTrack的新型高光谱视觉-语言联合跟踪框架,旨在解决高光谱目标跟踪中光谱冗余和动态场景下目标形变两大挑战。该框架通过语言引导的波段选择模块(LBSM)利用大语言模型描述建立语义-光谱映射以缓解冗余,并采用多模态视觉-语言融合模块整合视觉与语言特征。此外,通过基于Mamba的动态模板更新模块(DTUM)学习帧间依赖以更新模板特征,应对长期序列中的目标外观变化。

Details

Motivation: 高光谱目标跟踪(HOT)虽能利用丰富的光谱信息,但面临两大核心问题:一是从冗余光谱波段中高效提取和利用光谱信息存在挑战,限制了模型泛化能力;二是在动态场景中,目标因遮挡、光照变化等因素发生剧烈外观形变,导致当前帧与模板间差异巨大,给现有时序建模方法带来困难。

Result: 在HOT2023和HOT2024基准测试上的实验表明,VLHTrack的性能优于现有的最先进(SOTA)方法。

Insight: 论文的创新点在于首次将视觉-语言联合建模引入高光谱目标跟踪,具体包括:1) 利用语言先验(通过LLM描述)引导光谱波段选择,建立语义-光谱映射以突出判别性特征;2) 设计多模态融合模块整合互补的视觉与语言嵌入;3) 引入基于选择性状态空间模型(Mamba)的动态模板更新策略,利用时序上下文高效演化模板特征,以应对长期形变。

Abstract: Hyperspectral object tracking (HOT) leverages the rich spectral information provided by hyperspectral videos (HSVs), offering substantial potential for object tracking. However, efficiently extracting and exploiting spectral information from redundant spectral bands remains a fundamental challenge, which severely limits model generalization and tracking performance. Moreover, in dynamic scenes, targets often experience drastic appearance variations due to factors such as occlusion and illumination changes. These variations lead to large deformations between the current frame and the template. Such discrepancies pose major challenges for existing temporal modeling approaches. In this work, we propose VLHTrack, a novel hyperspectral vision-language (VL) joint tracking framework. Specifically, we incorporate language priors to address the fundamental challenge of spectral redundancy by designing a Language-Guided Band Selection Module (LBSM). By leveraging Large Language Model (LLM) descriptions, LBSM establishes a semantic-to-spectral mapping that mitigates redundancy and accentuates discriminative spectral features. A Multi-Modal Vision-Language Fusion Module is then employed to seamlessly integrate visual and linguistic embeddings, harnessing their complementary advantages to learn coherent cross-modal representations. To address target deformation in long-term sequences, we propose a dynamic update template feature strategy implemented via the Dynamic Template Update with Mamba (DTUM) module. By leveraging selective state space modeling, DTUM learns inter-frame dependencies to update template feature, ensuring efficient template feature evolution guided by temporal context. Experiments on HOT2023 and HOT2024 demonstrate that VLHTrack outperforms state-of-the-art (SOTA) methods.


[176] CP4D: Compositional Physics-aware 4D Scene Generation cs.CVPDF

Hanxin Zhu, Cong Wang, Tianyu He, Long Chen, Xin Jin

TL;DR: CP4D提出了一种新颖的、物理感知的4D场景生成范式,通过将静态3D环境与基于物理的动态物体相结合,生成具有高视觉保真度和物理一致性的动态3D场景。其核心是一个三阶段流程:分别生成静态环境和前景物体的高保真3D表示,利用物理模拟器和视频扩散模型的先验合成物理合理的运动轨迹,最后通过自动组合机制将它们融合成连贯的4D场景。

Details

Motivation: 现有4D生成方法通常无法捕捉底层物理原理,导致生成结果物理不一致且视觉上不合理。为了解决这一问题,本文旨在生成既真实又严格遵守复杂物理动力学的4D场景。

Result: 大量实验表明,CP4D能够生成具有高视觉保真度、强物理合理性和细粒度可控性的可探索、可交互4D场景,其性能显著优于现有方法。

Insight: 创新点在于将4D生成重新定义为静态3D环境与基于物理的动态物体的组合问题,并提出了一种结合物理模拟器先验与视频扩散模型常识的混合运动合成策略,以及一个自动化的场景组合机制,从而确保了生成场景的物理一致性和视觉真实性。

Abstract: 4D generation (\textit{i.e.}, dynamic 3D generation) has recently emerged as a rapidly growing research frontier due to its powerful spatiotemporal modeling capabilities. However, despite notable advances, existing approaches typically fail to capture the underlying physical principles, producing results that are both physically inconsistent and visually implausible. To overcome this limitation, we present CP4D, a novel paradigm for photorealistic 4D scene synthesis with faithful adherence to complex physical dynamics. Drawing inspiration from the compositional nature of real-world scenes, where immutable static backgrounds coexist with dynamic, physically plausible foregrounds, CP4D reformulates 4D generation as the integration of a static 3D environment with physically grounded dynamic objects. On this basis, our framework follows a three-stage pipeline: \textbf{1)} Firstly, we leverage pre-trained expert models to generate high-fidelity 3D representations of the environment and foreground objects respectively. \textbf{2)} Subsequently, to produce physically plausible trajectories and realistic interactions for these objects, we propose a hybrid motion synthesis strategy that integrates priors from physical simulators with the common sense embedded in video diffusion models. \textbf{3)} Finally, we develop an automated composition mechanism that seamlessly fuses the static environment and dynamic objects into coherent, physically consistent 4D scenes. Extensive experiments demonstrate that CP4D can generate explorable and interactive 4D scenes with high visual fidelity, strong physical plausibility, and fine-grained controllability, significantly outperforming existing methods. The project page: https://anonymous.4open.science/w/CP4D/.


[177] Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA cs.CV | cs.LGPDF

Zhou Du, Hamid Krim, Xiao Wu, Zhaoquan Yuan, Liangwei Li

TL;DR: 本文提出了一种名为CREDiT的反事实推理框架,用于解决视频问答(VideoQA)中模型依赖虚假统计相关性而非因果证据的问题。该框架通过结构因果模型将跨模态表征显式分解为因果和非因果成分,并引入特征级因果干预和反事实输入来抑制非因果关联,从而实现细粒度的证据解耦和定位。

Details

Motivation: 当前视频多模态模型在VideoQA中常依赖虚假的统计相关性进行推理,而非与答案相关的因果证据,导致推理过程不可靠且脆弱,尤其是在复杂的真实场景中。现有方法通常在时间区间级别操作,无法显式地从混杂因素中解耦出因果视觉线索,且细粒度证据定位能力有限。

Result: 在NExT-GQA、SportsQA和SPORTU-video等基准数据集上的大量实验表明,CREDiT在通用和复杂的体育场景中,持续提升了答案准确性和推理可靠性,从而构建了更可信赖的VideoQA系统。

Insight: 创新点在于将VideoQA过程形式化为结构因果模型,并在独立性和最小性约束下学习显式分解的跨模态表征。通过引入特征级因果干预和构造反事实输入来近似因果效应并抑制非因果相关性,实现了比现有方法更细粒度的证据解耦和定位,为构建可信赖的多模态推理系统提供了新思路。

Abstract: Recent advances in video multimodal models have significantly improved VideoQA performance. However, these systems often rely on spurious statistical correlations rather than answer-relevant causal evidence, resulting in unfaithful and brittle reasoning, especially in complex real-world scenarios. Existing methods either rely on cross-modality correlations, costly curated training resources, or insufficient causal assumptions and constraints, and typically operate at the time-interval level. As a result, they fail to explicitly disentangle causal visual cues from confounders and provide limited fine-grained evidence localization. To address this issue, we propose a Counterfactual Reasoning framework for fine-grained Evidence Disentanglement (CREDiT). CREDiT formulates the VideoQA process using a structural causal model and learns cross-modality representations that are explicitly decomposed into causal and non-causal components under independence and minimality constraints. To facilitate faithful disentanglement, we introduce feature-level causal interventions and construct counterfactual inputs that approximate causal effects while suppressing non-causal correlations. Extensive experiments on NExT-GQA, SportsQA, and SPORTU-video demonstrate that CREDiT consistently improves answer accuracy and reasoning reliability across both generic and complex sports scenarios, leading to more trustworthy VideoQA systems.


[178] Event-driven dynamic trajectories reconstruction and measurement of mechanical parameters for fragments cs.CVPDF

Haoyang Li, Banglei Guan, Muxi Zha, Yifei Bian, Minzu Liang

TL;DR: 本文提出了一种基于事件相机的事件驱动方法,用于重建战斗部破片爆炸过程中的动态三维轨迹并测量其机械参数(位置、速度、动能)。该方法构建了多事件相机视觉系统,通过时间相关的极线约束、三焦张量线约束和局部单应性约束三种几何约束,结合概率模型和熵权法来筛选误匹配,最终通过空间线线相交和非线性优化实现轨迹重建与参数计算。

Details

Motivation: 战斗部爆炸产生的高密度、高速且相互遮挡的破片,其机械参数直接影响毁伤效能,但爆炸场景中的高强度闪光和烟雾严重阻碍了这些参数的精确获取。

Result: 该方法利用事件相机微秒级时间分辨率和高动态范围感知的优势,克服了强闪光干扰下高速目标精确测量的难题,为破片场的机械毁伤评估和战术防护设计提供了可靠的技术支持。

Insight: 创新点在于将事件相机这一新型仿生视觉传感器引入爆炸力学参数测量领域,并提出了融合多种几何约束的概率模型与熵权法来鲁棒地处理高速、遮挡场景下的三维轨迹重建问题。

Abstract: During warhead detonation, high-density, high-speed, and mutually occluded fragments are generated. Their mechanical parameters (position, velocity, kinetic energy) directly determine the lethality of the warhead fragment field. However, high-intensity flash and smoke in detonation scenarios severely hinder the accurate acquisition of these mechanical parameters. To address this challenge, this paper integrates experimental mechanics approaches and presents an event-driven method for reconstructing the dynamic trajectories of fragments and measuring their mechanical parameters. As a novel brain-inspired visual sensor, event cameras offer microsecond-level temporal resolution and high dynamic range lighting change perception, overcoming the difficulty of accurately measuring high-speed targets under strong flash interference. The method constructs a multi-event-camera vision system, adopting three geometric constraints: time-correlated epipolar constraint to find potential matching event point pairs, and trifocal tensor line constraint plus local homography constraint to eliminate mismatches. A comprehensive probability model is established, with entropy weight method determining the weight of each constraint’s probability to quantitatively filter mismatches. 3D trajectory reconstruction is achieved via spatial line-line intersection and nonlinear optimization. Finally, the velocity and kinetic energy of the fragments are calculated based on the reconstructed trajectory. This method provides reliable technical support for the mechanical damage evaluation of warhead fragment fields and the tactical protection design.


[179] EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video cs.CV | cs.AIPDF

Yuan Zeng, Yujia Shi, Tiao Tan, Xingting Li, Yaqi Qin

TL;DR: 本文提出了EgoTactile基准数据集,用于从第一人称视角视频中估计全手抓握压力,并引入了EgoPressureFormer作为判别基线以及EgoPressureDiff作为条件扩散模型来解决部分观测的不确定性。该方法通过结合大规模预训练视频扩散主干和物理信息特征校正层,有效推断合理的接触模式并解决视觉-物理歧义,在基准测试和真实场景中均表现出优越性能。

Details

Motivation: 从第一人称视频估计全手抓握压力对于VR和机器人操作至关重要,但现有基于视觉的方法主要依赖平面或指尖接触,难以泛化到复杂的3D物体交互,且密集触觉传感通常需要侵入式硬件。

Result: 在提出的EgoTactile基准测试中,该方法取得了优越的性能,并展现出对真实场景的鲁棒可迁移性。

Insight: 创新点在于构建了首个配对第一人称视频与全手压力监督的基准数据集,并提出了一个结合大规模预训练世界知识先验和物理信息特征校正层的条件扩散框架,以显式处理部分观测的不确定性并解决视觉-物理歧义。

Abstract: Estimating full-hand grasp pressure from egocentric video is critical for immersive VR and robotic manipulation, yet dense tactile sensing often relies on intrusive hardware. Existing vision-based methods predominantly rely on planar surfaces or fingertip contacts, failing to generalize to complex 3D object interactions. Therefore, we introduce EgoTactile, a benchmark pairing egocentric video with full-hand pressure supervision for diverse everyday objects, incorporating a bare-hand transfer subset to enable generalization to natural scenarios. Leveraging this benchmark, we first establish EgoPressureFormer as a discriminative baseline. Beyond this, to explicitly address the uncertainty in partial observations, we propose EgoPressureDiff, a conditional diffusion framework that adapts a large-scale pre-trained video diffusion backbone. By combining rich world knowledge priors with a Physically-Informed Feature Rectification layer to inject semantic constraints, our approach effectively infers plausible contact patterns and resolves visual-physical ambiguities. Extensive experiments demonstrate that our method achieves superior performance on the benchmark and robust transferability to in-the-wild scenarios. Our project page is available at https://egotactile.github.io/.


[180] Semi-supervised Source Detection in Astronomical Images: New Benchmark and Strong Baseline cs.CV | astro-ph.IMPDF

Longhan Feng, Zihuang Cao, Ali Luo, Yuanhao Guo, Shuilian Yao

TL;DR: 本文针对天文图像中密集、微小、低信噪比星源检测的难题,提出了一个新的半监督检测基准LAMOST-DET,并设计了一个名为Nova Teacher的半监督学习框架。该框架通过集成光源增强模块、置信度引导的伪监督和双教师范式下的跨视图互补挖掘,有效利用稀疏标注检测密集星源。

Details

Motivation: 天文图像具有高密度、点扩散函数效应和低信噪比等特点,使现有先进目标检测器面临挑战,且密集、微小、暗淡星源的标注极其困难,导致全监督方法不实用。

Result: 在提出的LAMOST-DET基准上,Nova Teacher在两种半监督设置下,mAP分别持续超越先前最佳方法4.04%和5.22%。此外,在自然图像数据集上也与其他检测器性能相当,验证了其泛化能力。

Insight: 创新点在于构建了大规模天文检测基准LAMOST-DET,并提出了集成光源增强、置信度引导伪监督和跨视图互补挖掘的双教师半监督框架Nova Teacher,为解决标注稀缺下的密集小目标检测提供了有效方案。

Abstract: Source detection in modern observational astronomy is a cornerstone for localizing and identifying stellar sources accurately. It is crucial for studies such as stellar population synthesis and cosmological parameter estimation. However, the characteristics of astronomical images, including high density, the effect of point spread functions and low signal-to-noise ratios, significantly challenge the latest advanced object detectors. Besides, fully-supervised detection methods are hardly practical, due to the significant difficulty in annotating dense, small, and faint sources in astronomical images. To tackle the scarcity of astronomical datasets, we introduce a new comprehensive benchmark (LAMOST-DET), comprising 18,400 astronomical images and 728,898 source instances. Upon the dataset, we further devise a novel semi-supervised learning framework coined Nova Teacher, capable of detecting dense sources effectively given sparse annotations. It integrates source light enhancement module, confidence-guided pseudo-supervision, and cross-view complementary mining in a dual-teacher paradigm. Extensive experiments on LAMOST-DET show that, Nova Teacher consistently improves previous competitors by 4.04% and 5.22% mAP under two semi-supervised settings. Additionally, our method competes against other detectors on a natural image dataset, validating its generalization ability to various scenarios. The source code is available at https://github.com/AcWiz/NovaTeacher.


[181] Temporal-Aware Reasoning Optimization for Video Temporal Grounding cs.CVPDF

Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu

TL;DR: 本文提出了一种名为TaRO(时间感知推理优化)的框架,旨在解决多模态大语言模型在视频时序定位任务中存在的推理浅显、定位不精确的问题。该框架通过引入基于密集描述的构造性推理探索和一种衡量推理质量的时间敏感性奖励,并采用渐进式课程学习策略,有效提升了模型对时间维度的思考能力。

Details

Motivation: 现有基于强化学习的MLLM模型在视频时序定位中,由于探索效率低下和奖励函数仅关注答案正确性而忽略推理质量,导致生成的推理路径流于表面,无法为精确的时间定位提供有效指导。

Result: 实验表明,TaRO在视频时序定位基准测试上取得了最先进的性能。

Insight: 核心创新点在于:1. 利用预生成的密集描述构建基于明确视觉线索和时间戳的推理路径,实现了对高质量时间感知推理的高效探索;2. 设计了一种新颖的时间敏感性奖励,通过扰动事件边界导致推理路径置信度下降的程度来量化推理质量,从而引导模型生成更可靠的推理。这种将推理过程显式地锚定到具体时间事件并加以评估的思路具有借鉴意义。

Abstract: Multi-modal Large Language Models (MLLMs) have achieved remarkable progress in video temporal grounding with reinforcement learning for generating reasoning paths. However, existing models often produce superficial reasoning, which offers limited guidance for precise temporal localization. This limitation stems from (1) inefficient random exploration and (2) reward functions that focus solely on the answer correctness while ignoring reasoning quality. To address these issues, we propose TaRO (Temporal-Aware Reasoning Optimization), a framework that explicitly enhances the model’s ability of thinking with time. First, we introduce a Constructive Reasoning Exploration that leverages pre-generated dense captions to construct reasoning paths grounded in explicit visual cues and timestamps, enabling efficient exploration of high-quality time-aware reasoning. Second, to evaluate reasoning quality, we design a Temporal-Sensitivity Reward. High-quality reasoning should be anchored to specific events and timestamps. If the event boundary under thinking is disrupted, such reasoning should become invalid, leading to a drop in the logit of the reasoning path. We utilize this drop as a critique of reasoning quality. Finally, TaRO follows a progressive curriculum, which starts by utilizing this reward to select better constructed reasoning paths, and evolves to a free exploration phase where the model autonomously generates effective reasoning. Experiments demonstrate that TaRO achieves state-of-the-art performance on VTG benchmarks. Code is available at https://github.com/oceanflowlab/TaRO.


[182] MAGIS: Evidence-Based Multi-Agent Reasoning for Interpretable Strabismus Clinical Decision-Making cs.CVPDF

Xikai Tang, Yifan Wang, Jiafan Zhuang, Li Luo, Jinming Guo

TL;DR: 本文提出MAGIS,一种基于证据的多智能体推理框架,用于可解释性斜视临床决策。该框架将端到端生成转化为结构化诊断流程,包括候选假设生成、双证据约束上下文、基于证据的纠正验证和报告生成,旨在解决现有深度学习方法缺乏透明推理以及大型视觉语言模型在证据敏感医疗任务中易产生幻觉的问题。

Details

Motivation: 斜视是一种常见眼疾,需要细粒度亚型诊断以制定个体化治疗方案。现有深度学习方法仅提供诊断预测而缺乏透明推理,而近期大型视觉语言模型虽在图像理解和报告生成方面有潜力,但在这种证据敏感、规则驱动的医疗任务中极易产生幻觉。

Result: 在细粒度斜视基准测试上,MAGIS显著优于其他最先进的诊断系统,将加权F1分数从72.0%提升至91.3%,并大幅提高了生成诊断报告的临床可靠性(一致性、对齐性和完整性)。

Insight: 创新点在于引入了双证据约束上下文机制,将九方位注视照片的视觉证据与基于证据的临床诊断规则联合组织成约束上下文,以及基于证据的纠正验证机制,利用视觉证据、热图视觉线索和临床诊断规则验证诊断假设,并在检测到不一致时触发假设细化,从而构建准确、基于证据且临床可解释的诊断系统。

Abstract: Strabismus is a common ocular disorder that requires fine-grained subtype diagnosis for individualized treatment planning. However, existing deep learning methods mainly provide diagnostic predictions without transparent reasoning, while recent large vision-language models (LVLMs), although promising for joint image understanding and report generation, remain highly prone to hallucination in this evidence-sensitive and rule-driven medical task. To address these challenges, we propose MAGIS, an evidence-based Multi-AGent reasoning for Interpretable Strabismus diagnosis framework. MAGIS transforms black-box end-to-end generation into a structured diagnostic process consisting of candidate hypothesis generation, dual-evidence constrained context, evidence-based corrective verification, and report generation. Specifically, we introduce a Dual-Evidence Constrained Context (DECC) mechanism that jointly organizes visual evidence from the photograph of the nine cardinal positions of gaze and evidence-based clinical diagnostic rules into a constrained context for reliable diagnostic reasoning. We further develop an Evidence-Based Corrective Verification (EBCV) mechanism that verifies whether the current diagnostic hypothesis is supported by visual evidence, heatmap-based visual cues, and evidence-based clinical diagnostic rules. Hypothesis refinement is triggered when inconsistency is detected. Experiments on a fine-grained strabismus benchmark demonstrate that MAGIS not only significantly outperforms other state-of-the-art diagnostic systems, improving the weighted F1 score from 72.0% to 91.3%, but also substantially improves the clinical reliability (consistency, alignment, and completeness) of generated diagnostic reports. These results demonstrate that MAGIS provides an effective solution for building accurate, evidence-based, and clinically interpretable strabismus diagnosis systems.


[183] LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution cs.CVPDF

Yu Cao, Ziquan Liu, Zhensong Zhang, Jiankang Deng, Shaogang Gong

TL;DR: 本文提出LiteVSR,一种用于视频超分辨率(VSR)的轻量级适配框架。它基于流匹配原理,通过一个轻量化的状态感知适配器,利用完全冻结的扩散Transformer主干网络,实现了跨域VSR的高效适配。该方法仅需少量可训练参数和极短的训练时间,就能达到有竞争力的恢复质量,并保持快速采样能力。

Details

Motivation: 解决在新领域中将大规模预训练视频生成器适配用于视频超分辨率(VSR)时计算成本过高的问题。现有方法要么需要大量微调,要么在扩散Transformer架构下因缺乏编码器-解码器层次而效率低下。

Result: 在仅使用11.25%的可训练参数和单个A100 GPU上12小时的训练后,LiteVSR达到了有竞争力的恢复质量。它保持了快速采样兼容性,采样步骤可少至单步。

Insight: 核心创新在于利用流匹配原理,将跨域VSR适配任务简化为学习固定的注入模式,而非时变变换。提出的状态感知适配器采用双流架构,通过时间相关的交叉注意力,从低质量输入和中间去噪状态分别提取静态结构线索和动态线索,实现了从结构对齐到纹理细化的自适应过渡。这是一种参数和计算效率极高的适配范式。

Abstract: Adapting large-scale pre-trained video generators for Video Super-Resolution (VSR) in novel domains remains computationally prohibitive. Methods that reformulate generation as direct Low-Quality to High-Quality mappings deviate from the original generative formulation, demanding extensive fine-tuning. ControlNet-style adapters lose their efficiency under modern Diffusion Transformers since the absence of encoder-decoder hierarchy forces duplication of the entire backbone. We observe that flow matching offers a principled alternative for cross-domain VSR adaptation. By predicting a constant velocity field across all timesteps, the adaptation task reduces to learning a fixed injection pattern rather than time-varying transformations. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that extracts static structural cues from the LQ input and dynamic cues from intermediate denoising states, aligning them through time-dependent cross-attention to enable adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves competitive restoration quality with only 11.25% trainable parameters and 12 GPU-hours of training on a single A100, while maintaining fast sampling (down to a single step) compatibility.


[184] Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition cs.CVPDF

Tingyi Liu, Kun Li, Fei Wang, Junjie Chen, Zhiliang Wu

TL;DR: 本文提出了一个用于微手势识别的多模态集成框架,该框架将自监督的RGB模型与有监督的多流模型相结合,在IJCAI 2026 MiGA挑战赛的微手势分类任务中取得了第一名并刷新了SOTA。自监督RGB模型通过掩码视频建模在12万未标记视频片段上进行预训练,然后在iMiGUE数据集上微调,最终通过集成策略显著提升了识别准确率。

Details

Motivation: 解决微手势识别任务中,如何有效利用大量未标记的领域内视频数据来学习可迁移的表征,以提升模型性能。

Result: 在iMiGUE测试集上,自监督RGB基线模型达到69.224%的top-1准确率;最终集成模型达到74.419%的top-1准确率,比之前的SOTA提升了1.206个百分点。

Insight: 创新点在于将自监督预训练的RGB模型作为互补分支集成到现有框架中,证明了从无标签领域视频中学习可迁移表征对微手势识别任务的有效性;其集成策略简单有效,为多模态融合提供了新思路。

Abstract: In this paper, we present XInsight Lab’s solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based model with supervised multi-stream models from previous solutions. The self-supervised RGB model is pretrained on 120K unlabeled clips via masked video modeling and then fine-tuned on iMiGUE. This simple yet effective RGB baseline achieves 69.224% top-1 accuracy on the iMiGUE test set, demonstrating the benefit of learning transferable representations from unlabeled in-domain videos. By incorporating this model as a complementary branch, the final ensemble reaches 74.419% top-1 accuracy, surpassing the previous state of the art by 1.206 percentage points. Experimental results on iMiGUE, including ablation studies on the ensemble strategy, validate the effectiveness of self-supervised RGB representation learning for micro-gesture recognition.


[185] Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning cs.CVPDF

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Zizhao Tong

TL;DR: 本文提出了Visual Para-Thinker++,一个单策略多智能体框架,用于解决视觉推理任务。该框架将一个共享的多模态大语言模型策略实例化为具有不同角色的智能体(主智能体、工作智能体、总结智能体),通过并行推理和轨迹整合来减少幻觉和早期感知固化问题。

Details

Motivation: 视觉推理需要整合分布在多个区域、属性和关系中的证据,而单链推理容易导致早期感知固化(early perceptual commitment)和幻觉(hallucination)。本文旨在通过多智能体并行协作的框架来解决这一问题。

Result: 在V*、CountBench、RefCOCO系列和HallusionBench等多个基准测试上,Visual Para-Thinker++均优于单轨迹推理和推理时并行基线方法,尤其在对抗幻觉敏感的视觉推理任务上取得了显著提升。

Insight: 创新点在于提出了一个单策略多角色智能体框架,通过角色条件化、并行推理和完整的推理轨迹整合(而非简单的多数投票)来提升鲁棒性。训练方法上,采用多智能体能力注入和角色解耦优化,为不同角色分配特定奖励以减少梯度冲突,并通过共享视觉前缀和KV缓存实现高效推理。

Abstract: Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is instantiated as role-conditioned Main, Worker, and Summary Agents. The Main Agent decomposes the task with fixed allocation patterns; Worker Agents reason in parallel under context isolation; and the Summary Agent reconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained by Multi-Agent Capability Injection and Role-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reduce gradient conflict among collaborative roles. A native inference engine enables efficient multi-agent rollout through shared visual prefix and KV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains on hallucination-sensitive visual reasoning.


[186] Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning cs.CVPDF

Xinyan Gao, Haoran Hao, Xiangyu Yue

TL;DR: 本文提出Rea2Seg框架,通过两阶段方法解决基于复杂推理的图像分割问题:首先利用分割型多模态大语言模型(MLLM)的注意力图生成候选掩码,然后通过MLLM对问题和候选掩码进行推理评分,最终选择最高分掩码作为分割结果。同时,作者构建了ReasonSeg-SGDR新基准,以全面评估模型在感知、定位和推理方面的能力。

Details

Motivation: 现有基于MLLM的图像分割方法受限于训练数据不足以及MLLM与掩码生成模块之间的差距,难以有效迁移MLLM的感知与推理能力到复杂推理分割任务中。

Result: 在提出的ReasonSeg-SGDR基准和ReasonSeg数据集上的实验结果表明,该统一的掩码生成与选择框架具有有效性。

Insight: 创新点在于将图像分割重构为候选发现与判别性掩码选择的两阶段过程,并引入专门针对多维度推理能力评估的新基准,增强了MLLM对多模态查询与候选掩码的联合理解与推理评分能力。

Abstract: The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs’ perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework Rea2Seg for mask generation and selection. Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign scores to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection. We also notice that a large portion of questions in existing benchmarks focus on commonsense reasoning, and these questions usually do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark called ReasonSeg-SGDR that comprehensively evaluates a model’s perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask generation. In addition, we collect training data to enhance MLLMs’ ability to jointly understand multimodal queries and candidate masks, and to assign scores through reasoning. Experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.


[187] IB-HFN: Information Bottleneck-Driven SAR-Optical Fusion Network for High-Fidelity Cloud Removal cs.CVPDF

Haojun Guo, Fan Feng, Ziquan Wang

TL;DR: 本文提出了一种名为IB-HFN的信息瓶颈驱动的高保真网络,用于SAR辅助的光学图像去云任务。该方法通过双流骨干网络保留模态特定表示,并引入空间信息瓶颈融合模块来抑制SAR斑点噪声,同时利用局部-全局门控机制和狄拉克初始化的跳跃连接来分离噪声抑制与纹理保持。

Details

Motivation: 现有SAR-光学融合方法通常依赖直接空间拼接和逐像素监督,这会导致SAR斑点噪声传播到光学重建中,并产生过度平滑的结果。本文旨在解决这些限制,实现高保真的云去除。

Result: 在具有挑战性的时空划分的SEN12MS-CR数据集上的实验表明,IB-HFN在结构保持和光谱保真度方面优于现有方法。

Insight: 创新点包括:采用双流骨干网络避免过早的跨模态污染;设计空间信息瓶颈融合模块通过通道变分信息瓶颈压缩SAR特征以抑制非结构化斑点噪声;提出局部-全局门控机制与狄拉克初始化的跳跃连接,将噪声抑制与纹理保存解耦;开发了结合特征级瓶颈正则化和图像级约束的联合优化策略与动态加权调度。

Abstract: Synthetic aperture radar (SAR)-assisted optical cloud removal aims to recover surface information obscured by clouds in optical remote sensing images by exploiting complementary SAR observations. Existing multimodal fusion methods typically rely on direct spatial concatenation and pixel-wise supervision, which can propagate SAR speckle noise into optical reconstruction and lead to over-smoothed results. To address these limitations, we propose an Information Bottleneck-driven High-Fidelity Network (IB-HFN) for SAR-assisted optical cloud removal. IB-HFN employs a dual-stream backbone to preserve modality-specific representations before deep semantic fusion, thereby mitigating premature cross-modal contamination. At the fusion stage, we introduce a Spatial Information Bottleneck Fusion module that compresses SAR features through a channel-wise variational information bottleneck to suppress unstructured speckle noise. In parallel, a local-global gating mechanism predicts clear-sky regions and routes reliable optical details through a Dirac-initialized skip connection, decoupling noise suppression from texture preservation. We further develop a joint optimization strategy that integrates feature-level bottleneck regularization with image-level constraints on reconstruction accuracy, structural consistency, spectral fidelity, and contrastive sharpness. A dynamic weighting schedule balances these objectives to stabilize training and reduce hazy artifacts. Experiments on the SEN12MS-CR dataset under challenging spatio-temporal splits demonstrate that IB-HFN achieves superior structural preservation and spectral fidelity over existing methods.


[188] ExDet: Open-Domain Open-Vocabulary Detection with Cross-modal Extrapolation and Rectification cs.CVPDF

Yupeng Zhang, Yuzhong Feng, Ruize Han, Zhiwei Chen, Wei Feng

TL;DR: 本文提出ExDet,一个轻量级的类别-域协同泛化框架,用于解决开放域开放词汇检测(ODOVD)任务。该方法通过文本引导外推(TGE)从文本推断类别和域感知的代理视觉原型,利用检测器兼容校正(DCR)模块在推理时校正特征表示,并结合ExRPN重新校准候选框得分,以低成本提升现有检测器对新类别和未见域的泛化能力。

Details

Motivation: 开放域开放词汇检测(ODOVD)要求检测器同时泛化到新类别和未见域,比开放词汇检测更具挑战性。现有方法通常从头开始训练检测器和域泛化模块,导致训练成本高昂。

Result: ExDet在OD-LVIS、OV-LVIS、Objects365和MSOSB等多个基准测试上达到了最先进的(SOTA)性能。

Insight: 创新点在于利用视觉语言模型(VLMs)的DeltaSpace特性进行轻量级的文本引导外推,并设计了无需检测器训练和真实数据的DCR模块进行特征校正,以及结合语义相似度与RPN置信度的ExRPN,实现了低成本、高效的类别与域协同泛化。

Abstract: Open-domain open-vocabulary detection (ODOVD) requires detectors to generalize to both novel categories and unseen domains, making it more challenging than open-vocabulary detection. Existing methods typically train open-vocabulary detectors together with domain generalization modules from scratch, leading to high training cost. we propose ExDet, a lightweight category-domain collaborative generalization framework for ODOVD that enhances the cross-category and cross-domain generalization of existing detectors. ExDet consists of Text-Guided Extrapolation (TGE), a lightweight Detector-Compatible Rectification (DCR) module, and ExRPN. Specifically, TGE exploits the DeltaSpace property of vision-language models (VLMs) to infer category- and domain-aware proxy visual prototypes from text. DCR is learned from the TGE-generated prototypes in a detector training-free and real-data-free manner, and is inserted after the classification head at inference to rectify representations toward a detector-compatible source-domain visual distribution, thereby enhancing classification for targets from novel categories and unseen domains. ExRPN recalibrates proposal scores by combining semantic similarity with RPN confidence, improving recall for novel and domain-shifted objects while providing better support for subsequent classification and DCR. ExDet achieves SOTA performance on OD-LVIS, OV-LVIS, Objects365, and MSOSB.


[189] Beyond Humans: Multispecies Animal Face Recognition Using Transfer Learning cs.CV | cs.AIPDF

Maria De Marsico, Anil K. Jain, Annalaura Miglino

TL;DR: 本文研究了利用迁移学习进行多物种动物面部识别,以解决个体动物识别问题。通过比较在人类面部数据集上预训练的FaceNet和在ImageNet上预训练的Vision Transformer (ViT) 在狗、灵长类动物(狐猴、金丝猴、黑猩猩)和牛三类动物面部数据集上的表现,评估了迁移学习的可行性。

Details

Motivation: 个体动物识别在寻找丢失宠物、追踪濒危物种和密集农场管理中有重要应用,但现有技术(如微芯片)通常不切实际且难以应用。动物面部识别作为一种非侵入式、可远程操作且难以伪造的替代方案,具有显著优势,但缺乏足够大的标注数据集来训练深度学习模型。

Result: 在狗数据集上,ViT取得了最佳性能,平均验证准确率达到96.85%,Rank-1识别率为84.34%。对于牛,ViT结果优于当前最优方法(SOTA),而FaceNet仍具有竞争力。对于濒危灵长类动物,结果虽令人鼓舞,但性能因动物类别和任务(验证或识别)而异,并不总是超越SOTA。

Insight: 论文的创新点在于将迁移学习应用于多物种动物面部识别,并系统比较了针对人类面部优化的模型(FaceNet)与通用视觉模型(ViT)在不同动物数据集上的表现。客观来看,这揭示了预训练模型在数据稀缺的跨物种识别任务中的潜力与局限性,为实际应用提供了重要参考。

Abstract: Individual animal recognition can be useful in the search for lost or stolen pets, the tracking of individuals of endangered species, and the recognition of animals in crowded farms. Present recognition techniques mostly use physical devices, e.g., microchips, often impractical and difficult to apply. These could be replaced by remote recognition via the animal’s face; if accurate enough, it provides several advantages: it is non-invasive, can work at a distance, and is difficult to counterfeit, as, for instance, in the case of substituting sick animals for healthy ones in the food industry. The few existing datasets with sufficient per-subject images annotated with a single animal identity are not large enough to train current deep learning architectures. We rather investigate the possibility of transfer learning, exploiting pre-trained network models as backbones. Our experiments compared FaceNet, which is specifically trained on large databases of human faces, with the Vision Transformer (ViT) pre-trained on ImageNet, i.e., on object categories. We used three face datasets of very different animals: dogs, primates (lemurs, golden monkeys, and chimpanzees), and cattle. We report the results and, for each dataset, compare them with the state of the art (SOTA) ad hoc-trained deep networks. The capture conditions differ among the three datasets. Image quality (resolution, motion blur, diverse poses, etc.) decreases from dogs to cattle to primates. The best performance was achieved with dogs, where ViT reached a mean verification accuracy of 96.85% and a Rank-1 Identification Rate of 84.34%. The results for endangered primates are still encouraging, but performance varies across animal classes and tasks (verification or identification), and does not always outperform SOTA. For cattle, the ViT results outperform SOTA, while FaceNet is still competitive.


[190] PhysScene: A Scene Graph Dataset for Scientific Visual Reasoning in Physics Experiments cs.CV | cs.AIPDF

Minghao Zou, Qingtian Zeng, Shangkun Liu, Yanda Meng, Guanghui Yue

TL;DR: 本文提出了PhysScene,这是首个针对物理实验场景构建的场景图数据集,旨在解决现有数据集在科学实验场景中关系推理评估不足的问题。该数据集专注于物理实验环境中的专业仪器、结构化实验设置和功能关系,强调强语义约束和高关系密度,而非追求大规模数据。

Details

Motivation: 现有场景图数据集主要关注通用自然场景,缺乏针对特定领域和功能导向场景的覆盖,这限制了在科学实验场景中进行关系推理的评估,阻碍了智能监控、分析等相关应用的发展。

Result: 广泛的实验和分析表明,PhysScene补充了现有基准测试,并为推进科学视觉推理建立了一个有价值的测试平台。

Insight: 创新点在于创建了首个面向物理实验的场景图数据集,其核心是引入实验环境特有的功能关系,将推理从空间共现扩展到逻辑依赖,并对现有场景解析算法提出了新的挑战。从客观角度看,该数据集通过强调强语义约束而非数据规模,为领域特定的结构化视觉理解提供了新的研究方向。

Abstract: Scene Graphs (SGs) provide structured representations of visual scenes by modeling objects and their pairwise relationships. Despite recent progress, existing datasets primarily focus on generic natural contexts, leaving domain-specific and function-oriented scenes largely underexplored. This limitation restricts the evaluation of relational reasoning in scientific experimental scenes, thereby hindering the development of intelligent monitoring, analysis, and related applications in such scenes. To address this gap, we introduce PhysScene, the first SG dataset tailored to physics experiments. PhysScene encompasses specialized instruments, structured experimental setups, and functional relations intrinsic to experimental environments, enabling reasoning that extends beyond spatial co-occurrence to logical dependencies. Rather than pursuing large data scale, PhysScene focuses on strong semantic constraints and high relation density in experimental scenes, posing new challenges for existing scene parsing algorithms while offering opportunities for further improvements. Extensive analyses and experiments show that PhysScene complements existing benchmarks and establishes a valuable testbed for advancing scientific visual reasoning. The dataset is publicly available at https://github.com/ZMH-SDUST/PhysScene.


[191] Zero-Shot Semantic Re-Identification for Autonomous Driving: A VLM Baseline Study cs.CV | cs.LGPDF

Eduardo Borges, Manuel Abreu, Luís Garrote, Urbano J. Nunes

TL;DR: 本文提出了一种基于视觉语言模型(VLM)的零样本语义重识别基线研究,用于自动驾驶场景。该方法通过VLM为检测到的交通参与者生成结构化文本描述(如类别、颜色、形状等),并利用这些语义描述而非低层视觉特征进行跨观测的身份匹配。研究评估了该零样本方法的性能与局限性,为基于语言的重识别提供了初步基准。

Details

Motivation: 自动驾驶中的传统重识别方法依赖学习到的视觉嵌入,易受视角、遮挡、光照和传感器域变化的影响,限制了其在复杂驾驶场景中的可解释性和鲁棒性。本文旨在探索利用VLM生成的语义描述来支持身份匹配,以提供更可解释和稳健的表示。

Result: 实验结果表明,零样本语义描述能够支持有效的目标重识别,其检索性能与有监督的CNN基线方法相当,同时通过显式的身份线索提供了更强的可解释性。研究在自动驾驶场景的重识别任务上进行了基准测试。

Insight: 创新点在于将重识别问题重新表述为基于结构化语义属性的匹配任务,而非纯粹的低层视觉相似性比较,这为任务提供了更高的可解释性。从客观角度看,该研究揭示了利用VLM进行零样本语义描述的潜力,同时也指出了跨视角属性不一致和对视觉相似实例细粒度区分有限等关键挑战。

Abstract: Re-Identification (ReID) in autonomous driving is typically formulated as a visual matching problem, where observations of vehicles, pedestrians, and cyclists are associated across time, frames, or camera views using learned appearance embeddings, often complemented by motion, geometric, or multimodal cues. However, purely visual representations may be sensitive to viewpoint, occlusion, illumination, and sensor-domain variations, limiting their interpretability and robustness in complex driving scenes. We propose a baseline study of a zero-shot pipeline using Vision-Language Models (VLMs) to generate textual descriptions of detected traffic participants and evaluate whether these descriptions can support identity matching across observations. Instead of relying only on low-level visual similarity, the proposed formulation represents each object through structured semantic attributes, including category, color, shape, pose, visible parts, spatial context, and distinctive visual cues. This study provides an initial benchmark for language-based re-identification in autonomous-driving scenarios, discussing and evaluating the strengths and limitations of current VLMs for this task. Results demonstrate that zero-shot semantic descriptions can support effective object re-identification, achieving retrieval performance comparable to a supervised CNN baseline while offering greater interpretability through explicit identity cues. However, the experiments also reveal important challenges, including attribute inconsistency across viewpoints and limited fine-grained discrimination between visually similar instances.


[192] CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning cs.CVPDF

Penghui Yang, Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao

TL;DR: 本文提出了CapRL++,一种用于密集图像和视频描述生成的无参考强化学习训练框架。该框架通过一个可验证的奖励机制来定义描述质量:高质量的描述应能让一个非视觉的语言模型仅根据该描述,准确地回答关于对应视觉内容的多选题。实验表明,该方法在超过20个基准测试上提升了描述质量,并增强了基于描述的预训练效果,其训练的紧凑模型在密集描述任务上达到了与参数量大得多的模型相当的性能。

Details

Motivation: 当前最先进的描述生成模型通常依赖于监督微调,这种方法需要昂贵且难以扩展的人工标注,并容易导致模型记忆特定标准答案,限制了其泛化能力和生成多样化、创造性描述的能力。为了解决这些限制,作者希望将强化学习应用于开放式的多模态描述生成任务。

Result: 在超过20个图像和视频基准测试上的评估表明,CapRL++提升了密集描述质量,并增强了在空间和时间理解等任务上基于描述的预训练效果。在Prism描述质量评估框架内,使用CapRL++训练的紧凑模型在密集描述任务上达到了与Qwen2.5-VL-72B和Qwen3-VL-235B-A22B等大模型相当的性能。

Insight: 核心创新在于提出了一个无参考、可验证的奖励机制,将描述质量重新定义为其实用性(即能否让一个纯语言模型准确回答视觉相关问题),从而摆脱了对人工标注参考描述的依赖。其解耦的两阶段训练流程(大视觉语言模型生成描述,纯语言模型评估描述)也颇具新意,为训练更具泛化性和创造性的描述模型提供了新思路。

Abstract: Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable annotations and often causes models to memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome these limitations, we propose applying Reinforcement Learning with Verifiable Rewards (RLVR) to the open-ended task of multimodal captioning. We introduce Captioning Reinforcement Learning++ (CapRL++), a novel reference-free training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding visual content. CapRL++ employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. Evaluations on more than 20 image and video benchmarks show that CapRL++ improves dense caption quality and strengthens caption-based pretraining across tasks such as spatial and temporal understanding. Pretraining on scalable image and video caption datasets annotated by CapRL++ yields substantial downstream gains. Furthermore, within the Prism Framework for caption quality evaluation, compact models trained with CapRL++ achieve dense captioning performance comparable to substantially larger models such as Qwen2.5-VL-72B and Qwen3-VL-235B-A22B. These results validate that CapRL++ effectively trains models to produce generalizable, high-fidelity descriptions, establishing a robust foundation beyond the limitations of traditional SFT.


[193] RT-SDGOD: Real-Time Single-Domain Generalized Object Detection cs.CVPDF

Yupeng Zhang, Fangzhuo Gao, Ruize Han, Wei Feng, Liang Wan

TL;DR: 本文提出RT-SDGOD(实时单域广义目标检测)任务,并设计了RT-SDGDet框架,旨在解决实时检测器在严格推理预算下因天气和成像变化导致的域偏移性能下降问题。该方法通过训练时的表征学习,在不增加推理开销的前提下,利用多证据协同建模提升跨域泛化能力。

Details

Motivation: 现有单域广义目标检测方法很少从问题定义层面研究实时检测器在严格推理预算下的泛化能力,而实际部署中域偏移会导致检测性能严重下降,特别是漏检增加。

Result: 在多个未见目标域上的广泛实验表明,所提方法比现有方法取得了更好的泛化性能。

Insight: 创新点在于将问题聚焦于实时检测器的零额外推理开销泛化,并通过观察发现DETR类检测器的性能下降主要源于有限且不稳定的物体级判别性证据;据此设计了基于O2M监督构建稳定查询组、判别性证据多样性学习(DEDL)和双视图证据一致性学习(DvECL)的多证据协同建模框架,所有组件仅在训练时引入。

Abstract: In real-world deployment under strict real-time constraints, weather and imaging variations induce significant distribution shifts, severely degrading detectors. Single-Domain Generalized Object Detection aims to mitigate this issue, yet existing methods rarely investigate-at the level of problem formulation-the generalization capability of real-time detectors under such constrained inference budgets. To this end, we introduce Real-Time Single-Domain Generalized Object Detection (RT-SDGOD), which focuses on how real-time detectors can achieve cross-domain generalization under zero extra inference overhead by relying solely on training-time representation learning. We observe that, under domain shift, DETR-based real-time detectors mainly degrade through increased missed detections, rooted in limited and unstable object-level discriminative evidence. Based on this, we propose RT-SDGDet, a multi-evidence collaborative modeling framework for RT-SDGOD. The core idea is to enable multiple queries of the same object to collaboratively cover more sufficient discriminative evidence while maintaining the stability of such evidence modeling across views. Specifically, we use one-to-many (O2M) supervision to construct stable object-specific query groups, and further design Discriminative Evidence Diversity Learning (DEDL) and Dual-view Evidence Consistency Learning (DvECL) to expand object-level evidence coverage and improve evidence stability under appearance perturbations, respectively. Since all components are introduced only during training, our method incurs no extra inference overhead. Extensive experiments show that the proposed method achieves better generalization performance than existing approaches across multiple unseen target domains.


[194] Leveraging Morphology for Historical Script Metrological Analysis cs.CVPDF

Malamatenia Vlachou Efstathiou, Raphaël Baena, Dominique Stutzmann, Mathieu Aubry

TL;DR: 本文提出了一种基于形态学的历史手写体计量分析方法,利用Transformer检测架构和基于原型的行重建模块,从行级转录中学习字符原型及其出现、变形和位置,从而为古文书学提供可扩展、有意义且稳定的视觉测量指标。

Details

Motivation: 解决现有手写文本识别技术虽能大规模转录历史文献,但难以提供可解释的视觉测量指标以支持古文书学研究的问题。

Result: 在14世纪末的codex Paris手稿(160页)上扩展标注并进行案例研究,结果表明该方法不仅能区分不同书写者的图形特征,还能发现和分析细微变化,且仅需一列文本即可计算测量指标,数据需求低。

Insight: 创新点在于提出仅需行级转录监督即可高效建模字符的深度架构,以及基于该架构自动生成字符、双字母和图形单元间间距的测量指标,为古文书学分析提供了可扩展且数据高效的解决方案。

Abstract: Advances in handwritten text recognition have enabled large-scale transcription of historical documents, but still provide limited access to interpretable visual measurements for paleography, the study of historical scripts. In this paper, our main insight is that morphological script analysis, in particular the capacity to learn character prototypes from line-level transcriptions, enables the definition of scalable, meaningful, and stable paleographic measurements. More precisely, we leverage a transformer-based detection architecture together with a prototype-based line reconstruction module to learn prototypical characters and their occurrence, deformation, and positioning. Our contributions are twofold. First, we introduce a deep architecture and learning methodology that enables efficient character modeling with only line-level transcription supervision, significantly improving over the Learnable Typewriter baseline and enabling accurate character bounding box prediction, unlocking its potential for paleographic measurements. Second, we introduce and demonstrate the paleographical relevance of automatic measurements enabled by our architecture for characters, bi-grams, and spaces between graphical units. For this demonstration, we extend the annotations of the codex Paris, BnF, fr. 2813, commissioned in the late fourteenth century by Charles V and copied by four hands, to 160 pages. We visualize our measurements over these pages, showing how they enable us not only to differentiate graphical profiles, but also to discover and analyze subtle variations. This case study outlines the scalability of our approach and its frugality in terms of required training data, since a single column of text is sufficient to compute our measurements on each of the 160 pages. Data and code are publicly available at: https://malamatenia.github.io/morphology4metrology-analysis.


[195] Efficient Minimal Solvers for Visual-Inertial Relative Pose Estimation in Multi-Camera Systems cs.CVPDF

Tao Li, Zhenbao Yu, Banglei Guan, Jianli Han, Weimin Lv

TL;DR: 本文提出了两种高效的最小求解器,用于多相机系统的视觉-惯性相对位姿估计。第一种求解器利用惯性测量单元(IMU)提供的垂直方向先验,第二种则利用IMU的旋转轴方向先验。这些方法仅需四个点对应关系,并将问题简化为求解一个单变量六次多项式,显著降低了计算复杂度。

Details

Motivation: 解决多相机系统相对位姿估计中现有方法计算复杂度高或需要过多点对应关系的问题,以提高在自动驾驶、移动设备和无人机等实际应用中的适用性。

Result: 在合成数据和KITTI基准测试上进行严格评估,方法在计算效率上优于现有技术,并达到了具有竞争力的精度水平。

Insight: 创新点在于引入IMU提供的方向先验(垂直方向或旋转轴方向)进行新颖参数化,将问题简化为单变量六次多项式求解,从而显著减少所需点对应数量和计算负担,尤其适合集成到RANSAC框架中用于视觉里程计应用。

Abstract: Estimating the relative poses of multi-camera systems is a fundamental problem in computer vision, with critical applications in autonomous vehicles, mobile devices, and unmanned aerial vehicles (UAVs). However, existing solutions often suffer from high computational complexity or rely on an excessive number of point correspondences, limiting their real-world applicability. To address these limitations, we propose two efficient minimal solvers for estimating the relative poses of multi-camera systems using a novel parameterization. The first solver leverages the vertical direction prior provided by Inertial Measurement Units (IMUs), while the second utilizes the rotation axis direction prior from IMUs. Our methods require only four point correspondences and reduce the problem of multi-camera relative pose estimation to solving a univariate 6th-degree polynomial, a significant improvement over existing approaches, which typically involve 8th-degree polynomials. This reduction in computational complexity and correspondence requirements makes our solvers particularly effective when integrated into RANSAC frameworks, demonstrating strong potential for visual odometry applications. Through rigorous evaluations on synthetic data and the KITTI benchmark, our methods achieved superior computational efficiency and competitive accuracy compared to state-of-the-art algorithms.


[196] GD-MIL: Grade-Disentangled Multiple Instance Learning for Multimodal Biochemical Recurrence Prediction in Prostate Cancer cs.CVPDF

Dasari Naga Raju

TL;DR: 该论文提出了GD-MIL(Grade-Disentangled Multiple Instance Learning)方法,用于从H&E全切片图像中预测前列腺癌根治术后的生化复发风险。研究通过严格的交叉验证基准测试发现,特征提取器比MIL聚合器对性能影响更大,且纯影像模型未能显著超越临床基线。GD-MIL通过梯度反转对抗训练使图像表征与格里森分级解耦,再与临床变量融合,最终显著超越了临床和纯影像基线模型。

Details

Motivation: 解决前列腺癌生化复发风险分层过度依赖格里森分级等临床变量的问题,探究H&E全切片图像是否包含超越分级的预后信息,以及多示例学习能否有效提取这些信息。

Result: 在TCGA-PRAD数据集(487名患者)上,使用严格的五次重复五折交叉验证进行评估。最佳纯影像模型C-index为0.639,临床基线模型为0.687。提出的GD-MIL模型达到C-index 0.704,显著优于临床基线(Δc=+0.029, p=0.0005)和最佳纯影像模型(Δc=+0.062, p=0.039)。中位风险分层在无生化复发生存率上显示出极显著差异(对数秩检验p<0.0001)。

Insight: 核心创新点是引入了基于梯度反转对抗训练的等级解耦多示例学习框架,强制图像表征在融合临床变量前与格里森分级无关,从而提取出与分级互补的预后形态学信息。方法上强调了严格的外折评估基准的重要性,并揭示了特征提取器(如病理学基础模型)是性能的关键驱动因素,而非聚合器选择。

Abstract: Biochemical recurrence (BCR) after radical prostatectomy is a critical endpoint in prostate cancer, yet risk stratification relies almost entirely on variables dominated by Gleason grade. Whether H&E whole slide images (WSIs) carry prognostic signal beyond grade, and whether multiple instance learning (MIL) can recover it, remains unsettled. A key obstacle is that many pipelines select model checkpoints on the evaluation fold, artificially inflating concordance. We construct a rigorous benchmark on TCGA-PRAD (487 patients, 101 BCR events) using strict out-of-fold scoring over five-fold cross-validation repeated across five seeds. The choice of MIL aggregator (ABMIL, CLAM, TransMIL, PatchGCN) has little effect (C-index 0.61-0.64 with UNI2-h), while the feature extractor is the dominant factor (ResNet50 0.566 versus pathology foundation models up to 0.639). A clinical Cox model on grade, stage, and age reaches 0.687; no imaging-only model significantly outperforms it (p > 0.10). We introduce Grade-Disentangled MIL (GD-MIL), a gated-attention MIL encoder trained with a gradient-reversal grade adversary that encourages the slide representation to be invariant to Gleason grade before late fusion with clinical variables. GD-MIL achieves C-index 0.704, significantly outperforming both the clinical baseline (delta-c = +0.029, p = 0.0005) and the best imaging-only model (delta-c = +0.062, p = 0.039), suggesting H&E morphology contains prognostic information complementary to grade. A median risk split yields log-rank p < 0.0001 separation in BCR-free survival (~20% vs ~70% at five years).


[197] Prisma-World: Camera-Controllable Multi-Agent Video World Model cs.CVPDF

Huiqiang Sun, Zhan Peng, Size Wu, Kun Wang, Kang Liao

TL;DR: 本文提出了Prisma-World,一个相机可控的多智能体视频世界模型,通过联合几何感知去噪过程实现跨视角一致性生成。该模型在一个完整注意力序列中处理所有智能体视频,采用多智能体RoPE设计区分身份并保持同步时间坐标,并通过注意力机制注入相对相机几何信息以增强重叠视角的一致性。此外,还引入了重叠衰减课程训练范式和基于小地图的结构引导来强化多视角一致性与全局空间感知。

Details

Motivation: 现有视频世界模型大多从单一观察者视角模拟世界,扩展到多智能体时,独立生成的未来状态可能导致重叠视角出现场景不一致问题,如物体、布局和外观的冲突。传统相机条件控制无法显式耦合共享场景几何下应一致的视图生成。

Result: 实验表明,单个Prisma-World模型能够生成高保真度的多智能体视频,支持灵活的智能体数量、相机可控性,并提升了跨视角一致性和在小地图引导下的空间 grounding。

Insight: 创新点包括将多智能体生成建模为联合几何感知去噪过程,设计多智能体RoPE以同步时间坐标并区分身份,以及在注意力中注入相对相机几何来偏置重叠视角共享场景证据。此外,提出的重叠衰减课程训练和小地图条件结构引导增强了多视角一致性与全局空间感知,并构建了大规模PrismaDataset支持训练与评估。

Abstract: Video world models have made rapid progress in generating controllable visual experiences, but most of them still simulate the world from a single observer. Extending such models to multiple agents raises a central challenge: if each agent’s future state is generated independently, overlapping views may instantiate different versions of the same scene, leading to inconsistent objects, layouts, and appearances across agents. Conventional camera conditioning controls individual trajectories, but it does not explicitly couple the generation of views that should agree under shared scene geometry. We introduce Prisma-World, a camera-controllable multi-agent world model that formulates multi-agent generation as a joint geometry-aware denoising process for cross-view consistency. Prisma-World processes all agent videos within one full-attention sequence, uses a multi-agent RoPE design to distinguish agent identities while preserving synchronized temporal coordinates, and injects relative camera geometry into attention to bias overlapping viewpoints toward shared scene evidence. To further strengthen multi-view consistency and enhance global spatial perception, we augment our framework with an overlap-decaying curriculum training paradigm alongside minimap-conditioned structural guidance. To facilitate the training and evaluation of multi-agent models, we introduce PrismaDataset, a large-scale UE5 dataset with panoramic acquisition across diverse scenes, composable multi-agent view groups with flexible agent counts and complex camera trajectories, and precise camera/action annotations for consistency training and evaluation. Experiments show that a single Prisma-World model can generate high-fidelity multi-agent videos with flexible agent numbers, camera controllability, improved cross-view consistency, and spatial grounding under minimap guidance.


[198] SwiftVR: Real-Time One-Step Generative Video Restoration cs.CVPDF

Jiaqi Yan, Xiangyu Chen, Xinlin Zhong, Haibin Huang, Chi Zhang

TL;DR: SwiftVR是一个实时一步生成式视频修复框架,旨在解决现有基于扩散的VR模型在消费级GPU上部署困难的问题。通过引入无掩码的移位窗口自注意力和轻量级修复感知自动编码器,SwiftVR在因果分块协议下实现了高分辨率视频的实时修复,在单张H100上支持1440p@31FPS和4K@14FPS,并在消费级RTX 5090上首次实现了1080p实时流式处理。

Details

Motivation: 解决实时视频流修复中,现有一步扩散模型因高分辨率下的二次空间注意力计算和大视频自动编码器的延迟-内存开销而难以在消费级GPU上部署的问题。

Result: 在H100上,SwiftVR在2560x1440分辨率下达到31FPS,在3840x2160下达到14FPS,而对比的扩散基线在4K下均超出内存限制;在RTX 5090上,SwiftVR在1920x1080下达到26FPS,实现了消费级GPU上首个实时1080p流式生成式VR模型,并在无参考感知质量上表现强劲且推理成本更低。

Insight: 创新点包括:1)无掩码移位窗口自注意力通过确定性索引将空间窗口聚合成密集张量,仅使用标准密集SDPA调用,无需掩码、循环移位、填充或硬件特定稀疏核,提升了可移植性;2)轻量级修复感知自动编码器支持快速分块解码并保持重建质量;整体框架在因果分块协议下优化了注意力与自动编码瓶颈,实现了高效实时部署。

Abstract: Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31FPS at 2560x1440 and 14FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX5090, SwiftVR reaches 26FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.


[199] A VideoMAE-v2 Approach to Zero-Shot Traffic Accident Anticipation cs.CVPDF

Siyuan Li, Xiaoyang Bi, Mengshi Qi

TL;DR: 本文提出了一种基于VideoMAE-v2的零样本交通事故事件预测框架,该框架通过滑动窗口协议将视频主干网络与逐帧预测头相结合,旨在利用公开的二元标注驾驶事故数据集进行训练,并泛化到未见过的行车记录仪视频中,无需目标域训练数据。该方法在2026年CVPR@AUTOPILOT零样本事故预测竞赛中获得第二名。

Details

Motivation: 解决交通事故事件预测任务在规模化部署中的困难,因为为每个部署场景收集带标注的事故视频数据成本过高,因此研究在零样本设置下,仅使用公开的二元标注数据集进行学习并泛化到未见域。

Result: 在2026年CVPR@AUTOPILOT零样本事故预测竞赛中获得第二名,表明该方法在零样本设置下具有竞争力。

Insight: 创新点在于通过滑动窗口协议耦合VideoMAE-v2主干网络与逐帧预测头,弥合了逐帧时序风险估计任务与粗粒度二元标注数据集之间的差距,实现了从粗标签到细粒度预测的有效迁移。

Abstract: Traffic accident anticipation – predicting the likelihood of an imminent collision at every frame of a dashcam video – is safety-critical yet difficult to scale, because collecting in-domain annotated accident footage for every deployment scenario is prohibitively expensive. We study this task under a zero-shot setting where no target-domain training data is available: the model must learn exclusively from a publicly available binary-labelled driving-accident dataset and generalise to unseen dashcam footage. We propose a framework that bridges the gap between the frame-level temporal risk estimation task and coarsely labelled binary accident datasets by coupling a VideoMAE-v2 backbone with a per-frame prediction head under a sliding-window protocol. Our method achieves 2nd place in the 2026 CVPR@AUTOPILOT Zero-Shot Accident Anticipation competition. Code is available at https://github.com/TimeSouth/zero-shot-taa-solution.


[200] Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur? cs.CV | cs.LGPDF

Apratim Bhattacharyya, Shweta Mahajan, Sanjay Haresh, Rajeev Yasarla, Reza Pourreza

TL;DR: 该论文提出了Ego-MC-Bench基准测试,用于评估视频大语言模型在现实烹饪场景中实时纠正用户错误的能力,并发现现有SOTA模型在该任务上表现不佳。为解决训练数据稀缺问题,论文还构建了Ego-CoMist合成数据集,通过微调可有效提升模型性能,尤其对适合边缘设备的小型高效模型效果显著。

Details

Motivation: 随着人们越来越多地通过在线视频学习日常技能(如烹饪),视频大语言模型有望成为任务指导助手。其实用性的一个关键能力是能够在用户犯错时及时主动干预,但目前缺乏评估该能力的基准和相应的训练数据。

Result: 在提出的Ego-MC-Bench基准上,当前最先进的视频LLMs表现极具挑战性。通过在合成的Ego-CoMist数据集上进行微调,模型性能得到提升,特别是对于更适合边缘设备的小型高效视频LLMs。

Insight: 论文的核心创新点在于构建了首个专注于评估视频LLMs实时错误纠正能力的基准(Ego-MC-Bench),并通过数据工程方法(将非交互式视频转化为包含干预的监督样本)创建了合成数据集(Ego-CoMist)来缓解数据稀缺问题,这为开发实用的任务指导助手提供了关键的评估框架和数据支持。

Abstract: Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it’s ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.


[201] ATN3D: Density-Aware LiDAR-Radar Early 3D Object Detection Under Extreme Sparsity cs.CV | cs.AIPDF

Debojyoti Biswas, Xianbiao Hu

TL;DR: 本文提出了ATN3D,一个专为极端稀疏感知条件设计的LiDAR-Radar早期融合3D目标检测框架。该框架通过密度感知的早期融合、基于可信度的邻域聚合、证据条件化的通道自注意力以及范围感知的损失函数,旨在解决长距离、稀疏场景下检测性能下降的问题。在VoD基准测试的清晰和浓雾条件下,该方法在整体和远距离目标检测上均超越了基线模型。

Details

Motivation: 解决自动驾驶在长距离(>30米)极端稀疏感知条件下,早期多模态融合会丢失稀疏信息并引入噪声,以及均匀通道监督对远距离小目标优化不足,导致检测延迟的核心挑战。

Result: 在VoD基准测试中,ATN3D在清晰天气下mAP提升3.55%,在模拟浓雾下mAP提升8.41%;对于>30米的远距离目标,增益分别为3.33%(清晰)和2.09%(浓雾),实现了更早、更可靠的远距离检测。

Insight: 创新点在于将感知密度/稀疏性作为融合和聚合的显式条件,并设计了与距离分层评估对齐的损失函数。这为处理稀疏多模态数据提供了新思路,即利用数据本身的稀疏特性来引导网络关注可信证据,而非盲目融合所有信息。

Abstract: 3D object detection is the backbone of perception for automated vehicles (AV) and broader intelligent transportation systems applications. Long-range detection is challenging because sensing evidence is sparse; yet this long-range'' scenario is routine in traffic. Although >30m is often labeled long-range in computer vision, on roadways it affords only approx. 1-2s for perception and decision-making. Under such extreme sparsity, two core challenges arise. First, early multimodal fusion tends to discard sparsity information and inject noise from empty or falsely occupied cells, degrading long-range recall. Second, context-agnostic uniform channel supervision favors dense and near-range samples, leaving far and small objects under-optimized, delaying the earliest detection of distant objects. We propose Ask The Neighbor’’ (ATN3D), a LiDAR-Radar framework tailored for sparse-range conditions. ATN3D introduces (i) Density-aware early fusion with cross-modal gating that conditions fusion on per-voxel density/sparsity and Radar evidence, (ii) Occupancy-gated neighborhood aggregation with circular kernels to aggregate only from credible cells, (iii) Evidence-conditioned channel self-attention to adapt channel weights with weather/range, and (iv) a Range-aware loss that re-balances classification and localization by distance, aligning training with distance-stratified evaluation. On the VoD benchmark across clear and foggy conditions, ATN3D surpasses strong baselines: +3.55% mAP in clear weather and +8.41% mAP under simulated heavy fog; for >30m objects, gains are +3.33% (clear) and +2.09% (heavy fog). These results indicate earlier and more reliable long-range detections under sparse sensing in on-road traffic.


[202] CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation cs.CVPDF

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Zhucun Xue

TL;DR: 本文介绍了CineDance-1M,一个专为多镜头、长格式音视频联合生成设计的大规模开源文本-音视频数据集,以及配套的评估基准CineBench。同时,通过将LTX-2.3模型适配为CineDance模型,验证了数据集的高质量和有效性,旨在推动长格式叙事音视频生成的研究。

Details

Motivation: 开源视频生成模型的进展受限于高质量训练数据的稀缺,特别是缺乏针对多镜头、长格式叙事音视频联合生成的数据集,这阻碍了其生成电影级叙事的能力。

Result: 提出的CineDance模型在单模态质量、音视频对齐精度以及主体和环境一致性方面表现出色,有效验证了CineDance-1M数据集的高质量和其精心策划策略的有效性。

Insight: 核心创新在于构建了一个大规模、高质量、结构化的长格式音视频数据集(CineDance-1M),其策划流程融合了电影叙事理论;同时,提出了一个专门针对复杂叙事音视频评估的多维度基准(CineBench)。

Abstract: The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at https://aliothchen.github.io/projects/CineDance/.


[203] MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding cs.CVPDF

Jie Zhang, Qilang Ye, Hao Zhou, Haochen Liang, Fei Luo

TL;DR: MAVIS提出了一种新颖的多智能体视频检索框架,将检索重新定义为协作推理而非暴力搜索。该框架通过将原始视频解析为结构化语义库来解决视频与文本查询之间的语义不对称问题,并利用规划器分解用户意图、派遣智能体提名候选,最后通过逻辑感知辩论机制进行细粒度验证。

Details

Motivation: 当前基于嵌入的全库扫描视频检索范式存在计算效率低下以及信息密集的视频与稀疏文本查询之间语义不对称的问题,MAVIS旨在通过多智能体协作推理来弥补这一差距。

Result: 在MSR-VTT、MSVD和ActivityNet基准测试上的广泛实验表明,MAVIS无需任务特定的微调即可达到有竞争力的性能,为传统双编码器方法提供了可扩展且可解释的替代方案。

Insight: 创新点在于将检索重构为多智能体协作推理流程,核心是引入结构化语义库实现显式属性级索引,以及逻辑感知辩论机制配合严格的否决协议来高效筛选候选,这提供了超越简单嵌入匹配的更高效、可解释的检索新范式。

Abstract: The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce \textbf{MAVIS}, a novel multi-agent framework that rethinks retrieval as cooperative reasoning rather than brute-force search. MAVIS first bridges the granularity mismatch by parsing raw videos into a \textbf{Structured Semantic Library}, enabling explicit attribute-level indexing. During retrieval, a planner decomposes complex user intents into atomic sub-tasks, dispatching specialized agents to independently nominate candidates. Crucially, MAVIS employs a \textbf{Logic-aware Debate} mechanism with a strict veto protocol, where agents collaboratively prune logical mismatches to identify a compact set of ``controversial’’ candidates for fine-grained verification. This agentic workflow effectively bypasses the inefficiency of full-library traversal. Extensive experiments on MSR-VTT, MSVD, and ActivityNet demonstrate that MAVIS achieves competitive performance without task-specific fine-tuning, offering a scalable and interpretable alternative to traditional dual-encoder approaches.


[204] Visual Prompting Meets Feature Reconstruction-Based Anomaly Detection with Dual-Teacher Supervision cs.CV | cs.AIPDF

Mateo Diaz-Bone, Daniel Caraballo, Florian Scheidegger, Thomas Frick, Mattia Rigotti

TL;DR: 本文提出了一种结合视觉提示和双教师监督的异常检测方法,以解决现有方法在真实场景中因物体尺度、视角、背景、光照和位置等假设被违反而失效的问题。该方法通过前景-背景掩码的视觉提示管道隔离物体,解冻教师模型以增强域适应性,并利用扩散模型生成的合成图像进行数据增强。

Details

Motivation: 当前异常检测方法在标准数据集上表现优异,但在真实场景中因物体尺度、视角、背景、光照和位置等假设被违反时面临挑战,导致性能下降。

Result: 在具有挑战性的AeBAD数据集上,使用掩码多尺度重建模型作为骨干网络,该方法比之前的最先进方法提升了3.5个百分点。

Insight: 创新点包括视觉提示管道用于物体隔离、解冻教师模型以改善域适应性,以及基于扩散模型的合成图像数据增强策略,这些设计增强了方法在复杂真实场景中的鲁棒性。

Abstract: Recent Anomaly Detection methods achieve perfect detection and segmentation scores on well-established datasets, such as MVTec. However, many of these methods face challenges when foundational assumptions - such as consistent object scale, viewpoint, background, illumination, and centered placement - are violated. Those variations that occur render anomaly detection methods unusable in many real-world scenarios. To address these limitations, we introduce three key contributions: (1) a visual prompting pipeline that isolates objects using foreground-background masking; (2) a mechanism for unfreezing the teacher in student-teacher models to improve domain adaptability; and (3) a data augmentation strategy leveraging diffusion-generated synthetic images to enhance anomaly detection performance. We achieve a 3.5 percentage point improvement over the previous state-of-the-art on the challenging AeBAD dataset by using the Masked Multiscale Reconstruction (MMR) model as our backbone.


[205] Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis cs.CV | cs.AI | cs.LGPDF

Samuele Punzo, Niccolò Caselli, Ippokratis Pantelidis, Francesco Massafra, Salvatore Lo Sardo

TL;DR: 本文研究了预训练视频基础模型(如V-JEPA、VideoMAE和LTX-Video)是否在其冻结表示中编码了直观物理信息,并通过分层探测分析探讨了该信息如何随模型家族、层数和探测类型变化。研究发现,V-JEPA在IntPhys2和MVP基准上整体表现最强,尤其是在建模时间动态的探测中,而VideoMAE保持竞争力,LTX-Video则恢复出较弱但非平凡信号。分层分析表明,物理相关信息在早期层最弱,在中后期层最易访问,且打乱帧顺序会显著降低性能。

Details

Motivation: 动机是探究预训练视频基础模型是否在其冻结表示中隐含地学习了直观物理知识,以及这种知识的可访问性如何受预训练范式、表示深度和读出机制的影响。

Result: 在IntPhys2和Minimal Video Pairs(MVP)基准上,V-JEPA取得了最强的整体结果,VideoMAE保持竞争力,LTX-Video恢复出较弱信号;分层分析显示物理信息在中后期层最易访问,且打乱帧顺序会大幅降低性能(尤其在MVP上)。

Insight: 创新点在于通过分层探测分析系统比较了不同预训练范式(预测性联合嵌入、掩码重建、扩散生成)的视频基础模型对直观物理知识的编码能力,揭示了物理知识在模型中的涌现规律及其对预训练方法、层深和时间顺序的依赖性。

Abstract: We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.


[206] GenEyePose: Patient-Free, Knowledge-Based Saccadic Eye Movement Modeling for Digital Neurophysiologic Biomarker Development cs.CVPDF

Tianyu Lin, Jooyoung Ryu, Puvada Sreevarsha, Rahul Srinivasaragavan, Riya Satavlekar

TL;DR: 本文提出了一种名为GenEyePose的、完全合成的、无需患者参与的多模态眼动生成流程,用于生成扫视眼动数据。利用该合成数据集训练了一个深度学习分类器,以区分正常与异常(扫视不足和扫视过度)的扫视精度,并在真实临床数据上进行了评估。

Details

Motivation: 眼动(特别是扫视)是敏感的神经生理状态生物标志物,但当前缺乏稳健的AI视频眼动图解决方案,主要受限于隐私问题和数据稀缺。

Result: 在真实临床数据上评估,模型实现了0.76的AUROC和0.71的灵敏度,表明合成数据在临床应用(如家庭筛查或急诊室分诊)中具有良好泛化潜力。

Insight: 创新点在于提出了首个完全合成、无需真实患者数据的眼动生成流程,以解决数据稀缺和隐私问题,为开发数字神经生理生物标志物提供了一种新途径。

Abstract: Eye movements, including saccades, are widely regarded as highly sensitive and objective biomarkers of neurophysiologic states. Detecting saccadic signatures in neurologic diseases offers a rapid, portable alternative to brain imaging, avoiding access and cost barriers. Currently, there are no robust AI-enabled video-oculographic solutions (e.g., digital biomarkers) for screening, triaging, or localizing brain abnormalities due to privacy issues and scarce datasets. In this work, we propose the first fully synthetic, patient-free, multimodal eye movement generation pipeline for generalizable saccade analysis. Using this synthetic dataset, we trained a deep learning classifier to distinguish between normal and abnormal (hypometria and hypermetria) saccadic accuracies and evaluated its performance on real-world clinical data. The model achieved an AUROC of 0.76 and a sensitivity of 0.71, showing that the synthetic data has strong potential to generalize for clinical applications, including as a screening tool in at-home and emergency room settings or a tool for precise neuroanatomic localization.


[207] HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents cs.CVPDF

Letian Li, Chao Shen, Shuzhao Xie, Chenghao Gu, ZhengXiao He

TL;DR: 该论文提出了一种用于结构化3D室内场景生成和局部编辑的层次化领域特定语言HDSL,以及一个基于LLM智能体的流程。HDSL采用类似XML/CSS的树状结构表示场景,便于递归规划和局部检索。该流程利用LLM智能体生成并验证HDSL子树,通过多模态资产检索进行实例化,并使用力导向布局优化修复错误。对于编辑任务,提出了层次化检索增强生成方法,能高效地定位、重写并合并场景的局部子树。

Details

Motivation: 现有基于LLM的室内场景生成与编辑系统通常依赖场景图或全局约束列表,这些表示虽然紧凑,但难以充分指定局部几何信息,并且使得基于指令的编辑难以精确定位。论文旨在解决结构化程序生成和局部程序修复的问题。

Result: 在复现的基准测试中,HDSL在物体覆盖率、文本-场景对齐度和生成时间方面优于完整的文本到场景基线方法,同时在几何指标上与最近的仅布局方法复现结果保持竞争力。对于编辑任务,HRAG方法将令牌使用量减少了5.22倍,运行时间减少了6.19倍,在所有八个配对编辑案例中都生成了有效的DSL,并且更好地保留了无关的场景物体。

Insight: 核心创新在于提出了HDSL这一层次化、结构化的领域特定语言,它将场景表示为具有局部坐标的树,从而支持递归规划和高效局部检索。另一个关键创新是HRAG编辑框架,它通过检索相关子树、在局部上下文中重写并执行确定性的三方合并,实现了高效、精确的局部场景编辑,显著降低了计算开销。

Abstract: Text-driven indoor scene generation and editing require an intermediate representation that language models can both produce and revise. Existing LLM-based systems often rely on scene graphs or global constraint lists, which are compact but underspecify local geometry and make instruction-based edits difficult to localize. We frame this problem as structured program generation and local program repair, and propose Hierarchical Descriptive Scene Language (HDSL), an XML/CSS-style domain-specific language for structured 3D indoor scenes. HDSL represents rooms, regions, objects, and support surfaces as a tree with local coordinates, making complex scenes easier to plan recursively and easier to retrieve for editing. Our pipeline uses LLM agents to generate HDSL subtrees with bounded verification, grounds non-virtual nodes through multimodal asset retrieval, and applies force-directed layout optimization to repair boundary and collision errors. For editing, Hierarchical Retrieval-Augmented Generation retrieves the relevant subtree, asks the LLM to rewrite only that local context, and merges the result back through a deterministic three-way merge. In our reproduced benchmark, HDSL improves average object coverage, text-scene alignment, and generation time over full text-to-scene baselines while remaining competitive with recent layout-only reproductions on geometry metrics; for editing, HRAG reduces token use by $5.22\times$ and runtime by $6.19\times$, produces valid DSL for all eight paired edits, and better preserves unrelated scene objects.


[208] Hybrid Robustness Verification for Spatio-Temporal Neural Networks cs.CV | cs.AI | cs.LGPDF

Sherwin Varghese, Matthew Wicker, Alessio Lomuscio

TL;DR: 本文提出了一种针对时空神经网络的混合鲁棒性验证方法,通过建模时空约束来更精确地刻画实际对抗扰动,并引入了时空边界传播(STBP)框架,该框架结合了第一卷积层的精确闭式表征与后续层的可扩展近似,从而在保证鲁棒性的同时提升了计算效率。

Details

Motivation: 现有验证方法在视频等时空数据上要么过于保守,要么计算成本过高,且通常基于不现实的扰动假设(如每帧独立噪声),而实际对抗扰动具有结构化的时空相关性,因此需要一种能利用现实约束的验证方法来提供更紧的鲁棒性保证。

Result: 在动作识别(UCF-101)、自动驾驶(Udacity)和医学影像(MedMNIST)等基准上,STBP相比现有验证方法在相同扰动预算下实现了1.7倍更高的认证鲁棒准确率,并提出了ST-Bench基准以系统评估可验证鲁棒性。

Insight: 创新点在于将对抗扰动建模为时空约束(如修改连续帧的子集或块),从而允许更紧的近似;STBP框架通过第一层的精确闭式表征与后续层的近似传播相结合,在可扩展性与保证强度之间取得了平衡,为时空网络验证提供了新思路。

Abstract: With AI increasingly deployed in safety-critical systems, providing formal robustness guarantees for the underlying models is essential. Existing verification methods either rely on overly conservative approximations or incur prohibitive computational costs. For example, the use of lp-norm perturbations in video settings encodes the belief that the adversary can inject noise in every video frame. In practice, adversarial perturbations exhibit structured spatial and temporal correlations, constrained to lower-dimensional, semantically meaningful subspaces. In this work, we study robustness verification of 3D CNNs processing video and volumetric inputs, targeting applications in action recognition (UCF-101), autonomous driving (Udacity), and medical imaging (MedMNIST) exploiting realistic assumptions on adversarial strength by modelling them as spatio-temporal constraints - where the attacker can modify either a subset of frames or patches within a set of consecutive frames. We demonstrate that modelling realistic constraints enables tighter approximations. We introduce Spatio-Temporal Bound Propagation (STBP), a verification framework that computes an exact closed-form characterization of the first convolutional layer and propagates certified bounds through subsequent layers using scalable approximations. Computing the exact closed form provides the tightest bounds for the first convolutional layer. Thus, we utilise approximation methods in the remainder of the network. To spur further progress in this field, we propose ST-Bench, a verification benchmark for autonomous driving and activity recognition, to systematically evaluate verifiable robustness. Compared to existing verification-based approaches, STBP provides stronger robustness guarantees with significantly improved scalability, achieving 1.7x higher certified robust accuracy under identical perturbation budgets.


[209] POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction cs.CVPDF

Brandon Smock, Libin Liang, Max Sokolov, Amrit Ramesh, Valerie Faucon-Morin

TL;DR: 本文提出了一种轻量级的图像到图模型POTATR,用于页面级表格提取。该模型基于Table Transformer扩展,仅包含2900万参数,在PubTables-v2 Single Pages基准测试中超越了包括前沿多模态大语言模型在内的所有模型,同时运行速度快130倍以上且成本降低约300倍。

Details

Motivation: 大规模文档处理需要既准确又高效的上下文感知表格提取方法,而现有方法通常需要数十亿参数、数百个自回归步骤或昂贵的API推理,因此需要一种更轻量、高效的解决方案。

Result: 在PubTables-v2 Single Pages基准测试中,POTATR取得了0.964的GriTS_Con分数,超越了所有测试模型(包括前沿MLLMs),运行速度提升130倍以上,成本降低约300倍。

Insight: POTATR的创新点在于其轻量化的图像到图架构,实现了空间定位输出(每个识别元素都有边界框),支持视觉验证和几何文本分配,并能与其他模型结合,通过外部OCR扩展到扫描文档或通过跨页面合并实现全文档表格提取。

Abstract: Large-scale document processing requires contextually aware table extraction (TE) that is both accurate and efficient. Yet current approaches require billions of parameters, hundreds of autoregressive steps, or costly API inference. Motivated by this, we introduce the Page-Object Table Transformer (POTATR), a lightweight 29M parameter image-to-graph model that extends the Table Transformer (TATR) for contextualized page-level TE. POTATR outperforms all models tested on the PubTables-v2 Single Pages benchmark – including frontier MLLMs – achieving $\textrm{GriTS}_\textrm{Con}$ of 0.964 while running over 130$\times$ faster at roughly 300$\times$ lower cost. Further, POTATR’s output is spatially grounded: every recognized element has a bounding box, enabling visual verification and geometric text assignment. As a result, POTATR performs unified page-level TE while composing with other models, enabling extension to scanned documents via external OCR and to full-document TE via techniques like cross-page merging. Code and models will be released.


[210] Beyond Spherical Harmonics: Rethinking Appearance Models for Radiance Reconstruction cs.CV | cs.GRPDF

Ewa Miazga, Jorge Condor, Piotr Didyk

TL;DR: 本文针对辐射场重建中的视角相关外观建模问题,提出了一种新的球面函数——归一化各向异性球面Gabor函数,以高效建模高频外观效应。该方法在保持紧凑表示的同时,显著提升了如镜面反射等复杂视角相关现象的重建质量,并实现了高达五倍的内存效率提升和更快的计算评估。

Details

Motivation: 现有基于学习的方法通常依赖低阶球谐函数(SH)进行视角相关外观建模,但难以高效捕获高频现象(如镜面反射),导致重建结果过于平滑或漫反射,且高阶扩展会带来内存和计算成本的大幅增加。

Result: 在辐射场重建任务中验证,所提函数在重建视角相关现象(如闪光)时质量更高,同时内存效率提升高达五倍,且评估更高效。

Insight: 创新点在于系统评估了多种球面函数(部分首次引入图形学和计算机视觉),并基于实验洞察提出了归一化各向异性球面Gabor函数,该函数在紧凑表示下能高效学习和建模高频外观效应,为外观模型提供了新思路。

Abstract: View-dependent appearance modeling remains a challenging problem in novel-view synthesis and reconstruction. Accurately representing complex angular effects often requires substantial memory and computational resources. For new learning-based methods, a common approach is to rely on SH. However, capturing high-frequency phenomena such as specular reflections demands high-order expansions, which increase memory usage and computational cost. Consequently, most methods employ low-order SH, which limits the ability to model complex view-dependent effects, resulting in overly smooth or diffuse representations. To address these limitations, we systematically evaluate a wide range of spherical functions in the context of scene reconstruction. Some of them are introduced to graphics and computer vision for the first time in this paper. Based on the insights from the experiment, we develop a novel spherical formulation, the Normalized Anisotropic Spherical Gabor function that enables efficient modeling and learning of high-frequency appearance effects while maintaining compact representation. Compared to existing approaches, our function achieves higher-quality reconstruction of view-dependent phenomena such as glints, while being up to five times more memory-efficient and more efficient to evaluate. We validate its performance in radiance-field reconstruction tasks.


[211] Echo-Memory: A Controlled Study of Memory in Action World Models cs.CV | cs.GR | cs.LGPDF

Wayne King, Zeyue Xue, Yuxuan Bian, Jie Huang, Haoran Li

TL;DR: 本文提出了Echo-Memory,一个用于研究动作条件世界模型中记忆机制的受控框架。该研究通过固定视频生成接口,仅改变历史信息的存储和读取方式,在共享的视频扩散模型骨干上系统比较了多种记忆设计,包括原始上下文、基于压缩的记忆、空间摘要和状态空间循环。研究发现,原始上下文在开放域返回任务中表现优异,而过度压缩会损害记忆能力,块状状态空间循环是开放域返回的最佳机制。

Details

Motivation: 当前动作条件世界模型在生成多段视频时,核心失败往往源于记忆而非局部图像合成,例如相机离开并返回后场景或显著物体可能无声改变。现有记忆设计难以比较,因为性能提升与骨干网络、训练、检索和评估差异纠缠在一起。

Result: 研究在共享的视频扩散骨干、优化器、相机动作表示、采样器和评估流程下,通过三分支协议(重放质量、域内循环重访、开放域返回探测)评估记忆。结果表明,原始上下文显著提升了开放域返回性能,而块状状态空间循环在开放域返回机制中表现最强。

Insight: 研究创新点在于提出了一个受控框架来分离记忆的四个关键轴(容量、压缩、读取和循环),并设计了三分支评估协议,揭示了重放保真度不足以代表世界记忆能力。客观分析认为,该研究为超越孤立重放指标的系统性记忆研究提供了紧凑协议,并强调了隐式记忆结构的重要性。

Abstract: We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: \emph{capacity}, \emph{compression}, \emph{read-out}, and \emph{recurrence}. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.


[212] OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics cs.CV | cs.AIPDF

Mingxian Lin, Shengju Qian, Yuqi Liu, Yi-Hua Huang, Yiyu Wang

TL;DR: 本文提出了OmniGameArena基准测试,这是一个在Unreal Engine 5中构建的、包含12款新游戏(涵盖单人、玩家对战和合作模式)的实时评估平台,用于统一评估视觉语言模型(VLM)智能体。同时,论文引入了改进动态曲线(IDC),这是一个利用工具调用型LLM进行多轮自主反思以优化技能提示的框架,旨在超越传统的首次尝试得分评估。

Details

Motivation: 当前VLM智能体在游戏环境中的基准测试通常只报告每个(智能体,游戏)对的首次尝试得分,侧重于单人游戏,并且缺乏统一协议来公平评估不同类型的智能体(如商业VLM、开源VLM和专用游戏策略)。

Result: 论文报告了12个VLM智能体在冷启动排行榜上的得分,以及4个顶级智能体在IDC框架下的表现。IDC为每个(智能体,游戏)对提供了两个额外的可观测指标:得分在反思轮次中的演变情况,以及习得技能在保留任务变体上的行为表现。

Insight: 主要创新点在于构建了一个统一、多样化的实时游戏基准(OmniGameArena)以及一个评估智能体改进动态的反思框架(IDC)。这超越了静态的排行榜评估,能够动态追踪智能体通过自主反思学习技能的过程及其泛化能力,为评估异构VLM智能体提供了更全面的协议。

Abstract: Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.


[213] Latent Spatial Memory for Video World Models cs.CVPDF

Weijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen, Zeyu Zhang

TL;DR: 本文提出了一种用于视频世界模型的潜在空间记忆(Latent Spatial Memory)方法,旨在解决现有基于显式点云内存的方法在计算开销和信息损失上的问题。该方法通过在扩散模型的潜在空间中直接构建和查询3D场景表示,避免了像素空间的重建过程,从而显著提升了生成效率和降低了内存占用。

Details

Motivation: 现有视频世界模型通常依赖在RGB空间中构建显式点云内存来维持生成帧间的3D空间一致性,这种方法计算成本高(需要重复渲染和VAE编码)且存在信息损失(像素空间往返会丢弃学习到的潜在特征)。

Result: 实验表明,与显式3D基线方法相比,潜在空间记忆实现了高达10.57倍的端到端视频生成加速和55倍的内存占用减少。在WorldScore基准上达到了最先进的性能,并在RealEstate10K上表现出强大的重建质量。

Insight: 核心创新在于直接在扩散模型的潜在空间中维护一个持久的3D缓存,通过深度引导的反投影将潜在令牌提升到3D空间,并利用直接潜在空间变形来合成新视角,这统一了表示和查询过程,消除了像素重建的信息损失和重复编码的计算负担。

Abstract: Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce \emph{latent spatial memory} for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to \textbf{10.57}$\times$ faster end-to-end video generation and \textbf{55}$\times$ reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.


cs.AI [Back]

[214] Scaling Participation in Modular AI Systems cs.AI | cs.CLPDF

Shangbin Feng, Yike Wang, Weijia Shi, Luke Zettlemoyer, Yejin Choi

TL;DR: 本文提出了一种名为‘规模化参与’的新范式,旨在通过多元利益相关者自下而上贡献小型模型来构建模块化AI系统,以解决当前集中式、单一大型语言模型无法充分反映人类知识、推理和价值观多样性的问题。

Details

Motivation: 当前主流的LLM由少数人构建,是集中式的单一模型,结构上难以捕捉人类知识、推理和价值观的多样性,因此需要一种能反映人类丰富性的AI构建新范式。

Result: 在15项任务(如推理和事实性)上,参与式AI系统比单一LLM性能提升高达15.4%,超越了所有贡献组件总和更大的模型,并展现出解决超过15%个体模型均失败问题的涌现能力。

Insight: 创新点在于提出‘规模化参与’这一自下而上的模块化构建范式,强调贡献者多样性带来的系统性能提升和涌现能力,为从单一模型向开放、协作的AI未来过渡提供了技术基础。

Abstract: Humanity is a mosaic of multifaceted talents and needs, and any truly intelligent AI must reflect that richness. Yet the LLMs used by all are built by the few – a centralized market of monolithic AI models structurally ill-suited to capture the diversity of human knowledge, reasoning, and values. Here we introduce scaling participation, a new paradigm in which modular AI systems are built from the bottom up through the contributions of diverse stakeholders. Participants contribute small models trained on their own interests and priorities; these models then collaborate in modular frameworks as compositional AI systems. Participatory AI systems outperform monolithic LLMs by up to 15.4% across 15 tasks, such as reasoning and factuality, surpassing models larger than all contributed components combined. Further experiments show that participatory AI systems benefit from contributor diversity, substantially improve on each contributor’s original priorities, and exhibit emergent capabilities that allow them to solve over 15% of problems where all individual models fail. Scaling participation provides a technical foundation for transitioning from the monolithic status quo toward an open, bottom-up, and collaborative AI future.


[215] Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning cs.AI | cs.CL | cs.LGPDF

Mujtaba Farhan, Maheep Chaudhary

TL;DR: 本文针对CoCoNuT(连续潜在推理)范式在深度推理过程中存在的概念瓶颈问题,即中间隐藏状态被覆盖导致关键信息丢失,提出了AGCLR(自适应门控连续潜在推理)方法。该方法通过引入一个由三个可学习门控(写入、读取、遗忘)控制的持久性残差记忆流,来跨推理步骤保留和检索关键事实,从而提升模型在数学和多跳推理任务上的性能。

Details

Motivation: 动机是解决CoCoNuT等连续潜在推理方法中存在的“概念瓶颈”问题,即随着推理深度增加,中间隐藏状态被覆盖,导致模型丢失先前步骤计算出的关键事实,从而限制了其在复杂任务上的表现。

Result: 在GSM8K、HotpotQA和ProsQA基准测试上,以GPT-2为基础模型进行评估,AGCLR相比基线(如CoT和原始CoCoNuT)取得了全面的性能提升。特别是在HotpotQA上,AGCLR解决了原始CoCoNuT(EM 10.4%)性能不敌CoT基线(EM 11.0%)的问题,且随着课程深度增加,性能差距进一步扩大,直接解决了概念瓶颈。

Insight: 论文宣称的创新点在于提出了一个由可学习门控(写入、读取、遗忘)控制的持久性残差记忆流(Gated Concept Stream),以跨推理步骤自适应地管理信息流,从而缓解深度推理中的信息遗忘问题。从客观角度看,这是一种将门控机制与连续潜在推理框架相结合的新颖设计,旨在增强模型在长序列或多步推理任务中的记忆保持能力。

Abstract: Large language models (LLMs) have demonstrated remarkable reasoning abilities on mathematical and multi-hop planning tasks. The CoCoNuT (Chain of Continuous Thought) paradigm~\cite{hao2024coconut} extends this by enabling models to reason in latent space, exploring multiple reasoning paths simultaneously rather than committing to a single chain early on. However, we identify a limitation we term the \textbf{concept bottleneck}. At each reasoning pass, intermediate hidden states are overwritten, causing the model to lose critical facts computed in earlier steps as reasoning depth increases. We observe this empirically. On HotpotQA, vanilla CoCoNuT (10.4% EM) fails to improve over the CoT baseline (11.0% EM), and performance degrades with curriculum depth on GSM8K. To address this, we propose \textbf{AGCLR} (Adaptive Gated Continuous Latent Reasoning), which augments CoCoNuT with a \textit{Gated Concept Stream}. A persistent residual memory maintained across all reasoning passes, controlled by three learned gates: a \textit{write} gate that commits intermediate facts to memory, a \textit{read} gate that retrieves relevant prior states, and a \textit{forget} gate that prunes irrelevant context. Evaluated on GSM8K, HotpotQA, and ProsQA using GPT-2 as our base model, AGCLR achieves consistent improvements across all types of datasets. With the performance gap compounding as curriculum depth increases, directly resolving the concept bottleneck. Code available at https://anonymous.4open.science/r/JJJJ/README.md


[216] When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding cs.AI | cs.CL | cs.CVPDF

Yiheng Wang, Yueqian Lin, Lichen Zhu, Yudong Liu, Hai “Helen” Li

TL;DR: 本文对多模态大语言模型在视频理解中的缺席答案检测能力进行了诊断性研究,发现当正确答案被故意排除时,模型倾向于选择看似合理的干扰项而非识别无有效答案,尤其在时序推理任务中问题更严重。研究评估了三种设置下的检测行为,并探索了思维链提示作为缓解策略,但效果仍不理想。

Details

Motivation: 尽管多模态大语言模型在视频理解方面取得了显著进展,但其回答的可靠性尚未得到充分探索,特别是在正确答案缺失的情况下,模型能否可靠地识别出无有效答案仍是一个未解决的问题。

Result: 在多种模型和基准测试中,MLLMs普遍选择看似合理的干扰项而非检测缺席答案,这一失败在时序推理任务中更为明显,且随着帧采样密度增加而恶化;思维链提示虽能显著提高检测率,但性能仍不令人满意。

Insight: 论文揭示了多模态系统中缺席答案检测的系统性失败,强调了需要显式的检测机制;创新点在于对MLLMs在视频理解中可靠性的诊断性研究,以及通过不同设置和缓解策略的系统性评估,为未来模型设计提供了重要见解。

Abstract: Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the reliability of their responses remains underexplored. This work presents a diagnostic study of absent answer detection for MLLMs in video understanding, where the correct answer is deliberately excluded from the candidate set and a reliable model is expected to recognize that no valid option exists. We evaluate the absent answer detection behavior under three settings: multiple-choice questions augmented with an ``None of the Above’’ option, open-ended generation with a detection instruction, and standard evaluation without any guidance. Across a diverse set of models and benchmarks, we find that MLLMs overwhelmingly select plausible distractors rather than detecting the absent answer. This failure is more pronounced in temporal reasoning tasks and worsens with denser frame sampling. We further explore chain-of-thought prompting as a mitigation strategy and find that while it substantially improves detection rates, performance remains unsatisfactory, suggesting that prompting-based strategies alone are insufficient to fully address this limitation. These findings expose a systematic failure in absent answer detection and highlight the need for explicit detection mechanisms in multimodal systems.


[217] Explaining Black-Box Language Models: Learning to Optimize Linguistically-Structured Word Subsets cs.AI | cs.CLPDF

Minyoung Hwang, Seokhyun Lee, Changhee Lee

TL;DR: 本文提出了一种针对黑盒深度语言模型(DLM)的解释方法,通过选择输入文本中信息量大的小子集来生成解释。该方法将选择过程建模为摊销优化问题,利用REINFORCE策略梯度进行训练,并融入图结构知识以生成符合语言直觉的解释。

Details

Motivation: 解决黑盒深度语言模型在部署中因无法访问内部状态而难以解释的问题,现有方法难以同时满足推理效率、黑盒兼容性(不引发分布外行为)以及基于语言结构的可理解解释这三个关键需求。

Result: 在多种DLM架构和多个真实数据集上评估,该方法能持续识别出具有更强判别力且与语言显著线索更对齐的词子集,其性能优于传统的黑盒兼容方法以及需要梯度访问权限的基于梯度的方法。

Insight: 创新点在于将解释生成建模为摊销优化问题,实现高效的一次性推理,并通过集成图结构知识来增强解释的语言连贯性和可理解性,从而在保持黑盒兼容性的同时提供更符合人类认知的解释。

Abstract: As deep language models (DLMs) are increasingly deployed in high-stakes domains such as healthcare, understanding their decision rationale becomes paramount for ensuring trust, safety, and accountability. However, achieving this vital level of interpretability is particularly challenging when these DLMs operate as black-box systems (e.g., via APIs), where access to internal model states (e.g., parameters, gradients) is restricted. Despite numerous efforts, existing explanation methods often fail to concurrently satisfy three key desiderata: (i) inference-time efficiency, (ii) black-box compatibility without inducing out-of-distribution behavior, and (iii) comprehensible explanations grounded in the input’s linguistic structure. To address these challenges, we propose a method that explains predictions of DLMs by selecting a small, informative subset of input words. We formulate this as an amortized optimization problem, enabling efficient one-shot inference without the need for input-specific search. Our selection policy is trained via REINFORCE-style policy gradients, allowing discrete word selection in a fully gradient-free setting. To enhance interpretability and align with human linguistic intuition, we integrate graph-structured knowledge into this selection process, fostering linguistically coherent subsets that result in explanations both highly informative and cognitively meaningful to end-users. We evaluated our method on diverse DLM architectures and multiple real-world datasets. It consistently identifies word subsets with enhanced discriminative power and stronger alignment with linguistically salient cues, outperforming both conventional black-box compatible methods and gradient-based approaches that are given oracle access to the black-box model’s gradients for a more challenging benchmark. Our code is available at here.


[218] Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery cs.AI | cs.CL | cs.CV | cs.LGPDF

Syed Rifat Raiyan, Mohsinul Kabir, Hasan Mahmud, Md Kamrul Hasan

TL;DR: 这篇论文对人工智能在数学推理领域的发展进行了综合性综述,涵盖了从早期基于规则的数学应用题求解器到当代大型语言模型、神经符号定理证明器和验证发现工作流的演变。文章从非正式文本与图表推理、形式化证明辅助、数学发现以及推理与训练技术四个维度梳理了该领域的研究现状,并系统评估了主要基准测试、常见失败模式以及未来发展方向。

Details

Motivation: 数学推理长期以来被视为检验机器智能的严格标准,近十年已从自然语言处理中的小众问题发展为最重要的人工智能前沿之一。本文旨在提供一个统一的视角,梳理该领域从早期系统到当代模型的演进脉络,并批判性地评估当前进展与挑战。

Result: 论文并未提出新的具体模型或方法,因此没有报告定量的性能结果。它是一篇综述性文章,系统地梳理和评估了该领域在多个基准(如小学数学、竞赛数学、几何、形式化证明、多模态与多语言推理)上的研究进展、基准饱和与污染问题,以及不同评估指标(如pass@1、多数投票、验证器辅助的pass@k)之间的区别。

Insight: 论文的创新之处在于提供了一个整合性的分析框架,将数学推理领域划分为四个关键轴线进行系统梳理。其核心见解是强调将生成与验证相结合的技术趋势(如思维链提示、工具使用、过程奖励模型),并指出了未来以验证发现工作流、推理效率和可广泛使用的AI辅助形式化基础设施为中心的发展方向。

Abstract: Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it has moved from a niche problem within NLP to one of the most consequential AI frontiers. This survey provides a unified account of the field’s evolution, from early rule-based math word problem (MWP) solvers and template-driven geometry systems, through neural expression generation and LLM prompting, to contemporary reasoning models, multi-agent systems, neuro-symbolic theorem provers, and verified discovery workflows. We organize the landscape along four axes: (i) informal reasoning over text and diagrams, spanning MWP solving, multimodal geometry, and VLMs; (ii) formal reasoning in proof assistants, including autoformalization, tactic prediction, compiler-guided repair, and proof search; (iii) mathematical discovery, where systems propose constructions, improve bounds, or assist attacks on open problems; and (iv) the inference and training-time techniques, including CoT prompting, tool use, process reward models, and RLVR, that increasingly connect generation with verification. We catalog major benchmarks across grade-school arithmetic, competition mathematics, geometry, formal proving, multimodal and multilingual reasoning, and expert evaluation, and we examine benchmark saturation, contamination, reporting mismatches, and the distinction between pass@1, majority voting, and verifier-assisted pass@$k$. We critically assess failure modes: brittleness under perturbation, reward hacking, multimodal grounding failures, fragile formalization, and the energy cost of reasoning-scale inference. Drawing on recent perspectives from working mathematicians, we identify future directions centered on verified-discovery workflows, reasoning efficiency, and infrastructure to make AI-assisted formalization broadly usable. Companion materials: https://github.com/Starscream-11813/awesome-AI4Math.


[219] Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization cs.AI | cs.CL | cs.LGPDF

Hao Chen, Zhanming Shen, Liyao Li, Yanyu Chen, Xuhang Zhu

TL;DR: 本文提出了一种名为ISPO(Intrinsic Signal Policy Optimization)的新方法,用于解决强化学习与可验证奖励(RLVR)中基于二元结果奖励的现有方法(如GRPO)存在的两个结构性失败模式:零优势崩溃和幻觉确定性。ISPO通过利用策略自身条件概率计算的密集内在信号来丰富奖励,结合序列级信号和令牌级定向奖励,以提升大型语言模型在长链推理任务中的性能。

Details

Motivation: 现有基于GRPO的RLVR方法依赖二元结果奖励,导致零优势崩溃(组内所有结果相同,梯度消失)和幻觉确定性(模型在训练后期对错误结果变得过度自信)两种失败模式,限制了模型在复杂推理任务中的效果。

Result: 在三个基础模型和五个数学推理基准测试中,ISPO consistently outperforms competitive baselines,在零优势崩溃最频繁的最难基准上提升最大,且训练动态诊断确认两种失败模式均减少。

Insight: 创新点在于使用完全从策略自身条件概率计算的密集内在信号来丰富奖励,包括衡量思考轨迹对最终答案信息量的序列级信号,以及对关键决策令牌中自信错误预测进行惩罚的令牌级定向奖励,这为解决RLVR中的结构性问题提供了新思路。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely on a binary outcome reward, which induces two structural failure modes: Zero-Advantage Collapse, in which all rollouts in a group share the same outcome and the gradient vanishes, and Hallucinated Certainty, in which the model becomes increasingly confident on incorrect rollouts late in training. We address both modes by densifying the reward with intrinsic signals computed entirely from the policy’s own conditional probabilities, and propose ISPO (Intrinsic Signal Policy Optimization, which combines a sequence-level signal measuring how informative the thinking trajectory is for the final answer, with a token-level directional reward whose hallucinated-certainty hinge penalizes confidently-wrong predictions at critical decision tokens. Across three base models and five mathematical reasoning benchmarks, ISPO consistently outperforms competitive baselines, with the largest gains on the hardest benchmarks where zero-advantage collapse is most frequent, and training-dynamics diagnostics confirm that both failure modes are decreased.


[220] Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation cs.AI | cs.CL | cs.CV | cs.LGPDF

Siyuan Liu, Jinyang Wu

TL;DR: 本文通过分析LLaVA-1.5模型,发现视觉令牌在中间层趋于饱和,而文本令牌则持续受益于深层语义处理。针对这种模态不对称性,论文提出了双路径视觉令牌路由(DPVR)框架,其核心实现DPVR-LF在视觉令牌饱和点将其路由到一个可训练的单层侧分支,在深层堆栈中仅进行纯文本前向传播,并在最后一层重新融合视觉和文本流。该方法仅需约3%的可训练参数,在保持标准基准上竞争力的多模态性能的同时,显著减少了深层Transformer堆栈中的视觉计算。

Details

Motivation: 解决多模态大语言模型(MLLMs)中普遍存在的架构对称性与模态异步演化不匹配的问题。现有模型通常对图像和语言令牌采用统一、对称的深度Transformer计算,忽视了图像令牌与文本令牌在信息密度、冗余度和所需推理深度上的本质差异,导致视觉计算冗余和感知表示在深层任务特定适应中可能发生漂移。

Result: 在标准基准测试中,所提出的DPVR-LF方法在保持竞争力的多模态性能的同时,显著减少了深层堆栈中的视觉计算。具体而言,该方法仅使用约3%的可训练参数,就达到了与原始模型相当的性能水平,挑战了视觉令牌必须遍历所有深层语言模型层的传统假设。

Insight: 核心创新点是提出了一个模态不对称的路由框架(DPVR),特别是其DPVR-LF实现,它通过识别视觉令牌的饱和点,将其分流到轻量级侧分支,从而实现了高效的视觉-语言融合。这一发现表明,对于LLaVA风格的MLLMs,单一的后融合层可能就足以维持强大的感知能力,为设计更高效、计算成本更低的多模态模型提供了新思路。

Abstract: Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.


[221] OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs cs.AI | cs.CV | cs.SD | eess.ASPDF

Guangzhi Sun, Yixuan Li, Yudong Yang, Chao Zhang

TL;DR: OmniMem是一种专为音频-视觉大语言模型设计的高效流式处理框架,通过模态感知的内存分配和扰动感知的内存选择策略,解决了长视频推理中视频token和KV缓存线性增长的问题,从而在保持长距离理解能力的同时实现紧凑的内存压缩。

Details

Motivation: 音频-视觉大语言模型在长视频理解中潜力巨大,但其长视频推理受限于视频token和KV缓存的线性增长,现有压缩方法对所有token一视同仁,无法处理视觉和音频模态间的严重token不平衡问题。

Result: 在VideoMME Long、LVBench和LVOmniBench基准测试中,使用video-SALMONN 2+和Qwen-2.5-Omni模型,OmniMem在相同内存预算下比无训练的强基线压缩方法绝对准确率提升2-4%,微调后额外提升1-2%。

Insight: 创新点包括模态感知的内存分配策略(分别管理视觉和音频上下文)和扰动感知的内存选择(保留信息丰富且非冗余的KV状态),以及预算感知的微调方法,以在现实部署约束下强化压缩效果。

Abstract: Audio-visual large language models (LLMs) hold strong promise for long-form video understanding, yet their long-video inference is fundamentally limited by the linear growth of video tokens and key-value (KV) caches. We present OmniMem, a memory-efficient streaming framework designed specifically for audio-visual LLMs. Unlike existing compression methods that treat all tokens uniformly, OmniMem introduces a modality-aware memory allocation strategy that separately manages visual and audio contexts, addressing the severe token imbalance between the two modalities. OmniMem further preserves informative and non-redundant KV states through perturbation-aware memory selection, enabling compact memory without sacrificing long-range understanding. To strengthen compression under realistic deployment constraints, we also explore budget-aware fine-tuning, which encourages the model to consolidate useful information into retained memory. Experiments on VideoMME Long, LVBench, and LVOmniBench with video-SALMONN 2+ and Qwen-2.5-Omni show that OmniMem consistently improves over strong training-free compression baselines by 2-4% absolute accuracy under the same memory budgets, with an additional 1-2% gain after fine-tuning.


[222] Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory cs.AI | cs.CLPDF

Haoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin, Fanrui Zhang

TL;DR: 本文提出了SkeMex框架,一种基于技能记忆的自进化医疗代理系统,旨在通过提炼交互轨迹为结构化技能来提升临床决策中的经验复用能力,而无需更新模型权重。该框架通过多分支存储库组织技能,并利用环境反馈评估记忆效用,实现闭环的读取-写入-评估-治理生命周期,以持续优化代理性能。

Details

Motivation: 现有医疗代理系统在动态临床决策中依赖原始历史轨迹作为记忆,存在冗余、噪声和难以管理的问题,且无法有效区分对未来推理真正有用的记忆,限制了长期临床推理中紧凑可靠经验的积累。

Result: 在多种临床任务上的实验表明,SkeMex在离线和在线设置中均持续优于代表性的基于记忆的代理方法,并能够泛化到不同的模型骨干网络,支持可迁移的技能记忆。

Insight: 创新点在于将交互轨迹蒸馏为结构化技能以编码可复用的程序性知识,并通过基于环境反馈的上下文相关效用估计来指导记忆检索与治理,实现无需权重更新的自进化能力,为医疗代理提供了紧凑且可管理的经验积累机制。

Abstract: Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read–Write–Assess–Govern” lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.


[223] Capacity, Not Format: Rethinking Structured Reasoning Failures cs.AI | cs.CLPDF

Hengxin Fan

TL;DR: 本文重新审视了结构化输出对大型语言模型推理能力的影响,认为其并非总是‘推理税’,而是取决于模型的剩余容量。通过设计信息匹配的散文对照和四级模式复杂度梯度,研究分离了格式特异性效应与提示长度混淆,并在多个模型和基准上进行了验证。研究发现,结构化格式是容量依赖的:容量充足的模型能吸收JSON约束而无性能下降,而接近能力极限的模型则会因截断或纯容量竞争而严重退化。

Details

Motivation: 先前研究将结构化输出视为一种‘推理税’,但作者认为这种观点不完整,因为格式化成本强烈依赖于模型的剩余容量。本文旨在分离结构化格式本身的影响与提示长度等混杂因素,以更准确地理解结构化推理失败的原因。

Result: 在MATH-Hard基准上,容量充足的Claude 3.5 Sonnet模型使用JSON格式(88.7±4.0%)与思维链(89.3±1.7%)性能相当。而接近极限的模型如Claude 3.5 Haiku在标准token预算下因截断性能下降36.2个百分点(p<0.0001),GPT-4o-mini即使在消除截断后也因纯容量竞争下降28.0个百分点(p<0.001)。在AIME竞赛数学问题上,Claude 3 Opus使用JSON格式时性能从96.2%降至91.0%(下降约5.3个百分点)。延迟结构消融实验(先自由推理后格式化)恢复了大部分损失的准确率(3次运行均值:80-87%)。

Insight: 核心创新在于提出了结构化输出对模型性能的影响是‘容量依赖’的,而非固有的负面‘税’。研究通过精心设计的对照实验(信息匹配的散文控制、模式复杂度梯度)分离了格式效应与提示长度混淆,并揭示了两种具体的退化机制(截断与纯容量竞争)。实践启示是:不应避免结构化输出,而应根据模型容量进行匹配;当模型接近其极限时,应采用‘先思考,后格式化’的策略。

Abstract: Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model’s spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: $88.7\pm4.0$% JSON vs. $89.3\pm1.7$% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp ($p < 0.0001$) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp ($p < 0.001$), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar $p < 0.0001$) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON ($-5.3$pp; the displayed percentages are independently rounded, exact difference is $7/133 = 5.26$pp $\approx 5.3$pp). A delayed-structure ablation – reasoning freely before formatting – recovers most of the lost accuracy (3-run mean: 80–87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.


[224] TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs cs.AI | cs.CL | cs.IRPDF

Momina Ahsan, Sarfraz Ahmad, Ming Shan Hee, Roy Ka-Wei Lee, Preslav Nakov

TL;DR: TABVERSE是一个用于评估大语言模型(LLMs)和视觉语言模型(VLMs)跨格式表格理解能力的基准测试。它通过将相同的表格内容对齐到多种结构格式(如HTML、Markdown、LaTeX)和渲染图像中,并标注问题类别与难度,从而能够隔离并系统评估表格表示形式对模型性能的影响。

Details

Motivation: 当前对LLMs和VLMs的表格推理任务评估中,表格内容、格式、布局和模态常常混杂变化,难以分离出表格表示形式本身的影响。本文旨在填补这一空白,研究表格表示形式在模型理解中的关键作用。

Result: 在问答(QA)、结构理解能力(SUC)和结构重建(SR)三个任务上的评估表明,表示形式的选择显著影响模型性能。模型通常在结构化文本(尤其是HTML格式)上表现优于渲染图像,但性能差距因任务、模型和格式而异。对于行敏感的结构任务和语法可用的LaTeX重建,模型仍面临挑战。

Insight: 论文的核心创新点是提出了一个受控的多模态表格基准TABVERSE,它能够隔离并量化表格表示形式对模型理解的影响。客观来看,该研究揭示了表格表示是可靠评估的关键因素,并指出HTML通常是更鲁棒的文本格式,这为未来模型评估和设计提供了重要洞见。

Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table representation remains under-explored. In practice, the same table content may appear in different structural formats, such as HTML, Markdown, and LaTeX, or as rendered images. However, existing evaluations often let content, format, layout, and modality vary together, making it difficult to isolate representation effects. We introduce TABVERSE, a controlled multimodal table benchmark that aligns the same table content across multiple structural formats and rendered images, with question category and difficulty tags. This design enables systematic evaluation of representation effects while holding table content fixed. We evaluate LLMs and VLMs across three tasks: Question Answering (QA), Structural Understanding Capability (SUC), and Structure Reconstruction (SR). Our results show that representation choice substantially affects table understanding. Models generally perform better with structured text than with rendered images, but the size of this gap depends on the task, model, and format. HTML is often the most robust text format, while row-sensitive structural tasks and syntactically usable LaTeX reconstruction remain challenging. These findings show that table representation is a key factor in reliable table evaluation.


[225] SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks cs.AI | cs.CLPDF

Hongcheng Gao, Hailong Qu, Jingyi Tang, Jiahao Wang, Zihao Huang

TL;DR: 本文提出了SpatialWorld,一个用于评估多模态智能体在复杂现实世界任务中交互式空间理解能力的统一基准。该基准集成了八个异构仿真后端,包含760个人工标注的任务,要求智能体在仅视觉部分可观测条件下主动探索并做出决策。评估15个先进智能体的结果显示,最强的GPT-5模型平均任务成功率仅为17.4%,表明稳健的空间任务解决仍极具挑战性。

Details

Motivation: 现有基准主要依赖被动评估(如静态VQA)或特定仿真器流程,无法评估通用的交互式空间理解能力。

Result: 在SpatialWorld基准上,最强的GPT-5模型平均任务成功率为17.4%,领先的开源模型Qwen-3.5为14.1%。分析还揭示了任务成功率与执行效率之间的明显不匹配,以及显著的领域特定性能差异。

Insight: 创新点在于提出了一个统一的、仿真器无关的协议来评估交互式空间理解,并设计了包含人类验证的初始状态、参考轨迹和终止状态验证器的可靠评估框架。这为解决主动探索和长视野规划中的瓶颈问题提供了一个严格的测试平台。

Abstract: Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.


[226] ZIPP:Zero-shot Image Personalization from Personas cs.AI | cs.CVPDF

Harini SI, Somesh Singh, Yaman Kumar Singla, David Doermann, Rajiv Ratn Shah

TL;DR: 本文提出了ZIPP方法,一种基于自然语言人物设定(persona)的零样本图像个性化生成框架,无需用户特定数据或模型微调。该方法利用LLM根据给定人物设定重写提示词,引导扩散模型生成个性化图像,并通过图神经网络从大规模用户交互数据中挖掘人物设定。

Details

Motivation: 现有文本到图像扩散模型输出缺乏个性化,无法适应个体多元审美偏好,且现有方法依赖密集交互历史或用户微调,难以应对冷启动问题。

Result: 在包含1.5K用户的ZIPBench基准测试中,人物设定条件使生成效果在四个基准上提升13-20%,在少样本设置下匹配或超越基于100+样本微调的基线,人类评估胜率达79%(对比通用生成)和58-65%(对比所有微调基线)。

Insight: 创新点在于将用户偏好抽象为自然语言人物设定进行零样本个性化引导,并结合图神经网络与多模态大模型实现人物设定的自动化挖掘与表达,有效降低了冷启动偏差和人口统计偏差。

Abstract: Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs remain impersonal, optimized for aggregate aesthetics rather than individual taste. Human preferences are pluralistic: one user favoring muted, nostalgic portraits may prefer vibrant street photography, while another gravitates toward dreamy film aesthetics. Existing methods require dense interaction histories or per-user fine-tuning, failing in cold-start settings and collapsing context-dependent preferences into a static representation. We introduce zero-shot image personalization from personas (ZIPP), which conditions image generation on natural-language personas (concise descriptors of a user’s identity and aesthetic sensibilities) without any user-specific data or weight updates. ZIPP uses an LLM to rewrite prompts from the perspective of a given persona, steering diffusion models toward personalized outputs. To mine personas at scale, we train an inductive Graph Attention Network over a 22M-user Reddit interaction graph with dual contrastive objectives aligning graph structure with visual behavior, then verbalize learned representations into natural-language personas via an MLLM. We introduce ZIPBench, the first zero-shot personalization benchmark with 1.5K users, graph-mined personas, and 40K generated images. Across four benchmarks and 14 LLMs spanning five model families, persona conditioning yields consistent gains (13-20%), with frontier models benefiting most. In the few-shot setting, ZIPP matches or exceeds fine-tuned baselines trained on 100+ examples per user. ZIPP achieves the lowest preference distributional divergence (CMMD 0.16 vs. 0.55), and IPF-normalized demographic evaluation shows it substantially reduces subpopulation bias present in existing methods. Human evaluation confirms a 79% win rate over generic generation and 58-65% over all fine-tuned baselines.


[227] Hybrid E-Assessment in Higher Education: Semi-Automated Grading of Paper-Based Written Examinations cs.AI | cs.CV | cs.CYPDF

Hartwig Grabowski, Michael Canz

TL;DR: 本文探讨了高等教育中完全数字化和部分数字化电子评估方法在总结性考试中的局限性,提出了一种混合电子评估方法,该方法保留基于纸笔、面向问题的考试任务,同时实现半自动评分。通过结构化答题格式编码评估相关的中间结果,学生手写输入,随后从表格字段中捕获。核心技术瓶颈是在实际考试条件下可靠识别手写字符,利用具备视觉能力的大语言模型、两轮验证原则以及与标准答案的比对,可以减少误分类,从而提高总结性评估的效度、公平性和可扩展性。

Details

Motivation: 解决高等教育中完全或部分数字化电子评估方法在总结性考试中的局限性,特别是封闭式问题格式导致的教学内容窄化,以及在学生规模大时出现的组织、技术和法律约束。

Result: 摘要未提及具体的定量实验结果或基准测试,但指出所提出的方法(结合视觉大语言模型、两轮验证和答案比对)可以减少误分类,从而改善评估的效度、公平性和可扩展性。

Insight: 创新点在于提出了一种混合电子评估框架,将传统的纸笔、开放式问题考试与半自动评分相结合,其技术核心是利用先进的视觉大语言模型来可靠识别现实考试条件下的手写字符,并通过结构化答案格式和验证机制来确保评分的准确性。这为在保持开放式问题教学优势的同时,实现大规模评估的自动化提供了可行路径。

Abstract: This paper examines the limitations of fully digital and partially digital e-assessment approaches in summative examinations in higher education. The analysis focuses on the didactic narrowing caused by closed question formats and on organizational, technical, and legal constraints that become particularly relevant in large student cohorts. As an alternative, the paper proposes a hybrid e-assessment approach that retains paper-based, problem-oriented examination tasks while enabling semi-automated grading. Assessment-relevant intermediate results are encoded in a structured answer format, entered by students by hand, and subsequently captured from table fields. The central technical bottleneck is reliable recognition of handwritten characters under realistic examination conditions. Recent vision-capable large language models, combined with a two-pass validation principle and comparison against a solution key, can reduce misclassifications and thereby improve the validity, fairness, and scalability of summative assessment.


[228] IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation cs.AI | cs.CV | cs.MMPDF

Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui

TL;DR: 本文提出了IMUG-Bench,一个用于评估统一多模态模型在交错式图文对话中理解与生成能力的综合性基准。该基准包含静态空间、时序因果和混合三类任务,共3113个样本和12034个交互轮次,旨在模拟真实世界的多轮动态交互场景。通过对主流开源和闭源模型的大规模实验,揭示了模型的能力边界、失败模式以及生成任务中的暴露偏差,并探索了思维链、自我验证和最佳N采样等测试时扩展策略以提升性能。

Details

Motivation: 现有基准在评估统一多模态模型时,通常局限于单轮或静态设置,且忽视了多轮交互中的暴露偏差问题,无法有效评估动态、多轮交错图文对话这一关键现实任务。

Result: 在IMUG-Bench上对主流统一多模态模型进行了系统评估,揭示了它们在多轮交互中的能力边界和失败模式,并发现生成侧存在显著的暴露偏差。通过测试时扩展策略(如思维链、自我验证和最佳N采样),有效提升了生成准确率并缓解了暴露偏差。

Insight: 创新点在于构建了首个专注于多轮交错图文对话的综合性基准,并系统分析了模型在此场景下的局限性(如暴露偏差)。客观而言,该工作不仅提供了评估工具,还通过测试时策略探索了性能提升路径,为未来增强模型的鲁棒性和多轮交互能力提供了重要洞见。

Abstract: In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.


cs.LG [Back]

[229] ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research cs.LG | cs.AI | cs.CLPDF

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen

TL;DR: 本文提出了ResearchClawBench,一个用于评估端到端自主科学研究能力的基准测试,涵盖10个科学领域的40个任务。每个任务基于一篇已发表论文,提供相关文献和原始数据,并在评估时隐藏目标论文。通过专家制定的多模态评分标准,将目标科学成果分解为加权标准,以评估对目标论文的重新发现能力。

Details

Motivation: 当前AI编码代理越来越多地用于科学研究,但其端到端的自主研究能力难以验证,因此需要建立一个可靠的基准来评估这种能力。

Result: 在统一协议下评估了七个自主研究代理和十七个原生大语言模型。最强的自主代理Claude Code平均得分为21.5,最强的ResearchHarness LLM Claude-Opus-4.7平均得分为20.7,LLM前沿平均分仅为26.5,表明当前系统远未达到可靠的重新发现水平。错误分析显示失败主要集中在实验协议不匹配、证据不匹配和缺失科学核心上。

Insight: 创新点在于构建了一个基于真实论文、包含多领域任务和原始数据的可复现评估基准,并设计了专家制定的多模态评分标准来量化评估自主研究能力,为衡量自主科学研究的进展提供了标准化的前沿测试平台。

Abstract: AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.


[230] Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories cs.LG | cs.AI | cs.CLPDF

Marut Pandya, Kasey Zhang, Baiqing Lyu

TL;DR: 这篇论文提出了一个名为’紧张连贯性’的预失败信号,用于描述基于LLM的编码代理在认识到自身推理存在问题后仍继续执行错误行为的模式。研究者构建了一个基于Claude Sonnet 4.6的判别器来检测代码执行轨迹中的这种模式,并在Terminal-bench-2数据集上验证了其有效性,发现被标记的轨迹失败率极高。

Details

Motivation: 动机是识别和定义LLM编码代理中一种特定的安全相关失败模式,即代理拥有应改变其行为的信息并陈述了该信息,却仍然采取与之相悖的行动,这有助于提前预警代理的潜在失败。

Result: 在Qwen3.5-35B-A3B骨干网络上评估44条轨迹,被标记轨迹的失败率为94%,而未标记轨迹为46%,存在47个百分点的显著差距(p=0.003)。检测器在匹配选择性下达到94%的精确度,优于88%的基线方法。在Gemma4-31B上的复制实验方向一致但未达显著,部分归因于缺乏’思考’内容。

Insight: 创新点在于定义了’紧张连贯性’这一可操作的预失败概念,并开发了可解释的跨度级检测器,能输出代理忽略的具体信息。该方法不依赖于简单的词汇标记,对冲突表述的改写具有鲁棒性,为理解代理的决策故障提供了新视角。

Abstract: LLM-based coding agents sometimes acknowledge a problem in their own reasoning and then proceed anyway. We call this pattern strained coherence: a safety-relevant failure mode in which an agent has information that should change its behavior, states that information, and still acts against it. The pattern overlaps with verbalized reward hacking, where an agent names a tension between a task proxy and the underlying goal yet optimizes the proxy anyway. We give an operational definition, build a Claude Sonnet 4.6 judge that reads full trajectories and flags spans where the pattern occurs, and evaluate it on 44 Terminal-bench-2 trajectories using a Qwen3.5-35B-A3B backbone. Flagged trajectories fail 94% of the time versus 46% for unflagged trajectories (47-point gap, Fisher’s exact p = 0.003; 46 points after excluding three prompt-embedded examples, p = 0.006). At matched selectivity, the detector reaches 94% precision versus 88% for a lexical discourse-marker baseline; the 10-trajectory intersection of the two methods has a 100% failure rate (Clopper-Pearson 95% CI [69%, 100%]). We replicate on Gemma4-31B with 43 trajectories: the overall signal is directionally consistent but not significant (20-point gap, p = 0.31), with attenuation driven largely by 13 trajectories with zero think content, where the detector has no substrate to analyze. In the high-verbosity Gemma tertile, the gap is +30 points; in the mid- and high-verbosity Qwen tertiles, it is +40 points each. The first flag appears at a median of 83-84% of elapsed trajectory time across both models, and the binary flag survives paraphrases that soften explicit conflict markers (8/8 trajectories). Unlike univariate predictors, the detector emits interpretable span-level output – quoted acknowledgment, quoted action, and typed conflict – showing what the agent saw and ignored.


[231] ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning cs.LG | cs.CLPDF

Qing Miao, Yiming Zhao, Jing Yang, Chenxi Liu, Yuehai Chen

TL;DR: 本文提出了ConSteer-RL框架,通过将基于模型对数概率的token级置信度信号集成到RLVR训练中,来引导大型语言模型的推理能力。该方法在GRPO框架基础上,构建了一个置信度感知的奖励机制,惩罚过度自信的错误并强化正确且自信的推理。

Details

Motivation: 当前基于可验证奖励的强化学习(RLVR)在提升LLM推理能力时,受限于稀疏的二元奖励以及忽略了模型内部的不确定性。

Result: 实验结果表明,ConSteer-RL在不同模型规模上均持续优于强大的GRPO基线,平均提升了2.3%-4.0%。

Insight: 创新点在于将模型内部置信度(token级概率)作为连续信号融入强化学习奖励函数,实现了对推理过程不确定性的细粒度引导。这提供了一种将模型内部状态(不确定性)与外部强化学习目标相结合的通用思路。

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has recently become a key paradigm for improving the reasoning abilities of Large Language Models (LLMs), yet it remains limited by sparse binary rewards and its ignorance of model-internal uncertainty. In this paper, we propose ConSteer-RL, a simple yet effective framework that integrates token-level confidence signals derived from model log-probabilities into RLVR training. Specifically, building upon the Group Relative Policy Optimization (GRPO) framework, we construct a confidence-aware reward by aggregating per-token probabilities into a scalar confidence score and incorporating it into an awareness-based reward shaping mechanism that penalizes overconfident errors while reinforcing correct and confident reasoning. Experimental results demonstrate that ConSteer-RL consistently outperforms strong GRPO baselines, achieving average improvements of 2.3%-4.0% across different model scales.


[232] Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability cs.LG | cs.AI | cs.CL | stat.MLPDF

Giorgio Giannone, Mustafa Eyceoz, Shabana Baig, Shivchander Sudalairaj, Anna C. Doris

TL;DR: 本文提出了一种无需外部验证的推理时扩展(ITS)方法,通过利用并行样本集的内在统计量(如长度调整的尾部熵)来评估解决方案质量,并作为难度门控动态分配计算资源。该方法包括事后候选排序(iS)、步骤级重采样(iPF)和特权引导蒸馏(dPF),在数学、工程设计和临床响应等多个开放领域任务上显著提升了性能。

Details

Motivation: 现有ITS方法在可验证领域(如数学和编码)表现良好,但在易出现系统性失败的任务中(由于错误初始假设或多维约束未满足),通常依赖昂贵的外部求解器或脆弱的基于模型的验证器。本文旨在扩展ITS至开放领域,避免对真实标签或训练奖励模型的依赖。

Result: iS在三个领域匹配基于共识的算法,并将工程设计选择性能提升20%(相对于pass@1基线);iPF在困难数学问题上平均提升pass@1 6.1个百分点;dPF在复杂临床响应任务上获得高达26.5%的性能增益。方法适用于通用、领域专用和多模态架构。

Insight: 创新点在于利用样本集内在统计量(如尾部熵)作为无需真实标签的质量判别信号,并基于此动态路由问题以实现自适应计算分配;步骤级重采样和特权引导蒸馏进一步引导生成过程避免系统性推理错误,可借鉴于开放域推理任务的优化。

Abstract: Inference-Time Scaling (ITS) has largely succeeded in verifiable domains like math and coding, where cheap verification enables scalable output selection. However, extending ITS to tasks prone to systematic failure - driven by faulty initial assumptions or unmet multidimensional constraints - typically relies on costly external solvers or brittle, model-based verifiers. Our key insight is that the intrinsic statistics of parallel sample sets, specifically length-adjusted tail entropy, provide a robust discriminative signal for solution quality without access to ground truth. Crucially, these statistics serve as a difficulty gate for adaptive compute allocation, dynamically routing problems across scaling regimes. First, Intrinsic Selection (iS) ranks candidates post-hoc, matching consensus-based algorithms across three domains and improving engineering design selection by 20% over pass@1 baselines. Second, Intrinsic Particle Filtering (iPF) generalizes this to step-level resampling, guiding generation toward high-confidence reasoning trajectories to improve pass@1 by 6.1 points on average on hard math problems. Finally, Particle Distillation (dPF) injects privileged guidance via early logit blending and KL-guided resampling, steering generation past systematic reasoning errors to satisfy expert rubrics, yielding up to 26.5% gains on complex clinical responses. Our pipeline applies seamlessly across broad-purpose, domain-specialized, and multimodal architectures, successfully extending ITS to open-ended domains without requiring trained reward models or exact ground-truth verification.


[233] sGPO: Trading Inference FLOPs for Training Efficiency in RLVR cs.LG | cs.AI | cs.CL | stat.MLPDF

Shivchander Sudalairaj, Kai Xu, Akash Srivastava, Giorgio Giannone

TL;DR: 本文提出了一种名为排序分组策略优化(sGPO)的计算高效训练策略,用于可验证奖励的强化学习(RLVR)。该方法通过使用少量推理计算来预估每个查询任务的难度,并据此动态分配训练时的rollout预算,从而大幅减少训练计算资源的浪费。

Details

Motivation: 标准RLVR训练为每个查询分配固定的rollout预算,忽略了查询难度对当前策略的影响,导致在简单查询和不可解查询上浪费大量训练计算资源,无法产生有效的学习梯度。

Result: sGPO在包含前期推理分析成本的情况下,将总训练计算量减少了三倍,同时达到或超过了基线模型的性能水平。

Insight: 核心创新在于利用廉价的离线推理计算作为查询难度的代理,并据此实现数据过滤、自适应分组大小分配和课程构建(从易到难调度查询),从而最大化每个生成rollout的样本效率。

Abstract: Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query’s difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included.


[234] TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs cs.LG | cs.AI | cs.CLPDF

Hyeongwon Jang, Gyouk Chu, Changhun Kim, Joonhyung Park, Hangyul Yoon

TL;DR: 本文提出TRIAGE框架,通过训练大语言模型(LLM)对相互竞争的临床结果进行辩证推理,以生成针对特定结果的解释,从而解决LLM在预测不规则采样医疗时间序列(ISMTS)风险时出现的风险极化问题,实现连续风险评分和可验证的临床推理。

Details

Motivation: 基于电子健康记录(EHR)的临床早期预警系统需要提供校准的风险评分和临床医生可验证的解释,但现有LLM方法容易将分级临床风险坍缩为过度自信的二元预测,导致风险极化,损害校准性和跨患者可比性。

Result: 在三个ISMTS基准测试上,TRIAGE相比竞争基线平均AUPRC提升3.3%,校准误差降低81%;LLM作为评判者的评估显示,其解释在临床推理质量上比基线的事后解释高出20%。

Insight: 创新点在于引入辩证推理框架,让LLM生成针对不同临床结果的竞争性解释,从而缓解风险极化,使单一模型能输出基于显式临床推理的连续风险评分,提升了预测的可解释性和校准性。

Abstract: Clinical early warning systems built on electronic health records, in which clinical observations are recorded as irregularly sampled medical time series (ISMTS), must deliver both calibrated risk scores for patient triage and interpretable rationales that clinicians can verify. Large Language Models (LLMs) have been explored for this task, yet they collapse graded clinical risk into overconfident binary predictions. This risk polarization undermines both calibration and cross-patient comparability. To address this, we propose TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales. This dialectical formulation mitigates risk polarization, enabling a single LLM to yield continuous risk scores grounded in explicit clinical reasoning. Evaluated on three ISMTS benchmarks, TRIAGE achieves an average AUPRC improvement of 3.3% and reduces calibration error by 81% compared to the competitive baselines. An LLM-as-a-judge assessment further shows that our rationales surpass post-hoc explanations from the baseline by 20% in clinical reasoning quality. The source code is available at https://github.com/HyeongWon-Jang/TRIAGE .


[235] INFUSER: Influence-Guided Self-Evolution Improves Reasoning cs.LG | cs.AI | cs.CL | cs.GT | stat.MLPDF

Siyu Chen, Miao Lu, Beining Wu, Heejune Sheen, Fengzhuo Zhang

TL;DR: INFUSER提出了一种基于影响力引导的自进化框架,通过生成器与求解器的协同进化来提升推理能力。生成器从无结构文档中自动生成问题与参考答案,求解器则基于这些数据进行训练;生成器的奖励基于优化器感知的影响力分数,确保生成的问题能有效提升求解器在目标分布上的性能。该方法在多个基准测试上显著超越了现有自进化基线,并展示了框架的灵活性与可扩展性。

Details

Motivation: 现有自进化方法依赖大量人工标注或教师生成的数据,或无监督生成时仅依赖难度启发式奖励,无法保证生成的问题能有效提升求解器的推理能力。

Result: 在Qwen3-8B-Base模型上,INFUSER在Olympiad和SuperGPQA基准测试上相对现有强基线提升超过20%;8B的INFUSER协同进化生成器在数学和编码任务上优于冻结的32B思维生成器。消融实验验证了各设计选择的必要性,扩展实验进一步展示了框架的灵活性。

Insight: 创新点在于引入优化器感知的影响力分数作为生成器的奖励机制,确保生成的问题能实际提升求解器性能;同时提出双归一化变体DuGRPO以处理连续噪声奖励,将文档池转化为自适应课程,优先选择对当前求解器有用而非仅困难的问题。

Abstract: Self-evolution offers a scalable path to stronger reasoning: a pretrained language model improves itself with only minimal external supervision. Yet existing methods either depend on extensively curated or teacher-generated training data, or, when the generator runs unsupervised, reward it by a difficulty heuristic that need not improve the solver. We introduce INFUSER, an iterative co-training framework with two co-evolving roles: a Generator that drafts questions and reference golden answers from a pool of unstructured, automatically collected documents, and a Solver that improves by training on them. The solver is trained with standard correctness rewards against the generator-provided answers, while the generator is rewarded by an optimizer-aware influence score that measures whether each proposed question would actually improve the solver on the target distribution. Because this continuous, noisy influence score is poorly served by standard GRPO, we propose DuGRPO, a dual-normalized variant of GRPO, for generator training. Together, these turn the document pool into an adaptive curriculum that favors questions useful to the current solver, not just hard ones. On Qwen3-8B-Base, INFUSER outperforms strong self-evolution baselines with over 20% relative improvement on Olympiad and SuperGPQA benchmarks, and an 8B INFUSER co-evolving generator outperforms a frozen 32B thinking generator on math and coding. Ablations confirm each design choice is necessary, and two extensions, applying INFUSER to an instruction-finetuned anchor and augmenting it with rule-verifiable RLVR data, further demonstrate the flexibility and generalizability of the framework. Code is available at https://github.com/FFishy-git/INFUSER.


[236] A Unifying Lens on Reward Uncertainty in RLHF cs.LG | cs.AI | cs.CLPDF

Ely Hahami, Yoel Zimmermann, Ray Zhou, Jack Benarroch Jedlicki

TL;DR: 本文针对强化学习从人类反馈(RLHF)中的奖励黑客问题,提出了一种基于分布奖励模型(p(r|x,y))的悲观主义缓解方法。通过贝叶斯推断或KL分布鲁棒优化(KL-DRO)框架,推导出KL正则化RLHF目标的闭式有效奖励表达式,统一了现有奖励模型集成聚合的启发式方法(如均值聚合、最坏情况优化和不确定性加权优化),并阐明了这些规则的隐含假设。

Details

Motivation: RLHF面临奖励黑客的瓶颈,即策略利用代理奖励模型(RM)的误差来获得高分数而无需真正提升质量。标准标量RM缺乏对不确定性的原则性度量,因此需要一种方法来惩罚RM不确定区域的奖励。

Result: 在理论分析层面,通过推导出的有效奖励表达式,证明了现有启发式方法(如均值聚合、WCO、UWO)均是该表达式的极限或截断形式,从而在统一框架下解释了这些方法。

Insight: 核心创新在于将奖励建模为分布而非标量,并利用KL-DRO或贝叶斯视角,为RLHF中的不确定性处理提供了原则性框架,统一并澄清了多种现有经验性聚合规则的理论基础。

Abstract: Reinforcement learning from human feedback (RLHF) is bottlenecked by \emph{reward hacking}, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is \emph{pessimism}: penalizing rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a \emph{distributional} reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.


[237] DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression cs.LG | cs.AI | cs.CVPDF

Hongxu Ma, Lin Wang, Chenghou Jin, Han Zhou, Jie Zhang

TL;DR: 本文提出了一种新的序数回归范式,将其表述为连续生成式序数回归任务,并引入了DiffoR这一统一框架。DiffoR利用扩散模型通过迭代去噪来恢复连续的序数值,从而实现对软语义转换的动态学习。

Details

Motivation: 现有序数回归方法通常基于离散化分类或生成,受到量化伪影和缺乏全局序数拓扑感知的根本限制,无法捕捉序数数据固有的非平稳语义转换。

Result: 在四个领域的12个基准测试上进行的大量实验验证了DiffoR相对于最先进方法的一致优越性,确立了一个新的标准,展示了其作为通用序数回归解决方案的强大潜力。

Insight: 核心创新在于将序数回归重新定义为连续生成任务,并提出了双重解耦策略:空间上通过多尺度增量聚合将目标分解为分层连续增量,时间上通过动态去噪感知将去噪步骤与特征频率同步,从而显式地保留了序数拓扑结构,增强了表示能力和机制可解释性。

Abstract: Ordinal Regression (OR) aims to predict target values with inherent order, underpinning critical applications across diverse domains, from recommender systems to computer vision. Though having evolved from naive regression to discretization-based classification and generation, existing paradigms remain fundamentally constrained by quantization artifacts and the lack of global ordinal topological perception. These methods typically enforce rigid boundary delineations, failing to capture the non-stationary semantic transitions inherent to ordinal data. In this paper, we propose a novel paradigm where OR is formulated as a Continuous Generative Ordinal Regression task. Under the novel paradigm, we introduce DiffOR, a unified framework that leverages diffusion models to recover continuous ordinal values via iterative denoising, thereby enabling the dynamic learning of soft semantic transitions. To explicitly preserve ordinal topology, we devise a Dual-Decoupling Strategy: Spatially, Multi-scale Increment Aggregation decomposes targets into hierarchical continuous increments; Temporally, Dynamic Denoising Perception synchronizes denoising steps with feature frequencies, ensuring robust coarse-to-fine refinement. Theoretically, we show that the proposed method can significantly enhance both representation capability and mechanistic interpretability. Extensive experiments on 12 benchmarks across four domains validate DiffOR’s consistent superiority over state-of-the-art methods, establishing a new standard that demonstrates strong potential as a general-purpose solution for universal ordinal regression.


[238] Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning cs.LG | cs.CLPDF

Daoyu Wang, Mingyue Cheng, Qingchuan Li, Shuo Yu, Jie Ouyang

TL;DR: 本文提出了Claw-R1,一个用于智能体强化学习(Agentic RL)的交互式步级数据中间件系统。该系统旨在管理从智能体-环境交互数据产生到训练消费的完整数据生命周期,通过网关服务器和数据池两大核心组件,将异构的智能体运行时与RL训练后端连接起来。

Details

Motivation: 现有工作主要关注策略优化算法和训练框架,而忽视了智能体-环境交互从数据生产到训练消费的完整数据生命周期管理。本文旨在填补这一空白,强调将智能体交互轨迹视为可管理的数据资产而非临时运行时日志的重要性。

Result: 论文通过演示展示了Claw-R1系统,用户可交互式检查实时轨迹、审查每一步的状态、动作和奖励,并根据质量和准备情况筛选数据,为下游不同的RL算法配置训练就绪的批次。

Insight: 创新点在于提出了一个专注于数据生命周期管理的中间件系统,将智能体交互数据(特别是步级记录)进行统一捕获、组织和策管,为Agentic RL的数据管理实践提供了系统化工具和视角,有望推动社区重视该领域的数据管理问题。

Abstract: Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focuses on policy optimization algorithms and training frameworks, but pays less attention to the full data lifecycle of agent-environment interactions, from data production to training consumption. To bridge this gap, we present Claw-R1, an interactive step-level data middleware system for agentic RL. Claw-R1 connects heterogeneous agent runtimes with RL training backends through two core components: a Gateway Server and a Data Pool. The Gateway Server captures multi-turn interaction steps through a unified LLM API entry point, while the Data Pool organizes them into step-level records consisting of prompt IDs, response IDs, rewards and other metadata. In our demo, users can interactively inspect live trajectories, examine the state, action, and reward of each step, curate data by quality and readiness, and configure training-ready batches for different downstream RL algorithms. Overall, Claw-R1 treats agent interaction traces as managed data assets rather than temporary runtime logs. Through this demonstration, we hope to encourage the community to recognize the importance of data management in agentic RL. Our code is available at https://github.com/AgentR1/Claw-R1 and the demonstration video can be found at link https://youtu.be/Pw47dAOw6B0.


[239] PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment cs.LG | cs.CLPDF

Yang Tian, Rui Wang, Xumeng Wen, Junjie Li, Shizhao Sun

TL;DR: 本文提出了一种名为PBSD(特权贝叶斯自蒸馏)的方法,用于解决长视野智能体任务中的稀疏奖励信用分配问题。该方法通过贝叶斯校准的自蒸馏技术,将难以估计的轨迹级奖励转化为可处理的逐轮信用信号,从而更精细地指导策略学习。

Details

Motivation: 长视野智能体任务中,基于结果的强化学习面临根本性的信用分配挑战:轨迹级奖励仅验证最终正确性,但难以指导哪些中间推理步骤或工具交互对结果有贡献,这在多轮搜索智能体中尤为突出。

Result: 实验表明,PBSD在领域内和领域外设置下均能持续提升性能,并能有效地将短上下文训练的知识迁移到长上下文推理中,表明其细粒度信用分配机制促进了更有效的策略学习并改善了泛化能力。

Insight: 创新点在于提出了一种基于贝叶斯规则的自蒸馏框架,将答案验证的后验-先验概率比转化为学生模型与特权教师模型之间的似然比,从而得到逐轮的校准信用信号,这是一种原则性且优雅的稀疏奖励转化方案,与标准策略优化完全兼容。

Abstract: Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes’ rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.


[240] KITE: A Tri-Modal Transformer Integrating Text, Images, and Knowledge Graphs for Fake News Detection cs.LG | cs.CVPDF

Kevin Patel, Shashi Bhushan Jha

TL;DR: 本文提出了KITE(知识集成文本图像编码器),一个用于假新闻检测的三模态Transformer框架,它联合建模文本、图像和从知识图谱中提取的事实知识表示。该模型利用RoBERTa和CLIP分别进行语言和视觉编码,使用图注意力网络处理从Wikidata检索的结构化事实,并通过跨模态注意力在Transformer中整合多模态特征。模型还生成特定模态的置信度分数以提高可解释性。

Details

Motivation: 随着多模态虚假信息日益复杂,传统假新闻检测方法已显不足,现有工作多集中于文本-图像融合或仅将外部知识作为后处理步骤,限制了检测深层语义不一致性的能力。

Result: 在基准数据集上的评估表明,KITE显著优于单模态和双模态基线方法,特别是在涉及图文不匹配或与外部知识相矛盾的场景中。

Insight: 主要创新点在于将结构化知识图谱(通过GAT处理)作为第三模态与文本、图像进行端到端的联合建模,并通过跨模态注意力机制实现深度融合;同时,生成模态特定的置信度分数为决策提供了可解释性视角。

Abstract: Traditional fake news detection methods are falling behind as multimodal misinformation grows more advanced, seamlessly blending deceptive text, manipulated visuals, and factually incorrect claims. Most prior work focuses on text-image fusion or applies external knowledge only as a post-processing step, limiting their ability to detect deeper semantic inconsistencies. In this paper, we introduce KITE (Knowledge-Integrated Text-Image Encoder), a tri-modal fake news detection framework that jointly models textual, visual, and factual knowledge representations. KITE leverages Roberta [23,14] and CLIP [24] for linguistic and visual encoding, while a Graph Attention Network (GAT) processes structured facts retrieved from Wikidata. KITE uses cross-modal attention [9] within a multimodal transformer to integrate text, visual, and knowledge features, helping it understand how each modality relates to one another. Modality-specific confidence scores are generated alongside the final prediction, offering interpretability by indicating which input type most influenced the decision. Evaluations on benchmark datasets demonstrate that KITE significantly outperforms unimodal and bimodal baselines, particularly in scenarios involving image-text mismatches or contradictions with external knowledge.


[241] Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short cs.LG | cs.AI | cs.CLPDF

Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang

TL;DR: 本文提出了Reasoning Arena框架,用于解决强化学习可验证奖励(RLVR)在组级奖励无差异时梯度信号缺失的问题。该方法通过构建推理轨迹锦标赛,利用头对头比较生成细粒度相对奖励信号,并采用动态锚点池和Bradley-Terry模型实现高效奖励估计。

Details

Motivation: 当同一提示的所有采样推理轨迹获得相同可验证奖励时,组内相对优势估计无法提供梯度信号,尽管这些轨迹的推理质量可能存在显著差异,这限制了RLVR框架的效果。

Result: 在竞争数学和编程基准测试中,Reasoning Arena平均比RLVR基线提升7.6%,训练速度加快27%至41%,节省近50%的生成计算成本,显著提升了整体推理性能。

Insight: 创新点在于将无信息奖励组重定向到裁判系统进行轨迹锦标赛比较,将推理质量转化为相对奖励信号;采用动态锚点池和Bradley-Terry模型避免了全配对比较的二次复杂度,实现了高效可扩展的强化学习集成。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.


[242] Escaping the KL Agreement Trap in On-Policy Distillation cs.LG | cs.CLPDF

Haoran Xin, Anhao Zhao, Ying Sun, Jin Li, Xiaoyu Shen

TL;DR: 本文研究了在线策略蒸馏(OPD)中存在的KL一致性陷阱问题,即当学生模型生成质量下降的序列时,教师模型可能仍会给出低KL散度评分,导致监督信号失效。作者提出了KAT方法,通过动态阈值检测并终止这种低质量监督,从而提升训练效率与模型性能。

Details

Motivation: 解决在线策略蒸馏中因学生模型生成退化序列时,教师模型仍给出低KL散度评分而导致的监督信号失效问题,即KL一致性陷阱。

Result: 在四个数学推理基准测试中,KAT方法将平均top-k准确率提升2.66%,通过率提升3.43%,同时将平均生成序列长度减少59.73%。

Insight: 创新点在于识别了OPD中的KL一致性陷阱现象,并提出基于动态阈值的在线终止规则KAT,通过过滤低质量监督信号来提升蒸馏效果;客观分析认为该方法为策略蒸馏提供了更鲁棒的训练信号筛选机制。

Abstract: On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little corrective training signal. We identify this persistent regime as a low-KL agreement trap. Further analyses show that tokens during and after such traps produce less useful supervision signals. We propose KAT (KL Agreement Trap Termination), an online OPD termination rule that detects persistent low-KL agreement with a dynamic training-adaptive threshold. By filtering weak supervision from degenerate agreement, KAT improves avg@k accuracy by 2.66% and pass@k by 3.43% across four mathematical benchmarks, while reducing average rollout length by 59.73%.


[243] BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling cs.LG | cs.CLPDF

Gianluca Barmina, Annemette Broch Pirchert, Andrea Blasi Núñez, Lukas Galke Poech, Peter Schneider-Kamp

TL;DR: BrainSurgery是一个用于对神经网络检查点进行稳健、可复现的’张量手术’的工具。它通过声明式的YAML计划来抽象存储格式和内存管理,支持结构修改、数学变换和张量重塑,并通过内置断言验证张量形状、数据类型和值以防止静默错误。

Details

Motivation: 随着深度学习模型规模扩大,管理、检查和修改大型检查点变得日益困难,现有工作流通常依赖脆弱的临时Python脚本,缺乏稳健性和可复现性。

Result: 论文通过四个示例和三个案例研究(从模型升级回收到LoRA提取)进行了系统演示,展示了工具在复杂变换中的有效性。

Insight: 创新点在于提供了一个声明式、可验证的框架来替代临时脚本,通过YAML计划和正则表达式/结构定位实现复杂操作,其内置断言机制增强了操作的可靠性,为未来研究提供了可复现的基础。

Abstract: As deep learning models scale, managing, inspecting, and modifying large checkpoints has become increasingly challenging. Researchers often need to alter model weights for layer restructuring, precision casting, low-rank factorization, and architectural debugging, yet these workflows often rely on fragile ad-hoc Python scripts. Here, we introduce BrainSurgery, a tool for robust and reproducible “tensor surgery” on neural network checkpoints, and provide a system demonstration covering four examples and three case studies from model upcycling to LoRA extraction. By abstracting storage formats and memory management, BrainSurgery executes complex transformations through declarative YAML plans. It supports structural modifications, mathematical transformations, and tensor reshaping through expressive regex and structural targeting, while built-in assertions validate tensor shapes, data types, and values to prevent silent errors. We envision that BrainSurgery will provide a strong foundation for future research through its reproducible and validated operations.


[244] iOSWorld: A Benchmark for Personally Intelligent Phone Agents cs.LG | cs.CLPDF

Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh

TL;DR: 本文介绍了iOSWorld,这是首个围绕持久用户身份构建的交互式原生iOS模拟器基准测试,包含26个新构建的iOS应用及其关联数据(如交易、消息、旅行记录等)。该基准包含133个任务,分为三个难度递增的类别:单应用任务、多应用任务以及记忆与个性化任务。研究评估了前沿和开源模型在纯视觉和特权视觉+XML设置下的表现,发现最佳配置总体准确率为52%,但多应用任务仅37%。

Details

Motivation: 现有移动代理基准测试缺乏个性化能力,而实用的手机代理需要具备个人智能,能够基于设备上的用户身份、历史记录和偏好进行推理,而不仅仅是在非个人化的沙箱中执行孤立指令。

Result: 在iOSWorld基准测试中,最佳配置总体准确率达到52%,但多应用任务准确率仅为37%。特权视觉+XML访问使前沿模型性能提升高达26个百分点,而较小模型未能从额外的可访问性树输入中受益。

Insight: 创新点在于构建了首个围绕持久用户身份的交互式iOS模拟器基准,强调代理需处理跨应用关联数据以实现个性化推理;客观来看,该基准通过多应用任务和记忆/个性化任务的设计,有效评估了代理在真实手机环境中的综合智能水平。

Abstract: A useful phone agent needs to be personally intelligent. It should reason over a user’s identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52% overall but only 37% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.


[245] C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache cs.LG | cs.CV | cs.ROPDF

Weisen Zhao, Lam Nguyen, Zhicong Lu, Yuzhang Shang

TL;DR: 本文提出了C³ache,一种用于加速世界动作模型推理的方法。该方法通过跨推理块的残差缓存与重用,显著减少了WAM在任务执行过程中的计算冗余,实现了高达2.5倍的推理速度提升,且任务成功率几乎不受影响。

Details

Motivation: 世界动作模型虽然泛化能力强,但其基于视频建模的推理过程计算成本高昂,现有加速方法仅关注单个推理块内的冗余,忽略了跨块的巨大计算冗余。

Result: 在基于Fast-WAM骨干网的基准测试中,C³ache实现了高达2.5倍的总推理时间加速,同时任务成功率仅有可忽略的下降。

Insight: 核心创新在于发现了机器人执行平滑行为时,相邻推理块在同一去噪步骤的残差具有强相关性,并据此提出了无需训练、跨块缓存残差的加速方法,这是一种新颖的、针对序列生成模型的计算优化视角。

Abstract: World Action Models (WAMs) generalize better than standard Vision-Language-Action (VLA) policies to novel motions and environments, because a video-modeling objective lets them learn from abundant unlabeled video rather than scarce labeled robot demonstrations. This generalization is computationally expensive. To complete a task, a WAM runs over multiple inference chunks, and each chunk requires a costly denoising process. Existing acceleration methods reduce this cost by caching and reusing computation within a single chunk’s denoising trajectory. Our empirical analysis reveals a substantial source of redundancy they overlook: redundancy across chunks. When a robot executes a smooth behavior, the residuals computed at a given denoising step are strongly correlated from one chunk to the next. We introduce C$^3$ache, a training-free method that caches and reuses these residuals across inference chunks at the same denoising step. Experiments on benchmarks with a Fast-WAM backbone show that C$^3$ache achieves up to a $2.5\times$ speedup in total wall-clock inference time, with negligible degradation in task success rate.


[246] Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization cs.LG | cs.CVPDF

Dongze Hao, Zhiwei Jin, Chen Chen, Haonan Lu

TL;DR: 本文提出了一种名为全局归一化蒸馏策略优化(GNDPO)的方法,用于稳定多模态大语言模型(MLLM)推理中的在线策略蒸馏(OPD)训练。该方法通过将原始的KL散度分数转换为批次级别的相对优势,有效缓解了梯度爆炸问题,从而提升了训练鲁棒性和下游任务性能。

Details

Motivation: 在线策略蒸馏(OPD)作为一种后训练范式,相比依赖稀疏反馈的强化学习(RLVR)具有优势,但原始的令牌级蒸馏在异常状态下会因幅度失准而导致梯度不稳定。本文旨在解决OPD训练中的梯度不稳定问题。

Result: 实验结果表明,GNDPO在多模态推理任务上显著提高了训练鲁棒性和下游性能。

Insight: 核心创新点在于提出了全局归一化机制,将KL散度分数转换为批次级别的相对优势,这是一种稳定蒸馏训练梯度、同时保留令牌级监督优势的实用方法。

Abstract: On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at https://github.com/OPPO-Mente-Lab/GNDPO.


[247] Stage-1 Controls the Entropy Regime, Not the Outcome cs.LG | cs.AI | cs.CVPDF

Jianxiong Shen

TL;DR: 本文通过一项小规模数据研究,探讨了视觉语言模型(VLM)两阶段后训练中第一阶段(Stage-1)的作用。研究发现,在Geometry3K内部验证集上,三种不同的Stage-1预热启动方法(监督微调SFT和策略蒸馏OPD)的最终性能差异很小(53%-54%),表明Stage-1对领域内最终性能影响有限。然而,匹配配方、提前停止的SFT能显著提升领域外MathVista的性能(+2.1分),而过度训练的SFT则导致性能下降(-9.5分)。最显著的差异在于策略熵:OPD进入强化学习(RL)阶段时具有比SFT初始化更高的策略熵,且这种差异在训练轨迹中持续可见。尽管OPD在初始化时展现出更高的答案多样性和pass@16分数,但这些优势在RL阶段后和领域外任务上基本消失。

Details

Motivation: 研究动机是探究两阶段后训练(Stage-1预热启动后接Stage-2强化学习)中,第一阶段(Stage-1)究竟控制什么。具体来说,旨在厘清Stage-1(如监督微调SFT或策略蒸馏OPD)是否实质性地影响模型最终性能,还是主要影响训练动态(如熵状态)。

Result: 在领域内任务Geometry3K上,三种Stage-1方法的最终性能处于狭窄的53%-54%区间,与近期专门方法报告的范围一致,表明Stage-1对领域内终点性能影响有限。在领域外任务MathVista上,匹配配方、提前停止的SFT比过度训练变体提升+2.1分。OPD在初始化时展现出比SFT更高的策略熵、答案多样性和pass@16分数(+2.0到+5.2分),但这些优势在RL阶段后(端点pass@16差异在1.1分内)和MathVista上(六个模型差异在1.2分内)基本消失。

Insight: 论文宣称的创新点在于提供了一个有界的实证表征:在此设置中,Stage-1与熵状态强烈相关,但对下游的收益很小、局部化,且没有证据表明OPD是比SFT更好的RL预热启动。从客观角度看,该研究揭示了在VLM两阶段训练中,Stage-1的主要作用可能是设定初始训练动态(熵状态)而非决定最终性能,这挑战了OPD优于SFT作为RL预热启动的常见假设,并强调了训练配方(如提前停止)对领域外泛化的重要性。

Abstract: Two-stage post-training – a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD) followed by Stage-2 reinforcement learning (RL) – is increasingly used for vision-language models (VLMs). We ask what Stage-1 actually controls in a small-data study using Qwen2.5-VL-7B with a same-modality 72B VLM teacher for OPD. First, the three warm-starts reach a narrow $53$–$54%$ band on Geometry3K internal validation, consistent with the narrow range reported by recent specialized methods; this setup provides little evidence that Stage-1 changes the in-domain endpoint. Second, a matched-recipe, early-stopped SFT improves out-of-domain MathVista by $+2.1$ points, reversing the $-9.5$-point drop of an over-trained variant. The clearest difference is the \emph{entropy regime}: OPD enters RL with substantially higher policy entropy than either SFT initialization, and the separation remains visible through the available trajectories. At the in-domain initialization, OPD also has higher answer diversity and pass@16 ($+2.0$ to $+5.2$ points over SFT), although problem-level bootstrap intervals show that the smaller contrast is uncertain. The advantage is absent after RL (endpoint pass@16 values within $1.1$ points) and on MathVista (six models within $1.2$ points). Our contribution is therefore a bounded empirical characterization: Stage-1 is strongly associated with the entropy regime in this setup, but the downstream payoff is small, localized, and not evidence that OPD is a better RL warm-start.


cs.CR [Back]

[248] Detecting Aimbot Cheaters in MOGs cs.CR | cs.CV | cs.NIPDF

Salman Shaikh, Tao Ni, Marc Dacier

TL;DR: 本文提出了一种名为PATCH的新型主动防御策略,通过部署对抗性补丁作为游戏内的蜜罐来检测和缓解视觉自瞄作弊器。该方法旨在故意触发作弊器的目标检测模型,从而实现直接检测或通过补丁淹没其视口使游戏无法进行。在自定义Unreal Engine游戏和商业游戏Fortnite上的评估验证了其有效性。

Details

Motivation: 视觉自瞄作弊器利用计算机视觉模型从客户端屏幕捕获中检测对手,而非访问游戏内存,这使得商业内核级反作弊解决方案完全无法检测,严重破坏了游戏体验和公平性。

Result: 在自定义Unreal Engine游戏的白盒场景中,几乎所有补丁大小的检测率超过90%,较大补丁的跨模型可转移性达到60%至90%。在商业游戏Fortnite上的进一步验证证明了其实际应用性。

Insight: 创新点在于将对抗性补丁作为主动防御的蜜罐,通过触发作弊器的视觉模型来实现检测或干扰,这是一种从攻击者模型内部进行防御的新思路,具有较好的可扩展性和实际部署潜力。

Abstract: Multiplayer Online Games have become a multibillion dollar industry in the entertainment sector. However, the presence of cheaters undermines the experience of honest players and devalues the effort of game developers, as it directly affects player retention, competitive integrity, the legitimacy and trustworthiness of a game, and most importantly the overall revenue streams. Among various cheating techniques, visual aimbots represent an emerging threat. They use computer vision models to detect opponents from client screen captures rather than accessing game memory, making them completely undetectable by commercial kernel level anti cheat solutions. In this paper, we introduce PATCH, a novel proactive defense strategy that deploys adversarial patches as in game honeytokens to mitigate the presence of visual aimbot cheaters. Our approach centers on deliberately triggering the cheaters’ object detection model, enabling either direct detection, or rendering the game unplayable for the cheater via patch flooding on their viewport. We evaluate our approach on various criteria; analyzing the effectiveness of different patch sizes, scalability of patches to different screen resolutions, efficacy against diverse visual aimbot cheat configurations and also explore various YOLO models to assess patch transferability. Evaluation on a custom Unreal Engine game demonstrates over 90 percent detection rate in white box scenarios for almost all patch sizes, and reaches 60 to 90 percent cross model transferability with larger patches. We further validate our approach on Fortnite, a commercial MOG, demonstrating real world applicability.


cs.GR [Back]

[249] OmniFaceRig: Fully Automatic Inner-Mouth-Aware Face Rigging Across Diverse 3D Character Topologies cs.GR | cs.CVPDF

Chao Wang, Guangyao Ma, John Doublestein, Junming Chen, Yiming Lin

TL;DR: OmniFaceRig是一个全自动端到端流程,可将仅包含静态表面的3D角色网格(无预建模口腔)转换为具备口腔感知的FACS绑定系统,包含多达155个混合形状、程序化适配的牙齿/牙龈/舌头以及重新打包的UV/纹理。它支持多种拓扑结构(如人类、类人生物、长短吻动物),无需人工标注、用户模板或逐资产设置。

Details

Motivation: 解决3D角色制作中面部绑定(特别是基于FACS的混合形状和口腔内部几何创建)这一主要瓶颈,减少现有流程所需的大量设计师手动工作,如地标标注、逐角色模板调整和口腔内部放置。

Result: 在公开基准数据集Omni-Bench(包含1000个双足3D角色)的筛选输入上实现了高最终绑定成功率,面部检测召回率近乎完全,口腔内部放置可靠且穿透率低。

Insight: 创新点在于全自动、支持多样拓扑的端到端流程,结合了混合VLM+CV可绑定性检查、多模型面部解析、密集关键点驱动的模板配准、程序化口腔构建和碰撞感知的混合形状迁移,并引入了拓扑特定的面部/口腔模板选择与碰撞感知的口腔适配以减少穿透。

Abstract: Facial rigging - creating FACS-based blendshapes together with inner-mouth geometry (teeth, gums, and tongue) - remains a major bottleneck in 3D character production. Existing pipelines still require substantial designer effort, especially for manual landmark annotation, per-character template adjustment, and inner-mouth placement. We present OmniFaceRig, a fully automatic end-to-end pipeline that converts a static surface-only 3D character mesh, with no pre-modeled oral cavity, into an inner-mouth-aware FACS rig with up to 155 blendshapes, procedurally fitted teeth, gums, and tongue, and re-packed UV/texture. OmniFaceRig supports diverse topologies - humans, humanoids, long-muzzled animals (e.g., dogs, wolves, foxes), and short-muzzled animals (e.g., cats, bears, rabbits, tigers) - with no manual landmarks, no user-provided templates, and no per-asset setup. The pipeline combines hybrid VLM+CV riggability checking, multi-model face parsing, dense keypoint-driven template registration, procedural inner-mouth construction, and collision-aware blendshape transfer. For non-human characters, OmniFaceRig selects topology-specific face and inner-mouth templates and uses collision-aware inner-mouth fitting to reduce teeth-face intersections without exposing users to category-specific tuning. We also publicly release Omni-Bench, a freely available benchmark dataset of 1,000 biped 3D characters with FACS facial blendshapes and inner-mouth geometry, spanning humans, humanoids, cats, dogs, and other animals. Experiments show high final rigging success on screened Omni-Bench inputs, nearly complete face detection recall from the segmentation ensemble and reliable inner-mouth placement with low penetration. Together, OmniFaceRig provides an automatic path from static generated characters to animation-ready facial rigs across both human and non-human topologies.


cs.IR [Back]

[250] Evaluating Advanced Prompting on Gemini Flash for Multi-Hop Biomedical QA cs.IR | cs.AI | cs.CLPDF

Ahmed Bajaber, Mohammed Alliheedi

TL;DR: 本文评估了谷歌Gemini Flash模型在MedHopQA生物医学多跳问答挑战中的表现,重点研究了高级提示工程的影响。通过设计一个结合角色扮演、多示例思维链和详细格式规则的复杂提示,在Gemini 2.0 Flash上取得了0.720的概念级分数,显著优于基线提示的0.565分,且与下一代Gemini 2.5 Flash性能几乎相同。

Details

Motivation: 解决大型语言模型在生物医学领域复杂多跳推理任务中的性能问题,探究高级提示工程对模型推理能力的提升作用。

Result: 在MedHopQA基准测试中,使用复杂提示的Gemini 2.0 Flash获得了0.720的概念级分数,远超基线提示的0.565分,性能与Gemini 2.5 Flash相当。

Insight: 创新点在于设计了一个结合角色扮演、多示例思维链和格式规则的多组件提示策略,证明了精心设计的提示对于释放现代LLM推理潜力的关键性,且高效模型通过提示优化可达到与更先进模型相近的性能。

Abstract: The MedHopQA challenge presents a critical test for Large Language Models (LLMs): complex, multi-hop reasoning in the high-stakes biomedical domain. This paper details our direct API-based evaluation of Google’s Gemini Flash models, focusing on the impact of advanced prompt engineering. We designed a sophisticated, multi-component prompt for Gemini 2.0 Flash that combined role-playing, explicit multi-shot Chain-of-Thought (CoT) examples, and detailed formatting rules. Our best run, using this complex prompt, achieved a Concept Level Score of 0.720. This result dramatically outperformed a baseline prompt which scored only 0.565. Remarkably, this performance on the efficient Gemini 2.0 Flash was almost identical to the result from the next-generation Gemini 2.5 Flash. Our findings demonstrate that sophisticated prompt design is a critical factor for unlocking the full reasoning capabilities of modern LLMs.


cs.RO [Back]

[251] CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning cs.RO | cs.AI | cs.CL | cs.HC | cs.LGPDF

Markus Knauer, Valentin Gieraths, Tai Mai, Samuel Bustamante, Alin Albu-Schäffer

TL;DR: 本文提出CLASP框架,通过结合任务参数化核化运动基元(TP-KMPs)与预训练视觉语言模型(VLA/VLMs),使机器人能够从少量示教(2-5次)中学习技能,并利用自然语言指令进行技能选择、参数推理和组合,同时在无法完成任务时主动请求针对性示教。

Details

Motivation: 解决现有方法中视觉语言模型需要大量数据而任务参数化模仿学习缺乏自然语言理解的问题,旨在实现数据高效且能理解自然语言的机器人技能学习与执行。

Result: 在7自由度机械臂上的验证显示,在需要技能选择、组合和主动学习的场景中,成功率达到73.3%至100%。

Insight: 创新点在于模块化架构将TP-KMPs的数据效率与VLMs的语言理解能力结合,通过协方差加权组合实现新行为生成,并引入无需微调的主动学习机制识别能力差距。

Abstract: Enabling robots to understand and execute tasks from natural language commands while maintaining data efficiency remains challenging. Foundation models such as vision-language-action (VLA) and vision-language models (VLMs) provide intuitive interaction channels but require extensive data; task-parameterized imitation learning achieves data efficiency but lacks natural language grounding. This work bridges this gap through a modular architecture combining task-parameterized kernelized movement primitives (TP-KMPs) with pretrained VLMs. During learning, skills are acquired from 2 to 5 kinesthetic demonstrations, and the VLM generates skill schemas describing each skill’s parameters and preconditions. During execution, the VLM interprets commands to select skills, reason about parameter bindings, and create novel behaviors through covariance-weighted composition. When no skill or composition suffices, the system identifies capability gaps and requests targeted demonstrations, all without fine-tuning. Validation on a 7-DoF manipulator shows success rates of 73.3%-100% in scenarios requiring skill selection, composition, and active learning.


[252] From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs cs.RO | cs.AI | cs.CL | cs.CV | cs.GRPDF

Jiangtao Shuai, Zongxiong Chen, Manfred Hauswirth, Sonja Schimmler

TL;DR: 本文研究了利用大型语言模型(LLMs)作为零样本、无需训练的方法,将3D仿真场景(USD格式)中的对象自动关联到形式化本体(如SOMA-HOME)的类别,以构建知识图谱。实验表明,在厨房场景(125个对象)中,LLMs在使用描述性名称时准确率高达90-96%,显著优于基于字典和嵌入的基线方法。

Details

Motivation: 从3D仿真场景构建知识图谱对机器人任务推理至关重要,但当前将场景对象关联到本体类别的关键步骤(即本体接地)依赖于手工整理的字典,这种方法脆弱且难以泛化到不同资产。

Result: 在SOMA-HOME本体下的厨房场景(125个对象)测试中,LLMs在使用描述性名称时达到90-96%的精确匹配准确率,使用缩写名称时为49-89%,大幅超越字典和嵌入基线。在完全模糊的名称下,通过上下文增强提示可恢复至48%的准确率。

Insight: 论文的创新点在于首次将LLMs作为零样本、无需训练的工具用于3D场景的本体接地任务。关键洞察是LLMs主要利用场景图中的语义线索(如兄弟节点名称和父路径)进行推理,而几何特征单独作用效果有限(仅4-17%),这揭示了LLMs在此类结构化数据理解中的优势和依赖模式。

Abstract: Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.


[253] MinNav: Minimalist Navigation Using Optical Flow For Active Tiny Aerial Robots cs.RO | cs.CVPDF

Aniket Patil, Mandeep Singh, Uday Girish Maradana, Nitin J. Sanket

TL;DR: 本文提出了MinNav,一种基于光流及其不确定性的导航系统,专为微型空中机器人设计。该系统能够在未知环境中,仅使用单目摄像头,穿越包含静态和动态障碍物以及形状未知间隙的场景,无需任何先验知识。通过主动探索来发现障碍物并进行导航,提高了成功率。

Details

Motivation: 解决微型空中机器人自主导航的挑战,这些机器人需要在资源受限(如计算能力、成本)的条件下,仅依靠单目摄像头实现鲁棒的导航,尤其是在包含未知动态障碍物和间隙的复杂环境中。

Result: 在多种真实世界环境中进行了实验,包含静态/动态障碍物和未知形状间隙,总体成功率达到了70%。性能与基于深度的方法相当,但计算量大幅减少,能够直接在微型空中机器人上运行。

Insight: 主要创新点在于仅利用光流及其不确定性,结合主动探索策略,实现了无需先验知识的全场景(静态/动态障碍、未知间隙)单目导航。其轻量级设计使其特别适合计算资源有限的微型平台。

Abstract: Navigation using a monocular camera is pivotal for autonomous operation on tiny aerial robots due to their perfect balance of versatility, cost and accuracy. In this paper, we introduce MinNav, a navigation stack based on optical flow and its uncertainty to fly through a scene with static and dynamic obstacles and unknown-shaped gaps without any prior knowledge of the scene components and/or their locations/ordering. We further improve success rate by using the activeness of the robot to move around in an exploratory way to find obstacles and navigate. We successfully evaluate and demonstrate the proposed approach in many real-world experiments in various environments with static and dynamic obstacles and unknown-shaped gaps with an overall success rate of 70%. To the best of our knowledge, this is the first solution to tackle all the aforementioned navigation cases without prior knowledge using a monocular camera. Our approach is on par in performance with depth based methods with factors of magnitude less computation required and can readily run onboard tiny aerial robots. The accompanying video, supplementary material, code and dataset can be found at https://pear.wpi.edu/research/minnav.html


[254] GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors cs.RO | cs.CVPDF

Dongli Wu, Xiaobao Wei, Hao Wang, Qiaochu Dong, Ying Li

TL;DR: 本文提出GraspFoM,一个利用3D基础先验(SAM3D)的统一框架,通过构建共享的3D物体潜在表示,同时驱动高保真3D重建(网格和3D高斯泼溅形式)和机器人抓取姿态预测。该方法引入锚点初始化的截断姿态推理扩散器来预测连续多模态抓取姿态,并通过重建感知评分器和残差潜在更新器探索重建与抓取之间的交互。

Details

Motivation: 机器人抓取在部分观测下仍然具有挑战性,可靠的抓取依赖于局部接触线索和物体级3D结构。现有几何感知抓取方法通常将几何重建视为中间预测,而非可重用的物体先验知识用于抓取。本文旨在利用3D基础先验构建共享表示,统一重建与抓取任务。

Result: 综合实验表明,GraspFoM在重建和抓取任务上均取得了最先进(SOTA)的结果,并且这些改进仅需少量额外的可训练参数。组件消融研究也证明了每个模块的贡献。

Insight: 核心创新在于利用3D基础模型先验构建共享的物体潜在表示,将重建作为几何基础来驱动抓取推理,而非孤立任务。具体技术点包括:锚点初始化的截断姿态推理扩散器(避免依赖离散候选)、重建感知评分器与残差潜在更新器(实现任务间交互与表示精炼)。这种统一框架以较小参数量实现了多任务性能提升。

Abstract: Robotic grasping is a fundamental capability in robotic manipulation. Yet grasping remains challenging under partial observations. Reliable grasping depends on both local contact cues and object-level 3D structure. Existing geometry-aware grasping methods recognize the value of reconstruction, but they typically treat geometry as an intermediate prediction rather than a reusable object prior for grasping. In this paper, we present GraspFoM, a unified framework that leverages 3D foundation priors (SAM3D) to build a shared 3D object latent for both reconstruction and grasp pose prediction. Built on this shared object latent, we introduce an anchor-initialized truncated pose-reasoning diffuser that predicts continuous and multimodal grasp poses without directly relying on discrete grasp candidates. We further investigate the interaction between reconstruction and grasping through a reconstruction-aware scorer and a residual latent updater. Reconstruction provides grounded geometric cues, while grasp supervision refines the shared object latent toward grasp-relevant affordances. GraspFoM jointly predicts grasp poses and reconstructs high-fidelity 3D assets in mesh and 3DGS forms. Comprehensive experiments demonstrate that GraspFoM achieves state-of-the-art results on both reconstruction and grasping. Notably, these improvements require only a small number of additional trainable parameters. Component-wise ablation studies also demonstrate the contribution of each component.


[255] EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control cs.RO | cs.CVPDF

Haoyang Ge, Peng Ren, Yukun Shi, Cong Huang, Kun Li

TL;DR: EgoPriMo是一个从人类第一人称视角演示中学习全身运动先验的统一框架,通过结合身体动态、第一人称视觉上下文和文本提示,能够重建、生成和预测基于SMPL的全身运动。该框架使用语言作为高级控制信号,并通过单一检查点支持多种任务,生成的SMPL运动可被Unitree人形机器人控制器执行。

Details

Motivation: 解决人形机器人需要适应场景、任务和用户意图的全身运动生成问题,现有方法如运动跟踪和视觉-语言-动作系统缺乏可扩展且交互式的全身行为先验。

Result: 在Nymeria和EgoExo4D数据集上的实验表明,单一检查点优于UniEgoMotion,支持重建和预测任务,并能生成可执行的人形机器人运动。

Insight: 创新点包括使用第一人称视角数据学习运动先验,将语言作为高级控制而非完整运动规范,以及采用Triple-stream DiT联合建模多模态信息,通过任务条件掩码实现单一模型的多任务支持,为从可扩展观察到通用交互运动先验提供了实用路径。

Abstract: Humanoid robots require whole-body motions that adapt to scene context, task requirements, and user intent. Motion tracking reproduces specified trajectories, and humanoid vision-language-action systems provide semantic interfaces, but neither offers a scalable and interactive prior for broad full-body behavior. We introduce EgoPriMo (Egocentric Motion Prior for Humanoid Robots), a unified framework that learns such priors from egocentric human demonstrations. Given egocentric observations and a text prompt, EgoPriMo reconstructs, generates, and forecasts SMPL-based full-body motion. Language is used as a high-level control signal rather than a complete motion specification. At the core of EgoPriMo is a Triple-stream DiT that jointly models body dynamics, egocentric visual context, and text; task-conditioning masks route different tasks and missing-modality data through the same checkpoint. Experiments on Nymeria and EgoExo4D show that one checkpoint improves egocentric motion generation over UniEgoMotion while supporting reconstruction and forecasting; the generated SMPL motions can also be executed by a Unitree humanoid controller. These results indicate a practical path from scalable egocentric observations to generalizable and interactive humanoid motion priors.


[256] When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA cs.RO | cs.AI | cs.CVPDF

Haizhou Ge, Yufei Jia, Yue Li, Zhixing Chen, Lu Shi

TL;DR: 本文针对探索性操作轨迹问答(EMT-QA)任务,提出了一种闭环轨迹蒸馏方法,通过任务特定的编码代理从标注的训练轨迹中提炼出单行自然语言提示(蒸馏阅读启发式,DRH),以提升冻结视觉语言模型(VLM)在预测最小成功动作链上的准确性。

Details

Motivation: 解决现有视觉语言模型和具身多模态大语言模型在读取探索性操作轨迹(如失败尝试揭示潜在前提条件)时,无法可靠地从原始视频、本体感觉或其组合中恢复最小成功动作链的问题。

Result: 在三个模拟器和两个真实机器人任务中,DRH将链准确率比最佳原始模态基线提升了+0.38至+0.47,并且DRH还可作为一次性程序化分类器的唯一规范,匹配提示后VLM的性能。

Insight: 创新点在于通过闭环蒸馏从训练轨迹中自动生成紧凑的自然语言启发式提示,无需更新模型权重即可显著提升VLM的推理能力,同时该提示可泛化为可编程分类器的规范,实现了高效的知识迁移。

Abstract: Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For example, a robot pulls a locked cabinet drawer, fails, and only succeeds after opening the lock. The failed pull reveals a latent precondition (the drawer is locked) that determines the minimal-success action chain (the fewest actions that complete the task), here [lock-open, drawer-pull]. Correctly reading this trace is therefore the prerequisite for recovering that chain. We formalize this setting as Exploratory Manipulation Trace QA (EMT-QA): given synchronized video and proprioception from an exploratory trace, predict the minimal-success action chain under the latent precondition revealed by the probe. However, even state-of-the-art VLMs and embodied multimodal LLMs misread this evidence: they do not reliably recover the chain from raw video, raw proprioception, or their combination. We introduce Closed-Loop Trace Distillation, a pipeline that uses a per-task coding agent to inspect labeled training traces and distill a one-line natural-language prompt over the trace, which we call the Distilled Reading Heuristic (DRH). At inference, no agent is invoked and no model weights are updated; a frozen VLM receives the raw trace plus the DRH as a prompt entry. Across three simulator and two real-robot tasks, the DRH improves chain accuracy by +0.38 to +0.47 over the best raw-modality baseline. The same DRH also serves as the sole specification for one-shot programmatic classifiers that match the prompted VLM.


[257] PhysGraph: A Physics-aware 3D Scene Graph for Perception and Reasoning cs.RO | cs.CVPDF

Haoyu Li, Aaron Thomas, Shuyan Zhou, Xianyi Cheng

TL;DR: 本文提出了PhysGraph框架,旨在构建一个物理感知的3D场景图,以支持机器人感知与推理。该框架通过RGB-D观测重建以物体为中心的3D几何,关联多视角下的物体实例,并分解物体的功能部件,通过视觉推理推断材料和关节属性。PhysGraph在合成和真实数据集上实现了语义分割、多物体质量估计和关节预测的SOTA性能,并可用于约束感知的3D可操作性预测和真实到仿真的迁移任务。

Details

Motivation: 现有方法主要关注语义检索,往往忽略物理和运动学因素,而尝试建模物理属性的方法通常依赖狭窄的训练集或单物体建模,限制了在不同物体类型间的可扩展性和泛化能力。因此,需要一种能统一符号推理与结构化3D几何的框架,以在杂乱场景中建模运动学和物理属性。

Result: 在合成和真实世界数据集上的评估表明,PhysGraph在语义分割、多物体质量估计和关节预测方面达到了最先进(SOTA)的结果。

Insight: 创新点在于将符号推理与结构化3D几何相结合,以建模场景中的运动学和物理属性,并通过功能部件分解和视觉推理来推断材料和关节,从而生成物理一致且语义结构化的场景图,为下游任务提供结构化3D表示。

Abstract: To perform a wide range of daily tasks, robots need to construct a 3D representation that is semantically rich, physically grounded, and structured enough to support task planning and affordance prediction. However, existing approaches primarily focus on semantic retrieval, often overlooking physical and kinematic factors. Methods that attempt to model physical properties typically rely on narrow training sets or single-object modeling, limiting scalability and generalization across diverse object types. To address these challenges, we present PhysGraph, a framework that unifies symbolic reasoning with structured 3D geometry to model kinematic and physical properties in cluttered scenes. Given RGB-D observations, PhysGraph reconstructs object-centric 3D geometry and associates object instances across views. It then decomposes objects into functional parts and infers materials and articulations through visual reasoning. Evaluated on both synthetic and real-world datasets, PhysGraph achieves state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction. With its simple yet effective design, PhysGraph produces physically consistent and semantically structured scene graphs, serving as a structured 3D representation for downstream tasks such as constraint-aware 3D affordance prediction and real-to-sim transfer, both of which are demonstrated in our experiments.


[258] PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback cs.RO | cs.CVPDF

Chunji Lv, Jiaxi Ye, Yuchen Jiang, Rexar Lin, Changsheng Li

TL;DR: 本文提出了PhysAgent,一种基于模拟器在环的多智能体框架,用于自动化、物理合理的4D合成。该方法通过解耦内在材料和外在动力学,利用语义智能体生成有效初始化,并通过轨迹基础的多智能体反馈驱动细化智能体,实现零样本宏观跳跃和离散力场的动态切换,从而高效生成稳定多样的物理场景。

Details

Motivation: 解决完全自动化、物理合理的3D运动合成问题,现有方法在复杂力场优化中存在模态鸿沟和技术缺陷,如大型语言模型缺乏模拟反馈导致物理不准确,传统分数蒸馏采样存在梯度缓慢、局部最优陷阱和无法动态切换离散力场的问题。

Result: 大量实验表明,PhysAgent能从任意多模态提示快速生成稳定多样的物理场景,在生成多样性和物理准确性方面显著优于现有基线方法。

Insight: 创新点包括:首次提出模拟器在环的多智能体框架,通过解耦材料和动力学优化流程;利用轨迹基础的多智能体反馈将显式运动轨迹转换为结构化文本描述,结合LLM常识推理实现零样本宏观跳跃和动态力场切换,有效避免局部最优。

Abstract: Achieving fully automated, physically plausible 3D motion synthesis is a core objective in graphics and generative AI. However, configuring complex environmental force fields still relies entirely on manual expert intervention, creating a severe bottleneck for large-scale simulation data generation. Existing automated methods primarily focus on material optimization and exhibit severe modality gaps and technical flaws when applied to the vastly more complex force field optimization space: naive Large Language Models (LLMs) lack underlying simulation feedback, causing severe physical inaccuracies, while traditional Score Distillation Sampling (SDS) suffers from sluggish gradients, local optima entrapment, and a mathematical inability to dynamically switch discrete force fields. To address this, we propose PhysAgent, the first simulator-in-the-loop multi-agent framework that leverages multimodal inputs for automated, physically grounded 4D synthesis. By decoupling intrinsic materials from extrinsic dynamics, PhysAgent utilizes a Semantic Agent equipped with an externalized Force Field Skill module to master simulation rules and generate valid initializations. Subsequently, the Refine Agents, driven by Trajectory-Grounded Multi-Agent Feedback, leverage vision foundation models to extract dense point trajectories from rendered frames. By converting these explicit motion trajectories into structured textual descriptors, the agent harnesses LLM commonsense reasoning to execute zero-shot macroscopic leaps, effectively escaping local optima and dynamically switching discrete force fields. Extensive experiments demonstrate that PhysAgent rapidly generates stable, diverse physical scenes from arbitrary multimodal prompts, significantly outperforming existing baselines in both generation diversity and physical accuracy.


[259] RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation cs.RO | cs.CVPDF

Shengcheng Luo, Kefei Wu, Xiaoying Zhou, Wanlin Li, Ziyuan Jiao

TL;DR: 本文提出了一种名为RGB-S的框架,通过将触觉传感器位置投影到RGB图像平面,并渲染力调制的高斯显著性图,以显式地将物理接触信息与视觉表示对齐,从而提升在视觉遮挡下的灵巧操作鲁棒性。

Details

Motivation: 现有方法通常需要策略从有限演示中隐式学习跨模态对应关系,缺乏几何先验,导致数据效率低且在视觉观测退化时泛化能力差。本文旨在解决稀疏、异构触觉测量与密集视觉表示鲁棒对齐的根本挑战。

Result: 在模拟和真实世界的六个灵巧操作任务中,该方法在严重视觉遮挡下,相比最强的隐式视触觉基线,将真实世界遮挡操作成功率提升了26.7个百分点,显示出更强的空间推理能力和对遮挡的鲁棒性。

Insight: 创新点在于利用机器人正向运动学和相机标定,显式地将物理接触信息锚定在图像域,并通过零初始化条件架构注入预训练视觉主干,从而保留了预训练视觉表示并引入了物理接触先验。

Abstract: Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors. As a result, they are often data-inefficient and generalize poorly when visual observations are degraded. To address this limitation, we propose a framework that explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, we project tactile sensor locations directly onto the RGB image plane. We then render force-modulated Gaussian saliency maps to model spatial uncertainty arising from kinematic and calibration errors. By integrating these 2D spatial anchors through a zero-initialized conditioning architecture, our method injects physical contact priors into standard visual backbones while preserving pre-trained visual representations. We evaluate our method on six dexterous manipulation tasks in both simulation and the real world under severe visual occlusions. Real-world experiments show that explicit RGB-S grounding in the image domain improves real-world occluded manipulation success rates by $26.7$ percentage points over the strongest implicit visuo-tactile baseline, suggesting its improved spatial reasoning and robustness to occlusion. Project page: touch-as-saliency.github.io


[260] SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning cs.RO | cs.AI | cs.CVPDF

Yucheng Deng, Pingrui Lai, Xinhai Li, Chenjia Bai, Xiaoheng Deng

TL;DR: SpaceVLN是一个零样本的视觉语言导航代理,它通过在线空间认知记忆和推理来理解未见环境的空间结构。该代理采用分阶段的闭环框架,将探索区域抽象为空间路标并动态维护地标证据,形成层次化的空间认知记忆。基于此记忆,它通过空间思维链进行任务引导的空间推理,从而在无需任务特定策略训练的情况下,实现连续环境中的导航。

Details

Motivation: 现有零样本导航代理多依赖局部视觉线索和基于线性历史的推理,忽视了导航的空间本质(如探索区域、路径、地标及其空间关系),因此需要一种能有效理解和利用空间结构的导航方法。

Result: 在R2R-CE、RxR-CE、GN-Bench和HM3D-OVON等多个基准测试中,SpaceVLN实现了最先进的零样本性能,并且真实机器人部署进一步验证了其适用性。

Insight: 创新点在于引入了空间认知记忆(通过空间路标和地标证据的层次化表示)和任务引导的空间推理(通过空间思维链整合任务进展与空间感知),这为具身导航代理提供了一个实用的基础,并能统一处理视觉语言导航和物体目标导航任务。

Abstract: Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space–landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.


[261] Dense Force Estimation with an Event-based Optical Tactile Sensor cs.RO | cs.CV | cs.LGPDF

Agis Politis, René Zurbrügg, Valentina Cavinato

TL;DR: 本文提出了首个基于事件驱动光学触觉传感器的密集三维力场重建框架。该方法通过事件数据估计三维表面位移,并利用逆有限元方法(iFEM)将位移映射为力。其中,剪切位移通过提出的基于事件的标记跟踪算法恢复,而法向位移则通过卷积神经网络预测。实验表明,该方法能准确重建物理基础的力,在高达(4 N, 4 N, 20 N)的力范围内实现平均绝对误差(0.14 N, 0.10 N, 0.93 N),平均运行频率为100 Hz。

Details

Motivation: 人类依赖高时空分辨率的密集几何和力感知触觉反馈进行灵巧操作。传统视觉触觉传感器受限于相机帧率、运动模糊和数据带宽,而现有事件驱动光学触觉传感器方法仅能预测净力,无法提供密集力场。

Result: 在实验中,该方法实现了准确的物理基础力重建,在高达(4 N, 4 N, 20 N)的力范围内,平均绝对误差为(0.14 N, 0.10 N, 0.93 N),平均运行频率为100 Hz。

Insight: 创新点包括首次提出基于事件驱动光学触觉传感器的密集三维力场重建框架,结合了基于事件的标记跟踪算法(用于剪切位移)和卷积神经网络(用于法向位移),并通过逆有限元方法将位移映射为力,为机器人抓取和灵巧操作的高频控制提供了密集力反馈的第一步。

Abstract: Humans rely on spatially dense, geometry and force-aware tactile feedback at high temporal resolution for dexterous manipulation. While vision-based tactile sensors enable dense force estimation, they are limited by camera frame rates, motion blur, and data bandwidth. Event-based optical tactile sensors offer an attractive alternative with microsecond temporal resolution and low motion blur, but existing methods are restricted to predicting only net forces. We introduce the first framework for dense 3D force field reconstruction using event-based optical tactile sensors. Our approach estimates 3D surface displacements from event data and maps them to forces via the inverse Finite Elements Method (iFEM). Shear displacements are recovered through the proposed event-based marker tracking algorithm, while normal displacements are predicted by a convolutional neural network trained on a collected dataset of synchronized force-displacement-event data. Experiments demonstrate accurate reconstruction of physically grounded forces, achieving a mean absolute error of (0.14 N, 0.10 N, 0.93 N) over force ranges up to (4 N, 4 N, 20 N), while operating at an average of 100 Hz. This work constitutes a first step toward enabling dense force feedback for high-frequency control in robotic grasping and dexterous manipulation.


[262] Efficient Minimal Solvers for Relative Pose Estimation in Autonomous Driving Applications cs.RO | cs.CVPDF

Tao Li, Liang Liu, Jianli Han, Weimin Lv

TL;DR: 本文针对自动驾驶中的多相机系统相对位姿估计问题,提出了一种基于新颖平移参数化和一阶旋转近似的高效统一框架,并设计了三种专门用于自动驾驶场景的最小求解器。这些求解器分别利用了IMU提供的垂直方向先验、转向时的旋转轴方向先验以及地面车辆平面运动的假设,通过减少所需点对应数量和代数复杂度,显著提升了RANSAC流程中假设生成的速度。

Details

Motivation: 现有方法计算成本高且严重依赖丰富的特征匹配,难以满足自动驾驶对实时性和鲁棒性的严苛要求。本文旨在解决这一局限性,为时间敏感的驾驶场景提供高效、稳健的相对位姿估计方案。

Result: 在合成数据集和KITTI自动驾驶基准上的大量实验表明,所提出的求解器在速度与精度之间取得了有利的平衡,性能优于现有的最先进算法。

Insight: 核心创新点在于将特定场景先验(垂直方向、转向轴、平面运动)与新颖的数学参数化(平移参数化、一阶旋转近似)相结合,构建了一个统一的高效求解框架。这为在资源受限的实时系统中实现快速且可靠的位姿估计提供了新思路,特别是通过减少最小点对应需求来加速RANSAC流程的策略具有借鉴意义。

Abstract: With the advancement of visual sensing systems, computer vision is playing an increasingly important role in autonomous driving and robot navigation. Relative pose estimation in multi-camera systems is essential for accurate vehicle localization and environment perception, demanding high real-time performance and robustness. Existing methods, however, often involve high computational costs and rely heavily on abundant feature matches, limiting their applicability in time-sensitive driving scenarios. To address these limitations, this paper introduces a unified framework for efficient relative pose estimation, built upon a novel translation parameterization and first-order rotation approximation. Within this framework, we propose three efficient minimal solvers specifically designed for autonomous vehicles. The first solver integrates the vertical direction prior from Inertial Measurement Units (IMUs), the second utilizes the rotation axis direction prior during steering maneuvers, and the third is designed for planar motion - a realistic assumption for ground vehicles operating on structured roads. By reducing both the minimal number of point correspondences and the algebraic complexity, our methods enable faster hypothesis generation within RANSAC-based pipelines, improving suitability for real-time systems. Extensive experiments on synthetic datasets and the KITTI autonomous driving benchmark demonstrate that the proposed solvers achieve a favorable balance between speed and accuracy compared to existing state-of-the-art algorithms.


[263] DexPIE: Stable Dexterous Policy Improvement from Real-World Experience cs.RO | cs.CVPDF

Ruizhe Liao, Wenrui Chen, Liangji Zeng, Haoran Lin, Fan Yang

TL;DR: 本文提出了DexPIE,一种用于灵巧操作策略的后训练改进框架,旨在通过从真实世界部署中收集的经验来提升基于模仿学习策略的性能。该方法通过干预系统、多阶段数据收集、异步推理和连续最优性指标等技术,有效解决了模仿学习中复合误差和数据效率低的问题。

Details

Motivation: 灵巧操作因其高维动作空间和复杂的接触动力学,对模仿学习提出了巨大挑战。仅从演示数据训练的策略在部署时容易产生复合误差,且需要大量专家数据才能达到可靠性能。为了突破演示数据的限制,本文旨在利用真实世界部署中收集的经验来改进策略。

Result: 在三个具有挑战性的真实世界灵巧操作任务上,DexPIE相比基于演示的参考策略,成功率提升了37%,超越了所有基线方法,并表现出更强的鲁棒性。

Insight: 创新点包括:为灵巧手设计的干预系统和多阶段DAgger式数据收集以实现有效探索覆盖;在相对动作空间中引入异步推理以减少时序噪声并实现更一致的策略评估;以及通过连续最优性指标对数据进行细粒度利用来改进策略。这些方法为从真实世界经验中稳定提升策略性能提供了系统性的解决方案。

Abstract: Dexterous manipulation presents substantial challenges for imitation learning due to its high-dimensional action space and complex contact-rich dynamics. Policies trained purely from demonstrations often suffer from compounding errors during deployment and require large amounts of expert data to achieve reliable performance. To move beyond the limitations of demonstration data, in this work, we propose DexPIE, a post-training framework for dexterous policy improvement from experience collected through real-world deployment. First, DexPIE enables effective exploration coverage through a dexterous-hand-adapted intervention system and multi-stage DAgger-style data collection across initial and intermediate task stages, providing reliable supervision for accurate policy evaluation. To reduce temporal noise between post-training rollouts and demonstration data, we introduce asynchronous inference in the relative action space, which better aligns rollout data with demonstrated behavior and allows the critic to learn a value function induced by a more consistent underlying policy. Finally, DexPIE improves the policy through conditioning on a continuous optimality indicator, allowing the policy to leverage the quality of data in a more fine-grained manner. Across three challenging real-world dexterous manipulation tasks, DexPIE achieves a 37% improvement in success rate over the demonstration-based reference policy, outperforming all baseline methods and demonstrating stronger robustness. The source code and dataset will be made publicly available.


[264] AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing cs.RO | cs.AI | cs.CVPDF

Jisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu, Jiayue Kang

TL;DR: 本文提出了一种异步、视野自适应的世界-动作模型AHA-WAM,用于机器人操作任务。该模型基于双扩散Transformer架构,将世界预测(视频分支)和动作执行(动作分支)解耦到不同的时间节奏上,其中视频分支作为低频世界规划器,动作分支作为高频执行器,并通过观察引导的上下文路由机制进行交互。

Details

Motivation: 现有世界-动作模型将世界预测和动作执行耦合在同一时间分辨率上,导致视频分支需要建模冗余且信息量少的短期帧变化,未能充分利用视频分支在具身控制中的潜力。

Result: 在RoboTwin仿真和真实世界操作任务上的实验表明,AHA-WAM无需任何机器人数据预训练即达到了SOTA性能,在RoboTwin上平均成功率92.80%,在4个真实世界任务上成功率78.3%,并以24.17 Hz的频率实现闭环控制,相比Fast-WAM有4.59倍加速。

Insight: 核心创新在于提出了异步的、视野自适应的世界-动作建模范式,通过解耦世界与动作的时间节奏,并引入观察引导的上下文路由机制,使模型能高效利用长视野场景演化信息,同时保持对实时执行状态的响应能力。

Abstract: World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.


[265] MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models cs.RO | cs.CVPDF

Hao Shi, Weiye Li, Bin Xie, Yulin Wang, Renping Zhou

TL;DR: MemoryVLA++是一个用于机器人操作的完整时序建模框架,它通过为视觉-语言-动作模型配备记忆和想象能力来解决长期、时序依赖任务中的挑战。该方法结合了工作记忆、情景记忆和世界模型,通过检索历史上下文和想象未来状态来生成时序一致的动作序列。

Details

Motivation: 现有VLA模型主要依赖当前观测,难以处理长期、时序依赖的机器人操作任务。受人类认知机制(如工作记忆、情景记忆和内部模型)启发,论文旨在为VLA模型引入完整的时序建模能力。

Result: 在Libero、SimplerEnv、Mikasa-Robo、Calvin、Libero-Plus等5个仿真基准和3类真实机器人任务上进行了广泛实验。在真实机器人任务中,在通用、记忆依赖和想象依赖任务上分别取得了+9%、+26%和+28%的性能提升,验证了方法的有效性。

Insight: 创新点在于将认知科学中的记忆与想象机制系统性地整合到VLA模型中,提出了包含感知-认知记忆库和去噪潜在空间世界模型的完整时序建模框架。这为机器人操作中的长期规划提供了新的思路。

Abstract: Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal models to imagine possible future state evolution. Inspired by these mechanisms, we propose MemoryVLA++, a full temporal modeling framework that equips VLA models with memory and imagination for robotic manipulation. A pretrained VLM encodes the current observation into perceptual and cognitive tokens, forming working memory. These tokens query a Perceptual-Cognitive Memory Bank to retrieve relevant historical context. This bank stores low-level details and high-level semantics from past interactions, and is updated through redundancy-aware consolidation. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. The resulting tokens condition a diffusion action expert to predict temporally consistent action sequences. We conduct extensive experiments on 5 simulation benchmarks and 3 categories of real-robot tasks across 3 robots, covering general manipulation, long-horizon temporal tasks, robustness, and generalization. Our method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, Libero-Plus, and diverse real-robot tasks, validating the effectiveness of full temporal modeling with memory and imagination. For example, on real robots, it achieves +9%, +26%, +28% gains on general, memory-dependent, and imagination-dependent tasks. Project Page: https://shihao1895.github.io/MemoryVLA-PP-Web


cs.HC [Back]

[266] A Systematic Study of Behavioral Cloning for Scientific Data Annotation cs.HC | cs.AI | cs.CV | cs.LG | physics.data-anPDF

Ishaan Singh Chandok, Core Francisco Park

TL;DR: 本文系统研究了行为克隆在科学数据标注中的应用,提出了一个包含9个合成任务的框架来模拟专家标注行为(如探索、纠错和策略决策),并揭示了模型技能分层涌现、多任务预训练提升微调效率等关键发现。

Details

Motivation: 解决科学数据标注(如视频动物追踪或神经重建校对)中自动化验证与校正仍依赖大量人工的“最后一英里”问题,利用专家在标注过程中的交互行为(如导航、点击、验证)进行监督,而非仅预测最终标注。

Result: 在合成任务基准上实验表明:模型先学习GUI操作再掌握任务决策,错误率低于训练数据且能主动纠错;扩大模型规模在多任务行为克隆中提升数据效率;多任务预训练后微调新任务有效,而从零训练失败;线性探测揭示模型内部表征了标注过程的潜在变量(如任务阶段、数据位置)和跨任务共享的错误表示。

Insight: 创新点在于构建系统化行为克隆研究框架,通过合成任务模拟真实标注策略,并发现模型技能分层学习规律、多任务预训练的有效性以及跨任务共享的错误内部表征,为行为克隆应用于实际科学标注提供了基准和理论基础。

Abstract: Scientific data annotation, such as tracking animals in video or proofreading neural reconstructions, remains bottlenecked by the “last mile” problem: even with strong automation, verification and correction consume substantial human effort. Standard approaches train models to directly predict annotations, discarding the rich supervision in how experts navigate, click, verify, and correct. We introduce a framework for studying behavioral cloning on scientific annotation: 9 synthetic tasks paired with synthetic annotations that simulate realistic human strategies including exploration, mistake correction, and strategic decision-making. Our experiments reveal several findings. First, skills emerge hierarchically: models learn GUI mechanics before task-critical decisions, and commit fewer mistakes than the training data while retaining the ability to correct errors when they occur. Second, scaling models on multi-task behavioral cloning shows that larger models are more data efficient within our scale range. Third, multi-task pretraining enables efficient fine-tuning to new tasks, while training from scratch fails entirely. Fourth, linear probes reveal that models internally represent latent variables of the annotation process such as task phase and data position; interestingly, we find a shared mistake representation that generalizes across different annotation tasks. Overall, our framework establishes systematic benchmarks and identifies key bottlenecks, providing a foundation for scaling behavioral cloning to real-world scientific data annotation.


[267] Multimodal Large Language Models as Synthetic Participants in Video-Based Studies: An Evaluation cs.HC | cs.AI | cs.CV | cs.CY | cs.MMPDF

Prabal Shrestha, Bohan Jiang, Haoning Xue, Huan Liu, Xinyi Zhou

TL;DR: 本文评估了多模态大语言模型(MLLMs)作为合成参与者在基于视频的研究中模拟人类主观反应的能力。研究基于感知信息感官价值(PMSV)框架,比较了人类参与者和经过条件配置的MLLM(如Gemini 3 Flash和Qwen 3 Omni)对短视频感官参与度的评分。研究发现,即使领先的MLLMs与人类参与者的评分一致性有限,存在均值下移、中心化偏差等问题,且提示策略对结果有复杂影响。

Details

Motivation: 尽管MLLMs在视频理解等客观任务上表现强劲,但其能否近似依赖于个人社会背景的主观人类反应尚不明确。本文旨在填补这一空白,评估MLLMs在评估短视频感知感官参与度这一新兴任务中作为合成参与者的可行性。

Result: 在基于17项量表(测量情绪唤起、戏剧性影响和新颖性)的评估中,领先的MLLMs(Gemini 3 Flash和Qwen 3 Omni)与人类参与者(n=673)的评分一致性有限。模型表现出明显的均值下移和中心化偏差,同时引入并扁平化了子组差异,且对参与者背景的敏感性不一致。

Insight: 论文的创新点在于将MLLMs作为合成参与者进行系统性评估,揭示了其在模拟人类主观反应时存在的系统性偏差(如均值偏移和中心化趋势),并指出提示策略对结果有非单调影响。这为未来开发更可靠的MLLM模拟工具提供了关键的挑战和机遇洞察。

Abstract: Multimodal large language models (MLLMs) have shown strong performance on objective tasks such as video understanding and reasoning. However, it remains unclear whether they can approximate subjective human responses, which depend not only on content comprehension but also on individuals’ social contexts. To address this gap, we evaluate MLLMs as synthetic participants in an emerging task: assessing perceived sensory engagement with short videos. Grounded in the Perceived Message Sensation Value (PMSV) framework, we compare ratings from recruited human participants and profile-conditioned MLLM simulations (n=673) using a 17-item scale measuring emotional arousal, dramatic impact, and novelty. We find that even leading MLLMs (Gemini 3 Flash and Qwen 3 Omni) show limited agreement with human participants. The models exhibit distinct downward mean-shift and central-tendency biases in their rating distributions. They both introduce and flatten subgroup differences, while showing inconsistent sensitivity to participant profiles. Prompting strategies affect these metrics differently, modestly improving some aspects while worsening others. These results highlight both the challenges and opportunities of developing MLLMs as synthetic participants in video-based research. Data and code: https://github.com/MINDLab25/mllm-human-simulation-eval


eess.AS [Back]

[268] Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading eess.AS | cs.CL | cs.SDPDF

Eder del Blanco, David Gimeno-Gómez, Eva Navas, Carlos-D. Martínez-Hinarejos, Inma Hernáez

TL;DR: 本文提出了一种基于跨模态掩码的鲁棒性无声语音合成框架,联合利用表面肌电信号(sEMG)和唇读视频信号。通过在训练中引入模态掩码策略,该方法在多说话人设置下显著降低了词错误率,并增强了在模态退化或缺失情况下的鲁棒性。

Details

Motivation: 解决现有无声语音接口(SSI)中多模态(sEMG与唇读)融合方法对模态退化或传感器临时失效鲁棒性不足的问题,以提升在实际场景中的适用性。

Result: 在多说话人设置下,相比最强的单模态基线,词错误率降低了高达14个百分点;在低比特率条件下表现出更强的鲁棒性,且比针对特定退化的数据增强方法泛化更好。

Insight: 创新点在于训练时引入跨模态掩码策略,强制模型学习互补的发音信息;客观分析表明该方法对元音和特定辅音组有显著提升,为多模态鲁棒集成提供了有效范例。

Abstract: Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide complementary articulatory information, yet their integration for continuous speech synthesis remains underexplored. Moreover, existing multimodal approaches rarely address robustness to modality degradation or temporary sensor failure, limiting their applicability in realistic scenarios. In this work, we propose a masked multimodal speech synthesis framework that jointly leverages sEMG and lipreading signals through modality masking during training. Under multispeaker settings, the proposed approach reduces word error rate by up to 14 absolute percentage points compared to the strongest unimodal baseline. Experimental results not only show that masking strategies are critical for these performance gains and robustness under low-bitrate conditions, but also that they generalize better than degradation-specific data augmentations in the presence of modality absence conditions. Phone-level analyses further reveal complementary contributions across modalities, with particularly strong benefits for vowels and for specific consonant groups. Overall, these findings demonstrate the effectiveness and robustness of masked multimodal integration for silent speech synthesis, although adaptation to laryngectomized speakers remains an open research challenge.


physics.optics [Back]

[269] Beyond the Thin-Layer Limit: Differentiable Volumetric Training for Visible-Range Diffractive Neural Networks physics.optics | cs.CVPDF

Dineth Jayakody, Dushan N. Wadduwage

TL;DR: 该论文针对可见光衍射神经网络(D2NN)设计中存在的设计-器件性能失配问题,指出传统基于薄层近似的训练方法在可见光波段失效的根本原因并非短波长,而是低折射率材料所需的较厚结构导致层内衍射和相位累积效应显著。为此,作者提出了一种可微分光束传播层(∂BPM),将每个衍射元件建模为有限厚度的体积,并在训练中模拟光在其内部的传播,从而实现了与制造兼容的高度图端到端可训练。该方法在MNIST、Fashion-MNIST和CIFAR-100的分类和成像任务上显著提升了性能,并通过全波FDTD验证将分类准确率从50%提升至90%。

Details

Motivation: 动机在于解决将衍射神经网络(D2NN)从太赫兹波段成功迁移到可见光波段的核心障碍。传统训练方法基于薄层近似,假设衍射层无限薄,这在可见光波段使用低折射率材料时失效,因为需要较厚的浮雕结构,导致显著的层内衍射和相位累积,从而造成设计出的模型与实际制造器件性能严重不匹配。

Result: 在MNIST、Fashion-MNIST和CIFAR-100数据集上的分类和成像任务中,采用∂BPM训练方法显著减少了设计-器件失配。全波FDTD验证表明,未经重新优化,分类准确率从传统方法下的约50%提升至约90%,证明了该方法的有效性。

Insight: 论文的核心创新点在于突破了长期主导的薄层近似训练范式,提出了一个可微分的体积训练框架(∂BPM)。这提供了一个可扩展的、物理感知的桥梁,将高效的光学神经网络优化与制造一致的衍射设计连接起来,为可见光波段D2NN的实际应用铺平了道路。

Abstract: Diffractive deep neural networks (D2NNs) promise miniaturized, power-efficient, light-speed optical front-ends for machine vision, yet the most mature demonstrations remain in the terahertz regime, built from readily fabricated millimeter-scale neurons. Translating D2NNs to the visible range, where nearly all vision pipelines operate, was long blamed on the difficulty of fabricating nanoscale neurons; but even after recent advances removed that barrier, visible-range D2NNs matching their terahertz counterparts remain out of reach. We identify the true obstacle as the thin-layer approximation underlying nearly all D2NN training, which treats each diffractive layer as an infinitely thin mask. It fails not because of the short wavelength, as is commonly assumed, but because the low-refractive-index materials (n approximately 1.3-1.5) used at visible wavelengths require relief structures thick enough that intra-layer diffraction and phase accumulation become significant. To overcome this, we introduce a differentiable beam-propagation ($\partial$BPM) layer that models each element as a finite-thickness volume and propagates light through it during training, keeping the fabrication-compatible height map end-to-end trainable without full-wave simulation in the loop. Across MNIST, Fashion-MNIST, and CIFAR-100 classification and imaging, $\partial$BPM training substantially reduces the design-to-device mismatch, and full-wave FDTD validation raises classification accuracy from 50% to 90% without re-optimization. The $\partial$BPM layer thus offers a scalable, physics-aware bridge between efficient optical neural-network optimization and fabrication-consistent diffractive design.


cs.SD [Back]

[270] TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints cs.SD | cs.CL | eess.ASPDF

Vinh-Thuan Ly

TL;DR: 本文提出了TinyGiantALM,一个仅1.5B参数的紧凑型音频-语言模型,旨在解决资源受限环境下大规模音频推理模型部署困难的问题。其核心创新在于引入了指令感知特征精炼框架,通过查询引导投影器和语义门控机制,根据用户意图过滤声学信号。在MMAR基准测试中,该模型以零样本方式取得了46.4%的准确率,显著超越了参数量大4-8倍的基线模型。

Details

Motivation: 当前音频推理领域的进展依赖于庞大的大型音频-语言模型,这阻碍了其在资源受限环境中的部署。本文旨在设计一个高效、紧凑的替代方案,以在边缘设备上实现鲁棒的感知能力。

Result: 在MMAR基准测试上,TinyGiantALM实现了46.4%的零样本准确率,显著优于7B至13B参数的基线模型。虽然在逻辑叙事推理方面与30B+模型存在差距,并且在处理过于密集或空间场景时存在权衡,但该方法在解耦混合模态环境方面明显超越了参数量大8倍的模型。

Insight: 主要创新点在于提出的指令感知特征精炼框架,它通过查询引导投影和语义门控实现了基于意图的声学信号过滤。从客观角度看,该研究证明了通过架构设计的精确性,而非单纯扩大模型规模,是实现在边缘友好尺度上获得强大感知能力的一条可行路径。

Abstract: Current advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruction-Aware Feature Refinement framework using a Query-guided Projector and Semantic Gating to filter acoustic signals based on user intent. On the MMAR benchmark, TinyGiantALM achieves 46.4% zero-shot accuracy, significantly outperforming 7B-13B baselines. While a reasoning gap in logical narrative remains versus 30B+ models and certain trade-offs exist in overly dense or spatial scenes, our approach notably surpasses models up to 8x larger in disentangling mixed-modality environments. These findings demonstrate that architectural precision offers a tangible pathway to secure robust perception capabilities on edge-friendly scales.


eess.IV [Back]

[271] Programmable Silicon Retina on Pixel Processor Array eess.IV | cs.CVPDF

Maciej Lewandowski, Prince Philip, Alexandre Marcireau, Chetan Singh Thakur, André van Schaik

TL;DR: 本文首次在SCAMP-5像素处理器阵列上实现了多级硅视网膜模型,并开发了基于GPU的仿真框架。该模型融合了空间滤波和增益控制等生物启发处理阶段,在视频显著性预测任务上,相比标准动态视觉传感器事件表示,其预测损失降低了13%,同时事件率减少了约47%。

Details

Motivation: 探索在标准动态视觉传感器的基础上,引入空间滤波和增益控制等生物启发的额外处理阶段,是否能为显著性预测等下游任务带来优势。

Result: 在视频强度重建任务上表现不佳,但在视频显著性预测任务上,使用轻量级FireNet风格网络(约10万个参数),其预测损失比标准DVS事件表示降低了13%,事件率减少了约47%。

Insight: 硅视网膜的“信息蒸馏”机制能为下游神经网络(特别是在带宽受限的边缘应用中)提供更高效的表征;将生物启发模型与像素处理器阵列硬件结合,是实现高效边缘视觉处理的一个有前景的方向。

Abstract: Standard dynamic vision sensors approximate retinal processing by detecting temporal contrast changes, offering high speed and high dynamic range. In this work, we explore whether incorporating additional biologically inspired processing stages - specifically spatial filtering and gain control - can offer advantages for certain downstream tasks such as saliency prediction. We present the first implementation of a multi-stage Silicon Retina model on the SCAMP-5 Pixel Processor Array, along with a GPU-based simulation framework. We evaluate the performance of our model on Video Intensity Reconstruction and Video Saliency Prediction. While the bio-inspired model is less effective at reconstructing absolute intensity frames, it achieves a 13% reduction in saliency prediction loss in comparison to standard DVS event representation, while reducing the event rate by approximately 47%. These experiments are obtained using a lightweight $\approx 100$k-parameter FireNet-style network, adapted from event-based reconstruction to saliency prediction. These results suggest that the silicon retina’s “information distillation” mechanism can achieve a more efficient representation for downstream neural networks, particularly in bandwidth-constrained edge applications.


cs.CY [Back]

[272] Friend or Foe? Language as an ideological switch in open-weight LLMs under Russian disinformation stress cs.CY | cs.CLPDF

Anna Małgorzata Kamińska, Tetiana Klynina

TL;DR: 本文通过系统审计发现,针对不同语言社区微调的开源大语言模型在应对俄罗斯虚假信息时,其政治倾向与预设的文化对齐假设相悖:乌克兰导向模型在俄语环境下对虚假信息抵抗力最弱,而俄罗斯导向模型反而表现出最强抵制。研究揭示了微调悖论,指出语料构成、语言覆盖和提示格式比名义上的文化渊源更具决定性影响。

Details

Motivation: 针对业界和政策界普遍认为文化对齐的模型微调会自动编码目标社区政治取向的假设(例如乌克兰导向模型会抵抗俄罗斯叙事),本文旨在通过实证检验这一假设在俄乌战争信息战背景下的有效性。

Result: 在针对克里米亚、‘去纳粹化’、‘同一民族’论点以及布查和马立波暴行否认等十个争议战时叙事的审计中,乌克兰导向模型在俄语查询中对俄罗斯虚假信息的抵抗力最弱,俄罗斯导向模型则表现出最强的拒绝,该结果在乌克兰语、俄语和英语查询中均得到验证。

Insight: 创新点在于揭示了‘微调悖论’,挑战了文化对齐必然带来叙事抵抗力的行业迷思;客观而言,研究强调了模型行为对语料构成和查询语言等操作因素的敏感性,而非简单的政治标签,这对数字主权和混合战争背景下的LLM部署风险评估具有重要启示。

Abstract: As Russia’s war against Ukraine extends into generative AI, large language models (LLMs) adapted for local post-Soviet languages are deployed in contested information environments. Policy and industry discourse assumes that culturally aligned adaptation encodes the political orientation of the target community: a Ukrainian-oriented model will resist Russian narratives, a Russian-oriented one will reinforce them. Does it? This article systematically disconfirms that assumption. We run a controlled audit of four openly available LLMs sharing a common base model but fine-tuned for different linguistic communities, querying them in Ukrainian, Russian and English across ten contested wartime narratives: Crimea, “denazification”, the “one people” thesis, and atrocity denial at Bucha and Mariupol. The result is a Fine-Tuning Paradox: the Ukrainian-oriented model shows the weakest resistance to Russian disinformation in Russian, while the Russian-oriented one exhibits the strongest rejection. Corpus composition, language coverage and prompt format prove more decisive than nominal cultural provenance. We situate these findings within debates on hybrid warfare, digital sovereignty and post-imperial information orders, arguing that the principal threat to regional information sovereignty is not adversarial fine-tuning but the untested assumption that cultural alignment guarantees resilience.


[273] Frankenstein in the Pipeline: Computational Epistemicide in Facial Recognition cs.CY | cs.CVPDF

Nina da Hora

TL;DR: 本文批判性地分析了基于嵌入的人脸识别技术,将其比作弗兰肯斯坦式的‘拆解-重组’过程,揭示了该技术如何通过检测、对齐、向量化等步骤将活生生的面孔转化为标准化的数据点,从而实施一种‘计算性认知灭绝’,并主张废除这种以向量化身份为基础的权利治理模式。

Details

Motivation: 论文旨在揭示人脸识别技术流程中隐含的暴力操作机制,批判其如何通过数据化过程消灭面孔作为活生生的、关系性表面的本质,并确立数值代理作为身份特权的场所。

Result: 论文未提供具体的定量实验结果或基准测试,而是进行了一种批判性的理论分析和框架诊断。

Insight: 创新性地将弗兰肯斯坦的故事作为方法论诊断框架,提出了‘计算性认知灭绝’这一概念,并深刻剖析了人脸识别技术流程如何通过标准化实现暴力,最终主张从规范立场上废除该技术体系,而非进行改良式的‘伦理AI’优化。

Abstract: While the eugenic roots of computer vision are well-documented in critical technology studies, less attention has been paid to the operational mechanisms through which this violence is enacted at the level of the pipeline. This paper employs Mary Shelley’s Frankenstein not as a metaphor for unintended consequences, but as a diagnostic framework for method: disassembly, reconstruction, and the production of a creature whose legitimacy is asserted by the procedure that made it. I argue that embedding-based facial recognition enacts what I call computational epistemicide, an extension of Sueli Carneiro’s concept of epistemicide to the computational domain - by destroying the face as a living, relational surface and authorizing a numerical proxy as the privileged site of identity. Across detection/cropping, landmarking, alignment/frontalization, and embedding, the face is progressively narrowed to what can be stabilized as data, producing a canonical face as the condition of legibility and a corresponding form-subject as the condition of recognition. Vectorization completes the Frankensteinian “stitching”: the dissected face is reassembled into a fixed-dimensional artifact designed to circulate across databases and institutions. I then show how distance-based similarity and thresholding operationalize a norm of “close enough,” making recognition inseparable from standardization and rendering reformist “ethical AI” optimization structurally insufficient. The paper concludes by arguing for abolition as a normative stance: refusing vectorized identity as a legitimate basis for rights and access, and dismantling the institutional impulse to govern human life through dissectible data points.


[274] Interpretable Crisis Behavior Analysis Using Mobility and Social Media Data cs.CY | cs.CLPDF

Muhammad Hamza Arshad Majeed, Sidahmed Benabderrahmane, Talal Rahwan

TL;DR: 本文提出了一种统一且可解释的管道,用于整合移动性和社交媒体数据,以识别危机环境中的跨领域行为模式。该框架通过两个案例研究进行评估:2025年1月洛杉矶野火的短期分析和2020年3月至2021年12月阿联酋COVID-19行为的纵向分析。

Details

Motivation: 危机期间,人们的移动模式和在线情感话语会共同演变,但通常被孤立研究。本文旨在解决这种孤立分析的问题,通过整合多模态数据来理解危机中的跨领域行为模式。

Result: 在野火案例中,交通压力、恐惧/愤怒情绪和治理话语在33天窗口内紧密耦合,关键规则达到100%置信度和高达2.5的提升度。在COVID-19案例中,产生了8条稳定的同日规则(88%的留出测试通过率)和40条具有2-7天领先期的预测规则。

Insight: 创新点在于提出了一个可解释的多模态融合管道,结合形式概念分析和关联规则挖掘,并引入结构化的政策翻译层,将稳健的规则转化为可操作的简报,实现了科学可信且政策可操作的危机情报生成。

Abstract: Crises alter both how people move and how they communicate. During emergencies such as wildfires and pandemics, changes in mobility patterns and online emotional discourse evolve jointly, yet they are typically studied in isolation. This paper presents a unified and interpretable pipeline that integrates mobility and social media data to identify cross-domain behavioral patterns in crisis settings. The framework is evaluated through two case studies: a short-horizon analysis of the January 2025 Los Angeles wildfires (prototype case) and a longitudinal analysis of UAE COVID-19 behavior from March 2020 to December 2021 (primary case, 671 days). The pipeline aligns heterogeneous daily signals, transforms them into binary behavioral states, applies Formal Concept Analysis (FCA) to extract co-occurrence structure, mines association rules, and validates rule stability through chronological holdout testing. A structured policy-translation layer renders robust rules as operational briefs specifying triggers, lead times, and action playbooks. Results reveal clear cross-domain behavioral structure in both crises. In the wildfire case, traffic stress, fear/anger sentiment, and governance discourse are tightly coupled within a 33-day window, with key rules reaching 100% confidence and lift scores up to 2.5. In the COVID case, repeated mobility adaptation and sentiment volatility yield 8 stable same-day rules (88% holdout pass rate) and 40 clean predictive rules with 2–7 day lead horizons. The work demonstrates that interpretable multimodal fusion can produce both scientifically credible and policy-actionable crisis intelligence.