Table of Contents

cs.CL [Back]

[1] In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement cs.CL | cs.AI | cs.CR | cs.LGPDF

Anudeex Shetty, Aditya Joshi, Salil S. Kanhere

TL;DR: 本文研究了通过诱导大语言模型生成’醉酒语言’来测试其安全性的方法,提出了基于角色提示、因果微调和强化后训练的三种诱导机制,并在JailbreakBench和ConfAIde基准测试中验证了该方法能有效提高模型的越狱和隐私泄露风险。

Details

Motivation: 受人类在酒精影响下易出现不当行为和隐私泄露的启发,探索醉酒语言作为诱导大语言模型安全失效的驱动因素,以评估LLM的安全性漏洞。

Result: 在5个LLM上的评估显示,相比基线模型和先前方法,醉酒语言诱导在JailbreakBench(即使有防御)上导致更高的越狱成功率,在ConfAIde上引发更多隐私泄露,两个基准均为英文。

Insight: 创新点在于将人类醉酒行为与LLM拟人化安全风险相关联,提出了简单高效的醉酒语言诱导方法,可作为对抗LLM安全调优的潜在手段,揭示了LLM安全性的重大隐患。

Abstract: Humans are susceptible to undesirable behaviours and privacy leaks under the influence of alcohol. This paper investigates drunk language, i.e., text written under the influence of alcohol, as a driver for safety failures in large language models (LLMs). We investigate three mechanisms for inducing drunk language in LLMs: persona-based prompting, causal fine-tuning, and reinforcement-based post-training. When evaluated on 5 LLMs, we observe a higher susceptibility to jailbreaking on JailbreakBench (even in the presence of defences) and privacy leaks on ConfAIde, where both benchmarks are in English, as compared to the base LLMs as well as previously reported approaches. Via a robust combination of manual evaluation and LLM-based evaluators and analysis of error categories, our findings highlight a correspondence between human-intoxicated behaviour, and anthropomorphism in LLMs induced with drunk language. The simplicity and efficiency of our drunk language inducement approaches position them as potential counters for LLM safety tuning, highlighting significant risks to LLM safety.


[2] Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning cs.CLPDF

Chenxi Liu, Yanshuo Chen, Ruibo Chen, Tianyi Xiong, Tong Zheng

TL;DR: 本文提出了一种名为自辩论强化学习(SDRL)的训练框架,旨在提升大型语言模型(LLM)在独立解决问题和参与多智能体辩论(MAD)中的推理能力。SDRL通过让单个LLM生成多个候选解决方案,构建包含多样化推理路径的辩论上下文,并基于此生成第二轮响应,从而联合优化初始响应和辩论条件下的响应。

Details

Motivation: 当前基于可验证奖励的强化学习(RLVR)方法通常训练LLM孤立地解决问题,未能明确准备模型在辩论中综合和利用不同推理路径。因此,本文旨在解决LLM在MAD中如何有效学习和整合多样化推理轨迹的问题。

Result: 实验在多个基础模型和推理基准上进行,结果表明SDRL不仅提升了整体MAD性能,还同时增强了单个模型的推理能力。

Insight: 创新点在于SDRL框架通过自辩论过程,使LLM能够学习从多样化推理轨迹中受益,从而在保持强大独立求解能力的同时,优化其在协作辩论中的表现。这为训练更适应复杂多智能体交互的LLM提供了新思路。

Abstract: The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self-Debate Reinforcement Learning (SDRL), a training framework that equips a single LLM with strong standalone problem-solving ability and the capability to learn from diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second-turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate-conditioned responses, yielding a model that is effective as both a standalone solver and a debate participant. Experiments across multiple base models and reasoning benchmarks show that SDRL improves overall MAD performance while simultaneously strengthening single model reasoning.


[3] MERMAID: Memory-Enhanced Retrieval and Reasoning with Multi-Agent Iterative Knowledge Grounding for Veracity Assessment cs.CL | cs.AI | cs.LGPDF

Yupeng Cao, Chengyang He, Yangyang Yu, Ping Wang, K. P. Subbalakshmi

TL;DR: 本文提出了MERMAID框架,一种用于真实性评估的记忆增强多智能体系统,通过紧密耦合检索与推理过程,实现动态证据获取和跨声明证据重用,在多个事实核查基准上取得了最先进的性能。

Details

Motivation: 现有真实性评估方法通常将证据检索视为静态、孤立的步骤,未能有效管理或跨声明重用检索到的证据,导致效率低下和一致性不足。

Result: 在三个事实核查基准和两个声明验证数据集上,使用包括GPT、LLaMA和Qwen系列在内的多种大语言模型进行评估,MERMAID实现了最先进的性能,同时提高了搜索效率。

Insight: 创新点在于将检索、推理与记忆协同整合,通过基于Reason-Action的迭代过程、智能体驱动的搜索、结构化知识表示和持久性记忆模块,实现了动态证据获取和跨声明证据重用,从而提升了验证效率和一致性。

Abstract: Assessing the veracity of online content has become increasingly critical. Large language models (LLMs) have recently enabled substantial progress in automated veracity assessment, including automated fact-checking and claim verification systems. Typical veracity assessment pipelines break down complex claims into sub-claims, retrieve external evidence, and then apply LLM reasoning to assess veracity. However, existing methods often treat evidence retrieval as a static, isolated step and do not effectively manage or reuse retrieved evidence across claims. In this work, we propose MERMAID, a memory-enhanced multi-agent veracity assessment framework that tightly couples the retrieval and reasoning processes. MERMAID integrates agent-driven search, structured knowledge representations, and a persistent memory module within a Reason-Action style iterative process, enabling dynamic evidence acquisition and cross-claim evidence reuse. By retaining retrieved evidence in an evidence memory, the framework reduces redundant searches and improves verification efficiency and consistency. We evaluate MERMAID on three fact-checking benchmarks and two claim-verification datasets using multiple LLMs, including GPT, LLaMA, and Qwen families. Experimental results show that MERMAID achieves state-of-the-art performance while improving the search efficiency, demonstrating the effectiveness of synergizing retrieval, reasoning, and memory for reliable veracity assessment.


[4] SPLA: Block Sparse Plus Linear Attention for Long Context Modeling cs.CLPDF

Bailin Wang, Dan Friedman, Tao Lei, Chong Wang

TL;DR: 本文提出SPLA(稀疏加线性注意力)框架,用于高效长上下文建模。该方法通过二阶泰勒展开的度量精确选择相关块进行精确注意力计算,同时利用残差线性注意力模块将未选中的块压缩为紧凑的循环状态,避免了完全丢弃未选中块导致的上下文损失。

Details

Motivation: 现有块稀疏注意力方法在长上下文建模中存在选择保真度低和累积上下文丢失的问题,因为完全丢弃了未选中的块。

Result: 在RULER等长上下文基准测试中,SPLA在持续预训练中缩小了性能差距,超越了密集注意力模型,同时保持了竞争力的通用知识和推理能力。

Insight: 创新点在于结合了基于二阶泰勒展开的精确块选择与残差线性注意力,通过优化的减法公式避免了未选中块的显式访问,从而在保持效率的同时减少了上下文丢失。

Abstract: Block-wise sparse attention offers significant efficiency gains for long-context modeling, yet existing methods often suffer from low selection fidelity and cumulative contextual loss by completely discarding unselected blocks. To address these limitations, we introduce Sparse Plus Linear Attention (SPLA), a framework that utilizes a selection metric derived from second-order Taylor expansions to accurately identify relevant blocks for exact attention. Instead of discarding the remaining “long tail,” SPLA compresses unselected blocks into a compact recurrent state via a residual linear attention (RLA) module. Crucially, to avoid IO overhead, we derive an optimized subtraction-based formulation for RLA – calculating the residual as the difference between global and selected linear attention – ensuring that unselected blocks are never explicitly accessed during inference. Our experiments demonstrate that SPLA closes the performance gap in continual pretraining, surpassing dense attention models on long-context benchmarks like RULER while maintaining competitive general knowledge and reasoning capabilities.


[5] Bifocal Attention: Harmonizing Geometric and Spectral Positional Embeddings for Algorithmic Generalization cs.CL | cs.FL | cs.LGPDF

Kanishk Awadhiya

TL;DR: 本文提出了一种名为’双焦注意力’的架构范式,旨在解决标准旋转位置编码在算法推理任务中存在的’谱刚性’和’结构鸿沟’问题。该方法将位置编码解耦为几何模态和谱模态,并引入’谱演化’训练协议,使模型能够同时捕捉局部句法结构和长程递归模式。

Details

Motivation: 标准旋转位置编码因其固定的几何衰减特性,虽然擅长编码局部句法关系,但无法有效捕捉算法推理中固有的长程周期性递归结构,导致模型在浅层推理链上训练后难以外推到更深层的递归步骤。

Result: 摘要中未提及具体的定量实验结果或基准测试。

Insight: 核心创新点在于将位置编码解耦为几何和谱两个互补的模态,并提出了一个从静态几何参数初始化、再通过梯度下降演化为任务特定谐波基的’谱演化’训练协议,这为提升大语言模型在算法泛化任务上的能力提供了新的架构思路。

Abstract: Rotary Positional Embeddings (RoPE) have become the standard for Large Language Models (LLMs) due to their ability to encode relative positions through geometric rotation. However, we identify a significant limitation we term ‘’Spectral Rigidity’’: standard RoPE utilizes a fixed geometric decay ($θ^{-i}$) optimized for local syntactic coherence, which fails to capture the long-range, periodic structures inherent in recursive logic and algorithmic reasoning. This results in a ‘’Structure Gap’’, where models trained on shallow reasoning chains fail to extrapolate to deeper recursive steps. In this work, we introduce Bifocal Attention, an architectural paradigm that decouples positional encoding into two distinct modalities: Geometric Eyes (Standard RoPE) for precise token-level manipulation, and Spectral Eyes (Learnable Harmonic Operators) for tracking long-range recursive depth. We propose a novel training protocol, Spectral Evolution, which initializes positional frequencies as static geometric parameters but allows them to evolve via gradient descent into a harmonic basis optimized for the specific algorithmic topology of the task.


[6] SSL: Sweet Spot Learning for Differentiated Guidance in Agentic Optimization cs.CLPDF

Jinyang Wu, Changpeng Yang, Yuhao Shen, Fangzhi Xu, Bolin Ni

TL;DR: 本文提出了一种名为Sweet Spot Learning(SSL)的新型强化学习框架,旨在通过渐进放大的分层奖励机制,为智能体优化提供差异化指导,以捕捉轨迹间的质量差异并引导策略朝向解空间中的“甜点”区域。

Details

Motivation: 现有基于可验证奖励的强化学习方法通常使用二元奖励,无法区分达成相同结果但质量不同的轨迹,从而忽略了解空间内的潜在多样性,因此需要一种能够提供差异化指导的优化框架。

Result: 在GUI感知、短期/长期规划及复杂推理等任务的12个基准测试中,SSL相比强基线模型取得了持续改进,实现了高达2.5倍的样本效率提升,并展现出有效的跨任务可迁移性。

Insight: SSL的创新点在于引入“甜点”概念,通过渐进放大的分层奖励机制(如视觉任务中的距离分层建模和推理任务中的渐进进展奖励)来增强梯度信噪比并保持最优解顺序,这为训练能力强健的智能体提供了一种通用原则。

Abstract: Reinforcement learning with verifiable rewards has emerged as a powerful paradigm for training intelligent agents. However, existing methods typically employ binary rewards that fail to capture quality differences among trajectories achieving identical outcomes, thereby overlooking potential diversity within the solution space. Inspired by the ``sweet spot’’ concept in tennis-the racket’s core region that produces optimal hitting effects, we introduce \textbf{S}weet \textbf{S}pot \textbf{L}earning (\textbf{SSL}), a novel framework that provides differentiated guidance for agent optimization. SSL follows a simple yet effective principle: progressively amplified, tiered rewards guide policies toward the sweet-spot region of the solution space. This principle naturally adapts across diverse tasks: visual perception tasks leverage distance-tiered modeling to reward proximity, while complex reasoning tasks reward incremental progress toward promising solutions. We theoretically demonstrate that SSL preserves optimal solution ordering and enhances the gradient signal-to-noise ratio, thereby fostering more directed optimization. Extensive experiments across GUI perception, short/long-term planning, and complex reasoning tasks show consistent improvements over strong baselines on 12 benchmarks, achieving up to 2.5X sample efficiency gains and effective cross-task transferability. Our work establishes SSL as a general principle for training capable and robust agents.


[7] Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards cs.CLPDF

Yuan-Jay Lü, Chengyu Wang, Lei Shen, Jun Huang, Tong Xu

TL;DR: 本文提出了SYNTHAGENT框架,通过合成多样化的工具使用训练数据和模拟完整环境,解决了小规模语言模型在代理能力上的不足。该框架利用强大的教师模型生成新颖任务和工具生态系统,并故意提供不完整的指令,迫使代理主动向用户查询缺失细节。在模拟环境中,LLM用户模拟器提供用户私有信息,模拟工具系统提供稳定响应,并基于任务级评分标准进行奖励。实验表明,在数学、搜索和工具使用等14个挑战性数据集上,使用合成数据训练的模型取得了显著提升,小模型甚至超越了更大的基线模型。

Details

Motivation: 解决小规模语言模型在代理能力上难以匹配大型昂贵模型的问题,并克服现有强化学习方法中训练数据任务单一、易于解决,以及真实世界API缺乏多样性且不稳定等结构性瓶颈。

Result: 在数学、搜索和工具使用领域的14个挑战性数据集上,使用SYNTHAGENT合成数据训练的模型实现了显著性能提升,小模型超越了更大的基线模型,达到了先进水平。

Insight: 创新点包括:通过教师模型合成多样化任务和工具生态系统,并故意设计不完整指令以促进主动查询;构建模拟环境(用户模拟器和模拟工具系统)确保训练稳定性;引入基于子目标、用户-代理交互和禁止行为的任务级评分标准作为奖励机制。这些方法可借鉴用于增强小模型的代理能力和训练效率。

Abstract: Small LLMs often struggle to match the agentic capabilities of large, costly models. While reinforcement learning can help, progress has been limited by two structural bottlenecks: existing open-source agentic training data are narrow in task variety and easily solved; real-world APIs lack diversity and are unstable for large-scale reinforcement learning rollout processes. We address these challenges with SYNTHAGENT, a framework that jointly synthesizes diverse tool-use training data and simulates complete environments. Specifically, a strong teacher model creates novel tasks and tool ecosystems, then rewrites them into intentionally underspecified instructions. This compels agents to actively query users for missing details. When handling synthetic tasks, an LLM-based user simulator provides user-private information, while a mock tool system delivers stable tool responses. For rewards, task-level rubrics are constructed based on required subgoals, user-agent interactions, and forbidden behaviors. Across 14 challenging datasets in math, search, and tool use, models trained on our synthetic data achieve substantial gains, with small models outperforming larger baselines.


[8] One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry cs.CLPDF

Weisong Zhao, Tong Wang, Zichang Tan, Te Yang, Siran Peng

TL;DR: 本文提出了Power-Mean Policy Optimization (PMPO)框架,通过动态幂平均几何统一了基于群体的强化学习方法。该框架引入指数参数p来参数化聚合几何,将GRPO(算术平均)和GMPO(几何平均)作为特例包含其中。通过Clip-aware Effective Sample Size (ESS)机制自适应地确定p值,使算法能在可靠轨迹的激进算术平均与不稳定轨迹的保守几何平均之间动态切换。在多个数学推理基准测试上的实验表明,PMPO优于现有基线方法。

Details

Motivation: 解决现有基于群体的强化学习方法(如GRPO和GMPO)依赖固定聚合几何的局限性,这种固定方法忽略了每个轨迹的演变和异质性,无法适应不同稳定性的轨迹。

Result: 在多个数学推理基准测试上,PMPO超越了GRPO和GMPO等强基线方法,表现出更优的性能。

Insight: 创新点在于通过幂平均几何参数p统一了不同聚合方法,并引入Clip-aware ESS机制实现p的自适应动态调整,从而根据轨迹稳定性灵活平衡激进与保守的更新策略,提升了算法的适应性和性能。

Abstract: Group-based reinforcement learning has evolved from the arithmetic mean of GRPO to the geometric mean of GMPO. While GMPO improves stability by constraining a conservative objective, it shares a fundamental limitation with GRPO: reliance on a fixed aggregation geometry that ignores the evolving and heterogeneous nature of each trajectory. In this work, we unify these approaches under Power-Mean Policy Optimization (PMPO), a generalized framework that parameterizes the aggregation geometry via the power-mean geometry exponent p. Within this framework, GRPO and GMPO are recovered as special cases. Theoretically, we demonstrate that adjusting p modulates the concentration of gradient updates, effectively reweighting tokens based on their advantage contribution. To determine p adaptively, we introduce a Clip-aware Effective Sample Size (ESS) mechanism. Specifically, we propose a deterministic rule that maps a trajectory clipping fraction to a target ESS. Then, we solve for the specific p to align the trajectory induced ESS with this target one. This allows PMPO to dynamically transition between the aggressive arithmetic mean for reliable trajectories and the conservative geometric mean for unstable ones. Experiments on multiple mathematical reasoning benchmarks demonstrate that PMPO outperforms strong baselines.


[9] Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry cs.CL | cs.AI | cs.LGPDF

Zhuochun Li, Yong Zhang, Ming Li, Yuelyu Ji, Yiming Zeng

TL;DR: 本文提出了一种新的评估范式’Representation-as-a-Judge’,通过利用小型语言模型(SLM)的内部表示而非表面生成来进行评估,以解决传统’LLM-as-a-Judge’范式成本高、不透明且对提示设计敏感的问题。

Details

Motivation: 动机在于大型语言模型(LLM)作为无参考评估器存在成本高昂、不透明和提示敏感等缺陷,而小型模型尽管生成能力较弱,但其隐藏状态中可能编码了丰富的评估信号,这促使研究者探索更高效、可靠的评估方法。

Result: 在推理基准测试(GSM8K、MATH、GPQA)上,提出的INSPECTOR框架显著优于基于提示的小型模型,并能够接近完整LLM评估器的性能,同时提供了更高效、可靠和可解释的评估方案。

Insight: 创新点在于提出了’语义容量不对称假设’,即评估所需的语义容量远小于生成,因此可以利用小型模型的中间表示进行解码无关的评估;具体实现为基于探针的INSPECTOR框架,从小型模型表示中预测细粒度评估分数,为可扩展评估提供了新思路。

Abstract: Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this “LLM-as-a-Judge” paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.


[10] Language Model Circuits Are Sparse in the Neuron Basis cs.CL | cs.AIPDF

Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann

TL;DR: 本文通过实证研究发现,语言模型中多层感知机(MLP)神经元本身就是一个稀疏的特征基,其稀疏性与稀疏自编码器(SAE)相当。基于这一发现,作者开发了一种端到端的电路追踪流程,直接在MLP神经元基上定位因果电路,无需额外训练开销。该方法在主语-动词一致性和多跳推理任务上验证了有效性,识别出由少量神经元组成的控制模型行为的电路。

Details

Motivation: 动机在于探索语言模型内部表示的可解释性,传统认为神经元基难以直接解释,因此常使用稀疏自编码器等技术进行分解。本文挑战这一观点,旨在证明MLP神经元本身即可作为稀疏且可解释的特征基,从而简化电路追踪流程。

Result: 在主语-动词一致性基准测试中,仅需约100个MLP神经元即可控制模型行为;在多跳城市→州→首都任务中,识别出编码特定推理步骤(如‘映射城市到其州’)的小型神经元集合,并能通过引导改变模型输出。结果表明MLP神经元基在电路追踪中达到与SAE相当的稀疏性和有效性。

Insight: 创新点在于首次实证表明MLP神经元是稀疏特征基,无需额外训练即可用于可解释性分析;开发了基于梯度归因的端到端电路追踪方法,简化了流程并降低了成本;为语言模型自动可解释性提供了新方向,强调神经元基本身可能具有内在可解释性。

Abstract: The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city $\to$ state $\to$ capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state’), and can be steered to change the model’s output. This work thus advances automated interpretability of language models without additional training costs.


[11] Time-Annealed Perturbation Sampling: Diverse Generation for Diffusion Language Models cs.CL | cs.AIPDF

Jingxuan Wu, Zhenglin Wan, Xingrui Yu, Yuzhe Yang, Yiqiao Huang

TL;DR: 本文提出了一种名为时间退火扰动采样(TAPS)的训练无关推理策略,用于增强扩散语言模型的生成多样性。该方法基于扩散模型在文本生成中存在时间分工的观察:早期去噪步骤决定全局语义结构,后期步骤关注局部词汇精炼。TAPS通过在扩散过程早期引入扰动以鼓励语义分支,并随时间退火减少扰动以保持流畅性和指令遵循,从而在不牺牲生成质量的前提下提升多样性。

Details

Motivation: 扩散语言模型引入了显式的时间维度,但其如何被用于控制生成多样性以探索多个有效的语义或推理路径尚未得到充分探索。本文旨在利用扩散模型的时间结构来增强文本生成的多样性。

Result: 在LLaDA和TraDo等非自回归和半自回归扩散骨干模型上验证,TAPS在创意写作和推理基准测试中持续提升了输出多样性,且未损害生成质量。

Insight: 核心创新点在于揭示了扩散语言模型存在“时间分工”现象,并据此设计了时间退火的扰动采样策略。该方法的关键在于训练无关,通过动态调整早期和后期去噪步骤的扰动强度,在语义探索和流畅性之间取得平衡,为可控文本生成提供了新思路。

Abstract: Diffusion language models (Diffusion-LMs) introduce an explicit temporal dimension into text generation, yet how this structure can be leveraged to control generation diversity for exploring multiple valid semantic or reasoning paths remains underexplored. In this paper, we show that Diffusion-LMs, like diffusion models in image generation, exhibit a temporal division of labor: early denoising steps largely determine the global semantic structure, while later steps focus on local lexical refinement. Building on this insight, we propose Time-Annealed Perturbation Sampling (TAPS), a training-free inference strategy that encourages semantic branching early in the diffusion process while progressively reducing perturbations to preserve fluency and instruction adherence. TAPS is compatible with both non-autoregressive and semi-autoregressive Diffusion backbones, demonstrated on LLaDA and TraDo in our paper, and consistently improves output diversity across creative writing and reasoning benchmarks without compromising generation quality.


[12] TSLM: Tree-Structured Language Modeling for Divergent Thinking cs.CLPDF

Doyoung Kim, Jaehyeok Doo, Minjoon Seo

TL;DR: 本文提出了树结构语言建模(TSLM),一种通过特殊令牌编码分支结构的新方法,使语言模型能在单次生成过程中并行生成和选择性扩展多个搜索路径,从而提升推理效率和鲁棒性。

Details

Motivation: 解决传统语言模型顺序生成推理路径时无法解耦无关探索路径的问题,旨在提升模型在搜索过程中的系统化探索能力。

Result: TSLM在推理效率上表现优越,避免了外部搜索方法所需的多重独立前向传递,实现了鲁棒性能。

Insight: 创新点在于使用树结构令牌和完整搜索树(包括成功与失败尝试)的监督学习,使模型内化系统化探索,为推理时扩展提供了新范式。

Abstract: Language models generate reasoning sequentially, preventing them from decoupling irrelevant exploration paths during search. We introduce Tree-Structured Language Modeling (TSLM), which uses special tokens to encode branching structure, enabling models to generate and selectively expand multiple search paths within a single generation process. By training on complete search trees including both successful and failed attempts, TSLM learns to internalize systematic exploration without redundant recomputation of shared prefixes. TSLM achieves robust performance and superior inference efficiency by avoiding the multiple independent forward passes required by external search methods. These results suggest a new paradigm of inference-time scaling for robust reasoning, demonstrating that supervised learning on complete tree-structured traces provides an efficient alternative for developing systematic exploration capabilities in language models.


[13] Models Know Models Best: Evaluation via Model-Preferred Formats cs.CLPDF

Joonhak Lee, Sungmok Jung, Jongyeon Park, Jaejin Lee

TL;DR: 这篇论文研究了大型语言模型(LLMs)在多项选择任务中,符号化评估格式与完形填空式评估格式之间存在的显著性能差异。研究发现,这种差异可系统性地归因于任务特性:自然语言延续任务受益于似然评分,而显式比较则更适合符号化选择。针对这种不一致性,论文提出了一种动态格式对齐策略,该策略利用一个轻量级分类器,基于模型偏好的潜在信号来为每个问题实例确定最优评估格式。该方法在零样本推理和知识基准测试中实现了显著且一致的准确率提升。

Details

Motivation: 论文的动机是解决LLMs在不同评估格式(符号化与完形填空式)下性能表现不一致的问题,这种不一致性掩盖了模型的真实能力,需要一种更优的评估方法来揭示其潜在性能。

Result: 论文提出的动态格式对齐方法在零样本设置下的推理和知识基准测试(如MMLU、HellaSwag等)中取得了显著且一致的准确率提升,优于固定格式评估和人工设计的启发式方法,更好地揭示了模型的潜在能力。

Insight: 核心创新点是利用模型自身产生的偏好信号(而非人工规则)来动态选择最适合每个具体问题的评估格式,这是一种“模型指导模型评估”的元评估思想。这为解决评估方法偏差、更公平地衡量模型能力提供了一个新视角。

Abstract: Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models’ latent capabilities.


[14] MM-THEBench: Do Reasoning MLLMs Think Reasonably? cs.CLPDF

Zhidian Huang, Zijun Yao, Ji Qi, Shangqing Tu, Junxian Ma

TL;DR: 本文提出了MM-THEBench基准,用于评估具备推理能力的多模态大语言模型在思维链过程中的幻觉问题,旨在探究模型的‘思考’是否合理以及如何影响其多模态感知与推理能力。

Details

Motivation: 现有基准主要关注推理MLLMs出现之前的模型,忽略了模型内部的思维过程,无法衡量其在‘思考’过程中产生的幻觉。

Result: 在主流推理MLLMs上进行的广泛实验揭示了‘思考’如何影响各种多模态任务中的幻觉和推理能力。

Insight: 创新点在于提出了首个专注于评估推理MLLMs中间思维链幻觉的基准,其特色包括基于认知维度的细粒度分类、带有已验证推理注释的多样化数据以及多级自动化评估框架。

Abstract: Recent advances in multimodal large language models (MLLMs) mark a shift from non-thinking models to post-trained reasoning models capable of solving complex problems through thinking. However, whether such thinking mitigates hallucinations in multimodal perception and reasoning remains unclear. Self-reflective reasoning enhances robustness but introduces additional hallucinations, and subtle perceptual errors still result in incorrect or coincidentally correct answers. Existing benchmarks primarily focus on models before the emergence of reasoning MLLMs, neglecting the internal thinking process and failing to measure the hallucinations that occur during thinking. To address these challenges, we introduce MM-THEBench, a comprehensive benchmark for assessing hallucinations of intermediate CoTs in reasoning MLLMs. MM-THEBench features a fine-grained taxonomy grounded in cognitive dimensions, diverse data with verified reasoning annotations, and a multi-level automated evaluation framework. Extensive experiments on mainstream reasoning MLLMs reveal insights into how thinking affects hallucination and reasoning capability in various multimodal tasks.


Yifei Li, Richong Zhang, Wanyu Tu, Zhijie Nie, Haokun Luo

TL;DR: 该论文提出了一个名为APPELLATE REVIEW的新任务,专注于法律判决的错误检测、分类和纠正,并构建了AR-BENCH数据集基准,包含8,700个精细标注的判决和34,617个补充语料,通过评估14个大语言模型揭示了现有模型在法律应用错误识别能力上的关键局限。

Details

Motivation: 现有法律AI研究主要关注判决预测和文档生成,而判决审查任务在目标和范式上根本不同,旨在判决发布后进行错误检测、分类和纠正,属于异常检测而非预测或生成,现有研究存在空白。

Result: 通过评估14个大语言模型,发现现有模型在识别法律应用错误方面存在关键局限性,为未来改进提供了实证证据。

Insight: 创新点在于将法律AI任务从预测/生成范式转向异常检测范式,提出了APPELLATE REVIEW任务并构建了专门的AR-BENCH数据集基准,强调了法律实践中诊断推理和可靠性的评估。

Abstract: Legal judgments may contain errors due to the complexity of case circumstances and the abstract nature of legal concepts, while existing appellate review mechanisms face efficiency pressures from a surge in case volumes. Although current legal AI research focuses on tasks like judgment prediction and legal document generation, the task of judgment review differs fundamentally in its objectives and paradigm: it centers on detecting, classifying, and correcting errors after a judgment is issued, constituting anomaly detection rather than prediction or generation. To address this research gap, we introduce a novel task APPELLATE REVIEW, aiming to assess models’ diagnostic reasoning and reliability in legal practice. We also construct a novel dataset benchmark AR-BENCH, which comprises 8,700 finely annotated decisions and 34,617 supplementary corpora. By evaluating 14 large language models, we reveal critical limitations in existing models’ ability to identify legal application errors, providing empirical evidence for future improvements.


[16] Leveraging LLMs For Turkish Skill Extraction cs.CLPDF

Ezgi Arslan İltüzer, Özgür Anıl Özlü, Vahid Farajijobehdar, Gülşen Eryiğit

TL;DR: 本文针对土耳其语这一低资源且形态复杂的语言,首次构建了土耳其语技能抽取数据集,并系统评估了不同大语言模型(LLMs)及提示策略在技能抽取任务上的性能。研究发现,在端到端流程中,使用Claude Sonnet 3.7模型结合动态少样本提示进行技能识别,再通过嵌入检索和LLM重排进行技能链接,取得了最佳性能,将土耳其语的技能抽取研究提升至与其他语言相当的水平。

Details

Motivation: 土耳其语作为形态复杂的语言,缺乏技能分类体系和专用数据集,导致其技能抽取研究不足。本文旨在解决三个核心问题:如何在低资源环境下有效进行土耳其语技能抽取;哪种模型最有前景;以及不同LLMs和提示策略(如动态与静态少样本、不同上下文信息、因果推理鼓励)对技能抽取的影响。

Result: 在构建的包含327个职位描述、4819个标注技能跨度的土耳其语数据集上,LLM方法在端到端流程中超越了有监督的序列标注方法,能更有效地将抽取的技能与ESCO分类体系对齐。最佳配置(Claude Sonnet 3.7 + 动态少样本提示 + 嵌入检索 + LLM重排)的端到端性能达到了0.56,使土耳其语技能抽取性能与文献中其他语言的研究相当。

Insight: 主要创新点在于为土耳其语创建了首个技能抽取数据集,并系统探索了LLMs在低资源语言技能抽取中的应用潜力,特别是动态少样本提示、嵌入检索与LLM重排相结合的端到端流程。客观来看,该工作为低资源语言的技能抽取提供了一套可借鉴的LLM应用范式,证明了LLMs能有效弥补标注数据的不足。

Abstract: Skill extraction is a critical component of modern recruitment systems, enabling efficient job matching, personalized recommendations, and labor market analysis. Despite Türkiye’s significant role in the global workforce, Turkish, a morphologically complex language, lacks both a skill taxonomy and a dedicated skill extraction dataset, resulting in underexplored research in skill extraction for Turkish. This article seeks the answers to three research questions: 1) How can skill extraction be effectively performed for this language, in light of its low resource nature? 2)~What is the most promising model? 3) What is the impact of different Large Language Models (LLMs) and prompting strategies on skill extraction (i.e., dynamic vs. static few-shot samples, varying context information, and encouraging causal reasoning)? The article introduces the first Turkish skill extraction dataset and performance evaluations of automated skill extraction using LLMs. The manually annotated dataset contains 4,819 labeled skill spans from 327 job postings across different occupation areas. The use of LLM outperforms supervised sequence labeling when used in an end-to-end pipeline, aligning extracted spans with standardized skills in the ESCO taxonomy more effectively. The best-performing configuration, utilizing Claude Sonnet 3.7 with dynamic few-shot prompting for skill identification, embedding-based retrieval, and LLM-based reranking for skill linking, achieves an end-to-end performance of 0.56, positioning Turkish alongside similar studies in other languages, which are few in the literature. Our findings suggest that LLMs can improve skill extraction performance in low-resource settings, and we hope that our work will accelerate similar research on skill extraction for underrepresented languages.


[17] DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion cs.CL | cs.AI | cs.LG | cs.SDPDF

Yuxuan Lou, Ziming Wu, Yaochen Wang, Yong Liu, Yingxuan Ren

TL;DR: 本文提出DiffuSpeech,一种基于扩散的语音-文本语言模型,实现“无声思考、有声回答”的范式,即在生成语音回答的同时产生内部文本推理轨迹,以提升语音质量。该模型在统一的掩码扩散框架下联合处理离散文本和标记化语音,并构建了首个包含配对文本推理轨迹的语音问答数据集SpeechQA。

Details

Motivation: 当前语音语言模型直接生成回答而缺乏显式推理,导致错误一旦产生便无法纠正,因此需要一种能够生成内部推理轨迹以指导语音质量的模型。

Result: 在语音到语音问答任务中,DiffuSpeech达到了最先进的准确率,比最佳基线高出9个百分点;在文本到语音质量上,它在生成模型中取得了最佳性能(6.2% WER),并保持了语言理解能力(66.2% MMLU)。

Insight: 创新点在于将“无声思考、有声回答”范式引入语音生成,通过扩散模型联合生成推理轨迹和语音标记,利用模态特定的掩码调度实现多模态统一处理;客观分析认为,这种结合推理的扩散架构为可解释和高质量的语音生成提供了新思路。

Abstract: Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer’’} – a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2% WER) and preserving language understanding (66.2% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.


[18] Relaxing Positional Alignment in Masked Diffusion Language Models cs.CL | cs.LGPDF

Mengyu Ye, Ryosuke Takahashi, Keito Kudo, Jun Suzuki

TL;DR: 本文针对掩码扩散语言模型在开放式文本生成中存在的性能差距,提出了一种放宽位置对齐的监督策略。通过引入特殊标记并采用连接时序分类目标进行微调,该方法缓解了严格位置预测导致的解码敏感性问题,在五个基准测试中提升了生成质量和鲁棒性。

Details

Motivation: 掩码扩散语言模型在开放式文本生成中与自回归模型存在显著差距,作者假设严格的位置预测导致解码对词元错位高度敏感,且与不可逆的去噪动态不匹配,因此需要放宽位置监督。

Result: 在五个开放式文本生成基准测试中,该方法一致优于原始掩码扩散语言模型,并提高了对位置偏移的鲁棒性,表明放宽严格位置监督能有效提升生成质量。

Insight: 创新点在于通过连接时序分类目标引入标记,实现对齐灵活的监督,缓解了位置错位对语义的破坏;客观分析认为,将严格位置对齐调整为更灵活的监督策略,是提升扩散语言模型生成性能的关键方向。

Abstract: Masked diffusion language models (MDLMs) have emerged as a promising alternative to dominant autoregressive approaches. Although they achieve competitive performance on several tasks, a substantial gap remains in open-ended text generation. We hypothesize that one cause of this gap is that strict positional prediction makes MDLM decoding highly sensitive to token misalignment, and we show through controlled interventions that a one-position shift can severely disrupt semantics. This observation suggests that enforcing strict positional supervision during training is misaligned with the irreversible denoising dynamics of MDLM decoding. Motivated by this mismatch, we adopt an alignment-flexible supervision strategy during fine-tuning. Specifically, we introduce a special token via the connectionist temporal classification objective. We apply this approach to the widely used MDLM model and conduct experiments on five open-ended text generation benchmarks. Our method consistently outperforms the original model and improves robustness to positional shifts, indicating that relaxing strict positional supervision is an important factor in improving generation quality in MDLMs.


[19] Autonomous Chain-of-Thought Distillation for Graph-Based Fraud Detection cs.CL | cs.CRPDF

Yuan Li, Jun Hu, Bryan Hooi, Bingsheng He, Cheng Chen

TL;DR: 本文提出了一种名为FraudCoT的统一框架,用于基于文本属性图(TAGs)的欺诈检测。该框架通过自主的、图感知的思维链(CoT)推理和可扩展的LLM-GNN协同训练,解决了现有方法中预定义提示和解耦训练流程的限制。

Details

Motivation: 现有基于LLM增强的GNN方法受限于预定义的提示和解耦的训练流程,这限制了推理的自主性并削弱了语义-结构对齐,无法有效联合建模丰富的文本语义和关系依赖。

Result: 在公开和工业基准测试上的大量实验表明,FraudCoT在AUPRC指标上比最先进(SOTA)方法提升了高达8.8%,并且训练吞吐量实现了高达1066倍的加速,显著提升了检测性能和效率。

Insight: 创新点包括:1)一种欺诈感知的选择性思维链蒸馏机制,用于生成多样化的推理路径以增强语义-结构理解;2)将这些蒸馏后的思维链整合到节点文本中,为GNN提供丰富的多跳语义和结构线索;3)一种高效的非对称协同训练策略,实现了端到端优化并大幅降低了计算成本。

Abstract: Graph-based fraud detection on text-attributed graphs (TAGs) requires jointly modeling rich textual semantics and relational dependencies. However, existing LLM-enhanced GNN approaches are constrained by predefined prompting and decoupled training pipelines, limiting reasoning autonomy and weakening semantic-structural alignment. We propose FraudCoT, a unified framework that advances TAG-based fraud detection through autonomous, graph-aware chain-of-thought (CoT) reasoning and scalable LLM-GNN co-training. To address the limitations of predefined prompts, we introduce a fraud-aware selective CoT distillation mechanism that generates diverse reasoning paths and enhances semantic-structural understanding. These distilled CoTs are integrated into node texts, providing GNNs with enriched, multi-hop semantic and structural cues for fraud detection. Furthermore, we develop an efficient asymmetric co-training strategy that enables end-to-end optimization while significantly reducing the computational cost of naive joint training. Extensive experiments on public and industrial benchmarks demonstrate that FraudCoT achieves up to 8.8% AUPRC improvement over state-of-the-art methods and delivers up to 1,066x speedup in training throughput, substantially advancing both detection performance and efficiency.


[20] Residual Context Diffusion Language Models cs.CL | cs.AIPDF

Yuezhou Hu, Harman Singh, Monishwaran Maheswaran, Haocheng Xi, Coleman Hooper

TL;DR: 本文提出了一种名为残差上下文扩散(RCD)的新模块,用于改进扩散大语言模型(dLLMs)。现有的块状dLLMs在解码时使用‘重掩码’机制,仅保留置信度最高的token并丢弃其余部分,这造成了计算浪费。RCD通过回收这些被丢弃token的表示,将其转化为上下文残差并注入到下一个去噪步骤中,从而有效利用被浪费的计算和上下文信息。该方法采用解耦的两阶段训练流程以避免内存瓶颈,并证明仅需约10亿token即可将标准dLLM高效转换为RCD范式。

Details

Motivation: 解决现有最先进的块状扩散大语言模型(dLLMs)在并行解码时,因‘重掩码’机制丢弃大量低置信度token表示而造成的计算浪费问题,旨在回收和利用这些被丢弃token中保留的有用上下文信息,以提高模型效率和性能。

Result: 在长链思维推理(SDAR)和短链指令跟随(LLaDA)模型上进行了验证。RCD在广泛的基准测试中,以最小的额外计算开销,将前沿dLLMs的准确率持续提升了5-10个百分点。特别是在最具挑战性的AIME任务上,RCD几乎将基线准确率翻倍,并在同等准确率水平下实现了高达4-5倍的去噪步骤减少。

Insight: 核心创新点在于提出了残差上下文扩散(RCD)模块,它创造性地将解码过程中被丢弃的token表示回收并转化为可重用的上下文残差,从而提升了计算效率和模型性能。从客观角度看,其解耦的两阶段训练流程设计巧妙地规避了反向传播的内存瓶颈,使得该方法能够高效地集成到现有dLLMs中,仅需少量额外数据即可实现显著性能提升,为扩散语言模型的优化提供了新思路。

Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a “remasking” mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~1 billion tokens. RCD consistently improves frontier dLLMs by 5-10 points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at equivalent accuracy levels.


[21] InstructDiff: Domain-Adaptive Data Selection via Differential Entropy for Efficient LLM Fine-Tuning cs.CLPDF

Junyou Su, He Zhu, Xiao Luo, Liyu Zhang, Hong-Yu Zhou

TL;DR: 本文提出InstructDiff框架,通过差分熵实现领域自适应的数据选择,以高效进行大语言模型(LLM)的监督微调(SFT)。该方法利用基础模型与经过少量指令微调的校准模型之间的熵差模式,针对推理任务和通用指令遵循任务分别采用熵增(认知扩展)和熵减(认知压缩)策略,仅使用10%的数据即可超越完整数据集训练的效果。

Details

Motivation: 现有数据选择方法存在严重的领域特异性问题,即针对通用指令遵循优化的方法在推理任务上失效,反之亦然,且完整数据集训练成本高昂、收益递减。

Result: 在数学推理任务上,InstructDiff相比完整数据训练实现了17%的相对提升;在通用指令遵循任务上实现了52%的提升,均优于现有基线方法,且仅使用10%的数据。

Insight: 创新点在于发现并利用基础模型与校准模型间的差分熵模式作为领域自适应的统一选择准则,并通过预热校准、双向NLL过滤和基于熵的排序操作化该准则,实现了跨领域的高效数据选择。

Abstract: Supervised fine-tuning (SFT) is fundamental to adapting large language models, yet training on complete datasets incurs prohibitive costs with diminishing returns. Existing data selection methods suffer from severe domain specificity: techniques optimized for general instruction-following fail on reasoning tasks, and vice versa. We observe that measuring entropy differences between base models and minimally instruction-tuned calibrated models reveals a pattern – samples with the lowest differential entropy consistently yield optimal performance across domains, yet this principle manifests domain-adaptively: reasoning tasks favor entropy increase (cognitive expansion), while general tasks favor entropy decrease (cognitive compression). We introduce InstructDiff, a unified framework that operationalizes differential entropy as a domain-adaptive selection criterion through warmup calibration, bi-directional NLL filtering, and entropy-based ranking. Extensive experiments show that InstructDiff achieves 17% relative improvement over full data training on mathematical reasoning and 52% for general instruction-following, outperforming prior baselines while using only 10% of the data.


[22] JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs cs.CLPDF

Casimiro Pio Carrino, Paula Estrella, Rabih Zbib, Carlos Escolano, José A. R. Fonollosa

TL;DR: 本文介绍了JobResQA,一个用于评估大语言模型在人力资源领域(涉及简历和职位描述)机器阅读理解能力的多语言问答基准数据集。该数据集包含581个问答对,覆盖五种语言(英语、西班牙语、意大利语、德语和中文),问题复杂度从基础事实提取到跨文档推理分为三个层次。作者提出了一个从真实数据源通过去标识化和合成生成的数据构建流程,以确保真实性和隐私性,并通过可控属性支持系统性偏见与公平性研究。同时,采用基于TEaR方法的人机协同翻译流程来保证高质量的多语言平行基准。基准测试使用LLM-as-judge方法对多个开源大模型进行了评估,结果显示模型在英语和西班牙语上表现较好,但在其他语言上性能显著下降,突显了多语言机器阅读理解在人力资源应用中的关键差距。

Details

Motivation: 当前缺乏专门用于评估大语言模型在人力资源领域多语言机器阅读理解能力的基准数据集,特别是在处理简历和职位描述这类敏感且结构复杂的文档时。现有基准往往在语言覆盖、任务复杂性和现实应用场景方面存在不足,难以系统评估模型在实际HR任务中的性能、偏见和公平性问题。

Result: 在JobResQA基准上对多个开源大语言模型家族进行了基线评估,采用LLM-as-judge方法。结果显示,模型在英语和西班牙语上表现相对较高,但在意大利语、德语和中文上性能出现显著下降,揭示了当前大模型在多语言机器阅读理解能力上的不平衡和关键短板。该基准为HR应用中的公平可靠LLM系统提供了可复现的评估标准。

Insight: 论文的创新点包括:1) 构建了一个专注于HR领域(简历和职位描述)的多语言机器阅读理解基准,填补了现有研究的空白;2) 设计了结合去标识化与数据合成的生成流程,在保证数据真实性和隐私的同时,通过可控属性(如人口统计和专业背景占位符)支持系统性偏见与公平性研究;3) 提出了基于TEaR方法的人机协同翻译流程,结合MQM错误标注和选择性后编辑,确保了高质量的多语言平行数据生成,为多语言基准构建提供了可借鉴的方法。

Abstract: We introduce JobResQA, a multilingual Question Answering benchmark for evaluating Machine Reading Comprehension (MRC) capabilities of LLMs on HR-specific tasks involving résumés and job descriptions. The dataset comprises 581 QA pairs across 105 synthetic résumé-job description pairs in five languages (English, Spanish, Italian, German, and Chinese), with questions spanning three complexity levels from basic factual extraction to complex cross-document reasoning. We propose a data generation pipeline derived from real-world sources through de-identification and data synthesis to ensure both realism and privacy, while controlled demographic and professional attributes (implemented via placeholders) enable systematic bias and fairness studies. We also present a cost-effective, human-in-the-loop translation pipeline based on the TEaR methodology, incorporating MQM error annotations and selective post-editing to ensure an high-quality multi-way parallel benchmark. We provide a baseline evaluations across multiple open-weight LLM families using an LLM-as-judge approach revealing higher performances on English and Spanish but substantial degradation for other languages, highlighting critical gaps in multilingual MRC capabilities for HR applications. JobResQA provides a reproducible benchmark for advancing fair and reliable LLM-based HR systems. The benchmark is publicly available at: https://github.com/Avature/jobresqa-benchmark


[23] ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought cs.CLPDF

Fanmeng Wang, Haotian Liu, Guojiang Zhao, Hongteng Xu, Zhifeng Gao

TL;DR: 本文提出了一种名为ReGuLaR的新型潜在推理学习范式,通过变分自编码器框架将推理过程压缩到潜在空间,并利用渲染为图像的显式推理链提供视觉语义表示作为后验分布的指导,从而在减少计算冗余的同时保持甚至超越链式思维的性能。

Details

Motivation: 链式思维虽然提升了大型语言模型的性能,但显式推理链引入了显著的计算冗余;现有潜在推理方法试图压缩推理过程,但缺乏适当的压缩指导导致性能严重下降。

Result: 大量实验表明,ReGuLaR在计算效率和推理效果上显著优于现有潜在推理方法,甚至通过多模态推理超越了链式思维。

Insight: 创新点在于将潜在推理建模为变分自编码器框架,并利用渲染的推理链图像提取密集视觉语义表示来正则化后验分布,实现高效压缩且信息损失最小;这为潜在推理提供了新颖且富有洞察力的解决方案,结合了多模态信息指导。

Abstract: While Chain-of-Thought (CoT) significantly enhances the performance of Large Language Models (LLMs), explicit reasoning chains introduce substantial computational redundancy. Recent latent reasoning methods attempt to mitigate this by compressing reasoning processes into latent space, but often suffer from severe performance degradation due to the lack of appropriate compression guidance. In this study, we propose Rendered CoT-Guided variational Latent Reasoning (ReGuLaR), a simple yet novel latent learning paradigm resolving this issue. Fundamentally, we formulate latent reasoning within the Variational Auto-Encoding (VAE) framework, sampling the current latent reasoning state from the posterior distribution conditioned on previous ones. Specifically, when learning this variational latent reasoning model, we render explicit reasoning chains as images, from which we extract dense visual-semantic representations to regularize the posterior distribution, thereby achieving efficient compression with minimal information loss. Extensive experiments demonstrate that ReGuLaR significantly outperforms existing latent reasoning methods across both computational efficiency and reasoning effectiveness, and even surpasses CoT through multi-modal reasoning, providing a new and insightful solution to latent reasoning. Code: https://github.com/FanmengWang/ReGuLaR.


[24] Deep Search with Hierarchical Meta-Cognitive Monitoring Inspired by Cognitive Neuroscience cs.CLPDF

Zhongxiang Sun, Qipeng Wang, Weijie Yu, Jingxuan Yang, Haolang Lu

TL;DR: 本文提出了一种受认知神经科学启发的深度搜索框架DS-MCM,它通过引入分层的元认知监控机制来增强大型语言模型驱动的深度搜索代理。该框架包含一个快速一致性监控器和一个慢速经验驱动监控器,旨在任务执行过程中动态监测和调节推理与检索状态,从而提升性能和鲁棒性。

Details

Motivation: 现有基于大语言模型的深度搜索代理在多步检索和长程任务执行中表现出色,但其实际失败常源于缺乏在不确定性下对推理和检索状态进行监控与调节的机制。

Result: 在多个深度搜索基准测试和不同骨干模型上的实验表明,DS-MCM能持续提升性能和鲁棒性。

Insight: 创新点在于借鉴人类元认知的层次结构,将轻量级的快速异常检测与选择性触发的、基于经验记忆的反思性监控相结合,并直接嵌入到推理-检索循环中,以决定何时干预以及如何基于先验经验进行纠正。

Abstract: Deep search agents powered by large language models have demonstrated strong capabilities in multi-step retrieval, reasoning, and long-horizon task execution. However, their practical failures often stem from the lack of mechanisms to monitor and regulate reasoning and retrieval states as tasks evolve under uncertainty. Insights from cognitive neuroscience suggest that human metacognition is hierarchically organized, integrating fast anomaly detection with selectively triggered, experience-driven reflection. In this work, we propose Deep Search with Meta-Cognitive Monitoring (DS-MCM), a deep search framework augmented with an explicit hierarchical metacognitive monitoring mechanism. DS-MCM integrates a Fast Consistency Monitor, which performs lightweight checks on the alignment between external evidence and internal reasoning confidence, and a Slow Experience-Driven Monitor, which is selectively activated to guide corrective intervention based on experience memory from historical agent trajectories. By embedding monitoring directly into the reasoning-retrieval loop, DS-MCM determines both when intervention is warranted and how corrective actions should be informed by prior experience. Experiments across multiple deep search benchmarks and backbone models demonstrate that DS-MCM consistently improves performance and robustness.


[25] PaperBanana: Automating Academic Illustration for AI Scientists cs.CL | cs.CVPDF

Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li

TL;DR: 本文提出了PaperBanana,一个基于先进视觉语言模型和图像生成模型的智能体框架,用于自动化生成可直接用于发表的学术插图。该框架通过协调多个专用智能体来完成参考文献检索、内容与风格规划、图像渲染以及基于自我批判的迭代优化。为了系统评估,作者构建了PaperBananaBench基准,包含292个从NeurIPS 2025论文中提取的方法论图测试用例。实验表明,PaperBanana在忠实性、简洁性、可读性和美观性上均优于现有基线,并能有效扩展到高质量统计图的生成。

Details

Motivation: 尽管由语言模型驱动的自主AI科学家发展迅速,但在研究流程中,生成可直接用于发表的插图仍然是一个劳动密集型的瓶颈。本文旨在减轻研究人员的这一负担。

Result: 在作者构建的PaperBananaBench基准(包含来自NeurIPS 2025的292个方法论图测试用例)上进行全面实验,结果表明PaperBanana在忠实性、简洁性、可读性和美观性方面持续优于领先的基线方法。该方法还能有效生成高质量的统计图。

Insight: 论文的创新点在于提出了一个用于自动化生成学术插图的智能体框架,通过多智能体协作(检索、规划、渲染、迭代优化)来解决复杂任务。客观来看,其构建的专用基准(PaperBananaBench)为系统评估此类任务提供了有价值的工具,并且将自动化流程从简单的图像生成扩展到了包含内容规划与风格适配的端到端出版级插图制作。

Abstract: Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.


cs.CV [Back]

[26] Do Open-Vocabulary Detectors Transfer to Aerial Imagery? A Comparative Evaluation cs.CV | cs.LG | cs.ROPDF

Christos Tsourveloudis

TL;DR: 本文首次系统评估了五种最先进的开放词汇目标检测(OVD)模型在LAE-80C航空影像数据集上的零样本迁移能力。研究发现,OVD模型在航空影像领域存在严重的领域迁移失败,最佳模型(OWLv2)的F1分数仅为27.6%,且假阳性率高达69%。研究通过调整词汇量大小和提示工程策略,揭示了语义混淆是主要瓶颈,并指出性能对成像条件敏感,强调了开发领域自适应方法的必要性。

Details

Motivation: 开放词汇目标检测在自然图像上表现出色,但其在航空影像领域的可迁移性尚未被探索。本文旨在通过系统基准测试,评估OVD模型在航空影像上的零样本性能,以填补这一研究空白。

Result: 在LAE-80C数据集(3,592张图像,80个类别)的严格零样本条件下,最佳模型OWLv2仅达到27.6%的F1分数和69%的假阳性率。将词汇量从80类减少到3.2类可带来15倍的性能提升。在不同数据集(如DIOR和FAIR1M)上性能差异显著(F1分数分别为0.53和0.12),表明模型对成像条件脆弱。

Insight: 主要创新点在于首次系统评估了OVD模型向航空影像的迁移能力,并设计了实验协议(全局、Oracle和单类别推理模式)以分离语义混淆和视觉定位问题。关键发现是语义混淆是领域迁移的主要瓶颈,而提示工程策略(如领域特定前缀和同义词扩展)未能有效提升性能,这凸显了开发专门针对航空影像的领域自适应OVD方法的紧迫性。

Abstract: Open-vocabulary object detection (OVD) enables zero-shot recognition of novel categories through vision-language models, achieving strong performance on natural images. However, transferability to aerial imagery remains unexplored. We present the first systematic benchmark evaluating five state-of-the-art OVD models on the LAE-80C aerial dataset (3,592 images, 80 categories) under strict zero-shot conditions. Our experimental protocol isolates semantic confusion from visual localization through Global, Oracle, and Single-Category inference modes. Results reveal severe domain transfer failure: the best model (OWLv2) achieves only 27.6% F1-score with 69% false positive rate. Critically, reducing vocabulary size from 80 to 3.2 classes yields 15x improvement, demonstrating that semantic confusion is the primary bottleneck. Prompt engineering strategies such as domain-specific prefixing and synonym expansion, fail to provide meaningful performance gains. Performance varies dramatically across datasets (F1: 0.53 on DIOR, 0.12 on FAIR1M), exposing brittleness to imaging conditions. These findings establish baseline expectations and highlight the need for domain-adaptive approaches in aerial OVD.


[27] What Lies Beneath: A Call for Distribution-based Visual Question & Answer Datasets cs.CV | cs.DLPDF

Jill P. Naiman, Daniel J. Evans, JooYoung Seo

TL;DR: 这篇论文指出当前视觉问答(VQA)数据集在分析科学图表时存在局限,即大多缺乏图表背后的原始数据或假设图表标记与数据存在一一对应关系。作者呼吁并创建了一个专注于科学图表(特别是直方图)的VQA基准数据集,其中图表标记与底层数据没有一一对应关系,以评估模型对数据分布的理解能力。

Details

Motivation: 当前VQA数据集主要关注真实世界图像或简单图表,缺乏对复杂科学图表的解释能力,且通常忽略图表是数据转换(分析、简化、修改)的结果这一事实,导致无法评估模型基于底层数据分布进行推理的能力。

Result: 作者通过合成基于真实数据的直方图图表,构建了一个包含图表、底层数据、分布参数和标注框的开源数据集,并让人类和大型推理模型回答依赖底层数据的问题,以凸显现有模型的局限性。

Insight: 创新点在于提出了一个强调图表标记与底层数据非一一对应关系的VQA基准,推动了模型对数据分布而非仅视觉特征的推理能力评估;客观来看,该数据集为测试多模态模型在科学图表理解中的深层推理提供了新工具。

Abstract: Visual Question Answering (VQA) has become an important benchmark for assessing how large multimodal models (LMMs) interpret images. However, most VQA datasets focus on real-world images or simple diagrammatic analysis, with few focused on interpreting complex scientific charts. Indeed, many VQA datasets that analyze charts do not contain the underlying data behind those charts or assume a 1-to-1 correspondence between chart marks and underlying data. In reality, charts are transformations (i.e. analysis, simplification, modification) of data. This distinction introduces a reasoning challenge in VQA that the current datasets do not capture. In this paper, we argue for a dedicated VQA benchmark for scientific charts where there is no 1-to-1 correspondence between chart marks and underlying data. To do so, we survey existing VQA datasets and highlight limitations of the current field. We then generate synthetic histogram charts based on ground truth data, and ask both humans and a large reasoning model questions where precise answers depend on access to the underlying data. We release the open-source dataset, including figures, underlying data, distribution parameters used to generate the data, and bounding boxes for all figure marks and text for future research.


[28] Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation cs.CV | cs.AI | cs.CLPDF

Ken Deng, Yifu Qiu, Yoni Kasten, Shay B. Cohen, Yftah Ziser

TL;DR: 本文研究了视觉语言模型在相对相机姿态估计任务上的表现,发现其尽管在2D感知和语义推理方面表现良好,但在理解3D空间结构方面存在显著不足。作者引入了VRRPI-Bench和VRRPI-Diag两个基准测试,结果表明大多数VLM仅能依赖浅层的2D启发式方法,无法有效处理深度变化和绕光轴的旋转,其性能远低于经典几何基线和人类水平,且在多图像空间推理中存在不一致性。

Details

Motivation: 动机在于探索视觉语言模型在3D空间理解方面的局限性,特别是通过相对相机姿态估计这一基础视觉任务,来量化其与2D感知能力之间的差距。

Result: 在VRRPI-Bench和VRRPI-Diag基准上,最先进的VLM(如GPT-5,得分0.64)性能显著低于经典几何基线(0.97)和人类表现(0.92);在多图像推理任务中,最佳模型准确率仅为59.7%,表现不一致。

Insight: 创新点包括构建了基于未标记自我中心视频的基准VRRPI-Bench和诊断性基准VRRPI-Diag,以系统评估VLM的3D空间推理能力;客观分析揭示了VLM在深度和旋转变换上的根本性缺陷,强调了增强其3D和多视图空间基础的必要性。

Abstract: Vision-Language Models (VLMs) perform well in 2D perception and semantic reasoning compared to their limited understanding of 3D spatial structure. We investigate this gap using relative camera pose estimation (RCPE), a fundamental vision task that requires inferring relative camera translation and rotation from a pair of images. We introduce VRRPI-Bench, a benchmark derived from unlabeled egocentric videos with verbalized annotations of relative camera motion, reflecting realistic scenarios with simultaneous translation and rotation around a shared object. We further propose VRRPI-Diag, a diagnostic benchmark that isolates individual motion degrees of freedom. Despite the simplicity of RCPE, most VLMs fail to generalize beyond shallow 2D heuristics, particularly for depth changes and roll transformations along the optical axis. Even state-of-the-art models such as GPT-5 ($0.64$) fall short of classic geometric baselines ($0.97$) and human performance ($0.92$). Moreover, VLMs exhibit difficulty in multi-image reasoning, with inconsistent performance (best $59.7%$) when integrating spatial cues across frames. Our findings reveal limitations in grounding VLMs in 3D and multi-view spatial reasoning.


[29] Geometry without Position? When Positional Embeddings Help and Hurt Spatial Reasoning cs.CVPDF

Jian Shi, Michael Birsak, Wenqing Cui, Zhenyu Li, Peter Wonka

TL;DR: 本文从几何视角重新审视视觉Transformer中位置嵌入的作用,发现位置嵌入不仅是位置索引,更作为几何先验塑造了表征的空间结构。通过引入token级诊断方法,论文在14个基础ViT模型上系统分析了位置嵌入如何影响多视角几何一致性与空间推理能力。

Details

Motivation: 探究位置嵌入在视觉Transformer中如何作为几何先验影响空间表征,并明确其对多视角几何一致性的因果作用机制。

Result: 在14个基础ViT模型上的实验表明,位置嵌入能显著影响多视角几何一致性,其一致性是空间推理的关键因素。

Insight: 创新点在于将位置嵌入重新定义为几何先验,并开发了token级诊断工具量化其对空间结构的影响;客观来看,该研究为理解ViT的空间表征机制提供了新视角,对改进位置编码设计具有启发意义。

Abstract: This paper revisits the role of positional embeddings (PEs) within vision transformers (ViTs) from a geometric perspective. We show that PEs are not mere token indices but effectively function as geometric priors that shape the spatial structure of the representation. We introduce token-level diagnostics that measure how multi-view geometric consistency in ViT representation depends on consitent PEs. Through extensive experiments on 14 foundation ViT models, we reveal how PEs influence multi-view geometry and spatial reasoning. Our findings clarify the role of PEs as a causal mechanism that governs spatial structure in ViT representations. Our code is provided in https://github.com/shijianjian/vit-geometry-probes


[30] VMonarch: Efficient Video Diffusion Transformers with Structured Attention cs.CV | cs.AIPDF

Cheng Liang, Haoxian Chen, Liang Hou, Qi Fan, Gangshan Wu

TL;DR: 本文提出VMonarch,一种用于视频扩散变换器的高效注意力机制,通过结构化Monarch矩阵表示视频数据中稀疏的时空注意力模式,实现了次二次复杂度的注意力计算,显著降低了计算开销并提升了长视频生成效率。

Details

Motivation: 视频扩散变换器中注意力机制的二次复杂度严重限制了其上下文可扩展性,作者发现视频DiTs中高度稀疏的时空注意力模式可以自然地用Monarch矩阵表示,从而设计更高效的注意力计算方案。

Result: 在VBench基准测试中,经过少量调优后,VMonarch达到了与全注意力相当或更优的生成质量;它克服了注意力瓶颈,将注意力FLOPs降低了17.5倍,在长视频的注意力计算中实现了超过5倍的加速,在90%稀疏度下超越了最先进的稀疏注意力方法。

Insight: 创新点包括:将时空Monarch分解显式捕获视频帧内和帧间相关性;引入重计算策略缓解交替最小化过程中的不稳定伪影;提出融合到FlashAttention中的在线熵算法以实现长序列的快速Monarch矩阵更新。这些方法为高效视频生成模型提供了可借鉴的结构化稀疏注意力设计思路。

Abstract: The quadratic complexity of the attention mechanism severely limits the context scalability of Video Diffusion Transformers (DiTs). We find that the highly sparse spatio-temporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix. It is a class of structured matrices with flexible sparsity, enabling sub-quadratic attention via an alternating minimization algorithm. Accordingly, we propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient computation over the dynamic sparse patterns with structured Monarch matrices. First, we adapt spatio-temporal Monarch factorization to explicitly capture the intra-frame and inter-frame correlations of the video data. Second, we introduce a recomputation strategy to mitigate artifacts arising from instabilities during alternating minimization of Monarch matrices. Third, we propose a novel online entropy algorithm fused into FlashAttention, enabling fast Monarch matrix updates for long sequences. Extensive experiments demonstrate that VMonarch achieves comparable or superior generation quality to full attention on VBench after minimal tuning. It overcomes the attention bottleneck in Video DiTs, reduces attention FLOPs by a factor of 17.5, and achieves a speedup of over 5x in attention computation for long videos, surpassing state-of-the-art sparse attention methods at 90% sparsity.


[31] Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes cs.CVPDF

Gonzalo Gomez-Nogales, Yicong Hong, Chongjian Ge, Marc Comino-Trinidad, Dan Casas

TL;DR: C2R(Coarse-to-Real)是一个生成式渲染框架,它能够从粗糙的3D模拟中合成具有真实风格的都市人群视频。该方法利用粗糙3D渲染来显式控制场景布局、相机运动和人体轨迹,同时通过学习的神经渲染器在文本提示的引导下生成真实的外观、光照和精细动态。

Details

Motivation: 传统渲染管线依赖复杂资产、精确材质和光照以及大量计算资源来生成逼真图像,但在处理人口密集的动态场景时,仍面临可扩展性和真实感方面的挑战。本文旨在通过生成式方法,从最小化的3D输入合成可控且真实感强的城市人群视频。

Result: 论文提出的系统支持从粗到细的控制,能够泛化到多种CG和游戏输入,并从最小3D输入生成时间一致、可控且逼真的城市场景视频。

Insight: 主要创新点包括:1)采用两阶段混合CG-真实训练策略,解决了粗糙模拟与真实视频之间配对训练数据缺乏的问题;2)通过跨域共享的隐式时空特征引入可控性;3)结合文本提示引导神经渲染器,实现了外观、光照和精细动态的逼真生成。

Abstract: Traditional rendering pipelines rely on complex assets, accurate materials and lighting, and substantial computational resources to produce realistic imagery, yet they still face challenges in scalability and realism for populated dynamic scenes. We present C2R (Coarse-to-Real), a generative rendering framework that synthesizes real-style urban crowd videos from coarse 3D simulations. Our approach uses coarse 3D renderings to explicitly control scene layout, camera motion, and human trajectories, while a learned neural renderer generates realistic appearance, lighting, and fine-scale dynamics guided by text prompts. To overcome the lack of paired training data between coarse simulations and real videos, we adopt a two-phase mixed CG-real training strategy that learns a strong generative prior from large-scale real footage and introduces controllability through shared implicit spatio-temporal features across domains. The resulting system supports coarse-to-fine control, generalizes across diverse CG and game inputs, and produces temporally consistent, controllable, and realistic urban scene videos from minimal 3D input. We will release the model and project webpage at https://gonzalognogales.github.io/coarse2real/.


[32] FlexMap: Generalized HD Map Construction from Flexible Camera Configurations cs.CVPDF

Run Wang, Chaoyi Zhou, Amir Salarpour, Xi Liu, Zhi-Qi Cheng

TL;DR: 本文提出FlexMap,一种从灵活相机配置构建高清地图的新方法,它无需依赖固定的多相机标定或显式的2D到BEV投影,通过几何感知基础模型和跨帧注意力隐式编码3D场景理解,从而适应不同的相机设置且无需架构修改或重新训练。

Details

Motivation: 现有高清地图构建方法依赖标定的多相机设置和2D到BEV变换,在传感器故障或车队相机配置变化时脆弱,FlexMap旨在解决这一鲁棒性和灵活性不足的问题。

Result: 实验表明,FlexMap在多种配置下优于现有方法,同时对缺失视角和传感器变化保持鲁棒性,实现了更实用的现实世界部署。

Insight: 创新点包括:使用几何感知基础模型隐式编码3D理解以消除显式几何投影;时空增强模块分离跨视图空间推理与时间动态;以及带潜在相机令牌的相机感知解码器实现视图自适应注意力,无需投影矩阵。

Abstract: High-definition (HD) maps provide essential semantic information of road structures for autonomous driving systems, yet current HD map construction methods require calibrated multi-camera setups and either implicit or explicit 2D-to-BEV transformations, making them fragile when sensors fail or camera configurations vary across vehicle fleets. We introduce FlexMap, unlike prior methods that are fixed to a specific N-camera rig, our approach adapts to variable camera configurations without any architectural changes or per-configuration retraining. Our key innovation eliminates explicit geometric projections by using a geometry-aware foundation model with cross-frame attention to implicitly encode 3D scene understanding in feature space. FlexMap features two core components: a spatial-temporal enhancement module that separates cross-view spatial reasoning from temporal dynamics, and a camera-aware decoder with latent camera tokens, enabling view-adaptive attention without the need for projection matrices. Experiments demonstrate that FlexMap outperforms existing methods across multiple configurations while maintaining robustness to missing views and sensor variations, enabling more practical real-world deployment.


[33] Jailbreaks on Vision Language Model via Multimodal Reasoning cs.CV | cs.AIPDF

Aarush Noheria, Yuguang Yao

TL;DR: 本文提出了一种针对视觉语言模型(VLM)的越狱框架,通过结合后训练思维链(CoT)提示和基于ReAct的自适应噪声机制,构造能够绕过安全过滤器的隐蔽提示,有效提升了攻击成功率(ASR)并保持了文本和视觉输出的自然性。

Details

Motivation: 视觉语言模型在视觉问答、图像描述等任务中应用广泛,但其输出对提示变化高度敏感,可能暴露安全对齐的漏洞。本文旨在利用这种敏感性,开发能够规避安全过滤器的越狱攻击方法。

Result: 实验结果表明,所提出的双策略(CoT提示和自适应噪声)显著提高了攻击成功率(ASR),同时在文本和视觉领域保持了输出的自然性。

Insight: 创新点在于将后训练思维链(CoT)提示与ReAct驱动的自适应噪声机制相结合,通过迭代扰动输入图像中可能激活安全防御的区域,增强了攻击的隐蔽性和规避能力;从客观角度看,该方法展示了利用模型自身反馈(ReAct)来精炼对抗性噪声的多模态攻击策略的有效性。

Abstract: Vision-language models (VLMs) have become central to tasks such as visual question answering, image captioning, and text-to-image generation. However, their outputs are highly sensitive to prompt variations, which can reveal vulnerabilities in safety alignment. In this work, we present a jailbreak framework that exploits post-training Chain-of-Thought (CoT) prompting to construct stealthy prompts capable of bypassing safety filters. To further increase attack success rates (ASR), we propose a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback. This approach leverages the ReAct paradigm to refine adversarial noise in regions most likely to activate safety defenses, thereby enhancing stealth and evasion. Experimental results demonstrate that the proposed dual-strategy significantly improves ASR while maintaining naturalness in both text and visual domains.


[34] EMBC Special Issue: Calibrated Uncertainty for Trustworthy Clinical Gait Analysis Using Probabilistic Multiview Markerless Motion Capture cs.CVPDF

Seth Donahue, Irina Djuraskovic, Kunal Shah, Fabian Sinz, Ross Chafetz

TL;DR: 本研究评估了一种基于变分推断的概率性多视角无标记运动捕捉(MMMC)方法在临床步态分析中的校准和可靠性。该方法通过估计关节角后验分布来量化认知不确定性,并在68名参与者的数据上验证了其置信区间的校准效果,结果表明模型能可靠地识别不可靠输出。

Details

Motivation: 视频人体运动分析在临床实践中有潜力,但多视角无标记运动捕捉系统需要不仅准确,还需提供可靠的置信区间以指示其个体准确性,从而增强临床实施和信任度。

Result: 模型在步长和步幅长度以及偏差校正步态运动学上表现出可靠的校准,预期校准误差(ECE)值一般小于0.1;步长和步幅长度中位误差分别为约16毫米和12毫米,下肢关节偏差校正运动学误差中位值在1.5至3.8度之间;模型预测的不确定性与观测误差强相关。

Insight: 创新点在于利用变分推断量化概率性MMMC的认知不确定性,无需地面真值仪器即可识别不可靠输出,这为临床步态分析提供了可信任的校准不确定性估计方法。

Abstract: Video-based human movement analysis holds potential for movement assessment in clinical practice and research. However, the clinical implementation and trust of multi-view markerless motion capture (MMMC) require that, in addition to being accurate, these systems produce reliable confidence intervals to indicate how accurate they are for any individual. Building on our prior work utilizing variational inference to estimate joint angle posterior distributions, this study evaluates the calibration and reliability of a probabilistic MMMC method. We analyzed data from 68 participants across two institutions, validating the model against an instrumented walkway and standard marker-based motion capture. We measured the calibration of the confidence intervals using the Expected Calibration Error (ECE). The model demonstrated reliable calibration, yielding ECE values generally < 0.1 for both step and stride length and bias-corrected gait kinematics. We observed a median step and stride length error of ~16 mm and ~12 mm respectively, with median bias-corrected kinematic errors ranging from 1.5 to 3.8 degrees across lower extremity joints. Consistent with the calibrated ECE, the magnitude of the model’s predicted uncertainty correlated strongly with observed error measures. These findings indicate that, as designed, the probabilistic model reconstruction quantifies epistemic uncertainty, allowing it to identify unreliable outputs without the need for concurrent ground-truth instrumentation.


[35] Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework cs.CV | cs.AIPDF

Shiyu Liu, Xinyi Wen, Zhibin Lan, Ante Wang, Jinsong Su

TL;DR: 本文针对大型视觉语言模型在图像描述任务中存在的物体幻觉问题,提出了一种无需训练的自验证框架。该框架通过语言先验无关的验证方法,让模型能够忠实评估候选描述中物体存在的置信度,进而通过选择或聚合策略来显著减轻幻觉现象。

Details

Motivation: 尽管大型视觉语言模型取得进展,但在图像描述任务中,模型经常生成描述不存在物体的幻觉内容,损害了其可靠性。先前工作将此归因于模型对语言先验的过度依赖,但缺乏深入分析。本文旨在深入理解这种过度依赖机制并有效缓解物体幻觉。

Result: 实验结果表明,该框架在图像描述任务中显著减轻了物体幻觉,例如在LLaVA-v1.5-7B模型上,CHAIRI指标提升了65.6%,超越了之前的SOTA方法。

Insight: 创新点在于深入分析了随着生成长度增加,模型对语言先验的过度依赖如何加剧幻觉,并提出了一个无需训练的自验证框架,通过解锁模型自身的内在潜力来缓解幻觉,为减轻幻觉问题提供了一条新路径。

Abstract: Despite progress in Large Vision Language Models (LVLMs), object hallucination remains a critical issue in image captioning task, where models generate descriptions of non-existent objects, compromising their reliability. Previous work attributes this to LVLMs’ over-reliance on language priors and attempts to mitigate it through logits calibration. However, they still lack a thorough analysis of the over-reliance. To gain a deeper understanding of over-reliance, we conduct a series of preliminary experiments, indicating that as the generation length increases, LVLMs’ over-reliance on language priors leads to inflated probability of hallucinated object tokens, consequently exacerbating object hallucination. To circumvent this issue, we propose Language-Prior-Free Verification to enable LVLMs to faithfully verify the confidence of object existence. Based on this, we propose a novel training-free Self-Validation Framework to counter the over-reliance trap. It first validates objects’ existence in sampled candidate captions and further mitigates object hallucination via caption selection or aggregation. Experiment results demonstrate that our framework mitigates object hallucination significantly in image captioning task (e.g., 65.6% improvement on CHAIRI metric with LLaVA-v1.5-7B), surpassing the previous SOTA methods. This result highlights a novel path towards mitigating hallucination by unlocking the inherent potential within LVLMs themselves.


[36] ScribbleSense: Generative Scribble-Based Texture Editing with Intent Prediction cs.CVPDF

Yudi Zhang, Yeming Geng, Lei Zhang

TL;DR: ScribbleSense是一种基于涂鸦的3D模型纹理编辑方法,通过结合多模态大语言模型(MLLMs)和图像生成模型,预测涂鸦的编辑意图并提取局部纹理细节,以解决现有方法中涂鸦指令抽象导致的意图模糊和语义位置不明确的问题。

Details

Motivation: 现有3D纹理编辑方法主要支持基于草图的轮廓交互,而粗粒度的涂鸦交互利用有限,且涂鸦指令的抽象性常导致编辑意图模糊和目标语义位置不清晰。

Result: 实验结果表明,该方法有效利用了MLLMs的优势,在基于涂鸦的纹理编辑任务中实现了最先进的交互编辑性能。

Insight: 创新点在于利用MLLMs的视觉能力预测涂鸦编辑意图,并通过全局生成图像提取局部纹理细节来锚定局部语义,从而缓解目标语义位置的模糊性;从客观角度看,该方法将MLLMs与图像生成模型结合,为抽象交互指令的理解提供了新思路。

Abstract: Interactive 3D model texture editing presents enhanced opportunities for creating 3D assets, with freehand drawing style offering the most intuitive experience. However, existing methods primarily support sketch-based interactions for outlining, while the utilization of coarse-grained scribble-based interaction remains limited. Furthermore, current methodologies often encounter challenges due to the abstract nature of scribble instructions, which can result in ambiguous editing intentions and unclear target semantic locations. To address these issues, we propose ScribbleSense, an editing method that combines multimodal large language models (MLLMs) and image generation models to effectively resolve these challenges. We leverage the visual capabilities of MLLMs to predict the editing intent behind the scribbles. Once the semantic intent of the scribble is discerned, we employ globally generated images to extract local texture details, thereby anchoring local semantics and alleviating ambiguities concerning the target semantic locations. Experimental results indicate that our method effectively leverages the strengths of MLLMs, achieving state-of-the-art interactive editing performance for scribble-based texture editing.


[37] Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage cs.CVPDF

Junfei Xie, Peng Pan, Xulong Zhang

TL;DR: 本文提出了一种名为Head-Aware Visual Cropping(HAVC)的训练免费方法,旨在提升多模态大语言模型(MLLMs)在细粒度视觉问答(VQA)任务中的性能。该方法通过筛选和精炼注意力头,生成视觉裁剪引导图,以突出任务相关区域并裁剪子图像,从而增强视觉定位能力。

Details

Motivation: 多模态大语言模型在视觉问答中表现强劲,但由于低分辨率输入和噪声注意力聚合,在细粒度推理方面仍存在局限。

Result: 在多个细粒度VQA基准测试上的广泛实验表明,HAVC始终优于最先进的裁剪策略,实现了更精确的定位和更强的视觉定位。

Insight: 创新点在于利用基于OCR的诊断任务筛选注意力头,并结合空间熵和梯度敏感性进行精炼,以生成可靠的视觉裁剪引导图,这是一种简单有效的增强MLLM精度的策略。

Abstract: Multimodal Large Language Models (MLLMs) show strong performance in Visual Question Answering (VQA) but remain limited in fine-grained reasoning due to low-resolution inputs and noisy attention aggregation. We propose \textbf{Head Aware Visual Cropping (HAVC)}, a training-free method that improves visual grounding by leveraging a selectively refined subset of attention heads. HAVC first filters heads through an OCR-based diagnostic task, ensuring that only those with genuine grounding ability are retained. At inference, these heads are further refined using spatial entropy for stronger spatial concentration and gradient sensitivity for predictive contribution. The fused signals produce a reliable Visual Cropping Guidance Map, which highlights the most task-relevant region and guides the cropping of a subimage subsequently provided to the MLLM together with the image-question pair. Extensive experiments on multiple fine-grained VQA benchmarks demonstrate that HAVC consistently outperforms state-of-the-art cropping strategies, achieving more precise localization, stronger visual grounding, providing a simple yet effective strategy for enhancing precision in MLLMs.


[38] PromptMAD: Cross-Modal Prompting for Multi-Class Visual Anomaly Localization cs.CVPDF

Duncan McCain, Hossein Kashiani, Fatemeh Afghah

TL;DR: 本文提出PromptMAD,一个基于跨模态提示的无监督视觉异常检测与定位框架。该方法利用CLIP编码的文本提示描述正常和异常类别特征,通过视觉-语言对齐引入语义指导,并结合Focal损失解决像素级类别不平衡问题。模型架构融合多尺度卷积特征、Transformer空间注意力和扩散迭代细化,生成高分辨率异常图。在MVTec-AD数据集上实现了最先进的像素级性能。

Details

Motivation: 解决多类别视觉异常检测中因物体类别多样性、异常样本稀缺以及伪装缺陷存在而带来的挑战。

Result: 在MVTec-AD数据集上达到最先进(SOTA)的像素级性能,平均AUC提升至98.35%,平均AP达到66.54%,并在多种类别上保持高效。

Insight: 创新点在于通过跨模态提示(CLIP文本提示)将语义上下文融入视觉重建,以增强对细微和纹理异常的检测;同时采用Focal损失应对像素级不平衡,并结合多尺度特征融合与扩散迭代细化提升定位精度。

Abstract: Visual anomaly detection in multi-class settings poses significant challenges due to the diversity of object categories, the scarcity of anomalous examples, and the presence of camouflaged defects. In this paper, we propose PromptMAD, a cross-modal prompting framework for unsupervised visual anomaly detection and localization that integrates semantic guidance through vision-language alignment. By leveraging CLIP-encoded text prompts describing both normal and anomalous class-specific characteristics, our method enriches visual reconstruction with semantic context, improving the detection of subtle and textural anomalies. To further address the challenge of class imbalance at the pixel level, we incorporate Focal loss function, which emphasizes hard-to-detect anomalous regions during training. Our architecture also includes a supervised segmentor that fuses multi-scale convolutional features with Transformer-based spatial attention and diffusion iterative refinement, yielding precise and high-resolution anomaly maps. Extensive experiments on the MVTec-AD dataset demonstrate that our method achieves state-of-the-art pixel-level performance, improving mean AUC to 98.35% and AP to 66.54%, while maintaining efficiency across diverse categories.


[39] MIRRORTALK: Forging Personalized Avatars Via Disentangled Style and Hierarchical Motion Control cs.CV | cs.SDPDF

Renjie Lu, Xulong Zhang, Xiaoyang Qu, Jianzong Wang, Shangfei Wang

TL;DR: 本文提出了MirrorTalk,一个基于条件扩散模型的生成框架,用于合成个性化的说话人脸。它通过语义解耦风格编码器从参考视频中提取纯风格表示,并采用分层调制策略在扩散过程中动态平衡音频和风格特征,以同时保证精确的唇部同步和个性化的全脸动态表达。

Details

Motivation: 现有方法在合成个性化说话人脸时,难以将说话者特有的谈话风格与面部运动的语义内容有效解耦,导致无法忠实地将说话者的独特风格迁移到任意语音上。

Result: 大量实验表明,MirrorTalk在唇部同步准确性和个性化保持方面,相比现有最先进方法取得了显著提升。

Insight: 主要创新点在于提出了语义解耦风格编码器来提取纯风格表示,以及在扩散模型中引入分层调制策略,实现了对音频和风格特征在不同面部区域的动态平衡控制,从而有效分离并控制风格与内容。

Abstract: Synthesizing personalized talking faces that uphold and highlight a speaker’s unique style while maintaining lip-sync accuracy remains a significant challenge. A primary limitation of existing approaches is the intrinsic confounding of speaker-specific talking style and semantic content within facial motions, which prevents the faithful transfer of a speaker’s unique persona to arbitrary speech. In this paper, we propose MirrorTalk, a generative framework based on a conditional diffusion model, combined with a Semantically-Disentangled Style Encoder (SDSE) that can distill pure style representations from a brief reference video. To effectively utilize this representation, we further introduce a hierarchical modulation strategy within the diffusion process. This mechanism guides the synthesis by dynamically balancing the contributions of audio and style features across distinct facial regions, ensuring both precise lip-sync accuracy and expressive full-face dynamics. Extensive experiments demonstrate that MirrorTalk achieves significant improvements over state-of-the-art methods in terms of lip-sync accuracy and personalization preservation.


[40] DreamVAR: Taming Reinforced Visual Autoregressive Model for High-Fidelity Subject-Driven Image Generation cs.CVPDF

Xin Jiang, Jingwen Chen, Yehao Li, Yingwei Pan, Kezhou Chen

TL;DR: 本文提出了DreamVAR,一个基于视觉自回归(VAR)模型的新框架,用于主题驱动的图像生成。该框架通过预填充完整主题特征序列来简化自回归依赖,并利用强化学习联合增强语义对齐和主题一致性,从而在保持高效推理的同时实现高保真度的图像合成。

Details

Motivation: 尽管视觉自回归模型具有统一架构和高效推理的潜力,但在主题驱动图像生成方面的探索仍不足,现有扩散模型虽能生成高质量图像,但DreamVAR旨在挖掘VAR模型在此任务中的能力。

Result: 大量实验表明,DreamVAR在主题外观保持方面优于领先的基于扩散的方法,实现了更高的保真度。

Insight: 创新点包括在VAR范式中预填充完整主题特征序列以简化多尺度条件依赖,并引入强化学习来联合优化语义对齐和主题一致性,这为自回归模型在图像生成任务中的应用提供了新思路。

Abstract: Recent advances in subject-driven image generation using diffusion models have attracted considerable attention for their remarkable capabilities in producing high-quality images. Nevertheless, the potential of Visual Autoregressive (VAR) models, despite their unified architecture and efficient inference, remains underexplored. In this work, we present DreamVAR, a novel framework for subject-driven image synthesis built upon a VAR model that employs next-scale prediction. Technically, multi-scale features of the reference subject are first extracted by a visual tokenizer. Instead of interleaving these conditional features with target image tokens across scales, our DreamVAR pre-fills the full subject feature sequence prior to predicting target image tokens. This design simplifies autoregressive dependencies and mitigates the train-test discrepancy in multi-scale conditioning scenario within the VAR paradigm. DreamVAR further incorporates reinforcement learning to jointly enhance semantic alignment and subject consistency. Extensive experiments demonstrate that DreamVAR achieves superior appearance preservation compared to leading diffusion-based methods.


[41] CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content cs.CVPDF

Gyuwon Han, Young Kyun Jang, Chanho Eom

TL;DR: 本文提出了CoVA任务,即文本引导的音频-视觉组合视频检索,旨在通过参考视频和描述修改的文本查询来检索目标视频,同时考虑视觉和听觉的变化。为此,作者构建了AV-Comp基准数据集,并提出了一种名为AVT Compositional Fusion的多模态融合方法,该方法通过选择性对齐查询到最相关的模态来整合视频、音频和文本特征。

Details

Motivation: 现有组合视频检索基准仅关注视觉变化,忽略了视觉相似但音频不同的视频,因此需要一个新的任务和基准来同时处理视觉和听觉的差异。

Result: 提出的AVT方法在AV-Comp基准上优于传统的单模态融合方法,为CoVA任务提供了一个强有力的基线。

Insight: 创新点在于引入了音频作为组合视频检索的关键模态,并构建了包含跨模态变化的基准数据集;方法上,通过选择性模态对齐来优化多模态特征融合,可借鉴于其他需要处理视听文本交互的任务。

Abstract: Composed Video Retrieval (CoVR) aims to retrieve a target video from a large gallery using a reference video and a textual query specifying visual modifications. However, existing benchmarks consider only visual changes, ignoring videos that differ in audio despite visual similarity. To address this limitation, we introduce Composed retrieval for Video with its Audio CoVA, a new retrieval task that accounts for both visual and auditory variations. To support this, we construct AV-Comp, a benchmark consisting of video pairs with cross-modal changes and corresponding textual queries that describe the differences. We also propose AVT Compositional Fusion (AVT), which integrates video, audio, and text features by selectively aligning the query to the most relevant modality. AVT outperforms traditional unimodal fusion and serves as a strong baseline for CoVA. Examples from the proposed dataset, including both visual and auditory information, are available at https://perceptualai-lab.github.io/CoVA/.


[42] Can 3D point cloud data improve automated body condition score prediction in dairy cattle? cs.CVPDF

Zhou Tang, Jin Wang, Angelo De Castro, Yuxi Zhang, Victoria Bastos Primo

TL;DR: 本研究比较了使用俯视深度图像和三维点云数据预测奶牛体况评分(BCS)的性能,在四种数据设置下(未分割原始数据、分割全身数据、分割后躯数据、手工特征数据)评估了预测模型。结果表明,在未分割和全身分割数据上,深度图像模型比点云模型更准确;在后躯分割数据上两者性能相当;使用手工特征时两者准确率均下降。总体而言,在当前评估条件下,三维点云并未比深度图像提供一致的优势。

Details

Motivation: 解决传统奶牛体况评分主观且费时的问题,探索三维点云数据是否比广泛使用的深度图像能更好地捕捉动物形态几何特征以改进自动化BCS预测。

Result: 在包含1020头奶牛的商业农场数据集上,采用个体级交叉验证进行评估。深度图像模型在使用未分割原始数据和分割全身数据时,准确率持续高于点云模型;使用分割后躯数据时两者性能相当;使用手工特征数据时两者准确率均低于其他设置。点云预测对噪声和模型架构更敏感。

Insight: 论文宣称的创新点在于首次系统比较了深度图像与点云数据在多种设置下对奶牛BCS预测的性能。客观分析其核心发现是,在现有条件下,更丰富的三维点云数据并未展现出相对于深度图像的稳定优势,这挑战了“数据维度越高预测越好”的潜在假设,并提示点云方法对数据质量和模型设计更为敏感,为实际应用中的传感器选择提供了重要参考。

Abstract: Body condition score (BCS) is a widely used indicator of body energy status and is closely associated with metabolic status, reproductive performance, and health in dairy cattle; however, conventional visual scoring is subjective and labor-intensive. Computer vision approaches have been applied to BCS prediction, with depth images widely used because they capture geometric information independent of coat color and texture. More recently, three-dimensional point cloud data have attracted increasing interest due to their ability to represent richer geometric characteristics of animal morphology, but direct head-to-head comparisons with depth image-based approaches remain limited. In this study, we compared top-view depth image and point cloud data for BCS prediction under four settings: 1) unsegmented raw data, 2) segmented full-body data, 3) segmented hindquarter data, and 4) handcrafted feature data. Prediction models were evaluated using data from 1,020 dairy cows collected on a commercial farm, with cow-level cross-validation to prevent data leakage. Depth image-based models consistently achieved higher accuracy than point cloud-based models when unsegmented raw data and segmented full-body data were used, whereas comparable performance was observed when segmented hindquarter data were used. Both depth image and point cloud approaches showed reduced accuracy when handcrafted feature data were employed compared with the other settings. Overall, point cloud-based predictions were more sensitive to noise and model architecture than depth image-based predictions. Taken together, these results indicate that three-dimensional point clouds do not provide a consistent advantage over depth images for BCS prediction in dairy cattle under the evaluated conditions.


[43] SHED Light on Segmentation for Dense Prediction cs.CVPDF

Seung Hyun Lee, Sangwoo Mo, Stella X. Yu

TL;DR: SHED是一种新颖的编码器-解码器架构,通过将分割整合到密集预测中,显式地强制执行几何先验。该模型采用双向分层推理,在编码器中分层池化分割标记,在解码器中反池化以反转层次结构。模型仅在最终输出端进行监督,使得分割层次结构能够在没有显式分割监督的情况下自然涌现。SHED提高了深度边界锐度和分割一致性,并展示了从合成到真实环境的强大跨域泛化能力。

Details

Motivation: 解决现有密集预测方法将任务视为独立像素级预测,导致结构不一致的问题,通过引入分割来强制执行几何先验。

Result: SHED在深度估计中提高了边界锐度和分割一致性,在跨域泛化(从合成到真实环境)上表现强劲,并提升了语义分割性能和3D重建质量。

Insight: 创新点在于通过无监督的分割层次结构(segment hierarchy)来引导密集预测,利用双向分层推理(编码器池化与解码器反池化)隐式学习场景结构,从而改善全局3D场景布局的理解和可解释的部分级结构发现。

Abstract: Dense prediction infers per-pixel values from a single image and is fundamental to 3D perception and robotics. Although real-world scenes exhibit strong structure, existing methods treat it as an independent pixel-wise prediction, often resulting in structural inconsistencies. We propose SHED, a novel encoder-decoder architecture that enforces geometric prior explicitly by incorporating segmentation into dense prediction. By bidirectional hierarchical reasoning, segment tokens are hierarchically pooled in the encoder and unpooled in the decoder to reverse the hierarchy. The model is supervised only at the final output, allowing the segment hierarchy to emerge without explicit segmentation supervision. SHED improves depth boundary sharpness and segment coherence, while demonstrating strong cross-domain generalization from synthetic to the real-world environments. Its hierarchy-aware decoder better captures global 3D scene layouts, leading to improved semantic segmentation performance. Moreover, SHED enhances 3D reconstruction quality and reveals interpretable part-level structures that are often missed by conventional pixel-wise methods.


[44] Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction cs.CV | cs.LGPDF

Aditya Sarkar, Yi Li, Jiacheng Cheng, Shlok Mishra, Nuno Vasconcelos

TL;DR: 本文提出了一种名为MA-PaPSP(Memory Augmented Plug-and-Play Selective Prediction)的训练免费、低复杂度选择性预测方法,旨在为视觉语言基础模型(如CLIP)提供拒绝预测的能力,以应对从封闭集到开放集、从有限到无限词汇量的任务(如图像描述)。该方法通过引入检索数据集来减少嵌入表示的方差,并结合对比归一化改善分数校准,从而提升选择性预测性能。

Details

Motivation: 现有选择性预测研究主要集中于封闭集任务,而本文旨在解决视觉语言基础模型在更广泛任务(包括开放集和无限词汇量任务)中的选择性预测问题,并寻求无需训练、低复杂度、可应用于任何基础模型的通用方法。

Result: 在多个数据集上进行的选择性描述、图文匹配和细粒度分类实验表明,MA-PaPSP方法优于基础的PaPSP方法以及其他选择性预测基线方法。

Insight: 论文的创新点在于提出了一个无需训练、基于外部检索数据集(记忆增强)和对比归一化的通用选择性预测框架,有效解决了视觉语言表示不稳定和相似度分数校准不佳两个关键挑战,为开放集视觉语言任务提供了可靠的选择性预测方案。

Abstract: Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model embeddings, like CLIP. This is denoted as Plug-and-Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores. To address these issues, we propose a memory augmented PaPSP (MA-PaPSP) model, which augments PaPSP with a retrieval dataset of image-text pairs. This is leveraged to reduce embedding variance by averaging retrieved nearest-neighbor pairs and is complemented by the use of contrastive normalization to improve score calibration. Through extensive experiments on multiple datasets, we show that MA-PaPSP outperforms PaPSP and other selective prediction baselines for selective captioning, image-text matching, and fine-grained classification. Code is publicly available at https://github.com/kingston-aditya/MA-PaPSP.


[45] Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding cs.CV | cs.AIPDF

Yuansheng Gao, Jinman Zhao, Tong Zhang, Xingguo Xu, Han Bao

TL;DR: 本文提出了一种名为时空语义对比解码的新解码策略,旨在缓解视频大语言模型中的幻觉问题。该方法通过故意破坏视频特征的时空一致性和语义关联来构建负特征,并在推理时与原始视频特征进行对比解码以抑制幻觉。

Details

Motivation: 现有缓解视频幻觉的解码方法多依赖启发式设计,未能精确捕捉幻觉的根本原因及其细粒度时空与语义关联,导致在复杂场景中鲁棒性和泛化能力有限。

Result: 大量实验表明,该方法不仅能有效减少幻觉的发生,还能保持模型的通用视频理解和推理能力。

Insight: 创新点在于通过构造时空和语义上的负特征进行对比解码,直接针对幻觉的根源进行抑制,而非依赖启发式规则,从而提升了方法的精确性和泛化能力。

Abstract: Although Video Large Language Models perform remarkably well across tasks such as video understanding, question answering, and reasoning, they still suffer from the problem of hallucination, which refers to generating outputs that are inconsistent with explicit video content or factual evidence. However, existing decoding methods for mitigating video hallucinations, while considering the spatiotemporal characteristics of videos, mostly rely on heuristic designs. As a result, they fail to precisely capture the root causes of hallucinations and their fine-grained temporal and semantic correlations, leading to limited robustness and generalization in complex scenarios. To more effectively mitigate video hallucinations, we propose a novel decoding strategy termed Spatiotemporal-Semantic Contrastive Decoding. This strategy constructs negative features by deliberately disrupting the spatiotemporal consistency and semantic associations of video features, and suppresses video hallucinations through contrastive decoding against the original video features during inference. Extensive experiments demonstrate that our method not only effectively mitigates the occurrence of hallucinations, but also preserves the general video understanding and reasoning capabilities of the model.


[46] PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios cs.CV | cs.CLPDF

Xudong Lu, Huankang Guan, Yang Bo, Jinpeng Chen, Xintong Guo

TL;DR: 该论文提出了PhoStream,首个专注于移动场景的流式多模态基准测试,用于评估视频、音频和时序推理能力。它包含5,572个开放性问题对,覆盖4种场景和10种能力,并通过自动化生成流程构建。实验发现当前多模态大语言模型在‘即时’和‘向后’任务上表现良好,但在‘向前’任务上得分骤降,揭示了模型难以决定何时响应而不仅仅是说什么的根本局限。

Details

Motivation: 多模态大语言模型在离线音视频理解方面表现出色,但其在连续真实世界流中作为移动助手的能力尚未得到充分探索。现有基准测试通常局限于选择题或使用较短视频,无法评估模型在流式输入下的实时响应时机判断能力。

Result: 在PhoStream基准上,模型在LLM评判的0-100分制中,在‘即时’和‘向后’任务上表现良好(例如Gemini 3 Pro超过80分),但在‘向前’任务上得分急剧下降至16.40,主要原因是模型在所需视觉和音频线索出现之前就过早响应。

Insight: 论文的创新点在于构建了首个统一屏上和屏下场景的移动中心化流式基准,并采用自动化生成流程和严格的在线推理管道进行评估。核心洞察揭示了当前MLLMs的一个根本性局限:它们难以决定何时说话(时序推理),而不仅仅是决定说什么内容。

Abstract: Multimodal Large Language Models excel at offline audio-visual understanding, but their ability to serve as mobile assistants in continuous real-world streams remains underexplored. In daily phone use, mobile assistants must track streaming audio-visual inputs and respond at the right time, yet existing benchmarks are often restricted to multiple-choice questions or use shorter videos. In this paper, we introduce PhoStream, the first mobile-centric streaming benchmark that unifies on-screen and off-screen scenarios to evaluate video, audio, and temporal reasoning. PhoStream contains 5,572 open-ended QA pairs from 578 videos across 4 scenarios and 10 capabilities. We build it with an Automated Generative Pipeline backed by rigorous human verification, and evaluate models using a realistic Online Inference Pipeline and LLM-as-a-Judge evaluation for open-ended responses. Experiments reveal a temporal asymmetry in LLM-judged scores (0-100): models perform well on Instant and Backward tasks (Gemini 3 Pro exceeds 80), but drop sharply on Forward tasks (16.40), largely due to early responses before the required visual and audio cues appear. This highlights a fundamental limitation: current MLLMs struggle to decide when to speak, not just what to say. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/PhoStream.


[47] LINA: Linear Autoregressive Image Generative Models with Continuous Tokens cs.CVPDF

Jiahao Wang, Ting Pan, Haoge Deng, Dongchen Han, Taiqiang Wu

TL;DR: 本文提出LINA,一种完全基于线性注意力的简单且计算高效的文本到图像生成模型。论文系统研究了线性自回归模型中不同设计选择(如归一化范式和深度卷积)对参数扩展行为的影响,并提出了KV门控机制以增强记忆管理。LINA能够生成高保真度的1024x1024图像,在ImageNet和GenEval基准上取得了有竞争力的性能,同时显著降低了计算开销。

Details

Motivation: 连续令牌的自回归模型在视觉生成(尤其是文本到图像合成)中前景广阔,但存在计算成本高的问题。本文旨在研究如何在该框架内设计计算高效的线性注意力机制。

Result: LINA在ImageNet类条件基准上取得了2.18的FID分数(约14亿参数),在GenEval文本到图像基准上取得了0.74的分数(约15亿参数)。单个线性注意力模块相比softmax注意力减少了约61%的FLOPs,性能具有竞争力。

Insight: 论文的创新点包括:1)通过系统实证分析发现,对于线性生成Transformer,基于除法的归一化比基于减法的归一化扩展性更好;2)证实了深度卷积对局部性建模在自回归生成中的关键作用;3)将门控机制扩展到双向设置,提出了KV门,通过数据无关的可学习参数实现类似语言模型遗忘门的灵活记忆管理。这些发现共同构成了一个高效、高性能的线性注意力T2I模型基础。

Abstract: Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis, but they suffer from high computational cost. We study how to design compute-efficient linear attention within this framework. Specifically, we conduct a systematic empirical analysis of scaling behavior with respect to parameter counts under different design choices, focusing on (1) normalization paradigms in linear attention (division-based vs. subtraction-based) and (2) depthwise convolution for locality augmentation. Our results show that although subtraction-based normalization is effective for image classification, division-based normalization scales better for linear generative transformers. In addition, incorporating convolution for locality modeling plays a crucial role in autoregressive generation, consistent with findings in diffusion models. We further extend gating mechanisms, commonly used in causal linear attention, to the bidirectional setting and propose a KV gate. By introducing data-independent learnable parameters to the key and value states, the KV gate assigns token-wise memory weights, enabling flexible memory management similar to forget gates in language models. Based on these findings, we present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions. LINA achieves competitive performance on both class-conditional and T2I benchmarks, obtaining 2.18 FID on ImageNet (about 1.4B parameters) and 0.74 on GenEval (about 1.5B parameters). A single linear attention module reduces FLOPs by about 61 percent compared to softmax attention. Code and models are available at: https://github.com/techmonsterwang/LINA.


[48] What can Computer Vision learn from Ranganathan? cs.CV | cs.AIPDF

Mayukh Bagchi, Fausto Giunchiglia

TL;DR: 这篇论文提出借鉴图书馆学家S.R. Ranganathan的分类学原则来解决计算机视觉中的语义鸿沟问题,并基于此开发了名为vTelos的CV标注方法,实验证明该方法能提升标注质量和模型准确率。

Details

Motivation: 动机是解决计算机视觉中因视觉语义与词汇语义错位而产生的语义鸿沟问题,该问题导致CV数据集设计和基准测试存在缺陷。

Result: 论文简要提供了实验证据,表明vTelos方法在CV标注和准确率方面带来了改进,从而验证了该方法的有效性。

Insight: 核心创新点是将图书馆学的经典分类原则(如分面分类法)创造性地适配并应用于计算机视觉的数据集构建,为系统性地解决语义鸿沟问题提供了一个原则性的新起点。

Abstract: The Semantic Gap Problem (SGP) in Computer Vision (CV) arises from the misalignment between visual and lexical semantics leading to flawed CV dataset design and CV benchmarks. This paper proposes that classification principles of S.R. Ranganathan can offer a principled starting point to address SGP and design high-quality CV datasets. We elucidate how these principles, suitably adapted, underpin the vTelos CV annotation methodology. The paper also briefly presents experimental evidence showing improvements in CV annotation and accuracy, thereby, validating vTelos.


[49] Unsupervised Synthetic Image Attribution: Alignment and Disentanglement cs.CV | cs.AIPDF

Zongfang Liu, Guangyi Chen, Boyang Sun, Tongliang Liu, Kun Zhang

TL;DR: 本文提出了一种名为’对齐与解耦’的无监督合成图像溯源方法,旨在无需配对标注的情况下识别模型生成图像背后的概念。该方法首先通过对比自监督学习进行基础概念对齐,然后利用Infomax损失促进表示解耦以增强溯源能力。理论分析表明,该方法通过分解典型相关分析目标来近似概念匹配过程。

Details

Motivation: 随着合成图像质量提升,识别生成图像背后的概念对版权保护和模型透明度至关重要。现有方法依赖标注的合成图像与训练源配对数据,但获取此类监督成本高昂。本文探索无监督合成图像溯源的可能性,以消除对昂贵配对标注的需求。

Result: 在真实世界基准测试AbC上,所提出的无监督方法意外地超越了有监督方法,展示了其有效性。

Insight: 创新点在于利用对比自监督模型(如MoCo和DINO)固有的跨域对齐能力,通过理论假设将观察形式化为跨协方差分析,结合对齐与解耦机制实现无监督溯源。这为这一挑战性任务提供了新的视角,即无监督方法可能通过表示学习的内在特性达到甚至超越有监督性能。

Abstract: As the quality of synthetic images improves, identifying the underlying concepts of model-generated images is becoming increasingly crucial for copyright protection and ensuring model transparency. Existing methods achieve this attribution goal by training models using annotated pairs of synthetic images and their original training sources. However, obtaining such paired supervision is challenging, as it requires either well-designed synthetic concepts or precise annotations from millions of training sources. To eliminate the need for costly paired annotations, in this paper, we explore the possibility of unsupervised synthetic image attribution. We propose a simple yet effective unsupervised method called Alignment and Disentanglement. Specifically, we begin by performing basic concept alignment using contrastive self-supervised learning. Next, we enhance the model’s attribution ability by promoting representation disentanglement with the Infomax loss. This approach is motivated by an interesting observation: contrastive self-supervised models, such as MoCo and DINO, inherently exhibit the ability to perform simple cross-domain alignment. By formulating this observation as a theoretical assumption on cross-covariance, we provide a theoretical explanation of how alignment and disentanglement can approximate the concept-matching process through a decomposition of the canonical correlation analysis objective. On the real-world benchmarks, AbC, we show that our unsupervised method surprisingly outperforms the supervised methods. As a starting point, we expect our intuitive insights and experimental findings to provide a fresh perspective on this challenging task.


[50] ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding cs.CVPDF

Junyi Hu, Tian Bai, Fengyi Wu, Wenyan Li, Zhenming Peng

TL;DR: ExpAlign是一个基于理论基础的视觉-语言对齐框架,用于开放词汇定位任务。它通过期望对齐头进行基于注意力的软多示例学习池化,实现隐式的标记和实例选择,无需额外标注。此外,该框架采用基于能量的多尺度一致性正则化方案来稳定对齐学习。实验表明,ExpAlign在开放词汇检测和零样本实例分割任务上表现优异,特别是在长尾类别上,并在LVIS基准上达到了最先进的性能。

Details

Motivation: 解决开放词汇定位中弱监督下视觉-语言对齐的挑战,现有方法要么依赖缺乏细粒度表达能力的全局句子嵌入,要么需要显式监督或复杂的跨注意力设计,因此提出一种理论驱动的对齐框架以改进细粒度对齐。

Result: 在LVIS minival split上达到36.2 AP_r,优于其他最先进方法,同时模型轻量且推理高效,在开放词汇检测和零样本实例分割任务上表现一致提升,尤其在长尾类别上。

Insight: 创新点包括基于多示例学习理论的期望对齐头实现隐式标记选择,以及基于能量的多尺度一致性正则化(如Top-K多正例对比目标和几何感知一致性目标)来稳定学习。这些方法避免了额外标注和复杂设计,提升了细粒度对齐能力。

Abstract: Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained free-energy minimization. Extensive experiments show that ExpAlign consistently improves open-vocabulary detection and zero-shot instance segmentation, particularly on long-tail categories. Most notably, it achieves 36.2 AP$_r$ on the LVIS minival split, outperforming other state-of-the-art methods at comparable model scale, while remaining lightweight and inference-efficient.


[51] VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration cs.CVPDF

Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen

TL;DR: 本文提出VisionTrim,一个无需训练的统一视觉令牌压缩框架,用于加速多模态大语言模型(MLLM)。它通过两个即插即用模块——主导视觉令牌选择(DVTS)和文本引导视觉补充(TGVC)——来减少冗余视觉令牌,同时保持与文本的对齐,从而在多种图像和视频多模态基准测试中实现高效加速。

Details

Motivation: 解决MLLM因高分辨率或视频场景中视觉令牌过多导致计算成本高昂的问题,现有令牌缩减方法通常孤立于流水线组件且忽视文本对齐,导致性能下降。

Result: 在多种图像和视频多模态基准测试上的广泛实验表明,VisionTrim在性能上具有优越性,推动了MLLM在实际应用中的部署。

Insight: 创新点在于提出一个统一的训练免费加速框架,结合全局-局部视图的令牌选择和文本引导的上下文感知令牌合并,以保持文本对齐并减少计算开销。

Abstract: Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.


[52] Fire on Motion: Optimizing Video Pass-bands for Efficient Spiking Action Recognition cs.CV | cs.AIPDF

Shuhan Ye, Yuanbin Qian, Yi Yu, Chong Wang, Yuqi Xie

TL;DR: 本文提出了一种名为Pass-Bands Optimizer (PBO)的即插即用模块,用于解决脉冲神经网络在视频任务中因时间通带不匹配而性能不佳的问题。PBO通过优化时间通带,使其聚焦于与任务相关的运动频带,从而显著提升了SNN在动作识别和视频异常检测等动态任务上的性能。

Details

Motivation: 脉冲神经网络在静态图像任务上表现接近人工神经网络,但在动态视频任务中性能落后。作者诊断出根本原因在于标准脉冲动态行为类似于时间低通滤波器,强调了静态内容却衰减了承载运动信息的频带,导致与任务相关的动态信息丢失。

Result: 在UCF101数据集上,PBO带来了超过10个百分点的性能提升。在更复杂的多模态动作识别和弱监督视频异常检测任务上,PBO也带来了一致且显著的性能增益。

Insight: 创新点在于诊断出SNN在视频任务中的核心瓶颈——时间通带不匹配,并提出了一种轻量级、参数极少(仅两个可学习参数)的优化器PBO来针对性解决。该方法无需改变网络架构,计算开销可忽略,通过抑制对区分贡献小的静态成分,有效高通滤波数据流,使脉冲活动集中于运动内容,为基于SNN的视频处理提供了新视角。

Abstract: Spiking neural networks (SNNs) have gained traction in vision due to their energy efficiency, bio-plausibility, and inherent temporal processing. Yet, despite this temporal capacity, most progress concentrates on static image benchmarks, and SNNs still underperform on dynamic video tasks compared to artificial neural networks (ANNs). In this work, we diagnose a fundamental pass-band mismatch: Standard spiking dynamics behave as a temporal low pass that emphasizes static content while attenuating motion bearing bands, where task relevant information concentrates in dynamic tasks. This phenomenon explains why SNNs can approach ANNs on static tasks yet fall behind on tasks that demand richer temporal understanding.To remedy this, we propose the Pass-Bands Optimizer (PBO), a plug-and-play module that optimizes the temporal pass-band toward task-relevant motion bands. PBO introduces only two learnable parameters, and a lightweight consistency constraint that preserves semantics and boundaries, incurring negligible computational overhead and requires no architectural changes. PBO deliberately suppresses static components that contribute little to discrimination, effectively high passing the stream so that spiking activity concentrates on motion bearing content. On UCF101, PBO yields over ten percentage points improvement. On more complex multi-modal action recognition and weakly supervised video anomaly detection, PBO delivers consistent and significant gains, offering a new perspective for SNN based video processing and understanding.


[53] Visual Personalization Turing Test cs.CVPDF

Rameen Abdal, James Burgess, Sergey Tulyakov, Kuan-Chieh Jackson Wang

TL;DR: 本文提出了视觉个性化图灵测试(VPTT),一种基于感知不可区分性(而非身份复制)来评估上下文视觉个性化的新范式。为了实施VPTT,作者构建了VPTT框架,包含一个包含1万个人物角色的基准测试集(VPTT-Bench)、一个视觉检索增强生成器(VPRAG)以及一个仅基于文本、对标人类和视觉语言模型(VLM)判断进行校准的VPTT分数。实验表明,VPRAG在个性化对齐与原创性之间取得了最佳平衡。

Details

Motivation: 动机是建立一个更符合人类直觉的评估范式,用于评估个性化生成AI,其核心是判断生成内容是否与特定人物可能创作或分享的内容在感知上无法区分,而不仅仅是复制身份特征。

Result: 实验表明,人类、VLM和VPTT分数评估之间具有高度相关性,验证了VPTT分数作为可靠感知代理的有效性。VPRAG模型在VPTT-Bench上实现了最佳的对齐性与原创性平衡。

Insight: 主要创新点在于提出了基于感知不可区分性的新评估范式(VPTT)及其配套的框架、基准和自动化评估指标(VPTT分数),这为个性化生成AI提供了一个可扩展且注重隐私安全的评估基础。从客观角度看,将图灵测试思想引入视觉个性化评估,并强调上下文和“可能性”而非精确复制,是一个有洞察力的方向。

Abstract: We introduce the Visual Personalization Turing Test (VPTT), a new paradigm for evaluating contextual visual personalization based on perceptual indistinguishability, rather than identity replication. A model passes the VPTT if its output (image, video, 3D asset, etc.) is indistinguishable to a human or calibrated VLM judge from content a given person might plausibly create or share. To operationalize VPTT, we present the VPTT Framework, integrating a 10k-persona benchmark (VPTT-Bench), a visual retrieval-augmented generator (VPRAG), and the VPTT Score, a text-only metric calibrated against human and VLM judgments. We show high correlation across human, VLM, and VPTT evaluations, validating the VPTT Score as a reliable perceptual proxy. Experiments demonstrate that VPRAG achieves the best alignment-originality balance, offering a scalable and privacy-safe foundation for personalized generative AI.


[54] PEAR: Pixel-aligned Expressive humAn mesh Recovery cs.CV | cs.AIPDF

Jiahao Wu, Yunfei Liu, Lijian Lin, Ye Zhu, Lei Zhu

TL;DR: PEAR是一个快速、鲁棒的像素对齐表达性人体网格恢复框架,旨在从单张野外图像中重建详细的3D人体网格。它通过统一的ViT架构实现实时SMPLX参数推断,并引入像素级监督来优化几何细节,同时采用模块化数据标注策略增强模型鲁棒性。

Details

Motivation: 解决现有基于SMPLX的方法在推理速度慢、仅能生成粗略身体姿态、以及在面部和手部等细粒度区域存在错位或不自然伪影方面的局限性,使其难以应用于下游任务。

Result: 在多个基准数据集上的广泛实验表明,该方法在姿态估计准确性上相比之前的SMPLX方法取得了显著提升,能够以超过100 FPS的速度同时推断EHM-s(SMPLX和scaled-FLAME)参数。

Insight: 创新点包括:采用简洁统一的ViT模型实现快速粗粒度几何恢复;引入像素级监督来补偿简化架构导致的细粒度细节损失,提升重建精度;提出模块化数据标注策略以丰富训练数据并增强鲁棒性。从客观角度看,其将高效架构与精细监督相结合,在速度与精度间取得了良好平衡。

Abstract: Reconstructing detailed 3D human meshes from a single in-the-wild image remains a fundamental challenge in computer vision. Existing SMPLX-based methods often suffer from slow inference, produce only coarse body poses, and exhibit misalignments or unnatural artifacts in fine-grained regions such as the face and hands. These issues make current approaches difficult to apply to downstream tasks. To address these challenges, we propose PEAR-a fast and robust framework for pixel-aligned expressive human mesh recovery. PEAR explicitly tackles three major limitations of existing methods: slow inference, inaccurate localization of fine-grained human pose details, and insufficient facial expression capture. Specifically, to enable real-time SMPLX parameter inference, we depart from prior designs that rely on high resolution inputs or multi-branch architectures. Instead, we adopt a clean and unified ViT-based model capable of recovering coarse 3D human geometry. To compensate for the loss of fine-grained details caused by this simplified architecture, we introduce pixel-level supervision to optimize the geometry, significantly improving the reconstruction accuracy of fine-grained human details. To make this approach practical, we further propose a modular data annotation strategy that enriches the training data and enhances the robustness of the model. Overall, PEAR is a preprocessing-free framework that can simultaneously infer EHM-s (SMPLX and scaled-FLAME) parameters at over 100 FPS. Extensive experiments on multiple benchmark datasets demonstrate that our method achieves substantial improvements in pose estimation accuracy compared to previous SMPLX-based approaches. Project page: https://wujh2001.github.io/PEAR


[55] Bi-MCQ: Reformulating Vision-Language Alignment for Negation Understanding cs.CV | cs.LGPDF

Tae Hun Kim, Hyun Gyu Lee

TL;DR: 本文提出了一种名为Bi-MCQ的双向多项选择学习框架,旨在解决现有视觉语言模型在理解医学图像分析中否定性临床陈述方面的不足。该方法将视觉-语言对齐重新定义为条件语义比较问题,通过联合训练图像到文本和文本到图像的多项选择任务,并结合方向特定的交叉注意力融合模块,显著提升了模型对疾病不存在(否定)的理解能力。

Details

Motivation: 现有视觉语言模型在医学图像分析中,由于基于对比对齐的目标函数将否定视为次要的语言变体而非意义反转操作,导致其在理解否定性临床陈述方面表现较弱。基于提示的InfoNCE微调在多标签设置中进一步强化了简单的正面对齐,限制了模型有效学习疾病缺失的能力。

Result: 在ChestXray14、Open-I、CheXpert和PadChest数据集上的实验表明,Bi-MCQ相比当前最先进的CARZero模型的零样本性能,在否定理解方面AUC提升了高达0.47,同时在正负结合评估上获得了高达0.08的绝对增益。与基于InfoNCE的微调相比,Bi-MCQ将肯定-否定AUC差距平均减少了0.12。

Insight: 核心创新点在于将视觉-语言对齐重新定义为条件语义比较问题,并通过双向多项选择学习框架(Bi-MCQ)实现微调。该方法避免了全局相似性最大化,转而进行条件语义比较。此外,引入方向特定的交叉注意力融合模块来处理双向推理所需的不对称线索并减少对齐干扰,是提升否定理解的关键技术贡献。

Abstract: Recent vision-language models (VLMs) achieve strong zero-shot performance via large-scale image-text pretraining and have been widely adopted in medical image analysis. However, existing VLMs remain notably weak at understanding negated clinical statements, largely due to contrastive alignment objectives that treat negation as a minor linguistic variation rather than a meaning-inverting operator. In multi-label settings, prompt-based InfoNCE fine-tuning further reinforces easy-positive image-prompt alignments, limiting effective learning of disease absence. To overcome these limitations, we reformulate vision-language alignment as a conditional semantic comparison problem, which is instantiated through a bi-directional multiple-choice learning framework(Bi-MCQ). By jointly training Image-to-Text and Text-to-Image MCQ tasks with affirmative, negative, and mixed prompts, our method implements fine-tuning as conditional semantic comparison instead of global similarity maximization. We further introduce direction-specific Cross-Attention fusion modules to address asymmetric cues required by bi-directional reasoning and reduce alignment interference. Experiments on ChestXray14, Open-I, CheXpert, and PadChest show that Bi-MCQ improves negation understanding by up to 0.47 AUC over the zero-shot performance of the state-of-the-art CARZero model, while achieving up to a 0.08 absolute gain on positive-negative combined (PNC) evaluation. Additionally, Bi-MCQ reduces the affirmative-negative AUC gap by an average of 0.12 compared to InfoNCE-based fine-tuning, demonstrating that objective reformulation can substantially enhance negation understanding in medical VLMs.


[56] Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs cs.CV | cs.AIPDF

Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li

TL;DR: 本文提出了GRACE框架,通过结合知识蒸馏和量化感知训练,在信息瓶颈原则下对视觉语言模型进行高效量化。该方法引入置信度门控解耦蒸馏过滤不可靠监督、关系中心核对齐传递视觉令牌结构,以及通过拉格朗日松弛的自适应控制器平衡保真度与容量约束。在LLaVA和Qwen系列模型上的实验表明,INT4量化模型在多项基准测试中超越FP16基线,并接近教师模型性能,同时实现3倍吞吐量和54%内存减少。

Details

Motivation: 视觉语言模型部署成本高,后训练量化常导致显著精度损失,而量化感知训练在VLM领域尚未充分探索,需要一种原则性框架来平衡信息容量与性能保真度。

Result: 在LLaVA-1.5-7B上,SQA基准得分70.1(INT4)vs. 66.8(FP16);Qwen2-VL-2B在MMBench上得76.9 vs. 72.6。INT4量化模型在多项基准测试中一致超越FP16基线,几乎匹配教师模型性能,同时使用真实INT4内核实现3倍吞吐量和54%内存减少,显著优于现有量化方法。

Insight: 创新点包括:基于信息瓶颈原则统一知识蒸馏与量化感知训练;置信度门控机制过滤不可靠监督信号;关系中心核对齐有效传递视觉令牌结构;拉格朗日松弛自适应控制器动态平衡约束。从客观角度看,该框架为VLM量化提供了系统性的信息论指导,在保持性能的同时大幅提升部署效率。

Abstract: Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment.


[57] OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation cs.CV | cs.AIPDF

Jin Li, Tao Chen, Shuai Jiang, Weijie Wang, Jingwen Luo

TL;DR: 本文提出了OpenVTON-Bench,一个用于可控虚拟试穿评估的大规模高分辨率基准数据集。该数据集包含约10万对高分辨率图像,并通过基于DINOv3的分层聚类和Gemini驱动的密集标注来确保语义平衡。同时,论文提出了一种多模态评估协议,从五个可解释的维度衡量VTON质量,其实验结果与人类判断高度一致。

Details

Motivation: 当前虚拟试穿系统在视觉保真度上虽有提升,但可靠的评估仍是一个瓶颈。传统指标难以量化细粒度纹理细节和语义一致性,且现有数据集在规模和多样性上达不到商业标准。

Result: 提出的评估协议与人类判断具有强相关性(Kendall’s τ为0.833,而SSIM仅为0.611),为VTON评估建立了一个稳健的基准。

Insight: 创新点在于构建了一个大规模、高分辨率、语义平衡的基准数据集,并提出了一种结合VLM语义推理和基于SAM3分割与形态学侵蚀的新型多尺度表示度量的多模态评估协议,能够分离边界对齐误差和内部纹理伪影,实现了更可靠和可解释的评估。

Abstract: Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall’s $τ$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.


[58] GaussianOcc3D: A Gaussian-Based Adaptive Multi-modal 3D Occupancy Prediction cs.CVPDF

A. Enes Doruk, Hasan F. Ates

TL;DR: 本文提出GaussianOcc3D,一个基于3D高斯表示的多模态3D语义占据预测框架,用于自动驾驶场景理解。它通过四个核心模块(LDFA、EBFS、ACLF和Gauss-Mamba Head)有效融合相机和激光雷达数据,解决了模态异构、空间未对齐和表示效率问题,并在多个基准测试中取得了最先进的性能。

Details

Motivation: 解决自动驾驶中3D语义占据预测任务面临的挑战:单模态方法在相机语义和激光雷达几何信息之间存在权衡,而现有多模态框架受限于模态异构性、空间未对齐以及体素表示计算量大或BEV表示有信息损失等问题。

Result: 在Occ3D、SurroundOcc和SemanticKITTI基准测试上达到了最先进的性能,分别取得了49.4%、28.9%和25.2%的mIoU分数,并且在具有挑战性的雨天和夜间条件下表现出卓越的鲁棒性。

Insight: 主要创新点包括:1) 采用内存高效、连续的3D高斯表示作为多模态融合的基础;2) 设计了深度可变形采样、基于熵的特征平滑、不确定性感知的自适应融合等模块来处理多模态数据;3) 引入基于选择性状态空间模型(SSM)的Gauss-Mamba Head,以线性复杂度捕获全局上下文,提升了效率和性能。

Abstract: 3D semantic occupancy prediction is a pivotal task in autonomous driving, providing a dense and fine-grained understanding of the surrounding environment, yet single-modality methods face trade-offs between camera semantics and LiDAR geometry. Existing multi-modal frameworks often struggle with modality heterogeneity, spatial misalignment, and the representation crisis–where voxels are computationally heavy and BEV alternatives are lossy. We present GaussianOcc3D, a multi-modal framework bridging camera and LiDAR through a memory-efficient, continuous 3D Gaussian representation. We introduce four modules: (1) LiDAR Depth Feature Aggregation (LDFA), using depth-wise deformable sampling to lift sparse signals onto Gaussian primitives; (2) Entropy-Based Feature Smoothing (EBFS) to mitigate domain noise; (3) Adaptive Camera-LiDAR Fusion (ACLF) with uncertainty-aware reweighting for sensor reliability; and (4) a Gauss-Mamba Head leveraging Selective State Space Models for global context with linear complexity. Evaluations on Occ3D, SurroundOcc, and SemanticKITTI benchmarks demonstrate state-of-the-art performance, achieving mIoU scores of 49.4%, 28.9%, and 25.2% respectively. GaussianOcc3D exhibits superior robustness across challenging rainy and nighttime conditions.


[59] ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model cs.CV | cs.AIPDF

Xiaoshu Chen, Sihang Zhou, Ke Liang, Taichun Zhou, Xinwang Liu

TL;DR: 本文提出了ImgCoT方法,旨在将冗长的思维链压缩成紧凑的视觉令牌,以提升大语言模型的推理效率。该方法将文本思维链渲染为图像作为重建目标,从而用空间归纳偏置替代了语言偏置,使潜在令牌能更好地捕捉全局推理结构。此外,还提出了一个混合推理版本,通过补充少量关键文本步骤来保留细粒度推理细节。

Details

Motivation: 现有方法通过自编码器将文本思维链压缩为潜在令牌,但这迫使令牌保留表层语言特征,引入了强烈的语言归纳偏置,优先考虑语言形式而非推理结构,限制了逻辑抽象能力。

Result: 在多个数据集和大语言模型上的广泛实验证明了两种ImgCoT版本的有效性。

Insight: 核心创新在于将重建目标从文本思维链改为视觉思维链,利用空间布局偏置来编码抽象的推理结构。同时,通过混合推理设计,在压缩令牌数量的同时兼顾了全局结构和细节信息。

Abstract: Compressing long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs). Recent studies employ autoencoders to achieve this by reconstructing textual CoT from latent tokens, thus encoding CoT semantics. However, treating textual CoT as the reconstruction target forces latent tokens to preserve surface-level linguistic features (e.g., word choice and syntax), introducing a strong linguistic inductive bias that prioritizes linguistic form over reasoning structure and limits logical abstraction. Thus, we propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images. This substitutes linguistic bias with spatial inductive bias, i.e., a tendency to model spatial layouts of the reasoning steps in visual CoT, enabling latent tokens to better capture global reasoning structure. Moreover, although visual latent tokens encode abstract reasoning structure, they may blur reasoning details. We thus propose a loose ImgCoT, a hybrid reasoning that augments visual latent tokens with a few key textual reasoning steps, selected based on low token log-likelihood. This design allows LLMs to retain both global reasoning structure and fine-grained reasoning details with fewer tokens than the complete CoT. Extensive experiments across multiple datasets and LLMs demonstrate the effectiveness of the two versions of ImgCoT.


[60] Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models cs.CVPDF

Enyi Shi, Pengyang Shao, Yanxin Zhang, Chenhang Cui, Jiayi Lyu

TL;DR: 本文提出了Lingua-SafetyBench,一个用于评估多语言视觉语言模型安全性的基准测试,包含超过10万对跨10种语言的有害图文对,并明确划分为图像主导和文本主导子集以分离风险来源。通过评估11个开源VLLM,发现安全风险存在不对称性:高资源语言中图像主导风险攻击成功率更高,而非高资源语言中文本主导风险更严重。对Qwen系列的受控研究表明,模型规模和版本升级虽能整体降低攻击成功率,但主要惠及高资源语言,反而扩大了与非高资源语言在文本主导风险上的安全差距。

Details

Motivation: 当前VLLM在联合多语言和多模态输入下的安全性研究不足,现有基准测试要么是多语言但纯文本,要么是多模态但单语言,且近期多语言多模态红队测试主要依赖排版式视觉内容,缺乏语义接地的图文对,无法充分覆盖现实中的跨模态交互风险。

Result: 在提出的Lingua-SafetyBench基准上评估了11个开源VLLM,揭示了安全风险的语言和模态不对称性。对Qwen系列的受控实验表明,模型扩展和版本升级整体降低了攻击成功率,但在文本主导风险上,高资源语言与非高资源语言之间的安全差距反而扩大。

Insight: 创新点在于构建了首个大规模、多语言、多模态且语义接地的VLLM安全评估基准,并明确划分图像/文本主导风险以解耦风险源。核心洞察是VLLM的安全对齐不能仅依赖模型扩展,必须考虑语言和模态的差异性,否则可能加剧不同语言群体间的安全不平等。

Abstract: Robust safety of vision-language large models (VLLMs) under joint multilingual and multimodal inputs remains underexplored. Existing benchmarks are typically multilingual but text-only, or multimodal but monolingual. Recent multilingual multimodal red-teaming efforts render harmful prompts into images, yet rely heavily on typography-style visuals and lack semantically grounded image-text pairs, limiting coverage of realistic cross-modal interactions. We introduce Lingua-SafetyBench, a benchmark of 100,440 harmful image-text pairs across 10 languages, explicitly partitioned into image-dominant and text-dominant subsets to disentangle risk sources. Evaluating 11 open-source VLLMs reveals a consistent asymmetry: image-dominant risks yield higher ASR in high-resource languages, while text-dominant risks are more severe in non-high-resource languages. A controlled study on the Qwen series shows that scaling and version upgrades reduce Attack Success Rate (ASR) overall but disproportionately benefit HRLs, widening the gap between HRLs and Non-HRLs under text-dominant risks. This underscores the necessity of language- and modality-aware safety alignment beyond mere scaling.To facilitate reproducibility and future research, we will publicly release our benchmark, model checkpoints, and source code.The code and dataset will be available at https://github.com/zsxr15/Lingua-SafetyBench.Warning: this paper contains examples with unsafe content.


[61] StreamSense: Streaming Social Task Detection with Selective Vision-Language Model Routing cs.CVPDF

Han Wang, Deyi Ji, Lanyun Zhu, Jiebo Luo, Roy Ka-Wei Lee

TL;DR: StreamSense是一个用于实时流媒体平台社交任务检测的系统,它结合了轻量级流编码器和选择性路由到视觉语言模型(VLM)专家的机制。该系统通过轻量级编码器处理大多数时间戳,将困难或模糊的案例升级到VLM处理,并在上下文不足时推迟决策,以提高检测效率。

Details

Motivation: 解决实时流媒体平台中需要利用视频、文本和音频的异步证据进行实时监控和响应社交信号的问题,旨在平衡检测准确性与计算延迟。

Result: 在多个社交流检测任务(如情感分类和仇恨内容审核)上评估,StreamSense比仅使用VLM的流式检测达到更高准确率,同时仅偶尔调用VLM,从而降低了平均延迟和计算成本。

Insight: 创新点包括选择性升级和推迟决策作为理解流式社交任务的有效原语,以及训练编码器时采用跨模态对比项和对齐视觉/音频线索与文本信号,以及IoU加权损失来减轻跨段边界标签干扰。

Abstract: Live streaming platforms require real-time monitoring and reaction to social signals, utilizing partial and asynchronous evidence from video, text, and audio. We propose StreamSense, a streaming detector that couples a lightweight streaming encoder with selective routing to a Vision-Language Model (VLM) expert. StreamSense handles most timestamps with the lightweight streaming encoder, escalates hard/ambiguous cases to the VLM, and defers decisions when context is insufficient. The encoder is trained using (i) a cross-modal contrastive term to align visual/audio cues with textual signals, and (ii) an IoU-weighted loss that down-weights poorly overlapping target segments, mitigating label interference across segment boundaries. We evaluate StreamSense on multiple social streaming detection tasks (e.g., sentiment classification and hate content moderation), and the results show that StreamSense achieves higher accuracy than VLM-only streaming while only occasionally invoking the VLM, thereby reducing average latency and compute. Our results indicate that selective escalation and deferral are effective primitives for understanding streaming social tasks. Code is publicly available on GitHub.


[62] Procedural Knowledge Extraction from Industrial Troubleshooting Guides Using Vision Language Models cs.CV | cs.AIPDF

Guillermo Gil de Avalle, Laura Maruster, Christos Emmanouilidis

TL;DR: 本文评估了两种视觉语言模型(VLMs)在从工业故障排除指南的流程图式图表中提取结构化程序知识的性能,比较了标准指令引导和增强提示(利用故障排除布局模式)两种策略,揭示了模型在布局敏感性和语义鲁棒性之间的权衡。

Details

Motivation: 工业故障排除指南以流程图形式编码诊断程序,其空间布局和技术语言共同传递信息。为了将这些知识集成到辅助车间人员诊断和解决设备问题的操作员支持系统中,需要将信息提取并结构化以供机器解读,而手动提取过程劳动密集且容易出错。

Result: 评估结果显示,两种VLMs在提取结构化知识时,在布局敏感性和语义鲁棒性方面存在模型特定的权衡,这为实际部署决策提供了参考。

Insight: 论文的创新点在于探索并比较了VLMs在工业流程图知识提取这一特定领域的应用,特别是通过引入利用布局模式的增强提示策略来提升性能,揭示了视觉-语言联合理解任务中模型能力与任务特性(如空间布局与专业语义)之间的关键权衡。

Abstract: Industrial troubleshooting guides encode diagnostic procedures in flowchart-like diagrams where spatial layout and technical language jointly convey meaning. To integrate this knowledge into operator support systems, which assist shop-floor personnel in diagnosing and resolving equipment issues, the information must first be extracted and structured for machine interpretation. However, when performed manually, this extraction is labor-intensive and error-prone. Vision Language Models offer potential to automate this process by jointly interpreting visual and textual meaning, yet their performance on such guides remains underexplored. This paper evaluates two VLMs on extracting structured knowledge, comparing two prompting strategies: standard instruction-guided versus an augmented approach that cues troubleshooting layout patterns. Results reveal model-specific trade-offs between layout sensitivity and semantic robustness, informing practical deployment decisions.


[63] FarmMind: Reasoning-Query-Driven Dynamic Segmentation for Farmland Remote Sensing Images cs.CVPDF

Haiyang Wu, Weiliang Mu, Jipeng Zhang, Zhong Dandan, Zhuofei Du

TL;DR: 本文提出了一种名为FarmMind的动态分割框架,用于处理农田遥感图像(FRSI)分割中的模糊性和视觉不确定性。该框架通过引入推理查询机制,模拟人类专家在面临分割模糊时的思维过程,动态按需查询外部辅助图像(如更高分辨率、更大尺度或时间相邻数据),以补充单输入图像的信息不足,从而突破传统静态分割范式的限制。

Details

Motivation: 现有农田遥感图像分割方法通常遵循静态分割范式,仅依赖单个输入图像块内的有限信息进行分析,在处理具有模糊性和视觉不确定性的复杂场景时,其推理能力受限。而人类专家在解释此类模糊遥感图像时,倾向于主动查询辅助图像进行交叉验证,以实现更全面的推理。本文旨在模拟这一过程,提升分割性能。

Result: 大量实验表明,与现有方法相比,FarmMind实现了更优的分割性能和更强的泛化能力。

Insight: 核心创新点在于提出了一个推理查询驱动的动态分割框架,将分割过程从静态、被动的单图像分析,转变为动态、主动的多源信息查询与融合。其机制不仅查询辅助图像,更重要的是通过推理分析分割模糊的根本原因,从而智能地确定需要查询的辅助图像类型,这模仿了人类专家的决策过程,为解决遥感图像分割中的不确定性提供了一种新思路。

Abstract: Existing methods for farmland remote sensing image (FRSI) segmentation generally follow a static segmentation paradigm, where analysis relies solely on the limited information contained within a single input patch. Consequently, their reasoning capability is limited when dealing with complex scenes characterized by ambiguity and visual uncertainty. In contrast, human experts, when interpreting remote sensing images in such ambiguous cases, tend to actively query auxiliary images (such as higher-resolution, larger-scale, or temporally adjacent data) to conduct cross-verification and achieve more comprehensive reasoning. Inspired by this, we propose a reasoning-query-driven dynamic segmentation framework for FRSIs, named FarmMind. This framework breaks through the limitations of the static segmentation paradigm by introducing a reasoning-query mechanism, which dynamically and on-demand queries external auxiliary images to compensate for the insufficient information in a single input image. Unlike direct queries, this mechanism simulates the thinking process of human experts when faced with segmentation ambiguity: it first analyzes the root causes of segmentation ambiguities through reasoning, and then determines what type of auxiliary image needs to be queried based on this analysis. Extensive experiments demonstrate that FarmMind achieves superior segmentation performance and stronger generalization ability compared with existing methods. The source code and dataset used in this work are publicly available at: https://github.com/WithoutOcean/FarmMind.


[64] A Comparative Evaluation of Large Vision-Language Models for 2D Object Detection under SOTIF Conditions cs.CV | cs.ROPDF

Ji Zhou, Yilin Ding, Yongqi Zhao, Jiachen Xu, Arno Eichberger

TL;DR: 本文系统评估了十种代表性的大型视觉语言模型在SOTIF条件下的2D目标检测性能,使用专门构建的PeSOTIF数据集,并与基于YOLO的经典检测器进行对比。研究发现,在复杂自然场景下,表现最佳的LVLMs(如Gemini 3、Doubao)的召回率比YOLO基线高出25%以上,对视觉退化表现出更强的鲁棒性;而基线模型在合成扰动的几何精度上仍具优势。

Details

Motivation: 解决自动驾驶中环境感知的可靠性问题,特别是在SOTIF条件下,传统检测器在不利条件下性能下降,而LVLMs的语义推理能力在安全关键的2D目标检测中的定量有效性尚未充分探索。

Result: 在PeSOTIF基准测试中,顶级LVLMs在复杂自然场景下的召回率超过YOLO基线25%以上,展现出对视觉退化的优越鲁棒性;但基线在合成扰动的几何精度上保持优势。

Insight: 论文的创新点在于首次对LVLMs在SOTIF条件下的2D目标检测进行系统定量评估,揭示了语义推理与几何回归的互补优势,支持将LVLMs用作面向SOTIF的自动驾驶系统中的高级安全验证器。

Abstract: Reliable environmental perception remains one of the main obstacles for safe operation of automated vehicles. Safety of the Intended Functionality (SOTIF) concerns safety risks from perception insufficiencies, particularly under adverse conditions where conventional detectors often falter. While Large Vision-Language Models (LVLMs) demonstrate promising semantic reasoning, their quantitative effectiveness for safety-critical 2D object detection is underexplored. This paper presents a systematic evaluation of ten representative LVLMs using the PeSOTIF dataset, a benchmark specifically curated for long-tail traffic scenarios and environmental degradations. Performance is quantitatively compared against the classical perception approach, a YOLO-based detector. Experimental results reveal a critical trade-off: top-performing LVLMs (e.g., Gemini 3, Doubao) surpass the YOLO baseline in recall by over 25% in complex natural scenarios, exhibiting superior robustness to visual degradation. Conversely, the baseline retains an advantage in geometric precision for synthetic perturbations. These findings highlight the complementary strengths of semantic reasoning versus geometric regression, supporting the use of LVLMs as high-level safety validators in SOTIF-oriented automated driving systems.


[65] How Much of a Model Do We Need? Redundancy and Slimmability in Remote Sensing Foundation Models cs.CVPDF

Leonard Hackel, Tom Burgert, Begüm Demir

TL;DR: 本文研究了遥感领域基础模型中的冗余性问题,发现与计算机视觉领域相比,遥感基础模型在更小的规模下就进入了过参数化状态,增加参数主要导致冗余表示而非新的抽象。通过后验剪枝实验,作者证明遥感模型在极低计算预算下仍能保持较高精度,并提出了可学习的可瘦身训练方法以提升模型效率。

Details

Motivation: 遥感领域的基础模型直接沿用了计算机视觉领域的缩放假设,但这一假设在遥感中的适用性尚未得到充分验证。作者假设遥感模型在更小的规模下就会出现过参数化,导致冗余表示,而非带来质的提升。

Result: 在四个下游分类任务上对六个最先进的遥感基础模型进行后验剪枝测试,结果显示:在仅保留1% FLOPs的计算预算下,遥感模型仍能保持超过71%的相对准确率,而ImageNet上训练的MAE模型准确率则降至10%以下,形成了七倍的差距。此外,可学习的可瘦身训练进一步提升了基于MoCo和MAE的模型性能。

Insight: 论文的创新点在于首次系统性地验证了遥感基础模型相较于计算机视觉模型具有更高的参数冗余性,并提出了后验剪枝作为一种实用的部署策略和诊断工具,挑战了当前遥感领域主流的模型缩放范式。从客观角度看,该研究为资源受限环境下的模型轻量化部署提供了新的思路和实证依据。

Abstract: Large-scale foundation models (FMs) in remote sensing (RS) are developed based on the paradigms established in computer vision (CV) and have shown promise for various Earth observation applications. However, the direct transfer of scaling assumptions from CV to RS has not been adequately examined. We hypothesize that RS FMs enter an overparameterized regime at substantially smaller scales than their CV counterparts, where increasing parameter count primarily induces redundant representations rather than qualitatively new abstractions. To test this hypothesis, we use post-hoc slimming, where we uniformly reduce the width of pretrained encoder, as a tool to measure representational redundancy across six state-of-the-art RS FMs on four downstream classification tasks. Our findings reveal a significant contrast with those in the CV domain: while a post-hoc slimmed masked autoencoder (MAE) trained on ImageNet retains less than 10% accuracy at 1% FLOPs, RS FMs maintain over 71% relative accuracy at the same budget. This sevenfold difference provides strong empirical support for our hypothesis. We further demonstrate that learned slimmable training can improve both Momentum Contrast (MoCo)- and MAE- based models. In addition, through the explained variance ratio and the feature correlation analysis, we provide mechanistic explanations showing that RS FMs distribute task-relevant information with high redundancy. Our findings establish post-hoc slimmability as both a practical deployment strategy for resource-constrained environments and a diagnostic tool that challenges the prevailing scaling paradigm in RS. Upon acceptance, we will publish all code.


[66] Inference-Time Dynamic Modality Selection for Incomplete Multimodal Classification cs.CVPDF

Siyi Du, Xinzhe Luo, Declan P. O’Regan, Chen Qin

TL;DR: 本文提出了DyMo,一种用于不完全多模态分类的推理时动态模态选择框架,旨在解决现有方法在处理缺失模态时面临的丢弃-补全困境。DyMo通过自适应地识别和整合可靠的恢复模态,最大化任务相关信息,并设计了兼容任意模态组合的网络架构与训练策略。

Details

Motivation: 解决多模态深度学习在实际部署中因数据模态缺失而面临的挑战,即现有方法要么丢弃缺失模态损失信息,要么补全缺失模态引入噪声,存在丢弃-补全困境。

Result: 在多种自然和医学图像数据集上的广泛实验表明,DyMo在各种缺失数据场景下显著优于最先进的不完全/动态多模态深度学习方法,达到了SOTA水平。

Insight: 创新点在于提出了一个基于信息论原理的推理时动态模态选择算法,通过任务损失作为可处理代理来最大化任务相关信息,突破了传统的丢弃或补全范式;同时设计了灵活的网络架构以支持任意模态组合。

Abstract: Multimodal deep learning (MDL) has achieved remarkable success across various domains, yet its practical deployment is often hindered by incomplete multimodal data. Existing incomplete MDL methods either discard missing modalities, risking the loss of valuable task-relevant information, or recover them, potentially introducing irrelevant noise, leading to the discarding-imputation dilemma. To address this dilemma, in this paper, we propose DyMo, a new inference-time dynamic modality selection framework that adaptively identifies and integrates reliable recovered modalities, fully exploring task-relevant information beyond the conventional discard-or-impute paradigm. Central to DyMo is a novel selection algorithm that maximizes multimodal task-relevant information for each test sample. Since direct estimation of such information at test time is intractable due to the unknown data distribution, we theoretically establish a connection between information and the task loss, which we compute at inference time as a tractable proxy. Building on this, a novel principled reward function is proposed to guide modality selection. In addition, we design a flexible multimodal network architecture compatible with arbitrary modality combinations, alongside a tailored training strategy for robust representation learning. Extensive experiments on diverse natural and medical image datasets show that DyMo significantly outperforms state-of-the-art incomplete/dynamic MDL methods across various missing-data scenarios. Our code is available at https://github.com//siyi-wind/DyMo.


[67] When Anomalies Depend on Context: Learning Conditional Compatibility for Anomaly Detection cs.CV | cs.LGPDF

Shashank Mishra, Didier Stricker, Jason Rambach

TL;DR: 本文提出了一种条件兼容性学习框架,用于解决视觉领域中的上下文异常检测问题,即异常性取决于主体与上下文之间的兼容性而非内在属性。作者构建了CAAD-3K基准数据集来系统研究此类异常,并利用视觉-语言表示在有限监督下建模主体-上下文关系。该方法在CAAD-3K上显著优于现有方法,并在MVTec-AD和VisA上达到SOTA性能,证明了建模上下文依赖性对传统结构异常检测的补充作用。

Details

Motivation: 传统异常检测假设异常是观察对象的内在属性,独立于上下文,但现实中许多异常取决于潜在上下文因素(如相同行为在不同场景下可能正常或异常)。本文旨在重新审视并操作化视觉领域的上下文异常检测,其中异常标签取决于主体与上下文的兼容性。

Result: 在CAAD-3K基准上,所提方法大幅优于现有方法;在MVTec-AD和VisA数据集上达到了最先进的(SOTA)性能。

Insight: 创新点包括:将上下文异常检测形式化为条件兼容性问题,引入CAAD-3K基准以隔离上下文异常,以及利用视觉-语言表示在有限监督下学习主体-上下文关系。从客观角度看,该方法通过建模上下文依赖性,扩展了传统异常检测的范畴,具有实际应用价值。

Abstract: Anomaly detection is often formulated under the assumption that abnormality is an intrinsic property of an observation, independent of context. This assumption breaks down in many real-world settings, where the same object or action may be normal or anomalous depending on latent contextual factors (e.g., running on a track versus on a highway). We revisit \emph{contextual anomaly detection}, classically defined as context-dependent abnormality, and operationalize it in the visual domain, where anomaly labels depend on subject–context compatibility rather than intrinsic appearance. To enable systematic study of this setting, we introduce CAAD-3K, a benchmark that isolates contextual anomalies by controlling subject identity while varying context. We further propose a conditional compatibility learning framework that leverages vision–language representations to model subject–context relationships under limited supervision. Our method substantially outperforms existing approaches on CAAD-3K and achieves state-of-the-art performance on MVTec-AD and VisA, demonstrating that modeling context dependence complements traditional structural anomaly detection. Our code and dataset will be publicly released.


[68] DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation cs.CV | cs.AI | cs.LGPDF

Hun Chang, Byunghee Cha, Jong Chul Ye

TL;DR: 本文提出了DINO-SAE,一个结合了DINO语义表示和像素级重建的球形自编码器框架。其核心是通过分层卷积块嵌入模块和余弦相似度对齐目标来提升重建保真度,并利用基于黎曼流匹配的扩散变换器在球形潜在流形上进行生成。

Details

Motivation: 现有基于预训练视觉基础模型(如DINO)的生成自编码器方法,由于高频细节的丢失,往往存在重建保真度有限的问题。本文旨在弥合语义表示与像素级重建之间的差距。

Result: 在ImageNet-1K上的实验表明,该方法达到了最先进的重建质量,rFID为0.37,PSNR为26.2 dB,同时保持了与预训练VFM的强语义对齐。基于黎曼流匹配的DiT在80个epoch时实现了3.47的gFID,表现出高效的收敛性。

Insight: 关键创新在于认识到对比表示中的语义信息主要编码在特征向量的方向上,而强制严格的幅度匹配会阻碍细节保留。因此,通过分层卷积块嵌入增强局部结构,并使用余弦相似度对齐来保证语义一致性,同时允许灵活的特征幅度以保留细节。此外,利用SSL基础模型表示本质上位于超球面的观察,直接在球形潜在流形上训练扩散变换器,这是一个新颖的生成建模视角。

Abstract: Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work, we present the DINO Spherical Autoencoder (DINO-SAE), a framework that bridges semantic representation and pixel-level reconstruction. Our key insight is that semantic information in contrastive representations is primarily encoded in the direction of feature vectors, while forcing strict magnitude matching can hinder the encoder from preserving fine-grained details. To address this, we introduce Hierarchical Convolutional Patch Embedding module that enhances local structure and texture preservation, and Cosine Similarity Alignment objective that enforces semantic consistency while allowing flexible feature magnitudes for detail retention. Furthermore, leveraging the observation that SSL-based foundation model representations intrinsically lie on a hypersphere, we employ Riemannian Flow Matching to train a Diffusion Transformer (DiT) directly on this spherical latent manifold. Experiments on ImageNet-1K demonstrate that our approach achieves state-of-the-art reconstruction quality, reaching 0.37 rFID and 26.2 dB PSNR, while maintaining strong semantic alignment to the pretrained VFM. Notably, our Riemannian Flow Matching-based DiT exhibits efficient convergence, achieving a gFID of 3.47 at 80 epochs.


[69] Multi-Cue Anomaly Detection and Localization under Data Contamination cs.CVPDF

Anindya Sundar Das, Monowar Bhuyan

TL;DR: 本文提出了一种鲁棒的视觉异常检测框架,旨在解决现实工业场景中训练数据常被异常样本污染且缺乏异常标注的问题。该框架将有限的异常监督信息融入自适应偏差学习范式,通过结合统计偏差、预测不确定性和空间异常性三个互补的评分分量,构建了一个统一的复合异常评分机制,实现了准确的异常检测和基于梯度的定位。

Details

Motivation: 现实工业视觉异常检测面临两大局限:一是现有方法通常假设训练数据纯净(仅含正常样本),而实际数据常被异常污染;二是缺乏异常样本的标注,导致模型难以学习异常判别特征,从而降低检测与定位性能。

Result: 在MVTec和VisA基准测试上的大量实验表明,该框架在不同程度的数据污染下均优于现有最先进方法,实现了强大的检测与定位性能、可解释性和鲁棒性。

Insight: 创新点在于将有限异常监督与自适应实例加权结合以缓解污染影响,并设计了一个融合统计、不确定性和空间信息的复合评分机制,支持梯度定位并提供可解释的视觉证据,提升了在污染数据下的鲁棒性。

Abstract: Visual anomaly detection in real-world industrial settings faces two major limitations. First, most existing methods are trained on purely normal data or on unlabeled datasets assumed to be predominantly normal, presuming the absence of contamination, an assumption that is rarely satisfied in practice. Second, they assume no access to labeled anomaly samples, limiting the model from learning discriminative characteristics of true anomalies. Therefore, these approaches often struggle to distinguish anomalies from normal instances, resulting in reduced detection and weak localization performance. In real-world applications, where training data are frequently contaminated with anomalies, such methods fail to deliver reliable performance. In this work, we propose a robust anomaly detection framework that integrates limited anomaly supervision into the adaptive deviation learning paradigm. We introduce a composite anomaly score that combines three complementary components: a deviation score capturing statistical irregularity, an entropy-based uncertainty score reflecting predictive inconsistency, and a segmentation-based score highlighting spatial abnormality. This unified scoring mechanism enables accurate detection and supports gradient-based localization, providing intuitive and explainable visual evidence of anomalous regions. Following the few-anomaly paradigm, we incorporate a small set of labeled anomalies during training while simultaneously mitigating the influence of contaminated samples through adaptive instance weighting. Extensive experiments on the MVTec and VisA benchmarks demonstrate that our framework outperforms state-of-the-art baselines and achieves strong detection and localization performance, interpretability, and robustness under various levels of data contamination.


[70] Deep in the Jungle: Towards Automating Chimpanzee Population Estimation cs.CVPDF

Tom Raynes, Otto Brookes, Timm Haucke, Lukas Bösch, Anne-Sophie Crunchant

TL;DR: 本研究提出并评估了一种将基于计算机视觉的单目深度估计(MDE)流程直接集成到生态相机陷阱工作流中的方法,用于大猩猩种群保护。通过使用包含220个野生黑猩猩相机陷阱视频的真实数据集,结合两种MDE模型(DPT和Depth Anything)与多种距离采样策略,生成检测距离估计,进而推断种群密度和丰度。

Details

Motivation: 解决在未标记大猩猩种群中,依赖人工手动解释大量相机陷阱视频来获取动物到相机距离的劳动力密集型问题,探索自动化距离估计的可行性。

Result: 在真实数据集上,校准后的DPT模型在距离估计准确性和下游密度与丰度推断方面均优于Depth Anything模型。然而,两种模型均存在系统性偏差,在复杂森林环境中倾向于高估检测距离,从而低估密度和丰度。总体而言,该方法得出的种群估计值与传统方法相比误差在22%以内,表明MDE驱动的相机陷阱距离采样是可行的实用替代方案。

Insight: 创新点在于将MDE技术直接应用于生态相机陷阱工作流,自动化距离估计过程,减少人工依赖。从客观角度看,研究通过结合多种MDE模型和采样策略,提供了在复杂自然环境中应用计算机视觉进行种群估计的案例研究,并识别了动物检测失败是限制准确性的主要因素,为后续改进指明了方向。

Abstract: The estimation of abundance and density in unmarked populations of great apes relies on statistical frameworks that require animal-to-camera distance measurements. In practice, acquiring these distances depends on labour-intensive manual interpretation of animal observations across large camera trap video corpora. This study introduces and evaluates an only sparsely explored alternative: the integration of computer vision-based monocular depth estimation (MDE) pipelines directly into ecological camera trap workflows for great ape conservation. Using a real-world dataset of 220 camera trap videos documenting a wild chimpanzee population, we combine two MDE models, Dense Prediction Transformers and Depth Anything, with multiple distance sampling strategies. These components are used to generate detection distance estimates, from which population density and abundance are inferred. Comparative analysis against manually derived ground-truth distances shows that calibrated DPT consistently outperforms Depth Anything. This advantage is observed in both distance estimation accuracy and downstream density and abundance inference. Nevertheless, both models exhibit systematic biases. We show that, given complex forest environments, they tend to overestimate detection distances and consequently underestimate density and abundance relative to conventional manual approaches. We further find that failures in animal detection across distance ranges are a primary factor limiting estimation accuracy. Overall, this work provides a case study that shows MDE-driven camera trap distance sampling is a viable and practical alternative to manual distance estimation. The proposed approach yields population estimates within 22% of those obtained using traditional methods.


[71] Q-Hawkeye: Reliable Visual Policy Optimization for Image Quality Assessment cs.CVPDF

Wulin Xie, Rui Dai, Ruidong Ding, Kaikui Liu, Xiangxiang Chu

TL;DR: 本文提出Q-Hawkeye,一种基于强化学习的可靠视觉策略优化框架,用于图像质量评估。它通过不确定性感知动态优化和感知感知优化重新设计学习信号,以解决现有方法在预测稳定性和视觉感知能力方面的可靠性限制。

Details

Motivation: 现有基于强化学习和MLLM的IQA方法存在两个关键可靠性问题:一是对训练样本使用统一的优势权重,放大了不稳定样本的噪声信号;二是过于依赖文本推理而忽视了模型对图像内容的视觉感知能力。

Result: 大量实验表明,Q-Hawkeye在多个数据集上超越了最先进的方法,并展现出更好的泛化性能。

Insight: 创新点在于引入了基于多次rollout预测分数方差的不确定性估计来动态调整样本更新权重,以及通过构建退化-原始图像对并引入隐式感知损失来增强模型基于真实视觉证据进行质量判断的能力。

Abstract: Image Quality Assessment (IQA) predicts perceptual quality scores consistent with human judgments. Recent RL-based IQA methods built on MLLMs focus on generating visual quality descriptions and scores, ignoring two key reliability limitations: (i) although the model’s prediction stability varies significantly across training samples, existing GRPO-based methods apply uniform advantage weighting, thereby amplifying noisy signals from unstable samples in gradient updates; (ii) most works emphasize text-grounded reasoning over images while overlooking the model’s visual perception ability of image content. In this paper, we propose Q-Hawkeye, an RL-based reliable visual policy optimization framework that redesigns the learning signal through unified Uncertainty-Aware Dynamic Optimization and Perception-Aware Optimization. Q-Hawkeye estimates predictive uncertainty using the variance of predicted scores across multiple rollouts and leverages this uncertainty to reweight each sample’s update strength, stabilizing policy optimization. To strengthen perceptual reliability, we construct paired inputs of degraded images and their original images and introduce an Implicit Perception Loss that constrains the model to ground its quality judgments in genuine visual evidence. Extensive experiments demonstrate that Q-Hawkeye outperforms state-of-the-art methods and generalizes better across multiple datasets. The code and models will be made available.


[72] Triage: Hierarchical Visual Budgeting for Efficient Video Reasoning in Vision-Language Models cs.CVPDF

Anmin Wang, Nan Zhang, Wei Tao, Xiaoyang Qu, Guokuan Li

TL;DR: 论文提出了一种名为Triage的免训练即插即用框架,通过分层视觉预算将视频推理重构为资源分配问题,旨在解决视觉语言模型处理视频时因数据冗余导致的计算负担。

Details

Motivation: 视觉语言模型在处理视频时面临巨大的计算挑战,因为海量的数据冗余会产生过长的token序列,导致推理效率低下。

Result: 在多个视频推理基准测试中,Triage在保持或超越基线及其他方法性能的同时,显著提高了推理速度并减少了内存占用。

Insight: 创新点在于将视频推理视为资源分配问题,采用分层预算策略:先通过帧级预算识别关键帧,再通过token级预算分阶段分配核心token和上下文token,并使用高效的批处理最大边际相关性算法进行选择,这是一种新颖的、无需训练的效率优化方法。

Abstract: Vision-Language Models (VLMs) face significant computational challenges in video processing due to massive data redundancy, which creates prohibitively long token sequences. To address this, we introduce Triage, a training-free, plug-and-play framework that reframes video reasoning as a resource allocation problem via hierarchical visual budgeting. Its first stage, Frame-Level Budgeting, identifies keyframes by evaluating their visual dynamics and relevance, generating a strategic prior based on their importance scores. Guided by this prior, the second stage, Token-Level Budgeting, allocates tokens in two phases: it first secures high-relevance Core Tokens, followed by diverse Context Tokens selected with an efficient batched Maximal Marginal Relevance (MMR) algorithm. Extensive experiments demonstrate that Triage improves inference speed and reduces memory footprint, while maintaining or surpassing the performance of baselines and other methods on various video reasoning benchmarks.


[73] One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs cs.CVPDF

Youxu Shi, Suorong Yang, Dong Liu

TL;DR: 本文提出了一种名为OSGA的单样本优化转向向量框架,用于缓解视觉语言模型(VLMs)中的幻觉和安全相关问题。该方法通过基于方差的数据选择策略选取一个信息丰富的样本,并利用对比目标和生成锚点正则化学习一个单一的转向向量,该向量可在推理时应用于特定层,无需修改模型参数,从而以极低开销提升模型性能。

Details

Motivation: 视觉语言模型在多模态任务中表现出色,但仍存在幻觉和安全相关的失败问题,即使模型规模扩大也难以避免。转向(steering)作为一种轻量级技术可以改善模型性能,但现有方法在效率和效果之间难以取得平衡。本文旨在探索转向向量在输入间泛化的可能性,以开发一种更高效的单样本优化方法。

Result: 在多个基准测试上的实验表明,单个OSGA优化的转向向量能够持续改善幻觉缓解和安全性增强,且开销可忽略不计,证明了单样本转向作为一种实用且可扩展的解决方案的潜力。

Insight: 论文的创新点在于发现当任务共享对齐的语义意图时,转向向量可以跨输入泛化,并据此提出了OSGA框架,结合了基于方差的数据选择和生成锚点正则化的对比学习,实现了高效的单样本优化。从客观角度看,该方法通过单一样本学习通用转向向量,显著降低了计算和部署成本,为VLM的可靠应用提供了轻量级且可扩展的改进途径。

Abstract: Vision Language Models (VLMs) achieve strong performance on multimodal tasks but still suffer from hallucination and safety-related failures that persist even at scale. Steering offers a lightweight technique to improve model performance. However, steering, whether input-dependent or input-independent, achieves a meaningful trade-off between efficiency and effectiveness. In this work, we observe that steering vectors can generalize across inputs when tasks share aligned semantic intent. Based on this insight, we propose \textbf{OSGA} (\textbf{O}ne-shot \textbf{S}teering with \textbf{G}enerative \textbf{A}nchor), an input-independent framework that improves model performance with a single optimization instance. OSGA first selects an informative sample via a variance-based data selection strategy and learns a single steering vector with a contrastive objective with generative anchor regularization. The resulting vector can be universally applied at a certain layer during inference time without modifying model parameters. Experiments across multiple benchmarks show that a single OSGA-optimized steering vector consistently improves hallucination mitigation and safety enhancement with negligible overhead, highlighting one-shot steering as a practical and scalable solution for reliable VLMs.


[74] Hi-Light: A Path to high-fidelity, high-resolution video relighting with a Novel Evaluation Paradigm cs.CVPDF

Xiangrui Liu, Haoxiang Li, Yezhou Yang

TL;DR: Hi-Light是一个无需训练的视频重光照框架,旨在解决现有方法中存在的评估指标缺失、严重的光照闪烁以及编辑过程中细粒度细节退化等问题。它通过引入基于亮度先验的引导重光照扩散、混合运动自适应光照平滑滤波器和LAB细节融合模块,实现了高保真、高分辨率的稳定视频重光照,并提出了首个定量评估光照一致性的指标——光照稳定性分数。

Details

Motivation: 视频重光照具有巨大的创意潜力和商业价值,但面临缺乏合适的评估指标、严重的光照闪烁以及编辑时细节丢失等挑战,阻碍了其实际应用。

Result: 大量实验表明,Hi-Light在定性和定量比较中均显著优于最先进的方法,能够生成稳定且细节丰富的重光照视频。

Insight: 主要创新点包括:1) 无需训练的稳定重光照扩散框架;2) 利用光流实现时域稳定性的混合运动自适应滤波器,避免运动模糊;3) 基于LAB空间的细节融合模块以保留高频细节;4) 提出了首个专门衡量光照一致性的定量评估指标Light Stability Score,填补了评估空白。

Abstract: Video relighting offers immense creative potential and commercial value but is hindered by challenges, including the absence of an adequate evaluation metric, severe light flickering, and the degradation of fine-grained details during editing. To overcome these challenges, we introduce Hi-Light, a novel, training-free framework for high-fidelity, high-resolution, robust video relighting. Our approach introduces three technical innovations: lightness prior anchored guided relighting diffusion that stabilises intermediate relit video, a Hybrid Motion-Adaptive Lighting Smoothing Filter that leverages optical flow to ensure temporal stability without introducing motion blur, and a LAB-based Detail Fusion module that preserves high-frequency detail information from the original video. Furthermore, to address the critical gap in evaluation, we propose the Light Stability Score, the first quantitative metric designed to specifically measure lighting consistency. Extensive experiments demonstrate that Hi-Light significantly outperforms state-of-the-art methods in both qualitative and quantitative comparisons, producing stable, highly detailed relit videos.


[75] Med-Scout: Curing MLLMs’ Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training cs.CV | cs.AIPDF

Anglin Liu, Ruichao Chen, Yi Lu, Hongxia Xu, Jintai Chen

TL;DR: 本文提出Med-Scout框架,通过基于几何感知的强化学习后训练,解决多模态大语言模型在医疗感知中存在的几何盲区问题。该方法利用未标注医学图像的内在几何逻辑,设计了三个代理任务来生成监督信号,无需昂贵专家标注。

Details

Motivation: 现有最先进的多模态大语言模型在医疗诊断中存在几何盲区,即无法将输出与客观几何约束对齐,导致看似合理但事实错误的幻觉。这源于训练范式过于强调语言流畅性而忽视了几何保真度。

Result: 在专门设计的Med-Scout-Bench基准测试中,Med-Scout显著缓解了几何盲区,性能超过领先的专有和开源MLLMs超过40%。增强的几何感知能力还能泛化到更广泛的医疗理解任务中,在放射学和综合医疗视觉问答任务上取得了更优结果。

Insight: 创新点在于提出了一种无需专家标注、利用图像内在几何逻辑进行强化学习后训练的方法,通过分层尺度定位、拓扑拼图重建和异常一致性检测三个代理任务来生成可验证的监督信号,从而提升模型对几何信息的感知能力。

Abstract: Despite recent Multimodal Large Language Models (MLLMs)’ linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffer from a critical perceptual deficit: geometric blindness. This failure to ground outputs in objective geometric constraints leads to plausible yet factually incorrect hallucinations, rooted in training paradigms that prioritize linguistic fluency over geometric fidelity. This paper introduces Med-Scout, a novel framework that “cures” this blindness via Reinforcement Learning (RL) that leverages the intrinsic geometric logic latent within unlabeled medical images. Instead of relying on costly expert annotations, Med-Scout derives verifiable supervision signals through three strategic proxy tasks: Hierarchical Scale Localization, Topological Jigsaw Reconstruction, and Anomaly Consistency Detection. To rigorously quantify this deficit, we present Med-Scout-Bench, a new benchmark specifically designed to evaluate geometric perception. Extensive evaluations show that Med-Scout significantly mitigates geometric blindness, outperforming leading proprietary and open-source MLLMs by over 40% on our benchmark. Furthermore, this enhanced geometric perception generalizes to broader medical understanding, achieving superior results on radiological and comprehensive medical VQA tasks.


[76] Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning cs.CVPDF

Xiangyu Zeng, Zhiqiu Zhang, Yuhan Zhu, Xinhao Li, Zikang Wang

TL;DR: 本文提出Video-o3框架,用于解决长视频多跳推理任务中现有模型依赖均匀采样和单轮推理、难以从海量冗余信息中定位稀疏关键证据的问题。该框架支持迭代式发现显著视觉线索、细粒度检查关键片段,并在获得足够证据后自适应终止。

Details

Motivation: 现有用于长视频理解的多模态大语言模型主要依赖均匀采样和单轮推理,限制了其在大量冗余信息中识别稀疏但关键证据的能力。

Result: 在MLVU和Video-Holmes基准测试上,Video-o3大幅超越现有最优方法,分别达到72.1%和46.5%的准确率,证明了其强大的多跳证据寻找和推理能力。

Insight: 核心创新点包括:1)提出任务解耦注意力掩码,以解决推理与工具调用异质性导致的注意力分散问题;2)引入可验证轨迹引导奖励,以平衡探索覆盖与推理效率,控制多轮交互的上下文长度增长;3)构建了大规模高质量工具交互轨迹数据集Seeker-173K用于训练。

Abstract: Existing multimodal large language models for long-video understanding predominantly rely on uniform sampling and single-turn inference, limiting their ability to identify sparse yet critical evidence amid extensive redundancy. We introduce Video-o3, a novel framework that supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Technically, we address two core challenges in interleaved tool invocation. First, to mitigate attention dispersion induced by the heterogeneity of reasoning and tool-calling, we propose Task-Decoupled Attention Masking, which isolates per-step concentration while preserving shared global context. Second, to control context length growth in multi-turn interactions, we introduce a Verifiable Trajectory-Guided Reward that balances exploration coverage with reasoning efficiency. To support training at scale, we further develop a data synthesis pipeline and construct Seeker-173K, comprising 173K high-quality tool-interaction trajectories for effective supervised and reinforcement learning. Extensive experiments show that Video-o3 substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes. These results demonstrate Video-o3’s strong multi-hop evidence-seeking and reasoning capabilities, and validate the effectiveness of native tool invocation in long-video scenarios.


[77] ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search cs.CV | cs.AIPDF

Tao Yu, Haopeng Jin, Hao Wang, Shenghua Chai, Yujia Yang

TL;DR: 本文提出了ShotFinder基准测试和检索流程,用于评估和解决开放域视频镜头检索任务。该基准包含1,210个高质量样本,涵盖五种可控约束(时序、颜色、视觉风格、音频、分辨率),并设计了一个基于文本的三阶段检索与定位流程:通过视频想象进行查询扩展、搜索引擎检索候选视频、描述引导的时序定位。实验表明当前多模态大模型与人类性能存在显著差距,尤其在颜色和视觉风格约束上挑战较大。

Details

Motivation: 现有大语言模型在信息检索方面的研究主要集中于文本或静态多模态场景,而涉及更丰富时序结构和复杂语义的开放域视频镜头检索缺乏系统性基准和分析,本文旨在填补这一空白。

Result: 在ShotFinder基准上对多个闭源和开源模型进行实验,结果显示与人类性能存在显著差距,且不同约束间表现不平衡:时序定位相对可行,而颜色和视觉风格仍是主要挑战。

Insight: 创新点包括:1) 将编辑需求形式化为面向关键帧的镜头描述,并引入五种可控单因素约束的基准构建方法;2) 提出基于文本的三阶段检索与定位流程,其中查询扩展阶段的“视频想象”概念新颖;3) 揭示了开放域视频镜头检索是多模态大模型尚未克服的关键能力,为未来研究指明了方向。

Abstract: In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.


[78] Structured Over Scale: Learning Spatial Reasoning from Educational Video cs.CVPDF

Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas

TL;DR: 本文提出利用教育视频的结构化内容来增强视觉语言模型(VLM)的空间推理能力,通过从《爱探险的朵拉》中自动提取并构建DoraVQA数据集,并采用Group Relative Policy Optimization(GRPO)对Qwen模型进行微调,在仅使用38小时儿童教育视频训练的情况下,在多个视频理解基准上实现了显著性能提升。

Details

Motivation: 现有视觉语言模型在标准视频理解基准上表现优异,但在学龄前儿童能解决的简单推理任务(如计数、空间推理和组合理解)上系统性失败,作者假设教育视频的教学结构化内容能为提升这些能力提供理想的训练信号。

Result: 方法在DoraVQA上提升了8-14个百分点,在CVBench上达到SOTA的86.16%,并在Video-MME和NExT-QA上表现出强大的迁移能力,证明了从狭窄的教学内容到广泛多模态理解的有效泛化。

Insight: 创新点在于利用教育视频固有的‘上下文-问题-暂停-答案’结构化格式作为自包含的学习环境,并采用GRPO利用其清晰的正确性信号和结构化推理轨迹进行微调;核心洞察表明,对于提升VLM的推理能力,内容的结构化程度与数据规模同等重要。

Abstract: Vision-language models (VLMs) demonstrate impressive performance on standard video understanding benchmarks yet fail systematically on simple reasoning tasks that preschool children can solve, including counting, spatial reasoning, and compositional understanding. We hypothesize that the pedagogically-structured content of educational videos provides an ideal training signal for improving these capabilities. We introduce DoraVQA, a dataset of 5,344 question-answer pairs automatically extracted from 8 seasons of Dora the Explorer with precise timestamp alignment. Each episode follows a consistent \textit{context-question-pause-answer} structure that creates a self-contained learning environment analogous to interactive tutoring. We fine-tune both Qwen2 and Qwen3 using Group Relative Policy Optimization (GRPO), leveraging the clear correctness signals and structured reasoning traces inherent in educational content. Despite training exclusively on 38 hours of children’s educational videos, our approach achieves improvements of 8-14 points on DoraVQA and state-of-the-art 86.16% on CVBench, with strong transfer to Video-MME and NExT-QA, demonstrating effective generalization from narrow pedagogical content to broad multimodal understanding. Through cross-domain benchmarks, we show that VLMs can perform tasks that require robust reasoning learned from structured educational content, suggesting that content structure matters as much as content scale.


[79] Training-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models cs.CV | cs.LGPDF

Yi Zhang, Chun-Wun Cheng, Angelica I. Aviles-Rivero, Zhihai He, Liang-Jie Zhang

TL;DR: 本文提出了一种无需训练和反向传播的测试时自适应方法TaTa,利用布朗距离协方差动态适应视觉语言模型到新领域,结合属性增强提示、动态聚类和伪标签细化,显著降低计算成本并在领域和跨数据集泛化上达到SOTA性能。

Details

Motivation: 解决视觉语言模型在领域偏移下性能下降、现有测试时自适应方法计算量大、依赖反向传播且通常仅关注单模态的问题。

Result: 在多个数据集上的实验表明,TaTa显著降低了计算成本,同时在领域和跨数据集泛化任务中实现了最先进的性能。

Insight: 创新点包括使用布朗距离协方差作为无需训练的统计度量来捕获线性与非线性依赖,结合属性增强提示提升视觉语言推理,以及通过动态聚类和伪标签细化有效重新校准模型;客观分析认为其训练免费和高效适应策略具有借鉴意义。

Abstract: Vision-language models suffer performance degradation under domain shift, limiting real-world applicability. Existing test-time adaptation methods are computationally intensive, rely on back-propagation, and often focus on single modalities. To address these issues, we propose Training-free Test-Time Adaptation with Brownian Distance Covariance (TaTa). TaTa leverages Brownian Distance Covariance-a powerful statistical measure that captures both linear and nonlinear dependencies via pairwise distances-to dynamically adapt VLMs to new domains without training or back-propagation. This not only improves efficiency but also enhances stability by avoiding disruptive weight updates. TaTa further integrates attribute-enhanced prompting to improve vision-language inference with descriptive visual cues. Combined with dynamic clustering and pseudo-label refinement, it effectively recalibrates the model for novel visual contexts. Experiments across diverse datasets show that TaTa significantly reduces computational cost while achieving state-of-the-art performance in domain and cross-dataset generalization.


[80] User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments cs.CVPDF

Junfeng Lin, Yanming Xiu, Maria Gorlatova

TL;DR: 本文研究了在扩展现实(XR)环境中,用户提示策略和提示增强方法对开放集目标检测(OSOD)模型性能的影响。通过评估GroundingDINO和YOLO-E模型在真实XR图像上的表现,并模拟不同类型的用户提示(标准、欠详细、过详细、语用模糊),发现模型在模糊提示下性能下降,而过详细提示主要影响GroundingDINO。提示增强方法能显著提升模型在模糊提示下的鲁棒性。

Details

Motivation: 现有OSOD模型在基准测试上表现良好,但在交互式XR环境中,用户生成的提示往往是模糊、欠详细或过详细的,其在实际用户提示下的行为尚未得到充分探索。本文旨在研究提示条件下的模型鲁棒性。

Result: 在真实世界XR图像上的评估结果显示,两种模型在欠详细和标准提示下表现稳定,但在模糊提示下性能下降。过详细提示主要影响GroundingDINO。提示增强方法显著改善了模型在模糊提示下的鲁棒性,使mIoU提升超过55%,平均置信度提升超过41%。

Insight: 论文的创新点在于系统研究了XR环境中用户提示的多样性对OSOD模型的影响,并提出了相应的提示增强方法以提升鲁棒性。从客观角度看,将用户提示行为建模为不同类型并进行针对性增强,为交互式环境中的开放集检测提供了实用的策略指导。

Abstract: Open-set object detection (OSOD) localizes objects while identifying and rejecting unknown classes at inference. While recent OSOD models perform well on benchmarks, their behavior under realistic user prompting remains underexplored. In interactive XR settings, user-generated prompts are often ambiguous, underspecified, or overly detailed. To study prompt-conditioned robustness, we evaluate two OSOD models, GroundingDINO and YOLO-E, on real-world XR images and simulate diverse user prompting behaviors using vision-language models. We consider four prompt types: standard, underdetailed, overdetailed, and pragmatically ambiguous, and examine the impact of two enhancement strategies on these prompts. Results show that both models exhibit stable performance under underdetailed and standard prompts, while they suffer degradation under ambiguous prompts. Overdetailed prompts primarily affect GroundingDINO. Prompt enhancement substantially improves robustness under ambiguity, yielding gains exceeding 55% mIoU and 41% average confidence. Based on the findings, we propose several prompting strategies and prompt enhancement methods for OSOD models in XR environments.


[81] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation cs.CV | cs.AI | cs.LGPDF

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni

TL;DR: 本文提出VideoGPA框架,通过利用几何基础模型自动生成密集偏好信号,并采用直接偏好优化方法,引导视频扩散模型生成具有3D一致性的视频,有效解决了现有模型在物体形变和空间漂移方面的问题。

Details

Motivation: 现有视频扩散模型缺乏明确的几何一致性激励,导致生成的视频在3D结构上不一致,出现物体变形和空间漂移问题。

Result: VideoGPA在广泛的实验中显著提升了时间稳定性、物理合理性和运动连贯性,使用极少的偏好对就持续超越了最先进的基线模型。

Insight: 创新点在于将几何先验知识通过自监督的偏好对齐框架注入到视频生成过程中,无需人工标注即可引导模型学习3D一致性,这是一种数据高效且可泛化的方法。

Abstract: While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.


cs.GR [Back]

[82] HeatMat: Simulation of City Material Impact on Urban Heat Island Effect cs.GR | cs.CVPDF

Marie Reinbigler, Romain Rouffet, Peter Naylor, Mikolaj Czerkawski, Nikolaos Dionelis

TL;DR: 论文提出HeatMat方法,利用开源数据高分辨率分析城市材料对城市热岛效应的个体影响。该方法通过街景图像和预训练视觉语言模型估计建筑材料,结合OpenStreetMap数据生成表示城市垂直结构和材料特征的2D地图,并开发2.5D模拟器进行热传递建模,实现多分辨率表面温度估计,相比3D模拟加速20倍。

Details

Motivation: 城市热岛效应难以通过传感器数据(如卫星或现场站)进行高时空分辨率研究,且城市材料的影响分析及新配置测试需要城市尺度的高分辨率模拟,而现有数据(如OpenStreetMap)缺乏建筑立面材料信息。

Result: 方法在真实城市中实现高分辨率模拟,2.5D模拟器相比等效3D模拟达到20倍加速,支持多分辨率随机访问表面温度估计。

Insight: 创新点包括:结合街景图像和VLM补充OpenStreetMap以估计建筑材料,将城市信息编码为2D地图用于2.5D模拟,实现高效热传递建模。客观分析认为,该方法通过开源数据集成和模拟优化,为城市热岛效应研究提供了可扩展且计算高效的解决方案。

Abstract: The Urban Heat Island (UHI) effect, defined as a significant increase in temperature in urban environments compared to surrounding areas, is difficult to study in real cities using sensor data (satellites or in-situ stations) due to their coarse spatial and temporal resolution. Among the factors contributing to this effect are the properties of urban materials, which differ from those in rural areas. To analyze their individual impact and to test new material configurations, a high-resolution simulation at the city scale is required. Estimating the current materials used in a city, including those on building facades, is also challenging. We propose HeatMat, an approach to analyze at high resolution the individual impact of urban materials on the UHI effect in a real city, relying only on open data. We estimate building materials using street-view images and a pre-trained vision-language model (VLM) to supplement existing OpenStreetMap data, which describes the 2D geometry and features of buildings. We further encode this information into a set of 2D maps that represent the city’s vertical structure and material characteristics. These maps serve as inputs for our 2.5D simulator, which models coupled heat transfers and enables random-access surface temperature estimation at multiple resolutions, reaching an x20 speedup compared to an equivalent simulation in 3D.


cs.IR [Back]

[83] Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval cs.IR | cs.CV | cs.LG | cs.MM | cs.SDPDF

Ilyass Moummad, Marius Miron, David Robinson, Kawtar Zaher, Hervé Goëau

TL;DR: 本文提出了一种紧凑超立方体嵌入框架,用于基于文本的快速野生动物观测检索。该方法利用预训练的野生动物基础模型(如BioCLIP和BioLingual),通过参数高效微调将其适配为哈希模型,将自然语言描述与视觉或听觉观测对齐到共享的汉明空间,生成紧凑的二进制表示,从而在大规模图像和音频数据库中实现高效检索。

Details

Motivation: 大规模生物多样性监测平台日益依赖多模态野生动物观测,但高维相似性搜索的计算成本使得从海量档案中检索相关观测具有挑战性。本文旨在通过紧凑的二进制表示降低内存和搜索成本,实现高效的基于文本的检索。

Result: 在iNaturalist2024(文本到图像检索)和iNatSounds2024(文本到音频检索)等大规模基准测试以及多个声景数据集上评估,结果显示,使用离散超立方体嵌入的检索性能与连续嵌入相当甚至更优,同时大幅降低了内存和搜索成本。哈希目标还持续改进了底层编码器表示,增强了检索和零样本泛化能力。

Insight: 创新点包括将轻量级哈希扩展到多模态设置,对齐自然语言与视觉/听觉观测;利用参数高效微调适配预训练基础模型进行哈希;哈希目标不仅实现高效检索,还提升了编码器表示质量。该方法为生物多样性监测系统提供了可扩展且高效的搜索方案。

Abstract: Large-scale biodiversity monitoring platforms increasingly rely on multimodal wildlife observations. While recent foundation models enable rich semantic representations across vision, audio, and language, retrieving relevant observations from massive archives remains challenging due to the computational cost of high-dimensional similarity search. In this work, we introduce compact hypercube embeddings for fast text-based wildlife observation retrieval, a framework that enables efficient text-based search over large-scale wildlife image and audio databases using compact binary representations. Building on the cross-view code alignment hashing framework, we extend lightweight hashing beyond a single-modality setup to align natural language descriptions with visual or acoustic observations in a shared Hamming space. Our approach leverages pretrained wildlife foundation models, including BioCLIP and BioLingual, and adapts them efficiently for hashing using parameter-efficient fine-tuning. We evaluate our method on large-scale benchmarks, including iNaturalist2024 for text-to-image retrieval and iNatSounds2024 for text-to-audio retrieval, as well as multiple soundscape datasets to assess robustness under domain shift. Results show that retrieval using discrete hypercube embeddings achieves competitive, and in several cases superior, performance compared to continuous embeddings, while drastically reducing memory and search cost. Moreover, we observe that the hashing objective consistently improves the underlying encoder representations, leading to stronger retrieval and zero-shot generalization. These results demonstrate that binary, language-based retrieval enables scalable and efficient search over large wildlife archives for biodiversity monitoring systems.


cs.HC [Back]

[84] AI and My Values: User Perceptions of LLMs’ Ability to Extract, Embody, and Explain Human Values from Casual Conversations cs.HC | cs.AI | cs.CLPDF

Bhada Yun, Renn Su, April Yi Wang

TL;DR: 该论文介绍了VAPT(价值对齐感知工具包),用于研究大型语言模型如何反映人类价值观以及人们如何评判这些反映。通过为期一个月的用户与类人聊天机器人对话实验及后续访谈,研究发现部分参与者认为AI能理解人类价值观,并警告了’武器化共情’这一潜在危险设计模式。

Details

Motivation: 研究AI是否理解人类价值观这一哲学问题,从实用角度出发,开发工具包以评估LLMs在提取、体现和解释人类价值观方面的能力,并探讨用户对此的感知。

Result: 20名参与者中,13人最终相信AI能理解人类价值观;参与者认为体验有助于自我反思,并可能被AI推理说服。研究未提及具体基准测试或SOTA比较,而是基于定性访谈结果。

Insight: 创新点在于提出VAPT工具包,将价值观对齐分解为提取、体现和解释三个维度进行实证评估;客观分析揭示了’武器化共情’这一重要风险,为构建透明、可控的价值对齐对话代理提供了设计启示。

Abstract: Does AI understand human values? While this remains an open philosophical question, we take a pragmatic stance by introducing VAPT, the Value-Alignment Perception Toolkit, for studying how LLMs reflect people’s values and how people judge those reflections. 20 participants texted a human-like chatbot over a month, then completed a 2-hour interview with our toolkit evaluating AI’s ability to extract (pull details regarding), embody (make decisions guided by), and explain (provide proof of) human values. 13 participants left our study convinced that AI can understand human values. Participants found the experience insightful for self-reflection and found themselves getting persuaded by the AI’s reasoning. Thus, we warn about “weaponized empathy”: a potentially dangerous design pattern that may arise in value-aligned, yet welfare-misaligned AI. VAPT offers concrete artifacts and design implications to evaluate and responsibly build value-aligned conversational agents with transparency, consent, and safeguards as AI grows more capable and human-like into the future.


cs.LG [Back]

[85] ReNCE: Learning to Reason by Noise Contrastive Estimation cs.LG | cs.CLPDF

Wenzheng Zhang, Karl Stratos

TL;DR: 本文提出ReNCE方法,一种通过噪声对比估计学习推理的显式对比学习方法,用于增强预训练大语言模型的推理能力。该方法将一组结果划分为正负集合,并最大化正结果的似然,替代了GRPO中基于优势估计的软判别方法。

Details

Motivation: 解决GRPO方法依赖经验性调整(如非对称裁剪和零方差数据过滤)的问题,这些调整难以识别且需要大量经验洞察,旨在提供一种更直接、无需复杂调优的推理学习框架。

Result: 在多个具有挑战性的数学基准测试中,与DAPO和在线DPO等强基线相比,ReNCE方法展示了具有竞争力的性能。

Insight: 创新点在于将噪声对比估计在线实例化应用于LLM推理,通过显式对比学习替代软优势估计,简化了训练过程并减少了经验调优需求;从客观角度看,该方法为推理学习提供了更稳定、可解释的优化路径。

Abstract: GRPO is a standard approach to endowing pretrained LLMs with reasoning capabilities. It estimates the advantage of an outcome from a group of $K$ outcomes, and promotes those with positive advantages inside a trust region. Since GRPO discriminates between good and bad outcomes softly, it benefits from additional refinements such as asymmetric clipping and zero-variance data filtering. While effective, these refinements require significant empirical insight and can be challenging to identify. We instead propose an explicit contrastive learning approach. Instead of estimating advantages, we bifurcate $K$ outcomes into positive and negative sets, then maximize the likelihood of positive outcomes. Our approach can be viewed as an online instantiation of (multi-label) noise contrastive estimation for LLM reasoning. We validate our method by demonstrating competitive performance on a suite of challenging math benchmarks against strong baselines such as DAPO and online DPO.


[86] HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning cs.LG | cs.CLPDF

Weiqi Wang, Xin Liu, Binxuan Huang, Hejie Cui, Rongzhi Zhang

TL;DR: HeaPA是一种用于大型语言模型强化学习的高效采样和查询增强方法,通过维护一个动态演化的提示池,采用基于堆的边界采样来跟踪模型能力边界,并通过轻量级异步验证进行在线策略查询增强,从而在保持训练时间可比的情况下,以更少的计算量达到目标性能。

Details

Motivation: 现有RLVR方法在可验证结果的推理任务中训练LLMs时,通常使用静态或与学习进度松散关联的提示池,导致均匀采样无法适应模型能力变化,浪费计算资源在已解决或无法解决的提示上,且现有改进方法要么假设固定池难以支持稳定增长,要么引入额外教师成本和延迟。

Result: 在两个训练语料库、两种训练方案和七个基准测试上,HeaPA持续提高了准确性,并以更少的计算量达到目标性能,同时保持挂钟时间相当,且模型规模越大收益越明显。

Insight: 创新点在于结合了有界演化提示池、基于堆的边界采样以聚焦能力边界、通过轻量级异步验证实现在线策略查询增强,以及通过拓扑感知的统计重估计和受控重插入来稳定相关查询,实现了高效且自适应的采样策略,避免了固定池或额外教师的限制。

Abstract: RLVR is now a standard way to train LLMs on reasoning tasks with verifiable outcomes, but when rollout generation dominates the cost, efficiency depends heavily on which prompts you sample and when. In practice, prompt pools are often static or only loosely tied to the model’s learning progress, so uniform sampling can’t keep up with the shifting capability frontier and ends up wasting rollouts on prompts that are already solved or still out of reach. Existing approaches improve efficiency through filtering, curricula, adaptive rollout allocation, or teacher guidance, but they typically assume a fixed pool-which makes it hard to support stable on-policy pool growth-or they add extra teacher cost and latency. We introduce HeaPA (Heap Sampling and On-Policy Query Augmentation), which maintains a bounded, evolving pool, tracks the frontier using heap-based boundary sampling, expands the pool via on-policy augmentation with lightweight asynchronous validation, and stabilizes correlated queries through topology-aware re-estimation of pool statistics and controlled reinsertion. Across two training corpora, two training recipes, and seven benchmarks, HeaPA consistently improves accuracy and reaches target performance with fewer computations while keeping wall-clock time comparable. Our analyses suggest these gains come from frontier-focused sampling and on-policy pool growth, with the benefits becoming larger as model scale increases. Our code is available at https://github.com/horizon-rl/HeaPA.


[87] TTCS: Test-Time Curriculum Synthesis for Self-Evolving cs.LG | cs.AI | cs.CLPDF

Chengyi Yang, Zhishang Xiang, Yunbo Tang, Zongpei Teng, Chengsong Huang

TL;DR: 本文提出TTCS(Test-Time Curriculum Synthesis),一种协同进化的测试时训练框架,旨在通过动态构建课程来提升大语言模型在困难推理任务上的能力。该框架包含一个从同一预训练模型初始化的题目合成器和一个推理求解器,两者通过迭代优化协同进化:合成器根据测试题目和求解器当前能力生成逐步变难的题目变体,形成结构化课程;求解器则利用在原始测试题和合成题上通过自洽性奖励计算的反馈进行自我更新。

Details

Motivation: 现有测试时训练方法在处理困难推理问题时存在两个主要局限:原始测试问题往往过于困难,难以产生高质量的伪标签;且测试集规模有限,导致连续在线更新不稳定。TTCS旨在通过协同进化的课程合成来解决这些问题。

Result: 实验表明,TTCS在具有挑战性的数学推理基准上持续增强了模型的推理能力,并且能够迁移到不同LLM主干模型上的通用领域任务,展示了其可扩展性。

Insight: 核心创新在于提出了一个协同进化的课程合成框架,将测试时训练与课程学习相结合。通过让题目合成器与推理求解器在测试时迭代交互、相互引导,动态生成与模型当前能力相匹配的、结构化的困难题目变体,从而稳定并提升了测试时训练的效果,为模型的自进化提供了一条可扩展的路径。

Abstract: Test-Time Training offers a promising way to improve the reasoning ability of large language models (LLMs) by adapting the model using only the test questions. However, existing methods struggle with difficult reasoning problems for two reasons: raw test questions are often too difficult to yield high-quality pseudo-labels, and the limited size of test sets makes continuous online updates prone to instability. To address these limitations, we propose TTCS, a co-evolving test-time training framework. Specifically, TTCS initializes two policies from the same pretrained model: a question synthesizer and a reasoning solver. These policies evolve through iterative optimization: the synthesizer generates progressively challenging question variants conditioned on the test questions, creating a structured curriculum tailored to the solver’s current capability, while the solver updates itself using self-consistency rewards computed from multiple sampled responses on both original test and synthetic questions. Crucially, the solver’s feedback guides the synthesizer to generate questions aligned with the model’s current capability, and the generated question variants in turn stabilize the solver’s test-time training. Experiments show that TTCS consistently strengthens the reasoning ability on challenging mathematical benchmarks and transfers to general-domain tasks across different LLM backbones, highlighting a scalable path towards dynamically constructing test-time curricula for self-evolving. Our code and implementation details are available at https://github.com/XMUDeepLIT/TTCS.


[88] Attention Isn’t All You Need for Emotion Recognition:Domain Features Outperform Transformers on the EAV Dataset cs.LG | cs.CV | cs.SD | eess.ASPDF

Anmol Guragain

TL;DR: 本文通过EAV数据集对多模态情感识别进行系统研究,发现复杂注意力机制在小数据集上表现不佳,而简单的领域特征改进能显著提升性能。

Details

Motivation: 研究动机是探究在小规模数据集上,复杂的注意力机制是否比简单的领域特征改进更有效,以解决多模态情感识别中的过拟合和性能瓶颈问题。

Result: 实验表明,因子化注意力机制比基线低5-13个百分点,而领域特征改进如音频CNN添加delta MFCCs将准确率从61.9%提升至65.56%,EEG频域特征达到67.62%,视觉Transformer基线通过领域特定预训练达到75.30%,超过原论文ViViT结果。

Insight: 创新点在于强调领域知识和小数据集上的简单特征工程优于复杂架构设计,为资源有限场景下的情感识别提供了实用指导。

Abstract: We present a systematic study of multimodal emotion recognition using the EAV dataset, investigating whether complex attention mechanisms improve performance on small datasets. We implement three model categories: baseline transformers (M1), novel factorized attention mechanisms (M2), and improved CNN baselines (M3). Our experiments show that sophisticated attention mechanisms consistently underperform on small datasets. M2 models achieved 5 to 13 percentage points below baselines due to overfitting and destruction of pretrained features. In contrast, simple domain-appropriate modifications proved effective: adding delta MFCCs to the audio CNN improved accuracy from 61.9% to \textbf{65.56%} (+3.66pp), while frequency-domain features for EEG achieved \textbf{67.62%} (+7.62pp over the paper baseline). Our vision transformer baseline (M1) reached \textbf{75.30%}, exceeding the paper’s ViViT result (74.5%) through domain-specific pretraining, and vision delta features achieved \textbf{72.68%} (+1.28pp over the paper CNN). These findings demonstrate that for small-scale emotion recognition, domain knowledge and proper implementation outperform architectural complexity.


[89] Learnable Permutation for Structured Sparsity on Transformer Models cs.LG | cs.CLPDF

Zekai Li, Ji Liu, Guanchen Li, Yixing Xu, Ziqiong Liu

TL;DR: 本文提出了一种新颖的端到端可学习置换框架,用于提升Transformer模型结构化稀疏化(剪枝)的性能。该方法通过引入可学习的置换成本矩阵、可微分的二分图匹配求解器和稀疏化优化损失函数,直接优化权重矩阵的置换操作,从而将权重重新排列成更易于剪枝的模式。

Details

Motivation: 结构化稀疏化是一种流行的模型剪枝技术,但现有方法在Transformer等大规模架构上,由于置换搜索空间随模型规模指数增长,大多依赖贪婪或启发式算法,限制了权重重排序的有效性。本文旨在解决如何高效、可学习地找到最优权重置换以提升剪枝后模型性能的问题。

Result: 方法在视觉和语言Transformer模型上进行了广泛验证,结果表明,该方法在结构化稀疏化任务中实现了最先进的置换效果。

Insight: 创新点在于将权重置换问题形式化为一个端到端可学习的优化过程,通过可微分的二分图匹配求解器实现置换矩阵的梯度反向传播,从而直接优化稀疏化目标,避免了传统启发式搜索的局限性。

Abstract: Structured sparsity has emerged as a popular model pruning technique, widely adopted in various architectures, including CNNs, Transformer models, and especially large language models (LLMs) in recent years. A promising direction to further improve post-pruning performance is weight permutation, which reorders model weights into patterns more amenable to pruning. However, the exponential growth of the permutation search space with the scale of Transformer architectures forces most methods to rely on greedy or heuristic algorithms, limiting the effectiveness of reordering. In this work, we propose a novel end-to-end learnable permutation framework. Our method introduces a learnable permutation cost matrix to quantify the cost of swapping any two input channels of a given weight matrix, a differentiable bipartite matching solver to obtain the optimal binary permutation matrix given a cost matrix, and a sparsity optimization loss function to directly optimize the permutation operator. We extensively validate our approach on vision and language Transformers, demonstrating that our method achieves state-of-the-art permutation results for structured sparsity.


[90] Mem-T: Densifying Rewards for Long-Horizon Memory Agents cs.LG | cs.CLPDF

Yanwei Yue, Guibin Zhang, Boci Peng, Xuanbo Fan, Jiaxin Guo

TL;DR: 本文提出了一种名为Mem-T的自主记忆代理,它通过轻量级分层记忆数据库处理流式输入,实现动态更新和多轮检索。为了解决长时程记忆管理中稀疏奖励问题,作者进一步提出了MoT-GRPO强化学习框架,利用记忆操作树反向传播和事后信用分配将稀疏终端反馈转化为密集的逐步监督,从而联合优化记忆构建与检索。实验表明,Mem-T在性能上超越了A-Mem和Mem0等框架,并在准确性与效率的帕累托前沿上表现优越,同时显著减少了推理所需的令牌数量。

Details

Motivation: 现有记忆代理的训练范式受限于长时程记忆操作序列中稀疏且延迟的奖励,这阻碍了记忆管理策略的真正端到端优化。

Result: Mem-T在实验中性能超越A-Mem和Mem0框架达14.92%,并在准确性-效率帕累托前沿上表现优异,相对于GAM减少了约24.45%的每查询推理令牌数且不牺牲性能。

Insight: 创新点在于提出了MoT-GRPO框架,通过记忆操作树引导的强化学习将稀疏奖励密集化,实现了记忆构建与检索的联合优化;客观来看,该方法通过轻量级分层记忆数据库和树状结构监督,有效解决了长时程任务中的信用分配难题,提升了训练效率和代理的自主性。

Abstract: Memory agents, which depart from predefined memory-processing pipelines by endogenously managing the processing, storage, and retrieval of memories, have garnered increasing attention for their autonomy and adaptability. However, existing training paradigms remain constrained: agents often traverse long-horizon sequences of memory operations before receiving sparse and delayed rewards, which hinders truly end-to-end optimization of memory management policies. To address this limitation, we introduce Mem-T, an autonomous memory agent that interfaces with a lightweight hierarchical memory database to perform dynamic updates and multi-turn retrieval over streaming inputs. To effectively train long-horizon memory management capabilities, we further propose MoT-GRPO, a tree-guided reinforcement learning framework that transforms sparse terminal feedback into dense, step-wise supervision via memory operation tree backpropagation and hindsight credit assignment, thereby enabling the joint optimization of memory construction and retrieval. Extensive experiments demonstrate that Mem-T is (1) high-performing, surpassing frameworks such as A-Mem and Mem0 by up to $14.92%$, and (2) economical, operating on a favorable accuracy-efficiency Pareto frontier and reducing inference tokens per query by $\sim24.45%$ relative to GAM without sacrificing performance.


[91] Vision-Language Models Unlock Task-Centric Latent Actions cs.LG | cs.AI | cs.CVPDF

Alexander Nikulin, Ilya Zisman, Albina Klepach, Denis Tarasov, Alexander Derevyagin

TL;DR: 本文提出利用视觉语言模型(VLMs)的常识推理能力,为潜在动作模型(LAMs)提供可提示的表征,以无监督方式分离可控变化与噪声,从而解决LAMs在存在动作相关干扰物时编码噪声而非有意义潜在动作的问题。

Details

Motivation: 动机在于解决潜在动作模型在观测包含动作相关干扰物时性能下降的问题,借鉴人类仅凭简短任务描述就能区分任务相关运动与无关细节的能力。

Result: 在Distracting MetaWorld基准测试中,该方法使下游成功率提升高达六倍,并评估了多种流行VLMs,发现其可提示表征质量存在显著差异且对提示和超参数的鲁棒性不一。

Insight: 创新点在于利用VLMs的常识推理生成可提示表征来指导LAM训练,并发现简单提示VLMs忽略干扰物即可大幅提升潜在动作质量,同时指出较新VLMs可能表现不如旧模型这一反直觉现象。

Abstract: Latent Action Models (LAMs) have rapidly gained traction as an important component in the pre-training pipelines of leading Vision-Language-Action models. However, they fail when observations contain action-correlated distractors, often encoding noise instead of meaningful latent actions. Humans, on the other hand, can effortlessly distinguish task-relevant motions from irrelevant details in any video given only a brief task description. In this work, we propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations, effectively separating controllable changes from the noise in unsupervised way. We use these representations as targets during LAM training and benchmark a wide variety of popular VLMs, revealing substantial variation in the quality of promptable representations as well as their robustness to different prompts and hyperparameters. Interestingly, we find that more recent VLMs may perform worse than older ones. Finally, we show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.


[92] Decomposing and Composing: Towards Efficient Vision-Language Continual Learning via Rank-1 Expert Pool in a Single LoRA cs.LG | cs.CVPDF

Zhan Fa, Yue Duan, Jian Zhang, Lei Qi, Wanqi Yang

TL;DR: 本文提出了一种新颖的视觉语言持续学习框架,通过将单个LoRA模块重构为一个可分解的秩-1专家池,实现了高效的任务适应并缓解灾难性遗忘。该方法根据[CLS]令牌的语义动态组合稀疏的任务特定更新,并引入激活引导正交损失来正交化跨任务的LoRA权重关键部分,从而减少参数更新和任务间干扰。

Details

Motivation: 解决视觉语言模型在持续学习中面临的任务适应能力提升与灾难性遗忘缓解之间的挑战,同时避免现有方法带来的沉重推理负担或对外部知识的依赖。

Result: 在多个设置下的广泛实验表明,该方法在所有指标上均取得了最先进的结果,超越了零样本泛化的上界,并且相比基线方法减少了96.7%的可训练参数,无需依赖外部数据集或任务ID判别器。

Insight: 核心创新点在于将单个LoRA模块重构为秩-1专家池以实现动态稀疏组合,并结合激活引导正交损失来正交化权重,这实现了参数高效、无推理延迟的领域感知学习,为持续学习提供了轻量级且有效的解决方案。

Abstract: Continual learning (CL) in vision-language models (VLMs) faces significant challenges in improving task adaptation and avoiding catastrophic forgetting. Existing methods usually have heavy inference burden or rely on external knowledge, while Low-Rank Adaptation (LoRA) has shown potential in reducing these issues by enabling parameter-efficient tuning. However, considering directly using LoRA to alleviate the catastrophic forgetting problem is non-trivial, we introduce a novel framework that restructures a single LoRA module as a decomposable Rank-1 Expert Pool. Our method learns to dynamically compose a sparse, task-specific update by selecting from this expert pool, guided by the semantics of the [CLS] token. In addition, we propose an Activation-Guided Orthogonal (AGO) loss that orthogonalizes critical parts of LoRA weights across tasks. This sparse composition and orthogonalization enable fewer parameter updates, resulting in domain-aware learning while minimizing inter-task interference and maintaining downstream task performance. Extensive experiments across multiple settings demonstrate state-of-the-art results in all metrics, surpassing zero-shot upper bounds in generalization. Notably, it reduces trainable parameters by 96.7% compared to the baseline method, eliminating reliance on external datasets or task-ID discriminators. The merged LoRAs retain less weights and incur no inference latency, making our method computationally lightweight.


cs.AI [Back]

[93] JAF: Judge Agent Forest cs.AI | cs.CL | cs.LGPDF

Sahil Garg, Brad Cheezum, Sridhar Dutta, Vishal Agarwal

TL;DR: 本文提出JAF(Judge Agent Forest)框架,通过让评判智能体对主智能体生成的一组查询-响应对进行联合推理,而非孤立评估每个响应,从而提升评判的全局学习能力。该方法结合了信念传播与集成学习原理,利用局部敏感哈希算法选择多样化示例,并在云配置错误分类任务上进行了验证。

Details

Motivation: 现有评判智能体通常独立评估每个响应,缺乏对跨实例模式与不一致性的整体洞察,限制了主智能体通过集体反馈进行自我优化的能力。

Result: 在大型云环境中的云配置错误分类这一高要求任务上进行了实证研究,验证了JAF框架的有效性。

Insight: 创新点在于将评判智能体从局部评估者提升为全局学习者,通过联合推理识别跨实例模式;同时设计了融合语义嵌入、LLM驱动哈希谓词、类别标签监督与侧信息的灵活局部敏感哈希算法,实现高效、可解释且关系感知的示例选择。

Abstract: Judge agents are fundamental to agentic AI frameworks: they provide automated evaluation, and enable iterative self-refinement of reasoning processes. We introduce JAF: Judge Agent Forest, a framework in which the judge agent conducts joint inference across a cohort of query–response pairs generated by a primary agent, rather than evaluating each in isolation. This paradigm elevates the judge from a local evaluator to a holistic learner: by simultaneously assessing related responses, the judge discerns cross-instance patterns and inconsistencies, whose aggregate feedback enables the primary agent to improve by viewing its own outputs through the judge’s collective perspective. Conceptually, JAF bridges belief propagation and ensemble-learning principles: overlapping in-context neighborhoods induce a knowledge-graph structure that facilitates propagation of critique, and repeated, randomized evaluations yield a robust ensemble of context-sensitive judgments. JAF can be instantiated entirely via ICL, with the judge prompted for each query using its associated primary-agent response plus a small, possibly noisy set of peer exemplars. While kNN in embedding space is a natural starting point for exemplars, this approach overlooks categorical structure, domain metadata, or nuanced distinctions accessible to modern LLMs. To overcome these limitations, we develop a flexible locality-sensitive hashing (LSH) algorithm that learns informative binary codes by integrating semantic embeddings, LLM-driven hash predicates, supervision from categorical labels, and relevant side information. These hash codes support efficient, interpretable, and relation-aware selection of diverse exemplars, and further optimize exploration of CoT reasoning paths. We validate JAF with an empirical study on the demanding task of cloud misconfigs triage in large-scale cloud environments.


[94] Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents cs.AI | cs.CL | cs.LGPDF

Zehong Wang, Fang Wu, Hongru Wang, Xiangru Tang, Bolian Li

TL;DR: 这篇论文分析了基于大语言模型(LLM)的智能体在长时程决策中的失败原因,指出逐步推理(step-by-step reasoning)会导致短视的贪婪策略,从而在需要长远规划的复杂任务中失效。为此,论文提出了FLARE方法,一种融合了前瞻、价值传播和有限承诺的未来感知规划机制,以改善长时程决策性能。

Details

Motivation: 解决LLM智能体在短时程推理中表现良好,但在长时程规划任务中行为不连贯、容易失败的问题,核心动机是揭示逐步推理与有效规划之间的根本性不匹配。

Result: 在多个基准测试、智能体框架和LLM骨干模型上,FLARE方法持续提升了任务性能和规划层面的行为表现。例如,配备FLARE的LLaMA-8B模型在任务中经常能超越使用标准逐步推理的GPT-4o。

Insight: 论文的核心创新点在于从规划视角(planning-centric perspective)系统分析了LLM推理的局限性,并提出了FLARE这一轻量级未来感知规划机制。其可借鉴之处在于明确区分了“推理”与“规划”,并通过显式的前瞻和价值传播机制,使早期决策能考虑下游后果,从而缓解了短视承诺(myopic commitment)问题。

Abstract: Large language model (LLM)-based agents exhibit strong step-by-step reasoning capabilities over short horizons, yet often fail to sustain coherent behavior over long planning horizons. We argue that this failure reflects a fundamental mismatch: step-wise reasoning induces a form of step-wise greedy policy that is adequate for short horizons but fails in long-horizon planning, where early actions must account for delayed consequences. From this planning-centric perspective, we study LLM-based agents in deterministic, fully structured environments with explicit state transitions and evaluation signals. Our analysis reveals a core failure mode of reasoning-based policies: locally optimal choices induced by step-wise scoring lead to early myopic commitments that are systematically amplified over time and difficult to recover from. We introduce FLARE (Future-aware Lookahead with Reward Estimation) as a minimal instantiation of future-aware planning to enforce explicit lookahead, value propagation, and limited commitment in a single model, allowing downstream outcomes to influence early decisions. Across multiple benchmarks, agent frameworks, and LLM backbones, FLARE consistently improves task performance and planning-level behavior, frequently allowing LLaMA-8B with FLARE to outperform GPT-4o with standard step-by-step reasoning. These results establish a clear distinction between reasoning and planning.


[95] From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents cs.AI | cs.CLPDF

Jiaxuan Gao, Jiaao Chen, Chuyi He, Wei-Chen Wang, Shusheng Xu

TL;DR: 本文提出了一个名为EigenData的统一框架,用于训练多轮交互式工具使用智能体。该框架结合了自演化合成数据生成和基于验证器的强化学习,通过分层多智能体引擎合成工具对话及可执行检查器,并利用闭环自演化过程提升生成可靠性。在此基础上,开发了一种强化学习配方,包括微调用户模型和应用GRPO风格训练,在tau^2-bench基准测试中取得了优异性能。

Details

Motivation: 训练多轮交互式工具使用智能体面临两大挑战:高质量多轮工具使用数据合成难以扩展,以及强化学习可能因用户模拟产生的噪声信号导致训练效率下降。本文旨在解决这些问题,为复杂工具使用行为提供可扩展的训练路径,避免昂贵的人工标注。

Result: 在tau^2-bench基准测试中,最佳模型在Airline任务上达到73.0% pass^1,在Telecom任务上达到98.3% pass^1,匹配或超越了前沿模型性能。

Insight: 创新点包括:1) 自演化合成数据生成框架,通过分层多智能体引擎和闭环自演化过程提升数据质量和可靠性;2) 基于验证器的强化学习配方,结合用户模型微调和GRPO风格训练,利用轨迹级组相对优势和动态过滤实现稳定改进。这为无需人工标注的复杂工具使用行为训练提供了可扩展方案。

Abstract: Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions. Post-training such agents is challenging because synthesis for high-quality multi-turn tool-use data is difficult to scale, and reinforcement learning (RL) could face noisy signals caused by user simulation, leading to degraded training efficiency. We propose a unified framework that combines a self-evolving data agent with verifier-based RL. Our system, EigenData, is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers, and improves generation reliability via closed-loop self-evolving process that updates prompts and workflow. Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training with trajectory-level group-relative advantages and dynamic filtering, yielding consistent improvements beyond SFT. Evaluated on tau^2-bench, our best model reaches 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom, matching or exceeding frontier models. Overall, our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.


[96] Scaling Multiagent Systems with Process Rewards cs.AI | cs.CL | cs.ET | cs.MAPDF

Ed Li, Junyu Ren, Cat Yan

TL;DR: 本文提出了一种名为MAPPA的方法,通过基于AI反馈的每步动作过程奖励来微调多智能体系统,以解决多智能体协同中的信用分配和样本效率问题。该方法在数学竞赛和工具增强数据分析任务上进行了验证,显著提升了性能。

Details

Motivation: 解决多智能体系统在微调时面临的两个关键挑战:跨智能体的信用分配问题,以及昂贵多智能体交互的样本效率低下问题。

Result: 在未见过的数学问题上,MAPPA在AIME基准上提升了5.0-17.5个百分点,在AMC基准上提升了7.8-17.2个百分点;在数据分析任务中,成功率提高了12.5个百分点,质量指标提升高达30%,验证了其有效性。

Insight: 创新点在于引入每步动作的过程奖励进行细粒度监督,而非仅在任务完成时分配奖励,这能最大化训练信号并减少对人工标注的依赖,为复杂长程任务的多智能体系统扩展提供了新思路。

Abstract: While multiagent systems have shown promise for tackling complex tasks via specialization, finetuning multiple agents simultaneously faces two key challenges: (1) credit assignment across agents, and (2) sample efficiency of expensive multiagent rollouts. In this work, we propose finetuning multiagent systems with per-action process rewards from AI feedback (MAPPA) to address both. Through assigning credit to individual agent actions rather than only at task completion, MAPPA enables fine-grained supervision without ground truth labels while extracting maximal training signal from each rollout. We demonstrate our approach on competition math problems and tool-augmented data analysis tasks. On unseen math problems, MAPPA achieves +5.0–17.5pp on AIME and +7.8–17.2pp on AMC. For data analysis tasks, our method improves success rate by +12.5pp while quality metrics improve by up to 30%, validating that per-action supervision can lead to improvements across different multiagent system on various domains. By addressing these challenges, our work takes a first step toward scaling multiagent systems for complex, long-horizon tasks with minimal human supervision.


q-fin.GN [Back]

[97] UniFinEval: Towards Unified Evaluation of Financial Multimodal Models across Text, Images and Videos q-fin.GN | cs.AI | cs.CLPDF

Zhi Yang, Lingfeng Zeng, Fangqi Lou, Qi Qi, Wei Zhang

TL;DR: 该论文提出了UniFinEval,这是首个为高信息密度金融环境设计的统一多模态基准测试,涵盖文本、图像和视频。它基于现实金融系统构建了五个核心场景,并创建了一个包含3767个中英文问答对的高质量数据集,用于评估主流多模态大语言模型。

Details

Motivation: 现有多模态基准测试无法充分评估MLLMs在金融领域面临的挑战,如多模态高密度信息和跨模态多跳推理。

Result: 在Zero-Shot和CoT设置下评估了10个主流MLLMs,结果显示Gemini-3-pro-preview综合性能最佳,但与金融专家相比仍有显著差距。

Insight: 创新点在于构建了首个统一评估金融多模态模型在文本、图像和视频上能力的基准,并系统揭示了当前模型在细粒度、高信息密度金融环境中的系统性缺陷,有助于提升MLLMs在真实金融场景中的鲁棒性。

Abstract: Multimodal large language models are playing an increasingly significant role in empowering the financial domain, however, the challenges they face, such as multimodal and high-density information and cross-modal multi-hop reasoning, go beyond the evaluation scope of existing multimodal benchmarks. To address this gap, we propose UniFinEval, the first unified multimodal benchmark designed for high-information-density financial environments, covering text, images, and videos. UniFinEval systematically constructs five core financial scenarios grounded in real-world financial systems: Financial Statement Auditing, Company Fundamental Reasoning, Industry Trend Insights, Financial Risk Sensing, and Asset Allocation Analysis. We manually construct a high-quality dataset consisting of 3,767 question-answer pairs in both chinese and english and systematically evaluate 10 mainstream MLLMs under Zero-Shot and CoT settings. Results show that Gemini-3-pro-preview achieves the best overall performance, yet still exhibits a substantial gap compared to financial experts. Further error analysis reveals systematic deficiencies in current models. UniFinEval aims to provide a systematic assessment of MLLMs’ capabilities in fine-grained, high-information-density financial environments, thereby enhancing the robustness of MLLMs applications in real-world financial scenarios. Data and code are available at https://github.com/aifinlab/UniFinEval.


eess.IV [Back]

[98] SCENE: Semantic-aware Codec Enhancement with Neural Embeddings eess.IV | cs.CV | cs.LG | cs.MMPDF

Han-Yu Lin, Li-Wei Chen, Hung-Shin Lee

TL;DR: 本文提出了一种名为SCENE的轻量级、语义感知的预处理框架,旨在通过选择性处理标准视频编解码器产生的压缩伪影来提升感知质量。该方法将视觉语言模型的语义嵌入集成到高效的卷积架构中,优先保护感知上重要的结构,并使用可微分的编解码器代理进行端到端训练,以在不修改现有视频处理流程的情况下减轻多种标准编解码器的伪影。推理时,SCENE作为独立的预处理器运行,实现实时性能。

Details

Motivation: 标准视频编解码器产生的压缩伪影会降低感知质量,需要一种轻量级且能有效处理这些失真的方法来增强压缩视频流的感知保真度。

Result: 在高分辨率基准测试中,该方法在客观指标(MS-SSIM)和感知指标(VMAF)上均优于基线,尤其在保留显著区域内的细节纹理方面有显著提升。

Insight: 创新点在于将视觉语言模型的语义嵌入与高效卷积架构结合,并利用可微分编解码器代理进行端到端训练,实现了语义引导、编解码器感知的预处理,能在不改变现有视频管线的情况下实时增强压缩视频质量。

Abstract: Compression artifacts from standard video codecs often degrade perceptual quality. We propose a lightweight, semantic-aware pre-processing framework that enhances perceptual fidelity by selectively addressing these distortions. Our method integrates semantic embeddings from a vision-language model into an efficient convolutional architecture, prioritizing the preservation of perceptually significant structures. The model is trained end-to-end with a differentiable codec proxy, enabling it to mitigate artifacts from various standard codecs without modifying the existing video pipeline. During inference, the codec proxy is discarded, and SCENE operates as a standalone pre-processor, enabling real-time performance. Experiments on high-resolution benchmarks show improved performance over baselines in both objective (MS-SSIM) and perceptual (VMAF) metrics, with notable gains in preserving detailed textures within salient regions. Our results show that semantic-guided, codec-aware pre-processing is an effective approach for enhancing compressed video streams.


[99] A Survey on Semantic Communication for Vision: Categories, Frameworks, Enabling Techniques, and Applications eess.IV | cs.CVPDF

Runze Cheng, Yao Sun, Ahmad Taha, Xuesong Liu, David Flynn

TL;DR: 本文系统综述了面向视觉数据的语义通信(SemCom-Vision),通过整合计算机视觉与通信工程的跨学科分析,为机器学习赋能的语义通信设计提供全面指导。文章首先阐述语义通信的基础与关键概念,然后基于语义量化方案解释的通信目标,将现有方法分类为语义保持通信(SPC)、语义扩展通信(SEC)和语义精化通信(SRC)。接着,针对每类方法介绍了基于机器学习的编码器-解码器模型、训练算法以及知识结构与利用策略,最后讨论了潜在应用。

Details

Motivation: 语义通信作为变革性范式,旨在从原始数据传输转向有意义内容传输,以缓解通信资源压力,但面临视觉数据准确语义量化、多样化任务下鲁棒语义提取与重建、收发端协调与知识有效利用以及适应不可预测无线环境等挑战。

Result: 本文为综述性论文,未提供具体实验定量结果,但系统梳理了SemCom-Vision的分类框架、技术方法和应用前景,为领域研究提供结构化指南。

Insight: 创新点包括提出基于语义量化方案的新分类视角(SPC、SEC、SRC),以及跨学科整合计算机视觉与通信工程,强调机器学习在语义通信设计中的赋能作用,为未来研究提供了系统化框架和方向指引。

Abstract: Semantic communication (SemCom) emerges as a transformative paradigm for traffic-intensive visual data transmission, shifting focus from raw data to meaningful content transmission and relieving the increasing pressure on communication resources. However, to achieve SemCom, challenges are faced in accurate semantic quantization for visual data, robust semantic extraction and reconstruction under diverse tasks and goals, transceiver coordination with effective knowledge utilization, and adaptation to unpredictable wireless communication environments. In this paper, we present a systematic review of SemCom for visual data transmission (SemCom-Vision), wherein an interdisciplinary analysis integrating computer vision (CV) and communication engineering is conducted to provide comprehensive guidelines for the machine learning (ML)-empowered SemCom-Vision design. Specifically, this survey first elucidates the basics and key concepts of SemCom. Then, we introduce a novel classification perspective to categorize existing SemCom-Vision approaches as semantic preservation communication (SPC), semantic expansion communication (SEC), and semantic refinement communication (SRC) based on communication goals interpreted through semantic quantization schemes. Moreover, this survey articulates the ML-based encoder-decoder models and training algorithms for each SemCom-Vision category, followed by knowledge structure and utilization strategies. Finally, we discuss potential SemCom-Vision applications.


[100] Vision-Language Controlled Deep Unfolding for Joint Medical Image Restoration and Segmentation eess.IV | cs.CVPDF

Ping Chen, Zicheng Huang, Xiangming Wang, Yungeng Liu, Bingyu Liang

TL;DR: 本文提出VL-DUN框架,用于联合医学图像恢复与分割任务。该框架通过统一的优化问题将两个任务数学耦合,并引入频率感知Mamba机制来建模长程依赖,以线性复杂度同时处理全局分割和局部纹理恢复。在多模态基准测试中,该方法在PSNR和Dice系数上均取得显著提升,实现了新的SOTA性能。

Details

Motivation: 解决传统流水线中医学图像恢复(低层信号恢复)与分割(高层语义理解)任务孤立处理导致的次优问题,利用两者之间的协同效应进行联合优化。

Result: 在多模态基准测试中,PSNR提升0.92 dB,Dice系数提升9.76%,达到了新的SOTA水平。

Insight: 创新点包括:1) 将联合任务形式化为统一的优化问题,并推导出可解释的联合展开机制,实现数学上的任务耦合与相互优化;2) 引入频率感知Mamba机制,以线性复杂度有效建模长程依赖,同时保留恢复所需的高频纹理,缓解了标准架构的频谱偏差问题。从客观角度看,该研究为医学图像处理提供了一种高效、鲁棒的联合学习范式。

Abstract: We propose VL-DUN, a principled framework for joint All-in-One Medical Image Restoration and Segmentation (AiOMIRS) that bridges the gap between low-level signal recovery and high-level semantic understanding. While standard pipelines treat these tasks in isolation, our core insight is that they are fundamentally synergistic: restoration provides clean anatomical structures to improve segmentation, while semantic priors regularize the restoration process. VL-DUN resolves the sub-optimality of sequential processing through two primary innovations. (1) We formulate AiOMIRS as a unified optimization problem, deriving an interpretable joint unfolding mechanism where restoration and segmentation are mathematically coupled for mutual refinement. (2) We introduce a frequency-aware Mamba mechanism to capture long-range dependencies for global segmentation while preserving the high-frequency textures necessary for restoration. This allows for efficient global context modeling with linear complexity, effectively mitigating the spectral bias of standard architectures. As a pioneering work in the AiOMIRS task, VL-DUN establishes a new state-of-the-art across multi-modal benchmarks, improving PSNR by 0.92 dB and the Dice coefficient by 9.76%. Our results demonstrate that joint collaborative learning offers a superior, more robust solution for complex clinical workflows compared to isolated task processing. The codes are provided in https://github.com/cipi666/VLDUN.


cs.RO [Back]

[101] High-Definition 5MP Stereo Vision Sensing for Robotics cs.RO | cs.CVPDF

Leaf Jiang, Matthew Holzel, Bernhard Kaplan, Hsiou-Yuan Liu, Sabyasachi Paul

TL;DR: 本研究提出了一种新颖的高精度、高速帧间校准和立体匹配方法,用于处理5MP+高分辨率立体视觉系统,旨在充分发挥高角分辨率传感器的潜力,以生成更密集、更精确的3D点云。

Details

Motivation: 解决高分辨率(5MP+)立体视觉系统在机器人应用中,因传统方法无法满足其所需的高校准精度和快速处理要求,而未能充分发挥潜力的问题。

Result: 研究表明,通过实施高精度校准,高像素相机才能产生高质量点云;同时,论文引入了通过比较实时视差图与计算密集型算法生成的真值视差图来评估实时性能的新方法。

Insight: 创新点在于提出了一个兼顾高精度与速度的帧间校准和立体匹配流程,并引入了一种新的实时性能评估方法,强调了高精度校准对于高分辨率传感器输出高质量点云的关键作用。

Abstract: High-resolution (5MP+) stereo vision systems are essential for advancing robotic capabilities, enabling operation over longer ranges and generating significantly denser and accurate 3D point clouds. However, realizing the full potential of high-angular-resolution sensors requires a commensurately higher level of calibration accuracy and faster processing – requirements often unmet by conventional methods. This study addresses that critical gap by processing 5MP camera imagery using a novel, advanced frame-to-frame calibration and stereo matching methodology designed to achieve both high accuracy and speed. Furthermore, we introduce a new approach to evaluate real-time performance by comparing real-time disparity maps with ground-truth disparity maps derived from more computationally intensive stereo matching algorithms. Crucially, the research demonstrates that high-pixel-count cameras yield high-quality point clouds only through the implementation of high-accuracy calibration.


[102] CARE: Multi-Task Pretraining for Latent Continuous Action Representation in Robot Control cs.RO | cs.CVPDF

Jiaqi Shi, Xulong Zhang, Xiaoyang Qu, Jianzong Wang

TL;DR: CARE是一种用于机器人控制的新型视觉-语言-动作(VLA)模型预训练框架,它通过仅使用视频-文本对进行多任务预训练,学习连续的潜在动作表示,从而无需动作标注,并在微调阶段使用少量标注数据训练动作头以实现控制。实验表明,CARE在多种模拟任务中具有更高的成功率、语义可解释性,并能避免捷径学习。

Details

Motivation: 解决现有VLA模型在机器人控制中依赖动作监督、限制可扩展性和泛化能力的问题。

Result: 在多种模拟任务上的实验结果显示,CARE取得了更高的成功率,并展现出语义可解释性和避免捷径学习的能力,证明了其在弱监督下机器人控制的有效性。

Insight: 创新点在于通过仅使用弱对齐的视频-文本对进行多任务预训练,学习连续的潜在动作表示,从而无需动作标注,提高了模型的扩展性和泛化能力;从客观角度看,该方法通过弱监督学习动作表示,为减少机器人控制对大量标注数据的依赖提供了新思路。

Abstract: Recent advances in Vision-Language-Action (VLA) models have shown promise for robot control, but their dependence on action supervision limits scalability and generalization. To address this challenge, we introduce CARE, a novel framework designed to train VLA models for robotic task execution. Unlike existing methods that depend on action annotations during pretraining, CARE eliminates the need for explicit action labels by leveraging only video-text pairs. These weakly aligned data sources enable the model to learn continuous latent action representations through a newly designed multi-task pretraining objective. During fine-tuning, a small set of labeled data is used to train the action head for control. Experimental results across various simulation tasks demonstrate CARE’s superior success rate, semantic interpretability, and ability to avoid shortcut learning. These results underscore CARE’s scalability, interpretability, and effectiveness in robotic control with weak supervision.