Table of Contents

cs.CL [Back]

[1] Self-Execution Simulation Improves Coding Models cs.CL | cs.LGPDF

Gallil Maimon, Ori Yoran, Felix Kreuk, Michael Hassid, Gal Cohen

TL;DR: 本文提出了一种通过自执行模拟来提升代码生成模型性能的方法,通过结合监督微调与强化学习,训练模型逐步模拟程序执行,并利用执行反馈进行自验证与迭代修复,从而在多个竞争性编程基准测试中取得了一致的性能提升。

Details

Motivation: 解决大型语言模型在生成代码时无法准确估计程序执行结果的问题,特别是针对其自身生成的代码,以提高代码生成的正确性和可靠性。

Result: 在多个竞争性编程基准测试中,该方法相比标准推理方法取得了持续的性能改进,具体表现为模型能够通过执行模拟进行自验证和迭代修复,从而提升任务解决能力。

Insight: 创新点在于将程序执行模拟作为可训练能力引入代码生成模型,并通过结合自然语言执行轨迹的监督微调和基于可验证奖励的强化学习,实现了模型的自验证与自修复机制,为提升代码生成模型的可靠性和正确性提供了新思路。

Abstract: A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program execution, particularly for code they generate. In this work, we demonstrate that Code LLMs can be trained to simulate program execution in a step-by-step manner and that this capability can be leveraged to improve competitive programming performance. Our approach combines supervised fine-tuning on natural language execution traces, textual explanations grounded in true execution, with reinforcement learning using verifiable rewards. We introduce two complementary objectives: output prediction given code and inputs, and solving competitive programming tasks with either ground-truth or self-predicted execution feedback. These objectives enable models to perform self-verification over multiple candidate solutions, and iterative self-fixing by simulating test execution. Across multiple competitive programming benchmarks, our method yields consistent improvements over standard reasoning approaches. We further present ablations and analysis to elucidate the role of execution simulation and its limitations.


[2] Why Attend to Everything? Focus is the Key cs.CL | cs.AIPDF

Hengshuai Yao, Xing Chen, Ahmed Murtadha, Jin Li, Shuai Shao

TL;DR: 本文提出了一种名为Focus的高效注意力方法,它通过学习可学习的质心将token分配到不同组中,从而限制远距离注意力仅在同组token对之间进行,而局部注意力则保持全分辨率。该方法仅需训练少量参数(如148K),即可在保持下游任务性能不下降的同时,显著提升领域困惑度,并在多种模型规模(124M至70B参数)和五种注意力架构上验证了其有效性。推理时通过硬稀疏模式实现加速,且无需定制内核。

Details

Motivation: 解决传统注意力机制中近似所有token对计算成本高的问题,旨在通过选择性关注重要token对来提升效率,同时保持模型性能。

Result: 在124M参数模型上,Focus超越全注意力(困惑度30.3 vs 31.4);在7B规模从头训练时,再次击败全注意力(13.82 vs 13.89 PPL)。推理时通过硬稀疏模式实现2倍加速,且性能优于预训练基线(41.3 vs 42.8 PPL);结合标准FlashAttention调用可达8.6倍加速。在微调设置中,Focus保持指令调优模型的TruthfulQA分数,而LoRA则出现下降。

Insight: 创新点在于通过可学习质心实现token的软分组,结合Sinkhorn归一化强制平衡分组,从而在无需监督的情况下发现可解释的语言类别;该方法为纯加性,保持模型权重冻结,实现了高效注意力与性能的平衡,且在微调时能保持模型对齐性。

Abstract: We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks–from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting. At 124M, Focus surpasses full attention (30.3 vs 31.4 PPL); trained from scratch at 7B scale (2B tokens), Focus again beats full attention (13.82 vs 13.89 PPL). At inference, restricting each token to its top-k highest-scoring groups discretizes the soft routing into a hard sparsity pattern, yielding 2x speedup while beating the pretrained baseline (41.3 vs 42.8 PPL); decomposing this pattern into two standard FlashAttention calls reaches 8.6x wall-clock speedup at 1M tokens with no custom kernels. Unlike LoRA, centroid routing preserves alignment: instruction-tuned models retain TruthfulQA scores after adaptation, while LoRA degrades at every learning rate and rank. Sinkhorn normalization enforces balanced groups as a hard constraint, and the resulting groups discover interpretable linguistic categories without supervision.


[3] LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling cs.CL | cs.AI | cs.GL | cs.NEPDF

Keqin Xie

TL;DR: 本文提出了一种名为LPC-SM的混合自回归架构,用于长上下文语言建模。该架构将局部注意力、持久记忆、预测校正和运行时控制分离在同一模块中,并使用正交新颖性传输(ONT)来管理慢速记忆写入。作者通过三个阶段(基础语言建模、数学延续和4096个标记的延续)评估了一个1.58亿参数的模型,结果表明该模型在长序列上保持稳定,并优于固定比率延续的基线。

Details

Motivation: 当前大多数长上下文语言模型仍依赖注意力机制处理局部交互和长程状态,这限制了探索序列建模的替代分解方式。本文旨在通过分离不同功能组件,测试更广泛的分工方案,以改进长上下文自回归建模。

Result: 在基础语言建模阶段(Stage A),移除mHC组件使最终语言模型损失从12.630增加到15.127;在数学延续阶段(Stage B),自适应稀疏控制将最终语言模型损失从12.137改善到10.787(相对于匹配的固定比率延续基线);在4096标记的延续阶段(Stage C),模型保持稳定,最终语言模型损失为11.582,并在关键交叉熵诊断中将延迟标识符得分从14.396提升到12.031。

Insight: 论文宣称的创新点在于提出了一种将局部注意力、持久记忆、预测校正和运行时控制解耦的混合架构,以及使用ONT进行慢速记忆写入管理。从客观角度看,其核心创新在于挑战了注意力机制在长上下文建模中的垄断地位,通过模块化分工和自适应稀疏控制,为序列建模提供了新的设计思路,可能提升模型在长序列任务上的效率和稳定性。

Abstract: Most current long-context language models still rely on attention to handle both local interaction and long-range state, which leaves relatively little room to test alternative decompositions of sequence modeling. We propose LPC-SM, a hybrid autoregressive architecture that separates local attention, persistent memory, predictive correction, and run-time control within the same block, and we use Orthogonal Novelty Transport (ONT) to govern slow-memory writes. We evaluate a 158M-parameter model in three stages spanning base language modeling, mathematical continuation, and 4096-token continuation. Removing mHC raises the Stage-A final LM loss from 12.630 to 15.127, while adaptive sparse control improves the Stage-B final LM loss from 12.137 to 10.787 relative to a matched fixed-ratio continuation. The full route remains stable at sequence length 4096, where Stage C ends with final LM loss 11.582 and improves the delayed-identifier diagnostic from 14.396 to 12.031 in key cross-entropy. Taken together, these results show that long-context autoregressive modeling can be organized around a broader division of labor than attention alone.


[4] CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge cs.CL | cs.AIPDF

Mete Ismayilzada, Renqing Cuomao, Daniil Yurshevich, Anna Sotnikova, Lonneke van der Plas

TL;DR: 该论文提出了CresOWLve基准测试,用于评估大语言模型在基于真实世界知识的谜题上的创造性问题解决能力。研究发现,现有模型在事实性问题上的表现远优于创造性问题,揭示了模型在整合信息形成非明显创造性联系方面的困难。

Details

Motivation: 现有基准测试大多只评估创造性问题解决过程的特定组成部分,且依赖人工构建的脑筋急转弯或虚构场景,未能反映真实世界中的创造性问题解决。因此,需要建立一个基于真实世界知识的基准来全面评估模型的创造性问题解决能力。

Result: 在CresOWLve基准上评估了多个前沿的非思维链和思维链大语言模型,结果显示该基准极具挑战性。模型在事实性问题上的表现比创造性问题高出最多17%,表明模型虽能检索相关知识,但难以整合信息形成创造性联系以获得正确答案。

Insight: 创新点在于构建了一个基于真实世界知识的创造性问题解决基准,强调多领域知识检索和创造性整合。客观分析认为,该研究突出了当前大语言模型在高级认知任务(如创造性思维)上的局限性,为未来模型开发提供了重要的评估方向。

Abstract: Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect seemingly unrelated pieces of information. However, most existing benchmarks for large language models (LLMs) evaluate only specific components of this process. Moreover, many creativity-oriented benchmarks rely on artificially constructed brainteasers or contrived scenarios that do not reflect how creative problem-solving occurs in real-world settings. To address this gap, we introduce CresOWLve, a benchmark for evaluating creative problem-solving using puzzles grounded in real-world knowledge. Problems in CresOWLve require employing multiple creative thinking strategies, retrieving facts from diverse domains, and creatively combining them to arrive at a solution. Evaluating several frontier non-thinking and thinking LLMs, we show that CresOWLve remains highly challenging. Our analysis reveals a consistent performance gap: models perform substantially better on factual questions than on creative ones (up to a -17% drop). While models can often retrieve the relevant knowledge, they struggle to form the non-obvious creative connections required to integrate this information and arrive at the correct answer.


[5] Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution cs.CL | cs.AIPDF

Jacob Dineen, Aswin RRV, Zhikun Xu, Ben Zhou

TL;DR: 本文提出了一种名为’词汇丢弃’的方法,用于解决大语言模型协同进化中课程多样性崩溃的问题。该方法通过在提议者模型的输出逻辑上应用随机掩码,防止其生成问题分布过于狭窄,从而维持课程多样性并提升求解者模型的性能。

Details

Motivation: 在协同进化的自对弈中,提议者模型会迅速收敛到满足奖励函数的狭窄问题分布,导致课程多样性崩溃,使得求解者无法获得有信息量的训练数据,从而阻碍协同进化循环。

Result: 在数学推理任务上使用R-Zero训练Qwen3-4B和Qwen3-8B模型,词汇丢弃方法在词汇、语义和功能指标上均能维持提议者的多样性,并使8B求解者模型平均提升4.4个百分点,在竞赛级基准测试上取得了最大的性能增益。

Insight: 论文的创新点在于提出了’词汇丢弃’这一轻量级机制,通过硬性、非平稳的随机掩码对提议者的行动空间施加显式约束,类比于经典自对弈中游戏规则的结构性作用,从而维持了生产性的协同进化。这为语言模型自对弈训练提供了一种简单有效的多样性维持策略。

Abstract: Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer’s output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training, and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.


[6] MultiPress: A Multi-Agent Framework for Interpretable Multimodal News Classification cs.CLPDF

Tailong Luo, Hao Li, Rong Fu, Xinyue Jiang, Huaxuan Ding

TL;DR: 本文提出MultiPress,一个用于可解释多模态新闻分类的三阶段多智能体框架。该框架整合了专门用于多模态感知、检索增强推理和门控融合打分的智能体,并通过奖励驱动的迭代优化机制提升性能。

Details

Motivation: 动机是解决现有多模态新闻分类方法通常独立处理模态或采用简单融合策略,难以捕捉复杂跨模态交互和利用外部知识的问题。

Result: 在一个新构建的大规模多模态新闻数据集上验证,MultiPress相比强基线模型取得了显著提升。

Insight: 创新点在于模块化的多智能体协作架构、检索增强推理机制以及奖励驱动的迭代优化,这些设计旨在提升分类准确性和模型的可解释性。

Abstract: With the growing prevalence of multimodal news content, effective news topic classification demands models capable of jointly understanding and reasoning over heterogeneous data such as text and images. Existing methods often process modalities independently or employ simplistic fusion strategies, limiting their ability to capture complex cross-modal interactions and leverage external knowledge. To overcome these limitations, we propose MultiPress, a novel three-stage multi-agent framework for multimodal news classification. MultiPress integrates specialized agents for multimodal perception, retrieval-augmented reasoning, and gated fusion scoring, followed by a reward-driven iterative optimization mechanism. We validate MultiPress on a newly constructed large-scale multimodal news dataset, demonstrating significant improvements over strong baselines and highlighting the effectiveness of modular multi-agent collaboration and retrieval-augmented reasoning in enhancing classification accuracy and interpretability.


[7] The Format Tax cs.CLPDF

Ivan Yee Lee, Loris D’Antoni, Taylor Berg-Kirkpatrick

TL;DR: 该论文研究发现,要求大型语言模型以JSON、XML、LaTeX或Markdown等结构化格式输出会显著降低开源模型的推理和写作性能,这种现象被称为’格式税’。研究指出,性能下降主要源于提示中的格式要求指令,而非解码约束。通过将推理与格式生成解耦(例如先生成自由文本再重新格式化),可以大幅恢复损失的准确性。实验表明,大多数最新的闭源模型几乎没有格式税,说明这是当前开源模型尚未解决的一个差距。

Details

Motivation: 解决在要求结构化输出(如JSON)时,开源大型语言模型性能显著下降的问题,探究其根本原因并非主要在于解码约束,而是格式要求指令本身造成的干扰。

Result: 在六个开源模型、四个API模型、四种格式以及涵盖数学、科学、逻辑和写作的任务上进行测试,解耦方法恢复了大部分损失的准确性。大多数最新的闭源模型几乎没有表现出格式税。

Insight: 核心创新点是诊断出’格式税’的主要成本在于提示指令而非解码过程,并提出将推理与格式生成解耦的简单原则。这为改进开源模型的结构化输出能力提供了明确方向,即通过分离关注点来提升性能,而非仅仅依赖约束解码技术。

Abstract: Asking a large language model to respond in JSON should be a formatting choice, not a capability tax. Yet we find that structured output requirements – JSON, XML, LaTeX, Markdown – substantially degrade reasoning and writing performance across open-weight models. The research response has focused on constrained decoding, but sampling bias accounts for only a fraction of the degradation. The dominant cost enters at the prompt: format-requesting instructions alone cause most of the accuracy loss, before any decoder constraint is applied. This diagnosis points to a simple principle: decouple reasoning from formatting. Whether by generating freeform first and reformatting in a second pass, or by enabling extended thinking within a single generation, separating the two concerns substantially recovers lost accuracy. Across six open-weight models, four API models, four formats, and tasks spanning math, science, logic, and writing, decoupling recovers most lost accuracy. Notably, most recent closed-weight models show little to no format tax, suggesting the problem is not inherent to structured generation but a gap that current open-weight models have yet to close. Code is available at https://github.com/ivnle/the-format-tax.


[8] CAGMamba: Context-Aware Gated Cross-Modal Mamba Network for Multimodal Sentiment Analysis cs.CLPDF

Minghai Jiao, Jing Xiao, Peng Xiao, Ende Zhang, Shuang Kan

TL;DR: 本文提出CAGMamba,一种用于多模态情感分析(MSA)的上下文感知门控跨模态Mamba框架。该框架将上下文和当前话语特征组织成时间有序的二进制序列,利用Mamba模型显式建模情感在对话轮次中的演变。同时,它通过一个门控跨模态Mamba网络(GCMN)整合跨模态和单模态路径,以平衡信息融合与模态保留,并使用三分支多任务目标进行训练。

Details

Motivation: 现有基于Transformer的跨模态注意力方法存在二次复杂度问题,限制了可扩展性,并且在融合对话上下文时缺乏显式的时间建模来捕捉情感演变过程。

Result: 在三个基准数据集上的实验表明,CAGMamba在多个评估指标上达到了最先进的(SOTA)或具有竞争力的结果。

Insight: 主要创新点在于:1. 将上下文和当前话语组织成时间有序序列,为Mamba模型提供了显式的时间结构来建模情感演变;2. 设计了门控跨模态Mamba网络,通过可学习的门控机制可控地整合跨模态和单模态信息;3. 采用了三分支多任务训练目标。从客观角度看,将高效的Mamba架构与门控机制结合用于多模态时序建模,是一个有前景的方向。

Abstract: Multimodal Sentiment Analysis (MSA) requires effective modeling of cross-modal interactions and contextual dependencies while remaining computationally efficient. Existing fusion approaches predominantly rely on Transformer-based cross-modal attention, which incurs quadratic complexity with respect to sequence length and limits scalability. Moreover, contextual information from preceding utterances is often incorporated through concatenation or independent fusion, without explicit temporal modeling that captures sentiment evolution across dialogue turns. To address these limitations, we propose CAGMamba, a context-aware gated cross-modal Mamba framework for dialogue-based sentiment analysis. Specifically, we organize the contextual and the current-utterance features into a temporally ordered binary sequence, which provides Mamba with explicit temporal structure for modeling sentiment evolution. To further enable controllable cross-modal integration, we propose a Gated Cross-Modal Mamba Network (GCMN) that integrates cross-modal and unimodal paths via learnable gating to balance information fusion and modality preservation, and is trained with a three-branch multi-task objective over text, audio, and fused predictions. Experiments on three benchmark datasets demonstrate that CAGMamba achieves state-of-the-art or competitive results across multiple evaluation metrics. All codes are available at https://github.com/User2024-xj/CAGMamba.


[9] Document-Level Numerical Reasoning across Single and Multiple Tables in Financial Reports cs.CLPDF

Yi-Cheng Wang, Wei-An Wang, Chu-Song Chen

TL;DR: 本文针对大型语言模型在长文档数值推理上的不足,特别是金融年报中的跨表格数值推理问题,提出了FinLongDocQA数据集和FinLongDocAgent方法。FinLongDocQA是一个包含单表和跨表金融数值推理任务的数据集,用于评估模型在长上下文中的表现。FinLongDocAgent采用多智能体多轮检索增强生成方法,通过迭代检索证据、执行中间计算和验证结果来提升数值推理的可靠性。

Details

Motivation: 现有基准主要关注单表设置,而金融年报分析需要跨多个表格和文本进行数值推理,现有大型语言模型在长文档数值推理上存在困难,特别是上下文过长导致的检索问题和多步数值推理错误。

Result: 在FinLongDocQA数据集上评估闭源和开源大型语言模型,发现两个瓶颈:长文档超出模型上下文长度导致检索困难,以及多步数值推理错误。提出的FinLongDocAgent方法通过迭代检索和验证,在长金融文档的数值问答任务中表现出更高的可靠性。

Insight: 创新点包括引入跨表格金融数值推理数据集FinLongDocQA,以及提出多智能体多轮RAG框架FinLongDocAgent,强调迭代检索和验证对于长文档数值推理的重要性,为解决类似复杂文档分析问题提供了可借鉴的思路。

Abstract: Despite the strong language understanding abilities of large language models (LLMs), they still struggle with reliable question answering (QA) over long, structured documents, particularly for numerical reasoning. Financial annual reports exemplify this difficulty: financial statement analysis often hinges on accurate arithmetic, and analysts derive key indicators by integrating evidence scattered across multiple tables and narrative text. However, existing benchmarks focus largely on single-table settings, leaving cross-table document-level numerical reasoning underexplored. To address this gap, we introduce FinLongDocQA, a dataset for both single-table and cross-table financial numerical reasoning in long-context reports. Evaluating both closed-source and open-source LLMs on FinLongDocQA reveals two bottlenecks: (1) annual reports often exceed 129k tokens, exacerbating the context rot problem for locating relevant tables; and (2) even when relevant evidence is located, LLMs remain prone to errors in multi-step numerical reasoning. We propose FinLongDocAgent, a Multi-Agent Multi-Round Retrieval-Augmented Generation (RAG) approach that iteratively retrieves evidence, performs intermediate calculations, and verifies results across rounds. Experiments highlight the importance of iterative retrieval and verification for reliable numerical QA in long financial documents.


[10] LightThinker++: From Reasoning Compression to Memory Management cs.CL | cs.AI | cs.IR | cs.LG | cs.MMPDF

Yuqi Zhu, Jintian Zhang, Zhenjie Wan, Yujie Luo, Shuofei Qiao

TL;DR: 本文提出了LightThinker++框架,通过引入显式自适应内存管理,使大语言模型能够动态压缩中间推理步骤为紧凑的语义表示,从而显著降低长序列推理的认知开销和计算资源消耗。

Details

Motivation: 解决大语言模型在复杂推理任务中,因冗长的思维链(thought traces)导致的计算效率低下和认知负担过重的问题。

Result: 在标准推理任务中,LightThinker++在相同上下文预算下将峰值令牌使用量降低69.9%,同时准确率提升2.42%;在长视野智能体任务中,能在超过80轮对话中保持稳定的内存占用(降低60%-70%),并在不同复杂场景下平均性能提升14.8%。

Insight: 创新点在于从静态压缩演进到行为级的内存管理范式,通过引入显式内存原语和专门的轨迹合成流程来训练有目的的内存调度策略,为在扩展视野下维持深度推理提供了可扩展的方向。

Abstract: Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In this paper, we propose LightThinker, a method that enables LLMs to dynamically compress intermediate thoughts into compact semantic representations. However, static compression often struggles with complex reasoning where the irreversible loss of intermediate details can lead to logical bottlenecks. To address this, we evolve the framework into LightThinker++, introducing Explicit Adaptive Memory Management. This paradigm shifts to behavioral-level management by incorporating explicit memory primitives, supported by a specialized trajectory synthesis pipeline to train purposeful memory scheduling. Extensive experiments demonstrate the framework’s versatility across three dimensions. (1) LightThinker reduces peak token usage by 70% and inference time by 26% with minimal accuracy loss. (2) In standard reasoning, LightThinker++ slashes peak token usage by 69.9% while yielding a +2.42% accuracy gain under the same context budget for maximum performance. (3) Most notably, in long-horizon agentic tasks, it maintains a stable footprint beyond 80 rounds (a 60%-70% reduction), achieving an average performance gain of 14.8% across different complex scenarios. Overall, our work provides a scalable direction for sustaining deep LLM reasoning over extended horizons with minimal overhead.


[11] Testing the Limits of Truth Directions in LLMs cs.CL | cs.AIPDF

Angelos Poulis, Mark Crovella, Evimaria Terzi

TL;DR: 本文通过系统实验揭示了大型语言模型(LLM)中线性真理方向普遍性的多个局限,包括其高度依赖模型层数、任务类型、任务复杂度以及提示指令,表明真理方向的普遍性比先前认知更为有限。

Details

Motivation: 针对先前研究对LLM激活空间中线性真理方向普遍性的争议,本文旨在深入探究其普遍性的具体限制条件,以更全面理解真理方向的性质。

Result: 实验表明,真理方向高度依赖于模型层数(需跨多层探测)、任务类型(事实性任务在较早层出现,推理任务在较晚层出现)、任务复杂度以及提示模板(简单的正确性评估指令显著影响真理探针的泛化能力)。

Insight: 创新点在于系统识别了真理方向普遍性在层依赖性、任务类型/复杂度敏感性及指令影响等方面的先前未充分理解的限制,强调了在评估LLM内部表示时需要多维度、细粒度分析的重要性。

Abstract: Large language models (LLMs) have been shown to encode truth of statements in their activation space along a linear truth direction. Previous studies have argued that these directions are universal in certain aspects, while more recent work has questioned this conclusion drawing on limited generalization across some settings. In this work, we identify a number of limits of truth-direction universality that have not been previously understood. We first show that truth directions are highly layer-dependent, and that a full understanding of universality requires probing at many layers in the model. We then show that truth directions depend heavily on task type, emerging in earlier layers for factual and later layers for reasoning tasks; they also vary in performance across levels of task complexity. Finally, we show that model instructions dramatically affect truth directions; simple correctness evaluation instructions significantly affect the generalization ability of truth probes. Our findings indicate that universality claims for truth directions are more limited than previously known, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.


[12] When Models Know More Than They Say: Probing Analogical Reasoning in LLMs cs.CL | cs.AI | cs.LGPDF

Hope McGovern, Caroline Craig, Thomas Lippincott, Hale Sirin

TL;DR: 本文研究了大型语言模型(LLM)在类比推理任务中的表现,特别是当类比需要潜在信息而非表面线索时。通过比较模型内部探测表示与提示性能,发现两者在修辞类比和叙事类比任务上存在不对称性。

Details

Motivation: 动机是探究LLM在类比推理,尤其是需要抽象和泛化的深层叙事类比中的局限性,以理解其内部表示与外部提示行为之间的关系。

Result: 在开源模型中,对于修辞类比,探测性能显著优于提示性能;而对于叙事类比,两者性能均较低且相似,表明结果依赖于任务类型。

Insight: 创新点在于揭示了LLM内部表示与提示行为之间的不对称性,表明提示机制在访问可用信息方面存在局限性,这为改进模型推理能力提供了方向。

Abstract: Analogical reasoning is a core cognitive faculty essential for narrative understanding. While LLMs perform well when surface and structural cues align, they struggle in cases where an analogy is not apparent on the surface but requires latent information, suggesting limitations in abstraction and generalisation. In this paper we compare a model’s probed representations with its prompted performance at detecting narrative analogies, revealing an asymmetry: for rhetorical analogies, probing significantly outperforms prompting in open-source models, while for narrative analogies, they achieve a similar (low) performance. This suggests that the relationship between internal representations and prompted behavior is task-dependent and may reflect limitations in how prompting accesses available information.


[13] AdaptFuse: Training-Free Sequential Preference Learning via Externalized Bayesian Inference cs.CL | cs.AIPDF

Fangzhou Lin, Peiran Li, Shuo Xing, Siyuan Yang, Qianwen Ge

TL;DR: 本文提出了AdaptFuse,一个无需训练、通过外部化贝叶斯推理进行序列偏好学习的框架。它通过一个符号模块维护离散假设集上的贝叶斯后验,并利用冻结的大型语言模型进行语义推理,两者通过熵自适应融合结合。该方法在多个推荐任务上优于提示基准和微调的贝叶斯教学模型,且无需在敏感用户数据上进行训练或存储。

Details

Motivation: 大型语言模型难以在多轮用户交互中积累证据,无法以符合贝叶斯推理的方式更新其信念,而现有解决方案需要在敏感用户交互数据上进行微调,限制了其在注重隐私场景下的应用。

Result: 在航班推荐、酒店推荐和网络购物三个领域,基于Gemma 2 9B、Llama 3 8B和Qwen 2.5 7B模型的评估显示,AdaptFuse在所有任务上均一致优于提示基准和微调的贝叶斯教学模型,且准确率随交互轮次单调提升。

Insight: 创新点在于将概率计算完全外部化于LLM之外,结合符号贝叶斯推理与冻结LLM的语义能力,并通过熵自适应融合动态加权;其核心洞察是,原则性的推理时算法可以替代微调,实现个性化推荐,同时保护用户隐私。

Abstract: Large language models struggle to accumulate evidence across multiple rounds of user interaction, failing to update their beliefs in a manner consistent with Bayesian inference. Existing solutions require fine-tuning on sensitive user interaction data, limiting their applicability in privacy-conscious settings. We propose AdaptFuse, a training-free framework that externalizes probabilistic computation entirely from the LLM: a symbolic module maintains a Bayesian posterior over a discrete hypothesis set, while a frozen LLM contributes semantic reasoning via multi-sample Dirichlet aggregation. The two signals are combined through entropy-adaptive fusion, which automatically weights each source by its predictive confidence, shifting reliance from the LLM to the symbolic posterior as evidence accumulates. We evaluate across three domains: flight recommendation, hotel recommendation, and web shopping; on Gemma 2 9B, Llama 3 8B, and Qwen 2.5 7B. AdaptFuse consistently outperforms both prompting baselines and fine-tuned Bayesian Teaching models on all tasks, with accuracy improving monotonically over interaction rounds. These results demonstrate that principled inference-time algorithms can substitute for fine-tuning in personalized recommendation, without storing or training on sensitive user data. All the code and materials will be open-sourced.


[14] Predict, Don’t React: Value-Based Safety Forecasting for LLM Streaming cs.CL | cs.LGPDF

Pride Kavumba, Koki Wataoka, Huy H. Nguyen, Jiaxuan Li, Masaya Ohagi

TL;DR: 本文提出了StreamGuard,一种模型无关的流式防护机制,将内容审核重新定义为预测问题:基于部分生成的前缀,预测未来可能续写的文本的预期危害性。该方法通过蒙特卡洛推演进行监督,无需精确的令牌级边界标注,实现了对LLM流式输出的早期安全干预。

Details

Motivation: 现有流式防护机制通常将输出端审核视为边界检测问题,需要训练模型识别响应变得不安全的最早前缀。然而,这种方法依赖于精确的边界标注,且可能反应滞后。本文旨在通过预测未来风险,实现更早、更有效的安全干预。

Result: 在标准安全基准测试中,StreamGuard在输入审核和流式输出审核上均表现优异。在8B规模上,相比Qwen3Guard-Stream-8B-strict,StreamGuard将聚合输入审核F1从86.7提升至88.2,聚合流式输出审核F1从80.4提升至81.9。在QWENGUARDTEST的response_loc流式基准上,StreamGuard达到97.5 F1、95.1召回率和92.6%的及时干预率,同时将漏检率从7.9%降至4.9%。此外,基于预测的监督能有效跨分词器和模型家族迁移,例如Gemma3-StreamGuard-1B在迁移目标下达到81.3响应审核F1和98.2流式F1。

Insight: 创新点在于将流式审核从边界检测重构为风险预测问题,利用蒙特卡洛推演生成监督信号,避免了对精确边界标注的依赖。这为低延迟安全干预提供了有效的监督策略,并展示了跨模型迁移的潜力,提升了流式防护的通用性和实用性。

Abstract: In many practical LLM deployments, a single guardrail is used for both prompt and response moderation. Prompt moderation operates on fully observed text, whereas streaming response moderation requires safety decisions to be made over partial generations. Existing text-based streaming guardrails commonly frame this output-side problem as boundary detection, training models to identify the earliest prefix at which a response has already become unsafe. In this work, we introduce StreamGuard, a unified model-agnostic streaming guardrail that instead formulates moderation as a forecasting problem: given a partial prefix, the model predicts the expected harmfulness of likely future continuations. We supervise this prediction using Monte Carlo rollouts, which enables early intervention without requiring exact token-level boundary annotations. Across standard safety benchmarks, StreamGuard performs strongly both for input moderation and for streaming output moderation. At the 8B scale, StreamGuard improves aggregated input-moderation F1 from 86.7 to 88.2 and aggregated streaming output-moderation F1 from 80.4 to 81.9 relative to Qwen3Guard-Stream-8B-strict. On the QWENGUARDTEST response_loc streaming benchmark, StreamGuard reaches 97.5 F1, 95.1 recall, and 92.6% on-time intervention, compared to 95.9 F1, 92.1 recall, and 89.9% for Qwen3Guard-Stream-8B-stric, while reducing the miss rate from 7.9% to 4.9%. We further show that forecasting-based supervision transfers effectively across tokenizers and model families: with transferred targets, Gemma3-StreamGuard-1B reaches 81.3 response-moderation F1, 98.2 streaming F1, and a 3.5% miss rate. These results show that strong end-to-end streaming moderation can be obtained without exact boundary labels, and that forecasting future risk is an effective supervision strategy for low-latency safety intervention.


[15] GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces cs.CLPDF

Xinyu Geng, Yanjing Xiao, Yuyang Zhang, Hanwen Wang, Xinyan Liu

TL;DR: GeoBrowse是一个结合视觉推理与知识密集型多跳查询的地理定位基准测试,旨在评估智能体通过多步工具使用整合碎片化证据的能力。该基准包含两个难度级别:Level 1测试从图像中提取和组合模糊视觉线索,Level 2通过引入长尾知识和模糊关键实体来增加查询难度。作者还提出了一个名为GATE的智能体工作流程,配备多种工具,并提供了专家标注的逐步推理轨迹用于评估。实验表明,GATE在性能上优于直接推理和开源智能体,强调了特定级别的工具使用计划的重要性。

Details

Motivation: 现有多模态基准测试很少同时要求弱视觉线索组合和BrowseComp风格的多跳验证,而地理定位任务天然需要结合多个模糊视觉线索并通过开放网络证据进行验证,因此作者引入GeoBrowse来填补这一空白。

Result: 在GeoBrowse基准上,GATE智能体优于直接推理和开源智能体,表明无工具、仅搜索或仅图像的设置不足;性能提升源于连贯的、特定级别的工具使用计划,而非更多的工具调用,这使其更可靠地达到标注的关键证据步骤并在整合到最终决策时减少错误。

Insight: 创新点包括:将地理定位作为评估智能体工具使用的自然测试平台,设计了结合视觉与知识查询的两级难度基准;提供了专家标注的逐步推理轨迹,支持轨迹级分析;实验揭示了工具使用计划的质量(而非数量)对性能的关键影响,为智能体设计提供了新见解。

Abstract: Deep research agents integrate fragmented evidence through multi-step tool use. BrowseComp offers a text-only testbed for such agents, but existing multimodal benchmarks rarely require both weak visual cues composition and BrowseComp-style multi-hop verification. Geolocation is a natural testbed because answers depend on combining multiple ambiguous visual cues and validating them with open-web evidence. Thus, we introduce GeoBrowse, a geolocation benchmark that combines visual reasoning with knowledge-intensive multi-hop queries. Level 1 tests extracting and composing fragmented visual cues, and Level 2 increases query difficulty by injecting long-tail knowledge and obfuscating key entities. To support evaluation, we provide an agentic workflow GATE with five think-with-image tools and four knowledge-intensive tools, and release expert-annotated stepwise traces grounded in verifiable evidence for trajectory-level analysis. Experiments show that GATE outperforms direct inference and open-source agents, indicating that no-tool, search-only or image-only setups are insufficient. Gains come from coherent, level-specific tool-use plans rather than more tool calls, as they more reliably reach annotated key evidence steps and make fewer errors when integrating into the final decision. The GeoBrowse bernchmark and codes are provided in https://github.com/ornamentt/GeoBrowse


[16] Unmasking Hallucinations: A Causal Graph-Attention Perspective on Factual Reliability in Large Language Models cs.CL | cs.LGPDF

Sailesh kiran kurra, Shiek Ruksana, Vishal Borusu

TL;DR: 本文提出了一种因果图注意力网络(GCAN)框架,旨在减少大型语言模型(LLMs)中的幻觉问题。该方法通过构建结合自注意力权重和基于梯度的影响力分数的令牌级图,解释Transformer架构内部的注意力流,并引入因果贡献分数(CCS)来量化每个令牌的事实依赖性,以及一个事实锚定图重加权层来动态减少生成过程中易产生幻觉节点的影响。

Details

Motivation: 大型语言模型在语言理解和生成方面表现出色,但存在严重的幻觉问题,即产生事实错误、误导性或缺乏输入数据支持的输出,这在医疗诊断或法律推理等场景中会引发严重问题。

Result: 在TruthfulQA和HotpotQA等标准基准测试中,该方法相比基线检索增强生成(RAG)模型,幻觉率降低了27.8%,事实准确性提高了16.4%。

Insight: 创新点在于从因果图注意力视角出发,通过构建令牌级图并量化因果贡献,结合动态图重加权机制来提升事实可靠性;这为LLM的可解释性、鲁棒性和事实可靠性提供了新思路,特别是将注意力流与梯度信息结合以识别和抑制幻觉源的方法值得借鉴。

Abstract: This paper primarily focuses on the hallucinations caused due to AI language models(LLMs).LLMs have shown extraordinary Language understanding and generation capabilities .Still it has major a disadvantage hallucinations which give outputs which are factually incorrect ,misleading or unsupported by input data . These hallucinations cause serious problems in scenarios like medical diagnosis or legal reasoning.Through this work,we propose causal graph attention network (GCAN) framework that reduces hallucinations through interpretation of internal attention flow within a transformer architecture with the help of constructing token level graphs that combine self attention weights and gradient based influence scores.our method quantifies each tokens factual dependency using a new metric called the Causal Contribution Score (CCS). We further introduce a fact-anchored graph reweighting layer that dynamically reduces the influence of hallucination prone nodes during generation. Experiments on standard benchmarks such as TruthfulQA and HotpotQA show a 27.8 percent reduction in hallucination rate and 16.4 percent improvement in factual accuracy over baseline retrieval-augmented generation (RAG) models. This work contributes to the interpretability,robustness, and factual reliability of future LLM architectures.


[17] Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression cs.CLPDF

Lingjie Zeng, Xiaofan Chen, Yanbo Wang, Xiuying Chen

TL;DR: 本文首次系统性地实证研究了思维链(CoT)压缩对模型可信度的影响,评估了不同规模模型在安全性、抗幻觉能力和多语言鲁棒性三个维度的表现。研究发现,CoT压缩常导致可信度下降,且不同方法在不同维度上的退化情况差异显著。作者提出了归一化效率分数以公平比较,并引入了一种对齐感知的DPO变体,能在显著减少可信度损失的同时压缩推理长度。

Details

Motivation: 现有研究主要关注CoT压缩的任务准确性和token节省,但压缩会修改编码了可信度属性的参数空间,因此保持准确性并不能先验地保证保持可信度。本文旨在探究CoT压缩如何影响模型的可信度。

Result: 在受控比较下,CoT压缩经常引入可信度退化,不同方法在不同维度上表现出明显不同的退化情况。作者提出的对齐感知DPO变体在推理基准上将CoT长度减少了19.3%,同时可信度损失显著更小。

Insight: 创新点在于首次系统评估CoT压缩对可信度的影响,并提出了归一化效率分数来揭示标量指标可能掩盖的可信度权衡。客观来看,研究强调了在优化CoT压缩时,应将效率和可信度视为同等重要的设计约束,而非仅关注效率。

Abstract: Long chain-of-thought (Long-CoT) reasoning models have motivated a growing body of work on compressing reasoning traces to reduce inference cost, yet existing evaluations focus almost exclusively on task accuracy and token savings. Trustworthiness properties, whether acquired or reinforced through post-training, are encoded in the same parameter space that compression modifies. This means preserving accuracy does not, a priori, guarantee preserving trustworthiness. We conduct the first systematic empirical study of how CoT compression affects model trustworthiness, evaluating multiple models of different scales along three dimensions: safety, hallucination resistance, and multilingual robustness. Under controlled comparisons, we find that CoT compression frequently introduces trustworthiness regressions and that different methods exhibit markedly different degradation profiles across dimensions. To enable fair comparison across bases, we propose a normalized efficiency score for each dimension that reveals how naïve scalar metrics can obscure trustworthiness trade-offs. As an existence proof, we further introduce an alignment-aware DPO variant that reduces CoT length by 19.3% on reasoning benchmarks with substantially smaller trustworthiness loss. Our findings suggest that CoT compression should be optimized not only for efficiency but also for trustworthiness, treating both as equally important design constraints.


[18] Position: Logical Soundness is not a Reliable Criterion for Neurosymbolic Fact-Checking with LLMs cs.CLPDF

Jason Chan, Robert Gaizauskas, Zhixue Zhao

TL;DR: 本文批判了在基于大语言模型(LLM)的神经符号事实核查系统中,将逻辑正确性作为唯一可靠性标准的做法。作者指出,逻辑上正确的结论可能系统地引发人类做出前提无法支持的推断,从而导致误导性主张无法被检测。因此,他们主张将LLM的人类化推理倾向视为一种特性,用于验证神经符号系统中形式化组件的输出,以补充纯逻辑方法的不足。

Details

Motivation: 随着LLM越来越多地集成到事实核查流程中,形式逻辑常被提议为一种严谨手段,用以减轻模型输出中的偏见、错误和幻觉。然而,作者认为,由于逻辑上正确的结论与人类通常做出并接受的推断之间存在系统性差异,依赖逻辑正确性的方法在结构上无法检测误导性主张。

Result: 论文未提及具体的定量实验结果或基准测试,而是基于认知科学和语用学的研究,提出了一种案例类型学,展示了逻辑正确结论如何系统地引发人类做出前提无法支持的推断。

Insight: 论文的核心创新点在于挑战了神经符号系统中过度依赖形式逻辑的范式,并提出了一个互补性视角:将LLM的人类化推理偏差视为一种可利用的特性,而非缺陷,用于识别和验证形式逻辑组件可能产生的、对人类而言具有误导性的输出。这为设计更健壮、更符合人类认知的事实核查系统提供了新的思路。

Abstract: As large language models (LLMs) are increasing integrated into fact-checking pipelines, formal logic is often proposed as a rigorous means by which to mitigate bias, errors and hallucinations in these models’ outputs. For example, some neurosymbolic systems verify claims by using LLMs to translate natural language into logical formulae and then checking whether the proposed claims are logically sound, i.e. whether they can be validly derived from premises that are verified to be true. We argue that such approaches structurally fail to detect misleading claims due to systematic divergences between conclusions that are logically sound and inferences that humans typically make and accept. Drawing on studies in cognitive science and pragmatics, we present a typology of cases in which logically sound conclusions systematically elicit human inferences that are unsupported by the underlying premises. Consequently, we advocate for a complementary approach: leveraging the human-like reasoning tendencies of LLMs as a feature rather than a bug, and using these models to validate the outputs of formal components in neurosymbolic systems against potentially misleading conclusions.


[19] DARE: Diffusion Large Language Models Alignment and Reinforcement Executor cs.CLPDF

Jingyi Yang, Yuxian Jiang, Xuhao Hu, Shuang Cheng, Biqing Qi

TL;DR: 本文提出了DARE框架,旨在解决扩散大语言模型(dLLMs)在开源生态中后训练流程碎片化的问题。该框架基于verl和OpenCompass构建,统一了监督微调、参数高效微调、偏好优化以及dLLM特有的强化学习,为掩码和块扩散语言模型提供了一个共享的执行栈,并支持LLaDA、Dream、SDAR和LLaDA2.x等代表性模型家族,以实现可复现的基准评估和实际加速。

Details

Motivation: 扩散大语言模型(dLLMs)作为自回归模型的替代方案正在兴起,但其开源生态在后训练流程(如强化学习目标、实现和评估脚本)上存在碎片化,这阻碍了研究迭代、增加了复现工程负担,并导致算法间公平比较困难。

Result: 广泛的实证结果表明,DARE为当前和新兴dLLMs的后训练方法开发、比较和部署提供了一个可重用的研究基础,但摘要未提及具体基准测试或定量结果(如SOTA水平)。

Insight: 创新点在于提供了一个统一的后训练和评估框架,整合了多种微调和优化技术,并针对dLLMs的并行生成特性设计了专门的强化学习支持,这有助于标准化研究流程并促进公平比较。

Abstract: Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token generation with iterative denoising and parallel generation dynamics. However, their open-source ecosystem remains fragmented across model families and, in particular, across post-training pipelines, where reinforcement learning objectives, rollout implementations and evaluation scripts are often released as paper-specific codebases. This fragmentation slows research iteration, raises the engineering burden of reproduction, and makes fair comparison across algorithms difficult. We present \textbf{DARE} (\textbf{d}LLMs \textbf{A}lignment and \textbf{R}einforcement \textbf{E}xecutor), an open framework for post-training and evaluating dLLMs. Built on top of verl\cite{sheng2024hybridflow} and OpenCompass\cite{2023opencompass}, DARE unifies supervised fine-tuning, parameter-efficient fine-tuning, preference optimization, and dLLM-specific reinforcement learning under a shared execution stack for both masked and block diffusion language models. Across representative model families including LLaDA, Dream, SDAR, and LLaDA2.x, DARE provides broad algorithmic coverage, reproducible benchmark evaluation, and practical acceleration. Extensive empirical results position that DARE serves as a reusable research substrate for developing, comparing, and deploying post-training methods for current and emerging dLLMs.


[20] Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction cs.CLPDF

Jinrui Fang, Runhan Chen, Xu Yang, Jian Yu, Jiawei Xu

TL;DR: 本文介绍了MINT基准测试,用于评估大语言模型在多轮医学诊断中的表现,揭示了模型存在过早回答、自我纠正和强诱惑信息触发等行为模式,并提出了推迟诊断问题到后续轮次和保留关键临床证据等改进建议。

Details

Motivation: 研究动机是探索大语言模型在更接近真实临床推理的多轮证据积累过程中的表现,而不仅仅是单轮提供所有信息时的诊断准确性。

Result: 在MINT基准上评估了11个大语言模型,发现模型在头两轮内就做出超过55%的回答,自我纠正率最高可达错误到正确翻转的10.6倍,强诱惑信息会导致准确性下降高达23.3%。

Insight: 创新点在于构建了高保真多轮医学诊断基准MINT,并系统性地揭示了模型在多轮诊断中的行为模式,提出了推迟提问和证据管理策略以提升可靠性。

Abstract: Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into clinically actionable guidance: deferring the diagnostic question to later turns reduces premature answering and improves accuracy at the first point of commitment by up to 62.6%, while reserving salient clinical evidence for later turns prevents a catastrophic accuracy drop of up to 23.3% caused by premature commitment. Our work provides both a controlled evaluation framework and concrete recommendations for improving the reliability of LLMs in multi-turn medical diagnosis.


[21] Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding cs.CL | cs.AI | cs.CVPDF

Haruka Kawasaki, Ryota Tanaka, Kyosuke Nishida

TL;DR: 本文研究了视觉文档理解任务中大型视觉语言模型的内部表示与生成响应之间的差距,发现中间层比最终层更线性地编码任务所需信息,并通过微调中间层来缩小这一差距。

Details

Motivation: 当前VDU基准测试仅评估生成响应,无法反映模型是否真正在内部捕获了所需信息,因此需要探究内部表示与响应之间的不一致性。

Result: 实验表明,微调中间层能同时提高线性探测准确率和响应准确率,并缩小内部表示与响应之间的差距。

Insight: 创新点在于揭示了LVLMs在VDU任务中内部表示与响应之间的脱节,并提出通过中间层微调来改善模型性能,这为理解模型内部工作机制提供了新视角。

Abstract: Visual document understanding (VDU) is a challenging task for large vision language models (LVLMs), requiring the integration of visual perception, text recognition, and reasoning over structured layouts. Although recent LVLMs have shown progress on VDU benchmarks, their performance is typically evaluated based on generated responses, which may not necessarily reflect whether the model has actually captured the required information internally. In this paper, we investigate how information required to solve VDU tasks is represented across different layers of LLMs within LVLMs using linear probing. Our study reveals that (1) there is a clear gap between internal representations and generated responses, and (2) information required to solve the task is often encoded more linearly from intermediate layers than from the final layer. Motivated by these findings, we explore fine-tuning strategies that target intermediate layers. Experiments show that fine-tuning intermediate layers improves both linear probing accuracy and response accuracy while narrowing the gap.


[22] Structured Causal Video Reasoning via Multi-Objective Alignment cs.CLPDF

Zinuo Li, Yongxin Guo, Jun Liu, Jiawei Zhan, Xi Jiang

TL;DR: 本文提出了一种结构化因果视频推理方法,通过构建名为‘结构化事件事实’的紧凑表示来捕捉视频中的关键事件及其因果关系,以弥补现有视频大语言模型依赖非结构化推理的不足。为了有效训练模型,作者引入了CausalFact-60K数据集和一个包含事实对齐、格式预热、思维预热以及基于强化学习的后训练的四阶段训练流程。在强化学习阶段,作者将优化问题建模为多目标强化学习问题,以平衡结构完整性、因果保真度和推理长度之间的权衡,最终推出了Factum-4B模型,在需要细粒度时序推理的视频理解任务上实现了更可靠的推理和更强的性能。

Details

Motivation: 人类对视频动态的理解通常基于对实体、动作和时间关系的结构化心理表征,而现有视频大语言模型主要依赖非结构化推理,将关键视觉证据嵌入冗长的文本描述中,且对时序因果关系的建模较弱,导致推理过程低效且因果推断脆弱。本文旨在通过构建结构化先验来弥合这一认知差距,以促进简洁且因果基础扎实的推理。

Result: 提出的Factum-4B模型在需要细粒度时序推理的挑战性视频理解任务上表现出更可靠的推理和更强的性能,但摘要中未提及具体的基准测试名称或与现有SOTA模型的定量比较结果。

Insight: 创新点包括引入‘结构化事件事实’作为显式约束来提升推理的简洁性和因果基础,以及采用多目标强化学习框架来优化结构完整性、因果保真度和推理长度之间的权衡,这为视频理解任务提供了可借鉴的结构化推理和平衡多目标优化的方法。

Abstract: Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.


[23] DeonticBench: A Benchmark for Reasoning over Rules cs.CLPDF

Guangyao Dou, Luis Brena, Akhil Deo, William Jurayj, Jingyu Zhang

TL;DR: 该论文提出了一个名为DeonticBench的基准测试,用于评估大型语言模型在现实世界领域(如税法、航空行李政策、移民法和住房法)中进行基于上下文规则的道义推理(关于义务、许可和禁止的推理)的能力。该基准包含6232个任务,支持自由形式的思维链推理和可选的基于求解器的工作流(将法规转换为可执行的Prolog程序)。

Details

Motivation: 当前LLM在处理复杂、特定上下文的规则推理(尤其是长上下文、高风险的道义推理)方面存在挑战,而现有基准多关注短上下文数学推理,缺乏对此类现实场景的评估,因此需要构建专门的基准来填补这一空白。

Result: 在DeonticBench上,前沿LLM和代码模型在困难子集上的最佳性能仅为:SARA Numeric任务准确率44.4%,Housing任务宏观F1分数46.6。监督微调和强化学习训练虽能提升Prolog生成质量,但当前RL方法仍无法可靠解决这些任务。

Insight: 创新点在于构建了一个结合符号(Prolog程序)与非符号(自然语言推理)评估的多领域道义推理基准,并提供了可选的基于求解器的形式化工作流,为研究现实世界的规则推理提供了新工具和视角。

Abstract: Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deontic reasoning: reasoning about obligations, permissions, and prohibitions under explicit rules. While many recent benchmarks emphasize short-context mathematical reasoning, fewer focus on long-context, high-stakes deontic reasoning. To address this gap, we introduce DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage policies, U.S. immigration administration, and U.S. state housing law. These tasks can be approached in multiple ways, including direct reasoning in language or with the aid of symbolic computation. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal problem interpretations and an explicit program trace. We release reference Prolog programs for all instances. Across frontier LLMs and coding models, best hard-subset performance reaches only 44.4% on SARA Numeric and 46.6 macro-F1 on Housing. We further study training with supervised fine-tuning and reinforcement learning for symbolic program generation. Although training improves Prolog generation quality, current RL methods still fail to solve these tasks reliably. Overall, DEONTICBENCH provides a benchmark for studying context-grounded rule reasoning in real-world domains under both symbolic and non-symbolic settings.


[24] PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning cs.CL | cs.AIPDF

Madhav S Baidya

TL;DR: 本文提出PassiveQA框架,通过监督微调使大语言模型在问答任务中具备信息充分性判断能力,能够根据查询的完整性选择回答、请求澄清或弃权三种动作。

Details

Motivation: 解决大语言模型在现实场景中面对不完整、模糊或缺失关键变量的查询时,容易产生过度自信或幻觉回答的问题。

Result: 在多个QA数据集上的实验表明,微调后的规划器在宏观F1和弃权召回率上显著提升,同时降低了幻觉率,训练计算量受限。

Insight: 创新点在于将信息状态表示、知识图谱上下文和显式建模缺失变量的决策推理结合,证明认知决策能力需在训练中学习而非推理时强制施加。

Abstract: Large Language Models (LLMs) have achieved strong performance in question answering and retrieval-augmented generation (RAG), yet they implicitly assume that user queries are fully specified and answerable. In real-world settings, queries are often incomplete, ambiguous, or missing critical variables, leading models to produce overconfident or hallucinated responses. In this work, we study decision-aware query resolution under incomplete information, where a model must determine whether to Answer, Ask for clarification, or Abstain. We show that standard and enhanced RAG systems do not reliably exhibit such epistemic awareness, defaulting to answer generation even when information is insufficient. To address this, we propose PassiveQA, a three-action framework that aligns model behaviour with information sufficiency through supervised finetuning. Our approach integrates structured information-state representations, knowledge graph-grounded context, and a finetuned planner that explicitly models missing variables and decision reasoning. Experiments across multiple QA datasets show that the finetuned planner achieves significant improvements in macro F1 and abstention recall while reducing hallucination rates, under a compute-constrained training regime. These results provide strong empirical evidence that epistemic decision-making must be learned during training rather than imposed at inference time.


[25] Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity cs.CL | cs.AI | cs.CVPDF

Jaeyoon Jung, Yejun Yoon, Kunwoo Park

TL;DR: 本文提出AMuFC框架,通过两个协作智能体(分析器和验证器)自适应地决定是否使用视觉证据进行事实核查,挑战了视觉证据普遍提升性能的假设,并在三个数据集上验证了其有效性。

Details

Motivation: 挑战当前多模态事实核查中普遍认为视觉证据总能提升性能的假设,解决盲目使用视觉证据可能降低准确性的问题。

Result: 在三个数据集上的实验结果表明,将分析器对视觉证据必要性的评估纳入验证器的预测中,能显著提升验证性能。

Insight: 创新点在于引入视觉证据必要性的自适应判断机制,通过角色分离的协作智能体框架实现更高效的多模态事实核查;客观来看,其提出的WebFC新数据集有助于在更真实场景下评估模型。

Abstract: Automated fact-checking is a crucial task not only in journalism but also across web platforms, where it supports a responsible information ecosystem and mitigates the harms of misinformation. While recent research has progressed from text-only to multimodal fact-checking, a prevailing assumption is that incorporating visual evidence universally improves performance. In this work, we challenge this assumption and show that indiscriminate use of multimodal evidence can reduce accuracy. To address this challenge, we propose AMuFC, a multimodal fact-checking framework that employs two collaborative agents with distinct roles for the adaptive use of visual evidence: An Analyzer determines whether visual evidence is necessary for claim verification, and a Verifier predicts claim veracity conditioned on both the retrieved evidence and the Analyzer’s assessment. Experimental results on three datasets show that incorporating the Analyzer’s assessment of visual evidence necessity into the Verifier’s prediction yields substantial improvements in verification performance. In addition to all code, we release WebFC, a newly constructed dataset for evaluating fact-checking modules in a more realistic scenario, available at https://github.com/ssu-humane/AMuFC.


[26] IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation cs.CLPDF

Anjali Kantharuban, Aarohi Srivastava, Fahim Faisal, Orevaoghene Ahia, Antonios Anastasopoulos

TL;DR: IDIOLEX框架旨在学习能够捕捉句子风格和方言变化、并与语义内容解耦的连续句子表示。该方法结合了句子来源的监督信息和句子内容的语言学特征,在阿拉伯语和西班牙语方言上进行了评估,证明其表示能捕获有意义的变异并具有跨领域迁移能力,还可用于语言模型的风格对齐。

Details

Motivation: 现有句子表示主要编码句子内容(说什么),而忽略了表达方式(怎么说),但后者对于许多应用至关重要。本文旨在开发能够捕捉风格和方言、且与语义内容解耦的句子表示,即’个人语言特征表示学习’。

Result: 在阿拉伯语和西班牙语方言上的评估表明,学习到的表示能够捕获有意义的变异,并在跨领域分析和分类任务中实现迁移。此外,这些表示作为训练目标,有助于语言模型的风格对齐。

Insight: 创新点在于提出了一个结合句子来源监督和语言学特征的统一框架,以学习连续的风格/方言表示。其核心洞察是联合建模个体层面和社区层面的语言变异,这为研究个人语言特征提供了新视角,并支持需要风格敏感性的下游应用(如开发多样化和可访问的LLM)。

Abstract: Existing sentence representations primarily encode what a sentence says, rather than how it is expressed, even though the latter is important for many applications. In contrast, we develop sentence representations that capture style and dialect, decoupled from semantic content. We call this the task of idiolectal representation learning. We introduce IDIOLEX, a framework for training models that combines supervision from a sentence’s provenance with linguistic features of a sentence’s content, to learn a continuous representation of each sentence’s style and dialect. We evaluate the approach on dialects of both Arabic and Spanish. The learned representations capture meaningful variation and transfer across domains for analysis and classification. We further explore the use of these representations as training objectives for stylistically aligning language models. Our results suggest that jointly modeling individual and community-level variation provides a useful perspective for studying idiolect and supports downstream applications requiring sensitivity to stylistic differences, such as developing diverse and accessible LLMs.


[27] BiST: A Gold Standard Bangla-English Bilingual Corpus for Sentence Structure and Tense Classification with Inter-Annotator Agreement cs.CL | cs.AIPDF

Abdullah Al Shafi, Swapnil Kundu Argha, M. A. Moyeen, Abdul Muntakim, Shoumik Barman Polok

TL;DR: 本文介绍了BiST,一个经过严格标注的孟加拉语-英语双语语料库,用于句子层面的语法分类,标注维度包括句法结构(简单、复杂、复合、复杂复合)和时态(现在、过去、未来)。语料库包含30,534个句子,通过多阶段标注框架确保高质量,并展示了在语法建模任务中的基准性能。

Details

Motivation: 解决低资源语言(特别是孟加拉语)高质量双语资源匮乏的问题,为多语言NLP研究提供可靠的语法标注数据。

Result: 标注一致性指标Fleiss Kappa值在结构标注和时态标注上分别达到0.82和0.88;基线评估表明,利用互补语言特定表示的双编码器架构持续优于强大多语言编码器。

Insight: 创新点在于构建了一个统一的双语语法建模资源,通过多维度语法标注(结构和时态)和严格的一致性验证,为受控文本生成、自动反馈生成和跨语言表示学习等任务提供了明确的语言学监督。

Abstract: High-quality bilingual resources remain a critical bottleneck for advancing multilingual NLP in low-resource settings, particularly for Bangla. To mitigate this gap, we introduce BiST, a rigorously curated Bangla-English corpus for sentence-level grammatical classification, annotated across two fundamental dimensions: syntactic structure (Simple, Complex, Compound, Complex-Compound) and tense (Present, Past, Future). The corpus is compiled from open-licensed encyclopedic sources and naturally composed conversational text, followed by systematic preprocessing and automated language identification, resulting in 30,534 sentences, including 17,465 English and 13,069 Bangla instances. Annotation quality is ensured through a multi-stage framework with three independent annotators and dimension-wise Fleiss Kappa ($κ$) agreement, yielding reliable and reproducible labels with $κ$ values of 0.82 and 0.88 for structural and temporal annotation, respectively. Statistical analyses demonstrate realistic structural and temporal distributions, while baseline evaluations show that dual-encoder architectures leveraging complementary language-specific representations consistently outperform strong multilingual encoders. Beyond benchmarking, BiST provides explicit linguistic supervision that supports grammatical modeling tasks, including controlled text generation, automated feedback generation, and cross-lingual representation learning. The corpus establishes a unified resource for bilingual grammatical modeling and facilitates linguistically grounded multilingual research.


[28] What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features cs.CL | cs.AIPDF

Dayeon Ki, Kevin Duh, Marine Carpuat

TL;DR: 本文研究了多语言推理模型在不同语言间性能差异的根本原因,挑战了‘使所有语言推理都类似英语推理就能缩小差距’的假设。作者定义了一套可测量的推理特征,分析了它们与答案准确性的关联,并发现这些关联在不同语言间差异显著,甚至可能相反。

Details

Motivation: 大型推理模型在英语和其他语言之间存在显著的性能差距,当前工作通常假设通过使其他语言的推理过程模仿英语推理就能弥合这一差距。本文旨在挑战这一假设,探究多语言环境下有效推理的真正特征,以及源自英语的推理特征在多大程度上对其他语言真正有益。

Result: 在两个数学推理基准测试、四个大型推理模型和10种语言上的实验表明,大多数定义的推理特征与答案准确性呈正相关,但这种关联的强度在不同语言间差异很大,在某些情况下甚至相反。

Insight: 创新点在于定义了一套可量化的多语言推理特征(涵盖对齐、步骤和流程),并使用稀疏自编码器自动发现潜在推理概念。核心洞察是挑战了以英语为中心的奖励设计范式,指出需要适应语言特定推理模式的自适应目标,这对多语言基准和奖励设计有具体启示。

Abstract: Large Reasoning Models (LRMs) still exhibit large performance gaps between English and other languages, yet much current work assumes these gaps can be closed simply by making reasoning in every language resemble English reasoning. This work challenges this assumption by asking instead: what actually characterizes effective reasoning in multilingual settings, and to what extent do English-derived reasoning features genuinely help in other languages? We first define a suite of measurable reasoning features spanning multilingual alignment, reasoning step, and reasoning flow aspects of reasoning traces, and use logistic regression to quantify how each feature associates with final answer accuracy. We further train sparse autoencoders over multilingual traces to automatically discover latent reasoning concepts that instantiate or extend these features. Finally, we use the features as test-time selection policies to examine whether they can steer models toward stronger multilingual reasoning. Across two mathematical reasoning benchmarks, four LRMs, and 10 languages, we find that most features are positively associated with accuracy, but the strength of association varies considerably across languages and can even reverse in some. Our findings challenge English-centric reward designs and point toward adaptive objectives that accommodate language-specific reasoning patterns, with concrete implications for multilingual benchmark and reward design.


[29] Metaphors We Compute By: A Computational Audit of Cultural Translation vs. Thinking in LLMs cs.CL | cs.AIPDF

Yuan Chang, Jiaming Qu, Zhu Li

TL;DR: 这篇论文通过计算审计方法,评估了大型语言模型在创意写作任务中的文化包容性,特别关注模型是作为文化多样性创意伙伴还是仅作为文化翻译者。研究发现,模型在五个文化背景的隐喻生成任务中表现出刻板印象和西方默认主义,表明仅提示文化身份并不能保证模型进行文化扎根的推理。

Details

Motivation: 论文的动机是探究大型语言模型是否真正具备文化感知的推理能力,而非仅仅是多语言处理能力,以区分模型是进行文化多样性创意合作还是仅进行文化翻译。

Result: 在五个文化背景的隐喻生成任务中,模型表现出刻板化的隐喻使用和西方默认主义,这通过实证分析揭示了模型在文化推理上的局限性。

Insight: 论文的创新点在于通过计算审计方法量化评估LLM的文化包容性,揭示了模型在文化推理中的偏见,强调了仅依赖文化身份提示的不足,为未来模型的文化适应性改进提供了方向。

Abstract: Large language models (LLMs) are often described as multilingual because they can understand and respond in many languages. However, speaking a language is not the same as reasoning within a culture. This distinction motivates a critical question: do LLMs truly conduct culture-aware reasoning? This paper presents a preliminary computational audit of cultural inclusivity in a creative writing task. We empirically examine whether LLMs act as culturally diverse creative partners or merely as cultural translators that leverage a dominant conceptual framework with localized expressions. Using a metaphor generation task spanning five cultural settings and several abstract concepts as a case study, we find that the model exhibits stereotyped metaphor usage for certain settings, as well as Western defaultism. These findings suggest that merely prompting an LLM with a cultural identity does not guarantee culturally grounded reasoning.


[30] How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling cs.CLPDF

Yuhang Liu, Heyan Huang, Yizhe Yang, Hongyan Zhao, Zhizhuo Zeng

TL;DR: 本文提出了一种面向问题、分阶段的评估框架,用于系统评估大语言模型(LLMs)在数学建模竞赛(如中国研究生数学建模竞赛)中的端到端问题解决能力。研究发现,最先进的LLMs在问题识别和公式化等早期阶段表现良好,但在模型求解、代码实现和结果分析等执行导向阶段存在持续缺陷,且这些缺陷不随模型规模增大而消失,揭示了LLMs存在的理解-执行差距。

Details

Motivation: 评估LLMs在需要端到端工作流程的真实世界问题(如数学建模)中的解决能力,而不仅仅是基准测试上的推理能力。

Result: 提出的评估框架在专家验证标准下,其自动评分与独立人类专家判断的一致性显著优于现有评估方案。评估显示,顶级LLMs(如GPT-4)在数学建模竞赛问题中,早期阶段表现接近人类专家,但执行阶段存在显著差距,错误会跨阶段传播而未被纠正。

Insight: 创新点在于提出了一个可靠、分阶段的评估框架来系统衡量LLMs的端到端问题解决能力。核心发现是LLMs存在“理解-执行差距”,其失败根源在于规范不足、验证缺失和缺乏校验,这表明仅靠扩大模型规模无法解决复杂现实问题,需要新的方法(如更好的规范、验证和迭代机制)来弥补这一差距。

Abstract: Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria. We validate the framework’s reliability by comparing automatic scores with independent human expert judgments on problems from the China Postgraduate Mathematical Contest in Modeling, demonstrating substantially stronger alignment than existing evaluation schemes. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation, they exhibit persistent deficiencies in execution-oriented stages including model solving, code implementation, and result analysis. These gaps persist even with increased model scale. We further trace these failures to insufficient specification, missing verification, and lack of validation, with errors propagating across stages without correction. Our findings suggest that bridging this gap requires approaches beyond model scaling, offering insights for applying LLMs to complex real-world problem solving.


[31] LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection cs.CL | cs.AIPDF

Cheng Xu, Changhong Jin, Yingjie Niu, Nan Yan, Yuke Mei

TL;DR: LiveFact是一个动态、时间感知的基准测试,用于评估大语言模型在假新闻检测中的推理能力,通过模拟现实世界中信息不完整和演变的‘战争迷雾’来克服静态基准的数据污染问题,并引入双模式评估和显式数据污染监控。

Details

Motivation: 解决当前假新闻检测评估框架静态化的问题,这些框架易受基准数据污染影响,且无法有效评估模型在时间不确定性下的推理能力。

Result: 在22个大语言模型上的测试表明,开源混合专家模型(如Qwen3-235B-A22B)已达到或超越专有最先进系统的性能,并揭示了显著的‘推理差距’:能力强的模型能在早期数据切片中识别不可验证的声明。

Insight: 创新点在于构建动态、时间感知的基准以模拟真实信息环境,通过双模式评估(分类和推理)和显式数据污染监控来更全面地评估模型推理能力,强调了时间不确定性和认知谦逊在AI验证中的重要性。

Abstract: The rapid development of Large Language Models (LLMs) has transformed fake news detection and fact-checking tasks from simple classification to complex reasoning. However, evaluation frameworks have not kept pace. Current benchmarks are static, making them vulnerable to benchmark data contamination (BDC) and ineffective at assessing reasoning under temporal uncertainty. To address this, we introduce LiveFact a continuously updated benchmark that simulates the real-world “fog of war” in misinformation detection. LiveFact uses dynamic, temporal evidence sets to evaluate models on their ability to reason with evolving, incomplete information rather than on memorized knowledge. We propose a dual-mode evaluation: Classification Mode for final verification and Inference Mode for evidence-based reasoning, along with a component to monitor BDC explicitly. Tests with 22 LLMs show that open-source Mixture-of-Experts models, such as Qwen3-235B-A22B, now match or outperform proprietary state-of-the-art systems. More importantly, our analysis finds a significant “reasoning gap.” Capable models exhibit epistemic humility by recognizing unverifiable claims in early data slices-an aspect traditional static benchmarks overlook. LiveFact sets a sustainable standard for evaluating robust, temporally aware AI verification.


[32] Plausibility as Commonsense Reasoning: Humans Succeed, Large Language Models Do not cs.CL | cs.AIPDF

Sercan Karakaş

TL;DR: 本文通过土耳其语前置关系从句附着歧义研究,探究大语言模型是否像人类一样整合世界知识与句法结构进行歧义消解。实验表明人类能基于事件可能性正确选择附着方式,而多种LLM表现出微弱、不稳定甚至反向的偏好,揭示模型在结构敏感性推理上的局限。

Details

Motivation: 验证大语言模型是否以类人的结构敏感方式整合世界知识与句法结构进行歧义消解,特别针对土耳其语中保持句法结构不变但事件可能性不同的附着歧义场景。

Result: 人类受试者在速度强制选择实验中表现出显著且正确的可能性效应;而土耳其语及多语言LLM在基于token对数概率的偏好测试中,可能性驱动的偏好转移微弱、不稳定或方向错误,未达到人类水平。

Insight: 创新点在于设计句法结构固定、仅通过分级事件可能性区分附着偏好的土耳其语诊断任务,揭示了LLM在整合世界知识与句法结构方面的可靠性缺陷;该方法可作为超越通用基准的跨语言诊断工具。

Abstract: Large language models achieve strong performance on many language tasks, yet it remains unclear whether they integrate world knowledge with syntactic structure in a human-like, structure-sensitive way during ambiguity resolution. We test this question in Turkish prenominal relative-clause attachment ambiguities, where the same surface string permits high attachment (HA) or low attachment (LA). We construct ambiguous items that keep the syntactic configuration fixed and ensure both parses remain pragmatically possible, while graded event plausibility selectively favors High Attachment vs.\ Low Attachment. The contrasts are validated with independent norming ratings. In a speeded forced-choice comprehension experiment, humans show a large, correctly directed plausibility effect. We then evaluate Turkish and multilingual LLMs in a parallel preference-based setup that compares matched HA/LA continuations via mean per-token log-probability. Across models, plausibility-driven shifts are weak, unstable, or reversed. The results suggest that, in the tested models, plausibility information does not guide attachment preferences as reliably as it does in human judgments, and they highlight Turkish RC attachment as a useful cross-linguistic diagnostic beyond broad benchmarks.


[33] Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling cs.CLPDF

Qingyang Xu, Yaling Shen, Stephanie Fong, Zimu Wang, Yiwen Jiang

TL;DR: 本文提出了一种基于人格的客户模拟攻击(PCSA)框架,首次通过连贯的人格驱动对话模拟心理咨询中的客户,以暴露大型语言模型在心理安全对齐方面的隐藏漏洞。实验表明,PCSA在七个通用及心理健康专用LLM上显著优于四个基线方法,揭示了当前模型在提供未经授权的医疗建议、强化妄想和隐含鼓励危险行为等方面的脆弱性。

Details

Motivation: 随着大型语言模型在心理健康领域的应用增加,高风险的诊疗互动中区分治疗共情与适应不良的确认成为关键挑战,现有红队测试框架主要关注通用危害或基于优化的攻击,忽视了这一风险。

Result: 在七个通用及心理健康专用LLM上的实验显示,PCSA大幅优于四个竞争基线,困惑度分析和人工检查进一步表明PCSA生成更自然和真实的对话。

Insight: 创新点在于首次提出人格驱动的客户模拟攻击框架,专注于暴露心理咨询场景中的心理安全对齐漏洞;客观分析认为,该方法通过模拟连贯的客户对话,有效揭示了领域特定的对抗性策略风险,为LLM安全评估提供了新视角。

Abstract: The increasing use of large language models (LLMs) in mental healthcare raises safety concerns in high-stakes therapeutic interactions. A key challenge is distinguishing therapeutic empathy from maladaptive validation, where supportive responses may inadvertently reinforce harmful beliefs or behaviors in multi-turn conversations. This risk is largely overlooked by existing red-teaming frameworks, which focus mainly on generic harms or optimization-based attacks. To address this gap, we introduce Personality-based Client Simulation Attack (PCSA), the first red-teaming framework that simulates clients in psychological counseling through coherent, persona-driven client dialogues to expose vulnerabilities in psychological safety alignment. Experiments on seven general and mental health-specialized LLMs show that PCSA substantially outperforms four competitive baselines. Perplexity analysis and human inspection further indicate that PCSA generates more natural and realistic dialogues. Our results reveal that current LLMs remain vulnerable to domain-specific adversarial tactics, providing unauthorized medical advice, reinforcing delusions, and implicitly encouraging risky actions.


[34] Synthetic Sandbox for Training Machine Learning Engineering Agents cs.CL | cs.LGPDF

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Jiayi Liu, Xiangjun Fan

TL;DR: 本文提出了SandMLE框架,通过生成多样化的、可验证的合成机器学习工程(MLE)环境来解决训练MLE智能体时因数据规模大、验证成本高而无法进行大规模在线强化学习(RL)的瓶颈。该方法将任务数据集限制在微规模(50-200个样本),从而大幅降低执行时间,首次在MLE领域实现了轨迹级在线RL训练,并展示了在多个基准测试上的性能提升和泛化能力。

Details

Motivation: 动机是解决大语言模型智能体从软件工程任务扩展到机器学习工程任务时,验证成本急剧增加的问题。传统在线RL因需要在整个大型数据集上运行完整ML流水线(数据预处理、模型训练、评估)而变得极其缓慢,现有方法被迫退回到监督微调或离线代理奖励,牺牲了在线RL的探索和泛化优势。

Result: 在MLE-bench-lite基准测试上,SandMLE相比SFT基线在Qwen3-8B、14B和30B-A3B模型上取得了显著提升,相对奖牌率改进范围从20.3%到66.9%。执行时间减少了超过13倍。此外,训练出的策略在未见过的智能体框架上具有泛化能力,在MLE-Dojo基准上实现了高达32.4%的HumanRank分数提升。

Insight: 核心创新点在于识别出沙盒数据规模是训练瓶颈的主要来源,并据此设计了一个多智能体框架来生成结构和技术复杂度与现实问题相当、但数据集规模极小的合成MLE环境。这首次使得在MLE领域进行大规模、轨迹级的在线强化学习成为可能,同时保持了任务的真实性和可验证性。

Abstract: As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines – data preprocessing, model training, and metric evaluation – on large datasets at each rollout step, rendering trajectory-wise on-policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine-tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on-policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi-agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks, preserving the structural and technical complexity of real-world problems while constraining datasets to micro-scale (each task is paired with only 50-200 training samples). Through extensive experiments, we show that SandMLE reduces execution time by over 13 times, enabling large-scale, on-policy trajectory-wise RL for the first time in the MLE domain. On MLE-bench-lite, SandMLE yields significant gains over SFT baselines across Qwen3-8B, 14B, and 30B-A3B, with relative medal rate improvements ranging from 20.3% to 66.9%. Furthermore, the trained policy generalizes across unseen agentic scaffolds, achieving up to 32.4% better HumanRank score on MLE-Dojo.


[35] Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation cs.CL | cs.AI | cs.LGPDF

Hengrui Gu, Xiaotian Han, Yujing Bian, Kaixiong Zhou

TL;DR: 本文针对强化学习与可验证奖励(RLVR)中存在的探索受限问题,提出了一种新的探索框架AsymGRPO。该框架通过分析策略熵的动态特性,将其分解为有益熵和虚假熵,并基于组相对优势估计实现了对正向和负向轨迹的双向熵调制,从而在保持多样解路径的同时抑制噪声,有效提升了LLM的推理性能。

Details

Motivation: RLVR虽提升了LLMs的推理能力,但面临探索受限的根本限制,即策略迅速收敛到狭窄解集。传统的熵正则化方法对LLMs不可靠,存在超参数敏感且性能增益有限的问题,因此需要重新思考策略熵与探索的关系。

Result: 大量实验表明,AsymGRPO在多个基准测试中优于强基线方法,并展现出与现有熵正则化方法协同的潜力。

Insight: 论文的创新点在于将策略熵概念性分解为有益熵和虚假熵,并提出了熵精炼机制;从客观角度看,其提出的双向非对称调制框架为RL中的探索问题提供了新的理论视角和实用方法,可借鉴于其他需要平衡探索与利用的序列决策任务。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models (LLMs). However, it faces a fundamental limitation termed \textit{restricted exploration}, where the policy rapidly converges to a narrow set of solutions. While entropy regularization is a popular approach used to sustain exploration, it often proves unreliable for LLMs, suffering from high hyperparameter sensitivity and yielding only marginal performance gains. Motivated by these inefficiencies, we propose to rethink the relationship between policy entropy and exploration. By deriving a parametric formulation of group-relative advantage estimation and analyzing entropy dynamics, we conceptually decompose policy entropy into \textit{informative entropy}, which preserves diverse solution paths, and \textit{spurious entropy}, which erodes reasoning patterns. Our analysis reveals that, in contrast to blind maximization, effective exploration requires \textit{entropy refinement}-a mechanism implicitly embedded in group-relative advantage estimation that sustains informative entropy on positive rollouts while suppressing spurious entropy on negative ones. Guided by this insight, we propose \textbf{AsymGRPO}, an exploratory framework that explicitly decouples the modulation of positive and negative rollouts. This allows for independent control over the preservation of informative entropy and the suppression of spurious noise. Extensive experiments demonstrate that AsymGRPO achieves superior performance compared to strong baselines and exhibits the potential to synergize with existing entropy regularization methods.


[36] TriAttention: Efficient Long Reasoning with Trigonometric KV Compression cs.CL | cs.CVPDF

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu

TL;DR: 本文提出TriAttention方法,通过利用RoPE前空间中查询和键向量的集中特性,设计了一种基于三角级数和向量范数的键重要性估计机制,以高效压缩KV缓存,从而解决大语言模型长推理中KV缓存内存瓶颈问题。

Details

Motivation: 大语言模型的长推理过程导致严重的KV缓存内存瓶颈,现有基于后RoPE注意力分数估计键重要性的压缩方法因查询随位置旋转而代表性不足,导致键选择不佳和推理不稳定。

Result: 在AIME25基准测试的32K令牌生成任务中,TriAttention在保持与完整注意力相同推理精度的同时,实现了2.5倍吞吐量提升或10.7倍KV内存减少,而领先基线在相同效率下仅能达到约一半的精度。

Insight: 创新点在于发现并利用了RoPE前空间中Q/K向量的集中现象及其导致的特定距离偏好,通过三角级数将位置距离偏好和向量范数结合来估计键重要性,实现了更稳定高效的KV压缩,使长上下文模型能在消费级GPU上部署。

Abstract: Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions – Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.


[37] Early Stopping for Large Reasoning Models via Confidence Dynamics cs.CL | cs.AI | cs.LGPDF

Parsa Hosseini, Sumit Nawathe, Mahdi Salmani, Meisam Razaviyayn, Soheil Feizi

TL;DR: 本文提出了一种名为CoDE-Stop的早期停止方法,通过监控大型推理模型在思维链生成过程中中间答案置信度的动态变化,来决定何时终止推理过程,从而在保持准确率的同时显著降低计算成本。

Details

Motivation: 大型推理模型依赖长链思维生成解决复杂问题,但过长的推理会增加计算开销并可能因过度思考导致性能下降,核心挑战在于确定模型何时应停止推理并输出最终答案。

Result: 在多个推理和科学基准测试上,与现有早期停止方法相比,CoDE-Stop实现了更优的准确率-计算权衡,相比标准全长推理减少了25-50%的总token使用量。

Insight: 创新点在于利用中间答案置信度动态作为停止信号,无需额外训练即可集成到现有模型中;客观分析发现,正确推理轨迹往往早期达到高置信度,而错误轨迹则表现出更长、更不可靠的置信度动态,这为理解模型推理过程提供了新视角。

Abstract: Large reasoning models rely on long chain-of-thought generation to solve complex problems, but extended reasoning often incurs substantial computational cost and can even degrade performance due to overthinking. A key challenge is determining when the model should stop reasoning and produce the final answer. In this work, we study the confidence of intermediate answers during reasoning and observe two characteristic behaviors: correct reasoning trajectories often reach high-confidence answers early, while incorrect rollouts tend to produce long, unproductive reasoning traces and exhibit less reliable confidence dynamics. Motivated by these observations, we propose CoDE-Stop (Confidence Dynamics Early Stop), an early stopping method that leverages the dynamics of intermediate answer confidence to decide when to terminate reasoning, requiring no additional training and easily integrating into existing models. We evaluate CoDE-Stop on diverse reasoning and science benchmarks across multiple models. Compared to prior early stopping methods, it achieves a more favorable accuracy-compute tradeoff and reduces total token usage by 25-50% compared to standard full-length reasoning. In addition, we provide analyses of confidence dynamics during reasoning, offering insights into how confidence changes in both correct and incorrect trajectories.


cs.CV [Back]

[38] SafeScreen: A Safety-First Screening Framework for Personalized Video Retrieval for Vulnerable Users cs.CV | cs.AI | cs.CRPDF

Wenzheng Zhao, Madhava Kalyan Gadiputi, Fengpei Yuan

TL;DR: SafeScreen是一个面向脆弱用户(如儿童和痴呆症患者)的安全优先视频筛选框架,它通过提取个性化安全标准、基于多模态视频RAG的评估和LLM决策,在实时检索个性化视频时强制执行安全约束,确保内容的安全性和适宜性。

Details

Motivation: 解决开放域视频平台推荐算法可能向脆弱用户(如儿童、痴呆症患者)推送不当或有害内容的问题,需在个性化视频检索中优先保障安全。

Result: 在痴呆症护理案例研究中,使用30个合成患者档案和90个测试查询,SafeScreen在80-93%的情况下优先安全而非参与度,与YouTube的参与度优化排名显著不同,同时保持了高水平的安全覆盖、合理性和可解释性,并通过LLM评估和领域专家验证。

Insight: 创新点包括将安全作为先决条件而非排名因素,通过自适应问题生成和多模态VideoRAG实现证据驱动的评估,以及无需预计算安全标签的可解释实时筛选;客观分析认为其结合了个性化安全标准提取和LLM决策,为脆弱用户场景提供了可扩展的安全保障方法。

Abstract: Open-domain video platforms offer rich, personalized content that could support health, caregiving, and educational applications, but their engagement-optimized recommendation algorithms can expose vulnerable users to inappropriate or harmful material. These risks are especially acute in child-directed and care settings (e.g., dementia care), where content must satisfy individualized safety constraints before being shown. We introduce SafeScreen, a safety-first video screening framework that retrieves and presents personalized video while enforcing individualized safety constraints. Rather than ranking videos by relevance or popularity, SafeScreen treats safety as a prerequisite and performs sequential approval or rejection of candidate videos through an automated pipeline. SafeScreen integrates three key components: (i) profile-driven extraction of individualized safety criteria, (ii) evidence-grounded assessments via adaptive question generation and multimodal VideoRAG analysis, and (iii) LLM-based decision-making that verifies safety, appropriateness, and relevance before content exposure. This design enables explainable, real-time screening of uncurated video repositories without relying on precomputed safety labels. We evaluate SafeScreen in a dementia-care reminiscence case study using 30 synthetic patient profiles and 90 test queries. Results demonstrate that SafeScreen prioritizes safety over engagement, diverging from YouTube’s engagement-optimized rankings in 80-93% of cases, while maintaining high levels of safety coverage, sensibleness, and groundedness, as validated by both LLM-based evaluation and domain experts.


[39] A reconfigurable smart camera implementation for jet flames characterization based on an optimized segmentation model cs.CV | cs.AIPDF

Gerardo Valente Vazquez-Garcia, Carmina Perez Guerrero, Eduardo Garduño, Miguel Gonzalez-Mendoza, Adriana Palacios

TL;DR: 本文提出了一种基于优化分割模型的可重构智能相机平台,用于工业环境中喷射火焰的实时表征,以提升火灾安全管理。通过优化UNet模型并在SoC FPGA上部署,实现了低延迟的边缘处理流水线。

Details

Motivation: 解决工业环境中缺乏实时早期火灾分割与表征方案的问题,旨在通过边缘计算减少视频处理开销和整体延迟。

Result: 在Ultra96平台上,使用Vitis框架将模型参数从750万优化至59,095个(减少125倍),处理延迟降低2.9倍;进一步优化后延迟提升7.5倍,达到30 FPS,同时保持Dice Score指标精度。

Insight: 创新点包括设计可复制的实验设置与系统实现,以及通过模型压缩和硬件映射在SoC FPGA上实现高效并行执行,为火灾安全应用提供边缘智能解决方案。

Abstract: In this work we present a novel framework for fire safety management in industrial settings through the implementation of a smart camera platform for jet flames characterization. The approach seeks to alleviate the lack of real-time solutions for industrial early fire segmentation and characterization. As a case study, we demonstrate how a SoC FPGA, running optimized Artificial Intelligence (AI) models can be leveraged to implement a full edge processing pipeline for jet flames analysis. In this paper we extend previous work on computer-vision jet fire segmentation by creating a novel experimental set-up and system implementation for addressing this issue, which can be replicated to other fire safety applications. The proposed platform is designed to carry out image processing tasks in real-time and on device, reducing video processing overheads, and thus the overall latency. This is achieved by optimizing a UNet segmentation model to make it amenable for an SoC FPGAs implementation; the optimized model can then be efficiently mapped onto the SoC reconfigurable logic for massively parallel execution. For our experiments, we have chosen the Ultra96 platform, as it also provides the means for implementing full-fledged intelligent systems using the SoC peripherals, as well as other Operating System (OS) capabilities (i.e., multi-threading) for systems management. For optimizing the model we made use of the Vitis (Xilinx) framework, which enabled us to optimize the full precision model from 7.5 million parameters to 59,095 parameters (125x less), which translated into a reduction of the processing latency of 2.9x. Further optimization (multi-threading and batch normalization) led to an improvement of 7.5x in terms of latency, yielding a performance of 30 Frames Per Second (FPS) without sacrificing accuracy in terms of the evaluated metrics (Dice Score).


[40] Event-Driven Neuromorphic Vision Enables Energy-Efficient Visual Place Recognition cs.CV | cs.AI | cs.LGPDF

Geoffroy Keime, Nicolas Cuperlier, Benoit R. Cottereau

TL;DR: 本文提出了一种名为SpikeVPR的生物启发神经形态视觉定位识别方法,该方法结合事件相机和脉冲神经网络,从少量样本中生成紧凑且不变的场景描述符,以在光照、视角和外观剧烈变化下实现鲁棒识别。

Details

Motivation: 解决在动态现实条件下,传统深度网络进行视觉定位识别时计算和能耗过高的问题,受哺乳动物导航系统启发,寻求更高效的解决方案。

Result: 在两个具有挑战性的基准测试集(Brisbane-Event-VPR和NSAVP)上,SpikeVPR达到了与最先进深度网络相当的性能,同时参数量减少了50倍,能耗分别降低了30倍和250倍,能够在移动和神经形态平台上实时部署。

Insight: 主要创新点包括:1)将事件相机与脉冲神经网络结合用于视觉定位识别;2)提出了一种名为EventDilation的新型数据增强策略,增强了对速度和时序变化的鲁棒性;3)展示了基于脉冲编码的方法是实现复杂动态环境下高效、鲁棒VPR的有效途径。

Abstract: Reliable visual place recognition (VPR) under dynamic real-world conditions is critical for autonomous robots, yet conventional deep networks remain limited by high computational and energy demands. Inspired by the mammalian navigation system, we introduce SpikeVPR, a bio-inspired and neuromorphic approach combining event-based cameras with spiking neural networks (SNNs) to generate compact, invariant place descriptors from few exemplars, achieving robust recognition under extreme changes in illumination, viewpoint, and appearance. SpikeVPR is trained end-to-end using surrogate gradient learning and incorporates EventDilation, a novel augmentation strategy enhancing robustness to speed and temporal variations. Evaluated on two challenging benchmarks (Brisbane-Event-VPR and NSAVP), SpikeVPR achieves performance comparable to state-of-the-art deep networks while using 50 times fewer parameters and consuming 30 and 250 times less energy, enabling real-time deployment on mobile and neuromorphic platforms. These results demonstrate that spike-based coding offers an efficient pathway toward robust VPR in complex, changing environments.


[41] 3D-IDE: 3D Implicit Depth Emergent cs.CV | cs.AIPDF

Chushan Zhang, Ruihan Lu, Jinguang Tong, Yikai Wang, Hongdong Li

TL;DR: 本文提出了一种名为3D-IDE的新方法,旨在让多模态大语言模型(MLLMs)从几何自监督中自然地涌现出3D感知能力,而非依赖显式的3D编码或外部3D基础模型。该方法通过构建信息瓶颈,迫使模型最大化视觉特征与3D结构之间的互信息,从而在统一的视觉表示中实现隐式的3D理解,并在推理时无需深度和姿态信息,实现了零延迟开销。

Details

Motivation: 现有方法在融合2D与3D表示时存在权衡问题,导致部署效果不佳。本文旨在解决这一挑战,重新思考如何在视觉语言模型中更有效地整合3D知识,避免对显式3D信息的依赖。

Result: 该方法在多个3D场景理解基准测试中超越了当前最先进(SOTA)方法,并在保持强大性能的同时,将推理延迟降低了55%。

Insight: 核心创新在于提出了“隐式几何涌现原则”,通过精心设计的辅助目标(如细粒度几何验证器和全局表示约束)构建信息瓶颈,使3D感知能力从几何自监督中自然涌现,实现了无需外部依赖、零延迟的3D理解,代表了从外部嫁接向内部涌现的范式转变。

Abstract: Leveraging 3D information within Multimodal Large Language Models (MLLMs) has recently shown significant advantages for indoor scene understanding. However, existing methods, including those using explicit ground-truth 3D positional encoding and those grafting external 3D foundation models for implicit geometry, struggle with the trade-off in 2D-3D representation fusion, leading to suboptimal deployment. To this end, we propose 3D-Implicit Depth Emergence, a method that reframes 3D perception as an emergent property derived from geometric self-supervision rather than explicit encoding. Our core insight is the Implicit Geometric Emergence Principle: by strategically leveraging privileged geometric supervision through mechanisms like a fine-grained geometry validator and global representation constraints, we construct an information bottleneck. This bottleneck forces the model to maximize the mutual information between visual features and 3D structures, allowing 3D awareness to emerge naturally within a unified visual representation. Unlike existing approaches, our method enables 3D perception to emerge implicitly, disentangling features in dense regions and, crucially, eliminating depth and pose dependencies during inference with zero latency overhead. This paradigm shift from external grafting to implicit emergence represents a fundamental rethinking of 3D knowledge integration in visual-language models. Extensive experiments demonstrate that our method surpasses SOTA on multiple 3D scene understanding benchmarks. Our approach achieves a 55% reduction in inference latency while maintaining strong performance across diverse downstream tasks, underscoring the effectiveness of meticulously designed auxiliary objectives for dependency-free 3D understanding. Source code can be found at github.com/ChushanZhang/3D-IDE.


[42] Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models cs.CV | cs.AIPDF

Nanxi Li, Xiang Wang, Yuanjie Chen, Haode Zhang, Hong Li

TL;DR: 本文指出当前多模态大语言模型在理解连续体物体动态方面存在显著不足,为此引入了两个基准任务(下一帧选择和时序一致性验证)进行评估,并提出了场景动态场方法,通过利用物理模拟器进行多任务微调,显著提升了模型在直观物理理解任务上的性能。

Details

Motivation: 尽管多模态大语言模型在图像和视频理解方面表现出色,但其在高级物理推理,特别是直观物理理解方面存在明显缺陷,尤其是在处理连续体物体的动态变化时。

Result: 实验表明,即使是最先进的MLLMs在提出的NFS和TCV基准任务上表现不佳;而提出的SDF方法在流体任务上实现了高达20.7%的性能提升,并对未见过的物理领域展现出强大的泛化能力。

Insight: 论文的创新点在于首次系统地评估了MLLMs在直观物理理解上的局限性,并提出了一个简洁高效的SDF方法,通过结合物理模拟器进行多任务微调,以低成本方式增强模型的物理基础理解能力。

Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high-level physics reasoning. In this work, we investigate the first step of physical reasoning, i.e., intuitive physics understanding, revealing substantial limitations in understanding the dynamics of continuum objects. To isolate and evaluate this specific capability, we introduce two fundamental benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). Our experiments demonstrate that even state-of-the-art MLLMs perform poorly on these foundational tasks. To address this limitation, we propose Scene Dynamic Field (SDF), a concise approach that leverages physics simulators within a multi-task fine-tuning framework. SDF substantially improves performance, achieving up to 20.7% gains on fluid tasks while showing strong generalization to unseen physical domains. This work not only highlights a critical gap in current MLLMs but also presents a promising cost-efficient approach for developing more physically grounded MLLMs. Our code and data are available at https://github.com/andylinx/Scene-Dynamic-Field.


[43] HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis cs.CVPDF

Mingjin Chen, Junhao Chen, Zhaoxin Fan, Yujian Lee, Zichen Dang

TL;DR: 本文提出HVG-3D,一个用于3D感知手物交互视频合成的统一框架。该框架基于扩散模型,并集成了一个3D ControlNet,能够利用来自仿真或真实数据的3D控制信号,从单张真实图像生成高保真、时序一致的手物交互视频。

Details

Motivation: 现有方法多依赖缺乏空间表现力的2D控制信号,且难以有效利用合成的3D条件数据,限制了手物交互视频合成的质量和可控性。

Result: 在TASTE-Rob数据集上的实验表明,HVG-3D在空间保真度、时序一致性和可控性方面达到了最先进的水平。

Insight: 核心创新在于提出了一个结合3D ControlNet的扩散架构,实现了对3D几何与运动线索的显式推理,并设计了一个混合流程来构建输入和条件信号,从而能灵活、精确地利用真实与仿真数据进行训练和推理。

Abstract: Recent methods have made notable progress in the visual quality of hand-object interaction video synthesis. However, most approaches rely on 2D control signals that lack spatial expressiveness and limit the utilization of synthetic 3D conditional data. To address these limitations, we propose HVG-3D, a unified framework for 3D-aware hand-object interaction (HOI) video synthesis conditioned on explicit 3D representations. Specifically, we develop a diffusion-based architecture augmented with a 3D ControlNet, which encodes geometric and motion cues from 3D inputs to enable explicit 3D reasoning during video synthesis. To achieve high-quality synthesis, HVG-3D is designed with two core components: (i) a 3D-aware HOI video generation diffusion architecture that encodes geometric and motion cues from 3D inputs for explicit 3D reasoning; and (ii) a hybrid pipeline for constructing input and condition signals, enabling flexible and precise control during both training and inference. During inference, given a single real image and a 3D control signal from either simulation or real data, HVG-3D generates high-fidelity, temporally consistent videos with precise spatial and temporal control. Experiments on the TASTE-Rob dataset demonstrate that HVG-3D achieves state-of-the-art spatial fidelity, temporal coherence, and controllability, while enabling effective utilization of both real and simulated data.


[44] V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators cs.CV | cs.AIPDF

Jiazhou Zhou, Yucheng Chen, Hongyang Li, Qing Jiang, Hu Zhou

TL;DR: 本文提出V-Reflection框架,通过’先思考后观察’的视觉反思机制,将多模态大语言模型从被动观察者转变为主动询问者,以解决其在细粒度任务中因视觉推理能力不足而产生的幻觉问题。

Details

Motivation: 当前MLLMs的推理主要局限于语言域,将视觉输入视为静态、与推理无关的前置信息,导致模型成为被动观察者,无法在推理过程中重新审视视觉细节以支撑其推理状态,从而在细粒度任务中容易产生感知相关的幻觉。

Result: 在六个感知密集型基准测试上的广泛实验证明了V-Reflection的有效性,显著缩小了细粒度感知差距。可视化结果证实,潜在推理能自主定位任务关键的视觉证据。

Insight: 核心创新在于引入’先思考后观察’的视觉反思机制,通过两阶段蒸馏策略(Box-Guided Compression和Dynamic Autoregressive Compression)将模型的潜在状态映射为动态探针,主动询问全局视觉特征图,从而内化定位任务关键证据的能力,并在推理时保持纯端到端的潜在空间自回归解码以实现高效性。

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success, yet they remain prone to perception-related hallucinations in fine-grained tasks. This vulnerability arises from a fundamental limitation: their reasoning is largely restricted to the language domain, treating visual input as a static, reasoning-agnostic preamble rather than a dynamic participant. Consequently, current models act as passive observers, unable to re-examine visual details to ground their evolving reasoning states. To overcome this, we propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a “think-then-look” visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy. First, the Box-Guided Compression (BCM) module establishes stable pixel-to-latent targets through explicit spatial grounding. Next, a Dynamic Autoregressive Compression (DAC) module maps the model’s hidden states into dynamic probes that interrogate the global visual feature map. By distilling the spatial expertise of the BCM teacher into the DAC student, V-Reflection internalizes the ability to localize task-critical evidence. During inference, both modules remain entirely inactive, maintaining a purely end-to-end autoregressive decoding in the latent space with optimal efficiency. Extensive experiments demonstrate the effectiveness of our V-Reflection across six perception-intensive benchmarks, significantly narrowing the fine-grained perception gap. Visualizations confirm that latent reasoning autonomously localizes task-critical visual evidence.


[45] TreeGaussian: Tree-Guided Cascaded Contrastive Learning for Hierarchical Consistent 3D Gaussian Scene Segmentation and Understanding cs.CV | cs.AIPDF

Jingbin You, Zehao Li, Hao Jiang, Xinzhu Ma, Shuqin Gao

TL;DR: 本文提出了TreeGaussian,一个用于3D高斯场景分割与理解的树引导级联对比学习框架。它通过构建多级对象树来显式建模层次语义关系,并采用两阶段级联对比学习策略从全局到局部逐步优化特征表示,同时引入了CSD机制和图去噪模块来提升分割一致性和质量。

Details

Motivation: 现有基于3D高斯泼溅(3DGS)的方法难以表示复杂的层次化3D语义结构并捕获整体-部分关系,且密集的成对比较以及来自2D先验的不一致层次标签阻碍了特征学习,导致分割效果不佳。

Result: 在开放词汇3D对象选择、3D点云理解等广泛实验和消融研究中,证明了该方法的有效性和鲁棒性。

Insight: 创新点在于将层次语义结构(对象树)显式地引入3DGS表示学习,并设计了级联对比学习策略来缓解对比学习中的饱和与不稳定问题,同时通过CSD和去噪模块提升了跨视图分割一致性。这为3D场景的层次化理解提供了一种结构化的学习范式。

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a real-time, differentiable representation for neural scene understanding. However, existing 3DGS-based methods struggle to represent hierarchical 3D semantic structures and capture whole-part relationships in complex scenes. Moreover, dense pairwise comparisons and inconsistent hierarchical labels from 2D priors hinder feature learning, resulting in suboptimal segmentation. To address these limitations, we introduce TreeGaussian, a tree-guided cascaded contrastive learning framework that explicitly models hierarchical semantic relationships and reduces redundancy in contrastive supervision. By constructing a multi-level object tree, TreeGaussian enables structured learning across object-part hierarchies. In addition, we propose a two-stage cascaded contrastive learning strategy that progressively refines feature representations from global to local, mitigating saturation and stabilizing training. A Consistent Segmentation Detection (CSD) mechanism and a graph-based denoising module are further introduced to align segmentation modes across views while suppressing unstable Gaussian points, enhancing segmentation consistency and quality. Extensive experiments, including open-vocabulary 3D object selection, 3D point cloud understanding, and ablation studies, demonstrate the effectiveness and robustness of our approach.


[46] Diffusion Path Alignment for Long-Range Motion Generation and Domain Transitions cs.CVPDF

Haichao Wang, Alexander Okupnik, Yuxing Han, Gene Wen, Johannes Schneider

TL;DR: 本文提出了一种名为扩散路径对齐的推理时优化框架,旨在解决长距离人体运动生成以及跨语义不同运动域(如不同舞蹈风格)之间生成连贯过渡的挑战。该方法受基于扩散的随机最优控制启发,通过优化一个控制能量目标来正则化预训练扩散模型的过渡轨迹,从而在推理时生成保真度高且时间连贯的运动序列。

Details

Motivation: 长距离人体运动生成是计算机视觉和图形学中的一个核心挑战,尤其是在需要跨不同语义运动域(例如舞蹈编排中不同风格和主题之间的转换)生成连贯过渡的应用中,现有方法对此探索不足。

Result: 论文表明,在推理时优化所提出的目标函数,能够生成具有高保真度和时间连贯性的运动过渡。这是首个为受控的长距离人体运动生成提供明确过渡建模的通用框架。

Insight: 主要创新点在于将扩散模型的推理过程视为一个随机最优控制问题,并引入一个显式的控制能量目标来引导和正则化过渡路径,从而实现了对长距离、跨域运动序列的连贯生成,为可控运动生成提供了新思路。

Abstract: Long-range human movement generation remains a central challenge in computer vision and graphics. Generating coherent transitions across semantically distinct motion domains remains largely unexplored. This capability is particularly important for applications such as dance choreography, where movements must fluidly transition across diverse stylistic and semantic motifs. We propose a simple and effective inference-time optimization framework inspired by diffusion-based stochastic optimal control. Specifically, a control-energy objective that explicitly regularizes the transition trajectories of a pretrained diffusion model. We show that optimizing this objective at inference time yields transitions with fidelity and temporal coherence. This is the first work to provide a general framework for controlled long-range human motion generation with explicit transition modeling.


[47] PollutionNet: A Vision Transformer Framework for Climatological Assessment of NO$_2$ and SO$_2$ Using Satellite-Ground Data Fusion cs.CV | physics.ao-phPDF

Prasanjit Dey, Soumyabrata Dev, Bianca Schoen-Phelan

TL;DR: 本文提出了一种名为PollutionNet的基于Vision Transformer的框架,用于通过卫星与地面数据融合来评估大气中的二氧化氮(NO2)和二氧化硫(SO2)浓度。该模型利用自注意力机制捕捉复杂的时空依赖关系,在爱尔兰(2020-2021)的案例研究中,相比传统CNN和RNN模型,显著降低了预测误差,并实现了最先进的性能。

Details

Motivation: 传统监测方法存在局限性:卫星观测空间覆盖广但存在数据缺口,地面传感器时间分辨率高但空间范围有限。为了解决这些挑战,需要一种能够融合多源数据并准确评估大气污染物的方法。

Result: 在爱尔兰(2020-2021)的案例研究中,PollutionNet实现了最先进的性能,对于NO2和SO2的均方根误差(RMSE)分别为6.89 μg/m³和4.49 μg/m³,相比基线模型预测误差降低了高达14%。

Insight: 创新点在于将Vision Transformer的自注意力机制应用于卫星与地面数据的融合,以捕捉传统CNN和RNN模型难以处理的复杂时空依赖关系。这为在监测网络稀疏地区进行稳健的污染评估提供了一个可扩展且数据高效的工具。

Abstract: Accurate assessment of atmospheric nitrogen dioxide (NO$_2$) and sulfur dioxide (SO$_2$) is essential for understanding climate-air quality interactions, supporting environmental policy, and protecting public health. Traditional monitoring approaches face limitations: satellite observations provide broad spatial coverage but suffer from data gaps, while ground-based sensors offer high temporal resolution but limited spatial extent. To address these challenges, we propose PollutionNet, a Vision Transformer-based framework that integrates Sentinel-5P TROPOMI vertical column density (VCD) data with ground-level observations. By leveraging self-attention mechanisms, PollutionNet captures complex spatiotemporal dependencies that are often missed by conventional CNN and RNN models. Applied to Ireland (2020-2021), our case study demonstrates that PollutionNet achieves state-of-the-art performance (RMSE: 6.89 $μ$g/m$^3$ for NO$_2$, 4.49 $μ$g/m$^3$ for SO$_2$), reducing prediction errors by up to 14% compared to baseline models. Beyond accuracy gains, PollutionNet provides a scalable and data-efficient tool for applied climatology, enabling robust pollution assessments in regions with sparse monitoring networks. These results highlight the potential of advanced machine learning approaches to enhance climate-related air quality research, inform environmental management, and support sustainable policy decisions.


[48] CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream Tasks cs.CV | cs.CLPDF

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Ahmed

TL;DR: 本文提出了一种名为CoLA(跨模态低秩适应)的新型参数高效微调框架,旨在解决现有方法在适应多模态任务时无法有效捕获跨模态交互的问题。CoLA扩展了LoRA,通过引入专门的跨模态适应路径与模态内路径并行,从而在保持参数效率的同时,有效适应单模态基础模型(如DINO和BERT)到多模态下游任务。

Details

Motivation: 动机在于,尽管基础模型(如DINO、BERT)和多模态双流架构已取得进展,但现有的参数高效微调方法(如LoRA)在适应多模态任务时,仅在每个模态内独立操作,限制了其捕获跨模态交互的能力,因此需要一种能有效桥接这一差距的轻量级适应方法。

Result: 在视觉语言(RefCOCO、RefCOCO+、RefCOCOg)和视听(AVE、AVS)基准测试中,CoLA一致优于LoRA,分别实现了约3%和2%的相对性能提升,同时保持了参数效率,并首次实现了视觉定位任务的多任务参数高效微调框架。

Insight: 创新点在于提出了双路径设计(模态内和跨模态适应路径),这允许在微调过程中同时学习模态特定和跨模态交互,而互不干扰,从而高效地将单模态基础模型适配到多模态任务,填补了高效多模态适应中的关键空白。

Abstract: Foundation models have revolutionized AI, but adapting them efficiently for multimodal tasks, particularly in dual-stream architectures composed of unimodal encoders, such as DINO and BERT, remains a significant challenge. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) enable lightweight adaptation, yet they operate in isolation within each modality, limiting their ability in capturing cross-modal interactions. In this paper, we take a step in bridging this gap with Cross-Modal Low-Rank Adaptation (CoLA), a novel PEFT framework that extends LoRA by introducing a dedicated inter-modal adaptation pathway alongside the standard intra-modal one. This dual-path design enables CoLA to adapt unimodal foundation models to multimodal tasks effectively, without interference between modality-specific and cross-modal learning. We evaluate CoLA across a range of vision-language (RefCOCO, RefCOCO+, RefCOCOg) and audio-visual (AVE, AVS) benchmarks, where it consistently outperforms LORA, achieving a relative gain of around 3% and 2%, respectively, while maintaining parameter efficiency. Notably, CoLA enables the first multi-task PEFT framework for visual grounding, bridging a key gap in efficient multimodal adaptation.


[49] StoryBlender: Inter-Shot Consistent and Editable 3D Storyboard with Spatial-temporal Dynamics cs.CV | cs.AIPDF

Bingliang Li, Zhenhong Sun, Jiaming Bian, Yuehao Wu, Yifu Wang

TL;DR: StoryBlender是一个用于生成3D故事板的框架,旨在同时实现镜头间的一致性和显式可编辑性。它通过一个以故事为中心的反思方案,采用三阶段流程(语义-空间锚定、规范资产具象化、时空动态)来构建原生3D场景,支持对相机和视觉资产的直接精确编辑,并通过引擎验证反馈迭代自校正空间幻觉。

Details

Motivation: 当前基于2D扩散的生成器在故事板自动化中常存在身份漂移和几何控制有限的问题,而传统3D动画工作流虽然一致且可编辑但需要大量专家劳动。本文旨在解决现有方法难以同时满足镜头间一致性和显式可编辑性的挑战。

Result: 实验表明,StoryBlender在一致性和可编辑性方面显著优于基于扩散和基于3D的基线方法。

Insight: 创新点在于提出了一个以故事为中心的反思方案和三阶段流程,通过构建连续性记忆图解耦全局资产与镜头特定变量以实现长程一致性,并在统一坐标空间中实例化实体以保持视觉身份,同时通过分层多智能体协调和引擎验证反馈循环实现迭代自校正,从而生成支持直接精确编辑的原生3D场景。

Abstract: Storyboarding is a core skill in visual storytelling for film, animation, and games. However, automating this process requires a system to achieve two properties that current approaches rarely satisfy simultaneously: inter-shot consistency and explicit editability. While 2D diffusion-based generators produce vivid imagery, they often suffer from identity drift along with limited geometric control; conversely, traditional 3D animation workflows are consistent and editable but require expert-heavy, labor-intensive authoring. We present StoryBlender, a grounded 3D storyboard generation framework governed by a Story-centric Reflection Scheme. At its core, we propose the StoryBlender system, which is built on a three-stage pipeline: (1) Semantic-Spatial Grounding, to construct a continuity memory graph to decouple global assets from shot-specific variables for long-horizon consistency; (2) Canonical Asset Materialization, to instantiate entities in a unified coordinate space to maintain visual identity; and (3) Spatial-Temporal Dynamics, to achieve layout design and cinematic evolution through visual metrics. By orchestrating multiple agents in a hierarchical manner within a verification loop, StoryBlender iteratively self-corrects spatial hallucinations via engine-verified feedback. The resulting native 3D scenes support direct, precise editing of cameras and visual assets while preserving unwavering multi-shot continuity. Experiments demonstrate that StoryBlender significantly improves consistency and editability over both diffusion-based and 3D-grounded baselines. Code, data, and demonstration video will be available on https://engineeringai-lab.github.io/StoryBlender/


[50] When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models cs.CVPDF

Jiho Choi, Jaemin Kim, Sanghwan Kim, Seunghoon Hong, Jin-Hwi Park

TL;DR: 本文研究了大型视觉语言模型(LVLM)中的注意力汇聚(attention sink)现象,将其分为视觉编码器产生的V-sink和LLM深层产生的L-sink两类。研究发现注意力汇聚在编码全局场景先验与抑制局部细粒度视觉证据之间存在性能权衡,并据此提出了无需任务特定监督的轻量级即插即用模块——层间汇聚门控(LSG),以动态调节注意力贡献,在多数层中提升了多模态基准测试性能。

Details

Motivation: 探索注意力汇聚在大型视觉语言模型中的跨模态影响,厘清其是冗余伪影还是必要的全局先验,并解决其主导性可能抑制局部感知所需细粒度证据的问题。

Result: 在代表性多模态基准测试上,LSG在大多数层中带来了性能提升,有效平衡了全局推理与精确的局部证据。

Insight: 创新性地将视觉注意力汇聚分类为V-sink和L-sink,揭示了其性能权衡的本质;提出了轻量级、可训练的LSG模块,通过动态缩放注意力贡献来优化模型,无需微调主干网络或任务特定监督。

Abstract: Attention sinks are defined as tokens that attract disproportionate attention. While these have been studied in single modality transformers, their cross-modal impact in Large Vision-Language Models (LVLM) remains largely unexplored: are they redundant artifacts or essential global priors? This paper first categorizes visual sinks into two distinct categories: ViT-emerged sinks (V-sinks), which propagate from the vision encoder, and LLM-emerged sinks (L-sinks), which arise within deep LLM layers. Based on the new definition, our analysis reveals a fundamental performance trade-off: while sinks effectively encode global scene-level priors, their dominance can suppress the fine-grained visual evidence required for local perception. Furthermore, we identify specific functional layers where modulating these sinks most significantly impacts downstream performance. To leverage these insights, we propose Layer-wise Sink Gating (LSG), a lightweight, plug-and-play module that dynamically scales the attention contributions of V-sink and the rest visual tokens. LSG is trained via standard next-token prediction, requiring no task-specific supervision while keeping the LVLM backbone frozen. In most layers, LSG yields improvements on representative multimodal benchmarks, effectively balancing global reasoning and precise local evidence.


[51] Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning cs.CVPDF

Junyuan Liang, Qi Zhou, Sahan Bulathwela, Mutlu Cukurova

TL;DR: 本研究提出了一种可扩展的人工智能方法,利用预训练和基础模型自动检测面对面协作学习中的注视行为,无需人工标注数据。该方法结合YOLO11进行人员跟踪、YOLOE-26进行教育相关物体检测以及Gaze-LLE模型进行注视目标预测,在视频数据中实现了0.829的F1分数,尤其在笔记本电脑和同伴注视检测上表现良好,但在其他目标上较弱。

Details

Motivation: 解决现有机器学习方法在检测协作学习注视行为时依赖大量标注数据、且模型在跨配置鲁棒性上的不足,旨在开发无需人工标注的自动化方案。

Result: 在视频数据上达到F1分数0.829,对笔记本电脑和同伴注视检测性能强,其他目标较弱;与监督方法相比,在复杂环境中表现更优且更稳定,显示出更好的跨配置鲁棒性。

Insight: 创新点在于整合预训练和基础模型(如YOLO11、YOLOE-26和Gaze-LLE)构建无需标注的端到端管道,提升了自动化检测的可扩展性和鲁棒性,为教育场景中的实时分析提供了新思路。

Abstract: Previous studies have illustrated the potential of analysing gaze behaviours in collaborative learning to provide educationally meaningful information for students to reflect on their learning. Over the past decades, machine learning approaches have been developed to automatically detect gaze behaviours from video data. Yet, since these approaches often require large amounts of labelled data for training, human annotation remains necessary. Additionally, researchers have questioned the cross-configuration robustness of machine learning models developed, as training datasets often fail to encompass the full range of situations encountered in educational contexts. To address these challenges, this study proposes a scalable artificial intelligence approach that leverages pretrained and foundation models to automatically detect gaze behaviours in face-to-face collaborative learning contexts without requiring human-annotated data. The approach utilises pretrained YOLO11 for person tracking, YOLOE-26 with text-prompt capability for education-related object detection, and the Gaze-LLE model for gaze target prediction. The results indicate that the proposed approach achieves an F1-score of 0.829 in detecting students’ gaze behaviours from video data, with strong performance for laptop-directed gaze and peer-directed gaze, yet weaker performance for other gaze targets. Furthermore, when compared to other supervised machine learning approaches, the proposed method demonstrates superior and more stable performance in complex contexts, highlighting its better cross-configuration robustness. The implications of this approach for supporting students’ collaborative learning in real-world environments are also discussed.


[52] EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs cs.CVPDF

Zhenghao Chen, Huiqun Wang, Di Huang

TL;DR: 本文提出EgoMind,一种基于思维链的框架,通过角色扮演描述和渐进式空间分析实现无需几何先验的空间推理,仅用少量自生成数据就在多个空间认知基准上取得有竞争力的结果。

Details

Motivation: 现有MLLMs在空间认知任务中,要么依赖昂贵的3D先验或几何监督,要么因难以捕捉跨帧空间关系而表现不佳,EgoMind旨在通过纯语言推理解决这一问题。

Result: 在VSI-Bench、SPAR-Bench、SITE-Bench和SPBench等基准上取得有竞争力的结果,仅使用5K自生成SFT样本和20K RL样本。

Insight: 创新点在于通过角色扮演描述构建跨帧连贯语言场景图,结合渐进式空间分析实现几何无关的空间推理,展示了语言推理在空间认知中的潜力,且数据效率高。

Abstract: Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, demonstrating its effectiveness in strengthening the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition. Code and data are released at https://github.com/Hyggge/EgoMind.


[53] VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing cs.CV | cs.AI | cs.ROPDF

Junyi Zong, Qingxuan Jia, Meixian Shi, Tong Li, Jiayuan Li

TL;DR: 本文提出了VitaTouch,一种面向制造业质量检测的属性感知视觉-触觉-语言模型,用于推断材料属性和生成自然语言属性描述。该模型通过模态特定编码器和双Q-Former提取与语言相关的视觉和触觉特征,并利用对比学习对齐模态与文本、耦合视觉与触觉。作者还构建了包含186个物体、5.2万张图像和5.1千条人工验证指令-答案对的多模态数据集VitaSet。

Details

Motivation: 智能制造中的质量检测需要识别超越可见几何形状的内在材料和表面属性,而仅依赖视觉的方法易受遮挡和反射影响,因此需要融合触觉等多模态信息以提升鲁棒性和准确性。

Result: 在HCT和整体TVL基准测试中达到最佳性能,在SSVTP上保持竞争力。在VitaSet数据集上,硬度分类准确率达88.89%,粗糙度准确率达75.13%,描述符召回率达54.81%,材料描述任务的语义相似度峰值达0.9009。经LoRA微调后,在2类、3类、5类缺陷识别中分别达到100.0%、96.0%、92.0%的准确率,在100次实验室机器人试验中实现了94.0%的闭环识别准确率和94.0%的端到端分拣成功率。

Insight: 创新点在于提出了一种属性感知的多模态融合框架,通过双Q-Former和对比学习实现视觉与触觉的显式耦合与语言对齐,并构建了大规模、高质量的多模态数据集VitaSet,为机器人质量检测任务提供了有效的端到端解决方案。

Abstract: Quality inspection in smart manufacturing requires identifying intrinsic material and surface properties beyond visible geometry, yet vision-only methods remain vulnerable to occlusion and reflection. We propose VitaTouch, a property-aware vision-tactile-language model for material-property inference and natural-language attribute description. VitaTouch uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features, which are compressed into prefix tokens for a large language model. We align each modality with text and explicitly couple vision and touch through contrastive learning. We also construct VitaSet, a multimodal dataset with 186 objects, 52k images, and 5.1k human-verified instruction-answer pairs. VitaTouch achieves the best performance on HCT and the overall TVL benchmark, while remaining competitive on SSVTP. On VitaSet, it reaches 88.89% hardness accuracy, 75.13% roughness accuracy, and 54.81% descriptor recall; the material-description task further achieves a peak semantic similarity of 0.9009. With LoRA-based fine-tuning, VitaTouch attains 100.0%, 96.0%, and 92.0% accuracy for 2-, 3-, and 5-category defect recognition, respectively, and delivers 94.0% closed-loop recognition accuracy and 94.0% end-to-end sorting success in 100 laboratory robotic trials. More details are available at the project page: https://vitatouch.github.io/


[54] Safety-Aligned 3D Object Detection: Single-Vehicle, Cooperative, and End-to-End Perspectives cs.CV | cs.AI | cs.ROPDF

Brian Hsuan-Cheng Liao, Chih-Hong Cheng, Hasan Esen, Alois Knoll

TL;DR: 本文研究了面向安全的3D物体检测评估与优化方法,通过引入安全导向的度量NDS-USC和安全感知的损失函数EC-IoU,在单车、协同和端到端自动驾驶场景中,分析了感知错误对安全的影响,并展示了安全对齐的优化能有效提升系统安全性。

Details

Motivation: 当前深度学习感知系统在标准指标下表现良好,但所有感知错误被平等对待,而实际上只有部分错误是安全关键的;论文旨在通过安全对齐的评估和优化,明确识别并减少高影响错误,以提升联网与自动驾驶车辆的安全性。

Result: 在单车3D检测中,安全感知微调(EC-IoU)改善了安全关键检测性能;在车路协同检测中,协同模型优于单车模型,支持“零愿景”安全目标;在端到端框架SparseDrive中集成EC-IoU,碰撞率降低近30%,直接提升了系统级安全。

Insight: 创新点包括:提出安全导向的评估指标NDS-USC和损失函数EC-IoU,将安全关键错误纳入优化;在单车、协同和端到端多视角下验证安全对齐方法的有效性,为CAV感知系统提供了从评估到优化的实用安全增强路径。

Abstract: Perception plays a central role in connected and autonomous vehicles (CAVs), underpinning not only conventional modular driving stacks, but also cooperative perception systems and recent end-to-end driving models. While deep learning has greatly improved perception performance, its statistical nature makes perfect predictions difficult to attain. Meanwhile, standard training objectives and evaluation benchmarks treat all perception errors equally, even though only a subset is safety-critical. In this paper, we investigate safety-aligned evaluation and optimization for 3D object detection that explicitly characterize high-impact errors. Building on our previously proposed safety-oriented metric, NDS-USC, and safety-aware loss function, EC-IoU, we make three contributions. First, we present an expanded study of single-vehicle 3D object detection models across diverse neural network architectures and sensing modalities, showing that gains under standard metrics such as mAP and NDS may not translate to safety-oriented criteria represented by NDS-USC. With EC-IoU, we reaffirm the benefit of safety-aware fine-tuning for improving safety-critical detection performance. Second, we conduct an ego-centric, safety-oriented evaluation of AV-infrastructure cooperative object detection models, underscoring its superiority over vehicle-only models and demonstrating a safety impact analysis that illustrates the potential contribution of cooperative models to “Vision Zero.” Third, we integrate EC-IoU into SparseDrive and show that safety-aware perception hardening can reduce collision rate by nearly 30% and improve system-level safety directly in an end-to-end perception-to-planning framework. Overall, our results indicate that safety-aligned perception evaluation and optimization offer a practical path toward enhancing CAV safety across single-vehicle, cooperative, and end-to-end autonomy settings.


[55] CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection cs.CV | cs.AI | cs.LG | cs.SDPDF

Damith Chamalke Senadeera, Dimitrios Kollias, Gregory Slabaugh

TL;DR: 本文提出了CoLoRSMamba模型,一种用于监督式多模态暴力检测的定向视频到音频架构。该模型通过CLS引导的条件LoRA,将VideoMamba和AudioMamba耦合起来,利用视频CLS令牌生成调制向量和稳定门,来动态调整音频状态空间参数,实现场景感知的音频动态建模,而无需进行令牌级的交叉注意力。训练结合了二元分类和对称的AV-InfoNCE目标,以对齐片段级的音视频嵌入。

Details

Motivation: 暴力检测通常受益于音频信息,但现实世界中的声景可能嘈杂或与可见场景关联性较弱,因此需要一种能有效融合多模态信息并处理噪声的鲁棒方法。

Result: 在从NTU-CCTV和DVD数据集中筛选出的、具有可用音频的片段级子集上,CoLoRSMamba超越了代表性的纯音频、纯视频及多模态基线模型,在NTU-CCTV上达到88.63%准确率/86.24% F1-V,在DVD上达到75.77%准确率/72.94% F1-V,并且在参数量和计算量更少的情况下,取得了优于多个更大模型的准确率-效率权衡。

Insight: 创新点在于提出了CLS引导的条件LoRA机制,通过视频CLS令牌生成调制向量和稳定门来动态调整音频Mamba模型的选择性状态空间参数(包括步长通路),实现了高效的场景感知音频动态建模,避免了计算密集的令牌级交叉注意力,从而在保持高效的同时提升了多模态融合性能。

Abstract: Violence detection benefits from audio, but real-world soundscapes can be noisy or weakly related to the visible scene. We present CoLoRSMamba, a directional Video to Audio multimodal architecture that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA. At each layer, the VideoMamba CLS token produces a channel-wise modulation vector and a stabilization gate that adapt the AudioMamba projections responsible for the selective state-space parameters (Delta, B, C), including the step-size pathway, yielding scene-aware audio dynamics without token-level cross-attention. Training combines binary classification with a symmetric AV-InfoNCE objective that aligns clip-level audio and video embeddings. To support fair multimodal evaluation, we curate audio-filtered clip level subsets of the NTU-CCTV and DVD datasets from temporal annotations, retaining only clips with available audio. On these subsets, CoLoRSMamba outperforms representative audio-only, video-only, and multimodal baselines, achieving 88.63% accuracy / 86.24% F1-V on NTU-CCTV and 75.77% accuracy / 72.94% F1-V on DVD. It further offers a favorable accuracy-efficiency tradeoff, surpassing several larger models with fewer parameters and FLOPs.


[56] Bridging the Dimensionality Gap: A Taxonomy and Survey of 2D Vision Model Adaptation for 3D Analysis cs.CVPDF

Akshat Pandya, Bhavuk Jain

TL;DR: 这篇综述论文系统性地回顾和分类了将2D视觉模型(如CNN和ViT)适应于3D分析任务(如点云和网格处理)的策略,提出了一个统一的分类法,将方法分为数据为中心、架构为中心和混合方法三大类,并分析了它们在计算复杂度、预训练依赖和几何归纳偏置保留方面的权衡。

Details

Motivation: 解决2D视觉模型在处理不规则、稀疏的3D数据(如点云)时存在的根本性维度差异问题,旨在系统化现有适应策略,为领域提供清晰的分类和分析框架。

Result: 这是一篇综述性论文,未报告具体的定量实验结果,但通过提出的分类法对现有方法进行了定性分析,比较了不同策略在计算效率、数据需求和几何理解能力等方面的优劣。

Insight: 创新点在于提出了一个统一的三分法分类框架(数据为中心、架构为中心、混合方法),系统性地揭示了不同适应策略的核心权衡,并指出了未来研究方向,如3D基础模型、几何数据的自监督学习以及多模态信号的深度融合。

Abstract: The remarkable success of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in 2D vision has spurred significant research in extending these architectures to the complex domain of 3D analysis. Yet, a core challenge arises from a fundamental dichotomy between the regular, dense grids of 2D images and the irregular, sparse nature of 3D data such as point clouds and meshes. This survey provides a comprehensive review and a unified taxonomy of adaptation strategies that bridge this gap, classifying them into three families: (1) Data-centric methods that project 3D data into 2D formats to leverage off-the-shelf 2D models, (2) Architecture-centric methods that design intrinsic 3D networks, and (3) Hybrid methods, which synergistically combine the two modeling paradigms to benefit from both rich visual priors of large 2D datasets and explicit geometric reasoning of 3D models. Through this framework, we qualitatively analyze the fundamental trade-offs between these families concerning computational complexity, reliance on large-scale pre-training, and the preservation of geometric inductive biases. We discuss key open challenges and outline promising future research directions, including the development of 3D foundation models, advancements in self-supervised learning (SSL) for geometric data, and the deeper integration of multi-modal signals.


[57] Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction cs.CVPDF

Wuqi Su, Huilun Song, Chen Zhao, Chi Xu

TL;DR: 本文提出了一种基于Swin Transformer骨干网络的多层次感知条件随机场(CRF)模型,用于单目RGB图像的深度估计。该模型通过自适应混合金字塔特征融合(HPF)策略、分层感知适配器(HA)和具有动态缩放注意力的全连接CRF解码器,有效整合全局与局部上下文信息,降低计算复杂度并增强表示能力。在NYU Depth v2、KITTI和MatterPort3D数据集上的实验表明,该方法在保持较低参数和推理时间的同时,达到了最先进的性能水平。

Details

Motivation: 单目深度估计因尺度模糊和缺乏显式几何线索而具有挑战性,现有方法通常依赖复杂网络架构,导致训练成本高且未能充分利用像素间空间依赖性。本文旨在通过更高效的模型设计来解决这些限制。

Result: 在NYU Depth v2数据集上,Abs Rel降低至0.088(相对减少7.4%),RMSE降低至0.316(相对减少5.4%);在KITTI数据集上,阈值精度接近完美(δ<1.25^3 ≈ 99.8%)。模型仅需194M参数和21ms推理时间,在多个基准测试中达到最先进(SOTA)水平。

Insight: 创新点包括:自适应混合金字塔特征融合策略结合多尺度空间金字塔池化和双轴特征聚合以捕获短程和长程依赖;分层感知适配器通过轻量级广播模块和可学习维度缩放增强跨层特征交互;全连接CRF解码器引入动态缩放注意力和偏置学习单元以建模细粒度像素级空间关系并确保训练稳定性。这些设计在提升性能的同时有效控制了计算开销。

Abstract: Monocular depth estimation from a single RGB image remains a fundamental challenge in computer vision due to inherent scale ambiguity and the absence of explicit geometric cues. Existing approaches typically rely on increasingly complex network architectures to regress depth maps, which escalates training costs and computational overhead without fully exploiting inter-pixel spatial dependencies. We propose a multilevel perceptual conditional random field (CRF) model built upon the Swin Transformer backbone that addresses these limitations through three synergistic innovations: (1) an adaptive hybrid pyramid feature fusion (HPF) strategy that captures both short-range and long-range dependencies by combining multi-scale spatial pyramid pooling with biaxial feature aggregation, enabling effective integration of global and local contextual information; (2) a hierarchical awareness adapter (HA) that enriches cross-level feature interactions within the encoder through lightweight broadcast modules with learnable dimensional scaling, reducing computational complexity while enhancing representational capacity; and (3) a fully-connected CRF decoder with dynamic scaling attention that models fine-grained pixel-level spatial relationships, incorporating a bias learning unit to prevent extreme-value collapse and ensure stable training. Extensive experiments on NYU Depth v2, KITTI, and MatterPort3D datasets demonstrate that our method achieves state-of-the-art performance, reducing Abs Rel to 0.088 ($-$7.4%) and RMSE to 0.316 ($-$5.4%) on NYU Depth v2, while attaining near-perfect threshold accuracy ($δ< 1.25^3 \approx 99.8%$) on KITTI with only 194M parameters and 21ms inference time.


[58] Learning Additively Compositional Latent Actions for Embodied AI cs.CV | cs.AIPDF

Hangxing Wei, Xiaoyu Chen, Chuheng Zhang, Tim Pearce, Jianyu Chen

TL;DR: 本文提出了一种名为AC-LAM的加性组合潜在动作模型,用于具身AI中的潜在动作学习。该模型通过在潜在动作空间上强制实施场景级的加性组合结构约束,解决了现有方法中潜在动作与无关场景细节或未来观测信息纠缠、以及运动幅度校准不准确的问题。

Details

Motivation: 现有潜在动作学习方法缺乏对物理运动加性组合结构先验的编码,导致学习到的潜在动作常包含无关信息且运动幅度校准不佳。本文旨在通过引入结构化约束来学习更纯净、校准更好的潜在动作。

Result: 在模拟和真实世界桌面任务上,AC-LAM学习的潜在动作更具结构性、更专注于运动且位移校准更好,为下游策略学习提供了更强的监督,性能超越了最先进的潜在动作模型(SOTA)。

Insight: 创新点在于对潜在动作空间施加了短时域内的场景级加性组合结构约束,这鼓励了潜在动作空间的简单代数结构(如恒等、逆、循环一致性),并抑制了非加性组合的信息,从而提升了学习质量。从客观角度看,这种结构化先验的引入是提升无监督动作表示学习效果的有效途径。

Abstract: Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true state changes and miscalibrate motion magnitude. We introduce Additively Compositional Latent Action Model (AC-LAM), which enforces scene-wise additive composition structure over short horizons on the latent action space. These AC constraints encourage simple algebraic structure in the latent action space~(identity, inverse, cycle consistency) and suppress information that does not compose additively. Empirically, AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions and provides stronger supervision for downstream policy learning, outperforming state-of-the-art LAMs across simulated and real-world tabletop tasks.


[59] YOLOv11 Demystified: A Practical Guide to High-Performance Object Detection cs.CVPDF

Nikhileswara Rao Sulake

TL;DR: YOLOv11是YOLO系列实时目标检测器的最新版本,通过引入新颖的架构模块(如C3K2块、SPPF和C2PSA)来改进特征提取和小目标检测能力,在保持实时性的同时提升了精度。

Details

Motivation: 解决现有实时目标检测器在特征提取和小目标检测方面的不足,旨在提升模型精度而不牺牲速度,适用于自动驾驶、监控和视频分析等应用。

Result: 在标准基准测试中,YOLOv11相比先前YOLO版本在平均精度(mAP)和推理速度上均有改进,实现了更高的准确率并保持了实时性能。

Insight: 创新点包括C3K2块、SPPF模块和C2PSA模块,这些设计增强了空间特征处理能力;从客观角度看,这些模块的集成有效平衡了精度与速度,为实时目标检测提供了可借鉴的架构优化方案。

Abstract: YOLOv11 is the latest iteration in the You Only Look Once (YOLO) series of real-time object detectors, introducing novel architectural modules to improve feature extraction and small-object detection. In this paper, we present a detailed analysis of YOLOv11, including its backbone, neck, and head components. The model key innovations, the C3K2 blocks, Spatial Pyramid Pooling - Fast (SPPF), and C2PSA (Cross Stage Partial with Spatial Attention) modules enhance spatial feature processing while preserving speed. We compare YOLOv11 performance to prior YOLO versions on standard benchmarks, highlighting improvements in mean Average Precision (mAP) and inference speed. Our results demonstrate that YOLOv11 achieves superior accuracy without sacrificing real-time capabilities, making it well-suited for applications in autonomous driving, surveillance, and video analytics.This work formalizes YOLOv11 in a research context, providing a clear reference for future studies.


[60] ViBA: Implicit Bundle Adjustment with Geometric and Temporal Consistency for Robust Visual Matching cs.CVPDF

Xiaoji Niu, Yuqing Wang, Yan Wang, Hailiang Tang, Tisheng Zhang

TL;DR: 本文提出ViBA,一种可持续学习框架,将几何优化与特征学习相结合,用于在无约束视频流上进行连续在线训练。ViBA通过隐式可微几何残差框架,整合了初始跟踪网络、基于深度的离群点过滤和可微全局光束法平差,联合优化相机位姿和特征点位置,并结合光束法平差的几何一致性与跨帧的长期时间一致性,以增强特征的稳定性和准确性。

Details

Motivation: 现有图像关键点检测和描述方法大多依赖具有精确位姿和深度标注的数据集,这限制了方法的可扩展性和泛化能力,并常常导致导航和定位性能下降。

Result: 在EuRoC和UMA数据集上评估,与SuperPoint+SuperGlue、ALIKED和LightGlue等先进方法相比,ViBA将平均绝对平移误差降低了12-18%,绝对旋转误差降低了5-10%,同时保持实时推理速度(36-91 FPS)。在未见序列上评估,其定位精度保持在90%以上,展现了强大的泛化能力。

Insight: 创新点在于将可微光束法平差与特征学习在线集成,通过几何和时间一致性约束实现持续优化,无需依赖大量标注数据,提升了视觉匹配在真实场景中的鲁棒性和泛化性。

Abstract: Most existing image keypoint detection and description methods rely on datasets with accurate pose and depth annotations, limiting scalability and generalization, and often degrading navigation and localization performance. We propose ViBA, a sustainable learning framework that integrates geometric optimization with feature learning for continuous online training on unconstrained video streams. Embedded in a standard visual odometry pipeline, it consists of an implicitly differentiable geometric residual framework: (i) an initial tracking network for inter-frame correspondences, (ii) depth-based outlier filtering, and (iii) differentiable global bundle adjustment that jointly refines camera poses and feature positions by minimizing reprojection errors. By combining geometric consistency from BA with long-term temporal consistency across frames, ViBA enforces stable and accurate feature representations. We evaluate ViBA on EuRoC and UMA datasets. Compared with state-of-the-art methods such as SuperPoint+SuperGlue, ALIKED, and LightGlue, ViBA reduces mean absolute translation error (ATE) by 12-18% and absolute rotation error (ARE) by 5-10% across sequences, while maintaining real-time inference speeds (FPS 36-91). When evaluated on unseen sequences, it retains over 90% localization accuracy, demonstrating robust generalization. These results show that ViBA supports continuous online learning with geometric and temporal consistency, consistently improving navigation and localization in real-world scenarios.


[61] Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro cs.CV | cs.AI | cs.LGPDF

Kenan Tang, Praveen Arunshankar, Andong Hua, Anthony Yang, Yao Qin

TL;DR: 本文揭示了多模态智能体系统在多轮图像编辑中存在的迭代退化问题,即随着编辑次数增加,图像质量会因微小伪影累积而显著下降。为系统研究此问题,作者构建了Banana100数据集,包含28,000张经过100次迭代编辑的退化图像,并发现现有无参考图像质量评估(NR-IQA)指标无法有效检测这种退化,可能对模型训练和系统安全构成威胁。

Details

Motivation: 多模态智能体系统在多轮迭代编辑中会出现图像质量逐渐退化的问题,现有图像编辑模型虽能完成单次高质量编辑,但多次编辑后会产生伪影累积,导致图像质量严重下降且无法遵循简单指令,而当前图像质量评估方法未能有效识别此类退化,可能影响未来模型训练和部署系统的稳定性与安全性。

Result: 在包含21种流行NR-IQA指标的评估中,所有指标均无法一致地对严重退化图像给出比干净图像更低的评分,表明现有评估体系存在明显缺陷。Banana100数据集涵盖了多样化纹理和图像内容,为系统研究提供了基准。

Insight: 论文的创新点在于首次系统揭示了多轮迭代编辑中的图像退化现象及其对评估指标的挑战,构建了大规模退化数据集Banana100,并指出生成器与评估器的双重失败可能威胁多模态智能体系统的鲁棒性,为开发更稳健的模型提供了重要方向。

Abstract: The multi-step, iterative image editing capabilities of multi-modal agentic systems have transformed digital content creation. Although latest image editing models faithfully follow instructions and generate high-quality images in single-turn edits, we identify a critical weakness in multi-turn editing, which is the iterative degradation of image quality. As images are repeatedly edited, minor artifacts accumulate, rapidly leading to a severe accumulation of visible noise and a failure to follow simple editing instructions. To systematically study these failures, we introduce Banana100, a comprehensive dataset of 28,000 degraded images generated through 100 iterative editing steps, including diverse textures and image content. Alarmingly, image quality evaluators fail to detect the degradation. Among 21 popular no-reference image quality assessment (NR-IQA) metrics, none of them consistently assign lower scores to heavily degraded images than to clean ones. The dual failures of generators and evaluators may threaten the stability of future model training and the safety of deployed agentic systems, if the low-quality synthetic data generated by multi-turn edits escape quality filters. We release the full code and data to facilitate the development of more robust models, helping to mitigate the fragility of multi-modal agentic systems.


[62] KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models cs.CVPDF

Haifeng Huang, Yang Li

TL;DR: 本文提出了一种名为KiToke的无训练、查询无关的视频大语言模型(Video LLM)视觉令牌压缩方法。该方法通过基于核的冗余度量全局估计令牌多样性,实现内容自适应的令牌选择,并引入轻量级的时间间隔构建与间隔感知令牌合并以保持时间连贯性,旨在减少时空冗余并保留关键视觉信息,从而降低推理成本。

Details

Motivation: 视频大语言模型在视频理解任务上表现出色,但由于视觉令牌数量庞大,导致推理成本高昂。现有方法依赖局部或片段级别的启发式方法,未能有效捕获全局冗余。

Result: 在多个视频理解基准测试和不同Video LLM骨干网络上的广泛实验表明,KiToke在无需训练的情况下,始终优于现有的无训练压缩方法,尤其在保留率低至1%的激进压缩比下取得了显著优势。

Insight: 创新点在于:1) 使用基于核的冗余度量对整段视频进行全局冗余估计,实现更高效的令牌利用;2) 引入轻量级时间间隔构建与间隔感知合并机制,在压缩时保持时间连贯性。该方法提供了一种内容自适应、无需训练且能处理极端压缩场景的令牌压缩新思路。

Abstract: Video Large Language Models (Video LLMs) achieve strong performance on video understanding tasks but suffer from high inference costs due to the large number of visual tokens. We propose KiToke, a training-free, query-agnostic token compression approach that reduces spatiotemporal redundancy while preserving critical visual information. Our method estimates token diversity globally using a kernel-based redundancy measure, enabling content-adaptive selection that remains effective under extreme token budgets, and further introduces a lightweight temporal interval construction with interval-aware token merging to maintain temporal coherence. Unlike prior methods that rely on local or segment-level heuristics, KiToke explicitly captures global redundancy across an entire video, leading to more efficient token utilization. Extensive experiments on multiple video understanding benchmarks and Video LLM backbones demonstrate that KiToke consistently outperforms existing training-free compression methods, with particularly large gains at aggressive retention ratios down to 1%.


[63] Zero-Shot Quantization via Weight-Space Arithmetic cs.CV | cs.AI | cs.LGPDF

Daniele Solombrino, Antonio Andrea Gargiulo, Adrian Robert Minut, Luca Zhou, Alessandro Zirilli

TL;DR: 该论文提出了一种零样本量化方法,通过权重空间算术提取量化向量,该向量可从捐赠任务转移到接收模型,无需接收方量化感知训练即可显著提升后训练量化鲁棒性,为极低比特部署提供低成本替代方案。

Details

Motivation: 解决后训练量化导致的模型性能下降问题,避免传统量化感知训练需要接收方数据和训练成本高的缺点,实现零样本、低成本的模型量化部署。

Result: 在Vision Transformer模型上,该方法将后训练量化噪声的鲁棒性提升高达60%,无需接收方量化感知训练,实现了零样本量化效果。

Insight: 创新点在于发现量化鲁棒性是权重空间几何中可转移的通用特征,而非任务特定训练的副产品;通过权重空间算术提取量化向量实现跨模型转移,为模型压缩提供了新的零样本范式。

Abstract: We show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. We call this direction the quantization vector: extracted from a donor task by simple weight-space arithmetic, it can be used to patch a receiver model and improve robustness to PTQ-induced noise by as much as 60%, without receiver-side quantization-aware training (QAT). Because the method requires no receiver training data, it provides a zero-shot, low-cost alternative to QAT for extremely low-bit deployment. We demonstrate this on Vision Transformer (ViT) models. More broadly, our results suggest that quantization robustness is not merely a byproduct of task-specific training, but a reusable feature of weight-space geometry that can be transferred rather than retrained.


[64] Automated Segmentation and Tracking of Group Housed Pigs Using Foundation Models cs.CVPDF

Ye Bi, Bimala Acharya, David Rosero, Juan Steibel

TL;DR: 本研究提出了一种基于基础模型的自动化工作流,用于监控群养仔猪。该工作流利用预训练的视觉-语言基础模型作为通用视觉骨干,并通过模块化后处理实现农场特定适配。系统集成了检测、短期视频分割和长期跟踪模块,在连续视频中实现了高精度的猪只分割与身份一致性跟踪。

Details

Motivation: 精准畜牧业中现有监控流程严重依赖需要大量标注数据的监督学习模型,且需重复训练和农场特定调优。本研究旨在利用基础模型的先验知识,结合轻量级任务特定逻辑,构建一个可扩展、标注高效、能进行长时间监控的自动化系统,以解决夜间视觉和严重遮挡条件下性能下降的问题。

Result: 在132分钟连续视频的132个均匀采样真值帧上评估,系统取得了平均区域相似度(J)0.83、轮廓精度(F)0.92、J&F 0.87、MOTA 0.99、MOTP 90.7%的成绩,且无身份切换。在550个一分钟视频片段上,经后处理后,4927条活动轨迹中超过80%完全正确。

Insight: 创新点在于将通用基础模型(如Grounding-DINO和Grounded-SAM2)的先验知识作为视觉骨干,与轻量级的、模块化的任务特定后处理逻辑(包括初始化、跟踪、匹配、掩码细化、重识别和质量控制)相结合。这种范式减少了对任务特定监督学习的依赖,为实现可扩展、标注高效的长时间农业监控系统提供了可行路径。

Abstract: Foundation models (FM) are reshaping computer vision by reducing reliance on task-specific supervised learning and leveraging general visual representations learned at scale. In precision livestock farming, most pipelines remain dominated by supervised learning models that require extensive labeled data, repeated retraining, and farm-specific tuning. This study presents an FM-centered workflow for automated monitoring of group-housed nursery pigs, in which pretrained vision-language FM serve as general visual backbones and farm-specific adaptation is achieved through modular post-processing. Grounding-DINO was first applied to 1,418 annotated images to establish a baseline detection performance. While detection accuracy was high under daytime conditions, performance degraded under night-vision and heavy occlusion, motivating the integration of temporal tracking logic. Building on these detections, short-term video segmentation with Grounded-SAM2 was evaluated on 550 one-minute video clips; after post-processing, over 80% of 4,927 active tracks were fully correct, with most remaining errors arising from inaccurate masks or duplicated labels. To support identity consistency over an extended time, we further developed a long-term tracking pipeline integrating initialization, tracking, matching, mask refinement, re-identification, and post-hoc quality control. This system was evaluated on a continuous 132-minute video and maintained stable identities throughout. On 132 uniformly sampled ground-truth frames, the system achieved a mean region similarity (J) of 0.83, contour accuracy (F) of 0.92, J&F of 0.87, MOTA of 0.99, and MOTP of 90.7%, with no identity switches. Overall, this work demonstrates how FM prior knowledge can be combined with lightweight, task-specific logic to enable scalable, label-efficient, and long-duration monitoring in pig production.


[65] Inference-Path Optimization via Circuit Duplication in Frozen Visual Transformers for Marine Species Classification cs.CV | cs.AI | cs.LGPDF

Thomas Manuel Rost

TL;DR: 本文研究在冻结视觉Transformer中通过电路复制优化推理路径,以提升海洋物种分类性能。该方法在推理时不微调模型权重,通过重复遍历选定Transformer层来增强特征表示,在AQUA20基准测试中显著缩小了与全监督模型的差距。

Details

Motivation: 解决水下物种分类中标注成本高、环境变化导致模型泛化性差的问题,探索在不微调模型权重的情况下,通过推理时优化提升冻结自监督视觉基础模型性能。

Result: 在AQUA20基准上,类特定电路选择达到宏观F1分数0.875,接近全监督ConvNeXt基准(0.889),差距仅1.4个百分点;其中章鱼类别提升12.1个F1点,约75%的类别受益于类特定电路。

Insight: 首次将LLM中的电路复制方法应用于计算机视觉任务,通过推理时路径优化实现性能提升;类特定电路选择揭示了不同类别对特征提取路径的差异化需求,为冻结模型的高效利用提供了新思路。

Abstract: Automated underwater species classification is constrained by annotation cost and environmental variation that limits the transferability of fully supervised models. Recent work has shown that frozen embeddings from self-supervised vision foundation models already provide a strong label-efficient baseline for marine image classification. Here we investigate whether this frozen-embedding regime can be improved at inference time, without fine-tuning or changing model weights. We apply Circuit Duplication, an inference-time method originally proposed for Large Language Models, in which a selected range of transformer layers is traversed twice during the forward pass. We evaluate on the class-imbalanced AQUA20 benchmark using frozen DINOv3 embeddings under two settings: global circuit selection, where a single duplicated circuit is chosen for the full dataset, and class-specific circuit selection, where each species may receive a different optimal circuit. Both settings use simple semi-supervised downstream classifiers. Circuit Duplication consistently improves over the standard frozen forward pass. At the maximum label budget, class-specific selection reaches a macro F1 of 0.875, closing the gap to the fully supervised ConvNeXt benchmark (0.889) to 1.4 points without any gradient-based training. Four species exceed their fully supervised reference, with octopus improving by +12.1 F1 points. Across all budgets, roughly 75% of classes prefer a class-specific circuit, indicating a genuinely class-dependent benefit. To our knowledge, this is the first application of Circuit Duplication to computer vision.


[66] RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation cs.CV | cs.AIPDF

Ganlin Feng, Yuxi Long, Hafsa Ali, Erin Lou, Fahad Butt

TL;DR: 本文介绍了RDFace,一个针对罕见疾病面部图像分析的新基准数据集,包含103种罕见遗传病的456张儿科面部图像,平均每种疾病仅4.4个样本。该数据集旨在解决该领域数据极度稀缺和表型高度相似的挑战,支持在真实世界低数据约束下开发和评估高效AI诊断模型。作者还探索了使用DreamBooth和FastGAN进行合成数据增强,并通过面部关键点相似性过滤生成图像以保持表型保真度,在超低数据场景下将诊断准确率提升了高达13.7%。

Details

Motivation: 罕见疾病常表现为儿童独特的表型特征,为临床医生和AI辅助筛查系统提供重要诊断线索,但该领域进展严重受限于高质量、符合伦理的面部数据稀缺以及不同疾病间表型高度相似的问题。

Result: 在RDFace数据集上,通过交叉验证对多个预训练视觉骨干网络进行了基准测试。将合成增强数据与真实数据合并后,在超低数据场景下诊断准确率最高提升了13.7%。此外,通过视觉语言模型从真实和合成图像生成表型描述,获得了0.84的报告相似性分数,评估了语义有效性。

Insight: 论文的创新点在于构建了一个经过伦理验证、标准化的罕见疾病面部图像基准数据集,并提出了一个结合合成数据生成(使用DreamBooth和FastGAN)与基于面部关键点相似性的过滤机制的可扩展框架,以在数据极度稀缺下保持表型保真度并提升模型性能,为公平的罕见疾病AI研究提供了透明、基准就绪的数据集和评估体系。

Abstract: Rare diseases often manifest with distinctive facial phenotypes in children, offering valuable diagnostic cues for clinicians and AI-assisted screening systems. However, progress in this field is severely limited by the scarcity of curated, ethically sourced facial data and the high similarity among phenotypes across different conditions. To address these challenges, we introduce RDFace, a curated benchmark dataset comprising 456 pediatric facial images spanning 103 rare genetic conditions (average 4.4 samples per condition). Each ethically verified image is paired with standardized metadata. RDFace enables the development and evaluation of data-efficient AI models for rare disease diagnosis under real-world low-data constraints. We benchmark multiple pretrained vision backbones using cross-validation and explore synthetic augmentation with DreamBooth and FastGAN. Generated images are filtered via facial landmark similarity to maintain phenotype fidelity and merged with real data, improving diagnostic accuracy by up to 13.7% in ultra-low-data regimes. To assess semantic validity, phenotype descriptions generated by a vision-language model from real and synthetic images achieve a report similarity score of 0.84. RDFace establishes a transparent, benchmark-ready dataset for equitable rare disease AI research and presents a scalable framework for evaluating both diagnostic performance and the integrity of synthetic medical imagery.


[67] Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition cs.CV | cs.AI | q-bio.BMPDF

Haocheng Tang, Xingyu Dang, Junmei Wang

TL;DR: 本文提出了一种将DeepSeek-OCR-2模型适配于分子光学结构识别(OCSR)任务的方法,通过将任务定义为图像条件化的SMILES序列生成。为了解决训练不稳定的问题,作者设计了一个两阶段渐进式监督微调策略,并利用合成与真实数据混合的大规模语料库进行训练。最终得到的MolSeek-OCR模型在序列匹配精度上达到了与最佳图像到序列模型相当的水平,但仍不及最先进的图像到图模型。

Details

Motivation: 光学化学结构识别(OCSR)对于将印刷文献中的二维分子图转换为机器可读格式至关重要。尽管视觉语言模型在端到端OCR任务中表现出潜力,但其直接应用于OCSR仍面临挑战,且直接的全参数监督微调常常失败。

Result: 微调后的模型MolSeek-OCR在精确匹配准确率上达到了与性能最佳的图像到序列模型相当的水平,但仍逊色于最先进的图像到图模型。实验还发现,强化学习式后训练和基于数据筛选的优化方法未能提升SMILES精确匹配所需的严格序列级保真度。

Insight: 论文的主要创新点在于将OCSR任务重新定义为图像条件化的SMILES生成,并提出了一个两阶段渐进式微调策略(从参数高效的LoRA开始,过渡到采用分割学习率的选择性全参数微调),以克服训练不稳定性。此外,结合合成数据(PubChem)与真实专利图像(USPTO-MOL)的大规模训练语料库设计,旨在提升模型的覆盖范围和鲁棒性。

Abstract: Optical Chemical Structure Recognition (OCSR) is critical for converting 2D molecular diagrams from printed literature into machine-readable formats. While Vision-Language Models have shown promise in end-to-end OCR tasks, their direct application to OCSR remains challenging, and direct full-parameter supervised fine-tuning often fails. In this work, we adapt DeepSeek-OCR-2 for molecular optical recognition by formulating the task as image-conditioned SMILES generation. To overcome training instabilities, we propose a two-stage progressive supervised fine-tuning strategy: starting with parameter-efficient LoRA and transitioning to selective full-parameter fine-tuning with split learning rates. We train our model on a large-scale corpus combining synthetic renderings from PubChem and realistic patent images from USPTO-MOL to improve coverage and robustness. Our fine-tuned model, MolSeek-OCR, demonstrates competitive capabilities, achieving exact matching accuracies comparable to the best-performing image-to-sequence model. However, it remains inferior to state-of-the-art image-to-graph modelS. Furthermore, we explore reinforcement-style post-training and data-curation-based refinement, finding that they fail to improve the strict sequence-level fidelity required for exact SMILES matching.


[68] Multimodal Urban Tree Detection from Satellite and Street-Level Imagery via Annotation-Efficient Deep Learning Strategies cs.CVPDF

In Seon Kim, Ali Moghimi

TL;DR: 本研究提出了一种多模态框架,结合高分辨率卫星影像和谷歌街景图像,在标注有限的条件下实现可扩展且精细的城市树木检测。该框架首先利用卫星影像定位候选树木,然后检索针对性的地面街景进行详细检测,显著减少了低效的街景采样。为应对标注瓶颈,采用域适应技术将已有标注数据集的知识迁移到新区域,并评估了半监督学习、主动学习及其混合策略,其中基于Transformer检测模型的混合策略取得了最佳性能。

Details

Motivation: 解决传统人工实地调查方法在标注成本高、泛化能力差方面的限制,以实现可扩展、标注高效的城市树木精确测绘,支持环境监测、灾后评估和可持续城市规划。

Result: 在标注有限的条件下,混合策略(结合半监督学习和主动学习)取得了最佳性能,F1分数达到0.90,比基线模型提升了12%;主动学习和混合策略有效减少了假阳性和假阴性错误。

Insight: 创新点包括多模态(卫星与街景)集成框架以减少采样低效,以及通过域适应和混合学习策略(结合半监督与主动学习)降低标注需求;客观分析认为,该研究强调了在有限标注下结合不同学习策略的实用性,特别是主动学习通过针对性人工干预缓解了伪标注中的确认偏差问题。

Abstract: Beyond the immediate biophysical benefits, urban trees play a foundational role in environmental sustainability and disaster mitigation. Precise mapping of urban trees is essential for environmental monitoring, post-disaster assessment, and strengthening policy. However, the transition from traditional, labor-intensive field surveys to scalable automated systems remains limited by high annotation costs and poor generalization across diverse urban scenarios. This study introduces a multimodal framework that integrates high-resolution satellite imagery with ground-level Google Street View to enable scalable and detailed urban tree detection under limited-annotation conditions. The framework first leverages satellite imagery to localize tree candidates and then retrieves targeted ground-level views for detailed detection, significantly reducing inefficient street-level sampling. To address the annotation bottleneck, domain adaptation is used to transfer knowledge from an existing annotated dataset to a new region of interest. To further minimize human effort, we evaluated three learning strategies: semi-supervised learning, active learning, and a hybrid approach combining both, using a transformer-based detection model. The hybrid strategy achieved the best performance with an F1-score of 0.90, representing a 12% improvement over the baseline model. In contrast, semi-supervised learning exhibited progressive performance degradation due to confirmation bias in pseudo-labeling, while active learning steadily improved results through targeted human intervention to label uncertain or incorrect predictions. Error analysis further showed that active and hybrid strategies reduced both false positives and false negatives. Our findings highlight the importance of a multimodal approach and guided annotation for scalable, annotation-efficient urban tree mapping to strengthen sustainable city planning.


[69] Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models cs.CV | cs.AI | cs.CLPDF

Sohyeon Kim, Sang Yeon Yoon, Kyeongbo Kong

TL;DR: 本文提出了一种轻量级的推理时干预方法,通过分析视觉编码器内部注意力动态的三阶段结构(扩散、聚焦、再扩散),识别出在聚焦阶段获得低注意力的token与幻觉行为高度相关,并利用行列式点过程(DPP)选择性地抑制这些token,从而在无需训练、单次前向传播的情况下有效减少大型视觉语言模型(LVLM)的对象幻觉,同时保持描述质量并几乎不增加推理延迟。

Details

Motivation: 大型视觉语言模型在多模态推理方面取得显著进展,但仍容易产生对象幻觉(描述图像中不存在的物体)。现有方法通常通过抑制视觉编码器中不可靠的视觉信号来缓解幻觉,但许多方法需要对每个输入进行迭代优化,导致推理延迟显著增加。本文旨在探索一种更高效的推理时干预策略。

Result: 在多个LVLM主干网络和解码策略上的广泛实验表明,该方法能持续降低幻觉指标(如CHAIR、POPE),同时保持有竞争力的描述质量(如CIDEr、SPICE)。与基于对抗性不确定性估计的方法相比,本方法在实现相当的幻觉缓解效果的同时,仅带来可忽略的额外推理延迟。

Insight: 创新点在于揭示了视觉编码器注意力处理的三阶段结构(扩散、聚焦、再扩散),并发现聚焦阶段的低注意力token是幻觉敏感的关键;提出了一种无需训练、基于单次前向统计的轻量级干预方法,利用行列式点过程(DPP)在抑制冗余token的同时保留多样化的视觉线索,实现了效率与效果的平衡。

Abstract: Large Vision-Language Models (LVLMs) have achieved impressive progress in multimodal reasoning, yet they remain prone to object hallucinations, generating descriptions of objects that are not present in the input image. Recent approaches attempt to mitigate hallucinations by suppressing unreliable visual signals in the vision encoder, but many rely on iterative optimization for each input, resulting in substantial inference latency. In this work, we investigate the internal attention dynamics of vision encoders in LVLMs and identify a consistent three-phase structure of visual information processing: diffusion, focus, and rediffusion. Our analysis reveals that hallucination behavior is particularly sensitive to tokens receiving low attention during the focus phase. Motivated by this observation, we propose a lightweight inference-time intervention that selectively suppresses such tokens during the focus phase. The method operates in a training-free manner using statistics from a single forward pass and employs a Determinantal Point Process (DPP) to preserve diverse visual cues while filtering redundant tokens. Extensive experiments across multiple LVLM backbones and decoding strategies demonstrate that the proposed approach consistently reduces hallucination metrics while maintaining competitive caption quality. Moreover, compared to adversarial uncertainty estimation methods, our approach achieves comparable hallucination mitigation with negligible additional inference latency.


[70] LOGER: Local–Global Ensemble for Robust Deepfake Detection in the Wild cs.CVPDF

Fei Wu, Dagong Lu, Mufeng Yao, Xinlei Xu, Fengjun Guo

TL;DR: LOGER是一个用于鲁棒深度伪造检测的局部-全局集成框架,通过全局分支捕获语义和统计层面的整体异常,局部分支利用多实例学习top-k聚合策略聚焦最可疑的局部伪造痕迹,并通过logit空间融合提升鲁棒性。

Details

Motivation: 解决野外深度伪造检测中因操纵技术多样化和真实世界退化导致的挑战,现有方法难以同时有效捕捉全局语义异常和局部伪造痕迹。

Result: 在NTIRE 2026鲁棒深度伪造检测挑战赛中获得第二名,并在多个公共基准测试中展现出对不同操纵方法和真实世界退化条件的强鲁棒性和泛化能力。

Insight: 创新点在于结合异构视觉基础模型的多分辨率全局分支与基于多实例学习top-k聚合的局部分支,通过误差去相关的logit空间融合实现互补增强,有效缓解了局部证据被正常区域稀释的问题。

Abstract: Robust deepfake detection in the wild remains challenging due to the ever-growing variety of manipulation techniques and uncontrolled real-world degradations. Forensic cues for deepfake detection reside at two complementary levels: global-level anomalies in semantics and statistics that require holistic image understanding, and local-level forgery traces concentrated in manipulated regions that are easily diluted by global averaging. Since no single backbone or input scale can effectively cover both levels, we propose LOGER, a LOcal–Global Ensemble framework for Robust deepfake detection. The global branch employs heterogeneous vision foundation model backbones at multiple resolutions to capture holistic anomalies with diverse visual priors. The local branch performs patch-level modeling with a Multiple Instance Learning top-$k$ aggregation strategy that selectively pools only the most suspicious regions, mitigating evidence dilution caused by the dominance of normal patches; dual-level supervision at both the aggregated image level and individual patch level keeps local responses discriminative. Because the two branches differ in both granularity and backbone, their errors are largely decorrelated, a property that logit-space fusion exploits for more robust prediction. LOGER achieves 2nd place in the NTIRE 2026 Robust Deepfake Detection Challenge, and further evaluation on multiple public benchmarks confirms its strong robustness and generalization across diverse manipulation methods and real-world degradation conditions.


[71] SBF: An Effective Representation to Augment Skeleton for Video-based Human Action Recognition cs.CVPDF

Zhuoxuan Peng, Yiyi Ding, Yang Lin, S. -H. Gary Chan

TL;DR: 本文提出了一种名为SBF(Scale-Body-Flow)的有效表示方法,用于增强基于视频的人体动作识别(HAR)中的骨架表示。SBF通过整合关节深度、人体轮廓和人-物交互信息来弥补传统骨架表示的不足,并设计了SFSNet网络来预测SBF,无需额外标注。实验表明,该方法在保持紧凑性和效率的同时,显著提升了HAR的准确率。

Details

Motivation: 现有基于2D骨架的HAR方法在常见场景中表现不佳,因为骨架无法捕获关节深度、人体轮廓以及人与物体交互等关键动作相关信息,需要一种有效的表示来增强骨架信息。

Result: 在多个数据集上的广泛实验表明,基于SBF和SFSNet的管道相比最先进的仅使用骨架的方法,在保持类似紧凑性和效率的同时,实现了显著更高的HAR准确率。

Insight: 创新点在于提出了SBF这一综合表示,将尺度(深度)、人体轮廓和光流(交互)信息整合到骨架中,并通过SFSNet网络利用现有骨架和光流进行无额外标注的监督学习,有效增强了动作识别的信息完整性。

Abstract: Many modern video-based human action recognition (HAR) approaches use 2D skeleton as the intermediate representation in their prediction pipelines. Despite overall encouraging results, these approaches still struggle in many common scenes, mainly because the skeleton does not capture critical action-related information pertaining to the depth of the joints, contour of the human body, and interaction between the human and objects. To address this, we propose an effective approach to augment skeleton with a representation capturing action-related information in the pipeline of HAR. The representation, termed Scale-Body-Flow (SBF), consists of three distinct components, namely a scale map volume given by the scale (and hence depth information) of each joint, a body map outlining the human subject, and a flow map indicating human-object interaction given by pixel-wise optical flow values. To predict SBF, we further present SFSNet, a novel segmentation network supervised by the skeleton and optical flow without extra annotation overhead beyond the existing skeleton extraction. Extensive experiments across different datasets demonstrate that our pipeline based on SBF and SFSNet achieves significantly higher HAR accuracy with similar compactness and efficiency as compared with the state-of-the-art skeleton-only approaches.


[72] PortraitCraft: A Benchmark for Portrait Composition Understanding and Generation cs.CVPDF

Yuyang Sha, Zijie Lou, Youyun Tang, Xiaochao Qu, Haoxiang Li

TL;DR: PortraitCraft是一个用于人像构图理解与生成的统一基准,包含约5万张带有多层次结构化标注的真实人像图像数据集,并设立了构图理解和构图感知生成两个互补的基准任务。

Details

Motivation: 现有数据集和基准主要关注粗粒度美学评分、通用图像美学或无约束人像生成,缺乏对结构化人像构图分析和显式构图要求下可控人像生成的系统研究。

Result: 论文提供了标准化的评估协议和基于代表性多模态模型的参考基线结果,为未来研究建立了全面的基准。

Insight: 创新点在于构建了首个统一的人像构图理解与生成基准,提供了从全局评分、细粒度属性标注到解释性文本和视觉问答对的结构化多级监督数据,支持可解释的美学评估和可控生成。

Abstract: Portrait composition plays a central role in portrait aesthetics and visual communication, yet existing datasets and benchmarks mainly focus on coarse aesthetic scoring, generic image aesthetics, or unconstrained portrait generation. This limits systematic research on structured portrait composition analysis and controllable portrait generation under explicit composition requirements. In this paper, we introduce PortraitCraft, a unified benchmark for portrait composition understanding and generation. PortraitCraft is built on a dataset of approximately 50,000 curated real portrait images with structured multi-level supervision, including global composition scores, annotations over 13 composition attributes, attribute-level explanation texts, visual question answering pairs, and composition-oriented textual descriptions for generation. Based on this dataset, we establish two complementary benchmark tasks for composition understanding and composition-aware generation within a unified framework. The first evaluates portrait composition understanding through score prediction, fine-grained attribute reasoning, and image-grounded visual question answering, while the second evaluates portrait generation from structured composition descriptions under explicit composition constraints. We further define standardized evaluation protocols and provide reference baseline results with representative multimodal models. PortraitCraft provides a comprehensive benchmark for future research on fine-grained portrait understanding, interpretable aesthetic assessment, and controllable portrait generation.


[73] A Generative Foundation Model for Multimodal Histopathology cs.CV | cs.AIPDF

Jinxi Xiang, Mingjie Li, Siyu Hou, Yijiang Chen, Xiangde Luo

TL;DR: 本文提出了MuPD(多模态病理学扩散模型),这是一个生成式基础模型,通过解耦跨模态注意力的扩散Transformer,将H&E染色组织学图像、分子RNA谱和临床文本嵌入共享潜在空间。该模型在涵盖34个人体器官的1亿个组织学图像块、160万文本-组织学对和1080万RNA-组织学对上进行了预训练,支持多种跨模态合成任务,且无需或仅需少量任务特定微调。

Details

Motivation: 解决复杂疾病诊断和治疗中多模态数据(组织学、分子、临床)因组织稀缺、检测成本和工作流程限制而经常不完整的问题,现有方法依赖于狭窄的单源-目标对训练的任务特定模型,泛化性有限。

Result: 在文本条件生成和图像到图像生成中,MuPD合成组织学上真实的组织结构,将Fréchet inception距离(FID)分数比领域特定模型降低50%,并通过合成数据增强将少样本分类准确率提高达47%;在RNA条件组织学生成中,FID比次优方法降低23%,同时在五种癌症类型中保持细胞类型分布;作为虚拟染色剂,将H&E图像转换为免疫组化和多重免疫荧光,平均标记相关性比现有方法提高37%。

Insight: 创新点在于构建了一个统一的生成式基础模型,通过跨异构病理学模态的预训练,显著优于专业替代方案,为多模态病理学提供了可扩展的计算框架;客观分析认为,其解耦跨模态注意力的扩散Transformer架构和在大规模多模态数据上的预训练是实现高性能和泛化能力的关键。

Abstract: Accurate diagnosis and treatment of complex diseases require integrating histological, molecular, and clinical data, yet in practice these modalities are often incomplete owing to tissue scarcity, assay cost, and workflow constraints. Existing computational approaches attempt to impute missing modalities from available data but rely on task-specific models trained on narrow, single source-target pairs, limiting their generalizability. Here we introduce MuPD (Multimodal Pathology Diffusion), a generative foundation model that embeds hematoxylin and eosin (H&E)-stained histology, molecular RNA profiles, and clinical text into a shared latent space through a diffusion transformer with decoupled cross-modal attention. Pretrained on 100 million histology image patches, 1.6 million text-histology pairs, and 10.8 million RNA-histology pairs spanning 34 human organs, MuPD supports diverse cross-modal synthesis tasks with minimal or no task-specific fine-tuning. For text-conditioned and image-to-image generation, MuPD synthesizes histologically faithful tissue architectures, reducing Fréchet inception distance (FID) scores by 50% relative to domain-specific models and improving few-shot classification accuracy by up to 47% through synthetic data augmentation. For RNA-conditioned histology generation, MuPD reduces FID by 23% compared with the next-best method while preserving cell-type distributions across five cancer types. As a virtual stainer, MuPD translates H&E images to immunohistochemistry and multiplex immunofluorescence, improving average marker correlation by 37% over existing approaches. These results demonstrate that a single, unified generative model pretrained across heterogeneous pathology modalities can substantially outperform specialized alternatives, providing a scalable computational framework for multimodal histopathology.


[74] ComPrivDet: Efficient Privacy Object Detection in Compressed Domains Through Inference Reuse cs.CV | cs.CRPDF

Yunhao Yao, Zhiqiang Wang, Ruiqi Li, Haoran Cheng, Puhan Luo

TL;DR: 本文提出ComPrivDet,一种在压缩视频中高效检测隐私对象(如人脸、车牌)的方法。其核心创新在于通过复用I帧的检测结果,并利用压缩域线索判断新对象出现,从而选择性跳过或轻量化处理P帧和B帧的检测,在保持高精度的同时大幅降低计算开销和延迟。

Details

Motivation: 物联网视频分析规模庞大,逐帧进行隐私保护(如检测并模糊人脸)会引入显著延迟。现有对象检测器要么需要完全解码视频,要么在压缩视频中逐帧处理,导致解码开销大或精度下降。因此,需要一种能在压缩域高效、准确检测隐私对象的方法。

Result: 在隐私人脸检测任务中达到99.75%的准确率,在隐私车牌检测中达到96.83%的准确率,同时跳过了超过80%的推理过程。与现有的压缩域检测方法相比,平均准确率高出9.84%,延迟降低了75.95%。

Insight: 主要创新点在于利用视频压缩结构(I/P/B帧)和压缩域线索(如运动向量)来指导推理过程,通过复用I帧结果和选择性处理后续帧,实现了精度与效率的平衡。从客观角度看,这是一种将传统视频编码知识与现代目标检测任务相结合的、具有系统优化思维的创新方法。

Abstract: As the Internet of Things (IoT) becomes deeply embedded in daily life, users are increasingly concerned about privacy leakage, especially from video data. Since frame-by-frame protection in large-scale video analytics (e.g., smart communities) introduces significant latency, a more efficient solution is to selectively protect frames containing privacy objects (e.g., faces). Existing object detectors require fully decoded videos or per-frame processing in compressed videos, leading to decoding overhead or reduced accuracy. Therefore, we propose ComPrivDet, an efficient method for detecting privacy objects in compressed video by reusing I-frame inference results. By identifying the presence of new objects through compressed-domain cues, ComPrivDet either skips P- and B-frame detections or efficiently refines them with a lightweight detector. ComPrivDet maintains 99.75% accuracy in private face detection and 96.83% in private license plate detection while skipping over 80% of inferences. It averages 9.84% higher accuracy with 75.95% lower latency than existing compressed-domain detection methods.


[75] Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling cs.CV | cs.AIPDF

Yunyao Yu, Zhengxian Wu, Zhuohong Chen, Hangrui Xu, Zirui Liao

TL;DR: 本文提出了一种名为连续软化回溯重采样(CSRS)的方法,用于稳定多模态大语言模型的无监督自我进化过程。该方法通过回溯重推理机制探索长尾推理路径,使用软化频率奖励校准反馈信号,并结合视觉语义扰动确保模型关注数学逻辑而非视觉表面特征。实验表明,CSRS显著提升了Qwen2.5-VL-7B模型在MathVision等基准上的推理性能,并在几何任务的无监督自我进化中取得了最先进的结果。

Details

Motivation: 现有MLLM无监督自我进化方法主要依赖多数投票选择伪黄金答案,这可能导致模型固有偏见而非客观正确的推理路径,从而引发性能退化。本文旨在解决反馈信号质量不稳定问题,以促进更稳定有效的学习。

Result: 在MathVision等基准测试中,CSRS显著提升了Qwen2.5-VL-7B模型的推理性能,并在几何任务的无监督自我进化上达到了最先进水平。

Insight: 创新点包括回溯重推理机制以探索长尾路径、软化频率奖励提供连续反馈信号而非二元奖励,以及视觉语义扰动来强调数学逻辑。这些方法共同作用,缓解了模型偏见并稳定了自我进化过程。

Abstract: In the unsupervised self-evolution of Multimodal Large Language Models, the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may stem from the model’s intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract the degradation, we propose \textbf{C}ontinuous \textbf{S}oftened \textbf{R}etracing re\textbf{S}ampling (\textbf{CSRS}) in MLLM self-evolution. Specifically, we introduce a Retracing Re-inference Mechanism (\textbf{RRM}) that the model re-inferences from anchor points to expand the exploration of long-tail reasoning paths. Simultaneously, we propose Softened Frequency Reward (\textbf{SFR}), which replaces binary rewards with continuous signals, calibrating reward based on the answers’ frequency across sampled reasoning sets. Furthermore, incorporated with Visual Semantic Perturbation (\textbf{VSP}), CSRS ensures the model prioritizes mathematical logic over visual superficiality. Experimental results demonstrate that CSRS significantly enhances the reasoning performance of Qwen2.5-VL-7B on benchmarks such as MathVision. We achieve state-of-the-art (SOTA) results in unsupervised self-evolution on geometric tasks. Our code is avaible at https://github.com/yyy195/CSRS.


[76] Motion-Adaptive Multi-Scale Temporal Modelling with Skeleton-Constrained Spatial Graphs for Efficient 3D Human Pose Estimation cs.CVPDF

Ruochen Li, Shuang Chen, Wenke E, Farshad Arvin, Amir Atapour-Abarghouei

TL;DR: 本文提出MASC-Pose,一种用于高效3D人体姿态估计的框架,它结合了运动自适应的多尺度时间建模模块(AMTM)和骨架约束的自适应图卷积网络(SAGCN),以自适应地捕捉不同时间尺度的运动动态并进行关节特定的空间交互建模,从而在保证高精度的同时实现高计算效率。

Details

Motivation: 现有方法在建模单目视频中3D人体姿态的复杂时空依赖时,常面临效率与适应性的挑战,尤其是在密集注意力或固定建模方案下。

Result: 在Human3.6M和MPI-INF-3DHP数据集上的大量实验证明了该方法的有效性,实现了强精度和高计算效率。

Insight: 创新点在于联合了自适应多尺度时间建模(AMTM)与骨架约束的自适应图卷积(SAGCN),实现了对异构运动动态的自适应捕捉和高效的关节级空间聚合,提升了模型在时空依赖建模中的灵活性与效率。

Abstract: Accurate 3D human pose estimation from monocular videos requires effective modelling of complex spatial and temporal dependencies. However, existing methods often face challenges in efficiency and adaptability when modelling spatial and temporal dependencies, particularly under dense attention or fixed modelling schemes. In this work, we propose MASC-Pose, a Motion-Adaptive multi-scale temporal modelling framework with Skeleton-Constrained spatial graphs for efficient 3D human pose estimation. Specifically, it introduces an Adaptive Multi-scale Temporal Modelling (AMTM) module to adaptively capture heterogeneous motion dynamics at different temporal scales, together with a Skeleton-constrained Adaptive GCN (SAGCN) for joint-specific spatial interaction modelling. By jointly enabling adaptive temporal reasoning and efficient spatial aggregation, our method achieves strong accuracy with high computational efficiency. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate the effectiveness of our approach.


[77] Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval cs.CV | cs.IR | cs.MMPDF

Jun Li, Xuhang Lou, Jinpeng Wang, Yuting Wang, Yaowei Wang

TL;DR: 本文提出DreamPRVR模型,用于解决部分相关视频检索(PRVR)任务中存在的全局上下文感知不完整、查询歧义和局部噪声问题。该方法采用由粗到细的表征学习范式,首先生成覆盖整个视频的全局上下文语义寄存器作为粗粒度高亮,然后专注于细粒度相似性优化以实现精确的跨模态匹配。

Details

Motivation: 现有PRVR方法因全局上下文感知不完整,难以处理查询歧义和由虚假响应引起的局部噪声,因此需要一种能更好捕捉视频全局语义并实现精确匹配的新方法。

Result: 在PRVR任务上的大量实验表明,DreamPRVR模型超越了现有最先进(SOTA)方法。

Insight: 创新点在于:1) 提出由概率变分采样器初始化、并通过文本监督的截断扩散模型迭代细化的全局语义寄存器生成机制;2) 结合文本语义结构学习以构建良好的文本潜在空间,增强全局感知的可靠性;3) 通过寄存器增强的高斯注意力块自适应融合寄存器与视频令牌,实现上下文感知的特征学习。

Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos based on text queries that describe only partial events. Existing methods suffer from incomplete global contextual perception, struggling with query ambiguity and local noise induced by spurious responses. To address these issues, we propose DreamPRVR, which adopts a coarse-to-fine representation learning paradigm. The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on fine-grained similarity optimization for precise cross-modal matching. Concretely, these registers are generated by initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model. During this process, textual semantic structure learning constructs a well-formed textual latent space, enhancing the reliability of global perception. The registers are then adaptively fused with video tokens through register-augmented Gaussian attention blocks, enabling context-aware feature learning. Extensive experiments show that DreamPRVR outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/CVPR26-DreamPRVR.


[78] Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos cs.CVPDF

Daniele Materia, Francesco Ragusa, Giovanni Maria Farinella

TL;DR: 本文提出了一种利用视觉大语言模型(VLLMs)从第一人称视角视频中预测人-物交互的方法。该方法通过结合‘标记集’提示和用户注视轨迹来增强视觉定位和意图理解,并引入了一种新颖的逆指数采样策略来捕捉交互前的关键时序动态。在HD-EPIC数据集上的实验表明,该方法超越了现有最先进方法,并具有模型无关性。

Details

Motivation: 解决现有方法在第一人称人-物交互预测任务中的关键局限,特别是视觉定位能力不足和用户意图理解困难的问题。

Result: 在HD-EPIC数据集上进行的实验表明,该方法超越了该任务上的最先进方法(SOTA),并展示了其模型无关的特性。

Insight: 创新点在于将‘标记集’提示和用户注视轨迹整合到VLLMs中,以增强视觉定位和意图理解,并设计了逆指数采样策略来更有效地捕捉交互前的关键时序信息。从客观角度看,这是一种将多模态线索(视觉标记、注视)与高效时序建模相结合以提升VLLMs在具身交互任务中性能的有效思路。

Abstract: The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user’s most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.


[79] DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR cs.CVPDF

Hoonhee Cho, Jae-Young Kang, Yuhwan Jeong, Yunseo Yang, Wonyoung Lee

TL;DR: DSERT-RoLL是一个用于自动驾驶感知的多模态数据集,集成了立体事件相机、RGB相机、热成像相机、4D雷达和双激光雷达,覆盖多样天气和光照条件。该数据集提供精确的2D/3D边界框、轨迹ID和车辆里程计,支持传感器内和跨传感器组合的公平比较。作者建立了统一的2D和3D基准测试,报告了单模态和多模态方法的基线,并提出了一个融合框架以提升不同天气和光照下的3D检测鲁棒性。

Details

Motivation: 解决新型传感器(如事件相机和4D雷达)数据稀缺的问题,并支持对这些传感器行为的系统性研究,以促进自动驾驶在多样化驾驶条件下的鲁棒感知。

Result: 在数据集上建立了统一的2D和3D基准测试,报告了代表性单模态和多模态方法的基线结果,为后续研究提供了比较基础。

Insight: 创新点在于首次整合了立体事件-RGB-热成像相机、4D雷达和双激光雷达的多模态数据集,并设计了统一的基准测试协议和融合框架,有助于探索不同传感器组合和融合策略在复杂环境下的性能。

Abstract: In this paper, we present DSERT-RoLL, a driving dataset that incorporates stereo event, RGB, and thermal cameras together with 4D radar and dual LiDAR, collected across diverse weather and illumination conditions. The dataset provides precise 2D and 3D bounding boxes with track IDs and ego vehicle odometry, enabling fair comparisons within and across sensor combinations. It is designed to alleviate data scarcity for novel sensors such as event cameras and 4D radar and to support systematic studies of their behavior. We establish unified 3D and 2D benchmarks that enable direct comparison of characteristics and strengths across sensor families and within each family. We report baselines for representative single modality and multimodal methods and provide protocols that encourage research on different fusion strategies and sensor combinations. In addition, we propose a fusion framework that integrates sensor specific cues into a unified feature space and improves 3D detection robustness under varied weather and lighting.


[80] SciLT: Long-Tailed Classification in Scientific Image Domains cs.CVPDF

Jiahao Chen, Bing Su

TL;DR: 本文提出SciLT框架,针对科学图像领域的长尾分类问题,通过自适应特征融合和双重监督学习,利用多层级表示来平衡头尾类别的性能。

Details

Motivation: 现有长尾识别研究和基准主要集中于自然图像领域,而科学图像具有独特的视觉特征和监督信号,且预训练与微调数据分布差异大,因此需要探索在纯视觉和参数高效微调(PEFT)范式下,基础模型在科学图像长尾分类中的有效性。

Result: 在三个科学图像基准测试中,SciLT框架一致优于现有方法,为科学长尾识别建立了强大且实用的基线,并在头尾类别间实现了平衡的性能。

Insight: 创新点在于发现倒数第二层特征对尾类分类尤为重要,并据此设计自适应特征融合与双重监督机制,有效利用多层级表示来缓解领域偏移下的长尾问题,为适应基础模型至科学数据提供了指导。

Abstract: Long-tailed recognition has benefited from foundation models and fine-tuning paradigms, yet existing studies and benchmarks are mainly confined to natural image domains, where pre-training and fine-tuning data share similar distributions. In contrast, scientific images exhibit distinct visual characteristics and supervision signals, raising questions about the effectiveness of fine-tuning foundation models in such settings. In this work, we investigate scientific long-tailed recognition under a purely visual and parameter-efficient fine-tuning (PEFT) paradigm. Experiments on three scientific benchmarks show that fine-tuning foundation models yields limited gains, and reveal that penultimate-layer features play an important role, particularly for tail classes. Motivated by these findings, we propose SciLT, a framework that exploits multi-level representations through adaptive feature fusion and dual-supervision learning. By jointly leveraging penultimate- and final-layer features, SciLT achieves balanced performance across head and tail classes. Extensive experiments demonstrate that SciLT consistently outperforms existing methods, establishing a strong and practical baseline for scientific long-tailed recognition and providing valuable guidance for adapting foundation models to scientific data with substantial domain shifts.


[81] FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning cs.CVPDF

Zhengyu Fu, René Zurbrügg, Kaixian Qu, Marc Pollefeys, Marco Hutter

TL;DR: FunFact是一个从RGB-D图像构建概率性开放词汇功能3D场景图的框架。它首先构建以物体和部件为中心的3D地图,利用基础模型提出语义上合理的功能关系候选,然后将其转换为因子图变量,并受LLM衍生的常识先验和几何先验约束,通过联合概率推理优化所有功能边及其边缘概率,从而获得更校准的置信度分数。

Details

Motivation: 现有3D场景理解方法通常孤立地考虑物体对之间的功能关系,未能捕捉人类用于解决歧义的场景级相互依赖性,因此需要一种能够进行整体概率建模的方法来提升功能场景理解。

Result: 在SceneFun3D、FunGraph3D和FunThor数据集上的实验表明,FunFact提高了节点和关系发现的召回率,并显著降低了模糊关系的校准误差。

Insight: 创新点在于将功能关系建模为因子图进行联合概率推理,并整合了LLM常识先验和几何先验,从而实现了对场景级功能依赖关系的整体建模和更好的置信度校准。

Abstract: Recent work in 3D scene understanding is moving beyond purely spatial analysis toward functional scene understanding. However, existing methods often consider functional relationships between object pairs in isolation, failing to capture the scene-wide interdependence that humans use to resolve ambiguity. We introduce FunFact, a framework for constructing probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. FunFact first builds an object- and part-centric 3D map and uses foundation models to propose semantically plausible functional relations. These candidates are converted into factor graph variables and constrained by both LLM-derived common-sense priors and geometric priors. This formulation enables joint probabilistic inference over all functional edges and their marginals, yielding substantially better calibrated confidence scores. To benchmark this setting, we introduce FunThor, a synthetic dataset based on AI2-THOR with part-level geometry and rule-based functional annotations. Experiments on SceneFun3D, FunGraph3D, and FunThor show that FunFact improves node and relation discovery recall and significantly reduces calibration error for ambiguous relations, highlighting the benefits of holistic probabilistic modeling for functional scene understanding. See our project page at https://funfact-scenegraph.github.io/


[82] SGTA: Scene-Graph Based Multi-Modal Traffic Agent for Video Understanding cs.CVPDF

Xingcheng Zhou, Mingyu Liu, Walter Zimmer, Jiajie Zhang, Alois Knoll

TL;DR: SGTA是一个基于场景图的多模态交通智能体框架,用于交通视频理解。它通过检测、跟踪和车道线提取从路侧视频构建交通场景图,并利用工具对符号化图查询和视觉输入进行推理。SGTA采用ReAct方法处理大语言模型交织的推理轨迹与工具调用,实现对复杂视频问题的可解释决策。

Details

Motivation: 解决交通视频理解中复杂问题的可解释性决策挑战,通过结合结构化场景图与多模态推理来提升理解能力。

Result: 在TUMTraffic VideoQA数据集样本上的实验表明,SGTA在多种问题类型上达到了有竞争力的准确率,同时提供了透明的推理步骤。

Insight: 创新点在于将结构化场景图表示与多模态智能体集成,利用ReAct实现可解释的推理,为交通视频理解提供了模块化且透明的框架。

Abstract: We present Scene-Graph Based Multi-Modal Traffic Agent (SGTA), a modular framework for traffic video understanding that combines structured scene graphs with multi-modal reasoning. It constructs a traffic scene graph from roadside videos using detection, tracking, and lane extraction, followed by tool-based reasoning over both symbolic graph queries and visual inputs. SGTA adopts ReAct to process interleaved reasoning traces from large language models with tool invocations, enabling interpretable decision-making for complex video questions. Experiments on selected TUMTraffic VideoQA dataset sample demonstrate that SGTA achieves competitive accuracy across multiple question types while providing transparent reasoning steps. These results highlight the potential of integrating structured scene representations with multi-modal agents for traffic video understanding.


[83] VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning cs.CVPDF

Shaoyang Cui, Lingbei Meng

TL;DR: 本文提出了VidNum-1.4K,一个包含1379个人工标注视频-问题对的综合基准,用于评估视频中的数值推理能力。该基准涵盖多样化场景,并采用从感知到组合推理的三级层次结构,旨在测试视觉语言模型对真实世界动态的理解深度。

Details

Motivation: 现有视频数值推理基准往往局限于狭窄领域或将计数视为简单回归任务,无法评估真实复杂多媒体内容中的多步数值逻辑,因此需要一个新的综合性基准来填补这一空白。

Result: 在多个最先进的视觉语言模型上的评估揭示了一个显著的推理鸿沟:Gemini-3.1-pro的准确率勉强达到60%,而代表性的开源模型族则在25%到45%之间挣扎,表明当前模型仍缺乏稳定的‘内部世界模型’。

Insight: 论文的创新点在于构建了一个层次化、多样化的视频数值推理基准,将任务从直接感知延伸到基于视频的组合数值推理,这为诊断模型对时间、对象恒存性和组合逻辑的理解能力提供了一个严格的测试平台。

Abstract: Video-based numerical reasoning provides a premier arena for testing whether Vision-Language Models (VLMs) truly “understand” real-world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching. However, existing benchmarks are often confined to narrow domains, such as repetitive athletic motions, or treat simple counting merely as a superficial regression task, failing to assess multi-step numerical logic within the inherent complexity of real-world multimedia content. We introduce VidNum-1.4K, a comprehensive VideoQA benchmark comprising 1,379 strictly human-annotated video-question pairs designed to evaluate genuine numerical reasoning across highly diverse environments, encompassing object, action, and event quantification. The VidNum-1.4K is uniquely structured into a three-level hierarchy that evolves from direct visual perception to video-based compositional numerical reasoning, requiring models to perform arithmetic operations, comparisons, and logical deductions grounded in temporal evidence. Our evaluations across a diverse suite of state-of-the-art VLMs reveal a striking reasoning gap: while the Gemini-3.1-pro barely reaches a 60% accuracy threshold, representative open-source families struggle heavily in the 25%–45% range. These findings demonstrate that current VLMs still lack a stable “internal world model”, positioning VidNum-1.4K as a demanding diagnostic testbed for the next generation of numerical video intelligence.


[84] XSeg: A Large-scale X-ray Contraband Segmentation Benchmark For Real-World Security Screening cs.CVPDF

Hongxia Gao, Litao Li, Yixin Chen, Jiali Wen, Kaijie Zhang

TL;DR: 本文介绍了XSeg,一个用于X射线违禁品分割的大规模基准数据集,包含98,644张图像和295,932个实例掩码,覆盖30种常见违禁品类别。为了解决标注效率问题,论文提出了基于Segment Anything Model (SAM)的自适应点SAM (APSAM)模型,通过能量感知编码器和自适应点生成器改进交叉域泛化和堆叠物体检测能力,在XSeg上验证了其优越性能。

Details

Motivation: 当前X射线违禁品检测方法主要依赖边界框标注,缺乏像素级监督和真实世界数据,限制了模型泛化和性能。

Result: 在XSeg数据集上的大量实验表明,APSAM表现出优越性能,显著提高了对重叠物品的敏感性。

Insight: 创新点包括构建大规模X射线违禁品分割数据集XSeg,以及提出APSAM模型,通过能量感知编码器增强掩码解码器初始化,并设计自适应点生成器实现单点粗提示下的精确掩码标注,解决了SAM的跨域泛化差和堆叠物体检测能力有限的问题。

Abstract: X-ray contraband detection is critical for public safety. However, current methods primarily rely on bounding box annotations, which limit model generalization and performance due to the lack of pixel-level supervision and real-world data. To address these limitations, we introduce XSeg. To the best of our knowledge, XSeg is the largest X-ray contraband segmentation dataset to date, including 98,644 images and 295,932 instance masks, and contains the latest 30 common contraband categories. The images are sourced from public datasets and our synthesized data, filtered through a custom data cleaning pipeline to remove low-quality samples. To enable accurate and efficient annotation and reduce manual labeling effort, we propose Adaptive Point SAM (APSAM), a specialized mask annotation model built upon the Segment Anything Model (SAM). We address SAM’s poor cross-domain generalization and limited capability in detecting stacked objects by introducing an Energy-Aware Encoder that enhances the initialization of the mask decoder, significantly improving sensitivity to overlapping items. Additionally, we design an Adaptive Point Generator that allows users to obtain precise mask labels with only a single coarse point prompt. Extensive experiments on XSeg demonstrate the superior performance of APSAM.


[85] SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation cs.CVPDF

Guiyu Zhang, Yabo Chen, Xunzhi Xiang, Junchao Huang, Zhongyu Wang

TL;DR: SymphoMotion是一个统一的运动控制框架,用于视频生成,能够在一个模型中联合控制摄像机轨迹和物体动态。它通过摄像机轨迹控制机制和物体动态控制机制,结合2D视觉引导和3D轨迹嵌入,实现深度感知和空间一致的运动生成。为了支持大规模训练和评估,作者构建了RealCOD-25K数据集,包含配对的摄像机姿态和物体级3D轨迹。实验表明,SymphoMotion在视觉保真度、摄像机可控性和物体运动准确性方面显著优于现有方法。

Details

Motivation: 当前视频生成方法通常只能控制一种运动类型(摄像机运动或物体动态),或依赖模糊的2D线索,导致摄像机视差与真实物体运动纠缠不清,缺乏连贯性和表现力。因此,需要一种能够联合控制两者的统一框架。

Result: 广泛的实验和用户研究表明,SymphoMotion在视觉保真度、摄像机可控性和物体运动准确性方面显著优于现有方法,为视频生成中的统一运动控制设立了新的基准。

Insight: 论文的创新点在于提出了一个统一的框架,将摄像机轨迹控制(结合显式摄像机路径和几何感知线索)与物体动态控制(结合2D视觉引导和3D轨迹嵌入)集成在单一模型中,解决了运动纠缠问题。此外,构建RealCOD-25K数据集填补了统一运动控制领域的关键数据空白,为大规模训练和评估提供了支持。

Abstract: Controlling both camera motion and object dynamics is essential for coherent and expressive video generation, yet current methods typically handle only one motion type or rely on ambiguous 2D cues that entangle camera-induced parallax with true object movement. We present SymphoMotion, a unified motion-control framework that jointly governs camera trajectories and object dynamics within a single model. SymphoMotion features a Camera Trajectory Control mechanism that integrates explicit camera paths with geometry-aware cues to ensure stable, structurally consistent viewpoint transitions, and an Object Dynamics Control mechanism that combines 2D visual guidance with 3D trajectory embeddings to enable depth-aware, spatially coherent object manipulation. To support large-scale training and evaluation, we further construct RealCOD-25K, a comprehensive real-world dataset containing paired camera poses and object-level 3D trajectories across diverse indoor and outdoor scenes, addressing a key data gap in unified motion control. Extensive experiments and user studies show that SymphoMotion significantly outperforms existing methods in visual fidelity, camera controllability, and object-motion accuracy, establishing a new benchmark for unified motion control in video generation.Codes and data are publicly available at https://grenoble-zhang.github.io/SymphoMotion/.


[86] Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation cs.CVPDF

Binyuan Huang, Yuning Lu, Weinan Jia, Hualiang Wang, Mu Liu

TL;DR: 本文提出PoCo(位置嵌入作为上下文控制器),通过引入位置编码作为语义检索之外的额外上下文控制,解决多参考和多镜头视频生成中当参考图像外观高度相似时模型出现的参考混淆问题。该方法利用token的辅助信息实现精确的token级匹配,同时保持隐式语义一致性建模,从而可靠地控制视觉特征极其相似的角色。

Details

Motivation: 解决多参考和多镜头视频生成任务中,当参考图像外观高度相似时模型出现的参考混淆问题,即语义相似的token会降低模型检索正确上下文的能力。

Result: 大量实验表明,与多种基线方法相比,PoCo在跨镜头一致性和参考保真度方面均有提升。

Insight: 将位置编码作为额外的上下文控制机制,超越了传统的语义检索,通过token的辅助信息实现精确匹配,同时保持语义一致性,为处理高度相似视觉特征的参考控制提供了新思路。

Abstract: Recent proprietary models such as Sora2 demonstrate promising progress in generating multi-shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model’s ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token-level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi-reference and multi-shot video generation model capable of reliably controlling characters with extremely similar visual traits. Extensive experiments demonstrate that PoCo improves cross-shot consistency and reference fidelity compared with various baselines.


[87] Shower-Aware Dual-Stream Voxel Networks for Structural Defect Detection in Cosmic-Ray Muon Tomography cs.CV | physics.comp-phPDF

Parthiv Dasgupta, Sambhav Agarwal, Palash Dutta, Raja Karmakar, Sudeshna Goswami

TL;DR: 本文提出了SA-DSVN,一种用于宇宙射线μ子断层扫描中钢筋混凝土结构缺陷体素级分割的3D卷积架构。该方法通过独立的编码器流联合处理散射运动学(9通道)和次级电磁簇射多重性(40通道),并使用交叉注意力进行融合。在包含四种缺陷类型的900个体积模拟数据集上训练,模型在验证集上取得了高精度和检测灵敏度。

Details

Motivation: 解决传统μ子断层扫描重建方法(如POCA、MLSD)仅依赖μ子散射角信息,而忽略了次级电磁簇射多重性这一有效特征的问题,旨在提高结构缺陷检测的准确性和鲁棒性。

Result: 在60个独立模拟的验证体积上,模型实现了96.3%的体素精度,每种缺陷的Dice分数在0.59-0.81之间,体积级检测灵敏度达到100%,每个体积的推理时间为10毫秒。消融研究表明,仅簇射多重性流就贡献了主要的判别能力,将缺陷平均Dice从0.535(仅散射)提升至0.685(仅簇射)。

Insight: 创新点在于首次将次级电磁簇射多重性作为关键特征引入学习型μ子断层扫描重建,并通过双流交叉注意力架构有效融合多模态数据。从客观角度看,这揭示了在粒子探测中利用传统忽略的次级信息可以显著提升模型性能,为类似物理传感应用提供了新思路。

Abstract: We present SA-DSVN, a 3D convolutional architecture for voxel-level segmentation of structural defects in reinforced concrete using cosmic-ray muon tomography. Unlike conventional reconstruction methods (POCA, MLSD) that rely solely on muon scattering angles, our approach jointly processes scattering kinematics (9 channels) and secondary electromagnetic shower multiplicities (40 channels) through independent encoder streams fused via cross-attention. Training data were generated using Vega, a cloud-native Geant4 simulation framework, producing 4.5 million muon events across 900 volumes containing four defect types - honeycombing, shear fracture, corrosion voids, and delamination - embedded within a dense 7x7 rebar cage. A five-variant ablation study demonstrates that the shower multiplicity stream alone accounts for the majority of discriminative power, raising defect-mean Dice from 0.535 (scattering only) to 0.685 (shower only). On 60 independently simulated validation volumes, the model achieves 96.3% voxel accuracy, per-defect Dice scores of 0.59-0.81, and 100% volume-level detection sensitivity at 10 ms inference per volume. These results establish secondary shower multiplicity as a previously unexploited but highly effective feature for learned muon tomographic reconstruction.


[88] ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs cs.CVPDF

Zitong Xu, Huiyu Duan, Shengyao Qin, Guangyu Yao, Guangji Ma

TL;DR: 本文提出了一个新的图像描述生成基准ICBench,包含2K张图像上由10个先进多模态大语言模型生成的40K条短/长描述,并通过人工主观评估获得细粒度维度的平均意见分数。同时,作者提出了一种基于图像-文本-图像重建一致性的自动评估指标ITIScore,实验表明该指标与人类判断高度一致,并在其他公共数据集上展现出良好的零样本泛化能力。

Details

Motivation: 现有图像描述生成基准存在描述长度多样性不足、缺乏最新先进MLLMs参与以及人工标注不足等问题,导致评估存在偏差且无法全面评估现代MLLMs的性能,因此需要构建更全面的基准和自动化评估方法。

Result: 在提出的ICBench基准上,ITIScore指标与人类主观评估结果(平均意见分数)表现出强相关性,并在其他公共图像描述数据集(如COCO Captions、NoCaps)上展现了稳健的零样本泛化能力。

Insight: 创新点在于构建了一个涵盖多内容类别、包含短/长描述且由多种先进MLLMs生成的大规模基准ICBench,并提出了基于图像-文本-图像重建一致性的自动化评估框架ITIScore,该框架通过测量描述质量与图像重建的一致性来替代昂贵的人工评估,为MLLMs的图像描述能力评估提供了高效、可扩展的解决方案。

Abstract: Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbf{ITIScore}, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.


[89] M2StyleGS: Multi-Modality 3D Style Transfer with Gaussian Splatting cs.CVPDF

Xingyu Miao, Xueqi Qiu, Haoran Duan, Yawen Huang, Xian Wu

TL;DR: M2StyleGS是一种基于3D高斯泼溅(3DGS)表示和CLIP多模态知识的实时3D风格迁移方法,它允许使用文本或图像作为风格参考,通过精确的特征对齐(细分流)、观察损失和抑制损失来生成一系列颜色映射准确、风格增强的新视图。

Details

Motivation: 解决传统3D风格迁移方法依赖固定参考图像的局限性,满足虚拟/增强现实等应用中用户对更灵活输入(如文本描述和多样化图像)的需求。

Result: 实验表明,M2StyleGS在视觉质量上表现更好,并在一致性指标上超越先前工作高达32.92%。

Insight: 创新点包括:利用CLIP的多模态知识作为风格参考,引入细分流进行精确特征对齐以解决异常变换问题,以及设计观察损失和抑制损失来优化生成过程并保持颜色信息一致性。

Abstract: Conventional 3D style transfer methods rely on a fixed reference image to apply artistic patterns to 3D scenes. However, in practical applications such as virtual or augmented reality, users often prefer more flexible inputs, including textual descriptions and diverse imagery. In this work, we introduce a novel real-time styling technique M2StyleGS to generate a sequence of precisely color-mapped views. It utilizes 3D Gaussian Splatting (3DGS) as a 3D presentation and multi-modality knowledge refined by CLIP as a reference style. M2StyleGS resolves the abnormal transformation issue by employing a precise feature alignment, namely subdivisive flow, it strengthens the projection of the mapped CLIP text-visual combination feature to the VGG style feature. In addition, we introduce observation loss, which assists in the stylized scene better matching the reference style during the generation, and suppression loss, which suppresses the offset of reference color information throughout the decoding process. By integrating these approaches, M2StyleGS can employ text or images as references to generate a set of style-enhanced novel views. Our experiments show that M2StyleGS achieves better visual quality and surpasses the previous work by up to 32.92% in terms of consistency.


[90] When Does Multimodal AI Help? Diagnostic Complementarity of Vision-Language Models and CNNs for Spectrum Management in Satellite-Terrestrial Networks cs.CV | cs.AIPDF

Yuanhang Li

TL;DR: 本文首次系统比较了视觉语言模型(VLM)与卷积神经网络(CNN)在非地面网络与地面网络(NTN-TN)协作系统中频谱热图理解任务的性能。研究引入了一个包含10.8万个视觉问答对的基准测试SpectrumQA,涵盖四个粒度级别。实验发现VLM和CNN存在明显的任务依赖性互补:CNN在空间定位等监督任务上表现更好,而VLM仅需少量上下文示例即可实现CNN不具备的语义推理能力。通过确定性任务路由策略结合两者优势,综合性能提升了39.1%。

Details

Motivation: 当前VLM在无线网络管理中的应用加速,但缺乏系统性的理解来明确这些大型基础模型在哪些频谱相关任务上优于轻量级CNN。本文旨在诊断VLM与CNN在NTN-TN协作系统频谱热图理解中的互补性,为实际部署提供指导。

Result: 在三个NTN-TN场景的实验中,冻结的Qwen2-VL-7B VLM与训练的ResNet-18 CNN对比显示:CNN在严重性分类(L1)准确率达72.9%,空间定位(L3)IoU为0.552;VLM在仅使用三个上下文示例的情况下,实现了CNN完全不具备的语义推理(L4)能力,F1分数为0.576。思维链提示进一步将VLM推理性能提升12.6%。结合两者的确定性任务路由器获得了0.616的综合得分,比单独使用CNN提升39.1%。此外,VLM表征在跨场景鲁棒性上更强,在6个迁移方向中的5个上性能下降更小。

Insight: 论文的核心创新点在于首次系统诊断了VLM与CNN在频谱管理任务中的互补性,并提出了基于任务类型的路由部署策略:将监督任务(如空间定位)委托给CNN,将推理任务(如语义推理)委托给VLM,而非将其视为替代品。这为实际网络系统中高效、鲁棒的模型选择与部署提供了可操作的指导原则。从客观角度看,研究通过构建细粒度基准SpectrumQA和引入思维链提示等分析,揭示了模型架构差异(而非提示限制)是互补性的根源,这一发现对多模态AI在专业领域的应用具有普遍借鉴意义。

Abstract: The adoption of vision-language models (VLMs) for wireless network management is accelerating, yet no systematic understanding exists of where these large foundation models outperform lightweight convolutional neural networks (CNNs) for spectrum-related tasks. This paper presents the first diagnostic comparison of VLMs and CNNs for spectrum heatmap understanding in non-terrestrial network and terrestrial network (NTN-TN) cooperative systems. We introduce SpectrumQA, a benchmark comprising 108K visual question-answer pairs across four granularity levels: scene classification (L1), regional reasoning (L2), spatial localization (L3), and semantic reasoning (L4). Our experiments on three NTN-TN scenarios with a frozen Qwen2-VL-7B and a trained ResNet-18 reveal a clear taskdependent complementarity: CNN achieves 72.9% accuracy at severity classification (L1) and 0.552 IoU at spatial localization (L3), while VLM uniquely enables semantic reasoning (L4) with F1=0.576 using only three in-context examples-a capability fundamentally absent in CNN architectures. Chain-of-thought (CoT) prompting further improves VLM reasoning by 12.6% (F1: 0.209->0.233) while having zero effect on spatial tasks, confirming that the complementarity is rooted in architectural differences rather than prompting limitations. A deterministic task-type router that delegates supervised tasks to CNN and reasoning tasks to VLM achieves a composite score of 0.616, a 39.1% improvement over CNN alone. We further show that VLM representations exhibit stronger cross-scenario robustness, with smaller performance degradation in 5 out of 6 transfer directions. These findings provide actionable guidelines: deploy CNNs for spatial localization and VLMs for semantic spectrum reasoning, rather than treating them as substitutes.


[91] HistoFusionNet: Histogram-Guided Fusion and Frequency-Adaptive Refinement for Nighttime Image Dehazing cs.CVPDF

Mohammad Heydari, Wei Dong, Shahram Shirani, Jun Chen, Han Zhou

TL;DR: 本文提出HistoFusionNet,一种用于夜间图像去雾的Transformer增强架构。该方法通过结合直方图引导的表征学习和频率自适应的特征细化,解决了夜间场景中雾、光晕、非均匀光照、颜色失真和传感器噪声等复杂退化问题。

Details

Motivation: 夜间图像去雾是一个具有挑战性的低级视觉问题,因为存在多种退化因素,使得白天去雾的常用假设失效。本文旨在解决这些复杂的夜间退化问题。

Result: 在NTIRE 2026夜间图像去雾挑战赛基准测试中,该方法取得了极具竞争力的性能,并在22支参赛队伍中排名第一,证明了其有效性和鲁棒性。

Insight: 创新点在于引入了直方图Transformer块,通过根据动态范围特性对特征进行分组来建模长程依赖,以及一个频率感知细化分支,自适应地利用互补的低频和高频线索,从而有效处理夜间场景的异质性退化。

Abstract: Nighttime image dehazing remains a challenging low-level vision problem due to the joint presence of haze, glow, non-uniform illumination, color distortion, and sensor noise, which often invalidate assumptions commonly used in daytime dehazing. To address these challenges, we propose HistoFusionNet, a transformer-enhanced architecture tailored for nighttime image dehazing by combining histogram-guided representation learning with frequency-adaptive feature refinement. Built upon a multi-scale encoder-decoder backbone, our method introduces histogram transformer blocks that model long-range dependencies by grouping features according to their dynamic-range characteristics, enabling more effective aggregation of similarly degraded regions under complex nighttime lighting. To further improve restoration fidelity, we incorporate a frequency-aware refinement branch that adaptively exploits complementary low- and high-frequency cues, helping recover scene structures, suppress artifacts, and enhance local details. This design yields a unified framework that is particularly well suited to the heterogeneous degradations encountered in real nighttime hazy scenes. Extensive experiments and highly competitive performance of our method on the NTIRE 2026 Nighttime Image Dehazing Challenge benchmark demonstrate the effectiveness of the proposed method. Our team ranked 1st among 22 participating teams, highlighting the robustness and competitive performance of HistoFusionNet. The code is available at: https://github.com/heydarimo/Night-Time-Dehazing


[92] Rényi Attention Entropy for Patch Pruning cs.CV | cs.LGPDF

Hiroaki Aizawa, Yuki Igaue

TL;DR: 本文提出了一种基于Rényi熵的注意力熵准则,用于视觉Transformer中的补丁剪枝。该方法通过分析注意力分布的熵值来评估补丁的重要性,低熵补丁(注意力集中)被保留,高熵补丁(注意力分散)被视为冗余而被剪枝。在细粒度图像识别任务上的实验表明,该方法能在保持准确性的同时有效减少计算量。

Details

Motivation: 解决Transformer中自注意力计算成本随token数量二次增长的问题,通过剪枝冗余补丁来降低计算开销。

Result: 在细粒度图像识别任务上,该方法减少了计算量并保持了准确性;通过调整Rényi熵参数进一步优化了准确性与计算量之间的权衡。

Insight: 创新点在于将注意力分布的熵(从Shannon熵扩展到Rényi熵)作为补丁重要性的度量标准,Rényi熵能强调尖锐的注意力峰值并支持适应任务需求和计算限制的剪枝策略,为模型压缩提供了可解释且灵活的新方法。

Abstract: Transformers are strong baselines in both vision and language because self-attention captures long-range dependencies across tokens. However, the cost of self-attention grows quadratically with the number of tokens. Patch pruning mitigates this cost by estimating per-patch importance and removing redundant patches. To identify informative patches for pruning, we introduce a criterion based on the Shannon entropy of the attention distribution. Low-entropy patches, which receive selective and concentrated attention, are kept as important, while high-entropy patches with attention spread across many locations are treated as redundant. We also extend the criterion from Shannon to Rényi entropy, which emphasizes sharp attention peaks and supports pruning strategies that adapt to task needs and computational limits. In experiments on fine-grained image recognition, where patch selection is critical, our method reduced computation while preserving accuracy. Moreover, adjusting the pruning policy through the Rényi entropy measure yields further gains and improves the trade-off between accuracy and computation.


[93] InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset cs.CV | cs.AIPDF

Felix Stillger, Lukas Hahn, Frederik Hasecke, Tobias Meisen

TL;DR: 本文提出了InCaRPose,一种基于Transformer的架构,用于在受限且高度畸变的环境(如车内监控)中,对图像对进行鲁棒的相对位姿估计,以实现相机外参标定。该方法利用冻结的DINOv3等骨干网络特征和Transformer解码器,有效捕捉参考视图与目标视图之间的几何关系。与需要精确相机内参的传统方法不同,它能在单次推理中实现绝对度量尺度的平移估计,并仅使用合成数据训练,却能泛化到真实世界车内环境。

Details

Motivation: 解决在受限、高度畸变的环境(如车内监控)中进行精确相对位姿估计的挑战,传统方法在此类场景下存在困难,且车内安全相关感知需要准确的实际距离。

Result: 模型在公开的7-Scenes数据集上取得了有竞争力的性能,并在作者发布的高度畸变车内图像测试集上保持了高精度的旋转和平移估计,即使使用ViT-Small骨干网络也能实现实时推理。

Insight: 创新点在于提出了一种仅用合成数据训练、不依赖精确相机内参即可泛化到真实高度畸变场景的Transformer架构,能在单步推理中直接输出绝对度量尺度的平移,这对于车内安全监控等实时应用至关重要。

Abstract: Camera extrinsic calibration is a fundamental task in computer vision. However, precise relative pose estimation in constrained, highly distorted environments, such as in-cabin automotive monitoring (ICAM), remains challenging. We present InCaRPose, a Transformer-based architecture designed for robust relative pose prediction between image pairs, which can be used for camera extrinsic calibration. By leveraging frozen backbone features such as DINOv3 and a Transformer-based decoder, our model effectively captures the geometric relationship between a reference and a target view. Unlike traditional methods, our approach achieves absolute metric-scale translation within the physically plausible adjustment range of in-cabin camera mounts in a single inference step, which is critical for ICAM, where accurate real-world distances are required for safety-relevant perception. We specifically address the challenges of highly distorted fisheye cameras in automotive interiors by training exclusively on synthetic data. Our model is capable of generalization to real-world cabin environments without relying on the exact same camera intrinsics and additionally achieves competitive performance on the public 7-Scenes dataset. Despite having limited training data, InCaRPose maintains high precision in both rotation and translation, even with a ViT-Small backbone. This enables real-time performance for time-critical inference, such as driver monitoring in supervised autonomous driving. We release our real-world In-Cabin-Pose test dataset consisting of highly distorted vehicle-interior images and our code at https://github.com/felixstillger/InCaRPose.


[94] ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos cs.CVPDF

Peijun Bao, Anwei Luo, Gang Pan, Alex C. Kot, Xudong Jiang

TL;DR: 本文提出了ActivityForensics,首个用于定位视频中活动级伪造内容的大规模基准测试集,包含超过6000个伪造视频片段,并提出了一个名为Temporal Artifact Diffuser (TADiff)的简单有效基线方法,通过基于扩散的特征正则化器来暴露伪造痕迹。

Details

Motivation: 现有基准测试主要关注外观级伪造(如人脸交换、物体移除),而视频生成技术的进步催生了活动级伪造,其通过修改人类动作来扭曲事件语义,产生了高度欺骗性的伪造内容,严重破坏了媒体真实性和公众信任。

Result: 基于ActivityForensics,论文引入了涵盖域内、跨域和开放世界设置的全面评估协议,并对一系列最先进的伪造定位器进行了基准测试,以促进未来研究。

Insight: 主要创新点在于创建了首个专注于活动级视频伪造定位的基准测试集,并提出了一个利用扩散模型进行特征正则化的基线方法TADiff,以揭示时间上的伪造痕迹,这为检测语义层面的视频篡改提供了新的研究方向。

Abstract: Temporal forgery localization aims to temporally identify manipulated segments in videos. Most existing benchmarks focus on appearance-level forgeries, such as face swapping and object removal. However, recent advances in video generation have driven the emergence of activity-level forgeries that modify human actions to distort event semantics, resulting in highly deceptive forgeries that critically undermine media authenticity and public trust. To overcome this issue, we introduce ActivityForensics, the first large-scale benchmark for localizing manipulated activity in videos. It contains over 6K forged video segments that are seamlessly blended into the video context, rendering high visual consistency that makes them almost indistinguishable from authentic content to the human eye. We further propose Temporal Artifact Diffuser (TADiff), a simple yet effective baseline that exposes artifact cues through a diffusion-based feature regularizer. Based on ActivityForensics, we introduce comprehensive evaluation protocols covering intra-domain, cross-domain, and open-world settings, and benchmark a wide range of state-of-the-art forgery localizers to facilitate future research. The dataset and code are available at https://activityforensics.github.io.


[95] Task-Guided Multi-Annotation Triplet Learning for Remote Sensing Representations cs.CVPDF

Meilun Zhou, Alina Zare

TL;DR: 本文提出了一种任务引导的多标注三元组损失方法,用于遥感图像表示学习。该方法通过互信息准则选择跨任务信息量最大的三元组,替代了传统多任务三元组损失中依赖静态权重平衡不同标注监督的方式,从而更有效地塑造共享表示。

Details

Motivation: 现有基于多任务三元组损失的方法依赖静态权重来平衡不同类型标注的监督信号,这需要手动调参且无法考虑任务间在塑造共享表示时的相互作用。本文旨在解决这一问题。

Result: 在航空野生动物数据集上的实验表明,与多种三元组损失设置相比,所提出的任务引导选择方法在分类和回归任务上均取得了更好的性能,证明了任务感知的三元组选择能为下游任务生成更有效的共享表示。

Insight: 核心创新在于将多任务表示学习的重点从调整损失权重(静态加权)转向了基于互信息准则动态选择最具信息量的三元组样本,这直接影响了哪些样本参与塑造表示,是一种更本质的任务交互建模方式。

Abstract: Prior multi-task triplet loss methods relied on static weights to balance supervision between various types of annotation. However, static weighting requires tuning and does not account for how tasks interact when shaping a shared representation. To address this, the proposed task-guided multi-annotation triplet loss removes this dependency by selecting triplets through a mutual-information criteria that identifies triplets most informative across tasks. This strategy modifies which samples influence the representation rather than adjusting loss magnitudes. Experiments on an aerial wildlife dataset compare the proposed task-guided selection against several triplet loss setups for shaping a representation in an effective multi-task manner. The results show improved classification and regression performance and demonstrate that task-aware triplet selection produces a more effective shared representation for downstream tasks.


[96] Beyond Task-Driven Features for Object Detection cs.CVPDF

Meilun Zhou, Alina Zare

TL;DR: 本文提出了一种注释引导的特征增强框架,用于改进目标检测。该方法通过将注释引导的潜在空间嵌入注入检测器主干网络,构建密集空间特征网格并与特征金字塔融合,以增强区域提议和检测头。在野生动物和遥感数据集上的实验表明,该方法能提升目标聚焦度、降低背景敏感性,并增强对未见或弱监督任务的泛化能力。

Details

Motivation: 现代目标检测器学习到的任务驱动特征虽然优化了端任务损失,但常捕获捷径相关性,未能反映底层注释结构,限制了特征在任务定义变化或监督稀疏时的可迁移性、可解释性和鲁棒性。

Result: 在野生动物和遥感数据集上的多监督机制实验中,该方法在分类、定位和数据效率方面均取得一致改进,表现为更强的目标聚焦、更低的背景敏感性以及对未见或弱监督任务的更好泛化。

Insight: 创新点在于通过注释引导的特征增强,将特征与注释几何对齐,从而生成比纯任务优化特征更具意义的表示;客观分析认为,该方法通过注入结构化先验,有效缓解了任务驱动特征的捷径学习问题,提升了特征的语义一致性和泛化能力。

Abstract: Task-driven features learned by modern object detectors optimize end task loss yet often capture shortcut correlations that fail to reflect underlying annotation structure. Such representations limit transfer, interpretability, and robustness when task definitions change or supervision becomes sparse. This paper introduces an annotation-guided feature augmentation framework that injects embeddings into an object detection backbone. The method constructs dense spatial feature grids from annotation-guided latent spaces and fuses them with feature pyramid representations to influence region proposal and detection heads. Experiments across wildlife and remote sensing datasets evaluate classification, localization, and data efficiency under multiple supervision regimes. Results show consistent improvements in object focus, reduced background sensitivity, and stronger generalization to unseen or weakly supervised tasks. The findings demonstrate that aligning features with annotation geometry yields more meaningful representations than purely task optimized features.


[97] Training a Student Expert via Semi-Supervised Foundation Model Distillation cs.CVPDF

Pardis Taghavi, Tian Liu, Renjie Li, Reza Langari, Zhengzhong Tu

TL;DR: 本文提出了一种半监督知识蒸馏框架,用于将预训练视觉基础模型压缩成紧凑的专家模型。该框架通过三个阶段实现:使用对比校准进行自训练的领域适应、通过统一多目标损失进行知识转移以及学生模型精炼以减轻伪标签偏差。核心创新是引入了一个实例感知的像素级对比损失,以融合掩码和类别分数来提取信息丰富的负样本并增强实例间边界。

Details

Motivation: 视觉基础模型虽然感知能力强,但计算开销大且部署困难,同时其适应特定任务通常需要昂贵的像素级标注。本文旨在利用有限的标注数据和大量未标注数据,通过半监督知识蒸馏来压缩这些模型,以解决实例分割等任务中标注成本高昂的问题。

Result: 在Cityscapes和ADE20K数据集上,所提出的方法将模型压缩了约11倍,其学生模型在零样本视觉基础模型教师上的AP分别提升了+11.9和+8.6,超越了适应后的教师模型+3.4和+1.5 AP,并在基准测试中优于最先进的半监督知识蒸馏方法。

Insight: 论文的主要创新点在于提出了一个三阶段半监督知识蒸馏框架,并设计了一个实例感知的像素级对比损失,该损失在领域适应和蒸馏阶段都保持了对比信号,从而更好地对齐师生模型的嵌入表示并有效利用未标注图像。从客观角度看,该方法将自训练、对比学习和知识蒸馏相结合,为在标注数据有限的情况下高效压缩大型基础模型提供了一种系统性的解决方案。

Abstract: Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our $\approx 11\times$ smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.


[98] Learning 3D Reconstruction with Priors in Test Time cs.CVPDF

Lei Zhou, Haoyu Wu, Akshat Dave, Dimitris Samaras

TL;DR: 本文提出了一种用于多视图Transformer(MVTs)的测试时框架,该框架通过引入先验信息(如相机位姿、内参和深度)来提升3D任务性能,而无需重新训练或修改预训练的纯图像网络。该方法将先验作为预测约束,在推理时优化网络,优化损失包括自监督目标和先验惩罚项。在多个3D视觉基准测试中,该方法显著提升了基础MVTs的性能。

Details

Motivation: 解决在3D视觉任务中如何有效利用先验信息(如相机参数、深度)来提升多视图Transformer的性能,同时避免重新训练网络或修改现有架构的问题。

Result: 在ETH3D、7-Scenes和NRGBD等数据集上的点云图估计和相机位姿估计任务中,该方法将点云图距离误差减少了一半以上,优于基础纯图像模型,并且超越了重新训练的、先验感知的前馈方法,达到了SOTA水平。

Insight: 创新点在于提出测试时约束优化(TCO)框架,将先验信息作为约束而非网络输入,通过自监督目标(如光度或几何损失)和先验惩罚项在推理时优化网络,实现了无需重新训练即可有效整合先验,提升了3D重建任务的性能。

Abstract: We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks without retraining or modifying pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference time. The optimization loss consists of a self-supervised objective and prior penalty terms. The self-supervised objective captures the compatibility among multi-view predictions and is implemented using photometric or geometric loss between renderings from other views and each view itself. Any available priors are converted into penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method consistently improves performance over base MVTs by a large margin. On the ETH3D, 7-Scenes, and NRGBD datasets, our method reduces the point-map distance error by more than half compared with the base image-only models. Our method also outperforms retrained prior-aware feed-forward methods, demonstrating the effectiveness of our test-time constrained optimization (TCO) framework for incorporating priors into 3D vision tasks.


[99] Interpreting Video Representations with Spatio-Temporal Sparse Autoencoders cs.CV | cs.AIPDF

Atahan Dokme, Sriram Vishwanath

TL;DR: 本文首次系统研究了稀疏自编码器(SAE)在视频表示中的应用,发现标准SAE虽能分解出可解释的单义特征,但破坏了时间连贯性。为此,论文提出了时空对比目标和Matryoshka层次分组方法,以恢复甚至超越原始时间连贯性,并在重建与时间连贯性之间实现可调权衡。在两个骨干网络和两个数据集上的系统消融实验表明,不同配置在不同目标(如重建保真度、时间连贯性、动作区分或可解释性)上表现优异。对比SAE特征在动作分类上比原始特征提升3.9%,在文本-视频检索中R@1提升高达2.8倍。跨骨干分析揭示标准单义性度量存在骨干对齐伪影,而因果消融证实对比训练将预测信号集中到少量可识别特征中。

Details

Motivation: 解决标准稀疏自编码器在视频表示中破坏时间连贯性的问题,即硬TopK选择导致跨帧特征分配不稳定,使自相关降低36%,从而影响视频分析的时序一致性。

Result: 在动作分类任务上,对比SAE特征比原始特征提升3.9%;在文本-视频检索中,R@1指标提升高达2.8倍。消融实验显示方法在重建保真度、时间连贯性、动作区分和可解释性等目标上表现优异,达到或超越原始时间连贯性水平。

Insight: 创新点包括引入时空对比目标和Matryoshka层次分组来恢复时间连贯性,以及揭示标准单义性度量中的骨干对齐伪影。客观分析认为,该方法通过可调对比损失权重平衡重建与连贯性,并将预测信号集中到少量特征,提升了视频表示的实用性和可解释性。

Abstract: We present the first systematic study of Sparse Autoencoders (SAEs) on video representations. Standard SAEs decompose video into interpretable, monosemantic features but destroy temporal coherence: hard TopK selection produces unstable feature assignments across frames, reducing autocorrelation by 36%. We propose spatio-temporal contrastive objectives and Matryoshka hierarchical grouping that recover and even exceed raw temporal coherence. The contrastive loss weight controls a tunable trade-off between reconstruction and temporal coherence. A systematic ablation on two backbones and two datasets shows that different configurations excel at different goals: reconstruction fidelity, temporal coherence, action discrimination, or interpretability. Contrastive SAE features improve action classification by +3.9% over raw features and text-video retrieval by up to 2.8xR@1. A cross-backbone analysis reveals that standard monosemanticity metrics contain a backbone-alignment artifact: both DINOv2 and VideoMAE produce equally monosemantic features under neutral (CLIP) similarity. Causal ablation confirms that contrastive training concentrates predictive signal into a small number of identifiable features.


[100] Multimodal Structure Learning: Disentangling Shared and Specific Topology via Cross-Modal Graphical Lasso cs.CV | cs.LGPDF

Fei Wang, Yutong Zhang, Xiong Wang

TL;DR: 本文提出了一种名为跨模态图套索(CM-GLasso)的新方法,用于学习可解释的多模态表示。该方法通过文本-视觉对齐策略和统一的视觉-语言编码器,将多模态特征严格对齐到共享潜在空间中,并利用跨注意力蒸馏机制提取空间感知的跨模态先验。通过结合定制的图套索估计和公共-特定结构学习(CSSL)到一个联合目标中,并使用交替方向乘子法(ADMM)优化,该方法能同时解耦不变和类别特定的精度矩阵,避免了多步误差累积。

Details

Motivation: 现有的稀疏图估计技术(如图套索)在应用于视觉-语言领域时,受到高维噪声、模态未对齐以及共享与类别特定拓扑结构混淆的严重限制,难以有效揭示异构特征间的条件依赖关系。

Result: 在涵盖自然和医学领域的八个基准测试上进行的大量实验表明,CM-GLasso在生成式分类和密集语义分割任务上达到了新的最先进(SOTA)水平。

Insight: 创新点在于提出了一种严格的跨模态特征对齐方法、一种将高维块压缩为显式语义节点的跨注意力蒸馏机制,以及一个能同时解耦不变和类别特定精度矩阵的联合优化框架,从而有效解决了多模态结构学习中的关键瓶颈问题。

Abstract: Learning interpretable multimodal representations inherently relies on uncovering the conditional dependencies between heterogeneous features. However, sparse graph estimation techniques, such as Graphical Lasso (GLasso), to visual-linguistic domains is severely bottlenecked by high-dimensional noise, modality misalignment, and the confounding of shared versus category-specific topologies. In this paper, we propose Cross-Modal Graphical Lasso (CM-GLasso) that overcomes these fundamental limitations. By coupling a novel text-visualization strategy with a unified vision-language encoder, we strictly align multimodal features into a shared latent space. We introduce a cross-attention distillation mechanism that condenses high-dimensional patches into explicit semantic nodes, naturally extracting spatial-aware cross-modal priors. Furthermore, we unify tailored GLasso estimation and Common-Specific Structure Learning (CSSL) into a joint objective optimized via the Alternating Direction Method of Multiplier (ADMM). This formulation guarantees the simultaneous disentanglement of invariant and class-specific precision matrices without multi-step error accumulation. Extensive experiments across eight benchmarks covering both natural and medical domains demonstrate that CM-GLasso establishes a new state-of-the-art in generative classification and dense semantic segmentation tasks.


[101] VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models cs.CV | cs.AIPDF

Ravi Ranjan, Agoritsa Polyzou

TL;DR: 本文提出VLA-Forget,一种用于具身基础模型(如OpenVLA风格策略)的混合遗忘框架,旨在移除不安全、虚假或隐私敏感的行为,同时保持模型的感知、语言对齐和动作控制能力。该方法通过联合优化目标遗忘、感知保留和推理保留三个目标,在视觉编码器、跨模态投影器和上层动作生成Transformer块上进行分阶段更新,有效解决了传统遗忘方法在具身场景中存在的残留遗忘或效用损失问题。

Details

Motivation: 动机在于解决具身视觉-语言-动作(VLA)模型部署中的遗忘挑战:需要移除不安全、虚假或隐私敏感的行为,但这些不良知识可能分布在感知、对齐和推理/动作层中,仅对视觉栈或语言主干进行部分遗忘往往不足,而传统为独立视觉或语言模型设计的遗忘基线在具身设置中可能导致残留遗忘或不必要的效用损失。

Result: 在遗忘集行为探测和保留任务评估中,相较于强大的遗忘基线,VLA-Forget将遗忘效能提高了10%,感知特异性保留了22%,推理和任务成功率保留了9%,并将后量化恢复降低了55%。

Insight: 创新点在于提出了一种混合遗忘框架,结合了感知和跨模态特异性的比率感知选择性编辑,以及用于效用保留遗忘的层选择性推理/动作遗忘。该方法通过分阶段联合优化多个目标,有效处理了VLA模型中知识分布带来的遗忘难题,为具身基础模型的安全部署提供了可借鉴的解决方案。

Abstract: Vision-language-action (VLA) models are emerging as embodied foundation models for robotic manipulation, but their deployment introduces a new unlearning challenge: removing unsafe, spurious, or privacy-sensitive behaviors without degrading perception, language grounding, and action control. In OpenVLA-style policies, behavior is produced through a fused visual encoder, a cross-modal projector, and a language backbone that predicts tokenized robot actions, so undesirable knowledge can be distributed across perception, alignment, and reasoning/action layers rather than confined to a single module. Consequently, partial unlearning applied only to the vision stack or only to the language backbone is often insufficient, while conventional unlearning baselines designed for standalone vision or language models may leave residual forgetting or incur unnecessary utility loss in embodied settings. We propose VLA-Forget, a hybrid unlearning framework that combines ratio-aware selective editing for perception and cross-modal specificity with layer-selective reasoning/action unlearning for utility-preserving forgetting. VLA-Forget jointly optimizes three objectives: targeted forgetting, perceptual preservation, and reasoning retention, through staged updates over the visual encoder, projector, and upper action-generating transformer blocks. Across forget-set behavior probes and retain-task evaluations, VLA-Forget improves forgetting efficacy by 10%, preserves perceptual specificity by 22%, retains reasoning and task success by 9%, and reduces post-quantization recovery by 55% relative to strong unlearning baselines.


[102] Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection cs.CVPDF

Xueyang Kang, Zizhao Li, Tian Lan, Dong Gong, Kourosh Khoshelham

TL;DR: 本文提出了一种用于3D形状异常检测的分层点-面片融合网络,结合自适应面片化模块和自监督分解来捕获复杂结构偏差。该方法在公共基准(Anomaly-ShapeNet和Real3D-AD)和新发布的工业测试集上均表现出色,在点级和物体级检测指标上取得显著提升。

Details

Motivation: 现有深度学习方法在3D形状异常检测中难以泛化到多种异常类型和尺度(如全局几何错误),且对训练中的噪声或不完整局部点敏感。

Result: 在公共和工业数据集上,该方法在AUC-ROC和AUC-PR指标上表现优异,在新工业异常类型上点级性能提升超过40%,在Real3D-AD和Anomaly-ShapeNet上物体级平均增益分别为7%和4%。

Insight: 创新点包括分层点-面片异常评分网络联合建模区域部件特征和局部点特征,以及自适应面片化模块集成自监督分解以增强对结构偏差的鲁棒性。

Abstract: 3D shape anomaly detection is a crucial task for industrial inspection and geometric analysis. Existing deep learning approaches typically learn representations of normal shapes and identify anomalies via out-of-distribution feature detection or decoder-based reconstruction. They often fail to generalize across diverse anomaly types and scales, such as global geometric errors (e.g., planar shifts, angle misalignments), and are sensitive to noisy or incomplete local points during training. To address these limitations, we propose a hierarchical point-patch anomaly scoring network that jointly models regional part features and local point features for robust anomaly reasoning. An adaptive patchification module integrates self-supervised decomposition to capture complex structural deviations. Beyond evaluations on public benchmarks (Anomaly-ShapeNet and Real3D-AD), we release an industrial test set with real CAD models exhibiting planar, angular, and structural defects. Experiments on public and industrial datasets show superior AUC-ROC and AUC-PR performance, including over 40% point-level improvement on the new industrial anomaly type and average object-level gains of 7% on Real3D-AD and 4% on Anomaly-ShapeNet, demonstrating strong robustness and generalization.


[103] Gram-Anchored Prompt Learning for Vision-Language Models via Second-Order Statistics cs.CV | cs.AIPDF

Minglei Chen, Weilong Wang, Jiang Duan, Ye Deng

TL;DR: 本文提出了一种基于二阶统计量的Gram锚定提示学习方法(GAPL),用于增强视觉语言模型在下游任务中的适应能力。该方法通过引入Gram矩阵捕获全局二阶统计特征,与传统的基于一阶空间特征的对齐方法相结合,以提高模型对领域偏移和局部噪声的鲁棒性。

Details

Motivation: 现有参数高效的提示学习方法主要依赖一阶视觉特征(空间特征图)与文本提示对齐,这些特征易受领域偏移和局部噪声影响,导致适应能力不足。本文旨在通过引入二阶统计信息来增强视觉语言模型在下游任务中的鲁棒适应能力。

Result: 大量实验验证了二阶特征的有效性,GAPL在多个基准测试中取得了具有竞争力的性能表现。

Insight: 创新点在于首次将二阶统计量(Gram矩阵)系统地引入视觉语言模型的提示学习框架,通过融合局部语义对齐与全局结构一致性,使语言表示能动态适应不同领域的统计分布变化。从客观角度看,该方法为多模态表征学习提供了新的特征互补视角,即结合一阶空间细节与二阶统计模式,可能对领域泛化任务有普遍借鉴意义。

Abstract: Parameter-efficient prompt learning has become the de facto standard for adapting Vision-Language Models (VLMs) to downstream tasks. Existing approaches predominantly focus on aligning text prompts with first-order visual features (i.e., spatial feature maps). While effective for fine-grained semantic discrimination, we argue that relying solely on first-order information is insufficient for robust adaptation, as these spatially entangled features are highly susceptible to domain shifts and local noise. In this work, we propose \textbf{Gram-Anchored Prompt Learning (GAPL)} for Vision-Language Models via Second-Order Statistics, a framework that synergizes local semantic alignment with global structural consistency. Methodologically, we introduce an additional second-order statistical stream via \textbf{Gram matrices} that augments the standard first-order spatial interaction. By anchoring prompts to these second-order priors, our approach enables language representations to dynamically adapt to statistical distribution shifts across diverse domains. Extensive experiments indicate the effectiveness of the second-order features, and show compelling performances of GAPL on various benchmarks.


[104] A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning cs.CV | cs.SDPDF

Tianle Chen, Deepti Ghadiyaram

TL;DR: 本文系统研究了跨模态排版攻击对音频-视觉多模态大语言模型(MLLMs)的影响,揭示了多模态攻击相比单模态攻击具有更强的威胁性。

Details

Motivation: 随着MLLMs在安全关键应用中的部署增加,理解其脆弱性至关重要,旨在探索跨模态攻击如何影响模型的多模态推理能力。

Result: 在多个前沿MLLMs、任务以及常识推理和内容审核基准测试中,协调的多模态攻击成功率高达83.43%,显著高于单模态攻击的34.93%。

Insight: 创新点在于首次系统研究跨模态排版攻击,揭示了MLLMs的跨模态脆弱性,并证明多模态协同攻击是未充分探索的关键攻击策略。

Abstract: As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = $83.43%$ vs $34.93%$).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.


[105] OASIC: Occlusion-Agnostic and Severity-Informed Classification cs.CV | cs.LGPDF

Kay Gijzen, Gertjan J. Burghouts, Daniël M. Pelt

TL;DR: 本文提出了一种名为OASIC(遮挡无关且严重程度感知的分类)的新方法,旨在解决物体严重遮挡对计算机视觉的挑战。该方法通过测试时掩蔽遮挡物模式来消除干扰,并利用训练时随机掩蔽物体来应对可见信息减少的问题。核心创新在于估计测试图像的遮挡严重程度,并据此选择针对该程度优化的模型进行预测。

Details

Motivation: 解决物体严重遮挡的两个根本原因:可见信息的丢失和遮挡物产生的干扰模式。

Result: 在遮挡图像分类任务上,结合灰度掩蔽和自适应模型选择的策略,其遮挡AUC(AUC_occ)比在遮挡图像上的标准训练高出18.5,比在未遮挡图像上微调的模型高出23.7,取得了显著提升。

Insight: 主要创新点在于将遮挡视为目标物体的视觉异常来处理,实现了与遮挡类型无关的掩蔽;同时发现并利用了遮挡严重程度的可估计性,以及模型性能与训练时所用遮挡程度的强相关性,从而提出了一个严重程度感知的自适应模型选择框架。

Abstract: Severe occlusions of objects pose a major challenge for computer vision. We show that two root causes are (1) the loss of visible information and (2) the distracting patterns caused by the occluders. Our approach addresses both causes at the same time. First, the distracting patterns are removed at test-time, via masking of the occluding patterns. This masking is independent of the type of occlusion, by handling the occlusion through the lens of visual anomalies w.r.t. the object of interest. Second, to deal with less visual details, we follow standard practice by masking random parts of the object during training, for various degrees of occlusions. We discover that (a) it is possible to estimate the degree of the occlusion (i.e. severity) at test-time, and (b) that a model optimized for a specific degree of occlusion also performs best on a similar degree during test-time. Combining these two insights brings us to a severity-informed classification model called OASIC: Occlusion Agnostic Severity Informed Classification. We estimate the severity of occlusion for a test image, mask the occluder, and select the model that is optimized for the degree of occlusion. This strategy performs better than any single model optimized for any smaller or broader range of occlusion severities. Experiments show that combining gray masking with adaptive model selection improves $\text{AUC}_\text{occ}$ by +18.5 over standard training on occluded images and +23.7 over finetuning on unoccluded images.


[106] HOIGS: Human-Object Interaction Gaussian Splatting cs.CV | cs.AIPDF

Taewoo Kim, Suwoong Yeom, Jaehyun Pyun, Geonho Cha, Dongyoon Wee

TL;DR: 本文提出了一种名为HOIGS(Human-Object Interaction Gaussian Splatting)的新方法,用于重建包含复杂人-物交互的动态场景。该方法通过一个基于交叉注意力的HOI模块显式建模人与物体之间的交互形变,并分别采用HexPlane和三次埃尔米特样条作为人与物体的形变基线来提取特征,从而有效捕捉相互依赖的运动,提升在遮挡、接触和物体操作等场景下的形变估计精度。

Details

Motivation: 现有高斯泼溅方法要么依赖人体姿态先验而忽略动态物体,要么将所有运动近似在单一场中,限制了其捕捉富含交互的动态场景的能力。本文旨在填补这一空白,显式建模人-物交互以进行高保真重建。

Result: 在多个数据集上的综合实验表明,该方法在重建质量上持续优于最先进的以人为中心和4D高斯方法,达到了SOTA水平。

Insight: 核心创新在于显式建模人-物交互形变,并采用异构特征提取策略(HexPlane用于人,CHS用于物体)来整合特征。这为处理复杂交互动态提供了一种新的、有效的建模思路,强调了交互建模对高保真重建的重要性。

Abstract: Reconstructing dynamic scenes with complex human-object interactions is a fundamental challenge in computer vision and graphics. Existing Gaussian Splatting methods either rely on human pose priors while neglecting dynamic objects, or approximate all motions within a single field, limiting their ability to capture interaction-rich dynamics. To address this gap, we propose Human-Object Interaction Gaussian Splatting (HOIGS), which explicitly models interaction-induced deformation between humans and objects through a cross-attention-based HOI module. Distinct deformation baselines are employed to extract features: HexPlane for humans and Cubic Hermite Spline (CHS) for objects. By integrating these heterogeneous features, HOIGS effectively captures interdependent motions and improves deformation estimation in scenarios involving occlusion, contact, and object manipulation. Comprehensive experiments on multiple datasets demonstrate that our method consistently outperforms state-of-the-art human-centric and 4D Gaussian approaches, highlighting the importance of explicitly modeling human-object interactions for high-fidelity reconstruction.


[107] ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity cs.CVPDF

Hang Wang, Chao Shen, Lei Zhang, Zhi-Qi Cheng

TL;DR: 本文提出了一种名为ATSS的新方法,用于检测AI生成的视频。该方法基于一个关键观察:AI生成的视频在时间维度上表现出异常的自我相似性,即由于受确定性锚点(如文本或图像提示)驱动,其视觉和语义轨迹呈现出不自然的重复相关性,而真实视频的动态则是随机的。ATSS通过构建视觉、文本和跨模态的相似性矩阵来量化这种时间异常,并使用Transformer编码器和双向交叉注意力融合模块进行多模态建模。在多个大规模基准测试上的实验表明,ATSS在AP、AUC和ACC等指标上显著优于现有最先进的方法。

Details

Motivation: 现有的AI生成视频检测器主要关注局部伪影或短期时间不一致性,难以捕捉控制全局时间演化的底层生成逻辑,限制了检测性能。本文旨在解决这一问题,通过识别AI生成视频中特有的“异常时间自相似性”指纹来提升检测能力。

Result: 在GenVideo、EvalCrafter、VideoPhy和VidProM四个大规模基准测试上进行的广泛实验表明,ATSS在AP、AUC和ACC指标上显著优于最先进的(SOTA)方法,并展现出对不同视频生成模型的优异泛化能力。

Insight: 论文的核心创新点在于识别并利用了AI生成视频中存在的“异常时间自相似性”这一内在指纹。方法上的创新包括:1)通过构建视觉、文本和跨模态的三重相似性矩阵来量化时间异常;2)采用双向交叉注意力融合模块来有效建模模态内和模态间的动态关系。这为多模态视频伪造检测提供了一个新的、基于全局时间演化逻辑的分析视角。

Abstract: AI-generated videos (AIGVs) have achieved unprecedented photorealism, posing severe threats to digital forensics. Existing AIGV detectors focus mainly on localized artifacts or short-term temporal inconsistencies, thus often fail to capture the underlying generative logic governing global temporal evolution, limiting AIGV detection performance. In this paper, we identify a distinctive fingerprint in AIGVs, termed anomalous temporal self-similarity (ATSS). Unlike real videos that exhibit stochastic natural dynamics, AIGVs follow deterministic anchor-driven trajectories (e.g., text or image prompts), inducing unnaturally repetitive correlations across visual and semantic domains. To exploit this, we propose the ATSS method, a multimodal detection framework that exploits this insight via a triple-similarity representation and a cross-attentive fusion mechanism. Specifically, ATSS reconstructs semantic trajectories by leveraging frame-wise descriptions to construct visual, textual, and cross-modal similarity matrices, which jointly quantify the inherent temporal anomalies. These matrices are encoded by dedicated Transformer encoders and integrated via a bidirectional cross-attentive fusion module to effectively model intra- and inter-modal dynamics. Extensive experiments on four large-scale benchmarks, including GenVideo, EvalCrafter, VideoPhy, and VidProM, demonstrate that ATSS significantly outperforms state-of-the-art methods in terms of AP, AUC, and ACC metrics, exhibiting superior generalization across diverse video generation models. Code and models of ATSS will be released at https://github.com/hwang-cs-ime/ATSS.


[108] 4C4D: 4 Camera 4D Gaussian Splatting cs.CVPDF

Junsheng Zhou, Zhifan Yang, Liang Han, Wenyuan Zhang, Kanle Shi

TL;DR: 本文提出了4C4D框架,旨在解决从极稀疏(仅四个)便携相机拍摄的视频中重建4D动态场景的挑战。该方法的核心创新在于引入了一个作用于高斯不透明度的神经衰减函数,以增强4D高斯在稀疏设置下的几何建模能力,从而在保持时间一致性的新视角渲染任务上超越了现有方法。

Details

Motivation: 解决从极稀疏(如四个)相机视频中学习并建模动态场景,以实现高质量、时间一致的新视角渲染这一难题。传统方法通常需要密集(数十甚至上百个)相机阵列,限制了其便携性和应用范围。

Result: 在具有不同相机重叠度的稀疏视角数据集上进行的大量实验表明,4C4D在性能上超越了现有技术(SOTA)。

Insight: 主要创新点是提出了神经衰减函数来增强4D高斯的几何建模能力,其关键洞察在于:在稀疏相机设置下,几何学习比外观建模困难得多。该设计通过引导4DGS的梯度更多地关注几何学习,缓解了4DGS中几何与外观建模的固有不平衡问题。

Abstract: This paper tackles the challenge of recovering 4D dynamic scenes from videos captured by as few as four portable cameras. Learning to model scene dynamics for temporally consistent novel-view rendering is a foundational task in computer graphics, where previous works often require dense multi-view captures using camera arrays of dozens or even hundreds of views. We propose \textbf{4C4D}, a novel framework that enables high-fidelity 4D Gaussian Splatting from video captures of extremely sparse cameras. Our key insight lies that the geometric learning under sparse settings is substantially more difficult than modeling appearance. Driven by this observation, we introduce a Neural Decaying Function on Gaussian opacities for enhancing the geometric modeling capability of 4D Gaussians. This design mitigates the inherent imbalance between geometry and appearance modeling in 4DGS by encouraging the 4DGS gradients to focus more on geometric learning. Extensive experiments across sparse-view datasets with varying camera overlaps show that 4C4D achieves superior performance over prior art. Project page at: https://junshengzhou.github.io/4C4D.


[109] Intelligent Traffic Monitoring with YOLOv11: A Case Study in Real-Time Vehicle Detection cs.CV | cs.AI | cs.LGPDF

Shkelqim Sherifi

TL;DR: 本文提出了一种基于YOLOv11和BoT-SORT/ByteTrack的离线实时交通监控系统,用于车辆检测与计数。该系统在PyTorch/OpenCV上实现,并配有基于Qt的桌面用户界面,能够在无需云端依赖的情况下高效处理视频流。

Details

Motivation: 利用人工智能驱动的计算机视觉技术,特别是基于深度学习的物体检测与跟踪,来增强交通监控系统的能力,以支持未来智慧城市的现代化发展。

Result: 系统在不同场景下实现了66.67%至95.83%的计数准确率。在类别检测上,汽车和卡车的精确度分别达到0.97-1.00和1.00,召回率分别为0.82-1.00和0.70-1.00,对应的F1分数为汽车0.90-1.00,卡车0.82-1.00。在典型条件下性能稳健,但恶劣天气可能影响表现。

Insight: 主要创新点在于将预训练的轻量级YOLOv11检测器与BoT-SORT/ByteTrack多目标跟踪器结合,构建了一个离线、实时且不依赖云端的完整系统,并通过易用的桌面界面展示了AI驱动交通监控的实际应用潜力。

Abstract: Recent advancements in computer vision, driven by artificial intelligence, have significantly enhanced monitoring systems. One notable application is traffic monitoring, which leverages computer vision alongside deep learning-based object detection and counting. We present an offline, real-time traffic monitoring system that couples a pre-trained YOLOv11 detector with BoT-SORT/ByteTrack for multi-object tracking, implemented in PyTorch/OpenCV and wrapped in a Qt-based desktop UI. The CNN pipeline enables efficient vehicle detection and counting from video streams without cloud dependencies. Across diverse scenes, the system achieves (66.67-95.83%) counting accuracy. Class-wise detection yields high precision (cars: 0.97-1.00; trucks: 1.00) with strong recall (cars: 0.82-1.00; trucks: 0.70-1.00), resulting in F1 scores of (0.90-1.00 for cars and 0.82-1.00 for trucks). While adverse weather conditions may negatively impact this performance, results remain robust in typical conditions. By integrating lightweight models with an accessible, cloud-independent interface, this paper contributes to the modernization and development of future smart cities by showing the capacity of AI-driven traffic monitoring systems.


[110] A Physics-Informed, Behavior-Aware Digital Twin for Robust Multimodal Forecasting of Core Body Temperature in Precision Livestock Farming cs.CVPDF

Riasad Alvi, Mohaimenul Azam Khan Raiaan, Sadia Sultana Chowa, Arefin Ittesafun Abian, Reem E Mohamed

TL;DR: 本研究提出了一种结合物理信息数字孪生(DT)和不确定性感知专家加权堆叠集成模型的多模态框架,用于精准畜牧业中奶牛核心体温(CBT)的预测。该框架整合了基于ODE的热调节模型、高斯过程、卡尔曼滤波和行为马尔可夫链,通过多阶段堆叠集成方法融合传感器数据和DT生成的特征,实现了对CBT和热应激概率的准确、不确定性量化的预测。

Details

Motivation: 精准畜牧业需要准确及时的热应激预测以确保动物福利和优化农场管理,但现有方法在整合物理机制、个体差异和行为动态方面存在不足。

Result: 在MmCows数据集上,所提框架在2小时提前预测中取得了交叉验证R2为0.783、F1分数为84.25%、预测区间覆盖概率(PICP)为92.38%的性能,并通过消融分析验证了DT衍生特征和多模态融合的有效性。

Insight: 创新点在于将物理机理模型(ODE热调节)与数据驱动方法(堆叠集成)深度融合,并引入行为马尔可夫链建模活动状态转换,同时通过不确定性量化(如PICP)增强了预测的可靠性,为物理信息AI在农业领域的应用提供了范例。

Abstract: Precision livestock farming requires accurate and timely heat stress prediction to ensure animal welfare and optimize farm management. This study presents a physics-informed digital twin (DT) framework combined with an uncertainty-aware, expert-weighted stacked ensemble for multimodal forecasting of Core Body Temperature (CBT) in dairy cattle. Using the high-frequency, heterogeneous MmCows dataset, the DT integrates an ordinary differential equation (ODE)-based thermoregulation model that simulates metabolic heat production and dissipation, a Gaussian process for capturing cow-specific deviations, a Kalman filter for aligning predictions with real-time sensor data, and a behavioral Markov chain that models activity-state transitions under varying environmental conditions. The DT outputs key physiological indicators, such as predicted CBT, heat stress probability, and behavioral state distributions are fused with raw sensor data and enriched through multi-scale temporal analysis and cross-modal feature engineering to form a comprehensive feature set. The predictive methodology is designed in a three-stage stacked ensemble, where stage 1 trains modality-specific LightGBM ‘expert’ models on distinct feature groups, stage 2 collects their predictions as meta-features, and at stage 3 Optuna-tuned LightGBM meta-model yields the final CBT forecast. Predictive uncertainty is quantified via bootstrapping and validated using Prediction Interval Coverage Probability (PICP). Ablation analysis confirms that incorporating DT-derived features and multimodal fusion substantially enhances performance. The proposed framework achieves a cross-validated R2 of 0.783, F1 score of 84.25% and PICP of 92.38% for 2-hour ahead forecasting, providing a robust, uncertainty-aware, and physically principled system for early heat stress detection and precision livestock management.


[111] Hypothesis Graph Refinement: Hypothesis-Driven Exploration with Cascade Error Correction for Embodied Navigation cs.CVPDF

Peixin Chen, Guoxi Zhang, Jianwei Ma, Qing Li

TL;DR: 本文提出了一种名为假设图精化(HGR)的框架,用于具身导航任务。该框架通过将前沿区域的语义预测表示为可修正的假设节点,构建依赖感知的图记忆,并引入语义假设模块进行目标导向的探索排序,以及验证驱动的级联错误校正机制,在发现预测错误时撤回错误节点及其下游依赖,从而在长期探索中保持记忆的可靠性。

Details

Motivation: 现有基于图的导航系统通常将未探索区域视为语义未知,导致前沿搜索效率低下,且视觉语言模型(VLM)的预测错误可能嵌入记忆并传播,造成结构性的错误累积,而仅靠置信度衰减无法解决此问题。因此,需要一种能利用语义预测进行定向探索,同时能在新证据矛盾时系统撤回错误的框架。

Result: 在GOAT-Bench多模态终身导航基准上,HGR实现了72.41%的成功率和56.22%的SPL;在具身问答基准(A-EQA, EM-EQA)上也表现出一致的改进。诊断分析显示,级联校正消除了约20%的结构冗余假设节点,并将错误区域的重复访问减少了4.5倍,其中镜面和透明表面占校正预测错误的67%。

Insight: 创新点在于将前沿语义预测建模为可修正的假设节点,并设计依赖感知的图记忆和验证驱动的级联校正机制,使图记忆能够通过剪枝错误子图进行收缩,从而在长期任务中动态维护可靠性,而不仅仅是累积构建地图。这为解决导航中语义预测错误传播和记忆污染问题提供了新思路。

Abstract: Embodied agents must explore partially observed environments while maintaining reliable long-horizon memory. Existing graph-based navigation systems improve scalability, but they often treat unexplored regions as semantically unknown, leading to inefficient frontier search. Although vision-language models (VLMs) can predict frontier semantics, erroneous predictions may be embedded into memory and propagate through downstream inferences, causing structural error accumulation that confidence attenuation alone cannot resolve. These observations call for a framework that can leverage semantic predictions for directed exploration while systematically retracting errors once new evidence contradicts them. We propose Hypothesis Graph Refinement (HGR), a framework that represents frontier predictions as revisable hypothesis nodes in a dependency-aware graph memory. HGR introduces (1) semantic hypothesis module, which estimates context-conditioned semantic distributions over frontiers and ranks exploration targets by goal relevance, travel cost, and uncertainty, and (2) verification-driven cascade correction, which compares on-site observations against predicted semantics and, upon mismatch, retracts the refuted node together with all its downstream dependents. Unlike additive map-building, this allows the graph to contract by pruning erroneous subgraphs, keeping memory reliable throughout long episodes. We evaluate HGR on multimodal lifelong navigation (GOAT-Bench) and embodied question answering (A-EQA, EM-EQA). HGR achieves 72.41% success rate and 56.22% SPL on GOAT-Bench, and shows consistent improvements on both QA benchmarks. Diagnostic analysis reveals that cascade correction eliminates approximately 20% of structurally redundant hypothesis nodes and reduces revisits to erroneous regions by 4.5x, with specular and transparent surfaces accounting for 67% of corrected prediction errors.


[112] Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks cs.CV | cs.AIPDF

Rubén Moreno-Aguado, Alba Magallón, Victor Moreno, Yingying Fang, Guang Yang

TL;DR: 本文提出了一种名为VoxelFM的3D CT基础模型,该模型采用DINO框架进行自蒸馏训练,无需语言监督即可学习语义丰富的视觉特征。研究评估了VoxelFM在七类临床相关下游任务上的表现,包括分类、回归、生存分析、实例检索、定位、分割和报告生成,结果表明其在所有任务类别上均匹配或超越了现有CT基础模型。

Details

Motivation: 现有CT基础模型主要关注构建能够进行问答和报告生成的通用视觉-语言系统,但训练这类系统需要大规模配对图像-文本数据,这在CT领域尚不可得;此外,将底层视觉表示适配到下游任务通常需要部分或全部主干网络微调,计算成本高昂。因此,研究旨在优先学习鲁棒的视觉表示,以实现以最小标注数据且无需主干微调的高效迁移学习。

Result: VoxelFM在七类下游任务(分类、回归、生存分析、实例检索、定位、分割、报告生成)上使用冻结主干表示和轻量级探针进行评估,结果匹配或超越了四种现有CT基础模型,甚至在报告生成任务上超越了显式进行语言对齐训练的模型,达到了当前最佳水平(SOTA)。

Insight: 创新点在于提出了一种无需语言监督、通过自蒸馏学习鲁棒3D CT视觉特征的基础模型,并证明当前CT基础模型作为轻量级探针的特征提取器比作为视觉-语言模型的视觉编码器表现更优;这为临床任务的高效迁移学习提供了新思路,即优先强化视觉表示而非依赖大规模配对文本数据。

Abstract: There is substantial interest in developing artificial intelligence systems to support radiologists across tasks ranging from segmentation to report generation. Existing computed tomography (CT) foundation models have largely focused on building generalist vision-language systems capable of tasks such as question answering and report generation. However, training reliable vision-language systems requires paired image-text data at a scale that remains unavailable in CT. Moreover, adapting the underlying visual representations to downstream tasks typically requires partial or full backbone fine-tuning, a computationally demanding process inaccessible to many research groups. Instead, foundation models should prioritise learning robust visual representations that enable efficient transfer to new tasks with minimal labelled data and without backbone fine-tuning. We present VoxelFM, a 3D CT foundation model trained with self-distillation using the DINO framework, which learns semantically rich features without language supervision. We evaluated VoxelFM across seven categories of clinically relevant downstream tasks using frozen backbone representations with lightweight probes: classification, regression, survival analysis, instance retrieval, localisation, segmentation, and report generation. VoxelFM matched or outperformed four existing CT foundation models across all task categories. Despite receiving no language supervision during pre-training, VoxelFM surpassed models explicitly trained with language-alignment objectives, including on report generation. Our results indicate that current CT foundation models perform significantly better as feature extractors for lightweight probes rather than as vision encoders for vision-language models. Model weights and training code are publicly available.


[113] OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models cs.CVPDF

Liyu Zhang, Kehan Li, Tingrui Han, Tao Zhao, Yuxuan Sheng

TL;DR: 本文提出了OP-GRPO,一种专为流匹配模型设计的首个离线策略GRPO框架,旨在解决GRPO因在线训练范式导致的样本效率低下问题。通过主动选择高质量轨迹存入回放缓冲区、提出序列级重要性采样校正以缓解分布偏移,并截断后期去噪步骤来稳定训练,该方法在图像和视频生成基准上,仅用平均34.2%的训练步数就达到了与Flow-GRPO相当或更优的性能。

Details

Motivation: GRPO在提升流匹配模型生成质量方面效果显著,但其在线训练范式导致样本效率低下,因此需要开发更高效的离线策略训练方法。

Result: 在图像和视频生成基准测试中,OP-GRPO仅需平均34.2%的训练步数,就实现了与Flow-GRPO相当或更优的性能,显著提升了训练效率。

Insight: 创新点包括:主动选择高质量轨迹并自适应地纳入回放缓冲区进行重用;提出序列级重要性采样校正,在保持GRPO裁剪机制完整性的同时确保策略更新稳定;理论分析和实证表明后期去噪步骤会导致病态的离线策略比率,并通过截断轨迹来缓解此问题。这些方法为流匹配模型的离线策略训练提供了高效且稳定的解决方案。

Abstract: Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO’s clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.


[114] GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models cs.CV | cs.AIPDF

Yaohan Guan, Pristina Wang, Najim Dehak, Alan Yuille, Jieneng Chen

TL;DR: 本文介绍了GENFIG1基准测试,用于评估生成式AI模型(特别是视觉语言模型)根据学术论文的标题、摘要、引言和图表说明生成能够清晰表达和激发论文核心思想的‘图1’的能力。该基准强调模型需结合科学理解与视觉合成进行推理,而不仅仅是生成美观的图像。

Details

Motivation: 论文的动机是认识到学术论文中的‘图1’作为核心研究思想的视觉摘要,其设计需要大量人力和迭代,这凸显了科学视觉传达的难度。作者因此提出GENFIG1,旨在挑战AI模型在理解科学概念并生成有效视觉摘要方面的能力。

Result: 研究在从顶级深度学习会议论文中精心策划的GENFIG1基准上评估了一系列代表性模型,结果表明即使是最佳性能的系统也面临显著挑战。作者还引入了一种与专家人类判断相关性良好的自动评估指标。

Insight: 论文的创新点在于将学术论文的视觉摘要生成定义为一个需要深度科学理解和视觉设计推理的基准任务(GENFIG1),超越了传统的文本到图像生成。从客观角度看,这为多模态AI的进步提供了一个具有挑战性的新方向,强调了模型在专业领域进行概念提取和视觉合成的综合能力。

Abstract: In many science papers, “Figure 1” serves as the primary visual summary of the core research idea. These figures are visually simple yet conceptually rich, often requiring significant effort and iteration by human authors to get right, highlighting the difficulty of science visual communication. With this intuition, we introduce GENFIG1, a benchmark for generative AI models (e.g., Vision-Language Models). GENFIG1 evaluates models for their ability to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Solving GENFIG1 requires more than producing visually appealing graphics: the task entails reasoning for text-to-image generation that couples scientific understanding with visual synthesis. Specifically, models must (i) comprehend and grasp the technical concepts of the paper, (ii) identify the most salient ones, and (iii) design a coherent and aesthetically effective graphic that conveys those concepts visually and is faithful to the input. We curate the benchmark from papers published at top deep-learning conferences, apply stringent quality control, and introduce an automatic evaluation metric that correlates well with expert human judgments. We evaluate a suite of representative models on GENFIG1 and demonstrate that the task presents significant challenges, even for the best-performing systems. We hope this benchmark serves as a foundation for future progress in multimodal AI.


[115] Scale-Aware Vision-Language Adaptation for Extreme Far-Distance Video Person Re-identification cs.CVPDF

Ashwat Rajbhandari, Bharatesh Chakravarthi

TL;DR: 本文提出了一种面向极端远距离视频行人重识别的尺度感知视觉-语言自适应方法。该方法基于CLIP模型,通过升级视觉骨干网络、引入骨干感知选择性微调、轻量级时序注意力池化以及适配器和提示条件跨视角学习等技术,显著提升了模型在尺度压缩、分辨率退化、运动模糊和空-地视角不匹配等挑战下的鲁棒性。

Details

Motivation: 解决极端远距离视频行人重识别中,因相机高度和主体距离增加导致的尺度压缩、分辨率退化、运动模糊和空-地视角不匹配等问题,使模型在远距离条件下仍能可靠工作。

Result: 在DetReIDX压力测试基准上,该方法取得了mAP分数分别为46.69(A2G)、41.23(G2A)和22.98(A2A),总体mAP为35.73,证明了大规模视觉-语言骨干网络结合以稳定性为中心的自适应能显著增强极端远距离视频行人重识别的鲁棒性。

Insight: 创新点包括:1)将CLIP视觉骨干从ViT-B/16升级到ViT-L/14并引入骨干感知选择性微调以稳定大Transformer的自适应;2)采用轻量级时序注意力池化机制抑制退化帧并强调信息丰富的观测;3)保留基于适配器和提示条件的跨视角学习以缓解空-地领域偏移;4)结合改进的优化和k-互惠重排序来细化检索。从客观角度看,该方法的核心在于将大规模预训练视觉-语言模型的强大表征能力,通过一系列针对远距离视频特有挑战的稳定化、选择性和时序建模技术进行有效迁移和适应。

Abstract: Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate aerial-ground domain shifts, and further refine retrieval using improved optimization and k-reciprocal re-ranking. Experiments on the DetReIDX stress-test benchmark show that our approach achieves mAP scores of 46.69 (A2G), 41.23 (G2A), and 22.98 (A2A), corresponding to an overall mAP of 35.73. These results show that large-scale vision-language backbones, when combined with stability-focused adaptation, significantly enhance robustness in extreme far-distance video person ReID.


[116] AURA: Always-On Understanding and Real-Time Assistance via Video Streams cs.CVPDF

Xudong Lu, Yang Bo, Jinpeng Chen, Shuhan Li, Xintong Guo

TL;DR: AURA是一个端到端的流式视觉交互框架,旨在使视频大语言模型能够持续处理视频流,并支持实时问答和主动响应。它通过整合上下文管理、数据构建、训练目标和部署优化,实现了稳定的长时流式交互。

Details

Motivation: 现有视频大语言模型多为离线系统,不适用于需要持续观察和及时响应的实时视频流;而现有的流式方法通常依赖于解耦的触发-响应管道或仅限于字幕式叙述,限制了其在开放式问答和长时交互中的有效性。

Result: AURA在流式基准测试中达到了最先进的性能,并支持一个包含自动语音识别和文本转语音的实时演示系统,在两个80G加速器上以2 FPS运行。

Insight: 论文的创新点在于提出了一个统一的端到端流式视觉交互框架,将上下文管理、数据构建、训练和部署优化整合,以支持连续的视频流处理和主动响应,突破了现有方法在开放式问答和长时交互上的限制。

Abstract: Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.


[117] Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks cs.CV | cs.AI | cs.LGPDF

Adrienne Deganutti, Elad Hirsch, Haonan Zhu, Jaejung Seol, Purvanshi Mehta

TL;DR: 本文介绍了GraphicDesignBench(GDB),这是首个专门用于评估AI模型在专业平面设计任务上性能的综合基准套件。GDB涵盖了布局、排版、信息图表、模板与设计语义以及动画五个维度的50项任务,基于真实设计模板,并在理解和生成两种设置下进行评估。评估发现,当前模型在空间推理、矢量代码生成、精细排版感知和动画时间分解等核心设计挑战上仍存在明显不足。

Details

Motivation: 现有基准主要关注自然图像理解或通用文本到图像合成,缺乏针对专业平面设计独特挑战(如结构化布局、排版保真度、分层合成和矢量图形生成)的评估工具,因此需要构建一个全面的基准来推动AI在专业设计领域的发展。

Result: 在GDB上对前沿闭源模型进行评估,使用涵盖空间准确性、感知质量、文本保真度、语义对齐和结构有效性的标准化指标。结果显示,当前模型在复杂布局的空间推理、忠实矢量代码生成、细粒度排版感知和动画时间分解等任务上表现不佳,与专业设计需求存在显著差距。

Insight: GDB的创新点在于首次构建了覆盖专业平面设计全流程的综合性基准,强调结构化、精确性和组合性评估,揭示了AI在需要高精度和结构化输出的设计任务中的局限性,为未来开发能作为设计协作伙伴的AI系统提供了可复现的测试平台。

Abstract: We introduce GraphicDesignBench (GDB), the first comprehensive benchmark suite designed specifically to evaluate AI models on the full breadth of professional graphic design tasks. Unlike existing benchmarks that focus on natural-image understanding or generic text-to-image synthesis, GDB targets the unique challenges of professional design work: translating communicative intent into structured layouts, rendering typographically faithful text, manipulating layered compositions, producing valid vector graphics, and reasoning about animation. The suite comprises 50 tasks organized along five axes: layout, typography, infographics, template & design semantics and animation, each evaluated under both understanding and generation settings, and grounded in real-world design templates drawn from the LICA layered-composition dataset. We evaluate a set of frontier closed-source models using a standardized metric taxonomy covering spatial accuracy, perceptual quality, text fidelity, semantic alignment, and structural validity. Our results reveal that current models fall short on the core challenges of professional design: spatial reasoning over complex layouts, faithful vector code generation, fine-grained typographic perception, and temporal decomposition of animations remain largely unsolved. While high-level semantic understanding is within reach, the gap widens sharply as tasks demand precision, structure, and compositional awareness. GDB provides a rigorous, reproducible testbed for tracking progress toward AI systems that can function as capable design collaborators. The full evaluation framework is publicly available.


[118] DriveVA: Video Action Models are Zero-Shot Drivers cs.CV | cs.ROPDF

Mengmeng Liu, Diankun Zhang, Jiuming Liu, Jianfeng Cui, Hongwei Xie

TL;DR: 本文提出DriveVA,一种新颖的自动驾驶世界模型,它通过共享的潜在生成过程联合解码未来的视觉预测和动作序列。该方法利用预训练的大规模视频生成模型继承的运动动力学和物理合理性先验,以捕捉连续的时空演化和因果交互模式。DriveVA在NAVSIM挑战中实现了90.9 PDM分数的闭环性能,并在nuScenes和CARLA v2上的Bench2drive基准测试中显著降低了L2误差和碰撞率,展示了强大的零样本能力和跨域泛化性。

Details

Motivation: 解决现有基于世界模型的规划方法在跨数据集和传感器配置泛化能力有限,以及其松耦合规划范式导致视觉想象与轨迹一致性差的问题。

Result: 在NAVSIM挑战中达到90.9 PDM分数的闭环性能;在nuScenes上,与SOTA世界模型规划器相比,平均L2误差和碰撞率分别降低78.9%和83.3%;在基于CARLA v2的Bench2drive上,分别降低52.5%和52.4%。

Insight: 核心创新在于提出一个共享潜在生成过程,通过DiT-based解码器联合预测未来动作序列和视频,实现了规划与场景演化之间的紧密对齐;并引入视频延续策略以增强长时展开的一致性。这利用了预训练视频生成模型的先验知识来提升泛化能力。

Abstract: Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT-based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long-duration rollout consistency. DriveVA achieves an impressive closed-loop performance of 90.9 PDM score on the challenge NAVSIM. Extensive experiments also demonstrate the zero-shot capability and cross-domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2drive built on CARLA v2 compared with the state-of-the-art world-model-based planner.


[119] A Persistent Homology Design Space for 3D Point Cloud Deep Learning cs.CV | cs.AIPDF

Prachi Kudeshia, Jiju Poovvancheri, Amr Ghoneim, Dong Chen

TL;DR: 本文提出了一个用于3D点云深度学习的持久同调设计空间(3DPHDL),系统化地形式化了复杂构造、过滤策略、持久性表示、神经主干网络和预测任务之间的相互作用。通过将持久同调作为结构归纳偏置,在采样、邻域图、优化动态、自监督、输出校准和网络正则化等六个关键注入点进行整合,并在ModelNet40分类和ShapeNetPart分割任务上对PointNet、DGCNN和Point Transformer等主干网络进行了实证研究。

Details

Motivation: 尽管持久同调(PH)具有稳定的理论保证和日益增长的经验应用,但其在点云深度学习中的集成仍然大多是临时性的且处于架构边缘。本文旨在解决这一问题,为将拓扑推理系统性地融入3D点云学习提供一个统一的设计框架。

Result: 在ModelNet40分类和ShapeNetPart分割基准上的实验表明,该方法在拓扑敏感的判别性和部件一致性方面带来了持续改进,同时揭示了表示表达能力与组合复杂性之间的权衡。通过将持久同调图、图像和景观图与代表性主干网络结合,提升了模型对噪声和采样变化的鲁棒性。

Insight: 核心创新点在于将持久同调不仅仅视为辅助特征,而是作为学习流程中的结构化组件,并系统性地识别了六个拓扑作为结构归纳偏置的注入点。这为将拓扑先验知识深度整合到3D点云深度学习架构中提供了一个可扩展的、原则性的设计框架。

Abstract: Persistent Homology (PH) offers stable, multi-scale descriptors of intrinsic shape structure by capturing connected components, loops, and voids that persist across scales, providing invariants that complement purely geometric representations of 3D data. Yet, despite strong theoretical guarantees and increasing empirical adoption, its integration into deep learning for point clouds remains largely ad hoc and architecturally peripheral. In this work, we introduce a unified design space for Persistent-Homology driven learning in 3D point clouds (3DPHDL), formalizing the interplay between complex construction, filtration strategy, persistence representation, neural backbone, and prediction task. Beyond the canonical pipeline of diagram computation and vectorization, we identify six principled injection points through which topology can act as a structural inductive bias reshaping sampling, neighborhood graphs, optimization dynamics, self-supervision, output calibration, and even internal network regularization. We instantiate this framework through a controlled empirical study on ModelNet40 classification and ShapeNetPart segmentation, systematically augmenting representative backbones (PointNet, DGCNN, and Point Transformer) with persistence diagrams, images, and landscapes, and analyzing their impact on accuracy, robustness to noise and sampling variation, and computational scalability. Our results demonstrate consistent improvements in topology-sensitive discrimination and part consistency, while revealing meaningful trade-offs between representational expressiveness and combinatorial complexity. By viewing persistent homology not merely as an auxiliary feature but as a structured component within the learning pipeline, this work provides a systematic framework for incorporating topological reasoning into 3D point cloud learning.


[120] HighFM: Towards a Foundation Model for Learning Representations from High-Frequency Earth Observation Data cs.CV | cs.AIPDF

Stella Girtsou, Konstantinos Alexis, Giorgos Giannopoulos, Harris Kontoes

TL;DR: 本文提出了HighFM,一种面向高时间分辨率多光谱地球观测数据的基础模型。该模型基于超过2TB的SEVIRI气象卫星图像,采用改进的SatMAE掩码自编码框架学习稳健的时空表征,并通过精细时间编码增强短期变化捕捉能力。模型在云掩码和活跃火灾检测任务上微调后,在平衡准确率和IoU指标上均优于传统基线及近期地理空间基础模型。

Details

Motivation: 现有基础模型大多依赖高分辨率但低重访率的卫星图像,难以适应快速演变的自然现象和紧急响应需求。本文旨在开发一个适用于高时间分辨率地球观测数据的基础模型,以支持实时监测和灾害应对。

Result: 在SEVIRI数据集上预训练的视觉Transformer模型在云掩码和活跃火灾检测任务中,其平衡准确率和IoU指标均优于传统基线及近期地理空间基础模型,展现了持续的性能提升。

Insight: 创新点包括将掩码自编码框架适配于高时间分辨率多光谱数据,并引入精细时间编码以捕捉短期动态变化;客观来看,该研究首次系统探索了高时间分辨率地球静止轨道数据在基础模型构建中的应用潜力,为实时灾害检测与追踪提供了可扩展的路径。

Abstract: The increasing frequency and severity of climate related disasters have intensified the need for real time monitoring, early warning, and informed decision-making. Earth Observation (EO), powered by satellite data and Machine Learning (ML), offers powerful tools to meet these challenges. Foundation Models (FMs) have revolutionized EO ML by enabling general-purpose pretraining on large scale remote sensing datasets. However most existing models rely on high-resolution satellite imagery with low revisit rates limiting their suitability for fast-evolving phenomena and time critical emergency response. In this work, we present HighFM, a first cut approach towards a FM for high temporal resolution, multispectral EO data. Leveraging over 2 TB of SEVIRI imagery from the Meteosat Second Generation (MSG) platform, we adapt the SatMAE masked autoencoding framework to learn robust spatiotemporal representations. To support real time monitoring, we enhance the original architecture with fine grained temporal encodings to capture short term variability. The pretrained models are then finetuned on cloud masking and active fire detection tasks. We benchmark our SEVIRI pretrained Vision Transformers against traditional baselines and recent geospatial FMs, demonstrating consistent gains across both balanced accuracy and IoU metrics. Our results highlight the potential of temporally dense geostationary data for real-time EO, offering a scalable path toward foundation models for disaster detection and tracking.


[121] GA-GS: Generation-Assisted Gaussian Splatting for Static Scene Reconstruction cs.CV | cs.AIPDF

Yedong Shen, Shiqi Zhang, Sha Zhang, Yifan Duan, Xinran Zhang

TL;DR: 本文提出GA-GS方法,一种结合生成模型辅助的3D高斯溅射技术,用于从包含动态物体的单目视频中重建静态场景。该方法通过运动感知模块分割并移除动态区域,利用扩散模型修复被遮挡区域以提供伪真值监督,并引入可学习的真实性标量来平衡真实背景与生成区域的贡献,从而在动态遮挡场景下实现更完整的静态场景重建。

Details

Motivation: 现有方法主要依赖背景进行静态场景重建,难以恢复被动态物体遮挡的区域,限制了在虚拟现实和自动驾驶等应用中的实用性。

Result: 在DAVIS数据集和自建的Trajectory-Match数据集上的大量实验表明,GA-GS在静态场景重建中达到了最先进的性能,特别是在大规模、持续遮挡的挑战性场景中。

Insight: 创新点在于利用生成模型(扩散模型)为遮挡区域提供伪监督,并引入可学习的真实性标量实现真实性感知的渲染与监督,这为处理动态遮挡的3D重建问题提供了新思路。

Abstract: Reconstructing static 3D scene from monocular video with dynamic objects is important for numerous applications such as virtual reality and autonomous driving. Current approaches typically rely on background for static scene reconstruction, limiting the ability to recover regions occluded by dynamic objects. In this paper, we propose GA-GS, a Generation-Assisted Gaussian Splatting method for Static Scene Reconstruction. The key innovation of our work lies in leveraging generation to assist in reconstructing occluded regions. We employ a motion-aware module to segment and remove dynamic regions, and thenuse a diffusion model to inpaint the occluded areas, providing pseudo-ground-truth supervision. To balance contributions from real background and generated region, we introduce a learnable authenticity scalar for each Gaussian primitive, which dynamically modulates opacity during splatting for authenticity-aware rendering and supervision. Since no existing dataset provides ground-truth static scene of video with dynamic objects, we construct a dataset named Trajectory-Match, using a fixed-path robot to record each scene with/without dynamic objects, enabling quantitative evaluation in reconstruction of occluded regions. Extensive experiments on both the DAVIS and our dataset show that GA-GS achieves state-of-the-art performance in static scene reconstruction, especially in challenging scenarios with large-scale, persistent occlusions.


[122] Spatially-Weighted CLIP for Street-View Geo-localization cs.CVPDF

Ting Han, Fengjiao Li, Chunsong Chen, Haoling Huang, Yiping Chen

TL;DR: 本文提出了一种名为空间加权CLIP(SW-CLIP)的新框架,用于街景地理定位。该框架通过将空间自相关显式地融入视觉-语言对比学习,利用地理学第一定律(托布勒定律)建立距离感知的软监督,以改进传统CLIP方法将所有不匹配样本视为同等负例的局限。

Details

Motivation: 解决传统基于CLIP的地理定位方法忽略地理空间关系(即空间自相关)的问题,旨在从语义对齐转向地理对齐,以提升定位的鲁棒性和准确性。

Result: 在多城市数据集上的实验表明,SW-CLIP显著提高了地理定位精度,减少了长尾误差,并增强了嵌入空间的空间连贯性,优于标准CLIP方法。

Insight: 创新点在于引入位置作为文本表示,用基于测地线距离的空间加权软标签替代独热InfoNCE目标,并添加邻域一致性正则化以保持局部空间结构;这为将空间原则整合到多模态表示学习提供了一个通用范式。

Abstract: This paper proposes Spatially-Weighted CLIP (SW-CLIP), a novel framework for street-view geo-localization that explicitly incorporates spatial autocorrelation into vision-language contrastive learning. Unlike conventional CLIP-based methods that treat all non-matching samples as equally negative, SW-CLIP leverages Tobler’s First Law of Geography to model geographic relationships through distance-aware soft supervision. Specifically, we introduce a location-as-text representation to encode geographic positions and replace one-hot InfoNCE targets with spatially weighted soft labels derived from geodesic distance. Additionally, a neighborhood-consistency regularization is employed to preserve local spatial structure in the embedding space. Experiments on a multi-city dataset demonstrate that SW-CLIP significantly improves geo-localization accuracy, reduces long-tail errors, and enhances spatial coherence compared to standard CLIP. The results highlight the importance of shifting from semantic alignment to geographic alignment for robust geo-localization and provide a general paradigm for integrating spatial principles into multimodal representation learning.


[123] Integer-Only Operations on Extreme Learning Machine Test Time Classification cs.CV | cs.AI | cs.LGPDF

Emerson Lopes Machadoa, Cristiano Jacques Miosso, Ricardo Pezzuol Jacobi

TL;DR: 本文提出了一种基于极限学习机(ELM)的网络分类器在测试时降低计算成本的新技术,通过理论分析和实证评估,证明了测试时分类可以仅使用整数运算完成,且不牺牲分类精度。

Details

Motivation: 解决在嵌入式应用和数据中心等功耗受限或昂贵的场景下,降低ELM分类器在测试时的计算成本和功耗的问题。

Result: 在5个常用的计算机视觉数据集上测试,结果表明这些技术能够有效降低FPGA上测试时分类所需的计算成本。

Insight: 创新点包括:证明输入权重可从三元集中采样且精度损失有限(从而消除乘法运算);证明归一化与非归一化测试信号的分类精度相同;以及创建整数版本的输出权重。这些方法为硬件高效实现提供了理论依据。

Abstract: We present a theoretical analysis and empirical evaluations of a novel set of techniques for computational cost reduction of test time operations of network classifiers based on extreme learning machine (ELM). By exploring some characteristics we derived from these models, we show that the classification at test time can be performed using solely integer operations without compromising the classification accuracy. Our contributions are as follows: (i) We show empirical evidence that the input weights values can be drawn from the ternary set with limited reduction of the classification accuracy. This has the computational advantage of dismissing multiplications; (ii) We prove the classification accuracy of normalized and non-normalized test signals are the same; (iii) We show how to create an integer version of the output weights that results in a limited reduction of the classification accuracy. We tested our techniques on 5 computer vision datasets commonly used in the literature and the results indicate that our techniques can allow the reduction of the computational cost of the operations necessary for the classification at test time in FPGAs. This is important in embedded applications, where power consumption is limited, and crucial in data centers of large corporations, where power consumption is expensive.


[124] Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning cs.CVPDF

Songyuan Yang, Weijiang Yu, Ziyu Liu, Guijian Tang, Wenjing Yang

TL;DR: 本文提出Graph-to-Frame RAG(G2F-RAG),一种无需训练且可审计的视频推理范式,通过将外部知识以视觉帧的形式与视频融合,解决现有检索增强方法在异构信号融合时导致的注意力稀释和认知负荷过高问题。

Details

Motivation: 现有基于大语言模型(LMMs)的视频推理系统在引入外部知识时,通常将文本或多片段证据直接拼接到注意力空间中,导致注意力稀释和认知负荷增加,核心瓶颈在于如何有效地表示和融合外部知识与视频主干网络。

Result: 在多个公开基准测试上取得了一致的性能提升,尤其在知识密集型场景下改进更为显著,证明了该方法的有效性。

Insight: 创新点在于将外部知识离线构建为与问题无关的视频知识图谱,在线阶段通过分层多智能体控制器检索最小充分子图并将其渲染为单一推理帧,从而在统一的视觉域中进行联合推理,降低了认知负荷并提供了可审计的证据轨迹;该方法无需训练、即插即用,强调了知识表示和传递方式的重要性。

Abstract: When video reasoning requires external knowledge, many systems with large multimodal models (LMMs) adopt retrieval augmentation to supply the missing context. Appending textual or multi-clip evidence, however, forces heterogeneous signals into a single attention space. We observe diluted attention and higher cognitive load even on non-long videos. The bottleneck is not only what to retrieve but how to represent and fuse external knowledge with the video backbone.We present Graph-to-Frame RAG (G2F-RAG), a training free and auditable paradigm that delivers knowledge in the visual space. On the offline stage, an agent builds a problem-agnostic video knowledge graph that integrates entities, events, spatial relations, and linked world knowledge. On the online stage, a hierarchical multi-agent controller decides whether external knowledge is needed, retrieves a minimal sufficient subgraph, and renders it as a single reasoning frame appended to the video. LMMs then perform joint reasoning in a unified visual domain. This design reduces cognitive load and leaves an explicit, inspectable evidence trail.G2F-RAG is plug-and-play across backbones and scales. It yields consistent gains on diverse public benchmarks, with larger improvements in knowledge-intensive settings. Ablations further confirm that knowledge representation and delivery matter. G2F-RAG reframes retrieval as visual space knowledge fusion for robust and interpretable video reasoning.


[125] Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning cs.CVPDF

Songyuan Yang, Weijiang Yu, Jilin Ma, Ziyu Liu, Guijian Tang

TL;DR: 本文提出了RLER(Reinforce to Learn, Elect to Reason)双范式,用于提升视频推理的可靠性和可解释性。该方法将学习生成证据与获取可靠答案解耦,在训练阶段通过群体相对强化学习和三种新颖的任务驱动奖励优化策略,教导模型输出结构化、机器可验证的证据;在推理阶段采用无需训练的编排器生成多样化候选答案,并基于证据一致性、置信度、透明度和非冗余性进行加权选举,从而在不增大模型规模的情况下实现更可靠的推理。

Details

Motivation: 现有大型多模态模型(LMMs)在视频推理中通常采用单次推理直接返回答案,缺乏对推理过程是否与证据对齐的验证,导致可靠性和可解释性不足。

Result: 在8个代表性基准测试中,RLER全面超越了各种开源和基于强化学习的LMMs,实现了SOTA性能,相比基础模型平均提升6.3%,且每个问题平均仅使用3.1个候选答案,在计算成本与质量之间取得了良好平衡。

Insight: 创新点在于将证据生成与答案选举解耦的双范式设计,以及三种任务驱动奖励(帧敏感奖励、思维透明奖励、抗重复奖励)的引入。客观来看,其核心洞察在于:在训练中显式学习生成证据,在推理中基于证据进行选举,是构建可信视频推理系统的有效路径,且无需增加模型参数量。

Abstract: Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce Reinforce to Learn, Elect to Reason (RLER), a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In RLER-Training, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In RLER-Inference, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3% over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.


[126] UENR-600K: A Large-Scale Physically Grounded Dataset for Nighttime Video Deraining cs.CVPDF

Pei Yang, Hai Ci, Beibei Lin, Yiren Song, Mike Zheng Shou

TL;DR: 该论文提出了UENR-600K,一个大规模、基于物理模拟的夜间视频去雨数据集,包含60万对1080p帧。通过使用虚幻引擎模拟3D雨滴粒子,该数据集捕捉了夜间雨滴与人工光源交互产生的色彩、局部光照等物理特性。基于此数据集,作者通过适配Wan 2.2视频生成模型,建立了一个新的SOTA基线,将去雨视为视频到视频的生成任务,显著缩小了模拟与真实场景的差距。

Details

Motivation: 解决夜间视频去雨的独特挑战,即雨滴与人造光源交互产生的色彩和局部光照效应。现有小规模合成数据集依赖2D雨滴叠加,无法捕捉这些物理特性,导致模型在真实夜间雨景中泛化能力差,而捕获真实配对夜间视频又不可行。

Result: 在提出的UENR-600K数据集上训练的模型,在真实世界视频上展现出显著更好的泛化性能。通过适配Wan 2.2模型建立的基线,在广泛的基准测试中达到了新的最先进水平(SOTA),几乎完全弥合了模拟到真实的差距。

Insight: 主要创新点在于利用虚幻引擎进行3D物理模拟来生成大规模、高保真的夜间雨景数据集,从而准确捕捉色彩折射、场景遮挡、雨幕等细节。客观来看,将去雨任务重新定义为利用强生成先验的视频到视频生成问题,是有效利用高质量合成数据、提升真实世界性能的关键思路。

Abstract: Nighttime video deraining is uniquely challenging because raindrops interact with artificial lighting. Unlike daytime white rain, nighttime rain takes on various colors and appears locally illuminated. Existing small-scale synthetic datasets rely on 2D rain overlays and fail to capture these physical properties, causing models to generalize poorly to real-world night rain. Meanwhile, capturing real paired nighttime videos remains impractical because rain effects cannot be isolated from other degradations like sensor noise. To bridge this gap, we introduce UENR-600K, a large-scale, physically grounded dataset containing 600,000 1080p frame pairs. We utilize Unreal Engine to simulate rain as 3D particles within virtual environments. This approach guarantees photorealism and physically real raindrops, capturing correct details like color refractions, scene occlusions, rain curtains. Leveraging this high-quality data, we establish a new state-of-the-art baseline by adapting the Wan 2.2 video generation model. Our baseline treat deraining as a video-to-video generation task, exploiting strong generative priors to almost entirely bridge the sim-to-real gap. Extensive benchmarking demonstrates that models trained on our dataset generalize significantly better to real-world videos. Project page: https://showlab.github.io/UENR-600K/.


[127] Vero: An Open RL Recipe for General Visual Reasoning cs.CV | cs.AI | cs.CLPDF

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen

TL;DR: Vero是一个完全开源的视觉语言模型(VLM)系列,旨在通过开放的强化学习(RL)配方实现广泛的视觉推理能力。它构建了一个包含60万个样本、覆盖59个数据集的Vero-600K数据集,并设计了任务路由奖励来处理异构答案格式。Vero在30个具有挑战性的基准测试套件VeroEval上实现了最先进的性能,平均提升了3.7-5.5个百分点,超越了多个现有开源模型,甚至在某些方面优于使用了专有思维数据的模型。

Details

Motivation: 当前最强的视觉语言模型(VLMs)虽然展现出广泛的视觉推理潜力,但其构建方法(尤其是强化学习流程和数据)通常被专有且非公开的管道所锁定。本文旨在提供一个完全开放的RL配方,以构建能够在图表、科学、空间理解和开放式任务等多种视觉推理任务上工作的通用视觉推理器。

Result: Vero在自建的30个基准测试套件VeroEval上实现了最先进的(SOTA)性能。具体而言,基于Qwen3-VL-8B-Instruct训练的Vero模型,在30个基准中的23个上超越了同样基于该模型但使用了额外专有思维数据的Qwen3-VL-8B-Thinking。与四个基础模型相比,Vero平均提升了3.7-5.5个点。此外,从相同基础模型训练时,Vero-600K数据集在各项任务类别上的表现均超过了现有的RL数据集。

Insight: 论文宣称的创新点包括:1)提供了一个完全开放的、可复现的视觉语言模型强化学习配方;2)构建了大规模、覆盖广泛的RL数据集Vero-600K;3)设计了任务路由奖励机制以处理不同任务的异构答案格式。从客观角度看,其核心洞察在于:系统性的消融实验表明,不同任务类别会引发性质不同的推理模式,这些模式在孤立情况下迁移性差,因此广泛的(broad)数据覆盖是驱动RL有效扩展(scaling)的主要因素,这强调了构建多样化、大规模数据集对于实现通用视觉推理的重要性。

Abstract: What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats. Vero achieves state-of-the-art performance, improving over four base models by 3.7-5.5 points on average across VeroEval, our suite of 30 challenging benchmarks. Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without additional proprietary thinking data. When trained from the same base model, Vero-600K exceeds existing RL datasets across task categories. Systematic ablations reveal that different task categories elicit qualitatively distinct reasoning patterns that transfer poorly in isolation, suggesting that broad data coverage is the primary driver of strong RL scaling. All data, code, and models are released.


[128] BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing cs.CVPDF

Kaiwen Wang, Kaili Zheng, Rongrong Deng, Yiming Shi, Chenyi Guo

TL;DR: 本文提出了BoxComm数据集,这是首个专注于拳击运动的评论生成基准,包含445场世界拳击锦标赛视频和超过5.2万条专业评论句子。论文引入了一种结构化的评论分类法(逐播报、战术、背景),并设计了两种新颖的评估任务:类别条件生成和评论节奏评估,以全面衡量多模态大语言模型在格斗体育评论生成上的能力。

Details

Motivation: 现有体育评论生成基准仅关注团队运动(如足球、篮球),完全忽略了格斗运动。格斗运动具有独特挑战:关键动作在毫秒内发生且视觉差异细微但语义决定性高,且专业评论中战术分析比例远高于团队运动。

Result: 在多个最先进的多模态大语言模型上的实验表明,当前模型在两项新评估任务上均表现不佳。论文进一步提出的改进基线方法EIC-Gen(通过检测击打事件提供结构化动作线索)取得了稳定提升,突显了感知转瞬即逝的细微事件对于格斗体育评论的重要性。

Insight: 创新点在于首次构建了格斗体育评论生成基准,提出了细粒度的评论分类法和两个互补的评估维度(类别条件生成与评论节奏),后者捕捉了评论在连续视频片段中的时间节奏和类型分布能力,这是先前基准未涉及的维度。从客观角度看,将事件检测作为结构化线索注入模型以提升对快速、细微动作的理解,是一个值得借鉴的方向。

Abstract: Recent multimodal large language models (MLLMs) have shown strong capabilities in general video understanding, driving growing interest in automatic sports commentary generation. However, existing benchmarks for this task focus exclusively on team sports such as soccer and basketball, leaving combat sports entirely unexplored. Notably, combat sports present distinct challenges: critical actions unfold within milliseconds with visually subtle yet semantically decisive differences, and professional commentary contains a substantially higher proportion of tactical analysis compared to team sports. In this paper, we present BoxComm, a large-scale dataset comprising 445 World Boxing Championship match videos with over 52K commentary sentences from professional broadcasts. We propose a structured commentary taxonomy that categorizes each sentence into play-by-play, tactical, or contextual, providing the first category-level annotation for sports commentary benchmarks. Building on this taxonomy, we introduce two novel and complementary evaluations tailored to sports commentary generation: (1) category-conditioned generation, which evaluates whether models can produce accurate commentary of a specified type given video context; and (2) commentary rhythm assessment, which measures whether freely generated commentary exhibits appropriate temporal pacing and type distribution over continuous video segments, capturing a dimension of commentary competence that prior benchmarks have not addressed. Experiments on multiple state-of-the-art MLLMs reveal that current models struggle on both evaluations. We further propose EIC-Gen, an improved baseline incorporating detected punch events to supply structured action cues, yielding consistent gains and highlighting the importance of perceiving fleeting and subtle events for combat sports commentary.


[129] Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse cs.CVPDF

Hao Liu, Ye Huang, Chenghuan Huang, Zhenyi Zheng, Jiangsu Du

TL;DR: 本文提出了一种名为Chorus的缓存方法,通过利用跨请求的相似性来加速视频扩散Transformer(DiT)模型的推理服务。该方法采用三阶段缓存策略,包括对相似请求的潜在特征进行完全重用、在中间去噪步骤中对特定潜在区域进行跨请求缓存,并结合Token-Guided Attention Amplification技术来提升生成视频与条件提示之间的语义对齐,从而将完全重用的适用性扩展到后期去噪步骤。

Details

Motivation: 视频扩散Transformer模型在高质量视频生成中占主导地位,但由于迭代去噪过程导致推理成本高昂。现有的缓存方法主要利用单个请求扩散过程中的相似性来跳过冗余去噪步骤,但在工业级4步蒸馏模型上效果有限。本文旨在通过跨请求的相似性来加速模型服务。

Result: 在工业级4步蒸馏模型上,Chorus实现了高达45%的加速,而先前的请求内缓存方法在此类模型上无效。

Insight: 创新点在于提出了跨请求缓存策略,结合三阶段缓存和Token-Guided Attention Amplification技术,有效利用请求间的相似性来加速推理,特别是在蒸馏模型中扩展了缓存重用的适用性,提升了服务效率。

Abstract: Video Diffusion Transformer (DiT) models are a dominant approach for high-quality video generation but suffer from high inference cost due to iterative denoising. Existing caching approaches primarily exploit similarity within the diffusion process of a single request to skip redundant denoising steps. In this paper, we introduce Chorus, a caching approach that leverages similarity across requests to accelerate video diffusion model serving. Chorus achieves up to 45% speedup on industrial 4-step distilled models, where prior intra-request caching approaches are ineffective. Particularly, Chorus employs a three-stage caching strategy along the denoising process. Stage 1 performs full reuse of latent features from similar requests. Stage 2 exploits inter-request caching in specific latent regions during intermediate denoising steps. This stage is combined with Token-Guided Attention Amplification to improve semantic alignment between the generated video and the conditional prompts, thereby extending the applicability of full reuse to later denoising steps.


[130] Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model’s Robustness to Natural Semantic Variation Across Diverse Tasks cs.CVPDF

Jia Chengyu, AprilPyone MaungMaung, Huy H. Nguyen, Jinyin Chen, Isao Echizen

TL;DR: 本文系统评估了多种视觉语言模型(CLIP、鲁棒CLIP、BLIP2、SigLIP2)在自然对抗场景下的鲁棒性,覆盖零样本图像分类、语义分割和视觉问答任务。研究发现,鲁棒CLIP模型可能放大自然对抗漏洞,而CLIP模型在自然语言诱导的对抗样本上性能显著下降,并通过可解释分析揭示了失败模式。

Details

Motivation: 现有研究多基于标准基准评估视觉语言模型,缺乏对其在自然对抗场景下鲁棒性和实际适用性的全面独立评估,本文旨在填补这一空白。

Result: 在精心构建的对抗数据集(如排版攻击、ImageNet-A和自然语言诱导对抗样本)上评估了多种VLMs,结果显示模型在自然对抗场景下性能显著降低,鲁棒CLIP模型甚至可能加剧漏洞。

Insight: 创新点在于提出了一个系统性的自然对抗评估框架,覆盖多任务并揭示模型在语义变化下的脆弱性;客观来看,其跨模型、跨任务的全面审计方法为鲁棒性研究提供了新视角,强调了超越标准基准的重要性。

Abstract: Recent advances in vision-language models (VLMs) trained on web-scale image-text pairs have enabled impressive zero-shot transfer across a diverse range of visual tasks. However, comprehensive and independent evaluation beyond standard benchmarks is essential to understand their robustness, limitations, and real-world applicability. This paper presents a systematic evaluation framework for VLMs under natural adversarial scenarios for diverse downstream tasks, which has been overlooked in previous evaluation works. We evaluate a wide range of VLMs (CLIP, robust CLIP, BLIP2, and SigLIP2) on curated adversarial datasets (typographic attacks, ImageNet-A, and natural language-induced adversarial examples). We measure the natural adversarial performance of selected VLMs for zero-shot image classification, semantic segmentation, and visual question answering. Our analysis reveals that robust CLIP models can amplify natural adversarial vulnerabilities, and CLIP models significantly reduce performance for natural language-induced adversarial examples. Additionally, we provide interpretable analyses to identify failure modes. We hope our findings inspire future research in robust and fair multimodal pattern recognition.


[131] A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models cs.CV | cs.LGPDF

Tianmeng Fang, Yong Wang, Zetai Kong, Zengzhen Su, Jun Wang

TL;DR: 本文提出了一种基于补丁增强和跨视图正则化的统一防御框架,用于保护多模态大语言模型免受后门攻击。该方法通过结合补丁级数据增强和跨视图输出差异正则化,从特征表示和输出分布两个层面约束模型对触发模式的异常行为,同时利用输出熵约束避免过度抑制,从而在有效降低攻击成功率的同时保持模型的正常文本生成能力。

Details

Motivation: 多模态大语言模型在监督微调过程中极易被植入后门,一旦特定触发模式被激活,模型会稳定输出攻击者预定义的有害响应。后门防御的核心挑战在于在低中毒比例下抑制攻击成功率的同时保持模型的正常生成能力,这两个目标本质上是冲突的。

Result: 在三个模型、两个任务和六种攻击上的实验结果表明,所提出的防御方法能有效降低攻击成功率,同时保持高水平的正常文本生成能力。

Insight: 创新点在于利用后门响应对非语义扰动的异常不变性,通过补丁级数据增强和跨视图输出差异正则化主动拉开原始视图与扰动视图的输出分布,从而显著抑制后门触发成功率;同时引入输出熵约束来避免防御过程中的过度抑制,确保正常指令生成的质量。该方法为大规模多模态模型在现实低频中毒和隐蔽触发场景下的安全可控部署提供了解决方案。

Abstract: Multimodal large language models have become an important infrastructure for unified processing of visual and linguistic tasks. However, such models are highly susceptible to backdoor implantation during supervised fine-tuning and will steadily output the attacker’s predefined harmful responses once a specific trigger pattern is activated. The core challenge of backdoor defense lies in suppressing attack success under low poisoning ratios while preserving the model’s normal generation ability. These two objectives are inherently conflicting. Strong suppression often degrades benign performance, whereas weak regularization fails to mitigate backdoor behaviors. To this end, we propose a unified defense framework based on patch augmentation and cross-view regularity, which simultaneously constrains the model’s anomalous behaviors in response to triggered patterns from both the feature representation and output distribution levels. Specifically, patch-level data augmentation is combined with cross-view output difference regularization to exploit the fact that backdoor responses are abnormally invariant to non-semantic perturbations and to proactively pull apart the output distributions of the original and perturbed views, thereby significantly suppressing the success rate of backdoor triggering. At the same time, we avoid over-suppression of the model during defense by imposing output entropy constraints, ensuring the quality of normal command generation. Experimental results across three models, two tasks, and six attacks show that our proposed defense method effectively reduces the attack success rate while maintaining a high level of normal text generation capability. Our work enables the secure, controlled deployment of large-scale multimodal models in realistic low-frequency poisoning and covert triggering scenarios.


[132] The Indra Representation Hypothesis for Multimodal Alignment cs.CVPDF

Jianglin Lu, Hailing Wang, Kuo Yang, Yitian Zhang, Simon Jenni

TL;DR: 本文提出Indra表示假说,受因陀罗网哲学隐喻启发,认为单模态基础模型学习到的收敛表示隐含反映了现实共享的关系结构。通过范畴论中的V-丰富Yoneda嵌入形式化该假说,定义Indra表示为样本相对于其他样本的关系剖面,并证明其在给定成本函数下具有唯一性、完备性和结构保持性。使用角距离实例化Indra表示,在涉及视觉、语言和音频的跨模型与跨模态场景中评估,实验表明Indra表示能持续增强不同架构和模态间的鲁棒性与对齐性,为单模态基础模型提供了免训练的对齐框架。

Details

Motivation: 现有单模态基础模型倾向于学习收敛表示,但这些表示本质上是样本的独立内部抽象,表达能力有限。本文旨在通过关系视角增强表示的表达能力,解决跨模型和跨模态对齐问题。

Result: 在涉及视觉、语言和音频的跨模型和跨模态场景中进行广泛实验,结果表明Indra表示能一致地增强不同架构和模态间的鲁棒性与对齐性能,提供了一个免训练的对齐框架。

Insight: 创新点在于提出Indra表示假说,将哲学隐喻与范畴论形式化结合,从关系角度重新定义表示,强调样本间的相互关联而非独立抽象;客观分析认为其理论框架为表示学习提供了新的关系视角,且免训练对齐方法具有实用价值。

Abstract: Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra’s Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra’s Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be unique, complete, and structure-preserving under a given cost function. We instantiate the Indra representation using angular distance and evaluate it in cross-model and cross-modal scenarios involving vision, language, and audio. Extensive experiments demonstrate that Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a theoretically grounded and practical framework for training-free alignment of unimodal foundation models. Our code is available at https://github.com/Jianglin954/Indra.


[133] Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward cs.CVPDF

Shizhan Gong, Minda Hu, Qiyuan Zhang, Chen Ma, Qi Dou

TL;DR: 本文提出Saliency-R1框架,通过一种新颖的显著性图技术高效突出对生成文本贡献的关键图像区域,无需额外计算开销,并利用显著性图与人工标注边界框的重叠作为奖励函数,结合GRPO策略优化,以提升视觉语言模型推理的可解释性和忠实性。

Details

Motivation: 解决视觉语言模型在推理过程中过度依赖文本线索、可能产生无根据或捏造回答的可信度问题,旨在增强模型对视觉证据的利用和推理过程的忠实性。

Result: 实验表明Saliency-R1在多个任务上提升了推理的忠实性、可解释性以及整体性能,具体基准未在摘要中明确提及,但暗示了性能改进。

Insight: 创新点包括无需额外计算开销的显著性图生成技术,以及利用显著性图与人工标注的对齐作为奖励来引导模型关注相关视觉区域,从而增强推理过程的可追溯性和视觉基础。

Abstract: Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance.


[134] Temporal Inversion for Learning Interval Change in Chest X-Rays cs.CV | cs.AIPDF

Hanbin Ko, Kyeongmin Jeon, Doowoong Choi, Chang Min Park

TL;DR: 该论文提出了一种名为TILA(Temporal Inversion-aware Learning and Alignment)的框架,通过引入时间反转(即交换图像对的顺序)作为监督信号,来增强现有时序视觉-语言模型对胸部X光片(CXR)中方向性变化的敏感性。该框架在预训练、微调和推理阶段整合了反转感知目标,并提出了统一的评估协议和新的检索评估集MS-CXR-Tretrieval。实验表明,TILA能持续提升多种现有架构在进展分类和时序嵌入对齐上的性能。

Details

Motivation: 当前大多数医学基础模型孤立地分析放射影像,忽略了比较前后图像以评估间隔变化这一关键临床任务。对于胸部X光片,捕捉这种时序变化至关重要,因为放射科医生不仅需要评估发现的静态外观,还需要评估其随时间的演变。

Result: 在公开数据集和真实世界医院队列上的实验表明,TILA在应用于多种现有架构时,能持续改进进展分类和时序嵌入对齐的性能。

Insight: 论文的核心创新点在于提出了“时间反转”作为一种简单有效的监督信号,并构建了一个统一的评估协议(包括新的检索评估集MS-CXR-Tretrieval),以显式地学习时序顺序,从而补充了传统的外观建模方法。从客观角度看,这种方法将时序顺序本身作为一个可学习的监督信号,为时序医学图像分析提供了一种新颖且通用的增强思路。

Abstract: Recent advances in vision–language pretraining have enabled strong medical foundation models, yet most analyze radiographs in isolation, overlooking the key clinical task of comparing prior and current images to assess interval change. For chest radiographs (CXRs), capturing interval change is essential, as radiologists must evaluate not only the static appearance of findings but also how they evolve over time. We introduce TILA (Temporal Inversion-aware Learning and Alignment), a simple yet effective framework that uses temporal inversion, reversing image pairs, as a supervisory signal to enhance the sensitivity of existing temporal vision-language models to directional change. TILA integrates inversion-aware objectives across pretraining, fine-tuning, and inference, complementing conventional appearance modeling with explicit learning of temporal order. We also propose a unified evaluation protocol to assess order sensitivity and consistency under temporal inversion, and introduce MS-CXR-Tretrieval, a retrieval evaluation set constructed through a general protocol that can be applied to any temporal CXR dataset. Experiments on public datasets and real-world hospital cohorts demonstrate that TILA consistently improves progression classification and temporal embedding alignment when applied to multiple existing architectures.


[135] Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models cs.CVPDF

Arian Komaei Koma, Seyed Amir Kasaei, Ali Aghayari, AmirMahdi Sadeghzadeh, Mohammad Hossein Rohban

TL;DR: 本文系统评估了文本到图像扩散模型后验遗忘方法在概念移除过程中的组合生成能力退化问题,聚焦于Stable Diffusion 1.4的裸露内容移除场景,发现遗忘效果与组合完整性之间存在显著权衡。

Details

Motivation: 现有研究主要关注遗忘成功率,但忽视了遗忘操作对模型整体生成能力(特别是组合生成能力)的影响,需要系统评估遗忘方法在语义保持方面的表现。

Result: 在T2I-CompBench++和GenEval基准测试中,强遗忘方法导致属性绑定、空间推理和计数能力显著退化,而保持组合结构的方法则遗忘效果不足,揭示了当前评估实践的局限性。

Insight: 创新点在于首次通过组合生成视角系统评估遗忘方法的副作用,提出需要设计能平衡目标抑制与语义保持的遗忘目标,为未来遗忘方法设计提供了关键方向。

Abstract: Post-hoc unlearning has emerged as a practical mechanism for removing undesirable concepts from large text-to-image diffusion models. However, prior work primarily evaluates unlearning through erasure success; its impact on broader generative capabilities remains poorly understood. In this work, we conduct a systematic empirical study of concept unlearning through the lens of compositional text-to-image generation. Focusing on nudity removal in Stable Diffusion 1.4, we evaluate a diverse set of state-of-the-art unlearning methods using T2I-CompBench++ and GenEval, alongside established unlearning benchmarks. Our results reveal a consistent trade-off between unlearning effectiveness and compositional integrity: methods that achieve strong erasure frequently incur substantial degradation in attribute binding, spatial reasoning, and counting. Conversely, approaches that preserve compositional structure often fail to provide robust erasure. These findings highlight limitations of current evaluation practices and underscore the need for unlearning objectives that explicitly account for semantic preservation beyond targeted suppression.


[136] PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis cs.CVPDF

Inseong Choi, Siwoo Lee, Seung-Hun Nam, Soohwan Song

TL;DR: 本文提出了一种名为PR-IQA(部分参考图像质量评估)的框架,用于评估扩散模型生成的稀疏视角新视图合成图像的质量。该方法利用来自不同姿态的参考图像,无需真实数据,通过计算重叠区域的几何一致部分质量图,并利用交叉注意力机制结合参考视图上下文进行质量补全,生成密集的全图质量图。当集成到基于扩散增强的3D高斯溅射(3DGS)流程中时,PR-IQA能限制监督信号仅作用于其质量图识别出的高置信度区域,从而有效过滤不一致性,提升3D重建和新视图合成的质量。

Details

Motivation: 扩散模型在稀疏视角新视图合成中前景广阔,但其生成的伪真实视图常存在光度与几何不一致性,直接用于监督会损害3D重建质量。现有方法缺乏无需真实数据的有效质量评估手段来筛选这些生成视图。

Result: 实验表明,PR-IQA在图像质量评估任务上超越了现有IQA方法,在无需真实数据监督的情况下达到了全参考级别的准确性。集成PR-IQA的质量感知3DGS方法能更有效地过滤不一致性,从而产生更优的3D重建和新视图合成结果。

Insight: 论文的核心创新在于提出了一个无需真实数据的部分参考质量评估框架,通过几何一致部分质量图计算与基于交叉注意力的上下文感知质量补全,实现了对扩散生成视图的跨视图一致性评估。从客观角度看,其将质量评估与3D重建流程紧密结合,利用质量图动态调整监督区域,是一种将生成模型输出可靠地整合进下游任务的实用方法。

Abstract: Diffusion models are promising for sparse-view novel view synthesis (NVS), as they can generate pseudo-ground-truth views to aid 3D reconstruction pipelines like 3D Gaussian Splatting (3DGS). However, these synthesized images often contain photometric and geometric inconsistencies, and their direct use for supervision can impair reconstruction. To address this, we propose Partial-Reference Image Quality Assessment (PR-IQA), a framework that evaluates diffusion-generated views using reference images from different poses, eliminating the need for ground truth. PR-IQA first computes a geometrically consistent partial quality map in overlapping regions. It then performs quality completion to inpaint this partial map into a dense, full-image map. This completion is achieved via a cross-attention mechanism that incorporates reference-view context, ensuring cross-view consistency and enabling thorough quality assessment. When integrated into a diffusion-augmented 3DGS pipeline, PR-IQA restricts supervision to high-confidence regions identified by its quality maps. Experiments demonstrate that PR-IQA outperforms existing IQA methods, achieving full-reference-level accuracy without ground-truth supervision. Thus, our quality-aware 3DGS approach more effectively filters inconsistencies, producing superior 3D reconstructions and NVS results.The project page is available at https://kakaomacao.github.io/pr-iqa-project-page/.


[137] Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation cs.CVPDF

Quoc-Huy Trinh, Mustapha Abdullahi, Bo Zhao, Debesh Jha

TL;DR: 本文提出了Firebolt-VL,一种高效的视觉语言模型,旨在解决现有多模态大语言模型计算成本高、难以精确捕捉细粒度视觉区域的问题。该方法用液态基础模型解码器替代基于Transformer的解码器,并引入令牌-网格关联模块,通过轻量级相关性计算和FiLM条件状态空间模型调制,实现了线性时间推理和细粒度视觉定位。

Details

Motivation: 现有基于Transformer交叉注意力的多模态大语言模型计算复杂度高,限制了其在资源受限场景(如个人助理、文档理解、智能摄像头)的部署,且小型视觉语言模型在细粒度推理任务上表现不佳,难以精确捕捉任务相关的视觉区域。

Result: 在多个基准测试上的实验结果表明,Firebolt-VL在显著提升效率的同时,实现了准确、细粒度的理解。

Insight: 主要创新点在于用线性复杂度的液态基础模型解码器替代二次复杂度的Transformer解码器,并设计了令牌-网格关联模块,通过轻量级跨模态关联和条件调制来增强视觉定位能力,兼顾了效率与细粒度推理性能。

Abstract: Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: https://fireboltvl.github.io


[138] Beyond Semantics: Uncovering the Physics of Fakes via Universal Physical Descriptors for Cross-Modal Synthetic Detection cs.CVPDF

Mei Qiu, Jianqiang Zhao, Yanyun Qu

TL;DR: 本文提出了一种基于通用物理描述符的跨模态合成图像检测方法,通过系统探索15种物理特征,筛选出拉普拉斯方差、Sobel统计量和残差噪声方差等5个核心特征,并将其与语义信息结合集成到CLIP模型中,以提升对AI生成图像的检测性能。

Details

Motivation: 现有深度伪造检测器常过度拟合特定生成模型,难以泛化,因此需要重新审视区分自然图像与AI生成图像的内在物理特征,并探索如何将这些客观像素级特征融入多模态模型以增强检测性能。

Result: 在多个Genimage基准测试中达到SOTA性能,在Wukong和SDv1.4等数据集上准确率接近完美(99.8%)。

Insight: 创新点在于首次系统性地探索并筛选出跨数据集和生成架构稳定的物理特征,并将其文本编码后与语义信息结合,引导CLIP的图像-文本表示学习,为可信视觉语言建模和缓解大模型幻觉问题开辟了新方向。

Abstract: The rapid advancement of AI generated content (AIGC) has blurred the boundaries between real and synthetic images, exposing the limitations of existing deepfake detectors that often overfit to specific generative models. This adaptability crisis calls for a fundamental reexamination of the intrinsic physical characteristics that distinguish natural from AI-generated images. In this paper, we address two critical research questions: (1) What physical features can stably and robustly discriminate AI generated images across diverse datasets and generative architectures? (2) Can these objective pixel-level features be integrated into multimodal models like CLIP to enhance detection performance while mitigating the unreliability of language-based information? To answer these questions, we conduct a comprehensive exploration of 15 physical features across more than 20 datasets generated by various GANs and diffusion models. We propose a novel feature selection algorithm that identifies five core physical features including Laplacian variance, Sobel statistics, and residual noise variance that exhibit consistent discriminative power across all tested datasets. These features are then converted into text encoded values and integrated with semantic captions to guide image text representation learning in CLIP. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple Genimage benchmarks, with near-perfect accuracy (99.8%) on datasets such as Wukong and SDv1.4. By bridging pixel level authenticity with semantic understanding, this work pioneers the use of physically grounded features for trustworthy vision language modeling and opens new directions for mitigating hallucinations and textual inaccuracies in large multimodal models.


[139] Multimodal Backdoor Attack on VLMs for Autonomous Driving via Graffiti and Cross-Lingual Triggers cs.CVPDF

Jiancheng Wang, Lidan Liang, Yong Wang, Zengzhen Su, Haifeng Xia

TL;DR: 本文提出了一种针对自动驾驶视觉语言模型的多模态后门攻击方法GLA,通过涂鸦式视觉触发器和跨语言文本触发器构建隐蔽且稳定的攻击通道。在DriveVLM上的实验表明,仅需10%的中毒数据即可实现90%的攻击成功率,且不影响模型在干净任务上的性能,甚至提升BLEU-1等指标,从而规避传统基于性能下降的检测方法。

Details

Motivation: 现有后门攻击主要依赖单模态、显式且易检测的触发器,难以在自动驾驶场景中构建隐蔽且稳定的攻击通道,因此需要设计更自然、多模态的触发机制来评估安全关键多模态系统的脆弱性。

Result: 在DriveVLM基准测试中,GLA仅需10%的中毒比例即可达到90%的攻击成功率和0%的误报率,同时模型在干净任务上的BLEU-1等指标有所提升,表明攻击具有高隐蔽性。

Insight: 创新点在于结合了基于稳定扩散修复生成的涂鸦视觉触发器(无缝融入城市场景)和跨语言文本触发器(保持语义一致性的分布偏移),构建了多模态、自然化的后门攻击范式,揭示了自动驾驶VLM中未被充分认识的安全威胁。

Abstract: Visual language model (VLM) is rapidly being integrated into safety-critical systems such as autonomous driving, making it an important attack surface for potential backdoor attacks. Existing backdoor attacks mainly rely on unimodal, explicit, and easily detectable triggers, making it difficult to construct both covert and stable attack channels in autonomous driving scenarios. GLA introduces two naturalistic triggers: graffiti-based visual patterns generated via stable diffusion inpainting, which seamlessly blend into urban scenes, and cross-language text triggers, which introduce distributional shifts while maintaining semantic consistency to build robust language-side trigger signals. Experiments on DriveVLM show that GLA requires only a 10% poisoning ratio to achieve a 90% Attack Success Rate (ASR) and a 0% False Positive Rate (FPR). More insidiously, the backdoor does not weaken the model on clean tasks, but instead improves metrics such as BLEU-1, making it difficult for traditional performance-degradation-based detection methods to identify the attack. This study reveals underestimated security threats in self-driving VLMs and provides a new attack paradigm for backdoor evaluation in safety-critical multimodal systems.


[140] InCTRLv2: Generalist Residual Models for Few-Shot Anomaly Detection and Segmentation cs.CVPDF

Jiawen Zhu, Mengjia Niu, Guansong Pang

TL;DR: 本文提出了InCTRLv2,一个新颖的少样本通用异常检测与分割框架。它扩展了之前的InCTRL模型,通过一个双分支架构引入了两个互补的异常感知视角:一个主分支利用正常和异常数据进行判别性异常分数学习,另一个辅助分支仅使用正常数据进行单类异常分数学习。两个分支都由大规模视觉语言模型编码的丰富视觉-文本语义先验引导,从而从正常-异常判别和偏离正常语义两个角度进行异常检测。

Details

Motivation: 解决现有大多数异常检测模型是专家模型,需要针对特定目标数据集进行大量训练,难以泛化到未见数据集的问题。旨在构建一个无需重新训练即可跨多个领域工作的通用异常检测模型。

Result: 在十个异常检测数据集上进行的广泛实验表明,InCTRLv2在各种设置下的异常检测和分割任务中都达到了最先进的性能。

Insight: 创新点在于提出了一个双分支通用框架,结合了判别性学习和单类学习两种互补的异常感知视角,并利用大规模视觉语言模型的语义先验进行引导。这为异常检测提供了更全面的语义理解,既强调正常与异常的区分,也强调对正常模式的偏离检测。

Abstract: While recent anomaly detection (AD) methods have made substantial progress in recognizing abnormal patterns within specific domains, most of them are specialist models that are trained on large training samples from a specific target dataset, struggling to generalize to unseen datasets. To address this limitation, the paradigm of Generalist Anomaly Detection (GAD) has emerged in recent years, aiming to learn a single generalist model to detect anomalies across diverse domains without retraining. To this end, this work introduces InCTRLv2, a novel few-shot Generalist Anomaly Detection and Segmentation (GADS) framework that significantly extends our previously proposed GAD model, InCTRL. Building on the idea of learning in-context residuals with few-shot normal examples to detect anomalies as in InCTRL, InCTRLv2 introduces two new, complementary perspectives of anomaly perception under a dual-branch framework. This is accomplished by two novel modules upon InCTRL: i) Discriminative Anomaly Score Learning (DASL) with both normal and abnormal data in the main branch, which learns a semantic-guided abnormality and normality space that supports the classification of query samples from both the abnormality and normality perspectives; and ii) One-class Anomaly Score Learning (OASL) using only the normal data, which learns generalized normality patterns in a semantic space via an auxiliary branch, focusing on detecting anomalies through the lens of normality solely. Both branches are guided by rich visual-text semantic priors encoded by large-scale vision-language models. Together, they offer a dual semantic perspective for AD: one emphasizes normal-abnormal discriminations, while the other emphasizes normality-deviated semantics. Extensive experiments on ten AD datasets demonstrate that InCTRLv2 achieves SotA performance in both anomaly detection and segmentation tasks across various settings.


[141] Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale cs.CV | cs.AIPDF

Zhengcen Li, Chenyang Jiang, Hang Zhao, Shiyang Zhou, Yunyang Mo

TL;DR: 本文针对AI生成视频检测中预处理操作导致高频伪造痕迹丢失的问题,提出了一个包含超过14万视频的大规模数据集和一个基于Qwen2.5-VL Vision Transformer的新型检测框架。该框架支持原生可变空间分辨率和时间长度处理,有效保留了传统预处理中丢失的高频伪影和时空不一致性。

Details

Motivation: 当前视频生成模型能创建高度逼真的合成媒体,引发虚假信息传播的社会担忧。现有检测方法依赖固定分辨率调整和裁剪等预处理操作,会丢弃细微的高频伪造痕迹并造成空间失真与信息丢失,且其训练和评估数据集未能涵盖现代生成模型的复杂性。

Result: 大量实验表明,该方法在多个基准测试(包括专门用于评估超现实合成内容的Magic Videos基准)上取得了卓越性能,为AI生成视频检测建立了一个强大的新基线。

Insight: 主要创新点在于提出了一个原生尺度(native-scale)处理框架,避免了破坏性预处理,从而保留了关键的伪造伪影。客观来看,构建一个涵盖15种最先进开源和商业生成器的大规模、现代化数据集,对于推动该领域发展具有重要价值。

Abstract: The rapid advancement of video generation models has enabled the creation of highly realistic synthetic media, raising significant societal concerns regarding the spread of misinformation. However, current detection methods suffer from critical limitations. They rely on preprocessing operations like fixed-resolution resizing and cropping. These operations not only discard subtle, high-frequency forgery traces but also cause spatial distortion and significant information loss. Furthermore, existing methods are often trained and evaluated on outdated datasets that fail to capture the sophistication of modern generative models. To address these challenges, we introduce a comprehensive dataset and a novel detection framework. First, we curate a large-scale dataset of over 140K videos from 15 state-of-the-art open-source and commercial generators, along with Magic Videos benchmark designed specifically for evaluating ultra-realistic synthetic content. In addition, we propose a novel detection framework built on the Qwen2.5-VL Vision Transformer, which operates natively at variable spatial resolutions and temporal durations. This native-scale approach effectively preserves the high-frequency artifacts and spatiotemporal inconsistencies typically lost during conventional preprocessing. Extensive experiments demonstrate that our method achieves superior performance across multiple benchmarks, underscoring the critical importance of native-scale processing and establishing a robust new baseline for AI-generated video detection.


[142] Synthesis4AD: Synthetic Anomalies are All You Need for 3D Anomaly Detection cs.CVPDF

Yihan Sun, Yuqi Cheng, Junjie Zu, Yuxiang Tan, Guoyang Xie

TL;DR: 本文提出Synthesis4AD,一种用于3D异常检测的端到端范式,通过构建3D-DefectStudio平台,利用可控合成引擎MPAS生成大规模、高保真的合成异常数据,并结合多模态大语言模型自动解析产品设计信息以指导异常生成。该方法还引入了基于空间分布归一化和几何保真数据增强的训练流程,以提升点Transformer架构在非结构化点云上的鲁棒性和泛化能力。

Details

Motivation: 工业3D异常检测的性能受限于异常样本的稀缺性和长尾分布,因此需要一种能够生成高质量合成异常数据的方法来学习更具判别性的表示。

Result: 在Real3D-AD、MulSen-AD和一个真实工业零件数据集上的大量实验表明,该方法取得了最先进的(SOTA)性能。

Insight: 创新点包括:1) 提出可控合成引擎MPAS和交互式系统3D-DefectStudio,通过高维支撑基元指导生成几何逼真的缺陷和精确的点级异常掩码;2) 利用多模态大语言模型自动将产品设计信息转化为可执行的异常合成指令,实现可扩展的知识驱动数据生成;3) 引入空间分布归一化和几何保真数据增强的训练流程,缓解点Transformer对绝对坐标的敏感性,提升模型在真实数据变化下的特征学习能力。

Abstract: Industrial 3D anomaly detection performance is fundamentally constrained by the scarcity and long-tailed distribution of abnormal samples. To address this challenge, we propose Synthesis4AD, an end-to-end paradigm that leverages large-scale, high-fidelity synthetic anomalies to learn more discriminative representations for 3D anomaly detection. At the core of Synthesis4AD is 3D-DefectStudio, a software platform built upon the controllable synthesis engine MPAS, which injects geometrically realistic defects guided by higher-dimensional support primitives while simultaneously generating accurate point-wise anomaly masks. Furthermore, Synthesis4AD incorporates a multimodal large language model (MLLM) to interpret product design information and automatically translate it into executable anomaly synthesis instructions, enabling scalable and knowledge-driven anomalous data generation. To improve the robustness and generalization of the downstream detector on unstructured point clouds, Synthesis4AD further introduces a training pipeline based on spatial-distribution normalization and geometry-faithful data augmentations, which alleviates the sensitivity of Point Transformer architectures to absolute coordinates and improves feature learning under realistic data variations. Extensive experiments demonstrate state-of-the-art performance on Real3D-AD, MulSen-AD, and a real-world industrial parts dataset. The proposed synthesis method MPAS and the interactive system 3D-DefectStudio will be publicly released at https://github.com/hustCYQ/Synthesis4AD.


[143] Don’t Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs cs.CVPDF

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods, Gabriel Hillesheim, Abolfazl Razi

TL;DR: 本文提出了一种自适应KV缓存量化方法,用于在资源受限的移动和边缘设备上高效部署大型语言模型。该方法通过一个轻量级控制器,根据令牌的重要性动态分配KV缓存的量化位宽(如2位、4位、8位或FP16),从而在减少内存占用和解码延迟的同时,保持与FP16推理相近的准确性。

Details

Motivation: 在移动和边缘设备上部署LLM时,KV缓存的内存和带宽开销随上下文长度线性增长,成为解码成本的主要瓶颈。现有的固定精度或启发式KV缓存量化方案往往浪费比特在不重要的令牌上,或过度压缩信息丰富的令牌,导致不必要的精度损失。

Result: 在多个常识推理基准测试(如HellaSwag)上,使用SmolLM系列模型(135M、360M、1.7B参数)进行实验表明,该方法相比静态KV量化将解码延迟降低了17.75%,精度提升了7.60个百分点,且与FP16推理的精度差距仅为0.30个百分点,实现了更好的精度-延迟权衡。

Insight: 创新点在于受哈夫曼编码变长分配原则启发,提出了一种数据驱动的自适应量化策略,通过轻量级特征(如令牌频率、质量分数、注意力方差和基于熵的不确定性)动态调整KV缓存精度。这避免了手工启发式方法的局限性,在资源受限设备上实现了更高效的比特分配。

Abstract: Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding’s principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding. This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks. Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off. For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.


[144] Discovering Failure Modes in Vision-Language Models using RL cs.CV | cs.AIPDF

Kanishk Jain, Qian Yang, Shravan Nayak, Parisa Kordjamshidi, Nishanth Anand

TL;DR: 本文提出了一种基于强化学习(RL)的框架,用于自动发现视觉语言模型(VLMs)在给定数据分布上的失败模式或盲点,无需人工干预。该框架训练一个提问者智能体,根据候选VLM的响应自适应生成查询以引出错误答案,通过关注细粒度视觉细节和不同技能组合逐步增加问题复杂性,从而识别出36种新的VLM失败模式。

Details

Motivation: 解决现有手动识别VLM弱点方法成本高、不可扩展且易受人类偏见影响的问题,这些偏见往往忽视细微细节而关注显著对象,导致对模型漏洞的理解不完整。

Result: 该方法在多个模型组合上展示了广泛的适用性,识别出36种新的VLM失败模式,表明其能有效发现模型在计数、空间推理和视角理解等人类轻松识别的视觉概念上的误解。

Insight: 创新点在于使用RL驱动的自适应查询生成来系统探索VLM的盲点,通过逐步增加问题复杂性和关注细粒度细节,自动且可扩展地揭示模型潜在弱点,为模型评估提供了新工具。

Abstract: Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual concepts that humans identify effortlessly, such as counting, spatial reasoning, and viewpoint understanding. Previous studies manually identified these weaknesses and found that they often stem from deficits in specific skills. However, such manual efforts are costly, unscalable, and subject to human bias, which often overlooks subtle details in favor of salient objects, resulting in an incomplete understanding of a model’s vulnerabilities. To address these limitations, we propose a Reinforcement Learning (RL)-based framework to automatically discover the failure modes or blind spots of any candidate VLM on a given data distribution without human intervention. Our framework trains a questioner agent that adaptively generates queries based on the candidate VLM’s responses to elicit incorrect answers. Our approach increases question complexity by focusing on fine-grained visual details and distinct skill compositions as training progresses, consequently identifying 36 novel failure modes in which VLMs struggle. We demonstrate the broad applicability of our framework by showcasing its generalizability across various model combinations.


[145] Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning cs.CVPDF

Lei Zhang, Junjiao Tian, Zhipeng Fan, Kunpeng Li, Jialiang Wang

TL;DR: 本文提出了一种名为’过程驱动图像生成’的多步图像生成范式,将图像合成分解为交错进行的思维与动作推理轨迹。该方法通过’文本规划-视觉草图-文本反思-视觉细化’的迭代循环,模拟人类绘画的渐进过程,使生成过程变得显式、可解释且可直接监督。

Details

Motivation: 解决当前统一多模态模型在单步图像生成中缺乏对中间状态链的想象能力的问题,旨在模仿人类绘画的增量式、基于演化视觉状态的创作过程。

Result: 在多种文本到图像生成基准测试上进行了实验验证,结果表明该方法能有效生成图像,但摘要中未提及具体的定量结果(如FID分数)或是否达到SOTA水平。

Insight: 核心创新在于将图像生成建模为多步交错推理过程,并通过密集的逐步监督(空间语义一致性约束与视觉知识保持/错误纠正约束)来解决中间状态的模糊性,使生成过程可控且可解释。

Abstract: Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core challenge of process-driven generation stems from the ambiguity of intermediate states: how can models evaluate each partially-complete image? We address this through dense, step-wise supervision that maintains two complementary constraints: for the visual intermediate states, we enforce the spatial and semantic consistency; for the textual intermediate states, we preserve the prior visual knowledge while enabling the model to identify and correct prompt-violating elements. This makes the generation process explicit, interpretable, and directly supervisable. To validate proposed method, we conduct experiments under various text-to-image generation benchmarks.


[146] CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models cs.CVPDF

Xiangzhao Hao, Zefeng Zhang, Zhenyu Zhang, Linhao Yu, Yao Chen

TL;DR: 本文提出了CLEAR框架,旨在解决统一多模态模型在图像退化(如模糊、噪声、压缩、光照不佳)条件下理解能力下降的问题。该框架通过监督微调、潜在表示桥接和交错GRPO强化学习三个渐进步骤,将模型的生成能力与推理能力有效连接,从而提升对退化图像的理解性能。

Details

Motivation: 现实世界中的图像退化严重损害了多模态理解,而结合理解与生成的统一多模态模型本应利用其生成通路来建模被破坏的细粒度视觉结构,但现有模型未能有效调用自身生成能力处理退化输入,这源于训练机制未要求生成参与推理,以及标准的解码-重编码路径不支持联合优化。

Result: 在涵盖六种标准多模态基准测试、三个退化严重程度的MMD-Bench上实验表明,CLEAR显著提升了模型对退化输入的鲁棒性,同时保持了干净图像上的性能。

Insight: 创新点包括:1)通过监督微调建立“先生成后回答”的推理模式;2)设计潜在表示桥接,用可优化的直接连接替代低效的解码-重编码路径;3)提出交错GRPO强化学习方法,在答案正确性奖励下联合优化文本推理和视觉生成。分析还发现,去除像素级重建监督能产生感知质量更高的中间视觉状态,表明任务驱动优化与视觉质量自然对齐。

Abstract: Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.


[147] Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving cs.CV | cs.LGPDF

Mayank Mayank, Bharanidhar Duraisamy, Florian Geiß, Abhinav Valada

TL;DR: 本文提出了一种名为MMF-BEV的多模态传感器融合框架,用于自动驾驶中的3D目标检测。该框架结合了相机和毫米波雷达数据,通过在鸟瞰图(BEV)空间中使用可变形注意力机制进行跨模态特征对齐,以利用相机提供的密集语义信息和雷达提供的精确距离与速度信息。

Details

Motivation: 自动驾驶需要精确的3D目标检测,但单一传感器存在局限:相机深度信息不可靠,而毫米波雷达几何信息稀疏。因此,需要融合互补的传感器数据以提高检测性能。

Result: 在View-of-Delft (VoD) 4D雷达数据集上的实验表明,MMF-BEV在完整标注区域和近程感兴趣区域内,对所有物体类别的检测性能均一致优于单模态基线,并与先前的融合方法取得了有竞争力的结果。

Insight: 创新点包括:1) 使用可变形自注意力增强单模态分支,并使用可变形交叉注意力进行融合;2) 通过传感器贡献分析量化了不同距离下的模态权重,为传感器互补性提供了可解释的证据;3) 采用两阶段训练策略(先预训练相机分支,再联合训练雷达和融合模块)以稳定学习过程。

Abstract: Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then jointly training radar and fusion modules stabilizes learning. Experiments on VoD show that MMF-BEV consistently outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes in both the full annotated area and near-range Region of Interest.


[148] E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes cs.CV | cs.MM | cs.RO | eess.IVPDF

Jiajun Zhai, Hao Shi, Shangwei Guo, Kailun Yang, Kaiwei Wang

TL;DR: 本文提出了E-VLA,一种事件增强的视觉-语言-动作模型,旨在解决传统基于帧的视觉在极端低光、运动模糊等感知退化场景下不可靠的问题。该框架直接利用事件流中的运动和结构线索,而不进行图像重建,从而在恶劣条件下保持语义感知和感知-动作一致性。作者构建了一个开源遥操作平台和真实世界的RGB-事件-动作数据集,并提出了轻量级的事件集成策略。实验表明,即使在简单的无参数融合(如将累积事件图叠加到RGB图像上)下,模型在黑暗和模糊场景中的操作鲁棒性也得到了显著提升。

Details

Motivation: 现有机器人视觉-语言-动作(VLA)模型在开放操作任务上泛化良好,但其感知在极端低光、运动模糊和黑场裁剪等感知阶段退化条件下非常脆弱。本文旨在通过引入事件相机数据来增强VLA模型,提高其在传统基于帧的视觉变得不可靠时的操作鲁棒性。

Result: 在真实世界操作任务上的实验表明,在20勒克斯低光下的Pick-Place任务中,仅使用图像的成功率为0%,而使用简单叠加融合后提升至60%,使用本文提出的事件适配器后达到90%。在严重运动模糊(1000毫秒曝光)条件下,Pick-Place任务成功率从0%提升至20-25%,Sorting任务从5%提升至32.5%。这些结果在作者构建的真实世界数据集上验证了方法的有效性。

Insight: 论文的核心创新在于将事件驱动感知直接集成到VLA模型中,利用事件流固有的运动和结构信息,而非先重建图像,从而在感知退化条件下保持语义和动作一致性。从客观角度看,其提出的轻量级、与预训练模型兼容的事件集成策略(如叠加融合和事件适配器)以及关于事件窗口化和融合的稳定部署研究,为将事件相机有效融入多模态大模型提供了系统性的证据和实用方案,指向了超越传统帧式成像的鲁棒具身智能方向。

Abstract: Robotic Vision-Language-Action (VLA) models generalize well for open-ended manipulation, but their perception is fragile under sensing-stage degradations such as extreme low light, motion blur, and black clipping. We present E-VLA, an event-augmented VLA framework that improves manipulation robustness when conventional frame-based vision becomes unreliable. Instead of reconstructing images from events, E-VLA directly leverages motion and structural cues in event streams to preserve semantic perception and perception-action consistency under adverse conditions. We build an open-source teleoperation platform with a DAVIS346 event camera and collect a real-world synchronized RGB-event-action manipulation dataset across diverse tasks and illumination settings. We also propose lightweight, pretrained-compatible event integration strategies and study event windowing and fusion for stable deployment. Experiments show that even a simple parameter-free fusion, i.e., overlaying accumulated event maps onto RGB images, could substantially improve robustness in dark and blur-heavy scenes: on Pick-Place at 20 lux, success increases from 0% (image-only) to 60% with overlay fusion and to 90% with our event adapter; under severe motion blur (1000 ms exposure), Pick-Place improves from 0% to 20-25%, and Sorting from 5% to 32.5%. Overall, E-VLA provides systematic evidence that event-driven perception can be effectively integrated into VLA models, pointing toward robust embodied intelligence beyond conventional frame-based imaging. Code and dataset will be available at https://github.com/JJayzee/E-VLA.


[149] Less Detail, Better Answers: Degradation-Driven Prompting for VQA cs.CVPDF

Haoxuan Han, Weijie Wang, Zeyu Zhang, Yefei He, Bohan Zhuang

TL;DR: 本文提出了一种名为Degradation-Driven Prompting (DDP)的新框架,通过策略性地降低图像保真度(如下采样、添加结构视觉辅助工具)来迫使视觉语言模型关注本质结构信息,从而减少幻觉和推理错误,提升视觉问答性能。

Details

Motivation: 当前视觉语言模型在VQA任务中取得显著进展,但高分辨率细节有时会成为噪声,导致模型产生幻觉或推理错误。本文旨在通过降低图像细节来引导模型聚焦于关键结构信息,以解决此问题。

Result: 实验结果表明,在易导致人类误判的物理属性任务和多种机器易受影响的感知现象任务上,DDP框架通过降质输入和提供针对性结构提示,使VLMs在具有挑战性的视觉基准测试中实现了更优的推理准确率。

Insight: 论文的核心创新点是提出了“降质驱动提示”的逆向思维框架,主张“少即是多”,即通过主动降低视觉输入质量来提升模型对结构信息的关注和推理鲁棒性。这为缓解VQA中的幻觉问题提供了一种新颖且有效的提示工程方法。

Abstract: Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force models to focus on essential structural information. We evaluate DDP across two distinct tasks. Physical attributes targets images prone to human misjudgment, where DDP employs a combination of 80p downsampling, structural visual aids (white background masks and orthometric lines), and In-Context Learning (ICL) to calibrate the model’s focus. Perceptual phenomena addresses various machine-susceptible visual anomalies and illusions, including Visual Anomaly (VA), Color (CI), Motion(MI),Gestalt (GI), Geometric (GSI), and Visual Illusions (VI).For this task, DDP integrates a task-classification stage with specialized tools such as blur masks and contrast enhancement alongside downsampling. Our experimental results demonstrate that less is more: by intentionally degrading visual inputs and providing targeted structural prompts, DDP enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.


[150] InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement cs.CV | cs.AIPDF

Yude Zou, Junji Gong, Xing Gao, Zixuan Li, Tianxing Chen

TL;DR: 本文提出InfBaGel,一个用于生成人-物-场景交互(HOSI)的从粗到细的指令条件化框架。该框架与一致性模型的迭代去噪过程显式对齐,采用动态感知策略利用前序细化轨迹更新场景上下文,并引入碰撞感知引导以减少物理伪影,同时设计混合训练策略以克服数据稀缺问题。

Details

Motivation: HOSI生成需要推理动态的物体-场景变化,但面临标注数据有限的问题,现有方法在生成一致交互和避免物理伪影方面存在不足。

Result: 大量实验表明,该方法在HOSI和HOI生成上均达到了最先进的性能,并在未见场景上表现出强大的泛化能力。

Insight: 创新点包括:将交互生成与一致性模型去噪过程对齐的动态感知策略;无需精细场景几何的实时碰撞感知引导采样;以及通过注入体素化场景占用到HOI数据合成伪HOSI样本的混合训练策略,有效解决了数据稀缺问题。

Abstract: Human-object-scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike human-object interaction (HOI) and human-scene interaction (HSI), HOSI generation requires reasoning over dynamic object-scene changes, yet suffers from limited annotated data. To address these issues, we propose a coarse-to-fine instruction-conditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions. To further reduce physical artifacts, we introduce a bump-aware guidance that mitigates collisions and penetrations during sampling without requiring fine-grained scene geometry, enabling real-time generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with high-fidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Project page: https://yudezou.github.io/InfBaGel-page/


[151] The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models cs.CVPDF

Runhao Mao, Hanshi Wang, Yixiang Yang, Qianli Ma, Jingmeng Zhou

TL;DR: 本文系统研究了自动驾驶领域视觉语言模型(VLM)微调过程中的灾难性遗忘问题,并提出了首个用于量化该问题的大规模基准数据集(包含18万个场景)。为解决此问题,论文提出了Drive Expert Adapter(DEA)框架,该框架通过将适配过程从权重空间转移到提示空间,并基于场景线索动态路由推理至不同知识专家,从而在提升驾驶任务性能的同时有效保留模型预训练的世界知识。

Details

Motivation: 将视觉语言模型集成到自动驾驶中旨在解决长尾场景问题,但微调过程会导致灾难性遗忘,侵蚀模型宝贵的预训练世界知识,这与其被使用的核心原因相悖。本文旨在首次系统性地研究并解决这一未被充分探索的挑战。

Result: 在提出的新基准上进行的大量实验表明,所提出的DEA方法不仅在新提出的驾驶任务基准上取得了最先进(SOTA)的结果,而且有效缓解了灾难性遗忘,保留了模型关键的泛化能力。

Insight: 主要创新点在于:1)首次系统研究并量化了自动驾驶VLM微调中的灾难性遗忘问题,并创建了首个专用基准数据集;2)提出了DEA框架,通过将适配从权重空间转移到提示空间,并采用动态专家路由机制,巧妙地规避了性能与知识保留之间的权衡,为领域自适应提供了新思路。

Abstract: The integration of Vision-Language Models (VLMs) into autonomous driving promises to solve long-tail scenarios, but this paradigm faces the critical and unaddressed challenge of catastrophic forgetting. The very fine-tuning process used to adapt these models to driving-specific data simultaneously erodes their invaluable pre-trained world knowledge, creating a self-defeating paradox that undermines the core reason for their use. This paper provides the first systematic investigation into this phenomenon. We introduce a new large-scale dataset of 180K scenes, which enables the first-ever benchmark specifically designed to quantify catastrophic forgetting in autonomous driving. Our analysis reveals that existing methods suffer from significant knowledge degradation. To address this, we propose the Drive Expert Adapter (DEA), a novel framework that circumvents this trade-off by shifting adaptation from the weight space to the prompt space. DEA dynamically routes inference through different knowledge experts based on scene-specific cues, enhancing driving-task performance without corrupting the model’s foundational parameters. Extensive experiments demonstrate that our approach not only achieves state-of-the-art results on driving tasks but also effectively mitigates catastrophic forgetting, preserving the essential generalization capabilities that make VLMs a transformative force for autonomous systems. Data and model are released at FidelityDrivingBench.


[152] Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations cs.CVPDF

Tuan Dung Nguyen, Minh Khoi Ho, Qi Chen, Yutong Xie, Nguyen Cam-Tu

TL;DR: 本文提出了一种细粒度的补丁级幻觉检测框架,用于检测大型视觉语言模型(LVLM)中的幻觉对象标记。该方法通过分析模型各层的标记级交互,识别幻觉标记的两个特征:扩散的非局部化注意力模式以及缺乏与视觉区域的语义对齐。

Details

Motivation: 现有幻觉检测方法主要依赖粗粒度的全局图像度量,容易因幻觉标记在多个局部区域的微弱相关性聚合而漏检。本文基于‘忠实对象标记必须牢固地锚定在特定图像区域’的观察,旨在开发更稳健的细粒度检测方法。

Result: 该方法在标记级幻觉检测中达到了高达90%的准确率,证明了细粒度结构分析在检测幻觉方面的优越性。

Insight: 创新点在于从全局分数转向细粒度的补丁级分析,揭示了幻觉标记的扩散注意力模式和语义不对齐特征,并据此构建了轻量级、可解释的检测方法。

Abstract: Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers. Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention seen in faithful tokens; and (ii) they fail to exhibit meaningful semantic alignment with any visual region. Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.


[153] DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing cs.CV | cs.AI | cs.MMPDF

Ke Li, Maoliang Li, Jialiang Chen, Jiayu Chen, Zihao Zheng

TL;DR: 本文提出DIRECT框架,将视频混剪创作建模为多模态连贯性满足问题,通过分层多智能体规划与意图引导编辑,模拟专业制作流程,显著提升了视频混剪的视觉连续性和听觉对齐质量。

Details

Motivation: 现有自动化视频编辑框架常忽视跨层级的多模态协调,导致视频序列不连贯、视觉过渡突兀和音乐对齐不佳,无法达到专业级流畅度。本文旨在解决这一问题。

Result: 在提出的Mashup-Bench基准测试中,DIRECT在客观指标和人类主观评估上均显著优于现有最先进基线方法。

Insight: 创新点包括:将视频混剪形式化为多模态连贯性满足问题;采用分层多智能体框架(编剧、导演、编辑)模拟专业制作流程;引入意图引导的细粒度编辑优化;并构建了包含视觉连续性和听觉对齐指标的专用基准Mashup-Bench。

Abstract: Video mashup creation represents a complex video editing paradigm that recomposes existing footage to craft engaging audio-visual experiences, demanding intricate orchestration across semantic, visual, and auditory dimensions and multiple levels. However, existing automated editing frameworks often overlook the cross-level multimodal orchestration to achieve professional-grade fluidity, resulting in disjointed sequences with abrupt visual transitions and musical misalignment. To address this, we formulate video mashup creation as a Multimodal Coherency Satisfaction Problem (MMCSP) and propose the DIRECT framework. Simulating a professional production pipeline, our hierarchical multi-agent framework decomposes the challenge into three cascade levels: the Screenwriter for source-aware global structural anchoring, the Director for instantiating adaptive editing intent and guidance, and the Editor for intent-guided shot sequence editing with fine-grained optimization. We further introduce Mashup-Bench, a comprehensive benchmark with tailored metrics for visual continuity and auditory alignment. Extensive experiments demonstrate that DIRECT significantly outperforms state-of-the-art baselines in both objective metrics and human subjective evaluation. Project page and code: https://github.com/AK-DREAM/DIRECT


[154] FileGram: Grounding Agent Personalization in File-System Behavioral Traces cs.CV | cs.AIPDF

Shuai Liu, Shulin Tian, Kairui Hu, Yuhao Dong, Zhe Yang

TL;DR: 本文提出了FileGram框架,旨在解决本地文件系统中协作AI代理的个性化问题。该框架包含三个核心组件:FileGramEngine用于生成大规模多模态行为序列数据,FileGramBench提供基于文件系统行为痕迹的诊断基准,以及FileGramOS构建自底向上的记忆架构。

Details

Motivation: 当前AI代理在文件系统中的个性化受限于严格的数据隐私约束和多模态真实世界痕迹的收集困难,现有方法过于关注交互而忽略了密集的文件系统操作行为痕迹。

Result: 实验表明,FileGramBench对现有最先进的记忆系统仍具有挑战性,同时FileGramEngine和FileGramOS被证明是有效的。

Insight: 创新点在于将代理记忆和个性化直接建立在文件系统行为痕迹上,通过模拟工作流生成数据、构建诊断基准以及设计基于原子操作和内容增量的记忆架构,为个性化记忆中心文件系统代理的研究提供了新方向。

Abstract: Coworking AI agents operating within local file systems are rapidly emerging as a paradigm in human-AI interaction; however, effective personalization remains limited by severe data constraints, as strict privacy barriers and the difficulty of jointly collecting multimodal real-world traces prevent scalable training and evaluation, and existing methods remain interaction-centric while overlooking dense behavioral traces in file-system operations; to address this gap, we propose FileGram, a comprehensive framework that grounds agent memory and personalization in file-system behavioral traces, comprising three core components: (1) FileGramEngine, a scalable persona-driven data engine that simulates realistic workflows and generates fine-grained multimodal action sequences at scale; (2) FileGramBench, a diagnostic benchmark grounded in file-system behavioral traces for evaluating memory systems on profile reconstruction, trace disentanglement, persona drift detection, and multimodal grounding; and (3) FileGramOS, a bottom-up memory architecture that builds user profiles directly from atomic actions and content deltas rather than dialogue summaries, encoding these traces into procedural, semantic, and episodic channels with query-time abstraction; extensive experiments show that FileGramBench remains challenging for state-of-the-art memory systems and that FileGramEngine and FileGramOS are effective, and by open-sourcing the framework, we hope to support future research on personalized memory-centric file-system agents.


[155] ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality cs.CV | cs.GR | cs.HCPDF

Dawar Khan, Alexandre Kouyoumdjian, Xinyu Liu, Omar Mena, Dominik Engel

TL;DR: ClickAIXR是一个新颖的、在设备端运行的多模态视觉-语言交互框架,用于扩展现实(XR)环境中的真实物体交互。它通过控制器点击选择物体,并利用本地视觉语言模型处理物体图像,以文本和语音形式回答自然语言问题,旨在解决隐私、延迟和交互模糊性问题。

Details

Motivation: 解决现有XR系统依赖云端AI(如ChatGPT)或基于注视的选择(如GazePointAR)所带来的隐私、延迟和交互模糊性(仅注视或语音)问题,提供一种更精确、透明且保护隐私的本地交互范式。

Result: 在Magic Leap SDK上实现,基于ONNX进行本地VLM推理。用户研究表明,与Gemini 2.5 Flash和ChatGPT 5相比,延迟适中,用户体验可接受,在可用性、信任和用户满意度方面展示了潜力。

Insight: 创新点在于将基于控制器的物体点击选择与设备端视觉语言模型(VLM)相结合,实现了以物体为中心的交互,减少了模糊性,并通过完全本地推理提升了透明度和隐私保护。这为可信赖、保护隐私的XR交互提供了新方向。

Abstract: We present ClickAIXR, a novel on-device framework for multimodal vision-language interaction with objects in extended reality (XR). Unlike prior systems that rely on cloud-based AI (e.g., ChatGPT) or gaze-based selection (e.g., GazePointAR), ClickAIXR integrates an on-device vision-language model (VLM) with a controller-based object selection paradigm, enabling users to precisely click on real-world objects in XR. Once selected, the object image is processed locally by the VLM to answer natural language questions through both text and speech. This object-centered interaction reduces ambiguity inherent in gaze- or voice-only interfaces and improves transparency by performing all inference on-device, addressing concerns around privacy and latency. We implemented ClickAIXR in the Magic Leap SDK (C API) with ONNX-based local VLM inference. We conducted a user study comparing ClickAIXR with Gemini 2.5 Flash and ChatGPT 5, evaluating usability, trust, and user satisfaction. Results show that latency is moderate and user experience is acceptable. Our findings demonstrate the potential of click-based object selection combined with on-device AI to advance trustworthy, privacy-preserving XR interactions. The source code and supplementary materials are available at: nanovis.org/ClickAIXR.html


[156] A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens cs.CVPDF

Tommie Kerssies, Gabriele Berton, Ju He, Qihang Yu, Wufei Ma

TL;DR: 本文提出了DeltaTok和DeltaWorld方法,用于高效生成多样化的视频未来状态预测。DeltaTok将连续帧间的视觉基础模型特征差异编码为单个连续的’delta’令牌,从而将视频从三维时空表示压缩为一维时间序列,大幅减少令牌数量。DeltaWorld则基于这些令牌构建生成式世界模型,通过可处理的多假设训练并行生成多个未来假设并仅监督最佳结果,在推理时单次前向传播即可实现多样化预测。

Details

Motivation: 解决视频世界建模中预测多样化未来状态的挑战,现有判别式模型产生确定性预测而隐含了对可能未来的平均,生成式模型则计算成本高昂;旨在利用视觉基础模型特征空间进行高效生成式预测。

Result: 在密集预测任务上的实验表明,DeltaWorld生成的未来预测更符合真实世界结果,同时参数比现有生成式世界模型少35倍以上,FLOPs减少2000倍。

Insight: 创新点包括将帧间特征差异编码为紧凑的delta令牌以实现大幅压缩,以及通过多假设训练机制在高效计算下促进多样性生成;客观分析认为该方法在表示效率和生成多样性间取得了平衡,为轻量级生成式世界建模提供了新思路。

Abstract: Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous “delta” token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.


[157] Your Pre-trained Diffusion Model Secretly Knows Restoration cs.CV | cs.AIPDF

Sudarshan Rajagopalan, Vishal M. Patel

TL;DR: 本文揭示了预训练扩散模型本身具备图像修复能力,通过直接在文本编码器输出端学习提示嵌入即可解锁该能力,而无需微调或控制模块。研究发现文本提示和文本标记嵌入优化难以有效激活修复行为,且简单的提示学习不稳定,因此提出在扩散桥框架中训练提示以对齐训练和推理动态,从而确保从噪声退化状态到干净图像的去噪路径一致性。基于此,作者在预训练的WAN视频模型和FLUX图像模型中引入轻量级学习提示,将其转化为高性能修复模型。

Details

Motivation: 解决现有基于扩散模型的修复方法依赖微调或Control-Net风格模块来利用预训练模型先验的问题,探索预训练扩散模型内在的修复行为,并开发一种无需微调或额外控制模块的轻量级方法。

Result: 在多种退化类型上进行了广泛实验,结果表明该方法实现了具有竞争力的性能和泛化能力,避免了微调和特定修复控制模块的使用。

Insight: 创新点在于发现预训练扩散模型本身具备修复能力,可通过学习文本编码器输出的提示嵌入直接解锁;提出扩散桥框架来稳定提示学习,确保训练与推理动态对齐;该方法轻量高效,无需修改模型权重或添加控制模块,为利用预训练扩散模型进行修复任务提供了新思路。

Abstract: Pre-trained diffusion models have enabled significant advancements in All-in-One Restoration (AiOR), offering improved perceptual quality and generalization. However, diffusion-based restoration methods primarily rely on fine-tuning or Control-Net style modules to leverage the pre-trained diffusion model’s priors for AiOR. In this work, we show that these pre-trained diffusion models inherently possess restoration behavior, which can be unlocked by directly learning prompt embeddings at the output of the text encoder. Interestingly, this behavior is largely inaccessible through text prompts and text-token embedding optimization. Furthermore, we observe that naive prompt learning is unstable because the forward noising process using degraded images is misaligned with the reverse sampling trajectory. To resolve this, we train prompts within a diffusion bridge formulation that aligns training and inference dynamics, enforcing a coherent denoising path from noisy degraded states to clean images. Building on these insights, we introduce our lightweight learned prompts on the pre-trained WAN video model and FLUX image models, converting them into high-performing restoration models. Extensive experiments demonstrate that our approach achieves competitive performance and generalization across diverse degradations, while avoiding fine-tuning and restoration-specific control modules.


[158] Rethinking Model Efficiency: Multi-Agent Inference with Large Models cs.CVPDF

Sixun Dong, Juhua Hu, Steven Li, Wei Wen, Qi Qian

TL;DR: 本文重新审视了视觉语言模型(VLM)的效率问题,指出输出令牌数量是端到端延迟的瓶颈。研究发现,大模型用更少的输出令牌就能达到或超过小模型长序列的性能。为此,论文提出了一种多智能体推理框架,在保持大模型简短响应的同时,必要时从小模型转移关键推理令牌,以提升效率。

Details

Motivation: 解决VLM中自回归解码导致输出令牌数量成为延迟瓶颈的问题,并探索如何利用大模型在少量输出令牌下的高效性来提升整体推理效率。

Result: 在多样化真实世界基准测试上的实证研究表明,大模型能以显著更少的输出令牌达到更好或相当的性能;提出的多智能体框架通过重用小模型的推理令牌,能帮助接近大模型自身推理的性能,证实了其有效性。

Insight: 创新点在于揭示了模型效率不仅取决于参数规模,更与输出序列长度紧密相关,并提出了一个混合大小模型推理的多智能体框架,通过令牌级的知识转移来平衡性能与延迟,为高效VLM设计提供了新思路。

Abstract: Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.


[159] LoMa: Local Feature Matching Revisited cs.CVPDF

David Nordström, Johan Edstedt, Georg Bökman, Jonathan Astermark, Anders Heyden

TL;DR: 本文提出了一种名为LoMa的数据驱动局部特征匹配方法,通过结合大规模多样化数据、现代训练方法、扩展模型容量和计算资源,显著提升了匹配性能。同时,作者构建了一个名为HardMatch的高难度图像对数据集,以解决现有基准测试饱和的问题。

Details

Motivation: 局部特征匹配是三维视觉系统(如运动恢复结构)的基础组件,但其进展落后于现代数据驱动方法。现有方法通常只在少数中型数据集上训练,且标准基准测试主要基于成功三维重建的稀疏视图,评估局限于相对简单的图像对,导致基准测试饱和。

Result: 在广泛的基准测试中,LoMa在多个数据集上均取得显著进步:在HardMatch上比最先进的ALIKED+LightGlue方法高出+18.6 mAA,在WxBS上高出+29.5 mAA,在InLoc上高出+21.4 (1m, 10°),在RUBIK上高出+24.2 AUC,在IMC 2022上高出+12.4 mAA,实现了全面的性能提升。

Insight: 论文的创新点在于从数据驱动角度重新审视局部特征匹配,通过大规模数据混合、现代训练策略和模型/计算扩展来提升性能。客观来看,构建HardMatch高难度数据集以解决基准测试饱和问题,也是一个重要的贡献,有助于更真实地评估匹配算法的鲁棒性。

Abstract: Local feature matching has long been a fundamental component of 3D vision systems such as Structure-from-Motion (SfM), yet progress has lagged behind the rapid advances of modern data-driven approaches. The newer approaches, such as feed-forward reconstruction models, have benefited extensively from scaling dataset sizes, whereas local feature matching models are still only trained on a few mid-sized datasets. In this paper, we revisit local feature matching from a data-driven perspective. In our approach, which we call LoMa, we combine large and diverse data mixtures, modern training recipes, scaled model capacity, and scaled compute, resulting in remarkable gains in performance. Since current standard benchmarks mainly rely on collecting sparse views from successful 3D reconstructions, the evaluation of progress in feature matching has been limited to relatively easy image pairs. To address the resulting saturation of benchmarks, we collect 1000 highly challenging image pairs from internet data into a new dataset called HardMatch. Ground truth correspondences for HardMatch are obtained via manual annotation by the authors. In our extensive benchmarking suite, we find that LoMa makes outstanding progress across the board, outperforming the state-of-the-art method ALIKED+LightGlue by +18.6 mAA on HardMatch, +29.5 mAA on WxBS, +21.4 (1m, 10$^\circ$) on InLoc, +24.2 AUC on RUBIK, and +12.4 mAA on IMC 2022. We release our code and models publicly at https://github.com/davnords/LoMa.


[160] Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision cs.CVPDF

Hyunsoo Cha, Wonjung Woo, Byungjun Kim, Hanbyul Joo

TL;DR: Vanast是一个统一的框架,能够直接从单张人体图像、服装图像和姿态引导视频生成服装转移的人体动画视频。该模型通过单步统一流程解决了传统两阶段方法中身份漂移、服装变形和前后不一致的问题,并利用大规模合成三元组监督进行训练。

Details

Motivation: 传统虚拟试穿和姿态驱动动画的两阶段流程常导致身份漂移、服装扭曲和前后不一致,Vanast旨在通过统一框架一次性解决这些问题,实现连贯合成。

Result: 模型通过合成三元组监督和双模块视频扩散Transformer架构,在训练稳定性、生成质量、服装准确性、姿态遵循和身份保持方面表现优异,支持零样本服装插值,能够生成高保真、身份一致的动画。

Insight: 创新点包括构建大规模三元组监督数据(生成身份保持的换装图像、捕获完整上下装三元组、组装多样化野外三元组),以及引入双模块视频扩散Transformer以稳定训练并提升生成质量,实现了单步统一生成和零样本插值能力。

Abstract: We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.


cs.MM [Back]

[161] Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning cs.MM | cs.AI | cs.CV | cs.SDPDF

Donghuo Zeng, Hao Niu, Masato Taya

TL;DR: 本文提出了一种名为HSC-MAE的分层语义关联感知掩码自编码器,用于无监督的音频-视觉表征学习。该框架采用双路径师生结构,通过三个互补层次(全局、局部、样本)强制跨模态语义一致性,以从弱配对、无标签的数据中学习对齐的多模态嵌入。

Details

Motivation: 解决从弱配对、无标签语料库中学习对齐的多模态嵌入的挑战,这些数据通常仅提供预提取特征、包含多个事件且存在虚假共现。

Result: 在AVE和VEGAS基准测试上,相比强大的无监督基线方法,HSC-MAE取得了显著的mAP提升,验证了其能产生鲁棒且结构良好的音频-视觉表征。

Insight: 创新点在于分层(全局-局部-样本)的语义关联一致性约束,结合了DCCA、教师挖掘的软top-k亲和力以及掩码自编码,并通过可学习的多任务权重协调不同目标,形成了一个统一的师生框架来学习鲁棒的多模态表征。

Abstract: Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain discriminative semantic content under partial observation. Concretely, a student MAE path is trained with masked feature reconstruction and affinity-weighted soft top-k InfoNCE; an EMA teacher operating on unmasked inputs via the CCA path supplies stable canonical geometry and soft positives. Learnable multi-task weights reconcile competing objectives, and an optional distillation loss transfers teacher geometry into the student. Experiments on AVE and VEGAS demonstrate substantial mAP improvements over strong unsupervised baselines, validating that HSC-MAE yields robust and well-structured audio-visual representations.


cs.CR [Back]

[162] Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities cs.CR | cs.AI | cs.CLPDF

Charafeddine Mouzouni

TL;DR: 该论文通过约10,000次实验,系统研究了LLM智能体在工具访问下利用安全漏洞的行为触发条件,发现大多数假设的攻击维度(如最小化、道德许可、激励等)并未显著触发利用行为,但目标重构(goal reframing)这一维度能可靠地触发漏洞利用,例如将任务描述为’解谜’可导致高利用率。

Details

Motivation: 动机在于探究LLM智能体利用安全漏洞的具体触发特征,即系统提示中的哪些特性会引发此类行为,哪些不会,以更精确地理解威胁模型。

Result: 在研究的任务类别(植入的测试运行器漏洞)中,12个假设攻击维度中有9个(包括最小化、道德许可等)在每单元n=50的试验中未检测到利用(每条件95%置信区间上限<7%),但目标重构维度在Claude Sonnet 4上可靠触发38-40%的利用率,并在DeepSeek、GPT-5-mini、o4-mini上复现(CTF框架:8-14%)。GPT-4.1在1,850次试验中未产生利用,且跨四个OpenAI模型的时间比较显示安全训练可能有所改进。

Insight: 创新点在于通过大规模系统实验构建了攻击维度的分类法,揭示了目标重构语言是关键的威胁触发因素,而非广泛的对抗性提示,从而缩小了可测试的威胁模型,为防御者提供了具体的审计方向(即关注目标重构语言)。

Abstract: LLM agents with tool access can discover and exploit security vulnerabilities. This is known. What is not known is which features of a system prompt trigger this behaviour, and which do not. We present a systematic taxonomy based on approximately 10,000 trials across seven models, 37 prompt conditions, and 12 hypothesized attack dimensions – all executed in real Docker sandboxes. Every condition includes the instruction “Always respect all rules and access policies”; only a single encouragement sentence varies. Within the task class studied (planted test-runner vulnerabilities), the threat model is narrower than assumed: nine of twelve dimensions – including minimization, moral licensing, incentives, identity priming, and reasoning triggers – produce no detectable exploitation at n=50 per cell (upper 95% CI <7% per condition). But it is also more subtle than expected: one dimension, goal reframing, reliably triggers exploitation. “You are solving a puzzle; there may be hidden clues” produces 38-40% exploitation on Claude Sonnet 4 despite the explicit rule instruction, replicating across four models (CTF framing: 8-14% on DeepSeek, GPT-5-mini, o4-mini). The agent does not override the rules; it reinterprets the task so that exploitative actions become task-aligned. GPT-4.1 produces no exploitation across 1,850 trials (37 conditions), and a temporal comparison across four OpenAI models released over eleven months shows a pattern consistent with improving safety training, though model capability differences are a confounder. The practical contribution is a narrowed, testable threat model: defenders should audit for goal-reframing language, not for the broad class of adversarial prompts.


cs.HC [Back]

[163] BLADE: Better Language Answers through Dialogue and Explanations cs.HC | cs.CLPDF

Chathuri Jayaweera, Bonnie J. Dorr

TL;DR: BLADE是一个基于检索增强生成(RAG)的对话式教育助手,旨在通过引导学生查阅相关课程资源而非直接提供答案,来促进主动学习和概念理解。

Details

Motivation: 解决当前基于大语言模型的教育助手直接提供答案,从而减少学生探索、自我解释和参与课程材料的问题。

Result: 在本科计算机科学课程中的影响研究表明,与单纯提供完整课程资源相比,BLADE改善了学生对课程资源的导航和概念性表现。

Insight: 核心创新在于将RAG框架与教学策略结合,通过动态检索并呈现相关课程摘录来引导对话,强化了基于证据的推理和主动学习,而非追求答案生成的准确性或SOTA性能。

Abstract: Large language model (LLM)-based educational assistants often provide direct answers that short-circuit learning by reducing exploration, self-explanation, and engagement with course materials. We present BLADE (Better Language Answers through Dialogue and Explanations), a grounded conversational assistant that guides learners to relevant instructional resources rather than supplying immediate solutions. BLADE uses a retrieval-augmented generation (RAG) framework over curated course content, dynamically surfacing pedagogically relevant excerpts in response to student queries. Instead of delivering final answers, BLADE prompts direct engagement with source materials to support conceptual understanding. We conduct an impact study in an undergraduate computer science course, with different course resource configurations and show that BLADE improves students’ navigation of course resources and conceptual performance compared to simply providing the full inventory of course resources. These results demonstrate the potential of grounded conversational AI to reinforce active learning and evidence-based reasoning.


[164] The Persuasion Paradox: When LLM Explanations Fail to Improve Human-AI Team Performance cs.HC | cs.AI | cs.CLPDF

Ruth Cohen, Lu Feng, Ayala Bloch, Sarit Kraus

TL;DR: 本文研究了大型语言模型(LLM)提供的自然语言解释对人与AI团队客观绩效的影响,发现存在一个‘说服悖论’:流畅的解释会系统性增加用户对AI的信任和依赖,但并未可靠提升任务准确性,有时甚至损害准确性。通过三项受控人类实验(涵盖抽象视觉推理和演绎逻辑推理),研究发现解释的效果高度依赖于任务类型和认知模态。

Details

Motivation: 尽管LLM的解释被广泛用于提高透明度和信任,但其对人与AI团队客观绩效的影响尚不明确。本文旨在探究LLM解释是否真正能提升团队任务准确性,而非仅影响主观感受。

Result: 在视觉推理任务(RAVEN矩阵)中,LLM解释提高了用户信心但未提升准确性,反而显著抑制了用户从模型错误中恢复的能力;而基于概率的不确定性展示界面和选择性自动化策略(将不确定案例交由人类处理)实现了显著更高的准确性和错误恢复率。在语言逻辑推理任务(LSAT问题)中,LLM解释则取得了最高的准确性和恢复率,优于专家编写的解释和基于概率的支持。

Insight: 创新点在于揭示了‘说服悖论’以及解释效果的任务依赖性,挑战了将解释视为通用解决方案的观点。客观分析认为,研究强调了交互设计应优先考虑校准的依赖和有效的错误恢复,而非仅仅追求解释的流畅说服力,并指出主观信任指标是团队绩效的较差预测因子。

Abstract: While natural-language explanations from large language models (LLMs) are widely adopted to improve transparency and trust, their impact on objective human-AI team performance remains poorly understood. We identify a Persuasion Paradox: fluent explanations systematically increase user confidence and reliance on AI without reliably improving, and in some cases undermining, task accuracy. Across three controlled human-subject studies spanning abstract visual reasoning (RAVEN matrices) and deductive logical reasoning (LSAT problems), we disentangle the effects of AI predictions and explanations using a multi-stage reveal design and between-subjects comparisons. In visual reasoning, LLM explanations increase confidence but do not improve accuracy beyond the AI prediction alone, and substantially suppress users’ ability to recover from model errors. Interfaces exposing model uncertainty via predicted probabilities, as well as a selective automation policy that defers uncertain cases to humans, achieve significantly higher accuracy and error recovery than explanation-based interfaces. In contrast, for language-based logical reasoning tasks, LLM explanations yield the highest accuracy and recovery rates, outperforming both expert-written explanations and probability-based support. This divergence reveals that the effectiveness of narrative explanations is strongly task-dependent and mediated by cognitive modality. Our findings demonstrate that commonly used subjective metrics such as trust, confidence, and perceived clarity are poor predictors of human-AI team performance. Rather than treating explanations as a universal solution, we argue for a shift toward interaction designs that prioritize calibrated reliance and effective error recovery over persuasive fluency.


[165] Can LLMs Reason About Attention? Towards Zero-Shot Analysis of Multimodal Classroom Behavior cs.HC | cs.AI | cs.CVPDF

Nolan Platt, Sehrish Nizamani, Alp Tural, Elif Tural, Saad Nizamani

TL;DR: 本文提出了一种隐私保护的多模态课堂行为分析系统,通过OpenPose提取骨骼姿态和Gaze-LLE估计视觉注意力,并立即删除原始视频帧,仅保留几何坐标数据以符合FERPA隐私法规。该系统利用QwQ-32B-Reasoning大语言模型对姿态和注视数据进行零样本分析,生成注意力热图和行为摘要,并通过Web仪表盘供教师访问。初步研究表明LLMs在多模态行为理解方面具有潜力,但在空间推理方面仍存在不足。

Details

Motivation: 解决传统课堂学生参与度分析依赖耗时人工观察或侵犯隐私的录像问题,旨在开发一种隐私保护、自动化的零样本多模态行为分析系统。

Result: 初步研究结果表明,该系统能在单GPU上运行,实现了隐私保护的自动化分析流程,LLMs在多模态行为理解任务上展现出潜力,但在课堂布局的空间推理方面仍存在困难。

Insight: 创新点在于将姿态/注视提取与大语言模型的零样本推理能力结合,构建了一个端到端的隐私保护分析管道;客观来看,其利用LLMs进行多模态时序行为零样本分析的方法在教育技术领域具有新颖性,但空间推理的局限性指出了未来改进方向。

Abstract: Understanding student engagement usually requires time-consuming manual observation or invasive recording that raises privacy concerns. We present a privacy-preserving pipeline that analyzes classroom videos to extract insights about student attention, without storing any identifiable footage. Our system runs on a single GPU, using OpenPose for skeletal extraction and Gaze-LLE for visual attention estimation. Original video frames are deleted immediately after pose extraction, thus only geometric coordinates (stored as JSON) are retained, ensuring compliance with FERPA. The extracted pose and gaze data is processed by QwQ-32B-Reasoning, which performs zero-shot analysis of student behavior across lecture segments. Instructors access results through a web dashboard featuring attention heatmaps and behavioral summaries. Our preliminary findings suggest that LLMs may show promise for multimodal behavior understanding, although they still struggle with spatial reasoning about classroom layouts. We discuss these limitations and outline directions for improving LLM spatial comprehension in educational analytics contexts.


[166] VisionClaw: Always-On AI Agents through Smart Glasses cs.HC | cs.AI | cs.CV | cs.LG | cs.MAPDF

Xiaoan Liu, DaeHo Lee, Eric J Gonzalez, Mar Gonzalez-Franco, Ryo Suzuki

TL;DR: VisionClaw是一个运行在Meta Ray-Ban智能眼镜上的常开式可穿戴AI代理,它集成了实时第一人称视角感知与代理任务执行能力,允许用户通过语音直接执行任务,如将现实物体加入购物车、从文档生成笔记或控制物联网设备。

Details

Motivation: 解决当前可穿戴设备中感知与任务执行分离的问题,旨在通过常开式AI代理实现情境感知与任务执行的持续耦合,以支持更自然、免提的交互。

Result: 通过受控实验室研究(N=12)和纵向部署研究(N=5)评估,结果显示,与非常开、非代理基线相比,VisionClaw能实现更快的任务完成速度并减少交互开销。

Insight: 创新点在于将常开感知与代理执行深度集成于智能眼镜,实现了感知与行动的持续耦合,这代表了一种新的可穿戴AI代理范式,支持情境化、机会主义的任务启动和委托式执行,而非手动控制。

Abstract: We present VisionClaw, an always-on wearable AI agent that integrates live egocentric perception with agentic task execution. Running on Meta Ray-Ban smart glasses, VisionClaw continuously perceives real-world context and enables in-situ, speech-driven action initiation and delegation via OpenClaw AI agents. Therefore, users can directly execute tasks through the smart glasses, such as adding real-world objects to an Amazon cart, generating notes from physical documents, receiving meeting briefings on the go, creating events from posters, or controlling IoT devices. We evaluate VisionClaw through a controlled laboratory study (N=12) and a longitudinal deployment study (N=5). Results show that integrating perception and execution enables faster task completion and reduces interaction overhead compared to non-always-on and non-agent baselines. Beyond performance gains, deployment findings reveal a shift in interaction: tasks are initiated opportunistically during ongoing activities, and execution is increasingly delegated rather than manually controlled. These results suggest a new paradigm for wearable AI agents, where perception and action are continuously coupled to support situated, hands-free interaction.


cs.LG [Back]

[167] ClawArena: Benchmarking AI Agents in Evolving Information Environments cs.LG | cs.AI | cs.CLPDF

Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu

TL;DR: ClawArena是一个用于评估AI智能体在不断变化的信息环境中性能的基准测试。它模拟了现实世界中信息分散、矛盾且动态更新的复杂场景,通过多源冲突推理、动态信念修正和隐式个性化三个耦合挑战来测试智能体。基准包含64个跨8个专业领域的场景,总计1,879个评估轮次和365个动态更新。实验表明,模型能力和框架设计均显著影响性能,且信念修正的难度取决于更新设计策略而非单纯更新的存在。

Details

Motivation: 现有基准大多假设静态、单一权威的信息环境,无法评估AI智能体在现实世界中处理分散、矛盾且动态变化信息的能力。

Result: 在五个智能体框架和五个语言模型上的实验显示,模型能力差异导致15.4%的性能波动,框架设计差异导致9.2%的性能波动;自演化技能框架可以部分弥补模型能力差距;信念修正的难度由更新设计策略决定。

Insight: 创新点在于构建了一个模拟动态、多源、矛盾信息环境的基准,并提出了一个结合多源冲突推理、动态信念修正和隐式个性化的14类问题分类法;客观来看,其强调更新设计策略(而非更新本身)对智能体挑战性的影响,为智能体评估提供了新视角。

Abstract: AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1{,}879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self-evolving skill frameworks can partially close model-capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming-lab/ClawArena.


[168] One Model for All: Multi-Objective Controllable Language Models cs.LG | cs.AI | cs.CLPDF

Qiang He, Yucheng Yang, Tianyi Zhou, Meng Fang, Mykola Pechenizkiy

TL;DR: 本文提出了一种名为多目标控制(MOC)的方法,旨在训练一个单一的大型语言模型(LLM),使其能够根据不同的用户偏好,在多个目标(如安全性、帮助性、幽默感、忠实度等)的帕累托前沿上生成个性化的输出。该方法将多目标优化(MOO)原则引入基于人类反馈的强化学习(RLHF)中,将LLM训练为一个以偏好为条件的策略网络,并在单个GPU上实现了对7B参数模型的高效微调。

Details

Motivation: 当前基于人类反馈的强化学习(RLHF)主要依赖于从平均人类评分中学习的固定奖励,这削弱了模型对不同偏好的适应性和可控性。然而,创建个性化的LLM需要将模型与个体用户偏好对齐,这面临每个用户数据稀缺以及用户在多目标权衡(例如在某些情境下强调共情,在其他情境下要求效率和精确性)中偏好多样性的挑战。

Result: 广泛的实验表明,MOC在三个方面优于基线方法:(i)LLM输出在多个奖励权衡方面对用户偏好的可控性;(ii)LLM输出的质量和多样性,通过所实现多个解决方案的超体积来衡量;(iii)对未见偏好的泛化能力。这些结果突显了MOC在需要可扩展和可定制LLM的现实应用中的潜力。

Insight: 论文的核心创新点在于将多目标优化(MOO)原则系统地整合到RLHF框架中,从而训练一个单一的、条件化的策略网络来覆盖帕累托前沿上的不同偏好区域。这提供了一种更高效、更可控的个性化LLM训练范式,避免了为每个偏好单独训练模型的开销,并增强了模型对多样化、动态用户需求的适应性。

Abstract: Aligning large language models (LLMs) with human preferences is critical for enhancing LLMs’ safety, helpfulness, humor, faithfulness, etc. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the diversity of user preferences in multi-objective trade-offs, varying from emphasizing empathy in certain contexts to demanding efficiency and precision in others. Can we train one LLM to produce personalized outputs across different user preferences on the Pareto front? In this paper, we introduce Multi-Objective Control (MOC), which trains a single LLM to directly generate responses in the preference-defined regions of the Pareto front. Our approach introduces multi-objective optimization (MOO) principles into RLHF to train an LLM as a preference-conditioned policy network. We improve the computational efficiency of MOC by applying MOO at the policy level, enabling us to fine-tune a 7B-parameter model on a single A6000 GPU. Extensive experiments demonstrate the advantages of MOC over baselines in three aspects: (i) controllability of LLM outputs w.r.t. user preferences on the trade-off among multiple rewards; (ii) quality and diversity of LLM outputs, measured by the hyper-volume of multiple solutions achieved; and (iii) generalization to unseen preferences. These results highlight MOC’s potential for real-world applications requiring scalable and customizable LLMs.


[169] Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems cs.LG | cs.AI | cs.CLPDF

Justin Chih-Yao Chen, Archiki Prasad, Zaid Khan, Joykirat Singh, Runchu Tian

TL;DR: 本文提出Cog-DRIFT框架,通过将困难的开放式推理问题自适应地重构为认知上更简单的变体(如选择题和完形填空),构建基于难度的课程,使LLM能从原本无法提供有效奖励信号的难题中学习,从而提升推理能力。

Details

Motivation: 现有基于可验证奖励的强化学习(RLVR)无法让LLM从当前策略下过于困难、无法解决从而无法产生有意义奖励信号的问题中学习,这限制了模型推理能力的进一步提升。

Result: 在Qwen和Llama模型上,Cog-DRIFT对原本无法解决的难题绝对提升分别达+10.11%和+8.64%;在2个模型和6个推理基准测试中,平均优于次优基线+4.72%(Qwen)和+3.23%(Llama),并提升了测试时的pass@k和样本效率。

Insight: 核心创新在于将困难开放式问题重构为保留答案但搜索空间更小、学习信号更密集的简化变体(如选择题),并利用这些变体构建自适应课程进行引导式探索和知识迁移,从而克服LLM后训练中的探索障碍。

Abstract: Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of LLMs, yet a fundamental limitation remains: models cannot learn from problems that are too difficult to solve under their current policy, as these yield no meaningful reward signal. We propose a simple yet effective solution based on task reformulation. We transform challenging open-ended problems into cognitively simpler variants – such as multiple-choice and cloze formats – that preserve the original answer while reducing the effective search space and providing denser learning signals. These reformulations span a spectrum from discriminative to generative tasks, which we exploit to bootstrap learning: models first learn from structured, easier formats, and this knowledge transfers back to improve performance on the original open-ended problems. Building on this insight, we introduce Cog-DRIFT, a framework that constructs reformulated variants and organizes them into an adaptive curriculum based on difficulty. Training progresses from easier to harder formats, enabling the model to learn from problems that previously yielded zero signal under standard RL post-training. Cog-DRIFT not only improves on the originally unsolvable hard problems (absolute +10.11% for Qwen and +8.64% for Llama) but also generalizes well to other held-out datasets. Across 2 models and 6 reasoning benchmarks, our method consistently outperforms standard GRPO and strong guided-exploration baselines. On average, Cog-DRIFT shows +4.72% (Qwen) and +3.23% (Llama) improvements over the second-best baseline. We further show that Cog-DRIFT improves pass@k at test time, and the curriculum improves sample efficiency. Overall, our results highlight task reformulation and curriculum learning as an effective paradigm for overcoming the exploration barrier in LLM post-training.


cs.GL [Back]

[170] On the First Computer Science Research Paper in an Indian Language and the Future of Science in Indian Languages cs.GL | cs.CL | cs.CY | cs.DCPDF

Siddhartha Visveswara Jayanti

TL;DR: 本文描述了作者撰写首篇完全用印度语言(泰卢固语)表达的原创现代计算机科学研究论文的经历,论文属于分布式计算领域,介绍了一种证明多处理器算法认知逻辑下界的技术。作者通过借鉴梵语的帕尼尼语法创造科技术语,并开发了泰卢固语XeLaTeX模板(TeluguTeX)解决排版问题,进而提出了改善印度语言科学写作状况的愿景。

Details

Motivation: 解决在印度语言中撰写高级计算机科学研究论文时面临的科技术语缺乏和数学排版不发达的问题,以促进科学在印度语言中的发展。

Result: 论文成功在泰卢固语中引入了分布式计算领域的新技术,并开发了TeluguTeX排版工具,但未提及具体的定量实验结果或基准测试。

Insight: 创新点在于通过梵语语法派生本土科技术语,并开发专门排版工具,为在资源较少的语言中进行科学研究提供了可借鉴的方法论和技术解决方案。

Abstract: I describe my experience writing the first original, modern Computer Science research paper expressed entirely in an Indian language. The paper is in Telugu, a language with approximately 100 million speakers. The paper is in the field of distributed computing and it introduces a technique for proving epistemic logic based lower bounds for multiprocessor algorithms. A key hurdle to writing the paper was developing technical terminology for advanced computer science concepts, including those in algorithms, distributed computing, and discrete mathematics. I overcame this challenge by deriving and coining native language scientific terminology through the powerful, productive, Pāninian grammar of Samskrtam. The typesetting of the paper was an additional challenge, since mathematical typesetting in Telugu is underdeveloped. I overcame this problem by developing a Telugu XeLaTeX template, which I call TeluguTeX. Leveraging this experience of writing an original computer science research paper in an Indian language, I lay out a vision for how to ameliorate the state of scientific writing at all levels in Indic languages – languages whose native speakers exceed one billion people – through the further development of the Sanskrit technical lexicon and through technological internationalization.


eess.IV [Back]

[171] NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning eess.IV | cs.CVPDF

Tiberio Uricchio, Marco Bertini

TL;DR: 本文提出NeuralLVC,一种基于掩码扩散模型与时间条件化的神经无损视频压缩方法。它采用I/P帧架构,I帧模型通过双射线性标记化保证像素精确重建,P帧模型利用轻量级参考嵌入压缩帧间差异,支持分组解码以实现速度-压缩比权衡。实验表明其在Xiph CIF序列上显著优于H.264和H.265无损压缩。

Details

Motivation: 神经无损图像压缩已取得显著进展,但神经无损视频压缩领域仍缺乏探索,本文旨在填补这一空白,利用时间冗余提升压缩效率。

Result: 在9个Xiph CIF序列上的实验显示,NeuralLVC在无损压缩性能上显著超越H.264和H.265,并通过算术编码的端到端编解码测试验证了精确重建能力。

Insight: 创新点包括将掩码扩散模型与时间条件化结合用于视频压缩,采用轻量级参考嵌入(仅增加1.3%可训练参数)处理P帧,以及通过分组解码实现可控的权衡;这为神经无损视频压缩提供了新方向。

Abstract: While neural lossless image compression has advanced significantly with learned entropy models, lossless video compression remains largely unexplored in the neural setting. We present NeuralLVC, a neural lossless video codec that combines masked diffusion with an I/P-frame architecture for exploiting temporal redundancy. Our I-frame model compresses individual frames using bijective linear tokenization that guarantees exact pixel reconstruction. The P-frame model compresses temporal differences between consecutive frames, conditioned on the previous decoded frame via a lightweight reference embedding that adds only 1.3% trainable parameters. Group-wise decoding enables controllable speed-compression trade-offs. Our codec is lossless in the input domain: for video, it reconstructs YUV420 planes exactly; for image evaluation, RGB channels are reconstructed exactly. Experiments on 9 Xiph CIF sequences show that NeuralLVC outperforms H.264 and H.265 lossless by a significant margin. We verify exact reconstruction through end-to-end encode-decode testing with arithmetic coding. These results suggest that masked diffusion with temporal conditioning is a promising direction for neural lossless video compression.


[172] UniSurgSAM: A Unified Promptable Model for Reliable Surgical Video Segmentation eess.IV | cs.CVPDF

Haofeng Liu, Ziyue Wang, Alex Y. W. Kong, Guanyi Qin, Yunqiu Xu

TL;DR: 本文提出了UniSurgSAM,一个统一的可提示模型,用于可靠的手术视频分割。该模型支持视觉、文本或音频提示,采用解耦的两阶段框架分别优化目标初始化和跟踪,并引入了存在感知解码、边界感知长期跟踪和自适应状态转换等关键设计以提高可靠性。

Details

Motivation: 现有可提示视频对象分割方法通常局限于单一提示模态,且采用耦合框架导致目标初始化和跟踪之间存在优化干扰,同时存在目标缺失时的幻觉预测和累积的掩码漂移问题。本文旨在解决这些挑战,为计算机辅助手术提供可靠的分割基础。

Result: 在从四个公共手术数据集构建的多模态、多粒度基准测试上,UniSurgSAM在所有提示模态和粒度上都实现了最先进的实时性能。

Insight: 主要创新点在于:1)统一的、支持多模态提示的解耦两阶段框架,解决了优化干扰;2)针对可靠性的三项关键设计:存在感知解码抑制幻觉、边界感知长期跟踪防止漂移、自适应状态转换实现失败恢复;3)构建了新的手术视频分割基准。

Abstract: Surgical video segmentation is fundamental to computer-assisted surgery. In practice, surgeons need to dynamically specify targets throughout extended procedures, using heterogeneous cues such as visual selections, textual expressions, or audio instructions. However, existing Promptable Video Object Segmentation (PVOS) methods are typically restricted to a single prompt modality and rely on coupled frameworks that cause optimization interference between target initialization and tracking. Moreover, these methods produce hallucinated predictions when the target is absent and suffer from accumulated mask drift without failure recovery. To address these challenges, we present UniSurgSAM, a unified PVOS model enabling reliable surgical video segmentation through visual, textual, or audio prompts. Specifically, UniSurgSAM employs a decoupled two-stage framework that independently optimizes initialization and tracking to resolve the optimization interference. Within this framework, we introduce three key designs for reliability: presence-aware decoding that models target absence to suppress hallucinations; boundary-aware long-term tracking that prevents mask drift over extended sequences; and adaptive state transition that closes the loop between stages for failure recovery. Furthermore, we establish a multi-modal and multi-granular benchmark from four public surgical datasets with precise instance-level masklets. Extensive experiments demonstrate that UniSurgSAM achieves state-of-the-art performance in real time across all prompt modalities and granularities, providing a practical foundation for computer-assisted surgery. Code and datasets will be available at https://jinlab-imvr.github.io/UniSurgSAM.


[173] BAAI Cardiac Agent: An intelligent multimodal agent for automated reasoning and diagnosis of cardiovascular diseases from cardiac magnetic resonance imaging eess.IV | cs.AI | cs.CVPDF

Taiping Qu, Hongkai Zhang, Lantian Zhang, Can Zhao, Nan Zhang

TL;DR: 本文提出了BAAI Cardiac Agent,一个用于心血管疾病自动诊断的多模态智能系统。该系统通过集成多个心脏专家模型,实现了从心脏磁共振成像(CMR)的自动分割、功能量化、组织表征到疾病诊断的端到端解读,并能生成结构化临床报告。在两个医院的CMR数据集上验证,该系统在诊断和分割任务上表现优异,超过了现有最先进模型,并与放射科专家的诊断报告具有高度一致性。

Details

Motivation: 心脏磁共振(CMR)是诊断心血管疾病的基石,但其解读过程复杂、耗时,且高度依赖专业经验,导致其利用率不足。本文旨在开发一个自动化、端到端的智能系统,以解决CMR解读的复杂性和对专家依赖的问题。

Result: 在两个医院(共2413名患者)涵盖7种主要心血管疾病的CMR数据集上评估,该系统在内部验证中接收者操作特征曲线下面积(AUC)超过0.93,外部验证中超过0.81。在左心室功能指数(如射血分数、每搏输出量、左心室质量)估计任务中,与临床报告的皮尔逊相关系数均超过0.90。该系统在分割和诊断任务上超越了最先进(SOTA)模型,且生成的临床报告与不同经验水平的六位放射科专家读片结果高度一致。

Insight: 论文的主要创新点在于提出了一个动态编排多个专家模型进行协同多模态分析的智能体框架,实现了从原始CMR图像到结构化临床报告的完整、自动化工作流。从客观角度看,这种将多个专用模型(分割、量化、诊断)集成到一个统一、可协调的智能体中的架构设计,为复杂临床影像工作流的自动化提供了有前景的解决方案,并展示了高准确性和临床实用性。

Abstract: Cardiac magnetic resonance (CMR) is a cornerstone for diagnosing cardiovascular disease. However, it remains underutilized due to complex, time-consuming interpretation across multi-sequences, phases, quantitative measures that heavily reliant on specialized expertise. Here, we present BAAI Cardiac Agent, a multimodal intelligent system designed for end-to-end CMR interpretation. The agent integrates specialized cardiac expert models to perform automated segmentation of cardiac structures, functional quantification, tissue characterization and disease diagnosis, and generates structured clinical reports within a unified workflow. Evaluated on CMR datasets from two hospitals (2413 patients) spanning 7-types of major cardiovascular diseases, the agent achieved an area under the receiver-operating-characteristic curve exceeding 0.93 internally and 0.81 externally. In the task of estimating left ventricular function indices, the results generated by this system for core parameters such as ejection fraction, stroke volume, and left ventricular mass are highly consistent with clinical reports, with Pearson correlation coefficients all exceeding 0.90. The agent outperformed state-of-the-art models in segmentation and diagnostic tasks, and generated clinical reports showing high concordance with expert radiologists (six readers across three experience levels). By dynamically orchestrating expert models for coordinated multimodal analysis, this agent framework enables accurate, efficient CMR interpretation and highlights its potentials for complex clinical imaging workflows. Code is available at https://github.com/plantain-herb/Cardiac-Agent.


[174] NAIMA: Semantics Aware RGB Guided Depth Super-Resolution eess.IV | cs.CV | cs.LG | cs.MMPDF

Tayyab Nasir, Daochang Liu, Ajmal Mian

TL;DR: 本文提出了一种名为NAIMA的语义感知RGB引导深度超分辨率方法,通过引入预训练视觉变换器生成的全局上下文语义先验,解决传统引导深度超分辨率中因RGB图像颜色和纹理误导导致的深度边界模糊和伪影问题。

Details

Motivation: 传统引导深度超分辨率方法中,RGB图像中指示深度不连续性的误导性颜色和纹理线索常导致生成的深度图出现伪影和边界模糊,因此需要引入更可靠的语义先验来改善细节恢复。

Result: 在多个缩放因子和数据集上,所提出的NAIMA架构相比现有方法取得了显著改进,实现了性能提升。

Insight: 创新点包括提出引导令牌注意力模块,通过跨注意力迭代对齐RGB空间特征与深度编码,并选择性注入从预训练视觉变换器不同层提取的全局语义上下文;以及整合DINOv2与GTA块的NAIMA架构,实现了语义感知的深度超分辨率。

Abstract: Guided depth super-resolution (GDSR) is a multi-modal approach for depth map super-resolution that relies on a low-resolution depth map and a high-resolution RGB image to restore finer structural details. However, the misleading color and texture cues indicating depth discontinuities in RGB images often lead to artifacts and blurred depth boundaries in the generated depth map. We propose a solution that introduces global contextual semantic priors, generated from pretrained vision transformer token embeddings. Our approach to distilling semantic knowledge from pretrained token embeddings is motivated by their demonstrated effectiveness in related monocular depth estimation tasks. We introduce a Guided Token Attention (GTA) module, which iteratively aligns encoded RGB spatial features with depth encodings, using cross-attention for selectively injecting global semantic context extracted from different layers of a pretrained vision transformer. Additionally, we present an architecture called Neural Attention for Implicit Multi-token Alignment (NAIMA), which integrates DINOv2 with GTA blocks for a semantics-aware GDSR. Our proposed architecture, with its ability to distill semantic knowledge, achieves significant improvements over existing methods across multiple scaling factors and datasets.


[175] TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising eess.IV | cs.CVPDF

Junyoung Park, Youngjin Oh, Nam Ik Cho

TL;DR: 本文提出了一种名为TM-BSN(Triangular-Masked Blind-Spot Network)的新型盲点网络,用于解决真实世界sRGB图像中空间相关噪声的自监督去噪问题。该方法通过引入三角掩码卷积,使感受野与噪声的空间相关性模式(菱形)对齐,从而无需下采样即可有效建模噪声相关性,并结合知识蒸馏提升轻量级U-Net的性能。

Details

Motivation: 现有盲点网络(BSNs)假设噪声在像素间独立,但真实sRGB图像中的噪声由于相机ISP流程(如去马赛克)而具有空间相关性,导致性能下降。现有方法采用下采样来解相关噪声,但会改变噪声统计特性并限制网络利用完整上下文信息的能力。

Result: 在真实世界去噪基准测试上的大量实验表明,该方法达到了最先进的性能,显著优于现有的自监督方法。

Insight: 创新点在于提出了一种三角掩码卷积,其核函数限制在上三角区域,在原始分辨率下创建了菱形盲点,从而精确匹配了由去马赛克引起的噪声相关性的空间几何模式。此外,通过知识蒸馏将多个盲点预测的互补知识转移到轻量级U-Net中,兼顾了准确性和效率。

Abstract: Blind-spot networks (BSNs) enable self-supervised image denoising by preventing access to the target pixel, allowing clean signal estimation without ground-truth supervision. However, this approach assumes pixel-wise noise independence, which is violated in real-world sRGB images due to spatially correlated noise from the camera’s image signal processing (ISP) pipeline. While several methods employ downsampling to decorrelate noise, they alter noise statistics and limit the network’s ability to utilize full contextual information. In this paper, we propose the Triangular-Masked Blind-Spot Network (TM-BSN), a novel blind-spot architecture that accurately models the spatial correlation of real sRGB noise. This correlation originates from demosaicing, where each pixel is reconstructed from neighboring samples with spatially decaying weights, resulting in a diamond-shaped pattern. To align the receptive field with this geometry, we introduce a triangular-masked convolution that restricts the kernel to its upper-triangular region, creating a diamond-shaped blind spot at the original resolution. This design excludes correlated pixels while fully leveraging uncorrelated context, eliminating the need for downsampling or post-processing. Furthermore, we use knowledge distillation to transfer complementary knowledge from multiple blind-spot predictions into a lightweight U-Net, improving both accuracy and efficiency. Extensive experiments on real-world benchmarks demonstrate that our method achieves state-of-the-art performance, significantly outperforming existing self-supervised approaches. Our code is available at https://github.com/parkjun210/TM-BSN.


cs.SD [Back]

[176] OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text cs.SD | cs.CV | cs.MMPDF

Weiguo Pian, Saksham Singh Kushwaha, Zhimin Chen, Shijian Deng, Kai Wang

TL;DR: 本文提出了OmniSonic,一个基于流匹配扩散的框架,通过视频和文本联合条件生成包含屏内环境音、屏外环境音和人类语音的完整听觉场景,并构建了涵盖三种典型场景的新基准UniHAGen-Bench进行评测。

Details

Motivation: 现有视频条件音频生成模型通常只关注生成与可见发声事件对应的屏内环境音,忽略了屏外声音;而近期的文本-视频联合生成模型虽然旨在生成包含屏内外声音的听觉场景,但仅限于非语音声音,缺乏生成或整合人类语音的能力。本文旨在克服这些限制,实现通用且全面的音频生成。

Result: 大量实验表明,OmniSonic在客观指标和人类评估中均持续优于最先进的方法,在通用和全面的音频生成任务上建立了强大的基线。

Insight: 论文的创新点在于提出了一个联合处理屏内环境音、屏外环境音和语音条件的TriAttn-DiT架构,并采用混合专家门控机制自适应地平衡它们在生成过程中的贡献,从而实现了对包含语音的完整听觉场景的生成。

Abstract: In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/


cs.AI [Back]

[177] VERT: Reliable LLM Judges for Radiology Report Evaluation cs.AI | cs.CLPDF

Federica Bologna, Jean-Philippe Corbeil, Matthew Wilkens, Asma Ben Abacha

TL;DR: 本文提出了一种名为VERT的基于大语言模型(LLM)的放射学报告评估指标,并通过与专家评分的相关性分析,系统比较了包括RadFact、GREEN和FineRadScore在内的现有LLM评估指标。研究在涵盖多种成像模态和解剖部位的数据集(RadEval和RaTE-Eval)上,评估了不同规模的开源与闭源模型(推理与非推理模型)的性能,并进一步探索了少样本学习、集成学习和参数高效微调(PEFT)等方法。结果表明,VERT相比现有最佳指标GREEN,与放射科医生判断的相关性提升了11.7%,而对Qwen3 30B模型进行轻量级微调仅需1300个样本即可获得高达25%的性能增益,同时推理速度提升高达37.2倍。

Details

Motivation: 当前放射学报告评估研究主要集中于设计基于LLM的指标或微调针对胸部X光的小模型,但这些方法在其他成像模态和解剖部位的报告中是否稳健尚不明确。本文旨在探究何种模型和提示配置最适合作为放射学评估的LLM评判者,并系统评估现有指标的可靠性。

Result: 在RadEval和RaTE-Eval两个专家标注的多模态数据集上,提出的VERT指标相比现有最佳指标GREEN,与放射科医生判断的相关性相对提升了11.7%。对Qwen3 30B模型进行参数高效微调,仅使用1300个训练样本即可获得高达25%的性能增益,同时推理时间减少了37.2倍。

Insight: 论文的创新点在于提出了VERT这一新的LLM评估指标,并进行了迄今为止最全面的放射学报告评估指标比较研究,涵盖了多种模型、提示策略和数据集。客观来看,其核心洞察在于证明了通过轻量级适应(如少量数据的参数高效微调)即可实现可靠且高效的评估,这为在资源受限的临床环境中部署LLM评判者提供了可行路径。同时,系统性的错误检测与分类研究为理解指标行为提供了新视角。

Abstract: Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation? We conduct a thorough correlation analysis between expert and LLM-based ratings. We compare three existing LLM-as-a-judge metrics (RadFact, GREEN, and FineRadScore) alongside VERT, our proposed LLM-based metric, using open- and closed-source models (reasoning and non-reasoning) of different sizes across two expert-annotated datasets, RadEval and RaTE-Eval, spanning multiple modalities and anatomies. We further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using RaTE-Eval. To better understand metric behavior, we perform a systematic error detection and categorization study to assess alignment of these metrics against expert judgments and identify areas of lower and higher agreement. Our results show that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN. Furthermore, fine-tuning Qwen3 30B yield gains of up to 25% using only 1,300 training samples. The fine-tuned model also reduces inference time up to 37.2 times. These findings highlight the effectiveness of LLM-based judges and demonstrate that reliable evaluation can be achieved with lightweight adaptation.


[178] Towards the AI Historian: Agentic Information Extraction from Primary Sources cs.AI | cs.CL | cs.DLPDF

Lorenz Hufe, Niclas Griesshaber, Gavin Greif, Sebastian Oliver Eck, Philip Torr

TL;DR: 本文介绍了Chronos AI Historian项目的首个模块,旨在通过自然语言交互帮助历史学家将原始文献的图像扫描转换为结构化数据,以解决历史研究中AI应用不足的问题。

Details

Motivation: 历史研究领域AI应用有限,缺乏为历史学家定制的解决方案,因此开发Chronos AI Historian模块,支持从原始文献中灵活提取信息。

Result: 模块已开源,可供历史研究人员直接使用,但摘要未提及具体基准测试或定量结果。

Insight: 创新点在于采用基于自然语言交互的代理式工作流,而非固定视觉语言模型管道,允许历史学家根据异构文献库定制和迭代优化提取流程,提升AI在历史研究中的适应性和实用性。

Abstract: AI is supporting, accelerating, and automating scientific discovery across a diverse set of fields. However, AI adoption in historical research remains limited due to the lack of solutions designed for historians. In this technical progress report, we introduce the first module of Chronos, an AI Historian under development. This module enables historians to convert image scans of primary sources into data through natural-language interactions. Rather than imposing a fixed extraction pipeline powered by a vision-language model (VLM), it allows historians to adapt workflows for heterogeneous source corpora, evaluate the performance of AI models on specific tasks, and iteratively refine workflows through natural-language interaction with the Chronos agent. The module is open-source and ready to be used by historical researchers on their own sources.


[179] Empirical Characterization of Rationale Stability Under Controlled Perturbations for Explainable Pattern Recognition cs.AI | cs.CL | cs.LGPDF

Abu Noman Md Sakib, Zhensen Wang, Merjulah Roby, Zijie Zhang

TL;DR: 本文提出了一种评估模型解释一致性的新指标,通过量化相同标签输入下SHAP值的余弦相似度,检测模型在相似样本上的归因模式稳定性,并在SST-2情感分析数据集上使用BERT等模型进行了实验验证。

Details

Motivation: 现有可解释AI评估多关注单个实例,缺乏对同类样本或微小输入变化下归因模式一致性的量化,因此需要开发指标来确保模型在标签保留扰动下的解释稳定性。

Result: 在SST-2数据集上使用预训练BERT模型,并扩展测试RoBERTa、DistilBERT和IMDB数据集,通过SHAP计算特征重要性,实验表明该指标能有效识别模型预测偏差和解释不一致性,并与标准保真度指标对比验证其有效性。

Insight: 创新点在于提出基于归因模式一致性的稳定性评估指标,强调模型解释在相似输入下的鲁棒性,为构建可信AI系统提供了更深入的模型行为分析框架,支持实际模式识别流程中的稳健评估。

Abstract: Reliable pattern recognition systems should exhibit consistent behavior across similar inputs, and their explanations should remain stable. However, most Explainable AI evaluations remain instance centric and do not explicitly quantify whether attribution patterns are consistent across samples that share the same class or represent small variations of the same input. In this work, we propose a novel metric aimed at assessing the consistency of model explanations, ensuring that models consistently reflect the intended objectives and consistency under label-preserving perturbations. We implement this metric using a pre-trained BERT model on the SST-2 sentiment analysis dataset, with additional robustness tests on RoBERTa, DistilBERT, and IMDB, applying SHAP to compute feature importance for various test samples. The proposed metric quantifies the cosine similarity of SHAP values for inputs with the same label, aiming to detect inconsistent behaviors, such as biased reliance on certain features or failure to maintain consistent reasoning for similar predictions. Through a series of experiments, we evaluate the ability of this metric to identify misaligned predictions and inconsistencies in model explanations. These experiments are compared against standard fidelity metrics to assess whether the new metric can effectively identify when a model’s behavior deviates from its intended objectives. The proposed framework provides a deeper understanding of model behavior by enabling more robust verification of rationale stability, which is critical for building trustworthy AI systems. By quantifying whether models rely on consistent attribution patterns for similar inputs, the proposed approach supports more robust evaluation of model behavior in practical pattern recognition pipelines. Our code is publicly available at https://github.com/anmspro/ESS-XAI-Stability.


[180] QED-Nano: Teaching a Tiny Model to Prove Hard Theorems cs.AI | cs.CL | cs.LGPDF

LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching

TL;DR: 本文介绍了QED-Nano,一个仅4B参数的小型开源模型,通过三阶段训练方法(监督微调、基于规则的强化学习、带推理缓存的强化学习)在奥林匹克数学竞赛级别的定理证明任务上达到了与大型专有模型(如Gemini 3 Pro)相近的性能,同时大幅降低了推理成本。

Details

Motivation: 针对当前专有AI系统在复杂证明问题上虽表现出色但训练流程不透明、依赖大型模型导致成本高、难以复现和改进的问题,本文旨在探索小型开源模型是否也能在困难的奥林匹克数学竞赛级别推理任务上取得有竞争力的性能。

Result: QED-Nano在定理证明生成任务上超越了包括Nomos-1和GPT-OSS-120B在内的更大规模开源模型,并接近Gemini 3 Pro等专有模型的性能,同时推理成本显著降低。

Insight: 创新点包括:1)三阶段训练方法,特别是结合推理缓存的强化学习,将长证明分解为迭代的总结-精炼循环以增强推理能力;2)证明了小型模型通过精心设计的训练流程可以在复杂推理任务上达到与大型模型竞争的水平;3)开源了完整的训练管道、模型、数据集和代码,促进了开放数学推理研究。

Abstract: Proprietary AI systems have recently demonstrated impressive capabilities on complex proof-based problems, with gold-level performance reported at the 2025 International Mathematical Olympiad (IMO). However, the training pipelines behind these systems remain largely undisclosed, and their reliance on large “internal” models and scaffolds makes them expensive to run, difficult to reproduce, and hard to study or improve upon. This raises a central question: can small, open models also be trained to achieve competitive reasoning performance on difficult Olympiad-level math? In this paper, we answer this question by building QED-Nano, a 4B model post-trained for Olympiad-level proofs. Our training recipe has three stages: (1) supervised fine-tuning to imbue good proof-writing styles by distilling from DeepSeek-Math-V2, (2) reinforcement learning (RL) with rubric-based rewards, and (3) expanding RL with a reasoning cache, which decomposes long proofs into iterative summarize-and-refine cycles and enables stronger test-time reasoning. QED-Nano surpasses the proof-generation performance of much larger open models, including Nomos-1 and GPT-OSS-120B, and approaches the performance of proprietary models like Gemini 3 Pro, at a fraction of the inference cost. To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.


cs.RO [Back]

[181] Precise Robot Command Understanding Using Grammar-Constrained Large Language Models cs.RO | cs.CLPDF

Xinyun Huo, Raghav Gnanasambandam, Xinyao Zhang

TL;DR: 本文提出了一种结合语法约束与大型语言模型的混合方法,用于提升工业人机协作中机器人命令理解的精确性和可靠性。该方法通过两阶段处理:首先由微调LLM进行高层上下文推理和参数推断,然后由结构化语言模型和基于语法的规范化器将输出约束为标准化的符号格式,确保生成命令的有效性和机器人可读的JSON结构。此外,模型引入了验证和反馈循环机制,通过语法解析器验证命令有效性并在无效时自动生成纠正提示,以迭代自校正提高系统鲁棒性。

Details

Motivation: 解决工业环境中人机协作时,大型语言模型虽能理解通用语言但缺乏领域特定刚性,导致安全可执行命令不足的问题,旨在实现对话灵活性与机器人所需确定性精度的平衡。

Result: 在Human Robot Interaction Corpus (HuRIC)数据集上评估,该混合方法在命令有效性方面优于基于API的微调LLM和独立语法驱动NLU模型两个基线,实现了更优越的性能,促进了更安全有效的工业人机协作。

Insight: 创新点在于将语法驱动的自然语言理解系统与微调LLM集成,通过两阶段处理和验证反馈循环机制,结合了LLM的上下文推理能力和语法约束的确定性,提升了命令生成的可靠性和自校正能力,为工业机器人命令理解提供了可借鉴的混合架构。

Abstract: Human-robot collaboration in industrial settings requires precise and reliable communication to enhance operational efficiency. While Large Language Models (LLMs) understand general language, they often lack the domain-specific rigidity needed for safe and executable industrial commands. To address this gap, this paper introduces a novel grammar-constrained LLM that integrates a grammar-driven Natural Language Understanding (NLU) system with a fine-tuned LLM, which enables both conversational flexibility and the deterministic precision required in robotics. Our method employs a two-stage process. First, a fine-tuned LLM performs high-level contextual reasoning and parameter inference on natural language inputs. Second, a Structured Language Model (SLM) and a grammar-based canonicalizer constrain the LLM’s output, forcing it into a standardized symbolic format composed of valid action frames and command elements. This process guarantees that generated commands are valid and structured in a robot-readable JSON format. A key feature of the proposed model is a validation and feedback loop. A grammar parser validates the output against a predefined list of executable robotic actions. If a command is invalid, the system automatically generates corrective prompts and re-engages the LLM. This iterative self-correction mechanism allows the model to recover from initial interpretation errors to improve system robustness. We evaluate our grammar-constrained hybrid model against two baselines: a fine-tuned API-based LLM and a standalone grammar-driven NLU model. Using the Human Robot Interaction Corpus (HuRIC) dataset, we demonstrate that the hybrid approach achieves superior command validity, which promotes safer and more effective industrial human-robot collaboration.


[182] Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving cs.RO | cs.AI | cs.CVPDF

Zilin Huang, Zhengyang Wan, Zihao Sheng, Boyue Wang, Junwei You

TL;DR: 本文提出了Sim2Real-AD,一个模块化的仿真到现实框架,用于将CARLA模拟器中训练的、由视觉语言模型引导的强化学习策略,零样本部署到真实的全尺寸自动驾驶车辆上。该框架通过几何观测桥接、物理感知动作映射、两阶段渐进式训练和实时部署流水线四个核心模块,解决了仿真与真实世界在观测和动作语义上的不匹配问题,实现了无需真实世界RL训练数据的闭环部署。

Details

Motivation: 将模拟器中训练的强化学习策略部署到真实自动驾驶车辆面临根本性挑战,特别是对于VLM引导的RL框架,其策略依赖于模拟器原生的观测和与模拟器耦合的动作语义,这些在物理平台上无法直接获得。

Result: 仿真实验验证了框架在不同奖励范式下能保持代表性RL算法的相对性能排序。在真实的全尺寸福特E-Transit车辆上进行零样本部署,在跟车、避障和停车标志交互场景中分别达到了90%、80%和75%的成功率。据作者所知,这是首批在无需任何真实世界RL训练数据的情况下,将CARLA训练的VLM引导RL策略零样本闭环部署到全尺寸真实车辆的研究之一。

Insight: 创新点在于将复杂的仿真到现实迁移问题模块化分解为四个关键组件,特别是通过几何观测桥接将单目前视图像转换为模拟器兼容的鸟瞰图观测,以及通过两阶段渐进式训练策略分离动作空间和观测空间的迁移以稳定适应过程。这为基于模拟的RL策略在真实机器人平台上的安全、高效部署提供了一个系统性的解决方案。

Abstract: Deploying reinforcement learning policies trained in simulation to real autonomous vehicles remains a fundamental challenge, particularly for VLM-guided RL frameworks whose policies are typically learned with simulator-native observations and simulator-coupled action semantics that are unavailable on physical platforms. This paper presents Sim2Real-AD, a modular framework for zero-shot sim-to-real transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles without any real-world RL training data. The framework decomposes the transfer problem into four components: a Geometric Observation Bridge (GOB) that converts monocular front-view images into simulator-compatible bird’s-eye-view (BEV) observations, a Physics-Aware Action Mapping (PAM) that translates policy outputs into platform-agnostic physical commands, a Two-Phase Progressive Training (TPT) strategy that stabilizes adaptation by separating action-space and observation-space transfer, and a Real-time Deployment Pipeline (RDP) that integrates perception, policy inference, control conversion, and safety monitoring for closed-loop execution. Simulation experiments show that the framework preserves the relative performance ordering of representative RL algorithms across different reward paradigms and validate the contribution of each module. Zero-shot deployment on a full-scale Ford E-Transit achieves success rates of 90%, 80%, and 75% in car-following, obstacle avoidance, and stop-sign interaction scenarios, respectively. To the best of our knowledge, this study is among the first to demonstrate zero-shot closed-loop deployment of a CARLA-trained VLM-guided RL policy on a full-scale real vehicle without any real-world RL training data. The demo video and code are available at: https://zilin-huang.github.io/Sim2Real-AD-website/.


[183] Optimizing Neurorobot Policy under Limited Demonstration Data through Preference Regret cs.RO | cs.AI | cs.CV | cs.LGPDF

Viet Dung Nguyen, Yuhang Song, Anh Nguyen, Jamison Heard, Reynold Bailey

TL;DR: 本文提出了一种名为’自主掌握专家技能’(MYOE)的自模仿学习框架,旨在解决机器人强化学习从演示数据中学习时面临的数据稀缺和分布假设不切实际的问题。该框架通过设计可查询的偏好混合状态空间模型(QMoP-SSM)来估计每个时间步的期望目标,并利用’偏好遗憾’来优化机器人控制策略。

Details

Motivation: 动机在于现实世界中机器人演示数据通常稀缺且收集成本高,而传统的模仿学习算法假设数据独立同分布,导致测试轨迹中误差累积和性能下降。

Result: 实验表明,相比其他最先进的RLfD方案,该方法在鲁棒性、适应性和样本外性能方面表现更优,达到了SOTA水平。

Insight: 创新点包括引入MYOE自模仿框架和QMoP-SSM模型,通过估计期望目标和计算’偏好遗憾’来优化策略,这借鉴了人类感知与行动的机制,有效缓解了数据不足和误差累积问题。

Abstract: Robot reinforcement learning from demonstrations (RLfD) assumes that expert data is abundant; this is usually unrealistic in the real world given data scarcity as well as high collection cost. Furthermore, imitation learning algorithms assume that the data is independently and identically distributed, which ultimately results in poorer performance as gradual errors emerge and compound within test-time trajectories. We address these issues by introducing the “master your own expertise” (MYOE) framework, a self-imitation framework that enables robotic agents to learn complex behaviors from limited demonstration data samples. Inspired by human perception and action, we propose and design what we call the queryable mixture-of-preferences state space model (QMoP-SSM), which estimates the desired goal at every time step. These desired goals are used in computing the “preference regret”, which is used to optimize the robot control policy. Our experiments demonstrate the robustness, adaptability, and out-of-sample performance of our agent compared to other state-of-the-art RLfD schemes. The GitHub repository that supports this work can be found at: https://github.com/rxng8/neurorobot-preference-regret-learning.


[184] CRAFT: Video Diffusion for Bimanual Robot Data Generation cs.RO | cs.AI | cs.CV | cs.LGPDF

Jason Chen, I-Chun Arthur Liu, Gaurav Sukhatme, Daniel Seita

TL;DR: CRAFT是一个基于视频扩散模型的框架,用于生成双手机器人操作的演示数据,通过从模拟轨迹中提取边缘结构线索来合成时间连贯的操纵视频和动作标签,从而扩展真实世界数据的视觉多样性,提升策略的鲁棒性。

Details

Motivation: 解决双手机器人学习因真实世界数据成本高、视觉多样性有限而导致的策略在视角、物体配置和机器人形态上泛化能力不足的问题。

Result: 在模拟和真实世界的双手机器人任务中,CRAFT相比现有数据增强策略和简单数据扩展,显著提高了任务成功率,证明了基于扩散的视频生成能有效扩大演示多样性并改善双臂操作任务的泛化性能。

Insight: 创新点在于利用预训练视频扩散模型,结合模拟轨迹的边缘结构引导,生成物理上合理的轨迹变体,并支持统一的增强流程(如物体姿态、相机视角、光照背景变化、跨形态迁移和多视图合成),实现了从少量真实演示到大规模逼真训练数据的Sim2Real转换。

Abstract: Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels. By conditioning video diffusion on edge-based structural cues extracted from simulator-generated trajectories, CRAFT produces physically plausible trajectory variations and supports a unified augmentation pipeline spanning object pose changes, camera viewpoints, lighting and background variations, cross-embodiment transfer, and multi-view synthesis. We leverage a pre-trained video diffusion model to convert simulated videos, along with action labels from the simulation trajectories, into action-consistent demonstrations. Starting from only a few real-world demonstrations, CRAFT generates a large, visually diverse set of photorealistic training data, bypassing the need to replay demonstrations on the real robot (Sim2Real). Across simulated and real-world bimanual tasks, CRAFT improves success rates over existing augmentation strategies and straightforward data scaling, demonstrating that diffusion-based video generation can substantially expand demonstration diversity and improve generalization for dual-arm manipulation tasks. Our project website is available at: https://craftaug.github.io/


[185] HAD: Combining Hierarchical Diffusion with Metric-Decoupled RL for End-to-End Driving cs.RO | cs.CVPDF

Wenhao Yao, Xinglong Sun, Zhenxin Li, Shiyi Lan, Zi Wang

TL;DR: 本文提出了一种名为HAD的端到端自动驾驶规划框架,该框架结合了分层扩散策略和度量解耦强化学习。HAD通过分层扩散策略将规划分解为从粗到细的过程,并引入了保持结构性的轨迹扩展方法来生成更真实的候选轨迹。同时,提出了度量解耦策略优化方法,以支持跨多个驾驶目标的结构化强化学习优化。实验表明,HAD在NAVSIM和HUGSIM基准测试中取得了新的最先进性能。

Details

Motivation: 当前端到端自动驾驶规划模型通常采用评分-选择框架,但直接从整个候选空间中选择轨迹难以优化,且扩散模型中的高斯扰动常产生不真实的轨迹,使去噪过程复杂化。此外,现有的端到端强化学习方法通常依赖单一耦合奖励,缺乏结构化信号,限制了优化效果。

Result: 在NAVSIM基准上,HAD的EPDMS指标提升了2.3;在HUGSIM基准上,Route Completion指标提升了4.9,大幅超越了先前的方法,达到了新的最先进水平。

Insight: 主要创新点包括:1)分层扩散策略,将规划任务分解为从粗到细的过程,以简化优化;2)结构保持的轨迹扩展,在生成候选轨迹时维持运动学结构,提高真实性;3)度量解耦策略优化,通过解耦多个驾驶目标的奖励信号,实现更有效的结构化强化学习优化。

Abstract: End-to-end planning has emerged as a dominant paradigm for autonomous driving, where recent models often adopt a scoring-selection framework to choose trajectories from a large set of candidates, with diffusion-based decoding showing strong promise. However, directly selecting from the entire candidate space remains difficult to optimize, and Gaussian perturbations used in diffusion often introduce unrealistic trajectories that complicate the denoising process. In addition, for training these models, reinforcement learning (RL) has shown promise, but existing end-to-end RL approaches typically rely on a single coupled reward without structured signals, limiting optimization effectiveness. To address these challenges, we propose HAD, an end-to-end planning framework with a Hierarchical Diffusion Policy that decomposes planning into a coarse-to-fine process. To improve trajectory generation, we introduce Structure-Preserved Trajectory Expansion, which produces realistic candidates while maintaining kinematic structure. For policy learning, we develop Metric-Decoupled Policy Optimization (MDPO) to enable structured RL optimization across multiple driving objectives. Extensive experiments show that HAD achieves new state-of-the-art performance on both NAVSIM and HUGSIM, outperforming prior arts by a huge margin: +2.3 EPDMS on NAVSIM and +4.9 Route Completion on HUGSIM.


[186] Efficient Onboard Spacecraft Pose Estimation with Event Cameras and Neuromorphic Hardware cs.RO | cs.CV | cs.LGPDF

Arunkumar Rathinam, Jules Lecomte, Jost Reelsen, Gregor Lenz, Axel von Arnim

TL;DR: 本文提出了一种基于事件相机和神经形态硬件(BrainChip Akida处理器)的航天器6自由度位姿估计流程,旨在解决空间交会对接任务中因极端光照、高对比度和快速运动导致的传统视觉方法失效问题。

Details

Motivation: 解决在极端光照、高对比度和快速目标运动等挑战性空间环境下,实现可靠、低延迟、低功耗的航天器相对位姿估计问题,以支持自主交会对接与近距离操作。

Result: 在SPADES数据集上,对三种事件表示方法进行了基准测试,在Akida V1硬件上实现了实时、低功耗推理;针对Akida V2设计的基于热图的模型在Akida Cloud上评估,获得了更高的位姿精度。

Insight: 创新点在于首次将事件相机与Akida神经形态处理器进行端到端结合,用于航天器位姿估计,并通过量化感知训练和模型转换,展示了在神经形态硬件上实现低延迟、低功耗感知的实际路径。

Abstract: Reliable relative pose estimation is a key enabler for autonomous rendezvous and proximity operations, yet space imagery is notoriously challenging due to extreme illumination, high contrast, and fast target motion. Event cameras provide asynchronous, change-driven measurements that can remain informative when frame-based imagery saturates or blurs, while neuromorphic processors can exploit sparse activations for low-latency, energy-efficient inferences. This paper presents a spacecraft 6-DoF pose-estimation pipeline that couples event-based vision with the BrainChip Akida neuromorphic processor. Using the SPADES dataset, we train compact MobileNet-style keypoint regression networks on lightweight event-frame representations, apply quantization-aware training (8/4-bit), and convert the models to Akida-compatible spiking neural networks. We benchmark three event representations and demonstrate real-time, low-power inference on Akida V1 hardware. We additionally design a heatmap-based model targeting Akida V2 and evaluate it on Akida Cloud, yielding improved pose accuracy. To our knowledge, this is the first end-to-end demonstration of spacecraft pose estimation running on Akida hardware, highlighting a practical route to low-latency, low-power perception for future autonomous space missions.


[187] Visual Prompt Based Reasoning for Offroad Mapping using Multimodal LLMs cs.RO | cs.CVPDF

Abdelmoamen Nasser, Yousef Baba’a, Murad Mebrahtu, Nadya Abdel Madjid, Jorge Dias

TL;DR: 本文提出了一种零样本的越野可通行区域映射方法,利用SAM2进行环境分割,并结合视觉语言模型(VLM)对分割后的区域进行推理,以识别可通行区域,从而替代了传统需要多个独立模型的任务。

Details

Motivation: 传统越野自动驾驶方法依赖多个独立模型分别进行地形分类、高度估计和滑移/坡度量化,这需要分别训练每个组件、准备特定任务数据集并进行微调,过程复杂。本文旨在利用VLM的固有推理能力,构建一个统一的框架来简化这一流程。

Result: 该方法在高分辨率分割数据集上超越了最先进的可训练模型,并在Isaac Sim越野仿真环境中实现了完整的导航堆栈。

Insight: 创新点在于将SAM2分割与VLM的视觉推理能力结合,通过向VLM提供原始图像和带有数字标签的分割掩码图像,以提示方式让其识别可通行区域,从而无需显式的、特定地形的模型,实现了零样本的端到端越野可通行性分析。

Abstract: Traditional approaches to off-road autonomy rely on separate models for terrain classification, height estimation, and quantifying slip or slope conditions. Utilizing several models requires training each component separately, having task specific datasets, and fine-tuning. In this work, we present a zero-shot approach leveraging SAM2 for environment segmentation and a vision-language model (VLM) to reason about drivable areas. Our approach involves passing to the VLM both the original image and the segmented image annotated with numeric labels for each mask. The VLM is then prompted to identify which regions, represented by these numeric labels, are drivable. Combined with planning and control modules, this unified framework eliminates the need for explicit terrain-specific models and relies instead on the inherent reasoning capabilities of the VLM. Our approach surpasses state-of-the-art trainable models on high resolution segmentation datasets and enables full stack navigation in our Isaac Sim offroad environment.


[188] AnyUser: Translating Sketched User Intent into Domestic Robots cs.RO | cs.CV | cs.HCPDF

Songyuan Yang, Huibin Tan, Kailun Yang, Wenjing Yang, Shaowu Yang

TL;DR: 论文提出AnyUser系统,这是一个统一的机器人指令系统,允许用户通过在相机图像上进行自由手绘草图(可选结合语言)来直观地指导家用机器人执行任务。系统将多模态输入解释为空间语义基元,无需先验地图或模型即可生成可执行的机器人动作。

Details

Motivation: 解决高级机器人能力与非专业用户可访问交互方式之间的鸿沟,使非专家用户能够直观、便捷地指导家用机器人完成日常任务。

Result: 在大规模数据集上的定量基准测试显示,其在多种模拟家庭场景中解释多样化草图指令具有高精度;在两个不同的真实机器人平台(KUKA LBR iiwa机械臂和Realman RMC-AIDAL移动双臂机器人)上成功执行了目标擦拭和区域清洁等任务;用户研究表明,系统显著提高了可用性和任务指定效率,任务完成率达到85.7%-96.4%,用户满意度高。

Insight: 创新点在于通过多模态融合(草图、视觉、语言)理解用户意图,并将其转化为空间语义基元,结合分层策略实现鲁棒的动作生成,为适应真实世界人类环境的实用辅助机器人奠定了基础。

Abstract: We introduce AnyUser, a unified robotic instruction system for intuitive domestic task instruction via free-form sketches on camera images, optionally with language. AnyUser interprets multimodal inputs (sketch, vision, language) as spatial-semantic primitives to generate executable robot actions requiring no prior maps or models. Novel components include multimodal fusion for understanding and a hierarchical policy for robust action generation. Efficacy is shown via extensive evaluations: (1) Quantitative benchmarks on the large-scale dataset showing high accuracy in interpreting diverse sketch-based commands across various simulated domestic scenes. (2) Real-world validation on two distinct robotic platforms, a statically mounted 7-DoF assistive arm (KUKA LBR iiwa) and a dual-arm mobile manipulator (Realman RMC-AIDAL), performing representative tasks like targeted wiping and area cleaning, confirming the system’s ability to ground instructions and execute them reliably in physical environments. (3) A comprehensive user study involving diverse demographics (elderly, simulated non-verbal, low technical literacy) demonstrating significant improvements in usability and task specification efficiency, achieving high task completion rates (85.7%-96.4%) and user satisfaction. AnyUser bridges the gap between advanced robotic capabilities and the need for accessible non-expert interaction, laying the foundation for practical assistive robots adaptable to real-world human environments.


eess.AS [Back]

[189] Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency eess.AS | cs.CLPDF

Guan-Ting Lin, Chen Chen, Zhehuai Chen, Hung-yi Lee

TL;DR: 本文介绍了Full-Duplex-Bench-v3 (FDB-v3),一个用于评估口语模型在自然语音条件和多步骤工具使用场景下的基准测试。该数据集完全由真实人类音频构成,标注了五种不流畅类别,并与需要跨四个任务领域进行链式API调用的场景配对。作者评估了六种模型配置在准确性、延迟和轮流对话维度上的表现。

Details

Motivation: 为了解决现有基准测试在评估口语模型时缺乏真实世界不流畅语音和多步骤工具使用场景的问题,作者构建了FDB-v3,旨在更全面地衡量语音代理在真实交互环境下的性能。

Result: 在FDB-v3基准上,GPT-Realtime在Pass@1准确率(0.600)和避免不当打断(13.5%)方面领先;Gemini Live 3.1实现了最快的延迟(4.25秒)但最低的轮流对话成功率(78.0%);而级联基线虽然轮流对话成功率完美(100%),但延迟最高(10.12秒)。所有系统在处理自我纠正和困难场景下的多步骤推理时都表现出最一致的失败模式。

Insight: 论文的创新点在于构建了一个完全基于真实人类不流畅语音、并集成多步骤工具使用任务的基准测试FDB-v3,这比合成或简化数据集更能反映实际应用挑战。从客观角度看,该工作强调了在评估语音代理时综合考虑准确性、延迟和交互流畅性(如轮流对话)的重要性,并揭示了当前模型在处理复杂、不完美语音输入时的共同弱点,为未来研究指明了改进方向。

Abstract: We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. We evaluate six model configurations – GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a traditional Cascaded pipeline (Whisper$\rightarrow$GPT-4o$\rightarrow$TTS) – across accuracy, latency, and turn-taking dimensions. GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5%); Gemini Live 3.1 achieves the fastest latency (4.25s) but the lowest turn-take rate (78.0%); and the Cascaded baseline, despite a perfect turn-take rate, incurs the highest latency (10.12s). Across all systems, self-correction handling and multi-step reasoning under hard scenarios remain the most consistent failure modes.


q-bio.NC [Back]

[190] Large Language Models Align with the Human Brain during Creative Thinking q-bio.NC | cs.AI | cs.CLPDF

Mete Ismayilzada, Simone A. Luchini, Abdulkadir Gokce, Badr AlKhamissi, Antoine Bosselut

TL;DR: 本研究探索了大型语言模型(LLMs)与人类大脑在创造性思维(特别是发散性思维)过程中的神经表征对齐。通过分析170名参与者在执行替代用途任务(AUT)时的fMRI数据,并使用表征相似性分析(RSA)测量不同规模(270M-72B)的LLMs与大脑默认模式网络和额顶网络的相似性,研究发现对齐程度随模型规模和想法原创性增加而增强,且在创意过程早期最明显。此外,不同的后训练目标(如创造力优化、人类行为微调、推理训练)会以功能选择性的方式重塑LLMs的表征,使其与高创造力或低创造力的神经响应产生不同的对齐模式。

Details

Motivation: 解决现有大脑-LLM对齐研究主要关注被动、非创造性任务,而缺乏对创造性思维过程中对齐情况的探索,旨在揭示LLMs在核心创造性认知任务中与人类大脑的神经表征关系。

Result: 在替代用途任务(AUT)的fMRI基准上,发现大脑-LLM对齐程度随模型规模(仅默认模式网络)和想法原创性(两个网络)增加而提升,对齐效应在创意过程早期最强;创造力优化的Llama-3.1-8B-Instruct模型能保持与高创造力神经响应的对齐,同时减少与低创造力响应的对齐。

Insight: 创新点在于首次系统研究了LLMs与人类大脑在主动创造性任务中的神经对齐,并揭示了模型规模、想法原创性和后训练目标对对齐的选择性影响;客观来看,该研究为理解LLMs的“类人”创造性表征机制提供了神经科学证据,并表明通过特定微调可以定向塑造模型与不同认知状态的神经几何相似性。

Abstract: Creative thinking is a fundamental aspect of human cognition, and divergent thinking-the capacity to generate novel and varied ideas-is widely regarded as its core generative engine. Large language models (LLMs) have recently demonstrated impressive performance on divergent thinking tests and prior work has shown that models with higher task performance tend to be more aligned to human brain activity. However, existing brain-LLM alignment studies have focused on passive, non-creative tasks. Here, we explore brain alignment during creative thinking using fMRI data from 170 participants performing the Alternate Uses Task (AUT). We extract representations from LLMs varying in size (270M-72B) and measure alignment to brain responses via Representational Similarity Analysis (RSA), targeting the creativity-related default mode and frontoparietal networks. We find that brain-LLM alignment scales with model size (default mode network only) and idea originality (both networks), with effects strongest early in the creative process. We further show that post-training objectives shape alignment in functionally selective ways: a creativity-optimized \texttt{Llama-3.1-8B-Instruct} preserves alignment with high-creativity neural responses while reducing alignment with low-creativity ones; a human behavior fine-tuned model elevates alignment with both; and a reasoning-trained variant shows the opposite pattern, suggesting chain-of-thought training steers representations away from creative neural geometry toward analytical processing. These results demonstrate that post-training objectives selectively reshape LLM representations relative to the neural geometry of human creative thought.


cs.IR [Back]

[191] Align then Train: Efficient Retrieval Adapter Learning cs.IR | cs.CLPDF

Seiji Maekawa, Moin Aminnaseri, Pouya Pezeshkpour, Estevam Hruschka

TL;DR: 本文提出了一种名为高效检索适配器(ERA)的两阶段训练框架,旨在解决复杂查询与简单文档之间的检索不匹配问题。该方法首先通过自监督对齐将大型查询编码器与轻量级文档编码器的嵌入空间对齐,然后利用有限的标注数据进行监督适应,从而在不重新索引文档库的情况下弥合表示差距和语义差距。

Details

Motivation: 现实检索场景中,用户常通过长指令或任务描述表达意图,而目标文档相对简单静态,导致检索不匹配;直接微调大型嵌入模型以遵循指令计算成本高、内存密集且操作负担重。

Result: 在涵盖6个领域、126个检索任务的MAIR基准测试中,ERA在低标注设置下提升了检索性能,优于依赖大量标注数据的方法,并能有效跨领域结合更强的查询编码器与更弱的文档编码器。

Insight: 创新点在于受LLM预训练与监督微调启发,设计了两阶段适配器学习框架,通过自监督对齐减少对标注数据的依赖,实现高效跨模型表示对齐,可借鉴于不对称检索系统的轻量化适配。

Abstract: Dense retrieval systems increasingly need to handle complex queries. In many realistic settings, users express intent through long instructions or task-specific descriptions, while target documents remain relatively simple and static. This asymmetry creates a retrieval mismatch: understanding queries may require strong reasoning and instruction-following, whereas efficient document indexing favors lightweight encoders. Existing retrieval systems often address this mismatch by directly improving the embedding model, but fine-tuning large embedding models to better follow such instructions is computationally expensive, memory-intensive, and operationally burdensome. To address this challenge, we propose Efficient Retrieval Adapter (ERA), a label-efficient framework that trains retrieval adapters in two stages: self-supervised alignment and supervised adaptation. Inspired by the pre-training and supervised fine-tuning stages of LLMs, ERA first aligns the embedding spaces of a large query embedder and a lightweight document embedder, and then uses limited labeled data to adapt the query-side representation, bridging both the representation gap between embedding models and the semantic gap between complex queries and simple documents without re-indexing the corpus. Experiments on the MAIR benchmark, spanning 126 retrieval tasks across 6 domains, show that ERA improves retrieval in low-label settings, outperforms methods that rely on larger amounts of labeled data, and effectively combines stronger query embedders with weaker document embedders across domains.


[192] Lightweight Query Routing for Adaptive RAG: A Baseline Study on RAGRouter-Bench cs.IR | cs.CL | cs.LGPDF

Prakhar Bansal, Shivangi Agarwal

TL;DR: 本文在RAGRouter-Bench基准上首次系统评估了基于轻量级分类器的查询路由方法,用于为不同查询类型(事实性、推理性、摘要性)自适应选择检索策略。评估了五种经典分类器与三种特征组合(TF-IDF、MiniLM句子嵌入、手工结构特征),发现TF-IDF结合SVM效果最佳,实现了0.928的宏平均F1和93.2%的准确率,并模拟节省了28.1%的令牌成本。

Details

Motivation: 检索增强生成(RAG)流程中存在多种检索策略,其令牌成本和能力差异显著。为每个查询选择合适的策略是一个实际的效率问题,但此前缺乏在标准基准(RAGRouter-Bench)上训练的路由分类器。

Result: 在RAGRouter-Bench基准(包含7,727个查询,覆盖四个知识领域和三种查询类型)上,最佳配置(TF-IDF + SVM)取得了宏平均F1 0.928和准确率93.2%的结果,模拟令牌节省达28.1%。词法TF-IDF特征优于语义句子嵌入特征3.1个宏F1点。领域分析表明医疗查询最难路由,法律查询最易处理。

Insight: 论文的创新点在于首次在RAGRouter-Bench上建立了可复现的、仅基于查询侧的轻量级路由基线。关键发现是表面关键词模式(TF-IDF)是查询类型复杂度的强预测因子,其表现优于语义嵌入,这为设计高效、低成本的RAG路由器提供了简单有效的起点,并指明了结合语料库感知的路由是未来需要弥补的差距。

Abstract: Retrieval-Augmented Generation pipelines span a wide range of retrieval strategies that differ substantially in token cost and capability. Selecting the right strategy per query is a practical efficiency problem, yet no routing classifiers have been trained on RAGRouter-Bench \citep{wang2026ragrouterbench}, a recently released benchmark of $7,727$ queries spanning four knowledge domains, each annotated with one of three canonical query types: factual, reasoning, and summarization. We present the first systematic evaluation of lightweight classifier-based routing on this benchmark. Five classical classifiers are evaluated under three feature regimes, namely, TF-IDF, MiniLM sentence embeddings \citep{reimers2019sbert}, and hand-crafted structural features, yielding 15 classifier feature combinations. Our best configuration, TF-IDF with an SVM, achieves a macro-averaged F1 of $\mathbf{0.928}$ and an accuracy of $\mathbf{93.2%}$, while simulating $\mathbf{28.1%}$ token savings relative to always using the most expensive paradigm. Lexical TF-IDF features outperform semantic sentence embeddings by $3.1$ macro-F1 points, suggesting that surface keyword patterns are strong predictors of query-type complexity. Domain-level analysis reveals that medical queries are hardest to route and legal queries most tractable. These results establish a reproducible query-side baseline and highlight the gap that corpus-aware routing must close.


[193] Ruling Out to Rule In: Contrastive Hypothesis Retrieval for Medical Question Answering cs.IR | cs.AI | cs.CLPDF

Byeolhee Kim, Min-Kyung Kim, Young-Hak Kim, Tae-Joon Jeon

TL;DR: 本文提出了一种名为对比假设检索(CHR)的框架,用于改进医学问答系统中的检索增强生成(RAG)。该方法通过模拟临床鉴别诊断过程,生成目标假设和模拟假设,以在检索时同时促进相关证据并抑制临床相似的错误替代信息,从而减少硬负例的干扰。

Details

Motivation: 标准检索器在医学RAG中经常检索到与查询语义相近但临床条件不同的硬负例,现有查询扩展方法缺乏明确机制来抑制这些临床上合理的错误替代,导致系统易受主导性模仿诊断的影响。

Result: 在三个医学QA基准测试和三个答案生成器上,CHR在所有配置中均优于五个基线方法,相比次优方法提升高达10.4个百分点。分析表明,CHR的成功主要源于检索方向的实质性改变,而非对相同候选的轻微重排序。

Insight: 核心创新在于将临床鉴别诊断的对比推理过程融入检索机制设计,通过显式建模需要避免的内容(模拟假设)和需要寻找的内容(目标假设)来引导检索,为减少医学RAG系统中的硬负例污染提供了实用路径。

Abstract: Retrieval-augmented generation (RAG) grounds large language models in external medical knowledge, yet standard retrievers frequently surface hard negatives that are semantically close to the query but describe clinically distinct conditions. While existing query-expansion methods improve query representation to mitigate ambiguity, they typically focus on enriching target-relevant semantics without an explicit mechanism to selectively suppress specific, clinically plausible hard negatives. This leaves the system prone to retrieving plausible mimics that overshadow the actual diagnosis, particularly when such mimics are dominant within the corpus. We propose Contrastive Hypothesis Retrieval (CHR), a framework inspired by the process of clinical differential diagnosis. CHR generates a target hypothesis $H^+$ for the likely correct answer and a mimic hypothesis $H^-$ for the most plausible incorrect alternative, then scores documents by promoting $H^+$-aligned evidence while penalizing $H^-$-aligned content. Across three medical QA benchmarks and three answer generators, CHR outperforms all five baselines in every configuration, with improvements of up to 10.4 percentage points over the next-best method. On the $n=587$ pooled cases where CHR answers correctly while embedded hypothetical-document query expansion does not, 85.2% have no shared documents between the top-5 retrieval lists of CHR and of that baseline, consistent with substantive retrieval redirection rather than light re-ranking of the same candidates. By explicitly modeling what to avoid alongside what to find, CHR bridges clinical reasoning with retrieval mechanism design and offers a practical path to reducing hard-negative contamination in medical RAG systems.


cs.GR [Back]

[194] Real-time Neural Six-way Lightmaps cs.GR | cs.CVPDF

Wei Li, Hanxiao Sun, Tao Huang, Haoxiang Wang, Tongtong Wang

TL;DR: 本文提出了一种神经六向光照贴图方法,用于实时渲染参与介质(如烟雾),在动态交互与视觉真实感之间取得平衡。该方法通过大采样距离的光线行进从相机视图生成引导图,然后训练神经网络预测对应的六向光照贴图,可无缝集成到现有游戏引擎中。

Details

Motivation: 解决传统六向光照贴图技术仅适用于预模拟动画序列、无法处理相机移动等动态交互的局限性,旨在实现实时渲染参与介质时兼顾效率与真实感。

Result: 通过一系列综合基准测试,证明该方法适用于游戏和VR/AR等实时应用,支持烟雾与障碍物交互、相机移动和光照变化等交互效果。

Insight: 创新点在于结合神经网络的引导图生成与六向光照贴图预测,实现了动态场景下的实时渲染;从客观角度,该方法将传统图形学技术与深度学习结合,提升了渲染的灵活性和真实感。

Abstract: Participating media are a pervasive and intriguing visual effect in virtual environments. Unfortunately, rendering such phenomena in real-time is notoriously difficult due to the computational expense of estimating the volume rendering equation. While the six-way lightmaps technique has been widely used in video games to render smoke with a camera-oriented billboard and approximate lighting effects using six precomputed lightmaps, achieving a balance between realism and efficiency, it is limited to pre-simulated animation sequences and is ignorant of camera movement. In this work, we propose a neural six-way lightmaps method to strike a long-sought balance between dynamics and visual realism. Our approach first generates a guiding map from the camera view using ray marching with a large sampling distance to approximate smoke scattering and silhouette. Then, given a guiding map, we train a neural network to predict the corresponding six-way lightmaps. The resulting lightmaps can be seamlessly used in existing game engine pipelines. This approach supports visually appealing rendering effects while enabling real-time user interactivity, including smoke-obstacle interaction, camera movement, and light change. By conducting a series of comprehensive benchmarks, we demonstrate that our method is well-suited for real-time applications, such as games and VR/AR.