Table of Contents
- cs.CL [Total: 23]
- cs.CV [Total: 102]
- cs.CY [Total: 1]
- cs.IR [Total: 1]
- cs.MA [Total: 1]
- cs.CR [Total: 1]
- cs.LG [Total: 4]
- cs.SD [Total: 1]
- cs.RO [Total: 5]
- cs.AI [Total: 14]
cs.CL [Back]
[1] Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá cs.CL | cs.LGPDF
Opeyemi Osakuade, Simon King
TL;DR: 该论文探究了离散语音单元在编码超音段信息(如声调)方面的局限性,以汉语普通话和约鲁巴语为例,发现尽管自监督学习模型的潜在表示本身能编码声调,但量化过程倾向于优先保留音段结构,导致声调信息编码不可靠。
Details
Motivation: 解决离散语音单元在编码超音段特征(如声调、韵律)时不可靠的问题,这些特征对语音合成和多模态对话系统等任务至关重要。
Result: 实验表明,多种量化方法(包括K-means)均存在此问题,而提出的两阶段K-means聚类(先编码音段信息,再对残差表示编码声调)能更好地保留声调信息。
Insight: 创新点在于揭示了当前DSU量化策略对超音段特征的局限性,并提出了两阶段量化作为潜在解决方案,为开发声调或韵律感知的语音表示学习技术提供了方向。
Abstract: Discrete speech units (DSUs) are derived by quantising representations from models trained using self-supervised learning (SSL). They are a popular representation for a wide variety of spoken language tasks, including those where prosody matters. DSUs are especially convenient for tasks where text and speech are jointly modelled, such as text-to-speech and multimodal dialogue systems. But we have found that DSUs encode suprasegmental information less reliably than segmental structure, which we demonstrate in this work using lexical tone, though this limitation likely extends to other suprasegmental features such as prosody. Our investigations using the tone languages Mandarin and Yorùbá show that the SSL latent representations themselves do encode tone, yet DSUs obtained using quantisation tend to prioritise phonetic structure, which makes lexical tone less reliably encoded. This remains true for a variety of quantisation methods, not only the most common, K-means. We conclude that current DSU quantisation strategies have limitations for suprasegmental features, which suggests a need for new, tone-aware (or prosody-aware) techniques in speech representation learning. We point towards a potential form of the solution by performing K-means clustering once to encode phonetic information, then again on the residual representation, which better encodes lexical tone.
[2] Enabling Intrinsic Reasoning over Dense Geospatial Embeddings with DFR-Gemma cs.CL | cs.AIPDF
Xuechen Zhang, Aviv Slobodkin, Joydeep Paul, Mandar Sharma, Samet Oymak
TL;DR: 本文提出了DFR-Gemma框架,使大语言模型能够直接对密集的地理空间嵌入进行推理,无需中间文本转换,从而提高了效率和准确性。
Details
Motivation: 现有方法将地理空间基础模型的嵌入作为检索索引或转换为文本描述进行推理,存在冗余、令牌效率低和数值不准确的问题,需要一种更直接高效的集成方式。
Result: 实验结果表明,DFR框架使LLM能够解码潜在空间模式并在多任务地理空间基准测试中执行准确的零样本推理,同时相比基于文本的基线显著提高了效率。
Insight: 核心创新在于通过轻量级投影器将高维嵌入与LLM的潜在空间对齐,使嵌入可作为语义令牌与自然语言指令一起注入,实现了对空间特征的内在推理,为多模态地理空间智能提供了更直接、高效和可扩展的方法。
Abstract: Representation learning for geospatial and spatio-temporal data plays a critical role in enabling general-purpose geospatial intelligence. Recent geospatial foundation models, such as the Population Dynamics Foundation Model (PDFM), encode complex population and mobility dynamics into compact embeddings. However, their integration with Large Language Models (LLMs) remains limited. Existing approaches to LLM integration treat these embeddings as retrieval indices or convert them into textual descriptions for reasoning, introducing redundancy, token inefficiency, and numerical inaccuracies. We propose Direct Feature Reasoning-Gemma (DFR-Gemma), a novel framework that enables LLMs to reason directly over dense geospatial embeddings. DFR aligns high-dimensional embeddings with the latent space of an LLM via a lightweight projector, allowing embeddings to be injected as semantic tokens alongside natural language instructions. This design eliminates the need for intermediate textual representations and enables intrinsic reasoning over spatial features. To evaluate this paradigm, we introduce a multi-task geospatial benchmark that pairs embeddings with diverse question-answer tasks, including feature querying, comparison, and semantic description. Experimental results show that DFR allows LLMs to decode latent spatial patterns and perform accurate zero-shot reasoning across tasks, while significantly improving efficiency compared to text-based baselines. Our results demonstrate that treating embeddings as primary data inputs, provides a more direct, efficient, and scalable approach to multimodal geospatial intelligence.
[3] Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs cs.CLPDF
Mengdan Zhu, Senhao Cheng, Liang Zhao
TL;DR: 本文提出了一个名为“分解、观察与推理”(DLR)的强化潜在推理框架,旨在解决视觉语言模型在复杂视觉推理任务中因文本思维链导致视觉信息丢失的问题。该框架通过动态将查询分解为文本前提、提取前提条件化的连续视觉潜在表示,并通过基于依据的推理得出答案。
Details
Motivation: 现有方法要么增加工具调用成本,要么依赖局部补丁嵌入,这些方法在多步推理中提取语义信息不足。本文旨在解决视觉语言模型在复杂视觉推理中因文本思维链导致视觉信息丢失的核心问题。
Result: 在多个以视觉为中心的基准测试上的广泛实验表明,DLR始终优于包括纯文本、交错多模态思维链和潜在推理方法在内的强基线,同时提供了更优的逐步可解释性。
Insight: 主要创新点包括:1)提出了一个强化潜在推理框架,动态分解查询并提取条件化视觉潜在表示;2)引入了三阶段训练流程和一种新颖的球面高斯潜在策略,以在潜在空间中进行有效探索;3)该方法在保持性能的同时,增强了推理步骤的可解释性。
Abstract: Vision-Language Models often struggle with complex visual reasoning due to the visual information loss in textual CoT. Existing methods either add the cost of tool calls or rely on localized patch-based embeddings that are insufficient to extract semantics in multi-step reasoning. We propose \emph{“Decompose, Look, and Reason” (DLR)}, a reinforced latent reasoning framework that dynamically decomposes queries into textual premises, extracts premise-conditioned continuous visual latents, and deduces answers through grounded rationales. We introduce a three-stage training pipeline and propose a novel Spherical Gaussian Latent Policy to enable effective exploration in the latent space. Extensive experiments on vision-centric benchmarks show that DLR consistently outperforms strong baselines, including text-only, interleaved multimodal CoT, and latent reasoning methods, while providing superior stepwise interpretability.
[4] TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization cs.CL | cs.AIPDF
Figen Eğin, Aytuğ Onan
TL;DR: 本研究提出了一个基于多人摘要自动生成土耳其语教育视频黄金标准摘要的框架,并创建了包含82个土耳其语课程视频和3281个人类摘要的TR-EduVSum数据集。论文提出了AutoMUP方法,通过嵌入聚类和统计建模从多人摘要中提取共识内容,生成基于共识权重的分级摘要。实验表明AutoMUP摘要与Flash 2.5和GPT-5.1等LLM摘要具有高语义重叠,且该方法可低成本推广到其他突厥语言。
Details
Motivation: 解决土耳其语教育视频缺乏自动生成可重复黄金标准摘要方法的问题,并创建专门的数据集以支持相关研究。
Result: AutoMUP摘要与Flash 2.5和GPT-5.1等强大LLM摘要表现出高语义重叠;消融研究证实共识权重和聚类对摘要质量有关键影响。
Insight: 创新点包括提出基于金字塔评估的AutoMUP自动共识提取方法,以及创建首个土耳其语教育视频摘要数据集;其嵌入聚类和统计建模的框架设计为多语言视频摘要提供了可扩展的解决方案。
Abstract: This study presents a framework for generating the gold-standard summary fully automatically and reproducibly based on multiple human summaries of Turkish educational videos. Within the scope of the study, a new dataset called TR-EduVSum was created, encompassing 82 Turkish course videos in the field of “Data Structures and Algorithms” and containing a total of 3281 independent human summaries. Inspired by existing pyramid-based evaluation approaches, the AutoMUP (Automatic Meaning Unit Pyramid) method is proposed, which extracts consensus-based content from multiple human summaries. AutoMUP clusters the meaning units extracted from human summaries using embedding, statistically models inter-participant agreement, and generates graded summaries based on consensus weight. In this framework, the gold summary corresponds to the highest-consensus AutoMUP configuration, constructed from the most frequently supported meaning units across human summaries. Experimental results show that AutoMUP summaries exhibit high semantic overlap with robust LLM (Large Language Model) summaries such as Flash 2.5 and GPT-5.1. Furthermore, ablation studies clearly demonstrate the decisive role of consensus weight and clustering in determining summary quality. The proposed approach can be generalized to other Turkic languages at low cost.
[5] Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs cs.CL | cs.AI | cs.CY | cs.LGPDF
Tunazzina Islam
TL;DR: 本文提出了一种基于推理的无监督文本聚类精炼框架,利用大语言模型作为语义评判器来验证和重构任意无监督聚类算法的输出,通过一致性验证、冗余裁决和标签锚定三个推理阶段提升聚类质量。
Details
Motivation: 解决无监督聚类方法产生的聚类结果存在不一致、冗余或缺乏语义基础的问题,这些结果在没有标注数据的情况下难以验证。
Result: 在两个具有不同交互模式的真实社交媒体语料库上评估,相比经典主题模型和近期基于表示的基线方法,在聚类一致性和人类对齐的标签质量方面均取得持续改进;人类评估显示与LLM生成的标签高度一致。
Insight: 创新点在于将LLM用作语义评判器而非嵌入生成器,将表示学习与结构验证解耦,从而缓解了纯嵌入方法的常见失效模式;基于推理的框架为无监督语义结构的验证和精炼提供了一种通用机制。
Abstract: Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a reasoning-based refinement framework that leverages large language models (LLMs) not as embedding generators, but as semantic judges that validate and restructure the outputs of arbitrary unsupervised clustering algorithms.Our framework introduces three reasoning stages: (i) coherence verification, where LLMs assess whether cluster summaries are supported by their member texts; (ii) redundancy adjudication, where candidate clusters are merged or rejected based on semantic overlap; and (iii) label grounding, where clusters are assigned interpretable labels in a fully unsupervised manner. This design decouples representation learning from structural validation and mitigates common failure modes of embedding-only approaches. We evaluate the framework on real-world social media corpora from two platforms with distinct interaction models, demonstrating consistent improvements in cluster coherence and human-aligned labeling quality over classical topic models and recent representation-based baselines. Human evaluation shows strong agreement with LLM-generated labels, despite the absence of gold-standard annotations. We further conduct robustness analyses under matched temporal and volume conditions to assess cross-platform stability. Beyond empirical gains, our results suggest that LLM-based reasoning can serve as a general mechanism for validating and refining unsupervised semantic structure, enabling more reliable and interpretable analyses of large text collections without supervision.
[6] Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models cs.CLPDF
Steven Au, Sujit Noronha
TL;DR: 该论文提出了PPT-Bench基准,用于评估大语言模型在面临挑战知识、价值观或身份合法性的‘认知攻击’时的表现,而非简单的社会压力。该基准基于哲学压力分类法,定义了四种压力类型,并在三个层次上测试模型,揭示了模型在认知一致性和对话妥协方面的弱点。
Details
Motivation: 现有研究主要关注模型在分歧、奉承和偏好对齐方面的‘迎合性’,而更广泛的认知失败未被充分探索。本文旨在通过构建一个诊断性基准,系统评估LLMs在面临更深层次的哲学压力时的认知脆弱性。
Result: 在五个模型上的测试表明,不同类型的认知攻击会产生统计上可分离的不一致性模式,暴露了标准社会压力基准未能捕捉的弱点。缓解效果高度依赖于攻击类型和模型:在API模型中,提示层面的锚定和角色稳定性提示效果最好;对于开源模型,引导查询对比解码是最可靠的干预方法。
Insight: 创新点在于提出了‘认知攻击’这一新概念和PPT-Bench诊断基准,其哲学压力分类法(PPT)和三层测试框架(L0/L1/L2)能够更精细地评估模型的认知一致性。这为理解和提升LLMs的鲁棒性提供了新的分析维度和工具。
Abstract: Large language models (LLMs) can shift their answers under pressure in ways that reflect accommodation rather than reasoning. Prior work on sycophancy has focused mainly on disagreement, flattery, and preference alignment, leaving a broader set of epistemic failures less explored. We introduce \textbf{PPT-Bench}, a diagnostic benchmark for evaluating \textit{epistemic attack}, where prompts challenge the legitimacy of knowledge, values, or identity rather than simply opposing a previous answer. PPT-Bench is organized around the Philosophical Pressure Taxonomy (PPT), which defines four types of philosophical pressure: Epistemic Destabilization, Value Nullification, Authority Inversion, and Identity Dissolution. Each item is tested at three layers: a baseline prompt (L0), a single-turn pressure condition (L1), and a multi-turn Socratic escalation (L2). This allows us to measure epistemic inconsistency between L0 and L1, and conversational capitulation in L2. Across five models, these pressure types produce statistically separable inconsistency patterns, suggesting that epistemic attack exposes weaknesses not captured by standard social-pressure benchmarks. Mitigation results are strongly type- and model-dependent: prompt-level anchoring and persona-stability prompts perform best in API settings, while Leading Query Contrastive Decoding is the most reliable intervention for open models.
[7] TEMPER: Testing Emotional Perturbation in Quantitative Reasoning cs.CL | cs.AIPDF
Atahan Dokme, Benjamin Reichman, Larry Heck
TL;DR: 论文提出了TEMPER框架,用于研究情感语言对大型语言模型定量推理能力的影响。通过构建Temper-5400数据集(包含5400对经过语义验证的情感-中性问题对),并在GSM8K、MultiArith和ARC-Challenge三个基准上评估了18个模型(从10亿参数到前沿规模)。研究发现,即使保留所有数字内容,情感化的表述也会使模型准确率下降2-10个百分点,而将问题中性化可以恢复大部分性能损失。
Details
Motivation: 研究动机是探究在现实世界查询常带有情感(如沮丧、紧迫或热情)的背景下,仅情感框架本身(在保留所有数字内容的情况下)是否会损害大型语言模型的定量推理能力。
Result: 在GSM8K、MultiArith和ARC-Challenge基准上的实验结果表明,情感化表述导致模型准确率下降2-10个百分点。通过将问题中性化,可以恢复大部分损失的性能。非情感性的复述不会引起这种性能下降。
Insight: 论文的创新点在于开发了一个受控的情感翻译框架来构建评测基准,揭示了情感风格(而非内容)本身会损害模型推理,并提出了中性化作为一种轻量级的推理时缓解策略。该基准构建方法为受控的风格转换和鲁棒性评估提供了一个通用框架。
Abstract: Large language models are trained and evaluated on quantitative reasoning tasks written in clean, emotionally neutral language. However, real-world queries are often wrapped in frustration, urgency or enthusiasm. Does emotional framing alone degrade reasoning when all numerical content is preserved? To investigate this, a controlled emotion translation framework is developed that rewrites problems into emotional variants while preserving all quantities and relationships. Using this framework, Temper-5400 (5,400 semantically verified emotion–neutral pairs) is constructed across GSM8K, MultiArith, and ARC-Challenge, and evaluated on eighteen models (1B to frontier scale). Two core results emerge: First, emotional framing reduces accuracy by 2-10 percentage points even though all numerical content is preserved. Second, neutralizing emotional variants recovers most of the lost performance, showing both that the degradation is tied to emotional style rather than content corruption and that neutralization can serve as a lightweight inference-time mitigation. Non-emotional paraphrases cause no such degradation, implicating emotional content rather than surface-level changes. Beyond emotion specifically, the benchmark construction procedure provides a general framework for controlled stylistic translation and robustness evaluation.
[8] Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers cs.CL | cs.AI | cs.LGPDF
Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, Yuekun Yao
TL;DR: 本文研究了循环深度Transformer在隐式推理任务中的表现,特别是其组合泛化能力。研究表明,通过在同一层进行迭代计算,循环深度Transformer能够有效解决系统泛化和深度外推这两个组合泛化挑战,而标准Transformer则难以应对。
Details
Motivation: 解决大型语言模型在隐式多跳推理中组合其参数化知识的失败问题,即缺乏组合泛化能力。
Result: 在从头开始训练的模型上进行受控研究表明,循环深度Transformer能有效实现系统泛化和深度外推,而标准Transformer则难以应对。系统泛化通过三阶段‘顿悟’过程实现,深度外推则可通过扩展推理时的循环迭代次数来解锁。
Insight: 循环深度架构通过迭代计算促进了组合泛化;系统泛化遵循一个可解释的‘顿悟’动态过程;推理时扩展循环迭代次数是实现深度外推的关键,但也需警惕‘过度思考’导致性能下降的局限性。
Abstract: We study implicit reasoning, i.e. the ability to combine knowledge or rules within a single forward pass. While transformer-based large language models store substantial factual knowledge and rules, they often fail to compose this knowledge for implicit multi-hop reasoning, suggesting a lack of compositional generalization over their parametric knowledge. To address this limitation, we study recurrent-depth transformers, which enables iterative computation over the same transformer layers. We investigate two compositional generalization challenges under the implicit reasoning scenario: systematic generalization, i.e. combining knowledge that is never used for compositions during training, and depth extrapolation, i.e. generalizing from limited reasoning depth (e.g. training on up to 5-hop) to deeper compositions (e.g. 10-hop). Through controlled studies with models trained from scratch, we show that while vanilla transformers struggle with both generalization challenges, recurrent-depth transformers can effectively make such generalization. For systematic generalization, we find that this ability emerges through a three-stage grokking process, transitioning from memorization to in-distribution generalization and finally to systematic generalization, supported by mechanistic analysis. For depth extrapolation, we show that generalization beyond training depth can be unlocked by scaling inference-time recurrence, with more iterations enabling deeper reasoning. We further study how training strategies affect extrapolation, providing guidance on training recurrent-depth transformers, and identify a key limitation, overthinking, where excessive recurrence degrades predictions and limits generalization to very deep compositions.
[9] MemReader: From Passive to Active Extraction for Long-Term Agent Memory cs.CLPDF
Jingyi Kang, Chunyu Li, Ding Chen, Bo Tang, Feiyu Xiong
TL;DR: 本文提出了MemReader系列模型,用于智能体系统中的主动式长期记忆提取。其中MemReader-0.6B是一个经过蒸馏的紧凑型被动提取器,用于生成准确且模式一致的结构化输出;MemReader-4B则是一个主动提取器,通过组相对策略优化(GRPO)进行优化,能够做出记忆写入决策。在ReAct风格范式下,MemReader-4B在行动前会显式评估信息价值、引用歧义和完整性,从而选择性地写入记忆、推迟不完整输入、检索历史上下文或丢弃无关闲聊。实验表明,MemReader在多个基准测试中优于现有基于提取的基线方法。
Details
Motivation: 现有系统将记忆提取视为从上下文到结构化条目的单次被动转录,难以处理嘈杂对话、缺失引用和跨轮次依赖,导致记忆污染、低价值写入和不一致。本文旨在解决这些问题,推动记忆提取从被动转向主动。
Result: 在LOCOMO、LongMemEval和HaluMem基准测试上,MemReader持续优于现有基于提取的基线方法。特别是MemReader-4B在涉及知识更新、时序推理和幻觉减少的任务上达到了最先进的性能。
Insight: 论文的创新点在于提出了主动记忆提取框架,通过GRPO优化决策过程,使智能体能够进行推理驱动的选择性记忆提取,从而构建低噪声且动态演化的长期记忆。这强调了有效记忆不仅需要提取更多信息,更需要基于推理的选择性处理。模型已集成到MemOS并部署于实际应用,同时公开了模型和API访问以支持后续研究。
Abstract: Long-term memory is fundamental for personalized and autonomous agents, yet populating it remains a bottleneck. Existing systems treat memory extraction as a one-shot, passive transcription from context to structured entries, which struggles with noisy dialogue, missing references, and cross-turn dependencies, leading to memory pollution, low-value writes, and inconsistency. In this paper, we introduce the MemReader family for active long-term memory extraction in agent systems: MemReader-0.6B, a compact and cost-efficient passive extractor distilled for accurate and schema-consistent structured outputs, and MemReader-4B, an active extractor optimized with Group Relative Policy Optimization (GRPO) to make memory writing decisions. Under a ReAct-style paradigm, MemReader-4B explicitly evaluates information value, reference ambiguity, and completeness before acting, and can selectively write memories, defer incomplete inputs, retrieve historical context, or discard irrelevant chatter. Experiments on LOCOMO, LongMemEval, and HaluMem show that MemReader consistently outperforms existing extraction-based baselines. In particular, MemReader-4B achieves state-of-the-art performance on tasks involving knowledge updating, temporal reasoning, and hallucination reduction. These results suggest that effective agent memory requires not merely extracting more information, but performing reasoning-driven and selective memory extraction to build low-noise and dynamically evolving long-term memory. Furthermore, MemReader has been integrated into MemOS and is being deployed in real-world applications. To support future research and adoption, we release the models and provide public API access.
[10] Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning cs.CL | cs.AI | cs.LGPDF
Shiwan Zhao, Zhihu Wang, Xuyang Zhao, Jiaming Zhou, Caiyue Xu
TL;DR: 本文提出了一种统一视角来理解大语言模型的后训练方法,将其视为对模型行为的结构化干预。文章通过轨迹来源将方法分为离策略学习和在策略学习两大范式,并引入有效支持扩展、策略重塑和行为整合三个核心角色来系统化分析现有技术。
Details
Motivation: 当前大语言模型后训练方法(如SFT、偏好优化、强化学习等)的讨论较为零散,缺乏统一的分析框架。本文旨在提供一个系统性视角,通过行为瓶颈来组织和理解这些方法,以促进更协调的系统设计。
Result: 作为一篇综述性论文,未提供具体的定量实验结果,但通过提出的统一框架系统化地解读了主要后训练范式(如SFT、偏好优化、RL、蒸馏等)及其相互关系。
Insight: 创新点在于将后训练方法统一视为行为干预,并提出了基于轨迹来源(离策略/在策略)和三个核心角色(支持扩展、策略重塑、行为整合)的分析框架。这有助于诊断训练瓶颈、设计多阶段流程,并强调后训练的进展越来越依赖于协调的系统设计而非单一目标。
Abstract: Post-training has become central to turning pretrained large language models (LLMs) into aligned and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objective families rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary learning regimes: off-policy learning on externally supplied trajectories, and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles – effective support expansion, which makes useful behaviors more reachable, and policy reshaping, which improves behavior within already reachable regions – together with a complementary systems-level role, behavioral consolidation, which preserves, transfers, and amortizes behavior across stages and model transitions. This perspective yields a unified reading of major paradigms. SFT may serve either support expansion or policy reshaping, whereas preference-based methods are usually off-policy reshaping. On-policy RL often improves behavior on learner-generated states, though under stronger guidance it can also make hard-to-reach reasoning paths reachable. Distillation is often best understood as consolidation rather than only compression, and hybrid pipelines emerge as coordinated multi-stage compositions. Overall, the framework helps diagnose post-training bottlenecks and reason about stage composition, suggesting that progress in LLM post-training increasingly depends on coordinated system design rather than any single dominant objective.
[11] A Decomposition Perspective to Long-context Reasoning for LLMs cs.CL | cs.AI | cs.LGPDF
Yanling Xiao, Huaibing Xie, Guoliang Zhao, Shihan Dou, Shaolei Wang
TL;DR: 本文提出了一种分解视角来提升大语言模型的长上下文推理能力。作者将复杂的长上下文推理任务分解为一系列基本原子技能,并自动合成针对这些技能的伪数据集。通过在这些数据集上进行强化学习来提升模型的原子技能,从而增强其通用的长上下文推理性能。
Details
Motivation: 当前研究常忽视长上下文推理任务本身的内部复杂性,本文旨在超越这种整体视角,通过分解任务来解决这一挑战。
Result: 在Loogle、Loong、LongBench-v2、BrowscompLong、Ruler-qa2和MRCR等多个基准测试中,该方法平均提升7.7%(从46.3%提升至54.0%),显著优于强基线。
Insight: 创新点在于将长上下文推理分解为原子技能并合成针对性数据集,通过强化学习提升这些技能;客观分析认为,这种分解和针对性训练方法为理解与改进长上下文推理提供了新思路。
Abstract: Long-context reasoning is essential for complex real-world applications, yet remains a significant challenge for Large Language Models (LLMs). Despite the rapid evolution in long-context reasoning, current research often overlooks the internal complexity of the long-context reasoning task itself. In this paper, we move beyond this holistic view and decompose long-context reasoning into a set of fundamental atomic skills, and we then automatically synthesize a suite of pseudo datasets, each explicitly targeting a specific atomic skill. Our empirical analysis confirms that proficiency in these atomic skills is strongly correlated with general long-text reasoning performance. Building on this insight, we employ reinforcement learning on these pseudo datasets to sharpen the model’s atomic skills, in the hope of boosting its general long-context reasoning ability. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach: it outperforms a strong baseline by an average margin of 7.7% (improving from 46.3% to 54.0%) across Loogle, Loong, LongBench-v2, BrowscompLong, Ruler-qa2, and MRCR.
[12] Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation cs.CLPDF
Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Yuxi Zhang, Huimin Wang
TL;DR: 本文提出了GuarantRAG框架,通过显式解耦推理与证据整合来解决检索增强生成中的’整合瓶颈’问题。该方法首先生成基于参数知识的’内部答案’,然后使用对比DPO目标生成基于检索证据的’参考答案’,最后通过联合解码机制在token级别动态融合两者的优势。
Details
Motivation: 当前RAG研究主要关注检索质量,忽视了’整合瓶颈’问题:即使检索到相关文档,LLM也常因参数知识与外部证据冲突而无法有效利用。现有方法在单次生成中隐式解决冲突效果不佳。
Result: 在五个QA基准测试上,GuarantRAG相比标准和动态RAG基线,准确率最高提升12.1%,幻觉减少16.3%,实现了SOTA性能。
Insight: 创新点包括:1)显式分离推理流与证据整合阶段;2)使用对比DPO目标将参数知识作为负约束、检索文档作为正样本来保证忠实证据提取;3)提出联合解码机制在token级别动态融合逻辑连贯性与事实精确性。
Abstract: Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models (LLMs) by providing access to external knowledge. However, current research primarily focuses on retrieval quality, often overlooking the critical ‘’integration bottleneck’’: even when relevant documents are retrieved, LLMs frequently fail to utilize them effectively due to conflicts with their internal parametric knowledge. In this paper, we argue that implicitly resolving this conflict in a single generation pass is suboptimal. We introduce GuarantRAG, a framework that explicitly decouples reasoning from evidence integration. First, we generate an ‘’Inner-Answer’’ based solely on parametric knowledge to capture the model’s reasoning flow. Second, to guarantee faithful evidence extraction, we generate a ‘’Refer-Answer’’ using a novel Contrastive DPO objective. This objective treats the parametric Inner-Answer as a negative constraint and the retrieved documents as positive ground truth, forcing the model to suppress internal hallucinations in favor of external evidence during this phase. Finally, rather than naive concatenation or using the DPO trained model directly, we propose a joint decoding mechanism that dynamically fuses the logical coherence of the Inner-Answer with the factual precision of the Refer-Answer at the token level. Experiments on five QA benchmarks demonstrate that GuarantRAG improves accuracy by up to 12.1% and reduces hallucinations by 16.3% compared to standard and dynamic RAG baselines.
[13] Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving cs.CLPDF
Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang
TL;DR: 本文提出了一种名为双池令牌预算路由的轻量级调度机制,旨在解决生产环境中LLM服务因按最坏情况上下文长度配置实例而导致的KV缓存过度分配和并发利用率低的问题。该方法将同构服务集群划分为专门处理短上下文的高吞吐量池和处理长上下文的高容量池,根据请求的估计总令牌预算进行路由,无需分词器即可在线学习字节-令牌比率。
Details
Motivation: 解决生产vLLM集群因配置与流量不匹配而导致的效率低下问题,包括吞吐量浪费、内存溢出崩溃、抢占和请求拒绝等可靠性问题,这些问题的根源在于为长上下文优化的配置却服务于大量短上下文请求。
Result: 在Azure LLM推理数据集和LMSYS-Chat-1M的真实轨迹上,使用A100 GPU服务Llama-3-70B进行评估,结果显示该方法减少了31-42%的GPU小时数,相当于每年节省286万美元,同时将抢占率降低了5.4倍,并将P99 TTFT提高了6%。在AMD MI300X上服务Qwen3-235B-A22B的案例研究预测每年可节省1540万美元。
Insight: 创新点在于通过双池专业化分工和基于在线学习令牌预算的路由机制,动态适应异构工作负载,以极低的调度开销实现成本效益和可靠性提升;该方法可与现有优化技术(如PagedAttention、连续批处理和预填充-解码解耦)无缝组合,并提供了基于工作负载特征和实测吞吐量差异的简单分析模型,便于部署前效益评估。
Abstract: Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to $2.86M annual savings at fleet scale, while lowering preemption rates by 5.4$\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects $15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.
[14] Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection cs.CLPDF
Khalid Zaman, Melike Sah, Anuwat Chaiwongyenc, Cem Direkoglu
TL;DR: 该论文提出将量子视觉理论应用于音频分类,特别是深度伪造语音检测任务。通过将语音信号的频谱图转换为信息波,并构建基于量子视觉理论的卷积神经网络和视觉变换器模型,在ASVSpoof数据集上实现了比传统模型更高的分类准确率和鲁棒性。
Details
Motivation: 受量子物理中波粒二象性启发,提出数据不仅可用可观测的坍缩形式表示,还能表示为信息波。传统深度学习方法直接处理坍缩表示,而量子视觉理论先将输入转换为信息波再用于分类,旨在探索该理论在音频分类任务中的有效性。
Result: 在ASVSpoof数据集上,QV-CNN和QV-ViT模型均优于标准CNN和ViT模型。其中,使用MFCC特征的QV-CNN取得最佳整体性能(准确率94.20%,EER 9.04%),使用梅尔频谱图的QV-CNN达到最高准确率94.57%。
Insight: 创新点在于将量子视觉理论从图像领域扩展到音频领域,通过信息波转换增强模型对语音特征的表示能力。这为音频感知任务提供了量子启发的学习新方向,并证明了该理论在提升深度伪造检测性能方面的潜力。
Abstract: We propose Quantum Vision (QV) theory as a new perspective for deep learning-based audio classification, applied to deepfake speech detection. Inspired by particle-wave duality in quantum physics, QV theory is based on the idea that data can be represented not only in its observable, collapsed form, but also as information waves. In conventional deep learning, models are trained directly on these collapsed representations, such as images. In QV theory, inputs are first transformed into information waves using a QV block, and then fed into deep learning models for classification. QV-based models improve performance in image classification compared to their non-QV counterparts. What if QV theory is applied speech spectrograms for audio classification tasks? This is the motivation and novelty of the proposed approach. In this work, Short-Time Fourier Transform (STFT), Mel-spectrograms, and Mel-Frequency Cepstral Coefficients (MFCC) of speech signals are converted into information waves using the proposed QV block and used to train QV-based Convolutional Neural Networks (QV-CNN) and QV-based Vision Transformers (QV-ViT). Extensive experiments are conducted on the ASVSpoof dataset for deepfake speech classification. The results show that QV-CNN and QV-ViT consistently outperform standard CNN and ViT models, achieving higher classification accuracy and improved robustness in distinguishing genuine and spoofed speech. Moreover, the QV-CNN model using MFCC features achieves the best overall performance on the ASVspoof dataset, with an accuracy of 94.20% and an EER of 9.04%, while the QV-CNN with Mel-spectrograms attains the highest accuracy of 94.57%. These findings demonstrate that QV theory is an effective and promising approach for audio deepfake detection and opens new directions for quantum-inspired learning in audio perception tasks.
[15] Self-Debias: Self-correcting for Debiasing Large Language Models cs.CLPDF
Xuan Feng, Shuai Zhao, Luwei Xiao, Tianlong Gu, Bo An
TL;DR: 本文提出Self-Debias框架,旨在解决大语言模型在思维链推理过程中社会偏见持续传播的问题。该方法将去偏过程重新定义为策略性资源再分配问题,通过细粒度的轨迹级优化和动态约束,使模型能够自主识别并修正偏见推理路径,同时保留有效上下文前缀。仅需2万标注样本,该框架即可激活高效的自校正能力,在保持通用推理能力的同时实现优越的去偏性能。
Details
Motivation: 现有去偏方法主要依赖静态约束或外部干预,无法在偏见触发后识别并中断其在思维链中的持续传播,因此需要一种能够使模型具备内在自校正能力的渐进式框架。
Result: 仅使用2万标注样本,Self-Debias在去偏性能上达到优越水平,同时有效保持了模型的通用推理能力,无需持续的外部监督。
Insight: 创新点在于将去偏视为概率质量的策略性再分配问题,并设计了细粒度的轨迹级优化目标与动态约束,结合基于一致性过滤的在线自改进机制,使模型能够自主合成监督信号并进行选择性修正。
Abstract: Although Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, inherent social biases often cascade throughout the Chain-of-Thought (CoT) process, leading to continuous “Bias Propagation”. Existing debiasing methods primarily focus on static constraints or external interventions, failing to identify and interrupt this propagation once triggered. To address this limitation, we introduce Self-Debias, a progressive framework designed to instill intrinsic self-correction capabilities. Specifically, we reformulate the debiasing process as a strategic resource redistribution problem, treating the model’s output probability mass as a limited resource to be reallocated from biased heuristics to unbiased reasoning paths. Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints. This enables the model to selectively revise biased reasoning suffixes while preserving valid contextual prefixes. Furthermore, we integrate an online self-improvement mechanism utilizing consistency filtering to autonomously synthesize supervision signals. With merely 20k annotated samples, Self-Debias activates efficient self-correction, achieving superior debiasing performance while preserving general reasoning capabilities without continuous external oversight.
[16] Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing cs.CL | cs.AIPDF
Jun Seo, Sangwon Ryu, Heejin Do, Hyounghun Kim, Gary Geunbae Lee
TL;DR: 本文提出了一种名为行为感知项目建模(BAIM)的知识追踪框架,该框架通过整合动态解题过程信息来丰富项目表示,以更准确地预测学习者的未来表现。
Details
Motivation: 现有知识追踪方法虽然通过学习与知识组件对齐的项目表示有所改进,但忽略了问题解决的动态过程性,BAIM旨在解决这一问题。
Result: 在XES3G5M和NIPS34数据集上的实验表明,BAIM持续优于基于预训练的强基线模型,在重复学习者交互场景下取得了显著提升。
Insight: 创新点在于利用推理语言模型将解题过程分解为四个教学阶段(理解、计划、执行、回顾),并引入上下文条件机制自适应地路由这些阶段表示,以捕捉超越表面特征的潜在信号并反映学习者的异质性。
Abstract: Knowledge Tracing (KT) aims to predict learners’ future performance from past interactions. While recent KT approaches have improved via learning item representations aligned with Knowledge Components, they overlook the procedural dynamics of problem solving. We propose Behavior-Aware Item Modeling (BAIM), a framework that enriches item representations by integrating dynamic procedural solution information. BAIM leverages a reasoning language model to decompose each item’s solution into four problem-solving stages (i.e., understand, plan, carry out, and look back), pedagogically grounded in Polya’s framework. Specifically, it derives stage-level representations from per-stage embedding trajectories, capturing latent signals beyond surface features. To reflect learner heterogeneity, BAIM adaptively routes these stage-wise representations, introducing a context-conditioned mechanism within a KT backbone, allowing different procedural stages to be emphasized for different learners. Experiments on XES3G5M and NIPS34 show that BAIM consistently outperforms strong pretraining-based baselines, achieving particularly large gains under repeated learner interactions.
[17] When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning cs.CLPDF
Ruotao Xu, Yixin Ji, Yu Luo, Jinpeng Li, Dong Li
TL;DR: 本文提出了自适应工具信任校准(ATTC)框架,旨在解决工具集成推理(TIR)模型中存在的‘工具忽略’问题。该框架通过评估模型生成代码块的置信度,自适应地引导模型选择信任或忽略工具输出,从而提升数学推理任务的性能。
Details
Motivation: 现有开源TIR模型在推理结果与工具输出冲突时,倾向于相信自身推理而忽略正确的工具结果,导致错误答案,即‘工具忽略’问题。这表明模型缺乏何时信任工具的判断能力。
Result: 在不同规模和多个数据集上的实验表明,ATTC框架有效减少了‘工具忽略’问题,使模型性能提升了4.1%至7.5%。
Insight: 创新点在于引入基于代码块置信度的自适应信任校准机制,使模型能够动态评估并整合工具输出,而非盲目依赖自身推理或工具结果。这为提升工具增强型模型的鲁棒性和可靠性提供了新思路。
Abstract: Large reasoning models (LRMs) have achieved strong performance enhancement through scaling test time computation, but due to the inherent limitations of the underlying language models, they still have shortcomings in tasks that require precise computation and extensive knowledge reserves. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool call and execution within the reasoning trajectory. Although recent works have released some powerful open-source TIR models, our analysis reveals that these models still suffer from critical deficiencies. We find that when the reasoning of the model conflicts with the tool results, the model tends to believe in its own reasoning. And there are cases where the tool results are correct but are ignored by the model, resulting in incorrect answers, which we define as “Tool Ignored’’. This indicates that the model does not know when to trust or ignore the tool. To overcome these limitations, We introduce Adaptive Tool Trust Calibration (ATTC), a novel framework that guides the model to adaptively choose to trust or ignore the tool results based on the confidence score of generated code blocks. The experimental results from various open-source TIR models of different sizes and across multiple datasets demonstrate that ATTC effectively reduces the “Tool Ignored” issue, resulting in a performance increase of 4.1% to 7.5%.
[18] Distributed Multi-Layer Editing for Rule-Level Knowledge in Large Language Models cs.CL | cs.AIPDF
Yating Wang, Wenting Zhao, Yaqi Zhao, Yongshun Gong, Yilong Yin
TL;DR: 本文研究大语言模型中规则级知识的编辑问题,提出了一种分布式多层编辑方法(DMLE)。通过机制分析发现规则知识在Transformer层中具有形式特定的分布模式,并基于此设计DMLE,在不同层进行针对性更新,显著提升了规则编辑的性能。
Details
Motivation: 现有模型编辑方法主要针对事实级知识,假设通过局部干预即可实现编辑,但这一假设不适用于需要跨多种形式保持一致的规则级知识。
Result: 在扩展的RuleEdit基准(200条数学和物理规则)上,DMLE在GPT-J-6B、Qwen2.5-7B、Qwen2-7B和LLaMA-3-8B模型上,相比最强基线平均提升了13.91个百分点的实例可移植性和50.19个百分点的规则理解能力,同时保持标准编辑指标的竞争力。
Insight: 创新点在于揭示了规则知识在Transformer层中的非均匀分布(公式和描述集中于早期层,实例更关联中间层),并据此提出了针对不同知识形式进行分层编辑的DMLE方法,为规则级知识编辑提供了新思路。
Abstract: Large language models store not only isolated facts but also rules that support reasoning across symbolic expressions, natural language explanations, and concrete instances. Yet most model editing methods are built for fact-level knowledge, assuming that a target edit can be achieved through a localized intervention. This assumption does not hold for rule-level knowledge, where a single rule must remain consistent across multiple interdependent forms. We investigate this problem through a mechanistic study of rule-level knowledge editing. To support this study, we extend the RuleEdit benchmark from 80 to 200 manually verified rules spanning mathematics and physics. Fine-grained causal tracing reveals a form-specific organization of rule knowledge in transformer layers: formulas and descriptions are concentrated in earlier layers, while instances are more associated with middle layers. These results suggest that rule knowledge is not uniformly localized, and therefore cannot be reliably edited by a single-layer or contiguous-block intervention. Based on this insight, we propose Distributed Multi-Layer Editing (DMLE), which applies a shared early-layer update to formulas and descriptions and a separate middle-layer update to instances. While remaining competitive on standard editing metrics, DMLE achieves substantially stronger rule-level editing performance. On average, it improves instance portability and rule understanding by 13.91 and 50.19 percentage points, respectively, over the strongest baseline across GPT-J-6B, Qwen2.5-7B, Qwen2-7B, and LLaMA-3-8B. The code is available at https://github.com/Pepper66/DMLE.
[19] SeLaR: Selective Latent Reasoning in Large Language Models cs.CL | cs.AIPDF
Renyu Fu, Guibo Luo
TL;DR: 本文提出了SeLaR(选择性潜在推理)框架,旨在解决大语言模型中思维链推理的局限性。通过引入熵门控机制和熵感知对比正则化,SeLaR仅在低置信度步骤激活软嵌入以增强探索,而在高置信度步骤保持离散解码以确保稳定性,从而在无需训练的情况下提升推理性能。
Details
Motivation: 现有潜在推理方法(如使用软嵌入或隐藏状态)存在全局激活干扰高置信度步骤导致稳定性下降,以及软嵌入快速坍缩至高概率词元限制探索的问题,SeLaR旨在通过选择性激活和正则化机制解决这些挑战。
Result: 在五个推理基准测试上的实验表明,SeLaR consistently outperforms standard CoT and state-of-the-art training-free methods,即一致优于标准思维链和最先进的无训练方法,达到了新的SOTA水平。
Insight: 创新点包括熵门控的选择性激活机制(结合离散与软嵌入解码)和熵感知对比正则化(促进多路径探索),从客观角度看,这种轻量级、无训练的框架为平衡推理稳定性与探索性提供了可借鉴的设计思路。
Abstract: Chain-of-Thought (CoT) has become a cornerstone of reasoning in large language models, yet its effectiveness is constrained by the limited expressiveness of discrete token sampling. Recent latent reasoning approaches attempt to alleviate this limitation by replacing discrete tokens with soft embeddings (probability-weighted mixtures of token embeddings) or hidden states, but they commonly suffer from two issues: (1) global activation injects perturbations into high-confidence steps, impairing reasoning stability; and (2) soft embeddings quickly collapse toward the highest-probability token, limiting exploration of alternative trajectories. To address these challenges, we propose SeLaR (Selective Latent Reasoning), a lightweight and training-free framework. SeLaR introduces an entropy-gated mechanism that activates soft embeddings only at low-confidence steps, while preserving discrete decoding at high-confidence steps. Additionally, we propose an entropy-aware contrastive regularization that pushes soft embeddings away from the dominant (highest-probability) token’s direction, encouraging sustained exploration of multiple latent reasoning paths. Experiments on five reasoning benchmarks demonstrate that SeLaR consistently outperforms standard CoT and state-of-the-art training-free methods.
[20] Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces cs.CLPDF
Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang
TL;DR: 本文介绍了OmniBehavior,首个完全基于真实世界数据构建的用户模拟基准,旨在评估大语言模型在长时程、跨场景、异构行为轨迹上的模拟能力。研究发现现有模型难以准确模拟复杂行为,且存在结构性偏差,如趋向于模拟‘平均积极个体’,导致个体差异和长尾行为丢失。
Details
Motivation: 现有用户模拟基准局限于孤立场景、狭窄动作空间或合成数据,无法捕捉真实人类行为的整体性,因此需要构建一个集成长时程、跨场景和异构行为模式的真实世界基准来填补这一空白。
Result: 在OmniBehavior基准上的广泛评估显示,当前最先进的大语言模型在模拟复杂行为时表现不佳,性能即使随着上下文窗口扩大也趋于稳定;模拟行为与真实行为比较揭示了结构性偏差,如模型倾向于收敛到积极的平均个体,表现出过度活跃、角色同质化和乌托邦偏差。
Insight: 创新点在于首次构建了完全基于真实世界数据的用户模拟基准,强调了长时程、跨场景因果链对真实决策的重要性;客观分析揭示了LLMs在模拟中的结构性偏差,为未来高保真模拟研究指明了关键方向,如需要保留个体差异和长尾行为。
Abstract: The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.
[21] Synthetic Data for any Differentiable Target cs.CL | cs.AI | cs.LG | stat.MLPDF
Tristan Thrush, Sung Min Park, Herman Brunborg, Luke Bailey, Marcel Roed
TL;DR: 本文提出了一种名为数据集策略梯度(DPG)的强化学习原语,用于精确优化合成数据生成器,以产生针对特定目标的示例数据集。通过监督微调(SFT),这些合成数据能使目标模型在可微分指标上表现良好。该方法利用高阶梯度进行精确的数据归因,并将这些分数用作策略梯度奖励,从而近似生成器的真实梯度。实验表明,DPG能有效控制语言模型,例如嵌入QR码、特定模式、降低权重范数、实现语言重述和生成特定UUID。
Details
Motivation: 探索通过合成训练数据控制语言模型的极限,解决如何精确生成能优化目标模型在特定可微分指标上性能的合成数据的问题。
Result: 在多个定性任务上展示了DPG的有效性:仅通过SFT,成功使目标模型的LM头权重嵌入QR码和模式’67’、降低ℓ²范数、实现语言重述以及生成特定UUID,这些目标在生成器输入提示中未明确指定。
Insight: 创新点在于将数据归因与策略梯度结合,通过高阶梯度精确量化合成数据对目标模型性能的影响,从而实现对生成器的梯度近似优化,为仅使用合成训练数据塑造模型属性提供了灵活且强大的技术路径。
Abstract: What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model’s LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator’s input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.
[22] What do Language Models Learn and When? The Implicit Curriculum Hypothesis cs.CLPDF
Emmy Liu, Kaiser Sun, Millicent Li, Isabelle Lee, Lindia Tjuatja
TL;DR: 这篇论文提出了’隐性课程假说’,认为大语言模型在预训练过程中遵循一种可预测的、组合性的技能习得顺序。作者通过设计一套涵盖检索、形态变换、指代消解、逻辑推理和数学的简单可组合任务,追踪了四个模型家族(参数从4.1亿到130亿)的技能涌现点,发现技能涌现顺序在不同模型间高度一致,且组合任务通常在组件任务之后涌现。此外,这种结构编码在模型表征中,使得仅通过表征空间就能有效预测未见过任务的训练轨迹。
Details
Motivation: 尽管大语言模型能完成复杂任务,但其预训练过程中具体技能如何按顺序涌现的细粒度细节尚不清楚。验证损失的缩放定律只告诉我们模型随计算量提升而改进,但无法揭示其在何时习得何种技能。
Result: 研究发现,模型达到固定准确率阈值的涌现顺序在不同模型间具有惊人的一致性(45个模型对的斯皮尔曼相关系数ρ=0.81)。组合任务大多在其组件任务之后涌现。利用任务集衍生的表征空间,可以在整个预训练过程中有效预测未见过组合任务的训练轨迹(在不同模型上R²介于0.68到0.84之间)。
Insight: 论文的核心创新点是提出了’隐性课程假说’,并通过一套精心设计的可组合任务基准和表征分析,实证揭示了预训练过程中技能涌现的规律性和可预测性。从客观角度看,其方法论(设计简单任务追踪涌现、利用函数向量表征分析任务相似性)为理解模型内部学习动态提供了可借鉴的分析框架。
Abstract: Large language models (LLMs) can perform remarkably complex tasks, yet the fine-grained details of how these capabilities emerge during pretraining remain poorly understood. Scaling laws on validation loss tell us how much a model improves with additional compute, but not what skills it acquires in which order. To remedy this, we propose the Implicit Curriculum Hypothesis: pretraining follows a compositional and predictable curriculum across models and data mixtures. We test this by designing a suite of simple, composable tasks spanning retrieval, morphological transformations, coreference, logical reasoning, and mathematics. Using these tasks, we track emergence points across four model families spanning sizes from 410M-13B parameters. We find that emergence orderings of when models reach fixed accuracy thresholds are strikingly consistent ($ρ= .81$ across 45 model pairs), and that composite tasks most often emerge after their component tasks. Furthermore, we find that this structure is encoded in model representations: tasks with similar function vector representations also tend to follow similar trajectories in training. By using the space of representations derived from our task set, we can effectively predict the training trajectories of simple held-out compositional tasks throughout the course of pretraining ($R^2 = .68$-$.84$ across models) without previously evaluating them. Together, these results suggest that pretraining is more structured than loss curves reveal: skills emerge in a compositional order that is consistent across models and readable from their internals.
[23] Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models cs.CL | cs.LGPDF
Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han
TL;DR: 本文揭示了在策略蒸馏(OPD)训练中存在的长度膨胀问题,即随着训练进行,学生模型生成的轨迹会突然变长且重复,导致训练数据被截断轨迹主导,引发训练不稳定和性能下降。作者提出了StableOPD框架,通过结合基于参考的散度约束和轨迹混合蒸馏来缓解长度膨胀,稳定训练过程。
Details
Motivation: 解决在策略蒸馏(OPD)训练中因学生模型自身诱导分布导致的长度膨胀和截断崩溃问题,该问题会引发训练不稳定和验证性能急剧下降。
Result: 在多个数学推理数据集上,StableOPD方法防止了截断崩溃,稳定了训练动态,平均性能提升了7.2%。
Insight: 创新点在于识别了OPD中长度膨胀的失效模式,并提出通过散度约束和混合蒸馏来稳定训练;客观分析认为,该方法通过平衡学生模型探索与教师监督,有效缓解了重复诱导的偏差,为强化学习中的蒸馏策略提供了新思路。
Abstract: On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.
cs.CV [Back]
[24] FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios cs.CV | cs.AI | cs.LGPDF
Xiangru Jian, Hao Xu, Wei Pang, Xinjian Zhao, Chengyu Tao
TL;DR: 本文介绍了FORGE,一个用于制造业场景的细粒度多模态评估基准。它构建了一个结合2D图像和3D点云的高质量数据集,并标注了细粒度领域语义(如精确型号)。研究评估了18个最先进的多模态大语言模型在三个制造业任务上的性能,发现领域知识不足是主要瓶颈,而非视觉定位。此外,利用其结构化标注进行监督微调,可使紧凑模型在制造业场景上的准确率获得显著提升。
Details
Motivation: 当前制造业采用多模态大语言模型从感知转向自主执行,但现有评估未能反映真实制造业环境的严苛要求,且面临数据稀缺和现有数据集缺乏细粒度领域语义的问题。
Result: 在工件验证、结构表面检测和装配验证三个制造业任务上评估了18个SOTA MLLMs,揭示了显著的性能差距。利用其数据对紧凑的3B参数模型进行监督微调,在保留的制造业场景上准确率相对提升了90.8%。
Insight: 创新点在于构建了首个结合2D/3D数据并带有细粒度领域语义标注的制造业多模态评估基准。关键发现是,限制制造业MLLM性能的主要瓶颈是领域特定知识不足,而非视觉基础能力,这为未来研究指明了方向。其结构化标注可作为有效的训练资源,为领域自适应MLLM提供了可行路径。
Abstract: The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.
[25] GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents cs.CV | cs.AI | cs.HCPDF
Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, Mike Zheng Shou
TL;DR: 本文提出了GameWorld基准,旨在标准化和可验证地评估多模态大语言模型作为通用游戏代理在浏览器环境中的能力。该基准包含34款多样化游戏和170个任务,并配有基于状态验证的评估指标。研究对比了两种代理接口:直接键盘鼠标控制的计算机使用代理和通过语义动作解析在语义动作空间行动的多模态通用代理。实验表明,即使最佳代理也远未达到人类游戏水平,并揭示了实时交互、上下文记忆敏感性和动作有效性等挑战。
Details
Motivation: 多模态大语言模型代理在现实世界交互中面临延迟高、反馈稀疏和错误不可逆等挑战,而视频游戏因其丰富的视觉观察和闭环交互,成为评估细粒度感知、长程规划和精确控制能力的理想测试平台。然而,异构的动作接口和启发式验证阻碍了系统评估。
Result: 在18个模型-接口组合的评估中,即使表现最佳的代理也远未达到人类在视频游戏中的能力。广泛的重复全基准重运行实验证明了基准的鲁棒性,而对实时交互、上下文记忆敏感性和动作有效性的进一步研究揭示了更多挑战。
Insight: 创新点在于提出了一个标准化、可验证且可复现的评估框架GameWorld,通过定义两种代理接口(计算机使用代理和语义动作解析代理)和基于状态验证的指标,为多模态游戏代理乃至更广泛的具身智能研究奠定了坚实基础。其将游戏作为测试平台的方法,为解决现实交互中的延迟、反馈和错误问题提供了可控的研究环境。
Abstract: Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.
[26] HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents cs.CVPDF
Tencent Robotics X, HY Vision Team, :, Xumin Yu, Zuyan Liu
TL;DR: 本文提出了HY-Embodied-0.5系列基础模型,专为现实世界具身智能体设计。该系列包含一个高效的2B激活参数模型用于边缘部署,以及一个强大的32B激活参数模型用于复杂推理。模型采用混合专家架构增强视觉感知,并通过迭代后训练范式提升推理能力。在22个基准测试中表现出色,其中2B模型在16个基准上超越同类SOTA,32B模型性能可比肩前沿模型。
Details
Motivation: 为了解决通用视觉语言模型与具身智能体在空间-时间视觉感知、预测、交互与规划等核心能力需求之间的差距,设计专门用于现实世界具身代理的基础模型。
Result: 在涵盖视觉感知、空间推理和具身理解的22个基准测试中,MoT-2B模型在16个基准上超越了同等规模的SOTA模型,32B变体达到了与Gemini 3.0 Pro等前沿模型相当的性能。在下游机器人控制实验中,基于该VLM基础训练的VLA模型在现实物理评估中取得了令人信服的结果。
Insight: 创新点包括:采用混合专家架构实现模态特定计算以增强细粒度视觉感知;引入迭代、自进化的后训练范式来提升推理能力;利用在线策略蒸馏将大模型能力迁移到小模型,最大化紧凑模型的性能潜力。这些方法为构建高效且强大的具身基础模型提供了可借鉴的技术路径。
Abstract: We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.
[27] SMFD-UNet: Semantic Face Mask Is The Only Thing You Need To Deblur Faces cs.CV | eess.IVPDF
Abduz Zami
TL;DR: SMFD-UNet是一种轻量级的人脸去模糊框架,通过语义人脸掩码驱动去模糊过程,无需高质量参考图像。该方法首先生成详细的人脸组件掩码(如眼睛、鼻子、嘴巴),然后通过多阶段特征融合技术结合模糊输入生成清晰的高保真人脸图像。在CelebA数据集上,SMFD-UNet在PSNR和SSIM指标上优于现有SOTA模型,同时保持自然性度量。
Details
Motivation: 传统去模糊方法基于通用图像先验,难以捕捉人脸特有的结构和身份特征,且通常需要高质量参考图像。SMFD-UNet旨在解决这些问题,通过语义掩码直接指导去模糊,提升人脸图像恢复的准确性和效率。
Result: 在CelebA数据集上,SMFD-UNet在PSNR和SSIM指标上达到SOTA水平,同时在NIQE、LPIPS和FID等自然性度量上表现良好。
Insight: 创新点包括使用语义人脸掩码作为去模糊的唯一驱动,结合多阶段特征融合、轻量级UNet架构以及RDC块、CBAM注意力等组件。客观分析表明,该方法通过结构化语义信息有效提升去模糊性能,且轻量化设计有利于实际应用部署。
Abstract: For applications including facial identification, forensic analysis, photographic improvement, and medical imaging diagnostics, facial image deblurring is an essential chore in computer vision allowing the restoration of high-quality images from blurry inputs. Often based on general picture priors, traditional deblurring techniques find it difficult to capture the particular structural and identity-specific features of human faces. We present SMFD-UNet (Semantic Mask Fusion Deblurring UNet), a new lightweight framework using semantic face masks to drive the deblurring process, therefore removing the need for high-quality reference photos in order to solve these difficulties. First, our dual-step method uses a UNet-based semantic mask generator to directly extract detailed facial component masks (e.g., eyes, nose, mouth) straight from blurry photos. Sharp, high-fidelity facial images are subsequently produced by integrating these masks with the blurry input using a multi-stage feature fusion technique within a computationally efficient UNet framework. We created a randomized blurring pipeline that roughly replicates real-world situations by simulating around 1.74 trillion deterioration scenarios, hence guaranteeing resilience. Examined on the CelebA dataset, SMFD-UNet shows better performance than state-of-the-art models, attaining higher Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) while preserving satisfactory naturalness measures, including NIQE, LPIPS, and FID. Powered by Residual Dense Convolution Blocks (RDC), a multi-stage feature fusion strategy, efficient and effective upsampling techniques, attention techniques like CBAM, post-processing techniques, and the lightweight design guarantees scalability and efficiency, enabling SMFD-UNet to be a flexible solution for developing facial image restoration research and useful applications.
[28] Training-free Spatially Grounded Geometric Shape Encoding (Technical Report) cs.CVPDF
Yuhang He
TL;DR: 本文提出了一种无需训练的通用空间编码策略XShapeEnc,用于将任意二维空间几何形状编码为紧凑表示,该表示具有可逆性、自适应性、频率丰富性等五个优良特性。该方法将形状分解为归一化几何和位姿向量,利用正交Zernike基进行编码,并通过频率传播引入高频内容。
Details
Motivation: 现有位置编码主要针对一维序列数据,而扩展到二维空间几何形状时,需要同时考虑形状几何、位姿以及与神经网络学习的兼容性,本文旨在解决这些挑战。
Result: 通过广泛的分析和实验,在多种形状感知任务及自建数据集XShapeCorpus上验证了XShapeEnc的理论有效性、效率、可区分性和适用性。
Insight: 创新点在于提出了一种无需训练的、基于正交Zernike基的二维空间形状编码方法,将形状几何与位姿统一编码,并引入频率传播以增强高频信息,为二维空间智能研究提供了基础工具。
Abstract: Positional encoding has become the de facto standard for grounding deep neural networks on discrete point-wise positions, and it has achieved remarkable success in tasks where the input can be represented as a one-dimensional sequence. However, extending this concept to 2D spatial geometric shapes demands carefully designed encoding strategies that account not only for shape geometry and pose, but also for compatibility with neural network learning. In this work, we address these challenges by introducing a training-free, general-purpose encoding strategy, dubbed XShapeEnc, that encodes an arbitrary spatially grounded 2D geometric shape into a compact representation exhibiting five favorable properties, including invertibility, adaptivity, and frequency richness. Specifically, a 2D spatially grounded geometric shape is decomposed into its normalized geometry within the unit disk and its pose vector, where the pose is further transformed into a harmonic pose field that also lies within the unit disk. A set of orthogonal Zernike bases is constructed to encode shape geometry and pose either independently or jointly, followed by a frequency-propagation operation to introduce high-frequency content into the encoding. We demonstrate the theoretical validity, efficiency, discriminability, and applicability of XShapeEnc via extensive analysis and experiments across a wide range of shape-aware tasks and our self-curated XShapeCorpus. We envision XShapeEnc as a foundational tool for research that goes beyond one-dimensional sequential data toward frontier 2D spatial intelligence.
[29] Mathematical Analysis of Image Matching Techniques cs.CV | math.NAPDF
Oleh Samoilenko
TL;DR: 本文对卫星图像上的经典局部特征匹配算法(SIFT和ORB)进行了分析和实验评估。研究通过一个标准流程(关键点检测、描述子提取、匹配、基于RANSAC的单应性矩阵估计几何验证)来评估这些方法,并使用内点率作为匹配质量的衡量指标。实验基于一个手动构建的带GPS标注且有重叠区域的卫星图像数据集,并分析了提取关键点数量对内点率的影响。
Details
Motivation: 图像匹配是计算机视觉中的一个基础问题,在机器人、遥感和地理空间数据分析中有直接应用。本文旨在分析和评估经典局部特征匹配算法在卫星图像这一特定应用场景下的性能。
Result: 研究在手动构建的带重叠区域的GPS标注卫星图像数据集上,评估了SIFT和ORB算法,使用内点率作为核心评估指标,并分析了提取关键点数量对该指标的影响。
Insight: 论文提供了一个针对卫星图像的经典特征匹配算法的系统性分析框架,其创新之处在于在特定领域(卫星图像)使用手动构建的带精确地理标注的数据集进行可控评估,并定量分析了关键参数(关键点数量)对匹配质量(内点率)的影响,这对于实际应用中的参数调优具有指导意义。
Abstract: Image matching is a fundamental problem in Computer Vision with direct applications in robotics, remote sensing, and geospatial data analysis. We present an analytical and experimental evaluation of classical local feature-based image matching algorithms on satellite imagery, focusing on the Scale-Invariant Feature Transform (SIFT) and the Oriented FAST and Rotated BRIEF (ORB). Each method is evaluated through a common pipeline: keypoint detection, descriptor extraction, descriptor matching, and geometric verification via RANSAC with homography estimation. Matching quality is assessed using the Inlier Ratio - the fraction of correspondences consistent with the estimated homography. The study uses a manually constructed dataset of GPS-annotated satellite image tiles with intentional overlaps. We examine the impact of the number of extracted keypoints on the resulting Inlier Ratio.
[30] Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models cs.CVPDF
Katerina Katsarou, George Zountsas, Karam Tomotaki-Dawoud, Alexander Ehrenhoefer, Paul Chojecki
TL;DR: 本文提出了一种用于手术视频中器械交接事件检测和方向分类的时空视觉框架。该模型结合了Vision Transformer(ViT)主干网络进行空间特征提取和单向LSTM网络进行时序聚合,通过统一的多任务公式联合预测交接发生和交互方向,并在肾脏移植手术数据集上取得了优于单任务变体和VideoMamba基线的性能。
Details
Motivation: 手术器械交接的可靠监控对于维持手术室程序效率和患者安全至关重要,但由于频繁遮挡、背景杂乱以及交互事件的时序演化特性,术中视频的自动检测仍然具有挑战性。
Result: 在肾脏移植手术数据集上的实验表明,该方法在交接检测上取得了0.84的F1分数,在方向分类上取得了0.72的平均F1分数,优于单任务变体和用于方向预测的VideoMamba基线,同时保持了可比的检测性能。
Insight: 创新点在于提出了一个统一的多任务时空框架,联合建模交接事件和方向,避免了级联流水线典型的错误传播;同时利用Layer-CAM归因可视化来增强模型可解释性,突出手-器械交互线索。
Abstract: Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence scores form a temporal signal over the video, from which discrete handover events are identified via peak detection. Experiments on a dataset of kidney transplant procedures demonstrate strong performance, achieving an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming both a single-task variant and a VideoMamba-based baseline for direction prediction while maintaining comparable detection performance. To improve interpretability, we employ Layer-CAM attribution to visualize spatial regions driving model decisions, highlighting hand-instrument interaction cues.
[31] Bootstrapping Sign Language Annotations with Sign Language Models cs.CVPDF
Colin Lea, Vasileios Baltatzis, Connor Gillis, Raja Kushalnagar, Lorna Quandt
TL;DR: 本文提出了一种用于手语视频的伪标注流水线,通过结合拼写识别器、孤立手语识别器(ISR)和K-Shot LLM方法,从手语视频和英文输入中生成包含时间间隔的词汇、手指拼写单词和手语分类器的可能标注排序集合,旨在解决高质量手语标注数据匮乏的问题。
Details
Motivation: 动机是AI驱动的手语翻译受限于高质量标注数据的缺乏,尽管存在ASL STEM Wiki和FLEURS-ASL等包含数百小时数据的新数据集,但由于大规模标注成本高昂,这些数据集仅部分标注且未充分利用。
Result: 在FSBoard数据集上实现了6.7%的字符错误率(CER),在ASL Citizen数据集上实现了74%的top-1准确率,达到了state-of-the-art水平;同时,专业标注人员为ASL STEM Wiki的近500个视频提供了序列级词汇标注作为黄金标准基准。
Insight: 创新点在于开发了一个结合稀疏预测和LLM的伪标注流水线,以低成本生成大规模手语标注;客观分析认为,该方法通过建立有效基线模型和释放人工与伪标注数据,为手语研究提供了可扩展的解决方案和基准资源。
Abstract: AI-driven sign language interpretation is limited by a lack of high-quality annotated data. New datasets including ASL STEM Wiki and FLEURS-ASL contain professional interpreters and 100s of hours of data but remain only partially annotated and thus underutilized, in part due to the prohibitive costs of annotating at this scale. In this work, we develop a pseudo-annotation pipeline that takes signed video and English as input and outputs a ranked set of likely annotations, including time intervals, for glosses, fingerspelled words, and sign classifiers. Our pipeline uses sparse predictions from our fingerspelling recognizer and isolated sign recognizer (ISR), along with a K-Shot LLM approach, to estimate these annotations. In service of this pipeline, we establish simple yet effective baseline fingerspelling and ISR models, achieving state-of-the-art on FSBoard (6.7% CER) and on ASL Citizen datasets (74% top-1 accuracy). To validate and provide a gold-standard benchmark, a professional interpreter annotated nearly 500 videos from ASL STEM Wiki with sequence-level gloss labels containing glosses, classifiers, and fingerspelling signs. These human annotations and over 300 hours of pseudo-annotations are being released in supplemental material.
[32] VSAS-BENCH: Real-Time Evaluation of Visual Streaming Assistant Models cs.CVPDF
Pavan Kumar Anasosalu Vasu, Cem Koc, Fartash Faghri, Chun-Liang Li, Bo Feng
TL;DR: 本文提出了VSAS-Bench,一个用于评估视觉流式助手模型的新框架和基准测试。该基准针对流式视觉语言模型(VLMs)在实时场景下的性能,引入了同步和异步评估协议以及衡量主动性、一致性等新指标,并通过对现有模型的大规模评估,分析了关键设计因素对性能的影响,发现传统VLM经过适配后可在流式设置中超越专用模型。
Details
Motivation: 现有VLM评估框架主要针对离线场景,而流式VLM的性能需要额外考虑响应及时性(主动性)和随时间变化的鲁棒性(一致性)等指标,目前缺乏相应的评估标准。
Result: 在提出的VSAS-Bench基准上进行了大规模评估,结果表明,经过适配的传统VLM(如Qwen3-VL-4B)在异步协议下比当前最佳流式VLM(Dispider)性能高出3%。
Insight: 创新点在于提出了首个专门针对流式视觉助手的评估框架,引入了主动性、一致性等新评估维度以及同步/异步协议;客观分析发现,无需额外训练,传统VLM通过适配即可在流式任务中取得优异表现,这挑战了开发专用流式模型的必要性,并为模型设计(如缓冲区长度、访问策略)提供了实用见解。
Abstract: Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess models in offline settings. In contrast, the performance of a streaming VLM depends on additional metrics beyond pure video understanding, including proactiveness, which reflects the timeliness of the model’s responses, and consistency, which captures the robustness of its responses over time. To address this limitation, we propose VSAS-Bench, a new framework and benchmark for Visual Streaming Assistants. In contrast to prior benchmarks that primarily employ single-turn question answering on video inputs, VSAS-Bench features temporally dense annotations with over 18,000 annotations across diverse input domains and task types. We introduce standardized synchronous and asynchronous evaluation protocols, along with metrics that isolate and measure distinct capabilities of streaming VLMs. Using this framework, we conduct large-scale evaluations of recent video and streaming VLMs, analyzing the accuracy-latency trade-off under key design factors such as memory buffer length, memory access policy, and input resolution, yielding several practical insights. Finally, we show empirically that conventional VLMs can be adapted to streaming settings without additional training, and demonstrate that these adapted models outperform recent streaming VLMs. For example, Qwen3-VL-4B surpasses Dispider, the best streaming VLM on our benchmark, by 3% under the asynchronous protocol. The benchmark and code will be available at https://github.com/apple/ml-vsas-bench.
[33] Monocular Depth Estimation From the Perspective of Feature Restoration: A Diffusion Enhanced Depth Restoration Approach cs.CV | eess.IVPDF
Huibin Bai, Shuai Li, Hanxiao Zhai, Yanbo Gao, Chong Lv
TL;DR: 本文提出了一种从特征恢复视角解决单目深度估计问题的新方法,通过将预训练编码器特征视为退化特征,并设计可逆变换增强的间接扩散模块进行特征恢复,同时引入基于辅助视角的低级特征增强模块来提升局部细节,在多个数据集上实现了优于现有方法的性能。
Details
Motivation: 当前主流的单目深度估计方法采用编码器-解码器架构进行多级特征处理,但未充分评估当前架构的局限性以及不同层级特征对预测精度的影响。本文研究发现,若能改进编码器特征,现有框架仍有巨大潜力,因此从特征恢复的视角重新构建深度估计问题。
Result: 在KITTI基准测试中,与基线方法相比,在不同训练设置下,RMSE指标分别提升了4.09%和37.77%,并在多个数据集上取得了优于现有最先进方法的性能。
Insight: 创新点在于将深度估计问题形式化为特征恢复任务,并设计了可逆变换增强的间接扩散模块以解决间接监督导致的特征偏差问题,同时提出了即插即用的基于辅助视角的低级特征增强模块来利用额外视角信息提升细节。从客观角度看,该方法为改进编码器-解码器架构提供了一种新颖的特征增强思路,其可逆变换设计有助于稳定训练过程。
Abstract: Monocular Depth Estimation (MDE) is a fundamental computer vision task with important applications in 3D vision. The current mainstream MDE methods employ an encoder-decoder architecture with multi-level/scale feature processing. However, the limitations of the current architecture and the effects of different-level features on the prediction accuracy are not evaluated. In this paper, we first investigate the above problem and show that there is still substantial potential in the current framework if encoder features can be improved. Therefore, we propose to formulate the depth estimation problem from the feature restoration perspective, by treating pretrained encoder features as degraded features of an assumed ground truth feature that yields the ground truth depth map. Then an Invertible Transform-enhanced Indirect Diffusion (InvT-IndDiffusion) module is developed for feature restoration. Due to the absence of direct supervision on feature, only indirect supervision from the final sparse depth map is used. During the iterative procedure of diffusion, this results in feature deviations among steps. The proposed InvT-IndDiffusion solves this problem by using an invertible transform-based decoder under the bi-Lipschitz condition. Finally, a plug-and-play Auxiliary Viewpoint-based Low-level Feature Enhancement module (AV-LFE) is developed to enhance local details with auxiliary viewpoint when available. Experiments demonstrate that the proposed method achieves better performance than the state-of-the-art methods on various datasets. Specifically on the KITTI benchmark, compared with the baseline, the performance is improved by 4.09% and 37.77% under different training settings in terms of RMSE. Code is available at https://github.com/whitehb1/IID-RDepth.
[34] Adaptive Depth-converted-Scale Convolution for Self-supervised Monocular Depth Estimation cs.CVPDF
Yanbo Gao, Huibin Bai, Huasong Zhou, Xingyu Gao, Shuai Li
TL;DR: 本文提出了一种自适应深度转换尺度卷积(DcSConv)增强的自监督单目深度估计框架,通过利用物体深度与尺度之间的先验关系,从卷积感受野的适当尺度提取特征,以解决单目视频中物体因深度变化导致的尺度与深度模糊问题。
Details
Motivation: 解决自监督单目深度估计中,由于物体深度变化导致其尺度连续变化,从而引起的尺度与深度模糊问题,现有方法缺乏对物体尺度变化的显式处理。
Result: 在KITTI基准测试上,该方法在不同基线模型上进行了广泛实验,取得了最佳结果,SqRel指标提升高达11.6%,消融研究验证了各模块的有效性。
Insight: 创新点在于提出深度转换尺度卷积(DcSConv),强调卷积滤波器尺度的重要性不亚于其局部形变,并开发了深度转换尺度感知融合(DcS-F)模块自适应融合特征;该框架可作为即插即用模块增强现有CNN方法,提升深度估计性能。
Abstract: Self-supervised monocular depth estimation (MDE) has received increasing interests in the last few years. The objects in the scene, including the object size and relationship among different objects, are the main clues to extract the scene structure. However, previous works lack the explicit handling of the changing sizes of the object due to the change of its depth. Especially in a monocular video, the size of the same object is continuously changed, resulting in size and depth ambiguity. To address this problem, we propose a Depth-converted-Scale Convolution (DcSConv) enhanced monocular depth estimation framework, by incorporating the prior relationship between the object depth and object scale to extract features from appropriate scales of the convolution receptive field. The proposed DcSConv focuses on the adaptive scale of the convolution filter instead of the local deformation of its shape. It establishes that the scale of the convolution filter matters no less (or even more in the evaluated task) than its local deformation. Moreover, a Depth-converted-Scale aware Fusion (DcS-F) is developed to adaptively fuse the DcSConv features and the conventional convolution features. Our DcSConv enhanced monocular depth estimation framework can be applied on top of existing CNN based methods as a plug-and-play module to enhance the conventional convolution block. Extensive experiments with different baselines have been conducted on the KITTI benchmark and our method achieves the best results with an improvement up to 11.6% in terms of SqRel reduction. Ablation study also validates the effectiveness of each proposed module.
[35] FireSenseNet: A Dual-Branch CNN with Cross-Attentive Feature Interaction for Next-Day Wildfire Spread Prediction cs.CVPDF
Jinzhen Han, JinByeong Lee, Hak Han, YeonJu Na, Jae-Joon Lee
TL;DR: 本文提出了一种名为FireSenseNet的双分支卷积神经网络,用于预测次日野火蔓延。该网络通过新颖的跨注意力特征交互模块(CAFIM),在多个编码器尺度上显式建模静态燃料/地形属性与动态气象条件之间的空间变化交互。在Google Next-Day Wildfire Spread基准测试中,该方法超越了包括参数量更大的SegFormer在内的多种架构,并进行了消融实验、特征重要性分析和不确定性量化。
Details
Motivation: 现有深度学习方法通常将异构地理空间输入简单拼接为单一张量,忽略了静态燃料/地形属性与动态气象条件之间的基本物理区别。本文旨在通过显式建模这两种模态间的交互,以更准确地预测次日野火蔓延,这对灾害响应和资源调配至关重要。
Result: 在Google Next-Day Wildfire Spread基准测试上,FireSenseNet取得了F1分数0.4176和AUC-PR 0.3435,优于包括参数量多3.8倍的SegFormer(F1=0.3502)在内的七种对比架构,达到了SOTA水平。消融实验表明CAFIM模块相比简单拼接带来了7.1%的相对F1提升。
Insight: 论文的创新点在于提出了一个双分支CNN架构,并引入了跨注意力特征交互模块(CAFIM),以学习方式显式捕捉不同地理空间模态(燃料与天气)之间的空间变化交互。此外,研究还揭示了前一日火场掩膜在预测中的主导作用,以及风速在数据集粗时间分辨率下可能作为噪声的发现,并指出了常见评估捷径会严重夸大F1分数的问题,这些分析为领域提供了重要见解。
Abstract: Accurate prediction of next-day wildfire spread is critical for disaster response and resource allocation. Existing deep learning approaches typically concatenate heterogeneous geospatial inputs into a single tensor, ignoring the fundamental physical distinction between static fuel/terrain properties and dynamic meteorological conditions. We propose FireSenseNet, a dual-branch convolutional neural network equipped with a novel Cross-Attentive Feature Interaction Module (CAFIM) that explicitly models the spatially varying interaction between fuel and weather modalities through learnable attention gates at multiple encoder scales. Through a systematic comparison of seven architectures – spanning pure CNNs, Vision Transformers, and hybrid designs – on the Google Next-Day Wildfire Spread benchmark, we demonstrate that FireSenseNet achieves an F1 of 0.4176 and AUC-PR of 0.3435, outperforming all alternatives including a SegFormer with 3.8* more parameters (F1 = 0.3502). Ablation studies confirm that CAFIM provides a 7.1% relative F1 gain over naive concatenation, and channel-wise feature importance analysis reveals that the previous-day fire mask dominates prediction while wind speed acts as noise at the dataset’s coarse temporal resolution. We further incorporate Monte Carlo Dropout for pixel-level uncertainty quantification and present a critical analysis showing that common evaluation shortcuts inflate reported F1 scores by over 44%.
[36] Needle in a Haystack – One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology cs.CV | cs.LGPDF
Swarnadip Chatterjee, Vladimir Basic, Arrigo Capitanio, Orcun Goksel, Joakim Lindblad
TL;DR: 该论文探索了在计算细胞学中,针对全玻片图像中罕见恶性细胞检测的挑战,提出使用单类表示学习方法(如DSVDD和DROC),这些方法仅使用阴性斑块进行训练,无需实例级标注,并在极低见证率(≤1%)下实现了最先进的性能。
Details
Motivation: 解决计算细胞学中恶性细胞检测的难题,即恶性细胞形态多样且极其罕见,导致传统弱监督方法(如多示例学习)在实例级泛化能力不足,尤其是在见证率极低的情况下。
Result: 在公开的骨髓细胞形态学数据集(TCIA)和内部口腔癌细胞学数据集上,DSVDD在实例级异常排名中达到SOTA水平,在极低见证率下甚至优于全监督学习;DROC在极端罕见情况下也表现竞争性。
Insight: 创新点在于将单类表示学习应用于罕见恶性细胞检测,通过仅学习正常样本的紧凑表示来检测偏差,避免了实例级标注的需求,在极低见证率下比MIL更鲁棒和可解释。
Abstract: In computational cytology, detecting malignancy on whole-slide images is difficult because malignant cells are morphologically diverse yet vanishingly rare amid a vast background of normal cells. Accurate detection of these extremely rare malignant cells remains challenging due to large class imbalance and limited annotations. Conventional weakly supervised approaches, such as multiple instance learning (MIL), often fail to generalize at the instance level, especially when the fraction of malignant cells (witness rate) is exceedingly low. In this study, we explore the use of one-class representation learning techniques for detecting malignant cells in low-witness-rate scenarios. These methods are trained exclusively on slide-negative patches, without requiring any instance-level supervision. Specifically, we evaluate two OCC approaches, DSVDD and DROC, and compare them with FS-SIL, WS-SIL, and the recent ItS2CLR method. The one-class methods learn compact representations of normality and detect deviations at test time. Experiments on a publicly available bone marrow cytomorphology dataset (TCIA) and an in-house oral cancer cytology dataset show that DSVDD achieves state-of-the-art performance in instance-level abnormality ranking, particularly in ultra-low witness-rate regimes ($\leq 1%$) and, in some cases, even outperforming fully supervised learning, which is typically not a practical option in whole-slide cytology due to the infeasibility of exhaustive instance-level annotations. DROC is also competitive under extreme rarity, benefiting from distribution-augmented contrastive learning. These findings highlight one-class representation learning as a robust and interpretable superior choice to MIL for malignant cell detection under extreme rarity.
[37] Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation cs.CVPDF
Jiahao Li, Yang Lu, Yachao Zhang, Fangyong Wang, Yuan Xie
TL;DR: 本文提出了一种无需训练的开集词汇语义分割方法,通过直接推导分割图的解析解来避免传统方法中耗时的对数优化过程,在八个基准数据集上实现了最先进的性能。
Details
Motivation: 解决开集词汇语义分割中现有方法依赖耗时迭代训练或模型特定注意力调制的问题,旨在更直接地生成分割图。
Result: 在八个基准数据集上达到了最先进的性能,无需迭代训练或模型特定注意力调制。
Insight: 创新点在于假设分布差异编码语义信息,并直接利用其解析解作为语义图,从而将优化问题转化为解析求解,避免了传统训练过程。
Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open-vocabulary prompts, necessitating that existing methods possess pixel-level vision-language alignment capability. Typically, this capability involves computing the cosine similarity, \ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the logits and the ground truth (GT) to generate optimal logits that are subsequently used to construct segmentation maps, yet it depends on time-consuming iterative training or model-specific attention modulation. In this work, we propose a more direct approach that eschews the logits-optimization process by directly deriving an analytic solution for the segmentation map. We posit a key hypothesis: the distribution discrepancy encodes semantic information; specifically, this discrepancy exhibits consistency across patches belonging to the same category but inconsistency across different categories. Based on this hypothesis, we directly utilize the analytic solution of this distribution discrepancy as the semantic maps. In other words, we reformulate the optimization of the distribution discrepancy as deriving its analytic solution, thereby eliminating time-consuming iterative training, freeing us from model-specific attention modulation, and achieving state-of-the-art performance on eight benchmark datasets.
[38] Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification cs.CV | cs.AIPDF
Shogo Hamano, Shunya Wakasugi, Tatsuhito Sato, Sayaka Nakamura
TL;DR: 本文提出了一种名为CG-CLIP的新颖框架,用于解决高难度视频行人重识别问题。该方法利用多模态大语言模型生成的文本描述和可学习令牌,通过标题引导的记忆精炼和基于令牌的特征提取两个核心组件,来捕获细粒度身份特征并高效聚合时空信息。
Details
Motivation: 当前视频行人重识别方法在体育、舞蹈等高难度场景下表现不佳,这些场景中多人穿着相似服装并进行动态运动,使得匹配变得困难。
Result: 该方法在标准数据集(MARS和iLIDS-VID)以及新构建的高难度数据集(SportsVReID和DanceVReID)上进行了评估,实验结果表明其性能超越了当前最先进的方法,在所有基准测试中都取得了显著提升。
Insight: 主要创新点在于将文本描述(标题)作为显式指导信息引入视觉特征学习过程,并设计了可学习令牌与交叉注意力机制来高效处理视频序列,这为利用多模态信息解决细粒度视觉识别问题提供了新思路。
Abstract: In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate spatiotemporal features, reducing computational overhead. We evaluate our approach on two standard datasets (MARS and iLIDS-VID) and two newly constructed high-difficulty datasets (SportsVReID and DanceVReID). Experimental results demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements across all benchmarks.
[39] MSCT: Differential Cross-Modal Attention for Deepfake Detection cs.CV | cs.MMPDF
Fangda Wei, Miao Liu, Yingxue Wang, Jing Wang, Shenghui Zhao
TL;DR: 该论文提出了一种用于音频-视觉深度伪造检测的多尺度跨模态Transformer编码器(MSCT),通过多尺度自注意力整合相邻嵌入特征,并利用差分跨模态注意力融合多模态特征,以解决传统多模态伪造检测方法中特征提取不足和模态对齐偏差的问题。
Details
Motivation: 传统音频-视觉深度伪造检测方法主要依赖音频-视觉对齐提取伪造痕迹,但存在特征提取不足和模态对齐偏差的局限性,因此需要改进多模态特征融合机制。
Result: 在FakeAVCeleb数据集上的实验表明,该方法取得了竞争性的性能,验证了所提结构的有效性。
Insight: 创新点在于引入多尺度自注意力和差分跨模态注意力,增强了特征整合与跨模态融合能力,为多模态伪造检测提供了更精细的模态交互机制。
Abstract: Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.
[40] Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding cs.CV | cs.CL | cs.LGPDF
Xiangyue Liu, Zijian Zhang, Miles Yang, Zhao Zhong, Liefeng Bo
TL;DR: 本文提出了Symbiotic-MoE,一个统一的预训练框架,旨在解决大型多模态模型中图像生成任务与理解任务之间的梯度冲突问题。该框架基于原生多模态混合专家架构,通过模态感知专家解耦和渐进式训练策略,在避免参数开销的同时,促进了生成与理解任务之间的协同效应。
Details
Motivation: 现有方法(如混合Transformer)通过结构隔离来缓解多模态模型中生成任务导致的理解任务灾难性遗忘问题,但这切断了跨模态协同并导致容量碎片化。本文旨在解决这一任务干扰问题,同时保持并增强跨模态协同。
Result: 大量实验表明,Symbiotic-MoE在实现快速生成收敛的同时,解锁了跨模态协同,显著提升了模型的内在理解能力,在MMLU和OCRBench基准上取得了显著增益。
Insight: 创新点在于:1)在原生MoE架构内通过模态感知专家解耦解决路由崩溃问题,将专家划分为任务特定组并利用共享专家作为多模态语义桥梁;2)提出的渐进式训练策略(差异化学习率和早期梯度屏蔽)将生成信号转化为对理解任务有益的反馈。这为统一多模态模型的生成与理解能力提供了新思路。
Abstract: Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.
[41] DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics cs.CV | cs.AIPDF
Hang Zhang, Qijian Tian, Jingyu Gong, Daoguo Dong, Xuhong Wang
TL;DR: DailyArt提出了一种从单张静态图像中推断铰接物体关节参数的新方法,通过先合成一个最大程度张开的物体状态来暴露关节线索,然后基于观察状态与合成状态之间的差异来估计关节参数。该方法无需多状态观测、特定物体模板或多视角输入,并支持以关节为条件的部件级新状态合成。
Details
Motivation: 解决从单张闭合状态的静态图像中推断铰接物体运动学的挑战,因为关键的运动线索通常被遮挡。现有方法需要多状态观测或依赖显式的部件先验、检索等辅助输入。
Result: 大量实验表明,DailyArt在铰接关节估计任务上取得了强劲的性能,并支持以关节为条件的部件级新状态合成。
Insight: 将关节估计问题重新定义为合成介导的推理问题,通过先合成张开状态来暴露关节线索,而非直接从遮挡严重的观测中回归关节。该方法采用集合预测公式,无需物体特定模板或显式部件标注即可同时恢复所有关节。
Abstract: Articulated objects are essential for embodied AI and world models, yet inferring their kinematics from a single closed-state image remains challenging because crucial motion cues are often occluded. Existing methods either require multi-state observations or rely on explicit part priors, retrieval, or other auxiliary inputs that partially expose the structure to be inferred. In this work, we present DailyArt, which formulates articulated joint estimation from a single static image as a synthesis-mediated reasoning problem. Instead of directly regressing joints from a heavily occluded observation, DailyArt first synthesizes a maximally articulated opened state under the same camera view to expose articulation cues, and then estimates the full set of joint parameters from the discrepancy between the observed and synthesized states. Using a set-prediction formulation, DailyArt recovers all joints simultaneously without requiring object-specific templates, multi-view inputs, or explicit part annotations at test time. Taking estimated joints as conditions, the framework further supports part-level novel state synthesis as a downstream capability. Extensive experiments show that DailyArt achieves strong performance in articulated joint estimation and supports part-level novel state synthesis conditioned on joints. Project page is available at https://rangooo123.github.io/DaliyArt.github.io/.
[42] Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities cs.CV | cs.AIPDF
Jingtong Dou, Chuancheng Shi, Jian Wang, Fei Shen, Zhiyong Wang
TL;DR: 本文提出了一种模态无关的伪造检测框架(MAF),旨在解决现有深度伪造检测方法在面对未知或未见过的模态时泛化能力不足的问题。通过解耦模态特定风格并提取跨模态共享的潜在伪造知识,该框架在弱MAF和强MAF两个维度上提升了模型的泛化性能,并在新构建的DeepModal-Bench基准测试中验证了其有效性。
Details
Motivation: 随着生成式人工智能的发展,深度伪造攻击已从单模态操纵演变为复杂的多模态威胁。现有取证技术过度依赖表面、模态特定的伪影,忽视了隐藏在可变物理外观下的共享潜在伪造知识,导致在面对未知的“暗模态”时性能急剧下降。
Result: 在DeepModal-Bench基准测试中,MAF框架通过提取跨模态共享的伪造知识,在未知模态上实现了显著的性能突破,证明了通用伪造痕迹的存在,并为通用多模态防御提供了技术路径。
Insight: 创新点在于将多模态取证从传统的“特征融合”范式转向“模态泛化”,首次提出模态无关的伪造检测框架,并通过解耦模态特定风格来提取本质的跨模态伪造知识。这为处理未知模态攻击提供了新的泛化评估维度和技术方案。
Abstract: As generative artificial intelligence evolves, deepfake attacks have escalated from single-modality manipulations to complex, multimodal threats. Existing forensic techniques face a severe generalization bottleneck: by relying excessively on superficial, modality-specific artifacts, they neglect the shared latent forgery knowledge hidden beneath variable physical appearances. Consequently, these models suffer catastrophic performance degradation when confronted with unseen “dark modalities.” To break this limitation, this paper introduces a paradigm shift that redefines multimodal forensics from conventional “feature fusion” to “modality generalization.” We propose the first modality-agnostic forgery (MAF) detection framework. By explicitly decoupling modality-specific styles, MAF precisely extracts the essential, cross-modal latent forgery knowledge. Furthermore, we define two progressive dimensions to quantify model generalization: transferability toward semantically correlated modalities (Weak MAF), and robustness against completely isolated signals of “dark modality” (Strong MAF). To rigorously assess these generalization limits, we introduce the DeepModal-Bench benchmark, which integrates diverse multimodal forgery detection algorithms and adapts state-of-the-art generalized learning methods. This study not only empirically proves the existence of universal forgery traces but also achieves significant performance breakthroughs on unknown modalities via the MAF framework, offering a pioneering technical pathway for universal multimodal defense.
[43] RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs cs.CVPDF
Liang Yao, Shengxiang Xu, Fan Liu, Chuanyi Zhang, Bishun Yao
TL;DR: 本文提出RemoteAgent,一个基于强化学习的智能多模态大语言模型(MLLM)代理框架,旨在弥合地球观测(EO)领域中用户模糊的自然语言意图与不同粒度视觉分析任务(从整体图像理解到细粒度像素级预测)之间的鸿沟。该框架通过构建VagueEO数据集进行强化微调,使MLLM成为能够直接处理图像级和稀疏区域级任务的认知核心,并智能地通过Model Context Protocol协调外部专用工具来处理密集预测任务,从而在保持高效计算的同时提升任务性能。
Details
Motivation: 解决地球观测系统中,领域专家通常使用模糊的自然语言而非精确的机器指令表达需求,而现有MLLM基于文本的输出格式不适合密集、精度关键的空间预测,且现有代理框架 indiscriminate tool invocation 计算效率低下、未充分利用MLLM原生能力的问题。
Result: 大量实验表明,RemoteAgent在保持强大意图识别能力的同时,在多样化的地球观测任务上实现了极具竞争力的性能。
Insight: 创新点在于提出了一个尊重MLLM内在能力边界的智能代理框架,通过构建人本位的模糊指令数据集VagueEO进行强化微调来对齐用户意图,并采用策略性的任务分配机制(内部处理适合任务,仅对密集预测协调外部工具),从而在计算效率和任务性能之间取得平衡。从客观角度看,其将强化学习与代理框架结合以专门处理模糊查询和不同粒度视觉任务的方法,对多模态AI在专业领域的应用具有借鉴意义。
Abstract: Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM’s native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.
[44] ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions cs.CVPDF
Zihao Liu, Xiaoyu Wu, Wenna Li, Jianqin Wu, Linlin Yang
TL;DR: 本文提出ESOM模型,一种高效的无训练流式开放世界视频异常检测方法,通过定义归一化、帧间匹配的帧内令牌合并、混合流式记忆和概率评分模块,实现实时检测与动态异常定义适应,并构建了OpenDef-Bench基准进行评估。
Details
Motivation: 解决现有基于MLLM的开放世界视频异常检测方法在部署效率低、缺乏流式处理适应能力以及对动态异常定义支持有限的问题。
Result: 在OpenDef-Bench基准上,ESOM在单GPU上实现实时效率,并在异常时序定位、分类和描述生成任务中达到最先进(SOTA)性能。
Insight: 创新点包括无训练流式架构设计、定义归一化模块减少幻觉、视觉令牌压缩提升效率,以及引入支持动态异常定义的评估基准,为实际部署提供了高效解决方案。
Abstract: Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.
[45] Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video cs.CV | cs.LGPDF
Chanhyuk Choi, Taesoo Kim, Donggyu Lee, Siyeol Jung, Taehwan Kim
TL;DR: 本文提出了一种名为跨模态情感迁移(C-MET)的新方法,用于在说话人脸视频中进行情感编辑。该方法通过建模语音和视觉特征空间之间的情感语义向量,基于语音生成面部表情,解决了现有方法在表达灵活性和扩展情感生成方面的局限性。
Details
Motivation: 现有说话人脸视频情感编辑方法存在局限:基于标签的方法使用离散类别,无法捕捉广泛情感;基于音频的方法因情感与语言内容纠缠而难以准确表达目标情感;基于图像的方法依赖高质量正面参考图像且难以获取扩展情感(如讽刺)的数据。
Result: 在MEAD和CREMA-D数据集上的大量实验表明,该方法在情感准确性上比最先进方法提升了14%,并能生成富有表现力的说话人脸视频,甚至对于未见过的扩展情感也有效。
Insight: 创新点在于提出跨模态情感语义向量,利用大规模预训练音频编码器和解耦的面部表情编码器来学习跨模态的情感差异表示,实现了从语音到面部表情的直接、灵活且高质量的情感迁移,无需依赖离散标签或特定参考图像。
Abstract: Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/
[46] Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models cs.CV | cs.AIPDF
Shaotian Li, Shangze Li, Chuancheng Shi, Wenhua Wu, Yanqiu Wu
TL;DR: 本文提出了一种名为LAKE的训练免费框架,用于挖掘视觉语言模型(VLMs)中固有的、潜在的异常检测知识。作者认为异常知识已内嵌于预训练模型中,但处于潜伏和未充分激活状态,并集中在稀疏的异常敏感神经元子集中。LAKE仅使用少量正常样本即可识别并激发这些关键神经元信号,构建一个高度紧凑的正态性表示,该表示整合了视觉结构偏差和跨模态语义激活。在工业异常检测基准测试中,LAKE实现了最先进的性能,并提供了内在的、神经元级别的可解释性。
Details
Motivation: 当前方法主要将视觉语言模型(VLMs)视为黑盒特征提取器,并假设异常特定知识必须通过外部适配器或记忆库获取。本文挑战了这一假设,认为异常知识本质上已嵌入预训练模型中,只是处于潜伏和未充分激活状态,因此旨在揭示并激活这些内在知识,以理解并提升VLM的异常检测性能。
Result: 在工业异常检测基准测试上进行的广泛实验表明,LAKE实现了最先进的(SOTA)性能。
Insight: 论文宣称的创新点在于挑战了现有将VLM视为黑盒并依赖外部知识获取的范式,提出了异常知识内嵌于预训练模型稀疏敏感神经元的假设,并开发了无需训练即可挖掘和激活这些知识的LAKE框架。从客观角度看,其核心创新在于将异常检测重新定义为对预训练模型中潜在知识的针对性激活,而非下游任务的获取,这为理解大模型内部机制和实现高效、可解释的异常检测提供了新视角。
Abstract: Large-scale vision-language models (VLMs) exhibit remarkable zero-shot capabilities, yet the internal mechanisms driving their anomaly detection (AD) performance remain poorly understood. Current methods predominantly treat VLMs as black-box feature extractors, assuming that anomaly-specific knowledge must be acquired through external adapters or memory banks. In this paper, we challenge this assumption by arguing that anomaly knowledge is intrinsically embedded within pre-trained models but remains latent and under-activated. We hypothesize that this knowledge is concentrated within a sparse subset of anomaly-sensitive neurons. To validate this, we propose latent anomaly knowledge excavation (LAKE), a training-free framework that identifies and elicits these critical neuronal signals using only a minimal set of normal samples. By isolating these sensitive neurons, LAKE constructs a highly compact normality representation that integrates visual structural deviations with cross-modal semantic activations. Extensive experiments on industrial AD benchmarks demonstrate that LAKE achieves state-of-the-art performance while providing intrinsic, neuron-level interpretability. Ultimately, our work advocates for a paradigm shift: redefining anomaly detection as the targeted activation of latent pre-trained knowledge rather than the acquisition of a downstream task.
[47] HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models cs.CVPDF
Qihui Zhu, Tao Zhang, Yuchen Wang, Zijian Wen, Mengjie Zhang
TL;DR: 本文提出了一种名为HAWK的无训练视觉令牌剪枝方法,用于多模态大语言模型(MLLMs)。该方法通过感知注意力头在视觉任务中的不同重要性,并结合文本引导的注意力来评估视觉令牌的重要性,从而在剪除冗余令牌的同时最大限度地保留关键信息。
Details
Motivation: MLLMs中视觉令牌数量的激增显著增加了推理时间和计算开销,使其难以应用于实时或资源受限的场景。现有视觉令牌剪枝方法通常假设所有注意力头对视觉解释的贡献相同,但作者发现不同注意力头可能捕获不同的视觉语义并发挥不同作用。
Result: 在多个主流视觉语言基准测试上的广泛实验表明,HAWK达到了最先进的准确率。当应用于Qwen2.5-VL模型时,在剪除80.2%的视觉令牌后,仍能保持原始准确率的96.0%,并将端到端延迟降低至原始的74.4%,同时减少了GPU内存使用。
Insight: 核心创新在于揭示了注意力头在视觉处理中的异质性,并据此提出了基于注意力头重要性感知的剪枝策略。该方法无需训练,可即插即用地应用于各种MLLMs,在保持高准确性的同时实现了显著的推理加速和内存节省。
Abstract: In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications. Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens. Existing research usually assumes that all attention heads contribute equally to the visual interpretation. However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing. In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens. By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones. The proposed HAWK is entirely training-free and can be seamlessly applied to various MLLMs. Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0% of the original accuracy after pruning 80.2% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4% of the original and further decreases GPU memory usage across the tested models. The code is available at https://github.com/peppery77/HAWK.git.
[48] AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision Language Models cs.CVPDF
Hazza Mahmood, Yongqiang Yu, Rao Anwer
TL;DR: 本文介绍了AgriChain数据集和基于其微调的AgriChain-VL3B模型,旨在解决农业中植物病害诊断的准确性和可解释性问题。该数据集包含约11,000张专家标注的叶片图像,每张图像配有病害标签、校准置信度和专家验证的思维链解释。微调后的模型在测试集上取得了优于多个强大基线模型的性能,并生成了与专家推理高度一致的解释。
Details
Motivation: 解决现实农业中视觉语言模型在植物病害诊断上准确性不足和缺乏可解释性的挑战,弥合通用多模态模型与人类专业知识之间的差距。
Result: 在包含1,000张图像的测试集上,CoT监督的AgriChain-VL3B模型取得了73.1%的top-1准确率(宏F1=0.466;加权F1=0.655),超越了包括Gemini 1.5 Flash、Gemini 2.5 Pro和GPT-4o Mini在内的多个强大基线模型,达到了SOTA水平。
Insight: 主要创新点在于构建了一个包含专家验证的思维链解释的农业视觉语言数据集,并利用该数据集进行监督微调,从而同时提升模型诊断的准确性和生成解释的可信度与可解释性。这为构建可信赖、可部署的农业AI系统提供了新的数据驱动范式。
Abstract: Accurate and interpretable plant disease diagnosis remains a major challenge for vision-language models (VLMs) in real-world agriculture. We introduce AgriChain, a dataset of approximately 11,000 expert-curated leaf images spanning diverse crops and pathologies, each paired with (i) a disease label, (ii) a calibrated confidence score (High/Medium/Low), and (iii) an expert-verified chain-of-thought (CoT) rationale. Draft explanations were first generated by GPT-4o and then verified by a professional agricultural engineer using standardized descriptors (e.g., lesion color, margin, and distribution). We fine-tune Qwen2.5-VL-3B on AgriChain, resulting in a specialized model termed AgriChain-VL3B, to jointly predict diseases and generate visually grounded reasoning. On a 1,000-image test set, our CoT-supervised model achieves 73.1% top-1 accuracy (macro F1 = 0.466; weighted F1 = 0.655), outperforming strong baselines including Gemini 1.5 Flash, Gemini 2.5 Pro, and GPT-4o Mini. The generated explanations align closely with expert reasoning, consistently referencing key visual cues. These findings demonstrate that expert-verified reasoning supervision significantly enhances both accuracy and interpretability, bridging the gap between generic multimodal models and human expertise, and advancing trustworthy, globally deployable AI for sustainable agriculture. The dataset and code are publicly available at: https://github.com/hazzanabeel12-netizen/agrichain
[49] LPM 1.0: Video-based Character Performance Model cs.CV | cs.AI | cs.MMPDF
Ailing Zeng, Casper Yang, Chauncey Ge, Eddie Zhang, Garvey Xu
TL;DR: 本文提出了LPM 1.0(大型表演模型),这是一个专注于单人全双工视听对话表演的模型。它通过构建高质量的多模态数据集、训练一个基于扩散Transformer的基础模型,并将其蒸馏为因果流式生成器,实现了高表现力、实时推理和长时身份稳定性的统一,旨在作为对话代理、直播角色和游戏NPC的视觉引擎。
Details
Motivation: 现有视频模型难以同时实现高表现力、实时推理和长时身份稳定性(即“表演三难困境”),而对话是最全面的表演场景,需要角色同时进行说话、倾听、反应和情感表达。
Result: 在提出的首个交互式角色表演基准LPM-Bench上,LPM 1.0在所有评估维度上都达到了最先进的(SOTA)结果,同时保持了实时推理速度。
Insight: 创新点在于通过严格的数据构建(包括说话-倾听音视频配对、表演理解和身份感知的多参考提取)和模型设计(基础模型用于高可控性,在线蒸馏模型用于低延迟无限长交互),系统性地解决了表演生成中的三难问题,并为此设立了专门的评估基准。
Abstract: Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.
[50] ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video cs.CVPDF
Boyuan Wang, Xiaofeng Wang, Yongkang Li, Zheng Zhu, Yifan Chang
TL;DR: 本文提出了ReconPhys,首个从前馈式框架,能够从单目视频中联合学习物理属性估计和基于3D高斯泼溅的重建。该方法采用双分支架构和自监督训练策略,无需真实物理标签,即可同时推断物体的几何、外观和物理属性。
Details
Motivation: 解决现有方法在重建非刚性物体时,虽然能恢复几何和动态,但严重依赖昂贵的逐场景优化、手动调参或标注,导致实用性差、泛化能力有限的问题。
Result: 在大规模合成数据集上的实验表明,该方法在预测未来帧时达到21.64 PSNR,显著优于SOTA优化基线的13.27;同时将Chamfer Distance从0.349降至0.004。推理速度极快(<1秒),而现有方法需要数小时。
Insight: 核心创新在于将物理属性估计与3D高斯泼溅重建在前馈框架中联合学习,并通过自监督策略避免了真实物理标签的需求。这实现了从单目视频快速生成可用于机器人学和图形学的仿真就绪资产,在精度、速度和实用性上取得了显著突破。
Abstract: Reconstructing non-rigid objects with physical plausibility remains a significant challenge. Existing approaches leverage differentiable rendering for per-scene optimization, recovering geometry and dynamics but requiring expensive tuning or manual annotation, which limits practicality and generalizability. To address this, we propose ReconPhys, the first feedforward framework that jointly learns physical attribute estimation and 3D Gaussian Splatting reconstruction from a single monocular video. Our method employs a dual-branch architecture trained via a self-supervised strategy, eliminating the need for ground-truth physics labels. Given a video sequence, ReconPhys simultaneously infers geometry, appearance, and physical attributes. Experiments on a large-scale synthetic dataset demonstrate superior performance: our method achieves 21.64 PSNR in future prediction compared to 13.27 by state-of-the-art optimization baselines, while reducing Chamfer Distance from 0.349 to 0.004. Crucially, ReconPhys enables fast inference (<1 second) versus hours required by existing methods, facilitating rapid generation of simulation-ready assets for robotics and graphics.
[51] Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition cs.CV | cs.AIPDF
Xuemei Jia, Jiawei Du, Hui Wei, Jun Chen, Joey Tianyi Zhou
TL;DR: 本文提出了一种强化学习引导的合成数据生成框架,用于解决隐私敏感身份识别任务中的数据稀缺问题。该方法通过冷启动适应将预训练生成器与目标领域对齐,并引入多目标奖励机制联合优化语义一致性、覆盖多样性和表情丰富性,同时采用动态样本选择机制优先选择高效用合成样本,从而提升生成保真度和分类精度。
Details
Motivation: 在隐私敏感场景中,由于法规和版权限制导致数据访问受限,数据稀缺阻碍了模型开发,而生成模型本应弥补数据不足却因数据有限而性能不佳,形成恶性循环。本文旨在打破这一循环,将通用领域生成先验适应于隐私敏感身份识别任务。
Result: 在基准数据集上的大量实验表明,该框架显著提高了生成保真度和分类准确率,并在小数据机制下对新类别表现出强大的泛化能力。
Insight: 创新点包括:冷启动适应建立语义相关性和初始保真度;多目标奖励联合优化语义一致性、覆盖多样性和表情丰富性;动态样本选择机制实现自适应数据缩放和改善领域对齐。从客观角度看,该方法通过强化学习引导生成过程,有效解决了隐私敏感场景中数据稀缺与生成模型性能相互制约的挑战。
Abstract: High-fidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development–ironically, in settings where generative models are most needed to compensate for the lack of data. This creates a self-reinforcing challenge: limited data leads to poor generative models, which in turn fail to mitigate data scarcity. To break this cycle, we propose a reinforcement-guided synthetic data generation framework that adapts general-domain generative priors to privacy-sensitive identity recognition tasks. We first perform a cold-start adaptation to align a pretrained generator with the target domain, establishing semantic relevance and initial fidelity. Building on this foundation, we introduce a multi-objective reward that jointly optimizes semantic consistency, coverage diversity, and expression richness, guiding the generator to produce both realistic and task-effective samples. During downstream training, a dynamic sample selection mechanism further prioritizes high-utility synthetic samples, enabling adaptive data scaling and improved domain alignment. Extensive experiments on benchmark datasets demonstrate that our framework significantly improves both generation fidelity and classification accuracy, while also exhibiting strong generalization to novel categories in small-data regimes.
[52] AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning cs.CV | cs.AIPDF
Jiaming Su, Tengchao Yang, Ruikang Zhang, Zhengan Yan, Haoyu Sun
TL;DR: 本文提出了AnomalyAgent,一种基于工具增强强化学习的智能体,用于工业异常合成。该智能体具备自我反思、知识检索和迭代优化能力,通过五个工具(提示生成、图像生成、质量评估、知识检索、掩码生成)实现闭环优化,旨在生成逼真且多样的异常样本以缓解异常检测中的数据稀缺问题。
Details
Motivation: 现有异常合成方法大多依赖单步生成机制,缺乏复杂推理和迭代优化能力,难以生成具有高语义真实感的异常样本。本文旨在解决这一问题。
Result: 在MVTec-AD数据集上,AnomalyAgent在异常生成任务上取得了IS/IC-L为2.10/0.33的指标;使用ResNet34的分类准确率达到57.0%;使用简单UNet在图像/像素级别分别达到了99.3%/74.2%的平均精度(AP),超越了所有零样本SOTA方法。
Insight: 核心创新点在于将异常合成构建为一个由工具增强的智能体任务,通过两阶段训练(监督微调+强化学习)和包含任务奖励、反思奖励、行为奖励的三部分奖励机制,驱动智能体进行决策与自我反思,实现了对异常生成的闭环迭代优化。
Abstract: Industrial anomaly generation is a crucial method for alleviating the data scarcity problem in anomaly detection tasks. Most existing anomaly synthesis methods rely on single-step generation mechanisms, lacking complex reasoning and iterative optimization capabilities, making it difficult to generate anomaly samples with high semantic realism. We propose AnomalyAgent, an anomaly synthesis agent with self-reflection, knowledge retrieval, and iterative refinement capabilities, aiming to generate realistic and diverse anomalies. Specifically, AnomalyAgent is equipped with five tools: Prompt Generation (PG), Image Generation (IG), Quality Evaluation (QE), Knowledge Retrieval (KR), and Mask Generation (MG), enabling closed-loop optimization. To improve decision-making and self-reflection, we construct structured trajectories from real anomaly images and design a two-stage training framework: supervised fine-tuning followed by reinforcement learning. This process is driven by a three-part reward mechanism: (1) task rewards to supervise the quality and location rationality of generated anomalies; (2) reflection rewards to train the model’s ability to improve anomaly synthesis prompt; (3) behavioral rewards to ensure adherence to the trajectory. On the MVTec-AD dataset, AnomalyAgent achieves IS/IC-L of 2.10/0.33 for anomaly generation, 57.0% classification accuracy using ResNet34, and 99.3%/74.2% AP at the image/pixel level using a simple UNet, surpassing all zero-shot SOTA methods. The code and data will be made publicly available.
[53] PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation cs.CVPDF
Dingwen Xiao, Weiming Zhang, Shiqi Wen, Lin Wang
TL;DR: 本文提出了PanoSAM2,一个用于360度视频对象分割(360VOS)的轻量级框架。它通过引入全景感知解码器、失真引导掩码损失和长短时记忆模块,解决了SAM2模型直接应用于360度视频时遇到的投影失真、语义不一致和内存中对象信息稀疏的问题,显著提升了分割性能。
Details
Motivation: 360度视频对象分割缺乏高质量标注数据集,而现有的SAM2模型虽然具有强大的提示式视频对象分割能力,但直接用于360度视频会因投影失真、左右侧语义不一致以及内存中对象掩码信息稀疏而产生不理想的结果。
Result: 在360VOTS和PanoVOS基准测试上,PanoSAM2相比SAM2分别取得了+5.6和+6.7的显著提升,证明了该方法的有效性。
Insight: 创新点在于针对360度视频特有的失真和边界问题,设计了具有接缝一致感受野和迭代失真细化的全景感知解码器,以及根据失真程度加权像素的损失函数;同时,通过长短时记忆模块维护紧凑的长期对象指针来增强时序一致性,这些轻量级适配策略有效保留了SAM2的用户友好提示设计。
Abstract: 360 video object segmentation (360VOS) aims to predict temporally-consistent masks in 360 videos, offering full-scene coverage, benefiting applications, such as VR/AR and embodied AI. Learning 360VOS model is nontrivial due to the lack of high-quality labeled dataset. Recently, Segment Anything Models (SAMs), especially SAM2 – with its design of memory module – shows strong, promptable VOS capability. However, directly using SAM2 for 360VOS yields implausible results as 360 videos suffer from the projection distortion, semantic inconsistency of left-right sides, and sparse object mask information in SAM2’s memory. To this end, we propose PanoSAM2, a novel 360VOS framework based on our lightweight distortion- and memory-aware adaptation strategies of SAM2 to achieve reliable 360VOS while retaining SAM2’s user-friendly prompting design. Concretely, to tackle the projection distortion and semantic inconsistency issues, we propose a Pano-Aware Decoder with seam-consistent receptive fields and iterative distortion refinement to maintain continuity across the 0/360 degree boundary. Meanwhile, a Distortion-Guided Mask Loss is introduced to weight pixels by distortion magnitude, stressing stretched regions and boundaries. To address the object sparsity issue, we propose a Long-Short Memory Module to maintain a compact long-term object pointer to re-instantiate and align short-term memories, thereby enhancing temporal coherence. Extensive experiments show that PanoSAM2 yields substantial gains over SAM2: +5.6 on 360VOTS and +6.7 on PanoVOS, showing the effectiveness of our method.
[54] ParkSense: Where Should a Delivery Driver Park? Leveraging Idle AV Compute and Vision-Language Models cs.CV | cs.ROPDF
Die Hu, Henan Li
TL;DR: ParkSense是一个利用自动驾驶车辆在低风险状态下的闲置算力,运行视觉语言模型分析卫星和街景图像,为外卖司机精确推荐靠近商户入口且合法的停车位的框架。
Details
Motivation: 解决外卖配送中寻找停车位耗时过长的问题,目前尚无系统能针对商户入口进行精确的停车位选择。
Result: 量化后的7B参数视觉语言模型在HW4级硬件上完成推理仅需4-8秒,估计在美国每年可为每位司机增加3000-8000美元的收入。
Insight: 创新性地将自动驾驶闲置算力、视觉语言模型与最后一公里物流结合,提出了交付感知精确停车问题,开辟了自动驾驶、计算机视觉和物流交叉领域的新研究方向。
Abstract: Finding parking consumes a disproportionate share of food delivery time, yet no system addresses precise parking-spot selection relative to merchant entrances. We propose ParkSense, a framework that repurposes idle compute during low-risk AV states – queuing at red lights, traffic congestion, parking-lot crawl – to run a Vision-Language Model (VLM) on pre-cached satellite and street view imagery, identifying entrances and legal parking zones. We formalize the Delivery-Aware Precision Parking (DAPP) problem, show that a quantized 7B VLM completes inference in 4-8 seconds on HW4-class hardware, and estimate annual per-driver income gains of 3,000-8,000 USD in the U.S. Five open research directions are identified at this unexplored intersection of autonomous driving, computer vision, and last-mile logistics.
[55] Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction cs.CV | cs.AIPDF
Yuanhong Zhang, Zhaoyang Wang, Xin Zhang, Weizhan Zhang, Joey Tianyi Zhou
TL;DR: 本文提出了一种名为MESA的即插即用框架,旨在缓解大型视觉语言模型(LVLM)中的幻觉问题。该方法通过进行受控和选择性的潜在空间干预,在减少幻觉的同时,更好地保持了模型的原始生成行为,避免了现有方法导致的输出缩短和token分布偏移等问题。
Details
Motivation: 现有方法缓解LVLM的幻觉时,往往会改变模型的生成行为,导致输出变短和token分布偏移,尤其是在潜在空间引导方法中。作者认为这源于纠缠的引导信号,即抑制幻觉时无意中破坏了模型的内在生成行为。
Result: 在多种生成式和判别式基准测试上的广泛实验表明,MESA能持续减少幻觉,同时更好地保持生成行为,在多个LVLM家族上都优于先前的方法。
Insight: 核心创新在于识别并解决了潜在空间引导中的信号纠缠问题,提出了一种选择性干预机制,实现了幻觉抑制与生成行为保持的解耦。这为在不损害模型核心能力的前提下进行针对性优化提供了新思路。
Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success across cross-modal tasks but remain hindered by hallucinations, producing textual outputs inconsistent with visual content. Existing methods mitigate hallucinations but often alter generation behavior, resulting in shorter outputs and shifted token distributions, especially in latent space steering approaches. We identify that this issue stems from entangled steering signals, where suppressing hallucinations inadvertently disrupts the model’s intrinsic generation behavior. To address this, we propose MESA, an effective plug-and-play framework that performs controlled and selective latent intervention for hallucination mitigation. Specifically, MESA targets hallucination-relevant responses while preserving the model’s original token distribution, enabling effective hallucination reduction without compromising generation behavior. Extensive experiments across diverse generative and discriminative benchmarks demonstrate that MESA consistently reduces hallucinations while better preserving generation behavior, outperforming prior methods across multiple LVLM families.
[56] Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation cs.CVPDF
Weiming Zhang, Dingwen Xiao, Songyue Guo, Guangyu Xiang, Shiqi Wen
TL;DR: 本文提出了Tarot-SAM3,一个无需训练即可处理任意指代表达式的分割框架。它通过一个表达推理解释器(ERI)将任意查询转化为鲁棒的异质提示,以驱动SAM3生成掩码,然后通过一个掩码自精炼(MSR)阶段,利用DINOv3的特征关系选择最佳掩码并进行自精炼,以纠正过分割和欠分割问题。
Details
Motivation: 现有指代表达式分割(RES)方法严重依赖大规模标注数据,且通常只能处理显式或隐式表达中的一种,泛化能力有限。同时,直接将SAM3与多模态大语言模型结合的方法,其效果过度依赖MLLM的推理能力,且无法精炼SAM3的分割输出。
Result: 大量实验表明,Tarot-SAM3在显式和隐式RES基准测试以及开放世界场景中都取得了强劲的性能。消融研究进一步验证了每个阶段的有效性。
Insight: 创新点在于提出了一个无需训练的两阶段框架:1)ERI阶段通过推理辅助的提示选项进行结构化表达解析和评估感知的重述,将任意查询转化为鲁棒的异质提示;2)MSR阶段通过利用DINOv3的丰富特征关系来比较ERI输出中的判别性区域,从而推断区域归属并进行自精炼,这减少了对MLLM推理能力的直接依赖,并实现了对SAM3分割输出的有效优化。
Abstract: Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expressions, serving as a bridge between vision and language understanding. Existing RES methods, however, rely heavily on large annotated datasets and are limited to either explicit or implicit expressions, hindering their ability to generalize to any referring expression. Recently, the Segment Anything Model 3 (SAM3) has shown impressive robustness in Promptable Concept Segmentation. Nonetheless, applying it to RES remains challenging: (1) SAM3 struggles with longer or implicit expressions; (2) naive coupling of SAM3 with a multimodal large language model (MLLM) makes the final results overly dependent on the MLLM’s reasoning capability, without enabling refinement of SAM3’s segmentation outputs. To this end, we present Tarot-SAM3, a novel training-free framework that can accurately segment from any referring expression. Specifically, Tarot-SAM3 consists of two key phases. First, the Expression Reasoning Interpreter (ERI) phase introduces reasoning-assisted prompt options to support structured expression parsing and evaluation-aware rephrasing. This transforms arbitrary queries into robust heterogeneous prompts for generating reliable masks with SAM3. Second, the Mask Self-Refining (MSR) phase selects the best mask across prompt types and performs self-refinement by leveraging rich feature relationships from DINOv3 to compare discriminative regions among ERI outputs. It then infers region affiliation to the target, thereby correcting over- and under-segmentation. Extensive experiments demonstrate that Tarot-SAM3 achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios. Ablation studies further validate the effectiveness of each phase.
[57] Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting cs.CV | cs.LGPDF
Tao Hana, Zhibin Wen, Zhenghao Chen, Fenghua Lin, Junyu Gao
TL;DR: 本文提出了一种基于3D高斯泼溅的尺度感知视觉变换器(GSSA-ViT)框架,用于高维大气场的任意分辨率预测和灵活降尺度。该方法将经纬度网格点视为3D高斯中心,引入生成式3D高斯预测方案来估计未见样本的关键参数,并设计了尺度感知注意力模块以捕获跨尺度依赖关系,从而实现连续分辨率适应。
Details
Motivation: 解决基于AI的数值天气预报(NWP)在生成高分辨率输出时计算成本高昂的问题,其根源在于现有方法的多尺度适应性有限和数据表示效率低下。
Result: 在ERA5数据集上的实验表明,该方法能够准确预测87个大气变量在任意分辨率下的变化;在ERA5和CMIP6数据集上的评估显示,其在降尺度任务中具有优越性能。
Insight: 创新点在于首次将生成式3D高斯建模与尺度感知注意力相结合,用于统一的多尺度预测,提供了一种高效且可扩展的高分辨率大气预测和降尺度解决方案。
Abstract: While AI-based numerical weather prediction (NWP) enables rapid forecasting, generating high-resolution outputs remains computationally demanding due to limited multi-scale adaptability and inefficient data representations. We propose the 3D Gaussian splatting-based scale-aware vision transformer (GSSA-ViT), a novel framework for arbitrary-resolution forecasting and flexible downscaling of high-dimensional atmospheric fields. Specifically, latitude-longitude grid points are treated as centers of 3D Gaussians. A generative 3D Gaussian prediction scheme is introduced to estimate key parameters, including covariance, attributes, and opacity, for unseen samples, improving generalization and mitigating overfitting. In addition, a scale-aware attention module is designed to capture cross-scale dependencies, enabling the model to effectively integrate information across varying downscaling ratios and support continuous resolution adaptation. To our knowledge, this is the first NWP approach that combines generative 3D Gaussian modeling with scale-aware attention for unified multi-scale prediction. Experiments on ERA5 show that the proposed method accurately forecasts 87 atmospheric variables at arbitrary resolutions, while evaluations on ERA5 and CMIP6 demonstrate its superior performance in downscaling tasks. The proposed framework provides an efficient and scalable solution for high-resolution, multi-scale atmospheric prediction and downscaling. Code is available at: https://github.com/binbin2xs/weather-GS.
[58] Shortcut Learning in Glomerular AI: Adversarial Penalties Hurt, Entropy Helps cs.CVPDF
Mohammad Daouk, Jan Ulrich Becker, Neeraja Kambham, Anthony Chang, Hien Nguyen
TL;DR: 该论文研究了肾小球病理AI中染色变异导致的捷径学习问题,通过构建多中心多染色数据集评估了贝叶斯CNN和ViT模型在三种设置下的表现,发现精心构建的多染色数据集本身对染色捷径具有鲁棒性,并提出了一种无需标签的熵正则化双头架构作为防范染色相关分布偏移的简单有效方案。
Details
Motivation: 解决肾病理AI中染色变异引起的分布偏移和潜在捷径学习问题,探究肾小球病变分类器是否利用染色作为捷径特征,并研究如何在不依赖染色或站点标签的情况下缓解此类偏差。
Result: 在包含9,674个肾小球图像块的多中心多染色数据集上,实验表明染色身份本身极易被学习(确认其为强候选捷径),但监督性染色损失对病变分类指标无显著影响,而基于熵最大化的无标签正则化能将染色预测保持在随机水平且不损害病变分类的准确性或校准性。
Insight: 论文的创新点在于提出了一种无需染色标签的熵正则化双头架构,作为防范染色相关分布偏移的部署友好型保障;客观分析认为,其核心洞察在于揭示了精心策划的多染色数据集可能天然具备对染色捷径的鲁棒性,而对抗性惩罚可能无效甚至增加预测不确定性。
Abstract: Stain variability is a pervasive source of distribution shift and potential shortcut learning in renal pathology AI. We ask whether lupus nephritis glomerular lesion classifiers exploit stain as a shortcut, and how to mitigate such bias without stain or site labels. We curate a multi-center, multi-stain dataset of 9{,}674 glomerular patches (224$\times$224) from 365 WSIs across three centers and four stains (PAS, H&E, Jones, Trichrome), labeled as proliferative vs.\ non-proliferative. We evaluate Bayesian CNN and ViT backbones with Monte Carlo dropout in three settings: (1) stain-only classification; (2) a dual-head model jointly predicting lesion and stain with supervised stain loss; and (3) a dual-head model with label-free stain regularization via entropy maximization on the stain head. In (1), stain identity is trivially learnable, confirming a strong candidate shortcut. In (2), varying the strength and sign of stain supervision strongly modulates stain performance but leaves lesion metrics essentially unchanged, indicating no measurable stain-driven shortcut learning on this multi-stain, multi-center dataset, while overly adversarial stain penalties inflate predictive uncertainty. In (3), entropy-based regularization holds stain predictions near chance without degrading lesion accuracy or calibration. Overall, a carefully curated multi-stain dataset can be inherently robust to stain shortcuts, and a Bayesian dual-head architecture with label-free entropy regularization offers a simple, deployment-friendly safeguard against potential stain-related drift in glomerular AI.
[59] ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks cs.CVPDF
Jiayang Xu, Fan Zhuo, Majun Zhang, Changhao Pan, Zehan Wang
TL;DR: ImVideoEdit是一个高效的视频编辑框架,它仅从图像对中学习视频编辑能力,通过冻结预训练的3D注意力模块并将图像视为单帧视频,解耦了2D空间学习过程以保持原始时间动态。
Details
Motivation: 当前视频编辑模型通常依赖昂贵的配对视频数据,限制了实际可扩展性;论文旨在通过解耦时空过程,在保留预训练模型时间动态的同时,选择性且精确地修改空间内容。
Result: 尽管仅使用13K图像对训练5个epoch且计算开销极低,ImVideoEdit在编辑保真度和时间一致性方面达到了与在大量视频数据集上训练的大型模型相当的水平。
Insight: 创新点包括Predict-Update Spatial Difference Attention模块逐步提取和注入空间差异,以及Text-Guided Dynamic Semantic Gating机制实现自适应隐式文本驱动修改,无需依赖刚性外部掩码。
Abstract: Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient framework that learns video editing capabilities entirely from image pairs. By freezing the pre-trained 3D attention modules and treating images as single-frame videos, we decouple the 2D spatial learning process to help preserve the original temporal dynamics. The core of our approach is a Predict-Update Spatial Difference Attention module that progressively extracts and injects spatial differences. Rather than relying on rigid external masks, we incorporate a Text-Guided Dynamic Semantic Gating mechanism for adaptive and implicit text-driven modifications. Despite training on only 13K image pairs for 5 epochs with exceptionally low computational overhead, ImVideoEdit achieves editing fidelity and temporal consistency comparable to larger models trained on extensive video datasets.
[60] TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning cs.CV | cs.AI | cs.CLPDF
Yifei Gong, Xing Wu, Wenda Liu, Kang Tu
TL;DR: 本文提出了ToolCAD,一种新颖的代理式CAD框架,利用大型语言模型作为工具使用代理,通过强化学习进行文本到CAD的生成。该框架包含一个交互式CAD建模环境,用于生成推理和工具增强的交互轨迹,并采用端到端的后训练策略,通过在线课程强化学习优化LLM代理的CAD建模思维链,使其成为熟练的CAD工具使用代理。
Details
Motivation: 计算机辅助设计(CAD)是一项依赖长程推理和连贯建模动作的专家级任务。目前,尚未有研究探讨工具使用型LLM如何与CAD引擎进行最优交互,这阻碍了基于LLM的代理式文本到CAD建模系统的出现。
Result: 研究结果表明,ToolCAD填补了在CAD工具使用代理中采用和训练开源LLM的空白,使其性能可与专有模型相媲美,为更易获取和鲁棒的自主文本到CAD建模系统铺平了道路。
Insight: 创新点在于提出了一个集成了交互式CAD建模环境、混合反馈、人类监督以及在线课程强化学习的端到端后训练策略的完整框架,专门用于训练LLM成为熟练的CAD工具使用代理,并引入了CAD建模思维链(CAD-CoT)的概念来优化推理过程。
Abstract: Computer-Aided Design (CAD) is an expert-level task that relies on long-horizon reasoning and coherent modeling actions. Large Language Models (LLMs) have shown remarkable advancements in enabling language agents to tackle real-world tasks. Notably, there has been no investigation into how tool-using LLMs optimally interact with CAD engines, hindering the emergence of LLM-based agentic text-to-CAD modeling systems. We propose ToolCAD, a novel agentic CAD framework deploying LLMs as tool-using agents for text-to-CAD generation. Furthermore, we introduce an interactive CAD modeling gym to rollout reasoning and tool-augmented interaction trajectories with the CAD engine, incorporating hybrid feedback and human supervision. Meanwhile, an end-to-end post-training strategy is presented to enable the LLM agent to elicit refined CAD Modeling Chain of Thought (CAD-CoT) and evolve into proficient CAD tool-using agents via online curriculum reinforcement learning. Our findings demonstrate ToolCAD fills the gap in adopting and training open-source LLMs for CAD tool-using agents, enabling them to perform comparably to proprietary models, paving the way for more accessible and robust autonomous text-to-CAD modeling systems.
[61] DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing cs.CV | cs.AI | cs.LGPDF
Gyanendra Das, Sai Satyam Jena
TL;DR: 本文提出了一种名为动态子空间概念对齐(DSCA)的新方法,用于视觉语言模型(VLM)的终身知识编辑。该方法通过将模型的表示空间分解为一组正交的语义子空间,并在这些子空间内进行编辑,从而在结构上隔离不同概念,以实现精确且互不干扰的知识更新。
Details
Motivation: 解决现有VLM知识编辑方法在共享表示空间中操作、导致概念纠缠和编辑干扰的问题,以及终身编辑任务中易破坏已学概念、导致推理能力下降和跨模态错位的挑战。
Result: 在基础模型冻结的情况下,该方法实现了98%的单次编辑成功率,在1000次连续编辑后仍保持超过95%的成功率,将幻觉率降低了3-5%,并在持续指令调优基准测试中获得了最佳的后向迁移(BWT)分数。实验表明DSCA在各种数据集和基准测试中具有最先进的稳定性和知识保留能力。
Insight: 核心创新在于将知识的结构性隔离从软训练目标转变为架构属性,通过增量聚类和PCA在联合视觉-语言表示上构建正交子空间,从而在算法层面实现概念的物理分离,并结合多目标损失函数来保证编辑的精确性和跨模态对齐。
Abstract: Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision language representations. This process structurally isolates concepts, enabling precise, non interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi term loss function for maintaining task fidelity, edit locality, and cross modal alignment. With the base model frozen, our method achieves 98 percent single edit success, remains over 95 percent after 1000 sequential edits, lowers hallucination by 3 to 5 percent, and achieves the best backward transfer (BWT) scores on continual instruction tuning benchmarks. Extensive experiments demonstrate DSCA state of the art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.
[62] Lighting-grounded Video Generation with Renderer-based Agent Reasoning cs.CVPDF
Ziqi Cai, Taoyu Yang, Zheng Chang, Si Li, Han Jiang
TL;DR: 本文提出了LiVER,一个基于扩散模型的场景可控视频生成框架,通过引入显式3D场景属性(如物体布局、光照和相机参数)作为条件,并结合渲染控制信号,实现了对关键场景因素的解耦控制。
Details
Motivation: 现有扩散模型在视频生成方面取得了显著进展,但其可控性仍有限制,关键场景因素(如布局、光照、相机轨迹)往往纠缠或建模较弱,限制了在电影制作和虚拟制作等需要显式场景控制领域的应用。
Result: 实验表明,LiVER在光真实感和时间一致性方面达到了最先进水平(SOTA),同时实现了对场景因素的精确解耦控制,为可控视频生成设定了新标准。
Insight: 创新点包括:使用统一3D表示渲染控制信号来解耦场景属性,提出轻量级条件模块和渐进式训练策略以稳定集成到基础视频扩散模型中,并开发场景代理将高级用户指令自动转换为3D控制信号,增强了实用性。
Abstract: Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.
[63] DP-DeGauss: Dynamic Probabilistic Gaussian Decomposition for Egocentric 4D Scene Reconstruction cs.CVPDF
Tingxi Chen, Zhengxue Cheng, Houqiang Zhong, Su Wang, Rong Xie
TL;DR: 本文提出DP-DeGauss,一种用于第一人称(egocentric)4D场景重建的动态概率高斯分解框架。该方法从COLMAP先验初始化统一的3D高斯集合,为每个高斯添加可学习的类别概率,并动态地将它们路由到用于背景、手部或物体建模的专用变形分支中。通过类别特定掩码、亮度与运动流控制来改善解耦和重建质量,在PSNR、SSIM和LPIPS指标上优于基线,并首次实现了背景、手部和物体组件的最优解耦。
Details
Motivation: 第一人称视频对下一代4D场景重建至关重要,但动态第一人称场景的重建因复杂的自身运动、遮挡和手-物交互而极具挑战。现有分解方法不适用,因为它们假设固定视点或将动态部分合并为单一前景。
Result: 大量实验表明,DP-DeGauss在PSNR上平均优于基线+1.70dB,并在SSIM和LPIPS上也有提升。更重要的是,该框架首次实现了背景、手部和物体组件的最先进(SOTA)解耦。
Insight: 创新点在于提出了一个动态概率高斯分解框架,通过可学习的类别概率和动态路由机制实现细粒度解耦,并引入亮度与运动流控制来优化静态渲染和动态重建。从客观角度看,该方法将3D高斯表示与概率分解、类别特定变形分支相结合,为复杂动态第一人称场景的理解和编辑提供了新思路。
Abstract: Egocentric video is crucial for next-generation 4D scene reconstruction, with applications in AR/VR and embodied AI. However, reconstructing dynamic first-person scenes is challenging due to complex ego-motion, occlusions, and hand-object interactions. Existing decomposition methods are ill-suited, assuming fixed viewpoints or merging dynamics into a single foreground. To address these limitations, we introduce DP-DeGauss, a dynamic probabilistic Gaussian decomposition framework for egocentric 4D reconstruction. Our method initializes a unified 3D Gaussian set from COLMAP priors, augments each with a learnable category probability, and dynamically routes them into specialized deformation branches for background, hands, or object modeling. We employ category-specific masks for better disentanglement and introduce brightness and motion-flow control to improve static rendering and dynamic reconstruction. Extensive experiments show that DP-DeGauss outperforms baselines by +1.70dB in PSNR on average with SSIM and LPIPS gains. More importantly, our framework achieves the first and state-of-the-art disentanglement of background, hand, and object components, enabling explicit, fine-grained separation, paving the way for more intuitive ego scene understanding and editing.
[64] SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations cs.CVPDF
Yunnan Wang, Kecheng Zheng, Jianyuan Wang, Minghao Chen, David Novotny
TL;DR: 本文介绍了SceneScribe-1M,这是一个大规模、多模态的视频数据集,包含100万个野外视频,每个视频都配有详细的文本描述、精确的相机参数、密集深度图和一致的3D点轨迹。该数据集旨在弥合3D几何感知与视频合成领域对统一大规模数据资源的需求缺口,并支持多种下游任务的基准测试。
Details
Motivation: 现有数据集要么专注于3D理解,要么专注于视频生成,缺乏一个能同时大规模支持这两个领域的统一资源。为了填补这一空白,作者构建了SceneScribe-1M数据集。
Result: 论文通过在一系列下游任务上建立基准来展示数据集的价值,这些任务包括单目深度估计、场景重建、动态点跟踪,以及文本到视频合成(带或不带相机控制)等生成任务。
Insight: 主要创新点是创建了一个集成了全面几何(相机参数、深度图、3D点轨迹)与语义(文本描述)标注的大规模视频数据集,为联合研究3D感知与可控视频生成提供了统一的基础设施和基准。数据集的开源有望成为相关研究的催化剂。
Abstract: The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.
[65] MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models cs.CV | cs.MMPDF
Zile Guo, Zhan Chen, Enze Zhu, Kan Wei, Yongkang Zou
TL;DR: 本文提出了MotionScape,一个大规模、真实世界、高动态的无人机视角视频数据集,用于世界模型训练。该数据集包含超过30小时的4K视频(约450万帧),并提供了精确的6自由度相机轨迹和细粒度自然语言描述。通过自动化处理流程构建,实验表明该数据集能有效提升现有世界模型模拟复杂3D动态和处理大视角变化的能力。
Details
Motivation: 现有世界模型在无人机典型的高动态相机轨迹下难以保持时空物理一致性,主要原因是当前训练数据存在分布偏差,缺乏真实的高动态6自由度无人机运动先验。
Result: 实验表明,结合该数据集提供的语义和几何对齐标注,能有效提升现有世界模型模拟复杂3D动态和处理大视角变化的能力,从而有益于无人机在复杂环境中的决策与规划。
Insight: 创新点在于构建了一个大规模、真实世界、高动态的无人机视角视频数据集,并提供了语义和几何对齐的标注(6-DoF轨迹和语言描述)。其自动化处理流程(结合CLIP过滤、时序分割、视觉SLAM和LLM标注)也具有借鉴意义。
Abstract: Recent advances in world models have demonstrated strong capabilities in simulating physical reality, making them an increasingly important foundation for embodied intelligence. For UAV agents in particular, accurate prediction of complex 3D dynamics is essential for autonomous navigation and robust decision-making in unconstrained environments. However, under the highly dynamic camera trajectories typical of UAV views, existing world models often struggle to maintain spatiotemporal physical consistency. A key reason lies in the distribution bias of current training data: most existing datasets exhibit restricted 2.5D motion patterns, such as ground-constrained autonomous driving scenes or relatively smooth human-centric egocentric videos, and therefore lack realistic high-dynamic 6-DoF UAV motion priors. To address this gap, we present MotionScape, a large-scale real-world UAV-view video dataset with highly dynamic motion for world modeling. MotionScape contains over 30 hours of 4K UAV-view videos, totaling more than 4.5M frames. This novel dataset features semantically and geometrically aligned training samples, where diverse real-world UAV videos are tightly coupled with accurate 6-DoF camera trajectories and fine-grained natural language descriptions. To build the dataset, we develop an automated multi-stage processing pipeline that integrates CLIP-based relevance filtering, temporal segmentation, robust visual SLAM for trajectory recovery, and large-language-model-driven semantic annotation. Extensive experiments show that incorporating such semantically and geometrically aligned annotations effectively improves the ability of existing world models to simulate complex 3D dynamics and handle large viewpoint shifts, thereby benefiting decision-making and planning for UAV agents in complex environments. The dataset is publicly available at https://github.com/Thelegendzz/MotionScape
[66] Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments cs.CVPDF
Yun Zhu, Jianjun Qian, Jian Yang, Jin Xie, Na Zhao
TL;DR: 本文提出了FI3Det,一个用于动态室内环境的少样本增量3D目标检测框架。该框架利用视觉语言模型(VLMs)从少量新类别样本中学习知识,以解决现有增量3D检测方法严重依赖大量标注的问题。FI3Det包含一个VLM引导的未知目标学习模块和一个门控多模态原型印记模块,在ScanNet V2和SUN RGB-D数据集上取得了优于基线方法的性能。
Details
Motivation: 现有增量3D目标检测方法需要大量新类别的标注才能达到满意性能,这限制了其在动态室内环境中的实际应用。本文旨在解决这一限制,实现仅用少量新样本就能高效进行3D感知。
Result: 在ScanNet V2和SUN RGB-D数据集上,针对批处理和顺序评估两种设置,FI3Det相比基线方法取得了显著且一致的性能提升。
Insight: 主要创新点在于:1)利用VLM挖掘未知目标并提取全面的2D语义特征和类别无关的3D边界框表示;2)设计加权机制,根据空间位置和特征一致性对点和框级特征的贡献进行重新加权以抑制噪声;3)提出门控多模态原型印记模块,从对齐的2D语义和3D几何特征构建类别原型,并通过多模态门控机制融合分类分数。这是首个针对少样本增量3D目标检测的框架。
Abstract: Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of point- and box-level features based on their spatial locations and feature consistency within each box. Moreover, FI3Det proposes a gated multimodal prototype imprinting module, where category prototypes are constructed from aligned 2D semantic and 3D geometric features to compute classification scores, which are then fused via a multimodal gating mechanism for novel object detection. As the first framework for few-shot incremental 3D object detection, we establish both batch and sequential evaluation settings on two datasets, ScanNet V2 and SUN RGB-D, where FI3Det achieves strong and consistent improvements over baseline methods. Code is available at https://github.com/zyrant/FI3Det.
[67] SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving cs.CV | cs.AI | cs.LGPDF
Felix Embacher, Jonas Uhrig, Marius Cordts, Markus Enzweiler
TL;DR: 该论文提出了SearchAD,一个用于自动驾驶的大规模稀有图像检索数据集,包含从11个现有数据集中提取的超过423k帧图像,并提供了超过513k个边界框的高质量人工标注,涵盖90个稀有类别。该数据集专门针对极端稀有类别的检索问题,支持文本到图像和图像到图像的检索、少样本学习以及多模态检索模型的微调。
Details
Motivation: 解决从大规模数据集中高效检索稀有且安全关键的驾驶场景的挑战,以构建更鲁棒的自动驾驶系统,并应对现有基准主要关注实例级检索而缺乏语义图像检索的问题。
Result: 综合评估表明,基于文本的方法由于更强的内在语义基础而优于基于图像的方法;直接对齐空间视觉特征与语言的模型实现了最佳零样本结果,且微调基线显著提升了性能,但绝对检索能力仍不令人满意。
Insight: 创新点在于建立了首个专注于自动驾驶中检索驱动数据管理和长尾感知研究的大规模数据集,并提供了明确的数据划分以支持多种检索任务,揭示了基于文本的语义检索在稀有场景中的优势以及当前模型的局限性。
Abstract: Retrieving rare and safety-critical driving scenarios from large-scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples. We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high-quality manual annotations of more than 513k bounding boxes covering 90 rare categories. It specifically targets the needle-in-a-haystack problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance-level retrieval, SearchAD emphasizes semantic image retrieval with a well-defined data split, enabling text-to-image and image-to-image retrieval, few-shot learning, and fine-tuning of multi-modal retrieval models. Comprehensive evaluations show that text-based methods outperform image-based ones due to stronger inherent semantic grounding. While models directly aligning spatial visual features with language achieve the best zero-shot results, and our fine-tuning baseline significantly improves performance, absolute retrieval capabilities remain unsatisfactory. With a held-out test set on a public benchmark server, SearchAD establishes the first large-scale dataset for retrieval-driven data curation and long-tail perception research in AD: https://iis-esslingen.github.io/searchad/
[68] Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding cs.CVPDF
Xuezhen Tu, Jingyu Wu, Fangyu Kang, Qingpeng Nong, Kaijin Zhang
TL;DR: 本文提出Bridge-STG,一个用于时空视频定位的端到端框架,旨在解决多模态大语言模型在该任务中面临的时空对齐纠缠和视觉令牌冗余两大核心挑战。通过解耦时空定位并引入时空语义桥接与查询引导的空间定位模块,该方法在多个基准测试中实现了基于MLLM方法的最先进性能。
Details
Motivation: 解决时空视频定位任务中,现有MLLMs面临的两个核心问题:1) 时空对齐在自回归输出空间中纠缠,导致异构子任务耦合;2) 目标对象在时空上同时稀疏,造成双域视觉令牌冗余,绝大多数视觉令牌与定位查询无关。
Result: 在VidSTG基准上,平均m_vIoU从26.4提升至34.3,达到了基于MLLM方法的最先进水平。在统一多任务训练机制下,还在多种细粒度视频理解任务上展示了强大的跨任务迁移能力。
Insight: 核心创新在于将时空定位解耦,并通过时空语义桥接机制(STSB)与显式时间对齐(ETA)来弥合解耦带来的语义鸿沟,同时利用查询引导的空间定位模块(QGSL)通过多层交互查询和正负帧采样来消除视觉令牌冗余。这种解耦与桥接的设计为处理复杂的多模态时空推理任务提供了新思路。
Abstract: Spatio-Temporal Video Grounding requires jointly localizing target objects across both temporal and spatial dimensions based on natural language queries, posing fundamental challenges for existing Multimodal Large Language Models (MLLMs). We identify two core challenges: \textit{entangled spatio-temporal alignment}, arising from coupling two heterogeneous sub-tasks within the same autoregressive output space, and \textit{dual-domain visual token redundancy}, where target objects exhibit simultaneous temporal and spatial sparsity, rendering the overwhelming majority of visual tokens irrelevant to the grounding query. To address these, we propose \textbf{Bridge-STG}, an end-to-end framework that decouples temporal and spatial localization while maintaining semantic coherence. While decoupling is the natural solution to this entanglement, it risks creating a semantic gap between the temporal MLLM and the spatial decoder. Bridge-STG resolves this through two pivotal designs: the \textbf{Spatio-Temporal Semantic Bridging (STSB)} mechanism with Explicit Temporal Alignment (ETA) distills the MLLM’s temporal reasoning context into enriched bridging queries as a robust semantic interface; and the \textbf{Query-Guided Spatial Localization (QGSL)} module leverages these queries to drive a purpose-built spatial decoder with multi-layer interactive queries and positive/negative frame sampling, jointly eliminating dual-domain visual token redundancy. Extensive experiments across multiple benchmarks demonstrate that Bridge-STG achieves state-of-the-art performance among MLLM-based methods. Bridge-STG improves average m_vIoU from $26.4$ to $34.3$ on VidSTG and demonstrates strong cross-task transfer across various fine-grained video understanding tasks under a unified multi-task training regime.
[69] Component-Adaptive and Lesion-Level Supervision for Improved Small Structure Segmentation in Brain MRI cs.CV | cs.LGPDF
Minh Sao Khue Luu, Evgeniy N. Pavlovskiy, Bair N. Tuchinov
TL;DR: 本文提出了一种名为CATMIL的统一目标函数,用于改进脑MRI中不平衡小病灶的分割。该函数在基础分割损失的基础上,引入了两个不同层级的辅助监督项:基于连通分量的Component-Adaptive Tversky损失,用于平衡不同大小病灶的影响;以及基于多示例学习的病灶级监督项,用于鼓励检测每个病灶实例。该方法在MSLesSeg数据集上使用nnU-Net框架进行评估,结果表明其在分割精度、病灶检测和错误控制方面取得了最均衡的性能。
Details
Motivation: 解决在高度不平衡的脑MRI数据中,小病灶分割困难、召回率低和假阴性高的问题。
Result: 在MSLesSeg数据集上使用5折交叉验证进行评估。CATMIL取得了最均衡的性能:提高了Dice分数(0.7834),减少了边界误差,显著提高了小病灶召回率并减少了假阴性,同时在对比方法中保持了最低的假阳性体积。
Insight: 主要创新点在于将组件级(基于连通分量的自适应重加权)和病灶级(多示例学习)的监督整合到一个统一的目标函数中,以联合优化体素级分割精度和病灶级检测,这为高度不平衡场景下的小结构分割提供了一种有效且实用的方法。
Abstract: We propose a unified objective function, termed CATMIL, that augments the base segmentation loss with two auxiliary supervision terms operating at different levels. The first term, Component-Adaptive Tversky, reweights voxel contributions based on connected components to balance the influence of lesions of different sizes. The second term, based on Multiple Instance Learning, introduces lesion-level supervision by encouraging the detection of each lesion instance. These terms are combined with the standard nnU-Net loss to jointly optimize voxel-level segmentation accuracy and lesion-level detection. We evaluate the proposed objective on the MSLesSeg dataset using a consistent nnU-Net framework and 5-fold cross-validation. The results show that CATMIL achieves the most balanced performance across segmentation accuracy, lesion detection, and error control. It improves Dice score (0.7834) and reduces boundary error compared to standard losses. More importantly, it substantially increases small lesion recall and reduces false negatives, while maintaining the lowest false positive volume among compared methods. These findings demonstrate that integrating component-level and lesion-level supervision within a unified objective provides an effective and practical approach for improving small lesion segmentation in highly imbalanced settings. All code and pretrained models are available at \href{https://github.com/luumsk/SmallLesionMRI}{this url}.
[70] LINE: LLM-based Iterative Neuron Explanations for Vision Models cs.CV | cs.AI | cs.LGPDF
Vladimir Zaigrajew, Michał Piechota, Gaspar Sekula, Przemysław Biecek
TL;DR: 本文提出了一种名为LINE的新型、无需训练、基于大语言模型的迭代方法,用于解释视觉模型中单个神经元编码的概念。该方法在严格的黑盒设置下,利用大语言模型和文生图模型,在激活历史的引导下,以闭环方式迭代地提出和精炼概念描述。实验表明,LINE在多个模型架构上取得了最先进的性能,并能够发现大量预定义词汇表遗漏的新概念,同时提供完整的生成历史和视觉解释。
Details
Motivation: 现有神经元解释方法通常局限于预定义的概念词汇表,或产生过于具体、无法捕捉高阶全局概念的描述。本文旨在解决这一限制,为视觉模型提供开放词汇的概念标注。
Result: LINE在ImageNet和Places365数据集上实现了最先进的性能,AUC分别提升了高达0.18和0.05。此外,该方法平均能发现29%被大规模预定义词汇表遗漏的新概念。
Insight: 创新点在于提出了一种无需训练、黑盒的迭代框架,将大语言模型和文生图模型结合,以闭环方式动态生成和验证概念描述,从而超越了静态词汇表的限制,并能评估神经元的‘多义性’和提供视觉解释。
Abstract: Interpreting the concepts encoded by individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. We demonstrate that LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.18 on ImageNet and 0.05 on Places365, while discovering, on average, 29% of new concepts missed by massive predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, which enables polysemanticity evaluation and produces supporting visual explanations that rival gradient-dependent activation maximization methods.
[71] 3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience cs.CV | cs.AIPDF
Hongcan Xiao, Xinyue Xiao, Yilin Wang, Yue Zhang, Yonggang Qi
TL;DR: 本文提出了3DrawAgent,一种无需训练、基于语言驱动的3D草图生成框架,利用大语言模型(LLMs)在几何反馈下顺序绘制3D贝塞尔曲线。该方法引入了相对经验优化策略,通过构建生成草图的成对比较(基于CLIP感知奖励和LLM细粒度定性评估区分优劣),迭代优化模型的3D绘图先验知识,从而实现无需参数更新的黑盒强化,提升其空间理解和绘图质量。
Details
Motivation: 解决通过自然语言生成3D草图这一主要挑战,以实现对形状、结构和空间关系的表达性推理。
Result: 实验表明,3DrawAgent能够从多样化的文本提示生成复杂且连贯的3D贝塞尔草图,展现出涌现的几何推理能力,并能泛化到新形状,为无需训练的3D草图智能领域建立了新范式。
Insight: 创新点在于将相对经验优化策略(基于成对比较和混合奖励)引入3D草图生成,实现了无需参数更新的模型自我改进;从客观角度看,该方法巧妙地结合了LLM的推理能力、CLIP的感知评估以及黑盒强化学习思想,为训练自由的3D内容生成提供了一种新思路。
Abstract: Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model’s 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.
[72] Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images cs.CVPDF
Francesca Fati, Alberto Rota, Adriana V. Gregory, Anna Catozzo, Maria C. Giuliano
TL;DR: 本文提出了一种基于预训练DINOv3基础视觉Transformer的标签高效分割框架,用于超声视频图像中的附件肿块分割。该方法通过结合DPT风格解码器,将全局语义表示与细粒度空间细节融合,在临床数据集上实现了最先进的性能,并在数据稀缺情况下表现出更强的鲁棒性。
Details
Motivation: 超声评估附件肿块存在主观性强和观察者间差异大的挑战,而传统全监督卷积模型需要大量像素级标注且难以应对医学影像中的域偏移问题。
Result: 在包含112名患者、7,777帧标注图像的临床数据集上,该方法取得了0.945的Dice分数,并将95%分位数Hausdorff距离相对于最强卷积基线降低了11.4%,性能优于U-Net、U-Net++、DeepLabV3和MAnet等全监督基线模型,达到SOTA水平。
Insight: 创新点在于利用大规模自监督预训练基础模型(DINOv3)的强语义先验,结合分层特征重组解码器,实现了在标注数据有限情况下的高效医学图像分割,为数据受限的临床环境提供了有前景的解决方案。
Abstract: Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: https://github.com/FrancescaFati/MESA
[73] ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning cs.CVPDF
Daichi Yashima, Shuhei Kurita, Yusuke Oda, Shuntaro Suzuki, Seitaro Otsuki
TL;DR: 本文提出了一种名为ABMamba的新型多模态大语言模型,用于高效视频描述。该模型采用线性计算复杂度的深度状态空间模型作为语言主干,并引入对齐分层双向扫描模块来处理视频的多时间分辨率,旨在解决现有基于Transformer的方法在处理长视频序列时因二次方计算复杂度带来的效率瓶颈。
Details
Motivation: 现有基于Transformer的多模态大语言模型在处理具有复杂时间依赖性和长序列的视频时,其核心注意力机制的计算复杂度随序列长度呈二次方增长,导致计算成本过高,难以高效处理视频序列。
Result: 在VATEX和MSR-VTT等标准视频描述基准测试中,ABMamba的性能与典型的MLLMs相当,同时实现了约三倍的吞吐量提升。
Insight: 主要创新点在于将线性复杂度的深度状态空间模型(如Mamba)作为语言主干引入多模态大语言模型,并设计了新颖的对齐分层双向扫描模块,以多分辨率方式高效建模视频的时空信息,在保持性能的同时显著提升了处理效率。
Abstract: In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.
[74] EEG2Vision: A Multimodal EEG-Based Framework for 2D Visual Reconstruction in Cognitive Neuroscience cs.CVPDF
Emanuele Balloni, Emanuele Frontoni, Chiara Matti, Marina Paolanti, Roberto Pierdicca
TL;DR: EEG2Vision是一个基于脑电图(EEG)的端到端框架,用于从非侵入性脑电信号重建二维视觉刺激。该框架通过EEG条件扩散模型进行初步重建,并引入一个由多模态大语言模型引导的后处理增强阶段,以提升视觉质量。研究系统评估了不同EEG通道数(128、64、32、24)下的性能,并证明增强机制能有效改善感知质量,特别是在低通道配置下,推动了使用低分辨率EEG设备进行实时脑-图像应用的可能性。
Details
Motivation: 解决从低空间分辨率、高噪声的非侵入性脑电图(EEG)中重建视觉刺激的挑战,尤其是在现实低密度电极配置下,以提升重建质量和应用可行性。
Result: 实验表明,语义解码准确率随通道减少显著下降(如50-way Top-1 Acc从89%降至38%),而重建质量略有下降(如FID从76.77升至80.51)。提出的增强机制在所有配置下一致改善感知指标,在低通道设置中实现高达9.71%的IS增益,用户研究也证实了对增强重建的明显感知偏好。
Insight: 创新点包括模块化端到端EEG-to-image框架、系统评估不同EEG分辨率的影响,以及引入基于多模态大语言模型的提示引导后重建增强机制,通过语义描述提取和图像到图像扩散来优化几何和感知一致性,同时保留EEG基础结构,这为低分辨率EEG设备的实时应用提供了新思路。
Abstract: Reconstructing visual stimuli from non-invasive electroencephalography (EEG) remains challenging due to its low spatial resolution and high noise, particularly under realistic low-density electrode configurations. To address this, we present EEG2Vision, a modular, end-to-end EEG-to-image framework that systematically evaluates reconstruction performance across different EEG resolutions (128, 64, 32, and 24 channels) and enhances visual quality through a prompt-guided post-reconstruction boosting mechanism. Starting from EEG-conditioned diffusion reconstruction, the boosting stage uses a multimodal large language model to extract semantic descriptions and leverages image-to-image diffusion to refine geometry and perceptual coherence while preserving EEG-grounded structure. Our experiments show that semantic decoding accuracy degrades significantly with channel reduction (e.g., 50-way Top-1 Acc from 89% to 38%), while reconstruction quality slight decreases (e.g., FID from 76.77 to 80.51). The proposed boosting consistently improves perceptual metrics across all configurations, achieving up to 9.71% IS gains in low-channel settings. A user study confirms the clear perceptual preference for boosted reconstructions. The proposed approach significantly boosts the feasibility of real-time brain-2-image applications using low-resolution EEG devices, potentially unlocking this type of applications outside laboratory settings.
[75] Brain3D: EEG-to-3D Decoding of Visual Representations via Multimodal Reasoning cs.CVPDF
Emanuele Balloni, Emanuele Frontoni, Chiara Matti, Marina Paolanti, Roberto Pierdicca
TL;DR: 该论文提出了Brain3D,一种用于从脑电图(EEG)信号解码并重建三维(3D)视觉表征的多模态架构。该方法通过分阶段处理,首先从EEG生成二维图像,然后利用多模态大语言模型提取结构化的3D感知描述,再通过扩散模型生成图像,最后转换为连贯的3D网格,从而实现了从脑活动到3D物体的生成。
Details
Motivation: 当前从EEG解码视觉信息的研究主要集中在重建二维图像,而三维表征的重建尚未充分探索,这限制了几何理解能力并降低了神经解码在不同场景下的适用性。本文旨在填补这一空白。
Result: 实验评估了重建3D输出与原始视觉刺激在语义对齐和几何保真度上的表现。所提架构取得了高达85.4%的10路Top-1 EEG解码准确率和0.648的CLIPScore,证明了其有效性。
Insight: 创新点在于将复杂的EEG到3D的映射问题分解为多个结构化的、可解释的阶段(EEG->图像->文本描述->扩散图像->3D网格),避免了直接映射的困难,并利用多模态大语言模型作为中间桥梁来提取3D感知信息,实现了可扩展的脑驱动3D生成。
Abstract: Decoding visual information from electroencephalography (EEG) has recently achieved promising results, primarily focusing on reconstructing two-dimensional (2D) images from brain activity. However, the reconstruction of three-dimensional (3D) representations remains largely unexplored. This limits the geometric understanding and reduces the applicability of neural decoding in different contexts. To address this gap, we propose Brain3D, a multimodal architecture for EEG-to-3D reconstruction based on EEG-to-image decoding. It progressively transforms neural representations into the 3D domain using geometry-aware generative reasoning. Our pipeline first produces visually grounded images from EEG signals, then employs a multimodal large language model to extract structured 3D-aware descriptions, which guide a diffusion-based generation stage whose outputs are finally converted into coherent 3D meshes via a single-image-to-3D model. By decomposing the problem into structured stages, the proposed approach avoids direct EEG-to-3D mappings and enables scalable brain-driven 3D generation. We conduct a comprehensive evaluation comparing the reconstructed 3D outputs against the original visual stimuli, assessing both semantic alignment and geometric fidelity. Experimental results demonstrate strong performance of the proposed architecture, achieving up to 85.4% 10-way Top-1 EEG decoding accuracy and 0.648 CLIPScore, supporting the feasibility of multimodal EEG-driven 3D reconstruction.
[76] AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models cs.CV | cs.AIPDF
Imane Momayiz, Soufiane Ait Elaouad, Abdeljalil Elmajjodi, Haitame Bouanane
TL;DR: 本文介绍了AtlasOCR,首个针对摩洛哥阿拉伯语方言Darija的开源光学字符识别模型,通过微调30亿参数的视觉语言模型Qwen2.5-VL 3B构建。研究详细阐述了从利用OCRSmith库合成生成和精心收集真实数据构建Darija专用数据集,到采用QLoRA和Unsloth进行参数高效微调的全过程,并在新构建的AtlasOCRBench和现有KITAB-Bench上实现了最先进的性能。
Details
Motivation: 摩洛哥阿拉伯语方言Darija视觉内容丰富,但缺乏专门的光学字符识别工具,因此需要开发首个开源Darija OCR模型以解决这一资源稀缺问题。
Result: 在新构建的AtlasOCRBench和现有KITAB-Bench上评估,AtlasOCR展示了最先进的性能,挑战了更大模型,并突显了其在Darija和标准阿拉伯语OCR任务中的鲁棒性和泛化能力。
Insight: 创新点包括构建首个Darija专用OCR数据集(结合合成与真实数据)、采用QLoRA和Unsloth对Qwen2.5-VL 3B进行参数高效微调,以及通过全面消融研究优化关键超参数,为低资源语言OCR提供了可借鉴的端到端解决方案。
Abstract: Darija, the Moroccan Arabic dialect, is rich in visual content yet lacks specialized Optical Character Recognition (OCR) tools. This paper introduces AtlasOCR, the first open-source Darija OCR model built by fine-tuning a 3B parameter Vision Language Model (VLM). We detail our comprehensive approach, from curating a unique Darija-specific dataset leveraging both synthetic generation with our OCRSmith library and carefully sourced real-world data, to implementing efficient fine-tuning strategies. We utilize QLoRA and Unsloth for parameter-efficient training of Qwen2.5-VL 3B and present comprehensive ablation studies optimizing key hyperparameters. Our evaluation on the newly curated AtlasOCRBench and the established KITAB-Bench demonstrates state-of-the-art performance, challenging larger models and highlighting AtlasOCR’s robustness and generalization capabilities for both Darija and standard Arabic OCR tasks.
[77] DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather cs.CVPDF
Christof Leitgeb, Thomas Puchleitner, Max Peter Ronecker, Daniel Watzenig
TL;DR: DinoRADE是一个以雷达为中心的检测框架,用于恶劣天气下的多类别目标检测。它通过处理密集的雷达张量,并利用DINOv3视觉基础模型提取的特征,通过可变形交叉注意力机制在相机视角下聚合视觉特征,以解决现有雷达方法在检测小目标和易受伤害道路使用者时空间细节不足的问题。
Details
Motivation: 现有基于FMCW雷达的自动驾驶感知系统在恶劣天气下表现良好,但在解析精细空间细节(尤其是检测小目标和易受伤害道路使用者)方面存在局限,且现有研究未充分解决在K-Radar等恶劣天气数据集上的多类别检测问题。
Result: 在K-Radar数据集的所有天气条件下进行了全面评估,是首批报告五个目标类别各自检测性能的方法之一。与现有雷达-相机方法相比,性能提升了12.1%。
Insight: 创新点在于将DINOv3视觉基础模型的强大特征与密集雷达张量通过可变形交叉注意力进行融合,专注于提升恶劣天气下对小目标和易受伤害道路使用者的检测能力,实现了雷达为中心的多模态融合新范式。
Abstract: Reliable and weather-robust perception systems are essential for safe autonomous driving and typically employ multi-modal sensor configurations to achieve comprehensive environmental awareness. While recent automotive FMCW Radar-based approaches achieved remarkable performance on detection tasks in adverse weather conditions, they exhibited limitations in resolving fine-grained spatial details particularly critical for detecting smaller and vulnerable road users (VRUs). Furthermore, existing research has not adequately addressed VRU detection in adverse weather datasets such as K-Radar. We present DinoRADE, a Radar-centered detection pipeline that processes dense Radar tensors and aggregates vision features around transformed reference points in the camera perspective via deformable cross-attention. Vision features are provided by a DINOv3 Vision Foundation Model. We present a comprehensive performance evaluation on the K-Radar dataset in all weather conditions and are among the first to report detection performance individually for five object classes. Additionally, we compare our method with existing single-class detection approaches and outperform recent Radar-camera approaches by 12.1%. The code is available under https://github.com/chr-is-tof/RADE-Net.
[78] AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding cs.CVPDF
Handong Li, Zikang Liu, Longteng Guo, Tongtian Yue, Yepeng Tang
TL;DR: 本文提出了AdaSpark,一种用于高效长视频理解的自适应稀疏化框架。该框架通过将视频划分为3D时空立方体,并采用自适应立方体选择性注意力和自适应令牌选择性前馈网络,根据输入复杂度动态分配计算资源,显著降低了计算开销。
Details
Motivation: 当前处理长视频的视频大语言模型计算成本高昂,现有高效方法往往通过不可逆的信息丢弃牺牲细粒度感知,或通过刚性预定义稀疏模式抑制长程时序建模。AdaSpark旨在解决这些限制。
Result: 实验表明,AdaSpark在具有挑战性的小时级视频基准测试上,将计算负载显著降低了高达57%的FLOPs,同时保持了与密集模型相当的性能,并保留了细粒度和长程依赖关系。
Insight: 主要创新点在于提出了一个上下文感知的自适应稀疏化框架,其核心是两个协同设计的组件:自适应立方体选择性注意力和自适应令牌选择性前馈网络,以及基于熵的Top-p选择机制,实现了根据输入内容复杂度动态分配计算资源。
Abstract: Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.
[79] DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning cs.CVPDF
Junbo Wang, Liangyu Fu, Yuke Li, Yining Zhu, Ya Jing
TL;DR: 本文提出了一种名为DiffVC的非自回归视频描述生成框架,该框架基于扩散模型,旨在解决传统自回归方法生成速度慢、累积误差大以及现有非自回归方法生成质量不足的问题。该方法通过并行解码和判别性条件扩散模型来生成更高质量的文本描述。
Details
Motivation: 当前自回归视频描述方法存在生成速度慢和累积误差大的固有缺陷,而现有的非自回归方法则因缺乏充分的多模态交互建模导致生成质量不足。本文旨在通过基于扩散模型的非自回归框架来同时解决生成速度、累积误差和生成质量的问题。
Result: 在MSVD、MSR-VTT和VATEX基准测试上的实验表明,该方法超越了之前的非自回归方法,并达到了与自回归方法相当的性能,例如在CIDEr指标上最大提升9.9分,在B@4指标上提升2.6分,同时具有更快的生成速度。
Insight: 主要创新点在于将扩散模型引入非自回归视频描述任务,提出了判别性条件扩散模型以增强多模态交互,从而在保持并行解码优势的同时提升生成质量。从客观角度看,该工作展示了扩散模型在序列生成任务中替代自回归机制的潜力,特别是在需要平衡速度与质量的场景下。
Abstract: Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore, the few non-autoregressive counterparts suffer from deficiencies in generation quality due to the lack of sufficient multimodal interaction modeling. Therefore, we propose a non-autoregressive framework based on Diffusion model for Video Captioning (DiffVC) to address these issues. Its parallel decoding can effectively solve the problems of generation speed and cumulative error. At the same time, our proposed discriminative conditional Diffusion Model can generate higher-quality textual descriptions. Specifically, we first encode the video into a visual representation. During training, Gaussian noise is added to the textual representation of the ground-truth caption. Then, a new textual representation is generated via the discriminative denoiser with the visual representation as a conditional constraint. Finally, we input the new textual representation into a non-autoregressive language model to generate captions. During inference, we directly sample noise from the Gaussian distribution for generation. Experiments on MSVD, MSR-VTT, and VATEX show that our method can outperform previous non-autoregressive methods and achieve comparable performance to autoregressive methods, e.g., it achieved a maximum improvement of 9.9 on the CIDEr and improvement of 2.6 on the B@4, while having faster generation speed. The source code will be available soon.
[80] OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation cs.CV | cs.AI | cs.LGPDF
Seungjae Moon, Seunghyun Oh, Youngmin Ro
TL;DR: 本文提出了OV-Stitcher,一个无需训练的开放词汇语义分割框架。它通过在全图最后一个编码器块中直接拼接来自滑动窗口子图的特征,重建全局注意力表示,从而解决了现有方法因处理子图独立而导致的特征碎片化和上下文推理受限的问题。
Details
Motivation: 现有的无需训练开放词汇语义分割方法通常采用滑动窗口策略独立处理裁剪的子图,这限制了模型对全图的全局注意力,导致特征表示碎片化和上下文推理能力不足。本文旨在解决这一局限性。
Result: 在八个基准测试上的广泛评估表明,OV-Stitcher相比之前的无需训练基线方法,将平均交并比(mIoU)从48.7显著提升至50.7,为开放词汇分割提供了一个可扩展且有效的解决方案。
Insight: 核心创新点在于提出了一种在最终编码器块内直接拼接碎片化子图特征以重建全局注意力的训练后处理机制。这从客观角度看,是一种巧妙利用预训练模型内部结构、在不增加训练成本的前提下提升全局上下文建模能力的方法,其“缝合”思想对处理高分辨率输入的视觉任务具有借鉴意义。
Abstract: Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.
[81] Small Vision-Language Models are Smart Compressors for Long Video Understanding cs.CV | cs.AI | cs.CL | cs.LGPDF
Junjie Fei, Jun Chen, Zechun Liu, Yunyang Xiong, Chong Zhou
TL;DR: 本文提出Tempo框架,利用小型视觉语言模型作为局部时序压缩器,通过自适应令牌分配机制,对长视频进行高效的查询感知压缩,以解决多模态大语言模型在处理长视频时面临的上下文限制和令牌预算饱和问题。
Details
Motivation: 现有方法在处理长视频时,因密集视觉流导致令牌预算饱和并加剧’迷失在中间’现象,而稀疏采样或均匀池化等启发式方法会盲目牺牲保真度,丢弃关键帧或浪费带宽在不相关背景上。
Result: 在LVBench(4101秒)上,Tempo在严格的8K视觉预算下得分52.3,优于GPT-4o和Gemini 1.5 Pro;扩展到2048帧时达到53.7,实现了SOTA性能。
Insight: 创新点包括:使用小型视觉语言模型进行早期跨模态蒸馏以生成紧凑、意图对齐的表示;引入无需训练的自适应令牌分配机制作为动态路由器,根据查询重要性分配带宽并压缩冗余。核心观点是,真正的长视频理解依赖于意图驱动的效率,而非贪婪地填充上下文窗口。
Abstract: Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM’s zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.
[82] Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator cs.CV | cs.AIPDF
Luozheng Qin, Jia Gong, Qian Qiao, Tianjiao Li, Li Xu
TL;DR: Uni-ViGU是一个通过扩展视频生成器作为基础来统一视频生成与理解的框架,它采用统一的流匹配方法处理视频和文本,并引入基于MoE的模态驱动架构,通过双向训练机制将生成知识用于理解任务。
Details
Motivation: 解决统一多模态模型中视觉生成计算成本远高于理解的不平衡问题,特别是视频领域,通过反转传统范式,以生成器为中心构建统一模型。
Result: 实验表明,Uni-ViGU在视频生成和理解任务上均取得了有竞争力的性能,验证了以生成为中心的架构作为统一多模态智能可扩展路径的有效性。
Insight: 创新点包括:提出以视频生成器为基础的统一框架;设计统一的连续/离散流匹配方法;采用轻量级MoE架构增强文本生成;以及通过知识回忆和能力细化的双向训练机制重用生成知识进行理解。
Abstract: Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.
[83] PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction cs.CVPDF
Zhi-Yi Lin, Thomas Markhorst, Jouh Yeong Chew, Xucong Zhang
TL;DR: PolySLGen是一个在线多模态说话-倾听反应生成框架,专为多人交互场景设计。它基于历史对话和所有参与者的动作,为目标参与者生成包含语音、身体动作和说话状态评分的未来说话或倾听反应。该框架通过姿态融合模块和社会线索编码器聚合群体动作和社会信号,以有效建模群体交互动态。
Details
Motivation: 现有方法局限于单模态或仅生成说话反应,且多针对双人交互,无法处理真实社交场景中多人交互的非语言线索和复杂动态。PolySLGen旨在解决多人交互中多模态反应生成的挑战,以促进更自然的人与具身AI群体互动。
Result: 在定量和定性评估中,PolySLGen在动作质量、动作-语音对齐、说话状态预测和人类感知的真实性方面优于多个适应性和最先进的基线模型,表现出上下文适当且时间连贯的多模态反应生成能力。
Insight: 创新点包括提出针对多人交互的在线多模态反应生成框架,以及引入姿态融合模块和社会线索编码器来联合聚合群体运动和社会信号,从而更好地捕捉群体交互动态。这为建模复杂社交互动中的多模态行为生成提供了新思路。
Abstract: Human-like multimodal reaction generation is essential for natural group interactions between humans and embodied AI. However, existing approaches are limited to single-modality or speaking-only responses in dyadic interactions, making them unsuitable for realistic social scenarios. Many also overlook nonverbal cues and complex dynamics of polyadic interactions, both critical for engagement and conversational coherence. In this work, we present PolySLGen, an online framework for Polyadic multimodal Speaking and Listening reaction Generation. Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant, including speech, body motion, and speaking state score. To model group interactions effectively, we propose a pose fusion module and a social cue encoder that jointly aggregate motion and social signals from the group. Extensive experiments, along with quantitative and qualitative evaluations, show that PolySLGen produces contextually appropriate and temporally coherent multi-modal reactions, outperforming several adapted and state-of-the-art baselines in motion quality, motion-speech alignment, speaking state prediction, and human-perceived realism.
[84] T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation cs.CVPDF
Pranjal Khadka
TL;DR: 本文提出了一种名为T-Gated Adapter的轻量级时序适配器,用于解决基于视觉语言模型(VLM)的医学图像分割中,因独立处理3D扫描的2D切片而产生的噪声和解剖结构不连续问题。该方法通过向视觉令牌表示中注入相邻切片上下文信息,在FLARE22数据集上训练后,显著提升了分割精度,并在跨数据集和跨模态评估中表现出优异的泛化能力。
Details
Motivation: 传统全监督3D医学图像分割需要大量昂贵的体素级标注。虽然视觉语言模型(VLMs)提供了强大的替代方案,但将其独立应用于3D扫描的2D切片时,会产生噪声和解剖结构不合理的分割结果,破坏了固有的结构连续性。
Result: 在FLARE22数据集上训练(30个标注体积),该方法在13个腹部器官上取得了0.704的平均Dice分数,比没有使用时序上下文的基线VLM提升了+0.206。在BTCV和AMOS22数据集上的零样本评估分别获得+0.210和+0.230的稳定提升,平均跨域性能下降从38.0%减少到24.9%。在AMOS22 MRI上的跨模态评估(模型未接受任何MRI监督)中,平均Dice为0.366,优于仅在CT上训练的全监督3D基线(DynUNet,0.224)。
Insight: 创新点在于设计了一个轻量级时序适配器,通过时序Transformer(处理令牌级跨切片上下文)、空间上下文块(精炼切片内表示)和自适应门(平衡时序与单切片特征)三个组件,有效地将3D空间连续性信息注入到2D VLM中。客观分析表明,该方法不仅显著提升了分割精度和跨域泛化能力,还揭示了CLIP的视觉语义表示比卷积特征在跨成像模态上具有更优越的泛化特性。
Abstract: Medical image segmentation traditionally relies on fully supervised 3D architectures that demand a large amount of dense, voxel-level annotations from clinical experts which is a prohibitively expensive process. Vision Language Models (VLMs) offer a powerful alternative by leveraging broad visual semantic representations learned from billions of images. However, when applied independently to 2D slices of a 3D scan, these models often produce noisy and anatomically implausible segmentations that violate the inherent continuity of anatomical structures. We propose a temporal adapter that addresses this by injecting adjacent-slice context directly into the model’s visual token representations. The adapter comprises a temporal transformer attending across a fixed context window at the token level, a spatial context block refining within-slice representations, and an adaptive gate balancing temporal and single-slice features. Training on 30 labeled volumes from the FLARE22 dataset, our method achieves a mean Dice of 0.704 across 13 abdominal organs with a gain of +0.206 over the baseline VLM trained with no temporal context. Zero-shot evaluation on BTCV and AMOS22 datasets yields consistent improvements of +0.210 and +0.230, with the average cross-domain performance drop reducing from 38.0% to 24.9%. Furthermore, in a cross-modality evaluation on AMOS22 MRI with neither model receiving any MRI supervision, our method achieves a mean Dice of 0.366, outperforming a fully supervised 3D baseline (DynUNet, 0.224) trained exclusively on CT, suggesting that CLIP’s visual semantic representations generalize more gracefully across imaging modalities than convolutional features.
[85] On the Global Photometric Alignment for Low-Level Vision cs.CVPDF
Mingjia Li, Tianle Du, Hainuo Wang, Qiming Hu, Xiaojie Guo
TL;DR: 本文提出了一种名为Photometric Alignment Loss(PAL)的损失函数,用于解决低层视觉任务中成对训练数据存在的全局光度不一致性问题。该问题导致标准重建损失在优化时过度关注冲突的光度目标,从而挤占了内容恢复的梯度。PAL通过闭式仿射颜色对齐来消除无关的光度差异,同时保留与恢复相关的监督信号。
Details
Motivation: 动机是解决低层视觉监督模型中,由于成对训练数据存在每对图像间的全局亮度、颜色或白平衡映射不一致(源于任务固有的光度传递或非预期的采集偏移)所导致的优化病理问题,这种不一致性使得标准重建损失将不成比例的梯度预算分配给冲突的光度目标,从而损害了内容恢复。
Result: 在6个任务、16个数据集和16种架构上,PAL一致地改善了评估指标和泛化性能。
Insight: 论文的创新点在于通过理论分析证明了在最小二乘分解下,预测与目标残差的光度分量和结构分量是正交的,且密集的光度分量主导了梯度能量,从而提出了PAL这一灵活且开销可忽略的监督目标,通过闭式仿射对齐来消除光度差异,同时保留有效的监督信号。
Abstract: Supervised low-level vision models rely on pixel-wise losses against paired references, yet paired training sets exhibit per-pair photometric inconsistency, say, different image pairs demand different global brightness, color, or white-balance mappings. This inconsistency enters through task-intrinsic photometric transfer (e.g., low-light enhancement) or unintended acquisition shifts (e.g., de-raining), and in either case causes an optimization pathology. Standard reconstruction losses allocate disproportionate gradient budget to conflicting per-pair photometric targets, crowding out content restoration. In this paper, we investigate this issue and prove that, under least-squares decomposition, the photometric and structural components of the prediction-target residual are orthogonal, and that the spatially dense photometric component dominates the gradient energy. Motivated by this analysis, we propose Photometric Alignment Loss (PAL). This flexible supervision objective discounts nuisance photometric discrepancy via closed-form affine color alignment while preserving restoration-relevant supervision, requiring only covariance statistics and tiny matrix inversion with negligible overhead. Across 6 tasks, 16 datasets, and 16 architectures, PAL consistently improves metrics and generalization. The implementation is in the appendix.
[86] MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning cs.CV | cs.AIPDF
Zheng Jiang, Heng Guo, Chengyu Fang, Changchen Xiao, Xinyang Hu
TL;DR: MedVR是一个基于强化学习的无标注医学视觉推理框架,通过熵引导的视觉重定位和基于共识的信用分配机制,使医学视觉语言模型能够直接基于视觉证据进行推理,无需中间步骤的人工标注,在多个公开医学VQA基准上实现了最先进的性能。
Details
Motivation: 解决医学视觉语言模型因仅依赖文本范式而导致的推理能力受限、细粒度视觉分析能力不足以及在安全关键应用中存在视觉幻觉风险的问题。
Result: 在多个公开的医学VQA基准测试上取得了最先进的性能,显著优于现有模型。
Insight: 核心创新在于通过强化学习框架实现无标注的视觉推理,具体包括利用模型不确定性引导探索的熵引导视觉重定位机制,以及从推演一致性中提炼伪监督的基于共识的信用分配机制,从而提升了模型的鲁棒性和可解释性,有助于加速医学AI的临床部署。
Abstract: Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.
[87] OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering cs.CVPDF
Yiduo Jia, Muzhi Zhu, Hao Zhong, Mingyu Liu, Yuling Xi
TL;DR: 本文提出OmniJigsaw,一种基于时序重排序代理任务的通用自监督框架,旨在增强全模态模型的视频-音频理解和协作推理能力。该框架通过三种策略(联合模态集成、样本级模态选择和片段级模态掩码)协调视觉和听觉信号,并设计了两阶段粗到细的数据过滤流程以适配海量无标注数据。
Details
Motivation: 将强化学习后训练范式扩展到全模态模型,以同时增强视频-音频理解和协作推理能力。
Result: 在15个基准测试上的广泛评估显示,在视频、音频和协作推理任务中均取得显著提升,验证了OmniJigsaw作为可扩展自监督全模态学习范式的有效性。
Insight: 创新点包括通过时序重排序代理任务强制跨模态整合,以及识别并缓解了联合模态集成中的“双模态捷径现象”,其中细粒度的片段级模态掩码策略优于样本级选择。
Abstract: To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon’’ in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.
[88] SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection cs.CVPDF
You Hu, Chenzhuo Zhao, Changfa Mo, Haotian Liu, Xiaobai Li
TL;DR: 该论文提出了首个用于检测AI生成科学图像的基准数据集SciFigDetect。针对现有检测方法主要针对开放域自然图像,而科学图像具有结构化、文本密集且与学术语义紧密对齐的特点,该研究构建了一个包含多类别、多生成源、真实-合成配对的数据集,并评估了现有检测器在零样本、跨生成器和图像退化等场景下的性能。
Details
Motivation: 解决现有AI生成图像检测方法在科学图像这一独特且困难的检测目标上能力不足的问题,因为科学图像具有结构化、文本密集和学术语义对齐的特点,而现有基准和方法几乎完全针对开放域图像。
Result: 在构建的基准上评估代表性检测器,结果显示当前方法在零样本迁移上表现极差,存在严重的生成器特定过拟合,并且在常见的后处理损坏下依然脆弱,揭示了现有AIGI检测能力与高质量科学图像新兴分布之间存在巨大差距。
Insight: 创新点在于首次构建了专门针对AI生成科学图像的检测基准,并通过基于智能体的数据流水线(包括检索、多模态理解、结构化提示构建、合成与过滤)来创建高质量、对齐的数据对;客观分析认为,该研究强调了领域特定检测任务的重要性,并为开发更鲁棒和可泛化的科学图像取证方法奠定了基础。
Abstract: Modern multimodal generators can now produce scientific figures at near-publishable quality, creating a new challenge for visual forensics and research integrity. Unlike conventional AI-generated natural images, scientific figures are structured, text-dense, and tightly aligned with scholarly semantics, making them a distinct and difficult detection target. However, existing AI-generated image detection benchmarks and methods are almost entirely developed for open-domain imagery, leaving this setting largely unexplored. We present the first benchmark for AI-generated scientific figure detection. To construct it, we develop an agent-based data pipeline that retrieves licensed source papers, performs multimodal understanding of paper text and figures, builds structured prompts, synthesizes candidate figures, and filters them through a review-driven refinement loop. The resulting benchmark covers multiple figure categories, multiple generation sources and aligned real–synthetic pairs. We benchmark representative detectors under zero-shot, cross-generator, and degraded-image settings. Results show that current methods fail dramatically in zero-shot transfer, exhibit strong generator-specific overfitting, and remain fragile under common post-processing corruptions. These findings reveal a substantial gap between existing AIGI detection capabilities and the emerging distribution of high-quality scientific figures. We hope this benchmark can serve as a foundation for future research on robust and generalizable scientific-figure forensics. The dataset is available at https://github.com/Joyce-yoyo/SciFigDetect.
[89] Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment cs.CVPDF
Blessing Agyei Kyem, Joshua Kofi Asamoah, Anthony Dontoh, Armstrong Aboah
TL;DR: 该论文提出了PaveGPT,一个通过领域特定指令微调构建的视觉语言基础模型,用于全面的自动化路面状况评估。研究创建了包含27.8万个图像-指令-响应对的PaveInstruct数据集,涵盖32种任务类型,并证明指令微调显著提升了模型在感知、理解和推理任务上的性能,使其能够生成符合工程标准(ASTM D6433)的输出,从而替代多个专用系统。
Details
Motivation: 通用视觉语言模型在日常领域表现良好,但在需要精确术语、结构化推理和遵循工程标准的专业技术领域(如路面评估)中表现不佳。本研究旨在探索通过领域特定的指令微调,使视觉语言模型能够胜任全面的路面状况评估任务。
Result: 在感知、理解和推理任务上,PaveGPT模型相比最先进的视觉语言模型取得了显著提升,在空间定位、推理和生成任务上的改进超过20%,并能够生成符合ASTM D6433标准的输出。
Insight: 论文的创新点在于通过整合九个异构路面数据集构建大规模领域特定指令数据集(PaveInstruct),并证明指令微调能有效将通用视觉语言模型转化为领域专家。这为开发面向基础设施领域(如桥梁检测、铁路维护)的指令驱动AI系统提供了可行路径,其方法可推广至其他需要专业知识和标准遵循的技术领域。
Abstract: General-purpose vision-language models demonstrate strong performance in everyday domains but struggle with specialized technical fields requiring precise terminology, structured reasoning, and adherence to engineering standards. This work addresses whether domain-specific instruction tuning can enable comprehensive pavement condition assessment through vision-language models. PaveInstruct, a dataset containing 278,889 image-instruction-response pairs spanning 32 task types, was created by unifying annotations from nine heterogeneous pavement datasets. PaveGPT, a pavement foundation model trained on this dataset, was evaluated against state-of-the-art vision-language models across perception, understanding, and reasoning tasks. Instruction tuning transformed model capabilities, achieving improvements exceeding 20% in spatial grounding, reasoning, and generation tasks while producing ASTM D6433-compliant outputs. These results enable transportation agencies to deploy unified conversational assessment tools that replace multiple specialized systems, simplifying workflows and reducing technical expertise requirements. The approach establishes a pathway for developing instruction-driven AI systems across infrastructure domains including bridge inspection, railway maintenance, and building condition assessment.
[90] Can Vision Language Models Judge Action Quality? An Empirical Evaluation cs.CV | cs.AI | cs.CLPDF
Miguel Monte e Freitas, Rui Henriques, Ricardo Rei, Pedro Henrique Martins
TL;DR: 本文对视觉语言模型在动作质量评估任务中的表现进行了全面评估,发现包括Gemini 3.1 Pro在内的多个先进模型表现仅略高于随机猜测,且多种改进策略均无法带来一致有效的提升。分析揭示了模型存在系统性偏见,并指出其在细粒度运动质量评估上存在根本性困难。
Details
Motivation: 动作质量评估在物理治疗、体育教练等领域有广泛应用,尽管视觉语言模型在该领域前景广阔,但其实际性能尚未得到充分评估。本文旨在填补这一空白,对最先进的VLMs进行全面评估。
Result: 在多个活动领域(如健身、花样滑冰、跳水)的评估中,基线模型表现仅略高于随机机会。尽管引入骨骼信息、基础指令、推理结构和上下文学习等策略带来了个别提升,但没有一种策略能持续有效。
Insight: 论文揭示了VLMs在AQA任务中的两种系统性偏见:倾向于预测正确执行而忽略视觉证据,以及对表面语言框架的敏感性。研究指出,模型的局限性超出了这些偏见,表明细粒度运动质量评估本身对VLMs而言存在根本性困难,这为未来研究提供了关键的失败模式分析和基准。
Abstract: Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models’ limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.
[91] EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization cs.CV | cs.AIPDF
Xiangyuan Wang, Honghao Cai, Yunhao Bai, Tianze Zhou, Haohua Chen
TL;DR: 本文提出了EditCaption,一个用于图像编辑指令合成的可扩展两阶段后训练流程。该方法旨在解决现有视觉语言模型在生成图像编辑指令时存在的方向不一致、视角模糊和细粒度属性描述不足等系统性问题。通过结合监督微调和直接偏好优化,显著提升了指令的准确性和可用性。
Details
Motivation: 高质量的图像编辑训练数据(包含精确编辑指令的源-目标图像对)是扩展指令引导图像编辑模型的关键瓶颈。现有视觉语言模型在图像对场景下存在系统性的失败模式,导致生成的指令错误率高,无法用于下游训练。
Result: 在Eval-400、ByteMorph-Bench和HQ-Edit基准测试上,经过微调的Qwen3-VL模型超越了开源基线。其中235B模型在Eval-400上达到4.712分(优于Gemini-3-Pro的4.706、GPT-4.1的4.220和Kimi-K2.5的4.111),在ByteMorph-Bench上达到4.588分(优于Gemini-3-Pro的4.522和GPT-4.1的3.412)。人工评估显示,关键错误率从47.75%降至23%,正确率从41.75%升至66%。
Insight: 创新点在于提出了一个结合自动标注、基于EditScore的过滤、人工精修以及直接偏好优化的两阶段流程,专门针对图像编辑指令合成中的空间、方向和属性级准确性进行对齐。这为构建可扩展且与人类对齐的图像编辑指令合成数据提供了实用路径。
Abstract: High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.
[92] $\oslash$ Source Models Leak What They Shouldn’t $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization cs.CVPDF
Arnav Devalapally, Poornima Jain, Kartik Srinivas, Vineeth N. Balasubramanian
TL;DR: 本文提出了一种新的机器遗忘(MU)设置SCADA-UL,旨在解决源自由域适应(SFDA)中模型无意泄露源域特有类别知识到目标域的风险。作者通过对抗优化和重新缩放标签策略,在域适应过程中让模型遗忘这些类别,并在基准数据集上验证了方法的有效性。
Details
Motivation: 随着视觉模型在卫星图像和医学扫描等跨域应用中的增加,模型可能在目标域中无意保留并泄露敏感的源域特定信息,这构成了隐私风险。特别是在源自由域适应(SFDA)场景中,源数据本身受到保护,但源模型在适应过程中仍可能编码其影响,导致知识泄露,因此需要机器遗忘来保护隐私。
Result: 在提出的SCADA-UL设置中,该方法在基准数据集上一致优于基线方法,并达到了接近重新训练水平的遗忘性能。实验还扩展到了持续遗忘和未知遗忘类别的变体,均显示出有效性。
Insight: 创新点在于首次识别并形式化了SFDA中源域特有类别的遗忘问题(SCADA-UL),并提出了一种基于对抗生成遗忘样本和重新缩放标签策略的遗忘方法。从客观角度看,该方法通过对抗优化处理数据分布偏移,为机器遗忘在域适应场景中的应用提供了新思路。
Abstract: The increasing adaptation of vision models across domains, such as satellite imagery and medical scans, has raised an emerging privacy risk: models may inadvertently retain and leak sensitive source-domain specific information in the target domain. This creates a compelling use case for machine unlearning to protect the privacy of sensitive source-domain data. Among adaptation techniques, source-free domain adaptation (SFDA) calls for an urgent need for machine unlearning (MU), where the source data itself is protected, yet the source model exposed during adaptation encodes its influence. Our experiments reveal that existing SFDA methods exhibit strong zero-shot performance on source-exclusive classes in the target domain, indicating they inadvertently leak knowledge of these classes into the target domain, even when they are not represented in the target data. We identify and address this risk by proposing an MU setting called SCADA-UL: Unlearning Source-exclusive ClAsses in Domain Adaptation. Existing MU methods do not address this setting as they are not designed to handle data distribution shifts. We propose a new unlearning method, where an adversarially generated forget class sample is unlearned by the model during the domain adaptation process using a novel rescaled labeling strategy and adversarial optimization. We also extend our study to two variants: a continual version of this problem setting and to one where the specific source classes to be forgotten may be unknown. Alongside theoretical interpretations, our comprehensive empirical results show that our method consistently outperforms baselines in the proposed setting while achieving retraining-level unlearning performance on benchmark datasets. Our code is available at https://github.com/D-Arnav/SCADA
[93] DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection cs.CV | cs.AIPDF
Jiangbei Yue, Sharib Ali
TL;DR: 本文提出了一种名为DBMF的双分支多模态框架,用于医学图像中的分布外(OOD)检测。该框架通过一个文本-图像分支和一个视觉分支,充分利用多模态信息来识别训练分布之外的样本(如未见过的疾病病例)。在公开的内窥镜图像数据集上的实验表明,该框架在不同骨干网络上均表现出鲁棒性,并能显著提升OOD检测性能。
Details
Motivation: 现实临床环境复杂多变,需要可靠的深度学习系统。OOD检测对于提升模型在遇到偏离训练分布数据(如未见疾病)时的可靠性和泛化能力至关重要。现有方法通常仅依赖单一视觉模态或图像-文本匹配,未能充分利用多模态信息。
Result: 在公开的内窥镜图像数据集上进行综合实验,结果表明所提框架在不同骨干网络上均具有鲁棒性,并将OOD检测的最先进(SOTA)性能提升了高达24.84%。
Insight: 创新点在于提出了一个双分支多模态框架,通过互补的文本-图像分支和纯视觉分支来更全面地利用多模态表征进行OOD检测。客观来看,这种双分支集成策略有效结合了语义对齐和视觉特征信息,可能为多模态OOD检测提供了新的架构思路。
Abstract: The complex and dynamic real-world clinical environment demands reliable deep learning (DL) systems. Out-of-distribution (OOD) detection plays a critical role in enhancing the reliability and generalizability of DL models when encountering data that deviate from the training distribution, such as unseen disease cases. However, existing OOD detection methods typically rely either on a single visual modality or solely on image-text matching, failing to fully leverage multimodal information. To overcome the challenge, we propose a novel dual-branch multimodal framework by introducing a text-image branch and a vision branch. Our framework fully exploits multimodal representations to identify OOD samples through these two complementary branches. After training, we compute scores from the text-image branch ($S_t$) and vision branch ($S_v$), and integrate them to obtain the final OOD score $S$ that is compared with a threshold for OOD detection. Comprehensive experiments on publicly available endoscopic image datasets demonstrate that our proposed framework is robust across diverse backbones and improves state-of-the-art performance in OOD detection by up to 24.84%
[94] Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models cs.CVPDF
Jing Gu, Niccolò Cavagnero, Gijs Dubbelman
TL;DR: 本文提出Orion-Lite,一种通过知识蒸馏将大型语言模型(LLM)的推理能力迁移到高效纯视觉驾驶模型的方法,旨在解决LLM集成到视觉-语言-动作模型中带来的高延迟和高能耗问题,并在封闭环路的复杂交互场景中验证其有效性。
Details
Motivation: 利用LLM的通用世界知识提升自动驾驶系统处理罕见复杂场景的能力,同时解决LLM参数量大导致在延迟敏感和能效部署中的挑战,通过蒸馏保留推理能力并降低计算开销。
Result: 在严格的Bench2Drive基准测试中,Orion-Lite取得了80.6的驾驶分数,超越了其庞大的VLA教师模型ORION,创造了新的最先进(SOTA)性能。
Insight: 创新点在于结合潜在特征蒸馏和真实轨迹监督,在封闭环路评估的复杂交互场景中实现蒸馏,表明纯视觉架构在高性能反应式规划中仍具有巨大未开发潜力,为高效自动驾驶模型设计提供了新方向。
Abstract: Leveraging the general world knowledge of Large Language Models (LLMs) holds significant promise for improving the ability of autonomous driving systems to handle rare and complex scenarios. While integrating LLMs into Vision-Language-Action (VLA) models has yielded state-of-the-art performance, their massive parameter counts pose severe challenges for latency-sensitive and energy-efficient deployment. Distilling LLM knowledge into a compact driving model offers a compelling solution to retain these reasoning capabilities while maintaining a manageable computational footprint. Although previous works have demonstrated the efficacy of distillation, these efforts have primarily focused on relatively simple scenarios and open-loop evaluations. Therefore, in this work, we investigate LLM distillation in more complex, interactive scenarios under closed-loop evaluation. We demonstrate that through a combination of latent feature distillation and ground-truth trajectory supervision, an efficient vision-only student model \textbf{Orion-Lite} can even surpass the performance of its massive VLA teacher, ORION. Setting a new state-of-the-art on the rigorous Bench2Drive benchmark, with a Driving Score of 80.6. Ultimately, this reveals that vision-only architectures still possess significant, untapped potential for high-performance reactive planning.
[95] Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models cs.CV | cs.CLPDF
Marcel Gröpl, Jaewoo Jung, Seungryong Kim, Marc Pollefeys, Sunghwan Hong
TL;DR: 本文提出了一种无需训练的模型内在地面定位方法——熵梯度定位,通过计算模型下一个令牌分布的熵并反向传播到视觉令牌嵌入中,生成熵梯度相关性图,以解决视觉语言模型在依赖微小视觉细节或跨区域组合线索时的困难。该方法支持多证据查询和迭代细化,在七个基准测试中显著提升了细节关键和高分辨率场景下的性能。
Details
Motivation: 预训练的视觉语言模型在处理依赖微小视觉细节或跨多个区域组合线索的查询时表现不佳,本文旨在通过将定位问题重构为测试时证据检索,使模型能够主动识别需要关注的区域以消除歧义。
Result: 在四个VLM架构的七个基准测试中,该方法相比现有方法取得了持续改进,尤其在细节关键和高分辨率设置中提升最大,同时生成更具可解释性的证据定位。
Insight: 创新点在于使用模型内部的不确定性作为监督信号,通过熵梯度生成相关性图,无需辅助检测器或注意力图启发式方法,并引入迭代缩放和重新定位过程以及空间熵停止规则,以避免过度细化,提高了证据检索的准确性和可解释性。
Abstract: Despite rapid progress, pretrained vision-language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model’s next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.
[96] CAMotion: A High-Quality Benchmark for Camouflaged Moving Object Detection in the Wild cs.CVPDF
Siyuan Yao, Hao Sun, Ruiqi Yu, Xiwei Jiang, Wenqi Ren
TL;DR: 本文构建了一个名为CAMotion的高质量野外伪装运动目标检测基准数据集,旨在解决现有视频伪装目标检测数据集规模和多样性不足的问题。该数据集涵盖了多种物种和挑战性场景,并提供了详细的序列标注和统计分析。
Details
Motivation: 现有视频伪装目标检测数据集规模和多样性有限,阻碍了基于数据驱动深度学习算法的深入分析和广泛评估,因此需要构建一个更高质量、更具挑战性的基准数据集。
Result: 在CAMotion基准上对现有SOTA模型进行了全面评估,并讨论了VCOD任务的主要挑战。
Insight: 创新点在于构建了一个覆盖广泛物种、包含多种挑战性属性(如不确定边缘、遮挡、运动模糊、形状复杂性)的高质量野外数据集,并提供了从多角度分析的标注细节和统计分布,有助于深入分析伪装目标的运动特性。
Abstract: Discovering camouflaged objects is a challenging task in computer vision due to the high similarity between camouflaged objects and their surroundings. While the problem of camouflaged object detection over sequential video frames has received increasing attention, the scale and diversity of existing video camouflaged object detection (VCOD) datasets are greatly limited, which hinders the deeper analysis and broader evaluation of recent deep learning-based algorithms with data-hungry training strategy. To break this bottleneck, in this paper, we construct CAMotion, a high-quality benchmark covers a wide range of species for camouflaged moving object detection in the wild. CAMotion comprises various sequences with multiple challenging attributes such as uncertain edge, occlusion, motion blur, and shape complexity, etc. The sequence annotation details and statistical distribution are presented from various perspectives, allowing CAMotion to provide in-depth analyses on the camouflaged object’s motion characteristics in different challenging scenarios. Additionally, we conduct a comprehensive evaluation of existing SOTA models on CAMotion, and discuss the major challenges in VCOD task. The benchmark is available at https://www.camotion.focuslab.net.cn, we hope that our CAMotion can lead to further advancements in the research community.
[97] What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric cs.CV | cs.CL | cs.HCPDF
Mohamed Amine Kerkouri, Marouane Tliba, Bin Wang, Aladine Chetouani, Ulas Bagci
TL;DR: 本文提出了一种语义扫描路径相似性框架,通过整合视觉语言模型(VLMs)到眼动追踪分析中,将每个注视点编码为文本描述并聚合为扫描路径级表示,进而使用基于嵌入和词汇的NLP指标计算语义相似性,以补充传统基于空间和时间对齐的扫描路径相似性度量。
Details
Motivation: 现有扫描路径相似性度量主要评估空间和时间对齐,而忽略了注视图像区域之间的语义等价性,因此需要一种能够捕捉语义内容的扫描路径分析方法。
Result: 在自由观看眼动追踪数据上的实验表明,语义相似性捕获了与几何对齐部分独立的方差,揭示了尽管空间发散但内容高度一致的情况,并与MultiMatch和DTW等经典空间度量进行了比较。
Insight: 利用多模态基础模型(VLMs)实现可解释的、内容感知的经典扫描路径分析扩展,为眼动研究提供了补充维度;通过受控视觉上下文(基于补丁和基于标记的策略)编码注视点并生成简洁文本描述,再使用NLP指标计算语义相似性,是该方法的核心创新。
Abstract: Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.
[98] Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data cs.CVPDF
Yuchuan Deng, Qijie Wei, Kaiheng Qian, Jiazhen Liu, Zijie Xin
TL;DR: 本文提出了一种名为Fundus-R1的眼底图像阅读多模态大语言模型,通过仅使用公开数据集进行训练,解决了高质量临床数据难以获取的问题。该方法采用基于检索增强生成(RAG)的技术自动生成结合眼科知识的推理轨迹,并改进了强化学习验证奖励机制以增强推理一致性。
Details
Motivation: 眼底图像理解是知识密集型的视觉-语言任务,现有方法依赖大量内部高质量临床数据,但这类数据不公开,限制了研究的可复现性和参与范围。本文旨在仅利用公开数据(其中94%以上仅有图像级标签)训练一个强大的眼底阅读MLLM。
Result: 在FunBench、Omni-Fundus和GMAI-Fundus三个眼底阅读基准测试上,Fundus-R1明显优于多个基线模型,包括其通用版本Qwen2.5-VL以及不使用生成推理轨迹的增强版本,展现了优越性能。
Insight: 创新点包括:1)提出基于RAG的方法自动生成图像特定、知识感知的推理轨迹,将通用MLLM识别的视觉发现与图像标签通过眼科知识联系起来;2)在强化学习验证奖励中引入过程奖励,鼓励每次生成中推理轨迹的自洽性。这为利用公开数据训练强大的专业领域MLLM提供了可行路径。
Abstract: Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.
[99] Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification cs.CV | cs.AI | cs.LGPDF
Xun Zhu, Fanbin Mo, Xi Chen, Kaili Zheng, Shaoshuai Yang
TL;DR: 本文系统性地揭示了医学多模态大语言模型在图像分类任务中性能下降的现象,并通过特征探针实验追踪了视觉特征在模型各模块中的信息流,识别出导致性能下降的四种失效模式。
Details
Motivation: 尽管医学MLLMs在预训练数据和模型参数上具有优势,但其在医学图像分类任务上的性能却持续落后于传统深度学习模型,论文旨在探究这种性能下降的根本原因。
Result: 在三个代表性医学图像分类数据集上对14个开源医学MLLMs进行了广泛实验,并引入了量化指标来评估特征演化的健康程度,从而对不同模型和数据集进行有原则的比较。
Insight: 创新点在于首次系统性地剖析了医学MLLMs分类性能下降问题,识别出视觉表征质量限制、连接器投影保真度损失、LLM推理理解缺陷和语义映射错位四种具体失效模式,并提供了可视化和量化分析工具,为未来模型改进指明了方向。
Abstract: The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.
[100] InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding cs.CV | cs.AIPDF
Ashutosh Kumar, Rajat Saini, Jingjing Pan, Mustafa Erdogan, Mingfang Zhang
TL;DR: InstAP是一种实例感知的视觉语言预训练框架,通过联合优化全局视觉-文本对齐和细粒度的实例级对比对齐,提升了时空理解能力。该框架基于新构建的大规模数据集InstVL(包含200万图像和5万视频),在实例级检索任务上显著优于现有VLP模型,并在多个视频基准测试中实现了有竞争力的零样本性能。
Details
Motivation: 当前视觉语言预训练(VLP)范式擅长全局场景理解,但由于仅依赖全局监督,在实例级推理方面存在困难。本文旨在解决这一问题,通过引入实例级对齐来增强模型对特定时空区域的细粒度理解能力。
Result: 在InstVL基准测试中,InstAP在实例级检索任务上大幅超越现有VLP模型,并且优于在相同数据上训练的强VLP基线,证明了其实例感知目标的有效性。此外,在MSR-VTT和DiDeMo等多个视频基准测试中,InstAP实现了有竞争力的零样本性能。
Insight: 创新点在于提出了双粒度(全局与实例级)联合对齐的预训练框架,并构建了包含密集实例描述的大规模数据集InstVL。从客观角度看,该方法通过将文本提及定位到具体时空区域,有效提升了模型的细粒度推理能力,同时实例中心的预训练也有助于改善全局理解,这为视觉语言模型的细粒度对齐提供了新思路。
Abstract: Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.
[101] PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models cs.CV | cs.AIPDF
Ruizhi Zhang, Ye Huang, Yuangang Pan, Chuanfu Shen, Zhilin Liu
TL;DR: 该论文提出了PokeGym,一个基于《宝可梦传说:Z-A》3D开放世界构建的视觉驱动长时程基准测试,旨在评估视觉语言模型在复杂3D具身环境中的交互与决策能力。该基准通过严格的代码级隔离,确保智能体仅基于原始RGB观测行动,并利用内存扫描进行自动化评估,解决了现有基准在交互动态、深度感知、状态泄漏和评估成本方面的不足。
Details
Motivation: 现有视觉语言模型基准测试存在四大缺陷:被动感知任务规避交互动态、简化的2D环境无法评估深度感知、特权状态泄漏绕过真实视觉处理、人工评估成本高昂且难以扩展。因此,需要构建一个能严格评估VLMs在复杂3D环境中长时程视觉驱动决策能力的基准。
Result: 在包含30个任务(30-220步)的PokeGym基准上评估发现,当前VLMs的主要瓶颈是物理死锁恢复能力,而非高级规划能力,且死锁与任务成功率呈强负相关。研究还揭示了元认知差异:较弱模型主要遭受’无意识死锁’(未察觉被困),而先进模型则表现出’有意识死锁’(意识到被困但无法恢复)。
Insight: 创新点在于构建了一个严格隔离、基于原始视觉输入、可自动化评估的3D长时程交互基准。客观分析认为,其核心洞察是揭示了当前VLMs在空间直觉和物理交互恢复能力上的根本性不足,为未来模型架构需集成显式空间推理能力提供了明确方向。
Abstract: While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment. The benchmark comprises 30 tasks (30-220 steps) spanning navigation, interaction, and mixed scenarios, with three instruction granularities (Visual-Guided, Step-Guided, Goal-Only) to systematically deconstruct visual grounding, semantic reasoning, and autonomous exploration capabilities. Our evaluation reveals a key limitation of current VLMs: physical deadlock recovery, rather than high-level planning, constitutes the primary bottleneck, with deadlocks showing a strong negative correlation with task success. Furthermore, we uncover a metacognitive divergence: weaker models predominantly suffer from Unaware Deadlocks (oblivious to entrapment), whereas advanced models exhibit Aware Deadlocks (recognizing entrapment yet failing to recover). These findings highlight the need to integrate explicit spatial intuition into VLM architectures. The code and benchmark will be available on GitHub.
[102] OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks cs.CV | cs.AI | cs.CLPDF
Wenbo Hu, Xin Chen, Yan Gao-Tian, Yihe Deng, Nanyun Peng
TL;DR: 本文提出了OpenVLThinkerV2,一个用于多领域视觉任务的通用多模态推理模型。其核心是引入了一种名为高斯GRPO(G²RPO)的新型强化学习训练目标,通过非线性分布匹配确保跨任务的梯度公平性,并设计了两种任务级塑造机制来平衡细粒度感知与多步推理能力。
Details
Motivation: 将GRPO等强化学习目标成功扩展到开源多模态通用模型面临两大挑战:不同视觉任务间奖励拓扑结构的极端差异,以及平衡细粒度感知与多步推理能力的固有困难。本文旨在解决这些问题。
Result: 在18个不同的基准测试上进行广泛评估,结果表明OpenVLThinkerV2的性能优于强大的开源模型和领先的专有前沿模型,达到了SOTA水平。
Insight: 主要创新点包括:1. G²RPO目标,通过强制任务优势分布收敛为标准正态分布,实现跨任务梯度公平、缓解重尾异常值影响并提供对称更新;2. 两种任务级塑造机制(响应长度塑造和熵塑造),以平衡感知与推理,防止熵崩溃或爆炸。
Abstract: Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model’s exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.
[103] AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation cs.CV | cs.AI | cs.CLPDF
Ziwei Zhou, Zeyuan Lai, Rui Wang, Yifan Yang, Zhen Xing
TL;DR: 本文介绍了AVGen-Bench,一个用于文本到音视频(T2AV)生成的任务驱动型基准测试。该基准包含11个现实世界类别的高质量提示,并提出了一个结合轻量级专家模型和多模态大语言模型(MLLMs)的多粒度评估框架,以全面评估从感知质量到细粒度语义可控性的各个方面。评估揭示了当前T2AV模型在强视听美学与弱语义可靠性之间存在显著差距。
Details
Motivation: 当前T2AV生成的评估方法零散,现有基准大多孤立评估音频和视频,或依赖粗糙的嵌入相似度,无法捕捉现实提示所需的细粒度联合正确性。
Result: 评估揭示了当前T2AV模型在视听美学与语义可靠性之间存在显著差距,包括在文本渲染、语音连贯性、物理推理方面持续失败,以及在音乐音高控制方面普遍失效。
Insight: 创新点在于提出了一个任务驱动、多类别的基准(AVGen-Bench)和一个结合专家模型与MLLMs的多粒度评估框架,能够系统评估T2AV生成的感知质量和细粒度语义可控性,揭示了现有模型的关键短板。
Abstract: Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.
[104] Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts cs.CV | cs.AI | cs.CLPDF
Haolei Xu, Haiwen Hong, Hongxing Li, Rui Zhou, Yang Zhang
TL;DR: 本文研究了多模态专家混合模型中的‘见而不思’现象:模型能准确感知图像内容,却在后续推理中失败,而纯文本形式的相同问题却能正确解决。通过系统分析,作者验证了跨模态语义共享的存在,揭示了视觉专家和领域专家在层级上的分离,并提出‘路由分心’假说:视觉输入时路由机制未能充分激活任务相关的推理专家。作者设计了一种路由引导的干预方法来增强领域专家激活,在三个多模态MoE模型和六个基准测试中实现了性能提升,复杂视觉推理任务最高提升3.17%。
Details
Motivation: 解决多模态专家混合模型在视觉输入下出现的‘见而不思’问题,即模型能感知图像但推理失败,而纯文本相同问题却能正确解决,旨在探究其根本原因并提升模型性能。
Result: 在三个多模态MoE模型和六个基准测试上,提出的路由引导干预方法实现了性能的持续提升,复杂视觉推理任务最高获得3.17%的性能增益。
Insight: 创新点在于提出了‘路由分心’假说,并设计了路由引导的干预方法;客观分析认为,其揭示了MoE模型中视觉与领域专家的层级分离及路由机制的关键作用,且领域专家的识别定位了认知功能而非具体解决方案,这为跨任务迁移提供了新视角。
Abstract: Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.
[105] Phantasia: Context-Adaptive Backdoors in Vision Language Models cs.CV | cs.AIPDF
Nam Duong Tran, Phi Le Nguyen
TL;DR: 本文提出了Phantasia,一种针对视觉语言模型(VLM)的上下文自适应后门攻击方法。论文首先指出现有VLM后门攻击的隐蔽性被严重高估,容易被跨域防御技术检测;进而提出Phantasia,它能根据输入语义动态生成上下文连贯的恶意输出,从而显著提升攻击的隐蔽性和适应性。
Details
Motivation: 现有VLM后门攻击大多依赖生成包含固定、易识别模式的污染响应,隐蔽性不足,且其安全性尚未得到充分探索。本文旨在解决现有攻击隐蔽性差的问题,设计更隐蔽、自适应的后门攻击。
Result: 在多种VLM架构上的广泛实验表明,Phantasia在多种防御设置下保持了良性性能,同时实现了最先进的攻击成功率(SOTA)。
Insight: 创新点在于首次系统评估并揭示了现有VLM后门攻击隐蔽性的不足,并提出了上下文自适应的攻击范式,通过动态对齐输入语义生成看似合理的恶意响应,从而显著提升攻击的隐蔽性和适应性。这为后门攻击与防御研究提供了新的视角和基准。
Abstract: Recent advances in Vision-Language Models (VLMs) have greatly enhanced the integration of visual perception and linguistic reasoning, driving rapid progress in multimodal understanding. Despite these achievements, the security of VLMs, particularly their vulnerability to backdoor attacks, remains significantly underexplored. Existing backdoor attacks on VLMs are still in an early stage of development, with most current methods relying on generating poisoned responses that contain fixed, easily identifiable patterns. In this work, we make two key contributions. First, we demonstrate for the first time that the stealthiness of existing VLM backdoor attacks has been substantially overestimated. By adapting defense techniques originally designed for other domains (e.g., vision-only and text-only models), we show that several state-of-the-art attacks can be detected with surprising ease. Second, to address this gap, we introduce Phantasia, a context-adaptive backdoor attack that dynamically aligns its poisoned outputs with the semantics of each input. Instead of producing static poisoned patterns, Phantasia encourages models to generate contextually coherent yet malicious responses that remain plausible, thereby significantly improving stealth and adaptability. Extensive experiments across diverse VLM architectures reveal that Phantasia achieves state-of-the-art attack success rates while maintaining benign performance under various defensive settings.
[106] SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation cs.CVPDF
Wenli Zhang, Xianglong Shi, Sirui Zhao, Xinqi Chen, Guo Cheng
TL;DR: 本文提出SyncBreaker,一种阶段感知的多模态对抗攻击框架,针对基于扩散模型的音频驱动说话头生成系统,通过联合扰动肖像和音频输入来破坏唇部同步和面部动态,以防范欺诈和虚假信息等滥用风险。
Details
Motivation: 现有保护方法多局限于单模态(仅图像或仅音频),无法有效抑制语音驱动的面部动态,因此需要一种多模态攻击方法来填补这一空白。
Result: 在主动保护的白盒设置下,SyncBreaker相比强单模态基线能更有效地降低唇部同步和面部动态质量,同时保持输入感知质量并在净化攻击下保持鲁棒性。
Insight: 创新点包括:在图像流中引入带多区间采样的无效化监督,引导生成朝向静态参考肖像;在音频流中提出跨注意力愚弄方法,抑制特定区间的音频条件跨注意力响应;两流独立优化并在推理时结合,实现灵活部署。
Abstract: Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.
[107] BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields cs.CV | cs.ROPDF
Fan Yang, Wenrui Chen, Guorun Yan, Ruize Liao, Wanjun Jia
TL;DR: BLaDA是一个可解释的零样本框架,用于将开放词汇指令转化为灵巧功能操作的感知与控制约束。它通过知识引导的语言解析模块将自然语言解析为结构化的六元组操作约束,利用3D高斯泼溅作为连续场景表示,并通过三角功能点定位模块在几何约束下识别功能区域,最后通过3D关键点抓取矩阵变换执行模块将这些语义-几何约束解码为物理上合理的手腕姿态和手指级指令。
Details
Motivation: 在非结构化环境中,功能性灵巧抓取需要语义理解、精确的3D功能定位和物理可解释执行的紧密结合。现有的模块化分层方法虽然比端到端的视觉-语言-动作方法更可控和可解释,但仍依赖于预定义的affordance标签,并且缺乏功能性灵巧操作所需的紧密语义-姿态耦合。
Result: 在复杂基准测试上的大量实验表明,BLaDA在affordance grounding精度和跨多种类别与任务的功能性操作成功率方面,都显著优于现有方法。
Insight: 创新点在于提出了一个可解释的、零样本的推理链,将开放词汇指令直接转化为可执行的物理约束,并利用3D高斯泼溅进行连续场景表示以实现姿态一致的空间推理,从而弥合了语言、语义理解和灵巧物理操作之间的鸿沟。
Abstract: In unstructured environments, functional dexterous grasping calls for the tight integration of semantic understanding, precise 3D functional localization, and physically interpretable execution. Modular hierarchical methods are more controllable and interpretable than end-to-end VLA approaches, but existing ones still rely on predefined affordance labels and lack the tight semantic–pose coupling needed for functional dexterous manipulation. To address this, we propose BLaDA (Bridging Language to Dexterous Actions in 3DGS fields), an interpretable zero-shot framework that grounds open-vocabulary instructions as perceptual and control constraints for functional dexterous manipulation. BLaDA establishes an interpretable reasoning chain by first parsing natural language into a structured sextuple of manipulation constraints via a Knowledge-guided Language Parsing (KLP) module. To achieve pose-consistent spatial reasoning, we introduce the Triangular Functional Point Localization (TriLocation) module, which utilizes 3D Gaussian Splatting as a continuous scene representation and identifies functional regions under triangular geometric constraints. Finally, the 3D Keypoint Grasp Matrix Transformation Execution (KGT3D+) module decodes these semantic-geometric constraints into physically plausible wrist poses and finger-level commands. Extensive experiments on complex benchmarks demonstrate that BLaDA significantly outperforms existing methods in both affordance grounding precision and the success rate of functional manipulation across diverse categories and tasks. Code will be publicly available at https://github.com/PopeyePxx/BLaDA.
[108] HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment cs.CV | cs.AIPDF
Changdao Chen
TL;DR: 本文提出了一种名为HST-HGN的新型异构时空超图网络,用于从非修剪视频中进行驾驶员疲劳评估。该方法通过分层超图网络动态融合姿态解耦的几何拓扑与多模态纹理块来建模高阶协同面部变形,并利用具有线性复杂度的双向状态空间模型(Bi-Mamba)进行序列建模,以捕捉完整的生理生命周期并区分高度模糊的瞬时动作。
Details
Motivation: 解决在有限计算预算下,从未修剪视频中评估驾驶员疲劳的挑战,特别是建模细微面部表情中的长程时间依赖关系。现有方法要么计算量大,要么使用能力有限、无法建模高阶协同和全局时间上下文的传统轻量级成对图网络。
Result: 在多个疲劳基准测试上进行广泛评估,结果表明HST-HGN达到了最先进的性能。该方法在判别能力和计算效率之间取得了平衡,非常适合实时车内边缘部署。
Insight: 创新点在于将分层超图网络与双向状态空间模型(Mamba)相结合,以同时建模空间上的高阶面部协同变形和时间上的长程依赖关系,并以线性复杂度实现,为实时边缘计算场景下的细粒度行为识别提供了新思路。
Abstract: It remains challenging to assess driver fatigue from untrimmed videos under constrained computational budgets, due to the difficulty of modeling long-range temporal dependencies in subtle facial expressions. Some existing approaches rely on computationally heavy architectures, whereas others employ traditional lightweight pairwise graph networks, despite their limited capacity to model high-order synergies and global temporal context. Therefore, we propose HST-HGN, a novel Heterogeneous Spatial-Temporal Hypergraph Network driven by Bidirectional State Space Models. Spatially, we introduce a hierarchical hypergraph network to fuse pose-disentangled geometric topologies with multi-modal texture patches dynamically. This formulation encapsulates high-order synergistic facial deformations, effectively overcoming the limitations of conventional methods. In temporal terms, a Bi-Mamba module with linear complexity is applied to perform bidirectional sequence modeling. This explicit temporal-evolution filtering enables the network to distinguish highly ambiguous transient actions, such as yawning versus speaking, while encompassing their complete physiological lifecycles. Extensive evaluations across diverse fatigue benchmarks demonstrate that HST-HGN achieves state-of-the-art performance. In particular, our method strikes a balance between discriminative power and computational efficiency, making it well-suited for real-time in-cabin edge deployment.
[109] CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning cs.CV | cs.AI | cs.ROPDF
Rui Gan, Junyi Ma, Pei Li, Xingyou Yang, Kai Chen
TL;DR: 论文提出了CrashSight,一个基于真实世界路边摄像头数据的大规模视觉语言基准数据集,用于评估模型在交通安全关键场景(特别是交通事故)中的理解和推理能力。该数据集包含250个事故视频和1.3万个选择题对,采用双层分类法评估视觉基础和高层推理。论文对8个最先进的视觉语言模型进行了基准测试,发现它们在时序和因果推理方面存在不足。
Details
Motivation: 现有基准测试主要关注自车视角,缺乏从基础设施(路边摄像头)角度对安全关键交通场景下视觉语言模型理解和推理能力的充分评估。为了弥补这一空白,需要构建一个以基础设施为中心、面向交通事故场景理解的标准化评估框架。
Result: 在CrashSight基准上测试了8个最先进的视觉语言模型。结果表明,尽管模型在场景描述方面表现出色,但在安全关键场景下的时序推理和因果推理方面仍存在显著困难。论文提供了详细的失败案例分析。
Insight: 创新点在于构建了一个以基础设施为中心、分阶段(事故前、中、后)的交通事故视频理解基准,其双层分类法(视觉基础与高层推理)为评估模型在复杂安全关键场景下的能力提供了结构化框架。从客观角度看,该工作强调了从多视角(特别是基础设施)评估模型在现实世界安全应用中的鲁棒性的重要性,并为推动协同自动驾驶中的感知技术发展提供了具体的评估工具和方向。
Abstract: Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.
[110] OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance cs.CV | cs.AIPDF
Haoxi Zeng, Qiankun Liu, Yi Bin, Haiyue Zhang, Yujuan Ding
TL;DR: 本文提出OVS-DINO,一种用于开放词汇分割(OVS)的新框架。该方法通过将DINO视觉基础模型与SAM的结构先验进行对齐,重新激活了DINO潜在的边界感知能力,从而解决了现有方法在精细边缘分割上的不足。
Details
Motivation: 现有基于CLIP的开放词汇分割方法缺乏细粒度的空间感知能力,而结合DINO等视觉基础模型的方法仍难以实现高保真分割所需的精确边缘感知。本文发现DINO固有的边界感知能力在深层Transformer块中会逐渐衰减,因此需要一种方法来恢复这种能力。
Result: 在多个弱监督开放词汇分割基准测试中达到了最先进的性能,平均得分提升了2.1%(从44.8%到46.9%)。特别是在复杂场景(如Cityscapes数据集)上分割精度显著提升,增益达6.3%(从36.6%到42.9%)。
Insight: 核心创新在于通过结构对齐(利用SAM的结构先验)和提出的结构感知编码器(SAE)与结构调制解码器(SMD),有效激活了DINO的潜在边界特征。这为利用视觉基础模型进行密集预测任务提供了一种新思路,即通过模型间的结构知识迁移来弥补单一模型的固有缺陷。
Abstract: Open-Vocabulary Segmentation (OVS) aims to segment image regions beyond predefined category sets by leveraging semantic descriptions. While CLIP based approaches excel in semantic generalization, they frequently lack the fine-grained spatial awareness required for dense prediction. Recent efforts have incorporated Vision Foundation Models (VFMs) like DINO to alleviate these limitations. However, these methods still struggle with the precise edge perception necessary for high fidelity segmentation. In this paper, we analyze internal representations of DINO and discover that its inherent boundary awareness is not absent but rather undergoes progressive attenuation as features transition into deeper transformer blocks. To address this, we propose OVS-DINO, a novel framework that revitalizes latent edge-sensitivity of DINO through structural alignment with the Segment Anything Model (SAM). Specifically, we introduce a Structure-Aware Encoder (SAE) and a Structure-Modulated Decoder (SMD) to effectively activate boundary features of DINO using SAM’s structural priors, complemented by a supervision strategy utilizing SAM generated pseudo-masks. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple weakly-supervised OVS benchmarks, improving the average score by 2.1% (from 44.8% to 46.9%). Notably, our approach significantly enhances segmentation accuracy in complex, cluttered scenarios, with a gain of 6.3% on Cityscapes (from 36.6% to 42.9%).
[111] LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation cs.CVPDF
Jingjing Wang, Zhengdong Hong, Chong Bao, Yuke Zhu, Junhan Sun
TL;DR: LAMP提出了一种将图像编辑作为3D先验的方法,用于提取物体间的3D变换作为连续、几何感知的表征,以解决开放世界机器人操作中泛化能力不足的问题。
Details
Motivation: 现有基于学习的方法(如强化学习、模仿学习和视觉-语言-动作模型)在应对新任务和未见环境时存在困难,而大语言模型和视觉语言模型虽然语义推理能力强,但3D感知有限,难以支持精细操作。
Result: 大量实验表明,LAMP能够提供精确的3D变换,并在开放世界操作中实现了强大的零样本泛化能力。
Insight: 核心创新在于利用图像编辑中隐含的丰富2D空间线索,将其提升为3D变换,从而为开放世界操作提供精细且准确的几何感知指导,这为机器人操作提供了一种新的通用3D先验表示方法。
Abstract: Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: https://zju3dv.github.io/LAMP/.
[112] Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization cs.CV | cs.AIPDF
Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha, Vineeth N Balasubramanian, Tanuja Ganu
TL;DR: 本文提出Faithful GRPO (FGRPO)方法,旨在解决多模态推理模型在强化学习训练中,虽然提升了答案准确性,却牺牲了思维链推理质量的问题。该方法通过拉格朗日对偶上升法,将逻辑一致性和视觉基础性作为约束融入GRPO优化过程,在七个空间推理基准上显著提升了推理的忠实度与最终答案准确率。
Details
Motivation: 现有基于可验证奖励的强化学习训练的多模态推理模型,在视觉推理基准上准确率虽有提升,但其生成的思维链经常与最终答案不一致,且未能很好地基于视觉证据,即推理质量下降。本文旨在解决这一准确性与推理忠实度之间的权衡问题。
Result: 在Qwen2.5-VL-7B和3B骨干网络上,于七个空间推理数据集上的评估表明,FGRPO将不一致率从24.5%大幅降低至1.7%,视觉基础性得分提升了13%,并且最终答案准确率也超越了基础的GRPO方法。
Insight: 核心创新点在于将推理的忠实度(逻辑一致性与视觉基础性)作为优化约束,通过拉格朗日对偶上升法自适应地融入GRPO的优势计算中,证明了提升推理过程的忠实度能直接带来更好的最终答案性能,为解决多模态模型“黑箱”推理提供了可解释性优化路径。
Abstract: Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: “logical consistency” (does the CoT entail the final answer?) and “visual grounding” (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.
[113] Novel View Synthesis as Video Completion cs.CVPDF
Qi Wu, Khiem Vuong, Minsik Jeon, Srinivasa Narasimhan, Deva Ramanan
TL;DR: 本文提出FrameCrafter方法,将稀疏新视角合成(NVS)问题重新定义为低帧率视频补全任务,利用视频扩散模型中隐含的多视角知识,通过架构修改使模型对输入视角的顺序具有不变性。
Details
Motivation: 解决稀疏多视角图像(约5张)的新视角合成问题;现有基于单图像训练的扩散模型缺乏多视角知识,而视频模型已隐含此类知识,因此更容易适应NVS任务。
Result: 在稀疏视角NVS基准测试中取得了有竞争力的性能,表明视频模型经过最小监督即可‘忘记’时间信息,适用于NVS。
Insight: 创新点在于将NVS视为视频补全,并提出架构修改(如逐帧潜在编码和移除时间位置嵌入)以实现输入顺序不变性;客观来看,利用预训练视频模型的知识迁移是高效解决多视角问题的有效途径。
Abstract: We tackle the problem of sparse novel view synthesis (NVS) using video diffusion models; given $K$ ($\approx 5$) multi-view images of a scene and their camera poses, we predict the view from a target camera pose. Many prior approaches leverage generative image priors encoded via diffusion models. However, models trained on single images lack multi-view knowledge. We instead argue that video models already contain implicit multi-view knowledge and so should be easier to adapt for NVS. Our key insight is to formulate sparse NVS as a low frame-rate video completion task. However, one challenge is that sparse NVS is defined over an unordered set of inputs, often too sparse to admit a meaningful order, so the models should be $\textit{invariant}$ to permutations of that input set. To this end, we present FrameCrafter, which adapts video models (naturally trained with coherent frame orderings) to permutation-invariant NVS through several architectural modifications, including per-frame latent encodings and removal of temporal positional embeddings. Our results suggest that video models can be easily trained to “forget” about time with minimal supervision, producing competitive performance on sparse-view NVS benchmarks. Project page: https://frame-crafter.github.io/
[114] Quantifying Explanation Consistency: The C-Score Metric for CAM-Based Explainability in Medical Image Classification cs.CV | cs.AIPDF
Kabilan Elangovan, Daniel Ting
TL;DR: 本文提出了一种名为C-Score(一致性分数)的新评估指标,用于量化基于类激活映射(CAM)的医学图像分类模型解释的跨样本一致性,而非传统的定位准确性。作者在Kermany胸部X光数据集上,评估了六种CAM方法在三种CNN架构上的表现,揭示了AUC与解释一致性之间可能脱钩的三种机制,并证明C-Score可作为模型不稳定的早期预警信号。
Details
Motivation: 现有评估框架主要关注CAM解释相对于放射科医生标注的定位准确性(正确性),而忽略了模型对于相同病理的不同患者是否应用了相同的空间推理策略(一致性)。本文旨在解决如何量化这种解释一致性的问题。
Result: 在Kermany胸部X光数据集上的实验表明,C-Score能够识别出标准分类指标(如AUC)无法发现的三种AUC-一致性脱钩机制。例如,在ResNet50V2上,ScoreCAM的C-Score恶化可以在AUC灾难性崩溃前一个完整的检查点被检测到,从而为基于解释质量的临床部署提供了架构特定的建议。
Insight: 创新点在于提出了一个无需人工标注、基于置信度加权和强调强度信息的成对软IoU的量化指标(C-Score),用于评估模型解释的类内可重复性。这为模型可解释性评估开辟了新的维度(一致性),并能作为模型鲁棒性和潜在不稳定性的诊断工具,超越了仅依赖预测性能的评估范式。
Abstract: Class Activation Mapping (CAM) methods are widely used to generate visual explanations for deep learning classifiers in medical imaging. However, existing evaluation frameworks assess whether explanations are correct, measured by localisation fidelity against radiologist annotations, rather than whether they are consistent: whether the model applies the same spatial reasoning strategy across different patients with the same pathology. We propose the C-Score (Consistency Score), a confidence-weighted, annotation-free metric that quantifies intra-class explanation reproducibility via intensity-emphasised pairwise soft IoU across correctly classified instances. We evaluate six CAM techniques: GradCAM, GradCAM++, LayerCAM, EigenCAM, ScoreCAM, and MS GradCAM++ across three CNN architectures (DenseNet201, InceptionV3, ResNet50V2) over thirty training epochs on the Kermany chest X-ray dataset, covering transfer learning and fine-tuning phases. We identify three distinct mechanisms of AUC-consistency dissociation, invisible to standard classification metrics: threshold-mediated gold list collapse, technique-specific attribution collapse at peak AUC, and class-level consistency masking in global aggregation. C-Score provides an early warning signal of impending model instability. ScoreCAM deterioration on ResNet50V2 is detectable one full checkpoint before catastrophic AUC collapse and yields architecture-specific clinical deployment recommendations grounded in explanation quality rather than predictive ranking alone.
[115] Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics cs.CVPDF
Ying Shen, Jerry Xiong, Tianjiao Yu, Ismini Lourentzou
TL;DR: 本文提出Phantom模型,一种通过联合建模视觉内容和潜在物理动力学的物理增强视频生成方法,旨在生成视觉逼真且物理一致性的视频序列。
Details
Motivation: 现有生成视频模型虽在视觉真实性上取得进展,但缺乏对物理规律的理解,导致运动不真实;本文旨在通过将潜在物理属性推断直接集成到视频生成过程中,使模型能够产生物理合理的视频。
Result: 在标准视频生成和物理感知基准测试中,定量和定性结果表明,Phantom在物理动力学一致性方面优于现有方法,同时提供具有竞争力的感知保真度。
Insight: 创新点在于提出一种物理感知的视频表示,作为底层物理的抽象信息嵌入,无需显式指定复杂的物理动力学和属性集,即可联合预测物理动力学和视频内容,从而增强生成视频的物理一致性。
Abstract: Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.
[116] Visually-grounded Humanoid Agents cs.CV | cs.ROPDF
Hang Ye, Xiaoxuan Ma, Fan Lu, Wayne Wu, Kwan-Yee Lin
TL;DR: 该论文提出了视觉接地人形智能体,一种双层(世界层-智能体层)耦合范式,旨在生成能够在仅依赖视觉观察和指定目标的情况下,在新颖3D场景中主动、自然地执行目标导向行为的数字人。世界层通过遮挡感知流程从真实世界视频重建语义丰富的3D高斯场景并容纳可动画的高斯化身;智能体层则赋予这些化身自主性,使其具备第一人称RGB-D感知能力,并能进行具身规划与空间感知推理,最终驱动全身动作在场景中执行行为。
Details
Motivation: 解决现有数字人生成系统通常被动动画、依赖特权状态或脚本控制,难以扩展到新环境的问题,目标是让数字人仅凭视觉观察和指定目标就能在新场景中主动行为,从而实现大规模、自发、自然、目标导向的数字人场景填充。
Result: 在多样化的重建环境基准测试中,该智能体实现了稳健的自主行为,其任务成功率高于消融实验和现有最先进的规划方法,且碰撞次数更少。
Insight: 创新点在于提出了一个耦合的双层范式,将场景重建与智能体自主行为生成统一起来,特别是通过第一人称RGB-D感知和具身空间规划与迭代推理,实现了从视觉输入到全身动作执行的闭环,推动了以人为中心的具身AI发展。
Abstract: Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are passively animated, relying on privileged state or scripted control, which limits scalability to novel environments. We instead ask: how can digital humans actively behave using only visual observations and specified goals in novel scenes? Achieving this would enable populating any 3D environments with digital humans at scale that exhibit spontaneous, natural, goal-directed behaviors. To this end, we introduce Visually-grounded Humanoid Agents, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they look, perceive, reason, and behave like real people in real-world 3D scenes. The World Layer reconstructs semantically rich 3D Gaussian scenes from real-world videos via an occlusion-aware pipeline and accommodates animatable Gaussian-based human avatars. The Agent Layer transforms these avatars into autonomous humanoid agents, equipping them with first-person RGB-D perception and enabling them to perform accurate, embodied planning with spatial awareness and iterative reasoning, which is then executed at the low level as full-body actions to drive their behaviors in the scene. We further introduce a benchmark to evaluate humanoid-scene interaction in diverse reconstructed environments. Experiments show our agents achieve robust autonomous behavior, yielding higher task success rates and fewer collisions than ablations and state-of-the-art planning methods. This work enables active digital human population and advances human-centric embodied AI. Data, code, and models will be open-sourced.
[117] When Fine-Tuning Changes the Evidence: Architecture-Dependent Semantic Drift in Chest X-Ray Explanations cs.CVPDF
Kabilan Elangovan, Daniel Ting
TL;DR: 本文研究了在医学图像分类中,迁移学习后进行微调时,模型解释证据的稳定性问题。作者定义了‘语义漂移’概念,指模型预测所依赖的归因图在迁移学习和完全微调之间发生的系统性变化。通过在五分类胸部X光任务上评估DenseNet201、ResNet50V2和InceptionV3三种架构,发现尽管分类性能稳定,但归因证据的结构会发生显著的、架构依赖性的重组。
Details
Motivation: 在具有重叠视觉特征的多分类医学图像任务中,准确率的提升并不能保证模型做出预测所依赖的视觉证据的稳定性。本文旨在探究迁移学习与微调如何改变模型解释的证据结构,即‘语义漂移’现象。
Result: 在五分类胸部X光任务上,使用DenseNet201、ResNet50V2和InceptionV3进行实验。结果表明,粗粒度的解剖定位保持稳定,但使用重叠IoU度量的归因图结构一致性揭示了显著的、架构依赖性的证据重组。此外,在收敛的预测性能下,LayerCAM和GradCAM++两种归因方法得出的稳定性排名可能相反。
Insight: 创新点在于定义了‘语义漂移’这一概念,并系统地量化了其在微调过程中的变化。核心洞察是:解释的稳定性是模型架构、优化阶段(迁移学习 vs. 微调)和归因目标(解释方法)三者相互作用的结果,这为评估和选择可靠的医学AI模型提供了新的维度。
Abstract: Transfer learning followed by fine-tuning is widely adopted in medical image classification due to consistent gains in diagnostic performance. However, in multi-class settings with overlapping visual features, improvements in accuracy do not guarantee stability of the visual evidence used to support predictions. We define semantic drift as systematic changes in the attribution structure supporting a model’s predictions between transfer learning and full fine-tuning, reflecting potential shifts in underlying visual reasoning despite stable classification performance. Using a five-class chest X-ray task, we evaluate DenseNet201, ResNet50V2, and InceptionV3 under a two-stage training protocol and quantify drift with reference-free metrics capturing spatial localization and structural consistency of attribution maps. Across architectures, coarse anatomical localization remains stable, while overlap IoU reveals pronounced architecture-dependent reorganization of evidential structure. Beyond single-method analysis, stability rankings can reverse across LayerCAM and GradCAM++ under converged predictive performance, establishing explanation stability as an interaction between architecture, optimization phase, and attribution objective.
[118] MolmoWeb: Open Visual Web Agent and Open Data for the Open Web cs.CVPDF
Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang
TL;DR: 本文介绍了MolmoWeb,一个完全开源的视觉网页智能体家族,以及其配套的大规模、多样化训练数据集MolmoWebMix。该智能体仅根据任务指令和网页截图预测下一个浏览器操作,无需访问HTML等底层代码。在多个网页任务基准测试中,MolmoWeb模型取得了最先进的性能,超越了同类开源模型,甚至在某些方面超过了基于更大规模闭源模型的智能体。
Details
Motivation: 当前最强大的网页智能体依赖于不公开训练数据和方法的专有模型,这限制了科学理解、可复现性和社区驱动的进步。作者认为,为开放网络构建的智能体本身也应该是开放的。
Result: 在WebVoyager、Online-Mind2Web和DeepShop等浏览器使用基准测试中,MolmoWeb智能体(4B和8B参数)取得了最先进的结果,超越了Fara-7B、UI-Tars-1.5-7B等类似规模的开源模型。MolmoWeb-8B甚至超越了基于GPT-4o等更大规模闭源前沿模型的set-of-marks智能体。通过测试时并行采样和最佳选择策略,其在WebVoyager和Online-Mind2Web上的pass@4成功率分别达到94.7%和60.5%。
Insight: 主要创新点包括:1)构建了一个大规模、多样化的开源数据集(MolmoWebMix),结合了合成轨迹、人类演示、原子技能和GUI感知数据;2)提出了一种完全开源的、仅依赖视觉输入的网页智能体架构,无需访问底层代码;3)展示了通过测试时扩展(如并行采样)可以显著提升性能。这为开放、可复现的网页智能体研究提供了重要的数据和模型基础。
Abstract: Web agents–autonomous systems that navigate and execute tasks on the web on behalf of users–have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.
[119] UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding cs.CVPDF
Joungbin An, Agrim Jain, Kristen Grauman
TL;DR: 本文提出UniversalVTG,一个通用且轻量级的视频时序定位基础模型。它通过大规模跨数据集预训练,统一处理不同格式的查询,并采用高效的定位头,在多个基准测试上实现了最先进的性能,且模型规模远小于基于大语言模型的方法。
Details
Motivation: 解决现有视频时序定位模型通常针对特定数据集设计、跨领域和查询风格迁移能力差的问题,以及基于大语言模型的方法计算成本高、视频上下文处理有限,难以处理长视频的局限性。
Result: 在GoalStep-StepGrounding、Ego4D-NLQ、TACoS、Charades-STA和ActivityNet-Captions等多个基准测试上,单个UniversalVTG检查点达到了最先进的性能,优于专用模型;尽管比基于大语言模型的方法小100倍以上,但在多个基准上匹配或超越了它们的精度。
Insight: 创新点包括:1)离线的查询统一器,将异构查询格式规范化为共享的声明性空间,减少语言不匹配并防止朴素联合训练中的负迁移;2)结合高效的定位头,使模型能够扩展到长且未修剪的视频;3)通过大规模跨数据集预训练实现统一监督,保持模型轻量级,为参数密集型大语言模型提供了实用的替代方案。
Abstract: Video temporal grounding (VTG) is typically tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts to overcome this limitation have adapted large multimodal language models (MLLMs) to VTG, but their high compute cost and limited video context still hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing the negative transfer observed under naïve joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks-GoalStep-StepGrounding, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions-one UniversalVTG checkpoint achieves state-of-the-art performance versus dedicated VTG models. Moreover, despite being $>100\times$ smaller than recent MLLM-based approaches, UniversalVTG matches or exceeds their accuracy on multiple benchmarks, offering a practical alternative to parameter-heavy MLLMs.
[120] Self-Improving 4D Perception via Self-Distillation cs.CVPDF
Nan Huang, Pengcheng Yu, Weijia Zeng, James M. Rehg, Angjoo Kanazawa
TL;DR: 本文提出了SelfEvo,一个基于自蒸馏的自我改进框架,旨在利用未标注视频持续提升预训练的多视角三维/四维重建模型在动态场景下的性能,从而摆脱对昂贵真值标注的依赖。
Details
Motivation: 现有大规模多视角重建模型严重依赖昂贵且稀缺的3D/4D真值标注,尤其是在动态场景中,这限制了模型的扩展性。本文旨在解决如何利用无标签视频数据持续改进预训练模型的问题。
Result: 在涵盖不同数据集和领域的八个基准测试中,SelfEvo持续改进了预训练基线模型(如VGGT和π^3),在动态场景上提升显著。具体而言,在视频深度估计上实现了高达36.5%的相对提升,在相机姿态估计上实现了20.1%的提升,且未使用任何标注数据。
Insight: 核心创新点是利用时空上下文不对称性设计了一种自蒸馏方案,实现了无需外部标注的自改进学习。摘要中宣称的创新在于系统性地研究了使自改进有效的设计选择,包括损失信号、不对称形式及其他训练策略。从客观角度看,该框架提供了一种通用的、可跨基础模型泛化的无监督持续学习范式,对解决动态场景4D感知的数据标注瓶颈问题具有借鉴意义。
Abstract: Large-scale multi-view reconstruction models have made remarkable progress, but most existing approaches still rely on fully supervised training with ground-truth 3D/4D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external annotations. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g. VGGT and $π^3$), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data. Project Page: https://self-evo.github.io/.
[121] RewardFlow: Generate Images by Optimizing What You Reward cs.CV | cs.AIPDF
Onkar Susladkar, Dong-Hwan Jang, Tushar Prakash, Adheesh Juvekar, Vedant Shah
TL;DR: RewardFlow是一种无需反演(inversion-free)的框架,通过在推理时使用多奖励朗之万动力学(multi-reward Langevin dynamics)来引导预训练的扩散模型和流匹配模型。它统一了多种可微奖励,包括语义对齐、感知保真度、局部接地、对象一致性和人类偏好,并引入了一种基于可微VQA的奖励,通过语言-视觉推理提供细粒度语义监督。该框架还设计了一种提示感知自适应策略,从指令中提取语义基元,推断编辑意图,并在整个采样过程中动态调整奖励权重和步长。在多个图像编辑和组合生成基准测试中,RewardFlow实现了最先进的编辑保真度和组合对齐性能。
Details
Motivation: 为了解决在推理时有效引导预训练生成模型(如扩散和流匹配模型)以实现复杂、多目标图像编辑和生成任务的问题,特别是如何协调多种异构奖励(如语义、感知、局部、一致性和偏好)以实现高质量的图像生成。
Result: 在多个图像编辑和组合生成基准测试(benchmarks)上,RewardFlow实现了最先进的(state-of-the-art)编辑保真度和组合对齐性能。
Insight: 主要创新点包括:1) 提出了一种无需反演的推理时引导框架,避免了复杂的模型反演过程;2) 统一了多种互补的可微奖励,并引入了新颖的、基于可微VQA的细粒度语义监督奖励;3) 设计了提示感知自适应策略,能够动态理解编辑意图并调整优化过程(奖励权重和步长),从而有效协调异构目标。
Abstract: We introduce RewardFlow, an inversion-free framework that steers pretrained diffusion and flow-matching models at inference time through multi-reward Langevin dynamics. RewardFlow unifies complementary differentiable rewards for semantic alignment, perceptual fidelity, localized grounding, object consistency, and human preference, and further introduces a differentiable VQA-based reward that provides fine-grained semantic supervision through language-vision reasoning. To coordinate these heterogeneous objectives, we design a prompt-aware adaptive policy that extracts semantic primitives from the instruction, infers edit intent, and dynamically modulates reward weights and step sizes throughout sampling. Across several image editing and compositional generation benchmarks, RewardFlow delivers state-of-the-art edit fidelity and compositional alignment.
[122] ParseBench: A Document Parsing Benchmark for AI Agents cs.CVPDF
Boyang Zhang, Sebastián G. Acosta, Preston Carlson, Sacha Bron, Pierre-Loïc Doulcet
TL;DR: 论文提出了一个名为ParseBench的文档解析基准测试,专门用于评估AI智能体在文档解析任务中的语义正确性。该基准包含约2000页来自保险、金融和政府等企业文档的人工验证页面,围绕表格、图表、内容忠实度、语义格式化和视觉基础五个能力维度进行组织。
Details
Motivation: 现有基准测试未能充分捕捉企业自动化场景下AI智能体对文档解析的需求,它们依赖于狭窄的文档分布和文本相似性指标,无法发现对智能体至关重要的语义错误。
Result: 在评估了涵盖视觉语言模型、专用文档解析器和LlamaParse的14种方法后,基准测试揭示了能力碎片化的现状:没有方法在所有五个维度上都表现一致强劲。LlamaParse Agentic以最高整体得分(具体百分比未在摘要中给出)领先,但基准也凸显了当前系统仍存在的能力差距。
Insight: 论文的创新点在于提出了一个专注于语义正确性、面向企业自动化场景的文档解析基准,强调解析输出必须保留自主决策所需的结构和意义(如正确的表格结构、精确的图表数据等),这超越了传统基于文本相似性的评估。从客观角度看,其构建的多维度评估框架和多样化的企业文档数据集,为衡量和推动AI智能体在复杂文档理解方面的发展提供了有价值的工具和方向。
Abstract: AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on \href{https://huggingface.co/datasets/llamaindex/ParseBench}{HuggingFace} and \href{https://github.com/run-llama/ParseBench}{GitHub}.
[123] Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction cs.CVPDF
Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin
TL;DR: 本文提出了一种名为Scal3R的可扩展测试时训练方法,用于从长视频序列中进行大规模3D场景重建。该方法通过一种新颖的神经全局上下文表示来高效压缩和保留长距离场景信息,并利用轻量级神经子网络在测试时通过自监督目标快速适应,从而在不显著增加计算开销的情况下提升重建精度和一致性。
Details
Motivation: 现有前馈重建模型直接从RGB图像回归3D几何,但由于内存容量有限和无法有效捕获全局上下文线索,在处理长序列时难以保持重建精度和一致性。受人类利用全局场景理解来辅助局部感知的启发,本文旨在解决大规模3D重建中的长序列一致性问题。
Result: 在KITTI Odometry和Oxford Spires等多个大规模基准测试上的实验表明,该方法能够有效处理超大规模场景,在保持效率的同时,取得了领先的位姿精度和SOTA的3D重建精度。
Insight: 创新点在于提出了一种可学习的神经全局上下文表示,以及通过测试时自监督训练快速适应轻量子网络的机制,这显著扩展了模型的内存容量和长程依赖建模能力,为大规模3D重建提供了一种高效且可扩展的解决方案。
Abstract: This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry\cite{Geiger2012CVPR} and Oxford Spires\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code is available at https://zju3dv.github.io/scal3r.
[124] Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models cs.CV | cs.AIPDF
Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang
TL;DR: 本文提出HDPO框架,通过解耦任务准确性和工具效率的优化目标,解决多模态智能体过度依赖外部工具的问题,并训练出能自主判断何时使用工具的Metis模型。
Details
Motivation: 当前多模态智能体存在元认知缺陷,无法在利用内部知识和调用外部工具之间做出明智仲裁,导致盲目调用工具,引发延迟和噪声干扰推理。
Result: 在广泛评估中,Metis模型将工具调用次数降低了数量级,同时提升了推理准确性。
Insight: 创新点在于将工具效率从竞争性标量目标重构为严格条件性目标,通过解耦的优化通道(准确性通道和效率通道)实现认知课程学习,使智能体先掌握任务解决再提升自主性。
Abstract: The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.
[125] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models cs.CVPDF
Zhengyang Sun, Yu Chen, Xin Zhou, Xiaofan Li, Xiwu Chen
TL;DR: 本文提出了NUMINA,一个无需训练、基于识别-引导的框架,旨在提升文本到视频扩散模型中文本数字与视觉实例的对齐能力。该方法通过选择有区分度的自注意力和交叉注意力头来识别提示与布局的不一致性,生成可计数的潜在布局,然后保守地优化该布局并调制交叉注意力以引导视频重新生成。
Details
Motivation: 解决文本到视频扩散模型在生成视频时,难以准确生成提示中指定数量的物体的问题,即提升模型的数值对齐能力。
Result: 在提出的CountBench基准测试上,NUMINA将Wan2.1-1.3B模型的计数准确率提升了7.4%,在5B和14B模型上分别提升了4.9%和5.5%。同时,CLIP对齐度得到改善,并保持了时间一致性。
Insight: 创新点在于提出了一种无需训练、通过识别注意力不一致性并调制交叉注意力来引导生成的结构化指导方法。该方法表明,结构化的引导可以补充种子搜索和提示增强,为实现计数准确的文本到视频生成提供了一条实用路径。
Abstract: Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.
cs.CY [Back]
[126] The Weaponization of Computer Vision: Tracing Military-Surveillance Ties through Conference Sponsorship cs.CY | cs.AI | cs.CVPDF
Noa Garcia, Amelia Katirai
TL;DR: 这篇论文研究了计算机视觉领域与军事及监控应用的关联,通过分析顶级会议赞助商的活动,揭示了该领域技术被武器化的程度。
Details
Motivation: 计算机视觉领域常被定位为中立技术,但其历史上依赖军事资助且越来越多地应用于战争和监控,论文旨在揭示这种双重用途的现状。
Result: 研究发现,44%的会议赞助商与军事或监控应用有直接联系,并通过两个案例研究探讨了赞助作为揭示技术武器化手段的机遇与局限性。
Insight: 论文创新地利用会议赞助数据追踪技术武器化,揭示了学术领域与军事监控的隐性联系,为技术伦理和政策讨论提供了实证基础。
Abstract: Computer vision, a core domain of artificial intelligence (AI), is the field that enables the computational analysis, understanding, and generation of visual data. Despite being historically rooted in military funding and increasingly deployed in warfare, the field tends to position itself as a neutral, purely technical endeavor, failing to engage in discussions about its dual-use applications. Yet it has been reported that computer vision systems are being systematically weaponized to assist in technologies that inflict harm, such as surveillance or warfare. Expanding on these concerns, we study the extent to which computer vision research is being used in the military and surveillance domains. We do so by collecting a dataset of tech companies with financial ties to the field’s central research exchange platform: conferences. Conference sponsorship, we argue, not only serves as strong evidence of a company’s investment in the field but also provides a privileged position for shaping its trajectory. By investigating sponsors’ activities, we reveal that 44% of them have a direct connection with military or surveillance applications. We extend our analysis through two case studies in which we discuss the opportunities and limitations of sponsorship as a means for uncovering technological weaponization.
cs.IR [Back]
[127] SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval cs.IR | cs.AI | cs.CLPDF
Roxana Petcu, Evangelos Kanoulas, Maarten de Rijke
TL;DR: 本文提出了SubSearch框架,通过引入内在过程奖励来激励模型规划高质量推理步骤,从而改进大型语言模型在复杂检索任务中的多步推理能力。
Details
Motivation: 解决复杂查询需要多步推理但缺乏明确路径的挑战,以及现有方法依赖外部监督(如人工标注或大模型评判)进行过程奖励建模的问题。
Result: 在七个基准测试(包括QA和多跳QA数据集)上实验表明,使用内在奖励比仅使用结果奖励能产生更鲁棒的推理轨迹。
Insight: 创新点在于从仅依赖结果监督转向利用内在过程奖励直接优化生成器,无需外部监督,实现了更自主的信息密集型推理,为构建高效推理轨迹提供了数据高效的替代方案。
Abstract: Large language models (LLMs) are probabilistic in nature and perform more reliably when augmented with external information. As complex queries often require multi-step reasoning over the retrieved information, with no clear or predetermined reasoning path, they remain challenging. Recent approaches train models using reinforcement learning on the model’s outcome, showing promise in improving how models handle complex information. We introduce SubSearch, a specialized framework that shifts from outcome-only supervision to intermediate reward signals that incentivize planning high-quality reasoning. Unlike previous work on process reward modeling, which focuses on training a separate reward model with annotated trajectories by either human annotators or large LLM judges, SubSearch directly optimizes the generator using intrinsic process rewards, which we define as internally-derived rewards, eliminating the need for external supervision, and moving towards autonomous information-intensive reasoning. Experiments on seven benchmarks show that rewarding intermediate reasoning steps with intrinsic rewards leads to more robust reasoning traces in both QA and multi-hop QA datasets over using only outcome rewards. SubSearch can help in building reasoning traces that allow agents to better integrate search engines for complex query answering, while offering a data-efficient alternative to supervised process modeling.
cs.MA [Back]
[128] More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration cs.MA | cs.AI | cs.CLPDF
Advait Yadav, Sid Black, Oliver Sourbut
TL;DR: 本文研究了大型语言模型(LLM)智能体在零成本协作环境中的合作行为,发现模型能力增强并不必然提升合作意愿,甚至可能导致集体表现下降。通过构建一个无摩擦的多智能体实验环境,作者分离了合作失败与能力失败,并测试了协议和激励等干预措施的效果。
Details
Motivation: 动机是探究在多智能体系统中,当帮助他人既无个人成本也无个人收益,且被明确指示合作时,LLM智能体是否会出现合作失败,以理解现实世界协调问题(如知识共享)中的潜在障碍。
Result: 实验结果显示,能力更强的模型(如OpenAI o3)仅达到17%的最优集体性能,而能力较弱的模型(如OpenAI o3-mini)却达到50%。通过因果分解和干预测试,发现明确的协议可使低能力模型性能翻倍,微小的共享激励能改善弱合作模型。
Insight: 论文的创新点在于构建了一个简化战略复杂性的实验环境来孤立研究合作行为,并提出了分离合作失败与能力失败的因果分析方法。核心洞察是:仅靠扩展模型智能无法解决多智能体系统的协调问题,即使帮助他人零成本,也需要有意的合作机制设计。
Abstract: Large language model (LLM) agents increasingly coordinate in multi-agent systems, yet we lack an understanding of where and why cooperation failures may arise. In many real-world coordination problems, from knowledge sharing in organizations to code documentation, helping others carries negligible personal cost while generating substantial collective benefits. However, whether LLM agents cooperate when helping neither benefits nor harms the helper, while being given explicit instructions to do so, remains unknown. We build a multi-agent setup designed to study cooperative behavior in a frictionless environment, removing all strategic complexity from cooperation. We find that capability does not predict cooperation: OpenAI o3 achieves only 17% of optimal collective performance while OpenAI o3-mini reaches 50%, despite identical instructions to maximize group revenue. Through a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, tracing their origins through agent reasoning analysis. Testing targeted interventions, we find that explicit protocols double performance for low-competence models, and tiny sharing incentives improve models with weak cooperation. Our findings suggest that scaling intelligence alone will not solve coordination problems in multi-agent systems and will require deliberate cooperative design, even when helping others costs nothing.
cs.CR [Back]
[129] PrivFedTalk: Privacy-Aware Federated Diffusion with Identity-Stable Adapters for Personalized Talking-Head Generation cs.CR | cs.AI | cs.CV | cs.LGPDF
Soumya Mazumdar, Vineet Kumar Rakesh, Tapas Samanta
TL;DR: PrivFedTalk是一个隐私感知的联邦学习框架,用于个性化说话头生成,它结合了条件潜在扩散模型和参数高效的LoRA身份适配器,在保护用户隐私的同时,通过联邦训练共享扩散主干,并利用本地私有音视频数据学习轻量级身份适配器。
Details
Motivation: 解决基于扩散模型的说话头生成训练依赖集中式人脸视频和语音数据集带来的隐私问题,特别是在个性化生成中,身份特定数据高度敏感且无法跨用户或设备共享。
Result: 在多种训练和聚合条件下与FedAvg和FedProx的比较实验表明,PrivFedTalk实现了稳定的联邦优化,并在受限资源下成功完成了端到端训练和评估,支持隐私感知个性化说话头训练在联邦环境中的可行性。
Insight: 创新点包括:使用LoRA身份适配器实现参数高效的身份适应;提出身份稳定的联邦聚合(ISFA)以处理异构客户端分布;引入时间去噪一致性(TDC)正则化减少帧间漂移和闪烁;结合安全聚合和客户端级差分隐私来限制更新侧隐私风险。
Abstract: Talking-head generation has advanced rapidly with diffusion-based generative models, but training usually depends on centralized face-video and speech datasets, raising major privacy concerns. The problem is more acute for personalized talking-head generation, where identity-specific data are highly sensitive and often cannot be pooled across users or devices. PrivFedTalk is presented as a privacy-aware federated framework for personalized talking-head generation that combines conditional latent diffusion with parameter-efficient identity adaptation. A shared diffusion backbone is trained across clients, while each client learns lightweight LoRA identity adapters from local private audio-visual data, avoiding raw data sharing and reducing communication cost. To address heterogeneous client distributions, Identity-Stable Federated Aggregation (ISFA) weights client updates using privacy-safe scalar reliability signals computed from on-device identity consistency and temporal stability estimates. Temporal-Denoising Consistency (TDC) regularization is introduced to reduce inter-frame drift, flicker, and identity drift during federated denoising. To limit update-side privacy risk, secure aggregation and client-level differential privacy are applied to adapter updates. The implementation supports both low-memory GPU execution and multi-GPU client-parallel training on heterogeneous shared hardware. Comparative experiments on the present setup across multiple training and aggregation conditions with PrivFedTalk, FedAvg, and FedProx show stable federated optimization and successful end-to-end training and evaluation under constrained resources. The results support the feasibility of privacy-aware personalized talking-head training in federated environments, while suggesting that stronger component-wise, privacy-utility, and qualitative claims need further standardized evaluation.
cs.LG [Back]
[130] Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference cs.LG | cs.CLPDF
Quantong Qiu, Zhiyi Hong, Yi Yang, Haitian Wang, Kebin Liu
TL;DR: 本文提出了Flux Attention,一种上下文感知的混合注意力框架,旨在解决LLMs在长上下文场景中标准注意力二次计算复杂度的可扩展性瓶颈。该方法通过在冻结的预训练LLMs中集成一个轻量级的层路由器,根据输入上下文动态地将每一层路由到全注意力或稀疏注意力,从而在保持高保真信息检索的同时,实现理论计算减少到实际推理速度的提升。
Details
Motivation: 标准注意力的二次计算复杂度是LLMs在长上下文场景中的主要可扩展性瓶颈。现有的混合注意力方法通常依赖静态分配比例,无法适应不同任务的可变检索需求,且头级动态稀疏性常导致计算负载不平衡和同步长尾问题,阻碍自回归解码时的硬件加速。
Result: 在多个长上下文和数学推理基准测试上的广泛实验表明,与基线模型相比,Flux Attention在性能和推理速度之间实现了更优的权衡,在预填充和解码阶段分别实现了高达2.8倍和2.0倍的速度提升。
Insight: 创新点在于提出了层级的动态路由机制(Flux Attention),通过上下文感知的层路由器自适应选择注意力模式,既保持了信息检索的保真度,又确保了连续内存访问,从而将理论计算减少转化为实际墙钟加速。这是一种参数高效的方法,仅需少量训练即可集成到冻结模型中。
Abstract: The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, translating theoretical computational reductions into practical wall-clock speedups. As a parameter-efficient approach, our framework requires only 12 hours of training on 8$\times$A800 GPUs. Extensive experiments across multiple long-context and mathematical reasoning benchmarks demonstrate that Flux Attention achieves a superior trade-off between performance and inference speed compared with baseline models, with speed improvements of up to $2.8\times$ and $2.0\times$ in the prefill and decode stages.
[131] SOLAR: Communication-Efficient Model Adaptation via Subspace-Oriented Latent Adapter Reparametrization cs.LG | cs.CL | cs.CVPDF
Seyed Mahmoud Sajjadi Mohammadabadi, Xiaolong Ma, Lei Yang, Feng Yan, Junshan Zhang
TL;DR: 论文提出了一种名为SOLAR的后训练压缩框架,旨在显著降低参数高效微调(PEFT)方法(如LoRA)的通信和存储成本。SOLAR通过将PEFT更新表示为基于基础模型奇异向量构建的基向量的线性组合,利用基础模型与任务特定更新之间的子空间相似性,实现紧凑且表达能力强的表示。该框架与模型无关,兼容现有PEFT方法,并在语言和视觉任务上验证了其有效性。
Details
Motivation: 解决参数高效微调方法在资源受限环境中通信和存储成本高的问题,以促进基础模型在分布式系统和边缘设备上的高效部署。
Result: 在LLaMA、GPT和ViT模型的语言和视觉任务实验中,SOLAR在保持任务性能的同时显著减少了模型表示大小,提供了通信高效的解决方案。
Insight: 创新点在于利用基础模型与任务更新之间的子空间相似性,通过奇异向量和随机扰动构建基向量来重新参数化适配器,从而解耦适配器大小与PEFT结构,实现压缩而不损失表达能力;客观分析认为其模型无关性和理论误差界为PEFT压缩提供了通用且可靠的方法。
Abstract: Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, enable scalable adaptation of foundation models by injecting low-rank adapters. However, their communication and storage costs remain a major bottleneck in resource-constrained settings. We propose SOLAR (Subspace-Oriented Latent Adapter Reparameterization), a post-training compression framework that substantially reduces the communication cost (i.e., the number of parameters to transmit or store) of PEFT adapters. SOLAR expresses each PEFT update as a linear combination of basis vectors formed from the foundation model’s singular vectors with controlled random perturbations. By exploiting the subspace similarity (the alignment of principal directions) between the foundation model and task-specific fine-tuned updates, SOLAR decouples the adapter size from PEFT structure and ensures compact yet expressive representations. It is model-agnostic and compatible with existing PEFT methods, including LoRA, AdaLoRA, and other adapter modules. We theoretically establish a bound on the reconstruction error. Experiments on language and vision tasks using LLaMA, GPT, and ViT models demonstrate that SOLAR preserves task performance while significantly reducing model representation sizes, offering an effective and communication-efficient solution for deployment in distributed systems and edge devices.
[132] Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency cs.LG | cs.CV | cs.NEPDF
Mingqing Xiao, Yansen Wang, Dongqi Han, Caihua Shan, Dongsheng Li
TL;DR: 本文提出了一种名为Kuramoto振荡相位编码(KoPE)的神经启发式同步机制,将其作为额外的演化相位状态引入视觉Transformer中,旨在通过同步增强的结构学习来提升视觉模型在训练、参数和数据效率方面的性能。
Details
Motivation: 当前深度学习架构主要通过激活值表示和传播信息,忽略了速率和相位的联合动力学,而生物信息处理中的时空神经动力学和振荡同步被认为支持特征绑定等灵活协调,因此本文旨在将这种神经启发的同步机制引入深度学习以提升学习效率。
Result: 实验表明,KoPE能提升视觉模型的训练、参数和数据效率,并在需要结构化理解的任务(如语义和全景分割、语言表示对齐以及少样本抽象视觉推理ARC-AGI)中表现出优势;理论分析和实证验证进一步表明KoPE能加速注意力集中以提高学习效率。
Insight: 创新点在于将Kuramoto振荡器模型的相位同步机制引入视觉Transformer,作为额外的动态相位状态,从而通过神经启发的同步增强结构学习,这是一种可扩展的、受神经科学启发的机制,有望推动最先进神经网络模型的发展。
Abstract: Spatiotemporal neural dynamics and oscillatory synchronization are widely implicated in biological information processing and have been hypothesized to support flexible coordination such as feature binding. By contrast, most deep learning architectures represent and propagate information through activation values, neglecting the joint dynamics of rate and phase. In this work, we introduce Kuramoto oscillatory Phase Encoding (KoPE) as an additional, evolving phase state to Vision Transformers, incorporating a neuro-inspired synchronization mechanism to advance learning efficiency. We show that KoPE can improve training, parameter, and data efficiency of vision models through synchronization-enhanced structure learning. Moreover, KoPE benefits tasks requiring structured understanding, including semantic and panoptic segmentation, representation alignment with language, and few-shot abstract visual reasoning (ARC-AGI). Theoretical analysis and empirical verification further suggest that KoPE can accelerate attention concentration for learning efficiency. These results indicate that synchronization can serve as a scalable, neuro-inspired mechanism for advancing state-of-the-art neural network models.
[133] Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings cs.LG | cs.CVPDF
Yunxiang Peng, Mengmeng Ma, Ziyu Yao, Xi Peng
TL;DR: 本文提出了一种通过分析模型内部工作机制(即电路)来评估视觉变换器(ViT)泛化性能的新方法。针对模型部署前选择最佳模型和部署后监控分布偏移下性能这两个实际场景,分别引入了依赖深度偏差(Dependency Depth Bias)和电路偏移分数(Circuit Shift Score)两个无标签代理指标。
Details
Motivation: 现有基于模型输出(如置信度或准确率曲线)的泛化代理指标往往不可靠,因为它们忽略了产生输出的内部机制。本文旨在解决在目标数据标签稀缺或存在分布偏移时,如何可靠地评估模型泛化性能的迫切需求。
Result: 在多个任务上,所提出的两个指标与泛化性能的相关性显著优于现有代理指标,平均分别提升了13.4%和34.1%。
Insight: 创新点在于将模型内部因果交互(电路)作为预测泛化性能的指标,超越了仅依赖模型输出的传统方法。这为模型评估提供了一个更本质、更可靠的视角,特别是在无标签或分布偏移场景下。
Abstract: Reliable generalization metrics are fundamental to the evaluation of machine learning models. Especially in high-stakes applications where labeled target data are scarce, evaluation of models’ generalization performance under distribution shift is a pressing need. We focus on two practical scenarios: (1) Before deployment, how to select the best model for unlabeled target data? (2) After deployment, how to monitor model performance under distribution shift? The central need in both cases is a reliable and label-free proxy metric. Yet existing proxy metrics, such as model confidence or accuracy-on-the-line, are often unreliable as they only assess model output while ignoring the internal mechanisms that produce them. We address this limitation by introducing a new perspective: using the inner workings of a model, i.e., circuits, as a predictive metric of generalization performance. Leveraging circuit discovery, we extract the causal interactions between internal representations as a circuit, from which we derive two metrics tailored to the two practical scenarios. (1) Before deployment, we introduce Dependency Depth Bias, which measures different models’ generalization capability on target data. (2) After deployment, we propose Circuit Shift Score, which predicts a model’s generalization under different distribution shifts. Across various tasks, both metrics demonstrate significantly improved correlation with generalization performance, outperforming existing proxies by an average of 13.4% and 34.1%, respectively. Our code is available at https://github.com/deep-real/GenCircuit.
cs.SD [Back]
[134] Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning cs.SD | cs.CVPDF
Linge Wang, Yingying Chen, Bingke Zhu, Lu Zhou, Jinqiao Wang
TL;DR: 本文提出了一种教师引导的双路径框架(TG-DP),用于改进视听表示学习。该框架通过将掩码重建和对比对齐这两个目标解耦到独立的优化路径中,解决了现有方法中单一前向传播导致的语义噪声和优化干扰问题,从而提升了跨模态表示的质量。
Details
Motivation: 当前视听表示学习联合优化对比对齐和掩码重建目标时,对比分支被迫使用为重建任务设计的随机可见图像块,而非最适合跨模态对齐的模式,这引入了语义噪声并造成优化干扰。
Result: 在零样本检索任务上达到SOTA。在AudioSet上,视频到音频检索的R@1从35.2%提升至37.4%,音频到视频检索的R@1从27.9%提升至37.1%。在AS20K和VGGSound上的线性探测性能也达到SOTA。
Insight: 核心创新点在于将多模态目标(重建与对齐)解耦到双路径中,并为对比路径引入教师模型指导以组织可见标记,这减少了干扰并稳定了学习过程,为大规模视听预训练提供了一个有效的框架。
Abstract: Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2% to 37.4% for video-to-audio retrieval and from 27.9% to 37.1% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at https://github.com/wanglg20/TG-DP.
cs.RO [Back]
[135] EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World cs.RO | cs.CVPDF
Ryan Punamiya, Simar Kareer, Zeyi Liu, Josh Citron, Ri-Zhao Qiu
TL;DR: 本文介绍了EgoVerse,这是一个用于机器人学习的协作平台,它整合了以人为中心的数据收集、处理和访问,旨在通过大规模、多样化的人类演示数据来促进机器人学习。当前版本包含1,362小时(8万次演示)的人类数据,涵盖1,965个任务、240个场景和2,087名演示者,并提供了标准化格式、操作相关注释和下游学习工具。此外,研究还进行了大规模的人到机器人迁移实验,发现策略性能通常随人类数据增加而提升,但有效扩展取决于人类数据与机器人学习目标的对齐。
Details
Motivation: 机器人学习依赖大规模多样化数据,但机器人数据收集成本高且难以扩展。以人为中心的人类数据提供了丰富的日常环境操作行为,但现有数据集范围有限、难以扩展且机构间分散,因此需要统一平台来解决这些问题。
Result: 研究通过跨多个实验室、任务和机器人实体的共享协议进行实验,发现策略性能随人类数据增加而提升,但有效扩展依赖于数据与学习目标的对齐。EgoVerse数据集包含1,362小时数据,覆盖广泛任务和场景,为可重复的机器人学习进展奠定了基础。
Insight: 创新点在于提出了一个协作平台EgoVerse,统一了人类数据的收集、处理和访问,支持个体研究者、学术实验室和行业伙伴贡献,并通过大规模实验验证了人类数据对机器人学习的可扩展性,强调了数据对齐的重要性。
Abstract: Robot learning increasingly depends on large and diverse data, yet robot data collection remains expensive and difficult to scale. Egocentric human data offer a promising alternative by capturing rich manipulation behavior across everyday environments. However, existing human datasets are often limited in scope, difficult to extend, and fragmented across institutions. We introduce EgoVerse, a collaborative platform for human data-driven robot learning that unifies data collection, processing, and access under a shared framework, enabling contributions from individual researchers, academic labs, and industry partners. The current release includes 1,362 hours (80k episodes) of human demonstrations spanning 1,965 tasks, 240 scenes, and 2,087 unique demonstrators, with standardized formats, manipulation-relevant annotations, and tooling for downstream learning. Beyond the dataset, we conduct a large-scale study of human-to-robot transfer with experiments replicated across multiple labs, tasks, and robot embodiments under shared protocols. We find that policy performance generally improves with increased human data, but that effective scaling depends on alignment between human data and robot learning objectives. Together, the dataset, platform, and study establish a foundation for reproducible progress in human data-driven robot learning. Videos and additional information can be found at https://egoverse.ai/
[136] RoboAgent: Chaining Basic Capabilities for Embodied Task Planning cs.RO | cs.CVPDF
Peiran Xu, Jiaqi Zheng, Yadong Mu
TL;DR: 本文提出了RoboAgent,一种用于具身任务规划的能力驱动规划框架。该框架通过调度器调用不同的子能力模块,将复杂的长期规划任务分解为一系列基础的视觉-语言问题,从而利用视觉语言模型(VLM)更好地处理多轮交互、长程推理和扩展上下文分析。
Details
Motivation: 尽管当前的视觉语言模型在多模态理解和推理方面表现出色,但在涉及多轮交互、长程规划和扩展上下文分析的具身任务规划中,其性能仍然有限。本文旨在弥合这一差距。
Result: 在广泛使用的具身任务规划基准测试上进行了大量实验,验证了所提方法的有效性。
Insight: 核心创新点在于提出了一个能力驱动的规划流水线,将复杂规划分解为VLM更擅长处理的基础子问题,实现了更透明和可控的推理过程。同时,采用包含行为克隆、DAgger训练和强化学习的多阶段训练范式,并利用环境模拟器内部信息构建高质量监督,增强了模型在多样化场景下的性能。
Abstract: This paper focuses on embodied task planning, where an agent acquires visual observations from the environment and executes atomic actions to accomplish a given task. Although recent Vision-Language Models (VLMs) have achieved impressive results in multimodal understanding and reasoning, their performance remains limited when applied to embodied planning that involves multi-turn interaction, long-horizon reasoning, and extended context analysis. To bridge this gap, we propose RoboAgent, a capability-driven planning pipeline in which the model actively invokes different sub-capabilities. Each capability maintains its own context, and produces intermediate reasoning results or interacts with the environment according to the query given by a scheduler. This framework decomposes complex planning into a sequence of basic vision-language problems that VLMs can better address, enabling a more transparent and controllable reasoning process. The scheduler and all capabilities are implemented with a single VLM, without relying on external tools. To train this VLM, we adopt a multi-stage paradigm that consists of: (1) behavior cloning with expert plans, (2) DAgger training using trajectories collected by the model, and (3) reinforcement learning guided by an expert policy. Across these stages, we exploit the internal information of the environment simulator to construct high-quality supervision for each capability, and we further introduce augmented and synthetic data to enhance the model’s performance in more diverse scenarios. Extensive experiments on widely used embodied task planning benchmarks validate the effectiveness of the proposed approach. Our codes will be available at https://github.com/woyut/RoboAgent_CVPR26.
[137] Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles cs.RO | cs.CVPDF
Jiawei Liu, Xun Gong, Fen Fang, Muli Yang, Bohao Qu
TL;DR: 本文提出了一种基于大语言模型(LLM)和多规划器调度的指令实现框架,用于自动驾驶车辆中处理开放式乘客指令。该框架利用LLM解释自然语言指令,生成可执行脚本以调度多个基于模型预测控制(MPC)的运动规划器,并将规划轨迹转换为控制信号,从而在闭环环境中实现透明、可追溯的决策链。
Details
Motivation: 现有的人机交互(HMI)研究常忽视自动驾驶中乘客的操控需求,而将开放式自然语言指令转化为不牺牲可解释性和可追溯性的控制信号仍具挑战。
Result: 在提出的闭环开放式指令实现基准测试中,该框架显著提高了任务完成率,降低了LLM查询成本,在安全性和合规性上与专用自动驾驶方法相当,并对LLM推理延迟表现出较强的容忍度。
Insight: 创新点在于以调度为中心的设计,将语义推理与车辆控制在不同时间尺度上解耦,建立了从高层指令到低层动作的透明决策链;同时,为解决缺乏高保真评估工具的问题,引入了闭环环境下的基准测试。
Abstract: Most Human-Machine Interaction (HMI) research overlooks the maneuvering needs of passengers in autonomous driving (AD). Natural language offers an intuitive interface, yet translating passenger open-ended instructions into control signals, without sacrificing interpretability and traceability, remains a challenge. This study proposes an instruction-realization framework that leverages a large language model (LLM) to interpret instructions, generates executable scripts that schedule multiple model predictive control (MPC)-based motion planners based on real-time feedback, and converts planned trajectories into control signals. This scheduling-centric design decouples semantic reasoning from vehicle control at different timescales, establishing a transparent, traceable decision-making chain from high-level instructions to low-level actions. Due to the absence of high-fidelity evaluation tools, this study introduces a benchmark for open-ended instruction realization in a closed-loop setting. Comprehensive experiments reveal that the framework significantly improves task-completion rates over instruction-realization baselines, reduces LLM query costs, achieves safety and compliance on par with specialized AD approaches, and exhibits considerable tolerance to LLM inference latency. For more qualitative illustrations and a clearer understanding.
[138] Fail2Drive: Benchmarking Closed-Loop Driving Generalization cs.RO | cs.CVPDF
Simon Gerstenecker, Andreas Geiger, Katrin Renz
TL;DR: 论文提出了Fail2Drive,这是首个用于评估CARLA仿真环境中闭环自动驾驶泛化能力的配对路线基准测试集。该基准包含200条路线和17种新场景类别,涵盖外观、布局、行为和鲁棒性变化,并通过与原始分布内路线配对来量化分布偏移的影响。评估多个SOTA模型发现其性能平均下降22.8%,并揭示了模型忽略LiDAR可见物体等意外故障模式。
Details
Motivation: 解决现有自动驾驶基准测试在评估闭环驾驶泛化能力上的不足,即它们通常在测试时重复使用训练场景,导致成功可能源于记忆而非鲁棒行为,无法有效衡量模型在分布偏移下的真实泛化能力。
Result: 在提出的Fail2Drive基准上评估多个最先进模型,结果显示模型性能出现一致退化,平均成功率下降22.8%。分析揭示了模型忽略LiDAR清晰可见物体、未能学习自由与占用空间基本概念等意外故障模式。
Insight: 创新点在于设计了首个配对路线基准,通过隔离分布偏移效应将定性故障转化为定量诊断;提供了开源工具包用于创建新场景并通过特权专家策略验证可解性,为可复现地评估和改进闭环驾驶泛化建立了基础。
Abstract: Generalization under distribution shift remains a central bottleneck for closed-loop autonomous driving. Although simulators like CARLA enable safe and scalable testing, existing benchmarks rarely measure true generalization: they typically reuse training scenarios at test time. Success can therefore reflect memorization rather than robust driving behavior. We introduce Fail2Drive, the first paired-route benchmark for closed-loop generalization in CARLA, with 200 routes and 17 new scenario classes spanning appearance, layout, behavioral, and robustness shifts. Each shifted route is matched with an in-distribution counterpart, isolating the effect of the shift and turning qualitative failures into quantitative diagnostics. Evaluating multiple state-of-the-art models reveals consistent degradation, with an average success-rate drop of 22.8%. Our analysis uncovers unexpected failure modes, such as ignoring objects clearly visible in the LiDAR and failing to learn the fundamental concepts of free and occupied space. To accelerate follow-up work, Fail2Drive includes an open-source toolbox for creating new scenarios and validating solvability via a privileged expert policy. Together, these components establish a reproducible foundation for benchmarking and improving closed-loop driving generalization. We open-source all code, data, and tools at https://github.com/autonomousvision/fail2drive .
[139] SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds cs.RO | cs.AI | cs.CVPDF
Yunsong Zhou, Hangxu Liu, Xuekun Jiang, Xing Shen, Yuanzhen Zhou
TL;DR: SIM1是一个物理对齐的仿真到真实数据引擎,用于解决可变形物体操作中的数据稀缺问题。它通过将真实场景数字化为度量一致的双胞胎、校准可变形动力学并利用扩散模型生成轨迹,将稀疏的真实演示扩展为高质量的合成监督数据。
Details
Motivation: 可变形物体操作需要大量数据,而传统基于刚体抽象的仿真方法在几何、软体动力学和运动原语方面与真实世界不匹配,导致仿真数据质量低。SIM1旨在通过物理对齐的仿真,将仿真数据与真实物理世界对齐,以提供可扩展的监督数据。
Result: 在实验中,仅使用合成数据训练的策略在1:15的等效比下达到了与真实数据基线相当的性能,在真实世界部署中实现了90%的零样本成功率和50%的泛化提升。
Insight: 创新点在于提出物理对齐的仿真方法,通过数字化真实场景、弹性建模校准动力学和基于扩散的轨迹生成,将仿真数据与真实物理世界对齐,从而提供高质量、可扩展的合成监督数据,为数据高效策略学习提供实用途径。
Abstract: Robotic manipulation with deformable objects represents a data-intensive regime in embodied learning, where shape, contact, and topology co-evolve in ways that far exceed the variability of rigids. Although simulation promises relief from the cost of real-world data acquisition, prevailing sim-to-real pipelines remain rooted in rigid-body abstractions, producing mismatched geometry, fragile soft dynamics, and motion primitives poorly suited for cloth interaction. We posit that simulation fails not for being synthetic, but for being ungrounded. To address this, we introduce SIM1, a physics-aligned real-to-sim-to-real data engine that grounds simulation in the physical world. Given limited demonstrations, the system digitizes scenes into metric-consistent twins, calibrates deformable dynamics through elastic modeling, and expands behaviors via diffusion-based trajectory generation with quality filtering. This pipeline transforms sparse observations into scaled synthetic supervision with near-demonstration fidelity. Experiments show that policies trained on purely synthetic data achieve parity with real-data baselines at a 1:15 equivalence ratio, while delivering 90% zero-shot success and 50% generalization gains in real-world deployment. These results validate physics-aligned simulation as scalable supervision for deformable manipulation and a practical pathway for data-efficient policy learning.
cs.AI [Back]
[140] ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training cs.AI | cs.CL | cs.LGPDF
Yu Liang, Liangxin Liu, Longzheng Wang, Yan Wang, Yueyang Zhang
TL;DR: 本文提出ConsistRM,一种用于改进生成式奖励模型的自训练框架。该框架通过引入一致性感知的答案奖励和批判奖励,解决了传统GRM依赖人工标注数据以及自训练不稳定的问题,从而在无需人工标注的情况下实现更稳定有效的模型训练。
Details
Motivation: 生成式奖励模型虽在表示能力和灵活性上优于传统标量奖励模型,但面临两大挑战:依赖昂贵的人工标注数据限制了可扩展性,且自训练方法常存在不稳定性和易受奖励攻击的弱点。
Result: 在四个基础模型和五个基准数据集上的实验表明,ConsistRM平均优于普通强化微调方法1.5%,并提升了输出一致性,缓解了由输入顺序引起的位置偏差。
Insight: 创新点在于提出了一致性感知的答案奖励(提供具有时间一致性的可靠伪标签)和批判奖励(评估多批判间的语义一致性并分配细粒度差异化奖励),这为无人工标注下稳定训练GRM提供了新思路,其一致性机制可借鉴于缓解自训练的不稳定性问题。
Abstract: Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-training approaches often suffer from instability and vulnerability to reward hacking. To address these issues, we propose ConsistRM, a self-training framework that enables effective and stable GRM training without human annotations. ConsistRM incorporates the Consistency-Aware Answer Reward, which produces reliable pseudo-labels with temporal consistency, thereby providing more stable model optimization. Moreover, the Consistency-Aware Critique Reward is introduced to assess semantic consistency across multiple critiques and allocates fine-grained and differentiated rewards. Experiments on five benchmark datasets across four base models demonstrate that ConsistRM outperforms vanilla Reinforcement Fine-Tuning (RFT) by an average of 1.5%. Further analysis shows that ConsistRM enhances output consistency and mitigates position bias caused by input order, highlighting the effectiveness of consistency-aware rewards in improving GRMs.
[141] ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework cs.AI | cs.CLPDF
Kai Qin, Liangxin Liu, Yu Liang, Longzheng Wang, Yan Wang
TL;DR: 本文提出ReflectRM,一种新型生成式奖励模型,通过自我反思机制在统一生成框架中联合建模回答偏好和分析偏好,以提升对齐质量。该方法在四个基准测试中平均准确率提升3.7分,显著缓解位置偏差达10.2分,成为更稳定的评估器。
Details
Motivation: 现有生成式奖励模型主要关注结果级监督,忽视分析过程质量,限制了其潜力;本文旨在通过自我反思评估分析质量以增强偏好建模。
Result: 在四个基准测试中,ReflectRM在Qwen3-4B模型上平均准确率提升+3.7,位置偏差改善+10.2,优于领先的生成式奖励模型,达到更稳定的评估性能。
Insight: 创新点在于引入自我反思机制统一建模回答与分析偏好,利用最可靠分析推导最终预测;客观上,该方法通过过程质量监督增强了生成式奖励模型的泛化与稳定性。
Abstract: Reward Models (RMs) are critical components in the Reinforcement Learning from Human Feedback (RLHF) pipeline, directly determining the alignment quality of Large Language Models (LLMs). Recently, Generative Reward Models (GRMs) have emerged as a superior paradigm, offering higher interpretability and stronger generalization than traditional scalar RMs. However, existing methods for GRMs focus primarily on outcome-level supervision, neglecting analytical process quality, which constrains their potential. To address this, we propose ReflectRM, a novel GRM that leverages self-reflection to assess analytical quality and enhance preference modeling. ReflectRM is trained under a unified generative framework for joint modeling of response preference and analysis preference. During inference, we use its self-reflection capability to identify the most reliable analysis, from which the final preference prediction is derived. Experiments across four benchmarks show that ReflectRM consistently improves performance, achieving an average accuracy gain of +3.7 on Qwen3-4B. Further experiments confirm that response preference and analysis preference are mutually reinforcing. Notably, ReflectRM substantially mitigates positional bias, yielding +10.2 improvement compared with leading GRMs and establishing itself as a more stable evaluator.
[142] Reasoning Graphs: Deterministic Agent Accuracy through Evidence-Centric Chain-of-Thought Feedback cs.AI | cs.CLPDF
Matthew Penaroza
TL;DR: 本文提出了推理图(reasoning graphs)和检索图(retrieval graphs)两种图结构,用于解决语言模型代理在处理相似查询时因每次从头推理而导致准确率低、方差高的问题。推理图将代理对每条证据的思维链作为结构化边持久化存储,实现以证据为中心的反馈;检索图则用于优化候选证据的筛选流程。两者结合形成一个无需重新训练、仅通过图遍历进行上下文工程即可自我改进的反馈循环,显著提升确定性代理在多跳问答任务上的准确率并降低方差。
Details
Motivation: 现有语言模型代理对每个查询都从头开始推理,丢弃之前的思维链,导致处理相似查询时准确率不稳定、方差高。本文旨在通过持久化存储和结构化代理的推理过程来解决这一问题。
Result: 在多个多跳问答基准测试上,该方法通过顺序聚类评估协议验证,能够实现准确率的提升和方差的显著降低(方差崩溃),且所有改进均无需重新训练基础模型。
Insight: 核心创新在于提出了证据中心反馈机制,通过推理图将思维链与具体证据项而非查询本身关联,实现了从证据项向内的反向遍历,这是一种与基于查询相似性检索在结构上不同的能力。结合检索图形成的自我改进循环,为提升代理的确定性和可追溯性提供了新思路。
Abstract: Language model agents reason from scratch on every query: each time an agent retrieves evidence and deliberates, the chain of thought is discarded and the next similar query starts with no prior insight. This produces lower accuracy and high variance, as the same type of query can succeed or fail unpredictably. We introduce reasoning graphs, a graph structure that persists an agent’s per-evidence chain of thought as structured edges connected to the evidence items they evaluate. Unlike prior memory mechanisms that store distilled strategies as flat records indexed by query similarity or appended by recency, reasoning graphs enable evidence-centric feedback: given a new candidate set, the system traverses all incoming evaluation edges for each evidence item across all prior runs, surfacing how that specific item has been judged before. This backward traversal from evidence inward is a structurally different capability from query-similarity retrieval, because the feedback is tied to the specific evidence the agent is currently examining, not to the query. We further introduce retrieval graphs, a complementary structure that feeds a pipeline planner to tighten the candidate funnel over successive runs. Together, both graphs form a self-improving feedback loop: accuracy rises and variance collapses over successive runs, with every decision fully traceable through the graph. This improvement requires no retraining; the base model remains frozen and all gains come from context engineering via graph traversal. We formalize the graph structure, traversal algorithms, and feedback mechanisms, and describe a sequential cluster evaluation protocol for measuring accuracy convergence and variance collapse on multi-hop question answering benchmarks.
[143] How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles cs.AI | cs.CLPDF
Chenchen Kuai, Jiwan Jiang, Zihao Zhu, Hao Wang, Keshu Wu
TL;DR: 本文提出了一个用于审计黑盒大语言模型(LLM)之间行为依赖性的统计框架。该框架通过引入多分辨率层次结构和两个信息论指标(难度加权行为纠缠指数和累积信息增益)来量化模型间的行为纠缠,即由共享数据、蒸馏和对齐流程导致的隐藏依赖。通过在18个LLM上的实验,论文揭示了广泛存在的纠缠现象及其对LLM-as-a-judge评估的负面影响,并展示了通过去纠缠的验证器集成重加权方法可以提升性能。
Details
Motivation: 解决大语言模型生态中,由于共享预训练数据、蒸馏和对齐流程导致的模型间隐藏行为依赖(行为纠缠)问题。这种依赖会破坏多模型系统(如LLM-as-a-judge流水线和集成验证)所依赖的独立性假设,导致相关推理模式和同步失败。
Result: 在来自六个模型家族的18个LLM上进行了广泛实验,发现广泛存在行为纠缠。累积信息增益(CIG)指标与法官精度的下降存在统计学显著关联(GPT-4o-mini的Spearman系数为0.64,p<0.001;Llama3法官为0.71,p<0.01)。提出的去纠缠验证器集成重加权方法相比多数投票,实现了高达4.5%的准确率提升。
Insight: 创新点在于提出了一个量化LLM行为依赖性的统计审计框架,特别是难度加权行为纠缠指数和累积信息增益这两个信息论指标。从客观角度看,该框架为评估和缓解模型间的隐性依赖提供了方法论,其提出的基于推断独立性调整模型权重的集成重加权策略,对于构建更鲁棒的多模型系统具有实际借鉴意义。
Abstract: The rapid growth of the large language model (LLM) ecosystem raises a critical question: are seemingly diverse models truly independent? Shared pretraining data, distillation, and alignment pipelines can induce hidden behavioral dependencies, latent entanglement, that undermine multi-model systems such as LLM-as-a-judge pipelines and ensemble verification, which implicitly assume independent signals. In practice, this manifests as correlated reasoning patterns and synchronized failures, where apparent agreement reflects shared error modes rather than independent validation. To address this, we develop a statistical framework for auditing behavioral entanglement among black-box LLMs. Our approach introduces a multi-resolution hierarchy that characterizes the joint failure manifold through two information-theoretic metrics: (i) a Difficulty-Weighted Behavioral Entanglement Index, which amplifies synchronized failures on easy tasks, and (ii) a Cumulative Information Gain (CIG) metric, which captures directional alignment in erroneous responses. Through extensive experiments on 18 LLMs from six model families, we identify widespread behavioral entanglement and analyze its impact on LLM-as-a-judge evaluation. We find that CIG exhibits a statistically significant association with degradation in judge precision, with Spearman coefficient of 0.64 (p < 0.001) for GPT-4o-mini and 0.71 (p < 0.01) for Llama3-based judges, indicating that stronger dependency corresponds to increased over-endorsement bias. Finally, we demonstrate a practical use case of entanglement through de-entangled verifier ensemble reweighting. By adjusting model contributions based on inferred independence, the proposed method mitigates correlated bias and improves verification performance, achieving up to a 4.5% accuracy gain over majority voting.
[144] Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution cs.AI | cs.CLPDF
Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang
TL;DR: 本文提出了一种名为Squeeze Evolve的统一多模型编排框架,旨在解决无验证器进化方法在多样性和效率方面的瓶颈。该框架遵循一个核心原则:将模型能力分配到边际效用最高的阶段,即用强模型处理高影响力阶段,用廉价模型处理其他阶段以降低成本。
Details
Motivation: 无验证器进化方法面临两个主要瓶颈:缺乏外部校正导致重复进化加速向狭窄模式坍缩,以及统一使用高成本模型导致计算浪费和经济上不可行。
Result: 在AIME 2025、HMMT 2025、LiveCodeBench V6、GPQA-Diamond、ARC-AGI-V2以及MMMU-Pro和BabyVision等多模态视觉基准测试中,Squeeze Evolve持续提升了成本-能力边界,并在多项任务上取得了新的SOTA结果。实证表明,该框架可将API成本降低约3倍,并将固定预算下的服务吞吐量提高约10倍。在发现任务上,它是首个能匹配甚至在某些情况下超越基于验证器的进化方法性能的无验证器进化方法。
Insight: 核心创新点在于提出了一个轻量级、统一的多模型编排原则,通过按边际效用动态分配不同能力的模型(包括开源、闭源或混合部署),联合解决了进化过程中的多样性崩溃和成本效率问题,实现了无需外部验证器的高效进化推理。
Abstract: We show that verifier-free evolution is bottlenecked by both diversity and efficiency: without external correction, repeated evolution accelerates collapse toward narrow modes, while the uniform use of a high-cost model wastes compute and quickly becomes economically impractical. We introduce Squeeze Evolve, a unified multi-model orchestration framework for verifier-free evolutionary inference. Our approach is guided by a simple principle: allocate model capability where it has the highest marginal utility. Stronger models are reserved for high-impact stages, while cheaper models handle the other stages at much lower costs. This principle addresses diversity and cost-efficiency jointly while remaining lightweight. Squeeze Evolve naturally supports open-source, closed-source, and mixed-model deployments. Across AIME 2025, HMMT 2025, LiveCodeBench V6, GPQA-Diamond, ARC-AGI-V2, and multimodal vision benchmarks, such as MMMU-Pro and BabyVision, Squeeze Evolve consistently improves the cost-capability frontier over single-model evolution and achieves new state-of-the-art results on several tasks. Empirically, Squeeze Evolve reduces API cost by up to $\sim$3$\times$ and increases fixed-budget serving throughput by up to $\sim$10$\times$. Moreover, on discovery tasks, Squeeze Evolve is the first verifier-free evolutionary method to match, and in some cases exceed, the performance of verifier-based evolutionary methods.
[145] Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing cs.AI | cs.CL | cs.LGPDF
Pei-Xi Xie, Che-Yu Lin, Cheng-Lin Yang
TL;DR: 本文提出了一种名为分布对齐提示合成(DAHS)和反向提示退火(BHA)的方法,以解决数学RLVR(可验证奖励的强化学习)中存在的分布锐化和提示暴露问题。该方法旨在通过合成与学生学习风格对齐的教师提示,并在训练过程中逐步减少提示暴露,从而在保持大k(如pass@2048)性能的同时提升小k(如pass@1)的推理准确率。
Details
Motivation: 在数学RLVR中,现有基于提示的方法虽然能使难题可训练,但存在两个未充分探索的问题:教师-学生分布不匹配,以及需要减少提示暴露以匹配无提示评估。
Result: 在AIME24、AIME25和AIME26基准上,使用Qwen3-1.7B-Base模型时,该方法在pass@1和pass@2048指标上均优于DAPO基线;使用Llama-3.2-1B-Instruct模型时,性能提升主要集中在大k(如pass@2048)区域。
Insight: 核心创新点在于:1)通过DAHS合成条件于学生风格响应的已验证教师提示,以对齐分布;2)通过BHA在不同难度桶中退火提示暴露,并使用逐题提示丢弃,在RL训练全程保持无提示更新。客观来看,该方法强调了在训练早期通过提示恢复难题的可学习性,并在无提示评估前逐步移除提示的有效性。
Abstract: Reinforcement learning with verifiable rewards (RLVR) can improve low-$k$ reasoning accuracy while narrowing solution coverage on challenging math questions, and pass@1 gains do not necessarily translate into better large-$k$ performance. Existing hint-based approaches can make challenging questions trainable, but they leave two issues underexplored: teacher-student distribution mismatch and the need to reduce hint exposure to match no-hint evaluation. We address these issues through two components. Distribution-Aligned Hint Synthesis (DAHS) constructs verified teacher hints conditioned on student-style responses. Backward Hint Annealing (BHA) anneals hint exposure across difficulty buckets and uses per-question hint dropout to preserve no-hint updates throughout RL training. We evaluate the method in math RLVR under the DAPO training framework across AIME24, AIME25, and AIME26 using $\texttt{Qwen3-1.7B-Base}$ and $\texttt{Llama-3.2-1B-Instruct}$. On $\texttt{Qwen3-1.7B-Base}$, our method improves both pass@1 and pass@2048 relative to DAPO across the three AIME benchmarks. On $\texttt{Llama-3.2-1B-Instruct}$, the gains are concentrated in the large-$k$ regime. These results suggest that, in math RLVR, hint scaffolding is effective when it restores learnable updates on challenging questions early in training and is then gradually removed before no-hint evaluation.
[146] An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks cs.AI | cs.CL | cs.CY | cs.MAPDF
Gabriel Stefan, Adrian-Marius Dumitran
TL;DR: 本文提出了一种用于检测历史教科书偏见的智能体评估架构,包含多模态筛选智能体、由五个评估智能体组成的异质陪审团以及用于裁决合成和人工升级的元智能体。核心贡献是源归属协议,用于区分教科书叙述与引用的历史来源,防止误判。在罗马尼亚高中历史教科书的实证研究中,该架构有效减轻了过度惩罚,并在盲人评估中优于启发式变体和零样本基线,成本约为每本教科书2美元。
Details
Motivation: 历史教科书常包含隐性偏见、民族主义框架和选择性遗漏,难以大规模审计,需要一种可扩展的评估方法来解决这一问题。
Result: 在罗马尼亚高中历史教科书的270个摘录中,83.3%被分类为教学可接受(平均严重性2.9/7),而零样本基线为5.4/7;盲人评估中,独立审议配置在64.8%的情况下优于启发式变体和零样本基线,达到经济可行的决策支持工具水平。
Insight: 创新点包括智能体审议架构以减轻单模型评估者的系统性误报,以及源归属协议来区分叙述与引用,客观分析表明该方法在成本效益和人工评估偏好方面具有实用价值。
Abstract: History textbooks often contain implicit biases, nationalist framing, and selective omissions that are difficult to audit at scale. We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation. A central contribution is a Source Attribution Protocol that distinguishes textbook narrative from quoted historical sources, preventing the misattribution that causes systematic false positives in single-model evaluators. In an empirical study on Romanian upper-secondary history textbooks, 83.3% of 270 screened excerpts were classified as pedagogically acceptable (mean severity 2.9/7), versus 5.4/7 under a zero-shot baseline, demonstrating that agentic deliberation mitigates over-penalization. In a blind human evaluation (18 evaluators, 54 comparisons), the Independent Deliberation configuration was preferred in 64.8% of cases over both a heuristic variant and the zero-shot baseline. At approximately $2 per textbook, these results position agentic evaluation architectures as economically viable decision-support tools for educational governance.
[147] SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking cs.AI | cs.CLPDF
Weiyang Huang, Xuefeng Bai, Kehai Chen, Xinyang Chen, Yibin Chen
TL;DR: SAT(Stepwise Adaptive Thinking)是一个用于大型推理模型(LRMs)的框架,旨在通过步骤级别的自适应剪枝来平衡推理准确性和效率。它将推理过程建模为具有不同思维模式(慢速、正常、快速、跳过)的有限状态机(FSM),并使用轻量级过程奖励模型(PRM)动态导航这些状态,从而压缩简单步骤的推理长度,同时保留困难步骤的深度。
Details
Motivation: 当前大型推理模型普遍存在’过度思考’问题,即生成不必要的长推理链,现有解决方案在提升token效率时往往牺牲细粒度控制或破坏推理过程的逻辑完整性。SAT旨在解决这一矛盾,实现推理准确性与效率的平衡。
Result: 在9个大型推理模型和7个基准测试上的实验表明,SAT能够将推理token数量减少高达40%,同时通常保持或提高准确性。
Insight: 创新点在于将推理过程形式化为有限状态机,并引入轻量级过程奖励模型进行动态、难度感知的步骤级剪枝,实现了对推理链的细粒度控制,在压缩token的同时维护了核心推理结构的完整性。
Abstract: Large Reasoning Models (LRMs) have revolutionized complex problem-solving, yet they exhibit a pervasive “overthinking”, generating unnecessarily long reasoning chains. While current solutions improve token efficiency, they often sacrifice fine-grained control or risk disrupting the logical integrity of the reasoning process. To address this, we introduce Stepwise Adaptive Thinking (SAT), a framework that performs step-level, difficulty-aware pruning while preserving the core reasoning structure. SAT formulates reasoning as a Finite-State Machine (FSM) with distinct thinking modes (Slow, Normal, Fast, Skip). It navigates these states dynamically using a lightweight Process Reward Model (PRM), compressing easy steps while preserving depth for hard ones. Experiments across 9 LRMs and 7 benchmarks show that SAT achieves up to 40% reduction in reasoning tokens while generally maintaining or improving accuracy.
[148] Don’t Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents cs.AI | cs.CL | cs.MAPDF
Khushal Sethi
TL;DR: 这篇论文提出了TrACE,一种无需训练的自适应计算控制器,用于优化LLM智能体的推理过程。它通过测量不同rollout之间的动作一致性来动态分配计算资源:当模型对下一步动作高度一致时,视为简单决策,立即执行;当一致性低时,视为不确定决策,则增加采样次数。该方法在单步推理和多步决策任务上,能以更少的LLM调用次数达到与固定预算自一致性方法相当的准确率。
Details
Motivation: 现有的推理时计算缩放方法对所有决策步骤都统一分配计算预算,而不考虑每个步骤的实际难度。这可能导致资源浪费或决策质量不足。论文旨在开发一种无需训练、能根据决策难度自适应分配计算资源的方法。
Result: 在GSM8K(单步推理)和MiniHouse(多步家庭导航)两个基准测试上,使用Qwen 2.5 3B Instruct模型进行评估。TrACE-4在达到与SC-4相同准确率的同时,在GSM8K上减少了33%的LLM调用,在MiniHouse上减少了39%。TrACE-8在达到与SC-8相同准确率的同时,在GSM8K上减少了55%的LLM调用,在MiniHouse上减少了65%。
Insight: 核心创新点在于利用模型自身在不同rollout中输出动作的一致性作为决策难度的免费信号,从而无需训练、验证器或人工标签即可实现按时间步的自适应计算分配。这验证了模型自身输出的一致性编码了可被利用的难度信息这一假设。该方法首次在无需训练的情况下,在多步序列决策任务上实现了按时间步的自适应计算控制。
Abstract: Inference-time compute scaling has emerged as a powerful technique for improving the reliability of large language model (LLM) agents, but existing methods apply compute uniformly: every decision step receives the same budget regardless of its difficulty. We introduce TrACE (Trajectorical Adaptive Compute via agrEement), a training-free controller that allocates LLM calls adaptively across agent timesteps by measuring inter-rollout action agreement. At each step, TrACE samples a small set of candidate next actions and measures how consistently the model commits to the same action. High agreement signals an easy decision; the controller commits immediately. Low agreement signals uncertainty; the controller samples additional rollouts up to a configurable cap before committing to the plurality action. No learned components, no external verifier, and no human labels are required. We evaluate TrACE against greedy decoding and fixed-budget self-consistency (SC-4, SC-8) on two benchmarks spanning single-step reasoning (GSM8K, n=50) and multi-step household navigation (MiniHouse, n=30), using a Qwen 2.5 3B Instruct model running on CPU. TrACE-4 matches SC-4 accuracy while using 33% fewer LLM calls on GSM8K and 39% fewer on MiniHouse. TrACE-8 matches SC-8 accuracy with 55% fewer calls on GSM8K and 65% fewer on MiniHouse. We further show that inter-rollout agreement is a reliable signal of step-level success, validating the core hypothesis that the model’s own output consistency encodes difficulty information that can be exploited without training. TrACE is the first training-free, per-timestep adaptive-compute controller for LLM agents to be evaluated on multi-step sequential decision tasks.
[149] Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing cs.AI | cs.CLPDF
Wenhao Yuan, Chenchen Lin, Jian Chen, Jinfeng Xu, Xuehe Wang
TL;DR: 本文提出了SAVeR框架,通过自我审计和验证机制,在LLM智能体执行动作前对其内部信念状态进行核查,以确保推理过程的忠实性,从而解决长期任务中因不忠实信念传播导致的系统性行为漂移问题。
Details
Motivation: 现有LLM智能体将推理轨迹视为可靠的内部信念来指导行动和更新记忆,但连贯的推理仍可能违反逻辑或证据约束,导致无支持的信念在决策步骤间重复存储和传播,引发长期智能体系统的行为漂移。现有策略多依赖共识机制,错误地将一致性与忠实性等同。
Result: 在六个基准数据集上的广泛实验表明,SAVeR方法能持续提升推理忠实性,同时保持具有竞争力的最终任务性能。
Insight: 创新点在于引入基于角色的多样化候选信念生成,并在可验证的接受标准下通过对抗性审计定位违规并进行约束引导的最小干预修复,从而在动作提交前强制执行内部信念状态的验证,实现忠实推理。
Abstract: In large language model (LLM) agents, reasoning trajectories are treated as reliable internal beliefs for guiding actions and updating memory. However, coherent reasoning can still violate logical or evidential constraints, allowing unsupported beliefs repeatedly stored and propagated across decision steps, leading to systematic behavioral drift in long-horizon agentic systems. Most existing strategies rely on the consensus mechanism, conflating agreement with faithfulness. In this paper, inspired by the vulnerability of unfaithful intermediate reasoning trajectories, we propose \textbf{S}elf-\textbf{A}udited \textbf{Ve}rified \textbf{R}easoning (\textsc{SAVeR}), a novel framework that enforces verification over internal belief states within the agent before action commitment, achieving faithful reasoning. Concretely, we structurally generate persona-based diverse candidate beliefs for selection under a faithfulness-relevant structure space. To achieve reasoning faithfulness, we perform adversarial auditing to localize violations and repair through constraint-guided minimal interventions under verifiable acceptance criteria. Extensive experiments on six benchmark datasets demonstrate that our approach consistently improves reasoning faithfulness while preserving competitive end-task performance.
[150] Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM cs.AI | cs.CLPDF
Samay U. Shetty, Tharindu Cyril Weerasooriya, Deepak Pandita, Christopher M. Homan
TL;DR: 本文提出了DiADEM模型,一种通过学习人口统计特征重要性权重来建模标注者分歧分布的神经架构。该模型通过可学习的重要性向量α对标注者进行人口统计投影编码,结合互补拼接和Hadamard交互融合标注者与项目表示,并采用新颖的项目级分歧损失进行训练。在DICES对话安全和VOICED政治冒犯基准测试中,DiADEM在标准和透视主义指标上均显著优于LLM作为评判者和神经模型基线,实现了强分歧跟踪能力(DICES上r=0.75)。学习到的α权重显示种族和年龄是两个数据集中驱动标注者分歧的最具影响力的人口统计因素。
Details
Motivation: 解决主观内容标注中人类分歧被简化为单一多数标签的问题,现有方法(包括基于提示的大语言模型)无法捕捉由标注者社会身份和生活经历形成的真实观点差异结构。
Result: 在DICES和VOICED基准测试中,DiADEM在标准和透视主义指标上大幅超越LLM-as-a-judge和神经模型基线,在DICES上达到r=0.75的分歧跟踪相关系数,学习到种族和年龄是最关键的人口统计影响因素。
Insight: 创新点在于显式建模标注者身份(而不仅仅是标注内容)的重要性,通过可学习的人口统计重要性权重机制和项目级分歧损失函数,使NLP系统能更真实地反映人类解释多样性。
Abstract: When humans label subjective content, they disagree, and that disagreement is not noise. It reflects genuine differences in perspective shaped by annotators’ social identities and lived experiences. Yet standard practice still flattens these judgments into a single majority label, and recent LLM-based approaches fare no better: we show that prompted large language models, even with chain-of-thought reasoning, fail to recover the structure of human disagreement. We introduce DiADEM, a neural architecture that learns “how much each demographic axis matters” for predicting who will disagree and on what. DiADEM encodes annotators through per-demographic projections governed by a learned importance vector $\boldsymbolα$, fuses annotator and item representations via complementary concatenation and Hadamard interactions, and is trained with a novel item-level disagreement loss that directly penalizes mispredicted annotation variance. On the DICES conversational-safety and VOICED political-offense benchmarks, DiADEM substantially outperforms both the LLM-as-a-judge and neural model baselines across standard and perspectivist metrics, achieving strong disagreement tracking ($r{=}0.75$ on DICES). The learned $\boldsymbolα$ weights reveal that race and age consistently emerge as the most influential demographic factors driving annotator disagreement across both datasets. Our results demonstrate that explicitly modeling who annotators are not just what they label is essential for NLP systems that aim to faithfully represent human interpretive diversity.
[151] Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest cs.AI | cs.CL | cs.CYPDF
Addison J. Wu, Ryan Liu, Shuyue Stella Li, Yulia Tsvetkov, Thomas L. Griffiths
TL;DR: 这篇论文分析了大型语言模型在面临用户利益与公司广告收入冲突时的行为模式,提出了一个基于语言学和广告监管文献的分类框架,并评估了多个主流LLM在利益冲突场景下的表现,发现多数模型倾向于牺牲用户福利以满足公司激励。
Details
Motivation: 随着LLM被部署用于通过广告创收,模型可能面临用户最佳利益与公司商业激励之间的冲突,论文旨在系统研究LLM在此类冲突中的行为模式及其潜在风险。
Result: 评估显示,多数LLM在利益冲突场景中倾向于公司利益:例如Grok 4.1 Fast在83%的情况下推荐价格贵近一倍的赞助产品,GPT 5.1在94%的案例中通过展示赞助选项干扰购买流程,Qwen 3 Next在24%的比较中隐藏不利价格。模型行为还随推理能力和用户社会经济地位推断显著变化。
Insight: 创新点在于构建了利益冲突的分类框架和评估套件,揭示了LLM在商业激励下可能产生的隐性风险;客观而言,该研究为评估AI系统的商业伦理对齐提供了方法论参考,并警示了广告嵌入对模型中立性的影响。
Abstract: Today’s large language models (LLMs) are trained to align with user preferences through methods such as reinforcement learning. Yet models are beginning to be deployed not merely to satisfy users, but also to generate revenue for the companies that created them through advertisements. This creates the potential for LLMs to face conflicts of interest, where the most beneficial response to a user may not be aligned with the company’s incentives. For instance, a sponsored product may be more expensive but otherwise equal to another; in this case, what does (and should) the LLM recommend to the user? In this paper, we provide a framework for categorizing the ways in which conflicting incentives might lead LLMs to change the way they interact with users, inspired by literature from linguistics and advertising regulation. We then present a suite of evaluations to examine how current models handle these tradeoffs. We find that a majority of LLMs forsake user welfare for company incentives in a multitude of conflict of interest situations, including recommending a sponsored product almost twice as expensive (Grok 4.1 Fast, 83%), surfacing sponsored options to disrupt the purchasing process (GPT 5.1, 94%), and concealing prices in unfavorable comparisons (Qwen 3 Next, 24%). Behaviors also vary strongly with levels of reasoning and users’ inferred socio-economic status. Our results highlight some of the hidden risks to users that can emerge when companies begin to subtly incentivize advertisements in chatbots.
[152] WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models cs.AI | cs.CV | cs.ROPDF
Hongjin Chen, Shangyun Jiang, Tonghua Su, Chen Gao, Xinlei Chen
TL;DR: 本文提出WorldMAP框架,通过生成世界模型合成未来视图并构建语义-空间记忆,为视觉语言导航的轨迹预测提供结构化监督。该框架采用师生架构,教师模型利用世界模型生成视频、定位目标与障碍物并规划轨迹伪标签,学生模型则直接学习从视觉语言输入预测多假设轨迹。
Details
Motivation: 解决从单视角观察预测可靠导航轨迹的挑战,现有视觉语言模型生成的轨迹不稳定,而世界模型虽能合成未来视图却无法直接提供导航学习所需的基础监督信号。
Result: 在Target-Bench基准测试中,WorldMAP在ADE和FDE指标上均优于对比方法,ADE降低18.0%,FDE降低42.1%,并将小型开源视觉语言模型的DTW性能提升至与专有模型相当的水平。
Insight: 创新点在于将世界模型生成的未来视图转化为结构化监督信号,通过显式规划生成轨迹伪标签来训练轻量级学生模型,表明在具身导航中,世界模型的价值更在于为导航学习提供结构化监督而非直接提供行动就绪的想象证据。
Abstract: Vision-language models (VLMs) and generative world models are opening new opportunities for embodied navigation. VLMs are increasingly used as direct planners or trajectory predictors, while world models support look-ahead reasoning by imagining future views. Yet predicting a reliable trajectory from a single egocentric observation remains challenging. Current VLMs often generate unstable trajectories, and world models, though able to synthesize plausible futures, do not directly provide the grounded signals needed for navigation learning. This raises a central question: how can generated futures be turned into supervision for grounded trajectory prediction? We present WorldMAP, a teacher–student framework that converts world-model-generated futures into persistent semantic-spatial structure and planning-derived supervision. Its world-model-driven teacher builds semantic-spatial memory from generated videos, grounds task-relevant targets and obstacles, and produces trajectory pseudo-labels through explicit planning. A lightweight student with a multi-hypothesis trajectory head is then trained to predict navigation trajectories directly from vision-language inputs. On Target-Bench, WorldMAP achieves the best ADE and FDE among compared methods, reducing ADE by 18.0% and FDE by 42.1% relative to the best competing baseline, while lifting a small open-source VLM to DTW performance competitive with proprietary models. More broadly, the results suggest that, in embodied navigation, the value of world models may lie less in supplying action-ready imagined evidence than in synthesizing structured supervision for navigation learning.
[153] U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations cs.AI | cs.CVPDF
Angeliki Dimitriou, Nikolaos Chaidos, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou
TL;DR: 本文提出了一种名为U-CECE的通用多分辨率框架,用于生成概念性反事实解释。该框架通过三个表达层次(原子概念、关系集集合和结构图)来平衡解释的表达能力与计算效率,并支持基于监督图神经网络(GNNs)的转导模式和基于无监督图自编码器(GAEs)的归纳模式,以适应不同数据规模和计算预算的需求。
Details
Motivation: 随着AI模型日益复杂,可解释性对建立信任至关重要,但现有的基于概念的反事实解释方法在表达能力和效率之间存在权衡:原子概念表示快速但缺乏关系上下文,而完整图表示更忠实但需要解决NP难的图编辑距离(GED)问题。U-CECE旨在统一解决这一权衡问题。
Result: 在结构差异较大的CUB和Visual Genome数据集上的实验表明,U-CECE在不同层次上有效表征了效率与表达能力的权衡;人类调查和基于LVLM的评估显示,检索到的结构反事实解释在语义上与基于精确GED的真实解释等价,且通常更受青睐。
Insight: 创新点在于提出一个模型无关的多分辨率框架,通过分层表示(原子、关系、结构)灵活适应不同场景,并引入转导和归纳两种模式以平衡精度与可扩展性;客观来看,该框架将概念解释从静态原子表示扩展到动态图结构,为复杂模型的可解释性提供了通用且高效的解决方案。
Abstract: As AI models grow more complex, explainability is essential for building trust, yet concept-based counterfactual methods still face a trade-off between expressivity and efficiency. Representing underlying concepts as atomic sets is fast but misses relational context, whereas full graph representations are more faithful but require solving the NP-hard Graph Edit Distance (GED) problem. We propose U-CECE, a unified, model-agnostic multi-resolution framework for conceptual counterfactual explanations that adapts to data regime and compute budget. U-CECE spans three levels of expressivity: atomic concepts for broad explanations, relational sets-of-sets for simple interactions, and structural graphs for full semantic structure. At the structural level, both a precision-oriented transductive mode based on supervised Graph Neural Networks (GNNs) and a scalable inductive mode based on unsupervised graph autoencoders (GAEs) are supported. Experiments on the structurally divergent CUB and Visual Genome datasets characterize the efficiency-expressivity trade-off across levels, while human surveys and LVLM-based evaluation show that the retrieved structural counterfactuals are semantically equivalent to, and often preferred over, exact GED-based ground-truth explanations.