Table of Contents

cs.CL [Back]

[1] Benchmark for Assessing Olfactory Perception of Large Language Models cs.CL | cs.AIPDF

Eftychia Makri, Nikolaos Nakis, Laura Sisson, Gigi Minsky, Leandros Tassiulas

TL;DR: 本文提出了嗅觉感知(OP)基准,用于评估大语言模型(LLM)在嗅觉推理方面的能力。该基准包含1,010个问题,涵盖八个任务类别,包括气味分类、主要描述符识别、强度和愉悦度判断、多描述符预测、混合物相似性、嗅觉受体激活以及真实气味源识别。研究评估了21种模型配置,发现使用化合物名称提示的性能优于异构SMILES格式,最佳模型总体准确率达到64.4%,同时跨语言集成能进一步提升预测性能。

Details

Motivation: 解决当前LLM主要关注视觉或听觉信息,而缺乏对嗅觉信息处理能力评估的问题,旨在系统评估LLM在嗅觉推理方面的能力。

Result: 在OP基准上,最佳模型总体准确率为64.4%;化合物名称提示相比异构SMILES格式平均提升约7个百分点(范围+2.4至+18.9);跨21种语言集成预测的最佳模型AUROC达到0.86。

Insight: 创新点在于构建了首个全面的嗅觉推理基准,并揭示了当前LLM主要通过词汇关联而非分子结构推理来获取嗅觉知识;跨语言集成能有效提升嗅觉预测性能,为多模态推理提供了新视角。

Abstract: Here we introduce the Olfactory Perception (OP) benchmark, designed to assess the capability of large language models (LLMs) to reason about smell. The benchmark contains 1,010 questions across eight task categories spanning odor classification, odor primary descriptor identification, intensity and pleasantness judgments, multi-descriptor prediction, mixture similarity, olfactory receptor activation, and smell identification from real-world odor sources. Each question is presented in two prompt formats, compound names and isomeric SMILES, to evaluate the effect of molecular representations. Evaluating 21 model configurations across major model families, we find that compound-name prompts consistently outperform isomeric SMILES, with gains ranging from +2.4 to +18.9 percentage points (mean approx +7 points), suggesting current LLMs access olfactory knowledge primarily through lexical associations rather than structural molecular reasoning. The best-performing model reaches 64.4% overall accuracy, which highlights both emerging capabilities and substantial remaining gaps in olfactory reasoning. We further evaluate a subset of the OP across 21 languages and find that aggregating predictions across languages improves olfactory prediction, with AUROC = 0.86 for the best performing language ensemble model. LLMs should be able to handle olfactory and not just visual or aural information.


[2] Dynin-Omni: Omnimodal Unified Large Diffusion Language Model cs.CL | cs.AIPDF

Jaeik Kim, Woojin Kim, Jihwan Hong, Yejoon Lee, Sieun Hyeon

TL;DR: Dynin-Omni是首个基于掩码扩散的、统一文本、图像、语音理解与生成以及视频理解的通用多模态基础模型。它通过在共享离散标记空间上进行掩码扩散建模,实现了在双向上下文下的迭代优化,并采用多阶段训练策略和基于模型合并的模态扩展与对齐方法。

Details

Motivation: 为了解决现有自回归统一模型需要序列化异构模态,或组合式统一模型需要与外部模态特定解码器协调的问题,旨在提供一个原生支持任意模态间统一建模与生成的单一架构。

Result: 在涵盖语言推理、图像生成与编辑、视频理解、语音识别与合成的19个多模态基准测试中,Dynin-Omni取得了优异性能,例如在GSM8K上达到87.6分,在MME-P上达到1733.6分,在VideoMME上达到61.4分,在GenEval上达到0.87分,在LibriSpeech test-clean上词错误率为2.1,持续超越现有开源统一模型,并与强大的模态特定专家系统保持竞争力。

Insight: 其核心创新在于将通用多模态建模形式化为共享离散标记空间上的掩码扩散过程,这为任意到任意模态的建模提供了一个统一的范式。这种设计支持双向上下文迭代优化,并展示了掩码扩散作为统一框架在实现实时通用多模态系统、统一跨模态检索与生成以及具身多模态智能体方面的潜力。

Abstract: We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. Dynin-Omni adopts a multi-stage training strategy with model-merging-based modality expansion and omnimodal alignment. We evaluate Dynin-Omni across 19 multimodal benchmarks spanning language reasoning, image generation and editing, video understanding, and speech recognition and synthesis. Dynin-Omni achieves 87.6 on GSM8K, 1733.6 on MME-P, 61.4 on VideoMME, 0.87 on GenEval, and 2.1 WER on LibriSpeech test-clean, consistently outperforming existing open-source unified models while remaining competitive with strong modality-specific expert systems. These results demonstrate the potential of masked diffusion as a unified paradigm for any-to-any modeling, providing a flexible foundation for real-time omnimodal systems, unified cross-modal retrieval and generation, and embodied multimodal agents.


[3] Eyla: Toward an Identity-Anchored LLM Architecture with Integrated Biological Priors – Vision, Implementation Attempt, and Lessons from AI-Assisted Development cs.CL | cs.AIPDF

Arif Aditto

TL;DR: 本文介绍了Eyla的设计理念、实现尝试与失败分析,这是一个集成生物启发子系统(如HiPPO初始化状态空间模型、零初始化适配器、情景记忆检索和校准不确定性训练)的身份锚定LLM架构,旨在在消费级硬件上运行统一的智能体操作系统。与优化通用助人能力的现有方法不同,Eyla专注于身份一致性,即在对抗压力下保持连贯自我模型、承认不确定性和抵抗操纵的能力。作者提出了身份一致性评分(ICS)作为评估LLM此特性的新基准,并诚实记录了作为非程序员使用AI编码助手(Claude Code、Cursor)尝试实现该架构的失败经历,花费超过1000美元仅得到一个1.27B参数模型,其中86个大脑子系统对输出的贡献不足2%。分析揭示了AI辅助开发新型架构的五种系统性失败模式,并提供了具体建议。

Details

Motivation: 解决现有LLM在身份一致性方面的不足,即模型在对抗压力下难以保持连贯自我模型、承认不确定性和抵抗操纵的问题,旨在设计一个集成生物启发子系统的身份锚定LLM架构。

Result: 实现尝试失败,产生了一个1.27B参数模型,其中86个大脑子系统对输出的贡献小于2%;提出了身份一致性评分(ICS)作为评估LLM身份一致性的新基准。

Insight: 创新点包括提出身份一致性作为LLM的关键评估维度,并设计了集成多种生物启发子系统的统一架构;从客观角度看,论文首次将架构愿景与AI辅助LLM开发的第一人称失败分析相结合,为AI系统和AI辅助软件工程社区提供了宝贵教训,特别是揭示了AI辅助开发新型架构时的系统性失败模式。

Abstract: We present the design rationale, implementation attempt, and failure analysis of Eyla, a proposed identity-anchored LLM architecture that integrates biologically-inspired subsystems – including HiPPO-initialized state-space models, zero-initialized adapters, episodic memory retrieval, and calibrated uncertainty training – into a unified agent operating system running on consumer hardware. Unlike existing approaches that optimize models for generic helpfulness, Eyla targets identity consistency: the ability to maintain a coherent self-model under adversarial pressure, admit uncertainty, and resist manipulation. We propose the Identity Consistency Score (ICS), a novel benchmark for evaluating this property across LLMs. We then present an honest account of attempting to implement this architecture using AI coding assistants (Claude Code, Cursor) as a non-programmer, documenting a $1,000+ failure that produced a 1.27B parameter model with 86 brain subsystems contributing less than 2% to output. Our analysis identifies five systematic failure modes of AI-assisted development for novel architectures and offers concrete recommendations. To our knowledge, this is the first paper to combine an architectural vision with a documented first-person failure analysis of AI-assisted LLM development, providing lessons for both the AI systems and AI-assisted software engineering communities.


[4] Finding and Reactivating Post-Trained LLMs’ Hidden Safety Mechanisms cs.CL | cs.AIPDF

Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, Yisen Wang

TL;DR: 本文研究了大型语言模型(LLM)在特定任务后训练(如思维链数据上的训练)后安全性下降的问题,并以大型推理模型(LRM)为例。研究发现,后训练会掩盖基础模型原有的安全机制,但并未将其移除。基于此,作者提出了一种轻量级、低成本的解决方案SafeReAct,通过使用LoRA适配器对齐少数层来恢复被抑制的安全行为。实验表明,该方法能显著提升LRM在有害提示上的安全性,且不影响其推理性能,并在医疗等特定领域模型上验证了其通用性。

Details

Motivation: 解决大型语言模型(如大型推理模型LRM)在进行特定任务的后训练后,推理能力增强但安全性显著下降的问题,探究其根本原因并寻求恢复安全性的方法。

Result: 在四个最先进的大型推理模型上的实验表明,所提方法SafeReAct能显著提升在有害提示上的安全性,且不损害推理性能。在医疗等特定领域模型上的额外结果进一步证实了方法的通用性和有效性。

Insight: 创新点在于揭示了后训练导致安全性下降的机制是掩盖而非移除原有安全机制,并提出了一种轻量级、基于LoRA的层对齐方法来重新激活这些被抑制的安全机制,实现了安全性与性能的平衡。该方法具有成本效益和潜在通用性。

Abstract: Despite the impressive performance of general-purpose large language models (LLMs), they often require fine-tuning or post-training to excel at specific tasks. For instance, large reasoning models (LRMs), such as the DeepSeek-R1 series, demonstrate strong reasoning capabilities after post-training different general large language models on diverse chain-of-thought (CoT) datasets. However, this additional training frequently comes at the cost of reduced safety, as the fine-tuned or post-trained models tend to exhibit more harmful behaviors compared with the regular LLMs before post-training or fine-tuning, potentially leading to harmful outcomes due to their enhanced capabilities. Taking LRMs as an example, we first investigate the underlying cause of this safety degradation in this paper. Our analysis reveals that post-training can mask the original safety mechanisms of the base LLM, while over-amplifying representations related to their post-training ability. But luckily, we also find that LRMs’ safety mechanisms still exist instead of being removed during their post-training. Based on these findings, we propose a lightweight and cost-effective solution called SafeReAct that restores the suppressed safety behaviors by aligning with LoRA adapters on a few layers. Experiments on four state-of-the-art LRMs show that our method significantly improves safety on harmful prompts without compromising reasoning performance. Besides LRMs, additional results on other domain-specific LLMs, like medical models, further confirm the generality and effectiveness of our approach.


[5] MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis cs.CL | cs.AIPDF

Miaosen Luo, Zhenhao Yang, Jieshen Long, Jinghu Sun, Yichu Liu

TL;DR: 本文提出了一种名为MSA-Thinker的新型训练框架,用于多模态情感分析。该框架结合了结构化‘判别-校准’推理和基于提示的强化学习,旨在解决现有方法在可解释性、标注成本和强化学习效率方面的问题。

Details

Motivation: 动机在于解决多模态大语言模型在情感分析中存在的‘黑盒’可解释性限制,以及现有思维链方法标注成本高、强化学习方法探索效率低和奖励稀疏(尤其在困难样本上)的挑战。

Result: 在Qwen2.5Omni-7B模型上的实验表明,该方法在细粒度情感回归任务中获得了更高的准确率,生成了高质量的结构化推理链,并在跨域评估中展现出优越的泛化能力。

Insight: 创新点在于提出了一个集成了结构化‘判别-校准’推理范式和Hint-GRPO强化学习算法的训练框架。其核心是利用推理结构中的判别阶段作为可验证锚点,为困难样本提供方向性提示,从而有效缓解奖励稀疏问题,增强模型的可解释性和鲁棒性。

Abstract: Multimodal sentiment analysis aims to understand human emotions by integrating textual, auditory, and visual modalities. Although Multimodal Large Language Models (MLLMs) have achieved state-of-the-art performance via supervised fine-tuning (SFT), their end-to-end “black-box” nature limits interpretability. Existing methods incorporating Chain-of-Thought (CoT) reasoning are hindered by high annotation costs, while Reinforcement Learning (RL) faces challenges such as low exploration efficiency and sparse rewards, particularly on hard samples. To address these issues, we propose a novel training framework that integrates structured Discrimination-Calibration (DC) reasoning with Hint-based Reinforcement Learning. First, we perform cold-start SFT using high-quality CoT data synthesized by a teacher model (Qwen3Omni-30B), which inherently contains the DC structure. This equips the model with a reasoning paradigm that performs macro discrimination followed by fine-grained calibration from the initial stage. Building on this, we propose Hint-GRPO, which leverages the discrimination phase within the DC structure as a verifiable anchor during RL to provide directional hints for hard samples, guiding policy optimization and effectively mitigating the reward sparsity problem. Experiments on the Qwen2.5Omni-7B model demonstrate that our method not only achieves higher accuracy in fine-grained sentiment regression tasks but also generates high-quality structured reasoning chains. Crucially, it exhibits superior generalization capability in cross-domain evaluations. This enhances model interpretability while validating the positive contribution of explicit reasoning steps to model robustness, offering a new paradigm for building trustworthy and efficient sentiment analysis systems.


[6] Think Twice Before You Write – an Entropy-based Decoding Strategy to Enhance LLM Reasoning cs.CL | cs.AIPDF

Jiashu He, Meizhu Liu, Olaitan P Olaleye, Amit Agarwal, M. Avendi

TL;DR: 本文提出了一种基于熵的解码策略,旨在增强大型语言模型(LLM)的推理能力。该方法通过计算每个生成步骤中token分布的熵来识别高不确定性位置,并仅在这些关键点进行分支探索,从而动态管理一个部分生成序列池,将计算资源集中在最不确定的区域。此外,论文还引入了一种在完整推理轨迹后进行熵评估的停止准则,以实现高效终止。实验表明,该方法在多个推理基准测试上取得了稳定且强大的准确性。

Details

Motivation: 传统解码策略(如贪婪解码、束搜索)存在错误传播问题,而基于采样的方法则引入随机性且鲁棒性不足。自我一致性方法虽能提升可靠性,但计算开销巨大。因此,需要一种能够自适应地引导生成过程、将计算集中在关键不确定性区域,同时保持高效性的解码策略。

Result: 在GSM8K、AMC2023及其扰动变体上的实验表明,该方法实现了稳定且强大的准确性。特别地,在较小的LLM上,其性能可与GPT-5相媲美,而计算成本仅为后者的一小部分。

Insight: 创新点在于提出了一个熵引导的解码框架,实现了token级别的自适应生成。其核心是动态识别高不确定性位置并选择性分支,以及引入在完整推理轨迹后评估熵的停止准则(EAT),从而在保证推理质量的同时显著提升了计算效率。从客观角度看,这是一种将信息论概念(熵)与解码过程动态决策相结合的有效方法,为平衡LLM推理的准确性与效率提供了新思路。

Abstract: Decoding strategies play a central role in shaping the reasoning ability of large language models (LLMs). Traditional methods such as greedy decoding and beam search often suffer from error propagation, while sampling-based approaches introduce randomness without adequate robustness. Self-consistency improves reliability by aggregating multiple rollouts, but incurs significant computational overhead. We propose an entropy-guided decoding framework that introduces token-level adaptivity into generation. At each step, the model computes the entropy of the token distribution, identifies high-uncertainty positions, and selectively branches on these vulnerable points. A dynamic pool of partial rollouts is maintained and expanded until solutions are completed, concentrating computation where uncertainty is greatest and avoiding unnecessary exploration in confident regions. To enable efficient termination, we apply a rollout-level Entropy After (EAT) stopping criterion by performing entropy evaluation after the full reasoning trace, rather than incrementally at every step. Experiments on GSM8K, AMC2023, and their perturbed variants demonstrate that our method achieves consistently strong accuracy. Notably, on smaller LLMs, performance is comparable to GPT-5 while operating at a fraction of the cost.


[7] Brevity Constraints Reverse Performance Hierarchies in Language Models cs.CL | cs.AIPDF

MD Azizul Hakim

TL;DR: 本文研究发现大型语言模型在标准评估中会出现性能倒挂现象:在7.7%的基准问题上,参数多10-100倍的大模型反而比小模型表现差28.4个百分点。通过分析31个模型(0.5B-405B参数)在1485个问题上的表现,作者发现这是由于大模型的自发冗长性导致过度阐述错误。通过强制大模型给出简洁回答,其准确率可提升26个百分点,性能差距缩小三分之二,甚至在数学推理和科学知识基准上完全逆转性能层次,大模型反超小模型7.7-15.9个百分点。

Details

Motivation: 解决标准评估协议中观察到的反直觉现象:更大的语言模型在部分基准问题上反而比小模型表现更差,探究其根本原因并验证这是否反映模型真实能力。

Result: 在五个数据集的7.7%问题上,大模型比小模型差28.4个百分点;通过简洁性约束,大模型准确率提升26个百分点,性能差距减少三分之二;在数学推理和科学知识基准上,性能层次完全逆转,大模型反超小模型7.7-15.9个百分点;数据集特定最优模型规模在0.5B到3.0B参数之间。

Insight: 创新点在于揭示了规模依赖的冗长性是导致大模型表现不佳的关键机制,而非能力限制;通过简洁性约束的提示工程可以解锁大模型的潜在优势;研究证明需要针对模型规模设计评估协议,而非使用通用提示;这一发现对模型部署有直接意义:提示适配既能提高准确性又能降低计算成本。

Abstract: Standard evaluation protocols reveal a counterintuitive phenomenon: on 7.7% of benchmark problems spanning five datasets, larger language models underperform smaller ones by 28.4 percentage points despite 10-100x more parameters. Through systematic evaluation of 31 models (0.5B-405B parameters) across 1,485 problems, we identify the mechanism as spontaneous scale-dependent verbosity that introduces errors through overelaboration. Causal intervention experiments demonstrate this reflects correctable prompt design rather than fundamental capability limitations. Constraining large models to produce brief responses improves accuracy by 26 percentage points and reduces performance gaps by up to two-thirds. Most critically, brevity constraints completely reverse performance hierarchies on mathematical reasoning and scientific knowledge benchmarks, with large models achieving 7.7-15.9 percentage point advantages over small models – direct inversions of the original gaps. These reversals prove large models possess superior latent capabilities that universal prompting masks. We validate findings through three independent contamination tests and demonstrate inverse scaling operates continuously across the full parameter spectrum, with dataset-specific optimal scales ranging from 0.5B to 3.0B parameters. Our results establish that maximizing large model performance requires scale-aware prompt engineering rather than universal evaluation protocols, with immediate implications for deployment: prompt adaptation simultaneously improves accuracy and reduces computational costs.


[8] Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency cs.CLPDF

Xingshuai Huang, Derek Li, Bahareh Nikpour, Parsa Omidi

TL;DR: 本文提出了一种名为分层思维链(Hi-CoT)提示的新方法,旨在通过将推理过程结构化为交替进行的规划与执行子步骤,来提升大语言模型在复杂多步推理任务中的性能和效率。

Details

Motivation: 传统思维链提示依赖非结构化、扁平的推理链,存在冗余和性能欠佳的问题,本文旨在解决复杂多步推理中的这些挑战。

Result: 在多种大语言模型和数学推理基准上的广泛评估表明,与标准CoT相比,Hi-CoT平均准确率提升6.2%(在某些模型和任务上最高达61.4%),同时推理轨迹长度减少13.9%。

Insight: 核心创新在于将推理过程显式地组织为分层结构(规划与执行交替),这有助于模型管理长推理跨度并保持逻辑连贯性;客观来看,这种结构化方法在提升准确性的同时还能提高效率,是一个值得借鉴的优化方向。

Abstract: Chain-of-Thought (CoT) prompting has significantly improved the reasoning capabilities of large language models (LLMs). However, conventional CoT often relies on unstructured, flat reasoning chains that suffer from redundancy and suboptimal performance. In this work, we introduce Hierarchical Chain-of-Thought (Hi-CoT) prompting, a structured reasoning paradigm specifically designed to address the challenges of complex, multi-step reasoning. Hi-CoT decomposes the reasoning process into hierarchical substeps by alternating between instructional planning and step-by-step execution. This decomposition enables LLMs to better manage long reasoning horizons and maintain logical coherence. Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9% compared to CoT prompting. We further show that accuracy and efficiency are maximized when models strictly adhere to the hierarchical structure. Our code is available at https://github.com/XingshuaiHuang/Hi-CoT.


[9] Oblivion: Self-Adaptive Agentic Memory Control through Decay-Driven Activation cs.CL | cs.AIPDF

Ashish Rana, Chia-Chien Hung, Qumeng Sun, Julian Martin Kunkel, Carolin Lawrence

TL;DR: 本文提出Oblivion框架,通过模拟人类记忆的衰减机制,为LLM智能体引入了自适应的记忆控制。该框架将记忆控制解耦为读取和写入两条路径,分别基于智能体不确定性和记忆缓冲区充足性决定何时访问记忆,以及基于记忆对响应的贡献决定强化哪些内容,从而实现了层次化的记忆组织,在保持高层策略的同时动态加载细节。

Details

Motivation: 现有基于记忆增强的LLM智能体采用’始终开启’的检索和’扁平化’的存储方式,随着历史增长会导致高干扰和延迟。本文旨在解决这一问题,通过模拟人类选择性遗忘的适应性机制,为智能体引入更灵活、高效的内存管理。

Result: 在静态和动态的长视野交互基准测试中,Oblivion框架能够动态适应记忆访问和强化,在变化的环境中平衡学习和遗忘,证明了记忆控制对于有效的LLM智能体推理至关重要。

Insight: 核心创新在于将’遗忘’概念化为基于衰减的可访问性降低,而非显式删除,并解耦了记忆的读取和写入控制路径。这为LLM智能体提供了一种更接近人类记忆运作方式的、可自适应调整的记忆管理机制,有助于减少冗余访问和干扰,提升长期推理效率。

Abstract: Human memory adapts through selective forgetting: experiences become less accessible over time but can be reactivated by reinforcement or contextual cues. In contrast, memory-augmented LLM agents rely on “always-on” retrieval and “flat” memory storage, causing high interference and latency as histories grow. We introduce Oblivion, a memory control framework that casts forgetting as decay-driven reductions in accessibility, not explicit deletion. Oblivion decouples memory control into read and write paths. The read path decides when to consult memory, based on agent uncertainty and memory buffer sufficiency, avoiding redundant always-on access. The write path decides what to strengthen, by reinforcing memories contributing to forming the response. Together, this enables hierarchical memory organization that maintains persistent high-level strategies while dynamically loading details as needed. We evaluate on both static and dynamic long-horizon interaction benchmarks. Results show that Oblivion dynamically adapts memory access and reinforcement, balancing learning and forgetting under shifting contexts, highlighting that memory control is essential for effective LLM-agentic reasoning. The source code is available at https://github.com/nec-research/oblivion.


[10] REM-CTX: Automated Peer Review via Reinforcement Learning with Auxiliary Context cs.CL | cs.AIPDF

Pawin Taechoyotin, Daniel E. Acuna

TL;DR: REM-CTX是一个基于强化学习的自动化同行评审系统,它通过引入辅助上下文(如图表和外部学术信号)来提升评审生成质量。该系统使用一个80亿参数的语言模型,结合Group Relative Policy Optimization(GRPO)进行训练,并设计了一个多方面的质量奖励函数以及两个专门鼓励与辅助上下文对齐的对应性奖励函数。实验表明,REM-CTX在计算机、生物和物理科学领域的稿件评审中,在整体评审质量和上下文对齐指标上均超越了包括使用更大商业模型在内的六个基线系统。

Details

Motivation: 现有的大多数自动化同行评审系统仅依赖文本稿件内容,未能充分利用视觉元素(如图表)和外部学术信号等辅助上下文信息,导致评审质量受限。

Result: 在计算机、生物和物理科学领域的稿件上进行的实验显示,REM-CTX在整体评审质量上优于六个基线系统,包括那些使用更大商业模型的系统,并且在质量和上下文对齐指标上均超越了次优的强化学习基线。消融研究证实了两个对应性奖励函数的互补性。

Insight: 论文的核心创新点在于通过强化学习框架,将辅助上下文信息(视觉元素和外部信号)整合到评审生成过程中,并设计了专门的对应性奖励函数来强制模型与这些上下文对齐。从客观角度看,其将多维度奖励(质量与对应性)分组优化的策略,以及对训练动态中批评维度与其他指标负相关的分析,为未来多目标奖励设计提供了重要见解。

Abstract: Most automated peer review systems rely on textual manuscript content alone, leaving visual elements such as figures and external scholarly signals underutilized. We introduce REM-CTX, a reinforcement-learning system that incorporates auxiliary context into the review generation process via correspondence-aware reward functions. REM-CTX trains an 8B-parameter language model with Group Relative Policy Optimization (GRPO) and combines a multi-aspect quality reward with two correspondence rewards that explicitly encourage alignment with auxiliary context. Experiments on manuscripts across Computer, Biological, and Physical Sciences show that REM-CTX achieves the highest overall review quality among six baselines, outperforming other systems with substantially larger commercial models, and surpassing the next-best RL baseline across both quality and contextual grounding metrics. Ablation studies confirm that the two correspondence rewards are complementary: each selectively improves its targeted correspondence reward while preserving all quality dimensions, and the full model outperforms all partial variants. Analysis of training dynamics reveals that the criticism aspect is negatively correlated with other metrics during training, suggesting that future studies should group multi-dimension rewards for review generation.


[11] Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study cs.CLPDF

Zaifu Zhan, Mengyuan Cui, Rui Zhang

TL;DR: 本研究探索了大型语言模型在医学问答任务中的自我反思能力,通过比较标准思维链提示与迭代式自我反思循环在三个医学问答基准上的表现,发现自我反思提示并不能持续提升准确性,其效果高度依赖于数据集和模型,表明自我反思推理更适合作为理解模型行为的分析工具,而非提升医学问答可靠性的独立解决方案。

Details

Motivation: 尽管自我反思提示被广泛声称能通过让模型批判和修订自身推理来增强可靠性,但其在安全关键的医学环境中的有效性尚不明确,因此本研究旨在探索自我反思推理在医学多项选择问答中的实际效果。

Result: 在MedQA、HeadQA和PubMedQA三个基准上,使用GPT-4o和GPT-4o-mini进行实验,结果显示自我反思提示在MedQA上带来适度提升,但在HeadQA和PubMedQA上效果有限或负面,且增加反思步骤不能保证性能改进。

Insight: 论文揭示了推理透明度与推理正确性之间的差距,创新性地将自我反思推理定位为分析模型行为的工具而非直接提升性能的方法,强调了在医学等关键领域评估模型自我修正能力的局限性。

Abstract: Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.


[12] Asymmetric Actor-Critic for Multi-turn LLM Agents cs.CL | cs.AIPDF

Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia

TL;DR: 本文提出了一种非对称的演员-评论家框架,用于提升多轮对话中大型语言模型(LLM)代理的可靠性。该框架利用强大的专有LLM作为演员进行生成,同时使用一个更小的开源模型作为评论家进行运行时监督和干预,无需对演员模型进行训练。

Details

Motivation: 解决在多轮、一次性交互场景中,现有方法(如反思、后验评估或完全可训练模型)无法在利用专有LLM的同时确保可靠行为的问题。

Result: 在τ-bench和UserBench基准测试上,该方法显著超越了强单代理基线,提升了任务成功率和可靠性;轻量级开源评论家模型在监督角色上达到或超越了大型专有模型的性能,且微调评论家带来了超越多个SOTA方法的额外收益。

Insight: 核心创新在于利用生成与验证的不对称性:高质量生成需要大模型,而有效监督可由小模型完成。框架设计允许在固定、不可训练的专有演员模型上,通过独立微调小型评论家模型来实现运行时干预,无需修改演员或依赖额外尝试。

Abstract: Large language models (LLMs) exhibit strong reasoning and conversational abilities, but ensuring reliable behavior in multi-turn interactions remains challenging. In many real-world applications, agents must succeed in one-shot settings where retries are impossible. Existing approaches either rely on reflection or post-hoc evaluation, which require additional attempts, or assume fully trainable models that cannot leverage proprietary LLMs. We propose an asymmetric actor-critic framework for reliable conversational agents. A powerful proprietary LLM acts as the actor, while a smaller open-source critic provides runtime supervision, monitoring the actor’s actions and intervening within the same interaction trajectory. Unlike training-based actor-critic methods, our framework supervises a fixed actor operating in open-ended conversational environments. The design leverages a generation-verification asymmetry: while high-quality generation requires large models, effective oversight can often be achieved by smaller ones. We further introduce a data generation pipeline that produces supervision signals for critic fine-tuning without modifying the actor. Experiments on $τ$-bench and UserBench show that our approach significantly improves reliability and task success over strong single-agent baselines. Moreover, lightweight open-source critics rival or surpass larger proprietary models in the critic role, and critic fine-tuning yields additional gains over several state-of-the-art methods.


[13] Large Language Models in the Abuse Detection Pipeline cs.CL | cs.CYPDF

Suraj Kath, Sanket Badhe, Preet Shah, Ashwin Sampathkumar, Shivani Gupta

TL;DR: 这篇综述论文探讨了大型语言模型在在线滥用检测生命周期中的应用,将其分为标签与特征生成、检测、审核与申诉、审计与治理四个阶段,分析了LLM在各阶段的优势、局限及部署考量。

Details

Motivation: 在线滥用行为日益复杂,传统依赖静态分类器和人工标注的机器学习方法难以适应快速演变的威胁模式和细粒度政策要求,需要利用LLM的上下文推理、政策解释等新能力来提升检测系统的效能。

Result: 论文未提供具体的定量实验结果,而是综述了当前研究和行业实践,分析了LLM在滥用检测各阶段的应用潜力和挑战。

Insight: 创新点在于提出了一个以生命周期为导向的LLM集成框架(ADL),系统性地梳理了LLM在检测流程各阶段的作用;客观来看,该框架为构建可解释、可审计的规模化滥用检测系统提供了结构化设计思路,强调了延迟、成本、确定性等工程化挑战。

Abstract: Online abuse has grown increasingly complex, spanning toxic language, harassment, manipulation, and fraudulent behavior. Traditional machine-learning approaches dependent on static classifiers and labor-intensive labeling struggle to keep pace with evolving threat patterns and nuanced policy requirements. Large Language Models introduce new capabilities for contextual reasoning, policy interpretation, explanation generation, and cross-modal understanding, enabling them to support multiple stages of modern safety systems. This survey provides a lifecycle-oriented analysis of how LLMs are being integrated into the Abuse Detection Lifecycle (ADL), which we define across four stages: (I) Label & Feature Generation, (II) Detection, (III) Review & Appeals, and (IV) Auditing & Governance. For each stage, we synthesize emerging research and industry practices, highlight architectural considerations for production deployment, and examine the strengths and limitations of LLM-driven approaches. We conclude by outlining key challenges including latency, cost-efficiency, determinism, adversarial robustness, and fairness and discuss future research directions needed to operationalize LLMs as reliable, accountable components of large-scale abuse-detection and governance systems.


[14] Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning cs.CL | stat.APPDF

Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li

TL;DR: 本文提出了Agent Q-Mix,一个基于强化学习的框架,用于优化大型语言模型多智能体系统中的通信拓扑选择问题。该框架将拓扑选择建模为合作式多智能体强化学习问题,通过QMIX价值分解让每个智能体学习去中心化的通信决策,共同形成轮次通信图。在编码、推理和数学等七个核心基准测试中,该方法在保持高任务精度的同时,实现了最优的令牌效率和鲁棒性。

Details

Motivation: 解决复杂问题通常需要多个LLM智能体的协同,但如何有效选择和连接这些智能体(即通信拓扑选择)是一个核心挑战。现有方法在动态、高效地优化多智能体间的通信结构方面存在不足。

Result: 在编码、推理和数学领域的七个核心基准测试中,Agent Q-Mix取得了最高的平均准确率,并展现出优越的令牌效率和对抗智能体故障的鲁棒性。在极具挑战性的Humanity’s Last Exam基准上,使用Gemini-3.1-Flash-Lite作为骨干模型,Agent Q-Mix达到了20.8%的准确率,超越了Microsoft Agent Framework、LangGraph、AutoGen和Lobster等方法,实现了SOTA性能。

Insight: 主要创新点在于将多智能体系统的拓扑选择问题重新定义为合作式MARL问题,并应用QMIX价值分解进行去中心化决策学习。其核心架构结合了拓扑感知的GNN编码器、GRU记忆模块和每个智能体的Q-head,并在集中训练与分散执行的范式下,优化一个平衡任务精度与令牌成本的奖励函数。这为动态、高效地学习最优通信拓扑提供了新思路。

Abstract: Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity’s Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8% accuracy, outperforming Microsoft Agent Framework (19.2%) and LangGraph (19.2%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.


[15] Locally Confident, Globally Stuck: The Quality-Exploration Dilemma in Diffusion Language Models cs.CLPDF

Liancheng Fang, Aiwei Liu, Henry Peng Zou, Yankai Chen, Enze Ma

TL;DR: 本文研究了扩散大语言模型(dLLMs)中存在的质量与探索困境:理论上,dLLMs允许任意顺序解码,这比自回归模型具有更丰富的推理路径探索潜力;但在实践中,随机顺序解码会损害生成质量。低置信度重掩码策略虽然能通过优先选择高置信度token来提升单样本质量(如Pass@1),却抑制了探索性,限制了多样本收益(如Pass@k)。论文对此困境给出了统一解释,并提出了一种独立Metropolis-Hastings采样器,以在解码过程中近似实现平衡质量与探索的最优分布。

Details

Motivation: 解决扩散大语言模型在灵活解码顺序下,随机解码损害生成质量,而提升单样本质量的低置信度重掩码策略又会抑制路径探索、限制多样本性能提升这一根本性的质量-探索困境。

Result: 在包括MATH500、AIME24/25、HumanEval和MBPP在内的一系列推理基准测试上,所提出的方法在探索-质量权衡方面优于随机解码和低置信度重掩码策略。

Insight: 创新点在于从理论上统一解释了质量-探索困境的根源(低置信度重掩码优化了一个短视的质量代理指标,同时可证明地约束了序列分布的熵),并形式化地刻画了平衡质量与探索的最优分布,进而设计了一个简单高效的近似采样器来实现它。这为扩散模型在推理任务中的解码策略设计提供了新的理论指导和实用方法。

Abstract: Diffusion large language models (dLLMs) theoretically permit token decoding in arbitrary order, a flexibility that could enable richer exploration of reasoning paths than autoregressive (AR) LLMs. In practice, however, random-order decoding often hurts generation quality. To mitigate this, low-confidence remasking improves single-sample quality (e.g., Pass@$1$) by prioritizing confident tokens, but it also suppresses exploration and limits multi-sample gains (e.g., Pass@$k$), creating a fundamental quality–exploration dilemma. In this paper, we provide a unified explanation of this dilemma. We show that low-confidence remasking improves a myopic proxy for quality while provably constraining the entropy of the induced sequence distribution. To overcome this limitation, we characterize the optimal distribution that explicitly balances quality and exploration, and develop a simple Independent Metropolis–Hastings sampler that approximately targets this distribution during decoding. Experiments across a range of reasoning benchmarks including MATH500, AIME24/25, HumanEval, and MBPP show that our approach yields better exploration-quality tradeoff than both random and low-confidence remasking.


[16] TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning cs.CLPDF

Wenxuan Jiang, Yuxin Zuo, Zijian Zhang, Xuecheng Wu, Zining Fan

TL;DR: 本文提出了一种名为TR-ICRL的新框架,用于解决上下文强化学习中的奖励估计难题。该框架通过从无标签评估集中检索相关实例、生成候选答案、利用多数投票产生伪标签作为奖励信号,并迭代优化LLM的响应,最终在推理和知识密集型任务上显著提升了模型性能。

Details

Motivation: 在上下文强化学习中,模型在推理时通常无法获得真实奖励信号,这限制了其在线学习能力。本文旨在解决ICRL中的奖励估计问题。

Result: 在主流推理和知识密集型任务(如MedQA和AIME2024)上,TR-ICRL显著提升了性能,例如将Qwen2.5-7B在MedQA上的平均性能提高了21.23%,在AIME2024上甚至提高了137.59%。广泛的消融实验验证了方法的有效性和鲁棒性。

Insight: 创新点在于提出了一种测试时重思考机制,利用无标签数据通过多数投票生成伪奖励信号来指导模型迭代优化,这为在缺乏真实奖励的场景下进行有效的上下文强化学习提供了一种新思路。

Abstract: In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR-ICRL is evaluated on mainstream reasoning and knowledge-intensive tasks, where it demonstrates significant performance gains. Remarkably, TR-ICRL improves Qwen2.5-7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at https://github.com/pangpang-xuan/TR_ICRL.


[17] Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling cs.CLPDF

Kazuki Yano, Jun Suzuki, Shinji Watanabe

TL;DR: 本文提出了一种名为多模态深度上采样的方法,用于将预训练的文本大语言模型(LLMs)适配为语音语言模型(Speech LMs),通过在冻结的文本LLM中插入新的Transformer层并仅在这些新增层上训练语音数据,从而在保持原始文本能力的同时有效学习语音表示。

Details

Motivation: 解决在语音数据上持续预训练文本LLMs时,通常会损害模型原有文本能力的问题。

Result: 在SmolLM2-360M和SmolLM2-1.7B模型上,使用48k小时英语自动语音识别(ASR)数据进行实验,深度上采样方法在ASR性能上可与全微调相媲美,且文本能力退化远少于全微调和低秩适应(LoRA);当插入专为语音识别设计的E-Branchformer架构时,在更大模型上ASR性能达到或超越全微调,同时文本退化减少超过75%,且可训练参数减少60%。

Insight: 创新点在于提出了一种结构化的持续预训练策略,通过插入并冻结新增层来隔离模态特定学习,有效平衡了多模态适配中的灾难性遗忘问题;客观来看,该方法结合了架构扩展与模态专用设计(如E-Branchformer),为多模态LLM适配提供了高效且性能优异的解决方案。

Abstract: Adapting pre-trained text Large Language Models (LLMs) into Speech Language Models (Speech LMs) via continual pretraining on speech data is promising, but often degrades the original text capabilities. We propose Multimodal Depth Upscaling, an extension of an emerging strategy in continual LLM pre-training, where new transformer layers are inserted into a frozen text LLM and only the added layers are trained on speech data. Experiments with SmolLM2-360M and SmolLM2-1.7B on 48k hours of English Automatic Speech Recognition (ASR) data show that depth up-scaling achieves ASR comparable to full fine-tuning while causing far less text degradation than both full fine-tuning and Low-Rank Adaptation (LoRA). We further show that incorporating E-Branchformer, an architecture designed for speech recognition, as the inserted layers achieves ASR that matches or surpasses full fine-tuning on the larger model while reducing text degradation by over 75% with 60% fewer trainable parameters.


[18] Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation cs.CL | cs.AIPDF

Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Ru Peng, Zenan Huang

TL;DR: 本文提出Optimsyn框架,通过基于影响力的评估来优化合成数据生成的评分标准,从而提升监督微调数据的质量。该方法利用梯度信息量化合成样本对目标任务学习目标的贡献,并采用强化学习优化评分标准生成器,在多个领域和模型上实现了性能提升。

Details

Motivation: 解决知识密集型领域(如人文、社科、医学、法律、金融)中高质量监督微调数据稀缺的问题,因为专家标注成本高、隐私限制严格且标签一致性难以保证;同时,现有合成数据生成方法依赖人工设计的评分标准,该过程缺乏可靠的量化反馈且难以跨领域迁移。

Result: 在多个领域、目标模型和数据生成器上的实验表明,该方法无需任务特定调优即可实现一致的性能改进和强大的泛化能力,提升了合成数据在下游任务上的效用。

Insight: 创新点在于提出基于影响力估计的合成数据效用评估方法,揭示了合成样本与真实样本在嵌入空间相近时对学习的影响仍可能显著不同;通过优化评分标准生成器,将目标模型反馈作为奖励信号,实现了数据生成过程的自动化优化。

Abstract: Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data. However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to ensure. Recent work uses synthetic data, typically by prompting a generator over domain documents and filtering outputs with handcrafted rubrics. Yet rubric design is expert-dependent, transfers poorly across domains, and is often optimized through a brittle heuristic loop of writing rubrics, synthesizing data, training, inspecting results, and manually guessing revisions. This process lacks reliable quantitative feedback about how a rubric affects downstream performance. We propose evaluating synthetic data by its training utility on the target model and using this signal to guide data generation. Inspired by influence estimation, we adopt an optimizer-aware estimator that uses gradient information to quantify each synthetic sample’s contribution to a target model’s objective on specific tasks. Our analysis shows that even when synthetic and real samples are close in embedding space, their influence on learning can differ substantially. Based on this insight, we propose an optimization-based framework that adapts rubrics using target-model feedback. We provide lightweight guiding text and use a rubric-specialized model to generate task-conditioned rubrics. Influence score is used as the reward to optimize the rubric generator with reinforcement learning. Experiments across domains, target models, and data generators show consistent improvements and strong generalization without task-specific tuning.


[19] A Japanese Benchmark for Evaluating Social Bias in Reasoning Based on Attribution Theory cs.CLPDF

Taihei Shiotani, Masahiro Kaneko, Naoaki Okazaki

TL;DR: 该论文提出了一个名为JUBAKU-v2的日语基准数据集,用于评估大型语言模型(LLMs)在推理过程中基于归因理论的社会偏见,特别是针对日本文化背景下的内群体和外群体行为归因偏见。

Details

Motivation: 现有日语偏见评估基准大多依赖英文数据翻译,无法充分反映日本文化背景,且仅评估结论中的偏见,忽略了推理过程中的潜在偏见。

Result: 实验结果表明,JUBAKU-v2比现有基准能更敏感地检测出不同模型之间的性能差异。

Insight: 创新点在于基于社会心理学中的归因理论构建数据集,专门评估推理过程中的文化特定偏见,而非仅关注结论,为LLMs的公平性评估提供了更细粒度的文化适配工具。

Abstract: In enhancing the fairness of Large Language Models (LLMs), evaluating social biases rooted in the cultural contexts of specific linguistic regions is essential. However, most existing Japanese benchmarks heavily rely on translating English data, which does not necessarily provide an evaluation suitable for Japanese culture. Furthermore, they only evaluate bias in the conclusion, failing to capture biases lurking in the reasoning. In this study, based on attribution theory in social psychology, we constructed a new dataset, ``JUBAKU-v2,’’ which evaluates the bias in attributing behaviors to in-groups and out-groups within reasoning while fixing the conclusion. This dataset consists of 216 examples reflecting cultural biases specific to Japan. Experimental results verified that it can detect performance differences across models more sensitively than existing benchmarks.


[20] Speech LLMs are Contextual Reasoning Transcribers cs.CLPDF

Keqi Deng, Ruchao Fan, Bo Ren, Yiming Wang, Jinyu Li

TL;DR: 本文提出了链式思维自动语音识别(CoT-ASR),通过构建一个推理链,让大语言模型(LLM)先分析输入语音并生成上下文分析,从而充分利用其生成能力,再进行更明智的语音识别。该方法还引入了CTC引导的模态适配器,以减少模态差距,并支持用户引导的转录。

Details

Motivation: 尽管LLM已扩展至语音输入,但在自动语音识别(ASR)中有效利用其丰富知识和上下文理解仍然困难,因为ASR任务主要涉及直接的语音到文本映射。

Result: 实验表明,与标准的基于LLM的ASR相比,CoT-ASR在词错误率(WER)上实现了8.7%的相对降低,在实体错误率(EER)上实现了16.9%的相对降低。

Insight: 创新点在于将链式思维推理引入ASR,使LLM能够先进行上下文分析再转录,从而更好地发挥其生成和推理能力;同时,CTC引导的模态适配器有效对齐了语音编码器输出与LLM的文本潜在空间,减少了模态差距。该方法还自然地支持用户提供的上下文来引导转录,扩展了ASR功能。

Abstract: Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM’s textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).


[21] TRIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models cs.CLPDF

Lingjie Chen, Ruizhong Qiu, Yuyu Fan, Yanjun Zhao, Hanghang Tong

TL;DR: 本文提出了一种名为TRIMS(Trajectory-Ranked Instruction Masked Supervision)的轨迹引导监督微调框架,用于改进扩散语言模型(DLMs)的解码轨迹。该方法通过利用自回归教师模型的轻量级信号来指导轨迹感知的掩码策略,从而在标准掩码扩散语言模型训练中注入轨迹监督,以解决训练-推理不匹配导致的次优解码行为问题。

Details

Motivation: 扩散语言模型通过并行解码提供了低延迟生成的潜力,但其实际效率严重依赖于解码轨迹。标准训练缺乏对令牌揭示顺序的显式监督,导致训练与推理不匹配,从而产生次优解码行为,限制了并行解码优势的充分发挥。

Result: 在LLaDA和Dream模型上,针对数学和编码基准测试的实验表明,TRIMS在准确性与并行性权衡方面显著优于标准MDLM训练和无训练加速基线,同时以显著更低的训练成本实现了与先前基于蒸馏方法相竞争的性能。进一步分析证实TRIMS能产生更好的解码轨迹。

Insight: 核心创新在于提出了一种简单高效的轨迹引导监督框架(TRIMS),它不依赖昂贵的基于DLM的蒸馏,而是利用自回归教师的轻量信号来指导掩码策略,从而将轨迹监督以最小开销注入训练过程。这为解决扩散语言模型训练-推理不匹配、优化解码轨迹提供了一种新颖且成本效益高的方法。

Abstract: Diffusion language models (DLMs) offer a promising path toward low-latency generation through parallel decoding, but their practical efficiency depends heavily on the decoding trajectory. In practice, this advantage often fails to fully materialize because standard training does not provide explicit supervision over token reveal order, creating a train-inference mismatch that leads to suboptimal decoding behavior. We propose Trajectory-Ranked Instruction Masked Supervision (TRIMS), a simple trajectory-guided supervised fine-tuning framework that injects trajectory supervision into standard Masked Diffusion Language Model (MDLM) training with minimal overhead. Instead of relying on costly DLM-based distillation, TRIMS uses lightweight signals from an autoregressive teacher to guide a trajectory-aware masking strategy, encouraging the model to learn more effective decoding orders. Experiments on LLaDA and Dream across math and coding benchmarks show that TRIMS significantly improves the accuracy-parallelism trade-off over both standard MDLM training and train-free acceleration baselines, while achieving competitive performance with prior distillation-based approaches at substantially lower training cost. Further analysis shows that TRIMS leads to better decoding trajectories, validating the effectiveness of trajectory-guided supervision for DLMs.


[22] To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining cs.CL | cs.AI | cs.LGPDF

Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar

TL;DR: 本研究系统探讨了在固定数据预算下,语言模型预训练与检索增强生成(RAG)之间的权衡关系。通过训练不同规模的模型(30M至3B参数)并改变预训练数据量(1-150倍参数数)和检索库大小(1-20倍),作者提出了一个三维扩展框架,用于建模性能与模型大小、预训练令牌数和检索库大小的函数关系,从而为优化数据资源分配提供定量指导。

Details

Motivation: 目前对于语言模型在预训练中获得的参数化知识与通过检索访问的非参数化知识之间的关系理解不足,尤其是在固定数据预算下,需要系统研究预训练语料规模与检索库规模之间的权衡。

Result: 研究发现,在所有模型规模上,检索都能持续提升仅使用参数化知识的基线模型性能。通过提出的扩展框架,可以估算固定数据预算在预训练和检索之间的最优分配,结果表明检索的边际效用强烈依赖于模型规模、任务类型和预训练饱和程度。

Insight: 论文的创新点在于提出了一个三维扩展框架来量化预训练与检索的权衡,并揭示了检索效用的关键影响因素(如模型规模、任务类型),这为设计可扩展的语言建模系统提供了数据资源分配的实际指导。从客观角度看,该研究为理解RAG与预训练的互补关系提供了系统的实证基础和定量分析框架。

Abstract: Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.


[23] LangMARL: Natural Language Multi-Agent Reinforcement Learning cs.CLPDF

Huaiyuan Yao, Longchao Da, Xiaoou Liu, Charles Fleming, Tianlong Chen

TL;DR: 本文提出LangMARL框架,旨在解决大型语言模型(LLM)智能体在动态环境中难以自主演化协调策略的问题。该框架将经典多智能体强化学习(MARL)中的信用分配和策略梯度演化引入语言空间,通过智能体级语言信用分配、语言空间的梯度演化以及从回放轨迹中总结任务相关因果关系来提供密集反馈,从而在稀疏奖励下改善收敛性。

Details

Motivation: 解决LLM智能体在动态环境中因全局结果粗糙而难以获取局部策略优化所需因果信号的问题,这本质上是多智能体信用分配问题在LLM系统中的未充分解决瓶颈。

Result: 在多种合作多智能体任务上的大量实验表明,该方法提高了样本效率、可解释性,并展现出强大的泛化能力。

Insight: 创新点在于将经典MARL的信用分配和策略梯度机制引入语言空间,实现了智能体级语言信用分配和语言空间的梯度演化,并通过从轨迹中提取因果关系提供密集反馈,以应对稀疏奖励挑战。

Abstract: Large language model (LLM) agents struggle to autonomously evolve coordination strategies in dynamic environments, largely because coarse global outcomes obscure the causal signals needed for local policy refinement. We identify this bottleneck as a multi-agent credit assignment problem, which has long been studied in classical multi-agent reinforcement learning (MARL) but remains underaddressed in LLM-based systems. Building on this observation, we propose LangMARL, a framework that brings credit assignment and policy gradient evolution from cooperative MARL into the language space. LangMARL introduces agent-level language credit assignment, pioneers gradient evolution in language space for policy improvement, and summarizes task-relevant causal relations from replayed trajectories to provide dense feedback and improve convergence under sparse rewards. Extensive experiments across diverse cooperative multi-agent tasks demonstrate improved sample efficiency, interpretability, and strong generalization.


[24] From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks cs.CLPDF

Ayan Datta, Mounika Marreddy, Alexander Mehler, Zhixue Zhao, Radhika Mamidi

TL;DR: 本文研究了大型语言模型在字符计数等基础符号任务上失败的内在原因,发现模型内部能正确计算答案,但在输出层被抑制。通过机制分析,作者揭示了字符级信息在早期和中间层被编码,但在后续层(尤其是倒数第二层和最后一层的MLP)被负电路衰减,导致模型输出错误的高概率答案。

Details

Motivation: 尽管LLMs在复杂基准测试上表现出色,但在字符计数等基础符号任务上存在失败,且内部原因不明。本文旨在通过字符计数这一受控任务,探究模型内部表示与输出不一致的机制。

Result: 在LLaMA、Qwen和Gemma等现代架构上,模型内部能正确计算字符计数答案,但输出层常给出错误答案。通过探测分类器、激活修补、logit lens分析和注意力头追踪等方法,定性和定量地揭示了负电路对正确信号的抑制现象。

Insight: 创新点在于揭示了LLMs符号推理失败并非由于表示缺失或规模不足,而是源于计算图中的结构化干扰(负电路),并提出了LLM前向传播实现了竞争性解码的观点,即正确与错误假设共存并通过动态重加权(抑制与放大)决定最终输出。这为可解释性和鲁棒性设计提供了新视角。

Abstract: Large language models (LLMs) exhibit failures on elementary symbolic tasks such as character counting in a word, despite excelling on complex benchmarks. Although this limitation has been noted, the internal reasons remain unclear. We use character counting (e.g., “How many p’s are in apple?”) as a minimal, controlled probe that isolates token-level reasoning from higher-level confounds. Using this setting, we uncover a consistent phenomenon across modern architectures, including LLaMA, Qwen, and Gemma: models often compute the correct answer internally yet fail to express it at the output layer. Through mechanistic analysis combining probing classifiers, activation patching, logit lens analysis, and attention head tracing, we show that character-level information is encoded in early and mid-layer representations. However, this information is attenuated by a small set of components in later layers, especially the penultimate and final layer MLP. We identify these components as negative circuits: subnetworks that downweight correct signals in favor of higher-probability but incorrect outputs. Our results lead to two contributions. First, we show that symbolic reasoning failures in LLMs are not due to missing representations or insufficient scale, but arise from structured interference within the model’s computation graph. This explains why such errors persist and can worsen under scaling and instruction tuning. Second, we provide evidence that LLM forward passes implement a form of competitive decoding, in which correct and incorrect hypotheses coexist and are dynamically reweighted, with final outputs determined by suppression as much as by amplification. These findings carry implications for interpretability and robustness: simple symbolic reasoning exposes weaknesses in modern LLMs, underscoring need for design strategies that ensure information is encoded and reliably used.


[25] Emotion Entanglement and Bayesian Inference for Multi-Dimensional Emotion Understanding cs.CL | cs.AIPDF

Hemanth Kotaprolu, Kishan Maharaj, Raey Zhao, Abhijit Mishra, Pushpak Bhattacharyya

TL;DR: 本文提出了EmoScene基准,这是一个基于Plutchik基本情绪理论的、包含4731个情境丰富场景的多维情绪理解数据集,用于评估语言模型在上下文感知的多标签情绪预测任务上的表现。作者发现现有指令微调大语言模型在该任务上表现有限,并进一步提出了一种基于贝叶斯推理的轻量级后处理框架,通过融入情绪共现统计来提升预测的结构一致性。

Details

Motivation: 现有情绪理解基准大多依赖短文本和预定义标签,将情绪理解简化为独立标签预测,忽略了情绪之间的结构化依赖关系,无法反映自然语言中情绪的多维推理本质。

Result: 在零样本设置下评估了六个指令微调大语言模型,最佳模型的Macro F1为0.501,表明任务具有挑战性。提出的贝叶斯推理后处理框架能提升模型性能,例如为Qwen2.5-7B模型带来了+0.051的Macro F1增益。

Insight: 创新点在于构建了理论驱动的、情境丰富的多维情绪理解基准EmoScene,并提出了一个利用情绪共现统计进行联合后验推理的轻量级贝叶斯框架,以建模情绪间的纠缠关系,提升预测的结构一致性。

Abstract: Understanding emotions in natural language is inherently a multi-dimensional reasoning problem, where multiple affective signals interact through context, interpersonal relations, and situational cues. However, most existing emotion understanding benchmarks rely on short texts and predefined emotion labels, reducing this process to independent label prediction and ignoring the structured dependencies among emotions. To address this limitation, we introduce Emotional Scenarios (EmoScene), a theory-grounded benchmark of 4,731 context-rich scenarios annotated with an 8-dimensional emotion vector derived from Plutchik’s basic emotions. We evaluate six instruction-tuned large language models in a zero-shot setting and observe modest performance, with the best model achieving a Macro F1 of 0.501, highlighting the difficulty of context-aware multi-label emotion prediction. Motivated by the observation that emotions rarely occur independently, we further propose an entanglement-aware Bayesian inference framework that incorporates emotion co-occurrence statistics to perform joint posterior inference over the emotion vector. This lightweight post-processing improves structural consistency of predictions and yields notable gains for weaker models (e.g., +0.051 Macro F1 for Qwen2.5-7B). EmoScene therefore provides a challenging benchmark for studying multi-dimensional emotion understanding and the limitations of current language models.


[26] When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation cs.CLPDF

Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou

TL;DR: 本文首次系统研究了在长时程、环境交互的网页导航任务中可中断智能体的能力,提出了包含三种现实中断类型的InterruptBench基准,并评估了六种大型语言模型在单轮和多轮中断场景下的适应与恢复效率。

Details

Motivation: 随着LLM智能体从解决短时静态问题转向在动态环境中执行复杂长时程任务,处理用户在执行过程中添加需求或修改目标等中断的能力成为实际部署的核心需求,而现有基准大多假设智能体行为不受中断或仅在短时无约束语言任务中研究中断。

Result: 评估结果显示,即使在强大的大规模LLM上,在长时程智能体任务中有效且高效地处理用户中断仍然具有挑战性;该研究在基于WebArena-Lite构建的InterruptBench基准上进行了实验。

Insight: 论文的创新点在于首次形式化了长时程网页导航任务中的三种现实中断类型(添加、修订、撤销),并构建了一个在严格语义约束下合成高质量中断场景的基准,为评估智能体在动态环境中的适应与恢复能力提供了系统框架。

Abstract: As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution is becoming a core requirement for realistic deployment. However, existing benchmarks largely assume uninterrupted agent behavior or study interruptions only in short, unconstrained language tasks. In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes. We formalize three realistic interruption types, including addition, revision, and retraction, and introduce InterruptBench, a benchmark derived from WebArena-Lite that synthesizes high-quality interruption scenarios under strict semantic constraints. Using a unified interruption simulation framework, we evaluate six strong LLM backbones across single- and multi-turn interruption settings, analyzing both their effectiveness in adapting to updated intents and their efficiency in recovering from mid-task changes. Our results show that handling user interruptions effectively and efficiently during long-horizon agentic tasks remains challenging for powerful large-scale LLMs. Code and dataset are available at https://github.com/HenryPengZou/InterruptBench.


[27] Dual Optimal: Make Your LLM Peer-like with Dignity cs.CL | cs.AIPDF

Xiangqi Wang, Yue Huang, Haomin Zhuang, Kehan Guo, Xiangliang Zhang

TL;DR: 本文针对当前对齐语言模型存在的‘逃避型仆人’双重失效模式(即谄媚地认可用户错误信念,同时用模板化免责声明推卸责任),提出了‘尊严对等体’框架。该框架通过反谄媚和可信赖性来对抗奴性,并通过同理心和创造力来缓解逃避行为。为实现此框架,作者克服了数据监督、目标崩溃和评估偏差等挑战,具体引入了具有组合偏序结构的多角色偏好数据集PersonaKnob,并开发了容忍性约束拉格朗日DPO算法来动态平衡所有角色维度以防止行为崩溃。此外,采用心理测量校准的项目反应理论评估协议来分离潜在模型角色能力与评估者偏差等混杂因素。大量实证研究表明,该方法成功构建了一个兼具尊严与对等性的LLM智能体。

Details

Motivation: 解决当前对齐语言模型中存在的‘逃避型仆人’问题,即模型既过度谄媚用户(即使其信念有误)又通过免责声明逃避责任,旨在构建一个更具尊严和可信赖的、能与用户平等交流的LLM。

Result: 广泛的实证研究表明,所提出的方法成功构建了一个兼具尊严(dignity)与对等性(peer)的LLM智能体。虽然没有明确提及在特定基准测试(benchmark)上的具体数值结果或是否达到SOTA,但论文通过引入新的评估协议(项目反应理论)来验证其模型在分离潜在角色能力方面的有效性。

Insight: 论文的核心创新点在于提出了‘尊严对等体’这一概念框架,并配套开发了三个关键技术组件:1) 具有组合偏序结构的多角色偏好数据集PersonaKnob,用于精细化的数据监督;2) 容忍性约束拉格朗日DPO算法,以动态平衡多目标防止行为崩溃;3) 基于项目反应理论的心理测量校准评估协议,用于更准确地评估模型的内在能力,减少评估偏差。从客观角度看,将多角色偏好建模为偏序结构、将约束优化理论应用于对齐算法,以及将心理测量学方法引入LLM评估,都是具有借鉴意义的创新思路。

Abstract: Current aligned language models exhibit a dual failure mode we term the Evasive Servant: they sycophantically validate flawed user beliefs while deflecting responsibility with boilerplate disclaimers. We propose the Dignified Peer framework, which counters servility with anti-sycophancy and trustworthiness, and mitigates evasiveness through empathy and creativity. Realizing this agent requires overcoming significant challenges in data supervision, objective collapse, and evaluation bias. We address these issues by introducing the PersonaKnob dataset which features a compositional partial order structure of multiple persona preference. This data is utilized alongside a tolerant constrained Lagrangian DPO algorithm that dynamically balances all persona dimensions to prevent behavioral collapse. Additionally, we employ a psychometrically calibrated Item Response Theory evaluation protocol to disentangle latent model persona capability from confounders like judge biases. Extensive empirical studies demonstrate that our approach successfully build a LLM agent with both dignity and peer.


[28] Multimodal Analysis of State-Funded News Coverage of the Israel-Hamas War on YouTube Shorts cs.CL | cs.AI | cs.SIPDF

Daniel Miehling, Sandra Kuebler

TL;DR: 本文提出了一种结合自动转录、基于方面的情感分析(ABSA)和语义场景分类的多模态分析流程,用于研究YouTube Shorts上国家资助媒体对以色列-哈马斯战争的报道。通过分析超过2,300个相关短视频和94,000多帧视觉内容,系统性地比较了不同国际广播机构的战争报道模式。研究发现,不同媒体在特定方面的情感表达随时间变化,而视觉场景分类与现实事件一致,且小型领域适应模型在情感分析任务上优于大型Transformer和LLM。

Details

Motivation: YouTube Shorts已成为新闻消费的重要形式,但关于地缘政治事件在此类短格式视频中如何呈现的研究仍有限,本文旨在填补这一空白。

Result: 在超过2,300个冲突相关Shorts和94,000多帧视觉数据上,情感分析显示不同媒体对特定方面的情感表达存在差异且随时间变化,视觉场景分类与现实事件一致;在情感分析任务中,小型领域适应模型超越了大型Transformer和LLM。

Insight: 创新点在于构建了一个可扩展的多模态分析流程,适用于TikTok、Instagram等短格式平台,并证明了结合定性解释的多模态方法能有效表征算法驱动视频环境中的情感模式和视觉线索;同时,研究强调了资源高效的小型领域适应模型在人文研究中的价值,优于通用大模型。

Abstract: YouTube Shorts have become central to news consumption on the platform, yet research on how geopolitical events are represented in this format remains limited. To address this gap, we present a multimodal pipeline that combines automatic transcription, aspect-based sentiment analysis (ABSA), and semantic scene classification. The pipeline is first assessed for feasibility and then applied to analyze short-form coverage of the Israel-Hamas war by state-funded outlets. Using over 2,300 conflict-related Shorts and more than 94,000 visual frames, we systematically examine war reporting across major international broadcasters. Our findings reveal that the sentiment expressed in transcripts regarding specific aspects differs across outlets and over time, whereas scene-type classifications reflect visual cues consistent with real-world events. Notably, smaller domain-adapted models outperform large transformers and even LLMs for sentiment analysis, underscoring the value of resource-efficient approaches for humanities research. The pipeline serves as a template for other short-form platforms, such as TikTok and Instagram, and demonstrates how multimodal methods, combined with qualitative interpretation, can characterize sentiment patterns and visual cues in algorithmically driven video environments.


[29] CARE: Privacy-Compliant Agentic Reasoning with Evidence Discordance cs.CLPDF

Haochen Liu, Weien Li, Rui Song, Zeyu Li, Chun Jason Xue

TL;DR: 本文提出CARE框架,用于处理医疗决策中症状与体征不一致的挑战,通过远程LLM生成结构化分类与转换指导,本地LLM进行证据收集与决策,在MIMIC-DOS数据集上实现了隐私保护下的性能提升。

Details

Motivation: 解决现实医疗场景中患者报告症状与医学体征不一致时,现有LLM系统决策性能下降的问题。

Result: 在ICU器官功能恶化预测数据集MIMIC-DOS上,CARE在所有关键指标上均优于多个基线设置,能够更稳健地处理冲突临床证据。

Insight: 创新性地采用远程-本地LLM协作的隐私合规框架,远程模型提供结构化指导而不接触敏感数据,本地模型执行具体推理,兼顾了数据隐私与决策鲁棒性。

Abstract: Large language model (LLM) systems are increasingly used to support high-stakes decision-making, but they typically perform worse when the available evidence is internally inconsistent. Such a scenario exists in real-world healthcare settings, with patient-reported symptoms contradicting medical signs. To study this problem, we introduce MIMIC-DOS, a dataset for short-horizon organ dysfunction worsening prediction in the intensive care unit (ICU) setting. We derive this dataset from the widely recognized MIMIC-IV, a publicly available electronic health record dataset, and construct it exclusively from cases in which discordance between signs and symptoms exists. This setting poses a substantial challenge for existing LLM-based approaches, with single-pass LLMs and agentic pipelines often struggling to reconcile such conflicting signals. To address this problem, we propose CARE: a multi-stage privacy-compliant agentic reasoning framework in which a remote LLM provides guidance by generating structured categories and transitions without accessing sensitive patient data, while a local LLM uses these categories and transitions to support evidence acquisition and final decision-making. Empirically, CARE achieves stronger performance across all key metrics compared to multiple baseline settings, showing that CARE can more robustly handle conflicting clinical evidence while preserving privacy.


[30] Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning cs.CL | cs.AIPDF

Mohammad R. Abu Ayyash

TL;DR: 本文提出了Brainstacks,一种用于大语言模型持续多领域微调的模块化架构。其核心是将领域专业知识打包为冻结的适配器堆栈,在推理时以相加方式组合在共享的冻结基础模型上。该方法包含五个关键组件:MoE-LoRA路由、残差提升内循环、课程依赖外循环、零遗忘零空间投影以及基于结果的元路由器。实验在TinyLlama-1.1B和Gemma 3 12B IT上进行,证明了其在收敛速度、性能突破和跨领域组合能力上的优势。

Details

Motivation: 解决大语言模型在持续学习多个领域时面临的知识遗忘、参数效率低下以及跨领域能力组合困难的问题,旨在实现高效、可扩展且无遗忘的多领域适应。

Result: 在TinyLlama-1.1B(4个领域,9个堆栈)和Gemma 3 12B IT(5个领域,10个堆栈)上验证。MoE-LoRA比参数匹配的单一LoRA收敛速度快2.5倍,残差提升突破了单堆栈性能上限,路由系统恢复了因无门控堆栈累积而破坏的生成质量。

Insight: 主要创新点在于将领域适配器设计为可组合的冻结堆栈,并通过基于结果的元路由器实现跨领域认知能力的组合。一个核心发现是,领域堆栈编码的是可迁移的认知原语(如指令遵循清晰度、数学推理),而非领域特定知识,这解释了其强大的跨领域泛化能力(例如,医学提示97%的情况路由到聊天+数学堆栈)。架构上,MoE-LoRA路由、残差提升、课程依赖训练和零空间投影的协同设计,为实现高效、无遗忘的持续学习提供了系统性的解决方案。

Abstract: We present Brainstacks, a modular architecture for continual multi-domain fine-tuning of large language models that packages domain expertise as frozen adapter stacks composing additively on a shared frozen base at inference. Five interlocking components: (1) MoE-LoRA with Shazeer-style noisy top-2 routing across all seven transformer projections under QLoRA 4-bit quantization with rsLoRA scaling; (2) an inner loop performing residual boosting by freezing trained stacks and adding new ones; (3) an outer loop training sequential domain-specific stacks with curriculum-ordered dependencies; (4) null-space projection via randomized SVD constraining new stacks to subspaces orthogonal to prior directions, achieving zero forgetting in isolation; (5) an outcome-based sigmoid meta-router trained on empirically discovered domain-combination targets that selectively weights stacks, enabling cross-domain composition. Two boundary experiments: (6) PSN pretraining on a randomly initialized model; (7) per-domain RL (DPO/GRPO) validating compatibility with post-SFT alignment. Validated on TinyLlama-1.1B (4 domains, 9 stacks) and Gemma 3 12B IT (5 domains, 10 stacks), MoE-LoRA achieves 2.5x faster convergence than parameter-matched single LoRA, residual boosting breaks through the single-stack ceiling, and the routed system recovers generation quality destroyed by ungated stack accumulation. The central finding: the outcome-based router discovers that domain stacks encode transferable cognitive primitives (instruction-following clarity, numerical reasoning, procedural logic, chain-of-thought structure) rather than domain-specific knowledge, with medical prompts routing to chat+math stacks in 97% of cases despite zero medical data in those stacks.


[31] S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models cs.CL | cs.LGPDF

Jack Young

TL;DR: 本文提出了一种名为S0调优的参数高效微调方法,专门针对混合循环注意力模型。该方法仅优化每个循环层的初始状态矩阵,冻结所有权重参数,实现了零推理开销的模型适配。在HumanEval、GSM8K等多个基准测试中,使用约48个经过执行验证的训练样本,该方法显著优于LoRA等现有方法,且无需权重合并即可实现任务切换。

Details

Motivation: 解决在监督数据稀缺的情况下,如何对混合循环注意力模型进行高效参数微调的问题,同时避免引入推理开销和复杂的权重合并操作。

Result: 在HumanEval基准上,S0调优比LoRA高出10.8个百分点(p<0.001);在Qwen3.5-4B模型上,贪婪解码的pass@1提升23.6±1.7个百分点;在FalconH1-7B上与LoRA性能相当(71.8% vs 71.4%)。在MATH-500和GSM8K上分别实现4.8和2.8个百分点的跨领域迁移提升,达到SOTA或相当水平。

Insight: 创新点在于发现并利用循环层初始状态矩阵作为高效的微调接口,实现了零推理开销的参数高效微调。客观来看,该方法将微调参数从权重空间转移到状态空间,为混合架构提供了轻量级适配方案,且调优后的状态文件仅约48MB,支持快速任务切换。

Abstract: Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.


[32] Embarrassingly Simple Self-Distillation Improves Code Generation cs.CLPDF

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert

TL;DR: 本文提出了一种名为简单自蒸馏(SSD)的方法,该方法仅利用大语言模型(LLM)自身的原始输出来提升代码生成能力,无需验证器、教师模型或强化学习。该方法通过以特定温度和截断配置从模型中采样解决方案,然后使用标准监督微调在这些样本上进行微调。实验表明,SSD显著提升了多个模型在LiveCodeBench v6基准上的pass@1性能,并将改进归因于其能解决LLM解码中的精度-探索冲突。

Details

Motivation: 探索是否能够仅利用大语言模型自身的原始输出,而不依赖外部验证器、教师模型或强化学习,来提升其代码生成能力。

Result: 在LiveCodeBench v6基准上,SSD将Qwen3-30B-Instruct的pass@1准确率从42.4%提升至55.3%,且改进主要集中在更难的问题上。该方法在Qwen和Llama系列的4B、8B和30B规模模型(包括指令和思维变体)上均展现出泛化能力。

Insight: 核心创新点在于提出了一种极其简单且无需外部组件的自蒸馏后训练方法。其关键洞察是,SSD通过上下文相关的方式重塑了token分布,在需要精度的场景下抑制了干扰性的“尾部”分布,而在需要探索的场景下保留了有用的多样性,从而解决了LLM解码中的精度-探索冲突。这为提升LLM代码生成能力提供了一个互补的后训练方向。

Abstract: Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.


[33] ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget cs.CL | cs.AI | cs.IRPDF

Nandan Thakur, Zijian Chen, Xueguang Ma, Jimmy Lin

TL;DR: 本文提出ORBIT,一个用于训练搜索代理的合成数据集生成框架,包含20K个需要多步推理的查询及其可验证答案。该框架通过种子创建、问答对生成以及自我和外部验证四个模块化阶段,在不依赖付费API的情况下低成本构建高质量数据。实验表明,在ORBIT上训练的Qwen3-4B模型在维基百科问答任务中表现出色,验证了合成数据集的有效性。

Details

Motivation: 针对复杂查询的搜索代理(结合语言模型与网络搜索)需要多步检索和推理的训练数据,但人工标注成本高昂且流程繁琐,现有方法存在依赖付费API或前提条件复杂的问题。

Result: 在维基百科问答任务上,基于ORBIT训练的Qwen3-4B模型(ORBIT-4B)在参数量小于4B的语言模型中取得了强劲性能,证明了合成数据集的实用性。

Insight: 创新点在于提出一个模块化、可扩展且低成本的合成数据生成框架(包含种子创建、问答生成、自我验证和外部验证四阶段),通过严格的网络搜索验证确保答案可靠性,并开源了框架、代码和数据集,为资源受限环境下的搜索代理训练提供了可行方案。

Abstract: Search agents, which integrate language models (LMs) with web search, are becoming crucial for answering complex user queries. Constructing training datasets for deep research tasks, involving multi-step retrieval and reasoning, remains challenging due to expensive human annotation, or cumbersome prerequisites. In this work, we introduce ORBIT, a training dataset with 20K reasoning-intensive queries with short verifiable answers, generated using a frugal framework without relying on paid API services. The modular framework relies on four stages: seed creation, question–answer pair generation, and two stages of verification: self and external. ORBIT spans 15 domains and each training pair requires 4–5 reasoning steps, with external search verification required from the complete web. We train Qwen3-4B as the base model on ORBIT using GRPO and evaluate it on Wikipedia question answering tasks. Extensive experiment results demonstrate that ORBIT-4B achieves strong performance among sub-4B LLMs as search agents, proving the utility of synthetic datasets. Our framework, code and datasets are open-sourced and available publicly.


[34] Universal YOCO for Efficient Depth Scaling cs.CLPDF

Yutao Sun, Li Dong, Tianzhu Ye, Shaohan Huang, Jianyong Wang

TL;DR: 本文提出Universal YOCO (YOCO-U),一种结合YOCO解码器-解码器架构与递归计算的方法,旨在提升大型语言模型推理时的计算效率。该方法通过参数共享实现多轮迭代,并将迭代过程限制在浅层高效注意力层中,从而在保持高效推理的同时提升模型能力。

Details

Motivation: 标准Transformer在推理时难以高效扩展计算,因为传统的循环策略计算开销高,且KV缓存随模型深度增加而膨胀。论文旨在解决这一效率瓶颈,实现更好的能力-效率权衡。

Result: 实验结果表明,YOCO-U在通用和长上下文基准测试中保持高度竞争力,验证了高效注意力架构与递归计算结合对可扩展LLMs的有效性。

Insight: 创新点在于将YOCO架构的恒定全局KV缓存和线性预填充优势,与部分递归增强表示深度的能力相结合,实现了协同效应,为可扩展LLM设计提供了新方向。

Abstract: The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.


cs.CV [Back]

[35] Hierarchical Pre-Training of Vision Encoders with Large Language Models cs.CV | cs.AI | cs.CL | cs.LGPDF

Eugene Lee, Ting-Yu Chang, Jui-Huang Tsai, Jiajie Diao, Chen-Yi Lee

TL;DR: 本文提出了HIVE(Hierarchical Pre-Training of Vision Encoders)框架,通过引入视觉编码器与大语言模型(LLM)之间的分层交叉注意力机制,增强视觉-语言对齐,从而改进多模态表示学习。

Details

Motivation: 现有方法通常将视觉编码器和LLM视为独立模块,限制了分层视觉特征的整合,因此需要一种能实现结构化特征融合的框架。

Result: 在MME、GQA、OK-VQA和ScienceQA等基准测试中,HIVE在图像分类和多种视觉-语言任务上均优于基于自注意力的方法,达到了SOTA水平。

Insight: 创新点在于分层交叉注意力机制和三阶段渐进式训练策略,实现了视觉编码器与LLM的分层对齐,提升了梯度流和表示学习效率。

Abstract: The field of computer vision has experienced significant advancements through scalable vision encoders and multimodal pre-training frameworks. However, existing approaches often treat vision encoders and large language models (LLMs) as independent modules, limiting the integration of hierarchical visual features. In this work, we propose HIVE (Hierarchical Pre-Training of Vision Encoders), a novel framework that enhances vision-language alignment by introducing hierarchical cross-attention between the vision encoder and LLM. Unlike conventional methods that flatten image embeddings, HIVE enables structured feature fusion across multiple layers, improving gradient flow and representation learning. To optimize this interaction, we introduce a three-stage training strategy that progressively aligns the vision encoder with the LLM, ensuring stable optimization and effective multimodal fusion. Empirical evaluations demonstrate that HIVE achieves superior performance not only in image classification but also on various vision-language tasks, outperforming self-attention-based methods in benchmarks such as MME, GQA, OK-VQA, and ScienceQA. Our results highlight the benefits of hierarchical feature integration, paving the way for more efficient and expressive vision-language models.


[36] RawGen: Learning Camera Raw Image Generation cs.CVPDF

Dongyoung Kim, Junyong Lee, Abhijith Punnappurath, Mahmoud Afifi, Sangmin Han

TL;DR: 本文提出了RawGen,首个基于扩散模型的文本到原始图像生成框架,能够为任意目标相机生成物理一致的线性原始图像,并实现sRGB到原始图像的逆转换。该方法利用大规模sRGB扩散模型的生成先验,通过在潜在空间和像素空间的专门处理,合成CIE XYZ或相机特定的原始表示,以解决原始数据稀缺问题。

Details

Motivation: 动机在于解决低层视觉任务中原始图像数据稀缺的瓶颈,现有扩散模型主要合成经过ISP处理的sRGB图像,而非物理一致的线性原始表示,限制了其在需要原始数据的任务中的应用。

Result: 实验表明,RawGen在逆ISP任务上优于传统假设固定ISP的方法,并且其生成的文本驱动合成数据能够有效提升下游低层视觉任务的性能。

Insight: 创新点包括构建多对一的逆ISP数据集以处理未知且多样的ISP流水线,以及通过微调条件去噪器和专用解码器实现相机中心的线性重建,从而可扩展地生成物理有意义的原始图像数据。

Abstract: Cameras capture scene-referred linear raw images, which are processed by onboard image signal processors (ISPs) into display-referred 8-bit sRGB outputs. Although raw data is more faithful for low-level vision tasks, collecting large-scale raw datasets remains a major bottleneck, as existing datasets are limited and tied to specific camera hardware. Generative models offer a promising way to address this scarcity – however, existing diffusion frameworks are designed to synthesize photo-finished sRGB images rather than physically consistent linear representations. This paper presents RawGen, to our knowledge the first diffusion-based framework enabling text-to-raw generation for arbitrary target cameras, alongside sRGB-to-raw inversion. RawGen leverages the generative priors of large-scale sRGB diffusion models to synthesize physically meaningful linear outputs, such as CIE XYZ or camera-specific raw representations, via specialized processing in latent and pixel spaces. To handle unknown and diverse ISP pipelines and photo-finishing effects in diffusion-model training data, we build a many-to-one inverse-ISP dataset where multiple sRGB renditions of the same scene generated using diverse ISP parameters are anchored to a common scene-referred target. Fine-tuning a conditional denoiser and specialized decoder on this dataset allows RawGen to obtain camera-centric linear reconstructions that effectively invert the rendering pipeline. We demonstrate RawGen’s superior performance over traditional inverse-ISP methods that assume a fixed ISP. Furthermore, we show that augmenting training pipelines with RawGen’s scalable, text-driven synthetic data can benefit downstream low-level vision tasks.


[37] Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models cs.CVPDF

Longwei Xu, Feng Feng, Shaojie Zhang, Xin Chen, Hang Li

TL;DR: 本文针对面向OCR的视觉语言模型(VLMs)中文本锚定(即准确将查询文本定位到图像中对应空间区域)能力不足的问题,提出了Q-Mask框架。该框架基于因果查询驱动的掩码解码器(CQMD),采用类似思维链的视觉解码方式,先生成查询条件化的视觉掩码,再输出最终OCR结果,从而将‘文本在哪里’与‘文本是什么’解耦。为评估和训练该能力,作者还构建了细粒度文本区域定位基准TextAnchor-Bench(TABench)和大规模数据集TextAnchor-26M。实验表明,Q-Mask显著提升了模型在各种视觉场景下的文本锚定和理解能力。

Details

Motivation: 当前面向OCR的视觉语言模型在将查询文本准确锚定到图像中对应空间区域(即建立可靠的文本锚点)方面存在不足,这限制了其在真实世界视觉问答等下游任务中的应用。

Result: 在作者新提出的细粒度文本区域定位基准TextAnchor-Bench(TABench)上的大量实验表明,Q-Mask框架显著提升了文本锚定和理解能力。

Insight: 核心创新点在于提出了因果查询驱动的掩码解码器(CQMD)和视觉思维链(Visual CoT)范式,通过顺序生成查询条件化的视觉掩码来解耦定位与识别过程,从而在推理时显式构建文本锚点。此外,构建大规模细粒度掩码标注数据集TextAnchor-26M以注入空间先验,也是提升模型文本锚定稳定性的关键。

Abstract: Optical Character Recognition (OCR) is increasingly regarded as a foundational capability for modern vision-language models (VLMs), enabling them not only to read text in images but also to support downstream reasoning in real-world visual question answering (VQA). However, practical applications further require reliable text anchors, i.e., accurately grounding queried text to its corresponding spatial region. To systematically evaluate this capability, we introduce TextAnchor-Bench (TABench), a benchmark for fine-grained text-region grounding, which reveals that both general-purpose and OCR-specific VLMs still struggle to establish accurate and stable text anchors. To address this limitation, we propose Q-Mask, a precise OCR framework built upon a causal query-driven mask decoder (CQMD). Inspired by chain-of-thought reasoning, Q-Mask performs causal visual decoding that sequentially generates query-conditioned visual masks before producing the final OCR output. This visual CoT paradigm disentangles where the text is from what the text is, enforcing grounded evidence acquisition prior to recognition and enabling explicit text anchor construction during inference. To train CQMD, we construct TextAnchor-26M, a large-scale dataset of image-text pairs annotated with fine-grained masks corresponding to specific textual elements, encouraging stable text-region correspondences and injecting strong spatial priors into VLM training. Extensive experiments demonstrate that Q-Mask substantially improves text anchoring and understanding across diverse visual scenes.


[38] Suppressing Non-Semantic Noise in Masked Image Modeling Representations cs.CVPDF

Martine Hjelkrem-Tan, Marius Aasan, Rwiddhi Chakraborty, Gabriel Y. Arteaga, Changkyu Choi

TL;DR: 本文指出掩码图像建模(MIM)目标会导致学习到的表征保留非语义信息,从而损害推理性能。作者提出了一种基于主成分分析(PCA)的模型无关的语义不变性评分方法,并在此基础上提出了一种名为语义正交伪影投影(SOAP)的简单后处理方法,用于直接抑制图像块表征中的非语义信息,从而在各种基于MIM的模型中一致地提升了零样本性能。

Details

Motivation: 解决MIM自监督学习范式中学到的表征包含非语义信息(噪声),从而损害下游任务推理性能的问题。

Result: 提出的SOAP方法在多种基于MIM的模型(如MAE、SimMIM、iBOT)上,无需训练,作为单个线性头附加后,在ImageNet-1K零样本分类等任务上实现了性能的持续提升。

Insight: 创新点在于提出了一种模型无关的、基于PCA的语义不变性评分来量化非语义信息,并设计了一种无需训练的后处理投影方法(SOAP)来直接抑制这些噪声,从而提升表征的语义纯度。这为分析和改进MIM表征提供了一个新的、轻量级的工具。

Abstract: Masked Image Modeling (MIM) has become a ubiquitous self-supervised vision paradigm. In this work, we show that MIM objectives cause the learned representations to retain non-semantic information, which ultimately hurts performance during inference. We introduce a model-agnostic score for semantic invariance using Principal Component Analysis (PCA) on real and synthetic non-semantic images. Based on this score, we propose a simple method, Semantically Orthogonal Artifact Projection (SOAP), to directly suppress non-semantic information in patch representations, leading to consistent improvements in zero-shot performance across various MIM-based models. SOAP is a post-hoc suppression method, requires zero training, and can be attached to any model as a single linear head.


[39] UCell: rethinking generalizability and scaling of bio-medical vision models cs.CV | q-bio.QMPDF

Nicholas Kuang, Vanessa Scalon, Ji Yu

TL;DR: 本文提出了一种名为UCell的小型生物医学视觉模型,用于单细胞分割任务。该模型通过引入递归结构提高了参数效率,在多个基准测试中达到了与参数量大10-20倍的模型相当的性能,且无需依赖大规模自然图像预训练,展现出良好的泛化能力和适应性。

Details

Motivation: 针对生物医学研究中训练数据有限且成本高昂的问题,当前研究过度关注构建大型基础模型,而提升小模型能力的研究不足,本文旨在探索在数据受限下构建高效小模型的可行性。

Result: 在多个单细胞分割基准测试中,UCell(参数量10-30M)的性能与参数量大10-20倍的模型相当,并在未见过的域外数据上表现出相似的泛化能力;此外,通过少量样本微调实验验证了其适应性。

Insight: 创新点在于将递归结构融入模型前向计算图以提高参数效率,实现小模型在数据受限任务中的高性能;客观来看,该方法减少了对外部预训练数据的依赖,增强了模型的可控性和泛化性,为生物医学视觉任务提供了轻量级解决方案。

Abstract: The modern deep learning field is a scale-centric one. Larger models have been shown to consistently perform better than smaller models of similar architecture. In many sub-domains of biomedical research, however, the model scaling is bottlenecked by the amount of available training data, and the high cost associated with generating and validating additional high quality data. Despite the practical hurdle, the majority of the ongoing research still focuses on building bigger foundation models, whereas the alternative of improving the ability of small models has been under-explored. Here we experiment with building models with 10-30M parameters, tiny by modern standards, to perform the single-cell segmentation task. An important design choice is the incorporation of a recursive structure into the model’s forward computation graph, leading to a more parameter-efficient architecture. We found that for the single-cell segmentation, on multiple benchmarks, our small model, UCell, matches the performance of models 10-20 times its size, and with a similar generalizability to unseen out-of-domain data. More importantly, we found that ucell can be trained from scratch using only a set of microscopy imaging data, without relying on massive pretraining on natural images, and therefore decouples the model building from any external commercial interests. Finally, we examined and confirmed the adaptability of ucell by performing a wide range of one-shot and few-shot fine tuning experiments on a diverse set of small datasets. Implementation is available at https://github.com/jiyuuchc/ucell


[40] Benchmarking Interaction, Beyond Policy: a Reproducible Benchmark for Collaborative Instance Object Navigation cs.CV | cs.AIPDF

Edoardo Zorzi, Francesco Taioli, Yiming Wang, Marco Cristani, Alessandro Farinelli

TL;DR: 该论文提出了QAsk-Nav基准,这是首个用于协作实例对象导航(CoIN)的可复现基准,能够对具身导航和协作提问进行独立评估。该基准包含一个轻量级的提问协议、一个增强的导航协议以及一个包含28,000条质量检查轨迹的开源数据集。基于此基准,作者开发了Light-CoNav模型,该模型比现有模块化方法小3倍、快70倍,并在泛化到未见过的物体和环境方面超越了最先进的CoIN方法。

Details

Motivation: 现有的协作实例对象导航(CoIN)基准主要关注导航成功率,缺乏对协作交互能力进行一致评估的支持。为了解决这一局限性,需要一个新的基准来独立评估导航和协作提问这两个核心能力。

Result: 在提出的QAsk-Nav基准上,作者开发的Light-CoNav模型在泛化到未见过的物体和环境方面,性能超越了最先进的CoIN方法。该模型比现有的模块化方法体积小3倍,速度快70倍。

Insight: 论文的主要创新点在于提出了首个将导航成功与协作提问能力进行分离、独立评估的可复现基准(QAsk-Nav),并基于此设计了一个轻量、统一且高效的协作导航模型(Light-CoNav)。这为评估和提升具身智能体的交互式协作能力提供了新的框架和工具。

Abstract: We propose Question-Asking Navigation (QAsk-Nav), the first reproducible benchmark for Collaborative Instance Object Navigation (CoIN) that enables an explicit, separate assessment of embodied navigation and collaborative question asking. CoIN tasks an embodied agent with reaching a target specified in free-form natural language under partial observability, using only egocentric visual observations and interactive natural-language dialogue with a human, where the dialogue can help to resolve ambiguity among visually similar object instances. Existing CoIN benchmarks are primarily focused on navigation success and offer no support for consistent evaluation of collaborative interaction. To address this limitation, QAsk-Nav provides (i) a lightweight question-asking protocol scored independently of navigation, (ii) an enhanced navigation protocol with realistic, diverse, high-quality target descriptions, and (iii) an open-source dataset, that includes 28,000 quality-checked reasoning and question-asking traces for training and analysis of interactive capabilities of CoIN models. Using the proposed QAsk-Nav benchmark, we develop Light-CoNav, a lightweight unified model for collaborative navigation that is 3x smaller and 70x faster than existing modular methods, while outperforming state-of-the-art CoIN approaches in generalization to unseen objects and environments. Project page at https://benchmarking-interaction.github.io/


[41] Omni-MMSI: Toward Identity-attributed Social Interaction Understanding cs.CVPDF

Xinpeng Li, Bolin Lai, Hardy Chen, Shijian Deng, Cihang Xie

TL;DR: 本文提出了Omni-MMSI任务,旨在从原始音频、视觉和语音输入中实现全面的、身份归因的社会交互理解,包括感知谁在说什么以及推理说话者所指对象。为应对现有方法在身份归因上的不足,作者提出了Omni-MMSI-R参考引导流程,该流程利用工具生成身份归因的社会线索并进行思维链式社会推理。实验表明,该方法在Omni-MMSI任务上优于先进的LLMs和现有方法。

Details

Motivation: 开发能够感知和响应人类交互的AI助手需要从原始多模态数据中理解身份归因的社会交互,而现有研究多基于预处理好的线索,缺乏处理真实场景中原始数据的能力。

Result: 实验证明,提出的Omni-MMSI-R流程在Omni-MMSI任务上超越了先进的LLMs和现有方法,取得了更好的性能。

Insight: 创新点在于提出了一个需要从原始多模态输入中进行身份归因社会交互理解的新任务,并设计了一个结合工具生成身份线索和思维链推理的参考引导流程,通过构建参与者级别的参考对和推理标注来支持该流程,这为多模态社会理解提供了更现实的基准和方法。

Abstract: We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: https://sampson-lee.github.io/omni-mmsi-project-page.


[42] OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning cs.CVPDF

Taiting Lu, Kaiyuan Lin, Yuxin Tian, Yubo Wang, Muchuan Wang

TL;DR: 本文介绍了OmniSch,一个用于评估大型多模态模型(LMMs)在印刷电路板(PCB)原理图理解与空间加权网表图构建方面能力的首个综合性基准。该基准包含1,854个真实世界原理图,涵盖视觉定位、图到图推理、几何推理和工具增强代理推理四项任务。实验结果表明,当前LMMs在解释原理图工程制品方面存在显著差距。

Details

Motivation: 尽管大型多模态模型在视觉定位、文档理解和图表推理方面进展迅速,但其将PCB原理图转换为能同时捕获组件属性、连接性和几何信息的机器可读空间加权网表图的能力仍未得到充分探索,而这种图表示是电子设计自动化(EDA)实际工作流程的核心。

Result: 论文通过OmniSch基准测试了当前LMMs的性能,揭示了其在解释原理图工程制品方面存在显著差距,包括不可靠的细粒度定位、脆弱的布局到图解析、不一致的全局连接性推理以及低效的视觉探索。

Insight: 论文的创新点在于构建了首个针对PCB原理图结构化视觉推理的多模态基准OmniSch,它系统地定义了从视觉定位到工具增强推理的完整任务链,为评估和推动LMMs在专业工程领域的应用提供了重要工具和数据。从客观角度看,该工作将LMM评估从通用领域拓展到高度结构化、专业化的工程图表理解,具有明确的实用价值和研究导向性。

Abstract: Recent large multimodal models (LMMs) have made rapid progress in visual grounding, document understanding, and diagram reasoning tasks. However, their ability to convert Printed Circuit Board (PCB) schematic diagrams into machine-readable spatially weighted netlist graphs, jointly capturing component attributes, connectivity, and geometry, remains largely underexplored, despite such graph representations are the backbone of practical electronic design automation (EDA) workflows. To bridge this gap, we introduce OmniSch, the first comprehensive benchmark designed to assess LMMs on schematic understanding and spatial netlist graph construction. OmniSch contains 1,854 real-world schematic diagrams and includes four tasks: (1) visual grounding for schematic entities, with 109.9K grounded instances aligning 423.4K diagram semantic labels to their visual regions; (2) diagram-to-graph reasoning, understanding topological relationship among diagram elements; (3) geometric reasoning, constructing layout-dependent weights for each connection; and (4) tool-augmented agentic reasoning for visual search, invoking external tools to accomplish (1)-(3). Our results reveal substantial gaps of current LMMs in interpreting schematic engineering artifacts, including unreliable fine-grained grounding, brittle layout-to-graph parsing, inconsistent global connectivity reasoning and inefficient visual exploration.


[43] The Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment cs.CV | cs.AIPDF

Hongyuan Liu, Qinli Yang, Wen Li, Zhong Zhang, Jiaming Liu

TL;DR: 本文提出了一种名为TPC-CMA的三阶段课程学习微调框架,旨在解决视觉-语言模型(如CLIP)中存在的模态间隙问题。通过将模态间隙分解为质心间隙和分布间隙,并证明分布间隙是跨模态任务性能的关键预测因子,该方法在微调过程中联合减少质心偏移和重塑分布结构,从而显著提升跨模态对齐能力。

Details

Motivation: 现有视觉-语言模型的图像和文本表示在共享嵌入空间中存在几何分离(即模态间隙),这限制了需要跨模态互换性的任务(如图像描述和联合聚类)的性能。现有后处理方法主要减少全局质心偏移,但未能解决根本的分布不匹配问题。

Result: 在实验中,该方法显著减少了模态间隙:当目标对齐强度α_target=0.05时,模态间隙减少66.6%,仅导致4.84%的准确率下降;在更强对齐(α_target=0.5)下,间隙减少82.3%,聚类ARI从0.318提升至0.516,图像描述的CIDEr分数相比原始模型提高了57.1%。

Insight: 创新点在于将模态间隙分解为质心间隙和分布间隙,并证明分布间隙是跨模态任务质量的更可靠预测指标;提出的TPC-CMA框架通过三阶段课程学习和梯度感知调度,在微调中逐步引入对齐,实现了稳定的优化和有效的跨模态分布重塑。

Abstract: Vision-Language Models (VLMs) such as CLIP learn a shared embedding space for images and text, yet their representations remain geometrically separated, a phenomenon known as the modality gap. This gap limits tasks requiring cross-modal interchangeability, such as captioning and joint clustering. Existing post-processing approaches can partially improve cross-modal compatibility; however, we show through geometric analysis that they primarily reduce the global centroid offset while leaving the underlying distributional mismatch intact. We decompose the modality gap into a Centroid Gap and a Distribution Gap, and demonstrate that the Distribution Gap is the true predictor of cross-modal task quality ($R^2 = 0.986$), whereas the commonly used Raw Gap is misleading ($R^2 = 0.691$). Motivated by this observation, we propose TPC-CMA (Three-Phase Curriculum for Cross-Modal Alignment), a fine-tuning framework that explicitly reduces both components. The proposed CMA jointly mitigates centroid offsets and reshapes the distributional structure, while a three-phase curriculum with gradient-aware scheduling progressively introduces alignment during training to enable stable optimization. Experiments demonstrate that our method significantly improves cross-modal alignment. With $α_{\text{target}}{=}0.05$, the modality gap is reduced by 66.6% with only 4.84% accuracy drop. Under stronger alignment ($α_{\text{target}}{=}0.5$), the gap is reduced by 82.3%, clustering ARI improves from 0.318 to 0.516, and captioning CIDEr increases by 57.1% over the original model. Our code and pre-trained models will be made publicly available upon acceptance.


[44] Label-efficient underwater species classification with semi-supervised learning on frozen foundation model embeddings cs.CVPDF

Thomas Manuel Rost

TL;DR: 本文研究了一种基于冻结基础模型嵌入的半监督学习方法,用于解决水下物种分类中标注成本高的问题。该方法利用DINOv3 ViT-B的冻结嵌入,通过基于最近邻的自训练在少量标注数据上传播标签,在AQUA20基准测试中取得了接近全监督基线的性能。

Details

Motivation: 水下图像物种分类受限于专家标注的高成本,且监督模型难以迁移到新环境。本文旨在探索基于冻结基础模型嵌入的半监督方法,以最小标注成本缩小标注差距。

Result: 在AQUA20基准(20个海洋物种)上,仅使用不到5%的训练标签,该方法在冻结嵌入上的自训练性能大幅接近在全标注数据集上训练的完全监督ConvNeXt基线;在全监督设置下,差距缩小至几个百分点,部分物种甚至超过监督基线。ROC-AUC指标显示,即使在极低标注率下,嵌入空间的类别可分性也很高。

Insight: 创新点在于利用冻结的基础模型嵌入(无需微调)结合基于最近邻的自训练,实现了无需训练、无需领域特定数据工程或水下适应模型的标签高效分类方法,为实际部署提供了即用基线。客观来看,该方法证明了预训练视觉基础模型嵌入在少样本场景下具有强大的判别结构捕获能力。

Abstract: Automated species classification from underwater imagery is bottlenecked by the cost of expert annotation, and supervised models trained on one dataset rarely transfer to new conditions. We investigate whether semi-supervised methods operating on frozen foundation model embeddings can close this annotation gap with minimal labeling effort. Using DINOv3 ViT-B embeddings with no fine-tuning, we propagate a small set of labeled seeds through unlabeled data via nearest-neighbor-based self-training and evaluate on the AQUA20 benchmark (20 marine species). With fewer than 5% of the training labels, self-training on frozen embeddings closes much of the gap to a fully supervised ConvNeXt baseline trained on the entire labeled dataset; at full supervision, the gap narrows to a few percentage points, with several species exceeding the supervised baseline. Class separability in the embedding space, measured by ROC-AUC, is high even at extreme label scarcity, indicating that the frozen representations capture discriminative structure well before decision boundaries can be reliably estimated. Our approach requires no training, no domain-specific data engineering, and no underwater-adapted models, establishing a practical, immediately deployable baseline for label-efficient marine species recognition. All results are reported on the held-out test set over 100 random seed initializations.


[45] VADMamba++: Efficient Video Anomaly Detection via Hybrid Modeling in Grayscale Space cs.CVPDF

Jihao Lyu, Minghua Zhao, Jing Hu, Yifei Chen, Shuangli Du

TL;DR: VADMamba++是一种高效的视频异常检测方法,它基于灰度到RGB的范式,通过单通道到三通道的重建映射来工作,无需辅助输入(如光流),专为单一代理任务设计。该方法强制从灰度结构推断颜色外观,利用结构和颜色线索之间的双重不一致性来更有效地揭示异常。它集成了Mamba、CNN和Transformer模块的混合建模骨干网络,并采用任务内融合评分策略,结合显式的未来帧预测误差和隐式的量化特征误差,以提高准确性。

Details

Motivation: VADMamba首次将Mamba引入视频异常检测,通过混合代理任务实现了高精度和快速推理,但其严重依赖光流作为辅助输入和任务间融合评分,限制了其仅适用于单一代理任务。VADMamba++旨在克服这些限制,设计一种无需辅助输入、专为单一代理任务服务的高效方法。

Result: 在三个基准数据集上的大量实验表明,VADMamba++在性能和效率方面均优于最先进的方法,特别是在仅使用帧级输入的严格单任务设置下。

Insight: 创新点包括:1) 基于灰度到RGB范式的单通道到三通道重建映射,强制模型从结构推断颜色,利用结构和颜色的不一致性检测异常;2) 集成Mamba、CNN和Transformer的混合建模骨干,以捕捉多样正常模式并抑制异常出现;3) 任务内融合评分策略,结合显式和隐式误差。从客观角度看,该方法通过简化输入(仅灰度帧)和任务设置(单一代理任务),在保持高效的同时提升了异常检测的敏感性和准确性,为实际应用提供了更实用的解决方案。

Abstract: VADMamba pioneered the introduction of Mamba to Video Anomaly Detection (VAD), achieving high accuracy and fast inference through hybrid proxy tasks. Nevertheless, its heavy reliance on optical flow as auxiliary input and inter-task fusion scoring constrains its applicability to a single proxy task. In this paper, we introduce VADMamba++, an efficient VAD method based on the Gray-to-RGB paradigm that enforces a Single-Channel to Three-Channel reconstruction mapping, designed for a single proxy task and operating without auxiliary inputs. This paradigm compels inferring color appearances from grayscale structures, allowing anomalies to be more effectively revealed through dual inconsistencies between structure and chromatic cues. Specifically, VADMamba++ reconstructs grayscale frames into the RGB space to simultaneously discriminate structural geometry and chromatic fidelity, thereby enhancing sensitivity to explicit visual anomalies. We further design a hybrid modeling backbone that integrates Mamba, CNN, and Transformer modules to capture diverse normal patterns while suppressing the appearance of anomalies. Furthermore, an intra-task fusion scoring strategy integrates explicit future-frame prediction errors with implicit quantized feature errors, further improving accuracy under a single task setting. Extensive experiments on three benchmark datasets demonstrate that VADMamba++ outperforms state-of-the-art methods while meeting performance and efficiency, especially under a strict single-task setting with only frame-level inputs.


[46] Dynamic Graph Neural Network with Adaptive Features Selection for RGB-D Based Indoor Scene Recognition cs.CVPDF

Qiong Liu, Ruofei Xiong, Xingzhen Chen, Muyao Peng, You Yang

TL;DR: 本文提出了一种动态图神经网络模型,用于RGB-D室内场景识别,通过自适应节点选择机制从RGB和深度模态中提取关键局部特征,并构建动态图建模物体与场景关系,最终融合优化特征进行识别。

Details

Motivation: 解决RGB-D室内场景识别中自适应选择和有效利用RGB与深度模态关键局部特征的问题,现有方法未充分处理特征选择与利用。

Result: 在SUN RGB-D和NYU Depth v2公开数据集上实验,结果显示方法性能优于当前最先进方法,达到SOTA水平。

Insight: 创新点包括自适应节点选择机制从多模态提取关键特征,以及动态图建模结合注意力权重更新,可借鉴于多模态融合和图神经网络优化。

Abstract: Multi-modality of color and depth, i.e., RGB-D, is of great importance in recent research of indoor scene recognition. In this kind of data representation, depth map is able to describe the 3D structure of scenes and geometric relations among objects. Previous works showed that local features of both modalities are vital for promotion of recognition accuracy. However, the problem of adaptive selection and effective exploitation on these key local features remains open in this field. In this paper, a dynamic graph model is proposed with adaptive node selection mechanism to solve the above problem. In this model, a dynamic graph is built up to model the relations among objects and scene, and a method of adaptive node selection is proposed to take key local features from both modalities of RGB and depth for graph modeling. After that, these nodes are grouped by three different levels, representing near or far relations among objects. Moreover, the graph model is updated dynamically according to attention weights. Finally, the updated and optimized features of RGB and depth modalities are fused together for indoor scene recognition. Experiments are performed on public datasets including SUN RGB-D and NYU Depth v2. Extensive results demonstrate that our method has superior performance when comparing to state-of-the-arts methods, and show that the proposed method is able to exploit crucial local features from both modalities of RGB and depth.


[47] mmAnomaly: Leveraging Visual Context for Robust Anomaly Detection in the Non-Visual World with mmWave Radar cs.CV | eess.SPPDF

Tarik Reza Toha, Shao-Jung, Lu, Mahathir Monjur, Shahriar Nirjon

TL;DR: mmAnomaly是一个多模态异常检测框架,通过结合毫米波雷达和RGBD视觉输入来增强非视觉场景下的鲁棒性。它利用基于ResNet的快速分类器提取视觉上下文(如场景几何和材料属性),并使用条件潜在扩散模型合成给定视觉上下文下的预期毫米波频谱,最后通过双输入比较模块识别真实与生成频谱间的空间偏差以定位异常。

Details

Motivation: 毫米波雷达能在传统摄像头因遮挡或隐私限制而失效的非视觉场景(如穿透衣物或某些墙壁)中进行人体感知,但其信号易受材料属性、杂波和多径干扰影响,导致复杂非高斯失真,现有方法缺乏上下文感知能力,常将良性信号变化误判为异常。

Result: 在三个应用(隐藏武器定位、穿墙入侵者定位和穿墙跌倒定位)的两个多模态数据集上评估,系统达到高达94%的F1分数和亚米级定位误差,在衣物、遮挡和杂乱环境中展现出鲁棒的泛化能力。

Insight: 创新点在于引入视觉上下文(RGBD)来指导毫米波频谱的预期合成,通过条件潜在扩散模型生成预期信号,并结合双输入比较实现空间偏差检测,这提高了异常检测的准确性和可解释性,解决了毫米波信号因环境因素导致的误报问题。

Abstract: mmWave radar enables human sensing in non-visual scenarios-e.g., through clothing or certain types of walls-where traditional cameras fail due to occlusion or privacy limitations. However, robust anomaly detection with mmWave remains challenging, as signal reflections are influenced by material properties, clutter, and multipath interference, producing complex, non-Gaussian distortions. Existing methods lack contextual awareness and misclassify benign signal variations as anomalies. We present mmAnomaly, a multi-modal anomaly detection framework that combines mmWave radar with RGBD input to incorporate visual context. Our system extracts semantic cues-such as scene geometry and material properties-using a fast ResNet-based classifier, and uses a conditional latent diffusion model to synthesize the expected mmWave spectrum for the given visual context. A dual-input comparison module then identifies spatial deviations between real and generated spectra to localize anomalies. We evaluate mmAnomaly on two multi-modal datasets across three applications: concealed weapon localization, through-wall intruder localization, and through-wall fall localization. The system achieves up to 94% F1 score and sub-meter localization error, demonstrating robust generalization across clothing, occlusions, and cluttered environments. These results establish mmAnomaly as an accurate and interpretable framework for context-aware anomaly detection in mmWave sensing.


[48] Mine-JEPA: In-Domain Self-Supervised Learning for Mine-Like Object Classification in Side-Scan Sonar cs.CVPDF

Taeyoun Kwon, Youngwon Choi, Hyeonyu Kim, Myeongkyun Cho, Junhyeok Choi

TL;DR: 本文提出了Mine-JEPA,这是首个用于侧扫声纳(SSS)水雷分类的领域内自监督学习(SSL)框架。该框架仅使用1,170张未标记的声纳图像,通过基于正则化的SSL损失SIGReg进行预训练。在二分类(水雷 vs. 非水雷)任务中,其性能超越了在17亿图像上预训练的基础模型DINOv3的微调版本,并且在三分类任务中也表现出色。研究还发现,对强预训练模型进行额外的领域内SSL反而会损害性能。

Details

Motivation: 侧扫声纳(SSS)水雷分类面临数据极度稀缺以及与自然图像存在巨大领域鸿沟的挑战。尽管自监督学习(SSL)和通用视觉基础模型在其他领域表现出色,但在SSS领域的应用尚未得到充分探索。本文旨在探索专门针对SSS领域的自监督学习方案。

Result: 在二分类任务中,Mine-JEPA的F1分数达到0.935,优于微调的DINOv3(0.922)。在三分类任务中,结合合成数据增强,Mine-JEPA达到0.820,同样优于微调的DINOv3(0.810)。此外,使用紧凑的ViT-Tiny骨干网络,Mine-JEPA以4倍少的参数量达到了有竞争力的性能。

Insight: 论文的核心创新点在于提出了首个针对SSS水雷分类的领域内自监督学习管道(Mine-JEPA),并验证了在数据稀缺的领域,精心设计的领域内SSL可以成为比大规模通用基础模型更有效的替代方案。一个重要的客观发现是,对已经很强的基础模型(如DINOv3)应用额外的领域内SSL预训练反而会导致性能显著下降(10-13个百分点),这挑战了“更强的预训练模型总能从领域适应中受益”的直觉,为领域自适应研究提供了新的见解。

Abstract: Side-scan sonar (SSS) mine classification is a challenging maritime vision problem characterized by extreme data scarcity and a large domain gap from natural images. While self-supervised learning (SSL) and general-purpose vision foundation models have shown strong performance in general vision and several specialized domains, their use in SSS remains largely unexplored. We present Mine-JEPA, the first in-domain SSL pipeline for SSS mine classification, using SIGReg, a regularization-based SSL loss, to pretrain on only 1,170 unlabeled sonar images. In the binary mine vs. non-mine setting, Mine-JEPA achieves an F1 score of 0.935, outperforming fine-tuned DINOv3 (0.922), a foundation model pretrained on 1.7B images. For 3-class mine-like object classification, Mine-JEPA reaches 0.820 with synthetic data augmentation, again outperforming fine-tuned DINOv3 (0.810). We further observe that applying in-domain SSL to foundation models degrades performance by 10–13 percentage points, suggesting that stronger pretrained models do not always benefit from additional domain adaptation. In addition, Mine-JEPA with a compact ViT-Tiny backbone achieves competitive performance while using 4x fewer parameters than DINOv3. These results suggest that carefully designed in-domain self-supervised learning is a viable alternative to much larger foundation models in data-scarce maritime sonar imagery.


[49] Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompt: The 1st Winner for 5th PVUW MOSE Challenge cs.CVPDF

Jinrong Zhang, Canyang Wu, Xusheng He, Weili Guan, Jianlong Wu

TL;DR: 本文提出了一种名为TEP(Tracking-Enhanced Prompts)的训练免费方法,旨在提升复杂视频对象分割任务中对于微小和语义主导对象的识别能力。该方法通过结合外部跟踪模型和多模态大语言模型来生成跟踪增强提示,以弥补当前SOTA模型SAM3在理解此类目标上的不足,并在PVUW 2026挑战赛的复杂视频对象分割赛道上取得了第一名。

Details

Motivation: 当前最先进的视频对象分割模型SAM3在常规目标上表现出色,但在处理微小和语义主导对象时性能不足,主要原因是其对这类特定目标类型的理解不够充分。

Result: 在PVUW 2026挑战赛的复杂视频对象分割测试集上,该方法取得了56.91%的成绩,获得了第一名。

Insight: 创新点在于提出了一种无需训练的提示增强策略,通过整合外部跟踪模型和多模态大语言模型来生成更有效的提示,从而提升模型对复杂目标的适应性,这为改进基础分割模型在特定场景下的性能提供了一种轻量级解决方案。

Abstract: In the Complex Video Object Segmentation task, researchers are required to track and segment specific targets within cluttered environments, which rigorously tests a method’s capability for target comprehension and environmental adaptability. Although SAM3, the current state-of-the-art solution, exhibits unparalleled segmentation performance and robustness on conventional targets, it underperforms on tiny and semantic-dominated objects. The root cause of this limitation lies in SAM3’s insufficient comprehension of these specific target types. To address this issue, we propose TEP: Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompts. As a training-free approach, TEP leverages external tracking models and Multimodal Large Language Models to introduce tracking-enhanced prompts, thereby alleviating the difficulty SAM3 faces in understanding these challenging targets. Our method achieved first place (56.91%) on the test set of the PVUW Challenge 2026: Complex Video Object Segmentation Track.


[50] VLM-in-the-Loop: A Plug-In Quality Assurance Module for ECG Digitization Pipelines cs.CVPDF

Jiachen Li, Shihao Li, Soovadeep Bakshi, Wei Li, Dongmei Chen

TL;DR: 本文提出VLM-in-the-Loop,一个可插拔的质量保证模块,用于心电图数字化流程。该模块通过标准化接口包装任何数字化后端,利用视觉语言模型进行闭环反馈,核心机制是工具接地,即使用特定领域的信号分析工具提供定量证据来锚定VLM的评估。

Details

Motivation: 现有心电图数字化方法在真实世界图像上表现不佳,尽管基准测试数据良好,需要一种无需修改底层数字化器即可提升其输出质量的方法。

Result: 在200条有配对真实数据的记录上进行控制消融实验,工具接地将判决一致性从71%提升至89%,并将保真度分离度加倍。在四个不同后端上部署均带来改进,例如在Open-ECG-Digitizer上每张图像的有效导联数从2.5增至5.8。在428张真实临床HCM图像上,集成系统达到98.0%的优秀质量。

Insight: 创新点在于可插拔的模块化架构和工具接地机制,前者允许灵活集成到现有流程,后者通过领域工具提供客观证据来增强VLM评估的可靠性和一致性,该设计具有领域参数化特性,可推广到其他质量标准可客观测量的领域。

Abstract: ECG digitization could unlock billions of archived clinical records, yet existing methods collapse on real-world images despite strong benchmark numbers. We introduce \textbf{VLM-in-the-Loop}, a plug-in quality assurance module that wraps any digitization backend with closed-loop VLM feedback via a standardized interface, requiring no modification to the underlying digitizer. The core mechanism is \textbf{tool grounding}: anchoring VLM assessment in quantitative evidence from domain-specific signal analysis tools. In a controlled ablation on 200 records with paired ground truth, tool grounding raises verdict consistency from 71% to 89% and doubles fidelity separation ($Δ$PCC 0.03 $\rightarrow$ 0.08), with the effect replicating across three VLMs (Claude Opus4, GPT-4o, Gemini2.5 Pro), confirming a pattern-level rather than model-specific gain. Deployed across four backends, the module improves every one: 29.4% of borderline leads improved on our pipeline; 41.2% of failed limb leads recovered on ECG-Digitiser; valid leads per image doubled on Open-ECG-Digitizer (2.5 $\rightarrow$ 5.8). On 428 real clinical HCM images, the integrated system reaches 98.0% Excellent quality. Both the plug-in architecture and tool-grounding mechanism are domain-parametric, suggesting broader applicability wherever quality criteria are objectively measurable.


[51] The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation cs.CVPDF

Xusheng He, Canyang Wu, Jinrong Zhang, Weili Guan, Jianlong Wu

TL;DR: 本文介绍了在第五届PVUW MeViS-Text挑战赛中获胜的解决方案。该方案针对以运动为中心的语言表达下的参考视频对象分割任务,构建了一个完全无需训练的三阶段流程。该流程结合了强大的多模态大语言模型(如Gemini-3.1 Pro和Qwen3.5-Plus)与SAM3分割模型,通过目标分解、种子掩码生成与传播、以及预测精炼来实现高精度的分割。

Details

Motivation: 解决在以运动为中心的语言表达下,模型需要联合理解外观、时序行为和对象交互的参考视频对象分割问题。目标是构建一个无需任务特定微调的强大且通用的解决方案。

Result: 在PVUW 2026 MeViS-Text测试集上排名第一,最终得分为0.909064,J&F得分为0.7897。

Insight: 主要创新点在于构建了一个完全无需训练、模块化的三阶段流程,巧妙地将强大的现有多模态大语言模型(用于语言理解、目标分解和描述生成)与SAM3(用于分割和跟踪)相结合,并通过精炼阶段进行行为级验证来纠正错误。这展示了如何通过组合现有SOTA模型,而非训练新模型,来解决复杂的多模态任务。

Abstract: This report presents our winning solution to the 5th PVUW MeViS-Text Challenge. The track studies referring video object segmentation under motion-centric language expressions, where the model must jointly understand appearance, temporal behavior, and object interactions. To address this problem, we build a fully training-free pipeline that combines strong multimodal large language models with SAM3. Our method contains three stages. First, Gemini-3.1 Pro decomposes each target event into instance-level grounding targets, selects the frame where the target is most clearly visible, and generates a discriminative description. Second, SAM3-agent produces a precise seed mask on the selected frame, and the official SAM3 tracker propagates the mask through the whole video. Third, a refinement stage uses Qwen3.5-Plus and behavior-level verification to correct ambiguous or semantically inconsistent predictions. Without task-specific fine-tuning, our method ranks first on the PVUW 2026 MeViS-Text test set, achieving a Final score of 0.909064 and a J&F score of 0.7897. The code is available at https://github.com/Moujuruo/MeViSv2_Track_Solution_2026.


[52] First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models cs.CV | cs.AI | cs.CLPDF

Jiwoo Ha, Jongwoo Baek, Jinhyun So

TL;DR: 本文提出了一种名为First Logit Boosting (FLB)的无训练技术,旨在缓解大型视觉语言模型(LVLMs)中的物体幻觉问题。该方法通过存储并重复利用第一个生成token的logit来增强后续token预测中的视觉信息,从而对抗生成过程中视觉信息衰减的长期衰减问题。

Details

Motivation: 现有缓解物体幻觉的方法(如重训练或外部接地方法)存在数据成本高或结构复杂的问题,而无训练方法(如对比解码)虽然成本低,但存在长期衰减问题,即随着生成过程推进,视觉接地效果减弱,语言先验主导。FLB旨在以简单有效的方式解决这一长期衰减问题。

Result: 实验结果表明,FLB在各种任务、基准测试和骨干模型上都能显著减少物体幻觉,且带来的推理开销可忽略不计,使其非常适用于实时多模态系统。

Insight: 创新点在于提出了一种简单、无需训练且开销极低的技术,通过强化第一个token的视觉信息来持续影响整个生成过程,并利用“The”等特定token的稳定效应来抑制幻觉词的生成,为缓解LVLMs的幻觉问题提供了一种新颖的视角和实用方法。

Abstract: Recent Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs. However, object hallucination – the generation of nonexistent objects in answers – remains a persistent challenge. Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity. Training-free methods such as Contrastive Decoding (CD) are more cost-effective, avoiding additional training or external models, but still suffer from long-term decay, where visual grounding weakens and language priors dominate as the generation progresses. In this paper, we propose First Logit Boosting (FLB), a simple yet effective training-free technique designed to alleviate long-term decay in LVLMs. FLB stores the logit of the first generated token and adds it to subsequent token predictions, effectively mitigating long-term decay of visual information. We observe that FLB (1) sustains the visual information embedded in the first token throughout generation, and (2) suppresses hallucinated words through the stabilizing effect of the ``The’’ token. Experimental results show that FLB significantly reduces object hallucination across various tasks, benchmarks, and backbone models. Notably, it causes negligible inference overhead, making it highly applicable to real-time multimodal systems. Code is available at https://github.com/jiwooha20/FLB


[53] Automated Detection of Multiple Sclerosis Lesions on 7-tesla MRI Using U-net and Transformer-based Segmentation cs.CV | cs.LGPDF

Michael Maynord, Minghui Liu, Cornelia Fermüller, Seongjin Choi, Yuxin Zeng

TL;DR: 该论文研究了利用U-net和Transformer架构在7T MRI上自动分割多发性硬化(MS)白质病变(WML)。作者分析了7T FLAIR扫描数据,生成了经过专家手动修订的参考WML掩码,并比较了传统工具(LST-LPA和LST-AI)与基于Transformer的模型(3D UNETR和SegFormer)在多个分辨率下的性能。在原生0.5x0.5x0.5^3分辨率下,基于Transformer的模型在保持与LST-AI竞争性重叠的同时,能检测到更多小病变,但存在边界变异性和假阳性问题。性能随图像下采样而下降,突显了原生7T分辨率对小病变检测的重要性。作者开源了训练模型,以促进超高场MS研究的自动化病变量化。

Details

Motivation: 超高场7T MRI能更好地可视化多发性硬化(MS)白质病变(WML),但其对比度和伪影与1.5-3T成像存在显著差异,导致广泛使用的自动分割工具可能无法直接适用。因此,需要开发专门针对7T MRI的自动化分割方法。

Result: 在原生0.5x0.5x0.5^3分辨率的测试集上,最佳Transformer模型(SegFormer)在体素级Dice系数达到0.61,病变级Dice系数达到0.20,优于传统LST-LPA工具(Dice 0.39,病变级Dice 0.02)。模型在下采样图像上性能下降,表明原生7T分辨率对小病变检测至关重要。

Insight: 论文的创新点在于将基于Transformer的模型(如3D UNETR和SegFormer)应用于7T MRI的MS病变分割,证明了其在检测小病变方面的优势,同时开源模型提供了可复现的资源。从客观角度看,研究强调了针对特定成像场强(如7T)定制分割工具的必要性,以及高分辨率数据对小病变检测的关键作用。

Abstract: Ultra-high field 7-tesla (7T) MRI improves visualization of multiple sclerosis (MS) white matter lesions (WML) but differs sufficiently in contrast and artifacts from 1.5-3T imaging - suggesting that widely used automated segmentation tools may not translate directly. We analyzed 7T FLAIR scans and generated reference WML masks from Lesion Segmentation Tool (LST) outputs followed by expert manual revision. As external comparators, we applied LST-LPA and the more recent LST-AI ensemble, both originally developed on lower-field data. We then trained 3D UNETR and SegFormer transformer-based models on 7T FLAIR at multiple resolutions (0.5x0.5x0.5^3, 1.0x1.0x1.0^3, and 1.5x1.5x2.0^3) and evaluated all methods using voxel-wise and lesion-wise metrics from the BraTS 2023 framework. On the held-out test set at native 0.5x0.5x0.5^3 resolution, 7T-trained transformers achieved competitive overlap with LST-AI while recovering additional small lesions that were missed by classical methods, at the cost of some boundary variability and occasional artifact-related false positives. On a held-out 7 T test set, our best transformer model (SegFormer) achieved a voxel-wise Dice of 0.61 and lesion-wise Dice of 0.20, improving on the classical LST-LPA tool (Dice 0.39, lesion-wise Dice 0.02). Performance decreased for models trained on downsampled images, underscoring the value of native 7T resolution for small-lesion detection. By releasing our 7T-trained models, we aim to provide a reproducible, ready-to-use resource for automated lesion quantification in ultra-high field MS research (https://github.com/maynord/7T-MS-lesion-segmentation).


[54] All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models cs.CVPDF

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Peter Tu

TL;DR: 本文指出强化学习(特别是GRPO)在提升视觉语言模型推理能力时,会导致模型陷入深度但狭窄的推理模式,引发多样性崩溃。为解决此问题,论文提出了多组策略优化方法,旨在激励模型进行发散性思维,并在基准测试中验证了其有效性。

Details

Motivation: 动机在于探究强化学习驱动视觉语言模型有效性的内在机制及其局限性,特别是发现GRPO会导致模型过早收敛于有限的推理策略,丢弃多数潜在替代方案,从而陷入局部最优和可扩展性差的问题。

Result: 在现有基准测试上,所提出的MUPO方法被证明能有效激励发散性思维,解决了GRPO的多样性崩溃问题,提升了模型性能。

Insight: 创新点在于揭示了强化学习训练中多样性崩溃的现象,并提出了一个简单有效的多组策略优化框架来维持和激励推理路径的多样性,这对于避免模型陷入局部最优、提升泛化能力具有借鉴意义。

Abstract: Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: https://xytian1008.github.io/MUPO/


[55] A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation cs.CV | cs.AI | cs.LGPDF

Yabin Zhang, Chong Wang, Yunhe Gao, Jiaming Liu, Maya Varma

TL;DR: 本文提出了CheXOne,一个具备推理能力的视觉语言基础模型,用于胸部X光片(CXR)解读。该模型不仅生成诊断预测,还生成连接视觉证据、放射学发现和预测的显式、临床依据的推理轨迹。模型在1470万个指令和推理样本上训练,采用结合指令微调和强化学习的两阶段框架,并在17个零样本评估设置中超越了现有医学和通用领域基础模型。

Details

Motivation: 解决现有AI系统在CXR解读中通常只生成最终预测,缺乏对视觉证据如何转化为放射学发现和诊断预测的显式解释的问题,旨在提升模型的性能、可解释性和临床实用性。

Result: 在涵盖视觉问答、报告生成、视觉定位和推理评估的17个零样本评估设置中,CheXOne超越了现有医学和通用领域基础模型,并在独立公共基准测试中表现出色。临床读者研究表明,在55%的案例中,CheXOne生成的报告与住院医师撰写的报告相当或更好,同时有效处理临床指征并提升了报告撰写和CXR解读效率。

Insight: 核心创新在于将显式推理轨迹生成与诊断预测相结合,通过两阶段训练框架(指令微调+强化学习)提升推理质量。这为AI辅助CXR解读提供了性能提升和因果解释,增强了模型的可解释性和临床可信度。

Abstract: Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.


[56] PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training cs.CVPDF

Weifu Fu, Jinyang Li, Bin-Bin Gao, Jialin Li, Yuhuan Lin

TL;DR: PET-DINO是一个支持文本和视觉提示的通用目标检测器,旨在解决开放集目标检测中文本表示与复杂视觉概念对齐困难、稀有类别图像-文本对稀缺以及现有视觉提示方法多阶段优化复杂的问题。它通过Alignment-Friendly Visual Prompt Generation模块和两种提示增强训练策略(Intra-Batch Parallel Prompting和Dynamic Memory-Driven Prompting)来统一视觉线索,提升模型在多种提示协议下的零样本检测能力。

Details

Motivation: 开放集目标检测在识别超出固定类别的新颖类别时面临挑战,包括文本表示与复杂视觉概念对齐困难、稀有类别图像-文本对稀缺,导致在专业领域或复杂对象上性能不佳;现有视觉提示方法常涉及复杂的多模态设计和多阶段优化,延长开发周期,且数据驱动的训练策略未被充分探索。

Result: 综合实验表明,PET-DINO在各种基于提示的检测协议下展现出竞争力的零样本目标检测能力,其性能优势归因于继承式哲学和提示增强训练策略,这些策略在构建有效的通用目标检测器中起关键作用。

Insight: 创新点包括:提出Alignment-Friendly Visual Prompt Generation模块,在先进文本提示检测器基础上改进,解决文本表示指导限制并缩短开发周期;引入两种提示增强训练策略(IBP和DMD),在迭代和整体训练层面实现多提示路径同时建模,促进与多样现实使用场景的并行对齐,从而统一视觉线索到检测框架中。

Abstract: Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text pairs for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, prolonging the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal detector supporting both text and visual prompts. Our Alignment-Friendly Visual Prompt Generation (AFVPG) module builds upon an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the development cycle. We introduce two prompt-enriched training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the overall training level. These strategies enable simultaneous modeling of multiple prompt routes, facilitating parallel alignment with diverse real-world usage scenarios. Comprehensive experiments demonstrate that PET-DINO exhibits competitive zero-shot object detection capabilities across various prompt-based detection protocols. These strengths can be attributed to inheritance-based philosophy and prompt-enriched training strategies, which play a critical role in building an effective generic object detector. Project page: https://fuweifuvtoo.github.io/pet-dino.


[57] RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection cs.CVPDF

Jihwan Park, Chanhyeong Yang, Jinyoung Park, Taehoon Song, Hyunwoo J. Kim

TL;DR: RegFormer是一种用于弱监督人-物交互检测的高效Transformer模块,它利用空间接地信号引导推理过程,实现从图像级交互推理到实例级推理的直接迁移,无需额外训练。

Details

Motivation: 解决现有弱监督HOI检测方法因枚举大量实例对导致计算成本高,以及非交互组合产生误报而阻碍准确实例级推理的问题。

Result: 在弱监督HOI检测任务上,RegFormer高效地学习了实例级交互推理的空间线索,其性能甚至可与全监督模型相媲美。

Insight: 创新点在于通过空间接地信号引导和局部感知的交互学习,使模块能够区分人、物及其交互,实现从图像级到实例级推理的无缝迁移;其设计的通用交互识别模块提升了效率和准确性。

Abstract: Weakly-supervised Human-Object Interaction (HOI) detection is essential for scalable scene understanding, as it learns interactions from only image-level annotations. Due to the lack of localization signals, prior works typically rely on an external object detector to generate candidate pairs and then infer their interactions through pairwise reasoning. However, this framework often struggles to scale due to the substantial computational cost incurred by enumerating numerous instance pairs. In addition, it suffers from false positives arising from non-interactive combinations, which hinder accurate instance-level HOI reasoning. To address these issues, we introduce Relational Grounding Transformer (RegFormer), a versatile interaction recognition module for efficient and accurate HOI reasoning. Under image-level supervision, RegFormer leverages spatially grounded signals as guidance for the reasoning process and promotes locality-aware interaction learning. By learning localized interaction cues, our module distinguishes humans, objects, and their interactions, enabling direct transfer from image-level interaction reasoning to precise and efficient instance-level reasoning without additional training. Our extensive experiments and analyses demonstrate that RegFormer effectively learns spatial cues for instance-level interaction reasoning, operates with high efficiency, and even achieves performance comparable to fully supervised models. Our code is available at https://github.com/mlvlab/RegFormer.


[58] Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding cs.CV | cs.AIPDF

Haibo Wang, Zihao Lin, Zhiyang Xu, Lifu Huang

TL;DR: 本文提出了一种名为’Think, Act, Build (TAB)’的动态智能体框架,用于解决零样本3D视觉定位任务。该框架将3D定位重新定义为一种生成式的2D到3D重建范式,直接处理原始RGB-D视频流。其核心是利用2D视觉语言模型处理复杂语义,并结合确定性的多视图几何来构建3D结构,从而摆脱了对预处理3D点云的依赖。

Details

Motivation: 现有基于视觉语言模型的零样本3D视觉定位方法通常依赖于预处理的3D点云,采用静态工作流程,本质上将定位任务降级为候选框匹配。本文旨在绕过这种依赖,将任务解耦:利用2D视觉语言模型解析复杂空间语义,同时依靠多视图几何来实例化3D结构。

Result: 在ScanRefer和Nr3D基准测试上的大量实验表明,该框架完全基于开源模型,显著超越了之前的零样本方法,甚至超过了全监督的基线模型。

Insight: 主要创新点在于提出了一个动态的智能体框架,将3D视觉定位重构为生成式任务,并引入了’语义锚定几何扩展’机制。该机制首先在参考视频片段中锚定目标,然后利用多视图几何将其空间位置传播到未观测的帧中,有效克服了严格语义跟踪导致的多视图覆盖不足问题,实现了从2D视觉线索到3D坐标的直接映射。

Abstract: 3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose “Think, Act, Build (TAB)”, a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the target across 2D frames. Crucially, to overcome the multi-view coverage deficit caused by strict VLM semantic tracking, we introduce the Semantic-Anchored Geometric Expansion, a mechanism that first anchors the target in a reference video clip and then leverages multi-view geometry to propagate its spatial location across unobserved frames. This enables the agent to “Build” the target’s 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to 3D coordinates. Furthermore, to ensure rigorous assessment, we identify flaws such as reference ambiguity and category errors in existing benchmarks and manually refine the incorrect queries. Extensive experiments on ScanRefer and Nr3D demonstrate that our framework, relying entirely on open-source models, significantly outperforms previous zero-shot methods and even surpasses fully supervised baselines.


[59] AceTone: Bridging Words and Colors for Conditional Image Grading cs.CVPDF

Tianren Ma, Mingxiang Liao, Xijin Zhang, Qixiang Ye

TL;DR: 本文提出AceTone,首个支持多模态条件(文本提示或参考图像)色彩分级的统一框架。该方法将色彩分级建模为生成式色彩转换任务,通过VQ-VAE将3D LUT压缩为离散token,并构建大规模数据集AceTone-800K训练视觉语言模型预测LUT token,再结合强化学习优化感知保真度与美学。实验表明其在文本引导和参考引导任务上均达到SOTA。

Details

Motivation: 现有色彩分级方法依赖块状重着色或固定滤镜库,难以泛化到不同创作意图或与人类美学偏好对齐,因此需要一种能理解多模态条件并生成美学一致色彩转换的新方法。

Result: 在文本引导和参考引导的色彩分级任务上达到SOTA,LPIPS指标相比现有方法提升高达50%,人类评估也证实其输出视觉愉悦且风格一致。

Insight: 创新点包括:1) 首次将多模态条件色彩分级统一为生成式LUT预测任务;2) 设计VQ-VAE tokenizer高效压缩3D LUT;3) 结合强化学习对齐感知与美学目标;4) 构建大规模数据集支持模型训练。这为语言驱动、美学对齐的色彩编辑提供了新途径。

Abstract: Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE based tokenizer which compresses a $3\times32^3$ LUT vector to 64 discrete tokens with $ΔE<2$ fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to 50% over existing methods. Human evaluations confirm that AceTone’s results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading.


[60] FreqPhys: Repurposing Implicit Physiological Frequency Prior for Robust Remote Photoplethysmography cs.CVPDF

Wei Qian, Dan Guo, Jinxing Zhou, Bochao Zou, Zitong Yu

TL;DR: 本文提出FreqPhys,一种频率引导的远程光电容积描记(rPPG)框架,通过显式利用生理频率先验来增强信号恢复的鲁棒性。该方法结合生理带通滤波、频谱调制与自适应选择,并融合时域特征,最终通过频率感知的条件扩散过程重建高保真rPPG信号。

Details

Motivation: 现有rPPG方法主要依赖时域建模,易受运动伪影和光照波动影响,导致微弱的生理线索被噪声淹没。本文旨在通过显式建模生理频率先验来解决这些挑战。

Result: 在六个基准测试上的广泛实验表明,FreqPhys相比现有最先进方法有显著提升,尤其在具有挑战性的运动条件下表现突出。

Insight: 创新点在于显式利用生理频率先验(如带通滤波和频谱调制)来抑制带外干扰和带内噪声,并结合跨域表示学习与条件扩散过程进行信号重建,强调了频率先验建模在rPPG中的重要性。

Abstract: Remote photoplethysmography (rPPG) enables contactless physiological monitoring by capturing subtle skin-color variations from facial videos. However, most existing methods predominantly rely on time-domain modeling, making them vulnerable to motion artifacts and illumination fluctuations, where weak physiological clues are easily overwhelmed by noise. To address these challenges, we propose FreqPhys, a frequency-guided rPPG framework that explicitly leverages physiological frequency priors for robust signal recovery. Specifically, FreqPhys first applies a Physiological Bandpass Filtering module to suppress out-of-band interference, and then performs Physiological Spectrum Modulation together with adaptive spectral selection to emphasize pulse-related frequency components while suppress residual in-band noise. A Cross-domain Representation Learning module further fuses these spectral priors with deep time-domain features to capture informative spatial–temporal dependencies. Finally, a frequency-aware conditional diffusion process progressively reconstructs high-fidelity rPPG signals. Extensive experiments on six benchmarks demonstrate that FreqPhys yields significant improvements over state-of-the-art approaches, particularly under challenging motion conditions. It highlights the importance of explicitly modeling physiological frequency priors. The source code will be released.


[61] MATHENA: Mamba-based Architectural Tooth Hierarchical Estimator and Holistic Evaluation Network for Anatomy cs.CV | cs.AIPDF

Kyeonghun Kim, Jaehyung Park, Youngung Han, Anna Jung, Seongbin Park

TL;DR: 本文提出了MATHENA,一个基于Mamba状态空间模型(SSM)的统一框架,用于处理全景X光片(OPG)中的牙齿检测、龋齿分割、异常检测和牙齿发育分期四个协同任务。该框架包含一个多分辨率检测器(MATHE)和一个轻量级分割/分类网络(HENA),并引入了一个名为PARTHENON的新基准数据集。

Details

Motivation: 解决从全景X光片进行牙科诊断时,需要协同处理牙齿检测、龋齿分割、异常检测和牙齿发育分期等多个任务的问题,旨在提供一个高效、统一的端到端解决方案。

Result: 在提出的PARTHENON基准上,MATHENA在牙齿检测任务上达到93.78% mAP@50,龋齿分割任务达到90.11% Dice分数,异常检测任务达到88.35%,牙齿发育分期任务达到72.40%准确率。

Insight: 主要创新点包括:1)利用Mamba的线性复杂度SSM进行全局上下文建模,构建统一的多任务框架;2)提出一种三头架构,通过上游龋齿分割任务学习共享表征,并冻结后用于下游任务的微调和线性探测,实现稳定高效的学习;3)构建了一个大规模、多任务的牙科影像基准数据集PARTHENON。

Abstract: Dental diagnosis from Orthopantomograms (OPGs) requires coordination of tooth detection, caries segmentation (CarSeg), anomaly detection (AD), and dental developmental staging (DDS). We propose Mamba-based Architectural Tooth Hierarchical Estimator and Holistic Evaluation Network for Anatomy (MATHENA), a unified framework leveraging Mamba’s linear-complexity State Space Models (SSM) to address all four tasks. MATHENA integrates MATHE, a multi-resolution SSM-driven detector with four-directional Vision State Space (VSS) blocks for O(N) global context modeling, generating per-tooth crops. These crops are processed by HENA, a lightweight Mamba-UNet with a triple-head architecture and Global Context State Token (GCST). In the triple-head architecture, CarSeg is first trained as an upstream task to establish shared representations, which are then frozen and reused for downstream AD fine-tuning and DDS classification via linear probing, enabling stable, efficient learning. We also curate PARTHENON, a benchmark comprising 15,062 annotated instances from ten datasets. MATHENA achieves 93.78% mAP@50 in tooth detection, 90.11% Dice for CarSeg, 88.35% for AD, and 72.40% ACC for DDS.


[62] TRiGS: Temporal Rigid-Body Motion for Scalable 4D Gaussian Splatting cs.CVPDF

Suwoong Yeom, Joonsik Nam, Seunggyu Choi, Lucas Yunkyu Lee, Sangmin Kim

TL;DR: 本文提出TRiGS,一种新颖的4D高斯泼溅表示方法,旨在解决现有4DGS方法在建模动态场景时因依赖分段线性速度近似和短时间窗口而导致的严重时间碎片化问题。该方法通过整合SE(3)变换、分层贝塞尔残差和可学习的局部锚点,为每个基元建模统一、连续的几何变换,从而保持时间一致性并有效抑制内存的无界增长。

Details

Motivation: 现有4D高斯泼溅方法在处理复杂非线性动态时,由于采用分段线性速度近似和短时间窗口建模,导致时间碎片化严重,迫使基元反复被消除和再生以跟踪动态,这不仅破坏了物体的长期时间一致性,还导致高斯基元数量激增,阻碍了方法向长视频序列的可扩展性。

Result: 大量实验表明,TRiGS在标准基准测试上实现了高保真渲染,并且能够独特地扩展到长视频序列(例如600到1200帧)而不会遇到严重的内存瓶颈,在时间稳定性方面显著优于先前的工作。

Insight: 论文的核心创新在于提出了一种利用统一连续几何变换的4D表示,通过SE(3)变换建模刚体运动,结合分层贝塞尔残差和可学习锚点来增强表达能力,从而在保持基元长期时间身份的同时,有效控制了内存增长,为长序列动态场景建模提供了可扩展的解决方案。

Abstract: Recent 4D Gaussian Splatting (4DGS) methods achieve impressive dynamic scene reconstruction but often rely on piecewise linear velocity approximations and short temporal windows. This disjointed modeling leads to severe temporal fragmentation, forcing primitives to be repeatedly eliminated and regenerated to track complex nonlinear dynamics. This makeshift approximation eliminates the long-term temporal identity of objects and causes an inevitable proliferation of Gaussians, hindering scalability to extended video sequences. To address this, we propose TRiGS, a novel 4D representation that utilizes unified, continuous geometric transformations. By integrating $SE(3)$ transformations, hierarchical Bezier residuals, and learnable local anchors, TRiGS models geometrically consistent rigid motions for individual primitives. This continuous formulation preserves temporal identity and effectively mitigates unbounded memory growth. Extensive experiments demonstrate that TRiGS achieves high fidelity rendering on standard benchmarks while uniquely scaling to extended video sequences (e.g., 600 to 1200 frames) without severe memory bottlenecks, significantly outperforming prior works in temporal stability.


[63] Reliev3R: Relieving Feed-forward Reconstruction from Multi-View Geometric Annotations cs.CVPDF

Youyu Chen, Junjun Jiang, Yueru Luo, Kui Jiang, Xianming Liu

TL;DR: 本文提出Reliev3R,一种弱监督范式,用于从零开始训练前馈重建模型(FFRMs),无需依赖昂贵且难以扩展的多视图几何标注(如3D点云图和相机位姿)。该方法利用预训练模型的零样本预测提供的单目相对深度和图像稀疏对应关系,直接提取3D知识,并通过设计的模糊感知相对深度损失和基于三角测量的重投影损失来监督多视图几何一致性。

Details

Motivation: 前馈重建模型(FFRMs)在重建质量和下游任务适应性方面展现出潜力,但其完全监督训练方案过度依赖多视图几何标注,导致难以扩展。本文旨在解决这一可扩展性问题,降低对几何传感器数据和计算密集的结构从运动预处理的需求。

Result: 在较少数据下从零开始训练,Reliev3R能够追赶上其完全监督的同类模型,在低成本3D重建监督和可扩展FFRMs方面取得了进展。

Insight: 创新点在于提出了一种弱监督训练范式,通过零样本预测的单目相对深度和稀疏对应关系替代多视图几何标注,并设计了针对性的损失函数(模糊感知相对深度损失和三角重投影损失)来确保几何一致性,这为降低3D重建的标注成本和提高模型可扩展性提供了新思路。

Abstract: With recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up. In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models. At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilitate supervision for multi-view geometric consistency. Training from scratch with the less data, Reliev3R catches up with its fully-supervised sibling models, taking a step towards low-cost 3D reconstruction supervisions and scalable FFRMs.


[64] TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection cs.CVPDF

Zhijin He, Shuo Jin, Siyue Yu, Shuwei Wu, Bingfeng Zhang

TL;DR: 本文提出了一种无需训练的协同显著目标检测方法TF-SSD,通过结合SAM和DINO模型,利用SAM生成候选掩码池,并通过质量掩码生成器、图像内显著过滤器和图像间原型选择器筛选出最显著的掩码作为最终预测。

Details

Motivation: 现有基于训练的协同显著目标检测方法受限于封闭数据集且泛化能力有限,本文旨在探索视觉基础模型(VFMs)的潜力,利用其强大的泛化能力和显著理解能力来解决CoSOD问题。

Result: 大量实验表明,TF-SSD在协同显著目标检测任务上优于现有方法,例如在无需训练的方法中取得了13.7%的性能提升。

Insight: 创新点在于无需训练地结合SAM和DINO模型,通过协同掩码过滤流程实现跨图像的显著目标检测,利用了SAM的密集分割能力和DINO的语义注意力,提高了方法的泛化性和效率。

Abstract: Co-salient Object Detection (CoSOD) aims to segment salient objects that consistently appear across a group of related images. Despite the notable progress achieved by recent training-based approaches, they still remain constrained by the closed-set datasets and exhibit limited generalization. However, few studies explore the potential of Vision Foundation Models (VFMs) to address CoSOD, which demonstrate a strong generalized ability and robust saliency understanding. In this paper, we investigate and leverage VFMs for CoSOD, and further propose a novel training-free method, TF-SSD, through the synergy between SAM and DINO. Specifically, we first utilize SAM to generate comprehensive raw proposals, which serve as a candidate mask pool. Then, we introduce a quality mask generator to filter out redundant masks, thereby acquiring a refined mask set. Since this generator is built upon SAM, it inherently lacks semantic understanding of saliency. To this end, we adopt an intra-image saliency filter that employs DINO’s attention maps to identify visually salient masks within individual images. Moreover, to extend saliency understanding across group images, we propose an inter-image prototype selector, which computes similarity scores among cross-image prototypes to select masks with the highest score. These selected masks serve as final predictions for CoSOD. Extensive experiments show that our TF-SSD outperforms existing methods (e.g., 13.7% gains over the recent training-free method). Codes are available at https://github.com/hzz-yy/TF-SSD.


[65] STAR: Mitigating Cascading Errors in Spatial Reasoning via Turn-point Alignment and Segment-level DPO cs.CVPDF

Pukun Zhao, Longxiang Wang, Chen Chen, Peicheng Wang, Fanqing Zhou

TL;DR: 该论文提出了一个名为STAR的两阶段框架,旨在缓解大型语言模型在结构化空间导航任务中因复杂拓扑结构而产生的级联错误。该框架基于拓扑锚点,并引入了一个包含人类启发的转向点标注的新数据集RedMaze-23K。第一阶段通过监督微调帮助模型内化空间语义并剪枝冗余路径;第二阶段采用空间感知的片段级直接偏好优化来精炼长视野导航中的自我纠正能力。

Details

Motivation: 解决现有方法(如思维可视化VoT)在复杂拓扑结构的空间推理中容易产生级联错误的问题,提升大型语言模型在结构化空间导航基准上的性能。

Result: 实验表明,STAR在开源模型中达到了最先进的性能:其32B变体在RedMaze-23K数据集上超越了DeepSeek-V3(29.27% vs. 25.00%),并达到了GPT-4性能的82.4%。

Insight: 创新点在于提出了一个基于拓扑锚点的两阶段训练框架,结合了监督微调与空间感知的片段级直接偏好优化,并引入了带有人类转向点标注的数据集,以更精细地引导模型进行空间推理和自我纠正,从而有效缓解级联错误。

Abstract: Structured spatial navigation is a core benchmark for Large Language Models (LLMs) spatial reasoning. Existing paradigms like Visualization-of-Thought (VoT) are prone to cascading errors in complex topologies. To solve this, we propose STAR, a two-stage framework grounded on topological anchors, and introduce the RedMaze-23K dataset with human-inspired turnpoint annotations. The first stage uses supervised fine-tuning to help models internalize spatial semantics and prune redundant paths. The second adopts Spatial-aware Segment-level Direct Preference Optimization (SDPO) to refine self-correction in long-horizon navigation. Experiments show STAR achieves state-of-the-art performance among open-source models: its 32B variant outperforms DeepSeek-V3 (29.27% vs. 25.00%) and reaches 82.4% of GPT-4’s performance.


[66] FecalFed: Privacy-Preserving Poultry Disease Detection via Federated Learning cs.CVPDF

Tien-Yu Chi

TL;DR: 本文提出了FecalFed,一个用于家禽疾病分类的隐私保护联邦学习框架,通过联邦学习在保护农场数据隐私的同时,利用粪便图像检测高致病性禽流感等疾病。作者还发布了一个经过严格去重的数据集poultry-fecal-fl,并评估了模型在高度异构的非独立同分布数据下的性能。

Details

Motivation: 解决大规模部署计算机视觉模型进行家禽疾病检测时面临的农场数据隐私担忧、机构数据孤岛问题,以及现有开源农业数据集中存在严重且未记录的数据污染问题。

Result: 在高度异构的非独立同分布条件下,联邦学习方法(使用FedAdam优化器和Swin-Small架构)达到了90.31%的准确率,接近集中式上限的95.10%;而边缘优化的Swin-Tiny模型也保持了89.74%的竞争性性能。

Insight: 创新点包括:1) 提出了一个隐私保护的联邦学习框架,用于农业领域的疾病检测;2) 创建并发布了一个经过严格去重(消除46.89%重复率)的数据集,解决了数据污染问题;3) 在模拟真实农业环境的非独立同分布数据上验证了联邦学习的有效性,为农场级疾病监测提供了高效、隐私优先的蓝图。

Abstract: Early detection of highly pathogenic avian influenza (HPAI) and endemic poultry diseases is critical for global food security. While computer vision models excel at classifying diseases from fecal imaging, deploying these systems at scale is bottlenecked by farm data privacy concerns and institutional data silos. Furthermore, existing open-source agricultural datasets frequently suffer from severe, undocumented data contamination. In this paper, we introduce $\textbf{FecalFed}$, a privacy-preserving federated learning framework for poultry disease classification. We first curate and release $\texttt{poultry-fecal-fl}$, a rigorously deduplicated dataset of 8,770 unique images across four disease classes, revealing and eliminating a 46.89$%$ duplication rate in popular public repositories. To simulate realistic agricultural environments, we evaluate FecalFed under highly heterogeneous, non-IID conditions (Dirichlet $α=0.5$). While isolated single-farm training collapses under this data heterogeneity, yielding only 64.86$%$ accuracy, our federated approach recovers performance without centralizing sensitive data. Specifically, utilizing server-side adaptive optimization (FedAdam) with a Swin-Small architecture achieves 90.31$%$ accuracy, closely approaching the centralized upper bound of 95.10%. Furthermore, we demonstrate that an edge-optimized Swin-Tiny model maintains highly competitive performance at 89.74$%$, establishing a highly efficient, privacy-first blueprint for on-farm avian disease monitoring.


[67] HarassGuard: Detecting Harassment Behaviors in Social Virtual Reality with Vision-Language Models cs.CV | cs.HCPDF

Junhee Lee, Minseok Kim, Hwanjo Heo, Seungwon Woo, Jinwoo Kim

TL;DR: 本文提出了HarassGuard系统,这是一个基于视觉语言模型(VLM)的解决方案,旨在仅利用视觉输入来检测社交虚拟现实(VR)中的物理骚扰行为。该系统通过构建数据集、提示工程和微调VLM来理解社交VR的上下文信息,从而实现对骚扰行为的主动检测。

Details

Motivation: 社交VR平台存在严重的在线骚扰风险,现有安全措施多为被动响应,而主动检测方案又常依赖敏感的生理数据,引发隐私担忧。本文旨在开发一种仅基于视觉、能主动检测骚扰且保护隐私的方法。

Result: 实验表明,HarassGuard在二元分类和多元分类任务中分别达到了88.09%和68.85%的准确率,其性能与LSTM/CNN、Transformer等SOTA基线模型相当,且仅需200个微调样本(基线需1115个),在上下文推理和隐私保护方面具有优势。

Insight: 论文的创新点在于将视觉语言模型应用于社交VR的骚扰检测,仅依赖视觉模态,避免了隐私敏感的生理数据。其优势在于利用VLM的上下文理解能力,以更少的微调数据达到与SOTA模型竞争的性能,为隐私友好的主动安全监控提供了新思路。

Abstract: Social Virtual Reality (VR) platforms provide immersive social experiences but also expose users to serious risks of online harassment. Existing safety measures are largely reactive, while proactive solutions that detect harassment behavior during an incident often depend on sensitive biometric data, raising privacy concerns. In this paper, we present HarassGuard, a vision-language model (VLM) based system that detects physical harassment in social VR using only visual input. We construct an IRB-approved harassment vision dataset, apply prompt engineering, and fine-tune VLMs to detect harassment behavior by considering contextual information in social VR. Experimental results demonstrate that HarassGuard achieves competitive performance compared to state-of-the-art baselines (i.e., LSTM/CNN, Transformer), reaching an accuracy of up to 88.09% in binary classification and 68.85% in multi-class classification. Notably, HarassGuard matches these baselines while using significantly fewer fine-tuning samples (200 vs. 1,115), offering unique advantages in contextual reasoning and privacy-preserving detection.


[68] KG-CMI: Knowledge graph enhanced cross-Mamba interaction for medical visual question answering cs.CVPDF

Xianyao Zheng, Hong Yu, Hui Cui, Changming Sun, Xiangyu Li

TL;DR: 本文提出了一种名为KG-CMI(知识图谱增强的交叉Mamba交互)的框架,用于解决医学视觉问答(Med-VQA)任务。该框架通过整合细粒度跨模态特征对齐、知识图谱嵌入、跨模态交互表示和自由形式答案增强的多任务学习模块,旨在更有效地利用医学领域知识,并提升模型对自由形式答案的处理能力。

Details

Motivation: 现有Med-VQA方法未能充分利用领域特定的医学知识,难以准确关联医学图像中的病灶特征与关键诊断标准;同时,基于分类的方法依赖预定义答案集,限制了模型适应自由形式答案多样性的能力,并可能忽略答案中的详细语义信息。

Result: 实验结果表明,KG-CMI在三个Med-VQA数据集(VQA-RAD、SLAKE和OVQA)上均优于现有的最先进(SOTA)方法。

Insight: 创新点在于通过知识图谱有效整合专业医学知识以建立病灶特征与疾病知识间的关联,并引入自由形式答案增强的多任务学习模块来提升开放型Med-VQA的能力;客观来看,其将Mamba架构与知识图谱结合用于跨模态交互,为医学多模态任务提供了新的知识整合与表示学习思路。

Abstract: Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent methods fail to fully leverage domain-specific medical knowledge, making it difficult to accurately associate lesion features in medical images with key diagnostic criteria. Additionally, classification-based approaches typically rely on predefined answer sets. Treating Med-VQA as a simple classification problem limits its ability to adapt to the diversity of free-form answers and may overlook detailed semantic information in those answers. To address these challenges, we propose a knowledge graph enhanced cross-Mamba interaction (KG-CMI) framework, which consists of a fine-grained cross-modal feature alignment (FCFA) module, a knowledge graph embedding (KGE) module, a cross-modal interaction representation (CMIR) module, and a free-form answer enhanced multi-task learning (FAMT) module. The KG-CMI learns cross-modal feature representations for images and texts by effectively integrating professional medical knowledge through a graph, establishing associations between lesion features and disease knowledge. Moreover, FAMT leverages auxiliary knowledge from open-ended questions, improving the model’s capability for open-ended Med-VQA. Experimental results demonstrate that KG-CMI outperforms existing state-of-the-art methods on three Med-VQA datasets, i.e., VQA-RAD, SLAKE, and OVQA. Additionally, we conduct interpretability experiments to further validate the framework’s effectiveness.


[69] CL-VISTA: Benchmarking Continual Learning in Video Large Language Models cs.CVPDF

Haiyang Guo, Yichen Shi, Fei Zhu, Wenzhuo Liu, Hongbo Zhao

TL;DR: 本文提出了CL-VISTA,一个专门为视频大语言模型(Video-LLMs)设计的持续学习基准测试,旨在解决现有基准在评估大规模预训练模型方面的不足。该基准包含8个涵盖感知、理解和推理的多样化任务,以引发显著的分布偏移并有效暴露灾难性遗忘。研究建立了一个包含6种协议的综合评估框架,从性能、计算效率和内存占用三个维度系统评估了10种主流持续学习方法,揭示了方法间存在性能、泛化与资源开销的根本权衡。

Details

Motivation: 现有持续学习基准大多基于未经大规模预训练的模型,且通常将单个数据集划分为子任务,导致任务冗余度高,对预训练Video-LLMs的遗忘效应不显著,无法有效评估现代基础模型。

Result: 在CL-VISTA基准上对10种主流持续学习方法的广泛评测表明,没有单一方法能在所有评估维度(性能、计算效率、内存占用)上取得全面优势;成功缓解灾难性遗忘的方法往往以牺牲泛化能力或带来高昂的计算和内存开销为代价。

Insight: 创新点在于构建了一个针对预训练Video-LLMs、任务多样性高且能有效引发分布偏移的持续学习视频理解基准,并引入了包含通用视频理解评估的综合多维度评测框架,以区分方法是真正增强了基础智能还是仅导致了任务特异性过拟合。客观来看,该工作系统揭示了当前持续学习方法在基础模型上面临的性能-泛化-资源权衡这一核心挑战,为多模态基础模型的持续学习研究提供了关键洞见和评估工具。

Abstract: Video Large Language Models (Video-LLMs) require continual learning to adapt to non-stationary real-world data. However, existing benchmarks fall short of evaluating modern foundation models: many still rely on models without large-scale pre-training, and prevailing benchmarks typically partition a single dataset into sub-tasks, resulting in high task redundancy and negligible forgetting on pre-trained Video-LLMs. To address these limitations, we propose CL-VISTA, a benchmark tailored for continual video understanding of Video-LLMs. By curating 8 diverse tasks spanning perception, understanding, and reasoning, CL-VISTA induces substantial distribution shifts that effectively expose catastrophic forgetting. To systematically assess CL methods, we establish a comprehensive evaluation framework comprising 6 distinct protocols across 3 critical dimensions: performance, computational efficiency, and memory footprint. Notably, the performance dimension incorporates a general video understanding assessment to assess whether CL methods genuinely enhance foundational intelligence or merely induce task-specific overfitting. Extensive benchmarking of 10 mainstream CL methods reveals a fundamental trade-off: no single approach achieves universal superiority across all dimensions. Methods that successfully mitigate catastrophic forgetting tend to compromise generalization or incur prohibitive computational and memory overheads. We hope CL-VISTA provides critical insights for advancing continual learning in multimodal foundation models.


[70] MoonAnything: A Vision Benchmark with Large-Scale Lunar Supervised Data cs.CVPDF

Clémentine Grethen, Yuang Shi, Simone Gasparini, Géraldine Morin

TL;DR: 本文介绍了MoonAnything,一个基于真实月球地形构建的统一视觉基准数据集,通过物理渲染提供大规模、多样光照下的几何与光度监督。该基准包含LunarGeo(提供立体图像、密集深度图和相机标定,支持3D重建和姿态估计)和LunarPhoto(使用空间变化BRDF模型生成真实感图像及多光照渲染,支持反射率估计和光照鲁棒感知)两个互补子数据集,总计超过13万个样本。

Details

Motivation: 现有月球数据集通常缺乏几何真值、光度真实感、光照多样性或大规模覆盖,阻碍了基于学习的感知系统发展。本文旨在解决这一数据缺口,为月球表面精确感知提供全面监督。

Result: 在MoonAnything基准上建立了基于最先进方法的基线,并发布了完整数据集和生成工具以支持社区扩展。

Insight: 创新点在于首次提供了大规模、多样光照下的综合几何与光度监督,为低纹理、高对比度条件下的算法提供了独特且具有挑战性的测试平台,并可推广至其他无大气天体。

Abstract: Accurate perception of lunar surfaces is critical for modern lunar exploration missions. However, developing robust learning-based perception systems is hindered by the lack of datasets that provide both geometric and photometric supervision. Existing lunar datasets typically lack either geometric ground truth, photometric realism, illumination diversity, or large-scale coverage. In this paper, we introduce MoonAnything, a unified benchmark built on real lunar topography with physically-based rendering, providing the first comprehensive geometric and photometric supervision under diverse illumination with large scale. The benchmark comprises two complementary sub-datasets : i) LunarGeo provides stereo images with corresponding dense depth maps and camera calibration enabling 3D reconstruction and pose estimation; ii) LunarPhoto provides photorealistic images using a spatially-varying BRDF model, along with multi-illumination renderings under real solar configurations, enabling reflectance estimation and illumination-robust perception. Together, these datasets offer over 130K samples with comprehensive supervision. Beyond lunar applications, MoonAnything offers a unique setting and challenging testbed for algorithms under low-textured, high-contrast conditions and applies to other airless celestial bodies and could generalize beyond. We establish baselines using state-of-the-art methods and release the complete dataset along with generation tools to support community extension: https://github.com/clementinegrethen/MoonAnything.


[71] TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning cs.CVPDF

Soumya Shamarao Jahagirdar, Edson Araujo, Anna Kukleva, M. Jehanzeb Mirza, Saurabhchand Bhati

TL;DR: 本文提出TTA-Vid,一种针对视频推理任务的广义测试时自适应方法,利用测试时强化学习范式,在无显式标签的情况下,通过多帧子集逐步推理和基于批感知频率的奖励更新预训练模型,实现单批次甚至单样本适应,并能泛化至整个数据集及跨数据集。

Details

Motivation: 现有视频推理模型依赖大规模监督数据和多阶段训练流程,成本高且难以适应新领域,本文旨在通过测试时自适应降低对标注数据和专用训练集的依赖。

Result: 在多种视频推理任务上,TTA-Vid均取得一致性能提升,能够超越当前基于大规模数据训练的SOTA方法。

Insight: 创新点包括:结合测试时自适应与强化学习,利用无标签视频样本进行在线模型更新;提出基于批感知频率的奖励机制作为伪真值;引入多臂老虎机策略进行自适应帧选择以优先信息丰富的帧,提升推理效率。

Abstract: Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains. In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test-time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) combines two components that work simultaneously: (1) a test-time adaptation that performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model. It shows that the resulting model trained on a single batch or even a single sample from a dataset, is able to generalize at test-time to the whole dataset and even across datasets. Because the adaptation occurs entirely at test time, our method requires no ground-truth annotations or dedicated training splits. Additionally, we propose a multi-armed bandit strategy for adaptive frame selection that learns to prioritize informative frames, guided by the same reward formulation. Our evaluation shows that TTA-Vid yields consistent improvements across various video reasoning tasks and is able to outperform current state-of-the-art methods trained on large-scale data. This highlights the potential of test-time reinforcement learning for temporal multimodal understanding.


[72] IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models cs.CV | cs.AIPDF

Dong-Jae Lee, Sunghyun Baek, Junmo Kim

TL;DR: 本文提出了一种基于注意力机制对偶形式视角的新型无需训练令牌剪枝框架IWP,通过将注意力重新表述为隐式线性层,将令牌剪枝转化为选择最优的秩1更新子集以近似原始对偶权重矩阵,从而在大视觉语言模型中实现性能与效率的更好权衡。

Details

Motivation: 大视觉语言模型的计算成本随视觉令牌数量快速增长,现有令牌剪枝方法多基于经验性方法而忽视了注意力的内部机制,因此需要一种更理论驱动的剪枝框架。

Result: 大量实验表明,该方法在性能与效率之间实现了更好的权衡,并为现有剪枝方法提供了新的视角。

Insight: 创新点在于从注意力对偶形式视角重新形式化注意力机制,提出量化令牌信息量和信息重复度的新度量,并引入渐进分块最大边际相关性进行高效子集选择,为令牌剪枝提供了理论依据和高效算法。

Abstract: Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token’s key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token’s information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.


[73] PrivHAR-Bench: A Graduated Privacy Benchmark Dataset for Video-Based Action Recognition cs.CV | cs.CRPDF

Samar Ansari

TL;DR: 本文提出了一个名为PrivHAR-Bench的基准数据集,用于标准化评估视频动作识别中的隐私-效用权衡。该数据集包含1932个源视频,涵盖15个动作类别,并对每个视频应用了从轻度空间模糊到加密块置换等9个不同隐私强度的变换层级,同时提供了去除背景的变体以分离人体运动特征和场景背景的影响。

Details

Motivation: 现有隐私保护人体动作识别研究通常在二元范式下评估方法,即清晰视频与单一隐私变换的对比,这限制了方法间的可比性,并模糊了隐私强度与识别效用之间的细微关系。

Result: 使用R3D-18模型进行实证验证,结果显示识别准确率随隐私强度增加呈现可测量且可解释的下降曲线:层级内准确率从88.8%(清晰视频)降至53.5%(加密且去除背景),跨域准确率则骤降至4.8%,确立了该数据集作为标准化条件下比较隐私保护HAR方法的受控基准。

Insight: 主要创新点在于引入了分级隐私变换谱系,超越了传统的二元评估范式,并提供了去除背景的变体以隔离场景偏差,从而能够更精细地研究隐私与效用之间的关系。数据集、生成流程和评估代码均已公开。

Abstract: Existing research on privacy-preserving Human Activity Recognition (HAR) typically evaluates methods against a binary paradigm: clear video versus a single privacy transformation. This limits cross-method comparability and obscures the nuanced relationship between privacy strength and recognition utility. We introduce \textit{PrivHAR-Bench}, a multi-tier benchmark dataset designed to standardize the evaluation of the \textit{Privacy-Utility Trade-off} in video-based action recognition. PrivHAR-Bench applies a graduated spectrum of visual privacy transformations: from lightweight spatial obfuscation to cryptographic block permutation, to a curated subset of 15 activity classes selected for human articulation diversity. Each of the 1,932 source videos is distributed across 9 parallel tiers of increasing privacy strength, with additional background-removed variants to isolate the contribution of human motion features from contextual scene bias. We provide lossless frame sequences, per-frame bounding boxes, estimated pose keypoints with joint-level confidence scores, standardized group-based train/test splits, and an evaluation toolkit computing recognition accuracy and privacy metrics. Empirical validation using R3D-18 demonstrates a measurable and interpretable degradation curve across tiers, with within-tier accuracy declining from 88.8% (clear) to 53.5% (encrypted, background-removed) and cross-domain accuracy collapsing to 4.8%, establishing PrivHAR-Bench as a controlled benchmark for comparing privacy-preserving HAR methods under standardized conditions. The dataset, generation pipeline, and evaluation code are publicly available.


[74] An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models cs.CVPDF

Lennart Maack, Alexander Schlaefer

TL;DR: 本文提出SurgSTU-Pipeline,一种用于生成精细时空关系的手术视频数据集的确定性流程,并创建了包含150k个问答样本的SurgSTU数据集,以提升视觉语言模型在手术视频中的时空理解能力。

Details

Motivation: 现有手术视觉语言数据集难以捕捉和评估复杂的交织时空动态,而大规模数据集的创建面临人工标注成本高或大语言模型生成易出错的问题。

Result: 在SurgSTU数据集上,零样本设置的SOTA通用视觉语言模型表现不佳,但通过上下文学习可提升其时空能力;在SurgSTU训练集上微调的视觉语言模型在所有时空任务中取得了最高性能。

Insight: 创新点在于提出了一个结合时空连续性过滤的确定性生成流程,可靠地创建用于精细时空多模态理解的手术数据集,有效解决了数据标注的挑战并验证了数据集对模型能力提升的效用。

Abstract: Surgical video understanding is a crucial prerequisite for advancing Computer-Assisted Surgery. While vision-language models (VLMs) have recently been applied to the surgical domain, existing surgical vision-language datasets lack in capturing and evaluating complex, interleaved spatial-temporal dynamics. Creating large scale datasets that accurately represent fine-grained spatial-temporal relationships in surgical videos is challenging due to costly manual annotations or error-prone generation using large language models. To address this gap, we introduce the SurgSTU-Pipeline, a deterministic generation pipeline featuring temporal and spatial continuity filtering to reliably create surgical datasets for fine-grained spatial-temporal multimodal understanding. Applying this pipeline to publicly available surgical datasets, we create the SurgSTU dataset, comprising 7515 video clips densely extended with 150k fine-grained spatial-temporal question-answer samples. Our comprehensive evaluation shows that while state-of-the-art generalist VLMs struggle in zero-shot settings, their spatial-temporal capabilities can be improved through in-context learning. A fine-tuned VLM on the SurgSTU training dataset achieves highest performance among all spatial-temporal tasks, validating the dataset’s efficacy to improve spatial-temporal understanding of VLMs in surgical videos. Code will be made publicly available.


[75] HICT: High-precision 3D CBCT reconstruction from a single X-ray cs.CVPDF

Wen Ma, Jiaxiang Liu, Zikai Xiao, Ziyang Wang, Feng Yang

TL;DR: 本文提出HiCT,一个两阶段框架,旨在从单张低剂量全景X射线图像重建高精度3D锥形束计算机断层扫描(CBCT)。该方法首先生成几何一致的多视角投影,然后利用射线动态注意力网络和X射线采样策略重建高保真CBCT。

Details

Motivation: 解决CBCT高辐射剂量和高成本限制其可及性的问题,探索从单张全景X射线进行3D重建的替代方案,以应对几何不一致性和精度有限的挑战。

Result: 在构建的大规模数据集XCT上进行广泛实验,结果表明HiCT实现了最先进的性能,为临床使用提供了准确且几何一致的重建结果。

Insight: 创新点包括使用视频扩散模型生成几何一致的多视角投影,以及结合射线动态注意力网络和X射线采样策略进行高保真重建;客观分析认为,两阶段框架和专门构建的数据集是提升单视图3D重建精度的关键贡献。

Abstract: Accurate 3D dental imaging is vital for diagnosis and treatment planning, yet CBCT’s high radiation dose and cost limit its accessibility. Reconstructing 3D volumes from a single low-dose panoramic X-ray is a promising alternative but remains challenging due to geometric inconsistencies and limited accuracy. We propose HiCT, a two-stage framework that first generates geometrically consistent multi-view projections from a single panoramic image using a video diffusion model, and then reconstructs high-fidelity CBCT from the projections using a ray-based dynamic attention network and an X-ray sampling strategy. To support this, we built XCT, a large-scale dataset combining public CBCT data with 500 paired PX-CBCT cases. Extensive experiments show that HiCT achieves state-of-the-art performance, delivering accurate and geometrically consistent reconstructions for clinical use.


[76] Multimodal Language Models Cannot Spot Spatial Inconsistencies cs.CV | cs.CL | cs.LGPDF

Om Khangaonkar, Hadi J. Rad, Hamed Pirsiavash

TL;DR: 本文研究了多模态大语言模型在理解三维空间一致性方面的局限性,提出了一个更具挑战性的任务:给定同一场景的两个视角,识别违反三维运动一致性的物体。作者通过生成真实且空间不一致的图像对,系统评估了当前SOTA模型的能力,发现其表现远低于人类水平,且在不同场景属性上存在显著差异,揭示了模型对三维结构的理解是脆弱且不完整的。

Details

Motivation: 空间一致性是视觉世界的基本属性,也是理解物理现实模型的关键要求。尽管多模态大语言模型近期取得进展,但在跨多视角进行三维几何推理方面仍存在困难。本文旨在通过一个更具挑战性的任务来评估和揭示模型在这方面的缺陷。

Result: 实验结果表明,最先进的多模态大语言模型在识别空间不一致性任务上显著低于人类观察者水平,且在不同场景属性上表现出巨大的性能波动,未能达到稳健的理解。

Insight: 论文的创新点在于提出了一个针对三维运动一致性违反的评估任务,并开发了一种简单可扩展的方法来生成用于系统评估的真实空间不一致图像对。这揭示了当前MLLMs在深度、基础物理世界理解上的根本性不足,强调了开发更扎实物理理解方法的必要性。

Abstract: Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D geometry across multiple views. Rather than asking models to describe scene attributes, we introduce a more challenging task: given two views of the same scene, identify the object that violates 3D motion consistency. We propose a simple and scalable method for generating realistic, spatially inconsistent image pairs from multi-view scenes, enabling systematic evaluation of this capability. Our results show that state-of-the-art MLLMs significantly underperform human observers and exhibit substantial variability across different scene attributes, revealing a fragile and incomplete understanding of 3D structure. We hope our findings underscore the need for approaches that develop a more deeply grounded understanding of the physical world.


[77] Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers cs.CV | cs.HC | cs.IRPDF

Kawtar Zaher, Olivier Buisson, Alexis Joly

TL;DR: 本文重新审视了基于人机交互的目标检索任务,利用预训练的视觉变换器(ViT)表示,通过主动学习循环迭代优化检索性能,重点关注多目标数据集中的局部描述符设计。

Details

Motivation: 解决在多目标、复杂场景中,仅通过用户初始查询和相关性反馈,快速识别目标类别的多样化实例的挑战,特别是当目标仅占图像小区域时全局描述符不足的问题。

Result: 在多个多目标数据集上比较了不同的表示策略,分析了全局上下文与细粒度局部细节之间的权衡,为基于主动学习的交互式检索流程设计提供了实用见解。

Insight: 创新点在于系统性地利用预训练ViT表示重新制定人机交互目标检索任务,并探讨了关键设计问题(如实例选择、标注形式、主动选择策略和表示策略),强调局部描述符在多目标场景中的重要性。

Abstract: Building on existing approaches, we revisit Human-in-the-Loop Object Retrieval, a task that consists of iteratively retrieving images containing objects of a class-of-interest, specified by a user-provided query. Starting from a large unlabeled image collection, the aim is to rapidly identify diverse instances of an object category relying solely on the initial query and the user’s Relevance Feedback, with no prior labels. The retrieval process is formulated as a binary classification task, where the system continuously learns to distinguish between relevant and non-relevant images to the query, through iterative user interaction. This interaction is guided by an Active Learning loop: at each iteration, the system selects informative samples for user annotation, thereby refining the retrieval performance. This task is particularly challenging in multi-object datasets, where the object of interest may occupy only a small region of the image within a complex, cluttered scene. Unlike object-centered settings where global descriptors often suffice, multi-object images require more adapted, localized descriptors. In this work, we formulate and revisit the Human-in-the-Loop Object Retrieval task by leveraging pre-trained ViT representations, and addressing key design questions, including which object instances to consider in an image, what form the annotations should take, how Active Selection should be applied, and which representation strategies best capture the object’s features. We compare several representation strategies across multi-object datasets highlighting trade-offs between capturing the global context and focusing on fine-grained local object details. Our results offer practical insights for the design of effective interactive retrieval pipelines based on Active Learning for object class retrieval.


[78] DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale cs.CV | cs.AI | cs.ROPDF

Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li

TL;DR: 本文提出了一种用于自动驾驶的Vision-Geometry-Action(VGA)范式,并引入了流式驾驶视觉几何变换器DVGT-2。该模型以在线方式处理输入,联合输出当前帧的密集3D几何重建和轨迹规划,解决了现有几何重建方法依赖多帧批量处理、无法用于在线规划的问题。

Details

Motivation: 现有端到端自动驾驶模型(如VLA模型)主要依赖语言描述作为辅助任务,而作者认为车辆在3D世界中运行,密集3D几何信息才是决策的最全面线索。然而,现有几何重建方法(如DVGT)计算成本高且无法在线应用,因此需要一种能实时进行几何重建和规划的方法。

Result: DVGT-2在各种数据集上实现了更优的几何重建性能,且速度更快。同一训练模型无需微调即可直接应用于不同相机配置下的规划任务,在闭环NAVSIM和开环nuScenes基准测试中均表现良好。

Insight: 论文的核心创新在于提出了以密集3D几何为中心的VGA范式,并设计了支持在线推理的流式Transformer架构(采用时序因果注意力和历史特征缓存)。其滑动窗口流式策略和利用历史缓存避免重复计算的方法,在提升效率方面具有借鉴价值,实现了几何重建与规划的统一高效处理。

Abstract: End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.


[79] Continual Vision-Language Learning for Remote Sensing: Benchmarking and Analysis cs.CVPDF

Xingxing Weng, Ruifeng Ni, Chao Pang, XiangYu Hao, Yishan Wang

TL;DR: 本文提出了CLeaRS,一个用于遥感领域持续视觉-语言学习的综合基准,包含超过20.7万个图像-文本对,并定义了三种评估协议来系统评估模型的持续适应能力。基准测试表明现有视觉-语言模型存在严重的灾难性遗忘问题,且现有持续学习方法在遥感场景下效果有限。

Details

Motivation: 当前遥感视觉-语言模型依赖静态数据训练,难以适应不断涌现的新传感模态和下游任务,且缺乏评估其持续学习能力的专用基准。

Result: 在CLeaRS基准的三种评估设置(长周期、模态增量、任务增量)下,多种视觉-语言模型均表现出灾难性遗忘,且现有持续学习方法在遥感视觉-语言模型上效果不佳。

Insight: 论文的创新点在于构建了首个遥感持续视觉-语言学习基准CLeaRS,并揭示了该领域灾难性遗忘的普遍性,强调了开发针对性持续学习方法的必要性。

Abstract: Current remote sensing vision-language models (RS VLMs) demonstrate impressive performance in image interpretation but rely on static training data, limiting their ability to accommodate continuously emerging sensing modalities and downstream tasks. This exposes a fundamental challenge: enabling RS VLMs to continually adapt without catastrophic forgetting. Despite its practical importance, the continual learning capability of RS VLMs remains underexplored, and no dedicated benchmark currently exists. In this work, we present CLeaRS, a comprehensive benchmark for continual vision-language learning in remote sensing. CLeaRS comprises 10 curated subsets with over 207k image-text pairs, spanning diverse interpretation tasks, sensing modalities, and application scenarios. We further define three evaluation protocols: long-horizon, modality-incremental, and task-incremental settings, to systematically assess continual adaptation. Extensive benchmarking of diverse vision-language models reveals catastrophic forgetting across all settings. Moreover, representative continual learning methods, when adapted to RS VLMs, exhibit limited effectiveness in handling task, instruction, and modality transitions. Our findings underscore the need for developing continual learning methods tailored to RS VLMs.


[80] Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction cs.CVPDF

Patrick Glandorf, Thomas Norrenbrock, Bodo Rosenhahn

TL;DR: 本文提出了一种名为Video Patch Pruning (VPP)的新型视频补丁剪枝框架,旨在通过利用时间先验知识在Vision Transformer的早期层实现高效的稀疏化,从而显著降低视频实例分割任务的计算成本。

Details

Motivation: 动机在于Vision Transformers (ViTs) 虽然性能优越但计算成本高昂,现有补丁剪枝方法仅关注深层网络的令牌缩减,忽略了早期层的压缩潜力,限制了整体效率的提升。

Result: 该方法在密集预测任务中实现了高达60%的补丁缩减,远超传统基于图像的补丁剪枝方法(通常约30%稀疏度),并在Youtube-VIS 2021数据集上保持稳定性能,最大性能下降仅为0.6%。

Insight: 创新点在于将时间先验知识整合到早期ViT层的稀疏化过程中,通过一个完全可微分的时间映射模块,利用深层特征的前景选择性来准确选择早期网络阶段最相关的补丁,从而在高稀疏度下仍能维持卓越性能。

Abstract: Vision Transformers (ViTs) have demonstrated state-ofthe-art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches restrict token reduction to deeper layers, leaving early-stage compression unexplored. This limits their potential for holistic efficiency. In this work, we present a novel Video Patch Pruning framework (VPP) that integrates temporal prior knowledge to enable efficient sparsity within early ViT layers. Our approach is motivated by the observation that prior features extracted from deeper layers exhibit strong foreground selectivity. Therefore we propose a fully differentiable module for temporal mapping to accurately select the most relevant patches in early network stages. Notably, the proposed method enables a patch reduction of up to 60% in dense prediction tasks, exceeding the capabilities of conventional image-based patch pruning, which typically operate around a 30% patch sparsity. VPP excels the high-sparsity regime, sustaining remarkable performance even when patch usage is reduced below 55%. Specifically, it preserves stable results with a maximal performance drop of 0.6% on the Youtube-VIS 2021 dataset.


[81] LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation cs.CV | cs.CLPDF

Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva

TL;DR: 本文提出LinguDistill方法,通过选择性跨模态蒸馏恢复视觉-语言模型中的语言能力,无需引入额外模块,利用原始冻结语言模型作为教师,通过层间KV缓存共享实现视觉条件下的教师监督。

Details

Motivation: 预训练语言模型适配为视觉-语言模型时,会因表示偏移和跨模态干扰导致其固有语言能力下降,现有恢复方法通常增加中间对齐层,增加了架构复杂性和推理参数。

Result: 在语言和知识基准测试上恢复了约10%的性能损失,同时在视觉密集型任务上保持可比性能。

Insight: 创新点在于提出无适配器的蒸馏框架和层间KV缓存共享机制,实现了不修改模型架构下的跨模态监督,为多模态模型中的模态特定退化问题提供了高效解决方案。

Abstract: Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student’s multimodal representations without modifying the architecture of either model. We then selectively distill the teacher’s strong linguistic signal on language-intensive data to recover language capability, while preserving the student’s visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.


[82] Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation cs.CVPDF

Shuang Li, Chao Deng, Hang Chen, Liqun Liu, Zhenyu Hu

TL;DR: 本文提出DisCo框架,通过解耦与再耦合视觉和文本信息来解决主题驱动文本到图像生成中的相似性-可控性悖论。该方法首先将主题身份从参考图像中提取,并将文本提示简化为仅包含修改指令,然后通过强化学习奖励信号重新耦合两者,实现高保真主题保持与精确文本控制。

Details

Motivation: 解决主题驱动文本到图像生成中的相似性-可控性悖论,即增强文本控制会降低主题保真度,反之亦然,这源于文本提示在描述主题和修改时产生冲突信号。

Result: 在广泛实验中,该方法实现了最先进的性能,生成了高度真实和连贯的图像。

Insight: 创新点在于将主题身份与文本指令解耦,并通过强化学习奖励机制重新耦合,从而消除描述歧义并提升生成质量;客观分析认为其分离信息源和动态再耦合的策略为多模态生成任务提供了新思路。

Abstract: Subject-Driven Text-to-Image (T2I) Generation aims to preserve a subject’s identity while editing its context based on a text prompt. A core challenge in this task is the “similarity-controllability paradox”, where enhancing textual control often degrades the subject’s fidelity, and vice-versa. We argue this paradox stems from the ambiguous role of text prompts, which are often tasked with describing both the subject and the desired modifications, leading to conflicting signals for the model. To resolve this, we propose DisCo, a novel framework that first Disntangles and then re-Couples visual and textual information. First, our textual-visual decoupling module isolates the sources of information: subject identity is extracted exclusively from the reference image with the entity word of the subject, while the text prompt is simplified to contain only the modification command, where the subject refers to general pronouns, eliminating descriptive ambiguity. However, this strict separation can lead to unnatural compositions between the subject and its contexts. We address this by designing a dedicated reward signal and using reinforcement learning to seamlessly recouple the visually-defined subject and the textually-generated context. Our approach effectively resolves the paradox, enabling simultaneous high-fidelity subject preservation and precise textual control. Extensive experiments demonstrate that our method achieves state-of-the-art performance, producing highly realistic and coherent images.


[83] MotionGrounder: Grounded Multi-Object Motion Transfer via Diffusion Transformer cs.CVPDF

Samuel Teodoro, Yun Chen, Agus Gunawan, Soo Ye Kim, Jihyong Oh

TL;DR: MotionGrounder是一个基于扩散Transformer(DiT)的框架,首次实现了多对象可控的运动迁移,通过流式运动信号(FMS)提供稳定的运动先验,并利用对象-描述对齐损失(OCAL)将对象描述与其空间区域对齐,从而在真实多对象场景中实现细粒度控制。

Details

Motivation: 现有基于DiT的运动迁移方法仅限于单对象视频,无法对真实场景中的多个对象进行细粒度控制,因此需要开发能够处理多对象运动迁移的框架。

Result: MotionGrounder在定量、定性和人工评估中均优于现有基线方法,并通过提出的对象接地分数(OGS)联合评估生成对象与源视频对象之间的空间对齐以及生成对象与目标描述之间的语义一致性。

Insight: 创新点包括引入流式运动信号(FMS)作为稳定的运动先验,以及对象-描述对齐损失(OCAL)实现多对象描述的空间接地;从客观角度看,该方法通过结合运动先验和语义对齐,有效扩展了DiT在多对象视频生成中的应用潜力。

Abstract: Motion transfer enables controllable video generation by transferring temporal dynamics from a reference video to synthesize a new video conditioned on a target caption. However, existing Diffusion Transformer (DiT)-based methods are limited to single-object videos, restricting fine-grained control in real-world scenes with multiple objects. In this work, we introduce MotionGrounder, a DiT-based framework that firstly handles motion transfer with multi-object controllability. Our Flow-based Motion Signal (FMS) in MotionGrounder provides a stable motion prior for target video generation, while our Object-Caption Alignment Loss (OCAL) grounds object captions to their corresponding spatial regions. We further propose a new Object Grounding Score (OGS), which jointly evaluates (i) spatial alignment between source video objects and their generated counterparts and (ii) semantic consistency between each generated object and its target caption. Our experiments show that MotionGrounder consistently outperforms recent baselines across quantitative, qualitative, and human evaluations.


[84] A 4D Representation for Training-Free Agentic Reasoning from Monocular Laparoscopic Video cs.CVPDF

Maximilian Fehrentz, Nicolas Stellwag, Robert Wiebe, Nicole Thorisch, Fabian Grob

TL;DR: 本文提出了一种基于显式4D表示的框架,用于增强手术智能体在单目腹腔镜视频中的时空推理能力。该框架通过结合点跟踪、深度估计和分割模型构建了一个具有时空一致性的工具与组织语义的4D模型,并利用多模态大语言模型作为智能体,无需微调即可对4D表示(如轨迹)进行自然语言推理。

Details

Motivation: 解决软组织手术中人工智能进行时空推理的挑战,现有2D视觉语言模型难以处理手术场景的空间复杂性,因此需要显式的4D表示来提升推理系统的性能。

Result: 在一个包含134个临床相关问题的新数据集上评估,结果显示通用推理主干与4D表示的结合显著改善了时空理解能力,并实现了4D grounding。

Insight: 创新点在于无需额外训练,即可从2D多模态大语言模型和3D计算机视觉模型“组装”出时空智能;提供了一种将显式4D表示与自然语言推理相结合的训练免费方法,可应用于手术辅助系统。

Abstract: Spatiotemporal reasoning is a fundamental capability for artificial intelligence (AI) in soft tissue surgery, paving the way for intelligent assistive systems and autonomous robotics. While 2D vision-language models show increasing promise at understanding surgical video, the spatial complexity of surgical scenes suggests that reasoning systems may benefit from explicit 4D representations. Here, we propose a framework for equipping surgical agents with spatiotemporal tools based on an explicit 4D representation, enabling AI systems to ground their natural language reasoning in both time and 3D space. Leveraging models for point tracking, depth, and segmentation, we develop a coherent 4D model with spatiotemporally consistent tool and tissue semantics. A Multimodal Large Language Model (MLLM) then acts as an agent on tools derived from the explicit 4D representation (e.g., trajectories) without any fine-tuning. We evaluate our method on a new dataset of 134 clinically relevant questions and find that the combination of a general purpose reasoning backbone and our 4D representation significantly improves spatiotemporal understanding and allows for 4D grounding. We demonstrate that spatiotemporal intelligence can be “assembled” from 2D MLLMs and 3D computer vision models without additional training. Code, data, and examples are available at https://tum-ai.github.io/surg4d/


[85] PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding cs.CV | cs.AI | cs.CLPDF

Nan Wang, Zhiwei Jin, Chen Chen, Haonan Lu

TL;DR: 本文提出了一种名为PixelPrune的训练无关方法,用于在视觉语言模型(VLM)处理高分辨率图像(如文档和GUI)时,在像素空间进行基于预测编码的自适应视觉令牌削减。该方法通过识别并剪裁图像中冗余的像素块(即重复的补丁),在视觉Transformer(ViT)编码器之前进行压缩,从而加速整个推理流程(包括ViT和下游LLM)。

Details

Motivation: 文档理解和GUI交互等高价值VLM应用需要高分辨率输入,这会产生大量视觉令牌,带来沉重的计算负担。作者观察到,在文档和GUI基准测试中,只有22-71%的图像块是像素唯一的,其余都是同一图像中其他块的精确副本,因此这种计算成本在很大程度上是浪费的。

Result: 在三个模型规模和多个文档与GUI基准测试上的实验表明,PixelPrune在保持任务准确率竞争力的同时,实现了高达4.2倍的推理加速和1.9倍的训练加速。

Insight: 主要创新点在于:1)在像素空间(即任何神经网络计算之前)进行基于预测编码的冗余补丁剪裁,这是一种训练无关、无需可学习参数的方法;2)支持像素无损压缩(τ=0)和可控有损压缩(τ>0);3)通过减少视觉令牌数量,同时加速了ViT编码器和下游LLM,覆盖了完整的推理流水线。

Abstract: Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful – across document and GUI benchmarks, only 22–71% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image. We propose \textbf{PixelPrune}, which exploits this pixel-level redundancy through predictive-coding-based compression, pruning redundant patches \emph{before} the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training-free, requires no learnable parameters, and supports pixel-lossless compression ($τ{=}0$) as well as controlled lossy compression ($τ{>}0$). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2$\times$ inference speedup and 1.9$\times$ training acceleration. Code is available at https://github.com/OPPO-Mente-Lab/PixelPrune.


[86] JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation cs.CVPDF

Issa Sugiura, Koki Maeda, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara

TL;DR: 该论文提出了JAMMEval,一个经过精炼的日语视觉语言模型评估基准集合,旨在解决现有日语VQA基准中存在的模糊问题、错误答案以及无需视觉基础即可解答的实例等问题,以提升评估的可靠性。

Details

Motivation: 现有日语VQA基准缺乏像英语基准那样的迭代精炼,导致评估不可靠,影响了模型比较的准确性,因此需要构建一个高质量的日语评估基准。

Result: 实验在JAMMEval上评估了开源和专有VLMs,结果表明,精炼后的基准能产生更准确反映模型能力的评估分数,具有更低的运行间方差,并提升了区分不同能力水平模型的能力。

Insight: 通过两轮人工标注系统性地精炼七个现有日语基准数据集,显著提升了数据质量和评估可靠性;该方法强调了基准数据质量对可靠评估的重要性,其精炼流程可推广至其他语言或任务领域。

Abstract: Reliable evaluation is essential for the development of vision-language models (VLMs). However, Japanese VQA benchmarks have undergone far less iterative refinement than their English counterparts. As a result, many existing benchmarks contain issues such as ambiguous questions, incorrect answers, and instances that can be solved without visual grounding, undermining evaluation reliability and leading to misleading conclusions in model comparisons. To address these limitations, we introduce JAMMEval, a refined collection of Japanese benchmarks for reliable VLM evaluation. It is constructed by systematically refining seven existing Japanese benchmark datasets through two rounds of human annotation, improving both data quality and evaluation reliability. In our experiments, we evaluate open-weight and proprietary VLMs on JAMMEval and analyze the capabilities of recent models on Japanese VQA. We further demonstrate the effectiveness of our refinement by showing that the resulting benchmarks yield evaluation scores that better reflect model capability, exhibit lower run-to-run variance, and improve the ability to distinguish between models of different capability levels. We release our dataset and code to advance reliable evaluation of VLMs.


[87] ProCap: Projection-Aware Captioning for Spatial Augmented Reality cs.CV | cs.MMPDF

Zimo Cao, Yuchen Deng, Haibin Ling, Bingyao Huang

TL;DR: 本文提出了ProCap框架,用于解决空间增强现实(SAR)中投影内容与物理场景语义混淆的问题。该框架通过两阶段流程(视觉分离与区域感知检索)实现虚拟与物理层的解耦,并构建了首个大规模SAR语义基准数据集RGBP,包含65个物理场景和超过18万个投影的密集解耦标注。

Details

Motivation: 空间增强现实(SAR)直接投影数字内容到物理场景,但现有视觉语言模型(VLMs)难以区分投影内容与物理场景的语义,导致虚拟-物理歧义,阻碍SAR的智能交互(如场景推理或用户查询应答)。

Result: 实验表明ProCap为未来SAR研究提供了鲁棒的语义基础,并在提出的RGBP数据集上通过双描述评估协议(使用任务特定令牌独立评估物理场景和投影描述)验证了其有效性。

Insight: 创新点包括:1)明确解耦投影与物理场景的两阶段框架;2)首个大规模SAR语义数据集RGBP;3)双描述评估协议。客观分析认为,其区域感知检索机制和密集解耦标注方法可借鉴于其他混合现实场景的视觉语言任务。

Abstract: Spatial augmented reality (SAR) directly projects digital content onto physical scenes using projectors, creating immersive experience without head-mounted displays. However, for SAR to support intelligent interaction, such as reasoning about the scene or answering user queries, it must semantically distinguish between the physical scene and the projected content. Standard Vision Language Models (VLMs) struggle with this virtual-physical ambiguity, often confusing the two contexts. To address this issue, we introduce ProCap, a novel framework that explicitly decouples projected content from physical scenes. ProCap employs a two-stage pipeline: first it visually isolates virtual and physical layers via automated segmentation; then it uses region-aware retrieval to avoid ambiguous semantic context due to projection distortion. To support this, we present RGBP (RGB + Projections), the first large-scale SAR semantic benchmark dataset, featuring 65 diverse physical scenes and over 180,000 projections with dense, decoupled annotations. Finally, we establish a dual-captioning evaluation protocol using task-specific tokens to assess physical scene and projection descriptions independently. Our experiments show that ProCap provides a robust semantic foundation for future SAR research. The source code, pre-trained models and the RGBP dataset are available on the project page: https://ZimoCao.github.io/ProCap/.


[88] Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment cs.CV | cs.CLPDF

Zhuchenyang Liu, Yao Zhang, Yu Xiao

TL;DR: 本文构建了IKEA-Bench基准测试,系统评估了19个视觉语言模型在跨描绘(抽象装配图与真实视频帧)的装配指令对齐任务上的性能,发现视觉编码是影响跨描绘鲁棒性的主要瓶颈,文本信息虽能辅助理解但会削弱视觉对齐能力。

Details

Motivation: 解决2D抽象装配图与真实世界视频帧之间存在视觉特征差异(描绘鸿沟)时,智能助手在混合现实中监控装配进度、检测错误并提供逐步指导的难题。

Result: 在包含1623个问题、6种任务类型、29款IKEA产品的IKEA-Bench基准上评估了19个参数量2B至38B的VLM,发现模型架构家族比参数量更能预测对齐准确率,且视频理解是难以突破的瓶颈。

Insight: 创新点在于通过构建跨描绘基准和三级机制分析,揭示了装配图与视频帧在ViT子空间中的分离性,以及文本信息会将模型推理从视觉驱动转向文本驱动,为提升跨描绘鲁棒性指明了改进视觉编码的方向。

Abstract: 2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/


[89] Representation Selection via Cross-Model Agreement using Canonical Correlation Analysis cs.CV | cs.AIPDF

Dylan B. Lewis, Jens Gregor, Hector Santos-Villalobos

TL;DR: 本文提出了一种基于典型相关分析(CCA)的无训练方法,用于提升预训练图像编码器表示效率。该方法通过利用两个预训练编码器输出表示之间的共享结构,找到线性投影以实现表示选择和降维,保留共享语义内容并丢弃冗余维度。与PCA等单嵌入空间降维技术不同,该方法利用跨模型一致性指导表示蒸馏与精炼,可将表示维度减少75%以上并提升下游性能,或在固定维度下通过从更大或微调模型进行表示迁移来增强表示。

Details

Motivation: 现代视觉流程日益依赖预训练图像编码器,但其表示常存在过完备和模型特定性问题,需要一种无需训练的方法来提升表示效率。

Result: 在ImageNet-1k、CIFAR-100、MNIST等基准测试中,该方法相比基线表示和PCA投影表示均取得持续改进,最高准确率提升达12.6%,且能实现超过75%的维度压缩。

Insight: 创新点在于利用跨模型一致性通过CCA进行表示选择,这是一种原则性的表示蒸馏方法;客观分析认为其将多模型协同思想与经典统计方法结合,为模型压缩和表示优化提供了新视角。

Abstract: Modern vision pipelines increasingly rely on pretrained image encoders whose representations are reused across tasks and models, yet these representations are often overcomplete and model-specific. We propose a simple, training-free method to improve the efficiency of image representations via a post-hoc canonical correlation analysis (CCA) operator. By leveraging the shared structure between representations produced by two pre-trained image encoders, our method finds linear projections that serve as a principled form of representation selection and dimensionality reduction, retaining shared semantic content while discarding redundant dimensions. Unlike standard dimensionality reduction techniques such as PCA, which operate on a single embedding space, our approach leverages cross-model agreement to guide representation distillation and refinement. The technique allows representations to be reduced by more than 75% in dimensionality with improved downstream performance, or enhanced at fixed dimensionality via post-hoc representation transfer from larger or fine-tuned models. Empirical results on ImageNet-1k, CIFAR-100, MNIST, and additional benchmarks show consistent improvements over both baseline and PCA-projected representations, with accuracy gains of up to 12.6%.


[90] Learning Quantised Structure-Preserving Motion Representations for Dance Fingerprinting cs.CV | cs.AIPDF

Arina Kharlamova, Bowei He, Chen Ma, Xue Liu

TL;DR: 该论文提出了DANCEMATCH,一个用于基于动作的舞蹈检索的端到端框架,其任务是从原始视频中直接识别语义相似的编舞,定义为舞蹈指纹识别。该方法通过骨架动作量化与时空变换器,将人体姿态编码为紧凑的离散运动签名,并设计了基于直方图索引的亚线性检索引擎进行高效大规模检索。

Details

Motivation: 现有动作分析和检索方法依赖难以索引、解释或扩展的连续嵌入,因此需要构建紧凑、离散的运动签名来捕捉舞蹈的时空结构,以实现高效的大规模检索。

Result: 实验表明,该方法在多样舞蹈风格上实现了鲁棒的检索,并对未见编舞具有很强的泛化能力,为可扩展的运动指纹识别和定量编舞分析奠定了基础。

Insight: 创新点在于将骨架动作量化与时空变换器结合,生成结构化运动词汇的离散签名,并设计了基于直方图的亚线性检索引擎,同时发布了带有量化运动标记的姿态对齐数据集DANCETYPESBENCHMARK以促进可重复研究。

Abstract: We present DANCEMATCH, an end-to-end framework for motion-based dance retrieval, the task of identifying semantically similar choreographies directly from raw video, defined as DANCE FINGERPRINTING. While existing motion analysis and retrieval methods can compare pose sequences, they rely on continuous embeddings that are difficult to index, interpret, or scale. In contrast, DANCEMATCH constructs compact, discrete motion signatures that capture the spatio-temporal structure of dance while enabling efficient large-scale retrieval. Our system integrates Skeleton Motion Quantisation (SMQ) with Spatio-Temporal Transformers (STT) to encode human poses, extracted via Apple CoMotion, into a structured motion vocabulary. We further design DANCE RETRIEVAL ENGINE (DRE), which performs sub-linear retrieval using a histogram-based index followed by re-ranking for refined matching. To facilitate reproducible research, we release DANCETYPESBENCHMARK, a pose-aligned dataset annotated with quantised motion tokens. Experiments demonstrate robust retrieval across diverse dance styles and strong generalisation to unseen choreographies, establishing a foundation for scalable motion fingerprinting and quantitative choreographic analysis.


[91] EmoScene: A Dual-space Dataset for Controllable Affective Image Generation cs.CVPDF

Li He, Longtai Zhang, Wenqiang Zhang, Yan Wang, Lizhe Qi

TL;DR: 本文提出了EmoScene数据集,这是一个大规模的双空间情感数据集,旨在解决文本到图像扩散模型在场景语义和细粒度情感色调控制方面的挑战。该数据集包含120万张图像,涵盖300多个现实世界场景类别,每张图像都标注了离散情感标签、连续VAD值、感知描述符和文本描述。

Details

Motivation: 当前文本到图像模型难以在统一表示中整合情感和感知因素,限制了其合成具有连贯和细致情感意图场景的能力。

Result: 通过多空间分析揭示了离散情感在VAD空间中的分布以及情感与场景级感知因素的系统性关联,并提供了一个轻量级基准模型,通过浅层交叉注意力调制将双空间控制注入冻结的扩散主干中,作为双空间监督下情感可控性的可复现探针。

Insight: 创新点在于构建了联合编码情感维度和感知属性的双空间数据集,并提出了一个基准方法,通过浅层调制实现情感控制,为可控情感图像生成提供了新的数据资源和评估框架。

Abstract: Text-to-image diffusion models have achieved high visual fidelity, yet precise control over scene semantics and fine-grained affective tone remains challenging. Human visual affect arises from the rapid integration of contextual meaning, including valence, arousal, and dominance, with perceptual cues such as color harmony, luminance contrast, texture variation, curvature, and spatial layout. However, current text-to-image models rarely represent affective and perceptual factors within a unified representation, which limits their ability to synthesize scenes with coherent and nuanced emotional intent. To address this gap, we construct EmoScene, a large-scale dual-space emotion dataset that jointly encodes affective dimensions and perceptual attributes, with contextual semantics provided as supporting annotations. EmoScene contains 1.2M images across more than three hundred real-world scene categories, each annotated with discrete emotion labels, continuous VAD values, perceptual descriptors and textual captions. Multi-space analyses reveal how discrete emotions occupy the VAD space and how affect systematically correlates with scene-level perceptual factors. To benchmark EmoScene, we provide a lightweight reference baseline that injects dual-space controls into a frozen diffusion backbone via shallow cross-attention modulation, serving as a reproducible probe of affect controllability enabled by dual-space supervision.


[92] YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction cs.CVPDF

Miro Miranda, Deepak Pathak, Patrick Helber, Benjamin Bischke, Hiba Najjar

TL;DR: 本文介绍了YieldSAT,一个用于高分辨率作物产量预测的大规模、高质量多模态数据集。该数据集覆盖阿根廷、巴西、乌拉圭和德国等多个气候区的多种主要作物(如玉米、油菜、大豆和小麦),包含超过1200万个空间分辨率为10米的产量样本和11.3万张标注的卫星图像,并辅以环境数据。论文通过将产量预测建模为像素回归任务,比较了多种深度学习模型和数据融合架构,并针对真实数据中的严重分布偏移问题,探索了一种领域知识引导的深度集成方法以提升性能。

Details

Motivation: 现有作物产量预测数据集因获取成本高、数据质量不均和隐私法规限制而稀缺、质量低或局限于特定区域或作物类型,这阻碍了可扩展数据驱动解决方案的发展。

Result: 论文通过比较多种深度学习模型和数据融合架构,展示了大规模高分辨率产量预测作为像素回归任务的潜力,并指出真实数据中存在严重分布偏移的挑战。提出的领域知识引导的深度集成方法取得了显著的性能提升。

Insight: 创新点在于发布了一个跨国家、多作物、高分辨率的大规模多模态产量预测基准数据集,并将产量预测形式化为像素回归任务。客观来看,其针对数据分布偏移问题探索的领域知识引导深度集成方法,为处理农业数据中的异质性和不确定性提供了可借鉴的思路。

Abstract: Crop yield prediction requires substantial data to train scalable models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types, hindering the development of scalable data-driven solutions. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across multiple countries, including Argentina, Brazil, Uruguay, and Germany, and includes major crop types, including corn, rapeseed, soybeans, and wheat, across 2,173 expert-curated fields. In total, over 12.2 million yield samples are available, each with a spatial resolution of 10 m. Each field is paired with multispectral satellite imagery, resulting in 113,555 labeled satellite images, complemented by auxiliary environmental data. We demonstrate the potential of large-scale and high-resolution crop yield prediction as a pixel regression task by comparing various deep learning models and data fusion architectures. Furthermore, we highlight open challenges arising from severe distribution shifts in the ground truth data under real-world conditions. To mitigate this, we explore a domain-informed Deep Ensemble approach that exhibits significant performance gains. The dataset is available at https://yieldsat.github.io/.


[93] DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving cs.CVPDF

Yiyao Zhu, Ying Xue, Haiming Zhang, Guangfeng Jiang, Wending Zhou

TL;DR: 本文提出DLWM(双潜在世界模型),一种用于自动驾驶中基于高斯表示的全栈预训练新范式。该方法通过两阶段设计:第一阶段通过自监督重建多视角语义与深度图像从查询中预测3D高斯;第二阶段利用细粒度上下文特征,分别训练两个潜在世界模型进行时序特征学习,分别支持下游占据感知/预测任务与运动规划任务。

Details

Motivation: 针对基于视觉的自动驾驶中现有密集BEV或稀疏查询模型的局限性,提出以3D语义高斯作为全面且稀疏的场景表示方法,旨在实现覆盖感知、预测与规划的全栈高斯中心预训练。

Result: 在SurroundOcc和nuScenes基准测试中,DLWM在3D占据感知、4D占据预测和运动规划任务上均显示出显著的性能提升。

Insight: 创新点在于提出双潜在世界模型架构,将高斯流引导的潜在预测(用于感知与预测)与自车规划引导的潜在预测(用于运动规划)解耦,实现端到端的高斯中心表征学习与多任务协同优化。

Abstract: Vision-based autonomous driving has gained much attention due to its low costs and excellent performance. Compared with dense BEV (Bird’s Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent prediction for motion planning. Extensive experiments in SurroundOcc and nuScenes benchmarks demonstrate that DLWM shows significant performance gains across Gaussian-centric 3D occupancy perception, 4D occupancy forecasting and motion planning tasks.


[94] ACT Now: Preempting LVLM Hallucinations via Adaptive Context Integration cs.CVPDF

Bei Yan, Yuecong Min, Jie Zhang, Shiguang Shan, Xilin Chen

TL;DR: 本文提出了一种名为自适应上下文集成(ACT)的训练无关推理干预方法,旨在缓解大型视觉语言模型(LVLM)中严重的幻觉问题。该方法通过自适应地集成上下文信息,包括视觉上下文探索和语义上下文聚合,来动态适应生成过程中的变化并纠正信息损失。

Details

Motivation: 现有缓解LVLM幻觉的策略主要依赖静态、单步的状态来增强视觉关注或抑制强语言先验,忽略了生成过程中的动态上下文变化,并且难以纠正固有的信息损失。

Result: 在多种LVLM上的广泛实验表明,ACT显著减少了幻觉,并在判别性和生成性基准测试上取得了有竞争力的结果,且不损害基本生成能力。

Insight: 创新点在于提出了一个动态、自适应的推理干预框架,通过时空分析自适应放大负责视觉探索的注意力头,并通过边缘化潜在语义查询来聚合视觉证据,以解决令牌预测离散性导致的信息损失问题,这是一种鲁棒且高度可适配的解决方案。

Abstract: Large Vision-Language Models (LVLMs) frequently suffer from severe hallucination issues. Existing mitigation strategies predominantly rely on isolated, single-step states to enhance visual focus or suppress strong linguistic priors. However, these static approaches neglect dynamic context changes across the generation process and struggles to correct inherited information loss. To address this limitation, we propose Adaptive Context inTegration (ACT), a training-free inference intervention method that mitigates hallucination through the adaptive integration of contextual information. Specifically, we first propose visual context exploration, which leverages spatio-temporal profiling to adaptively amplify attention heads responsible for visual exploration. To further facilitate vision-language alignment, we propose semantic context aggregation that marginalizes potential semantic queries to effectively aggregate visual evidence, thereby resolving the information loss caused by the discrete nature of token prediction. Extensive experiments across diverse LVLMs demonstrate that ACT significantly reduces hallucinations and achieves competitive results on both discriminative and generative benchmarks, acting as a robust and highly adaptable solution without compromising fundamental generation capabilities.


[95] Customizing Large Vision Model-Guided Low-Rank Approximation for Ground-Roll Denoise cs.CVPDF

Jiacheng Liao, Feng Qian, Ziyin Fan, Yongjian Guo

TL;DR: 本文提出了一种无需训练的地滚波衰减框架,将地滚波衰减重新定义为语义引导的信号分离问题。该方法利用可提示的大型视觉模型从地震道集提取高级语义先验,生成软掩码,并嵌入到掩码条件低秩逆公式中,实现空间自适应抑制和反射保持重建。

Details

Motivation: 地滚波是陆地和垂直地震剖面数据中的主要相干噪声,严重掩盖反射事件并降低后续成像和解释质量。传统方法(包括变换域滤波、稀疏表示和深度学习)适应性有限、存在信号泄漏或依赖标记训练数据,尤其是在强信噪重叠情况下。

Result: 在合成和实际VSP数据集上的大量实验表明,该方法在地滚波衰减方面表现优异,同时保持了反射连续性和波形保真度,持续优于代表性的变换域滤波和隐式神经表示方法。

Insight: 创新点在于将地滚波衰减问题重新定义为语义引导的信号分离,并利用无需训练的大型视觉模型提取语义先验来指导低秩逆问题的求解,实现了物理一致且无需任务特定训练或手动标注的自适应信号恢复。

Abstract: Ground-roll is a dominant source of coherent noise in land and vertical seismic profiling (VSP) data, severely masking reflection events and degrading subsequent imaging and interpretation. Conventional attenuation methods, including transform-domain filtering, sparse representation, and deep learning, often suffer from limited adaptability, signal leakage, or dependence on labeled training data, especially under strong signal-noise overlap. To address these challenges, we propose a training-free framework that reformulates ground-roll attenuation as a semantic-guided signal separation problem. Specifically, a promptable large vision model is employed to extract high-level semantic priors by converting seismic gathers into visual representations and localizing ground-roll-dominant regions via text or image prompts. The resulting semantic response is transformed into a continuous soft mask, which is embedded into a mask-conditioned low-rank inverse formulation to enable spatially adaptive suppression and reflection-preserving reconstruction. An efficient alternating direction method of multipliers (ADMM)-based solver is further developed to solve the proposed inverse problem, enabling stable and physically consistent signal recovery without requiring task-specific training or manual annotation. Extensive experiments on both synthetic and field VSP datasets demonstrate that the proposed method achieves superior ground-roll attenuation while preserving reflection continuity and waveform fidelity, consistently outperforming representative transform-domain filtering and implicit neural representation methods.


[96] EgoSim: Egocentric World Simulator for Embodied Interaction Generation cs.CV | cs.AIPDF

Jinkun Hao, Mingda Jia, Ruiyan Wang, Xihui Liu, Ran Yi

TL;DR: 本文提出了EgoSim,一个闭环的以自我为中心的世界模拟器,用于生成空间一致的交互视频并持续更新底层3D场景状态以进行连续模拟。它通过将3D场景建模为可更新的世界状态,解决了现有模拟器缺乏显式3D基础导致视角变化下结构漂移,或将场景视为静态而无法跨多阶段交互更新世界状态的问题。

Details

Motivation: 现有以自我为中心的模拟器要么缺乏显式的3D基础,导致在视角变化下出现结构漂移;要么将场景视为静态,无法在多阶段交互中更新世界状态。EgoSim旨在同时解决这两个局限性。

Result: 广泛的实验表明,EgoSim在视觉质量、空间一致性以及对复杂场景和野外灵巧交互的泛化能力方面显著优于现有方法,同时支持跨具身(cross-embodiment)转移到机器人操作。

Insight: 主要创新点包括:1) 将3D场景建模为可更新的世界状态以实现闭环模拟;2) 通过几何-动作感知的观察模拟模型和交互感知的状态更新模块生成具身交互;3) 设计了一个从大规模单目自我中心视频中提取静态点云、相机轨迹和具身动作的可扩展数据流水线;4) 引入了低成本、使用未校准智能手机的EgoCap数据采集系统。

Abstract: We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at egosimulator.github.io.


[97] Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding cs.CV | cs.AI | cs.LGPDF

Yiheng Wang, Lichen Zhu, Yueqian Lin, Yudong Liu, Jingyang Zhang

TL;DR: 本文提出了一种基于信息瓶颈理论的证据驱动关键帧采样框架,用于解决多模态大语言模型在长视频理解中因上下文长度和计算成本限制而面临的关键帧选择问题。该方法将关键帧选择建模为最大化选定帧与查询之间的条件互信息,并通过分解优化将其转化为独立的帧级评分,从而高效地选取对回答问题最有贡献的证据帧。

Details

Motivation: 现有基于语义相关性或强化学习的关键帧采样方法,要么无法有效捕捉证据线索,要么面临组合优化效率低下的问题,限制了MLLM在长视频理解中的应用。

Result: 在长视频理解基准测试上的实验表明,该方法在严格的token预算下,持续优于先前的采样策略,同时显著提高了训练效率。

Insight: 创新点在于将信息瓶颈理论应用于关键帧选择,提供了一个原则性的优化目标(最大化条件互信息),并通过分解优化和查询条件化的证据评分网络,实现了高效且可解释的帧级重要性评估,避免了复杂的组合搜索。

Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame’s contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to independent frame-level scoring. We further introduce a query-conditioned evidence scoring network trained with a contrastive objective to estimate evidential importance efficiently. Experiments on long-form video understanding benchmarks show that our method consistently outperforms prior sampling strategies under strict token budgets, while significantly improving training efficiency.


[98] PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks cs.CV | cs.MMPDF

Jingning Xu, Haochen Luo, Chen Liu

TL;DR: 本文提出了一种名为PDA(Paraphrase-Decomposition-Aggregation)的训练免费防御框架,旨在通过文本增强技术提升视觉语言模型(VLMs)对抗各种对抗性图像攻击的鲁棒性。该方法在推理阶段通过提示改写、问题分解和一致性聚合来增强模型,无需修改底层模型或进行对抗训练。

Details

Motivation: 现有基于对抗训练的防御方法计算成本高,且难以泛化到未见过的攻击类型,PDA旨在解决这些局限性,为VLMs提供一个通用、强健且实用的推理阶段防御方案。

Result: 在视觉问答、分类和描述等多个基准测试及多种VLM架构上的实验表明,PDA能持续提升模型对抗各种对抗性扰动的鲁棒性,同时保持有竞争力的干净样本准确率。

Insight: 创新点在于将防御策略完全置于文本/提示层面,通过提示改写、问题分解和一致性聚合的组合,在无需重新训练模型的情况下实现鲁棒性提升;其实例化为不变量的设计在保持鲁棒性增益的同时降低了推理成本,提供了效率与鲁棒性的平衡方案。

Abstract: Vision-language models (VLMs) are vulnerable to adversarial image perturbations. Existing works based on adversarial training against task-specific adversarial examples are computationally expensive and often fail to generalize to unseen attack types. To address these limitations, we introduce Paraphrase-Decomposition-Aggregation (PDA), a training-free defense framework that leverages text augmentation to enhance VLM robustness under diverse adversarial image attacks. PDA performs prompt paraphrasing, question decomposition, and consistency aggregation entirely at test time, thus requiring no modification on the underlying models. To balance robustness and efficiency, we instantiate PDA as invariants that reduce the inference cost while retaining most of its robustness gains. Experiments on multiple VLM architectures and benchmarks for visual question answering, classification, and captioning show that PDA achieves consistent robustness gains against various adversarial perturbations while maintaining competitive clean accuracy, establishing a generic, strong and practical defense framework for VLMs during inference.


[99] Forecasting Motion in the Wild cs.CVPDF

Neerja Thakkar, Shiry Ginosar, Jacob Walker, Jitendra Malik, Joao Carreira

TL;DR: 本文提出了一种基于稠密点轨迹的视觉行为表示方法,用于预测野外非刚性物体(如动物)的运动。该方法通过解耦运动与外观,并利用扩散变换器建模无序轨迹集,显式处理遮挡问题,实现了复杂运动模式的连贯预测。在300小时野外动物视频数据集上的实验表明,该方法在类别无关、数据高效预测方面优于现有基线,并能泛化到稀有物种和形态。

Details

Motivation: 现有视觉系统缺乏对运动和行为的通用表示,难以预测野外环境中智能体的未来行为,因此需要一种能解耦运动与外观、泛化于多样非刚性物体的结构化表示。

Result: 在300小时野外动物视频数据集上,该方法在运动预测任务中超越了现有最先进基线,实现了类别无关且数据高效的预测,并能泛化到稀有物种和形态。

Insight: 创新点包括将稠密点轨迹作为行为的中层视觉表示,以及设计扩散变换器来建模无序轨迹集并显式推理遮挡,这为野外预测性视觉智能提供了通用基础。

Abstract: Visual intelligence requires anticipating the future behavior of agents, yet vision systems lack a general representation for motion and behavior. We propose dense point trajectories as visual tokens for behavior, a structured mid-level representation that disentangles motion from appearance and generalizes across diverse non-rigid agents, such as animals in-the-wild. Building on this abstraction, we design a diffusion transformer that models unordered sets of trajectories and explicitly reasons about occlusion, enabling coherent forecasts of complex motion patterns. To evaluate at scale, we curate 300 hours of unconstrained animal video with robust shot detection and camera-motion compensation. Experiments show that forecasting trajectory tokens achieves category-agnostic, data-efficient prediction, outperforms state-of-the-art baselines, and generalizes to rare species and morphologies, providing a foundation for predictive visual intelligence in the wild.


[100] Foundation Model-guided Iteratively Prompting and Pseudo-Labeling for Partially Labeled Medical Image Segmentation cs.CVPDF

Qiaochu Zhao, Wei Wei, David Horowitz, Richard Bakst, Yading Yuan

TL;DR: 本文提出了一种名为IPnP的迭代提示与伪标签框架,用于解决医学图像分割中因标注成本高而常见的部分标注问题。该方法通过可训练的分割网络(专家)与冻结的基础模型(通才)协作,迭代地为未标注器官生成并优化伪标签,从而逐步恢复全器官监督。

Details

Motivation: 解决医学图像分割中因临床优先级和标注成本导致的扫描图像仅部分器官被标注的问题,该问题会降低模型性能。

Result: 在公开数据集AMOS的模拟部分标注设置下,IPnP性能持续优于现有方法,并接近全标注参考模型的性能;在一个包含210名头颈癌患者的私有部分标注数据集上,验证了其在真实临床场景中的有效性。

Insight: 创新点在于将可训练的专用分割网络与冻结的通用基础模型协同工作,通过迭代提示和伪标签生成来利用部分标注数据,逐步逼近全监督性能,为数据标注受限的医学图像分析提供了新思路。

Abstract: Automated medical image segmentation has achieved remarkable progress with fully labeled data. However, site-specific clinical priorities and the high cost of manual annotation often yield scans with only a subset of organs labeled, leading to the partially labeled problem that degrades performance. To address this issue, we propose IPnP, an Iteratively Prompting and Pseudo-labeling framework, for partially labeled medical image segmentation. IPnP iteratively generates and refines pseudo-labels for unlabeled organs through collaboration between a trainable segmentation network (specialist) and a frozen foundation model (generalist), progressively recovering full-organ supervision. On the public dataset AMOS with the simulated partial-label setting, IPnP consistently improves segmentation performance over prior methods and approaches the performance of the fully labeled reference. We further evaluate on a private, partially labeled dataset of 210 head-and-neck cancer patients and demonstrate our effectiveness in real-world clinical settings.


[101] ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration cs.CVPDF

Fengyuan Yang, Luying Huang, Jiazhi Guan, Quanwei Yang, Dongwei Pan

TL;DR: 本文提出了一种名为ONE-SHOT的参数高效框架,用于组合式人-环境视频生成。该方法通过解耦人体动态与环境线索,并引入创新的位置编码和混合上下文集成机制,实现了对主体和场景的细粒度、独立编辑,支持长时程合成,并在视频合成任务上显著优于现有最先进方法。

Details

Motivation: 当前视频基础模型在人本视频合成方面取得进展,但对主体和场景进行细粒度、独立编辑仍是一个关键挑战。现有方法通过刚性3D几何组合引入更丰富的环境控制,但往往在精确控制与生成灵活性之间存在明显权衡,且繁重的3D预处理限制了实际可扩展性。

Result: 实验表明,该方法在视频合成任务上显著优于最先进(SOTA)方法,提供了卓越的结构控制和创意多样性。

Insight: 核心创新点在于将生成过程分解为解耦信号:1)通过规范空间注入机制,利用交叉注意力将人体动态与环境线索解耦;2)提出动态接地RoPE(Dynamic-Grounded-RoPE)位置嵌入策略,无需启发式3D对齐即可在不同空间域间建立空间对应关系;3)引入混合上下文集成机制以支持分钟级长时程合成,保持主体和场景的一致性。该方法避免了繁重的3D预处理,实现了参数高效和更好的控制灵活性。

Abstract: Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: https://martayang.github.io/ONE-SHOT/.


[102] A global dataset of continuous urban dashcam driving cs.CVPDF

Md Shadab Alam, Olena Bazilinska, Pavlo Bazilinskyy

TL;DR: 本文介绍了CROWD数据集,这是一个从YouTube公开视频中筛选和分割出的、手动整理的全球城市行车记录仪数据集,包含超过5万个连续、未编辑的日常驾驶片段,覆盖全球238个国家和地区,旨在支持跨域鲁棒性和交互分析研究。

Details

Motivation: 为了解决现有驾驶数据集往往聚焦于事故或编辑内容、缺乏日常连续驾驶场景的问题,作者构建了CROWD数据集,以优先考虑常规驾驶并明确排除事故等内容,从而更好地支持自动驾驶中的跨域鲁棒性和交互分析。

Result: 数据集包含51,753个片段记录,总时长20,275.56小时(源自42,032个视频),覆盖全球7,103个有人居住地点,并提供了时段(白天/夜晚)和车辆类型的手动标签,以及基于YOLOv11x和BoT-SORT的机器生成检测和跟踪结果(涵盖80个MS-COCO类别)。

Insight: 创新点在于构建了一个大规模、全球覆盖、专注于日常连续驾驶场景的行车记录仪数据集,并提供了详细的机器生成注释以降低基准测试门槛;其数据筛选策略(排除事故和编辑内容)为研究常规驾驶行为提供了更纯净的数据源。

Abstract: We introduce CROWD (City Road Observations With Dashcams), a manually curated dataset of ordinary, minute scale, temporally contiguous, unedited, front facing urban dashcam segments screened and segmented from publicly available YouTube videos. CROWD is designed to support cross-domain robustness and interaction analysis by prioritising routine driving and explicitly excluding crashes, crash aftermath, and other edited or incident-focused content. The release contains 51,753 segment records spanning 20,275.56 hours (42,032 videos), covering 7,103 named inhabited places in 238 countries and territories across all six inhabited continents (Africa, Asia, Europe, North America, South America and Oceania), with segment level manual labels for time of day (day or night) and vehicle type. To lower the barrier for benchmarking, we provide per-segment CSV files of machine-generated detections for all 80 MS-COCO classes produced with YOLOv11x, together with segment-local multi-object tracks (BoT-SORT); e.g. person, bicycle, motorcycle, car, bus, truck, traffic light, stop sign, etc. CROWD is distributed as video identifiers with segment boundaries and derived annotations, enabling reproducible research without redistributing the underlying videos.


[103] PHASOR: Anatomy- and Phase-Consistent Volumetric Diffusion for CT Virtual Contrast Enhancement cs.CVPDF

Zilong Li, Dongyang Li, Chenglong Ma, Zhan Feng, Dakai Jin

TL;DR: 本文提出了PHASOR,一个用于CT虚拟对比增强的体扩散框架,通过将CT体积视为连贯序列并利用视频扩散模型,结合解剖结构路由的专家混合模块和强度相位感知表示对齐模块,以解决现有方法在解剖异质性和空间错位方面的不足,从而生成高保真度的对比增强CT图像。

Details

Motivation: 解决现有虚拟对比增强方法在合成对比增强CT时,因解剖结构异质性和空间错位导致的增强模式不一致和细节错误的问题。

Result: 在三个数据集上的广泛实验表明,PHASOR在合成质量和增强准确性方面显著优于最先进的方法。

Insight: 创新点包括将CT体积建模为序列以利用视频扩散模型的结构连贯性,以及引入解剖结构路由的专家混合模块和强度相位感知表示对齐模块来确保解剖与相位一致的合成,这为医学图像生成中的结构一致性和细节保真度提供了新思路。

Abstract: Contrast-enhanced computed tomography (CECT) is pivotal for highlighting tissue perfusion and vascularity, yet its clinical ubiquity is impeded by the invasive nature of contrast agents and radiation risks. While virtual contrast enhancement (VCE) offers an alternative to synthesizing CECT from non-contrast CT (NCCT), existing methods struggle with anatomical heterogeneity and spatial misalignment, leading to inconsistent enhancement patterns and incorrect details. This paper introduces PHASOR, a volumetric diffusion framework for high-fidelity CT VCE. By treating CT volumes as coherent sequences, we leverage a video diffusion model to enhance structural coherence and volumetric accuracy. To ensure anatomy-phase consistent synthesis, we introduce two complementary modules. First, anatomy-routed mixture-of-experts (AR-MoE) anchors distinct enhancement patterns to anatomical semantics, with organ-specific memory to capture salient details. Second, intensity-phase aware representation alignment (IP-REPA) highlights intricate contrast signals while mitigating the impact of imperfect spatial alignment. Extensive experiments across three datasets demonstrate that PHASOR significantly outperforms state-of-the-art methods in both synthesis quality and enhancement accuracy.


[104] ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data cs.CV | cs.GRPDF

Yaoqin Ye, Yiteng Xu, Qin Sun, Xinge Zhu, Yujing Sun

TL;DR: ReMoGen是一个用于实时人类交互到反应生成的模块化学习框架,它通过从大规模单人运动数据中学习通用运动先验,并利用独立训练的元交互模块适应目标交互领域,以解决交互数据有限且分散的问题。该框架结合段级生成和轻量级帧级段细化模块,在连续在线交互中实现低延迟、高保真的运动响应。

Details

Motivation: 解决在真实世界环境中,个体运动受周围智能体和场景影响的实时交互到反应生成问题,克服交互数据有限、分散以及需要低延迟高保真响应的挑战。

Result: 在人与人、人与场景及混合模态交互设置上的广泛实验表明,ReMoGen能生成高质量、连贯且响应迅速的反应,并在多样交互场景中有效泛化。

Insight: 创新点包括利用通用运动先验和元交互模块实现数据稀缺和异构监督下的鲁棒泛化,以及通过段级生成与帧级细化结合提升在线交互的响应性和时序一致性,避免昂贵的全序列推理。

Abstract: Human behaviors in real-world environments are inherently interactive, with an individual’s motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human-robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego’s future motion from dynamic multi-source cues, including others’ actions, scene geometry, and optional high-level semantic inputs. This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human-human, and human-scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. To support responsive online interaction, ReMoGen performs segment-level generation together with a lightweight Frame-wise Segment Refinement module that incorporates newly observed cues at the frame level, improving both responsiveness and temporal coherence without expensive full-sequence inference. Extensive experiments across human-human, human-scene, and mixed-modality interaction settings show that ReMoGen produces high-quality, coherent, and responsive reactions, while generalizing effectively across diverse interaction scenarios.


[105] ProTPS: Prototype-Guided Text Prompt Selection for Continual Learning cs.CVPDF

Jie Mei, Li-Leng Peng, Keith Fuller, Jenq-Neng Hwang

TL;DR: 本文提出了一种名为ProTPS(原型引导的文本提示选择)的新方法,用于持续学习中的文本提示选择,旨在通过视觉原型引导每个类别的文本提示学习,以缓解灾难性遗忘问题。该方法在类增量(CI)和跨数据集持续(CDC)学习设置中进行了评估,并进一步引入了一个名为Marine112的真实世界海洋物种数据集,适用于类和域增量(CDI)学习设置,并在三种设置中展示了优于现有最先进方法的性能。

Details

Motivation: 现有基于文本提示的持续学习方法面临如何学习独特文本提示的挑战,这些提示隐式携带新类别的语义信息,以避免新类别语义特征与已训练类别重叠,从而缓解灾难性遗忘问题。

Result: 在类增量(CI)和跨数据集持续(CDC)学习设置中,ProTPS的性能接近理论上限;在引入的Marine112数据集(包含112个海洋物种,自然长尾分布)上,ProTPS在类和域增量(CDI)学习设置中表现优于最近的最先进方法。

Insight: 创新点包括:通过视觉原型引导文本提示的选择和学习,以增加训练灵活性并鼓励学习独特文本提示;引入真实世界数据集Marine112,为社区带来新的类和域增量学习挑战,并适应自然长尾分布。从客观角度看,该方法结合了视觉和文本模态的原型学习,可能增强了跨类别语义区分能力。

Abstract: For continual learning, text-prompt-based methods leverage text encoders and learnable prompts to encode semantic features for sequentially arrived classes over time. A common challenge encountered by existing works is how to learn unique text prompts, which implicitly carry semantic information of new classes, so that the semantic features of newly arrived classes do not overlap with those of trained classes, thereby mitigating the catastrophic forgetting problem. To address this challenge, we propose a novel approach Prototype-guided Text Prompt Selection (ProTPS)’’ to intentionally increase the training flexibility thus encouraging the learning of unique text prompts. Specifically, our ProTPS learns class-specific vision prototypes and text prompts. Vision prototypes guide the selection and learning of text prompts for each class. We first evaluate our ProTPS in both class incremental (CI) setting and cross-datasets continual (CDC) learning setting. Because our ProTPS achieves performance close to the upper bounds, we further collect a real-world dataset with 112 marine species collected over a span of six years, named Marine112, to bring new challenges to the community. Marine112 is authentically suited for the class and domain incremental (CDI) learning setting and is under natural long-tail distribution. The results under three settings show that our ProTPS performs favorably against the recent state-of-the-art methods. The implementation code and Marine112 dataset will be released upon the acceptance of our paper.


[106] Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation cs.CV | cs.AI | cs.LGPDF

Reyhaneh Ahani Manghotay, Jie Liang

TL;DR: 本文提出了一种名为MoA-DepthCLIP的参数高效框架,用于将预训练的CLIP视觉语言模型适配到单目深度估计任务中。该方法通过在预训练的ViT-B/32骨干网络中集成轻量级的混合适配器模块,并结合最终层的选择性微调,实现了在全局语义上下文向量引导下的空间感知适配。

Details

Motivation: 利用CLIP等视觉语言模型的丰富语义特征进行单目深度估计是一个有前景的方向,但现有方法通常需要大量微调或缺乏几何精度。本文旨在以最小的监督,高效地迁移VLM知识到细粒度的深度估计任务中。

Result: 在NYU Depth V2基准测试上,MoA-DepthCLIP取得了有竞争力的结果,显著优于DepthCLIP基线,将δ1准确率从0.390提升至0.745,并将RMSE从1.176降低至0.520,同时所需可训练参数量大幅减少。

Insight: 创新点在于提出了一个轻量级的、由提示引导的混合适配器模块,结合了深度区间分类与直接回归的混合预测架构,以及一个强制执行几何约束的复合损失函数,实现了参数高效的VLM知识迁移。

Abstract: Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the $δ_1$ accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.


[107] ReinDriveGen: Reinforcement Post-Training for Out-of-Distribution Driving Scene Generation cs.CVPDF

Hao Zhang, Lue Fan, Weikang Bian, Zehuan Wu, Lewei Lu

TL;DR: ReinDriveGen是一个用于生成可控动态驾驶场景的框架,它允许用户自由编辑交通参与者的轨迹以模拟安全关键场景(如碰撞、失控车辆、行人乱穿马路等)。该方法从多帧LiDAR数据构建动态3D点云场景,通过车辆补全模块重建完整360°几何,并渲染成2D条件图像来引导视频扩散模型合成逼真驾驶视频。针对编辑场景超出训练分布的问题,提出了基于强化学习的后训练策略,利用成对偏好模型和奖励机制在无真值监督下提升分布外场景的生成质量。

Details

Motivation: 解决现有驾驶场景生成方法缺乏对动态元素(如车辆、行人轨迹)的完全可控性,以及难以生成超出训练数据分布的安全关键场景(corner cases)的问题。

Result: 在编辑驾驶场景生成任务上超越了现有方法,并在新颖的自我视角合成任务上取得了最先进(SOTA)的结果。

Insight: 核心创新点在于将3D场景编辑(基于点云和几何补全)与2D视频生成(视频扩散模型)相结合,并通过基于强化学习的后训练策略专门优化分布外场景的生成质量,实现了对动态驾驶场景的高可控、高质量合成。

Abstract: We present ReinDriveGen, a framework that enables full controllability over dynamic driving scenes, allowing users to freely edit actor trajectories to simulate safety-critical corner cases such as front-vehicle collisions, drifting cars, vehicles spinning out of control, pedestrians jaywalking, and cyclists cutting across lanes. Our approach constructs a dynamic 3D point cloud scene from multi-frame LiDAR data, introduces a vehicle completion module to reconstruct full 360° geometry from partial observations, and renders the edited scene into 2D condition images that guide a video diffusion model to synthesize realistic driving videos. Since such edited scenarios inevitably fall outside the training distribution, we further propose an RL-based post-training strategy with a pairwise preference model and a pairwise reward mechanism, enabling robust quality improvement under out-of-distribution conditions without ground-truth supervision. Extensive experiments demonstrate that ReinDriveGen outperforms existing approaches on edited driving scenarios and achieves state-of-the-art results on novel ego viewpoint synthesis.


[108] TRACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking cs.CVPDF

Jiyuan Hu, Zechuan Zhang, Zongxin Yang, Yi Yang

TL;DR: TRACE是一个基于网格引导的3D高斯泼溅(3DGS)编辑框架,通过结合显式3D几何与视频扩散技术,实现了自动化、高保真的3D场景编辑,支持局部姿态调整和组件替换等细粒度操作,同时保持主体结构完整性。

Details

Motivation: 现有3D场景编辑方法在保持主体结构完整性和实现局部细粒度操作方面存在不足,TRACE旨在通过几何锚定和上下文视频掩码技术解决这些问题,提升编辑的精确性和一致性。

Result: 在MV-TRACE数据集上的大量实验表明,TRACE在编辑多样性和结构完整性方面显著优于现有方法,达到了SOTA水平。

Insight: 创新点包括:1)首个专用于场景一致对象添加和修改的多视角一致数据集MV-TRACE;2)通过两阶段注册实现网格与3DGS场景精确同步的有形几何锚定(TGA)方法;3)将3D投影整合到自回归视频流程中以实现时间稳定渲染的上下文视频掩码(CVM)技术。

Abstract: We present TRACE, a mesh-guided 3DGS editing framework that achieves automated, high-fidelity scene transformation. By anchoring video diffusion with explicit 3D geometry, TRACE uniquely enables fine-grained, part-level manipulatio–such as local pose shifting or component replacemen–while preserving the structural integrity of the central subject, a capability largely absent in existing editing methods. Our approach comprises three key stages: (1) Multi-view 3D-Anchor Synthesis, which leverages a sparse-view editor trained on our MV-TRACE datase–the first multi-view consistent dataset dedicated to scene-coherent object addition and modificatio–to generate spatially consistent 3D-anchors; (2) Tangible Geometry Anchoring (TGA), which ensures precise spatial synchronization between inserted meshes and the 3DGS scene via two-phase registration; and (3) Contextual Video Masking (CVM), which integrates 3D projections into an autoregressive video pipeline to achieve temporally stable, physically-grounded rendering. Extensive experiments demonstrate that TRACE consistently outperforms existing methods especially in editing versatility and structural integrity.


cs.AI [Back]

[109] How Emotion Shapes the Behavior of LLMs and Agents: A Mechanistic Study cs.AI | cs.CLPDF

Moran Sun, Tianlin Li, Yuwei Zheng, Zhenhong Zhou, Aishan Liu

TL;DR: 本文提出了一种名为E-STEER的可解释情感引导框架,通过将情感作为结构化、可控变量嵌入到LLM和智能体的隐藏状态中,研究了情感对客观推理、主观生成、安全性和多步智能体行为的机制性影响。研究发现情感与行为之间存在符合心理学理论的非单调关系,特定情感不仅能提升LLM能力,还能增强安全性并系统性地塑造智能体行为。

Details

Motivation: 现有情感感知研究主要将情感视为表层风格因素或感知目标,忽视了其在任务处理中的机制性作用,本文旨在探究情感信号是否能机制性地影响LLM和智能体的行为。

Result: 研究揭示了情感与行为之间符合既定心理学理论的非单调关系,并表明特定情感不仅能提升LLM能力,还能改善安全性,并系统性地塑造多步智能体行为。

Insight: 创新点在于提出了E-STEER框架,实现了在表示层面对LLM和智能体进行直接、可解释的情感干预,将情感作为结构化可控变量进行研究,突破了仅将情感视为风格或感知目标的局限,为理解情感在AI系统中的机制性作用提供了新途径。

Abstract: Emotion plays an important role in human cognition and performance. Motivated by this, we investigate whether analogous emotional signals can shape the behavior of large language models (LLMs) and agents. Existing emotion-aware studies mainly treat emotion as a surface-level style factor or a perception target, overlooking its mechanistic role in task processing. To address this limitation, we propose E-STEER, an interpretable emotion steering framework that enables direct representation-level intervention in LLMs and agents. It embeds emotion as a structured, controllable variable in hidden states, and with it, we examine the impact of emotion on objective reasoning, subjective generation, safety, and multi-step agent behaviors. The results reveal non-monotonic emotion-behavior relations consistent with established psychological theories, and show that specific emotions not only enhance LLM capability but also improve safety, and systematically shape multi-step agent behaviors.


[110] Execution-Verified Reinforcement Learning for Optimization Modeling cs.AI | cs.CLPDF

Runda Guan, Xiangqing Shen, Jiajun Zhang, Yifan Zhang, Jian Cheng

TL;DR: 本文提出了一种名为EVOM的执行验证强化学习框架,用于自动化优化建模。该框架将数学规划求解器视为确定性交互验证器,通过生成求解器特定代码、在沙箱中执行并转换执行结果为标量奖励,利用GRPO和DAPO进行闭环优化,无需过程级监督,并支持跨求解器泛化。

Details

Motivation: 现有基于LLM的优化建模自动化方法要么依赖高延迟的闭源LLM代理管道,要么使用成本高昂且易过拟合到单一求解器API的过程监督微调,因此需要一种无需过程监督、能实现跨求解器泛化的高效方法。

Result: 在NL4OPT、MAMO、IndustryOR和OptiBench基准测试中,针对Gurobi、OR-Tools和COPT求解器,EVOM匹配或超越了过程监督的SFT方法,支持零样本求解器迁移,并通过在目标求解器后端继续训练实现低成本有效适应。

Insight: 创新点在于将求解器作为执行验证器,通过仅基于结果的奖励公式消除过程监督需求,并通过切换验证环境而非重建数据集实现跨求解器泛化,为优化建模自动化提供了可扩展且高效的强化学习框架。

Abstract: Automating optimization modeling with LLMs is a promising path toward scalable decision intelligence, but existing approaches either rely on agentic pipelines built on closed-source LLMs with high inference latency, or fine-tune smaller LLMs using costly process supervision that often overfits to a single solver API. Inspired by reinforcement learning with verifiable rewards, we propose Execution-Verified Optimization Modeling (EVOM), an execution-verified learning framework that treats a mathematical programming solver as a deterministic, interactive verifier. Given a natural-language problem and a target solver, EVOM generates solver-specific code, executes it in a sandboxed harness, and converts execution outcomes into scalar rewards, optimized with GRPO and DAPO in a closed-loop generate-execute-feedback-update process. This outcome-only formulation removes the need for process-level supervision, and enables cross-solver generalization by switching the verification environment rather than reconstructing solver-specific datasets. Experiments on NL4OPT, MAMO, IndustryOR, and OptiBench across Gurobi, OR-Tools, and COPT show that EVOM matches or outperforms process-supervised SFT, supports zero-shot solver transfer, and achieves effective low-cost solver adaptation by continuing training under the target solver backend.


[111] Towards Reliable Truth-Aligned Uncertainty Estimation in Large Language Models cs.AI | cs.CLPDF

Ponhvoan Srey, Quang Minh Nguyen, Xiaobao Wu, Anh Tuan Luu

TL;DR: 本文提出了一种名为Truth AnChoring(TAC)的后处理校准方法,旨在解决大语言模型(LLM)不确定性估计(UE)指标性能不稳定、与事实正确性脱节的问题,通过将原始UE分数映射到与真实情况对齐的分数,以提升不确定性估计的可靠性。

Details

Motivation: 现有不确定性估计(UE)指标大多基于模型行为而非输出的事实正确性,导致其在不同配置下性能不稳定(即代理失败),尤其在信息匮乏的低信息区域缺乏区分度,限制了其实用性。

Result: 论文提出的TAC方法即使在噪声和少样本监督下,也能学习到校准良好的不确定性估计,并提供了一个实用的校准协议。

Insight: 创新点在于揭示了将启发式UE指标直接视为真实不确定性指示器的局限性,并提出了一种通过后处理校准将UE分数与事实正确性对齐的通用方法,这是迈向更可靠LLM不确定性估计的必要步骤。

Abstract: Uncertainty estimation (UE) aims to detect hallucinated outputs of large language models (LLMs) to improve their reliability. However, UE metrics often exhibit unstable performance across configurations, which significantly limits their applicability. In this work, we formalise this phenomenon as proxy failure, since most UE metrics originate from model behaviour, rather than being explicitly grounded in the factual correctness of LLM outputs. With this, we show that UE metrics become non-discriminative precisely in low-information regimes. To alleviate this, we propose Truth AnChoring (TAC), a post-hoc calibration method to remedy UE metrics, by mapping the raw scores to truth-aligned scores. Even with noisy and few-shot supervision, our TAC can support the learning of well-calibrated uncertainty estimates, and presents a practical calibration protocol. Our findings highlight the limitations of treating heuristic UE metrics as direct indicators of truth uncertainty, and position our TAC as a necessary step toward more reliable uncertainty estimation for LLMs. The code repository is available at https://github.com/ponhvoan/TruthAnchor/.


[112] Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents cs.AI | cs.CL | cs.SEPDF

Thanh Luong Tuan

TL;DR: 本文提出了一种用于企业级智能代理的神经符号架构,通过三层本体(角色、领域、交互)对基于大语言模型的代理进行形式化语义约束,以解决大模型在企业应用中存在的幻觉、领域漂移和合规性难题。该架构在FAOS平台上实现,并通过跨行业实验验证了其在准确性、合规性和角色一致性上的显著提升。

Details

Motivation: 解决大语言模型在企业应用中的幻觉、领域漂移问题,以及在推理层面难以确保监管合规性的挑战。

Result: 在五个行业(金融科技、保险、医疗、越南银行、越南保险)的600次实验运行中,本体约束的代理在度量准确性(p < .001, W = .460)、监管合规性(p = .003, W = .318)和角色一致性(p < .001, W = .614)上显著优于无约束代理,尤其在LLM参数知识最弱的越南本地化领域提升最大。

Insight: 创新点包括:三层企业本体模型、神经符号耦合模式分类法、通过SQL下推评分实现的本体约束工具发现、输出端本体验证框架,以及实证发现的本体约束价值与LLM领域训练数据覆盖度成反比的规律(逆参数知识效应)。

Abstract: Enterprise adoption of Large Language Models (LLMs) is constrained by hallucination, domain drift, and the inability to enforce regulatory compliance at the reasoning level. We present a neurosymbolic architecture implemented within the Foundation AgenticOS (FAOS) platform that addresses these limitations through ontology-constrained neural reasoning. Our approach introduces a three-layer ontological framework–Role, Domain, and Interaction ontologies–that provides formal semantic grounding for LLM-based enterprise agents. We formalize the concept of asymmetric neurosymbolic coupling, wherein symbolic ontological knowledge constrains agent inputs (context assembly, tool discovery, governance thresholds) while proposing mechanisms for extending this coupling to constrain agent outputs (response validation, reasoning verification, compliance checking). We evaluate the architecture through a controlled experiment (600 runs across five industries: FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance), finding that ontology-coupled agents significantly outperform ungrounded agents on Metric Accuracy (p < .001, W = .460), Regulatory Compliance (p = .003, W = .318), and Role Consistency (p < .001, W = .614), with improvements greatest where LLM parametric knowledge is weakest–particularly in Vietnam-localized domains. Our contributions include: (1) a formal three-layer enterprise ontology model, (2) a taxonomy of neurosymbolic coupling patterns, (3) ontology-constrained tool discovery via SQL-pushdown scoring, (4) a proposed framework for output-side ontological validation, (5) empirical evidence for the inverse parametric knowledge effect that ontological grounding value is inversely proportional to LLM training data coverage of the domain, and (6) a production system serving 21 industry verticals with 650+ agents.


[113] Beyond Symbolic Solving: Multi Chain-of-Thought Voting for Geometric Reasoning in Large Language Models cs.AI | cs.CL | cs.CVPDF

Md. Abu Bakor Siddique, Shahrin Hossain, Sadman Ahmed Siam, Syed Rifat Raiyan, Hasan Mahmud

TL;DR: 本文提出了MARS-GPS方法,通过生成多个并行推理路径(结合Python代码执行进行数值验证),使用词元级熵作为置信度信号进行排序,并通过多阶段投票和自我验证流程聚合答案,以解决大语言模型在几何问题求解中逻辑推理能力不足的问题。

Details

Motivation: 现有几何问题求解方法主要关注图文同步和符号操作,但逻辑推理能力不足,通常仅限于单一思维链,本文旨在弥补这一缺陷。

Result: 在Geometry3K基准测试上,使用8个并行推理路径的MARS-GPS达到了88.8%的准确率,比之前的最优方法提升了近11%,且随着推理路径数量从1增加到16,准确率持续提升(在消融子集上提升6.0%)。

Insight: 创新点在于采用多并行思维链投票机制,结合代码执行进行验证,并使用熵作为置信度排序;客观来看,其多阶段投票和自验证流程是提升推理可靠性的有效策略。

Abstract: Geometric Problem Solving (GPS) remains at the heart of enhancing mathematical reasoning in large language models because it requires the combination of diagrammatic understanding, symbolic manipulation and logical inference. In existing literature, researchers have chiefly focused on synchronising the diagram descriptions with text literals and solving the problem. In this vein, they have either taken a neural, symbolic or neuro-symbolic approach. But this solves only the first two of the requirements, namely diagrammatic understanding and symbolic manipulation, while leaving logical inference underdeveloped. The logical inference is often limited to one chain-of-thought (CoT). To address this weakness in hitherto existing models, this paper proposes MARS-GPS, that generates multiple parallel reasoning rollouts augmented with Python code execution for numerical verification, ranks them using token-level entropy as a confidence signal, and aggregates answers through a multi-stage voting and self-verification pipeline. Empirical results show that MARS-GPS with 8 parallel rollouts achieves 88.8% on Geometry3K, a nearly +11% improvement over the prior state-of-the-art, with accuracy scaling consistently as the number of rollouts increases from 1 to 16 (+6.0% on ablation subset). We provide our code and data in an anonymous repository: https://anonymous.4open.science/r/MARS-GPS-DE55.


[114] HippoCamp: Benchmarking Contextual Agents on Personal Computers cs.AI | cs.CVPDF

Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen

TL;DR: 本文提出了HippoCamp基准测试,用于评估多模态文件管理场景下的智能体能力。该基准基于真实用户配置文件构建了包含42.4GB、超过2000个多模态文件的设备级文件系统,并创建了581个问答对来测试智能体的搜索、证据感知和多步推理能力。实验表明,即使最先进的商业模型在用户画像任务上准确率仅为48.3%,尤其在长程检索和跨模态推理方面存在明显不足。

Details

Motivation: 现有智能体基准主要关注通用环境下的网络交互、工具使用或软件自动化任务,缺乏对以用户为中心的个人计算环境中多模态文件管理能力的评估。HippoCamp旨在填补这一空白,评估智能体在个性化环境中建模用户画像和进行上下文感知推理的能力。

Result: 在HippoCamp基准上评估了多种最先进的多模态大语言模型(MLLMs)和智能体方法。实验结果显示,性能最好的商业模型在用户画像任务上的准确率仅为48.3%,在密集个人文件系统中的长程检索和跨模态推理方面表现不佳。基准还提供了46.1K个密集标注的结构化轨迹用于逐步失败诊断。

Insight: 创新点在于构建了首个专注于个人计算环境中多模态文件管理的基准测试,通过真实用户配置文件创建了设备级文件系统。研究发现多模态感知和证据基础是当前智能体的主要瓶颈,这为开发下一代个人AI助手提供了重要方向。基准的逐步失败诊断机制也为模型能力分析提供了细粒度工具。

Abstract: We present HippoCamp, a new benchmark designed to evaluate agents’ capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents’ capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.


cs.SE [Back]

[115] Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines cs.SE | cs.AI | cs.CLPDF

Jingjie Ning, Xueqi Li, Chengyu Yu

TL;DR: 本文通过受控分解实验,质疑了多LLM修订流程中增益主要源于错误修正的普遍假设,将增益分解为重新求解、结构支架和内容三个可加性成分,并在多项选择题和代码生成任务上验证了不同任务结构、草稿质量和信息类型对增益的影响。

Details

Motivation: 研究动机是探究多LLM修订流程(即第二个模型审查并改进第一个模型生成的草稿)中增益的真正来源,挑战了增益主要来自错误修正的假设,旨在更精细地理解修订机制。

Result: 在三个基准测试(涵盖知识密集型多项选择题和竞争性编程)上评估了两个模型对,结果表明:在多项选择题任务中,大部分增益与更强模型的重新求解一致,直接路由查询到强模型可能比修订弱草稿更有效;在代码生成任务中,两阶段提示仍然有用,因为即使语义为空的草稿也能提供显著的结构支架,而弱草稿内容可能有害。

Insight: 创新点在于将多LLM修订增益分解为可加性成分,揭示了增益并非单一,而是动态受任务结构和草稿质量瓶颈制约,这启示需要更针对性的流程设计而非笼统的修订策略;客观分析认为,该方法为理解LLM协作机制提供了新视角,有助于优化多模型管道效率。

Abstract: Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.


cs.HC [Back]

[116] True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies cs.HC | cs.CL | cs.CVPDF

Graziano Blasilli, Marco Angelini

TL;DR: 本研究评估了16种多模态大语言模型(LLMs)在识别和解释误导性可视化方面的能力,重点关注其对可视化修辞手法、作者意图和误导性的理解。研究基于2,336条COVID-19相关推文(其中一半包含误导性可视化)和VisLies社区的真实案例,通过三个研究问题展开实验,并与可视化专家的人类判断进行对比分析。

Details

Motivation: 解决多模态LLMs能否有效识别误导性可视化中的修辞手法、作者意图和潜在欺骗性,并理解其与人类专家判断的异同。

Result: 在评估的16个SOTA模型中(包括15个开源模型和GPT-5.4),模型表现存在差异;通过与可视化专家的人类研究对比,揭示了LLMs与人类判断在识别意图和修辞技巧方面的相似性和分歧点。

Insight: 创新点在于将可视化修辞理论和作者意图分类法作为分析框架,系统评估了多模态LLMs在复杂信息解读任务上的能力边界,为理解AI与人类在视觉误导检测中的认知差异提供了实证依据。

Abstract: This study investigates the ability of multimodal Large Language Models (LLMs) to identify and interpret misleading visualizations, and recognize these observations along with their underlying causes and potential intentionality. Our analysis leverages concepts from visualization rhetoric and a newly developed taxonomy of authorial intents as explanatory lenses. We formulated three research questions and addressed them experimentally using a dataset of 2,336 COVID-19-related tweets, half of which contain misleading visualizations, and supplemented it with real-world examples of perceptual, cognitive, and conceptual errors drawn from VisLies, the IEEE VIS community event dedicated to showcasing deceptive and misleading visualizations. To ensure broad coverage of the current LLM landscape, we evaluated 16 state-of-the-art models. Among them, 15 are open-weight models, spanning a wide range of model sizes, architectural families, and reasoning capabilities. The selection comprises small models, namely Nemotron-Nano-V2-VL (12B parameters), Mistral-Small-3.2 (24B), DeepSeek-VL2 (27B), Gemma3 (27B), and GTA1 (32B); medium-sized models, namely Qianfan-VL (70B), Molmo (72B), GLM-4.5V (108B), LLaVA-NeXT (110B), and Pixtral-Large (124B); and large models, namely Qwen3-VL (235B), InternVL3.5 (241B), Step3 (321B), Llama-4-Maverick (400B), and Kimi-K2.5 (1000B). In addition, we employed OpenAI GPT-5.4, a frontier proprietary model. To establish a human perspective on these tasks, we also conducted a user study with visualization experts to assess how people perceive rhetorical techniques and the authorial intentions behind the same misleading visualizations. This allows comparison between model and expert behavior, revealing similarities and differences that provide insights into where LLMs align with human judgment and where they diverge.


cs.RO [Back]

[117] Generalizable Dense Reward for Long-Horizon Robotic Tasks cs.RO | cs.CV | cs.LGPDF

Silong Yong, Stephen Sheng, Carl Qi, Xiaojie Wang, Evan Sheehan

TL;DR: 本文提出VLLR,一种结合大型语言模型(LLM)和视觉语言模型(VLM)提供外部奖励,以及基于策略自我确定性的内部奖励的稠密奖励框架,用于对机器人基础策略进行强化学习微调,以解决长视野任务中的分布偏移和误差累积问题。

Details

Motivation: 现有机器人基础策略主要通过大规模模仿学习训练,但在长视野任务中易受分布偏移和误差累积影响;强化学习微调需要手动设计奖励函数,难以泛化到多样任务。

Result: 在涵盖移动操作和导航的CHORES基准测试中,VLLR相比预训练策略实现了高达56%的绝对成功率提升,相比最先进的RL微调方法,在分布内任务上提升高达5%,在分布外任务上提升高达10%。

Insight: 创新点在于利用LLM进行任务分解和VLM进行进度估计来初始化价值函数(避免全程高推理成本),并结合策略自我确定性提供每步内在奖励;两者优势互补:VLM初始化主要提升任务完成效率,自我确定性主要提升成功率(尤其在分布外任务上)。

Abstract: Existing robotic foundation policies are trained primarily via large-scale imitation learning. While such models demonstrate strong capabilities, they often struggle with long-horizon tasks due to distribution shift and error accumulation. While reinforcement learning (RL) can finetune these models, it cannot work well across diverse tasks without manual reward engineering. We propose VLLR, a dense reward framework combining (1) an extrinsic reward from Large Language Models (LLMs) and Vision-Language Models (VLMs) for task progress recognition, and (2) an intrinsic reward based on policy self-certainty. VLLR uses LLMs to decompose tasks into verifiable subtasks and then VLMs to estimate progress to initialize the value function for a brief warm-up phase, avoiding prohibitive inference cost during full training; and self-certainty provides per-step intrinsic guidance throughout PPO finetuning. Ablation studies reveal complementary benefits: VLM-based value initialization primarily improves task completion efficiency, while self-certainty primarily enhances success rates, particularly on out-of-distribution tasks. On the CHORES benchmark covering mobile manipulation and navigation, VLLR achieves up to 56% absolute success rate gains over the pretrained policy, up to 5% gains over state-of-the-art RL finetuning methods on in-distribution tasks, and up to $10%$ gains on out-of-distribution tasks, all without manual reward engineering. Additional visualizations can be found in https://silongyong.github.io/vllr_project_page/


[118] A Dual-Stream Transformer Architecture for Illumination-Invariant TIR-LiDAR Person Tracking cs.RO | cs.CVPDF

Yuki Minase, Kanji Tanaka

TL;DR: 本文提出了一种新颖的热红外与深度(TIR-D)双模态行人跟踪架构,旨在解决自主移动机器人在恶劣光照条件下(如全黑或强逆光)的鲁棒跟踪问题。该架构利用机器人上常见的SLAM传感器套件(LiDAR和TIR相机),并通过一种序列知识迁移策略和细粒度差分学习率策略,解决了TIR-D领域标注数据稀缺的问题,实现了对几何深度信息的快速适应。

Details

Motivation: RGB-D跟踪在挑战性光照条件下性能严重下降,而自主机器人需要全天候鲁棒的跟踪能力。热红外(TIR)和LiDAR传感器组合能提供光照不变的感知,但面临TIR-D多模态标注数据稀缺的挑战。

Result: 在实验中,提出的TIR-D跟踪器取得了平均重叠率(AO)0.700和成功率(SR)58.7%的优异性能,显著优于传统的RGB迁移方法和单模态基线模型。

Insight: 创新点在于提出了一种针对TIR-D双模态跟踪的序列知识迁移策略,以及细粒度差分学习率策略,该策略能有效保留预训练特征提取能力并快速适应新模态(深度信息),为解决多模态数据稀缺问题提供了一个实用且资源高效的方案。

Abstract: Robust person tracking is a critical capability for autonomous mobile robots operating in diverse and unpredictable environments. While RGB-D tracking has shown high precision, its performance severely degrades under challenging illumination conditions, such as total darkness or intense backlighting. To achieve all-weather robustness, this paper proposes a novel Thermal-Infrared and Depth (TIR-D) tracking architecture that leverages the standard sensor suite of SLAM-capable robots, namely LiDAR and TIR cameras. A major challenge in TIR-D tracking is the scarcity of annotated multi-modal datasets. To address this, we introduce a sequential knowledge transfer strategy that evolves structural priors from a large-scale thermal-trained model into the TIR-D domain. By employing a differential learning rate strategy – referred to as ``Fine-grained Differential Learning Rate Strategy’’ – we effectively preserve pre-trained feature extraction capabilities while enabling rapid adaptation to geometric depth cues. Experimental results demonstrate that our proposed TIR-D tracker achieves superior performance, with an Average Overlap (AO) of 0.700 and a Success Rate (SR) of 58.7%, significantly outperforming conventional RGB-transfer and single-modality baselines. Our approach provides a practical and resource-efficient solution for robust human-following in all-weather robotics applications.


[119] Learning Humanoid Navigation from Human Data cs.RO | cs.AI | cs.CV | cs.LGPDF

Weizhuo Wang, Yanjie Ze, C. Karen Liu, Monroe Kennedy

TL;DR: 本文提出了EgoNav系统,使仿人机器人能够仅通过5小时的人类行走数据进行学习,无需机器人数据或微调,即可在多样、未见过的环境中导航。系统利用扩散模型预测未来轨迹分布,结合360度视觉记忆和冻结的DINOv3骨干网络提取的视频特征,通过混合采样方案实现实时推理,并使用后退水平控制器选择路径。

Details

Motivation: 解决仿人机器人在未见环境中导航的挑战,通过从人类数据中学习,避免依赖机器人特定数据或复杂调优,实现零样本部署。

Result: 在离线评估中,EgoNav在避撞和多模态覆盖方面优于基线方法;在Unitree G1仿人机器人上零样本部署于未见室内外环境,表现出自然行为如等待开门、绕开人群和避免玻璃墙。

Insight: 创新点包括:仅用人类数据训练扩散模型进行轨迹预测,结合多模态视觉记忆和冻结骨干网络特征;混合采样实现高效推理;系统零样本泛化能力强,行为从学习先验中自然涌现。

Abstract: We present EgoNav, a system that enables a humanoid robot to traverse diverse, unseen environments by learning entirely from 5 hours of human walking data, with no robot data or finetuning. A diffusion model predicts distributions of plausible future trajectories conditioned on past trajectory, a 360 deg visual memory fusing color, depth, and semantics, and video features from a frozen DINOv3 backbone that capture appearance cues invisible to depth sensors. A hybrid sampling scheme achieves real-time inference in 10 denoising steps, and a receding-horizon controller selects paths from the predicted distribution. We validate EgoNav through offline evaluations, where it outperforms baselines in collision avoidance and multi-modal coverage, and through zero-shot deployment on a Unitree G1 humanoid across unseen indoor and outdoor environments. Behaviors such as waiting for doors to open, navigating around crowds, and avoiding glass walls emerge naturally from the learned prior. We will release the dataset and trained models. Our website: https://egonav.weizhuowang.com


[120] LiPS: Lightweight Panoptic Segmentation for Resource-Constrained Robotics cs.RO | cs.CVPDF

Calvin Galagain, Martyna Poreba, François Goulette, Cyrill Stachniss

TL;DR: 本文提出了一种名为LiPS的轻量级全景分割方法,旨在解决资源受限机器人平台(如移动机器人)上部署复杂全景分割模型的挑战。该方法通过简化的特征提取与融合路径,在保持基于查询的解码的同时,显著降低了计算需求,实现了高效率和高精度的平衡。

Details

Motivation: 当前最先进的全景分割模型复杂度高,难以在资源受限的机器人平台上部署,因此需要一种轻量级且高效的方法来统一语义理解和对象级推理。

Result: 在标准基准测试中,LiPS达到了与更重基线模型相当的精度,同时吞吐量提高了4.5倍(以每秒帧数衡量),计算量减少了近6.8倍。

Insight: 创新点在于引入轻量级设计,通过流线型的特征提取和融合路径,在保持基于查询的解码优势的同时,大幅降低计算复杂度,为实际机器人应用提供了可行的解决方案。

Abstract: Panoptic segmentation is a key enabler for robotic perception, as it unifies semantic understanding with object-level reasoning. However, the increasing complexity of state-of-the-art models makes them unsuitable for deployment on resource-constrained platforms such as mobile robots. We propose a novel approach called LiPS that addresses the challenge of efficient-to-compute panoptic segmentation with a lightweight design that retains query-based decoding while introducing a streamlined feature extraction and fusion pathway. It aims at providing a strong panoptic segmentation performance while substantially lowering the computational demands. Evaluations on standard benchmarks demonstrate that LiPS attains accuracy comparable to much heavier baselines, while providing up to 4.5 higher throughput, measured in frames per second, and requiring nearly 6.8 times fewer computations. This efficiency makes LiPS a highly relevant bridge between modern panoptic models and real-world robotic applications.


[121] A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems cs.RO | cs.AI | cs.CVPDF

J. E. Domínguez-Vidal

TL;DR: 本文介绍了一个为Florence-2视觉语言基础模型开发的ROS 2包装器,旨在将模型集成到机器人软件栈中。该包装器支持三种交互模式(主题驱动处理、同步服务调用和异步动作),并支持本地部署与Docker容器化,以促进模型在机器人系统中的实际应用。

Details

Motivation: 动机在于解决视觉语言基础模型(如Florence-2)在机器人系统中实际部署的挑战,尽管这些模型能提供比传统任务特定流程更丰富的语义感知,但其在机器人软件中的采用依赖于可复现的中间件集成,而不仅仅是模型质量。

Result: 通过功能验证和在不同GPU上的吞吐量研究,结果表明该包装器在消费级硬件上可实现本地部署,证明了其可行性。

Insight: 创新点在于为Florence-2模型设计了一个统一的ROS 2包装器,通过多种交互模式(主题、服务、动作)和灵活的部署选项(原生安装、Docker),简化了模型与机器人系统的集成,并支持通用JSON输出与标准ROS 2消息绑定,提升了实用性和可扩展性。

Abstract: Foundation vision-language models are becoming increasingly relevant to robotics because they can provide richer semantic perception than narrow task-specific pipelines. However, their practical adoption in robot software stacks still depends on reproducible middleware integrations rather than on model quality alone. Florence-2 is especially attractive in this regard because it unifies captioning, optical character recognition, open-vocabulary detection, grounding and related vision-language tasks within a comparatively manageable model size. This article presents a ROS 2 wrapper for Florence-2 that exposes the model through three complementary interaction modes: continuous topic-driven processing, synchronous service calls and asynchronous actions. The wrapper is designed for local execution and supports both native installation and Docker container deployment. It also combines generic JSON outputs with standard ROS 2 message bindings for detection-oriented tasks. A functional validation is reported together with a throughput study on several GPUs, showing that local deployment is feasible with consumer grade hardware. The repository is publicly available here: https://github.com/JEDominguezVidal/florence2_ros2_wrapper


eess.IV [Back]

[122] Brain MR Image Synthesis with Multi-contrast Self-attention GAN eess.IV | cs.AI | cs.CVPDF

Zaid A. Abod, Furqan Aziz

TL;DR: 本文提出了一种名为3D-MC-SAGAN的三维多模态MRI图像合成框架,能够从单一的T2加权输入图像中生成缺失的T2f、T1n和T1c模态的高保真图像,并特别注重保留肿瘤特征。该模型采用带有残差连接的多尺度3D编码器-解码器生成器,并引入了新颖的记忆边界混合注意力模块来高效捕获长程依赖。通过结合多种损失函数进行训练,该模型在定量和定性评估中均达到了最先进的性能,并能保持与完整多模态输入相当的肿瘤分割精度。

Details

Motivation: 在神经肿瘤学评估中,完整的多模态磁共振成像至关重要,但为每位患者获取所有模态(如T1c、T1n、T2、T2f)通常因时间、成本和患者不适而不切实际,这限制了全面的肿瘤评估。因此,需要一种能够从有限输入生成高质量、肿瘤特征保留的缺失模态的方法。

Result: 在3D脑部MRI数据集上的广泛评估表明,3D-MC-SAGAN在定量性能上达到了最先进水平,生成了视觉连贯、解剖学上合理的对比度图像,并具有改进的分布级真实感。此外,其肿瘤分割精度与完整获取的多模态输入相当。

Insight: 主要创新点包括:1)一个统一的3D多模态合成框架,能从单一输入生成多种缺失模态;2)新颖的记忆边界混合注意力模块,用于高效建模长程依赖;3)引入冻结的3D U-Net分割模块作为分割一致性约束,以明确保留病变形态;4)复合损失函数整合了对抗、重建、感知、结构相似性、对比度分类和分割引导损失,以同时保证全局真实感和肿瘤结构保留。

Abstract: Accurate and complete multi-modal Magnetic Resonance Imaging (MRI) is essential for neuro-oncological assessment, as each contrast provides complementary anatomical and pathological information. However, acquiring all modalities (e.g., T1c, T1n, T2, T2f) for every patient is often impractical due to time, cost, and patient discomfort, potentially limiting comprehensive tumour evaluation. We propose 3D-MC-SAGAN (3D Multi-Contrast Self-Attention generative adversarial network), a unified 3D multi-contrast synthesis framework that generates high-fidelity missing modalities from a single T2 input while explicitly preserving tumour characteristics. The model employs a multi-scale 3D encoder-decoder generator with residual connections and a novel Memory-Bounded Hybrid Attention (MBHA) block to capture long-range dependencies efficiently, and is trained with a WGAN-GP critic and an auxiliary contrast-conditioning branch to produce T2f, T1n, and T1c volumes within a single unified network. A frozen 3D U-Net-based segmentation module introduces a segmentation-consistency constraint to preserve lesion morphology. The composite objective integrates adversarial, reconstruction, perceptual, structural similarity, contrast-classification, and segmentation-guided losses to align global realism with tumour-preserving structure. Extensive evaluation on 3D brain MRI datasets demonstrates that 3D-MC-SAGAN achieves state-of-the-art quantitative performance and generates visually coherent, anatomically plausible contrasts with improved distribution-level realism. Moreover, it maintains tumour segmentation accuracy comparable to fully acquired multi-modal inputs, highlighting its potential to reduce acquisition burden while preserving clinically meaningful information.


cs.SD [Back]

[123] TRACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models cs.SD | cs.AI | cs.CVPDF

Awais Khan, Muhammad Umar Farooq, Kutub Uddin, Khalid Malik

TL;DR: 本文提出了一种无需训练的局部音频深度伪造检测方法TRACE,通过分析冻结语音基础模型表征的一阶动态变化来检测音频中合成片段与真实录音的拼接边界,无需标注数据或模型微调。

Details

Motivation: 现有局部音频伪造检测方法需要帧级标注、易过拟合特定合成流程且需随新生成模型出现而重新训练,作者认为这种监督是不必要的,并假设语音基础模型隐含了伪造检测信号:真实语音形成平滑缓慢变化的嵌入轨迹,而拼接边界会引入帧级过渡的突变。

Result: 在涵盖两种语言的四个基准测试中,TRACE在PartialSpoof上达到8.08% EER,与微调监督基线相当;在最具挑战性的LlamaPartialSpoof基准(采用LLM驱动的商业合成)上,TRACE以24.12% EER超越监督基线(24.49% EER),且无需目标域数据。

Insight: 创新点在于首次利用语音基础模型表征的时间动态作为无需训练的伪造检测信号,通过分析嵌入轨迹的一阶动态变化实现跨语言、跨合成方法的泛化检测,避免了监督方法对标注数据和特定合成流程的依赖。

Abstract: Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and must be retrained as new generative models emerge. We argue that this supervision is unnecessary. We hypothesize that speech foundation models implicitly encode a forensic signal: genuine speech forms smooth, slowly varying embedding trajectories, while splice boundaries introduce abrupt disruptions in frame-level transitions. Building on this, we propose TRACE (Training-free Representation-based Audio Countermeasure via Embedding dynamics), a training-free framework that detects partial audio deepfakes by analyzing the first-order dynamics of frozen speech foundation model representations without any training, labeled data, or architectural modification. We evaluate TRACE on four benchmarks that span two languages using six speech foundation models. In PartialSpoof, TRACE achieves 8.08% EER, competitive with fine-tuned supervised baselines. In LlamaPartialSpoof, the most challenging benchmark featuring LLM-driven commercial synthesis, TRACE surpasses a supervised baseline outright (24.12% vs. 24.49% EER) without any target-domain data. These results show that temporal dynamics in speech foundation models provide an effective, generalize signal for training-free audio forensics.


cs.LG [Back]

[124] A Survey of On-Policy Distillation for Large Language Models cs.LG | cs.CLPDF

Mingyang Song, Mao Zheng

TL;DR: 本文是关于大型语言模型(LLM)中在线策略蒸馏(On-Policy Distillation, OPD)的首次全面综述。OPD通过让学生模型在自身生成的轨迹上接受教师模型的反馈,解决了传统离线策略蒸馏中存在的暴露偏差问题。文章提出了一个统一的f-散度框架,并从反馈信号、教师访问和损失粒度三个正交维度对现有方法进行了系统梳理,同时分析了工业部署和开放性问题。

Details

Motivation: 传统知识蒸馏主要采用离线策略,即学生模型在静态的教师生成数据上训练,这会导致训练与推理时的暴露偏差,使得自回归推理中的预测误差累积。在线策略蒸馏旨在通过让学生模型基于自身生成输出获得反馈,将蒸馏过程建立在交互式模仿学习理论基础上,以解决这一问题。

Result: 作为一篇综述论文,未报告具体的定量实验结果,但系统性地梳理和分析了在线策略蒸馏领域的方法,包括基于散度最小化、奖励引导学习和自我博弈等多种方法,并指出了该领域在蒸馏缩放定律、不确定性感知反馈和智能体级蒸馏等方面的开放问题。

Insight: 创新点在于首次为LLM的在线策略蒸馏提供了统一框架和系统分类,提出了基于f-散度的理论框架,并从反馈信号(如基于logit、基于结果或自我博弈)、教师访问(白盒、黑盒或无教师)和损失粒度(词元级、序列级或混合)三个维度组织现有方法,为未来研究提供了清晰的结构化视角。

Abstract: Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains \textit{off-policy}: students train on static teacher-generated data and never encounter their own errors during learning. This train–test mismatch, an instance of \textit{exposure bias}, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified $f$-divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: \emph{feedback signal} (logit-based, outcome-based, or self-play), \emph{teacher access} (white-box, black-box, or teacher-free), and \emph{loss granularity} (token-level, sequence-level, or hybrid). We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.


[125] Learning to Hint for Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, Yuxiong He

TL;DR: 本文提出了Hint Learning for Reinforcement Learning (HiLL)框架,通过联合训练一个提示生成器策略和一个推理器策略来解决强化学习中GRPO方法面临的’优势崩溃’问题。该框架能够根据推理器当前的错误在线生成自适应提示,并通过引入’提示依赖性’指标来优化提示,以增强从带提示的成功到无提示成功的可迁移性。

Details

Motivation: 动机是解决Group Relative Policy Optimization (GRPO)方法在强化学习中面临的’优势崩溃’问题,即当一组轨迹都获得相同奖励时,无法提供有效的学习信号。现有方法使用固定提示,但无法适应推理器的当前状态,且带提示的成功不一定能迁移到无提示的测试策略上。

Result: 在多个基准测试上的实验表明,HiLL框架在性能上持续优于GRPO和先前的基于提示的基线方法,证明了自适应且考虑可迁移性的提示学习在强化学习中的价值。

Insight: 主要创新点在于提出了一个联合训练提示生成器和推理器的在线自适应框架,并引入了’提示依赖性’这一概念来量化提示对成功轨迹的影响,从而设计了一个基于可迁移性加权的奖励函数来训练提示生成器,确保生成的提示既能提供有效的学习信号,又能更好地提升原始无提示策略的性能。

Abstract: Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner’s incorrect rollout, allowing hint generation to adapt to the reasoner’s evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no-hint success, and we use this result to define a transfer-weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no-hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint-based baselines, demonstrating the value of adaptive and transfer-aware hint learning for RL. The code is available at https://github.com/Andree-9/HiLL.


[126] Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning cs.LG | cs.AI | cs.CL | stat.AP | stat.MLPDF

Cai Zhou, Zekai Wang, Menghua Wu, Qianyu Julie Zhu, Flora C. Shi

TL;DR: 本文提出了在线推理校准(ORCA)框架,通过结合保形预测和测试时训练,动态校准大语言模型的采样过程,以提供分布偏移下的有效置信度估计,从而在保证理论风险的同时,显著提升推理任务的效率和泛化能力。

Details

Motivation: 针对现有大语言模型在推理任务中因后训练模型校准不足及采样技术缺乏校准导致的过高计算成本问题,旨在开发一种能够在测试时动态校准、适应分布偏移的通用方法。

Result: 在风险水平δ=0.1下,ORCA在分布内任务上为Qwen2.5-32B模型节省了高达47.5%(有监督标签)和40.7%(自洽标签)的计算成本;在零样本分布外设置(如MATH-500)上,将节省率从静态校准基线的24.8%提升至67.0%,同时保持低经验错误率,该趋势在不同模型系列和下游基准测试中均成立。

Insight: 创新点在于将保形预测与测试时训练结合,通过元学习为每个输入动态更新校准模块,从而在推理模式变化或提示分布偏移时提供理论保证的置信度估计;客观来看,其测试时自适应校准机制是提升大模型推理效率与泛化性的有效途径。

Abstract: While test-time scaling has enabled large language models to solve highly difficult tasks, state-of-the-art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post-trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test-time training. Specifically, we introduce a meta-learning procedure that updates the calibration module for each input. This allows us to provide valid confidence estimates under distributional shift, e.g. in thought patterns that occur across different stages of reasoning, or in prompt distributions between model development and deployment. ORCA not only provides theoretical guarantees on conformal risks, but also empirically shows higher efficiency and generalization across different reasoning tasks. At risk level $δ=0.1$, ORCA improves Qwen2.5-32B efficiency on in-distribution tasks with savings up to 47.5% with supervised labels and 40.7% with self-consistency labels. Under zero-shot out-of-domain settings, it improves MATH-500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream benchmarks. Our code is publicly available at https://github.com/wzekai99/ORCA.


[127] Sit-to-Stand Transitions Detection and Duration Measurement Using Smart Lacelock Sensor cs.LG | cs.CVPDF

Md Rafi Islam, Md Rejwanul Haque, Elizabeth Choma, Shannon Hayes, Siobhan McMahon

TL;DR: 本研究提出了一种使用智能鞋带锁传感器检测坐-站转换并测量其持续时间的方法。该方法利用集成在鞋上的多模态传感器数据,通过机器学习分类器在老年人中进行评估,实现了高精度的转换分类和准确的持续时间测量。

Details

Motivation: 坐-站转换是评估老年人下肢力量、肌肉骨骼健康和跌倒风险的关键指标,但缺乏便捷、准确的监测方法。本研究旨在利用轻便的智能鞋带锁传感器解决这一问题,以支持现实世界的跌倒风险评估和活动能力监测。

Result: 在16名老年人使用SPPB协议进行的实验中,袋装树分类器对坐-站转换的分类准确率达到0.98,F1分数为0.8。正确分类转换的持续时间测量平均绝对误差为0.047秒,标准差为0.07秒。

Insight: 创新点在于利用轻量级、鞋载的多模态传感器(测力计、加速度计、陀螺仪)进行坐-站转换的检测与时长测量,并通过参与者独立的交叉验证评估,展示了其在现实场景中用于老年人健康监测的潜力。

Abstract: Postural stability during movement is fundamental to independent living, fall prevention, and overall health, particularly among older adults who experience age-related declines in balance, muscle strength, and mobility. Among daily functional activities, the Sit-to-Stand (SiSt) transition is a critical indicator of lower-limb strength, musculoskeletal health, and fall risk, making it an essential parameter for assessing functional capacity and monitoring physical decline in aging populations. This study presents a methodology SiSt transition detection and duration measurement using the Smart Lacelock sensor, a lightweight, shoe-mounted device that integrates a load cell, accelerometer, and gyroscope for motion analysis. The methodology was evaluated in 16 older adults (age: mean: 76.84, SD: 3.45 years) performing SiSt tasks within the Short Physical Performance Battery (SPPB) protocol. Features extracted from multimodal signals were used to train and evaluate four machine learning classifiers using a 4-fold participant-independent cross-validation to classify SiSt transitions and measure their duration. The bagged tree classifier achieved an accuracy of 0.98 and an F1 score of 0.8 in classifying SiSt transition. The mean absolute error in duration measurement of the correctly classified transitions was 0.047, and the SD was 0.07 seconds. These findings highlight the potential of the Smart Lacelock sensor for real-world fall-risk assessment and mobility monitoring in older adults.


[128] QUEST: A robust attention formulation using query-modulated spherical attention cs.LG | cs.AI | cs.CVPDF

Hariprasath Govindarajan, Per Sidén, Jacob Roll, Fredrik Lindsten

TL;DR: 本文提出了一种新的注意力机制QUEST,通过将键向量约束在超球面潜在空间中,同时允许查询向量灵活调节注意力分布的锐度,以解决标准注意力中查询和键向量范数增长导致的训练不稳定问题。该方法可作为标准注意力的即插即用替代,在视觉等多个领域验证了其稳定性、性能提升和鲁棒性。

Details

Motivation: 标准注意力机制中查询和键向量的范数可能任意增长,导致训练不稳定,尤其是在数据中存在易于学习的虚假模式时。

Result: 实验表明QUEST训练稳定,在视觉等任务上提升了模型性能,并对数据损坏和对抗攻击具有鲁棒性。

Insight: 创新点在于将键向量约束到超球面空间以稳定训练,同时通过查询调制保持注意力分布的灵活性;这提供了一种简单有效的注意力稳定化方案,可泛化到多种任务。

Abstract: The Transformer model architecture has become one of the most widely used in deep learning and the attention mechanism is at its core. The standard attention formulation uses a softmax operation applied to a scaled dot product between query and key vectors. We explore the role played by norms of the queries and keys, which can cause training instabilities when they arbitrarily increase. We demonstrate how this can happen even in simple Transformer models, in the presence of easy-to-learn spurious patterns in the data. We propose a new attention formulation, QUEry-modulated Spherical aTtention (QUEST), that constrains the keys to a hyperspherical latent space, while still allowing individual tokens to flexibly control the sharpness of the attention distribution. QUEST can be easily used as a drop-in replacement for standard attention. We focus on vision applications while also exploring other domains to highlight the method’s generality. We show that (1) QUEST trains without instabilities and (2) produces models with improved performance (3) that are robust to data corruptions and adversarial attacks.


[129] MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding cs.LG | cs.AI | cs.CV | cs.IRPDF

Junxian Wu, Chenghan Fu, Zhanheng Nie, Daoze Zhang, Bowen Wan

TL;DR: 本文提出了MOON3.0,一个用于电商产品理解的推理感知多模态大语言模型。该方法通过多头部模态融合模块自适应整合原始信号,结合对比与强化学习框架探索有效推理策略,并引入细粒度残差增强模块在网络中持续保留局部细节。论文还发布了大规模多模态电商基准MBE3.0。

Details

Motivation: 现有MLLM作为特征提取器将产品信息隐式编码为全局嵌入,限制了捕获细粒度属性的能力。本文旨在利用MLLM的推理能力显式建模细粒度产品属性,并解决长上下文推理分散注意力、监督微调限制推理策略探索以及前向传播中细节衰减等关键挑战。

Result: 实验表明,该模型在作者发布的MBE3.0基准和公共数据集上的各种下游任务中,均取得了最先进的零样本性能。

Insight: 创新点在于首次将推理感知机制引入基于MLLM的产品表示学习,核心是通过多模态融合、联合学习框架和残差增强模块,系统性地解决细粒度属性建模中的注意力分散、策略僵化和细节衰减问题。可借鉴其将强化学习与对比学习结合以自主探索推理策略的思路,以及通过残差连接在网络各层显式保留局部细节的设计。

Abstract: With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model’s attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.


cs.CR [Back]

[130] AutoMIA: Improved Baselines for Membership Inference Attack via Agentic Self-Exploration cs.CR | cs.CVPDF

Ruhao Liu, Weiqi Huang, Qi Li, Xinchao Wang

TL;DR: 本文提出AutoMIA,一个将成员推理攻击重构为自动化自我探索与策略演化的智能体框架,旨在解决现有方法依赖静态启发式规则、缺乏适应性的问题,并在多个基准测试中达到或超越现有最优方法。

Details

Motivation: 现有成员推理攻击方法主要依赖静态、手工设计的启发式规则,缺乏适应性,在不同大模型间迁移时性能不佳,因此需要一种自动化、模型无关的框架来系统性地探索攻击空间。

Result: 大量实验表明,AutoMIA在多个基准测试中持续匹配或超越了最先进的基线方法,同时无需手动特征工程。

Insight: 创新点在于将成员推理攻击重构为智能体驱动的自我探索过程,通过解耦抽象策略推理与底层执行,实现了对攻击搜索空间的系统性、模型无关的遍历,从而自动化地生成和优化攻击策略。

Abstract: Membership Inference Attacks (MIAs) serve as a fundamental auditing tool for evaluating training data leakage in machine learning models. However, existing methodologies predominantly rely on static, handcrafted heuristics that lack adaptability, often leading to suboptimal performance when transferred across different large models. In this work, we propose AutoMIA, an agentic framework that reformulates membership inference as an automated process of self-exploration and strategy evolution. Given high-level scenario specifications, AutoMIA self-explores the attack space by generating executable logits-level strategies and progressively refining them through closed-loop evaluation feedback. By decoupling abstract strategy reasoning from low-level execution, our framework enables a systematic, model-agnostic traversal of the attack search space. Extensive experiments demonstrate that AutoMIA consistently matches or outperforms state-of-the-art baselines while eliminating the need for manual feature engineering.