Table of Contents
- cs.CL [Total: 26]
- cs.CV [Total: 59]
- cs.GR [Total: 1]
- cs.DB [Total: 2]
- cs.IR [Total: 1]
- cs.MM [Total: 1]
- cs.RO [Total: 2]
- cs.LG [Total: 12]
- cs.NE [Total: 1]
- eess.IV [Total: 2]
- cs.CY [Total: 1]
- cs.AI [Total: 3]
cs.CL [Back]
[1] CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System cs.CL | cs.AIPDF
Zexin Lin, Jiachen Yu, Haoyang Zhang, Yuzhao Li, Zhonghang Li
TL;DR: 本文提出CoWork-X框架,通过快慢记忆分离机制将多智能体协作建模为跨回合的闭环优化问题,包含基于分层任务网络的技能执行代理和具备预算约束与漂移正则化的技能整合优化器,在类似Overcooked-AI的实时协作基准测试中实现了稳定的累积性能提升,同时持续降低在线延迟和令牌消耗。
Details
Motivation: 解决高度协作任务中同时存在的两个约束:亚秒级实时协调需求与严格在线令牌预算下的持续多回合适应问题,现有方法在实时推理延迟与离线文本整合可靠性上存在不足。
Result: 在类似Overcooked-AI的实时协作基准测试中,CoWork-X实现了稳定的累积性能提升,同时持续降低在线延迟和令牌使用量。
Insight: 创新点包括:将协作建模为跨回合闭环优化问题,采用结构化可组合技能库与HTN检索机制实现低延迟执行,以及通过显式预算约束和漂移正则化的补丁式技能整合方法平衡适应性与稳定性。
Abstract: Large language models are enabling language-conditioned agents in interactive environments, but highly cooperative tasks often impose two simultaneous constraints: sub-second real-time coordination and sustained multi-episode adaptation under a strict online token budget. Existing approaches either rely on frequent in-episode reasoning that induces latency and timing jitter, or deliver post-episode improvements through unstructured text that is difficult to compile into reliable low-cost execution. We propose CoWork-X, an active co-evolution framework that casts peer collaboration as a closed-loop optimization problem across episodes, inspired by fast–slow memory separation. CoWork-X instantiates a Skill-Agent that executes via HTN (hierarchical task network)-based skill retrieval from a structured, interpretable, and compositional skill library, and a post-episode Co-Optimizer that performs patch-style skill consolidation with explicit budget constraints and drift regularization. Experiments in challenging Overcooked-AI-like realtime collaboration benchmarks demonstrate that CoWork-X achieves stable, cumulative performance gains while steadily reducing online latency and token usage.
[2] Multilingual Extraction and Recognition of Implicit Discourse Relations in Speech and Text cs.CLPDF
Ahmed Ruby, Christian Hardmeier, Sara Stymne
TL;DR: 本文提出了一种自动构建多语言多模态隐式篇章关系数据集的方法,涵盖英语、法语和西班牙语,并引入了一种基于Qwen2-Audio的多模态分类方法,通过整合文本和声学信息来联合建模,以提升跨语言的隐式篇章关系分类性能。研究发现,虽然纯文本模型优于纯音频模型,但多模态融合能增强性能,且跨语言迁移能为低资源语言带来显著改进。
Details
Motivation: 解决隐式篇章关系分类中仅依赖文本可能无法充分捕捉跨模态和跨语言上下文线索的挑战,特别是在低资源语言场景下。
Result: 在构建的多语言多模态数据集上,多模态方法(整合文本与音频)相比单模态基线有所提升;跨语言迁移显著改善了低资源语言的性能。
Insight: 创新点包括自动构建多语言多模态数据集的方法,以及利用Qwen2-Audio进行文本-音频联合建模的多模态分类框架;客观来看,其强调了跨模态融合与跨语言迁移在低资源语言处理中的实用价值。
Abstract: Implicit discourse relation classification is a challenging task, as it requires inferring meaning from context. While contextual cues can be distributed across modalities and vary across languages, they are not always captured by text alone. To address this, we introduce an automatic method for distantly related and unrelated language pairs to construct a multilingual and multimodal dataset for implicit discourse relations in English, French, and Spanish. For classification, we propose a multimodal approach that integrates textual and acoustic information through Qwen2-Audio, allowing joint modeling of text and audio for implicit discourse relation classification across languages. We find that while text-based models outperform audio-based models, integrating both modalities can enhance performance, and cross-lingual transfer can provide substantial improvements for low-resource languages.
[3] Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems cs.CLPDF
Ziyuan Yang, Wenxuan Ding, Shangbin Feng, Yulia Tsvetkov
TL;DR: 本文研究了多语言模型协作系统中的恶意模型风险,通过构建四类恶意模型并集成到四种主流协作系统中,在10个数据集上评估其影响,发现恶意模型会显著降低系统性能,尤其在推理和安全领域。论文进一步提出采用外部监督器的缓解策略,能恢复95.31%的初始性能,但完全抵御恶意模型仍是开放问题。
Details
Motivation: 解决多语言模型协作系统中因部分模型被攻击或恶意而引发的安全风险,量化恶意模型的影响并探索缓解方法。
Result: 在10个数据集上的实验表明,恶意模型使推理和安全领域的性能平均下降7.12%和7.94%;提出的缓解策略平均能恢复95.31%的初始性能,但未能完全抵御恶意模型。
Insight: 创新点在于首次系统量化恶意模型对多模型协作系统的影响,并提出基于外部监督器的通用缓解框架;客观来看,该研究揭示了分散式AI协作的安全脆弱性,为构建鲁棒的多模型系统提供了实证基础和初步解决方案。
Abstract: Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.
[4] The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems cs.CLPDF
Shangbin Feng, Kishan Panaganti, Yulia Tsvetkov, Wenhao Yu
TL;DR: 本文提出了一种名为‘单-多进化循环’的自改进模型协作系统,旨在通过将多模型协作模式蒸馏到单个模型中,在保持协作优势的同时提升效率。该循环使多个语言模型协作,各自从协作输出中蒸馏学习,然后这些蒸馏后改进的模型再次协作,形成一个集体进化生态系统,模型通过与其他模型环境的交互实现自我进化与改进。
Details
Motivation: 解决多语言模型协作系统在结合多样模型优势时,因加载多个模型而带来的高成本与低效率问题,旨在通过蒸馏协作模式到单一模型来降低成本,同时保持协作性能。
Result: 在7种协作策略和15个任务(包括问答、推理、事实性等)上的广泛实验表明:1)单个模型平均提升8.0%,吸收了协作优势且成本降至单一模型;2)协作系统在蒸馏后因模型更强、协同性更好,平均比初始无进化系统提升14.9%。分析显示,该方法优于多种现有进化AI方法,兼容不同模型/协作/蒸馏设置,并能解决初始模型/系统难以处理的问题。
Insight: 创新点在于提出‘单-多进化循环’框架,将模型协作与蒸馏结合形成自改进生态系统,实现模型在交互中集体进化;客观分析认为,该方法通过迭代蒸馏与协作,有效融合了多模型优势与单模型效率,为自改进AI系统提供了可扩展且兼容性强的解决方案。
Abstract: Model collaboration – systems where multiple language models (LMs) collaborate – combines the strengths of diverse models with cost in loading multiple LMs. We improve efficiency while preserving the strengths of collaboration by distilling collaborative patterns into a single model, where the model is trained on the outputs of the model collaboration system. At inference time, only the distilled model is employed: it imitates the collaboration while only incurring the cost of a single model. Furthermore, we propose the single-multi evolution loop: multiple LMs collaborate, each distills from the collaborative outputs, and these post-distillation improved LMs collaborate again, forming a collective evolution ecosystem where models evolve and self-improve by interacting with an environment of other models. Extensive experiments with 7 collaboration strategies and 15 tasks (QA, reasoning, factuality, etc.) demonstrate that: 1) individual models improve by 8.0% on average, absorbing the strengths of collaboration while reducing the cost to a single model; 2) the collaboration also benefits from the stronger and more synergistic LMs after distillation, improving over initial systems without evolution by 14.9% on average. Analysis reveals that the single-multi evolution loop outperforms various existing evolutionary AI methods, is compatible with diverse model/collaboration/distillation settings, and helps solve problems where the initial model/system struggles to.
[5] Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky cs.CL | cs.HC | cs.LG | cs.SIPDF
Hsuan-Yu Chou, Wajiha Naveed, Shuyan Zhou, Xiaowei Yang
TL;DR: 本研究评估了开源权重LLM在社交媒体内容审核任务中的表现,以Bluesky平台真实帖子为测试数据,对比了四款专有模型和三款开源模型。研究发现开源模型在敏感性和特异性指标上与专有模型相当(敏感性81%-97% vs 72%-98%,特异性91%-100% vs 93%-99%),且开源模型可在消费级硬件上实现隐私保护的审核。
Details
Motivation: 随着网络有害内容增加,需要有效的审核机制。虽然专有LLM已被证明在零样本设置下优于传统机器学习模型,但开源权重LLM的即用能力尚未明确,本研究旨在填补这一空白。
Result: 在Bluesky真实数据上测试显示:开源LLM与专有LLM在敏感性和特异性指标上存在显著重叠;针对粗鲁内容检测时特异性高于敏感性,而对偏执和威胁内容检测则相反;人类审核员与LLM之间存在评分者间一致性。
Insight: 开源权重LLM具备与专有模型相当的审核能力,可在本地硬件实现隐私保护审核;研究揭示了不同有害内容类型的检测特性差异,为平衡社区规范与个人偏好的审核系统设计提供了新方向。
Abstract: As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%–97%) and specificity (91%–100%) of the open-weight LLMs and those (72%–98%, and 93%–99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.
[6] Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions cs.CL | cs.SDPDF
Jinchuan Tian, Haoran Wang, Bo-Hao Su, Chien-yu Huang, Qingzheng Wang
TL;DR: 本文提出了Bagpiper,一个80亿参数的音频基础模型,它通过丰富的自然语言描述(即“丰富字幕”)来理解和处理音频信号,旨在解决开放式的音频任务。该模型在大规模语料上进行预训练,建立了原始音频与高层认知概念空间之间的双向映射,并通过“先描述后处理”的工作流程进行微调,无需特定任务先验即可执行多样化的音频理解和生成任务。
Details
Motivation: 现有音频基础模型通常依赖僵化的、任务特定的监督,只能处理音频的孤立因素,而人类智能则是整体性地处理音频,无缝地将物理信号与抽象认知概念联系起来以执行复杂任务。本文旨在弥合这一差距,构建一个能够像人类一样进行整体音频处理的模型。
Result: 在音频理解基准测试MMAU和AIRBench上,Bagpiper的表现优于Qwen-2.5-Omni。在音频生成质量方面,它超越了CosyVoice3和TangoFlux,能够合成语音、音乐和音效的任意组合。据作者所知,Bagpiper是首批实现通用音频统一理解与生成的工作之一。
Insight: 核心创新在于引入了“丰富字幕”作为音频与高层概念之间的桥梁,并采用“先描述后处理”的工作流程来模拟认知推理步骤。这为构建能够处理开放式、复杂任务的统一音频模型提供了一种新范式,即通过自然语言这一通用接口来统一理解和生成任务,减少对特定任务监督的依赖。
Abstract: Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.
[7] Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR cs.CLPDF
Fanfan Liu, Youyang Yin, Peng Shi, Siqi Yang, Zhixiong Zeng
TL;DR: 本文分析了强化学习与可验证奖励(RLVR)训练中响应长度变化的原因,并提出了一种新的长度无偏序列策略优化(LUSPO)算法。该算法通过修正现有方法(如GSPO)中的长度偏差,解决了响应长度崩溃问题,在数学推理和多模态推理任务中均取得了优越性能。
Details
Motivation: RLVR训练中响应长度的增加常被视为推理能力提升的关键因素,但不同RLVR算法在训练过程中响应长度的变化模式差异显著,缺乏根本性解释。本文旨在深入分析主流RLVR算法组件,从理论上解释影响响应长度的因素,并据此提出改进方案。
Result: 在数学推理基准测试和多模态推理场景中进行的大量实验表明,LUSPO算法持续取得优越性能,相比现有方法(如GRPO和GSPO)代表了新颖的、最先进的优化策略。
Insight: 论文的创新点在于首次对RLVR训练中响应长度变化进行了理论分析,并基于此提出了LUSPO算法,通过使损失函数对响应长度无偏,有效解决了长度偏差导致的响应长度崩溃问题,为RLVR优化提供了新的视角和工具。
Abstract: Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.
[8] MentorCollab: Selective Large-to-Small Inference-Time Guidance for Efficient Reasoning cs.CLPDF
Haojin Wang, Yike Wang, Shangbin Feng, Hannaneh Hajishirzi, Yulia Tsvetkov
TL;DR: 本文提出MentorCollab方法,这是一种推理时协作框架,让大型推理模型(LRM)选择性地、稀疏地指导小型语言模型(SLM),而非完全接管生成过程。通过在随机采样的token位置探测两个模型之间的分歧,并使用轻量级验证器决定SLM是遵循导师模型的短前瞻片段还是自行继续生成,从而在保持高效推理的同时提升小型模型在多步推理任务上的性能。
Details
Motivation: 大型推理模型(LRM)虽然通过长链思维实现强性能,但推理成本高且常产生冗余推理;小型语言模型(SLM)效率高,但在多步推理任务上表现不佳。现有协作方法往往导致模仿和冗长推理,缺乏一致的错误纠正,因此需要一种能选择性引导、减少开销的协作方式。
Result: 在15个SLM-LRM模型对和3个领域(数学推理、通用知识、常识推理)的实验中,该方法在12种设置中提升了性能,平均增益为3.0%,最高达8.0%,同时平均仅有18.4%的token由昂贵的大型模型生成,表明选择性推理时指导能有效恢复大型模型的推理能力而不带来显著推理开销。
Insight: 创新点在于提出了一种稀疏、选择性的推理时指导机制,通过轻量级验证器动态决定何时引入大型模型的短前瞻片段进行干预,避免了完全模仿或接管,从而在效率和性能之间取得平衡;客观来看,该方法的核心洞察是短片段和选择性探测足以实现有效的模型协作,为异构模型的高效协同推理提供了新思路。
Abstract: Large reasoning models (LRMs) achieve strong performance by producing long chains of thought, but their inference costs are high and often generate redundant reasoning. Small language models (SLMs) are far more efficient, yet struggle on multi-step reasoning tasks. A natural idea is to let a large model guide a small one at inference time as a mentor, yet existing collaboration methods often promote imitation, resulting in verbose reasoning without consistent error correction. We propose MentorCollab, an inference-time collaboration method in which an LRM selectively and sparsely guides an SLM, rather than taking over generation. At randomly sampled token positions, we probe for divergences between the two models and use a lightweight verifier to decide whether the SLM should follow a short lookahead segment from its mentor or continue on its own. Across 15 SLM–LRM pairs and 3 domains (math reasoning, general knowledge, and commonsense reasoning), our method improves performance in 12 settings, with average gains of 3.0% and up to 8.0%, while adopting only having 18.4% tokens generated by the expensive mentor model on average. We find that short segments and selective probing are sufficient for effective collaboration. Our results show that selective inference-time guidance restores large-model reasoning ability without substantial inference overhead.
[9] PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning cs.CLPDF
Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li
TL;DR: 本文提出PACE方法,挑战了迭代对齐中探索规模假设,指出在数学推理任务中过度探索会导致收益递减甚至策略崩溃,通过基于生成的纠正策略替代暴力采样,以更小计算成本实现更优性能。
Details
Motivation: 解决标准DPO-R1依赖大规模采样(如N≥8)导致的计算效率低下、验证器噪声放大及有害分布偏移问题,特别是在数学推理任务中过度探索引发的性能下降。
Result: 在数学推理任务上,PACE以仅约1/5的计算量(使用N<3)超越了DPO-R1(N=16)的性能,表现出对奖励破解和标签噪声更强的鲁棒性。
Insight: 创新点在于用生成式纠正策略替代暴力探索,理论揭示了探索规模与验证噪声的放大关系,实践上通过合成高保真偏好对实现了高效对齐,为迭代对齐提供了计算高效且稳健的新范式。
Abstract: Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., $N \ge 8$) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling $N$ amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget ($2<N<3$), PACE synthesizes high-fidelity preference pairs from failed explorations. Empirical evaluations show that PACE outperforms DPO-R1 $(N=16)$ while using only about $1/5$ of the compute, demonstrating superior robustness against reward hacking and label noise.
[10] IESR:Efficient MCTS-Based Modular Reasoning for Text-to-SQL with Large Language Models cs.CLPDF
Tao Liu, Jiafan Lu, Bohan Yu, Pengcheng Wu, Liu Haixin
TL;DR: 本文提出了一种名为IESR的高效模块化推理框架,用于解决文本到SQL任务中复杂推理、领域知识和假设查询的挑战。该框架利用轻量化大语言模型进行关键信息理解和模式链接,通过基于蒙特卡洛树搜索的多路径推理机制和多数投票,并结合轨迹一致性验证模块来确保准确性和一致性。
Details
Motivation: 当前文本到SQL方法在BIRD和Spider等基准测试中表现良好,但在复杂推理、领域知识和假设查询方面存在困难,且在企业部署中成本高昂。IESR旨在解决这些问题,通过模块化设计提高推理效率和准确性。
Result: IESR在复杂推理基准LogicCat上达到24.28 EX,在Archer数据集上达到37.28 EX,仅使用紧凑轻量化模型且无需微调即实现最先进性能。
Insight: 创新点包括:将关键信息理解、模式链接与数学计算和SQL生成解耦;集成基于蒙特卡洛树搜索的多路径推理与多数投票机制;引入轨迹一致性验证模块。客观分析表明,该方法揭示了当前编码器模型在物理知识、数学计算和常识推理方面的偏差和不足,为未来研究提供了方向。
Abstract: Text-to-SQL is a key natural language processing task that maps natural language questions to SQL queries, enabling intuitive interaction with web-based databases. Although current methods perform well on benchmarks like BIRD and Spider, they struggle with complex reasoning, domain knowledge, and hypothetical queries, and remain costly in enterprise deployment. To address these issues, we propose a framework named IESR(Information Enhanced Structured Reasoning) for lightweight large language models: (i) leverages LLMs for key information understanding and schema linking, and decoupling mathematical computation and SQL generation, (ii) integrates a multi-path reasoning mechanism based on Monte Carlo Tree Search (MCTS) with majority voting, and (iii) introduces a trajectory consistency verification module with a discriminator model to ensure accuracy and consistency. Experimental results demonstrate that IESR achieves state-of-the-art performance on the complex reasoning benchmark LogicCat (24.28 EX) and the Archer dataset (37.28 EX) using only compact lightweight models without fine-tuning. Furthermore, our analysis reveals that current coder models exhibit notable biases and deficiencies in physical knowledge, mathematical computation, and common-sense reasoning, highlighting important directions for future research. We released code at https://github.com/Ffunkytao/IESR-SLM.
[11] Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances cs.CL | cs.AIPDF
Jiyun Chun, Eric Fosler-Lussier, Michael White, Andrew Perrault
TL;DR: 本文提出了一种基于LLM作为评判者的框架,用于评估儿童在成人-儿童对话中的话语质量。该框架首先对前一个成人话语类型进行分类,然后从扩展性(语境阐述和推理深度)和独立性(儿童对推进对话的贡献)两个维度对儿童回应进行评分,以替代传统仅依赖长度指标(如平均话语长度、词汇多样性等)的评估方法。
Details
Motivation: 现有评估儿童话语质量的常用代理指标(如平均话语长度、词汇多样性、可读性指数)过度依赖长度且忽略对话语境,无法捕捉回应质量的关键方面,如推理深度、话题维持和话语规划,因此需要开发更敏感于语境的发展性评估指标。
Result: 研究通过展示与年龄相关的模式确立了发展效度,并通过改进年龄估计优于常见基线证明了预测价值;进一步通过检测与话语关系相关的差异确认了语义敏感性;所提出的指标与人类判断一致,支持大规模评估。
Insight: 创新点在于将儿童话语评估从单纯测量长度转向评估儿童言语在特定语境中如何有意义地贡献和推进对话,引入了反映儿童语言发展基本维度的扩展性和独立性两个评估轴,并利用LLM-as-a-judge框架实现了语境敏感的自动化评分。
Abstract: Evaluating the quality of children’s utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics. Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context, missing aspects of response quality such as reasoning depth, topic maintenance, and discourse planning. We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child’s response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child’s contribution to advancing the discourse). These axes reflect fundamental dimensions in child language development, where Expansion captures elaboration, clause combining, and causal and contrastive connectives. Independence captures initiative, topic control, decreasing reliance on adult scaffolding through growing self-regulation, and audience design. We establish developmental validity by showing age-related patterns and demonstrate predictive value by improving age estimation over common baselines. We further confirm semantic sensitivity by detecting differences tied to discourse relations. Our metrics align with human judgments, enabling large-scale evaluation. This shifts child utterance assessment from simply measuring length to evaluating how meaningfully the child’s speech contributes to and advances the conversation within its context.
[12] Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models cs.CLPDF
Basel Mousi, Fahim Dalvi, Shammur Chowdhury, Firoj Alam, Nadir Durrani
TL;DR: 该论文针对多语言视觉语言模型(VLMs)中存在的反事实幻觉问题,提出了一个基于中东和北非(MENA)地区17个国家图像的多模态基准测试M2CQA,并引入了反事实幻觉率(CFHR)来衡量模型在正确回答真实陈述后仍接受反事实陈述的倾向。研究发现,在阿拉伯语(尤其是方言)中,即使模型真实陈述准确率高,CFHR也显著上升,且先推理后回答的提示策略会加剧幻觉,而先回答后解释则能提升鲁棒性。
Details
Motivation: 现有幻觉基准测试很少涵盖非西方语境和非英语环境,无法有效检测VLMs在文化上合理但视觉上错误的解释(即反事实幻觉)这一失败模式。
Result: 在M2CQA基准上评估SOTA VLMs发现,阿拉伯语(尤其是方言)的CFHR显著升高;先推理后回答的提示策略一致增加了反事实幻觉,而先回答后解释则提高了鲁棒性。
Insight: 创新点在于构建了文化背景丰富的多语言多模态基准M2CQA和CFHR指标,以隔离并量化反事实幻觉;客观分析表明,该研究揭示了VLMs在非英语和文化特定语境中的系统性偏见,以及提示策略对幻觉的显著影响,为模型鲁棒性评估提供了新维度。
Abstract: Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce M2CQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness. We will make the experimental resources and dataset publicly available for the community.
[13] Reasoning under Ambiguity: Uncertainty-Aware Multilingual Emotion Classification under Partial Supervision cs.CLPDF
Md. Mithun Hossaina, Mashary N. Alrasheedy, Nirban Bhowmick, Shamim Forhad, Md. Shakil Hossain
TL;DR: 本文提出了一种名为’Reasoning under Ambiguity’的不确定性感知框架,用于解决多语言多标签情感分类中因情感模糊性和部分监督(即标签缺失或不完整)带来的挑战。该方法通过共享多语言编码器、基于熵的模糊性加权机制以及结合正-未标记正则化的掩码感知目标,来显式地对齐学习过程与标注不确定性。
Details
Motivation: 当前基于知识的系统依赖多语言情感识别进行智能决策,但面临情感模糊性(多种情感状态共存)和部分监督(标签经常缺失或异构)的重大挑战。现有方法大多假设标签完全可观测并依赖确定性学习目标,这在部分监督下会导致学习偏差和不可靠的预测。
Result: 在英语、西班牙语和阿拉伯语的情感分类基准测试中,该方法在多个评估指标上相比强基线模型取得了一致的性能提升,同时展现出改进的训练稳定性、对标注稀疏性的鲁棒性以及增强的可解释性。
Insight: 创新点在于显式地将学习过程与标注不确定性对齐,具体通过基于熵的模糊性加权机制(降低高模糊性训练实例的权重,而非将缺失标签视为负证据)以及结合正-未标记正则化的掩码感知目标,实现了在部分监督下的鲁棒学习。从客观角度看,该方法为处理多标签分类中固有的不确定性和标注不完整性提供了一个系统性的不确定性感知框架。
Abstract: Contemporary knowledge-based systems increasingly rely on multilingual emotion identification to support intelligent decision-making, yet they face major challenges due to emotional ambiguity and incomplete supervision. Emotion recognition from text is inherently uncertain because multiple emotional states often co-occur and emotion annotations are frequently missing or heterogeneous. Most existing multi-label emotion classification methods assume fully observed labels and rely on deterministic learning objectives, which can lead to biased learning and unreliable predictions under partial supervision. This paper introduces Reasoning under Ambiguity, an uncertainty-aware framework for multilingual multi-label emotion classification that explicitly aligns learning with annotation uncertainty. The proposed approach uses a shared multilingual encoder with language-specific optimization and an entropy-based ambiguity weighting mechanism that down-weights highly ambiguous training instances rather than treating missing labels as negative evidence. A mask-aware objective with positive-unlabeled regularization is further incorporated to enable robust learning under partial supervision. Experiments on English, Spanish, and Arabic emotion classification benchmarks demonstrate consistent improvements over strong baselines across multiple evaluation metrics, along with improved training stability, robustness to annotation sparsity, and enhanced interpretability.
[14] A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering cs.CL | cs.IRPDF
Larissa Pusch, Alexandre Courtiol, Tim Conrad
TL;DR: 本文提出了一种以LLM为中心、人机协同的交互式框架,用于知识图谱问答。该框架利用LLM生成和解释Cypher图查询,用户通过自然语言迭代优化查询,旨在提升复杂知识图谱的可访问性,同时保持事实准确性和语义严谨性。
Details
Motivation: 解决LLM在知识密集型任务中的幻觉、信息过时和可解释性不足问题,以及传统文本检索增强生成在多跳推理上的局限,同时降低知识图谱查询对专业查询语言的要求。
Result: 在合成电影知识图谱的90个查询基准测试中评估了查询解释质量和错误检测能力,并在Hyena和MaRDI两个真实知识图谱上进行了小规模查询生成实验,验证了框架在不同领域的性能变化。
Insight: 创新点在于将LLM与知识图谱的精确查询能力结合,通过人机交互迭代优化自然语言到结构化查询的转换,增强了复杂数据访问的易用性和可解释性,为领域适应性提供了新思路。
Abstract: Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.
[15] Multi-Task GRPO: Reliable LLM Reasoning Across Tasks cs.CL | cs.AI | cs.LGPDF
Shyam Sundhar Ramesh, Xiaotong Ji, Matthieu Zimmer, Sangwoong Yoon, Zhiyong Wang
TL;DR: 本文提出了一种名为MT-GRPO的多任务强化学习后训练算法,旨在解决传统GRPO在多任务场景下优化不平衡的问题。该方法通过动态调整任务权重以优化最差任务性能,并引入比率保持采样器确保梯度反映权重分配,从而在多个推理任务上实现更可靠且均衡的性能提升。
Details
Motivation: 基于GRPO的强化学习后训练虽能提升大语言模型在单个推理任务上的表现,但在实际部署中需要模型在多样化任务上均保持可靠性能。直接的多任务GRPO适配常导致优化失衡,某些任务主导训练而其他任务停滞不前,且不同任务中提示产生零优势(即零梯度)的频率差异进一步扭曲了优化信号。
Result: 在3任务和9任务设置上的实验表明,MT-GRPO在最差任务准确率上持续优于基线方法。具体而言,相比标准GRPO和DAPO,MT-GRPO在最差任务性能上分别实现了16-28%和6%的绝对提升,同时保持了有竞争力的平均准确率。在3任务设置中,MT-GRPO达到50%最差任务准确率所需的训练步数减少了50%,显著提高了实现跨任务可靠性能的效率。
Insight: 论文的创新点在于提出了动态任务权重调整机制以显式优化最差任务性能,并设计了比率保持采样器来保证策略梯度与调整后的权重一致。这为解决多任务强化学习中常见的优化失衡和梯度信号扭曲问题提供了系统性的方法,可借鉴于需要均衡多任务性能的模型后训练场景。
Abstract: RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.
[16] CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering cs.CL | cs.AIPDF
Hao Yang, Zhiyu Yang, Xupeng Zhang, Wei Wei, Yunjie Zhang
TL;DR: 本文提出了CompactRAG框架,旨在提升多跳问答中检索增强生成的效率。该方法将过程解耦为离线知识库构建和在线推理两个阶段:离线阶段使用LLM将语料库转换为原子化的问答对知识库;在线阶段仅调用LLM两次(用于问题分解和答案合成),通过密集检索和RoBERTa提取答案,从而显著减少LLM调用和token开销。
Details
Motivation: 解决现有多跳RAG系统效率低下的问题,包括每一步都需要检索和推理交替进行导致的重复LLM调用、高token消耗以及跨跳实体指代不稳定。
Result: 在HotpotQA、2WikiMultiHopQA和MuSiQue基准测试上,CompactRAG在保持竞争力的准确率的同时,相比迭代式RAG基线显著降低了token消耗。
Insight: 核心创新在于将多跳推理的LLM密集型在线过程,解耦为一次性的离线知识重构和轻量级的在线检索/提取。通过构建原子化QA知识库和精心的问题重写来保证实体一致性,实现了仅需两次LLM调用的高效推理流程,为在大型知识库上进行成本效益高的多跳推理提供了实用方案。
Abstract: Retrieval-augmented generation (RAG) has become a key paradigm for knowledge-intensive question answering. However, existing multi-hop RAG systems remain inefficient, as they alternate between retrieval and reasoning at each step, resulting in repeated LLM calls, high token consumption, and unstable entity grounding across hops. We propose CompactRAG, a simple yet effective framework that decouples offline corpus restructuring from online reasoning. In the offline stage, an LLM reads the corpus once and converts it into an atomic QA knowledge base, which represents knowledge as minimal, fine-grained question-answer pairs. In the online stage, complex queries are decomposed and carefully rewritten to preserve entity consistency, and are resolved through dense retrieval followed by RoBERTa-based answer extraction. Notably, during inference, the LLM is invoked only twice in total - once for sub-question decomposition and once for final answer synthesis - regardless of the number of reasoning hops. Experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue demonstrate that CompactRAG achieves competitive accuracy while substantially reducing token consumption compared to iterative RAG baselines, highlighting a cost-efficient and practical approach to multi-hop reasoning over large knowledge corpora. The implementation is available at GitHub.
[17] LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards cs.CLPDF
Bowen Ping, Zijun Chen, Yiyao Yu, Tingfeng Hui, Junchi Yan
TL;DR: LongR是一个通过强化学习增强大语言模型长上下文推理能力的统一框架,它结合了动态的’思考-阅读’机制和基于相对信息增益的上下文密度奖励,以量化相关文档的效用,从而在长对话理解和结构化数据分析等任务中提升性能。
Details
Motivation: 现有方法主要关注数据合成或架构调整,但仅依赖稀疏的、仅基于结果的奖励在复杂的长上下文推理中效果有限,因为这种粗粒度信号不足以有效指导推理过程,因此需要更精细的奖励机制来提升性能。
Result: 在LongBench v2上实现了9%的性能提升,在RULER和InfiniteBench上也取得了一致的改进,展示了在广泛上下文中导航的鲁棒效率,并且能持续提升多种RL算法(如DAPO、GSPO)的性能。
Insight: 创新点在于引入了动态的’思考-阅读’机制来交错推理和文档查阅,以及基于相对信息增益的上下文密度奖励来量化文档效用,这为长上下文推理提供了更精细的指导信号,可借鉴于其他需要复杂推理的强化学习任务中。
Abstract: Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios–such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic “Think-and-Read” mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model’s robustness against distractors.
[18] Reinforcement World Model Learning for LLM-based Agents cs.CLPDF
Xiao Yu, Baolin Peng, Ruize Xu, Yelong Shen, Pengcheng He
TL;DR: 本文提出了一种名为强化世界模型学习(RWML)的自监督方法,用于为基于大语言模型(LLM)的智能体学习基于文本状态的动作条件世界模型。该方法利用模拟到现实的差距奖励,在预训练的嵌入空间中,对齐模型产生的模拟下一状态与环境观察到的实际下一状态,从而增强智能体内部世界模拟与环境实际动态的一致性。在ALFWorld和τ² Bench基准测试上的实验表明,该方法显著提升了基础模型的性能,并且在结合任务成功奖励后,其表现超越了直接使用任务成功奖励的强化学习方法,达到了与专家数据训练相当的水平。
Details
Motivation: 大语言模型在语言中心任务中表现出色,但在智能体环境中,它们往往难以预测动作后果并适应环境动态,这凸显了基于LLM的智能体需要世界建模能力。
Result: 在ALFWorld和τ² Bench基准测试上,该方法显著超越了基础模型。当与任务成功奖励结合时,其表现分别比直接使用任务成功奖励的强化学习方法高出6.9和5.7个点,并且达到了与专家数据训练相当的性能。
Insight: 论文的创新点在于提出了一种自监督的强化世界模型学习方法,它通过模拟到现实的差距奖励在语义嵌入空间中对齐模拟与真实状态,避免了传统下一状态令牌预测可能导致的模型崩溃和奖励黑客问题,为LLM智能体提供了更鲁棒的世界模型训练信号。
Abstract: Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and $τ^2$ Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and $τ^2$ Bench respectively, while matching the performance of expert-data training.
[19] RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference cs.CLPDF
Siran Liu, Guoxia Wang, Sa Wang, Jinle Zeng, HaoYang Xie
TL;DR: 本文提出了一种名为RRAttention的新型动态稀疏注意力方法,通过在每个步长内跨注意力头轮询采样查询位置,实现了查询独立性和高效的全局模式发现,将注意力复杂度从O(L^2)降低到O(L^2/S^2),在长上下文推理中仅计算一半注意力块即可恢复超过99%的全注意力性能,并获得2.4倍加速。
Details
Motivation: 解决传统注意力机制在处理长上下文时二次复杂度带来的计算瓶颈,以及现有动态稀疏注意力方法在预处理需求、全局评估缺失、查询独立性违反或高计算开销等方面的权衡问题。
Result: 在自然语言理解(HELMET)和多模态视频理解(Video-MME)基准测试中,RRAttention在128K上下文长度下实现了2.4倍加速,性能恢复超过99%,优于现有动态稀疏注意力方法。
Insight: 创新点在于引入头轮询采样策略,通过跨头旋转查询采样位置,在保持查询独立性的同时实现步长级聚合的全局模式发现,并结合自适应Top-τ选择优化稀疏性,为长上下文模型提供高效且性能接近全注意力的解决方案。
Abstract: The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underline{r}ound-\underline{r}obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from $O(L^2)$ to $O(L^2/S^2)$ and employs adaptive Top-$τ$ selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99% of full attention performance while computing only half of the attention blocks, achieving 2.4$\times$ speedup at 128K context length and outperforming existing dynamic sparse attention methods.
[20] xList-Hate: A Checklist-Based Framework for Interpretable and Generalizable Hate Speech Detection cs.CL | cs.AIPDF
Adrián Girón, Pablo Miralles, Javier Huertas-Tato, Sergio D’Antonio, David Camacho
TL;DR: 本文提出了xList-Hate框架,将仇恨言论检测重构为基于清单的诊断推理任务。该框架使用大语言模型(LLM)回答一系列基于规范准则的概念性问题,生成诊断表示,再通过可解释的决策树聚合信号进行预测,旨在提升模型的跨域鲁棒性、可解释性和对标注噪声的稳健性。
Details
Motivation: 当前仇恨言论检测通常被简化为直接的二分类问题,导致监督模型容易过拟合特定数据集的标注定义,在领域迁移和标注噪声下鲁棒性有限。本文旨在通过分解检测任务,构建一个基于广泛共享规范准则的、可解释的诊断框架来解决这些问题。
Result: 在多个仇恨言论基准测试和模型家族上的评估表明,与零样本LLM分类和领域内监督微调相比,xList-Hate框架在跨数据集鲁棒性和领域迁移下的相对性能上持续提升。定性分析也显示其对某些标注不一致和上下文模糊性不敏感。
Insight: 主要创新点在于将仇恨言论检测从单一分类问题重构为基于清单的诊断推理任务,利用LLM进行细粒度概念问答并结合可解释决策树,实现了透明、可审计的预测。这为内容审核提供了一个兼具鲁棒性、可解释性和可扩展性的新范式。
Abstract: Hate speech detection is commonly framed as a direct binary classification problem despite being a composite concept defined through multiple interacting factors that vary across legal frameworks, platform policies, and annotation guidelines. As a result, supervised models often overfit dataset-specific definitions and exhibit limited robustness under domain shift and annotation noise. We introduce xList-Hate, a diagnostic framework that decomposes hate speech detection into a checklist of explicit, concept-level questions grounded in widely shared normative criteria. Each question is independently answered by a large language model (LLM), producing a binary diagnostic representation that captures hateful content features without directly predicting the final label. These diagnostic signals are then aggregated by a lightweight, fully interpretable decision tree, yielding transparent and auditable predictions. We evaluate it across multiple hate speech benchmarks and model families, comparing it against zero-shot LLM classification and in-domain supervised fine-tuning. While supervised methods typically maximize in-domain performance, we consistently improves cross-dataset robustness and relative performance under domain shift. In addition, qualitative analysis of disagreement cases provides evidence that the framework can be less sensitive to certain forms of annotation inconsistency and contextual ambiguity. Crucially, the approach enables fine-grained interpretability through explicit decision paths and factor-level analysis. Our results suggest that reframing hate speech detection as a diagnostic reasoning task, rather than a monolithic classification problem, provides a robust, explainable, and extensible alternative for content moderation.
[21] EuroLLM-22B: Technical Report cs.CL | cs.AI | cs.LGPDF
Miguel Moura Ramos, Duarte M. Alves, Hippolyte Gisserot-Boukhlef, João Alves, Pedro Henrique Martins
TL;DR: 该技术报告介绍了EuroLLM-22B,这是一个从头开始训练的大型语言模型,旨在支持欧洲公民的需求,覆盖了欧盟所有24种官方语言和11种额外语言。报告全面概述了其开发过程,包括分词器设计、架构规格、数据过滤和训练程序。模型在多语言基准测试中展现出强大的推理、指令遵循和翻译能力,与同类规模模型竞争。
Details
Motivation: 解决现有开源大型语言模型中欧洲语言代表性不足和服务欠缺的问题,旨在满足欧洲多语言环境的需求。
Result: 在广泛的多语言基准测试中,EuroLLM-22B在推理、指令遵循和翻译方面表现出色,取得了与同类规模模型竞争的结果。
Insight: 创新点在于专门针对欧洲多语言场景进行从头训练,覆盖35种语言,并发布了完整的模型、数据集和代码库以支持未来研究;从客观角度看,其多语言覆盖和开源贡献有助于促进语言模型的公平性和可访问性。
Abstract: This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B’s development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.
[22] Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models cs.CLPDF
Shuo Nie, Hexuan Deng, Chao Wang, Ruiyu Fang, Xuebo Liu
TL;DR: 该论文提出了一种名为FaithRL(Faithfulness-Aware Step-Level Reinforcement Learning)的方法,旨在解决小型推理模型在思维链推理过程中产生的中间步骤不忠实(幻觉)问题。该方法通过过程奖励模型提供显式的步骤级忠实性奖励,并结合一种隐式的截断重采样策略来生成对比信号,从而在强化学习训练中更精细地监督推理步骤的忠实性。
Details
Motivation: 动机在于现有基于在线强化学习的缓解方法依赖于结果奖励或粗粒度的思维链评估,当最终答案正确时可能会无意中强化不忠实的推理步骤。因此,需要一种更细粒度的、步骤级的监督方法来确保小型推理模型推理过程的忠实性。
Result: 在多个小型推理模型和开放书问答基准上的实验表明,FaithRL能持续减少思维链和最终答案中的幻觉,从而实现更忠实和可靠的推理。
Insight: 创新点在于引入了步骤级的忠实性监督,结合了显式的过程奖励模型和隐式的截断重采样策略,为强化学习训练提供了更精细的对比信号,从而更有效地惩罚和纠正推理过程中的不忠实步骤,而不仅仅是依赖最终答案的正确性。
Abstract: As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.
[23] Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions cs.CLPDF
Léo Labat, Etienne Ollion, François Yvon
TL;DR: 本文研究了多语言大语言模型在回答涉及价值观的多选题时,是否因语言不同而产生响应差异。作者发布了包含8种欧洲语言人工翻译问题的Multilingual European Value Survey(MEVS)语料库,并对30多个不同规模、制造商和对齐微调状态的多语言LLM进行了测试。结果表明,尽管更大、经过指令微调的模型整体一致性更高,但其响应稳健性在不同问题间差异很大,且语言特定行为仅在部分问题上出现。
Details
Motivation: 探究多语言LLM在回答价值观相关多选题时,是否因语言不同而产生不一致的响应,即模型是像理论上的多语者一样保持一致性,还是像多个单语模型集合一样依赖问题语言表达不同价值观。
Result: 在MEVS语料库上的实验显示,更大、经过指令微调的模型整体一致性更高,但响应稳健性在不同问题间差异显著;某些问题能引发模型内和模型间的完全一致,而其他问题则导致LLM答案分裂;语言特定行为在所有一致的指令微调模型中出现,但仅针对特定问题。
Insight: 创新点在于构建了首个完全由人工翻译对齐的多语言价值观调查语料库MEVS,并系统分析了语言对LLM价值观响应的影响;客观来看,研究揭示了偏好微调的选择性效应,为多语言对齐提供了实证依据。
Abstract: Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.
[24] Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training cs.CLPDF
Junxiao Liu, Zhijun Wang, Yixiao Li, Zhejian Lai, Liqian Huang
TL;DR: 本文提出了一种名为TRIT的自改进框架,通过将翻译训练集成到多语言推理中,以解决多语言长推理模型在非英语问题上推理能力不足的问题,无需外部反馈或额外多语言数据即可提升多语言问题理解和响应生成能力。
Details
Motivation: 多语言长推理模型在非英语问题上倾向于使用英语推理,而强制使用问题语言推理时准确率显著下降,这源于多语言问题理解和推理能力的双重局限。
Result: 在MMATH基准测试中,该方法平均优于多个基线7个百分点,提高了答案正确性和语言一致性;进一步分析显示,翻译训练集成使跨语言问题对齐提升超过10个百分点,并在FLORES-200上实现最高8.4 COMET点的翻译质量增益。
Insight: 创新点在于通过翻译-推理集成训练实现自改进,同时增强多语言理解和生成,无需额外数据;客观分析认为其将翻译作为中间任务整合,有效缓解了多语言推理中的对齐和语义鸿沟问题。
Abstract: Long reasoning models often struggle in multilingual settings: they tend to reason in English for non-English questions; when constrained to reasoning in the question language, accuracies drop substantially. The struggle is caused by the limited abilities for both multilingual question understanding and multilingual reasoning. To address both problems, we propose TRIT (Translation-Reasoning Integrated Training), a self-improving framework that integrates the training of translation into multilingual reasoning. Without external feedback or additional multilingual data, our method jointly enhances multilingual question understanding and response generation. On MMATH, our method outperforms multiple baselines by an average of 7 percentage points, improving both answer correctness and language consistency. Further analysis reveals that integrating translation training improves cross-lingual question alignment by over 10 percentage points and enhances translation quality for both mathematical questions and general-domain text, with gains up to 8.4 COMET points on FLORES-200.
[25] A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies cs.CLPDF
Panagiotis Kaliosis, Adithya V Ganesan, Oscar N. E. Kjell, Whitney Ringwald, Scott Feltman
TL;DR: 本研究系统评估了11种先进大语言模型在PTSD严重程度估计任务中的表现,重点探究了上下文知识和建模策略对模型准确性的影响。研究发现,提供详细的结构定义和叙述上下文、增加推理努力、结合监督模型与零样本LLMs集成等方法能显著提升评估准确性。
Details
Motivation: 当前大语言模型越来越多地以零样本方式用于心理健康评估,但影响其准确性的因素尚不明确。本研究旨在系统探究上下文知识和不同建模策略如何影响LLMs在PTSD严重程度估计中的性能。
Result: 在包含1,437个自然语言叙述和自评PTSD严重程度分数的临床数据集上评估发现:提供详细结构定义和叙述上下文时LLMs最准确;增加推理努力可提升估计精度;开源模型参数超过700亿后性能趋于稳定,而闭源模型随代际更新持续改进;监督模型与零样本LLMs集成能达到最佳性能。
Insight: 创新点在于首次系统量化了上下文知识(如子量表定义、分布摘要)和建模策略(零样本/少样本、推理量、集成方法等)对心理健康评估任务的影响。客观来看,研究提出的结构化评估框架为临床部署LLMs提供了重要方法论指导,特别是集成策略的优化方案具有实践价值。
Abstract: Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, Deepseek), plateau beyond 70B parameters while closed-weight (o3-mini, gpt-5) models improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Taken together, the results suggest choice of contextual knowledge and modeling strategies is important for deploying LLMs to accurately assess mental health.
[26] Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory cs.CL | cs.AI | cs.LGPDF
Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao
TL;DR: 本文提出了BudgetMem,一个运行时智能体内存框架,用于实现显式的、查询感知的性能-成本控制。该框架将内存处理构建为一组内存模块,每个模块提供三个预算层级(低/中/高),并通过一个轻量级路由器进行跨模块的预算层级路由,以平衡任务性能和内存构建成本。
Details
Motivation: 解决现有LLM智能体内存系统依赖离线、查询无关的内存构建导致的效率低下和可能丢失关键信息的问题,以及运行时内存利用方案存在开销大且对性能-成本权衡缺乏显式控制的局限性。
Result: 在LoCoMo、LongMemEval和HotpotQA基准测试上,BudgetMem在优先性能(即高预算设置)时超越了强基线模型,并在更紧的预算下提供了更好的精度-成本边界。
Insight: 创新点在于提出了一个统一的运行时内存框架,通过查询感知的预算层级路由实现显式的性能-成本控制,并系统研究了实现预算层级的三种互补策略(实现方式、推理行为和容量),分析了不同策略在不同预算制度下的优劣权衡。
Abstract: Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.
cs.CV [Back]
[27] VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models cs.CV | cs.AI | cs.LG | cs.ROPDF
Yiye Chen, Yanan Jian, Xiaoyi Dong, Shuxin Cao, Jing Wu
TL;DR: 本文提出VISTA框架,通过轨迹跟随偏好优化增强视觉-语言-动作模型中的视觉条件依赖,以解决动作预测对当前视觉状态依赖不足的问题,从而提升模型在机器人操作任务中的可靠性和性能。
Details
Motivation: 现有视觉-语言-动作模型在扩展大型预训练视觉-语言模型到动作空间时,容易导致视觉-动作错位,即动作预测对当前视觉状态的依赖较弱,从而产生不可靠的动作输出。
Result: 在离散OpenVLA和连续OpenVLA-OFT设置下,该方法均提升了视觉条件依赖和任务性能,无需修改架构或收集额外数据。
Insight: 创新点在于通过轨迹跟随任务的偏好优化来显式强化视觉条件依赖,并利用潜在空间蒸馏将增强的对齐能力迁移到指令跟随任务中,这是一种无需额外架构或数据的高效训练策略。
Abstract: Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks. Despite the success, extending large pretrained Vision-Language Models (VLMs) to the action space can induce vision-action misalignment, where action predictions exhibit weak dependence on the current visual state, leading to unreliable action outputs. In this work, we study VLA models through the lens of visual conditioning and empirically show that successful rollouts consistently exhibit stronger visual dependence than failed ones. Motivated by this observation, we propose a training framework that explicitly strengthens visual conditioning in VLA models. Our approach first aligns action prediction with visual input via preference optimization on a track-following surrogate task, and then transfers the enhanced alignment to instruction-following task through latent-space distillation during supervised finetuning. Without introducing architectural modifications or additional data collection, our method improves both visual conditioning and task performance for discrete OpenVLA, and further yields consistent gains when extended to the continuous OpenVLA-OFT setting. Project website: https://vista-vla.github.io/ .
[28] Visual concept ranking uncovers medical shortcuts used by large multimodal models cs.CV | cs.LGPDF
Joseph D. Janizek, Sonnet Xu, Junayd Lateef, Roxana Daneshjou
TL;DR: 本文提出了一种名为视觉概念排序(VCR)的方法,用于识别大型多模态模型(LMMs)在医学任务(如皮肤病变分类)中依赖的重要视觉概念,并揭示了这些模型在不同人口亚组间存在的性能差距及潜在的‘捷径’特征。
Details
Motivation: 为确保机器学习模型在医疗等安全关键领域的可靠性,需要能够揭示模型缺陷的审计方法,以调查LMMs在处理医学任务时的行为模式。
Result: 方法应用于皮肤病变、胸部X光片和自然图像的分类任务,通过手动干预验证了VCR生成的关于视觉特征依赖性的假设,揭示了模型性能的亚组差异。
Insight: 创新点在于VCR方法能够系统地生成并验证模型对特定视觉概念的依赖假设,为理解和审计LMMs在医学领域的决策‘捷径’和潜在偏见提供了新工具。
Abstract: Ensuring the reliability of machine learning models in safety-critical domains such as healthcare requires auditing methods that can uncover model shortcomings. We introduce a method for identifying important visual concepts within large multimodal models (LMMs) and use it to investigate the behaviors these models exhibit when prompted with medical tasks. We primarily focus on the task of classifying malignant skin lesions from clinical dermatology images, with supplemental experiments including both chest radiographs and natural images. After showing how LMMs display unexpected gaps in performance between different demographic subgroups when prompted with demonstrating examples, we apply our method, Visual Concept Ranking (VCR), to these models and prompts. VCR generates hypotheses related to different visual feature dependencies, which we are then able to validate with manual interventions.
[29] ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation cs.CVPDF
Jia Li, Wenjie Zhao, Shijian Deng, Bolin Lai, Yuheng Wu
TL;DR: ARGaze是一种用于在线第一人称视线估计的自回归Transformer模型,通过将视线估计重新定义为序列预测任务,利用当前视觉特征和最近视线目标估计的固定长度上下文窗口来预测当前注视点,在多个第一人称基准测试中实现了最先进的性能。
Details
Motivation: 解决在线第一人称视线估计中缺乏明确头部或眼部信号的问题,通过利用视线在目标导向活动中的强时间连续性,从稀疏的间接线索(如手-物交互和显著场景内容)推断当前视觉注意力。
Result: 在多个第一人称基准测试的在线评估中实现了最先进的性能(SOTA),并通过广泛的消融实验验证了使用有限视线历史的自回归建模对于鲁棒预测至关重要。
Insight: 创新点在于受视觉语言模型中视觉条件自回归解码的启发,将视线估计重新定义为序列预测任务,并引入固定长度的视线上下文窗口来强制因果性和实现有限资源流式推理;客观分析认为其核心创新在于利用自回归建模捕捉视线的时间连续性,这在缺乏直接生理信号的第一人称设置中尤为有效。
Abstract: Online egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames, a task essential for augmented reality and assistive technologies. Unlike third-person gaze estimation, this setting lacks explicit head or eye signals, requiring models to infer current visual attention from sparse, indirect cues such as hand-object interactions and salient scene content. We observe that gaze exhibits strong temporal continuity during goal-directed activities: knowing where a person looked recently provides a powerful prior for predicting where they look next. Inspired by vision-conditioned autoregressive decoding in vision-language models, we propose ARGaze, which reformulates gaze estimation as sequential prediction: at each timestep, a transformer decoder predicts current gaze by conditioning on (i) current visual features and (ii) a fixed-length Gaze Context Window of recent gaze target estimates. This design enforces causality and enables bounded-resource streaming inference. We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation, with extensive ablations validating that autoregressive modeling with bounded gaze history is critical for robust prediction. We will release our source code and pre-trained models.
[30] AirGlove: Exploring Egocentric 3D Hand Tracking and Appearance Generalization for Sensing Gloves cs.CVPDF
Wenhui Cui, Ziyi Kou, Chuan Qin, Ergys Ristani, Li Guan
TL;DR: 本文提出了AirGlove方法,旨在解决基于视觉的手部追踪模型在感知手套(sensing gloves)上因外观差异导致的性能下降问题。通过利用现有手套数据,该方法能泛化学习到的手套表征,以有限数据适应新设计的手套,从而提升3D手部姿态追踪的准确性。
Details
Motivation: 现有基于传感器的手套追踪方法易受信号质量和校准影响,而基于视觉的裸手追踪模型虽性能强大,但在外观迥异的感知手套上表现不佳,缺乏系统评估。本文旨在弥合这一外观差距,提升视觉模型对各类感知手套的追踪能力。
Result: 实验在多种感知手套上进行,表明AirGlove能有效将手部姿态模型泛化到新设计的手套上,相比对比方案取得了显著的性能提升。
Insight: 创新点在于首次系统评估了视觉模型在手套上的零样本和微调性能,并提出了一个利用现有手套数据来泛化表征以适配新手套的框架,这为解决视觉模型在领域外观变化下的泛化问题提供了新思路。
Abstract: Sensing gloves have become important tools for teleoperation and robotic policy learning as they are able to provide rich signals like speed, acceleration and tactile feedback. A common approach to track gloved hands is to directly use the sensor signals (e.g., angular velocity, gravity orientation) to estimate 3D hand poses. However, sensor-based tracking can be restrictive in practice as the accuracy is often impacted by sensor signal and calibration quality. Recent advances in vision-based approaches have achieved strong performance on human hands via large-scale pre-training, but their performance on gloved hands with distinct visual appearances remains underexplored. In this work, we present the first systematic evaluation of vision-based hand tracking models on gloved hands under both zero-shot and fine-tuning setups. Our analysis shows that existing bare-hand models suffer from substantial performance degradation on sensing gloves due to large appearance gap between bare-hand and glove designs. We therefore propose AirGlove, which leverages existing gloves to generalize the learned glove representations towards new gloves with limited data. Experiments with multiple sensing gloves show that AirGlove effectively generalizes the hand pose models to new glove designs and achieves a significant performance boost over the compared schemes.
[31] GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling cs.CVPDF
Shivanshu Shekhar, Uttaran Bhattacharya, Raghavendra Addanki, Mehrab Tanjim, Somdeb Sarkhel
TL;DR: 本文提出了一种名为GT-SVJ的新方法,通过将先进的视频生成模型重新构建为基于能量的模型,使其能够作为具有时间感知能力的奖励模型,用于评估视频质量并与人类偏好对齐。该方法通过设计具有挑战性的合成负样本视频来迫使模型学习有意义的时空特征,从而在仅需少量人工标注数据的情况下,在多个基准测试上达到了最先进的性能。
Details
Motivation: 当前基于视觉语言模型(VLMs)的视频奖励建模方法难以捕捉细微的时间动态,而视频生成模型本身被设计用于建模时间结构,因此本文旨在探索如何将视频生成模型重新用作强大的时间感知奖励模型,以更高效地解决视频生成模型与人类偏好的对齐问题。
Result: GT-SVJ在GenAI-Bench和MonteBench基准测试上达到了最先进的性能,并且仅使用了3万个人工标注,这比现有的基于VLM的方法所需的数据量少了6到65倍。
Insight: 核心创新点在于将视频生成模型重新构建为基于能量的模型,使其能够区分视频质量;同时,通过设计受控的潜在空间扰动(如时间切片、特征交换和帧重排)来创建具有挑战性的合成负样本,迫使模型学习有意义的时空特征而非浅层伪影,从而实现了高效且精确的奖励建模。
Abstract: Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.
[32] E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching cs.CVPDF
Jiahao Nie, Wenbin An, Gongjie Zhang, Yicheng Xu, Yap-Peng Tan
TL;DR: 本文提出E.M.Ground,一种用于时序视频定位(TVG)的新型视频大语言模型(Vid-LLM),旨在通过整体事件感知与匹配,精准定位查询事件对应的时间片段。模型引入了特殊
Details
Motivation: 现有Vid-LLM方法在时序视频定位中通常通过单独标记匹配起止帧,依赖精确时间戳,但未能捕捉事件的语义连续性和完整性,导致定位模糊。本文旨在解决这一局限性,提升对事件整体连贯性的感知能力。
Result: 在多个基准数据集上的大量实验表明,E.M.Ground始终以显著优势超越最先进的Vid-LLMs,实现了SOTA性能。
Insight: 创新点包括:1) 引入特殊
Abstract: Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event’s semantic continuity and integrity, leading to ambiguities. To address this, we propose E.M.Ground, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. E.M.Ground introduces three key innovations: (i) a special
[33] Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation cs.CVPDF
Jiahao Nie, Guanqiao Fu, Wenbin An, Yap-Peng Tan, Alex C. Kot
TL;DR: 本文提出了一种名为多视图渐进适应(MPA)的方法,用于解决跨域少样本分割问题。该方法从数据和策略两个角度渐进地适应目标域:通过混合渐进增强生成多样且复杂的视图,以及通过双链多视图预测在广泛监督下利用这些视图进行学习。实验表明,MPA显著提升了在目标域上的少样本分割性能。
Details
Motivation: 跨域少样本分割旨在利用少量示例分割数据稀缺域中的类别,但现有方法因目标样本数量有限、多样性不足,以及源训练模型在目标域初始少样本能力弱、域差距大,导致适应效果受限。
Result: 在跨域少样本分割基准测试中,MPA大幅超越现有最先进方法,性能提升达+7.0%。
Insight: 创新点包括从数据角度引入混合渐进增强以生成渐进复杂视图,以及从策略角度设计双链多视图预测以通过序列和平行学习路径充分利用视图,通过强制不同复杂视图间预测一致性实现鲁棒且准确的域适应。
Abstract: Cross-Domain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model’s initially weak few-shot capability in target domains, coupled with substantial domain gaps, severely hinders the effective utilization of target samples and further impedes adaptation. To this end, we propose Multi-view Progressive Adaptation, which progressively adapts few-shot capability to target domains from both data and strategy perspectives. (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. (ii) From the strategy perspective, we design Dual-chain Multi-view Prediction, which fully leverages these progressively complex views through sequential and parallel learning paths under extensive supervision. By jointly enforcing prediction consistency across diverse and complex views, MPA achieves both robust and accurate adaptation to target domains. Extensive experiments demonstrate that MPA effectively adapts few-shot capability to target domains, outperforming state-of-the-art methods by a large margin (+7.0%).
[34] PatchFlow: Leveraging a Flow-Based Model with Patch Features cs.CV | cs.LGPDF
Boxiang Zhang, Baijian Yang, Xiaoming Wang, Corey Vian
TL;DR: 本文提出PatchFlow方法,结合局部邻域感知的补丁特征与归一化流模型,通过引入适配器模块弥合通用预训练特征提取器与工业产品图像之间的差距,以提高自动化异常检测的效率和准确性。
Details
Motivation: 压铸行业因表面缺陷影响质量控制,需自动化检测技术;现有计算机视觉方法在通用预训练特征与工业图像间存在差距,导致检测精度不足。
Result: 在MVTec AD数据集上图像级AUROC达99.28%,错误率降低20%;在VisA数据集上图像级AUROC为96.48%,错误率降低28.2%;在私有压铸数据集上异常检测准确率达95.77%,且无需异常样本训练。
Insight: 创新点包括结合局部补丁特征与流模型增强异常检测,引入适配器模块优化特征迁移;客观分析显示该方法通过无监督学习有效提升工业缺陷检测的泛化性和准确性。
Abstract: Die casting plays a crucial role across various industries due to its ability to craft intricate shapes with high precision and smooth surfaces. However, surface defects remain a major issue that impedes die casting quality control. Recently, computer vision techniques have been explored to automate and improve defect detection. In this work, we combine local neighbor-aware patch features with a normalizing flow model and bridge the gap between the generic pretrained feature extractor and industrial product images by introducing an adapter module to increase the efficiency and accuracy of automated anomaly detection. Compared to state-of-the-art methods, our approach reduces the error rate by 20% on the MVTec AD dataset, achieving an image-level AUROC of 99.28%. Our approach has also enhanced performance on the VisA dataset , achieving an image-level AUROC of 96.48%. Compared to the state-of-the-art models, this represents a 28.2% reduction in error. Additionally, experiments on a proprietary die casting dataset yield an accuracy of 95.77% for anomaly detection, without requiring any anomalous samples for training. Our method illustrates the potential of leveraging computer vision and deep learning techniques to advance inspection capabilities for the die casting industry
[35] RFM-Pose:Reinforcement-Guided Flow Matching for Fast Category-Level 6D Pose Estimation cs.CV | cs.ROPDF
Diya He, Qingchen Liu, Cong Zhang, Jiahu Qin
TL;DR: 本文提出RFM-Pose框架,用于加速类别级6D物体姿态估计。该框架采用流匹配生成模型高效生成姿态候选,并通过强化学习将采样过程建模为马尔可夫决策过程,利用近端策略优化微调采样策略,实现姿态生成与假设评分的联合优化。
Details
Motivation: 解决基于分数的生成模型在类别级姿态估计中因采样成本高而效率受限的问题,旨在加速姿态生成并主动评估采样假设。
Result: 在REAL275基准测试中取得了良好性能,同时显著降低了计算成本;该方法可适配于物体姿态跟踪任务,并获得有竞争力的结果。
Insight: 创新点在于将流匹配采样过程形式化为马尔可夫决策过程,并引入强化学习框架联合优化生成与评分;采用流匹配替代基于分数的扩散模型以提高采样效率是核心贡献。
Abstract: Object pose estimation is a fundamental problem in computer vision and plays a critical role in virtual reality and embodied intelligence, where agents must understand and interact with objects in 3D space. Recently, score based generative models have to some extent solved the rotational symmetry ambiguity problem in category level pose estimation, but their efficiency remains limited by the high sampling cost of score-based diffusion. In this work, we propose a new framework, RFM-Pose, that accelerates category-level 6D object pose generation while actively evaluating sampled hypotheses. To improve sampling efficiency, we adopt a flow-matching generative model and generate pose candidates along an optimal transport path from a simple prior to the pose distribution. To further refine these candidates, we cast the flow-matching sampling process as a Markov decision process and apply proximal policy optimization to fine-tune the sampling policy. In particular, we interpret the flow field as a learnable policy and map an estimator to a value network, enabling joint optimization of pose generation and hypothesis scoring within a reinforcement learning framework. Experiments on the REAL275 benchmark demonstrate that RFM-Pose achieves favorable performance while significantly reducing computational cost. Moreover, similar to prior work, our approach can be readily adapted to object pose tracking and attains competitive results in this setting.
[36] Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs cs.CVPDF
Qi Li, Yanzhe Zhao, Yongxin Zhou, Yameng Wang, Yandong Yang
TL;DR: 本文提出了Magic-MM-Embedding模型系列,旨在解决多模态大语言模型在通用多模态检索任务中因处理大量视觉令牌而导致计算成本高昂的问题。该方法通过结合高效的视觉令牌压缩架构和多阶段渐进式训练策略,实现了高推理效率与最先进性能的统一。
Details
Motivation: 多模态大语言模型在通用多模态检索中潜力巨大,但其实际应用常因处理视觉输入产生的大量令牌所带来的巨大计算成本而受阻。本文旨在开发一个既高效又高性能的通用多模态嵌入模型。
Result: 综合实验表明,该模型在保持更高推理效率的同时,大幅超越了现有方法的性能,达到了最先进水平。
Insight: 创新点在于两个协同支柱:一是融入视觉令牌压缩的高效MLLM架构以降低延迟和内存占用;二是从粗到精的多阶段渐进训练策略,包括持续预训练、大规模对比预训练与难负例挖掘,以及由MLLM-as-a-Judge引导的任务感知微调,以恢复并显著提升模型能力。
Abstract: Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs. In this paper, we propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding. Our approach is built on two synergistic pillars: (1) a highly efficient MLLM architecture incorporating visual token compression to drastically reduce inference latency and memory footprint, and (2) a multi-stage progressive training strategy designed to not only recover but significantly boost performance. This coarse-to-fine training paradigm begins with extensive continue pretraining to restore multimodal understanding and generation capabilities, progresses to large-scale contrastive pretraining and hard negative mining to enhance discriminative power, and culminates in a task-aware fine-tuning stage guided by an MLLM-as-a-Judge for precise data curation. Comprehensive experiments show that our model outperforms existing methods by a large margin while being more inference-efficient.
[37] FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion cs.CV | cs.AI | cs.CLPDF
Zhuokun Chen, Jianfei Cai, Bohan Zhuang
TL;DR: FlashBlock是一种用于长上下文块扩散模型的高效注意力缓存机制。它通过观察发现块扩散中块外注意力跨步具有冗余性,从而提出重用稳定的块外注意力输出,减少了注意力计算和KV缓存访问,提升了推理效率。
Details
Motivation: 生成长内容(如长视频和长文本)对现代生成模型越来越重要。块扩散通过KV缓存和块级因果推理提高了推理效率,但在长上下文设置中,由于需要重复计算不断增长的KV缓存上的注意力,仍会产生大量开销。
Result: 在扩散语言模型和视频生成任务上的实验表明,FlashBlock实现了高达1.44倍的令牌吞吐量提升和高达1.6倍的注意力时间减少,且对生成质量影响可忽略不计。
Insight: 核心创新在于识别并利用了块扩散中块外注意力跨步的冗余特性,提出了一种无需修改扩散过程即可重用稳定注意力输出的缓存机制。该方法与稀疏注意力正交,可作为补充的残差重用策略,在激进稀疏化下显著提高模型精度。
Abstract: Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.
[38] MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors cs.CVPDF
Jingdong Zhang, Xiaohang Zhan, Lingzhi Zhang, Yizhou Wang, Zhengming Yu
TL;DR: MTPano是一种无需标注训练的多任务全景基础模型,通过将全景图像投影为透视补丁并利用现成基础模型生成伪标签来解决全景场景理解中数据稀缺问题。模型采用全景双桥网络,通过几何感知调制层分离旋转不变和旋转变体任务特征流,并引入ERP令牌混合器和梯度截断来处理等距柱状投影畸变,同时利用辅助任务促进跨任务学习。
Details
Motivation: 解决全景场景理解中因高分辨率多任务标注稀缺、几何畸变严重以及坐标系差异导致的直接迁移透视基础模型失败的问题,并探索球面空间中密集预测任务间的潜在关系。
Result: 在多个基准测试中达到最先进性能,并与特定任务的全景专家基础模型相比具有竞争力。
Insight: 创新点包括无标注训练流程利用透视密集先验生成伪标签、按任务旋转属性分类并设计全景双桥网络进行特征解耦、以及引入ERP令牌混合器和梯度截断机制处理畸变与任务冲突,可借鉴于跨域多任务学习与几何感知模型设计。
Abstract: Comprehensive panoramic scene understanding is critical for immersive applications, yet it remains challenging due to the scarcity of high-resolution, multi-task annotations. While perspective foundation models have achieved success through data scaling, directly adapting them to the panoramic domain often fails due to severe geometric distortions and coordinate system discrepancies. Furthermore, the underlying relations between diverse dense prediction tasks in spherical spaces are underexplored. To address these challenges, we propose MTPano, a robust multi-task panoramic foundation model established by a label-free training pipeline. First, to circumvent data scarcity, we leverage powerful perspective dense priors. We project panoramic images into perspective patches to generate accurate, domain-gap-free pseudo-labels using off-the-shelf foundation models, which are then re-projected to serve as patch-wise supervision. Second, to tackle the interference between task types, we categorize tasks into rotation-invariant (e.g., depth, segmentation) and rotation-variant (e.g., surface normals) groups. We introduce the Panoramic Dual BridgeNet, which disentangles these feature streams via geometry-aware modulation layers that inject absolute position and ray direction priors. To handle the distortion from equirectangular projections (ERP), we incorporate ERP token mixers followed by a dual-branch BridgeNet for interactions with gradient truncation, facilitating beneficial cross-task information sharing while blocking conflicting gradients from incompatible task attributes. Additionally, we introduce auxiliary tasks (image gradient, point map, etc.) to fertilize the cross-task learning process. Extensive experiments demonstrate that MTPano achieves state-of-the-art performance on multiple benchmarks and delivers competitive results against task-specific panoramic specialist foundation models.
[39] Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation cs.CV | cs.LGPDF
Yongwoo Kim, Sungmin Cha, Hyunsoo Kim, Jaewon Lee, Donghyun Kim
TL;DR: 本文提出了一种名为PAIR的框架,通过使用不安全-安全配对数据,将概念擦除从简单移除重构为保持语义一致性的对齐过程,旨在移除文本到图像扩散模型中的不良概念(如有害内容)时,保持生成图像的结构和语义一致性。
Details
Motivation: 现有概念擦除方法主要关注移除不安全概念,但缺乏对相应安全替代概念的引导,导致原始生成与擦除后生成之间的结构和语义一致性难以保持。
Result: 大量实验表明,该方法在概念擦除效果上显著优于现有最先进基线,在保持结构完整性、语义连贯性和生成质量的同时实现了有效擦除。
Insight: 创新点包括:1) 提出不安全-安全配对数据生成及配对语义对齐目标,显式地将目标概念映射到语义对齐的安全锚点;2) 引入基于Fisher权重的DoRA初始化方法,利用配对数据初始化参数高效的低秩适应矩阵,以鼓励生成安全替代概念并选择性抑制不安全概念。从客观角度看,该方法将擦除任务重新定义为语义对齐问题,并通过配对数据和权重初始化策略实现了细粒度控制,提升了擦除的精确性和一致性。
Abstract: With the increasing versatility of text-to-image diffusion models, the ability to selectively erase undesirable concepts (e.g., harmful content) has become indispensable. However, existing concept erasure approaches primarily focus on removing unsafe concepts without providing guidance toward corresponding safe alternatives, which often leads to failure in preserving the structural and semantic consistency between the original and erased generations. In this paper, we propose a novel framework, PAIRed Erasing (PAIR), which reframes concept erasure from simple removal to consistency-preserving semantic realignment using unsafe-safe pairs. We first generate safe counterparts from unsafe inputs while preserving structural and semantic fidelity, forming paired unsafe-safe multimodal data. Leveraging these pairs, we introduce two key components: (1) Paired Semantic Realignment, a guided objective that uses unsafe-safe pairs to explicitly map target concepts to semantically aligned safe anchors; and (2) Fisher-weighted Initialization for DoRA, which initializes parameter-efficient low-rank adaptation matrices using unsafe-safe pairs, encouraging the generation of safe alternatives while selectively suppressing unsafe concepts. Together, these components enable fine-grained erasure that removes only the targeted concepts while maintaining overall semantic consistency. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving effective concept erasure while preserving structural integrity, semantic coherence, and generation quality.
[40] Multimodal Latent Reasoning via Hierarchical Visual Cues Injection cs.CVPDF
Yiming Zhang, Qiangyu Yan, Borui Jiang, Kai Han
TL;DR: 本文提出了一种名为HIVE的多模态潜在推理框架,旨在通过注入分层视觉线索,使多模态大语言模型(MLLMs)从依赖端到端生成的‘快速思考’转向在潜在空间中进行迭代精炼的‘慢速思考’,从而提升对复杂场景的理解和推理能力。
Details
Motivation: 当前多模态大语言模型的推理过程通常依赖于端到端生成或以语言为中心的显式思维链,这可能导致效率低下、冗长和幻觉问题,因此需要一种在潜在空间中无缝整合多模态信号的鲁棒推理方法。
Result: 广泛的评估表明,在测试时结合视觉知识进行扩展是有效的,并且整合分层信息显著增强了模型对复杂场景的理解能力。
Insight: 创新点在于提出了一种递归扩展Transformer块以创建内部循环进行迭代推理精炼的框架,并通过将全局场景上下文到细粒度区域细节的分层视觉线索直接注入到潜在表示中,实现了在潜在空间中进行基于视觉的、多步的推理过程。
Abstract: The advancement of multimodal large language models (MLLMs) has enabled impressive perception capabilities. However, their reasoning process often remains a “fast thinking” paradigm, reliant on end-to-end generation or explicit, language-centric chains of thought (CoT), which can be inefficient, verbose, and prone to hallucination. This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly. We propose multimodal latent reasoning via HIerarchical Visual cuEs injection (\emph{HIVE}), a novel framework that instills deliberate, “slow thinking” without depending on superficial textual rationales. Our method recursively extends transformer blocks, creating an internal loop for iterative reasoning refinement. Crucially, it injectively grounds this process with hierarchical visual cues from global scene context to fine-grained regional details directly into the model’s latent representations. This enables the model to perform grounded, multi-step inference entirely in the aligned latent space. Extensive evaluations demonstrate that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model’s understanding of complex scenes.
[41] Imagine a City: CityGenAgent for Procedural 3D City Generation cs.CVPDF
Zishan Liu, Zecong Tang, RuoCheng Wu, Xinzhe Zheng, Jingyu Hu
TL;DR: 本文提出了CityGenAgent,一个基于自然语言驱动的分层程序化生成高质量3D城市的框架。它将城市生成分解为区块程序(Block Program)和建筑程序(Building Program)两个可解释的组件,并采用监督微调(SFT)和强化学习(RL)的两阶段学习策略来确保结构正确性和语义对齐。该框架支持自然语言编辑和操控,在语义对齐、视觉质量和可控性方面优于现有方法。
Details
Motivation: 解决现有3D城市自动生成方法在高保真资产创建、可控性和可操控性方面的不足,以支持自动驾驶、虚拟现实和具身智能等广泛应用。
Result: 综合评估表明,CityGenAgent在语义对齐、视觉质量和可控性方面优于现有方法,为可扩展的3D城市生成奠定了坚实基础。
Insight: 主要创新点在于将城市生成分解为两个可解释的程序化组件(区块和建筑),并采用结合监督微调(用于生成符合模式约束的有效程序)和强化学习(设计了空间对齐奖励和视觉一致性奖励)的两阶段学习策略,从而实现了更好的语义理解、空间推理和文本-视觉模态对齐,最终支持自然语言编辑。从客观角度看,其将程序化生成与基于学习的奖励机制相结合,是提升生成任务可控性和质量的有效路径。
Abstract: The automated generation of interactive 3D cities is a critical challenge with broad applications in autonomous driving, virtual reality, and embodied intelligence. While recent advances in generative models and procedural techniques have improved the realism of city generation, existing methods often struggle with high-fidelity asset creation, controllability, and manipulation. In this work, we introduce CityGenAgent, a natural language-driven framework for hierarchical procedural generation of high-quality 3D cities. Our approach decomposes city generation into two interpretable components, Block Program and Building Program. To ensure structural correctness and semantic alignment, we adopt a two-stage learning strategy: (1) Supervised Fine-Tuning (SFT). We train BlockGen and BuildingGen to generate valid programs that adhere to schema constraints, including non-self-intersecting polygons and complete fields; (2) Reinforcement Learning (RL). We design Spatial Alignment Reward to enhance spatial reasoning ability and Visual Consistency Reward to bridge the gap between textual descriptions and the visual modality. Benefiting from the programs and the models’ generalization, CityGenAgent supports natural language editing and manipulation. Comprehensive evaluations demonstrate superior semantic alignment, visual quality, and controllability compared to existing methods, establishing a robust foundation for scalable 3D city generation.
[42] VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs cs.CV | cs.LGPDF
Tina Khezresmaeilzadeh, Jike Zhong, Konstantinos Psounis
TL;DR: 本文提出了VRIQ基准,用于评估和分析视觉语言模型(VLMs)的非语言视觉推理能力。研究发现,当前VLMs在抽象谜题任务上表现接近随机(平均准确率约28%),在自然图像任务上表现稍好但仍较弱(45%准确率),且工具增强推理仅带来有限提升。诊断分析表明,失败主要源于感知缺陷(56%单独感知错误,43%感知与推理联合错误),仅1%源于纯推理错误。
Details
Motivation: 随着视觉语言模型(VLMs)的发展,需要评估其是否能够可靠地进行非语言视觉推理,因此构建了一个专门的基准来系统分析其能力与局限。
Result: 在VRIQ基准上,VLMs在抽象谜题任务平均准确率约28%(接近随机),在自然图像任务平均准确率45%;工具增强推理仅带来有限改进。诊断分析量化了感知与推理错误的比例。
Insight: 论文的创新点在于构建了细粒度的视觉推理基准VRIQ,并通过诊断探针揭示了当前VLMs的失败主要源于感知限制而非推理能力,为多模态系统的改进提供了原则性依据。
Abstract: Recent progress in Vision Language Models (VLMs) has raised the question of whether they can reliably perform nonverbal reasoning. To this end, we introduce VRIQ (Visual Reasoning IQ), a novel benchmark designed to assess and analyze the visual reasoning ability of VLMs. We evaluate models on two sets of tasks: abstract puzzle-style and natural-image reasoning tasks. We find that on abstract puzzles, performance remains near random with an average accuracy of around 28%, while natural tasks yield better but still weak results with 45% accuracy. We also find that tool-augmented reasoning demonstrates only modest improvements. To uncover the source of this weakness, we introduce diagnostic probes targeting perception and reasoning. Our analysis demonstrates that around 56% of failures arise from perception alone, 43% from both perception and reasoning, and only a mere 1% from reasoning alone. This motivates us to design fine-grained diagnostic probe questions targeting specific perception categories (e.g., shape, count, position, 3D/depth), revealing that certain categories cause more failures than others. Our benchmark and analysis establish that current VLMs, even with visual reasoning tools, remain unreliable abstract reasoners, mostly due to perception limitations, and offer a principled basis for improving visual reasoning in multimodal systems.
[43] Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting cs.CVPDF
Hao Feng, Wei Shi, Ke Zhang, Xiang Fei, Lei Liao
TL;DR: Dolphin-v2是一个两阶段的通用文档解析模型,旨在统一并提升文档解析能力。它通过联合执行文档类型分类(数字原生与拍摄文档)和布局分析,并采用混合解析策略:对拍摄文档进行整体页面级解析以处理几何畸变,对数字原生文档则基于检测到的布局锚点进行元素级并行解析,从而实现高效的内容提取。
Details
Motivation: 解决当前文档解析领域因众多专用模型导致的碎片化问题,以及现有两阶段方法依赖轴对齐边界框而无法有效处理扭曲或拍摄文档的局限性。
Result: 在DocPTBench、OmniDocBench和自建的RealDoc-160基准测试上进行了综合评估。结果显示,在具有挑战性的OmniDocBench上整体提升了14.78分,在拍摄文档上错误率降低了91%,同时通过并行处理保持了高效的推理速度。
Insight: 创新点包括:1) 通过整体页面级理解实现对拍摄文档的鲁棒解析;2) 更细粒度的元素检测(21个类别)及语义属性(如作者信息和文档元数据)提取;3) 现有系统通常缺乏的代码块识别与缩进保留功能。从客观角度看,其通过可扩展的锚点提示和混合解析策略,有效统一了数字原生与拍摄文档的处理流程,提升了系统的通用性和鲁棒性。
Abstract: Document parsing has garnered widespread attention as vision-language models (VLMs) advance OCR capabilities. However, the field remains fragmented across dozens of specialized models with varying strengths, forcing users to navigate complex model selection and limiting system scalability. Moreover, existing two-stage approaches depend on axis-aligned bounding boxes for layout detection, failing to handle distorted or photographed documents effectively. To this end, we present Dolphin-v2, a two-stage document image parsing model that substantially improves upon the original Dolphin. In the first stage, Dolphin-v2 jointly performs document type classification (digital-born versus photographed) alongside layout analysis. For digital-born documents, it conducts finer-grained element detection with reading order prediction. In the second stage, we employ a hybrid parsing strategy: photographed documents are parsed holistically as complete pages to handle geometric distortions, while digital-born documents undergo element-wise parallel parsing guided by the detected layout anchors, enabling efficient content extraction. Compared with the original Dolphin, Dolphin-v2 introduces several crucial enhancements: (1) robust parsing of photographed documents via holistic page-level understanding, (2) finer-grained element detection (21 categories) with semantic attribute extraction such as author information and document metadata, and (3) code block recognition with indentation preservation, which existing systems typically lack. Comprehensive evaluations are conducted on DocPTBench, OmniDocBench, and our self-constructed RealDoc-160 benchmark. The results demonstrate substantial improvements: +14.78 points overall on the challenging OmniDocBench and 91% error reduction on photographed documents, while maintaining efficient inference through parallel processing.
[44] TSBOW: Traffic Surveillance Benchmark for Occluded Vehicles Under Various Weather Conditions cs.CVPDF
Ngoc Doan-Minh Huynh, Duong Nguyen-Ngoc Tran, Long Hoang Pham, Tai Huu-Phuong Tran, Hyung-Joon Jeon
TL;DR: 该论文提出了TSBOW数据集,这是一个用于各种天气条件下被遮挡车辆检测的交通监控基准数据集。它包含超过32小时的真实交通视频、48,000个手动标注框和320万个半标注帧,覆盖八类交通参与者,旨在解决现有数据集在极端天气和遮挡场景下的不足。
Details
Motivation: 全球变暖加剧了极端天气事件,这降低了监控视频质量并增加了交通事故率。现有数据集通常只包含轻度雾、雨、雪,无法捕捉极端天气条件,因此需要一个新的综合数据集来提升在各种天气和遮挡情况下的车辆检测能力。
Result: 论文为TSBOW建立了目标检测基准,突出了遮挡和恶劣天气带来的挑战。该数据集作为推进智能交通系统的关键资源,其多样化的道路类型、尺度和视角为研究提供了基础。
Insight: 创新点在于创建了一个大规模、涵盖多种极端天气和复杂遮挡场景的交通监控数据集,填补了现有数据集的空白。从客观角度看,该数据集通过结合手动标注和半自动标注,以及包含从大型车辆到行人等多种交通参与者,为鲁棒的目标检测模型开发提供了宝贵的测试平台。
Abstract: Global warming has intensified the frequency and severity of extreme weather events, which degrade CCTV signal and video quality while disrupting traffic flow, thereby increasing traffic accident rates. Existing datasets, often limited to light haze, rain, and snow, fail to capture extreme weather conditions. To address this gap, this study introduces the Traffic Surveillance Benchmark for Occluded vehicles under various Weather conditions (TSBOW), a comprehensive dataset designed to enhance occluded vehicle detection across diverse annual weather scenarios. Comprising over 32 hours of real-world traffic data from densely populated urban areas, TSBOW includes more than 48,000 manually annotated and 3.2 million semi-labeled frames; bounding boxes spanning eight traffic participant classes from large vehicles to micromobility devices and pedestrians. We establish an object detection benchmark for TSBOW, highlighting challenges posed by occlusions and adverse weather. With its varied road types, scales, and viewpoints, TSBOW serves as a critical resource for advancing Intelligent Transportation Systems. Our findings underscore the potential of CCTV-based traffic monitoring, pave the way for new research and applications. The TSBOW dataset is publicly available at: https://github.com/SKKUAutoLab/TSBOW.
[45] Stable Velocity: A Variance Perspective on Flow Matching cs.CVPDF
Donglin Yang, Yongxing Zhang, Xin Yu, Liang Hou, Xin Tao
TL;DR: 本文提出了一种名为Stable Velocity的统一框架,旨在解决流匹配方法中因条件速度方差过高而导致的训练不稳定和收敛慢的问题。该框架包含用于训练的方差缩减目标StableVM和自适应辅助监督方法VA-REPA,以及用于推理的无需微调的加速采样方法StableVS。在多个大规模图像和视频生成模型上的实验表明,该方法能提升训练效率并在不损失质量的情况下实现2倍以上的采样加速。
Details
Motivation: 流匹配方法依赖于单样本条件速度,这会产生高方差的训练目标,导致优化不稳定和收敛缓慢。本文的动机是通过显式分析这种方差,识别出训练中的高方差和低方差区域,并基于此设计更稳定的训练和采样方法。
Result: 在ImageNet 256x256以及SD3.5、Flux、Qwen-Image、Wan2.2等大型预训练文本到图像和文本到视频模型上的大量实验表明,该方法能持续提升训练效率,并在低方差区域内实现超过2倍的采样加速,且不降低样本质量。
Insight: 核心创新点在于从方差角度对流匹配过程进行系统性分析,并据此提出了一个统一的方差感知框架。具体包括:1)提出了无偏的方差缩减训练目标StableVM;2)设计了在低方差区域自适应增强辅助监督的VA-REPA方法;3)利用低方差区域动力学可简化的特性,实现了无需微调的快速采样StableVS。这为提升扩散/流匹配模型的训练稳定性和采样效率提供了新的思路。
Abstract: While flow matching is elegant, its reliance on single-sample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is challenging, and 2) a low-variance regime near the data distribution, where conditional and marginal velocities nearly coincide. Leveraging this insight, we propose Stable Velocity, a unified framework that improves both training and sampling. For training, we introduce Stable Velocity Matching (StableVM), an unbiased variance-reduction objective, along with Variance-Aware Representation Alignment (VA-REPA), which adaptively strengthen auxiliary supervision in the low-variance regime. For inference, we show that dynamics in the low-variance regime admit closed-form simplifications, enabling Stable Velocity Sampling (StableVS), a finetuning-free acceleration. Extensive experiments on ImageNet $256\times256$ and large pretrained text-to-image and text-to-video models, including SD3.5, Flux, Qwen-Image, and Wan2.2, demonstrate consistent improvements in training efficiency and more than $2\times$ faster sampling within the low-variance regime without degrading sample quality. Our code is available at https://github.com/linYDTHU/StableVelocity.
[46] DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching cs.CV | cs.AIPDF
Chang Zou, Changlin Li, Yang Li, Patrol Li, Jianbing Wu
TL;DR: 本文提出了一种蒸馏兼容的可学习特征缓存机制(DisCa),用于加速视频扩散变换器。该方法采用轻量级可学习神经预测器替代传统无训练启发式方法,以更准确地捕捉高维特征演化过程,并结合受限均值流方法实现稳定无损的蒸馏,在保持生成质量的同时实现11.8倍的加速。
Details
Motivation: 现有视频生成扩散模型计算负担急剧增加,无训练特征缓存方法存在语义和细节丢失问题,而训练感知的步数蒸馏方法在视频生成中面临严重质量下降,且两者结合时因采样步数稀疏化导致质量损失更严重。
Result: 在保持生成质量的前提下,将视频扩散变换器的推理速度提升至11.8倍,并通过大量实验验证了方法的有效性。
Insight: 首次提出蒸馏兼容的可学习特征缓存机制,通过神经预测器建模特征演化;针对大规模视频模型高度压缩蒸馏的挑战,提出受限均值流方法实现稳定无损蒸馏,为视频扩散模型的高效推理提供了新思路。
Abstract: While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.
[47] Attention Retention for Continual Learning with Vision Transformers cs.CV | cs.AIPDF
Yue Lu, Xiangyu Zhou, Shizhou Zhang, Yinghui Xing, Guoqiang Liang
TL;DR: 本文提出了一种用于视觉Transformer持续学习的注意力保持框架,通过梯度掩码机制来约束注意力漂移,从而缓解灾难性遗忘问题。
Details
Motivation: 持续学习中的灾难性遗忘是一个关键挑战,本文发现视觉Transformer中的注意力漂移是导致遗忘的主要根源,即学习新任务后对先前视觉概念的注意力会发生显著偏移。
Result: 实验和可视化结果表明,该方法能有效缓解灾难性遗忘并保持视觉概念。它在多种持续学习场景中取得了最先进的性能,并展现出强大的泛化能力。
Insight: 创新点在于受人类视觉系统选择性注意力的启发,通过层级的rollout机制提取先前任务的注意力图,生成实例自适应的二进制掩码,并在学习新任务时应用这些掩码来清零与先前注意力区域相关的梯度,从而保护已学概念。该方法还通过按比例缩放参数更新来兼容现代优化器。
Abstract: Continual learning (CL) empowers AI systems to progressively acquire knowledge from non-stationary data streams. However, catastrophic forgetting remains a critical challenge. In this work, we identify attention drift in Vision Transformers as a primary source of catastrophic forgetting, where the attention to previously learned visual concepts shifts significantly after learning new tasks. Inspired by neuroscientific insights into the selective attention in the human visual system, we propose a novel attention-retaining framework to mitigate forgetting in CL. Our method constrains attention drift by explicitly modifying gradients during backpropagation through a two-step process: 1) extracting attention maps of the previous task using a layer-wise rollout mechanism and generating instance-adaptive binary masks, and 2) when learning a new task, applying these masks to zero out gradients associated with previous attention regions, thereby preventing disruption of learned visual concepts. For compatibility with modern optimizers, the gradient masking process is further enhanced by scaling parameter updates proportionally to maintain their relative magnitudes. Experiments and visualizations demonstrate the effectiveness of our method in mitigating catastrophic forgetting and preserving visual concepts. It achieves state-of-the-art performance and exhibits robust generalizability across diverse CL scenarios.
[48] SOMA-1M: A Large-Scale SAR-Optical Multi-resolution Alignment Dataset for Multi-Task Remote Sensing cs.CVPDF
Peihao Wu, Yongxiang Yao, Yi Wan, Wenfei Zhang, Ruipeng Zhao
TL;DR: 本文介绍了SOMA-1M数据集,这是一个大规模、像素级精确对齐的合成孔径雷达(SAR)与光学遥感图像配对数据集,包含超过130万对地理配准图像,空间分辨率覆盖0.5米至10米,涵盖12种典型土地覆盖类别。基于该数据集,作者为图像匹配、图像融合、SAR辅助去云和跨模态翻译四个层次的任务建立了全面的评估基准,并验证了在该数据集上进行监督训练能显著提升各项任务的性能。
Details
Motivation: 现有遥感基准数据集存在空间分辨率单一、数据规模不足、对齐精度低等局限性,难以支持多尺度基础模型的训练与泛化。为了解决这些问题,作者构建了SOMA-1M数据集。
Result: 实验结果表明,在SOMA-1M上进行监督训练能显著提升所有四项基准任务(图像匹配、图像融合、SAR辅助去云、跨模态翻译)的性能。其中,多模态遥感图像(MRSI)匹配性能达到了当前最先进(SOTA)水平。
Insight: 论文的主要创新点是构建了一个大规模、多分辨率、像素级精确对齐的SAR-光学配对数据集,并设计了严格的从粗到精的图像匹配框架来解决多模态投影形变和海量数据配准问题。这为开发鲁棒的多模态算法和遥感基础模型提供了关键的数据基础。
Abstract: Synthetic Aperture Radar (SAR) and optical imagery provide complementary strengths that constitute the critical foundation for transcending single-modality constraints and facilitating cross-modal collaborative processing and intelligent interpretation. However, existing benchmark datasets often suffer from limitations such as single spatial resolution, insufficient data scale, and low alignment accuracy, making them inadequate for supporting the training and generalization of multi-scale foundation models. To address these challenges, we introduce SOMA-1M (SAR-Optical Multi-resolution Alignment), a pixel-level precisely aligned dataset containing over 1.3 million pairs of georeferenced images with a specification of 512 x 512 pixels. This dataset integrates imagery from Sentinel-1, PIESAT-1, Capella Space, and Google Earth, achieving global multi-scale coverage from 0.5 m to 10 m. It encompasses 12 typical land cover categories, effectively ensuring scene diversity and complexity. To address multimodal projection deformation and massive data registration, we designed a rigorous coarse-to-fine image matching framework ensuring pixel-level alignment. Based on this dataset, we established comprehensive evaluation benchmarks for four hierarchical vision tasks, including image matching, image fusion, SAR-assisted cloud removal, and cross-modal translation, involving over 30 mainstream algorithms. Experimental results demonstrate that supervised training on SOMA-1M significantly enhances performance across all tasks. Notably, multimodal remote sensing image (MRSI) matching performance achieves current state-of-the-art (SOTA) levels. SOMA-1M serves as a foundational resource for robust multimodal algorithms and remote sensing foundation models. The dataset will be released publicly at: https://github.com/PeihaoWu/SOMA-1M.
[49] Feature points evaluation on omnidirectional vision with a photorealistic fisheye sequence – A report on experiments done in 2014 cs.CVPDF
Julien Moreau, S. Ambellouis, Yassine Ruichek
TL;DR: 该报告是一份2014年的科学报告,提供了详细的文献综述、一个名为PFSeq(Photorealistic Fisheye Sequence)的鱼眼图像数据集(可通过https://doi.org/10.57745/DYIVVU获取),以及全面的实验分析。报告旨在评估鱼眼图像中特征点检测和描述子的性能,以解决自标定中的“鸡与蛋”问题,即需要准确投影模型进行特征检测,同时需要良好特征进行相机标定。报告未提出新算法,也未与专为全向图像设计的算法进行比较,且未经同行评审。
Details
Motivation: 解决在车载鱼眼相机(朝向天顶)自标定场景中,如何选择最佳特征检测器和描述子的问题,以支持后续的鱼眼视觉里程计和立体视觉任务,同时应对特征检测与相机标定相互依赖的“鸡与蛋”难题。
Result: 报告未提及具体的定量结果或基准测试,但提供了基于PFSeq数据集的全面实验分析,旨在评估现有特征算法在鱼眼图像上的性能。由于是2014年的未发表报告,未与当时或后续的先进方法(SOTA)进行比较。
Insight: 报告贡献了一个公开的鱼眼图像数据集(PFSeq),可用于鱼眼视觉研究;强调了在鱼眼相机自标定中特征选择的重要性,并系统评估了传统特征算法在该场景下的适用性,为后续研究提供了实验基础和数据资源。
Abstract: What is this report: This is a scientific report, contributing with a detailed bibliography, a dataset which we will call now PFSeq for ‘’Photorealistic Fisheye Sequence’’ and make available at https://doi.org/10. 57745/DYIVVU, and comprehensive experiments. This work should be considered as a draft, and has been done during my PhD thesis ‘’Construction of 3D models from fisheye video data-Application to the localisation in urban area’’ in 2014 [Mor16]. These results have never been published. The aim was to find the best features detector and descriptor for fisheye images, in the context of selfcalibration, with cameras mounted on the top of a car and aiming at the zenith (to proceed then fisheye visual odometry and stereovision in urban scenes). We face a chicken and egg problem, because we can not take advantage of an accurate projection model for an optimal features detection and description, and we rightly need good features to perform the calibration (i.e. to compute the accurate projection model of the camera). What is not this report: It does not contribute with new features algorithm. It does not compare standard features algorithms to algorithms designed for omnidirectional images (unfortunately). It has not been peer-reviewed. Discussions have been translated and enhanced but the experiments have not been run again and the report has not been updated accordingly to the evolution of the state-of-the-art (read this as a 2014 report).
[50] VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency cs.CVPDF
Zhuang Xiong, Chen Zhang, Qingshan Xu, Wenbing Tao
TL;DR: 本文提出VGGT-Motion,一种无需标定的单目SLAM系统,旨在解决长序列中严重的尺度漂移问题。该系统通过运动感知的子图构建机制和锚点驱动的直接Sim(3)配准策略,实现了高效、鲁棒的全局一致性,适用于公里级轨迹。
Details
Motivation: 现有无需标定的单目SLAM方法在长序列上存在严重的尺度漂移,运动无关的分区会破坏上下文连贯性并导致零运动漂移,而传统的几何对齐计算成本高昂。
Result: 实验表明,VGGT-Motion显著提升了轨迹精度和效率,在零样本、长距离、无需标定的单目SLAM任务中达到了最先进的性能。
Insight: 创新点包括:1) 利用光流引导自适应分区的运动感知子图构建机制;2) 基于上下文平衡锚点的免搜索、像素级稠密对齐的直接Sim(3)配准策略;3) 具有线性复杂度的轻量子图级位姿图优化以实现可扩展的长距离操作。
Abstract: Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.
[51] Generalization of Self-Supervised Vision Transformers for Protein Localization Across Microscopy Domains cs.CVPDF
Ben Isselmann, Dilara Göksu, Andreas Weinmann
TL;DR: 本文研究了自监督学习(SSL)预训练的Vision Transformers在跨显微镜领域(不同染色方案和通道配置)的蛋白质定位任务中的泛化能力。通过在OpenCell数据集上评估三种基于DINO预训练模型(ImageNet-1k、Human Protein Atlas和OpenCell)的图像嵌入,发现所有预训练模型均能有效迁移,其中显微镜领域相关的HPA预训练模型性能最佳。
Details
Motivation: 解决特定任务显微镜数据集通常规模较小,难以训练出鲁棒特征表示的深度学习模型的问题,并探究自监督学习预训练表征在不同显微镜领域间的迁移能力。
Result: 在OpenCell数据集上,所有预训练模型均表现出良好的迁移性能,其中基于Human Protein Atlas(HPA)预训练的DINO模型取得了最佳平均宏观F1分数(0.8221 ± 0.0062),略优于直接在OpenCell上训练的DINO模型(0.8057 ± 0.0090)。
Insight: 论文的创新点在于系统评估了自监督学习预训练Vision Transformers在跨显微镜领域的泛化性,并证明领域相关的SSL表征(如HPA预训练)能有效迁移到相关但不同的数据集,从而在标注数据有限时仍能实现强大的下游性能。从客观角度看,这强调了大规模领域相关预训练对于生物医学图像分析的重要性。
Abstract: Task-specific microscopy datasets are often too small to train deep learning models that learn robust feature representations. Self-supervised learning (SSL) can mitigate this by pretraining on large unlabeled datasets, but it remains unclear how well such representations transfer across microscopy domains with different staining protocols and channel configurations. We investigate the cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on the OpenCell dataset. We generate image embeddings using three DINO backbones pretrained on ImageNet-1k, the Human Protein Atlas (HPA), and OpenCell, and evaluate them by training a supervised classification head on OpenCell labels. All pretrained models transfer well, with the microscopy-specific HPA-pretrained model achieving the best performance (mean macro $F_1$-score = 0.8221 \pm 0.0062), slightly outperforming a DINO model trained directly on OpenCell (0.8057 \pm 0.0090). These results highlight the value of large-scale pretraining and indicate that domain-relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even when task-specific labeled data are limited.
[52] FastVMT: Eliminating Redundancy in Video Motion Transfer cs.CVPDF
Yue Ma, Zhikai Wang, Tianhao Ren, Mingzhe Zheng, Hongyu Liu
TL;DR: FastVMT是一种用于视频运动传输任务的高效方法,通过消除扩散变换器(DiT)架构中的两种计算冗余(运动冗余和梯度冗余)来实现加速。该方法通过局部注意力掩码和梯度重用优化方案,在不损失生成视频视觉保真度和时间一致性的前提下,显著提升了推理速度。
Details
Motivation: 现有基于DiT的视频运动传输方法存在计算效率低下的问题,其通用架构未能利用视频帧间运动平滑且变化小的特性,同时在扩散过程中存在不必要的梯度计算冗余。
Result: 在视频运动传输任务上,FastVMT平均实现了3.43倍的加速,且未降低生成视频的视觉质量或时间一致性。
Insight: 创新点在于识别并针对性优化了DiT中的两种结构冗余:通过局部注意力机制处理运动冗余,以及通过梯度重用方案处理梯度冗余。这为基于扩散模型的高效视频生成提供了可借鉴的架构优化思路。
Abstract: Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
[53] PIRATR: Parametric Object Inference for Robotic Applications with Transformers in 3D Point Clouds cs.CV | cs.ROPDF
Michael Schwingshackl, Fabio F. Oberweger, Mario Niedermeyer, Huemer Johannes, Markus Murschitz
TL;DR: PIRATR是一个用于机器人应用的端到端3D物体检测框架,它直接从受遮挡的点云数据中联合估计多类别物体的6自由度位姿和类别特定的参数化属性,实现了对参数化物体的几何定位和任务相关属性(如夹具开口)的估计。该框架采用模块化的类别特定检测头,易于扩展到新物体类型,并在自动化叉车平台上验证了其有效性。
Details
Motivation: 解决机器人应用中,从受遮挡的点云数据中同时进行几何定位和估计参数化物体(如夹具、平台、托盘)的任务相关属性的问题,以弥合低级几何推理与可操作世界模型之间的差距。
Result: 在自动化叉车平台上的三个不同类别(起重机夹具、装载平台、托盘)上验证,完全在合成环境中训练,无需微调即可泛化到真实室外LiDAR扫描,检测mAP达到0.919。
Insight: 创新点在于将参数化物体检测与6自由度位姿估计结合,通过模块化设计实现可扩展性,为仿真训练的感知系统在动态机器人环境中的部署提供了新范式。
Abstract: We present PIRATR, an end-to-end 3D object detection framework for robotic use cases in point clouds. Extending PI3DETR, our method streamlines parametric 3D object detection by jointly estimating multi-class 6-DoF poses and class-specific parametric attributes directly from occlusion-affected point cloud data. This formulation enables not only geometric localization but also the estimation of task-relevant properties for parametric objects, such as a gripper’s opening, where the 3D model is adjusted according to simple, predefined rules. The architecture employs modular, class-specific heads, making it straightforward to extend to novel object types without re-designing the pipeline. We validate PIRATR on an automated forklift platform, focusing on three structurally and functionally diverse categories: crane grippers, loading platforms, and pallets. Trained entirely in a synthetic environment, PIRATR generalizes effectively to real outdoor LiDAR scans, achieving a detection mAP of 0.919 without additional fine-tuning. PIRATR establishes a new paradigm of pose-aware, parameterized perception. This bridges the gap between low-level geometric reasoning and actionable world models, paving the way for scalable, simulation-trained perception systems that can be deployed in dynamic robotic environments. Code available at https://github.com/swingaxe/piratr.
[54] ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision Priors cs.CVPDF
Zhenxiao Liang, Ning Zhang, Youbao Tang, Ruei-Sung Lin, Qixing Huang
TL;DR: ShapeGaussian是一种从单目视频进行高保真4D人体重建的无模板方法,它通过整合视觉先验来克服现有方法在捕捉高变形人体运动或依赖姿态估计模板时的缺陷。该方法采用两步流程:首先利用预训练模型学习粗略的可变形几何,然后通过神经变形模型细化以捕捉动态细节。
Details
Motivation: 解决现有4D人体重建方法在单目视频中的局限性:无模板方法(如4DGS)缺乏鲁棒的视觉先验,难以捕捉高变形运动;而基于模板的方法(如HUGS)严重依赖SMPL模型,容易因姿态估计错误产生伪影。
Result: 大量实验表明,ShapeGaussian在重建精度上超越了基于模板的方法,在多样化的单目视频人体运动场景中实现了更优的视觉质量和鲁棒性。
Insight: 创新点在于有效整合无模板的视觉先验(如2D关键点),通过两步流程(粗几何学习与神经变形细化)和多重参考帧策略,既避免了模板方法对姿态估计的依赖,又解决了2D关键点不可见性问题,实现了高保真且鲁棒的重建。
Abstract: We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.
[55] A Hybrid CNN and ML Framework for Multi-modal Classification of Movement Disorders Using MRI and Brain Structural Features cs.CVPDF
Mengyu Li, Ingibjörg Kristjánsdóttir, Thilo van Eimeren, Kathrin Giehl, Lotta M. Ellingsen
TL;DR: 本研究提出了一种结合卷积神经网络(CNN)和机器学习(ML)的混合框架,用于对非典型帕金森病(APD)亚型(如进行性核上性麻痹PSP和多系统萎缩MSA)与帕金森病(PD)进行多模态分类。该模型利用T1加权磁共振成像(MRI)、12个与APD相关的深部脑结构分割掩模及其对应的体积测量值作为输入,通过融合图像、结构和定量特征,在APD亚型鉴别诊断中取得了良好的性能。
Details
Motivation: 非典型帕金森病(APD)在早期阶段与帕金森病(PD)的临床特征重叠,常导致误诊。因此,寻找可靠的影像学生物标志物以实现早期鉴别诊断是一个关键挑战。
Result: 该混合方法在分类任务中取得了有前景的结果:PSP vs. PD的曲线下面积(AUC)为0.95,MSA vs. PD的AUC为0.86,PSP vs. MSA的AUC为0.92。这些结果表明该方法在APD亚型鉴别方面具有潜力。
Insight: 论文的创新点在于将基于CNN的图像特征与基于体积的ML输入(即结构分割掩模和定量体积特征)进行融合,利用了空间信息和结构信息的互补性,从而提高了APD亚型分类的准确性。这种多模态融合策略对于医学影像分析具有借鉴意义。
Abstract: Atypical Parkinsonian Disorders (APD), also known as Parkinson-plus syndrome, are a group of neurodegenerative diseases that include progressive supranuclear palsy (PSP) and multiple system atrophy (MSA). In the early stages, overlapping clinical features often lead to misdiagnosis as Parkinson’s disease (PD). Identifying reliable imaging biomarkers for early differential diagnosis remains a critical challenge. In this study, we propose a hybrid framework combining convolutional neural networks (CNNs) with machine learning (ML) techniques to classify APD subtypes versus PD and distinguish between the subtypes themselves: PSP vs. PD, MSA vs. PD, and PSP vs. MSA. The model leverages multi-modal input data, including T1-weighted magnetic resonance imaging (MRI), segmentation masks of 12 deep brain structures associated with APD, and their corresponding volumetric measurements. By integrating these complementary modalities, including image data, structural segmentation masks, and quantitative volume features, the hybrid approach achieved promising classification performance with area under the curve (AUC) scores of 0.95 for PSP vs. PD, 0.86 for MSA vs. PD, and 0.92 for PSP vs. MSA. These results highlight the potential of combining spatial and structural information for robust subtype differentiation. In conclusion, this study demonstrates that fusing CNN-based image features with volume-based ML inputs improves classification accuracy for APD subtypes. The proposed approach may contribute to more reliable early-stage diagnosis, facilitating timely and targeted interventions in clinical practice.
[56] LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation cs.CVPDF
Junyang Chen, Xiangbo Lv, Zhiqiang Kou, Xingdong Sheng, Ning Xu
TL;DR: LoGoSeg是一个用于开放词汇语义分割的高效单阶段框架,通过整合局部和全局特征来解决现有方法在空间对齐和物体先验方面的不足。它引入了物体存在先验、区域感知对齐模块和双流融合机制,无需外部掩码建议、额外主干网络或数据集,在多个基准测试中展现出竞争性性能和强泛化能力。
Details
Motivation: 现有基于CLIP等视觉语言模型的开放词汇分割方法依赖图像级预训练,导致空间对齐不精确,在模糊或杂乱场景中产生错误分割,且缺乏物体先验和区域级约束,易引发物体幻觉或漏检。
Result: 在A-847、PC-459、A-150、PC-59、PAS-20和PAS-20b六个基准测试上的广泛实验表明,LoGoSeg在开放词汇设置中具有竞争性性能和强泛化能力。
Insight: 创新点包括:动态加权相关类别的物体存在先验以减少幻觉;建立精确区域级视觉-文本对应的区域感知对齐模块;以及优化结合局部结构信息与全局语义上下文的双流融合机制。这些设计无需外部组件,提升了效率和精度。
Abstract: Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.
[57] A Mixed Reality System for Robust Manikin Localization in Childbirth Training cs.CV | cs.ET | cs.GRPDF
Haojie Cheng, Chang Liu, Abhiram Kanneganti, Mahesh Arjandas Choolani, Arundhati Tushar Gosavi
TL;DR: 本文提出了一种用于分娩训练的混合现实系统,通过结合虚拟指导和触觉人体模型交互,使医学生能够在没有专家持续现场监督的情况下进行独立练习。该系统利用外部RGB-D摄像头扩展了商用头戴式显示器的透视能力,实现了物理训练对象的实时视觉集成,并通过从粗到精的定位流程,在人体模型上准确叠加虚拟引导手,以提供专家轨迹指导。
Details
Motivation: 动机是解决医学生在阴道分娩实践中机会受限的问题,这些限制源于临床轮转时间缩短、患者不情愿以及分娩过程的不确定性。为了减轻临床医生的教学负担并提高学员的学习效率,开发一个结合虚拟指导和触觉反馈的混合现实训练系统,使学员能够独立练习。
Result: 实验评估表明,该系统在独立头戴设备上实现了准确稳定的人体模型定位,无需外部计算资源即可实际部署。一项涉及83名四年级医学生的大规模用户研究比较了基于MR和基于VR的分娩训练,由四位资深产科医生使用标准化标准独立评估性能,结果显示MR训练在分娩、产后和整体任务表现上得分显著更高,且学员一致偏好MR训练。
Insight: 创新点包括通过外部RGB-D摄像头空间校准扩展商用HMD的透视能力,实现物理对象的实时视觉集成,以及采用从粗到精的定位流程(先使用基准标记对齐母体模型定义分娩区域,再在该区域内注册预扫描的新生儿头部),从而在人体模型上准确叠加虚拟引导手,结合触觉交互增强学习效果。从客观角度看,该系统将混合现实与触觉反馈结合,提供了更真实、高效且可独立操作的训练方案,有望在医疗教育中推广。
Abstract: Opportunities for medical students to gain practical experience in vaginal births are increasingly constrained by shortened clinical rotations, patient reluctance, and the unpredictable nature of labour. To alleviate clinicians’ instructional burden and enhance trainees’ learning efficiency, we introduce a mixed reality (MR) system for childbirth training that combines virtual guidance with tactile manikin interaction, thereby preserving authentic haptic feedback while enabling independent practice without continuous on-site expert supervision. The system extends the passthrough capability of commercial head-mounted displays (HMDs) by spatially calibrating an external RGB-D camera, allowing real-time visual integration of physical training objects. Building on this capability, we implement a coarse-to-fine localization pipeline that first aligns the maternal manikin with fiducial markers to define a delivery region and then registers the pre-scanned neonatal head within this area. This process enables spatially accurate overlay of virtual guiding hands near the manikin, allowing trainees to follow expert trajectories reinforced by haptic interaction. Experimental evaluations demonstrate that the system achieves accurate and stable manikin localization on a standalone headset, ensuring practical deployment without external computing resources. A large-scale user study involving 83 fourth-year medical students was subsequently conducted to compare MR-based and virtual reality (VR)-based childbirth training. Four senior obstetricians independently assessed performance using standardized criteria. Results showed that MR training achieved significantly higher scores in delivery, post-delivery, and overall task performance, and was consistently preferred by trainees over VR training.
[58] EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality cs.CV | cs.ET | cs.GRPDF
Haojie Cheng, Shaun Jing Heng Ong, Shaoyu Cai, Aiden Tat Yang Koh, Fuxi Ouyang
TL;DR: 本文提出了EgoPoseVR,一个用于虚拟现实(VR)中第一人称全身姿态估计的端到端框架。该框架通过一个双模态融合管道,将头戴式显示器(HMD)的运动线索与第一人称RGB-D观测数据相结合,以解决现有方法在VR应用中的时间不稳定性、下半身估计不准确和缺乏实时性能等问题。
Details
Motivation: 解决当前基于头戴摄像机的第一人称姿态估计方法在应用于VR头戴显示器时面临的挑战,包括时间不稳定性、下半身估计不准确以及缺乏实时性能,从而为沉浸式VR应用提供准确、时间连贯的全身姿态跟踪。
Result: 实验结果表明,EgoPoseVR在性能上超越了最先进的第一人称姿态估计模型。在真实场景的用户研究中,EgoPoseVR在准确性、稳定性、具身感和未来使用意愿方面的主观评分显著高于基线方法。
Insight: 创新点在于提出了一个整合HMD运动线索与RGB-D观测的双模态融合管道,以及一个利用HMD信号施加约束的运动学优化模块。同时,引入了一个大规模合成数据集用于训练和评估。从客观角度看,其核心创新在于通过跨模态的时空融合,充分利用了互补的运动线索,从而在无需额外身体传感器或房间尺度跟踪系统的情况下,实现了鲁棒的VR全身姿态跟踪。
Abstract: Immersive virtual reality (VR) applications demand accurate, temporally coherent full-body pose tracking. Recent head-mounted camera-based approaches show promise in egocentric pose estimation, but encounter challenges when applied to VR head-mounted displays (HMDs), including temporal instability, inaccurate lower-body estimation, and the lack of real-time performance. To address these limitations, we present EgoPoseVR, an end-to-end framework for accurate egocentric full-body pose estimation in VR that integrates headset motion cues with egocentric RGB-D observations through a dual-modality fusion pipeline. A spatiotemporal encoder extracts frame- and joint-level representations, which are fused via cross-attention to fully exploit complementary motion cues across modalities. A kinematic optimization module then imposes constraints from HMD signals, enhancing the accuracy and stability of pose estimation. To facilitate training and evaluation, we introduce a large-scale synthetic dataset of over 1.8 million temporally aligned HMD and RGB-D frames across diverse VR scenarios. Experimental results show that EgoPoseVR outperforms state-of-the-art egocentric pose estimation models. A user study in real-world scenes further shows that EgoPoseVR achieved significantly higher subjective ratings in accuracy, stability, embodiment, and intention for future use compared to baseline methods. These results show that EgoPoseVR enables robust full-body pose tracking, offering a practical solution for accurate VR embodiment without requiring additional body-worn sensors or room-scale tracking systems.
[59] CAViT – Channel-Aware Vision Transformer for Dynamic Feature Fusion cs.CV | cs.AIPDF
Aon Safdar, Mohamed Saadeldin
TL;DR: CAViT是一种双注意力架构的视觉Transformer,它用动态的、基于注意力的机制取代了标准ViT中静态的多层感知机,以进行特征交互。每个CAViT块依次执行空间自注意力和通道自注意力,使模型能够根据全局图像上下文动态地重新校准特征表示。该模型在多个基准数据集上超越了标准ViT基线,同时减少了参数量和计算量。
Details
Motivation: 标准视觉Transformer中的通道混合是静态的,依赖于固定的多层感知机,缺乏对输入内容的适应性。本文旨在解决这个问题,引入动态的、内容感知的通道特征交互机制。
Result: 在涵盖自然和医学领域的五个基准数据集上,CAViT在准确率上比标准ViT基线最高提升了+3.6%,同时参数量和FLOPs减少了超过30%。定性注意力图显示出更清晰、语义更有意义的激活模式。
Insight: 主要创新点是提出了一个统一的双注意力(空间+通道)Transformer块,实现了内容感知的动态通道特征融合。从客观角度看,将通道交互从静态MLP提升为动态注意力机制,是一种增强模型表达能力的有效方法,且未增加深度或复杂度,在提升性能的同时降低了计算成本。
Abstract: Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ViTs remains static, relying on fixed multilayer perceptrons (MLPs) that lack adaptability to input content. We introduce ‘CAViT’, a dual-attention architecture that replaces the static MLP with a dynamic, attention-based mechanism for feature interaction. Each Transformer block in CAViT performs spatial self-attention followed by channel-wise self-attention, allowing the model to dynamically recalibrate feature representations based on global image context. This unified and content-aware token mixing strategy enhances representational expressiveness without increasing depth or complexity. We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing parameter count and FLOPs by over 30%. Qualitative attention maps reveal sharper and semantically meaningful activation patterns, validating the effectiveness of our attention-driven token mixing.
[60] UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos cs.CVPDF
Jinlin Wu, Felix Holm, Chuxi Chen, An Wang, Yaxin Hu
TL;DR: 本文提出了UniSurg,一个专为手术视频设计的视频原生基础模型。它将学习范式从像素级重建转变为潜在运动预测,并引入了运动引导潜在预测、时空亲和性自蒸馏和特征多样性正则化三项关键技术。模型在包含3658小时视频的UniSurg-15M数据集上进行预训练,并在17个基准测试中显著优于现有方法,成为通用、面向运动的手术视频理解新标准。
Details
Motivation: 当前手术视频分析的基础模型主要依赖像素级重建目标,这浪费了模型能力在烟雾、镜面反射等低层次视觉细节上,而非对手术理解至关重要的语义结构。本文旨在解决这一问题,将学习重点转向更高层次的语义运动理解。
Result: 在17个基准测试上的广泛实验表明,UniSurg显著优于最先进方法:在手术工作流识别上(EgoSurgery +14.6% F1, PitVis +10.3%),动作三元组识别上(CholecT50 39.54% mAP-IVT),以及技能评估、息肉分割和深度估计等任务上均表现出色。
Insight: 核心创新在于将学习范式从像素重建转变为潜在运动预测,这更符合手术视频理解的高层语义需求。三项技术贡献——运动引导预测、时空亲和性自蒸馏和特征多样性正则化——专门针对手术视频纹理稀疏、语义区域重要的特点设计,有效防止了表征崩溃并提升了模型对关键区域的关注。大规模数据集UniSurg-15M的构建也为领域发展提供了重要资源。
Abstract: While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details - such as smoke, specular reflections, and fluid motion - rather than semantic structures essential for surgical understanding. We present UniSurg, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), UniSurg introduces three key technical innovations tailored to surgical videos: 1) motion-guided latent prediction to prioritize semantically meaningful regions, 2) spatiotemporal affinity self-distillation to enforce relational consistency, and 3) feature diversity regularization to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate UniSurg-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that UniSurg significantly outperforms state-of-the-art methods on surgical workflow recognition (+14.6% F1 on EgoSurgery, +10.3% on PitVis), action triplet recognition (39.54% mAP-IVT on CholecT50), skill assessment, polyp segmentation, and depth estimation. These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.
[61] Poster: Camera Tampering Detection for Outdoor IoT Systems cs.CV | cs.AIPDF
Shadi Attarha, Kanaga Shanmugi, Anna Förster
TL;DR: 本文针对户外物联网系统中的摄像头篡改检测问题,提出了基于规则和基于深度学习的两种方法,旨在评估它们在真实场景下的准确性、计算需求和训练数据要求。研究结果表明,深度学习模型具有更高的准确性,而基于规则的方法更适合资源有限且无法进行长时间校准的场景。此外,作者还公开了包含正常、模糊和旋转图像的数据集,以支持摄像头篡改检测方法的开发和评估。
Details
Motivation: 随着智能摄像头在户外监控和安全领域的广泛应用,这些系统容易受到故意破坏或恶劣环境条件的影响,导致监控效果下降。特别是在摄像头仅捕获静态图像而非视频时,由于缺乏连续帧序列,篡改检测更具挑战性。
Result: 实验结果表明,深度学习模型在准确性方面表现更优,而基于规则的方法在资源受限且无法进行长时间校准的场景中更为适用。
Insight: 论文的创新点在于针对静态图像(而非视频)的摄像头篡改检测,提出了两种互补的方法,并公开了相关数据集,填补了该领域资源不足的空白。从客观角度看,这种结合传统规则与深度学习的方法,为不同资源约束下的实际部署提供了灵活选择,具有实用价值。
Abstract: Recently, the use of smart cameras in outdoor settings has grown to improve surveillance and security. Nonetheless, these systems are susceptible to tampering, whether from deliberate vandalism or harsh environmental conditions, which can undermine their monitoring effectiveness. In this context, detecting camera tampering is more challenging when a camera is capturing still images rather than video as there is no sequence of continuous frames over time. In this study, we propose two approaches for detecting tampered images: a rule-based method and a deep-learning-based method. The aim is to evaluate how each method performs in terms of accuracy, computational demands, and the data required for training when applied to real-world scenarios. Our results show that the deep-learning model provides higher accuracy, while the rule-based method is more appropriate for scenarios where resources are limited and a prolonged calibration phase is impractical. We also offer publicly available datasets with normal, blurred, and rotated images to support the development and evaluation of camera tampering detection methods, addressing the need for such resources.
[62] Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization cs.CVPDF
Yunchuan Ma, Laiyun Qing, Guorong Li, Yuqing Liu, Yuankai Qi
TL;DR: 本文提出了一种用于点监督时序动作定位(PTAL)的多任务学习框架,通过设计三个自监督时序理解任务(动作完成、动作顺序理解和动作规律性理解)来增强模型对动作时序一致性的理解能力,从而提升在未修剪视频中定位动作实例的性能。
Details
Motivation: 现有PTAL方法通常仅使用点监督的片段级分类任务头,缺乏对动作帧间时序关系的显式建模,而理解时序关系对于准确定位动作的完整帧至关重要。
Result: 在四个基准数据集上的大量实验结果表明,该方法相比多个最先进(SOTA)方法具有有效性。
Insight: 创新点在于首次显式探索时序一致性以增强点监督动作定位,通过自监督多任务学习(动作完成、顺序理解和规律性理解)来建模动作的时序结构,这是一种利用弱监督信号提升模型时序理解能力的新颖思路。
Abstract: Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (\textit{i.e.}, labeling only a single frame per action instance) to train a model to effectively locate action instances within untrimmed videos. Most existing approaches design the task head of models with only a point-supervised snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action. However, understanding the temporal relationships of frames is crucial because it can help a model understand how an action is defined and therefore benefits localizing the full frames of an action. To this end, in this paper, we design a multi-task learning framework that fully utilizes point supervision to boost the model’s temporal understanding capability for action localization. Specifically, we design three self-supervised temporal understanding tasks: (i) Action Completion, (ii) Action Order Understanding, and (iii) Action Regularity Understanding. These tasks help a model understand the temporal consistency of actions across videos. To the best of our knowledge, this is the first attempt to explicitly explore temporal consistency for point supervision action localization. Extensive experimental results on four benchmark datasets demonstrate the effectiveness of the proposed method compared to several state-of-the-art approaches.
[63] Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification cs.CV | cs.LGPDF
Lexiang Hu, Youze Xue, Dian Li, Gang Liu, Zhouchen Lin
TL;DR: 本文提出了一种名为AGFF-Embed的方法,用于增强多模态大语言模型(MLLM)生成的嵌入表示。该方法通过提示MLLM生成关注不同语义维度的多个嵌入,并对其进行自适应平滑聚合,以融合全局和细粒度感知信息。此外,结合显式梯度放大(EGA)技术,该方法能在不编辑数据集的情况下增强批次内困难负样本,从而提升模型性能。在MMEB和MMVP-VLM基准测试中,AGFF-Embed在通用和细粒度理解任务上均达到了最先进的性能水平。
Details
Motivation: 现有的多模态嵌入模型(如基于CLIP和MLLM的模型)主要捕获全局语义信息,而复杂场景往往需要同时理解全局和细粒度元素。因此,需要一种兼容的融合机制来结合这两种感知模式。
Result: 在MMEB和MMVP-VLM基准测试上,AGFF-Embed相比其他多模态嵌入模型,在通用和细粒度理解方面均实现了全面的最先进(SOTA)性能。
Insight: 创新点在于提出了一种自适应融合全局与细粒度感知的MLLM嵌入生成与聚合方法(AGFF-Embed),并结合了显式梯度放大技术来增强困难负样本学习,无需对数据集进行细粒度标注或编辑,从而高效提升了多模态表示的判别能力。
Abstract: Multimodal embeddings serve as a bridge for aligning vision and language, with the two primary implementations – CLIP-based and MLLM-based embedding models – both limited to capturing only global semantic information. Although numerous studies have focused on fine-grained understanding, we observe that complex scenarios currently targeted by MLLM embeddings often involve a hybrid perceptual pattern of both global and fine-grained elements, thus necessitating a compatible fusion mechanism. In this paper, we propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings (AGFF-Embed), a method that prompts the MLLM to generate multiple embeddings focusing on different dimensions of semantic information, which are then adaptively and smoothly aggregated. Furthermore, we adapt AGFF-Embed with the Explicit Gradient Amplification (EGA) technique to achieve in-batch hard negatives enhancement without requiring fine-grained editing of the dataset. Evaluation on the MMEB and MMVP-VLM benchmarks shows that AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding compared to other multimodal embedding models.
[64] Depth as Prior Knowledge for Object Detection cs.CVPDF
Moussa Kassem Sbeyti, Nadja Klein
TL;DR: 本文提出DepthPrior框架,利用深度信息作为先验知识而非融合特征来提升小目标和远距离目标的检测性能。该框架包含训练阶段的深度损失加权(DLW)和深度损失分层(DLS),以及推理阶段的深度感知置信度阈值(DCT),无需修改检测器架构,仅需初始深度估计开销。
Details
Motivation: 解决目标检测中因尺度变化、低分辨率和背景杂乱导致的小目标和远距离目标检测难题,现有方法通常需要复杂且模型特定的架构修改,本文旨在探索一种更通用的深度信息利用方式。
Result: 在KITTI、MS COCO、VisDrone和SUN RGB-D四个基准数据集上,使用YOLOv11和EfficientDet两种检测器进行实验,DepthPrior将小目标的mAP_S提升高达9%,mAR_S提升高达7%,推理恢复率(真阳性 vs. 假阳性)高达95:1,达到SOTA水平。
Insight: 创新点在于将深度信息作为先验知识而非融合特征,通过理论分析和实证研究阐明了深度导致性能下降的系统性原因及深度监督的缓解机制;DepthPrior框架实现了无需额外传感器、架构修改或性能损失的通用性能提升,具有很好的可迁移性。
Abstract: Detecting small and distant objects remains challenging for object detectors due to scale variation, low resolution, and background clutter. Safety-critical applications require reliable detection of these objects for safe planning. Depth information can improve detection, but existing approaches require complex, model-specific architectural modifications. We provide a theoretical analysis followed by an empirical investigation of the depth-detection relationship. Together, they explain how depth causes systematic performance degradation and why depth-informed supervision mitigates it. We introduce DepthPrior, a framework that uses depth as prior knowledge rather than as a fused feature, providing comparable benefits without modifying detector architectures. DepthPrior consists of Depth-Based Loss Weighting (DLW) and Depth-Based Loss Stratification (DLS) during training, and Depth-Aware Confidence Thresholding (DCT) during inference. The only overhead is the initial cost of depth estimation. Experiments across four benchmarks (KITTI, MS COCO, VisDrone, SUN RGB-D) and two detectors (YOLOv11, EfficientDet) demonstrate the effectiveness of DepthPrior, achieving up to +9% mAP$_S$ and +7% mAR$_S$ for small objects, with inference recovery rates as high as 95:1 (true vs. false detections). DepthPrior offers these benefits without additional sensors, architectural changes, or performance costs. Code is available at https://github.com/mos-ks/DepthPrior.
[65] Neuro-Inspired Visual Pattern Recognition via Biological Reservoir Computing cs.CV | cs.NEPDF
Luca Ciampi, Ludovico Iannello, Fabrizio Tonelli, Gabriele Lagani, Angelo Di Garbo
TL;DR: 本文提出了一种神经启发的储层计算方法,利用体外培养的皮层神经元网络作为物理储层,通过高密度多电极阵列进行刺激和读取,将生物神经活动作为计算基底,并训练线性读出层对储层状态进行分类,实现了在计算机视觉框架下进行静态视觉模式识别任务。
Details
Motivation: 解决传统储层计算依赖人工循环模型近似神经动力学的问题,探索利用活体神经回路的自发和刺激诱发活动作为计算基底,以整合生物神经基质到神经形态计算框架中。
Result: 在从点状刺激到定向条、时钟数字形状及MNIST手写数字等一系列难度递增的任务上,系统生成的高维表示支持准确分类,尽管存在生物神经响应的固有变异性,但证明了体外皮层网络可作为有效的储层进行静态视觉模式识别。
Insight: 创新点在于直接使用活体神经元网络作为物理储层,而非模拟模型,通过生物启发的特征表示,为将生物原理融入机器学习提供了新途径,并展示了活体神经系统如何指导高效且基于生物的计算模型设计。
Abstract: In this paper, we present a neuro-inspired approach to reservoir computing (RC) in which a network of in vitro cultured cortical neurons serves as the physical reservoir. Rather than relying on artificial recurrent models to approximate neural dynamics, our biological reservoir computing (BRC) system leverages the spontaneous and stimulus-evoked activity of living neural circuits as its computational substrate. A high-density multi-electrode array (HD-MEA) provides simultaneous stimulation and readout across hundreds of channels: input patterns are delivered through selected electrodes, while the remaining ones capture the resulting high-dimensional neural responses, yielding a biologically grounded feature representation. A linear readout layer (single-layer perceptron) is then trained to classify these reservoir states, enabling the living neural network to perform static visual pattern-recognition tasks within a computer-vision framework. We evaluate the system across a sequence of tasks of increasing difficulty, ranging from pointwise stimuli to oriented bars, clock-digit-like shapes, and handwritten digits from the MNIST dataset. Despite the inherent variability of biological neural responses-arising from noise, spontaneous activity, and inter-session differences-the system consistently generates high-dimensional representations that support accurate classification. These results demonstrate that in vitro cortical networks can function as effective reservoirs for static visual pattern recognition, opening new avenues for integrating living neural substrates into neuromorphic computing frameworks. More broadly, this work contributes to the effort to incorporate biological principles into machine learning and supports the goals of neuro-inspired vision by illustrating how living neural systems can inform the design of efficient and biologically grounded computational models.
[66] ReText: Text Boosts Generalization in Image-Based Person Re-identification cs.CV | cs.AI | cs.LGPDF
Timur Mamedov, Karina Kvanchiani, Anton Konushin, Vadim Konushin
TL;DR: 本文提出了一种名为ReText的新方法,用于提升基于图像的行人重识别(Re-ID)的泛化能力。该方法通过混合多摄像头数据和单摄像头数据进行训练,并利用文本描述来增强单摄像头数据的语义信息。在训练过程中,ReText联合优化三个任务:多摄像头数据的Re-ID、图像-文本匹配以及文本引导的单摄像头数据图像重建。实验表明,ReText在跨域Re-ID基准测试中实现了强大的泛化性能,并显著优于现有最先进方法。
Details
Motivation: 解决基于图像的行人重识别在未见域上的泛化问题,现有方法通常依赖复杂架构处理域差距,而近期研究发现风格多样的单摄像头数据能提升泛化,但这类数据因缺乏跨视角变化而复杂性不足。
Result: 在跨域行人重识别基准测试中,ReText显著优于现有最先进方法(SOTA),表现出强大的泛化能力。
Insight: 创新点在于首次在基于图像的行人重识别中探索多模态联合学习,混合使用多摄像头和单摄像头数据,并通过文本描述增强单摄像头数据的语义,联合优化Re-ID、图像-文本匹配和文本引导的图像重建任务,以提升泛化。
Abstract: Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks. To the best of our knowledge, this is the first work to explore multimodal joint learning on a mixture of multi-camera and single-camera data in image-based person Re-ID.
[67] Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation cs.CV | cs.AIPDF
Hengyi Wang, Ruiqiang Zhang, Chang Liu, Guanjie Wang, Zehua Ma
TL;DR: 本文提出Allocentric Perceiver,一种无需训练的策略,通过利用现成的几何专家从单张或多张图像中恢复度量3D状态,并实例化一个与指令语义意图对齐的查询条件化他心参考系。该方法将重建的几何确定性地转换到目标坐标系,并以结构化的、基于几何的表征提示主干视觉语言模型,从而将心理旋转从隐式推理卸载为显式计算。
Details
Motivation: 随着对空间基础任务(如视觉语言导航/动作)需求的增长,视觉语言模型的他心感知能力日益受到关注。然而,在需要显式视角转换的他心空间查询上,VLMs仍然脆弱,这些查询的答案依赖于在目标中心坐标系而非观察相机视图中的推理。
Result: 在多个主干模型家族和空间推理基准测试上评估Allocentric Perceiver,观察到在他心任务上取得一致且显著的提升(约10%),同时保持强大的自我中心性能,超越了经过空间感知微调的模型以及最先进的开源和专有模型。
Insight: 核心创新在于通过显式几何重建与坐标系实例化,将复杂的他心空间推理(如心理旋转)从模型的隐式学习能力中解耦出来,转化为可计算的几何变换过程。这提供了一种无需额外训练、模块化地增强VLMs空间推理能力的新范式。
Abstract: With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction’s semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.
[68] Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning cs.CVPDF
Enwei Tong, Yuanchao Bai, Yao Zhu, Junjun Jiang, Xianming Liu
TL;DR: 本文提出Focus-Scan-Refine(FSR),一种受人类视觉感知启发的即插即用视觉令牌剪枝框架,用于提升视觉语言模型的推理效率。它模拟人类回答视觉问题的过程:首先聚焦于关键证据,然后根据需要全局扫描补充上下文,最后通过聚合相关细节来精炼扫描到的信息,从而在激进压缩下更好地平衡局部证据与全局上下文。
Details
Motivation: 视觉语言模型(VLMs)生成的大量视觉令牌显著增加了推理延迟和内存占用,而现有的免训练令牌剪枝方法在激进压缩下难以平衡局部证据和全局上下文。
Result: 在多个VLM主干网络和视觉语言基准测试上的广泛实验表明,FSR在准确性与效率的权衡上持续优于现有的最先进(SOTA)剪枝方法。
Insight: 创新点在于受人类视觉感知启发的三步剪枝范式(聚焦-扫描-精炼),通过结合视觉重要性与指令相关性来聚焦关键证据,基于聚焦集条件化扫描补充上下文,并通过基于相似性的分配和分数加权合并来精炼扫描上下文,而不增加令牌预算。
Abstract: Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods. The source codes can be found at https://github.com/ILOT-code/FSR
[69] Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation cs.CV | cs.ROPDF
Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu
TL;DR: 本文提出了一种名为SparseVideoNav的新方法,首次将视频生成模型引入超越视野导航任务,通过生成稀疏的未来视频帧来指导智能体在未知环境中进行长时程导航,实现了比现有LLM基线高2.5倍的成功率,并在推理速度上获得了27倍的提升。
Details
Motivation: 解决现实世界中超越视野导航的挑战,即智能体仅凭简单高层意图(而非详细、逐步的语言指令)在未知环境中导航至远处、不可见的目标,现有基于LLM的方法因依赖短时程监督而存在短视行为。
Result: 在超越视野导航任务的真实世界零样本实验中,SparseVideoNav的成功率达到了最先进LLM基线的2.5倍,并首次在具有挑战性的夜间场景中实现了该能力;同时,其优化的稀疏视频生成方法将轨迹推理速度提升了27倍。
Insight: 创新性地发现并利用了视频生成模型固有的长时程监督优势来解决BVN任务;提出了SparseVideoNav框架,通过生成跨越20秒视野的稀疏未来视频来实现高效、快速的子秒级轨迹推理,平衡了性能与部署实用性。
Abstract: Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.
[70] Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning cs.CVPDF
Yudi Shi, Shangzhe Di, Qirui Chen, Qinian Wang, Jiayin Cai
TL;DR: 本文提出Weaver,一种端到端可训练的多模态推理代理系统,旨在解决视频推理任务中感知能力受限和表示不匹配的问题。该系统通过动态调用多样化工具逐步获取视觉线索并构建真实的多模态推理轨迹,同时结合强化学习算法探索工具使用策略。
Details
Motivation: 现有基于文本链式思维(Chain-of-Thought)的推理方法在视频理解中存在表示不匹配和感知能力有限的问题,因此需要开发能够动态整合多模态信息的端到端代理系统。
Result: 在多个复杂视频推理基准测试中,特别是在长视频任务上,Weaver系统显著提升了性能。
Insight: 创新点在于将动态工具调用与强化学习结合,实现端到端的多模态推理轨迹构建,突破了传统文本中心化方法的感知限制。
Abstract: Video reasoning constitutes a comprehensive assessment of a model’s capabilities, as it demands robust perceptual and interpretive skills, thereby serving as a means to explore the boundaries of model performance. While recent research has leveraged text-centric Chain-of-Thought reasoning to augment these capabilities, such approaches frequently suffer from representational mismatch and restricted by limited perceptual acuity. To address these limitations, we propose Weaver, a novel, end-to-end trainable multimodal reasoning agentic system. Weaver empowers its policy model to dynamically invoke diverse tools throughout the reasoning process, enabling progressive acquisition of crucial visual cues and construction of authentic multimodal reasoning trajectories. Furthermore, we integrate a reinforcement learning algorithm to allow the system to freely explore strategies for employing and combining these tools with trajectory-free data. Extensive experiments demonstrate that our system, Weaver, enhances performance on several complex video reasoning benchmarks, particularly those involving long videos.
[71] UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents cs.CVPDF
Han Xiao, Guozhi Wang, Hao Wang, Shilong Liu, Yuxiang Chai
TL;DR: 本文提出UI-Mem,一种用于移动GUI智能体在线强化学习的新型框架。它通过引入一个分层经验记忆库来存储结构化的知识(如高级工作流、子任务技能和失败模式),并利用分层组采样和自我演化循环,有效解决了长视野任务中的信用分配低效和跨任务经验迁移不足的问题,从而提升了在线RL的性能和泛化能力。
Details
Motivation: 在线强化学习在GUI智能体中面临长视野任务信用分配效率低下,以及由于缺乏经验迁移导致跨任务重复错误的挑战。
Result: 在在线GUI基准测试中,UI-Mem显著优于传统的RL基线和静态重用策略,并在未见过的应用程序上表现出强大的泛化能力。
Insight: 创新点在于将传统回放缓冲区扩展为结构化的、可参数化迁移的分层经验记忆,并设计了分层组采样和自我演化循环机制,实现了经验知识的动态积累与策略引导,促进了跨任务和跨应用的技能迁移。
Abstract: Online Reinforcement Learning (RL) offers a promising paradigm for enhancing GUI agents through direct environment interaction. However, its effectiveness is severely hindered by inefficient credit assignment in long-horizon tasks and repetitive errors across tasks due to the lack of experience transfer. To address these challenges, we propose UI-Mem, a novel framework that enhances GUI online RL with a Hierarchical Experience Memory. Unlike traditional replay buffers, our memory accumulates structured knowledge, including high-level workflows, subtask skills, and failure patterns. These experiences are stored as parameterized templates that enable cross-task and cross-application transfer. To effectively integrate memory guidance into online RL, we introduce Stratified Group Sampling, which injects varying levels of guidance across trajectories within each rollout group to maintain outcome diversity, driving the unguided policy toward internalizing guided behaviors. Furthermore, a Self-Evolving Loop continuously abstracts novel strategies and errors to keep the memory aligned with the agent’s evolving policy. Experiments on online GUI benchmarks demonstrate that UI-Mem significantly outperforms traditional RL baselines and static reuse strategies, with strong generalization to unseen applications. Project page: https://ui-mem.github.io
[72] Pathwise Test-Time Correction for Autoregressive Long Video Generation cs.CVPDF
Xunzhi Xiang, Zixuan Duan, Guiyu Zhang, Haiyu Zhang, Zhe Gao
TL;DR: 本文提出了一种无需训练的测试时校正(TTC)方法,用于解决基于蒸馏的自回归扩散模型在生成长视频时出现的严重误差累积问题。该方法利用初始帧作为稳定的参考锚点,对采样轨迹中的中间随机状态进行校准,从而有效缓解长序列生成中的漂移问题。
Details
Motivation: 动机在于,现有的测试时优化(TTO)方法虽然对图像或短视频有效,但由于不稳定的奖励景观和蒸馏参数的超敏感性,它们无法缓解长序列生成中的漂移问题。
Result: 大量实验表明,该方法可与多种蒸馏模型无缝集成,以可忽略的开销扩展生成长度,并在30秒基准测试中达到与资源密集型基于训练的方法相当的质量。
Insight: 创新点在于提出了一种无需训练、基于路径的测试时校正框架,通过利用初始帧作为稳定锚点来校准采样过程,有效解决了长视频自回归生成中的误差累积和漂移问题,这是一种轻量且通用的后处理校正策略。
Abstract: Distilled autoregressive diffusion models facilitate real-time short video synthesis but suffer from severe error accumulation during long-sequence generation. While existing Test-Time Optimization (TTO) methods prove effective for images or short clips, we identify that they fail to mitigate drift in extended sequences due to unstable reward landscapes and the hypersensitivity of distilled parameters. To overcome these limitations, we introduce Test-Time Correction (TTC), a training-free alternative. Specifically, TTC utilizes the initial frame as a stable reference anchor to calibrate intermediate stochastic states along the sampling trajectory. Extensive experiments demonstrate that our method seamlessly integrates with various distilled models, extending generation lengths with negligible overhead while matching the quality of resource-intensive training-based methods on 30-second benchmarks.
[73] CLIP-Map: Structured Matrix Mapping for Parameter-Efficient CLIP Compression cs.CVPDF
Kangjie Zhang, Wenxuan Huang, Xin Zhou, Boxiang Zhou, Dejia Song
TL;DR: 本文提出了一种名为CLIP-Map的新型CLIP模型压缩框架,通过可学习的结构化矩阵映射(结合全映射与克罗内克分解)来组合预训练权重,旨在高效压缩模型的同时最大限度地保留原始权重信息,并设计了对角线继承初始化策略以缓解优化挑战。
Details
Motivation: CLIP模型在多种视觉任务中应用广泛,但其高内存和计算成本限制了在资源受限场景下的使用;现有的基于权重子集选择的压缩方法在极端压缩下会损害特征表示能力,因此需要一种能更好保留原始信息的映射式压缩方法。
Result: 大量实验结果表明,CLIP-Map在各种压缩比下均优于基于选择的压缩框架,尤其是在高压缩设置下取得了显著的性能提升。
Insight: 创新点在于从权重选择转向权重映射,通过结构化矩阵映射(全映射与克罗内克分解)来重构权重,以及使用对角线继承初始化来稳定优化过程,这为参数高效模型压缩提供了新思路。
Abstract: Contrastive Language-Image Pre-training (CLIP) has achieved widely applications in various computer vision tasks, e.g., text-to-image generation, Image-Text retrieval and Image captioning. However, CLIP suffers from high memory and computation cost, which prohibits its usage to the resource-limited application scenarios. Existing CLIP compression methods typically reduce the size of pre-trained CLIP weights by selecting their subset as weight inheritance for further retraining via mask optimization or important weight measurement. However, these select-based weight inheritance often compromises the feature presentation ability, especially on the extreme compression. In this paper, we propose a novel mapping-based CLIP compression framework, CLIP-Map. It leverages learnable matrices to map and combine pretrained weights by Full-Mapping with Kronecker Factorization, aiming to preserve as much information from the original weights as possible. To mitigate the optimization challenges introduced by the learnable mapping, we propose Diagonal Inheritance Initialization to reduce the distribution shifting problem for efficient and effective mapping learning. Extensive experimental results demonstrate that the proposed CLIP-Map outperforms select-based frameworks across various compression ratios, with particularly significant gains observed under high compression settings.
[74] LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation cs.CV | cs.AIPDF
Mirlan Karimov, Teodora Spasojevic, Markus Braun, Julian Wiederer, Vasileios Belagiannis
TL;DR: 本文提出了一种名为局部语义对齐(LSA)的框架,用于微调预训练视频生成模型,以增强交通视频生成中的时间一致性。该方法通过对比真实视频与生成视频中动态对象周围的语义特征,引入语义特征一致性损失,结合标准扩散损失进行微调,无需在推理时依赖外部控制信号,即可提升生成视频的时序连贯性。
Details
Motivation: 现有可控视频生成方法在推理时依赖控制信号来引导动态对象的时间一致性生成,这限制了其作为可扩展和通用数据引擎的实用性,因此需要一种无需外部控制信号就能增强时间一致性的方法。
Result: 在nuScenes和KITTI数据集上的大量实验表明,使用LSA微调一个epoch的模型在常见视频生成评估指标上优于基线方法;为进一步测试时间一致性,还采用了目标检测任务中的mAP和mIoU指标,验证了方法的有效性。
Insight: 创新点在于提出局部语义对齐损失,通过对齐真实与生成视频中动态对象的语义特征来增强时间一致性,无需推理时控制信号或额外计算开销;从客观角度看,该方法简单有效,将语义对齐思想应用于视频生成微调,提升了模型的通用性和可扩展性。
Abstract: Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.
[75] RISE-Video: Can Video Generators Decode Implicit World Rules? cs.CV | cs.AIPDF
Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang
TL;DR: 本文提出了RISE-Video,一个面向推理的文本-图像到视频(TI2V)合成基准,旨在评估生成视频模型对隐含世界规则的理解和推理能力,而非仅关注表面视觉质量。该基准包含8个类别、467个人工标注样本,并引入了包含推理对齐、时序一致性、物理合理性和视觉质量的多维评估协议,以及一个利用大型多模态模型(LMMs)的自动化评估流程。对11个SOTA TI2V模型的广泛实验揭示了它们在隐含约束下模拟复杂场景的普遍缺陷。
Details
Motivation: 当前生成视频模型在视觉保真度上取得了显著进展,但其内化和推理隐含世界规则的能力仍是一个关键但未被充分探索的领域,本文旨在填补这一评估空白。
Result: 对11个最先进的TI2V模型进行的广泛实验表明,这些模型在隐含约束下模拟复杂场景时普遍存在缺陷,为未来世界模拟生成模型的进步提供了关键见解。
Insight: 论文的创新点在于将评估焦点从表面美学转向深度认知推理,构建了一个结构化的、多维度的推理导向基准(RISE-Video)及其自动化评估流程,这为系统性地诊断和推动生成模型在理解物理世界和常识规则方面的能力提供了重要工具和视角。
Abstract: While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.
[76] VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation cs.CVPDF
Jie Deng, Kaichun Yao, Libo Zhang
TL;DR: 本文提出VisRefiner训练框架,通过让模型学习渲染预测与参考设计之间的视觉差异来提升截图到代码的生成质量。该方法构建了差异对齐的监督信号,并引入强化学习阶段进行自我精炼,从而在单步生成质量和布局保真度上取得显著提升。
Details
Motivation: 现有多模态大语言模型直接从截图生成代码,但训练过程中未观察生成代码的视觉结果,而人类开发者会通过迭代渲染、比较设计并学习视觉差异与代码修改的关系。本文受此启发,旨在让模型学习视觉差异以改进代码生成。
Result: 实验表明,VisRefiner显著提高了单步生成质量和布局保真度,并赋予模型强大的自我精炼能力,证明了从视觉差异中学习的有效性。
Insight: 创新点在于将视觉差异与代码编辑关联起来构建监督信号,并引入基于强化学习的自我精炼阶段,使模型能够通过观察渲染输出与目标设计的差异来迭代改进代码,这模仿了人类开发者的调试过程。
Abstract: Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.
[77] GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks? cs.CV | cs.AIPDF
Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu
TL;DR: 本文提出GenArena,一个用于视觉生成任务评估的统一框架,旨在解决传统绝对评分标准存在的随机不一致性和与人类感知对齐差的问题。通过采用成对比较范式,该框架显著提升了评估的稳定性和与人类判断的一致性,并发现此方法能使开源模型在评估中超越顶级专有模型。
Details
Motivation: 视觉生成模型的快速发展超越了传统评估方法,而当前广泛使用的绝对点式评分标准存在随机不一致性和与人类感知对齐不佳的局限性,需要更可靠、自动化的评估方案。
Result: 实验表明,采用成对比较协议可将评估准确率提升超过20%,与权威LMArena排行榜的Spearman相关系数达到0.86,远超点式方法的0.36。该框架在多种视觉生成任务上对最先进模型进行了基准测试。
Insight: 核心创新在于将评估范式从绝对评分转向成对比较,这不仅能稳定评估结果、更好地对齐人类判断,还意外地使现成开源模型在评估性能上超越顶级专有模型,为视觉生成领域提供了更严谨、自动化的评估标准。
Abstract: The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
[78] MambaVF: State Space Model for Efficient Video Fusion cs.CVPDF
Zixiang Zhao, Yukun Cui, Lilun Deng, Haowen Bai, Haotong Qin
TL;DR: 本文提出了MambaVF,一种基于状态空间模型(SSM)的高效视频融合框架。该框架将视频融合重新表述为序列状态更新过程,无需显式运动估计即可进行时序建模,通过时空双向扫描机制捕获长程依赖,显著降低了计算和内存开销。
Details
Motivation: 现有视频融合方法严重依赖光流估计和特征变形,导致计算开销大且可扩展性有限。本文旨在设计一个无需显式运动估计的高效时序建模框架来解决此问题。
Result: 在多个基准测试(多曝光、多焦点、红外-可见光及医学视频融合任务)上的广泛实验表明,MambaVF达到了最先进的性能水平。同时,它实现了高效率,与现有方法相比,参数减少了高达92.25%,计算FLOPs减少了88.79%,速度提升了2.1倍。
Insight: 主要创新点在于将视频融合重新定义为序列状态更新问题,并利用状态空间模型(SSM)的线性复杂度特性进行时序建模,避免了传统方法中计算密集的光流估计。提出的轻量级SSM融合模块通过时空双向扫描机制替代了传统的流引导对齐,实现了高效的信息跨帧聚合。
Abstract: Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io
[79] Context Forcing: Consistent Autoregressive Video Generation with Long Context cs.CVPDF
Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou
TL;DR: 本文提出了一种名为Context Forcing的新型框架,旨在解决实时长视频生成中因教师模型仅能访问短上下文而导致的监督不匹配问题。该框架通过使用长上下文教师模型来训练长上下文学生模型,并引入慢-快记忆架构来管理极端时长下的计算成本,从而实现了超过20秒的有效上下文长度,显著提升了长视频的时序一致性。
Details
Motivation: 现有实时长视频生成方法通常采用流式调优策略,即用短上下文(无记忆)教师模型训练长上下文学生模型,这导致教师无法访问长期历史,从而无法指导学生处理全局时序依赖,限制了学生模型的上下文长度。
Result: 实验结果表明,该方法在长视频评估指标上超越了现有最佳基线(如LongLive和Infinite-RoPE),有效上下文长度超过20秒,是现有方法的2到10倍,并保持了优异的长时一致性。
Insight: 核心创新在于通过长上下文教师模型消除监督不匹配,并设计慢-快记忆架构来高效管理线性增长的上下文,减少视觉冗余,从而在计算可行的情况下实现长时一致的自回归视频生成。
Abstract: Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher’s inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student’s context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds – 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.
[80] Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation cs.CVPDF
David Shavin, Sagie Benaim
TL;DR: 本文提出Splat and Distill框架,旨在为2D视觉基础模型注入3D感知能力。该方法通过一个快速前馈的3D重建流程增强教师模型,将教师模型提取的2D特征提升为显式的3D高斯表示,然后将其投影到新视角生成2D特征图,用于监督学生模型,从而蒸馏出几何基础的知识。
Details
Motivation: 现有视觉基础模型在多种2D下游任务上表现出色,但普遍缺乏3D感知能力。本文旨在解决这一问题,通过知识蒸馏将稳健的3D意识融入2D视觉基础模型中。
Result: 在单目深度估计、表面法线估计、多视角对应和语义分割等一系列下游任务上进行了全面评估。该方法显著优于先前工作,不仅在3D感知方面取得实质性提升,还增强了2D特征的语义丰富度。
Insight: 核心创新点在于用快速前馈的3D提升方法取代了先前工作中缓慢的逐场景优化,避免了特征平均伪影,并创建了一个教师模型与学生模型一致性共同提升的动态学习过程。从客观角度看,将显式3D表示(3D高斯)与知识蒸馏结合,为2D模型注入几何理解,是一个有前景的方向。
Abstract: Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, distilling” geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher’s consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat-and-Distill/
[81] V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval cs.CVPDF
Dongyang Chen, Chaoyang Wang, Dezhao SU, Xi Xiao, Zeyu Zhang
TL;DR: 本文提出V-Retrver,一种证据驱动的通用多模态检索框架,将多模态检索重新定义为基于视觉检查的智能体推理过程。该方法通过外部视觉工具在推理过程中选择性获取视觉证据,执行多模态交错推理,交替进行假设生成和针对性视觉验证。
Details
Motivation: 现有基于思维链(CoT)的多模态大语言模型(MLLMs)检索方法主要依赖语言驱动和静态视觉编码,缺乏主动验证细粒度视觉证据的能力,在视觉模糊场景中容易产生推测性推理。
Result: 在多个多模态检索基准测试中,该方法在检索准确率上实现了平均23.0%的提升,并提高了感知驱动的推理可靠性和泛化能力。
Insight: 创新点在于将检索过程构建为证据驱动的智能体推理,通过课程学习策略(监督推理激活、基于拒绝的细化和证据对齐目标的强化学习)训练检索智能体,实现了假设生成与视觉验证的交替迭代,增强了推理的可靠性和准确性。
Abstract: Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.
[82] InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions cs.CV | cs.GR | cs.ROPDF
Sirui Xu, Samuel Schulter, Morteza Ziyadi, Xialin He, Xiaohan Fei
TL;DR: InterPrior是一个可扩展的生成控制框架,用于实现基于物理的人-物交互。它通过大规模模仿预训练和强化学习后训练,学习一个统一的生成控制器,能够从多模态观察和高层意图重建运动,并通过数据增强和强化学习微调来提高对未见目标和初始化的泛化能力。
Details
Motivation: 解决人形机器人如何像人类一样,基于高层意图(如可供性)而非显式全身运动规划,自然地协调平衡、接触和操作,从而在不同场景中组合和泛化移动操作技能并保持物理连贯的全身协调的问题。
Result: 论文表明,该方法能够泛化到训练数据之外,例如与未见物体进行交互,并展示了其在用户交互控制和真实机器人部署中的潜力。
Insight: 创新点在于将全参考模仿专家提炼为通用的目标条件变分策略,并通过物理扰动的数据增强和强化学习微调来巩固重建的潜在技能,形成一个能够可靠泛化的运动先验,从而扩展了生成控制的可扩展性。
Abstract: Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.
[83] Thinking with Geometry: Active Geometry Integration for Spatial Reasoning cs.CVPDF
Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo
TL;DR: 本文提出了GeoThinker框架,旨在解决多模态大语言模型在空间推理任务中几何先验被动融合导致语义-几何错位和冗余信号的问题。该框架通过主动感知机制,让模型能够根据内部推理需求选择性检索几何证据,从而提升空间推理能力。
Details
Motivation: 现有MLLMs在空间推理中通常被动地全局融合3D编码器提供的几何先验,导致语义与几何信息错位以及信号冗余,因此需要一种主动的几何集成方法来改善空间推理性能。
Result: GeoThinker在VSI-Bench上达到了72.6分的峰值,创造了新的SOTA;在具身指代和自动驾驶等复杂下游场景中也表现出强大的泛化能力和显著提升的空间感知性能。
Insight: 创新点在于从被动融合转向主动感知,通过空间锚定融合和重要性门控机制,使模型能够基于语义视觉先验有选择地查询和整合任务相关的几何信息,这为下一代空间智能提供了关键的主动集成空间结构的能力。
Abstract: Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.
[84] SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs cs.CVPDF
Jintao Tong, Shilin Yan, Hongwei Xue, Xiaojun Tang, Kunyu Shi
TL;DR: 本文提出SwimBird,一种可切换推理模式的多模态大语言模型,通过动态选择文本推理、视觉推理或视觉-文本交错推理三种模式来适应不同查询需求,从而在保持文本逻辑推理能力的同时提升视觉密集型任务的性能。
Details
Motivation: 现有MLLMs主要依赖文本思维链进行推理,限制了其在视觉密集型任务上的效果;而近期引入固定数量连续隐藏状态作为“视觉思维”的方法虽提升了视觉性能,却往往损害了文本逻辑推理能力,核心问题在于僵化的预定义推理模式无法自适应地为不同查询选择最合适的思考模态。
Result: 在涵盖文本推理和挑战性视觉理解任务的多样化基准测试中,SwimBird取得了最先进的结果,并相较于先前固定模式的多模态推理方法获得了稳健的性能提升。
Insight: 创新点在于提出了一种混合自回归框架,统一了文本思维的下一个token预测和视觉思维的下一个嵌入预测,并设计了系统化的推理模式构建策略来创建覆盖所有三种推理模式的多样化监督微调数据集,实现了灵活、查询自适应的模式选择机制。
Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as “visual thoughts” into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.
[85] Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning cs.CVPDF
Xuejun Zhang, Aditi Tiwari, Zhenhailong Wang, Heng Ji
TL;DR: 本文提出CAMCUE框架,通过显式利用相机位姿作为几何锚点来解决多图像空间推理任务中的视角转换问题。该框架将每张图像的相机位姿注入视觉标记,将自然语言描述的视角映射到目标相机位姿,并合成位姿条件化的想象目标视图来支持问答。作者构建了包含27,668个训练样本和508个测试样本的CAMCUE-DATA数据集,实验表明CAMCUE在准确性和推理效率上均有显著提升。
Details
Motivation: 当前多模态大语言模型在多图像空间推理任务上仍面临挑战,特别是视角转换任务,需要模型从多视角观察中构建连贯的3D理解,并根据语言描述的新视角进行推理。现有方法缺乏对相机几何的显式建模,导致跨视图融合和新视角推理能力不足。
Result: 在CAMCUE-DATA数据集上,CAMCUE将整体准确率提升了9.06%。在从自然语言描述预测目标相机位姿的任务中,旋转预测在20°误差范围内的准确率超过90%,平移预测在0.5误差阈值内表现良好。推理时间从每样本256.6秒大幅减少至1.45秒,实现了实时交互应用。
Insight: 论文的核心创新在于将相机位姿作为显式几何先验注入多模态推理框架,实现了语言描述到相机位姿的直接映射,避免了耗时的测试时搜索匹配。这种位姿感知的跨视图融合机制为多图像3D场景理解提供了可解释的几何基础,同时大幅提升了推理效率。
Abstract: Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.
cs.GR [Back]
[86] Untwisting RoPE: Frequency Control for Shared Attention in DiTs cs.GR | cs.CVPDF
Aryan Mikaeili, Or Patashnik, Andrea Tagliasacchi, Daniel Cohen-Or, Ali Mahdavi-Amiri
TL;DR: 本文对旋转位置编码(RoPE)进行了原理性分析,揭示了其频率分量在共享注意力机制中的行为。研究发现,RoPE的高频分量主导注意力计算,导致在基于参考图像生成目标图像时,模型倾向于复制参考内容而非提取风格线索。基于此,论文提出了一种选择性调制RoPE频带的方法,使注意力反映语义相似性而非严格位置对齐,从而在现代基于Transformer的扩散模型中实现了稳定且有意义的共享注意力,实现了有效的风格迁移控制。
Details
Motivation: 研究动机是理解位置编码(特别是RoPE)在多模态和共享注意力设置中的行为,并解决在共享注意力机制中(例如,基于参考图像生成目标图像时)出现的意外内容复制问题,即模型复制参考内容而非仅提取其风格线索。
Result: 论文提出的方法应用于现代基于Transformer的扩散架构(DiTs),恢复了稳定且有意义的共享注意力,实现了对风格迁移与内容复制程度的有效控制,从而实现了不复制参考内容的风格对齐生成过程。
Insight: 论文的创新点在于对RoPE进行了频率分解分析,揭示了高频分量导致意外内容复制的机制,并提出了通过选择性调制RoPE频带来引导注意力基于语义而非位置对齐的方法,这为在共享注意力模型中实现可控的风格迁移提供了新思路。
Abstract: Positional encodings are essential to transformer-based generative models, yet their behavior in multimodal and attention-sharing settings is not fully understood. In this work, we present a principled analysis of Rotary Positional Embeddings (RoPE), showing that RoPE naturally decomposes into frequency components with distinct positional sensitivities. We demonstrate that this frequency structure explains why shared-attention mechanisms, where a target image is generated while attending to tokens from a reference image, can lead to reference copying, in which the model reproduces content from the reference instead of extracting only its stylistic cues. Our analysis reveals that the high-frequency components of RoPE dominate the attention computation, forcing queries to attend mainly to spatially aligned reference tokens and thereby inducing this unintended copying behavior. Building on these insights, we introduce a method for selectively modulating RoPE frequency bands so that attention reflects semantic similarity rather than strict positional alignment. Applied to modern transformer-based diffusion architectures, where all tokens share attention, this modulation restores stable and meaningful shared attention. As a result, it enables effective control over the degree of style transfer versus content copying, yielding a proper style-aligned generation process in which stylistic attributes are transferred without duplicating reference content.
cs.DB [Back]
[87] Pruning Minimal Reasoning Graphs for Efficient Retrieval-Augmented Generation cs.DB | cs.CL | cs.LGPDF
Ning Wang, Kuanyan Zhu, Daniel Yuehwoon Yee, Yitang Gao, Shiying Huang
TL;DR: 本文提出了AutoPrunedRetriever,一种图风格的检索增强生成(RAG)系统,旨在解决传统RAG系统为每个查询重复检索和推理导致的令牌、延迟和成本过高的问题。该系统通过持久化先前问题构建的最小推理子图,并增量扩展后续查询,使用紧凑的符号化结构进行检索和提示,而非原始文本。
Details
Motivation: 动机是解决当前大多数RAG系统将每个查询视为独立任务,导致重复检索长段落和从头推理,从而增加令牌使用、延迟和成本的问题,旨在提高效率和降低开销。
Result: 在GraphRAG-Benchmark(Medical和Novel)上,AutoPrunedRetriever的两个变体(使用REBEL或LLM提取器)实现了最先进的复杂推理准确率,比HippoRAG2提高了约9-11个百分点,并在上下文总结和生成任务中保持竞争力。在更难的STEM和TV基准测试中,它同样排名第一,同时使用的令牌数量比基于图的基线少两个数量级。
Insight: 创新点包括:1)提出持久化最小推理子图的图风格RAG架构,支持增量扩展;2)使用紧凑的ID索引码本存储实体和关系,实现符号化检索;3)引入两层整合策略(快速ANN/KNN别名检测和选择性k-means)和低价值结构剪枝,以保持图紧凑;4)提示中仅保留重叠代表和真正的新证据,减少冗余。这些方法显著提升了效率,适用于长会话、动态语料库和多智能体管道。
Abstract: Retrieval-augmented generation (RAG) is now standard for knowledge-intensive LLM tasks, but most systems still treat every query as fresh, repeatedly re-retrieving long passages and re-reasoning from scratch, inflating tokens, latency, and cost. We present AutoPrunedRetriever, a graph-style RAG system that persists the minimal reasoning subgraph built for earlier questions and incrementally extends it for later ones. AutoPrunedRetriever stores entities and relations in a compact, ID-indexed codebook and represents questions, facts, and answers as edge sequences, enabling retrieval and prompting over symbolic structure instead of raw text. To keep the graph compact, we apply a two-layer consolidation policy (fast ANN/KNN alias detection plus selective $k$-means once a memory threshold is reached) and prune low-value structure, while prompts retain only overlap representatives and genuinely new evidence. We instantiate two front ends: AutoPrunedRetriever-REBEL, which uses REBEL as a triplet parser, and AutoPrunedRetriever-llm, which swaps in an LLM extractor. On GraphRAG-Benchmark (Medical and Novel), both variants achieve state-of-the-art complex reasoning accuracy, improving over HippoRAG2 by roughly 9–11 points, and remain competitive on contextual summarize and generation. On our harder STEM and TV benchmarks, AutoPrunedRetriever again ranks first, while using up to two orders of magnitude fewer tokens than graph-heavy baselines, making it a practical substrate for long-running sessions, evolving corpora, and multi-agent pipelines.
[88] Cost-Efficient RAG for Entity Matching with LLMs: A Blocking-based Exploration cs.DB | cs.CLPDF
Chuangtao Ma, Zeyu Zhang, Arijit Khan, Sebastian Schelter, Paul Groth
TL;DR: 本文提出了一种名为CE-RAG4EM的高效检索增强生成(RAG)架构,用于大规模实体匹配任务。该架构通过基于分块的批量检索和生成来降低计算开销,并建立了一个统一的框架来分析和评估实体匹配中的RAG系统。实验表明,该方法在保持或提升匹配质量的同时,显著减少了端到端运行时间。
Details
Motivation: 现有RAG流程在大规模实体匹配中会产生高昂的检索和生成开销,本文旨在解决这一成本效率问题。
Result: 在广泛的实验中,CE-RAG4EM相对于强基线方法,在达到相当或改进的匹配质量的同时,大幅减少了端到端运行时间。
Insight: 创新点在于引入了基于分块的批量处理策略来优化RAG流程,并揭示了关键配置参数在性能与开销之间存在的固有权衡,为设计高效可扩展的实体匹配RAG系统提供了实用指导。
Abstract: Retrieval-augmented generation (RAG) enhances LLM reasoning in knowledge-intensive tasks, but existing RAG pipelines incur substantial retrieval and generation overhead when applied to large-scale entity matching. To address this limitation, we introduce CE-RAG4EM, a cost-efficient RAG architecture that reduces computation through blocking-based batch retrieval and generation. We also present a unified framework for analyzing and evaluating RAG systems for entity matching, focusing on blocking-aware optimizations and retrieval granularity. Extensive experiments suggest that CE-RAG4EM can achieve comparable or improved matching quality while substantially reducing end-to-end runtime relative to strong baselines. Our analysis further reveals that key configuration parameters introduce an inherent trade-off between performance and overhead, offering practical guidance for designing efficient and scalable RAG systems for entity matching and data integration.
cs.IR [Back]
[89] SAGE: Benchmarking and Improving Retrieval for Deep Research Agents cs.IR | cs.CLPDF
Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, Chen Zhao
TL;DR: 本文提出了SAGE基准测试,用于评估深度研究代理在科学文献检索中的性能,发现现有系统在推理密集型检索任务上表现不佳,且BM25检索器优于基于LLM的检索器。作者进一步提出了一个语料库级别的测试时扩展框架,通过LLM增强文档元数据和关键词,提升了检索性能。
Details
Motivation: 研究动机是探究基于LLM的检索器能否有效提升深度研究代理的工作流程性能,并解决现有代理在科学文献检索中推理能力不足的问题。
Result: 在SAGE基准(包含1200个查询和20万篇论文的语料库)上评估了六个深度研究代理,发现BM25检索器比基于LLM的检索器(如ReasonIR和gte-Qwen2-7B-instruct)性能高出约30%。提出的扩展框架在短格式和开放式问题上分别带来了8%和2%的性能提升。
Insight: 创新点包括引入了针对科学文献检索的SAGE基准,揭示了现有代理生成关键词导向子查询的局限性,并提出了一个轻量级的测试时文档增强框架来提升检索效果,而非直接替换检索器。
Abstract: Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus.We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.
cs.MM [Back]
[90] XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning cs.MM | cs.AI | cs.CVPDF
Hanwen Zhang, Yao Liu, Peiyuan Jiang, Lang Junjie, Xie Jun
TL;DR: 本文提出了一种名为XEmoGPT的可解释多模态情感识别框架,该框架通过专门的视频和音频情感线索桥接模块增强细粒度情感线索感知,并构建了一个大规模数据集EmoCue来支持线索级推理,同时引入了自动化评估指标EmoCue-360和基准测试集EmoCue-Eval。
Details
Motivation: 当前可解释多模态情感识别方法面临两大挑战:通用模态编码器对细粒度情感线索不敏感,以及现有数据集在标注质量与规模之间存在权衡,导致对情感线索的监督不足,且现有评估指标无法有效评估线索级推理性能。
Result: 实验结果表明,XEmoGPT在情感线索感知和推理方面均取得了强劲的性能,但摘要中未明确提及具体基准测试或与SOTA的比较结果。
Insight: 创新点在于设计了专门的视频和音频情感线索桥接模块来增强细粒度感知,构建了大规模情感线索数据集以支持推理,并提出了基于语义相似性的自动化评估指标来量化线索级推理能力。
Abstract: Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.
cs.RO [Back]
[91] VLN-Pilot: Large Vision-Language Model as an Autonomous Indoor Drone Operator cs.RO | cs.CVPDF
Bessie Dominguez-Dager, Sergio Suescun-Ferrandiz, Felix Escalona, Francisco Gomez-Donoso, Miguel Cazorla
TL;DR: 本文提出了VLN-Pilot框架,利用大型视觉语言模型作为室内无人机自主导航的‘飞行员’。该框架通过理解自由形式的自然语言指令,并结合视觉观察,在无GPS的室内环境中规划并执行无人机轨迹,实现了基于语义和上下文感知的高级飞行控制。
Details
Motivation: 解决传统基于规则或几何路径规划方法在室内无人机导航中缺乏语义理解和灵活性的问题,旨在通过VLLM的多模态推理能力,实现更自然、更少工程依赖的人机交互与自主控制。
Result: 在一个定制的逼真室内仿真基准测试中进行了验证,实验结果表明,该VLLM驱动的智能体在复杂的指令跟随任务(包括具有多个语义目标的长期导航)上取得了较高的成功率。
Insight: 主要创新点在于将大型视觉语言模型作为核心决策者直接用于室内无人机导航,实现了语言驱动的语义理解与视觉感知的深度融合,为可扩展、人性化的室内无人机控制(如巡检、搜救)提供了新途径,有望显著降低操作员负担并提高任务灵活性。
Abstract: This paper introduces VLN-Pilot, a novel framework in which a large Vision-and-Language Model (VLLM) assumes the role of a human pilot for indoor drone navigation. By leveraging the multimodal reasoning abilities of VLLMs, VLN-Pilot interprets free-form natural language instructions and grounds them in visual observations to plan and execute drone trajectories in GPS-denied indoor environments. Unlike traditional rule-based or geometric path-planning approaches, our framework integrates language-driven semantic understanding with visual perception, enabling context-aware, high-level flight behaviors with minimal task-specific engineering. VLN-Pilot supports fully autonomous instruction-following for drones by reasoning about spatial relationships, obstacle avoidance, and dynamic reactivity to unforeseen events. We validate our framework on a custom photorealistic indoor simulation benchmark and demonstrate the ability of the VLLM-driven agent to achieve high success rates on complex instruction-following tasks, including long-horizon navigation with multiple semantic targets. Experimental results highlight the promise of replacing remote drone pilots with a language-guided autonomous agent, opening avenues for scalable, human-friendly control of indoor UAVs in tasks such as inspection, search-and-rescue, and facility monitoring. Our results suggest that VLLM-based pilots may dramatically reduce operator workload while improving safety and mission flexibility in constrained indoor environments.
[92] CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction cs.RO | cs.AI | cs.CV | cs.LG | cs.MAPDF
Xiaopan Zhang, Zejin Wang, Zhixu Li, Jianpeng Yao, Jiachen Li
TL;DR: 本文提出了一种名为CommCP的新型多智能体协调框架,该框架利用基于大语言模型的通信与保形预测技术,旨在解决多智能体多任务具身问答问题,以提升任务成功率和探索效率。
Details
Motivation: 为了解决多异构机器人在自然语言指令下协同完成任务时,有效信息收集与协调通信的挑战,本文形式化了多智能体多任务具身问答问题,并针对其通信冗余和可靠性问题提出解决方案。
Result: 在提出的MM-EQA基准测试中,CommCP在任务成功率和探索效率方面显著优于基线方法。
Insight: 创新点在于将保形预测用于校准LLM生成的通信消息,以减少接收者的干扰并提高通信可靠性,同时为多智能体协同场景引入了新的基准测试和问题形式化。
Abstract: To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm-cp.github.io.
cs.LG [Back]
[93] Internalizing LLM Reasoning via Discovery and Replay of Latent Actions cs.LG | cs.AI | cs.CLPDF
Zhenning Shi, Yijia Zhu, Junhan Shi, Xun Zhang, Lei Wang
TL;DR: 本文提出了一种名为STIR(Self-Distilled Tools for Internal Reasoning)的框架,旨在将大语言模型(LLM)的思维链(CoT)推理过程内部化到隐藏状态中,以实现更高效的推理。该方法将推理增强重新定义为动态的潜在轨迹控制问题,通过一个三阶段流程(诱导潜在动作、构建稀疏控制基、进行值调制的轨迹干预)来动态引导模型推理,从而在减少生成token数量的同时提升推理准确性。
Details
Motivation: 现有基于激活引导(activation steering)的方法使用静态控制向量,无法适应复杂推理任务中隐藏状态的非平稳演化过程。为了解决这一局限性,本文旨在开发一种能够动态控制潜在推理轨迹的方法,以内部化思维链推理的好处,避免显式生成步骤。
Result: 在四个代表性模型和六个算术与逻辑推理基准测试上的广泛实验表明,与普通解码(vanilla decoding)相比,STIR将平均准确率提升了1.9%到7.5%,同时将平均token消耗量减少了高达35%。
Insight: 论文的核心创新点在于将推理增强重新定义为动态的潜在轨迹控制问题,并提出了一个三阶段的协同流程来实现。具体包括:1)从成功的潜在推理中诱导出“内在动作”作为引导基元;2)构建一个紧凑且几何多样的“工具库”(稀疏控制基);3)通过基于锚点的门控机制,进行上下文感知的值调制轨迹干预。这提供了一种将显式思维链推理的好处内部化到模型隐藏状态中的新范式,实现了效率与保真度的双重提升。
Abstract: The internalization of chain-of-thought processes into hidden states has emerged as a highly efficient paradigm for scaling test-time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non-stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self-Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three-stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) sparse control basis construction curates a compact, geometrically diverse tool library; and (3) value-modulated trajectory intervention dynamically injects context-specific impulses via anchor-based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla decoding. These findings demonstrate that the benefits of explicit chain-of-thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at https://github.com/sznnzs/LLM-Latent-Action.
[94] Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization cs.LG | cs.AI | cs.CLPDF
Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci
TL;DR: 本文提出了一种名为线性模型合并的方法,用于高效解决多模态大语言模型监督微调中的数据混合优化问题。该方法通过训练领域专家模型并进行参数插值,来代理评估不同数据混合比例的性能,从而避免直接训练的高昂成本。
Details
Motivation: 多模态大语言模型监督微调中,确定跨多个领域数据集的最优数据混合权重是一个关键但计算成本极高的组合搜索问题,即数据混合优化问题。
Result: 在14个多模态基准测试上的实验表明,合并后的代理模型与实际数据混合训练模型的性能排名具有高度相关性,验证了该方法的有效性。
Insight: 创新点在于将模型合并技术转化为数据混合性能的代理评估工具,将最优混合搜索与资源密集型训练过程解耦,提供了一种可扩展且高效的混合权重探索策略。
Abstract: Selecting the best data mixture is critical for successful Supervised Fine-Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain-specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so-called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain-specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain-specific multimodal experts and evaluate their weighted parameter-space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource-intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at https://github.com/BerasiDavide/mLLMs_merging_4_DMO.
[95] StagePilot: A Deep Reinforcement Learning Agent for Stage-Controlled Cybergrooming Simulation cs.LG | cs.CLPDF
Heajun An, Qi Zhang, Minqian Liu, Xinyi Zhang, Sang Won Lee
TL;DR: 本文提出StagePilot,一种基于离线强化学习的对话代理,用于模拟网络引诱行为的阶段性进展,以进行预防性培训。该代理通过平衡用户情感和目标接近度的复合奖励来选择对话阶段,并限制阶段间相邻转换以确保真实性和可解释性。
Details
Motivation: 网络引诱是对青少年的持续威胁,需要主动的教育干预,因此研究旨在开发一个能模拟引诱行为阶段性进展的对话代理,用于预防训练。
Result: 通过基于LLM的模拟评估,StagePilot在阶段完成度、对话效率和情感参与度上表现良好,其中IQL+AWAC代理在策略规划和情感一致性间达到最佳平衡,比基线模型更频繁地达到最终阶段(高出43%),同时保持超过70%的情感对齐。
Insight: 创新点包括使用复合奖励机制平衡情感与目标,以及限制阶段转换至相邻阶段以增强真实性和可解释性;从客观角度看,该方法将离线强化学习应用于网络安全教育,提供了可控制、可解释的对话模拟框架。
Abstract: Cybergrooming is an evolving threat to youth, necessitating proactive educational interventions. We propose StagePilot, an offline RL-based dialogue agent that simulates the stage-wise progression of grooming behaviors for prevention training. StagePilot selects conversational stages using a composite reward that balances user sentiment and goal proximity, with transitions constrained to adjacent stages for realism and interpretability. We evaluate StagePilot through LLM-based simulations, measuring stage completion, dialogue efficiency, and emotional engagement. Results show that StagePilot generates realistic and coherent conversations aligned with grooming dynamics. Among tested methods, the IQL+AWAC agent achieves the best balance between strategic planning and emotional coherence, reaching the final stage up to 43% more frequently than baselines while maintaining over 70% sentiment alignment.
[96] EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization cs.LG | cs.AI | cs.CLPDF
Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li
TL;DR: 本文提出了一种名为EBPO(经验贝叶斯策略优化)的新框架,旨在解决强化学习可验证奖励(RLVR)中组相对策略优化(GRPO)方法存在的稳定性问题。EBPO通过引入经验贝叶斯收缩估计器,动态平衡局部组统计量与全局先验,从而降低估计方差并避免梯度消失。
Details
Motivation: 动机是解决GRPO在计算受限(小组规模小)时估计方差高,以及在所有响应奖励为零的饱和失败场景中梯度信号消失的稳定性挑战。
Result: 在AIME和OlympiadBench等多个基准测试中,EBPO一致优于GRPO和其他基线方法,表现出更高的训练稳定性,即使在小规模组下也能实现高性能提升,并显著受益于难度分层课程学习。
Insight: 创新点在于将经验贝叶斯收缩估计引入策略优化,通过全局统计量正则化局部基线,理论上保证更低的均方误差、有界的熵衰减和非消失的惩罚信号,从而提升RLVR训练的鲁棒性。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing the reasoning capabilities of Large Language Models (LLMs). However, dominant approaches like Group Relative Policy Optimization (GRPO) face critical stability challenges: they suffer from high estimator variance under computational constraints (small group sizes) and vanishing gradient signals in saturated failure regimes where all responses yield identical zero rewards. To address this, we propose Empirical Bayes Policy Optimization (EBPO), a novel framework that regularizes local group-based baselines by borrowing strength from the policy’s accumulated global statistics. Instead of estimating baselines in isolation, EBPO employs a shrinkage estimator that dynamically balances local group statistics with a global prior updated via Welford’s online algorithm. Theoretically, we demonstrate that EBPO guarantees strictly lower Mean Squared Error (MSE), bounded entropy decay, and non-vanishing penalty signals in failure scenarios compared to GRPO. Empirically, EBPO consistently outperforms GRPO and other established baselines across diverse benchmarks, including AIME and OlympiadBench. Notably, EBPO exhibits superior training stability, achieving high-performance gains even with small group sizes, and benefits significantly from difficulty-stratified curriculum learning.
[97] Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities cs.LG | cs.CLPDF
Pengyi Li, Elizaveta Goncharova, Andrey Kuznetsov, Ivan Oseledets
TL;DR: 本文针对强化学习用于大语言模型推理时出现的策略熵值崩溃和模式塌缩问题,提出了一种新颖的优势重加权机制。该方法通过将提示困惑度和答案置信度融入优势估计,动态重塑奖励信号,以平衡所有正确回答的置信度,从而在保持准确性的同时显著提升生成多样性和响应熵。
Details
Motivation: 标准策略优化方法(如GRPO)在强化学习与可验证奖励范式中,容易收敛到低熵策略,导致严重的模式塌缩和有限的输出多样性,抑制了有效的替代推理链。
Result: 在Qwen2.5和DeepSeek模型上的数学与编码基准测试表明,所提方法显著缓解了熵值崩溃。具体而言,在Qwen2.5-7B上,该方法在Pass@1上比GRPO高出5.7%,在Pass@32上更是高出13.9%,展现了其在生成多样化正确推理路径方面的卓越能力。
Insight: 论文的核心创新点在于从采样概率动态的视角分析了熵值崩溃问题,并提出通过优势重加权机制来平衡不同正确推理路径的置信度。这为强化学习在LLM推理任务中实现更好的探索-利用权衡提供了一个可借鉴的思路,即通过动态调整奖励信号来引导模型探索被低估的正确解。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to severe mode collapse and limited output diversity. We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths, thereby suppressing valid alternative reasoning chains. To address this, we propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses. By incorporating Prompt Perplexity and Answer Confidence into the advantage estimation, our method dynamically reshapes the reward signal to attenuate the gradient updates of over-confident reasoning paths, while redistributing probability mass toward under-explored correct solutions. Empirical results demonstrate that our approach significantly enhances generative diversity and response entropy while maintaining competitive accuracy, effectively achieving a superior trade-off between exploration and exploitation in reasoning tasks. Empirical results on Qwen2.5 and DeepSeek models across mathematical and coding benchmarks show that ProGRPO significantly mitigates entropy collapse. Specifically, on Qwen2.5-7B, our method outperforms GRPO by 5.7% in Pass@1 and, notably, by 13.9% in Pass@32, highlighting its superior capability in generating diverse correct reasoning paths.
[98] When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging cs.LG | cs.AI | cs.CL | cs.CVPDF
Yayuan Li, Ze Peng, Jian Zhang, Jintao Guo, Yue Duan
TL;DR: 该论文提出了一种名为奇异值校准(SVC)的训练和数据无关的后处理方法,用于解决模型合并中的谱过度累积问题。当多个微调模型共享对齐的谱方向时,简单的权重线性组合会导致这些方向的奇异值过度膨胀,使合并模型偏向共享子空间。SVC通过量化子空间重叠并重新缩放膨胀的奇异值来恢复平衡的谱,从而提升合并性能。
Details
Motivation: 现有模型合并方法主要解决任务更新间的冲突,但未解决共享知识被重复计算(过度累积)的失效模式。当任务共享对齐的奇异向量时,线性组合会过度累积这些方向,导致奇异值膨胀和模型偏差。
Result: 在视觉和语言基准测试中,SVC持续改进了强大的合并基线,并实现了最先进的性能。通过仅修改奇异值,SVC将任务算术(Task Arithmetic)的性能提升了13.0%。
Insight: 创新点在于识别了模型合并中共享知识过度累积的谱偏差问题,并提出了一种无需训练和数据的后处理校正方法(SVC),通过校准奇异值来平衡谱分布,从而提升模型合并的鲁棒性和性能。
Abstract: Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over-counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training-free and data-free post-processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state-of-the-art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at: https://github.com/lyymuwu/SVC.
[99] Steering Large Reasoning Models towards Concise Reasoning via Flow Matching cs.LG | cs.AI | cs.CLPDF
Yawei Li, Benjamin Bergner, Yinghan Zhao, Vihang Prakash Patil, Bei Chen
TL;DR: 本文提出了一种名为FlowSteer的非线性引导方法,旨在解决大型推理模型(LRMs)输出冗长的问题。该方法利用流匹配技术学习从冗长推理分布到简洁推理分布的完整变换,实现对模型推理过程的精确、输入相关的控制,从而在保持任务性能的同时提高输出效率。
Details
Motivation: 大型推理模型在复杂推理任务上表现出色,但其输出往往过于冗长,影响效率。现有引导方法基于线性表示假设,仅应用单一的全局向量到隐藏表示,方法受限。本文旨在超越这种均匀的线性偏移,通过建模完整的分布变换来更有效地引导模型生成简洁推理。
Result: 在多个推理基准测试中,FlowSteer相比领先的推理时基线方法,展现出强大的任务性能和更高的token效率(即输出更紧凑)。
Insight: 创新点在于摒弃了基于线性表示假设的单一全局向量引导,转而利用生成式技术(流匹配)建模从冗长到简洁推理的完整分布传输,为控制大型推理模型提供了更有效和原则性的基础。这提示我们,对模型内部表示进行非线性、分布层面的引导可能比简单的线性操作更有效。
Abstract: Large Reasoning Models (LRMs) excel at complex reasoning tasks, but their efficiency is often hampered by overly verbose outputs. Prior steering methods attempt to address this issue by applying a single, global vector to hidden representations – an approach grounded in the restrictive linear representation hypothesis. In this work, we introduce FlowSteer, a nonlinear steering method that goes beyond uniform linear shifts by learning a complete transformation between the distributions associated with verbose and concise reasoning. This transformation is learned via Flow Matching as a velocity field, enabling precise, input-dependent control over the model’s reasoning process. By aligning steered representations with the distribution of concise-reasoning activations, FlowSteer yields more compact reasoning than the linear shifts. Across diverse reasoning benchmarks, FlowSteer demonstrates strong task performance and token efficiency compared to leading inference-time baselines. Our work demonstrates that modeling the full distributional transport with generative techniques offers a more effective and principled foundation for controlling LRMs.
[100] Rewards as Labels: Revisiting RLVR from a Classification Perspective cs.LG | cs.CLPDF
Zepeng Zhai, Meilin Chen, Jiaxuan Zhao, Junlang Qian, Lei Shen
TL;DR: 本文提出了一种名为REAL的新框架,将强化学习中的可验证奖励重新视为分类标签而非标量权重,从而将策略优化重新构建为分类问题。该框架通过引入锚定逻辑值来增强策略学习,解决了现有RLVR方法(如GRPO)中存在的正样本梯度错配和负样本梯度主导问题,实现了更平衡的梯度分配和更稳定的训练。
Details
Motivation: 现有基于可验证奖励的强化学习方法(如GRPO及其变体)在复杂推理任务中取得了成功,但作者发现它们存在正样本梯度错配和负样本梯度主导的问题,导致策略更新效率低下且非最优。
Result: 在数学推理基准测试上的大量实验表明,REAL提高了训练稳定性,并持续超越了GRPO及其强变体(如DAPO)。在1.5B参数模型上,REAL将平均Pass@1比DAPO提高了6.7%。在7B参数模型上,REAL继续分别超越DAPO和GSPO 6.2%和1.7%。即使使用普通的二元交叉熵损失,REAL也保持稳定,平均比DAPO高出4.5%。
Insight: 核心创新点在于将强化学习中的奖励信号重新定义为分类标签,从而将策略优化问题转化为分类问题。这带来了单调且有界的梯度加权,实现了跨轨迹的平衡梯度分配。从客观角度看,这种视角转换提供了一种新颖且有效的机制来缓解强化学习中常见的梯度不平衡问题,其引入的锚定逻辑值也是一个简单而有效的技术改进。该方法具有通用性,即使在基础损失函数下也能保持稳定和高效。
Abstract: Reinforcement Learning with Verifiable Rewards has recently advanced the capabilities of Large Language Models in complex reasoning tasks by providing explicit rule-based supervision. Among RLVR methods, GRPO and its variants have achieved strong empirical performance. Despite their success, we identify that they suffer from Gradient Misassignment in Positives and Gradient Domination in Negatives, which lead to inefficient and suboptimal policy updates. To address these issues, we propose Rewards as Labels (REAL), a novel framework that revisits verifiable rewards as categorical labels rather than scalar weights, thereby reformulating policy optimization as a classification problem. Building on this, we further introduce anchor logits to enhance policy learning. Our analysis reveals that REAL induces a monotonic and bounded gradient weighting, enabling balanced gradient allocation across rollouts and effectively mitigating the identified mismatches. Extensive experiments on mathematical reasoning benchmarks show that REAL improves training stability and consistently outperforms GRPO and strong variants such as DAPO. On the 1.5B model, REAL improves average Pass@1 over DAPO by 6.7%. These gains further scale to 7B model, REAL continues to outperform DAPO and GSPO by 6.2% and 1.7%, respectively. Notably, even with a vanilla binary cross-entropy, REAL remains stable and exceeds DAPO by 4.5% on average.
[101] Constrained Group Relative Policy Optimization cs.LG | cs.CL | cs.ROPDF
Roger Girgis, Rodrigue de Schaetzen, Luke Rowe, Azalée Robitaille, Christopher Pal
TL;DR: 本文提出了Constrained GRPO,一种基于拉格朗日方法的扩展,用于解决具有显式行为约束的策略优化问题。该方法通过指示器成本函数指定约束,并利用拉格朗日松弛直接优化违反率。研究发现,优势估计中的多分量处理会扭曲不同目标项的相对重要性,从而破坏约束学习。为此,论文提出了标量化优势构造方法以恢复稳定的约束控制。
Details
Motivation: 动机是扩展无评论者策略学习框架GRPO,使其能够处理具有明确行为约束的设置,这在现有研究中尚未充分探索。
Result: 在玩具网格世界实验中验证了所预测的优化病理,并证明标量化优势恢复了稳定的约束控制。在机器人任务评估中,该方法在提高任务成功率的同时改善了约束满足度。
Insight: 创新点在于揭示了优势估计中多分量处理对约束学习的破坏性影响,并提出了标量化优势构造作为解决方案。这为依赖大型多模态基础模型的具身AI领域提供了一个简单有效的约束策略优化方案。
Abstract: While Group Relative Policy Optimization (GRPO) has emerged as a scalable framework for critic-free policy learning, extending it to settings with explicit behavioral constraints remains underexplored. We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization. Constraints are specified via indicator cost functions, enabling direct optimization of violation rates through a Lagrangian relaxation. We show that a naive multi-component treatment in advantage estimation can break constrained learning: mismatched component-wise standard deviations distort the relative importance of the different objective terms, which in turn corrupts the Lagrangian signal and prevents meaningful constraint enforcement. We formally derive this effect to motivate our scalarized advantage construction that preserves the intended trade-off between reward and constraint terms. Experiments in a toy gridworld confirm the predicted optimization pathology and demonstrate that scalarizing advantages restores stable constraint control. In addition, we evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success, establishing a simple and effective recipe for constrained policy optimization in embodied AI domains that increasingly rely on large multimodal foundation models.
[102] Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations cs.LG | cs.AI | cs.CLPDF
Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li
TL;DR: 本文提出了一种用于Triton内核生成的强化学习方法,通过设计KernelGYM分布式GPU环境、解决多轮强化学习中的策略梯度偏差问题(提出TRLOO方法),并引入基于性能剖析的奖励和拒绝采样来克服奖励黑客和懒惰优化问题。训练得到的Dr.Kernel-14B模型在KernelBench基准测试中性能优于Claude-4.5-Sonnet和GPT-5。
Details
Motivation: 训练LLM生成高质量内核代码面临数据不足、环境脆弱、易受奖励黑客和懒惰优化影响的问题,模型可能为追求奖励而牺牲实际加速效果。
Result: 在KernelBench Level-2子集上,Dr.Kernel-14B生成的31.6%内核实现了至少1.2倍加速,优于Claude-4.5-Sonnet(26.7%)和GPT-5(28.6%);通过多轮选择最佳候选,加速率进一步提升至47.8%。
Insight: 创新点包括:1) 构建支持奖励黑客检查和长期训练的KernelGYM环境;2) 提出TRLOO方法解决多轮RL中的策略梯度偏差;3) 引入基于性能剖析的奖励和拒绝采样机制应对懒惰优化;4) 展示了测试时顺序扩展的有效性。
Abstract: High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr.Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr.Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.
[103] DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training cs.LG | cs.CLPDF
Dingwei Zhu, Zhiheng Xi, Shihan Dou, Jiahan Li, Chenhao Huang
TL;DR: 本文提出DFPO(Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control),一种鲁棒的分布强化学习框架,用于大语言模型(LLM)的后训练。该方法通过将价值建模为跨时间步的连续流,而非独立的标量分位数预测,以捕获更丰富的状态信息,从而提升在噪声监督和复杂OOD条件下的鲁棒性和泛化能力。
Details
Motivation: 解决在真实世界环境(特别是LLM后训练)中,由于噪声监督和域外(OOD)泛化能力差,导致强化学习(RL)训练困难的问题。现有分布RL方法虽通过多分位数建模提升鲁棒性,但各分位数独立学习为标量,导致价值表示粒度粗糙,在复杂和OOD条件下表现不佳。
Result: 在对话、数学推理和科学任务上的实验表明,DFPO在噪声监督下优于PPO、FlowRL及其他鲁棒基线,实现了更好的训练稳定性和泛化性能。
Insight: 核心创新在于将价值建模从离散的标量分位数预测扩展为连续的价值流场学习,从而获得更细粒度的状态条件价值表示。此外,通过沿价值流轨迹集成条件风险控制和一致性约束,进一步稳定了噪声反馈下的训练。这为提升RL在复杂、噪声环境中的鲁棒性提供了新思路。
Abstract: Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.
[104] Erase at the Core: Representation Unlearning for Machine Unlearning cs.LG | cs.CVPDF
Jaewon Lee, Yongwoo Kim, Donghyun Kim
TL;DR: 本文提出了一种名为’Erase at the Core’(EC)的框架,旨在解决机器学习遗忘任务中存在的’表面遗忘’问题,即模型在输出层(logit)上表现出遗忘,但其内部特征表示仍保留大量原始信息。EC通过在网络层次结构中集成多层对比遗忘和深度监督学习,强制在整个网络中实现遗忘。
Details
Motivation: 现有的大多数近似机器学习遗忘方法主要改变最终分类器,而中间层的特征表示基本保持不变,与原始模型高度相似,这被称为’表面遗忘’。论文的动机是解决这一局限性,实现真正深入网络内部的表示遗忘。
Result: 实验结果表明,EC不仅实现了有效的输出层遗忘,而且显著降低了中间层与原始模型在表示上的相似性。此外,EC是模型无关的,可以作为插件模块集成到现有的遗忘方法中,在保持保留集性能的同时,改善了表示层面的遗忘效果。
Insight: 核心创新点在于提出了一个层次化的、模型无关的遗忘框架,通过将多层对比遗忘(作用于遗忘集)与深度监督学习(作用于保留集)相结合,并采用分层加权损失,系统地强制网络内部表示发生改变,从而超越了仅修改分类器的现有方法,实现了更彻底的’核心’遗忘。
Abstract: Many approximate machine unlearning methods demonstrate strong logit-level forgetting – such as near-zero accuracy on the forget set – yet continue to preserve substantial information within their internal feature representations. We refer to this discrepancy as superficial forgetting. Recent studies indicate that most existing unlearning approaches primarily alter the final classifier, leaving intermediate representations largely unchanged and highly similar to those of the original model. To address this limitation, we introduce the Erase at the Core (EC), a framework designed to enforce forgetting throughout the entire network hierarchy. EC integrates multi-layer contrastive unlearning on the forget set with retain set preservation through deeply supervised learning. Concretely, EC attaches auxiliary modules to intermediate layers and applies both contrastive unlearning and cross-entropy losses at each supervision point, with layer-wise weighted losses. Experimental results show that EC not only achieves effective logit-level forgetting, but also substantially reduces representational similarity to the original model across intermediate layers. Furthermore, EC is model-agnostic and can be incorporated as a plug-in module into existing unlearning methods, improving representation-level forgetting while maintaining performance on the retain set.
cs.NE [Back]
[105] DARWIN: Dynamic Agentically Rewriting Self-Improving Network cs.NE | cs.AI | cs.CLPDF
Henry Jiang
TL;DR: DARWIN是一种进化式GPT模型,采用类似遗传算法的优化结构,通过多个独立的GPT代理相互修改训练代码来提升性能,并利用JSON记忆文件跟踪改进过程。实验表明,在5次迭代训练后,该模型在FLOPS利用率和困惑度上相比基线配置分别提升了1.26%和2.07%。
Details
Motivation: 解决传统GPT训练中手动优化代码效率低、成本高的问题,旨在通过自动化、进化式的方法动态改进训练过程,实现模型自我优化。
Result: 在基于nanoGPT框架的实验中,经过5次迭代,模型FLOPS利用率(MFU)提升1.26%,困惑度提升2.07%,展示了进化训练方法的潜力。
Insight: 创新点在于将遗传算法与多代理GPT结合,实现训练代码的自动迭代优化;同时引入持久化记忆和人工干预接口,增强了系统的可追溯性和灵活性,为大规模进化训练提供了基础框架。
Abstract: DARWIN is an evolutionary GPT model, utilizing a genetic-algorithm like optimization structure with several independent GPT agents being trained individually using unique training code. Each iteration, the GPT models are prompted to modify the training code of one another in an attempt to improve their performance in a mutation-like manner, and the best GPT agents are then benchmarked and selected for the next iteration by genetic algorithm. For demonstration purposes and due to budget and time constraints, OpenAI API is used to prompt training code improvements and the nanoGPT framework is used as the training code. DARWIN also utilizes persistent JSON-based memory files to track previous reasoning and changes to code to correlate with improvement to model performance. and a bidirectional interface for HITL intervention allowing the model to request upgrades such as additional datasets, training scripts, and restructuring of file hierarchies. In experiments, DARWIN achieved a 1.26 percent improvement in model FLOPS utilization (MFU) and a 2.07 percent improvement to perplexity in 5 iterations of training over baseline configurations, demonstrating promising capabilities as a foundation for scaling evolutionary GPT training.
eess.IV [Back]
[106] Context-Aware Asymmetric Ensembling for Interpretable Retinopathy of Prematurity Screening via Active Query and Vascular Attention eess.IV | cs.CVPDF
Md. Mehedi Hassan, Taufiq Hasan
TL;DR: 本文提出了一种用于早产儿视网膜病变(ROP)自动化筛查的上下文感知非对称集成模型(CAA Ensemble),该模型模拟临床推理,通过多尺度主动查询网络(MS-AQNet)定位纤维血管嵴,以及通过VascuMIL网络编码血管拓扑图(VMAP)以识别血管迂曲,并由元学习器集成这些信号来解决诊断分歧。
Details
Motivation: 解决ROP自动化筛查因数据有限、病情复杂(涉及结构分期和微血管异常)以及现有深度学习模型依赖大型私有数据集、被动多模态融合导致在小型不平衡公共队列上泛化能力差的问题。
Result: 在包含188名婴儿(6,004张图像)的高度不平衡队列上测试,该框架在两个不同的临床任务上达到了最先进的性能:在Broad ROP分期任务上获得了0.93的宏F1分数,在Plus Disease检测任务上获得了0.996的AUC。
Insight: 创新点在于通过主动查询和血管注意力机制模拟临床推理,实现‘玻璃盒’可解释性(如反事实注意力热图和血管威胁图),并证明架构归纳偏置可以作为弥合医疗AI数据鸿沟的有效桥梁。
Abstract: Retinopathy of Prematurity (ROP) is among the major causes of preventable childhood blindness. Automated screening remains challenging, primarily due to limited data availability and the complex condition involving both structural staging and microvascular abnormalities. Current deep learning models depend heavily on large private datasets and passive multimodal fusion, which commonly fail to generalize on small, imbalanced public cohorts. We thus propose the Context-Aware Asymmetric Ensemble Model (CAA Ensemble) that simulates clinical reasoning through two specialized streams. First, the Multi-Scale Active Query Network (MS-AQNet) serves as a structure specialist, utilizing clinical contexts as dynamic query vectors to spatially control visual feature extraction for localization of the fibrovascular ridge. Secondly, VascuMIL encodes Vascular Topology Maps (VMAP) within a gated Multiple Instance Learning (MIL) network to precisely identify vascular tortuosity. A synergistic meta-learner ensembles these orthogonal signals to resolve diagnostic discordance across multiple objectives. Tested on a highly imbalanced cohort of 188 infants (6,004 images), the framework attained State-of-the-Art performance on two distinct clinical tasks: achieving a Macro F1-Score of 0.93 for Broad ROP staging and an AUC of 0.996 for Plus Disease detection. Crucially, the system features `Glass Box’ transparency through counterfactual attention heatmaps and vascular threat maps, proving that clinical metadata dictates the model’s visual search. Additionally, this study demonstrates that architectural inductive bias can serve as an effective bridge for the medical AI data gap.
[107] Towards Segmenting the Invisible: An End-to-End Registration and Segmentation Framework for Weakly Supervised Tumour Analysis eess.IV | cs.AI | cs.CV | cs.LG | physics.med-phPDF
Budhaditya Mukhopadhyay, Chirag Mandal, Pavan Tummala, Naghmeh Mahmoodian, Andreas Nürnberger
TL;DR: 本文提出了一种用于弱监督肿瘤分析的端到端配准与分割框架,旨在解决肝脏肿瘤消融术中肿瘤在术前MRI可见但术中CT不可见的临床挑战。该框架结合了MSCGUNet进行跨模态图像配准和基于UNet的分割模块,通过配准辅助生成CT图像的伪标签。在CHAOS数据集上的评估表明,该流程能成功配准和分割健康肝脏解剖结构,但应用于含肿瘤的临床数据时性能显著下降,揭示了当目标病理在目标模态中缺乏对应视觉特征时当前配准方法的根本局限性。
Details
Motivation: 解决肝脏肿瘤消融术中,肿瘤在术前MRI清晰可见,但在术中CT上因病理组织与健康组织对比度极低而近乎不可见的临床难题,探索病理在一个模态(MRI)可见而在另一个模态(CT)缺失情况下的跨模态弱监督可行性。
Result: 在CHAOS数据集上,该框架对健康肝脏解剖结构的配准和分割取得了0.72的Dice分数;然而,在包含肿瘤的临床数据上,性能大幅下降至0.16的Dice分数,表明当前方法在处理目标模态中缺乏判别性特征的病理时存在根本局限。
Insight: 论文宣称的创新点在于提出了一种结合配准与分割的混合框架,用于跨模态弱监督下的标签传播和分割。从客观角度看,其核心研究贡献在于通过实验系统地揭示了基于配准的标签传播方法在目标模态中病理特征完全缺失(即“特征缺失”问题)时的根本局限性,为未来跨模态医学图像分析研究提供了重要洞见,即仅靠空间配准无法补偿目标模态中判别特征的缺失。
Abstract: Liver tumour ablation presents a significant clinical challenge: whilst tumours are clearly visible on pre-operative MRI, they are often effectively invisible on intra-operative CT due to minimal contrast between pathological and healthy tissue. This work investigates the feasibility of cross-modality weak supervision for scenarios where pathology is visible in one modality (MRI) but absent in another (CT). We present a hybrid registration-segmentation framework that combines MSCGUNet for inter-modal image registration with a UNet-based segmentation module, enabling registration-assisted pseudo-label generation for CT images. Our evaluation on the CHAOS dataset demonstrates that the pipeline can successfully register and segment healthy liver anatomy, achieving a Dice score of 0.72. However, when applied to clinical data containing tumours, performance degrades substantially (Dice score of 0.16), revealing the fundamental limitations of current registration methods when the target pathology lacks corresponding visual features in the target modality. We analyse the “domain gap” and “feature absence” problems, demonstrating that whilst spatial propagation of labels via registration is feasible for visible structures, segmenting truly invisible pathology remains an open challenge. Our findings highlight that registration-based label transfer cannot compensate for the absence of discriminative features in the target modality, providing important insights for future research in cross-modality medical image analysis. Code an weights are available at: https://github.com/BudhaTronix/Weakly-Supervised-Tumour-Detection
cs.CY [Back]
[108] Ethology of Latent Spaces cs.CY | cs.CL | cs.CV | cs.LGPDF
Philippe Boisnard
TL;DR: 本研究通过行为学视角挑战了视觉语言模型(VLM)中潜在空间的中立性假设,揭示了潜在空间具有模型特定的算法敏感性,这些敏感性由训练数据和架构选择塑造为不同的感知显著性机制。通过对三个模型(OpenAI CLIP、OpenCLIP LAION、SigLIP)在301件艺术作品(15至20世纪)上的比较分析,发现了它们在政治和文化类别归因上的显著差异。
Details
Motivation: 解决视觉语言模型的潜在空间并非中性、同质不确定空间的假设问题,探究训练数据和架构如何塑造模型特定的算法敏感性和感知机制。
Result: 在基于向量类比构建的双极语义轴上,SigLIP将59.4%的艺术作品分类为具有政治参与性,而OpenCLIP仅为4%;在非洲面具上,SigLIP给出最高政治分数,而OpenAI CLIP则视为非政治;在美学殖民轴上,模型间差异达到72.6个百分点。
Insight: 提出了三个操作概念:计算潜在政治化(政治类别在无意识编码中涌现)、涌现偏差(仅通过对比分析可检测,不可简化为统计或规范偏差)以及三种算法视觉机制(熵化、制度化、符号化);借鉴福柯的档案、詹姆逊的意识形态素和西蒙东的个体化理论,论证训练数据集作为准档案,其话语形构在潜在空间中结晶;呼吁在将文化解释委托给算法代理时,需整合学习架构的方法论。
Abstract: This study challenges the presumed neutrality of latent spaces in vision language models (VLMs) by adopting an ethological perspective on their algorithmic behaviors. Rather than constituting spaces of homogeneous indeterminacy, latent spaces exhibit model-specific algorithmic sensitivities, understood as differential regimes of perceptual salience shaped by training data and architectural choices. Through a comparative analysis of three models (OpenAI CLIP, OpenCLIP LAION, SigLIP) applied to a corpus of 301 artworks (15th to 20th), we reveal substantial divergences in the attribution of political and cultural categories. Using bipolar semantic axes derived from vector analogies (Mikolov et al., 2013), we show that SigLIP classifies 59.4% of the artworks as politically engaged, compared to only 4% for OpenCLIP. African masks receive the highest political scores in SigLIP while remaining apolitical in OpenAI CLIP. On an aesthetic colonial axis, inter-model discrepancies reach 72.6 percentage points. We introduce three operational concepts: computational latent politicization, describing the emergence of political categories without intentional encoding; emergent bias, irreducible to statistical or normative bias and detectable only through contrastive analysis; and three algorithmic scopic regimes: entropic (LAION), institutional (OpenAI), and semiotic (SigLIP), which structure distinct modes of visibility. Drawing on Foucault’s notion of the archive, Jameson’s ideologeme, and Simondon’s theory of individuation, we argue that training datasets function as quasi-archives whose discursive formations crystallize within latent space. This work contributes to a critical reassessment of the conditions under which VLMs are applied to digital art history and calls for methodologies that integrate learning architectures into any delegation of cultural interpretation to algorithmic agents.
cs.AI [Back]
[109] DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search cs.AI | cs.CL | cs.IRPDF
Zhanli Li, Huiwen Tian, Lvzhou Luo, Yixuan Cao, Ping Luo
TL;DR: DeepRead是一个结构感知的多轮文档推理智能体,旨在通过利用文档固有的层次组织和顺序结构先验来增强长文档问答中的智能搜索。它通过基于LLM的OCR模型将PDF转换为保留标题和段落边界的结构化Markdown,并在段落级别索引文档,为每个段落分配编码其章节身份和顺序的元数据键。DeepRead为LLM配备了检索工具和阅读工具,实现了类似人类的“定位后阅读”推理范式。
Details
Motivation: 现有智能搜索框架通常将长文档视为扁平化的文本块集合,未能充分利用文档固有的层次组织和顺序话语结构等先验知识,限制了长文档问答的效果。
Result: 实验表明,DeepRead在文档问答任务上相比Search-o1风格的智能搜索取得了显著提升,并验证了检索与阅读工具之间的协同效应。
Insight: 创新点在于明确地将文档的层次和顺序结构先验操作化,通过结构化表示和互补的工具设计(Retrieve和ReadSection),实现了更符合人类阅读习惯的“定位后阅读”推理范式,提升了长文档理解能力。
Abstract: With the rapid progress of tool-using and agentic large language models (LLMs), Retrieval-Augmented Generation (RAG) is evolving from one-shot, passive retrieval into multi-turn, decision-driven evidence acquisition. Despite strong results in open-domain settings, existing agentic search frameworks commonly treat long documents as flat collections of chunks, underutilizing document-native priors such as hierarchical organization and sequential discourse structure. We introduce DeepRead, a structure-aware, multi-turn document reasoning agent that explicitly operationalizes these priors for long-document question answering. DeepRead leverages LLM-based OCR model to convert PDFs into structured Markdown that preserves headings and paragraph boundaries. It then indexes documents at the paragraph level and assigns each paragraph a coordinate-style metadata key encoding its section identity and in-section order. Building on this representation, DeepRead equips the LLM with two complementary tools: a Retrieve tool that localizes relevant paragraphs while exposing their structural coordinates (with lightweight scanning context), and a ReadSection tool that enables contiguous, order-preserving reading within a specified section and paragraph range. Our experiments demonstrate that DeepRead achieves significant improvements over Search-o1-style agentic search in document question answering. The synergistic effect between retrieval and reading tools is also validated. Our fine-grained behavioral analysis reveals a reading and reasoning paradigm resembling human-like ``locate then read’’ behavior.
[110] A Unified Multimodal Framework for Dataset Construction and Model-Based Diagnosis of Ameloblastoma cs.AI | cs.CLPDF
Ajo Babu George, Anna Mariam John, Athul Anoop, Balu Bhasuran
TL;DR: 本文提出了一个针对成釉细胞瘤的统一多模态框架,包括新构建的多模态数据集和一个多模态深度学习模型。该数据集整合了放射学、组织病理学和口腔内临床图像以及从病例报告中提取的结构化数据,并进行了预处理和增强。模型用于分类成釉细胞瘤亚型、评估复发风险等行为模式,并支持手术规划,同时可结合临床输入进行个性化推理。
Details
Motivation: 动机是解决颌面病理学中人工智能辅助诊断缺乏高质量、结构化多模态数据集的问题,特别是现有资源对成釉细胞瘤覆盖有限且格式不一致,无法直接用于模型训练。
Result: 定量评估显示性能显著提升:亚型分类准确率从46.2%提高到65.9%,异常组织检测的F1分数从43.0%提升到90.3%。与MultiCaRe等资源相比,该工作通过提供稳健的数据集和适应性强的多模态AI框架,推进了患者特异性决策支持。
Insight: 创新点在于构建了一个专门针对成釉细胞瘤的统一多模态数据集,并开发了一个可接受临床输入(如主诉、年龄、性别)以增强个性化推理的多模态深度学习模型。从客观角度看,其将NLP技术用于从文本报告中提取临床特征,并与多模态图像数据整合,为特定疾病的AI诊断提供了可借鉴的数据集构建和模型设计框架。
Abstract: Artificial intelligence (AI)-enabled diagnostics in maxillofacial pathology require structured, high-quality multimodal datasets. However, existing resources provide limited ameloblastoma coverage and lack the format consistency needed for direct model training. We present a newly curated multimodal dataset specifically focused on ameloblastoma, integrating annotated radiological, histopathological, and intraoral clinical images with structured data derived from case reports. Natural language processing techniques were employed to extract clinically relevant features from textual reports, while image data underwent domain specific preprocessing and augmentation. Using this dataset, a multimodal deep learning model was developed to classify ameloblastoma variants, assess behavioral patterns such as recurrence risk, and support surgical planning. The model is designed to accept clinical inputs such as presenting complaint, age, and gender during deployment to enhance personalized inference. Quantitative evaluation demonstrated substantial improvements; variant classification accuracy increased from 46.2 percent to 65.9 percent, and abnormal tissue detection F1-score improved from 43.0 percent to 90.3 percent. Benchmarked against resources like MultiCaRe, this work advances patient-specific decision support by providing both a robust dataset and an adaptable multimodal AI framework.
[111] FiMI: A Domain-Specific Language Model for Indian Finance Ecosystem cs.AI | cs.CE | cs.CL | cs.LGPDF
Aboli Kathar, Aman Kumar, Anusha Kamath, Araveeti Srujan, Ashish Sharma
TL;DR: 本文介绍了FiMI,一个专门为印度金融生态系统设计的领域特定语言模型,包括FiMI Base和FiMI Instruct两个变体。该模型基于Mistral Small 24B架构,通过多阶段训练流程(包括在金融、多语言和合成数据上的持续预训练,以及针对工具驱动对话的指令微调)开发而成。评估显示,FiMI在金融推理和工具调用方面显著优于基础模型,同时在通用基准上保持可比性能。
Details
Motivation: 为印度数字支付系统开发一个领域专业化的金融语言模型,以处理现实世界中的多轮、工具驱动对话工作流,如交易纠纷和授权生命周期管理。
Result: FiMI Base在金融推理基准上比Mistral Small 24B Base模型提升20%;FiMI Instruct在领域特定工具调用上比Mistral Small 24B Instruct模型提升87%。同时,在通用基准上与类似规模模型性能相当。
Insight: 创新点包括:针对印度金融场景的多语言(英语、印地语、Hinglish)和合成数据训练;专注于工具驱动对话的领域特定监督微调,以建模真实工作流。从客观角度看,其多阶段训练策略有效平衡了领域专业化和通用能力。
Abstract: We present FiMI (Finance Model for India), a domain-specialized financial language model developed for Indian digital payment systems. We develop two model variants: FiMI Base and FiMI Instruct. FiMI adapts the Mistral Small 24B architecture through a multi-stage training pipeline, beginning with continuous pre-training on 68 Billion tokens of curated financial, multilingual (English, Hindi, Hinglish), and synthetic data. This is followed by instruction fine-tuning and domain-specific supervised fine-tuning focused on multi-turn, tool-driven conversations that model real-world workflows, such as transaction disputes and mandate lifecycle management. Evaluations reveal that FiMI Base achieves a 20% improvement over the Mistral Small 24B Base model on finance reasoning benchmark, while FiMI Instruct outperforms the Mistral Small 24B Instruct model by 87% on domain-specific tool-calling. Moreover, FiMI achieves these significant domain gains while maintaining comparable performance to models of similar size on general benchmarks.