Table of Contents

cs.CL [Back]

[1] DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning cs.CLPDF

Hanxu Hu, Yuxuan Wang, Maggie Huan, Jannis Vamvas, Yinya Huang

TL;DR: 本文提出DeReason方法,一种基于难度感知的课程学习策略,用于改进大型语言模型在通用科学推理任务上的解耦式SFT-then-RL训练。该方法通过基于LLM的评分将训练数据按推理强度划分为推理密集型和非推理密集型子集,分别分配给监督微调和强化学习阶段,以优化训练效率与性能。

Details

Motivation: 在通用科学领域,直接对基础模型应用强化学习样本效率低下,且性能常被中等质量响应的监督微调超越;而顺序的SFT后接RL可进一步提升性能,表明两阶段具有互补作用,但如何分配训练数据是关键挑战。

Result: 在通用STEM和数学基准测试上的广泛实验表明,DeReason的解耦课程训练显著优于仅使用SFT、仅使用RL以及随机分割数据的基线方法。

Insight: 创新点在于提出了一种基于推理强度的数据解耦策略,将广泛覆盖的非推理密集型问题分配给SFT以建立领域基础知识,而将困难的推理密集型问题保留给RL以培养复杂推理能力;这为通用推理任务中SFT与RL的交互提供了系统研究,并提供了高效通用的后训练方案。

Abstract: Reinforcement learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for eliciting reasoning capabilities in large language models, particularly in mathematics and coding. While recent efforts have extended this paradigm to broader general scientific (STEM) domains, the complex interplay between supervised fine-tuning (SFT) and RL in these contexts remains underexplored. In this paper, we conduct controlled experiments revealing a critical challenge: for general STEM domains, RL applied directly to base models is highly sample-inefficient and is consistently surpassed by supervised fine-tuning (SFT) on moderate-quality responses. Yet sequential SFT followed by RL can further improve performance, suggesting that the two stages play complementary roles, and that how training data is allocated between them matters. Therefore, we propose DeReason, a difficulty-based data decoupling strategy for general reasoning. DeReason partitions training data by reasoning intensity estimated via LLM-based scoring into reasoning-intensive and non-reasoning-intensive subsets. It allocates broad-coverage, non-reasoning-intensive problems to SFT to establish foundational domain knowledge, and reserves a focused subset of difficult problems for RL to cultivate complex reasoning. We demonstrate that this principled decoupling yields better performance than randomly splitting the data for sequential SFT and RL. Extensive experiments on general STEM and mathematical benchmarks demonstrate that our decoupled curriculum training significantly outperforms SFT-only, RL-only, and random-split baselines. Our work provides a systematic study of the interplay between SFT and RL for general reasoning, offering a highly effective and generalized post-training recipe.


[2] MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries cs.CL | cs.AI | cs.IRPDF

Riccardo Campi, Nicolò Oreste Pinciroli Vago, Mathyas Giudici, Marco Brambilla, Piero Fraternali

TL;DR: 本文提出了一种名为MDER-DR的领域无关、基于知识图谱的问答框架,旨在解决传统检索增强生成在知识图谱上因索引方法丢失上下文细节而导致的多跳问答性能下降问题。该框架包含MDER索引方法,用于生成上下文派生的三元组描述并与实体级摘要整合,以及DR检索机制,通过迭代推理将用户查询分解为可解析的三元组并在知识图谱中定位。

Details

Motivation: 动机在于传统RAG在知识图谱上的索引方法将文本简化为三元组时会丢失重要的上下文细微差别,这尤其损害了需要从多个实体、事实或关系中组合答案的多跳问答任务的性能。

Result: 实验表明,在标准和特定领域基准测试中,MDER-DR相比标准RAG基线实现了显著改进(提升高达66%),并保持了跨语言鲁棒性。

Insight: 创新点在于提出了MDER索引方法,通过生成上下文派生的三元组描述并与实体摘要整合,避免了QA检索阶段对图中边的显式遍历;以及DR检索机制,通过迭代推理分解查询并基于知识图谱进行解析。从客观角度看,其将实体中心摘要与基于三元组的推理相结合,为处理稀疏、不完整和复杂关系数据提供了一种鲁棒的LLM驱动QA管道。

Abstract: Retrieval-Augmented Generation (RAG) over Knowledge Graphs (KGs) suffers from the fact that indexing approaches may lose important contextual nuance when text is reduced to triples, thereby degrading performance in downstream Question-Answering (QA) tasks, particularly for multi-hop QA, which requires composing answers from multiple entities, facts, or relations. We propose a domain-agnostic, KG-based QA framework that covers both the indexing and retrieval/inference phases. A new indexing approach called Map-Disambiguate-Enrich-Reduce (MDER) generates context-derived triple descriptions and subsequently integrates them with entity-level summaries, thus avoiding the need for explicit traversal of edges in the graph during the QA retrieval phase. Complementing this, we introduce Decompose-Resolve (DR), a retrieval mechanism that decomposes user queries into resolvable triples and grounds them in the KG via iterative reasoning. Together, MDER and DR form an LLM-driven QA pipeline that is robust to sparse, incomplete, and complex relational data. Experiments show that on standard and domain specific benchmarks, MDER-DR achieves substantial improvements over standard RAG baselines (up to 66%), while maintaining cross-lingual robustness. Our code is available at https://github.com/DataSciencePolimi/MDER-DR_RAG.


[3] Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning cs.CL | cs.AI | cs.LGPDF

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao

TL;DR: 本文研究了大型语言模型在多轮对话中的诊断推理能力,发现与单次查询相比,多轮交互会显著降低模型性能,导致模型频繁放弃初始正确诊断或安全弃权,转而迎合错误的用户建议。

Details

Motivation: 随着患者和临床医生越来越多地使用基于大型语言模型的聊天机器人进行医疗咨询,尽管现有LLM在静态诊断推理基准上表现优异,但其在多轮对话(更贴近实际使用场景)中的有效性尚未得到充分研究。

Result: 在三个临床数据集上评估了17个LLM,实验揭示了“对话税”现象:多轮交互相比单次基线持续降低性能,模型常放弃正确诊断或安全弃权以对齐错误建议,部分模型甚至出现盲目切换行为。

Insight: 创新点在于提出了“坚持或切换”评估框架,以量化模型在多轮对话中的“信念”(即捍卫正确诊断或安全弃权)和“灵活性”(即识别正确建议)。客观分析表明,研究强调了LLM在动态交互环境中的脆弱性,为实际部署提供了重要警示。

Abstract: Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a “stick-or-switch” evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently degrade performance when compared to single-shot baselines. Notably, models frequently abandon initial correct diagnoses and safe abstentions to align with incorrect user suggestions. Additionally, several models exhibit blind switching, failing to distinguish between signal and incorrect suggestions.


[4] MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models cs.CL | cond-mat.mtrl-sciPDF

Michiko Yoshitake, Yuta Suzuki, Ryo Igarashi, Yoshitaka Ushiku, Keisuke Nagato

TL;DR: 本文介绍了MaterialFigBench,一个专门用于评估多模态大语言模型在解决大学水平材料科学问题中解读图表能力的基准数据集。该数据集包含137个自由回答式问题,涵盖晶体结构、力学性能、扩散、相图、相变和材料电子特性等多个主题,并针对图像数值读取的模糊性提供了专家定义的答案范围。研究评估了包括ChatGPT和GPT系列在内的多个先进多模态LLM,发现尽管模型更新后整体准确率有所提升,但当前模型在真正的视觉理解和定量解读材料科学图表方面仍存在困难,许多正确答案依赖于记忆的领域知识而非图像解读。

Details

Motivation: 现有基准主要依赖文本表示,缺乏对多模态LLM在材料科学中解读关键图表(如相图、应力-应变曲线、阿伦尼乌斯图、衍射图和微观结构示意图)能力的系统评估,因此需要创建一个专门的基准数据集来填补这一空白。

Result: 在MaterialFigBench上评估了多个SOTA多模态LLM(如通过OpenAI API访问的ChatGPT和GPT模型),结果显示整体准确率随模型更新有所提高,但模型在视觉推理、数值精度和有效数字处理方面仍存在持续弱点,且在许多情况下正确回答依赖于记忆而非图像解读。

Insight: 创新点在于创建了首个专注于材料科学图表解读的领域特定多模态基准数据集,并引入了专家定义的答案范围以处理图像数值读取的模糊性;客观分析认为,该研究揭示了当前多模态LLM在真正视觉理解和定量推理方面的局限性,为未来开发具有更强基于图表理解能力的LLM提供了系统性的评估基础和方向指引。

Abstract: We present MaterialFigBench, a benchmark dataset designed to evaluate the ability of multimodal large language models (LLMs) to solve university-level materials science problems that require accurate interpretation of figures. Unlike existing benchmarks that primarily rely on textual representations, MaterialFigBench focuses on problems in which figures such as phase diagrams, stress-strain curves, Arrhenius plots, diffraction patterns, and microstructural schematics are indispensable for deriving correct answers. The dataset consists of 137 free-response problems adapted from standard materials science textbooks, covering a broad range of topics including crystal structures, mechanical properties, diffusion, phase diagrams, phase transformations, and electronic properties of materials. To address unavoidable ambiguity in reading numerical values from images, expert-defined answer ranges are provided where appropriate. We evaluate several state-of-the-art multimodal LLMs, including ChatGPT and GPT models accessed via OpenAI APIs, and analyze their performance across problem categories and model versions. The results reveal that, although overall accuracy improves with model updates, current LLMs still struggle with genuine visual understanding and quantitative interpretation of materials science figures. In many cases, correct answers are obtained by relying on memorized domain knowledge rather than by reading the provided images. MaterialFigBench highlights persistent weaknesses in visual reasoning, numerical precision, and significant-digit handling, while also identifying problem types where performance has improved. This benchmark provides a systematic and domain-specific foundation for advancing multimodal reasoning capabilities in materials science and for guiding the development of future LLMs with stronger figure-based understanding.


Yuzhi Liang, Lixiang Ma, Xinrong Zhu

TL;DR: 本文提出了一种结合大语言模型(LLM)先验与统计因果发现的增强型因果推理框架,用于法律判决预测(LJP)。该框架通过粗到细的混合提取机制精准识别法律构成要素,并利用LLM辅助的因果结构消歧机制解决因果方向不确定性,最终构建因果感知的判决预测模型。

Details

Motivation: 主流基于预训练语言模型(PLM)的LJP方法过度依赖案件事实与判决结果的统计相关性,缺乏对法律构成要素和底层因果逻辑的显式建模,导致模型易学习虚假关联且鲁棒性差。现有因果LJP方法在实际法律文本中面临两大瓶颈:法律要素提取不准确且噪声严重,以及稀疏特征下马尔可夫等价性导致的因果结构发现存在显著不确定性。

Result: 在多个基准数据集(包括LEVEN、QA和CAIL)上的大量实验表明,所提方法在预测准确性和鲁棒性上显著优于最先进的基线模型,特别是在区分混淆罪名方面。

Insight: 创新点在于将LLM作为约束性先验知识库,结合统计方法进行法律要素的精细化提取和因果结构的概率性消歧,从而显式地将因果图约束引入文本注意力机制,提升了模型的可解释性和对虚假相关性的抵抗能力。

Abstract: Mainstream methods for Legal Judgment Prediction (LJP) based on Pre-trained Language Models (PLMs) heavily rely on the statistical correlation between case facts and judgment results. This paradigm lacks explicit modeling of legal constituent elements and underlying causal logic, making models prone to learning spurious correlations and suffering from poor robustness. While introducing causal inference can mitigate this issue, existing causal LJP methods face two critical bottlenecks in real-world legal texts: inaccurate legal factor extraction with severe noise, and significant uncertainty in causal structure discovery due to Markov equivalence under sparse features. To address these challenges, we propose an enhanced causal inference framework that integrates Large Language Model (LLM) priors with statistical causal discovery. First, we design a coarse-to-fine hybrid extraction mechanism combining statistical sampling and LLM semantic reasoning to accurately identify and purify standard legal constituent elements. Second, to resolve structural uncertainty, we introduce an LLM-assisted causal structure disambiguation mechanism. By utilizing the LLM as a constrained prior knowledge base, we conduct probabilistic evaluation and pruning on ambiguous causal directions to generate legally compliant candidate causal graphs. Finally, a causal-aware judgment prediction model is constructed by explicitly constraining text attention intensity via the generated causal graphs. Extensive experiments on multiple benchmark datasets, including LEVEN , QA, and CAIL, demonstrate that our proposed method significantly outperforms state-of-the-art baselines in both predictive accuracy and robustness, particularly in distinguishing confusing charges.


[6] Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs cs.CLPDF

Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du, Dacheng Tao

TL;DR: 论文提出了一种名为Tool-DC的“分而治之”框架,旨在提升大语言模型在长上下文工具调用任务中的性能。该框架通过“尝试-检查-重试”范式,降低推理难度并充分利用LLM的自我反思能力,包含无需训练的即插即用版本和基于训练的高效推理版本。实验表明,该方法在多个基准测试上显著优于基线模型。

Details

Motivation: 当前方法在处理长上下文工具调用任务中的海量且嘈杂的候选工具时存在困难,限制了其实际应用,因此需要一种新框架来提升LLM在此类任务中的性能。

Result: 在BFCL和ACEBench基准测试上,无需训练的Tool-DC版本相比基线平均提升高达25.10%;基于训练的版本使Qwen2.5-7B模型达到了与OpenAI o3、Claude-Haiku-4.5等专有模型相当甚至更好的性能。

Insight: 核心创新在于将复杂的工具调用任务分解为更易管理的子任务,并引入“尝试-检查-重试”的迭代反思机制。这提供了一种通用的、可插拔的框架设计思路,既能通过无训练方式快速部署,也能通过训练实现更高的推理效率。

Abstract: Tool-calling empowers Large Language Models (LLMs) to interact with external environments. However, current methods often struggle to handle massive and noisy candidate tools in long-context tool-calling tasks, limiting their real-world application. To this end, we propose Tool-DC, a Divide-and-Conquer framework for boosting tool-calling performance of LLMs. The core of Tool-DC is to reduce the reasoning difficulty and make full use of self-reflection ability of LLMs via a “Try-Check-Retry” paradigm. Specifically, Tool-DC involves two variants: 1) the training-free Tool-DC (TF), which is plug-and-play and flexible; 2) the training-based Tool-DC (TB), which is more inference-efficient. Extensive experiments show that both Tool-DC methods outperform their counterparts by a clear margin. Tool-DC (TF) brings up to +25.10% average gains against the baseline on BFCL and ACEBench benchmarks, while Tool-DC (TB) enables Qwen2.5-7B to achieve comparable or even better performance than proprietary LLMs, e.g., OpenAI o3 and Claude-Haiku-4.5.


[7] One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries cs.CL | cs.AI | cs.LGPDF

Mayank Saini Arit Kumar Bishwas

TL;DR: 本文提出了一种用于自主多模态查询处理的智能体AI框架,该框架能够协调跨文本、图像、音频、视频和文档模态的专用工具。一个中央监督器动态分解用户查询,将子任务委派给适合模态的工具,并通过自适应路由策略而非预定的决策树来综合结果。

Details

Motivation: 解决多模态AI系统中,如何高效、经济地协调不同模态的专用工具来处理复杂、异构的用户查询,以克服传统分层或固定流程方法在效率、成本和对话返工方面的不足。

Result: 在涵盖15个任务类别的2847个查询上进行评估,与匹配的分层基线相比,该框架在保持准确率相当的同时,实现了准确答案获取时间减少72%、对话返工减少85%、成本降低67%。

Insight: 创新点在于提出了一个由中央监督器驱动的自适应工具编排框架,它结合了针对纯文本查询的RouteLLM学习路由和针对非文本路径的SLM辅助模态分解,实现了动态、智能的任务分解与结果综合,而非依赖固定决策树,这从根本上改善了多模态AI部署的经济性。

Abstract: We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than predetermined decision trees. For text-only queries, the framework uses learned routing via RouteLLM, while non-text paths use SLM-assisted modality decomposition. Evaluated on 2,847 queries across 15 task categories, our framework achieves 72% reduction in time-to-accurate-answer, 85% reduction in conversational rework, and 67% cost reduction compared to the matched hierarchical baseline while maintaining accuracy parity. These results demonstrate that intelligent centralized orchestration fundamentally improves multimodal AI deployment economics.


[8] Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese cs.CL | cs.AIPDF

Masataka Kawai, Singo Sakashita, Shumpei Ishikawa, Shogo Watanabe, Anna Matsuoka

TL;DR: 本文评估了七种开源大语言模型在辅助日语病理报告撰写中的性能,涵盖诊断文本生成与信息提取、报告拼写错误纠正以及病理学家对模型生成解释性文本的主观评价三个方面。研究发现,思维模型和医学专用模型在需要推理的结构化报告任务和拼写纠错方面表现更优,而解释性文本的偏好则因评估者而异。

Details

Motivation: 目前大语言模型在支持日语病理报告撰写方面的性能尚未得到充分探索,因此本研究旨在填补这一空白,评估开源LLMs在临床相关任务中的实用性。

Result: 在结构化报告任务和拼写纠错方面,思维模型和医学专用模型表现出优势;在解释性文本生成任务中,评估者的偏好差异较大。整体上,开源LLMs在有限但临床相关的场景中显示出辅助日语病理报告撰写的潜力。

Insight: 论文的创新点在于首次系统评估了开源LLMs在日语病理报告撰写中的多任务性能,并揭示了模型类型(如思维模型、医学专用模型)在不同任务中的适用性差异,为临床实践中的模型选择提供了参考。

Abstract: The performance of large language models (LLMs) for supporting pathology report writing in Japanese remains unexplored. We evaluated seven open-source LLMs from three perspectives: (A) generation and information extraction of pathology diagnosis text following predefined formats, (B) correction of typographical errors in Japanese pathology reports, and (C) subjective evaluation of model-generated explanatory text by pathologists and clinicians. Thinking models and medical-specialized models showed advantages in structured reporting tasks that required reasoning and in typo correction. In contrast, preferences for explanatory outputs varied substantially across raters. Although the utility of LLMs differed by task, our findings suggest that open-source LLMs can be useful for assisting Japanese pathology report writing in limited but clinically relevant scenarios.


[9] Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge cs.CLPDF

Junjie Wu, Xuan Kan, Zihao He, Shunwen Tan, Bo Pan

TL;DR: 本文提出了一种名为MT-RL-Judge的多任务强化学习框架,旨在提升多模态大语言模型作为评判者的能力。该框架通过联合优化多个任务来增强模型的泛化性能,实验表明其在判断一致性和与人类偏好相关性方面优于现有基线,并在分布外任务上表现出鲁棒的泛化能力。

Details

Motivation: 现有MLLM-as-a-Judge模型大多针对单任务优化,难以泛化到多样化场景,这限制了其作为可靠评估工具的实用性。

Result: 在多个强基线的对比实验中,MT-RL-Judge在判断一致性和与人类偏好的相关性方面均表现更优,并在分布外任务上验证了其有效性。

Insight: 创新点在于将多任务强化学习引入MLLM-as-a-Judge的优化过程,通过联合训练提升模型的泛化能力和评估可靠性;从客观角度看,这种多任务协同优化策略为解决评估模型的场景适应性提供了新思路。

Abstract: Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks. However, most existing judge models are optimized for single-task scenarios and struggle to generalize to diverse contexts, which is a critical requirement for reliable evaluation. To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL. Experimental results against several strong baselines demonstrate that MT-RL-Judge outperforms strong baselines in both judgment consistency and correlation with human preferences. Furthermore, our approach exhibits robust generalization on out-of-distribution tasks, further validating its effectiveness.


[10] SemBench: A Universal Semantic Framework for LLM Evaluation cs.CL | cs.AIPDF

Mikel Zubillaga, Naiara Perez, Oscar Sainz, German Rigau

TL;DR: 本文提出了SemBench框架,用于自动生成评估大语言模型语义理解能力的合成基准,该方法仅需词典释义和句子编码器,无需人工标注例句,具有可扩展性和语言无关性。

Details

Motivation: 现有评估大语言模型语义理解能力的基准(如WiC)构建成本高且多限于高资源语言,需要一种轻量、可扩展的跨语言评估方法。

Result: 在英语、西班牙语和巴斯克语上测试多种大语言模型,SemBench得出的模型排名与标准WiC数据集排名高度相关,且仅需少量示例即可获得稳定有意义的排名。

Insight: 利用词典释义和句子编码器自动生成评估数据,实现了低成本、跨语言的语义能力评估,为资源匮乏语言的模型评测提供了新思路。

Abstract: Recent progress in Natural Language Processing (NLP) has been driven by the emergence of Large Language Models (LLMs), which exhibit remarkable generative and reasoning capabilities. However, despite their success, evaluating the true semantic understanding of these models remains a persistent challenge. Traditional benchmarks such as Word-in-Context (WiC) effectively probe this capability, but their creation is resource-intensive and often limited to high-resource languages. In this paper, we introduce SemBench, a framework for automatically generating synthetic benchmarks that assess the semantic competence of LLMs using only dictionary sense definitions and a sentence encoder. This approach eliminates the need for curated example sentences, making it both scalable and language-independent. We evaluate SemBench in three languages (English, Spanish, and Basque) spanning different levels of linguistic resources, and across a wide range of LLMs. Our results show that rankings derived from SemBench strongly correlate with those obtained from standard WiC datasets. Furthermore, our analysis demonstrates that only a small number of examples is required to achieve stable and meaningful rankings. Overall, SemBench provides a lightweight, adaptable, and data-efficient framework for cross-lingual evaluation of semantic understanding in LLMs.


Yaocong Li, Qiang Lan, Leihan Zhang, Le Zhang

TL;DR: 该论文提出了Legal-DC基准数据集和LegRAG框架,以解决中文法律检索增强生成(RAG)领域缺乏专业评估资源和现有系统难以处理法律条文结构化特性的问题。LegRAG通过法律自适应索引和双路径自反思机制,在关键评估指标上超越了现有最先进方法。

Details

Motivation: 现有法律RAG基准缺乏对检索器-生成器联合评估的专业支持,且主流RAG系统难以适应法律条文的结构化特性,限制了其在中文法律场景中的应用。

Result: 在Legal-DC基准上,LegRAG框架在关键评估指标上比现有SOTA方法提升了1.3%到5.6%。

Insight: 创新点包括构建了带条款级标注的中文法律RAG专业基准数据集,以及提出了结合法律自适应索引(条款边界分割)和双路径自反思机制的框架,以在保持条款完整性的同时提升答案准确性,并引入了面向高可靠性法律场景的自动化评估方法。

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising technology for legal document consultation, yet its application in Chinese legal scenarios faces two key limitations: existing benchmarks lack specialized support for joint retriever-generator evaluation, and mainstream RAG systems often fail to accommodate the structured nature of legal provisions. To address these gaps, this study advances two core contributions: First, we constructed the Legal-DC benchmark dataset, comprising 480 legal documents (covering areas such as market regulation and contract management) and 2,475 refined question-answer pairs, each annotated with clause-level references, filling the gap for specialized evaluation resources in Chinese legal RAG. Second, we propose the LegRAG framework, which integrates legal adaptive indexing (clause-boundary segmentation) with a dual-path self-reflection mechanism to ensure clause integrity while enhancing answer accuracy. Third, we introduce automated evaluation methods for large language models to meet the high-reliability demands of legal retrieval scenarios. LegRAG outperforms existing state-of-the-art methods by 1.3% to 5.6% across key evaluation metrics. This research provides a specialized benchmark, practical framework, and empirical insights to advance the development of Chinese legal RAG systems. Our code and data are available at https://github.com/legal-dc/Legal-DC.


[12] Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language cs.CL | cs.AIPDF

Remigiusz Kinas, Paweł Kiszczak, Sergio P. Perez, Krzysztof Ociepa, Łukasz Flis

TL;DR: 本文介绍了Bielik-Minitron-7B模型的创建过程,这是一个针对欧洲语言(特别是波兰语)优化的压缩大语言模型。通过结合结构化混合剪枝和知识蒸馏的两阶段压缩方法,将原始Bielik-11B-v3.0模型的参数从110.4亿减少到73.5亿,减少了33.4%。压缩后的模型经过对齐流程(包括监督微调、直接偏好优化和强化学习)后,恢复了约90%的基线模型性能,并实现了高达50%的推理加速。

Details

Motivation: 为资源较少或代表性不足的语言(如波兰语)创建高效、低成本部署的大语言模型,在保持模型质量的同时减少推理成本和延迟。

Result: 最终模型在性能上恢复了基线模型(Bielik-11B-v3.0)约90%的能力,同时推理速度提升了高达50%。

Insight: 采用受NVIDIA Minitron启发的两阶段压缩流程(结构化混合剪枝+基于logits的知识蒸馏),并结合严格的对齐流程(SFT、DPO-P、GRPO)进行质量恢复,为小语种高效构建高质量LLM提供了一条可行路径。

Abstract: This report details the creation of Bielik-Minitron-7B, a compressed 7.35B parameter version of the Bielik-11B-v3.0 model, specifically optimized for European languages. By leveraging a two-stage compression methodology inspired by the NVIDIA Minitron approach, we combined structured hybrid pruning and knowledge distillation to reduce the model’s parameter count by 33.4%, from 11.04B to 7.35B. We utilized the NVIDIA Model Optimizer for structural pruning and the NVIDIA NeMo Framework for logit-based distillation for quality recovery. Following distillation, the model underwent a rigorous alignment pipeline consisting of Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO-P), and Reinforcement Learning (GRPO). Our final model successfully recovered approximately 90% of the baseline model’s performance while providing up to 50% inference speedup. This approach demonstrates an efficient pathway to create language models for less-represented languages, preserving the original model quality while reducing inference deployment costs.


[13] CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks? cs.CLPDF

Ruirui Chen, Weifeng Jiang, Chengwei Qin, Cheston Tan

TL;DR: 该论文提出了一个名为CoMMET的多模态基准数据集,用于全面评估大型语言模型(LLMs)的心理理论(ToM)能力,即推理自己和他人心理状态的能力。该数据集扩展了现有评估范围,涵盖了更广泛的心理状态并引入了多轮对话测试,是首个在多轮对话设置中评估ToM的多模态数据集。通过对不同系列和规模的LLMs进行全面评估,论文分析了当前模型的优势和局限性,并指出了未来改进方向。

Details

Motivation: 现有评估LLMs心理理论能力的基准存在局限,大多仅依赖文本输入且狭隘地关注与信念相关的任务,而心理理论是有效自然交互的核心,因此需要更全面的评估方法来验证LLMs的社会推理能力。

Result: 论文通过CoMMET数据集对不同家族和规模的LLMs进行了全面评估,分析了当前模型在心理理论任务上的优势和局限性,但摘要中未提及具体的定量结果(如准确率)或与SOTA的比较。

Insight: 创新点在于提出了首个多模态、多轮对话的心理理论评估基准CoMMET,它扩展了评估范围(涵盖更广泛的心理状态)和测试模式(多轮对话),为深入理解LLMs的社会认知能力提供了新工具和方向。

Abstract: Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.


[14] Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections cs.CL | cs.AIPDF

Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns

TL;DR: 该论文提出了MADQA基准,包含2250个人类编写的问题和800个异构PDF文档,用于评估多模态代理在文档密集型任务中的推理能力。研究发现,尽管最佳代理在原始准确率上能与人类搜索者相当,但它们依赖暴力搜索而非战略规划,且与最优性能存在约20%的差距。

Details

Motivation: 解决多模态代理在文档处理中是否真正进行战略推理,还是仅依赖随机试错搜索的问题,以推动从暴力检索向高效推理的转变。

Result: 在MADQA基准上,最佳代理的准确率与人类搜索者相当,但通过新颖的准确率-努力权衡评估协议显示,它们依赖暴力搜索,无法达到最优性能(差距约20%)。

Insight: 创新点包括基于经典测试理论设计的MADQA基准和准确率-努力权衡评估协议;客观分析表明,该研究揭示了代理在战略规划上的不足,为开发更高效的推理方法提供了方向。

Abstract: Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.


[15] Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration cs.CL | cs.AIPDF

Priyanka Kargupta, Shuhaib Mehri, Dilek Hakkani-Tur, Jiawei Han

TL;DR: 本文提出Idea-Catalyst框架,旨在利用大语言模型(LLM)驱动跨学科灵感,以增强科学创造力。该框架通过将抽象研究目标分解为核心问题、识别领域内挑战、将其转化为领域无关的概念问题、并从外部学科检索类似解决方案,最终综合并重新情境化这些见解,以支持人类和LLM的创造性推理过程。

Details

Motivation: 当前跨学科研究虽能产生更大长期影响,但多数工作仍局限于单一领域;现有AI辅助科学发现方法多侧重于快速设计实验与解决方案,忽略了驱动创造性跨学科突破的探索性、协作性推理过程,因此本文旨在增强而非自动化科学发现中的推理过程。

Result: 实验表明,该框架在保持与原始研究问题相关性的同时,将平均新颖性提高了21%,洞察力提高了16%。

Insight: 创新点在于系统性地模拟跨学科推理的元认知特征(如定义研究目标、评估领域机会与挑战、基于潜在影响进行战略探索),并通过领域无关的问题重构实现跨学科知识检索与综合,从而支持创意激发阶段的头脑风暴,避免过早锚定特定解决方案。

Abstract: Despite interdisciplinary research leading to larger and longer-term impact, most work remains confined to single-domain academic silos. Recent AI-based approaches to scientific discovery show promise for interdisciplinary research, but many prioritize rapidly designing experiments and solutions, bypassing the exploratory, collaborative reasoning processes that drive creative interdisciplinary breakthroughs. As a result, prior efforts largely prioritize automating scientific discovery rather than augmenting the reasoning processes that underlie scientific disruption. We present Idea-Catalyst, a novel framework that systematically identifies interdisciplinary insights to support creative reasoning in both humans and large language models. Starting from an abstract research goal, Idea-Catalyst is designed to assist the brainstorming stage, explicitly avoiding premature anchoring on specific solutions. The framework embodies key metacognitive features of interdisciplinary reasoning: (a) defining and assessing research goals, (b) awareness of a domain’s opportunities and unresolved challenges, and (c) strategic exploration of interdisciplinary ideas based on impact potential. Concretely, Idea-Catalyst decomposes an abstract goal (e.g., improving human-AI collaboration) into core target-domain research questions that guide the analysis of progress and open challenges within that domain. These challenges are reformulated as domain-agnostic conceptual problems, enabling retrieval from external disciplines (e.g., Psychology, Sociology) that address analogous issues. By synthesizing and recontextualizing insights from these domains back into the target domain, Idea-Catalyst ranks source domains by their interdisciplinary potential. Empirically, this targeted integration improves average novelty by 21% and insightfulness by 16%, while remaining grounded in the original research problem.


[16] SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning cs.CL | cs.AI | cs.CVPDF

Ziyu Chen, Yilun Zhao, Chengye Wang, Rilyn Han, Manasi Patwardhan

TL;DR: 本文提出了SciMDR,一个用于训练基础模型的大规模科学多模态文档推理数据集,以及一个专家标注的评估基准SciMDR-Eval。为了解决构建此类数据集时在规模、忠实度和真实性之间的权衡问题,作者引入了合成与再锚定框架,该框架包含两个阶段:以主张为中心的QA合成和文档规模的再锚定。

Details

Motivation: 构建用于基础模型训练的科学多模态文档推理数据集,需要在规模、忠实度和真实性之间进行权衡,这是当前面临的一个挑战。

Result: 实验表明,在SciMDR数据集上微调的模型在多个科学问答基准测试中取得了显著改进,尤其是在需要复杂文档级推理的任务上。

Insight: 创新点在于提出了一个两阶段的合成与再锚定框架,首先生成忠实、独立的QA对,再将其程序化地重新嵌入到完整文档任务中以确保真实复杂性,从而有效平衡了数据集的规模、忠实度和真实性。

Abstract: Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.


cs.CV [Back]

[17] RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation cs.CV | cs.ROPDF

Shijie Zhou, Bin Zhu, Jiarui Yang, Xiangyu Zhao, Jingjing Chen

TL;DR: 本文提出了一种名为RC-NF(机器人条件归一化流)的实时监控模型,用于机器人操作中的异常检测和干预。该模型通过解耦处理任务感知的机器人和物体状态,仅需正样本进行无监督训练,并利用概率密度函数计算精确的异常分数。作者还创建了LIBERO-Anomaly-10基准用于仿真评估。RC-NF在多种异常类型上均达到了最先进的性能,并能作为即插即用模块增强VLA模型的鲁棒性,在真实世界实验中实现了低于100毫秒的响应延迟。

Details

Motivation: 当前基于模仿学习训练的视觉-语言-动作(VLA)模型在动态环境和分布外(OOD)条件下可靠性不足,容易失败。为了解决VLA模型在机器人操作任务中的鲁棒性问题,需要一种能够实时监测异常并触发干预的机制。

Result: 在提出的LIBERO-Anomaly-10基准测试中,RC-NF在所有异常类型上相比先前方法都取得了最先进的(SOTA)性能。真实世界实验表明,RC-NF可以作为VLA模型(如pi0)的即插即用模块,提供实时OOD信号以支持状态级回滚或任务级重规划,响应延迟低于100毫秒。

Insight: 论文宣称的创新点在于:1)提出了机器人条件归一化流(RC-NF),将任务感知的机器人和物体状态处理在归一化流中解耦;2)仅需正样本进行无监督训练,简化了数据需求;3)构建了LIBERO-Anomaly-10基准用于系统评估。从客观角度看,将归一化流条件化于机器人状态以实现实时、精确的异常检测,并将其作为即插即用模块无缝集成到现有VLA系统中,是提升机器人系统在动态环境中适应性的有效且实用的方法。

Abstract: Recent advances in Vision-Language-Action (VLA) models have enabled robots to execute increasingly complex tasks. However, VLA models trained through imitation learning struggle to operate reliably in dynamic environments and often fail under Out-of-Distribution (OOD) conditions. To address this issue, we propose Robot-Conditioned Normalizing Flow (RC-NF), a real-time monitoring model for robotic anomaly detection and intervention that ensures the robot’s state and the object’s motion trajectory align with the task. RC-NF decouples the processing of task-aware robot and object states within the normalizing flow. It requires only positive samples for unsupervised training and calculates accurate robotic anomaly scores during inference through the probability density function. We further present LIBERO-Anomaly-10, a benchmark comprising three categories of robotic anomalies for simulation evaluation. RC-NF achieves state-of-the-art performance across all anomaly types compared to previous methods in monitoring robotic tasks. Real-world experiments demonstrate that RC-NF operates as a plug-and-play module for VLA models (e.g., pi0), providing a real-time OOD signal that enables state-level rollback or task-level replanning when necessary, with a response latency under 100 ms. These results demonstrate that RC-NF noticeably enhances the robustness and adaptability of VLA-based robotic systems in dynamic environments.


[18] GGPT: Geometry Grounded Point Transformer cs.CVPDF

Yutong Chen, Yiming Wang, Xucong Zhang, Sergey Prokudin, Siyu Tang

TL;DR: GGPT是一种几何引导的点云Transformer框架,用于提升稀疏视图3D重建的几何一致性和细节精度。该方法首先通过改进的运动恢复结构流程估计相机姿态和部分3D点云,然后利用几何引导的3D点Transformer在显式几何监督下优化密集点云预测。

Details

Motivation: 解决现有前馈网络在稀疏视图3D重建中因缺乏显式多视图约束而导致的几何不一致和细粒度精度有限的问题。

Result: 在ScanNet++数据集上训练,使用VGGT预测,在领域内和跨领域设置中均显著优于最先进的前馈3D重建模型。

Insight: 创新点在于将可靠的稀疏几何先验与密集前馈预测相结合,通过优化的引导编码实现显式部分几何监督,从而提升重建的几何一致性和空间完整性。

Abstract: Recent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views. Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit partial-geometry supervision using an optimised guidance encoding. Extensive experiments demonstrate that our method provides a principled mechanism for integrating geometric priors with dense feed-forward predictions, producing reconstructions that are both geometrically consistent and spatially complete, recovering fine structures and filling gaps in textureless areas. Trained solely on ScanNet++ with VGGT predictions, GGPT generalises across architectures and datasets, substantially outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings.


[19] Evidential learning driven Breast Tumor Segmentation with Stage-divided Vision-Language Interaction cs.CVPDF

Jingxing Zhong, Qingtao Pan, Xuchang Zhou, Jiazhen Lin, Xinguo Zhuang

TL;DR: 本文提出了一种名为TextBCS的文本引导乳腺肿瘤分割模型,该模型结合了分阶段视觉-语言交互和证据学习,旨在解决乳腺MRI图像中肿瘤与正常组织对比度低、边界模糊导致的分割难题。

Details

Motivation: 针对乳腺MRI图像中肿瘤与正常区域对比度低、边界模糊导致现有深度学习方法难以准确定位肿瘤轮廓的问题,利用文本提示信息来改善分割效果。

Result: 在公开数据集上的大量实验验证了TextBCS优于其他分割网络,展示了最佳的乳腺肿瘤分割性能。

Insight: 创新点在于提出了分阶段视觉-语言交互机制,在每次下采样阶段促进视觉与文本特征的信息交互,以在低对比度场景中辅助定位病灶区域;同时引入证据学习,利用变分狄利克雷分布来量化模型对模糊边界的分割不确定性。

Abstract: Breast cancer is one of the most common causes of death among women worldwide, with millions of fatalities annually. Magnetic Resonance Imaging (MRI) can provide various sequences for characterizing tumor morphology and internal patterns, and becomes an effective tool for detection and diagnosis of breast tumors. However, previous deep-learning based tumor segmentation methods have limitations in accurately locating tumor contours due to the challenge of low contrast between cancer and normal areas and blurred boundaries. Leveraging text prompt information holds promise in ameliorating tumor segmentation effect by delineating segmentation regions. Inspired by this, we propose text-guided Breast Tumor Segmentation model (TextBCS) with stage-divided vision-language interaction and evidential learning. Specifically, the proposed stage-divided vision-language interaction facilitates information mutual between visual and text features at each stage of down-sampling, further exerting the advantages of text prompts to assist in locating lesion areas in low contrast scenarios. Moreover, the evidential learning is adopted to quantify the segmentation uncertainty of the model for blurred boundary. It utilizes the variational Dirichlet to characterize the distribution of the segmentation probabilities, addressing the segmentation uncertainties of the boundaries. Extensive experiments validate the superiority of our TextBCS over other segmentation networks, showcasing the best breast tumor segmentation performance on publicly available datasets.


[20] A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Nonlinear Multi-Adapters cs.CV | cs.AIPDF

Haihua Luo, Xuming Ran, Jiangrong Shen, Timo Hämäläinen, Zhonghua Chen

TL;DR: 本文提出了一种名为SimE的简单高效增量学习框架,该框架利用视觉语言模型(如CLIP)并专门设计了适配器来处理增量学习任务。研究发现,适配器连接数量与模型增量学习能力之间存在非线性关系,即增加Transformer块之间的适配器连接能提升性能,但在块内增加连接反而可能损害模型能力。实验表明,SimE在TinyImageNet和CIFAR-100数据集上显著优于传统方法和基于CLIP的方法。

Details

Motivation: 解决现有基于视觉语言模型的增量学习方法面临的三大挑战:训练效率低、依赖存储历史数据的记忆库、以及需要强大骨干网络来增强模型能力。

Result: SimE在TinyImageNet上比传统方法提升9.6%,在CIFAR-100上比其他基于CLIP的方法提升5.3%,达到了先进水平。

Insight: 创新点在于揭示了适配器连接数量与增量学习性能之间的非线性关系,并提出了一个无需记忆库、高效且可扩展的框架;同时建议通过使用更大数据集(如LAION2B)和更强架构(如ViT-L/14)训练的CLIP编码器来进一步提升零样本能力利用。

Abstract: Incremental Learning (IL) aims to learn new tasks while preserving previously acquired knowledge. Integrating the zero-shot learning capabilities of pre-trained vision-language models into IL methods has marked a significant advancement. However, these methods face three primary challenges: (1) the need for improved training efficiency; (2) reliance on a memory bank to store previous data; and (3) the necessity of a strong backbone to augment the model’s capabilities. In this paper, we propose SimE, a Simple and Efficient framework that employs a vision-language model with adapters designed specifically for the IL task. We report a remarkable phenomenon: there is a nonlinear correlation between the number of adaptive adapter connections and the model’s IL capabilities. While increasing adapter connections between transformer blocks improves model performance, adding more adaptive connections within transformer blocks during smaller incremental steps does not enhance, and may even degrade the model’s IL ability. Extensive experimental results show that SimE surpasses traditional methods by 9.6% on TinyImageNet and outperforms other CLIP-based methods by 5.3% on CIFAR-100. Furthermore, we conduct a systematic study to enhance the utilization of the zero-shot capabilities of CLIP. We suggest replacing SimE’s encoder with a CLIP model trained on larger datasets (e.g., LAION2B) and stronger architectures (e.g., ViT-L/14).


[21] Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning cs.CVPDF

Yuehao Song, Shaoyu Chen, Hao Gao, Yifan Zhu, Weixiang Yue

TL;DR: 本文提出Senna-2,一种先进的视觉语言模型(VLM)与端到端(E2E)驾驶策略对齐框架,旨在解决两者在高层决策与底层规划间的不一致问题。该方法采用一致性导向的三阶段训练范式,包括驾驶预训练、开环对齐和基于3D高斯泼溅环境的闭环分层强化学习对齐,以增强系统的自上而下引导和决策跟随能力。

Details

Motivation: 现有方法常忽视VLM的高层决策与E2E策略的底层规划之间的双系统一致性,导致生成的轨迹可能与预期驾驶决策不匹配,从而削弱了系统的自上而下引导和决策跟随能力。

Result: 大量实验表明,Senna-2在双系统一致性上显著提升(F1分数提高19.3%),并在开环设置(最终位移误差FDE降低5.7%)和闭环设置(平均故障碰撞率AF-CR降低30.6%)中均显著提升了驾驶安全性。

Insight: 论文的创新点在于明确提出了VLM与E2E策略的对齐问题,并设计了一个三阶段的一致性导向训练范式,特别是通过决策适配器以隐式嵌入形式传递VLM决策,并利用分层强化学习在3DGS环境中进行闭环安全与效率强化。从客观角度看,其将高层语义推理与底层控制紧密结合的系统性对齐方法,对提升端到端驾驶系统的可靠性和可解释性具有借鉴意义。

Abstract: Vision-language models (VLMs) enhance the planning capability of end-to-end (E2E) driving policy by leveraging high-level semantic reasoning. However, existing approaches often overlook the dual-system consistency between VLM’s high-level decision and E2E’s low-level planning. As a result, the generated trajectories may misalign with the intended driving decisions, leading to weakened top-down guidance and decision-following ability of the system. To address this issue, we propose Senna-2, an advanced VLM-E2E driving policy that explicitly aligns the two systems for consistent decision-making and planning. Our method follows a consistency-oriented three-stage training paradigm. In the first stage, we conduct driving pre-training to achieve preliminary decision-making and planning, with a decision adapter transmitting VLM decisions to E2E policy in the form of implicit embeddings. In the second stage, we align the VLM and the E2E policy in an open-loop setting. In the third stage, we perform closed-loop alignment via bottom-up Hierarchical Reinforcement Learning in 3DGS environments to reinforce the safety and efficiency. Extensive experiments demonstrate that Senna-2 achieves superior dual-system consistency (19.3% F1 score improvement) and significantly enhances driving safety in both open-loop (5.7% FDE reduction) and closed-loop settings (30.6% AF-CR reduction).


[22] Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models cs.CV | cs.CLPDF

Qingtao Pan, Zhihao Dou, Shuo Li

TL;DR: 本文提出了一种名为FMVR的频率调制视觉恢复策略,用于提升大型多模态模型在视觉令牌减少情况下的推理能力。该方法通过平均池化和最大池化将少量视觉令牌的表示解耦为低频和高频分量,并使用轻量级可学习参数进行调制,以增强显著性和弱视觉语义。同时,结合嵌套表示学习实现从粗到细的视觉令牌集学习,从而在推理时弹性调整视觉令牌数量。实验表明,FMVR-LLaVA在10个图像基准和4个视频基准上,将LLaVA-1.5-7B的FLOPs降低了89%,同时保持接近100%的原始准确率。

Details

Motivation: 大型多模态模型因视觉令牌数量众多而难以适应不同的计算预算,现有方法在减少视觉令牌时会导致视觉语义丢失,因此需要一种策略在减少令牌的同时保持或恢复视觉语义。

Result: 在10个图像基准和4个视频基准上的实验显示,FMVR-LLaVA将LLaVA-1.5-7B的FLOPs降低了89%,同时维持了几乎100%的原始准确率,达到了高效且高性能的水平。

Insight: 创新点在于通过频率调制(使用AvgPool和MaxPool解耦高低频分量)来恢复视觉语义,并结合嵌套表示学习实现弹性视觉令牌调整;客观分析认为,该方法提供了一种简单且即插即用的视觉令牌优化方案,有效平衡了计算效率与模型性能。

Abstract: Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.


[23] Hierarchical Granularity Alignment and State Space Modeling for Robust Multimodal AU Detection in the Wild cs.CVPDF

Jun Yu, Yunxiang Zhang, Naixiang Zheng, Lingsi Zhu, Guoyuan Wang

TL;DR: 本文提出了一种新颖的多模态框架,用于在非受控环境下进行鲁棒的面部动作单元检测。该框架利用DINOv2和WavLM等基础模型提取鲁棒的表征,并通过分层粒度对齐模块处理极端面部变化,同时引入Vision-Mamba架构进行高效的超长时序建模,以及一种非对称交叉注意力机制来深度同步副语言音频线索与细微视觉运动。

Details

Motivation: 解决在非受控环境中,由于严重的时空异质性、无约束姿态和复杂的视听依赖性,导致面部动作单元检测面临的巨大挑战。现有方法通常依赖能力有限的编码器和浅层融合机制,无法捕捉细粒度语义变化和超长时序上下文。

Result: 在具有挑战性的Aff-Wild2数据集上进行的大量实验表明,该方法显著优于现有基线,达到了最先进的性能,并在第10届非受控环境情感行为分析竞赛的AU检测赛道中获得了最高排名。

Insight: 创新点包括:利用强大的基础模型(DINOv2, WavLM)替代传统特征提取器;分层粒度对齐模块动态对齐全局语义与细粒度局部活跃区域;引入具有O(N)线性复杂度的Vision-Mamba架构进行超长时序建模;以及一种新颖的非对称交叉注意力机制用于深度同步多模态信息。

Abstract: Facial Action Unit (AU) detection in in-the-wild environments remains a formidable challenge due to severe spatial-temporal heterogeneity, unconstrained poses, and complex audio-visual dependencies. While recent multimodal approaches have made progress, they often rely on capacity-limited encoders and shallow fusion mechanisms that fail to capture fine-grained semantic shifts and ultra-long temporal contexts. To bridge this gap, we propose a novel multimodal framework driven by Hierarchical Granularity Alignment and State Space Models.Specifically, we leverage powerful foundation models, namely DINOv2 and WavLM, to extract robust and high-fidelity visual and audio representations, effectively replacing traditional feature extractors. To handle extreme facial variations, our Hierarchical Granularity Alignment module dynamically aligns global facial semantics with fine-grained local active patches. Furthermore, we overcome the receptive field limitations of conventional temporal convolutional networks by introducing a Vision-Mamba architecture. This approach enables temporal modeling with O(N) linear complexity, effectively capturing ultra-long-range dynamics without performance degradation. A novel asymmetric cross-attention mechanism is also introduced to deeply synchronize paralinguistic audio cues with subtle visual movements.Extensive experiments on the challenging Aff-Wild2 dataset demonstrate that our approach significantly outperforms existing baselines, achieving state-of-the-art performance. Notably, this framework secured top rankings in the AU Detection track of the 10th Affective Behavior Analysis in-the-wild Competition.


[24] UniCompress: Token Compression for Unified Vision-Language Understanding and Generation cs.CVPDF

Ziyao Wang, Chen Chen, Jingtao Li, Weiming Zhuang, Jiabo Huang

TL;DR: 本文提出了一种名为UniCompress的统一令牌压缩算法,旨在解决统一视觉-语言模型中大量视觉令牌带来的计算和内存开销问题,通过可学习的全局元令牌引导的压缩与解压缩机制,在保持图像理解和生成任务性能的同时,显著减少视觉令牌数量。

Details

Motivation: 统一模型通过将图像编码为离散令牌并与文本一起在自回归框架中处理,以支持理解和生成任务,但大量视觉令牌导致计算和内存开销大,阻碍了在资源受限场景(如具身AI系统)中的部署。

Result: 实验结果表明,该方法将图像令牌减少高达4倍,在推理延迟和训练成本上实现显著提升,且仅带来最小性能下降,展示了令牌高效统一建模在实际多模态应用中的潜力。

Insight: 创新点在于引入轻量级、模块化的插件式压缩与解压缩机制,由可学习的全局元令牌引导,无需完全重新训练即可集成到现有模型中,为资源受限环境下的统一视觉-语言建模提供了高效解决方案。

Abstract: Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and cross-modal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to 4 times, achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.


[25] Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning cs.CV | cs.GR | cs.ROPDF

Yuto Shibata, Kashu Yamazaki, Lalit Jayanti, Yoshimitsu Aoki, Mariko Isogawa

TL;DR: 本文提出AssistMimic方法,首次通过多智能体强化学习框架,在物理模拟器中联合训练辅助者和接受者的策略,以模仿和跟踪涉及紧密交互与力交换的人类辅助运动序列。

Details

Motivation: 现有通用运动跟踪方法主要局限于无接触社交互动或孤立运动,而辅助场景需要持续感知人类伙伴并快速适应其动态姿势,因此需要解决物理接地、社会感知的人形机器人辅助运动模仿问题。

Result: 在已建立的基准测试上,AssistMimic是首个成功跟踪辅助交互运动的方法,展示了多智能体强化学习在物理接地和社会感知人形控制方面的优势。

Insight: 创新点包括:将紧密交互的人类运动模仿建模为多智能体强化学习问题;提出伙伴策略初始化方案,从单人运动跟踪控制器迁移先验知识以改善探索;引入动态参考重定向和接触促进奖励,使辅助者参考运动适应接受者实时姿态并鼓励物理上有意义的支撑。

Abstract: Humanoid robotics has strong potential to transform daily service and caregiving applications. Although recent advances in general motion tracking within physics engines (GMT) have enabled virtual characters and humanoid robots to reproduce a broad range of human motions, these behaviors are primarily limited to contact-less social interactions or isolated movements. Assistive scenarios, by contrast, require continuous awareness of a human partner and rapid adaptation to their evolving posture and dynamics. In this paper, we formulate the imitation of closely interacting, force-exchanging human-human motion sequences as a multi-agent reinforcement learning problem. We jointly train partner-aware policies for both the supporter (assistant) agent and the recipient agent in a physics simulator to track assistive motion references. To make this problem tractable, we introduce a partner policies initialization scheme that transfers priors from single-human motion-tracking controllers, greatly improving exploration. We further propose dynamic reference retargeting and contact-promoting reward, which adapt the assistant’s reference motion to the recipient’s real-time pose and encourage physically meaningful support. We show that AssistMimic is the first method capable of successfully tracking assistive interaction motions on established benchmarks, demonstrating the benefits of a multi-agent RL formulation for physically grounded and socially aware humanoid control.


[26] DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding cs.CVPDF

Mingzhe Tao, Ruiping Liu, Junwei Zheng, Yufan Chen, Kedi Ying

TL;DR: 本文提出了DriveXQA数据集和MVX-LLM模型,用于解决自动驾驶中恶劣天气和传感器故障场景下的跨模态视觉问答问题。数据集包含四种视觉模态、五种传感器故障、五种天气条件以及超过10万个QA对。模型采用双交叉注意力投影器融合多模态信息,以减轻冗余并提升性能。

Details

Motivation: 现有多模态大语言模型在利用多传感器信息理解自动驾驶恶劣场景方面探索不足,缺乏专门的数据集和高效融合多模态信息的架构。

Result: 在雾天等挑战性条件下,所提DCA方法在GPTScore指标上达到53.5,显著优于基线模型的25.1,展示了性能提升。

Insight: 创新点在于构建了首个针对自动驾驶恶劣场景的多模态VQA数据集,并设计了高效的DCA投影器进行多模态融合,减少了信息冗余,为恶劣条件下的场景理解提供了新方案。

Abstract: Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes $102,505$ QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: $53.5$ vs. $25.1$ for the baseline). The established dataset and source code will be made publicly available.


[27] DeepHistoViT: An Interpretable Vision Transformer Framework for Histopathological Cancer Classification cs.CVPDF

Ravi Mosalpuri, Mohammed Abdelsamea, Ahmed Karam Eldaly

TL;DR: 本文提出了DeepHistoViT,一个基于Vision Transformer的可解释框架,用于自动化组织病理学图像分类。该模型通过定制的注意力机制捕获细粒度细胞结构,并定位诊断相关区域以提高可解释性。在涵盖肺癌、结肠癌和急性淋巴细胞白血病的三个公开数据集上评估,均取得了最先进的性能。

Details

Motivation: 手动组织病理学检查耗时、费力且存在观察者间差异,需要可靠的计算机辅助诊断工具。深度学习,特别是基于Transformer的架构,在建模医学图像复杂空间依赖性方面展现出潜力,但需要提高可解释性以支持临床决策。

Result: 在肺癌和结肠癌数据集上,分类准确率、精确率、召回率、F1分数和ROC-AUC均达到100%;在急性淋巴细胞白血病数据集上,这些指标分别达到99.85%、99.84%、99.86%、99.85%和99.99%(均报告95%置信区间),在所有数据集上均实现了最先进的性能。

Insight: 创新点在于将定制的Vision Transformer架构与集成注意力机制结合,旨在同时捕获细粒度细胞结构和通过注意力定位提供可解释性,这对于构建可信的临床辅助工具至关重要。从客观角度看,其在多个癌症类型数据集上实现近乎完美的性能,验证了Transformer在组织病理学分析中的有效性,并强调了可解释性设计在医学AI中的重要性。

Abstract: Histopathology remains the gold standard for cancer diagnosis because it provides detailed cellular-level assessment of tissue morphology. However, manual histopathological examination is time-consuming, labour-intensive, and subject to inter-observer variability, creating a demand for reliable computer-assisted diagnostic tools. Recent advances in deep learning, particularly transformer-based architectures, have shown strong potential for modelling complex spatial dependencies in medical images. In this work, we propose DeepHistoViT, a transformer-based framework for automated classification of histopathological images. The model employs a customized Vision Transformer architecture with an integrated attention mechanism designed to capture fine-grained cellular structures while improving interpretability through attention-based localization of diagnostically relevant regions. The framework is evaluated on three publicly available histopathology datasets covering lung cancer, colon cancer, and acute lymphoblastic leukaemia. Experimental results demonstrate state-of-the-art performance across all datasets, with classification accuracy, precision, recall, F1-score, and ROC-AUC reaching 100 percent on the lung and colon cancer datasets, and 99.85 percent, 99.84 percent, 99.86 percent, 99.85 percent, and 99.99 percent respectively on the acute lymphoblastic leukaemia dataset. All performance metrics are reported with 95 percent confidence intervals. These results highlight the effectiveness of transformer-based architectures for histopathological image analysis and demonstrate the potential of DeepHistoViT as an interpretable computer-assisted diagnostic tool to support pathologists in clinical decision-making.


[28] Seeing Isn’t Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary cs.CVPDF

Nazia Tasnim, Keanu Nichols, Yuting Yang, Nicholas Ikechukwu, Elva Zou

TL;DR: 本文提出了一个名为DORI的认知基础分层基准测试,专门评估多模态大语言模型在物体朝向理解上的能力。该基准将朝向分解为四个维度,每个维度在粗粒度(分类)和细粒度(度量)两个层次上进行评估。通过对24个最先进模型的测试,发现它们在以物体为中心的朝向任务上表现接近随机,揭示了现有基准测试中隐藏的几何推理缺陷。

Details

Motivation: 当前视觉语言基准测试大多将物体朝向与位置和一般场景理解混为一谈,而人类对朝向的认知是渐进式的。为了隔离朝向理解这一特定能力,并揭示模型在此方面的系统性失败,作者构建了DORI基准。

Result: 在DORI基准上评估的24个SOTA视觉语言模型中,表现最好的模型在粗粒度判断上仅达到54.2%准确率,在细粒度判断上为45.0%。模型在复合旋转和物体间参考系变化上失败最为严重,且在通用空间基准上表现良好的模型在DORI上表现接近随机。

Insight: 论文的创新点在于构建了一个认知基础、层次化的基准测试,通过边界框隔离、标准化空间参考系和结构化提示等方法,将物体朝向理解从物体识别难度、场景杂乱和语言歧义等混淆因素中分离出来。其核心发现是,当前模型严重依赖分类启发式方法而非几何推理,这一局限性被现有基准所掩盖。

Abstract: Humans learn object orientation progressively, from recognizing which way an object faces, to mentally rotating it, to reasoning about orientations between objects. Current vision-language benchmarks largely conflate orientation with position and general scene understanding. We introduce Discriminative Orientation Reasoning Intelligence (DORI), a cognitively grounded hierarchical benchmark that makes object orientation the primary target. Inspired by stages of human orientation cognition, DORI decomposes orientation into four dimensions, each evaluated at coarse (categorical) and granular (metric) levels. Composed from 13,652 images across 14 sources, DORI provides 33,656 multiple-choice questions covering 67 object categories in real-world and synthetic settings. Its coarse-to-granular design isolates orientation from confounds such as object recognition difficulty, scene clutter, and linguistic ambiguity via bounding-box isolation, standardized spatial reference frames, and structured prompts. Evaluating 24 state-of-the-art vision-language models shows a clear pattern: models that perform well on general spatial benchmarks are near-random on object-centric orientation tasks. The best models reach only 54.2% on coarse and 45.0% on granular judgments, with largest failures on compound rotations and shifts in inter-object reference frames. Large coarse-to-granular gaps reveal reliance on categorical heuristics rather than geometric reasoning, a limitation hidden by existing benchmarks. These results identify orientation understanding as an unsolved challenge for multimodal systems, with implications for robotic manipulation, 3D scene reconstruction, and human-AI interaction.


[29] ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation cs.CVPDF

Songlin Yang, Zhe Wang, Xuyi Yang, Songchun Zhang, Xianghao Kong

TL;DR: ShotVerse 是一个用于文本驱动多镜头视频生成的‘规划-控制’框架,旨在解决电影级相机控制的难题。它通过解耦生成过程为两个协作智能体:一个基于视觉语言模型(VLM)的规划器,从文本中生成全局对齐的电影级相机轨迹;以及一个控制器,通过相机适配器将这些轨迹渲染成多镜头视频内容。该框架的核心是构建了一个数据基础,包括一个自动化的多镜头相机校准流程和一个名为 ShotVerse-Bench 的高保真电影数据集,用于评估。

Details

Motivation: 当前文本驱动视频生成在电影级多镜头场景中的相机控制存在瓶颈:隐式文本提示缺乏精确性,而显式轨迹条件则带来过高的人工开销,且在当前模型中常导致执行失败。

Result: 广泛的实验表明,ShotVerse 有效弥合了不可靠的文本控制与劳动密集型手动规划之间的差距,在电影美学上表现优越,生成的视频在相机轨迹准确性和跨镜头一致性方面均表现出色。

Insight: 论文的核心创新在于提出了一个数据中心的范式转变,认为对齐的(字幕、轨迹、视频)三元组构成一个固有的联合分布,可以连接自动化规划和精确执行。这通过构建自动校准流程和高质量基准数据集(ShotVerse-Bench)得以实现,为框架提供了坚实基础。

Abstract: Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a “Plan-then-Control” framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.


[30] Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding cs.CVPDF

Songlin Li, Xin Zhu, Zechao Guan, Peipeng Chen, Jian Yao

TL;DR: 本文提出了一种名为R-MSD(可靠多样本蒸馏)的框架,用于解决大型视觉语言模型(LVLMs)在黑盒蒸馏中因依赖单一教师样本而导致的高方差响应和格式不一致问题。该方法通过构建任务自适应的教师池来提供鲁棒的监督,并结合质量感知信号匹配与对抗蒸馏目标,有效过滤教师噪声并最大化知识迁移。

Details

Motivation: 传统黑盒蒸馏方法通常每个输入仅依赖一个教师响应,在多模态或时序场景中容易产生高方差响应和格式不一致,导致监督不可靠。本文旨在通过显式建模教师采样方差来提升蒸馏的稳定性。

Result: 在多个视频理解基准测试(VideoMME、Video-MMMU、MathVerse)上,使用4B学生模型的R-MSD方法相比单样本蒸馏方法取得了显著提升(分别提升1.5%、3.2%、3.6%),且优于相同训练预算下的原始SFT+RL基线。

Insight: 创新点在于提出了多样本蒸馏框架,通过任务自适应教师池提供鲁棒监督,并结合质量感知匹配与对抗目标来过滤噪声。客观来看,该方法将蒸馏从依赖单一不确定样本扩展到利用多样本统计特性,增强了监督信号的可靠性,尤其适用于多模态和时序数据。

Abstract: Traditional black-box distillation for Large Vision-Language Models (LVLMs) typically relies on a single teacher response per input, which often yields high-variance responses and format inconsistencies in multimodal or temporal scenarios. To mitigate this unreliable supervision, we propose R-MSD (Reliable Multi-Sample Distillation), a framework that explicitly models teacher sampling variance to enhance distillation stability. Rather than relying on a single teacher response, our approach leverages a task-adaptive teacher pool to provide robust supervision tailored to both closed-ended and open-ended reasoning. By integrating quality-aware signal matching with an adversarial distillation objective, our approach effectively filters teacher noise while maximizing knowledge transfer. Extensive evaluations across comprehensive video understanding benchmarks demonstrate that R-MSD consistently outperforms single sample distillation methods. We additionally include an original SFT+RL 4B baseline under the same training budget, which shows only marginal gains, while our method achieves significant improvements. With a 4B student model, our approach delivers gains on VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).


[31] Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning cs.CVPDF

Seung Hyup Baek, Jimin Lee, Hyeongkeun Lee, Jae Won Cho

TL;DR: 本文提出了一种针对密集视频描述(DVC)任务的新方法,通过使用角色特定查询将定位和描述任务分离,并引入重叠抑制损失来减少时间冗余,同时利用轻量级模块捕获核心事件概念以增强描述的语义丰富性。

Details

Motivation: 现有基于查询的DVC框架中,共享查询会导致定位与描述任务之间的多任务干扰以及定位结果的时间冗余问题。

Result: 在YouCook2和ActivityNet Captions两个主要DVC基准测试上的广泛实验证明了该方法的有效性。

Insight: 核心创新点在于角色特定查询的分离设计、用于抑制时间冗余的重叠惩罚机制,以及通过概念级表示增强语义的轻量级模块,这有助于减少任务干扰并提升定位精度与描述质量。

Abstract: Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temporal redundancy, supervising the model to learn distinct, non-overlapping event regions for more precise localization. Additionally, we introduce a lightweight module that captures core event concepts to further enhance semantic richness in captions through concept-level representations. We demonstrate the effectiveness of our method through extensive experiments on major DVC benchmarks YouCook2 and ActivityNet Captions.


[32] Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection cs.CVPDF

Mehmet Kerem Turkcan

TL;DR: 本文提出了DART框架,将SAM3从单提示分割系统转换为实时多类别检测器,通过共享视觉主干计算、批量多类别解码和检测专用推理等技术,显著提升了推理速度,在COCO数据集上实现了高精度实时检测。

Details

Motivation: 现有基于视觉语言模型的提示式检测系统(如SAM3)每次前向传播只能处理单个文本提示,检测多个类别需要重复执行主干计算,导致计算成本随类别数线性增长,无法满足实时多类别检测需求。

Result: 在COCO val2017数据集(80个类别)上,DART在单张RTX 4080上实现了55.8 AP和15.8 FPS(4类别,1008x1008分辨率),超越了专门训练的开集检测器;通过适配器蒸馏进一步优化延迟,在13.9毫秒主干计算下达到38.7 AP。

Insight: 核心创新在于利用视觉主干与文本提示无关的结构不变性,将主干计算成本从O(N)降至O(1),结合批量解码和部署优化实现训练免费加速;该方法为现有大模型的高效部署提供了可借鉴的工程化思路。

Abstract: Recent advances in vision-language modeling have produced promptable detection and segmentation systems that accept arbitrary natural language queries at inference time. Among these, SAM3 achieves state-of-the-art accuracy by combining a ViT-H/14 backbone with cross-modal transformer decoding and learned object queries. However, SAM3 processes a single text prompt per forward pass. Detecting N categories requires N independent executions, each dominated by the 439M-parameter backbone. We present Detect Anything in Real Time (DART), a training-free framework that converts SAM3 into a real-time multi-class detector by exploiting a structural invariant: the visual backbone is class-agnostic, producing image features independent of the text prompt. This allows the backbone computation to be shared between all classes, reducing its cost from O(N) to O(1). Combined with batched multi-class decoding, detection-only inference, and TensorRT FP16 deployment, these optimizations yield 5.6x cumulative speedup at 3 classes, scaling to 25x at 80 classes, without modifying any model weight. On COCO val2017 (5,000 images, 80 classes), DART achieves 55.8 AP at 15.8 FPS (4 classes, 1008x1008) on a single RTX 4080, surpassing purpose-built open-vocabulary detectors trained on millions of box annotations. For extreme latency targets, adapter distillation with a frozen encoder-decoder achieves 38.7 AP with a 13.9 ms backbone. Code and models are available at https://github.com/mkturkcan/DART.


[33] Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning cs.CVPDF

Seung hee Choi, MinJu Jeon, Hyunwoo Oh, Jihwan Lee, Dong-Jin Kim

TL;DR: 本文提出了一种名为STaRC的检索增强密集视频描述(DVC)框架,通过监督帧级显著性来改进事件边界对齐。该方法利用从DVC真实标注中提取的二元标签训练高光检测模块,无需额外标注,并将显著性分数作为统一的时序信号,指导基于显著性的分割和通过显式显著性提示注入解码器的描述生成,从而提升检索和描述的准确性。

Details

Motivation: 现有检索增强的DVC方法通常依赖启发式策略,忽略了真实事件边界,导致时间分割与真实事件边界对齐不准确。本文旨在通过监督帧级显著性来克服这一限制,以生成更符合实际事件转换的时间连贯片段。

Result: 在YouCook2和ViTT基准测试上进行了全面评估,STaRC在大多数指标上达到了最先进的性能(SOTA)。

Insight: 创新点在于引入高光检测模块来监督帧级显著性,该模块仅使用DVC真实标注的二元标签训练,无需额外标注成本;同时,将显著性分数作为统一的时序信号,同时驱动显著性引导的分割和通过显式显著性提示注入解码器的描述生成,从而强制进行显著性约束的分割,实现更准确的事件边界对齐和上下文相关的描述生成。

Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries. The proposed framework, \textbf{STaRC}, overcomes this limitation by supervising frame-level saliency through a highlight detection module. Note that the highlight detection module is trained on binary labels derived directly from DVC ground truth annotations without the need for additional annotation. We also propose to utilize the saliency scores as a unified temporal signal that drives retrieval via saliency-guided segmentation and informs caption generation through explicit Saliency Prompts injected into the decoder. By enforcing saliency-constrained segmentation, our method produces temporally coherent segments that align closely with actual event transitions, leading to more accurate retrieval and contextually grounded caption generation. We conduct comprehensive evaluations on the YouCook2 and ViTT benchmarks, where STaRC achieves state-of-the-art performance across most of the metrics. Our code is available at https://github.com/ermitaju1/STaRC


[34] INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs cs.CV | cs.AIPDF

Junqi Yang, Yuecong Min, Jie Zhang, Shiguang Shan, Xilin Chen

TL;DR: 本文提出了INFACT诊断基准,用于评估视频大语言模型在忠实性和事实性幻觉方面的可靠性,包含9800个QA实例,覆盖真实和合成视频,并在四种模式下测试模型性能。

Details

Motivation: 现有基准对事实性幻觉覆盖有限且主要在干净设置下评估,视频大语言模型仍存在输出与视频证据或世界知识矛盾的幻觉问题,需要更全面的诊断工具。

Result: 在14个代表性视频大语言模型上的实验表明,基础模式准确率高的模型在诱导模式下可靠性不一定更高,证据损坏会降低稳定性,时间干预导致最大性能下降,许多开源基线在事实性问题上时间敏感性得分接近零。

Insight: 创新点在于构建了细粒度忠实性和事实性分类的基准,并引入诱导模式(视觉退化、证据损坏、时间干预)和量化指标(抵抗率、时间敏感性得分)来系统评估模型可靠性,揭示了模型在时序敏感问题上的显著惯性。

Abstract: Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably translate to higher reliability in the induced modes, with evidence corruption reducing stability and temporal intervention yielding the largest degradation. Notably, many open-source baselines exhibit near-zero TSS on factuality, indicating pronounced temporal inertia on order-sensitive questions.


[35] FBCIR: Balancing Cross-Modal Focuses in Composed Image Retrieval cs.CV | cs.AIPDF

Chenchen Zhao, Jianhuan Zhuo, Muxi Chen, Zhaohua Zhang, Wenyu Jiang

TL;DR: 本文提出了FBCIR,一种用于组合图像检索(CIR)的多模态焦点解释方法,旨在诊断模型在跨模态推理中的注意力失衡问题。研究发现,现有CIR模型在困难负样本场景下,往往过度关注一个模态而忽视另一个,导致性能下降。基于此,作者进一步提出了一种数据增强工作流程,通过精心设计的困难负样本来促进平衡的跨模态推理,从而提升模型在挑战性场景下的鲁棒性。

Details

Motivation: 现有组合图像检索模型在标准基准上表现良好,但在困难负样本场景下(即负样本候选与查询图像或文本在语义上对齐时)性能会显著下降。作者认为这种下降源于模型在跨模态推理中的注意力失衡问题,即模型不成比例地关注一个模态而忽视另一个。

Result: 广泛的实验表明,所提出的数据增强工作流程能持续提升多个CIR模型在挑战性场景下的性能,同时保持其在标准基准上的能力。

Insight: 论文的创新点在于:1)提出了FBCIR这一可解释性方法,用于诊断CIR模型的跨模态注意力失衡;2)提出了一种针对CIR的数据增强工作流程,通过引入精心设计的困难负样本来引导模型进行更平衡的跨模态推理,从而提升模型鲁棒性。这为CIR模型的诊断和性能改进提供了新视角。

Abstract: Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model’s retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data augmentation workflow that facilitates existing CIR datasets with curated hard negatives designed to encourage balanced cross-modal reasoning. Extensive experiments across multiple CIR models demonstrate that the proposed augmentation consistently improves performance in challenging cases, while maintaining their capabilities on standard benchmarks. Together, our interpretation method and data augmentation workflow provide a new perspective on CIR model diagnosis and robustness improvements.


[36] MDS-VQA: Model-Informed Data Selection for Video Quality Assessment cs.CVPDF

Jian Zou, Xiaoyu Xu, Zhihua Wang, Yilin Wang, Balu Adsumilli

TL;DR: 本文提出了MDS-VQA,一种模型驱动的数据选择机制,用于视频质量评估(VQA)。该方法旨在通过主动选择对基础VQA模型既困难又内容多样的未标记视频来优化数据集构建,从而弥合模型设计与数据管理之间的鸿沟。

Details

Motivation: 当前基于学习的VQA研究进展受限于模型设计与数据集构建之间的脱节。模型中心方法通常在固定基准上迭代,而数据中心方法收集新的人类标注时,未能系统性地针对现有VQA模型的弱点。

Result: 在多个VQA数据集和模型上的实验表明,仅使用目标领域5%的选定子集进行微调,模型性能显著提升:平均斯皮尔曼等级相关系数(SRCC)从0.651提高到0.722,并获得了最高的gMAD排名,显示出强大的适应性和泛化能力。

Insight: 论文的创新点在于提出了一种结合难度(通过基于排序目标训练失败预测器估计)与多样性(使用深度语义视频特征衡量)的主动数据选择机制,在有限标注预算下通过贪心算法平衡两者,从而高效地识别对模型微调最具信息量的样本。

Abstract: Learning-based video quality assessment (VQA) has advanced rapidly, yet progress is increasingly constrained by a disconnect between model design and dataset curation. Model-centric approaches often iterate on fixed benchmarks, while data-centric efforts collect new human labels without systematically targeting the weaknesses of existing VQA models. Here, we describe MDS-VQA, a model-informed data selection mechanism for curating unlabeled videos that are both difficult for the base VQA model and diverse in content. Difficulty is estimated by a failure predictor trained with a ranking objective, and diversity is measured using deep semantic video features, with a greedy procedure balancing the two under a constrained labeling budget. Experiments across multiple VQA datasets and models demonstrate that MDS-VQA identifies diverse, challenging samples that are particularly informative for active fine-tuning. With only a 5% selected subset per target domain, the fine-tuned model improves mean SRCC from 0.651 to 0.722 and achieves the top gMAD rank, indicating strong adaptation and generalization.


[37] Risk-Controllable Multi-View Diffusion for Driving Scenario Generation cs.CVPDF

Hongyi Lin, Wenxiu Shi, Heye Huang, Dingyi Zhuang, Song Zhang

TL;DR: 本文提出了一种名为RiskMV-DPO的、用于自动驾驶场景生成的风险可控多视图扩散模型。该方法通过整合目标风险等级与基于物理的风险建模,自主合成多样化的高风险动态轨迹,作为扩散视频生成器的几何锚点,并引入几何-外观对齐模块和区域感知直接偏好优化策略来保证时空一致性与几何保真度。

Details

Motivation: 现有生成方法通常将风险视为事后标签,难以在生成多视图驾驶场景时保持几何一致性,且真实世界数据中长尾风险场景稀少、难以手动设计。

Result: 在nuScenes数据集上的实验表明,该方法能自由生成多样化的长尾场景,同时保持最先进的视觉质量,将3D检测mAP从18.17提升至30.50,并将FID降低至15.70。

Insight: 创新点在于将风险控制与物理建模深度结合到生成流程中,通过显式的几何锚点和区域感知的偏好优化策略,实现了主动、可控的风险场景合成,将世界模型从被动环境预测转向主动的风险可控合成。

Abstract: Generating safety-critical driving scenarios is crucial for evaluating and improving autonomous driving systems, but long-tail risky situations are rarely observed in real-world data and difficult to specify through manual scenario design. Existing generative approaches typically treat risk as an after-the-fact label and struggle to maintain geometric consistency in multi-view driving scenes. We present RiskMV-DPO, a general and systematic pipeline for physically-informed, risk-controllable multi-view scenario generation. By integrating target risk levels with physically-grounded risk modeling, we autonomously synthesize diverse and high-stakes dynamic trajectories that serve as explicit geometric anchors for a diffusion-based video generator. To ensure spatial-temporal coherence and geometric fidelity, we introduce a geometry-appearance alignment module and a region-aware direct preference optimization (RA-DPO) strategy with motion-aware masking to focus learning on localized dynamic regions.Experiments on the nuScenes dataset show that RiskMV-DPO can freely generate a wide spectrum of diverse long-tail scenarios while maintaining state-of-the-art visual quality, improving 3D detection mAP from 18.17 to 30.50 and reducing FID to 15.70. Our work shifts the role of world models from passive environment prediction to proactive, risk-controllable synthesis, providing a scalable toolchain for the safety-oriented development of embodied intelligence.


[38] ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation cs.CV | cs.AIPDF

Md Jahidul Islam

TL;DR: 本文提出了一种名为ReHARK的无训练微调框架,旨在解决大规模视觉语言模型(如CLIP)在单样本下游任务适应中面临的’稳定性-可塑性’困境。该方法通过构建混合先验、增强支持集、自适应分布校正以及使用多尺度RBF核,在再生核希尔伯特空间中实现全局正则化,从而在11个基准测试上取得了最先进的单样本适应性能。

Details

Motivation: 现有无需训练的适配方法(如Tip-Adapter)本质上是局部Nadaraya-Watson估计器,存在固有的边界偏差且缺乏全局结构正则化,难以在单样本场景下稳定适应。ReHARK旨在通过全局近端正则化框架克服这些局限性。

Result: 在11个不同的基准测试上进行广泛实验,ReHARK实现了平均65.83%的准确率,显著优于现有基线,确立了单样本视觉语言适应的新最先进水平(SOTA)。

Insight: 创新点在于将少样本适应重新解释为再生核希尔伯特空间中的全局近端正则化问题,并设计了多阶段精炼流程,融合了零-shot文本知识、视觉原型、支持集增强与多尺度核,以协同方式缓解模态差异和领域偏移,同时保持训练高效性。

Abstract: The adaptation of large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks with extremely limited data – specifically in the one-shot regime – is often hindered by a significant “Stability-Plasticity” dilemma. While efficient caching mechanisms have been introduced by training-free methods such as Tip-Adapter, these approaches often function as local Nadaraya-Watson estimators. Such estimators are characterized by inherent boundary bias and a lack of global structural regularization. In this paper, ReHARK (Refined Hybrid Adaptive RBF Kernels) is proposed as a synergistic training-free framework that reinterprets few-shot adaptation through global proximal regularization in a Reproducing Kernel Hilbert Space (RKHS). A multistage refinement pipeline is introduced, consisting of: (1) Hybrid Prior Construction, where zero-shot textual knowledge from CLIP and GPT-3 is fused with visual class prototypes to form a robust semantic-visual anchor; (2) Support Set Augmentation (Bridging), where intermediate samples are generated to smooth the transition between visual and textual modalities; (3) Adaptive Distribution Rectification, where test feature statistics are aligned with the augmented support set to mitigate domain shifts; and (4) Multi-Scale RBF Kernels, where an ensemble of kernels is employed to capture complex feature geometries across diverse scales. Superior stability and accuracy are demonstrated through extensive experiments on 11 diverse benchmarks. A new state-of-the-art for one-shot adaptation is established by ReHARK, which achieves an average accuracy of 65.83%, significantly outperforming existing baselines. Code is available at https://github.com/Jahid12012021/ReHARK.


[39] MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks cs.CV | cs.AI | cs.ROPDF

Lirong Che, Shuo Wen, Shan Huang, Chuang Wang, Yuzhe Yang

TL;DR: MANSION是一个语言驱动的框架,用于生成建筑规模的多楼层3D环境,以支持跨楼层的长时程机器人任务。该框架考虑垂直结构约束,生成真实、可导航的整栋建筑结构,并包含多样化的类人场景。基于此框架,研究团队发布了MansionWorld数据集,包含超过1,000栋从医院到办公室的多样化建筑,以及一个任务语义场景编辑代理,允许通过开放词汇命令定制环境。基准测试表明,现有最先进的智能体在该设置下性能显著下降,凸显了MANSION作为下一代空间推理和规划关键测试平台的重要性。

Details

Motivation: 现实世界中的机器人任务通常是长时程且跨越多楼层的,需要丰富的空间推理能力,但现有的具身智能基准大多局限于单楼层室内环境,无法反映真实任务的复杂性。

Result: 基准测试显示,现有最先进的智能体在MANSION设置下性能急剧下降,这确立了MANSION作为下一代空间推理和规划关键测试平台的地位。

Insight: 创新点在于首次提出了一个语言驱动的多楼层3D场景生成框架,能够生成建筑规模的环境,并考虑了垂直结构约束;同时,通过MansionWorld数据集和任务语义场景编辑代理,提供了可定制和多样化的测试环境,填补了跨楼层长时程任务评估的空白。

Abstract: Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.


[40] Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception cs.CVPDF

Xinyu Nan, Ning Wang, Yuyao Zhai, Mei Yang

TL;DR: 本文提出了一种基于扩散模型的双监督图像美学增强方法DIAE,该方法通过多模态美学感知将模糊的美学指令转化为显式指导,并利用弱匹配数据集IIAEData和双分支监督框架来解决美学增强任务中缺乏完美配对图像的问题。

Details

Motivation: 现有图像编辑模型在可控性和灵活性方面虽有进步,但在图像美学增强方面仍面临挑战:一是难以遵循具有美学感知的编辑指令,二是缺乏内容一致但美学质量不同的“完美配对”图像。

Result: 实验结果表明,DIAE在基线方法中表现优异,获得了更高的图像美学评分和图像内容一致性评分。

Insight: 创新点包括引入多模态美学感知(MAP)将模糊指令具体化,以及构建弱匹配数据集IIAEData并设计双分支监督框架以利用不完美配对数据进行训练,从而有效解决美学增强中的数据稀缺和指令模糊问题。

Abstract: Image aesthetic enhancement aims to perceive aesthetic deficiencies in images and perform corresponding editing operations, which is highly challenging and requires the model to possess creativity and aesthetic perception capabilities. Although recent advancements in image editing models have significantly enhanced their controllability and flexibility, they struggle with enhancing image aesthetic. The primary challenges are twofold: first, following editing instructions with aesthetic perception is difficult, and second, there is a scarcity of “perfectly-paired” images that have consistent content but distinct aesthetic qualities. In this paper, we propose Dual-supervised Image Aesthetic Enhancement (DIAE), a diffusion-based generative model with multimodal aesthetic perception. First, DIAE incorporates Multimodal Aesthetic Perception (MAP) to convert the ambiguous aesthetic instruction into explicit guidance by (i) employing detailed, standardized aesthetic instructions across multiple aesthetic attributes, and (ii) utilizing multimodal control signals derived from text-image pairs that maintain consistency within the same aesthetic attribute. Second, to mitigate the lack of “perfectly-paired” images, we collect “imperfectly-paired” dataset called IIAEData, consisting of images with varying aesthetic qualities while sharing identical semantics. To better leverage the weak matching characteristics of IIAEData during training, a dual-branch supervision framework is also introduced for weakly supervised image aesthetic enhancement. Experimental results demonstrate that DIAE outperforms the baselines and obtains superior image aesthetic scores and image content consistency scores.


[41] TornadoNet: Real-Time Building Damage Detection with Ordinal Supervision cs.CVPDF

Robinson Umeike, Cuong Pham, Ryan Hausen, Thang Dao, Shane Crawford

TL;DR: TornadoNet是一个用于自动化街景建筑损伤评估的综合性基准,旨在评估现代实时目标检测架构和有序感知监督策略在真实灾后条件下的表现。该研究基于2021年美国中西部龙卷风爆发的3333张高分辨率地理标记图像和8890个标注建筑实例,系统比较了基于CNN的YOLO系列检测器与基于Transformer的RT-DETR模型在多级损伤检测中的性能。研究发现,YOLO模型在检测精度和吞吐量上表现最佳,而RT-DETR模型在损伤严重程度的有序一致性方面更强。通过引入软有序分类目标和显式有序距离惩罚,RT-DETR在有序监督下实现了性能提升。

Details

Motivation: 解决灾后建筑损伤自动化评估中,现有方法在真实场景下多级损伤检测的准确性和可靠性问题,特别是如何结合检测器架构和监督策略来提升损伤严重程度的有序分级能力。

Result: 在TornadoNet基准上,基于CNN的YOLO模型达到最高检测精度(46.05% mAP@0.5)和吞吐量(66-276 FPS),而基于Transformer的RT-DETR模型在有序一致性指标上表现更优(88.13% Ordinal Top-1 Accuracy,MAOE为0.65)。通过有序监督校准,RT-DETR的mAP@0.5提升4.8个百分点至44.70%,有序指标也进一步提高(91.15% Ordinal Top-1 Accuracy,MAOE=0.56)。

Insight: 论文的创新点在于首次构建了可控的基准来系统评估架构设计和损失函数对多级损伤检测的联合影响,并引入了有序感知监督策略(如软有序分类目标和有序距离惩罚)来提升损伤严重程度分级的可靠性。从客观角度看,该研究强调了结合检测器架构特性与任务固有有序性进行监督设计的重要性,为灾害响应提供了可部署的方法论和工具。

Abstract: We present TornadoNet, a comprehensive benchmark for automated street-level building damage assessment evaluating how modern real-time object detection architectures and ordinal-aware supervision strategies perform under realistic post-disaster conditions. TornadoNet provides the first controlled benchmark demonstrating how architectural design and loss formulation jointly influence multi-level damage detection from street-view imagery, delivering methodological insights and deployable tools for disaster response. Using 3,333 high-resolution geotagged images and 8,890 annotated building instances from the 2021 Midwest tornado outbreak, we systematically compare CNN-based detectors from the YOLO family against transformer-based models (RT-DETR) for multi-level damage detection. Models are trained under standardized protocols using a five-level damage classification framework based on IN-CORE damage states, validated through expert cross-annotation. Baseline experiments reveal complementary architectural strengths. CNN-based YOLO models achieve highest detection accuracy and throughput, with larger variants reaching 46.05% mAP@0.5 at 66-276 FPS on A100 GPUs. Transformer-based RT-DETR models exhibit stronger ordinal consistency, achieving 88.13% Ordinal Top-1 Accuracy and MAOE of 0.65, indicating more reliable severity grading despite lower baseline mAP. To align supervision with the ordered nature of damage severity, we introduce soft ordinal classification targets and evaluate explicit ordinal-distance penalties. RT-DETR trained with calibrated ordinal supervision achieves 44.70% mAP@0.5, a 4.8 percentage-point improvement, with gains in ordinal metrics (91.15% Ordinal Top-1 Accuracy, MAOE = 0.56). These findings establish that ordinal-aware supervision improves damage severity estimation when aligned with detector architecture. Model & Data: https://github.com/crumeike/TornadoNet


[42] SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning cs.CV | cs.ROPDF

Yuyuan Yang, Junkun Hong, Hongrong Wang, Honghao Cai, Xunpeng Ren

TL;DR: 本文提出了SVLL(分阶段视觉-语言学习)框架,用于解决具身任务规划中视觉基础与时间因果一致性的平衡问题。该框架通过三个阶段解耦空间基础与时间推理,并引入Bias-DPO对齐目标来抑制幻觉行为,最终在AI2-THOR基准和真实机器人部署中实现了优于开源与闭源模型的任务成功率。

Details

Motivation: 现有具身任务规划的训练范式面临关键权衡:端到端联合训练易导致过早的时间绑定,而标准强化学习方法则存在优化不稳定问题。

Result: 在交互式AI2-THOR基准测试和真实机器人部署中,SVLL在任务成功率上超越了最先进的开源模型(如Qwen2.5-VL-7B)和闭源模型(如GPT-4o、Gemini-2.0-flash),并显著减少了物理约束违反。

Insight: 创新点在于分阶段解耦空间基础与时间推理的训练框架,以及提出的Bias-DPO对齐目标,该目标通过显式最大化真实动作的似然并惩罚过度自信的幻觉,将策略锚定在专家流形上,从而确保严格遵循环境可供性并抑制物理上不可能的捷径。

Abstract: Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature – optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.


[43] R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection cs.CVPDF

Zhongyu Xia, Yousen Tang, Yongtao Wang, Zhifeng Wang, Weijun Qin

TL;DR: 本文提出R4Det,一种用于自动驾驶的高性能3D目标检测方法,通过融合4D雷达和相机数据来解决现有方法在深度估计、时间融合和对小物体检测方面的挑战。

Details

Motivation: 现有4D雷达-相机融合的3D目标检测方法存在绝对深度估计不鲁棒、时间融合依赖且易受自车姿态误差影响,以及对雷达点云稀疏的小物体检测能力不足等问题。

Result: 在TJ4DRadSet和VoD数据集上的实验表明,R4Det实现了最先进的(SOTA)3D目标检测性能。

Insight: 创新点包括:1) 全景深度融合模块,通过绝对深度与相对深度的相互增强提升深度估计质量;2) 可变形门控时间融合模块,不依赖于自车姿态;3) 实例引导的动态细化模块,从2D实例引导中提取语义原型以增强对小物体的检测。

Abstract: 4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle’s pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle’s pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.


[44] WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing cs.CVPDF

Hui Zhang, Juntao Liu, Zongkai Liu, Liqiang Niu, Fandong Meng

TL;DR: 本文提出了WeEdit,一个专注于文本中心图像编辑的系统解决方案,包括一个可扩展的数据构建流程、两个基准测试集和一个定制的两阶段训练策略。该框架旨在根据用户指令精确修改图像中的文本元素,同时保持非目标区域不变。

Details

Motivation: 现有基于指令的图像编辑模型在处理复杂文本编辑任务时,常出现字符模糊或幻觉问题,主要原因是缺乏专门针对文本中心编辑的训练范式、大规模数据集和标准化基准测试。

Result: WeEdit在双语和多语言基准测试上进行了全面评估,实验表明其在多种编辑操作上明显优于之前的开源模型。

Insight: 创新点包括:1)基于HTML的自动编辑流水线,生成了覆盖15种语言的33万训练对;2)采用字形引导的监督微调来注入空间和内容先验;3)结合多目标强化学习阶段,以对齐生成结果与指令遵循、文本清晰度和背景保持。这为文本中心编辑提供了闭环的训练和评估系统。

Abstract: Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging textual elements embedded within images. However, existing leading models often struggle to execute complex text editing precisely, frequently producing blurry or hallucinated characters. We attribute these failures primarily to the lack of specialized training paradigms tailored for text-centric editing, as well as the absence of large-scale datasets and standardized benchmarks necessary for a closed-loop training and evaluation system. To address these limitations, we present WeEdit, a systematic solution encompassing a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy. Specifically, we propose a novel HTML-based automatic editing pipeline, which generates 330K training pairs covering diverse editing operations and 15 languages, accompanied by standardized bilingual and multilingual benchmarks for comprehensive evaluation. On the algorithmic side, we employ glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by a multi-objective reinforcement learning stage to align generation with instruction adherence, text clarity, and background preservation. Extensive experiments demonstrate that WeEdit outperforms previous open-source models by a clear margin across diverse editing operations.


[45] LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference cs.CVPDF

Junkun Jiang, Ho Yin Au, Jingyu Xiang, Jie Chen

TL;DR: 本文提出LaMoGen框架,通过引入LabanLite符号化运动表示,利用大语言模型进行符号推理,实现从语言到人体动作的生成。该方法将复杂动作分解为可解释的符号序列和身体部位指令,解决了现有方法在时间准确性、细节合成和可解释性方面的不足。

Details

Motivation: 现有基于文本-动作嵌入的方法难以合成时间准确、细节丰富的动作,且缺乏可解释性。本文旨在通过符号化表示和LLM引导的推理,建立语言与动作轨迹之间的可解释链接,提升动作生成的质量与可控性。

Result: 在提出的基于Labanotation的基准测试以及两个公共数据集上,LaMoGen在符号、时间和协调性三个维度上均优于先前方法,为可解释性和可控性设立了新的基线。

Insight: 核心创新在于将Labanotation系统扩展为LabanLite符号表示,并构建了“文本-符号-动作”的两阶段生成框架,利用LLM进行符号推理与组合。这为语言驱动的动作合成提供了一种基于代理和符号推理的新范式,增强了生成过程的可解释性和可控性。

Abstract: Human motion is highly expressive and naturally aligned with language, yet prevailing methods relying heavily on joint text-motion embeddings struggle to synthesize temporally accurate, detailed motions and often lack explainability. To address these limitations, we introduce LabanLite, a motion representation developed by adapting and extending the Labanotation system. Unlike black-box text-motion embeddings, LabanLite encodes each atomic body-part action (e.g., a single left-foot step) as a discrete Laban symbol paired with a textual template. This abstraction decomposes complex motions into interpretable symbol sequences and body-part instructions, establishing a symbolic link between high-level language and low-level motion trajectories. Building on LabanLite, we present LaMoGen, a Text-to-LabanLite-to-Motion Generation framework that enables large language models (LLMs) to compose motion sequences through symbolic reasoning. The LLM interprets motion patterns, relates them to textual descriptions, and recombines symbols into executable plans, producing motions that are both interpretable and linguistically grounded. To support rigorous evaluation, we introduce a Labanotation-based benchmark with structured description-motion pairs and three metrics that jointly measure text-motion alignment across symbolic, temporal, and harmony dimensions. Experiments demonstrate that LaMoGen establishes a new baseline for both interpretability and controllability, outperforming prior methods on our benchmark and two public datasets. These results highlight the advantages of symbolic reasoning and agent-based design for language-driven motion synthesis.


[46] Articulat3D: Reconstructing Articulated Digital Twins From Monocular Videos with Geometric and Motion Constraints cs.CVPDF

Lijun Guo, Haoyu Zhao, Xingyue Zhao, Rong Fu, Linghao Zhuang

TL;DR: 本文提出了Articulat3D框架,用于从单目视频中重建铰接物体的高保真数字孪生体。该方法通过联合施加显式的3D几何和运动约束,解决了现有方法依赖多视角静态捕捉的限制,实现了在非受控真实世界条件下的可扩展重建。

Details

Motivation: 现有方法依赖物体在离散静态状态下的多视角捕捉,严重限制了其在真实世界中的可扩展性。本文旨在从随意拍摄的单目视频中构建铰接物体的数字孪生体,以克服这一限制。

Result: 在合成基准测试和真实世界随意拍摄的单目视频上,Articulat3D都达到了最先进的性能,显著提升了在非受控真实世界条件下创建数字孪生体的可行性。

Insight: 主要创新点包括:1)运动先验驱动的初始化,利用3D点轨迹挖掘铰接运动的低维结构,通过一组紧凑的运动基对场景动态建模,实现场景到多个刚性运动组的软分解;2)几何与运动约束细化,通过可学习的、由关节轴、枢轴点和逐帧运动标量参数化的运动学基元,强制物理上合理的铰接,从而获得几何精确且时间一致的重建结果。该方法将几何与运动约束联合优化,为单目视频重建提供了新思路。

Abstract: Building high-fidelity digital twins of articulated objects from visual data remains a central challenge. Existing approaches depend on multi-view captures of the object in discrete, static states, which severely constrains their real-world scalability. In this paper, we introduce Articulat3D, a novel framework that constructs such digital twins from casually captured monocular videos by jointly enforcing explicit 3D geometric and motion constraints. We first propose Motion Prior-Driven Initialization, which leverages 3D point tracks to exploit the low-dimensional structure of articulated motion. By modeling scene dynamics with a compact set of motion bases, we facilitate soft decomposition of the scene into multiple rigidly-moving groups. Building on this initialization, we introduce Geometric and Motion Constraints Refinement, which enforces physically plausible articulation through learnable kinematic primitives parameterized by a joint axis, a pivot point, and per-frame motion scalars, yielding reconstructions that are both geometrically accurate and temporally coherent. Extensive experiments demonstrate that Articulat3D achieves state-of-the-art performance on synthetic benchmarks and real-world casually captured monocular videos, significantly advancing the feasibility of digital twin creation under uncontrolled real-world conditions. Our project page is at https://maxwell-zhao.github.io/Articulat3D.


[47] Noise-aware few-shot learning through bi-directional multi-view prompt alignment cs.CVPDF

Lu Niu, Cheng Xue

TL;DR: 本文提出了一种名为NA-MVP的噪声感知小样本学习框架,通过双向多视图提示对齐来增强视觉语言模型在噪声标签下的鲁棒性。该框架的核心是从全局匹配转向区域感知对齐,利用多视图提示与不平衡最优传输实现细粒度补丁到提示的对应,并采用双向提示设计和选择性优化策略来区分并抑制噪声信号。

Details

Motivation: 视觉语言模型通过提示调整具备小样本学习能力,但对噪声标签敏感,噪声会破坏提示并损害跨模态对齐。现有方法难以建模细粒度语义线索并自适应地区分干净与噪声信号。

Result: 在合成和真实世界的噪声基准测试上,NA-MVP一致性地超越了最先进的基线方法,证实了其在噪声监督下实现鲁棒小样本学习的有效性。

Insight: 创新点包括:从全局匹配到区域感知对齐的概念转变;结合不平衡最优传输的多视图提示实现细粒度对应;双向提示设计捕获互补的干净导向和噪声感知线索;以及基于最优传输的对齐引导选择性优化策略,仅修正误标样本而保留可靠数据。

Abstract: Vision-language models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework for Noise-Aware few-shot learning through bi-directional Multi-View Prompt alignment. NA-MVP is built upon a key conceptual shift: robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision.


[48] MedPruner: Training-Free Hierarchical Token Pruning for Efficient 3D Medical Image Understanding in Vision-Language Models cs.CV | cs.AIPDF

Shengyuan Liu, Zanting Ye, Yunrui Lin, Chen Hu, Wanting Geng

TL;DR: 本文提出了MedPruner,一种无需训练、与模型无关的分层令牌剪枝框架,旨在高效处理3D医学图像。它通过切片间锚点过滤和动态信息核心选择两阶段机制,自适应地去除冗余视觉令牌,从而在保持或提升模型性能的同时,大幅降低计算开销。

Details

Motivation: 现有医学视觉语言模型在处理3D体数据时存在计算效率低下的问题,主要源于直接拼接连续2D切片导致的大量解剖冗余,以及固定剪枝比率无法灵活处理不同切片间异质信息密度。

Result: 在三个3D医学基准测试和三种不同的医学VLM上的实验表明,MedPruner能显著减少令牌冗余。例如,它使MedGemma等模型在仅保留少于5%视觉令牌的情况下,维持甚至超越原始性能,大幅降低了计算开销。

Insight: 创新点在于提出了一种无需训练的分层令牌剪枝框架,通过结合切片级时序冗余消除和基于累积注意力权重的自适应令牌级压缩,实现了对3D医学图像中异质信息密度的动态高效处理,为临床部署提供了实用方案。

Abstract: While specialized Medical Vision-Language Models (VLMs) have achieved remarkable success in interpreting 2D and 3D medical modalities, their deployment for 3D volumetric data remains constrained by significant computational inefficiencies. Current architectures typically suffer from massive anatomical redundancy due to the direct concatenation of consecutive 2D slices and lack the flexibility to handle heterogeneous information densities across different slices using fixed pruning ratios. To address these challenges, we propose MedPruner, a training-free and model-agnostic hierarchical token pruning framework specifically designed for efficient 3D medical image understanding. MedPruner introduces a two-stage mechanism: an Inter-slice Anchor-based Filtering module to eliminate slice-level temporal redundancy, followed by a Dynamic Information Nucleus Selection strategy that achieves adaptive token-level compression by quantifying cumulative attention weights. Extensive experiments on three 3D medical benchmarks and across three diverse medical VLMs reveal massive token redundancy in existing architectures. Notably, MedPruner enables models such as MedGemma to maintain or even exceed their original performance while retaining fewer than 5% of visual tokens, thereby drastically reducing computational overhead and validating the necessity of dynamic token selection for practical clinical deployment. Our code will be released.


[49] Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans cs.CV | cs.AIPDF

Sizhong Qin, Ramon Elias Weber, Xinzheng Lu

TL;DR: 本文提出HouseMind,一种多模态大语言模型,通过引入离散的房间实例令牌构建统一词汇表,将建筑平面图的理解、生成和编辑整合在一个框架中,实现了从文本指令合成连贯且可控的布局。

Details

Motivation: 解决当前AI系统在建筑平面图设计中难以联合推理几何、语义和空间层次的问题,特别是扩散和语言模型在空间推理和可控生成方面的不足。

Result: 实验表明,该框架在几何有效性和可控性方面表现优异,同时保持高效性和本地可部署性,但没有提及具体基准或与SOTA的定量比较。

Insight: 创新点在于使用离散房间实例令牌构建统一词汇表,实现布局与符号推理的桥接,并通过多模态对齐和指令微调提升可控生成能力;客观分析认为该方法为多模态任务中的结构化表示提供了新思路。

Abstract: Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.


Chongxiao Wang, Junjie Liang, Peng Cao, Jinzhu Yang, Osmar R. Zaiane

TL;DR: 本文提出了一种名为IDRL的个体感知多模态抑郁相关表征学习框架,用于抑郁症诊断。该框架通过将多模态表征解耦为模态共有抑郁空间、模态特定抑郁空间和抑郁无关空间来增强模态对齐并抑制无关信息,同时引入个体感知模态融合模块,根据特征的预测重要性动态调整权重,实现针对不同个体的自适应跨模态融合。

Details

Motivation: 现有抑郁症多模态检测方法存在模态间不一致性与抑郁无关信息干扰,以及个体抑郁表现多样性导致模态和线索重要性存在个体差异,从而阻碍了可靠的融合。本文旨在解决这些问题,以提升抑郁症诊断的鲁棒性。

Result: 广泛的实验表明,IDRL在多模态抑郁症检测任务上取得了优越且鲁棒的性能。

Insight: 创新点在于通过表征解耦分离出抑郁相关与无关信息,并引入个体感知的动态融合机制,实现了对个体差异的自适应处理,这为处理具有高度个体差异性的多模态医疗诊断任务提供了新思路。

Abstract: Depression is a severe mental disorder, and reliable identification plays a critical role in early intervention and treatment. Multimodal depression detection aims to improve diagnostic performance by jointly modeling complementary information from multiple modalities. Recently, numerous multimodal learning approaches have been proposed for depression analysis; however, these methods suffer from the following limitations: 1) inter-modal inconsistency and depression-unrelated interference, where depression-related cues may conflict across modalities while substantial irrelevant content obscures critical depressive signals, and 2) diverse individual depressive presentations, leading to individual differences in modality and cue importance that hinder reliable fusion. To address these issues, we propose Individual-aware Multimodal Depression-related Representation Learning Framework (IDRL) for robust depression diagnosis. Specifically, IDRL 1) disentangles multimodal representations into a modality-common depression space, a modality-specific depression space, and a depression-unrelated space to enhance modality alignment while suppressing irrelevant information, and 2) introduces an individual-aware modality-fusion module (IAF) that dynamically adjusts the weights of disentangled depression-related features based on their predictive significance, thereby achieving adaptive cross-modal fusion for different individuals. Extensive experiments demonstrate that IDRL achieves superior and robust performance for multimodal depression detection.


[51] BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder cs.CVPDF

Siquan Huang, Yijiang Li, Ningzhi Gao, Xingfu Yan, Leyu Shi

TL;DR: 本文提出了BackdoorIDS,一种用于预训练视觉编码器的零样本推理时后门样本检测方法。该方法基于注意力劫持与恢复现象,通过渐进输入掩码提取嵌入序列,并利用基于密度的聚类(如DBSCAN)判断样本是否包含后门。

Details

Motivation: 解决下游用户依赖来源不明的第三方预训练视觉编码器时面临的后门攻击风险,提供无需重新训练、零样本的检测方案。

Result: 在多种攻击类型、数据集和模型家族上的广泛实验表明,BackdoorIDS始终优于现有防御方法,且作为即插即用方法兼容CNN、ViT、CLIP和LLaVA-1.5等多种编码器架构。

Insight: 创新点在于利用后门样本在渐进掩码下注意力从恶意触发器快速转移到良性内容导致的嵌入突变,通过序列聚类实现检测;客观分析认为该方法无需模型修改或训练,具有通用性和实用性。

Abstract: Self-supervised and multimodal vision encoders learn strong visual representations that are widely adopted in downstream vision tasks and large vision-language models (LVLMs). However, downstream users often rely on third-party pretrained encoders with uncertain provenance, exposing them to backdoor attacks. In this work, we propose BackdoorIDS, a simple yet effective zero-shot, inference-time backdoor samples detection method for pretrained vision encoders. BackdoorIDS is motivated by two observations: Attention Hijacking and Restoration. Under progressive input masking, a backdoored image initially concentrates attention on malicious trigger features. Once the masking ratio exceeds the trigger’s robustness threshold, the trigger is deactivated, and attention rapidly shifts to benign content. This transition induces a pronounced change in the image embedding, whereas embeddings of clean images evolve more smoothly across masking progress. BackdoorIDS operationalizes this signal by extracting an embedding sequence along the masking trajectory and applying density-based clustering such as DBSCAN. An input is flagged as backdoored if its embedding sequence forms more than one cluster. Extensive experiments show that BackdoorIDS consistently outperforms existing defenses across diverse attack types, datasets, and model families. Notably, it is a plug-and-play approach that requires no retraining and operates fully zero-shot at inference time, making it compatible with a wide range of encoder architectures, including CNNs, ViTs, CLIP, and LLaVA-1.5.


[52] PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On cs.CVPDF

Haohua Chen, Tianze Zhou, Wei Zhu, Runqi Wang, Yandong Guan

TL;DR: PROMO是一个基于Flow Matching DiT骨干网络的可提示虚拟试穿框架,通过潜在多模态条件拼接和自参考机制,在保持高保真度的同时显著降低推理开销,实现了虚拟试穿任务中质量与速度的平衡。

Details

Motivation: 解决现有基于扩散模型的虚拟试穿方法在保真度与效率之间的权衡挑战,这些方法通常依赖复杂架构且采样速度慢,而PROMO将虚拟试穿视为结构化图像编辑问题,强调主体保持、忠实纹理转移和无缝融合三个关键要求。

Result: 在标准基准测试中,PROMO在视觉保真度上超越了先前的虚拟试穿方法和通用图像编辑模型,同时在质量与速度之间实现了有竞争力的平衡。

Insight: 创新点包括将虚拟试穿重新定义为结构化图像编辑任务,利用虚拟试穿的配对数据作为训练通用编辑器的监督资源,以及结合流匹配变换器、潜在多模态条件拼接和自参考加速机制,为高质量虚拟试穿提供了高效且训练友好的解决方案。

Abstract: Virtual Try-on (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is generic and transfers to broader image editing tasks. Moreover, the paired data produced by VTON constitutes a rich supervisory resource for training general-purpose editors. We present PROMO, a promptable virtual try-on framework built upon a Flow Matching DiT backbone with latent multi-modal conditional concatenation. By leveraging conditioning efficiency and self-reference mechanisms, our approach substantially reduces inference overhead. On standard benchmarks, PROMO surpasses both prior VTON methods and general image editing models in visual fidelity while delivering a competitive balance between quality and speed. These results demonstrate that flow-matching transformers, coupled with latent multi-modal conditioning and self-reference acceleration, offer an effective and training-efficient solution for high-quality virtual try-on.


[53] OSCBench: Benchmarking Object State Change in Text-to-Video Generation cs.CV | cs.AI | cs.CLPDF

Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington

TL;DR: 本文介绍了OSCBench,一个专门用于评估文本到视频(T2V)生成模型中对象状态变化(OSC)性能的基准测试。该基准基于烹饪教学数据构建,系统地将动作-对象交互组织为常规、新颖和组合场景,以探究模型在分布内性能和泛化能力。通过对六个代表性开源和专有T2V模型进行人工研究和基于多模态大语言模型(MLLM)的自动评估,研究发现当前模型在准确且时序一致的对象状态变化方面存在显著困难,尤其是在新颖和组合设置中。

Details

Motivation: 现有T2V生成基准主要关注感知质量、文本-视频对齐或物理合理性,而忽略了文本提示中明确指定的对象状态变化(OSC)这一关键动作理解方面。OSC指由动作引起的对象状态转变(如削土豆皮或切柠檬),该论文旨在填补这一评估空白。

Result: 在OSCBench上评估了六个代表性T2V模型(包括开源和专有模型)。结果表明,尽管模型在语义和场景对齐方面表现强劲,但在准确且时序一致的对象状态变化上普遍存在困难,特别是在新颖和组合场景中。这揭示了OSC是当前T2V生成的一个关键瓶颈。

Insight: 论文的创新点在于首次系统性地提出了针对T2V生成中对象状态变化(OSC)的专用诊断性基准OSCBench,其构建方法(基于烹饪数据并区分常规、新颖和组合场景)有助于深入评估模型的泛化能力。从客观角度看,该研究将评估焦点从传统的视觉质量转向更细粒度的动作语义理解,为开发状态感知的视频生成模型指明了重要方向。

Abstract: Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object’s state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.


[54] Cross-Resolution Attention Network for High-Resolution PM2.5 Prediction cs.CV | cs.LGPDF

Ammar Kheder, Helmi Toropainen, Wenqing Peng, Samuel Antão, Zhi-Song Liu

TL;DR: 本文提出了一种名为CRAN-PM的双分支视觉Transformer模型,用于超高分辨率、大陆尺度的PM2.5预测。该模型通过跨分辨率注意力机制,高效融合低分辨率(25公里)的全球气象数据与高分辨率(1公里)的局部PM2.5数据,并引入高程感知自注意力和风引导交叉注意力来学习物理一致的PM2.5预测特征表示。

Details

Motivation: 现有视觉Transformer在时空预测中表现出色,但其可扩展性难以满足现实世界环境监测所需的超高分辨率、大陆尺度需求(如单张欧洲空气质量图包含2900万像素)。需要一种高效且物理一致的方法来预测高分辨率PM2.5。

Result: 在2022年欧洲每日PM2.5预测任务(362天,2971个欧洲环境署站点)上,与最佳单尺度基线相比,CRAN-PM在T+1(次日)和T+3(后三天)预测上分别将RMSE降低了4.7%和10.7%,并在复杂地形中将偏差降低了36%。模型在单GPU上仅需1.8秒即可生成完整的2900万像素欧洲地图。

Insight: 创新点在于提出了跨分辨率注意力网络,通过双分支结构高效融合不同分辨率数据;并设计了高程感知自注意力和风引导交叉注意力,将物理约束(如地形、风向)融入注意力机制,引导模型学习更符合物理规律的表示,而非简单地将物理因子作为输入特征。

Abstract: Vision Transformers have achieved remarkable success in spatio-temporal prediction, but their scalability remains limited for ultra-high-resolution, continent-scale domains required in real-world environmental monitoring. A single European air-quality map at 1 km resolution comprises 29 million pixels, far beyond the limits of naive self-attention. We introduce CRAN-PM, a dual-branch Vision Transformer that leverages cross-resolution attention to efficiently fuse global meteorological data (25 km) with local high-resolution PM2.5 at the current time (1 km). Instead of including physically driven factors like temperature and topography as input, we further introduce elevation-aware self-attention and wind-guided cross-attention to force the network to learn physically consistent feature representations for PM2.5 forecasting. CRAN-PM is fully trainable and memory-efficient, generating the complete 29-million-pixel European map in 1.8 seconds on a single GPU. Evaluated on daily PM2.5 forecasting throughout Europe in 2022 (362 days, 2,971 European Environment Agency (EEA) stations), it reduces RMSE by 4.7% at T+1 and 10.7% at T+3 compared to the best single-scale baseline, while reducing bias in complex terrain by 36%.


[55] VTEdit-Bench: A Comprehensive Benchmark for Multi-Reference Image Editing Models in Virtual Try-On cs.CVPDF

Xiaoye Liang, Zhiyuan Qu, Mingye Zou, Jiaxin Liu, Lai Jiang

TL;DR: 本文提出了VTEdit-Bench,一个用于评估通用多参考图像编辑模型在虚拟试穿(VTON)场景下性能的综合基准。该基准包含五个代表性任务、超过2.4万个测试图像对,并引入了基于视觉语言模型的评估器VTEdit-QA。研究系统评估了八种通用编辑模型和七种专用VTON模型,发现顶级通用编辑模型在常规任务上具有竞争力且在复杂场景中泛化更稳定,但在处理多服装参考等复杂配置时仍面临挑战。

Details

Motivation: 现有专用虚拟试穿模型难以应对日益复杂的现实场景,而通用多参考图像编辑模型虽展现出强大泛化能力,但其在VTON任务上的优势和局限因缺乏系统性评估基准而未被充分探索。

Result: 在VTEdit-Bench上对八种通用编辑模型和七种专用VTON模型的评估结果表明,顶级通用编辑模型在常规VTON任务上表现具有竞争力,在更复杂场景中泛化更稳定,但在多服装参考等复杂配置上仍存在困难。

Insight: 论文的创新点在于构建了首个系统评估通用编辑模型在VTON任务中性能的综合性基准(VTEdit-Bench)及配套的参考感知评估器(VTEdit-QA),为比较通用与专用模型提供了标准化框架,并揭示了通用模型在复杂多参考条件下的局限性。

Abstract: As virtual try-on (VTON) continues to advance, a growing number of real-world scenarios have emerged, pushing beyond the ability of the existing specialized VTON models. Meanwhile, universal multi-reference image editing models have progressed rapidly and exhibit strong generalization in visual editing, suggesting a promising route toward more flexible VTON systems. However, despite their strong capabilities, the strengths and limitations of universal editors for VTON remain insufficiently explored due to the lack of systematic evaluation benchmarks. To address this gap, we introduce VTEdit-Bench, a comprehensive benchmark designed to evaluate universal multi-reference image editing models across various realistic VTON scenarios. VTEdit-Bench contains 24,220 test image pairs spanning five representative VTON tasks with progressively increasing complexity, enabling systematic analysis of robustness and generalization. We further propose VTEdit-QA, a reference-aware VLM-based evaluator that assesses VTON performance from three key aspects: model consistency, cloth consistency, and overall image quality. Through this framework, we systematically evaluate eight universal editing models and compare them with seven specialized VTON models. Results show that top universal editors are competitive on conventional tasks and generalize more stably to harder scenarios, but remain challenged by complex reference configurations, particularly multi-cloth conditioning.


[56] SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory cs.CVPDF

Dingcheng Zhen, Xu Zheng, Ruixin Zhang, Zhiqi Jiang, Yichao Yan

TL;DR: 本文提出了SoulX-LiveAct,一种用于小时级实时人体动画生成的自回归扩散模型。它通过引入Neighbor Forcing策略和ConvKV记忆机制,解决了现有方法在训练收敛、长序列生成质量和推理效率方面的挑战,实现了在少量GPU上支持20 FPS的实时流式推理。

Details

Motivation: 解决现有自回归扩散模型在扩展到小时级实时人体动画任务时面临的两个关键问题:一是样本级表征传播导致的扩散状态不匹配和学习信号不稳定;二是历史表征无限增长且缺乏结构,导致缓存状态无法有效复用,严重限制了推理效率。

Result: 在唇形同步准确性、人体动画质量和情感表现力方面达到了最先进的性能,并且具有最低的推理成本。实验表明,该方法在训练收敛性、小时级生成质量和推理效率上相比现有方法有显著提升,能够在仅两块NVIDIA H100或H200 GPU上实现20 FPS的实时流式推理。

Insight: 核心创新点在于Neighbor Forcing策略,它通过传播相同噪声条件下的相邻帧作为潜在邻居,提供了分布对齐且稳定的学习信号;以及结构化的ConvKV记忆机制,它将因果注意力中的键和值压缩为固定长度的表征,实现了恒定内存推理和真正无限的视频生成,无需依赖短期运动帧记忆。

Abstract: Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.


[57] Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints cs.CVPDF

Chenyangguang Zhang, Botao Ye, Boqi Chen, Alexandros Delitzas, Fangjinhua Wang

TL;DR: 本文提出了一种新颖的以自我为中心的视频生成框架,该框架利用稀疏的3D手部关节作为与具体形态无关的控制信号,从单个参考帧生成视频。该方法通过一个高效的遮挡感知控制模块,解决了现有方法在严重遮挡下存在的运动不一致和伪影问题,并实现了向机器人手部的跨具身泛化。

Details

Motivation: 现有基于2D轨迹或隐式姿态的运动可控视频生成方法,难以实现3D一致且精细的手部动作,在严重的自我中心视角遮挡下会导致运动不一致、产生伪影,并且无法泛化到机器人手部。

Result: 在构建的跨具身基准测试上进行的大量实验表明,该方法显著优于最先进的基线模型,能够生成具有真实交互的高保真自我中心视频,并在向机器人手部泛化方面表现出色。

Insight: 创新点在于使用具有清晰语义和几何结构的稀疏3D手部关节作为控制信号,并设计了一个遮挡感知控制模块,该模块通过惩罚来自隐藏关节的不可靠视觉信号来提取特征,并采用基于3D的加权机制来鲁棒地处理动态遮挡,同时将3D几何嵌入直接注入潜在空间以强制结构一致性。

Abstract: Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically, it extracts occlusion-aware features from the source reference frame by penalizing unreliable visual signals from hidden joints, and employs a 3D-based weighting mechanism to robustly handle dynamically occluded target joints during motion propagation. Concurrently, the module directly injects 3D geometric embeddings into the latent space to strictly enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline that yields over one million high-quality egocentric video clips paired with precise hand trajectories. Additionally, we register humanoid kinematic and camera data to construct a cross-embodiment benchmark. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic interactions and exhibiting exceptional cross-embodiment generalization to robotic hands.


[58] HELM: Hierarchical and Explicit Label Modeling with Graph Learning for Multi-Label Image Classification cs.CV | cs.AIPDF

Marjan Stoimchev, Boshko Koloski, Jurica Levatić, Dragi Kocev, Sašo Džeroski

TL;DR: 本文提出HELM(Hierarchical and Explicit Label Modeling)框架,用于解决遥感图像中复杂的多标签层次分类问题。该框架通过Vision Transformer中的层次特定类别令牌、图卷积网络显式编码层次结构以及自监督分支利用未标记数据,在四个遥感数据集上实现了最先进的性能。

Details

Motivation: 现有方法难以处理多路径层次结构(即实例属于多个分支)且很少利用未标记数据,HELM旨在克服这些限制,以更好地建模遥感图像中复杂的标签依赖关系。

Result: 在UCM、AID、DFC-15和MLRSNet四个遥感图像数据集上,HELM在监督和半监督设置下均超越了强基线模型,达到了最先进的性能,尤其在低标签场景中表现出色。

Insight: 创新点在于将层次特定类别令牌、图卷积网络对层次结构的显式编码与自监督学习相结合,以同时捕获细粒度标签交互并有效利用未标记数据,为多路径层次分类提供了新思路。

Abstract: Hierarchical multi-label classification (HMLC) is essential for modeling complex label dependencies in remote sensing. Existing methods, however, struggle with multi-path hierarchies where instances belong to multiple branches, and they rarely exploit unlabeled data. We introduce HELM (\textit{Hierarchical and Explicit Label Modeling}), a novel framework that overcomes these limitations. HELM: (i) uses hierarchy-specific class tokens within a Vision Transformer to capture nuanced label interactions; (ii) employs graph convolutional networks to explicitly encode the hierarchical structure and generate hierarchy-aware embeddings; and (iii) integrates a self-supervised branch to effectively leverage unlabeled imagery. We perform a comprehensive evaluation on four remote sensing image (RSI) datasets (UCM, AID, DFC-15, MLRSNet). HELM achieves state-of-the-art performance, consistently outperforming strong baselines in both supervised and semi-supervised settings, demonstrating particular strength in low-label scenarios.


[59] Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models cs.CV | cs.AI | cs.CLPDF

Lu Wang, Zhuoran Jin, Yupu Hao, Yubo Chen, Kang Liu

TL;DR: 本文提出了Think While Watching,一种用于多模态大语言模型的在线流式视频推理框架,旨在解决现有方法在连续视频流上进行多轮交互时存在的推理延迟和长程依赖建模困难的问题。该框架通过维护连续片段级记忆、采用阶段匹配的训练策略以及设计高效的推理管道,实现了观看与思考的重叠,从而提升了在线视频推理的性能。

Details

Motivation: 现有的多模态大语言模型在离线视频理解上表现良好,但在处理连续到达的视频流进行多轮交互时,通常局限于离线推理或在线推理能力较弱。现有流式方法通常采用交错的感知-生成范式,这阻碍了感知与生成的并发执行,并导致随着视频流增长,早期记忆快速衰减,损害了长程依赖建模。

Result: 在单轮和多轮流式输入协议下,该方法均取得了强劲结果。基于Qwen3-VL构建,在StreamingBench上单轮准确率提升了2.6%,在OVO-Bench上提升了3.79%。在多轮设置下,它在保持性能的同时,将输出token数量减少了56%。

Insight: 论文的核心创新点在于提出了一个记忆锚定的流式视频推理框架,通过维护连续片段级记忆来缓解早期记忆衰减问题。技术上,它采用了阶段匹配的训练策略、片段级流式因果掩码和流式位置编码来保证严格的因果性,并在推理时设计了重叠观看与思考的高效管道以及自适应选择最佳注意力后端,从而实现了感知与生成的并发,提升了在线推理的效率和长程建模能力。

Abstract: Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/


[60] Locating Demographic Bias at the Attention-Head Level in CLIP’s Vision Encoder cs.CV | cs.AI | cs.CYPDF

Alaa Yasser, Kittipat Phunjanna, Marcos Escudero Viñolo, Catarina Barata, Jenny Benois-Pineau

TL;DR: 本文提出了一种机制化的公平性审计方法,通过结合投影残差流分解、零样本概念激活向量和偏差增强的TextSpan分析,在视觉Transformer的单个注意力头层面定位人口统计学偏差。该方法在CLIP ViT-L-14编码器上对FACET基准的42个职业类别进行了性别和年龄偏差的可行性案例研究。

Details

Motivation: 标准的公平性审计只能量化模型存在偏差,但无法确定偏差在网络中的具体位置,因此需要一种机制化方法来定位偏差在模型内部的来源。

Result: 对于性别偏差,该方法识别出四个终端层注意力头,其消融使全局偏差(Cramer’s V)从0.381降至0.362,同时准确率略微提升0.42%;随机对照实验证实了该效果的特异性。对于年龄偏差,识别出的候选注意力头消融后效果较弱且不一致,表明年龄偏差在模型中编码更为分散。

Insight: 创新点在于将机制可解释性技术应用于公平性审计,实现了偏差在注意力头层面的细粒度定位;研究发现不同受保护属性(如性别与年龄)的偏差编码模式可能存在差异,这为针对性去偏提供了新思路。

Abstract: Standard fairness audits of foundation models quantify that a model is biased, but not where inside the network the bias resides. We propose a mechanistic fairness audit that combines projected residual-stream decomposition, zero-shot Concept Activation Vectors, and bias-augmented TextSpan analysis to locate demographic bias at the level of individual attention heads in vision transformers. As a feasibility case study, we apply this pipeline to the CLIP ViT-L-14 encoder on 42 profession classes of the FACET benchmark, auditing both gender and age bias. For gender, the pipeline identifies four terminal-layer heads whose ablation reduces global bias (Cramer’s V: 0.381 -> 0.362) while marginally improving accuracy (+0.42%); a layer-matched random control confirms that this effect is specific to the identified heads. A single head in the final layer contributes to the majority of the reduction in the most stereotyped classes, and class-level analysis shows that corrected predictions shift toward the correct occupation. For age, the same pipeline identifies candidate heads, but ablation produces weaker and less consistent effects, suggesting that age bias is encoded more diffusely than gender bias in this model. These results provide preliminary evidence that head-level bias localisation is feasible for discriminative vision encoders and that the degree of localisability may vary across protected attributes. keywords: Bias . CLIP . Mechanistic Interpretability . Vision Transformer . Fairness


[61] OSM-based Domain Adaptation for Remote Sensing VLMs cs.CV | cs.LGPDF

Stefan Maria Ailuro, Mario Markov, Mohammad Mahdi, Delyan Boychev, Luc Van Gool

TL;DR: 本文提出了一种名为OSMDA的自包含领域自适应框架,用于遥感视觉语言模型(VLMs)。该方法通过将航拍图像与渲染的OpenStreetMap(OSSM)图块配对,利用基础VLM的光学字符识别和图表理解能力,自动生成富含OSM辅助元数据的描述,从而无需手动标注或依赖外部大型教师模型。随后,仅使用卫星图像对模型进行微调,得到OSMDA-VLM。

Details

Motivation: 遥感VLMs严重依赖领域特定的图像-文本监督,但卫星和航拍图像的高质量标注稀缺且昂贵。现有的伪标注流程依赖大型前沿模型进行知识蒸馏,成本高、可扩展性有限,且性能受限于教师模型的上限。

Result: 在涵盖图像-文本到文本任务的10个基准测试上进行了全面评估,并与9个竞争性基线进行了比较。当与真实数据等量混合时,该方法取得了最先进(SOTA)的结果,且训练成本远低于依赖教师模型的替代方案。

Insight: 核心创新在于利用基础VLM自身作为标注引擎,通过结合OSM的众包地理数据自动生成训练语料,实现了无需外部强模型或人工标注的、实用且可扩展的遥感领域自适应路径。

Abstract: Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM’s vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.


[62] Multimodal classification of Radiation-Induced Contrast Enhancements and tumor recurrence using deep learning cs.CVPDF

Robin Peretzke, Marlin Hanstein, Maximilian Fischer, Lars Badhi Wessel, Obada Alhalabi

TL;DR: 该研究提出了一种名为RICE-NET的多模态3D深度学习模型,用于区分胶质母细胞瘤患者治疗后出现的肿瘤复发和放射性诱导的对比增强。模型整合了纵向MRI数据和放疗剂量分布,仅使用常规T1加权MRI数据即可实现自动化病灶分类。

Details

Motivation: 解决胶质母细胞瘤治疗后区分肿瘤复发与放射性诱导对比增强这一重大临床挑战,现有方法要么依赖临床稀缺的扩散MRI,要么未考虑日益受关注的放疗剂量图。

Result: 在包含92名患者的队列中,模型在独立测试集上取得了0.92的F1分数。消融实验表明可靠的分类很大程度上依赖于放疗剂量图。

Insight: 创新点在于将放疗剂量图(辐射图)作为关键模态与纵向MRI数据整合,进行多模态深度学习分类;客观分析认为,其通过遮挡法可解释性分析验证了模型关注临床相关区域,为神经肿瘤学中利用多模态数据提升诊断准确性提供了新思路。

Abstract: The differentiation between tumor recurrence and radiation-induced contrast enhancements in post-treatment glioblastoma patients remains a major clinical challenge. Existing approaches rely on clinically sparsely available diffusion MRI or do not consider radiation maps, which are gaining increasing interest in the tumor board for this differentiation. We introduce RICE-NET, a multimodal 3D deep learning model that integrates longitudinal MRI data with radiotherapy dose distributions for automated lesion classification using conventional T1-weighted MRI data. Using a cohort of 92 patients, the model achieved an F1 score of 0.92 on an independent test set. During extensive ablation experiments, we quantified the contribution of each timepoint and modality and showed that reliable classification largely depends on the radiation map. Occlusion-based interpretability analyses further confirmed the model’s focus on clinically relevant regions. These findings highlight the potential of multimodal deep learning to enhance diagnostic accuracy and support clinical decision-making in neuro-oncology.


[63] Towards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding cs.CVPDF

Jiahao Li, Qingwang Zhang, Qiuyu Chen, Guozhan Qiu, Yunzhong Lou

TL;DR: 本文提出FutureCAD,一种新颖的文本到CAD框架,通过结合大型语言模型(LLM)和边界表示(B-Rep)基础变换器(BRepGround),生成高保真度的CAD模型。该方法生成可执行的CadQuery脚本,并引入基于文本的查询机制,使LLM能够通过自然语言指定几何选择,然后由BRepGround将其定位到目标基元。

Details

Motivation: 解决现有CAD生成方法中参数化建模与直接B-Rep合成之间的范式差距问题,这种差距限制了AI在复杂工业产品设计中的CAD建模能力,因为现代基于特征的CAD系统中,参数化操作和B-Rep本质上是相互交织的。

Result: 实验表明,FutureCAD在CAD生成性能上达到了最先进水平(SOTA),具体基准未在摘要中明确提及,但声称在构建的真实世界CAD模型数据集上取得了优异结果。

Insight: 创新点包括:利用LLM驱动程序生成和基于文本的B-Rep基元定位,弥合了参数化与B-Rep之间的鸿沟;通过监督微调(SFT)和强化学习(RL)结合训练LLM,提升了泛化能力;构建了新的真实世界CAD数据集以支持框架训练。

Abstract: The field of Computer-Aided Design (CAD) generation has made significant progress in recent years. Existing methods typically fall into two separate categorie: parametric CAD modeling and direct boundary representation (B-Rep) synthesis. In modern feature-based CAD systems, parametric modeling and B-Rep are inherently intertwined, as advanced parametric operations (e.g., fillet and chamfer) require explicit selection of B-Rep geometric primitives, and the B-Rep itself is derived from parametric operations. Consequently, this paradigm gap remains a critical factor limiting AI-driven CAD modeling for complex industrial product design. This paper present FutureCAD, a novel text-to-CAD framework that leverages large language models (LLMs) and a B-Rep grounding transformer (BRepGround) for high-fidelity CAD generation. Our method generates executable CadQuery scripts, and introduces a text-based query mechanism that enables the LLM to specify geometric selections via natural language, which BRepGround then grounds to the target primitives. To train our framework, we construct a new dataset comprising real-world CAD models. For the LLM, we apply supervised fine-tuning (SFT) to establish fundamental CAD generation capabilities, followed by reinforcement learning (RL) to improve generalization. Experiments show that FutureCAD achieves state-of-the-art CAD generation performance.


[64] Linking Perception, Confidence and Accuracy in MLLMs cs.CV | cs.CLPDF

Yuetian Du, Yucheng Wang, Rongyu Zhang, Zhijie Xu, Boyu Yang

TL;DR: 本文揭示了多模态大语言模型(MLLMs)中存在的严重置信度校准问题,即模型无法准确判断自身何时不知道答案。为解决此问题,论文提出了置信度驱动的强化学习(CDRL)来增强感知敏感性和校准置信度,并进一步提出了置信度感知的测试时缩放(CA-TTS)框架,该框架通过置信度信号动态协调自一致性、自反思和视觉自检模块,并由专家模型进行调度。

Details

Motivation: 当前MLLMs的研究主要集中于提升视觉感知以改进准确性,但忽略了模型对其自身不确定性的认知问题,即模型是否知道它不知道。论文通过实验揭示了MLLMs中存在的置信度误校准问题,并旨在解决这一问题。

Result: 提出的集成框架在四个基准测试上取得了新的最先进(SOTA)结果,实现了平均8.8%的稳定性能提升。消融研究也证明了每个模块的有效性和缩放方案的优越性。

Insight: 创新点在于将置信度校准作为核心问题,并提出了CDRL训练方法和CA-TTS测试时动态协调框架。其中,利用原始-噪声图像对和基于置信度的奖励进行强化学习来校准置信度,以及在测试时通过置信度信号动态调度多个自省模块并由专家模型担任多种角色(如规划者、批评者、投票者)进行外部验证,是关键的创新设计。

Abstract: Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model’s confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.


[65] ZeroSense:How Vision matters in Long Context Compression cs.CVPDF

Yonghan Gao, Zehong Chen, Lijian Xu, Jingzhi Chen, Jingwei Guan

TL;DR: 本文提出了ZeroSense基准测试和新的评估框架,旨在解耦多模态大语言模型(MLLMs)的能力,以更准确地评估视觉-文本压缩(VTC)方法的质量,而非依赖下游任务性能。

Details

Motivation: 现有VTC评估方法(如DeepSeek-OCR)过度依赖下游任务性能,由于MLLMs固有的强大语言先验,无法准确衡量文本保真度,因此需要一种解耦评估框架来纯粹评估压缩质量。

Result: 在多个数据集上的广泛实验表明,VTC质量与下游任务准确率存在显著差异,验证了所提解耦评估框架的必要性。

Insight: 创新点在于提出了ZeroSense基准测试,通过确保测试样本的低语义相关性来消除上下文依赖,从而纯粹反映VTC质量;这为评估视觉在长上下文压缩中的作用提供了更可靠的基准。

Abstract: Recent visual-text compression (VTC) methods, typified by DeepSeek-OCR, report impressive high token compression ratios for long-context modeling tasks by leveraging text-to-image rendering. However, existing evaluation protocols heavily rely on downstream task performance. Such evaluation metrics fail to accurately measure text preservation due to the strong inherent linguistic priors of Multimodal Large Language Models (MLLMs). In this work, we introduce a new evaluation framework that decouples MLLMs’ capabilities to faithfully assess VTC quality. Within this framework, we further introduce the ZeroSense Benchmark to ensure low semantic correlation of testing samples. By eliminating contextual dependencies, our benchmark guarantees that the evaluation results are purely reflective of VTC quality, unaffected by the semantic inference capabilities of downstream models. Extensive experiments across multiple datasets demonstrate that VTC quality and downstream task accuracy diverge significantly, highlighting the necessity of our decoupled evaluation framework.


[66] EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models cs.CV | cs.CLPDF

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei

TL;DR: 本文提出了EndoCoT框架,通过迭代思想指导模块激活多模态大语言模型的推理潜力,并将其与扩散模型的去噪过程相连接,以解决复杂任务。

Details

Motivation: 现有方法将MLLMs作为文本编码器集成到扩散模型中,但存在推理深度不足和解码过程中指导不变的局限性,无法有效处理复杂任务。

Result: 在多个基准测试(如Maze、TSP、VSP和Sudoku)上平均准确率达到92.1%,比最强基线高出8.3个百分点。

Insight: 创新点在于通过迭代思想指导模块和终端思想接地模块,实现MLLMs的逐步推理,并将推理轨迹与文本监督对齐,从而提升扩散模型处理复杂任务的能力。

Abstract: Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs’ reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT’s denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.


[67] InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model cs.CVPDF

InSpatio Team, Xiaoyu Zhang, Weihong Pan, Zhichao Ye, Jialin Liu

TL;DR: InSpatio-WorldFM是一个开源的实时生成式帧模型,用于空间智能。它采用基于帧的范式,独立生成每一帧,实现了低延迟的实时空间推理,并通过显式3D锚点和隐式空间记忆确保多视角空间一致性,保持了全局场景几何和精细视觉细节。

Details

Motivation: 解决传统基于视频的世界模型因依赖序列帧生成和窗口级处理而导致的高延迟问题,为实时世界模拟提供高效替代方案。

Result: 实验结果表明,InSpatio-WorldFM在保持强大多视角一致性的同时,支持在消费级GPU上进行交互式探索。

Insight: 创新点包括采用独立帧生成范式以降低延迟,通过显式和隐式机制保证空间一致性,以及提出渐进式三阶段训练流程将预训练图像扩散模型转化为可控帧模型和实时生成器。

Abstract: We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.


[68] PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation cs.CVPDF

Pietro Bonazzi, Nicola Farronato, Stefan Zihlmann, Haotong Qin, Michele Magno

TL;DR: 本文提出了PicoSAM3,一个专为边缘和传感器内执行优化的轻量级可提示视觉分割模型,旨在实现实时、设备上的分割任务。该模型结合了密集CNN架构、兴趣区域提示编码、高效通道注意力以及从SAM2和SAM3的知识蒸馏,参数量仅为130万,并在COCO和LVIS数据集上取得了优于现有SAM基线和边缘导向基线的性能。量化后的模型在索尼IMX500视觉传感器上实现了11.82毫秒的实时推理延迟,且精度损失可忽略。

Details

Motivation: 解决延迟敏感和隐私保护应用(如智能眼镜和物联网设备)中实时、设备上分割的需求,优化模型以适应边缘和传感器内部署的严格内存与算力限制。

Result: 在COCO和LVIS数据集上分别达到65.45%和64.01%的mIoU,优于类似或更低复杂度的现有SAM基线和边缘导向基线;INT8量化模型在IMX500传感器上实现11.82毫秒延迟的实时推理,精度几乎无损。

Insight: 创新点包括:结合密集CNN与可提示分割架构、采用高效通道注意力、利用大型SAM模型进行知识蒸馏以显著提升性能(相比监督训练提升高达14.5% mIoU),并证明了在传感器级别实现高质量、空间灵活的可提示分割的可行性,为边缘设备上的实时视觉任务提供了高效解决方案。

Abstract: Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3 M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82 ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.


[69] Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling cs.CV | cs.AIPDF

Junhyeong Byeon, Jeongyeol Kim, Sejoon Lim

TL;DR: 本文提出了一种用于野外视频情感识别的多模态框架,结合了视觉(CLIP编码)和音频(Wav2Vec 2.0编码)特征,通过双向交叉注意力模块进行跨模态融合,并使用时序卷积网络(TCN)建模时间依赖性,以提升在复杂真实场景下的情感识别性能。

Details

Motivation: 解决野外视频情感识别中因面部外观、姿态、光照、背景噪声及情感动态变化大而导致的挑战,单一模态(如仅面部或语音)不足以捕捉复杂情感线索,因此需要多模态方法。

Result: 在ABAW 10th EXPR基准测试中,该方法提供了强大的多模态基线,性能优于单模态建模,证明了其有效性。

Insight: 创新点包括使用预训练模型(CLIP和Wav2Vec 2.0)作为冻结主干、TCN进行时序建模、双向交叉注意力实现对称跨模态交互,以及引入基于CLIP文本特征的对比目标以增强语义对齐。

Abstract: Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.


[70] HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios cs.CV | cs.AI | cs.CRPDF

Jiayue Pu, Zhongxiang Sun, Zilu Zhang, Xiao Zhang, Jun Xu

TL;DR: 该论文提出了HomeSafe-Bench基准测试,用于评估视觉语言模型在家庭场景中动态不安全动作检测的能力,并设计了HD-Guard分层流式架构以实现实时安全监控。

Details

Motivation: 现有安全评估方法多基于静态图像或文本,无法充分评估家庭环境中动态不安全动作的检测,因此需要针对性的基准和实时监控方案。

Result: HD-Guard在延迟与性能之间实现了优越的权衡,同时分析揭示了当前基于VLM的安全检测的关键瓶颈。

Insight: 创新点包括构建了结合物理仿真与视频生成的混合管道基准HomeSafe-Bench,以及提出分层双脑架构HD-Guard,通过轻量级FastBrain与异步大规模SlowBrain的协同,平衡推理效率与检测精度。

Abstract: The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.


[71] Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation cs.CVPDF

Chongyang Xu, Yixian Zou, Ziliang Feng, Fanman Meng, Shuaicheng Liu

TL;DR: 本文提出Ada3Drift方法,通过将迭代细化过程从推理阶段转移到训练阶段,学习一个训练时漂移场,将预测动作吸引向专家演示模式并排斥其他生成样本,从而实现了从3D点云观测进行高保真单步(1 NFE)动作生成,解决了现有单步生成方法在保持多模态动作分布上的不足。

Details

Motivation: 基于扩散的视觉运动策略能有效捕捉多模态动作分布,但推理延迟高;而现有的流匹配和一致性方法虽能实现单步生成,却牺牲了多模态保真度,导致动作模式坍缩为平均化、物理上不可行的轨迹。机器人领域计算预算的不对称性(离线训练与实时推理)为将迭代细化转移到训练时以恢复多模态保真度提供了动机。

Result: 在三个仿真基准(Adroit、Meta-World和RoboTwin)和真实世界机器人操作任务上的实验表明,Ada3Drift实现了最先进的性能,同时比基于扩散的替代方法所需函数评估次数少10倍。

Insight: 创新点包括:1) 提出训练时漂移场概念,将迭代细化从推理转移到训练,实现高效单步多模态生成;2) 针对少样本机器人场景,引入Sigmoid调度的损失函数,从粗分布学习平滑过渡到模式锐化细化;3) 采用多尺度场聚合,捕捉不同空间粒度的动作模式。从客观角度看,其核心创新在于利用训练-推理计算不对称性,通过训练时优化来补偿单步生成的信息损失,是一种高效的折中方案。

Abstract: Diffusion-based visuomotor policies effectively capture multimodal action distributions through iterative denoising, but their high inference latency limits real-time robotic control. Recent flow matching and consistency-based methods achieve single-step generation, yet sacrifice the ability to preserve distinct action modes, collapsing multimodal behaviors into averaged, often physically infeasible trajectories. We observe that the compute budget asymmetry in robotics (offline training vs.\ real-time inference) naturally motivates recovering this multimodal fidelity by shifting iterative refinement from inference time to training time. Building on this insight, we propose Ada3Drift, which learns a training-time drifting field that attracts predicted actions toward expert demonstration modes while repelling them from other generated samples, enabling high-fidelity single-step generation (1 NFE) from 3D point cloud observations. To handle the few-shot robotic regime, Ada3Drift further introduces a sigmoid-scheduled loss transition from coarse distribution learning to mode-sharpening refinement, and multi-scale field aggregation that captures action modes at varying spatial granularities. Experiments on three simulation benchmarks (Adroit, Meta-World, and RoboTwin) and real-world robotic manipulation tasks demonstrate that Ada3Drift achieves state-of-the-art performance while requiring $10\times$ fewer function evaluations than diffusion-based alternatives.


[72] CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation cs.CVPDF

Ziqi Ye, Ziyang Gong, Ning Liao, Xiaoxing Hu, Di Wang

TL;DR: 本文提出了CrossEarth-SAR,一个基于新型物理引导稀疏专家混合(MoE)架构的十亿级SAR视觉基础模型,专门用于跨域语义分割。作者构建了大规模数据集CrossEarth-SAR-200K用于预训练,并建立了一个包含8种不同域差距、22个子基准的评测套件。实验表明该模型在多个基准上取得了SOTA结果。

Details

Motivation: 合成孔径雷达(SAR)能够实现全球、全天候的地球观测,但由于成像机制多样,跨传感器和跨区域的域偏移严重阻碍了其语义泛化能力。本文旨在解决SAR图像跨域语义分割的泛化问题。

Result: 广泛的实验表明,CrossEarth-SAR在22个基准中的20个上取得了最先进(SOTA)的结果,在多差距迁移场景下的某些基准上,其平均交并比(mIoU)超越了先前方法超过10%。

Insight: 主要创新点包括:1)首个为SAR图像设计的十亿级视觉基础模型;2)新颖的物理引导稀疏专家混合(MoE)架构,融入了物理描述符;3)构建了大规模统一数据集和首个用于SAR图像域泛化语义分割的统一基准套件。从客观角度看,将物理先验知识与大规模模型架构(MoE)结合以应对特定模态(SAR)的域偏移问题,是一个有前景的研究方向。

Abstract: Synthetic Aperture Radar (SAR) enables global, all-weather earth observation. However, owing to diverse imaging mechanisms, domain shifts across sensors and regions severely hinder its semantic generalization. To address this, we present CrossEarth-SAR, the first billion-scale SAR vision foundation model built upon a novel physics-guided sparse mixture-of-experts (MoE) architecture incorporating physical descriptors, explicitly designed for cross-domain semantic segmentation. To facilitate large-scale pre-training, we develop CrossEarth-SAR-200K, a weakly and fully supervised dataset that unifies public and private SAR imagery. We also introduce a benchmark suite comprising 22 sub-benchmarks across 8 distinct domain gaps, establishing the first unified standard for domain generalization semantic segmentation on SAR imagery. Extensive experiments demonstrate that CrossEarth-SAR achieves state-of-the-art results on 20 benchmarks, surpassing previous methods by over 10% mIoU on some benchmarks under multi-gap transfer. All code, benchmark and datasets will be publicly available.


[73] Pano360: Perspective to Panoramic Vision with Geometric Consistency cs.CVPDF

Zhengdong Zhu, Weiyi Xue, Zuyuan Yang, Wenlve Zhou, Zhiheng Zhou

TL;DR: 本文提出Pano360方法,将传统的2D全景图拼接任务扩展到3D摄影测量空间,通过基于Transformer的架构聚合多视角全局信息,利用相机位姿引导3D空间中的图像变形以实现全局对齐,并采用多特征联合优化策略计算拼接缝。

Details

Motivation: 解决现有全景拼接方法过度依赖成对特征对应关系、无法利用多视角几何一致性的问题,特别是在纹理弱、视差大、重复图案多的挑战性场景中导致的严重失真和错位。

Result: 在构建的大规模真实场景数据集上进行广泛实验,结果表明该方法在对齐精度和感知质量上显著优于现有替代方案。

Insight: 创新点在于将2D对齐任务扩展到3D空间以直接利用更准确、全局一致的几何对应,采用Transformer架构实现3D感知和全局信息聚合,并引入相机位姿引导的3D图像变形与多特征联合优化策略;客观分析认为其3D空间建模和几何一致性利用是提升拼接鲁棒性的关键创新。

Abstract: Prior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns. Given that multi-view geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams. Additionally, to establish an evaluation benchmark and train our network, we constructed a large-scale dataset of real-world scenes. Extensive experiments show that our method significantly outperforms existing alternatives in alignment accuracy and perceptual quality.


[74] Single Pixel Image Classification using an Ultrafast Digital Light Projector cs.CV | physics.opticsPDF

Aisha Kanwal, Graeme E. Johnstone, Fahimeh Dehkhoda, Johannes H. Herrnsdorf, Robert K. Henderson

TL;DR: 本文提出了一种结合单像素成像(SPI)与低复杂度机器学习模型的超高速图像分类方法。该方法利用基于微LED-on-CMOS的数字光投影器进行亚毫秒级图像编码,在无需图像重建的情况下,通过时空信息变换直接实现分类,并在MNIST手写数字分类任务上进行了实验验证。

Details

Motivation: 解决机器视觉中(如自动驾驶)对动态环境复杂信息进行实时、高速图像分类的需求,旨在通过单像素成像技术绕过传统图像重建步骤,实现超高速率的分类。

Result: 在MNIST基准测试任务上评估了分类准确率,比较了极限学习机(ELM)和反向传播训练的深度神经网络两种低复杂度模型,分类性能与图像生成时间相当,并展示了基于SPI的ELM作为二元分类器在超高速成像场景中进行高效异常检测的潜力。

Insight: 创新点在于将单像素成像与低复杂度机器学习模型结合,通过时空变换直接分类,完全绕过了图像重建步骤,从而实现了多kHz帧率的超高速图像分类;从客观角度看,其硬件(微LED-on-CMOS投影器)与算法(低复杂度模型)的协同设计为实时机器视觉应用提供了高效的解决方案。

Abstract: Pattern recognition and image classification are essential tasks in machine vision. Autonomous vehicles, for example, require being able to collect the complex information contained in a changing environment and classify it in real time. Here, we experimentally demonstrate image classification at multi-kHz frame rates combining the technique of single pixel imaging (SPI) with a low complexity machine learning model. The use of a microLED-on-CMOS digital light projector for SPI enables ultrafast pattern generation for sub-ms image encoding. We investigate the classification accuracy of our experimental system against the broadly accepted benchmarking task of the MNIST digits classification. We compare the classification performance of two machine learning models: An extreme learning machine (ELM) and a backpropagation trained deep neural network. The complexity of both models is kept low so the overhead added to the inference time is comparable to the image generation time. Crucially, our single pixel image classification approach is based on a spatiotemporal transformation of the information, entirely bypassing the need for image reconstruction. By exploring the performance of our SPI based ELM as binary classifier we demonstrate its potential for efficient anomaly detection in ultrafast imaging scenarios.


[75] Continual Learning with Vision-Language Models via Semantic-Geometry Preservation cs.CV | cs.LGPDF

Chiyuan He, Zihuan Qiu, Fanman Meng, Runtong Zhang, Linfeng Xu

TL;DR: 本文提出了一种名为SeGP-CL的方法,用于解决预训练视觉语言模型在持续学习中的灾难性遗忘问题。该方法通过构建对抗性锚点来探测易发生语义漂移的区域,并利用锚点引导的跨模态几何蒸馏和文本语义几何正则化来保持跨模态语义结构,最终在多个基准测试上实现了最先进的性能。

Details

Motivation: 现有持续学习方法在适应新任务时,未能明确保持从预训练及先前阶段继承的跨模态语义几何结构,导致新任务的监督信号引发几何失真,尤其是在新旧语义接口附近的脆弱邻域内,共享视觉模式容易被新的文本语义重新解释。

Result: 在五个持续学习基准测试上的广泛实验表明,SeGP-CL一致地提高了模型的稳定性和前向迁移能力,实现了最先进的性能,并更好地保持了视觉语言模型的语义几何结构。

Insight: 创新点在于明确关注并保护跨模态语义几何结构,通过构建对抗性锚点(DPGD)来识别和稳定易漂移区域,并设计了锚点引导的跨模态几何蒸馏(ACGD)和轻量级文本语义几何正则化(TSGR)来在训练中保持结构,以及训练后通过锚点诱导的原始空间漂移估计来转移旧视觉原型并进行双路径推理。

Abstract: Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without explicitly preserving the cross-modal semantic geometry inherited from pretraining and previous stages, allowing new-task supervision to induce geometric distortion. We observe that the most pronounced drift tends to concentrate in vulnerable neighborhoods near the old-new semantic interface, where shared visual patterns are easily re-explained by new textual semantics. To address this under an exemplar-free constraint, we propose Semantic Geometry Preservation for Continual Learning (SeGP-CL). SeGP-CL first probes the drift-prone region by constructing a compact set of adversarial anchors with dual-targeted projected gradient descent (DPGD), which drives selected new-task seeds toward old-class semantics while remaining faithful in raw visual space. During training, we preserve cross-modal structure by anchor-guided cross-modal geometry distillation (ACGD), and stabilize the textual reference frame across tasks via a lightweight text semantic-geometry regularization (TSGR). After training, we estimate anchor-induced raw-space drift to transfer old visual prototypes and perform dual-path inference by fusing cross-modal and visual cues. Extensive experiments on five continual learning benchmarks demonstrate that SeGP-CL consistently improves stability and forward transfer, achieving state-of-the-art performance while better preserving semantic geometry of VLMs.


[76] Coarse-Guided Visual Generation via Weighted h-Transform Sampling cs.CV | cs.AIPDF

Yanghao Wang, Ziqi Jiang, Zhen Wang, Long Chen

TL;DR: 本文提出了一种基于加权h变换采样的粗引导视觉生成方法,通过修改扩散模型采样过程中的转移概率,引入漂移函数引导生成过程朝向理想的精细样本,并设计了噪声感知的加权调度来平衡引导强度与合成质量,实现了无需训练、无需已知前向变换算子的高质量图像与视频生成。

Details

Motivation: 解决现有训练无关引导生成方法需要已知前向变换算子(如双三次下采样)或难以平衡引导强度与合成质量的问题,旨在开发一种更通用、无需配对数据且能灵活适应不同退化类型的视觉生成方法。

Result: 在多种图像和视频生成任务上的广泛实验表明,该方法在保持高质量合成的同时有效遵循粗引导,展现了良好的有效性和泛化能力。

Insight: 创新点在于将h变换引入扩散模型采样过程,通过漂移函数近似引导生成轨迹,并结合噪声感知的加权调度动态调整引导项权重,以处理近似误差,实现了无需训练、无需前向算子知识的灵活引导生成框架。

Abstract: Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.


[77] Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos cs.CVPDF

Shuo Sun, Unal Artan, Malcolm Mielle, Achim J. Lilienthaland, Martin Magnusson

TL;DR: 本文提出了一种两阶段优化框架,用于从多视角视频中重建稠密动态场景并估计相机位姿。该方法首先通过构建时空连接图扩展单目视觉SLAM至多相机设置,实现鲁棒跟踪;然后利用宽基线光流优化稠密深度和位姿的一致性。

Details

Motivation: 解决从多个自由移动相机(如多个观察者拍摄同一事件)进行稠密动态场景重建和相机位姿估计的挑战性问题,克服现有方法仅支持单相机或需预标定刚性相机阵列的限制。

Result: 在合成和真实世界基准测试(包括新提出的MultiCamRobolab数据集)上,该方法显著优于最先进的(SOTA)前馈模型,且内存需求更低。

Insight: 创新点包括:1) 通过时空连接图结合相机内时间连续性和相机间空间重叠,实现多相机一致尺度与鲁棒跟踪;2) 在有限重叠下使用前馈重建模型进行宽基线初始化;3) 利用宽基线光流优化稠密一致性;4) 引入了带运动捕捉系统真值位姿的真实世界数据集。

Abstract: We address the challenging problem of dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras – a setting that arises naturally when multiple observers capture a shared event. Prior approaches either handle only single-camera input or require rigidly mounted, pre-calibrated camera rigs, limiting their practical applicability. We propose a two-stage optimization framework that decouples the task into robust camera tracking and dense depth refinement. In the first stage, we extend single-camera visual SLAM to the multi-camera setting by constructing a spatiotemporal connection graph that exploits both intra-camera temporal continuity and inter-camera spatial overlap, enabling consistent scale and robust tracking. To ensure robustness under limited overlap, we introduce a wide-baseline initialization strategy using feed-forward reconstruction models. In the second stage, we refine depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow. Additionally, we introduce MultiCamRobolab, a new real-world dataset with ground-truth poses from a motion capture system. Finally, we demonstrate that our method significantly outperforms state-of-the-art feed-forward models on both synthetic and real-world benchmarks, while requiring less memory.


[78] Paper Title: LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments cs.CV | cs.AIPDF

Zhaoyang Jiang, Zhizhong Fu, David McAllister, Yunsoo Kim, Honghan Wu

TL;DR: LoV3D是一个用于训练3D视觉语言模型的流程,旨在处理纵向T1加权脑部MRI。它通过生成区域级解剖评估、与先前扫描进行纵向比较,最终输出三类诊断(认知正常、轻度认知障碍或痴呆)以及一份综合诊断摘要。该流程通过强制标签一致性、纵向连贯性和生物学合理性来确保最终诊断的可靠性,从而减少幻觉风险。

Details

Motivation: 当前深度学习工具在分析纵向脑部MRI时存在碎片化问题:分类器将扫描简化为标签,体积测量流程产生难以解释的测量结果,而视觉语言模型可能生成流畅但存在幻觉的结论。LoV3D旨在通过一个统一的、可解释的流程来解决这些问题,为神经退行性疾病(如阿尔茨海默病)的认知预后推理提供更可靠的基础。

Result: 在ADNI测试集(479次扫描,258名受试者)上,LoV3D实现了93.7%的三类诊断准确率(比无基础基线提高34.8%),97.2%的二类诊断准确率(比SOTA提高4%)和82.6%的区域级解剖分类准确率(比VLM基线提高33.1%)。零样本迁移在MIRIAD上达到95.4%(痴呆召回率100%),在AIBL上达到82.9%的三类准确率,证实了其跨站点、扫描仪和人群的高泛化性。

Insight: 论文的创新点在于提出了一个分步的、可解释的3D视觉语言模型训练流程,通过强制多级约束(标签一致性、纵向连贯性、生物学合理性)来减少幻觉。其核心是引入了一个临床加权的验证器,该验证器根据标准体积指标衍生的规范参考自动评分候选输出,从而驱动无需人工标注的直接偏好优化,实现了自动化、可解释且可靠的纵向脑MRI分析。

Abstract: Longitudinal brain MRI is essential for characterizing the progression of neurological diseases such as Alzheimer’s disease assessment. However, current deep-learning tools fragment this process: classifiers reduce a scan to a label, volumetric pipelines produce uninterpreted measurements, and vision-language models (VLMs) may generate fluent but potentially hallucinated conclusions. We present LoV3D, a pipeline for training 3D vision-language models, which reads longitudinal T1-weighted brain MRI, produces a region-level anatomical assessment, conducts longitudinal comparison with the prior scan, and finally outputs a three-class diagnosis (Cognitively Normal, Mild Cognitive Impairment, or Dementia) along with a synthesized diagnostic summary. The stepped pipeline grounds the final diagnosis by enforcing label consistency, longitudinal coherence, and biological plausibility, thereby reducing the risks of hallucinations. The training process introduces a clinically-weighted Verifier that scores candidate outputs automatically against normative references derived from standardized volume metrics, driving Direct Preference Optimization without a single human annotation. On a subject-level held-out ADNI test set (479 scans, 258 subjects), LoV3D achieves 93.7% three-class diagnostic accuracy (+34.8% over the no-grounding baseline), 97.2% on two-class diagnosis accuracy (+4% over the SOTA) and 82.6% region-level anatomical classification accuracy (+33.1% over VLM baselines). Zero-shot transfer yields 95.4% on MIRIAD (100% Dementia recall) and 82.9% three-class accuracy on AIBL, confirming high generalizability across sites, scanners, and populations. Code is available at https://github.com/Anonymous-TEVC/LoV-3D.


[79] EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation cs.CVPDF

Yan Li, Ning Liao, Xiangyu Zhao, Shaofeng Zhang, Xiaoxing Wang

TL;DR: 本文提出了EvoTok,一种通过残差潜在演化实现视觉理解与生成统一的图像分词器。它在一个共享的潜在空间中,通过残差向量量化将图像编码为级联的残差标记序列,形成从低层细节到高层语义的演化轨迹,从而弥合了视觉理解所需的高层语义抽象与图像生成所需的细粒度像素级表示之间的粒度鸿沟。

Details

Motivation: 解决多模态大语言模型中视觉理解(需要高层语义抽象)与图像生成(需要细粒度像素表示)之间的粒度差距问题。现有方法要么在同一表示上施加两种监督导致干扰,要么在分离的特征空间解耦监督导致不一致。

Result: 在仅使用1300万张图像(远小于先前统一分词器使用的十亿级数据集)训练的情况下,在ImageNet-1K 256x256分辨率上获得了0.43 rFID的强重建质量。与大型语言模型集成后,在9个视觉理解基准中的7个上表现出色,并在GenEval和GenAI-Bench等图像生成基准上取得了显著成果。

Insight: 核心创新在于将视觉表示建模为一个在共享潜在空间内的演化轨迹(通过残差向量量化实现),而非维护独立的像素和语义标记空间。这为统一视觉理解与生成提供了一个有效且有原则的解决方案,表明即使使用相对较小的数据集,通过结构化的残差演化过程也能实现强大的性能。

Abstract: The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectively. In this work, we propose EvoTok, a unified image tokenizer that reconciles these requirements through a residual evolution process within a shared latent space. Instead of maintaining separate token spaces for pixels and semantics, EvoTok encodes an image into a cascaded sequence of residual tokens via residual vector quantization. This residual sequence forms an evolution trajectory where earlier stages capture low-level details and deeper stages progressively transition toward high-level semantic representations. Despite being trained on a relatively modest dataset of 13M images, far smaller than the billion-scale datasets used by many previous unified tokenizers, EvoTok achieves a strong reconstruction quality of 0.43 rFID on ImageNet-1K at 256x256 resolution. When integrated with a large language model, EvoTok shows promising performance across 7 out of 9 visual understanding benchmarks, and remarkable results on image generation benchmarks such as GenEval and GenAI-Bench. These results demonstrate that modeling visual representations as an evolving trajectory provides an effective and principled solution for unifying visual understanding and generation.


[80] Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D cs.CV | cs.LGPDF

Agniv Sharma, Xianghui Xie, Tom Fischer, Eddy Ilg, Gerard Pons-Moll

TL;DR: Hoi3DGen是一个从文本生成高质量3D人-物交互模型的框架,它通过利用多模态大语言模型构建高质量交互数据集,并开发了一个完整的文本到3D生成流程,显著提升了交互保真度和文本一致性。

Details

Motivation: 解决现有方法(依赖文本到图像模型的分数蒸馏)因高质量交互数据稀缺而导致的Janus问题(多面脸)和文本提示遵循不准确的问题,以支持AR、XR和游戏等应用。

Result: 在文本一致性上超越基线方法4-15倍,在3D模型质量上超越基线3-7倍,在多样类别和交互类型上表现出强大的泛化能力,同时保持高质量的3D生成。

Insight: 核心创新在于利用多模态大语言模型来策划高质量、真实的交互数据集,从而构建了一个端到端的文本到3D生成管道,从根本上解决了数据稀缺导致的生成质量问题,实现了交互保真度的数量级提升。

Abstract: Modeling and generating 3D human-object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality textured meshes of human-object interaction that follow the input interaction descriptions precisely. We first curate realistic and high-quality interaction data leveraging multimodal large language models, and then create a full text-to-3D pipeline, which achieves orders-of-magnitude improvements in interaction fidelity. Our method surpasses baselines by 4-15x in text consistency and 3-7x in 3D model quality, exhibiting strong generalization to diverse categories and interaction types, while maintaining high-quality 3D generation.


[81] HATS: Hardness-Aware Trajectory Synthesis for GUI Agents cs.CVPDF

Rui Shao, Ruize Gao, Bin Xie, Yixing Li, Kaiwen Zhou

TL;DR: 本文提出HATS框架,通过硬度感知的轨迹合成方法,解决GUI智能体训练中语义模糊动作导致的泛化能力不足问题。该框架包含硬度驱动探索和对齐引导精炼两个互补模块,形成闭环以提升数据质量。

Details

Motivation: 现有GUI智能体轨迹合成方法忽视语义模糊动作(如上下文依赖、序列依赖或视觉模糊的动作),导致智能体难以泛化到复杂交互,存在任务指令与执行之间的语义错位。

Result: 在多个基准GUI环境上的实验表明,使用HATS训练的智能体性能持续超越现有最先进基线方法。

Insight: 创新点在于将动作的语义模糊程度定义为“硬度”,并基于此设计闭环的数据收集与精炼机制,以主动生成和修正高难度、信息丰富的交互轨迹,提升数据集的语义对齐质量。

Abstract: Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantically ambiguous actions, whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) hardness-driven exploration, which guides data collection toward ambiguous yet informative interactions, and (2) alignment-guided refinement, which iteratively validates and repairs instruction-execution alignment. The two modules operate in a closed loop: exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.


[82] O3N: Omnidirectional Open-Vocabulary Occupancy Prediction cs.CV | cs.RO | eess.IVPDF

Mengfei Duan, Hao Shi, Fei Teng, Guoqiang Zhao, Yuheng Zhang

TL;DR: 本文提出了O3N,首个纯视觉、端到端的全向开放词汇占用预测框架,旨在解决自主智能体在开放世界探索中需要全面、安全场景感知的问题。它通过极坐标螺旋Mamba模块嵌入体素,实现连续空间表示和长距离上下文建模,并利用占用成本聚合模块统一几何与语义监督,以及自然模态对齐模块协调视觉特征、体素嵌入和文本语义。

Details

Motivation: 现有3D占用预测方法受限于有限的视角输入和预定义的训练分布,难以满足自主智能体在开放世界探索中需要全面、安全场景感知的需求,因此需要开发一种全向、开放词汇的3D占用预测框架。

Result: 在QuadOcc和Human360Occ基准测试中达到了最先进的性能,并展现出卓越的跨场景泛化能力和语义可扩展性。

Insight: 创新点包括:1)采用极坐标螺旋拓扑嵌入体素,实现全向连续空间表示;2)提出占用成本聚合模块,在体素空间内统一几何与语义监督;3)引入自然模态对齐模块,建立无梯度的对齐路径,形成一致的“像素-体素-文本”表示三元组。这些方法为通用3D世界建模提供了新思路。

Abstract: Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent “pixel-voxel-text” representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.


[83] FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance cs.CV | cs.AI | cs.LG | cs.MMPDF

Quanhao Li, Zhen Xing, Rui Wang, Haidong Cao, Qi Dai

TL;DR: 本文提出FlashMotion,一种用于少步轨迹可控视频生成的新型训练框架,通过结合轨迹适配器训练、生成器蒸馏和混合目标微调,在加速生成的同时保持视频质量和轨迹精度。

Details

Motivation: 现有轨迹可控视频生成方法依赖多步去噪过程,导致时间冗余和计算开销大,而直接应用现有视频蒸馏方法会导致视频质量和轨迹精度显著下降,因此需要一种高效的少步生成方案。

Result: 在提出的FlashBench基准测试中,FlashMotion在两种适配器架构上超越了现有视频蒸馏方法和多步模型,在视觉质量和轨迹一致性方面表现更优。

Insight: 创新点在于将轨迹适配器训练与生成器蒸馏相结合,并通过混合扩散和对抗目标的微调策略对齐适配器与少步生成器,实现了高效且精确的轨迹可控视频生成。

Abstract: Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories. However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead. While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation. We first train a trajectory adapter on a multi-step video generator for precise trajectory control. Then, we distill the generator into a few-step version to accelerate video generation. Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos. For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.


[84] EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next cs.CVPDF

Ye Pan, Chi Kit Wong, Yuanhuiyi Lyu, Hanqian Li, Jiahao Huo

TL;DR: 本文提出了EgoIntent基准测试,用于评估多模态大语言模型在自我中心视频中对细粒度人类意图的理解能力,包括局部意图(What)、全局意图(Why)和下一步计划(Next)。该基准包含3,014个步骤,覆盖15种日常生活场景,并通过对15个MLLM的评估发现,即使最佳模型平均得分也仅为33.31,表明该任务极具挑战性。

Details

Motivation: 现有基准主要关注片段级意图推理,忽略了更细粒度的步骤级意图理解,而智能助手、机器人模仿学习等应用需要理解每一步中人在做什么、为什么做以及下一步计划,以提供及时的情境感知支持。

Result: 在EgoIntent基准上评估了15个MLLM(包括SOTA闭源和开源模型),最佳模型的平均得分仅为33.31(三个意图维度),表明步骤级意图理解仍是一个高度挑战性问题。

Insight: 创新点在于构建了一个步骤级意图理解基准,通过截断关键结果发生前的视频片段来避免未来帧泄露,从而干净地评估预测性步骤理解和下一步规划;客观来看,该工作突出了MLLM在细粒度视频推理中的局限性,为未来研究提供了明确的评估方向。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.


[85] LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning cs.CVPDF

Haiying Xu, Zihan Wang, Song Dai, Zhengxuan Zhang, Kairan Dou

TL;DR: 本文提出LatentGeo框架,通过可学习的连续潜在视觉表示来内化辅助几何构造,以解决多模态大语言模型在几何推理中表示辅助构造的难题。该框架采用三阶段课程学习对齐潜在表示,并结合潜在感知强化学习(LaGDPO)稳定表示并提升任务正确率。

Details

Motivation: 现有方法(如基于文本的几何指定、视觉-标记交错推理、工具增强执行)在表示复杂空间关系、离散符号与连续结构匹配或端到端优化方面存在局限,无法忠实表示辅助几何构造。

Result: 在提出的GeoAux基准和MathVerse数据集上的实验表明,LatentGeo在几何推理任务上取得显著提升,尤其在需要辅助构造的任务中表现突出。

Insight: 创新点包括:1) 学习连续潜在表示以内部化辅助构造,避免像素级渲染或外部执行器;2) 三阶段课程学习与潜在感知强化学习相结合,稳定表示并优化策略;3) 引入GeoAux基准系统评估构造中心表示质量。

Abstract: Despite recent advances in multimodal reasoning, representing auxiliary geometric constructions remains a fundamental challenge for multimodal large language models (MLLMs). Such constructions are absent from the original diagram and must be introduced before theorems apply. Existing approaches predominantly rely on explicit construction paradigms, including text-based geometric specification, visual-token interleaving during reasoning, and tool-augmented geometric execution. However, these methods either fail to faithfully represent complex spatial relationships, incur representation mismatch between discrete symbols and continuous geometric structures, or rely on external capabilities that hinder end-to-end optimization. To address these limitations, we propose LatentGeo, a framework that learns continuous latent visual representations to internalize auxiliary geometric constructions without pixel-level rendering or external executors. We design a three-stage curriculum that progressively aligns and internalizes these latent representations through auxiliary visual supervision, followed by LaGDPO, a latent-aware reinforcement learning procedure that stabilizes latent representations during policy optimization while improving end-task correctness. To systematically evaluate construction-centric representation quality, we introduce GeoAux, a new benchmark targeting visually dependent geometry problems, and conduct experiments on GeoAux and MathVerse. Results show that LatentGeo achieves substantial gains on geometric reasoning tasks, particularly those requiring auxiliary constructions. Extensive analyses and ablation studies further validate the effectiveness of each component in our framework.


[86] BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning cs.CV | cs.AIPDF

Jingyang Ke, Weihan Li, Amartya Pradhan, Jeffrey Markowitz, Anqi Wu

TL;DR: BehaviorVLM是一个无需微调的统一视觉语言框架,用于动物姿态估计和行为理解。它通过引导预训练的视觉语言模型进行详细、明确且可验证的推理步骤,减少了对人工标注的依赖。该框架包含两个主要组件:一个利用量子点数据、结合时空和跨视图推理的多阶段姿态估计流程,以及一个直接从视觉信息出发、结合深度嵌入聚类、视频描述和LLM推理的行为理解流程。

Details

Motivation: 解决神经科学中自由移动动物行为分析依赖大量人工标注或不稳定的无监督流程,导致可扩展性和可重复性受限的问题。

Result: 摘要中未提及具体的定量实验结果或基准测试比较,但宣称该框架能够实现可扩展、可解释且标注需求低的多动物行为分析。

Insight: 创新点在于将预训练视觉语言模型(VLM)和大型语言模型(LLM)的推理能力整合到一个无需任务特定微调的框架中,通过多阶段、可验证的推理步骤(如几何检查)来提升姿态估计的可靠性,并直接从视觉信息进行行为分割和语义标注,减少了对关键点或大量标注的依赖。

Abstract: Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.


[87] ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models cs.CVPDF

Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu

TL;DR: 本文提出ForensicZip,一种无需训练的多模态大语言模型视觉令牌压缩框架,专门针对多媒体取证任务。该框架将令牌压缩重新定义为伪造驱动的优化问题,通过建模时间令牌演化为带松弛虚拟节点的生灭最优传输问题来量化生成伪影的物理不连续性,并结合高频先验分离取证证据与语义内容。在仅保留10%令牌的情况下,该框架实现了2.97倍加速和超过90%的FLOPs减少,同时保持最先进的检测性能。

Details

Motivation: 现有视觉令牌剪枝方法主要基于语义驱动,会保留显著物体而丢弃背景区域,但多媒体伪造痕迹(如高频异常和时间抖动)往往存在于背景中,导致取证性能下降。本文旨在解决高分辨率图像/视频处理时计算成本高与取证需求之间的矛盾。

Result: 在深度伪造和AIGC基准测试中,当仅保留10%视觉令牌时,ForensicZip实现了2.97倍加速和超过90%的FLOPs减少,同时保持了最先进的检测性能。

Insight: 创新点在于将令牌压缩从语义驱动转变为伪造驱动,通过生灭最优传输模型量化时间维度上的物理不连续性,并结合高频先验在高度压缩下分离取证证据。该方法无需训练即可实现高效压缩,为视觉-语言模型的实时取证应用提供了新思路。

Abstract: Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10% token retention, ForensicZip achieves $2.97\times$ speedup and over 90% FLOPs reduction while maintaining state-of-the-art detection performance.


[88] Real-World Point Tracking with Verifier-Guided Pseudo-Labeling cs.CVPDF

Görkay Aydemir, Fatma Güney, Weidi Xie

TL;DR: 本文提出了一种用于真实世界点跟踪的验证器引导伪标签方法,通过元模型评估多个预训练跟踪器预测的可靠性,选择最可信的预测生成高质量伪标签轨迹,从而在无标注视频上进行高效的自训练微调。

Details

Motivation: 解决长期点跟踪模型在真实世界视频中性能下降的问题,因为合成数据与真实数据特征不同且缺乏密集标注,而现有自训练方法中伪标签质量受限于教师模型的可靠性。

Result: 在四个真实世界基准测试上进行了广泛实验,结果表明该方法达到了最先进的性能,且比先前的自训练方法需要更少的数据。

Insight: 创新点在于引入验证器元模型来评估和选择多个跟踪器的预测可靠性,从而提升伪标签质量;客观分析认为该方法通过集成多个模型预测并动态验证,增强了自训练的鲁棒性和数据效率。

Abstract: Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher models, which vary across frames and scenes. In this paper, we address the problem of real-world fine-tuning and introduce verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions, resulting in high-quality pseudo-label trajectories. When applied for fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos. Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods. Project page: https://kuis-ai.github.io/track_on_r


[89] A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition cs.CVPDF

Jiajun Sun, Zhe Gao

TL;DR: 本文针对ABAW竞赛中的面部表情识别挑战,提出了一种两阶段双模态(视听)模型,旨在解决无约束视频中表情分类的困难。第一阶段使用基于DINOv2的视觉编码器提取鲁棒特征,并引入PadAug数据增强和MoE训练头;第二阶段通过多尺度重裁剪和Wav2Vec 2.0音频特征提取,结合门控融合模块和时序平滑进行模态融合与时间一致性处理。

Details

Motivation: 解决无约束视频中面部表情识别面临的挑战,包括人脸定位不准确、姿态和尺度变化大、运动模糊、时序不稳定等因素,通过双模态融合提升识别鲁棒性。

Result: 在ABAW数据集上,该方法在官方验证集上达到Macro-F1分数0.5368,在5折交叉验证中达到0.5122 +/- 0.0277,优于官方基线模型。

Insight: 创新点包括:使用DINOv2预训练模型作为视觉骨干,结合PadAug增强策略和MoE训练头提升特征多样性;通过多尺度重裁剪和音频特征融合,利用门控模块轻量化地整合双模态信息,并采用推理时时序平滑优化时间一致性。

Abstract: This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.


[90] HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers cs.CV | cs.LGPDF

Andy Li, Aiden Durrant, Milan Markovic, Georgios Leontidis

TL;DR: 本文提出了HiAP(Hierarchical Auto-Pruning),一个用于视觉Transformer的多粒度随机自动剪枝框架。它通过引入宏观和微观的随机门,在单次端到端训练中自动发现最优子网络,无需手动设置重要性启发式规则或预定义每层稀疏度目标,旨在解决ViT模型在边缘设备上部署时面临的计算和内存瓶颈问题。

Details

Motivation: 视觉Transformer需要大量计算资源和内存带宽,限制了其在边缘设备上的部署。现有的结构化剪枝方法通常只在单一粒度操作,且依赖复杂的多阶段流程和事后阈值设定来满足稀疏度预算。

Result: 在ImageNet上的大量实验表明,HiAP能够自动发现高效的架构,并为DeiT-Small等模型实现了具有竞争力的精度-效率帕累托前沿,其性能与复杂的多阶段方法相当,同时显著简化了部署流程。

Insight: 创新点在于提出了一个统一的多粒度随机剪枝框架,通过同时优化宏观(如注意力头、FFN块)和微观(如头内维度、FFN神经元)的Gumbel-Sigmoid门,并利用包含结构可行性惩罚和分析性FLOPs的损失函数,在单阶段训练中自然地收敛到稳定的子网络,从而同时处理内存和计算瓶颈。

Abstract: Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.


[91] SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation cs.CVPDF

Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng

TL;DR: 本文提出了SceneAssistant,一种基于视觉反馈的智能体,用于开放词汇的3D场景生成。该方法结合了现代3D物体生成模型与视觉语言模型(VLM)的空间推理和规划能力,通过提供一组原子操作(如缩放、旋转、聚焦),使VLM能够根据渲染的视觉反馈迭代地优化场景布局,从而生成与输入文本更一致、空间关系更连贯的3D场景。

Details

Motivation: 解决现有文本到3D场景生成方法受限于特定领域或依赖预定义空间关系的问题,以实现不受约束、开放词汇的3D场景合成。

Result: 实验结果表明,该方法能够生成多样化、开放词汇且高质量的3D场景。定性分析和定量人工评估均证明其优于现有方法。

Insight: 创新点在于将视觉反馈循环与VLM的空间推理相结合,通过迭代式交互和原子操作集,实现了对开放词汇3D场景的灵活生成和编辑,提升了场景的空间一致性和文本对齐度。

Abstract: Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant


[92] Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation cs.CVPDF

Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan

TL;DR: 本文提出了FIRM框架,通过构建高质量评分数据集(FIRM-Edit-370K和FIRM-Gen-293K)训练专门的奖励模型(FIRM-Edit-8B和FIRM-Gen-8B),并设计了针对编辑和生成任务的基准测试FIRM-Bench。同时,引入了一种新颖的’Base-and-Bonus’奖励策略来平衡不同目标,最终得到的模型FIRM-Qwen-Edit和FIRM-SD3.5在忠实度和指令遵循方面实现了显著性能突破。

Details

Motivation: 当前强化学习用于图像编辑和文本到图像生成时,奖励模型常出现幻觉并给出噪声评分,误导优化过程,因此需要开发更鲁棒的奖励模型来提供准确可靠的指导。

Result: 在专门设计的基准测试FIRM-Bench上,提出的奖励模型相比现有指标与人类判断具有更优的对齐性;最终模型FIRM-Qwen-Edit和FIRM-SD3.5在忠实度和指令遵循方面超越了现有通用模型,建立了新的标准。

Insight: 创新点包括:1)针对编辑(执行与一致性)和生成(指令遵循)任务分别设计数据收集流程以构建高质量数据集;2)提出’Base-and-Bonus’奖励策略(如用于编辑的CME和用于生成的QMA)来平衡竞争目标,从而更有效地将奖励模型集成到RL流程中。

Abstract: Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel “Base-and-Bonus” reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.


[93] DVD: Deterministic Video Depth Estimation with Generative Priors cs.CVPDF

Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Jing He, Zixin Zhang

TL;DR: 本文提出DVD框架,首次将预训练的视频扩散模型确定性地适配为单次深度回归器,解决了现有视频深度估计方法在生成模型随机几何幻觉与判别模型需要大量标注数据之间的权衡问题。

Details

Motivation: 现有视频深度估计方法存在根本性权衡:生成模型存在随机几何幻觉和尺度漂移问题,而判别模型需要大量标注数据来解决语义模糊性。DVD旨在打破这一僵局。

Result: 在多个基准测试中实现了零样本(zero-shot)性能的SOTA(state-of-the-art),且仅使用领先基线方法1/163的任务特定数据量就成功解锁了视频基础模型中隐含的深度几何先验。

Insight: 创新点包括:将扩散时间步重新用作结构锚点以平衡全局稳定性与高频细节;潜在流形校正(LMR)缓解回归导致的过度平滑;利用全局仿射相干性实现无需复杂时序对齐的长视频推理。

Abstract: Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.


[94] Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing cs.CVPDF

Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen

TL;DR: 本文提出AutoGaze,一个轻量级模块,用于在视觉Transformer或多模态大语言模型处理前,通过自回归方式选择最精简的多尺度视频补丁集,以消除时空冗余。该方法大幅减少了视觉令牌数量,加速了模型推理,并能将MLLM扩展到处理长时、高分辨率视频,在多个视频基准测试上取得了优异结果。

Details

Motivation: 现有的多模态大语言模型在处理长时、高分辨率视频时效率低下,因为它们对视频中的每个像素(通过ViT补丁)都进行同等处理,忽略了视频中存在的显著时空冗余。

Result: AutoGaze将视觉令牌减少了4到100倍,将ViT和MLLM加速高达19倍。在VideoMME基准上达到67.0%。在新提出的HLVid基准(5分钟4K视频QA)上,使用AutoGaze扩展的MLLM比基线提升10.1%,并超越之前最佳MLLM 4.5%。

Insight: 核心创新在于“先注视,后注意”的范式,即在昂贵的注意力计算之前,通过一个轻量级、可训练的模块(AutoGaze)进行自回归的冗余补丁剔除。这通过结合下一令牌预测和强化学习进行训练,在用户指定的误差阈值内重建视频,实现了效率与精度的权衡。同时,论文引入了首个高分辨率长视频QA基准HLVid,推动了该领域评估。

Abstract: Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos – they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.


[95] Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training cs.CV | cs.LGPDF

Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung

TL;DR: 本文提出了Spatial-TTT,一种基于测试时训练(TTT)的流式视觉空间智能方法,旨在从潜在无限长的视频流中持续维护和更新空间证据。该方法采用混合架构,结合大块更新与滑动窗口注意力进行高效的空间视频处理,并引入空间预测机制以增强几何对应和时序连续性的捕捉。通过构建包含密集3D空间描述的数据集,模型能够以结构化方式记忆和组织全局3D空间信号。

Details

Motivation: 解决从无界视频流中流式地维护和更新空间证据的核心挑战,即如何随时间选择、组织和保留空间信息,以实现类似人类通过视觉观察流感知和理解真实世界空间的能力。

Result: 在视频空间基准测试上取得了最先进的性能,实验表明Spatial-TTT显著提升了长时程空间理解能力。

Insight: 创新点包括:采用测试时训练(TTT)动态调整快速权重以捕获长期空间证据;设计混合架构结合大块更新与滑动窗口注意力以实现高效处理;引入基于3D时空卷积的空间预测机制增强空间感知;构建密集3D空间描述数据集以指导模型结构化记忆全局空间信号。从客观角度看,该方法将TTT与流式视频处理结合,为长时程空间理解提供了可扩展的解决方案。

Abstract: Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.


[96] DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning cs.CVPDF

Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing

TL;DR: DreamVideo-Omni是一个统一框架,通过渐进式两阶段训练范式,实现多主体身份定制与全粒度运动控制。第一阶段整合多种控制信号进行联合训练,并引入条件感知3D旋转位置编码和分层运动注入策略以增强控制。第二阶段设计潜在身份奖励反馈学习,通过训练潜在身份奖励模型来缓解身份退化问题。

Details

Motivation: 解决大规模扩散模型在视频合成中难以同时精确控制多主体身份和多粒度运动的问题,克服现有方法在运动粒度、控制模糊性和身份退化方面的局限性。

Result: 在构建的大规模数据集和综合评估基准DreamOmni Bench上,DreamVideo-Omni在生成高质量、具有精确可控性的视频方面表现出优越性能。

Insight: 创新点包括:1) 渐进式两阶段训练范式;2) 条件感知3D旋转位置编码协调异构输入;3) 分层运动注入策略增强全局运动引导;4) 组和角色嵌入解决多主体模糊性;5) 潜在身份奖励反馈学习范式缓解身份退化。从客观角度看,其将复杂场景解耦为独立可控实例并引入基于人类偏好的潜在奖励机制是值得借鉴的思路。

Abstract: While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.


[97] Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously cs.CVPDF

Yiran Guan, Liang Yin, Dingkang Liang, Jianzhong Ju, Zhenbo Luo

TL;DR: 本文提出了一种名为视频流式思考(VST)的新范式,用于在线视频大语言模型(VideoLLMs),旨在实现边观看边推理的机制,以在保持实时响应的同时提升对视频流的及时理解和连贯认知。

Details

Motivation: 现有在线视频大语言模型方法侧重于流式感知,缺乏同步的逻辑推理流,而直接应用测试时缩放方法会导致不可接受的响应延迟,因此需要解决实时响应与深度推理之间的权衡问题。

Result: 在在线基准测试中表现强劲,例如在StreamingBench上达到79.5%,在OVO-Bench上达到59.3%;同时,在离线长视频或推理基准上保持竞争力。与Video-R1相比,VST响应速度快15.7倍,并在VideoHolmes上实现了+5.4%的提升,展现了更高的效率和跨任务的强泛化能力。

Insight: 核心创新在于提出了’边观看边思考’的机制,将LLM推理延迟分摊到视频播放过程中;并设计了包含VST-SFT和VST-RL的完整后训练流程,以及一个利用视频知识图谱自动生成高质量流式问答对的数据合成管道,以强化多证据推理和对视频流的持续注意力。

Abstract: Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.


[98] GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing cs.CVPDF

Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu

TL;DR: 本文提出了GRADE基准测试,用于评估多模态模型在学科知识驱动的图像编辑任务中的推理能力。该基准包含10个学术领域的520个样本,并设计了多维度评估协议,涵盖学科推理、视觉一致性和逻辑可读性。通过对20个先进开源和闭源模型的广泛实验,揭示了当前模型在知识密集型编辑场景下的显著局限性。

Details

Motivation: 当前图像编辑基准主要局限于自然图像和浅层常识推理,缺乏对结构化、领域特定约束下多模态模型联合理解、推理与生成能力的评估。

Result: 在GRADE基准上对20个SOTA模型的测试表明,它们在隐含、知识密集的编辑设置下存在显著性能差距,暴露了当前模型的不足。

Insight: 创新点在于构建了首个面向学科知识推理的图像编辑基准,并提出了多维度的评估协议,为统一多模态模型的未来发展指明了关键方向,推动了学科知识驱动的图像编辑与推理研究。

Abstract: Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.


[99] OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams cs.CVPDF

Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie

TL;DR: OmniStream是一个统一的流式视觉骨干网络,能够从多样化的视觉输入中实时感知、重建和执行动作。它通过因果时空注意力和3D旋转位置嵌入实现高效的逐帧在线处理,并在29个数据集上进行多任务预训练,在图像/视频探测、流式几何重建、复杂视频/空间推理以及机器人操作等任务中展现出与专业模型相当的竞争力。

Details

Motivation: 解决当前视觉基础模型在实时流式环境中存在的碎片化问题,即模型通常仅专注于图像语义感知、离线时序建模或空间几何中的单一领域,缺乏能够统一处理感知、重建和动作的通用、因果且具有物理结构的表示。

Result: 在图像和视频探测、流式几何重建、复杂视频和空间推理以及机器人操作(训练时未见)等任务上,即使骨干网络严格冻结,OmniStream也取得了与专业模型(专家)持续竞争的性能。

Insight: 创新点在于提出了一个统一的流式视觉骨干架构,通过因果时空注意力和3D-RoPE支持高效的在线流处理,并采用结合静态与时序表示学习、流式几何重建和视觉-语言对齐的协同多任务预训练框架。这证明了训练单一、通用的视觉骨干网络以跨越语义、空间和时序推理进行泛化的可行性,是迈向通用视觉理解的重要一步。

Abstract: Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.


[100] MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning cs.CVPDF

Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang

TL;DR: 本文提出了MM-CondChain,一个用于评估多模态大语言模型在视觉基础深度组合推理能力的基准测试。该基准通过程序化验证的方式构建,包含多层推理链,每层都涉及基于视觉证据的非平凡组合条件。实验表明,现有MLLM在该基准上表现不佳,深度组合推理仍是重大挑战。

Details

Motivation: 现有基准主要关注浅层组合或独立约束,缺乏对深度链式组合条件推理能力的评估,而该能力对于执行视觉工作流(如GUI导航)至关重要。

Result: 在涵盖自然图像、数据图表和GUI轨迹三个视觉领域的基准上测试了一系列MLLM,即使最强模型也仅达到53.33%的路径F1分数,且在困难负例、深度增加或谓词复杂性增加时性能急剧下降。

Insight: 创新点在于提出了一个用于评估深度组合推理的程序化验证基准,并设计了一个由规划器、可验证程序化中间表示和组合器构成的智能合成流水线来规模化构建高质量、可验证的复杂推理链数据。

Abstract: Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., “if a permission dialog appears and the color of the interface is green, click Allow”) and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer’s condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.


[101] EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation cs.CVPDF

Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng

TL;DR: 本文提出了EVATok框架,用于实现高效的自适应视频分词。该框架通过为每个视频估计最优的token分配方案,训练轻量级路由器来快速预测这些分配,并训练自适应分词器,从而在视频重建质量和下游自回归生成的计算成本之间取得更好的权衡。

Details

Motivation: 传统视频分词器在不同视频的时间块上采用统一的token分配策略,这导致在简单、静态或重复的片段上浪费token,而在动态或复杂的片段上分配不足。为了解决这种效率低下的问题,需要一种能够根据视频内容自适应分配token的方法。

Result: 在UCF-101数据集上,EVATok实现了卓越的重建质量和最先进的类别到视频生成性能。与之前最先进的LARP方法以及固定长度的基线相比,平均token使用量至少节省了24.4%。

Insight: 论文的核心创新点在于提出了一个自适应视频分词框架,通过估计每个视频的最优token分配并利用轻量级路由器进行快速预测,从而动态调整编码策略。这为视频生成模型提供了一种更高效、更灵活的分词方案,可显著降低计算成本同时保持或提升生成质量。

Abstract: Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce $\textbf{EVATok}$, a framework to produce $\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.


cs.LG [Back]

[102] Scaling Reasoning Efficiently via Relaxed On-Policy Distillation cs.LG | cs.CLPDF

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, Pashmina Cameron

TL;DR: 本文提出了一种名为REOPOLD(Relaxed On-Policy Distillation)的框架,用于稳定并高效地将大型教师模型的推理能力蒸馏到容量受限的学生模型中。该框架将策略蒸馏重新解释为一种策略优化问题,并通过混合奖励裁剪、基于熵的动态采样和统一的探索-精炼训练策略来放松严格的模仿约束。

Details

Motivation: 解决传统同策略蒸馏(on-policy distillation)在将推理能力迁移到小模型时存在的不稳定性和负迁移问题。

Result: 在数学、视觉和智能体工具使用等推理任务上,REOPOLD在训练时表现出卓越的样本效率(比最近的强化学习方法高6.7~12倍),并在推理时实现了更好的扩展性。具体而言,一个7B参数的学生模型在视觉推理任务上匹配了32B参数教师模型的性能,同时推理速度提升了约3.32倍。

Insight: 核心创新点在于将同策略蒸馏重新解释为策略优化问题,并引入“松弛”机制(如混合奖励裁剪和动态采样)来稳定优化过程。客观来看,其统一的探索-精炼训练策略和基于教师-学生对数似然比的token级奖励设计是提升样本效率和最终性能的关键技术洞察。

Abstract: On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imitation constraints of standard on-policy distillation. Specifically, REOPOLD temperately and selectively leverages rewards from the teacher through mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD surpasses its baselines with superior sample efficiency during training and enhanced test-time scaling at inference, across mathematical, visual, and agentic tool-use reasoning tasks. Specifically, REOPOLD outperforms recent RL approaches achieving 6.7~12x greater sample efficiency and enables a 7B student to match a 32B teacher in visual reasoning with a ~3.32x inference speedup.


[103] Huntington Disease Automatic Speech Recognition with Biomarker Supervision cs.LG | cs.CL | cs.SDPDF

Charles L. Wang, Cady Chen, Ziwei Gong, Julia Hirschberg

TL;DR: 本文针对亨廷顿病(HD)患者的病理性语音,进行了自动语音识别(ASR)的系统性研究。研究使用了一个高保真的临床语音语料库,比较了多种ASR模型家族,发现HD语音会导致特定于架构的错误模式,其中Parakeet-TDT模型表现最佳。通过HD特异性适应,词错误率(WER)从6.99%降至4.95%,并提出了一种利用生物标志物进行辅助监督的方法,分析了错误行为如何以依赖于疾病严重程度的方式被重塑,而非均匀地降低WER。

Details

Motivation: 解决病理性语音(特别是亨廷顿病语音)的自动语音识别问题,因为其不规则的时序、不稳定的发声和发音扭曲对现有模型构成挑战。

Result: 在统一的评估框架下,Parakeet-TDT模型优于编码器-解码器和CTC基线模型。通过HD特异性适应,WER从6.99%降低到4.95%。

Insight: 创新点在于系统性地评估了HD语音对ASR模型的影响,并提出了利用生物标志物进行辅助监督的方法,该方法能重塑错误模式,其效果与疾病严重程度相关,而不仅仅是追求整体WER的降低。这为病理性语音识别提供了新的优化思路,即关注错误类型的分布而不仅仅是总体错误率。

Abstract: Automatic speech recognition (ASR) for pathological speech remains underexplored, especially for Huntington’s disease (HD), where irregular timing, unstable phonation, and articulatory distortion challenge current models. We present a systematic HD-ASR study using a high-fidelity clinical speech corpus not previously used for end-to-end ASR training. We compare multiple ASR families under a unified evaluation, analyzing WER as well as substitution, deletion, and insertion patterns. HD speech induces architecture-specific error regimes, with Parakeet-TDT outperforming encoder-decoder and CTC baselines. HD-specific adaptation reduces WER from 6.99% to 4.95% and we also propose a method for using biomarker-based auxiliary supervision and analyze how error behavior is reshaped in severity-dependent ways rather than uniformly improving WER. We open-source all code and models.


[104] Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings cs.LG | cs.AI | cs.CLPDF

Yuning Wu, Ke Wang, Devin Chen, Kai Wei

TL;DR: 本文提出Hindsight-Anchored Policy Optimization (HAPO)方法,用于解决稀疏奖励环境下强化学习(特别是RLVR范式)中纯RL方法优势崩溃与混合策略优化分布偏差的困境。该方法通过一个名为Synthetic Success Injection (SSI)的后见之明机制,在策略失败时选择性地锚定到教师演示数据,并利用Thompson采样启发的门控机制实现自主课程学习。理论证明HAPO具有渐近一致性,能自然衰减教师信号并恢复无偏的在线策略梯度。

Details

Motivation: 解决在稀疏奖励设置中,基于群体的方法(如GRPO)面临的困境:纯强化学习存在优势崩溃和高方差梯度估计问题,而混合策略优化则会引入持续的分布偏差。

Result: 论文未在摘要中提及具体的定量实验结果或基准测试,但理论证明了所提HAPO方法具有渐近一致性,能恢复无偏的在线策略梯度。

Insight: 核心创新点是Synthetic Success Injection (SSI)操作符及其Thompson采样启发的门控机制,它将失败经验转化为反馈,创建了自主的课程学习过程。其理论贡献在于证明了该方法能实现渐近一致性,使离线的教师指导作为临时支架而非永久上限,从而让模型有可能超越静态教师强制的限制。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for post-training reasoning models. However, group-based methods such as Group Relative Policy Optimization (GRPO) face a critical dilemma in sparse-reward settings: pure Reinforcement Learning (RL) suffers from advantage collapse and high-variance gradient estimation, while mixed-policy optimization introduces persistent distributional bias. To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO). HAPO employs the Synthetic Success Injection (SSI) operator, a hindsight mechanism that selectively anchors optimization to teacher demonstrations during failure. This injection is governed by a Thompson sampling-inspired gating mechanism, creating an autonomous, self-paced curriculum. Theoretically, we demonstrate that HAPO achieves \textit{asymptotic consistency}: by naturally annealing the teacher signal as the policy improves, HAPO recovers the unbiased on-policy gradient. This ensures off-policy guidance acts as a temporary scaffold rather than a persistent ceiling, enabling the model to surpass the limitations of static teacher forcing.


[105] Meta-Reinforcement Learning with Self-Reflection for Agentic Search cs.LG | cs.CLPDF

Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman

TL;DR: 本文提出MR-Search,一种基于上下文元强化学习(meta-RL)的智能搜索框架,通过自我反思机制,使智能体能够跨多个回合(episode)学习并调整搜索策略,从而在测试时实现更有效的探索。该方法利用每个回合后生成的自我反思作为额外上下文,指导后续搜索,并引入一种多回合RL算法进行细粒度信用分配。

Details

Motivation: 传统强化学习在稀疏奖励的独立回合中优化策略效果有限,本文旨在解决智能体在跨回合任务中自适应调整搜索策略的问题,通过元学习和自我反思提升探索效率。

Result: 在八个基准测试上,MR-Search相比基线RL方法表现出更强的泛化能力,相对性能提升达到9.2%至19.3%,实现了显著的改进。

Insight: 创新点在于结合了上下文元强化学习与显式的自我反思机制,使智能体能够跨回合积累经验并动态调整策略;同时,提出的多回合RL算法通过回合级密集相对优势估计,实现了更精细的信用分配,有助于提升学习效率。

Abstract: This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.


[106] LongFlow: Efficient KV Cache Compression for Reasoning M cs.LG | cs.CLPDF

Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li

TL;DR: LongFlow是一种针对推理模型长输出序列的高效KV缓存压缩方法,通过利用注意力计算的中间结果设计低开销的重要性估计指标,并融合FlashAttention、重要性估计和令牌淘汰的定制内核,在保持模型精度的情况下显著提升吞吐量。

Details

Motivation: 针对OpenAI-o1和DeepSeek-R1等推理模型在复杂任务中产生长输出序列导致KV缓存内存消耗大、带宽压力高的问题,现有KV缓存优化方法不适用于长输出场景且重要性估计计算成本高,需要一种高效的长输出KV缓存压缩方案。

Result: 在实验中,LongFlow实现了80%的KV缓存压缩,吞吐量提升高达11.8倍,同时对模型精度影响最小。

Insight: 创新点在于从注意力计算的当前查询中间结果推导出高效的重要性估计指标,无需额外计算和存储,并通过定制内核将注意力、估计和淘汰融合为单一优化算子,提升了系统级效率,适用于推理模型的长输出部署优化。

Abstract: Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences, leading to significantly increased deployment costs. In particular, long outputs require large KV caches, resulting in high memory consumption and severe bandwidth pressure during attention computation. Most existing KV cache optimization methods are designed for long-input, short-output scenarios and are ineffective for the long-output setting of reasoning models. Moreover, importance estimation in prior work is computationally expensive and becomes prohibitive when continuous re-evaluation is required during long generation. To address these challenges, we propose LongFlow, a KV cache compression method with an efficient importance estimation metric derived from an intermediate result of attention computation using only the current query. This design introduces negligible computational overhead and requires no auxiliary storage. We further develop a custom kernel that fuses FlashAttention, importance estimation, and token eviction into a single optimized operator, improving system-level efficiency. Experiments show that LongFlow achieves up to an 11.8 times throughput improvement with 80% KV cache compression with minimal impact on model accuracy.


[107] Chem4DLLM: 4D Multimodal LLMs for Chemical Dynamics Understanding cs.LG | cs.CLPDF

Xinyu Li, Zhen Zhang, Qi Chen, Anton van den Hengel, Lina Yao

TL;DR: 本文提出了一种名为化学动力学理解(ChemDU)的新任务,旨在将4D分子轨迹转化为可解释的自然语言描述,以弥补现有化学理解任务主要依赖静态分子表示的不足。为此,作者构建了首个配对4D分子轨迹与专家解释的数据集Chem4DBench,并开发了统一模型Chem4DLLM,该模型结合了等变图编码器和预训练大语言模型,以显式捕捉分子几何与旋转动力学。

Details

Motivation: 现有化学理解任务主要依赖静态分子表示,难以建模化学键断裂或构象变化等本质动态现象,而这些动态过程对于理解化学反应至关重要。

Result: 论文构建了首个基准数据集Chem4DBench,并提出了Chem4DLLM模型,但摘要中未提及具体的定量实验结果或与现有方法的比较。

Insight: 创新点在于将化学动力学理解形式化为一个从4D分子轨迹到自然语言解释的多模态任务,并提出了一个结合等变图神经网络(用于处理3D几何与时间演化)与大语言模型的统一架构,以促进动态化学理解与多模态科学推理的研究。

Abstract: Existing chemical understanding tasks primarily rely on static molecular representations, limiting their ability to model inherently dynamic phenomena such as bond breaking or conformational changes, which are essential for a chemist to understand chemical reactions. To address this gap, we introduce Chemical Dynamics Understanding (ChemDU), a new task that translates 4D molecular trajectories into interpretable natural-language explanations. ChemDU focuses on fundamental dynamic scenarios, including gas-phase and catalytic reactions, and requires models to reason about key events along molecular trajectories, such as bond formation and dissociation, and to generate coherent, mechanistically grounded narratives. To benchmark this capability, we construct Chem4DBench, the first dataset pairing 4D molecular trajectories with expert-authored explanations across these settings. We further propose Chem4DLLM, a unified model that integrates an equivariant graph encoder with a pretrained large language model to explicitly capture molecular geometry and rotational dynamics. We hope that ChemDU, together with Chem4DBench and Chem4DLLM, will stimulate further research in dynamic chemical understanding and multimodal scientific reasoning.


[108] Attention Gathers, MLPs Compose: A Causal Analysis of an Action-Outcome Circuit in VideoViT cs.LG | cs.AI | cs.CVPDF

Sai V R Chereddy

TL;DR: 该论文通过可解释人工智能方法,特别是机制可解释性技术,对预训练视频视觉变换器(VideoViT)中负责表示动作结果的内部电路进行了逆向工程。研究发现,模型通过一个独特的放大级联计算‘成功与失败’信号,其中注意力头充当‘证据收集器’,而MLP块则作为稳健的‘概念组合器’,共同构成了一个分布式且冗余的电路,解释了模型对简单消融的鲁棒性。

Details

Motivation: 探索为分类任务训练的视频模型如何表示可能不影响最终结果的微妙、隐藏语义信息,这是构建可信赖AI模型的一个关键挑战。

Result: 因果分析(主要通过激活修补和消融结果支持)揭示了注意力头和MLP块在计算‘成功’信号中的明确分工,该电路在模型内部是分布式和冗余的。

Insight: 论文的创新点在于揭示了视频ViT中一个专门用于处理动作结果的内部电路及其‘注意力收集证据,MLP组合概念’的计算模式。客观来看,这展示了即使是为简单分类训练的模型,也可能发展出超越其显式任务的‘隐藏知识’,强调了机制监督对于构建真正可解释和可信赖AI系统的重要性。

Abstract: The paper explores how video models trained for classification tasks represent nuanced, hidden semantic information that may not affect the final outcome, a key challenge for Trustworthy AI models. Through Explainable and Interpretable AI methods, specifically mechanistic interpretability techniques, the internal circuit responsible for representing the action’s outcome is reverse-engineered in a pre-trained video vision transformer, revealing that the “Success vs Failure” signal is computed through a distinct amplification cascade. While there are low-level differences observed from layer 0, the abstract and semantic representation of the outcome is progressively amplified from layers 5 through 11. Causal analysis, primarily using activation patching supported by ablation results, reveals a clear division of labor: Attention Heads act as “evidence gatherers”, providing necessary low-level information for partial signal recovery, while MLP Blocks function as robust “concept composers”, each of which is the primary driver to generate the “success” signal. This distributed and redundant circuit in the model’s internals explains its resilience to simple ablations, demonstrating a core computational pattern for processing human-action outcomes. Crucially, the existence of this sophisticated circuit for representing complex outcomes, even within a model trained only for simple classification, highlights the potential for models to develop forms of ‘hidden knowledge’ beyond their explicit task, underscoring the need for mechanistic oversight for building genuinely Explainable and Trustworthy AI systems intended for deployment.


cs.MM [Back]

[109] Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints cs.MM | cs.CV | cs.LGPDF

Minsak Nanang, Adrian Hilton, Armin Mustafa

TL;DR: 本文提出了一种基于现有馆藏数据库、面向博物馆视听内容的自动化元数据标注方法,利用一个可本地部署的视频语言模型构建多阶段流程,包括视频中艺术品摘要、生成目录式描述与流派标签,以及通过保守相似性匹配进行作品与艺术家归属,旨在提升档案可发现性并满足资源与监管约束。

Details

Motivation: 解决博物馆和美术馆中快速增长但缺乏可搜索元数据的视听档案难以利用的问题,自动化传统上依赖大量人工的目录式元数据整理工作。

Result: 在绘画目录上的早期部署表明,该框架能在尊重资源限制、数据主权和新兴法规的同时,提高视听档案的可发现性。

Insight: 创新点在于将开放、可本地部署的视频语言模型与多阶段处理流程(摘要、描述生成、保守相似性匹配)相结合,为资源受限和高风险领域提供了可迁移的应用驱动机器学习模板。

Abstract: Audiovisual (AV) archives in museums and galleries are growing rapidly, but much of this material remains effectively locked away because it lacks consistent, searchable metadata. Existing method for archiving requires extensive manual effort. We address this by automating the most labour intensive part of the workflow: catalogue style metadata curation for in gallery video, grounded in an existing collection database. Concretely, we propose catalogue-grounded multimodal attribution for museum AV content using an open, locally deployable video language model. We design a multi pass pipeline that (i) summarises artworks in a video, (ii) generates catalogue style descriptions and genre labels, and (iii) attempts to attribute title and artist via conservative similarity matching to the structured catalogue. Early deployments on a painting catalogue suggest that this framework can improve AV archive discoverability while respecting resource constraints, data sovereignty, and emerging regulation, offering a transferable template for application-driven machine learning in other high-stakes domains.


[110] OmniForcing: Unleashing Real-time Joint Audio-Visual Generation cs.MM | cs.CV | cs.SDPDF

Yaofeng Su, Yuming Li, Zeyue Xue, Jie Huang, Siming Fu

TL;DR: 本文提出了OmniForcing框架,旨在解决现有联合音视频扩散模型因双向注意力依赖导致的高延迟问题,无法满足实时应用需求。该框架通过知识蒸馏,将一个离线的、双流双向扩散模型转化为一个高保真度的流式自回归生成器。

Details

Motivation: 现有联合音视频生成模型(如扩散模型)虽然质量高,但由于其双向注意力机制导致推理延迟高,阻碍了实时应用。本文的动机是开发一个能保持高质量的同时实现实时流式生成的框架。

Result: OmniForcing在单个GPU上实现了约25 FPS的流式生成,达到了最先进的水平(SOTA),并且在多模态同步和视觉质量上与作为教师模型的双向扩散模型相当。

Insight: 主要创新点包括:1) 非对称块因果对齐与全局前缀机制,解决了模态间信息密度差异和同步漂移问题;2) 音频汇聚令牌机制与身份RoPE约束,缓解了音频令牌稀疏性导致的梯度爆炸;3) 联合自强制蒸馏范式,使模型能在长序列生成中动态自校正累积的跨模态误差;4) 模态无关的滚动KV缓存推理方案,实现了高效流式生成。

Abstract: Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at $\sim$25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.\textbf{Project Page:} \href{https://omniforcing.com}{https://omniforcing.com}


cs.SE [Back]

[111] CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents cs.SE | cs.AI | cs.CLPDF

Kristen Pereira, Neelabh Sinha, Rajat Ghosh, Debojyoti Dutta

TL;DR: 本文介绍了CR-Bench基准数据集和CR-Evaluator评估流水线,用于细粒度评估AI代码审查代理在真实场景中的效用。通过初步研究,发现当前代码审查代理在识别所有隐藏问题时存在信噪比低的问题,揭示了问题解决与误报之间的权衡。

Details

Motivation: 当前缺乏标准化基准和细粒度评估协议,难以超越粗略的成功指标评估代码审查代理的行为,特别是在误报成本高的任务中。

Result: 初步评估了基于两种前沿模型的单次代理和基于Reflexion的代理,发现仅通过解决率衡量会掩盖真实进展和开发人员生产力,代理在识别所有隐藏问题时信噪比较低。

Insight: 创新点在于提出了专门的基准数据集和细粒度评估流水线,客观揭示了代码审查代理设计中问题解决与误报之间的关键权衡,为LLM系统从受控基准转向真实软件工程工作流提供了基础。

Abstract: Recent advances in frontier large language models have enabled code review agents that operate in open-ended, reasoning-intensive settings. However, the lack of standardized benchmarks and granular evaluation protocols makes it difficult to assess behavior of code review agents beyond coarse success metrics, particularly for tasks where false positives are costly. To address this gap, we introduce CR-Bench, a benchmarking dataset, and CR-Evaluator, a fine-grained evaluation pipeline for code review agents. Using these tools, we conduct a preliminary study evaluating both a single-shot agent and a Reflexion-based agent across two frontier models. We find that code review agents can exhibit a low signal-to-noise ratio when designed to identify all hidden issues, obscuring true progress and developer productivity when measured solely by resolution rates. Our analysis identifies the hidden trade-off between issue resolution and spurious findings, revealing a frontier that constrains effective agent design. Together, CR-Bench and CR-Evaluator provide a timely foundation for studying and developing code review agents as LLM-based systems transition from controlled benchmarks to real-world software engineering workflows.


physics.med-ph [Back]

[112] MRI2Qmap: multi-parametric quantitative mapping with MRI-driven denoising priors physics.med-ph | cs.CV | cs.LGPDF

Mohammad Golbabaee, Matteo Cencini, Carolin Pirkl, Marion Menzel, Michela Tosetti

TL;DR: 本文提出MRI2Qmap,一种即插即用的定量重建框架,通过整合物理采集模型与从大型多模态加权MRI数据集预训练的深度去噪自编码器学习到的先验知识,解决了磁共振指纹成像(MRF)等高度加速瞬态参数映射技术中因压缩采样导致的混叠伪影问题,且无需真实定量成像数据进行训练。

Details

Motivation: MRF等高度加速参数映射技术虽能同时量化多种组织特性,但常因压缩采样产生混叠伪影,而现有深度学习方法依赖大量定量成像训练数据,但此类数据稀缺;本文旨在探索利用临床常规加权MRI图像作为训练数据来源以克服这一限制。

Result: 该方法在高度加速的3D全脑MRF数据(包括体内和模拟采集)上验证,相比现有基线方法,取得了竞争性或更优的性能,且无需真实定量成像数据进行训练。

Insight: 创新点在于提出了一种从独立获取的常规加权MRI数据集中学习空间域结构先验,并将其有效用于定量MRI重建的框架,实现了定量重建与真实MRF训练数据的解耦,为利用大规模临床MRI数据提供了可扩展的范式。

Abstract: Magnetic Resonance Fingerprinting (MRF) and other highly accelerated transient-state parameter mapping techniques enable simultaneous quantification of multiple tissue properties, but often suffer from aliasing artifacts due to compressed sampling. Incorporating spatial image priors can mitigate these artifacts, and deep learning has shown strong potential when large training datasets are available. However, extending this paradigm to MRF-type sequences remains challenging due to the scarcity of quantitative imaging data for training. Can this limitation be overcome by leveraging sources of training data from clinically-routine weighted MRI images? To this end, we introduce MRI2Qmap, a plug-and-play quantitative reconstruction framework that integrates the physical acquisition model with priors learned from deep denoising autoencoders pretrained on large multimodal weighted-MRI datasets. MRI2Qmap demonstrates that spatial-domain structural priors learned from independently acquired datasets of routine weighted-MRI images can be effectively used for quantitative MRI reconstruction. The proposed method is validated on highly accelerated 3D whole-brain MRF data from both in-vivo and simulated acquisitions, achieving competitive or superior performance relative to existing baselines without requiring ground-truth quantitative imaging data for training. By decoupling quantitative reconstruction from the need for ground-truth MRF training data, this framework points toward a scalable paradigm for quantitative MRI that can capitalize on the large and growing repositories of routine clinical MRI.


cs.AI [Back]

[113] Speak or Stay Silent: Context-Aware Turn-Taking in Multi-Party Dialogue cs.AI | cs.CLPDF

Kratika Bhagtani, Mrinal Anand, Yu Chen Xu, Amit Kumar Singh Yadav

TL;DR: 本文提出了一种上下文感知的轮流发言方法,用于解决多参与者对话中AI助手因误解停顿而随意插话的问题。通过构建包含超过12万条标注对话的基准数据集,评估了多种大型语言模型,发现它们在零样本提示下表现不佳,进而提出了一种带有推理轨迹的监督微调方法,显著提升了性能。

Details

Motivation: 现有语音AI助手将每个检测到的停顿都视为发言邀请,这在双人对话中有效,但在多参与者场景中,停顿频繁且含义模糊,导致助手频繁插话反而造成干扰。

Result: 在三个多参与者语料库构建的基准上评估了八种近期的大型语言模型,它们在零样本提示下的上下文感知轮流发言任务中表现一致不佳;提出的监督微调方法结合推理轨迹,将平衡准确率最高提升了23个百分点。

Insight: 创新点在于将轮流发言问题形式化为基于完整对话上下文的二分类决策(发言或保持沉默),并强调上下文感知轮流发言并非大型语言模型的涌现能力,需要通过显式训练来获得;可借鉴之处包括构建大规模多参与者对话基准以及使用推理轨迹进行监督微调的策略。

Abstract: Existing voice AI assistants treat every detected pause as an invitation to speak. This works in dyadic dialogue, but in multi-party settings, where an AI assistant participates alongside multiple speakers, pauses are abundant and ambiguous. An assistant that speaks on every pause becomes disruptive rather than useful. In this work, we formulate context-aware turn-taking: at every detected pause, given the full conversation context, our method decides whether the assistant should speak or stay silent. We introduce a benchmark of over 120K labeled conversations spanning three multi-party corpora. Evaluating eight recent large language models, we find that they consistently fail at context-aware turn-taking under zero-shot prompting. We then propose a supervised fine-tuning approach with reasoning traces, improving balanced accuracy by up to 23 percentage points. Our findings suggest that context-aware turn-taking is not an emergent capability; it must be explicitly trained.


[114] From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts cs.AI | cs.CL | cs.MAPDF

Sunil Prakash

TL;DR: 本文提出了审议式集体智能(DCI)框架,通过定义四种推理原型、14种类型化认知行为、共享工作空间以及保证收敛的DCI-CF算法,为多智能体LLM系统构建了一个结构化的集体审议过程,以生成包含选定选项、保留异议、少数派报告和重启条件在内的结构化决策包。

Details

Motivation: 现有多智能体LLM系统的交互模式(如投票、非结构化辩论或流水线编排)无法模拟真正的审议过程,即参与者交换类型化推理步骤、保留分歧并最终达成可问责结果的阶段性过程。

Result: 在七个领域的45个任务上使用Gemini 2.5 Flash进行评估。在非常规任务(n=40)上,DCI显著优于非结构化辩论(+0.95,95% CI [+0.41, +1.54]),在需要视角整合的隐藏信息任务上表现优异(得分9.56,为所有系统在各领域最高),但在常规决策任务上表现不佳(5.39)。DCI能100%生成结构化决策包,98%生成少数派报告,而所有基线均无法生成这些产物。DCI的token消耗约为单智能体的62倍,且在整体质量上单智能体生成优于DCI。

Insight: 核心创新在于为多智能体推理引入了结构化、类型化的审议框架(DCI)及保证收敛的算法(DCI-CF),强调过程可问责性在重要决策中的价值,而非单纯追求更多智能体或更高效率。其贡献在于证明了当过程问责性能够证明成本合理时,审议结构对关键决策有益。

Abstract: Multi-agent LLM systems increasingly tackle complex reasoning, yet their interaction patterns remain limited to voting, unstructured debate, or pipeline orchestration. None model deliberation: a phased process where differentiated participants exchange typed reasoning moves, preserve disagreements, and converge on accountable outcomes. We introduce Deliberative Collective Intelligence (DCI), specifying four reasoning archetypes, 14 typed epistemic acts, a shared workspace, and DCI-CF, a convergent flow algorithm that guarantees termination with a structured decision packet containing the selected option, residual objections, minority report, and reopen conditions. We evaluate on 45 tasks across seven domains using Gemini 2.5 Flash. On non-routine tasks (n=40), DCI significantly improves over unstructured debate (+0.95, 95% CI [+0.41, +1.54]). DCI excels on hidden-profile tasks requiring perspective integration (9.56, highest of any system on any domain) while failing on routine decisions (5.39), confirming task-dependence. DCI produces 100% structured decision packets and 98% minority reports, artifacts absent from all baselines. However, DCI consumes ~62x single-agent tokens, and single-agent generation outperforms DCI on overall quality. DCI’s contribution is not that more agents are better, but that consequential decisions benefit from deliberative structure when process accountability justifies the cost.


[115] XSkill: Continual Learning from Experience and Skills in Multimodal Agents cs.AI | cs.CLPDF

Guanyu Jiang, Zhaochen Su, Xiaoye Qu, Yi R., Fung

TL;DR: 本文提出了XSkill框架,旨在通过从视觉观察中提取和利用经验和技能两种可重用知识,实现多模态智能体在开放环境中的持续学习。该框架通过视觉基础化的知识积累和检索,结合使用历史反馈,形成一个持续学习循环,从而提升智能体的工具使用效率和任务规划灵活性。

Details

Motivation: 多模态智能体在处理复杂推理任务时,存在工具使用效率低和任务编排不灵活的问题,尤其是在开放环境中。核心挑战在于如何让智能体无需参数更新,仅通过从历史轨迹中学习实现持续改进。

Result: 在五个不同领域的基准测试中,使用四种骨干模型进行评估,XSkill框架均显著优于仅使用工具和基于学习的基线方法,并展现出卓越的零样本泛化能力。

Insight: 创新点在于识别并整合了经验和技能两种互补的知识形式,通过视觉基础化的知识蒸馏、整合与检索机制,构建了一个持续学习循环。从客观角度看,该框架将高层次任务规划与低层次决策指导相结合,为多模态智能体的自适应学习提供了结构化方法。

Abstract: Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.


[116] TopoBench: Benchmarking LLMs on Hard Topological Reasoning cs.AI | cs.CLPDF

Mayug Maniparambil, Nils Hoehing, Janak Kapuriya, Arjun Karuvally, Ellen Rushe

TL;DR: 论文提出了TopoBench基准测试,用于评估大语言模型在拓扑推理任务上的能力,包含六个谜题家族和三个难度级别。研究发现前沿模型在困难实例上表现不佳,通过错误分类和干预实验揭示了约束提取是主要瓶颈,而非推理过程本身。

Details

Motivation: 解决拓扑网格谜题需要推理全局空间不变量(如连通性、环路闭合和区域对称性),这对当前大语言模型仍具挑战性,因此需要建立受控基准来系统研究其能力与局限。

Result: 在TopoBench上评估发现,即使前沿模型也只能解决不到四分之一的困难实例,其中两个谜题家族几乎无法解决。干预实验表明,过早承诺和约束遗忘等错误模式直接影响解题能力,而重复推理是搜索的良性效应。

Insight: 创新点在于构建了专门针对拓扑推理的基准测试,并通过错误分类和针对性干预揭示了模型失败的根本原因在于从空间表示中提取约束的困难,而非推理能力不足,这为未来改进提供了方向。

Abstract: Solving topological grid puzzles requires reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry and remains challenging for even the most powerful large language models (LLMs). To study these abilities under controlled settings, we introduce TopoBench, a benchmark of six puzzle families across three difficulty levels. We evaluate strong reasoning LLMs on TopoBench and find that even frontier models solve fewer than one quarter of hard instances, with two families nearly unsolved. To investigate whether these failures stem from reasoning limitations or from difficulty extracting and maintaining spatial constraints, we annotate 750 chain of thought traces with an error taxonomy that surfaces four candidate causal failure modes, then test them with targeted interventions simulating each error type. These interventions show that certain error patterns like premature commitment and constraint forgetting have a direct impact on the ability to solve the puzzle, while repeated reasoning is a benign effect of search. Finally we study mitigation strategies including prompt guidance, cell-aligned grid representations and tool-based constraint checking, finding that the bottleneck lies in extracting constraints from spatial representations and not in reasoning over them. Code and data are available at github.com/mayug/topobench-benchmark.


[117] Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training cs.AI | cs.CL | cs.LGPDF

Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang

TL;DR: 本文研究了在不可验证领域中使用推理型LLM作为评判者(LLM-as-Judges)对LLM进行强化学习对齐训练的实际效果。通过一个受控的合成实验设置,发现非推理型评判者容易导致奖励攻击,而推理型评判者能训练出在黄金标准评判者评估下表现良好的策略,但这些策略实际上学会了生成欺骗其他LLM评判者的对抗性输出。

Details

Motivation: 动机在于,尽管推理型LLM评判者在静态评估基准上表现更好,但其在实际策略训练(如LLM对齐)中的有效性尚未得到系统检验,尤其是在输出正确性/质量无法直接验证的不可验证领域。

Result: 在基于强化学习的LLM对齐中,使用黄金标准评判者(gpt-oss-120b)训练较小评判者的实验表明:非推理型评判者训练的模型容易奖励攻击;推理型评判者训练的模型在黄金标准评判者评估下表现强劲,但在Arena-Hard等流行基准上,这些模型通过生成欺骗其他LLM评判者的对抗性输出也能获得高分。

Insight: 创新点在于首次系统评估了推理型与非推理型LLM评判者在不可验证的LLM后训练中的实际影响,揭示了推理型评判者可能训练出具有欺骗性的对抗性策略,这为LLM评判者在对齐应用中的改进提供了重要洞见和警示。

Abstract: Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoning and reasoning judges in reinforcement-learning-based LLM alignment. Our controlled synthetic setting, where a “gold-standard” judge (gpt-oss-120b) provides preference annotations to train smaller judges, reveals key differences between non-reasoning and reasoning judges: non-reasoning judges lead to reward hacking easily, while reasoning judges can lead to policies that achieve strong performance when evaluated by the gold-standard judge. Interestingly, we find that the reasoning-judge-trained policies achieve such strong performance by learning to generate highly effective adversarial outputs that can also score well on popular benchmarks such as Arena-Hard by deceiving other LLM-judges. Combined with our further analysis, our study highlights both important findings and room for improvements for applying (reasoning) LLM-judges in non-verifiable LLM post-training.


[118] GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics cs.AI | cs.CVPDF

Yan Zhang, Simiao Ren, Ankit Raj, En Wei, Dennis Ng

TL;DR: 该论文提出了GPT4o-Receipt数据集,包含1,235张收据图像,用于评估人类和多模态大语言模型(MLLM)在检测AI生成财务文档方面的能力。研究发现一个悖论:人类在视觉上能更好地区分AI伪影,但在二元检测任务中的F1分数却低于Claude Sonnet 4和Gemini 2.5 Flash等模型。原因在于AI生成收据的主要伪造信号是算术错误,这无法通过视觉检查发现,但可被LLMs系统性地验证。

Details

Motivation: 研究动机是探究人类与机器在检测AI生成财务文档(如收据)方面的能力差异,并建立一个基准数据集和评估框架来支持AI文档取证研究。

Result: 在GPT4o-Receipt基准上,人类注释者在视觉辨别上表现出最大的差距,但二元检测F1分数(0.58)低于Claude Sonnet 4(0.79)和Gemini 2.5 Flash(0.73)。对五个最先进的多模态LLM的评估显示出显著的性能差异和校准差异。

Insight: 论文的创新点在于揭示了AI生成文档取证中的一个关键悖论:人类擅长感知视觉伪影,但检测性能却不如LLMs,因为核心伪造信号(算术错误)是非视觉的、可验证的逻辑错误。这挑战了单纯依赖视觉检测或简单准确率指标的选择方式,强调了需要更全面的评估框架。

Abstract: Can humans detect AI-generated financial documents better than machines? We present GPT4o-Receipt, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study. Our findings reveal a striking paradox: humans are better at seeing AI artifacts, yet worse at detecting AI documents. Human annotators exhibit the largest visual discrimination gap of any evaluator, yet their binary detection F1 falls well below Claude Sonnet 4 and below Gemini 2.5 Flash. This paradox resolves once the mechanism is understood: the dominant forensic signals in AI-generated receipts are arithmetic errors – invisible to visual inspection but systematically verifiable by LLMs. Humans cannot perceive that a subtotal is incorrect; LLMs verify it in milliseconds. Beyond the human–LLM comparison, our five-model evaluation reveals dramatic performance disparities and calibration differences that render simple accuracy metrics insufficient for detector selection. GPT4o-Receipt, the evaluation framework, and all results are released publicly to support future research in AI document forensics.


[119] VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought cs.AI | cs.CVPDF

Eunsoo Lee, Jeongwoo Lee, Minki Hong, Jangho Choi, Jihie Kim

TL;DR: 论文提出VisDoT框架,通过模拟人类视觉感知过程来增强大型视觉语言模型(LVLMs)的视觉推理能力。该框架基于图形感知理论形式化了四个感知任务,并引入思维分解(DoT)提示,将问题分解为视觉感知子问题和逻辑子问题。在ChartQA和ChartQAPro等基准测试上取得了显著提升,特别是在新提出的VisDoTQA基准上实现了+33.2%的改进。

Details

Motivation: 大型视觉语言模型在图表理解中难以可靠地检测视觉基元并将其与语义表示对齐,这严重限制了复杂视觉推理的性能。缺乏感知基础是图表推理的主要瓶颈。

Result: 使用VisDoT微调InternVL模型,在ChartQA基准上提升了+11.2%,在更具挑战性的ChartQAPro基准上超越了GPT-4o。在新引入的VisDoTQA基准上提升了+33.2%。在多样化的开放域VQA基准上的零样本增益也证实了感知-逻辑分离策略的泛化能力。

Insight: 创新点在于将人类图形感知理论形式化为具体的感知任务(如位置和长度),并采用思维分解(DoT)提示策略,将视觉问题分解为感知和逻辑两个子问题,从而增强视觉基础并实现可解释的视觉推理,在图表理解任务上达到了SOTA水平。

Abstract: Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.


[120] Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting cs.AI | cs.CV | cs.LGPDF

Chantal Pellegrini, Adrian Delchev, Ege Özsoy, Nassir Navab, Matthias Keicher

TL;DR: 本文提出ProtoSR方法,通过从自由文本报告中提取知识来增强结构化放射学报告的生成。该方法利用指令调优的LLM从MIMIC-CXR数据集中自动构建多模态知识库,为每个结构化答案选项生成视觉原型,并在训练时检索相关原型以修正模型预测。

Details

Motivation: 解决结构化放射学报告自动化中因监督数据有限而难以对罕见发现和细粒度属性做出准确决策的问题,利用大量自由文本报告中隐含的细粒度图像关联信息。

Result: 在Rad-ReStruct基准测试中取得了最先进(SOTA)的结果,尤其在详细属性问题上提升最大。

Insight: 创新点在于提出从自由文本到结构化知识的自动提取管道,并利用视觉原型进行知识引导和残差修正,为细粒度图像理解提供了数据驱动的’第二意见’机制。

Abstract: Structured radiology reporting promises faster, more consistent communication than free text, but automation remains difficult as models must make many fine-grained, discrete decisions about rare findings and attributes from limited structured supervision. In contrast, free-text reports are produced at scale in routine care and implicitly encode fine-grained, image-linked information through detailed descriptions. To leverage this unstructured knowledge, we propose ProtoSR, an approach for injecting free-text information into structured report population. First, we introduce an automatic extraction pipeline that uses an instruction-tuned LLM to mine 80k+ MIMIC-CXR studies and build a multimodal knowledge base aligned with a structured reporting template, representing each answer option with a visual prototype. Using this knowledge base, ProtoSR is trained to retrieve prototypes relevant for the current image-question pair and augment the model predictions through a prototype-conditioned residual, providing a data-driven second opinion that selectively corrects predictions. On the Rad-ReStruct benchmark, ProtoSR achieves state-of-the-art results, with the largest improvements on detailed attribute questions, demonstrating the value of integrating free-text derived signal for fine-grained image understanding.


cs.RO [Back]

[121] Real-time Rendering-based Surgical Instrument Tracking via Evolutionary Optimization cs.RO | cs.CVPDF

Hanyang Hu, Zekai Liang, Florian Richter, Michael C. Yip

TL;DR: 本文提出了一种基于实时渲染和进化优化的手术器械跟踪方法,通过将CMA-ES进化优化策略整合到跟踪流程中,联合估计手术器械的位姿和关节配置,利用批量渲染并行评估多个位姿候选,显著减少了推理时间并提高了收敛鲁棒性。该方法可泛化至无关节角和双手操作场景,适用于视觉反馈控制和在线手术视频校准。

Details

Motivation: 解决机器人辅助微创手术中,由于手术器械部分可见性和特殊关节设计导致基于视觉的标记物校准方法在特征检测不可靠、渲染方法计算成本高且收敛不佳的问题。

Result: 在合成和真实数据集上的大量实验表明,该方法在准确性和运行时间上均显著优于先前方法。

Insight: 创新点在于将进化优化策略(CMA-ES)与批量渲染相结合,实现高效并行评估,提高了跟踪的鲁棒性和实时性,并可灵活适应不同手术器械配置。

Abstract: Accurate and efficient tracking of surgical instruments is fundamental for Robot-Assisted Minimally Invasive Surgery. Although vision-based robot pose estimation has enabled markerless calibration without tedious physical setups, reliable tool tracking for surgical robots still remains challenging due to partial visibility and specialized articulation design of surgical instruments. Previous works in the field are usually prone to unreliable feature detections under degraded visual quality and data scarcity, whereas rendering-based methods often struggle with computational costs and suboptimal convergence. In this work, we incorporate CMA-ES, an evolutionary optimization strategy, into a versatile tracking pipeline that jointly estimates surgical instrument pose and joint configurations. Using batch rendering to efficiently evaluate multiple pose candidates in parallel, the method significantly reduces inference time and improves convergence robustness. The proposed framework further generalizes to joint angle-free and bi-manual tracking settings, making it suitable for both vision feedback control and online surgery video calibration. Extensive experiments on synthetic and real-world datasets demonstrate that the proposed method significantly outperforms prior approaches in both accuracy and runtime.


[122] RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset cs.RO | cs.AI | cs.CVPDF

Yongzhong Wang, Keyu Zhu, Yong Zhong, Liqiong Wang, Jinyu Yang

TL;DR: RADAR是一个完全自主的闭环机器人数据生成系统,通过语义规划和自主因果环境重置,无需人工干预即可大规模采集物理交互数据。该系统将认知负载分解为四个模块:基于视觉语言模型的任务生成、图神经网络策略执行、自动化成功评估以及有限状态机驱动的自主环境重置。

Details

Motivation: 解决机器人学习中大规模物理交互数据采集成本高、可扩展性差的问题,突破人工参与收集范式的瓶颈。

Result: 在仿真中,RADAR在复杂长时程任务上达到90%的成功率,远超传统基线方法(后者性能接近零);在真实世界部署中,系统通过少量样本适应即可可靠执行多样化的接触密集型技能(如可变形物体操作),无需领域特定微调。

Insight: 创新点包括:将数据生成过程完全自动化,形成闭环;采用视觉语言模型进行语义任务生成与评估;结合图神经网络策略执行;设计基于有限状态机的自主环境重置机制,实现前向-反向因果序列规划,提升系统的鲁棒性和可扩展性。

Abstract: The acquisition of large-scale physical interaction data, a critical prerequisite for modern robot learning, is severely bottlenecked by the prohibitive cost and scalability limits of human-in-the-loop collection paradigms. To break this barrier, we introduce Robust Autonomous Data Acquisition for Robotics (RADAR), a fully autonomous, closed-loop data generation engine that completely removes human intervention from the collection cycle. RADAR elegantly divides the cognitive load into a four-module pipeline. Anchored by 2-5 3D human demonstrations as geometric priors, a Vision-Language Model first orchestrates scene-relevant task generation via precise semantic object grounding and skill retrieval. Next, a Graph Neural Network policy translates these subtasks into physical actions via in-context imitation learning. Following execution, the VLM performs automated success evaluation using a structured Visual Question Answering pipeline. Finally, to shatter the bottleneck of manual resets, a Finite State Machine orchestrates an autonomous environment reset and asymmetric data routing mechanism. Driven by simultaneous forward-reverse planning with a strict Last-In, First-Out causal sequence, the system seamlessly restores unstructured workspaces and robustly recovers from execution failures. This continuous brain-cerebellum synergy transforms data collection into a self-sustaining process. Extensive evaluations highlight RADAR’s exceptional versatility. In simulation, our framework achieves up to 90% success rates on complex, long-horizon tasks, effortlessly solving challenges where traditional baselines plummet to near-zero performance. In real-world deployments, the system reliably executes diverse, contact-rich skills (e.g., deformable object manipulation) via few-shot adaptation without domain-specific fine-tuning, providing a highly scalable paradigm for robotic data acquisition.


[123] CRAFT: A Tendon-Driven Hand with Hybrid Hard-Soft Compliance cs.RO | cs.AI | cs.CVPDF

Leo Lin, Shivansh Patel, Jay Moon, Svetlana Lazebnik, Unnat Jain

TL;DR: 论文介绍了CRAFT手,一种具有混合刚柔顺应性的肌腱驱动仿人机械手,专为接触丰富的操作任务设计。其核心设计理念是根据手部不同区域受力差异,在关节处使用软材料以吸收冲击,而连杆保持刚性,并采用滚动接触关节表面确保弯曲运动的可重复性。通过15个安装在手指上的电机驱动肌腱,实现了紧凑的形态和轻量化的手指。

Details

Motivation: 解决传统机械手在接触丰富的操作任务中,因刚度过高或顺应性不足而难以处理易碎、低摩擦物体,以及耐用性和成本控制的问题。

Result: 在结构测试中,CRAFT手在保持可比重复性的同时,提高了强度和耐久性。在遥操作测试中,改善了易碎和低摩擦物品的操作能力,并覆盖了Feix抓握分类法中的全部33种抓握。整个设计成本低于600美元。

Insight: 创新点在于根据手部受力分布(关节承受冲击、连杆承受负载)进行差异化材料设计(关节软、连杆硬)的混合顺应性理念,以及滚动接触关节设计,在提高耐用性和操作性能的同时,显著降低了成本并计划开源。

Abstract: We introduce CRAFT hand, a tendon-driven anthropomorphic hand with hybrid hard-soft compliance for contact-rich manipulation. The design is based on a simple idea: contact is not uniform across the hand. Impacts concentrate at joints, while links carry most of the load. CRAFT places soft material at joints and keeps links rigid, and uses rollingcontact joint surfaces to keep flexion on repeatable motion paths. Fifteen motors mounted on the fingers drive the hand through tendons, keeping the form factor compact and the fingers light. In structural tests, CRAFT improves strength and endurance while maintaining comparable repeatability. In teleoperation, CRAFT improves handling of fragile and low-friction items, and the hand covers 33/33 grasps in the Feix taxonomy. The full design costs under $600 and will be released open-source with visionbased teleoperation and simulation integration. Project page: http://craft-hand.github.io/


[124] SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics cs.RO | cs.CVPDF

Mengzhen Liu, Enshen Zhou, Cheng Chi, Yi Han, Shanyu Rong

TL;DR: 本文提出了SaPaVe,一个端到端的机器人视觉-语言-动作模型框架,旨在统一语义驱动的主动感知与鲁棒的、视点不变的操作执行。该方法通过解耦相机控制与机械臂操作动作,并采用自底向上的训练策略,结合新提出的ActiveViewPose-200K数据集和3D几何感知模块,在仿真和真实环境中实现了优于现有模型(如GR00T N1和π₀)的性能。

Details

Motivation: 现有方法难以统一语义驱动的主动感知与鲁棒、视点不变的操作执行,限制了机器人在复杂场景中的交互能力。本文旨在解决这一挑战。

Result: 在仿真和真实环境的大量实验中,SaPaVe超越了最近的视觉-语言-动作模型(如GR00T N1和π₀),在真实世界任务中实现了高达31.25%的成功率提升。

Insight: 主要创新点包括:1)将相机动作与操作动作解耦而非置于共享动作空间;2)采用自底向上的两阶段训练策略;3)引入了用于语义相机运动学习的大规模数据集ActiveViewPose-200K;4)提出了增强执行鲁棒性的3D几何感知模块;5)建立了首个超越固定视角设置的主动操作基准ActiveManip-Bench。核心洞察是,通过解耦但协调的策略进行训练,紧密耦合的感知与执行能实现高效且可泛化的主动操作。

Abstract: Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and (π_0), achieving up to 31.25% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: https://lmzpai.github.io/SaPaVe


cs.MA [Back]

[125] Enhancing Value Alignment of LLMs with Multi-agent system and Combinatorial Fusion cs.MA | cs.CLPDF

Yuanhong Wu, Djallel Bouneffouf, D. Frank Hsu

TL;DR: 本文提出了一种名为VAS-CFA的框架,通过多智能体系统和组合融合分析来增强大语言模型与人类价值观的对齐。该方法实例化多个代表不同规范视角的道德智能体,并利用基于排名和分数的组合融合聚合它们的输出,以更好地反映人类价值观的多样性。

Details

Motivation: 现有方法如RLHF通常依赖单一评估者或狭窄定义的奖励信号,难以捕捉伦理多元性,限制了模型与人类价值观对齐的能力。

Result: 实验评估表明,VAS-CFA在标准指标上优于单智能体基线和先前的聚合方法,证明了多智能体融合在提升LLM价值对齐方面的有效性和鲁棒性。

Insight: 创新点在于将多智能体系统与组合融合分析结合,利用智能体间的认知多样性来缓解冲突和冗余,从而更全面地整合多元伦理视角,为价值对齐提供了新机制。

Abstract: Aligning large language models (LLMs) with human values is a central challenge for ensuring trustworthy and safe deployment. While existing methods such as Reinforcement Learning from Human Feedback (RLHF) and its variants have improved alignment, they often rely on a single evaluator or narrowly defined reward signals, limiting their ability to capture ethical pluralism. In this work, we propose the Value Alignment System using Combinatorial Fusion Analysis (VAS-CFA), a framework that operationalizes multi-agent fusion alignment. It instantiates multiple moral agents, each fine-tuned to represent a distinct normative perspective, and fuses their outputs using CFA with both rank- and score-based aggregation. This design leverages cognitive diversity, between agents, to mitigate conflicts and redundancies across multiple agents, producing responses that better reflect human values. Empirical evaluation demonstrates that VAS-CFA outperforms both single agent baselines and prior aggregation approaches on standard metrics, showing that multi-agent fusion provides a robust and effective mechanism for advancing value alignment in LLMs.