Table of Contents

cs.CL [Back]

[1] Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch cs.CLPDF

Eleanor M. Lin, David Jurgens

TL;DR: 本文提出了一种数据高效框架,用于识别大型语言模型中有益的代码切换推理行为,并教导模型更有效地进行代码切换以提升推理能力。

Details

Motivation: 现有研究通常将代码切换视为错误或仅通过提示工程进行控制,且局限于特定语言、任务和模型,本文旨在填补这一空白,系统性地理解和引导推理模型中的代码切换行为。

Result: 该框架显著增加了有益代码切换推理行为,且通过微调(如机器翻译任务)可间接修改推理模型中的代码切换行为,表明数据高效干预是有效的。

Insight: 创新点在于首次从语言和行为动机出发,构建了系统分析数据集和微调干预方法,证明了代码切换可作为有益推理策略,并能通过跨任务微调间接塑造。

Abstract: Recent developments in reasoning capabilities have enabled large language models to solve increasingly complex mathematical, symbolic, and logical tasks. Interestingly, while reasoning models are often trained to generate monolingual text, these models have also been observed to code-switch (i.e., mix languages). Prior works have either viewed code-switching as an undesirable error, attempted to control code-switching through modifications to input prompts or the output decoding process, or focus on narrow subsets of languages, domains, tasks, and models. We address these gaps by introducing the first linguistically and behaviorally motivated fine-tuning framework for identifying beneficial code-switched reasoning behaviors in large language models and teaching these models to code-switch more effectively for reasoning. First, we create and systematically analyze a dataset of reasoning traces from diverse models, languages, tasks, and domains to understand the types of code-switching behaviors found in existing reasoning models. Then, we develop fine-tuning interventions that teach reasoning models to code-switch based on our observations of helpful behaviors in existing models. We find that our framework can significantly increase beneficial code-switched reasoning behaviors in a data-efficient manner. Interestingly, we also find that code-switching behaviors in reasoning models can be modified by fine-tuning for tasks that do not directly demonstrate code-switching in reasoning (e.g., machine translation). Our work suggests that data-efficient interventions can instill helpful forms of code-switching behavior in reasoning models.


[2] “Excuse me, may I say something…” CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations cs.CL | cs.AI | cs.HC | cs.LGPDF

Yang Wu, Jinhong Yu, Jingwei Xiong, Zhimin Tao, Xiaozhong Liu

TL;DR: 本文提出CoLabScience,一种主动式LLM助手,通过PULI框架在生物医学协作中实现适时、上下文感知的干预,以增强AI系统与人类专家之间的合作。

Details

Motivation: 现有LLM的被动响应特性限制了其在需要预见性和自主参与的科学协作场景中的有效性,因此需要开发能够主动介入的AI助手。

Result: 在基于PubMed文章构建的BSDD基准测试中,PULI框架在干预精度和协作任务效用方面显著优于现有基线方法。

Insight: 创新点在于提出PULI框架,结合强化学习目标,利用项目提案和长短期对话记忆来决定何时及如何干预流式科学讨论,推动了主动式LLM在科学发现中的应用。

Abstract: The integration of Large Language Models (LLMs) into scientific workflows presents exciting opportunities to accelerate biomedical discovery. However, the reactive nature of LLMs, which respond only when prompted, limits their effectiveness in collaborative settings that demand foresight and autonomous engagement. In this study, we introduce CoLabScience, a proactive LLM assistant designed to enhance biomedical collaboration between AI systems and human experts through timely, context-aware interventions. At the core of our method is PULI (Positive-Unlabeled Learning-to-Intervene), a novel framework trained with a reinforcement learning objective to determine when and how to intervene in streaming scientific discussions, by leveraging the team’s project proposal and long- and short-term conversational memory. To support this work, we introduce BSDD (Biomedical Streaming Dialogue Dataset), a new benchmark of simulated research discussion dialogues with intervention points derived from PubMed articles. Experimental results show that PULI significantly outperforms existing baselines in both intervention precision and collaborative task utility, highlighting the potential of proactive LLMs as intelligent scientific assistants.


[3] GroupDPO: Memory efficient Group-wise Direct Preference Optimization cs.CLPDF

Jixuan Leng, Si Si, Hsiang-Fu Yu, Vinod Raman, Inderjit S. Dhillon

TL;DR: 本文提出了一种内存高效的组级直接偏好优化方法GroupDPO,通过解耦样本反向传播来降低内存开销,从而支持更大组规模的训练,在离线和在线对齐设置中均证明利用多响应优于单对训练。

Details

Motivation: 现有偏好优化方法通常每个提示仅使用单一正负对进行训练,忽略了偏好数据集中通常包含多个候选响应的额外监督信息,且现有组级方法因内存开销大而难以扩展。

Result: 在离线和在线对齐设置中,利用多响应的训练方法 consistently outperforms 单对训练,且结合正响应的负对数似然(NLL)项对性能提升和训练稳定性至关重要。

Insight: 创新点在于通过解耦样本反向传播实现内存高效的组级优化,使更大组规模训练成为可能;客观分析认为其核心贡献是揭示了多响应监督的普遍优势及NLL项的关键作用。

Abstract: Preference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in preference datasets that typically contain multiple candidate responses. Motivated by this limitation, recent work explores group-wise preference optimization, which jointly contrasts multiple responses for the same prompt, but its empirical behavior and scalability remain underexplored due to the memory overhead of group-coupled objectives. In this work, we introduce a memory-efficient group-wise preference optimization algorithm that preserves gradients while decoupling samples during backpropagation, substantially reducing peak memory usage, which enables scalable training with larger group sizes. Across both offline and online alignment settings, we show that leveraging multiple responses consistently outperforms single-pair training. Furthermore, incorporating a negative log-likelihood (NLL) term on positive responses is critical for both performance gains and training stability.


[4] HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning cs.CL | cs.CVPDF

Yanbin Wei, Chun Kang, Siwei Li, Haoxuan Che, Yang Chen

TL;DR: 本文提出了首个用于评估大型视觉语言模型在超图理解和推理方面能力的基准测试HyperGVL,涵盖了12个任务、84,000个视觉语言问答样本,并引入了可学习的自适应表示路由器WiseHyGR以提升模型性能。

Details

Motivation: 大型视觉语言模型在超图领域的应用能力尚不明确,缺乏专门的基准测试来界定其边界,而超图在生命科学和社交网络等现实场景中具有重要应用价值。

Result: 在包含多尺度合成结构及真实世界引用和蛋白质网络的超图上,对12种先进LVLM进行了全面评估;提出的WiseHyGR方法通过学习自适应表示改进了LVLM在超图上的表现。

Insight: 创新点在于构建了首个超图理解与推理基准,并设计了可泛化的表示路由器来动态优化超图的文本和视觉表示,为连接超图与LVLM提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) consistently require new arenas to guide their expanding boundaries, yet their capabilities with hypergraphs remain unexplored. In the real world, hypergraphs have significant practical applications in areas such as life sciences and social communities. Recent advancements in LVLMs have shown promise in understanding complex topologies, yet there remains a lack of a benchmark to delineate the capabilities of LVLMs with hypergraphs, leaving the boundaries of their abilities unclear. To fill this gap, in this paper, we introduce $\texttt{HyperGVL}$, the first benchmark to evaluate the proficiency of LVLMs in hypergraph understanding and reasoning. $\texttt{HyperGVL}$ provides a comprehensive assessment of 12 advanced LVLMs across 84,000 vision-language question-answering (QA) samples spanning 12 tasks, ranging from basic component counting to complex NP-hard problem reasoning. The involved hypergraphs contain multiscale synthetic structures and real-world citation and protein networks. Moreover, we examine the effects of 12 textual and visual hypergraph representations and introduce a generalizable router $\texttt{WiseHyGR}$ that improves LVLMs in hypergraph via learning adaptive representations. We believe that this work is a step forward in connecting hypergraphs with LVLMs.


[5] C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment cs.CLPDF

Pufan Zeng, Yilun Liu, Mingchen Dai, Mengyao Piao, Chunguang Zhao

TL;DR: 本文提出C-Mining,一种无监督框架,用于从原始多语言语料库中自动发现文化种子(Culture Points),以支持大语言模型的文化对齐数据合成。该方法利用预训练嵌入空间中文化概念的跨语言几何错位作为可量化的发现信号,通过识别具有显著语言排他性和几何隔离的区域来提取高质量种子,无需人工或LLM监督,将准备成本降低150倍以上。

Details

Motivation: 当前为大语言模型生成文化对齐合成数据时,种子选择缺乏可量化标准,依赖不可扩展的人工筛选或易有偏见的LLM提取,将文化特异性视为抽象概念而非可测量信号。本文旨在解决这一“量化差距”。

Result: 在CulturalBench-Hard基准上,基于所提取种子合成的指令调优数据集显著提升了模型的文化理解和推理能力,取得了+6.03分的改进,超越了现有最先进基线。

Insight: 核心创新在于将文化种子发现从主观选择过程转化为可计算的数据挖掘问题,并提出了利用跨语言嵌入空间中的几何错位作为量化信号的新颖见解。这为高质量、可扩展的文化数据合成提供了无需监督的量化解决方案。

Abstract: Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating cultural specificity as an abstract concept rather than a measurable signal. In this paper, we address this “quantification gap” by proposing C-Mining, an unsupervised framework that transforms the discovery of cultural seeds from a subjective selection process into a computable data mining formulation. Our approach exploits a novel geometric insight, leveraging the cross-lingual misalignment of cultural concepts within pre-trained embedding spaces as a quantifiable discovery signal. By systematically identifying these regions characterized by pronounced linguistic exclusivity and geometric isolation, while actively filtering out noise, C-Mining automatically extracts high-fidelity Culture Points (CPs) from raw multilingual corpora without reliance on human or LLM supervision, reducing preparation costs by more than 150-fold. We further leverage the mined knowledge to steer the synthesis of diverse instruction-tuning datasets. Extensive experiments demonstrate that this seed-centric approach significantly enhances cultural understanding and reasoning capabilities, achieving a +6.03 point improvement on CulturalBench-Hard and surpassing state-of-the-art baselines, providing a scalable, quantifiable solution for high-quality cultural data synthesis.


[6] Preference Estimation via Opponent Modeling in Multi-Agent Negotiation cs.CLPDF

Yuta Konishi, Kento Yamamoto, Eisuke Sonomoto, Rikuho Takeda, Ryo Furukawa

TL;DR: 本文提出了一种新颖的偏好估计方法,用于解决多智能体谈判中的对手建模问题。该方法将大型语言模型提取的自然语言定性信息,整合到一个结构化的贝叶斯对手建模框架中,通过将语义线索转化为概率格式来实现动态信念追踪。

Details

Motivation: 解决传统仅基于数值的对手建模方法无法捕捉自然语言交互中嵌入的定性信息,导致偏好估计不稳定和不完整的问题。

Result: 在多参与方基准测试上的实验结果表明,该框架通过结合概率推理与自然语言理解,提高了完全达成协议的比率和偏好估计的准确性。

Insight: 创新点在于将LLM的语义理解能力与贝叶斯概率推理框架相结合,将定性语言信息定量化地整合到动态对手建模中,从而更稳定、更全面地估计对手偏好。

Abstract: Automated negotiation in complex, multi-party and multi-issue settings critically depends on accurate opponent modeling. However, conventional numerical-only approaches fail to capture the qualitative information embedded in natural language interactions, resulting in unstable and incomplete preference estimation. Although Large Language Models (LLMs) enable rich semantic understanding of utterances, it remains challenging to quantitatively incorporate such information into a consistent opponent modeling. To tackle this issue, we propose a novel preference estimation method integrating natural language information into a structured Bayesian opponent modeling framework. Our approach leverages LLMs to extract qualitative cues from utterances and converts them into probabilistic formats for dynamic belief tracking. Experimental results on a multi-party benchmark demonstrate that our framework improves the full agreement rate and preference estimation accuracy by integrating probabilistic reasoning with natural language understanding.


[7] Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information cs.CLPDF

Yao Chen, Jiawei Sheng, Wenyuan Zhang, Tingwen Liu

TL;DR: 本文提出了一种新颖的思维链知识蒸馏框架,通过将教师模型在推理过程中对关键信息的渐进式注意力转移给学生模型,并结合混合层模块实现动态层对齐,从而提升小模型的推理能力。

Details

Motivation: 现有思维链蒸馏方法主要关注转移教师生成的推理过程,但未能充分探索教师在推理过程中对关键信息的动态注意力模式,而该模式对结论推导至关重要。

Result: 该方法在多个数学和常识推理数据集上取得了一致的性能提升,表明其有效性。

Insight: 核心创新点在于首次在思维链蒸馏中利用逐步注意力机制,并通过混合层模块实现教师与学生模型不同层之间的动态自适应对齐,为小模型提供了结构化的渐进式关键信息关注引导。

Abstract: The significant computational demands of large language models have increased interest in distilling reasoning abilities into smaller models via Chain-of-Thought (CoT) distillation. Current CoT distillation methods mainly focus on transferring teacher-generated rationales for complex reasoning to student models. However, they do not adequately explore teachers’ dynamic attention toward critical information during reasoning. We find that language models exhibit progressive attention shifts towards key information during reasoning, which implies essential clues for drawing conclusions. Building on this observation and analysis, we introduce a novel CoT distillation framework that transfers the teacher’s stepwise attention on key information to the student model. This establishes structured guidance for the student’s progressive concentration on key information during reasoning. More importantly, we develop a Mixture of Layers module enabling dynamic alignment that adapts to different layers between the teacher and student. Our method achieves consistent performance improvements across multiple mathematical and commonsense reasoning datasets. To our knowledge, it is the first method to leverage stepwise attention within CoT distillation to improve small model reasoning.


[8] GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows cs.CL | cs.AIPDF

Jize Wang, Xuanxuan Liu, Yining Li, Songyang Zhang, Yijun Wang

TL;DR: 该论文提出了GTA-2基准测试,用于评估通用工具智能体从原子工具使用到开放式工作流的综合能力。该基准基于真实用户查询、已部署工具和多模态上下文构建,包含评估短视界封闭式任务精度的GTA-Atomic和评估长视界开放式端到端工作流的GTA-Workflow两个层次。实验揭示了当前前沿模型在原子任务上表现不佳(低于50%),在工作流任务上更是严重失败(最优模型成功率仅14.39%),并强调了执行框架设计的重要性。

Details

Motivation: 当前的工具使用基准测试与真实世界需求脱节,依赖于AI生成的查询、虚拟工具和有限的系统级协调。为了推动通用智能体从执行简单指令转向完成复杂的现实生产力工作流,需要一个新的、更贴近现实的评估框架。

Result: 在GTA-2基准上的实验结果表明,前沿模型在原子任务上的成功率低于50%,而在开放式工作流任务上,顶级模型(如GPT-4o)的成功率仅为14.39%。分析发现,基于检查点的反馈能提升性能,而先进的执行框架(如Manus和OpenClaw)能显著提高工作流完成度。

Insight: 论文的主要创新点在于构建了一个基于真实世界真实性的分层基准(GTA-2),并提出了基于递归检查点的评估机制,将开放式目标分解为可验证的子目标,从而能统一评估模型能力和智能体执行框架。客观来看,其将评估重点从单纯的模型能力扩展到包含执行框架(执行工具链)的整体智能体系统,为开发可靠的个人和专业助手提供了重要的评估指导和设计洞见。

Abstract: The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose GTA-2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open-ended workflows. Built on real-world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA-Atomic, inherited from our prior GTA benchmark, evaluates short-horizon, closed-ended tool-use precision. (ii) GTA-Workflow introduces long-horizon, open-ended tasks for realistic end-to-end completion. To evaluate open-ended deliverables, we propose a recursive checkpoint-based evaluation mechanism that decomposes objectives into verifiable sub-goals, enabling unified evaluation of both model capabilities and agent execution frameworks (i.e., execution harnesses). Experiments reveal a pronounced capability cliff: while frontier models already struggle on atomic tasks (below 50%), they largely fail on workflows, with top models achieving only 14.39% success. Further analysis shows that checkpoint-guided feedback improves performance, while advanced frameworks such as Manus and OpenClaw substantially enhance workflow completion, highlighting the importance of execution harness design beyond the underlying model capacity. These findings provide guidance for developing reliable personal and professional assistants. Dataset and code will be available at https://github.com/open-compass/GTA.


[9] TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models cs.CL | cs.CVPDF

Jinlun Ye, Jiang Liao, Runhe Lai, Xinhua Lu, Jiaxin Zhuang

TL;DR: 本文提出了测试时文本学习(TTL)框架,用于提升基于CLIP等视觉语言模型的分布外(OOD)检测能力。该方法通过从无标签的测试数据流中动态学习OOD文本语义,无需依赖外部固定的OOD标签,并引入了OOD知识净化策略和OOD文本知识库来抑制噪声并稳定校准。

Details

Motivation: 现有基于CLIP的测试时适应方法依赖有限且固定的外部OOD标签,无法表示测试流中多样且不断演化的OOD语义,限制了检测性能。

Result: 在两个标准基准测试和九个OOD数据集上的大量实验表明,TTL始终达到了最先进的(SOTA)性能。

Insight: 创新点在于动态学习OOD文本语义以应对开放语义空间,通过伪标签更新可学习提示、OOD知识净化策略筛选可靠样本,以及维护OOD文本知识库进行稳定校准,提升了测试时OOD检测的鲁棒性。

Abstract: Vision-language models (VLMs) such as CLIP exhibit strong Out-of-distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP-based test-time adaptation methods further improve detection performance by incorporating external OOD labels. However, such labels are finite and fixed, while the real OOD semantic space is inherently open-ended. Consequently, fixed labels fail to represent the diverse and evolving OOD semantics encountered in test streams. To address this limitation, we introduce Test-time Textual Learning (TTL), a framework that dynamically learns OOD textual semantics from unlabeled test streams, without relying on external OOD labels. TTL updates learnable prompts using pseudo-labeled test samples to capture emerging OOD knowledge. To suppress noise introduced by pseudo-labels, we introduce an OOD knowledge purification strategy that selects reliable OOD samples for adaptation while suppressing noise. In addition, TTL maintains an OOD Textual Knowledge Bank that stores high-quality textual features, providing stable score calibration across batches. Extensive experiments on two standard benchmarks with nine OOD datasets demonstrate that TTL consistently achieves state-of-the-art performance, highlighting the value of textual adaptation for robust test-time OOD detection. Our code is available at https://github.com/figec/TTL.


[10] Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing cs.CLPDF

Kai Wei, Raymond Li, Xi Zhu, Zhaoqian Xue, Jiaojiao Han

TL;DR: Skill-RAG是一个故障感知的检索增强生成框架,它通过一个轻量级的隐藏状态探测器和基于提示的技能路由器来诊断和纠正查询与证据空间之间的错配问题,从而解决传统RAG系统中因错配而非证据缺失导致的持续检索失败。

Details

Motivation: 现有自适应检索方法将检索后失败视为重试信号,而非诊断机会,未能解决查询与证据空间错配的结构性原因,导致大量持续检索失败。

Result: 在多个开放域问答和复杂推理基准测试上的实验表明,Skill-RAG显著提高了在多轮检索后仍存在的困难案例的准确性,特别是在分布外数据集上取得了强劲的提升。

Insight: 主要创新点在于将检索失败视为可诊断和分类的现象,并提出了一个结合隐藏状态探测和基于技能路由的框架,通过查询重写、问题分解、证据聚焦和退出技能等四种技能来纠正错配。分析表明这些技能在故障状态空间中占据结构化、可分离的区域,支持了查询-证据错配是一种类型化而非单一现象的观点。

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose – leaving the structural causes of query-evidence misalignment unaddressed. We observe that a significant portion of persistent retrieval failures stem not from the absence of relevant evidence but from an alignment gap between the query and the evidence space. We propose Skill-RAG, a failure-aware RAG framework that couples a lightweight hidden-state prober with a prompt-based skill router. The prober gates retrieval at two pipeline stages; upon detecting a failure state, the skill router diagnoses the underlying cause and selects among four retrieval skills – query rewriting, question decomposition, evidence focusing, and an exit skill for truly irreducible cases – to correct misalignment before the next generation attempt. Experiments across multiple open-domain QA and complex reasoning benchmarks show that Skill-RAG substantially improves accuracy on hard cases persisting after multi-turn retrieval, with particularly strong gains on out-of-distribution datasets. Representation-space analyses further reveal that the proposed skills occupy structured, separable regions of the failure state space, supporting the view that query-evidence misalignment is a typed rather than monolithic phenomenon.


[11] Qwen3.5-Omni Technical Report cs.CL | eess.ASPDF

Qwen Team

TL;DR: 本文介绍了Qwen3.5-Omni,这是Qwen-Omni模型系列的最新进展。它是一个参数规模达数千亿、支持256k上下文长度的多模态大模型,通过利用海量异构文本-视觉对和超过1亿小时的音视频内容进行训练,展现出强大的全模态能力。模型在架构上采用了混合注意力专家混合框架,支持高效长序列推理,并引入了ARIA技术以提升流式语音合成的稳定性和韵律。此外,模型还扩展了多语言理解和语音生成能力,并展示了卓越的音视频基础能力,包括生成带精确时间同步的结构化描述和基于音视频指令的直接编程新能力。

Details

Motivation: 推动多模态大模型的发展,构建一个能够统一处理和理解文本、视觉、音频等多种模态信息,并支持长上下文、高效推理和自然交互的通用模型。

Result: Qwen3.5-Omni-plus在215个音频和音视频理解、推理及交互子任务和基准测试中取得了SOTA结果,在关键音频任务上超越了Gemini-3.1 Pro,在综合音视频理解方面与之相当。

Insight: 主要创新点包括:1) 采用混合注意力专家混合框架实现高效的长序列多模态推理;2) 提出ARIA技术,通过动态对齐文本和语音单元来解决流式语音合成中的不稳定和不自然问题;3) 展示了模型在音视频基础任务上的新能力,如生成带时间同步的结构化描述和基于音视频指令的编程;4) 扩展了多语言情感语音生成能力。

Abstract: In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.


[12] CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution cs.CLPDF

Shidong Yang, Ziyu Ma, Tongwen Huang, Yiming Hu, Yong Wang

TL;DR: 本文提出了CoEvolve框架,通过智能体与数据的协同进化来训练LLM智能体。该框架从智能体的交互轨迹中提取反馈信号(如遗忘和不确定性),识别易失败的交互模式,并利用这些信号指导基于LLM的任务合成,从而动态更新训练数据分布,实现智能体与数据的联合适应。

Details

Motivation: 解决传统LLM智能体强化学习在静态数据分布上训练的问题,该方式无法适应智能体行为演变,且难以覆盖复杂环境交互,导致训练效果不佳。

Result: 在AppWorld和BFCL基准测试上,使用Qwen2.5-7B、Qwen3-4B和Qwen3-30B-A3B模型进行实验,相比强基线模型分别取得了19.43%、15.58%和18.14%的绝对性能提升,表现出显著且一致的改进。

Insight: 核心创新在于提出了智能体-数据协同进化的闭环训练范式,通过交互驱动的任务合成动态调整数据分布,打破了静态数据训练的局限,为LLM智能体的自适应学习提供了新思路。

Abstract: Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent’s evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent-data mutual evolution framework that enables LLM agents to improve through closed-loop, interaction-driven training. Specifically, CoEvolve extracts feedback signals such as forgetting and uncertainty from rollout trajectories to identify failure-prone interaction patterns, and utilizes them to guide LLM-based task synthesis. The synthesized tasks are validated through environment interaction and utilized to update the data distribution, enabling joint adaptation of the agent and its data. Extensive experiments on AppWorld and BFCL across Qwen2.5-7B, Qwen3-4B, and Qwen3-30B-A3B demonstrate consistent and significant improvements over strong base models, yielding absolute gains of 19.43%, 15.58%, and 18.14%, respectively.


[13] Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms cs.CLPDF

Tanja Baeumel, Josef van Genabith, Simon Ostermann

TL;DR: 本文通过分析大语言模型在算术任务中的内部机制,揭示了模型在处理推理任务时的分层处理模式:早期层识别任务,后期层生成正确结果。研究发现,熟练模型在注意力与MLP模块间存在明确分工,而能力较弱的模型则缺乏这种分工。

Details

Motivation: 探究大语言模型处理推理密集型任务(特别是算术运算)的内部工作机制,以增进对模型内部处理机制的理解。

Result: 实验表明,模型在早期层识别算术任务,但仅在最后几层生成正确结果;熟练模型展现出注意力模块传播输入信息、MLP模块聚合信息的分工模式,而较弱模型则无此特征。

Insight: 创新点在于使用早期解码技术追踪跨层的下一个令牌预测构建过程,揭示了模型模块间的功能分工,并表明成功模型可能具备超越事实回忆的推理能力。

Abstract: Large language models (LLMs) have demonstrated impressive capabilities, yet their internal mechanisms for handling reasoning-intensive tasks remain underexplored. To advance the understanding of model-internal processing mechanisms, we present an investigation of how LLMs perform arithmetic operations by examining internal mechanisms during task execution. Using early decoding, we trace how next-token predictions are constructed across layers. Our experiments reveal that while the models recognize arithmetic tasks early, correct result generation occurs only in the final layers. Notably, models proficient in arithmetic exhibit a clear division of labor between attention and MLP modules, where attention propagates input information and MLP modules aggregate it. This division is absent in less proficient models. Furthermore, successful models appear to process more challenging arithmetic tasks functionally, suggesting reasoning capabilities beyond factual recall.


[14] CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization cs.CLPDF

Junyi Li, Yongqiang Chen, Ningning Ding

TL;DR: 本文提出了一种名为CiPO的新框架,用于解决大型推理模型(LRMs)中的反事实遗忘问题。该方法通过迭代偏好优化,针对性地干预模型的链式思维(CoT)推理过程,以选择性移除不需要的知识,同时保持模型的推理能力。

Details

Motivation: 现有遗忘方法在大型推理模型(LRMs)中面临困境:要么难以从链式思维(CoT)轨迹中完全消除不需要的知识,要么因干扰推理过程而降低模型性能。本文旨在解决这一难题。

Result: 在具有挑战性的基准测试中,CiPO在遗忘任务上表现出色,能够完全移除中间CoT步骤和最终答案中的知识,同时保持了LRMs的推理能力。

Insight: 创新点在于将遗忘问题重新定义为对LRMs中CoT推理的针对性干预,并通过生成逻辑有效的反事实推理轨迹进行迭代偏好优化,从而在有效遗忘和保持性能之间取得平衡。

Abstract: Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to address complex questions, presents a dilemma to unlearning: existing methods either struggle to completely eliminate undesired knowledge from the CoT traces or degrade the reasoning performances due to the interference with the reasoning process. To this end, we introduce Counterfactual Unlearning through iterative Preference Optimization (CiPO), a novel framework that redefines unlearning as the targeted intervention of the CoT reasoning in LRMs. More specifically, given a desired unlearning target answer, CiPO instructs LRMs to generate a logically valid counterfactual reasoning trace for preference tuning. As the LRM adjusts to the counterfactual trace, CiPO iteratively updates the preference learning data to increase the discrepancy from the original model. This iterative loop ensures both desirable unlearning and smooth optimization, effectively mitigating the dilemma. Experiments on challenging benchmarks demonstrate that CiPO excels at unlearning, completely removing knowledge from both the intermediate CoT steps and the final answer, while preserving the reasoning abilities of LRMs.


[15] AgentV-RL: Scaling Reward Modeling with Agentic Verifier cs.CL | cs.AIPDF

Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing, Mingxu Chai

TL;DR: 本文提出Agentic Verifier框架,通过将奖励建模转化为一个多轮、工具增强的审慎过程来解决复杂领域中验证器面临的挑战。该框架引入了互补的前向和后向智能体进行双向验证,并进一步提出AgentV-RL方法,通过主动探索和强化学习使验证器自主交织工具使用与内部推理。实验表明,该方法在并行和顺序测试时扩展下均能带来一致的性能提升。

Details

Motivation: 解决现有验证器在复杂领域面临的挑战:错误中间推理导致的错误传播会使看似合理的解决方案产生误报,且缺乏外部基础使得验证器在计算或知识密集型任务上不可靠。

Result: 在广泛的实验中,Agentic Verifier在并行和顺序测试时扩展下均带来一致的性能增益。其40亿参数变体超越了最先进的对象关系模型25.2%,成为智能体奖励建模的一个有前景的范式。

Insight: 主要创新点在于将奖励建模框架化为一个由互补的前向与后向智能体驱动的、工具增强的多轮审慎过程,实现了对解决方案全面、可靠且可解释的评估。AgentV-RL进一步通过强化学习实现了工具使用与内部推理的自主交织,提升了验证的自主性和实用性。

Abstract: Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.


[16] Where does output diversity collapse in post-training? cs.CL | cs.AI | cs.LGPDF

Constantinos Karouzos, Xingwei Tan, Nikolaos Aletras

TL;DR: 本文研究了后训练语言模型输出多样性下降(即多样性崩溃)的现象。通过分析Olmo 3模型的三个并行后训练谱系(Think、Instruct、RL-Zero),在15个任务和四个文本多样性指标上进行实验,发现多样性崩溃的位置与训练数据构成密切相关,且崩溃被嵌入模型权重中,无法仅通过推理时调整生成格式来解决。

Details

Motivation: 后训练语言模型相比其基础版本产生的输出多样性更低,这削弱了依赖多样样本的推理时缩放方法,并可能在创意和价值负载任务上导致输出同质化。先前研究将崩溃归因于特定后训练方法,但未分离数据构成与方法、生成格式与模型权重的作用。

Result: 在15个任务和四个文本多样性指标上的实验表明:Think谱系在监督微调阶段损失最多语义多样性;DPO在Instruct谱系中的影响大于Think谱系;在推理时抑制Think模型的思维链会降低困难任务的准确性,但答案级多样性不变;在六个可验证任务上,多样性损失可分解为质量控制(移除错误输出)和残余(正确输出的真实收窄)两部分,且比例因任务而异,Think模型比Instruct模型保留了更多正确答案的多样性。

Insight: 核心创新在于系统性地追踪了不同后训练谱系的输出多样性演变,并分离了数据构成、训练方法和生成格式的影响。关键发现是多样性崩溃主要由训练数据构成在训练过程中决定,并嵌入模型权重,而非仅由推理时生成格式强加。这为理解和缓解后训练中的多样性损失提供了数据中心的视角。

Abstract: Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.


[17] Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning cs.CL | cs.LGPDF

Jiaxi Bi, Tongxu Luo, Wenyu Du, Zhengyang Tang, Benyou Wang

TL;DR: 本文提出了STOP(Super TOken for Pruning)方法,用于在大规模推理模型(LRMs)的并行推理中,通过可学习的内部信号对无效推理路径进行早期剪枝,以提高计算效率。论文首先建立了首个系统性的路径剪枝分类法,并基于此设计了STOP,在多个参数规模的LRM上验证了其优越的有效性和效率,并提供了实际部署的实证指南。

Details

Motivation: 并行推理虽然能增强大规模推理模型(LRMs)的能力,但早期错误会导致大量无效推理路径,造成巨大的计算成本。现有研究缺乏一个标准化的路径剪枝框架来有效缓解这一问题。

Result: 在参数规模从1.5B到20B的多个LRM上进行广泛评估,STOP在效果和效率上均优于现有基线。例如,在固定计算预算下,能将GPT-OSS-20B模型在AIME25数据集上的准确率从84%提升至接近90%。

Insight: 论文的主要创新点在于:1)提出了首个系统性的路径剪枝分类法(基于信号来源和可学习性),揭示了可学习的内部方法这一未充分探索的潜力方向;2)据此设计了STOP方法,利用可学习的内部信号(超级令牌)进行早期路径剪枝;3)提供了关于STOP可扩展性和实际部署的实证指南。

Abstract: Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non-learnable). This classification reveals the unexplored potential of learnable internal methods, motivating our proposal of STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B parameters demonstrate that STOP achieves superior effectiveness and efficiency compared to existing baselines. Furthermore, we rigorously validate the scalability of STOP under varying compute budgets - for instance, boosting GPT-OSS-20B accuracy on AIME25 from 84% to nearly 90% under fixed compute budgets. Finally, we distill our findings into formalized empirical guidelines to facilitate optimal real-world deployment. Code, data and models are available at https://bijiaxihh.github.io/STOP


[18] Sentiment Analysis of German Sign Language Fairy Tales cs.CL | cs.LGPDF

Fabrizio Nunnari, Siddhant Jain, Patrick Gebhard

TL;DR: 本文提出了一个用于德语手语(DGS)童话情感分析的数据集和模型。首先,利用四个大语言模型(LLMs)和多数投票法对德语童话文本片段进行三分类(消极、中性、积极)情感分析,获得了0.781的Krippendorff’s alpha评分者间一致性。其次,使用MediaPipe从对应的DGS视频片段中提取面部和身体运动特征。最后,训练了一个基于XGBoost的可解释模型,以从视频特征中预测情感。结果显示平均平衡准确率为0.631。

Details

Motivation: 解决德语手语(DGS)视频中情感自动分析的问题,特别是针对童话故事内容。

Result: 在构建的数据集上,模型取得了0.631的平均平衡准确率。分析表明,除了面部(眉毛和嘴巴)运动,身体(臀部、肘部、肩膀)的运动也对情感区分有显著贡献。

Insight: 创新点在于构建了首个用于DGS童话情感分析的数据集,并证明了在手语情感传达中,身体运动与面部表情具有同等重要性。方法上结合了LLMs进行文本情感标注和可解释的机器学习模型(XGBoost)进行视频特征分析。

Abstract: We present a dataset and a model for sentiment analysis of German sign language (DGS) fairy tales. First, we perform sentiment analysis for three levels of valence (negative, neutral, positive) on German fairy tales text segments using four large language models (LLMs) and majority voting, reaching an inter-annotator agreement of 0.781 Krippendorff’s alpha. Second, we extract face and body motion features from each corresponding DGS video segment using MediaPipe. Finally, we train an explainable model (based on XGBoost) to predict negative, neutral or positive sentiment from video features. Results show an average balanced accuracy of 0.631. A thorough analysis of the most important features reveal that, in addition to eyebrows and mouth motion on the face, also the motion of hips, elbows, and shoulders considerably contribute in the discrimination of the conveyed sentiment, indicating an equal importance of face and body for sentiment communication in sign language.


[19] AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency cs.CL | cs.AI | cs.LGPDF

Max Henning Höth, Kristian Kersting, Björn Deiseroth, Letitia Parcalabescu

TL;DR: 本文提出AtManRL方法,通过可微注意力操纵和强化学习来提升大语言模型思维链推理的忠实性,确保推理过程真实影响最终预测,并在GSM8K和MMLU基准上验证了其有效性。

Details

Motivation: 解决当前大语言模型思维链推理中,推理轨迹可能仅伴随而非真实贡献于最终答案的忠实性问题。

Result: 在GSM8K和MMLU基准上使用Llama-3.2-3B-Instruct模型进行实验,结果表明该方法能识别关键推理词元并训练出更透明的推理模型。

Insight: 创新点在于结合可微注意力掩码生成显著性奖励信号,并与基于结果的奖励在GRPO框架中联合优化,以同时提升正确性和可解释性。

Abstract: Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both contributes to and faithfully reflects the processes underlying the model’s final answer, rather than merely accompanying it, remains challenging. We introduce AtManRL, a method that leverages differentiable attention manipulation to learn more faithful reasoning through reinforcement learning. By training an additive attention mask that identifies tokens in the CoT crucial for producing correct answers, we derive a saliency reward signal that encourages the model to generate reasoning traces that genuinely influence its final predictions. We integrate this saliency reward with outcome-based rewards within the GRPO framework to jointly optimize for correctness and interpretability. Experiments on GSM8K and MMLU with Llama-3.2-3B-Instruct demonstrate that our approach can identify influential reasoning tokens and enable training more transparent reasoning models.


[20] BAGEL: Benchmarking Animal Knowledge Expertise in Language Models cs.CL | cs.AIPDF

Jiacheng Shen, Masato Hagiwara, Milad Alizadeh, Ellen Gilsenan-McMahon, Marius Miron

TL;DR: 本文介绍了BAGEL基准测试,用于评估语言模型在动物专业知识方面的能力。该基准通过整合多种科学和参考来源(如bioRxiv、Wikipedia等)构建,涵盖分类学、形态学、行为学等多个知识维度,并采用闭卷评估协议来测试模型的内在知识。

Details

Motivation: 现有语言模型在广泛领域知识基准上表现良好,但缺乏对专门动物知识在统一闭卷评估下的系统性测试,因此需要构建一个专业基准来评估模型在此领域的知识掌握程度。

Result: BAGEL基准支持对模型在动物知识上的细粒度分析,包括来源领域、分类群组和知识类别,能够精确刻画模型优势与系统性失败模式,为研究领域特定知识泛化提供了新测试平台。

Insight: 创新点在于构建了一个专门针对动物知识的闭卷评估基准,结合了人工筛选和自动生成的问题对,支持多维度分析,有助于提升语言模型在生物多样性相关应用中的可靠性。

Abstract: Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.


[21] SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation cs.CLPDF

Deshan Sumanathilaka, Nicholas Micallef, Julian Hough, Saman Jayasinghe

TL;DR: 本文提出了一种基于大语言模型(LLM)的框架,用于在叙事文本中进行同形异义词的合理性评分,以应对SemEval-2026 Task 5任务。该框架结合了结构化推理机制,探索了微调小参数LLM与多样化推理策略,以及为大参数LLM使用动态少样本提示的影响。

Details

Motivation: 尽管大语言模型在词义消歧方面表现出色,但其在真实叙事语境中的实际适用性仍未得到充分探索。SemEval-2026 Task 5通过引入预测单词在短篇故事中人类感知合理性的任务来填补这一空白。

Result: 实验结果表明,采用动态少样本提示的商业大参数LLM能够高度模拟人类的合理性判断。此外,模型集成略微提升了性能,相比单一模型预测,能更好地模拟五位人类标注者的一致性模式。

Insight: 创新点在于将结构化推理机制与LLM结合,专门用于叙事语境下的词义合理性评分。客观来看,该工作系统地比较了微调小模型与提示大模型两种策略,并探索了模型集成对模拟人类标注者共识的改进,为LLM在细粒度语义理解任务中的实际应用提供了有价值的见解。

Abstract: Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large Language Models (LLMs) can effectively disambiguate, their practical applicability in real-world narrative contexts remains underexplored. SemEval-2026 Task 5 addresses this gap by introducing a task that predicts the human-perceived plausibility of a word sense within a short story. In this work, we propose an LLM-based framework for plausibility scoring of homonymous word senses in narrative texts using a structured reasoning mechanism. We examine the impact of fine-tuning low-parameter LLMs with diverse reasoning strategies, alongside dynamic few-shot prompting for large-parameter models, on accurate sense identification and plausibility estimation. Our results show that commercial large-parameter LLMs with dynamic few-shot prompting closely replicate human-like plausibility judgments. Furthermore, model ensembling slightly improves performance, better simulating the agreement patterns of five human annotators compared to single-model predictions


Van-Truong Le

TL;DR: 本文提出了一种针对越南法律文本的双方面评估框架,旨在全面评估大语言模型在法律文本简化任务上的能力。该框架首先在准确性、可读性和一致性三个维度上对GPT-4o、Claude 3 Opus、Gemini 1.5 Pro和Grok-1四个SOTA模型进行了性能基准测试;其次,通过对60条复杂越南法律条款进行大规模错误分析,探究了模型表现背后的原因。

Details

Motivation: 越南法律文本的复杂性阻碍了公众获取司法公正。虽然大语言模型为法律文本简化提供了有前景的解决方案,但评估其真实能力需要超越表面指标的多方面方法。

Result: 基准测试揭示了关键权衡:Grok-1在可读性和一致性上表现出色,但在细粒度法律准确性上有所妥协;Claude 3 Opus获得了高准确性分数,但掩盖了大量微妙但关键的推理错误。错误分析将“错误示例”和“误解”确定为最普遍的失败类型。

Insight: 论文的创新点在于将定量基准测试与基于专家验证错误类型的定性深度分析相结合,提供了对法律应用LLM的整体且可操作的评估。核心洞察是,当前LLM的主要挑战并非摘要能力,而是受控且准确的法律推理能力。

Abstract: The complexity of Vietnam’s legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need. First, we establish a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, to understand the “why” behind these performance scores, we conduct a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles, using a novel, expert-validated error typology. Our results reveal a crucial trade-off: models like Grok-1 excel in Readability and Consistency but compromise on fine-grained legal Accuracy, while models like Claude 3 Opus achieve high Accuracy scores that mask a significant number of subtle but critical reasoning errors. The error analysis pinpoints \textit{Incorrect Example} and \textit{Misinterpretation} as the most prevalent failures, confirming that the primary challenge for current LLMs is not summarization but controlled, accurate legal reasoning. By integrating a quantitative benchmark with a qualitative deep dive, our work provides a holistic and actionable assessment of LLMs for legal applications.


cs.CV [Back]

[23] Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines cs.CV | cs.AIPDF

Keon Kim, Krish Chelikavada

TL;DR: 本文提出了一种名为‘缩放一致性’的免费置信度信号,用于多步视觉定位流程。该信号基于模型第二步预测与裁剪中心之间的几何距离,无需校准即可在不同架构的视觉语言模型之间直接比较。研究表明,该信号与预测正确性存在相关性,并可作为模型路由策略的依据。

Details

Motivation: 多步放大流程在GUI定位中被广泛使用,但其中间预测在坐标重映射后通常被丢弃。本文旨在利用这些中间输出中蕴含的免费置信度信号,以提升定位流程的鲁棒性和效率。

Result: 在理想化条件下,缩放一致性被证明是第一步空间误差的线性估计量。在两个视觉语言模型(KV-Ground-8B和Qwen3.5-27B)上,该信号与预测正确性显示出显著但较小的相关性(AUC=0.60;Spearman rho分别为-0.14和-0.11)。作为概念验证,使用该信号在专家模型和通用模型之间进行路由,捕获了二者之间16.5%的潜在性能提升空间(+0.8%)。

Insight: 创新点在于发现并形式化了多步视觉定位流程中一个被忽视的几何置信度信号——缩放一致性。其核心价值在于它是一个在共享坐标空间中的几何量,无需校准即可跨不同架构的模型进行比较,为模型选择或集成提供了新的、低成本的依据。

Abstract: Multi-step zoom-in pipelines are widely used for GUI grounding, yet the intermediate predictions they produce are typically discarded after coordinate remapping. We observe that these intermediate outputs contain a useful confidence signal for free: zoom consistency, the distance between a model’s step-2 prediction and the crop center. Unlike log-probabilities or token-level uncertainty, zoom consistency is a geometric quantity in a shared coordinate space, making it directly comparable across architecturally different VLMs without calibration. We prove this quantity is a linear estimator of step-1 spatial error under idealized conditions (perfect step-2, target within crop) and show it correlates with prediction correctness across two VLMs (AUC = 0.60; Spearman rho = -0.14, p < 10^{-6} for KV-Ground-8B; rho = -0.11, p = 0.0003 for Qwen3.5-27B). The correlation is small but consistent across models, application categories, and operating systems. As a proof-of-concept, we use zoom consistency to route between a specialist and generalist model, capturing 16.5% of the oracle headroom between them (+0.8%, McNemar p = 0.19). Code is available at https://github.com/omxyz/zoom-consistency-routing.


[24] Frequency-Aware Flow Matching for High-Quality Image Generation cs.CVPDF

Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille

TL;DR: 本文提出了一种名为频率感知流匹配(FreqFlow)的新方法,用于提升图像生成质量。该方法通过时间依赖的自适应加权,将频率感知条件显式地整合到流匹配框架中,采用双分支架构分别处理低频和高频分量,以更好地建模全局结构和精细细节。

Details

Motivation: 现有流匹配模型在潜在域添加噪声时,对不同频率分量的影响不均匀,导致推理过程中低频结构过早生成而高频细节生成滞后。本文旨在解决这种频率生成不平衡的问题。

Result: 在类别条件ImageNet-256生成基准测试中,该方法取得了最先进的性能,FID得分为1.38,分别比之前的扩散模型DiT和流匹配模型SiT提升了0.79和0.58 FID。

Insight: 创新点在于显式地将频率信息作为条件引入生成过程,通过双分支架构分离处理不同频率分量,并使用时间依赖的加权机制协调全局结构与细节生成,从而更均衡地控制图像合成的不同阶段。

Abstract: Flow matching models have emerged as a powerful framework for realistic image generation by learning to reverse a corruption process that progressively adds Gaussian noise. However, because noise is injected in the latent domain, its impact on different frequency components is non-uniform. As a result, during inference, flow matching models tend to generate low-frequency components (global structure) in the early stages, while high-frequency components (fine details) emerge only later in the reverse process. Building on this insight, we propose Frequency-Aware Flow Matching (FreqFlow), a novel approach that explicitly incorporates frequency-aware conditioning into the flow matching framework via time-dependent adaptive weighting. We introduce a two-branch architecture: (1) a frequency branch that separately processes low- and high-frequency components to capture global structure and refine textures and edges, and (2) a spatial branch that synthesizes images in the latent domain, guided by the frequency branch’s output. By explicitly integrating frequency information into the generation process, FreqFlow ensures that both large-scale coherence and fine-grained details are effectively modeled low-frequency conditioning reinforces global structure, while high-frequency conditioning enhances texture fidelity and detail sharpness. On the class-conditional ImageNet-256 generation benchmark, our method achieves state-of-the-art performance with an FID of 1.38, surpassing the prior diffusion model DiT and flow matching model SiT by 0.79 and 0.58 FID, respectively. Code is available at https://github.com/OliverRensu/FreqFlow.


[25] CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification cs.CVPDF

Hexin Dong, Yi Lin, Pengyu Zhou, Fengnian Zhao, Alan Clint Legasto

TL;DR: 本文介绍了第三届CXR-LT挑战赛(CXR-LT 2026),该挑战赛旨在解决胸部X光片(CXR)分析中因疾病分布长尾性和临床环境开放世界特性带来的难题。挑战赛提供了一个包含超过14.5万张图像的多中心数据集,并定义了两个核心任务:对30个已知类别的鲁棒多标签分类,以及对6个未见(分布外)罕见疾病类别的开放世界泛化。

Details

Motivation: 现有胸部X光分类基准通常基于单一机构的封闭类别,无法捕捉罕见疾病的分布或新发现的出现,阻碍了临床环境中的实际应用。CXR-LT挑战赛旨在建立一个更贴近真实临床场景(长尾分布、开放世界、多中心)的基准。

Result: 挑战赛结果分析表明,视觉-语言基础模型提升了已知分布内和零样本任务的性能,但在多中心偏移下检测罕见发现仍然具有挑战性。

Insight: 创新点在于构建了一个大规模、多中心、经过放射科医生标注的长尾与零样本胸部X光分类基准,并系统评估了模型在鲁棒分类和开放世界泛化任务上的表现,为在真实临床条件下开发和评估AI系统提供了基础。

Abstract: Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from a single institution, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT challenge. The first event, CXR-LT 2023, established a large-scale benchmark for long-tailed multi-label CXR classification and identified key challenges in rare disease recognition. CXR-LT 2024 further expanded the label space and introduced a zero-shot task to study generalization to unseen findings. Building on the success of CXR-LT 2023 and 2024, this third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. Additionally, all development and test sets in CXR-LT 2026 are annotated by radiologists, providing a more reliable and clinically grounded evaluation than report-derived labels. The challenge defines two core tasks this year: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. This paper summarizes the overview of the CXR-LT 2026 challenge. We describe the data collection and annotation procedures, analyze solution strategies adopted by participating teams, and evaluate head-versus-tail performance, calibration, and cross-center generalization gaps. Our results show that vision-language foundation models improve both in-distribution and zero-shot performance, but detecting rare findings under multi-center shift remains challenging. Our study provides a foundation for developing and evaluating AI systems in realistic long-tailed and open-world clinical conditions.


[26] AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution cs.CV | cs.LGPDF

Yiwei Zhao, Yi Zheng, Huapeng Su, Jieyu Lin, Stefano Ambrogio

TL;DR: 本文提出了AdaVFM,一个用于在边缘设备上高效部署语言对齐视觉基础模型(VFMs)的自适应框架。该框架通过集成神经架构搜索(NAS)和云端多模态大语言模型(LLM)的引导,能够根据场景上下文和任务复杂度动态调整模型计算量,实现精度与效率的优化平衡。

Details

Motivation: 语言对齐的视觉基础模型(VFMs)功能强大,但其在边缘设备上的部署受到严格的延迟和功耗限制。现有方法难以根据任务动态调整计算开销,因此需要一种运行时自适应的执行策略。

Result: 在零样本分类(ImageNet-1K)和开放词汇分割(ADE20K)任务上的大量实验表明,AdaVFM在精度-效率权衡上达到了最先进水平(SOTA)。在可比VFM规模下,其在IN1K上的top-1准确率比最佳基线高出7.9%,在ADE20K上的mIoU高出5.2%。在保持相似精度时,AdaVFM能将平均FLOPs降低高达77.9%。

Insight: 核心创新点在于揭示了模型规模缩减对性能的影响是任务依赖的,并据此提出了一种由云端LLM驱动的运行时自适应执行策略。该方法将NAS集成到VFM主干中,生成轻量子网络,并通过LLM智能体根据上下文动态选择,实现了模型在边缘设备上的高效、精准适配。

Abstract: Language-aligned vision foundation models (VFMs) enable versatile visual understanding for always-on contextual AI, but their deployment on edge devices is hindered by strict latency and power constraints. We present AdaVFM, an adaptive framework for efficient on-device inference of language-aligned VFMs that dynamically adjusts computation based on scene context and task complexity. Our key insight is that the effect of model size reduction on performance is task-dependent in vision applications, motivating a runtime-adaptive execution strategy. AdaVFM integrates neural architecture search (NAS) into the language-aligned VFM backbone to enable lightweight subnet execution during runtime. A multimodal large language model (LLM) deployed on the cloud enables runtime control with a context-aware agent. This synergy allows efficient model adaptation under diverse conditions while maintaining strong accuracy. Extensive experiments on zero-shot classification and open-vocabulary segmentation demonstrate that AdaVFM achieves state-of-the-art accuracy-efficiency trade-offs, surpassing prior baselines by up to $7.9%$ in acc@1 on IN1K and $5.2%$ mIoU on ADE20K over the best models of comparable VFM sizes. For models with similar accuracy, AdaVFM further reduces average FLOPs by up to $77.9%$.


[27] SIMMER: Cross-Modal Food Image–Recipe Retrieval via MLLM-Based Embedding cs.CV | cs.CL | cs.IR | cs.LG | cs.MMPDF

Keisuke Gomi, Keiji Yanai

TL;DR: 本文提出了SIMMER模型,用于解决食物图像与食谱文本之间的跨模态检索任务。该方法摒弃了传统的双编码器架构,转而采用基于多模态大语言模型(MLLM)的单一统一编码器(VLM2Vec)来处理图像和文本。通过设计针对食谱结构化内容(标题、食材、烹饪步骤)的提示模板,并引入组件感知的数据增强策略,模型在Recipe1M数据集上实现了最先进的检索性能。

Details

Motivation: 现有方法主要依赖双编码器架构,需要复杂的对齐策略和特定任务网络设计来弥合模态间的语义鸿沟。本文旨在探索使用单一MLLM编码器来简化架构并提升性能。

Result: 在Recipe1M数据集的1k和10k评估设置下均达到了最先进(SOTA)水平。具体而言,最佳模型将1k图像到食谱的R@1从81.8%提升至87.5%,将10k图像到食谱的R@1从56.5%提升至65.5%,显著超越了所有先前方法。

Insight: 核心创新点在于将MLLM(特别是VLM2Vec)作为单一统一编码器应用于跨模态检索任务,取代了传统的双编码器范式。此外,为结构化食谱设计的提示模板和组件感知数据增强策略也是提升模型鲁棒性和性能的关键技术贡献。

Abstract: Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding generation by the MLLM. We further introduce a component-aware data augmentation strategy that trains the model on both complete and partial recipes, improving robustness to incomplete inputs. Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8% to 87.5% and the 10k image-to-recipe R@1 from 56.5% to 65.5% compared to the previous best method.


[28] Causal Bootstrapped Alignment for Unsupervised Video-Based Visible-Infrared Person Re-Identification cs.CVPDF

Shuang Li, Jiaxu Leng, Changjiang Kuang, Mingpi Tan, Yu Yuan

TL;DR: 本文提出了一种名为因果引导对齐(CBA)的无监督视频可见光-红外行人重识别(USL-VVI-ReID)框架,旨在解决现有方法因依赖昂贵标注和通用预训练编码器而导致的模态偏差与聚类粒度失衡问题。该框架通过因果干预预热(CIW)利用视频时序一致性抑制虚假关联,并通过原型引导不确定性细化(PGUR)进行由粗到细的跨模态对齐,从而提升伪标签质量与表征学习效果。

Details

Motivation: 动机在于解决全天候监控中VVI-ReID对昂贵跨模态标注的依赖问题,以及将基于图像的无监督方法直接扩展到视频场景时,因预训练编码器身份判别力弱、模态偏差强而导致的类内混淆和跨模态聚类粒度失衡,这些因素共同损害了伪标签可靠性并阻碍了有效的跨模态对齐。

Result: 在HITSZ-VCM和BUPTCampus基准测试上的大量实验表明,CBA在扩展到USL-VVI-ReID设置时,显著优于现有的USL-VI-ReID方法,实现了先进的性能。

Insight: 创新点包括:1) 利用视频先验(时序身份一致性和跨模态身份一致性)进行序列级因果干预,以抑制模态和运动引起的虚假相关性;2) 提出由粗到细的对齐策略,在可靠可见光原型引导下重组欠聚类的红外表征,并引入不确定性感知监督。从客观角度看,该方法将因果推理与无监督跨模态学习结合,针对视频时序特性设计干预机制,有效提升了表征的判别性与对齐质量。

Abstract: VVI-ReID is a critical technique for all-day surveillance, where temporal information provides additional cues beyond static images. However, existing approaches rely heavily on fully supervised learning with expensive cross-modality annotations, limiting scalability. To address this issue, we investigate Unsupervised Learning for VVI-ReID (USL-VVI-ReID), which learns identity-discriminative representations directly from unlabeled video tracklets. Directly extending image-based USL-VI-ReID methods to this setting with generic pretrained encoders leads to suboptimal performance. Such encoders suffer from weak identity discrimination and strong modality bias, resulting in severe intra-modality identity confusion and pronounced clustering granularity imbalance between visible and infrared modalities. These issues jointly degrade pseudo-label reliability and hinder effective cross-modality alignment. To address these challenges, we propose a Causal Bootstrapped Alignment (CBA) framework that explicitly exploits inherent video priors. First, we introduce Causal Intervention Warm-up (CIW), which performs sequence-level causal interventions by leveraging temporal identity consistency and cross-modality identity consistency to suppress modality- and motion-induced spurious correlations while preserving identity-relevant semantics, yielding cleaner representations for unsupervised clustering. Second, we propose Prototype-Guided Uncertainty Refinement (PGUR), which employs a coarse-to-fine alignment strategy to resolve cross-modality granularity mismatch, reorganizing under-clustered infrared representations under the guidance of reliable visible prototypes with uncertainty-aware supervision. Extensive experiments on the HITSZ-VCM and BUPTCampus benchmarks demonstrate that CBA significantly outperforms existing USL-VI-ReID methods when extended to the USL-VVI-ReID setting.


[29] CPU Optimization of a Monocular 3D Biomechanics Pipeline for Low-Resource Deployment cs.CV | cs.PFPDF

Yan Zhang, Xiong Zhao

TL;DR: 本文优化了一个源自MonocularBiomechanics框架的单目3D生物力学分析流程,使其能够在仅使用CPU的消费级硬件上高效运行。通过剖析驱动的系统优化(包括模型初始化重构、消除磁盘I/O序列化和改进CPU并行化),在消费级工作站上实现了2.47倍的吞吐量提升和59.6%的总运行时间减少,同时保持了与基线方法高度一致的生物力学输出精度。

Details

Motivation: 解决基于单目视频的无标记3D运动分析流程通常依赖GPU加速,限制了其在消费级硬件和资源受限环境(如临床和运动场景)中的部署问题。

Result: 在AMD Ryzen 7 9700X CPU的消费级工作站上,处理吞吐量提升2.47倍,总运行时间减少59.6%,初始化延迟降低4.6倍。生物力学输出与基线保持高度一致(平均关节角度偏差0.35°,相关系数r=0.998)。

Insight: 创新点在于通过系统级优化(而非算法改变)使研究级视觉生物力学流程能在纯CPU硬件上高效运行,为可扩展的运动评估部署提供了实用方案。从客观角度看,其优化策略(如重构初始化、消除I/O瓶颈、改进并行化)对在资源受限环境中部署计算密集型计算机视觉管道具有借鉴意义。

Abstract: Markerless 3D movement analysis from monocular video enables accessible biomechanical assessment in clinical and sports settings. However, most research-grade pipelines rely on GPU acceleration, limiting deployment on consumer-grade hardware and in low-resource environments. In this work, we optimize a monocular 3D biomechanics pipeline derived from the MonocularBiomechanics framework for efficient CPU-only execution. Through profiling-driven system optimization, including model initialization restructuring, elimination of disk I/O serialization, and improved CPU parallelization. Experiments on a consumer workstation (AMD Ryzen 7 9700X CPU) show a 2.47x increase in processing throughput and a 59.6% reduction in total runtime, with initialization latency reduced by 4.6x. Despite these changes, biomechanical outputs remain highly consistent with the baseline implementation (mean joint-angle deviation 0.35$^\circ$, $r=0.998$). These results demonstrate that research-grade vision-based biomechanics pipelines can be deployed on commodity CPU hardware for scalable movement assessment.


[30] PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation cs.CVPDF

Shuyan Ke, Yifan Mei, Changli Wu, Yonghan Zheng, Jiayi Ji

TL;DR: 本文针对无人机推理分割任务,定义了其空间、属性和场景三个维度的语义需求,构建了包含1万张高分辨率航拍图像的DRSeg基准数据集,并提出了一种简单有效的像素级多模态语言模型PixDLM作为统一基线。

Details

Motivation: 解决无人机数据推理分割面临的独特挑战,如倾斜视角、超高分辨率和极端尺度变化,并正式定义无人机推理分割任务。

Result: 在DRSeg基准上的实验建立了强基线结果,凸显了无人机推理分割的独特挑战。

Insight: 将无人机推理分割的语义需求系统化为三个维度,并构建了大规模基准数据集;提出的PixDLM模型作为像素级多模态语言模型,为该任务提供了一个统一的基线框架。

Abstract: Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.


[31] HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning cs.CVPDF

Eunju Lee, MiHyeon Kim, JuneHyoung Kwon, Yoonji Lee, JiHyun Kim

TL;DR: 本文针对预训练视觉语言模型在跨学科、数据不平衡场景下的少样本类增量学习问题,提出了一个名为HyCal的无训练原型校准方法。该方法结合余弦相似度和马氏距离,在冻结的CLIP嵌入上稳定原型表示,以缓解因数据不平衡导致的’领域引力’问题,并在新提出的跨学科变量少样本类增量学习基准上验证了其有效性。

Details

Motivation: 现有少样本类增量学习方法通常假设同质领域和平衡数据分布,而现实世界中数据往往来自异质学科,存在样本可用性不平衡和视觉复杂度差异,导致’领域引力’问题,即数据不平衡引起嵌入空间中的表示不对称,使原型漂移并损害模型在少数或高熵域上的性能。

Result: 在提出的跨学科变量少样本类增量学习基准上进行的实验表明,HyCal方法有效缓解了领域引力问题,在不平衡的跨领域增量学习任务中优于现有方法,实现了持续的性能提升与保持效率。

Insight: 创新点在于识别了跨学科FSCIL中的’领域引力’问题,并提出了一个无需训练、结合方向对齐(余弦相似度)和协方差感知幅度(马氏距离)的混合原型校准方法,以互补的几何特性在异质不平衡条件下获得稳定原型,直接在冻结的CLIP嵌入上操作,兼顾了性能与效率。

Abstract: Pretrained Vision-Language Models (VLMs) like CLIP show promise in continual learning, but existing Few-Shot Class-Incremental Learning (FSCIL) methods assume homogeneous domains and balanced data distributions, limiting real-world applicability where data arises from heterogeneous disciplines with imbalanced sample availability and varying visual complexity. We identify Domain Gravity, a representational asymmetry where data imbalance across heterogeneous domains causes overrepresented or low-entropy domains to disproportionately influence the embedding space, leading to prototype drift and degraded performance on underrepresented or high-entropy domains. To address this, we introduce Cross-Discipline Variable Few-Shot Class-Incremental Learning (XD-VSCIL), a benchmark capturing real-world heterogeneity and imbalance where Domain Gravity naturally intensifies. We propose Hybrid Prototype Calibration (HyCal), a training-free method combining cosine similarity and Mahalanobis distance to capture complementary geometric properties-directional alignment and covariance-aware magnitude-yielding stable prototypes under imbalanced heterogeneous conditions. Operating on frozen CLIP embeddings, HyCal achieves consistent retention-adaptation improvements while maintaining efficiency. Experiments show HyCal effectively mitigates Domain Gravity and outperforms existing methods in imbalanced cross-domain incremental learning.


[32] P3T: Prototypical Point-level Prompt Tuning with Enhanced Generalization for 3D Vision-Language Models cs.CVPDF

Geunyoung Jung, Soohong Kim, Kyungwoo Song, Jiyoung Jung

TL;DR: 本文提出了一种名为P3T的原型点级提示调优方法,旨在以参数高效的方式适应预训练的3D视觉语言模型(VLM)到下游任务。该方法包含点提示器和文本提示器两个组件,分别生成实例感知的点级提示和可学习的文本提示,并通过引入原型损失来增强嵌入空间对齐,从而在保持泛化能力的同时实现任务特定适应。

Details

Motivation: 随着预训练模型在3D点云领域的广泛应用,适应下游任务变得日益重要,但传统的全微调方法计算和存储成本高,而现有提示调优方法容易过拟合,泛化能力受限。

Result: 在分类和少样本学习任务中,该方法匹配或超越了全微调的性能,并在跨数据集设置下展现出对数据偏移的鲁棒泛化能力。

Insight: 创新点包括设计点级和文本级双提示器实现输入数据直接操作,以及引入原型损失减少类内方差以增强嵌入对齐;从客观角度看,该方法通过参数高效调优避免了过拟合,提升了3D VLM的泛化性能。

Abstract: With the rise of pre-trained models in the 3D point cloud domain for a wide range of real-world applications, adapting them to downstream tasks has become increasingly important. However, conventional full fine-tuning methods are computationally expensive and storage-intensive. Although prompt tuning has emerged as an efficient alternative, it often suffers from overfitting, thereby compromising generalization capability. To address this issue, we propose Prototypical Point-level Prompt Tuning (P$^3$T), a parameter-efficient prompt tuning method designed for pre-trained 3D vision-language models (VLMs). P$^3$T consists of two components: 1) \textit{Point Prompter}, which generates instance-aware point-level prompts for the input point cloud, and 2) \textit{Text Prompter}, which employs learnable prompts into the input text instead of hand-crafted ones. Since both prompters operate directly on input data, P$^3$T enables task-specific adaptation of 3D VLMs without sacrificing generalizability. Furthermore, to enhance embedding space alignment, which is key to fine-tuning 3D VLMs, we introduce a prototypical loss that reduces intra-category variance. Extensive experiments demonstrate that our method matches or outperforms full fine-tuning in classification and few-shot learning, and further exhibits robust generalization under data shift in the cross-dataset setting. The code is available at \textcolor{violet}{https://github.com/gyjung975/P3T}.


[33] SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification cs.CV | cs.AIPDF

Enhui Chai, Sicheng Chen, Tianyi Zhang, Xingyu Li, Tianxiang Cui

TL;DR: 本文提出SSMamba,一种用于病理图像分类的自监督混合状态空间模型框架。该框架旨在解决现有基于视觉Transformer的ROI级基础模型在病理图像分析中的三个核心局限:跨放大倍率的域偏移、局部-全局关系建模不足以及细粒度敏感性不够。SSMamba通过结合Mamba掩码图像建模、方向多尺度模块和局部感知残差模块,在目标数据集上进行两阶段(自监督预训练+监督微调)训练,在多个公开的ROI和WSI数据集上超越了现有SOTA方法。

Details

Motivation: 动机是解决病理图像分析中,现有基于ViT和大规模自监督学习的基础模型在ROI级别分析时面临的三个核心限制:固定尺度预训练导致的跨放大倍率域偏移、ViT主干计算开销高且局部特征建模不精确、以及传统自注意力机制对细微诊断线索敏感性不足的问题。

Result: 在10个公开ROI数据集上超越了11个SOTA病理基础模型,并在6个公开WSI数据集上超越了8个SOTA方法,验证了其优越性。

Insight: 创新点在于提出了一个不依赖大型外部数据集、针对病理图像分析的混合自监督学习框架。其核心是三个领域自适应组件:用于缓解域偏移的MAMIM、用于平衡局部-全局建模的DMS模块,以及用于增强细粒度敏感性的LPR模块。从客观角度看,将状态空间模型(Mamba)引入病理图像的自监督预训练,并结合针对病理图像特点(多尺度、局部细节关键)的定制化模块设计,是一个有前景的方向。

Abstract: Pathological diagnosis is highly reliant on image analysis, where Regions of Interest (ROIs) serve as the primary basis for diagnostic evidence, while whole-slide image (WSI)-level tasks primarily capture aggregated patterns. To extract these critical morphological features, ROI-level Foundation Models (FMs) based on Vision Transformers (ViTs) and large-scale self-supervised learning (SSL) have been widely adopted. However, three core limitations remain in their application to ROI analysis: (1) cross-magnification domain shift, as fixed-scale pretraining hinders adaptation to diverse clinical settings; (2) inadequate local-global relationship modeling, wherein the ViT backbone of FMs suffers from high computational overhead and imprecise local characterization; (3) insufficient fine-grained sensitivity, as traditional self-attention mechanisms tend to overlook subtle diagnostic cues. To address these challenges, we propose SSMamba, a hybrid SSL framework that enables effective fine-grained feature learning without relying on large external datasets. This framework incorporates three domain-adaptive components: Mamba Masked Image Modeling (MAMIM) for mitigating domain shift, a Directional Multi-scale (DMS) module for balanced local-global modeling, and a Local Perception Residual (LPR) module for enhanced fine-grained sensitivity. Employing a two-stage pipeline, SSL pretraining on target ROI datasets followed by supervised fine-tuning (SFT), SSMamba outperforms 11 state-of-the-art (SOTA) pathological FMs on 10 public ROI datasets and surpasses 8 SOTA methods on 6 public WSI datasets. These results validate the superiority of task-specific architectural designs for pathological image analysis.


[34] Diffusion Autoencoder for Unsupervised Artifact Restoration in Handheld Fundus Images cs.CV | cs.AIPDF

Mathumetha Palani, Kavya Puthumana, Ayantika Das, Ganapathy Krishnamurthi

TL;DR: 本文提出了一种无监督扩散自编码器,用于修复手持眼底图像中的伪影(如闪光反射、曝光变化和运动模糊)。该模型仅使用高质量桌面眼底图像进行训练,能够泛化到未见的手持图像上,提升图像质量并改善下游诊断准确性。

Details

Motivation: 手持眼底成像设备虽提高了眼科诊断和筛查的可及性,但其图像常受伪影影响,降低质量并阻碍分析。现有生成模型多依赖配对监督或预定义伪影结构,难以适应手持图像中常见的非结构化退化问题。

Result: 在未见数据集和多种伪影条件下,模型将诊断准确率提升至81.17%,并通过定量和定性评估验证了修复效果。

Insight: 创新点在于将上下文编码器与去噪过程结合,以无监督方式学习语义表示进行伪影修复,无需配对数据或预定义退化结构,增强了模型对非结构化退化的适应性。

Abstract: The advent of handheld fundus imaging devices has made ophthalmologic diagnosis and disease screening more accessible, efficient, and cost-effective. However, images captured from these setups often suffer from artifacts such as flash reflections, exposure variations, and motion-induced blur, which degrade image quality and hinder downstream analysis. While generative models have been effective in image restoration, most depend on paired supervision or predefined artifact structures, making them less adaptable to unstructured degradations commonly observed in handheld fundus images. To address this, we propose an unsupervised diffusion autoencoder that integrates a context encoder with the denoising process to learn semantically meaningful representations for artifact restoration. The model is trained only on high-quality table-top fundus images and infers to restore artifact-affected handheld acquisitions. We validate the restorations through quantitative and qualitative evaluations, and have shown that diagnostic accuracy increases to 81.17% on an unseen dataset and multiple artifact conditions


[35] RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees cs.CV | cs.CLPDF

Yichen Xu, Yuanhang Liu, Chuhan Wang, Zihan Zhao, jinghan luo

TL;DR: 本文提出了RefereeBench,首个用于评估多模态大语言模型作为自动体育裁判能力的大规模基准测试。该基准涵盖11项运动、925个精选视频和6,475个问答对,评估了犯规存在性判断、犯规与判罚分类、推理、实体感知和时间定位等五项核心裁判能力。对现有先进模型的广泛评估表明,即使是表现最好的模型也仅能达到约60%的准确率,表明当前模型距离成为可靠的体育裁判仍有很大差距。

Details

Motivation: 尽管多模态大语言模型在通用视频理解方面表现出色,但其在支持专业的、基于规则的决策方面的能力尚未得到充分探索。本文旨在填补这一空白,通过构建专门的基准来评估MLLMs在体育裁判这一需要精确规则应用和多模态理解的复杂任务上的表现。

Result: 在RefereeBench基准上的评估结果显示,最强的闭源模型(如Doubao-Seed-1.8和Gemini-3-Pro)准确率仅约60%,最强的开源模型Qwen3-VL准确率为47%。这表明当前模型远未达到可靠体育裁判的水平。分析进一步表明,模型在识别事件和涉及实体方面尚可,但在规则应用和时间定位方面存在困难,并且经常在正常片段上误判犯规。

Insight: 论文的主要创新点在于构建了首个针对体育裁判任务的大规模、高质量、人类标注的多模态基准,系统地定义了五项核心裁判能力。从客观角度看,该研究揭示了当前MLLMs在需要深度领域知识和精确时空推理的专业决策任务上的关键短板,强调了未来模型需要更好地整合领域知识和多模态理解,以推进可信的AI辅助决策。该基准也为评估和推动更广泛的、基于规则的多模态决策系统提供了重要工具。

Abstract: While Multimodal Large Language Models (MLLMs) excel at generic video understanding, their ability to support specialized, rule-grounded decision-making remains insufficiently explored. In this paper, we introduce RefereeBench, the first large-scale benchmark for evaluating MLLMs as automatic sports referees. Spanning 11 sports with 925 curated videos and 6,475 QA pairs, RefereeBench evaluates five core officiating abilities: foul existence, foul and penalty classification, foul and penalty reasoning, entity perception, and temporal grounding. The benchmark is fully human-annotated to ensure high-quality annotations grounded in authentic officiating logic and multimodal evidence. Extensive evaluations of state-of-the-art MLLMs show that even the strongest models, such as Doubao-Seed-1.8 and Gemini-3-Pro, achieve only around 60% accuracy, while the strongest open-source model, Qwen3-VL, reaches only 47%. These results indicate that current models remain far from being reliable sports referees. Further analysis shows that while models can often identify incidents and involved entities, they struggle with rule application and temporal grounding, and frequently over-call fouls on normal clips. Our benchmark highlights the need for future MLLMs that better integrate domain knowledge and multimodal understanding, advancing trustworthy AI-assisted officiating and broader multimodal decision-making.


[36] Concept-wise Attention for Fine-grained Concept Bottleneck Models cs.CVPDF

Minghong Zhong, Guoshuai Zou, Kanghao Chen, Dexia Chen, Ruixuan Wang

TL;DR: 本文提出了一种名为CoAt-CBM的新框架,旨在解决现有概念瓶颈模型(CBM)中存在的两个关键限制:预训练偏差(如粒度不对齐或对结构先验的依赖)以及使用二元交叉熵损失进行微调时忽略概念间互斥性导致的次优对齐问题。该框架通过可学习的概念级视觉查询自适应地获取细粒度概念视觉嵌入,并利用新颖的概念对比优化来指导模型处理概念分数的相对重要性,从而实现自适应细粒度图像-概念对齐并提高可解释性。

Details

Motivation: 现有基于CLIP等大型预训练视觉语言模型的概念瓶颈模型存在预训练偏差和独立概念优化(忽略互斥性)的问题,导致概念建模的粒度和对齐效果不佳。

Result: 广泛的实验表明,CoAt-CBM在多个基准测试中一致地超越了最先进的方法。

Insight: 创新点在于引入了可学习的概念级视觉查询以实现自适应的细粒度概念视觉嵌入提取,以及提出了概念对比优化来显式地建模概念间的相对重要性,从而改善图像内容与概念预测的对齐和可解释性。

Abstract: Recently impressive performance has been achieved in Concept Bottleneck Models (CBM) by utilizing the image-text alignment learned by a large pre-trained vision-language model (i.e. CLIP). However, there exist two key limitations in concept modeling. Existing methods often suffer from pre-training biases, manifested as granularity misalignment or reliance on structural priors. Moreover, fine-tuning with Binary Cross-Entropy (BCE) loss treats each concept independently, which ignores mutual exclusivity among concepts, leading to suboptimal alignment. To address these limitations, we propose Concept-wise Attention for Fine-grained Concept Bottleneck Models (CoAt-CBM), a novel framework that achieves adaptive fine-grained image-concept alignment and high interpretability. Specifically, CoAt-CBM employs learnable concept-wise visual queries to adaptively obtain fine-grained concept-wise visual embeddings, which are then used to produce a concept score vector. Then, a novel concept contrastive optimization guides the model to handle the relative importance of the concept scores, enabling concept predictions to faithfully reflect the image content and improved alignment. Extensive experiments demonstrate that CoAt-CBM consistently outperforms state-of-the-art methods. The codes will be available upon acceptance.


[37] Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI cs.CV | cs.AIPDF

Lama Moukheiber, Caleb M. Yeung, Haotian Xue, Alec Helbling, Zelin Zhao

TL;DR: 该论文提出了一个名为SGMRI-VQA的新基准,用于评估视觉语言模型在容积MRI数据上进行多帧空间推理的能力,并展示了通过边界框监督对Qwen3-VL-8B模型进行微调可以有效提升其空间定位性能。

Details

Motivation: 现有医学视觉语言模型缺乏透明的推理过程和空间证据,且现有基准测试仅关注孤立2D图像,忽略了临床影像的容积特性(如发现可能跨越多帧或仅出现在少数切片)。

Result: 在SGMRI-VQA基准上对10个视觉语言模型进行了评测,结果显示,对Qwen3-VL-8B模型进行有监督微调(使用边界框监督)在空间定位性能上持续优于强大的零样本基线模型。

Insight: 论文的创新点在于构建了首个针对容积MRI的多帧空间推理基准,并证明了针对性的空间监督(如边界框)是提升临床推理模型空间定位能力的有效途径。

Abstract: Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We introduce Spatially Grounded MRI Visual Question Answering (SGMRI-VQA), a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI. Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies, each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates. Tasks are organized hierarchically across detection, localization, counting/classification, and captioning, requiring models to jointly reason about what is present, where it is, and across which frames it extends. We benchmark 10 VLMs and show that supervised fine-tuning of Qwen3-VL-8B with bounding box supervision consistently improves grounding performance over strong zero-shot baselines, indicating that targeted spatial supervision is an effective path toward grounded clinical reasoning.


[38] Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow cs.CVPDF

Chengxin Liu, Wonseok Choi, Chenshuang Zhang, Tae-Hyun Oh

TL;DR: 本文提出了一种基于自适应信息流的方法,旨在解决视觉语言模型(VLMs)中视觉与文本信息流错配的问题。通过动态调整文本令牌对视觉令牌的注意力,使模型在解码时仅关注重要的视觉区域,从而提升感知能力。该方法在多个任务(如视觉问答、视觉定位、计数、光学字符识别和物体幻觉)上显著提升了基线模型的性能。

Details

Motivation: 尽管视觉语言模型(VLMs)在视觉识别等任务中表现出色,但研究发现它们常能定位到与问题相关的正确图像区域,却无法给出正确答案。这种错配可能源于模型内部信息流不优,即文本令牌将过多注意力分配给无关的视觉令牌,导致错误答案。本文旨在通过调制推理过程中的信息流来解决这一问题。

Result: 该方法在多个数据集(包括视觉问答、视觉定位与计数、光学字符识别和物体幻觉)上进行了评估,应用于代表性的开源VLMs。结果显示,该方法显著提升了基线模型的性能,表明通过自适应信息流优化能有效改善模型的感知准确性。

Insight: 论文的创新点在于提出了一种基于令牌动态的方法来确定视觉令牌的重要性,即在不同解码阶段表现出明显激活模式的视觉令牌被视为重要。这允许在推理过程中自适应地调制信息流,减少无关区域的干扰。从客观角度看,这种方法提供了一种轻量级、无需重新训练的优化策略,可增强现有VLMs的鲁棒性和准确性。

Abstract: Vision-Language Models (VLMs) have demonstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent work shows that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. Based on the observation, we show that modulating the information flow during inference can improve the perception capability of VLMs. The idea is that text tokens should only be associated with important visual tokens during decoding, eliminating the interference of irrelevant regions. To achieve this, we propose a token dynamics-based method to determine the importance of visual tokens, where visual tokens that exhibit distinct activation patterns during different decoding stages are viewed as important. We apply our approach to representative open-source VLMs and evaluate on various datasets, including visual question answering, visual grounding and counting, optical character recognition, and object hallucination. The results show that our approach significantly improves the performance of baselines. Project page: https://cxliu0.github.io/AIF/.


[39] Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions cs.CVPDF

Ze Dong, Hao Shi, Zejia Gao, Zhonghua Yi, Kaiwei Wang

TL;DR: 本文提出了首个面向具身智能体以自我中心屏幕视角理解电影情感的基准数据集EgoScreen-Emotion(ESE),并构建了一个多模态长上下文情感推理框架。研究发现,传统基于电影原片训练的模型在真实自我中心屏幕视角下性能显著下降,而使用ESE数据训练可有效提升模型在真实观看场景中的鲁棒性。

Details

Motivation: 现有电影情感理解研究几乎完全基于电影原片,而具身机器人等智能体通常通过自我中心屏幕视角感知电影,存在视角扭曲、尺度变化等域偏移问题,导致跨域泛化能力受限。

Result: 在ESE基准上,仅用电影原片训练的模型Macro-F1从27.99降至16.69,显示严重域差距;使用ESE训练后性能显著提升。所提框架与强大的闭源多模态模型相比取得了有竞争力的结果。

Insight: 创新点在于构建了首个针对自我中心屏幕视角电影情感理解的数据集与评估基准,并设计了融合时序视觉证据、叙事摘要、压缩历史上下文及音频线索的多模态长上下文推理框架,强调了领域特定数据与长上下文多模态推理的重要性。

Abstract: Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework that models temporal visual evidence, narrative summaries, compressed historical context, and audio cues. Cross-domain experiments reveal a severe domain gap: models trained on cinematic footage drop from 27.99 to 16.69 Macro-F1 when evaluated on realistic egocentric screen-view observations. Training on ESE substantially improves robustness under realistic viewing conditions. Our approach achieves competitive performance compared with strong closed-source multimodal models, highlighting the importance of domain-specific data and long-context multimodal reasoning.


[40] Learning to Look before Learning to Like: Incorporating Human Visual Cognition into Aesthetic Quality Assessment cs.CVPDF

Liwen Yu, Chi Liu, Xiaotong Han, Congcong Zhu, Minghao Wang

TL;DR: 本文提出了一种新的美学质量评估(AQA)范式AestheticNet,它通过模拟人类视觉认知过程(如视觉注意力和扫描路径)来增强传统的基于语义感知的AQA方法。该方法采用双路径架构,其中视觉注意路径使用基于眼动数据预训练的编码器来建模人类视觉系统的注意力机制,并通过跨注意力融合来增强固定的语义编码器(如CLIP)。

Details

Motivation: 现有自动化美学质量评估方法主要将图像视为静态像素向量,仅通过语义感知与人类评分对齐,这与人类基于动态视觉探索(如扫描路径、处理流畅性以及自下而上的显著性与自上而下的意图交互)的美学认知过程存在差异。

Result: 实验通过假设检验验证了该方法相比仅使用语义的基线模型有持续改进,并且证明视觉注意模块可作为与多种AQA骨干网络兼容的模型无关校正器,支持了类人视觉认知对于AQA的必要性和模块化特性。

Insight: 创新点在于将人类视觉认知(如注意力机制)显式整合到AQA中,提出了一个资源高效的对比性注视对齐预训练方法,并展示了视觉注意作为提供超越语义的认知先验(如前景/背景结构、色彩级联、亮度和光照)的模块化组件,可提升模型性能。

Abstract: Automated Aesthetic Quality Assessment (AQA) treats images primarily as static pixel vectors, aligning predictions with human-rating scores largely through semantic perception. However, this paradigm diverges from human aesthetic cognition, which arises from dynamic visual exploration shaped by scanning paths, processing fluency, and the interplay between bottom-up salience and top-down intention. We introduce AestheticNet, a novel cognitive-inspired AQA paradigm that integrates human-like visual cognition and semantic perception with a two-pathway architecture. The visual attention pathway, implemented as a gaze-aligned visual encoder (GAVE) pre-trained offline on eye-tracking data using resource-efficient contrast gaze alignment, models attention from human vision system. This pathway augments the semantic pathway, which uses a fixed semantic encoder such as CLIP, through cross-attention fusion. Visual attention provides a cognitive prior reflecting foreground/background structure, color cascade, brightness, and lighting, all of which are determinants of aesthetic perception beyond semantics. Experiments validated by hypothesis testing show a consistent improvement over the semantic-alone baselines, and demonstrate the gaze module as a model-agnostic corrector compatible with diverse AQA backbones, supporting the necessity and modularity of human-like visual cognition for AQA. Our code is available at https://github.com/keepgallop/AestheticNet.


[41] Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection cs.CV | cs.AIPDF

Irem Ulku, Erdem Akagündüz, Ömer Özgür Tanrıöver

TL;DR: 本文提出了一种名为CBC-SLP的多模态遥感图像语义分割模型,旨在解决实际部署中模态缺失的问题。该模型通过结构化潜在投影方法,将潜在表示分解为共享和模态特定组件,并根据模态可用性自适应地传递给解码器,从而在完整和缺失模态场景下均保持鲁棒性能。

Details

Motivation: 现有多模态分割模型通过学习共享表示来处理缺失模态,但会损害模态特定的互补信息,导致所有模态可用时性能下降。本文旨在克服这一限制,设计一个既能保留模态不变信息又能保留模态特定信息的模型。

Result: 在三个多模态遥感图像数据集上的大量实验表明,CBC-SLP在完整和缺失模态场景下均持续优于最先进的多模态模型,达到了SOTA水平。

Insight: 创新点在于受模态对齐理论启发,提出了一种作为架构归纳偏置的结构化潜在投影方法,直接融入模型架构而非通过损失项强制实现。该方法能有效利用互补信息,并在随机模态丢失下保持鲁棒性,可恢复共享表示中可能未保留的互补信息。

Abstract: Multimodal remote sensing data provide complementary information for semantic segmentation, but in real-world deployments, some modalities may be unavailable due to sensor failures, acquisition issues, or challenging atmospheric conditions. Existing multimodal segmentation models typically address missing modalities by learning a shared representation across inputs. However, this approach can introduce a trade-off by compromising modality-specific complementary information and reducing performance when all modalities are available. In this paper, we tackle this limitation with CBC-SLP, a multimodal semantic segmentation model designed to preserve both modality-invariant and modality-specific information. Inspired by the theoretical results on modality alignment, which state that perfectly aligned multimodal representations can lead to sub-optimal performance in downstream prediction tasks, we propose a novel structured latent projection approach as an architectural inductive bias. Rather than enforcing this strategy through a loss term, we incorporate it directly into the architecture. In particular, to use the complementary information effectively while maintaining robustness under random modality dropout, we structure the latent representations into shared and modality-specific components and adaptively transfer them to the decoder according to the random modality availability mask. Extensive experiments on three multimodal remote sensing image sets demonstrate that CBC-SLP consistently outperforms state-of-the-art multimodal models across full and missing modality scenarios. Besides, we empirically demonstrate that the proposed strategy can recover the complementary information that may not be preserved in a shared representation. The code is available at https://github.com/iremulku/Multispectral-Semantic-Segmentation-via-Structured-Latent-Projection-CBC-SLP-.


[42] UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs cs.CV | cs.AIPDF

Lifan Jiang, Tianrun Wu, Yuhang Pei, Chenyang Wang, Boxi Wu

TL;DR: 本文提出了UniEditBench,一个统一且成本效益高的图像与视频编辑基准测试平台,通过蒸馏大型多模态模型(MLLMs)构建轻量级评估器,支持基于重建和指令驱动的编辑方法,并涵盖九种图像操作和八种视频操作,旨在解决现有评估体系碎片化、视频编辑缺乏可靠基准以及自动评估成本高昂的问题。

Details

Motivation: 当前视觉编辑模型的评估在不同方法和模态间是割裂的,现有基准通常针对特定范式,难以进行公平的跨范式比较,且视频编辑缺乏可靠的评估基准;同时,常见的自动评估指标与人类偏好不一致,而直接使用大型多模态模型(MLLMs)作为评估器又带来过高的计算和财务成本。

Result: 实验表明,蒸馏得到的轻量级(4B/8B)评估器在结构保真度、文本对齐、背景一致性、自然度以及时空一致性(针对视频)等多维度评分上,与人类判断保持了高度一致,并相对于教师模型(Qwen3-VL-235B-A22B Instruct)大幅降低了部署成本。

Insight: 创新点在于提出了一个统一的视觉编辑评估基准,并采用知识蒸馏技术将高性能但昂贵的MLLM评估器压缩为轻量级模型,在保持评估质量的同时显著降低了成本,为现代视觉编辑方法提供了一个实用且可复现的评估协议。

Abstract: The evaluation of visual editing models remains fragmented across methods and modalities. Existing benchmarks are often tailored to specific paradigms, making fair cross-paradigm comparisons difficult, while video editing lacks reliable evaluation benchmarks. Furthermore, common automatic metrics often misalign with human preference, yet directly deploying large multimodal models (MLLMs) as evaluators incurs prohibitive computational and financial costs. We present UniEditBench, a unified benchmark for image and video editing that supports reconstruction-based and instruction-driven methods under a shared protocol. UniEditBench includes a structured taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, with coverage of challenging compositional tasks such as counting and spatial reordering. To enable scalable evaluation, we distill a high-capacity MLLM judge (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scoring over structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency (for videos). Experiments show that the distilled evaluators maintain strong agreement with human judgments and substantially reduce deployment cost relative to the teacher model. UniEditBench provides a practical and reproducible protocol for benchmarking modern visual editing methods. Our benchmark and the associated reward models are publicly available at https://github.com/wesar1/UniEditBench.


[43] CLOTH-HUGS: Cloth Aware Human Gaussian Splatting cs.CVPDF

Sadia Mubashshira, Nazanin Amini, Kevin Desai

TL;DR: 本文提出了Cloth-HUGS,一个基于高斯溅射的神经渲染框架,用于实现照片级真实感的穿衣人体重建。该方法在共享规范空间中,将身体和衣物分别用独立的高斯层表示,并通过SMPL驱动的关节变形和物理约束来提升衣物真实感与动态一致性,支持实时渲染。

Details

Motivation: 现有方法通常将衣物与身体融合为单一表示,难以处理宽松衣物和复杂变形,因此需要一种能显式解耦身体与衣物、提升重建真实感的方法。

Result: 在多个基准测试中,Cloth-HUGS在感知质量和几何保真度上优于现有SOTA基线,LPIPS指标降低高达28%,并能生成时间一致的衣物动态。

Insight: 创新点包括:在规范空间中分离身体与衣物高斯表示;利用网格拓扑初始化衣物高斯并施加物理约束(如模拟一致性、ARAP正则化和掩码监督);采用深度感知的多通道渲染策略以实现鲁棒的合成与实时性能。

Abstract: We present Cloth-HUGS, a Gaussian Splatting based neural rendering framework for photorealistic clothed human reconstruction that explicitly disentangles body and clothing. Unlike prior methods that absorb clothing into a single body representation and struggle with loose garments and complex deformations, Cloth-HUGS represents the performer using separate Gaussian layers for body and cloth within a shared canonical space. The canonical volume jointly encodes body, cloth, and scene primitives and is deformed through SMPL-driven articulation with learned linear blend skinning weights. To improve cloth realism, we initialize cloth Gaussians from mesh topology and apply physics-inspired constraints, including simulation-consistency, ARAP regularization, and mask supervision. We further introduce a depth-aware multi-pass rendering strategy for robust body-cloth-scene compositing, enabling real-time rendering at over 60 FPS. Experiments on multiple benchmarks show that Cloth-HUGS improves perceptual quality and geometric fidelity over state-of-the-art baselines, reducing LPIPS by up to 28% while producing temporally coherent cloth dynamics.


[44] Efficient Video Diffusion Models: Advancements and Challenges cs.CVPDF

Shitong Shao, Lichen Bai, Pengfei Wan, James Kwok, Zeke Xie

TL;DR: 本文是一篇关于高效视频扩散模型的综述性论文,系统性地回顾了该领域在降低推理成本方面的进展与挑战。论文将现有方法统一归类为四个主要范式,并分析了其算法趋势和设计目标。

Details

Motivation: 视频扩散模型已成为高保真视频生成的主流方法,但其实际部署受到严重推理成本的限制,时空token增长和迭代去噪导致计算量剧增,注意力与内存开销成为主要瓶颈。

Result: 作为一篇综述,本文未提供具体的定量实验结果,但系统性地梳理了现有方法,并指出了标准化评估基础设施等未来方向。

Insight: 论文的创新点在于首次对高效视频扩散模型进行了全面综述,提出了一个统一的四类范式分类框架,并明确了减少函数评估次数和降低每步开销两大核心优化目标,为研究者和工程师提供了结构化概览。

Abstract: Video diffusion models have rapidly become the dominant paradigm for high-fidelity generative video synthesis, but their practical deployment remains constrained by severe inference costs. Compared with image generation, video synthesis compounds computation across spatial-temporal token growth and iterative denoising, making attention and memory traffic major bottlenecks in real-world settings. This survey provides a systematic and deployment-oriented review of efficient video diffusion models. We propose a unified categorization that organizes existing methods into four classes of main paradigms, including step distillation, efficient attention, model compression, and cache/trajectory optimization. Building on this categorization, we respectively analyze algorithmic trends of these four paradigms and examine how different design choices target two core objectives: reducing the number of function evaluations and minimizing per-step overhead. Finally, we discuss open challenges and future directions, including quality preservation under composite acceleration, hardware-software co-design, robust real-time long-horizon generation, and open infrastructure for standardized evaluation. To the best of our knowledge, our work is the first comprehensive survey on efficient video diffusion models, offering researchers and engineers a structured overview of the field and its emerging research directions.


[45] Making Image Editing Easier via Adaptive Task Reformulation with Agentic Executions cs.CVPDF

Bo Zhao, Kairui Guo, Runnan Du, Haiyang Sun, Pengshan Wang

TL;DR: 本文提出了一种自适应任务重构框架,通过MLLM智能体动态分析和重构图像编辑指令,将原始编辑任务转化为一系列可执行操作,从而在不修改底层模型的情况下提升指令引导图像编辑的可靠性。

Details

Motivation: 现有指令引导图像编辑模型在许多看似简单的案例中仍会失败,研究发现这些失败主要源于任务表述不当(如目标过小、空间关系隐含或指令不明确),而非模型能力不足。

Result: 在ImgEdit、PICA和RePlan等多个基准测试上,使用Qwen Image Edit和Nano Banana等不同编辑骨干模型均取得了一致的性能提升,尤其在具有挑战性的案例上获得了显著增益。

Insight: 创新点在于将编辑失败问题重新定义为任务表述问题,并引入基于MLLM智能体的分析、路由、重构和反馈驱动的迭代优化流程,揭示了任务表述与模型有效操作区间匹配的重要性,为提升现有模型性能提供了新思路。

Abstract: Instruction guided image editing has advanced substantially with recent generative models, yet it still fails to produce reliable results across many seemingly simple cases. We observe that a large portion of these failures stem not from insufficient model capacity, but from poorly formulated editing tasks, such as those involving small targets, implicit spatial relations, or under-specified instructions. In this work, we frame image editing failures as a task formulation problem and propose an adaptive task reformulation framework that improves editing performance without modifying the underlying model. Our key idea is to transform the original image-instruction pair into a sequence of operations that are dynamically determined and executed by a MLLM agent through analysis, routing, reformulation, and feedback-driven refinement. Experiments on multiple benchmarks, including ImgEdit, PICA, and RePlan, across diverse editing backbones such as Qwen Image Edit and Nano Banana, show consistent improvements, with especially large gains on challenging cases. These results suggest that task reformulation is a critical but underexplored factor, and that substantial gains can be achieved by better matching editing tasks to the effective operating regime of existing models.


[46] SENSE: Stereo OpEN Vocabulary SEmantic Segmentation cs.CV | cs.ROPDF

Thomas Campagnolo, Ezio Malis, Philippe Martinet, Gaétan Bahl

TL;DR: 论文提出了SENSE方法,这是首个基于立体视觉的开放词汇语义分割工作。该方法利用立体图像对和视觉语言模型,通过结合几何线索来增强开放词汇语义分割的空间推理能力和分割精度。在PhraseStereo数据集上训练后,该方法在短语定位任务中表现出色,并在零样本设置下展示了良好的泛化能力。

Details

Motivation: 现有开放词汇语义分割方法通常依赖单视图图像,在空间精度方面存在不足,尤其是在遮挡和物体边界附近。为了解决这一问题,论文提出利用立体视觉提供的几何线索来提升分割的准确性和空间推理能力。

Result: 在PhraseStereo数据集上,SENSE相比基线方法平均精度提升了2.9%,相比最佳竞争方法提升了0.76%。在Cityscapes和KITTI数据集上,SENSE分别实现了3.5% mIoU和18% mIoU的相对提升。

Insight: 论文的主要创新点在于首次将立体视觉引入开放词汇语义分割任务,通过联合推理语义和几何信息来提升场景理解能力。从客观角度看,这种多模态融合方法为解决单视图方法在空间精度上的局限性提供了有效途径,对自动驾驶和智能交通系统等应用具有重要价值。

Abstract: Open-vocabulary semantic segmentation enables models to segment objects or image regions beyond fixed class sets, offering flexibility in dynamic environments. However, existing methods often rely on single-view images and struggle with spatial precision, especially under occlusions and near object boundaries. We propose SENSE, the first work on Stereo OpEN Vocabulary SEmantic Segmentation, which leverages stereo vision and vision-language models to enhance open-vocabulary semantic segmentation. By incorporating stereo image pairs, we introduce geometric cues that improve spatial reasoning and segmentation accuracy. Trained on the PhraseStereo dataset, our approach achieves strong performance in phrase-grounded tasks and demonstrates generalization in zero-shot settings. On PhraseStereo, we show a +2.9% improvement in Average Precision over the baseline method and +0.76% over the best competing method. SENSE also provides a relative improvement of +3.5% mIoU on Cityscapes and +18% on KITTI compared to the baseline work. By jointly reasoning over semantics and geometry, SENSE supports accurate scene understanding from natural language, essential for autonomous robots and Intelligent Transportation Systems.


[47] MMGait: Towards Multi-Modal Gait Recognition cs.CVPDF

Chenye Wang, Qingyuan Cai, Saihui Hou, Aoqi Li, Yongzhen Huang

TL;DR: 该论文提出了一个名为MMGait的多模态步态识别基准数据集,整合了来自五种异构传感器(RGB相机、深度相机、红外相机、LiDAR扫描仪和4D雷达系统)的数据,包含12种模态和超过33万条序列。基于此基准,论文系统评估了单模态、跨模态和多模态范式,并引入了一个新的任务——Omni Multi-Modal Gait Recognition,旨在用一个统一模型整合上述三种范式。同时,论文提出了一个简单而强大的基线模型OmniGait,它学习跨模态的共享嵌入空间并取得了有前景的识别性能。

Details

Motivation: 现有步态识别方法主要依赖RGB衍生模态,在需要多模态协作和跨模态检索的真实场景中表现不足。论文旨在克服这些挑战,通过构建一个全面的多模态基准来推动该领域发展。

Result: 论文在提出的MMGait基准上进行了广泛评估,分析了模态的鲁棒性和互补性。提出的基线模型OmniGait在统一的多模态步态识别任务中取得了有前景的识别性能(具体定量结果未在摘要中提及)。

Insight: 主要创新点包括:1) 构建了首个全面整合五种异构传感器、涵盖几何、光度和运动域的多模态步态识别大规模基准数据集MMGait;2) 定义了新的“Omni Multi-Modal Gait Recognition”任务,旨在用单一模型统一单模态、跨模态和多模态识别;3) 提出了一个学习跨模态共享嵌入空间的简单而有效的基线模型OmniGait。从客观角度看,该工作通过数据集的构建和新任务的提出,为多模态步态识别的研究提供了重要的基础设施和明确的方向。

Abstract: Gait recognition has emerged as a powerful biometric technique for identifying individuals at a distance without requiring user cooperation. Most existing methods focus primarily on RGB-derived modalities, which fall short in real-world scenarios requiring multi-modal collaboration and cross-modal retrieval. To overcome these challenges, we present MMGait, a comprehensive multi-modal gait benchmark integrating data from five heterogeneous sensors, including an RGB camera, a depth camera, an infrared camera, a LiDAR scanner, and a 4D Radar system. MMGait contains twelve modalities and 334,060 sequences from 725 subjects, enabling systematic exploration across geometric, photometric, and motion domains. Based on MMGait, we conduct extensive evaluations on single-modal, cross-modal, and multi-modal paradigms to analyze modality robustness and complementarity. Furthermore, we introduce a new task, Omni Multi-Modal Gait Recognition, which aims to unify the above three gait recognition paradigms within a single model. We also propose a simple yet powerful baseline, OmniGait, which learns a shared embedding space across diverse modalities and achieves promising recognition performance. The MMGait benchmark, codebase, and pretrained checkpoints are publicly available at https://github.com/BNU-IVC/MMGait.


[48] IA-CLAHE: Image-Adaptive Clip Limit Estimation for CLAHE cs.CVPDF

Rikuto Otsuka, Yuho Shoji, Yuka Ogino, Takahiro Toizumi, Atsushi Ito

TL;DR: 本文提出了一种图像自适应的对比度受限自适应直方图均衡化方法(IA-CLAHE)。该方法通过一个轻量级的裁剪限制估计器,自适应地为每个局部区域估计裁剪限制参数,以解决传统CLAHE因固定裁剪限制而导致的过度增强问题。该方法无需预搜索的真实裁剪限制或任务特定数据集,通过端到端优化学习将输入图像直方图映射到领域不变均匀分布,实现了零样本泛化。

Details

Motivation: 传统CLAHE方法在提升计算机视觉任务性能和改善视觉质量时,由于固定的裁剪限制参数无法适应每个局部区域的直方图分布,常导致过度增强。本文旨在解决这一局限性。

Result: 实验结果表明,IA-CLAHE在无需任何任务特定训练数据的情况下,能持续提升识别性能,并同时改善人类感知的视觉质量。

Insight: 创新点在于提出了一种可微分的CLAHE扩展,用于训练轻量级估计器以自适应估计裁剪限制,并通过学习映射到领域不变均匀分布实现零样本泛化,避免了依赖真实标签或特定数据集。

Abstract: This paper proposes image-adaptive contrast limited adaptive histogram equalization (IA-CLAHE). Conventional CLAHE is widely used to boost the performance of various computer vision tasks and to improve visual quality for human perception in practical industrial applications. CLAHE applies contrast limited histogram equalization to each local region to enhance local contrast. However, CLAHE often leads to over-enhancement, because the contrast-limiting parameter clip limit is fixed regardless of the histogram distribution of each local region. Our IA-CLAHE addresses this limitation by adaptively estimating tile-wise clip limits from the input image. To achieve this, we train a lightweight clip limits estimator with a differentiable extension of CLAHE, enabling end-to-end optimization. Unlike prior learning-based CLAHE methods, IA-CLAHE does not require pre-searched ground-truth clip limits or task-specific datasets, because it learns to map input image histograms toward a domain-invariant uniform distribution, enabling zero-shot generalization across diverse conditions. Experimental results show that IA-CLAHE consistently improves recognition performance, while simultaneously enhancing visual quality for human perception, without requiring any task-specific training data.


[49] Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories cs.CV | cs.AI | cs.CL | cs.CYPDF

Delfina Sol Martinez Pandiani, Valentina Presutti

TL;DR: 这篇综述论文系统性地回顾了计算机视觉领域中的高级视觉理解研究,特别是针对抽象概念的自动图像分类任务。它通过多学科分析澄清了高级语义的隐性理解,将相关任务分类为常识、情感、美学和归纳解释语义等不同簇,并探讨了处理价值观和意识形态等抽象概念时的挑战与机遇。

Details

Motivation: 计算机视觉领域正日益转向高级视觉感知任务,但这些任务的确切性质仍不明确且隐含。本文旨在通过系统综述高级视觉理解研究,特别是抽象概念的图像分类,来澄清这种模糊性。

Result: 论文未提及具体的定量实验结果或基准测试,因为它是一篇综述性文章,主要贡献在于对现有研究的分类、分析和总结。

Insight: 论文的创新点在于对高级视觉语义进行了清晰的分类(如常识、情感、美学等),并强调了混合人工智能系统在处理抽象概念图像分类中的重要性,同时指出大规模数据集的局限性以及整合补充信息和中层特征的必要性。

Abstract: The field of Computer Vision (CV) is increasingly shifting towards ``high-level’’ visual sensemaking tasks, yet the exact nature of these tasks remains unclear and tacit. This survey paper addresses this ambiguity by systematically reviewing research on high-level visual understanding, focusing particularly on Abstract Concepts (ACs) in automatic image classification. Our survey contributes in three main ways: Firstly, it clarifies the tacit understanding of high-level semantics in CV through a multidisciplinary analysis, and categorization into distinct clusters, including commonsense, emotional, aesthetic, and inductive interpretative semantics. Secondly, it identifies and categorizes computer vision tasks associated with high-level visual sensemaking, offering insights into the diverse research areas within this domain. Lastly, it examines how abstract concepts such as values and ideologies are handled in CV, revealing challenges and opportunities in AC-based image classification. Notably, our survey of AC image classification tasks highlights persistent challenges, such as the limited efficacy of massive datasets and the importance of integrating supplementary information and mid-level features. We emphasize the growing relevance of hybrid AI systems in addressing the multifaceted nature of AC image classification tasks. Overall, this survey enhances our understanding of high-level visual reasoning in CV and lays the groundwork for future research endeavors.


[50] Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs cs.CV | cs.AIPDF

Rohit Sinha, Aditya Kanade, Sai Srinivas Kancheti, Vineeth N Balasubramanian, Tanuja Ganu

TL;DR: 该论文提出了一个名为’Mind’s Eye’的多选题基准测试,用于评估多模态大语言模型(MLLMs)的视觉认知和空间推理能力。该基准包含八个任务,基于’抽象、关系和转换’(A-R-T)分类法构建,灵感来源于人类智力测试。研究发现,人类在该基准上的准确率达到80%,而表现最佳的MLLMs准确率仍低于50%,揭示了当前MLLMs在视觉空间推理方面存在显著局限。

Details

Motivation: 尽管多模态大语言模型在标准视觉语言基准上取得了显著进展,但其视觉认知和空间推理能力尚未得到充分理解和评估。本文旨在填补这一空白,通过一个受经典人类智力测试启发的基准来系统性地探究MLLMs的核心流体智力过程。

Result: 在’Mind’s Eye’基准上,人类参与者的平均准确率为80%,而表现最好的多模态大语言模型(包括闭源和开源模型)的准确率均低于50%。这表明当前MLLMs的视觉空间推理能力远未达到人类水平。

Insight: 论文的创新点在于提出了一个基于认知科学(A-R-T分类法)的系统性评估框架,用于诊断MLLMs在视觉推理中的具体缺陷,如视觉注意力分配、内部感知操作和视觉概念抽象能力薄弱。这为未来开发更具认知基础的评价方法和模型改进方向提供了重要见解。

Abstract: Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce “Mind’s Eye”, a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel “A-R-T” taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.


[51] Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs cs.CV | cs.AIPDF

Sai Srinivas Kancheti, Aditya Sanjiv Kanade, Vineeth N. Balasubramanian, Tanuja Ganu

TL;DR: 本文通过评估17个模型在13个空间推理基准上的表现,发现思维链(CoT)提示会损害多模态大语言模型在视觉空间推理任务上的性能,并揭示模型存在严重的捷径学习问题,甚至在图像缺失时也会从文本先验中产生视觉幻觉。

Details

Motivation: 探究思维链(CoT)范式在多模态推理模型(MRMs)处理广义空间智能任务时的局限性,解决其在视觉空间推理中性能下降的问题。

Result: 在13个空间基准测试中,CoT提示持续导致性能下降;通过No-Image++消融实验证明,模型存在严重的捷径学习,并在无图像时从文本先验产生幻觉。

Insight: 挑战了纯文本CoT在空间任务中的有效性,强调了开发以视觉为中心的推理范式的必要性;揭示了多模态模型在空间推理中对文本线索的过度依赖和幻觉问题。

Abstract: Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.


[52] DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates cs.CVPDF

Laziz Hamdi, Amine Tamasna, Thierry Paquet

TL;DR: 本文介绍了DenTab数据集,这是一个包含2000张牙科估算单表格图像的数据集,用于评估表格识别(TR)和表格视觉问答(TableVQA)任务。该数据集提供了高质量的HTML标注和2208个涵盖检索、聚合和逻辑/一致性检查等11个类别的问题。作者对16个系统进行了基准测试,发现强大的结构恢复能力并不总能转化为多步算术和一致性问题的可靠性能。为此,作者提出了无需训练的Table Router Pipeline,通过结合视觉语言模型和基于规则的执行器来提高算术问题的可靠性。

Details

Motivation: 现有表格结构识别和TableVQA资源多来自干净的数字化来源或渲染表格,未能充分反映现实世界中嘈杂的管理文档(如牙科估算单)的挑战。本文旨在填补这一空白,提供一个真实、嘈杂的表格数据集,以推动更鲁棒的表格理解和推理系统的发展。

Result: 在DenTab数据集上对16个系统(包括14个视觉语言模型和2个OCR基线)进行了基准测试。结果表明,即使使用真实HTML表格输入,模型在多步算术和一致性问题上仍存在显著的推理失败。提出的Table Router Pipeline通过确定性执行,在无需训练的情况下提高了算术问题的可靠性。

Insight: 论文的创新点在于:1)引入了首个针对真实世界嘈杂牙科估算单的表格识别与视觉问答数据集DenTab,填补了领域空白;2)揭示了表格结构恢复能力与复杂推理性能之间的不一致性这一关键发现;3)提出了Table Router Pipeline,一种新颖的无需训练的混合方法,通过将问题路由到确定性执行器来提升算术可靠性,为增强模型在现实场景下的鲁棒性提供了新思路。

Abstract: Tables condense key transactional and administrative information into compact layouts, but practical extraction requires more than text recognition: systems must also recover structure (rows, columns, merged cells, headers) and interpret roles such as line items, subtotals, and totals under common capture artifacts. Many existing resources for table structure recognition and TableVQA are built from clean digital-born sources or rendered tables, and therefore only partially reflect noisy administrative conditions. We introduce DenTab, a dataset of 2{,}000 cropped table images from dental estimates with high-quality HTML annotations, enabling evaluation of table recognition (TR) and table visual question answering (TableVQA) on the same inputs. DenTab includes 2{,}208 questions across eleven categories spanning retrieval, aggregation, and logic/consistency checks. We benchmark 16 systems, including 14 vision–language models (VLMs) and two OCR baselines. Across models, strong structure recovery does not consistently translate into reliable performance on multi-step arithmetic and consistency questions, and these reasoning failures persist even when using ground-truth HTML table inputs. To improve arithmetic reliability without training, we propose the Table Router Pipeline, which routes arithmetic questions to deterministic execution. The pipeline combines (i) a VLM that produces a baseline answer, a structured table representation, and a constrained table program with (ii) a rule-based executor that performs exact computation over the parsed table. The source code and dataset will be made publicly available at https://github.com/hamdilaziz/DenTab.


[53] Polyglot: Multilingual Style Preserving Speech-Driven Facial Animation cs.CVPDF

Federico Nocentini, Kwanggyoon Seo, Qingju Liu, Claudio Ferrari, Stefano Berretti

TL;DR: 本文提出了Polyglot,一种基于扩散模型的统一架构,用于实现个性化的多语言语音驱动面部动画。该方法通过文本嵌入编码语言信息,并从参考面部序列中提取风格嵌入来捕捉个体说话特征,无需预定义的语言或说话者标签,通过自监督学习实现跨语言和说话者的泛化。

Details

Motivation: 现有语音驱动面部动画模型大多基于单语言数据训练,难以适应现实世界的多语言场景,且现有方法通常只依赖语言特定或说话者特定的条件,无法同时建模语言与个人风格的交互,限制了生成的真实感。

Result: 实验表明,该方法在单语言和多语言设置下均表现出改进的性能,为SDFA中建模语言和个人风格提供了一个统一的框架。

Insight: 创新点在于提出了一个联合条件化语言和风格的统一扩散架构,通过自监督方式编码语言和个体风格,无需显式标签,从而能捕捉节奏、发音和习惯性面部运动等表达特征,生成时序连贯且真实的面部动画。

Abstract: Speech-Driven Facial Animation (SDFA) has gained significant attention due to its applications in movies, video games, and virtual reality. However, most existing models are trained on single-language data, limiting their effectiveness in real-world multilingual scenarios. In this work, we address multilingual SDFA, which is essential for realistic generation since language influences phonetics, rhythm, intonation, and facial expressions. Speaking style is also shaped by individual differences, not only by language. Existing methods typically rely on either language-specific or speaker-specific conditioning, but not both, limiting their ability to model their interaction. We introduce Polyglot, a unified diffusion-based architecture for personalized multilingual SDFA. Our method uses transcript embeddings to encode language information and style embeddings extracted from reference facial sequences to capture individual speaking characteristics. Polyglot does not require predefined language or speaker labels, enabling generalization across languages and speakers through self-supervised learning. By jointly conditioning on language and style, it captures expressive traits such as rhythm, articulation, and habitual facial movements, producing temporally coherent and realistic animations. Experiments show improved performance in both monolingual and multilingual settings, providing a unified framework for modeling language and personal style in SDFA.


[54] SWNet: A Cross-Spectral Network for Camouflaged Weed Detection cs.CV | cs.AIPDF

Henry O. Velesaca, Luigi Miranda, Angel D. Sappa

TL;DR: 本文提出SWNet,一种双模态端到端跨光谱网络,专门用于在密集农业环境中检测伪装杂草。该网络利用可见光和近红外信息,通过金字塔视觉Transformer v2骨干网络捕获长距离依赖,并采用双模态门控融合模块动态整合两种光谱数据,结合边缘感知细化模块生成更清晰的目标边界。

Details

Motivation: 解决植物伪装(入侵物种模仿主要作物的表型特征导致同色混合)对传统计算机视觉系统带来的挑战,在复杂作物冠层中实现高精度杂草检测。

Result: 在Weeds-Banana数据集上的实验结果表明,SWNet优于十种最先进的方法,达到了SOTA水平。

Insight: 创新点包括利用跨光谱数据(特别是近红外光谱中捕获的叶绿素反射生理差异)来区分可见光范围内无法辨别的目标,以及通过边界引导细化提升分割精度;客观分析认为,将Transformer骨干与动态双模态融合及边缘细化结合,是处理农业场景中复杂伪装问题的有效途径。

Abstract: This paper presents SWNet, a bimodal end-to-end cross-spectral network specifically engineered for the detection of camouflaged weeds in dense agricultural environments. Plant camouflage, characterized by homochromatic blending where invasive species mimic the phenotypic traits of primary crops, poses a significant challenge for traditional computer vision systems. To overcome these limitations, SWNet utilizes a Pyramid Vision Transformer v2 backbone to capture long-range dependencies and a Bimodal Gated Fusion Module to dynamically integrate Visible and Near-Infrared information. By leveraging the physiological differences in chlorophyll reflectance captured in the NIR spectrum, the proposed architecture effectively discriminates targets that are otherwise indistinguishable in the visible range. Furthermore, an Edge-Aware Refinement module is employed to produce sharper object boundaries and reduce structural ambiguity. Experimental results on the Weeds-Banana dataset indicate that SWNet outperforms ten state-of-the-art methods. The study demonstrates that the integration of cross-spectral data and boundary-guided refinement is essential for high segmentation accuracy in complex crop canopies. The code is available on GitHub: https://cod-espol.github.io/SWNet/


[55] neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing cs.CV | cs.CEPDF

Toby Perrett, Matthew Bouchard, William McCarthy

TL;DR: 该论文提出了首个由专业CAD工程师构建的多模态指令3D CAD模型编辑基准neuralCAD-Edit,通过录制设计师在CAD软件中边操作、边说话、边指点绘制的视频来收集真实的编辑请求,并评估了领先的基础模型与人类专家之间的性能差距。

Details

Motivation: 解决现有3D CAD编辑研究主要依赖文本指令、缺乏真实多模态交互数据的问题,旨在为3D CAD编辑方法和基础模型的发展提供一个更贴近实际设计流程的评估基准。

Result: 在自动指标和人工评估中,领先的基础模型(如GPT 5.2)与人类CAD专家存在巨大性能差距,最佳模型在人工接受度测试中得分比专家低53%(绝对值),突显了该基准的挑战性。

Insight: 创新点在于首次构建了基于真实多模态交互(语音、手势、绘图)的专家级CAD编辑基准,推动了3D编辑任务从纯文本指令向更自然、更符合实际工作流程的多模态交互范式转变。

Abstract: We introduce neuralCAD-Edit, the first benchmark for editing 3D CAD models collected from expert CAD engineers. Instead of text conditioning as in prior works, we collect realistic CAD editing requests by capturing videos of professional designers, interacting directly with CAD models in CAD software, while talking, pointing and drawing. We recruited ten consenting designers to contribute to this contained study. We benchmark leading foundation models against human CAD experts carrying out edits, and find a large performance gap in both automatic metrics and human evaluations. Even the best foundation model (GPT 5.2) scores 53% lower (absolute) than CAD experts in human acceptance trials, demonstrating the challenge of neuralCAD-Edit. We hope neuralCAD-Edit will provide a solid foundation against which 3D CAD editing approaches and foundation models can be developed. Code/data: https://autodeskailab.github.io/neuralCAD-Edit


[56] Winner of CVPR2026 NTIRE Challenge on Image Shadow Removal: Semantic and Geometric Guidance for Shadow Removal via Cascaded Refinement cs.CVPDF

Lorenzo Beltrame, Jules Salzinger, Filip Svoboda, Jasmin Lampert, Phillipp Fanta-Jende

TL;DR: 本文提出了一种用于CVPR2026 NTIRE图像阴影去除挑战赛的三阶段渐进式阴影去除流程。该方法基于OmniSR架构,将去阴影视为迭代的直接细化过程,后续阶段修正前一阶段预测的残留伪影。模型结合了RGB外观、冻结的DINOv2语义引导以及来自单目深度和表面法线的几何线索,这些线索在所有阶段复用。通过引入收缩约束目标来稳定多阶段优化,并采用分阶段训练流程。该方法在官方测试集上取得了最佳性能,赢得了挑战赛冠军。

Details

Motivation: 解决图像阴影去除任务中残留伪影的问题,并通过引入语义和几何引导以及多阶段级联细化来提升去除效果和鲁棒性。

Result: 在NTIRE WSRD+ 2026隐藏测试集上,最终集成模型取得了PSNR 26.680、SSIM 0.8740、LPIPS 0.0578和FID 26.135的成绩,排名第一,赢得了挑战赛。在ISTD+和UAV-SC+数据集上也验证了其强性能。

Insight: 创新点包括:1) 将阴影去除构建为三阶段级联直接细化流程;2) 结合冻结的DINOv2语义特征与单目深度/法线几何线索作为跨阶段引导;3) 提出收缩约束目标以稳定多阶段训练;4) 采用分阶段训练与余弦退火检查点集成策略。从客观角度看,其核心在于有效整合多模态先验(语义、几何)并通过约束级联优化来逐步消除伪影,这是一种系统性的工程与算法设计结合的成功范例。

Abstract: We present a three-stage progressive shadow-removal pipeline for the CVPR2026 NTIRE WSRD+ challenge. Built on OmniSR, our method treats deshadowing as iterative direct refinement, where later stages correct residual artefacts left by earlier predictions. The model combines RGB appearance with frozen DINOv2 semantic guidance and geometric cues from monocular depth and surface normals, reused across all stages. To stabilise multi-stage optimisation, we introduce a contraction-constrained objective that encourages non-increasing reconstruction error across the cascade. A staged training pipeline transfers from earlier WSRD pretraining to WSRD+ supervision and final WSRD+ 2026 adaptation with cosine-annealed checkpoint ensembling. On the official WSRD+ 2026 hidden test set, our final ensemble achieved 26.680 PSNR, 0.8740 SSIM, 0.0578 LPIPS, and 26.135 FID, ranked first overall, and won the NTIRE 2026 Image Shadow Removal Challenge. The strong performance of the proposed model is further validated on the ISTD+ and UAV-SC+ datasets.


[57] AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection cs.CV | cs.AIPDF

Hao Wang, Beichen Zhang, Yanpei Gong, Shaoyi Fang, Zhaobo Qi

TL;DR: 该论文提出了AIFIND方法,用于解决增量式人脸伪造检测中的特征漂移和灾难性遗忘问题。该方法通过引入由低层伪造痕迹驱动的语义锚点来稳定特征空间,并利用注意力机制和分类器协调器实现细粒度对齐,从而提升模型在持续学习新伪造类型时的性能。

Details

Motivation: 现有增量式人脸伪造检测方法通常依赖数据回放或粗粒度的二值监督,未能显式约束特征空间,导致严重的特征漂移和灾难性遗忘。

Result: 在多个增量协议上的广泛实验验证了AIFIND的优越性,表明其能有效缓解遗忘并提升检测性能。

Insight: 创新点在于利用低层伪造痕迹构建不变的语义锚点作为固定坐标系,并通过注意力机制显式对齐视觉特征与语义锚点,同时通过保持语义锚点的角度关系来协调分类器,以维持跨任务的几何一致性。

Abstract: As forgery types continue to emerge consistently, Incremental Face Forgery Detection (IFFD) has become a crucial paradigm. However, existing methods typically rely on data replay or coarse binary supervision, which fails to explicitly constrain the feature space, leading to severe feature drift and catastrophic forgetting. To address this, we propose AIFIND, Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection, which leverages semantic anchors to stabilize incremental learning. We design the Artifact-Driven Semantic Prior Generator to instantiate invariant semantic anchors, establishing a fixed coordinate system from low-level artifact cues. These anchors are injected into the image encoder via Artifact-Probe Attention, which explicitly constrains volatile visual features to align with stable semantic anchors. Adaptive Decision Harmonizer harmonizes the classifiers by preserving angular relationships of semantic anchors, maintaining geometric consistency across tasks. Extensive experiments on multiple incremental protocols validate the superiority of AIFIND.


[58] GAViD: A Large-Scale Multimodal Dataset for Context-Aware Group Affect Recognition from Videos cs.CVPDF

Deepak Kumar, Abhishek Pratap Singh, Puneet Kumar, Xiaobai Li, Balasubramanian Raman

TL;DR: 本文介绍了GAViD数据集,这是一个大规模多模态数据集,用于从视频中进行上下文感知的群体情感识别。该数据集包含5091个视频片段,涵盖视频、音频和上下文信息,并标注了情感效价和离散情绪标签。同时,论文提出了CAGNet模型,用于多模态上下文感知的群体情感识别,在GAViD数据集上取得了63.20%的测试准确率,达到了与当前最先进方法相当的性能。

Details

Motivation: 解决在真实世界场景中群体情感计算建模的挑战,主要由于缺乏大规模标注数据集以及多模态社交交互的复杂性,特别是上下文和行为变异的影响。

Result: 在GAViD数据集上,CAGNet模型取得了63.20%的测试准确率,与当前最先进(SOTA)性能相当。

Insight: 创新点包括构建了大规模多模态数据集GAViD,整合了VideoGPT生成的上下文元数据和人工标注的行为线索;并提出了CAGNet模型,专注于上下文感知的群体情感识别,推动了该领域的数据集和方法发展。

Abstract: Understanding affective dynamics in real-world social systems is fundamental to modeling and analyzing human-human interactions in complex environments. Group affect emerges from intertwined human-human interactions, contextual influences, and behavioral cues, making its quantitative modeling a challenging computational social systems problem. However, computational modeling of group affect in in-the-wild scenarios remains challenging due to limited large-scale annotated datasets and the inherent complexity of multimodal social interactions shaped by contextual and behavioral variability. The lack of comprehensive datasets annotated with multimodal and contextual information further limits advances in the field. To address this, we introduce the Group Affect from ViDeos (GAViD) dataset, comprising 5091 video clips with multimodal data (video, audio and context), annotated with ternary valence and discrete emotion labels and enriched with VideoGPT-generated contextual metadata and human-annotated action cues. We also present Context-Aware Group Affect Recognition Network (CAGNet) for multimodal context-aware group affect recognition. CAGNet achieves 63.20% test accuracy on GAViD, comparable to state-of-the-art performance. The dataset and code are available at github.com/deepakkumar-iitr/GAViD.


[59] A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection cs.CV | cs.AIPDF

Van-Truong Le, Le-Khanh Nguyen, Trong-Doanh Nguyen

TL;DR: 本文提出了一种用于考试作弊检测的两阶段、以目标为中心的深度学习框架。该框架首先使用YOLOv8n模型定位考场图像中的学生,然后使用微调的RexNet-150模型对裁剪出的学生区域进行分类,判断其行为是正常还是作弊。

Details

Motivation: 解决传统人工监考效率低、成本高、易出错的问题,并改进现有AI监控系统在透明度、性能或架构复杂性方面的不足。

Result: 在整合自10个独立来源、总计273,897个样本的数据集上,系统达到了0.95的准确率、0.94的召回率、0.96的精确率和0.95的F1分数,相比基于视频的作弊检测基线准确率0.82提升了13%。平均推理时间为每样本13.9毫秒。

Insight: 创新点在于将目标检测(YOLOv8n)与行为分类(RexNet-150)结合的两阶段框架,实现了高效、可扩展的作弊检测。此外,系统设计考虑了伦理问题,如通过私人方式(如个人邮箱)反馈结果,避免公开羞辱。这为开发实时、可扩展、符合伦理的开源解决方案奠定了基础。

Abstract: Academic integrity continues to face the persistent challenge of examination cheating. Traditional invigilation relies on human observation, which is inefficient, costly, and prone to errors at scale. Although some existing AI-powered monitoring systems have been deployed and trusted, many lack transparency or require multi-layered architectures to achieve the desired performance. To overcome these challenges, we propose an improvement over a simple two-stage framework for exam cheating detection that integrates object detection and behavioral analysis using well-known technologies. First, the state-of-the-art YOLOv8n model is used to localize students in exam-room images. Each detected region is cropped and preprocessed, then classified by a fine-tuned RexNet-150 model as either normal or cheating behavior. The system is trained on a dataset compiled from 10 independent sources with a total of 273,897 samples, achieving 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1-score - a 13% increase over a baseline accuracy of 0.82 in video-based cheating detection. In addition, with an average inference time of 13.9 ms per sample, the proposed approach demonstrates robustness and scalability for deployment in large-scale environments. Beyond the technical contribution, the AI-assisted monitoring system also addresses ethical concerns by ensuring that final outcomes are delivered privately to individual students after the examination, for example, via personal email. This prevents public exposure or shaming and offers students an opportunity to reflect on their behavior. For further improvement, it is possible to incorporate additional factors, such as audio data and consecutive frames, to achieve greater accuracy. This study provides a foundation for developing real-time, scalable, ethical, and open-source solutions.


[60] CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting cs.CVPDF

Nishq Poorav Desai, Ali Etemad, Michael Greenspan

TL;DR: 本文提出了一种名为CollideNet的新型时空分层Transformer架构,专门用于碰撞时间预测任务。该方法通过空间流同时聚合多分辨率帧信息,并在时间流中结合多尺度特征编码,解耦视频数据的非平稳性、趋势和季节性成分。在三个常用公共数据集上实现了显著的SOTA性能,并进行了跨数据集评估以验证泛化能力。

Details

Motivation: 碰撞时间预测是防撞关键任务,需要精确的时间预测并理解视频中空间和时间上的局部与全局模式。现有方法难以有效处理视频的多尺度特性,因此需要设计专门架构来应对这一挑战。

Result: 在三个常用公共数据集上实现了最先进的性能,相比先前工作有显著提升。通过跨数据集评估验证了方法的泛化能力,并可视化了解耦趋势和季节性成分的效果。

Insight: 创新点包括:1)提出时空分层Transformer架构,专门针对TTC预测任务设计;2)在空间流中实现多分辨率信息聚合;3)在时间流中引入多尺度特征编码并解耦非平稳性、趋势和季节性成分;4)通过解耦分析提供了对视频时序模式的新理解。

Abstract: Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and global patterns encapsulated in a video, both spatially and temporally. To address the multi-scale nature of video, we introduce a novel spatiotemporal hierarchical transformer-based architecture called CollideNet, specifically catered for effective TTC forecasting. In the spatial stream, CollideNet aggregates information for each video frame simultaneously at multiple resolutions. In the temporal stream, along with multi-scale feature encoding, CollideNet also disentangles the non-stationarity, trend, and seasonality components. Our method achieves state-of-the-art performance in comparison to prior works on three commonly used public datasets, setting a new state-of-the-art by a considerable margin. We conduct cross-dataset evaluations to analyze the generalization capabilities of our method, and visualize the effects of disentanglement of the trend and seasonality components of the video data. We release our code at https://github.com/DeSinister/CollideNet/.


[61] Find, Fix, Reason: Context Repair for Video Reasoning cs.CVPDF

Haojian Huang, Chuanyu Qin, Yinchuan Li, Yingcong Chen

TL;DR: 本文提出了一种名为’Find, Fix, Reason’(FFR)的视频推理上下文修复方法。该方法利用一个冻结的、集成了工具的教师模型来识别缺失的时空依赖关系,并从原始视频中提取最小证据补丁(如时间戳、区域等)来增强学生模型的上下文。学生模型在获得额外上下文后重新回答问题,并通过一种集成到GRPO中的选择展开方案进行训练。此外,论文还提出了一个鲁棒改进奖励(RIR)来对齐优化目标。

Details

Motivation: 现有的视频推理强化学习方法存在局限:基于策略的自我探索会受限于模型的知识边界,而混合重放方法则需要精心正则化。动态上下文方法虽然能聚焦证据,但通常需要精心设计的预训练和两阶段调优,且其上下文受限于较小模型的能力。本文旨在利用大型模型在指令遵循和多模态理解方面的优势,为小型模型提供更丰富的上下文,从而改进视频推理。

Result: 在多个相关基准测试上的实验表明,该方法带来了持续且一致的准确性提升,并展现出强大的泛化能力。

Insight: 核心创新点在于提出了一种观察层面的干预机制(教师-学生框架)和一个鲁棒改进奖励(RIR)。该方法允许在保持策略探索的同时,通过最小化训练栈的改动,引导探索沿着具有因果意义的方向进行,从而有效提升视频推理性能。

Abstract: Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model’s knowledge boundary, or hybrid replay that mixes policies and demands careful regularization. Dynamic context methods zoom into focused evidence but often require curated pretraining and two-stage tuning, and their context remains bounded by a small model’s capability. In contrast, larger models excel at instruction following and multi-modal understanding, can supply richer context to smaller models, and rapidly zoom in on target regions via simple tools. Building on this capability, we introduce an observation-level intervention: a frozen, tool-integrated teacher identifies the missing spatiotemporal dependency and provides a minimal evidence patch (e.g., timestamps, regions etc.) from the original video while the question remains unchanged. The student answers again with the added context, and training updates with a chosen-rollout scheme integrated into Group Relative Policy Optimization (GRPO). We further propose a Robust Improvement Reward (RIR) that aligns optimization with two goals: outcome validity through correct answers and dependency alignment through rationales that reflect the cited evidence. Advantages are group-normalized across the batch, preserving on-policy exploration while directing it along causally meaningful directions with minimal changes to the training stack. Experiments on various related benchmarks show consistent accuracy gains and strong generalization. Web page and source code will be available at https://github.com/JethroJames/FFR.git.


[62] Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization cs.CVPDF

Siddhant Bharadwaj, Ashish Vashist, Fahimul Aleem, Shruti Vyas

TL;DR: 本文系统评估了多种最先进的视觉语言模型(VLMs)在仅使用地面视角图像进行国家级别图像地理定位任务中的零样本推理能力,揭示了模型在语义推理方面的潜力及其在捕捉细粒度地理线索上的局限性。

Details

Motivation: 传统图像地理定位方法依赖基于检索的地点识别或基于几何的视觉定位流程,而VLMs在多模态任务中展现出强大的零样本推理能力,但其在地理推断任务中的性能尚未得到充分探索。

Result: 在三个地理多样性数据集上的测试结果显示,不同模型间存在显著性能差异,表明当前VLMs在粗粒度地理定位上具有潜力,但在细粒度地理线索捕捉方面仍有限制。

Insight: 论文首次针对国家级别地理定位任务对现代VLMs进行了集中比较,为多模态推理与地理理解的交叉研究奠定了基础,强调了语义推理在地理定位中的潜在价值及现有模型的改进空间。

Abstract: Image geolocalization has traditionally been addressed through retrieval-based place recognition or geometry-based visual localization pipelines. Recent advances in Vision-Language Models (VLMs) have demonstrated strong zero-shot reasoning capabilities across multimodal tasks, yet their performance in geographic inference remains underexplored. In this work, we present a systematic evaluation of multiple state-of-the-art VLMs for country-level image geolocalization using ground-view imagery only. Instead of relying on image matching, GPS metadata, or task-specific training, we evaluate prompt-based country prediction in a zero-shot setting. The selected models are tested on three geographically diverse datasets to assess their robustness and generalization ability. Our results reveal substantial variation across models, highlighting the potential of semantic reasoning for coarse geolocalization and the limitations of current VLMs in capturing fine-grained geographic cues. This study provides the first focused comparison of modern VLMs for country-level geolocalization and establishes a foundation for future research at the intersection of multimodal reasoning and geographic understanding.


[63] Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap cs.CV | cs.CLPDF

Yige Xu, Yongjie Wang, Zizhuo Wu, Kaisong Song, Jun Lin

TL;DR: 该论文通过引入CrossMath基准,系统评估了视觉语言模型(VLMs)是否真正执行视觉推理。研究发现,当前VLMs主要依赖文本骨干进行推理,视觉输入反而可能降低性能,表明存在显著的模态差距。

Details

Motivation: 动机是探究视觉语言模型的优越性能究竟源于真正的视觉基础推理,还是主要依赖于其文本骨干的推理能力,以澄清模型对视觉模态的实际利用程度。

Result: 在CrossMath基准上的广泛评估显示,VLMs在纯文本输入上表现出色,而加入视觉数据(图像+文本)通常比纯文本基线性能下降,表明存在一致的文本与视觉推理性能差距。微调CrossMath训练集后,所有模态的推理性能均显著提升,并在两个通用视觉推理任务上获得稳健增益。

Insight: 创新点在于设计了严格对齐的跨模态比较基准CrossMath,以隔离模态特定差异,揭示了当前VLMs对视觉证据的真实依赖有限;可借鉴之处是通过针对性微调可以有效缓解模态差距,提升多模态推理能力。

Abstract: Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a substantial performance gap between textual and visual reasoning. Notably, VLMs excel with text-only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text-only baseline. These findings indicate that current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence. To mitigate this limitation, we curate a CrossMath training set for VLM fine-tuning. Empirical evaluations demonstrate that fine-tuning on this training set significantly boosts reasoning performance across all individual and joint modalities, while yielding robust gains on two general visual reasoning tasks. Source code is available at https://github.com/xuyige/CrossMath.


[64] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects cs.CV | cs.AI | cs.CLPDF

Xiangbo Gao, Sicong Jiang, Bangya Liu, Xinghao Chen, Minglai Yang

TL;DR: 该论文提出了VEFX-Bench,一个用于通用视频编辑和视觉效果的全面基准。它包括一个大规模人工标注的数据集VEFX-Dataset、一个专门用于视频编辑质量评估的奖励模型VEFX-Reward,以及一个用于标准化比较编辑系统的基准测试集。

Details

Motivation: 解决AI辅助视频创作领域缺乏大规模人工标注的完整编辑示例数据集,以及缺乏标准化、专门化的自动评估器的问题。现有资源规模小、缺少编辑输出或人工质量标签,而评估常依赖昂贵的人工检查或非专门的通用视觉语言模型。

Result: 实验表明,VEFX-Reward在标准IQA/VQA指标和成对偏好评估上,比通用VLM评估器和先前的奖励模型更符合人类判断。使用VEFX-Reward作为评估器对代表性商业和开源视频编辑系统进行基准测试,揭示了当前模型在视觉合理性、指令遵循和编辑局部性方面存在持续差距。

Insight: 主要创新点在于构建了一个大规模、多维度(指令遵循、渲染质量、编辑排他性)人工标注的视频编辑数据集,并基于此训练了一个专门用于视频编辑质量评估的联合处理奖励模型,从而为领域提供了标准化的评估基准和工具。

Abstract: As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models.


[65] Information Router for Mitigating Modality Dominance in Vision-Language Models cs.CV | cs.LGPDF

Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib

TL;DR: 本文提出了一种名为MoIR(多模态信息路由器)的新方法,旨在解决视觉语言模型中常见的模态主导问题。该方法通过在融合前显式地减少模态间的信息差异,识别信息较少的token并从信息更强的模态中路由补充信息,从而构建信息密集的token表示,以改善多模态推理的平衡性和鲁棒性。

Details

Motivation: 现有方法主要通过调整模型的注意力分配来缓解模态主导问题,但注意力只能决定模型关注哪里,无法补充缺失或模糊的信息。在现实世界中,输入模态的信息密度和信噪比往往不同,单纯调整注意力无法解决信息不足的根本问题。

Result: 在多个模型骨干网络上,对三个广泛使用的多模态基准测试进行评估。实验结果表明,MoIR能持续实现更平衡的模态贡献,并提高了鲁棒性和下游性能,特别是在模态退化的情况下表现尤为突出。

Insight: 本文的创新点在于提出了一种信息层面的融合方法,显式地在融合前修改跨模态信息可用性,以解决模态主导问题。这为多模态模型提供了一种有效且互补的策略,强调了直接处理信息差异而非仅依赖注意力机制的重要性。

Abstract: Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model’s attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model’s attention does not resolve the underlying lack of information. In this paper, we propose \textsc{MoIR}: \textit{Multi-modal Information Router}, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textsc{MoIR} identifies less informative tokens and routes complementary information from a stronger modality, constructing information-dense token representations before they are processed by a large language model. By modifying information availability, \textsc{MoIR} enables reliable shifts in modality dominance, even when one modality is degraded. We evaluate \textsc{MoIR} on three widely used multi-modal benchmarks across multiple model backbones. Experimental results show that \textsc{MoIR} consistently demonstrates more balanced modality contribution, and improves robustness and downstream performance, particularly even under modality degradation. These findings demonstrate that explicitly modifying cross-modal information is an effective and complementary strategy for mitigating modality dominance in multi-modal reasoning models.


[66] Enhancing Hazy Wildlife Imagery: AnimalHaze3k and IncepDehazeGan cs.CVPDF

Shivarth Rai, Tejeswar Pokuri

TL;DR: 本文针对大气雾霾严重降低野生动物图像质量、阻碍计算机视觉在保护生物学中应用的问题,提出了一个名为AnimalHaze3k的合成数据集和一个名为IncepDehazeGan的新型去雾网络架构。该网络在GAN框架中结合了Inception模块和残差跳跃连接,在去雾任务上取得了最先进的性能,并能显著提升下游目标检测任务的精度。

Details

Motivation: 大气雾霾会显著降低野生动物图像的质量,从而阻碍了用于动物保护的关键计算机视觉应用(如检测、跟踪和行为分析)。为了解决这一挑战,需要专门针对野生动物场景的去雾方法和数据。

Result: 提出的IncepDehazeGan在去雾任务上达到了最先进的性能(SOTA),具体指标为SSIM: 0.8914, PSNR: 20.54, LPIPS: 0.1104,比现有方法SSIM高出6.27%,PSNR高出10.2%。应用于下游检测任务时,去雾后的图像使YOLOv11的检测mAP提升了112%,IoU提升了67%。

Insight: 主要创新点包括:1) 创建了首个专门针对野生动物场景的大规模合成雾霾数据集AnimalHaze3k;2) 提出了IncepDehazeGan架构,创新性地将Inception模块与残差跳跃连接结合在GAN框架中,以更好地处理多尺度雾霾特征。这为在恶劣环境条件下进行可靠的野生动物监测提供了有力的视觉分析工具。

Abstract: Atmospheric haze significantly degrades wildlife imagery, impeding computer vision applications critical for conservation, such as animal detection, tracking, and behavior analysis. To address this challenge, we introduce AnimalHaze3k a synthetic dataset comprising of 3,477 hazy images generated from 1,159 clear wildlife photographs through a physics-based pipeline. Our novel IncepDehazeGan architecture combines inception blocks with residual skip connections in a GAN framework, achieving state-of-the-art performance (SSIM: 0.8914, PSNR: 20.54, and LPIPS: 0.1104), delivering 6.27% higher SSIM and 10.2% better PSNR than competing approaches. When applied to downstream detection tasks, dehazed images improved YOLOv11 detection mAP by 112% and IoU by 67%. These advances can provide ecologists with reliable tools for population monitoring and surveillance in challenging environmental conditions, demonstrating significant potential for enhancing wildlife conservation efforts through robust visual analytics.


[67] FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation cs.CV | cs.ROPDF

Dian Shao, Zhengzheng Xu, Peiyang Wang, Like Liu, Yule Wang

TL;DR: 本文提出FineCog-Nav,一个受人类认知启发的零样本无人机视觉语言导航框架。它将导航任务分解为语言处理、感知、注意力、记忆、想象、推理和决策等多个细粒度认知模块,每个模块由中等规模的基础模型驱动,并通过特定角色提示和结构化协议进行协作。

Details

Motivation: 解决现有零样本方法在复杂3D环境中执行长视野、多步骤模糊指令导航时的局限性,这些方法通常依赖大型基础模型、通用提示和松散协调的模块,导致性能不足。

Result: 在构建的细粒度评估基准AerialVLN-Fine(包含300条轨迹)上,FineCog-Nav在指令遵循、长视野规划和未见环境泛化方面持续优于零样本基线方法。

Insight: 创新点在于采用自上而下的细粒度认知模块化架构,将导航分解为多个专门化模块,并通过角色特定提示和结构化协议实现有效协作,提高了可解释性和零样本性能,为复杂任务分解提供了新思路。

Abstract: UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous multi-step instructions over long horizons. Existing zero-shot methods remain limited, as they often rely on large base models, generic prompts, and loosely coordinated modules. In this work, we propose FineCog-Nav, a top-down framework inspired by human cognition that organizes navigation into fine-grained modules for language processing, perception, attention, memory, imagination, reasoning, and decision-making. Each module is driven by a moderate-sized foundation model with role-specific prompts and structured input-output protocols, enabling effective collaboration and improved interpretability. To support fine-grained evaluation, we construct AerialVLN-Fine, a curated benchmark of 300 trajectories derived from AerialVLN, with sentence-level instruction-trajectory alignment and refined instructions containing explicit visual endpoints and landmark references. Experiments show that FineCog-Nav consistently outperforms zero-shot baselines in instruction adherence, long-horizon planning, and generalization to unseen environments. These results suggest the effectiveness of fine-grained cognitive modularization for zero-shot aerial navigation. Project page: https://smartdianlab.github.io/projects-FineCogNav.


[68] Repurposing 3D Generative Model for Autoregressive Layout Generation cs.CVPDF

Haoran Feng, Yifan Niu, Zehuan Huang, Yang-Tian Sun, Chunchao Guo

TL;DR: 本文提出了LaviGen框架,将3D生成模型重新用于3D布局生成。该方法直接在原生3D空间中操作,将布局生成建模为一个自回归过程,显式建模物体间的几何关系和物理约束,以生成连贯且物理合理的3D场景。

Details

Motivation: 解决现有方法通常从文本描述推断物体布局的问题,旨在直接在3D空间中生成更符合几何和物理约束的布局。

Result: 在LayoutVLM基准测试上的大量实验表明,LaviGen实现了卓越的3D布局生成性能,其物理合理性比现有最佳方法(SOTA)高出19%,且计算速度加快65%。

Insight: 创新点在于重新利用3D生成模型进行自回归布局生成,并提出了一个集成了场景、物体和指令信息的自适应3D扩散模型,以及双引导自展开蒸馏机制以提高效率和空间准确性。

Abstract: We introduce LaviGen, a framework that repurposes 3D generative models for 3D layout generation. Unlike previous methods that infer object layouts from textual descriptions, LaviGen operates directly in the native 3D space, formulating layout generation as an autoregressive process that explicitly models geometric relations and physical constraints among objects, producing coherent and physically plausible 3D scenes. To further enhance this process, we propose an adapted 3D diffusion model that integrates scene, object, and instruction information and employs a dual-guidance self-rollout distillation mechanism to improve efficiency and spatial accuracy. Extensive experiments on the LayoutVLM benchmark show LaviGen achieves superior 3D layout generation performance, with 19% higher physical plausibility than the state of the art and 65% faster computation. Our code is publicly available at https://github.com/fenghora/LaviGen.


cs.SD [Back]

[69] Hierarchical Codec Diffusion for Video-to-Speech Generation cs.SD | cs.CVPDF

Jiaxin Ye, Gaoxiang Cong, Chenhui Wang, Xin-Cheng Wen, Zhaoyang Li

TL;DR: 本文提出了一种名为HiCoDiT的新型分层编解码扩散Transformer模型,用于从无声视频生成语音。该方法利用基于残差向量量化(RVQ)的编解码器的分层结构,通过低层和高层模块分别生成编码粗粒度说话人语义和细粒度韵律细节的离散语音标记,并结合双尺度自适应实例层归一化技术,实现了更好的视听对齐和语音合成质量。

Details

Motivation: 现有视频到语音生成方法忽视了语音固有的分层特性(从粗粒度说话人语义到细粒度韵律细节),导致在属性匹配时难以在特定层级上直接对齐视觉和语音特征。

Result: 大量实验表明,HiCoDiT在保真度和表现力方面均优于基线方法,突显了离散建模在视频到语音生成任务中的潜力。

Insight: 创新点在于利用RVQ编解码器的固有层次结构来指导分层生成过程,并提出了双尺度自适应实例层归一化,通过通道维归一化捕获全局发声风格,通过时间维归一化捕获局部韵律动态,实现了更有效的从粗到细的条件控制。

Abstract: Video-to-Speech (VTS) generation aims to synthesize speech from a silent video without auditory signals. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders direct alignment between visual and speech features at specific hierarchical levels during property matching. In this paper, leveraging the hierarchical structure of Residual Vector Quantization (RVQ)-based codec, we propose HiCoDiT, a novel Hierarchical Codec Diffusion Transformer that exploits the inherent hierarchy of discrete speech tokens to achieve strong audio-visual alignment. Specifically, since lower-level tokens encode coarse speaker-aware semantics and higher-level tokens capture fine-grained prosody, HiCoDiT employs low-level and high-level blocks to generate tokens at different levels. The low-level blocks condition on lip-synchronized motion and facial identity to capture speaker-aware content, while the high-level blocks use facial expression to modulate prosodic dynamics. Finally, to enable more effective coarse-to-fine conditioning, we propose a dual-scale adaptive instance layer normalization that jointly captures global vocal style through channel-wise normalization and local prosody dynamics through temporal-wise normalization. Extensive experiments demonstrate that HiCoDiT outperforms baselines in fidelity and expressiveness, highlighting the potential of discrete modelling for VTS. The code and speech demo are both available at https://github.com/Jiaxin-Ye/HiCoDiT.


cs.MA [Back]

[70] AstroVLM: Expert Multi-agent Collaborative Reasoning for Astronomical Imaging Quality Diagnosis cs.MA | cs.CVPDF

Yaohui Han, Tianshuo Wang, Zixi Zhao, Zhengchun Zhu, Shuo Ren

TL;DR: 本文提出AstroVLM,一个用于天文图像质量诊断的协作多智能体系统。该系统利用视觉语言模型(VLMs)处理涉及多学科知识和多个子任务的复杂天文成像问题,通过多智能体协同推理来诊断图像质量并定位错误。实验表明,AstroVLM在真实世界天文成像质量诊断任务上优于所有基线方法。

Details

Motivation: 天文成像是一个涉及多学科知识和多个子任务的复杂问题,其过程中的各个步骤具有复杂的底层相关性,相互影响显著,使得天文图像的质量诊断和错误定位极具挑战性。目前,无论是NASA等世界级天文组织还是专业爱好者都需投入大量时间和精力。现有视觉语言模型(VLMs)在特定领域已展现出强大的问题解决能力,但尚未充分应用于天文成像领域。

Result: 实验结果表明,AstroVLM在真实世界天文成像质量诊断任务上超越了所有基线方法,为语言模型处理复杂的多流程任务提供了参考。

Insight: 论文的创新点在于将多智能体协同推理框架应用于天文成像这一复杂、多流程的特定领域质量诊断任务。从客观角度看,其核心创新在于设计了一个专门针对天文成像过程内在复杂性和子任务相互依赖性的协作系统架构,这可能为VLM处理其他需要多步骤、多学科知识整合的复杂现实问题提供了一种可借鉴的范式。

Abstract: Vision Language Models (VLMs) have been applied to several specific domains and have shown strong problem-solving capabilities. However, astronomical imaging, a quite complex problem involving multidisciplinary knowledge and several subtasks, has not been adequately studied. Due to the complexity of the astronomical imaging process, both world-class astronomical organizations, such as NASA, and expert enthusiasts devote a great deal of time and effort. This is because the processes in astronomical imaging have complex underlying correlations that significantly influence one another, making the quality diagnosis and error localization of astronomical images challenging. To address this problem, we propose AstroVLM, a collaborative multi-agent system for diagnosing the quality of astronomical images. Experiment results show that AstroVLM outperforms all baselines on real-world astronomical imaging quality diagnosis tasks, providing a reference for language models to handle complicated multi-process tasks.


cs.RO [Back]

[71] GaussianFlow SLAM: Monocular Gaussian Splatting SLAM Guided by GaussianFlow cs.RO | cs.CVPDF

Dong-Uk Seo, Jinwoo Jeon, Eungchang Mason Lee, Hyun Myung

TL;DR: 本文提出GaussianFlow SLAM,一种单目3D高斯溅射SLAM系统,通过引入光流作为几何感知线索来指导场景结构和相机位姿的优化,解决了单目输入缺乏可靠几何监督导致的结构退化和不准确问题,并提出了基于归一化误差的致密化和剪枝模块以提升地图质量。

Details

Motivation: 单目SLAM中应用高斯溅射作为地图表示面临挑战,因为单目输入缺乏可靠的几何线索,导致映射和跟踪容易陷入局部最优,产生结构退化和不准确。

Result: 在公开数据集上的实验表明,该方法在渲染质量和跟踪精度方面优于最先进的算法。

Insight: 创新点在于利用光流(称为GaussianFlow)作为几何正则化线索来同时约束场景重建和位姿估计,并设计了归一化误差驱动的致密化与剪枝机制来优化高斯分布,提升系统的鲁棒性和精度。

Abstract: Gaussian splatting has recently gained traction as a compelling map representation for SLAM systems, enabling dense and photo-realistic scene modeling. However, its application to monocular SLAM remains challenging due to the lack of reliable geometric cues from monocular input. Without geometric supervision, mapping or tracking could fall in local-minima, resulting in structural degeneracies and inaccuracies. To address this challenge, we propose GaussianFlow SLAM, a monocular 3DGS-SLAM that leverages optical flow as a geometry-aware cue to guide the optimization of both the scene structure and camera poses. By encouraging the projected motion of Gaussians, termed GaussianFlow, to align with the optical flow, our method introduces consistent structural cues to regularize both map reconstruction and pose estimation. Furthermore, we introduce normalized error-based densification and pruning modules to refine inactive and unstable Gaussians, thereby contributing to improved map quality and pose accuracy. Experiments conducted on public datasets demonstrate that our method achieves superior rendering quality and tracking accuracy compared with state-of-the-art algorithms. The source code is available at: https://github.com/url-kaist/gaussianflow-slam.


[72] DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs cs.RO | cs.CVPDF

Nikhil Behari, Diego Rivero, Luke Apostolides, Suman Ghosh, Paul Pu Liang

TL;DR: 本文提出了DENALI数据集,这是首个利用低成本激光雷达(LiDAR)捕获隐藏物体的大规模真实世界时空直方图数据集,旨在通过数据驱动的方法实现非视距(NLOS)感知。

Details

Motivation: 解决低成本消费级激光雷达因硬件限制难以用传统方法进行非视距重建的问题,探索通过数据驱动推理实现NLOS感知的补充方向。

Result: 实验表明,利用该数据集,消费级激光雷达能够实现准确的数据驱动NLOS感知,并识别了限制性能的关键场景和建模因素以及当前仿真到真实迁移的保真度差距。

Insight: 创新点在于构建了首个大规模真实世界NLOS数据集,推动了数据驱动方法在低成本LiDAR NLOS感知中的应用,并揭示了仿真与现实之间的差距,为未来可扩展的消费级LiDAR NLOS视觉研究指明了方向。

Abstract: Consumer LiDARs in mobile devices and robots typically output a single depth value per pixel. Yet internally, they record full time-resolved histograms containing direct and multi-bounce light returns; these multi-bounce returns encode rich non-line-of-sight (NLOS) cues that can enable perception of hidden objects in a scene. However, severe hardware limitations of consumer LiDARs make NLOS reconstruction with conventional methods difficult. In this work, we motivate a complementary direction: enabling NLOS perception with low-cost LiDARs through data-driven inference. We present DENALI, the first large-scale real-world dataset of space-time histograms from low-cost LiDARs capturing hidden objects. We capture time-resolved LiDAR histograms for 72,000 hidden-object scenes across diverse object shapes, positions, lighting conditions, and spatial resolutions. Using our dataset, we show that consumer LiDARs can enable accurate, data-driven NLOS perception. We further identify key scene and modeling factors that limit performance, as well as simulation-fidelity gaps that hinder current sim-to-real transfer, motivating future work toward scalable NLOS vision with consumer LiDARs.


cs.IR [Back]

[73] Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking cs.IR | cs.AI | cs.CLPDF

Jun Feng, Jiahui Tang, Zhicheng He, Hang Lv, Hongchao Gu

TL;DR: 本文重新评估了自适应检索增强生成(Adaptive RAG)的必要性,提出了一种名为AdaRankLLM的新型自适应检索框架。该框架通过零样本提示和段落丢弃机制构建自适应排序器,并与静态固定深度检索策略进行比较,以验证自适应列表重排的必要性。此外,为了赋予较小开源LLMs精确的列表排序和自适应过滤能力,作者引入了数据采样和增强技术增强的两阶段渐进蒸馏范式。在三个数据集和八个LLM上的广泛实验表明,AdaRankLLM在大多数场景下以显著减少的上下文开销实现最优性能。

Details

Motivation: 随着大型语言模型对噪声的鲁棒性增强,自适应检索的必要性需要重新评估。本文旨在通过动态确定检索补充段落的必要性来减轻外部噪声的干扰,并重新思考自适应检索的作用。

Result: 在三个数据集和八个LLM上的实验表明,AdaRankLLM在大多数场景下以显著减少的上下文开销实现最优性能,达到SOTA水平。

Insight: 创新点包括:提出AdaRankLLM框架,结合零样本提示和段落丢弃机制的自适应排序器;引入两阶段渐进蒸馏范式以增强较小LLMs的排序和过滤能力。客观分析认为,论文揭示了自适应检索的角色转变:对于较弱模型,它是克服限制的关键噪声过滤器;对于较强推理模型,它是成本效益高的效率优化器。

Abstract: Adaptive Retrieval-Augmented Generation aims to mitigate the interference of extraneous noise by dynamically determining the necessity of retrieving supplementary passages. However, as Large Language Models evolve with increasing robustness to noise, the necessity of adaptive retrieval warrants re-evaluation. In this paper, we rethink this necessity and propose AdaRankLLM, a novel adaptive retrieval framework. To effectively verify the necessity of adaptive listwise reranking, we first develop an adaptive ranker employing a zero-shot prompt with a passage dropout mechanism, and compare its generation outcomes against static fixed-depth retrieval strategies. Furthermore, to endow smaller open-source LLMs with this precise listwise ranking and adaptive filtering capability, we introduce a two-stage progressive distillation paradigm enhanced by data sampling and augmentation techniques. Extensive experiments across three datasets and eight LLMs demonstrate that AdaRankLLM consistently achieves optimal performance in most scenarios with significantly reduced context overhead. Crucially, our analysis reveals a role shift in adaptive retrieval: it functions as a critical noise filter for weaker models to overcome their limitations, while serving as a cost-effective efficiency optimizer for stronger reasoning models.


eess.IV [Back]

[74] RelativeFlow: Taming Medical Image Denoising Learning with Noisy Reference eess.IV | cs.AI | cs.CVPDF

Yuxin Liu, Yiqing Dong, Wenxue Yu, Zhan Wu, Rongjun Ge

TL;DR: 本文提出RelativeFlow,一种基于流匹配的医学图像去噪框架,旨在解决医学图像去噪中缺乏绝对干净图像作为监督的‘噪声参考’问题。该方法将绝对的去噪映射分解为一系列相对的去噪映射,通过一致性传输和基于模拟的速度场两个核心组件,驱动任意质量水平的输入图像向统一的高质量目标收敛。

Details

Motivation: 医学图像去噪缺乏绝对干净的图像作为监督目标,导致‘噪声参考’问题,这从根本上限制了去噪性能。现有的模拟监督判别式学习、模拟监督生成式学习将噪声参考视为干净目标,导致次优收敛或参考偏差学习,而自监督学习则施加了在真实医学图像场景中很少满足的严格噪声假设。

Result: 在计算机断层扫描和磁共振图像去噪上的大量实验表明,RelativeFlow显著优于现有方法,成功驾驭了带噪声参考的医学图像去噪任务。

Insight: 核心创新在于将绝对的去噪问题重构为相对的去噪映射学习,通过一致性传输确保相对流逐步组合成统一的绝对流,并利用基于模拟的速度场来支持不同医学成像模态。这为处理缺乏黄金标准监督的医学图像处理问题提供了一种新的、灵活的框架思路。

Abstract: Medical image denoising (MID) lacks absolutely clean images for supervision, leading to a noisy reference problem that fundamentally limits denoising performance. Existing simulated-supervised discriminative learning (SimSDL) and simulated-supervised generative learning (SimSGL) treat noisy references as clean targets, causing suboptimal convergence or reference-biased learning, while self-supervised learning (SSL) imposes restrictive noise assumptions that are seldom satisfied in realistic MID scenarios. We propose \textbf{RelativeFlow}, a flow matching framework that learns from heterogeneous noisy references and drives inputs from arbitrary quality levels toward a unified high-quality target. RelativeFlow reformulates flow matching by decomposing the absolute noise-to-clean mapping into relative noisier-to-noisy mappings, and realizes this formulation through two key components: 1) consistent transport (CoT), a displacement map that constrains relative flows to be components of and progressively compose a unified absolute flow, and 2) simulation-based velocity field (SVF), which constructs a learnable velocity field using modality-specific degradation operators to support different medical imaging modalities. Extensive experiments on Computed Tomography (CT) and Magnetic Resonance (MR) denoising demonstrate that RelativeFlow significantly outperforms existing methods, taming MID with noisy references.


[75] CTSCAN: Evaluation Leakage in Chest CT Segmentation and a Reproducible Patient-Disjoint Benchmark eess.IV | cs.CVPDF

Anton Ivchenko

TL;DR: 这篇论文提出了CTSCAN,一个可重复的多源胸部CT分割基准和研究栈,旨在评估在患者不重叠(patient-disjoint)划分下的真实模型性能。研究发现,当训练集和测试集混合了来自同一病例(study)的不同切片时,报告的分割性能会被严重夸大。论文通过实验量化了这种数据泄露(evaluation leakage)的影响,并提供了一个包含确定性数据划分、弱监督控制和可重复实验流程的基准框架。

Details

Motivation: 解决胸部CT分割研究中因训练集和测试集未进行严格的病例级别(patient-level)划分而导致性能评估虚高(即评估泄露)的问题,旨在建立一个可重复且严谨的基准来反映模型在真实、未见过的患者上的泛化能力。

Result: 在CTSCAN基准上,使用相同的FPN+EfficientNet-B0模型配置进行实验。当采用切片混合(slice-mixed)划分时,前景Dice系数为0.6665,前景IoU为0.5031;而采用病例不重叠(case-disjoint)划分时,前景Dice系数降至0.2066,前景IoU降至0.1181。这表明消除患者数据重用导致前景Dice绝对下降0.4599(相对下降69.00%),前景IoU绝对下降0.3850(相对下降76.52%),性能大幅降低。

Insight: 论文的核心创新点在于系统地揭示并量化了医学图像分割(特别是胸部CT)中因数据划分不当导致的严重评估泄露问题。其提供的CTSCAN基准框架(包含确定性数据划分清单、弱监督控制、多种子协议扫描和可重复图表生成)为未来研究建立了一个严谨、可复现的评估标准,强调了病例级别划分对于评估模型真实泛化能力的必要性。

Abstract: Reported chest CT segmentation performance can be strongly inflated when train and test partitions mix slices from the same study. We present CTSCAN, a reproducible multi-source chest CT benchmark and research stack designed to measure what survives under patient-disjoint evaluation. The current four-class artifact aggregates 89 cases from PleThora, MedSeg SIRM, and LongCIU, and we show that the original slice-PNG workflow induces near-complete case reuse across train, validation, and test. Using the playground environment, we run a multi-seed protocol sweep with the same FPN plus EfficientNet-B0 control configuration under slice-mixed and case-disjoint evaluation. Across 3 seeds and 12 epochs per seed, the slice-mixed protocol reaches 0.6665 foreground Dice and 0.5031 foreground IoU, whereas the case-disjoint protocol reaches 0.2066 Dice and 0.1181 IoU. Removing patient reuse therefore reduces foreground Dice by 0.4599 absolute (69.00% relative) and foreground IoU by 0.3850 absolute (76.52% relative). CTSCAN packages the corrected benchmark with deterministic split manifests, explicit weak-supervision controls, a scripted multi-seed protocol sweep, and reproducible figure generation, providing a reusable basis for patient-disjoint chest CT evaluation.


[76] Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration eess.IV | cs.AI | cs.CVPDF

Baramee Sukumal, Aueaphum Aueawatthanaphisut

TL;DR: 本研究提出了一种用于肺癌诊断和亚型分类的双模态人工智能框架,该框架整合了CT放射影像和H&E组织病理学图像,并融合了临床元数据以提高鲁棒性。系统使用卷积神经网络提取特征,并通过加权决策级融合机制对多种肺癌亚型进行分类,同时应用多种可解释AI技术提供视觉可解释性。

Details

Motivation: 解决传统CT成像在区分良恶性病变和提供可解释诊断见解方面的局限性,通过整合放射学和病理学多模态数据以提高肺癌诊断的准确性和可解释性。

Result: 实验结果显示,该方法在肺癌亚型分类任务上取得了高达0.87的准确率、超过0.97的AUROC以及0.88的宏观F1分数,其中Grad-CAM++在忠实度和定位准确性方面表现最佳。

Insight: 创新点在于将放射学与组织病理学进行多模态融合,并结合临床元数据,同时系统性地应用多种可解释AI技术(如Grad-CAM++)以增强模型透明度和临床可信度,为精准肿瘤学的临床决策支持系统提供了潜在路径。

Abstract: Lung cancer remains one of the leading causes of cancer-related mortality worldwide. Conventional computed tomography (CT) imaging, while essential for detection and staging, has limitations in distinguishing benign from malignant lesions and providing interpretable diagnostic insights. To address this challenge, this study proposes a dual-modal artificial intelligence framework that integrates CT radiology with hematoxylin and eosin (H&E) histopathology for lung cancer diagnosis and subtype classification. The system employs convolutional neural networks to extract radiologic and histopathologic features and incorporates clinical metadata to improve robustness. Predictions from both modalities are fused using a weighted decision-level integration mechanism to classify adenocarcinoma, squamous cell carcinoma, large cell carcinoma, small cell lung cancer, and normal tissue. Explainable AI techniques including Grad-CAM, Grad-CAM++, Integrated Gradients, Occlusion, Saliency Maps, and SmoothGrad are applied to provide visual interpretability. Experimental results show strong performance with accuracy up to 0.87, AUROC above 0.97, and macro F1-score of 0.88. Grad-CAM++ achieved the highest faithfulness and localization accuracy, demonstrating strong correspondence with expert-annotated tumor regions. These results indicate that multimodal fusion of radiology and histopathology can improve diagnostic performance while maintaining model transparency, suggesting potential for future clinical decision support systems in precision oncology.


cs.HC [Back]

[77] Acoustic and Facial Markers of Perceived Conversational Success in Spontaneous Speech cs.HC | cs.CL | cs.LGPDF

Thanushi Withanage, Elizabeth Redcay, Carol Espy-Wilson

TL;DR: 本研究分析了大规模Zoom自发对话语料库,探究多模态特征(如话轮转换、停顿、面部运动和声学特征)与感知对话质量的关系,发现语音趋同现象与更高的感知对话成功率相关。

Details

Motivation: 研究动机是探索自然非任务导向虚拟环境中对话趋同现象与感知互动质量的关系,以弥补现有研究在自发对话和虚拟设置中的不足。

Result: 通过因子分析量化感知对话成功率,结果显示在自发语音中可靠检测到的趋同现象与更高感知成功率相关,识别了对话质量的关键交互标记。

Insight: 创新点在于将多模态特征(包括面部运动和声学测量)整合到自然虚拟对话分析中,为针对性干预以提升沟通效果提供了新视角。

Abstract: Individuals often align their speaking patterns with their interlocutors, a phenomenon linked to engagement and rapport. While well documented in task-oriented dialogues, less is known about entrainment in naturalistic, non-task and virtual settings. In this study, we analyze a large corpus of spontaneous dyadic Zoom conversations to examine how conversational dynamics relate to perceived interaction quality. We extract multimodal features encompassing turn-taking, pauses, facial movements, and acoustic measures such as pitch and intensity. Perceived conversational success was quantified via factor analysis of post-conversation ratings. Results demonstrate that entrainment reliably detected in spontaneous speech and correlates with higher perceived success. These findings identify key interactional markers of conversational quality and highlight opportunities for targeted interventions to foster more effective and engaging communication.


[78] Automating Crash Diagram Generation Using Vision-Language Models: A Case Study on Multi-Lane Roundabouts cs.HC | cs.AI | cs.CV | cs.SEPDF

Xiao Lu, Hao Zhen, Jidong J. Yang

TL;DR: 本研究探索了使用视觉语言模型(VLMs)自动化生成交通事故图的方法,以多车道环岛事故作为案例。研究开发了一个三部分的结构化提示框架来指导模型进行解释、提取和视觉合成,并设计了一个包含10个指标的评估系统来评估图表质量。在79份事故报告上测试了GPT-4o、Gemini-1.5-Flash和Janus-4o三个模型,其中GPT-4o表现最佳。

Details

Motivation: 交通事故图在交通安全分析中至关重要,但其手动绘制耗时且易受人为因素影响。本研究旨在利用VLMs自动化这一过程,提高效率和一致性。

Result: 在79份多车道环岛事故报告上测试,GPT-4o平均得分最高(6.29/10),其次是Gemini-1.5-Flash(5.28)和Janus-4o(3.64)。GPT-4o在空间推理和提取数据与可视化数据对齐方面表现优异。

Insight: 创新点包括为VLM设计的三部分结构化提示框架(解释、提取、视觉合成)以及一个综合评估系统(语义准确性、空间保真度、视觉清晰度)。这为将生成式AI集成到工程可视化工作流中提供了基础,展示了VLMs在复杂空间推理任务中的潜力和当前局限。

Abstract: Crash diagrams are essential tools in transportation safety analysis, yet their manual preparation remains time-consuming and prone to human variability. This study investigates the use of Vision-Language Models (VLMs) to automate crash diagram generation from police crash reports, focusing on multilane roundabouts as a challenging test case. A three-part structured prompt framework was developed to guide model reasoning through interpretation, extraction, and visual synthesis, while a 10-metric evaluation system was designed to assess diagram quality in terms of semantic accuracy, spatial fidelity, and visual clarity. Three popular models, including GPT-4o, Gemini-1.5-Flash, and Janus-4o, were tested on 79 crash reports. GPT-4o achieved the highest average performance (6.29 out of 10), followed by Gemini-1.5-Flash (5.28) and Janus-4o (3.64). The analysis revealed GPT-4o’s superior spatial reasoning and alignment between extracted and visualized crash data. These results highlight both the promise and current limitations of VLMs in engineering visualization tasks. The study lays the groundwork for integrating generative AI into crash analysis workflows to improve efficiency, consistency, and interpretability.


cs.CY [Back]

[79] From Vulnerable Data Subjects to Vulnerabilizing Data Practices: Navigating the Protection Paradox in AI-Based Analyses of Platformized Lives cs.CY | cs.AI | cs.CV | cs.HCPDF

Delfina S. Martinez Pandiani, Ella Streefkerk, Laurens Naudts, Paula Helm

TL;DR: 本文探讨了在平台化生活背景下,AI数据分析中的伦理挑战,特别是从将脆弱性视为数据主体的静态属性转向分析数据实践如何主动制造脆弱性。通过一个AI for Social Good案例(使用计算机视觉量化YouTube家庭视频中的儿童存在以进行监管倡导),揭示了’保护悖论’:旨在保护脆弱主体的数据驱动努力可能无意中导致新的计算暴露、还原论和剥削形式。

Details

Motivation: 动机是解决平台化生活数据丰富性带来的伦理挑战,即研究者如何操作现有海量数据,以及技术流程如何将’脆弱’个体转化为数据主体并加剧其脆弱性,超越传统关注数据缺失或对抗数据的伦理框架。

Result: 论文通过案例分析和管道解构,提出了一个反思性伦理协议,围绕数据集设计、操作化、推断和传播四个关键节点,识别技术问题和伦理张力,并提供具体提示来导航暴露、货币化、叙事固定和算法优化这四个交叉的脆弱化因素。

Insight: 创新点在于将脆弱性概念从数据主体的固有属性转向数据实践的动态 enact,提出了’保护悖论’概念,并贡献了一个结构化的反思性伦理协议,为平台化数据主体的研究伦理提供了具体操作指南,强调技术决策的伦理构成性。

Abstract: This paper traces a conceptual shift from understanding vulnerability as a static, essentialized property of data subjects to examining how it is actively enacted through data practices. Unlike reflexive ethical frameworks focused on missing or counter-data, we address the condition of abundance inherent to platformized life-a context where a near inexhaustible mass of data points already exists, shifting the ethical challenge to the researcher’s choices in operating upon this existing mass. We argue that the ethical integrity of data science depends not just on who is studied, but on how technical pipelines transform “vulnerable” individuals into data subjects whose vulnerability can be further precarized. We develop this argument through an AI for Social Good (AI4SG) case: a journalist’s request to use computer vision to quantify child presence in monetized YouTube ‘family vlogs’ for regulatory advocacy. This case reveals a “protection paradox”: how data-driven efforts to protect vulnerable subjects can inadvertently impose new forms of computational exposure, reductionism, and extraction. Using this request as a point of departure, we perform a methodological deconstruction of the AI pipeline to show how granular technical decisions are ethically constitutive. We contribute a reflexive ethics protocol that translates these insights into a reflexive roadmap for research ethics surrounding platformized data subjects. Organized around four critical junctures-dataset design, operationalization, inference, and dissemination-the protocol identifies technical questions and ethical tensions where well-intentioned work can slide into renewed extraction or exposure. For every decision point, the protocol offers specific prompts to navigate four cross-cutting vulnerabilizing factors: exposure, monetization, narrative fixing, and algorithmic optimization. Rather than uncritically…


cs.LG [Back]

[80] FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models cs.LG | cs.AI | cs.CLPDF

Zixuan Weng, Jinghuai Zhang, Kunlin Cai, Ying Li, Peiran Wang

TL;DR: 本文提出了FineSteer框架,用于在大型语言模型推理时进行细粒度控制,以解决模型的安全违规和幻觉等问题。该框架将控制过程分解为条件控制和细粒度向量合成两个阶段,通过子空间引导条件控制机制和混合控制专家机制,在保持模型通用能力的同时,针对特定输入自适应地优化控制向量。

Details

Motivation: 现有推理时控制方法在效果、通用性保持和训练效率上存在不足,无法同时满足有效、保持效用和高效训练的要求,因此需要一种更灵活、细粒度的控制框架。

Result: 在安全和真实性基准测试上的大量实验表明,FineSteer在整体性能上优于现有最先进方法,能以最小的效用损失实现更强的控制性能。

Insight: 创新点在于将推理时控制分解为两个互补阶段,并引入了子空间引导条件控制机制来避免不必要的控制以保持效用,以及混合控制专家机制来捕捉期望控制行为的多模态特性并生成查询特定的控制向量以提高效果。

Abstract: Large language models (LLMs) often exhibit undesirable behaviors, such as safety violations and hallucinations. Although inference-time steering offers a cost-effective way to adjust model behavior without updating its parameters, existing methods often fail to be simultaneously effective, utility-preserving, and training-efficient due to their rigid, one-size-fits-all designs and limited adaptability. In this work, we present FineSteer, a novel steering framework that decomposes inference-time steering into two complementary stages: conditional steering and fine-grained vector synthesis, allowing fine-grained control over when and how to steer internal representations. In the first stage, we introduce a Subspace-guided Conditional Steering (SCS) mechanism that preserves model utility by avoiding unnecessary steering. In the second stage, we propose a Mixture-of-Steering-Experts (MoSE) mechanism that captures the multimodal nature of desired steering behaviors and generates query-specific steering vectors for improved effectiveness. Through tailored designs in both SCS and MoSE, FineSteer maintains robust performance on general queries while adaptively optimizing steering vectors for targeted inputs in a training-efficient manner. Extensive experiments on safety and truthfulness benchmarks show that FineSteer outperforms state-of-the-art methods in overall performance, achieving stronger steering performance with minimal utility loss. Code is available at https://github.com/YukinoAsuna/FineSteer


[81] Faster LLM Inference via Sequential Monte Carlo cs.LG | cs.CLPDF

Yahya Emara, Mauricio Barba da Costa, Chi-Chih Chang, Cameron Freer, Tim Vieira

TL;DR: 本文提出了一种名为顺序蒙特卡洛推测解码(SMC-SD)的新方法,用于加速大语言模型(LLM)推理。该方法通过用重要性加权重采样替代传统的令牌级拒绝采样,处理推测模型与目标模型之间的差异,从而在保持模型准确性的同时显著提升推理速度。

Details

Motivation: 传统推测解码(SD)方法在验证推测令牌时,一旦发现错误就会截断整个推测块,导致当推测模型与目标模型差异较大时,推理吞吐量下降。本文旨在解决这一问题,通过更高效地利用推测令牌来提升加速效果。

Result: 在推理、指令遵循和代码生成等基准测试上,SMC-SD方法在保持目标模型准确率3%以内误差的同时,实现了比传统推测解码快2.36倍、比自回归解码快5.2倍的加速效果。

Insight: 核心创新点在于将推测解码框架重新表述为一个顺序蒙特卡洛(SMC)近似推断问题,用基于粒子群的重要性加权重采样替代了简单的拒绝采样。这不仅是一个有理论误差界的原理性方法,而且巧妙利用了LLM推理中内存带宽受限的特性,将验证过程转化为一个可并行化的、固定大小的向量操作,几乎不增加额外计算开销。

Abstract: Speculative decoding (SD) accelerates language model inference by drafting tokens from a cheap proposal model and verifying them against an expensive target model via rejection sampling. Because rejection truncates the draft block at the first error, throughput degrades when draft and target diverge. Rather than rejecting draft tokens outright, we propose to reweight them. To this end, we introduce sequential Monte Carlo speculative decoding (SMC-SD), which replaces token-level rejection with importance-weighted resampling over a population of draft particles. SMC-SD is a principled approximate inference scheme that trades exactness for additional speed, while preserving theoretical bounds on its per-step approximation error. Because LLM inference is memory bandwidth-bound, the arithmetic needed to draft particles and to score them in parallel comes nearly for free – SMC-SD uses idle compute to turn verification into a vectorized, fixed-size operation with no rollback. Empirically, SMC-SD achieves 2.36x speed-up over speculative decoding and a 5.2x speed-up over autoregressive decoding, while remaining within 3% of the target model’s accuracy on reasoning, instruction-following, and coding benchmarks.


[82] Detecting and Suppressing Reward Hacking with Gradient Fingerprints cs.LG | cs.CLPDF

Songtao Wang, Quang Hieu Pham, Fangcong Yin, Xinpeng Wang, Jocelyn Qiaochu Chen

TL;DR: 本文提出了一种名为梯度指纹(GRIFT)的方法,用于检测和抑制强化学习中的奖励黑客行为。该方法通过分析模型内部计算的梯度,将思维链(CoT)的梯度压缩为紧凑表示,以评估推理过程是否利用了奖励函数的漏洞。在数学、代码和逻辑推理等多个可验证推理基准测试中,GRIFT显著优于现有基线方法,检测奖励黑客行为的相对改进超过25%。此外,将GRIFT集成到拒绝微调流程中,可以减少奖励黑客行为并提升真实任务目标的性能。

Details

Motivation: 在基于可验证奖励的强化学习(RLVR)中,模型通常只优化结果奖励,而不对中间推理过程施加约束,这导致训练容易受到奖励黑客行为的影响,即模型利用奖励函数中的漏洞(如训练数据中的虚假模式)来获得高分,而非真正解决目标任务。这些行为往往是隐性的,因为思维链表面上可能看似合理,限制了纯文本监控的有效性。

Result: 在涵盖数学、代码和逻辑推理的可验证推理基准测试上,GRIFT大幅优于包括CoT Monitor和TRACE在内的强基线方法,在检测奖励黑客行为方面实现了超过25%的相对改进。将GRIFT集成到推理任务的拒绝微调流程中,可以减少奖励黑客行为并提升真实任务目标的性能。

Insight: 论文的主要创新点在于提出利用模型内部计算的梯度表示(梯度指纹)来评估思维链推理质量,这为检测隐性的奖励黑客行为提供了一种新视角。从客观角度看,该方法通过梯度层面的分析,超越了传统基于文本表面的监控,有望更有效地识别和抑制模型在强化学习训练中的不良优化策略。

Abstract: Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward-hacking behaviors are often implicit, as the intermediate chain-of-thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text-based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models’ internal computations. Given a prompt and a model-generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior. Across verifiable reasoning benchmarks spanning math, code, and logical reasoning, GRIFT substantially outperforms strong baselines, including CoT Monitor and TRACE, achieving over 25% relative improvement in detecting reward hacking behavior. Moreover, integrating GRIFT into the rejection fine-tuning pipeline for reasoning tasks reduces reward hacking and improves performance on the true task objective. Our results highlight a promising direction of leveraging gradient level representations for assessing the quality of CoT reasoning traces. Our code is available at: https://github.com/songtao-x/reward_hack.


[83] M3R: Localized Rainfall Nowcasting with Meteorology-Informed MultiModal Attention cs.LG | cs.CV | cs.MMPDF

Sanjeev Panta, Rhett M Morvant, Xu Yuan, Li Chen, Nian-Feng Tzeng

TL;DR: 本文提出M3R模型,一种融合气象信息的基于多模态注意力的架构,用于直接进行降雨临近预报。该模型通过一个综合流程将视觉NEXRAD雷达图像与数值化个人气象站测量数据进行时间对齐,并利用专门的多模态注意力机制,以气象站时间序列作为查询来有选择地关注空间雷达特征,从而聚焦提取降水特征。

Details

Motivation: 解决现有深度学习方法在有效利用多样化多媒体数据源进行降水预测方面的局限性,以实现更准确、及时的降雨临近预报,这对灾害缓解和水资源管理至关重要。

Result: 在以NEXRAD雷达站为中心的三个100 km * 100 km空间区域的实验中,M3R超越了现有方法,在准确性、效率和降水检测能力方面取得了显著提升,为基于多媒体的降水临近预报建立了新的基准。

Insight: 创新点在于提出了一个专门的多模态注意力机制,将气象站时间序列作为查询来引导对空间雷达特征的注意力,实现了异构气象数据的有效融合与降水特征的聚焦提取,为操作化天气预报系统提供了实用工具。

Abstract: Accurate and timely rainfall nowcasting is crucial for disaster mitigation and water resource management. Despite recent advances in deep learning, precipitation prediction remains challenging due to limitations in effectively leveraging diverse multimedia data sources. We introduce M3R, a Meteorology-informed MultiModal attention-based architecture for direct Rainfall prediction that synergistically combines visual NEXRAD radar imagery with numerical Personal Weather Station (PWS) measurements, using a comprehensive pipeline for temporal alignment of heterogeneous meteorological data. With specialized multimodal attention mechanisms, M3R novelly leverages weather station time series as queries to selectively attend to spatial radar features, enabling focused extraction of precipitation signatures. Experimental results for three spatial areas of 100 km * 100 km centered at NEXRAD radar stations demonstrate that M3R outperforms existing approaches, achieving substantial improvements in accuracy, efficiency, and precipitation detection capabilities. Our work establishes new benchmarks for multimedia-based precipitation nowcasting and provides practical tools for operational weather prediction systems. The source code is available at https://github.com/Sanjeev97/M3Rain


[84] ProtoTTA: Prototype-Guided Test-Time Adaptation cs.LG | cs.CVPDF

Mohammad Mahdi Abootorabi, Parvin Mousavi, Purang Abolmaesumi, Evan Shelhamer

TL;DR: ProtoTTA是一个用于原型模型的通用测试时适应框架,它利用中间原型信号而非仅依赖模型输出来提升模型在分布偏移下的鲁棒性。该方法通过最小化原型相似度分布的熵来鼓励在偏移数据上产生更自信和原型特定的激活,并使用几何过滤、原型重要性权重和模型置信度分数来稳定更新。

Details

Motivation: 基于原型的可解释模型在关键领域(如医疗)中平衡了高精度与可解释性,但其鲁棒性受限于对训练数据的依赖,难以应对分布偏移。现有测试时适应方法主要更新参数和统计量,但尚未探索如何利用原型信号来提升这类模型的鲁棒性。

Result: 在四个原型骨干网络和四个涵盖细粒度视觉、组织病理学和NLP的多样化基准测试上,实验表明ProtoTTA相比标准的输出熵最小化方法提升了鲁棒性,并恢复了原型激活中正确的语义焦点。

Insight: 创新点在于首次将原型信号用于测试时适应,提出了基于原型相似度分布熵最小化的优化目标,并结合几何过滤和正则化机制确保更新稳定性。此外,引入了新的可解释性指标和视觉语言模型评估框架来验证TTA动态,证实了方法能恢复与人类对齐的语义焦点。

Abstract: Deep networks that rely on prototypes-interpretable representations that can be related to the model input-have gained significant attention for balancing high accuracy with inherent interpretability, which makes them suitable for critical domains such as healthcare. However, these models are limited by their reliance on training data, which hampers their robustness to distribution shifts. While test-time adaptation (TTA) improves the robustness of deep networks by updating parameters and statistics, the prototypes of interpretable models have not been explored for this purpose. We introduce ProtoTTA, a general framework for prototypical models that leverages intermediate prototype signals rather than relying solely on model outputs. ProtoTTA minimizes the entropy of the prototype-similarity distribution to encourage more confident and prototype-specific activations on shifted data. To maintain stability, we employ geometric filtering to restrict updates to samples with reliable prototype activations, regularized by prototype-importance weights and model-confidence scores. Experiments across four prototypical backbones on four diverse benchmarks spanning fine-grained vision, histopathology, and NLP demonstrate that ProtoTTA improves robustness over standard output entropy minimization while restoring correct semantic focus in prototype activations. We also introduce novel interpretability metrics and a vision-language model (VLM) evaluation framework to explain TTA dynamics, confirming ProtoTTA restores human-aligned semantic focus and correlates reliably with VLM-rated reasoning quality. Code is available at: https://github.com/DeepRCL/ProtoTTA.


[85] Hierarchical Active Inference using Successor Representations cs.LG | cs.AI | cs.CVPDF

Prashant Rangarajan, Rajesh P. N. Rao

TL;DR: 该论文提出了一种基于层次化主动推断和后续表示(Successor Representations)的规划模型,旨在解决传统主动推断方法难以扩展到复杂大规模现实环境的问题。该方法通过结合环境的分层模型和后续表示,实现了从低级表示学习高级抽象状态、利用低级主动推断引导学习高级抽象动作,并利用这些抽象进行高效规划。

Details

Motivation: 主动推断作为一种基于自由能原理(FEP)的神经启发模型,在理解大脑的感知、行动和学习方面具有潜力,但将其扩展到解决现实世界中的复杂大规模问题仍具挑战性。受大脑中存在多尺度层次化表示的启发,本文旨在开发一种能够高效处理复杂任务的层次化主动推断模型。

Result: 论文在多个规划和强化学习(RL)问题上验证了方法的性能,包括四房间任务变体、基于钥匙的导航任务、部分可观测规划问题、Mountain Car问题以及具有连续状态和动作空间的PointMaze导航任务家族。结果表明,该方法能够成功学习层次化的状态和动作抽象,并实现高效规划。

Insight: 创新点在于首次将学习到的层次化状态和动作抽象应用于基于FEP的主动推断理论中,通过结合层次化模型和后续表示,实现了从低级到高级的抽象学习与规划引导,为扩展主动推断到复杂任务提供了新途径。

Abstract: Active inference, a neurally-inspired model for inferring actions based on the free energy principle (FEP), has been proposed as a unifying framework for understanding perception, action, and learning in the brain. Active inference has previously been used to model ecologically important tasks such as navigation and planning, but scaling it to solve complex large-scale problems in real-world environments has remained a challenge. Inspired by the existence of multi-scale hierarchical representations in the brain, we propose a model for planning of actions based on hierarchical active inference. Our approach combines a hierarchical model of the environment with successor representations for efficient planning. We present results demonstrating (1) how lower-level successor representations can be used to learn higher-level abstract states, (2) how planning based on active inference at the lower-level can be used to bootstrap and learn higher-level abstract actions, and (3) how these learned higher-level abstract states and actions can facilitate efficient planning. We illustrate the performance of the approach on several planning and reinforcement learning (RL) problems including a variant of the well-known four rooms task, a key-based navigation task, a partially observable planning problem, the Mountain Car problem, and PointMaze, a family of navigation tasks with continuous state and action spaces. Our results represent, to our knowledge, the first application of learned hierarchical state and action abstractions to active inference in FEP-based theories of brain function.


[86] AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning cs.LG | cs.CVPDF

Guransh Singh

TL;DR: 本文提出了一种名为AEGIS(Anchor-Enforced Gradient Isolation System)的梯度隔离框架,旨在解决在机器人控制任务中微调预训练视觉语言模型(VLM)时,由于连续动作回归梯度与预训练语义流形之间的维度不匹配所导致的灾难性遗忘问题。该方法通过层级的正交梯度投影,在直接进行连续均方误差(MSE)学习的同时,有效保护了模型的视觉问答(VQA)能力,无需额外的协同训练数据或重放缓冲区。

Details

Motivation: 将预训练的视觉语言模型(VLM)适配到机器人控制任务时,需要将来自流匹配动作专家的高量级连续梯度注入到仅用交叉熵(CE)预训练的主干网络中。这种跨模态梯度不对称——低秩MSE回归梯度与高维CE预训练语义流形之间的谱维度不匹配——会导致VLM的视觉问答(VQA)能力迅速且严重地退化。现有方法(如停止梯度或低秩适配器LoRA)要么完全丢弃连续监督,要么无法约束梯度方向,仍会覆盖预训练流形。

Result: 论文提出的AEGIS框架在微调过程中,其正交梯度投影平均仅舍弃不到1%的梯度能量,但消除了导致严重遗忘的累积激活漂移。虽然没有在摘要中明确提及具体的基准测试和SOTA对比,但该方法旨在实现直接连续MSE学习的同时,保护预训练的VQA能力,暗示其在缓解灾难性遗忘方面具有优越性。

Insight: 核心创新点在于一种无缓冲、层级的正交梯度投影框架。它通过预计算一个来自掩码VQA前向传播的静态高斯参考锚点,并构建Wasserstein-2传输惩罚来生成锚点恢复梯度。然后通过顺序双反向传播分解任务梯度和锚点梯度,并对每个Transformer层应用单次Gram-Schmidt正交投影,将任务梯度从破坏性方向弯曲开,同时保留其建设性内容。这种方法在理论上提供了一种新颖的、基于几何约束的梯度隔离机制来保护预训练知识。

Abstract: Adapting pre-trained vision-language models (VLMs) for robotic control requires injecting high-magnitude continuous gradients from a flow-matching action expert into a backbone trained exclusively with cross-entropy. This cross-modal gradient asymmetry - the spectral dimensionality mismatch between low-rank MSE regression gradients and the high-dimensional semantic manifold sculpted by CE pre-training, causes rapid, severe erosion of the VLM’s visual-question-answering (VQA) capability. Industry-standard defences either sever the gradient pathway entirely via stop gradient, discarding the rich continuous supervision, or restrict parameter capacity through low-rank adapters (LoRA) that constrain the rank of updates but not their direction, and thus still overwrite the pre-trained manifold. We introduce AEGIS (Anchor-Enforced Gradient Isolation System): a buffer-free, layer-wise orthogonal gradient projection framework that enables direct continuous MSE learning while preserving the pre-trained VQA manifold - without any co-training data or replay buffer. AEGIS pre-computes a static Gaussian reference anchor from masked VQA forward passes across all transformer layers, then at each training step constructs a Wasserstein-2 transport penalty that generates an anchor restoration gradient. A sequential dual-backward decomposes the task and anchor gradients; for each transformer layer, AEGIS applies a single Gram-Schmidt orthogonal projection that bends the task gradient away from the destructive direction while preserving its constructive content. The projection sheds less than 1% of gradient energy on average, yet eliminates the cumulative activation drift that drives severe forgetting.


cs.AI [Back]

[87] Preregistered Belief Revision Contracts cs.AI | cs.CL | cs.LO | cs.MAPDF

Saad Alqithami

TL;DR: 本文提出了一种名为PBRC(预注册信念修订契约)的协议层机制,旨在解决多智能体系统中因社会性互动(如一致性、声望或多数意见)导致虚假高置信度收敛的‘危险从众效应’问题。PBRC通过公开固定证据触发条件、修订算子、优先级规则和回退策略,严格分离开放通信与可允许的认知变化,确保每个实质性信念变更均可由路由器强制执行且事后可审计。论文证明了该机制在保守回退下能防止纯从众驱动的错误级联,并展示了其可审计性、认知问责性及在传播模型下的理论性质,同时引入了配套的逻辑框架进行形式化验证与仿真。

Details

Motivation: 多智能体系统中,智能体通过交换消息和修订信念以提升性能,但社会性因素(如一致性、信心、声望或多数规模)可能被误当作证据,导致虚假结论的高置信度收敛(即‘危险从众效应’),因此需要一种机制来严格区分社会互动与证据驱动的信念变更,确保认知过程的可靠性与可审计性。

Result: 论文通过理论证明和仿真展示了PBRC机制的效果:在保守回退的契约下,纯社会性交互轮次不会增加置信度或产生纯从众驱动的错误级联;可审计触发协议可转化为保持信念轨迹的PBRC范式;强制执行的契约实现了认知问责性,即任何顶层假设的变更都可归因于具体验证的证据集;对于令牌不变契约,强制轨迹仅取决于令牌暴露轨迹,在泛洪传播模型下,该轨迹由截断可达性精确刻画,并给出了通用证据闭包的紧致直径界限。

Insight: 创新点在于提出了一种协议层的契约机制(PBRC),通过预注册证据触发和外部验证令牌,将社会通信与证据驱动的信念修订严格分离,从而抑制从众效应并确保可审计性与问责性;同时,论文引入了配套的动态认知逻辑来形式化轨迹不变性,为多智能体系统的可靠交互提供了可验证的框架设计思路。

Abstract: Deliberative multi-agent systems allow agents to exchange messages and revise beliefs over time. While this interaction is meant to improve performance, it can also create dangerous conformity effects: agreement, confidence, prestige, or majority size may be treated as if they were evidence, producing high-confidence convergence to false conclusions. To address this, we introduce PBRC (Preregistered Belief Revision Contracts), a protocol-level mechanism that strictly separates open communication from admissible epistemic change. A PBRC contract publicly fixes first-order evidence triggers, admissible revision operators, a priority rule, and a fallback policy. A non-fallback step is accepted only when it cites a preregistered trigger and provides a nonempty witness set of externally validated evidence tokens. This ensures that every substantive belief change is both enforceable by a router and auditable after the fact. In this paper, (a) we prove that under evidential contracts with conservative fallback, social-only rounds cannot increase confidence and cannot generate purely conformity-driven wrong-but-sure cascades. (b) We show that auditable trigger protocols admit evidential PBRC normal forms that preserve belief trajectories and canonicalized audit traces. (c) We demonstrate that sound enforcement yields epistemic accountability: any change of top hypothesis is attributable to a concrete validated witness set. For token-invariant contracts, (d) we prove that enforced trajectories depend only on token-exposure traces; under flooding dissemination, these traces are characterized exactly by truncated reachability, giving tight diameter bounds for universal evidence closure. Finally, we introduce a companion contractual dynamic doxastic logic to specify trace invariants, and provide simulations illustrating cascade suppression, auditability, and robustness-liveness trade-offs.


[88] Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4 cs.AI | cs.CL | cs.LOPDF

Chengwu Liu, Yichun Yin, Ye Yuan, Jiaxuan Xie, Botao Li

TL;DR: 本文提出了一个名为’发现与证明’(DAP)的开源智能体框架,用于在Lean 4中解决’困难模式’下的自动定理证明问题。该框架通过大语言模型进行自然语言推理和显式自我反思来发现答案,然后将困难模式问题转化为简单模式供现有证明器处理。同时,作者发布了两个专家重新标注的困难模式基准测试变体:MiniF2F-Hard和FIMO-Hard。

Details

Motivation: 现有大多数自动定理证明基准测试将最终答案嵌入形式化陈述中(即’简单模式’),这简化了任务,可能导致对模型能力的高估。论文旨在解决更严格、更现实的’困难模式’设置,即系统必须独立发现答案后才能构建形式化证明。

Result: 在CombiBench上,DAP将已解决问题数量从之前的SOTA(Pass@16)的7个提升到10个;在PutnamBench上,它是首个在困难模式下正式证明36个定理的系统。同时揭示,在同一问题上,最先进的大语言模型答案准确率超过80%,而形式化证明器成功率低于10%,显示出巨大差距。

Insight: 主要创新点在于区分并定义了’困难模式’这一更现实的评估设置,并提出了DAP这一结合LLM推理与自我反思的智能体框架来应对该模式。其核心洞察是将困难模式问题分解为’发现答案’和’形式化证明’两个阶段,并利用LLM的强推理能力弥补传统证明器在答案发现上的短板,这为评估和提升AI的数学推理能力提供了新范式。

Abstract: Most ATP benchmarks embed the final answer within the formal statement – a convention we call “Easy Mode” – a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting “Hard Mode”: the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks. Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers. DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode – while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are uniquely suited to measure.


Haoyu Bian, Chaoning Zhang, Jiaquan Zhang, Xingyao Li, Yuanfang Guo

TL;DR: 本文提出WORC框架,一种基于弱链路原则的多智能体推理与协作优化方法,通过两阶段流程(弱智能体定位与弱链路优化)来识别并补偿性能受限的智能体,以提升多智能体系统的整体稳定性和泛化能力。

Details

Motivation: 现有LLM驱动的多智能体框架在复杂推理任务中存在推理不稳定性问题,个体智能体错误会在协作中被放大,而当前研究主要关注增强高能力智能体或抑制不可靠输出,缺乏对性能限制智能体的系统性识别与强化。

Result: 在推理基准测试中,WORC实现了平均82.2%的准确率,同时提高了框架稳定性和跨架构泛化能力。

Insight: 创新点在于将弱链路原则引入多智能体优化,通过元学习权重预测器零样本映射任务特征到智能体性能权重,并结合不确定性驱动的预算分配策略补偿弱智能体;客观分析表明,补偿弱链路而非仅强化优势能增强系统鲁棒性,为多智能体协作提供了新视角。

Abstract: LLM-driven multi-agent frameworks address complex reasoning tasks through multi-role collaboration. However, existing approaches often suffer from reasoning instability, where individual agent errors are amplified through collaboration, undermining overall performance. Current research mainly focuses on enhancing high-capability agents or suppressing unreliable outputs to improve framework effectiveness, while systematic identification and reinforcement of performance-limiting agents receive less attention. To address this gap, we propose WORC, a \underline{w}eak-link \underline{o}ptimization framework for multi-agent \underline{r}easoning and \underline{c}ollaboration, grounded in the weak-link principle. WORC follows a two-stage workflow. In the weak agent localization stage, task features are constructed, and a meta-learning-based weight predictor trained on optimal configurations identified by swarm intelligence algorithms (SIAs) enables zero-shot mapping from these features to agent performance weights, where the agent with the lowest predicted weight is identified as the weak agent. In the weak-link optimization stage, an uncertainty-driven allocation strategy assigns additional reasoning budgets to weak agents, with lower predicted weights leading to larger repeated-sampling quotas to compensate for reliability deficiencies. Experimental results show that WORC achieves an average accuracy of 82.2% on reasoning benchmarks while improving framework stability and cross-architecture generalization, suggesting that compensating for weak links, rather than reinforcing strengths alone, enhances the robustness of multi-agent systems.


[90] Learning to Reason with Insight for Informal Theorem Proving cs.AI | cs.CL | cs.LGPDF

Yunhe Li, Hao Shi, Bowen Deng, Wei Wang, Mengzhe Ruan

TL;DR: 本文提出了一种针对非形式化定理证明的新框架,旨在解决大语言模型在复杂数学推理中缺乏‘洞察力’(即识别问题核心技巧的能力)的瓶颈。作者构建了层次化数据集DeepInsightTheorem,并设计了渐进式多阶段监督微调策略,引导模型从基础证明写作过渡到洞察式思维。实验表明,该方法在多个具有挑战性的数学基准测试上显著优于基线模型。

Details

Motivation: 动机在于,尽管大多数自动定理证明方法依赖形式化系统,但非形式化定理证明更能发挥大语言模型在自然语言处理方面的优势。当前非形式化定理证明的主要瓶颈是缺乏‘洞察力’,即难以识别解决复杂问题所需的核心技巧。

Result: 在具有挑战性的数学基准测试上的实验表明,这种注重洞察力的生成策略显著优于基线方法,证明了教导模型识别和应用核心技巧能实质性地提升其数学推理能力。

Insight: 论文宣称的创新点在于:1) 识别并形式化了非形式化定理证明中的‘洞察力’瓶颈;2) 构建了层次化数据集DeepInsightTheorem,明确提取核心技巧和证明草图;3) 设计了模仿人类学习过程的渐进式多阶段监督微调策略。从客观角度看,将‘洞察力’作为可学习的技能进行建模,并通过结构化数据和分阶段训练来培养,是提升LLMs复杂推理能力的一个有前景且可借鉴的方向。

Abstract: Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models’ (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core techniques required to solve complex problems. To address this, we propose a novel framework designed to cultivate this essential reasoning skill and enable LLMs to perform insightful reasoning. We propose $\mathtt{DeepInsightTheorem}$, a hierarchical dataset that structures informal proofs by explicitly extracting core techniques and proof sketches alongside the final proof. To fully exploit this dataset, we design a Progressive Multi-Stage SFT strategy that mimics the human learning process, guiding the model from basic proof writing to insightful thinking. Our experiments on challenging mathematical benchmarks demonstrate that this insight-aware generation strategy significantly outperforms baselines. These results demonstrate that teaching models to identify and apply core techniques can substantially improve their mathematical reasoning.


[91] GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology cs.AI | cs.CV | cs.HC | cs.ROPDF

Shivendra Agrawal, Bradley Hayes

TL;DR: 本文提出了GIST(Grounded Intelligent Semantic Topology)系统,这是一个多模态知识提取流程,能够将消费级移动点云转换为带有语义标注的导航拓扑结构。该系统通过构建2D占据地图、提取拓扑布局并叠加轻量级语义层,为复杂密集环境(如零售店、仓库)中的空间定位与导航提供结构化知识。

Details

Motivation: 解决在物品准静态、语义分布长尾的复杂密集环境(如零售店、仓库、医院)中,传统计算机视觉和现有视觉-语言模型(VLMs)在空间定位(spatial grounding)方面面临的挑战。

Result: 在多个下游人机交互任务中验证了有效性:语义定位器(Semantic Localizer)实现了1.04米的top-5平均平移误差;基于LLM的多标准评估中,其视觉接地指令生成(Visually-Grounded Instruction Generator)优于基于序列的基线;现场形成性评估(N=5)仅依靠语音提示实现了80%的导航成功率。

Insight: 创新点在于将原始点云通过多模态知识提取流程,系统地转化为一个集成了几何拓扑与轻量级语义的结构化空间表示(GIST),并展示了该表示在多种意图驱动的人机交互任务(如语义搜索、定位、区域分类、指令生成)中的通用性和有效性,为具身AI在复杂环境中的空间理解与交互提供了新思路。

Abstract: Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments. We present GIST (Grounded Intelligent Semantic Topology), a multimodal knowledge extraction pipeline that transforms a consumer-grade mobile point cloud into a semantically annotated navigation topology. Our architecture distills the scene into a 2D occupancy map, extracts its topological layout, and overlays a lightweight semantic layer via intelligent keyframe and semantic selection. We demonstrate the versatility of this structured spatial knowledge through critical downstream Human-AI interaction tasks: (1) an intent-driven Semantic Search engine that actively infers categorical alternatives and zones when exact matches fail; (2) a one-shot Semantic Localizer achieving a 1.04 m top-5 mean translation error; (3) a Zone Classification module that segments the walkable floor plan into high-level semantic regions; and (4) a Visually-Grounded Instruction Generator that synthesizes optimal paths into egocentric, landmark-rich natural language routing. In multi-criteria LLM evaluations, GIST outperforms sequence-based instruction generation baselines. Finally, an in-situ formative evaluation (N=5) yields an 80% navigation success rate relying solely on verbal cues, validating the system’s capacity for universal design.


[92] MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation cs.AI | cs.CVPDF

Yi Lin, Yihao Ding, Yonghui Wu, Yifan Peng

TL;DR: 本文提出MARCH(多智能体放射学临床层级)框架,通过模拟放射科专业层级结构,将不同角色分配给专用智能体,以解决自动化3D放射报告生成中的临床幻觉问题并模仿人类实践的迭代验证过程。

Details

Motivation: 现有自动化3D放射报告生成系统常出现临床幻觉且缺乏临床工作流中典型的协作监督与迭代验证机制,而传统视觉语言模型多为‘黑盒’系统,无法体现临床协作特性。

Result: 在RadGenome-ChestCT数据集上,MARCH在临床保真度和语言准确性方面显著优于当前最先进的基线模型。

Insight: 创新点在于将人类临床层级组织结构建模为多智能体系统,通过住院医师智能体进行多尺度CT特征提取与初稿起草、专科医师智能体进行检索增强修订、以及主治医师智能体协调基于立场的迭代共识讨论来解决诊断分歧,这种结构化协作机制提升了高风险医疗领域AI的可靠性。

Abstract: Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic “black-box” systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.