Table of Contents

cs.CL [Back]

[1] Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches cs.CL | cs.AIPDF

Marcin Michał Mirończuk

TL;DR: 这篇论文是一篇关于多模态和多视图表示方法在文档分类中应用的系统性综述。它通过分析139项研究,提出了一个统一框架,进行了定性趋势分析和首次针对文档分类的随机效应元分析,量化了信息融合的性能提升。

Details

Motivation: 该领域缺乏统一的框架、对其有效性的定量综合以及为从业者提供的清晰指导。本文旨在通过系统性综述填补这些空白。

Result: 元分析显示,多模态融合显著提高了准确率(平均增益+5.28个百分点,p=0.0016),而多视图融合在准确率(+4.67%)、F1分数(+3.08%)和召回率上均提供了稳定但适度的增益(所有p<0.05)。

Insight: 论文的主要贡献在于提供了一个统一框架、首个定量证据库以及数据驱动的指导方针。核心洞察是,成功的信息融合不依赖于算法复杂性,而在于融合方法与任务背景的战略对齐以及对更严格验证的承诺。同时,综述揭示了该领域在方法严谨性和可重复性方面存在重大挑战。

Abstract: Information fusion is used widely to improve document classification by the integration of multiple data sources (multimodal) or representations (multiview). However, the field lacks a unified framework, a quantitative synthesis of its effectiveness, and clear guidance for practitioners. This systematic review addresses these gaps by analysing 139 primary studies. It introduces a formal framework to structure the field, presents the results of a qualitative analysis to identify key trends, and performs a random-effects meta-analysis (to our knowledge, the first focused on document classification) to quantify performance gains. Our meta-analysis reveals that multimodal fusion improves accuracy (mean gain of +5.28 percentage points, $p=0.0016$) significantly – the F1-score effect is directionally positive but statistically non-significant in our primary model. Multiview fusion provides consistent but modest gains for accuracy (+4.67%), F1-score (+3.08%), and recall (all $p<0.05$). Critically, our qualitative synthesis uncovers challenges in reproducibility in methodological rigour: only 11.8% (multimodal) and 23.3% (multiview) of the studies use statistical tests to validate their findings, which undermines the reliability of many of their results. This review’s primary contributions are a unifying framework, the first quantitative evidence base, and data-driven guidelines. This review concludes that successful information fusion depends not on algorithmic complexity, but on the strategic alignment of the fusion method with the task context and a commitment to more rigorous validation.


[2] EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs cs.CL | cs.AI | cs.SDPDF

Liang Lin, Chunxi Luo, Kaiwen Luo, Jie Zhang, Jin Wang

TL;DR: 本文提出EchoDistill,一种基于对齐的从噪声到干净音频的自蒸馏框架,用于增强音频大语言模型(ALLMs)在真实噪声环境下的鲁棒性。该方法利用冻结的干净音频教师模型为推理时的噪声音频学生模型提供语义参考,通过组相对策略优化(GRPO)和音频感知奖励塑造,对齐噪声学生候选响应与干净语义证据,从而提升模型的语义可靠性和任务性能,且不引入额外推理成本。

Details

Motivation: 音频大语言模型(ALLMs)对真实世界噪声高度敏感,噪声常导致严重的语义漂移和幻觉;现有鲁棒性方法主要依赖波形级声学增强、答案级监督或噪声表征的内部抑制,存在局限性。

Result: 在强噪声下,EchoDistill相比最强基线在GSR指标上平均提升4.18%;在Qwen-Omni上的消融实验显示,EchoDistill相比仅使用GRPO的变体在Acc、Noisy和GSR指标上平均分别提升3.02%、3.89%和4.53%。

Insight: 创新点包括:1)提出噪声到干净音频的自蒸馏框架,利用冻结教师模型提供语义参考;2)引入组相对策略优化(GRPO)和音频感知奖励塑造,对齐噪声与干净语义;3)方法不增加推理开销,有效提升模型在噪声下的语义可靠性和任务性能。

Abstract: Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations. To address these issues, we propose echodistill, an alignment-based noisy-to-clean self-distillation framework. Echodistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization (GRPO), where the token-level consistency with the teacher acts as a reward bonus. By aligning the noisy student’s candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. Echodistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs. Extensive experiments show that: (I) Compared with the strongest baseline, echodistill achieves average improvements of 4.18%$\uparrow$ in GSR under strong noise. (II) Ablation results on Qwen-Omni further show that echodistill improves over the GRPO-only variant by 3.02%$\uparrow$ in Acc, 3.89%$\uparrow$ in Noisy, and 4.53%$\uparrow$ in GSR on average. Our codes are available at https://anonymous.4open.science/r/echodistill-10DE.


[3] Multi-Persona Debate System for Automated Scientific Hypothesis Generation cs.CLPDF

Jaeha Oh, Byungchan Kim, Ju Li, Yang Jeong Park, Jin-Sung Park

TL;DR: 本文提出了多角色辩论系统(MPDS),一个基于文献检索、长上下文大语言模型推理、语料驱动角色归纳和结构化多智能体辩论的自动化科学假设生成框架。该系统通过构建包含多达500篇文献的快照,让不同角色智能体基于特定证据池进行三轮引用感知的辩论,并由主持人综合,从而在保持证据可追溯性的同时促进跨视角协商。

Details

Motivation: 现代科学发现的瓶颈不在于数据稀缺,而在于难以将碎片化的知识综合成可操作的假设,这在需要同时优化电化学性能、界面行为和制造可行性的电池材料研究中尤为突出。

Result: 在钠离子电池负极和全固态电池正极设计任务中,MPDS恢复了与实验验证的解决方案空间一致的设计逻辑,并生成了比简单基线更具机制明确性和过程意识的提案。在消融研究中,MPDS在五种条件下获得了最高的综合假设质量平均分,其最大优势在于跨视角整合能力。

Insight: 创新点在于将结构化多智能体辩论与文献快照相结合,通过语料驱动角色归纳和引用感知辩论,在复杂工程约束下改进假设形成,为文本密集型科学发现提供了一个可重复的工作流程。从客观角度看,其将多智能体协作、证据追溯和领域知识整合的系统化方法,为解决知识碎片化问题提供了新思路。

Abstract: Modern scientific discovery is bottlenecked not by data scarcity, but by the inability to synthesize fragmented knowledge into actionable hypotheses. This challenge is especially acute in battery materials research, where electrochemical performance, interfacial behavior, and manufacturing feasibility must be optimized simultaneously. Here, we present the Multi-Persona Debate System (MPDS), a literature-grounded framework for automated scientific hypothesis generation that combines literature retrieval, long-context large language model reasoning, corpus-driven persona induction, and structured multi-agent debate. MPDS constructs literature snapshots of up to 500 papers, grounds agents in role-specific evidence pools, and conducts a three-round citation-aware debate followed by moderator synthesis, enabling negotiation between personas while preserving evidence traceability. We evaluate MPDS using a temporally controlled protocol excluding direct access to target papers, including two held-out battery-materials case studies and a blinded comparison across 30 matched cases. In sodium-ion anode and all-solid-state battery cathode design tasks, MPDS recovered design logics aligned with experimentally validated solution spaces and generated more mechanistically explicit, process-aware proposals than simpler baselines. To assess the impact of personas and debate, we introduce Integrative Hypothesis Quality scoring. In ablation studies, MPDS achieved the highest mean score among five conditions, with its largest advantage in cross-perspective integration. A laboratory follow-up suggests utility as a diagnostic aid for identifying practical bottlenecks in workflows. These results indicate that structured debate over literature snapshots improves hypothesis formation under coupled engineering constraints and provides a reusable workflow for text-intensive scientific discovery.


[4] QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks cs.CLPDF

Jian Xie, Tianhe Lin, Zilu Wang, Yuting Ning, Yuekun Yao

TL;DR: QUEST是一系列开源深度研究代理模型(2B至35B参数),旨在处理广泛的长时程搜索任务,具备事实查找、引用支撑和报告合成的强大能力。通过结合中期训练、监督微调和强化学习的训练方法,并使用基于统一评分树的数据合成流程生成可验证奖励的训练数据,QUEST在八个深度研究基准测试中接近或超越前沿闭源代理,并在开源模型中取得最佳整体性能。

Details

Motivation: 当前前沿深度研究代理系统多为闭源,而现有开源代理在不同任务类型上泛化能力较差,缺乏训练广泛能力深度研究代理的明确方法。

Result: 仅使用8K个合成任务进行训练,QUEST在涵盖多种任务类型的八个深度研究基准测试中,接近或超越了前沿闭源代理,并在近期开源权重代理中取得了最佳整体性能。

Insight: 创新点在于提出了一个结合中期训练、监督微调和强化学习的有效训练配方,其核心是基于统一评分树的数据合成流程,无需人工标注即可生成具有可验证奖励的训练数据;同时,模型内置了上下文管理机制以支持有效的长时程推理和知识合成。

Abstract: Deep research agents extend the role of search engines from retrieving keyword-matched pages to synthesizing knowledge, fundamentally changing how humans interact with information. However, frontier systems remain proprietary, while existing open agents often generalize poorly across different task types, leaving unclear how to train a broadly capable deep research agent. We release QUEST, a family of open models (ranging from 2B to 35B) that serve as general-purpose deep research agents designed to handle a wide range of long-horizon search tasks, with strong capabilities in fact seeking, citation grounding, and report synthesis. To build QUEST, we propose an effective training recipe combining mid-training, supervised fine-tuning, and reinforcement learning. Central to this recipe is a curated data synthesis pipeline based on unified rubric trees, which applies to different task types and enables synthesizing training data with verifiable rewards without human annotation. In addition, QUEST incorporates a built-in context management mechanism that enables effective long-horizon reasoning and knowledge synthesis. Using only 8K synthesized tasks, QUEST approaches or even surpasses frontier closed-source agents across eight deep research benchmarks spanning diverse task types, and achieves the best overall performance among recent open-weight agents. We released everything: models, data, and training scripts.


[5] An Interactive Paradigm for Deep Research cs.CL | cs.AIPDF

Lin Ai, Victor S. Bursztyn, Xiang Chen, Julia Hirschberg, Saayan Mitra

TL;DR: 本文提出了SteER框架,一种可交互、可解释的深度研究系统,通过引入中途控制机制,允许用户在长流程研究任务中进行干预和调整,以更好地满足用户意图。

Details

Motivation: 现有深度研究系统通常采用一次性规划和长时自主运行的工作流程,缺乏中途调整能力,难以应对用户意图的动态变化。

Result: SteER在一致性指标上比开源和商业基线模型提升高达22.80%,在广度、平衡性等质量指标上领先,并在85%以上的成对对比中被人类读者偏好。

Insight: 创新点在于将成本效益分析引入决策点,结合多样性感知规划和奖励对齐、新颖性、覆盖度的效用信号,并维护动态更新的用户角色模型,实现了可控、用户对齐的长任务智能体。

Abstract: Recent advances in large language models (LLMs) have enabled deep research systems that synthesize comprehensive, report-style answers to open-ended queries by combining retrieval, reasoning, and generation. Yet most frameworks rely on rigid workflows with one-shot scoping and long autonomous runs, offering little room for course correction if user intent shifts mid-process. We present SteER, a framework for Steerable deEp Research that introduces interpretable, mid-process control into long-horizon research workflows. At each decision point, SteER uses a cost-benefit formulation to determine whether to pause for user input or to proceed autonomously. It combines diversity-aware planning with utility signals that reward alignment, novelty, and coverage, and maintains a live persona model that evolves throughout the session. SteER outperforms state-of-the-art open-source and proprietary baselines by up to 22.80% on alignment, leads on quality metrics such as breadth and balance, and is preferred by human readers in 85%+ of pairwise alignment judgments. We also introduce a persona-query benchmark and data-generation pipeline. To our knowledge, this is the first work to advance deep research with an interactive, interpretable control paradigm, paving the way for controllable, user-aligned agents in long-form tasks.


[6] DRInQ: Evaluating Conversational Implicature with Controlled Context Variation cs.CLPDF

Hirona Jacqueline Arai, Xiang Ren

TL;DR: 本文提出了DRInQ基准,用于评估语言模型在对话隐含意义推理方面的能力,通过固定问题表面形式、控制语境变化来隔离语用变异。研究发现,当前最先进的模型在生成语用场景时表现良好,但在推理时难以准确恢复隐含意图,存在生成与推理的不对称性。

Details

Motivation: 尽管大型语言模型在对话流畅性上表现优异,但在依赖整合社交和语境线索进行推理的对话隐含意义理解方面仍不可靠,而这一过程在文本中很少明确表述,因此需要专门的评估框架来检验模型的语用推理能力。

Result: 在DRInQ基准上的评估显示,最先进模型存在一致的生成-推理不对称性:在引导下能生成合理的语用场景,但在推理时经常无法恢复预期隐含意义;对于较小模型,结构化提示能改善与人类判断的一致性。

Insight: 创新点在于设计了可控语境变异的半自动化流水线来构建评估实例,从而系统性地测试语用推理;客观分析表明,该研究揭示了模型在隐含意义理解上的局限性,并强调了开发更语境敏感评估框架的必要性,同时发现人类与模型在语境生成上具有互补优势。

Abstract: Human conversation relies heavily on conversational implicature, in which speakers convey meanings that are suggested rather than explicitly stated. Although recent large language models exhibit strong conversational fluency, they remain unreliable when interpretation depends on reasoning that integrates social and contextual cues, a process rarely articulated in text. We introduce DRinQ, a benchmark for evaluating pragmatic reasoning about conversational implicature in question utterances, designed to isolate pragmatic variation while holding each question’s surface form fixed. To support scalable evaluation, we propose a semi-automated pipeline that produces question-context-interpretation instances with systematic variation. Across evaluations, we find a consistent generation-inference asymmetry: while state-of-the-art models can generate plausible pragmatic scenarios when guided, they often fail to recover the intended implication at inference time. For smaller models, structured prompting improves alignment with human judgments. A comparative writing study further reveals complementary strengths: human authors tend to produce safer, predictable contexts, whereas models generate varied scenarios with interpretations that sometimes exceed contextual support. These findings highlight persistent challenges in modeling conversational implicature and motivate more context-sensitive evaluation frameworks.


[7] Distinguishing Right from Wrong in Debates: Attribution Analysis of Chinese Harmful Memes cs.CLPDF

Weiming Wang, Junyu Lu, Han Wang, Xiaokun Zhang, Zewen Bai

TL;DR: 本文针对中文有害表情包检测中的文化背景依赖和语义模糊性挑战,构建了首个中文有害表情包解释数据集Ex-ToxiCN-MM,并提出了包含知识增强和相对意图推理模块的RIKE框架,以提升模型对模糊、文化相关内容的辨别与理解能力。

Details

Motivation: 解决中文有害表情包检测的两大难题:一是准确评估危害性需依赖深层文化背景理解,二是许多表情包语义模糊导致危害性判断高度主观。

Result: 在中文有害表情包归因任务中,该方法在多个指标上优于主流基线模型,并通过定量和定性实验验证了其有效性。

Insight: 创新点包括构建首个中文有害表情包解释数据集(含对立解释)、建立中文文化概念与冒犯性词汇知识库(C-HarmKB),以及提出结合知识增强与相对意图推理的归因分析框架(RIKE),以增强模型对文化语境和语义模糊性的处理能力。

Abstract: Research on harmful meme detection has garnered significant attention, resulting in the development of numerous datasets and methods. However, progress in detecting Chinese harmful memes lags considerably, primarily due to two challenges: first, accurately assessing a meme’s harmfulness depends heavily on understanding deep cultural context; second, many memes are semantically ambiguous, making harmfulness highly subjective. To address these issues, we focus on the interpretable detection of Chinese harmful memes by constructing the first Chinese harmful meme explanation dataset, Ex-ToxiCN-MM. This dataset offers opposing interpretations, categorized as “harmful” and “non-harmful”, for each meme, aiming to rigorously evaluate a model’s ability to discern and comprehend ambiguous, culturally grounded content. We built a specialized knowledge base of Chinese cultural concepts and offensive vocabulary to supply models with essential prior knowledge (C-HarmKB). To address the ambiguity and lack of background knowledge in meme attribution, we have developed a comprehensive attribution analysis framework, RIKE, which includes an Attribution Knowledge Enhancement module (AKE) and a Relative Intent Reasoning module (RIR). Extensive quantitative and qualitative experiments demonstrate that our method outperforms mainstream baseline models across multiple metrics in the task of attributing harmful memes in Chinese. The code, Ex-ToxiCN-MM dataset, and Chinese Harmful Semantic Knowledge Base (C-HarmKB) involved in this study have been open-sourced at https://github.com/wimiw123/Ex-ToxiCN-MM


[8] SEAL: Synergistic Co-Evolution of Agents and Learning Environments cs.CLPDF

Yihao Hu, Zhihao Wen, Xiujin Liu, Pan Wang, Xin Zhang

TL;DR: 本文提出了SEAL框架,一个用于交互式工具使用智能体的闭环协同进化框架。它通过可执行验证收集在线轨迹,将失败回放诊断为回合级失败标签,并利用这些诊断信号同时驱动环境侧和模型侧的适应性调整。环境通过提供更清晰的工具可供性线索、约束信息和面向恢复的反馈来进化其训练时学习接口,而策略则通过诊断引导的优势重加权进行更新。

Details

Motivation: 现有的大语言模型智能体自我进化方法通常孤立地调整策略或学习环境,导致智能体能力边界变化时,提供监督的环境却保持静态或与智能体暴露的失败弱耦合,这种结构性问题被称为“智能体-环境错位”。

Result: 在分布内和分布外的多轮工具使用评估中,SEAL显著提升了低资源智能体的学习效率:仅使用400个训练样本,它在三个骨干模型上实现了平均+8.25到+26.25的性能提升,并展现出积极的分布外迁移能力。

Insight: 核心创新在于将智能体策略与训练环境视为一个协同进化的整体,通过诊断失败回放产生的共享信号来同时优化两者,从而弥合智能体-环境错位。这为构建鲁棒的自进化LLM智能体提供了新思路,即联合适应学习者和其训练时的学习基底。

Abstract: Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as \emph{Agent-Environment Misalignment}: the agent’s capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent’s revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.


[9] Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval cs.CL | cs.CVPDF

Hao Sun, Yingyan Hou, Jiayan Guo, Bo Wang, Chunyu Yang

TL;DR: 本文提出了Unveil框架,用于多模态文档检索。该框架通过统一整合文本与视觉特征来构建鲁棒的文档表示,并利用知识蒸馏将这种语义理解能力迁移到纯视觉模型中,从而在无需解析的情况下实现高效且语义保真的检索。

Details

Motivation: 解决现实场景中因文档格式和模态多样而带来的检索挑战。传统基于文本的方法忽略布局信息且易出错,而近期免解析的视觉方法在文本丰富的场景中难以捕捉细粒度语义。

Result: 实验结果表明,所提出的视觉-文本嵌入方法超越了现有方法,并且知识蒸馏成功缩小了视觉-文本方法与纯视觉方法之间的性能差距,同时提升了检索准确性和效率。

Insight: 主要创新点在于提出了一个统一的视觉-文本集成与蒸馏框架,通过知识蒸馏将多模态模型的语义能力迁移到轻量的纯视觉模型上,实现了在保持语义保真度前提下的高效免解析检索。

Abstract: Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose \textbf{Unveil}, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.


Jihyung lee, Hyounghun Kim, Gary Lee

TL;DR: 本文提出了一种名为Decompose-and-Refine (DaR) 的结构化法律问答框架,旨在解决多跳法律问答中准确检索支持性法条的问题。该框架通过将复杂法律问题逐步分解为原子子问题,并为每个子问题生成与法规对齐的参数化查询,从而精确检索核心法条。在韩国多跳法律问答基准KoBLEX上的实验表明,DaR显著提升了检索准确性和最终答案质量。

Details

Motivation: 法律问答不仅要求答案准确,还必须严格基于明确的法律依据。在多跳法定法律问答中,现有方法(如自然语言推理或无显式查询重构的检索)未能有效解决用户问题与法律文本之间的词汇鸿沟,导致幻觉风险增加。

Result: 在韩国多跳法律问答基准KoBLEX上,使用Qwen3-32B和Gemma3-27B模型进行实验,结果表明DaR在检索准确性和最终答案质量上均持续优于现有方法。

Insight: 创新点在于将逐步问题分解与基于参数化知识的查询精炼紧密结合,为每个法律问题生成对齐法规的参数化查询,从而精确选择核心法条。这不仅提升了性能,还通过显式分离子问题及其对应法条,实现了对复杂法律推理过程的透明、问题级别的验证。

Abstract: Large language models (LLMs) have shown strong performance in the legal domain, demonstrating notable potential in Legal Question Answering (LQA). However, unlike general QA, LQA requires answers that are not only accurate but also rigorously grounded in explicit legal authority. In statutory LQA, many questions require multi-hop reasoning across multiple legal issues, substantially increasing the risk of hallucination, thereby making accurate retrieval of supporting statutory provisions a critical prerequisite. Despite recent progress in multi-hop QA, existing approaches often rely on reasoning in natural language or retrieval without explicit query reformulation, leaving the vocabulary gap between user questions and statutory text largely unaddressed. To address this challenge, we propose Decompose-and-Refine (DaR), a statute-grounded LQA framework that tightly integrates step-wise question decomposition with parametric knowledge-based query refinement. DaR progressively decomposes a complex legal question into atomic sub-questions and generates statute-aligned parametric queries for each sub-question, enabling the selection of a single most central statutory provision corresponding to each legal issue. We evaluate DaR on KoBLEX, a Korean multi-hop LQA benchmark grounded in statutory law, using Qwen3-32B and Gemma3-27B. Experimental results demonstrate that DaR consistently improves both retrieval accuracy and final answer quality over existing approaches. Moreover, by explicitly separating sub-questions and their corresponding statutory provisions, DaR facilitates transparent, issue-level verification of complex legal reasoning processes.


Max Prior, Niklas Wais, Matthias Grabmair

TL;DR: 本文提出了一种全自动流程,将大量法院判决转化为法律评注。该方法通过检索、聚类和生成技术,从德国联邦法院的4,555份判决中提取段落级文本,总结其推理并生成关键词,随后进行嵌入和聚类。针对每个聚类,利用LLM生成标题并合成包含丰富引用的章节,最终由四个先进的LLM合并成连贯的评注。

Details

Motivation: 旨在解决从大量法院判决中自动生成法律评注的问题,无需提供手工制作的法律框架,以降低生成成本并实现快速更新。

Result: 在五个维度(主题相关性、标题匹配、引用忠实度、聚类区分度和逻辑排序)上进行了评估,结合人类专家和LLM法官的评价,结果表明从法院判决中生成类似评注的论证挖掘是可行的,但存在来源受限和法律推理规范性带来的局限性。

Insight: 创新点在于构建了一个无需人工框架的全自动法律评注生成流程,结合了检索、聚类和LLM生成技术,实现了从原始判决到结构化评注的端到端转换,为法律文本自动化处理提供了新思路。

Abstract: We present a fully automated pipeline that transforms large collections of court decisions into legal commentaries for statutes - without providing any handcrafted doctrinal framework. Using 4.555 decisions of the German Federal Court of Justice that cite sections 242, 280, 812 and 823 of the German Civil Code (BGB), we extract paragraph-level chunks, summarize their reasoning, and derive keywords, which are embedded and clustered. For each cluster, an LLM generates headings and synthesizes citation-rich sections, which are then merged into coherent commentaries by four state-of-the-art LLMs. We evaluate along five dimensions - topical relevance, heading-match, citation faithfulness, cluster distinction and logical ordering - using both a human expert and an LLM-judge. Our results show that commentary-like argument mining from court decisions to generate reports that can be refreshed within minutes at minimal cost is feasible, yet they highlight limitations arising from restricted sources and the normativity of legal reasoning.


[12] AstroMind: A High-Fidelity Benchmark for Spacecraft Behavior Reasoning Based on Large Language Models cs.CLPDF

Hao Liu, Siyuan Yang, Qinglei Hu, Dongyu Li

TL;DR: AstroMind是一个基于高保真天体动力学模拟和真实观测约束构建的基准测试,旨在评估大型语言模型在航天器行为推理任务上的能力,包括意图推断、机动参数估计和威胁评估。该基准考虑了真实的传感噪声和多源文本情报,并设计了同时衡量语义正确性和物理约束下定量一致性的评估指标。

Details

Motivation: 随着地球轨道日益拥挤和竞争加剧,理解航天器为何进行机动(而不仅仅是检测到其发生了机动)对于空间态势感知变得越来越重要。当前的分析流程主要面向检测,擅长发现事件,但在推理其含义方面能力不足。

Result: 对一系列开源模型进行基准测试表明,没有单一模型在所有维度上占优:Qwen3 (32B)在意图推断准确率上领先;QwQ (32B)在威胁评估上领先,并在解析项上实现了最低的中位相对误差;GPT-OSS (20B)在人工评判的推理质量上最强,并为参数估计提取了最多的标量值(241个解析项中的136个)。训练数据构成和推理风格与模型规模同样重要。

Insight: 论文的创新点在于构建了一个物理基础扎实、场景逼真的航天器行为推理基准,强调将物理正确性与战术情境解读相结合的必要性。客观来看,其将高保真模拟、多源情报和结构化评估指标整合到一个统一的测试框架中,为评估LLM在复杂、多模态物理推理任务上的能力提供了新标准,并揭示了模型规模之外的因素(如数据构成和提示工程)对性能的关键影响。

Abstract: Understanding why a spacecraft maneuvers – rather than simply that it did – is an increasingly important problem for space domain awareness as Earth orbits grow crowded and contested. Current analysis pipelines are built for detection: they are good at picking up that something happened, less good at reasoning about what it means. AstroMind is a physics-grounded benchmark designed to close that gap. It draws on high-fidelity astrodynamics simulations and real observational constraints, converting them into verifiable reasoning problems across three task types: intent inference, maneuver parameter estimation, and threat assessment. Each scenario includes realistic sensing noise and multi-source textual intelligence at varying reliability levels. Evaluation metrics capture both semantic correctness and quantitative consistency under physical constraints. Benchmarking a suite of open-weight models shows no single model dominates every axis: Qwen3 (32B) leads on intent inference accuracy; QwQ (32B) leads on threat assessment and achieves the lowest median relative error on parsed items; GPT-OSS (20B) produces the strongest judged reasoning quality and extracts the most scalar values for parameter estimation (136 of 241 parsed items). Training data composition and reasoning style matter as much as model size. Structured reasoning prompts help consistently across tested 8B models, with larger gains for those that can already track physical constraints. AstroMind gives the field a shared test for a problem where getting the physics right and reading the tactical situation correctly are both required – neither is sufficient on its own.


[13] Word Class Representations Spontaneously Emerge from Successor Representations Trained on Natural Language cs.CL | q-bio.NCPDF

Mathis Immertreu, Achim Schilling, Thomas Kinfe, Patrick Krauss

TL;DR: 该论文将强化学习中的后继表示(SR)框架引入自然语言处理,训练深度残差神经网络在WikiText-103语料库上预测多个时间跨度的未来词分布,从而学习长程转移结构。研究发现,在没有显式语言学监督的情况下,网络自发地学习到了具有清晰几何组织的词类表示,如名词、动词和形容词变得可分离且可通过无监督聚类恢复。

Details

Motivation: 探索一种替代传统语言模型(预测下一个词)的预测原则,即从强化学习借鉴后继表示,该表示建模未来状态的期望折现分布而非仅下一个状态,旨在研究语言的长程转移结构能否通过这种预测性序列学习自发涌现。

Result: 在WikiText-103(1.03亿词元,2万词词汇表)上训练的网络,其学习到的表示空间对词性(POS)类别(如名词、动词、形容词)展现出清晰的可分离性,可通过无监督聚类恢复;预测时间跨度系统地影响组织,短跨度产生最强的句法结构,长跨度则整合更广泛的上下文和语义信息;在更细粒度上,还出现了可解释的词汇子类结构。

Insight: 创新点在于首次将后继表示系统性地应用于自然语言,建立了强化学习、语言学和认知神经科学之间的概念桥梁;核心发现是句法类别(如词性)无需显式编码,可能作为预测性序列学习的后果自发涌现,这为理解语言表示的习得提供了新视角。

Abstract: Language models are typically trained to predict the next token in a sequence. Here, we explore an alternative predictive principle from reinforcement learning: Successor Representations (SRs), which model the expected discounted distribution of future states rather than the immediate next state. We transfer this framework to natural language and train neural networks to predict future word distributions across multiple temporal horizons, thereby learning representations of long-range transition structure. We train a deep residual neural network on WikiText-103 (103 million tokens; 20,000-word vocabulary) and optimize successor representations as probability distributions using KL divergence. Without explicit linguistic supervision, structured language representations emerge spontaneously. After training, the learned space develops a clear geometric organization with respect to part-of-speech (POS) categories: nouns, verbs, and adjectives become separable and recoverable through unsupervised clustering. This organization depends systematically on predictive horizon, with short horizons producing the strongest syntactic structure and longer horizons increasingly integrating broader contextual and semantic information. At finer resolutions, additional interpretable lexical substructure emerges, revealing coherent subclasses within major word categories. These findings suggest that syntactic categories need not be explicitly encoded but may arise as a consequence of predictive sequence learning. To our knowledge, this work provides the first systematic application of successor representations to natural language and establishes a conceptual bridge between reinforcement learning, linguistics, and cognitive neuroscience.


[14] Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning cs.CL | cs.AI | cs.SEPDF

Haizhou Xia

TL;DR: 本文提出了一种名为GuardedRepair的框架,用于对大型语言模型(LLM)的数学推理进行事后修复。该框架的核心思想是进行有选择性的替换,通过结合轻量级符号检查、语义风险诊断和确定性验证守卫,仅在修复后的候选答案比原始缓存推理轨迹更安全时才进行替换,从而避免因替换原本正确的答案而造成损害。

Details

Motivation: LLM数学推理的事后修复存在不对称风险:修复错误的推理轨迹是有益的,但替换原本正确的轨迹则可能有害。本文旨在研究这种选择性替换问题,确保修复过程不会损害模型已有的正确能力。

Result: 在GSM8K测试集上,当初始推理器准确率已达95.60%时,GuardedRepair将最终准确率提升至96.89%,修复了58个剩余错误中的17个,且在主运行中没有测量到破坏正确案例的情况。在弱推理器ASDiv设置中,准确率从78.40%提升至87.60%。与直接重新生成(re-solving)的基线相比,该框架的增益并非仅源于使用更强模型,因为重新解决所有GSM8K样例反而会将准确率降至93.03%并破坏47个原本正确的答案。

Insight: 论文的主要创新点在于将事后修复视为一种有风险意识的选择性替换问题,而非无约束的重新求解。其提出的框架通过诊断、选择性触发和保守的接受策略,显著改善了修复/破坏的权衡,并引入了确定性验证守卫来降低替换风险。这为LLM推理的可靠改进提供了一种更安全、可控的范式。

Abstract: Post-hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect reasoning trace is useful, but replacing a trace that was already correct can be harmful. We study this problem under a selective replacement setting, where a system must decide whether a repaired candidate is safer than preserving the original cached trace. We present GuardedRepair, a guarded best-of-N repair framework that diagnoses cached reasoning traces, selectively triggers repair, and accepts answer-changing candidates only when deterministic verification guards support replacement. The framework combines lightweight symbolic checks, surface semantic-risk diagnostics, bounded candidate generation, and conservative acceptance policies. On the full GSM8K test set, where the initial reasoner already achieves 95.60% accuracy, GuardedRepair improves final accuracy to 96.89%, fixing 17 of 58 remaining errors without measured broken-correct cases in the main run. On a weak-reasoner ASDiv setting, accuracy improves from 78.40% to 87.60%. Direct regeneration baselines show that this gain is not explained by stronger-model re-solving alone: re-solving all GSM8K examples lowers accuracy to 93.03% and breaks 47 initially correct answers. Additional analyses show that guarded repair substantially improves the fixed/broken tradeoff, while also revealing that replacement risk is reduced rather than eliminated. These results support viewing post-hoc repair as harm-aware selective replacement rather than unconstrained re-solving.


[15] HiMed: Incentivizing Hindi Reasoning in Medical LLMs cs.CLPDF

Dingfeng Jiang, Han Yan, Chenze Ma, Amit Kumar Jaiswal, Ang Li

TL;DR: 该论文针对医疗大语言模型在印地语上表现不佳的问题,提出了HiMed数据集、基准测试套件以及HiMed-8B模型。通过设计衰减支架奖励机制,模型在印地语医疗推理任务上性能显著提升,缩小了与英语的准确率差距。

Details

Motivation: 医疗大语言模型在减少医疗差距方面潜力巨大,但印地语资源严重不足,尤其在印度传统医学领域表现不佳。作者认为实现稳健的跨语言医疗迁移需要提升印地语推理能力。

Result: 大量实验表明,所提方法显著提升了印地语医疗推理性能,并缩小了英语与印地语之间的准确率差距。消融研究验证了每个训练阶段和奖励组件的贡献。

Insight: 创新点在于构建了覆盖西方和印度医学的印地语推理语料库和基准测试套件,并设计了衰减支架奖励机制来训练印地语医疗推理模型,为低资源语言医疗AI提供了数据和方法支持。

Abstract: Medical large language models hold promise for reducing healthcare disparities, yet Hindi remains severely underrepresented. While medical LLMs excel in high-resource languages, their performance degrades sharply in Hindi, particularly on Indian systems of medicine. We argue that robust cross-lingual medical transfer requires Hindi reasoning. To this end, we introduce HiMed, a Hindi reasoning medical corpus and benchmark suite covering both Western and Indian medicine. We further propose HiMed-8B, a Hindi-form medical reasoning LLM, through the design of decaying scaffolding reward. Extensive experiments demonstrate improvement in Hindi medical reasoning performance and reduction in the English–Hindi accuracy gap. Ablation studies validate the contribution of each training stage and reward component. All data and code are available on GitHub: https://github.com/FreedomIntelligence/HiMed.


[16] Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs cs.CL | cs.AIPDF

Bo Li, Tianyu Dong, Shaolin Zhu, Deyi Xiong

TL;DR: 本文提出Mix-MoE,一种用于提升大语言模型多语言机器翻译能力的混合专家框架。该框架通过两个阶段的后预训练,并引入语言模型专家和机器翻译专家两组专门化的MoE层,结合基于傅里叶变换特征的增强路由机制,旨在缓解微调过程中的参数干扰问题。

Details

Motivation: 大语言模型在多语言机器翻译中展现出潜力,但使用平行语料库进行微调时面临严重的参数干扰挑战。

Result: 实验结果表明,Mix-MoE在多语言机器翻译任务上显著优于现有基线模型,并在缓解参数干扰方面取得了显著进展。

Insight: 核心创新点在于将MoE层划分为专注于保留单语知识的LM专家和专门学习双语翻译知识的MT专家,并利用模型表示衍生的傅里叶变换特征来增强路由机制,以促进专家间的有效交互并利用文本潜在结构模式。

Abstract: Large Language Models (LLMs) have shown great promise in multilingual machine translation (MT), even with limited bilingual supervision. However, fine-tuning LLMs with parallel corpora presents major challenges, namely parameter interference. To address these issues, we propose Mix-MoE, a mixed Mixture-of-Experts framework designed to train LLMs for multilingual MT. Our framework operates in two distinct stages: (1) post-pretraining with MoE on monolingual corpora, and (2) post-pretraining with MoE on parallel corpora. Crucially, we divide the MoE layers into two specialized groups: Language Model Experts (LM Experts) and Machine Translation Experts (MT Experts). LM Experts are designed to capture and retain the monolingual knowledge learned by the pre-trained LLM. MT Experts, on the other hand, are specifically trained to acquire and store bilingual translation knowledge. Furthermore, to facilitate effective interaction between these specialized experts and leverage potential underlying structural patterns in text, we introduce a routing mechanism enhanced by Fourier Transform features derived from model representations. The experimental results demonstrate that Mix-MoE excels in multilingual MT, significantly outperforming existing baselines and showing notable progress in mitigating parameter interference.


[17] The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models cs.CL | cs.AIPDF

Bohang Sun, Max Zhu, Francesco Caso, Jindong Gu, Junchi Yu

TL;DR: 本文提出TraceLock,一种轻量级可插拔控制器,用于学习扩散语言模型中的令牌承诺策略,以解决并行解码中的隐藏控制问题。该方法通过未来稳定性进行自监督训练,能够跨不同窗口宽度、生成长度和步数预算部署,无需重新训练或校准。

Details

Motivation: 扩散大语言模型通过并行细化多个令牌位置实现快速生成,但并行性引入了隐藏控制问题:在每一步中,哪些提议的令牌应该被转移到部分解码序列中?现有方法主要依赖手工设计的置信度规则或特定块接受过滤器,作者认为令牌承诺可以作为一种可重用的轨迹状态策略来学习。

Result: 在问答、数学推理和代码生成任务上的实验表明,TraceLock在质量-步数权衡上优于启发式和基于学习的基线方法,在跨设置部署下表现出特别稳定的行为。

Insight: 创新点在于将令牌承诺问题形式化为可学习的轨迹状态策略,并通过未来稳定性提供自监督信号,使得控制器能够捕捉超越标量置信度的承诺轨迹空间,实现可重用且无需每设置校准的部署。

Abstract: Diffusion large language models promise faster generation by refining many token positions in parallel, but this parallelism introduces a hidden control problem: which proposed tokens should be transferred into the partially decoded sequence at each step? We refer to this decision as token commitment. Existing frozen-generator decoders largely rely on hand-designed confidence rules or block-specific acceptance filters. We argue that token commitment can instead be learned as a reusable trace-state policy. We introduce TraceLock, a lightweight plug-in controller that instantiates this policy for a frozen diffusion language model. Since oracle commitment times are unavailable, TraceLock derives self-supervision from future stability: at decoding step t, a proposed token for position i is labeled stable if it matches the final token at position i after the full decoding trace completes. The controller scores variable-length trace states and decides which active token proposals should be committed to the partially decoded sequence. Once trained for a given frozen backbone, the controller can be deployed across local-window widths, generation lengths, and step budgets without retraining or per-setting calibration. Experiments on question answering, mathematical reasoning, and code generation show that TraceLock improves the quality-step tradeoff over heuristic and learned baselines, with particularly stable behavior under cross-setting deployment. Diagnostic analyses show that its decisions are not reducible to scalar confidence, suggesting that frozen diffusion language models expose a learnable space of commitment trajectories beyond confidence-based decoding. Code is available at https://github.com/BobSun98/TraceLock.


[18] TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering cs.CL | cs.AIPDF

Liying Han, Kang Yang, Oliver Wang, Jason Wu, Pengrui Quan

TL;DR: 本文提出了TS-Skill,一个用于评估时间序列问答(TSQA)中三种可组合分析技能(时间尺度选择、时间定位和跨区间整合)的基准测试。通过SKEvol框架大规模构建了包含时间戳感知问题、广泛领域覆盖和人工验证质量的数据集。在十个SOTA模型上的实验揭示了模型在这些技能上存在显著且不均衡的能力差距,特别是跨区间整合技能对非智能体模型极具挑战性。

Details

Motivation: 现有TSQA基准通常按任务类型或高层推理类别组织,难以诊断驱动模型性能的底层信号级能力。因此,需要建立一个可控的基准来细粒度评估模型在时间序列分析中的核心技能。

Result: 在十个最先进的LLM和TSLM上的实验表明,模型在SK1-SK3技能上存在显著且不均衡的能力差距。SK3(跨区间整合)对非智能体模型始终具有挑战性,而工具增强的智能体在独立的SK3任务上显示出选择性优势。

Insight: 创新点在于提出了一个技能导向的、可控的TSQA评估基准(TS-Skill)及其构建框架(SKEvol),该框架结合了领域感知生成、技能控制、多阶段验证和人工循环。核心洞察是技能级评估能揭示被总体TSQA分数掩盖的时间推理失败,为模型能力诊断提供了更细粒度的视角。

Abstract: Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unlike text-only QA, TSQA requires models to ground answers in temporal signals whose patterns may occur at different scales, specific time locations, or across separated intervals. However, existing benchmarks are typically organized by task types or high-level reasoning categories, making it difficult to diagnose the underlying signal-level capabilities driving model performance. We introduce TS-Skill, a controlled benchmark for evaluating three composable analytical skills in TSQA: temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). TS-Skill provides timestamp-aware questions, broad domain coverage, and human-validated QA quality. To construct the benchmark at scale, we develop SKEvol, a skill-guided agentic framework that combines domain-aware time-series seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase signal-grounded verification, and human-in-the-loop curation. Experiments on ten state-of-the-art LLMs and TSLMs reveal substantial and uneven capability gaps across SK1-SK3. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. These findings demonstrate that skill-level evaluation can uncover temporal reasoning failures that are obscured by aggregate TSQA scores.


[19] Beyond the Target: From Imitation to Collaboration in Speculative Decoding cs.CLPDF

Jinze Li, Yixing Xu, Guanchen Li, Jinfeng Xu, Shuo Yang

TL;DR: 论文提出了协作式推测解码(CoSpec),这是一种对传统推测解码(SPD)的泛化。它不再将目标大模型视为唯一的权威,而是通过强化学习训练一个仲裁策略,在草稿模型和目标模型的预测不一致时,有选择地采纳草稿模型的预测,从而在保持加速的同时提升最终答案的准确性。

Details

Motivation: 传统推测解码(SPD)隐含地假设目标模型在每一个词元位置都是更好的选择,但实践中草稿模型(较弱模型)在某些情况下能做出更优的局部选择。论文旨在解决SPD中这种“目标模型总是最优”的假设不成立的问题,探索草稿与目标模型之间的协作潜力。

Result: 实验结果表明,CoSpec在保持显著加速(即维持推测解码的加速优势)的同时,其最终答案的准确性超越了仅使用目标模型的性能。

Insight: 核心创新点在于将推测解码的范式从“模仿”转变为“协作”,通过一个可学习的仲裁策略动态选择草稿或目标模型的输出。这挑战了目标模型在词元级别上绝对权威的假设,为利用强弱模型互补性提供了新视角。

Abstract: Speculative decoding (SPD) accelerates large language model (LLM) inference by letting a smaller draft model propose multiple future tokens that are verified in parallel by a larger target model. The dominant SPD paradigm treats the target model as the sole reliable teacher, accepting a draft token only when it exactly matches the target prediction. This design implicitly assumes that the target is always the better choice at every position. In practice, this assumption does not hold. Although the draft is the weaker model overall, it is not uniformly inferior at the token level. In a meaningful fraction of cases where draft and target disagree, the draft’s choice is the one that leads to the correct final answer. Inspired by this, we introduce \textbf{Collaborative Speculative Decoding (CoSpec)}, a generalization of SPD that no longer treats the target model as the sole token-level authority. CoSpec trains an arbitration policy via reinforcement learning to decide whether to accept tokens from the draft or target model, selectively accepting draft tokens at mismatches when doing so is likely to yield a correct final answer. Experimental results show that CoSpec maintains substantial speedups while surpassing target-only performance. By shifting the emphasis from imitation to collaboration, CoSpec suggests a new perspective on speculative decoding.


[20] Lngram: N-gram Conditional Memory in Latent Space cs.CLPDF

Yunao Zheng, Guoyang Xia, Xiaojie Wang, Lei Ren

TL;DR: Lngram是一种在潜在空间中学习的N-gram条件记忆模块,它直接从隐藏状态学习离散符号并执行N-gram查找,以解耦检索与主干网络计算。该方法减少了对分词器ID的依赖,可扩展到非文本模态,在长上下文语言建模中持续降低困惑度,并在视觉语言等任务中带来整体性能提升。

Details

Motivation: 标准Transformer通过密集计算同时处理组合推理和局部静态知识检索,Engram部分解耦了检索但与分词器ID绑定。Lngram旨在进一步解耦,直接从隐藏状态学习符号,摆脱对分词器的依赖并扩展到多模态。

Result: 在评估设置中,Lngram优于Transformer和Engram基线,在长上下文语言建模中持续降低困惑度,后添加到预训练模型能有效注入领域知识。与主干联合训练超越全微调,在视觉语言和视觉语言动作任务中显示整体增益。

Insight: 创新点在于潜在空间离散符号学习和N-gram查找机制,摆脱分词器限制并支持多模态。客观分析表明,它使预测相关信息更早出现,以有限推理和内存开销增加有效深度。

Abstract: Sequence modeling requires both compositional reasoning and local static knowledge retrieval, yet standard Transformers handle both through dense computation. Engram partially decouples retrieval from the backbone, but its token-based keys remain tied to text tokenization and hash compression. We propose Lngram, a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over these symbols. This design removes the dependence on tokenizer IDs and naturally extends to non-text modalities. In our evaluated settings, Lngram outperforms Transformer and Engram baselines, consistently reduces perplexity in long-context language modeling, and effectively injects domain knowledge when added post hoc to pretrained models. Joint training with the backbone further surpasses full fine-tuning, while experiments on vision-language and vision-language-action tasks show overall gains. Analyses with LogitLens and CKA suggest that Lngram enables prediction-relevant information to emerge earlier, increasing effective depth with limited inference and memory overhead. Code is available at https://github.com/zyaaa-ux/Lngram.


[21] DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting cs.CLPDF

Amelia Girard, Massimo Piccardi

TL;DR: 本文提出了一种用于反事实故事重写任务的可微分训练目标(DTO),旨在通过端到端反向传播微调Transformer模型,以直接优化反事实改进。该方法结合了忠实于参考改写和与源叙事语义一致性的损失函数,在TimeTravel和ART数据集上超越了最大似然基线,并与当代大型语言模型表现相当。

Details

Motivation: 反事实故事重写任务要求对现有故事进行微小、局部化的修改,而传统最大似然训练目标容易忽略这些细微差别,且基于强化学习的方法设置复杂、训练缓慢,因此需要一种更有效的训练方法。

Result: 在TimeTravel和ART数据集上的实验表明,DTO方法超越了最大似然基线和基于偏好的方法,并在所有评估指标上与两种当代大型语言模型表现相当。

Insight: 创新点在于设计了一个完全可微分的损失函数,直接联合优化忠实性和语义一致性,为细微、受控的文本生成任务提供了任务特定的可微分目标,避免了强化学习的复杂性。

Abstract: Counterfactual story rewriting is a natural language processing task that requires updating an existing story to reflect a chosen alternative event, yet preserving all the unaffected storyline elements and overall coherence. While large language models have recently made remarkable progress on this task, it still remains challenging since the required modifications are typically very small in size and highly localized. As a consequence, models trained in a conventional manner with the maximum-likelihood training objective tend to overlook these nuances. At the same time, more sophisticated training approaches based on reinforcement learning are notoriously slow and difficult to set up. For these reasons, our paper proposes a novel, differentiable training objective (DTO) that directly optimizes for the requisite counterfactual improvements. In our approach, a transformer model is fine-tuned via end-to-end backpropagation against a fully differentiable loss function that jointly rewards (i) fidelity to the reference rewrite and (ii) semantic consistency with the source narrative. The empirical evaluation on the TimeTravel and ART datasets shows that the proposed DTO approach has been able to surpass a maximum-likelihood baseline and a preference-based approach, and perform competitively against two contemporary large language models in all evaluation metrics. These findings substantiate the effectiveness of task-specific differentiable objectives for nuanced, controlled text-generation tasks.


[22] Towards a Universal Causal Reasoner cs.CL | cs.AI | cs.LGPDF

Qirun Dai, Xiao Liu, Jiawei Zhang, Dylan Zhang, Hao Peng

TL;DR: 本文提出了UniCo框架,用于生成多样化的因果推理训练数据,涵盖Pearl因果阶梯中的18种查询类型,并将符号示例转换为代码和自然语言形式以模拟真实场景。通过在66.6K个UniCo生成实例上进行监督微调,多个LLM在分布内查询类型上平均提升22.9%,在7个外部因果基准测试中超越现有数据生成框架8.1%,并在医疗、法律和表格推理等真实任务中展现出更忠实的推理轨迹。

Details

Motivation: 现有因果推理数据大多专注于基准测试特定方面,不适合训练通用的因果推理模型,因此需要一种能够生成多样化、高质量训练数据的方法。

Result: 在Qwen3-4B、Qwen3-8B和Olmo-3-7B-Instruct模型上,UniCo微调后使18种分布内查询类型平均提升22.9%,在7个外部因果基准测试中超越SOTA数据生成框架8.1%,并在真实任务中使基础模型的忠实度指标平均提升20.2%。

Insight: 创新点包括覆盖Pearl因果阶梯的全面查询类型、将符号示例转换为多模态形式以模拟真实用例,以及通过精确因果推断和过滤推理捷径确保数据质量;这有助于LLM不仅提升因果推理能力,还能在一般推理任务中培养因果思维。

Abstract: Despite the importance of causal reasoning, training LLMs to reason causally remains underexplored. Existing data efforts mostly focus on benchmarking LLMs on specific aspects of causality, making them less suitable for training generalizable causal reasoners. To address this, we propose UniCo, a data generation framework that both (1) addresses 18 causal query types across Pearl’s Causal Ladder and (2) translates natively symbolic examples into code and natural language forms to simulate real-world use cases where causal terms are not explicitly specified. To ensure data quality, UniCo grounds answers with exact causal inference and filters cases with reasoning shortcuts. Upon supervised finetuning with 66.6K UniCo-generated instances, Qwen3-4B, Qwen3-8B and Olmo-3-7B-Instruct achieve an average of 22.9% improvements across all 18 in-distribution query types, and 8.1% over state-of-the-art causal data generation frameworks on 7 established causal benchmarks outside the training distribution. More importantly, in real-world medical understanding, legal decision, and tabular reasoning, UniCo-trained models consistently display more faithful reasoning traces, outperforming the base models by an average of 20.2% in faithfulness metrics. These suggest that causality-centered training not only strengthens causal reasoning, but also equips LLMs with a causal mindset in general reasoning tasks.


[23] When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation cs.CL | cs.AI | cs.LGPDF

Faizan Faisal

TL;DR: 本研究评估了前沿大语言模型(GPT-5.4, DeepSeek-V4-Flash, Gemma-4-E4B)在临床SOAP笔记生成任务中的表现,通过一个包含OMI Health、ACI-Bench和PriMock57数据集的源感知基准。研究采用2x2实验设计,独立控制模型的原生推理能力和同源检索增强生成(RAG)功能,并使用七种自动指标和两种参考感知的LLM评估器进行综合评估。

Details

Motivation: 尽管具备推理能力的大语言模型在医学推理基准上表现优异,但其优势是否能迁移到结构化临床文档(如SOAP笔记)生成任务中尚不明确,本研究旨在探究这一问题。

Result: 评估结果显示,在所有模型中,一个禁用推理功能的GPT-5.4配置获得了最高的整体质量评分;而在启用推理功能的配置中,DeepSeek-V4-Flash表现最佳。启用推理功能会显著降低GPT-5.4在所有三个数据集上的性能,而同源RAG带来的改进较小且依赖于具体模型。

Insight: 论文的核心发现是,更强的推理能力不应被假定为能自动提升对保真度敏感的临床SOAP笔记生成质量,这凸显了进行专门、任务特定评估的必要性。从客观角度看,研究通过精细控制推理与RAG变量,揭示了模型能力与具体任务需求(如保真度)之间可能存在的复杂甚至负向关系,为模型在专业领域的应用评估提供了重要方法论启示。

Abstract: Reasoning-enabled LLMs perform strongly on medical reasoning benchmarks, but it remains unclear whether these gains transfer to structured clinical documentation; we investigate this question using SOAP note generation from clinical dialogue in a source-aware benchmark spanning OMI Health, ACI-Bench, and PriMock57. We evaluate GPT-5.4, DeepSeek-V4-Flash, and Gemma-4-E4B in a controlled 2x2 design that independently toggles provider-native reasoning and same-source retrieval-augmented generation (RAG). Outputs are assessed using seven automatic metrics alongside two reference-aware LLM judges. Both evaluation approaches agree that a non-reasoning GPT-5.4 configuration achieves the highest overall quality, while DeepSeek-V4-Flash performs best among reasoning-enabled configurations. Enabling reasoning significantly degrades GPT-5.4 performance across all three datasets, whereas same-source RAG yields smaller, model-dependent improvements. Overall, the findings indicate that stronger reasoning capability should not be assumed to improve fidelity-sensitive SOAP note generation without dedicated, task-specific evaluation.


[24] NITP: Next Implicit Token Prediction for LLM Pre-training cs.CLPDF

Xiangdong Zhang, Debing Zhang, Shaofeng Zhang, Xiaohan Qin, Yu Cheng

TL;DR: 本文提出了一种名为‘下一隐式令牌预测’(NITP)的新预训练目标,用于增强大型语言模型。NITP在标准的下一令牌预测(NTP)基础上,增加了在表示空间中对下一令牌语义内容的连续监督,旨在约束隐表示空间并改善其几何结构,从而提升模型在下游任务上的泛化能力。

Details

Motivation: 标准的下一令牌预测(NTP)仅通过输出logit空间中的离散标签进行监督,这种稀疏的one-hot监督使得潜在表示空间约束不足,可能导致隐藏状态退化或各向异性,从而限制模型的泛化能力。

Result: 在参数规模从0.5B到9B的稠密模型和MoE模型上,NITP均能持续提升下游任务性能,且计算开销可忽略。具体地,在一个9B的MoE模型上,NITP在MMLU-Pro上实现了5.7%的绝对提升,在C3和CommonsenseQA上分别提升了6.4%和4.3%,仅增加了约2%的训练FLOPs且不增加推理成本。

Insight: 核心创新点在于将离散的下一令牌预测与表示空间中的连续、密集监督相结合,利用模型自身浅层的稳定表示作为自监督目标来预测下一令牌的隐式语义。这从理论上可以正则化优化景观,鼓励形成更紧凑、结构化的表示几何,是一种高效且通用的预训练增强方法。

Abstract: Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.


[25] Better, Faster: Harnessing Self-Improvement in Large Reasoning Models cs.CLPDF

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Leszek Rutkowski

TL;DR: 本文提出HSIR方法,通过验证退出采样策略和内在多样性评分,解决大推理模型自改进训练中的数据不平衡和过度思考问题,显著提升推理性能和效率。

Details

Motivation: 针对大推理模型自改进训练在复杂任务中效果不佳甚至导致模型崩溃的问题,发现数据不平衡和过度思考是主要瓶颈。

Result: 在多种后训练范式中应用HSIR,平均性能提升高达10.9%,推理开销相对减少42.4%,实现了性能与效率的双重优化。

Insight: 创新点在于引入验证退出采样策略高效收集困难查询的准确解,以及设计内在多样性评分量化并过滤冗余推理步骤;进一步提出的H-GRPO算法将多样性作为外部奖励,通过强化学习鼓励简洁多样的推理。

Abstract: Self-improvement training enables the large reasoning models (LRMs) to improve themselves by self-generating reasoning trajectories as training data without external supervision. However, we find that this method often falls short in complex reasoning tasks and even leads to model collapse. Through a series of preliminary analyses, we reveal two problems: (1) data imbalance, where most training samples are simple, but the challenging yet crucial samples are scarce; (2) overthinking, where many undesired samples with redundant reasoning steps are used for self-training. To this end, we propose HSIR, which effectively Harnesses Self-Improvement in large Reasoning models via two simple-yet-effective approaches. Specifically, HSIR introduces a verify-then-exit sampling strategy to mitigate data imbalance by efficiently collecting more accurate solutions for difficult queries, and designs an Intrinsic Diversity score to quantify overthinking and filter out the undesired solutions. We apply HSIR to various post-training paradigms, among which we further propose H-GRPO, an enhanced GRPO algorithm that leverages the intrinsic diversity as an external reward to encourage concise and diverse reasoning via reinforcement learning. Extensive results show that HSIR not only effectively enhances the reasoning performance, i.e., bringing up to +10.9% average performance gains, but also significantly improves the reasoning efficiency by reducing up to 42.4% relative inference overhead.


[26] Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation cs.CL | cs.AIPDF

Yangneng Chen, Jing Li

TL;DR: 本文系统研究了大型视觉语言模型(LVLMs)中的语言偏见问题,指出其根源在于训练过程中的模态失配,并提出了两种简单有效的缓解方法:在指令微调中使用的语言偏见正则化(LBR)和在直接偏好优化(DPO)中使用的语言偏见惩罚(LBP)。实验表明,这些方法能有效提升模型在多个基准测试上的性能并减少幻觉,且无需额外数据或辅助模型。

Details

Motivation: LVLMs存在幻觉问题,即生成流畅但与图像内容不一致的输出。近期研究将此归因于语言偏见,即模型过度依赖文本而忽视视觉输入,但其根本原因尚不明确。本文旨在深入分析语言偏见的根源并提供有效的缓解方案。

Result: 在多种模型和基准测试上的广泛实验证明,LBR方法在超过十个通用基准上持续提升性能,LBP方法则显著减少幻觉并提高模型可信度。

Insight: 创新点在于将语言偏见的根源定位为训练(特别是VIT和DPO阶段)中的模态失配,并提出了两种轻量级的正则化/惩罚方法进行缓解。从客观角度看,其方法简单有效,不增加额外开销,为改善LVLMs的多模态对齐提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) extend large language models with visual understanding, but remain vulnerable to hallucination, where outputs are fluent yet inconsistent with images. Recent studies link this issue to language bias-the tendency of LVLMs to over-rely on text while neglecting visual inputs. Yet most analyses remain empirical without uncovering its underlying cause. In this paper, we provide a systematic study of language bias and identify its root in modality misalignment during training. Our analysis shows that both Visual Instruction Tuning (VIT) and Direct Preference Optimization (DPO) often prioritize textual improvements, which may cause LVLMs to overly lean toward language modeling rather than balanced multimodal understanding. To address this, we propose two simple yet effective methods: Language Bias Regularization (LBR) which mitigates language bias through regularization during instruction tuning, and Language Bias Penalty (LBP), which penalizes language bias in the DPO training process. Extensive experiments across diverse models and benchmarks demonstrate the effectiveness of our approach. LBR consistently improves performance on over ten general benchmarks, while LBP significantly reduces hallucination and improves trustworthiness. Together, these methods not only mitigate language bias but also advance the overall alignment of LVLMs, all without introducing any additional data or auxiliary models. Our code is publicly available at https://github.com/lab-klc/LVLM-Language-Bias.


[27] LLM Agent Based Renewable Energy Forecasting Using Edge and IoT Data A Review of Solar Wind Weather and Grid Aware Decision Support cs.CL | cs.AIPDF

Pavan Manjunath, Thomas Pruefer

TL;DR: 这篇综述论文探讨了如何利用大型语言模型(LLM)智能体来增强可再生能源(如太阳能和风能)的预测。它整合了来自物联网(IoT)和边缘设备的异构传感器数据、天气API、历史发电记录和电网约束,以构建统一的决策支持工作流。论文回顾了传统预测方法,提出了一个六层分类法,并指出了十二个开放挑战,最后建议了一个以开放基准、物理信息LLM基础和联邦预测架构为中心的研究议程。

Details

Motivation: 可再生能源(太阳能和风能)的间歇性和波动性对电网稳定、能源交易和运营规划构成了挑战。传统预测方法难以充分利用物联网和边缘设备产生的海量实时数据,因此需要更智能的解决方案来整合多源信息并提供决策支持。

Result: 论文是一篇综述,未报告具体的定量实验结果或基准测试性能。它主要提出了一个分类框架并识别了研究挑战,而非提出一个达到SOTA水平的具体模型。

Insight: 论文的创新点在于提出了将LLM智能体应用于可再生能源预测的框架,强调其整合异构数据、进行上下文推理以及生成自然语言报告的能力。从客观角度看,将LLM的推理和解释能力与物理信息、边缘计算结合,为解决该领域的数据融合和决策支持问题提供了新的研究方向。

Abstract: Reliable forecasting of renewable energy generation is a foundational requirement for grid stability energy trading battery scheduling and carbon aware operational planning Solar and wind resources are inherently intermittent their output fluctuates with cloud cover wind speed atmospheric turbulence seasonal patterns and local terrain The proliferation of IoT and edge devices spanning smart meters inverters anemometers pyranometers weather stations and grid interface sensors has created an unprecedented volume of real time operational data that conventional forecasting pipelines are ill equipped to exploit fully This review investigates how large language model LLM agents can enhance renewable energy forecasting by integrating heterogeneous sensor streams weather API data historical generation records grid constraints and contextual reasoning into unified decision support workflows We survey classical forecasting methods statistical time series models deep learning architectures physics hybrid approaches and emerging LLM agent frameworks for explanation uncertainty communication and operator guidance A six layer taxonomy is proposed covering data acquisition preprocessing feature engineering model inference uncertainty estimation and natural language reporting The review identifies twelve open challenges spanning real time deployment model drift under distribution shift uncertainty quantification hallucination control in LLM agents interoperability of edge hardware and integration with energy management systems The paper concludes by recommending a research agenda centred on open benchmarks physics informed LLM grounding and federated forecasting architectures


[28] Re-defining Humor Data Objects for AI Humor Research cs.CLPDF

Anna Arnett, Bang Nguyen, Meng Jiang

TL;DR: 本文重新定义了AI幽默研究中的数据对象,将幽默视为具有上下文和解释的社会互动,而非简单的二元存在。通过设计幽默推理数据对象和改进LLM提示方法,论文生成了适用于大众的幽默解释,并实现了大规模数据生成,为AI幽默研究的数据合成与增强奠定了基础。

Details

Motivation: 现有AI幽默研究大多将幽默简化为“存在”或“不存在”的二元分类,忽略了其作为社会互动的复杂性和上下文依赖性,因此需要更精细的数据表示方法来促进AI对幽默的理解。

Result: 通过迭代优化LLM提示,改进版本显著减少了重要错误,并成功扩展到大规模生成幽默推理数据对象,这些数据对象可用于数据合成与增强,但未提及具体基准测试或SOTA比较。

Insight: 创新点在于将幽默重新定义为包含上下文和解释的社会互动数据对象,并通过提示工程提升LLM生成幽默解释的质量,特别是通过更细致地处理缺失上下文、多模态和转录问题,为AI理解幽默作为社会行为提供了新思路。

Abstract: In most existing AI humor research, humor was treated as either “present” or “not present.” We explore the concept of humor as a social interaction with context and explanations. During this project, we defined a humor reasoning data object and developed a way to prompt LLMs to generate an explanation of humor effective for general population. We iterated from an earlier prompt to an improved prompt, found that the later version reduced important errors, and then scaled generation to a large number of data objects which have the potential to enable data synthesis and data augmentation for AI humor research. Our main takeaway is that better prompting of an LLM improves humor explanation quality, especially by handling missing context, multi-modality, and transcript issues more carefully. These results establish a strong foundation for future work on AI understanding of humor as social behavior.


[29] STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media cs.CL | cs.AIPDF

Liang Xue, Haoyu Liu, Cheng Wang, Pengyu Chen, Haozhuo Zheng

TL;DR: 本文提出了Stream框架,通过挖掘流媒体(直播和短视频)中的真实交互信号,大规模合成高质量、面向特定垂直领域的任务型对话数据。该框架结合角色化人设构建与对话蓝图构建,并采用检索增强生成技术生成知识感知的响应,从而缓解了领域专用对话数据稀缺的问题。基于此框架,作者发布了StreamDial数据集,涵盖汽车、餐厅和酒店等多个领域,包含大量结构化的对话会话。

Details

Motivation: 解决垂直领域大语言模型面临的高质量、复杂任务型对话数据稀缺的瓶颈问题。现有数据获取方法存在专家标注昂贵、真实服务对话受隐私和商业限制、静态语料库时效性差的三难困境。

Result: 自动评估和下游任务(如对话状态跟踪)的评估表明,StreamDial在内在对话质量上优于强基线,并且使用StreamDial训练的模型在不同骨干网络上均提升了对话状态跟踪性能。在受控的训练预算下,Qwen3-8B模型上还观察到了令人鼓舞的多语言迁移效果。

Insight: 创新点在于提出了一个以数据为中心的框架,利用公开流媒体作为数据源,通过挖掘真实交互信号并结构化合成(结合人设、对话蓝图和RAG)来生成大规模、高质量、多领域的任务型对话数据,有效规避了传统数据获取方法的局限性。从客观角度看,其将嘈杂的流媒体内容转化为结构化对话数据的方法,以及对需求挖掘、约束冲突、协商等真实服务行为的捕捉,是值得借鉴的数据工程创新。

Abstract: Large language models for vertical domains are bottlenecked by the scarcity of complex, domain-specific task-oriented dialogues. Existing data acquisition pipelines face a persistent trilemma: expert annotation is expensive, real-world service conversations are constrained by privacy and commercial restrictions, and static corpora quickly become temporally stale. We propose Stream, a data-centric framework that leverages publicly available streaming media (live streams and short videos) to synthesize high-value service dialogues at scale. Stream mines authentic interaction signals from noisy streams and synthesizes conversations by integrating role-grounded persona construction with Conversational Blueprint construction; it further adopts retrieval-augmented generation (RAG) to support knowledge-aware responses. Based on Stream, we release StreamDial, a large-scale multi-domain dataset covering Automotive, Restaurant, and Hotel. StreamDial contains 87,498 dialogue sessions and 1,497,320 turns in total, with an average of 17.11 turns per session and a comparable scale across domains. Each session is organized as a structured quadruplet $\langle P_u, P_a, B, H \rangle$ that pairs dialogue history with explicit user/agent personas and a Conversational Blueprint, capturing realistic service behaviors such as requirement mining, constraint conflicts, negotiation, and recovery. Evaluations with automatic judges and downstream tasks show that StreamDial improves intrinsic dialogue quality over strong baselines, and models trained with StreamDial improve Dialogue State Tracking across backbones; we further report a completed human-evaluation set and encouraging multilingual transfer on Qwen3-8B under a controlled training budget. The data is released in https://github.com/hitxueliang/DialogDataSetBySTREAM.


[30] Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience cs.CL | cs.AIPDF

Jake Stephen, Niraj K. Jha

TL;DR: 该论文探索了仅使用单一权威教科书中的信息,通过知识图谱(KG)驱动的深度推理能力是否能在神经科学领域涌现。核心假设是:将结构化知识提炼为高质量KG并转化为基于KG的问答监督数据,足以通过微调语言模型(LM)产生专家级推理能力,在准确性上超越大语言模型(LLMs),同时使用数量级更少的参数。

Details

Motivation: 解决在神经科学等专业领域,如何不依赖大规模、异构的网络语料库,仅利用权威教科书中的结构化知识,诱导出深度、机制性的理解与专家级推理能力的问题。

Result: 结果表明,基于KG构建的合成神经科学课程和微调后的LM,能够在神经科学领域实现深度、机制性的理解,其准确性超越了LLMs,且参数量级远小于LLMs。相关资源已在GitHub上开源。

Insight: 创新点在于提出了一种从单一权威教科书构建高质量KG并生成多跳问答监督数据的完整流程,结合双LLM验证和基于KG拓扑的掩码LM扩展,以及使用路径衍生KG信号作为隐式奖励模型的强化学习,实现了在小规模、高质量数据上诱导专家级领域推理能力的方法。

Abstract: Knowledge graph (KG) is an abstraction that can be extracted from text corpora and used for in-depth reasoning. Prior work has leveraged KGs to fine-tune language models (LMs), enabling domain-specific superintelligence. In this work, we explore whether KG-driven in-depth reasoning capabilities can emerge in neuroscience using only information contained within a single authoritative textbook. The central hypothesis is that structured knowledge, when distilled into a high-quality KG and converted into KG-grounded question-answer (QA) supervision, is sufficient to produce expert-level reasoning through a fine-tuned LM that surpasses large language models (LLMs) in accuracy, while employing orders of magnitude fewer parameters. We construct a textbook-derived KG via a dual-LLM validation pipeline, expand it with a masked LM trained on the KG topology, generate multi-hop QA items, which include QA pairs and reasoning traces, to fine-tune an LM exclusively on KG-derived supervision, and apply reinforcement learning using path-derived KG signals as implicit reward models. Our results demonstrate that deep, mechanistic neuroscience understanding can be induced in the model without reliance on large, heterogeneous web-scale corpora. The KG-based synthetic neuroscience curriculum that readers can quiz themselves on, and the fine-tuned LM, are available at the following GitHub location: https://kg-bottom-up-superintelligence.github.io/neuro-bench.


[31] By Their Fruits You Will Know Them: Comparing Formalizations of Law by the Decisions They Encode cs.CL | cs.AIPDF

Julius Vernie, Matthias Grabmair

TL;DR: 本文提出了一种系统比较同一法律条款不同形式化版本的方法,通过分析它们在具体案例上的推理差异来评估形式化质量。该方法将多个形式化模型在节点层面进行匹配,利用SAT求解器枚举不同形式化版本产生分歧的边缘案例,并将这些案例转化为法律专家可审查的具体事实场景。作者将该方法应用于九个前沿大语言模型对十条欧盟法律条款生成的形式化版本,发现形式化模型的行为差异与其结构一致性基本无关,且案例揭示的争议类型反映了法律评论中的真实分歧。

Details

Motivation: 法律条款的形式化有望实现机器可访问的法律和自动化法律推理,而大语言模型的出现使得直接从法律文本生成形式化版本成为可能。然而,任何形式化都隐含了难以预见的解释性选择,尤其是当形式化由大语言模型生成时,其后果难以评估。

Result: 该方法应用于九个前沿大语言模型对十条欧盟法律条款生成的形式化版本。研究发现,不同形式化模型之间的行为差异与其结构一致性基本无关,且通过案例转化揭示的分歧类型具有质的区别,其中一些分歧反映了法律评论中真实存在的争议。

Insight: 创新点在于提出了一种通过枚举和比较推理差异来系统评估法律形式化模型的方法,而非仅仅依赖结构相似性。该方法将形式化模型的内部逻辑差异转化为可解释的具体法律场景,为法律专家提供了可操作的审查依据,并揭示了LLM生成法律形式化时可能存在的隐性解释偏差。

Abstract: Formalizing legal provisions promises machine-accessible law and automated legal reasoning, and recent LLMs make it tempting to generate such formalizations directly from statutory text. However, any formalization makes implicit interpretive choices whose consequences are hard to anticipate, especially if an LLM is the author. We present a method for systematically comparing different formalizations of the same legal provision by their inferences on individual cases. Given multiple formalizations of a provision, we match them at the node level, derive a shared interface for each pair from the matching, and use a SAT solver to enumerate the edge cases on which any two formalizations disagree. Selected edge cases are then verbalized into concrete factual scenarios that a legal expert can examine and act on. We apply our method to formalizations of ten EU provisions generated by nine frontier LLMs. We find that behavioral divergence between formalizations is essentially uncorrelated with their structural agreement and that the verbalized cases reveal qualitatively distinct types of disagreement, including divergences that mirror genuine controversies in the legal commentary.


[32] GroupTravelBench: Benchmarking LLM Agents on Multi-Person Travel Planning cs.CLPDF

Xiang Cheng, Yulan Hu, Lulu Zheng, Zheng Pan, Xin Li

TL;DR: 本文介绍了GroupTravelBench,这是首个用于评估大语言模型(LLM)代理在多用户、多轮旅行规划任务中能力的基准。该基准基于真实用户画像、POI数据和票价数据,合成了650个任务,并分为三个难度级别,旨在评估代理在偏好收集、冲突协调和行程规划三个关键方面的能力。

Details

Motivation: 现有旅行规划基准通常只考虑单一用户,忽略了现实场景中多用户间偏好冲突这一核心挑战。为了填补这一空白,本文旨在创建一个能系统性评估LLM代理在多用户场景下规划与协调能力的基准。

Result: 研究评估了多种LLM,发现即使是前沿模型在偏好覆盖率和群体公平性方面仍存在明显不足。该基准通过一个带有缓存真实世界工具数据的交互式沙盒环境,实现了可靠的离线评估。

Insight: 论文的创新点在于首次提出了专注于多用户旅行规划的基准,并明确定义了三个核心评估维度:主动对话以收集偏好、协调冲突、以及最大化群体效用的规划。其构建的沙盒环境为复杂、多轮交互任务的可靠离线评估提供了实用方案。

Abstract: Travel planning is a realistic task for evaluating the planning and tool-use abilities of LLM agents. However, existing benchmarks typically assume only a single user, thereby avoiding one of the most challenging aspects of real-world scenarios: an agent’s ability to identify and resolve conflicts among multiple users. To address this gap, we introduce \textbf{GroupTravelBench}, the first benchmark for \textbf{multi-user, multi-turn} travel planning. Based on real user profiles, POI data, and ticket price data, we synthesize 650 tasks and divide them into three difficulty levels. Beyond standard abilities in single-user itinerary planning, such as multi-step reasoning and tool use, our benchmark further evaluates three key capabilities required for travel agents: \emph{(i) elicitation} – proactively engaging in multi-turn dialogue to gather preferences from each user; \emph{(ii) coordination} – resolving conflicts among users through compromise or subgrouping strategies; and \emph{(iii) planning} – searching for travel plans that maximize overall group utility while maintaining fairness and feasibility. To simulate real-world conversational itinerary planning while enabling reliable tool use and offline evaluation, we build an interactive sandbox environment with cached real-world tool data. We evaluate a wide range of LLMs and find that even frontier models still show substantial weaknesses in preference coverage and group fairness. \textit{GroupTravelBench} provides a practical and reproducible benchmark for advancing research on LLM agents for real-world travel planning.


[33] From Automation to Collaboration: Human-in-the-Loop Methods for Safe and Trustworthy NLP cs.CLPDF

Most. Sharmin Sultana Samu, MD. Tanvir Ahmed Seum, Md. Rakibul Islam

TL;DR: 这篇综述论文探讨了在自然语言处理(NLP)领域,如何从自动化转向人机协作,以提升大型语言模型在安全性和可信度方面的表现。文章回顾了人类专业知识在模型审计、鲁棒性评估、数据构建和模型引导等方面的支持作用,并指出了当前在可扩展探测、可持续基准测试、低资源设置和私有系统治理方面存在的不足。

Details

Motivation: 动机在于大型语言模型在高风险NLP任务中部署广泛,但仍存在偏见、幻觉、对抗性漏洞和不可靠泛化等风险,而人类监督虽关键但成本高且难以扩展,因此需要探索人机协作方法来应对这些挑战。

Result: 论文作为综述未提供具体实验结果,但总结了现有研究发现,例如基于探针的审计揭示了模型行为的不一致性,对抗性文本生成暴露了鲁棒性差距,尤其是在基准测试有限的低资源语言中。

Insight: 创新点在于系统性地审视了人机协作方法在NLP安全与可信度中的应用,提出了从自动化向协作的范式转变,并指出了未来研究方向,如自适应审计、协作评估和可问责部署,以弥合当前研究与实践之间的差距。

Abstract: Large language models are widely deployed in high-stakes NLP tasks, yet risks such as bias, hallucination, adversarial vulnerability and unreliable generalization remain. Probe-based auditing reveals inconsistencies in model behavior. Adversarial text generation uncovers robustness gaps, especially in lower-resourced languages with limited benchmarks. Enterprise text-to-SQL settings expose the difficulty of validating outputs over private and large-scale databases. Human supervision is essential for probe validation, adversarial verification and domain-specific annotation, but it is costly and hard to scale. This survey examines recent human-in-the-loop methods that shift NLP from automation toward collaboration for safety and trustworthiness. We review how human expertise supports auditing, robustness evaluation, data construction and model steering. Our findings highlight gaps in scalable probing, sustainable robustness benchmarks, low-resource settings and governance of private systems. We outline practical research directions for adaptive auditing, collaborative evaluation and accountable deployment.


[34] JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment cs.CL | cs.AI | cs.CYPDF

Russell Yang, Ruishi Chen, Pierce Kelaita, Riya Ranjan, Sibo Ma

TL;DR: 本文介绍了JudgmentBench,一个包含30个真实世界法律任务的基准测试,并收集了来自执业律师的1539个基于量规的评分和1530个成对偏好判断。研究通过比较这两种评估方法发现,在LLM生成的三个质量等级输出上,比较性判断在恢复预期质量排序方面显著优于量规评分,且标注时间更短。

Details

Motivation: 当前基准测试主要采用基于量规的评分和比较性判断两种方法,但选择哪种方法缺乏充分依据。本文旨在通过一个高专业度领域的配对数据集,实证比较这两种评估方法在质量评估中的表现。

Result: 在LLM生成的三个质量等级输出上,比较性判断的斯皮尔曼等级相关系数均值(0.908)远高于量规评分(0.150),估计差异为0.758 [0.494, 1.021],且标注时间不到量规评分的一半。这一模式在人类标注者和LLM自动评分器上均成立。

Insight: 创新点在于构建了首个在高专业度领域(法律)中,由同一专家对相同项目提供两种监督信号的公开配对数据集。客观来看,研究为在没有可验证真实答案的领域中,如何有效获取、聚合专家判断作为监督信号提供了实证基础和新的研究方向。

Abstract: Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies are widely used, the choice between them is rarely justified. We release JudgmentBench, a benchmark of 30 real-world legal tasks, paired with 1,539 rubric scores and 1,530 pairwise preference judgments collected from practicing attorneys–including at major U.S. law firms–with substantial experience. The annotations constitute the first publicly available dataset in a high-expertise domain in which both supervision signals are elicited from the same experts on the same items. Using LLM-generated outputs at three constructed quality levels, we provide an initial empirical comparison: comparative judgments recover the intended quality ordering substantially better than rubrics (mean Spearman’s rank correlation of 0.908 vs. 0.150, estimated difference = 0.758 [0.494, 1.021]) while requiring less than half the annotation time. The patterns hold for human annotators and LLM autograders. Beyond this initial comparison, the paired structure of the dataset supports a broader research agenda on how expert judgment should be elicited, aggregated, and used as supervision in domains without verifiable ground truth.


[35] Inference Time Optimization with Confidence Dynamics cs.CLPDF

Yu Wang, Minghao Liu, Jiayun Wang, Jinrui Huang, Ankit Shah

TL;DR: 本文研究了大型语言模型推理过程中的置信度动态变化,首次发现正确与错误答案轨迹在置信度演化上存在显著差异:正确答案的置信度随时间提升,而错误答案的置信度则衰减。基于此观察,作者提出了置信度动态增益投票方法,通过分析推理链上的置信度轨迹来优化答案选择。

Details

Motivation: 现有推理时优化技术(如重复采样)虽提升了LLM的推理能力,但模型不确定性在这些策略中的作用尚未被充分探索,本文旨在填补这一空白。

Result: 在AIME24/25、HMMT25和BRUMO25基准测试上,对DeepSeek-R1、gpt-oss、Gemma-3、Qwen-QwQ四种开源架构的实验表明,CDG方法相比基线带来了显著的性能提升。

Insight: 创新点在于首次揭示了推理过程中置信度动态的独特模式,并据此设计了一种利用置信度轨迹演化信号的投票机制,为LLM推理中的答案选择提供了鲁棒的判别依据;同时提供了理论解释,增强了方法的可解释性。

Abstract: Inference time optimization techniques, such as repeated sampling, have significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, the critical role of model uncertainty remains largely underexplored in these optimization strategies. In this paper, we investigate the dynamics of confidence along reasoning trajectories and for first time reveal a surprising and unique pattern: correct answer traces tend to exhibit confidence improvement over time (positive confidence gain), while incorrect traces show attenuated or declining confidence as reasoning proceeds. Based on this observation, we propose Confidence Dynamic Gain (CDG) based voting, which incorporates how the confidence trajectory of the response evolves along the reasoning chain. Experiments across four open-source architectures (DeepSeek-R1, gpt-oss, Gemma-3, Qwen-QwQ) on the AIME24/25, HMMT25, and BRUMO25 benchmarks demonstrate that CDG yields a significant performance boost over baselines. These results demonstrate that our method provides a robust discriminative signal for improving answer selection in LLM reasoning. We also provide theoretical insights for this phenomenon. Code will be released at https://github.com/Accenture/CDG.git.


[36] READER: Reasoning-Enhanced AI-Generated Text Detection cs.CL | cs.AIPDF

Pingfan Su, Kai Ye, Shijin Gong, Erhan Xu, Jin Zhu

TL;DR: 本文提出了READER,一种推理增强的AI生成文本检测器,它不仅能输出人类/AI标签,还能提供结构化的推理依据来解释其判断。该方法通过构建READ监督数据集,对LLM进行微调,使模型在推理时先进行思考再做出检测。尽管仅有15亿参数,READER在多个基准测试中均超越了现有检测器以及规模大100至1000倍的提示式LLM基线模型。

Details

Motivation: 随着大语言模型的进步,区分人类与AI生成的文本变得日益困难;现有监督检测器虽然在分布内表现良好,但往往不透明且在分布偏移下性能显著下降。

Result: 在多个基准测试中,READER一致超越了现有检测器以及GPT-5.2、Gemini-3-Pro和DeepSeek-V3.2等提示式大容量LLM基线,尽管其参数量仅为15亿。

Insight: 创新点在于将推理过程整合到检测任务中,通过提供结构化依据增强了模型的可解释性;同时,利用精心策划的监督数据集(READ)进行微调,使小规模模型在检测任务上实现了超越大规模模型的性能,这为构建高效、可解释的AI内容检测器提供了新思路。

Abstract: Recent advances in large language models (LLMs) have made it increasingly difficult to distinguish human-written text from AI-generated content. Many existing detectors train supervised neural classifiers that achieve strong in-distribution performance but are often opaque and can degrade substantially under distribution shift. We present READER, a reasoning-enhanced AI text detector that outputs both a human/AI label and a structured rationale describing the evidence for its decision. A key component of our approach is READ, a curated supervision set of rationales and verdicts. We fine-tune an LLM on READ to build READER, which reasons before detecting at inference time. Despite having only 1.5B parameters, READER consistently outperforms existing detectors as well as prompted, high-capacity LLM baselines (GPT-5.2, Gemini-3-Pro, and DeepSeek-V3.2), which are 100 to 1000 times larger in scale.


[37] Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction cs.CL | cs.AI | cs.LGPDF

Hangxuan Li, Renjun Jia, Xuezhang Wu, Yunjie Qian, Zeqi Zheng

TL;DR: 论文提出了Eureka框架,这是一个基于LLM的智能特征工程系统,将特征工程定义为代码生成问题,通过专家代理生成特征设计计划,LLM特征工厂将其转换为可执行代码,并利用自进化对齐引擎通过强化学习提升代码质量。

Details

Motivation: 传统特征工程依赖领域专家,难以规模化应用,Eureka旨在通过LLM驱动的自动化框架解决这一问题,将特征视为可生成、评估和迭代改进的可执行程序。

Result: 在医疗、金融和社交领域的7个公开基准测试中,Eureka均优于传统AutoFE和基于LLM的基线方法;在阿里巴巴云的云GPU资源需求预测任务中,Eureka将需求满足率提升了16%,计算资源迁移率降低了33%。

Insight: 创新点在于将特征工程重新定义为代理式代码生成问题,并设计了包含专家代理、LLM特征工厂和自进化对齐引擎的三阶段框架,通过强化学习结合指标效用和语义对齐的双通道奖励机制来提升生成代码的质量和可迁移性。

Abstract: Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability across applications. We define feature engineering as an agentic code generation problem: features are not static data transformations, but executable programs that can be generated, evaluated, and iteratively improved. We present Eureka, an LLM-driven framework with three stages. (1) An Expert Agent, fine-tuned via SFT on domain knowledge, produces structured feature design plans in JSON format. (2) An LLM Feature Factory translates each plan into executable Python code through chain-of-thought reasoning, turning feature hypotheses into runnable programs. (3) A Self-Evolving Alignment Engine uses Reinforcement Learning (GRPO) with dual-channel reward (metric-based utility + semantic alignment) to enhance code quality. By expressing features as programs, the learned generation patterns can transfer across domains. Evaluated on 7 public benchmarks in healthcare, finance, and social domains, Eureka consistently outperforms both traditional AutoFE and LLM-based baselines. We further demonstrate Eureka’s effectiveness on cloud GPU resource demand prediction at Alibaba Cloud, where Eureka improves demand fulfillment rate by 16% and lowers computing resource migration rates by 33%.


[38] Learning to Route Languages for Multilingual Policy Optimization cs.CLPDF

Geyang Guo, Hiromi Wakaki, Yuki Mitsufuji, Alan Ritter, Wei Xu

TL;DR: 本文提出了一种语言路由策略优化(LRPO)框架,用于改进大型语言模型(LLM)的多语言策略优化。该框架将语言视为可选择的变量,通过一个可训练的语言路由器(建模为多臂老虎机)自适应地决定在强化学习中探索哪些语言,从而在固定的计算预算下增加训练信号的多样性和信息量。

Details

Motivation: 现有策略优化方法通常将每个训练问题限制在单一响应语言或依赖固定的主导语言进行监督,这未能充分利用LLMs在异构多语言语料库上训练的优势。

Result: 大量实验表明,LRPO能持续提升多语言性能,证明了自适应语言路由能够有效利用跨语言知识进行训练。

Insight: 核心创新点在于将语言选择建模为一个在线决策问题(通过多臂老虎机实现),从而动态平衡对未充分利用语言的探索和对信息量更大语言的利用,这为多语言强化学习提供了一种新的、更高效的训练信号获取范式。

Abstract: Large language models~(LLMs) are trained on heterogeneous multilingual corpora, yet existing policy optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed policy optimization (LRPO), an online policy optimization framework that treats language as a selectable variable. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore during reinforcement learning, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training. We release all the resources at https://github.com/Guochry/LRPO.


[39] GeoMathCode: Understanding Interleaved Math-Code Reasoning for Geometry Problem Solving cs.CLPDF

Yingji Zhang, Yong Dai, André Freitas

TL;DR: 该论文提出了GeoMathCode,一种用于解决几何问题的多模态大语言模型方法,其中程序化表示作为中间视觉输出。研究深入分析了潜在的几何推理过程,发现推理与代码生成步骤在潜在空间中可解耦,且监督微调使推理流形更具结构性和信息性。

Details

Motivation: 为了更贴近人类解决几何问题的方式,该工作旨在通过引入辅助视觉构造(如额外的线或点)作为中间步骤,以增强几何解释和教育清晰度。

Result: 实验结果表明,在潜在空间中,推理和代码生成步骤可以被解耦,同时监督微调使推理流形更加结构化且信息丰富。此外,分层句法代码结构作为解耦的潜在子空间出现,并且比视觉表示包含更多的数学符号信息。

Insight: 创新点在于将程序化代码作为几何推理的中间视觉表示,并揭示了推理与代码生成在潜在空间中的解耦特性,以及代码结构在承载数学符号信息方面的优势。

Abstract: Mathematical reasoning is a hallmark of human intelligence, requiring logical deduction, symbolic manipulation, and abstract thinking. Recent multimodal large language models (MLLMs) have demonstrated strong performance on geometry problems through multi-step reasoning. To better emulate human problem-solving, intermediate steps can incorporate auxiliary visual constructions, such as additional lines or points, which improve geometric interpretation and educational clarity. In this work, we introduce the GeoMathCode, where programmatic representations serve as intermediate visual outputs. We further conduct an in-depth analysis of the underlying reasoning geometry. Experimental results show that reasoning and code generation steps can be disentangled in the latent space, while supervised fine-tuning (SFT) makes the reasoning manifold more structured and informative. Moreover, hierarchical syntactic code structures emerge as disentangled latent subspaces, and contain more mathematical symbolic information than visual representations.


[40] Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models cs.CLPDF

Zongji Yu, Wenshui Luo, Yiliu Sun, Hao Fang, Runmin Cong

TL;DR: 本文提出了一种名为多域对比策略优化(MCPO)的新方法,旨在解决大型推理模型(LRMs)在多领域设置下进行强化学习后训练时存在的跨领域干扰问题。MCPO通过对比学习的方式,识别并利用可迁移的正确推理轨迹作为正例,同时将错误轨迹作为负例,从而促进跨领域知识共享和领域内知识巩固,构建一个能和谐容纳多领域知识的表示空间。

Details

Motivation: 现有基于GRPO风格的多领域强化学习方法,由于策略优化中的固有干扰,往往难以在所有领域都取得一致的性能提升。先前的研究主要关注缓解跨领域干扰,但忽视了知识共享的关键作用,而作者认为知识共享是将跨领域互动从有害竞争转变为有益迁移的关键。

Result: 实验结果表明,MCPO提升了LRMs在多个领域的推理能力,并且在某些情况下甚至超越了单领域训练的性能。

Insight: 核心创新点在于将对比学习思想引入多领域策略优化,通过构建正负样本对(跨领域可迁移正例、错误负例以及领域内对齐正例)来显式地促进知识共享与巩固,从而将跨领域互动转化为有益的知识迁移,而非单纯的干扰竞争。

Abstract: Post-training has significantly enhanced the reasoning capability of Large Reasoning Models (LRMs), especially with Reinforcement Learning (RL) like Group Relative Policy Optimization (GRPO). However, GRPO-style RL methods in multi-domain settings often fail to achieve consistent improvements across all domains due to inherent interference in policy optimization. Prior studies on multi-domain RL primarily focus on alleviating cross-domain interference, while often neglecting the pivotal role of knowledge sharing, which we argue is the key to transforming cross-domain interactions from harmful competition into beneficial transfer. To address this limitation, we propose Multi-domain Contrastive Policy Optimization (MCPO), which analyzes the structural relationships among rollouts and promotes cross-domain knowledge sharing and in-domain knowledge consolidation in a contrastive manner. Specifically, for a given prompt, MCPO identifies transferable reasoning trajectories from other domains as positive examples, while treating incorrect rollouts as negative ones. It then encourages consistent representations for positive pairs and pushes negative pairs apart, thereby facilitating knowledge transfer and reducing interference. Moreover, MCPO aligns intra-domain correct rollouts to build a consolidated representation space. In this way, MCPO contrastively learns a harmonious representation space that can accommodate diverse multi-domain knowledge. Empirical results show that MCPO improves the reasoning capabilities of LRMs across multiple domains and even outperforms single-domain training in some cases. Code is available at https://github.com/Maricalce/MCPO.


[41] GeoSVG-RL: Geometry-Aware Reinforcement Learning for Layout-Constrained Text-to-SVG Diagram Generation cs.CLPDF

Sifan Li, Yujun Cai, Hongkai Chen, Yiwei Wang

TL;DR: 本文提出GeoSVG-RL,一种几何感知的强化学习框架,用于解决布局约束下的文本到SVG图表生成问题。该方法通过生成结构化布局计划作为几何约束,并利用浏览器验证器计算细粒度奖励,优化模型在渲染有效性、画布适配、锚点放置等六个维度的表现。

Details

Motivation: 当前大语言模型在生成结构化、可编辑的图表时面临输出结构脆弱性问题,如连接端点错位、文本标签重叠或布局超出画布边界,导致SVG文件在专业应用中无法使用。

Result: 在合成数据上进行监督预热后,GeoSVG-RL在结构可靠性方面取得显著提升,特别是在箭头锚点准确性和文本框内文本率上。定量评估表明,该方法在局部几何精度和保持图连通性方面持续优于当前最先进系统。

Insight: 创新点在于引入基于可执行几何反馈的强化学习框架,将布局计划作为几何契约,并通过多维度细粒度奖励优化模型;客观分析认为,其结合结构化规划与浏览器验证的奖励机制,为自动化技术插图生成提供了可靠路径。

Abstract: Generating structured, editable diagrams remains a significant challenge for contemporary large language models, despite their proficiency in general-purpose vector code generation. The primary difficulty lies in the structural fragility of the output; minor errors such as misaligned connector endpoints, text labels overlapping borders, or complex layouts drifting beyond the canvas boundaries render the resulting SVG files functionally unusable for professional applications. To address these issues, we introduce GeoSVG-RL, a specialized reinforcement learning framework designed for layout-constrained text-to-SVG generation. Unlike standard training objectives that rely solely on maximizing token-level likelihood, our approach optimizes the policy against explicit, executable geometric feedback. The model first produces a structured layout plan that serves as a geometric contract for the subsequent generation of the SVG code. This code is then rendered through a browser-backed verifier, enabling the calculation of fine-grained rewards across six critical dimensions: rendering validity, canvas fitting, precise anchor placement, text containment, graph consistency, and code cleanliness. We utilize Group Relative Policy Optimization (GRPO) to refine the model, sampling multiple candidates per prompt to facilitate updates based on relative quality. Starting from a supervised warm-start phase on synthetic data, GeoSVG-RL achieves substantial gains in structural reliability, particularly in arrow-anchor accuracy and text-in-box rates. Quantitative evaluations demonstrate that our method consistently outperforms current state-of-the-art systems in local geometric precision and the preservation of graph connectivity, providing a robust pathway toward automated yet reliable technical illustration.


[42] TypedCSIP: Typed Counterfactual Pretraining for Chinese Legislative Conflict Classification cs.CLPDF

Yao Liu

TL;DR: TypedCSIP是一种针对中文法律冲突分类任务的类型化反事实预训练方法。该方法利用专家编写的最小修订作为反事实监督信号,通过两阶段训练:第一阶段使用类型化反事实选择性干预预训练目标在(上位法、下位法、专家修订)三元组上预训练共享编码器;第二阶段将编码器迁移到五分类任务。在LCR-CN基准测试中显著提升了冲突分类的宏F1分数。

Details

Motivation: 解决中文法律条款冲突分类任务中,如何有效利用专家编写的条款修订信息来提升分类性能的问题。具体任务是在给定(上位法,下位法)条款对时,预测是否存在冲突以及冲突属于四种法律原则类型(责任、条件、制裁、定义)中的哪一种。

Result: 在LCR-CN基准的696条测试集上,v2变体相比最强单模型基线在chinese-roberta-wwm-ext上提升宏F1 0.916个百分点,在SAILER跨骨干复制上提升1.288个百分点,均通过预设的统计显著性检验。在244条未见条款上的冷启动分层结果也保持正向增益。

Insight: 创新点在于将专家修订作为反事实监督信号,设计了类型化反事实选择性干预预训练目标,使模型学习识别冲突证据的缺失。方法学上采用预注册测试确保结果可靠性,并通过跨任务诊断验证了编码器的任务特异性。

Abstract: TypedCSIP is a typed counterfactual pretraining method for the conflict-classification task of the LCR-CN benchmark (Zhao et al., 2026): given a (superior, subordinate) provision pair, predict whether the pair conflicts and which of four legal-doctrine types (Responsibility, Condition, Sanction, Definition) describes the inconsistency. We exploit LCR-CN’s expert-written minimal revisions as training-time counterfactual supervision; at test time the classifier reads only the original pair. Stage 1 pretrains a shared encoder with a typed Counterfactual Selective Intervention Pretraining objective on (superior, subordinate, expert-revised) triplets, treating the expert revision as a counterfactual that the typed factor head must classify as carrying no conflict evidence. Stage 2 transfers the encoder to a five-way classification head. The confirmatory test was registered on the Open Science Framework before observing v6 measurements: 18 seeds, locked rule requiring mean per-seed difference at least 0.8 pp with both seed-bootstrap and Student-t 95% lower bounds above zero. On the 696-record test split, the v2 variant improves macro-F1 over the strongest single-model baseline by +0.916 pp on chinese-roberta-wwm-ext and +1.288 pp on the SAILER cross-backbone replication; both cells pass the rule. A cold-start stratified result on the 244 Unseen-gB records keeps the gain positive on both backbones. A cross-task diagnostic shows the Stage-2 encoder is classification-specialized and does not transfer to LCR-CN’s superior-law retrieval task, so we scope the contribution to conflict classification. We release code, 72 pre-registered prediction files, matched-seed and MLM-control auxiliaries, and the OSF pre-registration record.


[43] Retrieval as Reasoning: Self-Evolving Agent-Native Retrieval via LLM-Wiki cs.CLPDF

Haoliang Ming, Feifei Li, Xiaoqing Wu, Wenhui Que

TL;DR: 本文提出了LLM-Wiki,一种面向智能体(agent)的原生检索系统,它将外部知识组织为可编译、可组合且能自我演化的结构化Wiki页面,而非静态的检索索引。该系统通过标准工具调用接口暴露搜索、阅读和链接追踪操作,并引入错误簿(Error Book)进行持续的结构与语义自我修正。

Details

Motivation: 为了解决传统检索增强生成(RAG)将知识组织为扁平化片段并通过嵌入相似度检索的局限性,这种检索即查找(retrieval-as-lookup)的接口与使用工具的智能体不匹配。本文旨在使检索过程更像推理,支持智能体进行搜索、阅读、遍历和判断证据是否充分。

Result: 在HotpotQA、MuSiQue和2WikiMultiHopQA基准测试中,LLM-Wiki超越了包括HippoRAG 2、LightRAG和GraphRAG在内的七个基线模型,相比最强的基于图的基线模型,F1分数提升了2.0-8.1个点,相比Dense RAG提升更大。在AuthTrace基准上,LLM-Wiki实现了最佳的整体准确率,尤其是在多文档结构化查询任务上表现出显著优势。

Insight: 核心创新在于提出了“检索即推理”(Retrieval-as-Reasoning)范式,将外部知识视为可动态编译和演化的结构化网络(Wiki页面),并通过标准工具接口暴露给智能体。这使检索过程更符合智能体的推理行为,而非简单的上下文获取。其设计的错误簿机制支持知识的持续自我修正,增强了系统的鲁棒性和适应性。

Abstract: LLM agents require retrieval to behave less like one-shot context fetching and more like reasoning: searching, reading, traversing, and deciding when evidence is sufficient. However, Retrieval-Augmented Generation (RAG) typically organizes external knowledge as flat chunks retrieved by embedding similarity, exposing a retrieval-as-lookup interface that is poorly aligned with tool-using agents. We propose LLM-Wiki, an agent-native retrieval system that operationalizes the Retrieval-as-Reasoning paradigm by treating external knowledge as a compilable, composable, and self-evolving structure rather than a static retrieval index. LLM-Wiki compiles documents into structured Wiki pages with bidirectional links, exposes search, read, and link-following operations through standard tool-calling interfaces, and introduces an Error Book for persistent structural and semantic self-correction. On HotpotQA, MuSiQue, and 2WikiMultiHopQA, LLM-Wiki outperforms seven baselines, including HippoRAG 2, LightRAG, and GraphRAG, with gains of 2.0-8.1 F1 points over the strongest graph-based baseline and larger gains over Dense RAG. On AuthTrace, LLM-Wiki achieves the best overall accuracy, with especially strong gains on multi-document structured queries, showing that compilation-based knowledge organization generalizes beyond chain-style multi-hop reasoning.


[44] CRPO: Character-centric Group Relative Policy Optimization for Role-aware Reasoning in Role-playing Agents cs.CLPDF

Yihong Tang, Kehai Chen, Liang Yue, Benyou Wang, Min Zhang

TL;DR: 本文提出了一种名为CRPO(Character-centric Group Relative Policy Optimization)的强化学习框架,旨在解决现有基于群体的策略优化方法在应用于角色扮演智能体时,因过度关注任务效用而导致的角色保真度下降和风格趋同问题。该框架通过解耦任务逻辑与风格奖励、动态调整优化约束以及使用通用回复作为负样本基线等机制,有效提升了角色扮演智能体的角色一致性和情感表达等性能。

Details

Motivation: 现有强化学习方法(如GRPO)主要针对问题解决进行优化,当应用于角色扮演智能体时,会因优先考虑上下文特定的效用而损害角色对齐,导致角色保真度丧失和风格崩溃。

Result: 大量实验表明,CRPO在角色一致性、情感表达等多个维度上优于现有方法。

Insight: 论文的创新点在于将强化学习目标与角色扮演任务重新对齐,其核心机制包括:解耦任务逻辑与风格奖励以解决梯度冲突、基于角色复杂性动态调整优化约束、以及使用通用回复作为负样本基线以防止模型退化到通用分布。从客观角度看,这是一种针对特定领域(角色扮演)的强化学习目标函数和训练策略的精细化设计,为解决多目标优化中的冲突提供了新思路。

Abstract: Recent advancements in Reinforcement Learning (RL), particularly Group Relative Policy Optimization (GRPO), have significantly enhanced the reasoning capabilities of Large Language Models. However, applying these problem-centric optimization methods to role-playing agents often leads to a loss of character fidelity and style collapse, as they prioritize context-specific utility over persona alignment. To address this, we propose Character-Centric Group Relative Policy Optimization (CRPO), a framework designed to realign RL objectives with the role-playing task. CRPO improves character distinctiveness through three mechanisms: decoupling task logic from stylistic rewards to resolve gradient conflicts, dynamically adapting optimization constraints based on character complexity, and utilizing generic responses as negative baselines to prevent the model from reverting to a common distribution. Extensive experiments demonstrate that CRPO outperforms existing methods in consistency, emotion and others.


[45] BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data cs.CL | cs.AI | cs.LGPDF

Bo Zou, Chao Xu

TL;DR: 本文提出了BC协议,一种结构化的双专家对话方法,用于为大语言模型(LLM)的后训练生成高质量的思维链(CoT)数据。该方法通过配对领域专家和知识工程师,系统地外化专家的隐性判断为自然语言推理链,以解决现有数据生产方法(如众包标注、专家独立写作、RLHF)在生成深度推理路径方面的结构性局限。

Details

Motivation: 高质量专家思维链数据是LLM后训练的核心瓶颈之一。现有方法存在结构性缺陷:众包标注缺乏深度推理路径;专家独立写作受限于“专家盲点”,即专家会下意识地跳过他们认为显而易见的推理步骤;RLHF仅产生偏好信号而非推理链。

Result: 在叙事小说领域的对照实验中,BC协议双专家对话组(A组)生成的CoT,在由GPT-4o、Claude Opus 4.5和Gemini 2.5 Pro三个模型进行的盲评中,于“推理过程自然度”维度上相比专家独立写作组(B组)取得了压倒性优势(A组均值4.80 vs. B组均值1.30,p=2.4e-8,Cliff’s δ=1.0)。

Insight: 论文的创新点在于提出了结构化的双专家对话协议(BC协议)来外化隐性知识,并引入了参与者资质模型和“校准的无知”这一原创概念。其核心方法论原则是“选择优于规定”,即在隐性知识引出任务中,将质量控制资源投入人员选拔比投入流程设计能获得更高回报。

Abstract: High-quality expert chain-of-thought (CoT) data is one of the core bottlenecks in large language model (LLM) post-training. Existing data production methods each have structural limitations: crowdsourced annotation lacks deep reasoning paths; expert solo writing is constrained by the “expert blind spot” – experts structurally skip reasoning steps they consider obvious; RLHF only produces preference signals rather than reasoning chains. This paper proposes the BC Protocol – a structured dual-expert elicitation method for LLM post-training data production. The method carefully pairs a domain expert (crystallized intelligence) with a knowledge engineer (fluid intelligence), systematically externalizing the expert’s implicit judgments as natural language reasoning chains. We introduce the Participant Aptitude Model, which defines six participant characteristic dimensions that affect elicitation quality. “Calibrated Ignorance” is an original concept proposed in this paper. We further propose “Selection-over-Prescription” as a methodological principle: for implicit knowledge elicitation tasks, investing quality-control resources in personnel selection yields a higher return than investing the same resources in process design. In a controlled experiment in the narrative fiction domain, we directly compared CoT produced by BC Protocol dual dialogue (Group A, (n=20)) against CoT written independently by the same domain expert (Group B, (n=20)). Three cross-vendor judge models – GPT-4o, Claude Opus 4.5, and Gemini 2.5 Pro – conducted blind evaluation across five dimensions (600 ratings total). Results show that the BC Protocol achieves an overwhelming advantage in “naturalness of reasoning process” (Group A mean 4.80 vs. Group B mean 1.30, (p=2.4\times10^{-8}), Cliff’s (δ=1.0)).


[46] DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning cs.CL | cs.LGPDF

Guochao Jiang, Jingyi Song, Guofeng Quan, Chuzhan Hao, Guohua Liu

TL;DR: 本文提出了一种名为DVAO(动态方差自适应优势优化)的新方法,用于解决多奖励强化学习中的训练不稳定和静态超参数问题。该方法通过基于经验奖励方差动态调整组合权重,有效增强学习信号强的目标并抑制噪声目标,从而在数学推理和工具使用基准测试中显著优于基线方法。

Details

Motivation: 现有方法如奖励组合和优势组合在多奖励设置中存在显著缺陷:奖励组合常导致优势平方幅度过大引发训练不稳定,而优势组合依赖静态超参数且忽略跨目标相关性。

Result: 在基于Qwen3和Qwen2.5模型的数学推理和工具使用基准测试中,DVAO显著优于基线方法,实现了更优的多目标帕累托前沿和鲁棒的训练稳定性。

Insight: 创新点在于提出基于经验奖励方差的动态权重调整机制,数学证明了其能保持优势幅度有界以确保训练稳定,并引入了自适应的跨目标正则化机制。

Abstract: Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.


[47] Reinforcement Learning from Denoising Feedback cs.CL | cs.LGPDF

Qi He, Huan Chen, Ya Guo, Huijia Zhu, Yi R. Fung

TL;DR: 本文提出了一种名为RLDF(从去噪反馈中强化学习)的新训练范式,用于解决扩散语言模型中策略损失估计的挑战。该方法利用从训练和推演过程中获得的反馈,通过优化模型朝向从中间噪声状态裁剪得到的干净状态,并结合加权时间步采样,来平衡计算效率与估计效果。实验表明,RLDF在LLaDA和Dream两种代表性扩散语言模型架构上,于多个推理基准测试中取得了性能与泛化能力的显著提升。

Details

Motivation: 动机是解决扩散语言模型中策略损失估计这一长期存在的根本性挑战,旨在实现准确且高效的策略优化。

Result: 在LLaDA和Dream两种dLLM架构上,于多个推理基准测试中取得了性能与泛化能力的显著提升,表明该方法有效。

Insight: 创新点在于提出了RLDF训练范式,通过利用去噪过程的反馈(优化朝向裁剪干净状态)和加权时间步采样,为扩散语言模型的强化学习提供了一个可扩展的原则性基础。

Abstract: Policy loss estimation remains a fundamental and long-standing challenge in reinforcement learning (RL) for diffusion language models (dLLMs). We introduce Reinforcement Learning from Denoising Feedback (RLDF), a novel training paradigm that leverages feedback obtained from rollout and training processes to facilitate accurate and efficient policy loss estimation. To balance the trade-off between computational efficiency and estimation effectiveness, RLDF optimizes the model toward the clipped clean state $\hat{x}_0$ from intermediate noisy states $x_t$, combined with weighted timestep sampling over $t$. Extensive experiments demonstrate that RLDF achieves consistent and substantial improvements in both performance and generalizability across two representative dLLM architectures, LLaDA and Dream, on multiple reasoning benchmarks. Our work lays a principled foundation for scalable reinforcement learning in diffusion language models. We build Drift, a training framework for dLLMs, available at https://github.com/ant-research/Drift.


[48] From Facts to Insights: A Persona-Driven Dual Memory Framework and Dataset for Role-Playing Agents cs.CL | cs.DB | cs.MAPDF

Rongsheng Zhang, Ruofan Hu, Weijie Chen, Jiji Tang, Junnan Ren

TL;DR: 本文针对角色扮演智能体在长期对话中因上下文窗口限制导致角色一致性下降的问题,提出了RoleMemo数据集和DualMem双记忆框架。RoleMemo包含四项需要结合角色信息进行事实推理的任务,用于评估现有方法;DualMem则将记忆解耦为事实认知和角色条件化洞察两个流,通过监督微调和强化学习训练,在保持长期角色保真度上超越了基于DeepSeek-V3.2的零样本角色无关框架。

Details

Motivation: 现有角色扮演智能体在长期对话中依赖角色无关的事实摘要记忆,导致生成响应缺乏角色特异性,损害了角色保真度。

Result: 在提出的RoleMemo数据集上评估,暴露了角色无关记忆框架的局限性。提出的DualMem框架(使用40亿参数模型)通过SFT和RL训练,在维持角色保真度方面超越了由DeepSeek-V3.2驱动的零样本角色无关框架。

Insight: 核心创新是将外部记忆解耦为事实认知和角色条件化洞察两个独立流,使模型能基于角色视角解释事实。RoleMemo数据集为评估角色驱动的推理能力提供了基准。

Abstract: While role-playing agents excel in short-term interactions, long-term conversations overwhelm context windows, motivating external memory frameworks. Current systems typically rely on persona-agnostic summarization, which records facts without persona-specific interpretation, yielding generic responses that compromise persona fidelity. To bridge this gap, we introduce RoleMemo, a dataset featuring four reasoning tasks where the factual fragments must be interpreted through the persona to reach the correct answer. Evaluation on RoleMemo exposes critical limitations of persona-agnostic frameworks. We thus propose DualMem, which decouples memory into two streams: factual cognition and persona-conditioned insight. Trained through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), our framework with a 4B-parameter model outperforms zero-shot persona-agnostic frameworks powered by DeepSeek-V3.2 for sustained persona fidelity. Our resources are available at https://github.com/role2026/rolememo.


[49] Testing the Deliteralization Hypothesis in Human and Machine Translation cs.CLPDF

Malik Marmonier, Rachel Bawden, Benoît Sagot

TL;DR: 本文通过WMT24++数据集,测试了人类翻译与机器翻译(包括传统NMT系统和现代LLM)在‘去字面化’假设上的表现。研究发现:人类翻译始终比所有机器翻译系统更不字面,尽管最新LLM缩小了差距;当LLM被提示迭代修订自身输出时,它们会单调地去字面化,首次证明该假设适用于LLM生成;作为后编辑者,LLM与人类后编辑者的修订触发模式相反,它们容忍字面化草稿,而针对地道的表达进行修订。

Details

Motivation: 研究动机是验证翻译研究中的长期假设——‘去字面化’假设(即翻译在起草和修订过程中会逐渐变得不那么字面)是否适用于从专用NMT系统转向通用LLM的现代机器翻译,并比较人类与机器在翻译字面性上的差异。

Result: 在WMT24++数据集的54个语言对和三个任务(直接翻译、迭代自修订、后编辑人类草稿)上,使用基于六个启发式的合成字面性指数进行测量。结果显示:人类翻译显著比所有测试的MT系统更不字面,但近期LLM缩小了差距;LLM在迭代自修订中表现出单调去字面化;LLM作为后编辑者时,其修订模式与人类相反。

Insight: 论文的创新点在于首次将‘去字面化’假设扩展到LLM生成过程,并系统比较了人类与多种机器翻译系统在字面性上的行为差异。客观分析认为,其通过量化指标揭示了LLM在翻译修订中的独特模式,为理解LLM的翻译机制提供了新视角。

Abstract: The recent shift from dedicated NMT systems to general-purpose LLMs has reshaped machine translation, with LLMs reported to produce more fluent, less literal output than their predecessors. We test whether this shift extends to the deliteralization hypothesis, the long-standing claim from translation studies that translations become progressively less literal as they are drafted and revised. Using the WMT24++ dataset, we compare the literality of human translations and post-editions to that of two NMT systems and six LLMs across 54 language pairs and three tasks: direct translation, iterative self-revision, and post-editing of human drafts. Literality is measured via a validated Synthetic Literality Index built from six heuristics. We find that (i) human translations remain significantly less literal than those of all tested MT systems, though recent LLMs narrow the gap; (ii) when prompted to iteratively revise their own output, LLMs deliteralize monotonically, providing the first evidence that the hypothesis applies natively to LLM generation; and (iii) as post-editors, LLMs invert the revision triggers of human post-editors, tolerating literal drafts and targeting idiomatic human formulations for revision.


[50] Selective Latent Thinking: Adaptive Compression of LLM Reasoning Chains cs.CLPDF

Hui Xie, Jie Liu, Ziyue Qiao, Joaquin Vanschore

TL;DR: 本文提出了选择性潜在思维(SLT)框架,旨在解决现有潜在推理方法在压缩思维链时因过度压缩关键步骤而导致推理精度下降的问题。SLT通过轻量级解码器预测即将到来的推理片段,并基于置信度的门控机制选择性地将冗余片段压缩为潜在表示,同时保留精度关键的片段为显式思维链,从而在推理效率和准确性之间取得平衡。

Details

Motivation: 显式思维链推理显著提升了大型语言模型的推理能力,但因其冗长的自回归轨迹导致高昂的推理成本。现有潜在推理方法虽提供了替代方案,但往往将推理视为均匀可压缩的,导致精度关键的中问步骤被过度压缩,从而损害推理准确性。

Result: 在四个数学推理基准测试上的广泛实验表明,SLT在可比压缩率下比潜在推理基线实现了22.7%的更高准确率,同时与显式思维链相比,推理链长度减少了58.4%,仅带来2.8%的准确率下降。

Insight: 创新点在于提出了选择性压缩策略,结合了基于置信度的门控机制和三阶段训练策略(包括片段级潜在压缩、可靠性感知的未来推理预测和轨迹级强化学习),以优化答案正确性与推理成本之间的权衡,实现了自适应压缩,而非均匀压缩整个推理链。

Abstract: Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a promising alternative, yet they often treat reasoning as uniformly compressible, causing precision-critical intermediate steps to be overly compressed and thereby degrading reasoning accuracy. In this work, we propose Selective Latent Thinking (SLT), a framework that selectively compresses redundant reasoning spans into latent representations while preserving precision-critical spans as explicit CoT within the same reasoning trajectory. Specifically, SLT first uses a lightweight decoder to anticipate a short upcoming reasoning span, and then applies confidence-based gating to determine the longest span that can be reliably compressed. The accepted span is encoded into a compact latent representation to improve reasoning efficiency, while uncertain or precision-critical reasoning remains in explicit CoT form to preserve accuracy. To learn this selective compression policy, SLT adopts a three-stage training strategy that combines span-level latent compression, reliability-aware future reasoning prediction, and trajectory-level reinforcement learning to optimize the trade-off between answer correctness and reasoning cost. Extensive experiments across four mathematical reasoning benchmarks demonstrate that SLT achieves 22.7% higher accuracy than latent reasoning baselines at comparable compression ratios, while reducing reasoning chain length by 58.4% with only 2.8% accuracy degradation compared to explicit CoT,Our code can be found in https://github.com/hunshi34/SLT.


[51] Double Triangle Annotation: A Scalable Human-in-the-Loop Framework for High-Precision Historical Document Annotation cs.CLPDF

Yi Ren

TL;DR: 本文提出了一种名为’双三角标注’的人机协同框架,用于历史文档的高精度结构化信息提取标注。该框架利用两个架构独立的多模态大语言模型并行标注,当模型达成共识时自动接受标签,分歧则交由人工裁决;第二层通过交叉检查两个系统进一步解决剩余冲突,最终在法国医疗目录语料库上实现了极低的词错误率,并自动完成了超过85%的字段标注。

Details

Motivation: 解决历史文档大规模结构化信息提取中高精度标注的难题,传统人工标注成本高昂,而基于大语言模型的自动化流程容易产生幻觉,因此需要一种既能保证精度又能提升效率的标注方法。

Result: 在1887-1906年的法国医疗目录语料库Guides Rosenwald上,该框架实现了0.003的词错误率,在13,595个字段的规模化应用中,模型共识自动接受了超过85%的标注。

Insight: 创新点在于提出了一种基于模型错误独立性假设的两层人机协同框架,无需先验分布或任务特定校准,通过交叉模型共识自动化大部分标注工作,同时利用人工干预确保高精度,且框架的自主性随模型能力提升而增强;客观来看,这种可扩展的混合标注范式为历史文档等专业领域的高质量数据标注提供了实用解决方案。

Abstract: Evaluating structured-information extraction from historical documents at scale requires high-precision ground-truth annotations, yet traditional manual labeling is expensive and fully automated pipelines built on large language models are prone to hallucination. We propose Double Triangle Annotation, a two-layer human-in-the-loop framework that leverages cross-model consensus to automate the majority of annotation work while ensuring high-precision outputs. In the first layer, two architecturally independent Multimodal Large Language Models annotate each document in parallel; when they agree, the label is auto-accepted, and disagreements are routed to a human jury. A second layer cross-checks two such systems against each other, escalating residual conflicts to a domain expert. The framework rests on a single assumption – error independence between models – requires no distributional priors or task-specific calibration, and becomes more autonomous as model capability improves. On the Guides Rosenwald, a corpus of French medical directories spanning 1887-1906, the framework achieves a final Word Error Rate of 0.003. Applied at scale, model consensus auto-accepts over 85% of 13,595 fields. We release the resulting benchmark – the first structured-extraction ground truth for the Rosenwald Guides – to support future work on historical document processing.


[52] Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express cs.CL | cs.AIPDF

Ziyi Ding, Xiao-Ping Zhang

TL;DR: 论文发现大型语言模型在因果推理任务中存在内部表征与输出回答之间的不匹配现象:模型隐藏状态能准确编码反常识的因果答案(准确率约0.97),但Yes/No输出却倾向于常识答案(准确率约0.5)。作者将这种约0.5的差距称为’因果舌结’,并指出仅凭输出准确率评估LLM因果推理能力存在误导性。

Details

Motivation: 针对当前仅通过输出准确率评估LLM因果推理能力的局限性,研究旨在揭示模型内部表征与语言输出之间的脱节现象,挑战单一准确率指标的有效性。

Result: 在反常识CLadder数据集上,线性探针从隐藏状态恢复证据支持答案的准确率约0.97,而模型Yes/No输出准确率仅约0.5,形成约0.5的’因果舌结’差距。

Insight: 创新提出”因果舌结”概念,将错误分解为无内部信号与信号无法通过语言接口表达两种失败模式;方法论上通过线性探针揭示内部表征与输出的解耦,为评估LLM推理能力提供了更精细的分析框架。

Abstract: We find a mismatch between what large language models encode about a causal question and what they answer. On anti-commonsense CLadder items, a fixed linear probe recovers the evidence-supported answer from the model’s hidden state (accuracy approximately 0.97), while the spoken Yes/No reverts to the commonsense one (accuracy approximately 0.5). We call this approximately +0.5 gap Causal Tongue-Tie: a wrong Yes/No decomposes into two separable failure modes: no internal signal versus a signal the verbal interface cannot say. The implication cuts both ways for output-only causal benchmarks: a benchmark “correct” need not mean the model has understood, and a benchmark “wrong” need not mean it cannot. Sweeping claims about whether LLMs can do causal reasoning, drawn from a single accuracy number, deserve a second look.


Wei Fan, Yining Zhou, Mufan Zhang, Yanbing Weng, Yiran HU

TL;DR: 本文提出了一种名为LegalSearch-R1的端到端强化学习框架,旨在解决法律AI搜索中的时间一致性问题。该框架结合了本地法规RAG进行精确法条匹配和在线网络搜索以获取更广泛的法律知识,并在跨越多个修订时期的时间索引数据上进行训练,以确保法律适用的时间准确性。

Details

Motivation: 当前基于大语言模型的法律智能体在搜索时忽视了法律适用必须与案件时间背景匹配这一基本约束,导致可能错误地追溯适用法律,违反了核心法律原则。现有方法存在时间偏差、搜索查询缺乏时间约束以及网络搜索无法提供精确法条引用等问题。

Result: 在覆盖13个法律任务的基准测试中,该7B参数的智能体在整体性能上比最先进的深度研究框架和专业法律大模型高出12.9%至29.8%,在时间一致性指标上超过基线模型57.7%至80.3%,并展现出强大的跨领域泛化能力。

Insight: 论文的核心创新点在于通过强化学习框架,将本地精确检索与在线广泛搜索相结合,并利用时间索引数据进行训练,以系统性地强制模型遵守法律的时间一致性约束。这为解决AI在法律等时效性敏感领域的应用提供了一个可借鉴的范式,即显式地将领域特定的硬约束(如时间)融入模型架构和训练目标中。

Abstract: While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental constraint that applicable law must match the temporal context of each case, as retroactive application of statutes violates core legal principles and leads to erroneous conclusions. Our observations reveal that current legal LLMs suffer from temporal bias anchored to their training cutoff, while search agents rarely incorporate temporal constraints into queries, and that web search alone cannot provide the precise statute and precedent citations that legal reasoning demands. To address these challenges, we propose LegalSearch-R1, an end-to-end reinforcement learning framework that pairs local statute RAG for precise article matching with online web search for broader legal knowledge, trained on temporally-indexed data spanning multiple amendment periods to enforce temporal consistency. Extensive experiments on our benchmark covering 13 legal tasks demonstrate that our 7B-parameter agent outperforms state-of-the-art deep research frameworks and specialized legal LLMs by 12.9% to 29.8%, surpasses baselines by 57.7% to 80.3% on temporal consistency, and exhibits robust out-of-domain generalization. The code and data are available at https://github.com/AlexFanw/LegalSearch-R1.


[54] Thaka at KSAA-2026 Task 2: Regularized Fine-Tuning for Arabic Speech Diacritization cs.CL | cs.SD | eess.ASPDF

Meshal Alamr, Hassan Alqaeri, Abdullah Aldahlawi

TL;DR: 本文介绍了在KSAA-2026共享任务Task 2(阿拉伯语语音听写与自动标注)中获胜的系统。该任务要求从语音音频和无标注文本中生成完全标注的阿拉伯语文本,训练数据仅限2327个样本且不允许使用外部数据。系统基于CATT-Whisper模型进行微调,该模型结合了预训练的CATT文本编码器和冻结的Whisper语音编码器。核心创新在于训练正则化策略,包括R-Drop一致性正则化、Optuna优化的高权重衰减超参数以及Focal Loss。推理时,通过蒙特卡洛Dropout在softmax概率层对四个模型检查点进行200次随机前向传播平均。该系统在主榜单指标(包含词尾格位,计入无标注位置)上取得了23.26%的词错误率(WER),在所有参与者中排名第一。

Details

Motivation: 解决在极有限训练数据(仅2327个样本)且无外部数据可用条件下,从阿拉伯语语音和未标注文本中生成完全标注文本的挑战性任务。

Result: 在KSAA-2026 Task 2的主榜单指标(包含词尾格位和未标注位置)上达到23.26%的词错误率(WER),获得第一名(SOTA)。

Insight: 创新点在于针对小数据场景的系统性训练正则化组合(R-Drop、高权重衰减、Focal Loss),以及推理时利用蒙特卡洛Dropout进行多检查点概率级集成,有效提升了模型鲁棒性和泛化能力。

Abstract: We describe the winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization. The task requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts, with only 2,327 training samples available and no external data permitted. Our system fine-tunes CATT-Whisper, a character-level multimodal model combining a pretrained CATT text encoder with a frozen Whisper speech encoder. The key to our approach is training regularization: R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, and Focal Loss. At inference, we average 200 stochastic forward passes across four model checkpoints using Monte Carlo Dropout at the softmax probability level. The system achieves 23.26% WER on the primary leaderboard metric (with case endings, including no-diacritic positions), placing 1st among all participants.


[55] QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability cs.CL | cs.AI | cs.LGPDF

Bo Zou, Chao Xu

TL;DR: 本文提出了QUIET基准测试,用于评估大语言模型(LLMs)的创造性生成能力。它采用多空白级联故事完形填空的形式,要求模型在开放式生成中填充多个具有约束条件和级联依赖关系的空白,并通过基于信息论的自动化评分协议进行客观评估,避免了传统方法的主观性或仅衡量判别能力的问题。

Details

Motivation: 现有基准(如Story Cloze Test)主要衡量LLMs在叙事延续上的判别(选择题)能力,而非直接评估其创造性生成能力;同时,基于量规的评分或LLM-as-Judge方法依赖主观维度评估或自然语言输出,缺乏客观、自动化的评分机制。

Result: 论文提出了QUIET基准及其自动化评分协议,该协议基于‘校准惊喜’理论框架,为每个空白计算复合分数(满足约束程度与惊喜度的结合)。该方法无需人工评分,实现了对创造性生成能力的客观、自动化评估。

Insight: 创新点在于将创造性生成能力评估形式化为具有级联依赖关系的多空白完形填空任务,并设计了基于信息论的客观自动化评分协议(结合‘满足约束’和‘惊喜度’),这为直接、可量化地衡量LLM的创造性提供了新范式。

Abstract: Large language models (LLMs) face a dual challenge in creative capability evaluation: existing benchmarks (e.g., Story Cloze Test, HellaSwag) measure models’ discriminative ability over narrative continuation using multiple-choice recognition paradigms, rather than directly measuring creative generation capability; rubric-based scoring and LLM-as-Judge methods rely on subjective dimension assessment or natural language model outputs, and cannot provide objective, automated scoring mechanisms. This paper proposes QUIET (Quality Understanding via Interlocked Evaluation Testing), a diagnostic benchmark for LLM creative capability based on multi-blank cascaded story cloze. QUIET sets N blanks (10-20) in a story with complete structure, with each blank accompanied by an explicit content constraint, and cascade dependency relationships between blanks – the content filled into earlier blanks constrains the feasible solution space for later blanks. The evaluated model (or human participants) fills all blanks in open-ended generation mode; the results are scored by an information-theoretic automated scoring protocol without human grading. The scoring protocol directly operationalizes the “calibrated surprise” theoretical framework (Zou & Xu, 2026a). For each blank k, a composite score is computed: score = satisfy * (1 + lambda * surprise), where lambda = 1.0. Here, “satisfy” measures how well the blank filling satisfies the content constraint (objective logical reasoning judgment, not subjective aesthetic scoring), and “surprise” measures the degree of surprise given that the constraint is satisfied. Creative answers that do not satisfy the constraint score zero; answers that satisfy the constraint but are mediocre score low; answers that satisfy the constraint and are surprising score high.


[56] PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction cs.CL | cs.CEPDF

Daren Wang, Hong Xu, Jiawen Xian

TL;DR: 本文介绍了PolyGnosis 2.0,一个创新的多智能体架构,旨在通过综合Polymarket异常信号与全球开源情报(OSINT)流(特别是GDELT)来提取预测性情报。该研究将Polymarket情绪与全球媒体流之间的叙事差异(即“视角不匹配”)定义为高阿尔法交易信号。论文超越了泛泛的智能体优势讨论,严格量化了在高噪声金融领域中“驾驭工程”技术(包括反思循环、工具调用、分治分区和思维链)的有效性。

Details

Motivation: 解决在高噪声金融预测市场(如Polymarket)中,如何有效整合不同来源的异构信息(市场情绪与全球媒体叙事)以提取可靠预测信号的问题,并系统性地评估和优化多智能体架构中的各种工程化技术。

Result: 在人类专家基准上的实证评估表明,结构分区对于实现多维对齐是必需的,而无约束的终端反思会引发逻辑漂移。研究识别出所有智能体配置在叙事推理中普遍存在的“共识偏见”,并最终分离出一个帕累托最优配置,该配置在最小化延迟和令牌开销的同时,达到了专业级的分析精度。

Insight: 创新点在于将“视角不匹配”概念化为可操作的交易信号,并系统性地对多智能体架构中的“驾驭工程”技术(如反思、分治)进行量化评估与优化,揭示了特定技术(如无约束反思)的负面效应和普遍存在的“共识偏见”,为预测市场中的自主智能系统提供了经过验证的优化蓝图。

Abstract: This paper introduces PolyGnosis 2.0, a pioneering multi-agent architecture designed to extract predictive intelligence by synthesizing Polymarket anomaly signals with global Open Source Intelligence (OSINT) streams, specifically Global Database of Events, Language, and Tone (GDELT). We define and target “Perspective Mismatches”, the narrative divergence between Polymarket sentiment and global media flows, as high-alpha trading signals. Moving beyond generic agentic superiority, we rigorously quantify the efficacy of “Harness Engineering” techniques, including reflection loops, tool-calling, divide-and-conquer partitioning (D&C), and chain-of-thought (CoT), within high-noise financial domains. Our empirical evaluation against human-expert benchmarks reveals that while structural partitioning is mandatory for multi-dimensional alignment, unconstrained terminal reflection actively induces logical drift. Furthermore, we identify a pervasive “consensus bias” across all agent configurations during narrative reasoning, necessitating deterministic validation. Ultimately, we isolate a Pareto-optimal configuration that achieves professional-grade analytical precision while minimizing latency and token overhead, providing a robust blueprint for autonomous intelligence in prediction markets.


[57] Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents cs.CL | cs.IR | cs.MAPDF

Haoyi Hu, Qirong Lyu, Xianghan Kong, Weiwen Liu, Jianghao Lin

TL;DR: 本文提出了ProAct,一种利用AI代理空闲计算时间进行主动预测和学习的架构,旨在解决传统反应式代理在交互间隙浪费计算资源的问题。通过分析对话历史和持久记忆,ProAct能预测用户未来需求并提前获取信息,从而加速任务完成、减少用户努力和降低幻觉率。

Details

Motivation: 当前AI代理主要基于反应式范式,仅在用户明确提示后计算响应,这浪费了交互之间的空闲时间,导致无法提前准备以满足用户未来需求。

Result: 在ProActEval基准测试(包含40个领域的200个场景)上,ProAct相比反应式基线将任务完成所需轮次减少14.8%,用户努力降低11.7%,幻觉率下降28.1%;在MemBench评估中达到SOTA的反思准确性。

Insight: 创新点在于利用空闲计算时间进行主动预测和学习,通过结合对话历史和持久记忆来迭代获取信息,这为构建更高效、低延迟的AI代理提供了新思路,可借鉴其前瞻性优化机制。

Abstract: While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time between interactions is largely wasted, leaving agents unable to prepare for future user needs. To bridge this gap, we introduce ProAct, a proactive agent architecture that leverages idle-time compute to anticipate and fulfill likely upcoming user needs. By analyzing evolving dialogue history together with persistent memory, ProAct predicts upcoming needs and iteratively acquires information, allowing the agent to resolve knowledge gaps and prepare evidence before the user initiates a query.To rigorously evaluate proactive capabilities, we also introduce ProActEval, a comprehensive benchmark comprising 200 scenarios across 40 domains, featuring predictable need chains and diverse user cognitive profiles. Empirical results demonstrate significant advantages over reactive baselines. ProAct accelerates task completion by reducing required turns by 14.8%, decreases user effort by 11.7%, and cuts hallucination rates by 28.1% on ProActEval. Furthermore, MemBench evaluations confirm that ProAct achieves state-of-the-art reflective accuracy, underscoring its sustained and robust performance.


[58] When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation cs.CLPDF

Liyun Zhang, Jiayi Guo

TL;DR: 本文通过一项包含68个实验单元的测量研究,探讨了LLM驱动的思维链和ReAct智能体对语义噪声(如同义改写)和表面噪声(如格式调整)的敏感性差异。研究发现,在GSM8K、MATH和HotpotQA数据集上,语义扰动比同等严重程度的表面扰动更易改变最终答案,且这种不一致性差距在严重性匹配后平均为+19.69个百分点。研究进一步在一个完全保留的模型上验证了主要效应,并提出了’隐性分歧’机制来解释语义扰动如何在智能体推理轨迹中传播。

Details

Motivation: 研究动机是探究LLM智能体在处理不同类型输入噪声时的行为差异,特别是比较语义噪声(改变含义)和表面噪声(仅改变呈现形式)对智能体最终决策的影响,以理解其鲁棒性和推理机制。

Result: 在GSM8K、MATH和HotpotQA三个基准测试的68个实验单元中,语义扰动导致答案改变的比例平均比表面扰动高19.69个百分点(p<0.0001),且64/68个单元呈现正差距。在完全保留的模型(qwen2.5-14B-Instruct)上验证仍显示小但显著的正效应(p=9.6×10^{-4})。

Insight: 主要创新点在于通过大规模测量研究揭示了LLM智能体对语义噪声的敏感性显著高于表面噪声,并利用保留的轨迹数据提出了’隐性分歧’机制——语义扰动常保持初始动作一致,但从后续推理步骤开始引发中间推理的分歧。研究还提供了严格的统计验证和部分轨迹级解释,为理解扰动在智能体推理中的传播提供了新视角。

Abstract: We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of comparable severity. Across 68 cells spanning GSM8K, MATH, and HotpotQA (1,530 originals and $\sim$11,150 variants), the inconsistency gap averages +19.69 pp after severity matching (paired $t=9.58$, $p<0.0001$), with 64/68 cells positive. The gap survives four severity-proxy audits and remains significant when excluding qwen models (+11.10 pp, $p<0.0001$). Several stress tests fail honestly: cluster-bootstrap significance disappears under stricter assumptions, tractability contrasts do not replicate, cross-architecture generator swaps break per-cell rankings, and a second LLM judge yields only moderate agreement ($κ=0.50$). We then validate the headline effect on a fully held-out 11th model (qwen2.5-14B-Instruct; 1,800 trajectories) and re-test a pre-registered capability$\times$tractability partition, observing a small but positive held-out effect (3/4 cells positive; pooled Welch $t=3.81$, $p=9.6\times10^{-4}$). Using held-out trajectories, we probe four trace-level mechanism signals. Two prior mechanism claims fail to replicate and are explicitly retracted. Two new probes instead support a \emph{stealth-divergence} picture: semantic perturbations often preserve the first action but induce divergence in intermediate reasoning from later steps onward, accompanied by slightly deeper trajectories. We position this as a measurement contribution with held-out replication and a partial trace-level account of how semantic perturbations propagate through agent reasoning. Code, perturbation corpus, raw trajectories, and analysis scripts are released anonymously for review.


[59] SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation cs.CL | cs.AIPDF

Michael Orme, Yanchao Yu, Zhiyuan Tan

TL;DR: 本文提出SafeCtrl-RL框架,一种基于强化学习的推理时自适应行为控制方法,用于调控大语言模型(LLM)对话的安全性。该方法将对话生成建模为序列决策过程,通过RL智能体根据上下文反馈动态选择提示词调整策略,从而在不重新训练或修改模型参数的情况下抑制不安全行为。

Details

Motivation: 解决大语言模型在现实世界部署中确保安全且上下文恰当行为的挑战,避免模型重训练或参数修改带来的高成本。

Result: 在多个LLM和不安全对话场景的评估中,SafeCtrl-RL持续提升了安全性和响应质量,优于现有的基于提示词的优化方法,并实现了良好的性能-效率权衡。

Insight: 创新点在于将推理时行为控制形式化为强化学习驱动的提示词优化问题,实现了“推理时行为反学习”,为LLM安全调控提供了一种轻量且自适应的解决方案。

Abstract: Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbf{SafeCtrl-RL}, an inference-time behavioural control framework that enables adaptive safety regulation without model retraining or parameter modification. The method formulates dialogue generation as a sequential decision process, where a reinforcement learning agent dynamically selects prompt adjustment strategies based on contextual feedback. This allows unsafe behaviours to be suppressed through iterative refinement, which we conceptualise as inference-time behavioural unlearning. Evaluated across multiple LLMs and unsafe dialogue scenarios, SafeCtrl-RL consistently improves safety and response quality, outperforms existing prompt-based optimisation methods, and achieves favourable performance–efficiency trade-offs. **Warning: This paper may contain examples of harmful language, and reader discretion is recommended.


[60] AI-Assisted Systematization for Evaluating GenAI Systems cs.CL | cs.AI | cs.CYPDF

Dhruv Agarwal, Emily Sheng, Chad Atalla, Jean Garcia-Gathright, Hussein Mozannar

TL;DR: 本文提出了一种AI辅助的系统化方法,用于评估生成式AI系统。针对评估中常遇到的模糊概念(如‘公平性’、‘创造力’),作者引入了概念规范(concept spec)和验证工作表等结构化表示,并开发了两种AI辅助系统化工具:零样本直接方法和多智能体方法。通过‘仇恨言论’和‘数字共情’两个概念进行验证,评估了生成概念规范的内容有效性和信息可恢复性。

Details

Motivation: 生成式AI系统评估面临挑战,因为许多评估目标(如‘推理’、‘公平性’)是宽泛且有争议的概念,缺乏明确定义会导致测量和结果解释困难。这反映了系统化步骤的缺失,即从宽泛背景概念转化为可测量的明确结构化描述。

Result: 研究开发了两种AI辅助系统化方法,并在‘仇恨言论’和‘数字共情’两个概念上生成概念规范,评估了其内容有效性和信息可恢复性,但摘要未具体说明定量结果或基准比较。

Insight: 创新点在于引入概念规范和验证工作表的结构化框架,将模糊概念系统化为可操作评估指标;同时探索AI辅助系统化过程,特别是多智能体方法模拟人工系统化流程,为自动化评估概念定义提供了新思路。

Abstract: Evaluating generative AI (GenAI) systems is challenging because many targets of evaluation are broad, contested concepts, such as “reasoning,” “fairness,” or “creativity.” When these concepts are left underspecified, it becomes unclear what should be measured or how evaluation results should be interpreted. This problem reflects a missing step: systematization, that is, moving from a broad background concept to an explicit, structured account of the concept in measurable terms. To help address the fact that systematization is cognitively demanding and resource-intensive, we investigate whether AI assistance can support this process. To enable AI-assisted systematization and assess its quality, we introduce a structured representation of a systematized concept, a concept spec, and a validation worksheet. We then develop two AI-assisted systematizers: a direct, zero-shot approach and a multi-agent approach that more closely mirrors manual systematization approaches from existing literature. We use these systematizers to produce concept specs for two concepts – hate-based rhetoric and digital empathy – and evaluate resulting concept specs on content validity and information recoverability.


[61] StakeBench: Evaluating Language Understanding Grounded in Market Commitment cs.CL | cs.AI | q-fin.GNPDF

Yunhua Pei, Jingyu Hu, Yiwei Shi, Hongnan Ma, Weiru Liu

TL;DR: 本文提出了StakeBench,一个基于市场承诺的语言理解评估框架,用于衡量模型对说话者在市场中实际承诺的理解,而非外部观察者的感知。该框架通过连接Polymarket和Manifold平台上已解决市场的560,876条评论与已验证的头寸、行动和市场赔率记录,构建了四个诊断任务来测试模型能力。

Details

Motivation: 现有金融NLP基准通常依赖外部观察者提供的标签,衡量的是语言如何被感知,而非说话者在市场中的实际承诺,这可能导致评估偏差。

Result: 在15个LLM和18个主题及平台设置上的实验表明,模型在检测头寸侧信号方面部分恢复(定向准确率从0.506到0.599),但在后续任务中存在结构性失败,如十个模型在未来行动预测中崩溃到一两个行动标签,且没有模型在集体赔率预测中持续优于朴素赔率方向基线。

Insight: 创新点在于引入基于可观察市场行为的监督(如头寸侧、交易行动和市场赔率轨迹)替代人工标注,并设计了三个承诺感知指标来衡量与揭示偏好的对齐;客观分析表明,该方法有助于区分可观察承诺信号与潜在信念,揭示了模型规模与性能无关、金融领域微调无益等发现。

Abstract: Existing financial NLP benchmarks often rely on labels supplied by outside observers, measuring how language is perceived rather than what speakers have committed to in the market. We introduce StakeBench, an evaluation framework for language understanding grounded in market commitment. StakeBench links 560,876 comments from 2,261 resolved markets to verified position, action, and market-odds records across Polymarket and Manifold. Supervision is derived from observable market behavior. Position sides, post-comment trading actions, and market-odds trajectories replace human annotation. Four diagnostic tasks test whether models detect market commitment, identify the revealed side, anticipate future action, and perform collective odds projection. Three commitment-aware metrics measure alignment with revealed preferences rather than perceived sentiment. Validity audits and explicit interpretation boundaries help distinguish observable commitment signals from latent belief and causal market-odds impact. Across 15 LLMs and 18 topics and platform settings, models partially recover position-side signals, with Directed Accuracy from 0.506 to 0.599, but show structural failures on later tasks. Ten of the fifteen models collapse to one or two action labels in future action anticipation, and no model consistently improves on the naive odds-direction baseline in collective odds projection. Model scale is not correlated with performance, finance-domain tuning does not improve revealed-side identification, and platform incentives strongly shape higher-order results. StakeBench is packaged with evaluation code and dataset under CC-BY 4.0.


[62] Language Models Need Sleep cs.CL | cs.AIPDF

Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti

TL;DR: 该论文提出了一种类似睡眠的巩固机制,用于解决Transformer大语言模型在处理长上下文任务时注意力机制扩展性差的问题。该方法通过周期性将近期上下文转换为持久快速权重并清空键值缓存,在睡眠阶段进行离线循环处理以更新状态空间模型(SSM)块的快速权重,从而在推理时保持低延迟。

Details

Motivation: 动机是解决基于Transformer的大语言模型在处理长序列任务时,注意力机制随上下文长度增加而计算复杂度急剧上升的问题,旨在通过引入睡眠机制来优化长期记忆管理。

Result: 在合成任务(如细胞自动机和多跳图检索)以及现实数学推理任务上测试,该方法在常规Transformer和SSM-注意力混合模型失败的情况下表现良好;增加睡眠持续时间N能提升性能,尤其在需要深度推理的示例上效果显著。

Insight: 创新点在于将生物睡眠的巩固概念引入模型训练,通过离线循环更新快速权重来压缩上下文信息,这提供了一种在保持推理效率的同时增强模型长期记忆能力的新思路。

Abstract: Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs $N$ offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration $N$ for our models improves performance, with the largest gains on examples that require deeper reasoning.


cs.CV [Back]

[63] Nano World Models: A Minimalist Implementation of Future Video Prediction cs.CV | cs.AI | cs.LGPDF

Siqiao Huang, Partha Kaushik, Michael Chen, Hengkai Pan, Omar Chehab

TL;DR: 本文提出了Nano World Models,一个基于扩散强迫的简约未来视频预测代码库,旨在为世界模型研究提供统一、可复现且易于扩展的实验平台。该框架支持对生成目标、模型规模、动作条件机制、潜在观测空间、数据集、评估协议和长时程推演过程进行模块化研究,并通过在控制环境、游戏仿真和真实机器人数据上的实验,分析了预测参数化、架构规模、动作注入、采样预算和领域复杂性对视频预测质量和自回归推演行为的影响。

Details

Motivation: 当前世界模型研究缺乏紧凑、可复现且易于扩展的实现,难以系统研究现代世界模型的设计选择,因此需要提供一个统一的实验平台来支持开放、可复现的科学研究。

Result: 通过在不同环境(如简单控制、游戏仿真、真实机器人数据)上的实验,论文分析了预测参数化、架构规模、动作注入等因素对视频预测质量的影响,但摘要未提及具体的定量结果或与SOTA的比较。

Insight: 创新点在于提出了一个模块化、统一的代码库框架,支持对世界模型各组件进行解耦研究,并通过开源代码、配置、评估脚本和预训练检查点,促进了开放、可复现的世界模型研究。

Abstract: World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet, despite rapid progress in industry-scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action-conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long-horizon rollout procedures. This design enables controlled studies of world-modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real-robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world-model research.


[64] RAW: Robust Avatar Watermarking – Benchmarking and Baseline cs.CV | cs.AIPDF

Jack Parry, Jack Saunders, Vinay Namboodiri

TL;DR: 本文提出了一个名为RAW的鲁棒数字人水印基准测试,包含50个合成人像视频和6种模拟真实工作流的攻击。研究发现现有方法在背景移除等特定攻击下性能显著下降,并提出了一种名为WALT的新方法,通过在UV纹理空间嵌入水印来提升鲁棒性。

Details

Motivation: 数字人水印面临独特挑战,因为人像在部署前常经历背景替换、重新构图和格式转换等后处理,现有方法对这些特定攻击的鲁棒性不足。

Result: 在提出的RAW基准上评估7种现有方法,发现背景移除等攻击显著降低水印恢复率;提出的WALT方法在缩放攻击上达到92.4%的鲁棒性,在背景移除攻击上保持95.6%的高性能。

Insight: 创新点在于构建了针对数字人水印的专用基准测试,并提出通过3D人脸重建在UV纹理空间嵌入水印的WALT方法,这提升了水印对几何变换和背景处理的鲁棒性,为领域研究提供了新方向。

Abstract: Digital avatar watermarking presents unique challenges: avatars are routinely post-processed with background replacement, reframing, and format conversion before deployment. We introduce \textbf{RAW} (Robust Avatar Watermarking), a benchmark comprising 50 synthetic avatar videos from 5 commercial providers and 6 attacks simulating real-world avatar workflows. Evaluating 7 existing methods reveals that avatar-specific attacks such as background removal significantly degrade watermark recovery. We propose \textbf{WALT} (Watermarking Avatars with Learned Textures), which embeds watermarks in UV texture space via 3D face reconstruction. WALT achieves the highest robustness to zoom attacks (92.4%) while maintaining strong performance on background removal (95.6%). We release our benchmark to facilitate research into avatar-specific watermarking.


[65] IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning cs.CV | cs.AI | cs.LGPDF

Chenghao Li, Fusheng Hao, Xikai Zhang, Likang Xiao, Yanwei Ren

TL;DR: 本文提出了一种名为IVR-R1(迭代视觉基础推理)的新型强化学习训练框架,旨在解决多模态大语言模型在长视野多模态场景中存在的视觉幻觉和逻辑错误问题。该框架通过动态视觉重对齐机制,主动修正推理轨迹以指导策略优化。

Details

Motivation: 当前方法通常将高维视觉场景预编码为离散文本代理以辅助下游推理,但随着推理链展开,文本与视觉场景之间的信息不对称会削弱视觉基础,导致推理错误。本文旨在解决这一视觉基础侵蚀问题,提升复杂多模态推理中的逻辑与视觉一致性。

Result: 在多个多模态基准测试上的实验表明,IVR-R1持续优于现有的强化学习方法,为复杂多模态推理中保持逻辑和视觉一致性建立了优越的范式。

Insight: 核心创新在于引入了奖励驱动的筛选机制来识别有缺陷的推演,并在多模态上下文中进行细粒度的步骤级错误归因,以及通过迭代交叉引用中间推理状态与原始视觉先验的“再推理循环”,自动合成专家级演示作为高保真推理模板。这为强化学习中的视觉基础推理提供了动态修正轨迹的新思路。

Abstract: Multimodal large language models via reinforcement learning (RL) have demonstrated remarkable capabilities in complex visual reasoning tasks, yet they remain limited in long-horizon multimodal scenarios, often suffering from visual hallucination and logical error. Current methods typically pre-encode high-dimensional visual scenes into discrete textual proxies to facilitate downstream reasoning. As the reasoning chain unfolds, however, the inherent information asymmetry between text and visual scenes tends to erode visual grounding, resulting in misguided reasoning and erroneous outputs. To address this issue, we introduce IVR-R1 (Iterative Visual-grounded Reasoning), a novel RL training framework that facilitates dynamic visual re-alignment that actively rectifies reasoning trajectories to guide policy optimization. Specifically, by leveraging a reward-driven screening mechanism to identify flawed rollouts, IVR-R1 executes a fine-grained, step-level error attribution within the multimodal context. By iteratively cross-referencing intermediate reasoning states against pristine visual priors, a Re-Reasoning Loop enables automated trajectory rectification, effectively synthesizing expert-level demonstrations that serve as high-fidelity reasoning templates for the policy model. Our experiments across diverse multimodal benchmarks demonstrate that IVR-R1 consistently outperforms existing reinforcement learning methods, establishing a superior paradigm for maintaining logical and visual consistency in complex multimodal reasoning.


[66] Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL cs.CV | cs.AI | cs.LGPDF

Junyi Wu, Weijian Luo, Haoyang Zheng, Runzhe Zhang, Guang Lin Haoyang Zheng Runzhe Zhang Guang Lin

TL;DR: 本文提出Diff-Instruct with Diffused Reward (DIDR),一种用于一步式文本到图像生成的无数据轨迹级对齐框架。该方法通过扩散奖励分数(DRS)和扩散奖励代理(DRP),在扩散轨迹的所有噪声级别上传播RLHF最优的奖励倾斜干净图像分布,以解决现有强化学习方法中终端奖励优化与生成动态不匹配的问题。

Details

Motivation: 现有的一步式生成器强化学习方法结合图像空间奖励优化和扩散噪声空间分布匹配,导致终端奖励优化与底层生成动态不匹配,优化过程倾向于利用随机自由度,常以牺牲图像保真度为代价来提升奖励。

Result: 大量实验表明,DIDR在SDXL基准上始终帕累托占优于现有的一步式基线。当迁移到6B参数的DiT骨干网络(Z-Image)时,DIDR在偏好对齐方面超越了其50步的教师模型,同时仅需单步生成。

Insight: 核心创新在于从积分KL最小化推导出的轨迹级对齐框架DIDR,以及其诱导的扩散奖励分数(DRS)作为对参考分数函数的奖励驱动校正。实用的创新点是引入了基于可微分短步去噪的高效DRS估计器——扩散奖励代理(DRP),实现了原理性的一步生成器强化学习。

Abstract: Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.


[67] Brain-to-Image Retrieval and Reconstruction via Multimodal EEG Alignment cs.CV | eess.IVPDF

Chi Kit Wong, Yan Liu, Haowen Yan

TL;DR: 本文提出了一种脑电信号到图像的检索与重建系统,能够从观看自然图像时记录的脑电信号中解码视觉刺激。该系统包含两个任务:一是基于脑电信号从200个候选图像中检索出正确刺激图像的检索任务,二是根据脑电信号重建与感知刺激一致的图像的重建任务。

Details

Motivation: 解决如何从脑电信号中解码出丰富的视觉表征,实现脑电信号到视觉内容的直接映射,探索脑机接口在视觉信息解码方面的潜力。

Result: 在检索任务中,模型在单被试上经过10次随机种子实验,平均最终epoch的Top-1准确率达到86.30%,Top-5准确率达到98.55%。在重建任务中,模型平均CLIP分数(使用ViT-H-14)为0.903,CLIP分数(使用ViT-L/14)为0.870,SSIM为0.409。

Insight: 创新点在于采用了多模态对齐方法,将脑电表征与CLIP的多模态嵌入(图像、文本、深度、边缘)对齐,并利用改进的生物启发EVNet特征和多级模糊化方法进行检索,结合IP-Adapter条件控制的SDXL-Turbo进行图像生成,展示了利用现代多模态对齐和生成建模技术从脑电信号解码视觉表征的可行性。

Abstract: We present a brain-to-image system that decodes visual stimuli from EEG signals recorded during natural image viewing. Our system addresses two tasks: (1) EEG-to-image retrieval, which ranks the correct stimulus image among 200 candidates given an EEG segment, and (2) EEG-to-image reconstruction, which generates an image consistent with the perceived stimulus. For retrieval, we implement a multi-level blurring approach improved with biologically inspired EVNet features and trained with the InfoNCE loss. Evaluated over 10 random seeds for a single subject, the retrieval model achieves a mean final-epoch Top-1 accuracy of 86.30% and Top-5 accuracy of 98.55%. For reconstruction, we implement CognitionCapturerPro, which aligns EEG representations to multi-modal CLIP embeddings, including image, text, depth, and edge embeddings, and synthesizes images with SDXL-Turbo conditioned via IP-Adapter. Averaged over 10 seeds, the reconstruction model achieves a CLIP score of 0.903 using ViT-H-14, a CLIP score of 0.870 using ViT-L/14, and an SSIM of 0.409. These results demonstrate the feasibility of decoding rich visual representations from EEG signals using modern multi-modal alignment and generative modeling techniques.


[68] SkySeg: Collaborative Onboard Semantic Segmentation with Heterogeneous UAVs in the Wild cs.CVPDF

Anqi Lu, Yun Cheng, Youbing Hu, Zhiqiang Cao, Jie Liu

TL;DR: 本文提出了SkySeg,一个异构多无人机空中协作框架,用于解决资源受限无人机平台上的实时语义分割难题。该框架通过融合低分辨率广角图像与高分辨率聚焦图像进行高效推理,并采用跨设备测试时自适应策略来应对飞行中的环境变化。

Details

Motivation: 解决无人机在资源受限平台上部署实时语义分割时面临的两大挑战:硬件限制导致实时处理困难,以及飞行中环境变化引起的数据分布偏移。

Result: 实验表明,SkySeg框架将推理延迟加速约3.6倍,机载分割精度提升5.91%,并在野外环境中平均精度增益达到10.91%。

Insight: 创新点在于将异构无人机协作与计算机视觉、飞行模式相结合,通过信息融合推理和跨设备测试时自适应策略,实现了在动态环境下的高效、准确机载语义分割。

Abstract: The demand for unmanned aerial vehicle (UAV)-based image acquisition and analysis has surged, with UAVs increasingly utilized for semantic segmentation tasks. To meet the real-time analysis requirements of UAV remote sensing missions, performing onboard computation and making decisions based on the results is a natural approach. However, deploying semantic segmentation on resource-constrained UAV platforms presents two significant challenges: 1) hardware constraints limit the ability of UAVs to perform real-time semantic segmentation, and 2) environmental variations during flight cause data distribution shifts, deviating from the original training data. To address these issues, this paper introduces SkySeg, a heterogeneous multi-UAV air-air cooperation framework that integrates computer vision and flight pattern to enable onboard semantic segmentation using low-cost sensors. SkySeg employs an efficient information fusion inference method, combining low-definition, wide-area images with high-definition, focused-area images. Additionally, it incorporates a cross-device test-time adaptation (TTA) strategy to enhance segmentation performance in dynamic environments by collaboratively addressing distribution shifts of test data streams across UAVs. Experimental results demonstrate that our SkySeg framework accelerates inference latency by approximately 3.6x, improves onboard segmentation accuracy by 5.91%, and achieves a 10.91% average accuracy gain in the wild.


[69] ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models cs.CV | cs.AIPDF

Arash Akbari, Arman Akbari, Masih Eskandar, Qitao Tan, Yixiao Chen

TL;DR: 本文提出了ActQuant,一种面向视觉-语言-动作(VLA)模型的亚4比特动作引导混合精度后训练量化框架。该方法通过两阶段量化(层间比特分配与层内尺度优化)来最小化量化对动作预测性能的影响,并配套开发了OmniModel.cpp部署流水线。实验表明,该方法在LIBERO基准测试和真实UR3机械臂上,能在极低比特(如2.5-3比特/权重)下保持接近原始模型的性能,实现显著的内存压缩。

Details

Motivation: VLA模型在具身智能任务中表现出色,但其巨大的计算开销阻碍了在边缘设备上的部署。现有的后训练量化方法在极低比特(亚4比特)下会导致严重的性能下降,因此需要一种能保持动作生成能力的有效量化方案。

Result: 在LIBERO基准测试中,ActQuant是唯一能在3比特/权重或更低比特下运行的方法,在OpenVLA-OFT和π₀.₅模型上分别保持了95.0%和94.8%的原始性能。进一步压缩至2.5比特/权重时,OpenVLA-OFT模型仍保持90.1%的性能,主干网络压缩比达5.3倍。在真实UR3机械臂上,量化后的π₀.₅模型保持了基线成功率,同时内存占用减少了2.5倍。

Insight: 创新点在于将量化过程与模型的核心任务(动作预测)直接挂钩,通过动作贡献度指导层间比特分配,并利用动作感知曲率优化层内量化尺度,从而在极低比特下保护对控制至关重要的权重。配套的OmniModel.cpp部署流水线实现了从模型到高效C/C++运行时的完整端到端低比特部署方案。

Abstract: Vision-Language-Action (VLA) models exhibit remarkable action generation for embodied intelligence, but their heavy compute make deployment on edge platforms impractical. Aggressive, sub-4-bit weight quantization is the natural solution, yet existing post-training quantization (PTQ) methods suffer severe performance degradation in this regime. To address this, we introduce ActQuant, an action-guided mixed-precision PTQ framework that operates in two stages: (1) an inter-tensor bit allocator that assigns each weight matrix a single bit-width based on how much it contributes to predicting the agent’s actions; (2) an intra-tensor scale optimizer tunes per-block quantization scales using action-aware curvature, so that dynamic range is concentrated on the weights most influential for control. To deliver the on-device benefits of our aggressive quantization, we further introduce OmniModel.cpp, an agentic conversion pipeline that ports architectures into a native C/C++ runtime with efficient low-bit kernels. We evaluate ActQuant both in simulation and on a real-world 6-DoF UR3 arm, with all models deployed through OmniModel.cpp. On the LIBERO benchmark, ActQuant is the only method that operates at or below 3 bits-per-weight, retaining 95.0% on OpenVLA-OFT and 94.8% on $π_{0.5}$. Pushed further, ActQuant reaches 2.5 bpw at 90.1% on OpenVLA-OFT, compressing the backbone from 14.3 GB to 2.7 GB (5.3$\times$). On the physical UR3 arm, $π_{0.5}$ quantized with ActQuant retains the baseline’s success rate while reducing the memory footprint by 2.5$\times$.


[70] MGVQ: Synergizing Multi-dimensional Sensitivity-Aware and Gradient-Hessian Fusion for Vector Quantization cs.CV | cs.LGPDF

Zhong Wang, Zukang Xu, Xing Hu, Dawei Yang

TL;DR: 本文提出了一种名为MGVQ的新型向量量化框架,旨在解决视觉语言模型在边缘设备上部署时面临的模型尺寸过大问题。该框架通过结合多维敏感度感知和梯度-Hessian融合,实现了对模型权重的超低位量化,从而在减少内存消耗和传输开销的同时保持模型性能。

Details

Motivation: 现有向量量化方法应用于视觉语言模型时存在两个核心限制:一是单一统一码本难以拟合视觉与文本输入带来的跨模态权重分布差异;二是当前二阶误差补偿方法忽略了一阶梯度信息,导致权重偏离预训练最优状态并产生有偏的补偿结果。

Result: 在LLaVA-onevision、InternVL2和Qwen2-VL等主流视觉语言模型上的大量实验验证了MGVQ的有效性。在2位量化设置下,MGVQ显著超越了现有的先进后训练量化方法,例如在InternVL2-26B上实现了最高4.9个百分点的准确率提升(71.4% vs 67.0%),达到了SOTA水平。

Insight: 论文的创新点在于提出了一个集成了敏感度引导的结构化混合精度量化模块(通过全局与局部敏感度分析动态分配不同位宽)和梯度感知的二阶误差补偿模块(将一阶梯度嵌入误差校正,并采用Kronecker和Block-LDL分解以降低计算成本)。这为多模态大模型在资源受限环境中的稳定高效超低位量化部署提供了新思路。

Abstract: Vision-Language Models (VLMs) achieve outstanding performance, yet their huge model size severely hinders deployment on edge devices with limited resources. As an efficient model compression technique, vector quantization (VQ) excels in ultra-low-bit representation, which maps model weights to discrete codewords in a compact codebook to cut memory consumption and transmission overhead while preserving model capability. Direct VQ application to VLMs still has two core limitations. First, cross-modality weight distribution differences brought by visual and textual inputs cannot be well fitted by a single unified codebook. Second, current second-order error compensation ignores first-order gradient information, causing weight deviation from pre-trained optimal states, gradient drift and biased compensation results. This work proposes MGVQ, a novel vector quantization framework integrating multi-dimensional sensitivity perception and gradient-Hessian fusion. It consists of two core modules: sensitivity-guided structured mixed-precision quantization dynamically assigns different bit-widths according to channel sensitivity via combined global and local sensitivity analysis for refined resource allocation; gradient-aware second-order error compensation embeds first-order gradients into error correction, and adopts Kronecker and Block-LDL decomposition to ensure low computational cost. Extensive experiments on mainstream VLMs including LLaVA-onevision, InternVL2 and Qwen2-VL verify the effectiveness of MGVQ. In 2-bit quantization settings, MGVQ surpasses existing advanced post-training quantization methods significantly, achieving a maximum accuracy improvement of 4.9 points (71.4% vs 67.0% on InternVL2-26B). The proposed method realizes stable and efficient ultra-low-bit VLM quantization, greatly promoting the practical deployment of multimodal large models in resource-limited environments.


[71] Mitigating Hallucinations in Large Vision-Language Models via Causal Route Gating cs.CVPDF

Zhe Cheng, Wenyu Chen, Fode Zhang, Dehuan Shen

TL;DR: 该论文针对大型视觉语言模型(LVLMs)中常见的、生成流畅但与图像内容不符的幻觉问题,提出了一种无需训练、基于因果路径门控的干预方法。该方法通过将注意力头分解为视觉路径和文本路径,并估计其各自的令牌级影响,来识别并选择性抑制导致幻觉的文本主导路径,从而减少错误。

Details

Motivation: 论文的动机是解决LVLMs因文本路径主导决策而忽略视觉证据,从而产生幻觉内容的问题,这限制了模型在实际部署中的可靠性。

Result: 在涵盖判别式和生成式设置的五个基准测试中,该方法持续减少了多个模型的幻觉相关错误,且对整体多模态性能影响有限,仅带来适度的推理时间开销。

Insight: 论文的创新点在于提出了一种训练自由的因果干预机制,通过路径分解和冲突检测,实现了对特定注意力头中文本路径的精准抑制,从而在保持视觉路径完整性的前提下缓解幻觉,这是一种新颖的模型行为诊断与控制方法。

Abstract: Large vision-language models (LVLMs) often hallucinate content that is fluent yet unsupported by the image, limiting their reliability in real-world deployment. We show that a key failure mode arises from route competition: even when visual tokens receive attention, the final token decision can be dominated by the textual pathway, causing the decoder to follow linguistic priors over visual evidence. To mitigate this, we propose a training-free, decision-aligned intervention that decomposes each attention head into a visual route and a text route, and estimates their token-level effects using an efficient one-forward/one-gradient approximation. These estimates reveal route conflict within heads and identify prior-dominant ones, enabling selective suppression of only the text route while keeping the visual route intact. Across five benchmarks spanning discriminative and generative settings, our method consistently reduces hallucination-related errors across models with limited impact on overall multimodal performance, while incurring a modest inference-time overhead.


[72] Machine Intelligence that Understands Visual and Linguistic Information and Interacts with Humans and Environments cs.CV | cs.AIPDF

Van Quang Nguyen

TL;DR: 本文提出了一系列新颖的架构,旨在提升智能体在三个关键视觉-语言任务上的性能:图像描述、视觉对话和交互式指令跟随。针对图像描述,提出了GRIT模型,一个基于Transformer的端到端架构,整合了网格和区域特征以提升准确性和速度。针对视觉对话,提出了轻量化的LTMI模型,以极少的参数量实现了强大的多模态交互建模。针对具身AI的指令跟随,提出了一个两阶段指令解释框架,结合多视角和分层注意力,在ALFRED数据集上取得了最先进的未见成功率。

Details

Motivation: 计算机视觉与自然语言处理的交叉领域进展对于辅助技术、多媒体查询和机器人等应用至关重要。本文旨在解决现有方法在图像描述、视觉对话和交互式指令跟随任务中存在的局限性,如视觉表征的全局上下文缺失、多模态交互建模效率低下以及指令理解与视觉环境融合的挑战。

Result: 在图像描述任务中,GRIT模型在推理准确性和速度上均优于先前方法。在视觉对话任务中,LTMI模型在VisDial数据集上验证,其单层参数量不到标准Transformer扩展的十分之一,但表征能力相当。在交互式指令跟随任务中,所提框架在ALFRED数据集上实现了8.37%的未见成功率,达到了最先进水平。

Insight: 主要创新点包括:1) 在图像描述中,采用纯Transformer架构(GRIT)并整合网格与区域特征,实现了端到端训练与性能提升。2) 在视觉对话中,设计了专门的注意力块(LTMI),以极低的参数量高效建模多模态输入交互。3) 在具身AI中,提出了两阶段指令解释框架,先独立于视觉上下文解码语言指令生成初步动作序列,再与视觉特征融合执行,并结合多视角与分层注意力进行精确定位。

Abstract: Advancements at the intersection of computer vision and natural language processing are crucial for applications like assistive tech, multimedia querying, and robotics. This dissertation proposes novel architectures to improve intelligent agents across three key vision-language tasks: image captioning, visual dialog, and interactive instruction following. First, we address limitations in visual representation for image captioning. Traditional models rely on region-based features from CNN detectors, which lack global context and suffer from high computational overhead. We propose GRIT (Grid and Region-based Image captioning Transformer), a transformer-only architecture. By integrating grid and region features using a DETR-based detector, GRIT enables end-to-end training and out-performs prior methods in both inference accuracy and speed. Second, we tackle visual dialog, which requires multi-turn conversation about an image. The challenge lies in efficiently modeling interactions between multiple inputs (image, question, history). We introduce LTMI (Light-weight Transformer for Many Inputs). Utilizing a specialized attention block, an LTMI layer matches the representational power of a standard Transformer extension while utilizing less than one-tenth of its parameters, as validated on the VisDial dataset. Finally, we study interactive instruction-following for embodied AI using the ALFRED dataset. We propose a framework featuring a two-stage instruction interpretation: it first decodes language directives independently of visual context to predict a tentative action-object sequence, which is then fused with visual features for final execution. Using multiple egocentric views and hierarchical attention, our method accurately localizes objects and achieves a state-of-the-art unseen success rate of 8.37%.


[73] EMMA: Extracting Multiple physical parameters from Multimodal Data cs.CVPDF

Farhat Shaikh, Ayan Banerjee, Sandeep Gupta

TL;DR: EMMA是一个物理信息驱动的多模态框架,能够直接从原始视频、音频和图像时间序列观测中恢复系统的所有可识别动态参数。它通过联合推断显式参数、隐式动态分量和标定不变量,克服了以往仅基于视频的方法在处理遮挡状态、隐藏驱动输入或已知初始条件假设方面的局限性。

Details

Motivation: 解决现有单模态方法(尤其是视频方法)在系统状态被遮挡、存在隐藏驱动输入或依赖已知初始条件和坐标系假设时,难以准确恢复动态参数的问题。

Result: 在超过100个场景(包括5个标准动态基准测试、75个Delfys视频、具有隐藏输入的真实世界漫游车和四旋翼系统,以及涵盖生物和混沌系统的模拟-图表案例研究)中,EMMA实现了鲁棒的多参数恢复,并显著优于现有的单模态和方程发现基线方法。

Insight: 创新点在于提出了一个统一的多模态特征对齐管道,结合了液态时间常数网络学习异构模态的潜在动态,并通过物理约束损失确保与支配微分方程的一致性,从而无需分割掩码、可微分渲染或专用传感器即可估计参数。

Abstract: We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations. A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under forced, implicit, and multivariate dynamics without requiring segmentation masks, differentiable rendering, or specialized sensors. Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems, EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines. Our results establish EMMA as a general, scalable solution for physics-consistent model extraction from opportunistic multimodal data. Code and data are available at: https://github.com/ImpactLabASU/EMMA-CVPR2026


[74] Learning to See Like Humans: Gaze-Aligned Cycling Safety Prediction cs.CVPDF

Luís Maria Perdigão, Miguel Costa, Carlos Santiago, Manuel Marques

TL;DR: 本文提出了一种眼动追踪引导的感知骑行安全框架(EG-PCS),通过将人类注视数据整合到基于视觉Transformer的成对学习流程中,以提升模型对人类视觉注意力的对齐能力,从而改进对街道环境安全性的主观感知预测。

Details

Motivation: 解决现有基于街景图像成对比较的骑行安全感知方法未能显式建模人类视觉注意力的问题,因为视觉注意力在人类安全感知中起核心作用。

Result: 实验表明,在保持与现有最先进方法相当的排序性能的同时,所提模型生成的注意力图能更准确地反映人类的视觉注意行为。

Insight: 创新点在于将眼动追踪信号作为监督信息引入模型注意力机制,以对齐学习到的注意力与人类注视模式,这增强了基于感知的城市分析任务的预测准确性和可解释性。

Abstract: Cycling delivers significant public-health and environmental benefits, yet its uptake in cities is often limited by perceived safety. When street environments appear unsafe, individuals are less likely to cycle, making perception a key barrier to adoption. Recent work has shown that pairwise comparisons of street-view images provide a scalable way to learn subjective safety judgments. However, existing approaches do not explicitly model human visual attention, which plays a central role in how humans perceive safety. We propose an Eye-Tracking-Guided Perceived Cycling Safety framework (EG-PCS) that integrates gaze data into a pairwise learning pipeline based on vision transformers. By supervising the model’s attention mechanism with eye-tracking signals, we encourage alignment between learned attention maps and human fixation patterns. Experiments show that gaze-guided models achieve similar ranking performance compared to state-of-the-art approaches while producing attention maps that more accurately reflect human visual attention behavior. Our results demonstrate that incorporating eye-tracking information enhances both predictive accuracy and interpretability in perception-based urban analytics.


[75] Mode-as-Sequence: Translating Multimodal Motion Prediction into Unified Sequential Mode Modeling cs.CV | cs.AIPDF

Zikang Zhou, Haibo Hu, Xinhong Chen, Yifan Zhang, Nan Guan

TL;DR: 本文提出了一种名为Mode-as-Sequence的统一解码框架,用于解决多模态运动预测中的模式坍塌和置信度排序不可靠问题。该框架将无序的模式集合转化为有序的模式序列,并显式建模模式间依赖关系,其具体实现包括自回归的ModeSeq和并行的Parallel ModeSeq。通过引入Early-Match-Take-All(EMTA)训练策略和轻量级排序正则化器,该方法在多个大规模基准测试中显著提升了预测的多样性和置信度校准性能。

Details

Motivation: 多模态运动预测本质上存在监督稀疏性问题:每个训练场景仅提供一个已实现的未来轨迹,但实际存在多个合理的未来可能性。这导致现有方法容易出现模式坍塌(假设冗余且模式覆盖不足)以及在小规模轨迹集预测时置信度排序不可靠。

Result: 在Waymo Open Dataset等大规模基准测试中,该方法在面向排序的指标和最佳K精度(best-of-K accuracy)上均取得了一致性提升。具体地,ModeSeq在2024年LiDAR-free运动预测赛道获得第一名,Parallel ModeSeq在2025年交互预测挑战赛中获得第一名,验证了其在精度和效率上的有效性。

Insight: 核心创新在于将多模态预测问题重构为统一的序列建模任务,通过显式建模模式间因果依赖来促进多样性和校准置信度。Parallel ModeSeq利用掩码自注意力在单次前向传播中解码所有模式,突破了逐模式自回归的瓶颈,实现了高效的大规模推理和可扩展的联合场景预测。训练策略EMTA及其扩展MA-EMTA,结合轻量级排序正则化,有效缓解了稀疏标签下的模式代表性和置信度校准问题。

Abstract: Multimodal motion forecasting is inherently under-supervised: each training scene provides only one realized future, yet multiple plausible futures exist. This sparse supervision often leads to mode collapse (redundant hypotheses and insufficient mode coverage) and unreliable confidence ranking when predicting a small set of trajectories. We propose Mode-as-Sequence, a unified decoding framework that translates an unordered mode set into an ordered mode sequence and explicitly models mode-to-mode dependency. Under this framework, we develop two complementary instantiations. ModeSeq performs recurrent mode decoding, where each mode is generated conditioned on the previously generated modes, encouraging diverse, non-redundant hypotheses with calibrated confidence ordering. To remove the mode-by-mode autoregressive bottleneck, we further propose Parallel ModeSeq, which preserves the same causal dependency using masked mode-to-mode self-attention while decoding all modes in a single forward pass, enabling efficient large-$K$ inference and scalable joint-scene prediction. To learn representative modes and calibrated confidence under sparse labels, we introduce Early-Match-Take-All (EMTA) and its joint-scene extension MA-EMTA, together with a lightweight ranking regularizer that reduces confidence inversions. Extensive experiments on large-scale benchmarks demonstrate consistent improvements in both ranking-oriented metrics and best-of-K accuracy across datasets, horizons, and object types. In the Waymo Open Dataset challenges, ModeSeq achieves 1st place in the 2024 LiDAR-free motion prediction track, and Parallel ModeSeq achieves 1st place in the 2025 Interaction Prediction Challenge, validating the effectiveness of Mode-as-Sequence for both accuracy and efficiency.


[76] Towards Large Model Feature Coding cs.CV | cs.LGPDF

Youwei Pang, Changsheng Gao, Dong Liu, Huchuan Lu, Weisi Lin

TL;DR: 本文提出了一个针对大模型特征编码(LaMoFC)的全面基准测试和评估框架。作者构建了名为LaMoFCBench的特征数据集,覆盖了4个类别和16个场景的多样化任务需求,并整合了广泛采用的架构和多种拆分计算设置。通过指定代表性的拆分点来提取中间特征,建立了一个统一的、公平且可复现的比较流程,并对主流通用特征编解码器进行了基准测试。

Details

Motivation: 大模型虽然在感知和生成任务上表现出色,但其实际部署受到计算、内存预算和隐私要求的限制。拆分计算可以缓解这些限制,但会引入中间特征的密集传输和存储问题。现有CNN的特征编码方法主要针对同质的空间激活图,而现代大模型生成的特征具有异构性(如多级/多模态表示和自回归上下文缓存),统计分布和压缩容忍度各不相同,因此需要将LaMoFC视为一个基本的系统组件,并建立一个系统的评估框架。

Result: 论文对主流通用特征编解码器进行了基准测试,结果表明现有编码范式与大模型特征的异构性之间存在深刻的不匹配。LaMoFCBench为这一领域的进一步研究提供了共享的实证基础。

Insight: 本文的核心创新在于首次将大模型特征编码(LaMoFC)作为一个独立的、关键的系统问题提出,并为此建立了首个全面的基准测试框架LaMoFCBench。其客观分析价值在于揭示了现有通用编码技术在处理大模型异构特征时的根本性不足,为未来开发专门针对大模型特征特性的新型编码范式指明了方向并提供了评估标准。

Abstract: Large models have delivered remarkable performance across a wide range of perception and generation tasks, yet practical deployment is increasingly constrained by computational and memory budgets, as well as privacy requirements. Split execution alleviates these constraints by partitioning computation across devices, but it inevitably introduces intensive transmission and storage of intermediate features. Unlike conventional feature coding for CNNs that typically targets homogeneous spatial activation maps, modern large models generate heterogeneous features with varying statistical distributions and compression tolerances, e.g., multi-level/multi-modal representations and autoregressive context caches. These characteristics necessitate treating large model feature coding (LaMoFC) as a fundamental system component and call for a systematic evaluation framework. In this paper, we present a comprehensive benchmark and evaluation framework for LaMoFC. We first build the feature dataset LaMoFCBench, covering diverse task requirements across 4 categories and 16 scenarios while integrating widelyadopted architectures and various split-computing settings. We then specify representative split points according to practical application scenarios to extract intermediate features, establishing a unified pipeline for fair and reproducible comparisons. Finally, we benchmark mainstream universal feature codecs, exposing the profound misalignment between existing coding paradigms and the heterogeneous nature of large model features. These findings reveal that LaMoFC demands a fundamental departure from existing paradigms, and LaMoFCBench provides the shared empirical foundation to drive this transition. The data and code will be available at https://github.com/lartpang/LaMoFCBench.


[77] D2-V2X: Depth-Driven Cooperative V2X Reasoning for Autonomous Driving cs.CVPDF

Kevin Richard, Alphin Varghese, Colin Pham, David Oh, Srijan Das

TL;DR: 本文提出了D2-V2X,一个用于自动驾驶的深度驱动协同V2X推理基准和模型。该工作创建了一个包含8500个三元组(问题-推理-答案)的空间感知基准,并建立了一个将3D LiDAR特征与视觉语言模型(VLM)潜在空间对齐的基线模型。通过强制模型在输出结构化JSON前生成自然语言思维链推理,模型被引导显式表达空间关系。实验表明,该方法在识别被遮挡危险物方面显著优于零样本模型,并大幅降低了空间估计误差。

Details

Motivation: 单车的视觉语言模型(VLM)受限于传感器遮挡问题。现有的V2X系统虽然能缓解此问题,但当前基准缺乏解决复杂环境中歧义所需的协同推理能力。

Result: 在提出的D2-V2X基准上,所提方法在识别被遮挡危险物方面实现了24.4%的召回率(零样本模型接近0%),并将可见物体的空间估计误差降低了77%(相较于零样本基线)。模型的功能性决策F1分数达到53.5。

Insight: 主要创新点在于:1)提出了一个强调空间感知和协同推理的V2X问答基准(QRA);2)建立了一个将3D LiDAR特征与VLM对齐的基线模型;3)通过思维链推理强制模型进行显式空间关系建模。研究同时指出,当前VLM架构中3D到2D的投影是一个根本性瓶颈,为未来创新设立了新基线。

Abstract: Single-vehicle Vision-Language Models (VLMs) are fundamentally constrained by sensor occlusions. While Vehicle-to-Everything (V2X) systems mitigate this, current benchmarks lack the cooperative reasoning required for resolving ambiguities in complex environments. We introduce D2-V2X, a spatially-aware Question-Rationale-Answer (QRA) benchmark featuring 8,500 triplets derived from multimodal vehicle and infrastructure sensors. We additionally establish a baseline that aligns 3D LiDAR features with the VLM’s latent space. By enforcing natural language Chain-of-Thought rationales prior to structured JSON outputs, our model is forced to explicitly articulate spatial relations. Our experiments demonstrate that grounding VLMs in cooperative LiDAR achieves 24.4% recall in identifying occluded hazards compared to near-zero in zero-shot models and reduces spatial estimation error for visible objects by 77% compared to the zero-shot baseline. While the model achieves a functional decision-making F1-score of 53.5, we identify 3D-to-2D projection as a fundamental bottleneck in current VLM architectures, establishing a new baseline for future innovation. Data, code, and trained models available at https://github.com/KevinRichard1/D2-V2X


[78] ImPartial: Multi-channel Whole-Cell Segmentation using Partial Annotations cs.CVPDF

Gunjan Shrivastava, Saad Nadeem

TL;DR: ImPartial是一种深度学习框架,旨在通过稀疏涂鸦和有限监督实现低标注条件下的细胞分割。该方法通过自监督多通道量化插值增强分割目标,在多种成像数据集上达到与全监督模型相当的性能。

Details

Motivation: 解决病理图像中细胞分割需要密集像素级标注的高成本问题,尤其针对新兴生物成像模态和通道配置多变的多重数据集,其中专家标注数据稀缺。

Result: 在多重细胞成像和单重临床亮场免疫组化基准数据集上,仅使用部分标注即超越强基线,性能与全监督模型相当,达到SOTA水平。

Insight: 创新点在于引入自监督多通道量化插值,将分割目标与自监督分类目标对齐,避免对图像完美重建的需求,从而在低标注条件下实现高效分割。

Abstract: Accurate cell segmentation in pathology images typically requires dense pixel-wise annotations, which are costly and time-consuming to obtain. This challenge is especially important for emerging biological imaging modalities and multiplexed datasets with variable channel configurations, where expert-labeled data are scarce. In this work, we introduce ImPartial, a deep learning framework designed to achieve state-of-the-art segmentation performance in low-annotation regimes using sparse scribbles and limited supervision. ImPartial augments the segmentation objective via self-supervised multi-channel quantized imputation. This approach leverages the observation that perfect pixel-wise reconstruction or denoising of the image is not needed for accurate segmentation, and thus, introduces a self-supervised classification objective that better aligns with the overall segmentation goal. We demonstrate that ImPartial achieves performance at par with fully supervised models while requiring substantially fewer annotations. Extensive experiments on benchmark multiplexed cellular imaging and single-plex clinical brightfield immunohistochemistry datasets show consistent improvements over strong baselines with only partial annotations. All benchmark datasets and code are available via our Github: https://github.com/nadeemlab/ImPartial.


[79] EchoVQA: Enabling Conversational Assistance for Point-of-Care Cardiac Ultrasound cs.CVPDF

Filippos Bellos, Yutong Li, Jessie N Dong, Zaiyang Guo, Emily Mackay

TL;DR: 本文介绍了EchoVQA,这是首个用于心脏超声(超声心动图)的大规模视觉问答数据集,包含14,299张图像和74,819个问答对,整合了公共数据和手持探头采集的点式护理图像。论文还提出了一种基于多模态可学习提示的参数高效方法,在多个基准测试(包括EchoVQA)上达到了最先进的性能,且所需训练参数显著少于现有方法。

Details

Motivation: 点式护理经胸超声心动图(TTE)的临床应用受限于图像采集和解读所需的专业知识,现有超声心动图VQA数据集规模小、仅限于高质量图像且覆盖视图少,无法有效弥合这一专业鸿沟。

Result: 提出的参数高效方法在包括EchoVQA在内的大多数基准测试上达到了最先进的性能,同时所需的可训练参数显著少于现有的SOTA方法。

Insight: 主要创新点在于构建了首个大规模、包含多样视图(包括次优图像)和点式护理采集的超声心动图VQA数据集,并专门设计了采集指导问题以帮助新手操作者优化探头定位;方法上,采用多模态可学习提示实现了参数高效的SOTA性能。

Abstract: Point-of-care transthoracic echocardiography (TTE) enables cardiac assessment in virtually any clinical setting, yet its diagnostic utility remains constrained by the expertise required for image acquisition and interpretation. Visual question answering (VQA) offers a promising paradigm for bridging this expertise gap through interactive clinical assistance, but existing echocardiography VQA datasets are limited in scale, restricted to high-quality images, and only cover a few views. We introduce EchoVQA, the first large-scale VQA dataset for echocardiography, comprising 14,299 images and 74,819 question-answer pairs. The dataset integrates public sources (EchoNet-Dynamic, CAMUS) with our own point-of-care acquisitions from two handheld probes (Lumify, Clarius), spanning diverse views and including both high-quality and suboptimal images. Uniquely, EchoVQA includes acquisition guidance questions to help users optimize transducer positioning toward a diagnostic apical 4-chamber view for left ventricular ejection fraction estimation – a challenging task for novice operators in point-of-care settings. We further develop a parameter-efficient method based on multimodal learnable prompts achieving state-of-the-art performance on most benchmarks, including EchoVQA, with significantly less trainable parameters than existing state-of-the-art approaches.


[80] Loki: Representation over Architecture for Diffusion-Based Portrait Animation cs.CVPDF

Pouyan Navard, Sernam Lim

TL;DR: Loki 提出了一种基于扩散模型的肖像动画新方法,通过使用参数化人脸模型(如3DMM)来编码驱动视频的面部表情和头部姿态,从而在条件路径上摆脱了对原始RGB表示的依赖。该方法将身份信息通过轻量级的键值注入方式与扩散主干分离,实现了身份、表情和姿态的解耦。与现有SOTA扩散方法相比,Loki 显著减少了可训练参数(约43%)和训练数据需求(1496倍),并在头部姿态轨迹和面部表情跟随的评测中取得领先或并列领先的结果。

Details

Motivation: 现有最先进的扩散模型肖像动画方法通常堆叠多个分别训练的表情、姿态和身份模块,导致参数量大、依赖专有数据集,且本应独立控制的维度之间存在残留纠缠。其根本原因在于从RGB表示中学习表情和姿态,而RGB表示中身份、姿态和表情是天然耦合的。Loki旨在通过改变条件表示来解决这个问题。

Result: Loki在头部姿态轨迹和面部表情跟随这两个直接衡量肖像动画核心任务的新定义指标上,取得领先或并列领先的成绩。与领先的扩散基线模型相比,Loki的推理参数量减少了约43%,并且训练所需的视频样本数量减少了1496倍。

Insight: 核心创新在于将条件路径从RGB表示转向由人脸模型(如3DMM)提供的、在构造上就与身份正交的参数化表示(表情和姿态系数),并通过光栅化为空间图供扩散模型使用。身份信息则通过轻量级键值注入利用扩散主干自身的预训练特征单独处理。这种表示层面的解耦使得跨身份重演在推理时仅需系数替换,无需跨身份的训练数据,实现了高效和模块化的设计。

Abstract: Portrait animation transfers a driver clip’s facial expression and head pose onto a single reference image while preserving the reference’s identity. State-of-the-art diffusion systems address this by stacking trained modules for expression, pose, and identity in turn, paying for it in trainable parameters, proprietary corpora, and residual entanglement between the very axes the system is meant to control independently. This complexity compensates for an upstream choice – learning facial expression and head pose from RGB, a representation in which identity, pose, and expression are inseparable without being learned apart. Loki steps out of RGB on the conditioning path. Driver expression and head pose are encoded by a face model whose parameter axes are identity-orthogonal by construction, then rasterised into a spatial map that the diffusion backbone consumes natively. Identity is routed separately through the diffusion backbone’s own pretrained features via lightweight key-value injection. Because the parametric representation factorises identity from expression and pose, cross ID reenactment reduces to a coefficient substitution at inference, requiring no cross ID training data. Loki requires ~43% fewer inference parameters than leading diffusion baselines and trained on 1496x less video samples. We define two metrics that directly measure whether the generated head pose trajectory and facial expression followed the driver’s – the questions portrait animation actually asks; Loki leads or co-leads on both.


[81] Cross-Modal Action Recognition in Egocentric Video Using Mamba: Integrating RGB and Hand Skeleton Streams via CLS Token Fusion Strategies cs.CVPDF

Juan Ignacio Bustos Gorostegui, Maria Elena Buemi

TL;DR: 本文提出了一种基于Mamba的跨模态架构,用于第一人称视角(egocentric)视频中的动作识别,通过整合RGB视频流和手部骨骼时序数据,并设计了四种不同的CLS token融合策略来融合两种模态的信息。

Details

Motivation: 解决第一人称视角动作识别中因相机运动剧烈、手部频繁遮挡以及视觉表征随时间难以保持一致性所带来的挑战。

Result: 在H2O数据集上的实验表明,Average融合策略性能最佳,在Tiny配置下比VideoMamba基线提升了超过10%的Top-1准确率,在Small配置下提升了2%。

Insight: 创新点在于将Mamba(状态空间模型)的线性时间复杂度优势应用于跨模态动作识别,并系统性地设计并评估了四种基于CLS token的融合策略,其中Average策略被证明最为有效。

Abstract: Egocentric action recognition is a challenging task due to erratic camera motion, frequent hand occlusion, and the difficulty of maintaining consistent visual representations over time. In this work, we propose a cross-modal architecture that combines RGB video and temporal hand skeleton data within a unified Mamba-based framework, exploiting the linear time complexity of State Space Models (SSMs). Our architecture consists of three components: a VideoMamba module for visual feature extraction, a skeleton encoder built on a stack of Mamba blocks, and a fusion module that integrates both modalities into a single representation. A central contribution of this work is the design and evaluation of four Class (CLS) token mixing strategies for multimodal fusion: Naive, Average, Weighted and Context-based. These strategies differ in how the pretrained unimodal CLS tokens, which role is to act as information sinks concentrating learned representations, are leveraged to initialize the mixed CLS token used for final classification. We evaluate all strategies on the H2O dataset. Experimental results show that the Average strategy achieves the best performance, yielding gains of over 10% Top-1 accuracy in the Tiny configuration and 2% in the Small configuration over the VideoMamba baseline.


[82] Causal Physics Steering in Video World Models via Concept Activation Vectors cs.CVPDF

Nahid Alam

TL;DR: 本文提出了一种无需训练的物理引导方法,通过使用线性探针权重向量作为概念激活向量,在推理时注入到视频世界模型的隐藏状态中,从而在不改变模型权重的情况下控制模型的物理推理预期。该方法在IntPhys基准测试中验证了其有效性,表明物理表示在模型的特定层中局部化且可分离。

Details

Motivation: 视频世界模型学习物理动态的表示,但在推理时控制其物理预期仍是一个开放问题。本文旨在探索如何利用模型内部已识别的物理涌现区域直接引导模型的物理推理,而不需要重新训练模型。

Result: 在IntPhys基准测试上,该方法能够可靠地根据引导符号的方向性改变模型的合理性判断,且仅在物理涌现区域内应用干预时才有效,表明物理表示在该区域局部化。

Insight: 创新点在于利用概念激活向量实现无需训练的物理引导,揭示了视频模型中物理推理的可读性和可引导性,并发现不同物理原理在表示空间中占据不同方向,为模型的可解释性和控制提供了新途径。

Abstract: Video world models learn representations of physical dynamics, but controlling their physical expectations at inference time remains an open problem. Recent interpretability work identified a Physics Emergence Zone (PEZ), a group of middle transformer layers in VideoMAE where physical plausibility is represented separately from other visual features. However, it remained unclear whether this structure could be used to directly control the model’s physics reasoning. We present physics steering, a training-free method that uses the weight vector of a linear probe at a PEZ layer as a Concept Activation Vector (CAV) and injects it into hidden states during inference. This shifts the model’s physical expectations without changing any model weights. On the IntPhys benchmark, this intervention reliably shifts the model’s plausibility judgment in either direction, depending on the steering sign. The effect appears only when the intervention is applied within the Physics Emergence Zone, suggesting that the relevant physics representation is localized there. We further find that physics is encoded separately from motion direction, and that different intuitive physics principles occupy distinct directions within this representation space. Together, these results show that physical reasoning in VideoMAE is not only readable, but also directly steerable.


[83] Unified 3D Scene Understanding Through Physical World Modeling cs.CVPDF

Wanhee Lee, Klemen Kotar, Rahul Mysore Venkatesh, Jared Watrous, Honglin Chen

TL;DR: 本文提出了一种用于统一3D场景理解和交互的物理世界模型(3WM),该模型将深度估计、新视角合成和物体操纵等多种任务整合到一个单一的概率图模型中。通过不同的推理路径(如RGB、光流和相机位姿提示)实现零样本任务执行,无需针对特定任务进行训练,从而实现了跨任务的联合训练和知识共享。

Details

Motivation: 现有方法通常孤立地处理3D场景理解任务(如深度估计、新视角合成和物体操纵),缺乏共享表示和跨任务知识迁移,导致系统碎片化。本文旨在通过统一模型简化这些任务,将它们从独立的训练目标转化为仅通过不同提示即可执行,以实现更灵活的场景理解和交互。

Result: 3WM在无需微调的情况下超越了专用基线方法,在新视角合成(NVS)和3D物体操纵任务上达到了最先进(SOTA)性能,并展现出精确可控性、强几何一致性和现实场景中的鲁棒性。

Insight: 创新点在于将多模态场景元素(如RGB、光流、相机位姿)建模为概率图模型的节点,通过可组合的推理路径实现零样本多任务统一;这为构建通用视觉世界模型提供了实用替代方案,支持复杂几何推理(如移动物体同时导航3D环境)。

Abstract: Understanding 3D scenes requires flexible combinations of visual reasoning tasks, including depth estimation, novel view synthesis, and object manipulation, all of which are essential for perception and interaction. Existing approaches have typically addressed these tasks in isolation, preventing them from sharing a common representation or transferring knowledge across tasks. A conceptually simpler but practically non-trivial alternative is to unify these diverse tasks into a single model, reducing different tasks from separate training objectives to merely different prompts and allowing for joint training across all datasets. In this work, we present a physical world model for unified 3D understanding and interaction (3WM), formulated as a probabilistic graphical model in which nodes represent multimodal scene elements such as RGB, optical flow, and camera pose. Diverse tasks emerge from different inference pathways through the graph: novel view synthesis from RGB and dense flow prompts, object manipulation from RGB and sparse flow prompts, and depth estimation from RGB and camera conditioning, all zero-shot without task-specific training. 3WM outperforms specialized baselines without the need for finetuning by offering precise controllability, strong geometric consistency, and robustness in real-world scenarios, achieving state-of-the-art performance on NVS and 3D object manipulation. Beyond predefined tasks, the model supports composable inference pathways, such as moving objects aside while navigating a 3D environment, enabling complex geometric reasoning. This demonstrates that a unified model can serve as a practical alternative to fragmented task-specific systems, taking a step towards a general-purpose visual world model.


[84] ViViD-5K: Vineyard vision dataset for field-based berry detection and segmentation and grape cluster closure estimation cs.CV | q-bio.OTPDF

Xiangzhi Tong, Chengrui Zhang, Mac Flaherty, Andre Matteo Garcia, Dominic Gorman

TL;DR: 本文提出了ViViD-5K,一个包含5000张图像、超过64.8万个浆果中心点和果串分割掩码的大规模葡萄园视觉数据集,涵盖了13个葡萄品种。基于此数据集,作者开发了GrapeSAM,一个结合了点基浆果定位与基于Segment Anything的提示分割,再通过基于Transformer的果串分割的两阶段视觉流程,用于自动化、低监督地估计果串紧实度。

Details

Motivation: 果串紧实度是葡萄园管理中的关键性状,影响病害风险,但传统的人工视觉评分方法劳动密集、主观且缺乏时间分辨率。现有数据集很少支持细粒度的浆果级分析,限制了鲁棒深度学习模型的发展。

Result: 定量结果表明,该流程在不同条件下具有强大的分割和计数准确性,可视化结果也证实了其在域内和域外样本上的鲁棒性。

Insight: 主要创新点在于构建了首个支持浆果级分析的大规模、密集标注的葡萄园视觉数据集,并设计了一个结合点定位、提示分割和Transformer的两阶段流程,为果串紧实度评估提供了可扩展、客观的自动化解决方案,支持具有增强空间细节的高通量表型分析。

Abstract: Cluster closure, defined as the progressive filling of gaps between the berries in a grape bunch, is a key trait in vineyard management, impacting disease risk. However, traditional visual scoring methods are labor-intensive, subjective, and lack temporal resolution. Existing datasets rarely support fine-grained berry-level analysis, limiting the development of robust deep learning models. In this work, we present ViViD-5k, a large-scale in-field Vineyard Vision Dataset containing 5,000 images with dense annotations, including over 648,000 berry centroids and cluster segmentation masks spanning 13 grape varieties. Building on this dataset, we introduce GrapeSAM, a two-stage visual pipeline that combines point-based berry localization with prompt-based segmentation using Segment Anything, followed by transformer-based cluster segmentation. The pipeline enables automated, in-field estimation of cluster closure with minimal supervision. Quantitative results demonstrate strong segmentation and counting accuracy across diverse conditions, while visualizations confirm robustness on both in-domain and out-of-domain samples. This work provides a scalable and objective alternative to manual compactness scoring and supports high-throughput grape phenotyping with enhanced spatial detail.


[85] VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation cs.CV | cs.AI | cs.GRPDF

Tarun Gehlaut, Difan Liu, Charu Bansal, Krutik Malani, Souymodip Chakraborty

TL;DR: 本文提出了VectorArk,一种基于视觉语言模型(VLM)的鲁棒实用图像矢量化方法。该方法采用新颖的圆角多边形表示来简化学习过程并生成平滑的图元,同时引入退化模型以增强对多样化不完美输入的鲁棒性。

Details

Motivation: 现有基于VLM的图像矢量化方法通常在合成基准上评估,对使用未知栅格化方法或由文生图模型生成的真实世界图像泛化能力差。

Result: 实验表明,VectorArk在多个数据集上实现了更优的几何完整性和伪影抑制,超越了先前方法,并通过消融实验验证了各组件贡献。

Insight: 核心创新点是圆角多边形表示,它能自然生成平滑视觉图元并简化学习;同时提出的退化模型增强了模型对真实复杂输入的实用性,推动了矢量化技术向实际应用迈进。

Abstract: Recent vision-language model (VLM)-based approaches have achieved impressive results on image vectorization tasks. However, they are typically evaluated on synthetic benchmarks, where clean SVGs are rasterized at high resolution and then re-vectorized. As a result, these methods generalize poorly to real-world scenarios, such as images with unknown rasterization methods or those generated by text-to-image models. We introduce VectorArk, a new VLM-based model designed for robust and practical image vectorization. VectorArk employs a novel rounded polygon representation that simplifies the learning process while naturally producing smooth, visually appealing primitives. We also propose a degradation model that enhances robustness across diverse and imperfect inputs. Our experiments show that, in contrast to previous methods, VectorArk achieves superior geometric completeness and artifact suppression across multiple datasets, with comprehensive ablations validating the contribution of each component.


[86] Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects cs.CVPDF

Denys Iliash, Jiayi Liu, Egor Fokin, Qirui Wu, Ali Mahdavi-Amiri

TL;DR: Artiverse是一个高质量、多样化的物理基础铰接式3D物体数据集,包含来自多个静态3D仓库的5.4K个人工创建物体,涵盖88个类别,并标注了功能部件、内部结构、运动学关系、物理属性等。该数据集通过半自动标注流程提高了效率,并展示了在部件运动分析、铰接物体生成和物理交互等任务上的应用价值。

Details

Motivation: 旨在为铰接式物体提供高质量、物理基础的数据集,以支持真实的功能建模与仿真,解决现有数据在多样性、物理属性和功能标注方面的不足。

Result: 通过半自动标注流程将人工标注时间减少了30%以上,并在部件运动分析、铰接物体生成和物理交互等任务上验证了数据集的价值。

Insight: 创新点包括构建大规模、物理基础的铰接物体数据集,以及结合少样本分割、几何推理和多阶段人工验证的半自动标注流程,可借鉴于3D数据标注和功能理解研究。

Abstract: We present Artiverse, a diverse and physically grounded dataset of high-quality articulated 3D objects designed for realistic functional modeling and simulation. Artiverse contains 5.4K human-authored objects across a broad range of 88 categories, aggregated from multiple 3D static repositories. Objects are annotated with functional parts, interior structures, realistic kinematic relationships and articulated joints including multi-DoF joints, and physical attributes such as metric scale, material, and mass. We develop a semi-automated annotation pipeline that combines few-shot segmentation, geometric reasoning, and multi-stage human verification to achieve high-quality and efficient annotation, reducing manual annotation time by over 30%. We demonstrate the value of Artiverse on tasks of part mobility analysis, articulated object generation, and physics-based interaction. Artiverse provides a data resource to advance functional understanding for articulated objects.


[87] Benchmarking Composed Image Retrieval for Applied Earth Observation cs.CVPDF

Bill Psomas, Dionysis Christopoulos, Thanasis Petropoulos, Nikos Efthymiadis, Ioannis Kakogeorgiou

TL;DR: 本文针对遥感组合图像检索(RSCIR)领域,通过构建统一基准和应用导向研究,系统评估了现有组合检索方法在遥感影像上的可迁移性,并引入了一个面向灾害监测的变化中心数据集xView2-CIR。研究发现,免训练的组合方法在遥感检索中提供了强大且可扩展的基线,而变化中心检索则因需保持场景身份而面临与基于属性检索不同的挑战。

Details

Motivation: 解决现代组合图像检索方法在遥感影像上的可迁移性不足,及其与实际遥感工作流程的相关性未被充分探索的问题。

Result: 在PatternCom数据集上,使用六种视觉-语言骨干网络对代表性组合检索方法进行了标准化评估;在xView2-CIR数据集上展示了变化中心检索的挑战。结果表明免训练方法可作为强基线,而变化中心检索性能受场景身份保持需求影响。

Insight: 创新点包括为RSCIR建立了统一的实践基准,并引入了专注于灾害监测的变化中心数据集,揭示了变化检索与属性检索的差异;客观来看,该研究将组合检索定位为遥感图像检索、档案探索和变化分析的补充工具,具有实际应用价值。

Abstract: Remote sensing composed image retrieval (RSCIR) enables search in large satellite image archives using composed queries that combine a reference image with a textual modifier. Although RSCIR offers a flexible interface for expressing targeted retrieval intent, the transferability of modern composition methods to Earth observation (EO) imagery and their relevance to operational EO workflows remain underexplored. We address this gap through a unified benchmark and an application-oriented study. First, we systematically adapt and evaluate representative composed image retrieval methods with six vision-language backbones on PatternCom under a standardized protocol, analyzing their behavior across backbones, composition strategies, and query types. Second, we introduce xView2-CIR, a change-centric dataset for disaster and damage monitoring, where retrieval is conditioned on scene identity and a target post-event state. Our results show that training-free composition methods provide strong and scalable baselines for EO retrieval, while change-centric retrieval presents different challenges from attribute-based retrieval, particularly due to the need to preserve scene identity. Overall, this study establishes a practical benchmark for RSCIR and positions composed retrieval as a complementary tool for remote sensing image retrieval, archive exploration, and change analysis. The dataset and code are available at https://github.com/billpsomas/rscir.


[88] EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy cs.CVPDF

Jinzhao Li, Yinuo Chen, Dongxu Piao, Panwang Pan, Yifan Yu

TL;DR: 该论文提出了EgoProx基准,用于评估多模态大语言模型在自我中心视角下的3D空间邻近关系推理能力。论文通过一个基于智能体的数据引擎大规模生成多样且一致的问答对,并沿着意图、探索、利用和行动链的认知层次组织任务。

Details

Motivation: 人类在日常生活中不断进行3D邻近关系推理以指导感知和行动,但多模态大语言模型是否能执行这种具身3D推理尚不清楚。

Result: 论文对主流MLLMs在EgoProx上进行了基准测试,并进行了数据集特定和任务特定的指令微调分析。结果表明,当前MLLMs包含一定的空间知识(存在较大的跨领域增益),但在有效利用这些知识进行空间推理VQA方面仍有困难。

Insight: 创新点在于构建了一个针对自我中心3D邻近推理的认知层次化基准,并设计了一个可扩展的智能体数据引擎来生成高质量评估数据。这为评估和提升MLLMs的具身空间推理能力提供了系统化的工具和洞见。

Abstract: Humans constantly reason about 3D proximity, the relations between their body and surrounding objects, to guide perception and action in daily life. Whether multimodal large language models (MLLMs) can perform such embodied 3D reasoning remains unclear. To this end, we introduce EgoProx, a benchmark for egocentric 3D proximity reasoning. We organize our tasks along a cognitive chain, covering intention, exploration, exploitation, and chain-of-actions reasoning. We also design an agent based data engine that produces diverse and consistent QA pairs at scale. We benchmark prevailing MLLMs on EgoProx and conduct additional analyses with dataset specific and task specific instruction tuning. We observe large cross-domain gains, indicating that current MLLMs contain some spatial knowledge; however, they still struggle to effectively leverage it for spatial reasoning VQA.


[89] TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge cs.CVPDF

Zixu Li, Yupeng Hu, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu

TL;DR: 本文提出了TempRet方法,用于解决CVPR 2026 EPIC-KITCHENS-100多实例检索挑战中的视频-文本检索问题。该方法基于CLIP双编码器骨干网络,引入了时间Transformer来建模视频帧间依赖关系,并采用两阶段重排序流程(先通过双编码器检索Top-K候选,再通过带ITM头的交叉编码器细化分数),利用对称多相似性损失进行训练。

Details

Motivation: 现有视频-文本检索方法大多继承图像-文本检索的逐帧语义捕获假设,忽视了第一人称视频的时间动态特性;且EPIC-KITCHENS-100 MIR挑战提供了软标签相关性矩阵而非二值标签,要求模型能处理跨模态的渐进语义对应关系。

Result: 在EPIC-KITCHENS-100 MIR基准测试中,该方法达到了67.97%的平均mAP和82.92%的平均nDCG,证明了其有效性。

Insight: 创新点包括:1) 在视频侧引入时间Transformer,通过可学习位置编码和多头自注意力对帧级CLIP特征建模时间依赖;2) 设计两阶段重排序流程,结合双编码器的高效检索和交叉编码器的精细匹配;3) 利用对称多相似性损失有效利用软标签相关性矩阵进行训练。

Abstract: Video-text retrieval has witnessed remarkable progress driven by large-scale vision-language pretraining, yet most existing approaches inherit an implicit assumption from image-text retrieval: that visual semantics can be captured frame-by-frame. This assumption overlooks the temporal dynamics of egocentric videos. The EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge further raises the bar by providing soft-label relevance matrices rather than binary labels, demanding models that can resolve graded semantic correspondences across modalities. In this report, we present our solution, termed TempRet, to the CVPR 2026 EPIC-KITCHENS-100 MIR challenge. Our approach builds upon a CLIP-based dual-encoder backbone and introduces two key components to address the temporal and cross-modal challenges. First, a temporal transformer operates exclusively on the video side, modeling inter-frame dependencies through learnable positional encodings and multi-head self-attention over frame-level CLIP features. Second, a two-stage reranking pipeline first retrieves Top-K candidates via the dual-encoder, then refines their scores using a cross-encoder equipped with an Image-Text Matching (ITM) head. The entire system is trained with Symmetric Multi-Similarity Loss to exploit the soft-label relevance matrices provided by the challenge. Our method achieves 67.97% average mAP and 82.92% average nDCG on the EK-100 MIR benchmark, demonstrating the effectiveness of temporal modeling and cross-modal refinement for egocentric video retrieval.


[90] OmniEgo-R$^2$: A Routed Reasoning Framework for the 1st Cross-Domain EgoCross Challenge at CVPR 2026 cs.CVPDF

Zixu Li, Zhiwei Chen, Zhiheng Fu, Wenbo Wang, Yupeng Hu

TL;DR: 本文介绍了为CVPR 2026第一届跨域EgoCross挑战赛提出的OmniEgo-R^2框架。该框架将跨域第一人称视频推理问题重新定义为鲁棒的具身视频推理任务,而非简单的选择题问答。通过一个包含时间证据归一化、领域无关能力路由、结构化感知-动态-决策推理等模块的统一路由推理流程,在资源受限和开源两个赛道均获得了第二名。

Details

Motivation: 解决跨领域第一人称视频推理中的核心挑战:时间边界模糊、跨领域语义粒度不匹配,以及在相近选项下决策不稳定的问题。

Result: 在CVPR 2026 EgoCross挑战赛的Source-Limited和Open-Source两个赛道上,分别取得了66.35%和66.77%的整体准确率,均排名第二。

Insight: 创新性地将任务重新定义为鲁棒的具身视频推理问题,并提出了一个统一的路由推理框架,通过轻量级的测试时推理和解析程序,有效整合了视觉语言骨干模型,以应对跨领域、时间敏感和决策模糊的复杂场景。

Abstract: The 1st Cross-Domain EgoCross Challenge at EgoVis, CVPR 2026 evaluates whether multimodal large language models can reason over egocentric videos across surgery, industry, extreme sports, and animal perspective. We achieved second place in both the Source-Limited and Open-Source tracks. In this report, we formulate EgoCross as a robust cross-domain embodied video reasoning problem rather than a simple multiple-choice visual question answering task. We identify three key challenges: (C1) temporal boundary ambiguity, where critical state transitions are sparsely sampled and often occur between frames; (C2) cross-domain semantic granularity mismatch, where the same capability requires different domain-specific visual grammar; and (C3) decision instability under close options, where long multimodal reasoning can select unsupported distractors or produce malformed outputs. To address them, we propose OmniEgo-R$^2$ (Omnidomain Egocentric Routed Reasoning), a unified routed reasoning pipeline consisting of temporal-evidence normalization, domain-agnostic capability routing, structured perception–dynamics–decision reasoning, boundary-aware option verification, and defensive answer calibration. OmniEgo-R$^2$ uses the Qwen3-VL-4B-SFT checkpoints on each EgoCross domain as the visual-language backbone, and wraps them with lightweight test-time reasoning and parsing programs. Our final submissions obtain 66.35% overall accuracy in the Source-Limited track and 66.77% in the Open-Source track, ranking second in both leaderboards.


[91] Appearance-Invariant Detection of Suggestive Motion via Laban Movement Descriptors on SMPL Skeletons cs.CV | cs.GRPDF

Jaehoon Ahn, Jeonghan Kong, Moon-Ryul Jung

TL;DR: 本文提出了一种基于SMPL骨骼轨迹和拉班运动分析(LMA)描述符的运动分类流程,用于在3D虚拟环境中检测暗示性和露骨的运动内容。该方法仅使用运动数据,在包含四个等级(日常、艺术、暗示、露骨)的超过2万个运动片段数据集上进行了评估。

Details

Motivation: 当前在线多人3D虚拟环境的内容审核主要依赖基于AI的自动化流程,但现有技术多集中于图像、视频和音频中的非法内容检测,在检测暗示性运动方面存在盲区。

Result: 在20,514个运动片段(超过17小时)的数据集上,使用110个LMA特征进行逻辑回归,实现了57.3%的四分类准确率(是随机概率的2.3倍)、72.1%的三分类准确率和78.7%的二进制SFW/NSFW分类准确率。混淆主要集中在相邻等级之间。

Insight: 创新点在于将拉班运动分析(LMA)描述符应用于SMPL骨骼轨迹,构建了一个纯运动数据的分类管道来检测暗示性内容。客观分析表明,分类体系反映了真正不同的运动模式,且没有单一特征主导分类,这为基于运动质量而非外观的内容审核提供了新思路。

Abstract: Content moderation in online multiplayer 3D virtual environments has recently been relegated to automated, AI-based pipelines. However, the field has mainly been involved in detection of illicit content in images, video, and audio, leaving blind spots in detection techniques for suggestive motion. We present a motion-only classification pipeline that detects suggestive and explicit movement from SMPL skeleton trajectories using Laban Movement Analysis (LMA) descriptors. On 20,514 motion fragments (17+ hours) spanning four ordinal tiers – everyday, artistic, suggestive, explicit – logistic regression over 110 LMA features achieves 57.3% four-way accuracy (2.3x chance), 72.1% three-way, and 78.7% binary SFW/NSFW. Confusion concentrates on adjacent tiers, confirming that classification errors are concentrated between adjacent tiers over non-adjacent ones. Moreover, different movement qualities dominate at each level of the taxonomy – no single feature drives the classification, suggesting that the four-tier structure reflects genuinely distinct motion regimes.


[92] Med-R2: An Adversarial Benchmark for Evidence-Grounded Reasoning in Medical VLMs cs.CVPDF

Wen Ma, Fucheng Niu, Zhiting Fan, Zikai Xiao, Jiaxiang Liu

TL;DR: 本文提出了Med-R2基准测试,用于评估医学视觉语言模型(VLMs)在临床工作流程中的证据驱动推理能力。该基准包含四个临床阶段的分层任务,通过对抗性扰动测试模型对误导性线索的鲁棒性。实验表明现有模型在临床推理中严重依赖提示,且视觉-文本对齐能力不足,而基于该基准的分步微调能显著提升推理鲁棒性。

Details

Motivation: 现有医学视觉语言模型在医疗视觉问答中表现优异,但其预测是否基于证据驱动的临床推理还是依赖虚假先验仍不明确,因此需要构建一个评估模型对抗性鲁棒性和视觉基础推理能力的基准。

Result: 在包含42,432张图像、31个任务类别和110,406个问答对的Med-R2基准上评估14个VLMs,发现模型性能沿四阶段临床工作流程逐级下降;对抗实验显示模型严重依赖正确提示猜测答案,且即使提供明确视觉线索也难以准确对齐文本描述。

Insight: 创新点在于构建了与临床工作流程对齐的分层对抗性基准,通过逐步问答任务和对抗性扰动系统评估推理的视觉基础性;客观来看,该工作强调了医学AI中证据驱动推理的重要性,并证明分层数据的分步微调能有效提升模型鲁棒性,为基于证据的医疗AI改进提供了新方向。

Abstract: Vision-language models have demonstrated impressive capabilities in general medical visual question answering, yet due to limited interpretability, it remains unclear whether their predictions reflect evidence-grounded clinical reasoning or reliance on spurious priors. We introduce Med-R2 Bench, a hierarchical benchmark aligned with the clinical workflow to evaluate adversarial robustness with visual grounding. We design stepwise QA tasks to assess whether reasoning chains are strictly grounded in visual evidence across the four clinical stages, and employ adversarial perturbations to test robustness against misleading cues. Med-R2 comprises 42,432 images, 31 task categories, and 110,406 QA pairs. Evaluation across 14 VLMs reveals a sequential performance degradation along the four-stage clinical workflow. Adversarial experiments show that models rely heavily on correct prompts to guess answers. Even when provided with explicit visual cues, the models struggle to accurately align textual descriptions. Finally, we demonstrate stepwise fine-tuning using our hierarchical data significantly improves reasoning robustness, highlighting its potential to drive future improvements in evidence-based medical AI.


[93] EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026 cs.CVPDF

Zhiheng Fu, Zixu Li, Zhiwei Chen, Fangxu Liu, Yupeng Hu

TL;DR: 本文提出了EgoAction,一个用于EPIC-KITCHENS-100动作检测挑战的统一解耦检测与融合流程。该方法利用微调的VideoMAE-L特征,分别训练动词和名词时序检测器,通过置信度自适应的动态加权融合规则组合动作假设,以提升长未剪辑自我中心视频中动作的时序定位精度。

Details

Motivation: 解决在长未剪辑自我中心视频中,动作的动词和名词时序检测流(分别对运动转换和物体可见性/杂乱度敏感)可能以不同方式失效的问题,避免固定算术平均融合在某一流退化时放大定位误差。

Result: 该方法在EPIC-KITCHENS-100动作检测挑战中进行了评估,通过解耦检测、动态加权融合、滑动窗口推理和Soft-NMS等组件,构建了一个紧凑且可复现的系统。

Insight: 核心创新点是动态加权融合,它根据名词和动词分类置信度动态分配边界权重,将边界决策权偏向更可靠的检测流,同时保持了动作评分解耦机制,这是一个轻量级的张量操作算子。

Abstract: The EPIC-KITCHENS-100 Action Detection challenge evaluates whether a model can localize the start and end of each action in long untrimmed egocentric videos and assign the corresponding verb–noun action label. In this report, we formulate our submission as EgoAction (Egocentric Action Composition with Reliability-Aware Temporal Fusion), a unified decoupled detection and fusion pipeline. The pipeline uses EPIC-finetuned VideoMAE-L features, trains separate noun and verb temporal detectors with causal temporal modeling, composes action hypotheses from top noun–verb pairs, and introduces a confidence-adaptive boundary fusion rule at post-processing time. The key observation is that verb and noun streams often fail differently: verb scores are sensitive to motion transitions, whereas noun scores are sensitive to hand-object visibility and object clutter. A fixed arithmetic mean of their predicted boundaries can therefore amplify localization errors when one stream degenerates. We replace this hard-coded mean with Dynamic Weighted Fusion (DWF), which normalizes the maximum noun and verb classification confidences into proposal-wise boundary weights and linearly combines the two intervals. This lightweight tensor-only operator shifts boundary authority toward the more reliable stream while preserving the decoupled action scoring mechanism. Together with sliding-window inference, top-K noun–verb action composition, and class-wise Soft-NMS, EgoAction provides a compact and reproducible system for egocentric temporal action detection.


[94] FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis cs.CV | cs.AIPDF

Ruihao Xu, Xingming Shui, Jingxuan Niu, Yiqin Wang, Jilin Yu

TL;DR: 本文提出了FoodMonitor基准,用于评估多模态大语言模型在商业厨房监控中进行可解释合规分析的能力。该基准包含477个视频片段和3,307个违规标注,采用双通道设计覆盖人员与环境违规,并建立了包含空间定位和语义理解两阶段评估的统一协议。对多个SOTA模型的系统评估表明,最佳模型C_score仅为0.360,揭示了空间定位和细粒度规则理解是主要瓶颈。

Details

Motivation: 现有视频异常检测数据集仅关注事件级二元分类,缺乏现实合规场景所需的规则驱动、可解释分析能力,无法提供可验证证据和可追溯问责信号。

Result: 在FoodMonitor基准上,最佳性能的SOTA模型仅获得0.360的C_score综合分数,空间定位和细粒度规则理解成为主要性能瓶颈。分析还识别出定位主导和语义主导两种错误模式。

Insight: 创新点包括:1)构建了首个专注于可解释合规分析的双通道视频基准;2)设计了结合空间定位与语义理解的两阶段评估机制及平衡性综合指标C_score;3)通过错误模式诊断为模型开发提供针对性改进方向。

Abstract: As AI-powered compliance monitoring becomes increasingly important in public governance and industrial safety, the ability to provide verifiable evidence and traceable accountability signals is essential. However, existing video anomaly detection datasets focus on event-level binary classification, lacking the rule-driven, explainable analysis required for real-world compliance scenarios. We introduce FoodMonitor, a benchmark for explainable compliance analysis in commercial kitchen surveillance. FoodMonitor comprises 477 video clips with 3,307 violation annotations across a dual-channel design covering both person-level and environment-level violations. Each annotation specifies which rule was violated, what non-compliant behavior occurred, and who committed it with frame-level bounding boxes. We establish a unified evaluation protocol with a two-stage matching mechanism that separately assesses spatial localization and semantic understanding, along with a composite metric ($C_{\text{score}}$) that balances environment and person detection performance. Systematic evaluation of several state-of-the-art multimodal large language models reveals that the best-performing model achieves only 0.360 $C_{\text{score}}$, with spatial localization and fine-grained rule understanding emerging as the primary bottlenecks. Our analysis identifies two distinct failure modes: localization-dominated errors and semantics-dominated errors, providing diagnostic insights for future model development.


[95] EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge cs.CVPDF

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Guozhi Qiu

TL;DR: 本文提出了EgoAdapt方法,用于解决CVPR 2026 HD-EPIC VQA挑战中的第一人称厨房视频问答任务。该方法通过三个推理时组件——基于类别的路由、校准的选项评分和测试时一致性适应——来应对基准测试中异构的时间、空间和语义结构带来的不匹配问题。

Details

Motivation: 主要动机是解决HD-EPIC基准测试中,单一通用推理方案与视频中异构的时序、空间和语义结构不匹配的难题,该基准要求模型能推理手物交互、长程食谱轨迹、空间关系和细微注视线索等多种证据。

Result: 该方法显著超越了现有的HD-EPIC基线模型,但摘要中未提及具体的定量结果(如准确率)或是否达到SOTA水平。

Insight: 创新点在于提出了一个专门针对第一人称视频问答的推理时自适应框架,通过类别条件路由实现差异化处理,结合校准评分(利用字母标记似然和生成一致性)和测试时一致性聚合来提升对模糊情况的鲁棒性,可借鉴其针对任务异构性设计模块化推理策略的思路。

Abstract: This technical report presents our solution, EgoAdapt (Egocentric Adaptation via Category, Calibration, and Consistency), to the CVPR 2026 HD-EPIC VQA challenge. HD-EPIC evaluates whether a vision-language model can reason over realistic first-person kitchen videos, where the evidence for an answer may be a short hand-object interaction, a long recipe trajectory, a spatial relation to a fixture, or a subtle gaze cue. The benchmark contains 26K multiple-choice questions across seven macro-categories: recipe, ingredient, nutrition, fine-grained action, 3D perception, object motion, and gaze. We observe that the main difficulty is not only model capacity, but also the mismatch between a single generic inference recipe and the heterogeneous temporal, spatial, and semantic structure of the benchmark. Our method, EgoAdapt, introduces three inference-time components: (1) category-conditioned routing with per-category prompts, frame budgets, and sampling rates; (2) calibrated option scoring that evaluates all candidate answers with letter-token likelihoods and generation agreement instead of relying only on direct generation; and (3) test-time consistency adaptation that aggregates predictions across option permutations and verification-style prompts for ambiguous cases. This design substantially improves over the available HD-EPIC baselines.


[96] Φ-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation cs.CV | cs.AI | cs.GR | cs.LGPDF

Ofir Abramovich, Nadav Z. Cohen, Adi Rosenthal, Ariel Shamir

TL;DR: 本文提出了一种名为Φ-Noise的训练自由方法,用于实现基于运动条件的视频生成。该方法通过将参考视频的低频相位信息直接注入到扩散模型的噪声潜在空间中,从而在不修改模型架构或推理流程的情况下,将运动线索从参考视频转移到生成视频中。

Details

Motivation: 现有的视频扩散模型条件生成方法通常需要额外的训练和计算开销。本文旨在探索一种无需训练、计算高效的方案,利用生成模型中频率成分的重要性,来实现对生成视频外观和动态的有效控制。

Result: 在多个应用场景中,该方法在生成视频的外观和动态控制方面都取得了有效的结果。与更复杂的条件生成方法相比,该方法取得了具有竞争力甚至更优的性能。

Insight: 核心创新点在于利用相位信息(特别是低频相位)作为运动线索的载体,并将其直接注入到扩散过程的噪声中,这是一种无需训练、架构无关的条件控制方法。从客观角度看,将频率域分析与扩散模型结合,为视频生成的条件控制提供了一个新颖且高效的视角。

Abstract: Latent video diffusion models generate videos by progressively transforming Gaussian noise into realistic samples conditioned on text or visual inputs. However, existing conditioning methods often require additional training and computational overhead. Motivated by recent findings on the importance of frequency components in generative models, we propose a simple, training-free approach for motion-conditioned video generation by injecting low-frequency phase information from a reference video directly into the diffusion noise latents. Our method transfers motion cues without modifying the model architecture or inference pipeline. Using several applications, we demonstrate effective control over both appearance and dynamics in generated videos, while achieving competitive or superior results compared to more complex conditioning approaches.


[97] PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction cs.CV | cs.AIPDF

Naman Mishra, Shankar Gangisetty, C. V. Jawahar

TL;DR: 该论文提出了PedestrianQA,一个将行人意图与轨迹预测任务转化为问答形式的大规模视频数据集,旨在利用大视觉语言模型(VLMs)的统一框架来提升自动驾驶系统在复杂交通环境中的安全决策能力。

Details

Motivation: 为了解决自动驾驶系统中行人意图和轨迹预测这一关键安全问题,并利用大视觉语言模型强大的视觉理解和自然语言推理能力,构建一个无需为每个任务定制专用架构的统一、可解释的框架。

Result: 在PIE、JAAD、TITAN和IDD-PeD等多个基准数据集上的实验表明,在PedestrianQA上微调SOTA的VLMs,显著提升了意图分类、轨迹预测的准确性以及解释性理由的质量。

Insight: 创新点在于将行人行为预测任务重新定义为基于视频的问答任务,并辅以结构化理由,这使得VLMs能够作为一个统一且可解释的框架,同时学习视觉动态、上下文线索和交通参与者间的交互,而无需为每个子任务设计专门模型。

Abstract: Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision-language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural language reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as question-answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized architectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of-the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling.


[98] IQA-Spider: Unifying Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring cs.CVPDF

Xinge Peng, Yiting Lu, Xin Li, Zhibo Chen

TL;DR: 本文提出了IQA-Spider,这是首个基于大型多模态模型(LMM)的图像质量评估框架,它将推理、定位和指称任务统一起来,以实现多粒度(全局和局部)的图像质量理解。该工作通过构建一个严谨的四任务范式(覆盖全局/局部质量描述、像素级定位和区域级指称)和相应的数据集,并采用无冲突的两阶段设计(先进行细粒度文本级推理,再通过无训练的文本到点定位范式将文本语义映射到空间坐标),最终实现了统一的多粒度可解释图像质量评估。

Details

Motivation: 现有的基于LMM的IQA方法通常只支持部分感知维度(如质量描述和问答推理,或像素级定位),缺乏统一的任务与数据定义以及有效的多粒度学习优化范式,这限制了模型对图像质量的全面理解。

Result: 在多个基准测试上的广泛实验证明了其强大的性能,验证了所提出的任务定义和框架的有效性与多功能性。

Insight: 主要创新点在于:1) 提出了一个严谨的统一多粒度IQA四任务范式;2) 构建了相应的数据集与可扩展的自动标注流程;3) 设计了无冲突的两阶段训练策略,特别是第二阶段的无需训练的文本到点定位范式,巧妙地桥接了文本语义与像素级感知。这为构建统一、可解释的多模态理解模型提供了新思路。

Abstract: We present IQA-Spider, the first image quality assessment (IQA) framework that unifies reasoning, grounding, and referring into a single LMM-based framework for multi-granularity quality understanding. Existing LMM-based IQA methods typically support only partial perception dimensions, such as quality description and question answering~(\textit{i.e.}, reasoning) or pixel-level grounding. This limitation largely stems from the absence of (i) a unified task and data formulation and (ii) effective optimization paradigms for multi-granularity learning. To address these limitations, we formulate a rigorous four-task paradigm covering global and local quality description, pixel-level grounding, and region-level referring. Based on this formulation, we construct a corresponding IQA dataset with a scalable and automatic annotation pipeline, thereby providing a solid foundation for unified multi-granularity learning. To further enable unified perception, we adopt a conflict-free two-stage design that progressively extends text-level multi-granularity understanding to pixel-level grounding: (i) the first stage equips the model with fine-grained text-level reasoning across multiple IQA tasks, and (ii) the second stage introduces a training-free text-to-point grounding paradigm, which bridges textual semantics and pixel-level perception by mapping token logits to spatial coordinates. Based on these efforts, we achieve IQA-Spider with unified multi-granularity explainable image quality assessment. Extensive experiments across multiple benchmarks demonstrate strong performance, validating the effectiveness and versatility of the proposed formulation and framework.


[99] Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory cs.CV | cs.AIPDF

Quanjiang Li, Zhiming Liu, Wei Luo, Tingjin Luo, Chenping Hou

TL;DR: 本文揭示了多模态大语言模型中物体幻觉与人类注意力分散现象的关联,并提出了一种无需额外训练的注意力聚焦方法AFIP,通过跨头注意力增强和动态历史注意力强化来纠正注意力分散,从而减少幻觉。

Details

Motivation: 多模态大语言模型常出现物体幻觉,但其背后的视觉感知机制尚不明确;本文旨在通过模拟人类注意力分散现象,从理论和算法上解决这一问题。

Result: 在多个基准测试和模型上的广泛实验验证了AFIP的有效性,无需额外训练即可减少幻觉。

Insight: 创新点在于将人类注意力分散理论引入模型分析,并提出通过跨头注意力丰富化和动态历史注意力增强来纠正视觉模糊,这为理解并缓解MLLM幻觉提供了新视角和实用方法。

Abstract: Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this failure remains poorly understood. In this work, we reveal that hallucinations are strongly associated with a human-like attention distraction phenomenon, where humans under divided focus experience degraded visual clarity and produce inaccurate descriptions, while in models the same mechanism manifests as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. We further provide theoretical insights that attention dispersion increases model complexity and degrades classification generalization. Motivated by these findings, we propose an Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction via cross-head attention enrichment and reinforces visual grounding through dynamic historical attention enhancement. Extensive experiments on multiple benchmarks and models validate the effectiveness of AFIP without additional training.


[100] World Models as Group Actions cs.CVPDF

Zijie Wang, Wei Zhang, Weiming Zhang, Fanqi Zhang, Xiao Tan

TL;DR: 本文提出了一种基于群作用理论的视频世界模型评估框架,强调动作忠实性应通过动作的组合结构(如导航中的SE(2)群)来理解。作者将动作条件化的世界建模形式化为状态空间上的群作用,并提出了通过潜在空间正则化强制执行恒等、逆和组合一致性的方法,无需额外数据收集。实验表明,该方法在保持感知质量的同时,显著提升了现有先进视频世界模型的群作用一致性和鲁棒性。

Details

Motivation: 现有视频世界模型虽视觉逼真,但未能确保其动态真正由动作控制;本文旨在通过群结构(如SE(2))的形式化框架,评估和提升动作条件化世界模型的动作忠实性。

Result: 在多个先进视频世界模型上的广泛实验显示,所提方法一致改善了群作用一致性(GAC)和群作用鲁棒性(GAR)指标,且未降低感知质量。

Insight: 创新点在于将动作忠实性形式化为群作用,并设计了基于潜在空间正则化的统一训练框架来强制执行群结构约束;这为评估世界模型的动态正确性提供了原则性标准,并可通过合成监督避免额外数据需求。

Abstract: Video world models have achieved strong visual realism, but this does not ensure that their dynamics are truly governed by actions. In this work, we argue that action faithfulness should be understood through the compositional structure of actions, which in many embodied settings follows a group structure (e.g., SE(2) for navigation). Based on this insight, we formalize action-conditioned world modeling as realizing a group action on the state space, providing a principled criterion for evaluating dynamics beyond visual quality. To operationalize this framework, we propose a unified approach that enforces identity, inverse, and composition consistency via latent-space regularization with synthesized supervision, avoiding additional data collection. We further introduce two metrics: Group-Action Consistency (GAC) and Group-Action Robustness (GAR), to evaluate structural correctness and rollout stability. Extensive experimental results show that our method consistently improves both GAC and GAR in state-of-the-art video world models without degrading perceptual quality.


[101] Vision-Language Binding in In-Context Image Generation cs.CVPDF

Chris Ge, Rohit Gandikota, Antonio Torralba, Tamar Rott Shaham

TL;DR: 本文研究了FLUX.2等上下文图像生成模型中视觉语言绑定的机制,揭示了文本标记与参考图像之间存在隐式的跨模态绑定。通过三种因果干预实验(T2I Lens、Attention Knockout和I2I-to-I2I Patching),作者发现参考图像的属性(如颜色、风格)会先写入文本标记,再传递到生成图像,而像素级精确属性(如特定人脸)则通过图像到图像的注意力直接流动。

Details

Motivation: 旨在探究FLUX.2等模型内部参考图像信息如何流动以影响输出图像,因为现有模型将所有输入(文本、参考图像、噪声标记)在单一注意力流中处理,其内部机制尚不明确。

Result: 在包括SUN397、DreamBench++和在线收集图像在内的2,875个编辑任务中,实验观察到参考图像属性与文本标记绑定的一致性分工,并进一步将绑定定位到文本序列的填充标记。

Insight: 创新点在于揭示了多模态DiT中文本标记不仅是提示持有者,还是参考图像内容的结构化通道,表明即使在统一注意力的多模态生成模型中,标记模态也结构化了条件信息在网络中的表示和路由方式。

Abstract: In-context image generation models such as FLUX.2 take a text prompt and an optional reference image as visual conditioning for the output. Internally, all three inputs – text, reference image, and the noise tokens – are concatenated and processed through a single attention stream, where all tokens can attend to one another. This leaves open how reference information flows through the model to produce the output image. We show that an implicit cross-modal binding emerges between the text tokens and the reference image: the text tokens absorb visual reference content during the forward pass, and that absorbed content causally influences the generated output. We surface this binding with three causal interventions on FLUX.2: T2I Lens, which decodes intermediate text-token activations through a text-to-image path; Attention Knockout, which severs specific attention edges; and I2I-to-I2I Patching, which copies text token activations between editing runs. Across 2,875 editing tasks on various images, including SUN397 and DreamBench++ datasets and images collected online, we observe a consistent division of labor: properties of the reference image, like color, style, and scene setting, are first written into the text tokens, which carry them to the generated image; pixel-exact properties like a specific face or instance identity bypass the text tokens and flow directly from reference to image through image-to-image attention. We further localize the reference-text binding to the padding tokens of the text sequence. These results show that text tokens in a multimodal DiT are not just prompt holders, but a structured channel for reference image content. More broadly, they suggest that even in unified-attention multimodal generative models, token modality structures how conditioning information is represented and routed across the network.


[102] ULF-Synth: Physics-Guided Ultra-Low-Field MRI Enhancement for Pediatric Neuroimaging cs.CVPDF

Toufiq Musah, Salvatore Calcagno, Federica Proietto Salanitri, Xiaomeng Li, Maruf Adewole

TL;DR: 论文提出ULF-Synth框架,通过从高场MRI合成逼真的超低场MRI图像来创建大规模配对训练数据,并采用空间-频率域目标函数来优先恢复高频解剖细节,从而在无需真实配对数据的情况下增强超低场MRI图像质量。

Details

Motivation: 超低场MRI具有便携性和可及性优势,但其信噪比和空间分辨率较低,且在资源有限环境下难以获取配对的超低场-高场数据用于监督增强。

Result: 在仅使用合成数据训练后,模型能有效泛化至真实的64mT超低场采集,提升了多类脑分割性能,并在盲法阅读者研究中获得了更高的放射科医师偏好和诊断可接受性。

Insight: 创新点在于提出了一种无需真实配对数据的合成监督框架,通过物理引导的图像合成和空间-频率域目标函数,实现了对多种翻译模型(如编码器-解码器、对抗生成和扩散模型)的通用性提升,为超低场MRI增强提供了可扩展的解决方案。

Abstract: Ultra-low-field (ULF) MRI offers portable and accessible neuroimaging but suffers from reduced signal-to-noise ratio and limited spatial resolution compared to high-field (HF) systems. Acquiring paired ULF-HF data for supervised enhancement is often difficult, particularly in resource-limited settings. We introduce ULF-Synth, a framework that combines: (i) acquisition-based synthesis of realistic ULF images from HF volumes to create large-scale paired training data, (ii) a spatial-frequency domain objective that prioritizes recovery of high-frequency anatomical detail. This formulation is architecture-agnostic, consistently improving structural similarity and perceptual fidelity across encoder-decoder, adversarial, and diffusion-based translation models. When trained exclusively on synthetic data, the resulting models generalize effectively to real 64mT ULF acquisitions, improving downstream multiclass brain segmentation and achieving higher radiologist preference and diagnostic acceptability in a blinded reader study. These findings demonstrate that synthetic paired supervision provides a practical and scalable pathway for enhancing ULF MRI without requiring real paired acquisitions. Code, Models and Dataset: https://github.com/toufiqmusah/ULF-Synth


[103] DexSIM: Real-time Dexterous Simulation with Unified Causal Video Diffusion cs.CVPDF

Adam Lee

TL;DR: 本文提出DexSIM框架,通过两阶段训练的统一因果视频扩散模型实现实时灵巧手部操作模拟。该方法结合双向视频扩散与基于滚动更新的自回归训练,利用高斯热图手部编码和空间缓存注意力机制提升长期一致性与3D感知能力。

Details

Motivation: 现有视频扩散与3D重建方法多关注导航任务,而灵巧手物交互模拟因缺乏实时交互性、长期空间一致性与记忆机制而受限,限制了其在交互体验创建与机器人合成数据生成中的应用。

Result: 在像素/语义相似性、运动保真度和手部投影准确度上超越基线模型,支持手部运动迁移等新应用,并以15.24 FPS实现实时交互。

Insight: 创新点包括:1)将手部动作轨迹与视频嵌入统一特征空间的联合训练框架;2)高斯热图手部编码提升表示精度;3)基于空间缓存注意力机制的滚动自回归训练策略增强长期空间记忆与3D感知模拟能力。

Abstract: Recent progress of video diffusion models have enabled extensive simulation of the physical world. While simulation with hand object interaction has been less explored. We propose DexSIM, a dexterous simulation framework for simulating dexterous manipulation in real-time. While previous works utilizing video diffusion and 3D reconstruction focus on navigation, dexterous manipulation has been limited while it has extensive applications for creating interactive experiences with the simulated world and for generating synthetic data for robotics. Existing methods lack real-time interactivity and long-term spatial consistency and memory. We propose a 2-stage training framework for DexSIM. First we train a bi-directional video diffusion model by jointly embedding the hand action trajectory and video in a unified feature space. We utilize gaussian heatmap hand encoding for more accurate hand representation. Then we conduct a roll-out based autoregressive training with updated spatial cache as attention sink for spatial memory, which improves long-term consistency and 3D aware dexterous manipulation simulation. DexSIM outperforms the baseline on pixel and semantic similarity, motion fidelity, and hand projection accuracy. It also allows new applications such as hand motion transfer and runs at 15.24 FPS real-time interactivity.


[104] Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing cs.CVPDF

Yan Li, Lin Liu, Xiaopeng Zhang, Qi Tian

TL;DR: 本文提出了一种名为RVEDiT的隐式推理视频编辑扩散Transformer框架,旨在解决现有基于指令的视频编辑方法在保持内容一致性和时间连贯性方面的不足。该框架通过粒度路由令牌条件化和参考锚定注意力对齐两个核心组件,引导模型进行从粗到细的编辑推理,从而提升编辑质量。

Details

Motivation: 现有基于扩散Transformer的视频编辑器存在两个结构性问题:一是所有条件信号不加区分地输入到所有Transformer块,导致单个令牌流需同时编码全局编辑意图和细粒度视觉证据;二是控制编辑的交叉注意力模式仅通过像素级重建间接监督,使得模型的内部推理过程约束不足。

Result: 在标准基于指令的视频编辑基准测试中,RVEDiT一致性地优于最先进的基线方法,在局部化和组合性编辑任务上取得了特别显著的性能提升。

Insight: 创新点在于引入了粒度路由令牌条件化,将从多模态大语言模型蒸馏出的可学习编辑令牌路由到浅层块,而将原生视觉和文本令牌保留给深层块,从而在骨干网中诱导出从粗到细的编辑过程;以及参考锚定注意力对齐,通过在训练中使用参数共享的参考分支并最大化编辑分支与参考分支注意力特征之间的互信息,来正则化模型的内部推理,且不增加推理成本。

Abstract: Instruction-based video editing requires transforming a source video according to a natural-language instruction while preserving irrelevant content and remaining temporally coherent. We argue that existing Diffusion Transformer (DiT) editors struggle with this task for two structural reasons. First, conditioning signals are fed undifferentiated into all transformer blocks, forcing a single token stream to encode both global editing intent and fine-grained visual evidence. Second, the cross-attention patterns that govern the edit are supervised only indirectly through pixel-level reconstruction, leaving the model’s internal reasoning process under-constrained. To address both limitations, we propose RVEDiT, an implicit Reasoning Video Editing DiT framework built around two complementary components. The first, Granularity-Routed Token Conditioning, introduces learnable editing tokens distilled from a multimodal LLM and routes them to shallow blocks, while reserving native visual and textual tokens for deeper blocks, thereby inducing a coarse-to-fine editing process inside the backbone. The second, Reference-Anchored Attention Alignment, employs a parameter-sharing reference branch during training and maximizes the mutual information between the attention features of the editing and reference branches, regularizing the model’s internal reasoning without incurring any additional inference cost. Experiments on standard instruction-based video editing benchmarks show that RVEDiT consistently outperforms state-of-the-art baselines, with particularly strong gains on localized and compositional edits.


[105] Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models cs.CV | cs.ROPDF

Yurou Yang, Muyuan Lin, Roberto Martin-Martin, Martin Labrie, Shreekant Gayaka

TL;DR: 本文对视觉-语言-动作模型(VLAs)与几何基础模型(GFMs)的融合进行了系统性实验分析,旨在理解几何理解对VLAs的影响。研究首先量化了现有VLAs在几何理解上的不足(即’几何鸿沟’),然后比较了三种不同的架构策略来将GFMs的几何知识注入VLAs,并分析了非架构设计选择(如训练数据、相机数量)对性能的影响。

Details

Motivation: 尽管已有工作探索将几何基础模型(GFMs)融入视觉-语言-动作模型(VLAs)以提升3D重建等任务性能,但现代VLAs是否已具备足够的几何理解、最佳的几何知识注入架构是什么、以及非架构设计选择的影响尚不明确。本文旨在通过严谨的实验分析回答这些问题。

Result: 研究基于特定VLA(GR00T-N1.5)和GFM(VGGT)进行实验。首次通过线性探测分析量化了VLAs与GFMs之间的’几何鸿沟’,表明现有VLAs几何理解不足。同时,比较了三种不同几何注入架构的性能,并分析了训练数据、相机数量等非架构因素对几何VLA性能的影响。

Insight: 创新点在于首次系统性地量化了VLAs的几何理解缺陷,并对比了多种几何知识注入架构。从客观角度看,该研究为构建更有效的几何VLAs提供了关键的实验基准和设计指导,强调了不仅需要关注架构融合,还需考虑数据质量等多方面因素。

Abstract: Recent work explores new opportunities at the intersection of vision-language-action models (VLAs) and geometric foundation models (GFMs) for 3D reconstruction, such as VGGT. While the resulting geometric VLAs often show improved performance, it remains unclear (i) if modern VLAs already have sufficient geometric understanding to start with, (ii) what is the best architecture to inject geometric understanding into a VLA, and (iii) what is the effect of other design choices that affect geometric VLAs. In this paper we provide a rigorous experimental analysis to shed light on these questions, for a specific choice of VLA (GR00T-N1.5) and GFM (VGGT). Our first contribution is to formalize prior work’s intuition that current VLAs lack geometric understanding, by providing a rigorous analysis based on linear probing. The analysis quantifies, for the first time, the “geometric gap” between VLAs and GFMs. Our second contribution is to identify and compare different strategies to bridge GFMs with VLAs. We implement three different architectures, which differ in the way they inject geometry in the VLA, while keeping low-level implementation details as similar as possible, to ensure a fair comparison. Finally, we analyze the impact of non-architectural choices (e.g., training data, number of cameras, reconstruction quality) on the performance of the geometric VLAs.


[106] VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation cs.CV | cs.AIPDF

Bo Li, Ronghao Chen, Ningyuan Deng, Huacan Wang, Shaolin Zhu

TL;DR: VaaWIT是一个端到端框架,旨在通过视觉感知适配大语言模型,以解决网页图像中多语言文本的翻译问题。它通过双流注意力模块融合多语言语义特征与细粒度视觉表征,并利用视觉感知适配器将融合后的视觉线索高效注入冻结的LLM主干,从而提升对复杂网页内容的翻译能力。

Details

Motivation: 解决网页图像中嵌入式文本的翻译问题,这对于提升内容可访问性和跨语言信息检索至关重要。现有的大视觉语言模型在应用于网页图像翻译时,因标准编码器往往优先考虑高层语义而非识别多样字符形态所需的细粒度视觉细节,存在视觉表征差距的挑战。

Result: 在三个公开基准的八个任务上进行的大量实验表明,VaaWIT显著优于最先进的开源基线模型,并与专有模型取得了具有竞争力的性能。

Insight: 论文宣称的创新点包括引入双流注意力模块进行双向特征交互,以及参数高效的视觉感知适配器进行微调。从客观角度看,其核心创新在于将细粒度视觉感知明确且高效地整合到LLM中,以弥合视觉表征差距,从而专门针对网页图像翻译这一复杂任务进行优化。

Abstract: Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies. To address this challenge, we propose VaaWIT, an end-to-end framework that adapts Large Language Models for multilingual Web image translation. The framework introduces two key technical contributions: (1) a Dual-Stream Attention Module (DSAM), which facilitates bidirectional interaction between multilingual semantic features and detailed visual representations, thereby synthesizing unified features robust to textual variations; and (2) a Visual-Aware Adapter (VAA), a parameter-efficient fine-tuning strategy that dynamically injects these fused visual cues into the frozen LLM backbone. This design enables the model to align the visual context with linguistic reasoning effectively while minimizing computational costs. Extensive experiments on eight tasks on three public benchmarks demonstrate that VaaWIT significantly outperforms state-of-the-art (SOTA) open-source baselines and achieves competitive performance against proprietary models. These results validate the efficacy of integrating fine-grained visual perception into LLMs for complex Web content analysis.


[107] MindAdapter: Few-Shot Parameter-Efficient Residual Calibration of Cross-Subject Brain-to-Visual Decoding Models cs.CVPDF

Jiaxiang Liu, Jiawei Du, Xupeng Chen, Guoqi Li, Jiang Cai

TL;DR: 该论文提出了MindAdapter,一种参数高效的少样本校准框架,用于解决跨被试脑到视觉解码中的个体间功能错位问题。该方法通过冻结预训练的显式脑功能对齐主干网络,并引入轻量级非线性残差适配器,实现全局跨被试对应与个体特异性残差校正的解耦。此外,设计了拓扑锚定的双流流形约束来保持全局表征稳定性。

Details

Motivation: 动机是解决跨被试脑到视觉解码中因严重的个体间变异性导致的系统性被试特异性功能错位这一核心挑战,以实现个性化的脑到视觉解码。

Result: 在Natural Scenes Dataset (NSD)上的实验表明,MindAdapter仅使用少量共享刺激就显著提高了跨被试视觉重建和检索的准确性,为个性化脑到视觉解码提供了一个实用且数据高效的解决方案。

Insight: 创新点在于提出了解耦的线性-残差级联对齐范式,将全局对齐与个体校正分离;并设计了拓扑锚定的双流流形约束,利用少量配对数据作为拓扑锚点,同时在未配对数据上通过语义流施加一致性,从而高效注入个体特异性校正并保持预训练学到的全局表征几何结构。

Abstract: Cross-subject brain-to-visual decoding remains a core challenge in brain-computer interfaces due to severe inter-individual variability that induces systematic subject-specific functional misalignment. To address this issue, we propose MindAdapter, a parameter-efficient few-shot calibration framework for pretrained brain-to-visual decoding models. MindAdapter adopts a decoupled linear-residual cascade alignment paradigm by freezing a pretrained explicit brain functional alignment backbone (coarse) and introducing a lightweight nonlinear residual adapter (fine), thereby disentangling global cross-subject correspondence from subject-specific residual corrections for fine-grained spatial and semantic calibration. To further preserve global representational stability, we design a topology-anchored dual-stream manifold constraint, where a small set of shared stimuli serves as topological pins with voxel-level paired supervision, while a semantic stream enforces consistency through a frozen vision-language decoder on unpaired brain data. Together, MindAdapter efficiently injects subject-specific corrections while maintaining the global representational geometry learned during pretraining. Experiments on the Natural Scenes Dataset (NSD) demonstrate that MindAdapter substantially improves cross-subject visual reconstruction and retrieval accuracy using only a few shared stimuli, offering a practical and data-efficient solution for personalized brain-to-visual decoding.


[108] HoloFair: Unified T2I Fairness Evaluation and Fair-GRPO Debiasing cs.CV | cs.AIPDF

Ruyi Chen, Lu Zhou, Xiaogang Xu, Chiyu Zhang, Jiafei Wu

TL;DR: 本文提出了HoloFair,一个用于评估文本到图像(T2I)模型多维人口统计偏见的综合基准框架,并引入了基于强化学习的去偏方法Fair-GRPO。该框架基于大规模公平性数据集和SpaFreq属性分类器,提出了MGBI指标来评估内在多样性和条件偏见。实验表明,Fair-GRPO在SD3.5-Medium模型上显著提升了多维公平性,同时保持了高图像质量。

Details

Motivation: 现有的T2I模型偏见评估方法通常只关注单一维度的偏见,缺乏从社会相关深层语义层面揭示模型偏见的视角,因此需要开发一个更全面的评估和去偏框架。

Result: 在SD3.5-Medium模型上的实验表明,Fair-GRPO方法显著改善了多维公平性(通过MGBI等指标评估),同时保持了高图像质量。

Insight: 主要创新点包括:1) 提出了一个统一的多维偏见评估框架HoloFair,其核心是MGBI指标,能同时评估内在多样性和条件偏见;2) 设计了基于强化学习的去偏方法Fair-GRPO,通过多目标奖励函数调整生成模型的分布;3) 分析了潜在的奖励黑客现象并提供了缓解策略,为公平性研究提供了新的技术路径和基准。

Abstract: Text-to-Image (T2I) models have made significant strides in visual realism and semantic consistency, yet they often perpetuate and amplify societal biases. Existing evaluation methods typically address only single-dimensional biases, lacking perspectives to uncover model biases at social-related deeper semantic levels. We introduce HoloFair, a comprehensive benchmark framework for multidimensional demographic bias analysis. Built upon our large-scale fairness-oriented dataset and the SpaFreq (Spatial-Frequency) attribute classifier, this framework proposes the Multi-attribute, Group-wise Bias Index (MGBI) metric, designed to assess both intrinsic diversity and conditional biases. Beyond evaluation, we further introduce Fair-GRPO, a reinforcement-learning-based debiasing method that alters the distribution of generative models through a designed multi-objective reward function. E.g., experiments on the SD3.5-Medium model demonstrate that Fair-GRPO significantly improves multidimensional fairness while maintaining high image quality. We also analyze potential reward hacking phenomena and provide corresponding mitigation strategies. Code and dataset are available at https://github.com/1059684669/HoloFair


[109] AdaFuse-Det: Adaptive Cross-Modal Fusion of Event Cameras for Robust Object Detection in Low-Light RGB Imagery cs.CVPDF

Raju Imandi, Chethana B, Bharatesh Chakravarthi, Yong-Guk Kim, Manipriya S

TL;DR: 本文提出AdaFuse-Det,一种用于极低光照条件下鲁棒目标检测的双流框架。它通过基于最小方差线性估计理论的自适应跨模态融合模块,融合CLAHE增强的RGB帧与体素化事件张量,以利用事件相机提供的与光照无关的互补信息。

Details

Motivation: 解决在极低光照条件下(如夜间监控、搜救机器人)传统RGB相机性能急剧下降,而事件相机能提供高动态范围、微秒级分辨率且对光照不敏感的结构信息,如何有效融合这两种模态以实现可靠目标检测的问题。

Result: 在LLE-VOS基准测试中,AdaFuse-Det在严重光照退化条件下取得了65.54%的召回率、53.85%的精确率和59.12%的F1分数,其召回率优于单模态检测器,符合理论预测的光照适应行为。

Insight: 创新点在于提出了一个理论驱动的自适应跨模态融合模块,其学习到的注意力图渐进地恢复高斯-马尔可夫最优融合权重,并建立了事件体素化阶段的事件守恒和时间分辨率界限,为多模态融合提供了可解释的理论保证。

Abstract: Detecting objects reliably under extreme low-light conditions is an open problem in computer vision, with practical urgency in applications ranging from nighttime surveillance to search-and-rescue robotics. Conventional RGB cameras degrade sharply at low photon flux, while event cameras which record asynchronous per-pixel brightness changes at microsecond resolution and high dynamic range provide complementary structural cues that are largely illumination-invariant. We present AdaFuse-Det, a dual-stream framework that fuses CLAHE-enhanced RGB frames with voxelized event tensors through an Adaptive Cross-Modal Fusion (ACMF) module grounded in minimum-variance linear estimation theory. We formally show that the learned attention map asymptotically recovers the Gauss-Markov optimal fusion weights, and establish event conservation and temporal resolution bounds for the voxelization stage. On the LLE-VOS benchmark, AdaFuse-Det achieves a Recall of $65.54%$, Precision of $53.85%$, and F1-Score of $59.12%$ under severe illumination degradation, outperforming single-modality detectors in recall by a margin that reflects the theoretically predicted illumination-adaptation behavior.


[110] SRUG: Shadow-Guided Relightable Urban Scene with Generation Model cs.CV | cs.GRPDF

Yonghao Zhao, Zexin Yin, Jian Yang, Beibei Wang, Jin Xie

TL;DR: 本文提出SRUG框架,用于从图像或视频创建可重照明的城市场景。该方法利用阴影引导三维补全模型恢复不可见区域的几何结构,并通过迭代材料分解方案结合大型材料模型(LMM)监督,实现鲁棒的材料分解,最终构建基于物理的照明模型以支持可靠的重照明。

Details

Motivation: 城市场景通常无边界且存在大量不可见区域,这些区域会投射阴影到可见区域,而稀疏输入视图和复杂照明条件导致材料分解存在严重歧义,使得创建可重照明城市场景极具挑战。

Result: 大量定量评估和视觉比较表明,该方法在新视图合成和重照明任务上均优于现有方法。

Insight: 创新点在于利用阴影作为引导进行三维场景补全以合成物理合理的阴影,并引入迭代材料分解方案,结合大型材料模型提供监督,从而有效解决城市场景重照明中的几何恢复和材料分解歧义问题。

Abstract: Creating relightable urban scenes from images or videos is widely useful but highly ill-posed. Urban environments are typically unbounded and extend beyond the visible regions. As a result, many portions of the scene remain unobserved, yet these invisible regions can cast shadows onto visible areas. Reasonably modeling shadows cast by such invisible regions is challenging and poses a significant obstacle to creating relightable urban scenes. At the same time, sparse input views and complex illumination conditions further complicate relighting, as they introduce severe ambiguities in material decomposition. In this paper, we propose Shadow-guided Relightable Urban Scene with Generation model (SRUG), a novel framework designed to address relighting challenges in urban scenes. SRUG leverages shadows to guide a 3D completion model for recovering the geometry of invisible regions, promoting the synthesis of physically reasonable shadows. In addition, SRUG employs an iterative material decomposition scheme that applies the large material model (LMM) to provide material supervision and iteratively decompose the scene’s material properties, enabling robust material decomposition. Building upon these components, we introduce a physically-based lighting model that captures the complex illumination of urban scenes and supports reliable relighting. Extensive quantitative evaluations and visual comparisons demonstrate that our method outperforms existing approaches in both novel view synthesis and relighting tasks.


[111] Physics-Guided Self-Supervised Statistical Residual Learning for Sonar Despeckling with Improved Generalization cs.CV | eess.SPPDF

Swapna Pillai, Siddharth Singh Savner, Sujit Kumar Sahoo

TL;DR: 本文提出了一种物理引导的自监督统计残差学习框架,用于声纳图像去斑。该方法在齐次对数域中将去斑问题重新表述为残差一致性,通过约束对数比残差服从乘性斑纹统计特性,无需干净图像监督即可有效抑制斑纹并保持结构保真度。结合轻量级神经网络,该方法在多个真实声纳数据集上实现了最先进的性能,并展现出优异的跨数据集鲁棒性和实时部署能力。

Details

Motivation: 解决声纳图像去斑任务中缺乏干净监督数据的问题,同时避免自监督方法可能出现的退化恒等解,旨在提升方法的泛化能力和实用性。

Result: 在多个真实声纳数据集上取得了最先进的性能,并表现出优秀的跨数据集鲁棒性,同时保持了实时部署的适用性。

Insight: 创新点在于将去斑问题重新表述为对数域的残差一致性学习,并引入物理引导的乘性斑纹统计约束、方差目标统计损失、边缘感知结构正则化和中值引导课程稳定化策略,实现了无需监督的有效去斑。从客观角度看,其物理信息与自监督学习的结合以及对泛化能力的强调具有借鉴意义。

Abstract: This letter introduces a physics-informed self-supervised framework for sonar image despeckling that reformulates despeckling as residual consistency in the homomorphic log domain. By constraining the log-ratio residual to obey multiplicative speckle statistics, the proposed method eliminates the need for clean supervision while preventing degenerate identity solutions. A variance-targeted statistical loss combined with edge-aware structural regularization and median-guided curriculum stabilization enables effective speckle suppression with preserved structural fidelity. This formulation along with a lightweight neural network achieves state-of-the-art performance across multiple real sonar datasets and demonstrates excellent cross-dataset robustness, while remaining suitable for real-time deployment.


[112] Motion-Compensated Weight Compression cs.CV | cs.AI | cs.LGPDF

Ismail Lamaakal

TL;DR: 本文提出了一种名为Motion-Compensated Weight Compression (MCWC)的权重压缩方法,该方法通过对齐神经网络中具有置换对称性的模块(如隐藏单元和注意力头)来最大化跨层冗余,将深度转换为可预测的序列,并采用轻量级预测器和熵编码对量化残差进行压缩,从而在保持快速解码的同时提升压缩率与精度的权衡性能。

Details

Motivation: 现有神经网络权重压缩方法通常独立处理各层,忽略了由函数保持对称性(如排列对称性)引起的跨层冗余,导致压缩效率受限,本文旨在解决这一问题。

Result: 在Transformer语言建模和视觉分类任务上,MCWC相比强基准方法(如量化和学习型权重编解码器)提升了率失真(rate-accuracy)帕累托前沿,同时保持了有竞争力的解码时间。

Insight: 核心创新在于通过排列对齐(alignment)将深度维度转化为可预测序列,并结合预测、熵建模和关键帧调度进行残差编码;客观来看,该方法将视频压缩中的运动补偿思想(对齐与预测)创造性地应用于权重压缩,有效利用了神经网络的结构性对称冗余。

Abstract: Neural network weights are increasingly a bottleneck for deployment, yet most compression pipelines treat layers independently and overlook cross-layer redundancy induced by function-preserving symmetries. We propose Motion-Compensated Weight Compression (MCWC), a weight-only codec that aligns permutation-symmetric blocks (e.g., hidden units and attention heads) to maximize cross-layer correspondence, turning depth into a predictable sequence. In the aligned coordinate system, MCWC uses a lightweight layer-sequential predictor with periodic keyframes and encodes only quantized prediction residuals using a learned entropy model trained under a rate distortion objective. A simple decoder reconstructs deployable weights by entropy decoding, dequantization, predictor-driven reconstruction, and inverse alignment, enabling fast weight materialization for inference. Across Transformer language modeling and vision classification, MCWC improves the rate accuracy Pareto frontier over strong quantization and learned weight-codec baselines, while maintaining competitive decode time. Ablations confirm that alignment, prediction, entropy modeling, and keyframe scheduling are each necessary for the full gains. Our code is available via https://github.com/Ism-ail11/MCWC.


[113] How Noisy Poses Break Inverse Dynamics: Analysis and Mitigation for Video-Based Joint Torque Estimation cs.CVPDF

Donghyun Kim, Chanyoung Kim, Eunseo Jeong, Youngjoong Kwon, Seong Jae Hwang

TL;DR: 本文系统分析了单目3D人体姿态估计中的噪声如何通过逆动力学管道被放大,并量化了其对关节力矩计算的影响。研究发现姿态噪声在数值微分过程中会被放大约1000倍,且近端关节比远端关节对噪声更敏感。为进行此分析,论文提出了SMPL-Dynamics,一个完全可微的逆动力学模块,并展示了通过可微姿态优化可将力矩误差降低93%。

Details

Motivation: 尽管单目3D人体姿态估计取得了进展,但将运动学估计转化为关节力矩等物理量仍具挑战性,主要原因是逆动力学过程中的噪声放大问题未被系统研究。

Result: 在SMPL-Dynamics模块上进行的分析表明:姿态噪声在力矩计算中被放大1000倍;脊柱、髋部等近端关节的噪声敏感性是手腕等远端关节的10倍;通过可微姿态优化,力矩误差降低了93%,而姿态变化可忽略。

Insight: 论文的创新点在于首次系统量化了姿态估计噪声在逆动力学中的传播规律,并提出了无需外部物理模拟器的可微逆动力学模块SMPL-Dynamics,实现了端到端的梯度计算与姿态优化,为视频驱动的生物力学分析提供了新工具。

Abstract: Recent advances in monocular 3D human pose estimation enable accurate body tracking from video. However, translating these kinematic estimates into physical quantities, such as joint torques, remains challenging due to noise amplification through inverse dynamics. In this work, we provide a systematic analysis of how pose estimation noise propagates through the inverse dynamics pipeline. We present three key findings: (1) pose noise is amplified by approximately 1,000x when computing joint torques via numerical differentiation, (2) proximal joints (spine, hips) are up to 10x more sensitive to noise than distal joints (wrists, hands), and (3) low-pass filtering before differentiation substantially reduces this amplification. To enable this analysis, we develop SMPL-Dynamics, a fully differentiable inverse dynamics module for the SMPL body model that requires no external physics simulators. Our module supports end-to-end gradient computation, and we demonstrate this through differentiable pose refinement, which reduces torque error by 93% with negligible change in pose.


[114] 4KLSDB: A Large-Scale Dataset for 4K Image Restoration and Generation cs.CVPDF

Zihao Zhu, Kuan-Ru Huang, Zhaoming Xu, Renjie Li, Bo Wu

TL;DR: 本文介绍了4KLSDB,一个专为4K图像超分辨率和文本到图像生成研究设计的大规模、高质量数据集。该数据集包含超过12.9万张精心筛选的4K分辨率图像,覆盖多个类别,并提供了独立的验证集和测试集。通过在该数据集上训练模型,作者证明了使用原生4K数据进行训练能显著提升图像复原任务在4K分辨率下的性能。

Details

Motivation: 当前公开可用的高分辨率数据集既缺乏原生4K分辨率,也缺乏训练最先进模型所需的大规模数据,这限制了超分辨率和文本到图像扩散模型的研究进展。

Result: 在原生4K基准测试中,使用4KLSDB训练的代表性超分辨率和扩散模型性能显著提升。综合实验表明,在真实4K分辨率数据上训练与图像复原任务(尤其是在4K分辨率下)的保真度提升呈正相关。

Insight: 论文的核心创新点是构建并开源了一个大规模、高质量、多类别的原生4K分辨率数据集(4KLSDB),填补了该领域的资源空白。其数据处理流程结合了自动化筛选、大型多模态模型(LMMs)和人工标注,确保了数据的美学质量和一致性,为推进高保真图像合成与复原研究提供了关键基础设施。

Abstract: High-resolution datasets are essential for advancing super-resolution (SR) and text-to-image (T2I) diffusion research. However, current publicly available datasets lack both the native 4K resolution and the extensive scale necessary for training state-of-the-art models. To address this gap, we introduce a 4K Large Scale Dataset and Benchmark (4KLSDB), a large-scale, diverse dataset consisting of 129,484 carefully curated 4K resolution images spanning multiple categories such as nature, urban scenes, people, food, artwork, and CGI, alongside distinct validation and test sets containing 2,000 and 1,984 images respectively. Images were sourced from established open datasets including Photo Concept Bucket, Laion2B, and PD12M. 4KLSDB underwent rigorous multi-stage automated filtering and annotation pipelines involving both human annotators and Large Multimodal Models (LMMs) to ensure high aesthetic quality and dataset consistency. We demonstrate 4KLSDB’s effectiveness by training representative super-resolution and diffusion models, observing significant improvements in performance on native 4K benchmarks. Comprehensive experiments illustrate a positive correlation between training on true 4K resolution data and improved fidelity in image restoration task, especially on 4K resolution. We provide the research community a valuable resource to drive progress toward genuinely high-fidelity image synthesis and restoration by providing 4KLSDB. Our project page is available at: https://4klsdb.github.io/.


[115] Self-Supervised Contrastive Learning for Cardiac MR Sequence Classification cs.CV | eess.IVPDF

Yuli Wang, Hyewon Jung, Dongshen Peng, Yuwei Dai, Jing Wu

TL;DR: 本研究针对心脏磁共振图像序列分类任务,提出了一种基于图像自监督对比学习的Vision Transformer适应策略,以弥补预训练ViT模型在医学影像领域知识不足的问题。该方法在内部数据集上超越了传统监督训练,并在外部数据集上展现出良好的泛化能力。

Details

Motivation: 预训练的Vision Transformer模型在通用数据集上表现良好,但缺乏医学影像领域的专业知识,导致其在心脏MR图像分类任务上迁移效果不佳。

Result: 在心脏MR序列分类任务中,适应后的ViT模型在四个最常见序列上的分类AUC超过0.75,并在BraTS和ADNI等外部MR数据集上表现出强泛化能力。

Insight: 创新点在于采用图像自监督对比学习进行领域适应,有效提升了ViT在医学影像任务上的性能;同时通过消融实验分析了批次大小和数据集规模的影响,为医学影像的迁移学习提供了实用见解。

Abstract: Vision Transformer (ViT) models, utilizing self-attention mechanisms, have demonstrated robust generalization capabilities across various vision tasks, including image classification. However, these models, typically pretrained on general public datasets, often lack the specialized domain knowledge necessary for medical imaging applications. In this study, we investigate the adaptation of ViT models, specifically for cardiac magnetic resonance (MR) images, using an in-house dataset. We found that pretrained ViT features do not effectively transfer to the cardiac MR domain. To overcome this limitation, we introduce an adaptation strategy that utilizes image-based self-supervised contrastive learning, demonstrating superior performance compared to traditional supervised training approaches. Moreover, our adapted ViT model exhibits strong generalization to external MR datasets such as BraTS and ADNI. Through ablation studies, we further investigate the impact of batch size and dataset scale on performance. Ultimately, our adapted model achieves classification AUC exceeding 0.75 across the four most common cardiac MR sequences.


[116] From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for Vision-Language Model Weak Supervision Across Three Medical-Imaging Benchmarks cs.CV | cs.AI | cs.LGPDF

Bruce Changlong Xu, Jose James, Alexander Ryu

TL;DR: 本文针对视觉语言模型(VLM)在医学影像弱监督学习中的应用,通过理论校准将经典噪声标签理论中的性能交叉点预测转化为实例级决策规则。在三个医学影像基准(PCAM、ISIC、NIH-CXR)上使用BiomedCLIP生成的弱标签进行实验,验证了交叉点的存在及其对下游AUC性能的影响,并提出了一个仅需10-20个黄金标签即可操作的决策规则。

Details

Motivation: 经典噪声标签理论预测弱监督下的下游性能受限于标注者准确率,存在一个交叉点:一旦黄金训练分类器匹配标注者,弱标签便从有益转为有害。但该预测是理论性的,缺乏针对现代基础模型标注者的基准校准,以将其转化为实例级陈述。

Result: 在PCAM、ISIC和NIH-CXR三个医学影像基准上,使用BiomedCLIP生成的弱标签和六种下游架构(参数范围跨11倍)进行实验,理论预测的交叉点出现在PCAM约100个样本、ISIC 20-50个样本、NIH-CXR 250-500个样本处;交叉点以上的弱标签使AUC下降高达-0.10。交叉点位置在五种预训练架构中的四种上具有架构不变性。

Insight: 创新点在于将理论交叉点校准为可操作的决策规则(仅需10-20个黄金标签比较黄金-only AUC与VLM在用户黄金集上的准确率),并发现仅基于比率的边界公式不完整,提出了标签空间投影的具体改进方向供未来基准测试验证。

Abstract: Classical noisy-label theory predicts that downstream performance under weak supervision is bounded above by the labeler’s accuracy, implying a sharp crossover: once a gold-trained classifier matches the labeler, weak labels stop helping and start hurting. The prediction is theoretical; what is missing is a benchmark calibration that turns it into an instance-level statement for modern foundation-model labelers. We provide such a calibration for BiomedCLIP-generated weak labels on three medical-imaging benchmarks (PCAM, ISIC, NIH-CXR) and six downstream architectures spanning an 11x parameter range. The crossover predicted by theory appears at ng~100 on PCAM, 20-50 on ISIC, and 250-500 on NIH-CXR; weak labels above the crossover degrade AUC by up to -0.10. The location is architecture-invariant for four of five pretrained architectures, and a within-family DenseNet sweep (2.5x parameters, identical pretraining) supports the view that the labeler, not the student, is the dominant constraint. The calibration in turn produces a decision rule operable from 10-20 gold labels: compare gold-only AUC to VLM accuracy on the user’s gold set. A structured-vs-random noise sign flip on NIH-CXR shows that the rate-only formulation of the bound is incomplete and identifies a concrete refinement (label-space projection) that future benchmarks can be designed to test.


[117] DUEL: Adversarial Self-Play for Multimodal Reasoning cs.CV | cs.CLPDF

Lin Qiu, Hanqing Zeng, Yao Liu, Bingjun Sun, Guangdeng Liao

TL;DR: 论文提出了一种名为DUEL的自进化后训练框架,通过对抗性自博弈来提升视觉语言模型的多模态推理能力。该方法利用从同一预训练模型初始化的两个策略——挑战者生成基于图像的真实声明及其最小扰动的困难负例,求解器则验证声明与图像的匹配性,从而在无需人工标注、外部奖励模型或图像编辑工具的情况下,促进细粒度的视觉区分。

Details

Motivation: 现有基于强化学习的视觉语言模型优化方法依赖昂贵的高质量标注,难以扩展;而无监督替代方案则可能因视觉基础薄弱和缺乏可靠验证信号而偏向有偏解。

Result: 实验表明,DUEL在无需额外人工标注、外部奖励模型或图像编辑工具的情况下,持续提升了视觉推理和鲁棒区分能力,但摘要未提及具体基准测试或定量结果水平。

Insight: 创新点包括:1) 通过对抗性自博弈从同一预训练模型派生出挑战者和求解器,实现自我监督的细粒度视觉推理优化;2) 引入长度归一化的对数似然奖励,在稀疏反馈下保留信息丰富的优化信号并提升学习稳定性。

Abstract: Reinforcement learning (RL) has emerged as an effective paradigm for improving the reasoning capability of vision-language models (VLMs). However, RL-based optimization typically depends on costly high-quality annotations that are difficult to scale. Existing unsupervised alternatives may drift toward biased solutions due to weak visual grounding and the lack of reliable verification signals. We propose a self-evolving post-training framework, DUEL, where supervision emerges from adversarial interactions between two policies initialized from the same pretrained VLM. A Challenger generates an image-grounded true claim together with a minimally perturbed hard-negative counterpart, while a Solver verifies both claims against the image, encouraging fine-grained visual discrimination under near-neighbor semantics. To stabilize optimization, we introduce a length-normalized log-likelihood reward that preserves informative optimization signals beyond binary outcome supervision and improves learning stability under sparse feedback. Experiments show that DUEL consistently improves visual reasoning and robust discrimination without additional human annotations, external reward models, or image editing tools.


[118] Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering cs.CV | cs.AIPDF

Ojonugwa Oluwafemi Ejiga Peter, Frederick Akor Ejiga, Fahmi Khalifa, Md Mahmudur Rahman

TL;DR: 本文提出了一种双管道的参数高效微调(PEFT)模型,旨在解决胃肠道内窥镜AI系统中的两个核心问题:临床视觉问答(VQA)和隐私保护的合成数据生成。该方法采用Florence-2视觉语言模型进行VQA,并利用基于LoRA的Stable Diffusion 2.1生成高质量内窥镜图像,以应对数据稀缺和隐私限制,同时大幅降低计算成本。

Details

Motivation: 胃肠道内窥镜AI系统面临标注数据短缺、隐私政策严格以及传统模型微调计算成本高昂等主要限制,这些因素阻碍了复杂AI模型在临床实践中的可靠和可扩展应用。

Result: 在Kvasir-VQA数据集上,Florence-2 VQA模型取得了ROUGE-1为0.92、ROUGE-L为0.91的成绩,BLEU分数从0.08提升至0.24;在私有数据集上微调的结果持续优于公共数据集。LoRA合成模型(秩为4)取得了最佳性能,保真度分数为0.290,一致性分数为0.730,Frechet BiomedCLIP距离(FBD)为1450,计算成本降低了近90%。与FLUX、MSDM和Kandinsky 2.2相比,该模型在FBD和语义对齐方面表现更优。

Insight: 论文的创新点在于构建了一个统一的PEFT框架,同时高效处理临床VQA和隐私敏感的医学图像生成任务。其核心是利用LoRA等参数高效微调技术,在保持模型性能的同时显著降低计算开销,并生成可用于增强训练数据的合成图像,从而提升AI在医疗领域的实用性和可扩展性。

Abstract: The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and significant bottlenecks in conventional model fine-tuning. Such limitations impede the successful application of sophisticated AI models in clinical practice, particularly affecting the reliability and scalability of diagnosis. In this paper, we present a dual-pipeline PEFT model that addresses two fundamental problems: medical Visual Question Answering (VQA) and the generation of privacy-preserving synthetic data. For clinical VQA, we adopt the Florence-2 vision-language model. Leveraging PEFT enhances model interpretability while substantially reducing the computational cost of training. Simultaneously, we employ Low-Rank Adaptation (LoRA) with Stable Diffusion 2.1 to generate high-quality GI images that enhance training databases without violating patient privacy. This research utilized the Kvasir-VQA dataset. Our Florence-2 VQA model achieved ROUGE-1 of 0.92, ROUGE-L of 0.91, and BLEU score improvements from 0.08 to 0.24. Fine-tuning on private datasets consistently showed better results than fine-tuning on public datasets. The rank-4 LoRA synthesis achieved optimal performance with a fidelity score of 0.290, an agreement score of 0.730, and a Frechet BiomedCLIP Distance (FBD) of 1450, reducing computational costs by almost 90 percent. This framework improves the clinical potential of AI in GI endoscopy. Compared to FLUX, MSDM, and Kandinsky 2.2, our model demonstrates superior FBD and strong semantic alignment. While other models lead in Fidelity or Agreement, our lower FBD indicates better image-text coherence. These results establish our approach as a robust solution for enhancing VQA and synthetic data generation in clinical AI.


[119] HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm cs.CVPDF

Jie-En Yao, Hong-En Chen, C. -C. Jay Kuo

TL;DR: 本文提出了一种名为HCL-FF的层次化对比学习框架,用于改进前向-前向算法。该框架通过引入从粗到细的层次学习策略和基于监督对比的目标,解决了原算法缺乏层间协调和特征语义模糊的问题。在CIFAR-10、CIFAR-100和Tiny-ImageNet数据集上的实验表明,该方法在基于FF的算法中达到了新的最先进性能。

Details

Motivation: 动机是解决反向传播算法在生物合理性、计算需求和可解释性方面的不足,同时改进前向-前向算法因纯局部优化和特征与“优良度”解耦导致的层间协调缺乏和语义模糊问题。

Result: 在CIFAR-10、CIFAR-100和Tiny-ImageNet基准测试上,HCL-FF分别取得了+5.46%、+17.00%和+12.51%的显著准确率提升,在基于前向-前向算法的方法中达到了新的最先进水平。

Insight: 摘要宣称的创新点在于结合了从低层线索到高层语义的层次化学习策略,以及解耦后通过监督对比目标强制类别判别性对齐。从客观角度看,其将层次化表示学习与对比学习机制融入局部训练框架,为改进生物启发式算法提供了新思路。

Abstract: Deep neural networks trained with backpropagation have achieved outstanding performance in vision tasks but remain biologically implausible, computationally demanding, and difficult to interpret. The Forward-Forward (FF) algorithm offers a promising alternative by training each layer independently through local goodness objectives. However, its purely local optimization lacks hierarchical coordination across layers, and the decoupling of goodness from features leaves the representations unconstrained and semantically ambiguous. We propose a Hierarchical and Contrastive Learning FF framework (HCL-FF) to address these limitations. HCL-FF introduces (1) a coarse-to-fine hierarchical learning strategy that guides representations from low-level cues to high-level semantics, and (2) a supervised contrastive objective that enforces class-discriminative alignment after goodness decoupling. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that HCL-FF achieves new state-of-the-art performance among FF-based methods, with notable accuracy gains of +5.46%, +17.00%, and +12.51%, respectively.


[120] Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models cs.CV | cs.AIPDF

Zhipeng Ye, Jiaqi Huang, Feng Jiang, Qiufeng Wang, Yikang Duan

TL;DR: 本文针对多模态大语言模型在大规模图像分类任务中性能显著下降的问题,提出了一种名为分治推理的新测试时扩展策略。该方法通过递归地将复杂全局分类任务分解为多个局部子问题,并采用动态剪枝机制压缩搜索空间,有效缓解了长序列推理中的注意力稀释问题,从而提升了模型精度和推理效率。

Details

Motivation: 动机在于解决多模态大语言模型在大规模图像分类中,随着标签空间扩展而出现的性能崩溃现象,即长序列识别中的性能下降问题。

Result: 在ImageNet-1K和ImageNet-21K等基准测试上的大量实验表明,DCI能持续提升分类准确率,使轻量级开源模型无需额外训练或微调即可媲美甚至超越前沿闭源大型模型。

Insight: 创新点在于从信息论角度揭示了性能崩溃源于信息熵增加与注意力机制中注意力稀释/衰减之间的根本冲突,并提出了一种模型无关、即插即用的分治推理范式,通过动态局部化处理改善了长序列推理中的信噪比和计算效率。

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when applied to large scale image classification, their performance degrades significantly as the label space expands a phenomenon we define as Performance Collapse in Long Sequence Recognition. Through an information theoretic analysis, we reveal that this collapse stems from a fundamental conflict between the escalating information entropy and the prominent attention dilution and decay within attention mechanisms, which impairs the model’s ability to maintain a sufficient signal-to-noise ratio when processing extremely long prompts. To mitigate this, we propose Divide-and-Conquer Inference (DCI), a novel test-time scaling strategy for visual recognition with MLLMs. DCI recursively decomposes complex global classification tasks into multiple simpler, localized subproblems and employs a dynamic pruning mechanism to compress the search space. This method effectively improves the local signal to noise ratio and model accuracy by mitigating the inherent weight dilution issues in long-sequence inference. Moreover, while traditional self-attention incurs a prohibitive quadratic computational complexity, DCI achieves more favorable scaling behavior and substantially accelerates inference in large scale classification scenarios. Extensive experiments on benchmarks such as ImageNet-1K and ImageNet-21K demonstrate that DCI consistently improves classification accuracy. This enables lightweight open-source models to rival or even surpass frontier closed-source giants without any additional training or fine-tuning. As a model-agnostic, plug-and-play paradigm, DCI offers an efficient approach for scaling the inferential precision of MLLMs in large-scale scenarios.


[121] AOEPT: Breaking the Implicit Modality-Reduction Bottleneck in Modality-Missing Prompt Tuning cs.CVPDF

Jian Lang, Rongpei Hong, Ting Zhong, Fan Zhou

TL;DR: 本文提出AOEPT方法,旨在解决多模态提示调优中的隐式模态缩减瓶颈问题。该方法通过引入模态上下文提示,从训练数据中提取全局模态先验,为缺失模态提供潜在信息源,从而恢复多模态Transformer在缺失模态场景下的完整推理能力。

Details

Motivation: 现有基于提示调优的多模态方法在处理模态缺失场景时,仅依赖观测到的模态来生成提示,导致模型推理范围被限制在模态缩减的子空间中,无法利用缺失模态的潜在信息源。

Result: 在多个多模态基准测试和骨干网络上进行的实验表明,AOEPT方法取得了强劲的性能,且计算开销极小。

Insight: 创新点在于提出了模态上下文提示,通过从训练数据中蒸馏全局模态先验作为缺失模态的潜在信息库,并基于剩余模态将其实例化为样本感知的提示,从而突破了隐式模态缩减瓶颈。

Abstract: Deploying multimodal systems in real-world environments often entails handling modality-missing scenarios, where one or more modalities are unavailable. While recent studies address this challenge for the general Multimodal Transformer (MT) architecture via prompt tuning, we identify a fundamental limitation in these methods: the Implicit Modality-Reduction bottleneck. By conditioning prompts solely on the observed modalities, they inadvertently restrict the reasoning scope of MTs to the modality-reduced subspace, cutting off access to the latent information sources of the missing modalities. To overcome this limitation, we propose AOEPT, which pioneers a novel modal-contextualized prompting fashion. Specifically, we introduce lightweight Modal-Contextualized Prompts (MCPs) that distill global modality-wise priors from training data, serving as latent repositories of the information sources for missing modalities. Conditioned on the remaining modalities, these MCPs are instantiated into instance-aware prompts that selectively augment missing-modality information for each sample, thereby restoring the reasoning scope of MTs beyond the observed-modality-only subspace. Experiments across various multimodal benchmarks and backbones confirm the strong performance of AOEPT, with minimal computational overhead.


[122] CLIP-Guided SAM: Parameter-Efficient Semantic Conditioning for Promptable Segmentation cs.CVPDF

Shayan Jalilian, Abdul Bais

TL;DR: 本文提出CLIP-Guided SAM,一种参数高效的语义分割框架。该方法通过在SAM的图像编码器中注入CLIP提取的文本、视觉和相似性特征,实现了内部语义条件化,使SAM在保持其可提示接口的同时,能够进行语义感知的掩码预测。该框架支持手动(文本+空间提示)和半自动(仅文本)两种操作模式,并在低标注数据设置下,在通用基准和下游任务中均表现出色。

Details

Motivation: 解决Segment Anything Model (SAM) 语义盲点的问题,即SAM需要外部提示来指定类别,而现有方法通常将视觉语言模型作为单独阶段来生成空间提示。

Result: 在通用基准和下游任务上,与无语义条件化的SAM+PEFT基线、视觉语言+SAM流水线、SAM 3以及依赖大量无标签数据的半监督分割方法相比,CLIP-Guided SAM始终达到优越或具有竞争力的性能,同时保持训练和部署的参数高效性。

Insight: 创新点在于通过轻量级多模态语义适配器将CLIP特征直接注入SAM内部进行语义条件化,而非仅用于生成外部提示;同时强调了训练与推理时提示类型一致性的重要设计原则,以提升鲁棒性。

Abstract: Promptable foundation models such as the Segment Anything Model (SAM) produce high-quality masks but remain semantically blind, relying on external prompts to specify categories. Existing vision-language approaches address this limitation by using external prompt coupling, where a vision-language model generates spatial prompts for SAM as a separate stage. We propose CLIP-Guided SAM, a parameter-efficient segmentation framework built on internal semantic conditioning. Instead of using semantic signals only to generate prompts, we inject CLIP-derived text, vision, and similarity features directly into SAM’s image encoder through lightweight multi-modal semantic adapters. These adapters condition SAM’s internal feature representations, allowing semantic information to influence mask prediction while preserving SAM’s original promptable interface. Our framework is designed for low labeled-data settings and applies to both general-domain benchmarks and specialized downstream tasks. It supports two operating modes: Manual mode, for interactive segmentation with both text and spatial prompts, and Semi-Automatic text-only mode, for applications that require concept-specific segmentation using only textual input. We show that robustness depends on aligning training with the type of prompts used at inference, making train-test prompt consistency an important design principle. Through extensive experiments and ablations, we evaluate our method against SAM+PEFT baselines without semantic conditioning, vision-language + SAM pipelines, SAM 3, and strong semi-supervised segmentation methods that rely on large amounts of unlabeled data. Across these settings, CLIP-Guided SAM consistently achieves superior or competitive performance while remaining parameter-efficient in both training and deployment.


[123] QuoVLA: Quotient Space for Vision-Language-Action Models cs.CVPDF

Xuan Wang, Yinan Wu, Haoran Duan, Jungong Han

TL;DR: 本文提出了QuoVLA,一个用于视觉-语言-动作(VLA)模型的商空间框架。该框架基于一个新的理论视角,认为预训练的视觉-语言模型(VLM)的潜在表示本身已包含足够的控制信息,但存在冗余。QuoVLA通过量化模块和双分支设计压缩这些表示,去除与提示相关的冗余,保留动作相关信息,从而提升VLA模型的泛化能力。

Details

Motivation: 现有VLA模型通常基于“动作信息不足”的视角,认为预训练VLM的潜在表示缺乏可直接用于控制的信息,或应避免接触动作学习信号。本文反对这一观点,提出“VLA商理论”,认为这些表示是“动作信息充分”但“过度完备”的,即已包含所需控制信息,但区分了导致相同最优动作行为的提示级变化。

Result: 在多个基准测试上的广泛实验表明,QuoVLA实现了强大的性能,特别是在视觉、语言和环境分布变化下的泛化能力方面取得了显著提升。

Insight: 核心创新点在于提出了“动作信息充分”的理论视角,并据此设计了商空间框架QuoVLA,通过量化与双分支结构及相对时间复杂性正则化来压缩冗余、保留关键信息。这为利用预训练VLM进行机器人控制提供了一种新的、更高效的范式,强调了挖掘现有表示中固有信息而非从头学习的重要性。

Abstract: Vision-Language-Action (VLA) models commonly adapt pretrained Vision-Language Models (VLMs) to robot control by mapping visual observations and language instructions to continuous actions. Existing approaches typically take an action-insufficiency view, assuming that pretrained VLM latents either lack directly usable action information or should be shielded from action-learning signals. Against this view, our \textit{Quotient Theory for VLA} shows that pretrained VLM latents are not action-insufficient but action-sufficient: they already contain the information needed for control, yet remain overcomplete by distinguishing prompt-level variations that induce the same optimal action behavior. To operationalize this theory, we propose QuoVLA, a quotient-space framework for VLA that compresses pretrained VLM latents into action-sufficient representations. Specifically, QuoVLA instantiates this principle with a quantization module and a dual-branch design with relative temporal-complexity regularization, preserving action-relevant information while removing prompt-level redundancy. Extensive experiments across multiple benchmarks demonstrate that QuoVLA achieves strong performance, with particularly notable improvements in generalization under visual, linguistic, and environmental distribution shifts. Our code will be made publicly available.


[124] X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling cs.CVPDF

Baolu Li, Jingyu Qian, Rui Guo, Yilun Chen, Hanpeng Liu

TL;DR: 本文提出了X-Foresight,一种集成到视觉-语言-动作模型架构中的预测性世界模型,通过分块长时域自回归策略联合学习世界建模和实时动作控制,以解决视频预测中的低熵冗余和长时因果建模难题。

Details

Motivation: 动机在于让VLA模型内化物理世界动态和长时因果关系,以支持安全且可泛化的规划,但简单的下一帧预测面临视频令牌低熵冗余导致平庸外推,以及密集预测与长时因果建模效率之间的两难困境。

Result: 综合实验表明,X-Foresight在规划性能上显著优于VLA基线,同时保持了强大的生成保真度,为世界知识驱动的自主系统建立了稳健范式。

Insight: 创新点包括长时域分块自回归策略以平衡瞬时动态与长时因果,课程学习计划稳定训练,以及基于自我运动和行为的时序重要性采样来聚焦安全关键监督;客观分析其将世界建模直接集成到VLA架构进行联合学习是一大亮点。

Abstract: Physical world knowledge resides mainly in videos. Equipping Vision-Language-Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long-term causality by predicting future video from past observations. However, naive next-frame prediction faces two challenges: 1) unlike semantically distinct text tokens, video tokens are low-entropy and redundant, causing prediction to degenerate into trivial extrapolation. 2) world modeling poses a temporal dilemma: dense prediction captures instantaneous dynamics, but cannot efficiently model long-horizon causality. To learn world knowledge effectively, we introduce X-Foresight, a predictive world model integrated directly into the VLA architecture to jointly learn world modeling and real-time action control. At its core lies a long-horizon chunk-wise auto-regressive strategy that addresses both challenges: by predicting semantically distant chunks rather than adjacent frames, it escapes trivial extrapolation, while preserving dense intra-chunk frames for instantaneous dynamics and sparse inter-chunk transitions for long-term causality. A curriculum learning schedule progressively extends prediction horizons and stabilizes long-horizon training. To capture long-term causality effectively, we present temporal importance sampling, which concentrates supervision on safety-critical chunks identified by ego-motion and behavioral signals. We further delegate photorealistic synthesis to a diffusion-based multi-view renderer, improving photorealistic appearance. Comprehensive experiments demonstrate that X-Foresight significantly outperforms VLA baselines in planning performance while maintaining strong generative fidelity, establishing a robust paradigm for world-knowledge-driven autonomous systems.


[125] BED-SAM2: Boundary-Enhanced-Depth SAM2 via Monocular Geometric Priors cs.CVPDF

Tyler Rust, Dara McNally, Kyle O’Donnell, Colin Kelly, Chandra Kambhamettu

TL;DR: 本研究在SAM2视觉基础模型的基础上,提出了BED-SAM2模型。该模型通过修改SAM2的Hiera编码器架构,使其能够直接从RGB图像编码单目深度信息,从而利用几何线索来增强物体边界分割和伪装物体形状的提取。该模型在多个显著性和伪装物体检测任务上,仅用五个训练周期就达到了具有竞争力的最先进性能。

Details

Motivation: 为了解决下游分割任务中物体边界模糊和伪装物体难以准确分割的问题,研究旨在通过引入单目深度信息作为几何先验,来增强SAM2模型对物体边界和形状的感知能力。

Result: BED-SAM2在多个显著性和伪装物体检测任务上展示了具有竞争力的最先进性能,并且仅需五个训练周期即可达到此效果。

Insight: 主要创新点在于修改SAM2的编码器架构,使其能够直接编码RGB图像的深度信息,从而为分割任务提供有价值的几何线索。这提供了一种将几何先验高效集成到视觉基础模型中以提升边界感知能力的思路。

Abstract: Building upon the SAM2 vision foundation model for downstream segmentation, this study introduces Boundary Enhanced Depth (BED)-SAM2. The SAM2 Hiera encoder architecture is modified to directly encode monocular depth information from RGB images, thereby providing geometric cues that enhance object boundary delineation and facilitate the extraction of camouflaged object shapes. BED-SAM2 demonstrates competitive state-of-the-art performance across multiple salient and camouflaged object detection tasks with as few as five training epochs.


[126] X-Edit: Exact, Explicit, and Explainable Null-Space Editing for Medical Vision Transformers cs.CVPDF

Yuanye Liu, Siyuan Zhou, Ke Zhang, Lei Li, Wei Chen

TL;DR: 本文提出X-Edit框架,用于精确、显式和可解释地编辑预训练视觉Transformer(ViT)模型,以纠正其在医疗图像分类中的错误预测。该方法通过因果追踪定位关键层,并利用锚点集构建正交零空间投影矩阵,将参数更新严格约束在该零空间内,从而在修正目标错误的同时,数学上保证不干扰模型已习得的诊断表征。

Details

Motivation: 在动态临床场景中,预训练ViT模型难免出现错误预测,而传统的微调方法会导致灾难性遗忘,严重损害模型原有的诊断能力,威胁临床安全性。因此,需要一种主动、可控、可靠且理论可解释的干预机制。

Result: 在六个医疗影像基准测试上的广泛评估表明,X-Edit能全面抑制灾难性遗忘,同时实现卓越的编辑成功率。

Insight: 创新点在于将编辑过程从基于梯度的迭代优化转变为理论驱动的闭式解,通过零空间投影实现精确的参数更新,在数学上保证了编辑的针对性(仅修正错误)和稳定性(不遗忘旧知识),并提供了因果追踪的可解释性分析路径。

Abstract: Pre-trained Vision Transformers (ViTs) are increasingly deployed for medical image classification. However, correcting their inevitable failure cases in dynamic clinical scenarios poses a critical challenge. Conventional fine-tuning approaches inherently suffer from catastrophic forgetting, severely degrading previously acquired diagnostic capabilities. Such instability fundamentally compromises clinical safety. Addressing this vulnerability requires an active, controllable, and reliable intervention mechanism that is both theoretically grounded and inherently interpretable. To this end, we propose X-Edit (eXact, eXplicit, and eXplainable Editing), an efficient null-space model editing framework. X-Edit transitions the editing process from iterative gradient-based optimization to a theoretically grounded, closed-form solution. Specifically, we first explicitly localize the influential layers via causal tracing governing the erroneous prediction. Subsequently, we construct an orthogonal null-space projection matrix from a curated anchor set. By geometrically constraining the exact parameter update strictly within this null space, we provide mathematical guarantees that the intervention rectifies targeted errors without perturbing established diagnostic representations. Extensive evaluations on six medical imaging benchmarks demonstrate that X-Edit comprehensively suppresses catastrophic forgetting while achieving superior edit success rates. Our code is available at https://github.com/HenryLau7/X-Edit.


[127] Interpretability Transfer from Language to Vision via Sparse Autoencoders cs.CVPDF

Alexey Kravets, Da Li, Chuan Li, Da Chen, Vinay P. Namboodiri

TL;DR: 本文提出VISTA框架,通过将视觉投影器映射到预训练语言模型的稀疏自编码器(SAE)空间,实现从语言到视觉的可解释性迁移,无需训练专门的视觉SAE。该方法在LLaVA风格视觉语言模型中,通过SAE重建损失正则化投影器,显著提升图像语义元素与文本概念的匹配率,并利用DINOv2编码器的强定位能力进行细粒度概念干预验证。

Details

Motivation: 解决语言模型稀疏自编码器(SAE)的可解释性方法难以有效迁移到视觉领域的问题,主要由于视觉概念标注的困难和模糊性。

Result: 在匹配率(衡量SAE空间中最激活的文本概念与图像语义元素的对应准确性)上提升三倍;在对象移除和替换任务中,相比纯视觉基线分别提升35%和47%,并在多种LLM架构上验证了有效性。

Insight: 创新点在于通过跨模态对齐,将视觉特征投影到预训练的文本SAE空间,实现可解释性迁移;客观分析表明,该方法利用语言模型已有的标注概念空间,避免了视觉SAE的独立训练,同时揭示了DINOv2编码器在空间定位上的优势。

Abstract: Recent advances in language model interpretability using sparse autoencoders (SAEs) have yet to effectively translate to the visual domain, mainly due to the difficulty and ambiguity of labeling visual concepts. In this paper, we introduce Visual Interpretability via SAE Transfer Alignment (VISTA), a framework that transfers interpretability from language to vision in a LLaVA-style vision-language model by constraining a visual projector to map visual tokens into an LLM’s pre-existing, labeled textual SAE space. This approach enables visual interpretability without training dedicated vision SAEs. By regularizing the projector using the LLM’s SAE reconstruction loss, VISTA achieves a threefold increase in the matching rate, which measures how accurately the most activating textual concepts in the SAE space correspond to semantic elements in the image. Using this framework, we further analyze spatial localization properties of different vision encoders and show that DINOv2 features have stronger localization abilities than other encoders. Leveraging this precision, we validate VISTA’s cross-modal alignment through fine-grained, localized concept interventions, where specific objects are removed or replaced in the model’s perception while preserving the surrounding scene. This results in improvements of 35% in object removal and 47% in object replacement tasks over vision-only baselines, providing causal evidence that visual tokens inhabit the text SAE manifold. These contributions are validated across multiple LLM architectures.


[128] MambaDSF: Multi-Scale SSM with Dilated Feature Fusion for Sonar Small Target Detection cs.CVPDF

Hui Lin, Jiayi Li, Jing Wang, Shenghui Rong

TL;DR: 该论文提出了一种名为MambaDSF的混合框架,用于解决声纳图像中小目标检测的难题。该方法通过结合Mamba状态空间模型和扩张特征融合,在保持线性计算复杂度的同时,有效捕获局部回声线索和全局声学上下文,并利用新的损失函数稳定小目标训练。

Details

Motivation: 声纳成像是水下目标检测的主要方式,但小目标由于像素覆盖不足、声学对比度低以及成像范围带来的尺度模糊性而难以检测。现有的基于CNN的检测器无法有效抑制噪声引起的误报,而基于Transformer的方法计算成本过高。现有的基于Mamba的视觉模型缺乏跨金字塔层级的多尺度语义对齐、多感受野融合以及针对小目标的训练监督。

Result: 在UATD前视声纳基准测试中,MambaDSF以2870万参数实现了91.5%的mAP50,超越了所有对比的检测器。在小目标子集上,性能增益达到+2.2个百分点,并且在FLS和MD-FLS上的跨域评估证实了所提架构的泛化能力。

Insight: 论文的创新点在于:1) 提出了Mamba增强特征金字塔(MambaEFP)主干,以线性复杂度联合捕获局部和全局特征;2) 设计了扩张融合Mamba(DFMamba)编码器,强制跨金字塔层级的多尺度特征对齐;3) 引入了尺度自适应加权IoU(SA-WIoU)和跨尺度一致性(CSC)损失来稳定小目标训练。这些贡献共同解决了声纳小目标检测中的关键挑战。

Abstract: Sonar imaging is the primary modality for underwater target detection, yet small targets remain difficult to detect due to insufficient pixel coverage, low acoustic contrast, and scale ambiguity across imaging ranges. CNN-based detectors extract local features efficiently but cannot suppress noise-induced false alarms without global acoustic context. Transformer-based methods capture long-range dependencies at quadratic computational cost. Existing Mamba-based vision models offer efficient linear-cost scanning but lack multi-scale semantic alignment across pyramid levels, multi-receptive-field fusion, and small-target-aware training supervision needed for reliable sonar detection. This letter proposes Mamba Dilated-Scale Fusion (MambaDSF), a hybrid framework addressing these limitations through three contributions: a Mamba Enhanced Feature Pyramid (MambaEFP) backbone that jointly captures local echo cues and global acoustic context at linear complexity; a Dilate Fusion Mamba (DFMamba) encoder that enforces multi-scale feature alignment across pyramid levels; and Scale-Adaptive Weighted IoU (SA-WIoU) and Cross-Scale Coherence (CSC) losses that stabilize small-target training. MambaDSF achieves 91.5% mAP50 on the UATD forward-looking sonar benchmark with 28.7 million parameters, surpassing all compared detectors. On a small-target subset the gain reached +2.2 percentage points, and cross-domain evaluation on FLS and MD-FLS confirms the generalization of the proposed architecture. The codes are publicly available at https://github.com/IDontKnowAAA/MambaDSF.


[129] Tempered Self-Similarity Alignment for Physically Plausible Video Generation cs.CVPDF

Manjin Kim, Suha Kwak, Minsu Cho

TL;DR: 本文提出了一种名为Tempered Self-similarity Alignment (TSA) 的方法,通过将视觉基础模型中编码的时空自相似性(STSS)关系知识迁移到视频生成模型中,以解决现有模型在生成物理真实视频时出现的外观漂移、不合理运动和时序不一致等问题。该方法将STSS转化为概率对应分布,并训练生成模型在动态变化区域对齐这些分布。

Details

Motivation: 当前视频生成模型在物理真实性方面存在不足,经常产生外观漂移、不合理运动和时序不一致等问题。本文旨在通过迁移视觉基础模型中捕获的真实世界动态关系知识来提升视频的物理合理性。

Result: 在VideoPhy和VideoPhy2基准测试上的评估表明,该方法在多种交互场景下显著提升了生成视频的物理合理性。

Insight: 创新点在于利用时空自相似性(STSS)作为关系知识的载体,并提出了TSA损失函数来对齐概率对应分布,从而将基础模型对真实世界动态的理解有效迁移到生成模型中,这是一种新颖的知识迁移范式。

Abstract: Despite remarkable advances in video generative models, they still struggle to generate physically realistic videos, frequently exhibiting appearance drift, implausible motion, and temporal inconsistencies. In this work, we address this limitation by transferring relational knowledge encoded in spatio-temporal self-similarity (STSS) from visual foundation models into video generative models. STSS represents pairwise similarities among features across space and time, revealing the relational structure of how objects interact with other entities throughout a video, effectively capturing real-world dynamics, including object motion and semantic transformations. To transfer this relational knowledge, we propose Tempered Self-similarity Alignment (TSA) loss, which transforms STSS into probabilistic correspondence distributions and trains the video generative model to align its correspondence distributions with those of the visual foundation model on dynamically changing regions. Evaluated on VideoPhy and VideoPhy2 benchmarks, our method demonstrates substantial improvements in physical plausibility across diverse interaction scenarios, validating the effectiveness of transferring relational knowledge for physically realistic video generation.


[130] ConFi-GS Confidence-Guided High-Frequency Injection for 3D Gaussian Splatting Super-Resolution cs.CVPDF

Jiaxiang Li, Zongtan Zhou, Zhen Tan, Yadong Liu, Dewen Hu

TL;DR: 本文提出ConFi-GS框架,用于解决低分辨率多视角图像重建高质量3D场景时3D高斯泼溅(3DGS)面临的纹理模糊、边界弱和视角不一致问题。该框架通过几何引导的细节需求先验和频率感知的可靠性映射,生成细节注入图来指导优化过程中的超分辨率细节引入,并结合空间选择性监督、由粗到细的频率正则化和可靠性感知的高斯致密化进行统一优化。

Details

Motivation: 现有方法在低分辨率3DGS重建中,要么均匀应用超分辨率引导,要么主要基于几何采样定位增强区域,但未能区分‘哪里需要额外细节’和‘候选高频内容是否足够可靠以融入多视角一致的3D表示’这两个根本不同的问题。

Result: 在多个基准测试上的实验表明,该方法在抑制不稳定或视角不一致细节的同时,提高了重建的保真度和感知质量。

Insight: 创新点在于提出了可靠性感知的频率建模框架,明确分离了细节需求定位与高频内容可靠性评估,并设计了基于细节注入图的统一优化方案,控制细节注入的位置、时机和方式,从而更可靠地将高频细节内化到高斯表示中。

Abstract: Reconstructing high-quality 3D scenes from low-resolution multi-view images remains challenging for 3D Gaussian Splatting (3DGS), because insufficient high-frequency observations often lead to blurred textures, weak boundaries, and view-inconsistent details. Existing approaches either apply super-resolution guidance uniformly or localize enhancement regions based mainly on geometric sampling. However, they typically do not distinguish between two fundamentally different questions: where additional detail is needed, and whether the corresponding candidate high-frequency content is reliable enough to be internalized into a multi-view consistent 3D representation. In this paper, we propose a reliability-aware frequency modeling framework for low-resolution 3DGS reconstruction. The framework first estimates a geometry-guided detail-demand prior to locate regions that are likely under-detailed under low-resolution supervision. It then computes a frequency-aware reliability map to determine whether candidate high-frequency details are structurally supported, spectrally unresolved, and cross-view stable. Combining these signals yields a detail-injection map that guides where super-resolved details should be introduced during optimization. Based on this map, we design a unified optimization scheme comprising spatially selective supervision, coarse-to-fine frequency regularization, and reliability-aware Gaussian densification. This scheme controls where reliable details are injected, when high-frequency supervision is activated, and how unresolved yet reliable details are internalized into the Gaussian representation. Experiments on multiple benchmarks show improved fidelity and perceptual quality while suppressing unstable or view-inconsistent details.


[131] Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection cs.CV | cs.AI | cs.LGPDF

Ibrahim Delibasoglu

TL;DR: 本文研究了现代视觉基础模型在面部深度伪造检测中的跨域泛化能力,通过对比三种基础学习范式(全监督宏观语义特征、纯自监督几何特征和多教师聚合表示),在DF40基准上评估其作为即用型特征提取器的性能,揭示了预训练范式与参数规模之间的内在权衡。

Details

Motivation: 解决深度伪造检测器无法泛化到未见过的生成技术的问题,传统网络因表示崩溃而过度拟合特定训练生成器的局部伪影指纹,探索视觉基础模型是否能作为通用的特征提取器来追踪跨未知生成流形的取证异常。

Result: 在DF40基准上,通过冻结骨干网络进行下游线性探测,发现基础模型在全脸合成检测中保持高判别能力,但局部面部编辑技术暴露了线性探测评估结构的基本限制。

Insight: 创新点在于系统比较了不同预训练范式在跨域深度伪造检测中的泛化极限,客观分析表明,基础模型的性能受预训练策略和参数规模的影响,为设计更鲁棒的检测器提供了见解。

Abstract: The rapid evolution of generative models has enabled the creation of hyper-realistic facial deepfakes, exposing a critical vulnerability in modern digital forensics: the inability of detectors to generalize to unseen manipulation techniques. Traditional networks suffer from representation collapse, overfitting to localized artifact fingerprints of specific training generators. This work investigates whether modern Vision Foundation Models can serve as generalizable, out-of-the-box feature extractors capable of tracking forensic anomalies across entirely unseen generative manifolds. We conduct a systematic cross-domain evaluation comparing three foundational learning paradigms: fully supervised macro-semantic features (RoPE-ViT), pure self-supervised geometric features (DINOv3), and multi-teacher agglomerative representations (NVIDIA C-RADIOv4-H). By deploying frozen backbones subjected to downstream linear probing, we map the performance limitations of these architectures on the challenging DF40 benchmark. Our empirical findings expose the intrinsic trade-offs between pre-training paradigms and parameter scale, proving that while foundation models retain high discriminative capabilities for entire face synthesis, localized face editing techniques expose fundamental boundaries in linear probe evaluation structures. Source code and model weights are available in http://github.com/mribrahim/deepfake


[132] MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing cs.CV | cs.AI | cs.CLPDF

Bangrui Xu, Ziyang Miao, Xuanhe Zhou, Yiming Lin, Zirui Tang

TL;DR: 本文提出了MinerU-Popo,一个轻量级、通用的OCR输出后处理框架,旨在将不同解析器生成的页面级结果转换为连贯的文档级结构。它通过分解为文本截断恢复、表格截断恢复、标题层次重建和图文关联四个子任务,并利用任务导向的数据引擎微调轻量模型,结合动态分块与重叠同步技术来处理长文档,最终组装成树状文档表示。

Details

Motivation: 现有的VLM-based OCR模型在页面级元素提取上表现良好,但无法处理跨页面的连续性,导致段落、表格等结构在页面边界处被破坏,而下游应用(如RAG)需要连贯的文档级信息。因此,需要一种后处理方法,利用现有OCR输出来重建文档级逻辑结构。

Result: 实验结果表明,MinerU-Popo在测试的五种OCR模型上,将标题层次TEDS指标至少提升了20%,同时提高了RAG的准确性并降低了每次查询的延迟。

Insight: 创新点在于将复杂的文档结构重建问题分解为四个可管理的子任务,并构建了任务导向的数据生成引擎;同时,针对长文档处理,提出了动态分块与基于重叠的同步机制,以保持全局一致性,最终输出富含节点分块和摘要的树状结构,便于下游检索与分析。

Abstract: VLM-based OCR models have become the de facto choice for document parsing, as they can accurately extract page-level elements (e.g., paragraphs within individual pages) together with their bounding boxes and textual content. However, downstream applications such as RAG require coherent document-level information, whereas these models often break cross-page continuity and fail to recover disrupted structures, such as paragraphs and tables truncated by page boundaries. Such relationships are not confined to a single page; instead, they require joint analysis of titles, paragraphs, tables, and images spanning multiple pages. A natural solution is therefore to reuse existing OCR outputs and reconstruct document-level logical structures through post-processing. To this end, we propose MinerU-Popo, a lightweight and universal framework for POst-Processing OCR outputs, which converts page-level results from diverse parsers into coherent document-level structures. MinerU-Popo decomposes the problem into four focused subtasks: text truncation recovery, table truncation recovery, title hierarchy reconstruction, and image-text association. To address these effectively, we build a task-oriented data engine with task-specific input filtering, and use the generated data (30K) to fine-tune a lightweight post-processing model (Qwen3-VL-4B). To support long documents, we introduce dynamic chunking with overlap-based synchronization, which aligns chunk-level outputs from the fine-tuned model and preserves global consistency. Finally, we assemble the aligned outputs into a tree-structured document representation, further enriched with node chunking and summaries for downstream retrieval and analysis. Empirical results show MinerU-Popo improves title-hierarchy TEDS by at least 20% across all five tested OCR models, improves RAG accuracy and reduces per-query latency.


[133] Universal Boosts, Specific Suppressors: Sparse Autoencoder Steering of Medical Vision-Language Models cs.CV | cs.CLPDF

Farhad Nooralahzadeh, Benjamin Gundersen, Nicolas Deperrois, Hidetoshi Matsuom, Mizuho Nishio

TL;DR: 该论文提出了一种无需权重更新的解码时残差引导方法,通过稀疏自编码器(SAE)对医学视觉语言模型(VLMs)进行逐令牌的干预,以减轻其在生成胸部X光报告时的幻觉问题(如捏造、遗漏或错误定位发现)。该方法在推理时结合了抑制和增强干预,显著提升了多个放射学VLM(RadVLM、LLaVA-Rad和CheXOne)的报告生成质量。

Details

Motivation: 解决医学视觉语言模型在生成胸部X光报告时经常出现的幻觉问题,包括捏造不存在的发现、遗漏重要发现或错误定位发现,而无需进行模型权重更新。

Result: 在MIMIC-CXR测试集上,该方法使三个放射学VLM的临床综合指标相对提升了+5.4%、+7.2%和+17.0%,所有骨干网络均获得统计显著的GREEN分数增益;在IU-Xray数据集上零样本迁移也实现了相对+7.7%的GREEN提升。

Insight: 创新点在于提出了一种基于稀疏自编码器的解码时因果引导框架,可针对不同模型进行定制化干预;关键发现是提升报告质量的“增强”方向在不同架构间高度重叠,而与幻觉相关的“抑制”方向则具有模型特异性,因此可迁移的引导必须针对每个骨干网络单独处理抑制方向,而非共享通用列表。

Abstract: Medical vision-language models (VLMs) often hallucinate findings when generating chest X-ray reports: they fabricate findings that are not present in the image, miss important ones, or locate them incorrectly. We mitigate this without weight updates by decoding-time residual steering on a per-token sparse autoencoder (SAE) basis: Top-$K$ SAEs on late layers, causal steering against clinical errors, then combined suppress/boost intervention at inference time. On the MIMIC-CXR test split, our inference-only method improves the quality of generated reports for three radiology VLMs (RadVLM, LLaVA-Rad, and CheXOne), with relative improvements of +5.4%, +7.2%, and +17.0% in the clinical composite metric, and statistically significant GREEN gains on all backbones. A cross-model feature alignment shows that the quality-promoting (boost) directions overlap strongly across architectures, whereas hallucination-linked (suppress) directions are model-specific. Therefore, transferable steering must treat suppression per-backbone, rather than sharing a universal suppress list. The same recipe transfers zero-shot to IU-Xray (Green $+7.7%$ rel.) without retraining, confirming that the identified features are properties of the model, not of the training corpus. We release causal feature sets and an interactive feature dashboard: https://cxr-sparse-feature-dashboard.netlify.app/.


[134] Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation cs.CVPDF

Imanol G. Estepa, Jesús M Rodríguez-de-Vera, Bhalaji Nagarajan, Petia Radeva

TL;DR: 本文提出了LEASE框架,通过配对的生成-判别式码本设计,在离散标记空间中统一视觉表示学习和生成任务。该方法结合掩码标记重建损失和码本对比损失,无需数据增强、教师模型或在线分词器,实现了高效的训练。

Details

Motivation: 当前判别式和生成式视觉模型在各自领域表现出色,但语义上存在错位,阻碍了统一视觉学习的进展。本文旨在弥合这一差距,构建一个同时支持高质量生成和强表示学习的统一潜在空间。

Result: 在ImageNet-1K上,LEASE在多个基准测试中实现了最先进的统一性能:线性探测(相对MAGE提升高达+1.7%)、无条件生成(FID降低1.26,IS提升10.19)、少样本学习(平均相对Sorcen提升+0.56%)、迁移学习(平均相对MAGE和Sorcen提升+0.75%)和鲁棒性(平均相对MAGE和Sorcen分别提升+5.86%和+4.25%)。

Insight: 创新点在于通过一次性预计算的离散标记空间和配对码本设计,结合生成与判别目标,实现了高效的无监督统一学习。其自适应质心加权的码本对比损失有效对齐了编码器特征与判别语义,为构建通用视觉模型提供了新思路。

Abstract: Discriminative and generative vision models excel in their respective domains but remain semantically misaligned, hindering progress toward unified visual learning. We introduce LEASE (LEArning from SEmantic Dictionaries), a self-supervised framework that bridges this gap using a paired generative-discriminative codebook design. LEASE operates entirely in a discrete token space produced through a one-time precomputation step, enabling efficient training without data augmentations, teacher models, or online tokenizers. LEASE integrates two complementary objectives: a masked token reconstruction loss that captures fine-grained generative detail, and a codebook contrast loss that aligns encoder features with discriminative semantics via adaptive centroid weighting. This dual supervision yields a unified latent space that supports both high-quality generation and strong representation learning. On ImageNet-1K, LEASE achieves state-of-the-art unified performance, outperforming prior VQGAN-based methods such as MAGE and Sorcen across linear probing (up to +1.7%), unconditional generation (-1.26 FID and +10.19 IS w.r.t MAGE), few-shot learning (+0.56% on average against Sorcen), transfer (+0.75% average improvement against MAGE and Sorcen), and robustness benchmarks (+5.86% and +4.25% average improvement against MAGE and Sorcen, respectively). It also competes favorably with domain-specialized contrastive and generative models while surpassing previous MIM methods. The unsupervised LEASE model can also be extended to conditional generation by building upon its learned representations, proving competitive with specialized baselines. Overall, LEASE provides an efficient and effective step toward general-purpose vision models that jointly understand and generate visual content.


[135] ClueAegis: Heuristic-to-Reasoning Cognitive-skill Learning for Unified Evidence-based Synthetic Image Detection cs.CVPDF

Huangsen Cao, Hongkang Chu, Yuxi Li, Ying Zhang, Chen Li

TL;DR: 本文提出了一种名为ClueAegis的启发式到推理认知技能学习框架,用于基于证据的合成图像检测。该框架首先提取启发式感知线索并选择最佳取证技能,然后进行技能条件推理以提取证据并做出决策。作者还引入了ClueAegis-Bench基准,将检测任务分解为明确标注的取证认知技能进行结构化评估。

Details

Motivation: 现有合成图像检测方法通常局限于端到端分类或单一推理,未能建模结构化的取证推理和异构视觉证据。本文从认知角度重新审视该问题,旨在实现更透明、可解释且基于证据的检测。

Result: 大量实验表明,ClueAegis在合成图像检测上取得了最先进的性能,同时提升了跨域泛化能力和鲁棒性。

Insight: 主要创新点在于将合成图像检测重新构建为一个可配置的多技能推理过程,连接了感知、技能选择和取证推理。该框架提供了透明的推理轨迹和结构化的取证证据,相比传统端到端检测器更具可解释性。

Abstract: The rapid advancement of generative models has made synthetic images increasingly realistic, challenging reliable detection. Existing methods are often limited to end-to-end classification or monolithic reasoning, and thus fail to model structured forensic reasoning and heterogeneous visual evidence. We revisit synthetic image detection from a cognitive perspective and propose a \textit{Heuristic-to-Reasoning} cognitive skill learning framework for evidence-based forensic analysis. Given an input image, our framework first extracts heuristic perceptual clues, selects the optimal forensic skill, and then performs skill-conditioned reasoning for evidence extraction and decision making. To support this paradigm, we introduce \textbf{ClueAegis-Bench}, which decomposes synthetic image detection into explicitly annotated forensic cognitive skills for structured evaluation beyond binary classification. Based on this benchmark, we propose \textbf{ClueAegis} (\underline{C}ognitive-skill \underline{L}earning for \underline{U}nified \underline{E}vidence-based Synthetic Image Detection), a two-stage agentic framework that conducts heuristic skill selection followed by evidence-guided reasoning through skill-conditioned toolchains. This design reformulates synthetic image detection as a configurable multi-skill reasoning process that bridges perception, skill selection, and forensic reasoning. Extensive experiments show that ClueAegis achieves state-of-the-art performance while improving cross-domain generalization and robustness. It also provides transparent reasoning trajectories and structured forensic evidence, offering a more explainable alternative to conventional end-to-end detectors.


[136] AstroRAG – A Pagerank-Based Retrieval-Augmented Generation Pipeline for Question Answering in Astronomy cs.CVPDF

Zhifeng Wang, Jason Jingshi Li, Kaihao Zhang, Ramesh Sankaranarayana

TL;DR: 本文提出了AstroRAG,一个基于PageRank的检索增强生成(RAG)管道,专门用于天文学领域的问答。该系统通过分词感知的文本分块、瞬时索引、两阶段检索(MMR和基于相似图的PageRank重排序)来获取紧凑且相互支持的上下文,从而提升回答质量。

Details

Motivation: 解决大型语言模型(LLMs)在仅依赖参数知识时容易产生事实性错误的问题,以及传统RAG方法中“检索-倾倒”方式引入无关上下文导致答案质量下降的缺陷。

Result: 在天文学问答基准AstroQA上进行了评估,RAG增强的Mistral-7B模型达到了79.49%的准确率和79.49%的F1分数,性能几乎翻倍,在所有难度级别上都表现出竞争力。

Insight: 创新点在于将PageRank算法用于检索结果的读者驱动重排序,以在严格token预算下识别紧凑且相互支持的上下文;系统设计为无需训练、保护隐私且可复现,通过瞬时索引防止任务间信息泄露,为将RAG扩展到其他科学领域提供了基础。

Abstract: Large language models (LLMs) demonstrate strong performance in natural language processing but often generate factual errors when relying solely on parametric knowledge. Retrieval-Augmented Generation (RAG) mitigates these errors by grounding responses in external evidence, yet conventional retrieve-and-dump approaches frequently introduce irrelevant context that degrades answer quality. In this work, we present AstroRAG – a PageRank-based retrieval-augmented generation (RAG) pipeline adapted for question answering in astronomy. The system performs token-aware chunking and per-instance, ephemeral indexing in Elasticsearch, then executes a two-stage retrieval: (i) Maximal Marginal Relevance (MMR) to obtain a small, diverse candidate set and (ii) a reader-driven PageRank (PR) re-ranking on a similarity graph to identify a compact, mutually supportive context under a strict token budget. Our design is training-free, privacy-preserving, and reproducible, as each instance is processed through transient indexing to prevent cross-task leakage. We evaluate the pipeline on the AstroQA benchmark for astronomy QA, and demonstrate competitive performance across all difficulty levels. In particular, the RAG-enhanced Mistral-7B achieves \textbf{79.49% accuracy} and \textbf{79.49% F1-score}, nearly doubling the performance of its non-RAG counterpart. These results highlight the effectiveness of disciplined retrieval and refinement in boosting domain-specific reasoning, establishing a robust foundation for extending RAG to other scientific fields.


[137] VEOcc: Voxel-Centric Online Semantic Occupancy Prediction For Embodied Scene Understanding cs.CVPDF

Ruoyu Wang, Yong Liu, Sheng Tao, Yuhang Lin, Yukai Ma

TL;DR: 本文提出了VEOcc,一个用于在线3D语义占据预测的体素中心框架,旨在为具身场景理解提供增量式的密集空间表示。该框架采用递归感知与同化范式,无需初始场景尺度估计,实现了高效、开放式的在线地图构建。

Details

Motivation: 现有的高斯中心方法在结构边界保真度上存在不足,且严重依赖预定义的场景尺寸先验,这限制了其操作效率。本文旨在解决这些问题,为自主探索提供更准确、高效的在线语义占据预测方案。

Result: 在Occ-ScanNet和EmbodiedOcc-ScanNet基准测试中,VEOcc在局部和具身设置下均取得了最先进的性能。在自采集视频序列上的零样本评估进一步证实了其在未见真实世界环境中强大的分布外泛化能力。

Insight: 核心创新在于体素中心的递归感知与同化范式,它消除了对初始尺度估计的依赖。同时,提出的时空感知在线更新策略(包含跨时序逻辑聚合、可靠性感知置信度调制和置信度驱动的增量状态更新)有效聚合了离散体素空间中的噪声时序观测,提升了时空一致性和鲁棒性。

Abstract: Crucial for autonomous exploration, online 3D occupancy prediction and mapping incrementally constructs dense spatial representations on the fly. However, recent Gaussian-centric methods struggle with structural boundary fidelity and rely heavily on predefined scene-size priors, fundamentally limiting their operational efficiency. In this work, we present VEOcc, a voxel-centric framework formulated as a recursive perception-and-assimilation paradigm. By eliminating the need for initial scale estimation, VEOcc enables highly streamlined, open-ended map expansion. Furthermore, to robustly aggregate noisy temporal observations within the discrete voxel space, we propose a Spatio-Temporal-Aware Online Update Strategy. It integrates Cross-Temporal Logit Aggregation (TLA) for temporal consistency, Reliability-Aware Confidence Modulation (RCM) for spatial uncertainty calibration, and Confidence-Driven Incremental State Update (CSU) for robust global state assimilation. % Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings, providing an accurate and efficient solution for real-world exploration. Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings. Notably, zero-shot evaluations on self-collected video sequences further confirm its robust out-of-distribution generalization capability in completely unseen real-world environments. Ultimately, our framework provides an accurate and highly efficient solution for autonomous exploration. Code and supplementary visualizations are available on our project page: https://wryzju.github.io/VEOcc/.


[138] WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models cs.CVPDF

Bohai Gu, Taiyi Wu, Yueyang Yuan, Jian Liu, Xiaocheng Lu

TL;DR: WorldCraft是一个交互式视频世界模型框架,它将模型的交互能力从相机导航扩展到对象级别的轨迹控制。用户可以通过点击和绘制路径来指定特定对象的运动轨迹,模型则生成未来帧,其中被选对象遵循指定路径运动,同时相机继续在场景中导航。

Details

Motivation: 现有的基于视频的世界模型主要支持相机层面的交互(如视角移动),但缺乏对场景中单个对象的直接操作能力。由于真实世界的交互本质上是对象中心的,这种限制使得模型更像是被动的场景观察者,而非真正可操纵的环境。

Result: 实验表明,WorldCraft能够实现对对象的精确控制,在仅评估相机导航时保持了原有视频世界模型的相机保真度,并且能在长序列自回归生成中,即使对象离开相机视野,也能维持其状态。

Insight: 核心创新在于提出了一个以轨迹为中心的控制流程:1) 归一化世界轨迹(NWT)将用户绘制的运动表示在相机不变的世界坐标系中;2) 空间通路LoRA(SP-LoRA)通过模型的空间控制通路注入世界空间信号;3) 轨迹锚定状态持久化(TASP)将世界轨迹视为持久空间状态,在轨迹条件生成后刷新自回归记忆。这实现了对象运动与相机诱导的屏幕空间位移的解耦,并在不破坏预训练相机控制器的情况下增加了对象操控能力。

Abstract: Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete: users can move the camera, but cannot act on individual objects. Since real-world interaction is inherently object-centric, such models remain closer to passive scene observers than truly manipulable environments. We present WorldCraft, a framework that expands interactive video world models from camera navigation to object-level trajectory actions. Given a user click and a sketched path, WorldCraft generates future frames in which the selected object follows the prescribed trajectory while the camera continues to navigate the scene. WorldCraft achieves this through a trajectory-centric control pipeline: First, Normalized World Trajectory (NWT) represents user-drawn motion in a camera-invariant world coordinate system and dynamically re-projects it under the current camera pose, separating object motion from camera-induced screen-space displacement; Spatial-Pathway LoRA (SP-LoRA) then injects this world-space signal through the model’s spatial-control pathway, adding object manipulation capability while preserving the pretrained camera controller; finally, Trajectory-Anchored State Persistence (TASP) treats the world trajectory as a persistent spatial state and refreshes autoregressive memory after trajectory-conditioned generation, allowing moved objects to reappear at their updated positions after leaving the camera view. Experiments show that WorldCraft enables accurate object control, preserves the video-based world model’s camera fidelity under camera-only evaluation, and maintains object state across long autoregressive rollouts with off-camera excursions.


[139] Unbiased Diffusion Variational Inversion via Principled Posterior Matching cs.CVPDF

Weimin Bai, Yuxuan Gu, Yifei Wang, Weijian Luo, He Sun

TL;DR: 本文提出了Principled Posterior Matching (PPM)框架,用于解决基于分数的逆问题方法中因近似KL散度最小化而导致的模式崩溃和不确定性量化不可靠的问题。该框架通过整合Fisher散度,严格地公式化了KL散度的精确优化,并推导出其等价的可处理的梯度形式,从而避免了先前近似方法引入的偏差。PPM统一了变分推断和摊销推断两种范式,在多个计算成像任务中验证了其优越的重建保真度、多模态后验恢复能力和校准良好的不确定性估计。

Details

Motivation: 现有基于分数的逆问题方法通常近似最小化反演分布与贝叶斯后验之间的KL散度,这种近似会导致严重的模式崩溃和不可靠的不确定性量化。本文旨在回归变分推断的基本原理,避免使用启发式近似,以解决这些问题。

Result: 在具有挑战性的计算成像任务(包括修复、超分辨率荧光显微镜和射电干涉黑洞成像)上进行验证,PPM在所有实验中均实现了卓越的重建保真度、忠实的多模态后验恢复以及校准良好的不确定性估计。

Insight: 核心创新在于通过整合Fisher散度,严格推导出KL散度精确优化的等价梯度形式,从理论上解决了先前方法的近似差距问题。PPM框架统一了变分推断(采用质量覆盖散度)和摊销推断(训练高效的单步重建网络)两种互补范式,并可通过推广Fisher散度的积分自然地扩展到更广泛的散度度量族。

Abstract: Existing score-based methods for inverse problems often resort to approximate minimization of the KL divergence between the inversion distribution and the Bayesian posterior. Such an approximation leads to severe mode collapse and unreliable uncertainty quantification. In this paper, we propose Principled Posterior Matching (PPM), a framework that returns to the fundamentals of variational inference, rather than using tricky approximations. Instead of relying on heuristic approximations, we rigorously formulate the exact optimization of the KL divergence via the integration of Fisher divergence. We derive a tractable, equivalent gradient form of this integral, enabling precise optimization without the biases introduced by prior approximations. Our analysis clearly reveals that the mode collapse in previous methods stems directly from this approximation gap. Supported by our theoretical solution, PPM unifies two complementary paradigms: (1) In variational inference, PPM adopts mass-covering divergences that significantly improve the inversion diversity and uncertainty quantification; (2) In amortized inference, it enables the training of an efficient reconstruction network for rapid, single-step reconstruction. Furthermore, our formulation naturally extends to a broader family of divergence measures by generalizing the integral of the Fisher divergence. We validate PPM across challenging computational imaging tasks, including inpainting, super-resolution fluorescent microscopy, and radio interferometric black-hole imaging. In all experiments, PPM achieves superior reconstruction fidelity, faithful multimodal posterior recovery, and well-calibrated uncertainty estimates, establishing a robust framework for scientific imaging.


[140] Uncertainty-DTW for Sequences and Visual Tokens cs.CV | cs.AI | cs.LGPDF

Lei Wang, Syuan-Hao Li, Yongsheng Gao, Piotr Koniusz

TL;DR: 本文提出了一种不确定性感知的对齐框架——不确定性动态时间规整(uDTW),用于处理序列和视觉标记的结构化数据对齐问题。该方法通过建模异方差不确定性,利用精度加权匹配项和方差正则化,实现对噪声的鲁棒性和可解释性对齐。

Details

Motivation: 现有对齐方法(如动态时间规整及其可微变体)依赖确定性相似度度量,对异构和噪声特征敏感,因此需要一种能处理不确定性的鲁棒对齐框架。

Result: 在多个领域评估中,该方法相比最先进方法取得了一致性改进,学习到的不确定性与语义重要性相关,验证了其有效性。

Insight: 创新点在于将不确定性建模引入对齐过程,形成概率对齐机制;并将框架从时序序列推广到视觉标记,建立了对齐、注意力和不确定性建模之间的联系,提供了一种可解释的匹配方法。

Abstract: Aligning structured data is a fundamental problem in computer vision and machine learning, underlying tasks such as time series analysis, human action recognition, and visual representation learning. Existing alignment methods, including Dynamic Time Warping (DTW) and its differentiable variants, rely on deterministic similarity measures and are therefore sensitive to heterogeneous and noisy features. In this work, we introduce uncertainty-aware alignment, a probabilistic framework that models pairwise correspondences with heteroscedastic uncertainty and performs structured matching along alignment paths. Our formulation, uncertainty-DTW (uDTW), assigns each correspondence a Normal distribution and parametrizes each alignment path by a Maximum Likelihood Estimate objective consisting of (i) a precision-weighted matching term that suppresses unreliable features, and (ii) a log-variance regularization that prevents degenerate solutions. This yields a probabilistic alignment mechanism that is robust to noise and interpretable, as uncertainty directly reflects the reliability of matches. We further generalize this framework from temporal sequences to tokenized visual representations, enabling structured matching over sets of visual tokens. The learned uncertainty can be interpreted as a reverse-attention: semantically relevant regions exhibit low uncertainty and dominate the alignment, while ambiguous/noisy regions have high uncertainty. This provides a connection between alignment, attention, and uncertainty modeling. We evaluate the proposed framework across diverse domains. The results demonstrate consistent improvements over state-of-the-art methods and show that learned uncertainty correlates with semantic importance. These findings establish uncertainty-aware alignment as a general, robust, and interpretable framework for learning from structured data.


[141] PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration cs.CV | cs.LGPDF

Haoqing Wu, Alexa Nawotki, Jochen Garcke

TL;DR: 本文提出了一种名为PQDT(Pseudo-Query Dual Transformer)的统一3D点云恢复网络,旨在从存在不完整、噪声、离群点和不规则密度等多种退化的点云中,自适应地重建高质量几何形状。该方法的核心是嵌入Transformer架构中的伪查询模块,通过两个协同阶段重新定义几何转换,以增强结构清晰度、鲁棒性和局部细节保留。

Details

Motivation: 现实世界中的点云常因传感器限制或遮挡而存在多种退化,影响下游应用。现有基于学习的方法通常依赖全局瓶颈特征,导致细粒度几何信息丢失,且对不同输入质量敏感,缺乏一个统一且鲁棒的恢复框架。

Result: 在精心策划的基准测试上进行的大量实验表明,该方法在通用3D恢复任务中超越了最先进(SOTA)的性能,能够有效处理补全、变形和去噪等复杂组合的退化。

Insight: 创新点在于提出了一个伪查询模块,将几何转换重新表述为两个协同阶段,从而在统一的、仅使用点云的骨干网络中实现了对多种退化的鲁棒恢复,增强了结构清晰度和局部细节。从客观角度看,这是一种新颖的、旨在提升模型适应性和细节保留能力的架构设计。

Abstract: Point clouds are a fundamental 3D representation in computer vision, enabling a wide range of perception tasks. However, real-world point clouds often suffer from degradations such as incompleteness, noise, outliers, and irregular density, caused by sensor limitations or occlusions. Recovering clean and detailed shapes from such degraded data is crucial for downstream applications. While existing learning-based methods achieve progress on individual tasks like completion or denoising, they typically rely on global bottleneck features, which lose fine-grained geometry and remain sensitive to varying input quality. We propose a unified 3D restoration network that directly takes point clouds as input and adaptively reconstructs high-quality geometry under diverse degradation scenarios. At the core of our approach is a Pseudo-Query module, implemented within a Transformer backbone, which reformulates geometric translation into two cooperative stages to enhance structural clarity, robustness, and local detail preservation. Extensive experiments on curated benchmarks demonstrate that our approach surpasses state-of-the-art performance in general 3D restoration. It effectively handles complex combinations of completion, deformation, and denoising degradations. With this work, we provide a novel unified, point-only backbone for robust 3D restoration, enabling more versatile 3D perception.


[142] SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing cs.CVPDF

Sen Liang, Cong Wang, Fengbin Guan, Zhentao Yu, Yiting Lu

TL;DR: 本文提出了SpongeBob,首个端到端的音视频联合编辑框架,通过双向跨模态交互解决现有方法中音视频不同步和上下文冲突的问题。该框架包含同步感知机制和上下文感知模块,并引入了同步保持训练与引导(SPTG)来增强对齐效果。作者构建了大规模数据集和评估基准,实验表明SpongeBob在同步性和上下文一致性指标上显著优于现有基线。

Details

Motivation: 现有视频编辑方法通常采用解耦的流水线,缺乏双向模态交互,导致音视频不同步和生成音频与保留内容之间的上下文冲突。

Result: 在提出的SpongeBob-Bench上进行系统评估,SpongeBob显著优于现有基线,将Sync-C指标提升了30%,Ctx-F1指标提升了12.5%。

Insight: 创新点在于首个端到端的音视频联合编辑框架,通过双向跨模态注意力、时间对齐和空间约束实现同步感知,并利用声学和视觉上下文注意力确保语义一致性;同时提出了同步保持训练与引导(SPTG)方法以及构建了大规模数据集和评估基准。

Abstract: Visual and acoustic events in the physical world are inherently coupled, yet existing video editing methods typically adopt decoupled pipelines, lacking bidirectional modality interaction. This results in two key limitations: (i) audio-visual desynchronization and (ii) contextual conflicts between generated audio and preserved content. To address these, we propose SpongeBob, the first end-to-end audio-visual joint editing framework featuring bidirectional cross-modal interaction. For synchronization, a Sync-Aware Mechanism aligns visual edits with sound events via bidirectional attention, temporal alignment, and spatial constraints. For contextual consistency, a Context-Aware Module leverages acoustic and visual context attention to prevent semantic clashes. Additionally, we introduce Sync-Preserving Training and Guidance (SPTG) to enhance alignment without degrading quality. Due to the scarcity of paired data, we construct a scalable data pipeline and a large-scale subject-level dataset. We also propose SpongeBob-Bench for systematic evaluation. Experiments show SpongeBob significantly outperforms existing baselines, improving Sync-C by 30% and Ctx-F1 by 12.5%. Our project page is available at: https://hy-spongebob.github.io/.


[143] Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation cs.CVPDF

Shuyuan Tu, Qi Tian, Zihan Yang, Yue Wu, Xintong Han

TL;DR: 本文提出了Baton框架,首次将显式语义规划引入到联合视频-音频生成任务中。该框架通过一个名为VA-Planner的多模态语言模型,在去噪前生成语义对齐的视频和音频规划令牌,作为关键帧级别的蓝图,以协调去噪轨迹并增强跨模态对齐。

Details

Motivation: 现有开源扩散模型在生成稳定且同步的视听内容方面存在困难,尤其是在需要复杂语义推理的场景下。其根本原因在于现有方法依赖于现成编码器生成的粗糙文本嵌入来指导音视频去噪,这丢弃了细粒度语义,并且缺乏共享的长期规划,导致去噪轨迹不协调和跨模态对齐脆弱。

Result: 在基准测试上的实验表明,Baton在定性和定量评估上均表现出有效性。

Insight: 核心创新点在于提出了显式语义规划的思想,通过VA-Planner生成模态感知的、在去噪前进行联合推理和相互对齐的规划令牌,以补充粗糙的文本引导。此外,提出的Relative Semantic RoPE位置编码方法,将规划令牌和潜在变量映射到共享的时空坐标系中,解决了二者非一一对应的问题,实现了精确的语义线索关注。

Abstract: Current open-source diffusion models struggle to generate stable and synchronized audio-visual content, particularly in scenarios demanding complex semantic reasoning. The root cause is that existing methods rely on coarse text embeddings from off-the-shelf encoders to guide audio-video denoising, which discards fine-grained semantics and, critically, lacks a shared long-horizon plan, leading to uncoordinated denoising trajectories and fragile cross-modal alignment. We propose Baton, the first framework that introduces explicit semantic planning into joint video-audio generation. Our key insight is that complementing coarse text guidance with semantically rich, modality-aware planned tokens, jointly reasoned and mutually aligned before denoising, can simultaneously restore fine-grained semantic detail and establish a shared blueprint that coordinates both audio and video denoising trajectories. Concretely, Baton first introduces the VA-Planner, a multimodal language model equipped with dual semantic alignment towers, where learnable queries cross-attend to both video and audio features to produce a pair of semantically aligned video and audio planned tokens as keyframe-level blueprints. These planned tokens are injected into the diffusion backbone via cross-attention layers, providing temporally grounded guidance complementary to coarse text embeddings. Since planned tokens do not share one-to-one spatial-temporal correspondence with diffusion latents, we further propose Relative Semantic RoPE, a relative positional encoding that maps planned tokens and latents into a shared spatial-temporal coordinate frame, enabling each latent to accurately attend to its positionally corresponding semantic cues. Experiments on benchmarks show the effectiveness of Baton both qualitatively and quantitatively.


[144] Multi-view Consistent 3D Gaussian Head Avatars ‘without’ Multi-view Generation cs.CV | cs.GR | cs.ROPDF

Aviral Chharia, Fernando De la Torre

TL;DR: 本文提出MVCHead,一种仅从单张2D图像生成高质量3D高斯人头化身的方法。该方法无需多视图数据、3D监督或中间视图生成,通过引入层次状态空间块和分层双向状态扫描来强制多视图一致性,并设计了SE(3)多视图判别器来评估渲染一致性。

Details

Motivation: 现有高保真3D人头化身生成方法严重依赖多视图数据集、3D捕获或中间2D视图合成,限制了其可扩展性和实用性。本文旨在仅使用随机采样的2D图像,无需上述额外数据或监督,学习条件与非条件的3D头部模型。

Result: MVCHead在感知质量上达到了SOTA水平,在纹理和几何一致性方面超越了先前方法,并保持了可比的形状一致性。作者还发布了首个大规模即用型3D高斯人头资产数据集FaceGS-10K,用于训练和评估。

Insight: 核心创新在于提出了层次状态空间块进行由粗到细的高斯细化,以及分层双向状态扫描来捕捉长程依赖并沿多视图不一致性最强的轴对齐递归。此外,SE(3)多视图判别器能在不观察真实多视图对的情况下,通过自渲染评估底层3D配置的一致性,这是一种新颖的无监督一致性约束方法。

Abstract: High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba’s standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models. Project Page and code: https://humansensinglab.github.io/MVCHead/


[145] Semantics-Guided Multimodal Masked Autoencoder Pretraining for 3D BEV Object Detection cs.CVPDF

Prabuddhi Wariyapperuma, Rajitha de Silva, Marc Hanheide, Thomas Bohné, Leonardo Guevara

TL;DR: 本文提出了一种语义引导的多模态掩码自编码器预训练框架,用于提升3D BEV目标检测性能。该方法在预训练中引入语义信息,通过语义引导的LiDAR体素掩码策略和辅助的点级LiDAR语义解码器分支,增强了对关键区域的关注和语义理解。

Details

Motivation: 现有用于3D BEV检测的多模态掩码自编码器通常对相机和LiDAR输入进行均匀随机掩码,平等对待所有区域,且仅通过掩码重建学习表征,这忽略了不同区域的语义重要性差异。

Result: 在nuScenes mini验证集上的BEVFusion 3D目标检测实验中,语义引导的LiDAR体素掩码使mAP和NDS分别提升1.49%和1.66%,而解码器端的点语义监督使mAP和NDS分别提升1.39%和3.22%,优于标准的UniM2AE基线。

Insight: 创新点在于将语义信息显式引入多模态预训练过程,通过语义引导的掩码策略和辅助语义解码任务,使模型更关注语义关键区域,从而学习到更有效的下游检测表征。这为多模态预训练提供了语义感知的新思路。

Abstract: Accurate 3D bird’s-eye view (BEV) object detection is essential for autonomous driving, and depends strongly on effective multimodal representations from complementary sensors such as cameras and LiDAR. Multimodal masked autoencoders have shown strong potential for learning such representations for downstream 3D BEV object detection. However, existing methods typically apply uniform random masking to camera and LiDAR inputs, treating all regions equally, and learn representations only through masked reconstruction. We propose a semantics-guided multimodal masked autoencoder framework that introduces semantic information during pretraining through two separate components: (i) semantics-guided LiDAR voxel masking, which preserves semantically important LiDAR regions more strongly, and (ii) an auxiliary point-wise LiDAR semantic decoder branch that injects semantic guidance in addition to reconstruction. On BEVFusion 3D object detection, our semantics-guided pretraining strategy improves performance on the nuScenes mini validation set compared to the standard UniM2AE baseline: semantics-guided LiDAR voxel masking yields +1.49% mean Average Precision (mAP) and +1.66% nuScenes Detection Score (NDS), while decoder-side point semantic supervision yields +1.39% mAP and +3.22% NDS over the baseline.


[146] DeltaCam: Differential Intrinsic Camera Modeling for Video Generation cs.CVPDF

Debabrata Mandal, Zhihan Peng, Yujie Wang, Praneeth Chakravarthula

TL;DR: DeltaCam是一个视频扩散框架,通过Δ参数化的神经相机适配器对相机行为和内在参数进行建模,专注于相对变化而非绝对状态。该方法利用合成视频数据学习微分公式,减少对精确真实世界相机标签的依赖,实现对焦距、光圈、ISO、色温和镜头畸变等成像因素的平滑一致控制。框架还通过微调真实图像-元数据对和提取解耦嵌入,扩展到真实世界素材,实现相机一致的视频生成和编辑。

Details

Motivation: 现有视频生成模型主要关注相机外参控制(如姿态和运动),而将相机内参视为隐式或固定,缺乏大规模具有准确多样时变相机元数据的视频数据集,导致难以学习绝对相机参数化,无法以可控且时序一致的方式融入景深过渡、曝光变化、镜头畸变和色彩处理等摄影相机行为。

Result: 论文未在摘要中明确提及具体的定量基准测试结果或SOTA比较,但指出DeltaCam能够实现相机一致的视频生成和编辑操作,这是现有模型难以实现的,并建立了一种实用且可扩展的方法来桥接合成控制和真实世界摄影模拟。

Insight: 创新点在于提出了一种微分相机内参建模方法(Δ-parameterized neural camera adaptors),通过相对变化建模而非绝对状态,降低了对精确真实世界相机标签的依赖;同时,通过微调控制和提取解耦嵌入的机制,将框架有效扩展到真实世界素材,实现了场景内容与内在成像行为的有效分离。

Abstract: Incorporating camera intrinsics into video generation models offers a principled way to control not only scene dynamics but also the imaging process that governs visual appearance. Prior work has primarily focused on extrinsic control, such as camera pose and motion, while treating intrinsic camera parameters as implicit or fixed. A key bottleneck is the lack of large-scale video datasets with accurate and diverse temporally varying camera metadata, which makes learning absolute camera parameterizations difficult. As a result, current models struggle to incorporate photographic camera behavior, including depth-of-field transitions, exposure variations, lens distortions, and color processing, in a controllable and temporally consistent manner. We introduce DeltaCam, a video diffusion framework that models camera behavior through $Δ$-parameterized neural camera adaptors, operating on relative changes in camera motion and intrinsics instead of absolute states. By learning this differential formulation from synthetic video data, we mitigate reliance on precise real-world camera labels and enable smooth, consistent control over imaging factors such as focal length, aperture, ISO, color temperature, and lens distortion. We extend this framework to real-world footage through two mechanisms: finetuning the controls on real image-metadata pairs for precise shot matching, and extracting disentangled embeddings for implicit video-to-video style transfer without requiring explicit camera parameters. By effectively separating scene content from intrinsic imaging behavior, DeltaCam enables camera-consistent video generation and editing operations that are difficult to achieve with existing models. Ultimately, our results establish a practical and scalable approach for bridging synthetic control and real-world photographic emulation.


[147] Stabilizing Streaming Video Geometry via Dynamic Feature Normalization cs.CVPDF

Xiaoyang Lyu, Muxin Liu, Xiaoshan Wu, Ruicheng Wang, Yi-Hua Huang

TL;DR: 该论文提出了一种名为动态特征归一化(DyFN)的轻量级因果循环模块,旨在解决从连续RGB视频流中估计3D几何时出现的时序不一致问题,特别是尺度-偏移漂移。DyFN通过动态调制特征统计量来稳定几何预测,仅需微调2%的额外参数即可适配预训练的单目几何基础模型,在保持单帧精度的同时显著提升时序一致性。

Details

Motivation: 论文的动机是解决现代单目几何基础模型在连续视频输入上存在的严重时序不一致问题,尤其是尺度-偏移漂移,这对于自动驾驶、具身AI和大规模重建等实际应用至关重要。

Result: 在四个基准测试上的广泛实验表明,DyFN有效消除了时序伪影(如分层断裂和位置抖动),并实现了最先进的时序稳定性,比先前的流式方法提升了高达14%,甚至优于更重的非因果视频基线。

Insight: 论文的创新点在于通过实证分析将时序不稳定性根源追溯到潜在特征统计量的波动,并据此设计了DyFN模块来动态鲁棒地调制这些统计量;从客观角度看,其轻量级、仅微调少量参数且保持主干冻结的方法,在提升时序一致性同时不牺牲单帧精度,具有高效和实用的借鉴价值。

Abstract: Consistent 3D geometry estimation from streaming RGB input is crucial for real-world applications such as autonomous driving, embodied AI, and large-scale reconstruction. While modern monocular geometry foundation models achieve strong single-image accuracy, they exhibit severe temporal inconsistency on continuous input, notably dominated by scale–shift drifting. Through targeted empirical analysis, we trace this instability to its root cause: fluctuations in latent feature statistics, whose mean and variance directly determine the predicted depth’s scale and shift. Building on this insight, we introduce Dynamic Feature Normalization (DyFN), a lightweight, causal recurrent module that dynamically and robustly modulates feature statistics to maintain stable geometry over time. We adapt powerful pretrained monocular geometry models for streaming by finetuning only DyFN, a mere 2% additional parameters, while keeping the backbone frozen, thereby achieving temporal consistency without compromising single-image accuracy. Extensive experiments across four benchmarks show that DyFN effectively eliminates temporal artifacts such as disjointed layering and positional jitter, and achieves state-of-the-art temporal stability, improving over prior streaming methods by up to 14% and even outperforming heavier non-causal video baselines. Project Page: https://shawlyu.github.io/DyFN


[148] Perceive-then-Plan: Layout-as-Policy for Monocular 3D Scene Layout Estimation cs.CVPDF

Junwei Zhou, Yu-Wing Tai

TL;DR: 该论文提出了一种名为Layout-as-Policy(LaP)的感知-规划框架,用于从单张图像进行3D场景布局估计。该方法首先使用增强几何感知的Perceiver模型对3D物体进行初步定位,然后通过一个规划器以策略学习的方式迭代执行平移、旋转和缩放等离散动作,逐步优化布局的物理合理性和与输入图像的一致性。

Details

Motivation: 从单张图像构建结构化3D场景布局需要协调视觉观察与物理空间约束,仅靠直接预测难以解决这一挑战。

Result: 实验表明,该方法生成的布局在物理连贯性和视觉对齐方面表现更优,并自然支持场景编辑和操作等下游任务。

Insight: 核心创新在于将布局估计从一次性预测任务转化为迭代优化过程,并提出了结合监督轨迹初始化与基于偏好的优化的训练策略,无需显式奖励工程即可学习纠正行为,从而更好地处理全局约束和复杂物体交互。

Abstract: Building structured 3D scene layouts from a single image requires reconciling visual observations with physical and spatial constraints, a challenge that is difficult to address with direct prediction alone. In this work, we formulate monocular 3D layout estimation as a perceive-then-plan problem with vision-language models, where a Perceiver first grounds the 3D objects and then a Planner iteratively refines the scene hypothesis through actions that improve physical plausibility while preserving consistency with the input image. We propose Layout-as-Policy (LaP), which casts the planning stage as a policy learning problem: 3D layouts are represented as structured states, and refined via discrete actions such as translation, rotation, and rescaling. Starting from an observation-aligned initialization with the geometry-enhanced Perceiver, the LaP Planner is trained to produce action sequences that progressively resolve geometric inconsistencies and enforce realistic spatial relations. To enable effective learning, we combine supervised trajectory initialization with preference-based optimization, allowing the model to learn corrective behaviors without requiring explicit reward engineering. This formulation transforms layout estimation from a one-shot prediction task into an iterative refinement process, enabling better handling of global constraints and complex object interactions. Experiments demonstrate that our approach produces layouts that are more physically coherent and better aligned with visual observations, while naturally supporting downstream tasks such as scene editing and manipulation.


[149] DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement cs.CV | cs.MMPDF

Renjie Lu, Xulong Zhang, Xiaoyang Qu, Shangfei Wang, Jianzong Wang

TL;DR: 本文提出DIVA框架,旨在解决统一多模态模型中理解与生成任务因监督信号差异导致的相互干扰问题。通过将视觉表示分解为共享和独特组件,并利用互信息估计保护独特信息,DIVA实现了理解与生成分支的协同增强。

Details

Motivation: 统一多模态模型在单一架构中同时优化理解与生成任务时,由于理解任务偏好语义判别性嵌入,而生成任务需要高保真细粒度表示,导致目标冲突和相互损害。

Result: 在视觉理解任务上提升7.82%,生成任务上提升8.46%,在多个基准测试中实现了一致性改进。

Insight: 创新点在于通过表示分解和互信息估计,将表示差异转化为内部协同,使互补目标在统一架构中实现相互增强而非干扰;客观来看,该方法为多任务学习中的表示冲突问题提供了可泛化的解决方案。

Abstract: Unified Multimodal models (UMMs) built on a single architecture have shown impressive performance in both understanding and generation. We identify a fundamental challenge that lies in inductive biases induced by distinct supervision signals: generation branch prefers high-fidelity, fine-grained representations capable of reconstruction, while the understanding favours semantically discriminative embeddings that remain invariant to task-irrelevant factors. Consequently, optimizing these complementary but non-equivalent objectives within a monolithic backbone leads to mutual impairment instead of enhancement. In this paper, we first analyze the root cause of this interference in unified backbones and reveal a complementary structure in their internal representations. Motivated by the observation, we propose DIVA, a self-improved post-training framework that transforms the representation divergence into interior synergy. By explicitly factorizing the visual representation into shared and unique components based on two complementary information flow, DIVA enables both the understanding and generation branches to achieve beneficial transferring while preserving the integrity of unique information from cross-flow interference via mutual information estimation. Despite its generality, our method consistently achieves improvements across visual understanding (+7.82%) and generation (+8.46%). The official code is available at: https://github.com/Jayyy-H/DIVA.


[150] Teaching Video Generators to Remember: Eliciting Dynamic Memory for Out-of-Sight State Evolution cs.CVPDF

Tianshuo Xu, Yichen Xie, Depu Meng, Chensheng Peng, Quentin Herau

TL;DR: 本文提出ReMind框架,通过记忆导向的数据构建、事件感知训练和缓存适配,激发视频扩散模型中的动态记忆能力,使其能够在视觉证据中断时维持状态演化。该方法在STEVO-Bench和恢复任务上取得最佳性能,同时避免灾难性遗忘。

Details

Motivation: 现有视频生成模型在视觉证据中断时往往冻结隐藏状态,无法维持动态演化,这并非单纯的能力问题,而是预训练模型未充分利用其KV缓存机制作为动态记忆。

Result: ReMind在STEVO-Bench和恢复任务上获得最佳综合得分,且在通用图像到视频评估中证实其训练课程能避免灾难性遗忘。

Insight: 创新点包括:构建基于100+动态事件分类的数据混合训练集,设计节点结构课程强制模型跨中断检索过去状态,以及提出PM-RoPE扩展实现单注意力代价的时空检索。

Abstract: Video world models should maintain evolving states when evidence is unobserved, yet current generators often freeze hidden states upon interruption. This is not simply a capacity problem: pretrained video diffusion transformers already possess KV-cache mechanisms capable of non-local retrieval, but they are rarely trained to use them as dynamic memory. We introduce ReMind, a framework eliciting dynamic memory behavior via memory-oriented data, event-aware training, and cache adaptation. Organized around a taxonomy of 100+ dynamic events, we build a camera-annotated training mixture combining VLM-filtered real videos, generated hard dynamics, synthetic camera loops, and memory-interruption augmentations. Each clip is converted into a frame graph with protected anchors, degraded intervals, and explicit temporal gaps. A node-structured curriculum, including node-drop, noisy memory, frontier continuation, and reference-cache training, forces the model to retrieve relevant past states across interruptions rather than relying solely on local continuity. PM-RoPE, an elegant camera-phase RoPE extension, unlocks spatiotemporal retrieval at a single-attention cost while preserving pretrained pathways. ReMind achieves the best overall scores on STEVO-Bench and recovery tasks. Furthermore, general image-to-video evaluations confirm this curriculum avoids catastrophic forgetting. We will open-source our code, data, and models.


[151] Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence cs.CVPDF

Yufei Zheng, Xuhan Zhu, Zide Liu, Chunpeng Zhou, Chenfeng Wang

TL;DR: 本文提出了GAMSI,一种仅以RGB图像为输入的双通路几何感知多模态大语言模型,旨在统一内部化整体3D结构感知和细粒度度量尺度估计两种几何先验知识。通过引入度量-结构解耦查询(MSDQ)和专家引导视觉定位(EVG)模块,并结合多任务空间指令调优数据集(MTS)和两阶段课程训练,该模型在七个空间智能基准测试中达到了最先进的性能。

Details

Motivation: 现有MLLM通常只处理深度图或点云等单一几何输入形式,导致计算开销大且受限于上游预测模型的泛化能力,无法同时实现全面的空间理解。

Result: 在七个空间智能基准测试上取得了最先进的性能(SOTA)。

Insight: 创新点包括:提出度量-结构解耦查询(MSDQ)以分别提取密集度量信号和稀疏结构线索;设计专家引导视觉定位(EVG)模块,仅使用视觉基础模型作为训练时监督而非输入;构建了大规模多任务空间指令调优数据集(MTS)。

Abstract: Spatial understanding of the physical world from 2D visual inputs hinges on two complementary forms of geometric knowledge: holistic 3D structural perception and fine-grained metric scale estimation. Existing multimodal large language models (MLLMs) typically address only one facet, ingesting either depth maps or point clouds as additional model inputs, which incurs substantial computational overhead and inherits the generalization limitations of upstream prediction models. We propose GAMSI, a dual-pathway Geometry-Aware MLLM for Spatial Intelligence that takes only RGB images as input while internalizing both forms of geometric prior within a unified autoregressive backbone. Specifically, we introduce Metric-Structure Decoupled Queries (MSDQ) which employ two groups of learnable queries to respectively extract dense metric signals and sparse structural cues from the shared visual context, with a task-decoupled attention mask further preventing the two pathways from contaminating each other. Building on this, an Expert-Guided Visual Grounding (EVG) module projects the aggregated cues back to frame-level visual features and aligns them with vision foundation models, which serve purely as training-time supervision, rather than as model inputs. We further build a multi-task spatial instruction-tuning dataset (MTS) comprising 152{,}776 samples spanning 13 task types and three visual modalities, consolidated from six public datasets. Trained with a two-stage curriculum, GAMSI achieves state-of-the-art performance on seven spatial intelligence benchmarks.


[152] Toward Native Multimodal Modeling: A Roadmap cs.CVPDF

Siyu An, Junru Lu, Junnan Dong, Qiufeng Wang, Yinghui Li

TL;DR: 本文提出一个关于原生多模态建模(NMM)的路线图,旨在推动从模态无关的推理向世界建模的范式转变。文章形式化定义了架构的“原生性”,将现有原生模型按输入-输出对偶性分为三类,并从工业视角系统性地剖析了实现最终NMM框架的端到端流程。

Details

Motivation: 当前多模态建模正从依赖后期融合的模态无关范式转向具有内在模态整合的原生多模态建模,但其架构设计空间尚未被充分定义,需要清晰的路线图指导这一过渡。

Result: 本文未报告具体的定量实验结果,而是提供了一个全面的、工业级的调查分析,旨在为社区定义和实现最终的原生多模态建模框架提供系统性指导。

Insight: 创新点在于形式化定义了多模态架构的“原生性”,并基于输入-输出对偶性提出了一个清晰的原生模型分类法(多到文本、多到目标、多到多),为理解和设计统一的理解与生成共存的原生Transformer范式提供了结构化框架。

Abstract: Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. While early approaches predominantly rely on late-fusion that assembles encoders and frozen language backbones with output heads, recent efforts have shifted the paradigm toward native multimodal modeling (NMM) with the intrinsic integration of modalities for superior multimodal performance. Despite its potential, the design space of native architectures remains insufficiently defined. In this paper, we present the community with a formalized roadmap for this transition. Specifically, we formally define the architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. We further organize the existing native models through the lens of input-output duality into three categories: (i) Multi-to-Text for cross-modal comprehension with text-only output; (ii) Multi-to-Target for scenario-oriented generation, e.g., image, audio and video generation, and (iii) Multi-to-Multi for unified modeling with symmetric input-output. We deliver a comprehensive and industrial-grade investigation into the transition toward the definitive NMM framework, where understanding and generation seamlessly coexist within a unified transformer paradigm. We systematically unpack the end-to-end pipeline from industrial perspectives from architectural coordination, massive data curation, to full-stack training recipes, inference & deployment, and the comprehensive evaluation for truly native modeling.


[153] ERNIE-Image Technical Report cs.CV | cs.LGPDF

Jiaxiang Liu, Zhida Feng, Pengyu Zou, Zhenyu Qian, Tianrui Zhu

TL;DR: ERNIE-Image是一个基于8B参数单流DiT架构的开源文生图模型。它通过改进预训练数据的挖掘和训练监督质量,旨在缩小开源模型与顶尖闭源系统之间的差距。模型采用了自底向上的预训练数据构建流程和自顶向下的后训练流程,并配备了轻量级提示增强器以提升实用性。

Details

Motivation: 旨在解决当前开源文生图模型与领先闭源系统(如DALL-E 3、Midjourney)之间的性能差距,通过更有效的数据挖掘和训练监督来提升模型能力。

Result: 广泛的定性和定量实验表明,ERNIE-Image在开源模型中取得了领先性能,在指令跟随、文本渲染和美学质量方面接近顶级商业模型。

Insight: 创新点包括:1)结合细粒度分类、丰富标注、美学评估和分层采样的自底向上预训练数据构建流程;2)用于高需求场景的自顶向下后训练数据流程与稳定的DPO策略;3)用于高效生成的ERNIE-Image-Turbo及缓解蒸馏能力漂移的MT-DMD方法;4)实用的轻量级提示增强器;5)发布了工业级美学模型ERINE-Image-Aes及配套评测基准。

Abstract: We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.


[154] Towards Reliable Fetal Ultrasound Interpretation with Multi-Agent Collaboration cs.CV | cs.MAPDF

Xiaotian Hu, Mingxuan Liu, Junwei Huang, Kasidit Anmahapong, Yifei Chen

TL;DR: 该论文提出了FetUSAgents,一个基于工具增强的多智能体系统,用于全面的胎儿超声解读,支持视觉问答、报告生成、图像描述和视频摘要。该系统通过协作的LLM智能体协调特定任务的视觉工具,并引入双路径证据仲裁机制,整合基于LLM的审慎推理与来自专业视觉工具的结构化计算证据,以提高解读的可靠性和可追溯性。

Details

Motivation: 自动化胎儿超声解读需要一个从视觉感知到临床理解的多步骤工作流,但当前主流的’一任务一模型’范式限制了证据在整个流程中的系统性整合。尽管多模态大语言模型展现出有前景的视觉理解能力,但其在胎儿超声分析中面临领域知识不足和幻觉风险,限制了可靠性。

Result: 在专门构建的胎儿超声视觉问答基准FetUS-VQA上进行的大量分布外实验表明,FetUSAgents在VQA准确率上超越了通用和医学MLLMs,比最强基线高出超过25%。

Insight: 创新点在于提出了一个工具增强的多智能体协作框架,将临床查询分解为从解剖识别到定量测量的子任务,并通过双路径证据仲裁机制整合LLM推理与结构化计算证据,同时构建检索增强的证据库以支持可追溯的、基于临床的结论。这为构建证据驱动的产前影像临床助手提供了一条可扩展的路径。

Abstract: Automated fetal ultrasound interpretation requires a workflow from visual perception, including plane recognition and anatomical segmentation, to clinical understanding, including biometric measurement and diagnostic reporting. However, the prevailing “one-task, one-model” paradigm limits systematic integration of evidence across this multi-step process. Although multimodal large language models (MLLMs) show promising visual understanding, their limited domain-specific grounding and hallucination risks restrict reliability in fetal ultrasound analysis. To address these limitations, we propose FetUSAgents, a tool-augmented multi-agent system for comprehensive fetal ultrasound interpretation, supporting visual question answering (VQA), report generation, image captioning, and video summarization. FetUSAgents coordinates task-specific visual tools through collaborative LLM agents and decomposes clinical queries into subtasks that progress from anatomical recognition to quantitative measurement. We further introduce Dual-Path Evidence Arbitration (DPEA), which integrates LLM-based deliberative reasoning with structured computational evidence from specialized visual tools. A retrieval-enhanced evidence bank consolidates intermediate findings to support traceable and clinically grounded conclusions. In addition, we construct FetUS-VQA, a dedicated VQA benchmark for fetal ultrasound, comprising 1,892 images and 3,205 question-answer pairs across 10 clinical tasks. Extensive out-of-distribution experiments show that FetUSAgents outperforms general and medical MLLMs, exceeding the strongest baseline by more than 25 percent in VQA accuracy. These results suggest a scalable route toward evidence-driven clinical assistants for prenatal imaging. Code is available.


[155] Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning cs.CVPDF

Longteng Guo, Yifan Wang, Pengkang Huo, Tailai Chen, Yuze Wu

TL;DR: 该论文提出了VisReason基准测试,旨在评估多模态大语言模型在视觉中心推理任务上的能力,特别是在日常场景中感知与推理紧密结合的情况。VisReason包含1,505个问题,涵盖10个类别,包括感知、结构和概念推理。评估显示,VisReason与现有基准测试存在质的不同,揭示了人类与当前MLLMs之间的显著差距,并表明测试时推理策略的益处有限。

Details

Motivation: 当前多模态大语言模型在视觉推理基准上表现强劲,但其性能是否真正基于视觉证据进行推理尚不明确,因此需要一个新的基准来评估视觉中心推理能力。

Result: 在VisReason基准上的评估表明,该基准对现有MLLMs构成独特挑战,人类与模型之间存在显著性能差距,且测试时推理策略提升有限。

Insight: 创新点在于设计了VisReason这一专注于视觉中心推理的诊断性基准,强调感知与推理的耦合,能有效暴露MLLMs在视觉基础推理上的局限性,为模型评估提供了新视角。

Abstract: Recent multimodal large language models (MLLMs) achieve strong performance on visual reasoning benchmarks, yet it remains unclear to what extent such performance reflects reasoning directly grounded in visual evidence. We introduce VisReason, a benchmark for vision-centric reasoning in everyday scenarios where perception and inference are tightly coupled. VisReason contains 1,505 questions across 10 categories spanning perceptual, structural, and conceptual reasoning. Our evaluation shows that VisReason poses a qualitatively different challenge from existing benchmarks, exposing substantial gaps between humans and current MLLMs and revealing limited benefits from test-time reasoning strategies. VisReason offers a focused diagnostic for evaluating vision-centric reasoning beyond language.


[156] MARVEL: Universal Murray’s Law-informed Vessel Tree Segmentation and Topology Estimation cs.CVPDF

Yi Zhou, Thiara Sana Ahmed, Jacqueline Chua, Meng Wang, Qinrong Zhang

TL;DR: 本文提出MARVEL框架,将Murray定律的生物物理先验融入血管树分割与拓扑估计,通过可微正则化约束分支结构,提升分割的生理合理性与拓扑一致性。

Details

Motivation: 现有深度学习方法忽略血管系统的生物物理约束,导致分割结果在分支结构和生理合理性上不可靠,影响下游临床任务(如血流模拟)。

Result: 在八个公共数据集上评估,MARVEL在分割精度、拓扑一致性和生理合理性方面均优于基线模型,并能通过基于图的血流模拟显著提升高血压分类性能(p < 0.001)。

Insight: 创新点在于将Murray定律的局部分叉约束作为可微正则化项融入训练,实现与分割主干无关的通用框架,增强了血管树的生理一致性与临床预测价值。

Abstract: Vascular circulation follows fundamental biophysical principles that optimize mass transport and metabolic energy expenditure, which can be effectively modeled by Murray’s law. However, contemporary deep learning methods for vascular segmentation often neglect these biophysical constraints. This leads to physiologically implausible branching and misclassification vascular trees, rendering. These automated segmentation results are unreliable unreliable for downstream clinical tasks such as blood flow simulation or disease quantification. In this paper, we introduce MARVEL (Universal MurrAy’s law-infoRmed Vessel sEgmentation and topoLogy estimation), a backbone-agnostic framework that integrates biophysical priors into vascular tree extraction. MARVEL combines per-pixel supervision with explicit radius predictions to enforce local bifurcation constraints derived from an empirical width-exponent mapping. We implement these constraints as differentiable regularizers during training to guide models toward physiologically consistent reconstructions. We evaluate MARVEL on eight public datasets across multiple vascular modalities and segmentation backbones. Results demonstrate MARVEL’s superior performance in segmentation accuracy, topological consistency, and physiological plausibility. By converting segmented masks into graph-based hemodynamic simulations, we demonstrate that MARVEL preserves the subtle pathological narrowing and topological connectivity required to distinguish hypertensive from normotensive eyes. Results show that MARVEL significantly improves the classification of hypertension via arteriovenous pressure differences in the eye (p < 0.001), outperforming baseline models in both topological consistency and clinical predictive value.


[157] Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation cs.CV | cs.AIPDF

Ruoxi Cheng, Haoxuan Ma, Zhengfei Hai, Yiyan Huang, Ranjie Duan

TL;DR: 本文提出了一种名为对抗正交解缠(AOD)的潜在几何框架,用于缓解大型视觉语言模型(LVLM)的幻觉问题。该方法通过极小极大目标学习一个与幻觉相关的方向,并利用训练免费的双前向对比解码策略来抑制幻觉,同时保持模型的通用能力。在多个LVLM模型和基准测试上的实验表明,AOD在幻觉缓解和任务效用方面均优于现有基线方法。

Details

Motivation: 现有缓解LVLM幻觉的方法要么依赖昂贵的外部干预(如指令微调和检索),要么使用受限于有缺陷的注意力权重和纠缠隐藏表示的内部机制,因此需要一种更有效的内部机制来解缠表示并缓解幻觉。

Result: 在三个LVLM模型和四个幻觉基准(如POPE、AMBER)以及四个效用基准(如MMMU)上的实验表明,AOD一致优于强基线,例如将POPE准确率平均提升超过6%,将AMBER提升6%,并在效用任务上保持强劲性能。

Insight: 创新点在于提出了一种通过对抗学习在潜在空间中解缠幻觉相关信号的几何框架,以及一种无需训练的双前向对比解码策略;从客观角度看,该方法可能捕捉到了与幻觉相关的一般性偏差,而非数据集特定伪影,具有较好的可迁移性。

Abstract: Large Vision-Language Models (LVLMs) have advanced multimodal understanding, yet their reliability is limited by hallucination, where generated content conflicts with visual facts. Existing mitigation methods either rely on costly external interventions, such as instruction tuning and retrieval, or use internal mechanisms that remain limited by flawed attention weights and entangled hidden representations. We propose Adversarial Orthogonal Disentanglement (AOD), a latent geometric framework for mitigating LVLM hallucinations. AOD learns a hallucination-related direction through a minimax objective: a classifier concentrates hallucination signals into the projected component, while an adversary removes them from the orthogonal residual space via a Gradient Reversal Layer. The learned direction enables a training-free dual-forward-pass contrastive decoding strategy that suppresses hallucinations while preserving general capabilities. Experiments on three LVLMs across four hallucination and four utility benchmarks show that AOD consistently outperforms strong baselines. It improves POPE accuracy by over 6% on average, boosts AMBER by 6%, and maintains strong performance on utility tasks such as MMMU. Further analysis shows robust transfer across datasets, suggesting that AOD captures general hallucination-related biases rather than dataset-specific artifacts. Our source code and datasets are available at https://github.com/Hunter-Wrynn/AOD.


[158] Weakly Supervised Camouflaged Object Detection Based on the SAM Model and Mask Guidance cs.CV | cs.AIPDF

Xia Li, Xinran Liu, Lin Qi, Junyu Dong

TL;DR: 本文提出了一种基于SAM模型和掩码引导的弱监督伪装目标检测方法MGNet,通过自定义级联掩码解码器生成初始掩码来指导分割过程,并引入上下文增强模块和掩码引导特征聚合模块以改善边缘模糊和漏检问题。同时,利用BoxSAM通过边界框提示生成高质量像素级伪标签来训练网络。

Details

Motivation: 解决伪装目标检测任务中因目标与背景高度相似而导致的挑战,以及现有全监督方法需要大量像素级标注、弱监督方法因使用粗糙标注而性能下降的问题,旨在平衡准确性和标注效率。

Result: 在伪装目标检测基准测试中,该方法表现出与当前最先进方法相竞争的性能,通过大量实验验证了其有效性。

Insight: 创新点包括利用SAM模型结合边界框提示生成高质量伪标签的BoxSAM策略,以及通过级联掩码解码器、上下文增强模块和掩码引导特征聚合模块组成的MGNet网络结构,有效提升了弱监督下的检测精度和边缘细节。

Abstract: Camouflaged object detection (COD) from a single image is a challenging task due to the high similarity between objects and their surroundings. Existing fully supervised methods require labor-intensive pixel-level annotations, making weakly supervised methods a viable compromise that balances accuracy and annotation efficiency. However, weakly supervised methods often experience performance degradation due to the use of coarse annotations. In this paper, we introduce a new weakly supervised approach for camouflaged object detection to overcome these limitations. Specifically, we propose a novel network, MGNet, which tackles edge ambiguity and missed detections by utilizing initial masks generated by our custom-designed Cascaded Mask Decoder (CMD) to guide the segmentation process and enhance edge predictions. We introduce a Context Enhancement Module(CEM) to reduce the missing detection, and a Mask-guided Feature Aggregation Module (MFAM) for effective feature aggregation. For the weak supervision challenge, we propose BoxSAM, which leverages the Segment Anything Model (SAM) with bounding-box prompts to generate pseudo-labels. By employing a redundant processing strategy, high quality pixel-level pseudo-labels are provided for training MGNet. Extensive experiments demonstrate that our method delivers competitive performance against current state-of-the-art methods.


[159] Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation cs.CV | cs.AIPDF

Chunzheng Zhu, Yijun Wang, Jianxin Lin, Feng Wang, Hongwei Wang

TL;DR: 本文提出了一种名为ANAUS的解剖锚定自监督学习框架,用于超声图像表示学习。该框架通过可学习的潜在提示引擎实现无标注的解剖结构分割,并在此基础上设计了双策略自监督学习范式,包括视图间语义感知的解剖分离对齐和上下文核心区域预测,以增强表示学习。

Details

Motivation: 现有超声图像自监督预训练方法通常在图像或帧级别操作,忽略了临床相关的解剖上下文信息,难以学习到与临床任务对齐的表示。本文旨在将表示学习从通用视觉区域转移到具有临床意义的解剖结构上。

Result: 在六个公共数据集上的广泛评估表明,ANAUS框架在多个下游任务上持续超越当前最先进(SOTA)方法,同时保持了临床部署所需的计算效率。

Insight: 主要创新点在于将解剖先验知识引入超声图像自监督学习,通过无监督解剖分割和基于解剖结构的双策略学习(对齐与预测),实现了对解剖结构不变性和细粒度细节的建模,从而学习到更具临床相关性和可迁移性的表示。

Abstract: Self-supervised pre-training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet existing methods for ultrasound (US) images operate at the image or frame level, overlooking the anatomical context for clinical-aligned representation learning. In this work, we propose an anatomy-anchored ultrasound self-supervision framework ANAUS that shifts representation learning from generic visual regions to clinically meaningful anatomical structures. Utilizing a learnable latent prompt engine alongside a one-time domain adaptation on existing public image–mask pairs, we empower the LP-SAM module to achieve annotation-free anatomy delineation at scale. Building upon this anatomical grounding, we propose a dual-policy self-supervised learning paradigm consisting of inter-view semantics-aware anatomy-separating alignment and contextual core-region prediction to enhance representation learning. Specifically, the former enforces feature invariance within identical anatomical regions while promoting discriminability across distinct structures; the latter compels the model to reconstruct corrupted regions, thereby capturing fine-grained structural details. Extensive evaluations on six public datasets demonstrate that \ours{} consistently outstrips current state-of-the-art methods while maintaining the computational efficiency essential for clinical deployment. Code is available at https://github.com/zhcz328/ANAUS.


[160] MTLLFM: Multimodal-Temporal Laughter Localization: UR-FUNNY-Temporal and SMILE-Temporal Benchmarks with an Adaptive Multimodal Fusion Model cs.CVPDF

Eyal Hanania, Nadav Kirsch, Daniel Arkushin, Jonathan Benvenisti, Amos Bercovich

TL;DR: 本文提出了一个用于视频中笑声精确定位(temporal laughter localization)的弱监督框架MTLLFM,并发布了两个带有精确时间边界标注的数据集UR-FUNNY-Temporal和SMILE-Temporal。该方法利用预训练的HuBERT和MAE编码器,结合时序软最大池化和自适应模态门控,仅使用片段级标签即可学习细粒度的时间定位。

Details

Motivation: 现有方法将视频中的笑声检测视为粗粒度的片段级分类任务,无法捕捉短暂笑声事件的精确时间边界。本文旨在填补这一空白,实现精确的时序笑声定位。

Result: 在三个数据集上的实验表明,该方法显著优于包括Gemini 3 Flash在内的多模态基础模型,在体育广播数据上取得了99%的F1分数和68.1%的定位精度。精确的时间标注使下游笑声推理任务(CIDEr指标)提升了227%,并使GPT-3.5的表现超过了GPT-4o。

Insight: 创新点在于提出了一个轻量级的弱监督时序定位框架,无需逐帧标注即可学习细粒度时间边界,并引入了自适应模态门控机制来融合视听信息。同时,发布的两个高质量时序标注数据集为相关研究提供了重要基准。

Abstract: Detecting laughter in video is essential for affective computing and narrative understanding, yet existing approaches treat it as coarse clip-level classification, failing to capture precise temporal boundaries of brief, transient laughter events. We address this gap with two complementary contributions. First, we introduce UR-FUNNY-Temporal and SMILE-Temporal, fully annotated temporal laughter datasets extending two widely-used humor benchmarks. Our annotations cover over 11,053 videos (78.8 hours) and provide precise onset/offset boundaries for each laughter event, along with rich metadata distinguishing speaker vs. audience laughter, modality dominance (acoustic, visual, or both), and intensity levels. Second, we propose a lightweight weakly-supervised framework for temporal laughter localization. Our architecture combines fixed HuBERT and MAE encoders with temporal softmax pooling and adaptive modality gating, learning fine-grained temporal grounding from clip-level labels without requiring frame-level annotations during training. Experiments across three datasets demonstrate that our approach substantially outperforms multimodal foundation models including Gemini 3 Flash, achieving 99% F1 and 68.1% localization precision on sports broadcast data. Ablations validate each architectural component. Furthermore, our precise temporal tags improve downstream laughter reasoning by 227% on CIDEr, enabling GPT-3.5 to outperform GPT-4o. The code, UR-FUNNY-Temporal and SMILE-Temporal datasets are publicly available at https://github.com/WSCSports/MTLLFM-temporal-laughter-localization.


[161] Binding Visual Features Point by Point cs.CV | cs.AIPDF

Udith Haputhanthri, Declan Campbell, Rim Assouel, Jonathan D. Cohen, Taylor W. Webb

TL;DR: 这篇论文研究了视觉语言模型在处理多物体场景时存在的绑定问题,提出通过文本中的空间坐标‘指向’来模拟人类视觉的串行处理机制,从而改善模型性能。研究发现,学习指向会诱导内部视觉搜索过程,消除绑定错误并实现组合泛化。

Details

Motivation: 视觉语言模型在多物体场景任务中表现不佳,存在绑定问题,即难以准确绑定上下文中的物体特征。受人类视觉通过串行处理解决绑定问题的启发,研究探索使用空间坐标指向作为类似解决方案。

Result: 研究发现,通过文本学习指向可以诱导内部视觉搜索机制,消除绑定错误,并在新任务上通过微调实现组合泛化,为视觉语言模型解决绑定问题提供了原理性证明。

Insight: 创新点在于将人类视觉的串行处理机制(通过指向模拟)引入视觉语言模型,以解决绑定问题。这为模型设计提供了新思路,即利用空间坐标引导注意力,改善多物体场景处理能力。

Abstract: Despite success on standard benchmarks, vision language models display persistent failures on tasks involving processing of multi-object scenes, including many tasks that are relatively easy for humans. Recent work has found that these failures may stem from a basic inability to accurately bind object features in-context, a challenge that is referred to as the “binding problem” in cognitive science and neuroscience. The human visual system is thought to solve this binding problem via serial processing, attending to individual objects one at a time so as to avoid interference from other objects. Recent work has proposed “pointing” – the use of explicit spatial coordinates to refer to objects – as an analogous solution for vision language models, and found that it improves performance on challenging multi-object tasks. However, it is unclear $\textit{why}$ (i.e., on a mechanistic or representational level) this approach improves performance, and how directly this relates to serial processing in human vision. Here, we investigate this question. We find that learning to point-via-text induces an internal visual search routine, and we characterize the mechanisms that support this procedure. We also find that pointing behavior can be generalized to new tasks via fine-tuning, and that doing so eliminates binding errors and enables compositional generalization. These results provide a proof-of-principle that serial processing can solve the binding problem for vision language models just as it does for biological vision.


[162] Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning cs.CVPDF

Fanhu Zeng, Zhicong Luo, Zefan Wang, You Li, Chi Chen

TL;DR: 本文提出了一种名为MARS的单锚点多源视觉推理框架,旨在解决多源输入融合时信息增益与干扰难以区分的问题。该方法将每个视觉模态视为独立信息源,利用单源奖励作为动态锚点,将多源融合带来的信息增益显式地纳入优势归一化过程,从而在强化学习与可验证奖励(RLVR)中自适应地促进模态间的协同并抑制噪声或冲突。

Details

Motivation: 现有基于RLVR的视觉推理方法在处理多源输入时,往往将其视为信息的简单累加,缺乏明确机制来区分额外源是带来信息增益还是引入干扰,导致在模态物理属性和语义差异较大(如红外与深度)时,动态交互建模困难,性能甚至可能劣于单源推理。

Result: 理论分析表明,该方法能在梯度估计中有效量化多源集成引入的信息增益,实现一致的模态调控。在多个数据集上的实验结果显示,该方法在GRPO和DAPO基准上分别取得了3.2%和4.9%的性能提升,验证了其有效性。

Insight: 核心创新点在于提出了单锚点优势归一化机制,将单源奖励作为动态基准来显式建模多源融合的信息增益,从而自适应地调节模态间的相互作用。这为多模态强化学习中的信息融合与噪声抑制提供了一种新的理论框架和实用方法。

Abstract: Visual reasoning through reinforcement learning with verifiable rewards (RLVR) has achieved remarkable progress. However, when dealing with multi-source inputs, existing approaches tend to treat them as a mere accumulation of information, lacking explicit mechanisms to distinguish whether integrating additional sources yields information gain or introduces interference. Therefore, they struggle to effectively model dynamic interaction when integrating multiple sources, particularly when they differ significantly in physical properties and semantics, e.g., infrared and depth, leading to inferior performance to mono-source reasoning when a certain source holds the dominant signal. To address this issue, we propose MARS, a novel mono-anchored multi-source reasoning framework that models each visual modality as an independent information source. Specifically, by treating mono-source rewards as dynamic anchors, our method explicitly incorporates the information gain introduced by multi-source fusion into advantage normalization and adaptively emphasizes mutual promotion between sources while suppressing potential noise or conflicts during RLVR. From theoretical analysis, our method effectively quantifies information gain introduced by multi-source integration in gradient estimation, enabling consistent modality regulation. Empirical results also show impressive 3.2% and 4.9% performance gains on GRPO and DAPO across diverse datasets, confirming effectiveness of our method.


[163] Enhancing Single-Image Facial Demorphing using Multimodal Large Language Models cs.CVPDF

Nitish Shukla, Arun Ross

TL;DR: 本文提出了一种新颖的无参考人脸去变形框架,利用多模态大语言模型(MLLMs)指导基于耦合扩散的重建过程,旨在从单张变形人脸图像中恢复出构成其的原始人脸图像,以应对人脸识别系统中的变形攻击。

Details

Motivation: 人脸识别系统易受变形攻击威胁,现有检测方法只能识别变形图像,无法恢复构成图像或身份,限制了其取证效用。

Result: 在严格操作点下,RGB域的去变形方法比潜在空间方法性能高出30-40%;消融研究表明,完整的MLLM嵌入通过多模态预训练增强的语义结构,相比原始ViT特征具有显著优势。

Insight: 核心创新在于利用MLLM中间层的语义嵌入来指导去变形过程,提供了关于面部属性和身份线索的高层推理;将去变形构建为耦合条件生成问题,通过直接在RGB域操作的去噪扩散模型联合合成两张构成人脸,确保身份间一致性并保留细粒度感知细节;避免了有损的文本生成-重编码循环,直接使用MLLM隐藏状态作为条件信号。

Abstract: Face recognition systems are increasingly vulnerable to morphing attacks, where a composite image is crafted to match multiple identities, enabling unauthorized access and identity fraud. Existing detection methods identify morphed images but cannot recover constituent images or identities, limiting their forensic utility. This paper presents a novel reference-free facial demorphing framework that leverages Multimodal Large Language Models (MLLMs) to guide a coupled diffusion-based reconstruction process. Our key innovation lies in extracting semantic embeddings from intermediate MLLM layers to condition the demorphing, providing high-level reasoning about facial attributes and identity cues that complement low-level pixel information. We formulate demorphing as a coupled conditional generation problem, where both constituent faces are synthesized jointly through a denoising diffusion model operating directly in the RGB domain, ensuring inter-identity consistency while preserving fine-grained perceptual details. Unlike prior approaches that rely on compressed latent representations or assume identity overlap between training and testing sets, our method bypasses lossy text generation-reencoding cycles by directly utilizing MLLM hidden states as conditioning signals, enabling the denoising network to attend to subtle visual cues such as hair, background, and facial textures. Ablation studies further reveal that middle MLLM layers encode more identity-discriminative representations, RGB-domain demorphing outperforms latent-space approaches by 30–40% at strict operating points, and full MLLM embeddings provide substantial advantages over raw ViT features through enhanced semantic structuring from multimodal pretraining.


[164] Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion cs.CVPDF

Ting-Hsuan Chen, Ying-Huan Chen, Tao Tu, Jie-Ying Lee, Cho-Ying Wu

TL;DR: Pantheon360是一个可控的360°视频生成框架,通过3D感知的扩散模型从稀疏的360°输入合成高保真视频。其核心是使用从输入重建的显式3D缓存作为几何支架,以支持用户定义的任意相机路径,从而在保持全局几何一致性的同时实现逼真的纹理细化。

Details

Motivation: 解决从视频生成完整数字孪生体时面临的精确相机控制、全局场景覆盖和严格时空一致性挑战,传统透视视频生成器因视野有限而难以满足这些要求。

Result: 实验表明,Pantheon360在视觉质量和几何一致性方面表现优异,实现了可靠且灵活的360°场景生成,适用于下游仿真和数字孪生应用。

Insight: 创新点在于利用360°视频的全景覆盖优势简化轨迹设计,并引入显式3D缓存作为几何约束,使扩散模型专注于纹理细化,从而有效解决了跨视图不一致和时间漂移问题。

Abstract: Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial-temporal consistency constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift. We argue that 360° video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides a strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion, a controllable 360° video generation framework that synthesizes high-fidelity videos from sparse 360° inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency. Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360° scene generation for downstream simulation and digital-twin applications.


[165] MetaphorVU: Towards Metaphorical Video Understanding cs.CVPDF

Zhuoqun Li, Boxi Cao, Guiping Jiang, Fangrui Lv, Ruotong Pan

TL;DR: 本文提出了首个系统性的隐喻视频理解基准MetaphorVU-Bench,旨在评估多模态大语言模型(MLLMs)的高阶认知能力。研究发现现有MLLMs在隐喻视频理解上表现不佳,远低于人类水平,主要归因于跨域映射缺陷。为此,作者构建了隐喻知识图谱作为映射增强,并提出了推理时增强框架MetaphorBoost,以提升模型性能。

Details

Motivation: 隐喻视频在现实场景中广泛用于传达复杂概念,其理解需要高阶认知能力。目前缺乏对隐喻视频理解的系统性研究,这限制了MLLMs的实际应用能力,并阻碍了对它们高阶认知能力的全面评估。

Result: 实验表明,当前MLLMs在隐喻视频理解任务上表现挣扎,远落后于人类水平。通过提出的MetaphorBoost框架,模型性能得到了持续提升,但具体基准测试结果未在摘要中详细说明。

Insight: 创新点包括构建首个系统性隐喻视频理解基准MetaphorVU-Bench,以及通过隐喻知识图谱增强跨域映射的推理时框架MetaphorBoost。这为未来提升MLLMs的高阶认知能力提供了基础和方法借鉴。

Abstract: Metaphorical videos are prevalent across various real-world scenarios to convey complex ideas, and understanding them typically requires high-order cognitive capabilities. The lack of systematic studies on metaphorical video understanding not only constrains the real-world applicability of MLLMs but also impedes the thorough assessment of their high-order cognitive capabilities. To bridge this gap, we propose MetaphorVU-Bench, the first systematic and comprehensive benchmark dedicated to metaphorical video understanding. Through experiments, we find current MLLMs struggle with accurate metaphorical video understanding, lagging far behind human level, primarily due to defective cross-domain mapping. Motivated by this finding, we construct a metaphor knowledge graph as mapping augmentation and propose MetaphorBoost, an inference-time enhancement framework achieving consistent performance improvement. Our benchmark, analysis, and method provide useful insights and a foundation for future research on advancing MLLMs.


[166] MAIL++: Multi-Modal Bi-directional Agent Layer for Vision-Language Models cs.CVPDF

Kaixiang Chen, Pengfei Fang, Hui Xue

TL;DR: 本文提出了MAIL和MAIL++,一种用于视觉语言模型(VLM)的参数高效微调(PEFT)范式。该方法通过在VLM内部核心计算模块(如LayerNorm)后插入轻量级代理层,并引入跨模态桥接机制,直接在模型内部实现细粒度的视觉-语言双向耦合,从而在保持推理效率的同时提升下游任务性能。

Details

Motivation: 现有VLM的PEFT方法(如提示或适配器)通常依赖外部辅助模块实现跨模态耦合,导致与原始VLM结构解耦、交互间接且粒度粗糙,限制了表征表达能力。本文旨在将跨模态耦合直接嵌入VLM的内在计算模块中,以实现更有效的微调。

Result: 在少样本图像分类和少样本通用跨域检索任务上的大量实验表明,MAIL和MAIL++方法持续优于最先进的PEFT方法,达到了SOTA水平。

Insight: 主要创新点在于:1) 将跨模态耦合直接嵌入VLM内部核心计算模块的PEFT新范式;2) 通过瓶颈式文本-图像桥联合优化跨模态的配对代理层;3) MAIL++通过元代理层和元桥实现了双向跨模态信息交换;4) 推理时所有代理层可重参数化到冻结的主干网络中,保持原始计算效率。

Abstract: Adapting large vision-language models (VLMs) such as CLIP to downstream tasks remains challenging, as full fine-tuning is computationally prohibitive and prone to overfitting in low-data regimes. Parameter-efficient fine-tuning (PEFT) alleviates these issues with lightweight prompt- or adapter-based modules, and cross-modal coupling has proven especially effective by strengthening interactions between vision and language. However, existing coupling mechanisms predominantly rely on external auxiliary modules, leading to indirect, coarse-grained interactions that are structurally decoupled from the original VLM and thus limit representational expressiveness. In this paper, we propose Multi-Modal Interactive Agent Layer (MAIL), a PEFT paradigm that embeds cross-modal coupling directly into the intrinsic computation modules of VLMs. MAIL freezes the backbone and inserts lightweight agent layers after core modules, such as LayerNorm, to approximate the parameter updates induced by full fine-tuning. To couple visual and textual streams at this level, we introduce a bottleneck-based text-to-image bridge that jointly optimizes paired agent layers across modalities, coordinating the adaptation of corresponding computation modules. We further present MAIL++, which enables bidirectional cross-modal exchange through a meta agent layer, a meta-text bridge, and a meta-image bridge. At inference time, all agent layers are re-parameterized into the frozen backbone, preserving the original computational efficiency. Extensive experiments on few-shot image classification and few-shot universal cross-domain retrieval demonstrate that MAIL and MAIL++ consistently outperform state-of-the-art PEFT methods.


[167] Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation cs.CV | cs.AI | cs.MMPDF

Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao

TL;DR: 本文提出了一种名为测试时自适应条件化(TT-SAC)的无参数推理框架,用于解决音频驱动说话头生成中静态参考图像与动态面部运动不匹配的问题。该框架通过将预训练生成器与其编码器组成反馈循环,利用生成器自身的输出来重构更优的条件表示,从而在无需重新训练或梯度更新的情况下,提升生成视频的身份一致性、时间连贯性和感知质量。

Details

Motivation: 现有音频驱动说话头生成方法(如AniTalker、FLOAT、Sonic)在推理时依赖单一的静态参考图像进行条件化,这导致固定的身份特征与动态演化的面部运动之间不匹配,从而引发身份漂移、时间不一致和感知质量下降。

Result: 在多个最先进的说话头生成器和基准数据集上进行的大量实验表明,TT-SAC在唇形同步精度、时间连贯性、身份保持和感知保真度方面均取得了持续改进。

Insight: 核心创新在于提出了一个无需训练、模型无关的测试时自适应条件化框架,通过构建生成器与编码器的反馈循环,实现条件表示的动态优化。理论分析表明该方法在温和的Lipschitz假设下能减少特征方差并提高生成稳定性,同时展现出控制自适应强度的偏差-方差权衡原则。

Abstract: Audio-driven talking-head generation has achieved remarkable progress with recent models such as AniTalker, FLOAT, and Sonic. Despite their success, most existing approaches rely on a single static reference image to condition the entire video generation process at inference stage. This static conditioning paradigm often creates a mismatch between fixed identity features and dynamically evolving facial motion, leading to identity drift, temporal inconsistency, and degraded perceptual quality. We introduce Test-Time Self-Adaptive Conditioning (TT-SAC), a parameter-free inference framework that enables pretrained talking-head generators to adapt their conditioning representations during inference without retraining, gradient updates, or additional supervision. Instead of treating the reference portrait as immutable, TT-SAC composes the generator with its encoder in a feedback loop: the generator’s own outputs are re-encoded to construct a refined conditioning representation that better aligns with the temporal dynamics of the synthesized sequence. A single adaptation step approximates a self-consistent equilibrium of the generative process, stabilizing identity and motion across time. We further provide theoretical analysis showing that test-time conditioning adaptation reduces feature variance and improves generative stability under mild Lipschitz assumptions, while exhibiting a principled bias-variance tradeoff that governs the optimal strength of adaptation. Extensive experiments on state-of-the-art talking-head generators and benchmark datasets demonstrate consistent improvements in lip-sync accuracy, temporal coherence, identity preservation, and perceptual fidelity. TT-SAC offers a model-agnostic and training-free strategy for enhancing generative video models, establishing test-time conditioning adaptation as an effective mechanism for stabilizing audio-driven portrait animation.


[168] Full-4D: Generating Full-Scope 4D Scenes from a Single-View Video cs.CVPDF

Tingxi Chen, Ke Hao, Yabo Chen, Zhengxue Cheng, Rong Xie

TL;DR: 本文提出了一种名为Full-4D的端到端框架,旨在从单视角视频生成完整覆盖的动态4D场景。该方法将问题分解为多视角视频合成和基于优化的4D重建两个步骤,并通过引入大规模真实世界多视角视频数据集Real-MV-4D、融合时间-视角注意力的多视角视频扩散模型以及结合流匹配蒸馏损失的4D高斯溅射(4DGS)重建,实现了高质量、几何一致的完整4D场景生成。

Details

Motivation: 从单视角视频生成4D场景是一个不适定问题,因为单一视角缺乏完整动态场景的信息。现有方法通常局限于单目视频、简单的3D效果或仅在原始视角附近进行微小扰动,无法实现真正的完整4D生成,且缺乏大规模同步多视角视频数据集也阻碍了该方向的发展。

Result: 大量实验表明,该方法在视觉保真度和几何一致性方面均优于现有方法,能够从单视角视频生成完整覆盖的4D场景。

Insight: 主要创新点包括:1) 构建了大规模真实世界同步多视角视频数据集Real-MV-4D以提供4D监督;2) 提出了融合时间-视角注意力的多视角视频扩散模型,将几何重投影先验和显式相机条件直接嵌入到视角-时间交互中,严格对齐生成过程与物理3D先验;3) 采用4D高斯溅射(4DGS)作为显式4D表示,并引入流匹配蒸馏损失进行正则化,利用多视角先验提升新视角渲染质量。

Abstract: Generating 4D scenes from a single-view video is inherently ill-posed: a single viewpoint lacks the information needed to recover a complete, dynamic scene with full coverage. Existing methods are typically limited to monocular videos, simple 3D effects, or only small viewpoint perturbations around the original viewpoint, falling short of true 4D generation. Meanwhile, the lack of large-scale datasets capturing full-scope 4D scenes with synchronized multi-view videos further hinders progress in this direction. We propose a novel single-view video-to-4D framework that casts full-scope 4D generation as a multi-view video synthesis followed by optimization-based 4D reconstruction from the generated views. To instantiate this formulation end-to-end, we make three key contributions. First, we introduce Real-MV-4D, a large-scale dataset of synchronized multi-view videos captured in diverse real-world environments to provide the 4D supervision. Second, we train a multi-view video diffusion model driven by a novel fused time(T)-view(V) attention mechanism that directly embeds geometric reprojection priors and explicit camera conditioning into its view-time interactions. Unlike basic feature fusion, this direct binding strictly aligns the generation process with physical 3D priors to produce a dense, synchronized T$\times $V video grid. Third, rather than relying on non-interactive and inconsistent 2D video interpolations, we lift the synthesized multi-view videos into an explicit 4D representation (i.e. 4DGS), regularized by a Flow Matching Distillation loss that exploits the multi-view prior to improve novel-view rendering. Extensive experiments demonstrate that our method outperforms existing approaches in both visual fidelity and geometric consistency, enabling full-scope 4D scene generation from single-view videos.


[169] ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs cs.CVPDF

Jiangyang Li, Cong Wan, Changjie Wu, Songlin Dong, Lingjun Zhang

TL;DR: 该论文提出了一种名为ProSR的过程塑造优化框架,旨在解决视觉语言模型在空间推理任务中存在的推理过程不可靠问题。通过构建高质量思维链数据集并诊断模型推理过程,论文揭示了强化学习优化中出现的虚假接地和尾部不稳定性两类典型过程退化现象。ProSR通过引入反事实不变性惩罚和尾部漂移惩罚,将优化目标从单一答案正确性扩展到视觉依赖性和轨迹稳定性两个过程层面,从而提升模型的空间推理能力。

Details

Motivation: 现有视觉语言模型的空间推理训练范式主要依赖结果对齐或过程模仿,缺乏对推理过程的显式约束,导致难以确保真正的视觉依赖和稳定的推理轨迹。

Result: 在多个复杂和分布外空间推理基准测试上的实验表明,ProSR在提高答案准确率的同时,能生成更稳定、更依赖视觉证据的推理轨迹。

Insight: 论文的核心创新在于将优化目标从结果层面扩展到推理过程层面,通过过程塑造惩罚项(反事实不变性惩罚和尾部漂移惩罚)来约束模型的视觉依赖性和推理轨迹稳定性,这为提升模型推理的可靠性和可解释性提供了新思路。

Abstract: Reliable spatial reasoning remains a core bottleneck for vision-language models (VLMs). Existing mainstream training paradigms for spatial reasoning largely rely on outcome alignment or process imitation, lacking explicit constraints on the reasoning process, and therefore struggle to ensure genuine visual dependence and stable reasoning trajectories. In this paper, we construct a high-quality CoT dataset covering diverse spatial phenomena and diagnose the model’s reasoning process, revealing two typical types of process degradation during reinforcement learning optimization: Spurious Grounding, which bypasses visual evidence, and Tail Instability, where uncertainty abnormally rises in the later stage of reasoning. To address these issues, we propose ProSR, a process-shaping optimization framework for spatial reasoning. Through a Counterfactual Invariance Penalty and a Tail Drift Penalty, ProSR extends the optimization objective from single answer correctness to two process-level dimensions: visual dependence and trajectory stability. Experiments on multiple complex and out-of-distribution spatial reasoning benchmarks show that ProSR improves answer accuracy while generating reasoning trajectories that are more stable and more dependent on visual evidence.


[170] Tetris: Tile-level Sampling for Efficient and High-Fidelity Video Object Tracking cs.CV | cs.DBPDF

Chanwut Kittivorawong, Alena Chao, Charlie Si, Alvin Cheung

TL;DR: Tetris是一个用于高效高保真视频目标跟踪的系统,它通过将视频分解为基于瓦片的多联骨牌数据模型,实现了细粒度的时空剪枝,从而在最小化保真度损失的前提下减少检测器调用。

Details

Motivation: 现有系统通过时间帧采样来降低跟踪成本,但会丢失帧间运动信息,而Tetris旨在解决在静态视频中高效提取可重用目标轨迹的问题,通过利用视频中大部分区域无目标且不同区域可容忍不同采样率的特点。

Result: 在7个静态视频数据集上,Tetris在保持跟踪精度损失不超过5%的情况下,相比全帧每帧参考流水线,吞吐量最高提升68.8倍,且优于先前系统在3个数据集上的表现。

Insight: 创新点包括引入瓦片级多联骨牌数据模型以实现细粒度剪枝,以及通过分类器、整数线性规划和打包器三个上游操作优化检测器调用,从而在保证高保真度的同时显著提升效率。

Abstract: Track materialization converts raw video into reusable object tracks that downstream queries can run against without rerunning tracking, but extracting those tracks efficiently and with high fidelity remains expensive. Prior systems reduce cost through temporal frame sampling, erasing the inter-frame motion that fine-grained tracking requires. In stationary video, however, large portions of each frame contain no objects of interest, and the remaining regions tolerate different sampling rates. We present Tetris, a track-extraction system that decomposes videos into a tile-based polyomino data model, enabling fine-grained spatiotemporal pruning that reduces detector calls with minimal fidelity loss. Tetris runs three operators upstream of the user-provided detector: a classifier identifies relevant tiles and groups them into polyominoes, an integer linear program (ILP) prunes redundant polyominoes under a user-specified accuracy constraint, and a packer assembles the survivors into canvases that minimize detector calls. Across 7 stationary-video datasets, Tetris stays within a 5% tracking accuracy loss of a full-frame, every-frame reference pipeline, whereas prior systems exceed this bound on 3 of the 7 datasets. At this 5% bound, Tetris achieves up to 17.4x higher throughput than prior systems and up to 68.8x higher than the reference pipeline. The project page is at https://tetris-db.github.io .


[171] ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation cs.CV | cs.ROPDF

Huan Ren, Yihan Chen, Chuxin Wang, Nailong Liu, Wenfei Yang

TL;DR: 本文提出ComPose,一个统一的完成-姿态框架,用于鲁棒的类别级物体姿态估计。该框架通过基于关键点的渐进式补全模块恢复完整形状表示,并利用几何关系编码模块增强关键点特征,以解决观测点云不完整导致的姿态估计困难。

Details

Motivation: 现有方法在处理观测点云固有的不完整性时存在局限,难以捕捉完整物体形状以进行鲁棒的姿态推理;而将点云补全作为独立的预处理步骤会引入累积误差和额外计算开销,影响精度和效率。

Result: 在标准基准测试上的大量实验表明,该方法在不依赖类别级形状先验的情况下,性能优于现有最先进方法。

Insight: 创新点在于将形状补全与姿态估计紧密集成在一个统一框架中,通过渐进式关键点补全和几何关系一致性损失,确保全局一致的坐标变换,从而提升对不完整点云的鲁棒性。

Abstract: Category-level object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency. To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and their surrounding dense point sets, empowering the keypoints to capture holistic object geometries. A geometric relation encoding module further enriches keypoint features with both local and global geometric context. In addition, we introduce a novel geometric relation consistency loss to enforce structural alignment between observed keypoints and their predicted NOCS coordinates, ensuring globally coherent coordinate transformations. Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art approaches without relying on category-level shape priors.


[172] Rethinking Scribble-Guided Image Editing: Generalization, Instruction Adherence, and Multi-Tasking cs.CVPDF

Mingyi Xu, Jinpeng Lin, Min Zhou, Tiezheng Ge, Ming Zeng

TL;DR: 本文重新审视了涂鸦引导的图像编辑范式,指出现有模型在多任务场景下性能不稳定,并揭示了指令级泛化(如跨编辑任务、从单任务到多任务)比图像域泛化更具挑战性。为此,论文提出了三种策略:覆盖度优先-真实感后修的课程学习、多任务拼接和编辑聚焦损失,从而在VIBE基准上显著提升了单任务和多任务编辑性能,达到了SOTA水平。

Details

Motivation: 解决现有涂鸦引导图像编辑模型在多任务场景下性能不稳定的问题,并探究其泛化瓶颈,发现指令多样性学习不足是主要瓶颈而非图像域差异。

Result: 在VIBE基准上,提出的方法显著提升了单任务和多任务涂鸦引导编辑的性能,达到了最先进水平(SOTA)。

Insight: 创新点在于揭示了指令级泛化是涂鸦编辑的关键瓶颈,并提出了针对性的数据构建(课程学习与多任务拼接)和损失函数设计(编辑聚焦损失)策略,以低成本提升模型对多样化编辑指令的理解与执行能力。

Abstract: Scribble-guided image editing allows users to combine simple scribble annotations with text prompts to specify both where and how an image should be edited, enabling flexible interaction with precise spatial control. However, existing models still exhibit unstable performance under this paradigm, especially in multi-task scenarios. To improve performance, we conduct empirical studies using an open-source editing model and reveal an asymmetry in generalization: instruction-level generalization, including across editing tasks and from single-task to multi-task settings, is more challenging than image-domain generalization, such as from synthetic to real-world images or from mosaicked to regular images. This suggests that the primary bottleneck lies in insufficient learning for diverse editing instructions rather than in the image domain gap. Motivated by this insight, we propose three strategies: (a) a Coverage-then-Realism Curriculum, a two-stage pipeline that first builds large-scale synthetic, instruction-rich data for broad task supervision, then curates a small set of real-world data to refine generation realism; (b) Multi-Task Mosaicking, which constructs multi-task training samples by concatenating single-task examples at nearly zero cost while enabling the learned capability to generalize to non-mosaicked images; and (c) an Edit-Focused Loss, which leverages the changed regions between input and output images in synthetic data to focus training on edited regions, improving both learning efficiency and editing accuracy. With these strategies, we substantially improve both single-task and multi-task scribble-guided editing on the VIBE benchmark, achieving state-of-the-art results. We will publicly release our dataset and model.


[173] ControlLight: Towards Controllable, Consistent, and Generalizable Low-Light Enhancement cs.CVPDF

Yufeng Yang, Jianzhuang Liu, Jisheng Chu, Yuqi Peng, Xianfang Zeng

TL;DR: 本文提出了ControlLight,一个可控、一致且可泛化的低光增强框架,旨在解决现有方法因训练数据有限和单一增强目标导致的泛化能力不足和可控性差的问题。该框架通过构建大规模真实世界退化图像数据集,并引入错位感知加权流匹配损失来确保不同控制强度下的输出一致性。

Details

Motivation: 现有基于深度学习的低光增强方法通常在有限数据集上训练,且仅针对单一增强目标,这限制了它们在真实场景中的泛化能力和可控性。

Result: 大量实验表明,ControlLight在低光增强任务上达到了最先进的性能,同时在真实场景中展现出强大的连续可控性和泛化能力。

Insight: 创新点包括构建大规模连续光照强度监督的真实世界退化图像数据集,以及引入错位感知加权流匹配损失来确保增强过程中的结构一致性和视觉真实性,从而实现了灵活可控的低光增强。

Abstract: Existing deep learning-based low-light enhancement methods are typically trained on limited datasets with single enhancement targets, which restricts their generalization ability and controllability in real-world applications. To overcome these limitations, we propose ControlLight, a controllable, consistent, and generalizable framework for low-light enhancement. We first construct a large-scale dataset of real-world degraded images with continuous illumination-strength supervision. To further ensure consistent outputs under different control strengths, we introduce a misalignment-aware weighted flow matching loss that preserves image structure across continuous enhancement strengths. ControlLight allows users to edit real-world degraded low-light images toward satisfactory enhancement results by flexibly controlling the strength while preserving visual consistency and realism. Extensive experiments show that ControlLight achieves state-of-the-art performance against existing low-light enhancement approaches while demonstrating strong continuous controllability and generalization to real-world scenarios.


[174] AnE: Pushing the Reasoning Frontier of Multimodal LLMs via Anchor Evolution cs.CVPDF

Zehao Wang, Yihan Zeng, Zidong Gong, Yuanfan Guo, Feng Zhu

TL;DR: 本文提出了一种名为锚点演化(AnE)的新范式,旨在突破多模态大语言模型(MLLMs)在推理能力上的性能瓶颈。该方法通过整合基于真实锚点的数据构建和模型演化过程,实现了稳定且可靠的性能提升。

Details

Motivation: 现有基于监督微调(SFT)和强化学习(RL)的后训练方法受限于静态数据,容易达到性能瓶颈,并因低质量合成数据导致认知漂移和幻觉推理路径。

Result: 在多个多模态推理基准测试中,该方法将基础模型的性能提升了10.3%,并取得了最先进(SOTA)的结果。

Insight: 核心创新点在于提出了“真实锚点扩展”来构建高保真数据,以及“支架剥离机制”来内化推理能力,前者通过轨迹推演定位模型失败边界并利用真实数据库检索锚点,后者先通过支架增强监督锚定推理路径,再用RL剥离支架模板,从而有效将推理路径转化为模型内在能力。

Abstract: Post-training via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is crucial for enhancing reasoning in Multimodal Large Language Models (MLLMs), yet existing paradigms often reach a performance bottleneck due to the limitations of static data. While current methods leverage self-reflection or self-evolution to push these boundaries, they still suffer from cognitive drift and hallucinated reasoning paths caused by low-quality synthetic data. To address these challenges, we propose Anchor Evolution (AnE), a new paradigm that integrates truth-anchored data curation and model evolution, achieving faithful and steady performance gains at the reasoning frontier. Specifically, we propose Truth Anchor Expansion, which pinpoints the model failing frontier via trajectory rollouts and leverages ground-truth databases to retrieve high-fidelity anchors for faithful data curation. Subsequently, we introduce the Scaffold-Stripping Mechanism to internalize reasoning capabilities. This mechanism first anchors reasoning paths via scaffold-augmented supervision to mitigate the learning complexity and distribution drift of direct SFT on raw data, then leverages RL to strip the scaffold template, thereby effectively transitioning the reasoning paths into intrinsic model capabilities. Experimental results on multimodal reasoning benchmarks show that our method substantially advances the model performance frontier, improving the base model by 10.3% across eight multimodal benchmarks and achieving state-of-the-art results. The code will be made publicly available.


[175] How Far Has AI Come in Liver Fibrosis Staging? A Large-Scale Real-World Dataset and Benchmark cs.CVPDF

Yuanye Liu, Nannan Shi, Zhejia Zhang, Hanxiao Zhang, Boya Wang

TL;DR: 本文介绍了LiFS数据集与基准,这是首个基于多中心、多扫描仪、多序列MRI的肝纤维化分期大规模真实世界基准,旨在系统评估AI在临床异质条件下的进展。通过对96个团队中选出的9种独立方法进行系统评估,论文从三个互补视角揭示了当前AI在肝纤维化分期中的表现水平、数据挑战和技术影响因素。

Details

Motivation: 现有AI方法在肝纤维化分期中的进展尚未在临床实践中典型的异质、多中心条件下得到系统评估,因此需要构建一个真实世界基准来填补这一空白。

Result: 在LiFS基准上,最佳AI方法的性能与资深放射科医生大体相当,在特定场景下显著超过初级放射科医生,而AI的中位性能普遍接近初级放射科医生水平。评估揭示了跨中心异质性、标签不平衡和对比增强序列变异性是主要挑战。

Insight: 创新点在于构建了首个提供完整钆塞酸增强序列且具有组织病理学确认标注的多扫描仪真实世界基准。客观分析表明,空间配准、输入维度、多模态融合策略和骨干架构等技术选择能调节跨中心鲁棒性,但单一选择无法完全弥合性能差距,这为未来研究指明了关键挑战。

Abstract: Despite years of methodological progress, how far AI has come in liver fibrosis staging has never been systematically evaluated under the heterogeneous, multi-center conditions that define clinical practice. To address this gap, we introduce LiFS, a large-scale dataset and benchmark derived from the MICCAI 2025 CARE-Liver challenge, comprising 610 patients across multiple centers and scanners with multi-sequence MRI. To the best of our knowledge, LiFS is the first benchmark providing complete gadoxetic acid-enhanced sequences with histopathology-confirmed annotations from diverse real-world scanners. Through systematic evaluation of 9 independently developed methods selected from 96 registered teams against in-cohort radiologist reference results, our findings address how far current AI has progressed toward clinical-level liver fibrosis staging from three complementary perspectives. First, against radiologists, the best AI methods were broadly comparable to the senior radiologist and significantly exceeded the junior radiologist in selected settings, while median AI performance generally approached junior-radiologist levels. Second, from a data perspective, cross-center heterogeneity, label imbalance, and contrast-enhanced sequence variability emerge as the dominant challenges for AI methods. Third, from a technical perspective, methodological design choices, including spatial registration, input dimensionality, multi-modal fusion strategy, and backbone architecture, appear to modulate cross-center robustness, although no single choice alone closes the gap. Overall, LiFS provides a rigorous real-world benchmark for positioning the current state of AI in liver fibrosis staging and for enabling future research on the key challenges that limit clinically reliable deployment.


[176] StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering cs.CVPDF

Ming Xie, Zizheng Huang, Xudong Tan, Chao Wang, Xiangyu Zeng

TL;DR: 本文提出了StreamOV框架,用于解决流式全视频理解中的连续感知与实时交互问题。该框架通过证据引导的长短期记忆管理不断增长的视听上下文,并采用隐状态驱动的触发器自主决定响应时机。同时,作者构建了首个在线多轮全模态评估基准SOVBench,并在多个流式和全视频基准测试中取得了最先进的性能。

Details

Motivation: 当前的全模态方法主要针对离线场景设计,缺乏处理长时连续视听上下文的能力,且无法在流式场景中自主触发响应;同时,现有基准测试局限于离线单轮问答,难以评估连续多轮交互。

Result: StreamOV在多种流式和全视频基准测试中均达到了最先进的性能,证明了其在在线和离线视频理解任务中的有效性。

Insight: 创新点包括:1)引入多模态证据引导的长短期记忆,在固定预算下将历史上下文压缩为紧凑的证据;2)采用隐状态驱动的响应触发器,无需显式生成静默标记或依赖外部路由;3)构建了首个在线多轮全模态评估基准SOVBench,填补了该领域基准的空白。

Abstract: While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenarios due to two fundamental flaws. First, they lack robust mechanisms to manage continuously growing audio-visual context over long horizons and cannot autonomously initiate responses at opportune moments. Second, existing benchmarks are predominantly confined to offline, single-turn question answering, failing to capture continuous, multi-turn streaming interactions. To bridge these gaps, we propose StreamOV, a novel Streaming Omni-Video understanding framework for efficient online audio-visual reasoning with bounded memory and proactive response triggering. Specifically, StreamOV introduces a multimodal evidence-guided long-short term memory that condenses historical audio-visual context into compact informative evidence under a fixed budget. It further employs a hidden-state-driven trigger to decide when to respond, avoiding explicit silence-token generation and external routers. We also curate SOVBench, the first comprehensive benchmark for online, multi-turn omni-modal evaluation. Extensive experiments show that StreamOV achieves state-of-the-art performance across diverse streaming and omni-video benchmarks, demonstrating its effectiveness for both online and offline video understanding.


[177] ARMA-C3: A Contrastive ARMA Convolutional Framework for Unsupervised and Semi-supervised Classification cs.CVPDF

VSS Tejaswi Abburi, Saurabh J. Shigwan, Nitin Kumar

TL;DR: 本文提出了ARMA-C3,一个基于对比学习和图割正则化的统一无监督与半监督图学习框架,用于节点分类。该框架将样本建模为图节点,利用样本间关系来捕获传统方法常忽略的主体级依赖关系,旨在从生物医学图像中学习结构有意义且具有判别性的表征。

Details

Motivation: 动机在于解决生物医学和神经退行性疾病识别中,由于标记数据稀缺和成像模式复杂而导致的准确、早期疾病识别挑战。

Result: 在五个临床相关数据集(ADNI、NIFD、BreastMNIST、PneumoniaMNIST和一个肝脏超声数据集)上的广泛二元分类实验表明,ARMA-C3在多种评估设置下,尤其是在有限监督和严重类别不平衡情况下,相比经典聚类技术、SOTA机器学习模型和现有基于图的深度学习方法,取得了有竞争力且经常更优的性能。

Insight: 宣称的创新点在于将对比学习与图割正则化结合到一个统一的图学习框架中,以捕获主体级依赖关系并增强表征的判别性。客观分析认为,其创新之处在于为无监督和半监督节点分类提供了一个能够处理标记数据稀缺和复杂模式、并展现出强大跨模态泛化能力的通用框架。

Abstract: In biomedical and neurodegenerative disorders, accurate and early disease identification remains challenging due to the scarcity of labeled data and the complexity of imaging patterns. To address these challenges, we introduce ARMA-C3, a unified unsupervised and semi-supervised graph learning framework for node classification based on contrastive learning and graph-cut regularization to learn structurally meaningful and discriminative representations. By modeling samples or images as graph nodes and exploiting inter-sample relationships, the proposed framework captures subject-level dependencies that conventional machine learning methods typically overlook. We conduct extensive binary classification experiments across five clinically relevant datasets: the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the Neuroimaging in Frontotemporal Dementia (NIFD) dataset, and three medical imaging benchmarks (BreastMNIST, PneumoniaMNIST, and a liver ultrasound dataset). Experimental results demonstrate that ARMA-C3 achieves competitive and frequently superior performance compared to classical clustering techniques, state-of-the-art machine learning models, and existing graph-based deep learning approaches across multiple evaluation settings, particularly under limited supervision and severe class imbalance. The proposed framework further demonstrates robust representation learning and strong cross-modal generalization across diverse biomedical imaging modalities.


[178] UAV-OVO: Out-of-Viewpoint Generalization in UAV Action Recognition cs.CVPDF

Yu Xia, Zhengbo Zhang, Shuaihu Zhang, Zhigang Tu

TL;DR: 本文提出了UAV-OVO基准测试和LATER方法,以解决无人机(UAV)动作识别中因视角变化(如从低俯角到高俯角)导致的模型泛化性能下降问题。UAV-OVO基准通过视角隔离构建了分布内(ID)和分布外(OOD)测试集,揭示了现有模型在视角变化下的性能差距;LATER方法则结合LoRA微调和在线特征重中心化,旨在减少视角漂移并保持任务语义。

Details

Motivation: 解决无人机动作识别中因训练与部署时摄像机俯角不同(低俯角到高俯角)导致的部署偏移问题,这种偏移会改变身体可见性、运动投影和场景上下文,使模型依赖视角特定的捷径,从而影响泛化能力。

Result: 在UAV-OVO基准上,代表性视频识别模型显示出显著的ID/OOD性能差距:模型在低俯角训练分布上表现良好,但在高俯角OOD测试上失败,暴露了聚合精度隐藏的视角捷径。LATER方法通过LoRA锚定的测试时重中心化,减少了视角引起的漂移,提升了视角鲁棒性。

Insight: 创新点包括:1) 构建了UAV-OVO基准,通过视角评分和隔离带控制变量,专门评估视角泛化能力;2) 提出了LATER方法,将LoRA子空间作为语义锚点进行在线特征重中心化,在减少视角漂移的同时保留任务相关语义,为视角鲁棒的无人机视频理解提供了实用适应方案。

Abstract: UAV action recognition faces a deployment shift that standard benchmarks often obscure: a model trained on UAV footage captured from low-depression viewpoints may be required to recognize the same action classes from high-depression viewpoints. While the action labels remain unchanged, this shift alters body visibility, motion projection, and scene context, encouraging models to rely on viewpoint-specific shortcuts. We introduce UAV-OVO, an Out-of-Viewpoint generalization benchmark for UAV action recognition. UAV-OVO derives view scores from uncalibrated videos, uses a view-isolation band to assign low-depression videos to the training and in-distribution test splits while reserving high-depression videos for out-of-distribution testing, and constructs ID/OOD test sets matched by class distribution so that performance differences reflect viewpoint shift rather than label imbalance. Across representative video recognizers, UAV-OVO reveals a substantial ID/OOD gap: models that fit the low-depression training distribution well often fail to transfer to held-out high-depression views, exposing viewpoint shortcuts hidden by aggregate accuracy. We further propose LATER, LoRA-Anchored Test-time Re-centering, which first adapts the recognizer with Low-Rank Adaptation (LoRA) and then uses the learned LoRA subspace as a semantic anchor for online feature re-centering. Specifically, LATER projects target-domain displacement onto the orthogonal complement of the LoRA subspace before re-centering features, reducing viewpoint-induced drift while preserving task-relevant semantics. Together, UAV-OVO and LATER provide a controlled testbed and a practical adaptation method for viewpoint-robust UAV video understanding.


[179] StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration cs.CVPDF

Linrui Tian, Qi Wang, Bang Zhang

TL;DR: StreamChar是一个实时流式角色音视频生成框架,通过将长时程编排与短窗口音视频去噪解耦来解决现有方法在实时生成中面临的文本-音频错位、视觉漂移和低延迟要求之间的矛盾。该框架采用基于LLM的编排器生成帧对齐的音频条件,并结合联合音视频DiT进行局部双向去噪,通过两阶段蒸馏和进度感知指针等技术实现高效部署与稳定生成。

Details

Motivation: 实时流式角色动画音视频生成需要同时满足文本转录准确性、跨片段视觉一致性以及严格播放预算的要求,现有方法在自回归生成中容易累积文本-音频错位和视觉漂移,而低延迟所需的少步蒸馏又会损害空间多样性和时序质量。

Result: 在短片段和长时程协议上的实验表明,StreamChar在单个H100 GPU上可实时运行,并在文本保真度、音视频同步性、视觉质量和流式稳定性方面相比近期联合生成及音频驱动基线实现了更优的系统级权衡。

Insight: 创新点在于将长时程编排与局部去噪解耦的架构设计,以及结合两阶段蒸馏、进度感知指针和锚定记忆机制来平衡生成质量与实时性;客观来看,其通过分离全局规划与局部生成有效缓解了流式生成中的累积误差问题,为实时音视频合成提供了可扩展的解决方案。

Abstract: Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.


[180] DRM: Diffusion-based Reward Model With Step-wise Guidance cs.CVPDF

Jaxon Zhang, Binxin Yang, Hubery Yin, Chen Li, Jing Lyu

TL;DR: 本文提出了一种基于扩散模型的奖励模型(DRM),用于解决现有基于VLM的奖励模型在评估图像美学、构图等感知质量方面的不足。DRM利用预训练扩散模型作为评估主干,能够对生成过程中的中间噪声潜在表示进行逐步评估。基于此,作者提出了Step-wise GRPO强化学习算法和Step-wise Sampling推理策略,以提供密集的逐步奖励并引导生成过程,从而显著提升生成图像的质量。

Details

Motivation: 当前基于VLM的奖励模型主要针对语义对齐进行预训练,难以有效捕捉图像的美学、构图和视觉和谐等关键感知质量。作者认为,一个能够进行高保真生成的模型必须深刻理解这些视觉属性,因此需要一种新的评估范式。

Result: 大量实验证实,该方法显著提升了生成图像的最终质量。

Insight: 核心创新在于利用预训练扩散模型本身作为奖励模型的评估主干(DRM),这使其具备了对生成过程中任意步骤的中间潜在表示进行评估的独特能力。基于此能力,作者提出了Step-wise GRPO算法来解决GRPO中的信用分配不精确问题,并提出了Step-wise Sampling推理策略来动态引导生成路径,这为基于人类偏好的扩散模型对齐提供了新的、更精细的优化和推理思路。

Abstract: Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities-such as aesthetics, composition, and visual harmony. In this work, we argue that a model capable of high-fidelity generation must possess a profound understanding of these visual attributes. Based on this insight, we introduce the Diffusion-based Reward Model (DRM), a novel paradigm that use the pre-trained diffusion model as a powerful evaluative backbone. A key advantage of the DRM is its unique ability to assess not only the final image but also the noisy intermediate latents at any stage of the generative process. We leverage this step-wise evaluative capacity in two ways. First, we propose Step-wise GRPO, a reinforcement learning algorithm that provides dense, per-step rewards to resolve the imprecise credit assignment problem in GRPO algorithm, leading to more stable and effective alignment. Second, we introduce Step-wise Sampling, a novel inference strategy that employs the DRM as a dynamic guide to evaluate multiple generation paths at each step, steering the process towards higher-quality outcomes. Extensive experiments confirm that our approach significantly enhances the final quality of generated images. Code: https://github.com/jjaxonx/DRM.


[181] Towards Open-World Referring Expression Comprehension: A Benchmark with Training-free Multi-task Consistency Checker cs.CVPDF

Zongjian Wu, Lei Zhang

TL;DR: 本文提出了一个面向开放世界的指称表达式理解(REC)新基准测试OpenRef,它包含了多样化的视觉场景、可变的目标数量和丰富的词汇类型,以应对复杂现实环境。同时,论文提出了一种无需训练、即插即用的多任务一致性检查器(MCC)来提升现有模型的性能,并引入了F1和N3R等新评估指标。

Details

Motivation: 当前REC基准测试通常假设简单场景和单一目标映射,这限制了模型在开放世界复杂环境中的部署。本文旨在填补这一空白,推动REC向开放世界发展。

Result: 在提出的OpenRef基准上进行的大量实验表明,该工作显著提升了现有REC模型在复杂场景下的性能,为开放世界REC铺平了道路。

Insight: 主要创新点在于构建了一个更全面、更贴近现实的REC基准测试(OpenRef),并设计了一种无需训练、通过一致性自验证来提升模型性能的即插即用策略(MCC),同时引入了更合适的评估指标(如N3R)。

Abstract: Referring expression comprehension (REC) aims to localize a target object within an image based on a given expression. Although recent advances in vision-language models have led to substantial improvements in REC tasks, current REC benchmarks often hold simple scenarios and the assumption that each expression maps to a unique object. These limitations hinder the deployment of REC models in open-world environments. To fill this gap, we introduce OpenRef, a new benchmark for REC in complex visual and linguistic scenarios. OpenRef features three key advancements: 1) Diverse visual scenarios: spanning diverse visual domains, including ground views, drone views, dark scenes and adverse weather conditions; 2) Variable target counts: breaking the single-target limitation with multi-target and none-target samples; 3) Rich vocabulary types: incorporating proper nouns, polysemous words and ordinal terms to fit a wider range of expression needs. Furthermore, as traditional metrics are insufficient for open-world setting, we leverage F1 to measure grounding accuracy and propose N3R (Negative Relative Rejection Reliability) to assess relative rejection reliability against negative expressions. Finally, we introduce Multi-task Consistency Checker (MCC), a training-free but plug-and-play strategy that enhances model performance with one click by enforcing consistency self-verification. Extensive experiments demonstrate that this work significantly advances the performance of existing REC models in complex scenarios, paving the way for open-world REC. Project page: https://zongjianwu.github.io/openref


[182] CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning cs.CV | cs.CL | cs.ETPDF

Sriram Mandalika

TL;DR: 本文提出了一种名为CMAP的跨模态自适应提示方法,用于解决多领域任务增量学习问题。该方法利用冻结的CLIP模型,通过文本空间任务路由、多原型视觉-文本置信度估计和对称跨模态门控机制,有效提升了模型在顺序学习多个视觉领域任务时的性能,同时防止灾难性遗忘。

Details

Motivation: 现有基于冻结视觉-语言模型的高效方法在多领域任务增量学习中,仅依赖视觉特征进行任务路由、置信度估计和编码器适配,未能充分利用CLIP的跨模态文本嵌入空间。本文旨在填补这一空白,通过利用文本模态信息来提升学习效率和鲁棒性。

Result: 在涵盖11个数据集和1201个类别的MTIL基准测试中,该方法在Order-I设置下取得了74.2%的迁移准确率、80.5%的平均准确率和88.7%的最终准确率,以仅250万个可训练参数且无需外部数据,超越了先前的最优方法5.0、3.7和3.0个百分点,达到了新的SOTA水平。

Insight: 创新点在于将任务路由从视觉空间迁移到文本空间,利用冻结的CLIP文本原型进行零参数成本的鲁棒匹配;引入多原型视觉-文本置信度估计,结合K-means视觉原型和跨模态对齐分数;以及提出对称跨模态门控机制,使文本编码器能根据批量图像特征进行自适应调整,从而在分布外输入上保持跨模态对齐。

Abstract: Multi-domain task-incremental learning requires a model to sequentially acquire knowledge across visually diverse domains without forgetting prior tasks, and without access to task identity at inference. Parameter-efficient methods built on frozen vision-language models have made strong progress, yet all existing approaches rely exclusively on visual features for task routing, confidence estimation, and encoder adaptation, leaving CLIP’s cross-modal text embedding space entirely unexploited. We address this gap through three contributions. Text-space task routing replaces visual Gaussian matching with cosine similarity to frozen CLIP text prototypes, giving order-independent routing robust to data scarcity at zero parameter cost. Multi-prototype visual-textual confidence replaces single-Gaussian class modeling with K-means visual prototypes and cross-modal alignment scores under task-calibrated thresholds. Symmetric cross-modal gating extends per-layer Gumbel gates to the text encoder conditioned on batch image features, preserving cross-modal alignment on out-of-distribution inputs. On the MTIL benchmark spanning 11 datasets and 1201 classes, our method achieves 74.2% Transfer, 80.5% Average, and 88.7% Last under Order-I, surpassing the prior state of the art by 5.0, 3.7, and 3.0 percentage points with only 2.5M trainable parameters and no external data.


[183] DeCoDrift: Stabilizing Decoder Coupling in Closed-Loop Foundation Segmentation cs.CVPDF

H. M. Shadman Tabib, Md. Shamsuzzoha Bayzid, M Sohel Rahman

TL;DR: 本文揭示了基础分割模型(如SAM)在迭代提示的闭环使用中,存在一种未被充分研究的失效模式——解码器耦合漂移,即掩码解码器的交叉注意力会逐渐与目标对象失准,导致误差在迭代中累积。作者通过分析解码器内部信号,将迭代提示形式化为离散时间动力系统,并提出了一种无需训练、在推理时稳定模型的框架DeCoDrift,该框架通过约束提示更新来保持解码器耦合,从而提升了注意力稳定性、时序一致性和分割质量。

Details

Motivation: 解决基础分割模型(如SAM)在迭代提示的闭环使用中,由于反馈循环导致的解码器耦合漂移问题,即掩码解码器的交叉注意力会逐渐失准,使误差累积,从而影响分割的稳定性和质量。

Result: 在体积电子显微镜数据上的实验表明,DeCoDrift框架相比标准迭代提示方法,能持续提升注意力稳定性、时序一致性和分割质量,且无需重新训练或真实标注监督。

Insight: 论文的创新点在于首次将迭代提示形式化为离散时间动力系统来分析其稳定性,并提出了一种基于解码器内部动态信号(如提示-图像耦合、注意力稳定性)的、无需训练和真实标注的推理时稳定框架(DeCoDrift),证明了解码器内部动态不仅可用于诊断,还可为稳定闭环使用中的基础分割模型提供可操作的信号。

Abstract: Foundation segmentation models such as Segment Anything Model (SAM) are now routinely used in iterative pipelines, where each predicted mask is fed back as the next prompt. This practice turns segmentation into a closed-loop dynamical process, yet the decoder-level behavior of these systems remains largely unexamined. We show that this feedback loop can induce a previously overlooked failure mode, decoder coupling drift, in which the mask decoder’s cross-attention progressively loses alignment with the target object, causing errors to accumulate across iterations. We study this phenomenon by instrumenting SAM’s mask decoder and deriving ground-truth-free measures of prompt-image coupling, attention stability, and temporal consistency. On volumetric electron microscopy data, these decoder-internal signals reveal that standard iterative prompting systematically degrades attention alignment and temporal coherence relative to oracle-anchored feedback. We then formalize iterative prompting as a discrete-time dynamical system and show how proximal anchoring reduces error amplification in the feedback loop. Building on this analysis, we introduce DeCoDrift, a training-free inference-time stabilization framework that constrains prompt updates and preserves decoder coupling across iterations. Across extensive experiments, DeCoDrift consistently improves attention stability, temporal coherence, and segmentation quality over standard iterative prompting, without retraining or ground-truth supervision. More broadly, our results show that decoder-internal dynamics are not merely diagnostic: they provide actionable signals for stabilizing foundation segmentation models in closed-loop use.


[184] DRFusion: Drift-Resilient Temporally Consistent Infrared-Visible Video Fusion cs.CVPDF

Xingyuan Li, Haoyuan Xu, Shulin Li, Xiang Chen, Zhiying Jiang

TL;DR: 本文提出了一种名为DRFusion的漂移鲁棒红外-可见光视频融合方法,旨在解决动态场景中融合视频的时间一致性问题。该方法将任务重新定义为历史条件运动生成,通过稳定历史引导和软时间锚定将时间一致性转化为谱滤波,并采用解耦结构-运动适应策略来桥接预训练先验和结构约束。

Details

Motivation: 红外与可见光视频融合对于动态场景的全面感知至关重要,但保持时间一致性仍是一个巨大挑战。现有基于光流的方法存在几何刚性和重影伪影问题,而标准基于扩散的逐帧融合模型在自回归扩展时缺乏内在时间约束,容易导致严重的误差累积和漂移。

Result: 大量实验表明,该方法在融合质量和时间稳定性方面均达到了最先进的性能水平。

Insight: 创新点在于将时间一致性问题重新表述为谱滤波,避免了刚性对齐,并通过两阶段训练和潜在细化策略有效结合了预训练先验与结构约束,从而实现了对漂移的鲁棒性。

Abstract: Infrared and visible video fusion is essential for achieving comprehensive perception in dynamic scenes. However, maintaining temporal consistency remains a formidable challenge. Conventional methods relying on optical flow often suffer from geometric rigidity and ghosting artifacts. Moreover, standard diffusion-based fusion models typically operate in a frame-by-frame manner; when extended to autoregressive settings, they lack intrinsic temporal constraints and are prone to severe error accumulation and drifting, where minor artifacts amplify over time. To address these limitations, we propose a drift-resilient video fusion method that reformulates the task as history-conditioned motion generation. We introduce Stabilized History Guidance and Soft Temporal Anchoring to reframe temporal consistency as spectral filtering, implicitly aggregating motion dynamics without rigid alignment. Furthermore, our Decoupled Structure-Motion Adaptation strategy bridges pre-trained priors and structural constraints via two-stage training and latent refinement. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both fusion quality and temporal stability.


[185] VertiCue-Bench: Diagnosing Whether MLLMs Use Height Cues to Resolve 2D Ambiguity in Remote Sensing Natural Scenes cs.CV | cs.MMPDF

Jing Huang, Duanchu Wang, Junjie Yang, Zihang Cheng, Cheng Li

TL;DR: 该论文介绍了VertiCue-Bench,这是首个用于诊断多模态大语言模型是否能够利用冠层高度模型等垂直结构线索来解决遥感自然场景中二维光谱混淆问题的基准测试。该基准包含1534个实例,覆盖17个任务,旨在分离低层高度感知与基于歧义的语义推理。对14个先进模型的评估揭示了感知与推理之间的显著脱节。

Details

Motivation: 现有遥感基准主要关注二维光学外观,但在自然环境中,光谱混淆严重,生态不同的区域可能纹理相似但垂直结构不同。因此,需要明确的三维结构数据(如冠层高度模型)进行语义消歧,但目前尚不清楚MLLMs是否能真正利用垂直线索。

Result: 在VertiCue-Bench上对14个最先进的通用和遥感专用MLLMs进行评估,结合反事实模态测试,发现模型在读取原始CHM高度线索方面表现出初步能力,但在将几何感知转化为可靠的语义推理方面大多失败,在需要联合约束时,其表现甚至常常不如仅使用RGB的基线模型。

Insight: 论文的创新点在于构建了首个专注于CHM驱动的空间推理的诊断性基准,并系统性地揭示了当前MLLMs在自然场景理解中存在的关键“几何到语义”的鸿沟,即感知能力与高层推理能力之间的脱节,这为推进地理空间MLLMs的发展提供了可操作的见解。

Abstract: Multimodal Large Language Models (MLLMs) have recently shown promising progress in geospatial reasoning. However, existing remote sensing benchmarks remain largely 2D-centric, evaluating models primarily on optical appearance. In natural environments, this paradigm breaks down due to severe spectral confusion, where ecologically distinct regions share similar textures but differ fundamentally in vertical structure. In such cases, explicit 3D structural data, such as Canopy Height Models (CHMs), become essential geometric evidence for semantic disambiguation. Yet, it remains unclear whether current MLLMs can genuinely leverage vertical cues to resolve appearance-level ambiguity. To address this gap, we introduce VertiCue-Bench, the first diagnostic benchmark for CHM-grounded geospatial reasoning. VertiCue-Bench comprises 1,534 carefully curated instances across 17 tasks, explicitly disentangling low-level height perception from ambiguity-aware semantic reasoning. Evaluations on 14 state-of-the-art general and remote-sensing-specialized MLLMs, combined with counterfactual modality testing, reveal a striking perception-reasoning dissociation. While models exhibit emerging competence in reading raw CHM height cues, they largely fail to translate geometric perception into reliable semantic reasoning, often underperforming RGB-only baselines when joint constraints are required. Overall, VertiCue-Bench exposes a critical geometry-to-semantics gap in natural scene understanding, offering actionable insights for advancing geospatial MLLMs.


[186] Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning cs.CVPDF

Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li

TL;DR: 本文研究了在源自由跨域少样本学习(CDFSL)场景中,视觉语言模型(如CLIP)微调时注意力汇聚(attention sink)问题加剧的现象。作者发现,标准的少样本微调会导致模型过度依赖与目标域类别初始距离较近的‘简单’token,而忽视更具判别性但初始距离较远的‘困难’token,从而损害分类判别能力。为此,作者提出了一种在微调过程中根据token与目标域类别的相关性进行动态重加权的方法,以抑制对简单token的依赖并增强对困难token的学习。

Details

Motivation: 动机在于探索CLIP等视觉语言模型在跨域少样本学习中的潜力,并解决一个被先前工作忽视的关键问题:标准的少样本微调会显著加剧注意力汇聚现象,导致模型跨类别的判别能力下降。

Result: 在四个基准数据集上的大量实验验证了所提方法的合理性,并展示了新的最先进(SOTA)性能。

Insight: 创新点在于首次揭示了CDFSL中注意力汇聚问题加剧的现象,并将其解释为模型进行域适应的捷径学习;提出的解决方案是通过动态token重加权来明确抑制对简单token的依赖并增强对困难token的学习,从而减少汇聚token并提升判别能力。

Abstract: Vision-language models (VLMs) like CLIP have shown impressive generalization capabilities, yet their potential for Cross-Domain Few-Shot Learning (CDFSL) remains underexplored, where the model needs to transfer source-domain information to target domains with scarce training data. While the attention sink phenomenon has been observed in VLMs for certain tasks, its role in CDFSL scenarios has not been studied. In this paper, we uncover a critical issue overlooked by prior works: standard target-domain few-shot fine-tuning in CDFSL significantly exacerbates the attention sink problem, leading to poor discriminability across classes. To understand this phenomenon, through extensive experiments, we interpret it as the model’s shortcut learning for domain adaptation: to overcome the huge domain gap between the source and target domains, the model shows a high tendency to push tokens that are initially closer to target-domain classes (i.e., simple tokens) to be even closer to these classes, exacerbating the attention sink and wasting the capability of learning other discriminative but initially further tokens (i.e., hard tokens). To address this, we propose a novel approach to dynamically re-weight tokens according to their relevance with target-domain classes during the target-domain finetuning, which explicitly suppresses the model’s reliance on these simple tokens and enhances the learning of hard tokens, reducing sink tokens and enhancing discriminability. Extensive experiments on four benchmark datasets validate the rationale of our method, demonstrating new state-of-the-art performance. Our codes are available at https://github.com/shuaiyi308/TIR.


[187] PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolution cs.CVPDF

Wenxue Li, Jingjing Ren, Peng Zhang, Tian Ye, Daiguo Zhou

TL;DR: 本文提出了PixelWizard框架,旨在解决高分辨率视频生成中优化不稳定和计算成本高昂的耦合瓶颈。该方法通过分层解耦全局结构建模与细粒度细节合成,首先建立紧凑的时空锚点来集中结构先验,再指导高分辨率细节生成。同时,通过引入噪声跨度对齐的快捷训练机制,实现了超过10倍的生成加速,能够在保持高质量的同时高效生成原生2K/4K视频。

Details

Motivation: 高分辨率视频生成面临优化不稳定和计算成本高昂的双重瓶颈。令牌序列的急剧膨胀不仅导致优化偏向局部纹理而牺牲全局一致性,引发结构崩溃,还带来了难以承受的训练成本和严重的推理延迟。

Result: 大量实验表明,PixelWizard在保持卓越视觉质量的同时,将原生2K/4K视频的生成采样速度加速了超过10倍。

Insight: 核心创新点在于分层解耦的结构化生成框架,以及为加速推理而设计的噪声跨度对齐快捷训练机制。该机制通过显式建模步长,结合指数索引偏置采样和自适应噪声跨度校准,使模型能够以大步长遍历生成轨迹,在不依赖繁重蒸馏的情况下实现鲁棒的少步推理。

Abstract: High-resolution video generation faces a coupled bottleneck of optimization instability and prohibitive computational costs. The massive expansion of the token sequence not only biases optimization toward local textures at the expense of global coherence, leading to structural collapse, but also imposes prohibitive training costs and severe inference latency. To address this, we propose PixelWizard, a framework that hierarchically decouples global structure modeling from fine-grained detail synthesis. PixelWizard first establishes a compact spatiotemporal anchor to concentrate dense structural priors, which then guides fine-grained generation at high resolution. This mitigates the local optimization bias to ensure structural stability without compromising high-frequency details. Leveraging this structural stability, we introduce Noise-Span Aligned Shortcut Training to break the inference bottleneck. By explicitly modeling the step size, this mechanism allows the model to traverse the generation trajectory with large steps. Crucially, we incorporate Exponential Index-Biased Sampling and Adaptive Noise-Span Calibration to align optimization with the shifted noise schedules of high-resolution grids, ensuring robust few-step inference without incurring the heavy overhead of distillation. Extensive experiments demonstrate that PixelWizard achieves superior visual quality while accelerating the generative sampling of native 2K/4K videos by over 10x.


[188] Rethinking VLM Representation for VLA Initialization cs.CVPDF

Weifeng Lin, Siyuan Huang, Hao Li, Tingwei Chen, Ruichuan An

TL;DR: 本文研究了如何将预训练的视觉语言模型(VLM)有效初始化为视觉语言动作(VLA)模型。通过系统分析三个设计维度——具身VQA监督的能力级别、参数更新策略和机器人数据预训练,发现原始VLM表示是动作性能的关键来源,而有效的VLA初始化需要在注入动作相关信号的同时保留有用的预训练表示。

Details

Motivation: VLA模型广泛采用预训练的VLM作为策略主干,但何种预训练VLM表示适合作为VLA初始化尚不明确。本文旨在将此问题作为一个受控的表征设计问题进行研究。

Result: 实验表明,具身VQA适应带来的增益取决于下游瓶颈且不同能力域的增益并非简单叠加;LoRA比完全微调提供更可靠的初始化;机器人数据预训练能进一步改善初始化,最强变体通过基于LoRA的分阶段训练获得。

Insight: 论文的创新点在于系统性地解构了VLM到VLA的初始化问题,并提出了关键设计原则:有效的适应应注入动作相关的具身和机器人轨迹信号,同时保留对动作学习仍有用的预训练VLM表示,避免过度重塑。LoRA策略和分阶段训练被证明是有效的技术手段。

Abstract: Vision-Language-Action (VLA) models widely adopt pretrained Vision-Language Models (VLMs) as policy backbones, yet it remains unclear what kind of pretrained VLM representation is useful as a VLA initialization. In this paper, we study VLA initialization as a controlled representation-design problem along three axes: capability-level embodied VQA supervision, parameter-update strategy, and robot-data pretraining. Our experiments show that the original pretrained VLM representation is a key source of action performance. However, embodied VQA adaptation does not yield uniform gains: its benefit depends on downstream bottlenecks, and gains from different capability domains are not simply additive. For update strategy, LoRA provides a more reliable initialization than Full Finetune, indicating that overly reshaping the pretrained representation can weaken VLA initialization. Robot-data pretraining further improves VLA initialization, with the strongest variant obtained by staged LoRA-based training. Together, these findings suggest that effective VLM-to-VLA adaptation should inject action-relevant embodied and robot-trajectory signals while preserving the pretrained VLM representation that remains useful for action learning.


[189] Event-to-Video Reconstruction using Spatio-Temporal and Frequency-Enhanced Deep Neural Networks cs.CVPDF

Ramna Maqsood, Paulo Nunes, Luís Ducla Soares, Caroline Conti

TL;DR: 本文提出了一种新颖的多尺度频率增强Transformer模型MSFET-E2V,用于将异步事件流重建为同步视频帧。该模型的核心是跨域注意力模块,它融合了时空特征和来自离散小波变换的频率感知表示,以更好地捕捉局部和全局结构。此外,还引入了一个轻量级的小波增强跳跃块,通过联合空频域处理来抑制伪影并细化结构细节。

Details

Motivation: 事件相机具有高时间分辨率、低延迟和能效高等优点,但缺乏密集强度帧,限制了传统计算机视觉方法的直接应用。现有的基于CNN和Transformer的事件到视频重建方法主要在空间域操作,难以恢复精细结构细节并抑制严重的重建伪影。

Result: 在多个真实世界事件数据集上的广泛实验表明,MSFET-E2V在重建质量上优于最先进的方法,取得了显著提升。与现有的基于Transformer的方法相比,所提模型显著减少了参数量、GPU内存使用量和推理时间。

Insight: 主要创新点在于引入了频率域信息(通过离散小波变换)来增强模型对细节和结构的感知能力,具体通过跨域注意力模块和轻量级小波增强跳跃块实现。这为事件视频重建提供了一种更有效的空频域联合处理思路,并在性能和效率上均有提升。

Abstract: Event cameras offer significant advantages over conventional frame-based counterparts, including high temporal resolution, low latency, and energy efficiency. These characteristics make them suitable for high-speed and high-dynamic range scene acquisition scenarios; however, the lack of dense intensity frames limits the direct applicability of conventional computer vision methods for scene understanding. Event-to-video (E2V) reconstruction seeks to bridge this gap by converting asynchronous event streams into a sequence of synchronous video frames. Existing E2V reconstruction methods based on convolutional neural networks and transformers operate primarily in the spatial domain and often struggle to recover fine structural details while suppressing severe reconstruction artifacts. To address these issues, we propose MSFET-E2V, a novel multiscale frequency-enhanced transformer model. At its core lies a cross-domain attention module, which fuses spatio-temporal features with frequency-aware representations derived from the discrete wavelet transform. Unlike prior methods relying solely on spatial attention, our approach effectively captures both local and global structures by taking into account low- and high-frequency components, enhancing detail preservation and robustness across various motion scenarios. Furthermore, we propose a lightweight wavelet-enhanced skip block that serves as a skip connection, facilitating artifact suppression and structural detail refinement through joint spatial-frequency domain processing. Extensive experiments demonstrate that MSFET-E2V achieves superior performance over state-of-the-art methods on multiple real-world event datasets, offering significant gains in reconstruction quality. Moreover, compared to the existing transformer-based method, our proposed model significantly reduces the number of parameters, the GPU memory usage, and inference time.


[190] STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models cs.CV | cs.CLPDF

Yiming Liang, Yixiao Chen, Yiyang Zhou, Yixuan Wang, Shoubin Yu

TL;DR: 本文提出STORMS框架,通过内部化建模提升视频语言模型的时空推理能力。该方法采用两阶段训练:第一阶段将潜在标记与生成视频的思维表示对齐,第二阶段仅用答案监督训练,使推理过程内化。推理时无需生成视频或调用外部工具,在多个基准测试中提升了准确率并降低了推理开销。

Details

Motivation: 现有基于大视觉语言模型的方法常通过外部化推理(如文本思维链、关键帧选择或外部工具)处理视频推理任务,这增加了推理延迟和工程复杂度,且迫使时空视觉证据被序列化为文本或重复编码。

Result: 在VideoMME、MVBench、TempCompass和MMVU等基准测试中,STORMS提高了视频推理准确率,同时显著降低了基于工具或视频生成的推理流程的推理开销。

Insight: 创新点在于将视觉推理过程内化,通过有界连续潜在轨迹进行推理,而非依赖显式文本思维链,从而在训练后实现高效推理,无需生成视频或外部工具,降低了计算成本。

Abstract: Many video reasoning tasks require tracking motion, temporal order, and evolving visual states across frames. Existing methods built on large vision-language models (LVLMs) often address this challenge by externalizing reasoning through textual chain-of-thought (CoT), keyframe selection, repeated frame reinsertion, or external tool use. While effective, such pipelines increase inference-time latency and engineering complexity, and they force temporal-visual evidence to be serialized into text or repeatedly re-encoded from frames. Inspired by the intuition that visual reasoning can occur implicitly before verbalization, we propose STORMS (Spatial-Temporal reasOning via inteRnalized Modeling), a two-stage framework that teaches LVLMs to reason through bounded continuous latent trajectories instead of explicit textual CoT. In Stage I, STORMS aligns latent tokens with thought-video representations derived from generated videos, grounding the latent states in dynamic visual evidence. In Stage II, the model is further trained with answer-only supervision, encouraging the reasoning process to be internalized without step-by-step annotations. Generated thought videos are used only during training; at inference, STORMS performs a bounded latent rollout without regenerating videos, reinserting frames, or invoking external visual tools. Experiments on VideoMME, MVBench, TempCompass, and MMVU show that STORMS improves video reasoning accuracy while substantially reducing inference overhead compared with tool or video-generation-based reasoning pipelines.


[191] An Analysis Focused on Womens Safety: Can VAD Models Be Enhanced by a Multi-modal Dataset? cs.CVPDF

Sangeeta, Maddikuntla Sai Prajwal, Debi Prosad Dogra, Kamalakar Vijay Thakare, Hyungjoo Jung

TL;DR: 本文针对女性安全问题,提出了一个名为ExtrAnom的新型多模态视频异常检测数据集,包含1001个视频(500个正常,501个异常),涵盖五种针对女性的犯罪类型(如跟踪、抢项链、骚扰等),并包含低光照、低分辨率、长镜头等真实监控场景。数据集还提供了文本描述,支持跨模态和视觉语言模型验证。实验表明,现有数据集在检测女性相关异常方面不足,而ExtrAnom能有效用于训练和评估模型。

Details

Motivation: 解决现有视频异常检测(VAD)研究在女性安全领域的不足,现有数据集多为光照良好、高分辨率、近景视频,无法有效代表针对女性的犯罪(如跟踪、抢项链、不当接触等),且缺乏多模态支持。

Result: 在ExtrAnom数据集上对流行单模态和多模态VAD数据集(如XD-Violence、UCF-Crime、UCA)及SOTA方法进行基准测试,实验显示现有数据集不足以训练模型来检测女性相关异常,而ExtrAnom提供了有效的评估基准。

Insight: 创新点在于构建了首个专注于女性安全的多模态VAD数据集,涵盖真实监控场景的多样条件(低光照、低分辨率等),并集成文本描述以支持跨模态分析和视觉语言模型应用,填补了该领域的数据空白。

Abstract: Women’s safety and security are paramount for a modern society. Crimes against women occur in daylight as well as in low-light conditions. Often, such events are captured through real-world surveillance cameras that operate at lower resolutions. Despite substantial progress in CV-related research, video anomaly detection (VAD) focused on women’s safety has not yet been adequately addressed. Existing video anomaly datasets contain well-lit, high-resolution, close-shot videos, and fail to represent women-centric anomalies such as chain snatching, stalking, inappropriate touch, and other subtle forms of crime against women. To address these problems, we propose the ExtrAnom dataset, a new multi-modal benchmark containing 1001 videos with textual descriptions, 500 normal and 501 anomalous, classified into 5 different types of women-centric crimes. The dataset comprises low-light (8%), low-resolution videos (13%), long-shot (15%), along with daylight (64%) anomalous videos. And it covers anomalous events like stalking (3.9%), chain snatching (17.6%), kidnapping (7.3%), assassinations (2.3%), harassment (18.9%), and normal (50%). Each video is supplemented with 4 textual annotations, including one human-generated and three LLM-generated descriptions, enabling cross-modal and VLM-based validations. The aim of creating a women-centric dataset is to accurately detect the women-centric anomaly patterns, which are possible to observe visually. The dataset supplements the VLMs to accurately generate video-level descriptions. ExtrAnom has been benchmarked against popular unimodal and multi-modal VAD datasets (e.g., XD-Violence, UCF-Crime, and UCA) and SOTA methods. Experiments reveal that the existing datasets are insufficient to train models for detecting women-centric anomalies.


[192] MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models cs.CV | cs.CLPDF

Shristi Das Biswas, Kaushik Roy

TL;DR: 论文提出了MAGIC方法,一种无需训练、仅前向传播的核心集选择方法,用于为多模态指令调优构建紧凑且行为忠实的数据子集。该方法基于预训练视觉语言模型提取的三个内在信号:多模态增益、桥接相关性和技能神经元签名,通过三阶段流程筛选高质量样本,在保持多模态技能覆盖的同时显著减少训练数据量和计算时间。

Details

Motivation: 当前大型视觉语言模型(LVLMs)的指令调优严重依赖大规模多模态语料库,但这些数据集存在样本冗余、视觉依赖性低以及多模态推理行为覆盖高度不平衡的问题,导致均匀子采样或基于简单分数的选择方法往往产生次优的训练子集。

Result: 在LLaVA-665K和Vision-Flan数据集上,针对LLaVA-1.5-7B和-13B等目标模型的迁移设置中,在匹配20%预算下,MAGIC方法持续优于强基线:在LLaVA-665K上达到全微调相对性能的100.3%,在Vision-Flan-186K上达到101.6%,同时将运行时间减少了73.7%。

Insight: 创新点在于利用预训练模型内部信号(多模态增益、桥接相关性、技能神经元签名)进行无需训练和反向传播的高效核心集选择,通过分阶段过滤、排序和基于离散神经元签名的桶式预算分配,在保证多模态技能覆盖的同时大幅提升数据选择效率。

Abstract: Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons. MAGIC combines these signals in a three-stage pipeline: filtering low-gain examples, ranking candidates by a normalized quality objective, and performing bucket-wise budget allocation over discrete neuron signatures to preserve latent multimodal skill coverage. This formulation avoids backpropagation, auxiliary selector training, and expensive clustering in continuous activation spaces, while remaining efficient and easily deployable in existing VLMs. Across LLaVA-665K and Vision-Flan datasets, and transfer settings to large target models, LLaVA-1.5-7B and -13B, MAGIC consistently improves over strong baselines under matched 20% budgets: it achieves 100.3% relative performance to full finetuning on LLaVA-665K and 101.6% relative performance on Vision-Flan-186K, while yielding a 73.7% reduction in wall-clock run time.


[193] Data-driven Head Motion Generation through Natural Gaze-Head Coordination cs.CVPDF

Xiaohan Liu, Yilin Wen, Yusuke Sugano

TL;DR: 本文提出首个数据驱动方法,从大规模野外面部视频中建模时序性注视-头部协调运动。通过基于外观的注视估计器自动提取自然多样的注视与头部运动数据,并构建生成式条件变分自编码器来捕捉概率相关性及时序动态,实现基于注视条件的逼真头部运动生成。该框架进一步应用于注视控制的面部视频生成,使头部运动与输入注视自然关联。

Details

Motivation: 解决现有方法在生成面部视频时忽视注视与头部运动自然协调的问题,旨在从真实视频数据中学习并生成符合人类行为规律的头部运动。

Result: 人类评估和定量比较表明,该方法在生成自然性上显著优于基线方法,评估者对生成结果表现出统计显著的偏好。

Insight: 创新点包括利用大规模野外视频自动构建训练数据管道,以及采用生成式条件变分自编码器建模注视-头部协调的概率与时序特性,为动态面部生成提供了数据驱动且注重生理协调性的新思路。

Abstract: We present the first data-driven approach to model temporal gaze-head coordination from large-scale in-the-wild facial videos. To obtain training data for generalizable learning, we propose an automatic pipeline that extracts natural yet diverse gaze and head motions with off-the-shelf appearance-based gaze estimators. To capture the probabilistic correlation and temporal dynamics of gaze-head coordination, we build our model on a generative conditional Variational Autoencoder for plausible yet diverse gaze-conditioned head motion generations. We further apply our framework to gaze-controlled facial video generation, where we enable video generation with natural and realistic head motion correlated to the input gaze - an aspect that has not been emphasized before. Human evaluation and quantitative comparisons demonstrate our method’s effectiveness and validate our design choices, with evaluators showing statistically significant preference for our approach over baseline methods.


[194] [CLS] is Not Enough: Multi-Label Recognition via Patch-Level Inference and Adaptive Aggregation cs.CVPDF

Akang Wang, Xili Deng, Zhanxuan Hu, Yi Zhao, Yonghang Tai

TL;DR: 本文提出了一种名为PIAA的多标签图像识别框架,旨在解决CLIP等视觉语言模型在多标签识别任务中因全局视觉表示不足而表现不佳的问题。该方法通过补丁级推理和自适应聚合,增强补丁级预测并整合为最终的多标签预测,整个流程无需训练或微调。

Details

Motivation: 动机在于CLIP等模型在多标签识别中因使用单一全局视觉表示(如[CLS]标记)而无法有效编码多尺度、多上下文和共现模式的多样目标,导致性能受限。

Result: 在NUS-WIDE基准测试中,该方法实现了超过6%的mAP增益,显著优于代表性基线模型,且仅需极少的额外计算开销。

Insight: 创新点包括从缓解视觉编码器中的语义纠缠和学习无监督视觉分类器两个互补角度增强补丁级预测,以及引入自适应聚合模块整合补丁级分数,整个框架无需训练,提升了多标签识别的效率和准确性。

Abstract: Vision-Language Models such as CLIP exhibit strong zero-shot recognition capability by aligning images with textual concepts, yet they often underperform on multi-label recognition where multiple objects co-exist. A key bottleneck is that the [CLS] token, as a single global visual representation, is insufficient to faithfully encode diverse targets with varying scales, contexts, and co-occurrence patterns. To address this limitation, we present a new multi-label image recognition framework, termed PIAA, which formulates prediction as Patch-level Inference followed by Adaptive Aggregation. Specifically, we first enhance patch-wise predictions from two complementary perspectives: (i) mitigating semantic entanglement in the visual encoder to obtain more discriminative patch representations, and (ii) learning an unsupervised visual classifier to narrow the vision-language modality gap. We then introduce an adaptive aggregation module that consolidates patch-level scores into the final multi-label prediction. Notably, the entire pipeline is fully training-free, requiring no gradient updates or parameter fine-tuning. Experiments show that our method achieves strong improvements with minimal extra computation, exceeding a 6% mAP gain on the challenging NUS-WIDE benchmark over representative baselines. Code is available at https://github.com/akang-wang/PIAA.


[195] WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation cs.CVPDF

Kaining Ying, Hengrui Hu, Siyu Ren, Jiamu Li, Fengjiao Chen

TL;DR: 本文介绍了WBench,一个用于评估交互式世界模型的综合性多轮基准测试,涵盖视频质量、场景遵循、交互遵循、一致性和物理合规性五个维度。该基准包含289个测试用例和1058个交互轮次,覆盖多样场景、风格、主体及第一与第三人称视角,并整合了文本、6-DoF姿态和离散动作控制以评估不同输入接口的模型。评估采用22个自动子指标,结合专业视觉模型与大型多模态模型,所有指标均经过人工验证。

Details

Motivation: 现有基准测试仅覆盖交互式世界模型所需能力的一部分,缺乏统一的系统评估标准,因此需要建立一个全面的基准来填补这一空白。

Result: 在20个最先进的模型上进行评估,发现没有单一模型在所有维度上表现强劲;评估提供了每个模型在特征优势、弱点和开放挑战方面的详细诊断见解。

Insight: 创新点在于提出了一个多维度、多轮交互的基准测试框架,统一了不同输入接口(如文本、6-DoF姿态),并利用自动指标与人工验证相结合的方法进行系统评估,为模型性能提供了全面诊断。

Abstract: Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.


[196] R5DGS: Semantic-Aware 4D Gaussian Splatting with Rigid Body Constraints for Efficient Dynamic Scene Reconstruction cs.CVPDF

Denis Gridusov, Maxim Popov, Sergey Kolyubin

TL;DR: R5DGS是一个用于高效动态场景重建的框架,它通过结合语义感知和刚体约束来增强基于物理的4D高斯表示。该方法利用紧凑的身份编码向量将高斯点与物体关联,并构建基于CLIP的物体查找表以支持开放词汇的文本提示检索。此外,它通过仅预测物体质心的物理动力学并传播运动到关联的高斯点,实现了在保持轨迹合理性的同时加速外推过程。

Details

Motivation: 解决从多视角视频重建和预测动态3D场景时,现有基于物理的高斯溅射方法缺乏语义感知且计算开销大的问题。

Result: 在保持轨迹合理性的前提下,外推阶段实现了11 FPS的加速。

Insight: 创新点包括引入身份编码向量实现高斯点到物体的精确关联,以及通过仅优化物体质心动力学并传播运动的刚体推断约束来显著提升计算效率。

Abstract: Reconstructing and predicting dynamic 3D scenes from multi-view videos is a foundational task for robotics, AR/VR, and digital twins. Recent physics-informed Gaussian Splatting methods achieve impressive future frame extrapolation but lack semantic awareness and suffer from large computational overhead. We introduce $\textbf{R5DGS}$, a framework that augments a physics-driven 4D Gaussian representation with compact Identity Encoding vectors, enabling precise Gaussian-to-object association. By constructing an offline CLIP-based object lookup table, we support open-vocabulary text prompting to retrieve and render object-specific Gaussians across arbitrary timestamps and viewpoints. Furthermore, we propose a rigid-body inference constraint that predicts and integrates physical dynamics exclusively for object centroids, propagating motion to associated Gaussians via relative transformations. This optimization yields a 11 FPS speedup during extrapolation without compromising trajectories plausibility.


[197] AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models cs.CV | cs.ROPDF

Cuong Huynh, Maxim Popov, Denis Gridusov, Sergey Kolyubin

TL;DR: 本文提出了AgentGrounder,一个无需任务特定3D训练的零样本3D视觉点云定位框架。它采用两阶段设计:离线阶段构建包含实例ID、语义标签和3D边界框的对象查找表(OLT);在线阶段通过工具驱动代理分解查询、从OLT中检索相关候选对象、进行几何评分,并在需要额外视觉证据时按需触发图像渲染。

Details

Motivation: 解决现有零样本3D视觉定位方法依赖多视角图像集和标准3D分割工具提供的有限语义与空间细节的问题,旨在为具身AI提供更鲁棒和实用的开放词汇3D定位基础。

Result: 在ScanRefer和Nr3D数据集上的零样本设置中,相比SeeGround方法,Acc@0.5指标分别提升了2.5%和6.3%,其中在Nr3D的视图无关查询上取得了显著的6.3%增益。

Insight: 创新点在于将选择性检索、几何推理和自适应视觉检查相结合,通过构建对象查找表和工具驱动代理来减少级联匹配错误并提高上下文窗口效率,为开放词汇3D定位提供了新思路。

Abstract: 3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $\textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding. Our code is available at https://github.com/be2rlab/AgentGrounder.


[198] MuNet: A Mutualistic Network for Joint 3D Human Mesh Recovery and 3D Clothed Human Reconstruction from Single Images cs.CV | cs.AIPDF

Yunqi Gao, Leyuan Liu, Yuhan Li, Changxin Gao, Jingying Chen

TL;DR: 该论文提出了一种名为MuNet的互惠网络,用于从单张图像中联合进行3D人体网格恢复和3D着装人体重建。它采用统一的2-流形图表示,通过端到端的图卷积网络逐步变形初始图,并引入互惠机制使两个任务在训练中相互指导与优化。

Details

Motivation: 3D人体网格恢复和3D着装人体重建这两个任务本质相关,但以往研究常孤立进行,忽视了联合优化的潜在收益。论文旨在克服这一局限,通过统一框架有效利用任务间的相互依赖性。

Result: 在Human3.6M、3DPW、MPI-INF-3DHP、THuman2.0、CAPE和RenderPeople六个基准数据集上进行了广泛评估,实验结果表明MuNet在所有数据集上的两项任务均达到了最先进的(SOTA)性能。

Insight: 创新点包括:1)采用2-流形图作为所有3D模型的统一表示,实现跨任务的一致性建模;2)设计端到端的图卷积网络进行渐进式变形与细化;3)引入互惠机制,使两个任务在训练中通过相互指导与反馈进行联合优化,这为多任务学习提供了可借鉴的思路。

Abstract: 3D human mesh recovery and 3D clothed human reconstruction are inherently related, yet they have long been studied in isolation, thereby overlooking the potential gains of joint optimization. To overcome this limitation, we propose to address these two tasks within a unified framework, which allows their mutual dependencies to be effectively exploited. Building on this idea, we propose MuNet, a mutualistic network for joint 3D human mesh recovery and 3D clothed human reconstruction from single images. First, we adopt 2-manifold graphs as a unified representation for all 3D models, enabling consistent modeling across 3D human mesh recovery and clothed human reconstruction. Second, we design an end-to-end graph convolutional network that progressively deforms an initial graph into a 3D human mesh and refines it into a detailed 3D clothed human model. Third, we introduce a mutualistic mechanism that allows reciprocal interaction between the two tasks {during training}, where 3D human mesh recovery provides guidance for 3D clothed human reconstruction, and reconstruction feedback refines the 3D human mesh recovery. We extensively evaluate MuNet on six benchmark datasets for 3D human mesh recovery and 3D clothed human reconstruction, including Human3.6M, 3DPW, MPI-INF-3DHP, THuman2.0, CAPE, and RenderPeople. Experimental results demonstrate that MuNet achieves state-of-the-art performance on both tasks across all datasets. The code of MuNet is released for research purposes at https://github.com/starVisionTeam/MuNet.


[199] Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models cs.CVPDF

Xiao Liu, Jiaxiang Liu, Boci Peng, Boren Hu, Yusong Wang

TL;DR: 本文提出了一种名为闭环双向提示的方法,用于提升视觉语言模型在对抗性扰动下的鲁棒性。该方法通过动态反馈循环,利用文本语义去噪视觉表示,并基于精炼的视觉特征进行实例自适应的提示更新,以恢复跨模态语义对齐。

Details

Motivation: 现有防御方法多为单向或结构性的,未能充分利用双向跨模态互补性和实例级自适应保护,导致模型在对抗性扰动下跨模态语义对齐易受破坏。

Result: 在11个数据集上的广泛评估验证了该方法达到了最先进的鲁棒性水平,并展现出强大的基础到新任务的泛化能力,同时在计算成本与准确性之间保持了良好的平衡。

Insight: 创新点在于将鲁棒适应问题建模为通过动态反馈循环恢复跨模态一致性的过程,并引入语义锚作为稳定先验来约束循环更新,以减轻扰动引起的特征损坏,实现了基于锚的引导式双向去噪与自适应提示优化。

Abstract: Vision Language Models adapt well to downstream tasks but are highly vulnerable to adversarial perturbations that disrupt cross-modal semantic alignment. Existing defenses are largely unidirectional or structural, failing to exploit bidirectional cross-modal complementarity and instance-wise adaptive protection. To overcome the limitations of unidirectional and static defenses in adversarial settings, we propose Closed-Loop Bidirectional Prompting, casting robust adaptation as cross-modal agreement recovery via a dynamic feedback loop on frozen encoders. A Semantic Anchor is introduced as a stable prior to constrain cyclic updates and mitigate perturbation-induced feature corruption. Through anchor-based bootstrapping, textual semantics denoise visual representations, while the refined visuals enable instance-adaptive prompt updating, yielding a rectified and robust consensus. Extensive evaluations across 11 datasets validate state-of-the-art robustness and strong base-to-new generalization, while maintaining a favorable trade-off between computational cost and accuracy.


[200] Where Concept Erasure Should Occur: Concept-Layer Alignment in Text-to-Video Diffusion Models cs.CVPDF

Yiwei Xie, Ping Liu, Zheng Zhang

TL;DR: 本文提出CLEAR框架,通过概念层对齐优化文本到视频扩散模型中的概念擦除,识别出概念与目标信号在特定表示深度可分离的瓶颈,从而更精确地抑制目标概念同时保持生成质量。

Details

Motivation: 针对文本到视频扩散变换器中语义信息在不同模型深度编码不均,导致概念擦除效果受限的问题,研究旨在解决概念与非目标信号在特定表示层纠缠的结构性约束。

Result: 在大规模文本到视频扩散模型上的实验表明,强制概念层对齐能实现更精确的概念抑制,同时保持整体生成质量。

Insight: 创新点在于将概念擦除重新定义为识别概念与非目标信号自然分离的表示深度问题,并引入基于可分离性的优化目标进行层选择,而非依赖启发式方法,这为模型编辑提供了结构感知的新视角。

Abstract: Text-to-video diffusion transformers encode semantic information unevenly across model depth, which constrains effective concept erasure. We identify a representational bottleneck, termed concept-layer topological alignment, under which target concepts exhibit higher separability at certain representational depths. Outside these depths, concept and non-target signals remain strongly entangled, limiting the effectiveness of depth-specific erasure. This observation reframes concept erasure as the problem of identifying representational depths where concept-non-target separation naturally emerges. Motivated by this structural constraint, we introduce CLEAR, a separability-driven optimization framework for concept erasure that explicitly enforces concept-layer alignment. CLEAR operationalizes this principle by formulating layer selection as an optimization problem over concept-non-target separability, rather than relying on layer-agnostic or heuristic choices. To enable this, we introduce a separability-aware objective that favors layers exhibiting stronger concept-non-target separation. Experiments on large-scale text-to-video diffusion models demonstrate that enforcing concept–layer alignment leads to more precise concept suppression while preserving overall generative quality.


[201] LRDDv3: High-Resolution Long-Range Drone Detection Dataset with Range Information and Thermal Data cs.CV | cs.ROPDF

Knut Peterson, Zaid Mayers, Azmain Yousuf, Priontu Chowdhury, Asher Zaczepinski

TL;DR: 本文介绍了LRDDv3数据集,这是一个用于远距离无人机检测的高分辨率数据集,包含102,532张RGB图像和29,630张配对的IR图像,旨在解决现有数据集中高质量、高分辨率远距离无人机数据不足的问题。

Details

Motivation: 随着无人机在各类空域的普及,从远距离有效检测无人机以确保空域安全变得至关重要,而现有数据集在高质量、高分辨率的远距离无人机数据方面仍存在缺口。

Result: 该数据集是首批利用4K图像分辨率和配对的640x512 IR图像的无人机检测数据集之一,代表了在实现远距离无人机检测能力方面的显著进步。

Insight: 创新点在于提供了大规模、高分辨率、包含距离信息和配对热成像数据的远距离无人机数据集,这有助于训练更鲁棒的检测模型,特别是在复杂光照和背景条件下。数据集的时间跨度长、场景多样,增强了其代表性和实用性。

Abstract: Unmanned Aerial Vehicles (UAVs) have quickly become common in various airspaces, representing a wide range of applications from recreation flying to commercial photography and package delivery. With the increasing prevalence of UAVs, it becomes critical that both manned and unmanned aircraft can detect UAVs and other flying objects from long range to effectively track movement and ensure safe operation in shared spaces. While several datasets have been introduced for drone detection, the need for expanded high-quality data persists, especially in the area of high-resolution long-range drone data. To address this, we introduce a high-resolution dataset of 102,532 long-range RGB images of drones, sampled at 5 FPS from 128 distinct video clips taken mid flight during 17 different data collection days spread over 8 months to ensure a wide variety of lighting scenarios, flight locations, and background elements. The dataset boasts comprehensive drone range information across the dataset, as well as 29,630 IR images, all paired with RGB counterparts from the base dataset. As one of the first drone detection datasets to leverage 4K image resolution and paired 640x512 IR images, our work represents a significant advancement to enable the detection of drones at long range. For access to the complete dataset, please visit https://research.coe.drexel.edu/ece/imaple/lrddv3/


[202] EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory cs.CV | cs.AIPDF

Ruiqiang Xiao, Zhaohu Xing, Yijun Yang, Zhenyan Han, Weiming Wang

TL;DR: EchoPilot是一个无需训练的超声视频分割框架,仅需第一帧的单点点击和类别名称即可实现分割。它通过整合冻结的医学视觉语言模型、视觉基础模型和可提示视频分割器,解决了超声图像中的斑点噪声、弱边界和快速形变等挑战。

Details

Motivation: 解决超声视频分割中因斑点噪声、弱边界和快速解剖形变导致的困难,以及现有可提示基础模型直接部署时因单点提示空间上下文不足和贪婪内存更新导致的严重时间漂移问题。

Result: 在三个超声视频数据集上,EchoPilot在稀疏交互设置下达到了最先进的性能,一致优于无需训练的基线方法和经过微调的专用模型。

Insight: 创新点包括尺度空间语义提示(通过无参数的S.E.E.D.准则选择最佳上下文视图并合成几何精确的辅助点提示)和可靠性门控内存更新(在不确定预测下选择性冻结分割器的内存库以防止错误累积),同时贡献了首个动态胎儿胎盘超声视频分割数据集。

Abstract: Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but their direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift. We present EchoPilot, a training-free framework for ultrasound video segmentation under sparse first-frame interaction, requiring only a single point click and an anatomical category name. EchoPilot orchestrates a frozen medical vision-language model (VLM) for semantic localization, a vision foundation model (VFM) for dense geometric feature extraction, and a promptable video segmentor for mask prediction and propagation. To resolve initialization ambiguity, we propose Scale-Space Semantic Prompting, which first selects an optimal contextual view via a parameter-free S.E.E.D. (Semantic Energy-Entropy Density) criterion, and then synthesizes geometrically precise auxiliary point prompts from dense foundation features without additional user interaction. To reduce propagation drift, a Reliability-Gated Memory update is further introduced to selectively freeze the segmentor’s memory bank under uncertain predictions, preventing error accumulation. We also contribute the first dynamic fetal placenta ultrasound video segmentation dataset with 671 annotated frames. Across three ultrasound video datasets, EchoPilot achieves state-of-the-art performance under the sparse-interactive setting, consistently outperforming training-free baselines and finetuned specialists.


[203] A Pedestrian-Vehicle Interaction Benchmark and Annotation Framework for Unstructured Scenes via Uncalibrated Cameras cs.CVPDF

Haoyang Peng, Qian Hu, Songan Zhang, Ming Yang

TL;DR: 本文提出了一个基于未标定监控摄像头的行人-车辆交互数据集标注框架,并发布了PINNS数据集,该数据集覆盖多国多场景,包含密集的行人-车辆交互复杂场景,旨在促进非结构化场景下的轨迹预测和自动驾驶研究。

Details

Motivation: 当前缺乏公开的密集行人-车辆交互数据集,尤其是在非结构化和半结构化场景中,现有研究多依赖结构化道路数据,导致复杂异构交互研究不足。

Result: 论文构建了PINNS数据集,该数据集根据中国自动化学会标准进行标注,提供了轨迹数据和场景级信息,并公开可用。

Insight: 创新点在于提出了一个适用于未标定摄像头的通用标注框架,并创建了一个覆盖广泛、易于扩展的异构交通交互基准数据集,为复杂混合交通场景的研究提供了必要的数据支持。

Abstract: Predicting the interaction between pedestrian and vehicle is essential for autonomous driving safety in unstructured and semi-structured scenarios; however, this task is severely hindered by the scarcity of public datasets that feature dense pedestrian-vehicle interactions. Most current studies rely on structured road data, leaving the complex, heterogeneous interactions found in unstructured environments insufficiently represented and researched. In this paper, we propose a dataset annotation framework based on video data from uncalibrated surveillance cameras and present PINNS (Pedestrian-vehicle Interaction dataset from uNcalibrated cameras in uNstructured Scenes). The dataset covers multiple countries and regions, includes diverse typical traffic scenarios, and considers variations in seasons, lighting conditions, and weather. It focuses on complex scenes with dense pedestrian-vehicle interactions and is designed to be easily extensible. The dataset is constructed and annotated according to the standard issued by the Chinese Association of Automation, providing both trajectory data and corresponding scene-level information. Furthermore, this paper analyzes current challenges and research directions in heterogeneous agent trajectory prediction, shows the necessity and usefulness of the proposed dataset. We hope our framework and dataset will facilitate research on trajectory prediction and autonomous driving in complex mixed traffic scenarios. PINNS is publicly available at https://github.com/Songan-Lab.


[204] VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding cs.CV | cs.AIPDF

Yinghao Wu, Zhuoyan Luo, Yiyao Yu, Zhaojian Yu, Yujiu Yang

TL;DR: 论文提出VEN-VL,一种基于视觉集成混合专家(MoE)的框架,旨在通过‘先丰富后压缩’的原则,解决现有高效多模态理解方法因过度压缩单一视觉线索和依赖启发式剪枝策略而导致的信息容量和密度瓶颈问题。该框架首先统一不同视角的视觉表征以丰富信息容量,然后通过自适应路由器在专门的视觉专家中逐步压缩以增强信息密度,并引入显式视觉监督来保留关键信息。

Details

Motivation: 现有高效多模态方法在加速理解时,因强调单一视觉线索的高压缩比和依赖粗粒度注意力对齐的启发式剪枝策略,导致视觉令牌的信息容量和密度受限,从而引起明显的性能下降。

Result: 实验结果表明,VEN-VL在复杂视觉任务中,仅使用少量信息浓缩的令牌,就能实现优越性能,有效弥合了性能与效率之间的差距。

Insight: 创新点在于提出‘先丰富后压缩’的视觉集成MoE框架,通过统一多视角表征和自适应路由增强信息密度,并结合显式视觉监督来保留关键信息,从而在保持高效的同时提升多模态理解效果。

Abstract: Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density. Furthermore, we incorporate the reconstruction ability of vanilla structure via explicit visual supervision, facilitating crucial information preservation. Experimental results demonstrate our superiority in complex visual tasks with few information-condensed tokens, which effectively bridges the gap between performance and efficiency.


[205] RAPTOR+: A Visually Grounded Vision-Language Framework to Improve Clinical Trust and Auditability in Automated Cancer Referral Processing cs.CVPDF

Sofiat Abioye, Ufaq Khan, Shazad Ashraf, Anusha Jose, Benjamin Wallace

TL;DR: 本文提出了RAPTOR+,一个基于视觉-语言模型(VLM)的多模态框架,用于端到端地理解和处理临床癌症转诊文件。该系统旨在解决原始RAPTOR系统依赖独立OCR阶段导致的对手写、布局变化敏感以及视觉证据链丢失的问题,通过将提取的转诊决策与视觉证据关联,提高临床信任度和可审计性。

Details

Motivation: 解决紧急疑似结直肠癌(CRC)转诊处理中,半结构化临床文件需要人工审核和转录造成的操作瓶颈,以及现有基于LLM的提取系统因依赖OCR而存在的鲁棒性不足和视觉证据链断裂问题。

Result: 在223份临床整理的CRC紧急转诊表格上评估了微调的VLM、商业和开源零样本VLM以及原始OCR流程。微调的Qwen3-VL-8B模型在阅读准确率(96.1%)和严格安全指标(60.6%)上表现最佳,显著优于零样本模型(如Gemini 2.5 Flash的92.6%阅读准确率和1.2%严格安全率),证明了其在可验证证据定位方面的实质性改进。

Insight: 论文的核心创新点是提出了一个结合视觉证据定位的端到端多模态临床文档理解框架(RAPTOR+),并引入了关注证据定位的评估框架。客观来看,其关键洞察在于:针对高风险的临床文档处理,任务特定的模型微调对于实现可靠、可审计的理解至关重要,而零样本模型在证据定位方面存在明显差距。

Abstract: Urgent suspected colorectal cancer (CRC) referrals create operational bottlenecks because semi-structured clinical documents often require manual review and transcription. The original RAPTOR system used Large Language Models for structured extraction but relied on a separate OCR stage, making it vulnerable to handwriting, layout variation, and loss of visual evidence linkage. We present RAPTOR+, a multimodal extension that uses Vision-Language Models (VLMs) for end-to-end referral understanding. We evaluate fine-tuned VLMs, commercial and open-source zero-shot VLMs, and the original OCR-based pipeline on 223 clinically curated CRC urgent referral forms. We also introduce a grounding-aware evaluation framework that measures both extraction accuracy and evidence localisation. Results show a clear grounding gap in zero-shot models. Gemini 2.5 Flash achieved 92.6% Reading Accuracy but only 1.2% Strict Safety. In contrast, fine-tuned Qwen3-VL-8B achieved 96.1% Reading Accuracy and 60.6% Strict Safety, substantially improving verifiable evidence grounding. These findings show that task-specific fine-tuning is essential for reliable, auditable clinical document understanding. RAPTOR+ enables extracted referral decisions to be linked to visual evidence, supporting safer and more efficient cancer referral triage.


[206] Context-driven Missing-Modality Learning for Robust Medical Diagnosis with Image-Tabular Data cs.CVPDF

Tianling Liu, Lequan Yu, Tong Han, Liang Wan

TL;DR: 本文提出了一种名为上下文驱动的缺失模态学习(CMML)的框架,用于解决医学诊断中多模态(图像和表格)数据任意缺失的问题。该框架通过级联残差Transformer自编码器(CRTA)利用可学习的上下文令牌作为数据集级语义先验来捕获模态间依赖关系并合成关键缺失表示,再通过实例自适应的语义参考将异构模态表征对齐到统一空间,最后应用类感知对比优化来探索判别性诊断线索。

Details

Motivation: 临床实践中多模态数据(如影像和临床表格)的任意缺失会严重降低多模态模型的诊断性能,现有方法要么丢弃缺失模态导致信息损失,要么难以在不捕获复杂模态间依赖关系的情况下有效合成缺失数据。

Result: 在皮肤病变(Derm7pt)、眼部疾病(ODIR)和脑膜瘤(MEN)数据集上的广泛评估表明,CMML显著优于现有最先进(SOTA)方法,平均AUC分别提高了1.26%、0.97%和1.32%。

Insight: 创新点在于提出了一个结合模态合成与语义对齐的两阶段框架,其核心是利用可学习的上下文令牌作为数据集级先验来引导缺失模态的合成与表征对齐,并通过类感知对比学习进一步提炼判别特征,为处理任意缺失的多模态数据提供了鲁棒的解决方案。

Abstract: While multimodal data integrating diverse imaging and clinical tabular records is crucial for accurate medical diagnosis, the arbitrary absence of specific modalities is prevalent in clinical practice, severely degrading the performance of multimodal models. Existing methods either discard missing modalities, leading to information loss, or struggle to synthesize them without capturing complex inter-modal dependencies. To address these limitations, we propose a novel Context-driven Missing-Modality Learning (CMML) framework, which sequentially performs modality synthesis and semantic alignment to achieve robust diagnosis under arbitrary missing conditions. Specifically, we design a Cascade Residual Transformer-based Autoencoder (CRTA) that leverages learnable context tokens acting as dataset-level semantic prior to capture inter-modal dependencies and synthesize key missing representations. These representations are further enriched by modality-specific memory banks. To resolve the discrepancy between original available and synthesized representations, we transform the learned context tokens into instance-adaptive semantic references by infusing multimodal representations from the CRTA’s outputs. This reference guides the alignment of heterogeneous modality representations into a unified space, where class-aware contrastive refinement is finally applied to explore discriminative diagnostic cues. Extensive evaluations on skin lesion (Derm7pt), ocular disease (ODIR), and meningioma (MEN) datasets demonstrate that CMML significantly outperforms state-of-the-art (SOTA) methods, yielding AVG AUC improvements of 1.26%, 0.97%, and 1.32%, respectively.


[207] A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring cs.CV | cs.AI | cs.LGPDF

Adina Scheinfeld, Haotan Zhang, Shang Mu, Rudolf L. M. van Herten, Lucas Stoffl

TL;DR: 本文提出了一种用于光片荧光显微镜(LSM)数据的3D基础模型,该模型通过在大规模多生物、多染色、多成像协议的3D图像数据集上进行预训练,学习可迁移的体表示。模型通过联合优化掩码重建和图文对齐目标进行训练,显著降低了标注负担,实现了对分割、分类和去模糊等多种下游任务的高效少样本适应。

Details

Motivation: LSM数据具有尺寸大、维度高、标注负担重的特点,使得监督深度学习方法成本高昂且难以扩展;同时,尽管存在大量未标注的LSM体数据,但由于计算挑战和体表示学习的复杂性,针对该模态的基础模型仍未被充分探索。

Result: 在分割、分类和去模糊等下游任务上进行评估,结果表明该方法在标准评估指标和领域专家严格评估下均持续优于基线模型。

Insight: 创新点在于为LSM数据构建了首个3D多模态基础模型,通过联合掩码重建与图文对齐的预训练策略学习通用体表示,实现了少样本高效适应;其核心价值在于通过预训练显著降低标注需求,同时提升多种LSM分析任务的性能。

Abstract: Light sheet fluorescence microscopy (LSM) enables high-resolution, three-dimensional (3D) imaging of biological specimens, providing rich volumetric data for studying cellular organization, pathology, and vascular networks. However, the size, dimensionality, and annotation burden of LSM data make supervised deep learning approaches costly and difficult to scale. Additionally, despite the abundance of unannotated LSM volumes, foundation models for this modality remain underexplored due to computational challenges and the complexity of volumetric representation learning. In this work, we introduce a 3D foundation model for LSM data, pretrained on a large curated collection of 3D images spanning multiple organisms, stains, and imaging protocols. We learn transferable volumetric representations by jointly optimizing for masked reconstruction and image-text alignment. The pretrained backbone drastically reduces the annotation burden, enabling efficient, few-shot adaptation for varied downstream tasks. We evaluate this approach on downstream segmentation, classification, and deblurring. Our results demonstrate consistent improvements over baselines, (1) when measured using standard evaluation metrics and (2) when rigorously assessed by domain experts. This highlights the potential of foundation model pretraining to reduce annotation requirements while improving performance across diverse LSM analysis tasks. Pretrained model weights and code for pretraining and finetuning are publicly available: https://github.com/AdinaScheinfeld/lsm_fm_public_repo.git.


[208] Paris 2.0: A Decentralized Diffusion Model for Video Generation cs.CV | cs.LGPDF

Ali Rouzbayani, Bidhan Roy, Marcos Villagra, Zhiying Jiang

TL;DR: Paris 2.0是首个通过去中心化计算预训练的视频生成模型,它基于其前身Paris 1.0(首个开放权重的去中心化扩散模型)构建,解决了在去中心化训练下生成时间连贯视频的开放性问题。

Details

Motivation: 动机是解决在去中心化训练范式下,如何生成时间连贯视频这一开放性问题,旨在证明视频生成无需依赖集中的大型GPU集群。

Result: 在低分辨率文本到视频训练中,与相同总计算预算下集中式训练的模型相比,Paris 2.0将弗雷歇视频距离(FVD)从561.04显著降低至279.01(提升约2倍),并提高了CLIP文本-视频相似度和美学评分。

Insight: 主要创新点在于首次实现了去中心化训练的视频生成扩散模型,其训练方法证明了去中心化计算在视频生成任务上的有效性和性能优势,为分布式AI训练提供了新范例。

Abstract: We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed that image generation can be trained without a monolithic GPU cluster. However, temporally coherent video generation had remained an open problem under decentralized training, and Paris 2.0 closes it. In low-resolution text-to-video training, against a monolithic model trained on the same data under a matched total compute budget, Paris 2.0 cuts Frechet Video Distance (FVD) from 561.04 to 279.01, a ~2.0x improvement, and lifts CLIP text-video similarity and aesthetic score.


[209] LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence cs.CVPDF

Xiang An, Yin Xie, Feilong Tang, Yunyao Yan, Huajie Tan

TL;DR: 本文介绍了LLaVA-OneVision-2 (LLaVA-OV-2),这是LLaVA-OneVision系列中迄今为止能力最强的视觉语言模型,在广泛的多模态基准测试中实现了卓越性能。该模型基于原生OneVision编码器,并引入了窗口注意力机制以实现高效的局部计算,同时保持原生分辨率。其核心创新是编解码流标记化技术,将压缩视频视为连续的比特成本流,通过比特成本动态确定自适应时间分组,并利用运动残差线索将显著的空间证据选择到紧凑的视觉画布中。

Details

Motivation: 旨在开发下一代感知智能模型,解决现有模型在长视频理解、细粒度时空定位以及统一跨模态感知方面的不足,特别是在处理高频、密集重复运动等现有视频评估未充分覆盖的场景时。

Result: 在提出的新基准JumpScore上,LLaVA-OneVision-2-8B达到了74.9的mAP,远超Qwen3-VL-8B (30.1) 44.8分;在相同视觉标记预算下,编解码流输入比帧采样在时间定位上提升了9.7分。在标准基准上,该模型在视频任务上平均领先Qwen3-VL-8B 4.3分,在空间任务上领先5.3分,在跟踪任务上平均J&F领先15.6分,实现了SOTA性能。

Insight: 核心创新在于编解码流标记化技术,它通过比特成本和运动残差自适应地分配有限的标记预算到事件相关的内容上,实现了比固定图像组更稳定的长视频标记压缩。此外,共享的3D RoPE将编解码画布、采样帧和图像置于统一的时空坐标系中,支持跨视频理解、时空定位和操作痕迹推理的统一感知。大规模开放监督数据集的构建(约800万重标注视频样本用于预训练,400万样本空间语料用于微调)也是关键贡献。

Abstract: We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.


[210] DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models cs.CV | cs.AIPDF

Xinrui Shi, Kai Liu, Ziqing Zhang, Jianze Li, Anqi Li

TL;DR: 本文提出DRScaffold,一个用于提升轻量级视觉语言模型在密集场景推理能力的监督微调框架。该框架通过将监督目标分解为四个因果有序的阶段,强制模型进行基于视觉实体的推理,而无需修改模型架构。实验表明,该方法在提出的DRBench基准上显著提升性能,甚至能让3B模型超越32B模型在该任务上的表现。

Details

Motivation: 轻量级视觉语言模型在标准基准上表现良好,但在需要同时定位多个对象、属性和关系并进行多步推理的密集场景理解任务中系统性失败。现有训练信号缺乏推理步骤与底层视觉实体和关系的显式关联,导致模型生成流畅但视觉上无锚定的推理链。

Result: 在提出的DRBench基准(包含14,573个问题和2,943张图像)上,使用DRScaffold微调的三个轻量级VLM均取得显著提升。具体而言,经过DRScaffold训练的Qwen2.5-VL-3B模型在DRBench上超越了冻结的Qwen2.5-VL-32B模型,同时在通用基准上保持或提升了性能。

Insight: 论文的创新点在于提出了一个结构化的监督微调框架(DRScaffold),通过因果分解的监督目标强制模型进行基于视觉的逐步推理。其核心洞察是,对于密集场景推理任务,结构化的监督信号可以部分替代模型规模,即通过更好的训练方法,小模型也能在大模型的专长任务上取得优异表现。

Abstract: Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through multi-step inference. Such capability is critical for real-world applications where models must reliably interpret cluttered environments. Yet existing training signals provide no explicit grounding between reasoning steps and the underlying visual entities and relations, leaving lightweight models free to generate fluent but visually unanchored reasoning chains. To address this gap, we first introduce DRBench, a benchmark of 14,573 questions across 2,943 images, organized into five task categories spanning three progressive reasoning layers. Building on DRBench, we propose DRScaffold, a supervised fine-tuning framework that decomposes the supervision target into four causally ordered stages, enforcing grounded reasoning without architectural modification. Experiments on three lightweight VLMs demonstrate substantial gains on DRBench while preserving or improving performance on general-purpose benchmarks. Notably, Qwen2.5-VL-3B trained with DRScaffold surpasses the frozen Qwen2.5-VL-32B on DRBench, demonstrating that structured supervision can substitute for a significant portion of model scale in dense-scene reasoning. Our code and models are available at https://github.com/irene-shi/DRScaffold .


[211] Global Structure-from-Motion Meets Feedforward Reconstruction cs.CVPDF

Linfei Pan, Johannes Schönberge, Marc Pollefeys

TL;DR: 本文提出了一种新的运动恢复结构(SfM)流程,通过结合经典方法与前馈方法的优势,旨在解决传统SfM在低纹理、有限重叠和对称场景中的失败案例,同时克服前馈方法在可扩展性、精度和鲁棒性方面的限制。

Details

Motivation: 传统SfM方法在低纹理、有限重叠和对称场景中容易失败,而前馈3D重建方法虽在这些挑战性条件下表现优异,但在可扩展性、精度或鲁棒性方面存在局限,且在标准重建设置中通常不如经典方法。

Result: 在多个数据集上的广泛实验表明,该方法在多种场景下均取得了最先进(SOTA)的结果。

Insight: 创新点在于系统性地分析了经典与前馈方法的局限性,并提出了一种结合两者优势的新SfM流程,实现了更广泛场景下的高性能重建。

Abstract: Structure-from-Motion – the process of simultaneously estimating camera poses and 3D scene structure from a collection of images – remains a central challenge in computer vision, with many open problems yet to be solved. Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited overlap, and symmetries. However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, or robustness, and typically fall short of classical methods in standard reconstruction settings. In this work, we systematically analyze these limitations and propose a new Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods. Extensive experiments across multiple datasets show the benefits of our approach, achieving state-of-the-art results across a wide range of scenarios. We share our system as an open-source implementation at https://github.com/colmap/gluemap.


[212] Pixel-Level Pavement Distress Assessment Using Instance Segmentation cs.CVPDF

Logan Dewick, Bibesh Pyakurel, Kong Pheng Yang, Nazim Choudhury, M. G. Sarwar Murshed

TL;DR: 本文提出了一种基于Mask R-CNN实例分割的路面病害视觉分析系统,旨在实现裂缝等病害的像素级精确定位与量化。研究在自建的真实道路图像数据集UWGB-StreetCrack上评估了多种骨干网络,并对比了实例分割与目标检测方法在路面病害评估任务上的性能。

Details

Motivation: 自动化路面病害评估需要超越图像级分类或粗糙的边界框检测,要求对细长、分支和不规则的裂缝进行精确定位,以实现与维护相关的几何量化所需的精度。

Result: 在一致的微调协议下,性能最佳的模型(基于ResNet-101 FPN骨干的Mask R-CNN)在项目特定的边界框匹配协议下达到了84.23%的精确率、90.04%的召回率和87.04%的F1分数。其预测的裂缝总面积占比(2.164%)与真实值(2.170%)高度吻合。相比之下,基于CSPDarknet53的YOLO检测器在验证协议下仅达到27.5%的精确率和20.7%的召回率。

Insight: 论文的核心创新在于将实例分割技术应用于路面病害的像素级精确评估与面积量化,证明了其相对于传统目标检测方法的显著优越性。研究同时揭示了该领域在标注一致性、类别不平衡、干扰物排除和掩码级基准测试方面存在的开放挑战。

Abstract: Automated pavement distress assessment requires more than image-level classification or coarse bounding box detection, demanding precise localization of thin, branching, and irregular cracks to achieve the geometric precision necessary for maintenance-relevant quantification. This paper presents a vision-based pavement distress analysis system based on Mask R-CNN instance segmentation and evaluates it on UWGB-StreetCrack, a custom field-collected roadway image dataset acquired with a vehicle-mounted smartphone and manually annotated with polygon labels for longitudinal cracks, transverse cracks, alligator cracks, and potholes. Five Detectron2-based Mask R-CNN backbone variants were considered under a consistent fine-tuning protocol. The best-performing model, Mask R-CNN with a ResNet-101 FPN backbone, achieved 84.23% precision, 90.04% recall, and an F1 score of 87.04% under the project-specific bounding-box matching protocol. The same model produced an aggregate predicted crack-area fraction of 2.164%, closely matching the 2.170% ground-truth crack-area fraction. To contextualize the segmentation system against a detector-oriented alternative, a CSPDarknet53-based YOLO detector was also adapted and retrained on the dataset, reaching 27.5% precision and 20.7% recall on the validation protocol. The results show that instance segmentation is a practical direction for field pavement imagery and aggregate crack-area estimation, while also exposing open challenges in annotation consistency, class imbalance, confounder rejection, and mask-level benchmarking.


[213] EVIDENT: Routing MLLM Adaptation through Entity-Grounded Visual Evidence for Cross-Domain Video Temporal Grounding cs.CVPDF

Geo Ahn, Jiwook Han, Youngrae Kim, Joonseok Lee, Jinwoo Choi

TL;DR: 本文提出了EVIDENT框架,用于提升多模态大语言模型在视频时序定位任务中的跨域鲁棒性。该框架通过实体瓶颈适配器、实体绑定蒸馏损失和实体到证据门控机制,引导模型基于视觉实体证据进行时序定位,从而减少对数据集特定捷径的依赖。

Details

Motivation: 现有方法在视频时序定位任务上进行微调后,虽然在域内性能提升,但在域外数据上性能急剧下降。研究发现,这种失败主要源于视觉域的偏移,阻碍了模型将其学习到的时间定位知识与固有的实体注意力能力相结合。

Result: 在跨域视频时序定位基准测试上的实验表明,EVIDENT在保持有竞争力的域内性能的同时,显著提升了域外鲁棒性,且仅需适度的参数量开销。

Insight: 创新点在于提出了一个参数高效的适配框架,将时序定位锚定在预训练MLLM固有的实体注意力上,通过显式的视觉实体证据来路由适配过程。其核心洞察是利用实体级别的接地作为可泛化时间定位的有效归纳偏置。

Abstract: Fine-tuning MLLMs for Video Temporal Grounding (VTG) often improves in-domain performance but degrades sharply under domain shift. In this work, we find that this failure is primarily driven not just by unseen query concepts, but by visual domain shift, which prevents the model from coupling its learned temporal localization knowledge with its inherent entity-attention capability. To address this, we introduce EVIDENT, a parameter-efficient adaptation framework that anchors temporal grounding in the inherent entity-attention of pre-trained MLLMs by routing VTG adaptation through explicit visual entity evidence. EVIDENT consists of three components: (i) an Entity Bottleneck Adapter that transforms dense visual tokens into compact entity-level slots, (ii) an Entity-Binding Distillation loss that instills objectness priors into the semantically unstructured MLLM visual space, guiding each slot to bind to a coherent entity, and (iii) an Entity-to-eVidence gating mechanism that leverages the captured entities as evidence, steering the model to localize moments containing query-relevant entities. Together, these components enable VTG fine-tuning to rely on entity-grounded evidence rather than brittle dataset shortcuts. Experiments on cross-domain VTG benchmarks show that EVIDENT consistently improves out-of-domain robustness while preserving competitive in-domain performance with modest parameter overhead. These results suggest that entity-level grounding is an effective inductive bias for generalizable temporal localization.


[214] InstructSAM: Segment Any Instance with Any Instructions cs.CVPDF

Yuqian Yuan, Wentong Li, Zhaocheng Li, Yutong Lin, Juncheng Li

TL;DR: 本文提出InstructSAM,一个统一的框架,用于在任意指令下进行多实例分割。它将指令驱动的实例分割建模为集合结构的查询预测问题,通过一个显式的推理到实例查询接口,将视觉语言模型(VLM)与SAM3桥接起来。该方法使SAM3具备了高级指令理解、组合推理和实例级集合预测能力,无需修改其核心架构。

Details

Motivation: 旨在解决在任意自由形式指令下进行多实例分割的问题,使模型能够理解复杂指令并一次性分割出多个实例,克服现有方法(如SAM3的代理式流程)效率低或能力有限的问题。

Result: 在复杂指令驱动和短语级参考分割基准测试中,仅2B参数规模的InstructSAM取得了强劲结果,超越了之前的端到端方法以及SAM3的代理式流程,同时实现了高效的单次前向多实例预测。

Insight: 创新点在于将指令驱动的实例分割形式化为集合查询预测,并设计了LLM条件化的查询接口来桥接VLM和SAM3;构建了高质量的大规模指令-实例分割数据集Inst2Seg;通过混合注意力机制促进查询、视觉和指令令牌的交互,改善了实例枚举并减少了重复预测。

Abstract: In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3’s detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3’s agentic pipeline while enabling efficient single-pass multi-instance prediction.


[215] On-Policy Adversarial Flow Distillation for Autoregressive Video Generation cs.CVPDF

Yang Luo, Shengju Qian, Xiaohang Tang, Zirui Zhu, Yong Liu

TL;DR: 本文提出了一种名为对抗性流蒸馏(AFD)的在线策略框架,用于将异构黑盒视频生成器蒸馏为自回归学生模型。该方法通过查询教师模型并基于相同提示展开学生模型,训练一个提示配对判别器来估计教师与学生之间的差异,并将优势信号转化为前向过程的流匹配更新,从而提供密集的速度场监督。

Details

Motivation: 自回归视频生成器在流式、长时程和交互应用中具有吸引力,但将强大的黑盒教师模型蒸馏为因果学生模型仍然困难,因为学生必须在自身生成分布下学习,而教师可能仅暴露提示条件下的完整视频,且架构、容量、时序设计和采样计划存在差异。

Result: 在两个因果自回归学生模型系列上的实验表明,AFD在保持整体视频质量的同时,持续改善了运动和物理敏感性的生成效果,消融实验验证了自适应在线策略反馈和前向过程信用分配的重要性。

Insight: AFD的创新点在于无需教师分数、潜在变量、去噪轨迹、步骤对齐或反向链强化学习,仅需干净的教师视频和学生生成结果,为将专有或异构视频生成器蒸馏为高效自回归学生模型提供了实用途径,其核心在于在线策略的对抗性蒸馏和前向过程的流匹配更新机制。

Abstract: Autoregressive video generators are attractive for streaming, long-horizon, and interactive applications, but distilling strong black-box teachers into causal students remains difficult. The student must learn under its own rollout distribution, whereas practical teachers may expose only prompt-conditioned completed videos and may differ in architecture, capacity, temporal design, and sampling schedule. This interface makes supervised fine-tuning off-policy, score-based distillation inapplicable, and direct adversarial imitation too sparse for denoising-time credit assignment. We propose Adversarial Flow Distillation (AFD), an on-policy framework for heterogeneous black-box video distillation. AFD queries the teacher and rolls out the current student on the same prompts, trains a prompt-paired Bradley-Terry discriminator to estimate clean-sample teacher-student discrepancy, and converts the resulting on-policy advantage into forward-process flow-matching updates on the student’s own noised states. Thus, AFD provides dense velocity-field supervision while requiring no teacher scores, latents, denoising trajectories, step alignment, or reverse-chain reinforcement learning. Experiments across two causal AR student families show that AFD consistently improves motion- and physics-sensitive generation while preserving general video quality, and ablations validate the importance of adaptive on-policy feedback and forward-process credit assignment. The method requires only clean teacher videos and student rollouts, providing a practical route for distilling proprietary or heterogeneous video generators into efficient autoregressive students.


[216] Helix4D: Complex 4D Mesh Generation cs.CVPDF

Jiraphon Yenphraphai, Jianqi Chen, Jian Wang, Gordon Qian, Sergey Tulyakov

TL;DR: Helix4D是一个从视频生成动态4D网格的框架,旨在解决现有方法在处理复杂拓扑变化、透明材质、薄结构和内表面时的困难。它通过继承Trellis2的表达性表示,并将其从图像到3D生成适配到视频条件下的4D生成。

Details

Motivation: 解决当前视频到4D方法在复杂动态场景(如拓扑变化、透明物体)中的生成质量不足问题,并探索如何在保持预训练模型能力的同时,实现跨帧信息共享和时间信息注入。

Result: 在ActionBench和作者自建的复杂动态数据集上进行的大量实验证明了Helix4D在高质量动态网格生成上的有效性。

Insight: 创新点包括:1) 采用滑动窗口跨帧注意力机制并以第一帧为锚点,在保持预训练模型对罕见案例(如透明物体)生成质量的同时实现跨帧信息共享;2) 提出一种4D时间编码,通过重新利用冗余的低频空间RoPE频带表示时间,从而将3D位置编码扩展到4D且不增加额外参数。

Abstract: Current video-to-4D methods struggle with complex topology changes, transparent materials, thin structures, and inner surfaces. We present Helix4D, a dynamic mesh generation framework by inheriting the expressive representation of Trellis2, adapting it from image-to-3D to video-conditioned 4D generation. Our design arises from two key questions: (a) how to enable Trellis2’s frame-local attention to share information across frames while preserving its pretrained quality on rare cases such as transparent objects and inner surfaces, and (b) how to inject temporal information into a purely 3D positional encoding without breaking pretrained capabilities. We address (a) with a sliding-window cross-frame attention and anchor on the first frame. The first frame is generated by the base Trellis2 model and injected into our model, letting it inherit Trellis2’s quality in rare cases through cross-frame attention. We address (b) with a 4D temporal encoding that repurposes redundant low-frequency spatial RoPE bands for time, extending the encoding from 3D with no additional parameters. Extensive experiments show the effectiveness of Helix4D for high-quality dynamic mesh generation on ActionBench and our own challenging complex dynamics set.


[217] Reinforcing Few-step Generators via Reward-Tilted Distribution Matching cs.CVPDF

Yushi Huang, Xiangxin Zhou, Ruoyu Wang, Chi Zhang, Jun Zhang

TL;DR: 本文提出了一种名为奖励倾斜分布匹配蒸馏(RTDMD)的两阶段框架,旨在将少步扩散蒸馏模型与人类偏好对齐。该框架将分布匹配蒸馏与奖励引导的强化学习相结合,通过最小化与奖励倾斜教师分布的KL散度,将其分解为分布匹配项和奖励最大化项进行优化。实验表明,RTDMD在SD3、SD3.5和FLUX.2等模型上仅用4步推理,就在偏好、美学和构图指标上取得了新的最先进(SOTA)结果。

Details

Motivation: 现有的少步扩散蒸馏技术虽然实现了高效的图像生成,但难以使模型输出与人类偏好保持一致。本文旨在解决少步流生成器与人类偏好对齐的挑战。

Result: 在SD3、SD3.5和FLUX.2模型上的实验表明,RTDMD在仅使用4步推理的情况下,在偏好、美学和构图等多个评估指标上均超越了以往的少步文生图方法,建立了新的SOTA结果。

Insight: 创新点在于提出了一个统一的两阶段框架RTDMD,将分布匹配蒸馏与强化学习结合。具体包括:1)第一阶段提出了环境一致性分布匹配蒸馏(AC-DMD),通过子区间分布匹配和一致性正则化来稳定训练;2)第二阶段推导了一种混合策略梯度(结合GRPO风格估计器和确定性步骤的直接奖励反向传播),并引入了子集GRPO(SubGRPO)来降低方差。其核心洞察是将对齐问题形式化为对奖励倾斜分布的KL散度最小化,并自然分解为两个可联合优化的目标。

Abstract: Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.


[218] Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation cs.CV | cs.AI | cs.GR | cs.LG | cs.MMPDF

Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li, Igor Gilitschenski

TL;DR: 本文提出了一种基于多模态大语言模型(MLLMs)的主题驱动图像生成方法。该方法通过联合编码文本和参考图像,并引入基于VAE的身份条件,设计了一个双层级聚合模块和多阶段去噪策略,以在遵循文本指令的同时更好地保持主题身份。

Details

Motivation: 现有方法通常将文本和参考图像分开编码,这限制了跨模态推理能力并导致复制粘贴伪影。近期连接多模态模型和扩散模型的框架改善了指令遵循,但很大程度上忽视了身份保持。

Result: 大量实验表明,该方法在主题驱动图像生成任务上,协调了多模态理解与身份保持,缓解了复制粘贴问题,并在人类偏好评估中取得了优越的性能。

Insight: 核心创新在于利用MLLMs进行跨模态联合编码,并设计DLA模块和多阶段去噪策略来平衡语义信息和身份细节。这为结合大型预训练模型进行细粒度可控生成提供了新思路。

Abstract: Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A novel Dual Layer Aggregation (DLA) module is designed to aggregate multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy is applied to progressively balance the semantic information from MLLM and fine-detail identity from VAE during inference. Extensive experiments demonstrate that our approach harmonizes multimodal understanding with identity preservation, mitigates copy-paste issues, and achieves superior performance regarding human preference on subject-driven image generation. Our project website is available at https://zsh2000.github.io/squeeze-mllm-subject-gen/.


[219] TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction cs.CVPDF

Weijie Wang, Zimu Li, Jinchuan Shi, Zeyu Zhang, Botao Ye

TL;DR: TriSplat提出了一种基于三角形图元的正向传播三维场景重建网络,能够直接从稀疏图像输入中预测三角形属性、相机姿态等参数,并直接输出可用于物理仿真的网格场景。该方法通过预测局部3D点云图、构建几何法向量并利用图像条件法向量头进行优化,实现了稳定的三角形参数化。

Details

Motivation: 现有基于高斯图元的正向传播重建方法仅间接表示表面,提取可用网格需要昂贵的后处理步骤,无法满足下游仿真、物理推理等应用需求;在无姿态设置下,联合估计场景结构和相机参数尤为困难。

Result: 在RealEstate10K和DL3DV数据集上的实验表明,该方法相比高斯正向传播基线能产生更几何忠实的三维重建结果,同时保持竞争力的新视角渲染质量。

Insight: 创新点包括使用三角形作为渲染图元实现直接网格输出、通过预测点云构建几何法向量并利用图像条件法向量头优化三角形参数化、以及采用单法向量引导训练和渐进式锐化策略提升表面表示质量,为正向传播三维重建提供了可直接用于物理引擎的仿真就绪解决方案。

Abstract: Sparse-view 3D reconstruction is increasingly addressed with feed-forward splatting networks that predict explicit primitives directly from images. Yet most existing methods remain centered on Gaussian primitives and expose surfaces only indirectly: extracting a usable mesh for downstream simulation, physics reasoning, or embodied interaction still requires expensive post-hoc steps that break the feed-forward promise. This limitation is especially pronounced in pose-free settings, where scene structure and camera parameters must be estimated jointly from sparse observations. We present TriSplat, a feed-forward reconstruction network that represents scenes with oriented triangle primitives and directly exports simulation-ready mesh scenes from a single forward pass. Given input images, the network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics. Rather than regressing triangle orientation as an unconstrained latent variable, our approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization. A mono-normal bootstrap schedule further stabilizes early training, while opacity and blur scheduling progressively sharpens the learned surface representation for direct mesh extraction. Experiments on RealEstate10K and DL3DV show that this representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. Because the rendering primitives are themselves surface triangles, the output can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction.


cs.AI [Back]

[220] In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models cs.AI | cs.CL | cs.CV | cs.NEPDF

Sam Earle, Kay Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius

TL;DR: 本文通过用前沿视觉语言模型(VLMs)替代人类用户,复现了Picbreeder这一人类驱动的开放式搜索经典范例,旨在探究人工智能代理是否具备无引导的、富有成果的发现能力。研究发现系统输出与人类基线存在明显定性差异,并通过系统发育复杂性、视觉/语义显著性与新颖性等指标进行表征,同时探索了选择噪声、行为多样性和过往行动记忆(叙事动力)等潜在影响因素。

Details

Motivation: 旨在探究人工智能代理是否具备类似人类在科学、技术和创意生产中所展现的开放式创新能力,即生成看似无穷无尽的新颖且有意义的形态的能力。

Result: 观察到基于VLM的系统输出与历史人类基线在定性上存在明显差异,并使用系统发育复杂性、视觉/语义显著性与新颖性等指标进行了表征。

Insight: 创新点在于用VLMs自动化复现人类开放式创意搜索过程,并系统性地对比分析与人类基线的差异,探索了选择噪声、行为多样性和叙事动力(记忆)等因素对开放式搜索结果的影响,为理解AI的开放式创新能力提供了实证框架。

Abstract: We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes in their human form has been their open-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human-driven open-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents’ selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder-vlm.


[221] Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning cs.AI | cs.CLPDF

Sebastien Kawada

TL;DR: 该论文研究了多轮推理系统的失败模式,发现主要错误并非逻辑矛盾,而是满足性漂移,即系统内部状态保持一致性但返回答案却违背先前承诺。作者构建了DRIFT-Bench基准测试,评估了多种方法,发现MUS-Repair方法表现最佳,但修复后残留错误几乎全是满足性漂移。

Details

Motivation: 探究多轮推理系统如何失败,挑战传统认为逻辑矛盾是主要错误的观点,揭示满足性漂移这一被忽视的失败模式。

Result: 在DRIFT-Bench基准(包含816个测试问题)上评估了四种方法,MUS-Repair在所有设置中表现最强(比最佳非MUS基线提升1.8至15.0个百分点),修复后残留错误98-100%为满足性漂移,矛盾几乎降至零。

Insight: 创新点在于识别并量化了满足性漂移作为多轮推理的主要失败模式;提出MUS-Repair方法利用最小不可满足子集进行反馈;强调可靠系统需单独验证返回答案与维护状态的一致性。

Abstract: How do multi-turn reasoning systems fail? The expected answer is logical contradiction, in which the system’s maintained state becomes unsatisfiable. We show that the dominant mode is instead satisfiable drift, where the internal state stays consistent while the returned answer silently violates prior commitments. We build DRIFT-Bench (Decomposing Reasoning Into Failure Types), a solver-instrumented benchmark of 816 test problems across three constraint domains, and evaluate four methods on it across four open-weight models (8B-120B parameters). MUS-Repair, which feeds minimal unsatisfiable subsets back to the generator, is strongest in every setting (+1.8 to +15.0 pp over the best non-MUS baseline). But the central finding is what repair leaves behind. After structured feedback, models rarely contradict themselves. They forget. Residual errors are 98-100% satisfiable drift across all settings, while contradiction drops to near zero. Reliable multi-turn systems must separately validate that the returned answer respects the maintained state. Code is available at https://github.com/kaons-research/drift-bench.


[222] Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform cs.AI | cs.CL | cs.ROPDF

Feisal Alaswad, Batoul Aljaddouh, Maher Alrahhal, Poovammal E, Talal Bonny

TL;DR: 本文认为大型语言模型在因果推理、持续状态跟踪和长时程规划方面存在局限,提出潜在动态推理的概念框架,并引入Flux环境进行实证研究。研究表明,在提取的潜在状态空间中训练的智能体比仅依赖文本观测的LLMs在长时程游戏中表现更稳定,胜率显著更高。

Details

Motivation: 解决LLMs在需要因果推理、持续状态跟踪和长时程规划的任务中的局限性,这些局限源于序列预测与潜在环境动态推理之间的目标不匹配。

Result: 在Flux环境中,基于潜在状态空间的智能体实现了约79%的胜率,而LLMs仅为11%,表现出更稳定的长时程行为。

Insight: 创新点在于提出潜在动态推理框架,强调世界模型对AGI的重要性,并通过Flux环境实证展示了显式状态空间访问在动态推理中的优势,为弥补LLMs的推理缺陷提供了方向。

Abstract: Large language models achieve strong performance in language generation and knowledge-intensive tasks, yet remain limited in settings requiring causal reasoning, persistent state tracking, and long-horizon planning. We argue that these limitations may arise from an objective-level mismatch between sequence prediction and reasoning over latent environment dynamics. To formalize this distinction, we introduce Latent Dynamics Inference (LDI), a conceptual perspective that interprets language and multimodal observations as partial evidence of underlying transition dynamics. To empirically investigate this perspective, we introduce Flux, a sequential reasoning environment specified entirely through natural-language rules. As a proof-of-concept case study, the rules are first compiled into an explicit state-transition simulator, illustrating that structured latent transition dynamics can, in some cases, be operationally extracted from textual rule descriptions. This enables a controlled comparison between the LLMs operating purely over textual observations and reinforcement-learning agents trained directly within the extracted latent state space. Within this case study, agents operating with explicit access to the latent state space exhibit substantially more stable behavior in long-horizon gameplay, achieving an aggregate win rate of approximately 79% versus 11% for LLMs. Qualitative analysis further reveals failure modes consistent with unstable persistent state tracking, including invalid actions, state-tracking errors, and short-horizon reasoning failures. The complete implementation of the Flux environment available at https://github.com/FeisalAlaswad/FLUX-RL-Agent Within the evaluated setting, these results suggest that strong sequence prediction alone may struggle to support robust long-horizon dynamic reasoning without mechanisms for persistent state tracking and transition modeling


[223] LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition cs.AI | cs.CLPDF

Yanyu Chen, Jiyue Jiang, Dianzhi Yu, Zheng Wu, Jiahong Liu

TL;DR: 本文提出LC-ERD框架,通过挖掘潜在逻辑结构来促进大语言模型(LLM)的自我进化推理。该方法利用变分逻辑势能去噪推理流形,并基于IGM原则的多智能体价值分解协议量化推理步骤的效用,以解决自对齐中存在的标签噪声、粗粒度监督和分布坍塌三大挑战。

Details

Motivation: 大语言模型推理能力的进化受限于高质量过程数据的稀缺,而基于内生奖励的自对齐方法面临标签噪声(模仿偏差)、粗粒度监督和分布坍塌三大挑战,阻碍了有效的监督信号挖掘。

Result: 实验表明,LC-ERD提供了一条稳健的自我进化路径,揭示了逻辑一致性与准确性之间的权衡,并识别出标准奖励所遗漏的高价值推理模式。

Insight: 创新点在于将自对齐重新定义为潜在结构挖掘,通过聚合模型潜在逻辑专家(LLE)的共识来构建变分逻辑势能进行去噪,并引入基于IGM原则的多智能体价值分解协议对推理步骤进行细粒度效用量化,从而实现对推理过程更精细、更一致的监督。

Abstract: The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high-quality process data. While self-alignment via endogenous rewards offers a solution, mining valid supervision faces three challenges: (1) Label Noise via Mimetic Bias, where rewards prioritize statistical likelihood over logical truth, creating a “correctness illusion” that masks compounding errors; (2) Coarse-Grained Supervision, where sparse global outcomes (e.g., in GRPO) fail to provide granular guidance, treating reasoning chains as monolithic; and (3) Distributional Collapse, where signals fail to generalize without amplifying pre-training biases. To address these, we introduce LC-ERD (Logic-Consistent Endogenous Reward Decomposition), a framework framing self-alignment as latent structure mining. We derive a Variational Logic Potential by aggregating consensus from the model’s Latent Logic Expertise (LLE) to denoise the reasoning manifold, and introduce a Multi-Agent Value Decomposition protocol based on the IGM principle to quantify individual step utility. Experiments show LC-ERD delivers a robust self-evolution path, uncovering trade-offs between logic consistency and accuracy while identifying high-value reasoning patterns missed by standard rewards. Our code is available at https://github.com/Reinhardmannn/LC-ERD.


[224] AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning cs.AI | cs.CLPDF

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Tong Zhao

TL;DR: 本文提出了AgentFugue框架,旨在通过集体推理来扩展多智能体在长视野任务中的能力。该框架围绕一个共享的推理中心构建,允许并行工作的对等智能体记录和选择性访问彼此的探索发现,从而将孤立的推理轨迹连接成一个可重用的中间推理生态。

Details

Motivation: 当前长视野智能体任务的能力提升主要依赖对单个智能体的扩展(如更强的模型、更好的工具),而对等智能体之间的横向扩展(scaling out)能否成为新的能力来源,而不依赖于明确的角色分工或工作流编排,是本文研究的核心问题。

Result: 在研究的多个具有挑战性的长视野任务场景中,AgentFugue框架的表现优于强大的基线模型。结果表明,集体推理能够将对等智能体系统的横向扩展转化为显著的能力增益,而不仅仅是消耗更多计算资源。

Insight: 创新点在于提出了一个无需集中式规划的共享推理中心(hub),它作为一个可插拔的通信层,通过监督微调和端到端强化学习进行训练,实现了智能体间中间推理成果的有效共享与复用,将横向扩展本身转化为一种新的能力提升途径。

Abstract: Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools, and more effective scaffolding. In contrast, much less is understood about scaling out: whether multiple peer agents, all targeting the same task, can become an additional source of capability without relying on explicit role specialization or workflow orchestration. We study this question and propose AgentFugue, a collective reasoning framework built around a shared reasoning hub. As peer agents explore the same task in parallel, the hub records concise notes on what each agent has established, attempted, or ruled out, and enables each agent to selectively access what other agents have discovered in a form useful for its current search. This design turns otherwise isolated trajectories into a connected ecology of reusable intermediate reasoning without requiring centralized planning. We instantiate the hub as a plug-in communication layer, trained with supervised fine-tuning and end-to-end reinforcement learning. Across the challenging long-horizon settings we study, AgentFugue improves over strong baselines. Our results suggest that collective reasoning can turn scaling out peer agent systems into a distinct source of capability gains, rather than merely a way of spending more compute.


[225] Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models cs.AI | cs.CL | cs.LGPDF

Seokil Ham, Jaehyuk Jang, Wonjun Lee, Changick Kim

TL;DR: 本文提出了一种名为Buffer-and-Reinforce的微调框架,用于防御针对大型语言模型(LLMs)的有害微调攻击。该框架通过临时越狱(temporary jailbreaking)来缓冲有害更新,并在适应后通过基于QR分解的合并方法强化安全性,从而在无需额外安全数据的情况下,实现安全性与任务性能的平衡。

Details

Motivation: 针对微调即服务(FaaS)中,有害微调攻击会削弱LLMs的安全对齐(safety-alignment)的问题,现有方法虽能通过激活有害行为模块来防止模型学习不良行为,但其机制尚不明确。本文旨在重新审视临时越狱作为防御手段,并深入分析其梯度层面的作用机制。

Result: 大量实验表明,该框架在用户微调过程中无需额外安全数据,且计算成本极低的情况下,实现了卓越的安全性和实用性(utility),在多个基准测试中达到了先进的防御效果。

Insight: 创新点在于从梯度层面揭示了临时越狱通过饱和安全退化梯度(safety-degrading gradients)同时保留良性任务相关梯度来防御有害微调的机制,并据此设计了BufferLoRA和ReinforceLoRA的双适配器结构,通过可移除的临时越狱适配器缓冲有害更新,再通过基于QR分解的合并集成安全强化适配器,从而在微调后有效恢复拒绝行为并保持任务性能。

Abstract: Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful fine-tuning attacks. Recent work has shown that activating harmful-behavior modules during fine-tuning can prevent models from learning undesired behaviors, but its mechanism remains unclear. In this paper, we revisit temporary jailbreaking as a defense against harmful fine-tuning and provide a gradient-level analysis showing that it saturates safety-degrading gradients while preserving benign task-relevant gradients. Based on this insight, we propose a Buffer-and-Reinforce fine-tuning framework that buffers harmful updates during user fine-tuning and reinforces safety after adaptation. Specifically, BufferLoRA induces temporary jailbreaking as a removable adapter to reduce harmful updates during user fine-tuning. After adaptation, ReinforceLoRA, trained to recover refusal behavior under the temporarily jailbroken state, is integrated with UserLoRA via QR decomposition-based merging to reinforce safety while preserving user-task performance. Extensive experiments show that our framework achieves superior safety and utility with no additional safety data during user fine-tuning and minimal computational cost.


[226] Learning to Reason Efficiently with A* Post-Training cs.AI | cs.CL | cs.LGPDF

Andreas Opedal, Francesco Ignazio Re, Abulhair Saparov, Mrinmaya Sachan, Bernhard Schölkopf

TL;DR: 本文研究如何通过A搜索算法指导大语言模型进行高效推理,将自然语言推理视为搜索问题,目标是生成正确且高效的证明路径。论文探索了两种训练方法:基于A执行轨迹的监督微调,以及使用A信息过程奖励模型的强化学习。实验表明,1B-3B参数的Llama-3.2模型经过A后训练后,推理准确率从接近零大幅提升,甚至超过了更大的DeepSeek-V3.2模型。

Details

Motivation: 大语言模型在需要演绎推理的应用中经常产生错误或冗余的推理步骤,因此需要一种能保证中间推理正确性的推理过程。

Result: 在1B-3B参数规模的Llama-3.2模型上,A后训练使其从接近零准确率提升到超越更大模型DeepSeek-V3.2的水平;研究还揭示了正确性奖励与A引导信号在准确性和效率之间的权衡,以及在更大搜索空间中使用不完美启发式训练的模型表现出更高的准确性。

Insight: 创新点在于将自然语言推理形式化为搜索问题,并利用经典搜索算法A*的原理来引导模型学习生成正确且高效的证明;客观来看,这种将符号搜索算法的保证性与神经模型的灵活性相结合的方法,为提升模型推理能力提供了新方向。

Abstract: Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is the valid proof itself, requiring a reasoning procedure in which intermediate inferences are correct. Specifically, we investigate whether LLMs can learn to generate correct and efficient proofs with guidance from A* search – an algorithm that guarantees an optimally efficient path to a goal. We explore two training techniques: supervised fine-tuning on execution traces from A* and reinforcement learning with A*-informed process reward models. Empirically, we find that Llama-3.2 models in the 1B–3B range benefit substantially from A* post training, going from near-zero accuracy to outperforming DeepSeek-V3.2 – a much larger model. Our analysis uncovers a trade-off: while simple correctness rewards maximize accuracy, A*-informed signals strike a balance between accuracy and efficiency. Furthermore, we find that on larger search spaces, models trained with imperfect heuristics exhibit superior accuracy. Our results demonstrate a promising direction towards reasoning guided by principles derived from classical search algorithms.


[227] GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration cs.AI | cs.CLPDF

Junjie Zhao, Jingyi Liang, Zhenyang Cai, Jiaming Zhang, Zhenwei Wen

TL;DR: 本文提出了首个跨国牙科基准测试GlobalDentBench,用于评估大语言模型在牙科临床推理中的表现和安全性。该基准包含来自六大洲88个国家和地区的14个牙科专科的8978个专家验证问题,涵盖三种问题格式和三个渐进推理层级。对12个前沿LLM的评估显示,随着推理复杂性增加,模型性能急剧下降,且在真实病例中生成的临床建议存在高达31.01%的不安全率。

Details

Motivation: 尽管大语言模型在医学领域具有变革潜力,但其在真实临床场景(尤其是牙科)中的推理鲁棒性和安全性仍未得到充分探索,因此需要建立一个专门的基准来系统评估。

Result: 在GlobalDentBench上评估的12个前沿LLM表现出随推理复杂度增加而性能骤降的趋势:选择题准确率81.34%,简答题64.53%,案例分析题仅22.34%;从L1到L3推理层级的准确率从74.01%降至35.71%。风险分析显示LLM生成的临床建议总体不安全率为31.01%,其中4.51%可能导致不可逆的患者伤害,在正畸等专科中风险尤为突出。

Insight: 论文的创新点在于构建了首个跨国、多专科、多格式、多推理层级的牙科临床基准,并引入了专家校准的自动化构建框架以确保数据质量。客观来看,该研究系统地揭示了当前LLM在复杂医学推理和安全方面的根本性局限,为临床AI的可信评估提供了可扩展的基础。

Abstract: While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continents. The benchmark comprises 8,978 expert-validated questions across three formats (multiple-choice, short-answer, and case-based questions) and assesses three progressive reasoning levels: knowledge recall (L1), routine reasoning (L2), and individualized reasoning (L3). To ensure data quality, the automated construction framework was calibrated by six senior dentists, achieving expert agreement rates of 99.98% for multiple-choice and short-answer questions and 96.78% for the more complex case-based questions. Evaluation of 12 frontier LLMs on GlobalDentBench revealed a sharp, stepwise performance degradation with increasing reasoning complexity. Specifically, accuracy plummeted from 81.34% on multiple-choice to 64.53% on short-answer and 22.34% on case-based questions, while declining markedly from 74.01% at L1 to 55.64% at L2 and 35.71% at L3. More critically, risk analysis of real-world dental cases demonstrated an alarming overall unsafe rate of 31.01% in LLM-generated clinical recommendations, with 4.51% posing risks of irreversible patient harm and risks particularly pronounced in specialties such as orthodontics. These findings expose fundamental limitations in the medical reasoning and safety of current LLMs. Consequently, GlobalDentBench provides a scalable foundation for trustworthy clinical AI evaluation, underscoring the urgent need for rigorous validation before the safe deployment of these models in healthcare.


[228] Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework cs.AI | cs.CLPDF

Ali Şenol, Garima Agrawal, Huan Liu

TL;DR: 该研究提出了一个多维度行为框架来评估大型语言模型的推理质量,超越了仅关注最终答案正确性的传统方法。该框架定义了六个理论维度:正确性、一致性、鲁棒性、逻辑连贯性、效率和稳定性。在四个基准测试的975个项目上对七个LLM进行的实验表明,该框架能揭示仅靠准确率指标无法观察到的模型行为,并支持更明智的模型部署决策。

Details

Motivation: 当前LLM评估方法主要依赖最终答案的正确性,无法深入洞察产生这些答案的底层推理过程。为了解决这一局限性,研究旨在从行为角度提供一个统一的、多维度的框架来衡量LLM的推理质量。

Result: 在四个基准测试的975个项目上对七个LLM进行了广泛实验。结果显示,逻辑连贯性与正确性正交(r = -0.172),表明正确答案可能来自不连贯的推理;Claude-Haiku-4.5获得了最高的综合多维分数(Q_bal = 0.778)。框架还暴露了关键的排名反转,例如DeepSeek-V3在准确率优先下排名第二,但在法律/合规权重下排名第五。判别效度证实了15对维度中有11对是独立的(|r| < 0.50)。

Insight: 创新点在于提出了一个基于六个理论维度的统一行为评估框架,将评估重点从单一答案正确性扩展到推理过程质量。该框架提供了心理测量学支持,证明各维度是独立的信号,并能揭示传统单指标评估无法检测的模型行为差异(如排名反转),直接支持模型部署的问责、基准测试和信号完整性决策。

Abstract: LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those answers. To address this gap, this study proposes a unified multi-dimensional framework for measuring reasoning quality in LLMs from a behavioral perspective, operationalizing six theoretically grounded dimensions: Correctness (CQ), Consistency (CS), Robustness (RS), Logical Coherence (LS), Efficiency (ES), and Stability (SS). Extensive experiments on seven LLMs across 975 items from four benchmarks demonstrate that the framework reveals behaviors invisible to accuracy-only metrics. Notably, logical coherence is orthogonal to correctness (r = -0.172, ns), confirming that correct answers can arise from incoherent reasoning, while Claude-Haiku-4.5 achieves the highest multi-dimensional score (Q_bal = 0.778). Furthermore, the framework exposes critical ranking inversions: DeepSeek-V3 ranks second under accuracy-priority but fifth under legal/compliance weighting, a reversal that single-metric evaluation cannot detect. Discriminant validity confirms 11/15 dimension pairs are independent (|r| < 0.50), providing psychometric support for treating each dimension as a distinct signal. The dimensional profiles produced by the framework directly support three classes of deployment decision: identifying models whose reasoning traces would fail accountability audits despite correct final answers (LS–CQ orthogonality); preventing ranking errors caused by accuracy-only benchmarking; and ensuring that no single metric silently substitutes for the six independent signals the framework captures.


[229] Geo-Expert: Towards Expert-Level Geological Reasoning via Parameter-Efficient Fine-Tuning cs.AI | cs.CLPDF

Chenyou Guo, Zongqi Liu, Yizhou Zhang, Zhaorui Jiang, Ze Liu

TL;DR: 本文提出了Geo-Expert,一个通过参数高效微调(LoRA)方法构建的地质学领域大语言模型家族。研究基于Qwen3和Gemma等基础模型,在自定义的高质量指令数据集上进行微调,旨在解决通用LLM在地质推理中常出现幻觉的问题。

Details

Motivation: 当前应用于地质学的通用大语言模型在推理地下结构和深时演化时经常产生幻觉,而地球科学领域的人工智能研究主要集中于地表遥感和GIS。本文旨在弥合这一差距,提升LLM在专业地质推理上的能力。

Result: 在新型领域特定基准Geo-Eval上的评估表明,领域对齐的8B模型在专业地质推理上超越了开源的70B通用模型和专有的GPT-4o,而32B变体则接近前沿推理模型的水平。优化后的8B模型在部署上具有有竞争力的性价比。

Insight: 论文的创新点在于提供了一个可复现的、用于构建科学领域LLM的配方,并建立了地质人工智能的基线。其核心在于通过参数高效的LoRA微调和高质量领域指令数据,使较小规模模型在特定专业任务上达到甚至超越更大规模通用模型和前沿商业模型的性能。

Abstract: While general-purpose Large Language Models (LLMs) applied to Geology often hallucinate when reasoning about subsurface structures and deep-time evolution, current AI in Earth sciences predominantly targets surface remote sensing and GIS. To bridge this gap, we introduce Geo-Expert, a family of parameter-efficient geological LLMs fine-tuned on a custom-curated, high-quality instruction dataset processed using our custom instruction synthesis pipeline. We investigate the impact of model scaling and architecture by fine-tuning three base models: Qwen3-8B, Qwen3-32B, and Gemma-3-27B, with Low-Rank Adaptation (LoRA) method. Our extensive evaluation on a novel domain-specific benchmark, Geo-Eval, reveals that a domain-aligned 8B model can outperform open-weight 70B generalists and proprietary GPT-4o on specialized geological reasoning, while a 32B variant approaches frontier reasoning models. The optimized 8B model further offers a competitive cost-performance ratio for deployment. This work provides a reproducible recipe for democratizing scientific LLMs and establishes a baseline for geological artificial intelligence.


[230] Clustering as Reasoning: A $k$-Means Interpretation of Chain-of-Thought Graph Learning cs.AI | cs.CL | cs.NIPDF

Xuanting Xie, Zhaochen Guo, Bingheng Li, Xingtong Yu, Zhifei Liao

TL;DR: 本文提出了一种基于聚类即推理原则的图学习框架KCoT,将思维链(CoT)推理与图表示学习统一起来。通过建立Transformer块与k-means算法之间的数学对应关系,将推理过程解释为迭代的分配和更新步骤,并引入语义区分提示和结构对齐策略来融合拓扑先验与动态的思维条件表示。

Details

Motivation: 现有基于思维链的图学习方法依赖分离的架构和固定的图表示,限制了语义-拓扑的逐步交互和可解释性。本文旨在克服这一限制,通过聚类视角重新解释CoT在图结构数据上的迭代推理机制。

Result: 在标准基准测试上的实验表明,KCoT框架相比最先进方法取得了持续的性能提升,验证了聚类作为基于CoT的图学习的原则性机制的有效性。

Insight: 核心创新在于揭示了Transformer块与k-means算法之间的形式化数学对应,从而将推理过程解释为聚类迭代;并设计了语义区分提示和结构对齐策略,实现了拓扑先验与动态语义表示的融合,提升了图学习的可解释性和性能。

Abstract: Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) on text-attributed graphs (TAGs). This work reframes CoT-based graph learning through the principle of clustering as reasoning, offering a $k$-means interpretation of how iterative reasoning operates over graph-structured data. We observe that existing graph CoT methods rely on disjoint architectures and fixed graph representations, limiting step-by-step semantic-topological interaction and interpretability. To overcome this limitation, we propose a unified framework named KCoT that integrates CoT reasoning with graph representation learning. Our key theoretical result reveals a formal mathematical correspondence between a Transformer block and the $k$-means algorithm, allowing reasoning to be interpreted as iterative assignment and update steps. Based on this insight, we introduce a Semantic Discriminating Prompt that explicitly formulates these steps as structured CoT reasoning, together with a structure-grounded alignment strategy to fuse topological priors with evolving thought-conditioned representations. Experiments on standard benchmarks demonstrate consistent improvements over state-of-the-art methods, validating clustering as a principled mechanism for CoT-based graph learning.


[231] Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction cs.AI | cs.CLPDF

João Sedoc, Baotong Zhang, Dean Foster

TL;DR: 本文提出了一种名为‘证明者-验证者审议’的推理时协议,用于大型语言模型的选择性预测。该协议通过模拟交互式证明过程,让一个证明者模型为候选答案提供可验证的子主张,同时让一个验证者模型提出针对性挑战并给出接受、挑战或拒绝的裁决,从而生成答案和结构化置信度判断,使系统能够报告高置信度答案并在不确定时弃权。

Details

Motivation: 动机在于认识到可靠地知道语言模型何时正确与其本身正确性几乎同等重要,旨在通过一种受交互式证明理论启发的机制,为LLM的预测提供一种可解释的、基于论证的置信度评估方法,以提升选择性预测的可靠性。

Result: 在GPQA Diamond基准上,使用Claude Sonnet 4.6作为证明者和Claude Haiku 4.5作为验证者进行实验。结果显示,被验证者直接接受且未修改的答案子集,其高置信度精度比非该子集高出约30个百分点,有效区分了可靠与不可靠答案。跨模型家族的鲁棒性实验表明,高置信度精度可以迁移,但验证者的严格性和领域能力是关键因素。

Insight: 核心创新点是将交互式证明理论形式化为一个适用于LLM的推理时审议协议,为选择性预测提供了一个独特的‘论证可辩护性’信号。这不同于自我一致性或多智能体辩论等方法,它通过结构化的挑战-辩护对话来评估答案的稳健性,为模型不确定性校准提供了新视角。然而,协议的有效性依赖于验证者模型的能力和严格性,当验证者超出其有效区域时可能失效,这揭示了其实用边界。

Abstract: Reliably knowing when a language model is correct is almost as important as being correct. We introduce prover-verifier deliberation (PVD), an inference-time protocol grounded in interactive proof theory, as a mechanism for selective prediction: the protocol produces both an answer and a structured confidence verdict, allowing a system to report high-confidence answers while abstaining on uncertain cases. In each dialogue, a prover defends a candidate answer through checkable sub-claims while a verifier issues targeted challenges and returns \textsc{Accept}, \textsc{Challenge}, or \textsc{Reject}. Because frozen language models are imperfect provers and verifiers operating over a noisy channel, formal soundness and completeness guarantees do not transfer; instead, we characterize the protocol empirically through its coverage-precision behavior. Our main experiment uses Claude Sonnet 4.6 as prover and Claude Haiku 4.5 as verifier on GPQA Diamond. Questions accepted with no answer revision, which we call Accept + No Change (ANC), are reported as the high-confidence subset; we evaluate this subset by its precision and coverage. ANC separates reliable from unreliable answers, yielding a $\sim$30pp HC-Prec gap over the non-ANC complement. Robustness experiments with GPT and Gemini pairings show that high HC-Prec can transfer across model families, while verifier strictness and domain competence largely determine the size of the selection gap. On Humanity’s Last Exam, weaker prover-verifier pairings can collapse or invert the ANC signal, illustrating a practical failure mode when the verifier operates outside its effective region. Comparisons with self-consistency, universal self-consistency, multi-agent debate, and Reflexion suggest that prover-verifier deliberation supplies a distinct argument-defensibility signal for selective prediction.


[232] MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning cs.AI | cs.CLPDF

Aritra Dutta, Somak Aditya

TL;DR: 本文提出了一种名为MuCRASP的多模态思维链感知结构化剪枝框架,旨在解决视觉语言模型(VLMs)在压缩时思维链推理能力下降的问题。该方法通过识别推理关键组件、保持跨模态对齐并考虑层间敏感性,在多个VLM和推理基准测试中,在较高压缩率下仍能有效保持推理质量与一致性。

Details

Motivation: 现有结构化剪枝方法在压缩视觉语言模型时,会损害其思维链推理能力,原因在于它们未考虑推理轨迹中的关键转换点(枢纽词元)以及视觉与文本模态间的激活分布差异。

Result: 在三个推理基准测试上对四个VLMs的实验表明,MuCRASP在增加压缩率时能持续保持推理质量。例如,在Qwen2.5-VL-7B模型上进行30%剪枝时,其在物理推理任务上的LLM-as-a-Judge得分达到8.87,显著优于最强基线的7.32,并且在高达50%的剪枝率下仍能保持较高的推理一致性,同时困惑度退化更低。

Insight: 创新点在于首次将思维链推理的关键结构(如枢纽词元)和跨模态激活差异纳入剪枝考量,提出了一个针对多模态推理任务感知的全局剪枝框架,这为高效压缩多模态大模型同时保持其复杂推理能力提供了新思路。

Abstract: Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.


[233] Reason–Imagine–Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving cs.AI | cs.CV | cs.LG | cs.ROPDF

Zhengqi Sun, Yiwen Sun, Boxuan Liu, Tailai Chen, Tianxu Guo

TL;DR: 本文提出Reason–Imagine–Act (RIA)框架,将大型语言模型(LLM)与动作条件世界模型相结合,用于自动驾驶的闭环决策。该框架通过LLM提出动作模板和候选子动作,利用世界模型进行短期推演,并由安全评分器选择最安全的可执行动作,旨在弥合语义意图与物理可行性之间的差距。

Details

Motivation: 现有方法要么仅进行在线语言推理而缺乏显式的动态验证,要么主要在离线流程中使用世界模型,导致在决策时语义意图与物理可行性之间存在脱节。

Result: 在统一的CARLA点目标协议(1000个episodes)下,RIA实现了80.05%的路线完成率、51.10%的到达率和0.20%的碰撞率。在相同的闭环接口下,RIA在核心闭环指标上持续优于CARLA TM和MADA等免训练基线方法。

Insight: 创新点在于将LLM的语义推理与世界模型的物理动态验证在决策时进行闭环耦合,通过安全评分器实现实时安全筛选,为LLM在自动驾驶等安全关键领域的应用提供了可借鉴的架构设计。

Abstract: Large language models (LLMs) are promising for autonomous driving, but semantics-only decision policies can yield physically unsafe behavior in dynamic traffic. Existing methods either perform online language reasoning without explicit dynamics verification or use world models mainly in offline pipelines, leaving a gap between semantic intent and physical feasibility at decision time. We propose Reason–Imagine–Act (RIA), a closed-loop framework that couples an LLM reasoner with an action-conditioned world model for online safety verification. At each step, the LLM proposes an action template and candidate sub-actions, the world model performs short-horizon rollouts, and a safety scorer selects the safest executable action with feedback to the next reasoning step. Under a unified CARLA point-goal protocol (1000 episodes), RIA achieves 80.05% route completion, 51.10% arrival rate, and 0.20% collision rate. Under the same closed-loop interface, RIA consistently outperforms training-free baselines, including CARLA TM and MADA, on core closed-loop metrics. For reproducibility, code is available at https://github.com/pku-smart-city/source_code/tree/main/RIA.


[234] AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models cs.AI | cs.CV | cs.MM | cs.SDPDF

Jialiang Yang, Bin Xia, Ruihang Chu, Dingdong Wang, Wanke Xia

TL;DR: 本文提出了AVBench,一个针对以人为中心的音视频生成模型的自动化评估基准。该基准通过细粒度指标和基于偏好学习的专用评估器,解决了现有评估方法在人类相关场景中评估不准确的问题,并展示了在数据过滤和强化学习中的潜在应用。

Details

Motivation: 当前音视频生成模型的评估仍处于早期阶段,现有基准主要针对人类相关场景,但评估粒度较粗且依赖预设的多模态大模型,导致对模型能力的评估不准确。

Result: AVBench在人类中心的真实世界场景中集成了十个评估维度,覆盖视觉质量、音频质量和多级跨模态一致性。通过偏好学习构建的专用评估器能够可靠地检测细微的跨模态不一致性,其基于概率的评分机制比传统的VQA式评估更可靠,且与人类判断高度一致。

Insight: 创新点在于提出了一个专门针对人类中心音视频生成的自动化评估框架,其核心是通过构建大规模监督数据(通过扰动真实视频生成训练对)来训练专用评估器,并采用基于模型预测置信度的连续概率评分机制,这比离散的文本判断更可靠,且可作为可微分的奖励信号用于RLHF。

Abstract: Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook. (ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model’s prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering, and serves as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).


[235] Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration cs.AI | cs.CV | cs.LGPDF

Yuanzhi Xu, Qian Gao, Jun Fan, Guohui Ding, Zhenyu Yang

TL;DR: 本文提出了一种无需训练的区域感知自适应注意力重校准推理策略,用于缓解大型视觉语言模型中的物体幻觉问题。该方法通过计算跨注意力头的抗异常值统计中点建立稳定锚点,并利用区域间注意力头差异动态确定干预预算,以连续惩罚调制方式抑制导致幻觉的注意力路径。

Details

Motivation: 解决大型视觉语言模型中普遍存在的物体幻觉问题,现有方法(如数据微调、对比解码或注意力头截断)往往在计算效率或特征空间连续性上存在不足,需要一种更高效且不破坏模型特征的解决方案。

Result: 在CHAIR、POPE和MME等多模态基准测试中,该方法显著减少了实例级和句子级幻觉,取得了最先进的性能,超越了现有基线方法。

Insight: 创新点在于提出了一种无需训练的区域感知自适应注意力重校准机制,通过建立稳定的视觉表示锚点和动态干预预算,在保持生成流畅性和语言先验的同时,有效纠正视觉语义错位,算法鲁棒性强。

Abstract: The generation of factually incorrect objects, commonly known as object hallucination, remains a persistent challenge in Large Vision-Language Models (LVLMs). Current approaches to address this issue - ranging from expensive data-driven fine-tuning and high-latency contrastive decoding to rigid attention head truncation - frequently compromise either computational efficiency or the continuity of the model’s feature space. To overcome these limitations, we introduce a novel, training-free inference strategy that operates as a region-aware adaptive weighting mechanism to dynamically correct semantic drift without relying on abrupt heuristic truncations. By computing an outlier-resistant statistical midpoint across various attention heads, we establish a stable anchor for reliable visual representations. We then utilize the inter-head disagreement mapped across regions to dynamically determine intervention budgets, gently suppressing hallucination-inducing attention paths through a continuous penalty modulation. This recalibration process effectively rectifies visual-semantic misalignments while fully preserving generative fluency and language priors. Comprehensive evaluations on standard multimodal benchmarks, including CHAIR, POPE, and MME, reveal that our strategy substantially curtails both instance- and sentence-level hallucinations. The results demonstrate state-of-the-art performance against contemporary baselines, confirming our method’s efficiency and algorithmic robustness. Our code will be public.


cs.LG [Back]

[236] Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning cs.LG | cs.AI | cs.CL | cs.CRPDF

Nesreen K. Ahmed, Nima Nafisi

TL;DR: 本文提出Agent-ToM框架,通过心智理论推理来监控自主LLM代理的隐蔽恶意行为。该框架对完整轨迹进行结构化分析,推断代理的信念、意图和行动偏差,并采用“推理-验证-精炼”流程进行决策。在训练时,它将批评信号提炼为可重用的语义护栏记忆,从而在对抗性监控基准测试中实现了优于现有方法的性能。

Details

Motivation: 现有监控方法通常独立处理每个轨迹且无法从历史监控经验中学习,难以检测具有延迟性、上下文依赖性和长视野攻击模式的隐蔽恶意行为。

Result: 在SHADE-Arena和CUA-SHADE-Arena对抗性代理监控基准上,Agent-ToM实现了强健的精确率-召回率平衡,并超越了包括集成方法在内的最先进监控基线。

Insight: 创新点在于将心智理论结构化地融入监控框架,通过推断代理的信念和意图来区分良性任务执行与隐蔽偏离;同时,通过构建可跨情景重用的语义护栏记忆,实现了监控层的持续学习与知识积累。

Abstract: Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from prior monitoring experience. Moreover, standard reasoning methods explain observed behavior without explicitly reasoning about agent beliefs, intentions, and goal alignment required to distinguish benign task execution from covert deviation. We propose \textbf{Agent-ToM}, a learning-to-monitor framework grounded in Theory-of-Mind (ToM) reasoning for security analysis of autonomous agents. Agent-ToM performs structured full-trajectory analysis by inferring beliefs, intent hypotheses with calibrated confidence, expected actions, and deviations from task-consistent behavioral baselines. At inference time, it employs a \textit{Reason-Verify-Refine} pipeline to construct and validate monitoring decisions. At training time, Agent-ToM distills critique signals into a persistent \textit{semantic guardrail memory}, enabling reusable belief- and intent-conditioned constraints across episodes. We evaluate Agent-ToM on adversarial agent monitoring benchmarks (SHADE-Arena and CUA-SHADE-Arena). Agent-ToM achieves strong precision-recall balance and outperforms state-of-the-art monitoring baselines, including ensemble methods, while using a coherent two-call reasoning pipeline. These results demonstrate that learning at the monitoring layer, combined with structured ToM reasoning and verification, provides an effective and deployable foundation for securing autonomous LLM agents.


[237] Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning cs.LG | cs.CLPDF

Jinghan Jia, Joe Benton, Eric Easley

TL;DR: 本文提出了一种基于信息流视角评估和提升思维链(CoT)忠实性的框架,通过三个互补属性(充分性、完整性和必要性)来衡量推理过程是否忠实反映了模型计算,并设计了多种训练时干预措施来增强CoT的媒介作用。

Details

Motivation: 现有思维链推理可能依赖从提示到答案的捷径,导致可见的推理轨迹具有误导性,无法真实反映模型内部计算,因此需要一种任务无关的方法来评估和提升CoT的忠实性。

Result: 在提示算术、可被奖励攻击的代码修复以及DAPO-Math模型等任务上,提出的干预措施(如注意力掩码、梯度掩码等)改善了行为与结构指标,增强了CoT的媒介作用,提升了任务无关的忠实性度量,并在某些设置中降低了对错误提示的敏感性。

Insight: 创新点在于从信息流结构角度形式化CoT忠实性,提出可量化的诊断指标(如基于熵、掩码KL散度和梯度的度量),并通过训练时干预直接调控信息流动路径,为构建更忠实、可监控的推理模型提供了新思路。

Abstract: Chain-of-thought (CoT) reasoning is useful for monitoring language models only when the reasoning trace faithfully reflects the computation that produces the final answer. However, models can rely on prompt-to-answer shortcuts that bypass the CoT, making the visible reasoning trace misleading even when it appears plausible. We study CoT faithfulness through a structural information-flow perspective: faithful reasoning should route answer-relevant information through the mediated path from prompt to CoT to answer, rather than through a direct prompt-to-answer shortcut. This perspective yields a task-agnostic framework based on three complementary properties, sufficiency, completeness, and necessity, which we instantiate with entropy-based, masked-KL, and gradient-based diagnostics. We show that these metrics recover externally judged faithfulness differences in hinted reasoning, and identify a low-entropy failure mode of KL-based diagnostics where gradient-based measures remain more stable. Building on this analysis, we introduce update-time interventions for verifier-based on-policy RL, including attention masking, backward-only gradient masking, CoT gradients, and adversarial perturbations of prompt representations. Across hinted arithmetic, reward-hackable code repair, and DAPO-Math models trained without hints but evaluated under wrong-hint injection, our interventions shift behavioral and structural indicators toward stronger CoT mediation. In particular, they make shortcut and reward-hacking behavior more transparent in the CoT and improve task-agnostic faithfulness metrics, while in some settings also reducing wrong-hint susceptibility. Our results suggest that controlling information flow during training is a practical route toward more faithful and monitorable CoT reasoning. Code is available at https://github.com/safety-research/faithful-cot.


[238] ECHO: Terminal Agents Learn World Models for Free cs.LG | cs.CLPDF

Vaishnavi Shrivastava, Piero Kauffmann, Ahmed Awadallah, Dimitris Papailiopoulos

TL;DR: 本文提出ECHO方法,一种用于训练命令行界面(CLI)智能体的混合目标函数。该方法通过结合标准策略梯度损失与辅助的环境观察预测损失,将终端返回的反馈流转化为密集的监督信号,从而更有效地利用环境交互数据,无需额外采样即可提升智能体性能。

Details

Motivation: 现有基于强化学习(如GRPO)的智能体训练方法仅利用稀疏的结果级奖励来更新动作令牌,而丢弃了每次交互中环境返回的丰富反馈流(如stdout、错误、文件等)。这些反馈包含了环境如何响应动作的密集证据,但标准方法未能利用。

Result: 在TerminalBench-2.0基准测试上,ECHO将GRPO的pass@1性能提升了一倍:Qwen3-8B模型从2.70%提升至5.17%,Qwen3-14B从5.17%提升至10.79%。此外,ECHO训练出的策略能更好地预测终端动态,显著降低了环境令牌的交叉熵,而仅用GRPO则几乎无变化。

Insight: 核心创新在于将环境观察反馈视为一种密集的、在策略的监督信号,并设计了一个辅助的交叉熵损失来训练策略预测其自身动作导致的环境观察令牌。这种方法无需额外采样或专家演示,仅利用现有交互数据即可实现自我改进,甚至在无验证器的情况下也能在未见过的OOD任务上提升性能。

Abstract: CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream – stdout, errors, files, logs, and traces – records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO-style training updates action tokens with sparse outcome-level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy-gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross-entropy Hybrid Objective), a hybrid objective that combines the standard policy-gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held-out rollouts, it sharply reduces environment-token cross-entropy while GRPO alone barely changes it. From base Qwen3-8B, ECHO matches expert-SFT-then-GRPO performance on held-out terminal tasks without expert demonstrations, and recovers roughly half of the expert-SFT initialization benefit on TerminalBench-2.0. In some settings, the environment prediction loss alone enables verifier-free self-improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on-policy supervision signal already present in every rollout.


[239] MindAlign: Bridging EEG, Vision, and Language for Zero-Shot Visual Decoding cs.LG | cs.CL | q-bio.NCPDF

Zexuan Chen, Sichao Liu, Runhao Lu, Huichao Qi, Alexandra Woolgar

TL;DR: 本文提出了MindAlign,一个用于基于EEG的零样本视觉解码的三模态对比学习框架。该框架通过两阶段设计,将EEG信号、视觉图像和LLM生成的文本描述对齐到一个统一的潜在空间中,旨在弥合神经表征与计算视觉模型之间的鸿沟。

Details

Motivation: 解决从脑信号(特别是EEG)进行视觉解码的关键挑战,即如何桥接神经表征与计算视觉模型,以实现鲁棒且语义基础良好的解码。

Result: 在Things-EEG2 200-way零样本基准测试中,该方法达到了54.1%的Top-1准确率和83.4%的Top-5准确率,显著超过了之前最强的基线(32.4%/64.0%),并通过配对Wilcoxon检验确认了其显著性(p < 0.01)。在Things-MEG数据集上也验证了其泛化能力。

Insight: 创新点在于引入文本作为语义正则化器,通过三模态对比学习将语言结构注入共享空间,而不淹没主要的EEG-图像信号;同时,编码器集成了特定于受试者的适应、通道上的图注意力以及时空卷积嵌入。分析还发现,紧凑的嵌入几何(如CN-CLIP)优于更大的骨干网络,且解码结果与已知的视觉处理神经生理学一致。

Abstract: Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. We introduce a tri-modal contrastive framework for EEG-based visual decoding that aligns EEG, visual, and textual representations within a unified latent space. Our approach follows a two-stage design. First, we pre-train an EEG encoder via masked reconstruction on unlabeled trials, learning spatio-temporal regularities that transfer robustly to downstream tasks. Second, we jointly align EEG, image, and LLM-generated textual descriptions through contrastive learning, where text supervision acts as a semantic regularizer that injects linguistic structure into the shared space without overwhelming the primary EEG-image signal. The encoder integrates subject-specific adaptation, graph-attention over channels, and temporal-spatial convolutional embeddings. On the Things-EEG2 200-way zero-shot benchmark, our framework achieves 54.1% Top-1 and 83.4% Top-5 accuracy, substantially exceeding the strongest prior baseline (32.4% / 64.0%), with paired Wilcoxon tests confirming significance (p < 0.01) over all in-subject baselines. We validate generalization on Things-MEG. Analysis reveals that compact embedding geometries (CN-CLIP) outperform much larger backbones, and that decoding aligns with established neurophysiology of visual processing. This work is a critical step towards robust, semantically-grounded visual decoding from non-invasive temporal neural signals. The source code is publicly available in https://github.com/anon-eeg/eeg_image_decoding.


[240] Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models cs.LG | cs.CLPDF

Wenlong Deng, Jiaji Huang, Kaan Ozkara, Yushu Li, Christos Thrampoulidis

TL;DR: 本文研究了语言模型强化学习中奖励黑客(reward hacking)问题的几何机制,提出通过分析参数更新的主导奇异方向来检测优化漂移,并引入可信方向投影(trusted-direction projection)方法将梯度约束在干净的参考子空间内,以缓解模型利用捷径而非真正解决任务的失效模式。

Details

Motivation: 奖励黑客问题是指模型通过利用捷径来改进代理奖励,而非解决预期任务,这导致强化学习失效。本文旨在从参数更新的几何角度理解这一失效模式,并提出缓解方法。

Result: 在数学推理任务的奖励黑客实验中,所提出的可信方向投影方法延迟了捷径利用,并更好地保持了任务性能。

Insight: 创新点在于从参数更新的奇异方向分析奖励黑客的几何特征,并提出通过投影约束梯度方向来稳定学习轨迹,为缓解强化学习中的奖励黑客问题提供了新的视角和方法。

Abstract: Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.


[241] When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards cs.LG | cs.CLPDF

Li Wang, Xiaodong Lu, Xiaohan Wang, Yikun Ban, Jiajun Chai

TL;DR: 本文提出了一种名为RLAVR(Reinforcement Learning with Active Verifiable Rewards)的新方法,旨在解决强化学习与可验证奖励(RLVR)中真实标签获取成本高昂的问题。该方法通过主动获取少量选定样本的真实标签,并与伪标签结合,以稳定训练过程并提升在有限标注预算下的性能。

Details

Motivation: RLVR范式依赖于真实标签进行奖励计算,但在现实场景中获取这些标签成本极高。现有的无监督RLVR方法使用伪标签容易导致训练崩溃,且不同样本的标注价值存在差异,因此需要一种在有限标注预算下高效利用标签的方法。

Result: 在多个领域、模型家族和模型规模上的广泛实验证明了该方法的有效性和通用性。

Insight: 核心创新点在于提出了RLAVR框架,并引入了Corrective Advantage Gap(CAG)指标来量化样本级的监督价值,进而设计了CARE(Correction-Aware Reliability Estimation)策略,将理想的CAG准则转化为实际的查询前获取策略,从而显著提升了训练稳定性。

Abstract: Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expensive in real-world scenarios. While unsupervised RLVR paradigms attempt to circumvent this by training on pseudo-labels, they are notoriously susceptible to training collapse. Moreover, different samples often exhibit varying annotation values. In this paper, we propose Reinforcement Learning with Active Verifiable Rewards (RLAVR), which actively acquires ground-truth labels for a small set of selected samples and integrates them with pseudo-labels, thereby stabilizing training dynamics and improving performance under limited annotation budgets. To identify valuable samples, we propose the Corrective Advantage Gap (CAG) metric and analyze the sample-level supervision value. Building on this, we introduce Correction-Aware Reliability Estimation for RLAVR (CARE), which translates the oracle CAG criterion into a practical pre-query acquisition policy to substantially improve training stability. Extensive experiments across diverse domains, model families, and model scales demonstrate the effectiveness and generality of our approach. Our code is available at https://github.com/Lumina04/CARE.


[242] Retrieval-Augmented Detection of Potentially Abusive Clauses in Chilean Terms of Service cs.LG | cs.AI | cs.CLPDF

Christoffer Loeffler, Tomás Rey Pizarro, Daniel Ignacio Miranda Vásquez, Andrea Martínez Freile

TL;DR: 本文提出了一种检索增强生成框架,用于自动检测和分类智利服务条款中的潜在滥用条款。该框架结合了高效的条款检测、混合稠密-稀疏检索、重排序和提示增强技术,以支持中等规模的开源语言模型。研究还引入了扩展的智利滥用服务条款语料库,包含100份合同和10,029个标注条款,涵盖24个法律类别。

Details

Motivation: 在线服务条款常作为附合合同,造成不对称性,可能使消费者面临潜在滥用条款。在智利,评估这些条款具有法律挑战性,因为有些条款明显违反强制性消费者法律,而其他则依赖于更广泛的标准,如诚信原则和合同失衡。

Result: 实验比较了商业和开源语言模型、微调编码器以及传统基线方法,结果表明检索增强提示显著提高了性能,并使本地模型能够以较低的计算和令牌成本接近更大的基于云的系统。

Insight: 创新点在于提出了一个专为本地执行设计的检索增强生成框架,以及一个包含非法、灰色和黑暗条款的精细法律标注方案和语料库,为AI辅助消费者合同审查提供了实用设计。

Abstract: Online Terms of Service often function as contracts of adhesion, creating asymmetries that may expose consumers to potentially abusive clauses. In Chile, assessing such clauses is legally challenging because some provisions clearly violate mandatory consumer law, whereas others depend on broader standards such as good faith and contractual imbalance. We present a retrieval-augmented generation framework for the automated detection and classification of potentially abusive clauses in Chilean Terms of Service. Designed for local execution, it combines efficient clause detection, hybrid dense–sparse retrieval, reranking, and prompt augmentation to support medium-sized open-weight language models. We also introduce the Chilean Abusive Terms of Service Extended corpus, comprising 100 contracts and 10,029 annotated clauses in 24 legally grounded categories spanning illegal, dark, and gray clauses. Experiments comparing commercial and open-weight language models, fine-tuned encoders, and traditional baselines show that retrieval-augmented prompting substantially improves performance and enables local models to approach larger cloud-based systems at lower computational and token cost. The study also contributes a refined legal annotation scheme and a practical design for AI-assisted consumer contract review.


[243] Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning cs.LG | cs.CL | cs.CVPDF

Jun-Tao Tang, Yu-Cheng Shi, Zhen-Hao Xie, Da-Wei Zhou

TL;DR: 本文提出了Prism,一个专门为可扩展的多模态持续指令调优(MCIT)研究设计的即插即用、可复现的基础设施。它通过轻量级的插件注册机制,将算法开发与主干模型实现分离,允许新策略作为独立插件集成,无需修改底层MLLM代码库,从而解决了当前MCIT研究中存在的严重工程瓶颈问题。

Details

Motivation: 现实世界部署要求多模态大语言模型(MLLMs)能够持续适应新兴任务,这推动了多模态持续指令调优(MCIT)的发展。然而,当前的MCIT研究受到严重工程瓶颈的阻碍,现有方法通常通过直接修改基础MLLM代码库来实现,导致巨大的实现开销和特定于方法的架构,严重限制了代码重用和公平比较。

Result: 论文介绍了Prism基础设施,它原生支持广泛使用的大规模训练流程,从而实现了可复现和可扩展的MCIT实验。

Insight: 主要创新点在于提出了一种插件式的、可复现的基础设施设计,通过解耦算法与主干模型,解决了MCIT领域代码碎片化和比较困难的问题,为加速该领域的方法开发和公平评估提供了标准化平台。

Abstract: Multimodal Large Language Models (MLLMs) achieve versatility by reformulating diverse tasks into a unified instruction-following framework via instruction tuning. However, real-world deployment requires continuous adaptation to emerging tasks, motivating Multimodal Continual Instruction Tuning (MCIT). Despite its growing importance, current MCIT research is hindered by severe engineering bottlenecks. Existing methods are typically implemented by directly modifying the base MLLM codebase, which imposes substantial implementation overhead and yields method-specific architectures that severely limit code reuse and fair comparison. To address this, we introduce Prism, a plug-in reproducible codebase specifically designed for scalable MCIT research. It separates algorithmic development from the backbone implementation via a lightweight plugin registration mechanism, enabling new strategies to be integrated as independent plugins without modifying the underlying MLLM codebase, thereby eliminating structural fragmentation and accelerating method development. Prism natively supports widely used large-scale training pipeline, thereby enabling reproducible and scalable MCIT experimentation. Code is available at https://github.com/LAMDA-CL/Prism.


[244] Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection cs.LG | cs.AI | cs.CVPDF

Heqiang Wang, Weihong Yang, Zheyuan Yang, Jia Zhou, Xiaoxiong Zhong

TL;DR: 本文提出了一种名为MODIAD的多模态在线分布式工业异常检测框架,旨在解决真实工业环境中数据分布式、持续生成带来的挑战。该框架通过多类智能调度(MIS)问题来协调跨类模型更新,并设计了SMG算法和REC-LoRA策略,以在资源受限下实现高效、高性能的异常检测。

Details

Motivation: 现有工业异常检测方法主要针对集中式和离线场景,忽略了真实工业环境中数据分布式和持续生成的特点。随着边缘智能的发展,需要一种能够利用边缘设备进行分布式协同训练的新框架。

Result: 在MVTec 3D-AD和Eyecandies两个代表性的多模态工业异常检测数据集上的大量实验表明,所提方法在MODIAD场景下实现了优越的性能和效率。

Insight: 创新点在于将工业异常检测问题建模为在线分布式场景下的多类智能调度(MIS)问题,并提出了高效的SMG求解算法。同时,提出的REC-LoRA策略通过类特定的低秩适应,显著降低了计算和通信开销,是一种参数高效的适配方法。

Abstract: Industrial anomaly detection has attracted significant attention as a fundamental challenge in industrial systems. The rapid advancement of heterogeneous industrial sensors has driven industrial anomaly detection from unimodal to multimodal paradigms. However, existing methods are primarily designed for centralized and offline settings, overlooking the distributed and continuously generated data characteristic of real-world industrial environments. With the advancement of edge intelligence, modern edge devices are increasingly capable of not only data acquisition but also distributed model training, enabling collaborative intelligence across the system. Industrial anomaly detection represents a critical application in this context. Motivated by these challenges, we propose a novel framework termed Multimodal Online Distributed Industrial Anomaly Detection (MODIAD). We first present a comprehensive workflow for MODIAD and then formulate a Multi-class Intelligent Scheduling (MIS) problem to coordinate cross class model updates by balancing data sufficiency and class update frequency. To efficiently solve this problem, we design a Sequential Marginal Gain Greedy (SMG) algorithm that enables effective multi-class training under resource constraints. Furthermore, to improve the computational and communication efficiency during training, we propose an Resource Efficient Class-Wise Low Rank Adaptation (REC-LoRA) strategy, which significantly reduces system overhead while preserving detection performance. Extensive experiments on two representative multimodal industrial anomaly detection datasets, MVTec 3D-AD and Eyecandies demonstrate that the proposed approach achieves superior performance and efficiency under the MODIAD scenario.


[245] CAFD: Concept-Aware DNN Fault Detection using VLMs cs.LG | cs.CV | cs.SEPDF

Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand

TL;DR: 本文提出了一种名为CAFD的概念感知深度神经网络故障检测方法,该方法通过整合模型输出信号、距离特征以及新颖的概念失败比率(CFR)特征,在保持高效的同时实现了卓越的故障检测性能。

Details

Motivation: 现有混合方法虽能结合多源信息提升检测性能,但计算开销大,限制了实际应用的可扩展性和实用性,因此需要一种高效且性能优越的故障检测方法。

Result: 在包括ImageNet在内的三个DNN模型和数据集上,CAFD在广泛的约束选择预算下,故障检测率(FDR)持续优于五种最先进的基线方法,平均FDR提升了18.3%。

Insight: 创新点在于引入了概念失败比率(CFR)这一概念特征,利用视觉语言模型(VLMs)从图像中提取文本概念并量化其与DNN失败的相关性,从而整合了互补的语义信息以提升检测效果。

Abstract: Fault detection for Deep Neural Networks (DNNs) has received increasing attention in recent years. While more advanced hybrid approaches have been proposed to combine multiple sources of information and outperform earlier techniques, they often incur substantial computational overhead, limiting scalability and practicality in real-world settings. In this paper, we introduce Concept-Aware Fault Detection (CAFD), a learning-based approach that achieves superior fault detection performance by effectively integrating multiple information sources while maintaining practical efficiency. Specifically, CAFD is trained using a carefully selected set of informative features, including model-based signals derived from the DNN’s outputs, distance-based features, and a novel concept-based feature, called Concept Failure Ratio (CFR). CFR leverages Vision-Language Models (VLMs) to extract textual concepts from images and quantify the likelihood that their presence is associated with DNN failures. By incorporating this feature, CAFD benefits from complementary semantic information, enabling more effective fault detection. Our results demonstrate that CFR serves as an effective indicator for DNN fault detection. We conduct an extensive empirical evaluation of CAFD, comparing it against five state-of-the-art baselines across three subject DNN models and datasets, including ImageNet. Across a wide range of constrained selection budgets, CAFD consistently outperforms all baselines in Fault Detection Rate (FDR), achieving average FDR improvements of 18.3% across all investigated subjects and budget sizes.


[246] Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra cs.LG | cs.CVPDF

Ben S. Southworth, Shuai Jiang, Daniel McBride, Eric C. Cyr, Stephen Thomas

TL;DR: 本文研究了矩阵感知优化器Muon在视觉Transformer(ViT)训练中的行为,发现其在ImageNet-100和Pl@ntNet-300K数据集上持续优于AdamW,尤其在长尾数据集Pl@ntNet上提升显著。Muon的性能增益与数据增强方案紧密相关,其梯度谱在注意力投影层展现出更宽的基,而完整的数据增强配方对于防止深层前馈块中的谱集中和模式崩溃至关重要。

Details

Motivation: Muon在Transformer训练中表现出色,但其在视觉Transformer中的行为尚未被充分理解。本文旨在探究Muon在ViT训练中的表现,特别是其与标准视觉训练配方(如数据增强)的相互作用。

Result: 在ImageNet-100和长尾数据集Pl@ntNet-300K上,Muon持续超越AdamW,在Pl@ntNet的宏平均top-1准确率上获得显著提升。在图像分割和掩码自编码器模型的训练中,Muon在所有设定下均优于AdamW。

Insight: 创新点在于揭示了Muon在ViT训练中是一种优化器-配方相互作用:其性能高度依赖于数据增强,且与AdamW最明显的区别在于注意力投影层的梯度具有更宽的谱基。客观分析认为,理解这种相互作用对于设计高效的ViT训练方案至关重要,并强调了完整训练配方在防止深层网络退化中的重要性。

Abstract: Muon is a recently developed matrix-aware optimizer that has shown strong results in transformer training, but its behavior in vision transformers (ViTs) is not yet well understood. We study Muon for ViT training, largely on ImageNet-100 and Pl@ntNet-300K, comparing against AdamW under standard vision recipes involving mixup, cutmix, smoothing, and random augmentation and erasing. Muon consistently outperforms AdamW, with especially large gains on long-tailed Pl@ntNet macro top-1. These gains are also recipe-dependent, where Muon benefits much more than AdamW from advanced and significant data augmentation techniques. To understand this interaction, we analyze the singular-value structure of matrix gradients throughout the ViT. Within Muon training runs, removing heavy data augmentation induces a late-training spectral concentration and mode collapse in gradient matrices, primarily in deep MLP-down blocks. Under a fixed “full” augmentation recipe, the clearest Muon-AdamW contrast appears instead in QKV gradients, where AdamW gradient energy remains concentrated in a much narrower basis while Muon spreads energy across substantially more singular modes. Muon in ViTs is therefore best understood as an optimizer-recipe interaction. Under a fixed recipe, Muon differs from AdamW most clearly in attention projections, where its gradients consist of a broader spectral basis. Within Muon, a full training recipe is important for preventing late spectral concentration and mode collapse in deep feedforward blocks. We further demonstrate efficacy in training ViTs on image segmentation and masked autoencoder models, where Muon outperforms AdamW in all settings considered.


[247] PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems cs.LG | cs.CV | physics.comp-phPDF

Divyam Goel, Nithin Chalapathi, Sanjeev Raja, Aditi S. Krishnapriyan

TL;DR: 该论文提出了PDEInvBench,一个用于偏微分方程(PDE)逆问题的综合基准数据集,并基于此数据集系统探索了神经网络在解决此类问题时的设计空间。研究发现,两阶段训练、引入PDE导数作为输入特征以及增加训练数据中初始条件的多样性是提升性能的关键因素。

Details

Motivation: 当前机器学习在PDE领域的研究主要集中在正问题,缺乏针对PDE逆问题(即从解场反推物理参数)的综合性基准数据集和系统研究,该工作旨在填补这一空白。

Result: 在提出的PDEInvBench数据集上进行了广泛的实验,评估了神经网络在分布内和多种分布外场景下的性能,并提供了关于优化策略、问题表示和模型/数据规模扩展的详细分析结果。

Insight: 主要创新点在于创建了首个全面的PDE逆问题基准数据集,并系统揭示了神经网络在此类任务上的有效设计原则,如两阶段训练和利用PDE导数信息,为后续研究提供了重要参考。

Abstract: Inverse problems in partial differential equations (PDEs) involve estimating the physical parameters of a system from observed spatiotemporal solution fields.Neural networks are well-suited for PDE parameter estimation due to their capability to model function-to-function space transformations. While existing benchmarks of machine learning methods for PDEs primarily focus on the forward problem, there are no similar comprehensive studies and benchmark datasets on PDE inverse problems, i.e., mapping solution fields to underlying physical parameters. We fill this gap by introducing PDEInvBench, a comprehensive benchmark dataset consisting of numerical simulations for both time-dependent and time-independent PDEs across a wide range of physical behaviors and parameters. Our dataset includes evaluation splits that assess performance in both in-distribution and various out-of-distribution settings. Using our benchmark dataset, we comprehensively explore the design space of neural networks for PDE inverse problems along three key dimensions: (1) optimization procedures, analyzing the role of supervised, self-supervised, and test-time training objectives on performance, (2) problem representations, where we study the value of architectural choices with different inductive biases and various conditioning strategies, and (3) scaling, which we perform with respect to both model and data size. Our experiments reveal several practical insights: 1) neural networks perform best with a two-stage training procedure: initial supervision with PDE parameters followed by test-time fine-tuning using the PDE residual, 2) incorporating PDE derivatives as input features consistently improves accuracy, and 3) increasing the diversity of initial conditions in the training data yields greater performance gains than expanding the range of PDE parameters. We make our dataset and codebase publicly available.


[248] Opportunistic Target Selection: Early Directional Commitment for Query-Efficient Black-Box Adversarial Attacks cs.LG | cs.CVPDF

Florent Tariolle, Florian Yger

TL;DR: 本文提出了一种名为机会目标选择(OTS)的轻量级包装方法,用于提升查询效率的黑盒对抗攻击。该方法通过在攻击早期将无目标攻击切换为目标攻击,锁定当前领先的非真实类别,从而避免类别漂移问题,减少查询次数。实验在ImageNet分类器上验证了OTS对随机搜索攻击的有效性,但在梯度估计攻击和对抗训练模型上效果有限。

Details

Motivation: 解决黑盒对抗攻击中因类别漂移导致的查询效率低下问题,即扰动在特征空间中漫无方向地游走,浪费查询资源,而无需修改底层攻击架构、梯度访问或先验目标类别知识。

Result: 在ImageNet的五个标准分类器上(4,500次运行),OTS在随机搜索攻击(如SimBA、Square Attack)中接近oracle性能,在ResNet-50上成功率提升高达+27个百分点,平均迭代次数相对减少43%;但在梯度估计攻击(如Bandits)和对抗训练模型上效果不显著或冗余。

Insight: 创新点在于通过早期定向承诺将无目标攻击转化为目标攻击,作为边际损失的替代方案,提升查询效率;客观分析显示,该方法适用于随机搜索场景,但揭示了其在梯度估计和对抗训练下的局限性,为攻击策略选择提供了新视角。

Abstract: Black-box adversarial attacks that minimize only the ground-truth confidence suffer from class drift: perturbations wander through the feature space without committing to a specific adversarial class, wasting queries on diffuse, undirected progress. We introduce Opportunistic Target Selection (OTS), a lightweight wrapper that switches an untargeted attack to a targeted objective early in its trajectory, locking onto whichever non-true class currently leads. OTS requires no architectural modification to the underlying attack, no gradient access, and no a priori target-class knowledge. We validate OTS on three score-based attacks (SimBA, Square Attack with cross-entropy loss, and Bandits) across five standard ImageNet classifiers (4,500 runs). On random-search attacks, OTS closely tracks oracle performance, with gains up to +27 pp in success rate and 43% relative reduction in censored-mean iterations on ResNet-50. On gradient-estimation attacks (Bandits) and attacks with margin loss, OTS is redundant, a negative result that reinforces our interpretation of OTS as a margin-loss surrogate. On adversarially-trained models, a bimodal difficulty distribution eliminates the regime where targeting helps.


[249] AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models cs.LG | cs.AI | cs.CVPDF

Branislav Kveton, Anup Rao, Subhojyoti Mukherjee, Krishna Kumar Singh, Viet Dac Lai

TL;DR: 论文提出了AdvantageFlow,一种用于整流流模型的正向过程强化学习算法。该方法通过优化优势加权的正向过程预测损失,在图像生成任务上优于Flow-GRPO和基于负感知微调的SOTA正向过程RL基线。

Details

Motivation: 解决Flow-GRPO等现有方法优化反向过程的问题,提出优化正向过程以提升整流流模型的强化学习性能。

Result: 在Stable Diffusion 3.5 Medium的图像生成任务上评估,AdvantageFlow超越了Flow-GRPO和基于负感知微调的SOTA正向过程RL基线。

Insight: 创新点在于采用优势加权的正向过程预测损失,并通过rollout策略正则化稳定优化过程;客观分析认为其通过拟合局部奖励改进目标分布来降低方差的方法具有借鉴意义。

Abstract: We introduce AdvantageFlow, a forward-process reinforcement learning algorithm for rectified flow models. Unlike Flow-GRPO, which optimizes the reverse process, we optimize an advantage-weighted forward-process prediction loss. This optimization problem is unstable when advantages are negative and the loss becomes non-convex. We stabilize it by rollout policy regularization, which reduces variance and arises from fitting a local reward-improving target distribution. We evaluate AdvantageFlow on image generation tasks with Stable Diffusion 3.5 Medium. It outperforms both Flow-GRPO and a state-of-the-art forward-process RL baseline based on negative-aware fine-tuning.


eess.IV [Back]

[250] How Accurate are Video Quality Models for Diffusion-Based Video Super-Resolution? eess.IV | cs.CVPDF

Benjamin Herb, Steve Göring, Alexander Raake, Rakesh Rao Ramachandra Rao

TL;DR: 本文研究了现有视频质量评估模型在评估基于扩散的视频超分辨率方法性能时的准确性,通过将模型预测与主观测试结果进行比较,发现基于CNN的全参考模型(如LPIPS、DISTS、CVQA-FR)相关性较高,但所有模型均无法完全替代主观测试。

Details

Motivation: 随着基于扩散的视频超分辨率方法展现出潜力,需要评估现有视频质量模型是否适用于这类新方法的质量评估,以确定其能否替代耗时的主观测试。

Result: 在UHD-1/4K屏幕上测试六种上采样方法(包括Lanczos、Rhea、SCST等)处理压缩和未压缩低分辨率视频,结果显示CNN全参考模型相关性显著高于传统全参考和无参考模型,但所有模型均未达到足够准确性以替代主观测试。

Insight: 创新点在于首次系统评估视频质量模型对扩散基超分辨率方法的适用性,揭示了CNN全参考模型的相对优势及现有模型的局限性,为未来质量评估研究提供了基准数据集。

Abstract: Recent video super-resolution (VSR) approaches use deep neural networks to enhance low-quality input videos and recover visual detail, with diffusion-based methods in particular showing promising results. In this paper, we investigate whether existing video quality models can be used to assess the performance of these diffusion-based VSR methods, by comparing model predictions with results from a subjective test. The study compares six upscaling methods (Lanczos, Rhea, SCST, DOVE, SeedVR2, Starlight Mini) applied to both compressed (AV1 and DCVC-RT) and uncompressed low-resolution videos considering the play-out on a UHD-1/4K screen. A range of full- and no-reference quality models are used to assess their applicability to this new type of quality degradation, focusing on within-sequence performance. The results highlight that CNN-based full-reference models, such as LPIPS, DISTS, and CVQA-FR show significantly higher correlation coefficients than both conventional full- as well as the tested no-reference models. Most overestimate the overly sharp results of SCST, with VMAF mainly failing due to spatial inconsistencies introduced by Starlight Mini. None of the tested video quality models reach sufficient accuracy so as to replace complementary subjective testing. The reference, degraded and upscaled videos, as well as the user ratings and model scores are made available with the paper at https://github.com/Telecommunication-Telemedia-Assessment/AVT-VQDB-UHD-1-VSR as open data.


cs.CY [Back]

[251] Catching The Correct Answer Trap: Characterising AI Tutor Blind Spots When Analysing Student Reasoning cs.CY | cs.AI | cs.CLPDF

Moiz Imran, Sahan Bulathwela

TL;DR: 本文研究了智能辅导系统在评估学生推理过程中的一种失效模式,即‘正确答案陷阱’:当学生通过错误推理得出正确答案时,模型难以检测出其中的误解。通过分析Eedi数学平台的真实学生回答,发现71%的此类失败集中在两种具有共同结构的问题类型上。研究比较了微调T5模型与前沿大语言模型的性能,发现模型能力的提升能减少但无法完全消除该问题。

Details

Motivation: 动机在于揭示智能辅导系统在提供自动化反馈时的一个关键缺陷:仅依赖最终答案的正确性进行评分,会掩盖学生推理过程中的误解,从而无法提供稳健的反馈。

Result: 在Eedi数学平台数据集上,微调T5模型对‘正确答案陷阱’的检测准确率为84%,前沿大语言模型为57%,但即使性能最佳的模型也会产生约4:1的误报率,使其在实际课堂规模下难以独立应用。

Insight: 创新点在于识别并量化了‘正确答案陷阱’这一特定失效模式,并指出高整体准确率可能掩盖推理评估中的关键失败。客观而言,该研究强调了在自动化评估中结合人类判断的必要性,并为未来改进模型对复杂推理的细粒度理解提供了方向。

Abstract: Intelligent tutoring systems increasingly provide automated feedback on student work, but robust feedback requires assessing reasoning, not only final answers. We study a failure mode we call the correct answer trap (CAT): models under-detect misconceptions when students reach a correct answer via flawed reasoning. Analysing real student responses from the Eedi mathematics platform, we show that 71% of these failures concentrate in just two question types, both sharing a common structure where flawed reasoning happens to produce the correct numerical answer. Comparing a fine-tuned T5 with a frontier large language model, we find that improved capabilities reduce but do not eliminate the problem (84% vs 57% detection accuracy). Even the best-performing model generates roughly four false alarms for every genuine detection, making stand-alone screening impractical at realistic class sizes. Our findings demonstrate that high overall accuracy can mask critical failures in reasoning assessment, and that careful analysis of student reasoning still benefits from human judgment.


cs.GR [Back]

[252] Snapshot Polarimetric Display Inverse Rendering cs.GR | cs.CVPDF

Seokjun Choi, Yunseong Moon, Kaizhang Kang, Hoon-Gyu Chung, Jin-Nyeong Kim

TL;DR: 本文提出了一种偏振显示逆渲染方法,通过LCD投影线性偏振RGB二值图案,并使用配备四分之一波片的RGB偏振相机单次拍摄获取光谱偏振测量数据。前馈Transformer将这些测量映射到像素级的法线、反照率、粗糙度和金属度。在真实桌面设置上的评估表明,该方法在不同场景下实现了精确的逆渲染,优于现有方法。

Details

Motivation: 解决在轻量级桌面工作流中,单帧信息预算高度受限的快照配置下,逆渲染这一图形学和视觉领域的核心挑战。

Result: 在真实桌面设置上的评估显示,该方法实现了跨多样场景的精确逆渲染,性能优于现有方法。

Insight: 创新点在于结合偏振显示(LCD投影偏振图案)与光谱偏振相机进行单次快照测量,并利用前馈Transformer进行端到端属性估计;通过生成流形扩展有限的测量偏振双向反射分布函数数据以克服训练数据稀缺问题。

Abstract: Inverse rendering remains a core challenge in graphics and vision, especially in the snapshot configurations required for lightweight desktop workflows, where the per-frame information budget is highly constrained. Previous inverse rendering work explores various available dimensions for enriching the per-shot information, including temporal modulation, spectral encoding, and polarization. In this work, we introduce polarimetric display inverse rendering, using an LCD to project a linearly polarized RGB binary pattern and an RGB polarization camera augmented with a quarter-wave plate to acquire spectro-polarimetric measurements in a single shot. A feed-forward transformer maps these measurements to per-pixel normal, albedo, roughness, and metallicity. To overcome training data scarcity, we expand a limited set of measured polarimetric bidirectional reflectance distribution functions via a generative manifold. Evaluations on a real desktop setup demonstrate accurate inverse rendering across diverse scenes, outperforming existing approaches.


cs.RO [Back]

[253] HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos cs.RO | cs.AI | cs.CV | cs.LGPDF

Zhi, Wang, Botao He, Kelin Yu, Seungjae Lee

TL;DR: HumanEgo是一个从人类自我中心视频中零样本学习机器人操作技能的框架。它通过将人类演示提升为手-物体交互的实体级表示,并训练一个带有密集辅助目标的流匹配策略,来弥合人与机器人之间的具身鸿沟。该框架无需机器人数据,仅需每任务30分钟的人类视频,就能在真实世界任务中实现高成功率,并能在不同机器人、摄像头和环境间进行零样本迁移。

Details

Motivation: 人类自我中心视频蕴含丰富的操作演示,但由于人类与机器人在视觉外观和运动学上存在具身鸿沟,将这些技能迁移到机器人上仍然具有挑战性。

Result: 在四个真实世界任务中,仅使用每任务30分钟的人类视频,HumanEgo平均成功率达到了92.5%(仅15分钟视频时为75%),比匹配时长的机器人遥操作性能高出41%,并能鲁棒地零样本迁移到新的机器人、摄像头和环境中。

Insight: 核心创新在于通过实体级表示和密集辅助目标的流匹配策略来桥接具身鸿沟,实现了数据高效、硬件无关且无需机器人数据的零样本技能迁移,这为从丰富的人类视频数据中学习通用机器人技能提供了新思路。

Abstract: Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity-level representation of hand-object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot-data-free, hardware-agnostic, data-efficient, and zero-shot human-to-robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real-world tasks (75% with just 15 minutes), outperforms matched-time robot teleoperation by 41%, and robustly transfers zero-shot across novel robots, cameras, and environments.


[254] RepSAM: Bridging Foundation Models to Robotic Vision via Representation-Guided Adaptation cs.RO | cs.CVPDF

Wenhui Chu

TL;DR: 本文提出RepSAM,一种基于表示引导的参数高效微调框架,用于将基础模型(如SAM)适配到机器人视觉任务中。该方法通过分析Transformer层间表示偏移的非均匀性,设计了一种理论驱动的CKA引导秩分配策略,并结合多模态融合模块来处理透明物体和杂乱场景等挑战性机器人场景。

Details

Motivation: 尽管基础模型(如SAM)具有零样本能力,但在非结构化环境中的机器人感知性能仍不理想,这归因于Transformer层间表示偏移的非均匀性:浅层存在显著的领域差距,而深层表示可有效迁移。

Result: 在六个基准测试和机器人操作任务上的实验表明,RepSAM达到了全参数微调性能的97.9%(89.0% vs. 90.9% mIoU),同时将可训练参数减少了158倍(从632M降至4.0M)。RepSAM在单个A100 GPU上仅训练4小时(比全微调减少96倍时间),mIoU超过DoRA 7.9%,并在机器人操作成功率上比LoRA(RGB)基线绝对提升12.0%,结果具有统计显著性(p < 0.01)。

Insight: 创新点在于揭示了基础模型适配中Transformer层表示迁移的非均匀性,并据此提出了理论驱动的CKA引导秩分配策略,实现了参数高效且性能接近全微调的适配;同时,结合多模态融合增强了模型在挑战性机器人场景(如透明物体)中的鲁棒性。

Abstract: Robotic perception in unstructured environments remains challenging despite the zero-shot capabilities of foundation models such as SAM. This work attributes performance degradation to non-uniform representation shifts across transformer layers: shallow layers exhibit substantial domain gaps (CKA < 0.5), whereas deep layers transfer effectively (CKA > 0.7). Based on this observation, we propose RepSAM, a representation-guided parameter-efficient fine-tuning (PEFT) framework for adapting foundation models to robotic vision. RepSAM employs a theoretically grounded CKA-guided rank allocation strategy combined with a multi-modal fusion module for robust handling of challenging robotic scenarios, including transparent objects and cluttered scenes. Experimental evaluation across six benchmarks and robotic manipulation tasks demonstrates that RepSAM achieves 97.9% of full fine-tuning performance (89.0% vs. 90.9% mIoU) while reducing trainable parameters by 158x (from 632M to 4.0M). RepSAM outperforms DoRA by 7.9% mIoU with just 4 hours of training on a single A100 GPU (a 96x reduction from full fine-tuning, which takes 384 GPU-hours). These improvements are statistically significant (p < 0.01) and translate to a 12.0% absolute improvement in robotic manipulation success rates over the LoRA (RGB) baseline.


[255] AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond cs.RO | cs.CVPDF

Haiming Zhang, Junfei Zhou, Feng Jiang, Jingzhong Li, Zhenglong Guo

TL;DR: 本文提出了AnyScene,一个以占据网格为中心的驾驶场景生成统一框架。该框架通过空间-时间占据扩散变换器从BEV布局生成语义占据序列,并利用几何基础视图扩展模块合成多视角驾驶视频,实现了对任意BEV布局的高可控性和长时程生成。

Details

Motivation: 现有基于占据网格引导的方法通常依赖浅层条件机制和参考帧依赖的视频合成,限制了从任意BEV布局进行细粒度控制的能力,也制约了其在可扩展仿真中的应用。

Result: 大量实验表明,AnyScene在占据网格生成和视频生成方面均达到了最先进的性能,对未见过的和自定义布局表现出强大的泛化能力,并为稀疏视图3D重建等下游任务带来了可衡量的收益。

Insight: 创新点在于提出了一个统一的占据中心化框架,通过联合标记化BEV和占据特征的自回归方式实现精确控制,并利用占据作为规范空间表示进行无参考、自回归的多视角视频合成,支持推理时的灵活相机配置。

Abstract: Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of rare safety-critical scenarios. Existing occupancy-guided methods typically rely on shallow conditioning mechanisms and reference-frame-dependent video synthesis, which limits fine-grained controllability from arbitrary BEV layouts and restricts their applicability for scalable simulation. In this paper, we propose AnyScene, a unified occupancy-centric framework for driving scene generation. AnyScene generates semantic occupancy sequences from BEV layouts through a Spatial-Temporal Occupancy Diffusion Transformer that jointly tokenizes BEV and occupancy features in an autoregressive manner. This design enables precise controllability from cross-dataset and user-defined BEV inputs while naturally supporting long-horizon generation. Building upon the generated occupancy, a Geometry-Grounded View Expansion module treats occupancy as the canonical spatial representation and synthesizes temporally consistent multi-view driving videos in a reference-free and autoregressive fashion, supporting flexible camera configurations at inference time. Extensive experiments demonstrate that AnyScene achieves state-of-the-art performance in both occupancy and video generation. It exhibits strong generalization to unseen and customized layouts, and provides measurable benefits for downstream tasks such as sparse-view 3D reconstruction.


cs.IR [Back]

[256] The Multilingual Curse at the Retrieval Layer: Evidence from Amharic cs.IR | cs.CL | cs.LGPDF

Yosef Worku Alemneh, Kidist Amde Mekonnen, Maarten de Rijke

TL;DR: 这篇论文探讨了多语言检索在低资源、形态丰富语言(以阿姆哈拉语为例)上的局限性。研究发现,零样本多语言检索器在阿姆哈拉语上的表现显著低于单语检索器,即使经过阿姆哈拉语微调的多语言模型也无法超越最佳单语模型。这表明仅依赖多语言基准测试的零样本性能不足以评估低资源语言的检索效果,需要进行语言内评估和适配。

Details

Motivation: 当前多语言检索模型在基准测试上的强零样本性能常被误认为在所有语言上都能可靠迁移,但作者认为这一假设对于低资源、形态丰富的语言(如阿姆哈拉语)并不成立,需要实证检验。

Result: 在共享段落检索协议下,最佳零样本多语言检索器的MRR@10比最佳阿姆哈拉语单语检索器低23%。对两个多语言嵌入模型进行阿姆哈拉语微调后,MRR@10相对零样本提升了32-60%,但仍低于最佳单语检索器。

Insight: 论文揭示了多语言检索模型在低资源语言上的‘诅咒’,即零样本性能不能代表实际效果,强调了针对特定语言进行检索评估和适配的必要性,而非依赖聚合的多语言基准推断。这为LLM时代公平信息获取提供了重要洞见,并开源了数据集和模型以促进研究。

Abstract: Multilingual retrieval increasingly underpins cross-lingual question answering and retrieval-augmented generation. Strong zero-shot scores on multilingual benchmarks are often taken as evidence that current encoders transfer reliably across many languages. We argue that this assumption breaks down for underrepresented, morphologically rich languages, and use Amharic as a diagnostic case. Under a shared passage retrieval protocol covering dense, late-interaction, learned sparse, and cross-encoder paradigms, we compare zero-shot multilingual retrievers, Amharic-fine-tuned multilingual retrievers, and monolingual Amharic retrievers. The strongest zero-shot multilingual retriever underperforms the strongest monolingual Amharic first-stage retriever by 23% relative MRR@10. Fine-tuning two recent multilingual embedding models on the same Amharic supervision yields 32-60% relative MRR@10 gains over zero-shot, but the best Amharic-fine-tuned multilingual model remains below the strongest monolingual Amharic retriever. These findings indicate that zero-shot multilingual retrieval is not a sufficient proxy for equitable information access in the LLM era: for underrepresented languages, retrieval must be evaluated and adapted in-language rather than inferred from aggregate multilingual benchmarks. To foster future research, we publicly release the dataset, codebase, and trained models at https://github.com/rasyosef/amharic-neural-ir.


[257] Your Embedding Model is SMARTer Than You Think cs.IR | cs.AI | cs.CVPDF

Jianrui Zhang, Hyun Jung Lee, Sukanta Ganguly, Tae-Eui Kam, Donghyun Kim

TL;DR: 本文提出了SMART框架,旨在解锁标准单向量检索模型的潜在多向量能力,通过利用隐藏状态进行直接后期交互,提升多模态检索性能。该框架可作为即插即用升级,在MMEB-V2等基准上持续改进最先进模型,并在视觉文档检索中超越多向量模型。

Details

Motivation: 针对单向量检索器丢弃细粒度局部证据、多向量方法需训练且忽视全局表示的问题,旨在开发一种无需训练即可增强现有模型检索能力的方法。

Result: 在MMEB-V2基准上,SMART一致提升了多种模态的检索性能,包括最先进模型;在视觉文档检索中,单向量模型甚至超越了最先进的多向量模型。

Insight: 创新点在于发现标准对比训练通过梯度流隐式塑造了隐藏状态的检索几何,从而允许对冻结隐藏状态进行直接后期交互;这提供了一种高效推理增强和微调技术,无需重新训练模型。

Abstract: Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART’s superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.