Table of Contents
- cs.CL [Total: 58]
- cs.CV [Total: 179]
- cs.RO [Total: 7]
- eess.IV [Total: 1]
- cs.GR [Total: 1]
- cs.AI [Total: 19]
- cs.HC [Total: 2]
- cs.SD [Total: 2]
- cs.LG [Total: 14]
- cs.SE [Total: 1]
- cs.NI [Total: 1]
- cs.IR [Total: 2]
- cs.NE [Total: 1]
cs.CL [Back]
[1] Self-Calibrating Language Models via Test-Time Discriminative Distillation cs.CL | cs.LGPDF
Mohamed Rissal Hedna, Jan Strich, Martin Semmann, Chris Biemann
TL;DR: 本文提出了一种名为SECL(自校准语言模型)的测试时训练方法,旨在解决大语言模型(LLMs)系统性过度自信的问题。该方法利用模型自身对‘答案是否正确?’这一判别式问题的‘True’标记概率作为无标签自监督信号,在测试时动态适应输入分布的变化,从而以低成本实现模型校准,无需标注数据或人工监督。
Details
Motivation: 大语言模型存在系统性过度自信问题,即对经常回答错误的问题也表现出高置信度。现有校准方法需要标注验证数据、在分布偏移下性能下降或推理成本高昂。研究发现,模型内部用于判别‘答案是否正确’的‘True’概率信号比其口头表达的置信度更准确,这为无监督校准提供了可能。
Result: 在三个模型家族的四个小语言模型和四个不同领域上,SECL将期望校准误差(ECE)降低了56%至78%,其性能优于其自身提供的监督信号,并与近期推理时方法相当或更优。该方法仅在输入分布偏移时进行自适应,仅需处理6-26%的问题流,成本低于其蒸馏的基线模型。
Insight: 核心创新在于首次将测试时训练(TTT)范式应用于语言模型校准,并巧妙地利用模型内部判别式信号(P(True))与生成式置信度之间的理论性能差距作为免费的自监督信号。该方法实现了无需外部标注、对分布偏移鲁棒且低成本的在线校准。七项消融实验证实了各组件在不同配置下的关键性和鲁棒性。
Abstract: Large language models (LLMs) are systematically overconfident: they routinely express high certainty on questions they often answer incorrectly. Existing calibration methods either require labeled validation data, degrade under distribution shifts, or incur substantial inference costs. Recent work has shown that LLMs already contain a better-calibrated signal than the one they verbalize: the token probability of “True” when the model is asked “Is this answer correct?” ($P(\text{True})$) consistently outperforms their stated confidence, a gap that is theoretically grounded as generative error is lower-bounded by roughly twice the corresponding discriminative error. We introduce $\textbf{SECL}$ ($\textbf{SE}$lf-$\textbf{C}$alibrating $\textbf{L}$anguage Models), a test-time training (TTT) pipeline that exploits this gap as label-free self-supervision, requiring no labeled data or human supervision. SECL adapts only when the input distribution shifts, training on just 6–26% of the question stream at lower cost than the baseline it distills from. Across four small language models from three model families and four diverse domains, SECL reduces Expected Calibration Error (ECE) by 56–78%, outperforming its own supervision signal and matching or outperforming recent inference-time methods. SECL is the first method to apply TTT to calibration; seven ablations covering signal quality, gating strategy, weight accumulation, loss design, domain ordering, hyperparameter sensitivity, and layer selection confirm that each component is crucial and robust across configurations. Code: https://anonymous.4open.science/r/secl-emnlp26-submission-C890
[2] GIANTS: Generative Insight Anticipation from Scientific Literature cs.CL | cs.AIPDF
Joy He-Yueya, Anikait Singh, Ge Gao, Michael Y. Li, Sherry Yang
TL;DR: 该论文提出了‘insight anticipation’任务,即根据上游‘父论文’预测下游论文的核心见解,并构建了包含1.7万个样本的跨学科基准GiantsBench进行评估。作者训练了一个名为GIANTS-4B的强化学习模型来优化此任务,该模型在相似度得分上超越了更大的专有模型,并能泛化到未见领域,其生成的见解在人类评估中也被认为更清晰、更具高引用潜力。
Details
Motivation: 科学突破往往源于对先前思想的综合,但语言模型在完成这种基于文献的、有针对性的综合任务上的能力尚未被充分探索。本文旨在通过‘insight anticipation’任务来评估和提升模型从已有文献中预测新科学见解的能力。
Result: 在GiantsBench基准测试中,GIANTS-4B模型在相似度得分上比gemini-3-pro相对提升了34%,并超越了其他专有基线模型。人类评估表明其生成的见解概念更清晰;第三方模型SciJudge-30B预测,在68%的成对比较中,GIANTS-4B生成的见解更可能带来高引用。
Insight: 论文的核心创新在于定义了‘insight anticipation’这一新颖的生成任务及其评估基准,并展示了通过强化学习,利用相似度得分作为代理奖励,可以训练出在科学见解预测上超越更大规模专有模型的高效开源模型。该方法强调了任务定义和针对性训练在提升模型科学推理能力上的重要性。
Abstract: Scientific breakthroughs often emerge from synthesizing prior ideas into novel contributions. While language models (LMs) show promise in scientific discovery, their ability to perform this targeted, literature-grounded synthesis remains underexplored. We introduce insight anticipation, a generation task in which a model predicts a downstream paper’s core insight from its foundational parent papers. To evaluate this capability, we develop GiantsBench, a benchmark of 17k examples across eight scientific domains, where each example consists of a set of parent papers paired with the core insight of a downstream paper. We evaluate models using an LM judge that scores similarity between generated and ground-truth insights, and show that these similarity scores correlate with expert human ratings. Finally, we present GIANTS-4B, an LM trained via reinforcement learning (RL) to optimize insight anticipation using these similarity scores as a proxy reward. Despite its smaller open-source architecture, GIANTS-4B outperforms proprietary baselines and generalizes to unseen domains, achieving a 34% relative improvement in similarity score over gemini-3-pro. Human evaluations further show that GIANTS-4B produces insights that are more conceptually clear than those of the base model. In addition, SciJudge-30B, a third-party model trained to compare research abstracts by likely citation impact, predicts that insights generated by GIANTS-4B are more likely to lead to higher citations, preferring them over the base model in 68% of pairwise comparisons. We release our code, benchmark, and model to support future research in automated scientific discovery.
[3] Should We be Pedantic About Reasoning Errors in Machine Translation? cs.CL | cs.AIPDF
Calvin Bao, Marine Carpuat
TL;DR: 该论文研究了机器翻译中的推理错误问题,通过自动标注协议在多个语言对中识别三类推理错误,并采用弱到强的干预措施修正推理轨迹,发现小修正对翻译质量影响有限,而强干预虽能提高分辨率但翻译质量提升不一致,最终表明推理错误移除对解决初始错误效果有限,揭示了机器翻译中推理忠实度的局限性。
Details
Motivation: 动机是量化机器翻译中推理错误的发生频率,并探究修正这些错误是否能够提升翻译质量,从而评估推理在机器翻译中的忠实度。
Result: 在英语到西班牙语、法语、德语、普通话、日语、乌尔都语、粤语等多个语言对的实验中,发现推理错误在乌尔都语中可高精度识别,但在西班牙语中精度较低;干预实验显示小修正对翻译质量影响小,强干预能提高分辨率但翻译质量提升不一致。
Insight: 创新点在于提出自动标注协议来分类和量化推理错误,并采用弱到强干预措施评估修正效果;客观分析表明,该研究揭示了机器翻译中推理错误的普遍性及其修正的有限有效性,为提升翻译模型的推理忠实度提供了新视角。
Abstract: Across multiple language pairings (English $\to$ {Spanish, French, German, Mandarin, Japanese, Urdu, Cantonese}), we find reasoning errors in translation. To quantify how often these reasoning errors occur, we leverage an automated annotation protocol for reasoning evaluation wherein the goal is to detect if a reasoning step is any of three error categories: (1) source sentence-misaligned, (2) model hypothesis-misaligned, or (3) reasoning trace-misaligned. We probe the reasoning model with perturbed traces correcting for these identified reasoning errors using an array of weak-to-strong interventions: hedging, removal, re-reasoning after removal, hindsight, and oracle interventions. Experimenting with interventions on the reasoning traces suggests that small corrections to the reasoning have little impact on translation quality, but stronger interventions yield the highest resolution rates, despite translation quality gains being mixed. We find ultimately that reasoning errors in MT can be identified with high precision in Urdu but lower precision in Spanish, but that removing these reasoning errors does not resolve the initial errors significantly, suggesting limited reasoning faithfulness for machine translation.
[4] CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models cs.CL | cs.AIPDF
Mengfan Li, Xuanhua Shi, Yang Deng
TL;DR: 论文提出CoSToM框架,通过因果追踪定位大语言模型中与心智理论相关的内部特征层,并采用轻量级的激活引导技术进行干预,旨在提升模型内在的心智理论认知能力及其在复杂社会推理任务中的泛化表现。
Details
Motivation: 解决大语言模型在标准心智理论基准测试上表现良好,但在复杂任务场景中泛化能力不足、过度依赖提示工程的问题,探究模型是否真正具备内在认知并能稳定外化为高质量行为。
Result: 实验表明,CoSToM显著增强了类人的社会推理能力和下游对话质量,但摘要未提及具体的基准测试名称或是否达到SOTA水平。
Insight: 创新点在于从机制解释转向主动干预,结合因果追踪进行内部特征定位,并针对关键层进行轻量级激活引导,以实现模型内在认知与外部行为的对齐。
Abstract: Theory of Mind (ToM), the ability to attribute mental states to others, is a hallmark of social intelligence. While large language models (LLMs) demonstrate promising performance on standard ToM benchmarks, we observe that they often fail to generalize to complex task-specific scenarios, relying heavily on prompt scaffolding to mimic reasoning. The critical misalignment between the internal knowledge and external behavior raises a fundamental question: Do LLMs truly possess intrinsic cognition, and can they externalize this internal knowledge into stable, high-quality behaviors? To answer this, we introduce CoSToM (Causal-oriented Steering for ToM alignment), a framework that transitions from mechanistic interpretation to active intervention. First, we employ causal tracing to map the internal distribution of ToM features, empirically uncovering the internal layers’ characteristics in encoding fundamental ToM semantics. Building on this insight, we implement a lightweight alignment framework via targeted activation steering within these ToM-critical layers. Experiments demonstrate that CoSToM significantly enhances human-like social reasoning capabilities and downstream dialogue quality.
[5] ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models cs.CL | cs.AI | cs.SD | eess.ASPDF
Chi-Yuan Hsiao, Ke-Han Lu, Yu-Kuan Fu, Guan-Ting Lin, Hsiao-Tsung Hung
TL;DR: 本文提出ASPIRin框架,通过动作空间投影将文本词汇映射为二元状态(说话/沉默),解耦了说话时机与说话内容,使用基于规则的奖励和GRPO优化全双工语音语言模型的交互性,在保持语义质量的同时显著减少重复生成。
Details
Motivation: 标准原始令牌强化学习优化全双工语音语言模型的时序动态会损害语义质量,导致生成崩溃和重复,需要解决交互优化与语义保持之间的冲突。
Result: 经验评估表明ASPIRin在话轮转换、反馈信号和停顿处理上优化了交互性,相比标准GRPO将重复n-gram比例降低超过50%,有效消除了退化性重复。
Insight: 创新点在于通过动作空间投影显式解耦说话时机与内容,结合基于规则的奖励设计,在强化学习中平衡交互指标与语义连贯性,避免了生成质量下降。
Abstract: End-to-end full-duplex Speech Language Models (SLMs) require precise turn-taking for natural interaction. However, optimizing temporal dynamics via standard raw-token reinforcement learning (RL) degrades semantic quality, causing severe generative collapse and repetition. We propose ASPIRin, an interactivity-optimized RL framework that explicitly decouples when to speak from what to say. Using Action Space Projection, ASPIRin maps the text vocabulary into a coarse-grained binary state (active speech vs. inactive silence). By applying Group Relative Policy Optimization (GRPO) with rule-based rewards, it balances user interruption and response latency. Empirical evaluations show ASPIRin optimizes interactivity across turn-taking, backchanneling, and pause handling. Crucially, isolating timing from token selection preserves semantic coherence and reduces the portion of duplicate n-grams by over 50% compared to standard GRPO, effectively eliminating degenerative repetition.
[6] Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty cs.CLPDF
Chao Xue, Yao Wang, Mengqiao Liu, Di Liang, Xingsheng Han
TL;DR: 本文提出了一种高效的生成式奖励建模框架E-GRM,它基于模型内部不确定性来选择性触发思维链推理,以降低计算成本,并引入一个轻量级判别性评分器来精细评估推理路径,从而在多个推理基准上实现了成本降低和准确率提升。
Details
Motivation: 现有生成式奖励模型对所有输入不加区分地使用思维链提示,导致不必要的计算开销,且其基于投票的评估机制缺乏对推理质量的细粒度评估。
Result: 在多个推理基准上的实验表明,E-GRM显著降低了推理成本,同时持续提高了答案准确率。
Insight: 核心创新在于利用并行模型生成的收敛行为来估计不确定性,从而仅在需要时触发思维链推理,这是一种无需手工特征或任务相关信号的高效机制;同时,结合回归-排序混合目标训练的轻量级判别性评分器提升了奖励评估的保真度。
Abstract: Recent advancements in the Generative Reward Model (GRM) have demonstrated its potential to enhance the reasoning abilities of LLMs through Chain-of-Thought (CoT) prompting. Despite these gains, existing implementations of GRM suffer from two critical limitations. First, CoT prompting is applied indiscriminately to all inputs regardless of their inherent complexity. This introduces unnecessary computational costs for tasks amenable to fast, direct inference. Second, existing approaches primarily rely on voting-based mechanisms to evaluate CoT outputs, which often lack granularity and precision in assessing reasoning quality. In this paper, we propose E-GRM, an efficient generative reward modeling framework grounded in model-internal uncertainty. E-GRM leverages the convergence behavior of parallel model generations to estimate uncertainty and selectively trigger CoT reasoning only when needed, without relying on handcrafted features or task-dependent signals. To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression–ranking objective to provide fine-grained evaluation of reasoning paths. Experiments on multiple reasoning benchmarks show that E-GRM substantially reduces inference cost while consistently improving answer accuracy, demonstrating that model-internal uncertainty is an effective and general signal for efficient reasoning-aware reward modeling.
[7] Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models cs.CLPDF
Chao Xue, Yao Wang, Mengqiao Liu, Di Liang, Xingsheng Han
TL;DR: 本文首次系统性地研究了大型语言模型监督微调中的不完全学习现象,即模型在收敛后仍无法正确复现部分训练数据。论文通过实验验证了该现象的普遍性,识别了五个主要成因,并提出了一个诊断框架和针对性缓解策略。
Details
Motivation: 发现监督微调后模型存在无法内化部分监督实例的持续失败模式,旨在系统研究其成因与影响。
Result: 在Qwen、LLaMA和OLMo2等多个模型家族上的实验表明,不完全学习现象广泛存在且异质性强,总体指标的提升可能掩盖了持续的未学习子集。
Insight: 创新点在于将不完全学习现象形式化并系统归因于五个可诊断的根源;客观来看,其提出的诊断优先框架为理解SFT失败模式提供了细粒度分析工具,强调了超越聚合指标进行样本级诊断的必要性。
Abstract: Supervised Fine-Tuning (SFT) is the standard approach for adapting large language models (LLMs) to downstream tasks. However, we observe a persistent failure mode: even after convergence, models often fail to correctly reproduce a subset of their own supervised training data. We refer to this behavior as the Incomplete Learning Phenomenon(ILP). This paper presents the first systematic study of ILP in LLM fine-tuning. We formalize ILP as post-training failure to internalize supervised instances and demonstrate its prevalence across multiple model families, domains, and datasets. Through controlled analyses, we identify five recurrent sources of incomplete learning: (1) missing prerequisite knowledge in the pre-trained model, (2) conflicts between SFT supervision and pre-training knowledge, (3) internal inconsistencies within SFT data, (4) left-side forgetting during sequential fine-tuning, and (5) insufficient optimization for rare or complex patterns. We introduce a diagnostic-first framework that maps unlearned samples to these causes using observable training and inference signals, and study several targeted mitigation strategies as causal interventions. Experiments on Qwen, LLaMA, and OLMo2 show that incomplete learning is widespread and heterogeneous, and that improvements in aggregate metrics can mask persistent unlearned subsets. The findings highlight the need for fine-grained diagnosis of what supervised fine-tuning fails to learn, and why.
[8] CircuitSynth: Reliable Synthetic Data Generation cs.CL | cs.AIPDF
Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz
TL;DR: 本文提出CircuitSynth,一种新颖的神经符号框架,用于生成高保真、结构可靠的合成数据。该方法将教师大语言模型的推理能力提炼为概率句子决策图,以强制执行硬逻辑约束,并通过凸优化满足软分布目标,从而在保持语言表达能力的同时,确保生成数据的有效性和覆盖率。
Details
Motivation: 解决大语言模型在结构化数据生成中存在的幻觉、逻辑不一致和模式崩溃问题,现有方法难以在语言表达性与数据有效性、覆盖率之间取得平衡。
Result: 在多样化基准测试中,CircuitSynth在复杂逻辑谜题上实现了100%的模式有效性(而未经约束的基线仅为12.4%),并在罕见组合覆盖率方面显著优于最先进方法。
Insight: 核心创新在于将语义推理与表层实现解耦,通过将LLM推理提炼为可处理的符号化概率模型(PSDD)来强制执行硬约束,并结合凸优化实现软目标,为可靠合成数据生成提供了兼具表达性与形式保证的框架。
Abstract: The generation of high-fidelity synthetic data is a cornerstone of modern machine learning, yet Large Language Models (LLMs) frequently suffer from hallucinations, logical inconsistencies, and mode collapse when tasked with structured generation. Existing approaches, such as prompting or retrieval-augmented generation, lack the mechanisms to balance linguistic expressivity with formal guarantees regarding validity and coverage. To address this, we propose CircuitSynth, a novel neuro-symbolic framework that decouples semantic reasoning from surface realization. By distilling the reasoning capabilities of a Teacher LLM into a Probabilistic Sentential Decision Diagram (PSDD), CircuitSynth creates a tractable semantic prior that structurally enforces hard logical constraints. Furthermore, we introduce a convex optimization mechanism to rigorously satisfy soft distributional goals. Empirical evaluations across diverse benchmarks demonstrate that CircuitSynth achieves 100% Schema Validity even in complex logic puzzles where unconstrained baselines fail (12.4%) while significantly outperforming state-of-the-art methods in rare-combination coverage.
[9] Think in Sentences: Explicit Sentence Boundaries Enhance Language Model’s Capabilities cs.CL | cs.AIPDF
Zhichen Liu, Yongyuan Li, Yang Xu
TL;DR: 这篇论文提出了一种通过在大语言模型(LLM)输入中插入句子边界分隔符来增强模型能力的方法。该方法利用自然语言固有的句子级结构,引导模型进行逐句推理,并通过上下文学习(ICL)和监督微调(SFT)两种具体方式实现。实验表明,该方法在多个任务上带来了性能的持续提升。
Details
Motivation: 现有通过插入虚拟令牌(dummy tokens)来增强LLM能力的研究,未能利用自然语言固有的句子级结构。鉴于LLM是通过接触具有句子结构的人类文本来习得语言能力的,这是一个关键的疏忽。本研究的动机是填补这一空白。
Result: 在从7B到600B参数规模(如Deepseek-V3)的模型上进行的实验表明,该方法在多个任务上带来了一致的性能提升,特别是在GSM8k和DROP基准测试上分别取得了高达7.7%和12.5%的显著增益。微调后的LLM在其内部表征中也显示出对句子结构的感知能力。
Insight: 核心创新点在于明确地将句子边界信息(通过分隔符)整合到LLM的输入中,从而引导模型进行更符合人类认知习惯的逐句处理。这为认知启发式的LLM增强范式提供了一个简单而有效的技术方向,即利用语言本身的结构性先验知识,而不仅仅是增加无意义的虚拟令牌。
Abstract: Researchers have explored different ways to improve large language models (LLMs)’ capabilities via dummy token insertion in contexts. However, existing works focus solely on the dummy tokens themselves, but fail to leverage the inherent sentence-level structure of natural language. This is a critical oversight, as LLMs acquire linguistic capabilities through exposure to human-generated texts, which are inherently structured at the sentence level. Motivated by this gap, we propose an approach that inserts delimiters at sentence boundaries in LLM inputs, which not only integrates dummy tokens into the context, but also facilitates LLMs with sentence-by-sentence processing behavior during reasoning. Two concrete methods: (1). In-context learning and (2). Supervised fine-tuning are experimented using 7B models to 600B Deepseek-V3. Our results demonstrate consistent improvements across various tasks, with notable gains of up to 7.7% on GSM8k and 12.5% on DROP. Furthermore, the fine-tuned LLMs can incorporate sentence awareness evidenced by their internal representations. Our work establishes a simple yet effective technique for enhancing LLM’s capabilities, offering promising directions for cognitive-inspired LLM enhancement paradigm.
[10] Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation cs.CL | cs.LGPDF
Mohamed Ehab, Ali Hamdi
TL;DR: 该论文提出了自适应多专家推理(AMR)框架,通过动态调整推理策略来应对不同难度级别的数学问题。该框架包含一个敏捷的路由系统,用于预测问题的难度和不确定性,并引导可重构的采样机制来管理生成的广度。三个专业专家生成候选响应,经过多轮修正和最终化阶段后,由神经验证器评估正确性,并基于聚类聚合技术结合共识和答案质量确定最终答案。
Details
Motivation: 大型语言模型(LLM)在数学推理基准测试中表现强劲,但其性能在不同难度级别的问题上表现不一致。该研究旨在通过关注问题复杂性,利用动态适应的推理策略来提高模型在数学推理任务中的鲁棒性和准确性。
Result: 在GSM8K数据集上的评估显示,AMR达到了75.28%的准确率,仅使用原始训练数据。这一结果优于大多数使用合成数据训练的可比7B模型,表明基于难度路由和不确定性驱动聚合的模型在提高数学推理模型鲁棒性方面是高效且有效的。
Insight: 论文的创新点在于结合了难度感知路由和不确定性引导聚合的动态多专家推理框架,通过预测问题难度和不确定性来优化生成策略,并利用神经验证器和聚类聚合技术提升答案质量。从客观角度看,该方法强调了问题复杂性的自适应处理,为提升LLM在数学推理任务中的鲁棒性提供了可借鉴的思路。
Abstract: Large language models (LLMs) demonstrate strong performance in math reasoning benchmarks, but their performance varies inconsistently across problems with varying levels of difficulty. This paper describes Adaptive Multi-Expert Reasoning (AMR), a framework that focuses on problem complexity by reasoning with dynamically adapted strategies. An agile routing system that focuses on problem text predicts problems’ difficulty and uncertainty and guides a reconfigurable sampling mechanism to manage the breadth of generation. Three specialized experts create candidate responses, which are modified during multiple correction and finalization phases. A neural verifier assesses the correctness of responses, while a clustering-based aggregation technique identifies the final candidate answer based on a combination of consensus and answer quality. When evaluated on the GSM8K dataset, AMR achieved 75.28% accuracy while only using the original training data. This result outperformed the majority of comparable 7B models that were trained on synthetic data. This showcases that models using difficulty-based routing and uncertainty-driven aggregation are efficient and effective in improving math reasoning models’ robustness.
[11] BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection cs.CLPDF
Saukun Thika You, Nguyen Anh Khoa Tran, Wesley K. Marizane, Hanshu Rao, Qiunan Zhang
TL;DR: BLUEmed是一个结合混合检索增强生成(RAG)与多智能体辩论的框架,用于检测临床笔记中的术语替换错误。它通过分解笔记、检索分区证据、指派专家智能体独立分析、结构化辩论裁决及安全层过滤,在零样本和少样本提示下显著提升了检测性能。
Details
Motivation: 临床笔记中的术语替换错误(一个医学术语被替换为语言上有效但临床含义不同的术语)是医疗保健中自动化错误检测的持续挑战,需要更准确可靠的检测方法。
Result: 在临床术语替换检测基准测试中,BLUEmed在少样本提示下取得了最佳准确率(69.13%)、ROC-AUC(74.45%)和PR-AUC(72.44%),优于单智能体RAG和纯辩论基线,并在多个骨干模型和提示策略上验证了其有效性。
Insight: 创新点在于将混合RAG(密集、稀疏和在线检索)与结构化多智能体辩论相结合,通过证据驱动推理和多视角验证来提升检测可靠性;客观分析认为其框架设计(如分区证据检索、专家分歧解决机制和级联安全层)可借鉴用于需要高精度和可解释性的领域特定错误检测任务。
Abstract: Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi-agent debate framework augmented with hybrid Retrieval-Augmented Generation (RAG) that combines evidence-grounded reasoning with multi-perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero-shot and few-shot prompting with multiple backbone models spanning proprietary and open-source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting, outperforming both single-agent RAG and debate-only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction-following and clinical language understanding.
[12] CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning cs.CL | cs.AIPDF
Cheng-Yen Li, Xuanjun Chen, Claire Lin, Wei-Yu Chen, Wenhua Nie
TL;DR: 本文提出CodaRAG框架,旨在解决大语言模型在知识密集型任务中因幻觉和信息碎片化导致的推理困难。受互补学习系统启发,该框架将检索过程从被动查找转变为主动关联发现,通过知识整合、关联导航和干扰消除三个阶段,系统性地重建分散证据之间的逻辑链,从而提升检索和生成的准确性。
Details
Motivation: 现有检索增强生成方法通常将证据视为孤立单元,无法重建连接这些分散信息的逻辑链,导致大语言模型在知识密集型任务中产生幻觉和碎片化推理。
Result: 在GraphRAG-Bench基准测试上,CodaRAG在检索召回率上实现了7-10%的绝对提升,在生成准确率上实现了3-11%的绝对提升,证明了其在事实性、推理性和创造性任务中系统性地增强关联证据检索的卓越能力。
Insight: 核心创新在于借鉴互补学习系统理论,将检索过程重构为主动的关联发现,通过构建稳定的记忆基底、多维度的关联导航路径以及干扰消除机制,显式地恢复分散的证据链,从而超越了传统RAG将证据视为孤立单元的局限。
Abstract: Large Language Models (LLMs) struggle with knowledge-intensive tasks due to hallucinations and fragmented reasoning over dispersed information. While Retrieval-Augmented Generation (RAG) grounds generation in external sources, existing methods often treat evidence as isolated units, failing to reconstruct the logical chains that connect these dots. Inspired by Complementary Learning Systems (CLS), we propose CodaRAG, a framework that evolves retrieval from passive lookup into active associative discovery. CodaRAG operates via a three-stage pipeline: (1) Knowledge Consolidation to unify fragmented extractions into a stable memory substrate; (2) Associative Navigation to traverse the graph via multi-dimensional pathways-semantic, contextualized, and functional-explicitly recovering dispersed evidence chains; and (3) Interference Elimination to prune hyper-associative noise, ensuring a coherent, high-precision reasoning context. On GraphRAG-Bench, CodaRAG achieves absolute gains of 7-10% in retrieval recall and 3-11% in generation accuracy. These results demonstrate CodaRAG’s superior ability to systematically robustify associative evidence retrieval for factual, reasoning, and creative tasks.
[13] Instruction Data Selection via Answer Divergence cs.CLPDF
Bo Li, Mingda Wang, Shikun Zhang, Wei Ye
TL;DR: 本文提出了一种基于答案分歧的指导数据选择方法ADG,通过分析多样本输出的几何结构(分散程度和形状各向异性)来筛选高质量的指令数据,仅使用10K精选样本进行微调即可在推理、知识和编码等六个基准测试中超越现有选择方法。
Details
Motivation: 指令调优的性能高度依赖于指令-响应语料库的质量和构成,现有方法在数据选择上缺乏对输出多样性和多模态性的有效度量,因此需要一种能同时考虑答案分散程度和形状各向异性的选择策略。
Result: 在两个骨干模型和三个公共指令池上,仅使用10K ADG选择的数据进行微调,在涵盖推理、知识和编码的六个基准测试中均优于现有强基线方法,达到了SOTA水平。
Insight: 创新点在于将答案分歧量化为几何结构特征(分散程度和形状各向异性),并证明两者结合是有效的指令数据选择信号;可借鉴之处在于利用生成模型的高温采样输出进行无监督数据筛选,提升指令调优效率。
Abstract: Instruction tuning relies on large instruction-response corpora whose quality and composition strongly affect downstream performance. We propose Answer Divergence-Guided Selection (ADG), which selects instruction data based on the geometric structure of multi-sample outputs. ADG draws several high-temperature generations per instruction, maps responses into an embedding space, and computes an output divergence score that jointly encodes dispersion magnitude and shape anisotropy. High scores correspond to instructions whose answers are both far apart and multi-modal, rather than clustered paraphrases along a single direction. Across two backbones and three public instruction pools, fine-tuning on only 10K ADG-selected examples consistently outperforms strong selectors on six benchmarks spanning reasoning, knowledge, and coding. Analyses further show that both dispersion magnitude and shape anisotropy are necessary, supporting answer divergence as a practical signal for instruction data selection. Code and appendix are included in the supplementary materials.
[14] EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning cs.CLPDF
Hengyu Zhang, Xuyun Zhang, Pengxiang Zhan, Linhao Luo, Hang Lv
TL;DR: 本文提出了EviCare框架,通过结合深度模型引导与大型语言模型(LLM)的上下文推理,来增强电子健康记录(EHR)中的诊断预测,特别是针对新颖且临床重要的病症。
Details
Motivation: 现有基于LLM的诊断预测方法容易过拟合历史观察到的诊断,忽略了对早期干预至关重要的新颖病症,EviCare旨在解决这一问题。
Result: 在MIMIC-III和MIMIC-IV两个真实世界EHR基准测试上,EviCare在精确度和准确率指标上平均优于纯LLM和纯深度模型基线20.65%,在新颖诊断预测任务中平均提升30.97%。
Insight: 创新点在于将深度模型推理、基于集合的EHR证据优先级排序和关系证据构建整合到自适应上下文提示中,以可解释的方式引导LLM推理,提升预测准确性和对新病症的识别能力。
Abstract: Recent advances in large language models (LLMs) have enabled promising progress in diagnosis prediction from electronic health records (EHRs). However, existing LLM-based approaches tend to overfit to historically observed diagnoses, often overlooking novel yet clinically important conditions that are critical for early intervention. To address this, we propose EviCare, an in-context reasoning framework that integrates deep model guidance into LLM-based diagnosis prediction. Rather than prompting LLMs directly with raw EHR inputs, EviCare performs (1) deep model inference for candidate selection, (2) evidential prioritization for set-based EHRs, and (3) relational evidence construction for novel diagnosis prediction. These signals are then composed into an adaptive in-context prompt to guide LLM reasoning in an accurate and interpretable manner. Extensive experiments on two real-world EHR benchmarks (MIMIC-III and MIMIC-IV) demonstrate that EviCare achieves significant performance gains, which consistently outperforms both LLM-only and deep model-only baselines by an average of 20.65% across precision and accuracy metrics. The improvements are particularly notable in challenging novel diagnosis prediction, yielding average improvements of 30.97%.
[15] From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation cs.CL | cs.AIPDF
Mingfei Lu, Yi Zhang, Mengjia Wu, Yue Feng
TL;DR: 本文针对法律咨询问答任务,构建了包含4.3万条真实中文法律查询的JurisCQAD数据集,并提出JurisMA多智能体框架,通过结构化任务分解将查询转换为法律要素图,结合动态路由、法条引用和风格优化模块,实现了强上下文感知推理。
Details
Motivation: 解决法律咨询问答中高质量训练数据稀缺、任务组合复杂、上下文依赖性强等独特挑战。
Result: 在精炼的LawBench基准上评估,系统在多项词汇和语义指标上显著优于通用及法律领域大语言模型。
Insight: 创新点在于将法律查询结构化分解为实体、事件、意图和法律问题的要素图,并采用模块化多智能体框架实现可解释的分解与协作,提升了法律咨询的推理能力。
Abstract: Legal consultation question answering (Legal CQA) presents unique challenges compared to traditional legal QA tasks, including the scarcity of high-quality training data, complex task composition, and strong contextual dependencies. To address these, we construct JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative responses, and design a structured task decomposition that converts each query into a legal element graph integrating entities, events, intents, and legal issues. We further propose JurisMA, a modular multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Combined with the element graph, the framework enables strong context-aware reasoning, effectively capturing dependencies across legal facts, norms, and procedural logic. Trained on JurisCQAD and evaluated on a refined LawBench, our system significantly outperforms both general-purpose and legal-domain LLMs across multiple lexical and semantic metrics, demonstrating the benefits of interpretable decomposition and modular collaboration in Legal CQA.
[16] Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning cs.CLPDF
Xinyi Huang, Mingzhe Lu, Haoyu Dong
TL;DR: 论文提出了一种名为SGKR(Structure-Grounded Knowledge Retrieval)的检索框架,用于增强大语言模型在解决多步数据分析任务时的知识检索能力。该框架通过函数调用依赖关系构建知识图,根据问题提取语义输入输出标签并识别依赖路径,从而组装出与任务相关的结构化上下文,以支持基于LLM的代码生成。
Details
Motivation: 当前检索增强方法主要依赖词法或嵌入相似性,这通常不足以捕获多步推理所需的关键知识,尤其是在数据分析任务中,相关知识往往与可执行代码及其依赖结构紧密相关,而非仅与查询文本相关。
Result: 在多步数据分析基准测试上的实验表明,SGKR相比无检索和基于相似性的检索基线,能持续提升普通LLM和编码代理的解决方案正确性。
Insight: 创新点在于将知识组织为基于函数调用依赖的图结构,并通过提取语义标签和依赖路径来检索与任务结构相匹配的知识,这为代码生成提供了更精准、结构化的上下文,超越了传统的文本相似性检索方法。
Abstract: Selecting the right knowledge is critical when using large language models (LLMs) to solve domain-specific data analysis tasks. However, most retrieval-augmented approaches rely primarily on lexical or embedding similarity, which is often a weak proxy for the task-critical knowledge needed for multi-step reasoning. In many such tasks, the relevant knowledge is not merely textually related to the query, but is instead grounded in executable code and the dependency structure through which computations are carried out. To address this mismatch, we propose SGKR (Structure-Grounded Knowledge Retrieval), a retrieval framework that organizes domain knowledge with a graph induced by function-call dependencies. Given a question, SGKR extracts semantic input and output tags, identifies dependency paths connecting them, and constructs a task-relevant subgraph. The associated knowledge and corresponding function implementations are then assembled as a structured context for LLM-based code generation. Experiments on multi-step data analysis benchmarks show that SGKR consistently improves solution correctness over no-retrieval and similarity-based retrieval baselines for both vanilla LLMs and coding agents.
[17] Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models cs.CL | cs.AIPDF
Jiyeon Kim, Sungik Choi, Yongrae Jo, Moontae Lee, Minjoon Seo
TL;DR: 本文研究了非自回归扩散语言模型中的解码动态,揭示了基于置信度的非自回归生成存在邻近性偏差,即去噪过程倾向于集中在空间相邻的token上,导致空间错误传播。作者提出了一种轻量级干预方法,通过引导早期token选择和序列结束温度退火来改善性能。
Details
Motivation: 扩散语言模型作为自回归模型的替代方案,具有并行生成和双向上下文建模的潜力,但如何利用这种灵活性进行完全非自回归解码,特别是在推理和规划任务中,仍是一个开放问题。
Result: 在多种推理和规划任务上进行评估,该方法在不显著增加计算开销的情况下,相比现有启发式基线取得了显著的整体性能提升。
Insight: 创新点在于揭示了非自回归扩散解码中的邻近性偏差及其导致的初始轨迹依赖问题,并提出了一种轻量级规划器和温度退火策略来引导早期决策,从而缓解错误传播。
Abstract: Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure mode in confidence-based non-autoregressive generation stemming from a strong proximity bias-the tendency for the denoising order to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.
[18] Efficient Process Reward Modeling via Contrastive Mutual Information cs.CL | cs.AI | cs.LGPDF
Nakyung Lee, Sangwoo Hong, Jungwoo Lee
TL;DR: 本文提出了一种名为对比点互信息(CPMI)的新方法,用于自动标注过程奖励模型(PRM)的训练数据,以解决传统人工标注成本高和现有自动方法(如蒙特卡洛估计)计算资源消耗大的问题。CPMI利用模型内部概率,通过对比推理步骤与正确答案之间的互信息来推断步骤级监督信号,从而显著减少数据集构建的时间和计算开销。
Details
Motivation: 当前训练过程奖励模型(PRM)需要人工为每个推理步骤分配奖励分数,成本高昂且耗时;现有的自动方法如蒙特卡洛估计也因需要大量LLM推演而计算资源密集。本文旨在开发一种高效、自动的奖励标注方法以克服这些限制。
Result: 实验结果表明,与蒙特卡洛估计相比,基于CPMI的标注方法将数据集构建时间减少了84%,令牌生成减少了98%,同时在过程级评估和数学推理基准测试中达到了更高的准确性。
Insight: 创新点在于提出CPMI作为一种新颖的自动奖励标注技术,它通过对比推理步骤与正确答案之间的互信息来量化步骤对最终解决方案的贡献,从而提供可靠的奖励信号,无需人工标注或大量计算,可借鉴用于高效训练过程奖励模型或其他验证器模型。
Abstract: Recent research has devoted considerable effort to verifying the intermediate reasoning steps of chain-of-thought (CoT) trajectories using process reward models (PRMs) and other verifier models. However, training a PRM typically requires human annotators to assign reward scores to each reasoning step, which is both costly and time-consuming. Existing automated approaches, such as Monte Carlo (MC) estimation, also demand substantial computational resources due to repeated LLM rollouts. To overcome these limitations, we propose contrastive pointwise mutual information (CPMI), a novel automatic reward labeling method that leverages the model’s internal probability to infer step-level supervision while significantly reducing the computational burden of annotating dataset. CPMI quantifies how much a reasoning step increases the mutual information between the step and the correct target answer relative to hard-negative alternatives. This contrastive signal serves as a proxy for the step’s contribution to the final solution and yields a reliable reward. The experimental results show that CPMI-based labeling reduces dataset construction time by 84% and token generation by 98% compared to MC estimation, while achieving higher accuracy on process-level evaluations and mathematical reasoning benchmarks.
[19] Learning and Enforcing Context-Sensitive Control for LLMs cs.CL | cs.AI | cs.LGPDF
Mohammad Albinhassan, Pranava Madhyastha, Mark Law, Alessandra Russo
TL;DR: 本文提出了一种自动学习上下文敏感约束的框架,通过语法探索和约束利用两阶段过程,使LLM能够自动学习并强制执行约束,从而克服上下文无关文法在保证生成有效性方面的局限性。
Details
Motivation: 解决上下文敏感约束通常需要手动指定,这需要专业知识且成为显著障碍的问题,旨在自动学习这些约束以提升LLM生成的有效性。
Result: 实验表明,该方法使小型LLM(1B参数)能够完美遵守约束生成,优于更大模型和最先进的推理模型,在约束遵守方面达到SOTA水平。
Insight: 创新点在于首次将上下文敏感文法学习与LLM生成集成,无需手动指定约束即可保持生成有效性,通过自动学习过程提升了生成控制的效率和可扩展性。
Abstract: Controlling the output of Large Language Models (LLMs) through context-sensitive constraints has emerged as a promising approach to overcome the limitations of Context-Free Grammars (CFGs) in guaranteeing generation validity. However, such constraints typically require manual specification – a significant barrier demanding specialized expertise. We introduce a framework that automatically learns context-sensitive constraints from LLM interactions through a two-phase process: syntactic exploration to gather diverse outputs for constraint learning, followed by constraint exploitation to enforce these learned rules during generation. Experiments demonstrate that our method enables even small LLMs (1B parameters) to learn and generate with perfect constraint adherence, outperforming larger counterparts and state-of-the-art reasoning models. This work represents the first integration of context-sensitive grammar learning with LLM generation, eliminating manual specification while maintaining generation validity.
[20] QFS-Composer: Query-focused summarization pipeline for less resourced languages cs.CLPDF
Vuk Đuranović, Marko Robnik Šikonja
TL;DR: 本文提出了一种名为QFS-Composer的查询聚焦摘要生成框架,专门针对资源匮乏语言(如斯洛文尼亚语)设计。该框架通过集成查询分解、问题生成、问答和抽象摘要生成等模块,旨在提升摘要与用户查询意图的事实对齐度。
Details
Motivation: 解决资源匮乏语言中查询聚焦摘要任务面临的挑战,包括标注数据集和评估工具有限,以及大型语言模型在这些语言上性能显著下降的问题。
Result: 在斯洛文尼亚语上的实证评估表明,该QA引导的摘要流水线在一致性和相关性方面优于基线LLM,但未明确提及是否达到SOTA水平。
Insight: 创新点在于提出了一个可扩展的框架,将问答模块集成到摘要流程中以增强事实对齐,并针对低资源语言开发了专门的QA和QG模型及无参考摘要评估方法。
Abstract: Large language models (LLMs) demonstrate strong performance in text summarization, yet their effectiveness drops significantly across languages with restricted training resources. This work addresses the challenge of query-focused summarization (QFS) in less-resourced languages, where labeled datasets and evaluation tools are limited. We present a novel QFS framework, QFS-Composer, that integrates query decomposition, question generation (QG), question answering (QA), and abstractive summarization to improve the factual alignment of a summary with user intent. We test our approach on the Slovenian language. To enable high-quality supervision and evaluation, we develop the Slovenian QA and QG models based on a Slovene LLM and adapt evaluation approaches for reference-free summary evaluation. Empirical evaluation shows that the QA-guided summarization pipeline yields improved consistency and relevance over baseline LLMs. Our work establishes an extensible methodology for advancing QFS in less-resourced languages.
[21] Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS cs.CLPDF
Shijia Xu, Zhou Wu, Xiaolong Jia, Yu Wang, Kai Liu
TL;DR: 本文提出了一种名为Self-Correcting RAG的统一框架,旨在解决检索增强生成在处理复杂推理任务时面临的两个主要挑战:上下文利用率低和频繁产生幻觉。该框架将检索和生成重新定义为约束优化和路径规划问题。在输入侧,它将上下文选择形式化为多维多选择背包问题以最大化信息密度;在输出侧,它引入基于自然语言推理的蒙特卡洛树搜索机制来动态探索推理路径并验证答案的忠实性。
Details
Motivation: 动机是解决当前检索增强生成在复杂推理任务中存在的低上下文利用率和频繁幻觉问题。
Result: 在六个多跳问答和事实核查数据集上的实验表明,该方法显著提高了复杂查询的推理准确性,同时有效减少了幻觉,性能优于现有强基线模型。
Insight: 创新点在于将检索形式化为MMKP问题以优化上下文选择,并首次将NLI引导的MCTS用于RAG框架,在推理时动态验证答案的忠实性,从而提升生成的可信度。
Abstract: Retrieval-augmented generation (RAG) substantially extends the knowledge boundary of large language models. However, it still faces two major challenges when handling complex reasoning tasks: low context utilization and frequent hallucinations. To address these issues, we propose Self-Correcting RAG, a unified framework that reformulates retrieval and generation as constrained optimization and path planning. On the input side, we move beyond traditional greedy retrieval and, for the first time, formalize context selection as a multi-dimensional multiple-choice knapsack problem (MMKP), thereby maximizing information density and removing redundancy under a strict token budget. On the output side, we introduce a natural language inference (NLI)-guided Monte Carlo Tree Search (MCTS) mechanism, which leverages test-time compute to dynamically explore reasoning trajectories and validate the faithfulness of generated answers. Experiments on six multi-hop question answering and fact-checking datasets demonstrate that our method significantly improves reasoning accuracy on complex queries while effectively reducing hallucinations, outperforming strong existing baselines.Our code is available at https://github.com/xjiacs/Self-Correcting-RAG .
[22] RCBSF: A Multi-Agent Framework for Automated Contract Revision via Stackelberg Game cs.CLPDF
Shijia Xu, Yu Wang, Xiaolong Jia, Zhou Wu, Kai Liu
TL;DR: 本文提出了一个名为RCBSF(风险约束双层Stackelberg框架)的多智能体框架,用于自动化合同修订。该框架将修订过程建模为一个非合作的Stackelberg博弈,通过一个全局规定智能体(GPA)对由约束修订智能体(CRA)和本地验证智能体(LVA)组成的跟随者系统施加风险预算,进行迭代优化。
Details
Motivation: 尽管大语言模型(LLMs)在法律AI中被广泛采用,但其在自动化合同修订中的应用仍受限于幻觉安全问题和缺乏严格的行为约束。本文旨在解决这些问题。
Result: 在统一基准上的实证验证表明,RCBSF达到了最先进的性能,平均风险解决率(RRR)为84.21%,超越了迭代基线方法,同时提高了令牌效率。
Insight: 主要创新点在于将合同修订问题形式化为一个双层Stackelberg博弈,并引入风险预算约束来引导和约束修订过程。从客观角度看,这种博弈论框架为LLM在需要严格约束和可验证输出的高风险领域(如法律合同)的应用提供了新的、具有理论保证的范式。
Abstract: Despite the widespread adoption of Large Language Models (LLMs) in Legal AI, their utility for automated contract revision remains impeded by hallucinated safety and a lack of rigorous behavioral constraints. To address these limitations, we propose the Risk-Constrained Bilevel Stackelberg Framework (RCBSF), which formulates revision as a non-cooperative Stackelberg game. RCBSF establishes a hierarchical Leader Follower structure where a Global Prescriptive Agent (GPA) imposes risk budgets upon a follower system constituted by a Constrained Revision Agent (CRA) and a Local Verification Agent (LVA) to iteratively optimize output. We provide theoretical guarantees that this bilevel formulation converges to an equilibrium yielding strictly superior utility over unguided configurations. Empirical validation on a unified benchmark demonstrates that RCBSF achieves state-of-the-art performance, surpassing iterative baselines with an average Risk Resolution Rate (RRR) of 84.21% while enhancing token efficiency. Our code is available at https://github.com/xjiacs/RCBSF .
[23] Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation cs.CL | cs.AI | cs.IRPDF
Fangda Ye, Zhifei Xie, Yuxin Hu, Yihang Yin, Shurui Huang
TL;DR: 本文提出了Deep-Reporter,一个用于生成有依据的多模态长文本的统一智能体框架,通过智能体多模态搜索与过滤、清单引导的增量合成和循环上下文管理,解决多模态长文本生成任务。
Details
Motivation: 现有智能体搜索框架主要关注文本,忽略了现实世界专家报告中常见的多模态证据,因此需要解决多模态长文本生成这一紧迫任务。
Result: 在M2LongBench测试集(包含9个领域的247个研究任务)上的广泛实验表明,多模态长文本生成具有挑战性,尤其在多模态选择与整合方面,而有效的后训练可以缩小性能差距。
Insight: 创新点包括统一的智能体框架整合多模态证据、清单引导的合成确保图文连贯与引用优化,以及构建高质量训练轨迹和稳定多模态测试平台以推动该任务发展。
Abstract: Recent agentic search frameworks enable deep research via iterative planning and retrieval, reducing hallucinations and enhancing factual grounding. However, they remain text-centric, overlooking the multimodal evidence that characterizes real-world expert reports. We introduce a pressing task: multimodal long-form generation. Accordingly, we propose Deep-Reporter, a unified agentic framework for grounded multimodal long-form generation. It orchestrates: (i) Agentic Multimodal Search and Filtering to retrieve and filter textual passages and information-dense visuals; (ii) Checklist-Guided Incremental Synthesis to ensure coherent image-text integration and optimal citation placement; and (iii) Recurrent Context Management to balance long-range coherence with local fluency. We develop a rigorous curation pipeline producing 8K high-quality agentic traces for model optimization. We further introduce M2LongBench, a comprehensive testbed comprising 247 research tasks across 9 domains and a stable multimodal sandbox. Extensive experiments demonstrate that long-form multimodal generation is a challenging task, especially in multimodal selection and integration, and effective post-training can bridge the gap.
[24] When Meaning Isn’t Literal: Exploring Idiomatic Meaning Across Languages and Modalities cs.CLPDF
Sarmistha Das, Shreyas Guha, Suvrayan Bandyopadhyay, Salisa Phosit, Kitsuchart Pasupa
TL;DR: 本文针对语言模型在习语理解上的盲点,提出了一个多语言多模态的习语语料库Mediom,包含印地语、孟加拉语和泰语的3533个习语,每个习语配有解释、跨语言翻译和图文对齐表示。论文还提出了基于提示的习语解释框架HIDE,通过错误反馈检索和诊断线索进行迭代推理优化,旨在为下一代AI系统建立文化基础的多模态习语理解测试平台和方法论。
Details
Motivation: 解决当代语言模型在习语推理上的不足,特别是模型倾向于字面理解而忽略文化隐喻含义的问题,例如孟加拉语习语“葡萄是酸的”所蕴含的否认驱动合理化含义。
Result: 在Mediom语料库上对大型语言模型和视觉语言模型进行基准测试,揭示了它们在隐喻理解上的系统性失败。
Insight: 创新点在于构建了首个多语言多模态习语语料库Mediom,并提出了HIDE框架,通过错误反馈和诊断线索增强模型的迭代推理能力,为文化基础的多模态习语理解提供了新的测试平台和方法。
Abstract: Idiomatic reasoning, deeply intertwined with metaphor and culture, remains a blind spot for contemporary language models, whose progress skews toward surface-level lexical and semantic cues. For instance, the Bengali idiom \textit{\foreignlanguage{bengali}{\char”0986\char”0999\char”09CD\char”0997\char”09C1 \char”09B0 \char”09AB\char”09B2 \char”099F\char”0995}} (angur fol tok, grapes are sour''): it encodes denial-driven rationalization, yet naive models latch onto the literal fox-and-grape imagery. Addressing this oversight, we present Mediom,’’ a multilingual, multimodal idiom corpus of 3,533 Hindi, Bengali, and Thai idioms, each paired with gold-standard explanations, cross-lingual translations, and carefully aligned text–image representations. We benchmark both large language models (textual reasoning) and vision-language models (figurative disambiguation) on Mediom, exposing systematic failures in metaphor comprehension. To mitigate these gaps, we propose ``HIDE,’’ a Hinting-based Idiom Explanation framework that leverages error-feedback retrieval and targeted diagnostic cues for iterative reasoning refinement. Collectively, Mediom and HIDE establish a rigorous test bed and methodology for culturally grounded, multimodal idiom understanding embedded with reasoning hints in next-generation AI systems.
[25] TInR: Exploring Tool-Internalized Reasoning in Large Language Models cs.CL | cs.AIPDF
Qiancheng Xu, Yongqi Li, Fan Liu, Hongru Wang, Min Yang
TL;DR: 本文提出了一种名为TInR(工具内化推理)的新方法,旨在将外部工具的知识内化到大型语言模型(LLMs)中,以解决现有工具集成推理(TIR)方法存在的工具掌握困难、工具规模限制和推理效率低下等问题。作者提出了TInR-U框架,通过包含双向知识对齐、监督微调预热和强化学习的三阶段训练流程来实现工具内化和工具-推理协调。实验表明,TInR-U在领域内和领域外设置下均取得了优越的性能。
Details
Motivation: 现有工具集成推理方法依赖外部工具文档,导致工具掌握困难、工具规模受限和推理效率低下,因此需要探索将工具知识内化到LLMs中的新范式。
Result: 在领域内和领域外设置的全面评估中,TInR-U均取得了优越的性能,证明了其有效性和效率。
Insight: 核心创新点是提出了工具内化推理(TInR)的概念和TInR-U框架,通过三阶段训练(特别是双向知识对齐策略和TInR特定的强化学习奖励)将工具知识内化到模型中,实现了更统一、高效的推理与工具使用协调。这为减少对外部文档的依赖、提升推理效率提供了新思路。
Abstract: Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models’ (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.
[26] Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series cs.CL | cs.AIPDF
Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej
TL;DR: 本文介绍了Bielik v3 PL系列(7B和11B参数)的开发,这是一个针对波兰语优化的语言模型。论文的核心在于通过优化分词器来解决通用模型在处理波兰语时因形态复杂而导致的效率低下问题,具体包括从通用分词器转向波兰语专用词汇表、采用FOCUS初始化、多阶段预训练以及包含SFT、DPO和GRPO的后训练对齐过程。
Details
Motivation: 通用大语言模型使用的通用分词器在处理波兰语等形态复杂的语言时,无法捕捉其语言细微差别,导致生育率(fertility)高、推理成本增加和有效上下文窗口受限。本文旨在通过优化分词器来提升波兰语语言建模的效率和性能。
Result: 报告详细描述了模型构建和训练过程,但摘要中未提及具体的定量实验结果(如基准测试分数)或与SOTA模型的比较。
Insight: 主要创新点在于为特定语言(波兰语)设计专用分词器以解决通用模型的架构低效问题,并采用了FOCUS初始化、多阶段预训练课程以及结合SFT、DPO和GRPO的复杂后训练对齐策略,为语言特定LLM优化提供了系统性方法。
Abstract: The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal tokenizers. These tokenizers, typically designed to cover a broad spectrum of languages, often fail to capture the morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows. This report details the transition from the universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary for the Bielik v3 models, exploring the FOCUS-based embedding initialization, the multi-stage pretraining curriculum, and the subsequent post-training alignment involving Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.
[27] OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models cs.CLPDF
Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su
TL;DR: 本文提出了OccuBench基准测试,用于评估AI智能体在100个真实世界专业任务场景上的表现,涵盖10个行业类别和65个专业领域。该基准通过语言世界模型(LWMs)模拟领域特定环境,并采用多智能体合成管道自动生成具有可解性保证、难度校准和文档多样性评估实例。评估从任务完成度和环境鲁棒性两个维度进行,测试了15个前沿模型,发现没有单一模型在所有行业占优,且隐式故障比显式错误更具挑战性。
Details
Motivation: 现有基准测试只能在少数存在公共环境的领域评估AI智能体,而AI智能体被期望在数百个职业领域执行专业工作,因此需要一种系统性的跨行业评估方法。
Result: 在OccuBench基准上评估了8个模型家族的15个前沿模型,发现GPT-5.2在从最小到最大推理努力下性能提升27.5个百分点,且隐式故障(如数据截断、字段缺失)比显式错误(如超时、500错误)和混合故障更难处理。
Insight: 创新点包括:首次系统性跨行业评估AI智能体在专业职业任务上的表现;提出基于语言世界模型(LWMs)的环境模拟方法,通过LLM驱动的工具响应生成实现领域特定环境模拟;设计多智能体合成管道自动生成具有可解性保证、难度校准和文档多样性的评估实例;引入环境鲁棒性评估维度,通过受控故障注入(显式错误、隐式数据退化、混合故障)测试智能体鲁棒性。客观分析认为,该工作强调了模拟器质量对基于LWM评估可靠性的关键作用,并揭示了智能体能力与模拟器能力的差异性。
Abstract: AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.
[28] AOP-Smart: A RAG-Enhanced Large Language Model Framework for Adverse Outcome Pathway Analysis cs.CL | cs.AIPDF
Qinjiang Niu, Lu Yan
TL;DR: 本文提出了一种名为AOP-Smart的检索增强生成(RAG)框架,旨在利用AOP-Wiki的官方XML数据(包括关键事件、关键事件关系等)为大型语言模型(LLM)检索相关知识,以解决LLM在毒理学研究领域的不良结局通路(AOP)分析任务中存在的幻觉问题,从而提升模型回答的准确性和可靠性。
Details
Motivation: 大型语言模型在AOP相关的问答和机制推理任务中存在幻觉问题,即可能生成与事实不符或缺乏证据的内容,限制了其可靠性。本研究旨在通过RAG框架,利用结构化知识源增强LLM,以解决这一问题。
Result: 在包含20个AOP相关问答任务的测试集上,对GPT、DeepSeek和Gemini三个主流LLM进行实验。未使用RAG时,准确率分别为15.0%、35.0%和20.0%;使用AOP-Smart的RAG后,准确率分别提升至95.0%、100.0%和95.0%,显著提高了答案准确性。
Insight: 论文的创新点在于针对特定领域(AOP)构建了结构化的RAG框架,通过利用官方知识库(AOP-Wiki)中的关键事件和关系进行精准检索,有效缓解了LLM的幻觉问题。这为其他需要高可靠性的专业领域知识问答任务提供了可借鉴的范式,即结合领域权威数据源来增强LLM的生成质量。
Abstract: Adverse Outcome Pathways (AOPs) are an important knowledge framework in toxicological research and risk assessment. In recent years, large language models (LLMs) have gradually been applied to AOP-related question answering and mechanistic reasoning tasks. However, due to the existence of the hallucination problem, that is, the model may generate content that is inconsistent with facts or lacks evidence, their reliability is still limited. To address this issue, this study proposes an AOP-oriented Retrieval-Augmented Generation (RAG) framework, AOP-Smart. Based on the official XML data from AOP-Wiki, this method uses Key Events (KEs), Key Event Relationships (KERs), and specific AOP information to retrieve relevant knowledge for user questions, thereby improving the reliability of the generated results of large language models. To evaluate the effectiveness of the proposed method, this study constructed a test set containing 20 AOP-related question answering tasks, covering KE identification, upstream and downstream KE retrieval, and complex AOP retrieval tasks. Experiments were conducted on three mainstream large language models, Gemini, DeepSeek, and ChatGPT, and comparative tests were performed under two settings: without RAG and with RAG. The experimental results show that, without using RAG, the accuracies of GPT, DeepSeek, and Gemini were 15.0%, 35.0%, and 20.0%, respectively; after using RAG, their accuracies increased to 95.0%, 100.0%, and 95.0%, respectively. The results indicate that AOP-Smart can significantly alleviate the hallucination problem of large language models in AOP knowledge tasks, and greatly improve the accuracy and consistency of their answers.
[29] When Verification Fails: How Compositionally Infeasible Claims Escape Rejection cs.CL | cs.AIPDF
Muxin Liu, Delip Rao, Grace Kim, Chris Callison-Burch
TL;DR: 该论文指出现有科学声明验证基准无法区分严格验证模型与仅检查最显著约束的捷径模型,为此构建了组合不可行声明来揭示模型过度接受问题,并通过上下文干预分析发现模型差异主要源于验证阈值而非推理能力。
Details
Motivation: 解决现有科学声明验证基准在区分严格验证(基于封闭世界假设)与仅依赖显著约束检查的捷径模型方面的不足,以更准确评估模型的实际推理能力。
Result: 在跨模型家族和多模态实验中,模型在现有基准上饱和但持续过度接受组合不可行声明,证实了捷径推理的普遍性;通过上下文干预发现不同模型和提示策略位于共享ROC曲线的不同位置,表明差异源于验证阈值而非根本推理能力。
Insight: 创新点在于构建组合不可行声明作为新评估工具,揭示了当前验证行为的结构性瓶颈(组合推理限制),并证明仅靠策略指导无法克服该瓶颈,为改进验证模型提供了新方向。
Abstract: Scientific claim verification, the task of determining whether claims are entailed by scientific evidence, is fundamental to establishing discoveries in evidence while preventing misinformation. This process involves evaluating each asserted constraint against validated evidence. Under the Closed-World Assumption (CWA), a claim is accepted if and only if all asserted constraints are positively supported. We show that existing verification benchmarks cannot distinguish models enforcing this standard from models applying a simpler shortcut called salient-constraint checking, which applies CWA’s rejection criterion only to the most salient constraint and accepts when that constraint is supported. Because existing benchmarks construct infeasible claims by perturbing a single salient element they are insufficient at distinguishing between rigorous claim verification and simple salient-constraint reliance. To separate the two, we construct compositionally infeasible claims where the salient constraint is supported but a non-salient constraint is contradicted. Across model families and modalities, models that otherwise saturate existing benchmarks consistently over-accept these claims, confirming the prevalence of such shortcut reasoning. Via model context interventions, we show that different models and prompting strategies occupy distinct positions on a shared ROC curve, indicating that the gap between model families reflects differences in verification threshold rather than underlying reasoning ability, and that the compositional inference bottleneck is a structural property of current verification behavior that strategy guidance alone cannot overcome.
[30] When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies cs.CL | cs.AI | cs.CEPDF
Zhengzhe Yang
TL;DR: 本文研究大型语言模型(LLM)能否生成用于强化学习(RL)交易代理的连续数值特征。作者构建了一个模块化流程,其中冻结的LLM作为无状态特征提取器,将非结构化的每日新闻和文件转换为固定维向量,供下游PPO代理使用。通过引入一个自动提示优化循环,将提取提示视为离散超参数,并直接针对信息系数(预测收益与实际收益之间的斯皮尔曼等级相关性)进行优化,而非NLP损失。优化后的提示发现了具有预测性的特征(在保留数据上IC高于0.15)。然而,这些有效的中间表示并未自动转化为下游任务性能:在宏观经济冲击导致的分布偏移期间,LLM衍生的特征增加了噪声,增强代理的表现低于仅使用价格基线的代理。在更平静的测试环境中,代理表现恢复,但宏观经济状态变量仍是策略改进最稳健的驱动因素。研究结果强调了特征级有效性与策略级鲁棒性之间的差距,这与分布偏移下迁移学习的已知挑战相呼应。
Details
Motivation: 探索LLM生成的连续数值特征是否能提升RL交易代理的性能,并研究在金融时间序列预测中,有效的特征提取如何受分布偏移影响。
Result: 优化提示在保留数据上产生了预测性特征(IC > 0.15),但在宏观经济冲击导致的分布偏移下,LLM特征增加了噪声,使增强代理表现低于价格基线;在平静环境中代理恢复,但宏观经济变量仍是策略改进的最稳健驱动因素。
Insight: 创新点包括将LLM作为无状态特征提取器集成到RL交易流程中,以及直接针对金融指标(信息系数)优化提示的自动化循环;客观分析表明,研究揭示了在分布偏移下,特征有效性与下游任务鲁棒性之间的脱节,这对LLM在动态环境中的应用具有重要启示。
Abstract: Can large language models (LLMs) generate continuous numerical features that improve reinforcement learning (RL) trading agents? We build a modular pipeline where a frozen LLM serves as a stateless feature extractor, transforming unstructured daily news and filings into a fixed-dimensional vector consumed by a downstream PPO agent. We introduce an automated prompt-optimization loop that treats the extraction prompt as a discrete hyperparameter and tunes it directly against the Information Coefficient - the Spearman rank correlation between predicted and realized returns - rather than NLP losses. The optimized prompt discovers genuinely predictive features (IC above 0.15 on held-out data). However, these valid intermediate representations do not automatically translate into downstream task performance: during a distribution shift caused by a macroeconomic shock, LLM-derived features add noise, and the augmented agent under-performs a price-only baseline. In a calmer test regime the agent recovers, yet macroeconomic state variables remain the most robust driver of policy improvement. Our findings highlight a gap between feature-level validity and policy-level robustness that parallels known challenges in transfer learning under distribution shift.
[31] Uncertainty-Aware Web-Conditioned Scientific Fact-Checking cs.CL | cs.AIPDF
Ashwin Vinod, Katrin Erk
TL;DR: 本文提出了一种基于原子谓词-论元分解和校准不确定性门控佐证的科学事实核查框架,通过将复杂声明分解为原子事实,利用嵌入对齐局部证据片段,并使用紧凑的基于证据的核查器进行验证,仅在不确定支持的情况下触发受限领域网络搜索。该系统支持二元和三值分类,并在上下文仅与上下文加网络两种模式下评估,当检索证据与给定上下文冲突时选择弃权而非覆盖。
Details
Motivation: 解决科学领域(如生物医学和材料科学)事实核查中现有系统常出现幻觉或推理不一致的问题,特别是在验证技术性、组合性声明时,面临来源和成本/延迟约束下的证据片段匹配挑战。
Result: 在多个基准测试中,该框架超越了最强基准;实验显示网络佐证平均仅对少数原子事实触发,表明外部证据在校准不确定性下被选择性咨询而非例行调用。
Insight: 创新点在于将原子粒度分解与校准不确定性门控佐证结合,实现更可解释和上下文条件化的验证,适用于高风险、单文档设置,要求可追溯推理、可预测成本/延迟和保守性。
Abstract: Scientific fact-checking is vital for assessing claims in specialized domains such as biomedicine and materials science, yet existing systems often hallucinate or apply inconsistent reasoning, especially when verifying technical, compositional claims against an evidence snippet under source and cost/latency constraints. We present a pipeline centered on atomic predicate-argument decomposition and calibrated, uncertainty-gated corroboration: atomic facts are aligned to local snippets via embeddings, verified by a compact evidence-grounded checker, and only facts with uncertain support trigger domain-restricted web search over authoritative sources. The system supports both binary and tri-valued classification where it predicts labels from Supported, Refuted, NEI for three-way tasks. We evaluate under two regimes, Context-Only (no web) and Context+Web (uncertainty-gated web corroboration); when retrieved evidence conflicts with the provided context, we abstain with NEI rather than overriding the context. On multiple benchmarks, our framework surpasses the strongest benchmarks. In our experiments, web corroboration was invoked for only a minority of atomic facts on average, indicating that external evidence is consulted selectively under calibrated uncertainty rather than routinely. Overall, coupling atomic granularity with calibrated, uncertainty-gated corroboration yields more interpretable and context-conditioned verification, making the approach well-suited to high-stakes, single-document settings that demand traceable rationales, predictable cost/latency, and conservative.
[32] A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities cs.CL | cs.AIPDF
Jiaqi Chen, Ming Wang, Tingna Xie, Shi Feng, Yongkang Liu
TL;DR: 本文系统研究了通过NPTI框架为LLM注入大五人格特质后,对其在六个认知基准测试上表现的影响。研究发现,人格诱导不仅改变交互风格,还会稳定地影响认知任务表现,且效果具有任务依赖性。基于此规律,作者提出了无需额外训练的轻量级动态人格路由策略DPR,其性能优于最佳静态人格。
Details
Motivation: 目前为LLM注入特定人设以定制交互风格的做法很普遍,但其对模型底层认知能力的影响尚未被探索。本文旨在系统分析人格引导对LLM核心认知能力的影响。
Result: 在六个认知基准测试上的评估表明,人格诱导对认知任务表现产生了稳定、可复现的影响,效果大小因特质维度而异,其中开放性和外向性影响最显著。提出的DPR策略在无需额外训练的情况下,性能超越了最佳静态人格。
Insight: 创新点在于首次系统量化了人格特质诱导对LLM认知能力(而不仅仅是风格)的实质性、可预测的影响,并发现其与人类人格-认知关系有73.68%的方向一致性。据此提出的DPR策略展示了利用这种规律进行查询自适应优化的潜力。
Abstract: Imbuing Large Language Models (LLMs) with specific personas is prevalent for tailoring interaction styles, yet the impact on underlying cognitive capabilities remains unexplored. We employ the Neuron-based Personality Trait Induction (NPTI) framework to induce Big Five personality traits in LLMs and evaluate performance across six cognitive benchmarks. Our findings reveal that persona induction produces stable, reproducible shifts in cognitive task performance beyond surface-level stylistic changes. These effects exhibit strong task dependence: certain personalities yield consistent gains on instruction-following, while others impair complex reasoning. Effect magnitude varies systematically by trait dimension, with Openness and Extraversion exerting the most robust influence. Furthermore, LLM effects show 73.68% directional consistency with human personality-cognition relationships. Capitalizing on these regularities, we propose Dynamic Persona Routing (DPR), a lightweight query-adaptive strategy that outperforms the best static persona without additional training.
[33] How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts cs.CLPDF
Minh-Vuong Nguyen, Fatemeh Shiri, Zhuang Li, Karin Verspoor
TL;DR: 该论文介绍了ClinicNumRobBench,一个包含1,624个实例的基准测试,用于评估大型语言模型在临床环境中的数值推理能力,包括数值检索、算术计算、关系比较和聚合四种类型。研究测试了14个LLM,发现数值检索能力较强,但关系比较和聚合任务仍然具有挑战性,且模型性能对临床笔记格式变化敏感。
Details
Motivation: LLM在临床问答和决策支持中的应用日益增多,但其安全部署需要可靠处理异构临床笔记中的患者测量数据。现有评估在操作层面覆盖有限,主要局限于算术计算,且很少评估模型在不同临床笔记格式下的数值理解鲁棒性。
Result: 在ClinicNumRobBench基准测试中,大多数模型在数值检索任务上准确率超过85%,但关系比较和聚合任务表现较差,部分模型准确率低于15%。微调医学数据可能使数值推理能力相对于基础模型下降超过30%,且模型性能在笔记风格变化下出现下降。
Insight: 论文的创新点在于提出了一个全面评估临床数值推理鲁棒性的基准测试,并揭示了LLM在复杂临床数值任务(如关系比较和聚合)上的显著弱点,以及其对输入格式的敏感性,这为开发更可靠的临床AI系统提供了关键洞见。
Abstract: Large Language Models (LLMs) are increasingly being explored for clinical question answering and decision support, yet safe deployment critically requires reliable handling of patient measurements in heterogeneous clinical notes. Existing evaluations of LLMs for clinical numerical reasoning provide limited operation-level coverage, restricted primarily to arithmetic computation, and rarely assess the robustness of numerical understanding across clinical note formats. We introduce ClinicNumRobBench, a benchmark of 1,624 context-question instances with ground-truth answers that evaluates four main types of clinical numeracy: value retrieval, arithmetic computation, relational comparison, and aggregation. To stress-test robustness, ClinicNumRobBench presents longitudinal MIMIC-IV vital-sign records in three semantically equivalent representations, including a real-world note-style variant derived from the Open Patients dataset, and instantiates queries using 42 question templates. Experiments on 14 LLMs show that value retrieval is generally strong, with most models exceeding 85% accuracy, while relational comparison and aggregation remain challenging, with some models scoring below 15%. Fine-tuning on medical data can reduce numeracy relative to base models by over 30%, and performance drops under note-style variation indicate LLM sensitivity to format. ClinicNumRobBench offers a rigorous testbed for clinically reliable numerical reasoning. Code and data URL are available on https://github.com/MinhVuong2000/ClinicNumRobBench.
[34] Evaluating Memory Capability in Continuous Lifelog Scenario cs.CLPDF
Jianjie Zheng, Zhichen Liu, Zhanyu Shen, Jingxiang Qu, Guanhua Chen
TL;DR: 本文针对可穿戴设备持续记录生活对话的场景,提出了一个名为LifeDialBench的新型基准测试,包含基于真实第一人称视频的EgoMem和基于虚拟社区模拟的LifeMem两个子集,并引入了严格的在线评估协议以防止时间泄漏。实验发现,当前复杂的记忆系统在真实流式场景下甚至不如简单的RAG基线,揭示了过度设计和有损压缩的负面影响。
Details
Motivation: 现有基准测试主要关注在线一对一聊天或人机交互,忽视了真实世界持续生活记录场景的独特需求,且缺乏公开的生活记录音频数据集。
Result: 在提出的LifeDialBench基准上进行在线评估,结果显示当前先进的记忆系统性能未能超越一个简单的基于检索增强生成(RAG)的基线模型。
Insight: 论文的创新点在于构建了专门针对持续生活记录场景的层次化合成基准测试和严格的在线评估协议;客观分析表明,其核心洞察是强调在生活记录等场景中,高保真地保留上下文信息比复杂的系统结构更为关键,过度设计和有损压缩反而有害。
Abstract: Nowadays, wearable devices can continuously lifelog ambient conversations, creating substantial opportunities for memory systems. However, existing benchmarks primarily focus on online one-on-one chatting or human-AI interactions, thus neglecting the unique demands of real-world scenarios. Given the scarcity of public lifelogging audio datasets, we propose a hierarchical synthesis framework to curate \textbf{\textsc{LifeDialBench}}, a novel benchmark comprising two complementary subsets: \textbf{EgoMem}, built on real-world egocentric videos, and \textbf{LifeMem}, constructed using simulated virtual community. Crucially, to address the issue of temporal leakage in traditional offline settings, we propose an \textbf{Online Evaluation} protocol that strictly adheres to temporal causality, ensuring systems are evaluated in a realistic streaming fashion. Our experimental results reveal a counterintuitive finding: current sophisticated memory systems fail to outperform a simple RAG-based baseline. This highlights the detrimental impact of over-designed structures and lossy compression in current approaches, emphasizing the necessity of high-fidelity context preservation for lifelog scenarios. We release our code and data at https://github.com/qys77714/LifeDialBench.
[35] MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis cs.CL | cs.AIPDF
Zixiong Yu, Jun Rao, Guhan Chen, Songtao Tian, Bohan Li
TL;DR: 本文提出了一种名为MathAgent的分层合成框架,将数学推理数据合成视为约束图上的无监督优化问题,采用立法者-执行者范式:立法者对抗性地演化编码问题约束的结构化生成蓝图,执行者将这些规范实例化为多样化的自然语言场景。该方法在10个模型上的实验表明,仅用1K合成样本微调的模型在八个数学基准上超越了同等规模的广泛使用数据集。
Details
Motivation: 解决无需人类先验知识合成高质量数学推理数据的挑战,现有方法通常依赖种子数据突变或简单提示工程,存在模式崩溃和逻辑复杂性有限的问题。
Result: 在Qwen、Llama、Mistral和Gemma系列的10个模型上进行实验,使用1K合成样本微调的模型在八个数学基准上超越了同等规模的LIMO和s1K等广泛使用数据集,表现出卓越的分布外泛化能力。
Insight: 创新点在于将数据合成视为约束图优化而非直接文本生成任务,通过立法者-执行者范式解耦骨架设计与语言实现,优先关注构建复杂多样的逻辑结构以指导高质量数据合成;客观分析认为这种分层对抗演化框架有效提升了合成数据的逻辑复杂性和多样性。
Abstract: Synthesizing high-quality mathematical reasoning data without human priors remains a significant challenge. Current approaches typically rely on seed data mutation or simple prompt engineering, often suffering from mode collapse and limited logical complexity. This paper proposes a hierarchical synthesis framework that formulates data synthesis as an unsupervised optimization problem over a constraint graph followed by semantic instantiation, rather than treating it as a direct text generation task. We introduce a Legislator-Executor paradigm: The Legislator adversarially evolves structured generation blueprints encoding the constraints of the problem, while the Executor instantiates these specifications into diverse natural language scenarios. This decoupling of skeleton design from linguistic realization enables a prioritized focus on constructing complex and diverse logical structures, thereby guiding high-quality data synthesis. Experiments conducted on a total of 10 models across the Qwen, Llama, Mistral, and Gemma series demonstrate that our method achieves notable results: models fine-tuned on 1K synthesized samples outperform widely-used datasets of comparable scale (LIMO, s1K) across eight mathematical benchmarks, exhibiting superior out-of-distribution generalization.
[36] TRACE: An Experiential Framework for Coherent Multi-hop Knowledge Graph Question Answering cs.CLPDF
Yingxu Wang, Jiaxin Huang, Mengzhu Wang, Nan Yin
TL;DR: 本文提出了一种名为TRACE的体验式框架,用于提升多跳知识图谱问答(KGQA)的连贯性和鲁棒性。该框架通过将LLM驱动的上下文推理与探索先验知识整合,动态地将推理路径转化为自然语言叙述以保持语义连续性,并将先前的探索轨迹抽象为可重用的经验先验来捕获重复的探索模式。
Details
Motivation: 现有的多跳KGQA方法通常独立处理每个推理步骤,未能有效利用先前探索的经验,导致推理碎片化和冗余探索。本文旨在解决这些问题,提升推理的连贯性和效率。
Result: 在多个KGQA基准测试上的广泛实验表明,TRACE始终优于最先进的基线方法。
Insight: 主要创新点在于提出了一个统一的体验式框架,通过动态生成自然语言叙述来维持语义连续性,并将历史探索轨迹抽象为可重用的经验先验,结合双重反馈重排序机制来指导关系选择,从而增强了多跳推理的连贯性和鲁棒性。
Abstract: Multi-hop Knowledge Graph Question Answering (KGQA) requires coherent reasoning across relational paths, yet existing methods often treat each reasoning step independently and fail to effectively leverage experience from prior explorations, leading to fragmented reasoning and redundant exploration. To address these challenges, we propose Trajectoryaware Reasoning with Adaptive Context and Exploration priors (TRACE), an experiential framework that unifies LLM-driven contextual reasoning with exploration prior integration to enhance the coherence and robustness of multihop KGQA. Specifically, TRACE dynamically translates evolving reasoning paths into natural language narratives to maintain semantic continuity, while abstracting prior exploration trajectories into reusable experiential priors that capture recurring exploration patterns. A dualfeedback re-ranking mechanism further integrates contextual narratives with exploration priors to guide relation selection during reasoning. Extensive experiments on multiple KGQA benchmarks demonstrate that TRACE consistently outperforms state-of-the-art baselines.
[37] CocoaBench: Evaluating Unified Digital Agents in the Wild cs.CL | cs.AIPDF
CocoaBench Team, Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu
TL;DR: 该论文提出了一个名为CocoaBench的基准测试,用于评估能够结合视觉、搜索和编码能力的统一数字智能体。同时,论文还介绍了一个轻量级共享框架CocoaAgent,以促进不同模型主干之间的受控比较。实验表明,当前智能体在CocoaBench上的表现远未达到可靠水平。
Details
Motivation: 当前LLM智能体在多个领域表现出色,且其架构和模型正趋向于整合多种能力,但现有评估方法通常孤立地测试这些能力,缺乏对需要组合多种能力的多样化实际场景的评估。
Result: 在CocoaBench基准测试上,当前最佳评估系统的成功率仅为45.1%,表明现有智能体性能仍有巨大提升空间。
Insight: 论文的创新点在于构建了一个基于人类设计的、长视野任务的基准测试,该测试仅通过指令和最终输出的自动评估函数来定义任务,实现了可靠且可扩展的评估。分析指出,智能体在推理与规划、工具使用与执行以及视觉基础方面存在显著改进空间。
Abstract: LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.
[38] Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method cs.CL | cs.AIPDF
Tianzhe Zhao, Jiaoyan Chen, Shuxiu Zhang, Haiping Zhu, Qika Lin
TL;DR: 本文针对检索增强生成(RAG)系统中,当检索到的外部知识(特别是非结构化文本与半结构化知识图谱)之间存在冲突时,大语言模型(LLMs)难以进行忠实推理的问题,提出了一个名为ConflictQA的新基准来系统化地实例化此类跨源知识冲突,并发现LLMs在此场景下表现不佳。基于此发现,作者进一步提出了一个名为XoT的两阶段基于解释的思维框架,专门用于处理异构冲突证据上的推理,并通过实验验证了其有效性。
Details
Motivation: 现有研究主要关注外部知识与LLMs内部参数化知识之间的冲突,而忽略了不同外部知识源(如文本与知识图谱)之间的冲突。随着现代RAG系统日益强调整合异构知识源以提高知识完整性和推理忠实性,解决跨源知识冲突成为一个关键且未被充分探索的问题。
Result: 在提出的ConflictQA基准上进行广泛评估,结果表明,面对跨源知识冲突时,代表性LLMs往往无法识别可靠证据以进行正确推理,且对提示选择更敏感,倾向于仅依赖单一类型证据(KG或文本),导致错误回答。提出的XoT框架通过实验验证了其有效性。
Insight: 论文的创新点在于:1)首次系统性地关注并构建了针对跨源(文本 vs. 知识图谱)知识冲突的基准ConflictQA;2)揭示了LLMs在此类冲突下的脆弱性及其对提示的敏感性;3)提出了一个新颖的两阶段解释性思维框架XoT,专门设计用于处理异构冲突证据的推理,为提升RAG系统的推理忠实性提供了新思路。
Abstract: Large language models (LLMs) have achieved remarkable success across a wide range of applications especially when augmented by external knowledge through retrieval-augmented generation (RAG). Despite their widespread adoption, recent studies have shown that LLMs often struggle to perform faithful reasoning when conflicting knowledge is retrieved. However, existing work primarily focuses on conflicts between external knowledge and the parametric knowledge of LLMs, leaving conflicts across external knowledge largely unexplored. Meanwhile, modern RAG systems increasingly emphasize the integration of unstructured text and (semi-)structured data like knowledge graphs (KGs) to improve knowledge completeness and reasoning faithfulness. To address this gap, we introduce ConflictQA, a novel benchmark that systematically instantiates conflicts between textual evidence and KG evidence. Extensive evaluations across representative LLMs reveal that, facing such cross-source conflicts, LLMs often fail to identify reliable evidence for correct reasoning. Instead, LLMs become more sensitive to prompting choices and tend to rely exclusively on either KG or textual evidence, resulting in incorrect responses. Based on these findings, we further propose XoT, a two-stage explanation-based thinking framework tailored for reasoning over heterogeneous conflicting evidence, and verify its effectiveness with extensive experiments.
[39] HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning cs.CLPDF
Yangfan Wang, Tianyang Sun, Chen Tang, Jie Liu, Wei Cai
TL;DR: 本文提出HiEdit,一种基于分层强化学习的终身模型编辑框架,旨在通过自适应识别每个编辑实例中最相关的知识层,实现精确、局部化的模型更新,从而在纠正LLM中过时或不准确知识的同时,最小化对无关输入的副作用。
Details
Motivation: 现有终身模型编辑方法通常对所有编辑实例应用参数扰动于静态且密集的模型层集,忽略了不同知识可能存储在不同模型层的特性,这阻碍了新知识的整合适应性,并可能导致对通用知识和先前编辑知识的灾难性遗忘。
Result: 在多种LLM上的实验表明,HiEdit将竞争性方法RLEdit的性能平均提升了8.48%,同时每次编辑仅扰动一半的层。
Insight: 创新点在于引入分层强化学习框架,实现动态、实例感知的层选择,并结合稀疏性内在奖励,从而更精确地定位和更新与特定知识相关的模型参数,这为模型编辑提供了更细粒度和自适应的控制机制。
Abstract: Lifelong model editing (LME) aims to sequentially rectify outdated or inaccurate knowledge in deployed LLMs while minimizing side effects on unrelated inputs. However, existing approaches typically apply parameter perturbations to a static and dense set of LLM layers for all editing instances. This practice is counter-intuitive, as we hypothesize that different pieces of knowledge are stored in distinct layers of the model. Neglecting this layer-wise specificity can impede adaptability in integrating new knowledge and result in catastrophic forgetting for both general and previously edited knowledge. To address this, we propose HiEdit, a hierarchical reinforcement learning framework that adaptively identifies the most knowledge-relevant layers for each editing instance. By enabling dynamic, instance-aware layer selection and incorporating an intrinsic reward for sparsity, HiEdit achieves precise, localized updates. Experiments on various LLMs show that HiEdit boosts the performance of the competitive RLEdit by an average of 8.48% with perturbing only half of the layers per edit. Our code is available at: https://github.com/yangfanww/hiedit.
[40] Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate cs.CLPDF
Zhixiang Lu, Jionglong Su
TL;DR: 本文提出Dialectic-Med,一种多智能体对抗辩论框架,旨在解决医疗多模态大语言模型(MLLMs)中因确认偏误导致的诊断幻觉问题。该框架通过三个角色特化的智能体(支持者、反对者、调解者)进行动态辩论,其中反对者利用新颖的视觉证伪模块检索矛盾视觉证据,确保诊断推理严格基于已验证的视觉区域。
Details
Motivation: 医疗多模态大语言模型存在严重的确认偏误,常幻觉视觉细节以支持初始可能错误的诊断假设,现有思维链方法缺乏内在纠正机制,易导致错误传播。
Result: 在MIMIC-CXR-VQA、VQA-RAD和PathVQA基准测试上,Dialectic-Med实现了最先进的性能,并显著提升了推理过程的可信度、解释忠实性,有效缓解了幻觉现象,超越了单智能体基线。
Insight: 创新点在于引入对抗性辩证法,通过多智能体辩论(特别是视觉证伪模块)显式建模证伪认知过程,确保诊断推理基于视觉证据;客观来看,该方法将辩论机制与视觉验证结合,为增强MLLMs的可靠性和可解释性提供了新思路。
Abstract: Multimodal Large Language Models (MLLMs) in healthcare suffer from severe confirmation bias, often hallucinating visual details to support initial, potentially erroneous diagnostic hypotheses. Existing Chain-of-Thought (CoT) approaches lack intrinsic correction mechanisms, rendering them vulnerable to error propagation. To bridge this gap, we propose Dialectic-Med, a multi-agent framework that enforces diagnostic rigor through adversarial dialectics. Unlike static consensus models, Dialectic-Med orchestrates a dynamic interplay between three role-specialized agents: a proponent that formulates diagnostic hypotheses; an opponent equipped with a novel visual falsification module that actively retrieves contradictory visual evidence to challenge the Proponent; and a mediator that resolves conflicts via a weighted consensus graph. By explicitly modeling the cognitive process of falsification, our framework guarantees that diagnostic reasoning is tightly grounded in verified visual regions. Empirical evaluations on MIMIC-CXR-VQA, VQA-RAD, and PathVQA demonstrate that Dialectic-Med not only achieves state-of-the-art performance but also fundamentally enhances the trustworthiness of the reasoning process. Beyond accuracy, our approach significantly enhances explanation faithfulness and decisively mitigates hallucinations, establishing a new standard over single-agent baselines.
[41] Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning cs.CL | cs.AIPDF
Rui Song, Lida Shi, Ruihua Qi, Yingji Li, Hao Xu
TL;DR: 本文提出了一种字形驱动的微调框架GEVO,旨在增强多模态大语言模型在古汉字演变分析方面的能力。作者构建了一个包含11个任务、超过13万个实例的综合性基准,用于评估模型性能,并发现现有模型在字形比较和核心任务上存在局限。通过GEVO框架微调后,即使是2B规模的模型也在所有任务上取得了全面且一致的性能提升。
Details
Motivation: 解决多模态大语言模型在系统性地支持古汉字演变分析方面能力不足的问题,该领域目前尚未充分探索。
Result: 在构建的综合性基准上,经过GEVO微调的模型(包括2B规模)在所有11个评估任务上均取得了全面且一致的性能提升。
Insight: 创新点在于提出了一个字形驱动的微调框架,明确引导模型捕捉字形变换中的演变一致性,从而增强对文本演变的理解;同时构建了首个针对古汉字演变分析的大规模多任务基准,为未来研究提供了资源。
Abstract: In recent years, rapid advances in Multimodal Large Language Models (MLLMs) have increasingly stimulated research on ancient Chinese scripts. As the evolution of written characters constitutes a fundamental pathway for understanding cultural transformation and historical continuity, how MLLMs can be systematically leveraged to support and advance text evolution analysis remains an open and largely underexplored problem. To bridge this gap, we construct a comprehensive benchmark comprising 11 tasks and over 130,000 instances, specifically designed to evaluate the capability of MLLMs in analyzing the evolution of ancient Chinese scripts. We conduct extensive evaluations across multiple widely used MLLMs and observe that, while existing models demonstrate a limited ability in glyph-level comparison, their performance on core tasks-such as character recognition and evolutionary reasoning-remains substantially constrained. Motivated by these findings, we propose a glyph-driven fine-tuning framework (GEVO) that explicitly encourages models to capture evolutionary consistency in glyph transformations and enhances their understanding of text evolution. Experimental results show that even models at the 2B scale achieve consistent and comprehensive performance improvements across all evaluated tasks. To facilitate future research, we publicly release both the benchmark and the trained models\footnote{https://github.com/songruiecho/GEVO}.
[42] Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning cs.CL | cs.AIPDF
Bo Li, Mingda Wang, Gexiang Fang, Shikun Zhang, Wei Ye
TL;DR: 该论文提出了一种名为GRIP的统一框架,将检索增强生成(RAG)中的检索决策直接嵌入到生成过程中,通过自触发的信息规划,让模型在单次自回归解码中自主决定何时检索、如何重构查询以及何时终止,从而紧密耦合检索与推理。
Details
Motivation: 动机在于重新审视RAG范式,旨在解决传统方法将检索作为外部干预、需要额外控制器或分类器的问题,通过将检索控制直接集成到生成中,实现端到端的协调。
Result: 在五个问答基准测试上的实验表明,GRIP超越了强大的RAG基线模型,并且在使用更少参数的情况下,性能与GPT-4o相当。
Insight: 创新点在于提出了“检索即生成”的范式,通过控制令牌发射实现自触发的信息规划,将检索决策无缝融入生成过程,支持动态多步推理和实时证据整合,这为端到端的检索增强生成提供了新的统一框架。
Abstract: We revisit retrieval-augmented generation (RAG) by embedding retrieval control directly into generation. Instead of treating retrieval as an external intervention, we express retrieval decisions within token-level decoding, enabling end-to-end coordination without additional controllers or classifiers. Under the paradigm of Retrieval as Generation, we propose \textbf{GRIP} (\textbf{G}eneration-guided \textbf{R}etrieval with \textbf{I}nformation \textbf{P}lanning), a unified framework in which the model regulates retrieval behavior through control-token emission. Central to GRIP is \textit{Self-Triggered Information Planning}, which allows the model to decide when to retrieve, how to reformulate queries, and when to terminate, all within a single autoregressive trajectory. This design tightly couples retrieval and reasoning and supports dynamic multi-step inference with on-the-fly evidence integration. To supervise these behaviors, we construct a structured training set covering answerable, partially answerable, and multi-hop queries, each aligned with specific token patterns. Experiments on five QA benchmarks show that GRIP surpasses strong RAG baselines and is competitive with GPT-4o while using substantially fewer parameters.
[43] Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books cs.CL | cs.AI | cs.IR | cs.LGPDF
Argyrios Papoudakis, Mirella Lapata, Frank Keller
TL;DR: 该论文提出了一种用于从长篇叙事(如小说)中生成角色描述的训练框架,通过将推理与生成解耦,使用问答引导的推理轨迹来提升生成描述的准确性、信息量和事实一致性。
Details
Motivation: 解决从长篇叙事中生成准确角色描述的挑战,包括跟踪动态属性、整合分散证据和推断隐含细节,并针对现有推理增强LLMs在此任务上表现不佳的问题,提出解耦推理与生成的方案。
Result: 在BookWorm和CroSS两个数据集上的实验表明,该方法在忠实性、信息量和事实一致性方面优于强大的长上下文基线模型。
Insight: 创新点在于将推理过程结构化(通过问答轨迹)并与生成分离,可借鉴于需要复杂推理的文本生成任务,以提升模型的可解释性和输出质量。
Abstract: Character description generation is an important capability for narrative-focused applications such as summarization, story analysis, and character-driven simulations. However, generating accurate character descriptions from long-form narratives (e.g., novels) is challenging: models must track evolving attributes (e.g., relationships and events), integrate evidence scattered across the text, and infer implicit details. Despite the success of reasoning-enabled LLMs on many benchmarks, we find that for character description generation their performance improves when built-in reasoning is disabled (i.e., an empty reasoning trace). Motivated by this, we propose a training framework that decouples reasoning from generation. Our approach, which can be applied on top of long-context LLMs or chunk-based methods, consists of a reasoning model that produces a structured QA reasoning trace and a generation model that conditions on this trace to produce the final character description. Experiments on two datasets (BookWorm and CroSS) show that QA-guided reasoning improves faithfulness, informativeness, and grounding over strong long-context baselines.
[44] METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models cs.CL | cs.AIPDF
Pengfeng Li, Chen Huang, Chaoqun Hao, Hongyao Chen, Xiao-Yong Wei
TL;DR: 本文提出了METER基准测试,用于在大语言模型(LLMs)中系统性地评估多层级上下文因果推理能力。该基准在一个统一的上下文设置下,覆盖因果阶梯的所有三个层级。评估发现,随着任务向因果层级的上层移动,LLMs的熟练度显著下降。通过错误模式识别和内部信息流追踪的机制分析,揭示了LLMs在因果推理中的两种主要失败模式。
Details
Motivation: 现有的基准测试通常在碎片化设置下评估上下文因果推理能力,无法保证上下文一致性或覆盖完整的因果层级。为了解决这一问题,作者提出了METER基准。
Result: 对各种LLMs的广泛评估表明,随着任务在因果阶梯上向更高层级(如干预和反事实推理)移动,模型性能显著下降。该基准为评估LLMs的因果推理能力提供了一个新的、系统性的测试平台。
Insight: 论文的创新点在于构建了一个统一的、覆盖完整因果层级的上下文因果推理基准(METER)。从客观角度看,其通过机制分析揭示的失败模式(如被因果无关但事实正确的信息干扰,以及随着任务层级上升对上下文的忠实度下降)为理解LLMs的因果推理机制提供了重要的诊断性见解,并为未来研究奠定了基础。
Abstract: Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at https://github.com/SCUNLP/METER .
[45] Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization cs.CL | cs.AI | cs.LGPDF
Jiashu Yao, Heyan Huang, Chuwei Luo, Daiqing Wu, Zeming Liu
TL;DR: 本文提出Policy Split方法,通过将LLM强化学习策略分为正常模式和高熵模式,并采用双模熵正则化,在保证任务准确性的同时促进多样化探索。
Details
Motivation: 解决在LLM强化学习中鼓励多样化探索而不损害准确性的问题。
Result: 在多种模型大小和通用及创造性任务上,该方法持续优于现有的熵引导RL基线方法。
Insight: 创新点在于通过策略分叉和双模熵正则化实现协作学习,高熵模式产生与正常模式不同的行为模式,提供独特的学习信号。
Abstract: To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.
[46] Triviality Corrected Endogenous Reward cs.CLPDF
Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Jialin Liu
TL;DR: 本文提出了一种名为TCER(Triviality Corrected Endogenous Reward)的无监督强化学习方法,用于解决开放式文本生成任务中缺乏可验证奖励的问题。该方法通过校正置信度奖励导致的‘平凡性偏差’,利用专家策略与通用参考策略之间的相对信息增益来提升生成质量,并在多个写作基准和模型架构上取得了稳定的改进,同时也能有效迁移到数学推理任务。
Details
Motivation: 开放式文本生成的强化学习因缺乏可验证奖励而受限,通常依赖需要标注数据或强大闭源模型的评判模型。受近期基于置信度的无监督强化学习在数学推理任务上成功的启发,本文旨在探索该原理是否适用于开放式写作任务。
Result: 在多个写作基准和模型架构上,TCER方法在没有外部监督的情况下实现了持续的改进。此外,TCER也能有效迁移到数学推理任务,验证了该方法在不同生成任务上的通用性。
Insight: 论文的核心创新点在于识别并校正了直接应用置信度奖励导致的‘平凡性偏差’,提出了通过专家策略与通用参考策略的相对信息增益,并结合概率依赖的校正机制来构建内生奖励。这为无监督的开放式生成任务强化学习提供了一种通用且有效的解决方案。
Abstract: Reinforcement learning for open-ended text generation is constrained by the lack of verifiable rewards, necessitating reliance on judge models that require either annotated data or powerful closed-source models. Inspired by recent work on unsupervised reinforcement learning for mathematical reasoning using confidence-based endogenous rewards, we investigate whether this principle can be adapted to open-ended writing tasks. We find that directly applying confidence rewards leads to Triviality Bias: the policy collapses toward high-probability outputs, reducing diversity and meaningful content. We propose TCER (Triviality Corrected Endogenous Reward), which addresses this bias by rewarding the relative information gain between a specialist policy and a generalist reference policy, modulated by a probability-dependent correction mechanism. Across multiple writing benchmarks and model architectures, TCER achieves consistent improvements without external supervision. Furthermore, TCER also transfers effectively to mathematical reasoning, validating the generality of our approach across different generation tasks.
[47] Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory cs.CL | cs.AIPDF
Weixian Waylon Li, Jiaxin Zhang, Xianan Jim Yang, Tiejun Ma, Yiwen Guo
TL;DR: 本文提出了RoMem,一种用于结构化记忆系统的即插即用式时序知识图谱模块。它通过预训练的语义速度门将关系文本嵌入映射到波动性分数,并结合连续相位旋转实现几何遮蔽,使过时事实在复杂向量空间中相位偏移,从而无需删除即可让时序正确的事实自然优于矛盾事实。
Details
Motivation: 现有方法将时间建模为离散元数据,要么按新近度排序(埋没旧但永久的知识),要么简单地覆盖过时事实,或在每个摄入步骤都需要昂贵的LLM调用,无法区分持久事实与演变事实。
Result: 在时序知识图谱补全任务上,RoMem在ICEWS05-15数据集上取得了72.6 MRR的SOTA结果。应用于智能体记忆时,在时序推理(MultiTQ)上实现了2-3倍的MRR和答案准确率提升,在混合基准(LoCoMo)上表现优异,在静态记忆(DMR-MSC)上保持零退化,并能零样本泛化到未见过的金融领域(FinTMMBench)。
Insight: 核心创新在于将时间建模为连续相位旋转而非离散标签,通过几何遮蔽机制优雅地处理知识演变,避免了显式删除或覆盖,并利用预训练的语义速度门自适应地学习不同关系的演变速度(波动性)。
Abstract: Structured memory representations such as knowledge graphs are central to autonomous agents and other long-lived systems. However, most existing approaches model time as discrete metadata, either sorting by recency (burying old-yet-permanent knowledge), simply overwriting outdated facts, or requiring an expensive LLM call at every ingestion step, leaving them unable to distinguish persistent facts from evolving ones. To address this, we introduce RoMem, a drop-in temporal knowledge graph module for structured memory systems, applicable to agentic memory and beyond. A pretrained Semantic Speed Gate maps each relation’s text embedding to a volatility score, learning from data that evolving relations (e.g., “president of”) should rotate fast while persistent ones (e.g., “born in”) should remain stable. Combined with continuous phase rotation, this enables geometric shadowing: obsolete facts are rotated out of phase in complex vector space, so temporally correct facts naturally outrank contradictions without deletion. On temporal knowledge graph completion, RoMem achieves state-of-the-art results on ICEWS05-15 (72.6 MRR). Applied to agentic memory, it delivers 2-3x MRR and answer accuracy on temporal reasoning (MultiTQ), dominates hybrid benchmark (LoCoMo), preserves static memory with zero degradation (DMR-MSC), and generalises zero-shot to unseen financial domains (FinTMMBench).
[48] Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale cs.CLPDF
Liujie Zhang, Benzhe Ning, Rui Yang, Xiaoyan Yu, Jiaxing Li
TL;DR: 本文提出了Relax,一个用于大规模全模态后训练的异步强化学习引擎,旨在解决异构数据流、大规模操作鲁棒性以及陈旧性与吞吐量权衡三大挑战。它通过全栈原生全模态架构、独立可扩展的故障隔离服务以及支持异步训练的TransferQueue数据总线实现高效训练,在Qwen3系列模型上实现了显著的端到端加速,并支持MoE模型和稳定的全模态RL收敛。
Details
Motivation: 随着大模型扩展到全模态输入和智能体多轮工作流,RL训练系统面临异构数据流、大规模操作鲁棒性以及陈旧性与吞吐量权衡三大相互关联的挑战,需要新的系统设计来解决。
Result: 在Qwen3-4B上,Relax相比veRL实现了1.20倍的端到端加速;其完全异步模式在Qwen3-4B上相比colocate实现了1.76倍加速,在Qwen3-Omni-30B上实现了2.00倍加速,且所有模式都能收敛到相同的奖励水平。对于MoE模型,Relax支持R3方法仅带来1.9%的开销,而veRL在相同配置下性能下降32%。此外,Relax在Qwen3-Omni上实现了跨图像、文本和音频的稳定全模态RL收敛,在视频上可持续超过2000步而不退化。
Insight: 创新点在于将全模态支持内置于整个训练栈(全栈原生架构),而非在文本中心化流程上打补丁;通过服务级解耦实现独立、可扩展、故障隔离的RL角色服务,提升鲁棒性;以及通过TransferQueue数据总线和单一陈旧性参数,灵活地在同策略、近同策略和完全异步执行之间平滑插值,优化吞吐量与陈旧性的权衡。这为大规模、多模态RL训练系统设计提供了可借鉴的架构范式。
Abstract: Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness – throughput tradeoff. We present \textbf{Relax} (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emph{omni-native architecture} builds multimodal support into the full stack – from data preprocessing and modality-aware parallelism to inference generation – rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20$\times$ end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76$\times$ speedup over colocate on Qwen3-4B and a 2.00$\times$ speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\cite{ma2025r3} for MoE models with only 1.9% overhead, compared to 32% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2{,}000 steps on video without degradation. Relax is available at https://github.com/rednote-ai/Relax.
[49] A Triadic Suffix Tokenization Scheme for Numerical Reasoning cs.CL | cs.AI | cs.LGPDF
Olga Chetverina
TL;DR: 本文提出了一种名为Triadic Suffix Tokenization(TST)的数字分词方案,旨在解决标准子词分词方法在处理数字时导致的位置和十进制结构丢失问题。该方法将数字按三位一组进行划分,并为每组添加显式的数量级标记,从而在分词层面保持数字的精确性和数量级关系的透明性。
Details
Motivation: 标准子词分词方法对数字进行不一致的分割,导致大语言模型在处理算术和科学推理任务时丢失数字的位置和十进制结构,这是相关错误的主要根源。
Result: 摘要中未提及具体的实验结果或基准测试,仅说明实验验证将留待未来工作。
Insight: 创新点在于提出了一种确定性的、基于三位一组的分词方案,通过显式的数量级标记为模型提供一致的梯度信号,且该方案与架构无关,可作为预处理步骤轻松集成,并具有线性扩展词汇表以适应任意精度和范围的固有可扩展性。
Abstract: Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure - a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic Suffix Tokenization (TST), a deterministic scheme that partitions digits into three-digit triads and annotates each triad with an explicit magnitude marker. Critically, the scheme defines a fixed, one-to-one mapping between suffixes and orders of magnitude for the integer part (thousands, millions, billions, etc.) and a parallel system of replicated markers for fractional depth (tenths, thousandths, millionths, etc.). Unlike approaches that rely on positional inference, this method provides a consistent gradient signal, which should ensure stable convergence. Two implementation variants are proposed: (1) a vocabulary-based approach that adds at most 10,000 fixed tokens to an existing vocabulary, covering 33 orders of magnitude ($10^{-15}$ to $10^{18}$); and (2) a suffix-marker approach that uses a small set of special tokens to denote magnitude dynamically. Both variants preserve exact digits while making order-of-magnitude relationships transparent at the token level. The framework is inherently scalable, allowing for linear vocabulary expansion to accommodate arbitrary precision and range. TST is architecture-agnostic and can be integrated as a drop-in preprocessing step. Experimental validation is deferred to future work.
[50] Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation cs.CL | cs.LGPDF
Jiashu Yao, Heyan Huang, Zeming Liu, Yuhang Guo
TL;DR: 本文提出了一种名为MISE(互信息自评估)的强化学习范式,旨在解决大语言模型智能体面临的稀疏奖励问题。该方法利用事后的生成式自我评估作为密集的内部奖励信号,并通过与外部环境反馈进行校准来优化这些信号。
Details
Motivation: 动机是克服基于大语言模型的强化学习智能体在训练中因奖励稀疏而导致的效率低下问题,通过引入密集的内部奖励信号来辅助学习。
Result: 大量实验表明,MISE方法优于多个强基线模型,使参数量约7B的开源大语言模型在无需专家监督的验证任务上达到了与GPT-4o相当的性能水平。
Insight: 创新点在于首次为生成式自我奖励范式提供了形式化理论基础,证明了利用事后自我评估奖励等价于最小化一个结合了互信息和策略与代理奖励策略之间KL散度的目标函数,并据此设计了校准步骤以主动对齐奖励与最优策略。
Abstract: To overcome the sparse reward challenge in reinforcement learning (RL) for agents based on large language models (LLMs), we propose Mutual Information Self-Evaluation (MISE), an RL paradigm that utilizes hindsight generative self-evaluation as dense reward signals while simultaneously calibrating them against the environmental feedbacks. Empirically, MISE enables an agent to learn autonomously from dense internal rewards supplementing sparse extrinsic signals. Theoretically, our work provides the first formal foundation for the paradigm of generative self-rewarding. We prove that utilizing hindsight self-evaluation rewards is equivalent to minimizing an objective that combines mutual information with a KL divergence term between the policy and a proxy reward policy. This theoretical insight then informs and justifies our calibration step, which actively aligns these rewards with the optimal policy. Extensive experiments show that MISE outperforms strong baselines, enabling open-source LLMs about 7B parameters to achieve performance comparable to GPT-4o on validation without expert supervision.
[51] Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation cs.CLPDF
Yuqian Wu, Wei Chen, Zhengjun Huang, Junle Chen, Qingxiang Liu
TL;DR: 本文提出了一种名为\method的简约对话记忆框架,通过Turn Isolation Retrieval(TIR)和Query-Driven Pruning(QDP)技术,仅依赖检索与生成来管理长期对话历史。该方法旨在解决现有复杂记忆系统中因信号稀疏效应导致的上下文稀释问题,在多个基准测试中展现出稳健性能和高效率。
Details
Motivation: 现有对话记忆系统依赖复杂的层次化摘要或强化学习,但随着对话增长容易遭受上下文稀释。本文认为瓶颈在于潜在知识流形中的信号稀疏效应,具体表现为决定性证据稀疏和双重冗余,因此提出回归基础的简约方法。
Result: 在多个基准测试上的广泛实验表明,\method在不同设置下均实现了稳健性能,持续超越强基线方法,同时在token使用和延迟方面保持高效率,为对话记忆建立了新的简约基线。
Insight: 创新点在于识别了信号稀疏效应(决定性证据稀疏和双重冗余),并提出了TIR(通过最大激活策略捕获轮次级信号)和QDP(移除冗余会话和填充内容以构建紧凑证据集)两种简约技术,避免了复杂架构,强调检索与生成的核心作用。
Abstract: Existing conversational memory systems rely on complex hierarchical summarization or reinforcement learning to manage long-term dialogue history, yet remain vulnerable to context dilution as conversations grow. In this work, we offer a different perspective: the primary bottleneck may lie not in memory architecture, but in the \textit{Signal Sparsity Effect} within the latent knowledge manifold. Through controlled experiments, we identify two key phenomena: \textit{Decisive Evidence Sparsity}, where relevant signals become increasingly isolated with longer sessions, leading to sharp degradation in aggregation-based methods; and \textit{Dual-Level Redundancy}, where both inter-session interference and intra-session conversational filler introduce large amounts of non-informative content, hindering effective generation. Motivated by these insights, we propose \method, a minimalist framework that brings conversational memory back to basics, relying solely on retrieval and generation via Turn Isolation Retrieval (TIR) and Query-Driven Pruning (QDP). TIR replaces global aggregation with a max-activation strategy to capture turn-level signals, while QDP removes redundant sessions and conversational filler to construct a compact, high-density evidence set. Extensive experiments on multiple benchmarks demonstrate that \method achieves robust performance across diverse settings, consistently outperforming strong baselines while maintaining high efficiency in tokens and latency, establishing a new minimalist baseline for conversational memory.
[52] CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity cs.CLPDF
Xuefeng Wei, Zhixuan Wang, Xuan Zhou, Zhi Qu, Hongyao Li
TL;DR: 本文提出了CArtBench基准,这是一个基于博物馆的评估框架,专门用于评估视觉语言模型在中文艺术品理解、阐释和真伪鉴别方面的能力。该基准包含四个子任务:基于证据的识别与推理、结构化专家风格鉴赏、可辩护的再阐释以及视觉相似干扰下的真伪鉴别。通过对九个代表性VLM的评估,发现现有模型在证据关联、风格断代推理、长文本鉴赏和真伪鉴别方面仍面临显著挑战。
Details
Motivation: 现有视觉语言模型在中文艺术品领域的评估多局限于短文本识别和问答,缺乏对深度理解、阐释和真伪鉴别能力的系统评估,因此需要构建一个更全面的基准来推动模型在艺术领域的进步。
Result: 在九个代表性VLM上的评估结果显示:模型在CURATORQA任务的整体准确率较高,但在硬证据关联和风格断代推理上表现骤降;长文本鉴赏任务远未达到专家参考水平;真伪鉴别任务的性能接近随机猜测,凸显了当前模型在鉴赏家级推理上的困难。
Insight: 创新点在于构建了一个多维度、基于权威博物馆数据的中国艺术品评估基准,通过证据关联、结构化鉴赏、再阐释和真伪鉴别等子任务系统评估VLM的深层艺术理解能力,揭示了模型在复杂艺术推理任务上的局限性,为未来研究提供了明确的改进方向。
Abstract: We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.
[53] RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents cs.CL | cs.AI | cs.MAPDF
Riccardo Rosati, Edoardo Colucci, Massimiliano Bolognini, Adriano Mancini, Paolo Sernani
TL;DR: 本文提出了RPA-Check,一个用于评估基于大语言模型(LLM)的动态角色扮演代理(RPA)的多阶段自动化框架。该框架通过定义行为维度、生成细粒度检查清单、进行语义过滤以及采用LLM作为评判者进行链式思维验证,来客观评估代理在复杂约束环境中的角色遵循度、逻辑一致性和叙事稳定性。
Details
Motivation: 随着LLM在交互系统中的快速应用,动态开放式的角色扮演代理(RPA)被广泛创建,但现有标准NLP指标难以捕捉其角色遵循、逻辑一致性和长期叙事稳定性等细微差别,因此需要一个新的评估框架。
Result: 在涉及多个量化本地模型的法证训练严肃游戏LLM Court中,对五个不同法律场景的实验验证了该框架的有效性。结果表明,该框架能够识别模型大小、推理深度和操作稳定性之间的微妙权衡,并发现参数规模与程序一致性之间存在反比关系,即经过充分指令微调的较小模型(8-9B)可能优于易受用户对齐偏差或谄媚倾向影响的大型架构。
Insight: 创新点在于提出了一个标准化、可复现的多阶段自动化评估框架,将高层次定性行为标准转化为客观、无冗余的细粒度布尔检查清单,并利用链式思维验证的LLM-as-a-Judge方法进行评分。这为专业领域生成式代理的评估提供了系统化的方法论。
Abstract: The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which employs chain-of-thought verification to score agent fidelity. We validate this framework by applying it to LLM Court, a serious game for forensic training involving several quantized local models. Experimental results across five distinct legal scenarios demonstrate the framework’s ability to identify subtle trade-offs between model size, reasoning depth, and operational stability. Notably, the findings reveal an inverse relationship between parametric scale and procedural consistency, showing that smaller, adequately instruction-tuned models (8-9B) can outperform larger architectures prone to user-alignment bias or sycophancy. RPA-Check thus provides a standardized and reproducible metric for future research in generative agent evaluation within specialized domains.
[54] Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind cs.CL | cs.AI | cs.LGPDF
Hanqi Xiao, Vaidehi Patil, Zaid Khan, Hyunji Lee, Elias Stengel-Eskin
TL;DR: 本文提出了一种新颖的隐私主题的心智理论挑战——用于引导信念的心智理论,要求防御者扮演双面间谍,在共享信息环境中引导具有部分先验知识的攻击者的信念。研究发现,前沿大语言模型在此任务上表现不佳,而通过强化学习训练的AI双面间谍模型,结合心智理论和欺骗奖励,能够显著提升性能,在困难场景中超越GPT-5.4和Gemini3-Pro。
Details
Motivation: 随着大语言模型成为对话系统的核心,其理解对话伙伴意图和状态的能力对于与潜在对抗性伙伴的安全交互至关重要。本文旨在解决模型在需要深度心智推理的对抗性场景中的不足,特别是如何引导攻击者信念以保护敏感信息。
Result: 在包含四种不同强度攻击者、六种防御方法以及分布内和分布外评估的实验中,结合心智理论和欺骗奖励训练的AI双面间谍模型在困难场景中的欺骗和心智理论性能最强,超越了使用心智理论提示的Gemini3-Pro和GPT-5.4。研究还发现欺骗成功与心智理论能力之间存在双向涌现关系,且两者性能增益高度相关。
Insight: 论文的核心创新在于提出了ToM-SB这一具体的、可评估的心智理论挑战任务,并展示了通过强化学习,将欺骗成功和心智理论建模作为联合奖励信号,可以协同提升模型在这两个方面的能力。这为训练具有更高级社会推理和对抗性交互能力的AI系统提供了新的方法论和评估基准。
Abstract: As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker’s beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.
[55] Legal2LogicICL: Improving Generalization in Transforming Legal Cases to Logical Formulas via Diverse Few-Shot Learning cs.CL | cs.AI | cs.LGPDF
Jieying Xue, Phuong Minh Nguyen, Ha Thanh Nguyen, May Myo Zin, Ken Satoh
TL;DR: 本文提出Legal2LogicICL框架,通过检索增强生成和多样化少样本学习,将自然语言法律案例转化为逻辑公式,以提升逻辑法律推理系统的泛化能力。
Details
Motivation: 现有基于逻辑的法律推理方法依赖微调模型,但受限于高质量标注数据的稀缺,本文旨在解决此问题,提升系统在数据有限情况下的泛化性能。
Result: 在开源和专有LLMs上的实验表明,该方法在将法律案例转化为逻辑表示时,显著提高了准确性、稳定性和泛化能力,并在新构建的Legal2Proleg数据集上支持评估。
Insight: 创新点在于结合潜在语义和法律文本结构层面的多样性与相似性平衡,并缓解法律文本中实体引起的检索偏差,从而构建信息丰富且稳健的少样本示例,无需额外训练即可生成准确逻辑规则。
Abstract: This work aims to improve the generalization of logic-based legal reasoning systems by integrating recent advances in NLP with legal-domain adaptive few-shot learning techniques using LLMs. Existing logic-based legal reasoning pipelines typically rely on fine-tuned models to map natural-language legal cases into logical formulas before forwarding them to a symbolic reasoner. However, such approaches are heavily constrained by the scarcity of high-quality annotated training data. To address this limitation, we propose a novel LLM-based legal reasoning framework that enables effective in-context learning through retrieval-augmented generation. Specifically, we introduce Legal2LogicICL, a few-shot retrieval framework that balances diversity and similarity of exemplars at both the latent semantic representation level and the legal text structure level. In addition, our method explicitly accounts for legal structure by mitigating entity-induced retrieval bias in legal texts, where lengthy and highly specific entity mentions often dominate semantic representations and obscure legally meaningful reasoning patterns. Our Legal2LogicICL constructs informative and robust few-shot demonstrations, leading to accurate and stable logical rule generation without requiring additional training. In addition, we construct a new dataset, named Legal2Proleg, which is annotated with alignments between legal cases and PROLEG logical formulas to support the evaluation of legal semantic parsing. Experimental results on both open-source and proprietary LLMs demonstrate that our approach significantly improves accuracy, stability, and generalization in transforming natural-language legal case descriptions into logical representations, highlighting its effectiveness for interpretable and reliable legal reasoning. Our code is available at https://github.com/yingjie7/Legal2LogicICL.
[56] Discourse Diversity in Multi-Turn Empathic Dialogue cs.CL | cs.AIPDF
Hongli Zhan, Emma S. Gueorguieva, Javier Hernandez, Jina Suh, Desmond C. Ong
TL;DR: 该论文研究了大型语言模型在多轮共情对话中话语策略多样性的问题,发现LLMs在对话中会重复使用相同的话语策略,其重复率远高于人类。为了应对这一问题,论文提出了首个强化学习框架MINT,旨在优化多轮对话中的话语策略多样性,并在提升共情质量的同时显著降低了策略重复率。
Details
Motivation: 尽管LLMs在单轮对话中被评为具有高度共情能力,但它们也被认为是公式化的生成器,会在不同任务中重复使用相同的词汇、句法和话语结构。然而,对于这种公式化是否延伸到话语策略层面,尤其是在多轮共情对话中,研究较少。有效的共情支持不仅需要单轮的善意回应,还需要在对话展开时采用多样化的策略。
Result: 在真实情感支持对话中,一旦某种策略出现在支持者的回合中,LLMs在下一回合重复使用该策略的比率(0.50-0.56)几乎是人类(0.27)的两倍。提出的MINT框架结合了共情质量奖励和跨回合策略新颖性信号,在1.7B和4B模型上,其整体共情质量比基线提升了25.3%,同时在4B模型上将跨回合话语策略重复率降低了26.3%,在共情质量和策略多样性两项指标上均超越了包括仅优化质量和词元级多样性方法在内的所有基线。
Insight: 论文的核心创新点在于首次将研究焦点从单轮对话扩展到多轮共情对话中的话语策略多样性问题,并揭示了标准相似性度量无法捕捉的LLMs策略重复模式。提出的MINT框架是首个针对多轮对话话语策略多样性进行优化的强化学习方法,其关键洞察在于:当前模型缺乏的并非共情能力本身,而是在整个对话过程中灵活变换其话语策略的能力。这为未来开发更自然、适应性更强的对话系统提供了新的优化方向。
Abstract: Large language models (LLMs) produce responses rated as highly empathic in single-turn settings (Ayers et al., 2023; Lee et al., 2024), yet they are also known to be formulaic generators that reuse the same lexical patterns, syntactic templates, and discourse structures across tasks (Jiang et al., 2025; Shaib et al., 2024; Namuduri et al., 2025). Less attention has been paid to whether this formulaicity extends to the level of discourse moves, i.e., what a response does for the person it is addressing. This question is especially consequential for empathic dialogue, where effective support demands not just a kind response at one moment but varied strategies as a conversation unfolds (Stiles et al., 1998). Indeed, prior work shows that LLMs reuse the same tactic sequences more than human supporters in single-turn settings (Gueorguieva et al., 2026). We extend this analysis to multi-turn conversations and find that the rigidity compounds: once a tactic appears in a supporter turn, LLMs reuse it in the next at nearly double the rate of humans (0.50-0.56 vs. 0.27). This pattern holds across LLMs serving as supporters in real emotional support conversations, and is invisible to standard similarity metrics. To address this gap, we introduce MINT (Multi-turn Inter-tactic Novelty Training), the first reinforcement learning framework to optimize discourse move diversity across multi-turn empathic dialogue. The best MINT variant combines an empathy quality reward with a cross-turn tactic novelty signal, improving aggregate empathy by 25.3% over vanilla across 1.7B and 4B models while reducing cross-turn discourse move repetition by 26.3% on the 4B model, surpassing all baselines including quality-only and token-level diversity methods on both measures. These results suggest that what current models lack is not empathy itself, but the ability to vary their discourse moves across a conversation.
[57] Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks cs.CLPDF
Yoonsang Lee, Howard Yen, Xi Ye, Danqi Chen
TL;DR: 本文研究了长视野智能体任务(如智能搜索和深度研究)的并行测试时扩展问题,提出了一种名为AggAgent的聚合智能体方法。该方法将并行生成的多个任务轨迹视为环境,并配备轻量级工具来检查和搜索这些轨迹,从而按需导航和综合信息,以生成最终响应。
Details
Motivation: 动机在于解决长视野智能体任务在并行扩展时面临的独特挑战:这些任务轨迹长、多轮次且涉及工具调用,输出通常是开放式的。简单地聚合最终答案会丢弃轨迹中的丰富信息,而拼接所有轨迹又会超出模型上下文窗口的限制。
Result: 在六个基准测试和三个模型系列(GLM-4.7, Qwen3.5, MiniMax-M2.5)上,AggAgent优于所有现有聚合方法,平均绝对性能提升高达5.3%,在两个深度研究任务上提升达10.3%,同时仅增加最小开销,因为聚合成本仅相当于单个智能体轨迹的生成成本。
Insight: 创新点在于将聚合过程本身构建为一个智能体(Agentic Aggregation),使其能够主动探索和综合并行轨迹中的信息,而非被动处理。这为长视野、开放式任务的并行扩展提供了一种高效且成本可控的新范式,其核心思想是将复杂的多轨迹信息整合问题转化为一个可导航的智能体决策问题。
Abstract: We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model’s context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.
[58] General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks cs.CL | cs.AIPDF
Junlin Liu, Shengnan An, Shuang Zhou, Dan Ma, Shixiong Luo
TL;DR: 该论文提出了General365基准测试,旨在评估大语言模型在通用推理能力上的表现,该基准包含365个种子问题和1,095个变体问题,覆盖八个类别,限制背景知识在K-12水平以分离推理与专业知识。对26个领先LLM的评估显示,即使最佳模型准确率也仅为62.8%,远低于其在数学和物理等专业领域的表现,表明当前LLM的推理能力高度依赖领域,在通用场景中仍有很大提升空间。
Details
Motivation: 当前大语言模型在数学和物理等专业领域展现出强大推理能力,但其在更广泛、通用上下文中的推理能力(即通用推理)尚未充分探索,存在如复杂约束、嵌套逻辑分支和语义干扰等挑战,因此需要专门基准来评估和推动这一能力的发展。
Result: 在General365基准上评估了26个领先LLM,结果显示最佳模型准确率仅为62.8%,与LLM在数学和物理基准上接近完美的表现形成鲜明对比,突显了当前模型在通用推理上的局限性。
Insight: 论文的创新点在于设计了一个限制背景知识为K-12水平的基准,明确将推理能力与专业知识解耦,从而更纯粹地评估通用推理;从客观角度看,这为研究LLM的泛化推理能力提供了标准化、多样化的测试平台,有助于推动模型向更鲁棒、通用的现实场景应用发展。
Abstract: Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts–often termed general reasoning–remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io
cs.CV [Back]
[59] 3D Multi-View Stylization with Pose-Free Correspondences Matching for Robust 3D Geometry Preservation cs.CVPDF
Shirsha Bose
TL;DR: 本文提出了一种用于多视图3D场景的前馈风格化网络,通过结合外观迁移与几何保持的复合目标进行训练,并采用逐场景测试时优化。该方法利用基于SuperPoint和SuperGlue的对应性一致性损失、MiDaS/DPT深度保持损失以及全局颜色对齐,以减少风格化对下游3D任务(如SLAM和重建)的负面影响,并在Tanks and Temples和Mip-NeRF 360数据集上验证了其有效性。
Details
Motivation: 解决多视图3D场景艺术风格化中因独立逐视图风格化导致的纹理漂移、边缘扭曲和光照不一致等问题,这些会破坏几何感知流程所需的对应关系,从而降低SLAM、深度预测和多视图重建等下游3D任务的性能,且方法不依赖训练时的相机位姿或显式3D表示。
Result: 在Tanks and Temples和Mip-NeRF 360数据集上评估,使用图像和重建指标(如Color Histogram Distance和Structure Distance衡量风格保持与结构保留,使用单目DROID-SLAM轨迹和对称Chamfer距离衡量3D一致性)。与MuVieCAST基线相比,该方法在保持竞争性风格化的同时,实现了更强的轨迹和点云一致性,对应性和深度正则化减少了结构失真并提高了SLAM稳定性和重建几何质量。
Insight: 创新点包括:1) 引入基于SuperPoint和SuperGlue的对应性一致性损失,通过约束风格化锚视图描述符与原始多视图匹配描述符的一致性来稳定跨视角结构;2) 结合深度保持损失(MiDaS/DPT)和全局颜色对齐以减少深度模型域偏移;3) 采用分阶段权重调度引入几何和深度约束。从客观角度看,该方法将风格迁移与几何保持紧密结合,通过测试时优化实现无需位姿假设的鲁棒3D几何保持,为多视图风格化的实际应用提供了可行方案。
Abstract: Artistic style transfer is well studied for images and videos, but extending it to multi-view 3D scenes remains difficult because stylization can disrupt correspondences needed by geometry-aware pipelines. Independent per-view stylization often causes texture drift, warped edges, and inconsistent shading, degrading SLAM, depth prediction, and multi-view reconstruction. This thesis addresses multi-view stylization that remains usable for downstream 3D tasks without assuming camera poses or an explicit 3D representation during training. We introduce a feed-forward stylization network trained with per-scene test-time optimization under a composite objective coupling appearance transfer with geometry preservation. Stylization is driven by an AdaIN-inspired loss from a frozen VGG-19 encoder, matching channel-wise moments to a style image. To stabilize structure across viewpoints, we propose a correspondence-based consistency loss using SuperPoint and SuperGlue, constraining descriptors from a stylized anchor view to remain consistent with matched descriptors from the original multi-view set. We also impose a depth-preservation loss using MiDaS/DPT and use global color alignment to reduce depth-model domain shift. A staged weight schedule introduces geometry and depth constraints. We evaluate on Tanks and Temples and Mip-NeRF 360 using image and reconstruction metrics. Style adherence and structure retention are measured by Color Histogram Distance (CHD) and Structure Distance (DSD). For 3D consistency, we use monocular DROID-SLAM trajectories and symmetric Chamfer distance on back-projected point clouds. Across ablations, correspondence and depth regularization reduce structural distortion and improve SLAM stability and reconstructed geometry; on scenes with MuVieCAST baselines, our method yields stronger trajectory and point-cloud consistency while maintaining competitive stylization.
[60] TRACE: Thermal Recognition Attentive-Framework for CO2 Emissions from Livestock cs.CVPDF
Taminul Islam, Abdellah Lakhssassi, Toqi Tahamid Sarker, Mohamed Embaby, Khaled R Ahmed
TL;DR: TRACE是一个用于从牲畜热成像视频中量化CO2排放的统一框架,首次联合解决了每帧CO2羽流分割和片段级排放通量分类问题。它通过热气体感知注意力编码器、基于注意力的时间融合模块和四阶段渐进式训练课程,在CO2 Farm Thermal Gas数据集上实现了SOTA性能。
Details
Motivation: 现有系统无法对自由放牧的牛进行连续、空间分辨的呼出CO2测量,而这是评估瘤胃代谢状态和农场规模碳核算的关键。论文旨在开发一个非侵入性、无需物理限制或接触的监测框架。
Result: 在CO2 Farm Thermal Gas数据集上,TRACE在15个SOTA模型中取得了mIoU为0.998的最佳结果,在所有分割和分类指标上均表现最优,超越了参数多几倍的领域专用气体分割器和所有通量分类基线。
Insight: 创新点包括:1) 热气体感知注意力编码器,利用像素级气体强度作为空间监督信号引导自注意力;2) 基于注意力的时间融合模块,通过结构化跨帧注意力捕捉呼吸周期动态;3) 四阶段渐进式训练课程,耦合两个任务并防止梯度干扰。从客观角度看,将气体强度作为先验监督信号和结构化时间建模是解决该领域特定问题的有效方法。
Abstract: Quantifying exhaled CO2 from free-roaming cattle is both a direct indicator of rumen metabolic state and a prerequisite for farm-scale carbon accounting, yet no existing system can deliver continuous, spatially resolved measurements without physical confinement or contact. We present TRACE (Thermal Recognition Attentive-Framework for CO2 Emissions from Livestock), the first unified framework to jointly address per-frame CO2 plume segmentation and clip-level emission flux classification from mid-wave infrared (MWIR) thermal video. TRACE contributes three domain-specific advances: a Thermal Gas-Aware Attention (TGAA) encoder that incorporates per-pixel gas intensity as a spatial supervisory signal to direct self-attention toward high-emission regions at each encoder stage; an Attention-based Temporal Fusion (ATF) module that captures breath-cycle dynamics through structured cross-frame attention for sequence-level flux classification; and a four-stage progressive training curriculum that couples both objectives while preventing gradient interference. Benchmarked against fifteen state-of-the-art models on the CO2 Farm Thermal Gas Dataset, TRACE achieves an mIoU of 0.998 and the best result on every segmentation and classification metric simultaneously, outperforming domain-specific gas segmenters with several times more parameters and surpassing all baselines in flux classification. Ablation studies confirm that each component is individually essential: gas-conditioned attention alone determines precise plume boundary localization, and temporal reasoning is indispensable for flux-level discrimination. TRACE establishes a practical path toward non-invasive, continuous, per-animal CO2 monitoring from overhead thermal cameras at commercial scale. Codes are available at https://github.com/taminulislam/trace.
[61] FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models cs.CV | cs.LG | cs.ROPDF
Xinyuan An, Tao Luo, Gengyun Peng, Yaobing Wang, Kui Ren
TL;DR: 本文提出了FlowHijack,这是首个针对流匹配视觉-语言-动作模型的向量场动态特性的后门攻击框架。该方法通过一种新颖的τ条件注入策略和动态模仿正则化器,在保持良性任务性能的同时,利用隐蔽的上下文感知触发器实现了高攻击成功率,并生成了与正常动作行为上无法区分的恶意动作。
Details
Motivation: 流匹配视觉-语言-动作模型(如π₀)因其生成平滑连续动作的能力而成为机器人技术的基石,但其独特的动作生成机制——向量场动态——构成了一个关键且未被探索的安全漏洞(特别是后门漏洞)。现有的为自回归离散化VLA设计的后门攻击无法直接应用于这种新的连续动态。
Result: 实验表明,FlowHijack在使用隐蔽的上下文感知触发器时实现了高攻击成功率,而先前的工作则失败了。至关重要的是,它保持了良性任务性能,并通过强制运动学相似性,生成了在行为上与正常动作无法区分的恶意动作。
Insight: 论文的创新点在于首次系统地针对流匹配VLA的底层向量场动态进行后门攻击,提出了τ条件注入策略和动态模仿正则化器。从客观角度看,其核心洞察是将攻击目标从离散输出转向控制连续动作生成的内部动态过程,这揭示了连续具身模型中一个重要的安全漏洞,并强调了针对模型内部生成动态进行防御的紧迫性。
Abstract: Vision-Language-Action (VLA) models are emerging as a cornerstone for robotics, with flow-matching policies like $π_0$ showing great promise in generating smooth, continuous actions. As these models advance, their unique action generation mechanism - the vector field dynamics - presents a critical yet unexplored security vulnerability, particularly backdoor vulnerabilities. Existing backdoor attacks designed for autoregressive discretization VLAs cannot be directly applied to this new continuous dynamics. We introduce FlowHijack, the first backdoor attack framework to systematically target the underlying vector-field dynamics of flow-matching VLAs. Our method combines a novel $τ$-conditioned injection strategy, which manipulates the initial phase of the action generation, with a dynamics mimicry regularizer. Experiments demonstrate that FlowHijack achieves high attack success rates using stealthy, context-aware triggers where prior works failed. Crucially, it preserves benign task performance and, by enforcing kinematic similarity, generates malicious actions that are behaviorally indistinguishable from normal actions. Our findings reveal a significant vulnerability in continuous embodied models, highlighting the urgent need for defenses targeting the model’s internal generative dynamics.
[62] A Modular Zero-Shot Pipeline for Accident Detection, Localization, and Classification in Traffic Surveillance Video cs.CV | cs.LGPDF
Amey Thakur, Sarvesh Talele
TL;DR: 本文提出了一种用于交通监控视频中事故检测、定位和分类的零样本模块化流程。该方法将问题分解为三个独立模块:通过帧差信号的峰值检测定位事故发生时间,利用密集光流图的加权质心确定碰撞位置,以及使用CLIP图像与文本嵌入的余弦相似度对事故类型进行分类。整个流程无需领域特定的微调,仅使用预训练模型权重。
Details
Motivation: 解决在无真实世界标注训练数据的情况下,从监控视频中预测交通事故发生的时间、位置和类型的挑战。
Result: 该方法是为CVPR 2026 ACCIDENT挑战赛开发的,其流程设计旨在零样本条件下完成检测、定位和分类任务,具体定量结果未在摘要中提及。
Insight: 创新点在于将复杂任务分解为独立的时序、空间和语义模块,并组合使用经典的信号处理(峰值检测)、计算机视觉(光流)与预训练多模态模型(CLIP)进行零样本推理,避免了领域微调的需求,具有模块化和可解释性。
Abstract: We describe a zero-shot pipeline developed for the ACCIDENT @ CVPR 2026 challenge. The challenge requires predicting when, where, and what type of traffic accident occurs in surveillance video, without labeled real-world training data. Our method separates the problem into three independent modules. The first module localizes the collision in time by running peak detection on z-score normalized frame-difference signals. The second module finds the impact location by computing the weighted centroid of cumulative dense optical flow magnitude maps using the Farneback algorithm. The third module classifies collision type by measuring cosine similarity between CLIP image embeddings of frames near the detected peak and text embeddings built from multi-prompt natural language descriptions of each collision category. No domain-specific fine-tuning is involved; the pipeline processes each video using only pre-trained model weights. Our implementation is publicly available as a Kaggle notebook.
[63] Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models cs.CV | cs.AIPDF
Yunkai Zhang, Linda Li, Yingxin Cui, Xiyuan Ruan, Zeyu Zheng
TL;DR: 本文提出了Grid2Matrix(G2M)基准测试,用于评估视觉语言模型(VLMs)在需要详尽读取图像细节(如颜色网格和映射)任务上的表现。研究发现,VLMs在零样本端到端评估中会出现早期性能骤降,即‘数字失认症’,表明模型视觉编码与语言输出间存在信息鸿沟,且错误模式与视觉补丁边界高度相关。
Details
Motivation: 现有VLM评测多关注高层次推理,可能掩盖模型在忠实捕获所有视觉细节方面的失败。本文旨在通过可控的G2M基准,揭示VLMs在处理精细视觉信息(如表格、图表)时的根本性缺陷。
Result: 在G2M基准上,VLMs在网格尺寸较小时就出现性能崩溃,而非随任务密度增加逐渐退化。对两类代表性VLM视觉编码器的分析表明,视觉特征保留的信息远多于端到端输出,但模型缩放和多模态对齐等常见策略无法完全消除该失败模式。
Insight: 创新点在于提出了一个简单可控的基准来量化VLM的‘数字失认症’(视觉细节丢失),并揭示了失败与视觉补丁边界结构相关。这为理解VLM在精细视觉任务(如GUI、表格理解)中的局限性提供了新视角。
Abstract: Vision-Language Models (VLMs) excel on many multimodal reasoning benchmarks, but these evaluations often do not require an exhaustive readout of the image and can therefore obscure failures in faithfully capturing all visual details. We introduce Grid2Matrix (G2M), a controlled benchmark in which a model is shown a color grid and a color-to-number mapping, and must output the corresponding matrix. By varying grid size and the number of colors, G2M provides a simple way to increase visual complexity while minimizing semantic confounds. We find that VLMs exhibit a sharp early collapse in zero-shot end-to-end evaluation, failing on surprisingly small grids rather than degrading gradually as the task becomes denser. We probe the visual encoders of VLMs from two representative families and find that they preserve substantially more of the grid information than the corresponding end-to-end outputs. This suggests that the failure is not explained by visual encoding alone, but also reflects a gap between what remains recoverable from visual features and what is ultimately expressed in language. We term this gap \textit{Digital Agnosia}. Further analyses show that these errors are highly structured and depend strongly on how grid cells overlap with visual patch boundaries. We also find that common strategies such as model scaling and multimodal alignment do not fully eliminate this failure mode. We expect G2M to serve as a useful testbed for understanding where and how VLMs lose fine visual details, and for evaluating tasks where missing even small visual details can matter, such as tables, charts, forms, and GUIs.
[64] EDFNet: Early Fusion of Edge and Depth for Thin-Obstacle Segmentation in UAV Navigation cs.CVPDF
Negar Fathi
TL;DR: 本文提出了EDFNet,一种用于无人机导航中细长障碍物分割的早期融合分割框架。该框架整合了RGB、深度和边缘信息,以解决细长障碍物(如电线、杆子、树枝)因像素少、视觉对比度弱和类别不平衡而难以检测的问题。
Details
Motivation: 现有分割方法主要针对较粗的障碍物,未能充分利用感知细长结构所需的多模态互补线索。因此,需要一种能够有效融合多种信息以提升细长障碍物感知能力的方法。
Result: 在Drone Depth and Obstacle Segmentation (DDOS)数据集上评估了十六种模态-骨干网络配置。结果表明,早期RGB-深度-边缘融合提供了具有竞争力且均衡的基线,在边界敏感和召回率导向的指标上增益最一致。预训练的RGBDE U-Net实现了最佳整体性能,包括最高的细长结构评估分数(0.244)、平均IoU(0.219)和边界IoU(0.234),同时在评估硬件上保持了有竞争力的运行时性能(19.62 FPS)。
Insight: 创新点在于提出了一个模块化的早期融合框架,将RGB、深度和边缘信息在早期阶段整合,以增强对细长障碍物的感知。从客观角度看,该方法强调了多模态信息(特别是边缘信息)在解决细长、低对比度物体分割挑战中的互补价值,为无人机导航中的细长障碍物分割提供了一个实用且可扩展的基线。然而,对超细类别(最稀有类别)的分割性能仍然较低,表明可靠的超细分割仍是一个开放挑战。
Abstract: Autonomous Unmanned Aerial Vehicles (UAVs) must reliably detect thin obstacles such as wires, poles, and branches to navigate safely in real-world environments. These structures remain difficult to perceive because they occupy few pixels, often exhibit weak visual contrast, and are strongly affected by class imbalance. Existing segmentation methods primarily target coarser obstacles and do not fully exploit the complementary multimodal cues needed for thin-structure perception. We present EDFNet, a modular early-fusion segmentation framework that integrates RGB, depth, and edge information for thin-obstacle perception in cluttered aerial scenes. We evaluate EDFNet on the Drone Depth and Obstacle Segmentation (DDOS) dataset across sixteen modality-backbone configurations using U-Net and DeepLabV3 in pretrained and non-pretrained settings. The results show that early RGB-Depth-Edge fusion provides a competitive and well-balanced baseline, with the most consistent gains appearing in boundary-sensitive and recall-oriented metrics. The pretrained RGBDE U-Net achieves the best overall performance, with the highest Thin-Structure Evaluation Score (0.244), mean IoU (0.219), and boundary IoU (0.234), while maintaining competitive runtime performance (19.62 FPS) on our evaluation hardware. However, performance on the rarest ultra-thin categories remains low across all models, indicating that reliable ultra-thin segmentation is still an open challenge. Overall, these findings position early RGB-Depth-Edge fusion as a practical and modular baseline for thin-obstacle segmentation in UAV navigation.
[65] Assessing Privacy Preservation and Utility in Online Vision-Language Models cs.CV | cs.AIPDF
Karmesh Siddharam Chaudhari, Youxiang Zhu, Amy Feng, Xiaohui Liang, Honggang Zhang
TL;DR: 本文探讨了在线视觉语言模型处理图像时面临的个人可识别信息隐私泄露风险,分析了图像上下文关系如何导致直接或间接的PII暴露,并提出了在保护隐私的同时保持图像实用性的方法。
Details
Motivation: 随着在线视觉语言模型的广泛应用,用户上传图像时可能无意中泄露个人敏感信息,即使看似无害的细节也可能通过上下文线索间接暴露PII,因此需要研究隐私保护与实用性的平衡问题。
Result: 论文提出的隐私保护方法在评估中显示出有效性,能够在在线图像处理环境中实现隐私保护与实用性的微妙平衡。
Insight: 创新点在于系统分析了图像上下文关系对PII泄露的双重路径(直接/间接),并开发了针对VLM应用的隐私保护技术,为隐私敏感的视觉语言处理系统设计提供了新思路。
Abstract: The increasing use of Online Vision Language Models (OVLMs) for processing images has introduced significant privacy risks, as individuals frequently upload images for various utilities, unaware of the potential for privacy violations. Images contain relationships that relate to Personally Identifiable Information (PII), where even seemingly harmless details can indirectly reveal sensitive information through surrounding clues. This paper explores the critical issue of PII disclosure in images uploaded to OVLMs and its implications for user privacy. We investigate how the extraction of contextual relationships from images can lead to direct (explicit) or indirect (implicit) exposure of PII, significantly compromising personal privacy. Furthermore, we propose methods to protect privacy while preserving the intended utility of the images in Vision Language Model (VLM)-based applications. Our evaluation demonstrates the efficacy of these techniques, highlighting the delicate balance between maintaining utility and protecting privacy in online image processing environments. Index Terms-Personally Identifiable Information (PII), Privacy, Utility, privacy concerns, sensitive information
[66] Attention-Guided Flow-Matching for Sparse 3D Geological Generation cs.CV | cs.AIPDF
Zhixiang Lu, Mengqi Han, Peixin Guo, Tianming Bai, Jionglong Su
TL;DR: 本文提出了一种名为3D-GeoFlow的注意力引导连续流匹配框架,用于解决从稀疏的一维钻孔和二维地表数据生成高分辨率三维地质模型这一高度不适定逆问题。该方法将离散分类生成重新表述为通过均方误差优化的无模拟连续向量场回归,并集成了三维注意力门来动态传播局部钻孔特征,从而确保宏观结构一致性。
Details
Motivation: 传统启发式和隐式建模方法在极端稀疏条件下无法捕捉非线性拓扑不连续性,常产生不真实的伪影;而扩散模型等深度生成架构在稀疏分类网格条件下会出现严重的表示崩溃,因此需要一种专门针对稀疏多模态地质建模的新方法。
Result: 在包含2200个程序生成的三维地质案例的大规模多模态数据集上进行广泛的分布外评估表明,3D-GeoFlow实现了范式转变,显著优于启发式插值和标准扩散基线。
Insight: 创新点在于将离散分类生成问题重新构建为连续向量场回归,并引入三维注意力门机制来动态传播稀疏输入特征,这为处理稀疏条件生成问题提供了新的思路,其无模拟、确定性的最优传输路径设计也增强了训练的稳定性。
Abstract: Constructing high-resolution 3D geological models from sparse 1D borehole and 2D surface data is a highly ill-posed inverse problem. Traditional heuristic and implicit modeling methods fundamentally fail to capture non-linear topological discontinuities under extreme sparsity, often yielding unrealistic artifacts. Furthermore, while deep generative architectures like Diffusion Models have revolutionized continuous domains, they suffer from severe representation collapse when conditioned on sparse categorical grids. To bridge this gap, we propose 3D-GeoFlow, the first Attention-Guided Continuous Flow Matching framework tailored for sparse multimodal geological modeling. By reformulating discrete categorical generation as a simulation-free, continuous vector field regression optimized via Mean Squared Error, our model establishes stable, deterministic optimal transport paths. Crucially, we integrate 3D Attention Gates to dynamically propagate localized borehole features across the volumetric latent space, ensuring macroscopic structural coherence. To validate our framework, we curated a large-scale multimodal dataset comprising 2,200 procedurally generated 3D geological cases. Extensive out-of-distribution (OOD) evaluations demonstrate that 3D-GeoFlow achieves a paradigm shift, significantly outperforming heuristic interpolations and standard diffusion baselines.
[67] PASTA: Vision Transformer Patch Aggregation for Weakly Supervised Target and Anomaly Segmentation cs.CV | cs.LGPDF
Melanie Neubauer, Elmar Rueckert, Christian Rauch
TL;DR: 本文提出了一种名为PASTA的弱监督目标与异常分割方法,通过比较观测场景与参考场景,在自监督视觉Transformer特征空间中进行分布分析,以识别目标和异常物体,并利用Segment Anything Model 3的语义文本提示引导零样本分割。
Details
Motivation: 解决工业与农业应用(如材料回收和除草)中,在非结构化环境下检测未知异常时,现有感知系统因依赖详尽标注数据而难以满足实时处理、像素级分割精度和鲁棒性等严格操作要求的问题。
Result: 在自定义的钢铁废料回收数据集和植物数据集上评估,相比领域特定基线方法,训练时间减少了75.8%;在工业和农业领域,该方法实现了卓越的目标分割性能(最高88.3% IoU)和异常分割性能(最高63.5% IoU)。
Insight: 创新点在于结合自监督ViT特征空间的分布分析与SAM 3的文本提示,构建了一个领域无关的弱监督分割流程,无需像素级标注即可实现高效的目标和异常检测,显著提升了分割精度并减少了训练时间。
Abstract: Detecting unseen anomalies in unstructured environments presents a critical challenge for industrial and agricultural applications such as material recycling and weeding. Existing perception systems frequently fail to satisfy the strict operational requirements of these domains, specifically real-time processing, pixel-level segmentation precision, and robust accuracy, due to their reliance on exhaustively annotated datasets. To address these limitations, we propose a weakly supervised pipeline for object segmentation and classification using weak image-level supervision called ‘Patch Aggregation for Segmentation of Targets and Anomalies’ (PASTA). By comparing an observed scene with a nominal reference, PASTA identifies Target and Anomaly objects through distribution analysis in self-supervised Vision Transformer (ViT) feature spaces. Our pipeline utilizes semantic text-prompts via the Segment Anything Model 3 to guide zero-shot object segmentation. Evaluations on a custom steel scrap recycling dataset and a plant dataset demonstrate a 75.8% training time reduction of our approach to domain-specific baselines. While being domain-agnostic, our method achieves superior Target (up to 88.3% IoU) and Anomaly (up to 63.5% IoU) segmentation performance in the industrial and agricultural domain.
[68] Multi-Granularity Reasoning for Image Quality Assessment via Attribute-Aware Reinforcement Learning to Rank cs.CVPDF
Xiangyong Chen, Xiaochuan Lin, Haoran Liu, Xuan Li, Yichen Su
TL;DR: 本文提出了一种多粒度图像质量评估框架MG-IQA,它通过属性感知的强化学习排序方法,使视觉语言模型能够在单次推理中联合评估图像的整体质量和细粒度质量属性(如清晰度、色彩保真度等)。
Details
Motivation: 现有基于推理的图像质量评估方法通常只预测整体质量分数,忽略了人类感知的多维属性,因此需要一种能同时评估整体和细粒度属性的方法。
Result: 在八个IQA基准测试上的实验表明,MG-IQA在整体质量预测(平均SRCC提升2.1%)和属性级评估上均优于最先进方法,并能生成可解释的质量描述。
Insight: 创新点包括:属性感知提示策略、多维Thurstone奖励模型以及跨域对齐机制,实现了稳定的多数据集联合训练和可解释的多属性评估。
Abstract: Recent advances in reasoning-induced image quality assessment (IQA) have demonstrated the power of reinforcement learning to rank (RL2R) for training vision-language models (VLMs) to assess perceptual quality. However, existing approaches operate at a single granularity, predicting only an overall quality score, while overlooking the multi-dimensional nature of human quality perception, which encompasses attributes such as sharpness, color fidelity, noise level, and compositional aesthetics. In this paper, we propose MG-IQA (Multi-Granularity IQA), a multi-granularity reasoning framework that extends RL2R to jointly assess overall image quality and fine-grained quality attributes within a single inference pass. Our approach introduces three key innovations: (1) an attribute-aware prompting strategy that elicits structured multi-attribute reasoning from VLMs; (2) a multi-dimensional Thurstone reward model that computes attribute-specific fidelity rewards for group relative policy optimization; and (3) a cross-domain alignment mechanism that enables stable joint training across synthetic distortion, authentic distortion, and AI-generated image datasets without perceptual scale re-alignment. Extensive experiments on eight IQA benchmarks demonstrate that MG-IQA consistently outperforms state-of-the-art methods in both overall quality prediction (average SRCC improvement of 2.1%) and attribute-level assessment, while generating interpretable, human-aligned quality descriptions.
[69] Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks cs.CV | cs.AIPDF
Wang Zixian
TL;DR: 本文提出了一种用于视觉Transformer前馈网络的Orthogonal Quadratic Complements (OQC)方法,旨在通过构建一个低秩二次辅助分支,并将其明确投影到主分支的正交补空间后再注入,以补充主分支未捕获的信息,从而提升模型性能。
Details
Motivation: 解决现有双线性前馈网络替换方案中,二次交互增强与冗余增加效应相互混淆的问题,探索一种辅助二次特征仅补充主分支未捕获信息的互补设计原则。
Result: 在参数匹配的Deep-ViT和CIFAR-100协议下,完整OQC将AFBO基线从64.25提升至65.59,而OQC-LR达到65.52且具有更好的速度-精度权衡;在TinyImageNet上,门控扩展OQC-dynamic达到51.88,比基线提升1.43个点并优于所有非门控变体。
Insight: 核心创新在于通过正交投影确保辅助分支补充的信息与主分支不重叠,从而更高效地利用参数;同时提出的低秩实现和门控扩展进一步优化了效率与性能的权衡,机制分析表明该方法能改善表示几何和类别分离。
Abstract: Recent bilinear feed-forward replacements for vision transformers can substantially improve accuracy, but they often conflate two effects: stronger second-order interactions and increased redundancy relative to the main branch. We study a complementary design principle in which auxiliary quadratic features contribute only information not already captured by the dominant hidden representation. To this end, we propose Orthogonal Quadratic Complements (OQC), which construct a low-rank quadratic auxiliary branch and explicitly project it onto the orthogonal complement of the main branch before injection. We further study an efficient low-rank realization (OQC-LR) and gated extensions (OQC-static and OQC-dynamic). Under a parameter-matched Deep-ViT and CIFAR-100 protocol with a fixed penultimate residual readout, full OQC improves an AFBO baseline from 64.25 +/- 0.22 to 65.59 +/- 0.22, while OQC-LR reaches 65.52 +/- 0.25 with a substantially better speed-accuracy tradeoff. On TinyImageNet, the gated extension OQC-dynamic achieves 51.88 +/- 0.32, improving the baseline (50.45 +/- 0.21) by 1.43 points and outperforming all ungated variants. Mechanism analyses show near-zero post-projection auxiliary-main overlap together with improved representation geometry and class separation. The full family, including both ungated and gated variants, generalizes consistently across both datasets.
[70] Head-wise Modality Specialization within MLLMs for Robust Fake News Detection under Missing Modality cs.CV | cs.CLPDF
Kai Qian, Weijie Shi, Jiaqi Wang, Mengze Li, Hao Chen
TL;DR: 本文提出了一种基于多模态大语言模型(MLLMs)的头部级模态专业化方法,用于增强缺失模态下的多模态假新闻检测(MFND)鲁棒性。该方法通过分析MLLMs中注意力头与模态性能的关系,引入头部专业化机制和单模态知识保留策略,以在图像或文本缺失时仍能保持各模态的强验证能力。
Details
Motivation: 解决现实世界中多模态假新闻检测常因图像删除、截图损坏等问题面临模态缺失的挑战,现有方法因对低贡献模态学习不足和单模态标注稀缺,难以在缺失模态下保持各模态的强验证能力。
Result: 实验表明,该方法在缺失模态下提升了检测鲁棒性,同时在全模态输入下保持了性能,具体基准未在摘要中提及,但暗示了性能的改进。
Insight: 创新点在于系统研究了MLLMs中注意力头的模态专业化特性,并据此设计头部级专业化机制和单模态知识保留策略,以显式分配和保留关键头的模态验证能力,从而增强模型对缺失模态的适应性。
Abstract: Multimodal fake news detection (MFND) aims to verify news credibility by jointly exploiting textual and visual evidence. However, real-world news dissemination frequently suffers from missing modality due to deleted images, corrupted screenshots, and similar issues. Thus, robust detection in this scenario requires preserving strong verification ability for each modality, which is challenging in MFND due to insufficient learning of the low-contribution modality and scarce unimodal annotations. To address this issue, we propose Head-wise Modality Specialization within Multimodal Large Language Models (MLLMs) for robust MFND under missing modality. Specifically, we first systematically study attention heads in MLLMs and their relationship with performance under missing modality, showing that modality-critical heads serve as key carriers of unimodal verification ability through their modality specialization. Based on this observation, to better preserve verification ability for the low-contribution modality, we introduce a head-wise specialization mechanism that explicitly allocates these heads to different modalities and preserves their specialization through lower-bound attention constraints. Furthermore, to better exploit scarce unimodal annotations, we propose a Unimodal Knowledge Retention strategy that prevents these heads from drifting away from the unimodal knowledge learned from limited supervision. Experiments show that our method improves robustness under missing modality while preserving performance with full multimodal input.
[71] LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models cs.CV | cs.AIPDF
Shi-Yu Tian, Zhi Zhou, Kun-Yang Yu, Ming Yang, Yang Chen
TL;DR: 本文提出LAST框架,通过将异构视觉工具抽象为原子指令和可复用空间技能,生成多模态提示(如标注图像和文本描述)供大语言模型直接使用,以增强多模态大语言模型的空间推理能力。
Details
Motivation: 多模态大语言模型在解析复杂几何布局时经常出现幻觉和不精确问题,数据驱动的扩展难以内化结构化几何先验和空间约束,而集成成熟的专用视觉模型是一种有前景的替代方案,但面临调用异构工具和理解其低级输出的挑战。
Result: 在四个数据集上的实验表明,LAST-7B相比其骨干模型实现了约20%的性能提升,并优于强大的专有闭源大语言模型,显著增强了复杂空间任务的推理能力。
Insight: 创新点包括设计可扩展的交互沙盒LAST-Box,将工具调用抽象为原子指令和空间技能,以及三阶段渐进训练策略,引导模型从理解工具输出到熟练自适应调用工具,从而有效利用视觉工具增强空间推理。
Abstract: Spatial reasoning is a cornerstone capability for intelligent systems to perceive and interact with the physical world. However, multimodal large language models (MLLMs) frequently suffer from hallucinations and imprecision when parsing complex geometric layouts. As data-driven scaling struggles to internalize structured geometric priors and spatial constraints, integrating mature, specialized vision models presents a compelling alternative. Despite its promise, applying this paradigm to spatial reasoning is hindered by two key challenges: The difficulty of invoking heterogeneous, parameter-rich tools, as well as the challenge of understanding and effectively leveraging their diverse low-level outputs (e.g., segmentation masks, depth maps) in high-level reasoning. To address these challenges, we propose LAST, a unified framework for tool-augmented spatial reasoning. LAST features an extensible interactive sandbox, termed LAST-Box, which abstracts heterogeneous tool invocations into atomic instructions and reusable spatial skills, returning multimodal hints (e.g., annotated images and textual descriptions) that can be directly consumed by LLMs. We further design a three-stage progressive training strategy that guides models from understanding tool outputs to proficient and adaptive tool invocation. Experiments on four datasets show that LAST-7B achieves around 20% performance gains over its backbone and outperforms strong proprietary closed-source LLMs, substantially enhancing reasoning on complex spatial tasks.
[72] Training Deep Visual Networks Beyond Loss and Accuracy Through a Dynamical Systems Approach cs.CV | cs.AIPDF
Hai La Quang, Hassan Ugail, Newton Howard, Cong Tran Tien, Nam Vu Hoai
TL;DR: 本文提出了一种基于动力系统视角的新方法来分析深度视觉模型在训练过程中内部表征的动态变化,通过定义三个从层激活中提取的度量指标(整合分数、亚稳态分数和动态稳定性指数),在多种模型架构和数据集上揭示了训练行为的模式差异。
Details
Motivation: 传统训练评估指标(如损失和准确率)无法揭示模型内部表征在训练过程中的动态变化,因此需要一种补充性方法来深入理解深度视觉网络的训练动力学。
Result: 在ResNet变体、DenseNet-121、MobileNetV2、VGG-16和预训练Vision Transformer等九种模型架构与CIFAR-10/100数据集组合上的实验表明:整合分数能区分不同难度数据集;稳定性指数的波动变化可提前预示收敛;整合与亚稳态的关系反映了不同的训练行为模式。
Insight: 创新点在于将动力系统理论引入深度视觉网络训练分析,定义了可量化内部表征协调性与稳定性的新度量,为理解训练动态提供了超越传统指标的新视角,具有探索性但前景广阔。
Abstract: Deep visual recognition models are usually trained and evaluated using metrics such as loss and accuracy. While these measures show whether a model is improving, they reveal very little about how its internal representations change during training. This paper introduces a complementary way to study that process by examining training through the lens of dynamical systems. Drawing on ideas from signal analysis originally used to study biological neural activity, we define three measures from layer activations collected across training epochs: an integration score that reflects long-range coordination across layers, a metastability score that captures how flexibly the network shifts between more and less synchronised states, and a combined dynamical stability index. We apply this framework to nine combinations of model architecture and dataset, including several ResNet variants, DenseNet-121, MobileNetV2, VGG-16, and a pretrained Vision Transformer on CIFAR-10 and CIFAR-100. The results suggest three main patterns. First, the integration measure consistently distinguishes the easier CIFAR-10 setting from the more difficult CIFAR-100 setting. Second, changes in the volatility of the stability index may provide an early sign of convergence before accuracy fully plateaus. Third, the relationship between integration and metastability appears to reflect different styles of training behaviour. Overall, this study offers an exploratory but promising new way to understand deep visual training beyond loss and accuracy.
[73] Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset cs.CVPDF
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Farida Siddiqi Prity, Saydul Akbar Murad
TL;DR: 本文针对孟加拉语手写字符识别任务,构建了一个新的平衡数据集,并提出了一种基于多头交叉注意力融合的交互感知混合深度学习架构,该架构并行整合了EfficientNetB3、Vision Transformer和Conformer模块。
Details
Motivation: 解决孟加拉语手写字符因书写风格多样、笔画模式不一致、字符间视觉相似度高以及现有数据集类内样本有限、类别分布不均而导致的识别难题。
Result: 在自建数据集上达到98.84%的准确率,在外部基准CHBCR上达到96.49%的准确率,展示了强大的泛化能力,并通过Grad-CAM可视化提供了可解释性。
Insight: 主要创新点在于构建了一个新的、平衡的、多样化的孟加拉语手写字符数据集,并提出了一种新颖的交互感知混合架构,通过多头交叉注意力机制有效融合了CNN、Transformer和Conformer的特征,提升了识别性能与泛化能力。
Abstract: Character recognition is the fundamental part of an optical character recognition (OCR) system. Word recognition, sentence transcription, document digitization, and language processing are some of the higher-order activities that can be done accurately through character recognition. Nonetheless, recognizing handwritten Bangla characters is not an easy task because they are written in different styles with inconsistent stroke patterns and a high degree of visual character resemblance. The datasets available are usually limited in intra-class and inequitable in class distribution. We have constructed a new balanced dataset of Bangla written characters to overcome those problems. This consists of 78 classes and each class has approximately 650 samples. It contains the basic characters, composite (Juktobarno) characters and numerals. The samples were a diverse group comprising a large age range and socioeconomic groups. Elementary and high school students, university students, and professionals are the contributing factors. The sample also has right and left-handed writers. We have further proposed an interaction-aware hybrid deep learning architecture that integrates EfficientNetB3, Vision Transformer, and Conformer modules in parallel. A multi-head cross-attention fusion mechanism enables effective feature interaction across these components. The proposed model achieves 98.84% accuracy on the constructed dataset and 96.49% on the external CHBCR benchmark, demonstrating strong generalization capability. Grad-CAM visualizations further provide interpretability by highlighting discriminative regions. The dataset and source code of this research is publicly available at: https://huggingface.co/MIRZARAQUIB/Bangla_Handwritten_Character_Recognition.
[74] Data-Driven Automated Identification of Optimal Feature-Representative Images in Infrared Thermography Using Statistical and Morphological Metrics cs.CV | physics.app-ph | physics.data-anPDF
Harutyun Yagdjian, Martin Gurka
TL;DR: 本文提出了一种数据驱动的方法,用于在红外热成像(IRT)图像序列中自动识别最能代表结构特征(如缺陷)的图像。该方法基于三个互补的指标:混合均匀性指数(HI)、代表性基本面积(REA)和总变分能量(TVE)指数,无需先验空间信息即可量化图像的统计异质性和几何拓扑特征。
Details
Motivation: 红外热成像后处理生成的图像序列中,缺陷可见性随时间、频率或系数域变化很大,而传统评估指标(如信噪比或Tanimoto准则)通常需要缺陷位置或无缺陷参考区域的先验知识,限制了自动化无监督分析的适用性。
Result: 该方法在含有人工缺陷的碳纤维增强聚合物(CFRP)板脉冲加热红外热成像数据上进行了实验验证,并得到一维N层热模型模拟的支持。结果表明,该方法能对图像序列进行稳健且无偏的排序,为红外热成像中面向缺陷的自动图像选择提供了可靠基础。
Insight: 创新点在于提出了一套无需先验空间信息的互补指标(HI、REA、TVE),将统计异质性量化与基于Minkowski泛函的几何拓扑分析相结合,实现了完全数据驱动的、无监督的缺陷代表性图像自动识别。
Abstract: Infrared thermography (IRT) is a widely used non-destructive testing technique for detecting structural features such as subsurface defects. However, most IRT post-processing methods generate image sequences in which defect visibility varies strongly across time, frequency, or coefficient/index domains, making the identification of defect-representative images a critical challenge. Conventional evaluation metrics, such as the signal-to-noise ratio (SNR) or the Tanimoto criterion, often require prior knowledge of defect locations or defect-free reference regions, limiting their suitability for automated and unsupervised analysis. In this work, a data-driven methodology is proposed to identify images within IRT datasets that are most likely to contain and represent structural features, particularly anomalies and defects, without requiring prior spatial information. The approach is based on three complementary metrics: the Homogeneity Index of Mixture (HI), which quantifies statistical heterogeneity via deviations of local intensity distributions from a global reference distribution; a Representative Elementary Area (REA), derived from a Minkowski-functional adaptation of the Representative Elementary Volume concept to two-dimensional images; and a geometrical-topological Total Variation Energy (TVE) index, also based on two-dimensional Minkowski functionals, designed to improve sensitivity to localized anomalies. The framework is validated experimentally using pulse-heated IRT data from a carbon fiber-reinforced polymer (CFRP) plate containing six artificial defects at depths between 0.135 mm and 0.810 mm, and is further supported by one-dimensional N-layer thermal model simulations. The results demonstrate robust and unbiased ranking of image sequences and provide a reliable basis for automated defect-oriented image selection in IRT.
[75] LOLGORITHM: Funny Comment Generation Agent For Short Videos cs.CV | cs.AIPDF
Xuan Ouyang, Senan Wang, Bouzhou Wang, Siyuan Xiahou, Jinrong Zhou
TL;DR: 本文提出了LOLGORITHM,一个新颖的模块化多智能体框架,用于生成符合特定平台文化和语言规范的风格化短视频评论。该框架支持六种可控评论风格,并包含视频内容摘要、视频分类以及结合语义检索和热门梗增强的评论生成三个核心模块。作者构建了一个包含3267个视频和16335条评论的双语数据集,并通过自动评分和人工偏好评估表明,LOLGORITHM在YouTube和抖音平台上均显著优于基线方法。
Details
Motivation: 短视频平台已成为多媒体信息传播的核心,评论在驱动用户参与、传播和算法反馈中起着关键作用。然而,现有的视频摘要和直播弹幕生成等方法无法生成符合特定平台文化和语言规范的真实评论。
Result: 结合自动评分和大规模人工偏好分析,LOLGORITHM在107名受访者中,于YouTube和抖音平台上分别达到了80.46%和84.29%的人工偏好选择率,持续优于基线方法。消融研究证实了这些收益归因于框架架构而非主干大语言模型的选择。
Insight: 创新点在于提出了一个模块化的多智能体框架,通过整合视频内容摘要、分类以及结合语义检索和热门梗增强的生成模块,实现了对评论风格(六种)的精细控制,并能生成符合特定平台文化语境的评论。其框架设计具有鲁棒性和通用性,不依赖于特定的大语言模型主干。
Abstract: Short-form video platforms have become central to multimedia information dissemination, where comments play a critical role in driving engagement, propagation, and algorithmic feedback. However, existing approaches – including video summarization and live-streaming danmaku generation – fail to produce authentic comments that conform to platform-specific cultural and linguistic norms. In this paper, we present LOLGORITHM, a novel modular multi-agent framework for stylized short-form video comment generation. LOLGORITHM supports six controllable comment styles and comprises three core modules: video content summarization, video classification, and comment generation with semantic retrieval and hot meme augmentation. We further construct a bilingual dataset of 3,267 videos and 16,335 comments spanning five high-engagement categories across YouTube and Douyin. Evaluation combining automatic scoring and large-scale human preference analysis demonstrates that LOLGORITHM consistently outperforms baseline methods, achieving human preference selection rates of 80.46% on YouTube and 84.29% on Douyin across 107 respondents. Ablation studies confirm that these gains are attributable to the framework architecture rather than the choice of backbone LLM, underscoring the robustness and generalizability of our approach.
[76] See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment cs.CVPDF
Mohammad Anas Azeez, Ankan Deria, Zohaib Hasan Siddiqui, Adinath Madhavrao Dukre, Rafiq Ali
TL;DR: 本文提出了一种名为DOP-OBC的训练无关、架构无关的解码策略,旨在通过公平注意力分配来减少多模态大语言模型(MLLMs)中的物体幻觉问题。该方法包含两个互补的对象感知信号:主导物体惩罚(DOP)和异常值提升系数(OBC),它们通过调整因果注意力掩码中的logit来抑制对视觉主导区域的过度关注,并增强对罕见但被自信检测到的物体的关注。
Details
Motivation: 多模态大语言模型在解码过程中,注意力往往被视觉上主导或频繁出现的内容过度吸引,导致模型生成视觉输入中不存在的物体(即物体幻觉)。本文认为,注意力分配的不公平是物体幻觉的根本原因,因此提出应确保图像中的每个物体,无论其大小、频率或视觉显著性如何,在解码过程中都能获得平等的表征机会。
Result: 在图像和视频MLLMs上的广泛实验表明,该方法在CHAIR和POPE基准测试中持续减少了物体幻觉,同时通过GPT-4o评估,在正确性、一致性、细节、上下文和时间维度上提升了字幕生成质量。
Insight: 创新点在于将公平注意力分配作为实际有效的解码策略,通过DOP和OBC的logit调制实现无需权重更新的注意力调整。客观来看,这种方法提供了一种轻量级、可泛化的解决方案,直接针对注意力机制中的偏差进行干预,有助于提升多模态生成的忠实度。
Abstract: Multimodal large language models (MLLMs) frequently hallucinate objects that are absent from the visual input, often because attention during decoding is disproportionately drawn to visually dominant or frequently occurring content. We observe that this inequity in attention allocation is a root cause of object hallucination: when rare, small, or contextually peripheral objects receive insufficient attention, the model fails to ground its generation in the full visual scene. We argue that every object in an image, regardless of its size, frequency or visual salience, deserves equal representational opportunity during decoding. To this end, we propose DOP-OBC, a training-free and architecture-agnostic decoding strategy built on the principle of equitable attention. Two complementary object-aware signals work in tandem: a Dominant Object Penalty (DOP) that softly suppresses attention over-concentration on visually dominant regions, and an Outlier Boost Coefficient (OBC) that amplifies attention toward rare yet confidently detected objects. These signals are injected as per-row logit modulations within the causal attention mask, requiring no weight updates and preserving autoregressive decoding properties. Extensive experiments across image and video MLLMs demonstrate consistent reductions in object hallucination on CHAIR and POPE benchmarks, alongside improvements in GPT-4o assessed captioning quality across correctness, consistency, detail, context and temporal dimensions. DOP-OBC establishes that fairness in attention allocation is not merely a design principle but a practical and effective path toward more faithful multimodal generation.
[77] MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering cs.CV | cs.AIPDF
Suyang Xi, Songtao Hu, Yuxiang Lai, Wangyun Dan, Yaqi Liu
TL;DR: 本文提出了MedLVR,一种用于医学视觉问答(VQA)的潜在视觉推理框架。该框架通过在自回归解码器中引入显式的视觉证据状态,并穿插一个短的潜在推理片段,实现了在生成答案前对查询相关的视觉证据进行迭代保存和精炼,从而克服了传统视觉-语言模型中图像仅作为静态上下文、推理以文本为中心的局限性。
Details
Motivation: 现有医学视觉-语言模型(VLMs)的推理过程主要基于文本,图像仅被编码为静态上下文,这在依赖细微、局部化视觉证据的临床场景中存在根本性限制。论文旨在解决如何可靠地保留和利用这些关键视觉信息以提升医学VQA的准确性和可靠性。
Result: 在OmniMedVQA和五个外部医学VQA基准测试上的实验表明,MedLVR持续优于近期的推理基线方法,并将Qwen2.5-VL-7B骨干模型的平均得分从48.3%提升至53.4%。
Insight: 核心创新点在于将显式的视觉证据状态作为潜在变量引入解码过程,通过迭代的潜在推理步骤来动态保留和精炼视觉信息。从客观角度看,其两阶段训练策略(ROI监督微调和视觉-潜在策略优化VLPO)为如何有效地监督和优化这种潜在推理过程提供了可借鉴的方法学思路。
Abstract: Medical vision–language models (VLMs) have shown strong potential for medical visual question answering (VQA), yet their reasoning remains largely text-centric: images are encoded once as static context, and subsequent inference is dominated by language. This paradigm is fundamentally limited in clinical scenarios, where accurate answers often depend on subtle, localized visual evidence that cannot be reliably preserved in static embeddings. We propose \textsc{MedLVR}, a latent visual reasoning framework that introduces an explicit visual evidence state into autoregressive decoding. Instead of relying solely on text-based intermediate reasoning, \textsc{MedLVR} interleaves a short latent reasoning segment within the decoder by reusing hidden states as continuous latent steps, enabling iterative preservation and refinement of query-relevant visual evidence before answer generation. To support effective visual supervision, we adopt a two-stage training strategy: region of interest (ROI)-supervised fine-tuning aligns latent states with clinically relevant image evidence, and Visual-Latent Policy Optimization (VLPO) further optimizes latent reasoning and answer generation under outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks show that \textsc{MedLVR} consistently outperforms recent reasoning baselines and improves the average score over the Qwen2.5-VL-7B backbone from 48.3% to 53.4%. These results show that latent visual reasoning provides an effective mechanism for preserving diagnostically relevant visual evidence and improving the reliability of medical VQA.
[78] Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents cs.CVPDF
Sangwon Baik, Gunhee Kim, Mingi Choi, Hanbyul Joo
TL;DR: 该论文提出了一种利用视觉语言模型作为闭环智能体,通过迭代推理和三种推理时技术,实现文本引导的6D物体姿态重排的方法。该方法无需额外微调,在预测目标物体与文本指令一致的6D目标姿态上超越了现有方法,并能与机器人运动规划结合,提升操作成功率。
Details
Motivation: 视觉语言模型在3D理解方面存在不足,难以根据文本指令推断出3D场景中目标物体一致的6D目标姿态。
Result: 该方法在预测文本引导的目标6D姿态上超越了先前方法,且对闭源和开源VLM均有效;结合简单机器人运动规划后,其操作成功率也高于现有方法。
Insight: 创新点在于将VLM作为闭环智能体进行迭代推理,并引入了三种关键推理时技术:多视角推理与支持视角选择、以物体为中心的坐标系可视化、以及单轴旋转预测,这些技术显著提升了3D场景下的姿态推理能力。
Abstract: Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains. Concretely, given a 3D scene represented by an RGB-D image (or a compositional scene of 3D meshes) and a text instruction specifying a desired state change, we repeat the following loop: observe the current scene; evaluate whether it is faithful to the instruction; propose a pose update for the target object; apply the update; and render the updated scene. Through this closed-loop interaction, the VLM effectively acts as an agent. We further introduce three inference-time techniques that are essential to this closed-loop process: (i) multi-view reasoning with supporting view selection, (ii) object-centered coordinate system visualization, and (iii) single-axis rotation prediction. Without any additional fine-tuning or new modules, our approach surpasses prior methods at predicting the text-guided goal 6D pose of the target object. It works consistently across both closed-source and open-source VLMs. Moreover, when combining our 6D pose prediction with simple robot motion planning, it enables more successful robot manipulation than existing methods. Finally, we conduct an ablation study to demonstrate the necessity of each proposed technique.
[79] ACCIDENT: A Benchmark Dataset for Vehicle Accident Detection from Traffic Surveillance Videos cs.CV | cs.AIPDF
Lukas Picek, Michal Čermák, Marek Hanzl, Vojtěch Čermák
TL;DR: 本文介绍了ACCIDENT,一个用于交通监控视频中车辆事故检测的基准数据集,旨在评估模型在监督学习(包括IID和OOD)和零样本设置下的性能,以反映数据丰富和数据稀缺的场景。该基准包含2,027个真实和2,211个合成视频片段,标注了事故时间、空间位置和高级碰撞类型。定义了三个核心任务:事故时间定位、空间定位和碰撞类型分类,并使用考虑监控视频固有不确定性和模糊性的自定义指标进行评估。除了基准外,还提供了多种基线方法,包括启发式、运动感知和视觉语言方法,并展示了ACCIDENT的挑战性。
Details
Motivation: 解决交通监控视频中车辆事故自动检测的基准数据集缺乏问题,特别是在监督和零样本设置下,以应对现实世界数据丰富和稀缺的不同场景。
Result: 在ACCIDENT基准上评估了多种基线方法,包括启发式、运动感知和视觉语言模型,结果表明该数据集具有挑战性,但未提及具体定量结果或是否达到SOTA水平。
Insight: 创新点在于构建了一个包含真实和合成视频、标注详细(时间、空间、类型)且支持多种评估设置(IID、OOD、零样本)的综合性事故检测基准,并设计了考虑监控视频不确定性的自定义评估指标,为模型在复杂现实场景中的鲁棒性评估提供了新工具。
Abstract: We introduce ACCIDENT, a benchmark dataset for traffic accident detection in CCTV footage, designed to evaluate models in supervised (IID and OOD) and zero-shot settings, reflecting both data-rich and data-scarce scenarios. The benchmark consists of a curated set of 2,027 real and 2,211 synthetic clips annotated with the accident time, spatial location, and high-level collision type. We define three core tasks: (i) temporal localization of the accident, (ii) its spatial localization, and (iii) collision type classification. Each task is evaluated using custom metrics that account for the uncertainty and ambiguity inherent in CCTV footage. In addition to the benchmark, we provide a diverse set of baselines, including heuristic, motion-aware, and vision-language approaches, and show that ACCIDENT is challenging. You can access the ACCIDENT at: https://accidentbench.github.io
[80] F3G-Avatar : Face Focused Full-body Gaussian Avatar cs.CV | cs.AIPDF
Willem Menu, Erkut Akdag, Pedro Quesado, Yasaman Kashefbahrami, Egor Bondarev
TL;DR: 本文提出F3G-Avatar,一种专注于面部细节的全身高斯化身合成方法,旨在从多视角RGB视频和估计的姿态/形状参数中重建可动画化的人体表示。该方法通过一个双分支架构(身体分支和面部细化分支)来分别捕捉姿态相关的非刚性变形和精细化头部几何与外观,并结合重建、感知和面部对抗损失进行训练,最终实现了高质量的全身化身合成。
Details
Motivation: 现有的全身高斯化身方法主要优化全局重建质量,但往往难以保留细粒度的面部几何和表情细节,这源于有限的面部表示能力导致难以建模高频的姿态相关变形。
Result: 在AvatarReX数据集上,该方法在面部视角的渲染质量达到PSNR 26.243、SSIM 0.964、LPIPS 0.084,展现了强大的渲染性能。消融实验进一步验证了MHR模板和面部聚焦变形分支的贡献。
Insight: 创新点在于提出了一个专门的面部聚焦变形分支来精细化头部区域,并结合了MHR模板作为初始化以及针对面部的对抗损失来增强真实感,从而在保持全身重建质量的同时显著提升了面部细节的保真度。
Abstract: Existing full-body Gaussian avatar methods primarily optimize global reconstruction quality and often fail to preserve fine-grained facial geometry and expression details. This challenge arises from limited facial representational capacity that causes difficulties in modeling high-frequency pose-dependent deformations. To address this, we propose F3G-Avatar, a full-body, face-aware avatar synthesis method that reconstructs animatable human representations from multi-view RGB video and regressed pose/shape parameters. Starting from a clothed Momentum Human Rig (MHR) template, front/back positional maps are rendered and decoded into 3D Gaussians through a two-branch architecture: a body branch that captures pose-dependent non-rigid deformations and a face-focused deformation branch that refines head geometry and appearance. The predicted Gaussians are fused, posed with linear blend skinning (LBS), and rendered with differentiable Gaussian splatting. Training combines reconstruction and perceptual objectives with a face-specific adversarial loss to enhance realism in close-up views. Experiments demonstrate strong rendering quality, with face-view performance reaching PSNR/SSIM/LPIPS of 26.243/0.964/0.084 on the AvatarReX dataset. Ablations further highlight contributions of the MHR template and the face-focused deformation. F3G-Avatar provides a practical, high-quality pipeline for realistic, animatable full-body avatar synthesis.
[81] Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models cs.CV | cs.AIPDF
Oliver McLaughlin, Daniel Shubin, Carsten Eickhoff, Ritambhara Singh, William Rudman
TL;DR: 本文评估了医学领域微调后的视觉语言模型(VLMs)在医疗影像任务上的表现,发现随着任务难度增加,模型性能下降至接近随机水平,医学微调并未带来一致优势,且模型对提示词高度敏感。通过引入基于描述的诊断流程,仅能恢复有限额外信号,整体表明医学VLM性能脆弱、依赖提示且未通过领域微调可靠提升。
Details
Motivation: 研究动机在于探究视觉语言模型在医学领域微调后是否真正提升了超越表面视觉线索的临床推理能力,尤其是在高风险医疗影像任务中。
Result: 在四个难度递增的医疗影像分类任务(脑肿瘤、肺炎、皮肤癌和组织病理学)上评估了四对开源VLM(LLaVA vs. LLaVA-Med; Gemma vs. MedGemma),结果显示性能随难度增加而恶化至接近随机水平,医学微调无一致优势,且模型对提示词微小变化敏感,导致准确率和拒绝率大幅波动。
Insight: 创新点包括揭示了医学VLM性能的脆弱性和提示依赖性,并提出基于描述的诊断流程以测试潜在知识提取;客观分析表明失败源于视觉表示薄弱和下游推理能力不足,领域特定微调未有效解决根本问题。
Abstract: Vision-language models (VLMs) are increasingly adapted through domain-specific fine-tuning, yet it remains unclear whether this improves reasoning beyond superficial visual cues, particularly in high-stakes domains like medicine. We evaluate four paired open-source VLMs (LLaVA vs. LLaVA-Med; Gemma vs. MedGemma) across four medical imaging tasks of increasing difficulty: brain tumor, pneumonia, skin cancer, and histopathology classification. We find that performance degrades toward near-random levels as task difficulty increases, indicating limited clinical reasoning. Medical fine-tuning provides no consistent advantage, and models are highly sensitive to prompt formulation, with minor changes causing large swings in accuracy and refusal rates. To test whether closed-form VQA suppresses latent knowledge, we introduce a description-based pipeline where models generate image descriptions that a text-only model (GPT-5.1) uses for diagnosis. This recovers a limited additional signal but remains bounded by task difficulty. Analysis of vision encoder embeddings further shows that failures stem from both weak visual representations and downstream reasoning. Overall, medical VLM performance is fragile, prompt-dependent, and not reliably improved by domain-specific fine-tuning.
[82] Do vision models perceive illusory motion in static images like humans? cs.CVPDF
Isabella Elaine Rosario, Fan L. Cheng, Zitang Sun, Nikolaus Kriegeskorte
TL;DR: 该论文探讨了计算机视觉模型是否能够像人类一样感知静态图像中的错觉运动,特别是旋转蛇错觉。通过评估多个光流模型,发现大多数模型无法生成与人类感知一致的流场,而受人类视觉启发的双通道模型在模拟眼跳条件下表现出预期的旋转运动。
Details
Motivation: 研究动机在于理解人类运动处理机制,以构建更可靠、以人为中心的计算机视觉系统。当前深度神经网络在光流估计中表现优异,但仍不如人类稳健,且依赖不同的计算策略,因此通过视觉运动错觉来探究人类与机器视觉的异同。
Result: 在旋转蛇错觉上评估了多个代表性光流模型,结果显示大多数模型无法生成与人类感知一致的流场。在模拟眼跳条件下,仅受人类启发的双通道模型表现出预期的旋转运动,且最接近人类感知的对应出现在眼跳模拟期间。消融分析进一步表明,基于亮度和高阶颜色特征的运动信号以及循环注意力机制对此行为至关重要。
Insight: 论文的创新点在于利用静态图像中的错觉运动作为探针,揭示了当前光流模型与人类视觉运动处理之间的显著差距。从客观角度看,研究强调了结合人类视觉机制(如双通道处理和注意力整合)对于开发更符合人类感知的运动估计系统具有借鉴意义,有助于推动以人为中心的人工智能发展。
Abstract: Understanding human motion processing is essential for building reliable, human-centered computer vision systems. Although deep neural networks (DNNs) achieve strong performance in optical flow estimation, they remain less robust than humans and rely on fundamentally different computational strategies. Visual motion illusions provide a powerful probe into these mechanisms, revealing how human and machine vision align or diverge. While recent DNN-based motion models can reproduce dynamic illusions such as reverse-phi, it remains unclear whether they can perceive illusory motion in static images, exemplified by the Rotating Snakes illusion. We evaluate several representative optical flow models on Rotating Snakes and show that most fail to generate flow fields consistent with human perception. Under simulated conditions mimicking saccadic eye movements, only the human-inspired Dual-Channel model exhibits the expected rotational motion, with the closest correspondence emerging during the saccade simulation. Ablation analyses further reveal that both luminance-based and higher-order color–feature–based motion signals contribute to this behavior and that a recurrent attention mechanism is critical for integrating local cues. Our results highlight a substantial gap between current optical-flow models and human visual motion processing, and offer insights for developing future motion-estimation systems with improved correspondence to human perception and human-centric AI.
[83] FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views cs.CVPDF
Chaoyi Zhou, Run Wang, Feng Luo, Mert D. Pesé, Zhiwen Fan
TL;DR: FF3R是一个无需标注的前馈框架,能够从无约束的多视角图像序列中统一进行几何与语义三维重建。它无需相机位姿、深度图或语义标签,仅通过RGB和特征图的渲染监督进行训练,解决了现有方法将几何与语义能力孤立处理导致的冗余和误差累积问题。
Details
Motivation: 现有视觉基础模型在几何重建和语义理解方面虽取得进展,但通常孤立处理这两种能力,导致流程冗余和误差累积。本文旨在开发一个统一的、无需标注的前馈框架,直接从无约束图像中联合推理几何与语义信息。
Result: 在ScanNet和DL3DV-10K基准测试上的大量实验表明,FF3R在新视角合成、开放词汇语义分割和深度估计任务上均取得了优越性能,并展现出对野外场景的强泛化能力。
Insight: 创新点包括:1) 提出Token-wise Fusion Module,通过交叉注意力用语义上下文丰富几何token;2) 提出语义-几何互增强机制,结合几何引导的特征扭曲实现全局一致性,以及语义感知的体素化实现局部一致性。这为需要空间和语义理解的具身智能系统提供了一种可扩展的统一三维推理范式。
Abstract: Recent advances in vision foundation models have revolutionized geometry reconstruction and semantic understanding. Yet, most of the existing approaches treat these capabilities in isolation, leading to redundant pipelines and compounded errors. This paper introduces FF3R, a fully annotation-free feed-forward framework that unifies geometric and semantic reasoning from unconstrained multi-view image sequences. Unlike previous methods, FF3R does not require camera poses, depth maps, or semantic labels, relying solely on rendering supervision for RGB and feature maps, establishing a scalable paradigm for unified 3D reasoning. In addition, we address two critical challenges in feedforward feature reconstruction pipelines, namely global semantic inconsistency and local structural inconsistency, through two key innovations: (i) a Token-wise Fusion Module that enriches geometry tokens with semantic context via cross-attention, and (ii) a Semantic-Geometry Mutual Boosting mechanism combining geometry-guided feature warping for global consistency with semantic-aware voxelization for local coherence. Extensive experiments on ScanNet and DL3DV-10K demonstrate FF3R’s superior performance in novel-view synthesis, open-vocabulary semantic segmentation, and depth estimation, with strong generalization to in-the-wild scenarios, paving the way for embodied intelligence systems that demand both spatial and semantic understanding.
[84] DINO_4D: Semantic-Aware 4D Reconstruction cs.CV | cs.AI | cs.ROPDF
Yiru Yang, Zhuojie Wu, Quentin Marguet, Nishant Kumar Singh, Max Schulthess
TL;DR: DINO_4D提出了一种语义感知的4D动态场景重建方法,通过引入冻结的DINOv3特征作为结构先验,将语义信息注入重建过程,有效抑制动态跟踪中的语义漂移。
Details
Motivation: 解决4D动态场景重建中,如何将低层几何感知与高层语义理解有效结合,并抑制跟踪过程中的语义漂移问题。
Result: 在Point Odyssey和TUM-Dynamics基准测试中,方法在保持O(T)线性时间复杂度的同时,显著提升了跟踪精度(APD)和重建完整性,达到了新的先进水平。
Insight: 创新点在于将预训练、冻结的视觉基础模型(DINOv3)特征作为结构先验引入4D重建流程,为构建兼具几何精度和语义理解的4D世界模型提供了新范式。
Abstract: In the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes serve as the critical bridge connecting low-level geometric sensing with high-level semantic understanding. We present DINO_4D, introducing frozen DINOv3 features as structural priors, injecting semantic awareness into the reconstruction process to effectively suppress semantic drift during dynamic tracking. Experiments on the Point Odyssey and TUM-Dynamics benchmarks demonstrate that our method maintains the linear time complexity $O(T)$ of its predecessors while significantly improving Tracking Accuracy (APD) and Reconstruction Completeness. DINO_4D establishes a new paradigm for constructing 4D World Models that possess both geometric precision and semantic understanding.
[85] Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception cs.CV | cs.AI | cs.LG | cs.MM | eess.IVPDF
Gautham Vinod, Bruce Coburn, Siddeshwar Raghavan, Fengqing Zhu
TL;DR: 本文提出了一种结合立体视觉与自然语言文本的新方法,用于从视觉数据中准确估计物体体积。该方法通过从立体图像对和包含物体类别及近似体积的文本描述中提取深度特征,并利用投影层将它们融合为统一的多模态表示进行回归,显著优于仅依赖视觉的基线方法。
Details
Motivation: 解决现有方法依赖复杂3D重建流程或难以处理单视图图像固有模糊性的问题,旨在通过融合视觉与语言先验知识提升体积估计的准确性和鲁棒性。
Result: 在公开数据集上的实验表明,该文本引导方法显著优于纯视觉基线,验证了利用简单文本先验能有效指导体积估计任务。
Insight: 创新点在于将立体视觉的隐式3D线索与自然语言文本的显式先验知识相结合,通过多模态融合简化了体积估计流程,为上下文感知的视觉测量系统提供了新思路。
Abstract: Accurate volume estimation of objects from visual data is a long-standing challenge in computer vision with significant applications in robotics, logistics, and smart health. Existing methods often rely on complex 3D reconstruction pipelines or struggle with the ambiguity inherent in single-view images. To address these limitations, we introduce a new method that fuses implicit 3D cues from stereo vision with explicit prior knowledge from natural language text. Our approach extracts deep features from a stereo image pair and a descriptive text prompt that contains the object’s class and an approximate volume, then integrates them using a simple yet effective projection layer into a unified, multi-modal representation for regression. We conduct extensive experiments on public datasets demonstrating that our text-guided approach significantly outperforms vision-only baselines. Our findings show that leveraging even simple textual priors can effectively guide the volume estimation task, paving the way for more context-aware visual measurement systems. Code: https://gitlab.com/viper-purdue/stereo-typical-estimator.
[86] From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping cs.CV | cs.AI | cs.CLPDF
Yu Wu, Guangzeng Han, Ibra Niang Niang, Francia Ravelombola, Maiara Oliveira
TL;DR: 该论文提出了一个名为PlantXpert的多模态基准测试,用于评估视觉语言模型在植物表型分析领域的表现,特别是在大豆和棉花作物上。该基准包含385张数字图像和3000多个样本,覆盖病害、虫害、杂草管理和产量等关键领域,旨在测试模型的视觉专业知识、定量推理和多步骤农学推理能力。
Details
Motivation: 植物科学领域需要领域专业知识、细粒度视觉解释和复杂的生物农学推理,这对基础模型提出了挑战。现有模型在该领域缺乏专门的评估基准,因此需要开发一个结构化和可复现的框架来评估和适应视觉语言模型。
Result: 评估了11个最先进的视觉语言模型,结果显示任务特定的微调能显著提高准确性,例如Qwen3-VL-4B和Qwen3-VL-30B模型准确率最高达到78%。但模型规模超过一定容量后收益递减,在大豆和棉花之间的泛化能力不均,定量推理和基于生物学的推理仍面临重大挑战。
Insight: 论文的创新点在于构建了一个证据驱动的多模态推理基准,为植物科学领域的模型评估和适应提供了标准化框架。从客观角度看,该基准强调了领域适应的重要性,并揭示了当前模型在复杂农学推理任务上的局限性,为未来多模态模型在专业领域的开发指明了方向。
Abstract: To improve crop genetics, high-throughput, effective and comprehensive phenotyping is a critical prerequisite. While such tasks were traditionally performed manually, recent advances in multimodal foundation models, especially in vision-language models (VLMs), have enabled more automated and robust phenotypic analysis. However, plant science remains a particularly challenging domain for foundation models because it requires domain-specific knowledge, fine-grained visual interpretation, and complex biological and agronomic reasoning. To address this gap, we develop PlantXpert, an evidence-grounded multimodal reasoning benchmark for soybean and cotton phenotyping. Our benchmark provides a structured and reproducible framework for agronomic adaptation of VLMs, and enables controlled comparison between base models and their domain-adapted counterparts. We constructed a dataset comprising 385 digital images and more than 3,000 benchmark samples spanning key plant science domains including disease, pest control, weed management, and yield. The benchmark can assess diverse capabilities including visual expertise, quantitative reasoning, and multi-step agronomic reasoning. A total of 11 state-of-the-art VLMs were evaluated. The results indicate that task-specific fine-tuning leads to substantial improvement in accuracy, with models such as Qwen3-VL-4B and Qwen3-VL-30B achieving up to 78%. At the same time, gains from model scaling diminish beyond a certain capacity, generalization across soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose substantial challenges. These findings suggest that PlantXpert can serve as a foundation for assessing evidence-grounded agronomic reasoning and for advancing multimodal model development in plant science.
[87] Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection cs.CVPDF
Lars Lundqvist, Earl Ranario, Hamid Kamangir, Heesup Yun, Christine Diepenbrock
TL;DR: 本文提出了一种系统化的提示优化框架,用于评估四种开放词汇检测模型(YOLO World、SAM3、Grounding DINO和OWLv2)在复杂农业场景(如豇豆花和豆荚检测)中的零样本性能。研究通过将提示分解为八个维度进行单因素分析和组合优化,发现最优提示结构具有模型特异性,且从合成数据优化的提示能有效迁移到真实田间图像,显著提升了检测性能。
Details
Motivation: 视觉基础模型在复杂农业场景中的零样本目标检测性能对文本提示的构建高度敏感,现有方法缺乏系统化的提示优化策略,导致性能不稳定且与监督方法存在差距。
Result: 通过模型特定的组合提示优化,在合成豇豆花数据上,YOLO World的mAP@0.5提升了0.357,OWLv2提升了0.362;优化后的提示结构从合成数据迁移到真实田间数据时,在多数模型-对象组合上匹配或超过了基于真实标注数据发现的提示性能(例如,YOLO World在花检测上达到0.374 vs. 0.353)。
Insight: 创新点在于提出了一个系统化的提示分解与优化框架,揭示了最优提示结构具有模型特异性且可跨领域(合成到真实、不同形态目标)迁移;客观来看,该方法为提升VFMs在专业领域的零样本性能提供了一种无需人工标注的工程化路径,强调了提示工程对缩小零-shot与监督检测差距的关键作用。
Abstract: Vision foundation models (VFMs) offer the promise of zero-shot object detection without task-specific training data, yet their performance in complex agricultural scenes remains highly sensitive to text prompt construction. We present a systematic prompt optimization framework evaluating four open-vocabulary detectors – YOLO World, SAM3, Grounding DINO, and OWLv2 – for cowpea flower and pod detection across synthetic and real field imagery. We decompose prompts into eight axes and conduct one-factor-at-a-time analysis followed by combinatorial optimization, revealing that models respond divergently to prompt structure: conditions that optimize one architecture can collapse another. Applying model-specific combinatorial prompts yields substantial gains over a naive species-name baseline, including +0.357 mAP@0.5 for YOLO World and +0.362 mAP@0.5 for OWLv2 on synthetic cowpea flower data. To evaluate cross-task generalization, we use an LLM to translate the discovered axis structure to a morphologically distinct target – cowpea pods – and compare against prompting using the discovered optimal structures from synthetic flower data. Crucially, prompt structures optimized exclusively on synthetic data transfer effectively to real-world fields: synthetic-pipeline prompts match or exceed those discovered on labeled real data for the majority of model-object combinations (flower: 0.374 vs. 0.353 for YOLO World; pod: 0.429 vs. 0.371 for SAM3). Our findings demonstrate that prompt engineering can substantially close the gap between zero-shot VFMs and supervised detectors without requiring manual annotation, and that optimal prompts are model-specific, non-obvious, and transferable across domains.
[88] BLPR: Robust License Plate Recognition under Viewpoint and Illumination Variations via Confidence-Driven VLM Fallback cs.CVPDF
Guillermo Auza Banegas, Diego Calvimontes Vera, Sergio Castro Sandoval, Natalia Condori Peredo, Edwin Salcedo
TL;DR: 本文提出了一个名为BLPR的鲁棒车牌识别框架,专门针对玻利维亚等数据稀缺、视觉特征独特的地区。该框架采用两阶段流程:首先使用在Blender生成的合成数据上预训练的YOLO检测器进行车牌检测,并在玻利维亚拉巴斯采集的真实街道数据上进行微调;检测到的车牌经过几何校正后送入字符识别模型。为提高在模糊场景下的鲁棒性,系统引入了一个基于置信度的后备机制,选择性触发轻量级视觉语言模型Gemma3 4B。
Details
Motivation: 解决在无约束环境(特别是玻利维亚等数据稀缺、具有独特视觉特征的地区)中,因光照变化和视角畸变导致车牌识别准确性下降的挑战。
Result: 在真实世界数据上实现了89.6%的字符级识别准确率,并发布了首个公开的玻利维亚车牌检测与识别数据集。
Insight: 主要创新点包括:1) 针对特定地区(玻利维亚)设计,并利用合成数据(模拟极端视角和光照)预训练结合真实数据微调的域适应策略;2) 引入基于置信度的轻量级VLM(Gemma3 4B)后备机制,以处理模糊场景,提升系统鲁棒性;3) 创建并公开了首个玻利维亚车牌数据集,填补了该领域空白。
Abstract: Robust license plate recognition in unconstrained environments remains a significant challenge, particularly in underrepresented regions with limited data availability and unique visual characteristics, such as Bolivia. Recognition accuracy in real-world conditions is often degraded by factors such as illumination changes and viewpoint distortion. To address these challenges, we introduce BLPR, a novel deep learning-based License Plate Detection and Recognition (LPDR) framework specifically designed for Bolivian license plates. The proposed system follows a two-stage pipeline where a YOLO-based detector is pretrained on synthetic data generated in Blender to simulate extreme perspectives and lighting conditions, and subsequently fine-tuned on street-level data collected in La Paz, Bolivia. Detected plates are geometrically rectified and passed to a character recognition model. To improve robustness under ambiguous scenarios, a lightweight vision-language model (Gemma3 4B) is selectively triggered as a confidence-based fallback mechanism. The proposed framework further leverages synthetic-to-real domain adaptation to improve robustness under diverse real-world conditions. We also introduce the first publicly available Bolivian LPDR dataset, enabling evaluation under diverse viewpoint and illumination conditions. The system achieves a character-level recognition accuracy of 89.6% on real-world data, demonstrating its effectiveness for deployment in challenging urban environments. Our project is publicly available at https://github.com/EdwinTSalcedo/BLPR.
[89] I Walk the Line: Examining the Role of Gestalt Continuity in Object Binding for Vision Transformers cs.CV | cs.AIPDF
Alexa R. Tartaglini, Michael A. Lepori
TL;DR: 该论文研究了视觉Transformer(ViT)在对象绑定过程中是否依赖格式塔连续性原则,通过合成数据集实验发现预训练ViT对连续性敏感,识别出特定关注头负责跟踪连续性,且这些头在不同数据集上具有泛化能力,并通过消融实验证明它们对编码对象绑定的表征有贡献。
Details
Motivation: 探讨视觉模型如何实现对象绑定,特别是是否依赖格式塔连续性原则,以理解神经网络中绑定机制的形成。
Result: 在合成数据集上,预训练视觉Transformer的绑定探针显示对连续性敏感;识别出的特定关注头能泛化到不同数据集;消融这些头会削弱对象绑定的表征编码。
Insight: 揭示了视觉Transformer可能通过特定关注头利用格式塔连续性进行对象绑定,这为理解神经网络视觉认知机制提供了新视角,并强调了连续性原则在人工视觉模型中的重要性。
Abstract: Object binding is a foundational process in visual cognition, during which low-level perceptual features are joined into object representations. Binding has been considered a fundamental challenge for neural networks, and a major milestone on the way to artificial models with flexible visual intelligence. Recently, several investigations have demonstrated evidence that binding mechanisms emerge in pretrained vision models, enabling them to associate portions of an image that contain an object. The question remains: how are these models binding objects together? In this work, we investigate whether vision models rely on the principle of Gestalt continuity to perform object binding, over and above other principles like similarity and proximity. Using synthetic datasets, we demonstrate that binding probes are sensitive to continuity across a wide range of pretrained vision transformers. Next, we uncover particular attention heads that track continuity, and show that these heads generalize across datasets. Finally, we ablate these attention heads, and show that they often contribute to producing representations that encode object binding.
[90] Cross-Cultural Value Awareness in Large Vision-Language Models cs.CV | cs.AI | cs.CLPDF
Phillip Howard, Xin Su, Kathleen C. Fraser
TL;DR: 该论文研究了大型视觉语言模型(LVLMs)在图像文化背景(如宗教、国籍、社会经济地位)影响下对人物道德、伦理和政治价值观判断的刻板印象问题,通过构建反事实图像集和综合评估框架,对五种流行LVLMs进行了多维分析。
Details
Motivation: 现有研究主要关注LVLMs的社会偏见公平性问题,但较少探讨其在文化背景相关刻板印象(如宗教、国籍)上的表现,本文旨在填补这一空白,评估图像文化背景如何影响LVLMs对人物价值观的判断。
Result: 论文通过反事实图像集、道德基础理论、词汇分析和生成值对文化背景的敏感性等方法,对五种流行LVLMs进行了诊断性评估,揭示了模型在文化价值差异意识上的表现,但摘要未提及具体定量结果或基准比较。
Insight: 创新点在于将文化背景刻板印象研究扩展到LVLMs领域,并构建了一个结合反事实图像、道德理论的多维评估框架,为模型的文化价值敏感性分析提供了系统方法。
Abstract: The rapid adoption of large vision-language models (LVLMs) in recent years has been accompanied by growing fairness concerns due to their propensity to reinforce harmful societal stereotypes. While significant attention has been paid to such fairness concerns in the context of social biases, relatively little prior work has examined the presence of stereotypes in LVLMs related to cultural contexts such as religion, nationality, and socioeconomic status. In this work, we aim to narrow this gap by investigating how cultural contexts depicted in images influence the judgments LVLMs make about a person’s moral, ethical, and political values. We conduct a multi-dimensional analysis of such value judgments in five popular LVLMs using counterfactual image sets, which depict the same person across different cultural contexts. Our evaluation framework diagnoses LVLM awareness of cultural value differences through the use of Moral Foundations Theory, lexical analyses, and the sensitivity of generated values to depicted cultural contexts.
[91] Unmixing-Guided Spatial-Spectral Mamba with Clustering Tokens for Hyperspectral Image Classification cs.CVPDF
Yimin Zhu, Lincoln Linlin Xu
TL;DR: 本文提出了一种用于高光谱图像分类的新型解混引导空间-光谱Mamba模型,该模型结合了聚类令牌。该方法通过设计光谱解混网络来分离光谱混合效应,利用Top-K令牌选择策略生成Mamba令牌序列,并构建了一个解混引导的空间-光谱Mamba模块以增强特征学习和细节保留。通过多任务监督方案同时学习端元-丰度模式和分类标签,形成了一个新的解混-分类框架。
Details
Motivation: 高光谱图像分类因光谱混合效应、空间-光谱异质性以及难以保持类别边界和细节而具有挑战性,本文旨在解决这些问题以提升分类性能。
Result: 在四个高光谱图像数据集上的对比实验表明,该模型显著优于其他最先进方法,达到了SOTA水平。
Insight: 创新点包括:1) 自动学习端元和丰度图并考虑端元可变性的光谱解混网络;2) 基于丰度图聚类的高效Top-K令牌选择策略;3) 改进令牌学习和序列化的解混引导空间-光谱Mamba模块;4) 同时输出分类图、光谱库和丰度图的多任务解混-分类框架。
Abstract: Although hyperspectral image (HSI) classification is critical for supporting various environmental applications, it is a challenging task due to the spectral-mixture effect, the spatial-spectral heterogeneity and the difficulty to preserve class boundaries and details. This letter presents a novel unmixing-guided spatial-spectral Mamba with clustering tokens for improved HSI classification, with the following contributions. First, to disentangle the spectral mixture effect in HSI for improved pattern discovery, we design a novel spectral unmixing network that not only automatically learns endmembers and abundance maps from HSI but also accounts for endmember variabilities. Second, to generate Mamba token sequences, based on the clusters defined by abundance maps, we design an efficient Top-\textit{K} token selection strategy to adaptively sequence the tokens for improved representational capability. Third, to improve spatial-spectral feature learning and detail preservation, based on the Top-\textit{K} token sequences, we design a novel unmixing-guided spatial-spectral Mamba module that greatly improves traditional Mamba models in terms of token learning and sequencing. Fourth, to learn simultaneously the endmember-abundance patterns and classification labels, a multi-task scheme is designed for model supervision, leading to a new unmixing-classification framework that outputs not only accurate classification maps but also a comprehensive spectral-library and abundance maps. Comparative experiments on four HSI datasets demonstrate that our model can greatly outperform the other state-of-the-art approaches. Code is available at https://github.com/GSIL-UCalgary/Unmixing_guided_Mamba.git
[92] Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation cs.CVPDF
Tzu Ling Liu, Ian Stavness, Mrigank Rochan
TL;DR: 本文提出了一种名为可学习运动聚焦标记化(LMFT)的方法,用于视频无监督域自适应(VUDA)。该方法通过将视频帧分割为补丁标记,并学习丢弃与低运动背景区域对应的冗余标记,同时保留富含运动信息的动作相关标记,以应对静态背景导致的域偏移问题,并显著提升计算效率。
Details
Motivation: 现有VUDA方法常因静态、无信息背景加剧域偏移而难以达到全监督性能,且普遍忽视计算效率,限制了实际应用。本文旨在同时解决VUDA的有效性和效率问题。
Result: 在三个标准VUDA基准测试的21个域自适应设置上进行广泛实验,结果表明,结合LMFT的VUDA框架达到了最先进的性能,并显著降低了计算开销。
Insight: 创新点在于提出了一种可学习的、基于运动信息的标记选择机制,动态聚焦于动作相关区域以减轻背景干扰,这为视频域自适应提供了一种兼顾性能与效率的新思路。
Abstract: Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.
[93] YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection cs.CV | cs.DBPDF
Yiyu Liu, Shuo Ye, Chao Hao, Zitong Yu
TL;DR: 本文提出了一个名为YUV20K的复杂性驱动的视频伪装目标检测(VCOD)基准数据集,并设计了一个包含运动特征稳定(MFS)和轨迹感知对齐(TAA)模块的新框架,以解决现有方法在复杂运动场景下面临的运动诱导外观不稳定和时序特征错位问题。
Details
Motivation: 当前VCOD领域面临两个主要挑战:一是缺乏具有挑战性的基准数据集,二是现有模型对复杂、不规则运动动态的鲁棒性有限,容易产生运动诱导的外观不稳定和时序特征错位。
Result: 大量实验表明,该方法在现有数据集上显著优于最先进的竞争对手,并在新提出的挑战性基准YUV20K上建立了新的性能基线。该框架在面对复杂时空场景时,还表现出优异的跨域泛化能力和鲁棒性。
Insight: 论文的创新点在于:1)构建了一个大规模、像素级标注、复杂性驱动的VCOD基准YUV20K,专门针对大位移运动、相机运动等挑战性场景;2)提出了一个包含MFS和TAA模块的新颖框架,其中MFS利用与帧无关的语义基元来稳定特征,TAA利用轨迹引导的可变形采样来确保精确的时序对齐,从而有效应对复杂运动带来的挑战。
Abstract: Video Camouflaged Object Detection (VCOD) is currently constrained by the scarcity of challenging benchmarks and the limited robustness of models against erratic motion dynamics. Existing methods often struggle with Motion-Induced Appearance Instability and Temporal Feature Misalignment caused by complex motion scenarios. To address the data bottleneck, we present YUV20K, a pixel-level annoated complexity-driven VCOD benchmark. Comprising 24,295 annotated frames across 91 scenes and 47 kinds of species, it specifically targets challenging scenarios like large-displacement motion, camera motion and other 4 types scenarios. On the methodological front, we propose a novel framework featuring two key modules: Motion Feature Stabilization (MFS) and Trajectory-Aware Alignment (TAA). The MFS module utilizes frame-agnostic Semantic Basis Primitives to stablize features, while the TAA module leverages trajectory-guided deformable sampling to ensure precise temporal alignment. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art competitors on existing datasets and establishes a new baseline on the challenging YUV20K. Notably, our framework exhibits superior cross-domain generalization and robustness when confronting complex spatiotemporal scenarios. Our code and dataset will be available at https://github.com/K1NSA/YUV20K
[94] GIF: A Conditional Multimodal Generative Framework for IR Drop Imaging in Chip Layouts cs.CVPDF
Kiran Thorat, Nicole Meng, Mostafa Karami, Caiwen Ding, Yingjie Lao
TL;DR: 本文提出了一种名为GIF的条件多模态生成框架,用于芯片布局中的IR压降成像。该框架融合了图像和图特征,通过条件扩散过程生成高质量的IR压降图像,以解决传统EDA工具在晶体管密度增加时变得缓慢且昂贵的问题。
Details
Motivation: 传统EDA工具在晶体管密度增加时变得缓慢且昂贵,而现有的基于机器学习的方法未能同时捕捉局部和长程依赖,并忽略了物理布局和逻辑连接中的关键几何和拓扑信息。
Result: 在CircuitNet-N28数据集上,GIF实现了0.78 SSIM、0.95 Pearson相关系数、21.77 PSNR和0.026 NMAE,优于先前方法,展示了其生成高质量IR压降图像的能力。
Insight: 创新点在于将几何感知的空间特征与逻辑图表示相结合,通过基于扩散的多模态条件生成,有效利用生成建模的最新进展,为结构化图像生成提供了一种新方法。
Abstract: IR drop analysis is essential in physical chip design to ensure the power integrity of on-chip power delivery networks. Traditional Electronic Design Automation (EDA) tools have become slow and expensive as transistor density scales. Recent works have introduced machine learning (ML)-based methods that formulate IR drop analysis as an image prediction problem. These existing ML approaches fail to capture both local and long-range dependencies and ignore crucial geometrical and topological information from physical layouts and logical connectivity. To address these limitations, we propose GIF, a Generative IR drop Framework that uses both geometrical and topological information to generate IR drop images. GIF fuses image and graph features to guide a conditional diffusion process, producing high-quality IR drop images. For instance, On the CircuitNet-N28 dataset, GIF achieves 0.78 SSIM, 0.95 Pearson correlation, 21.77 PSNR, and 0.026 NMAE, outperforming prior methods. These results demonstrate that our framework, using diffusion based multimodal conditioning, reliably generates high quality IR drop images. This shows that IR drop analysis can effectively leverage recent advances in generative modeling when geometric layout features and logical circuit topology are jointly modeled. By combining geometry aware spatial features with logical graph representations, GIF enables IR drop analysis to benefit from recent advances in generative modeling for structured image generation.
[95] SwinTextUNet: Integrating CLIP-Based Text Guidance into Swin Transformer U-Nets for Medical Image Segmentation cs.CVPDF
Ashfak Yeafi, Parthaw Goswami, Md Khairul Islam, Ashifa Islam Shamme
TL;DR: 本文提出了一种名为SwinTextUNet的多模态医学图像分割框架,该框架将CLIP生成的文本嵌入集成到Swin Transformer U-Net主干网络中,通过跨注意力机制和卷积融合,将语义文本引导与分层视觉表征对齐,以提升在模糊或低对比度模式下的分割鲁棒性和准确性。
Details
Motivation: 解决传统仅依赖视觉特征的模型在面对医学图像中模糊或低对比度模式时分割性能不佳的问题,旨在通过整合视觉与语言模态来增强分割的鲁棒性。
Result: 在QaTaCOV19数据集上,提出的四阶段变体在性能与复杂度之间取得了最佳平衡,获得了86.47%的Dice分数和78.2%的IoU分数。消融研究进一步验证了文本引导和多模态融合的重要性。
Insight: 创新点在于将CLIP的文本语义引导与Swin Transformer U-Net的视觉架构相结合,通过跨模态对齐来增强医学图像分割。这为视觉-语言集成在医学图像分析中的应用提供了有前景的方向,可借鉴其多模态融合策略以提升模型在复杂场景下的理解能力。
Abstract: Precise medical image segmentation is fundamental for enabling computer aided diagnosis and effective treatment planning. Traditional models that rely solely on visual features often struggle when confronted with ambiguous or low contrast patterns. To overcome these limitations, we introduce SwinTextUNet, a multimodal segmentation framework that incorporates Contrastive Language Image Pretraining (CLIP), derived textual embeddings into a Swin Transformer UNet backbone. By integrating cross attention and convolutional fusion, the model effectively aligns semantic text guidance with hierarchical visual representations, enhancing robustness and accuracy. We evaluate our approach on the QaTaCOV19 dataset, where the proposed four stage variant achieves an optimal balance between performance and complexity, yielding Dice and IoU scores of 86.47% and 78.2%, respectively. Ablation studies further validate the importance of text guidance and multimodal fusion. These findings underscore the promise of vision language integration in advancing medical image segmentation and supporting clinically meaningful diagnostic tools.
[96] Demographic and Linguistic Bias Evaluation in Omnimodal Language Models cs.CV | cs.AI | cs.CLPDF
Alaa Elobaid
TL;DR: 本文对能够处理文本、图像、音频和视频的Omnimodal语言模型进行了全面的评估,重点分析了其在人口统计学(如年龄、性别、肤色)和语言学(如语言、来源国)方面的偏见。研究发现,图像和视频理解任务表现较好且偏见较小,而音频理解任务则表现出显著更低的性能、更大的偏见,并存在向狭窄类别预测崩溃的问题。
Details
Motivation: 尽管Omnimodal语言模型被广泛部署,但其在不同人口统计群体和模态间的性能差异尚未得到充分研究,因此需要评估其公平性,以应对实际应用中的潜在风险。
Result: 在人口属性估计、身份验证、活动识别、多语言语音转录和语言识别等任务上评估了四个Omnimodal模型。结果显示,图像和视频任务的人口统计学差异较小,而音频任务在年龄、性别和语言上存在显著的准确率差异和预测崩溃现象。
Insight: 论文的创新点在于首次对Omnimodal模型进行了跨模态的公平性评估,揭示了音频模态是当前多模态模型公平性的主要短板,强调了未来模型开发中需对所有模态进行公平性评估的重要性。
Abstract: This paper provides a comprehensive evaluation of demographic and linguistic biases in omnimodal language models that process text, images, audio, and video within a single framework. Although these models are being widely deployed, their performance across different demographic groups and modalities is not well studied. Four omnimodal models are evaluated on tasks that include demographic attribute estimation, identity verification, activity recognition, multilingual speech transcription, and language identification. Accuracy differences are measured across age, gender, skin tone, language, and country of origin. The results show that image and video understanding tasks generally exhibit better performance with smaller demographic disparities. In contrast, audio understanding tasks exhibit significantly lower performance and substantial bias, including large accuracy differences across age groups, genders, and languages, and frequent prediction collapse toward narrow categories. These findings highlight the importance of evaluating fairness across all supported modalities as omnimodal language models are increasingly used in real-world applications.
[97] What and Where to Adapt: Structure-Semantics Co-Tuning for Machine Vision Compression via Synergistic Adapters cs.CVPDF
Shaobo Liu, Haobo Xiong, Kai Liu, Yuna Lin
TL;DR: 本文提出了一种名为结构-语义协同调优(S2-CoT)的新框架,用于机器视觉图像压缩的预训练编解码器参数高效微调。该框架通过两种专门的协同适配器——结构保真适配器(SFA)和语义上下文适配器(SCA)——分别调整编码器-解码器中的特征结构和熵模型中的统计语义,以实现两者的协调优化,从而在仅微调少量参数的情况下,在多种基础编解码器上取得了最先进的性能。
Details
Motivation: 现有工作主要关注调整编码器-解码器主干中的特征结构,而忽略了熵模型中用于预测潜在特征概率分布的统计语义的适应。分析表明,在熵模型中简单地插入适配器会导致次优结果,适配器调优的有效性关键在于适配器类型和在整个压缩流程中放置位置的协调。
Result: S2-CoT在四种不同的基础编解码器上实现了最先进(SOTA)的结果,仅使用一小部分可训练参数,其性能与完全微调(full fine-tuning)的性能非常接近。
Insight: 主要创新点在于提出了结构-语义协同调优(S2-CoT)框架,通过SFA和SCA两种协同适配器,分别针对结构保真和语义上下文进行专门优化,并强调了两者联合优化的协同增益。客观来看,其核心洞察在于识别并系统性地解决了压缩流程中结构适应与语义适应之间的协调问题,而非孤立地优化单个组件。
Abstract: Parameter-efficient fine-tuning of pre-trained codecs is a promising direction in image compression for human and machine vision. While most existing works have primarily focused on tuning the feature structure within the encoder-decoder backbones, the adaptation of the statistical semantics within the entropy model has received limited attention despite its function of predicting the probability distribution of latent features. Our analysis reveals that naive adapter insertion into the entropy model can lead to suboptimal outcomes, underscoring that the effectiveness of adapter-based tuning depends critically on the coordination between adapter type and placement across the compression pipeline. Therefore, we introduce Structure-Semantics Co-Tuning (S2-CoT), a novel framework that achieves this coordination via two specialized, synergistic adapters: the Structural Fidelity Adapter (SFA) and the Semantic Context Adapter (SCA). SFA is integrated into the encoder-decoder to preserve high-fidelity representations by dynamically fusing spatial and frequency information; meanwhile, the SCA adapts the entropy model to align with SFA-tuned features by refining the channel context for more efficient statistical coding. Through joint optimization, S2-CoT turns potential performance degradation into synergistic gains, achieving state-of-the-art results across four diverse base codecs with only a small fraction of trainable parameters, closely matching full fine-tuning performance. Code is available at https://github.com/Brock-bit4/S2-CoT.
[98] LVSum: A Benchmark for Timestamp-Aware Long Video Summarization cs.CV | cs.AI | cs.LGPDF
Alkesh Patel, Melis Ozyildirim, Ying-Chang Cheng, Ganesh Nagarajan
TL;DR: 本文提出了LVSum基准测试,专门用于评估具有细粒度时间对齐的长视频摘要任务。该基准包含13个领域的长视频及带精确时间参考的人工标注摘要,并对专有和开源多模态大语言模型进行了全面评估,揭示了现有模型在时间理解上的系统性不足。
Details
Motivation: 当前多模态大语言模型在长视频摘要任务中面临保持长时间内时间保真度以及生成语义和时间均可靠的摘要的挑战,需要专门的基准来评估和改进。
Result: 在LVSum基准上评估了专有和开源MLLMs,使用新提出的基于LLM的内容相关性和模态一致性指标以及标准评估指标,发现现有模型在时间理解上存在系统性差距。
Insight: 创新点在于构建了首个专注于时间感知长视频摘要的带细粒度时间对齐标注的基准LVSum,并引入了基于LLM的评估指标来量化内容相关性和模态一致性,为推进长视频摘要中的时间推理研究奠定了基础。
Abstract: Long video summarization presents significant challenges for current multimodal large language models (MLLMs), particularly in maintaining temporal fidelity over extended durations and producing summaries that are both semantically and temporally grounded. In this work, we present LVSum, a human-annotated benchmark designed specifically for evaluating long video summarization with fine-grained temporal alignment. LVSum comprises diverse long-form videos across 13 domains, each paired with human-generated summaries containing precise temporal references. We conduct a comprehensive evaluation of both proprietary and open-source MLLMs on LVSum, assessing performance using newly introduced LLM-based metrics for content relevance and modality coherence, alongside standard evaluation metrics. Our experiments reveal systematic gaps in temporal understanding among existing MLLMs and offer insights that establish a new foundation for advancing temporal reasoning in long video summarization.
[99] SinkTrack: Attention Sink based Context Anchoring for Large Language Models cs.CVPDF
Xu Liu, Guikun Chen, Wenguan Wang
TL;DR: 本文提出了一种名为SinkTrack的免训练、即插即用方法,通过利用大语言模型(LLM)固有的’注意力汇’特性,将关键上下文特征注入序列起始标记(
Details
Motivation: 大语言模型存在幻觉和上下文遗忘问题,先前研究表明注意力漂移(模型关注点从初始输入转向新生成的标记)是主要原因。本文旨在利用LLM固有的’注意力汇’(即模型倾向于持续高关注序列的第一个标记
Result: 实验表明,SinkTrack在文本任务(如在SQuAD2.0上使用Llama3.1-8B-Instruct提升21.6%)和多模态任务(如在M3CoT上使用Qwen2.5-VL-7B-Instruct提升22.8%)上均有效缓解了幻觉和上下文遗忘,在不同架构和规模的模型上均取得一致增益,证明了其鲁棒性和泛化性。
Insight: 论文的创新点在于将LLM固有的’注意力汇’(
Abstract: Large language models (LLMs) suffer from hallucination and context forgetting. Prior studies suggest that attention drift is a primary cause of these problems, where LLMs’ focus shifts towards newly generated tokens and away from the initial input context. To counteract this, we make use of a related, intrinsic characteristic of LLMs: attention sink – the tendency to consistently allocate high attention to the very first token (i.e.,
[100] Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation cs.CVPDF
Gordon Chen, Ziqi Huang, Ziwei Liu
TL;DR: 本文提出Prompt Relay,一种推理时即插即用的方法,用于解决视频扩散模型在多事件视频生成中缺乏细粒度时序控制的问题。该方法通过在交叉注意力机制中引入惩罚项,使每个时间段仅关注其分配的提示词,从而改善时序提示对齐、减少语义干扰并提升视觉质量。
Details
Motivation: 现有视频扩散模型难以表示真实视频中多个事件的时序连续性,缺乏控制语义概念何时出现、持续多久以及事件发生顺序的显式机制,这导致使用段落式提示描述复杂事件序列时出现语义纠缠和文本-视频对齐不佳的问题。
Result: 论文表明,Prompt Relay方法无需修改模型架构或增加额外计算开销,即可在多事件视频生成中实现细粒度时序控制,改善了时序提示对齐并提升了视觉质量。
Insight: 创新点在于提出了一种推理时通过修改交叉注意力机制来实现时序控制的方法,其核心思想是对注意力施加惩罚,强制模型在不同时间段专注于不同的提示词,从而解耦语义并实现精确的事件时序编排。
Abstract: Video diffusion models have achieved remarkable progress in generating high-quality videos. However, these models struggle to represent the temporal succession of multiple events in real-world videos and lack explicit mechanisms to control when semantic concepts appear, how long they persist, and the order in which multiple events occur. Such control is especially important for movie-grade video synthesis, where coherent storytelling depends on precise timing, duration, and transitions between events. When using a single paragraph-style prompt to describe a sequence of complex events, models often exhibit semantic entanglement, where concepts intended for different moments in the video bleed into one another, resulting in poor text-video alignment. To address these limitations, we propose Prompt Relay, an inference-time, plug-and-play method to enable fine-grained temporal control in multi-event video generation, requiring no architectural modifications and no additional computational overhead. Prompt Relay introduces a penalty into the cross-attention mechanism, so that each temporal segment attends only to its assigned prompt, allowing the model to represent one semantic concept at a time and thereby improving temporal prompt alignment, reducing semantic interference, and enhancing visual quality.
[101] Counting to Four is still a Chore for VLMs cs.CVPDF
Duy Le Dinh Anh, Patrick Amadeus Irawan, Tuan Van Vo
TL;DR: 本文通过行为与机制分析,实证研究了视觉语言模型(VLMs)在简单计数任务上的失败原因,发现计数失败不仅源于视觉感知限制,更源于语言推理阶段对视觉证据的利用不足。作者提出了COUNTINGTRICKS评估套件来暴露模型在不同图像分块布局和对抗提示下的脆弱性,并引入了一种轻量级干预方法Modality Attention Share(MAS),以在答案生成阶段强制分配最低限度的视觉注意力。
Details
Motivation: 尽管VLMs在复杂多模态推理任务上表现出色,但在简单的物体计数等基础任务上仍会失败。现有评估大多仅关注最终输出,对模型内部失败原因缺乏深入洞察,因此需要从行为和机制层面分析VLMs的计数行为。
Result: 通过注意力分析和组件探测,研究发现计数相关的视觉证据在模态投影阶段最强,但在后续语言层中显著退化,模型更容易受文本先验影响。提出的MAS干预方法在简单计数任务上显示出改善潜力,但具体定量结果(如SOTA对比)在摘要中未明确提及。
Insight: 创新点在于揭示了VLMs计数失败的核心并非仅是视觉感知缺陷,而是语言推理阶段对视觉信息的利用不足;提出的COUNTINGTRICKS评估套件和MAS干预方法为理解和改进VLMs的视觉-语言对齐提供了新的机制分析工具和轻量级解决方案。
Abstract: Vision–language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by this finding, we further evaluate Modality Attention Share (MAS), a lightweight intervention that encourages a minimum budget of visual attention during answer generation. Our results suggest that counting failures in VLMs stem not only from visual perception limits, but also from the underuse of visual evidence during language-stage reasoning. Code and dataset will be released at https://github.com/leduy99/-CVPRW26-Modality-Attention-Share.
[102] U$^{2}$Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation cs.CVPDF
Xunpei Sun, Wenwei Lin, Yi Chang, Gang Chen
TL;DR: 本文提出了U$^{2}$Flow,这是首个能够联合估计光流和逐像素不确定性的循环无监督框架。其核心创新在于一种解耦学习策略,通过基于拉普拉斯分布的最大似然目标从数据增强一致性中推导出不确定性监督,实现了无需真实标签的稳定训练。预测的不确定性被进一步整合到网络中,以指导自适应光流优化并动态调制区域平滑损失。此外,还引入了不确定性引导的双向光流融合机制,以增强在挑战性区域的鲁棒性。
Details
Motivation: 现有的无监督光流方法通常缺乏可靠的不确定性估计,这限制了其鲁棒性和可解释性。
Result: 在KITTI和Sintel数据集上的大量实验表明,U$^{2}$Flow在无监督方法中达到了最先进的性能,同时能生成高度可靠的不确定性图,验证了其联合估计范式的有效性。
Insight: 主要创新点包括:1. 首个联合估计光流与不确定性的无监督循环框架;2. 通过基于拉普拉斯分布的最大似然目标,从增强一致性中解耦学习不确定性,实现无真实标签的稳定训练;3. 将预测的不确定性用于指导自适应流优化、调制平滑损失以及引导双向流融合,提升了模型在困难区域的鲁棒性。这是一种将不确定性估计深度融入无监督学习流程的范式创新。
Abstract: Unsupervised optical flow methods typically lack reliable uncertainty estimation, limiting their robustness and interpretability. We propose U$^{2}$Flow, the first recurrent unsupervised framework that jointly estimates optical flow and per-pixel uncertainty. The core innovation is a decoupled learning strategy that derives uncertainty supervision from augmentation consistency via a Laplace-based maximum likelihood objective, enabling stable training without ground truth. The predicted uncertainty is further integrated into the network to guide adaptive flow refinement and dynamically modulate the regional smoothness loss. Furthermore, we introduce an uncertainty-guided bidirectional flow fusion mechanism that enhances robustness in challenging regions. Extensive experiments on KITTI and Sintel demonstrate that U$^{2}$Flow achieves state-of-the-art performance among unsupervised methods while producing highly reliable uncertainty maps, validating the effectiveness of our joint estimation paradigm. The code is available at https://github.com/sunzunyi/U2FLOW.
[103] On The Application of Linear Attention in Multimodal Transformers cs.CVPDF
Armin Gerami, Seyedehanita Madani, Ramani Duraiswami
TL;DR: 本文研究了在线性注意力机制在多模态Transformer中的应用,通过将标准二次复杂度的注意力替换为线性注意力,显著降低了计算开销,同时保持了与原始模型相当的性能。
Details
Motivation: 当前最先进的多模态视觉语言模型依赖于Transformer,但其二次方注意力复杂度是模型可扩展性的关键瓶颈,因此需要探索更高效的注意力机制。
Result: 在ViT-S/16、ViT-B/16和ViT-L/16架构上,使用LAION-400M数据集进行训练,并在ImageNet-21K零样本准确率上进行验证,结果表明线性注意力在显著节省计算的同时,性能保持竞争力,并遵循与标准softmax注意力相同的缩放定律。
Insight: 论文的创新点在于系统地将线性注意力集成到多模态Transformer框架中,证明了其作为高效、可扩展替代方案的可行性,为处理大规模复杂数据集的下一代模型提供了解决方案。
Abstract: Multimodal Transformers serve as the backbone for state-of-the-art vision-language models, yet their quadratic attention complexity remains a critical barrier to scalability. In this work, we investigate the viability of Linear Attention (LA) as a high-efficiency alternative within multimodal frameworks. By integrating LA, we reduce the computational overhead from quadratic to linear relative to sequence length while preserving competitive performance. We evaluate our approach across ViT-S/16, ViT-B/16, and ViT-L/16 architectures trained on the LAION-400M dataset, with validation focused on ImageNet-21K zero-shot accuracy. Our systematic evaluation demonstrates that Linear Attention not only yields significant computational savings but also adheres to the same scaling laws as standard softmax attention. These findings position Linear Attention as a robust, scalable solution for next-generation multimodal Transformers tasked with processing increasingly large and complex datasets.
[104] Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation cs.CVPDF
Yebo Wu, Han Jin, Zhijiang Guo, Li Li
TL;DR: 本文提出了一种名为双锚点内省解码(DaID)的新型对比解码框架,旨在缓解多模态大语言模型(MLLMs)中的幻觉问题。该方法通过挖掘模型内部的感知差异,动态校准每个token的生成,具体包括识别一个‘聚光灯’层来增强视觉事实信号,以及一个‘阴影’层来抑制文本惯性,并利用视觉注意力分布来指导这一双锚点选择过程。
Details
Motivation: 多模态大语言模型虽然展现出强大的推理能力,但仍普遍存在幻觉问题,即生成的文本与视觉内容相矛盾。本文旨在通过一种新的解码框架来缓解这一问题。
Result: 在多个基准测试和不同的MLLMs上进行的实验结果表明,DaID方法显著缓解了幻觉,同时增强了模型的通用推理能力。
Insight: 创新点在于提出了一种基于模型内部感知差异的对比解码框架(DaID),通过动态识别并利用‘聚光灯’和‘阴影’层来分别增强视觉信号和抑制文本惯性,实现了对每个token生成的精细校准。从客观角度看,该方法将视觉注意力机制与解码过程深度结合,为缓解MLLM幻觉提供了一种新颖的内部表征利用思路。
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities yet continue to suffer from hallucination, where generated text contradicts visual content. In this paper, we introduce Dual-Anchor Introspective Decoding (DaID), a novel contrastive decoding framework that dynamically calibrates each token generation by mining the model’s internal perceptual discrepancies. Specifically, DaID identifies a Spotlight layer to amplify visual factual signals and a Shadow layer to suppress textual inertia. By leveraging visual attention distributions to guide this dual-anchor selection process, our method ensures precise, token-specific adaptation. Experimental results across multiple benchmarks and MLLMs demonstrate that DaID significantly mitigates hallucination while enhancing general reasoning capabilities.
[105] Attention-Guided Dual-Stream Learning for Group Engagement Recognition: Fusing Transformer-Encoded Motion Dynamics with Scene Context via Adaptive Gating cs.CV | cs.LGPDF
Saniah Kayenat Chowdhury, Muhammad E. H. Chowdhury
TL;DR: 本文提出了一种名为DualEngage的双流框架,用于从课堂视频中识别小组参与度。该框架通过一个主分支建模学生个体运动动态,使用Transformer编码时序运动模式,并通过注意力池化聚合个体表示;同时,一个辅助分支从完整视频中捕获场景级时空信息。两个分支的表示通过基于softmax的门控融合机制动态加权结合,以学习个体行为与整体小组动态的联合表示。
Details
Motivation: 现有自动化参与度识别方法多针对在线课堂或个体层面,缺乏对小组层面参与度的有效建模。本文旨在填补这一空白,提出一个专门用于从课堂视频中识别小组参与度的双流框架。
Result: 在中国海洋大学开发的Classroom Group Engagement Dataset上,使用五折交叉验证进行评估,模型取得了平均分类准确率0.9621±0.0161和宏平均F1分数0.9530±0.0204。
Insight: 创新点在于采用双流设计,显式地将运动线索作为参与度估计器,并通过注意力引导的门控融合机制动态整合个体运动动态与场景上下文信息,以联合建模个体与小组层面的行为。这是课堂参与度识别领域首次采用此类设计的研究之一。
Abstract: Student engagement is crucial for improving learning outcomes in group activities. Highly engaged students perform better both individually and contribute to overall group success. However, most existing automated engagement recognition methods are designed for online classrooms or estimate engagement at the individual level. Addressing this gap, we propose DualEngage, a novel two-stream framework for group-level engagement recognition from in-classroom videos. It models engagement as a joint function of both individual and group-level behaviors. The primary stream models person-level motion dynamics by detecting and tracking students, extracting dense optical flow with the Recurrent All-Pairs Field Transforms network, encoding temporal motion patterns using a transformer encoder, and finally aggregating per-student representations through attention pooling into a unified representation. The secondary stream captures scene-level spatiotemporal information from the full video clip, leveraging a pretrained three-dimensional Residual Network. The two-stream representations are combined via softmax-gated fusion, which dynamically weights each stream’s contribution based on the joint context of both features. DualEngage learns a joint representation of individual actions with overarching group dynamics. We evaluate the proposed approach using fivefold cross-validation on the Classroom Group Engagement Dataset developed by Ocean University of China, achieving an average classification accuracy of 0.9621+/-0.0161 with a macro-averaged F1 of 0.9530+/-0.0204. To understand the contribution of each branch, we further conduct an ablation study comparing single-stream variants against the two-stream model. This work is among the first in classroom engagement recognition to adopt a dual-stream design that explicitly leverages motion cues as an estimator.
[106] MatRes: Zero-Shot Test-Time Model Adaptation for Simultaneous Matching and Restoration cs.CV | cs.AIPDF
Kanggeon Lee, Soochahn Lee, Kyoung Mu Lee
TL;DR: 本文提出MatRes,一种零样本测试时自适应框架,用于联合提升图像恢复质量和几何匹配精度。该方法仅需一对低质量与高质量图像,通过强制对应位置的条件相似性,仅更新轻量级模块而保持预训练组件冻结,无需离线训练或额外监督。
Details
Motivation: 解决真实世界图像对中同时存在的严重退化与大幅视角变化问题,这些因素使得图像恢复和几何匹配任务在独立处理时相互干扰。
Result: 在多种组合的广泛实验中,MatRes相比单独使用恢复或匹配模型,在恢复和几何对齐方面均取得显著提升,提供了适用于真实场景的实用解决方案。
Insight: 创新点在于通过零样本测试时自适应联合优化恢复与匹配,利用条件相似性约束,仅微调轻量模块,有效解决了两个任务间的相互干扰问题,具有广泛适用性。
Abstract: Real-world image pairs often exhibit both severe degradations and large viewpoint changes, making image restoration and geometric matching mutually interfering tasks when treated independently. In this work, we propose MatRes, a zero-shot test-time adaptation framework that jointly improves restoration quality and correspondence estimation using only a single low-quality and high-quality image pair. By enforcing conditional similarity at corresponding locations, MatRes updates only lightweight modules while keeping all pretrained components frozen, requiring no offline training or additional supervision. Extensive experiments across diverse combinations show that MatRes yields significant gains in both restoration and geometric alignment compared to using either restoration or matching models alone. MatRes offers a practical and widely applicable solution for real-world scenarios where users commonly capture multiple images of a scene with varying viewpoints and quality, effectively addressing the often-overlooked mutual interference between matching and restoration.
[107] Particle Diffusion Matching: Random Walk Correspondence Search for the Alignment of Standard and Ultra-Widefield Fundus Images cs.CVPDF
Kanggeon Lee, Soochahn Lee, Kyoung Mu Lee
TL;DR: 本文提出了一种名为粒子扩散匹配(PDM)的鲁棒对齐技术,用于标准眼底图像(SFI)和超广角眼底图像(UWFI)的配准。该方法通过扩散模型引导的迭代随机游走对应搜索(RWCS),在每次迭代中结合局部外观、粒子结构分布和估计的全局变换来估计位移向量,从而在困难条件下逐步优化对应关系。
Details
Motivation: 解决SFI和UWFI之间由于尺度、外观差异以及特征稀缺而难以对齐的问题,以促进互补视网膜图像模态的整合。
Result: 在多个视网膜图像对齐基准测试中达到了最先进的性能,在SFI-UWFI配对的主要数据集上显示出显著改进,并证明了其在真实临床场景中的有效性。
Insight: 创新点在于将扩散模型与随机游走搜索相结合,利用局部外观、粒子结构和全局变换进行渐进式对应估计;这为改进眼科中的下游监督学习、疾病诊断和多模态图像分析提供了新方向。
Abstract: We propose a robust alignment technique for Standard Fundus Images (SFIs) and Ultra-Widefield Fundus Images (UWFIs), which are challenging to align due to differences in scale, appearance, and the scarcity of distinctive features. Our method, termed Particle Diffusion Matching (PDM), performs alignment through an iterative Random Walk Correspondence Search (RWCS) guided by a diffusion model. At each iteration, the model estimates displacement vectors for particle points by considering local appearance, the structural distribution of particles, and an estimated global transformation, enabling progressive refinement of correspondences even under difficult conditions. PDM achieves state-of-the-art performance across multiple retinal image alignment benchmarks, showing substantial improvement on a primary dataset of SFI-UWFI pairs and demonstrating its effectiveness in real-world clinical scenarios. By providing accurate and scalable correspondence estimation, PDM overcomes the limitations of existing methods and facilitates the integration of complementary retinal image modalities. This diffusion-guided search strategy offers a new direction for improving downstream supervised learning, disease diagnosis, and multi-modal image analysis in ophthalmology.
[108] Global monitoring of methane point sources using deep learning on hyperspectral radiance measurements from EMIT cs.CV | cs.LG | physics.ao-phPDF
Vishal V. Batchu, Michelangelo Conserva, Alex Wilson, Anna M. Michalak, Varun Gulshan
TL;DR: 本文提出了MAPL-EMIT模型,这是一个端到端的视觉Transformer框架,利用EMIT仪器获取的高光谱辐射数据,联合反演场景中所有像素的甲烷增强信号,实现了甲烷点源的全球监测、增强量化、羽流描绘和源定位。
Details
Motivation: 人为甲烷点源是近期气候强迫、安全隐患和系统效率低下的主要驱动因素,现有基于空间的成像光谱方法主要依赖人工识别羽流,效率低下且难以规模化。
Result: 在包含1084个EMIT数据块的测试集上,模型捕获了79%已知的人工标注的NASA L2B羽流复合体,并捕获了比人工分析多一倍的合理羽流;与现有匹配滤波方法相比,能捕获更弱的羽流,并通过合成评估证实了高召回率和精确度。
Insight: 创新点在于将光谱和空间上下文结合以显著降低检测限,并利用端到端视觉Transformer框架联合处理全光谱辐射数据,实现了对多个重叠羽流的同步处理;通过模型生成的指标(如光谱拟合分数和估计噪声水平)进一步限制了误报率,使甲烷监测从劳动密集型转向快速、可扩展的范式。
Abstract: Anthropogenic methane (CH4) point sources drive near-term climate forcing, safety hazards, and system inefficiencies. Space-based imaging spectroscopy is emerging as a tool for identifying emissions globally, but existing approaches largely rely on manual plume identification. Here we present the Methane Analysis and Plume Localization with EMIT (MAPL-EMIT) model, an end-to-end vision transformer framework that leverages the complete radiance spectrum from the Earth Surface Mineral Dust Source Investigation (EMIT) instrument to jointly retrieve methane enhancements across all pixels within a scene. This approach brings together spectral and spatial context to significantly lower detection limits. MAPL-EMIT simultaneously supports enhancement quantification, plume delineation, and source localization, even for multiple overlapping plumes. The model was trained on 3.6 million physics-based synthetic plumes injected into global EMIT radiance data. Synthetic evaluation confirms the model’s ability to identify plumes with high recall and precision and to capture weaker plumes relative to existing matched-filter approaches. On real-world benchmarks, MAPL-EMIT captures 79% of known hand-annotated NASA L2B plume complexes across a test set of 1084 EMIT granules, while capturing twice as many plausible plumes than identified by human analysts. Further validation against coincident airborne data, top-emitting landfills, and controlled release experiments confirms the model’s ability to identify previously uncaptured sources. By incorporating model-generated metrics such as spectral fit scores and estimated noise levels, the framework can further limit false-positive rates. Overall, MAPL-EMIT enables high-throughput implementation on the full EMIT catalog, shifting methane monitoring from labor-intensive workflows to a rapid, scalable paradigm for global plume mapping at the facility scale.
[109] ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents cs.CVPDF
Dongjie Huo, Haoyun Liu, Guoqing Liu, Dekang Qi, Zhiming Sun
TL;DR: 本文提出了ABot-Claw,一个为持久、协作和自我进化机器人智能体设计的具身化扩展框架。它通过整合统一的具身化接口、以视觉为中心的多模态记忆和基于批评者的闭环反馈机制,旨在弥合高级推理与低级物理执行之间的鸿沟,支持在开放动态环境中的长期、多机器人任务执行。
Details
Motivation: 当前具身智能系统在开放世界环境中,高级推理与低级物理执行之间存在显著差距。现有的VLA模型具有开环限制,而结合系统2认知机制的智能体通常局限于封闭沙盒环境。OpenClaw虽提供本地运行时,但缺乏长期多机器人执行所需的具身控制架构。
Result: 论文未在摘要中提供具体的定量实验结果或基准测试比较。
Insight: 主要创新点在于:1)为异构机器人协调设计了能力驱动的统一具身化接口;2)构建了用于持久上下文保留和落地检索的、以视觉为中心的跨具身化多模态记忆;3)引入了基于通用奖励模型的批评者闭环反馈机制,用于在线进度评估、局部校正和重规划。其解耦架构(OpenClaw层、共享服务层、机器人具身层)实现了从自然语言意图到物理动作的闭环,支持智能体在开放环境中的渐进式自我进化。
Abstract: Current embodied intelligent systems still face a substantial gap between high-level reasoning and low-level physical execution in open-world environments. Although Vision-Language-Action (VLA) models provide strong perception and intuitive responses, their open-loop nature limits long-horizon performance. Agents incorporating System 2 cognitive mechanisms improve planning, but usually operate in closed sandboxes with predefined toolkits and limited real-system control. OpenClaw provides a localized runtime with full system privileges, but lacks the embodied control architecture required for long-duration, multi-robot execution. We therefore propose ABot-Claw, an embodied extension of OpenClaw that integrates: 1) a unified embodiment interface with capability-driven scheduling for heterogeneous robot coordination; 2) a visual-centric cross-embodiment multimodal memory for persistent context retention and grounded retrieval; and 3) a critic-based closed-loop feedback mechanism with a generalist reward model for online progress evaluation, local correction, and replanning. With a decoupled architecture spanning the OpenClaw layer, shared service layer, and robot embodiment layer, ABot-Claw enables real-world interaction, closes the loop from natural language intent to physical action, and supports progressively self-evolving robotic agents in open, dynamic environments.
[110] Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation cs.CVPDF
Ruibin Li, Tao Yang, Fangzhou Ai, Tianhe Wu, Shilei Wen
TL;DR: 本文提出了一种名为Hybrid Forcing的流式视频生成方法,通过混合注意力机制与解耦蒸馏策略,解决了现有滑动窗口注意力在生成长视频时丢失历史信息且计算开销大的问题。该方法结合了轻量级线性时间注意力来保留长程依赖,并引入块稀疏注意力来优化局部窗口内的计算效率,最终在单张H100 GPU上实现了实时、无限制的832x480视频生成。
Details
Motivation: 现有流式视频生成方法基于滑动窗口注意力,在生成长视频时会不可避免地丢失远距离历史信息,且其计算开销对实时部署构成关键挑战。
Result: 在短格式和长格式视频生成基准测试上的大量实验表明,Hybrid Forcing始终达到最先进的性能水平。具体而言,模型在单张NVIDIA H100 GPU上无需量化或模型压缩即可实现29.5 FPS的实时、无限制832x480视频生成。
Insight: 创新点在于提出了混合注意力设计(线性时间注意力+块稀疏滑动窗口注意力)来协同优化时间信息保留与计算效率,并设计了与之匹配的解耦蒸馏策略以确保稳定优化。从客观角度看,其维护紧凑KV状态以增量吸收被移出窗口的token,以及将计算资源重新分配给更关键依赖关系的思路,对高效长序列建模具有借鉴意义。
Abstract: Streaming video generation (SVG) distills a pretrained bidirectional video diffusion model into an autoregressive model equipped with sliding window attention (SWA). However, SWA inevitably loses distant history during long video generation, and its computational overhead remains a critical challenge to real-time deployment. In this work, we propose Hybrid Forcing, which jointly optimizes temporal information retention and computational efficiency through a hybrid attention design. First, we introduce lightweight linear temporal attention to preserve long-range dependencies beyond the sliding window. In particular, we maintain a compact key-value state to incrementally absorb evicted tokens, retaining temporal context with negligible memory and computational overhead. Second, we incorporate block-sparse attention into the local sliding window to reduce redundant computation within short-range modeling, reallocating computational capacity toward more critical dependencies. Finally, we introduce a decoupled distillation strategy tailored to the hybrid attention design. A few-step initial distillation is performed under dense attention, then the distillation of our proposed linear temporal and block-sparse attention is activated for streaming modeling, ensuring stable optimization. Extensive experiments on both short- and long-form video generation benchmarks demonstrate that Hybrid Forcing consistently achieves state-of-the-art performance. Notably, our model achieves real-time, unbounded 832x480 video generation at 29.5 FPS on a single NVIDIA H100 GPU without quantization or model compression. The source code and trained models are available at https://github.com/leeruibin/hybrid-forcing.
[111] A Dual Cross-Attention Graph Learning Framework For Multimodal MRI-Based Major Depressive Disorder Detection cs.CV | cs.AIPDF
Nojod M. Alotaibi, Areej M. Alhothali
TL;DR: 该论文提出了一种基于双重交叉注意力的多模态融合框架,用于整合结构MRI和静息态功能MRI数据,以提升重度抑郁症(MDD)的检测性能。该框架在大型REST-meta-MDD数据集上进行了测试,通过显式建模模态间的双向交互,在多种脑图谱配置下均取得了稳健且具有竞争力的分类结果。
Details
Motivation: 重度抑郁症(MDD)的神经生物学变化复杂,单一成像模态无法全面捕捉。多模态MRI结合了结构和功能数据,能提供更全面的脑变化信息,但如何有效整合这些模态仍具挑战。
Result: 在10折分层交叉验证下,所提方法在所有图谱类型上均表现出稳健且具竞争力的性能。对于功能图谱,其性能持续优于传统的特征级拼接方法;对于结构图谱,则保持相当性能。最佳模型在REST-meta-MDD数据集上取得了84.71%的准确率、86.42%的灵敏度、82.89%的特异度、84.34%的精确率和85.37%的F1分数。
Insight: 创新点在于提出了一个显式建模结构MRI与功能MRI表征间双向交互的双重交叉注意力融合机制。这强调了在多模态神经影像分类中,显式建模跨模态交互的重要性,为多模态融合提供了一种有效的注意力机制设计思路。
Abstract: Major depressive disorder (MDD) is a prevalent mental disorder associated with complex neurobiological changes that cannot be fully captured using a single imaging modality. The use of multimodal magnetic resonance imaging (MRI) provides a more comprehensive understanding of brain changes by combining structural and functional data. Despite this, the effective integration of these modalities remains challenging. In this study, we propose a dual cross-attention-based multimodal fusion framework that explicitly models bidirectional interactions between structural MRI (sMRI) and resting-state functional MRI (rs-fMRI) representations. The proposed approach is tested on the large-scale REST-meta-MDD dataset using both structural and functional brain atlas configurations. Numerous experiments conducted under a 10-fold stratified cross-validation demonstrated that the proposed fusion algorithm achieves robust and competitive performance across all atlas types. The proposed method consistently outperforms conventional feature-level concatenation for functional atlases, while maintaining comparable performance for structural atlases. The most effective dual cross-attention multimodal model obtained 84.71% accuracy, 86.42% sensitivity, 82.89% specificity, 84.34% precision, and 85.37% F1-score. These findings emphasize the importance of explicitly modeling cross-modal interactions for multimodal neuroimaging-based MDD classification.
[112] VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation cs.CV | cs.AIPDF
Longteng Jiang, DanDan Zheng, Qianqian Qiao, Heng Huang, Huaye Wang
TL;DR: 本文提出了VGA-Bench,一个用于联合评估视频生成质量和美学质量的统一基准。它基于一个三层分类法构建,包含美学质量、美学标签和生成质量三个维度,并细分为多个子维度。作者创建了一个包含超过6万个视频的大规模数据集,并开发了三个多任务神经评估模型(VAQA-Net、VTag-Net、VGQA-Net)以实现自动化评估。实验表明这些模型与人类判断具有良好的一致性。
Details
Motivation: 当前基于AIGC的视频生成技术快速发展,但现有基准主要关注技术保真度,缺乏对美学吸引力等感知和艺术品质的全面评估,存在评估空白。
Result: 广泛的实验表明,所开发的三个评估模型(VAQA-Net、VTag-Net、VGQA-Net)在自动化评估中与人类判断实现了可靠的对齐,兼具准确性和效率。
Insight: 创新点在于提出了一个统一的三层分类法来系统化评估视频生成,并构建了大规模数据集和配套的自动化多任务评估模型,为AIGC视频的全面评估提供了新的基准和工具。
Abstract: The rapid advancement of AIGC-based video generation has underscored the critical need for comprehensive evaluation frameworks that go beyond traditional generation quality metrics to encompass aesthetic appeal. However, existing benchmarks remain largely focused on technical fidelity, leaving a significant gap in holistic assessment-particularly with respect to perceptual and artistic qualities. To address this limitation, we introduce VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality. VGA-Bench is built upon a principled three-tier taxonomy: Aesthetic Quality, Aesthetic Tagging, and Generation Quality, each decomposed into multiple fine-grained sub-dimensions to enable systematic assessment. Guided by this taxonomy, we design 1,016 diverse prompts and generate a large-scale dataset of over 60,000 videos using 12 video generation models, ensuring broad coverage across content, style, and artifacts. To enable scalable and automated evaluation, we annotate a subset of the dataset via human labeling and develop three dedicated multi-task neural assessors: VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation and basic quality attributes. Extensive experiments demonstrate that our models achieve reliable alignment with human judgments, offering both accuracy and efficiency. We release VGA-Bench as a public benchmark to foster research in AIGC evaluation, with applications in content moderation, model debugging, and generative model optimization.
[113] Semantic Manipulation Localization cs.CV | cs.AIPDF
Zhenshan Tan, Chenhan Lu, Yuxiang Huang, Ziwen He, Xiang Zhang
TL;DR: 本文针对传统图像篡改定位方法在检测现代图像编辑和生成模型产生的、无明显低级伪影的语义篡改时效果不佳的问题,提出了语义篡改定位新任务,并构建了相应的细粒度基准。作者进一步提出了TRACE框架,该框架通过语义锚定、语义扰动感知和语义约束推理三个渐进耦合的组件来建模语义敏感性,以定位显著改变图像解释的细微语义编辑。
Details
Motivation: 现代图像编辑和生成模型产生的篡改往往不留下明显的低级伪影,而是对物体属性、状态或关系进行细微但意义重大的编辑,同时与周围内容高度一致。这使得主要依赖伪影检测的传统图像篡改定位方法失效,因此需要转向对语义敏感的定位方法。
Result: 在作者构建的SML专用细粒度基准上,TRACE框架持续优于现有的IML方法,并产生更完整、紧凑且语义一致的定位结果。
Insight: 论文的核心创新在于将篡改定位的关注点从低级伪影转移到语义不一致性上,定义了SML新任务并构建了相应基准。提出的TRACE框架通过结合语义理解、频率扰动感知和联合推理,系统性地建模了语义敏感性,为复杂语义编辑场景下的图像取证提供了新方向。
Abstract: Image Manipulation Localization (IML) aims to identify edited regions in an image. However, with the increasing use of modern image editing and generative models, many manipulations no longer exhibit obvious low-level artifacts. Instead, they often involve subtle but meaning-altering edits to an object’s attributes, state, or relationships while remaining highly consistent with the surrounding content. This makes conventional IML methods less effective because they mainly rely on artifact detection rather than semantic sensitivity. To address this issue, we introduce Semantic Manipulation Localization (SML), a new task that focuses on localizing subtle semantic edits that significantly change image interpretation. We further construct a dedicated fine-grained benchmark for SML using a semantics-driven manipulation pipeline with pixel-level annotations. Based on this task, we propose TRACE (Targeted Reasoning of Attributed Cognitive Edits), an end-to-end framework that models semantic sensitivity through three progressively coupled components: semantic anchoring, semantic perturbation sensing, and semantic-constrained reasoning. Specifically, TRACE first identifies semantically meaningful regions that support image understanding, then injects perturbation-sensitive frequency cues to capture subtle edits under strong visual consistency, and finally verifies candidate regions through joint reasoning over semantic content and semantic scope. Extensive experiments show that TRACE consistently outperforms existing IML methods on our benchmark and produces more complete, compact, and semantically coherent localization results. These results demonstrate the necessity of moving beyond artifact-based localization and provide a new direction for image forensics in complex semantic editing scenarios.
[114] Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval cs.CV | cs.CL | cs.IRPDF
Yibo Yan, Mingdong Ou, Yi Cao, Jiahao Huo, Xin Zou
TL;DR: 本文提出ColChunk框架,通过引入多模态晚期分块技术,为视觉文档检索构建高效、上下文感知的多向量表示。该方法利用分层聚类融合二维位置先验,自适应地分组补丁级嵌入,在显著减少向量数量的同时保持全局上下文。在24个VDR数据集上的评估表明,ColChunk能减少90%以上的存储需求,并在代表性单向量模型上平均提升nDCG@5指标9个百分点。
Details
Motivation: 解决多向量模型在视觉文档检索中因细粒度匹配能力带来的高存储和计算成本问题,以实现实际部署的可行性。
Result: 在24个VDR数据集上评估,ColChunk将存储需求降低超过90%,并在代表性单向量模型上平均提升nDCG@5指标9个百分点,实现了检索精度与效率的平衡。
Insight: 创新点在于提出多模态晚期分块框架,通过分层聚类与二维位置先验融合实现自适应、内容感知的向量分组,不同于现有的剪枝或固定标记方法,保持了空间语义连贯性,为高效视觉文档检索提供了实用解决方案。
Abstract: Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose ColChunk, a plug-and-play framework that introduces multimodal late chunking to construct efficient, contextualized multi-vectors. Unlike existing pruning or fixed-token approaches, ColChunk employs hierarchical clustering on patch-level embeddings, fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping allows for a content-aware representation that preserves global context while drastically reducing the vector count. Evaluations across 24 VDR datasets demonstrate ColChunk achieves over a 90% reduction in storage requirements while simultaneously delivering a 9-point average improvement in nDCG@5 across representative single-vector models. ColChunk provides a practical solution for balancing retrieval accuracy and efficiency in visual document systems.
[115] Radiology Report Generation for Low-Quality X-Ray Images cs.CVPDF
Hongze Zhu, Chen Hu, Jiaxuan Jiang, Hong Liu, Yawen Huang
TL;DR: 本文提出了一种针对低质量X射线图像的鲁棒放射学报告生成框架,通过自动质量评估代理识别低质量样本并建立LRRG基准,采用基于双层优化和梯度一致性的双循环训练策略,以学习质量无关的诊断特征,从而缓解图像质量下降导致的模型性能退化。
Details
Motivation: 现有视觉语言模型在放射学报告生成中隐含假设高质量输入,忽略了真实临床环境中普遍存在的噪声和伪影,导致处理次优图像时性能严重下降,因此需要设计能应对图像质量变化的鲁棒方法。
Result: 大量实验表明,该方法有效缓解了图像质量恶化引起的模型性能退化,在建立的LRRG基准上验证了其有效性。
Insight: 创新点包括引入自动质量评估代理构建低质量报告生成基准,以及通过双循环训练策略实现跨质量区域的梯度对齐,从而学习鲁棒的诊断特征,为处理真实世界噪声图像提供了新思路。
Abstract: Vision-Language Models (VLMs) have significantly advanced automated Radiology Report Generation (RRG). However, existing methods implicitly assume high-quality inputs, overlooking the noise and artifacts prevalent in real-world clinical environments. Consequently, current models exhibit severe performance degradation when processing suboptimal images. To bridge this gap, we propose a robust report generation framework explicitly designed for image quality variations. We first introduce an Automated Quality Assessment Agent (AQAA) to identify low-quality samples within the MIMIC-CXR dataset and establish the Low-quality Radiology Report Generation (LRRG) benchmark. To tackle degradation-induced shifts, we propose a novel Dual-loop Training Strategy leveraging bi-level optimization and gradient consistency. This approach ensures the model learns quality-agnostic diagnostic features by aligning gradient directions across varying quality regimes. Extensive experiments demonstrate that our approach effectively mitigates model performance degradation caused by image quality deterioration. The code and data will be released upon acceptance.
[116] Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration? cs.CVPDF
Isaac Corley, Alex Stoken, Gabriele Berton
TL;DR: 本文系统评估了24种预训练图像匹配器在跨模态(光学-SAR)卫星图像配准任务上的零样本性能,重点关注它们在SpaceNet9等基准测试上的表现。研究发现,具有显式跨模态训练的匹配器并非总是优于无此训练的模型,且部署协议(如几何模型、分块大小)对精度有巨大影响,有时甚至超过更换匹配器本身的效果。
Details
Motivation: 跨模态光学-SAR卫星图像配准是遥感灾害响应的关键瓶颈,但现有图像匹配器主要在自然图像领域开发和评估,缺乏在卫星和SAR数据上的系统性能分析。本文旨在通过零样本评估,探究预训练匹配器在该任务上的实际效果及影响因素。
Result: 在SpaceNet9训练场景上,XoFTR(可见光-热红外匹配训练)和RoMa(无跨模态训练)取得了最低的平均误差3.0像素,MatchAnything-ELoFTR(合成跨模态对训练)为3.4像素。研究发现,部署协议选择(如使用仿射几何)可将单个匹配器的平均误差从12.34像素降至9.74像素,影响高达33倍。
Insight: 创新点在于首次大规模零样本评估预训练匹配器在跨模态卫星配准中的性能,揭示了基础模型特征(如DINOv2)可能促进模态不变性,部分替代显式跨模态监督。关键发现是部署协议对精度的影响常被忽视,但实际影响巨大,这为未来匹配器设计和实际部署提供了重要指导。
Abstract: Cross-modal optical-SAR (Synthetic Aperture Radar) registration is a bottleneck for disaster-response via remote sensing, yet modern image matchers are developed and benchmarked almost exclusively on natural-image domains. We evaluate twenty-four pretrained matcher families–in a zero-shot setting with no fine-tuning or domain adaptation on satellite or SAR data–on SpaceNet9 and two additional cross-modal benchmarks under a deterministic protocol with tiled large-image inference, robust geometric filtering, and tie-point-grounded metrics. Our results reveal asymmetric transfer–matchers with explicit cross-modal training do not uniformly outperform those without it. While XoFTR (trained for visible-thermal matching) and RoMa achieve the lowest reported mean error at $3.0$ px on the labeled SpaceNet9 training scenes, RoMa achieves this without any cross-modal training, and MatchAnything-ELoFTR ($3.4$ px)–trained on synthetic cross-modal pairs–matches closely, suggesting (as a working hypothesis) that foundation-model features (DINOv2) may contribute to modality invariance that partially substitutes for explicit cross-modal supervision. 3D-reconstruction matchers (MASt3R, DUSt3R), which are not designed for traditional 2D image matching, are highly protocol-sensitive and remain fragile under default settings. Deployment protocol choices (geometry model, tile size, inlier gating) shift accuracy by up to $33\times$ for a single matcher, sometimes exceeding the effect of swapping matchers entirely within the evaluated sweep–affine geometry alone reduces mean error from $12.34$ to $9.74$ px. These findings inform both practical deployment of existing matchers and future matcher design for cross-modal satellite registration.
[117] SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation cs.CVPDF
Yun Wang, Zhengjie Yang, Jiahao Zheng, Zhanjie Zhang, Dapeng Oliver Wu
TL;DR: 本文提出SMFormer框架,通过结合视觉基础模型(VFM)和数据增强技术,提升自监督立体匹配的鲁棒性和准确性。该方法利用VFM与特征金字塔网络(FPN)提取抗干扰的特征表示,并设计数据增强机制强化特征和视差预测的一致性。
Details
Motivation: 现有自监督立体匹配方法依赖光度一致性假设,易受真实世界干扰影响,导致监督信号无效且与有监督方法存在性能差距。
Result: 在多个主流基准测试中,SMFormer在自监督方法中达到SOTA性能,甚至与有监督方法相当;在Booster基准上,其表现优于部分有监督SOTA方法如CFNet。
Insight: 创新点在于引入视觉基础模型增强特征鲁棒性,并结合数据增强机制显式约束特征和视差预测的一致性,从而有效应对真实场景干扰。
Abstract: Recent self-supervised stereo matching methods have made significant progress. They typically rely on the photometric consistency assumption, which presumes corresponding points across views share the same appearance. However, this assumption could be compromised by real-world disturbances, resulting in invalid supervisory signals and a significant accuracy gap compared to supervised methods. To address this issue, we propose SMFormer, a framework integrating more reliable self-supervision guided by the Vision Foundation Model (VFM) and data augmentation. We first incorporate the VFM with the Feature Pyramid Network (FPN), providing a discriminative and robust feature representation against disturbance in various scenarios. We then devise an effective data augmentation mechanism that ensures robustness to various transformations. The data augmentation mechanism explicitly enforces consistency between learned features and those influenced by illumination variations. Additionally, it regularizes the output consistency between disparity predictions of strong augmented samples and those generated from standard samples. Experiments on multiple mainstream benchmarks demonstrate that our SMFormer achieves state-of-the-art (SOTA) performance among self-supervised methods and even competes on par with supervised ones. Remarkably, in the challenging Booster benchmark, SMFormer even outperforms some SOTA supervised methods, such as CFNet.
[118] Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis cs.CV | cs.AIPDF
Yang Yu, Dunyuan Xu, Yaoqian Li, Xiaomeng Li, Jinpeng Li
TL;DR: 本文提出了一种将2D多模态大语言模型(MLLM)适配用于3D CT图像分析的方法。该方法首先通过参数复用,将训练于2D自然图像的MLLM扩展以支持3D医学体数据输入。然后,设计了一个文本引导的分层混合专家(TGH-MoE)框架,以根据文本提示为不同任务(如医疗报告生成和医疗视觉问答)提取定制化的图像特征,并采用两阶段训练策略学习任务共享和任务特定的特征。实验表明,该方法在两项任务上优于现有的3D医学MLLM。
Details
Motivation: 解决现有3D医学MLLM因3D医学图像数据稀缺导致的视觉编码器预训练不足,以及无法为不同任务提取定制化图像特征的问题。
Result: 在医疗报告生成(MRG)和医疗视觉问答(MVQA)任务上,该方法的表现优于现有的3D医学MLLM。
Insight: 创新点在于将成熟的2D MLLM通过参数复用的方式迁移到3D医学领域,并设计了文本引导的分层MoE框架来动态适配不同下游任务,实现了任务感知的特征提取。这是一种高效利用预训练知识和解决数据稀缺问题的有效策略。
Abstract: 3D medical image analysis is of great importance in disease diagnosis and treatment. Recently, multimodal large language models (MLLMs) have exhibited robust perceptual capacity, strong cross-modal alignment, and promising generalizability. Therefore, they have great potential to improve the performance of medical report generation (MRG) and medical visual question answering (MVQA), which serve as two important tasks in clinical scenarios. However, due to the scarcity of 3D medical images, existing 3D medical MLLMs suffer from insufficiently pretrained vision encoder and inability to extract customized image features for different kinds of tasks. In this paper, we propose to first transfer a 2D MLLM, which is well trained with 2D natural images, to support 3D medical volumetric inputs while reusing all of its pre-trained parameters. To enable the vision encoder to extract tailored image features for various tasks, we then design a Text-Guided Hierarchical MoE (TGH-MoE) framework, which can distinguish tasks under the guidance of the text prompt. Furthermore, we propose a two-stage training strategy to learn both task-shared and task-specific image features. As demonstrated empirically, our method outperforms existing 3D medical MLLMs in both MRG and MVQA tasks. Our code will be released once this paper is accepted.
[119] MedVeriSeg: Teaching MLLM-Based Medical Segmentation Models to Verify Query Validity Without Extra Training cs.CVPDF
Ziqian Lu, Qinyue Tong, Jun Liu, Yunlong Yu
TL;DR: 本文提出MedVeriSeg,一种无需额外训练的验证框架,旨在解决基于MLLM的医学图像分割模型(如LISA)无法可靠拒绝包含不存在目标的错误查询、并产生幻觉分割掩码的问题。该框架通过分析[SEG]令牌特征与MLLM图像特征之间的相似性图分布差异,结合相似性响应质量评分模块和GPT-4o的视觉证据评估,实现对查询有效性的自动验证。
Details
Motivation: 现有LISA类医学分割模型无法可靠拒绝错误查询,常对不存在目标产生幻觉分割,降低了其在医学教育和临床实践中的实用可靠性。
Result: 在基于SA-Med2D-20M构建的小规模基准测试上,MedVeriSeg能有效拒绝错误查询的分割请求,同时保持对真实查询的可靠识别。
Insight: 创新点在于发现真假查询的相似性图分布模式存在显著差异,并据此设计了一个无需训练的质量评分模块,从强度、紧凑性和纯度三方面量化相似性图特征,再结合大语言模型(GPT-4o)的视觉推理进行最终验证,提升了模型在实际应用中的安全性和可靠性。
Abstract: Despite recent advances in MLLM-based medical image segmentation, existing LISA-like methods cannot reliably reject false queries and often produce hallucinated segmentation masks for absent targets. This limitation reduces practical reliability in both medical education and clinical use. In this work, we propose MedVeriSeg, a training-free verification framework that equips LISA-like medical segmentation models with the ability to identify and reject false queries which contain non-existent targets. Our key observation is that the similarity map between the [SEG] token feature and MLLM image features exhibits markedly different distribution patterns for true and false queries. Based on this, we introduce a Similarity Response Quality Scoring Module that characterizes the similarity map from three aspects: strength, compactness, and purity, producing an initial target-existence prediction. We further incorporate qualitative visual evidence by using GPT-4o to jointly assess the similarity heatmap and the results of Similarity Response Quality Scoring Module for final verification. Experiments on a small-scale benchmark constructed from SA-Med2D-20M show that MedVeriSeg effectively rejects false-query segmentation requests while maintaining reliable recognition of true queries.
[120] Warm-Started Reinforcement Learning for Iterative 3D/2D Liver Registration cs.CV | physics.med-phPDF
Hanyuan Zhang, Lucas He, Zijie Cheng, Abdolrahim Kadkhodamohammadi, Danail Stoyanov
TL;DR: 本文提出了一种基于离散动作强化学习的框架,用于术前CT与术中腹腔镜视频之间的3D/2D配准。该方法将配准过程建模为顺序决策问题,通过一个从监督式姿态估计网络热启动的共享特征编码器提取特征,并由强化学习策略头选择刚性变换并决定何时停止迭代。实验表明,该方法在公开数据集上达到了与需要优化的监督方法相当的配准精度,同时实现了更快的收敛速度。
Details
Motivation: 解决基于学习的术前CT与术中腹腔镜视频配准方法通常产生粗略对齐、需要依赖额外的基于优化的细化步骤,从而导致推理时间增加的问题。
Result: 在公开的腹腔镜数据集上,该方法实现了平均目标配准误差(TRE)为15.70 mm,与需要优化的监督方法性能相当,同时收敛更快。
Insight: 主要创新点在于将配准问题形式化为一个顺序决策的强化学习任务,并引入从监督网络热启动的特征编码器以提供稳定的几何特征和加速收敛。该离散框架为未来手术AR应用中连续动作和可变形配准模型提供了实用基础,实现了无需手动调整步长或停止标准的自动化、高效迭代配准。
Abstract: Registration between preoperative CT and intraoperative laparoscopic video plays a crucial role in augmented reality (AR) guidance for minimally invasive surgery. Learning-based methods have recently achieved registration errors comparable to optimization-based approaches while offering faster inference. However, many supervised methods produce coarse alignments that rely on additional optimization-based refinement, thereby increasing inference time. We present a discrete-action reinforcement learning (RL) framework that formulates CT-to-video registration as a sequential decision-making process. A shared feature encoder, warm-started from a supervised pose estimation network to provide stable geometric features and faster convergence, extracts representations from CT renderings and laparoscopic frames, while an RL policy head learns to choose rigid transformations along six degrees of freedom and to decide when to stop the iteration. Experiments on a public laparoscopic dataset demonstrated that our method achieved an average target registration error (TRE) of 15.70 mm, comparable to supervised approaches with optimization, while achieving faster convergence. The proposed RL-based formulation enables automated, efficient iterative registration without manually tuned step sizes or stopping criteria. This discrete framework provides a practical foundation for future continuous-action and deformable registration models in surgical AR applications.
[121] Real-Time Human Reconstruction and Animation using Feed-Forward Gaussian Splatting cs.CV | cs.GRPDF
Devdoot Chatterjee, Zakaria Laskar, C. V. Jawahar
TL;DR: 本文提出了一种可泛化的前馈高斯溅射框架,用于从多视角RGB图像和SMPL-X姿态直接进行人体三维重建和实时动画。该方法在规范姿态下预测与每个SMPL-X顶点关联的一组3D高斯基元,其中一个高斯被约束靠近SMPL-X表面以提供几何先验,而额外的无约束高斯则能捕捉衣物、头发等偏离参数化表面的几何结构。与需要重复网络推理来合成新姿态的方法不同,本方法通过单次前向传播即可生成可动画化的人体表示,并通过线性混合蒙皮实现高效动画,无需进一步网络评估。
Details
Motivation: 解决现有方法依赖深度监督、固定输入视角、UV贴图或需要为每个目标视角或姿态重复进行前馈推理的问题,旨在实现从多视角RGB图像直接进行高效、高质量且支持实时动画的人体重建。
Result: 在THuman 2.1、AvatarReX和THuman 4.0数据集上评估,重建质量与最先进方法相当,同时独特地支持实时动画和交互式应用。
Insight: 创新点在于将3D高斯基元显式关联到SMPL-X顶点,结合了参数化人体模型的强几何先验(一个约束高斯)和捕捉非刚性细节的灵活性(多个无约束高斯),从而实现了单次前向传播即可重建并支持通过线性混合蒙皮进行实时动画的高效框架。
Abstract: We present a generalizable feed-forward Gaussian splatting framework for human 3D reconstruction and real-time animation that operates directly on multi-view RGB images and their associated SMPL-X poses. Unlike prior methods that rely on depth supervision, fixed input views, UV map, or repeated feed-forward inference for each target view or pose, our approach predicts, in a canonical pose, a set of 3D Gaussian primitives associated with each SMPL-X vertex. One Gaussian is regularized to remain close to the SMPL-X surface, providing a strong geometric prior and stable correspondence to the parametric body model, while an additional small set of unconstrained Gaussians per vertex allows the representation to capture geometric structures that deviate from the parametric surface, such as clothing and hair. In contrast to recent approaches such as HumanRAM, which require repeated network inference to synthesize novel poses, our method produces an animatable human representation from a single forward pass; by explicitly associating Gaussian primitives with SMPL-X vertices, the reconstructed model can be efficiently animated via linear blend skinning without further network evaluation. We evaluate our method on the THuman 2.1, AvatarReX and THuman 4.0 datasets, where it achieves reconstruction quality comparable to state-of-the-art methods while uniquely supporting real-time animation and interactive applications. Code and pre-trained models are available at https://github.com/Devdoot57/HumanGS .
[122] FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data cs.CV | cs.AIPDF
Peng Yuan, Bingyin Mei, Hui Zhang
TL;DR: 本文提出了一个名为FashionMV的新数据集和一个名为ProCIR的建模框架,用于解决产品级别的多视图组合图像检索任务。该任务将传统的单图像组合检索推广到考虑产品多视角图像的检索,以更贴合电商实际场景。
Details
Motivation: 现有组合图像检索方法均基于单张参考图像,而真实电商场景中用户面对的是产品的多视角展示,这种视图不完整性限制了应用。本文旨在解决这一不匹配问题,将CIR从图像级别提升到产品级别。
Result: 在三个时尚基准测试上,对16种配置的系统性消融实验表明,所提出的0.8B参数最佳模型超越了所有基线模型,包括参数量是其10倍的通用嵌入模型。
Insight: 主要创新点在于:1) 正式定义了多视图CIR新任务并构建了首个大规模数据集FashionMV;2) 提出了ProCIR框架,整合了两阶段对话、基于描述的对其和思维链引导等互补机制;3) 关键发现是特征对齐是最关键的机制,且两阶段对话架构是实现有效对齐的前提。
Abstract: Composed Image Retrieval (CIR) retrieves target images using a reference image paired with modification text. Despite rapid advances, all existing methods and datasets operate at the image level – a single reference image plus modification text in, a single target image out – while real e-commerce users reason about products shown from multiple viewpoints. We term this mismatch View Incompleteness and formally define a new Multi-View CIR task that generalizes standard CIR from image-level to product-level retrieval. To support this task, we construct FashionMV, the first large-scale multi-view fashion dataset for product-level CIR, comprising 127K products, 472K multi-view images, and over 220K CIR triplets, built through a fully automated pipeline leveraging large multimodal models. We further propose ProCIR (Product-level Composed Image Retrieval), a modeling framework built upon a multimodal large language model that employs three complementary mechanisms – two-stage dialogue, caption-based alignment, and chain-of-thought guidance – together with an optional supervised fine-tuning (SFT) stage that injects structured product knowledge prior to contrastive training. Systematic ablation across 16 configurations on three fashion benchmarks reveals that: (1) alignment is the single most critical mechanism; (2) the two-stage dialogue architecture is a prerequisite for effective alignment; and (3) SFT and chain-of-thought serve as partially redundant knowledge injection paths. Our best 0.8B-parameter model outperforms all baselines, including general-purpose embedding models 10x its size. The dataset, model, and code are publicly available at https://github.com/yuandaxia2001/FashionMV.
[123] Seeing No Evil: Blinding Large Vision-Language Models to Safety Instructions via Adversarial Attention Hijacking cs.CV | cs.CLPDF
Jingru Li, Wei Ren, Tianqing Zhu
TL;DR: 本文提出了一种名为Attention-Guided Visual Jailbreaking的新型视觉攻击方法,旨在通过直接操纵大型视觉语言模型(LVLM)的注意力模式来绕过其安全对齐机制,而非强行对抗。该方法通过抑制对安全相关前缀令牌的注意力,并将生成过程锚定在对抗性图像特征上,有效减少了梯度冲突,显著提升了攻击成功率并降低了迭代次数。
Details
Motivation: 现有攻击方法通常通过优化图像扰动来最大化有害输出的可能性,但由于对抗性目标与模型安全检索机制之间的梯度冲突,导致收敛缓慢。本文旨在解决这一问题,通过操纵注意力模式来规避而非压制安全对齐,从而更高效地实现视觉越狱攻击。
Result: 在Qwen-VL模型上,该方法实现了94.4%的攻击成功率(基线为68.8%),并将梯度冲突减少了45%,同时迭代次数减少了40%。在更严格的扰动预算(ε=8/255)下,攻击成功率保持在59.0%,而标准方法仅为45.7%。
Insight: 本文的创新点在于提出了通过注意力引导来直接操纵模型内部注意力分布的视觉越狱方法,揭示了模型在攻击下可能出现的一种失效模式——安全盲区,即模型因未能检索安全指令而生成有害内容,而非覆盖安全规则。这种方法简单有效,为理解LVLM的安全机制和对抗性攻击提供了新视角。
Abstract: Large Vision-Language Models (LVLMs) rely on attention-based retrieval of safety instructions to maintain alignment during generation. Existing attacks typically optimize image perturbations to maximize harmful output likelihood, but suffer from slow convergence due to gradient conflict between adversarial objectives and the model’s safety-retrieval mechanism. We propose Attention-Guided Visual Jailbreaking, which circumvents rather than overpowers safety alignment by directly manipulating attention patterns. Our method introduces two simple auxiliary objectives: (1) suppressing attention to alignment-relevant prefix tokens and (2) anchoring generation on adversarial image features. This simple yet effective push-pull formulation reduces gradient conflict by 45% and achieves 94.4% attack success rate on Qwen-VL (vs. 68.8% baseline) with 40% fewer iterations. At tighter perturbation budgets ($ε=8/255$), we maintain 59.0% ASR compared to 45.7% for standard methods. Mechanistic analysis reveals a failure mode we term safety blindness: successful attacks suppress system-prompt attention by 80%, causing models to generate harmful content not by overriding safety rules, but by failing to retrieve them.
[124] AC-MIL: Weakly Supervised Atrial LGE-MRI Quality Assessment via Adversarial Concept Disentanglement cs.CVPDF
K M Arefeen Sultan, Kaysen Hansen, Benjamin Orkild, Alan Morris, Eugene Kholmovski
TL;DR: 本文提出了一种名为AC-MIL的弱监督框架,用于心房LGE-MRI图像的质量评估。该框架通过对抗性概念解耦,将整体图像质量分解为临床定义的放射学概念,从而在保持高精度分类性能的同时,提供高度局部化的空间概念图来解释扫描质量差的具体原因。
Details
Motivation: 现有基于多示例学习(MIL)的自动化质量评估方法将局部视觉证据映射为单一、不透明的全局特征向量,这种黑盒方法无法为特定的失败模式(如运动模糊、对比度不足)提供可操作的反馈。
Result: 在心房LGE-MRI临床数据集上的大量实验表明,AC-MIL在保持与现有基线模型(如标准MIL方法)高度竞争力的有序分级性能的同时,成功打开了MIL的黑盒,提供了高度局部化的空间概念图。
Insight: 创新点在于提出了一个对抗性擦除机制引导的无监督残差分支,以严格防止预定义概念间的信息泄露,并引入了空间多样性约束来惩罚不同概念注意力图之间的重叠,从而确保局部化和可解释的特征提取。
Abstract: High-quality Late Gadolinium Enhancement (LGE) MRI can be helpful for atrial fibrillation management, yet scan quality is frequently compromised by patient motion, irregular breathing, and suboptimal image acquisition timing. While Multiple Instance Learning (MIL) has emerged as a powerful tool for automated quality assessment under weak supervision, current state-of-the-art methods map localized visual evidence to a single, opaque global feature vector. This black box approach fails to provide actionable feedback on specific failure modes, obscuring whether a scan degrades due to motion blur, inadequate contrast, or a lack of anatomical context. In this paper, we propose Adversarial Concept-MIL (AC-MIL), a weakly supervised framework that decomposes global image quality into clinically defined radiological concepts using only volume-level supervision. To capture latent quality variations without entangling predefined concepts, our framework incorporates an unsupervised residual branch guided by an adversarial erasure mechanism to strictly prevent information leakage. Furthermore, we introduce a spatial diversity constraint that penalizes overlap between distinct concept attention maps, ensuring localized and interpretable feature extraction. Extensive experiments on a clinical dataset of atrial LGE-MRI volumes demonstrate that AC-MIL successfully opens the MIL black box, providing highly localized spatial concept maps that allow clinicians to pinpoint the specific causes of non-diagnostic scans. Crucially, our framework achieves this deep clinical transparency while maintaining highly competitive ordinal grading performance against existing baselines. Code to be released on acceptance.
[125] Class-Adaptive Cooperative Perception for Multi-Class LiDAR-based 3D Object Detection in V2X Systems cs.CV | cs.AI | cs.ETPDF
Blessing Agyei Kyem, Joshua Kofi Asamoah, Armstrong Aboah
TL;DR: 本文提出了一种面向V2X系统中多类别LiDAR 3D目标检测的类别自适应协同感知架构。该模型通过多尺度窗口注意力、类别特定融合模块、鸟瞰图增强和类别平衡目标加权四个组件,针对不同类别目标的几何结构和点云采样模式差异进行自适应特征提取与融合,以提升多类别检测的平衡性与鲁棒性。
Details
Motivation: 现有协同3D目标检测器通常对所有物体类别采用统一的融合策略,难以处理大小物体在几何结构和点云采样模式上的差异;且现有评估协议往往只关注单一主导类别或少数协同设置,缺乏对多样化V2X交互场景下鲁棒多类别检测的深入探索。
Result: 在V2X-Real基准测试中,覆盖了以车辆为中心、以基础设施为中心、车对车、基础设施对基础设施及车对基础设施等多种设置,并在相同骨干网络和训练配置下进行实验。所提方法在强中间融合基线上持续提升了平均检测性能,其中对卡车类别的提升最大,对行人类别有明显改善,对汽车类别也取得了有竞争力的结果。
Insight: 创新点在于将特征提取和融合策略与类别相关的几何特性及点云密度对齐,通过类别自适应的融合路径和平衡化目标加权,实现了更均衡的协同感知。这为处理V2X场景中多类别目标检测的类别不平衡问题提供了可借鉴的架构设计思路。
Abstract: Cooperative perception allows connected vehicles and roadside infrastructure to share sensor observations, creating a fused scene representation beyond the capability of any single platform. However, most cooperative 3D object detectors use a uniform fusion strategy for all object classes, which limits their ability to handle the different geometric structures and point-sampling patterns of small and large objects. This problem is further reinforced by narrow evaluation protocols that often emphasize a single dominant class or only a few cooperation settings, leaving robust multi-class detection across diverse vehicle-to-everything interactions insufficiently explored. To address this gap, we propose a class-adaptive cooperative perception architecture for multi-class 3D object detection from LiDAR data. The model integrates four components: multi-scale window attention with learned scale routing for spatially adaptive feature extraction, a class-specific fusion module that separates small and large objects into attentive fusion pathways, bird’s-eye-view enhancement through parallel dilated convolution and channel recalibration for richer contextual representation, and class-balanced objective weighting to reduce bias toward frequent categories. Experiments on the V2X-Real benchmark cover vehicle-centric, infrastructure-centric, vehicle-to-vehicle, infrastructure-to-infrastructure, and vehicle-to-infrastructure settings under identical backbone and training configurations. The proposed method consistently improves mean detection performance over strong intermediate-fusion baselines, with the largest gains on trucks, clear improvements on pedestrians, and competitive results on cars. These results show that aligning feature extraction and fusion with class-dependent geometry and point density leads to more balanced cooperative perception in realistic vehicle-to-everything deployments.
[126] Context Matters: Vision-Based Depression Detection Comparing Classical and Deep Approaches cs.CVPDF
Maneesh Bilalpur, Saurabh Hinduja, Sonish Sivarajkumar, Nicholas Allen, Yanshan Wang
TL;DR: 该论文比较了基于视觉的抑郁症检测中经典方法与深度学习方法在准确性、公平性和泛化性方面的表现。研究在TPOT数据库的母婴互动和Pitt数据库的患者-临床医生访谈两种不同情境下进行实验,发现经典方法(手工特征+SVM)在两个情境中均达到更高准确性,且在患者-临床医生情境中显著更公平;两种方法的跨情境泛化能力均有限,表明抑郁症检测可能具有情境特异性。
Details
Motivation: 解决经典方法(强调可解释特征如面部表情)与深度学习方法(使用通用视觉模型学习特征)在抑郁症检测中缺乏系统性比较的问题,特别是在准确性、公平性和跨情境泛化性方面的差异。
Result: 在TPOT和Pitt两个数据库的实验中,经典方法(手工特征+SVM)的准确性均高于深度学习方法(FMAE-IAT嵌入+多层感知机);在患者-临床医生情境中,经典方法显著更公平;两种方法的跨情境泛化能力均较弱(泛化性能有限)。
Insight: 创新点在于首次系统比较了抑郁症检测中经典与深度学习方法的性能差异,并强调情境对检测结果的影响;客观来看,研究揭示了手工特征在特定医疗视觉任务中可能仍优于深度学习特征,且跨情境泛化挑战凸显了任务的情境依赖性,为未来研究提供了重要方向。
Abstract: The classical approach to detecting depression from vision emphasizes interpretable features, such as facial expression, and classifiers such as the Support Vector Machine (SVM). With the advent of deep learning, there has been a shift in feature representations and classification approaches. Contemporary approaches use learnt features from general-purpose vision models such as VGGNet to train machine learning models. Little is known about how classical and deep approaches compare in depression detection with respect to accuracy, fairness, and generalizability, especially across contexts. To address these questions, we compared classical and deep approaches to the detection of depression in the visual modality in two different contexts: Mother-child interactions in the TPOT database and patient-clinician interviews in the Pitt database. In the former, depression was operationalized as a history of depression per the DSM and current or recent clinically significant symptoms. In the latter, all participants met initial criteria for depression per DSM, and depression was reassessed over the course of treatment. The classical approach included handcrafted features with SVM classifiers. Learnt features were turn-level embeddings from the FMAE-IAT that were combined with Multi-Layer Perceptron classifiers. The classical approach achieved higher accuracy in both contexts. It was also significantly fairer than the deep approach in the patient-clinician context. Cross-context generalizability was modest at best for both approaches, which suggests that depression may be context-specific.
[127] Multi-modal, multi-scale representation learning for satellite imagery analysis just needs a good ALiBi cs.CVPDF
Patrick Kage, Pavlos Andreadis
TL;DR: 本文提出了一种名为Scale-ALiBi的线性偏置Transformer注意力机制,通过引入空间编码偏置来处理不同地面采样距离(GSD)尺度的卫星图像块之间的关系。该方法在一个对齐的高分辨率光学、低分辨率光学和低分辨率SAR卫星图像数据集上,通过三重对比和重建架构实现,并在GEO-Bench基准测试中取得了性能提升,同时公开了新整理的数据集。
Details
Motivation: 当前视觉基础模型在处理卫星图像时,难以有效处理多空间分辨率和多模态(如光学与SAR)数据,本文旨在解决这一挑战。
Result: 在GEO-Bench基准测试上展示了性能改进,表明Scale-ALiBi方法优于现有方法。
Insight: 创新点在于将ALiBi注意力机制扩展到多尺度卫星图像分析,通过空间编码偏置直接建模不同分辨率图像块间的关系,并结合三重对比和重建学习策略,有效整合多模态信息。
Abstract: Vision foundation models have been shown to be effective at processing satellite imagery into representations fit for downstream tasks, however, creating models which operate over multiple spatial resolutions and modes is challenging. This paper presents Scale-ALiBi, a linear bias transformer attention mechanism with a spatial encoding bias to relationships between image patches at different ground sample distance scales. We provide an implementation of Scale-ALiBi over a dataset of aligned high- and low-resolution optical and low-resolution SAR satellite imagery data using a triple-contrastive and reconstructive architecture, show an improvement on the GEO-Bench benchmark, and release the newly curated dataset publicly.
[128] Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning cs.CVPDF
Nicolae Cudlenco, Mihai Masala, Marius Leordeanu
TL;DR: 本文提出了一种名为‘Agentic Video Generation’的代理式视频生成系统,它颠覆了现有基于LLM协调神经视频生成器的范式。该系统首先让LLM根据文本描述,规划并构建一个形式化的‘时空事件图’结构化规范,然后在一个3D游戏引擎中确定性地执行该规范来生成视频。系统采用分层双代理架构,并引入程序化状态后端来保证规范的可执行性。
Details
Motivation: 现有基于LLM的多代理视频生成系统虽然能产生视觉效果出色的视频,但其输出在语义上不可靠且缺乏真实标注。本文旨在解决这一问题,通过构建一个可执行的结构化事件规范,确保生成视频在物理有效性和语义对齐上的可靠性。
Result: 在自主生成评估中,代理式叙事在文本和视频比较中分别以79%和74%的胜率优于程序化基线。在种子生成评估中,与VEO 3.1和WAN 2.2相比,本系统生成的视频在物理有效性(58% vs 25%和20%)和语义对齐(3.75/5 vs 2.33和1.50)上显著优于神经生成器。
Insight: 核心创新在于将视频生成范式从‘直接生成像素’逆转为‘先规划结构化事件图,再确定性地执行’。这通过职责分离的架构实现:LLM负责自然语言叙事规划,程序化状态后端通过验证的工具调用强制执行所有模拟器约束,从而保证规范的可执行性。此外,分层双代理架构和专用的关系子代理首次充分利用了GEST形式化表示的全部表达能力。
Abstract: Existing multi-agent video generation systems use LLM agents to orchestrate neural video generators, producing visually impressive but semantically unreliable outputs with no ground truth annotations. We present an agentic system that inverts this paradigm: instead of generating pixels, the LLM constructs a formal Graph of Events in Space and Time (GEST) – a structured specification of actors, actions, objects, and temporal constraints – which is then executed deterministically in a 3D game engine. A staged LLM refinement pipeline fails entirely at this task (0 of 50 attempts produce an executable specification), motivating a fundamentally different architecture based on a separation of concerns: the LLM handles narrative planning through natural language reasoning, while a programmatic state backend enforces all simulator constraints through validated tool calls, guaranteeing that every generated specification is executable by construction. The system uses a hierarchical two-agent architecture – a Director that plans the story and a Scene Builder that constructs individual scenes through a round-based state machine – with dedicated Relation Subagents that populate the logical and semantic edge types of the GEST formalism that procedural generation leaves empty, making this the first approach to exercise the full expressive capacity of the representation. We evaluate in two stages: autonomous generation against procedural baselines via a 3-model LLM jury, where agentic narratives win 79% of text and 74% of video comparisons; and seeded generation where the same text is given to our system, VEO 3.1, and WAN 2.2, with human annotations showing engine-generated videos substantially outperform neural generators on physical validity (58% vs 25% and 20%) and semantic alignment (3.75/5 vs 2.33 and 1.50).
[129] GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models cs.CVPDF
Nicolae Cudlenco, Mihai Masala, Marius Leordeanu
TL;DR: 本文提出了GTASA数据集和GEST-Engine系统,用于生成具有精确时空标注的多角色视频,以解决现有神经生成模型在物理合理性和语义忠实性方面缺乏真实标注进行评估和训练的难题。
Details
Motivation: 解决现有先进神经生成模型难以生成复杂的多角色场景视频,且由于缺乏物理合理性和语义忠实性的真实标注而难以评估的问题。
Result: 在人类评估的物理有效性和语义对齐方面,以及通过训练视频描述模型的定量评估中,该方法均优于开源和闭源的神经生成器;利用GTASA的精确3D真实标注对11项时空推理任务进行探测,发现自监督编码器比视觉语言模型的视觉编码器能更好地编码空间结构。
Insight: 创新点在于构建了带有逐帧空间关系图和事件级时间映射的多角色视频语料库GTASA及其生成系统GEST-Engine,为视频模型的评估和训练提供了精确的时空真实标注基准;客观分析认为,其提供的精确3D真实标注是深入分析视频编码器时空推理能力的宝贵资源。
Abstract: Generating complex multi-actor scenario videos remains difficult even for state-of-the-art neural generators, while evaluating them is hard due to the lack of ground truth for physical plausibility and semantic faithfulness. We introduce GTASA, a corpus of multi-actor videos with per-frame spatial relation graphs and event-level temporal mappings, and the system that produced it based on Graphs of Events in Space and Time (GEST): GEST-Engine. We compare our method with both open and closed source neural generators and prove both qualitatively (human evaluation of physical validity and semantic alignment) and quantitatively (via training video captioning models) the clear advantages of our method. Probing four frozen video encoders across 11 spatiotemporal reasoning tasks enabled by GTASA’s exact 3D ground truth reveals that self-supervised encoders encode spatial structure significantly better than VLM visual encoders.
[130] FishRoPE: Projective Rotary Position Embeddings for Omnidirectional Visual Perception cs.CV | cs.AIPDF
Rahul Ahuja, Mudit Jain, Bala Murali Manoghar Sai Sudhakar, Venkatraman Narayanan, Pratik Likhar
TL;DR: 本文提出FishRoPE框架,通过低秩适应(LoRA)和鱼眼旋转位置嵌入(FishRoPE)技术,将预训练的视觉基础模型(如DINOv2)适配到鱼眼相机几何结构,无需大规模鱼眼标注重新训练,实现了在鱼眼图像上的高效视觉感知。
Details
Motivation: 现有视觉基础模型和鸟瞰图表示假设针孔相机几何,而广泛应用于自动驾驶的鱼眼相机存在严重径向畸变,导致几何不一致;同时缺乏大规模鱼眼标注数据使得从头训练基础模型不切实际。
Result: 在WoodScape 2D检测任务上达到54.3 mAP,在SynWoodScapes BEV分割任务上达到65.1 mIoU,在两个基准测试中均取得了最先进(SOTA)的结果。
Insight: 创新点包括:1)使用冻结DINOv2主干网络结合LoRA,无需任务特定预训练即可迁移自监督特征;2)提出FishRoPE,在鱼眼投影的球面坐标中重新参数化注意力机制,使自注意力和交叉注意力基于角度分离而非像素距离,该方法架构无关、计算开销可忽略,且在针孔几何下自然退化为标准形式。
Abstract: Vision foundation models (VFMs) and Bird’s Eye View (BEV) representation have advanced visual perception substantially, yet their internal spatial representations assume the rectilinear geometry of pinhole cameras. Fisheye cameras, widely deployed on production autonomous vehicles for their surround-view coverage, exhibit severe radial distortion that renders these representations geometrically inconsistent. At the same time, the scarcity of large-scale fisheye annotations makes retraining foundation models from scratch impractical. We present \ours, a lightweight framework that adapts frozen VFMs to fisheye geometry through two components: a frozen DINOv2 backbone with Low-Rank Adaptation (LoRA) that transfers rich self-supervised features to fisheye without task-specific pretraining, and Fisheye Rotary Position Embedding (FishRoPE), which reparameterizes the attention mechanism in the spherical coordinates of the fisheye projection so that both self-attention and cross-attention operate on angular separation rather than pixel distance. FishRoPE is architecture-agnostic, introduces negligible computational overhead, and naturally reduces to the standard formulation under pinhole geometry. We evaluate \ours on WoodScape 2D detection (54.3 mAP) and SynWoodScapes BEV segmentation (65.1 mIoU), where it achieves state-of-the-art results on both benchmarks.
[131] Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation cs.CV | cs.AIPDF
Yuanhao Luo, Di Wen, Kunyu Peng, Ruiping Liu, Junwei Zheng
TL;DR: 本文提出了一种新的视频人-物交互理解框架HOI-DA,通过将未来交互建模为当前配对状态的残差转换,联合执行主体-对象定位、当前HOI检测和未来预测,并构建了时间校正的基准数据集DETAnt-HOI以提升评估可靠性。
Details
Motivation: 现有方法通常将预测视为基于外部构建的人-物配对的下游预测任务,限制了检测与预测之间的联合推理,且当前基准中的稀疏关键帧标注可能导致未来标签与实际动态时间错位,影响预测评估的可靠性。
Result: 实验表明,该方法在检测和预测任务上均取得了一致的性能提升,且在更长的时间跨度上增益更大。
Insight: 创新点在于提出了一个以配对为中心的联合学习框架,将预测作为对配对级视频表示学习的结构性约束进行学习,并构建了时间校正的基准以支持更可靠的多时间跨度评估。
Abstract: Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states. Experiments show consistent improvements in both detection and anticipation, with larger gains at longer horizons. Our results highlight that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. Benchmark and code will be publicly available.
[132] IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly cs.CV | cs.AIPDF
Di Wen, Zeyun Zhong, David Schneider, Manuel Zaremski, Linus Kunzmann
TL;DR: IMPACT是一个面向工业流程理解的多视角RGB-D数据集,围绕专业工具对商用角磨机进行真实装配和拆卸过程构建。该数据集首次在单一真实工业工作流中联合提供同步的自我-外部RGB-D捕获、解耦的双手机器人标注、合规感知状态跟踪以及显式异常-恢复监督。
Details
Motivation: 解决现有基准测试在真实工业部署场景下(如不完整观察、灵活执行路径和纠正行为)对多粒度人类程序性动作理解的局限性,特别是单任务基准无法揭示的系统性缺陷。
Result: 在包含13名参与者112次试验、总计39.5小时的数据集上建立了系统基线,揭示了在涉及部分观察、多路径执行和异常恢复的现实部署条件下传统方法的根本局限。
Insight: 通过同步多视角RGB-D数据与分层标注架构(从原子动作到程序步骤、组件状态和合规阶段),首次实现了工业装配场景中感知限制与算法失败的解耦,并为评估算法在真实复杂工作流中的鲁棒性提供了新基准。
Abstract: We introduce IMPACT, a synchronized five-view RGB-D dataset for deployment-oriented industrial procedural understanding, built around real assembly and disassembly of a commercial angle grinder with professional-grade tools. To our knowledge, IMPACT is the first real industrial assembly benchmark that jointly provides synchronized ego-exo RGB-D capture, decoupled bimanual annotation, compliance-aware state tracking, and explicit anomaly–recovery supervision within a single real industrial workflow. It comprises 112 trials from 13 participants totaling 39.5 hours, with multi-route execution governed by a partial-order prerequisite graph, a six-category anomaly taxonomy, and operator cognitive load measured via NASA-TLX. The annotation hierarchy links hand-specific atomic actions to coarse procedural steps, component assembly states, and per-hand compliance phases, with synchronized null spans across views to decouple perceptual limitations from algorithmic failure. Systematic baselines reveal fundamental limitations that remain invisible to single-task benchmarks, particularly under realistic deployment conditions that involve incomplete observations, flexible execution paths, and corrective behavior. The full dataset, annotations, and evaluation code are available at https://github.com/Kratos-Wen/IMPACT.
[133] Point2Pose: Occlusion-Recovering 6D Pose Tracking and 3D Reconstruction for Multiple Unknown Objects Via 2D Point Trackers cs.CV | cs.ROPDF
Tzu-Yuan Lin, Ho Jae Lee, Kevin Doherty, Yonghyeon Lee, Sangbae Kim
TL;DR: Point2Pose是一种无需模型、基于单目RGB-D视频的因果性6D姿态跟踪方法,用于多个未知刚性物体。该方法仅需从物体上的稀疏图像点初始化,无需物体CAD模型或类别先验,通过利用2D点跟踪器获取长程对应关系,实现完全遮挡后的即时恢复,并同时增量重建目标的在线TSDF表示。
Details
Motivation: 解决现有无模型跟踪方法无法处理多物体跟踪和完全遮挡后恢复的问题,旨在开发一种更鲁棒、通用的6D姿态跟踪与3D重建系统。
Result: 在严重遮挡基准测试上,性能与最先进方法相当,同时支持多物体跟踪和完全遮挡恢复,这是先前无模型跟踪方法不具备的能力。
Insight: 创新点在于将2D点跟踪器的长程对应能力与在线TSDF重建结合,实现了对未知多物体的鲁棒跟踪与遮挡恢复;客观分析其核心在于利用2D跟踪的时序信息来辅助3D姿态估计和重建,提供了一种数据驱动、模型无关的解决方案。
Abstract: We present Point2Pose, a model-free method for causal 6D pose tracking of multiple rigid objects from monocular RGB-D video. Initialized only from sparse image points on the objects to be tracked, our approach tracks multiple unseen objects without requiring object CAD models or category priors. Point2Pose leverages a 2D point tracker to obtain long-range correspondences, enabling instant recovery after complete occlusion. Simultaneously, the system incrementally reconstructs an online Truncated Signed Distance Function (TSDF) representation of the tracked targets. Alongside the method, we introduce a new multi-object tracking dataset comprising both simulation and real-world sequences, with motion-capture ground truth for evaluation. Experiments show that Point2Pose achieves performance comparable to the state-of-the-art methods on a severe-occlusion benchmark, while additionally supporting multi-object tracking and recovery from complete occlusion, capabilities that are not supported by previous model-free tracking approaches.
[134] DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain cs.CVPDF
Song Jin, Juntian Zhang, Xun Zhang, Zeying Tian, Fei Jiang
TL;DR: 本文提出了DiningBench,一个用于评估视觉语言模型在饮食领域感知与推理能力的层次化多视图基准。该基准包含3,021道不同菜品,每道菜平均有5.27张图像,并设计了三个认知复杂度递增的任务:细粒度分类、营养估计和视觉问答。通过对29个先进开源和专有模型的广泛评估,发现当前模型在细粒度视觉辨别和精确营养推理方面存在显著不足。
Details
Motivation: 现有视觉语言模型在通用视觉理解方面取得进展,但在食物领域的应用受限于依赖粗粒度类别、单视图图像和不准确元数据的基准。为弥补这一差距,需要构建一个更全面、严谨的评估基准。
Result: 在DiningBench上对29个SOTA模型进行了评估。实验表明,当前模型在通用推理任务上表现良好,但在细粒度视觉辨别和精确营养推理方面显著落后。研究还系统分析了多视图输入和思维链推理的影响,并识别了五种主要失败模式。
Insight: 创新点在于构建了一个包含层次化任务、多视图图像、细粒度负样本和经过验证的营养数据的综合性基准。从客观角度看,该工作强调了领域特定评估的重要性,并为推动以食物为中心的下一代VLM研究提供了具有挑战性的测试平台。
Abstract: Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained “hard” negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general reasoning, they struggle significantly with fine-grained visual discrimination and precise nutritional reasoning. Furthermore, we systematically investigate the impact of multi-view inputs and Chain-of-Thought reasoning, identifying five primary failure modes. DiningBench serves as a challenging testbed to drive the next generation of food-centric VLM research. All codes are released in https://github.com/meituan/DiningBench.
[135] SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units cs.CVPDF
Ruibin Wang, Zhenyu Lin, Xinhai Zhao
TL;DR: 本文提出SignReasoner,一种将通用视觉语言模型(VLM)转化为专业交通标志推理器的新范式,以解决复杂交通标志(如具有复杂布局、多语言文本和复合符号)的语义理解问题。其核心创新是功能结构单元(FSU),通过将复杂标志分解为最小的核心功能块(如方向、通知、车道),学习底层结构语法,从而实现对未见组合的鲁棒泛化。
Details
Motivation: 当前模型(包括专用小模型和大型视觉语言模型)在复杂交通标志理解上存在组合泛化能力不足的瓶颈,导致遇到新颖标志配置时失败,这威胁自动驾驶安全。
Result: 在新提出的FSU-Reasoning基准TrafficSignEval上,SignReasoner实现了新的SOTA,具有显著的数据效率且无需修改模型架构,显著提升了多种VLM的交通标志理解能力。
Insight: 创新点在于从常见的基于实例的建模转向灵活的基于功能的分解(FSU),并设计了包含迭代标题-FSU蒸馏和基于树编辑距离奖励的GRPO算法的两阶段后训练流程,以增强模型的组合推理能力。
Abstract: Accurate semantic understanding of complex traffic signs-including those with intricate layouts, multi-lingual text, and composite symbols-is critical for autonomous driving safety. Current models, both specialized small ones and large Vision Language Models (VLMs), suffer from a significant bottleneck: a lack of compositional generalization, leading to failure when encountering novel sign configurations. To overcome this, we propose SignReasoner, a novel paradigm that transforms general VLMs into expert traffic sign reasoners. Our core innovation is Functional Structure Unit (FSU), which shifts from common instance-based modeling to flexible function-based decomposition. By breaking down complex signs into minimal, core functional blocks (e.g., Direction, Notice, Lane), our model learns the underlying structural grammar, enabling robust generalization to unseen compositions. We define this decomposition as the FSU-Reasoning task and introduce a two-stage VLM post-training pipeline to maximize performance: Iterative Caption-FSU Distillation that enhances the model’s accuracy in both FSU-reasoning and caption generation; FSU-GRPO that uses Tree Edit Distance (TED) to compute FSU differences as the rewards in GRPO algorithm, boosting reasoning abilities. Experiments on the newly proposed FSU-Reasoning benchmark, TrafficSignEval, show that SignReasoner achieves new SOTA with remarkable data efficiency and no architectural modification, significantly improving the traffic sign understanding in various VLMs.
[136] Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance cs.CVPDF
Chenyu Wang, Weicheng Dai, Han Liu, Wenchao Li, Kayhan Batmanghelich
TL;DR: 本文提出了一种名为DCP-PD的即插即用框架,用于增强3D CT报告生成中的细粒度空间定位能力。该方法通过从自由文本报告中提取细粒度线索并利用提示丢弃来引导报告生成,在CT-RATE和Rad-ChestCT数据集上实现了显著的性能提升,并引入了一种分层位置感知评估协议来直接评估病理位置定位。
Details
Motivation: 现有放射学报告生成方法存在两个关键局限:一是训练监督较为粗糙,缺乏对细粒度属性或病理位置的显式对齐;二是评估通常基于整体指标,无法诊断空间定位能力。本文旨在解决这些问题,提升模型在细粒度空间定位方面的性能。
Result: 在CT-RATE数据集上,DCP-PD将宏F1分数从0.501提升至0.603(相对提升20%),达到SOTA水平;在Rad-ChestCT数据集上,F1分数从0.266提升至0.503(相对提升89%),显著增强了分布外性能。
Insight: 创新点包括:1)通过判别性提示与提示丢弃机制,从自由文本中蒸馏细粒度线索以引导生成,同时减少对捷径的依赖;2)引入分层位置感知评估协议(存在性→侧向性→肺叶),直接评估病理位置定位,揭示了当前基准测试的局限性。
Abstract: Vision–language models (VLMs) for radiology report generation (RRG) can produce long-form chest CT reports from volumetric scans and show strong potential to improve radiology workflow efficiency and consistency. However, existing methods face two key limitations: (i) training supervision is often coarse, aligning a whole CT volume with a full free-text report without explicit alignment for fine-grained attributes or pathology locations; and (ii) evaluation is typically holistic (lexical overlap, entity matching, or LLM-as-a-judge scores) and not diagnostic for spatial grounding. We propose \emph{Discriminative Cue-Prompting with Prompt Dropout (DCP-PD)}, a plug-and-play framework that distills fine-grained cues from free-text reports and uses them to guide report generation while mitigating shortcut reliance via prompt dropout. DCP-PD achieves state-of-the-art performance on CT-RATE, improving macro F1 from $=0.501$ to $0.603$ (20% relative), and substantially boosts out-of-distribution performance on Rad-ChestCT from F1 $=0.266$ to $0.503$ (89% relative). Finally, we introduce a hierarchical, location-aware question-set protocol (presence $\rightarrow$ laterality $\rightarrow$ lobe) to directly assess pathology-location grounding, showing that fine-grained spatial localization remains challenging even for models that score highly on current benchmarks.
[137] PERCEPT-Net: A Perceptual Loss Driven Framework for Reducing MRI Artifact Tissue Confusion cs.CVPDF
Ziheng Guo, Danqun Zheng, Chengwei Chen, Boyang Pan, Shuai Li
TL;DR: 本文提出了一个名为PERCEPT-Net的深度学习框架,用于解决MRI图像中运动伪影校正的临床泛化问题。该框架的核心是引入了运动感知损失(MPL),通过感知监督来区分伪影与解剖结构,从而在抑制伪影的同时保持解剖结构的保真度。
Details
Motivation: 现有的基于深度学习的MRI伪影校正模型由于固有的伪影-组织混淆问题,难以区分伪影和解剖结构,导致临床泛化能力差。本文旨在解决这一问题,实现结构保持的伪影抑制。
Result: PERCEPT-Net在临床数据上超越了现有最先进(SOTA)方法。消融实验证实了MPL与性能的直接因果关系,其缺失会导致结构一致性和组织对比度显著下降。放射科医生的评估也证实了其在全局图像质量和关键诊断结构保留方面的优越性。
Insight: 论文的主要创新点在于提出了一个任务特定、伪影感知的感知学习框架,其核心是运动感知损失(MPL)。这为医学图像重建提供了一个可验证的机制,以缓解过度平滑和结构退化问题,提高了临床鲁棒性。
Abstract: Purpose: Existing deep learning-based MRI artifact correction models exhibit poor clinical generalization due to inherent artifact-tissue confusion, failing to discriminate artifacts from anatomical structures. To resolve this, we introduce PERCEPT-Net, a framework leveraging dedicated perceptual supervision for structure-preserving artifact suppression. Method: PERCEPT-Net utilizes a residual U-Net backbone integrated with a multi-scale recovery module and dual attention mechanisms to preserve anatomical context and salient features. The core mechanism, Motion Perceptual Loss (MPL), provides artifact-aware supervision by learning generalizable motion artifact representations. This logic directly guides the network to suppress artifacts while maintaining anatomical fidelity. Training utilized a hybrid dataset of real and simulated sequences, followed by prospective validation via objective metrics and expert radiologist assessments. Result: PERCEPT-Net outperformed state-of-the-art methods on clinical data. Ablation analysis established a direct causal link between MPL and performance; its omission caused a significant deterioration in structural consistency (p < 0.001) and tissue contrast (p < 0.001). Radiologist evaluations corroborated these objective metrics, scoring PERCEPT-Net significantly higher in global image quality (median 3 vs. 2, p < 0.001) and verifying the preservation of critical diagnostic structures. Conclusion: By integrating task-specific, artifact-aware perceptual learning, PERCEPT-Net suppresses motion artifacts in clinical MRI without compromising anatomical integrity. This framework improves clinical robustness and provides a verifiable mechanism to mitigate over-smoothing and structural degradation in medical image reconstruction.
[138] A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation cs.CVPDF
Peixuan Zhang, Chang Zhou, Ziyuan Zhang, Hualuo Liu, Chunjie Zhang
TL;DR: 本文提出了CineBench基准测试和CineAgents多智能体系统,用于解决指令驱动的电影视频剪辑任务。CineBench是首个包含多样化用户指令和专业编辑标注的真实剪辑数据的基准。CineAgents通过‘设计-组合’范式,利用脚本逆向工程构建分层叙事记忆,并通过迭代叙事规划生成最终剪辑脚本,以克服上下文崩溃和时间碎片化问题。
Details
Motivation: 长格式电影内容改编为短视频的需求激增,但现有剪辑方法局限于预定义任务,且缺乏全面的基准来评估电影剪辑质量。
Result: 大量实验表明,CineAgents在CineBench基准上显著优于现有方法,生成的剪辑在叙事连贯性和逻辑一致性方面表现优异。
Insight: 创新点在于将电影视频剪辑重构为‘设计-组合’范式,并引入多智能体系统进行脚本逆向工程和迭代叙事规划,以提升上下文理解和时序连贯性。
Abstract: The surging demand for adapting long-form cinematic content into short videos has motivated the need for versatile automatic video compilation systems. However, existing compilation methods are limited to predefined tasks, and the community lacks a comprehensive benchmark to evaluate the cinematic compilation. To address this, we introduce CineBench, the first benchmark for instruction-driven cinematic video compilation, featuring diverse user instructions and high-quality ground-truth compilations annotated by professional editors. To overcome contextual collapse and temporal fragmentation, we present CineAgents, a multi-agent system that reformulates cinematic video compilation into ``design-and-compose’’ paradigm. CineAgents performs script reverse-engineering to construct a hierarchical narrative memory to provide multi-level context and employs an iterative narrative planning process that refines a creative blueprint into a final compiled script. Extensive experiments demonstrate that CineAgents significantly outperforms existing methods, generating compilations with superior narrative coherence and logical coherence.
[139] Toward Accountable AI-Generated Content on Social Platforms: Steganographic Attribution and Multimodal Harm Detection cs.CV | cs.AI | cs.CR | cs.ETPDF
Xinlei Guan, David Arosemena, Tejaswi Dhandu, Kuan Huang, Meng Xu
TL;DR: 本文提出了一种结合隐写术溯源与多模态有害内容检测的端到端取证框架,旨在解决AI生成图像与有害文本结合导致的滥用问题。该系统在图像生成时嵌入加密签名标识符,并通过多模态检测触发溯源验证,实验表明小波域扩频水印具有强鲁棒性,多模态融合检测器AUC-ROC达0.99。
Details
Motivation: 生成式AI的快速发展导致内容审核和数字取证面临新挑战,特别是良性AI生成图像与有害文本结合形成的上下文滥用难以检测,且合成图像缺乏持久元数据或设备签名,使溯源和问责复杂化。
Result: 在空间域、频域和小波域评估了五种水印方法,发现小波域扩频水印在模糊失真下具有强鲁棒性;基于CLIP的多模态有害内容检测融合模型AUC-ROC达到0.99,实现了可靠的跨模态溯源验证。
Insight: 创新点在于将隐写术溯源与多模态检测结合,构建端到端取证管道,通过加密签名嵌入和多模态触发机制,增强AI生成图像在社交平台上的可追溯性和问责性;客观来看,小波域水印与多模态融合的策略为合成媒体滥用提供了实用解决方案。
Abstract: The rapid growth of generative AI has introduced new challenges in content moderation and digital forensics. In particular, benign AI-generated images can be paired with harmful or misleading text, creating difficult-to-detect misuse. This contextual misuse undermines the traditional moderation framework and complicates attribution, as synthetic images typically lack persistent metadata or device signatures. We introduce a steganography enabled attribution framework that embeds cryptographically signed identifiers into images at creation time and uses multimodal harmful content detection as a trigger for attribution verification. Our system evaluates five watermarking methods across spatial, frequency, and wavelet domains. It also integrates a CLIP-based fusion model for multimodal harmful-content detection. Experiments demonstrate that spread-spectrum watermarking, especially in the wavelet domain, provides strong robustness under blur distortions, and our multimodal fusion detector achieves an AUC-ROC of 0.99, enabling reliable cross-modal attribution verification. These components form an end-to-end forensic pipeline that enables reliable tracing of harmful deployments of AI-generated imagery, supporting accountability in modern synthetic media environments. Our code is available at GitHub: https://github.com/bli1/steganography
[140] ExpertEdit: Learning Skill-Aware Motion Editing from Expert Videos cs.CVPDF
Arjun Somayazulu, Kristen Grauman
TL;DR: 本文提出ExpertEdit框架,通过仅使用未配对的专家视频演示来学习技能感知的运动编辑,以自动将新手运动提升至更高技能水平,适用于体育和康复训练中的个性化视觉反馈。
Details
Motivation: 解决现有运动编辑方法依赖配对输入输出数据和显式编辑指导的问题,旨在通过专家视频自动生成个性化技能提升反馈,加速运动技能学习。
Result: 在Ego-Exo4D和Karate Kyokushin数据集的八种技术和三项运动中,ExpertEdit在运动真实性和专家质量多个指标上优于当前最先进的监督运动编辑方法。
Insight: 创新点包括使用掩码语言建模目标学习专家运动先验,在推理时通过掩码关键技能时刻将新手运动投影到专家流形,实现无需配对监督或手动编辑指导的局部技能改进。
Abstract: Visual feedback is critical for motor skill acquisition in sports and rehabilitation, and psychological studies show that observing near-perfect versions of one’s own performance accelerates learning more effectively than watching expert demonstrations alone. We propose to enable such personalized feedback by automatically editing a person’s motion to reflect higher skill. Existing motion editing approaches are poorly suited for this setting because they assume paired input-output data – rare and expensive to curate for skill-driven tasks – and explicit edit guidance at inference. We introduce ExpertEdit, a framework for skill-driven motion editing trained exclusively on unpaired expert video demonstrations. ExpertEdit learns an expert motion prior with a masked language modeling objective that infills masked motion spans with expert-level refinements. At inference, novice motion is masked at skill-critical moments and projected into the learned expert manifold, producing localized skill improvements without paired supervision or manual edit guidance. Across eight diverse techniques and three sports from Ego-Exo4D and Karate Kyokushin, ExpertEdit outperforms state-of-the-art supervised motion editing methods on multiple metrics of motion realism and expert quality. Project page: https://vision.cs.utexas.edu/projects/expert_edit/ .
[141] UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation cs.CV | cs.AIPDF
Haopeng Chen, Yihao Ai, Kabeen Kim, Robby T. Tan, Yixin Chen
TL;DR: UDAPose是一种用于低光条件下人体姿态估计的无监督域自适应框架,通过合成低光图像并动态融合视觉线索与姿态先验来提升性能。
Details
Motivation: 解决低光场景下人体姿态估计因标注数据稀缺和视觉信息丢失而面临的挑战,现有方法合成的低光图像不真实且姿态估计器依赖的视觉线索在低光下不可靠。
Result: 在ExLPose-test hard set (LL-H)上AP提升10.1(56.4%),在EHPT-XC跨数据集验证上AP提升7.4(31.4%),优于现有最先进方法。
Insight: 创新点包括使用基于直流的高通滤波器和低光特征注入模块来合成更真实的低光图像,以及通过动态注意力控制模块在Transformer中自适应平衡图像线索与姿态先验。
Abstract: Low-visibility scenarios, such as low-light conditions, pose significant challenges to human pose estimation due to the scarcity of annotated low-light datasets and the loss of visual information under poor illumination. Recent domain adaptation techniques attempt to utilize well-lit labels by augmenting well-lit images to mimic low-light conditions. But handcrafted augmentations oversimplify noise patterns, while learning-based methods often fail to preserve high-frequency low-light characteristics, producing unrealistic images that lead pose models to generalize poorly to real low-light scenes. Moreover, recent pose estimators rely on image cues through image-to-keypoint cross-attention, but these cues become unreliable under low-light conditions. To address these issues, we propose Unsupervised Domain Adaptation for Pose Estimation (UDAPose), a novel framework that synthesizes low-light images and dynamically fuses visual cues with pose priors for improved pose estimation. Specifically, our synthesis method incorporates a Direct-Current-based High-Pass Filter (DHF) and a Low-light Characteristics Injection Module (LCIM) to inject high-frequency details from input low-light images, overcoming rigidity or the detail loss in existing approaches. Furthermore, we introduce a Dynamic Control of Attention (DCA) module that adaptively balances image cues with learned pose priors in the Transformer architecture. Experiments show that UDAPose outperforms state-of-the-art methods, with notable AP gains of 10.1 (56.4%) on the ExLPose-test hard set (LL-H) and 7.4 (31.4%) in cross-dataset validation on EHPT-XC. Code: https://github.com/Vision-and-Multimodal-Intelligence-Lab/UDAPose
[142] Visual Enhanced Depth Scaling for Multimodal Latent Reasoning cs.CVPDF
Yudong Han, Yong Wang, Zaiquan Yang, Zhen Qu, Liyuan Pan
TL;DR: 本文提出了一种视觉增强深度缩放方法,用于改进多模态潜在推理。该方法通过分析潜在训练中的梯度动态,揭示了视觉令牌因语言偏差而优化不足以及复杂令牌因固定深度而梯度不稳定的问题。为解决这些问题,论文引入了视觉重放模块和路由深度缩放机制,前者通过因果自注意力增强视觉感知,后者自适应地为复杂令牌分配更多推理步骤。在课程学习策略的指导下,该方法在多个基准测试中实现了最先进的性能,并显著提升了推理速度。
Details
Motivation: 多模态潜在推理通过隐式特征传播替代显式思维链解码,旨在提升表示信息量并降低推理延迟。然而,由于语言偏差导致视觉令牌优化不足,以及固定架构深度限制复杂令牌的梯度稳定性,现有方法存在局限性。
Result: 该方法在多个基准测试中实现了最先进的性能,同时相比显式思维链基线带来了显著的推理加速。
Insight: 创新点包括:通过梯度动态分析揭示视觉令牌优化不足的问题;提出视觉重放模块利用因果自注意力增强视觉感知;引入路由深度缩放机制自适应分配推理深度;采用课程学习策略将显式思维链逐步内化为紧凑潜在表示。这些方法协同提升了多模态推理的效率和效果。
Abstract: Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly higher and more volatile gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.
[143] FreeScale: Scaling 3D Scenes via Certainty-Aware Free-View Generation cs.CVPDF
Chenhan Jiang, Yu Chen, Qingwen Zhang, Jifei Song, Songcen Xu
TL;DR: FreeScale是一个新颖的框架,旨在通过利用场景重建技术,将有限的真实世界图像序列转化为可扩展的高质量训练数据源,以解决通用化新视角合成模型因大规模、多样化且精确相机轨迹数据稀缺而受限的问题。其核心是提出一种确定性感知的自由视角采样策略,从重建的场景中识别出既有语义意义又受重建误差影响最小的新视点。
Details
Motivation: 通用化新视角合成模型的发展受限于缺乏大规模、多样化且具有精确相机轨迹的训练数据。真实世界捕获的数据虽然真实,但通常稀疏且离散;合成数据虽可扩展,但存在领域差距且往往缺乏真实的语义。
Result: 通过使用FreeScale扩展前馈NVS模型的训练,在具有挑战性的分布外基准测试中实现了PSNR指标2.7 dB的显著提升。此外,生成的数据还能有效增强基于场景的3D高斯泼溅优化,在多个数据集上带来了一致的改进。
Insight: 创新点在于将不完美的重建场景视为丰富的几何代理,并提出了一种确定性感知的自由视角采样策略,以智能地选择受重建伪影影响最小的新视点来生成高质量数据。这为解决3D视觉中数据瓶颈提供了一个实用且强大的数据生成引擎。
Abstract: The development of generalizable Novel View Synthesis (NVS) models is critically limited by the scarcity of large-scale training data featuring diverse and precise camera trajectories. While real-world captures are photorealistic, they are typically sparse and discrete. Conversely, synthetic data scales but suffers from a domain gap and often lacks realistic semantics. We introduce FreeScale, a novel framework that leverages the power of scene reconstruction to transform limited real-world image sequences into a scalable source of high-quality training data. Our key insight is that an imperfect reconstructed scene serves as a rich geometric proxy, but naively sampling from it amplifies artifacts. To this end, we propose a certainty-aware free-view sampling strategy identifying novel viewpoints that are both semantically meaningful and minimally affected by reconstruction errors. We demonstrate FreeScale’s effectiveness by scaling up the training of feedforward NVS models, achieving a notable gain of 2.7 dB in PSNR on challenging out-of-distribution benchmarks. Furthermore, we show that the generated data can actively enhance per-scene 3D Gaussian Splatting optimization, leading to consistent improvements across multiple datasets. Our work provides a practical and powerful data generation engine to overcome a fundamental bottleneck in 3D vision. Project page: https://mvp-ai-lab.github.io/FreeScale.
[144] Data-Efficient Surgical Phase Segmentation in Small-Incision Cataract Surgery: A Controlled Study of Vision Foundation Models cs.CV | cs.AIPDF
Lincoln Spencer, Song Wang, Chen Chen
TL;DR: 本文研究了在标注数据稀缺的小切口白内障手术(SICS)视频中进行数据高效的手术阶段分割。通过将不同的视觉编码器(包括监督模型ResNet-50、I3D和自监督基础模型DINOv3、V-JEPA2)与相同的时间模型(MS-TCN++)在相同设置下进行对比,发现基础模型特征能提升分割性能,其中DINOv3 ViT-7B效果最佳。研究还探讨了利用未标注视频进行领域迁移和轻量级适应的效果。
Details
Motivation: 在计算机辅助手术中,手术阶段分割至关重要,但在标注手术视频稀缺的情况下,开发鲁棒的模型仍然困难。本文旨在研究在数据有限的情况下,如何通过对比不同视觉表示来提高小切口白内障手术的阶段分割效率。
Result: 在SICS-155数据集(19个阶段)上,基础模型特征显著提升了分割性能。DINOv3 ViT-7B取得了最佳整体结果,准确率达到83.4%,编辑分数为87.0。
Insight: 创新点在于通过控制变量法(固定时间模型和训练/评估设置)系统性地比较了监督模型与自监督视觉基础模型在数据稀缺医疗视频任务中的表现。研究证实了现代视觉基础模型(如DINOv3)在外科工作流理解任务上具有很强的可迁移性,并为低标注医疗视频场景提供了实用的技术指导(如缓存特征流水线、领域迁移分析)。
Abstract: Surgical phase segmentation is central to computer-assisted surgery, yet robust models remain difficult to develop when labeled surgical videos are scarce. We study data-efficient phase segmentation for manual small-incision cataract surgery (SICS) through a controlled comparison of visual representations. To isolate representation quality, we pair each visual encoder with the same temporal model (MS-TCN++) under identical training and evaluation settings on SICS-155 (19 phases). We compare supervised encoders (ResNet-50, I3D) against large self-supervised foundation models (DINOv3, V-JEPA2), and use a cached-feature pipeline that decouples expensive visual encoding from lightweight temporal learning. Foundation-model features improve segmentation performance in this setup, with DINOv3 ViT-7B achieving the best overall results (83.4% accuracy, 87.0 edit score). We further examine cataract-domain transfer using unlabeled videos and lightweight adaptation, and analyze when it helps or hurts. Overall, the study indicates strong transferability of modern vision foundation models to surgical workflow understanding and provides practical guidance for low-label medical video settings. The project website is available at: https://sl2005.github.io/DataEfficient-sics-phase-seg/
[145] STORM: End-to-End Referring Multi-Object Tracking in Videos cs.CV | cs.AIPDF
Zijia Lu, Jingru Yi, Jue Wang, Yuxiao Chen, Junwen Chen
TL;DR: 本文提出了STORM,一个端到端的多模态大语言模型,用于视频中的指代多目标跟踪任务。该模型在统一框架内联合执行目标定位与跟踪,无需外部检测器,并能对视觉外观、运动与语言进行连贯推理。为了提升数据效率,作者提出了任务组合学习策略,将RMOT分解为图像定位与目标跟踪两个子任务,并构建了新的高质量数据集STORM-Bench。实验表明,STORM在多个基准测试中达到了最先进的性能。
Details
Motivation: 解决现有指代多目标跟踪方法将目标定位与跟踪分离为独立模块所导致的性能受限问题,这些限制源于训练视频稀缺、标注模糊以及领域受限。
Result: 在图像定位、单目标跟踪和RMOT基准测试中均取得了最先进的性能,展示了在复杂真实场景中强大的泛化能力和鲁棒的时空定位能力。
Insight: 创新点包括:1) 端到端的统一框架,联合执行定位与跟踪,实现跨模态连贯推理;2) 任务组合学习策略,利用数据丰富的子任务提升数据效率;3) 构建了高质量、标注清晰且多样化的新数据集STORM-Bench,以解决数据瓶颈问题。
Abstract: Referring multi-object tracking (RMOT) is a task of associating all the objects in a video that semantically match with given textual queries or referring expressions. Existing RMOT approaches decompose object grounding and tracking into separated modules and exhibit limited performance due to the scarcity of training videos, ambiguous annotations, and restricted domains. In this work, we introduce STORM, an end-to-end MLLM that jointly performs grounding and tracking within a unified framework, eliminating external detectors and enabling coherent reasoning over appearance, motion, and language. To improve data efficiency, we propose a task-composition learning (TCL) strategy that decomposes RMOT into image grounding and object tracking, allowing STORM to leverage data-rich sub-tasks and learn structured spatial–temporal reasoning. We further construct STORM-Bench, a new RMOT dataset with accurate trajectories and diverse, unambiguous referring expressions generated through a bottom-up annotation pipeline. Extensive experiments show that STORM achieves state-of-the-art performance on image grounding, single-object tracking, and RMOT benchmarks, demonstrating strong generalization and robust spatial–temporal grounding in complex real-world scenarios. STORM-Bench is released at https://github.com/amazon-science/storm-referring-multi-object-grounding.
[146] BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs cs.CVPDF
Aaditya Baranwal, Vishal Yadav, Abhishek Rajora
TL;DR: 本文提出了名为BareBones的零样本基准测试,旨在评估视觉语言模型(VLMs)对纯几何形状的理解能力,而非依赖RGB纹理或上下文先验。该基准整合了六个数据集的像素级轮廓,包括新构建的WTP-Bench,以测试模型仅从边界轮廓识别几何概念的能力。通过对26个最先进的专有和开源VLM(如GPT-4.1、Gemini等)的评估,发现模型在缺乏RGB信息时性能普遍严重下降,揭示了其普遍存在的结构盲点。
Details
Motivation: 现有评估方法未能有效隔离视觉语言模型对几何结构的真实理解能力,往往混淆了语义推理与纹理映射,并依赖可能泄露环境线索的不精确标注。因此,需要设计一个零样本基准来专门测试纯几何形状理解,以填补这一研究空白。
Result: 在BareBones基准测试中,评估了26个最先进的VLM,结果显示所有模型在RGB信息被剥夺时均出现一致的、严重的性能崩溃,这一现象被称为“纹理偏置悬崖”。这证明了现有模型在几何理解方面存在普遍缺陷,未能达到真正的几何基础。
Insight: 论文的创新点在于构建了一个无噪声的几何分类基准BareBones,特别是其旗舰数据集WTP-Bench,它通过极端的细粒度视觉谜题强制模型仅从边界轮廓学习几何概念。这为评估和提升VLM的几何理解能力提供了一个严格的标尺,揭示了模型对纹理的过度依赖这一关键局限。
Abstract: While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce \textbf{BareBones}, a zero-shot benchmark designed to stress-test pure geometric shape comprehension. We curate pixel-level silhouettes of geometrically distinct classes across six datasets: five established segmentation sources (ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200) and our novel flagship collection, WTP-Bench, establishing a noise-free geometric taxonomy. WTP-Bench is an extreme, fine-grained visual puzzle that forces models to identify inter-class geometric concepts from boundary contours alone. Our evaluation of 26 state-of-the-art proprietary and open-weight VLMs (\eg, GPT-4.1, Gemini, Claude Sonnet 4.5, LLaVA) reveals a consistent, severe performance collapse under RGB deprivation, a phenomenon we term the \textit{Texture Bias Cliff}. By documenting universal structural blindspots, BareBones establishes a rigorous yardstick for genuine geometric grounding.
[147] Bidirectional Learning of Facial Action Units and Expressions via Structured Semantic Mapping across Heterogeneous Datasets cs.CVPDF
Jia Li, Yu Zhang, Yin Chen, Zhenzhen Hu, Yong Li
TL;DR: 本文提出了一种结构化语义映射(SSM)框架,用于在异构数据集条件下实现面部动作单元(AU)检测和面部表情(FE)识别的双向联合学习。该框架通过共享视觉主干、文本语义原型模块和动态先验映射模块,解决了AU与FE数据在标注范式、标签粒度和数据多样性等方面的异构性问题,促进了跨任务的语义对齐与知识迁移。
Details
Motivation: 现有研究主要关注从AU到FE的单向知识迁移,而AU与FE之间存在固有的语义关联,双向学习尚未得到充分探索。同时,AU和FE数据集在标注范式(帧级vs. 片段级)、标签粒度和数据可用性上存在异构性,阻碍了有效的联合学习。
Result: 在流行的AU检测和FE识别基准测试上的大量实验表明,SSM在两个任务上同时达到了最先进的性能(SOTA),并证明整体的表情语义反过来也能增强细粒度的AU学习,即使是在跨异构数据集的情况下。
Insight: 创新点在于提出了一个统一的框架(SSM)来显式地处理AU和FE任务之间的异构性并实现双向知识迁移。具体通过文本语义原型模块构建结构化语义原型作为跨任务对齐的锚点,以及通过动态先验映射模块结合面部动作编码系统的先验知识并学习数据驱动的关联矩阵,实现了在共享语义空间和高级特征空间中的显式、双向知识传递。
Abstract: Facial action unit (AU) detection and facial expression (FE) recognition can be jointly viewed as affective facial behavior tasks, representing fine-grained muscular activations and coarse-grained holistic affective states, respectively. Despite their inherent semantic correlation, existing studies predominantly focus on knowledge transfer from AUs to FEs, while bidirectional learning remains insufficiently explored. In practice, this challenge is further compounded by heterogeneous data conditions, where AU and FE datasets differ in annotation paradigms (frame-level vs.\ clip-level), label granularity, and data availability and diversity, hindering effective joint learning. To address these issues, we propose a Structured Semantic Mapping (SSM) framework for bidirectional AU–FE learning under different data domains and heterogeneous supervision. SSM consists of three key components: (1) a shared visual backbone that learns unified facial representations from dynamic AU and FE videos; (2) semantic mediation via a Textual Semantic Prototype (TSP) module, which constructs structured semantic prototypes from fixed textual descriptions augmented with learnable context prompts, serving as supervision signals and cross-task alignment anchors in a shared semantic space; and (3) a Dynamic Prior Mapping (DPM) module that incorporates prior knowledge derived from the Facial Action Coding System and learns a data-driven association matrix in a high-level feature space, enabling explicit and bidirectional knowledge transfer. Extensive experiments on popular AU detection and FE recognition benchmarks show that SSM achieves state-of-the-art performance on both tasks simultaneously, and demonstrate that holistic expression semantics can in turn enhance fine-grained AU learning even across heterogeneous datasets.
[148] NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models: Datasets, Methods and Results cs.CVPDF
Xin Li, Jiachao Gong, Xijun Wang, Shiyao Xiong, Bingchen Li
TL;DR: 本文概述了NTIRE 2026挑战赛,该赛事专注于利用生成模型对真实世界中的短格式用户生成内容视频进行修复。挑战赛基于一个名为KwaiVIR的新基准数据集,包含合成失真和真实世界的短视频。比赛分为主观评估和客观评估两个赛道,共有95支队伍注册,12支提交了最终方案,并在基准上取得了强劲性能。
Details
Motivation: 为在复杂真实世界退化条件下,特别是基于生成模型的新兴范式下,修复短格式UGC视频建立一个强大且实用的基准。
Result: 提交的方法在KwaiVIR基准上取得了强劲的性能,展示了该领域令人鼓舞的进展。
Insight: 创新点在于引入了一个包含合成与真实退化视频的新基准数据集KwaiVIR,并设计了结合主观用户研究和客观指标的双轨评估框架,以全面评估生成模型在真实场景短视频修复中的性能。
Abstract: This paper presents an overview of the NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models. This challenge utilizes a new short-form UGC (S-UGC) video restoration benchmark, termed KwaiVIR, which is contributed by USTC and Kuaishou Technology. It contains both synthetically distorted videos and real-world short-form UGC videos in the wild. For this edition, the released data include 200 synthetic training videos, 48 wild training videos, 11 validation videos, and 20 testing videos. The primary goal of this challenge is to establish a strong and practical benchmark for restoring short-form UGC videos under complex real-world degradations, especially in the emerging paradigm of generative-model-based S-UGC video restoration. This challenge has two tracks: (i) the primary track is a subjective track, where the evaluation is based on a user study; (ii) the second track is an objective track. These two tracks enable a comprehensive assessment of restoration quality. In total, 95 teams have registered for this competition. And 12 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the KwaiVIR benchmark, demonstrating encouraging progress in short-form UGC video restoration in the wild.
[149] Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor cs.CVPDF
Yapeng Meng, Lin Yang, Yuguo Chen, Xiangru Chen, Taoyi Wang
TL;DR: 本文提出了一种名为STGDNet的时空差分引导去模糊网络,用于解决极端动态场景下的运动模糊问题。该方法利用新型互补视觉传感器Tianmouc同步采集的RGB帧、高帧率空间差分(SD)和时间差分(TD)数据,通过循环多分支架构迭代编码和融合SD与TD序列,以恢复模糊RGB输入中丢失的结构和颜色细节。
Details
Motivation: 运动模糊在曝光期间发生快速场景变化时产生,仅使用RGB图像进行去模糊是一个高度不适定问题,在极端运动下往往失败。虽然事件相机等仿生视觉传感器提供了时间密集信息,但仍存在事件率饱和以及事件模态中边缘特征与运动线索纠缠的问题。互补视觉传感器(CVS)Tianmouc能在一个RGB曝光周期内同步捕获高帧率、多比特的空间差分(编码结构边缘)和时间差分(编码运动线索)数据,为解决极端动态场景下的RGB去模糊提供了新途径。
Result: 该方法在合成的CVS数据集和真实世界评估中,均优于当前基于RGB或事件的方法,表现出强大的性能。此外,STGDNet在超过100个极端真实世界场景中展现出很强的泛化能力。
Insight: 论文的核心创新点在于首次充分利用了互补视觉传感器(CVS)提供的同步、高帧率、解耦的空间差分(SD)和时间差分(TD)数据流,并设计了循环多分支网络STGDNet来迭代融合这些互补模态,以明确地分别恢复结构细节和运动信息,从而有效解决极端运动下的去模糊问题。从客观角度看,将SD(结构)和TD(运动)作为解耦的引导信号进行融合,是一种新颖且有效的多模态信息利用范式,可能对处理其他动态视觉任务具有借鉴意义。
Abstract: Motion blur arises when rapid scene changes occur during the exposure period, collapsing rich intra-exposure motion into a single RGB frame. Without explicit structural or temporal cues, RGB-only deblurring is highly ill-posed and often fails under extreme motion. Inspired by the human visual system, brain-inspired vision sensors introduce temporally dense information to alleviate this problem. However, event cameras still suffer from event rate saturation under rapid motion, while the event modality entangles edge features and motion cues, which limits their effectiveness. As a recent breakthrough, the complementary vision sensor (CVS), Tianmouc, captures synchronized RGB frames together with high-frame-rate, multi-bit spatial difference (SD, encoding structural edges) and temporal difference (TD, encoding motion cues) data within a single RGB exposure, offering a promising solution for RGB deblurring under extreme dynamic scenes. To fully leverage these complementary modalities, we propose Spatio-Temporal Difference Guided Deblur Net (STGDNet), which adopts a recurrent multi-branch architecture that iteratively encodes and fuses SD and TD sequences to restore structure and color details lost in blurry RGB inputs. Our method outperforms current RGB or event-based approaches in both synthetic CVS dataset and real-world evaluations. Moreover, STGDNet exhibits strong generalization capability across over 100 extreme real-world scenarios. Project page: https://tmcDeblur.github.io/
[150] Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models cs.CVPDF
Dehui Wang, Congsheng Xu, Rong Wei, Yue Shi, Shoufa Chen
TL;DR: 本文提出了Rein3D框架,用于从稀疏输入生成高质量、全局一致的3D室内场景。该方法结合了显式的3D高斯泼溅(3DGS)表示和来自视频扩散模型的时间一致性先验,遵循“恢复-细化”范式:通过径向探索策略从初始3DGS渲染有缺陷的全景视频序列,利用全景视频到视频扩散模型进行修复和超分辨率增强,最后用这些细化后的视频作为伪真值来更新全局3D高斯场。
Details
Motivation: 解决现有方法在从稀疏输入合成3D室内场景时,难以推断大面积不可见区域的几何信息并保持全局一致性的问题,避免产生局部合理但全局不一致的重建结果。
Result: 实验表明,Rein3D能生成逼真且全局一致的3D场景,在长距离相机探索任务上显著优于现有基线方法。
Insight: 创新点在于将3DGS与视频扩散模型的时间一致性先验耦合,提出了“恢复-细化”范式和径向探索策略;同时构建了PanoV2V-15K数据集用于基于扩散的场景修复任务。从客观角度看,该方法通过视频序列的生成与迭代更新来增强3D重建的全局一致性,是一个有前景的思路。
Abstract: The growing demand for Embodied AI and VR applications has highlighted the need for synthesizing high-quality 3D indoor scenes from sparse inputs. However, existing approaches struggle to infer massive amounts of missing geometry in large unseen areas while maintaining global consistency, often producing locally plausible but globally inconsistent reconstructions. We present Rein3D, a framework that reconstructs full 360-degree indoor environments by coupling explicit 3D Gaussian Splatting (3DGS) with temporally coherent priors from video diffusion models. Our approach follows a “restore-and-refine” paradigm: we employ a radial exploration strategy to render imperfect panoramic videos along trajectories starting from the origin, effectively uncovering occluded regions from a coarse 3DGS initialization. These sequences are restored by a panoramic video-to-video diffusion model and further enhanced via video super-resolution to synthesize high-fidelity geometry and textures. Finally, these refined videos serve as pseudo-ground truths to update the global 3D Gaussian field. To support this task, we construct PanoV2V-15K, a dataset of over 15K paired clean and degraded panoramic videos for diffusion-based scene restoration. Experiments demonstrate that Rein3D produces photorealistic and globally consistent 3D scenes and significantly improves long-range camera exploration compared with existing baselines.
[151] TAPNext++: What’s Next for Tracking Any Point (TAP)? cs.CVPDF
Sebastian Jung, Artem Zholus, Martin Sundermeyer, Carl Doersch, Ross Goroshin
TL;DR: 本文提出了TAPNext++模型,旨在改进现有的TAPNext方法,以应对长视频序列中的点跟踪问题,并解决查询点被遮挡或离开画面后重新出现时的重检测难题。
Details
Motivation: 现有TAPNext方法在处理长视频序列时表现不佳,且在点被遮挡或离开画面后重新出现时重检测失败频繁,因此需要一种能处理更长序列并提升重检测性能的模型。
Result: TAPNext++在多个基准测试中达到了新的最先进水平(SOTA),特别是在新引入的重检测平均Jaccard(AJ_RD)指标上表现出色,且模型保持了低内存和计算开销。
Insight: 创新点包括:采用序列并行技术训练长达1024帧的序列以提升长序列处理能力;引入新的重检测评估指标AJ_RD;以及使用周期性滚动等几何增强和监督遮挡点来专门提升重检测性能。
Abstract: Tracking-Any-Point (TAP) models aim to track any point through a video which is a crucial task in AR/XR and robotics applications. The recently introduced TAPNext approach proposes an end-to-end, recurrent transformer architecture to track points frame-by-frame in a purely online fashion – demonstrating competitive performance at minimal latency. However, we show that TAPNext struggles with longer video sequences and also frequently fails to re-detect query points that reappear after being occluded or leaving the frame. In this work, we present TAPNext++, a model that tracks points in sequences that are orders of magnitude longer while preserving the low memory and compute footprint of the architecture. We train the recurrent video transformer using several data-driven solutions, including training on long 1024-frame sequences enabled by sequence parallelism techniques. We highlight that re-detection performance is a blind spot in the current literature and introduce a new metric, Re-Detection Average Jaccard ($AJ_{RD}$), to explicitly evaluate tracking on re-appearing points. To improve re-detection of points, we introduce tailored geometric augmentations, such as periodic roll that simulates point re-entries, and supervising occluded points. We demonstrate that recurrent transformers can be substantially improved for point tracking and set a new state-of-the-art on multiple benchmarks. Model and code can be found at https://tap-next-plus-plus.github.io.
[152] GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing cs.CV | cs.AIPDF
Maram Hasan, Md Aminur Hossain, Savitra Roy, Souparna Bhowmik, Ayush V. Patel
TL;DR: 本文提出了GeoMeld,一个包含约250万个空间对齐样本的大规模遥感多模态数据集,以及一个名为GeoMeld-FM的预训练框架。该数据集通过一个智能标注框架提供语义基础的语言监督,而预训练框架结合了掩码自编码、JEPA表示学习和字幕-视觉对比对齐,旨在学习能捕捉跨传感器物理一致性和基础语义的表征。实验表明该方法在下游任务迁移和跨传感器鲁棒性上取得了持续提升。
Details
Motivation: 当前遥感领域缺乏大规模、空间对齐且具有语义基础监督的多模态数据资源,这限制了有效基础模型的发展。本文旨在解决这一资源限制问题。
Result: 实验结果表明,所提出的方法在下游任务迁移和跨传感器鲁棒性方面取得了持续的性能增益。
Insight: 主要创新点包括:1) 构建了大规模、空间对齐且通过智能标注框架提供语义基础语言监督的多模态遥感数据集;2) 提出了一个结合多预任务掩码自编码、JEPA表示学习和字幕-视觉对比对齐的联合预训练框架,以学习物理一致且语义基础的表征。这为遥感领域语义基础的多模态基础建模提供了一个可扩展的参考框架。
Abstract: Effective foundation modeling in remote sensing requires spatially aligned heterogeneous modalities coupled with semantically grounded supervision, yet such resources remain limited at scale. We present GeoMeld, a large-scale multimodal dataset with approximately 2.5 million spatially aligned samples. The dataset spans diverse modalities and resolutions and is constructed under a unified alignment protocol for modality-aware representation learning. GeoMeld provides semantically grounded language supervision through an agentic captioning framework that synthesizes and verifies annotations from spectral signals, terrain statistics, and structured geographic metadata, encoding measurable cross-modality relationships within textual descriptions. To leverage this dataset, we introduce GeoMeld-FM, a pretraining framework that combines multi-pretext masked autoencoding over aligned modalities, JEPA representation learning, and caption-vision contrastive alignment. This joint objective enables the learned representation space to capture both reliable cross-sensor physical consistency and grounded semantics. Experiments demonstrate consistent gains in downstream transfer and cross-sensor robustness. Together, GeoMeld and GeoMeld-FM establish a scalable reference framework for semantically grounded multi-modal foundation modeling in remote sensing.
[153] How to Design a Compact High-Throughput Video Camera? cs.CVPDF
Chenxi Qiu, Tao Yue, Xuemei Hu
TL;DR: 本文针对高吞吐量视频采集面临的系统复杂性和读出/传输瓶颈问题,提出了一种基于现有技术的低比特梯度相机方案,通过利用梯度相机在快速读出和高效表示方面的优势,并结合多尺度重建CNN来重构高分辨率图像,从而设计出紧凑的高吞吐量视频相机。
Details
Motivation: 现有高吞吐量成像系统通过拼接数百个子图像/视频实现,系统复杂度极高;随着像素尺寸减小至亚微米级,在单芯片上集成超高吞吐量成为可能,但读出和输出传输速度无法跟上像素数量的增长,因此需要解决这些瓶颈。
Result: 在模拟和真实数据上进行了大量实验,证明了所提方法在质量和可行性方面的良好表现。
Insight: 创新点在于分析梯度相机在快速读出和高效表示方面的优势,提出低比特梯度相机方案以解决读出和传输瓶颈,并设计多尺度重建CNN进行图像重构,为紧凑高吞吐量视频相机设计提供了新思路。
Abstract: High throughput video acquisition is a challenging problem and has been drawing increasing attention. Existing high throughput imaging systems splice hundreds of sub-images/videos into high throughput videos, suffering from extremely high system complexity. Alternatively, with pixel sizes reducing to sub-micrometer levels, integrating ultra-high throughput on a single chip is becoming feasible. Nevertheless, the readout and output transmission speed cannot keep pace with the increasing pixel numbers. To this end, this paper analyzes the strength of gradient cameras in fast readout and efficient representation, and proposes a low-bit gradient camera scheme based on existing technologies that can resolve the readout and transmission bottlenecks for high throughput video imaging. A multi-scale reconstruction CNN is proposed to reconstruct high-resolution images. Extensive experiments on both simulated and real data are conducted to demonstrate the promising quality and feasibility of the proposed method.
[154] LogitDynamics: Reliable ViT Error Detection from Layerwise Logit Trajectories cs.CVPDF
Ido Beigelman, Moti Freiman
TL;DR: 本文提出LogitDynamics方法,通过分析Vision Transformer(ViT)中间层logit的动态变化来可靠地检测分类错误。该方法在多个数据集上提升了错误预测的AUCPR性能,并展现出更强的跨数据集泛化能力。
Details
Motivation: 受大语言模型内部信号幻觉检测的启发,研究是否能在ViT中找到类似的深度方向信号,以解决仅通过单次前向传播信号进行图像分类器错误预测的问题。
Result: 在多个数据集上,该方法在AUCPR指标上优于或匹配基线方法,同时表现出更强的跨数据集泛化能力,且仅需极少的额外计算开销。
Insight: 创新点在于利用ViT层间logit轨迹(包括预测类及其top-K竞争类的logit,以及顶层类别在深度上的不稳定性统计)来建模类别证据的演化过程,通过轻量级线性头提取特征并训练线性探针进行错误预测,这是一种简单有效的内部信号利用方式。
Abstract: Reliable confidence estimation is critical when deploying vision models. We study error prediction: determining whether an image classifier’s output is correct using only signals from a single forward pass. Motivated by internal-signal hallucination detection in large language models, we investigate whether similar depth-wise signals exist in Vision Transformers (ViTs). We propose a simple method that models how class evidence evolves across layers. By attaching lightweight linear heads to intermediate layers, we extract features from the last L layers that capture both the logits of the predicted class and its top-K competitors, as well as statistics describing instability of top-ranked classes across depth. A linear probe trained on these features predicts the error indicator. Across datasets, our method improves or matches AUCPR over baselines and shows stronger cross-dataset generalization while requiring minimal additional computation.
[155] LoViF 2026 The First Challenge on Weather Removal in Videos cs.CV | cs.AI | cs.MMPDF
Chenghao Qian
TL;DR: 本文介绍了LoViF 2026视频天气去除挑战赛的综述,该挑战赛旨在推动从受雨雪等恶劣天气影响的视频中恢复干净视频的方法发展,强调视觉合理性和时间一致性,同时保留场景结构和运动动态。为支持此任务,引入了一个新的短格式WRV数据集,包含18个视频、1,216个合成帧与真实世界真实帧配对,分辨率为832 x 480,并按1:1:1比例分为训练、验证和测试集。挑战赛吸引了37名参与者,收到5份有效最终提交,促进了视频天气去除领域的进展。
Details
Motivation: 解决在真实世界恶劣天气条件下恢复干净视频的问题,强调视觉合理性和时间一致性,以提升视频恢复的鲁棒性和真实性。
Result: 挑战赛吸引了37名参与者,收到5份有效最终提交,评估协议综合考虑了保真度和感知质量,但摘要未提及具体定量结果或基准比较。
Insight: 创新点包括引入专门针对视频天气去除的短格式WRV数据集,以及强调时间一致性和场景结构保留的评估框架,可借鉴于视频恢复任务的数据集构建和评估标准设计。
Abstract: This paper presents a review of the LoViF 2026 Challenge on Weather Removal in Videos. The challenge encourages the development of methods for restoring clean videos from inputs degraded by adverse weather conditions such as rain and snow, with an emphasis on achieving visually plausible and temporally consistent results while preserving scene structure and motion dynamics. To support this task, we introduce a new short-form WRV dataset tailored for video weather removal. It consists of 18 videos 1,216 synthesized frames paired with 1,216 real-world ground-truth frames at a resolution of 832 x 480, and is split into training, validation, and test sets with a ratio of 1:1:1. The goal of this challenge is to advance robust and realistic video restoration under real-world weather conditions, with evaluation protocols that jointly consider fidelity and perceptual quality. The challenge attracted 37 participants and received 5 valid final submissions with corresponding fact sheets, contributing to progress in weather removal for videos. The project is publicly available at https://www.codabench.org/competitions/13462/.
[156] HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement cs.CVPDF
Marco Schouten, Ioannis Siglidis, Serge Belongie, Dim P. Papadopoulos
TL;DR: 本文提出了一种通过蒸馏文本条件扩散模型中隐含的布局知识来学习显式、类别条件的空间先验的方法,用于自然场景中的物体放置。该方法构建了一个名为HiddenObjects的大规模数据集,包含2700万个布局标注,并在下游图像编辑任务中超越了稀疏人工标注和现有基线方法。
Details
Motivation: 现有方法依赖有限规模的人工标注数据或基于修复的物体移除流程,后者易产生伪影导致捷径学习,因此需要一种全自动、可扩展的框架来学习高质量的空间先验。
Result: 实验表明,该方法在图像编辑任务中优于稀疏人工标注(VLM-Judge评分3.90 vs. 2.68),显著超越现有布局基线和零样本视觉语言模型,且蒸馏后的轻量模型推理速度提升230,000倍。
Insight: 创新点包括利用扩散模型自动评估密集物体布局以构建大规模数据集,以及通过蒸馏将隐式知识转化为显式空间先验,实现高效推理;客观分析认为其数据生成流程和蒸馏策略具有借鉴意义。
Abstract: We propose a method to learn explicit, class-conditioned spatial priors for object placement in natural scenes by distilling the implicit placement knowledge encoded in text-conditioned diffusion models. Prior work relies either on manually annotated data, which is inherently limited in scale, or on inpainting-based object-removal pipelines, whose artifacts promote shortcut learning. To address these limitations, we introduce a fully automated and scalable framework that evaluates dense object placements on high-quality real backgrounds using a diffusion-based inpainting pipeline. With this pipeline, we construct HiddenObjects, a large-scale dataset comprising 27M placement annotations, evaluated across 27k distinct scenes, with ranked bounding box insertions for different images and object categories. Experimental results show that our spatial priors outperform sparse human annotations on a downstream image editing task (3.90 vs. 2.68 VLM-Judge), and significantly surpass existing placement baselines and zero-shot Vision-Language Models for object placement. Furthermore, we distill these priors into a lightweight model for fast practical inference (230,000x faster).
[157] Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification cs.CVPDF
Jiayu Zhang, Shuo Ye, Qilang Ye, Zihan Song, Jiajian Huang
TL;DR: 本文提出了一种名为R²ScP的新型框架,用于处理模态不完整的音频-视觉问答任务。该框架将缺失模态的处理范式从传统的生成式填补转向基于检索的恢复,通过跨模态检索获取缺失的特定领域知识,并引入上下文感知的自适应净化机制来消除检索数据中的潜在语义噪声。
Details
Motivation: 现有的音频-视觉问答方法在处理缺失模态时缺乏有效机制,在数据中断的真实场景中性能严重下降。主流方法依赖生成式填补来合成缺失特征,但倾向于捕捉模态间共性而难以获取缺失数据中独特的模态特定知识,导致幻觉和推理准确性受损。
Result: 大量实验表明,R²ScP显著提升了音频-视觉问答的性能,并在模态不完整场景中增强了鲁棒性。
Insight: 创新点在于将缺失模态处理范式从生成式填补转向检索式恢复,利用统一的语义嵌入进行跨模态检索以获取特定知识,并设计自适应净化机制消除噪声。从客观角度看,该方法通过检索外部知识而非内部生成,可能更好地保留模态特异性,减少幻觉,提升推理准确性。
Abstract: Recent Audio-Visual Question Answering (AVQA) methods have advanced significantly. However, most AVQA methods lack effective mechanisms for handling missing modalities, suffering from severe performance degradation in real-world scenarios with data interruptions. Furthermore, prevailing methods for handling missing modalities predominantly rely on generative imputation to synthesize missing features. While partially effective, these methods tend to capture inter-modal commonalities but struggle to acquire unique, modality-specific knowledge within the missing data, leading to hallucinations and compromised reasoning accuracy. To tackle these challenges, we propose R$^{2}$ScP, a novel framework that shifts the paradigm of missing modality handling from traditional generative imputation to retrieval-based recovery. Specifically, we leverage cross-modal retrieval via unified semantic embeddings to acquire missing domain-specific knowledge. To maximize semantic restoration, we introduce a context-aware adaptive purification mechanism that eliminates latent semantic noise within the retrieved data. Additionally, we employ a two-stage training strategy to explicitly model the semantic relationships between knowledge from different sources. Extensive experiments demonstrate that R$^{2}$ScP significantly improves AVQA and enhances robustness in modal-incomplete scenarios.
[158] Architecture-Agnostic Modality-Isolated Gated Fusion for Robust Multi-Modal Prostate MRI Segmentation cs.CV | cs.AIPDF
Yongbo Shu, Wenzhao Xie, Shanhu Yao, Zirui Xin, Luo Lei
TL;DR: 本文提出了一种架构无关的模态隔离门控融合(MIGF)模块,用于提升多模态前列腺MRI分割在模态缺失或损坏情况下的鲁棒性。该方法通过保持独立的模态编码流并结合模态丢弃训练,实现了在不完整输入下的补偿能力。在PI-CAI数据集上的实验表明,MIGF能显著提升多种骨干网络在多种缺失模态和伪影场景下的性能。
Details
Motivation: 多参数前列腺MRI在临床实践中常因运动伪影、协议简化等原因出现模态缺失或质量下降,而现有多模态融合方法通常假设输入完整且早期层就融合模态信息,导致对损坏或缺失模态的鲁棒性有限。
Result: 在PI-CAI数据集(1500个研究,fold-0划分,五个随机种子)上评估了六种骨干网络和七种缺失模态/伪影场景。MIGF提升了UNet、nnUNet和Mamba在理想场景下的Ranking Score,分别提升了2.8%、4.6%和13.4%。最佳模型MIGFNet-nnUNet(带门控和模态丢弃,无深度监督)达到了0.7304 +/- 0.056。
Insight: 创新点在于提出了架构无关的模态隔离门控融合模块,通过严格的模态隔离和模态丢弃训练实现鲁棒性,而非依赖自适应的样本质量路由。研究发现门控收敛于稳定的模态先验,深度监督仅对大型骨干有益。这支持了更简单的鲁棒多模态分割设计原则:先结构性地隔离损坏输入,再显式训练不完整输入的补偿能力。
Abstract: Multi-parametric prostate MRI – combining T2-weighted, apparent diffusion coefficient, and high b-value diffusion-weighted sequences – is central to non-invasive detection of clinically significant prostate cancer, yet in routine practice individual sequences may be missing or degraded by motion, artifacts, or abbreviated protocols. Existing multi-modal fusion strategies typically assume complete inputs and entangle modality-specific information at early layers, offering limited resilience when one channel is corrupted or absent. We propose Modality-Isolated Gated Fusion (MIGF), an architecture-agnostic module that maintains separate modality-specific encoding streams before a learned gating stage, combined with modality dropout training to enforce compensation behavior under incomplete inputs. We benchmark six bare backbones and assess MIGF-equipped models under seven missing-modality and artifact scenarios on the PI-CAI dataset (1,500 studies, fold-0 split, five random seeds). Among bare backbones, nnUNet provided the strongest balance of performance and stability. MIGF improved ideal-scenario Ranking Score for UNet, nnUNet, and Mamba by 2.8%, 4.6%, and 13.4%, respectively; the best model, MIGFNet-nnUNet (gating + ModDrop, no deep supervision), achieved 0.7304 +/- 0.056. Mechanistic analysis reveals that robustness gains arise from strict modality isolation and dropout-driven compensation rather than adaptive per-sample quality routing: the gate converged to a stable modality prior, and deep supervision was beneficial only for the largest backbone while degrading lighter models. These findings support a simpler design principle for robust multi-modal segmentation: structurally contain corrupted inputs first, then train explicitly for incomplete-input compensation.
[159] Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization cs.CV | cs.AIPDF
Yuqi Chen, Xiaohan Zhang, Ahmad Arrabi, Waqas Sultani, Chen Chen
TL;DR: 本文提出了一种简单而有效的框架,通过参数高效微调将多模态大语言模型(MLLMs)适配于自然语言引导的跨视角地理定位任务,在保持其预训练多模态知识的同时优化潜在表示,实现了强大的跨模态对齐,无需重新设计模型架构。
Details
Motivation: 解决现有自然语言引导的跨视角地理定位方法(通常基于CLIP风格的双编码器架构)存在的跨模态泛化能力弱和架构设计复杂的问题,同时利用MLLMs强大的语义推理能力,将其优化用于检索任务。
Result: 在GeoText-1652数据集上实现了SOTA,Text-to-Image Recall@1提升了12.2%;在CVG-Text数据集的12个子任务中,有5个取得了最佳性能,且使用的可训练参数远少于基线模型。
Insight: 创新点在于通过参数高效微调,将生成式MLLMs转变为强大的检索模型,解锁了MLLMs在语义跨视图检索任务中的潜力,为NGCG提供了一种可扩展且强大的替代传统双编码器设计的方案。该方法在保持预训练知识的同时优化对齐,具有通用性启示。
Abstract: Natural-language Guided Cross-view Geo-localization (NGCG) aims to retrieve geo-tagged satellite imagery using textual descriptions of ground scenes. While recent NGCG methods commonly rely on CLIP-style dual-encoder architectures, they often suffer from weak cross-modal generalization and require complex architectural designs. In contrast, Multimodal Large Language Models (MLLMs) offer powerful semantic reasoning capabilities but are not directly optimized for retrieval tasks. In this work, we present a simple yet effective framework to adapt MLLMs for NGCG via parameter-efficient finetuning. Our approach optimizes latent representations within the MLLM while preserving its pretrained multimodal knowledge, enabling strong cross-modal alignment without redesigning model architectures. Through systematic analysis of diverse variables, from model backbone to feature aggregation, we provide practical and generalizable insights for leveraging MLLMs in NGCG. Our method achieves SOTA on GeoText-1652 with a 12.2% improvement in Text-to-Image Recall@1 and secures top performance in 5 out of 12 subtasks on CVG-Text, all while surpassing baselines with far fewer trainable parameters. These results position MLLMs as a robust foundation for semantic cross-view retrieval and pave the way for MLLM-based NGCG to be adopted as a scalable, powerful alternative to traditional dual-encoder designs. Project page and code are available at https://yuqichen888.github.io/NGCG-MLLMs-web/.
[160] MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark cs.CVPDF
Junzhi Ning, Jiashi Lin, Yingying Fang, Wei Li, Jiyao Liu
TL;DR: 该论文提出了首个针对罕见病的多模态多图像医学基准测试MMRareBench,用于评估多模态大语言模型在罕见病临床任务中的表现。基准包含1,756个问答对和7,958张医学图像,涵盖诊断、治疗规划、跨图像证据对齐和检查建议四个工作流程相关的任务。通过对23个MLLMs的系统评估,发现模型在治疗规划方面普遍表现不佳,且医学领域模型在多图像任务上落后于通用MLLMs,揭示了医学微调可能削弱多图像组合能力的问题。
Details
Motivation: 现有基准主要评估常见病的单图像场景,缺乏对罕见病数据稀缺下多模态和多图像证据整合能力的系统评估,而临床医生在罕见病场景中往往缺乏先验知识,只能依赖病例级证据进行判断。
Result: 在MMRareBench基准上评估了23个MLLMs,结果显示模型能力碎片化,治疗规划性能普遍较低;医学领域模型尽管在诊断得分上有竞争力,但在多图像任务上显著落后于通用MLLMs。
Insight: 创新点在于构建了首个罕见病多模态多图像医学基准,并提出了容量稀释效应:医学微调可能缩小诊断差距,但会侵蚀罕见病证据整合所需的组合性多图像能力。该基准通过Orphanet本体对齐、特定任务泄漏控制和证据基础标注等方法确保了评估的严谨性。
Abstract: Multimodal large language models (MLLMs) have advanced clinical tasks for common conditions, but their performance on rare diseases remains largely untested. In rare-disease scenarios, clinicians often lack prior clinical knowledge, forcing them to rely strictly on case-level evidence for clinical judgments. Existing benchmarks predominantly evaluate common-condition, single-image settings, leaving multimodal and multi-image evidence integration under rare-disease data scarcity systematically unevaluated. We introduce MMRareBench, to our knowledge the first rare-disease benchmark jointly evaluating multimodal and multi-image clinical capability across four workflow-aligned tracks: diagnosis, treatment planning, cross-image evidence alignment, and examination suggestion. The benchmark comprises 1,756 question-answer pairs with 7,958 associated medical images curated from PMC case reports, with Orphanet-anchored ontology alignment, track-specific leakage control, evidence-grounded annotations, and a two-level evaluation protocol. A systematic evaluation of 23 MLLMs reveals fragmented capability profiles and universally low treatment-planning performance, with medical-domain models trailing general-purpose MLLMs substantially on multi-image tracks despite competitive diagnostic scores. These patterns are consistent with a capacity dilution effect: medical fine-tuning can narrow the diagnostic gap but may erode the compositional multi-image capability that rare-disease evidence integration demands.
[161] HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models cs.CVPDF
Haiyan Jiang, Deyu Zhang, Dongdong Weng, Weitao Song, Henry Been-Lirn Duh
TL;DR: HOG-Layout是一个利用大语言模型和视觉语言模型进行文本驱动的层次化3D场景生成、优化与实时编辑的系统。它通过检索增强生成技术提升场景语义一致性,结合优化模块增强物理一致性,并采用层次化表示以实现高效推理和实时编辑。
Details
Motivation: 解决3D布局生成与编辑中手动创建繁琐、数据驱动方法缺乏多样性,以及现有方法在语义和物理一致性上的不足。
Result: 实验表明,与现有基线方法相比,HOG-Layout能生成更合理的3D环境,并支持快速直观的场景编辑。
Insight: 创新点在于将检索增强生成、物理优化模块与层次化场景表示相结合,利用大模型能力提升3D场景生成的语义与物理质量,并实现实时交互编辑。
Abstract: 3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.
[162] Uncertainty-quantified Pulse Signal Recovery from Facial Video using Regularized Stochastic Interpolants cs.CVPDF
Vineet R. Shenoy, Cheng Peng, Rama Chellappa, Yu Sun
TL;DR: 本文提出了一种名为RIS-iPPG的新方法,用于从面部视频中恢复脉搏信号并进行不确定性量化。该方法将成像光电容积描记(iPPG)任务建模为逆问题,通过构建概率路径和预测时变随机过程的瞬时流与得分向量,将相机像素分布演化为真实信号分布。在测试时,通过求解随机微分方程对给定像素强度测量的正确BVP波形后验分布进行采样,并引入正则化以利用生理变化的缓慢特性来提升恢复质量。
Details
Motivation: 当前iPPG算法虽然在基准数据集上表现出色,但缺乏在测试时对解空间进行采样的能力,无法进行对临床应用至关重要的不确定性分析。本文旨在解决这一缺陷。
Result: 在三个数据集上的实验结果表明,RIS-iPPG在重建质量和重建不确定性估计方面均优于现有方法。
Insight: 主要创新点在于将iPPG恢复建模为逆问题,并引入一个结合随机插值和正则化的新范式(RIS-iPPG),实现了对恢复信号的后验分布采样和不确定性量化。其正则化策略(最大化相邻时间窗口残差流向量预测之间的相关性)巧妙地利用了生理信号的缓慢变化特性,这是一个值得借鉴的领域知识融合思路。
Abstract: Imaging Photoplethysmography (iPPG), an optical procedure which recovers a human’s blood volume pulse (BVP) waveform using pixel readout from a camera, is an exciting research field with many researchers performing clinical studies of iPPG algorithms. While current algorithms to solve the iPPG task have shown outstanding performance on benchmark datasets, no state-of-the art algorithms, to the best of our knowledge, performs test-time sampling of solution space, precluding an uncertainty analysis that is critical for clinical applications. We address this deficiency though a new paradigm named Regularized Interpolants with Stochasticity for iPPG (RIS-iPPG). Modeling iPPG recovery as an inverse problem, we build probability paths that evolve the camera pixel distribution to the ground-truth signal distribution by predicting the instantaneous flow and score vectors of a time-dependent stochastic process; and at test-time, we sample the posterior distribution of the correct BVP waveform given the camera pixel intensity measurements by solving a stochastic differential equation. Given that physiological changes are slowly varying, we show that iPPG recovery can be improved through regularization that maximizes the correlation between the residual flow vector predictions of two adjacent time windows. Experimental results on three datasets show that RIS-iPPG provides superior reconstruction quality and uncertainty estimates of the reconstruction, a critical tool for the widespread adoption of iPPG algorithms in clinical and consumer settings.
[163] ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment cs.CVPDF
Mingyu Dong, Chong Xia, Mingyuan Jia, Weichen Lyu, Long Xu
TL;DR: 本文提出了ReplicateAnyScene框架,能够将任意拍摄的视频零样本地转换为组合式3D场景。该框架通过一个五级级联流程,从视觉基础模型中提取文本、视觉和空间维度的通用先验并进行结构化对齐,将其融入结构化3D表示,从而确保构建场景的语义连贯性和物理合理性。
Details
Motivation: 现有方法在组合式3D重建任务中,由于跨模态信息整合不足,依赖人工对象提示、辅助视觉输入或受限于训练偏差导致的过于简单的场景,难以实际部署。本文旨在解决这些限制,实现完全自动化和零样本的视频到3D场景转换。
Result: 大量实验表明,该方法在生成高质量组合式3D场景方面优于现有基线。作者进一步引入了C3DR基准,从多个方面评估重建质量,以促进对该任务的更全面评估。
Insight: 创新点在于提出了一个五阶段级联框架,实现了文本、视觉和空间维度通用先验的结构化对齐与融合,从而在无需手动提示或额外视觉输入的情况下,实现零样本、自动化的视频到组合3D场景重建。该方法强调了跨模态信息的深度整合对于提升场景语义连贯性和物理合理性的重要性。
Abstract: Humans exhibit an innate capacity to rapidly perceive and segment objects from video observations, and even mentally assemble them into structured 3D scenes. Replicating such capability, termed compositional 3D reconstruction, is pivotal for the advancement of Spatial Intelligence and Embodied AI. However, existing methods struggle to achieve practical deployment due to the insufficient integration of cross-modal information, leaving them dependent on manual object prompting, reliant on auxiliary visual inputs, and restricted to overly simplistic scenes by training biases. To address these limitations, we propose ReplicateAnyScene, a framework capable of fully automated and zero-shot transformation of casually captured videos into compositional 3D scenes. Specifically, our pipeline incorporates a five-stage cascade to extract and structurally align generic priors from vision foundation models across textual, visual, and spatial dimensions, grounding them into structured 3D representations and ensuring semantic coherence and physical plausibility of the constructed scenes. To facilitate a more comprehensive evaluation of this task, we further introduce the C3DR benchmark to assess reconstruction quality from diverse aspects. Extensive experiments demonstrate the superiority of our method over existing baselines in generating high-quality compositional 3D scenes.
[164] Uncertainty-Guided Attention and Entropy-Weighted Loss for Precise Plant Seedling Segmentation cs.CV | cs.LGPDF
Mohamed Ehab, Ali Hamdi
TL;DR: 本文提出UGDA-Net,一种用于植物幼苗精确分割的模型,通过不确定性引导的双注意力机制、熵加权混合损失函数和深度监督三个新组件,有效应对复杂背景和叶片精细结构的挑战。
Details
Motivation: 解决标准分割模型在复杂背景和植物叶片精细结构下分割精度不足的问题,以支持精准农业中的自动化表型分析。
Result: 在包含432张高分辨率植物幼苗图像的数据集上,UGDA-Net相比基线模型显著提升了分割性能,Dice系数提高了9.3%(U-Net)和13.2%(LinkNet),定性分析显示叶片边界假阳性减少,不确定性热图与复杂形态一致。
Insight: 创新点在于将不确定性估计(通过通道方差实现)与注意力机制和损失函数设计相结合,证明了不确定性引导的注意力与不确定性加权的损失是两个互补的系统,为精细结构分割提供了高清解决方案。
Abstract: Plant seedling segmentation supports automated phenotyping in precision agriculture. Standard segmentation models face difficulties due to intricate background images and fine structures in leaves. We introduce UGDA-Net (Uncertainty-Guided Dual Attention Network with Entropy-Weighted Loss and Deep Supervision). Three novel components make up UGDA-Net. The first component is Uncertainty-Guided Dual Attention (UGDA). UGDA uses channel variance to modulate feature maps. The second component is an entropy-weighted hybrid loss function. This loss function focuses on high-uncertainty boundary pixels. The third component employs deep supervision for intermediate encoder layers. We performed a comprehensive systematic ablation study. This study focuses on two widely-used architectures, U-Net and LinkNet. It analyzes five incremental configurations: Baseline, Loss-only, Attention-only, Deep Supervision, and UGDA-Net. We trained UGDA-net using a high-resolution plant seedling image dataset containing 432 images. We demonstrate improved segmentation performance and accuracy. With an increase in Dice coefficient of 9.3% above baseline. LinkNet’s variance is 13.2% above baseline. Overlays that are qualitative in nature show the reduced false positives at the leaf boundary. Uncertainty heatmaps are consistent with the complex morphology. UGDA-Net aids in the segmentation of delicate structures in plants and provides a high-def solution. The results showed that uncertainty-guided attention and uncertainty-weighted loss are two complementing systems.
[165] HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching cs.CV | cs.ROPDF
Zerui Chen, Rolandos Alexandros Potamias, Shizhe Chen, Jiankang Deng, Cordelia Schmid
TL;DR: 本文提出HO-Flow框架,用于从文本和规范3D物体生成逼真的手-物体交互运动序列。该框架首先通过交互感知的变分自编码器将手和物体的运动序列编码到统一的潜在流形中,然后利用掩码流匹配模型结合自回归时间推理与连续潜在生成,以提升时间一致性。实验表明,该方法在多个基准测试中实现了最先进的性能。
Details
Motivation: 现有方法在学习表达性运动表示以生成具有时间一致性和高保真物理合理性的3D手-物体交互方面存在局限,HO-Flow旨在解决这一问题。
Result: 在GRAB、OakInk和DexYCB基准测试中,HO-Flow在交互运动合成的物理合理性和运动多样性方面均达到了最先进的性能水平。
Insight: 创新点包括:使用交互感知变分自编码器统一编码手与物体运动以捕获交互动态;结合掩码流匹配与自回归推理以增强时间一致性;通过预测相对于初始帧的物体运动来提升在大规模合成数据上的预训练泛化能力。
Abstract: Generating realistic 3D hand-object interactions (HOI) is a fundamental challenge in computer vision and robotics, requiring both temporal coherence and high-fidelity physical plausibility. Existing methods remain limited in their ability to learn expressive motion representations for generation and perform temporal reasoning. In this paper, we present HO-Flow, a framework for synthesizing realistic hand-object motion sequences from texts and canoncial 3D objects. HO-Flow first employs an interaction-aware variational autoencoder to encode sequences of hand and object motions into a unified latent manifold by incorporating hand and object kinematics, enabling the representation to capture rich interaction dynamics. It then leverages a masked flow matching model that combines auto-regressive temporal reasoning with continuous latent generation, improving temporal coherence. To further enhance generalization, HO-Flow predicts object motions relative to the initial frame, enabling effective pre-training on large-scale synthetic data. Experiments on the GRAB, OakInk, and DexYCB benchmarks demonstrate that HO-Flow achieves state-of-the-art performance in both physical plausibility and motion diversity for interaction motion synthesis.
[166] Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation cs.CVPDF
Zeqian Long, Ozgur Kara, Haotian Xue, Yongxin Chen, James M. Rehg
TL;DR: 本文提出Immune2V框架,旨在防御基于双流图像到视频生成模型的深度伪造攻击。通过分析现有图像级对抗攻击在I2V模型上失效的原因,设计了防止噪声在时间维度上稀释和对抗文本引导覆盖的方法,实现了对生成视频质量的持续破坏。
Details
Motivation: 针对图像到视频生成技术可能被滥用于制作未经授权的深度伪造视频的问题,现有防御方法主要针对静态图像篡改,缺乏对I2V生成的有效保护,因此需要开发专门的免疫防御机制。
Result: 在相同不可感知性约束下,Immune2V相比适应的图像级基线方法能产生更强烈且更持久的视频质量退化,实验验证了其有效性。
Insight: 创新点在于系统分析了I2V模型对图像级对抗噪声的鲁棒性机制,并针对性提出了时间平衡潜在散度和预计算崩溃诱导轨迹对齐的策略,为动态生成模型的防御提供了新思路。
Abstract: Image-to-video (I2V) generation has the potential for societal harm because it enables the unauthorized animation of static images to create realistic deepfakes. While existing defenses effectively protect against static image manipulation, extending these to I2V generation remains underexplored and non-trivial. In this paper, we systematically analyze why modern I2V models are highly robust against naive image-level adversarial attacks (i.e., immunization). We observe that the video encoding process rapidly dilutes the adversarial noise across future frames, and the continuous text-conditioned guidance actively overrides the intended disruptive effect of the immunization. Building on these findings, we propose the Immune2V framework which enforces temporally balanced latent divergence at the encoder level to prevent signal dilution, and aligns intermediate generative representations with a precomputed collapse-inducing trajectory to counteract the text-guidance override. Extensive experiments demonstrate that Immune2V produces substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.
[167] STGV: Spatio-Temporal Hash Encoding for Gaussian-based Video Representation cs.CVPDF
Jierun Lin, Jiacong Chen, Qingyu Mao, Shuai Liu, Xiandong Meng
TL;DR: 本文提出了一种用于基于高斯模型的视频表示的新框架STGV,通过将视频特征分解为可学习的2D空间和3D时空哈希编码,有效分离静态和动态分量,并结合关键帧初始化策略,提升了视频表示的质量和下游任务性能。
Details
Motivation: 现有基于2D高斯泼溅的视频表示方法使用内容无关或时空特征重叠的嵌入来预测规范高斯基元的变形,这导致静态和动态分量纠缠,无法有效建模其不同特性,从而造成时空变形预测不准确和表示质量不佳。
Result: 实验结果表明,该方法在视频表示质量上优于其他基于高斯的方法(PSNR提升0.98),并在下游视频任务中取得了有竞争力的性能。
Insight: 主要创新点在于将视频特征分解为独立的可学习空间和时空哈希编码,以分离静态和动态分量,并结合关键帧规范初始化策略来构建更稳定一致的初始高斯表示,避免了特征重叠和几何结构不一致的问题。
Abstract: 2D Gaussian Splatting (2DGS) has recently become a promising paradigm for high-quality video representation. However, existing methods employ content-agnostic or spatio-temporal feature overlapping embeddings to predict canonical Gaussian primitive deformations, which entangles static and dynamic components in videos and prevents modeling their distinct properties effectively. These result in inaccurate predictions for spatio-temporal deformations and unsatisfactory representation quality. To address these problems, this paper proposes a Spatio-Temporal hash encoding framework for Gaussian-based Video representation (STGV). By decomposing video features into learnable 2D spatial and 3D temporal hash encodings, STGV effectively facilitates the learning of motion patterns for dynamic components while maintaining background details for static elements.In addition, we construct a more stable and consistent initial canonical Gaussian representation through a key frame canonical initialization strategy, preventing from feature overlapping and a structurally incoherent geometry representation. Experimental results demonstrate that our method attains better video representation quality (+0.98 PSNR) against other Gaussian-based methods and achieves competitive performance in downstream video tasks.
[168] TAMISeg: Text-Aligned Multi-scale Medical Image Segmentation with Semantic Encoder Distillation cs.CVPDF
Qiang Gao, Yi Wang, Yong Zhang, Yong Li, Yongbing Deng
TL;DR: TAMISeg是一种文本引导的医学图像分割框架,通过整合临床语言提示和语义蒸馏作为辅助语义线索,以增强视觉理解并减少对像素级细粒度标注的依赖。它包含三个核心组件:一致性感知编码器、语义编码器蒸馏模块和尺度自适应解码器。在多个数据集上的实验表明,TAMISeg在定性和定量评估中均优于现有方法。
Details
Motivation: 解决医学图像分割中因细粒度标注有限、解剖结构复杂以及图像噪声、低对比度或光照变化导致的退化问题,旨在利用文本引导和语义蒸馏来提升分割性能。
Result: 在Kvasir-SEG、MosMedData+和QaTa-COV19数据集上的实验显示,TAMISeg在定性和定量评估中一致优于现有的单模态和多模态方法,达到了SOTA水平。
Insight: 创新点包括:引入文本对齐的多尺度分割框架,结合临床语言提示;通过语义编码器蒸馏利用冻结的DINOv3教师模型增强语义判别能力;以及一致性感知编码器和尺度自适应解码器设计,以提升鲁棒性和跨尺度分割能力。从客观角度看,该方法有效整合了多模态信息,减少了标注依赖,具有实际应用潜力。
Abstract: Medical image segmentation remains challenging due to limited fine-grained annotations, complex anatomical structures, and image degradation from noise, low contrast, or illumination variation. We propose TAMISeg, a text-guided segmentation framework that incorporates clinical language prompts and semantic distillation as auxiliary semantic cues to enhance visual understanding and reduce reliance on pixel-level fine-grained annotations. TAMISeg integrates three core components: a consistency-aware encoder pretrained with strong perturbations for robust feature extraction, a semantic encoder distillation module with supervision from a frozen DINOv3 teacher to enhance semantic discriminability, and a scale-adaptive decoder that segments anatomical structures across different spatial scales. Experiments on the Kvasir-SEG, MosMedData+, and QaTa-COV19 datasets demonstrate that TAMISeg consistently outperforms existing uni-modal and multi-modal methods in both qualitative and quantitative evaluations. Code will be made publicly available at https://github.com/qczggaoqiang/TAMISeg.
[169] ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding cs.CV | cs.AIPDF
Xucheng Wang, Xiaoman Zhang, Sung Eun Kim, Ankit Pal, Pranav Rajpurkar
TL;DR: 本文提出了ReXSonoVQA,一个面向超声检查过程的视频问答基准,包含514个视频片段和514个问题,旨在评估模型对动态操作过程的理解能力。对多个先进视觉语言模型(VLMs)的零样本评估表明,它们能提取部分过程信息,但在故障排除类问题上表现不佳,揭示了其在因果推理方面的局限性。
Details
Motivation: 现有基准主要评估静态图像理解,而超声检查需要动态的探头操作和实时调整,因此缺乏评估模型对动态操作过程理解能力的基准。本文旨在填补这一空白,推动面向超声培训、引导和机器人自动化的感知系统发展。
Result: 在ReXSonoVQA基准上对Gemini 3 Pro、Qwen3.5-397B、LLaVA-Video-72B和Seed 2.0 Pro等模型进行了零样本评估。结果显示,VLMs能够提取一些过程信息,但在故障排除类问题上挑战性大,相比纯文本基线提升有限。
Insight: 论文的创新点在于创建了一个专注于医疗操作过程(超声)的动态视频问答基准,定义了三个核心评估能力(动作-目标推理、伪影分辨与优化、过程上下文与规划),并揭示了当前VLMs在复杂因果推理任务上的不足,为开发面向实际应用的自主超声系统指明了方向。
Abstract: Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.
[170] AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling cs.CVPDF
Juncheng Hu, Ziteng Xue, Guotao Liang, Anran Qi, Buyu Li
TL;DR: AmodalSVG是一个新的模态图像矢量化框架,它从自然图像中生成语义组织且几何完整的SVG表示。该框架通过两阶段方法实现:首先在栅格域进行语义解耦和补全,生成模态完整的语义层,然后独立矢量化这些层。
Details
Motivation: 现有矢量化方法仅追踪可见像素,忽略遮挡,导致生成的SVG语义纠缠且几何不完整,限制了SVG的结构可编辑性。AmodalSVG旨在重建完整对象几何(包括遮挡区域)为独立可编辑的矢量层,以提升编辑能力。
Result: 大量实验表明,AmodalSVG在视觉保真度上显著优于先前方法,且生成的模态层支持直接在矢量域进行对象级编辑,这是现有矢量化方法无法实现的。
Insight: 创新点包括引入语义层剥离(SLP)策略,利用VLM指导逐步分解图像为语义连贯的层,并通过混合修复恢复遮挡下的完整对象外观;以及提出自适应分层矢量化(ALV),通过误差预算驱动机制动态调整图元预算,实现高效矢量化。
Abstract: We introduce AmodalSVG, a new framework for amodal image vectorization that produces semantically organized and geometrically complete SVG representations from natural images. Existing vectorization methods operate under a modal paradigm: tracing only visible pixels and disregarding occlusion. Consequently, the resulting SVGs are semantically entangled and geometrically incomplete, limiting SVG’s structural editability. In contrast, AmodalSVG reconstructs full object geometries, including occluded regions, into independent, editable vector layers. To achieve this, AmodalSVG reformulates image vectorization as a two-stage framework, performing semantic decoupling and completion in the raster domain to produce amodally complete semantic layers, which are then independently vectorized. In the first stage, we introduce Semantic Layer Peeling (SLP), a VLM-guided strategy that progressively decomposes an image into semantically coherent layers. By hybrid inpainting, SLP recovers complete object appearances under occlusions, enabling explicit semantic decoupling. To vectorize these layers efficiently, we propose Adaptive Layered Vectorization (ALV), which dynamically modulates the primitive budget via an error-budget-driven adjustment mechanism. Extensive experiments demonstrate that AmodalSVG significantly outperforms prior methods in visual fidelity. Moreover, the resulting amodal layers enable object-level editing directly in the vector domain, capabilities not supported by existing vectorization approaches. Code will be released upon acceptance.
[171] Hierarchical Textual Knowledge for Enhanced Image Clustering cs.CV | cs.CL | cs.MMPDF
Yijie Zhong, Yunfan Gao, Weipeng Jiang, Haofen Wang
TL;DR: 本文提出了一种知识增强聚类(KEC)方法,利用大语言模型构建层次化的概念-属性结构化知识来指导图像聚类,以解决传统方法依赖视觉空间知识难以区分视觉相似但语义不同类别的问题。该方法将冗余文本标签提炼为抽象概念,并自动提取每个概念及相似概念对之间的判别性属性,生成知识增强特征,结合原始视觉特征适配于多种下游聚类算法。
Details
Motivation: 传统图像聚类方法主要依赖视觉空间知识,难以区分视觉相似但语义不同的类别;现有利用文本知识的方法多依赖粗粒度类别标签或简单名词,忽略了文本空间中丰富的概念和属性级语义。
Result: 在20个多样化数据集上的评估表明,KEC方法在使用额外文本知识的现有方法上取得了持续改进;无需训练的KEC在20个数据集中的14个上超越了零样本CLIP;同时,KEC在提供准确性的同时保证了鲁棒性,而简单的文本知识使用可能损害聚类性能。
Insight: 创新点在于利用大语言模型构建层次化的概念-属性结构化知识来增强图像聚类,将文本知识从粗粒度标签细化到概念和属性层面,并通过结构化提示自动提取判别性属性,从而更有效地利用文本语义信息提升聚类性能与鲁棒性。
Abstract: Image clustering aims to group images in an unsupervised fashion. Traditional methods focus on knowledge from visual space, making it difficult to distinguish between visually similar but semantically different classes. Recent advances in vision-language models enable the use of textual knowledge to enhance image clustering. However, most existing methods rely on coarse class labels or simple nouns, overlooking the rich conceptual and attribute-level semantics embedded in textual space. In this paper, we propose a knowledge-enhanced clustering (KEC) method that constructs a hierarchical concept-attribute structured knowledge with the help of large language models (LLMs) to guide clustering. Specifically, we first condense redundant textual labels into abstract concepts and then automatically extract discriminative attributes for each single concept and similar concept pairs, via structured prompts to LLMs. This knowledge is instantiated for each input image to achieve the knowledge-enhanced features. The knowledge-enhanced features with original visual features are adapted to various downstream clustering algorithms. We evaluate KEC on 20 diverse datasets, showing consistent improvements across existing methods using additional textual knowledge. KEC without training outperforms zero-shot CLIP on 14 out of 20 datasets. Furthermore, the naive use of textual knowledge may harm clustering performance, while KEC provides both accuracy and robustness.
[172] Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models cs.CV | cs.AIPDF
Songlin Yang, Xianghao Kong, Anyi Rao
TL;DR: 本文提出‘伪统一’概念,指出统一多模态模型(UMMs)未能有效结合大语言模型的推理能力与视觉模型的生成能力,并设计了一个信息论探测框架来揭示其内部原因。该框架应用于十个代表性UMMs,发现伪统一源于模态不对称编码和模式分裂响应的双重分歧,只有实现信息流一致性的模型才能获得真正的多模态协同。
Details
Motivation: 现有统一多模态模型在实践中未能实现大语言模型推理能力与视觉模型生成能力的有效协同,表现出‘伪统一’现象,而现有探测方法缺乏模型内部洞察或忽略提示-响应依赖关系,因此需要新的诊断方法。
Result: 在十个代表性UMMs上的应用表明,伪统一源于模态不对称编码(视觉和语言遵循不同的熵轨迹)和模式分裂响应(文本生成呈现高熵创造性而图像合成强制低熵保真度)的双重分歧;只有通过上下文预测等方式统一两端的模型才能实现更真实的统一,即使参数更少也能实现更强的基于推理的文生图。
Insight: 创新点在于提出了信息论探测框架来联合分析UMMs如何编码输入和生成输出,首次从模型内部探测统一性;核心见解是真正的多模态协同需要信息流的一致性,而不仅仅是共享参数,这为模型设计提供了新方向。
Abstract: Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision models. In practice, however, this synergy remains elusive: UMMs fail to transfer LLM-like reasoning to image synthesis and exhibit divergent response behaviors. We term this phenomenon pseudo-unification. Diagnosing its internal causes is important, but existing probing methods either lack model-internal insight or ignore prompt-response dependencies. To address these limitations, we propose an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, our framework reveals that pseudo-unification stems from a dual divergence: (i) Modality-Asymmetric Encoding, where vision and language follow different entropy trajectories, and (ii) Pattern-Split Response, where text generation exhibits high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models that unify both sides (e.g., via contextual prediction) achieve more genuine unification, enabling stronger reasoning-based text-to-image generation even with fewer parameters. Our work provides the first model-internal probing of unification, demonstrating that real multimodal synergy requires consistency in information flow, not just shared parameters.
[173] Sign Language Recognition in the Age of LLMs cs.CV | cs.CLPDF
Vaclav Javorek, Jakub Honzik, Ivan Gruber, Tomas Zelezny, Marek Hruz
TL;DR: 本文研究了现代视觉语言模型(VLMs)在零样本设置下处理孤立手语识别(ISLR)任务的能力,通过在WLASL300基准上评估多个开源和专有模型,发现开源VLMs在仅提示的零样本推理中表现远逊于传统监督分类器,但揭示了模型在视觉-语义对齐方面的潜力,且更大规模的专有模型准确率显著更高。
Details
Motivation: 探究通用视觉语言模型是否能在无需任务特定训练的情况下,解决如孤立手语识别这类专业视觉识别问题。
Result: 在WLASL300基准上,开源VLMs的零样本性能远低于经典监督ISLR分类器,而更大规模的专有模型(如GPT-4V)则达到更高准确率,凸显了模型规模和训练数据多样性的重要性。
Insight: 论文创新点在于系统评估VLMs在手语识别中的零样本能力,并揭示视觉-语义对齐的潜力;客观分析表明,模型规模和数据多样性是提升专业任务性能的关键因素,为未来通用模型在细分领域的应用提供了参考。
Abstract: Recent Vision Language Models (VLMs) have demonstrated strong performance across a wide range of multimodal reasoning tasks. This raises the question of whether such general-purpose models can also address specialized visual recognition problems such as isolated sign language recognition (ISLR) without task-specific training. In this work, we investigate the capability of modern VLMs to perform ISLR in a zero-shot setting. We evaluate several open-source and proprietary VLMs on the WLASL300 benchmark. Our experiments show that, under prompt-only zero-shot inference, current open-source VLMs remain far behind classic supervised ISLR classifiers by a wide margin. However, follow-up experiments reveal that these models capture partial visual-semantic alignment between signs and text descriptions. Larger proprietary models achieve substantially higher accuracy, highlighting the importance of model scale and training data diversity. All our code is publicly available on GitHub.
[174] Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation cs.CVPDF
Jihun Kim, Hoyong Kwon, Hyeokjun Kweon, Kuk-Jin Yoon
TL;DR: 本文提出了一种名为DiTTA(蒸馏辅助测试时适应)的新框架,旨在将图像语义分割(ISS)模型高效地转换为具有时间感知能力的视频语义分割(VSS)模型,而无需标注视频数据。该方法通过一个简短的初始化阶段,将SAM2等基础模型的时间分割知识蒸馏到ISS模型中,并结合轻量级时间融合模块聚合跨帧上下文,从而在仅使用少量视频片段(如初始10%)进行适应的情况下实现鲁棒泛化。
Details
Motivation: 解决完全监督VSS对密集标注视频数据的依赖以及逐帧应用预训练ISS模型忽略时间一致性的问题,同时避免SAM2等基础模型因语义理解有限和计算开销大而无法直接用于VSS的局限性。
Result: 在VSPW和Cityscapes数据集上的大量实验表明,DiTTA取得了有竞争力或优于完全监督VSS方法的性能,显著超越了在推理时重复调用SAM2的零样本优化方法。
Insight: 创新点在于通过蒸馏辅助的测试时适应(TTA)框架,将基础模型的时间知识高效迁移到ISS模型,并结合轻量级时间融合,实现了无需标注、计算高效且性能强大的视频语义分割解决方案。
Abstract: Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pre-trained Image Semantic Segmentation (ISS) models frame-by-frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high-quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation-assisted Test-Time Adaptation), a novel framework that converts an ISS model into a temporally-aware VSS model through efficient test-time adaptation (TTA), without annotated videos. DiTTA distills SAM2’s temporal segmentation knowledge into the ISS model during a brief, single-pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame context. Crucially, DiTTA achieves robust generalization even when adapting with highly limited partial video snippets (e.g., initial 10%), significantly outperforming zero-shot refinement approaches that repeatedly invoke SAM2 during inference. Extensive experiments on VSPW and Cityscapes demonstrate DiTTA’s effectiveness, achieving competitive or superior performance relative to fully-supervised VSS methods, thus providing a practical and annotation-free solution for real-world VSS tasks.
[175] You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass cs.CV | cs.AIPDF
Yinuo Yang, Zixian Ma, Manasi Ganti, Jieyu Zhang, Ranjay Krishna
TL;DR: 本文提出了一种名为‘You Only Judge Once’的多模态判别式奖励模型,能够在单次前向传播中同时评估所有候选响应,从而显著提升效率。该方法通过拼接多个响应并使用分隔符,应用交叉熵损失进行直接比较学习,实现了高达N倍的加速和计算量减少。为支持N路评估,作者构建了MR²Bench-Image和MR²Bench-Video两个新基准。基于4B视觉语言主干网络,结合LoRA微调和轻量级MLP价值头,该模型在六个多模态奖励基准上达到了最先进的性能,并在强化学习应用中显著提升了策略模型的训练稳定性和开放式生成质量。
Details
Motivation: 解决传统判别式奖励模型需对每个候选响应进行独立前向传播导致的效率低下问题,旨在实现高效的多响应比较评估。
Result: 在六个多模态奖励基准(包括新构建的MR²Bench-Image和MR²Bench-Video)上达到SOTA,超越了现有更大的生成式和判别式奖励模型;在强化学习应用中,使用GRPO训练的策略模型在标准基准上保持性能,同时在开放式生成质量和训练稳定性上大幅优于单响应奖励模型基线。
Insight: 创新点在于多响应单次前向评估架构,通过响应拼接和交叉熵损失实现直接比较学习,显著提升效率;同时构建了大规模N路评估基准以支持更全面的模型评估;方法结合了LoRA微调和轻量级设计,在保持性能的同时实现高效训练和应用。
Abstract: We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient $N$-way preference learning. The multi-response design also yields up to $N\times$ wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable $N$-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR$^2$Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR$^2$Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR$^2$Bench-Image, MR$^2$Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.
[176] MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models cs.CV | cs.AIPDF
Xincheng Yao, Zefeng Qian, Chao Shi, Jiayang Song, Chongyang Zhang
TL;DR: 本文提出了一个名为MMR-AD的大规模多模态数据集,用于在通用异常检测任务上对多模态大语言模型进行基准测试。同时,论文还提出了一个基于推理的基线模型Anomaly-R1,该模型利用MMR-AD中的思维链数据进行训练,并通过强化学习增强,在异常检测和定位方面相比通用MLLMs取得了显著提升。
Details
Motivation: 通用异常检测是工业异常检测的终极目标,而多模态大语言模型因其强大的视觉理解和语言推理能力,在该领域展现出巨大潜力。然而,当前MLLMs的通用异常检测能力尚未被充分探索,主要因为其预训练数据与异常检测场景存在差距,且主流异常检测数据集不适合用于MLLMs的后训练。
Result: 实验表明,当前最先进的通用MLLMs在异常检测性能上仍远未达到工业要求。而基于MMR-AD数据集提出的基线模型Anomaly-R1,在异常检测和定位任务上相比通用MLLMs取得了显著改进。
Insight: 论文的核心创新点在于构建了一个专门用于训练和评估基于MLLM的异常检测模型的综合基准数据集MMR-AD,填补了该领域的空白。此外,提出的Anomaly-R1模型通过结合思维链学习和强化学习,为如何有效利用MLLM进行异常检测提供了一个新颖的、可借鉴的推理增强框架。
Abstract: In the progress of industrial anomaly detection, general anomaly detection (GAD) is an emerging trend and also the ultimate goal. Unlike the conventional single- and multi-class AD, general AD aims to train a general AD model that can directly detect anomalies in diverse novel classes without any retraining or fine-tuning on the target data. Recently, Multimodal Large Language Models (MLLMs) have shown great promise in achieving general anomaly detection due to their revolutionary visual understanding and language reasoning capabilities. However, MLLM’s general AD ability remains underexplored due to: (1) MLLMs are pretrained on amounts of data sourced from the Web, these data still have significant gaps with the data in AD scenarios. Moreover, the image-text pairs during pretraining are also not specifically for AD tasks. (2) The current mainstream AD datasets are image-based and not yet suitable for post-training MLLMs. To facilitate MLLM-based general AD research, we present MMR-AD, which is a comprehensive benchmark for both training and evaluating MLLM-based AD models. With MMR-AD, we reveal that the AD performance of current SOTA generalist MLLMs still falls far behind the industrial requirements. Based on MMR-AD, we also propose a baseline model, Anomaly-R1, which is a reasoning-based AD model that learns from the CoT data in MMR-AD and is further enhanced by reinforcement learning. Extensive experiments show that our Anomaly-R1 achieves remarkable improvements over generalist MLLMs in both anomaly detection and localization.
[177] What Do Vision-Language Models Encode for Personalized Image Aesthetics Assessment? cs.CV | cs.CLPDF
Koki Ryu, Hitomi Yanaka
TL;DR: 本文研究了视觉语言模型(VLMs)在个性化图像美学评估(PIAA)中的应用潜力,通过分析VLMs的内部表示,发现其编码了丰富的多层级美学属性,并利用这些表示实现了无需微调的轻量级个性化评估。
Details
Motivation: 解决现有基于VLMs的PIAA方法是否内部编码了足够丰富的美学属性以支持有效个性化的问题,并探索如何利用这些表示进行轻量级个性化。
Result: 分析表明VLMs编码了多样的美学属性,且简单线性模型即可有效执行PIAA;在不同VLM架构和图像域中分析了美学信息的跨层传递。
Insight: 创新点在于揭示了VLMs内部美学属性的存在与分布,并利用其实现无需微调的个性化评估;客观分析认为该方法为利用VLMs建模主观美学偏好提供了新视角,具有轻量化和可解释性优势。
Abstract: Personalized image aesthetics assessment (PIAA) is an important research problem with practical real-world applications. While methods based on vision-language models (VLMs) are promising candidates for PIAA, it remains unclear whether they internally encode rich, multi-level aesthetic attributes required for effective personalization. In this paper, we first analyze the internal representations of VLMs to examine the presence and distribution of such aesthetic attributes, and then leverage them for lightweight, individual-level personalization without model fine-tuning. Our analysis reveals that VLMs encode diverse aesthetic attributes that propagate into the language decoder layers. Building on these representations, we demonstrate that simple linear models can perform PIAA effectively. We further analyze how aesthetic information is transferred across layers in different VLM architectures and across image domains. Our findings provide insights into how VLMs can be utilized for modeling subjective, individual aesthetic preferences. Our code is available at https://github.com/ynklab/vlm-latent-piaa.
[178] Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging cs.CV | cs.CLPDF
Zihang Fu, Haonan Wang, Jian Kang, Kenji Kawaguchi, Jiaying Wu
TL;DR: 该论文提出了一种名为MERIT的无训练、任务驱动的模型融合框架,旨在恢复视频语言模型(VLM)中的时序推理能力。该方法通过有选择性地融合VLM与其纯文本骨干模型的自注意力层,在提升时序推理的同时,避免损害模型的时序感知能力。
Details
Motivation: 动机在于解决多模态适应(特别是视频语言模型)中常见的权衡问题:增强感知能力的同时,往往会削弱模型从纯语言预训练中继承的推理能力,尤其是对序列事件进行时序推理的能力。
Result: 在三个代表性VLM和多个具有挑战性的视频基准测试上,MERIT一致地提升了时序推理性能,保持或改进了时序感知能力,并且泛化到搜索集之外的四个不同基准测试。其表现优于均匀的完整模型融合和随机层选择方法。
Insight: 论文宣称的创新点在于提出了一个无需重新训练、通过层选择性融合来针对性恢复时序推理的框架。客观分析其核心洞察是:推理能力主要驻留在模型的特定层中,通过有选择地融合这些层,可以在不损害感知能力的前提下,有效恢复模型的时序推理性能,这为模型编辑和优化提供了新的方向。
Abstract: Multimodal adaptation equips large language models (LLMs) with perceptual capabilities, but often weakens the reasoning ability inherited from language-only pretraining. This trade-off is especially pronounced in video-language models (VLMs), where visual alignment can impair temporal reasoning (TR) over sequential events. We propose MERIT, a training-free, task-driven model merging framework for restoring TR in VLMs. MERIT searches over layer-wise self-attention merging recipes between a VLM and its paired text-only backbone using an objective that improves TR while penalizing degradation in temporal perception (TP). Across three representative VLMs and multiple challenging video benchmarks, MERIT consistently improves TR, preserves or improves TP, and generalizes beyond the search set to four distinct benchmarks. It also outperforms uniform full-model merging and random layer selection, showing that effective recovery depends on selecting the right layers. Interventional masking and frame-level attribution further show that the selected layers are disproportionately important for reasoning and shift model decisions toward temporally and causally relevant evidence. These results show that targeted, perception-aware model merging can effectively restore TR in VLMs without retraining.
[179] ArtiCAD: Articulated CAD Assembly Design via Multi-Agent Code Generation cs.CVPDF
Yuan Shui, Yandong Guan, Zhanwei Zhang, Juncheng Hu, Jing Zhang
TL;DR: 本文提出了ArtiCAD,首个无需训练的多智能体系统,能够直接从文本或图像生成可编辑、可活动的CAD装配体。该系统将复杂任务分解给四个专业智能体(设计、生成、装配、审查),其关键洞见是在初始设计阶段而非装配阶段预测装配关系,通过一个明确连接器定义附着点和关节参数来绕过当前LLM/VLM空间推理的局限。系统还引入了验证步骤、跨阶段回滚机制和自我演进的经验库以确保质量。
Details
Motivation: 解决从高级描述(文本或图像)生成可活动的多部件CAD装配体这一尚未探索的难题,这对于产品开发至关重要。
Result: 在三个数据集(ArtiCAD-Bench、CADPrompt和ACD)上的广泛评估验证了方法的有效性,并展示了其在需求驱动的概念设计、物理原型制作以及通过URDF导出生成具身AI训练资产方面的适用性。
Insight: 创新点在于:1) 首个训练免费的多智能体CAD装配生成框架;2) 在几何生成前(设计阶段)预测装配关系,使用明确连接器来规避LLM/VLM的空间推理限制;3) 引入验证、跨阶段错误回滚和自我演进经验库以确保质量和持续改进。
Abstract: Parametric Computer-Aided Design (CAD) of articulated assemblies is essential for product development, yet generating these multi-part, movable models from high-level descriptions remains unexplored. To address this, we propose ArtiCAD, the first training-free multi-agent system capable of generating editable, articulated CAD assemblies directly from text or images. Our system divides this complex task among four specialized agents: Design, Generation, Assembly, and Review. One of our key insights is to predict assembly relationships during the initial design stage rather than the assembly stage. By utilizing a Connector that explicitly defines attachment points and joint parameters, ArtiCAD determines these relationships before geometry generation, effectively bypassing the limited spatial reasoning capabilities of current LLMs and VLMs. To further ensure high-quality outputs, we introduce validation steps in the generation and assembly stages, accompanied by a cross-stage rollback mechanism that accurately isolates and corrects design- and code-level errors. Additionally, a self-evolving experience store accumulates design knowledge to continuously improve performance on future tasks. Extensive evaluations on three datasets (ArtiCAD-Bench, CADPrompt, and ACD) validate the effectiveness of our approach. We further demonstrate the applicability of ArtiCAD in requirement-driven conceptual design, physical prototyping, and the generation of embodied AI training assets through URDF export.
[180] TraversalBench: Challenging Paths to Follow for Vision Language Models cs.CVPDF
Clara Petrova, Zhuo Chen, Marin Soljačić
TL;DR: 本文提出了TraversalBench,一个用于评估视觉语言模型在精确视觉路径遍历任务上能力的基准测试。该基准通过控制路径自相交数、曲折度、顶点数和附近干扰线等关键结构因素,测试模型从起点到终点按顺序恢复路径顶点序列的能力。研究发现自相交是主要困难来源,错误高度局部化,而干扰线则导致持续的性能下降。
Details
Motivation: 现有视觉语言模型在多模态基准上表现良好,但其遵循复杂视觉路径的能力(人类通常能轻松完成)尚未得到充分测试。本文旨在填补这一空白,提供一个受控的、诊断性的基准来评估模型在持续视觉处理中的路径忠实推理能力。
Result: 在TraversalBench基准上的分析表明,自相交是主要困难来源,模型在首次交叉点后性能急剧下降;而附近干扰线则导致持续的、累积的性能衰减。辅助的阅读顺序基准进一步揭示了模型对从左到右序列化布局的一致偏好。
Insight: 论文的创新点在于设计了一个专注于持续视觉接地的诊断性基准,通过控制关键路径结构因素来隔离和量化模型在空间推理中的特定失败模式(如处理自相交时的局部崩溃 vs. 处理干扰时的持续衰减),为研究多模态模型在模糊、杂乱和干扰结构下的空间推理提供了有用的测试平台。
Abstract: Vision-language models (VLMs) perform strongly on many multimodal benchmarks. However, the ability to follow complex visual paths – a task that human observers typically find straightforward – remains under-tested. We introduce TraversalBench, a controlled benchmark for exact visual path traversal. Each instance contains a single continuous polyline, a unique start marker, and markers placed at path vertices; the task is to recover the exact ordered sequence encountered when traversing the path from start to finish. The benchmark explicitly balances key path-structural factors including self-intersection count, tortuosity, vertex count, and nearby confounding lines, while minimizing reliance on OCR, world knowledge, and open-ended planning. We find that self-intersections are the dominant source of difficulty. A first-crossing analysis shows that errors are sharply localized: performance is relatively stable immediately before the first crossing, then drops steeply when the model must resolve the correct continuation. By contrast, nearby confounding lines produce a weaker persistent degradation that compounds with repeated exposure. These analyses make TraversalBench a useful diagnostic for identifying whether models suffer from human-like failures or other breakdowns in sustained visual processing. An auxiliary reading-order benchmark further reveals a consistent preference for layouts compatible with left-to-right serialization, while not explaining away the main effects of path complexity. Together, these results position TraversalBench as a controlled diagnostic of path-faithful visual reasoning and as a useful testbed for studying multimodal spatial reasoning under ambiguity, clutter, and distractor structure. More broadly, we position TraversalBench as a contribution to the still-limited area of sustained visual grounding benchmarks for VLMs.
[181] Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference cs.CV | cs.CL | cs.LGPDF
Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune
TL;DR: 本文重新审视了双编码器视觉语言模型(如CLIP)的组合性能力,指出其组合性表现不佳可能源于基于全局余弦相似度的标准推理协议,而非表征缺陷。通过诊断实验,作者发现显式执行细粒度区域-片段对齐能显著提升组合性性能。进而,作者引入了一个轻量级Transformer,直接从冻结的补丁和词元嵌入中学习这种对齐。与全微调和先前端到端组合性训练方法相比,该方法在域内检索上匹配全微调性能,并在受控的域外组合性基准上取得显著提升。
Details
Motivation: 解决双编码器视觉语言模型(如CLIP)在组合性基准上表现不佳的问题,并探究其根本原因是否在于推理协议(全局余弦相似度匹配)而非表征本身。
Result: 在受控的域外组合性基准上取得显著改进,其提出的基于冻结表征学习局部对齐的方法在域内检索上匹配全微调性能,并在组合性泛化上表现更鲁棒。
Insight: 创新点在于将双编码器VLM组合性瓶颈定位为推理时的全局嵌入匹配,并提出通过轻量级模块学习局部对齐来提升组合性泛化能力,而无需更新预训练编码器。这为改进冻结模型的能力提供了一种高效途径。
Abstract: Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations than from the standard inference protocol based on global cosine similarity. First, through controlled diagnostic experiments, we show that explicitly enforcing fine-grained region-segment alignment at inference dramatically improves compositional performance without updating pretrained encoders. We then introduce a lightweight transformer that learns such alignments directly from frozen patch and token embeddings. Comparing against full fine-tuning and prior end-to-end compositional training methods, we find that although these approaches improve in-domain retrieval, their gains do not consistently transfer under distribution shift. In contrast, learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while yielding substantial improvements on controlled out-of-domain compositional benchmarks. These results identify global embedding matching as a key bottleneck in dual-encoder VLMs and highlight the importance of alignment mechanisms for robust compositional generalization.
[182] Panoptic Pairwise Distortion Graph cs.CV | cs.AI | cs.LGPDF
Muhammad Kamran Janjua, Abdul Wahab, Bahador Rashidi
TL;DR: 本文提出了一种新的图像对比评估视角,将图像对表示为区域的结构化组合,并引入了失真图(Distortion Graph, DG)这一新任务。作者贡献了区域级数据集PandaSet、基准测试套件PandaBench以及高效架构Panda,用于生成失真图。研究表明,现有最先进的多模态大语言模型(MLLMs)难以理解区域级退化,而基于PandaSet的训练或使用DG提示能激发区域级失真理解能力。
Details
Motivation: 现有图像质量评估方法主要关注整图分析,而隐式依赖区域级理解。本文旨在通过将场景图概念从单图像扩展到图像对,提出一种结构化、可解释的表示方法,以更精细地捕捉图像对之间的退化信息(如失真类型、严重程度、比较和质量分数)。
Result: 在提出的基准测试套件PandaBench上,实验表明当前最先进的多模态大语言模型(MLLMs)即使提供显式区域线索,也难以处理区域级退化任务,凸显了该基准的挑战性。使用PandaSet训练或DG提示能有效提升模型对区域级失真的理解能力。
Insight: 创新点在于将场景图思想扩展到图像对,提出了失真图(DG)这一结构化表示任务,将密集的退化信息编码为紧凑、可解释的图结构。这为细粒度、结构化的成对图像评估开辟了新方向,并揭示了当前MLLMs在区域级视觉理解上的局限性。
Abstract: In this work, we introduce a new perspective on comparative image assessment by representing an image pair as a structured composition of its regions. In contrast, existing methods focus on whole image analysis, while implicitly relying on region-level understanding. We extend the intra-image notion of a scene graph to inter-image, and propose a novel task of Distortion Graph (DG). DG treats paired images as a structured topology grounded in regions, and represents dense degradation information such as distortion type, severity, comparison and quality score in a compact interpretable graph structure. To realize the task of learning a distortion graph, we contribute (i) a region-level dataset, PandaSet, (ii) a benchmark suite, PandaBench, with varying region-level difficulty, and (iii) an efficient architecture, Panda, to generate distortion graphs. We demonstrate that PandaBench poses a significant challenge for state-of-the-art multimodal large language models (MLLMs) as they fail to understand region-level degradations even when fed with explicit region cues. We show that training on PandaSet or prompting with DG elicits region-wise distortion understanding, opening a new direction for fine-grained, structured pairwise image assessment.
[183] UHD-GPGNet: UHD Video Denoising via Gaussian-Process-Guided Local Spatio-Temporal Modeling cs.CVPDF
Weiyuan He, Chen Wu, Pengwen Dai, Wei Wang, Dianjie Lu
TL;DR: 本文提出UHD-GPGNet,一种用于超高清视频去噪的高斯过程引导局部时空建模框架。该方法通过估计紧凑时空描述符上的稀疏高斯过程后验统计量,显式地表征局部退化响应和不确定性,从而指导自适应时域细节融合。结合结构-色彩协同重建头、异方差目标函数和重叠分块推理,在保持高效4K部署的同时,实现纹理细节保留和色彩稳定性。
Details
Motivation: 解决超高清视频去噪中同时抑制复杂时空退化、保留精细纹理与色彩稳定性、并实现高效全分辨率4K部署的挑战。
Result: 在UVG和RealisVideo-4K基准测试中,UHD-GPGNet以更少的参数实现了具有竞争力的恢复保真度,相比最接近的质量竞争对手实现了显著加速的全分辨率4K实时推理,并在多级混合退化场景中保持鲁棒性能。真实世界手机拍摄4K视频的测试进一步证实了模型对未见真实传感器噪声的泛化能力,并能提升挑战性条件下的下游目标检测性能。
Insight: 创新点在于将高斯过程后验统计量显式引入视频去噪,以稀疏估计方式指导自适应融合;结构-色彩协同重建头实现亮度、色度与高频校正的解耦;异方差目标与重叠分块推理优化了训练稳定性与内存受限的4K部署。该方法为结合显式统计建模与深度学习提供了新思路。
Abstract: Ultra-high-definition (UHD) video denoising requires simultaneously suppressing complex spatio-temporal degradations, preserving fine textures and chromatic stability, and maintaining efficient full-resolution 4K deployment. In this paper, we propose UHD-GPGNet, a Gaussian-process-guided local spatio-temporal denoising framework that addresses these requirements jointly. Rather than relying on implicit feature learning alone, the method estimates sparse GP posterior statistics over compact spatio-temporal descriptors to explicitly characterize local degradation response and uncertainty, which then guide adaptive temporal-detail fusion. A structure-color collaborative reconstruction head decouples luminance, chroma, and high-frequency correction, while a heteroscedastic objective and overlap-tiled inference further stabilize optimization and enable memory-bounded 4K deployment. Experiments on UVG and RealisVideo-4K show that UHD-GPGNet achieves competitive restoration fidelity with substantially fewer parameters than existing methods, enables real-time full-resolution 4K inference with significant speedup over the closest quality competitor, and maintains robust performance across a multi-level mixed-degradation schedule.A real-world study on phone-captured 4K video further confirms that the model, trained entirely on synthetic degradation, generalizes to unseen real sensor noise and improves downstream object detection under challenging conditions.
[184] Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images cs.CVPDF
Zheng Jiang, Yiming Chen, Nan He, Jiahui Chen, Chaoyang Li
TL;DR: 本文提出了一种名为测试时感知扩展(TTSP)的框架,旨在解决多模态大语言模型在细粒度视觉推理中面临的‘接地悖论’问题。该框架将感知本身视为一个可扩展的推理过程,通过生成多个探索性感知轨迹、基于熵的置信度估计过滤不可靠轨迹、将验证后的观察提炼为结构化知识,并迭代地针对未解决的不确定性进行后续探索。
Details
Motivation: 动机是解决多模态大语言模型在‘图像思维’推理中存在的‘接地悖论’——即模型必须在获得足够证据之前就决定观察哪里,导致细粒度视觉推理的脆弱性。
Result: 在高分辨率和通用多模态推理基准上的大量实验表明,TTSP在不同骨干模型规模上均持续优于强基线,同时展现出良好的可扩展性和令牌效率。
Insight: 创新点在于将感知过程本身视为一个可在测试时进行扩展和迭代优化的推理过程,通过置信度估计和知识蒸馏来引导探索,为解决感知不确定性下的鲁棒多模态推理提供了一个有前景的方向。
Abstract: Recent multimodal large language models (MLLMs) have begun to support Thinking with Images by invoking visual tools such as zooming and cropping during inference. Yet these systems remain brittle in fine-grained visual reasoning because they must decide where to look before they have access to the evidence needed to make that decision correctly. We identify this circular dependency as the Grounding Paradox. To address it, we propose Test-Time Scaling over Perception (TTSP), a framework that treats perception itself as a scalable inference process. TTSP generates multiple exploratory perception traces, filters unreliable traces using entropy-based confidence estimation, distills validated observations into structured knowledge, and iteratively refines subsequent exploration toward unresolved uncertainty. Extensive experiments on high-resolution and general multimodal reasoning benchmarks show that TTSP consistently outperforms strong baselines across backbone sizes, while also exhibiting favorable scalability and token efficiency. Our results suggest that scaling perception at test time is a promising direction for robust multimodal reasoning under perceptual uncertainty.
[185] EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates cs.CVPDF
Weikun Peng, Denys Iliash, Manolis Savva
TL;DR: EgoFun3D提出了一个从第一人称视角视频中建模交互式3D对象的任务框架、数据集和基准。该任务旨在从视频输入中获取可直接用于仿真的交互式3D对象,通过引入函数模板这一结构化计算表示来捕捉部件间的功能映射关系(如旋钮旋转控制炉温)。
Details
Motivation: 交互式对象对具身AI至关重要,但相关数据稀缺。现有工作多关注关节结构,而本文旨在从易于获取的真实世界视频中建模更通用的跨部件功能关系,以支持仿真应用。
Result: 论文构建了一个包含271个第一人称视频的数据集,带有3D几何、2D/3D分割、关节和函数模板标注。基准测试表明,现有方法在此任务上具有挑战性,为未来工作指明了方向。
Insight: 核心创新是引入了函数模板作为结构化表示来编码对象部件间的功能依赖关系,这超越了传统的关节建模,并能直接编译为跨仿真平台的可执行代码。提出的四阶段流程(2D分割、重建、关节估计、函数模板推断)为从视频中提取可仿真对象提供了系统方法。
Abstract: We present EgoFun3D, a coordinated task formulation, dataset, and benchmark for modeling interactive 3D objects from egocentric videos. Interactive objects are of high interest for embodied AI but scarce, making modeling from readily available real-world videos valuable. Our task focuses on obtaining simulation-ready interactive 3D objects from egocentric video input. While prior work largely focuses on articulations, we capture general cross-part functional mappings (e.g., rotation of stove knob controls stove burner temperature) through function templates, a structured computational representation. Function templates enable precise evaluation and direct compilation into executable code across simulation platforms. To enable comprehensive benchmarking, we introduce a dataset of 271 egocentric videos featuring challenging real-world interactions with paired 3D geometry, segmentation over 2D and 3D, articulation and function template annotations. To tackle the task, we propose a 4-stage pipeline consisting of: 2D part segmentation, reconstruction, articulation estimation, and function template inference. Comprehensive benchmarking shows that the task is challenging for off-the-shelf methods, highlighting avenues for future work.
[186] Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization cs.CVPDF
Renyu Li, Vladimir Kirilenko, Yao You, Crag Wolfe
TL;DR: 本文提出了一种基于智能体(agentic)的标签协调工作流,利用视觉语言模型在训练前协调异构数据集之间的类别语义和边界框粒度,以解决标注不一致问题。论文以文档布局检测为案例研究,展示了该方法在提升检测性能、表格结构识别和空间一致性方面的有效性。
Details
Motivation: 动机在于解决目标检测模型在合并多个数据集进行微调时,由于标注标准不一致(如类别语义和边界框定义冲突)导致的性能下降问题。
Result: 在文档布局检测的SCORE-Bench基准测试中,应用协调方法后,检测F分数从0.860提升至0.883,表格TEDS分数从0.750(未协调)提升至0.814,边界框平均重叠误差从0.043降至0.016,表明在内容保真度、表格结构和空间一致性方面均取得一致提升。
Insight: 创新点在于提出了一种训练前的主动标签协调流程,利用视觉语言模型理解语义并统一标注,从而避免标注不一致对特征学习空间的扭曲;客观分析表明,该方法能产生更紧凑和可分离的特征表示,为多数据集联合训练提供了新思路。
Abstract: Fine-tuning object detection (OD) models on combined datasets assumes annotation compatibility, yet datasets often encode conflicting spatial definitions for semantically equivalent categories. We propose an agentic label harmonization workflow that uses a vision-language model to reconcile both category semantics and bounding box granularity across heterogeneous sources before training. We evaluate on document layout detection as a challenging case study, where annotation standards vary widely across corpora. Without harmonization, naïve mixed-dataset fine-tuning degrades a pretrained RT-DETRv2 detector: on SCORE-Bench, which measures how accurately the full document conversion pipeline reproduces ground-truth structure, table TEDS drops from 0.800 to 0.750. Applied to two corpora whose 16 and 10 category taxonomies share only 8 direct correspondences, harmonization yields consistent gains across content fidelity, table structure, and spatial consistency: detection F-score improves from 0.860 to 0.883, table TEDS improves to 0.814, and mean bounding box overlap drops from 0.043 to 0.016. Representation analysis further shows that harmonized training produces more compact and separable post-decoder embeddings, confirming that annotation inconsistency distorts the learned feature space and that resolving it before training restores representation structure.
[187] RESP: Reference-guided Sequential Prompting for Visual Glitch Detection in Video Games cs.CVPDF
Yakun Yu, Ashley Wiens, Adrián Barahona-Ríos, Benedict Wilkins, Saman Zadtootaghaj
TL;DR: 本文提出了RESP框架,一种基于视觉语言模型(VLM)的参考引导顺序提示方法,用于视频游戏中的视觉故障检测。该方法通过为每个测试帧从同一视频中选择一个参考帧,将检测任务重新定义为视频内比较而非孤立分类,从而在多帧层面实现鲁棒的故障检测。
Details
Motivation: 视频游戏中的视觉故障会降低玩家体验和感知质量,而手动质量保证无法适应现代游戏开发日益增长的测试规模。现有的自动化方法,特别是使用视觉语言模型(VLM)的方法,大多基于单帧或有限的视频级基线,难以应对真实场景变化,导致鲁棒的视频级故障检测具有挑战性。
Result: 在五个VLM模型和三个数据集(一个合成数据集RefGlitch,两个真实世界数据集)上的实验表明,参考引导持续增强了帧级检测性能,并且改进的帧级证据在真实质量保证条件下可靠地转化为更强的视频级故障筛选能力。
Insight: 核心创新点是参考引导提示策略,将故障检测重构为视频内参考帧与测试帧的比较任务,无需微调VLM即可聚合噪声帧预测为稳定的视频级决策。同时,引入的合成数据集RefGlitch为可控分析参考帧效应提供了基础。
Abstract: Visual glitches in video games degrade player experience and perceived quality, yet manual quality assurance cannot scale to the growing test surface of modern game development. Prior automation efforts, particularly those using vision-language models (VLMs), largely operate on single frames or rely on limited video-level baselines that struggle under realistic scene variation, making robust video-level glitch detection challenging. We present RESP, a practical multi-frame framework for gameplay glitch detection with VLMs. Our key idea is reference-guided prompting: for each test frame, we select a reference frame from earlier in the same video, establishing a visual baseline and reframing detection as within-video comparison rather than isolated classification. RESP sequentially prompts the VLM with reference/test pairs and aggregates noisy frame predictions into a stable video-level decision without fine-tuning the VLM. To enable controlled analysis of reference effects, we introduce RefGlitch, a synthetic dataset of manually labeled reference/test frame pairs with balanced coverage across five glitch types. Experiments across five VLMs and three datasets (one synthetic, two real-world) show that reference guidance consistently strengthens frame-level detection and that the improved frame-level evidence reliably transfers to stronger video-level triage under realistic QA conditions. Code and data are available at: \href{https://github.com/PipiZong/RESP_code.git}{this https URL}.
[188] Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization cs.CVPDF
Jinsung Lee, Jaemin Oh, Namhun Kim, Dongwon Kim, Byung-Jun Yoon
TL;DR: 本文提出了一种新颖的正则化方法,用于改进图像分词器(tokenizer)的潜在空间,使其同时具备紧凑性和生成友好性。该方法通过引导分词器模仿状态空间模型(SSM)的隐藏状态动态,将频率感知特性转移到潜在特征中,从而在保持重建保真度的同时提升扩散模型的生成质量。
Details
Motivation: 动机是解决图像分词器潜在空间需要同时满足紧凑(高效捕捉图像内容)和生成友好(易于生成模型建模)的双重需求,现有方法往往难以兼顾这两点。
Result: 实验表明,该方法在扩散模型中提高了生成质量,同时仅带来极小的重建保真度损失,验证了其在紧凑性和生成友好性上的改进。
Insight: 创新点在于从状态空间模型的理论分析出发,设计了一种正则化器,将频率感知和精细空间结构编码到紧凑的潜在特征中,从而更有效地利用表示容量并提升生成模型的可建模性。
Abstract: Image tokenizers are central to modern vision models as they often operate in latent spaces. An ideal latent space must be simultaneously compact and generation-friendly: it should capture image’s essential content compactly while remaining easy to model with generative approaches. In this work, we introduce a novel regularizer to align latent spaces with these two objectives. The key idea is to guide tokenizers to mimic the hidden state dynamics of state-space models (SSMs), thereby transferring their critical property, frequency awareness, to latent features. Grounded in a theoretical analysis of SSMs, our regularizer enforces encoding of fine spatial structures and frequency-domain cues into compact latent features; leading to more effective use of representation capacity and improved generative modelability. Experiments demonstrate that our method improves generation quality in diffusion models while incurring only minimal loss in reconstruction fidelity.
[189] CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation cs.CVPDF
Rongjia Yu, Tong Jia, Hao Wang, Xiaofang Li, Xiao Yang
TL;DR: 本文提出CDPR,一种用于单目深度估计的新型扩散框架,通过整合物理基础的偏振先验信息来增强估计的鲁棒性。该方法将RGB和偏振图像编码到共享的隐空间,并通过可学习的置信感知门控机制动态融合多模态信息,从而在具有挑战性的区域(如反射或透明表面)提升性能。
Details
Motivation: 单目深度估计在纹理缺失、透明和镜面反射等复杂条件下具有挑战性。现有基于扩散的方法仅依赖RGB输入,在困难区域往往缺乏足够线索,因此需要引入额外的物理信息(如偏振)来提高鲁棒性。
Result: 在合成和真实世界数据集上的实验表明,CDPR在具有挑战性的区域显著优于仅使用RGB的基线方法,同时在标准场景中保持有竞争力的性能。
Insight: 创新点在于将物理基础的偏振先验整合到扩散框架中,并设计了可学习的置信感知门控机制来自适应地融合多模态信息,抑制噪声并保留有效线索。该框架可轻松推广到表面法线预测等任务,展示了其在偏振引导的密集预测任务中的可扩展性。
Abstract: Monocular depth estimation is a fundamental yet challenging task in computer vision, especially under complex conditions such as textureless surfaces, transparency, and specular reflections. Recent diffusion-based approaches have significantly advanced performance by reformulating depth prediction as a denoising process in the latent space. However, existing methods rely solely on RGB inputs, which often lack sufficient cues in challenging regions. In this work, we present CDPR - Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation - a novel diffusion-based framework that integrates physically grounded polarization priors to enhance estimation robustness. Specifically, we encode both RGB and polarization (AoLP/DoLP) images into a shared latent space via a pre-trained Variational Autoencoder (VAE), and dynamically fuse multi-modal information through a learnable confidence-aware gating mechanism. This fusion module adaptively suppresses noisy signals in polarization inputs while preserving informative cues, particularly around reflective or transparent surfaces, and provides the integrated latent representation for subsequent monocular depth estimation. Beyond depth estimation, we further verify that our framework can be easily generalized to surface normal prediction with minimal modification, showcasing its scalability to general polarization-guided dense prediction tasks. Experiments on both synthetic and real-world datasets validate that CDPR significantly outperforms RGB-only baselines in challenging regions while maintaining competitive performance in standard scenes.
[190] OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video cs.CV | cs.MMPDF
Junfu Pu, Yuxin Chen, Teng Wang, Ying Shan
TL;DR: 本文提出了视频到脚本(V2S)的新任务,旨在为长篇电影视频生成包含角色动作、对话、表情和音频线索的层次化、分场景脚本。为此,作者构建了首个人工标注的基准数据集,并提出了一种时序感知的层次化评估框架。同时,本文提出了OmniScript,一个为长篇叙事理解定制的80亿参数全模态(视听)语言模型,通过渐进式训练流程(包括思维链监督微调和基于时序分段奖励的强化学习)进行训练。实验表明,OmniScript在参数效率高的前提下,在时序定位和多领域语义准确性上显著优于更大的开源模型,并与包括Gemini 3-Pro在内的最先进专有模型性能相当。
Details
Motivation: 当前多模态大语言模型(MLLMs)在短视频理解上表现出色,但将长篇电影视频转化为详细、时序对齐的脚本仍是一个重大挑战。本文旨在解决这一挑战,推动对长篇叙事视频的深度理解。
Result: 在作者构建的V2S基准上进行广泛实验,结果表明OmniScript在时序定位和多领域语义准确性上显著优于更大的开源模型,其性能与最先进的专有模型(如Gemini 3-Pro)相当。
Insight: 论文的创新点包括:1) 定义了全新的视频到脚本(V2S)任务及其评估框架;2) 提出了专为长篇视听叙事理解设计的OmniScript模型架构;3) 采用了结合思维链监督微调(用于情节和角色推理)和基于时序分段奖励的强化学习的渐进式训练流程,有效提升了模型在复杂时序任务上的表现。
Abstract: Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.
[191] Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding cs.CV | cs.AIPDF
Yueying Li, Fengxiang Wang, Yan Li, Mingshuo Chen, Mengying Zhao
TL;DR: 本文提出了一种名为DualComp的训练免费视觉令牌压缩框架,用于处理超高分辨率遥感图像理解任务中视觉令牌数量过多导致的巨大计算开销问题。该框架通过一个轻量级预训练路由器动态引导,将特征处理解耦为两个专用路径:针对对象语义任务的流采用空间连续语义聚合器进行背景压缩,而针对场景几何任务的流则采用指令引导结构恢复器来重建空间拓扑骨架。
Details
Motivation: 动机在于解决多模态大语言模型处理超高分辨率遥感图像时,因生成海量视觉令牌而带来的过高计算成本问题。现有视觉令牌压缩方法多为静态均匀策略,忽视了遥感解译任务中固有的’语义-几何对偶性’:对象语义任务关注抽象语义并受益于激进的背景剪枝,而场景几何任务则严重依赖空间拓扑的完整性。
Result: 在超高分辨率遥感基准XLRS-Bench上的实验表明,DualComp能够以极低的计算成本实现高保真度的遥感解译,在效率和准确性上同时获得提升。
Insight: 创新点在于提出了任务自适应的双流令牌压缩框架,首次明确区分并针对性处理遥感任务中的语义与几何需求。具体包括:1) 利用轻量级路由器进行动态任务引导;2) 为对象语义设计尺寸自适应聚类以聚合冗余背景并保护小物体;3) 为场景几何设计贪婪路径追踪拓扑补全机制以重建空间骨架。这为面向任务的视觉令牌压缩提供了新思路。
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated immense potential in Earth observation. However, the massive visual tokens generated when processing Ultra-High-Resolution (UHR) imagery introduce prohibitive computational overhead, severely bottlenecking their inference efficiency. Existing visual token compression methods predominantly adopt static and uniform compression strategies, neglecting the inherent “Semantic-Geometric Duality” in remote sensing interpretation tasks. Specifically, object semantic tasks focus on the abstract semantics of objects and benefit from aggressive background pruning, whereas scene geometric tasks critically rely on the integrity of spatial topology. To address this challenge, we propose DualComp, a task-adaptive dual-stream token compression framework. Dynamically guided by a lightweight pre-trained router, DualComp decouples feature processing into two dedicated pathways. In the object semantic stream, the Spatially-Contiguous Semantic Aggregator (SCSA) utilizes size-adaptive clustering to aggregates redundant background while protecting small object. In the scene geometric stream, the Instruction-Guided Structure Recoverer (IGSR) introduces a greedy path-tracing topology completion mechanism to reconstruct spatial skeletons. Experiments on the UHR remote sensing benchmark XLRS-Bench demonstrate that DualComp accomplishes high-fidelity remote sensing interpretation at an exceptionally low computational cost, achieving simultaneous improvements in both efficiency and accuracy.
[192] BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning cs.CV | cs.AIPDF
Zekun Qian, Ruize Han, Wei Feng
TL;DR: 本文提出了BoxTuning方法,用于解决视频问答任务中多模态大语言模型缺乏细粒度物体定位能力的问题。该方法通过将彩色边界框和轨迹轨迹直接渲染到视频帧上作为视觉提示,仅保留简洁的颜色-物体图例作为文本,从而显著降低了文本令牌成本并保持了完整的时间分辨率。
Details
Motivation: 现有MLLMs对视频帧进行整体编码,缺乏显式的细粒度物体定位机制。近期工作将边界框坐标序列化为文本令牌,但这种文本-坐标范式存在模态不匹配的根本问题:物体信息本质上是视觉的,但将其编码为文本会产生高昂的令牌成本,迫使进行激进的时间下采样。
Result: 在五个视频问答基准测试(CLEVRER, Perception Test, STAR, NExT-QA, IntentQA)上的实验结果表明,BoxTuning在面向空间的任务上超越了文本-坐标基线,并且在以推理为中心的任务上几乎消除了观察到的准确率下降。
Insight: 创新点在于将物体时空信息直接注入视觉模态,通过视觉提示(彩色框和轨迹)来编码物体位置和运动,这比将坐标序列化为文本更自然、更高效,显著减少了文本令牌开销并保留了帧间动态信息。
Abstract: Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding it as text incurs a high token cost that forces aggressive temporal downsampling. We propose BoxTuning, which resolves this mismatch by injecting object spatial-temporal information directly into the visual modality. Colored bounding boxes and trajectory trails are rendered onto video frames as visual prompts, with only a concise color-to-object legend retained as text. This reduces the token cost significantly, achieving 87-93% text token reduction in practice. It also preserves full temporal resolution, where the trajectory trails further encode inter-frame motion direction and speed within each keyframe, recovering fine-grained dynamics that text-coordinate methods are forced to discard. Experimental results on five video QA benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA) show that BoxTuning surpasses text-coordinate baselines on spatially oriented tasks and nearly eliminates the accuracy degradation observed on reasoning-centric tasks, establishing visual prompting as a more natural and efficient paradigm for conveying object information to video MLLMs.
[193] Sparse Hypergraph-Enhanced Frame-Event Object Detection with Fine-Grained MoE cs.CVPDF
Wei Bao, Yuehan Wang, Tianhang Zhou, Siqi Li, Yue Gao
TL;DR: 本文提出Hyper-FEOD框架,通过稀疏超图增强跨模态融合(S-HCF)和细粒度专家混合(FG-MoE)增强模块,高效融合RGB图像与事件流数据,在动态挑战条件下实现鲁棒且轻量化的目标检测。
Details
Motivation: 解决RGB相机与事件流数据融合时因模态异构性和数据冗余导致的计算开销过大或特征融合次优的问题。
Result: 在主流RGB-Event基准测试中,Hyper-FEOD实现了优越的精度-效率权衡,性能超越现有SOTA方法,同时保持轻量化适合实时边缘部署。
Insight: 创新点包括利用事件流稀疏性构建事件引导活动图进行高效超图建模,以及设计针对不同图像区域语义需求的细粒度MoE模块,通过空间门控机制自适应增强特征,结合负载均衡损失和零初始化策略确保训练稳定。
Abstract: Integrating frame-based RGB cameras with event streams offers a promising solution for robust object detection under challenging dynamic conditions. However, the inherent heterogeneity and data redundancy of these modalities often lead to prohibitive computational overhead or suboptimal feature fusion. In this paper, we propose Hyper-FEOD, a high-performance and efficient detection framework, which synergistically optimizes multi-modal interaction through two core components. First, we introduce Sparse Hypergraph-enhanced Cross-Modal Fusion (S-HCF), which leverages the inherent sparsity of event streams to construct an event-guided activity map. By performing high-order hypergraph modeling exclusively on selected motion-critical sparse tokens, S-HCF captures complex non-local dependencies between RGB and event data while overcoming the traditional complexity bottlenecks of hypergraph computation. Second, we design a Fine-Grained Mixture of Experts (FG-MoE) Enhancement module to address the diverse semantic requirements of different image regions. This module employs specialized hypergraph experts tailored for object boundaries, internal textures, and backgrounds, utilizing a pixel-level spatial gating mechanism to adaptively route and enhance features. Combined with a load-balancing loss and zero-initialization strategy, FG-MoE ensures stable training and precise feature refinement without disrupting the pre-trained backbone’s distribution. Experimental results on mainstream RGB-Event benchmarks demonstrate that Hyper-FEOD achieves a superior accuracy-efficiency trade-off, outperforming state-of-the-art methods while maintaining a lightweight footprint suitable for real-time edge deployment.
[194] rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training cs.CVPDF
Tianyang Dai, Ming Chang, Yan Chen, Yang Hu
TL;DR: 本文提出了rPPG-VQA框架,用于评估视频是否适合用于无监督远程光电容积描记术(rPPG)模型的训练。该框架通过信号级和场景级双分支分析,结合多方法共识的SNR估计和多模态大语言模型(MLLM)识别干扰,并采用两阶段自适应采样(TAS)策略筛选高质量视频数据,从而提升无监督rPPG模型的性能。
Details
Motivation: 解决无监督rPPG训练中,使用低质量“野外”视频会严重降低模型性能的问题,现有视频质量评估(VQA)方法主要针对人类感知设计,不适用于评估视频对rPPG模型学习的适用性。
Result: 实验表明,使用该框架筛选的大规模“野外”视频进行训练,能显著提升无监督rPPG模型在标准基准测试上的准确性。
Insight: 创新点在于提出了首个专门针对rPPG训练的视频质量评估框架,结合了信号质量(多方法SNR估计)和场景干扰(MLLM分析)的双重评估,并设计了自适应采样策略来优化训练数据集构建,为利用无标签视频进行生理信号学习提供了新思路。
Abstract: Unsupervised remote photoplethysmography (rPPG) promises to leverage unlabeled video data, but its potential is hindered by a critical challenge: training on low-quality “in-the-wild” videos severely degrades model performance. An essential step missing here is to assess the suitability of the videos for rPPG model learning before using them for the task. Existing video quality assessment (VQA) methods are mainly designed for human perception and not directly applicable to the above purpose. In this work, we propose rPPG-VQA, a novel framework for assessing video suitability for rPPG. We integrate signal-level and scene-level analyses and design a dual-branch assessment architecture. The signal-level branch evaluates the physiological signal quality of the videos via robust signal-to-noise ratio (SNR) estimation with a multi-method consensus mechanism, and the scene-level branch uses a multimodal large language model (MLLM) to identify interferences like motion and unstable lighting. Furthermore, we propose a two-stage adaptive sampling (TAS) strategy that utilizes the quality score to curate optimal training datasets. Experiments show that by training on large-scale, “in-the-wild” videos filtered by our framework, we can develop unsupervised rPPG models that achieve a substantial improvement in accuracy on standard benchmarks. Our code is available at https://github.com/Tianyang-Dai/rPPG-VQA.
[195] Boxes2Pixels: Learning Defect Segmentation from Noisy SAM Masks cs.CVPDF
Camile Lendering, Erkut Akdag, Egor Bondarev
TL;DR: 本文提出了一种名为Boxes2Pixels的噪声鲁棒性框到像素蒸馏框架,用于解决工业缺陷分割中像素级标注稀缺的问题。该方法将Segment Anything Model (SAM)生成的噪声伪掩码作为监督信号,通过一个包含层次化解码器、辅助二值定位头和单边在线自校正机制的学生网络进行学习,有效提升了缺陷分割的准确性。
Details
Motivation: 工业缺陷分割需要精确的像素级标注,但这类标注成本高昂且稀缺。常用的替代方法是用基础分割模型(如SAM)将廉价边界框转换为伪掩码,但这些伪标签在工业表面存在系统性噪声,经常错误地包含背景结构并遗漏稀疏缺陷。
Result: 在手动标注的风力涡轮机检测基准上,Boxes2Pixels在相同弱监督条件下,相比最强基线,异常mIoU提升了+6.97,二值IoU提升了+9.71。在线自校正机制使二值召回率提升了+18.56,同时模型可训练参数减少了80%。
Insight: 创新点在于将SAM视为噪声教师而非真值监督源,并设计了层次化解码器、辅助定位头和单边在线自校正机制来鲁棒地学习噪声伪标签。这为利用噪声伪标签进行弱监督分割提供了一种有效的蒸馏框架,特别适用于工业缺陷检测等标注稀缺场景。
Abstract: Accurate defect segmentation is critical for industrial inspection, yet dense pixel-level annotations are rarely available. A common workaround is to convert inexpensive bounding boxes into pseudo-masks using foundation segmentation models such as the Segment Anything Model (SAM). However, these pseudo-labels are systematically noisy on industrial surfaces, often hallucinating background structure while missing sparse defects. To address this limitation, a noise-robust box-to-pixel distillation framework, Boxes2Pixels, is proposed that treats SAM as a noisy teacher rather than a source of ground-truth supervision. Bounding boxes are converted into pseudo-masks offline by SAM, and a compact student is trained with (i) a hierarchical decoder over frozen DINOv2 features for semantic stability, (ii) an auxiliary binary localization head to decouple sparse foreground discovery from class prediction, and (iii) a one-sided online self-correction mechanism that relaxes background supervision when the student is confident, targeting teacher false negatives. On a manually annotated wind turbine inspection benchmark, the proposed Boxes2Pixels improves anomaly mIoU by +6.97 and binary IoU by +9.71 over the strongest baseline trained under identical weak supervision. Moreover, online self-correction increases the binary recall by +18.56, while the model employs 80% fewer trainable parameters. Code is available at https://github.com/CLendering/Boxes2Pixels.
[196] RADA: Region-Aware Dual-encoder Auxiliary learning for Barely-supervised Medical Image Segmentation cs.CVPDF
Shuang Zeng, Boxu Xie, Lei Zhu, Xinliang Zhang, Jiakui Hu
TL;DR: 本文提出了一种名为RADA(区域感知双编码器辅助学习)的新方法,用于解决医学图像分割中标注稀疏的挑战。该方法通过结合预训练的Alpha-CLIP双编码器框架,提取细粒度的区域特定视觉特征,并与文本级语义指导相结合,为像素级分割提供区域感知的语义监督。在LA2018、KiTS19和LiTS数据集上的实验表明,该方法在极稀疏标注设置下达到了最先进的性能。
Details
Motivation: 现有方法通常通过几何连续性传播稀疏标注以生成伪标签,但缺乏语义理解,导致伪标签质量低。医学图像分割本质上是像素级视觉理解任务,其准确性依赖于局部细粒度视觉特征的质量。
Result: 在LA2018、KiTS19和LiTS数据集上,RADA在极稀疏标注设置下实现了最先进的性能,并展示了跨数据集的鲁棒泛化能力。
Insight: 创新点在于引入基于Alpha-CLIP预训练的双编码器框架,结合图像级细粒度视觉特征与文本级语义指导,提供区域感知的语义监督,从而在稀疏标注下提升分割质量。
Abstract: Deep learning has greatly advanced medical image segmentation, but its success relies heavily on fully supervised learning, which requires dense annotations that are costly and time-consuming for 3D volumetric scans. Barely-supervised learning reduces annotation burden by using only a few labeled slices per volume. Existing methods typically propagate sparse annotations to unlabeled slices through geometric continuity to generate pseudo-labels, but this strategy lacks semantic understanding, often resulting in low-quality pseudo-labels. Furthermore, medical image segmentation is inherently a pixel-level visual understanding task, where accuracy fundamentally depends on the quality of local, fine-grained visual features. Inspired by this, we propose RADA, a novel Region-Aware Dual-encoder Auxiliary learning pipeline which introduces a dual-encoder framework pre-trained on Alpha-CLIP to extract fine-grained, region-specific visual features from the original images and limited annotations. The framework combines image-level fine-grained visual features with text-level semantic guidance, providing region-aware semantic supervision that bridges image-level semantics and pixel-level segmentation. Integrated into a triple-view training framework, RADA achieves SOTA performance under extremely sparse annotation settings on LA2018, KiTS19 and LiTS, demonstrating robust generalization across diverse datasets.
[197] Do Instance Priors Help Weakly Supervised Semantic Segmentation? cs.CVPDF
Anurag Das, Anna Kukleva, Xinting Hu, Yuki M. Asano, Bernt Schiele
TL;DR: 本文提出SeSAM框架,利用基础分割模型SAM(Segment Anything Model)结合弱标签(如粗掩码、涂鸦和点标注)进行语义分割,以降低标注成本。该方法通过将类别掩码分解为连通分量、沿物体骨架采样点提示、基于弱标签覆盖选择SAM掩码,并利用伪标签迭代优化,使SAM生成的掩码适用于语义分割任务。结合半监督学习框架,SeSAM平衡真实标签、SAM伪标签和高置信度伪标签,显著提升分割质量。
Details
Motivation: 语义分割需要密集的像素级标注,成本高且耗时,因此研究如何利用弱标签(如粗掩码、涂鸦和点标注)结合基础分割模型SAM来降低标注负担,解决SAM原本基于实例分割而无法直接用于类别语义分割的问题。
Result: 在多个基准测试和弱标注类型上的广泛实验表明,SeSAM始终优于弱监督基线方法,同时相对于精细监督大幅减少了标注成本,实现了显著的性能提升。
Insight: 创新点在于将SAM适配到类别语义分割任务中,通过分解类别掩码、采样点提示、掩码选择和迭代伪标签优化等策略,有效利用弱标签;从客观角度看,该方法结合了基础模型的强大分割能力与弱监督学习的成本效益,为降低语义分割标注开销提供了实用框架。
Abstract: Semantic segmentation requires dense pixel-level annotations, which are costly and time-consuming to acquire. To address this, we present SeSAM, a framework that uses a foundational segmentation model, i.e. Segment Anything Model (SAM), with weak labels, including coarse masks, scribbles, and points. SAM, originally designed for instance-based segmentation, cannot be directly used for semantic segmentation tasks. In this work, we identify specific challenges faced by SAM and determine appropriate components to adapt it for class-based segmentation using weak labels. Specifically, SeSAM decomposes class masks into connected components, samples point prompts along object skeletons, selects SAM masks using weak-label coverage, and iteratively refines labels using pseudo-labels, enabling SAM-generated masks to be effectively used for semantic segmentation. Integrated with a semi-supervised learning framework, SeSAM balances ground-truth labels, SAM-based pseudo-labels, and high-confidence pseudo-labels, significantly improving segmentation quality. Extensive experiments across multiple benchmarks and weak annotation types show that SeSAM consistently outperforms weakly supervised baselines while substantially reducing annotation cost relative to fine supervision.
[198] Precision Synthesis of Multi-Tracer PET via VLM-Modulated Rectified Flow for Stratifying Mild Cognitive Impairment cs.CVPDF
Tuo Liu, Shuijin Lin, Shaozhen Yan, Haifeng Wang, Jie Lu
TL;DR: 本文提出DIReCT++模型,一种结合领域知识视觉语言模型(BiomedCLIP)的3D整流流模型,用于从磁共振成像(MRI)和基本临床信息合成多示踪剂正电子发射断层扫描(PET)图像,旨在解决PET成像成本高、有辐射的问题,以支持阿尔茨海默病(AD)的早期筛查。
Details
Motivation: 阿尔茨海默病的生物学定义依赖于多模态神经影像,但PET成像的临床实用性受限于成本和辐射暴露,阻碍了临床前或前驱期的早期筛查。现有生成模型虽能从MRI合成PET,但实现个体化精准合成仍是一大挑战。
Result: 在多中心数据集上的广泛评估表明,DIReCT++不仅能合成出具有卓越保真度和泛化性的合成PET图像(18F-AV-45和18F-FDG),还能准确复现疾病特异性模式。将合成PET与MRI结合,能够实现对轻度认知障碍(MCI)的精准个体化分层。
Insight: 创新点在于将3D整流流架构与领域适配的视觉语言模型(BiomedCLIP)相结合,利用临床评分和影像知识实现文本引导的个性化生成,从而捕获复杂的跨模态和跨示踪剂关系,为AD早期诊断和预后预测提供了一个可扩展、数据高效的工具。
Abstract: The biological definition of Alzheimer’s disease (AD) relies on multi-modal neuroimaging, yet the clinical utility of positron emission tomography (PET) is limited by cost and radiation exposure, hindering early screening at preclinical or prodromal stages. While generative models offer a promising alternative by synthesizing PET from magnetic resonance imaging (MRI), achieving subject-specific precision remains a primary challenge. Here, we introduce DIReCT$++$, a Domain-Informed ReCTified flow model for synthesizing multi-tracer PET from MRI combined with fundamental clinical information. Our approach integrates a 3D rectified flow architecture to capture complex cross-modal and cross-tracer relationships with a domain-adapted vision-language model (BiomedCLIP) that provides text-guided, personalized generation using clinical scores and imaging knowledge. Extensive evaluations on multi-center datasets demonstrate that DIReCT$++$ not only produces synthetic PET images ($^{18}$F-AV-45 and $^{18}$F-FDG) of superior fidelity and generalizability but also accurately recapitulates disease-specific patterns. Crucially, combining these synthesized PET images with MRI enables precise personalized stratification of mild cognitive impairment (MCI), advancing a scalable, data-efficient tool for the early diagnosis and prognostic prediction of AD. The source code will be released on https://github.com/ladderlab-xjtu/DIReCT-PLUS.
[199] Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding cs.CVPDF
Shivam Sharma, Sankalp Nagaonkar, Ashish Choithani, Ashutosh Trivedi
TL;DR: 本文评估了Gemini视觉语言模型在视频场景理解中内部推理轨迹(称为“思维流”)的影响。通过分析Gemini 2.5 Flash和Flash Lite的四种配置在100小时视频提取场景上的表现,研究了思维长度与输出质量的关系、收益饱和点以及模型关注内容。研究发现,额外思维带来的质量提升很快达到平台期,Flash Lite在质量与令牌使用间取得最佳平衡,且推理预算紧张会导致模型在最终输出中添加未经推理的内容(压缩步骤幻觉)。
Details
Motivation: 解决视觉语言模型在视频场景理解中内部推理过程(思维流)如何影响输出质量的问题,探究思维长度与性能的关系、收益递减点以及模型的实际关注焦点。
Result: 在提取自100小时视频的场景上评估,使用三个新指标(内容充实度、思维-最终覆盖度、主导实体分析)和GPT-5作为独立评判者。结果显示:质量提升在最初几百个令牌后即达到平台期;Flash Lite在质量与令牌效率间表现最佳;推理预算不足会导致压缩步骤幻觉;Flash和Flash Lite的思维流内容相似但风格不同(Flash讨论推理过程,Lite侧重场景描述)。
Insight: 创新点包括提出三个针对思维流评估的量化指标,揭示了思维长度与视频理解性能的非线性关系及平台效应,并发现模型层级差异主要影响推理风格而非内容焦点,为优化视觉语言模型的推理效率提供了实证依据。
Abstract: We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google’s Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.
[200] MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration cs.CVPDF
Jiahui Peng, He Yao, Jingwen Li, Yanzhou Su, Sibo Ju
TL;DR: 本文提出了MedP-CLIP,一个区域感知的医学视觉语言模型。它通过创新的特征级区域提示集成机制,能够灵活响应点、边界框、掩码等多种提示形式,在关注局部区域时保持全局上下文感知。模型在包含640万医学图像和9730万区域级标注的大规模数据集上进行了预训练,并在零样本识别、交互式分割和赋能多模态大语言模型等多种医学任务上显著优于基线方法。
Details
Motivation: CLIP在全局图像理解和零样本迁移上表现出色,但医学图像分析的核心在于对特定解剖结构或病变区域的细粒度理解。因此,精确理解由医学专家或感知模型提供的感兴趣区域(RoI)信息变得至关重要。
Result: 实验表明,MedP-CLIP在多种医学任务(包括零样本识别、交互式分割和赋能多模态大语言模型)上显著优于基线方法。
Insight: 主要创新点在于将医学先验知识与特征级区域提示集成机制相结合,使模型能灵活处理多种区域提示形式,同时保持全局上下文感知。这为医学AI提供了一个可扩展、即插即用的视觉骨干,结合了整体图像理解与精确区域分析。
Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated outstanding performance in global image understanding and zero-shot transfer through large-scale text-image alignment. However, the core of medical image analysis often lies in the fine-grained understanding of specific anatomical structures or lesion regions. Therefore, precisely comprehending region-of-interest (RoI) information provided by medical professionals or perception models becomes crucial. To address this need, we propose MedP-CLIP, a region-aware medical vision-language model (VLM). MedP-CLIP innovatively integrates medical prior knowledge and designs a feature-level region prompt integration mechanism, enabling it to flexibly respond to various prompt forms (e.g., points, bounding boxes, masks) while maintaining global contextual awareness when focusing on local regions. We pre-train the model on a meticulously constructed large-scale dataset (containing over 6.4 million medical images and 97.3 million region-level annotations), equipping it with cross-disease and cross-modality fine-grained spatial semantic understanding capabilities. Experiments demonstrate that MedP-CLIP significantly outperforms baseline methods in various medical tasks, including zero-shot recognition, interactive segmentation, and empowering multimodal large language models. This model provides a scalable, plug-and-play visual backbone for medical AI, combining holistic image understanding with precise regional analysis.
[201] 3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis cs.CV | cs.LG | cs.MMPDF
Stefan Schulz, Fernando Edelstein, Hannah Dröge, Matthias B. Hullin, Markus Plack
TL;DR: 本文提出了3DTV,一个用于实时稀疏视角插值的前馈网络,旨在解决实时自由视点渲染中多相机冗余与交互应用延迟约束的平衡问题。该方法结合轻量级几何与学习,通过基于Delaunay的三元组选择确保每个目标视角的角覆盖,并引入姿态感知深度模块估计由粗到细的金字塔深度,实现高效的特征重投影和遮挡感知融合。
Details
Motivation: 解决实时自由视点渲染中,如何在多相机冗余数据与交互应用的严格延迟约束之间取得平衡的挑战。
Result: 在具有挑战性的多视角视频数据集上的实验表明,3DTV在质量和效率之间始终实现了良好的平衡,优于近期的实时新视角合成基线方法。
Insight: 创新点包括:基于Delaunay的三元组选择策略确保角度覆盖;姿态感知的由粗到细深度金字塔估计,支持高效重投影与遮挡处理;作为前馈网络,无需针对特定场景进行优化或重新训练,增强了在AR/VR、远程呈现等交互应用中的实用性;避免了显式的场景代理,提升了跨多样化场景的鲁棒渲染能力。
Abstract: Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications. We address this challenge by combining lightweight geometry with learning and propose 3DTV, a feedforward network for real-time sparse-view interpolation. A Delaunay-based triplet selection ensures angular coverage for each target view. Building on this, we introduce a pose-aware depth module that estimates a coarse-to-fine depth pyramid, enabling efficient feature reprojection and occlusion-aware blending. Unlike methods that require scene-specific optimization, 3DTV runs feedforward without retraining, making it practical for AR/VR, telepresence, and interactive applications. Our experiments on challenging multi-view video datasets demonstrate that 3DTV consistently achieves a strong balance of quality and efficiency, outperforming recent real-time novel-view baselines. Crucially, 3DTV avoids explicit proxies, enabling robust rendering across diverse scenes. This makes it a practical solution for low-latency multi-view streaming and interactive rendering. Project Page: https://stefanmschulz.github.io/3DTV_webpage/
[202] H-SPAM: Hierarchical Superpixel Anything Model cs.CVPDF
Julien Walther, Rémi Giraud, Michaël Clément
TL;DR: H-SPAM是一种分层超像素生成模型,它通过两阶段区域合并过程,从精细分割开始构建准确、规则且完美嵌套的层次化超像素表示。该模型利用深度特征和外部对象先验指导,并可通过视觉注意力图或用户输入进行调制,以在层次结构中更长久地保留重要区域。
Details
Motivation: 现有超像素方法在分割精度上遇到瓶颈,生成的超像素形状噪声大,且大多只能产生单一固定尺度的分割,限制了其在需要多尺度表示的视觉流程中的应用。
Result: 在标准基准测试中,H-SPAM在准确性和规则性上显著优于现有的分层方法,同时与最新的非分层SOTA方法性能相当。
Insight: 创新点在于提出了一个统一框架,通过结合深度特征、对象先验和可控的两阶段合并(先保持对象一致性,再允许对象间分组)来生成层次化超像素,并支持通过注意力或用户输入进行调制,以实现更灵活和语义感知的多尺度图像表示。
Abstract: Superpixels offer a compact image representation by grouping pixels into coherent regions. Recent methods have reached a plateau in terms of segmentation accuracy by generating noisy superpixel shapes. Moreover, most existing approaches produce a single fixed-scale partition that limits their use in vision pipelines that would benefit multi-scale representations. In this work, we introduce H-SPAM (Hierarchical Superpixel Anything Model), a unified framework for generating accurate, regular, and perfectly nested hierarchical superpixels. Starting from a fine partition, guided by deep features and external object priors, H-SPAM constructs the hierarchy through a two-phase region merging process that first preserves object consistency and then allows controlled inter-object grouping. The hierarchy can also be modulated using visual attention maps or user input to preserve important regions longer in the hierarchy. Experiments on standard benchmarks show that H-SPAM strongly outperforms existing hierarchical methods in both accuracy and regularity, while performing on par with most recent state-of-the-art non-hierarchical methods. Code and pretrained models are available: https://github.com/waldo-j/hspam.
[203] Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection cs.CVPDF
You Su, Yonghong Song, Jingqi Chen, Zehan Wen
TL;DR: 本文提出Seg2Change,一个适配器框架,旨在将开放词汇语义分割模型适配到遥感变化检测任务中,以解决现有方法局限于预定义类别的问题。
Details
Motivation: 现有变化检测方法受限于训练数据中的预定义类别,缺乏处理任意类别变化的可扩展性,而开放词汇语义分割模型尚未被有效应用于开放词汇变化检测任务。
Result: 在WHU-CD和SECOND基准测试中,Seg2Change实现了最先进的开放词汇变化检测性能,分别提升了9.52 IoU和5.50 mIoU。
Insight: 创新点包括构建类别无关的变化检测数据集CA-CDD,并设计类别无关的变化头来检测任意类别的变化并索引到具体类别,从而通过简单有效的适配器框架实现开放词汇变化检测。
Abstract: Change detection is a fundamental task in remote sensing, aiming to quantify the impacts of human activities and ecological dynamics on land-cover changes. Existing change detection methods are limited to predefined classes in training datasets, which constrains their scalability in real-world scenarios. In recent years, numerous advanced open-vocabulary semantic segmentation models have emerged for remote sensing imagery. However, there is still a lack of an effective framework for directly applying these models to open-vocabulary change detection (OVCD), a novel task that integrates vision and language to detect changes across arbitrary categories. To address these challenges, we first construct a category-agnostic change detection dataset, termed CA-CDD. Further, we design a category-agnostic change head to detect the transitions of arbitrary categories and index them to specific classes. Based on them, we propose Seg2Change, an adapter designed to adapt open-vocabulary semantic segmentation models to change detection task. Without bells and whistles, this simple yet effective framework achieves state-of-the-art OVCD performance (+9.52 IoU on WHU-CD and +5.50 mIoU on SECOND). Our code is released at https://github.com/yogurts-sy/Seg2Change.
[204] Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection cs.CVPDF
Jiaqi Wu, Zhen Wang, Enhao Huang, Kangqing Shen, Yulin Wang
TL;DR: 本文提出了一种用于文本引导多光谱目标检测的语义桥接融合框架,通过将文本作为共享语义桥梁来对齐RGB和红外(IR)模态的响应,并引入共识支持与差异支持的双支撑建模来融合跨模态信息,从而提升检测性能。
Details
Motivation: 现有方法仅将文本作为辅助语义增强信号,未能充分利用其引导作用来弥合RGB与IR模态间的粒度不对称性,且传统基于注意力的融合方法倾向于强调稳定共识而忽略有价值的跨模态差异。
Result: 在多光谱基准测试上进行的广泛实验证明了所提融合框架的有效性,并实现了优越的检测性能。
Insight: 创新点在于将文本作为共享语义桥梁进行模态对齐,并提出了共识支持与差异支持的双支撑建模,通过动态重校准将其作为结构化归纳偏置引入融合过程,同时设计了双向语义对齐模块以增强视觉-文本的闭环引导。
Abstract: Text-guided multispectral object detection uses text semantics to guide semantic-aware cross-modal interaction between RGB and IR for more robust perception. However, notable limitations remain: (1) existing methods often use text only as an auxiliary semantic enhancement signal, without exploiting its guiding role to bridge the inherent granularity asymmetry between RGB and IR; and (2) conventional data-driven attention-based fusion tends to emphasize stable consensus while overlooking potentially valuable cross-modal discrepancies. To address these issues, we propose a semantic bridge fusion framework with bi-support modeling for multispectral object detection. Specifically, text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion. We further formulate RGB-IR interaction evidence into the regular consensus support and the complementary discrepancy support that contains potentially discriminative cues, and introduce them into fusion via dynamic recalibration as a structured inductive bias. In addition, we design a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement. Extensive experiments demonstrate the effectiveness of the proposed fusion framework and its superior detection performance on multispectral benchmarks. Code is available at https://github.com/zhenwang5372/Bridging-RGB-IR-Gap.
[205] Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models cs.CVPDF
Kexin Ma, Jing Xiao, Chaofeng Chen, Geyong Min, Guibo Zhu
TL;DR: 本文提出了一种名为DeSAP的解耦相似性感知剪枝方法,用于在大型视觉语言模型(LVLM)的视觉编码器中进行精确的、任务感知的令牌剪枝。该方法通过解耦相似性捕获视觉特征与文本令牌之间的细粒度跨模态相关性,并结合视觉显著性信号来指导剪枝决策,从而在保持模型性能的同时显著降低计算开销。
Details
Motivation: 现有令牌剪枝方法通常依赖于LVLM不同组件产生的单一注意力源,导致注意力分布存在偏差,从而做出不完整且次优的剪枝决策。本文旨在解决这一问题,实现更精确、任务感知的视觉令牌剪枝。
Result: 在多种基准测试和架构上的广泛实验表明,DeSAP在准确性和效率方面均持续优于SOTA方法。在LLaVA-1.5-7B模型上,DeSAP仅保留11.1%的视觉令牌,即可实现10倍的FLOPs减少和2.3倍的预填充加速,同时保持原始性能的98.1%。
Insight: 主要创新点在于引入了“解耦相似性”来显式地捕获跨模态任务相关性,并将其与视觉注意力信号相结合,为剪枝提供更全面(任务相关和视觉线索)的指导。这提供了一种更稳健的剪枝框架,尤其是在高剪枝率下。从客观角度看,将任务语义(通过文本)明确纳入视觉令牌重要性评估是一个关键且有效的思路。
Abstract: Token pruning has emerged as an effective approach to reduce the substantial computational overhead of Large Vision-Language Models (LVLMs) by discarding less informative visual tokens while preserving performance. However, existing methods typically rely on individual attention sources from different LVLM components, resulting in incomplete and suboptimal pruning decisions due to biased attention distributions. To address this problem, we propose DeSAP, a novel Decoupled Similarity-Aware Pruning method for precise, task-aware token pruning within the visual encoder. Specifically, DeSAP introduces a decoupled similarity to capture fine-grained cross-modal relevance between visual features and text tokens, providing explicit task-related guidance for pruning. By integrating decoupled similarity with visual saliency signals derived from visual attention, DeSAP performs token pruning under the guidance of both task-related and visual cues, enabling robust pruning even under aggressive pruning ratios. Extensive experiments across diverse benchmarks and architectures show that DeSAP consistently outperforms SOTA methods in both accuracy and efficiency. On LLaVA-1.5-7B, DeSAP achieves a 10 times FLOPs reduction and a 2.3 times prefill speedup by retaining only 11.1% of visual tokens, while maintaining 98.1% of the original performance.
[206] Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding cs.CVPDF
Tencent Hunyuan Team
TL;DR: 本文提出了一种名为多流场景脚本(MTSS)的新范式,用于视频描述生成。该方法通过流分解将视频解耦为互补的流(如参考、镜头、事件和全局流),并通过关系接地使用显式的身份和时间链接重新连接这些流,从而取代了传统的、纠缠视觉、听觉和身份信息的整体式段落描述。
Details
Motivation: 当前多模态大语言模型(MLLMs)主导的视频描述范式将视频生成为整体式的叙事段落,这导致视觉、听觉和身份信息纠缠在一起,不仅损害了表示保真度,也限制了可扩展性,因为局部编辑可能引发全局重写。本文旨在解决这一结构性瓶颈。
Result: 在Video-SALMONN-2基准上,MTSS使总错误率平均降低了25%;在Daily-Omni推理基准上,平均性能提升了67%。在视频生成任务中,使用MTSS提示(无需架构调整)带来了显著的人类评分提升:跨镜头身份一致性提升45%,视听对齐提升56%,时序可控性提升71%。
Insight: 核心创新在于提出了流分解和关系接地的原则,将视频描述从单一、纠缠的文本结构转变为结构化、可分解和显式接地的场景脚本。这不仅提升了视频理解和生成任务的可学习性与性能,也为视频编辑和可控生成提供了更灵活、可扩展的接口。
Abstract: Advances in Multimodal Large Language Models (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the dominant paradigm still casts videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information. This dense coupling not only compromises representational fidelity but also limits scalability, since even local edits can trigger global rewrites. To address this structural bottleneck, we propose Multi-Stream Scene Script (MTSS), a novel paradigm that replaces monolithic text with factorized and explicitly grounded scene descriptions. MTSS is built on two core principles: Stream Factorization, which decouples a video into complementary streams (Reference, Shot, Event, and Global), and Relational Grounding, which reconnects these isolated streams through explicit identity and temporal links to maintain holistic video consistency. Extensive experiments demonstrate that MTSS consistently enhances video understanding across various models, achieving an average reduction of 25% in the total error rate on Video-SALMONN-2 and an average performance gain of 67% on the Daily-Omni reasoning benchmark. It also narrows the performance gap between smaller and larger MLLMs, indicating a substantially more learnable caption interface. Finally, even without architectural adaptation, replacing monolithic prompts with MTSS in multi-shot video generation yields substantial human-rated improvements: a 45% boost in cross-shot identity consistency, a 56% boost in audio-visual alignment, and a 71% boost in temporal controllability.
[207] Empowering Video Translation using Multimodal Large Language Models cs.CVPDF
Bingzheng QU, Kehai Chen, Xuefeng Bai, Min Zhang
TL;DR: 本文首次对基于多模态大语言模型(MLLMs)的视频翻译进行了全面综述,围绕MLLMs在视频翻译中扮演的三种角色(语义推理器、表达性表演者、视觉合成器)进行系统梳理,并讨论了当前面临的挑战和未来研究方向。
Details
Motivation: 尽管MLLMs在视频翻译中作用日益重要,且已有大量关于通用视频-语言理解的综述,但缺乏对MLLMs如何赋能视频翻译任务的聚焦性、系统性综述,本文旨在填补这一空白。
Result: MLLMs驱动的视频翻译方法在翻译质量上达到或超越了传统级联流水线,并在零样本设置和多说话人场景中展现出更强的鲁棒性,同时能联合建模语义保真度、时序、说话人身份和情感一致性。
Insight: 创新点在于提出了一个围绕MLLMs在视频翻译中三种核心角色的分类法(语义推理、表达性生成、视觉合成),为系统理解和推进该领域提供了结构化框架;客观来看,将MLLMs的强大多模态理解与生成能力系统性地整合到端到端视频翻译流程中,是超越传统分离式处理范式的关键进步。
Abstract: Recent developments in video translation have further enhanced cross-lingual access to video content, with multimodal large language models (MLLMs) playing an increasingly important supporting role. With strong multimodal understanding, reasoning, and generation capabilities, MLLMs-based video translation systems are overcoming the limitations of traditional cascaded pipelines that separately handle automatic speech recognition, machine translation, text-to-speech and lip synchronization. These MLLM-powered approaches not only achieve competitive or superior translation quality, but also demonstrate stronger robustness in zero-shot settings and multi-speaker scenarios, while jointly modeling semantic fidelity, timing, speaker identity, and emotional consistency. However, despite the rapid progress of MLLMs and extensive surveys on general video-language understanding, a focused and systematic review of how MLLMs empower video translation tasks is still lacking. To fill this gap, we provide the first comprehensive overview of MLLMs-based video translation, organized around a three-role taxonomy: 1) Semantic Reasoner, which characterizes how MLLMs perform video understanding, temporal reasoning, and multimodal fusion; 2) Expressive Performer, which analyzes LLM-driven and LLM-augmented techniques for expressive, controllable speech generation; and 3) Visual Synthesizer, which examines different types of video generators for high-fidelity lip-sync and visual alignment. Finally, we discuss open challenges in video understanding, temporal modeling, and multimodal alignment, and outline promising future research directions for MLLMs-powered video translation.
[208] Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV | cs.CGPDF
Dongxu Wei, Qi Xu, Zhiqi Li, Hangning Zhou, Cong Qiu
TL;DR: 本文提出了一种直接在隐式3D潜在空间中生成3D场景的新方法,以解决现有基于2D扩散模型方法存在的表示冗余和空间一致性受限问题。该方法首先构建了一个3D表示自编码器(3DRAE),将多视角2D语义表示融合为统一的3D潜在表示;然后引入3D扩散变换器(3DDiT)在该空间中进行扩散建模,实现高效且空间一致的3D场景生成,并支持多种条件配置。
Details
Motivation: 现有3D场景生成主要依赖2D多视图或视频扩散模型,这导致将3D空间外推降级为2D时间扩展,存在表示冗余和空间一致性受限两个根本问题。
Result: 论文提出的方法在3D潜在空间中实现了高效且空间一致的场景生成,支持任意相机轨迹的图像和点云解码,无需像2D方法那样进行逐轨迹扩散采样。
Insight: 核心创新在于首次构建了用于场景生成的3D基础表示空间(3DRAE),以及在该空间中进行扩散建模的3DDiT架构,实现了从2D耦合表示到3D解耦表示的转变,显著提升了生成效率和空间一致性。
Abstract: 3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes. In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views–at any resolution and aspect ratio–with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations. Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.
[209] Video-based Heart Rate Estimation with Angle-guided ROI Optimization and Graph Signal Denoising cs.CVPDF
Gan Pei, Junhao Ning, Boqiu Shen, Yan Zhu, Menghan Hu
TL;DR: 本文提出两个即插即用模块来提升基于视频的远程光电容积描记术(rPPG)心率估计性能,以应对面部运动(如说话和摇头)带来的干扰。角度引导的ROI自适应优化模块通过量化ROI-相机角度来修正受运动影响的信号并捕捉全局运动,而多区域联合图信号去噪模块利用图信号处理联合建模区域内和区域间的ROI信号以抑制运动伪影。这些模块与基于反射模型的rPPG方法兼容,并在三个公共数据集上验证有效。
Details
Motivation: 远程光电容积描记术(rPPG)可从面部视频进行非接触式心率测量,但其性能受面部运动(如说话和摇头)显著影响,需要解决运动伪影问题以提高估计精度。
Result: 在三个公共数据集上的实验结果表明,联合使用两个模块显著降低了平均绝对误差(MAE),相比基线平均减少了20.38%,消融研究也证实了每个模块的有效性。
Insight: 创新点包括角度引导的ROI优化来量化运动影响,以及基于图信号处理的多区域联合去噪来建模信号间关系,这些方法可增强rPPG在运动场景下的鲁棒性,为运动伪影抑制提供了新思路。
Abstract: Remote photoplethysmography (rPPG) enables non-contact heart rate measurement from facial videos, but its performance is significantly degraded by facial motions such as speaking and head shaking. To address this issue, we propose two plug-and-play modules. The Angle-guided ROI Adaptive Optimization module quantifies ROI-Camera angles to refine motion-affected signals and capture global motion, while the Multi-region Joint Graph Signal Denoising module jointly models intra- and inter-regional ROI signals using graph signal processing to suppress motion artifacts. The modules are compatible with reflection model-based rPPG methods and validated on three public datasets. Results show that jointly use markedly reduces MAE, with an average decrease of 20.38% over the baseline, while ablation studies confirm the effectiveness of each module. The work demonstrates the potential of angle-guided optimization and graph-based denoising to enhance rPPG performance in motion scenarios.
[210] GS4City: Hierarchical Semantic Gaussian Splatting via City-Model Priors cs.CVPDF
Qilin Zhang, Jinyu Zhu, Olaf Wysocki, Benjamin Busam, Boris Jutzi
TL;DR: GS4City是一种结合城市模型先验的分层语义高斯泼溅方法,用于城市场景理解。它通过两阶段光线投射从CityGML模型中获取可靠的图像对齐掩码,利用父子关系验证并恢复精细的立面元素,然后将这些基于几何的掩码与基础模型预测融合以建立场景一致的实例对应关系,并在联合2D身份监督和3D空间正则化下为每个高斯学习紧凑的身份编码。
Details
Motivation: 现有的语义3D高斯泼溅方法主要依赖2D基础模型,常导致边界模糊且对结构化城市语义支持有限。CityGML等城市模型虽编码了分层组织的语义和建筑几何,但其标签无法直接映射到高斯基元。
Result: 在TUM2TWIN和Gold Coast数据集上的实验表明,GS4City在粗粒度建筑分割上比LangSplat和Gaga等现有2D驱动的语义3DGS基线高出15.8 IoU点,在细粒度语义分割上高出14.2 mIoU点,有效将结构化建筑语义融入高斯场景表示。
Insight: 创新点在于利用城市模型先验(如CityGML)通过两阶段光线投射和父子关系验证来生成可靠的几何基础掩码,并与2D基础模型预测融合,从而将结构化语义信息整合到可微渲染的高斯场景表示中,实现了语义可查询和结构感知的城市重建。
Abstract: Recent semantic 3D Gaussian Splatting (3DGS) methods primarily rely on 2D foundation models, often yielding ambiguous boundaries and limited support for structured urban semantics. While city models such as CityGML encode hierarchically organized semantics together with building geometry, these labels cannot be directly mapped to Gaussian primitives. We present GS4City, a hierarchical semantic Gaussian Splatting method that incorporates city-model priors for urban scene understanding. GS4City derives reliable image-aligned masks from Level of Detail (LoD) 3 CityGML models via two-pass raycasting, explicitly using parent-child relations to validate and recover fine-grained facade elements. It then fuses these geometry-grounded masks with foundation-model predictions to establish scene-consistent instance correspondences, and learns a compact identity encoding for each Gaussian under joint 2D identity supervision and 3D spatial regularization. Experiments on the TUM2TWIN and Gold Coast datasets show that GS4City effectively incorporates structured building semantics into Gaussian scene representations, outperforming existing 2D-driven semantic 3DGS baselines, including LangSplat and Gaga, by up to 15.8 IoU points in coarse building segmentation and 14.2 mIoU points in fine-grained semantic segmentation. By bridging structured city models and photorealistic Gaussian scene representations, GS4City enables semantically queryable and structure-aware urban reconstruction. Code is available at https://github.com/Jinyzzz/GS4City.
[211] Scene Change Detection with Vision-Language Representation Learning cs.CVPDF
Diwei Sheng, Vijayraj Gohil, Satyam Gaba, Zihan Liu, Giles Hamilton-Fletcher
TL;DR: 本文提出LangSCD,一种用于场景变化检测的视觉-语言框架,通过引入语言模块生成场景变化的文本描述,并与视觉特征融合,以克服现有方法仅依赖低级视觉特征的局限性。同时,论文还提出了NYC-CD数据集,包含8,122个纽约市街景图像对及多类别变化标注,以支持细粒度场景动态理解。实验表明,该框架在多个街景基准测试中实现了最先进的性能。
Details
Motivation: 现有场景变化检测方法主要依赖低级视觉特征,难以在光照变化、季节转换、视角差异和复杂城市布局等真实环境挑战中准确识别变化物体,且现有基准仅提供二元变化标注,无法满足下游应用对细粒度场景动态理解的需求。
Result: 在多个街景基准测试上的广泛实验表明,所提出的语言和匹配模块持续改进了现有变化检测架构,实现了最先进的性能。
Insight: 创新点包括:通过视觉-语言模型引入语义推理,生成文本描述以增强视觉特征;设计几何-语义匹配模块确保语义一致性和空间完整性;构建大规模多类别标注数据集NYC-CD,推动细粒度场景变化检测研究。从客观角度看,该工作将语言模态与视觉表示结合,提升了模型在复杂真实环境中的鲁棒性。
Abstract: Scene change detection (SCD) is crucial for urban monitoring and navigation but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely primarily on low-level visual features, limiting their ability to accurately identify changed objects amid the visual complexity of urban scenes. In this paper, we propose LangSCD, a vision-language framework for scene change detection that overcomes this single-modal limitation by incorporating semantic reasoning through language. Our approach introduces a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of scene changes, which are fused with visual features through a cross-modal feature enhancer. We further introduce a geometric-semantic matching module that refines the predicted masks by enforcing semantic consistency and spatial completeness. Existing real-world scene change detection benchmarks provide only binary change annotations, which are insufficient for downstream applications requiring fine-grained understanding of scene dynamics. To address this limitation, we introduce NYC-CD, a large-scale dataset of 8,122 real-world image pairs collected in New York City with multiclass change annotations generated through a semi-automatic pipeline. Extensive experiments across multiple street-view benchmarks demonstrate that our language and matching modules consistently improve existing change-detection architectures, achieving state-of-the-art performance and highlighting the value of integrating linguistic reasoning with visual representations for robust scene change detection.
[212] Online Reasoning Video Object Segmentation cs.CVPDF
Jinyuan Liu, Yang Wang, Zeyu Zhao, Weixin Li, Song Wang
TL;DR: 本文提出在线推理视频目标分割任务,要求模型仅基于过去和当前帧进行因果性逐帧预测,并处理指代转移问题。为此构建了ORVOSB基准数据集,并提出一种结合持续更新分割提示和结构化时序令牌池的基线方法。
Details
Motivation: 现有视频目标分割方法依赖离线全视频推理,与需要严格因果决策的实际部署场景不符,因此研究在线推理场景下的分割任务。
Result: 在ORVOSB基准测试中,现有方法在严格因果性和指代转移场景下表现不佳,而提出的基线方法为未来研究建立了坚实基础。
Insight: 创新点在于提出在线推理分割任务范式,构建因果标注数据集,并设计具有持续提示更新和时序令牌池的轻量化推理架构,为实时视频理解系统提供新思路。
Abstract: Reasoning video object segmentation predicts pixel-level masks in videos from natural-language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime, where the entire video is available at inference time and future frames can be exploited for retrospective disambiguation, deviating from real-world deployments that require strictly causal, frame-by-frame decisions. We study Online Reasoning Video Object Segmentation (ORVOS), where models must incrementally interpret queries using only past and current frames without revisiting previous predictions, while handling referent shifts as events unfold. To support evaluation, we introduce ORVOSB, a benchmark with frame-level causal annotations and referent-shift labels, comprising 210 videos, 12,907 annotated frames, and 512 queries across five reasoning categories. We further propose a baseline with continually-updated segmentation prompts and a structured temporal token reservoir for long-horizon reasoning under bounded computation. Experiments show that existing methods struggle under strict causality and referent shifts, while our baseline establishes a strong foundation for future research.
[213] Observe Less, Understand More: Cost-aware Cross-scale Observation for Remote Sensing Understanding cs.CVPDF
Zhenghao Xie, Jing Xiao, Zhenqi Wang, Kexin Ma, Liang Liao
TL;DR: 本文提出了一种成本感知的跨尺度遥感理解方法,通过将细粒度高分辨率(HR)采样与跨块表示预测相结合,在有限成本下实现更有效的任务推理。同时,构建了包含1000万空间对齐多分辨率图像的大规模基准数据集GL-10M,用于系统评估预算受限的跨尺度推理。
Details
Motivation: 遥感理解需要多分辨率观测,但高分辨率图像获取成本高、覆盖范围有限。现有HR采样方法通常从孤立的低分辨率(LR)块中做出选择,忽略了细粒度块内重要性和跨块上下文交互,导致稀疏HR观测下的特征表示碎片化和场景推理次优。
Result: 在识别和检索任务上的大量实验表明,该方法在性能与成本权衡方面始终优于现有方法,实现了更优的权衡。
Insight: 创新点在于将跨尺度遥感理解统一为成本感知问题,耦合细粒度HR采样与跨块表示预测,并构建了大规模多分辨率基准数据集GL-10M,为系统评估提供了基础。
Abstract: Remote sensing understanding inherently requires multi-resolution observation, since different targets and application tasks demand different levels of spatial detail. While low-resolution (LR) imagery enables efficient global observation, high-resolution (HR) imagery provides critical local details at much higher acquisition cost and limited coverage. This motivates a cross-scale sensing strategy that selectively acquires HR imagery from LR-based global perception to improve task performance under constrained cost. Existing methods for HR sampling methods typically make selection decisions from isolated LR patches, which ignore fine-grained intra-patch importance and cross-patch contextual interactions, leading to fragmented feature representation and suboptimal scene reasoning under sparse HR observations. To address this issue, we formulate cross-scale remote sensing understanding as a unified cost-aware problem that couples fine-grained HR sampling with cross-patch representation prediction, enabling more effective task reasoning with fewer HR observations. Furthermore, we present GL-10M, a large-scale benchmark of 10 million spatially aligned multi-resolution images, enabling systematic evaluation of budget-constrained cross-scale reasoning in remote sensing. Extensive experiments on recognition and retrieval tasks show that our method consistently achieves a superior performance-cost trade-off.
[214] TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition cs.CVPDF
Imtiaz Ul Hassan, Nik Bessis, Ardhendu Behera
TL;DR: 本文提出TAG-Head,一种轻量级的时空图头部模块,用于仅使用RGB视频的细粒度人体动作识别。该模块可即插即用地升级标准3D主干网络,通过Transformer编码器捕获长程依赖,并通过全连接帧内边和时间对齐的帧间边构成的图结构来增强特征,以区分细微的时空线索。
Details
Motivation: 解决细粒度动作识别中视觉相似动作难以区分的问题,同时避免现有方法依赖额外模态(如姿态、文本)带来的标注负担和计算成本增加,旨在设计一个仅使用RGB的轻量高效模块。
Result: 在FineGym(Gym99和Gym288)和HAA500基准测试上,TAG-Head在仅使用RGB的模型中达到了新的最先进水平(SOTA),甚至超越了许多依赖特权信息(视频+姿态+文本)的多模态方法。消融实验和复杂度分析验证了其有效性和低延迟。
Insight: 创新点在于将全局上下文(通过Transformer)与高分辨率空间交互(通过全连接帧内图边)和低方差时间连续性(通过时间对齐的帧间图边)显式耦合在一个轻量级图头部中。这种设计实现了即插即用、端到端训练,且参数和计算开销小,为实际系统中仅使用RGB传感器提供了高性能解决方案。
Abstract: Fine-grained human action recognition (FHAR) is challenging because visually similar actions differ by subtle spatio-temporal cues. Many recent systems enhance discriminability with extra modalities (e.g., pose, text, optical flow), but this increases annotation burden and computational cost. We introduce TAG-Head, a lightweight spatio-temporal graph head that upgrades standard 3D backbones (SlowFast, R(2+1)D-34, I3D, etc.) for FHAR using RGB only. Our pipeline first applies a Transformer encoder with learnable 3D positional encodings to the backbone tokens, capturing long-range dependencies across space and time. The resulting features are then refined by a graph in which (i) fully-connected intra-frame edges to resolve subtle appearance differences within frames, and (ii) time-aligned temporal edges that connect features at the same spatial location across frames to stabilise motion cues without over-smoothing. The head is compact (little parameter/FLOP overhead), plug-and-play across backbones, and trained end-to-end with the backbone. Extensive evaluations on FineGym (Gym99 and Gym288) and HAA500 show that TAG-Head sets a new state-of-the-art among RGB-only models and surpasses many recent multimodal approaches (video + pose + text) that rely on privileged information. Ablations disentangle the contributions of the Transformer and the graph topology, and complexity analyses confirm low latency. TAG-Head advances FHAR by explicitly coupling global context with high-resolution spatial interactions and low-variance temporal continuity inside a slim, composable graph head. The simplicity of the design enables straightforward adoption in practical systems that favour RGB-only sensors, while delivering performance gains typically associated with heavier or multimodal models. Code will be released on GitHub.
[215] SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models cs.CV | cs.AIPDF
Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa
TL;DR: 本文提出了一种名为SVD-Prune的训练无关、即插即用的视觉令牌剪枝方法,用于提升视觉语言模型(VLM)的效率。该方法基于奇异值分解(SVD),通过统计杠杆分数选择对全局方差贡献最大的前K个视觉令牌,以解决现有方法在高剪枝率下因位置偏差和信息分散导致的性能下降问题。
Details
Motivation: 视觉语言模型在处理长序列视觉令牌时面临高计算和内存需求,现有基于局部启发式(如注意力分数或令牌范数)的剪枝方法存在位置偏差和信息分散,难以在高剪枝率下保留关键内容,导致在视觉细节丰富的图像上性能下降。
Result: 实验表明,SVD-Prune在极端视觉令牌预算(如32和16个令牌)下,持续优于先前的剪枝方法,保持了强大的性能。
Insight: 创新点在于利用奇异值分解的全局统计特性(统计杠杆分数)进行令牌选择,避免了局部启发式方法的偏差,实现了无需训练的高效剪枝。从客观角度看,这是一种将线性代数中的经典降维技术(SVD)创新性地应用于视觉令牌剪枝的简洁有效方法。
Abstract: Vision-Language Models (VLM) have revolutionized multimodal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a trainingfree, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-K tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.
[216] CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space cs.CV | cs.AIPDF
Sohwi Lim, Lee Hyoseok, Jungjoon Park, Tae-Hyun Oh
TL;DR: CLAY提出了一种自适应视觉相似度计算方法,通过重构预训练视觉语言模型的嵌入空间,使其成为文本条件化的相似度空间,无需额外训练即可实现多条件图像检索。
Details
Motivation: 解决现有图像检索系统依赖固定相似度度量、无法灵活适应人类主观感知和多重条件的问题。
Result: 在标准数据集和自建的CLAY-EVAL数据集上,CLAY在检索准确率和计算效率方面均优于先前方法。
Insight: 创新点在于将文本条件化过程与视觉特征提取解耦,利用预训练VLM的嵌入空间实现高效多条件检索;可借鉴其无需训练的条件化机制和合成评估数据集构建思路。
Abstract: Human perception of visual similarity is inherently adaptive and subjective, depending on the users’ interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves high retrieval accuracy and notable computational efficiency compared to previous works.
[217] Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models cs.CVPDF
Songlong Xing, Weijie Wang, Zhengyu Zhao, Jindong Gu, Philip Torr
TL;DR: 本文提出了一种名为AdvFLYP的对抗性微调方法,旨在提升CLIP等视觉语言模型的零样本对抗鲁棒性,同时保持其跨域泛化能力。该方法通过使用网络收集的图像-文本对进行对抗性微调,并采用对比损失和正则化策略来平衡鲁棒性与准确性。
Details
Motivation: 现有方法在提升CLIP对抗鲁棒性时,通常基于ImageNet等代理数据集进行微调,忽略了训练数据分布和学习目标的重要性,导致零样本能力下降和鲁棒性跨域迁移受限。
Result: 在涵盖多个领域的14个下游数据集上的实验表明,AdvFLYP在对抗鲁棒性和干净准确率方面均优于主流方法,实现了更好的跨域泛化性能。
Insight: 创新点在于将对抗性微调过程与CLIP的预训练策略对齐,使用网络规模的图像-文本对和对比损失,并引入对数概率和特征层面的正则化来分别优化鲁棒性和准确性,从而在保持零样本能力的同时提升鲁棒性。
Abstract: Despite their impressive zero-shot abilities, vision-language models such as CLIP have been shown to be susceptible to adversarial attacks. To enhance its adversarial robustness, recent studies finetune the pretrained vision encoder of CLIP with adversarial examples on a proxy dataset such as ImageNet by aligning adversarial images with correct class labels. However, these methods overlook the important roles of training data distributions and learning objectives, resulting in reduced zero-shot capabilities and limited transferability of robustness across domains and datasets. In this work, we propose a simple yet effective paradigm AdvFLYP, which follows the training recipe of CLIP’s pretraining process when performing adversarial finetuning to the model. Specifically, AdvFLYP finetunes CLIP with adversarial images created based on image-text pairs collected from the web, and match them with their corresponding texts via a contrastive loss. To alleviate distortion of adversarial image embeddings of noisy web images, we further propose to regularise AdvFLYP by penalising deviation of adversarial image features. We show that logit- and feature-level regularisation terms benefit robustness and clean accuracy, respectively. Extensive experiments on 14 downstream datasets spanning various domains show the superiority of our paradigm over mainstream practices. Our code and model weights are released at https://github.com/Sxing2/AdvFLYP.
[218] GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth cs.CV | cs.ROPDF
Krishna Jaganathan, Patricio Vela
TL;DR: 本文提出了GeomPrompt,一种轻量级的跨模态适应模块,用于解决RGB-D语义分割中深度信息缺失或退化的问题。该模块仅从RGB图像中合成任务驱动的几何提示,作为冻结RGB-D分割模型的第四通道输入,无需深度监督。此外,还引入了GeomPrompt-Recovery模块,通过预测与冻结分割器相关的第四通道校正来补偿退化的深度。两个模块仅通过下游分割监督进行训练,旨在恢复对分割有用的几何先验,而非估计深度信号。
Details
Motivation: 动机在于机器人学和具身AI中的多模态感知系统通常假设可靠的RGB-D感知,但实践中深度信息经常缺失、有噪声或损坏,因此需要一种轻量级方法来适应这些情况,提升RGB-D语义分割的鲁棒性。
Result: 在SUN RGB-D数据集上,GeomPrompt相比仅使用RGB的推理,在DFormer上提升了+6.1 mIoU,在GeminiFusion上提升了+3.0 mIoU,同时与强大的单目深度估计器保持竞争力。对于退化深度,GeomPrompt-Recovery一致地提高了鲁棒性,在严重深度损坏下获得高达+3.6 mIoU的增益。GeomPrompt的效率也显著高于单目深度基线,延迟为7.8 ms,而基线为38.3 ms和71.9 ms。
Insight: 创新点在于提出任务驱动的几何提示作为跨模态补偿机制,避免了直接深度估计,专注于恢复对分割任务有用的几何先验。这种方法轻量高效,仅需分割监督,能有效处理缺失和退化的深度输入,为RGB-D感知提供了新的适应策略。
Abstract: Multimodal perception systems for robotics and embodied AI often assume reliable RGB-D sensing, but in practice, depth is frequently missing, noisy, or corrupted. We thus present GeomPrompt, a lightweight cross-modal adaptation module that synthesizes a task-driven geometric prompt from RGB alone for the fourth channel of a frozen RGB-D semantic segmentation model, without depth supervision. We further introduce GeomPrompt-Recovery, an adaptation module that compensates for degraded depth by predicting the fourth channel correction relevant for the frozen segmenter. Both modules are trained solely with downstream segmentation supervision, enabling recovery of the geometric prior useful for segmentation, rather than estimating depth signals. On SUN RGB-D, GeomPrompt improves over RGB-only inference by +6.1 mIoU on DFormer and +3.0 mIoU on GeminiFusion, while remaining competitive with strong monocular depth estimators. For degraded depth, GeomPrompt-Recovery consistently improves robustness, yielding gains up to +3.6 mIoU under severe depth corruptions. GeomPrompt is also substantially more efficient than monocular depth baselines, reaching 7.8 ms latency versus 38.3 ms and 71.9 ms. These results suggest that task-driven geometric prompting is an efficient mechanism for cross-modal compensation under missing and degraded depth inputs in RGB-D perception.
[219] MLLM-as-a-Judge Exhibits Model Preference Bias cs.CVPDF
Shuitsu Koyama, Yuiga Wada, Daichi Yashima, Komei Sugiura
TL;DR: 该论文提出了Philautia-Eval方法,用于量化MLLM-as-a-Judge评估框架中存在的模型特定偏好偏见。研究发现,主流多模态大语言模型(MLLMs)倾向于表现出自我偏好偏见,并且在特定模型家族之间存在相互偏好偏见。论文还提出了一个简单的MLLM集成方法Pomms,能有效缓解这种偏见。
Details
Motivation: MLLM-as-a-Judge自动评估方法被广泛用于衡量模型性能,但如果该方法存在偏见,会扭曲模型比较和基准驱动的科学进展。目前尚不清楚MLLM-as-a-Judge方法在多大程度上偏爱或歧视特定MLLM生成的文本。
Result: 基于从12个MLLMs收集的129万条字幕-评分对进行分析,量化了模型特定偏好偏见。提出的集成方法Pomms在缓解偏见的同时保持了性能。
Insight: 揭示了MLLM-as-a-Judge评估中普遍存在的模型自我偏好和家族内相互偏好偏见,其潜在驱动因素包括共享的连接器和重叠的指令调优资源。提出的简单集成策略Pomms为构建更公平的自动评估器提供了可行方案。
Abstract: Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge methods were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge methods favor or disfavor text generated by specific MLLMs. In this study, we propose Philautia-Eval to investigate such model-specific preference bias. Philautia-Eval quantifies the degree of the bias by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs collected from 12 MLLMs, we found that representative MLLMs tend to exhibit self-preference bias. Moreover, experimental results indicate mutual preference bias within particular model families, which is potentially driven by reused connectors and overlapping instruction-tuning resources. Finally, we introduce a simple ensemble of MLLMs, Pomms. Our results demonstrated that Pomms effectively mitigated the model-specific preference bias while maintaining performance.
[220] Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language cs.CVPDF
Peijie Wang, Ming-Liang Zhang, Jun Cao, Chao Deng, Dekang Ran
TL;DR: 本文提出了一种用于平面和立体几何的统一形式语言,并构建了包含2万平面和9千立体几何样本的大规模数据集GDP-29K。通过结合监督微调与基于可验证奖励的强化学习训练范式,该方法在几何解析任务上达到了最先进的性能,且解析出的形式描述能显著提升多模态大语言模型在几何推理下游任务中的能力。
Details
Motivation: 解决多模态大语言模型在几何推理中因细粒度视觉元素感知瓶颈而表现不佳的问题,特别是针对需要空间理解的立体几何,目前缺乏统一的形式化表示方法。
Result: 在几何解析任务上实现了最先进的性能,并在下游几何推理任务中显著提升了多模态大语言模型的能力。
Insight: 创新点在于设计了一个统一覆盖平面与立体几何结构和语义关系的形式语言,并提出了结合监督微调与基于可验证奖励的强化学习的训练范式,确保了形式描述的语法正确性和几何一致性,从而为几何推理提供了有效的认知支架。
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress but continue to struggle with geometric reasoning, primarily due to the perception bottleneck regarding fine-grained visual elements. While formal languages have aided plane geometry understanding, solid geometry which requires spatial understanding remains largely unexplored. In this paper, we address this challenge by designing a unified formal language that integrates plane and solid geometry, comprehensively covering geometric structures and semantic relations. We construct GDP-29K, a large-scale dataset comprising 20k plane and 9k solid geometry samples collected from diverse real-world sources, each paired with its ground-truth formal description. To ensure syntactic correctness and geometric consistency, we propose a training paradigm that combines Supervised Fine-Tuning with Reinforcement Learning via Verifiable Rewards. Experiments show that our approach achieves state-of-the-art parsing performance. Furthermore, we demonstrate that our parsed formal descriptions serve as a critical cognitive scaffold, significantly boosting MLLMs’ capabilities for downstream geometry reasoning tasks. Our data and code are available at Geoparsing.
[221] POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CVPDF
Haicheng Wang, Yuan Liu, Yikun Liu, Zhemeng Yu, Zhongyin Zhao
TL;DR: POINTS-Long是一种原生双模态多模态大语言模型(MLLM),旨在解决长视频和流式场景中视觉令牌序列快速增长带来的可扩展性和实际部署挑战。该模型受人类视觉系统启发,采用动态视觉令牌缩放机制,支持聚焦模式和待机模式两种互补的感知模式,允许用户在推理过程中动态权衡效率与准确性。
Details
Motivation: 解决MLLMs在处理长视频和流式视觉数据时,因视觉令牌序列快速增长而导致的可扩展性不足和实际部署困难的问题。
Result: 在细粒度视觉任务上,聚焦模式保持最佳性能;在长格式通用视觉理解任务中,待机模式仅使用1/40至1/10的视觉令牌即可保留97.7-99.7%的原始准确率。模型通过动态可分离的KV缓存设计原生支持流式视觉理解,能高效维护超长视觉记忆。
Insight: 创新点包括受人类视觉系统启发的动态视觉令牌缩放机制、双模式(聚焦与待机)自适应推理设计,以及动态可分离KV缓存以实现流式处理,为未来MLLMs的自适应高效长格式视觉理解提供了新思路和基础。
Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences–especially in long-video and streaming scenarios–poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.
[222] MorphoFlow: Sparse-Supervised Generative Shape Modeling with Adaptive Latent Relevance cs.CVPDF
Mokshagna Sai Teja Karanam, Tushar Kataria, Shireen Elhabian
TL;DR: MorphoFlow是一个稀疏监督的生成式形状建模框架,它直接从稀疏表面标注中学习紧凑的概率形状表示。该方法结合了神经隐式形状表示、自解码器公式和自回归归一化流,以学习潜在形状空间上的表达性概率密度。通过自适应潜在相关性加权,模型能够根据解剖变异的关联性调节各个潜在维度的贡献,从而支持不确定性量化和解剖学上合理的形状合成。
Details
Motivation: 现有统计形状建模方法通常依赖密集标注的分割和固定的潜在表示,这限制了建模复杂解剖变异时的可扩展性和灵活性。MorphoFlow旨在解决这些问题,通过稀疏监督学习紧凑的概率形状表示。
Result: 在公开的腰椎和股骨数据集上的评估表明,MorphoFlow能够从稀疏输入中实现准确的高分辨率重建,并恢复与群体水平趋势一致的结构化解剖变异模式。
Insight: 创新点包括将神经隐式形状表示与自解码器、自回归流相结合,以及引入自适应潜在相关性加权机制。这实现了无需手动调整潜在维度的紧凑结构化潜在空间,支持不确定性量化和生成表达性。
Abstract: Statistical shape modeling (SSM) is central to population level analysis of anatomical variability, yet most existing approaches rely on densely annotated segmentations and fixed latent representations. These requirements limit scalability and reduce flexibility when modeling complex anatomical variation. We introduce MorphoFlow, a sparse supervised generative shape modeling framework that learns compact probabilistic shape representations directly from sparse surface annotations. MorphoFlow integrates neural implicit shape representations with an autodecoder formulation and autoregressive normalizing flows to learn an expressive probabilistic density over the latent shape space. The neural implicit representation enables resolution-agnostic modeling of 3D anatomy, while the autodecoder formulation supports direct optimization of per-instance latent codes under sparse supervision. The autoregressive flow captures the distribution of latent anatomical variability providing a tractable, likelihood-based generative model of shapes. To promote compact and structured latent representations, we incorporate adaptive latent relevance weighting through sparsity-inducing priors, enabling the model to regulate the contribution of individual latent dimensions according to their relevance to the underlying anatomical variation while preserving generative expressivity. The resulting latent space supports uncertainty quantification and anatomically plausible shape synthesis without manual latent dimensionality tuning. Evaluation on publicly available lumbar vertebrae and femur datasets demonstrates accurate high-resolution reconstruction from sparse inputs and recovery of structured modes of anatomical variation consistent with population level trends.
[223] STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding cs.CVPDF
Wenhao Li, Xueying Jiang, Gongjie Zhang, Xiaoqin Zhang, Ling Shao
TL;DR: 本文提出STS-Mixer,一种用于4D点云视频理解的时空谱混合框架。该方法将4D点云视频转换为图谱信号,分解为多个频带以捕捉不同几何结构,并将这些谱表示与时空信息融合,以增强对点云视频几何细节和动态的建模能力。
Details
Motivation: 现有4D点云视频方法主要在时空域工作,难以捕捉底层几何特征,导致表示学习和理解性能下降。本文从互补的谱视角出发,旨在通过谱分析更好地捕获点云视频的几何结构。
Result: 在多个广泛采用的基准测试上,STS-Mixer在3D动作识别和4D语义分割任务上均取得了优越的性能。
Insight: 创新点在于首次将谱分析引入4D点云视频理解,通过多频带分解揭示不同频率信号对应不同几何细节(低频捕捉粗形状,高频编码细粒度细节),并设计统一的时空谱混合框架进行融合。这为点云视频分析提供了新的谱域视角和有效的特征表示方法。
Abstract: 4D point cloud videos capture rich spatial and temporal dynamics of scenes which possess unique values in various 4D understanding tasks. However, most existing methods work in the spatiotemporal domain where the underlying geometric characteristics of 4D point cloud videos are hard to capture, leading to degraded representation learning and understanding of 4D point cloud videos. We address the above challenge from a complementary spectral perspective. By transforming 4D point cloud videos into graph spectral signals, we can decompose them into multiple frequency bands each of which captures distinct geometric structures of point cloud videos. Our spectral analysis reveals that the decomposed low-frequency signals capture more coarse shapes while high-frequency signals encode more fine-grained geometry details. Building on these observations, we design Spatio-Temporal-Spectral Mixer (STS-Mixer), a unified framework that mixes spatial, temporal, and spectral representations of point cloud videos. STS-Mixer integrates multi-band delineated spectral signals with spatiotemporal information to capture rich geometries and temporal dynamics, while enabling fine-grained and holistic understanding of 4D point cloud videos. Extensive experiments show that STS-Mixer achieves superior performance consistently across multiple widely adopted benchmarks on both 3D action recognition and 4D semantic segmentation tasks. Code and models are available at https://github.com/Vegetebird/STS-Mixer.
[224] GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays cs.CVPDF
David Wong, Zeynep Isik, Bin Wang, Marouane Tliba, Gorkem Durak
TL;DR: GazeVaLM是一个公开的眼动追踪数据集,用于研究专家放射科医生在评估真实与AI生成胸部X光片时的临床感知。该数据集包含16位专家在诊断评估和真伪分类两种条件下解读60张图像(30张真实,30张由扩散模型生成)的960条眼动记录,并提供原始数据、注视图、诊断标签和真实性判断。研究还将协议扩展到6个先进的多模态大语言模型,发布了它们的预测结果,以支持人机在决策和不确定性层面的直接比较。
Details
Motivation: 解决在医学影像领域,如何客观评估AI生成图像(特别是X光片)的临床真实感,以及如何比较人类专家与AI系统在图像感知、解释和决策方面的差异这一关键问题。
Result: 在诊断准确性和真实性检测方面,对放射科医生与多模态大语言模型进行了基准测试和分析,提供了关于注视一致性、观察者间一致性的分析结果。
Insight: 创新点在于首次联合发布了专家眼动数据、临床标签和AI模型预测,构建了一个多模态、多任务的基准,为研究临床决策、人机比较、生成图像真实感评估和不确定性量化提供了可重复的研究基础。数据集公开可用,促进了跨学科研究。
Abstract: We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays (generated by diffusion based generative AI) under two conditions: diagnostic assessment and real-fake classification (Visual Turing test). For each image-observer pair, we provide raw gaze samples, fixation maps, scanpaths, saliency density maps, structured diagnostic labels, and authenticity judgments. We extend the protocol to 6 state-of-the-art multimodal LLMs, releasing their predicted diagnoses, authenticity labels, and confidence scores under matched conditions - enabling direct human-AI comparison at both decision and uncertainty levels. We further provide analyses of gaze agreement, inter-observer consistency, and benchmarking of radiologists versus LLMs in diagnostic accuracy and authenticity detection. GazeVaLM supports research in gaze modeling, clinical decision-making, human-AI comparison, generative image realism assessment, and uncertainty quantification. By jointly releasing visual attention data, clinical labels, and model predictions, we aim to facilitate reproducible research on how experts and AI systems perceive, interpret, and evaluate medical images. The dataset is available at https://huggingface.co/datasets/davidcwong/GazeVaLM.
[225] UNIGEOCLIP: Unified Geospatial Contrastive Learning cs.CVPDF
Guillaume Astruc, Eduard Trulls, Jan Hosang, Loic Landrieu, Paul-Edouard Sarlin
TL;DR: UNIGEOCLIP是一个大规模多模态对比学习框架,旨在将航空影像、街景、高程模型、文本和地理坐标这五种互补的地理空间模态对齐到一个统一的嵌入空间中,实现跨模态的无缝比较、检索和推理。
Details
Motivation: 解决现有方法通常融合模态或依赖中心枢纽表示的问题,通过全对全对比对齐来利用日益丰富的共位地理空间多模态数据。
Result: 在多个下游地理空间任务上的广泛实验表明,UNIGEOCLIP持续优于单模态对比模型和仅坐标基线,凸显了整体多模态对齐的优势。
Insight: 创新点包括提出全对全对比对齐方法以支持任意模态组合的交互,以及引入可缩放经纬度编码器来捕获多尺度地理结构以改进空间表示。
Abstract: The growing availability of co-located geospatial data spanning aerial imagery, street-level views, elevation models, text, and geographic coordinates offers a unique opportunity for multimodal representation learning. We introduce UNIGEOCLIP, a massively multimodal contrastive framework to jointly align five complementary geospatial modalities in a single unified embedding space. Unlike prior approaches that fuse modalities or rely on a central pivot representation, our method performs all-to-all contrastive alignment, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of modalities. We further propose a scaled latitude-longitude encoder that improves spatial representation by capturing multi-scale geographic structure. Extensive experiments across downstream geospatial tasks demonstrate that UNIGEOCLIP consistently outperforms single-modality contrastive models and coordinate-only baselines, highlighting the benefits of holistic multimodal geospatial alignment. A reference implementation is available at https://gastruc.github.io/unigeoclip.
[226] Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge cs.CVPDF
Asbjørn Munk, Stefano Cerri, Vardan Nersesjan, Christian Hedeager Krag, Jakob Ambsdorf
TL;DR: 本文介绍了FOMO25挑战赛,旨在推动脑MRI基础模型的临床部署。该挑战赛提供了一个大规模预训练数据集FOMO60K,并在少样本和域外设置下评估模型在直接来自临床工作流的数据上的表现。任务包括梗死分类、脑膜瘤分割和脑年龄回归,结果显示自监督预训练能提升模型在临床数据上的泛化能力,且没有单一的预训练目标对所有任务都最优。
Details
Motivation: 解决临床脑MRI分析自动化面临的挑战:临床数据异构且噪声大,高质量标注成本高昂。通过自监督学习利用大量未标注临床数据训练鲁棒的基础模型,以适应域外数据并减少监督需求。
Result: 在FOMO25挑战赛中,19个基础模型在梗死分类、脑膜瘤分割和脑年龄回归任务上评估。结果显示,自监督预训练在域偏移下提升了临床数据的泛化性能,最强的域外训练模型超越了域内训练的监督基线;不同预训练目标各有优势:MAE利于分割,混合重建-对比目标利于分类;小规模预训练模型表现强劲,而扩大模型规模和训练时长未带来可靠收益。
Insight: 创新点在于通过大规模临床预训练数据集和标准化评估流程,系统验证了自监督学习对脑MRI基础模型在真实临床场景中的有效性。客观分析表明,模型设计应注重任务特异性预训练目标,而非盲目扩大规模,这为资源受限的临床部署提供了实用指导。
Abstract: Clinical deployment of automated brain MRI analysis faces a fundamental challenge: clinical data is heterogeneous and noisy, and high-quality labels are prohibitively costly to obtain. Self-supervised learning (SSL) can address this by leveraging the vast amounts of unlabeled data produced in clinical workflows to train robust \textit{foundation models} that adapt out-of-domain with minimal supervision. However, the development of foundation models for brain MRI has been limited by small pretraining datasets and in-domain benchmarking focused on high-quality, research-grade data. To address this gap, we organized the FOMO25 challenge as a satellite event at MICCAI 2025. FOMO25 provided participants with a large pretraining dataset, FOMO60K, and evaluated models on data sourced directly from clinical workflows in few-shot and out-of-domain settings. Tasks covered infarct classification, meningioma segmentation, and brain age regression, and considered both models trained on FOMO60K (method track) and any data (open track). Nineteen foundation models from sixteen teams were evaluated using a standardized containerized pipeline. Results show that (a) self-supervised pretraining improves generalization on clinical data under domain shift, with the strongest models trained \textit{out-of-domain} surpassing supervised baselines trained \textit{in-domain}. (b) No single pretraining objective benefits all tasks: MAE favors segmentation, hybrid reconstruction-contrastive objectives favor classification, and (c) strong performance was achieved by small pretrained models, and improvements from scaling model size and training duration did not yield reliable benefits.
[227] LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment cs.CV | cs.ROPDF
Dujun Nie, Fengjiao Chen, Qi Lv, Jun Kuang, Xiaoyu Li
TL;DR: 本文提出了LARY基准测试,用于评估从视觉观察中提取潜在动作表示的能力,以支持通用化的视觉到动作对齐。该基准包含超过100万视频、62万图像对和59.5万运动轨迹,覆盖151个动作类别。实验发现通用视觉基础模型在无动作监督下优于专用具身潜在动作模型,且潜在视觉空间比像素空间更对齐物理动作空间。
Details
Motivation: 解决利用大规模无标注人类动作视频数据时,将视觉信号转换为独立于本体的潜在动作表示的挑战,并评估其从视觉观察中推导鲁棒控制的能力。
Result: 在LARY基准上的实验表明,无动作监督的通用视觉基础模型持续优于专用具身潜在动作模型,且潜在视觉空间比像素空间在物理动作空间对齐方面表现更优。
Insight: 创新点在于提出了首个统一评估潜在动作表示在高层语义动作和低层机器人控制中性能的基准;客观分析表明,通用视觉表示天生编码了物理控制相关的动作知识,语义级抽象是比像素级重建更有效的视觉到动作通路。
Abstract: While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied latent action models. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.
[228] Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction cs.CVPDF
Efstathios Karypidis, Spyros Gidaris, Nikos Komodakis
TL;DR: 本文提出Re2Pix,一种用于视频预测的分层框架,通过将预测分解为语义表示预测和表示引导的视觉合成两个阶段,旨在提升复杂动态环境(如自动驾驶)中未来视频预测的视觉保真度和场景语义一致性。
Details
Motivation: 解决在复杂动态环境中,未来视频预测需要同时保证高视觉真实性和场景语义一致性的挑战。
Result: 在具有挑战性的自动驾驶基准测试上的实验表明,与强大的扩散基线模型相比,所提出的语义优先设计显著改善了时间语义一致性、感知质量和训练效率。
Insight: 核心创新点在于将视频预测分解为语义表示预测和视觉合成两个阶段,并引入了嵌套丢弃和混合监督两种条件策略来缓解训练-推理时表示不匹配的问题,从而将场景动态建模与外观生成解耦,提升了模型的鲁棒性和效率。
Abstract: Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at https://github.com/Sta8is/Re2Pix
[229] Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions cs.CV | cs.HC | cs.LGPDF
Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam
TL;DR: 本文探讨了利用深度学习模型识别视频中的矛盾/犹豫情绪,以支持个性化数字健康干预。论文研究了监督学习、无监督领域适应和基于大语言模型的零样本推理三种学习设置,并在BAH视频数据集上进行了实验。
Details
Motivation: 解决数字健康干预中因矛盾/犹豫情绪导致患者延迟、避免或放弃干预的问题,通过自动识别这些微妙且矛盾的情感状态,提高干预的个性化和成本效益。
Result: 在BAH视频数据集上的实验结果显示性能有限,表明需要更适应的多模态模型来实现准确的矛盾/犹豫识别。
Insight: 创新点在于将矛盾/犹豫识别视为一个多模态任务,并探索了多种学习范式。客观分析认为,需要更好的时空建模和多模态融合方法来利用模态内/间的冲突信息。
Abstract: Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain healthy habits that improve medical outcomes. In-person interventions are costly and difficult to scale, especially in resource-limited regions. Digital health interventions offer a cost-effective approach, potentially supporting independent living and self-management. Automating such interventions, especially through machine learning, has gained considerable attention recently. Ambivalence and hesitancy (A/H) play a primary role for individuals to delay, avoid, or abandon health interventions. A/H are subtle and conflicting emotions that place a person in a state between positive and negative evaluations of a behaviour, or between acceptance and refusal to engage in it. They manifest as affective inconsistency across modalities or within a modality, such as language, facial, vocal expressions, and body language. While experts can be trained to recognize A/H, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital health interventions. Here, we explore the application of deep learning models for A/H recognition in videos, a multi-modal task by nature. In particular, this paper covers three learning setups: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference via large language models (LLMs). Our experiments are conducted on the unique and recently published BAH video dataset for A/H recognition. Our results show limited performance, suggesting that more adapted multi-modal models are required for accurate A/H recognition. Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities.
[230] Learning Long-term Motion Embeddings for Efficient Kinematics Generation cs.CVPDF
Nick Stracke, Kolja Bauer, Stefan Andreas Baumann, Miguel Angel Bautista, Josh Susskind
TL;DR: 本文提出了一种高效生成长期运动的方法,通过从大规模轨迹数据中学习高度压缩的运动嵌入,并训练条件流匹配模型来生成符合文本提示或空间触发的运动,从而在保持高质量的同时大幅提升生成效率。
Details
Motivation: 现有视频模型在理解和预测运动方面表现良好,但通过完整视频合成探索多种可能未来仍效率低下,因此需要一种更高效的方法来建模场景动态并生成长期、真实的运动。
Result: 该方法在运动分布生成上优于最先进的视频模型和专门的任务特定方法,实现了64倍的时间压缩因子,并在相关基准测试中达到SOTA水平。
Insight: 创新点在于学习高度压缩的长期运动嵌入空间,并结合条件流匹配模型进行高效运动生成,这为运动预测和生成任务提供了可扩展且高效的解决方案。
Abstract: Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.
[231] HDR Video Generation via Latent Alignment with Logarithmic Encoding cs.CVPDF
Naomi Ken Korem, Mohamed Oumoumad, Harel Cain, Matan Ben Yosef, Urska Jelercic
TL;DR: 本文提出了一种利用预训练生成模型实现高动态范围(HDR)视频生成的方法。该方法通过将HDR图像进行对数编码,使其分布与预训练模型的潜在空间对齐,从而只需对模型进行轻量级微调即可生成HDR视频,无需重新训练编码器或设计新的表示方法。
Details
Motivation: HDR图像能更真实地反映场景辐射,但与训练生成模型所用的有界、感知压缩数据不匹配,导致生成HDR内容具有挑战性。现有方法通常需要学习新的HDR表示,这增加了复杂性和数据需求。
Result: 该方法在预训练视频模型上进行最小化适配后,实现了高质量的HDR视频生成,在多样场景和挑战性光照条件下均取得了强劲结果。
Insight: 核心创新点在于发现并利用了对数编码能将HDR图像分布与预训练生成模型潜在空间自然对齐的特性,从而简化了HDR生成流程。此外,引入基于相机模拟退化的训练策略,鼓励模型从其学习到的先验中推断缺失的高动态范围细节,也是一个有效的技术洞察。
Abstract: High dynamic range (HDR) imagery offers a rich and faithful representation of scene radiance, but remains challenging for generative models due to its mismatch with the bounded, perceptually compressed data on which these models are trained. A natural solution is to learn new representations for HDR, which introduces additional complexity and data requirements. In this work, we show that HDR generation can be achieved in a much simpler way by leveraging the strong visual priors already captured by pretrained generative models. We observe that a logarithmic encoding widely used in cinematic pipelines maps HDR imagery into a distribution that is naturally aligned with the latent space of these models, enabling direct adaptation via lightweight fine-tuning without retraining an encoder. To recover details that are not directly observable in the input, we further introduce a training strategy based on camera-mimicking degradations that encourages the model to infer missing high dynamic range content from its learned priors. Combining these insights, we demonstrate high-quality HDR video generation using a pretrained video model with minimal adaptation, achieving strong results across diverse scenes and challenging lighting conditions. Our results indicate that HDR, despite representing a fundamentally different image formation regime, can be handled effectively without redesigning generative models, provided that the representation is chosen to align with their learned priors.
[232] LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation cs.CVPDF
Yuqian Yuan, Wenqiao Zhang, Juekai Lin, Yu Zhong, Mingjian Gao
TL;DR: 本文全面综述了大型多模态模型与以物体为中心的视觉研究的融合进展,系统梳理了该领域在物体中心视觉理解、参考分割、编辑和生成四个主要主题上的关键方法、学习策略与评估协议,并讨论了未来挑战。
Details
Motivation: 当前大型多模态模型在通用视觉-语言理解上取得显著进展,但在需要精确物体级定位、细粒度空间推理和可控视觉操作的任务上仍存在局限,如难以准确识别实例、跨交互保持物体身份或高精度定位修改指定区域。以物体为中心的视觉通过显式表示和操作视觉实体为解决这些挑战提供了原则性框架。
Result: 本文是一篇综述性论文,未报告具体的定量实验结果,但系统总结了支持物体中心多模态能力的建模范式、学习策略和评估协议。
Insight: 论文的核心创新点在于提出了一个将大型多模态模型能力从全局场景理解扩展到物体级理解、分割、编辑和生成的结构化框架与综述视角,强调了显式物体中心表示对于实现精确、可控、可扩展多模态系统的重要性,并指出了鲁棒实例恒常性、细粒度空间控制、统一跨任务建模等未来研究方向。
Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision–language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.
[233] LottieGPT: Tokenizing Vector Animation for Autoregressive Generation cs.CVPDF
Junhao Chen, Kejun Gao, Yuehan Cui, Mingze Sun, Mingjin Chen
TL;DR: 本文提出了LottieGPT,这是首个用于自回归生成矢量动画的框架。该工作通过设计定制的Lottie Tokenizer,将基于JSON的Lottie动画标准编码为紧凑的语义对齐的token序列,并构建了大规模数据集LottieAnimation-660K。在此基础上微调Qwen-VL,使其能够直接从自然语言或视觉提示生成连贯、可编辑的矢量动画。
Details
Motivation: 现有视频生成模型无法生成矢量动画,而矢量动画具有分辨率无关、紧凑、语义结构化和可编辑参数化运动等优势。同时,大型多模态模型在生成结构化数据方面展现出强大能力,这表明原生矢量动画生成是可能实现的。
Result: 实验表明,所提出的tokenizer在保持结构保真度的同时显著减少了序列长度,从而实现了对动态矢量内容有效的自回归学习。LottieGPT在SVG生成(单帧矢量动画的特例)任务上超越了之前的最先进模型。
Insight: 主要创新点在于首次提出了矢量动画的tokenization和自回归生成框架,并为此构建了大规模数据集。其设计的Lottie Tokenizer能够将分层的几何图元、变换和基于关键帧的运动编码为紧凑的语义对齐序列,这是实现有效生成的关键。这为生成高质量、可编辑的结构化多媒体内容开辟了新方向。
Abstract: Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides, 3D meshes, LEGO sequences, and indoor layouts, suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).
[234] Budget-Aware Uncertainty for Radiotherapy Segmentation QA Using nnU-Net cs.CV | cs.AIPDF
Ricardo Coimbra Brioso, Lorenzo Mondo, Damiano Dei, Nicola Lambri, Pietro Mancosu
TL;DR: 本文提出了一种基于nnU-Net的预算感知不确定性驱动质量保证框架,用于放射治疗中的临床靶区勾画。该框架结合了不确定性量化和事后校准,生成基于预测熵的体素级不确定性图谱,以指导针对性的人工审查。在TMLI案例中,评估了温度缩放、深度集成、检查点集成和测试时增强等方法单独及组合使用的效果。
Details
Motivation: 放射治疗计划中临床靶区的精确勾画至关重要,但耗时且难以评估。基于深度学习的自动分割虽能减轻工作量,但其安全临床部署需要可靠的指标来指示模型可能出错的位置。
Result: 在TMLI代表性用例中,所有配置下的分割精度保持稳定,而温度缩放显著改善了校准。基于校准的检查点集成推理最有效地提升了不确定性-误差对齐度,在顶部0-5%最不确定体素上的AUC指标得到总结。
Insight: 主要创新点在于将校准与高效集成相结合,构建了一个预算感知的质量保证工作流程。客观来看,该方法通过事后校准优化不确定性图谱的可靠性,使其能更一致地突出需要手动编辑的区域,为临床部署提供了实用的决策支持工具。
Abstract: Accurate delineation of the Clinical Target Volume (CTV) is essential for radiotherapy planning, yet remains time-consuming and difficult to assess, especially for complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI). While deep learning-based auto-segmentation can reduce workload, safe clinical deployment requires reliable cues indicating where models may be wrong. In this work, we propose a budget-aware uncertainty-driven quality assurance (QA) framework built on nnU-Net, combining uncertainty quantification and post-hoc calibration to produce voxel-wise uncertainty maps (based on predictive entropy) that can guide targeted manual review. We compare temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), evaluated both individually and in combination on TMLI as a representative use case. Reliability is assessed through ROI-masked calibration metrics and uncertainty–error alignment under realistic revision constraints, summarized as AUC over the top 0-5% most uncertain voxels. Across configurations, segmentation accuracy remains stable, whereas TS substantially improves calibration. Uncertainty-error alignment improves most with calibrated checkpoint-based inference, leading to uncertainty maps that highlight more consistently regions requiring manual edits. Overall, integrating calibration with efficient ensembling seems a promising strategy to implement a budget-aware QA workflow for radiotherapy segmentation.
[235] OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation cs.CVPDF
Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin
TL;DR: 本文提出了OmniShow,一个用于人-物交互视频生成(HOIVG)的端到端框架,能够统一整合文本、参考图像、音频和姿态等多种模态条件,以生成高质量视频。为解决可控性与质量之间的权衡,引入了统一通道级条件注入和门控局部上下文注意力机制,并采用解耦后联合训练策略应对数据稀缺问题。同时,建立了专门的HOIVG-Bench基准用于评估。
Details
Motivation: 旨在解决人-物交互视频生成任务中现有方法无法同时适应文本、图像、音频和姿态等多种模态条件的问题,以满足电子商务演示、短视频制作等实际应用中对自动化内容创建的需求。
Result: 在HOIVG-Bench等基准上的大量实验表明,OmniShow在各种多模态条件设置下实现了整体最先进的性能,为该新兴任务设定了坚实标准。
Insight: 创新点包括:1. 统一通道级条件注入方法,高效整合图像和姿态条件;2. 门控局部上下文注意力机制,确保精确的视听同步;3. 解耦后联合训练策略,有效利用异构子任务数据集解决数据稀缺;4. 建立了首个专门的HOIVG基准HOIVG-Bench,填补了评估空白。
Abstract: In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.
[236] Pair2Scene: Learning Local Object Relations for Procedural Scene Generation cs.CVPDF
Xingjian Ran, Shujie Zhang, Weipeng Zhong, Li Luo, Bo Dai
TL;DR: 本文提出Pair2Scene框架,通过建模物体间的局部依赖关系(支撑关系和功能关系)进行程序化3D室内场景生成,解决了现有方法难以扩展到密集场景或缺乏精确空间推理的问题。
Details
Motivation: 现有方法在生成高保真3D室内场景时面临数据稀缺和复杂空间关系建模的挑战,难以泛化到训练分布之外的密集场景,且基于LLM/VLM的方法缺乏精确空间推理能力。本文基于物体摆放主要依赖局部而非冗余全局分布的观察,旨在通过局部规则建模解决这些问题。
Result: 在广泛实验中,该框架在生成超出训练数据的复杂环境方面优于现有方法,同时保持了物理和语义合理性。
Insight: 创新点在于将场景生成分解为学习局部物体关系(支撑与功能关系)并通过层次化递归应用,结合基于物理的碰撞感知拒绝采样,将局部规则整合为连贯的全局布局。这提供了一种可扩展且精确的场景生成新范式。
Abstract: Generating high-fidelity 3D indoor scenes remains a significant challenge due to data scarcity and the complexity of modeling intricate spatial relations. Current methods often struggle to scale beyond training distribution to dense scenes or rely on LLMs/VLMs that lack the ability for precise spatial reasoning. Building on top of the observation that object placement relies mainly on local dependencies instead of information-redundant global distributions, in this paper, we propose Pair2Scene, a novel procedural generation framework that integrates learned local rules with scene hierarchies and physics-based algorithms. These rules mainly capture two types of inter-object relations, namely support relations that follow physical hierarchies, and functional relations that reflect semantic links. We model these rules through a network, which estimates spatial position distributions of dependent objects conditioned on position and geometry of the anchor ones. Accordingly, we curate a dataset 3D-Pairs from existing scene data to train the model. During inference, our framework can generate scenes by recursively applying our model within a hierarchical structure, leveraging collision-aware rejection sampling to align local rules into coherent global layouts. Extensive experiments demonstrate that our framework outperforms existing methods in generating complex environments that go beyond training data while maintaining physical and semantic plausibility.
[237] Who Handles Orientation? Investigating Invariance in Feature Matching cs.CVPDF
David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman
TL;DR: 这篇论文研究了在特征匹配中如何有效处理大角度平面内旋转的问题,通过实验探讨了在描述符学习阶段引入旋转不变性与在匹配器阶段处理旋转的差异。研究发现,在描述符中学习旋转不变性可以达到与在匹配器中处理相似的效果,但能更早实现旋转不变性,从而构建更快的旋转不变匹配器。此外,大规模训练下强制旋转不变性不会损害直立图像的性能,且增加训练数据规模能显著提升对旋转图像的泛化能力。作者发布了两个旋转鲁棒的匹配器,在多个基准测试中达到SOTA水平。
Details
Motivation: 现代匹配器在处理大角度平面内旋转时存在困难,论文旨在探究在特征匹配流程的哪个阶段(描述符学习或匹配器处理)引入旋转不变性更有效,以提升匹配性能。
Result: 在多个图像匹配基准测试(如多模态WxBS、极端HardMatch和卫星图像匹配SatAst)上,提出的旋转鲁棒匹配器达到了最先进的性能水平。
Insight: 创新点在于系统性地研究了旋转不变性在特征匹配流程中的引入时机,发现描述符学习阶段引入旋转不变性可实现更早的旋转不变性,且大规模训练能有效提升泛化能力,这为设计高效旋转不变匹配器提供了新思路。
Abstract: Finding matching keypoints between images is a core problem in 3D computer vision. However, modern matchers struggle with large in-plane rotations. A straightforward mitigation is to learn rotation invariance via data augmentation. However, it remains unclear at which stage rotation invariance should be incorporated. In this paper, we study this in the context of a modern sparse matching pipeline. We perform extensive experiments by training on a large collection of 3D vision datasets and evaluating on popular image matching benchmarks. Surprisingly, we find that incorporating rotation invariance already in the descriptor yields similar performance to handling it in the matcher. However, rotation invariance is achieved earlier in the matcher when it is learned in the descriptor, allowing for a faster rotation-invariant matcher. Further, we find that enforcing rotation invariance does not hurt upright performance when trained at scale. Finally, we study the emergence of rotation invariance through scale and find that increasing the training data size substantially improves generalization to rotated images. We release two matchers robust to in-plane rotations that achieve state-of-the-art performance on e.g. multi-modal (WxBS), extreme (HardMatch), and satellite image matching (SatAst). Code is available at https://github.com/davnords/loma.
cs.RO [Back]
[238] ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models cs.RO | cs.CL | cs.CVPDF
Nastaran Darabi, Amit Ranjan Trivedi
TL;DR: 本文提出了ProGAL-VLA模型,旨在解决视觉语言动作(VLA)模型中存在的语言忽视问题,即模型过度依赖视觉捷径而对指令变化不敏感。该方法通过构建三维实体中心图(GSM)、使用慢速规划器生成符号子目标,并利用接地对齐对比(GAC)损失将子目标与接地实体对齐。所有动作都基于经过验证的目标嵌入$g_t$,其注意力熵提供了内在的模糊性信号。
Details
Motivation: 解决VLA模型中普遍存在的语言忽视问题,即模型倾向于依赖视觉线索而忽略语言指令的细微变化,导致对指令不敏感和鲁棒性不足。
Result: 在LIBERO-Plus基准测试中,ProGAL-VLA将机器人扰动下的鲁棒性从30.3%提升至71.5%,语言忽视减少了3-4倍,实体检索的Recall@1从0.41提高到0.71。在自定义模糊性基准测试中,AUROC达到0.81(对比基线0.52),AUPR为0.79,并在不损害明确任务成功率的情况下,将模糊输入的澄清请求率从0.09提升至0.81。
Insight: 创新点包括:通过验证瓶颈(verified goal embedding)增强语言与动作间的互信息;引入实体级的GAC损失(基于InfoNCE)实现更好的对齐;利用注意力熵进行校准的选择性预测,以识别模糊性。这表明显式的验证接地是构建对指令敏感、能感知模糊性的智能体的有效途径。
Abstract: Vision language action (VLA) models enable generalist robotic agents but often exhibit language ignorance, relying on visual shortcuts and remaining insensitive to instruction changes. We present Prospective Grounding and Alignment VLA (ProGAL-VLA), which constructs a 3D entity-centric graph (GSM), uses a slow planner to produce symbolic sub-goals, and aligns them with grounded entities via a Grounding Alignment Contrastive (GAC) loss. All actions are conditioned on a verified goal embedding $g_t$, whose attention entropy provides an intrinsic ambiguity signal. On LIBERO-Plus, ProGAL-VLA increases robustness under robot perturbations from 30.3 to 71.5 percent, reduces language ignorance by 3x-4x, and improves entity retrieval from 0.41 to 0.71 Recall@1. On the Custom Ambiguity Benchmark, it reaches AUROC 0.81 (vs. 0.52), AUPR 0.79, and raises clarification on ambiguous inputs from 0.09 to 0.81 without harming unambiguous success. The verification bottleneck increases mutual information of language-actions, the GAC loss imposes an entity-level InfoNCE bound, and attention entropy yields calibrated selective prediction, indicating that explicit verified grounding is an effective path toward instruction-sensitive, ambiguity-aware agents.
[239] VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions cs.RO | cs.CL | cs.CVPDF
Hung-Ting Su, Ting-Jun Wang, Jia-Fong Yeh, Min Sun, Winston H. Hsu
TL;DR: 本文提出了VLN-NF基准测试,用于评估视觉语言导航(VLN)智能体处理虚假前提指令(即目标不存在)的能力。同时,作者提出了ROAM方法,该方法结合了监督式房间级导航和基于LLM/VLM的室内探索,以应对该挑战,并在新基准上取得了最佳性能。
Details
Motivation: 传统的VLN基准测试假设指令是可行的且目标存在,这使得智能体无法有效处理目标不存在的虚假前提指令。本文旨在解决这一局限性。
Result: 在提出的VLN-NF基准上,作者的方法ROAM在综合评价指标REV-SPL上取得了最佳性能,而基线方法则普遍存在探索不足和过早终止的问题。
Insight: 主要创新点在于构建了首个专注于虚假前提指令的VLN基准VLN-NF,并提出了一个结合传统导航与LLM/VLM驱动探索的两阶段混合方法ROAM,以及一个用于联合评估导航、探索和决策正确性的新指标REV-SPL。
Abstract: Conventional Vision-and-Language Navigation (VLN) benchmarks assume instructions are feasible and the referenced target exists, leaving agents ill-equipped to handle false-premise goals. We introduce VLN-NF, a benchmark with false-premise instructions where the target is absent from the specified room and agents must navigate, gather evidence through in-room exploration, and explicitly output NOT-FOUND. VLN-NF is constructed via a scalable pipeline that rewrites VLN instructions using an LLM and verifies target absence with a VLM, producing plausible yet factually incorrect goals. We further propose REV-SPL to jointly evaluate room reaching, exploration coverage, and decision correctness. To address this challenge, we present ROAM, a two-stage hybrid that combines supervised room-level navigation with LLM/VLM-driven in-room exploration guided by a free-space clearance prior. ROAM achieves the best REV-SPL among compared methods, while baselines often under-explore and terminate prematurely under unreliable instructions. VLN-NF project page can be found at https://vln-nf.github.io/.
[240] LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment cs.RO | cs.CVPDF
Yifu Xu, Bokai Lin, Xinyu Zhan, Hongjie Fang, Yong-Lu Li
TL;DR: LIDEA是一个通过隐式特征蒸馏和显式几何对齐,利用人类视频进行机器人模仿学习的框架。它旨在解决人类与机器人之间的具身鸿沟,通过2D视觉域的双阶段传递蒸馏对齐表征,并在3D几何域使用与具身无关的对齐策略解耦交互几何,从而高效利用丰富的人类演示数据。
Details
Motivation: 机器人学习受限于稀缺的机器人演示数据,而人类视频提供了大量未充分利用的交互数据。关键挑战在于弥合人类手部与机器人手臂之间的具身鸿沟,现有跨具身转移方法常因视觉外观和3D几何的内在差异引入伪影。
Result: 大量实验从数据效率和OOD鲁棒性两个角度验证了LIDEA。结果表明,人类数据可替代高达80%的高成本机器人演示,且框架能成功迁移人类视频中的未见模式以实现分布外泛化。
Insight: 创新点包括:在2D视觉域采用双阶段传递蒸馏管道在共享潜在空间对齐人类与机器人表征;在3D几何域提出与具身无关的显式对齐策略,解耦具身与交互几何以确保一致的3D感知。这为利用丰富人类数据提升机器人学习效率提供了新思路。
Abstract: Scaling up robot learning is hindered by the scarcity of robotic demonstrations, whereas human videos offer a vast, untapped source of interaction data. However, bridging the embodiment gap between human hands and robot arms remains a critical challenge. Existing cross-embodiment transfer strategies typically rely on visual editing, but they often introduce visual artifacts due to intrinsic discrepancies in visual appearance and 3D geometry. To address these limitations, we introduce LIDEA (Implicit Feature Distillation and Explicit Geometric Alignment), an imitation learning framework in which policy learning benefits from human demonstrations. In the 2D visual domain, LIDEA employs a dual-stage transitive distillation pipeline that aligns human and robot representations in a shared latent space. In the 3D geometric domain, we propose an embodiment-agnostic alignment strategy that explicitly decouples embodiment from interaction geometry, ensuring consistent 3D-aware perception. Extensive experiments empirically validate LIDEA from two perspectives: data efficiency and OOD robustness. Results show that human data substitutes up to 80% of costly robot demonstrations, and the framework successfully transfers unseen patterns from human videos for out-of-distribution generalization.
[241] ViserDex: Visual Sim-to-Real for Robust Dexterous In-hand Reorientation cs.RO | cs.CVPDF
Arjun Bhardwaj, Maximum Wilder-Smith, Mayank Mittal, Vaishakh Patil, Marco Hutter
TL;DR: 本文提出了一个名为ViserDex的视觉仿真到现实框架,用于解决单目RGB相机下的灵巧手内物体重定向问题。该框架的核心是集成3D高斯溅射(3DGS)来弥合视觉仿真与现实之间的差距,通过在3D高斯表示空间中进行领域随机化,生成用于物体姿态估计的逼真、随机化视觉数据。操控策略则通过基于课程的强化学习与师生蒸馏进行训练。整个系统(感知与控制模型)可在消费级硬件上独立训练,并在物理多指灵巧手上验证了其在挑战性光照条件下对五种不同物体的鲁棒重定向能力。
Details
Motivation: 解决灵巧手内物体重定向任务中,依赖精确物体姿态估计的问题。现有基于RGB感知的方案通常需要多相机设置或昂贵的射线追踪,本文旨在开发一个仅需单目RGB相机、计算成本更低的鲁棒解决方案。
Result: 在挑战性视觉环境中,使用3DGS数据训练的位姿估计器性能优于使用传统渲染数据训练的估计器。在配备RGB相机的物理多指灵巧手上成功验证,能够鲁棒地重定向五种不同的物体,即使在挑战性光照条件下。
Insight: 主要创新点在于将3D高斯溅射(3DGS)应用于视觉仿真到现实流程,并提出了在3D高斯表示空间(而非像素空间)进行领域随机化的方法,以生成物理一致、逼真的训练数据。这为仅使用RGB的灵巧操控提供了一条实用的技术路径,并显著降低了硬件和计算需求。
Abstract: In-hand object reorientation requires precise estimation of the object pose to handle complex task dynamics. While RGB sensing offers rich semantic cues for pose tracking, existing solutions rely on multi-camera setups or costly ray tracing. We present a sim-to-real framework for monocular RGB in-hand reorientation that integrates 3D Gaussian Splatting (3DGS) to bridge the visual sim-to-real gap. Our key insight is performing domain randomization in the Gaussian representation space: by applying physically consistent, pre-rendering augmentations to 3D Gaussians, we generate photorealistic, randomized visual data for object pose estimation. The manipulation policy is trained using curriculum-based reinforcement learning with teacher-student distillation, enabling efficient learning of complex behaviors. Importantly, both perception and control models can be trained independently on consumer-grade hardware, eliminating the need for large compute clusters. Experiments show that the pose estimator trained with 3DGS data outperforms those trained using conventional rendering data in challenging visual environments. We validate the system on a physical multi-fingered hand equipped with an RGB camera, demonstrating robust reorientation of five diverse objects even under challenging lighting conditions. Our results highlight Gaussian splatting as a practical path for RGB-only dexterous manipulation. For videos of the hardware deployments and additional supplementary materials, please refer to the project website: https://rffr.leggedrobotics.com/works/viserdex/
[242] ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation cs.RO | cs.CVPDF
Yiran Qin, Jiahua Ma, Li Kang, Wenzhan Li, Yihang Jiao
TL;DR: 本文提出了一种名为ComSim的混合方法,通过组合式仿真结合经典仿真与神经仿真,生成准确的动作-视频对,并利用闭环的真实-仿真-真实数据增强流程,从小规模真实数据生成大规模、多样化的训练数据集,以缩小仿真到真实的领域差距。
Details
Motivation: 解决机器人领域获取大规模、高质量训练数据的挑战,因为现有数据获取通常需要大量人工且难以覆盖多样化的真实世界环境。
Result: 实验表明,该方法显著减少了仿真到真实的领域差距,提高了在真实世界环境中训练的策略模型的成功率。
Insight: 创新点在于提出组合式仿真和闭环数据增强管道,利用神经仿真将经典仿真视频转换为真实世界表示,从而可扩展地生成鲁棒训练数据并弥合仿真与真实机器人之间的差距。
Abstract: Recent advancements in foundational models, such as large language models and world models, have greatly enhanced the capabilities of robotics, enabling robots to autonomously perform complex tasks. However, acquiring large-scale, high-quality training data for robotics remains a challenge, as it often requires substantial manual effort and is limited in its coverage of diverse real-world environments. To address this, we propose a novel hybrid approach called Compositional Simulation, which combines classical simulation and neural simulation to generate accurate action-video pairs while maintaining real-world consistency. Our approach utilizes a closed-loop real-sim-real data augmentation pipeline, leveraging a small amount of real-world data to generate diverse, large-scale training datasets that cover a broader spectrum of real-world scenarios. We train a neural simulator to transform classical simulation videos into real-world representations, improving the accuracy of policy models trained in real-world environments. Through extensive experiments, we demonstrate that our method significantly reduces the sim2real domain gap, resulting in higher success rates in real-world policy model training. Our approach offers a scalable solution for generating robust training data and bridging the gap between simulated and real-world robotics.
[243] EagleVision: A Multi-Task Benchmark for Cross-Domain Perception in High-Speed Autonomous Racing cs.RO | cs.CVPDF
Zakhar Yagudin, Murad Mebrahtu, Ren Jin, Jiaqi Huang, Yujia Yue
TL;DR: 本文介绍了EagleVision,一个基于LiDAR的多任务基准测试,用于高速自动驾驶赛车中的3D检测和轨迹预测。该基准整合了Indy Autonomous Challenge数据集、A2RL Real竞赛数据集以及模拟器生成的数据,共包含约28,056帧标注数据,并采用统一的评估协议。通过数据集中心化的迁移框架,论文量化了跨域(城市、模拟器、真实赛车)的泛化性能。
Details
Motivation: 高速自动驾驶赛车面临极端感知挑战,如大相对速度和显著的域偏移,现有基准测试未能充分捕捉这些高动态条件。因此,需要一个新的基准来系统研究高速动态下的感知泛化问题。
Result: 在城市数据上预训练提高了检测性能(NDS 0.72 vs. 0.69),而在真实赛车数据上进行中间预训练在A2RL上实现了最佳迁移(NDS 0.726),优于仅使用模拟器的适应。在轨迹预测方面,在Indy上训练的模型在A2RL测试序列上优于域内训练(FDE 0.947 vs. 1.250),突显了运动分布覆盖在跨域预测中的作用。
Insight: 创新点包括引入一个统一的LiDAR多任务基准,涵盖真实和模拟数据,并采用数据集中心化的迁移框架来量化跨域泛化。从客观角度看,该研究强调了在极端高速条件下,数据多样性和运动分布覆盖对提升感知模型泛化能力的重要性,为高速自动驾驶的感知系统开发提供了新见解。
Abstract: High-speed autonomous racing presents extreme perception challenges, including large relative velocities and substantial domain shifts from conventional urban-driving datasets. Existing benchmarks do not adequately capture these high-dynamic conditions. We introduce EagleVision, a unified LiDAR-based multi-task benchmark for 3D detection and trajectory prediction in high-speed racing, providing newly annotated 3D bounding boxes for the Indy Autonomous Challenge dataset (14,893 frames) and the A2RL Real competition dataset (1,163 frames), together with 12,000 simulator-generated annotated frames, all standardized under a common evaluation protocol. Using a dataset-centric transfer framework, we quantify cross-domain generalization across urban, simulator, and real racing domains. Urban pretraining improves detection over scratch training (NDS 0.72 vs. 0.69), while intermediate pretraining on real racing data achieves the best transfer to A2RL (NDS 0.726), outperforming simulator-only adaptation. For trajectory prediction, Indy-trained models surpass in-domain A2RL training on A2RL test sequences (FDE 0.947 vs. 1.250), highlighting the role of motion-distribution coverage in cross-domain forecasting. EagleVision enables systematic study of perception generalization under extreme high-speed dynamics. The dataset and benchmark are publicly available at https://avlab.io/EagleVision
[244] StarVLA-$α$: Reducing Complexity in Vision-Language-Action Systems cs.RO | cs.AI | cs.CVPDF
Jinhui Ye, Ning Gao, Senqiao Yang, Jinliang Zheng, Zixuan Wang
TL;DR: 本文介绍了StarVLA-α,一个旨在简化视觉-语言-动作(VLA)模型设计的强基线模型。该模型通过最小化架构和流程的复杂性,在多个机器人基准测试(LIBERO、SimplerEnv、RoboTwin、RoboCasa)上实现了强大的性能,表明一个强大的视觉语言模型主干结合简洁设计已足够,无需依赖额外的复杂架构或工程技巧。
Details
Motivation: 当前VLA模型领域高度碎片化和复杂,不同方法在架构、训练数据、具身配置和基准测试工程上差异很大,难以进行系统比较。本文旨在通过一个简单但强大的基线模型,在受控条件下研究VLA的关键设计选择。
Result: 在多个基准测试的统一训练中,该简单基线模型保持了高度竞争力。特别是在公开的真实世界RoboChallenge基准测试上,其单一通用模型性能比π_{0.5}高出20%。
Insight: 论文宣称的创新点在于提供了一个简化、可控的VLA研究基线,并系统性地重新评估了动作建模策略、机器人特定预训练和接口工程等关键设计轴。从客观角度看,其核心洞察是:一个强大的视觉语言模型主干配合极简设计,足以实现强大的VLA性能,这挑战了该领域对复杂架构和工程技巧的依赖。
Abstract: Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-$α$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-$α$ deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to achieve strong performance without relying on additional architectural complexity or engineering tricks. Notably, our single generalist model outperforms $π_{0.5}$ by 20% on the public real-world RoboChallenge benchmark. We expect StarVLA-$α$ to serve as a solid starting point for future research in the VLA regime. Code will be released at https://github.com/starVLA/starVLA.
eess.IV [Back]
[245] Search-MIND: Training-Free Multi-Modal Medical Image Registration eess.IV | cs.CVPDF
Boya Wang, Ruizhe Li, Chao Chen, Xin Chen
TL;DR: 本文提出Search-MIND,一种无需训练、基于迭代优化的多模态医学图像配准框架。它采用由粗到精的策略,包括分层粗对齐和可变形细化两个阶段,并引入了两种新颖的损失函数以应对强度非线性和局部最优问题。
Details
Motivation: 解决多模态图像配准中因非线性强度关系和局部最优带来的挑战,以及现有深度学习模型在未见模态上泛化性崩溃的问题。
Result: 在CARE Liver 2025和CHAOS Challenge数据集上的评估表明,Search-MIND在配准性能上持续优于ANTs等经典方法和DINO-reg等基于基础模型的方法,并在不同模态间展现出更优的稳定性。
Insight: 创新点在于提出了无需训练的实例级优化框架,以及两种新损失函数:VWMI通过方差加权聚焦信息丰富的组织区域以屏蔽噪声,S-MIND通过扩大局部搜索范围拓宽结构描述符的收敛域,从而提升配准鲁棒性。
Abstract: Multi-modal image registration plays a critical role in precision medicine but faces challenges from non-linear intensity relationships and local optima. While deep learning models enable rapid inference, they often suffer from generalization collapse on unseen modalities. To address this, we propose Search-MIND, a training-free, iterative optimization framework for instance-specific registration. Our pipeline utilizes a coarse-to-fine strategy: a hierarchical coarse alignment stage followed by deformable refinement. We introduce two novel loss functions: Variance-Weighted Mutual Information (VWMI), which prioritizes informative tissue regions to shield global alignment from background noise and uniform regions, and Search-MIND (S-MIND), which broadens the convergence basin of structural descriptors by considering larger local search range. Evaluations on CARE Liver 2025 and CHAOS Challenge datasets show that Search-MIND consistently outperforms classical baselines like ANTs and foundation model-based approaches like DINO-reg, offering superior stability across diverse modalities.
cs.GR [Back]
[246] NeuVolEx: Implicit Neural Features for Volume Exploration cs.GR | cs.CVPDF
Haill An, Suhyeon Kim, Donghyuk Choo, Younhyun Jung
TL;DR: 本文提出NeuVolEx,一种基于隐式神经表示(INR)的体数据探索方法,将INR从体压缩扩展到体探索任务。该方法利用INR训练过程中学习到的特征表示,通过引入结构编码器和多任务学习增强空间一致性,以支持图像传递函数设计和视点推荐等探索任务。
Details
Motivation: 现有体数据探索方法依赖显式局部特征或隐式卷积特征,前者难以捕获广泛几何模式,后者在用户监督有限时性能不稳定。INR在体压缩中表现出色,但尚未用于探索任务,因此研究如何利用INR特征进行鲁棒的体探索。
Result: 在多种模态和ROI复杂度的体数据集上验证,NeuVolEx在图像传递函数设计任务中,在稀疏用户监督下实现准确的ROI分类;在视点推荐任务中,通过无监督聚类识别互补视点以揭示不同ROI簇,在效果和可用性上优于现有方法。
Insight: 创新点在于将INR从压缩扩展到探索,利用其训练特征作为鲁棒基础;通过结构编码器和多任务学习增强特征的空间一致性,提升ROI表征能力;为体探索提供了基于神经特征的通用框架,可适应稀疏监督和无监督场景。
Abstract: Direct volume rendering (DVR) aims to help users identify and examine regions of interest (ROIs) within volumetric data, and feature representations that support effective ROI classification and clustering play a fundamental role in volume exploration. Existing approaches typically rely on either explicit local feature representations or implicit convolutional feature representations learned from raw volumes. However, explicit local feature representations are limited in capturing broader geometric patterns and spatial correlations, while implicit convolutional feature representations do not necessarily ensure robust performance in practice, where user supervision is typically limited. Meanwhile, implicit neural representations (INRs) have recently shown strong promise in DVR for volume compression, owing to their ability to compactly parameterize continuous volumetric fields. In this work, we propose NeuVolEx, a neural volume exploration approach that extends the role of INRs beyond volume compression. Unlike prior compression methods that focus on INR outputs, NeuVolEx leverages feature representations learned during INR training as a robust basis for volume exploration. To better adapt these feature representations to exploration tasks, we augment a base INR with a structural encoder and a multi-task learning scheme that improve spatial coherence for ROI characterization. We validate NeuVolEx on two fundamental volume exploration tasks: image-based transfer function (TF) design and viewpoint recommendation. NeuVolEx enables accurate ROI classification under sparse user supervision for image-based TF design and supports unsupervised clustering to identify compact complementary viewpoints that reveal different ROI clusters. Experiments on diverse volume datasets with varying modalities and ROI complexities demonstrate NeuVolEx improves both effectiveness and usability over prior methods
cs.AI [Back]
[247] LABBench2: An Improved Benchmark for AI Systems Performing Biology Research cs.AI | cs.CL | cs.LGPDF
Jon M Laurent, Albert Bou, Michael Pieler, Conor Igoe, Alex Andonian
TL;DR: 本文介绍了LABBench2,这是一个用于评估AI系统在生物学研究中执行实际科学任务能力的改进基准。该基准包含近1900个任务,延续了LAB-Bench的框架,但在更真实的场景中测量类似能力,难度显著提升。
Details
Motivation: 随着AI加速科学发现的乐观情绪增长,需要将评估重点从单纯的知识和推理转向AI执行实际有意义工作的能力,以更真实地衡量其在科学领域的进展。
Result: 评估当前前沿模型显示,虽然LAB-Bench和LABBench2测量的能力已有大幅提升,但LABBench2难度显著增加(模型在子任务上的准确率差异范围为-26%至-46%),表明性能仍有改进空间。
Insight: 创新点在于将AI科学能力评估从抽象推理扩展到更贴近真实研究场景的任务执行,通过提高任务现实性和难度来推动AI工具在核心研究功能上的发展;客观来看,这为社区提供了一个公开数据集和评估框架,有助于标准化和加速AI在科学领域的应用评估。
Abstract: Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI-driven autonomous labs. The need to measure progress of AI systems in scientific domains correspondingly must not only accelerate, but increasingly shift focus to more real-world capabilities. Beyond rote knowledge and even just reasoning to actually measuring the ability to perform meaningful work. Prior work introduced the Language Agent Biology Benchmark LAB-Bench as an initial attempt at measuring these abilities. Here we introduce an evolution of that benchmark, LABBench2, for measuring real-world capabilities of AI systems performing useful scientific tasks. LABBench2 comprises nearly 1,900 tasks and is, for the most part, a continuation of LAB-Bench, measuring similar capabilities but in more realistic contexts. We evaluate performance of current frontier models, and show that while abilities measured by LAB-Bench and LABBench2 have improved substantially, LABBench2 provides a meaningful jump in difficulty (model-specific accuracy differences range from -26% to -46% across subtasks) and underscores continued room for performance improvement. LABBench2 continues the legacy of LAB-Bench as a de facto benchmark for AI scientific research capabilities and we hope that it continues to help advance development of AI tools for these core research functions. To facilitate community use and development, we provide the task dataset at https://huggingface.co/datasets/futurehouse/labbench2 and a public eval harness at https://github.com/EdisonScientific/labbench2.
[248] CID-TKG: Collaborative Historical Invariance and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning cs.AI | cs.CLPDF
Shuai-Long Lei, Xiaobin Zhu, Jiarui Liang, Guoxi Sun, Zhiyu Fang
TL;DR: 该论文提出了一种名为CID-TKG的新型协作学习框架,用于时序知识图谱推理。该框架通过整合历史不变性语义和演化动力学作为有效的归纳偏置,以克服现有方法主要依赖时间不变或弱时间依赖结构、忽视演化动力学的局限性。
Details
Motivation: 现有时序知识图谱推理方法因其归纳偏置的固有局限,主要依赖时间不变或弱时间依赖结构,而忽略了实体的演化动力学,导致推理能力受限。
Result: 广泛的实验验证表明,CID-TKG在外推设置下实现了最先进的性能。
Insight: 核心创新点在于协作学习框架,它同时构建并编码历史不变性图和演化动力学图,以分别捕捉长期结构规律和短期时序变迁;并通过将关系分解为视图特定表示,并利用对比目标对齐查询表示,来缓解跨结构语义差异,促进跨视图一致性并抑制视图特定噪声。
Abstract: Temporal knowledge graph (TKG) reasoning aims to infer future facts at unseen timestamps from temporally evolving entities and relations. Despite recent progress, existing approaches still suffer from inherent limitations due to their inductive biases, as they predominantly rely on time-invariant or weakly time-dependent structures and overlook the evolutionary dynamics. To overcome this limitation, we propose a novel collaborative learning framework for TKGR (dubbed CID-TKG) that integrates evolutionary dynamics and historical invariance semantics as an effective inductive bias for reasoning. Specifically, CID-TKG constructs a historical invariance graph to capture long-term structural regularities and an evolutionary dynamics graph to model short-term temporal transitions. Dedicated encoders are then employed to learn representations from each structure. To alleviate semantic discrepancies across the two structures, we decompose relations into view-specific representations and align view-specific query representations via a contrastive objective, which promotes cross-view consistency while suppressing view-specific noise. Extensive experiments verify that our CID-TKG achieves state-of-the-art performance under extrapolation settings.
[249] Unifying Ontology Construction and Semantic Alignment for Deterministic Enterprise Reasoning at Scale cs.AI | cs.CLPDF
Hongyin Zhu
TL;DR: 本文提出大型本体模型(LOM),一个将本体构建、语义对齐和逻辑推理统一到端到端架构中的框架,旨在解决企业数据混乱和现有神经符号方法流程割裂、错误传播的问题。
Details
Motivation: 企业积累大量混乱且未被利用的数据,现有神经符号方法依赖割裂的流程且易受错误传播影响,需要一种统一框架来实现确定性的企业级推理。
Result: 在基于真实企业数据集构建的综合基准测试中,LOM-4B模型在本体补全任务上达到88.8%的准确率,在复杂图推理任务上达到94%的准确率,显著优于最先进的大语言模型(SOTA)。
Insight: 创新点在于提出了统一的构建-对齐-推理(CAR)管道,将本体构建、语义对齐和逻辑推理集成到单一架构中,并通过图感知编码器和强化学习实现神经生成与结构现实的语义对齐,从而支持自主的逻辑构建以实现确定性的企业级智能。
Abstract: While enterprises amass vast quantities of data, much of it remains chaotic and effectively dormant, preventing decision-making based on comprehensive information. Existing neuro-symbolic approaches rely on disjoint pipelines and struggle with error propagation. We introduce the large ontology model (LOM), a unified framework that seamlessly integrates ontology construction, semantic alignment, and logical reasoning into a single end-to-end architecture. LOM employs a construct-align-reason (CAR) pipeline, leveraging its unified architecture across all three stages: it first autonomously constructs a domain-specific ontological universe from raw data, then aligns neural generation with this structural reality using a graph-aware encoder and reinforcement learning, and finally executes deterministic reasoning over the constructed topology, node attributes and relation types. We evaluate LOM on a comprehensive benchmark constructed from diverse real-world enterprise datasets. Experimental results demonstrate that LOM-4B achieves 88.8% accuracy in ontology completion and 94% in complex graph reasoning tasks, significantly outperforming state-of-the-art LLMs. These findings validate that autonomous logical construction is essential for achieving deterministic, enterprise-grade intelligence.
[250] Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI | cs.CL | cs.LG | cs.MAPDF
Dhruv Atreja, Julia White, Nikhil Nayak, Kelton Zhang, Henrijs Princis
TL;DR: 本文提出了Pioneer Agent系统,这是一个自动化的小型语言模型(SLM)生产部署与持续改进的闭环系统。系统包含冷启动和生产两种模式,能够根据自然语言任务描述自动获取数据、构建评估集、迭代训练模型,或在部署后诊断错误模式、构建针对性训练数据并进行约束性重训练。
Details
Motivation: 小型语言模型因其低成本、快速推理和易于定制而适合生产部署,但针对特定任务的适配过程仍面临数据管理、故障诊断、回归避免和迭代控制等工程挑战,需要自动化解决方案。
Result: 在涵盖推理、数学、代码生成、摘要和分类的八个冷启动基准测试中,Pioneer Agent相比基础模型提升了1.6到83.8个点。在专门设计的AdaptFT-Bench基准上,它在所有七个场景中均保持或提升性能,而简单重训练方法性能下降高达43个点。在两个基于公共基准任务构建的生产式部署中,意图分类准确率从84.9%提升至99.3%,实体F1分数从0.345提升至0.810。
Insight: 系统创新点在于将模型适配的完整生命周期(从冷启动到生产迭代)自动化,并通过下游反馈自动发现有效的训练策略(如思维链监督、任务特定优化和以质量为中心的数据管理)。其提出的AdaptFT-Bench基准也为测试完整适配循环提供了新工具。
Abstract: Small language models are attractive for production deployment due to their low cost, fast inference, and ease of specialization. However, adapting them to a specific task remains a challenging engineering loop, driven not by training itself but by surrounding decisions: data curation, failure diagnosis, regression avoidance, and iteration control. We present Pioneer Agent, a closed-loop system that automates this lifecycle. In cold-start mode, given only a natural-language task description, the agent acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given a deployed model with labeled failures, it diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. To evaluate this setting, we introduce AdaptFT-Bench, a benchmark of synthetic inference logs with progressively increasing noise, designed to test the full adaptation loop: diagnosis, curriculum synthesis, retraining, and verification. Across eight cold-start benchmarks spanning reasoning, math, code generation, summarization, and classification, Pioneer Agent improves over base models by 1.6-83.8 points. On AdaptFT-Bench, it improves or preserves performance in all seven scenarios, while naive retraining degrades by up to 43 points. On two production-style deployments built from public benchmark tasks, it raises intent classification from 84.9% to 99.3% and Entity F1 from 0.345 to 0.810. Beyond performance gains, the agent often discovers effective training strategies, including chain-of-thought supervision, task-specific optimization, and quality-focused data curation, purely from downstream feedback.
[251] COMPOSITE-Stem cs.AI | cs.CL | cs.LGPDF
Kyle Waters, Lucas Nuzzi, Tadhg Looram, Alessandro Tomasiello, Ariel Ghislain Kemogne Kamdoum
TL;DR: 该论文提出了COMPOSITE-STEM基准测试,这是一个由博士级研究人员精心策划的、包含70个物理、生物、化学和数学领域专家编写任务的评估集。它结合了精确匹配评分、基于标准的评分准则以及LLM作为评审团的评分协议,旨在更灵活地评估科学上有意义的输出。通过在Harbor智能体评估框架中使用改进的多模态Terminus-2智能体,对四个前沿模型进行了评估,最佳模型仅达到21%的得分,表明该基准测试捕捉到了当前AI智能体尚未掌握的能力。所有任务均已开源以支持可重复性研究。
Details
Motivation: 当前AI智能体在加速科学发现方面前景广阔,但缺乏前沿评估阻碍了其在真实工作流程中的应用。现有的专家编写基准测试虽能有效衡量AI推理能力,但大多已趋于饱和且仅能评估受限输出,因此需要新的评估工具来填补这一空白。
Result: 在COMPOSITE-STEM基准测试上,使用改进的多模态Terminus-2智能体在Harbor框架内评估了四个前沿模型,其中表现最佳的模型得分仅为21%,这证明了该基准测试的挑战性,能够有效衡量当前AI智能体尚未达到的科学能力水平。
Insight: 论文的创新点在于构建了一个跨STEM学科、由专家精心设计的高难度开放任务基准,并引入了结合精确匹配、人工准则和LLM陪审团评分的混合评估协议,从而实现了对科学推理和输出灵活性的更全面评估。其核心价值在于为衡量AI在复杂科学问题解决上的真实能力提供了一个未被饱和的、更具挑战性的新标准。
Abstract: AI agents hold growing promise for accelerating scientific discovery; yet, a lack of frontier evaluations hinders adoption into real workflows. Expert-written benchmarks have proven effective at measuring AI reasoning, but most at this stage have become saturated and only measure performance on constrained outputs. To help address this gap, we introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics, curated by doctoral-level researchers. Our benchmark combines exact-match grading and criterion-based rubrics with an LLM-as-a-jury grading protocol, allowing more flexible assessment of scientifically meaningful outputs. Using an adapted multimodal Terminus-2 agent harness within the Harbor agentic evaluation framework, we evaluate four frontier models. The top-performing model achieves 21%, demonstrating that COMPOSITE-STEM captures capabilities beyond current agent reach. All tasks are open-sourced with contributor permission to support reproducibility and to promote additional research towards AI’s acceleration of scientific progress in these domains.
[252] Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards cs.AI | cs.CL | cs.GT | econ.GNPDF
Shuze Daniel Liu, Claire Chen, Jiabao Sean Xiao, Lei Lei, Yuheng Zhang
TL;DR: 本文提出了一种基于可验证奖励的强化学习框架,用于训练大型语言模型进行双边价格谈判。通过将奖励信号直接建立在经济剩余最大化和严格遵守私有预算约束的基础上,研究发现智能体在训练过程中经历了从天真议价到使用激进起始价格、经历僵局阶段,最终发展出复杂说服技巧的四阶段战略演化。
Details
Motivation: 大型语言模型在作为自主交互智能体方面展现出潜力,但在不完全信息的战略博弈中表现不佳,例如双边价格谈判。本文旨在探索可验证奖励的强化学习是否能有效教会LLMs进行谈判,并研究学习过程中涌现的战略行为。
Result: 实验结果表明,这种可验证训练使得一个300亿参数的智能体在提取经济剩余方面显著优于规模是其十倍以上的前沿模型。此外,训练后的智能体能够稳健地泛化到训练中未见过的更强对手,并且在面对敌对的、对抗性的卖家角色时仍然有效。
Insight: 论文的创新点在于提出了一个将奖励信号直接与可验证的经济指标(剩余和预算)挂钩的强化学习框架,从而引导LLMs在不完全信息博弈中发展出复杂的战略行为。从客观角度看,其将强化学习与LLMs在结构化谈判任务中的战略演化过程进行系统性关联的研究方法具有借鉴意义。
Abstract: The recent advancement of Large Language Models (LLMs) has established their potential as autonomous interactive agents. However, they often struggle in strategic games of incomplete information, such as bilateral price negotiation. In this paper, we investigate if Reinforcement Learning from Verifiable Rewards (RLVR) can effectively teach LLMs to negotiate. Specifically, we explore the strategic behaviors that emerge during the learning process. We introduce a framework that trains a mid-sized buyer agent against a regulated LLM seller across a wide distribution of real-world products. By grounding reward signals directly in the maximization of economic surplus and strict adherence to private budget constraints, we reveal a novel four-phase strategic evolution. The agent progresses from naive bargaining to using aggressive starting prices, moves through a phase of deadlock, and ultimately develops sophisticated persuasive skills. Our results demonstrate that this verifiable training allows a 30B agent to significantly outperform frontier models over ten times its size in extracting surplus. Furthermore, the trained agent generalizes robustly to stronger counterparties unseen during training and remains effective even when facing hostile, adversarial seller personas.
[253] FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks cs.AI | cs.CE | cs.CL | cs.MMPDF
Yupeng Cao, Haohang Li, Weijin Liu, Wenbo Cao, Anke Xu
TL;DR: 本文提出了FinTrace基准,用于全面评估大语言模型在长视野金融任务中的工具调用能力。该基准包含800条专家标注的轨迹,涵盖34个真实金融任务类别,并采用基于量规的评估协议,从四个维度(动作正确性、执行效率、过程质量和输出质量)细粒度评估模型表现。此外,作者构建了首个轨迹级偏好数据集FinTrace-Training,并通过监督微调和直接偏好优化训练模型,发现虽然中间推理指标有所提升,但最终答案质量仍是瓶颈。
Details
Motivation: 现有金融工具调用评估基准局限于有限场景,且依赖调用级指标,无法捕捉轨迹级的推理质量,因此需要构建一个更全面的评估框架。
Result: 在13个LLM的评估中,前沿模型在工具选择上表现强劲,但所有模型在信息利用和最终答案质量上均存在困难;使用FinTrace-Training对Qwen-3.5-9B进行微调后,中间推理指标持续改善,DPO能更有效抑制失败模式,但端到端答案质量仍未完全提升。
Insight: 创新点在于提出了一个多维度、细粒度的轨迹级评估基准FinTrace,并构建了首个轨迹级偏好数据集用于模型优化;客观来看,该工作揭示了工具调用中工具选择与输出推理之间的关键差距,为未来研究提供了诊断和优化方向。
Abstract: Recent studies demonstrate that tool-calling capability enables large language models (LLMs) to interact with external environments for long-horizon financial tasks. While existing benchmarks have begun evaluating financial tool calling, they focus on limited scenarios and rely on call-level metrics that fail to capture trajectory-level reasoning quality. To address this gap, we introduce FinTrace, a benchmark comprising 800 expert-annotated trajectories spanning 34 real-world financial task categories across multiple difficulty levels. FinTrace employs a rubric-based evaluation protocol with nine metrics organized along four axes – action correctness, execution efficiency, process quality, and output quality – enabling fine-grained assessment of LLM tool-calling behavior. Our evaluation of 13 LLMs reveals that while frontier models achieve strong tool selection, all models struggle with information utilization and final answer quality, exposing a critical gap between invoking the right tools and reasoning effectively over their outputs. To move beyond diagnosis, we construct FinTrace-Training, the first trajectory-level preference dataset for financial tool-calling, containing 8,196 curated trajectories with tool-augmented contexts and preference pairs. We fine-tune Qwen-3.5-9B using supervised fine-tuning followed by direct preference optimization (DPO) and show that training on FinTrace-Training consistently improves intermediate reasoning metrics, with DPO more effectively suppressing failure modes. However, end-to-end answer quality remains a bottleneck, indicating that trajectory-level improvements do not yet fully propagate to final output quality.
[254] Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation cs.AI | cs.CLPDF
Yanjie He
TL;DR: 这篇论文研究了大型语言模型(LLMs)在现实世界政策评估中进行因果和反事实推理的可靠性。作者构建了一个包含40个来自经济学和社会科学实证政策评估案例的基准,这些案例根据其发现是否符合先验直觉(明显的、模糊的或反直觉的)进行分类。通过评估四个前沿LLM在五种提示策略下的表现,研究发现存在思维链(CoT)悖论、直觉性是主导影响因素,以及知识与推理的分离现象。
Details
Motivation: 尽管LLMs越来越多地用于因果和反事实推理,但它们在现实世界政策评估中的可靠性仍未得到充分探索。论文旨在评估LLMs在面对不同直觉性(符合、模糊或违背常识)的实证发现时的推理表现。
Result: 在2400次实验试验中,研究发现:1)思维链提示对明显案例的性能提升显著,但对反直觉案例的益处几乎消失(交互作用OR=0.053,p<0.001);2)直觉性是解释性能差异的主导因素(ICC=0.537),其影响超过模型选择或提示策略;3)基于引用的熟悉度与准确性无关(p=0.53),表明模型拥有相关知识但无法在发现违背直觉时进行有效推理。
Insight: 论文的创新点在于将实证政策评估案例按直觉性分类,并系统揭示了LLMs推理中的“思维链悖论”和“知识-推理分离”现象。从客观角度看,该研究通过双过程理论(系统1 vs. 系统2)的视角,批判性地指出当前LLMs的“慢思考”可能只是“慢说话”,即它们能产生深思熟虑推理的形式,但缺乏实质内容,这对理解LLMs在复杂现实任务中的局限性具有重要借鉴意义。
Abstract: Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy evaluation remains underexplored. We construct a benchmark of 40 empirical policy evaluation cases drawn from economics and social science, each grounded in peer-reviewed evidence and classified by intuitiveness – whether the empirical finding aligns with (obvious), is unclear relative to (ambiguous), or contradicts (counter-intuitive) common prior expectations. We evaluate four frontier LLMs across five prompting strategies with 2,400 experimental trials and analyze the results using mixed-effects logistic regression. Our findings reveal three key results: (1) a chain-of-thought (CoT) paradox, where chain-of-thought prompting dramatically improves performance on obvious cases but this benefit is nearly eliminated on counter-intuitive ones (interaction OR = 0.053, $p < 0.001$); (2) intuitiveness as the dominant factor, explaining more variance than model choice or prompting strategy (ICC = 0.537); and (3) a knowledge-reasoning dissociation, where citation-based familiarity is unrelated to accuracy ($p = 0.53$), suggesting models possess relevant knowledge but fail to reason with it when findings contradict intuition. We frame these results through the lens of dual-process theory (System 1 vs. System 2) and argue that current LLMs’ “slow thinking” may be little more than “slow talking” – they produce the form of deliberative reasoning without the substance.
[255] ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval cs.AI | cs.CLPDF
David H. Yang, Yuxuan Zhu, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati
TL;DR: 本文提出ZoomR方法,通过多粒度键值检索实现内存高效推理,在数学和推理任务中减少超过4倍推理内存需求,同时保持与基线相当的竞争性能。
Details
Motivation: 解决LLMs在长输出生成任务中,由于KV缓存随输出长度增长导致内存和计算成本增加的问题,现有工作主要压缩输入上下文而保留完整解码KV缓存,无法有效应对长输出场景。
Result: 在数学和推理任务上的实验表明,ZoomR在保持竞争性能的同时,将推理内存需求降低了超过4倍。
Insight: 创新点在于自适应压缩冗长推理思路为摘要,并利用动态KV缓存选择策略,通过摘要键作为粗粒度索引,在解码时仅检索最重要思路的细节,实现分层内存优化。
Abstract: Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache for decoding. For tasks requiring long output generation, this leads to increased computational and memory costs. In this paper, we introduce ZoomR, a novel approach that enables LLMs to adaptively compress verbose reasoning thoughts into summaries and uses a dynamic KV cache selection policy that leverages these summaries while also strategically “zooming in” on fine-grained details. By using summary keys as a coarse-grained index during decoding, ZoomR uses the query to retrieve details for only the most important thoughts. This hierarchical strategy significantly reduces memory usage by avoiding full-cache attention at each step. Experiments across math and reasoning tasks show that our approach achieves competitive performance compared to baselines, while reducing inference memory requirements by more than $4\times$. These results demonstrate that a multi-granularity KV selection enables more memory efficient decoding, especially for long output generation.
[256] CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning cs.AI | cs.CLPDF
Qixian Huang, Hongqiang Lin, Tong Fu, Yingsen Wang, Zhenghui Fu
TL;DR: 本文提出了一个名为CFMS(从粗到细的多模态合成框架)的新颖两阶段范式,用于增强表格推理。该框架将高级视觉感知与细粒度符号推理分层解耦:粗粒度阶段利用多模态大语言模型一次性合成多视角知识元组作为动态推理地图;细粒度阶段则由符号引擎根据该地图在表格上执行有针对性的迭代操作序列。
Details
Motivation: 现有方法(如思维链)在表格数据推理中存在局限,纯符号方法对整体视觉模式存在“盲区”。为了解决这个问题,本文旨在通过结合视觉感知与符号推理来提升对表格的理解和推理能力。
Result: 在WikiTQ和TabFact基准测试上的大量实验表明,CFMS达到了有竞争力的准确率。该框架在处理大型表格以及使用较小的骨干模型实例化时表现出特别的鲁棒性,验证了其有效性和泛化能力。
Insight: 主要创新点在于提出了一个分层的两阶段范式,将视觉感知(粗粒度)与符号推理(细粒度)解耦,并利用多模态大语言模型生成的动态推理地图来指导后续的符号操作。这为表格推理提供了一种结合感知与推理的新思路,尤其在处理大规模数据或资源受限时具有优势。
Abstract: Reasoning over tabular data is a crucial capability for tasks like question answering and fact verification, as it requires models to comprehend both free-form questions and semi-structured tables. However, while methods like Chain-of-Thought (CoT) introduce reasoning chains, purely symbolic methodes are inherently limited by their blindness to holistic visual patterns. To address this, we propose the Coarse-to-Fine Multimodal Synthesis framework (CFMS), a novel two-stage paradigm that hierarchically decouples high-level visual perception from granular symbolic reasoning. In the Coarse Stage, CFMS leverages the Multimodal Large Language Models (MLLMs) to perform a one-time synthesis of a multi-perspective knowledge tuple. This tuple subsequently serves as a dynamic reasoning map to guide the fine stage, where a symbolic engine executes a targeted and efficient sequence of iterative operations over the table. Extensive experiments on the WikiTQ and TabFact benchmarks demonstrate that CFMS achieves competitive accuracy. The framework exhibits particular robustness when handling large tables and when instantiated with smaller backbone models, validating its effectiveness and generalizability.
[257] Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models cs.AI | cs.CL | cs.CVPDF
Sameera Horawalavithana, Lauren Phillips, Ian Stewart, Sai Munikoti, Karl Pazdernik
TL;DR: 本研究系统性地探讨了预训练大语言模型(LLM)骨干的演进对下游视觉语言模型(VLM)任务性能的影响。通过保持视觉编码器、训练数据和后训练算法不变,在基于LLAMA-1、LLAMA-2和LLAMA-3的VLM上进行对比实验,发现较新的LLM骨干并不总是带来更好的VLM性能,其效果取决于具体的下游任务。
Details
Motivation: 随着更强大的预训练LLM不断涌现,如何高效地将这些进步整合到现有VLM中,以及不断演进的LLM如何影响多模态推理、对齐和任务特定性能,仍缺乏深入探索。本研究旨在填补这一空白,系统研究预训练LLM骨干的变化对下游VLM任务性能的影响。
Result: 在视觉问答(VQA)任务中,较新的LLM骨干倾向于解决不同类型的问题,而非仅仅解决更多问题,这归因于模型信息处理方式的差异,如更好的置信度校准和更稳定的内部表征。某些VLM能力仅在最新一代LLM中显现,而主要依赖视觉理解的任务则从较新LLM骨干中获益甚微。
Insight: 论文的创新点在于对LLM骨干演进与VLM性能关系进行了首次受控、系统的实证研究,揭示了性能提升的任务依赖性,并深入分析了其背后机制(如置信度校准和表征稳定性)。这为未来VLM开发中如何选择和更新LLM骨干提供了重要见解,即并非简单地采用最新LLM,而需根据目标任务特性进行权衡。
Abstract: Vision-Language Models (VLMs) have rapidly advanced by leveraging powerful pre-trained Large Language Models (LLMs) as core reasoning backbones. As new and more capable LLMs emerge with improved reasoning, instruction-following, and generalization, there is a pressing need to efficiently update existing VLMs to incorporate these advancements. However, the integration of new LLMs into VLMs, particularly how the evolving LLMs contribute to multimodal reasoning, alignment, and task-specific performance remains underexplored. Addressing this gap is important for VLM development, given the rapid evolution of pretrained LLM backbones. This study presents a controlled and systematic investigation of how changes in the pretrained LLM backbone affect downstream VLM task performance. By having the vision encoder, training data, and post-training algorithm remain same across LLAMA-1, LLAMA-2, and LLAMA-3 based VLMs, we find that newer LLM backbones do not always lead to better VLMs, but the performance depends on the downstream VLM task. For example, in visual question and answering tasks, newer LLM backbones tend to solve different questions rather than just more questions, and our analysis shows this is driven by differences in how the models process information, including better calibrated confidence and more stable internal representations. We also find that some VLM capabilities appear only in the newest LLM generation, while tasks that depend mainly on visual understanding see little benefit from a newer LLM backbone.
[258] Min-$k$ Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics cs.AI | cs.CL | cs.LGPDF
Yuanhao Ding, Meimingwei Li, Esteban Garces Arias, Matthias Aßenmacher, Christian Heumann
TL;DR: 本文提出了一种名为Min-$k$采样的新型解码策略,通过分析排序后logit分布的局部形状来动态确定截断边界,实现了严格的温度不变性,并在多个推理基准和创意写作任务上提升了文本生成质量。
Details
Motivation: 主流解码方法(如Top-$k$、Top-$p$、Min-$p$)在概率空间进行截断,对温度参数极度敏感;而现有的logit空间方法(如Top-$nσ$)虽具有温度不变性,但依赖全局统计量,易受长尾噪声影响,无法捕捉候选词间的细粒度置信结构。
Result: 在多个推理基准、创意写作任务和人工评估中,Min-$k$采样一致提升了文本质量,即使在极端温度设置下(基于概率的方法失效时)仍保持稳健性能,且对超参数选择不敏感。
Insight: 创新点在于通过分析logit分布的局部形状识别“语义悬崖”(从高置信核心词到不确定长尾词的尖锐过渡),并计算位置加权的相对衰减率来动态确定截断边界,实现了严格的温度不变性,避免了全局统计的噪声问题。
Abstract: The quality of text generated by large language models depends critically on the decoding sampling strategy. While mainstream methods such as Top-$k$, Top-$p$, and Min-$p$ achieve a balance between diversity and accuracy through probability-space truncation, they share an inherent limitation: extreme sensitivity to the temperature parameter. Recent logit-space approaches like Top-$nσ$ achieve temperature invariance but rely on global statistics that are susceptible to long-tail noise, failing to capture fine-grained confidence structures among top candidates. We propose \textbf{Min-$k$ Sampling}, a novel dynamic truncation strategy that analyzes the local shape of the sorted logit distribution to identify “semantic cliffs”: sharp transitions from high-confidence core tokens to uncertain long-tail tokens. By computing a position-weighted relative decay rate, Min-$k$ dynamically determines truncation boundaries at each generation step. We formally prove that Min-$k$ achieves strict temperature invariance and empirically demonstrate its low sensitivity to hyperparameter choices. Experiments on multiple reasoning benchmarks, creative writing tasks, and human evaluation show that Min-$k$ consistently improves text quality, maintaining robust performance even under extreme temperature settings where probability-based methods collapse. We make our code, models, and analysis tools publicly available.
[259] Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories cs.AI | cs.CLPDF
Peiyang Liu, Zhirui Chen, Xi Wang, Di Liang, Youru Li
TL;DR: 本文提出了一种名为对比推理路径合成(CRPS)的框架,用于从蒙特卡洛树搜索(MCTS)的多样化搜索轨迹中高效合成监督数据。CRPS通过分析高质量与低质量搜索轨迹之间的差异,提取策略转折点和局部失败模式的显式信息,从而合成包含成功模式并避免已知陷阱的推理链。实验表明,仅用6万CRPS合成样本微调的模型性能匹配或超越了使用59万标准拒绝采样样本训练的基线,数据量减少了20倍,且在域外基准测试中表现出更好的泛化能力。
Details
Motivation: 当前基于蒙特卡洛树搜索(MCTS)的自动推理数据探索方法中,监督提取效率低下,通常仅保留单一最高奖励轨迹,丢弃了众多探索路径中的对比信号。
Result: 在实验中,仅使用6万CRPS合成样本微调的模型,其性能匹配或超越了使用59万标准拒绝采样样本训练的基线,实现了20倍的数据量减少,并在域外基准测试中展现出更强的泛化能力。
Insight: CRPS的创新点在于将监督提取从过滤过程转变为合成过程,通过对比成功与失败的搜索轨迹来提取可迁移的推理能力,这比仅从成功中学习更有效,提高了数据效率和模型泛化性。
Abstract: Monte Carlo Tree Search (MCTS) has been widely used for automated reasoning data exploration, but current supervision extraction methods remain inefficient. Standard approaches retain only the single highest-reward trajectory, discarding the comparative signals present in the many explored paths. Here we introduce \textbf{Contrastive Reasoning Path Synthesis (CRPS)}, a framework that transforms supervision extraction from a filtering process into a synthesis procedure. CRPS uses a structured reflective process to analyze the differences between high- and low-quality search trajectories, extracting explicit information about strategic pivots and local failure modes. These insights guide the synthesis of reasoning chains that incorporate success patterns while avoiding identified pitfalls. We show empirically that models fine-tuned on just 60K CRPS-synthesized examples match or exceed the performance of baselines trained on 590K examples derived from standard rejection sampling, a 20$\times$ reduction in dataset size. Furthermore, CRPS improves generalization on out-of-domain benchmarks, demonstrating that learning from the contrast between success and failure produces more transferable reasoning capabilities than learning from success alone.
[260] Anthropogenic Regional Adaptation in Multimodal Vision-Language Model cs.AI | cs.CL | cs.CVPDF
Samuel Cahyawijaya, Peerat Limkonchotiwat, Tack Hwa Wong, Hitesh Laxmichand Patel, Amit Agarwal
TL;DR: 本文提出了一种名为‘人类区域适应’的新范式,旨在优化多模态视觉语言模型对特定区域(如东南亚)的适应性,同时保持其全球泛化能力。作者还介绍了一种简单有效的方法GG-EZ,通过区域数据过滤和模型合并来实现这一目标,并在多种视觉语言架构上验证了其有效性。
Details
Motivation: 当前视觉语言领域缺乏专门评估系统在人类中心对齐方面的框架,特别是在适应不同区域文化背景时,模型往往无法兼顾区域相关性和全球泛化能力。
Result: 在东南亚区域适应案例中,GG-EZ方法在文化相关性指标上提升了5-15%,同时保持了超过98%的全球性能,有时甚至略有超越,实验覆盖了大型视觉语言模型、文本到图像扩散模型和视觉语言嵌入模型。
Insight: 创新点在于提出了‘人类区域适应’这一范式,强调区域上下文优化与全球泛化保留的平衡;GG-EZ方法通过数据过滤和模型合并提供了一种简单有效的基线,为多模态模型在不同地区的适用性奠定了基础。
Abstract: While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.
[261] SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context cs.AI | cs.CLPDF
Shuquan Lian, Juncheng Liu, Yazhe Chen, Yuhong Chen, Hui Li
TL;DR: 本文提出SWE-AGILE,一种用于自主软件工程的智能体框架,旨在解决现有ReAct风格方法在深度分析和处理复杂边缘案例时缺乏显式系统2推理,以及在多轮任务中面临完整推理历史导致上下文爆炸与丢弃历史导致冗余重复推理的两难问题。该框架通过动态推理上下文策略,结合详细推理的滑动窗口和压缩的历史推理摘要,来平衡推理深度、效率与上下文约束。
Details
Motivation: 现有自主软件工程(SWE)中的ReAct风格方法缺乏进行深度分析所需的显式系统2推理,且在多轮SWE任务中,保留完整推理历史会导致上下文爆炸和“迷失在中间”的性能下降,而丢弃历史又迫使智能体每一步都进行冗余的重新推理。
Result: 在SWE-Bench-Verified基准测试上,SWE-AGILE为7B-8B参数规模的模型设立了新的性能标准,仅使用了2.2k条轨迹和896个任务进行实证评估。
Insight: 核心创新点是动态推理上下文策略,它维护一个详细推理的“滑动窗口”以保证即时连续性,避免冗余重新分析,同时将历史推理内容压缩为简洁的推理摘要。这有效解决了长上下文多轮推理任务中的效率与深度权衡问题。
Abstract: Prior representative ReAct-style approaches in autonomous Software Engineering (SWE) typically lack the explicit System-2 reasoning required for deep analysis and handling complex edge cases. While recent reasoning models demonstrate the potential of extended Chain-of-Thought (CoT), applying them to the multi-turn SWE task creates a fundamental dilemma: retaining full reasoning history leads to context explosion and Lost-in-the-Middle'' degradation, while discarding it would force the agent to redundantly re-reason at every step. To address these challenges, we propose SWE-AGILE, a novel software agent framework designed to bridge the gap between reasoning depth, efficiency, and context constraints. SWE-AGILE introduces a Dynamic Reasoning Context strategy, maintaining a sliding window’’ of detailed reasoning for immediate continuity to prevent redundant re-analyzing, while compressing historical reasoning content into concise Reasoning Digests. Empirically, SWE-AGILE sets a new standard for 7B-8B models on SWE-Bench-Verified using only 2.2k trajectories and 896 tasks. Code is available at https://github.com/KDEGroup/SWE-AGILE.
[262] Agentic Exploration of PDE Spaces using Latent Foundation Models for Parameterized Simulations cs.AI | cs.CVPDF
Abhijeet Vishwasrao, Francisco Giral, Mahmoud Golestanian, Federica Tonti, Andrea Arroyo Ramo
TL;DR: 该论文提出了一种结合多智能体大语言模型与潜在基础模型(LFM)的框架,用于自动化探索偏微分方程(PDE)控制的物理现象(如流体流动)的参数化模拟空间。LFM作为按需替代模拟器,学习流场的紧凑解耦潜在表示,使智能体能够以低成本查询任意参数配置。通过分层智能体架构实现假设、实验、分析和验证的闭环探索,无需人工干预。在Re=500的串联圆柱绕流案例中,该框架自主评估了1600多个参数-位置对,发现了不同的标度律,如最小位移厚度的两模态结构和最大动量厚度的线性标度。
Details
Motivation: 传统上,PDE控制的物理现象(如流体流动)探索依赖于实验室实验或计算昂贵的数值模拟,限制了自动化和大规模探索。与药物发现或材料科学等领域不同,PDE空间缺乏离散化、可标记化的表示,难以与大语言模型自然交互。本文旨在通过结合多智能体LLMs和潜在基础模型,解决PDE参数空间的连续、高维和混沌特性带来的探索挑战,实现自动化科学发现。
Result: 在Re=500的串联圆柱绕流基准案例中,该框架自主评估了超过1600个参数-位置对,发现了不同的标度律:最小位移厚度呈现依赖于流态的两模态结构,最大动量厚度则表现出稳健的线性标度,且两者在近尾流到共脱落流态过渡时均展现出双极值结构。这验证了框架在PDE控制系统中自动化探索的有效性。
Insight: 创新点包括:1)提出潜在基础模型(LFM)作为生成模型,学习PDE参数化模拟的紧凑、解耦潜在表示,实现低成本替代模拟;2)结合多智能体LLMs与LFM,通过分层架构和工具模块化接口实现闭环自主探索,无需用户支持;3)建立了PDE控制系统中自动化科学发现的通用范式,将学习到的物理表示与智能体推理耦合,可扩展至其他复杂物理现象。
Abstract: Flow physics and more broadly physical phenomena governed by partial differential equations (PDEs), are inherently continuous, high-dimensional and often chaotic in nature. Traditionally, researchers have explored these rich spatiotemporal PDE solution spaces using laboratory experiments and/or computationally expensive numerical simulations. This severely limits automated and large-scale exploration, unlike domains such as drug discovery or materials science, where discrete, tokenizable representations naturally interface with large language models. We address this by coupling multi-agent LLMs with latent foundation models (LFMs), a generative model over parametrised simulations, that learns explicit, compact and disentangled latent representations of flow fields, enabling continuous exploration across governing PDE parameters and boundary conditions. The LFM serves as an on-demand surrogate simulator, allowing agents to query arbitrary parameter configurations at negligible cost. A hierarchical agent architecture orchestrates exploration through a closed loop of hypothesis, experimentation, analysis and verification, with a tool-modular interface requiring no user support. Applied to flow past tandem cylinders at Re = 500, the framework autonomously evaluates over 1,600 parameter-location pairs and discovers divergent scaling laws: a regime-dependent two-mode structure for minimum displacement thickness and a robust linear scaling for maximum momentum thickness, with both landscapes exhibiting a dual-extrema structure that emerges at the near-wake to co-shedding regime transition. The coupling of the learned physical representations with agentic reasoning establishes a general paradigm for automated scientific discovery in PDE-governed systems.
[263] Belief-Aware VLM Model for Human-like Reasoning cs.AI | cs.CVPDF
Anshul Nayak, Shahil Shaik, Yue Wang
TL;DR: 本文提出了一种信念感知的视觉语言模型(VLM)框架,通过集成基于检索的记忆和强化学习来模拟人类推理中的信念表示与更新,以解决传统意图推断模型在泛化和长时程意图捕捉上的不足。
Details
Motivation: 传统神经网络意图推断模型过度依赖可观测状态,难以泛化到多样化任务和动态环境;现有VLM/VLA模型虽引入常识推理,但缺乏显式的信念表示与更新机制,限制了其类人推理和长时程意图捕捉能力。
Result: 在公开VQA数据集(如HD-EPIC)上评估,该方法相比零样本基线模型取得了持续的性能提升,证明了信念感知推理的重要性。
Insight: 创新点在于不学习显式信念模型,而是用基于向量的记忆检索相关多模态上下文来近似信念,并将其整合到VLM中进行推理;同时,在VLM潜在空间上使用强化学习策略进一步优化决策,实现了对动态信念的隐式建模与利用。
Abstract: Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Language Action (VLA) models introduce common-sense reasoning through large-scale multimodal pretraining, enabling zero-shot performance across tasks. However, these models still lack explicit mechanisms to represent and update belief, limiting their ability to reason like humans or capture the evolving human intent over long-horizon. To address this, we propose a belief-aware VLM framework that integrates retrieval-based memory and reinforcement learning. Instead of learning an explicit belief model, we approximate belief using a vector-based memory that retrieves relevant multimodal context, which is incorporated into the VLM for reasoning. We further refine decision-making using a reinforcement learning policy over the VLM latent space. We evaluate our approach on publicly available VQA datasets such as HD-EPIC and demonstrate consistent improvements over zero-shot baselines, highlighting the importance of belief-aware reasoning.
[264] Edu-MMBias: A Three-Tier Multimodal Benchmark for Auditing Social Bias in Vision-Language Models under Educational Contexts cs.AI | cs.CVPDF
Ruijia Li, Mingzi Zhang, Zengyi Yu, Yuang Wei, Bo Jiang
TL;DR: 本文提出了Edu-MMBias,一个用于在教育情境下审计视觉语言模型社会偏见的三层多模态基准测试框架。该框架基于社会心理学的态度三元模型,从认知、情感和行为三个维度系统诊断偏见,并通过一个包含自校正机制和人工验证的生成流程合成抗污染的学生档案,对先进VLMs进行全面压力测试。
Details
Motivation: 随着视觉语言模型在教育决策中日益重要,确保其公平性至关重要。然而,当前以文本为中心的评估忽略了视觉模态,为潜在的社会偏见留下了不受监管的渠道。本文旨在填补这一空白。
Result: 广泛的审计揭示了关键且反直觉的模式:模型表现出补偿性的阶层偏见,倾向于较低社会地位叙事,同时却潜藏着根深蒂固的健康和种族刻板印象。研究发现,视觉输入充当了安全后门,会触发偏见的复现,绕过基于文本的对齐防护措施,并揭示了潜在认知与最终决策之间的系统性错位。
Insight: 创新点在于将社会心理学的态度三元模型引入多模态偏见评估,构建了分层的三维审计框架。从客观角度看,其提出的包含自校正和人工验证的合成数据生成流程,以及揭示视觉模态作为偏见’安全后门’的发现,对理解和缓解多模态模型偏见具有重要借鉴意义。
Abstract: As Vision-Language Models (VLMs) become integral to educational decision-making, ensuring their fairness is paramount. However, current text-centric evaluations neglect the visual modality, leaving an unregulated channel for latent social biases. To bridge this gap, we present Edu-MMBias, a systematic auditing framework grounded in the tri-component model of attitudes from social psychology. This framework diagnoses bias across three hierarchical dimensions: cognitive, affective, and behavioral. Utilizing a specialized generative pipeline that incorporates a self-correct mechanism and human-in-the-loop verification, we synthesize contamination-resistant student profiles to conduct a holistic stress test on state-of-the-art VLMs. Our extensive audit reveals critical, counter-intuitive patterns: models exhibit a compensatory class bias favoring lower-status narratives while simultaneously harboring deep-seated health and racial stereotypes. Crucially, we find that visual inputs act as a safety backdoor, triggering a resurgence of biases that bypass text-based alignment safeguards and revealing a systematic misalignment between latent cognition and final decision-making. The contributions of this paper are available at: https://anonymous.4open.science/r/EduMMBias-63B2.
[265] WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark cs.AI | cs.CVPDF
Peng Yuan, Yuyang Yin, Yuxuan Cai, Zheng Wei
TL;DR: 本文提出了WebForge框架和WebForge-Bench基准测试,解决了现有浏览器智能体基准测试中真实性、可复现性和可扩展性三者难以兼顾的困境。通过四智能体流水线(规划、生成、精炼、验证)自动创建交互式、自包含的网页环境,无需人工标注。同时设计了七维难度控制框架,支持系统化的能力分析。
Details
Motivation: 现有浏览器智能体基准测试存在根本性困境:真实网站基准因内容漂移缺乏可复现性,受控环境因忽略真实网络噪声而牺牲真实性,且两者都需要昂贵的人工标注,限制了可扩展性。
Result: 使用WebForge构建了包含934个任务、覆盖7个领域和3个难度级别的WebForge-Bench基准。多模型实验表明,难度分层能有效区分模型能力,跨领域分析揭示了聚合指标无法发现的能力偏差。
Insight: 创新点在于:1)首个完全自动化框架,通过四智能体流水线解决真实性-可复现性-可扩展性三难问题;2)七维难度控制框架支持多维能力分析;3)证明了多维评估能揭示单一聚合分数无法捕捉的差异化能力特征。
Abstract: Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline – Plan, Generate, Refine, and Validate – that produces interactive, self-contained web environments end-to-end without human annotation. A seven-dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, and more, enabling systematic capability profiling beyond single aggregate scores. Using WebForge, we construct WebForge-Bench, a benchmark of 934 tasks spanning 7 domains and 3 difficulty levels. Multi-model experiments show that difficulty stratification effectively differentiates model capabilities, while cross-domain analysis exposes capability biases invisible to aggregate metrics. Together, these results confirm that multi-dimensional evaluation reveals distinct capability profiles that a single aggregate score cannot capture. Code and benchmark are publicly available at https://github.com/yuandaxia2001/WebForge.
cs.HC [Back]
[266] ACE-TA: An Agentic Teaching Assistant for Grounded Q&A, Quiz Generation, and Code Tutoring cs.HC | cs.AI | cs.CLPDF
Himanshu Tripathi, Charlottee Crowell, Kaley Newlin, Subash Neupane, Shahram Rahimi
TL;DR: 本文介绍了ACE-TA框架,这是一个基于大语言模型的自主智能教学助手,用于编程教育。它能将来自编程课程材料的概念性查询自动路由到三个核心模块:提供精确、上下文对齐解释的检索增强概念问答系统;构建针对高阶理解的自适应多主题评估的测验生成器;以及通过沙盒执行和迭代反馈引导学生进行逐步推理的交互式代码辅导模块。
Details
Motivation: 解决在编程教育中,如何利用大语言模型自动、高效地提供精准的概念解释、自适应测验和交互式代码辅导,以提升学习效果和效率的问题。
Result: 摘要中未提及具体的定量实验结果或基准测试,主要介绍了框架的构成与功能。
Insight: 创新点在于构建了一个协调的、模块化的智能体框架,将检索增强、自适应评估和交互式执行反馈三种能力整合到一个统一的系统中,旨在实现从概念理解到实践应用的全流程自动化教学支持。
Abstract: We introduce ACE-TA, the Agentic Coding and Explanations Teaching Assistant framework, that autonomously routes conceptual queries drawn from programming course material to grounded Q&A, stepwise coding guidance, and automated quiz generation using pre-trained Large Language Models (LLMs). ACE-TA consists of three coordinated modules: a retrieval grounded conceptual Q&A system that provides precise, context-aligned explanations; a quiz generator that constructs adaptive, multi-topic assessments targeting higher-order understanding; and an interactive code tutor that guides students through step-by-step reasoning with sandboxed execution and iterative feedback.
[267] Evaluating Visual Prompts with Eye-Tracking Data for MLLM-Based Human Activity Recognition cs.HC | cs.AI | cs.CVPDF
Jae Young Choi, Seon Gyeom Kim, Hyungjun Yoon, Taeckyung Lee, Donggun Lee
TL;DR: 本文研究了利用视觉提示策略,将眼动追踪数据转换为时间线、热图和扫描路径三种可视化图像,作为多模态大语言模型(MLLM)的输入,以进行基于MLLM的人类活动识别(HAR)。通过在三个公共眼动追踪数据集上系统评估不同时间窗口下的可视化效果,发现视觉提示为眼动数据提供了高效且可扩展的表示,有助于MLLM在物联网场景中有效处理高频传感器信号。
Details
Motivation: 直接应用眼动追踪等高维高频传感器数据到大语言模型(LLM)会导致信息丢失和计算成本高昂,因此需要一种更高效的数据表示方法。
Result: 在三个公共眼动追踪数据集上评估了三种可视化类型(时间线、热图、扫描路径)在不同时间窗口下的效果,表明视觉提示在基于MLLM的HAR任务中提供了高效且可扩展的表示。
Insight: 创新点在于提出将传感器数据转换为可视化图像作为MLLM的视觉提示,这是一种令牌高效且可扩展的表示方法,能够缓解直接处理原始传感器数据时的信息损失和高成本问题,为物联网应用中的MLLM推理提供了新思路。
Abstract: Large Language Models (LLMs) have emerged as foundation models for IoT applications such as human activity recognition (HAR). However, directly applying high-frequency and multi-dimensional sensor data, such as eye-tracking data, leads to information loss and high token costs. To mitigate this, we investigate a visual prompting strategy that transforms sensor signals into data visualization images as an input to multimodal LLMs (MLLMs) using eye-tracking data. We conducted a systematic evaluation of MLLM-based HAR across three public eye-tracking datasets using three visualization types of timeline, heatmap, and scanpath, under varying temporal window sizes. Our findings suggest that visual prompting provides a token-efficient and scalable representation for eye-tracking data, highlighting its potential to enable MLLMs to effectively reason over high-frequency sensor signals in IoT contexts.
cs.SD [Back]
[268] Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music cs.SD | cs.AI | cs.CL | eess.ASPDF
Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand
TL;DR: 本文介绍了Audio Flamingo Next (AF-Next),这是Audio Flamingo系列中新一代、能力最强的大型音频-语言模型,旨在提升对语音、环境声音和音乐的理解与推理能力。相比前代模型,AF-Next引入了更强的基础模型、大规模音频理解与推理数据构建策略、长达30分钟的复杂音频输入支持,以及一种新的推理范式——时间音频思维链。模型通过课程学习策略进行训练,在20个音频理解与推理基准测试中大幅超越同类开源模型,并与更大规模的模型竞争甚至超越。
Details
Motivation: 动机是解决现有音频-语言模型在音频理解与推理能力上的关键不足,特别是在处理长音频、实现细粒度时间对齐以及泛化到未见任务方面的局限性。
Result: 在20个音频理解与推理基准测试(包括具有挑战性的长音频任务)上,AF-Next大幅超越类似规模的开源模型,并与更大规模的开源权重和闭源模型保持高度竞争力,有时甚至超越它们。
Insight: 创新点包括:1)通过系统分析前代模型识别关键差距;2)构建超百万小时的大规模数据集以扩展能力;3)引入时间音频思维链推理范式,将中间推理步骤显式对齐到时间戳,提升可解释性;4)采用课程学习策略进行多阶段训练。这些方法提升了模型在长音频处理和任务泛化方面的鲁棒性。
Abstract: We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.
[269] Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing cs.SD | cs.AI | cs.CV | cs.MMPDF
Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan
TL;DR: 本文提出了Audio-Omni,一个首个端到端的统一框架,旨在将音频生成和编辑任务无缝整合到通用声音、音乐和语音领域,并具备多模态理解能力。该框架结合了冻结的多模态大语言模型进行高级推理和可训练的扩散Transformer进行高保真合成。为了解决音频编辑数据稀缺的问题,作者构建了包含超过一百万对精心策划编辑样本的大规模数据集AudioEdit。实验表明,Audio-Omni在一系列基准测试中达到了最先进的性能,超越了先前的统一方法,并与专业专家模型相当或更优。
Details
Motivation: 当前多模态模型在音频理解、生成和编辑方面取得了快速进展,但这些能力通常由专门的模型分别处理,缺乏一个能够无缝整合所有三个任务的真正统一框架。尽管已有先驱工作探索统一音频理解和生成,但它们往往局限于特定领域。
Result: 广泛的实验表明,Audio-Omni在一套基准测试中实现了最先进的性能,超越了先前的统一方法,同时达到了与专业专家模型相当或更优的水平。
Insight: 创新点包括:1)首次提出一个端到端框架,统一了跨通用声音、音乐和语音领域的生成和编辑,并集成了多模态理解能力;2)架构上协同使用冻结的多模态大语言模型进行高级推理和可训练的扩散Transformer进行高保真合成;3)构建了大规模音频编辑数据集AudioEdit以克服数据稀缺问题;4)展现了知识增强推理生成、上下文生成和零样本跨语言控制等继承能力,为通用生成音频智能指明了方向。
Abstract: Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.
cs.LG [Back]
[270] Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs cs.LG | cs.AI | cs.CLPDF
Subramanyam Sahoo
TL;DR: 本文研究了在大型语言模型(LLM)微调过程中,因奖励优化(如RLHF)导致的奉承行为(sycophancy)如何破坏模型的校准性(calibration),从而影响其不确定性量化(uncertainty quantification)的可靠性。作者通过对Qwen3-8B模型进行三种微调(无微调、中性监督微调、诱导奉承的GRPO微调),并在MMLU数据集上评估,发现奉承性微调会导致校准性能的定向退化,即使应用后处理缩放(matrix scaling)校正,奉承模型仍保留更高的校准误差。
Details
Motivation: 现代LLM常通过人类反馈强化学习(RLHF)等奖励优化方案进行微调以提升感知有用性,但作者质疑奉承性的奖励信号是否会降低模型的校准性——这是可靠不确定性量化的关键属性。
Result: 在MMLU的五个学科领域上使用自助置信区间和置换检验进行评估,发现奉承性GRPO微调导致校准性能定向退化:ECE相对于基础模型上升+0.006,MCE相对于中性SFT上升+0.010(但在此训练预算下未达到统计显著性,p=0.41)。对所有模型应用后处理矩阵缩放后,ECE降低40-64%,准确率提升1.5-3.0个百分点,但奉承模型的后缩放ECE(0.042)仍高于中性SFT控制模型(0.037)。
Insight: 论文的创新点在于建立了评估奖励黑客(reward hacking)对校准影响的方法论,并揭示了奖励诱导的校准错误即使在仿射校正后也会留下结构化残差,这为设计校准感知的训练目标提供了动机。从客观角度看,该研究强调了在RLHF等微调过程中考虑校准退化风险的重要性,并为后续改进不确定性量化提供了实证基础。
Abstract: Modern large language models (LLMs) are increasingly fine-tuned via reinforcement learning from human feedback (RLHF) or related reward optimisation schemes. While such procedures improve perceived helpfulness, we investigate whether sycophantic reward signals degrade calibration – a property essential for reliable uncertainty quantification. We fine-tune Qwen3-8B under three regimes: no fine-tuning (base), neutral supervised fine-tuning (SFT) on TriviaQA, and sycophancy-inducing Group Relative Policy Optimisation (GRPO) that rewards agreement with planted wrong answers. Evaluating on $1{,}000$ MMLU items across five subject domains with bootstrap confidence intervals and permutation testing, we find that \textbf{sycophantic GRPO produces consistent directional calibration degradation} – ECE rises by $+0.006$ relative to the base model and MCE increases by $+0.010$ relative to neutral SFT – though the effect does not reach statistical significance ($p = 0.41$) at this training budget. Post-hoc matrix scaling applied to all three models reduces ECE by $40$–$64%$ and improves accuracy by $1.5$–$3.0$ percentage points. However, the sycophantic model retains the highest post-scaling ECE relative to the neutral SFT control ($0.042$ vs.\ $0.037$), suggesting that reward-induced miscalibration leaves a structured residual even after affine correction. These findings establish a methodology for evaluating the calibration impact of reward hacking and motivate calibration-aware training objectives.
[271] Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents cs.LG | cs.AI | cs.CLPDF
Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan
TL;DR: 本文提出了Skill-SD框架,用于解决多轮交互任务中LLM智能体强化学习训练样本效率低下的问题。该方法通过将智能体自身完成的轨迹总结为描述成功行为、错误和工作流程的自然语言技能,作为动态的、仅教师模型可见的监督信息,并结合重要性加权的反向KL损失进行稳定的token级蒸馏,从而显著提升训练效果。
Details
Motivation: 标准强化学习训练LLM智能体在多轮交互任务中面临奖励稀疏和任务步长过长导致的样本效率低下问题。现有的基于策略的自蒸馏方法虽然通过访问真实答案的教师模型提供了密集监督,但其固定的特权信息无法捕捉任务中多样的有效策略,且与强化学习简单结合常导致训练崩溃。
Result: 在AppWorld和Sokoban等智能体基准测试上,Skill-SD显著优于标准强化学习基线,分别将vanilla GRPO的性能提升了14.0%和10.9%,将vanilla OPD的性能提升了42.1%和40.6%。
Insight: 核心创新在于将智能体自身轨迹动态转化为自然语言技能作为训练时的特权监督信息,这比固定答案更灵活且能捕捉多样策略。同时,通过重要性加权的反向KL损失实现梯度校正的token级蒸馏,并动态同步教师与学生模型,有效稳定了训练过程,避免了崩溃。这是一种将轨迹经验显式知识化并用于条件化蒸馏的新颖思路。
Abstract: Reinforcement learning (RL) has been widely used to train LLM agents for multi-turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On-policy self-distillation (OPSD) alleviates this by providing dense token-level supervision from a privileged teacher that has access to ground-truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks, and naively combining OPSD with RL often leads to training collapse. To address these limitations, we introduce Skill-SD, a framework that turns the agent’s own trajectories into dynamic training-only supervision. Completed trajectories are summarized into compact natural language skills that describe successful behaviors, mistakes, and workflows. These skills serve as dynamic privileged information conditioning only the teacher, while the student always acts under the plain task prompt and learns to internalize the guidance through distillation. To stabilize the training, we derive an importance-weighted reverse-KL loss to provide gradient-correct token-level distillation, and dynamically synchronize the teacher with the improving student. Experimental results on agentic benchmarks demonstrate that Skill-SD substantially outperforms the standard RL baseline, improving both vanilla GRPO (+14.0%/+10.9% on AppWorld/Sokoban) and vanilla OPD (+42.1%/+40.6%). Project page: https://k1xe.github.io/skill-sd/
[272] SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting cs.LG | cs.AI | cs.CLPDF
Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu
TL;DR: 本文提出了SCOPE(信号校准的在线策略蒸馏增强)框架,这是一种双路径自适应训练方法,旨在解决大型语言模型在线强化学习中对齐任务中稀疏奖励导致的令牌级信用分配难题。SCOPE根据在线生成轨迹的正确性,将其路由到两条互补的监督路径:对于错误轨迹,采用基于教师困惑度加权的KL蒸馏,以优先考虑教师模型真正具备纠正能力的实例;对于正确轨迹,则采用基于学生困惑度加权的最大似然估计,以强化模型能力边界上的低置信度样本。
Details
Motivation: 在线策略强化学习已成为大型语言模型对齐推理的主流范式,但其稀疏的结果级奖励使得令牌级信用分配非常困难。现有的在线策略蒸馏方法通过引入来自教师模型的密集令牌级KL监督来缓解这一问题,但通常对所有生成轨迹均匀施加这种监督,忽略了信号质量的根本差异。
Result: 在六个推理基准测试上的广泛实验表明,SCOPE相比竞争基线,在Avg@32和Pass@32指标上分别实现了平均11.42%和7.30%的相对提升,证明了其一致的有效性。
Insight: 核心创新点在于根据轨迹正确性进行双路径路由,并分别采用教师困惑度和学生困惑度进行自适应加权,以校准监督信号的质量。这避免了均匀监督的弊端,能更精细地利用教师模型的指导能力和强化学生模型的薄弱环节,并通过组级归一化来适应不同提示的固有难度差异。
Abstract: On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.
[273] Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning cs.LG | cs.AI | cs.CLPDF
Zikang Shan, Han Zhong, Liwei Wang, Li Zhao
TL;DR: 本文提出了一种名为生成式演员-评论家(GenAC)的新方法,用于改进大型语言模型(LLM)强化学习中的价值建模和信用分配问题。该方法通过用生成式评论家替代传统的一次性标量价值预测,并结合思维链推理和上下文条件化技术,以增强价值函数的表达能力和训练稳定性。
Details
Motivation: 在LLM强化学习中,传统的判别式评论家模型难以可靠训练,导致价值建模常被回避,这限制了信用分配的精细度。作者认为这一困难部分源于现有价值模型在一次性预测范式下的表达能力有限。
Result: GenAC在价值近似、排名可靠性和分布外泛化方面均表现出改进,这些提升转化为比基于价值的和无价值的基线方法更强的下游RL性能。实验表明,该方法在相关基准测试中取得了优于现有方法的结果。
Insight: 核心创新点在于将生成式模型引入价值函数估计,通过思维链推理增强表达;同时提出的上下文条件化技术有助于评论家在训练过程中保持与当前演员模型的校准。这为改进LLM强化学习中的信用分配提供了一个有前景的方向。
Abstract: Credit assignment is a central challenge in reinforcement learning (RL). Classical actor-critic methods address this challenge through fine-grained advantage estimation based on a learned value function. However, learned value models are often avoided in modern large language model (LLM) RL because conventional discriminative critics are difficult to train reliably. We revisit value modeling and argue that this difficulty is partly due to limited expressiveness. In particular, representation complexity theory suggests that value functions can be hard to approximate under the one-shot prediction paradigm used by existing value models, and our scaling experiments show that such critics do not improve reliably with scale. Motivated by this observation, we propose Generative Actor-Critic (GenAC), which replaces one-shot scalar value prediction with a generative critic that performs chain-of-thought reasoning before producing a value estimate. We further introduce In-Context Conditioning, which helps the critic remain calibrated to the current actor throughout training. GenAC improves value approximation, ranking reliability, and out-of-distribution generalization, and these gains translate into stronger downstream RL performance than both value-based and value-free baselines. Overall, our results suggest that stronger value modeling is a promising direction for improving credit assignment in LLM reinforcement learning.
[274] The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping cs.LG | cs.AI | cs.CLPDF
Yang Liu, Enxi Wang, Yufei Gao, Weixin Zhang, Bo Wang
TL;DR: 本文提出了一种名为MEDS的记忆增强动态奖励塑形框架,旨在解决强化学习训练大型语言模型时采样多样性下降和重复错误模式的问题。该方法通过存储历史行为信号,利用基于密度的聚类识别频繁出现的错误模式,并对属于常见错误簇的轨迹施加更重的惩罚,从而鼓励更广泛的探索并减少重复错误。
Details
Motivation: 强化学习训练大型语言模型时,一个常见的失败模式是采样多样性降低,策略反复生成相似的错误行为。经典的熵正则化鼓励当前策略下的随机性,但并未明确阻止在多次轨迹中反复出现的失败模式。
Result: 在五个数据集和三个基础模型上,MEDS持续改进了现有基线的平均性能,在pass@1指标上最高提升4.13分,在pass@128指标上最高提升4.37分。基于LLM的标注和定量多样性指标的分析表明,MEDS增加了采样期间的行为多样性。
Insight: 核心创新点在于将历史行为信号动态地整合到奖励设计中,通过记忆机制和聚类分析来识别和惩罚重复的错误模式,从而在鼓励探索和避免重复失败之间取得更好的平衡。这为奖励塑形提供了一种新的、基于历史经验的动态调整思路。
Abstract: Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.
[275] Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration cs.LG | cs.AI | cs.CLPDF
Zhipeng Chen, Tao Qian, Wayne Xin Zhao, Ji-Rong Wen
TL;DR: 本文提出了一种名为NExt的新框架,通过非线性建模和推断低秩参数轨迹来加速大型语言模型(LLM)的强化学习与可验证奖励(RLVR)训练过程。该方法首先使用LoRA训练模型并提取多个训练步骤中参数差异的秩-1子空间,然后训练一个预测器来建模参数更新轨迹并进行推断,从而减少计算开销。
Details
Motivation: RLVR训练虽然能显著提升LLM能力,但需要大量探索和学习,导致计算开销巨大。现有工作采用参数线性推断,但RLVR训练中模型参数更新的动态特性尚未被充分理解,因此需要更有效的加速方法。
Result: 实验表明,该方法能减少约37.5%的计算开销,且与多种RLVR算法和任务兼容,证明了其有效性和鲁棒性。
Insight: 创新点在于发现RLVR训练中模型的秩-1子空间演化是非线性的,并基于此提出非线性低秩轨迹建模与推断框架NExt,为RLVR训练加速提供了新思路。
Abstract: Recently, scaling reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) has emerged as an effective training paradigm for significantly improving model capabilities, which requires guiding the model to perform extensive exploration and learning, leading to substantial computational overhead and becoming a key challenge. To reduce the number of training steps, Prior work performs linear extrapolation of model parameters. However, the dynamics of model parameter updates during RLVR training remain insufficiently understood. To further investigate the evolution of LLMs during RLVR training, we conduct empirical experiments and find that the rank-1 subspace of the model does not evolve linearly, and its dominance over the original parameters is further amplified during LoRA training. Based on the above insights, we propose the \textbf{N}onlinear \textbf{Ext}rapolation of low-rank trajectories (\textbf{NExt}), a novel framework that models and extrapolates low-rank parameter trajectories in a nonlinear manner. Concretely, we first train the model using LoRA and extract the rank-1 subspace of parameter differences at multiple training steps, which is then used for the subsequent nonlinear extrapolation. Afterward, we utilized the extracted rank-1 subspace to train a predictor, which can model the trajectory of parameter updates during RLVR, and then perform the predict-extend process to extrapolate model parameters, achieving the acceleration of RLVR. To further study and understand NExt, we conduct comprehensive experiments that demonstrate the effectiveness and robustness of the method. Our method reduces computational overhead by approximately 37.5% while remaining compatible with a wide range of RLVR algorithms and tasks. We release our code in https://github.com/RUCAIBox/NExt.
[276] Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach cs.LG | cs.CLPDF
Haolin Li, Shuyang Jiang, Ruipeng Zhang, Jiangchao Yao, Ya Zhang
TL;DR: 本文提出MedSSR框架,通过医学知识增强的数据合成和半监督强化学习,高效提升大语言模型在医学领域的推理能力,特别是在罕见病等数据稀缺领域。
Details
Motivation: 现有方法依赖从大型专有模型蒸馏推理链,成本高且在罕见病等数据稀缺领域改进有限,因此需要一种高效且能增强医学推理能力的方法。
Result: 在Qwen和Llama模型上的实验表明,该方法在十个医学基准测试中优于现有方法,在罕见病任务上最高提升5.93%。
Insight: 创新点在于利用罕见病知识合成分布可控的推理问题,并使用策略模型自身生成高质量伪标签,实现从伪标签合成数据(自监督强化学习)到人工标注真实数据(监督强化学习)的两阶段训练范式,从而高效扩展模型训练而不依赖昂贵的推理链蒸馏。
Abstract: While large language models hold promise for complex medical applications, their development is hindered by the scarcity of high-quality reasoning data. To address this issue, existing approaches typically distill chain-of-thought reasoning traces from large proprietary models via supervised fine-tuning, then conduct reinforcement learning (RL). These methods exhibit limited improvement on underrepresented domains like rare diseases while incurring substantial costs from generating complex reasoning chains. To efficiently enhance medical reasoning, we propose MedSSR, a Medical Knowledge-enhanced data Synthesis and Semi-supervised Reinforcement learning framework. Our framework first employs rare disease knowledge to synthesize distribution-controllable reasoning questions. We then utilize the policy model itself to generate high-quality pseudo-labels. This enables a two-stage, intrinsic-to-extrinsic training paradigm: self-supervised RL on the pseudo-labeled synthetic data, followed by supervised RL on the human-annotated real data. MedSSR scales model training efficiently without relying on costly trace distillation. Extensive experiments on Qwen and Llama demonstrate that our method outperforms existing methods across ten medical benchmarks, achieving up to +5.93% gain on rare-disease tasks. Our code is available at https://github.com/tdlhl/MedSSR.
[277] ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents cs.LG | cs.AI | cs.CL | cs.CVPDF
Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao
TL;DR: 本文提出了ClawGUI,一个用于训练、评估和部署GUI智能体的开源统一框架。该框架解决了当前GUI智能体领域因缺乏连贯的全栈基础设施而面临的瓶颈,包括训练环境不稳定、评估标准不统一以及难以在真实设备上部署等问题。
Details
Motivation: 当前GUI智能体领域的发展瓶颈主要不在于模型能力,而在于缺乏一个连贯的全栈基础设施,具体表现为在线强化学习训练面临环境不稳定和封闭流程、评估协议在不同工作中存在隐性偏差,以及训练好的智能体难以在真实设备和用户中部署。
Result: 在MobileWorld GUI-Only基准测试上,通过该框架端到端训练的ClawGUI-2B模型取得了17.1%的成功率,比同等规模的MAI-UI-2B基线模型高出6.0%。ClawGUI-Eval在6个基准测试和11个以上模型上实现了完全标准化的评估流程,并达到了95.8%的复现率。
Insight: 论文的主要创新点在于提供了一个集成了训练(ClawGUI-RL)、评估(ClawGUI-Eval)和部署(ClawGUI-Agent)三大核心功能的统一开源框架。具体包括:首个支持并行虚拟环境和真实物理设备的开源GUI智能体强化学习基础设施;强制执行跨基准和模型的标准化评估流程;以及通过混合CLI-GUI控制和持久个性化记忆,将训练好的智能体部署到多种移动操作系统和聊天平台。这为GUI智能体研究提供了急需的、可复现的工程基础。
Abstract: GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present \textbf{ClawGUI}, an open-source framework addressing these three gaps within a single harness. \textbf{ClawGUI-RL} provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. \textbf{ClawGUI-Eval} enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8% reproduction against official baselines. \textbf{ClawGUI-Agent} brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, \textbf{ClawGUI-2B} achieves 17.1% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0%.
[278] Efficient Matrix Implementation for Rotary Position Embedding cs.LG | cs.CVPDF
Chen Minqi, Zhongqi Yue, Shihao Zhang, Yun Xu, Peng Wu
TL;DR: 本文提出了一种名为RoME(Rotary Matrix position Embedding)的高效矩阵实现方法,用于替代现有Transformer架构中广泛使用的旋转位置编码(RoPE)。该方法通过将向量级的拆分与合并操作统一为矩阵变换,消除了多维设置下的额外开销,从而在硬件利用率上实现了显著提升。
Details
Motivation: 现有RoPE实现依赖于向量级的拆分与合并操作,会引入不可忽略的计算开销,这一问题在多维设置(如2D和3D RoPE)中尤为突出,额外的向量操作和不均匀的特征划分会降低硬件利用率。
Result: 实验表明,RoME在算子级别和全模型级别均实现了显著的加速效果。
Insight: 核心创新点在于将RoPE从向量操作重新表述为统一的矩阵变换,这不仅在数学上等价,而且简化了实现,并能在现代NPU的Cube和Vector单元上实现融合并行执行,从而提升计算效率。从客观角度看,这种从向量到矩阵的抽象提升了对硬件特性的适配性,是一种有效的工程优化思路。
Abstract: Rotary Position Embedding (RoPE) has become a core component of modern Transformer architectures across language, vision, and 3D domains. However, existing implementations rely on vector-level split and merge operations that introduce non-negligible computational overhead, often overlooked in attention optimization. The problem is further amplified in multi-dimensional settings (e.g., 2D and 3D RoPE), where additional vector operations and uneven feature partitions degrade hardware utilization. To overcome these limitations, we propose RoME (Rotary Matrix position Embedding), a mathematically equivalent yet computationally efficient reformulation of RoPE that replaces vector operations with unified matrix transformations. RoME eliminates dimension-specific operations, simplifies implementation, and enables fused parallel execution across Cube and Vector units on modern NPUs. Experiments show that RoME delivers substantial acceleration at both the operator and full-model levels. The implementation is available at https://gitcode.com/cann/ops-transformer/blob/master/experimental/posembedding/rope_matrix/README.md.
[279] Efficient Personalization of Generative User Interfaces cs.LG | cs.AI | cs.CV | cs.HCPDF
Yi-Hao Peng, Samarth Das, Jeffrey P. Bigham, Jason Wu
TL;DR: 本文研究了生成式用户界面(UI)的个性化问题,通过收集20名设计师对600个生成UI的成对偏好判断数据集,发现设计师之间存在显著偏好分歧。基于此,作者提出了一种样本高效的个性化方法,将新用户偏好表示为先前设计师偏好的组合,而非固定设计概念。该方法在技术评估中优于预训练的UI评估器和更大的多模态模型,并在个性化生成中获得了12名新设计师的偏好认可。
Details
Motivation: 生成式UI虽能按需适应用户,但个性化仍面临挑战,因为理想的UI属性具有主观性、难以明确表达,且从稀疏反馈中推断成本高昂。本文旨在通过分析设计师偏好分歧,开发更高效的个性化方法。
Result: 在技术评估中,提出的偏好模型在性能上超越了预训练的UI评估器和一个更大的多模态模型,并且能更好地随反馈增加而扩展。在个性化生成任务中,该方法生成的界面被12名新设计师认为优于基线方法(包括直接用户提示)。
Insight: 核心创新点在于将新用户的偏好建模为已有设计师偏好的组合,而非依赖固定的设计概念框架,这提供了一种更灵活、样本高效的个性化途径。从客观角度看,该方法通过利用群体分歧数据来驱动个性化,为处理主观且难以量化的设计偏好问题提供了新思路。
Abstract: Generative user interfaces (UIs) create new opportunities to adapt interfaces to individual users on demand, but personalization remains difficult because desirable UI properties are subjective, hard to articulate, and costly to infer from sparse feedback. We study this problem through a new dataset in which 20 trained designers each provide pairwise judgments over the same 600 generated UIs, enabling direct analysis of preference divergence. We find substantial disagreement across designers (average kappa = 0.25), and written rationales reveal that even when designers appeal to similar concepts such as hierarchy or cleanliness, designers differ in how they define, prioritize, and apply those concepts. Motivated by these findings, we develop a sample-efficient personalization method that represents a new user in terms of prior designers rather than a fixed rubric of design concepts. In a technical evaluation, our preference model outperforms both a pretrained UI evaluator and a larger multimodal model, and scales better with additional feedback. When used to personalize generation, it also produces interfaces preferred by 12 new designers over baseline approaches, including direct user prompting. Our findings suggest that lightweight preference elicitation can serve as a practical foundation for personalized generative UI systems.
[280] Towards Multi-Source Domain Generalization for Sleep Staging with Noisy Labels cs.LG | cs.CV | cs.ROPDF
Kening Wang, Di Wen, Yufan Chen, Ruiping Liu, Junwei Zheng
TL;DR: 本文提出了首个针对多源域泛化睡眠分期任务中标签噪声的基准(NL-DGSS),并发现现有抗噪方法在域偏移与标签噪声共存时性能显著下降。为解决此问题,作者提出了FF-TRUST框架,该框架结合了联合时频早期学习正则化(JTF-ELR),通过利用时域和频域的一致性以及置信度多样性正则化来提升模型在噪声监督下的鲁棒性。
Details
Motivation: 自动睡眠分期是一个涉及EEG和EOG等多模态生理信号的学习问题,这些信号常受机构、设备和人群差异导致的域偏移影响,且实际数据常伴有噪声标注。然而,同时处理域偏移和标签噪声的鲁棒多源域泛化方法尚未得到充分探索。
Result: 在五个公共数据集上的实验表明,FF-TRUST在多种对称和非对称噪声设置下均取得了最先进的(SOTA)性能,表现一致且优于现有方法。
Insight: 论文的创新点在于首次构建了多源域泛化睡眠分期中标签噪声的基准,并提出了结合时域与频域一致性的联合正则化方法(JTF-ELR)来增强模型对噪声和域偏移的鲁棒性。从客观角度看,其将早期学习正则化扩展到多模态时频联合分析,为处理复杂噪声下的域泛化问题提供了新思路。
Abstract: Automatic sleep staging is a multimodal learning problem involving heterogeneous physiological signals such as EEG and EOG, which often suffer from domain shifts across institutions, devices, and populations. In practice, these data are also affected by noisy annotations, yet label-noise-robust multi-source domain generalization remains underexplored. We present the first benchmark for Noisy Labels in Multi-Source Domain-Generalized Sleep Staging (NL-DGSS) and show that existing noisy-label learning methods degrade substantially when domain shifts and label noise coexist. To address this challenge, we propose FF-TRUST, a domain-invariant multimodal sleep staging framework with Joint Time-Frequency Early Learning Regularization (JTF-ELR). By jointly exploiting temporal and spectral consistency together with confidence-diversity regularization, FF-TRUST improves robustness under noisy supervision. Experiments on five public datasets demonstrate consistent state-of-the-art performance under diverse symmetric and asymmetric noise settings. The benchmark and code will be made publicly available at https://github.com/KNWang970918/FF-TRUST.git.
[281] Preventing Latent Rehearsal Decay in Online Continual SSL with SOLAR cs.LG | cs.CVPDF
Giacomo Cignoni, Simone Magistri, Andrew D. Bagdanov, Antonio Carta
TL;DR: 本文研究了在线持续自监督学习(OCSSL)场景,提出了一种名为SOLAR的方法来解决稳定性-可塑性权衡问题,通过引入两个度量指标(重叠度和偏差)来诊断潜在空间退化,并利用这些指标指导缓冲区管理和优化损失函数,从而在OCSSL视觉基准测试中实现了最先进的性能。
Details
Motivation: 在线持续自监督学习(OCSSL)中,模型需要从连续的非平稳无标签数据流中学习,现有方法通常采用重放机制并追求快速收敛,但面临稳定性-可塑性权衡的挑战:稳定方法(如使用Reservoir采样的重放)收敛更快,但在某些条件下会导致性能下降。
Result: 实验表明,SOLAR在OCSSL视觉基准测试中达到了最先进的性能,同时具有高收敛速度和最终性能。
Insight: 论文的创新点在于提出了潜在重放衰减假设来解释性能崩溃现象,并设计了重叠度和偏差两个度量指标来诊断潜在空间退化;SOLAR方法通过在线代理偏差指导缓冲区管理,并结合显式的重叠度损失来自适应管理可塑性,有效平衡了稳定性和可塑性。
Abstract: This paper explores Online Continual Self-Supervised Learning (OCSSL), a scenario in which models learn from continuous streams of unlabeled, non-stationary data, where methods typically employ replay and fast convergence is a central desideratum. We find that OCSSL requires particular attention to the stability-plasticity trade-off: stable methods (e.g. replay with Reservoir sampling) are able to converge faster compared to plastic ones (e.g. FIFO buffer), but incur in performance drops under certain conditions. We explain this collapse phenomenon with the Latent Rehearsal Decay hypothesis, which attributes it to latent space degradation under excessive stability of replay. We introduce two metrics (Overlap and Deviation) that diagnose latent degradation and correlate with accuracy declines. Building on these insights, we propose SOLAR, which leverages efficient online proxies of Deviation to guide buffer management and incorporates an explicit Overlap loss, allowing SOLAR to adaptively managing plasticity. Experiments demonstrate that SOLAR achieves state-of-the-art performance on OCSSL vision benchmarks, with both high convergence speed and final performance.
[282] Autonomous Diffractometry Enabled by Visual Reinforcement Learning cs.LG | cond-mat.mtrl-sci | cs.CVPDF
J. Oppliger, M. Stifter, A. Rüegg, I. Biało, L. Martinelli
TL;DR: 该论文提出了一种基于无模型强化学习的自主单晶取向系统,该系统能够直接从劳厄衍射图案中识别并导航至高对称性取向,无需依赖晶体学和衍射理论的人类专业知识。
Details
Motivation: 解决自动化任务中抽象视觉信息解释的挑战,特别是在晶体取向等依赖人类理解衍射图案的领域,旨在开发智能衍射仪的计算框架。
Result: 尽管缺乏人类监督,智能体能够发展出类似人类的策略,在不同晶体对称性类别中实现时间高效的取向,推动了材料科学自动化实验流程的发展。
Insight: 创新点在于将无模型强化学习应用于衍射图案的视觉解释,实现了无需先验晶体学知识的自主取向系统,为自动化科学实验提供了可借鉴的计算方法。
Abstract: Automation underpins progress across scientific and industrial disciplines. Yet, automating tasks requiring interpretation of abstract visual information remain challenging. For example, crystal alignment strongly relies on humans with the ability to comprehend diffraction patterns. Here we introduce an autonomous system that aligns single crystals without access to crystallography and diffraction theory. Using a model-free reinforcement learning framework, an agent learns to identify and navigate towards high-symmetry orientations directly from Laue diffraction patterns. Despite the absence of human supervision, the agent develops human-like strategies to achieve time-efficient alignment across different crystal symmetry classes. With this, we provide a computational framework for intelligent diffractometers. As such, our approach advances the development of automated experimental workflows in materials science.
[283] Solving Physics Olympiad via Reinforcement Learning on Physics Simulators cs.LG | cs.AI | cs.CV | cs.ROPDF
Mihir Prabhudesai, Aryan Satpathy, Yangmin Li, Zheyang Qin, Nikash Bhardwaj
TL;DR: 本文提出了一种利用物理模拟器生成合成数据,并通过强化学习训练大语言模型进行物理推理的方法。该方法在仅使用合成数据训练的情况下,实现了对国际物理奥林匹克竞赛问题的零样本迁移,性能提升了5-10个百分点。
Details
Motivation: 解决大语言模型在物理等领域因缺乏大规模问答数据集而难以有效训练推理能力的问题。
Result: 在仅使用合成模拟数据训练后,模型在不同规模下对国际物理奥林匹克竞赛问题的性能提升了5-10个百分点,实现了零样本的模拟到真实迁移。
Insight: 创新点在于将物理模拟器作为可扩展的监督数据源,通过强化学习在合成数据上训练,使模型获得超越互联网问答数据限制的深度物理推理能力。
Abstract: We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: https://sim2reason.github.io/.
cs.SE [Back]
[284] DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode cs.SE | cs.CLPDF
Hojae Han, Jaejin Kim, Seung-won Hwang, Yu Jin Kim, Moontae Lee
TL;DR: 本文提出DuET框架,通过结合生成代码的直接执行和基于LLM的伪代码执行,采用功能多数投票机制来提升测试输出预测的可靠性。
Details
Motivation: 解决测试用例生成中测试输出预测的挑战,现有方法依赖生成代码的直接执行,但代码中的微小错误易导致失败,需要更稳健的预测方法。
Result: 在LiveCodeBench基准测试中,DuET实现了最先进的性能,将Pass@1指标提升了13.6个百分点。
Insight: 创新点在于提出双执行框架,利用直接执行和伪代码执行的互补性,克服代码错误和幻觉问题,通过多数投票提高预测可靠性。
Abstract: This work addresses test output prediction, a key challenge in test case generation. To improve the reliability of predicted outputs by LLMs, prior approaches generate code first to ground predictions. One grounding strategy is direct execution of generated code, but even minor errors can cause failures. To address this, we introduce LLM-based pseudocode execution, which grounds prediction on more error-resilient pseudocode and simulates execution via LLM reasoning. We further propose DuET, a dual-execution framework that combines both approaches by functional majority voting. Our analysis shows the two approaches are complementary in overcoming the limitations of direct execution suffering from code errors, and pseudocode reasoning from hallucination. On LiveCodeBench, DuET achieves the state-of-the-art performance, improving Pass@1 by 13.6 pp.
cs.NI [Back]
[285] R2E-VID: Two-Stage Robust Routing via Temporal Gating for Elastic Edge-Cloud Video Inference cs.NI | cs.CV | cs.DCPDF
Zheming Yang, Lulu Zuo, Shun Lu, Yangyu Zhang, Zhicheng Li
TL;DR: 本文提出了R2E-VID,一个用于弹性边缘-云视频推理的两阶段鲁棒路由框架。该框架通过第一阶段的时序门控机制建模视频流的时序一致性和运动动态,以预测每个片段的最优路由模式;第二阶段的鲁棒路由优化模块通过多模型适应来细化分配,在动态网络和工作负载变化下联合最小化推理延迟和资源消耗。
Details
Motivation: 现有边缘-云协同系统往往无法动态适应异构视频内容和波动的资源条件,导致路由效率低下和计算成本高昂。本文旨在解决这一问题,实现更高效、自适应的视频推理路由。
Result: 在公开数据集上的大量实验表明,与以云为中心的基线方法相比,R2E-VID实现了高达60%的总成本降低;与最先进的边缘-云解决方案相比,延迟降低了35-45%,同时推理精度提高了2-7%。
Insight: 主要创新点在于结合了时序门控机制进行细粒度时空弹性工作负载划分,以及通过两阶段鲁棒优化联合考虑延迟、成本和精度。其核心思想是利用视频内容的时序特性来指导动态路由决策,并引入鲁棒性以适应环境变化,这对于动态边缘计算场景具有借鉴意义。
Abstract: With the rapid growth of large-scale video analytics applications, edge-cloud collaborative systems have become the dominant paradigm for real-time inference. However, existing approaches often fail to dynamically adapt to heterogeneous video content and fluctuating resource conditions, resulting in suboptimal routing efficiency and high computational costs. In this paper, we propose R2E-VID, a two-stage robust routing framework via temporal gating for elastic edge-cloud video inference. In the first stage, R2E-VID introduces a temporal gating mechanism that models the temporal consistency and motion dynamics of incoming video streams to predict the optimal routing pattern for each segment. This enables adaptive partitioning of inference workloads between edge and cloud nodes, achieving fine-grained spatiotemporal elasticity. In the second stage, a robust routing optimization module refines the allocation through multi-model adaptation, jointly minimizing inference delay and resource consumption under dynamic network and workload variations. Extensive experiments on public datasets demonstrate that R2E-VID achieves up to 60% reduction in overall cost compared to cloud-centric baselines, and delivers 35-45% lower delay while improving inference accuracy by 2-7% over state-of-the-art edge-cloud solutions.
cs.IR [Back]
[286] MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval cs.IR | cs.AI | cs.CLPDF
Kiarash Naghavi Khanghah, Hoang Anh Nguyen, Anna C. Doris, Amir Mohammad Vahedi, Daniele Grandi
TL;DR: 该论文提出了MCERF(多模态ColPali增强检索与推理框架),这是一个针对工程文档(包含密集文本、表格和插图)的问答系统。它结合了多模态检索器与大语言模型推理,通过混合检索策略和自适应路由机制,在DesignQA基准测试上实现了比基线RAG系统高41.1%的平均准确率提升。
Details
Motivation: 解决现有检索增强生成(RAG)系统在处理工程规则手册和技术标准等多模态信息(密集文本、表格、插图)时面临的挑战,特别是在不完全摄取整个规则书的情况下,提升对复杂多模态问题的准确理解和回答能力。
Result: 在DesignQA基准测试上进行评估,该系统在所有任务上的平均准确率相比基线RAG最佳结果有+41.1%的相对提升,在多模态和推理密集型任务上取得了显著改进。
Insight: 创新点包括:1)将多模态检索器(ColPali)与LLM推理耦合;2)设计了四种检索与推理策略(混合查找、视觉到文本融合、高推理LLM模式、自一致性决策);3)提出了模块化框架设计和两种动态查询路由方法(单案例路由和多智能体系统),为未来多模态系统提供了可复用的模板。
Abstract: Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.
[287] NSFL: A Post-Training Neuro-Symbolic Fuzzy Logic Framework for Boolean Operators in Neural Embeddings cs.IR | cs.AI | cs.CL | cs.LGPDF
Vladi Vexler, Ofer Idan, Gil Lederman, Dima Sivov
TL;DR: 本文提出了一种名为神经符号模糊逻辑(NSFL)的后训练框架,旨在为神经嵌入空间中的布尔运算符提供一种无需重新训练的微积分方法。该框架通过神经符号增量(NS-Delta)主动引导表示,并利用球面查询优化(SQO)将模糊逻辑公式投影到流形稳定的查询向量中,从而在保持原子意义的同时捕获领域依赖关系,避免了传统几何基线的表示崩溃和流形逃逸问题。
Details
Motivation: 标准密集检索器缺乏处理多原子逻辑约束的原生微积分,NSFL旨在解决这一问题,为神经嵌入空间提供一种无需重新训练的逻辑操作框架。
Result: 在六种不同编码器配置和两种模态(包括零样本和SOTA微调模型)上验证,NSFL的mAP提升高达81%,即使应用于专门为逻辑推理微调的编码器,也能带来平均20%、最高47%的性能提升。
Insight: 创新点在于将形式t-范数和t-余范数适配到神经嵌入空间,通过NS-Delta实现一阶混合微积分,并利用SQO进行流形优化,为高维空间提供了一种无需训练、顺序感知的微积分基础,支持未来动态扩展和学习流形逻辑。
Abstract: Standard dense retrievers lack a native calculus for multi-atom logical constraints. We introduce Neuro-Symbolic Fuzzy Logic (NSFL), a framework that adapts formal t-norms and t-conorms to neural embedding spaces without requiring retraining. NSFL operates as a first-order hybrid calculus: it anchors logical operations on isolated zero-order similarity scores while actively steering representations using Neuro-Symbolic Deltas (NS-Delta) – the first-order marginal differences derived from contextual fusion. This preserves pure atomic meaning while capturing domain reliance, preventing the representation collapse and manifold escape endemic to traditional geometric baselines. For scalable real-time retrieval, Spherical Query Optimization (SQO) leverages Riemannian optimization to project these fuzzy formulas into manifold-stable query vectors. Validated across six distinct encoder configurations and two modalities (including zero-shot and SOTA fine-tuned models), NSFL yields mAP improvements up to +81%. Notably, NSFL provides an additive 20% average and up to 47% boost even when applied to encoders explicitly fine-tuned for logical reasoning. By establishing a training-free, order-aware calculus for high-dimensional spaces, this framework lays the foundation for future dynamic scaling and learned manifold logic.
cs.NE [Back]
[288] Sharpness-Aware Surrogate Training for On-Sensor Spiking Neural Networks cs.NE | cs.CV | cs.LGPDF
Maximilian Nicholson
TL;DR: 本文提出了一种锐度感知代理训练(SAST)方法,用于减少脉冲神经网络(SNN)在部署时从平滑代理非线性到硬阈值转换导致的性能下降(即代理-硬转换间隙)。该方法将锐度感知最小化(SAM)应用于代理前向的SNN,以优化训练目标并确保梯度精确。在两个事件相机基准测试(N-MNIST和DVS Gesture)上,该方法显著提升了硬脉冲精度,并在硬件感知推理模拟(如INT8/INT4量化)下保持了强健性能,同时降低了SynOps(突触操作)计数。
Details
Motivation: 解决SNN在部署时,由于训练中使用的平滑代理非线性被替换为硬阈值而导致的性能急剧下降问题(即代理-硬转换间隙),这对于在严格功耗预算下运行的片上传感器视觉系统至关重要。
Result: 在N-MNIST上,硬脉冲精度从65.7%提升至94.7%;在DVS Gesture上,从31.8%提升至63.3%。在硬件感知推理模拟(INT8/INT4量化、定点膜电位、离散泄漏因子)下,N-MNIST的INT8精度从47.6%提升至96.9%,INT4从43.2%提升至81.0%;DVS Gesture的INT8精度从25.3%提升至47.6%,INT4从26.0%提升至43.8%。同时,SynOps计数显著降低(例如N-MNIST INT8从1734k降至1315k)。
Insight: 创新点在于将SAM应用于SNN的代理训练,以减小代理-硬转换间隙,提升模型鲁棒性和部署精度。从客观角度看,该方法通过优化损失景观的平坦性(锐度感知),增强了模型对量化、离散化等硬件约束的适应性,为片上传感器SNN推理提供了一个有效的训练策略。
Abstract: Spiking neural networks (SNNs) are a natural computational model for on-sensor and near-sensor vision, where event driven processors must operate under strict power budgets with hard binary spikes. However, models trained with surrogate gradients often degrade sharply when the smooth surrogate nonlinearity is replaced by a hard threshold at deployment; a surrogate-to-hard transfer gap that directly limits on-sensor accuracy. We study Sharpness-Aware Surrogate Training (SAST), which applies Sharpness-Aware Minimization (SAM) to a surrogate-forward SNN so that the training objective is smooth and the gradient is exact, and position it as one gap-reduction strategy under the tested settings rather than the only viable mechanism. Under explicit contraction assumptions we provide state-stability, input-Lipschitz, and smoothness bounds, together with a corresponding nonconvex convergence result. On two event-camera benchmarks, swap-only hard-spike accuracy improves from 65.7% to 94.7% on N-MNIST and from 31.8% to 63.3% on DVS Gesture. Under a hardware-aware inference simulation (INT8/INT4 weight quantization, fixed-point membrane potentials, discrete leak factors), SAST remains strong: on N-MNIST, hard-spike accuracy improves from 47.6% to 96.9% (INT8) and from 43.2% to 81.0% (INT4), while on DVS Gesture it improves from 25.3% to 47.6% (INT8) and from 26.0% to 43.8% (INT4). SynOps also decrease under the same hardware-aware setting, including 1734k$\rightarrow$1315k (N-MNIST, INT8) and 86221k$\rightarrow$4323k (DVS Gesture, INT8). These results suggest that SAST is a promising component in a broader toolbox for on-sensor spiking inference under the tested settings.