Table of Contents
- cs.CL [Total: 83]
- cs.CV [Total: 195]
- cs.LG [Total: 9]
- cs.MM [Total: 1]
- cs.AI [Total: 12]
- cs.HC [Total: 2]
- cs.GR [Total: 4]
- cs.NE [Total: 1]
- cs.DC [Total: 1]
- cs.IR [Total: 2]
- cs.SD [Total: 2]
- cs.CR [Total: 2]
- cs.RO [Total: 11]
- eess.IV [Total: 2]
cs.CL [Back]
[1] DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset cs.CL | cs.AIPDF
Shannan Liu, Peifeng Li, Yaxin Fan, Qiaoming Zhu
TL;DR: 本文构建了首个公开可用的英语多模态多轮对话篇章解析数据集DraDDP,基于美剧构建,包含495个对话片段、6374个话语和9.1小时的并行视频内容,覆盖丰富的多方交互场景,并建立了全面的基准测试。
Details
Motivation: 现有的对话篇章解析研究大多局限于文本模态或双人对话,无法满足多模态和多参与者的实际场景需求。
Result: 在DraDDP数据集上的实验结果表明,多模态信息对于捕捉对话结构和关系类型具有重要价值,为未来研究提供了基准。
Insight: 创新点在于首次构建了公开的多模态多轮对话篇章解析数据集,并系统评估了不同模态的影响,推动了多模态对话理解领域的发展。
Abstract: Multi-party dialogue discourse parsing aims to identify dependency structures and relation types between utterances in conversations. Previous studies are mostly limited to textual modality or two-party dialogue, failing to meet the multimodal and multi-party settings. In this paper, we construct the first publicly available English multimodal dataset DraDDP for multi-party dialogue discourse parsing, based on American TV dramas. DraDDP contains 495 dialogue segments with 6,374 utterances and 9.1 hours of parallel video content, covering rich multi-party interaction scenarios. Moreover, we establish comprehensive benchmarks by evaluating this task on DraDDP and conducting in-depth analysis on the impact of different modalities. Experimental results demonstrate the value of multimodal information in capturing dialogue structures and relation types. We will publicly release the dataset, annotation guidelines, and code to promote future research in multimodal dialogue understanding.
[2] CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards cs.CL | cs.AIPDF
Wei Tian, Yuhao Zhou, Man Lan
TL;DR: 本文提出CSRP框架,通过三阶段训练解决中文语法纠错(CGEC)中的两大挑战:通用大语言模型缺乏领域先验知识,以及监督微调导致的过度纠偏问题。该框架结合持续预训练、思维链监督微调和基于效率感知奖励的强化学习,在NACGEC基准上取得SOTA性能,并显著提升了拼写纠错能力。
Details
Motivation: 针对基于大语言模型的中文语法纠错系统存在的两个关键问题:通用模型缺乏对细微语法差异的领域先验知识,以及使用最大似然估计的监督微调会优化偏向召回率指标,导致系统性过度纠错。
Result: 在NACGEC基准上达到SOTA,F0.5为50.99,精确率为57.17;在CSCD拼写纠错任务上F1达到59.61,超过GPT-4达5.20分。消融实验表明强化学习对齐阶段相比SFT基线带来8%的相对增益。
Insight: 创新点在于提出三阶段渐进式训练框架,特别是引入了效率感知奖励的强化学习来显式惩罚不必要的编辑,从而有效缓解过度纠偏;同时通过思维链微调实现诊断透明性,且证明效率优化与大规模持续预训练的增益是正交的。
Abstract: Large Language Model (LLM) based Chinese Grammatical Error Correction (CGEC) systems face two critical challenges: general-purpose models lack specialized linguistic priors for subtle grammatical distinctions, and Supervised Fine-Tuning (SFT) with Maximum Likelihood Estimation fails to optimize for precision-focused metrics, leading to systematic over-correction. We propose CSRP, a three-stage framework that progressively builds correction capability through Continual Pre-training (CPT) on 5.9M balanced samples to internalize domain knowledge, Chain-of-Thought SFT with explicit error reasoning for diagnostic transparency, and Group Relative Policy Optimization with a novel Efficiency-Aware Reward that explicitly penalizes unnecessary edits. On the NACGEC benchmark, CSRP achieves state-of-the-art performance with 50.99 $F_{0.5}$ and 57.17 precision, substantially outperforming previous best results while effectively mitigating the over-correction bias inherent in MLE-trained models. Our method also advances CSCD spelling correction to 59.61 F1, surpassing GPT-4 by 5.20 points. Comprehensive ablation studies demonstrate that the RL alignment stage contributes a 8% relative gain over the SFT baseline, and that this gain is orthogonal to the contribution of large-scale CPT, validating that explicit optimization for edit efficiency is essential for high-quality grammatical error correction. Our code is available at https://github.com/TW-NLP/ChineseErrorCorrector.
[3] lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation cs.CL | cs.AIPDF
Alexey Tikhonov, Alexey Ivanov
TL;DR: 本文介绍了参加SemEval-2026 Task 1(MWAHAHA)幽默生成竞赛的系统,该系统采用‘生成多个候选 -> 选择最佳’的策略。首先通过多步提示、模型集成和多样性解码生成多样化的候选笑话,然后利用从人类成对比较数据中学习到的偏好模型来近似‘读者’喜好,从而选择最佳输出。该系统在英语和中文子任务中排名第一,在西班牙语子任务中排名第二。
Details
Motivation: 幽默生成之所以困难,不仅在于创作流畅新颖的笑话本身,更在于‘有趣’是依赖于受众的,且监督信号是嘈杂的——不同受众、语境和文化下的偏好各异,标注者之间的一致性通常较低。因此,需要一种方法来建模这种主观偏好,以在约束条件下生成更符合特定‘读者’口味的幽默。
Result: 在MWAHAHA竞赛的评估中,该系统在英语和中文子任务中排名第一,在西班牙语子任务中排名第二。此外,在三个偏好数据集上的实验表明,其偏好模型在性能上持续超越基线模型,并展现出更强的跨领域迁移能力。
Insight: 核心创新点在于将幽默生成问题分解为‘多样化生成’与‘基于偏好的选择’两个阶段,并构建了一个从人类成对比较中学习的偏好模型来近似受众的主观评价,而非依赖绝对的有趣度分数。这为解决主观性强的文本生成任务提供了一种可解释的管道,即通过收集高质量的成对比较数据(如Humor Arena原型收集的2.5K条数据)来训练鲁棒的偏好模型,并应用于候选排序。
Abstract: Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because “funny” is audience-dependent and supervision is noisy – preferences vary with audience, context, and culture, and annotator agreement is often low. In this paper, we describe our system for the SemEval-2026 Task-1 (MWAHAHA), which focuses on humor generation under explicit constraints. The task evaluates submitted systems via human preference judgments in 1-on-1 arena-style comparisons. We adopt a “generate-many -> select-best” strategy. First, we generate a diverse pool of candidates per instance using multi-step prompting, model ensembling, and diversity-oriented decoding. Second, we select outputs using a preference model that approximates a “reader” by learning from human comparisons rather than absolute funniness scores. To support this approach, we release 2.5K human pairwise judgments collected through the Humor Arena prototype. We further propose an interpretable pipeline that converts labeled comparisons into a preference model. Across three preference datasets, our models consistently outperform baselines and show stronger cross-domain transfer. Finally, we apply the learned preference model to rank candidates for the MWAHAHA setting and release intermediate artifacts (candidate pools and rankings) to facilitate follow-up work. Our system ranked 1st in the English and Chinese subtasks of MWAHAHA and 2nd in the Spanish subtask.
[4] TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation cs.CL | cs.AIPDF
Sidra Nasir, Muhammad Noman Zahid, Rizwan Ahmed Khan
TL;DR: 本文提出了TCAR-Gen框架,用于解决基于知识的生成系统中,在回答复杂历史刑事案件叙述问题时面临的时间推理和证据融合挑战。该框架结合了查询条件化的图神经网络、时间证据融合和链式树推理,以检索到的证据为基础生成答案。在Victorian Crime Diaries基准测试中,TCAR-Gen在包括多跳推理和反事实问题在内的七种查询类型上,Recall@5达到0.3738,优于多种基线方法。
Details
Motivation: 现有的检索增强生成系统在处理历史犯罪案件叙述等复杂问题时,在时间推理和证据融合方面存在不足,要么检索过程独立于查询语义,要么无法连贯地整合多个证据源。
Result: 在Victorian Crime Diaries基准上,TCAR-Gen的Recall@5为0.3738,超越了Vanilla RAG、Temporal RAG、GraphRAG-C和GraphRAG-T等基线模型。消融研究表明上下文图、时间惩罚机制和查询条件化是关键组件。
Insight: 创新点在于显式的时间建模和多分支证据融合,这对于在基于知识的语料库上进行忠实、推理密集的问答至关重要。框架结合了查询条件化的图神经网络和链式树推理,以增强检索的语义相关性和证据的整合能力。
Abstract: Retrieval-augmented generation systems struggle with temporal reasoning and evidence fusion when answering complex questions over historical criminal case narratives. Existing approaches either retrieve independently of query semantics or fail to integrate multiple evidence sources coherently. We propose Temporal Context Augmented Retrieval Generation (TCAR-Gen), a framework that combines query-conditioned graph neural networks, temporal evidence fusion, and chain-of-trees reasoning to ground answer generation in retrieved evidence. On the Victorian Crime Diaries benchmark, TCAR-Gen achieves 0.3738 Recall@5, outperforming Vanilla RAG, Temporal RAG, GraphRAG-C, and GraphRAG-T across seven query types including multi-hop reasoning and counterfactual questions. Ablation studies reveal that the context graph, temporal penalty mechanism, and query conditioning are critical components. Cross-model evaluation across five language model (GPT-OSS 20B to TinyLlama 1.1B) demonstrates that TCAR-Gen maintains robust retrieval coverage at smaller model scales, though generation quality degrades substantially with reduced model capacity. Our work shows that explicit temporal modelling and multi-branch evidence fusion are essential for faithful, reasoning-intensive question answering over knowledge-grounded corpora.
[5] DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models cs.CL | cs.AIPDF
Sangdae Nam
TL;DR: 本文提出了DLLM-JEPA,一种将联合嵌入预测架构(JEPA)与掩码扩散语言模型相结合的自监督学习方法。该方法通过扩散模型的双向注意力机制,从单一输入生成两个语义不同的视图,无需显式的多视图数据对,并且每个训练步骤仅需一次前向传播,相比之前的LLM-JEPA减少了33%的训练计算量。
Details
Motivation: 动机是解决现有LLM-JEPA方法的两大成本问题:需要显式的多视图数据(如文本-代码对)以及每个训练步骤需要两次前向传播。
Result: 在多个任务和架构组合的评估中,DLLM-JEPA均优于仅使用扩散模型的微调方法,例如在LLaDA-8B和Dream-7B模型上的GSM8K任务分别提升了18.7和11.4个百分点,并在Spider、NL-RX-SYNTH和Django任务上取得一致的正向增益。此外,该方法在提升任务准确率的同时,还能降低预训练数据的损失并保持基础能力。
Insight: 核心创新点在于将JEPA与掩码扩散语言模型配对,利用扩散模型的不同掩码率自然生成多视图,从而消除了对显式配对数据的需求并大幅降低了计算开销。层间探测分析揭示了其机制:微调后的骨干网络在几何上远离预训练权重,但在功能上遗忘更少,这种解耦现象集中在中层Transformer中,且在不同骨干网络上均出现,表明其普适性。
Abstract: Joint Embedding Predictive Architectures (JEPAs) have reshaped self-supervised representation learning in vision. The recent LLM-JEPA ported JEPA to autoregressive language models but inherited two steep costs from the causal-attention substrate: it demands explicit multi-view data (e.g., text-code pairs), and it requires two gradient-carrying forward passes per step. We introduce DLLM-JEPA, which pairs JEPA with masked-diffusion language models to eliminate both costs at once. The bidirectional attention of diffusion models yields two semantically distinct views of the same input via different masking rates – no explicit pairs needed – and supports a single gradient-carrying forward pass, cutting training FLOPs by 33% relative to LLM-JEPA. DLLM-JEPA improves over diffusion-only fine-tuning in every (task, architecture) combination we evaluate: up to +18.7 pp on LLaDA-8B GSM8K and +11.4 pp on Dream-7B GSM8K, with consistent positive gains on Spider, NL-RX-SYNTH, and Django. Beyond accuracy, DLLM-JEPA exhibits a dual-win property: on LLaDA-8B with the Wide-t configuration, it simultaneously raises GSM8K accuracy (67.1 vs. 65.2, +1.8 pp), drives held-out Wikitext loss below the pre-trained base, and preserves MMLU accuracy at base level across three fine-tuning seeds – whereas an L2-to-base parameter anchor matches baseline accuracy with no task gain. Layer-wise probing reveals the mechanism: a geometric-functional drift dissociation in which the fine-tuned backbone moves further from the pre-trained weights than the baseline yet forgets less on held-out Wikitext, with the amplification concentrated in middle transformer layers. The pattern appears on Dream-7B as well, indicating the phenomenon is not specific to a single backbone.
[6] RealityTest: How People Probe AI Identity and Whether Models Disclose It cs.CLPDF
Anna Gausen, Sarenne Wallbridge, Bessie O’Dell, Christopher Summerfield, Hannah Rose Kirk
TL;DR: 论文提出了RealityTest基准,用于全面评估AI系统在被询问时是否披露其身份。该基准基于从49个国家约750名参与者收集的3,152条身份探测查询(涵盖文本和语音场景),是首个大规模多模态、多语言的评估工具。研究发现,在模糊场景中只有31%的人直接询问身份,且人类提出的问题比机器生成的问题更加多样。测试了17个文本和6个语音模型后,发现披露行为存在显著差异,但单一抑制指令可将披露率降至30%以下。
Details
Motivation: 解决AI系统在对话场景中身份披露不足的安全风险,现有评估通常仅限于英语、基于机器生成问题且局限于文本,缺乏基于真实人类交互数据的全面评估。
Result: 在RealityTest基准上测试了17个文本和6个语音模型,披露行为差异显著;最佳模型在单一抑制指令下披露率也低于30%。研究发现问题措辞和对话上下文对披露的影响大于模型本身。
Insight: 创新点在于构建了首个基于真实人类数据的大规模多模态、多语言身份披露评估基准;关键洞察是安全评估若基于狭窄或合成查询集可能错误表征模型在实际部署中的行为,强调了多样化、基于人类数据评估的重要性。
Abstract: AI systems are increasingly deployed in conversational settings where users may be uncertain whether they are speaking with a human or an AI. Despite mounting regulatory attention to this known safety risk, existing evaluations of AI disclosure are typically English-only, based on machine-generated questions, and restricted to text. We present RealityTest to comprehensively test whether AI systems disclose their identity when asked. The benchmark is the first large-scale multimodal and multilingual evaluation, grounded in human data on how people actually encounter and question AI identity in the real-world. Alongside the benchmark, we release the underlying dataset of 3,152 identity-probing queries collected from ~750 participants across 49 countries and five languages, in text and speech scenarios. We find that only 31% of people ask about identity directly in ambiguous scenarios, and that the questions people ask are far more diverse than machine-generated queries. We test 17 text and 6 speech models, and find substantial variation in disclosure behaviour. However, a single suppression instruction reduces disclosure rates to below 30%, even in the best-performing models. Validating our investment in diverse, human-grounded evaluation data, we find that how the question is phrased and the context of the conversation matter more for disclosure than which model is being tested. Safety evaluations built on narrow or synthetic query sets risk mischaracterising how models behave in realistic deployment settings.
[7] Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models cs.CLPDF
Sanchit Ahuja, Terra Blevins
TL;DR: 本文研究了多语言专家语言模型在持续预训练中的灾难性遗忘问题,提出了一套层感知参数对齐策略来缓解遗忘。通过对比五种对齐策略与两个无正则化基线在32种训练语言和保留语言上的表现,发现参数对齐能显著减少遗忘且对语言习得影响很小,不同策略在不同任务上各有优势。
Details
Motivation: 持续预训练是将大语言模型扩展到新语言的实用方法,但简单的微调会导致灾难性遗忘,损害模型原有能力。即使按语系组织训练,也无法完全防止下游任务所需通用知识的遗忘,因此需要解决参数漂移问题。
Result: 在涵盖32种训练语言(来自五个语系)和保留语言的基准测试中,评估了困惑度、阅读理解、物理推理和翻译四个维度。参数对齐策略显著减少了遗忘,层冻结和正则化最好地保留了阅读理解能力,而事后权重恢复在翻译任务上提升最大。
Insight: 创新点在于将遗忘归因于多语言持续预训练中的参数漂移,并提出了一套层感知参数对齐策略(如硬层冻结、软正则化、事后权重恢复和模型合并)。客观分析认为,这些策略为语系专家模型的习得-遗忘边界提供了实用部署指南,可根据任务需求配对最佳策略。
Abstract: While continual pretraining~(CPT) is a practical way to extend large language models to new languages, naïve finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around language families reduces cross-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks. We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer-aware parameter alignment strategies: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation. Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post-hoc reversion yields the strongest translation gains. Together, these results map the acquisition–forgetting frontier for family-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves.
[8] Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance cs.CL | cs.AIPDF
Yuxuan Jiang, Francis Ferraro
TL;DR: 本文提出轨迹感知的在线策略蒸馏(TOPD)方法,以解决传统在线策略蒸馏(OPD)中存在的‘轨迹采样但词元学习’问题。TOPD利用近未来轨迹信息识别真实推理分叉状态,并将指导信号分配到多个未来词元上,从而更有效地将学生模型的推理轨迹引导至教师轨迹。
Details
Motivation: 传统OPD虽然基于轨迹采样,但其学习信号停留在词元级别,仅通过高损失词元识别偏差并进行局部反向KL修正。这种机制无法可靠地弥合学生与教师之间的推理轨迹差异,因为许多高损失词元仅是表面形式不匹配而非真实推理分叉,且孤立词元级监督难以修复由短视分布漂移导致的推理失败。
Result: 实验表明,抑制非分叉高损失词元可将标准OPD的平均准确率从47.8%提升至48.2%,而TOPD进一步将性能提升至52.2%。在AIME24基准上准确率从60.0%提升至63.3%,在AIME25上从46.7%提升至53.3%。
Insight: 创新点在于将轨迹级别的未来信息引入蒸馏过程,通过近未来指导区分表面不匹配与真实推理分叉,并对多步未来词元进行分布式监督。这为基于轨迹的模型蒸馏提供了更细粒度的轨迹对齐机制,缓解了短视分布漂移问题。
Abstract: On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal remains token-level: it identifies deviations through high-loss tokens and repairs them through local reverse-KL correction. We show that this “trajectory-sampled but token-learned” mechanism cannot reliably bridge student trajectories toward teacher trajectories. About 30% of high-loss tokens fall into the low-divergence regime, indicating that many are surface-form mismatches rather than real reasoning forks. Moreover, even truly divergent tokens are difficult to repair with isolated token-level supervision, since reasoning failures often unfold as short-horizon distributional drift. We propose Trajectory-aware OPD (TOPD), which uses near-future trajectory information to identify real divergent states and distribute guidance across multiple future tokens. Experiments show that suppressing non-divergent high-loss tokens improves standard OPD from 47.8% to 48.2% average accuracy, while TOPD further improves performance to 52.2%, with gains on AIME24 from 60.0% to 63.3% and AIME25 from 46.7% to 53.3%.
[9] Effects of Varying LLM Access on Essay Writing Behavior cs.CL | cs.AI | cs.HCPDF
Julia Christenson, Karin de Langis, Shirley Anugrah Hayati, Dongyeop Kang
TL;DR: 本研究通过一项试点实验,探讨了不同级别的大语言模型(LLM)辅助对大学生写作行为的影响。实验将24名学生随机分为无LLM访问、有限访问(≤3次提示,回复上限100词)和无限访问三组,发现各组间的整体论文质量在统计上没有显著差异,但写作行为和作者感知存在明显不同。
Details
Motivation: 研究动机是调查LLM在多大程度上影响大学教学,旨在找到既能利用AI辅助的脚手架效益,又不损害学生学习成果的整合策略,特别是关注其对写作表现、投入度和作者感知的影响。
Result: 实验结果表明,有限访问组的学生报告了更高的作者所有权(62.5%愿意将论文作为独立作品提交,而无限访问组为25%)、更强的组织能力提升以及更注重修订的策略性提示。无限访问组则花费更多时间写作,其论文与LLM输出更相似,并报告了创造性表达降低。整体论文质量在各组间无统计学差异。
Insight: 论文的创新点在于通过控制LLM访问级别(无、有限、无限)来实证研究其对写作行为和作者感知的差异化影响。核心见解是,限制而非禁止LLM访问,可能在保留AI辅助益处的同时,维护学生的作者信心和战略学习行为,这为教育中LLM的整合提供了具体、可操作的策略方向。
Abstract: Investigating the degree to which large language models (LLMs) affect teaching and learning in universities can help identify strategies for integrating LLMs in a way that supports, rather than undermines, student learning outcomes. This study examined how varying levels of LLM assistance affect writing performance, engagement, and perceived authorship. We report a pilot study in which 24 college students were randomly assigned to write a short essay with no LLM access, limited access (<=3 prompts, responses capped at 100 words), or unlimited access. Overall essay quality was statistically indistinguishable across groups. Yet writing behavior and perceived authorship diverged sharply: students with limited access reported higher ownership (62.5% would submit the essay as independent work, vs. 25% in the unlimited group), stronger organizational gains, and more strategic, revision-focused prompting. The unlimited group spent more time writing, produced essays more similar to LLM output, and reported reduced creative expression. Our findings suggest that constraining, rather than banning, LLM access may preserve authorship confidence while retaining the scaffolding benefits of AI assistance.
[10] Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning cs.CL | cs.AIPDF
Xiaoyang Ming, Jose Hernandez, Thomas Stephan Juzek
TL;DR: 本文提出了一种名为Triangulated Preference Shift score的自动化度量方法,用于量化大型语言模型在偏好学习阶段(如人类反馈强化学习)中引入的词汇偏差,而无需依赖人工标注。该方法通过三角测量人类黄金标准、基础模型和指令变体之间的差异,来隔离由偏好学习引起的特定词汇变化。作者在六个模型家族上提供了数据,并将结果与现有文献锚定,同时分析了偏好学习是否使模型倾向于使用一种可被解读为’声望语言’的词汇模式。
Details
Motivation: 大型语言模型在偏好学习阶段(如RLHF)虽然提升了实用性,但可能引入系统性的词汇偏差,例如偏好特定格式或过度使用某些词汇(如’delve’、’furthermore’)。现有研究依赖人工标注来评估这种词汇错位,存在局限性。
Result: 作者在六个模型家族上应用了Triangulated Preference Shift score,并将结果与文献中的发现进行了锚定。该度量方法能够自动化地量化由偏好调优引起的行为变化,为模型对齐和可信AI的发展提供了初步工具。
Insight: 创新点在于提出了一种无需人工标注的自动化度量方法,通过三角测量来隔离偏好学习阶段的词汇偏差。这为客观评估和缓解模型在偏好学习中的词汇错位提供了新工具,有助于理解模型是否倾向于使用一种’声望语言’,从而促进更可信的AI对齐研究。
Abstract: Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are thought to partly originate in the preference-learning stage, e.g. Reinforcement Learning from Human Feedback, which generally makes models more useful but simultaneously may introduce systematic lexical bias. In terms of lexical behavior, this is visible in a model’s preference for certain formats or the overuse of words (delve, furthermore), even when such patterns are not present in base model outputs. Research on lexical misalignment induced during preference training is constrained by reliance on manual curation. We address this, by introducing the Triangulated Preference Shift score, a metric that triangulates between human gold standards, base models, and instruct variants to isolate shifts induced specifically by preference learning, without manual curation. We provide data across six model families, anchor the results in the literature, and illustrate the general approach’s utility by analyzing whether preference learning shifts models toward what could be interpreted as a “language of prestige”. The metric provides an initial automated method to quantify behavioral shifts attributable to preference tuning, and thus, may help inform model alignment and development of trustworthy AI.
[11] Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs cs.CL | cs.CVPDF
Xin Gao, Cheng Yang, Chufan Shi, Taylor Berg-Kirkpatrick
TL;DR: 本文研究了统一多模态模型中的跨模态知识编辑问题,提出了首个基准测试UniKE,揭示了文本编辑与图像生成之间存在显著的模态鸿沟,并提出了推理增强参数编辑方法来改善编辑效果。
Details
Motivation: 随着统一多模态模型在现实世界中的部署,有效更新其内部知识变得至关重要,但目前尚不清楚成功的文本编辑是否能有效迁移到图像生成中。
Result: 在UniKE基准测试中,文本侧编辑效果可达约92%,而直接图像生成的VQA准确率最高仅为18.5%;提出的推理增强参数编辑方法将整体VQA准确率提升了最高18.6个百分点。
Insight: 研究发现文本知识编辑不能保证可靠的跨模态迁移,这源于编辑后的文本表示与视觉生成条件路径之间的部分对齐不足;创新点在于提出了首个跨模态知识编辑基准和一种通过显式激活编辑知识来改善图像生成的方法。
Abstract: Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While knowledge editing has matured for text-only models, it remains unclear whether edits that successfully modify textual outputs also transfer to image generation in UMMs. To study this question, we introduce UniKE, the first benchmark for cross-modality knowledge editing in UMMs, comprising 2,971 edit subjects spanning attribute and relation edits. Using VQA-based visual verification, we reveal a striking modality gap: text-side efficacy can reach approximately 92%, whereas the best overall VQA accuracy under direct image generation is only 18.5%. We further propose Reasoning-augmented Parameter Editing, which explicitly activates edited knowledge before generation and improves overall VQA accuracy for all evaluated model-editor pairs, with gains up to 18.6 percentage points. Mechanistic analysis shows that this gap is associated with partial alignment between edited textual representations and the conditioning pathways for visual generation, where edits sufficient for text outputs may remain too weak or misaligned to steer image synthesis. These findings show that textual knowledge edits do not guarantee reliable cross-modality transfer and motivate modality-aware editing methods. Our code and data are available at https://github.com/gxx27/UniKE.
[12] Short-form Text Rewriting with Phi Silica cs.CL | cs.AI | cs.LGPDF
Divya Tadimeti, Shawn Pan, Sameera Lanka, Chenghui Zhou, Sadid Hasan
TL;DR: 本文研究了如何通过数据集构建、提示蒸馏和参数高效微调等方法,将小型语言模型Phi Silica适配于短文本改写任务。研究结果表明,微调能有效提升语义保真度、减少幻觉,并在与GPT-5-chat改写的对比中提高偏好胜率。
Details
Motivation: 短文本改写是复述的一个受限变体,其有限的上下文和高语义密度使得改写空间很小。大型语言模型在通用复述上表现良好,但小型语言模型在短文本场景下常面临语义保真度不足和幻觉鲁棒性差的问题。
Result: 微调后的Phi Silica在语义保真度和幻觉减少方面均有提升,并在与GPT-5-chat改写的LLM-as-a-judge评估中获得了更高的偏好胜率。
Insight: 针对特定任务(如短文本改写)进行定向适配,可以显著缩小小型语言模型与云端大模型之间的性能差距。这为将小型语言模型应用于精度要求高的改写任务提供了实用指导。
Abstract: Short-form text rewriting is a constrained variant of paraphrasing in which limited context and high semantic density leave little room for variation. While large language models perform well on general paraphrasing, small language models (SLMs) often struggle with semantic fidelity and hallucination robustness in short-form settings. In this work, we present an empirical study of adapting an SLM, Phi Silica, for short-form rewrite through dataset curation, prompt distillation, parameter-efficient fine-tuning, and evaluation. We curate a dataset of short presentation-style text from public slide decks and use GPT-5-chat both to generate rewrite supervision and to conduct LLM-as-a-judge evaluation. Our results show that finetuning improves semantic fidelity, reduces hallucinations, and increases preference win rate against GPT-5-chat rewrites. The findings suggest that targeted adaptation for SLMs can substantially narrow the gap to cloud models and provide practical guidance for adapting SLMs to precision-critical rewrite tasks.
[13] LaSR: Context-Aware Speech Recognition via Latent Reasoning cs.CLPDF
Heyang Liu, Ziyang Cheng, Jiayi Huang, Wenyang Xiao, Ronghua Wu
TL;DR: 本文提出LaSR(潜在语音推理),一种新颖的训练范式,通过利用潜在推理过程来增强语音大语言模型的上下文感知能力。该方法将思维链监督与目标词的声学特征区域对齐,并引入潜在推理周期来整合上下文信息,从而在不引入额外延迟的情况下显著提升专业术语的识别准确率。
Details
Motivation: 当前语音大语言模型在上下文感知方面存在局限,难以有效反映说话者意图和主题语境,特别是在处理专业术语识别时表现不佳。
Result: 在Fun-Audio-Chat数据集上的初步实验表明,LaSR显著提升了术语识别能力,且持续优于标准监督微调基线,同时未引入额外延迟。
Insight: 创新点在于提出了基于潜在推理过程的上下文感知训练范式,通过将思维链监督与声学特征对齐来隐式建模推理轨迹;同时,为有效评估专业词汇的上下文识别,构建了大规模学术术语语料库Spoken Darwin-Science。
Abstract: Recent advances in Speech Large Language Models (Speech LLMs) have significantly enhanced spoken language understanding and reasoning. However, their contextual awareness is limited, struggling to perform speech recognition that effectively reflects the speaker’s intent and topical context. In this paper, we propose LaSR (Latent Speech Reasoning), a novel training paradigm featuring a context-aware reasoning trajectory that leverages the latent reasoning process. Instead of generating explicit intermediate tokens, LaSR aligns chain-of-thought (CoT) supervision around the acoustic feature region of the targeted word, and introduces latent reasoning periods for context information grounding and transcriptional transition. Furthermore, to effectively benchmark contextual recognition on specialized vocabulary, we propose Spoken Darwin-Science, a large-scale corpus focusing on academic terminologies. Preliminary experiments on Fun-Audio-Chat demonstrate that LaSR significantly improves terminology recognition without introducing additional latency and consistently outperforms standard supervised fine-tuning baselines. Our findings highlight the potential of latent reasoning in building efficient, context-aware speech assistants.
[14] ProtStructQA: A Denotation Threshold in Protein Structural Reasoning cs.CLPDF
Aravind Mandiga, Guoming Li, Jin Lu, Ismailcem Budak Arpinar, Khaled Rasheed
TL;DR: 本文介绍了ProtStructQA,一个用于蛋白质结构问答的可执行基准测试。该基准通过从隐藏的类型化领域特定语言(DSL)程序生成自然语言问题,并在AlphaFold预测的结构上执行该程序来获得答案。研究评估了不同规模的Qwen3和Gemma-3模型在多种推理策略下的表现,发现模型能力存在一个“指称阈值”,在此阈值之上,思维链推理变得非常有效。
Details
Motivation: 当前蛋白质语言系统的评估通常关注其生成的文本是否合理,但结构性问题具有更精确的语义,它指向三维坐标系中的具体测量。因此,需要一个新的基准来评估语言模型将自然语言问题映射到可执行的三维结构测量的能力。
Result: 在ProtStructQA基准(包含38.2万个问题)上,对Qwen3模型(0.6B到8B参数)和Gemma-3模型(1B和12B)进行了评估。研究发现,在Qwen3-1.7B和Qwen3-4B之间存在一个能力依赖的“指称阈值”:低于该阈值,工具辅助的ReAct策略占优;高于该阈值,思维链推理从有害变为有益,并在大多数数据划分上成为最强策略。
Insight: 论文的创新点在于将科学问答重新定义为从语言到测量的“编译”过程,并提供了一个诊断性测试平台。其核心洞察是发现了模型在蛋白质结构推理中存在一个关键的“指称阈值”,这标志着模型从生成不可解析的语言到生成可执行的结构指称的转变,为理解语言模型在结构化科学推理中的能力边界提供了新视角。
Abstract: Protein-language systems are often evaluated by whether they generate plausible biological text, but a structural question has a sharper semantics: it denotes a measurement in a 3D coordinate system. We introduce ProtStructQA, an executable benchmark for protein structural question answering in which each natural-language question is generated from a hidden typed domain-specific language (DSL) program and the answer is obtained by executing that program on an AlphaFold-predicted structure. ProtStructQA releases 382.2K questions covering confidence, distances, predicted aligned error (PAE), solvent exposure, secondary structure, topology and contacts, and held-out compositions: a 330K active benchmark over 10K proteins from four species, plus a 52.2K hard-negative robustness pool. Without fine-tuning, we evaluate Qwen3 models from 0.6B to 8B under direct prompting, chain-of-thought, grammar-constrained executable voting, executable voting with chain-of-thought, and multi-turn ReAct-style tool use, and replicate the headline finding on Gemma-3-1B and Gemma-3-12B. We find a capability-dependent denotation threshold between Qwen3-1.7B and Qwen3-4B: below it, tool-mediated ReAct dominates because models often fail to produce executable denotations; above it, chain-of-thought flips from mostly harmful to strongly beneficial and becomes the strongest strategy on most splits. Parse-failure and family-level analyses show that the threshold is a transition from unparseable language to executable structural denotation, while grammar and execution remain selectively valuable for PAE and secondary-structure queries. ProtStructQA reframes scientific QA as compilation from language to measurement and provides a diagnostic testbed for when language models can map words to executable 3D structural measurements.
[15] Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents cs.CLPDF
Yibo Wang, Nikki Lijing Kuang, Philip S. Yu, Zhewei Yao, Yuxiong He
TL;DR: 本文提出了MERIT框架,一种用于交互式文本到SQL(Text-to-SQL)代理的动态多视野记忆检索方法。该框架维护了用于全局战略指导的回合级(episode-level)记忆和用于局部决策支持的轮次级(turn-level)记忆,并使用强化学习优化的检索策略。实验表明,该方法在BIRD-Interact基准上提高了成功率并减少了平均交互轮次,且在Spider2-Snow上展现了正向的跨基准迁移能力。
Details
Motivation: 解决现有交互式Text-to-SQL代理中记忆检索方法的局限性。静态方法依赖固定的相似性启发式规则,未优化下游效用;动态方法通常从稀疏的最终结果中学习,并在单一决策视野下检索记忆。这不足以应对记忆在不同交互阶段(如初始规划与局部执行)有用性变化的问题。
Result: 在BIRD-Interact基准测试中,MERIT在成功率上超越了无记忆、静态检索和动态检索基线,同时减少了平均交互轮次。在Spider2-Snow上的迁移实验进一步显示,无需针对特定基准调整即可实现正向的跨基准迁移。
Insight: 主要创新点是提出了一个动态的多视野(multi-horizon)记忆检索框架,将记忆分为回合级和轮次级两个层次,分别提供全局和局部指导。一个关键的实现创新是使用轻量级的Process Reward Model为轮次级记忆选择提供密集的代理奖励,以解决中间监督稀疏的问题,从而优化基于强化学习的检索策略。
Abstract: Interactive text-to-SQL agents solve database tasks through multi-turn interactions involving schema exploration, query execution, feedback interpretation, and decision revision. Long-term memory helps agents reuse past experiences, but existing retrieval methods remain limited. Static methods rely on fixed similarity heuristics that do not optimize downstream utility, while dynamic methods often learn from sparse final outcomes and retrieve memories at a single decision horizon. This is insufficient when memory usefulness changes across interaction stages, since memories useful for initial planning may differ from those needed for local, state-conditioned execution. We propose MERIT, a dynamic multi-horizon memory retrieval framework. MERIT maintains episode-level memory for global strategic guidance and turn-level memory for local decision support. Both levels use learned retrieval policies optimized with reinforcement learning. To train turn-level retrieval despite limited intermediate supervision, MERIT uses a lightweight Process Reward Model to provide dense proxy rewards for local memory selection. Experiments on BIRD-Interact show that MERIT outperforms no-memory, static-retrieval, and dynamic-retrieval baselines in success rate while reducing average interaction turns. Transfer results on Spider2-Snow further show positive cross-benchmark transfer without benchmark-specific tuning. These results suggest that multi-horizon retrieval improves experience reuse in interactive text-to-SQL agents.
[16] ProactiveLLM: Learning Active Interaction for Streaming Large Language Models cs.CLPDF
Junlong Tong, Yao Zhang, Anhao Zhao, Yingqi Fan, Yunpu Ma
TL;DR: 本文提出了ProactiveLLM,一种用于流式大语言模型(Streaming LLMs)的主动交互学习方法。该方法通过利用模型的内生状态(如语义充分性感知)来指导交互时机决策,避免了传统方法依赖硬编码规则或昂贵外部对齐信号(如时间标签、推理轨迹或更强教师模型)的局限性。
Details
Motivation: 标准LLM采用先读取后生成的范式,导致不必要的延迟和计算开销;而现有流式LLM虽然支持边接收输入边生成,但难以动态决定何时与流进行交互,现有方法要么依赖硬编码的交互时机,要么需要成本高昂的外部对齐信号。
Result: 在文本和语音流式任务上的广泛评估表明,ProactiveLLM在保持生成质量的同时,显著降低了交互延迟,验证了其动态主动交互的能力。
Insight: 创新点在于提出了两种互补的训练机制来学习从部分输入中感知语义充分性:基于掩码的流式建模和同步特权自蒸馏(SPSD)。前者通过单调随机掩码模拟逐步揭示的流式输入,学习局部语义依赖;后者将部分上下文的学生视图与同一演化模型生成的全上下文教师视图对齐,让特权全上下文证据指导学生在不完整观察下的理解。这些机制无需外部教师或标注即可诱导内生充分性线索,为即插即用的多样化决策头集成提供了通用基础。
Abstract: Standard Large Language Models (LLMs) follow a read-then-generate paradigm, causing unnecessary latency and computation. Streaming LLMs alleviate this issue by generating while receiving inputs, but still struggle to decide when to interact with the stream. Existing methods either hard-code interaction timing or rely on costly external alignment signals, such as timing labels, reasoning trajectories, or stronger teachers. In this paper, we propose ProactiveLLM, which achieves active interaction by leveraging the model’s endogenous states to guide interaction decisions. The model first learns to perceive semantic sufficiency from partial inputs through two complementary training mechanisms: mask-based streaming modeling and synchronized privileged self-distillation (SPSD). The former applies monotonic random masking to the input during training, simulating progressively revealed streaming inputs and enabling the model to learn local semantic dependencies from partial-input views. The latter aligns the partial-context student view with a full-context teacher view generated by the same evolving model, allowing privileged full-context evidence to guide the student’s understanding under incomplete observations. Together, these mechanisms induce endogenous sufficiency cues without requiring external teachers or annotations, providing a versatile foundation for the plug-and-play integration of diverse decision heads. Extensive evaluation across text and speech streaming tasks confirms that ProactiveLLM significantly reduces interaction latency while maintaining quality, validating its capacity for dynamic and active interaction. Code is publicly available at https://github.com/EIT-NLP/StreamingLLM/tree/main/ProactiveLLM.
[17] Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence cs.CL | cs.AIPDF
Wanying Ren, Xin Song, Futing Wang, Guoxiu He, Aixin Sun
TL;DR: 本文重新审视了基于参数的知识编辑方法在大语言模型中的应用,通过理论分析和实证评估揭示了其局限性。研究发现,局部参数修改会沿表示空间的脆弱方向传播,导致全局干扰和推理崩溃,而简单的检索基线方法在所有评估条件下均优于参数编辑方法。
Details
Motivation: 解决现有参数编辑方法忽视理论限制和缺乏实践导向评估的问题,探究知识编辑对LLM核心能力的影响。
Result: 实证评估表明,参数编辑方法持续损害LLM的核心能力,而检索基线方法在所有知识复杂度、编辑数量和评估维度上均表现更优。
Insight: 提出了基于维度崩溃假设的理论框架解释参数编辑的局限性,并强调未来研究应关注知识编辑后LLM基本能力的保持。
Abstract: Parameter-based knowledge editing updates the internal knowledge of large language models (LLMs) via localized weight modifications and has attracted significant attention. However, most existing methods overlook fundamental theoretical limitations and are rarely evaluated under realistic, practice-oriented settings. In this paper, we first present a theoretical analysis based on the dimensional Collapse Hypothesis, explaining how localized parameter edits can propagate along fragile directions in the representation space, inducing global interference and ultimately causing reasoning collapse. Building on this insight, we conduct a comprehensive empirical evaluation by systematically varying knowledge complexity, number of edits, evaluation dimensions, and baseline methods. Our results show that parameter-based editing methods consistently damage core LLM capabilities. In contrast, a simple retrieval-based baseline achieves consistently stronger performance than all parameter-editing methods across all evaluated conditions. These findings highlight that preserving the fundamental capabilities of LLMs after knowledge editing should be a central concern for future research.
[18] SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering cs.CL | cs.AIPDF
Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng, Yingcai Wu
TL;DR: 本文提出了SPADER强化学习框架,用于解决多答案问答任务中长序列工具使用的挑战。该框架包含无评论家步级信用分配机制和多样性感知探索奖励,以提升对长尾实体的发现能力。
Details
Motivation: 现有工具增强的大语言模型主要关注单一正确答案任务,而现实查询常需发现全面有效答案集,这带来了长搜索轨迹的细粒度信用分配和超越高频实体的持续探索奖励对齐两大挑战。
Result: 在QAMPARI、Mintaka、WebQSP和QUEST基准测试中,SPADER在召回率和整体F1分数上普遍优于基于提示的智能体、结果监督的强化学习方法及近期步级监督方法。
Insight: 创新点在于步级同伴优势机制实现无评论家的信用分配,以及通过加权稀有发现和降权冗余发现的多样性感知探索奖励,促进长尾实体探索,为多答案问答的强化学习策略提供了新思路。
Abstract: Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent work has improved long-horizon tool-use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real-world queries require discovering a comprehensive set of valid answers, a setting known as Multi-Answer QA. This setting raises two challenges: fine-grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high-frequency entities. We propose SPADER, a reinforcement learning framework for long-horizon tool use in Multi-Answer QA. SPADER includes Step-wise Peer Advantage (SPA), a critic-free step-level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity-aware exploration reward that promotes long-tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting-based agents, outcome-supervised RL methods, and recent step-level supervision approaches. Our code and model weights are available at https://github.com/KhanCold/spader.
[19] Toward Responsible and Epistemically Grounded Multilingual LLMs for Computational Social Science and Humanities cs.CLPDF
Wajdi Zaghouani
TL;DR: 本文针对多语言大语言模型在计算社会科学与人文研究中的应用,提出了一种基于解释学和技术哲学的理论框架,以评估其在跨语言文化语境中的推理有效性、文化对齐性和解释稳定性,并通过多语言政治话语分析案例进行了验证。
Details
Motivation: 现有评估范式主要基于任务型NLP基准,未能解决多语言LLM在社会科学与人文研究中的解释有效性、文化情境性和认知中介问题,因此需要重新概念化多语言推理LLM作为跨语言文化意义生产的解释工具。
Result: 论文提出了一个理论框架和实验协议,包含文化对齐、跨语言稳定性和推理忠实度的可操作化指标,以及针对解释性研究任务的透明度要求,并通过多语言政治话语分析场景进行了框架演示。
Insight: 创新点在于将多语言LLM重新定义为解释学工具,融合解释学、技术哲学和STS理论,构建了面向SSH研究的评估框架,强调文化情境和认知中介,为负责任地整合多语言推理LLM到计算社会科学基础设施提供了概念与方法基础。
Abstract: Large language models have rapidly evolved in multilingual competence and reasoning capacity, enabling their integration into Social Sciences and Humanities research workflows. Yet existing evaluation paradigms remain anchored in task-based NLP benchmarks and fail to address interpretive validity, cultural situatedness, and epistemic mediation. This paper reconceptualizes multilingual reasoning LLMs as hermeneutic instruments that actively structure meaning production across linguistic and cultural contexts. Drawing on hermeneutics, philosophy of technology, science and technology studies, multilingual NLP research, and computational social science methodology, we develop a theoretically grounded framework for evaluating multilingual reasoning in Social Sciences and Humanities (SSH) research. We articulate a rigorous experimental protocol with operationalized metrics for cultural alignment, cross-lingual stability, and reasoning faithfulness, along with transparency requirements tailored to interpretive research tasks. We illustrate the framework through a concrete application scenario involving multilingual political discourse analysis. The paper contributes a conceptual and methodological foundation for responsible integration of multilingual reasoning LLMs into computational social science infrastructures.
[20] Sandboxed Coding Agents are Competitive Omni-modal Task Solvers cs.CL | cs.CVPDF
Dongping Chen, Xuanao Huang, Zhihan Hu, Qingyuan Shi, Dianqi Li
TL;DR: 本文提出了一种基于代码代理的沙盒化工具使用框架,通过文本和图像输入结合代码编写与工具调用,能够有效处理音频-视频等多模态任务,其性能在多个基准测试中达到或超越原生全模态模型。研究进一步通过轨迹分析揭示了其优势在于将全模态任务转化为检索与信息处理问题,并引入了技能注入方法以提升性能。此外,论文还提出了Code-X训练方案和TerminalBench-O基准测试,以推动开源模型在多模态处理领域的发展。
Details
Motivation: 针对当前多模态大语言模型(LLMs)普遍追求原生全模态能力以处理视频和音频任务的趋势,本文质疑这种必要性,旨在探索仅依赖文本+图像输入的代码代理是否能够通过工具使用接口有效解决全模态任务。
Result: 在多个音频-视频基准测试中,所提出的沙盒化代码代理匹配甚至超越了当前最先进(SOTA)的原生全模态模型和预定义多模态代理框架;通过技能注入(包括人工编写和自蒸馏技能)进一步显著提升了性能。
Insight: 创新点在于将全模态任务重新定义为检索与信息处理问题,而非直接处理完整媒体流,这通过代码代理编写程序并协调工具(如从转录文本、帧中提取证据)实现;此外,论文提出的Code-X训练方案和TerminalBench-O基准测试为开源模型在多模态处理领域提供了新的评估与发展方向。
Abstract: As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool-use interface can match, and in several settings outperform, SOTA native omnimodal models and predefined multimodal agent scaffolds across multiple audio-video benchmarks. Our trajectory analysis suggests that their strength comes from writing code and orchestrating tools to extract relevant evidence from transcripts, frames, and other modality signals, thereby converting omnimodal tasks into retrieval and information-processing problems rather than ingesting entire media streams. We further characterize their limitations through a failure taxonomy and process-level trace analysis, and show that simple skill injection, including human-written and self-distilled skills, substantially improves performance. To explore open-source elicitation, we introduce Code-X, a training recipe with the OmniCoding trajectory dataset and verifiable reward, and provide baselines on Qwen-3.5-9B and Qwen-3.6-27B. Finally, we argue that the next frontier is many-modality processing, and introduce TerminalBench-O, a process-level benchmark for real-world omnimodal processing tasks. Code will be available at https://github.com/Dongping-Chen/OmniCoding.
[21] Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation cs.CLPDF
Ruiqi Zhang, Lingxiang Wang, Hainan Zhang Zhiming Zheng
TL;DR: 本文提出了一种名为分布对齐自蒸馏(DASD)的方法,旨在解决自蒸馏过程中因模仿参考答案风格而损害模型推理能力的问题。该方法通过一个答案感知的参考模型生成候选token,并基于基础模型的置信度进行动态过滤,以保留有益的逻辑知识token,抑制分布未对齐的风格噪声。
Details
Motivation: 自蒸馏通过将参考答案重写为与模型自身分布更匹配的训练数据来提高学习效率,但参考答案引入了强烈的风格偏差,导致生成模型模仿表面形式而非学习有用的推理模式。
Result: 在数学、代码和常识推理基准测试上的实验表明,DASD始终优于竞争基线,减少了高困惑度(PPL)的token,并提高了在不同难度任务上的鲁棒性。
Insight: 核心创新在于区分了重写数据中高困惑度token的来源(有益的逻辑修正与有害的风格漂移),并提出了基于基础模型置信度的动态token选择机制,以实现知识增强与分布对齐,这为改进自蒸馏提供了新思路。
Abstract: Self-distillation improves learning efficiency by rewriting reference answers as training data that better matches the model’s own distribution. However, reference answers also introduce strong stylistic biases, causing the generative model to imitate surface forms rather than learn useful reasoning patterns. We observe that the rewriting data contains a large number of high-perplexity (PPL) tokens, coming from two distinct sources: beneficial knowledge-enhancing logical corrections, and harmful stylistic drift induced by reference imitation. Treating all such tokens equally can disrupt the base model’s original distribution and degrade performance, especially on difficult reasoning tasks. To address this, we propose Distribution-Aligned Self-Distillation (DASD), which uses an answer-aware reference model to generate candidate tokens and dynamically filters them according to the base model’s confidence. DASD preserves tokens that encode useful logical knowledge while suppressing distributionally misaligned style noise. Experiments on math, code, and commonsense reasoning benchmarks show that DASD consistently outperforms competitive baselines, reduces high-PPL tokens, and improves robustness across tasks of varying difficulty.
[22] OCC-RAG: Optimal Cognitive Core for Faithful Question Answering cs.CLPDF
Maksim Savkin, Mikhail Goncharov, Alexander Gambashidze, Alla Chepurova, Dmitrii Tarasov
TL;DR: 本文介绍了OCC-RAG,这是一个专为忠实问答任务设计的小型语言模型系列。它通过大规模合成多上下文、多跳推理的QA数据进行训练,旨在忽略模型记忆的知识,严格基于提供的上下文进行推理。
Details
Motivation: 当前语言模型的发展过于依赖参数规模,而许多实际应用更需要鲁棒的推理能力而非庞大的参数知识。因此,研究团队提出构建任务专用的小型语言模型,以在忠实问答等需要多跳推理的任务中实现更优性能。
Result: 在HotpotQA、MuSiQue、TAT-QA等多跳推理基准,以及ConFiQA忠实性基准和MuSiQue-Un拒绝基准上,OCC-RAG模型(0.6B和1.7B参数)的性能匹配或超越了参数量是其2到6倍的通用模型。
Insight: 创新点在于提出了任务专用小型语言模型的设计范式,并开发了大规模合成多跳推理QA数据的流程。模型能生成带有上下文直接引用来源的结构化推理轨迹,确保了答案的忠实性和可解释性。
Abstract: Recent progress in the development of language models has been defined by scale, with each generation absorbing more of the world’s knowledge into its weights. However, many practical applications benefit more from robust reasoning than from extensive parametric knowledge. In this setting, task-specialized small language models (SLMs) offer a principled design choice. We introduce Optimal Cognitive Core (OCC), a family of SLMs built around this premise. As a variant of OCC, we present OCC-RAG, optimized for faithful question answering (QA) grounded in the provided context. This task directly aligns with the OCC design approach, requiring multi-hop reasoning over supplied passages while ignoring memorized knowledge. To train OCC-RAG, we implement a novel pipeline for synthesizing multi-context, multi-hop QA data at scale, producing a corpus of over three million examples targeting multi-hop reasoning, strict context faithfulness, and calibrated abstention. We release OCC-RAG-0.6B and OCC-RAG-1.7B, both mid-trained on this corpus. The models produce structured reasoning traces with source citations grounded in literal quotes from the context. Through OCC-RAG, we demonstrate that compact, task-specialized SLMs can match or exceed general-purpose models 2 – 6x their size across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un) benchmarks.
[23] I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications cs.CLPDF
Dasen Dai, Biao Wu, Meng Fang, Shuoqi Li, Wenhao Wang
TL;DR: 本文提出了一种将研究论文转化为可执行交互式网页系统的智能代理,并引入了I-WebGenBench基准来评估此类任务。论文还提出了PaperVoyager框架,通过显式建模机制和交互逻辑来提升生成系统的质量。
Details
Motivation: 现有文档代理主要将论文转化为静态摘要或网页,无法充分处理涉及动态机制和状态转换的技术论文,因此需要开发能生成交互式系统的代理。
Result: 在包含19篇研究论文及其专家构建的交互式系统作为基准的I-WebGenBench上,PaperVoyager框架显著提升了生成交互系统的质量。
Insight: 创新点在于将论文理解任务从静态输出扩展到动态交互系统生成,并提出了结构化生成框架来显式建模论文中的机制和交互逻辑,为交互式科学论文理解提供了新范式。
Abstract: Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.
[24] Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations cs.CLPDF
Adril Putra Merin, David Anugraha, Ayu Purwarianti, Genta Indra Winata
TL;DR: 本文提出了Momento基准测试,用于评估在多会话服务环境中具备持久性的智能体完成任务的能力,该基准要求智能体处理跨会话的时间依赖性和动态变化的用户目标。实验表明,当前智能体主要因错误估计用户状态而失败,它们将历史会话视为可靠代理而非需要重新验证的过时信息,揭示了现有智能体能力与真实长期人机交互之间的显著差距。
Details
Motivation: 现有智能体基准测试仅评估单次会话内的表现,忽略了智能体需整合历史行动、用户偏好和先前决策以实现个性化用户目标的需求,因此需要一个新的基准来评估多会话环境中具备持久性的智能体。
Result: 实验结果显示,当前智能体在Momento基准上表现不佳,主要失败原因是误判用户状态,错误地将历史会话信息视为可靠上下文而非需重新验证的过时数据,这突显了现有智能体与真实长期交互要求之间的能力差距。
Insight: 论文的创新点在于引入了首个专注于多会话持久性智能体任务完成的基准Momento,强调了对跨会话时间依赖和动态用户目标的处理需求;客观来看,该研究揭示了当前智能体在处理长期交互时对历史信息验证不足的核心问题,为未来智能体设计提供了重要方向。
Abstract: Recent advances in agentic AI have enabled agents to complete complex tasks through tool use, reasoning, and multi-step planning. Yet existing benchmarks evaluate agents within a single session, ignoring past actions, stated preferences, and prior decisions that agents must integrate to fulfill personalized user goals. We introduce Momento, a benchmark for persistent agentic task completion in multi-session service environments, requiring agents to take consequential, tool-mediated actions while resolving temporal dependencies and evolving user goals across sessions. Experimental results reveal that current agents fail primarily through misestimation of user state, treating prior session history as a reliable proxy for current context rather than stale information requiring re-validation, highlighting a substantial gap between current agent capabilities and realistic long-horizon human-agent interaction.
[25] Not All Flips Are Conformity: Decomposing Stance Convergence in Multi-Agent LLM Debate cs.CLPDF
Xiqi Hao, Zengqing Wu, Yu-Xuan Qiu, Chuan Xiao, Ruiqi Xu
TL;DR: 本文研究了多智能体大语言模型辩论中立场趋同的机制,指出传统的答案翻转率混淆了三种不同机制:自发不稳定性、立场诱导的从众性和推理诱导的说服性。作者提出了一个三源分解框架,通过受控反事实条件分离这些机制,并揭示了有害从众行为的存在及其可预测性,同时发现无针对性地减少同伴采纳并不能提高准确性。
Details
Motivation: 多智能体辩论是提升大语言模型推理能力的有前景的策略,但当智能体就某个答案达成一致时,这种趋同究竟反映了真正的深思熟虑还是社会从众行为尚不清楚。本文旨在厘清立场趋同背后的不同机制。
Result: 在主要MMLU-Pro设置中,37%的智能体-问题观察结果仅因自我反思就发生改变;严格的从众性占29%,且在模型复现中主要是有害的(57-77%从正确变为错误)。有害从众性可从第0轮特征预测(AUC = 0.79),针对性干预可将其降低13.6个百分点(p < 0.001)。
Insight: 创新点在于提出了一个分解框架,将答案翻转分解为自发不稳定性、从众性和说服性三个来源,并通过受控实验量化了各自的影响。客观来看,该研究揭示了多智能体辩论中‘趋同’的复杂性,强调了区分有益和有害影响的重要性,并为设计更有效的辩论机制提供了诊断工具和干预思路。
Abstract: Multi-agent debate (MAD) is a promising strategy for improving LLM reasoning, but when agents converge on a shared answer, it is unclear whether that convergence reflects genuine deliberation or social compliance. We show that the conventional answer flip rate conflates three distinct mechanisms: spontaneous instability, stance-induced conformity, and reasoning-induced persuasion. Our three-source decomposition framework isolates each through controlled counterfactual conditions. In the primary MMLU-Pro setting, 37% of agent-question observations change under self-reflection alone, while robustness tests show substantial model-dependent instability across GPQA-Diamond and three model families; strict conformity is 29% in the primary setting and remains predominantly harmful across model replications (57-77% correct-to-wrong). A controlled information-gradient experiment reveals that even vacuous reasoning is associated with 20-39% error adoption among resistant agents, with reasoning-like presentation carrying substantial persuasive weight. Harmful conformity can be predicted from Round 0 features (AUC = 0.79), and risk-targeted intervention reduces it by 13.6 percentage points (p < 0.001). However, without correctness labels or self-reflection controls, reducing peer adoption does not improve accuracy, because harmful and beneficial influence cannot be distinguished.
[26] Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning cs.CL | cs.LGPDF
Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun, Junjie Wang
TL;DR: 本文提出了一种名为温度缩放在线策略自蒸馏(TS-OPSD)的轻量级策略‘再加热’方法,用于解决强化学习中因策略熵崩溃导致的探索多样性下降问题。该方法通过将高温缩放应用于模型自身的逻辑值来构建一个自教师模型,然后将产生的更平滑分布蒸馏回学生模型,从而将温度探索效应内化到模型参数中。
Details
Motivation: 基于可验证奖励的强化学习能提升大语言模型的推理能力,但常因策略熵崩溃导致探索多样性减少和学习信号减弱。现有方法(如熵正则化或调整采样温度)的干预仍外在于模型参数,本文旨在提出一种将探索效应内化的方法。
Result: 在Qwen3-4B-Base和Qwen3-8B-Base模型上的实验表明,与标准的持续强化学习和基于采样的温度再加热相比,策略再加热为后续强化学习提供了更强的初始化。分析显示该方法主要降低了输出锐度,同时保留了中间表示、顶级候选集和推理能力。
Insight: 创新点在于提出了一种无需外部教师、特权数据或额外推理成本的自蒸馏策略再加热方法,将温度探索效应参数化。客观来看,该方法为推理导向的强化学习提供了一种简单的熵恢复后处理干预手段,有助于缓解熵崩溃问题。
Abstract: Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learning signals. Existing remedies either constrain the RL objective (e.g., entropy regularization) or adjust sampling temperature during rollout collection, but these interventions remain external to the model parameters. We propose Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a lightweight policy reheating method that internalizes the exploratory effect of temperature into model parameters. Starting from an entropy-collapsed RL checkpoint, TS-OPSD constructs a self-teacher by applying high-temperature scaling to the model’s own logits, then distills the resulting smoother distribution back into the student. This policy reheating requires no external teacher, privileged data, or additional inference cost. Experiments on Qwen3-4B-Base and Qwen3-8B-Base show that policy reheating yields a stronger initialization for continued RL than both standard continued RL and rollout-level temperature reheating. Further analyses show that TS-OPSD mainly reduces output sharpness while preserving intermediate representations, top candidate sets, and reasoning capability. These results suggest that entropy restoration can serve as a simple post-collapse intervention for extending reasoning-oriented RL.
[27] Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs cs.CL | cs.DLPDF
Volodymyr Ovcharov
TL;DR: 该论文提出了‘引文落地’(Citation Grounding, CG)这一自动化指标,用于大规模检测和减少大型语言模型在法律引文上的幻觉问题。CG通过一个从1.008亿份乌克兰法院判决中提取的真实引文图(包含5.02亿条边和21,736个独特法规节点)来验证LLM生成的引文,并将其分解为引文精确性、相关性和时效性三个子指标。为了减少幻觉,论文还提出了CG-DPO方法,通过算法构建偏好对来微调模型,并在Qwen2.5-7B-Instruct模型上验证了其有效性。
Details
Motivation: 大型语言模型在法律引文上存在系统性幻觉(如捏造法规、引用已废止条款、混淆司法管辖区),但目前缺乏自动化方法来大规模衡量或减少这种行为。
Result: 在针对五个系统(包括四个商业LLM和一个RAG增强的生产系统)的100个乌克兰法律查询的实证评估中,CG得分在0.791到0.873之间,有13-21%的引文是幻觉。使用CG-DPO方法在2,244份法院判决数据集上微调的Qwen2.5-7B-Instruct模型,在区分正确与损坏引文的任务上达到了98.5%的平均验证准确率。
Insight: 论文的创新点在于提出了一个基于真实法律引文图的、可分解的自动化评估指标CG,以及一种无需人工标注、通过算法构建偏好对来微调模型以减少幻觉的CG-DPO方法。该方法为特定领域(如法律)的LLM事实性评估和优化提供了可扩展的框架和开源资源。
Abstract: Large language models systematically hallucinate legal citations – fabricating statute references, citing repealed provisions, and confusing jurisdictions – yet no automated method exists to measure or reduce this behavior at scale. We propose citation grounding (CG), a metric that verifies LLM-generated legal citations against a ground-truth citation graph extracted from 100.8 million Ukrainian court decisions (502 million edges, 21,736 unique statute nodes). CG decomposes into three components – citation precision (does the cited provision exist?), citation relevance (is it contextually appropriate?), and citation temporality (was it valid at the relevant date?) – enabling differential diagnosis of hallucination types. Empirical evaluation on 100 Ukrainian legal queries across five systems – four commercial LLMs via AWS Bedrock (Claude Haiku 4.5, Mistral Pixtral Large, Amazon Nova Pro/Lite) and one RAG-augmented production system – reveals CG ranging from 0.791 to 0.873, with 13-21% of citations hallucinated. To reduce hallucinations without human annotation, we introduce Citation Grounding DPO (CG-DPO): a method that constructs preference pairs algorithmically by corrupting verified citations from real court decisions via four targeted strategies. On a dataset of 2,244 court decisions, a Qwen2.5-7B-Instruct model fine-tuned with LoRA achieves 98.5% mean validation accuracy in distinguishing correct from corrupted citations (rewards margin +14.9, std < 0.3 pp across 3 seeds). The citation graph, evaluation framework, and CG-DPO dataset are released as open resources.
[28] MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models cs.CL | cs.AIPDF
Ravil Mussabayev, Rustam Mussabayev
TL;DR: 本文提出了MLLM-Microscope系统,用于分析多模态大语言模型(MLLMs)内部的隐藏表示。该系统评估了跨Transformer层的多模态token嵌入的线性度、内在维度和各向异性。通过在ScienceQA数据集上评估LLaVA-NeXT和OmniFusion两个先进模型,发现两种模态的token在主流和残差流中都表现出高度线性行为,但模型间存在差异,揭示了内部工作机制与模态融合方式高度相关。
Details
Motivation: 旨在深入理解多模态大语言模型的内部表示和工作机制,通过系统化分析其隐藏结构,为未来的模型设计和优化提供信息。
Result: 在ScienceQA数据集上评估了LLaVA-NeXT和OmniFusion。发现两个模型的token流均高度线性,但LLaVA-NeXT的图像token线性度略有下降,而OmniFusion保持稳定;OmniFusion的图像token维度在各层均高于LLaVA-NeXT,且其各向异性在各层始终保持较低水平。
Insight: 创新点在于提出了一个专门用于分析MLLM隐藏表示的系统(MLLM-Microscope),并提供了对token嵌入的线性度、维度和各向异性的量化评估框架。客观分析表明,该工作揭示了MLLM内部表示特性与模型架构(特别是模态融合策略)之间的关联,为理解模型行为提供了新的实证视角。
Abstract: This work presents MLLM-Microscope, a novel system designed for analyzing the hidden representations within Multimodal Large Language Models (MLLMs). Our system evaluates the linearity, intrinsic dimension, and anisotropy of multimodal token embeddings across transformer layers. Utilizing the ScienceQA dataset, we evaluate two state-of-the-art MLLMs, LLaVA-NeXT and OmniFusion. We find that both the main and residual streams for tokens of both modalities exhibit highly linear behaviors across transformer layers. However, LLaVA-NeXT’s image tokens reveal a slight decline in linearity, whereas OmniFusion’s remain consistent. Image token dimensions in OmniFusion remain consistently higher across layers compared to LLaVA-NeXT. Also, the OmniFusion’s anisotropy is observed to stay consistently low throughout the layers. These findings suggest that the inner workings of MLLMs highly depend on the nature of modality fusion performed before passing the token sequence into LLM. This and other new potential insights obtainable from our system are surely capable of enhancing our understanding of the inner workings of MLLMs, informing future model design and optimization.
[29] Towards Lightweight Reliability: Using Soft Prompts for Hallucination Mitigation in Large Language Models cs.CL | cs.LGPDF
S M Tahmid Siddiqui, Akib Jawad Ononto, Anoop Singhal, Latifur Khan
TL;DR: 本文提出了一种名为RCSP的参数高效方法,通过软提示来缓解大语言模型在生成式问答任务中的幻觉问题,并鼓励模型在不确定时主动弃答。该方法结合对比损失、课程学习和KL正则化来训练软提示,以平衡抑制幻觉、鼓励弃答和保持事实召回三个目标。
Details
Motivation: 大语言模型在广泛应用中常因产生看似合理但事实错误的幻觉而降低可靠性,在高风险领域尤其带来信任和实际风险问题,因此需要一种高效方法来缓解幻觉并促进负责任的弃答行为。
Result: 在Gemma 3 (12B)和Llama 3.1 (8B)模型上,使用LLM-as-a-Judge框架在五个不同的生成式问答数据集上进行评估,RCSP方法在平衡事实召回、幻觉抑制和弃答方面表现优异,其综合F分数优于标准的推理和基于指令的提示基线。
Insight: 创新点在于提出了一种参数高效的软提示训练方法RCSP,通过复合损失函数整合了对比学习、课程学习和正则化,实现了在仅训练少量参数的情况下有效提升模型可靠性,为改进LLM可靠性提供了一条模块化且计算高效的路径。
Abstract: Large language models (LLMs) have seen widespread adoption across various domains, yet their reliability is frequently undermined by hallucinations - responses that are plausible-sounding but factually incorrect. In high-stakes domains, these errors can reduce trust and introduce real-world risk. To address this challenge, we present a parameter-efficient approach that uses soft prompts to mitigate hallucinated content and promote responsible abstention in generative question-answering (QA) tasks. Our method, called Responsible Contrastive Soft Prompting (RCSP), uses a composite loss to train soft prompts that balance three goals: suppressing hallucinatory content, encouraging abstention under uncertainty, and preserving or improving factual recall. To achieve these goals, we incorporate contrastive loss, curriculum learning, and KL regularization into our training mechanism. We evaluate our approach on five diverse generative QA datasets using an LLM-as-a-Judge framework. Experimental results on the Gemma 3 (12B) and Llama 3.1 (8B) backbones demonstrate that RCSP effectively balances factual recall with hallucination suppression and abstention, yielding a generally superior F-score over standard reasoning and instruction-based prompting baselines. Notably, these improvements are achieved by training only a fraction of the parameters required by other tuning techniques. Our results demonstrate that soft prompts provide a modular and computationally efficient path toward improving LLM reliability.
[30] PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects cs.CL | cs.AI | eess.ASPDF
Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu, Lu Fan
TL;DR: 本文提出了PolySpeech-100,一个大规模、多语言的语音理解基准测试,旨在评估端到端语音大语言模型在110种语言变体(包括19种汉语方言和80多种低资源语言)上的‘母语级’理解能力。通过结合人工录音和指令驱动的合成语音构建数据集,并对22个SOTA模型进行评估,揭示了开源模型在低资源语言上的性能崩溃、直接音频处理对保留副语言线索的优势,以及思维链提示在零样本场景下可能损害性能等关键发现。
Details
Motivation: 现有语音评估基准存在三个关键局限:严重偏向高资源语言、侧重于低级识别而非语义推理、以及忽视地区方言。为了弥补这一差距,需要一个新的基准来推动下一代包容性、全能的语音大语言模型的发展。
Result: 在PolySpeech-100基准上对22个SOTA模型(包括Gemini-3, GPT-Audio, Qwen2.5-Omni)进行了广泛评估。结果显示,开源端到端模型在重方言上优于级联系统;商业模型保持稳健,而开源模型在低资源语言上出现灾难性性能下降;在标准零样本设置下,思维链提示经常损害大多数模型的语音理解性能。
Insight: 论文的创新点在于构建了一个覆盖广泛语言和方言变体的大规模语音理解基准,并采用了混合构建流水线。客观分析认为,其核心洞察在于揭示了直接音频处理对保留副语言和韵律特征(如语调、重音)的重要性,以及当前模型架构在模态对齐上可能存在的差距,这为未来模型设计提供了重要方向。
Abstract: While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level’ speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.
[31] ExpWeaver: LLM Agents Learn from Experience via Latent RAG cs.CLPDF
Tao Feng, Tianyang Luo, Jingjun Xu, Zhigang Hua, Yan Xie
TL;DR: 本文提出了ExpWeaver框架,使LLM智能体能够通过潜在检索增强生成(Latent RAG)从经验中学习。该方法将经验编码为LLM的隐藏状态,在解码步骤中直接在潜在空间检索相关经验,并通过交叉注意力聚合和门控残差机制进行整合,整个流程通过强化学习进行端到端优化。
Details
Motivation: 现有基于经验学习的方法局限于显式文本空间,通过语义相似性检索经验并拼接到上下文窗口中,这导致了显著的令牌开销以及检索与生成模块分离的架构。本文旨在解决这些限制。
Result: 在涵盖问答、推理、编码、科学预测和推荐的13个多样化任务上,ExpWeaver在12个任务上达到了最先进的性能,比最强基线高出超过6.8%;其令牌效率与非检索基线相当,而基于文本的检索方法需要1.5到2倍的令牌;在零样本和少样本迁移下,其跨领域泛化能力分别比最强基线高出16.32%和15.21%。
Insight: 核心创新在于将经验学习从显式文本空间迁移到潜在空间,利用LLM自身的隐藏状态进行编码和检索,并通过端到端强化学习统一优化检索与生成过程,从而实现了高效、耦合的经验集成架构。
Abstract: Experience learning has achieved promising results in enhancing LLM agent planning and reasoning by integrating past interactions as reusable knowledge. However, existing methods remain confined to explicit text space, retrieving experiences via semantic similarity and concatenating them into the context window, leading to substantial token overhead and a decoupled architecture that separates retrieval from generation. To address these limitations, we propose ExpWeaver, a framework that enables LLM agents to learn from experience via latent retrieval-augmented generation, without requiring a separate RAG module. ExpWeaver encodes experiences using the LLM’s own hidden states, retrieves relevant experiences directly in latent space at each decoding step, and integrates them through cross-attention aggregation and gated residual mechanisms. The entire pipeline is optimized end-to-end with reinforcement learning, supporting both generative and ranking tasks. We evaluate ExpWeaver on 13 diverse tasks spanning question answering, reasoning, coding, scientific prediction, and recommendation. Results demonstrate that ExpWeaver achieves state-of-the-art performance on 12 out of 13 tasks, outperforming the strongest baseline by over 6.8%; maintains token efficiency comparable to non-retrieval baselines while text-based retrieval methods require 1.5 to 2 times more tokens; and exhibits superior cross-domain generalization, outperforming the strongest baseline by 16.32% under zero-shot transfer and 15.21% under few-shot transfer. Our code for ExpWeaver is released at https://github.com/ulab-uiuc/ExpWeaver.
[32] Revise, Don’t Freeze: Sampler-Matched Training for Self-Correcting Masked Diffusion Language Models cs.CLPDF
Longxuan Yu, Shaorong Zhang, Yu Fu, Hui Liu, Yue Dong
TL;DR: 本文提出了一种名为D3IM的无参数采样器,用于解决掩码扩散语言模型(MDLMs)在标准采样过程中无法修正已揭示令牌的问题。D3IM作为一种校正器风格的反向更新,允许直接对可见令牌进行修订,无需额外模块。同时,作者发现模型存在’保持偏差’,即倾向于重复自身错误的已提交令牌,并提出了SCOPE(基于预测误差的自条件化)这一轻量级后训练程序来缓解此问题。在LLaDA-8B模型上,结合SCOPE与D3IM在多个基准测试(如GSM8K、MATH-500、HumanEval、MBPP)上显著提升了性能。
Details
Motivation: 标准采样器在掩码扩散语言模型中一旦揭示令牌便不再修正,浪费了模型的重预测能力;现有方法要么依赖启发式或学习机制来修订已提交令牌,要么将其重新掩码,缺乏一种无需辅助模块、可直接修订可见令牌的原则性采样器。
Result: 在LLaDA-8B模型上,使用64个去噪步骤时,SCOPE+D3IM相比原始标准解掩码方法,在GSM8K上提升13.0点至68.3%,在MATH-500上提升4.8点至23.6%,在HumanEval上提升15.3点至29.3%,在MBPP上提升10.4点至30.8%;在数学和代码生成任务上,使用更多去噪步骤时增益进一步增加。
Insight: 创新点在于提出了D3IM这一无参数采样器,实现了对可见令牌的直接修订,无需额外模块或辅助过程;同时揭示了模型存在的’保持偏差’问题,并设计了SCOPE后训练程序来模拟D3IM采样过程以缓解偏差,从而有效提升了模型在推理任务中的自我修正能力。
Abstract: Masked diffusion language models (MDLMs) re-predict every position at each denoising step, but standard samplers commit tokens once revealed, leaving this revision capability unused. Existing approaches either add heuristic or learned mechanisms to revise committed tokens, or remask them back to [MASK] before re-predicting; a principled sampler that directly revises visible tokens without auxiliary modules remains underexplored. We introduce D3IM, a parameter-free sampler derived as a corrector-style reverse update that permits direct visible-to-visible revision without additional modules or auxiliary passes. D3IM also reveals a model-side obstacle we term preservation bias: the model tends to reproduce its own wrong committed tokens rather than correct them. We address this with SCOPE (Self-Conditioned On Prediction Errors), a lightweight post-training procedure that simulates D3IM’s sampling process. On LLaDA-8B at 64 denoising steps, SCOPE+D3IM improves over the original LLaDA-8B with standard unmasking by +13.0 on GSM8K (68.3%), +4.8 on MATH-500 (23.6%), +15.3 on HumanEval (29.3%), and +10.4 on MBPP (30.8%), with gains that increase as more denoising steps are used on math and HumanEval.
[33] PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining cs.CLPDF
Guanghao Zhu, Zeyu Liu, Zhitian Hou, Pengkai Wang, Zhijie Sang
TL;DR: 本文提出了PMC-InterCPT,一个用于生物医学多模态持续预训练(CPT)的上下文增强型交错数据语料库。该工作重新审视了医学多模态数据构建,通过整合图像标题和引用图像的正文文本来构建更丰富的交错样本,并利用LLM监督的质量分类器进行过滤。实验表明,使用该语料库进行CPT能有效提升Qwen3.5-4B-Base模型在医学和通用多模态任务上的性能,且所需训练数据量更少。
Details
Motivation: 现有从科学文献中提取的大规模生物医学图像-文本数据集通常以图像-标题对的形式组织,但标题往往简短、依赖上下文,且缺乏周围正文信息时信息不完整。同时,大规模自动提取会引入结构噪声,如缺失标题、残留标记、重复上下文和不连贯的多段落描述。
Result: 在Qwen3.5-4B-Base模型上进行CPT及后续监督微调(SFT)后,PMC-InterCPT有效提升了医学和通用多模态性能,且使用的CPT tokens少于原始源数据池。实验结果还说明了数据质量和模态对于医学多模态CPT的互补性。
Insight: 主要创新点在于构建了一个上下文增强的生物医学交错语料库,通过整合图像标题和引用正文来提供更丰富的上下文信息。此外,提出了一个包含LLM监督的质量过滤和基于四桶证据分类的模态感知重采样策略,以解决数据噪声和模态不平衡问题。
Abstract: Large-scale biomedical image-text datasets extracted from scientific literature provide valuable resources for medical multimodal model training. These datasets are commonly organized as image-caption pairs; however, figure captions are often short, context-dependent, and only partially informative without the surrounding article text. At the same time, large-scale automatic extraction introduces structural noise such as missing captions, residual markup, duplicated context, and incoherent multi-paragraph figure descriptions. We revisit data construction for medical multimodal continued pretraining (CPT) and present PMC-InterCPT, a context-grounded biomedical interleaved corpus that incorporates figure-referencing body text in addition to captions. Our pipeline recovers missing captions, cleans caption and context text, reconstructs coherent interleaved image-text samples, and applies LLM-supervised medical relevance and quality classifiers to filter noisy records. We further reveal strong modality imbalance in the resulting corpus and introduce a four-bucket evidence taxonomy for modality-aware resampling. Through CPT followed by supervised fine-tuning (SFT) on Qwen3.5-4B-Base, PMC-InterCPT effectively improves medical and general multimodal performance while using fewer CPT tokens than the raw source pool. The experimental results also illustrate the complementarity between the data quality and modality for medical multimodal CPT.
[34] On the Generalization Gap in Self-Evolving Language Model Reasoning cs.CLPDF
Zhenting Qi, Susanna Maria Baby, Stefanie Anna Baby, Kan Yuan, Andrew Tomkins
TL;DR: 本文研究了在严格闭环设置下,大语言模型通过自进化(仅使用未标注提示集和基础模型)进行自我提升的能力。作者在Knights and Knaves逻辑推理任务上分析了四种自进化策略,发现自进化能持续提升基础模型性能,但会达到平台期,且与有监督训练存在显著差距。在真实世界推理基准上的提升也较为有限。
Details
Motivation: 探究在仅使用模型自身生成的监督信号的严格闭环设置下,自进化训练能否接近有监督训练的性能,并分析其泛化能力。
Result: 在Knights and Knaves任务上,自进化能持续提升基础模型,但存在平台期,与有监督训练存在差距;其中Gemma 12B模型通过多轮批评修订策略能达到接近有监督训练的性能。在真实世界推理基准上,提升幅度有限。
Insight: 论文创新点在于在严格闭环设置下系统评估了多种自进化策略,揭示了其性能上限和与有监督训练的差距;客观分析表明,多轮批评修订是有效的自进化策略,但自进化在泛化能力上仍存在根本性限制。
Abstract: Recent work suggests that large language models (LLMs) can improve through self-evolution (SE), using supervision signals generated by the model itself. In this work, we ask: under a strict closed-loop setup, where the self-evolution algorithm has access only to an unlabeled prompt set and a base model, how close can internally generated supervision come to oracle-supervised training? We analyze four representative strategies in a unified offline self-evolution framework: single-round verification, multi-turn revision with feedback, iterative training, and curriculum learning. Our primary experiments use Knights and Knaves (KK) logical reasoning tasks, which provide deterministic solutions, controlled difficulty levels, and a clean testbed for easy-to-hard generalization. We first show that self-evolution consistently improves over the base model, but plateaus after excessive training compute is invested, and eventually still leaves a non-trivial gap to oracle supervision. We find that multi-turn critic-revision with large models can reach strong self-evolution performance, with Gemma 12B nearly matching oracle-supervised training. Beyond Knights and Knaves, we also evaluate self-evolution on real-world reasoning benchmarks, where gains are also modest. Overall, our results characterize when closed-loop self-evolution can help and show how internally generated supervision remains insufficient under this minimal formulation.
[35] Deep Research as Rubric for Reinforcement Learning cs.CLPDF
Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai, Lefan Zhang
TL;DR: 本文提出Deep Research as Rubric (DR-rubric)框架,将开放式推理和长文本生成任务的评估准则构建重新定义为证据驱动的研究过程。该框架通过两阶段方法(领域知识搜索与证据提炼)自动生成原子化、可独立验证的约束,用于基于GRPO的策略优化,并在6个基准测试上验证了其有效性。
Details
Motivation: 解决开放式推理和长文本生成任务中缺乏可靠自动奖励信号的问题,现有方法将评估准则视为给定产物,往往忽略了任务特定、知识密集的关键维度,导致奖励信号失真。
Result: 在涵盖智能体研究和专家推理的6个基准测试上,DR-Rubric仅用1K-3K训练实例就实现了强劲的竞争性能。其中GPT-5生成的准则在智能体任务上广度覆盖更佳,Gemini生成的准则在两类任务上表现最均衡,而自举生成的准则在第三次迭代时达到最佳整体性能。
Insight: 核心创新在于将准则构建从静态评估模板重构为动态的证据驱动研究过程,使训练模型能作为自身的准则生成器,无需前沿大模型辅助即可实现自举生成,为开放式任务提供了更可扩展、细粒度的奖励信号。
Abstract: Open-ended reasoning and long-form generation tasks lack reliable automatic verification signals for reward-based policy optimization. Rubrics offer a promising alternative, but existing approaches treat them as given artifacts – either hand-crafted or prompt-generated – and often miss the task-specific, knowledge-intensive dimensions that matter most, distorting the reward signal. Our key observation is that rubric construction is itself a research problem: identifying what makes a response correct or insightful requires discovering and synthesizing external knowledge. We propose Deep Research as Rubric (DR-rubric), a two-stage framework for constructing such rubrics. Stage I elicits domain facts, structural constraints, and failure modes through iterative multi-turn agentic search; Stage II distills this evidence into atomic, independently verifiable constraints for GRPO-based policy optimization. Because the model under training can serve as its own rubric generator, DR-rubric-8B supports bootstrap rubric generation without frontier-model assistance. We evaluate on 6 benchmarks spanning agentic research and expert reasoning. Experiments show that DR-Rubric achieves strong competitive performance with only 1K – 3K training instances, where GPT-5-generated rubrics particularly benefit breadth coverage on agentic tasks, Gemini-generated rubrics yield the most balanced performance across agentic and expert reasoning tasks, and bootstrap rubrics exhibit a specialization-to-rebalancing evolution achieving the best overall performance at the third iteration. Results demonstrate that reframing rubric construction from static evaluation templates into an evidence-driven research process yields more scalable, fine-grained reward signals for open-ended tasks.
[36] MiCU: End-to-End Smart Home Command Understanding with Large Language Model cs.CL | cs.AIPDF
Haowei Han, Kexin Hu, Weiwei Cai, Debiao Zhang, Bin Qin
TL;DR: 本文提出了MiCU,一个面向智能家居命令理解的领域特定大语言模型。通过自动化数据合成流程、课程学习、强化学习以及令牌压缩技术,MiCU显著提升了处理模糊或未对齐命令的能力,并在真实部署中验证了其有效性。
Details
Motivation: 解决智能家居命令理解系统在处理模糊或未对齐命令(如“让卧室变得舒适”)时表现不佳的问题,同时克服大语言模型在特定领域数据稀缺、任务适应不足和计算成本高的限制。
Result: 在实验中,MiCU显著优于基线模型,在所有设备类别上平均准确率提升20.01%。在小米家庭App的生产部署中,每日处理约170万次页面访问,用户纠正率降低1.57%,人工审核准确率提升32.05%。
Insight: 创新点包括:利用用户日志和LLM的自动化训练数据合成工作流;通过课程学习和基于领域特定思维规则的强化学习来注入领域知识和增强推理能力;引入令牌压缩技术以降低推理开销并优化长输入处理。
Abstract: Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, while they perform well on precise utterances (e.g., “turn on the bedroom light”), they struggle with ambiguous or misaligned commands (e.g., “make the bedroom cozy”). Large language models (LLMs) generalize well across various domains and can outperform traditional rule-based systems on such tasks, but their effectiveness is often constrained by scarce domain-specific data, insufficient task-specific adaptation, and high computational costs. In this paper, we propose an automated training data synthesis workflow using user logs and LLMs; then we build MiCU, a domain-specific LLM that excels at command understanding. Specifically, we employ curriculum learning to inject domain knowledge into the base LLM, then we enhance its reasoning ability via cold-start training combined with reinforcement learning (RL) guided by domain-specific thinking rules. Additionally, we introduce a token compression technique that condenses device description into a single special token, substantially reducing inference overhead and enabling \model-fast, an efficient variant optimized for long inputs. Extensive experiments show that MiCU significantly outperforms baselines, with an average accuracy gain of 20.01% across all device categories. We have deployed MiCU in the Xiaomi Home app, receiving approximately 1.7 million page views per day. Production evaluations show that MiCU reduces user correction rate by 1.57% and increases human audited accuracy by 32.05%. Our data and code are available at https://github.com/xiaomi-research/iot_spec_llm
[37] Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs cs.CLPDF
Yubo Gao, Haotian Wu, Hong Chen, Junquan Huang, Yibo Yan
TL;DR: 本文提出了一种名为分层自适应预算器(HAB)的训练框架,旨在解决大语言模型(LLM)推理中因‘过度思考’导致的计算资源浪费问题。该框架遵循‘经济思考’原则,通过粗到细的预算机制,在问题间预测最优推理深度,在问题内为每个推理步骤学习特定的token预算信号,从而实现更优的性能-效率权衡。
Details
Motivation: 现有链式思考(CoT)方法常因‘过度思考’而产生过长的推理过程,导致不必要的计算开销。现有效率方法通常采用均匀压缩,忽略了推理复杂性在不同问题和单个推理步骤内部存在异质性的关键观察。
Result: 在GSM8K和MATH500基准测试上的实验表明,HAB不仅在准确性上超越了标准CoT方法,还减少了token使用量,相比基线方法实现了更强的性能-效率权衡。
Insight: 论文的核心创新在于提出了‘经济思考’原则及其实例化框架HAB,它通过分层(问题间和步骤内)自适应预算来智能分配计算资源,而非追求均匀简洁。具体创新点包括:基于困惑度(PPL)的步骤比较和自适应帕累托优化目标来学习步骤特定的预算信号,以及基于费雪信息的剪枝器提供细粒度训练指导,促使模型内化更经济的推理模式。
Abstract: Chain-of-Thought (CoT) has significantly enhanced LLM reasoning, yet often incurs substantial computational overhead due to “overthinking”: generating excessively long rationales without commensurate accuracy gains. Existing efficiency methods typically apply uniform compression, which overlooks a critical observation that reasoning complexity is heterogeneous at two distinct granularity: across different problems and within individual reasoning steps. This motivates our principle of Thinking Economically: intelligently allocating computational resources based on intrinsic task and step demands rather than pursuing uniform brevity. We propose Hierarchical Adaptive Budgeter (HAB), a training framework that operationalizes this principle through coarse-to-fine budgeting. At the inter-step level, HAB predicts the optimal reasoning depth for each problem. At the intra-step level, HAB learns step-specific token budgeting signals from PPL-derived step comparisons and an adaptive Pareto optimization objective that captures the local quality-efficiency trade-off, while a Fisher Information-based pruner further provides fine-grained training-time guidance, thereby encouraging the generator to internalize more economical reasoning patterns. Experiments on GSM8K and MATH500 show that HAB not only surpasses standard CoT in accuracy but also reduces token usage, achieving a stronger performance-efficiency trade-off than the compared baselines.
[38] CA-BED: Conversation-Aware Bayesian Experimental Design cs.CL | cs.AIPDF
Daniel Arnould, Rashad Aziz, Zixuan Kang, Tanav Changal, Kevin Zhu
TL;DR: 本文提出了一种名为CA-BED(对话感知贝叶斯实验设计)的推理时概率对话规划框架,旨在解决大语言模型在需要主动提问获取信息的交互式场景中性能下降的问题。该框架将贝叶斯实验设计与基于LLM的似然估计相结合,通过维护假设的信念分布、预测可能答案并在模拟对话树中传播预期信息增益,来优化多轮对话中的问题选择。
Details
Motivation: 大语言模型在静态推理任务上表现出色,但在需要主动通过提问获取信息的交互式场景中,其性能往往会下降。核心挑战在于如何选择能够有效减少不确定性、同时又能处理模糊或仅部分信息性回答的问题。
Result: 在两个结构化实体推断基准测试中,CA-BED相比直接提示方法,平均成功率提高了21.8%,与其它信息寻求方法相比也取得了可比的提升。这些性能提升仅伴随着平均1.8个额外对话轮次的增加。
Insight: 主要创新点在于将贝叶斯实验设计原则与LLM的推理能力相结合,构建了一个能够主动规划信息获取策略的对话框架。其核心是维护一个动态更新的信念分布,并通过前瞻性地模拟对话树来量化每个问题的预期信息增益,从而做出更优的提问决策,这为LLM在交互式任务中的规划能力提供了一种概率化的解决方案。
Abstract: Large Language Models (LLMs) excel at static reasoning tasks, yet their performance often degrades in interactive scenarios where information must be actively acquired through questioning. A key challenge lies in selecting questions that reduce uncertainty while incorporating responses that may be ambiguous or only partially informative. To address this, we propose Conversation-Aware Bayesian Experimental Design (CA-BED), an inference-time probabilistic dialog planning framework that integrates Bayesian Experimental Design with LLM-based likelihood estimation to optimize question selection over multiple conversational turns. CA-BED maintains a belief distribution over hypotheses, anticipates possible answers, and propagates expected information gain through a simulated conversation tree. Across two structured entity-deduction benchmarks, CA-BED yields an average 21.8% improvement in success rates over direct prompting, with comparable gains relative to alternative information-seeking methods. It achieves these gains with an average increase of only 1.8 conversational turns compared to direct prompting.
[39] Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue cs.CL | cs.AIPDF
Jingjie Lin, Bingbing Wang, Zihan Wang, Zhengda Jin, Weiming Qiao
TL;DR: 本文针对现有长上下文建模基准局限于显式事实记忆的问题,提出了RefMem-Bench基准,用于评估长对话中的反思性记忆能力,即整合碎片化多模态线索进行高层解释的能力。同时,作者提出了REMIND分层框架,通过问题条件证据检索、显著性感知基础和抽象级监督来增强模型的反思性记忆,实验表明该基准对现有模型构成挑战,而REMIND能持续提升答案准确性和记忆召回。
Details
Motivation: 现有基准主要关注显式事实记忆的检索,无法衡量模型将分散的多模态线索综合成高层解释的反思性记忆能力,因此需要新的基准和方法来填补这一空白。
Result: 在RefMem-Bench基准上,当前模型面临显著挑战,而提出的REMIND框架通过渐进证据感知、基础和抽象,一致提高了答案准确性和记忆召回性能。
Insight: 创新点在于定义了反思性记忆这一新评估维度,并设计了REMIND分层框架,通过渐进反思对齐将高层反思推理蒸馏到事实推理路径中,从而提升模型对隐含意义的推断能力。
Abstract: Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing to measure the reflective memory required to synthesize fragmented, multimodal cues into high-level interpretations. To address this gap, we introduce RefMem-Bench, a benchmark for reflective memory in long-horizon dialogue. RefMem-Bench contains 26K annotated QA instances with eight reflective-memory dimensions and three task formats, requiring models to move beyond surface-level retrieval and infer latent meanings from evidence distributed across interaction histories. To enhance reflective memory capability, we propose REflective Memory INDuction (REMIND), a hierarchical framework that treats reflective memory as progressive meaning construction. REMIND couples question-conditioned evidence retrieval, salience-aware grounding, and abstraction-level supervision, and uses Progressive Reflective Alignment to distill high-level reflective reasoning into the factual inference pathway. Experiments show RefMem-Bench poses a substantial challenge to current models, while REMIND consistently improves both answer accuracy and memory recall through progressive evidence perception, grounding, and abstraction.
[40] Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking cs.CLPDF
Fachrina Dewi Puspitasari, Chaoning Zhang, Jiaquan Zhang, Zhicheng Wang, Hafiz Shakeel Ahmad Awan
TL;DR: 本文提出了一种名为InSemRAG的高效检索增强生成(RAG)框架,旨在解决传统RAG系统中因意图无关检索和信息碎片化导致的信息不足问题。该框架通过一个迭代的检索-检查机制,结合意图感知检索器(IAR)和语义保持分块(SPC)两个模块,动态调整检索策略并修复受损的证据块以保持语义完整性。为降低延迟,该方法利用了小型语言模型(SLM)。
Details
Motivation: 传统RAG系统存在信息不足的问题,主要源于两个因素:意图无关的检索(无法根据查询意图动态调整)和信息碎片化(检索到的知识块语义不完整)。本文旨在解决这些挑战,提升RAG系统的信息充分性和生成质量。
Result: 在多个基准数据集上的广泛实验表明,该方法相比近期最先进的RAG机制具有竞争力。特别是在多跳和证据敏感任务上取得显著提升:在HotPotQA数据集上F1分数提高了2.65个百分点,在FEVER数据集上准确率提高了1.5个百分点。同时,利用SLM实现了与Multi-Hop RAG相当的性能,但延迟降低了4.32倍。
Insight: 主要创新点在于提出了一个结合意图感知检索和语义保持分块的迭代RAG框架。意图感知检索器通过动态混合检索和自适应权重分配,使检索过程与查询意图对齐;语义保持分块模块则通过检测和修复来维护证据块的语义完整性。此外,利用小型语言模型来平衡性能与效率的思路具有借鉴意义。
Abstract: The demand for powerful instruction following and reasoning capability of large language models (LLMs) has promoted rapid development of retrieval-augmented generation (RAG). The RAG system assists LLM generation by retrieving chunks of query-fit supplementary knowledge from an external database. Conventional RAG systems, however, suffer from information insufficiency due to two factors, which are intent-agnostic retrieval and information fragmentation. Our work proposes a RAG framework, termed InSemRAG, that addresses these challenges via an iterative retrieve-and-check mechanism with two supporting modules, an intention-aware retriever (IAR) and semantics-preserving chunking (SPC). IAR implements a dynamic hybrid retrieval method that adaptively weights the retrieval channels based on the query intent, while SPC performs detection and reparation to the damaged evidence chunks to preserve the semantic integrity. To alleviate the computational latency brought by our iterative mechanism, we leverage small language models (SLMs). Extensive experiments across several benchmark datasets consistently demonstrate the competitiveness of our method against recent state-of-the-art RAG mechanisms. Particularly, our method achieves significant gains on multi-hop and evidence-sensitive tasks, with a 2.65-point improvement in F1 on HotPotQA and a 1.5-point increase in accuracy on FEVER. Our method also achieves competitive performance to Multi-Hop RAG with 4.32$\times$ lower latency with the utilization of SLM.
[41] Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations cs.CL | cs.AI | cs.CYPDF
Qi Han Wong
TL;DR: 这篇论文研究了大型语言模型(LLM)在医疗分诊建议中是否存在基于患者提示语言的差异。研究发现,对于相同的神经系统症状,模型根据提示语言的不同(如英语、日语、印地语等)给出了差异巨大的急诊室(ER)推荐率(0%到30%),尽管其给出的严重性评分几乎一致。添加患者地理位置信息会显著改变推荐结果,表明模型会从输入语言中隐式推断地理位置,从而导致建议不一致。
Details
Motivation: 动机是探究LLM在医疗分诊等关键应用中,其建议是否会因患者使用的语言而产生不公平的差异,即语言驱动的偏见或差异,这关系到AI系统的公平性和可靠性。
Result: 在Gemini 3.5 Flash模型上的实验结果显示,对于相同的症状,英语和阿拉伯语提示的ER推荐率为30%,而日语和印地语提示则为0%。添加美国位置信息后,非英语提示的ER推荐率大幅提升(最高增加76.7个百分点)。反向锚定(英语提示+东京位置)则使ER推荐率从30%降至6.7%。回译控制实验证实差异源于语言隐含的地理推断,而非翻译质量。
Insight: 论文的创新点在于揭示了LLM在医疗分诊任务中,会从输入语言隐式推断患者地理位置,并据此产生不一致的建议,这是一种新型的、潜在有害的偏见形式。从客观角度看,这强调了在评估和部署LLM时,必须系统性地测试其在不同语言和语境下的行为一致性,以防止隐蔽的地理推断导致不公平的结果。
Abstract: We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the language of the patient prompt. Using Gemini 3.5 Flash, we evaluate a neurological symptom profile (persistent headache, blurred vision, nausea) across six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic) with 30 runs per condition (n=450 total API calls). We find that the model recommends emergency room visits at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic), despite assigning nearly identical severity scores (7.7-8.0/10) across all languages. Adding a single sentence specifying the patient’s US location increases ER recommendations by up to 76.7 percentage points for non-English prompts, while the reverse anchor (English prompt with a Tokyo location) reduces the ER rate from 30% to 6.7%. A back-translation control (Japanese to English) produces ER rates comparable to the English baseline, confirming that the disparity is not caused by translation quality but by implicit geographic inference from the input language. We release the complete dataset, experiment code, and results.
[42] Unlocking the Black Box of Latent Reasoning: An Interpretability-Guided Approach to Intervention cs.CL | cs.LGPDF
Shuochen Chang, Tong Bai, Xiaofeng Zhang, Qianli Ma, Qingyang Liu
TL;DR: 本文提出了一种基于可解释性指导的干预方法,以解锁大语言模型中潜在推理的黑箱。首先通过结构、因果和几何探针系统分析,发现潜在向量编码了压缩且忠实的推理步骤表示,其中早期向量是关键因果枢纽。基于这些发现,作者将可解释性洞察转化为一系列无需训练的解码时干预技术,通过施加已识别的几何和语义先验来优化潜在推理过程。
Details
Motivation: 潜在推理使大语言模型能够在连续隐藏状态中进行多步推理,相比显式思维链(CoT)具有效率优势,但这些连续思维向量的不透明性阻碍了其可靠性和可控性。本文旨在弥合机制可解释性与可操作控制之间的差距。
Result: 在多个模型规模和不同任务领域的广泛实验表明,所提方法能持续提升推理准确率,无需任何参数更新即可解锁潜在能力并改善推理性能。
Insight: 创新点在于将可解释性分析(结构、因果、几何探针)直接转化为无需训练的解码时干预技术,通过施加几何和语义先验来引导和优化模型的潜在推理过程,从而在保持效率的同时提升透明度和控制性。
Abstract: Latent reasoning enables Large Language Models (LLMs) to perform multi-step inference within continuous hidden states, offering efficiency gains over explicit Chain-of-Thought (CoT). However, the opacity of these continuous thought vectors hinders their reliability and controllability. This paper bridges the gap between mechanistic interpretability and actionable control. We first present a systematic analysis using structural, causal, and geometric probes, revealing that latent vectors encode compressed, faithful representations of reasoning steps, with early vectors acting as critical causal hubs. Building on this, we operationalize these interpretability insights into a suite of training-free, decode-time interventions that refine the latent reasoning process by imposing the identified geometric and semantic priors. Extensive experiments across multiple model scales and diverse task domains demonstrate that our approaches consistently improve reasoning accuracy. Our interpretability-guided interventions consistently unlock latent capabilities and improve reasoning accuracy without any parameter updates.
[43] Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aware In-Context Learning cs.CLPDF
Yiming Liao, Zeno Franco, Jose Eduardo Lizarraga Mazaba, Keke Chen
TL;DR: 本文提出了Med-HEAL框架,用于系统性地识别、分析和缓解医疗大语言模型中的幻觉问题。该框架基于临床数据构建了幻觉数据集,并研究了两种缓解策略:自我批判流程和检索增强的上下文学习。实验表明,自我批判策略能在不更新模型参数的情况下,显著提升多个开源医疗LLM的准确性。
Details
Motivation: 医疗大语言模型中的幻觉对临床决策支持构成严重风险,而现有基准测试往往缺乏真实的临床背景,且对如何在实际中缓解幻觉的见解有限。
Result: 在BioMistral、Llama-3.1、DeepSeek、Qwen2.5和Qwen3五个开源LLM上的实验表明,自我批判策略显著提升了其中三个模型的准确性(p < 0.05),无需参数更新。
Insight: 创新点在于提出了一个结合临床数据、双重评估流程(LLM-as-a-Judge与人工审核)来构建幻觉数据集的系统框架,并验证了无需参数更新的自我批判策略的有效性,为医疗LLM的安全部署提供了实用工具和数据集。
Abstract: Hallucinations in medical large language models (LLMs) pose serious risks for clinical decision support, particularly when models must reason over complex electronic health records (EHRs). However, existing benchmarks often lack a realistic clinical context and provide limited insight into how hallucinations can be mitigated in practice. We introduce Med-HEAL, a framework for systematically identifying, analyzing, and mitigating hallucinations in medical LLMs using clinically grounded data. Building on the EHRNoteQA benchmark derived from MIMIC-IV discharge summaries, we construct a hallucination dataset by evaluating BioMistral-7B on open-ended clinical question answering tasks. Model outputs are labeled through a dual evaluation pipeline that combines LLM-as-a-Judge assessment (GPT-4o) with human auditing by medical student reviewers, producing correctness judgments and annotations of reasoning errors via a custom web-based evaluation system. We then leverage this dataset to investigate mitigation strategies: a self-critique pipeline, in which the test model reviews its own answers to detect potential errors and regenerates responses for flagged cases, and retrieval-augmented in-context learning (RA-ICL), which exposes the model to hallucinated and corrected examples. Experiments across five open-source LLMs-BioMistral, Llama-3.1, DeepSeek, Qwen2.5, and Qwen3, show that the self-critique strategy improves accuracy for three of five models (p < 0.05) without requiring parameter updates. Med-HEAL provides both a reusable hallucination dataset and a practical framework for studying and mitigating hallucinations in medical LLMs, supporting safer deployment of AI systems in clinical environments. Our code and data are publicly available at https://github.com/yimingliao-blad/med-heal.git.
[44] SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories cs.CL | cs.AI | cs.LG | cs.MAPDF
Zhuoyun Yu, Xin Xie, Wuguannan Yao, Chenxi Wang, Lei Liang
TL;DR: 本文提出SkillAdaptor,一种无需训练、基于步骤级故障归因的技能自适应框架,用于提升LLM智能体在长视野交互任务中的表现。该框架通过识别失败轨迹中的首个可操作故障步骤,将责任关联到候选技能,并在保持主干模型冻结的情况下进行有针对性的更新。
Details
Motivation: 现有无需训练的技能自适应方法通常基于完整轨迹或会话级反馈进行更新,导致故障归因粗糙、修订不稳定或过于宽泛。本文旨在解决这一问题,实现更精细、稳定的技能维护。
Result: 在WebShop、PinchBench和Claw-Eval三个基准测试上,使用Kimi-K2.5、GLM-5和GPT-5.2模型进行评估,SkillAdaptor在所有测试集上均优于无技能和使用技能自适应的基线方法,最大单指标提升分别为PinchBench平均得分+1.5分、Claw-Eval平均得分+1.8分、WebShop成功率+1.7分。
Insight: 核心创新在于引入了明确的步骤级故障归因机制,实现了更稳定、可审计的无需训练技能维护。该方法可插入OpenClaw类智能体框架,通过定位首个可操作故障步骤并针对性更新候选技能,避免了粗粒度更新带来的问题。
Abstract: Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-free skill adaptation pipelines usually update skills from full trajectories or session-level feedback, which makes failure attribution coarse and often produces unstable or overly broad revisions. We propose SkillAdaptor, a training-free step-level skill adaptation framework with explicit failure attribution, and it can plug into OpenClaw-class agent harnesses. Given a failed trajectory, SkillAdaptor identifies a first actionable fault step, links responsibility to candidate skills, and applies targeted updates under explicit acceptance checks while keeping the backbone frozen. We evaluate on WebShop, PinchBench, and Claw-Eval with Kimi-K2.5, GLM-5, and GPT-5.2. SkillAdaptor improves over no-skill and skill-adaptation baselines on all three suites, with the largest single-metric improvements of +1.5 points on PinchBench Avg Score%, +1.8 on Claw-Eval Avg Score, and +1.7 on WebShop success rate. These results indicate that step-level attribution supports more stable and auditable training-free skill maintenance\footnote{The code will be released at https://github.com/zjunlp/SkillAdaptor.}.
[45] Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing cs.CL | cs.AI | cs.CVPDF
Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo, Zhenting Qi
TL;DR: 本文提出了Dr. DocBench,一个针对专家级和困难文档解析的综合基准测试。该基准基于大规模多语言书籍语料库构建,通过基于解析器失败的采样方法选取了4,514个具有挑战性的文档页面,覆盖52个BISAC学科领域,并包含6.5万个高质量的布局、阅读顺序、层次关系和领域特定视觉内容的标注。评估表明,现有最先进系统在现有基准上的优异表现无法迁移到该专家级解析任务上。
Details
Motivation: 现有OCR和文档解析基准在覆盖范围和难度上存在局限,它们多关注常见文档类型或现代解析器已表现良好的均匀采样页面,而对化学公式、音乐符号、复杂表格和跨页面布局等专家领域结构的标注有限。
Result: 在Dr. DocBench上的评估显示,基于流水线的解析器和通用视觉语言模型(VLMs)在现有基准上的强性能无法迁移到该专家级文档解析任务上,分析揭示了跨学科、内容类型和结构属性的显著失败案例。
Insight: 论文的创新点在于通过基于解析器失败的采样策略构建了一个难度感知的、覆盖广泛专家领域的文档解析基准,这为诊断和推进文档智能系统提供了一个全面的测试平台,强调了当前系统在处理复杂、专业文档结构方面的不足。
Abstract: Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sampled pages where modern parsers already perform strongly, while offering limited annotation for expert-domain structures such as chemical formula, music notation, complex tables, and cross-page layouts. We introduce Dr. DocBench, a difficulty-aware benchmark for expert-level document parsing. Built from a large-scale multilingual book corpus, Dr. DocBench spans 52 BISAC subject domains and selects challenging documents through parser-failure-based sampling, targeting cases where multiple state-of-the-art systems struggle. It contains 4,514 annotated pages from long documents averaging around 100 pages, with 65k high-quality page- and block-level annotations for layout, reading order, hierarchical relations, and domain-specific visual contents. Evaluations of pipeline-based parsers and general-purpose VLMs show that strong performance on existing benchmarks does not transfer to our expert-level document parsing. Our analysis reveals substantial failures across subjects, content types, and structural attributes, highlighting Dr. DocBench as a comprehensive testbed for diagnosing and advancing document intelligence.
[46] LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning cs.CLPDF
Mengmeng Ji, Ravi Shanker Raju, Jonathan Lingjie Li, Chen Wu
TL;DR: 本文提出了LongAttnComp,一种用于长上下文推理的跨模型上下文压缩方法。该方法通过微调轻量级交叉注意力评分层,并结合令牌级分块、令牌预算top-p算法、位置重排序和格式无关查询解析器,以降低长序列(如10万+令牌)的预填充成本。研究设计了两阶段微调方案,在InfiniteBench Code-Debug和LongBench v2等基准测试中,该方法在保持任务准确性的同时,显著优于无需训练的基线方法,并能跨多个模型家族迁移。
Details
Motivation: 随着实际应用需要处理超过10万令牌的长输入,上下文长度与推理效率之间的差距成为关键瓶颈。现有无需训练的基于注意力的方法在代码推理等要求苛刻的长上下文任务中存在明显不足,因此需要一种有效的上下文压缩方法来减少预填充成本并保持准确性。
Result: 在InfiniteBench Code-Debug基准测试中,LongAttnComp达到或超过了完整上下文的准确率,显著优于无需训练的基线方法,并能跨三个模型家族的四个目标模型迁移。在LongBench v2上,两阶段微调方案大幅缩小了第一阶段在多文档推理任务上的差距,同时保持了Code-Debug的性能。
Insight: 创新点包括:1)对AttnComp进行长上下文适配,引入令牌级分块和令牌预算top-p算法等机制;2)设计两阶段微调方案,先建立通用检索基础,再扩展到多跳和推理数据以覆盖更广泛的长上下文任务;3)实现了跨模型家族的上下文压缩能力,提升了方法的通用性和迁移性。
Abstract: As real-world applications increasingly require processing inputs of 100k+ tokens, the gap between context length and inference efficiency has become a critical bottleneck. Context compression offers a way to reduce prefill costs while preserving task accuracy. However, existing training-free attention-based methods leave substantial gaps in demanding long-context tasks such as code reasoning. We present LongAttnComp, a long-context adaptation of AttnComp that fine-tunes a lightweight cross-attention scoring layer and introduces tokenlevel chunking, a token-budget top-p algorithm, positional reordering, and a formatagnostic query parser. We further design a two-stage fine-tuning recipe for the compressor: Stage 1 builds a general retrieval foundation from NIAH-style data, and Stage 2 extends it with multi-hop and reasoning data for broader long-context task coverage. On InfiniteBench Code-Debug, LongAttnComp matches or exceeds full-context accuracy, substantially outperforms training-free baselines, and transfers across four target models from three families. On LongBench v2, the two-stage recipe largely closes the Stage 1 gap on multi-document reasoning while preserving Code-Debug performance.
[47] UniD$^3$: A Knowledge Graph-Enhanced RAG Framework for Drug-Disease Discovery and Reasoning cs.CLPDF
Qing Wang, Tianshi Liu, Minghao Zhou, Jialu Liang, Sen Guo
TL;DR: 本文提出了UniD^3框架,这是一个将大语言模型与知识图谱增强的检索增强生成(KG-RAG)相结合的集成系统,旨在从生物医学文献中提取、组织和验证药物-疾病知识。该框架处理了超过15万篇PubMed文章,构建了多个知识图谱,并生成了大规模的结构化数据集,用于药物-疾病匹配、药物有效性评估和药物靶点分析。
Details
Motivation: 药物-疾病关系的系统性表征对药物发现和再利用至关重要,但现有方法面临生物医学文献异质性强、增长快、人工标注数据集不完整、以及纯LLM方法存在幻觉和证据基础薄弱等问题。
Result: UniD^3生成了六个知识图谱和包含数万条记录的大规模数据集。在外部基准测试中表现出色(DDM/DEA的F1分数为0.85-0.87,DTA为0.82),临床医生评审确认了高可靠性(AUROC = 0.90)。KG-RAG增强的模型性能优于独立的LLM。
Insight: 创新点在于提出了一个结合LLM与知识图谱的KG-RAG统一框架,并采用以药物和疾病实体为中心的双阶段(论文级提取与知识图谱级整合)知识图谱构建策略,实现了从非结构化文献到高质量结构化知识的可扩展转换,支持可解释的探索。
Abstract: Systematic characterization of drug-disease relationships is essential for drug discovery and repurposing, yet is hindered by the heterogeneity and rapid growth of biomedical literature. Existing datasets rely on labor-intensive curation and are often incomplete, while LLM-only approaches suffer from hallucination and weak evidence grounding. We introduce UniD$^3$, a unified framework that integrates Large Language Models with Knowledge Graph-enhanced Retrieval-Augmented Generation (KG-RAG) to extract, organize, and validate drug-disease knowledge across Drug-Disease Matching (DDM), Drug Effectiveness Assessment (DEA), and Drug-Target Analysis (DTA). UniD$^3$ processes 157,849 PubMed articles with Llama 3.3-70B and constructs knowledge graphs via a dual-stage strategy combining paper-level extraction with KG-level consolidation centered on drug and disease entities. These graphs support KG-RAG-based generation of structured datasets, evaluated through external benchmarks, fuzzy matching with curated resources, and clinician review. UniD$^3$ produces six knowledge graphs and large-scale datasets, including 28,915 DDM, 15,042 DEA, and over 4,000 DTA QA pairs. External validation shows strong performance (F1: 0.85-0.87 for DDM/DEA; 0.82 for DTA), with clinician review confirming high reliability (AUROC = 0.90). KG-RAG-augmented models outperform standalone LLMs, and the UniD$^3$ chatbot enables interpretable, citation-supported exploration of drug-disease relationships. UniD$^3$ provides a scalable, extensible framework for transforming unstructured biomedical literature into high-quality, structured drug-disease knowledge, supporting AI-driven discovery, repurposing, and precision medicine.
[48] Cross-lingual Self-Consistency for Multilingual Reasoning with Language Models cs.CLPDF
Ahmed Elhady, Eneko Agirre, Mikel Artetxe
TL;DR: 本文提出了一种无监督的强化学习方法,通过强制跨语言自一致性来增强大语言模型的多语言推理能力。该方法不依赖黄金答案或平行数据,在MGSM基准的10种语言上平均提升了21.7%,并在未见过的语言和分布外基准上展现出强大的泛化能力。
Details
Motivation: 尽管大语言模型扩展了多语言覆盖范围,但其高级推理能力仍主要局限于英语等高资源语言。现有方法受限于多语言推理数据的稀缺性,且对未见语言的泛化能力较弱。
Result: 在MGSM基准的10种语言上平均提升高达21.7%。在训练期间未见过的MGSM语言上平均提升18.2%,并在三个分布外基准上最高提升6.2%。
Insight: 核心创新点是利用跨语言自一致性原则(即模型应为不同语言中的等价问题生成相同最终答案)作为无监督强化学习的奖励信号。该方法无需监督数据即可显著提升模型的多语言推理泛化能力,为基于一致性的方法开辟了新路径。
Abstract: Despite expanding their multilingual coverage, the advanced reasoning capabilities of LLMs remain largely confined to a few high-resource languages like English. To address this, we propose an unsupervised Reinforcement Learning (RL) approach to enhance multilingual reasoning by enforcing cross-lingual self-consistency: the principle that a model should produce the same final answer for equivalent problems in different languages. Existing methods are limited by the scarcity of multilingual reasoning data and show weak generalization to unseen languages. Our approach requires neither gold answers nor parallel data, and it achieves average gains of up to 21.7% on MGSM across 10 languages. In addition, our method demonstrates strong generalization, with an 18.2% mean improvement on MGSM languages unseen during training, and up to 6.2% gain on 3 out-of-distribution benchmarks. These results show the potential of consistency-based methods to improve the multilingual capabilities of LLMs without requiring supervised data.
[49] TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning cs.CL | cs.AIPDF
Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li, Yilei Shao
TL;DR: 本文提出了TimeSage-MT,一个用于评估智能体在多轮对话中进行时间序列推理的基准测试。该基准包含240个任务和2680轮对话,覆盖8个现实领域,从基础探索到决策导向的分析。作者还评估了前沿大语言模型和一个名为TimeSage的新型结构化智能体,揭示了当前智能体在决策导向任务中的关键缺陷。
Details
Motivation: 现有基准主要关注预测和异常检测等单步任务,忽视了用户目标动态演变、智能体需要基于先前分析进行推理、结论从累积证据中产生的实际工作流程。因此,需要一个新的基准来评估智能体在多轮对话中进行可靠时间序列分析的能力。
Result: 在TimeSage-MT基准上的评估结果显示,前沿大语言模型和TimeSage智能体在决策导向任务上性能急剧下降,主要归因于记忆、不确定性处理和基于领域的决策制定方面的失败。该基准为比较时间序列智能体系统提供了统一的评估协议和公开排行榜。
Insight: 论文的创新点在于构建了一个多轮对话形式的时间序列推理基准,并通过可复现的流水线将真实世界数据转化为可验证答案的对话。从客观角度看,该基准强调了评估智能体在复杂、动态交互场景中推理能力的重要性,为未来智能体系统的发展提供了严谨的评估基础。
Abstract: Time series data inform critical decisions across many real-world domains. While large language model (LLM) agents can analyze data through natural language and tools, it remains unclear whether they can conduct reliable time series analysis across multi-turn conversations. Existing benchmarks focus on single-step tasks such as forecasting and anomaly detection, overlooking practical workflows where user goals evolve, agents must build on prior analyses, and conclusions emerge from accumulated evidence. In this work, we introduce TimeSage-MT, a multi-turn benchmark for agentic time series reasoning with 240 tasks and 2,680 dialogue turns across 8 real-world domains, spanning basic exploration to decision-oriented analysis. TimeSage-MT is built through a reproducible pipeline that converts real-world time series data into multi-turn conversations with verifiable answers. It provides a unified evaluation protocol and public leaderboard for comparing time series agentic systems. To demonstrate the benchmark’s utility, we evaluate frontier LLMs alongside TimeSage, a novel structured agent equipped with a comprehensive time series skill library. The results show sharp performance drops on decision-oriented tasks, driven by failures in memory, uncertainty handling, and domain-based decision making. TimeSage-MT exposes critical gaps in current agentic reasoning and provides a rigorous foundation for future development.
[50] Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents cs.CL | cs.AIPDF
Aitor Arronte Alvarez, Naiyi Xie Fincham
TL;DR: 本研究评估了大型语言模型在对话式辅导场景中的社会偏见问题,特别是模型在无法识别偏见判断时仍保持高度自信的情况。作者提出了一种新的数据集生成方法,通过重构学生-AI导师互动并引入受控偏见轮次,在自然教学条件下评估偏见。研究发现,在对话辅导环境中检测偏见比基准评估更具挑战性,且最先进的LLMs对刻板偏见陈述的错误评估过度自信,这种自信会影响其推理和反馈质量。
Details
Motivation: 对话辅导代理能提升学习参与度和学生成果,但LLMs可能延续或放大刻板社会偏见,这在教育环境中构成特殊风险。因此,需要识别LLMs在辅导对话中的高置信度偏见实例,即模型无法识别偏见判断却对自身评估保持强信心的场景。
Result: 通过计算和人工评估,研究发现偏见检测在对话辅导环境中比基准评估更具挑战性,且SOTA LLMs对刻板偏见陈述的错误评估表现出过度自信。模型自信强烈影响其推理和反馈,突显了基于LLM的辅导代理中过度自信偏见行为的风险。
Insight: 创新点在于提出了一种新的数据集生成方法,能在自然教学条件下评估偏见,并揭示了LLMs在辅导场景中过度自信偏见的普遍性及其对推理和反馈的负面影响。这为开发更可信的对话辅导代理提供了关键见解,强调了在应用前进行上下文特定偏见评估的重要性。
Abstract: Conversational tutoring agents have been shown to improve learning engagement and student outcomes, and large language models (LLMs) are increasingly used in these systems to provide scalable, personalized feedback. However, LLMs may perpetuate or amplify stereotypical social biases, posing particular risks in educational settings. In this study, we evaluate LLMs in conversational tutoring scenarios to identify high-confidence social biases, instances where models are unable to identify biased judgments in tutoring conversations while maintaining strong confidence in their assessments, potentially affecting their reasoning and the feedback they provide to learners. We present a new dataset generation method that enables bias evaluation under naturalistic instructional conditions by regenerating student-AI tutor interactions and introducing turns with controlled bias derived from a benchmark dataset. Using this data, we assess multiple LLMs’ ability to detect stereotypical biases and analyze the confidence and reasoning underlying their responses through computational and human evaluations. We find that bias detection is substantially more challenging in conversational tutoring contexts than in benchmark-based evaluations, and that state-of-the-art LLMs are overconfident in their incorrect assessments of stereotypical bias statements. Moreover, model confidence strongly influences reasoning and feedback, highlighting the risks of overconfident, biased behavior in LLM-based tutoring agents. We conclude by discussing implications, mitigation considerations, and directions for future research.
[51] EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision cs.CL | cs.AIPDF
Tianyi Xu, Yaolun Zhang, Xuan Ouyang, Huazheng Wang
TL;DR: EvoPool是一种进化多智能体框架,用于解决大语言模型在专业领域标注成本高的问题。该框架通过三个专门智能体迭代生成可执行的标注代码,利用小验证集提供适应度信号,并通过确定性门控机制筛选出具有生存力、多样性和边际贡献的标注器。最终,EvoAgg聚合器将标注器投票映射为软训练标签,实现高效标注。
Details
Motivation: 大语言模型在通用任务上表现出色,但在专业、高风险领域(如生物医学关系抽取、法律条款分类)中,由于训练标签成本高昂,其性能往往不如小型监督模型。EvoPool旨在通过进化程序化标注方法,以低成本生成高质量标注,提升模型在专业任务上的表现。
Result: 在8个LLM表现较弱的专业复杂任务中的7个上,EvoPool平均比最强的LLM标注基线提高了0.141 macro-F1,其中在ChemProt和PubMed任务上分别提升了0.301和0.265。标注池在10万个样本上的运行速度比LLM标注快4500到31000倍,且每个样本的标注成本接近零。
Insight: 创新点包括:进化多智能体框架模拟达尔文进化过程,通过迭代优化标注代码;确定性门控机制确保标注器的生存力、多样性和边际贡献;EvoAgg聚合器结合语义特征和标注器投票特征生成软标签。从客观角度看,该方法将程序化标注与进化算法结合,实现了高效、低成本的领域特定监督学习,为标签稀缺的专业任务提供了新思路。
Abstract: Large language models excel at general tasks but underperform smaller supervised models in specialized, high-stakes domains where training labels are costly. We address this regime with EvoPool, an evolutionary multi-agent framework inspired by Darwinian evolution. Three specialized agents iteratively propose executable annotator code, a small validation set provides a fitness signal, and a deterministic gate keeps only annotators that pass viability, diversity, and marginal-contribution checks across generations. Pool votes are mapped to soft training labels by EvoAgg, a text-aware aggregator combining semantic features with annotator-vote features. The authored pool runs at near-zero per-example cost and is 4500 to 31000x faster than LLM annotation on 100K examples. Across 7 of 8 LLM-weak specialized and complex tasks spanning biomedical relation extraction, legal-clause classification, complex reasoning, and dense multi-label biomedical classification, EvoPool beats the strongest LLM annotation baseline by an average +0.141 macro-F1, peaking at +0.301 on ChemProt and +0.265 on PubMed. Code is available at: https://github.com/tianyi0216/EvoPool
[52] Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity cs.CL | cs.AIPDF
Jiaming Qu, Lucheng fu, Yibo Hu
TL;DR: 本文研究了多智能体系统中LLM的从众风险,通过模拟实验发现:当LLM看到其他智能体的答案时,更容易被误导而放弃原本正确的答案,而非修正原本错误的答案;权威标签会加剧这一现象,且常用推理干预方法(如思维链和反思)无法有效缓解有害修正。
Details
Motivation: 旨在探究LLM在多智能体环境中因社会线索(如共识结构和权威标签)而进行修正时,这种修正是更倾向于纠正错误还是引入新错误,以评估从众行为对系统可靠性的实际影响。
Result: 在四个开源LLM和七个QA数据集上的实验表明:初始正确的模型被误导的概率是初始错误模型被纠正概率的2.5倍;权威标签使模型更可能选择被认可答案(无论对错);思维链和反思等干预方法无法稳定减少有害修正。
Insight: 创新点在于区分了有益修正和有害修正,并量化了社会线索对两者的非对称影响;客观而言,该研究揭示了多智能体系统中简单聚合答案的风险,提出了需验证同伴答案而非盲目跟从的设计原则。
Abstract: Large language models are increasingly used in multi-agent systems, where they see and respond to other agents’ answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. Prior studies show that LLMs often revise toward a majority answer, but it remains unclear whether these revisions help correct mistakes as often as they introduce new errors. In this paper, we conduct a controlled study in which an LLM first answers a question, then sees simulated peer responses before making a final decision. We manipulate two social cues: consensus structure and authority labels assigned to peers, and measure how they influence beneficial and harmful revisions. Across four open-weight LLMs and seven QA datasets, we find that peer agreement makes it much easier to mislead initially correct models than to correct initially wrong ones. Authority labels make models more likely to choose the endorsed answer, regardless of whether it is correct. More concerningly, generic reasoning interventions such as chain-of-thought and reflection do not reliably reduce harmful revision while preserving beneficial revision. These findings suggest that multi-agent LLM systems should verify peer answers rather than simply aggregate them.
[53] Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification cs.CLPDF
Sunisth Kumar, Xanh Ho, Tim Schopf, Andre Greiner-Petter, Florian Boudin
TL;DR: 本文研究多模态大语言模型在科学声明验证任务中,面对表格和图表证据时性能差异的原因。通过层间线性探测和注意力分析发现,模型能够编码图表信息,但在预测时未能有效利用这些信息,而非编码失败。
Details
Motivation: 解决多模态LLMs在科学同行评审中,验证论文声明时对表格证据表现优于图表证据的差异原因,探究是信息提取失败还是信息利用失败。
Result: 在三个开源视觉语言模型上,使用相同底层数据的表格和图表证据进行实验,发现图表信息在模型中间表示中被编码,但未传递到预测位置,而表格无此问题,该结论在所有测试条件下一致成立。
Insight: 创新点在于将表格-图表性能差距重新定义为编码视觉信息在预测时路由失败,而非编码本身失败;注意力分析揭示了不同模型家族中此断开连接的两种架构上不同形式,为模型内部信息流机制提供了新见解。
Abstract: Multimodal LLMs are increasingly used to assist scientific peer review, where a core requirement is verifying whether claims in a paper are supported by its evidence. Prior work has shown that models perform substantially better at this task when the evidence is a table than when it is a chart of the same underlying data. This raises the question of whether models fail to extract information from charts, or do they extract it but fail to use it when forming their prediction? We study this question through layer-wise linear probing and attention analysis on three open-weight VLMs over table and chart evidence, representing the same underlying data. We find consistent evidence for the latter. Chart information is encoded in the models’ intermediate representations but does not reach the prediction position, a gap that is absent for tables and holds across all conditions tested. Attention analysis further reveals that this disconnect takes two architecturally distinct forms across model families. These findings reframe the table-chart gap as a failure of how encoded visual information is routed at prediction time, rather than a failure of encoding itself.
[54] When Meaning Travels: A Granular Lens on Hybrid-MoE’s Role in Idiomatic Understanding for Language Models cs.CLPDF
Sarmistha Das, Vaibhav Vishal, Shreyas Guha, Amaan Ali, Kitsuchart Pasupa
TL;DR: 本文针对低资源东南亚语言(如印地语、孟加拉语和泰语)中富含文化内涵的习语理解难题,提出了Varnika多模态习语语料库和HybridMoE(混合专家)框架。该框架通过整合选定和未选定专家的输出,结合掩码多模态嵌入的习语属性信号,提升了多语言多模态环境下对习语比喻意义和文化语义的表示能力。
Details
Motivation: 在低资源东南亚语言中,文化丰富的习语因其深刻的隐喻复杂性,对计算建模和跨语言迁移构成重大障碍。本文旨在解决保留这些语言中习语比喻和文化语义的挑战。
Result: 经验评估表明,HybridMoE在先进视觉语言模型中实现了5-6%的性能提升,通过IDIO-TONE和习语验证评分(衡量直译保真度、视觉语义对齐和习语意义保留的三阶段评估流程)验证了其在多语言多模态设置下对比喻语言和文化嵌入意义表示的改进。
Insight: 创新点包括构建了包含3,533个多语言习语并带有七种习语语调的Varnika多模态语料库,以及引入HybridMoE框架来嵌入多个习语专家意见并通过受控混合缓解专家稀疏性问题,同时利用掩码多模态嵌入的习语属性信号增强理解。
Abstract: In the contemporary epoch of multilingual education, learning idioms provides a fascinating gateway towards creativity, cultural values, historical context, and diverse perspectives inherent to various linguistic traditions. This paper showcases the navigation of retaining figurative and cultural semantics in low-resource Southeast Asian languages such as Hindi, Bengali, and Thai, where culturally rich idioms pose significant obstacles for computational modeling and cross-linguistic transfer due to their deep metaphorical complexity. To tackle such complexity, we present Varnika, a reconstructed multimodal idiom corpus comprising 3,533 multilingual idioms, enriched with seven idiomatic tones aligned with both textual and visual representations. Additionally, to infer informative idiomatic understanding, we introduce a Hybrid Mixture-of-Experts (HybridMoE) framework that embeds multiple idiomatic expert opinions while mitigating expert sparsity by integrating outputs from both selected and unselected experts through controlled hybridization, further augmented with Idiomatic Property Signals via masked multimodal embeddings. To analyze the performance across multiple dimensions, we propose the IDIO-TONE and Idiomatic Validation Score, a three-stage evaluation pipeline measuring (i) literal translation fidelity, (ii) visual-semantic alignment, and (iii) idiomatic meaning retention. Empirical evaluations highlight that HybridMoE achieves 5–6% performance gains across advanced vision language models, demonstrating improved representation of figurative language and culturally embedded meaning in multilingual multimodal settings
[55] Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning cs.CL | cs.AI | cs.LGPDF
Atoosa Chegini, Soheil Feizi
TL;DR: 本文提出了一种名为Chunk-Level Guided Generation的训练免费方法,使用现成的大型语言模型作为过程评分器,指导小型模型在数学推理任务中的生成过程。该方法通过在每个步骤中采样固定长度的候选块,并由大模型基于似然性评分选择最佳块,从而在错误传播前引导生成。
Details
Motivation: 为了解决传统方法中当小型模型已陷入错误推理路径时,仅从多个样本中选择最佳响应的策略失效的问题,以及避免需要带步骤级标签训练的奖励模型(如PRM引导搜索)的复杂性和成本。
Result: 在GSM8K、MATH、Minerva Math、AMC23和AIME24等多个数学推理基准测试中,使用Qwen2.5和Llama系列模型,Contrastive-Guided Selection (CGS)方法在匹配引导预算下,比多数投票提升高达28个百分点,并在大多数基准上匹配或超越了Qwen2.5-Math-PRM-72B引导搜索的性能,无需奖励模型训练。例如,使用Qwen2.5-7B由Qwen2.5-72B引导时,CGS在MATH上达到81.8%,在Minerva Math上达到63.6%。
Insight: 创新点包括使用固定长度块来避免大模型似然性评分中的长度偏差问题,以及提出对比引导选择规则来利用大模型与小模型偏好差异。这提供了一种高效、无需训练的过程引导生成替代方案,能显著缩短推理轨迹并提升性能。
Abstract: Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model’s log-probability to favor chunks where the large model’s preference diverges from the small model’s. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4–6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.
[56] Construction of Historical Knowledge Graphs Based on BERT and Graph Neural Networks cs.CL | cs.AIPDF
Ping Li, Bartlomiej Brzozka
TL;DR: 本文提出了一种结合BERT和图神经网络(GNN)的高级架构,用于从各类历史文本中提取实体和关系,以构建历史知识图谱。该方法旨在系统性地解决历史文本中的语言歧义、上下文受限的指代以及缺乏既定语法规范等问题。实验表明,该联合系统在精确率、召回率和F1分数上优于传统规则方法和其他深度学习基线。
Details
Motivation: 动机在于通过数字人文研究和规模化历史数据分析,将大量传统历史文本转化为结构化的知识图谱,以应对历史文本特有的语言挑战,如歧义、上下文受限的指代和语法不规范问题。
Result: 在市政记录、议会文件和历史信件等综合数据集上的实验表明,BERT-GNN联合系统在精确率、召回率和F1分数上优于传统规则方法和其他流行深度学习基线(如表2所示),能够以足够的准确性和全面性处理复杂嵌套结构和隐式指代问题。
Insight: 创新点在于将上下文敏感的语义表示技术(如BERT)与关系图学习算法(GNN)相结合,实现历史数据的自动提取,为知识库积累智慧;此外,还开发了基于FastRQNet和预训练视觉语言模型Vilt-qaformer+RoBInet的新图像检索系统作为配套方法。
Abstract: Through digital humanities research and scale-up historical data analysis, a significant amount of traditional historical text is converted into structured knowledge graphs. This paper provides a high-level architecture that combines bidirectional encoder representations of transformers (BERT) and graph neural networks (GNN) to extract the entities and relationships from various types of historical texts. The texts of traditional history resolve linguistic ambiguities, references limited by context, and a lack of established grammatical norms in a systematic way. This study develops a new image retrieval system based on FastRQNet and pre-trained vision-language model Vilt-qaformer+RoBInet in accordance with the aforementioned recommendations. The experiments make full use of a comprehensive collection of municipal records, parliamentary documents, and historical correspondence. When compared to conventional rule-based techniques and other popular deep-learning baselines, the joint BERT-GNN system obtains greater Precision, Recall, and F1-score (Table 2). Complex nested structures and implicit reference issues can be handled by this structure with sufficient accuracy and thoroughness when creating knowledge graphs. The aforementioned experiments show that combining relational graph learning algorithms with context-sensitive semantic representation techniques can automatically extract historical data to add accumulated wisdom to the knowledge repository.
[57] HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems cs.CLPDF
Mingju Chen, Can Lv, Guibin Zhang, Heng Chang, Shiji Zhou
TL;DR: 本文提出了HarnessForge框架,用于实现LLM智能体系统的元自适应。该框架将智能体系统建模为‘harness-policy’对,通过故障引导的harness剪裁和harness条件化的策略对齐,共同演化外部执行架构与内部推理策略,以提升智能体在异构任务上的适应性。
Details
Motivation: 现有方法通常单独调整智能体的外部执行架构(harness)或训练内部推理策略(policy),缺乏对全系统协同适应的研究,且未明确结构与执行之间的适配空间,也未联合优化外部架构与内部推理器的兼容性。
Result: 在五个不同领域的基准测试中,HarnessForge持续提升了Qwen3-4B和Qwen3-8B基座模型的性能,优于仅调整harness或仅训练policy的基线方法,最高可超越最强基线12.0%,并在部署效率上取得了有利的权衡,证明了其有效性。
Insight: 核心创新在于将智能体系统明确分解为‘harness-policy’对,并提出了二者协同演化的元自适应框架。客观来看,其强调并优化了可执行架构与推理策略之间的兼容性,这是实现智能体系统高效自适应的一个关键且常被忽视的维度。
Abstract: LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execution paradigms. This challenges fixed agent systems and motivates system-level meta-adaptation beyond isolated component updates. While existing works have adapted external harness or trained underlying reasoning policies, full-system adaptation remains insufficiently characterized. The adaptation space between structure and execution is rarely made explicit, and the compatibility between the external harness and the internal reasoner is not optimized jointly. We propose HarnessForge, a meta-adaptive framework for evolving LLM agent systems. HarnessForge formulates an agent system as a harness–policy pair, defining a stable adaptation space that separates harness-level execution structure from policy-level reasoning behavior. It then performs harness–policy co-evolution through fault-guided harness tailoring and harness-conditioned policy alignment. Experiments across five benchmarks from diverse domains show that HarnessForge consistently improves both Qwen3-4B and Qwen3-8B backbones, outperforming harness-only and policy-only baselines with gains of up to 12.0% over the strongest baseline and achieving favorable rollout-efficiency tradeoffs, demonstrating that harness–policy co-evolution is effective, and that executable compatibility between the harness and reasoning policy is essential for agent-system adaptation. The code is available at https://github.com/mingju-c/HarnessForge.
[58] Cost-Aware Diffusion Draft Trees for Speculative Decoding cs.CLPDF
Shuai Zhang, Huachuan Qiu, Hongliang He, Yong Dai
TL;DR: 本文提出了一种名为CaDDTree的成本感知扩散草稿树方法,用于优化推测解码中的令牌吞吐量。该方法通过联合选择树结构和节点预算,直接最大化单位时间内生成的预期令牌数,无需离线预算搜索,并能根据每轮的位置分布和验证成本自适应调整预算。
Details
Motivation: 现有推测解码方法(如DDTree)通过最大化预期接受长度来构建候选树,但该指标会随预算增加而单调不减,忽略了验证成本,缺乏预算选择的合理依据。因此,需要一种能直接优化吞吐量(而非仅接受长度)的方法。
Result: 在Qwen3-4B和Qwen3-8B模型上,跨越推理、编码和指令遵循等八个基准测试的实验表明,CaDDTree在几乎所有任务上都达到或超过了具有最优预算选择的DDTree的性能。
Insight: 创新点在于将吞吐量(令牌/单位时间)作为直接优化目标,并建模了草稿和验证延迟;证明了在凸验证成本下吞吐量函数是单峰的,从而实现了高效的贪婪停止规则。这为推测解码中的预算自适应选择提供了理论依据和实用方法。
Abstract: Speculative decoding accelerates inference by having a lightweight drafter propose tokens verified in parallel by the target language model. Block diffusion drafters such as DFlash generate an entire draft block in one pass, yielding per-position marginals; DDTree uses these to build a candidate tree that maximizes expected acceptance length under a fixed node budget. We observe, however, that acceptance length is non-decreasing in budget: it always favors larger trees regardless of verification cost, offering no principled basis for budget selection. We introduce \textbf{CaDDTree} (Cost-aware Diffusion Draft Tree), a method that directly optimizes token throughput (expected tokens generated per unit time) by jointly selecting the tree structure and node budget. We model draft and verification latencies explicitly, show that the throughput objective decomposes into a per-round one-dimensional search over the budget, and prove that under a convex verification cost the throughput function is \emph{unimodal}, enabling an efficient greedy stopping rule. CaDDTree requires no offline budget search, adapting the budget each round from the current per-position distributions and verification cost. Experiments on Qwen3-4B and Qwen3-8B across eight benchmarks spanning reasoning, coding, and instruction-following tasks show that \caDDTree{} matches or surpasses DDTree with oracle budget selection on nearly all tasks.
[59] CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs cs.CLPDF
Yangfan Ye, Xiaocheng Feng, Jialong Tang, Xiayu Cao, Zihan Zhang
TL;DR: 本文介绍了CultureForest基准测试,用于评估大型语言模型在文化规范基础上的推理能力。该基准包含5,378个样本,覆盖8个领域和53个国家/地区,支持从选择题到开放式生成的渐进式评估。研究发现,即使顶级模型在开放式场景中表现大幅下降,且存在明显的跨区域差异。
Details
Motivation: 现有研究大多将LLMs的文化智能简化为知识层面的问题,忽视了模型能否在现实场景中有效利用其习得的文化知识。本文旨在填补这一空白,评估模型基于文化规范的实际推理能力。
Result: 在CultureForest基准上的广泛实验表明,顶级模型在开放式生成任务中性能显著下降,并表现出明显的跨区域差异。通过针对性分析,揭示了模型推理增益有限、存在共享的区域偏好结构、响应保守等一致模式。
Insight: 创新点在于将文化知识获取与文化推理能力解耦,强调从知识中心评估转向测量基于知识的推理。研究指出,尽管LLMs拥有丰富的文化知识,但其有效利用成为性能瓶颈,这为未来评估提供了新方向。
Abstract: Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce CultureForest, a benchmark for \textit{Cultural Norm Grounded Reasoning}. Each question is grounded in a small set of atomic norms, enabling verifiable and attributable evaluation. CultureForest comprises 5,378 examples across 8 domains and 53 countries/regions, and supports a progressive evaluation from multiple-choice to open-ended generation. Extensive experiments reveal that even top-tier models degrade substantially in open-ended settings, accompanied by pronounced cross-region disparities. Through targeted analysis, we uncover several consistent patterns: (1) test-time reasoning yields limited gains and may exacerbate inequity; (2) models exhibit highly shared regional preference structures; (3) model responses are markedly conservative, especially under stricter cultural constraints; and (4) by disentangling cultural knowledge acquisition from cultural reasoning, we show that while LLMs possess substantial cultural knowledge, their performance is further bottlenecked by its effective use. These findings point to a necessary shift from knowledge-centric evaluation toward measuring knowledge-grounded reasoning.
[60] LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models cs.CL | cs.AI | cs.LGPDF
Prateek Kumar Sikdar
TL;DR: 本文提出了LayerRoute,一种轻量级适配器,用于在智能体语言模型系统中根据输入类型自适应地跳过Transformer层。该方法通过在每个Transformer块中添加一个参数极少的逐层路由器和LoRA适配器,在保持主干网络权重冻结的情况下,经过端到端训练,使模型学会区分工具调用和规划推理步骤,并为前者跳过更多计算。
Details
Motivation: 当前智能体语言模型系统在处理结构迥异的工具调用(简短、确定性)和开放式规划/推理(冗长、复杂)步骤时,采用相同的计算量,这导致了计算资源的低效利用。
Result: 在A100 GPU上仅训练6.4分钟后,LayerRoute实现了12.91%的跳过差异:工具调用步骤跳过15.25%的FLOPs,而规划步骤仅跳过2.34%。模型质量因LoRA适配而提升,困惑度在工具调用和规划步骤上分别降低了1.29和1.30。
Insight: 核心创新在于通过极少量可训练参数(一个微型逐层路由器和LoRA适配器)实现输入条件化的动态计算路径选择,将计算资源分配与任务异构性对齐,这是一种高效的自适应推理方法。
Abstract: Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis. LayerRoute augments each of the 24 transformer blocks in Qwen2.5-0.5B-Instruct with: (1) a per-layer router (~897 parameters, Linear(896,1)) that outputs a hard binary gate via the straight-through estimator, and (2) LoRA adapters (rank 8, ~1.08M parameters) on the Q/K/V/O attention projections. The backbone weights remain frozen. A single end-to-end training pass on agentic data (Hermes, Glaive, GSM8K, Turing) with a gate regularisation term forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB), LayerRoute achieves a 12.91% skip differential: tool calls skip 15.25% of FLOPs while planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, with perplexity delta of -1.29 on tool calls and -1.30 on planning.
[61] Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning cs.CL | cs.CVPDF
Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng, Wang Yang
TL;DR: 本文研究了多模态大语言模型在空间推理任务中存在的空间词汇偏见问题,即模型倾向于选择包含空间关系词的答案选项。研究发现,即使模型能正确回答二元空间问题,当加入第三个空间选项时,模型决策会变得脆弱。通过机制可解释性工具,作者发现这种偏见主要源于语言侧而非视觉侧,并提出了基于轻量级LLM-only DPO更新的缓解方法。
Details
Motivation: 多模态大语言模型在空间多项选择题上表现不可靠,传统归因于视觉信息关注不足。本文旨在揭示一个互补的失败模式——空间词汇偏见,即空间关系词会不当吸引模型决策。
Result: 在九个开源MLLMs中广泛观察到该现象。提出的轻量级DPO更新方法在合成数据上将四向鲁棒准确率提升高达100点,在WhatsUp、SpatialMQA-Direct和VSR评估数据集上分别提升68.0、32.6和20.1点。
Insight: 创新点在于识别并诊断了MLLMs中语言侧驱动的空间词汇偏见,并使用机制可解释性工具(如残差流探针和激活修补)定位了特定LLM侧通道和神经元。从客观角度看,将失败模式从视觉归因扩展到语言偏见,并为通过纯语言微调缓解多模态模型偏见提供了新思路。
Abstract: Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model’s decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet consistently select an incorrect third spatial option once it is added to the answer set. We isolate such binary-stable but ternary-fragile cases as diagnostic examples and leverage mechanistic interpretability tools, revealing that a substantial part of the failure instead originates on the language side rather than the visual side: visual attention analyses and residual-stream probes show the correct spatial relation remains internally available on these failures, while irrelevant-option controls, activation patching, and sparse component interventions trace the bias to specific LLM-side channels and neurons. Based on this finding, we show that a lightweight LLM-only DPO update on tiny single-object-pair synthetic data mitigates the bias, lifting four-way robust accuracy by up to 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on broader evaluation datasets WhatsUp, SpatialMQA-Direct, and VSR.
[62] What to Format and How: A Benchmark and Workflow Approach for Document Formatting cs.CLPDF
Shihao Rao, Liang Li, Jiapeng Liu, Tong Lin, Bing Li
TL;DR: 本文针对自动化文档格式化的内容感知场景,提出了DocFormBench基准测试和DocFormFlow工作流方法。DocFormBench扩展了文本到格式的评估,以涵盖多样化的格式要求;DocFormFlow则将格式化任务解耦为‘格式化什么’(目标定位)和‘如何格式化’(修改执行),以减少冗余的文档读取。实验表明,该方法在多个大语言模型和多模态模型上均能提高格式化准确性并降低令牌消耗。
Details
Motivation: 当前大语言模型为自动化文档格式化提供了新可能,但现实世界的格式化往往需要基于文档内容识别目标。这种内容感知的设置由于缺乏专门的评估数据集而具有挑战性且研究不足。
Result: 在提出的DocFormBench基准上进行的大量实验表明,与代表性基线相比,DocFormFlow方法在多个LLM和多模态模型上持续提高了格式化准确性,同时减少了令牌消耗。
Insight: 主要创新点在于将格式化任务解耦为目标定位(what)和修改执行(how)的工作流,这解决了现有方法中冗余文档读取的问题。分析进一步揭示,精确的目标定位是影响格式化性能的主要因素。提出的基准和工作流为未来更智能、可靠的文档格式化研究提供了基础。
Abstract: Recent advances in large language models (LLMs) have opened up new possibilities for automated document formatting. However, real-world formatting often requires identifying targets based on document content. This content-aware setting remains challenging and underexplored, primarily due to the lack of dedicated evaluation datasets.To enable evaluation in realistic content-aware scenarios, we introduce DocFormBench, a benchmark that extends Text-to-Format evaluation to diverse formatting requirements, along with metrics for both accuracy and efficiency.To mitigate redundant document reading in existing methods during formatting, we propose DocFormFlow, a workflow formatting method that decouples target localization from modification execution into what to format and how. Extensive experiments across multiple LLMs and multimodal models show that DocFormFlow consistently improves formatting accuracy while reducing token consumption compared to representative baselines. Further analysis reveals that precise target localization is the primary factor influencing formatting performance. We hope DocFormBench and DocFormFlow will facilitate future research toward more intelligent and reliable document formatting.
[63] MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills? cs.CL | cs.AI | cs.LGPDF
Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei, Shihao Li
TL;DR: 本文提出MMG2Skill框架,旨在将网络上的多模态、异构、噪声的程序性知识(指南)转化为智能体可执行的技能,并通过轨迹反馈持续改进技能。为此,作者构建了首个针对该任务的基准测试MMG2Skill-Bench,并在GUI控制、开放游戏和策略卡牌等任务上验证了框架的有效性。
Details
Motivation: 网络上丰富的程序性知识(指南)有潜力帮助智能体解决长视野任务,但这些知识通常面向人类,具有多模态、异构、噪声等特点,难以被智能体直接用作可执行技能。因此,需要一种方法将人类导向的指南转化为智能体可执行的技能,并实现持续改进。
Result: 在包含六个不同视觉语言模型(VLM)骨干的GUI控制、开放游戏和策略卡牌任务中,MMG2Skill框架在所有模型-领域设置下均优于基线智能体,宏观平均增益达到+12.8到+25.3个百分点。消融研究表明,结构化技能构建和轨迹驱动的修订对于性能提升都是必要的。
Insight: 创新点在于将指南到技能的学习形式化为一个闭环框架,包括将指南编译为可编辑技能、在执行时让固定VLM智能体以技能为条件,以及基于轨迹级根因反馈(而非基准分数)修订技能。此外,在可推断成功的任务中,基于分析器的早期停止策略能防止后期性能衰退并节省大量尝试次数。
Abstract: Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.
[64] CARTE: A Benchmark for Mapping Language Model Knowledge Across France cs.CLPDF
Sarah Almeida Carneiro, Christos Xypolopoulos, Xiao Fei, Yang Zhang, Michalis Vazirgiannis
TL;DR: 本文介绍了CARTE基准测试,这是一个用于评估大型语言模型在法国境内进行细粒度地理知识推理能力的多项选择基准。该基准包含2,431个问题,覆盖法国13个大区及14个主题领域,并引入了专注于语言变体的子集CARTE-LV。研究评估了27个参数规模从1B到12B的LLM,揭示了模型在不同区域和规模上的性能差异。
Details
Motivation: 现有基准主要关注国家层面的文化理解,忽视了国家内部的区域差异以及区分紧密相关区域背景的需求。CARTE旨在填补这一空白,评估LLM对法国境内地理基础和区域差异化知识的细粒度推理能力。
Result: 在少样本设置下评估了27个LLM(1B-12B参数)。实验结果显示,模型在不同区域和规模上存在性能差异,表明预训练覆盖存在系统性差距,且对国内变异的鲁棒性有限。
Insight: 创新点在于提出了首个专注于国家内部区域细粒度地理文化知识评估的基准CARTE,并引入了针对语言变体的子集CARTE-LV。这为评估和提升LLM对复杂、多层次地域知识的理解能力提供了新的工具和视角。
Abstract: We introduce CARTE 1 (Culturally Anchored Regional-Territorial Evaluation), a multiplechoice benchmark for evaluating the ability of large language models (LLMs) to perform fine-grained reasoning over geographically grounded and regionally differentiated knowledge within France. While prior benchmarks focus on national-level cultural understanding, they largely overlook intra-country variation and the need to distinguish between closely related regional contexts. CARTE addresses this gap by introducing 2,431 questions spanning the 13 metropolitan regions of France and covering 14 thematic domains, including culture, language, demographics, economy, environment, and mobility. We further introduce CARTE-LV, a subset targeting Linguistic Variation across French regions, enabling focused evaluation of language-related differences. We evaluate 27 LLMs ranging from 1B to 12B parameters under few-shot settings. Our experiments reveal performance disparities across regions and model scales, suggesting systematic gaps in pretraining coverage and limited robustness to intra-national variation.
[65] PlanarBench: Evaluating LLM Spatial Reasoning via Planar Graph Drawing cs.CL | cs.AIPDF
Oleksandr Nikitin
TL;DR: PlanarBench是一个评估大语言模型空间推理能力的基准测试,要求模型仅根据边列表绘制平面图的ASCII艺术图。该研究评估了91个模型在199个最简单的非同构连通平面图(2-7个顶点)上的表现,发现边数是预测任务难度的主要因素。
Details
Motivation: 为了解决现有基准测试可能因记忆化而无法真实评估LLM空间推理能力的问题,该论文提出了一个基于平面图绘制的任务,其边顺序、边方向和节点标签均可置换,从而抵抗记忆化。
Result: 在评估的91个模型上,边数被发现是任务难度的主导预测因子(相关系数r = -0.85),这一发现在先前仅使用节点数作为难度轴的LLM图基准测试中未被报告。
Insight: 论文的创新点在于提出了一个抵抗记忆化的空间推理基准测试PlanarBench,并揭示了边数(而非仅仅是节点数)是图相关任务中更关键的难度预测因素,这为未来评估和提升LLM的几何与组合推理能力提供了新的视角和基准。
Abstract: PlanarBench tests whether LLMs can draw planar graphs as ASCII art given only an edge list – a spatial reasoning task that resists memorization because edge order, edge orientation, and node labels are all permutable. We evaluate 91 models on the 199 simplest non-isomorphic connected planar graphs (2 - 7 vertices). Edge count is the dominant difficulty predictor ($r = -0.85$) – a finding not reported in prior LLM graph benchmarks, which use only node count as the difficulty axis.
[66] Unveiling the Entropy Dynamics of Chain-of-Thought Reasoning cs.CL | cs.LGPDF
Ting Xu, Xu He, Yupu Lu, Jiankai Sun, Dong Li
TL;DR: 本文研究了思维链推理过程中的熵动态,揭示其存在一个一致的两阶段结构:从探索的不确定性区域急剧过渡到收敛的置信区域。置信区域具有高可靠性和高冗余性两个关键特性,作者基于此提出了两种更高效可靠的推理策略:早期退出和测试时缩放。为实现这些策略,他们将置信区域检测建模为序列变点检测问题,并首次应用经典的CUSUM算法来监控思维链推理,形成了一个无需训练、可实时控制推理的框架。
Details
Motivation: 研究思维链推理的内在动态,旨在理解模型何时达到可靠答案,并利用这种理解来开发更高效(减少计算量)和更可靠(提高准确性)的推理策略。
Result: 在早期退出任务上,该方法在帕累托前沿上表现优异,CUSUM算法以11.1%的令牌减少实现了63.06%的准确率,分别比DEER和Dynasor高出3.28%和4.36%。在测试时缩放任务上,CUSUM加权的投票方法持续优于自洽性方法。
Insight: 核心创新在于将思维链的熵动态建模为具有明确变点的两阶段过程,并首次将经典的统计变点检测方法(如CUSUM)应用于监控和优化大型语言模型的推理过程,从而实现了无需额外训练、基于推理过程本身信号的实时效率与可靠性权衡控制。
Abstract: This paper investigates the entropy dynamics of Chain-of-Thought (CoT) and uncovers a consistent two-phase structure: an Uncertainty Region of exploration transitioning sharply to a Confidence Region of convergence. We demonstrate that the Confidence Region possesses two critical properties: 1) High Reliability – answers in the confidence region become highly accurate and stable, and 2) High Redundancy – models generate unnecessary tokens long after reaching the correct answer. These properties unlock more efficient and reliable inference strategies: 1) Early Exit leverages reliability and redundancy to terminate computation safely when returns diminish, and 2)Test-Time Scaling uses the Confidence Region signal to prioritize converged trajectories. To operationalize these insights, we formulate Confidence Region detection as a sequential change-point detection problem, being the first to apply classical change-point methods to monitor CoT reasoning. Using the Cumulative Sum (CUSUM) algorithm, a statistically optimal change-point detector, we develop a training-free framework for real-time inference control. Experiments show our approach establishes a superior Pareto-frontier for early exit. CUSUM achieves 63.06% accuracy with 11.1% token reduction, outperforming DEER and Dynasor by 3.28% and 4.36% in accuracy respectively. For test-time scaling, CUSUM-weighted voting consistently outperforms self-consistency.
[67] SentGuard: Sentence-Level Streaming Guardrails for Large Language Models cs.CLPDF
Jiaqi Yu, Xin Wang, Yixu Wang, Jie Li, Yan Teng
TL;DR: 本文提出SentGuard,一种句子级流式护栏系统,用于实时监控大型语言模型的生成内容。它通过轻量级等待缓冲区将流式令牌分组为句子块,并在生成过程中并行评估安全性,仅在验证后释放安全句子。
Details
Motivation: 现有护栏方法存在两个极端:响应级方法延迟干预直到完整输出生成,而令牌级方法基于不完整语义操作,导致决策不稳定和过度调用。SentGuard旨在解决流式生成中实时安全监控的挑战。
Result: 在5个安全基准测试中,SentGuard优于现有基线,能在两个句子内检测90.5%的不安全案例,同时保持7.41%的低流式误报率。
Insight: 创新点包括句子级流式监控架构、StreamSafe基准数据集(含8类危害的逐句标注),以及从粗到细的训练目标,可借鉴其平衡实时性与语义完整性的设计思路。
Abstract: Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail that operates in parallel with generation. A lightweight waiting buffer groups streamed tokens into sentence chunks and releases only verified chunks to the user, introducing a small offset that enables SentGuard to assess the current prefix while the target LLM decodes subsequent content. To support this, we construct StreamSafe, a benchmark with structured per-sentence annotations across 8 harm categories, capturing the evolution of safety risks across both reasoning and response segments. We further train SentGuard with a coarse-to-fine objective to detect unsafe intent as soon as it emerges at sentence boundaries. Experiments on 5 safety benchmarks show that SentGuard outperforms existing baselines, detecting 90.5% of unsafe cases within two sentences while maintaining a low streaming false-positive rate of 7.41%.
[68] DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding cs.CLPDF
Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu, Zheng Li
TL;DR: 本文提出了DFlare方法,旨在通过一种轻量级的逐层融合机制,拓宽现有块扩散推测解码方法DFlash中的条件瓶颈,从而提升草稿模型的表达能力。该方法允许每个草稿层关注目标模型中广泛层集的可学习组合,注入更丰富的目标知识,并支持将草稿模型扩展到更深架构。通过将训练数据从80万扩展到240万样本以充分利用增强的容量,DFlare在多个基准测试上实现了显著的推理加速。
Details
Motivation: 现有最先进的块扩散推测解码方法DFlash将所有草稿层约束为共享一个仅从少数目标层导出的单一融合表示,这限制了每层的表达能力,阻碍了草稿容量的进一步扩展。本文旨在解决这一瓶颈,以更有效地利用目标模型的内部知识并提升草稿模型的预测能力。
Result: 在涵盖数学推理、代码生成和对话的六个基准测试上,DFlare在Qwen3-4B、Qwen3-8B和GPT-OSS-20B模型上分别实现了平均5.52倍、5.46倍和3.91倍的实时加速,相比DFlash分别提升了约11%、8%和5%,达到了新的最先进水平。
Insight: 论文的核心创新点是提出了一个轻量级的逐层融合机制,通过让每个草稿层独立地关注目标模型中广泛层集的可学习组合,打破了共享单一表示的瓶颈,从而在可忽略的开销下显著增强了每层表达能力和整体草稿容量。这为扩展推测解码中草稿模型的架构深度提供了有效途径。
Abstract: Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target model’s internal knowledge. However, the state-of-the-art method DFlash constrains all draft layers to share a single fused representation derived from only a few target layers, limiting per-layer expressiveness and hindering further scaling of draft capacity. In this paper, we present \modelname, which flares out the narrow conditioning bottleneck of DFlash through a lightweight layer-wise fusion mechanism: each draft layer attends to its own learnable combination of a broad set of target layers at negligible overhead, simultaneously injecting richer target knowledge and providing every draft layer with a distinct input. This enhanced per-layer expressiveness enables scaling the draft model to deeper architectures with consistent gains. We further scale training data from 800K to 2.4M samples to fully exploit the enlarged capacity. On six benchmarks spanning mathematical reasoning, code generation, and conversation, \modelname attains average wall-clock speedups of 5.52x on Qwen3-4B, 5.46x on Qwen3-8B, and 3.91x on GPT-OSS-20B, improving over DFlash by roughly 11%, 8%, and 5% respectively. Our code is available at https://github.com/Tencent/AngelSlim.
[69] A Primer in Post-Training Reasoning Data: What We Know About How It Works cs.CL | cs.AIPDF
Yaoming Li, Guangxiang Zhao, Qilong Shi, Lin Sun, Xiangzheng Zhang
TL;DR: 本文是关于后训练推理数据的综述性论文,系统梳理了超过150项关键公开研究和系统报告,围绕数据对象类型、有效性因素、构建方法和扩展规律四个核心问题,为后训练推理数据的发布和训练方案提供了归因框架。
Details
Motivation: 后训练已成为推动大语言模型推理能力进步的主要手段,但相关研究分散在数据集、强化学习、奖励模型、基准测试和前沿系统报告中,缺乏系统性整合。
Result: 论文未提及具体的定量实验结果或基准测试排名,主要贡献在于对现有文献的系统性综述和框架构建。
Insight: 创新点在于首次提出了一个系统性的后训练推理数据归因框架,将零散的研究成果整合为围绕四个核心问题的结构化知识体系,为未来数据发布和训练方案设计提供了理论指导。
Abstract: Post-training has become a primary driver of recent progress in large reasoning models, and reasoning data are often the key variable determining whether this stage succeeds. Work on post-training reasoning data has grown rapidly, yet this literature remains scattered across dataset papers, reinforcement-learning recipes, reward-model studies, benchmarks, and frontier system reports. This paper is the first primer to synthesize over 150 key public studies and system reports on post-training reasoning data. We organize the field around four questions: what data objects exist, what makes them useful, how they are constructed, and how they scale. Together, this organization provides an attribution framework for future reasoning-data releases and post-training recipes.
[70] Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages cs.CL | cs.AIPDF
Saeed Almheiri, Bilal Elbouardi, Salsabila Zahirah Pranida, Irina Nikishina, Ashwath Rao B
TL;DR: 本文介绍了MIDI,一个涵盖高、中、低资源语言的多语言习语数据集,包含句子和对话上下文,用于评估模型对习语的字面和比喻义理解。基准测试表明,模型在低资源语言上的习语理解能力下降,且字面义理解普遍比比喻义更难,对话上下文虽有帮助但无法消除差距。
Details
Motivation: 解决多语言NLP中习语理解因含义在字面和比喻用法间转换而带来的挑战,现有研究多关注高资源语言且评估脱离真实语境,因此需要包含多种语言和真实上下文的数据集。
Result: 在MIDI数据集上对SOTA模型进行基准测试,结果显示模型在低资源语言上表现下降,所有资源层级中字面义理解都比比喻义困难得多,对话上下文能提升性能但无法消除差距。
Insight: 创新点在于构建了首个涵盖高、中、低资源语言并包含句子和对话上下文的多语言习语数据集,通过控制测试和隐表示干预分离了记忆与推理,揭示了当前模型的核心局限。
Abstract: Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse. We introduce MIDI, a multilingual idiom dataset spanning 3 high-, 3 medium-, and 12 low-resource languages, curated by native speakers. Unlike previous datasets, MIDI provides idioms embedded in both sentence-level and conversational contexts, capturing both literal and figurative readings. Benchmarking state-of-the-art models shows that idiom comprehension degrades in low-resource languages and that, in all resource tiers, literal interpretations are substantially harder than figurative ones. Conversational context improves performance but does not eliminate these disparities. Through controlled tests and interventions on hidden representations, we further separate memorization from reasoning, exposing core limitations of current models.
[71] CRAFTQA: A Code-Driven Adaptive Framework for Complex Structured Data Reasoning cs.CLPDF
Chengtao Gan, Zhiqiang Liu, Long Jin, Yushan Zhu, Lei Liang
TL;DR: CRAFTQA是一个面向异构结构化数据(如表、知识图谱)问答的自适应代码驱动框架,通过生成可执行的Python代码序列和动态创建自定义函数来增强复杂推理能力。
Details
Motivation: 现有统一结构化数据问答方法依赖预定义函数集,限制了处理复杂推理任务的能力,因此需要一种更灵活的自适应框架来克服这一根本限制。
Result: 在多个结构化数据集上的综合实验表明,CRAFTQA在复杂推理场景中相比现有统一方法取得了显著提升。
Insight: 创新点在于将代码生成与动态函数创建相结合(CodeSTEP和CRAFT模块),实现了超越预定义操作的自适应复杂推理,为结构化数据问答提供了更通用的解决方案。
Abstract: Real-world scenarios involve massive heterogeneous structured data (e.g., tables, knowledge graphs), making effective reasoning over such diverse data increasingly important. Unified structured data question answering has emerged as a prominent research trend, aiming to answer natural language questions across different structured data types within a single framework. However, existing unified methods share a common limitation: they rely on a set of predefined functions, which restricts their ability to perform complex reasoning beyond these predefined operations. To overcome this fundamental limitation, we propose CRAFTQA, a novel adaptive code-driven framework comprising two core modules, CodeSTEP and CRAFT. The CodeSTEP module is a paradigm that generates a complete executable Python code sequence, which contains step-by-step code-based reasoning operations based on the question. The CRAFT module dynamically generates custom code functions for operations beyond the predefined function set, and seamlessly integrates with CodeSTEP to significantly enhance flexibility in handling complex reasoning. Comprehensive experiments on multiple structured datasets demonstrate that CRAFTQA achieves remarkable improvements in complex reasoning scenarios compared to existing unified methods.
[72] Do Gender Cues Affect LLM Value Trade-offs? Evidence from a Controlled Decision Benchmark cs.CLPDF
Yangyang Liu, Dong Yu, Pengyuan Liu
TL;DR: 该论文研究了性别线索是否会影响大型语言模型在价值敏感决策中的判断。作者构建了Realistic Value Decision Benchmark (RVDB)基准,通过控制变量仅改变角色性别配置,评估了七个模型在性别扰动下的决策不变性及其自我归因。研究发现,明确的性别线索会导致有限但系统的决策翻转,模型在自我归因中常忽略性别影响,且性别效应在价值边界模糊和决策情境严重时更明显。
Details
Motivation: 大型语言模型越来越多地应用于价值敏感的决策场景,需要确保无关的人口统计线索(如性别)不会影响其判断。论文旨在通过受控实验探究性别线索是否会导致模型决策偏差,并检验模型的自我归因是否能反映实际行为变化。
Result: 在RVDB基准测试中,七个模型在性别线索扰动下表现出有界但系统的决策翻转,尤其是在明确的性别归因提示下。跨性别角色交换揭示了女性提议决策的不对称性,且性别效应在价值边界较模糊和决策更严重的情境中更集中。模型自我归因常错误地将决策翻转归因于非性别因素。
Insight: 论文的创新点在于构建了受控的RVDB基准来隔离性别线索的影响,揭示了性别作为局部边界转移因素而非全局价值推理覆盖的行为效应。研究发现模型行为与自我归因之间存在脱节,强调了超越基于解释的评估、进行受控行为审计的重要性。
Abstract: Large language models are increasingly used in value-sensitive decision settings, where irrelevant demographic cues should not alter judgments. We construct the Realistic Value Decision Benchmark (RVDB), a controlled benchmark that varies only the role-gender configuration while holding the scenario, ordered value pair, roles, candidate decisions, Value Distance, and Decision Severity fixed. Using a position-balanced evaluation across seven models, we test whether models preserve decision invariance under gender perturbations and whether their self-attributions reflect observed behavioral changes. We find that explicit gender cues induce bounded but systematic decision flips, including under an explicit gender-attribution prompt that asks models to report whether gender influenced their choice. Cross-gender role swaps reveal a consistent female-proposed-decision asymmetry, while models often attribute flipped decisions to No Influence or other non-gender factors. Further analysis shows that gender effects concentrate near less determinate value boundaries and under more severe decision contexts, suggesting that gender cues act as local boundary-shifting factors rather than global overrides of value reasoning. Value rankings remain largely stable, but ordered value-pair trade-offs shift unevenly across role-gender configurations. These results show that gender can enter LLM value trade-offs behaviorally while remaining obscured in self-attribution, motivating controlled behavioral audits beyond explanation-based evaluation.
[73] Better with Experience: Self-Evolving LLM Agents for Evidence-Grounded Health Community Notes cs.CL | cs.SIPDF
Zihang Fu, Fanxiao Li, Jianyang Gu, Haonan Wang, Preslav Nakov
TL;DR: 本文提出了EvoNote框架,通过构建可演化的经验记忆库,使基于大语言模型的健康社区笔记生成能够自我进化,从而更有效地纠正社交媒体上的健康错误信息。该框架利用细粒度信用分配机制,将轨迹级反馈转化为行动级记忆,优化了声明分析、证据获取和笔记撰写过程。
Details
Motivation: 现有基于LLM的健康社区笔记生成方法在处理每个帖子时都会重置,无法利用先前纠正错误信息的宝贵经验,导致效率低下和资源浪费。
Result: 在包含1200个实例的多模态基准MM-HealthCN上,经人工验证的分层效用评估显示,EvoNote生成的笔记在89.6%的情况下优于人工编写的笔记;在另一组无众包有用性判定的帖子中,EvoNote为82.0%的案例生成了有用笔记,并将生成候选纠正的中位时间从人工流程的13小时以上缩短至2分钟以内。
Insight: 核心创新在于引入了可自我演化的经验记忆框架和细粒度的信用分配机制,将任务层面的反馈有效分解并应用于具体行动步骤,从而实现了纠正策略的复用和证据利用的强化,为健康错误信息治理提供了新的范式。
Abstract: Large Language Model (LLM)-augmented Community Notes offer a scalable path for timely, evidence-grounded correction of health misinformation on social platforms. However, they still reset at every post, leaving useful correction experience from prior cases unused. We introduce EvoNote, an agentic framework that enables health Community Notes generation to self-evolve through an evolving experience memory of prior misinformation correction episodes. Its core is fine-grained credit assignment: EvoNote grounds trajectory-level feedback in health-specific note qualities and distills it into action-level memory for claim analysis, evidence acquisition, and note writing. We evaluate EvoNote on MM-HealthCN, a 1.2K-instance multimodal benchmark of user-flagged health posts with human-written Community Notes and crowd-derived helpfulness labels. Under a human-validated hierarchical utility judge, EvoNote-generated notes are preferred over corresponding human-written notes in 89.6% of cases; on a separate set of Needs More Ratings posts without a crowd helpfulness verdict, EvoNote produces helpful notes for 82.0% of cases. It also reduces the median time needed to produce a candidate correction from over 13 hours in the human-note pipeline to under 2 minutes. Analyses link these gains to stronger evidence use and reusable correction strategies, positioning self-evolving note generation as a promising paradigm for health misinformation governance.
[74] Geometric Latent Reasoning Induces Shorter Generations in LLMs cs.CLPDF
Shashi Kumar, Yacouba Kaloga, Petr Motlicek, Ina Kodrasi, Andrea Cavallaro
TL;DR: 本文提出几何隐式推理(GLR)方法,将大语言模型的推理过程建模为预训练词嵌入空间中的几何路径逼近问题。该方法通过轻量级转换头预测嵌入空间中的迭代方向更新,利用文本思维链作为锚点学习逼近离散推理轨迹,同时允许连续偏离精确词嵌入。在数学推理基准测试中发现,几何隐式推理能显著减少生成步骤,实现更紧凑的中间推理状态。
Details
Motivation: 传统大语言模型通过生成显式推理链解决问题,但这种方法成本高、对长度敏感且受限于离散自然语言。隐式推理虽提供连续替代方案,但如何确定有效的中间隐状态结构仍是开放挑战。
Result: 在Qwen3模型上的数学推理基准测试表明,几何隐式推理能显著减少总生成步骤,在保持准确性的同时实现更短的生成长度,揭示了隐式计算预算、输出长度与精度之间的新权衡关系。
Insight: 创新点在于将隐式推理形式化为词嵌入空间中的几何路径逼近问题,通过连续轨迹作为紧凑中间推理状态,实现了无需显式长度目标即可缩短生成步骤的涌现现象,为高效推理提供了新范式。
Abstract: Large language models solve complex problems by generating lengthy chains of explicit reasoning tokens. While effective, this makes reasoning expensive, length-sensitive, and constrained to (discrete) natural language. While latent reasoning offers a continuous alternative, determining useful structures for intermediate latent states is an open challenge. In this paper, we formulate latent reasoning as a geometric path-approximation problem within the model’s pretrained token-embedding space. We introduce Geometric Latent Reasoning (GLR), which uses a lightweight transition head to predict iterative direction updates in embedding space. Using textual chain-of-thought traces as anchors, GLR learns to approximate discrete reasoning trajectories while permitting continuous deviations from exact token embeddings. Evaluations on mathematical reasoning benchmarks using Qwen3 models reveal an emergent phenomenon: geometric latent reasoning induces substantially shorter generations without an explicit length objective. By replacing early explicit reasoning with continuous latent steps, models often reach correct answers using substantially fewer total generation steps. These findings suggest that continuous trajectories act as compact intermediate reasoning states, exposing a new tradeoff between latent computation budget, output length, and accuracy.
[75] ResMerge: Residual-based Spectral Merging of Large Language Models cs.CLPDF
Yandu Sun, Zhiyan Hou, Haokai Ma, Yuheng Jia, Junfeng Fang
TL;DR: 本文提出ResMerge,一种基于残差的谱合并框架,用于合并通过强化学习(RL)训练得到的专家大语言模型。该方法将任务向量分解为头部和残差两部分,分别处理:通过球面残差共识适应构建稳定的残差主干,再通过基于正交叉专家一致性的轻量级头部校正模块引入头部信息。实验表明,ResMerge在多个RL专家组和能力领域上比现有任务向量和谱合并基线更好地保留了专家能力。
Details
Motivation: 现有谱合并方法通常假设主要任务信号集中在主导奇异方向,而低能量残差分量可被压缩或衰减以减少干扰。然而,作者发现对于RL任务向量,头部和残差分量都包含大量行为知识且具有不同的合并特性,因此需要新的合并策略来有效整合RL专家模型。
Result: 在多个RL专家组和能力领域的实验表明,ResMerge比代表性的任务向量和谱合并基线(如TIES-Merging、DARE、SLERP等)更好地保留了专家能力,实现了更优的性能。
Insight: 核心创新在于揭示了RL任务向量中头部(信息集中但易冲突)和残差(分散但稳定)的不同合并特性,并据此设计了分两步的合并框架:先基于残差构建稳定共识主干,再通过门控机制选择性引入头部信息。这为训练自由的模型合并,特别是针对RL专家,提供了新的视角和方法。
Abstract: Model merging offers a training-free way to combine multiple post-trained expert models, but merging experts obtained through reinforcement learning (RL) remains challenging. Existing spectral merging methods often assume that leading singular directions contain the main task signal, while lower-energy residual components can be compressed, selected, or attenuated to reduce interference. We find that this assumption does not hold for RL task vectors: after decomposing each task vector into a leading spectral head and a residual component, both parts can independently recover substantial behavior knowledge, while exhibiting different merging properties. The head is highly concentrated and informative but more prone to sharp cross-expert conflicts, whereas the residual component is more dispersed and provides a more stable basis for aggregation. Based on this observation, we propose ResMerge, a residual-based spectral merging framework for RL experts. ResMerge first constructs a stable residual backbone with Spherical Residual Consensus Adaptation, which estimates a reliability-weighted consensus direction on the Frobenius sphere. It then reintroduces leading-head information through a Lightweight Head Correction module gated by positive cross-expert agreement. Experiments across multiple RL expert groups and capability domains show that ResMerge better preserves expert capabilities than representative task-vector and spectral merging baselines. The implementation of ResMerge is publicly available at https://github.com/sunyd0303-cpu/ResMerge-release.
[76] DECK: A Consistency x Confidence Taxonomy of LLM Hallucinations cs.CLPDF
Mohit Singh Chauhan
TL;DR: 论文提出了DECK幻觉分类法,这是一种基于可检测性特征(即不确定性评分器可读取的信号)对LLM错误进行分类的补充性分类法。该方法沿样本间一致性和词元级置信度两个维度将错误划分为四种行为模式(漂移、根深蒂固、虚构、纠结),每种模式对应特定的评分器家族。论文在多个模型和数据集上验证了该分类法,并识别了输出级不确定性量化在知识缺口输入上的普遍盲点。
Details
Motivation: 现有幻觉分类法根据输出内容本身的错误(如记忆误解、推理失败、流利捏造)进行分类,适用于诊断但无法回答另一个问题:哪种不确定性评分器能捕捉到此类错误?因此,需要一种基于错误可检测性特征的补充分类法。
Result: 在三个模型和四个数据集上,通过分析评分器对之间的分歧以及检查外部标签(如SelfAware不可回答、HaluEval对抗性、PopQA实体流行度)是否落入预测的DECK单元,验证了该分类法。同时发现模型规模和内容特异性会导致次级单元的细化。
Insight: 创新点在于提出了一个基于可检测性信号(一致性与置信度)的LLM幻觉二维分类框架(DECK),将错误类型与特定不确定性评分器家族的有效性直接关联。客观分析认为,该方法为理解和检测幻觉提供了新的诊断视角,并揭示了输出级不确定性量化在模型产生自信且可重复虚构内容时的普遍失效问题,这指向了需要探索更丰富的内部状态方法(如UQ头、信息论估计器)。
Abstract: Existing hallucination taxonomies classify LLM errors by what is wrong with the output – memorised misconceptions, reasoning failures, fluent fabrications. These taxonomies are useful for diagnosis but cannot answer a different question: which uncertainty scorer would have caught this error? We propose a complementary taxonomy that classifies errors by their detectability signature – the signal a scorer family would read. The DECK taxonomy is a 2x2 partition along inter-sample consistency and token-level confidence into four behavioural regimes (Drift, Entrenched, Confabulation, Knotted), each mapping to a specific scorer family (or families) that can detect it: black-box consistency scorers have signal in D and C, white-box token-probability scorers have signal in K and C, and only an LLM-as-a-Judge with independent pretraining can detect E. Cell membership is operationalised by a Youden’s J optimal split on each scorer axis. Across three models and four datasets we validate the taxonomy two ways: by analysing scorer-pair disagreement, and by checking that external labels (SelfAware unanswerable, HaluEval adversarial, PopQA entity popularity) land in the predicted DECK cells, with model-scale and content-specific secondary-cell refinements. We further identify a universal blind spot of output-level UQ: on knowledge-gap inputs where the generator emits confident, repeatable fabrications, every output-level family collapses by construction. A linear probe on Llama-3-8B’s hidden states also collapses to chance, giving preliminary evidence that the failure may persist at the activation level; richer internal-state methods (UQ heads, information-theoretic estimators) remain to be tested.
[77] Unified Context Evolution for LLM Agents cs.CLPDF
Zixuan Zhu, Yitong Hu, Yong Dai, Junfeng Fang, Chunyang Jiang
TL;DR: 本文提出了统一上下文演化框架,用于提升基于大语言模型的智能体在交互式任务中的表现。该框架将智能体经验外部化为一个可演化的、类型化的上下文单元库,包含记忆、策略、工作流和技能四种类型,并通过调度模块动态优化库的构成。
Details
Motivation: 现有基于LLM的智能体在解决多步交互任务时,每次任务都从固定上下文开始,且任务过程中发现的有用策略在任务结束后会丢失。现有方法要么局限于当前任务学习,要么将所有经验不加区分地存储,缺乏对知识类型的区分、使用质量的追踪以及库中不足之处的平衡。
Result: 在两个交互式基准测试中,UCE将ALFWorld的成功率从75.4%提升至96.3%,将WebShop的任务得分从45.1%提升至61.3%。此外,积累的库可以迁移到不同的智能体骨干模型上而无需重新训练。
Insight: 创新点在于提出了一个梯度自由的框架,将经验分解为四种互补的类型化单元,并设计了基于使用结果的评分、剪枝机制以及一个调度模块来动态分配生成预算以弥补库的短板。这为构建可复用、可演化的智能体经验库提供了一种系统化方法。
Abstract: LLM-based agents can solve multi-step interactive tasks by combining reasoning with environment feedback, yet each episode starts from the same fixed context and any useful strategy discovered along the way is lost once the task ends. Existing approaches either limit learning to the current task or pool all experience into a single untyped store, without distinguishing knowledge types, tracking quality through use, or balancing what the library still lacks. We introduce Unified Context Evolution (UCE), a gradient-free framework that externalizes agent experience into an evolving library of typed Evolvable Context Units (ECUs). UCE decomposes experience into four complementary types (Memory, Strategy, Workflow, and Skill), each generated from trajectories under type-specific conditions, retrieved at decision time, scored through repeated usage outcomes, and pruned when no longer valuable. A scheduling module allocates each cycle’s generation budget toward the types where the library is weakest. Across two interactive benchmarks, UCE raises ALFWorld success from 75.4% to 96.3% and WebShop task score from 45.1% to 61.3%, and the accumulated library transfers to alternative actor backbones without retraining.
[78] TVIR: Building Deep Research Agents Towards Text–Visual Interleaved Report Generation cs.CLPDF
Xinkai Ma, Zhiqi Bai, Dingling Zhang, Pei Liu, Yishuo Yuan
TL;DR: 本文提出了TVIR(文本-视觉交错报告生成)框架,包含TVIR-Bench基准和TVIR-Agent多智能体系统,旨在解决现有深度研究智能体在视觉元素事实可靠性和与文本分析对齐方面的评估不足问题。
Details
Motivation: 现有深度研究智能体主要关注文本,缺乏对视觉元素事实可靠性及其与文本分析对齐性的评估,因此需要构建一个多模态深度研究任务基准和系统。
Result: 在九个深度研究系统上的实验表明,TVIR-Agent在TVIR-Bench基准上取得了优异的整体性能,验证了其多模态设计和评估的有效性。
Insight: 创新点在于引入了层次化多智能体框架(TVIR-Agent)进行报告生成,并开发了结合文本和视觉评估的双路径评估框架,强调了显式多模态设计对于证据驱动报告生成的重要性。
Abstract: Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text–Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.
[79] K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts cs.CLPDF
Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko
TL;DR: 本文介绍了K-BrowseComp,一个基于韩国语境的网页浏览智能体基准测试,包含400个问题。其中300个问题由母语者手动构建和验证。在该测试集上,前沿大语言模型表现不佳,韩国本土模型表现更差。作者还构建了一个100个问题的合成诊断集作为针对性压力测试。
Details
Motivation: 当前前沿模型评估正从基础能力转向组合性、智能体能力,但韩语智能体基准测试仍然稀缺。
Result: 在K-BrowseComp-Verified子集上,GPT-5.5、DeepSeek-V4-Pro和GLM-5.1等前沿LLM的准确率仅为30.00-45.67%,远低于BrowseComp基准;韩国专有AI基础模型项目的LLM准确率仅为0.00-10.33%。在对抗性过滤的合成诊断集上,最强模型准确率仅为26.00%。
Insight: 创新点在于构建了首个基于韩国语境的网页浏览智能体基准,并利用解决与创建问题之间的不对称性,通过硬少样本示例和针对失败模式的生成技术构建了合成诊断集,可作为针对性压力测试工具。
Abstract: Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00–45.67%, a substantial drop from BrowseComp, while Korean LLMs released through Korea’s Proprietary AI Foundation Model program obtain only 0.00–10.33%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00%, and we report this split separately as a targeted stress test. We publicly release our data and code.
[80] PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning cs.CL | cs.AI | cs.CVPDF
Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu, Jitian Guo
TL;DR: 本文提出了PaSBench-Video基准测试,用于评估多模态大语言模型(MLLMs)在流式视频中主动安全预警的能力。该基准包含740个视频,涵盖驾驶、医疗、日常生活和工业生产四个领域,并标注了风险起始帧和事故边界帧。研究发现,现有13个MLLMs在严格指标下表现不佳(最高不足20.0%),且召回率与误报率高度相关,表明模型主要依赖场景级活动线索而非对潜在危害的推理。
Details
Motivation: 当前基准测试无法有效评估MLLMs在动态视频中作为持续安全监控器的能力,因为它们通常基于静态输入、忽略时间精度且缺乏对安全场景误报的测量。
Result: 在PaSBench-Video上测试13个MLLMs,最严格指标下无模型超过20.0%;召回率与误报率的皮尔逊相关系数为0.64,表明提高检测率会大幅增加安全场景的误报。模型表现因领域而异:在日常生活领域(风险本身异常)可实现中等召回率和低误报,但在驾驶领域(常规与危险场景相似)则误报频繁。
Insight: 创新点在于构建了首个专注于流式视频主动安全预警的时序校准基准,强调因果观察和精确时间标注。客观分析表明,当前MLLMs缺乏对新兴危害的推理能力,过度依赖场景级统计特征,这为未来模型设计提供了改进方向。
Abstract: Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do not test this ability: they rely on static inputs, ignore timing precision, and omit false-positive measurement on safe scenes. We present PaSBench-Video, a 740-video benchmark with 481 risk and 259 no-risk videos across four domains: driving, healthcare, daily life, and industrial production. Risk videos are annotated with frame-level risk onset and accident boundaries. A model must observe the video causally and produce a warning that is both temporally calibrated and content-correct. Testing 13 MLLMs, we find that no model exceeds 20.0% on our strictest metric, and recall is tightly coupled with false-positive rate, with Pearson correlation 0.64: higher detection comes only at the cost of triggering warnings on the majority of safe clips. Performance splits sharply by domain: models achieve moderate recall at low false-positive rates in daily life, where risks are inherently anomalous, yet fire indiscriminately in driving, where routine and hazardous scenes look alike. These results indicate that current models rely on scene-level activity cues rather than reasoning about emerging harm.
[81] Learning When to Translate for Multilingual Reasoning cs.CL | cs.AIPDF
Deokhyung Kang, Hyounghun Kim, Gary Geunbae Lee
TL;DR: 本文提出Luar框架,通过强化学习训练推理语言模型(RLMs)在直接理解不可靠时选择性调用翻译,以解决多语言推理中的语言理解失败问题。该方法在多个多语言推理基准测试中优于标准GRPO等基线方法,尤其在低资源语言上表现突出。
Details
Motivation: 解决推理语言模型在多语言任务中因非英语输入理解失败导致的推理性能差距,避免对所有输入进行不必要的翻译,仅在直接理解不可靠时调用翻译。
Result: 在多个多语言推理基准测试中,Luar优于标准GRPO和其他基于训练的基线方法,在低资源语言上获得显著提升,并能在直接推理足够时避免翻译,同时将翻译调用行为泛化到未见过的低资源语言。
Insight: 创新点在于提出边界感知的强化学习框架,让模型自主决定何时调用翻译,实现选择性多语言推理;客观分析表明该方法通过动态翻译决策平衡了性能与效率,为多语言模型优化提供了新思路。
Abstract: Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, but still exhibit substantial multilingual reasoning gaps, largely due to language-understanding failures in non-English inputs. English translation can mitigate these failures by expressing non-English inputs in a form that RLMs can more reliably interpret, yet translating every input is unnecessary when the model can reason reliably from the original query. To address this challenge, we propose Luar, a Language Understanding Boundary-aware Reinforcement Learning framework that trains RLMs to selectively invoke translation when direct understanding is unreliable. Luar trains the model to choose between solving the original input directly and reasoning over its English translation, encouraging translation only when translator-augmented reasoning is expected to substantially outperform direct reasoning. Across multilingual reasoning benchmarks, Luar outperforms standard GRPO and other training-based baselines, with particularly large gains on low-resource languages. Further analysis shows that Luar avoids unnecessary translation in cases where direct reasoning is sufficient, while extending its translator-call behavior to unseen low-resource languages. Together, our work suggests a selective approach to multilingual reasoning: RLMs can learn to invoke translation only when their direct understanding is unreliable. The project will be made publicly available at https://github.com/deokhk/LUAR
[82] CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning cs.CLPDF
Jun-Tao Tang, Zhen-Hao Xie, Yu-Cheng Shi, Da-Wei Zhou
TL;DR: 本文提出CRAM方法,用于解决多模态持续指令调优中的灾难性遗忘与参数效率权衡问题。该方法通过隔离任务特定模式、自适应秩实例化以及质心引导路由,在保持任务间稳定知识重用的同时动态分配必要参数。
Details
Motivation: 现有多模态大语言模型在持续指令调优中面临共享参数更新导致任务竞争与遗忘,或独立模块分配造成参数效率低下的困境,需要一种能平衡遗忘防止与参数效率的方法。
Result: 在多个基准测试上的广泛实验表明,CRAM方法持续优于现有方法,实现了更优的持续学习性能与参数利用率。
Insight: 创新点在于将任务特定模式隔离与自适应参数分配相结合,通过质心引导路由实现专家能力稳定复用,正交性惩罚机制防止通用能力重复学习,为多模态持续学习提供了高效架构设计思路。
Abstract: Multimodal Large Language Models (MLLMs) unify heterogeneous vision-language tasks under a shared generative framework via instruction tuning, yet real-world deployment demands continuous capability expansion, making Multimodal Continual Instruction Tuning (MCIT) essential. Existing methods either update all tasks with a shared parameter set or allocate dedicated modules for each new task. Shared updates force heterogeneous tasks to compete, causing forgetting of learned capabilities. Conversely, isolated expansion prevents interference but severely limits parameter efficiency over long task streams. To address this dilemma, we propose CRAM. Specifically, by isolating task-specific patterns into independent modules, CRAM mitigates catastrophic forgetting across tasks. To further boost parameter efficiency, we utilize adaptive-rank instantiation to identify the capability gap between existing expert capability and new task demands, and dynamically allocate only the necessary parameters. To ensure stable reuse among tasks, centroid-guided routing recognizes and activates existing experts’ capabilities, while an orthogonality penalty confines new updates to task-specific directions, preventing re-learning general capability. Extensive experiments across diverse benchmarks consistently demonstrate its superiority over existing methods.
[83] FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes cs.CL | cs.CV | cs.CYPDF
Liuliu Chen, Elise R. Carrotte, Brian E. Chapman, Jo Robinson, Mike Conway
TL;DR: 本文介绍了FigSIM数据集,这是首个用于细粒度分析自杀相关表情包的数据集,包含1049个表情包,标注了自杀严重程度、比喻现象和自杀相关内容。作者在三个任务上评估了16个单模态和多模态模型,发现自杀表情包对建模和内容审核提出了独特挑战,并揭示了模型偏见。
Details
Motivation: 自杀表情包在社交媒体上日益普遍,但缺乏理解且可能有害,目前缺乏标注数据集阻碍了自动化审核方法的开发和评估。
Result: 在比喻语言、自杀严重程度和自杀相关内容检测三个任务上对16个模型进行了基准测试,结果显示模型存在偏见,例如对较高自杀严重级别(尤其是比喻性表情包)的预测不足。
Insight: 创新点在于创建了首个细粒度自杀表情包标注数据集,并揭示了多模态模型在处理此类具有比喻性内容的敏感数据时面临的独特挑战和系统偏见,为内容审核策略开发提供了数据基础和分析视角。
Abstract: Suicide memes are memes used to express suicide-related thoughts or comment on suicide-related issues. Suicide memes are increasingly common on social media, yet remain poorly understood and potentially harmful. There is an urgent need to better understand their characteristics and to develop appropriate content moderation strategies that limits users’ exposure to potentially harmful content. Currently, the absence of annotated datasets of suicide memes remains a key barrier to developing and evaluating automated moderation approaches. In this paper, we introduce FigSIM, the first dataset designed for fine-grained analysis of suicide memes. The dataset consists of 1049 memes, each annotated for (1) fine-grained suicide severity levels, (2) figurative phenomena (e.g., metaphors), and (3) suicide-related content (e.g., suicide method depiction). We benchmark 16 unimodal and multimodal models across three tasks: figurative language, suicide severity, and suicide-related content detection. Overall, FigSIM demonstrates that suicide memes pose unique challenges for both modeling and content moderation. Analysis revealed biases, such as underprediction of higher suicide severity levels, especially for figurative memes. The dataset (including splits used for analyses) is publicly available. Content Warning: This paper contains suicide-related content that may be triggering.
cs.CV [Back]
[84] Improved Belief-Attention in Vision Task cs.CV | cs.AIPDF
Guoqiang Zhang
TL;DR: 本文提出了一种改进的注意力机制Belief2-Attention。该方法扩展了现有的Belief-Attention,不仅利用其残差(垂直)分量,还通过一个两层前馈网络处理其投影分量以保留信息。此外,通过引入额外的内积矩阵ZZ^T来丰富token相关性建模。该模块在图像分类和分割任务上验证了有效性。
Details
Motivation: 现有Belief-Attention仅利用投影后的垂直分量作为残差信号,但研究发现其投影分量也包含token相关性信息,不应被忽略。因此,需要一种能同时利用两个分量的改进注意力机制。
Result: 论文在图像分类和分割等视觉任务上验证了Belief2-Attention的有效性,并理论证明了其比标准注意力机制具有更强的表达能力。
Insight: 主要创新点在于:1) 将Belief-Attention扩展为同时利用投影分量和垂直分量,其中投影分量通过一个类似两层FFN的结构处理;2) 在标准QK^T注意力矩阵基础上,引入额外的ZZ^T矩阵来捕获更丰富的token相关性,增强了模块的表达能力。
Abstract: Recently, Belief-Attention \cite{Guoqiang25BeliefAttention} has been proposed by first performing an orthogonal projection of the softmax-based weighted summation of $V$ vectors with respect to the original $V$ vectors and then taking the perpendicular component as the residual signal in Transformer for performance improvement. In this paper, we first conduct an ablation study showing the projected component also carries information about the token correlation, which should not be ignored. We then propose to extend Belief-Attention by making use of both the perpendicular and projected components. In particular, the projected component goes through certain activation function and then a linear mapping before merging with the considered token. Conceptually speaking, the neural block for the projected component can be viewed as a two-layer feedforward network (FFN) within the new attention block. It is also noted that standard attention captures the token correlation via the inner-product matrix $QK^T$. We propose to introduce an additional inner-product matrix $ZZ^T$ to $QK^T$ to capture richer token correlation. We refer to the new module as Belief2-Attention. It can be easily shown that Belief2-Attention is more expressive than standard Attention. We then verify the effectiveness of Belief2-Attention for vision tasks of image classification and segmentation.
[85] Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems cs.CV | cs.AI | cs.LG | cs.NEPDF
Alan Gerson Contreras Montanares, Luis Valenzuela, Luis Martí, Nayat Sanchez-Pi
TL;DR: 该论文介绍了Planktonzilla-17M,一个整合了13个成像系统、包含1740万张图像(其中374万张浮游生物图像覆盖602个分类类别)的统一数据集,旨在解决现有浮游生物分类模型因数据孤立和标签不一致而泛化能力差的问题。通过在该数据集上对比监督学习和CLIP风格的图像-文本训练,发现使用分类谱系作为文本的监督分类器表现相当或更优,并揭示了当前生物基础模型(如BioCLIP)在海洋成像领域零样本和少样本设置下的局限性。
Details
Motivation: 浮游生物对水生食物网和全球碳封存至关重要,但现有分类模型因训练数据孤立、标签不一致,难以跨仪器和环境泛化,阻碍了对海洋健康和气候反馈的理解。
Result: 利用Planktonzilla-17M数据集,在共享ViT骨干网络上进行对比实验,发现使用分类谱系作为文本的监督分类器在浮游生物分类上匹配或优于CLIP风格训练;BioCLIP和BioCLIP2在零样本和少样本设置下表现不佳,而使用该数据集能提升分类性能。
Insight: 创新点在于构建了迄今为止最大、最全面的标准化浮游生物图像数据集,并系统评估了监督与CLIP风格训练在生物领域的有效性,揭示了当前通用生物基础模型在特定海洋成像任务中的不足,强调了领域专用大规模数据的重要性。
Abstract: Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable species identification critical for understanding ocean health and climate feedbacks. Existing classification models perform well on individual collections but fail to generalize across instruments and environments due to isolated training datasets and inconsistent labels. To address this, we introduce Planktonzilla-17M, a unified dataset consolidating publicly available plankton image collections spanning thirteen imaging systems. It comprises 17.4 million images with standardized taxonomy and geo-environmental metadata, including 3.74 million plankton images spanning over 602 taxonomic classes, of which 201 are identified at the species level, making it the largest and most comprehensive plankton image dataset to date. Using this large-scale dataset, we perform a controlled comparison between supervised and CLIP-style image–text training on a shared ViT backbone. We find that a supervised classifier matches or exceeds CLIP-style training when trained using taxonomic lineage as text. We further observe that BioCLIP and BioCLIP2 perform poorly on plankton in zero-shot and few-shot settings. Leveraging Planktonzilla-17M improves plankton classification performance, highlighting the limitations of current biological foundation models in marine imaging domains.
[86] Structured Visual Evidence Decomposition for Evidence-Grounded Multimodal Screening of Obstructive Sleep Apnea-Hypopnea Syndrome cs.CV | cs.AIPDF
Chen Zhan, Yingchen Wei, Xiaoyu Tan, Jingjing Huang, Xihe Qiu
TL;DR: 本文提出了EviOSAHS,一个用于阻塞性睡眠呼吸暂停低通气综合征(OSAHS)筛查的证据驱动多模态推理框架。该框架将面部图像分解为七个固定的解剖学查询,生成结构化的证据卡片,再与临床资料结合,由大语言模型进行最终的二元筛查判定。
Details
Motivation: 直接使用通用多模态基础模型进行医学二元决策会产生不稳定、校准性差的输出。因此,需要一种将视觉证据获取与临床裁决分离的、可审计的高灵敏度筛查方法。
Result: 在642名受试者的队列中,EviOSAHS实现了88.47%的准确率、94.86%的灵敏度、93.74%的F1分数和5.14%的假阴性率,优于仅临床提示、直接多模态提示和朴素两阶段流水线。
Insight: 创新点在于将视觉证据分解为结构化查询(七个解剖学问题)并与临床裁决阶段解耦,从而实现了高灵敏度、可审计的筛查流程,而非直接进行端到端决策。
Abstract: Effective pre-polysomnography screening for obstructive sleep apnea-hypopnea syndrome (OSAHS) requires combining clinical risk factors with visible craniofacial and neck cues. Directly prompting general-purpose multimodal foundation models for medical yes/no decisions can yield unstable, poorly calibrated outputs. We propose EviOSAHS, an evidence-grounded multimodal reasoning framework that separates image-only anatomical evidence acquisition from final clinical adjudication. Each frontal facial image is decomposed into seven fixed anatomical queries covering the neck, chin, mouth, face/neck fat, lower jaw, midface, and nose. Visual responses are converted into structured evidence cards recording target anatomy, visibility, risk direction, evidence strength, confidence, and a concise summary. These cards are combined with a cleaned clinical profile only in the final stage, where a large language model performs balanced binary screening adjudication. We evaluated EviOSAHS on a 642-subject cohort, mapping normal subjects to screening-negative and mild, moderate, or severe OSAHS subjects to screening-positive. EviOSAHS achieved 88.47% accuracy, 94.86% sensitivity, 93.74% F1-score, and a 5.14% false-negative rate, outperforming clinical-only prompting, direct multimodal prompting, and naive two-stage pipelines under a unified protocol. Ablations showed that seven-question visual decomposition and balanced final adjudication were critical to the high-sensitivity operating point. A question-level audit of 4,494 visual outputs showed a 100% structured parse rate and 93.88% high-visibility rate. EviOSAHS provides an auditable, high-sensitivity workflow for binary pre-polysomnography OSAHS screening, but should be viewed as a triage assistant rather than a diagnostic system. Prospective validation, external testing, and calibrated operating-point control are needed before clinical deployment.
[87] Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation cs.CV | cs.AI | cs.CL | cs.ROPDF
Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong
TL;DR: 本文提出了一种分层语义-几何地图(HSGM),用于解决视觉语言导航(VLN)任务中视觉语言模型(VLM)在2D视觉理解与3D空间推理之间的鸿沟。该方法将3D几何信息转化为与VLM兼容的结构化表示,通过解耦高级语义规划与低级路径执行,并结合指令分解,实现了在零样本设置下的高效导航。
Details
Motivation: 当前基于视觉语言模型(VLM)的导航方法存在语义-几何鸿沟:VLM擅长语言和2D视觉理解,但缺乏3D空间推理能力,难以捕捉动作与空间转换之间的因果关系,导致在零样本导航中表现不可靠。
Result: 在R2R-CE和RxR-CE基准测试上的大量实验表明,该零样本框架达到了最先进的性能,甚至优于一些有监督方法。
Insight: 核心创新在于提出了分层语义-几何地图(HSGM),它将3D几何信息结构化地组织为几何、语义和决策三个层次,使VLM能够作为高级语义规划器进行空间布局理解,同时与经典路径规划算法解耦。此外,通过将复杂指令分解为子任务,缓解了长视野导航中的进度遗忘或幻觉问题。
Abstract: Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic-Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods. Code is available at https://github.com/Teacher-Tom/HSGM_public.
[88] Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents cs.CV | cs.AIPDF
Dong-Hee Kim, Reuben Tan, Donghyun Kim
TL;DR: 本文研究了视觉思维链(Visual Chain-of-Thought)智能体中外部视觉工具的使用,发现模型在训练过程中会出现工具使用崩溃现象,即模型逐渐停止使用工具但任务准确率反而上升。研究通过引入熵正则化项鼓励多样化的探索,最终在工具使用率下降的情况下实现了最佳性能。
Details
Motivation: 现有研究主要关注视觉搜索任务中的工具使用,而工具在更复杂的视觉推理任务(如3D空间推理和医学视觉问答)中的作用尚未得到充分探索。本文旨在超越简单任务,研究智能体如何整合工具获取的局部证据与全局上下文进行复杂推理。
Result: 在3D空间推理和医学VQA等挑战性任务上,研究发现完全消除工具使用会损害性能,而单纯鼓励工具使用仅带来边际收益。通过引入熵正则化鼓励多样化探索,模型在工具使用率逐渐下降的情况下取得了最佳性能。
Insight: 论文的创新点在于揭示了工具使用崩溃现象及其与探索多样性减少的关系,并提出将训练阶段的工具视为脚手架,通过鼓励语言生成和视觉工具调用的多样化探索来提升推理能力,即使最终工具使用率下降。
Abstract: Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainly studied these tools in visual search tasks, their role in more complex visual reasoning remains underexplored. In this paper, we move beyond simple visual search tasks to investigate more challenging tasks, including 3D spatial reasoning and medical visual question answering, where agents must integrate tool-acquired local evidence with the global context. We identify a {tool-use collapse phenomenon: models progressively stop using tools while still achieving higher task accuracy. Moreover, we observe a clear asymmetry: (i) completely eliminating tool use degrades performance, whereas (ii) incentivizing tool use yields only marginal gains despite substantially increasing usage. We find that vanilla training and tool-use encouragement both reduce rollout diversity, explaining why higher tool use does not yield stronger reasoning performance. Motivated by these findings, we add an entropy regularization term to encourage diverse rollout exploration, achieving the best performance despite gradually declining tool usage. % We further observe similar dynamics on medical VQA, suggesting that tool-use collapse is not limited to 3D spatial reasoning. Overall, our findings suggest a training-time view of tools as scaffolding, where broader exploration over language generation and visual tool invocation improves reasoning despite tool-use collapse. Project page: https://scaffolded-exploration.github.io
[89] CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection cs.CV | cs.AIPDF
Huidong Feng, Wentao Chen, Jie Chen, Xinqi Cai, Ruolong Ma
TL;DR: 本文提出了CoCoVideo-26K数据集和CoCoDetect检测框架,以应对高质量商业AI生成视频的检测挑战。该数据集包含13种主流商业生成器产生的语义对齐的真实-伪造视频对,为高保真视频伪造检测建立了新基准。检测框架融合了对比学习和置信度门控的多模态大语言模型推理,在多个基准测试中取得了最先进的性能。
Details
Motivation: 现有深度伪造检测方法主要基于开源模型生成的低质量数据集,难以应对无可见水印、质量接近商业系统的高保真AI生成视频,这构成了对公共话语和社会安全的新挑战。
Result: 在CoCoVideo-26K和公共基准测试上的大量实验表明,所提出的CoCoDetect框架实现了最先进的性能,验证了其鲁棒性和泛化能力。
Insight: 主要创新点在于构建了首个基于高质量商业模型的对比性AIGC视频检测基准数据集,并提出了一个结合对比学习与置信度门控MLLM推理的检测框架,通过路由不确定案例进行物理合理性和场景一致性推理,以提升对高保真伪造视频的检测能力。
Abstract: With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalent, posing new challenges to public discourse and societal security. Despite remarkable progress in existing deepfake detection methods, AIGC forgery detection remains challenging, as existing datasets mainly rely on open-source video generation models with quality far below that of commercial AIGC systems. Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high-fidelity AIGC videos. To address these issues, we introduce CoCoVideo-26K, a contrastive, commercial-model-based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real-fake video pairs. This dataset enables deeper exploration of the differences between authentic and high-quality synthetic videos and establishes a new benchmark for highly realistic video forgery detection. Building on this dataset, we propose CoCoDetect, a detection framework integrating contrastive learning with confidence-gated multimodal large language model (MLLM) inference. An R3D-18 backbone extracts spatio-temporal representations, while a confidence gate routes uncertain cases to an MLLM for reasoning about physical plausibility and scene consistency. Extensive experiments on CoCoVideo-26K and public benchmarks demonstrate state-of-the-art performance, validating the framework’s robustness and generalizability. Our code and dataset are available at https://github.com/DonoToT/CoCoVideo.
[90] Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning cs.CV | cs.AIPDF
Junkai Chen, Yuhao He, Junxiang You, Ruiqi Liu, Chenyu Wang
TL;DR: 本文提出了一种名为视觉噪声引导的上下文蒸馏(VGID)的框架,用于多模态大语言模型(MLLMs)的遗忘学习。该方法通过结合视觉扰动和文本上下文遗忘的双模态干预,动态构建一个面向遗忘的教师分布,并以此作为蒸馏信号,引导学生模型在参数层面实现遗忘,无需外部教师模型或显式的不良响应标注。
Details
Motivation: 多模态大语言模型可能记忆并暴露敏感或受限知识,引发隐私和安全风险。现有基于训练的方法难以平衡遗忘效果与模型效用,而免训练的上下文遗忘方法虽能保持模型效用,但未在参数层面移除记忆知识,且在多模态场景中视觉输入可能诱导不良输出,因此需要一种更有效的遗忘方法。
Result: 实验结果表明,VGID在保持模型效用的同时实现了强遗忘效果。在一个代表性设置中,遗忘集的ROUGE-L降低了0.371,而保留集的ROUGE-L仅下降0.055。
Insight: 创新点在于提出了一种基于蒸馏的遗忘框架,通过双模态(视觉噪声+文本上下文)干预动态生成教师信号,实现了参数级别的遗忘,无需额外标注或外部模型,在多模态场景中有效平衡了遗忘与模型效用。
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress on vision-language tasks, but they may also memorize and expose sensitive or restricted knowledge, raising concerns about privacy and broader safety risks. Machine Unlearning (MU) provides a promising way to remove targeted undesirable knowledge from trained models without retraining from scratch while preserving general model utility. Nevertheless, effective unlearning in MLLMs remains particularly challenging. Existing training-based methods often struggle to balance unlearning effectiveness and model utility. In contrast, training-free methods such as in-context unlearning preserve model utility by avoiding parameter updates, but they do not remove memorized knowledge at the parameter level and may remain vulnerable to reverse-engineering attacks. More importantly, in-context unlearning is insufficient in multimodal settings, where visual inputs can provide strong conditioning signals and induce undesirable outputs. To address these challenges, we propose Visual-Noise Guided In-Context Distillation (VGID), a distillation-based framework for MLLM unlearning. VGID dynamically constructs an unlearning-oriented teacher distribution from the frozen base model through dual-modal intervention that combines visual perturbation with textual in-context unlearning. The resulting intervention-induced distribution serves as a teacher signal for distillation, guiding the student model toward parameter-level unlearning without requiring external teacher models or explicit undesirable response annotations. Experimental results show that VGID achieves strong unlearning effectiveness while preserving competitive model utility, reducing forget set ROUGE-L by 0.371 with only a 0.055 drop in retain set ROUGE-L in a representative setting.
[91] General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling cs.CV | cs.ROPDF
Huaihai Lyu, Chaofan Chen, Mingyu Cao, Yuheng Ji, Changsheng Xu
TL;DR: 该论文提出了广义动作流形(GAM)框架,通过时空解耦来构建广义流形,以解决具身智能中从有限数据实现鲁棒泛化的核心挑战。GAM通过强制两个正交维度(时间不变性和几何不变性)上的不变性来实施广义协变性,从而将内在任务几何与刚性的执行模式分离开来。
Details
Motivation: 现有方法通过回归绝对坐标来学习动作,这违反了广义协变性原理,导致策略与特定运动风格和固定速度绑定,无法实现鲁棒泛化。论文旨在解决这一根本问题。
Result: 实验结果表明,GAM框架实现了优异的迁移和鲁棒性能力,其性能优于不考虑几何结构的基线方法。
Insight: 核心创新点在于通过结构解耦(时空解耦)来强制执行广义协变性,具体包括:1)使用弧长参数化器实现时间不变性,将空间路径几何与时间动态正交化;2)通过模式-仿射-因子分解机制实现几何不变性,将轨迹映射到姿态归一化坐标系中的规范“世界线”。这为从稀疏演示中构建连续有效的动作流形提供了新思路。
Abstract: Achieving robust generalization from limited data is a central challenge in embodied intelligence. Prevailing methods fail by regressing absolute coordinates, which violates the principle of general covariance. Fundamentally, this conflates the intrinsic task geometry with rigid execution patterns, binding policies to specific motion styles and fixed speeds. To resolve this, we propose the Generalized Action Manifold (GAM) framework that enforces general covariance through structural disentanglement. Specifically, GAM realizes the manifold by enforcing invariance across two orthogonal dimensions: (1) Temporal Invariance, utilizing an Arc-Length Parameterizer to orthogonalize the spatial path geometry from temporal dynamics, ensuring robustness to velocity variations; (2) Geometric Invariance, where a Schema-Affine-Factorization mechanism maps trajectories to canonical ``world lines’’ in a pose-normalized coordinate frame. This distinguishes invariant geometric schemas from affine modulations, ensuring spatial generalizability. By integrating GAM within a structured Vision-Language-Action (VLA) architecture, we enable sparse demonstrations to densely populate a continuous, valid action manifold. Empirical results demonstrate that GAM enables superior transfer and robustness capabilities, outperforming geometry-agnostic baselines.
[92] Recursive Vision Transformer with Dynamic Depth and Width Adjustment for Resource-Efficient Image Semantic Communication cs.CV | cs.ITPDF
Zhilong Zhang, Xinhui Zhang, Gongyu Jin, Sihua Wang, Danpu Liu
TL;DR: 本文提出了一种基于递归视觉变换器(ViT)的图像语义通信系统,通过引入递归结构迭代优化语义特征以减少参数量,并设计了三种动态调整策略(动态深度调整、动态宽度调整及联合宽度-深度优化)来降低计算复杂度。
Details
Motivation: 针对下一代无线通信系统中图像语义通信存在的大内存占用和高计算复杂度问题,旨在开发一种适用于资源受限设备的高效部署方案。
Result: 仿真结果表明,所提出的递归ViT系统结合三种动态调整策略,在可比计算复杂度下将参数量减少了48.7%,并实现了比现有基线更高的重建质量。
Insight: 创新点在于将递归结构引入ViT以迭代细化特征,并设计动态调整机制根据图像内容和信道条件自适应配置计算资源,实现了参数与计算效率的显著提升。
Abstract: Image semantic communication is a critical component in next-generation wireless communication systems. However, such systems typically suffer from large memory footprints and high computational complexity, making them difficult to deploy on resource-constrained devices. To address these challenges, we propose a vision transformer (ViT)-enabled image semantic communication system. In this system, a recursive structure is introduced to iteratively refine semantic features and reduce the parameter count. In addition, three dynamic adjustment strategies are designed to adaptively reduce computational complexity: dynamic depth adjustment, dynamic width adjustment, and joint width-depth optimization. Dynamic depth adjustment adaptively determines the number of recursive modules according to image content and channel conditions, while dynamic width adjustment selectively preserves important neurons and attention heads. The joint width-depth optimization further enables flexible computation configurations. Simulation results verify that the proposed recursive ViT-based system, combined with the three dynamic adjustment strategies, reduces the parameter count by 48.7% and achieves higher reconstruction quality than existing baselines under comparable computational complexity.
[93] CoilDrop-MRI: Self-supervised physics-guided MRI reconstruction with coil dropout cs.CV | cs.AIPDF
Tongxi Song, Ziyu Li, Zihan Li, Wen Zhong, Congyu Liao
TL;DR: 该论文提出了一种名为CoilDrop-MRI的自监督物理引导MRI重建方法。该方法通过在输入数据中应用线圈维度上的随机丢弃,并将丢弃的数据作为训练目标,从而在自监督框架下充分利用多线圈间的信号相关性。该方法被集成到图像域和k空间域的去折叠架构中,并扩展到多激发扩散MRI重建,在多个数据集上验证了其有效性。
Details
Motivation: 现有自监督MRI重建方法仅在空间频率域划分数据以构建输入-目标对,未能充分利用接收线圈维度上的信号相关性。本文旨在通过探索线圈维度的数据划分,以更好地利用多线圈信息,提升重建质量。
Result: 在多个站点、多场强和多模态数据集上的广泛验证表明,CoilDrop-MRI一致性地超越了最先进的自监督方法,其重建质量达到了无需全采样参考数据的监督方法的水平。
Insight: 主要创新点在于将自监督学习的数据划分策略从传统的k空间域扩展到线圈维度,提出了线圈丢弃机制。这为利用多线圈冗余信息提供了一种新的、有效的数据增强和自监督训练范式,具有强大的数据效率和跨成像条件的鲁棒泛化能力。
Abstract: Self-supervised deep learning-based methods have shown great promise for accelerated magnetic resonance imaging (MRI) reconstruction, achieving high image quality without requiring fully sampled data for training. These methods typically partition the acquired data into two disjoint subsets to construct input-target pairs for optimizing the reconstruction network. However, existing approaches perform this partition exclusively within the spatial frequency (k-space) domain, leaving the coil dimension unexplored. To enforce full exploitation of signal correlation across receiver coils, we propose CoilDrop-MRI, which applies coil-wise dropout to the input and uses the dropped data as training targets in a self-supervised framework. This method is integrated into unrolled architectures in both image-domain (SENSE) and k-space (SPIRiT) formulations. We further demonstrate its versatility by extending CoilDrop-MRI to multi-shot, phase-corrected diffusion MRI (dMRI) reconstruction. CoilDrop-MRI is extensively validated on multi-site, multi-field-strength (0.3T, 0.55T, and 3T), and multi-modality (T1-weighted, T2-weighted, T2-FLAIR, and dMRI) datasets and consistently outperforms state-of-the-art self-supervised methods, achieving quality comparable to supervised reconstruction methods without requiring fully sampled reference training data. Moreover, CoilDrop-MRI exhibits strong data efficiency and robust generalization across imaging conditions, establishing it as a practical and versatile framework for self-supervised parallel MRI reconstruction.
[94] Physics from Video: Identifiability of Time-Invariant Second-Order ODEs under Minimal Trajectory Conditions cs.CV | cs.LG | stat.MLPDF
Yuanyuan Wang, Wenjie Wang, Kun Zhang, Mingming Gong
TL;DR: 该论文研究如何从原始视频像素中唯一地恢复二阶线性常微分方程(ODE)的参数,证明了一种仅编码器(encoder-only)的流程在满足特定轨迹覆盖条件下,可以学习到与真实物理状态局部仿射的潜在空间,从而实现精确参数估计。
Details
Motivation: 动机是弥合视觉真实感与物理理解之间的鸿沟,探索仅从视频像素中识别连续时间物理定律(特别是二阶线性ODE)的结构可辨识性,而无需依赖计算密集的像素重建。
Result: 理论分析表明,欠阻尼系统仅需单个视频片段即可辨识,而其他阻尼机制则需要三条不同的轨迹;在合成和真实世界数据上的验证证实了该方法能可靠地从视频中估计可解释的物理常数。
Insight: 创新点在于首次刻画了跨阻尼机制的最小数据需求理论,并引入了方差下限正则化器以稳定无解码器的目标函数,防止潜在空间坍缩,从而确保物理正确性和透明度。
Abstract: Bridging the gap between visual realism and physical understanding is a core challenge for video-based world models. We study the structural identifiability of continuous-time physical laws from raw pixels, focusing on whether an encoder-only pipeline can uniquely recover the parameters of second-order linear ODEs. We prove that a level-set slope-coverage condition ensures the learned latent space is locally affine to the true physical state, enabling exact parameter recovery. Our theory provides the first characterization of minimal data requirements across damping regimes, establishing that underdamped systems are identifiable from a single video clip, whereas other regimes require three diverse trajectories. We further introduce a variance-floor regularizer to stabilize the decoder-free objective and prevent latent collapse. Validated on synthetic and real-world data, our approach demonstrates that interpretable physical constants can be reliably estimated from video without the need for compute-intensive pixel reconstruction, ensuring both physical correctness and transparency. Code is available at https://github.com/wenjiewang3/PhysicsFromVideo.
[95] Positional Encodings Anchor Spatial Structure in Vision Transformers: A Geometric Perspective on Robustness cs.CV | cs.LGPDF
Mahmoud Mannes
TL;DR: 本文研究了视觉Transformer中位置编码如何影响内部空间表示及其与鲁棒性的关系。作者提出了一种度量SSDC来量化token表示中的空间结构,发现无位置编码的ViT会形成依赖视觉内容的空间结构,但在token置换下会崩溃;而所有类型的位置编码(学习绝对、正弦、旋转)都会引导模型形成基于索引锚定的空间组织,从而在内容扰动下保持稳定并显著提升鲁棒性。
Details
Motivation: 研究位置编码在视觉Transformer中如何塑造内部空间表示,并探究这种表示如何影响模型在内容破坏性分布偏移下的鲁棒性。
Result: 使用SSDC度量发现,有位置编码的ViT在token置换等扰动下保持空间结构稳定,对分布偏移的鲁棒性显著提升;不同位置编码的鲁棒性表现相似,表明鲁棒性主要依赖于稳定的位置参考框架而非具体编码机制。
Insight: 创新点在于提出SSDC度量来量化空间结构,并从几何视角揭示了位置编码通过提供稳定的位置参考框架来锚定空间结构、提升鲁棒性的机制,这为未来编码方案的设计提供了原则性指导。
Abstract: Positional embeddings (PEs) in Vision Transformers (ViTs) are known to impact performance and robustness, but their role in shaping internal spatial representations is not well understood. In this work, we study how different forms of PEs influence the representational geometry of ViTs and how these changes relate to robustness under content-disrupting distribution shifts. We introduce a metric, the Spatial Similarity Distance Correlation (SSDC), to quantify spatial structure in token representations. Using this metric, we show that ViTs trained without PEs still develop non-trivial spatial structure, but this structure is driven by visual content and collapses under token permutation. In contrast, we find that all PEs considered (learned absolute, sinusoidal, and rotary) are associated with a consistent shift toward an index-anchored spatial organization. Representations in these models remain stable under perturbations that disrupt content, and exhibit substantially improved robustness to such distributional shifts. We further show that while different PEs produce distinct depth-wise trajectories of spatial structure, their robustness properties are largely similar (with secondary variation across encoding schemes), suggesting that robustness appears to depend on the presence of a stable positional reference frame more than it depends on the specific encoding mechanism. These results offer a geometric account of how positional encodings shape internal representations, with implications for the principled design of future encoding schemes.
[96] Advances in Neural 3D Mesh Texturing: A Survey cs.CV | cs.GRPDF
Sai Raj Kishore Perla, Hao Zhang, Ali Mahdavi-Amiri
TL;DR: 这篇论文是一篇关于神经3D网格纹理化技术的综述,全面回顾了该领域的最新进展,包括纹理合成、纹理迁移和纹理补全等方法。文章首先总结了网格几何、纹理映射、可微分渲染和神经生成模型等关键基础,然后将文献组织成一个统一的分类体系,涵盖了从早期基于GAN的方法到现代基于扩散模型的流程。
Details
Motivation: 尽管基于神经辐射场和高斯泼溅等生成式3D方法可以直接生成带纹理的资产,但多边形网格仍然是建模、动画、视觉特效和游戏流程中的核心表示形式,因此神经3D网格纹理化仍然是一个重要且活跃的研究领域。
Result: 作为一篇综述论文,本文未提出具体的新方法,因此没有报告定量实验结果或基准测试性能。它主要对现有方法进行了系统性的回顾、分类和分析。
Insight: 论文提供了一个结构化的视角,将神经3D网格纹理化方法统一分类,并分析了常见的架构、监督策略、数据集和评估协议,有助于指导该领域未来的发展。其创新之处在于对快速发展的子领域进行了系统性的梳理和整合。
Abstract: Texturing 3D meshes plays a vital role in determining the visual realism of digital objects and scenes. Although recent generative 3D approaches based on Neural Radiance Fields and Gaussian Splatting can produce textured assets directly, polygonal meshes remain the core representation across modeling, animation, visual effects, and gaming pipelines. Neural 3D mesh texturing therefore continues to be an essential and active area of research. In this survey, we present a comprehensive review of recent advances in neural 3D mesh texturing, covering methods for texture synthesis, transfer, and completion. We first summarize key foundations in mesh geometry, texture mapping, differentiable rendering, and neural generative models, and then organize the literature into a unified taxonomy spanning early GAN-based methods to modern diffusion-based pipelines. We further analyze common architectures and supervision strategies, review datasets and evaluation protocols, and discuss emerging applications, practical/commercial systems, and open challenges. Together, these insights provide a structured perspective on the current landscape and help guide future developments in learning-based 3D mesh texturing.
[97] StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning cs.CV | cs.AIPDF
Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng, Qiyao Sun
TL;DR: 本文提出了StemBind诊断基准,用于检测多模态大语言模型在抽象视觉推理任务中规则理解与实例应用之间的脱节问题。该基准通过共享视觉主干设计,将推理过程分解为感知、规则归纳和完整答案选择三个对齐问题,从而定位模型失败的具体子步骤。
Details
Motivation: 现有抽象视觉推理基准将感知、规则归纳和答案选择合并为单一正确/错误信号,无法诊断模型在已知规则情况下仍选错答案的根本原因,因此需要一种能够细粒度定位推理失败环节的评估方法。
Result: 在24个前沿MLLM配置上的评估显示:规则准确率普遍高于完整项目准确率(22/24模型);即使感知和规则均正确,模型仍有51.2%的概率答错完整问题;失败主要集中于Sternberg推理阶段的第三阶段(规则到实例映射);模型规模扩大和显式思维模式均未能有效缩小性能差距。
Insight: 创新点在于提出可审计的共享主干诊断框架,将抽象视觉推理分解为可追溯的阶段性任务;核心发现揭示了MLLM存在’规则-实例绑定鸿沟’,为后续提升视觉基础推理能力提供了明确改进方向——强化规则到具体实例的映射能力。
Abstract: Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model can describe what it sees and name the underlying pattern, yet still fail to choose the matching candidate. Existing AVR benchmarks cannot detect this because they collapse perception, rule induction, and answer selection into a single right-or-wrong signal. We introduce StemBind, a shared-stem diagnostic benchmark that probes the same visual stem with three aligned questions: Perception (what is in the image), Rule (what pattern governs it), and Full (which option completes it), so a final-answer error can be attributed to a specific sub-step on the same evidence. StemBind contains 2,298 curated knowledge-light stems across nine auditable visual operations, totaling 19,533 P/R/F tasks, with each full item annotated by Sternberg’s four reasoning stages (S1 Encode, S2 Infer, S3 Map, S4 Apply). Evaluating 24 frontier MLLM configurations yields four findings. (i) The R-F chasm: rule accuracy exceeds full-item accuracy on 22 of 24 models, so most failures happen after the rule is identified. (ii) A persistent binding gap: even when P and R are both correct on the same stem, models still answer F incorrectly 51.2% of the time. (iii) The bottleneck is S3: process diagnostics and Stage-wise Stimulus Augmentation localize the dominant failure to rule-to-instance mapping. (iv) Scaling and thinking do not help: neither larger models nor explicit thinking mode reliably closes the gap, and thinking even lowers rule and full-item accuracy. StemBind reframes AVR evaluation from final-answer ranking to locating where abstract visual reasoning breaks down, identifying rule-to-instance binding as a concrete next target for vision-grounded reasoning.
[98] CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations cs.CV | cs.AI | cs.LGPDF
Zixian Su, Hongkai Zhang, Fan Gao, Encheng Su, Taiping Qu
TL;DR: 本文介绍了CardioLens,一个基于多序列心脏磁共振成像(CMR)构建的抗信息泄露评估测试平台,用于揭示多模态大语言模型(MLLMs)在临床现实应用中的巨大差距。该平台包含大量经过验证的QA对,评估了CMR解读的三个阶段,发现现有MLLMs整体表现不佳,且存在类别崩溃的失败模式。
Details
Motivation: 现有MLLMs在公共医学基准上表现良好,但其评估通常是临床应用的弱代理,依赖于孤立的输入和简化的识别式任务。本文旨在通过一个更贴近真实临床工作流程的评估平台,揭示MLLMs在复杂医学图像解读(如多序列CMR)中的实际能力差距。
Result: 在24个最先进的MLLMs上进行评估,模型整体表现不佳,且性能沿着真实的CMR工作流程(图像理解、报告生成、疾病诊断)逐渐下降。模型表现出类别崩溃的失败模式,倾向于预测高频异常类别而非区分临床不同的发现。改变切片选择协议或使用显式推理提示对性能改善微乎其微(约1%),且后者常使模型更保守而非更好地利用视觉证据。
Insight: 创新点在于构建了一个从医院私有档案通过严格报告到QA构建和验证流程生成的、抗信息泄露的临床评估测试平台CardioLens,它更贴近真实临床决策需求(需要整合跨序列、视图和时相的分布式证据)。客观分析表明,当前MLLMs在需要复杂多模态整合和推理的真实临床图像解读任务上仍远不可靠,这为开发面向真实世界临床部署的下一代MLLMs提供了重要的基础。
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance on public medical benchmarks, yet existing evaluations often remain weak proxies for clinical use, relying on isolated inputs and simplified recognition-style tasks. We introduce CardioLens, a leakage-resistant evaluation testbed for multi-sequence Cardiovascular Magnetic Resonance (CMR), constructed from private hospital archives through a rigorous report-to-QA construction and verification pipeline. CardioLens contains 473,896 slices and 13,494 verified QA pairs across 4D Cine, LGE, perfusion, and T2-weighted imaging, and evaluates three stages of CMR interpretation: image understanding, report generation, and disease diagnosis. Across 24 state-of-the-art MLLMs, CardioLens reveals a substantial clinical reality gap: models perform poorly overall, with performance degrading along the real CMR workflow. Confusion analysis further shows a category-collapse failure mode, where models default to frequent abnormal categories rather than distinguishing clinically distinct findings. To rule out MLLM-compatible input construction as the primary cause, we compare random, clinically motivated, and data-driven slice selection protocols under different slice budgets; performance changes only marginally, typically by about 1%. Explicit reasoning prompts also fail to rescue performance, often making models more conservative rather than improving visual evidence use. These results show that current MLLMs remain far from reliable CMR interpretation, where clinical decisions require integrating distributed evidence across sequences, views, and temporal phases. CardioLens provides a clinically grounded testbed for developing next-generation MLLMs toward real-world clinical deployment.
[99] The Harsh Truth: Segment-Level Analysis of Harsh Driving Events in Milan Using Large-Scale Telematics, Street Networks, and Google Street View cs.CV | physics.soc-phPDF
Andrea La Grotteria, Paolo Santi, Titus Venverloo, Umberto Fugiglando, Carlo Ratti
TL;DR: 本研究结合大规模车载远程信息处理数据、街道网络属性及Google街景视觉特征,分析了米兰城市道路网络中急加速/急刹车事件的分布规律。研究发现,更宽的车道、交叉口和公交站点以及更开阔的视觉空间与更高的事件强度相关,而密集的建筑立面则与较低强度相关。针对自行车基础设施的案例研究表明,仅标线式自行车道和混合交通配置的急刹车事件强度分别比物理隔离自行车道高19.5%和11.5%。
Details
Motivation: 传统警力报告的交通事故数据存在不完整性和滞后性,无法满足精细化、及时的道路安全干预需求。急加速/急刹车事件作为替代安全指标,此前仅在较小城市样本中被研究,缺乏大规模城市网络层面的分析。
Result: 通过非参数曼-惠特尼U检验和机器学习回归分析发现,在控制交通暴露量后,特定道路特征与急刹车事件强度显著相关。自行车道案例中,物理隔离自行车道相比仅标线式车道和混合交通配置,分别降低19.5%和11.5%的事件强度评分。
Insight: 创新点在于首次将大规模车载远程信息处理数据(420万辆车辆)与开放地理空间数据(OpenStreetMap)、街景语义分割特征(OneFormer模型)相结合,构建了城市道路安全的多维度分析框架。该方法为‘零愿景’决策提供了数据驱动的精细化评估工具,支持基于具体道路环境而非统一标准的交通安全干预策略。
Abstract: Police-reported crash statistics remain the standard input for urban road-safety assessment, but their incompleteness and reporting lag limit their usefulness for timely, fine-grained intervention design. Harsh acceleration and braking events are widely used as surrogate safety indicators, but have so far been studied only in comparatively small urban samples. This study analyses harsh events across the urban road network of Milan, combining high-resolution telematics from more than 4.2 million vehicles equipped with On-Board Units, segment-level traffic metrics from TomTom, street-network and infrastructure attributes from OpenStreetMap, and visual streetscape features extracted from Google Street View via semantic segmentation using a OneFormer model. We employ an analytical framework combining non-parametric Mann–Whitney U tests of segment-feature distributions between high- and low-harshness groups with supervised machine-learning regressors. We find that, once exposure is controlled for, wider carriageways, crossings and transit stops, and more open visual fields (higher sky- and road-pixel proportions) are associated with higher harsh-event intensity, while denser built frontage is associated with lower intensity. Finally, the cycling-infrastructure case study identifies a gradient in harsh-event intensity across facility types: markings-only cycle lanes are associated with a 19.5% higher harshness score, and mixed-traffic configurations with an 11.5% higher score, relative to physically separated cycle paths, conditional on the included controls. These results support context-specific rather than uniform urban-safety interventions and illustrate how large-scale telematics combined with open geospatial and visual data can inform Vision Zero decision-making at the metropolitan scale.
[100] StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement cs.CV | cs.AI | cs.LG | cs.ROPDF
Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding, Apoorva Sharma
TL;DR: StressDream是一种引导视频世界模型生成高风险但合理未来场景的方法,通过优化扩散模型的初始噪声,使模型能够生成指定类型的高影响结果(如任务失败),从而支持更稳健的策略评估与改进。
Details
Motivation: 现有视频世界模型在策略评估和改进时通常依赖名义想象,这可能会遗漏机器人动作的高影响结果,除非采样数量极大,因此需要一种方法能主动引导想象朝向指定的高风险但合理的结果。
Result: 在自动驾驶和机器人操作领域的先进视频世界模型上,StressDream能有效根据推理时指定的文本(如任务失败)引导想象,识别出可能导致不良结果的行动,从而支持稳健的策略评估。
Insight: 创新点在于通过优化扩散模型的初始噪声来可控地生成高风险场景,并引入语义目标(利用视觉语言模型提供梯度)和合理性目标(防止噪声偏离分布)两个互补目标来解决优化挑战,实现了对想象结果的定向引导。
Abstract: Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.
[101] Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models cs.CV | cs.AIPDF
Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang, Huishen Jiao
TL;DR: 本文提出AsyMoE,一种针对大型视觉语言模型的新型混合专家架构,旨在解决现有对称MoE方法忽视视觉与语言模态处理不对称性的问题。该方法通过三个专用专家组(模态内专家、双曲跨模态专家和证据优先语言专家)分别处理模态特定信息、层次化跨模态关系以及抑制参数记忆依赖,从而提升模型性能并减少幻觉。
Details
Motivation: 现有LVLMs中的MoE方法采用对称架构处理视觉和语言模态,忽略了二者处理过程中的固有不对称性,导致无法有效编码文本与视觉间的层次关系,且深层语言专家过度依赖参数记忆而非上下文证据。
Result: 在多项实验中,AsyMoE相比基线MoE变体平均提升1.5%,在幻觉敏感任务上最高提升3.8%,同时比稠密模型少激活25.45%的参数。
Insight: 创新点在于显式建模模态不对称性,引入双曲几何专家捕捉层次化跨模态关系,以及设计证据优先机制维持语言专家的上下文接地性;这为多模态模型的高效结构与层次关系建模提供了新思路。
Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved computational efficiency. However, existing MoE approaches treat visual and linguistic modalities with symmetric architectures, overlooking the inherent asymmetry in how these two modalities are processed. This asymmetry causes two critical issues. First, text and vision form hierarchical rather than parallel relationships, as text queries typically describe partial aspects of complete visual scenes. Euclidean expert space struggles to encode such containment structures. Second, language experts in deeper layers progressively shift from evidence-based processing to parametric memory dependence, losing grounding in the provided visual and linguistic information. To address these issues, we propose AsyMoE, a novel architecture that explicitly models this asymmetry through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. Extensive experiments demonstrate that AsyMoE achieves consistent improvements over baseline methods, with average gains of 1.5% over MoE variants and up to 3.8% on hallucination-sensitive tasks. AsyMoE activates 25.45% fewer parameters compared to dense models.
[102] Training-Free Object-Agnostic Jam Detection in Fulfillment Centers cs.CVPDF
Ruiliang Liu, Tina Dongxu Li, Joshua Migdal, Fernando Ruch, Kenneth Meszaros
TL;DR: 本文提出了一种无需训练、对象无关的堵塞检测方法,用于解决物流中心中物体因传送带摩擦、方向错误或机械故障而堵塞的问题。该方法通过在监控区域均匀采样参考点,利用物体对参考点的持续遮挡作为检测信号,当被遮挡点比例超过时间阈值时判定为堵塞事件。
Details
Motivation: 传统堵塞检测方法依赖于目标检测模型和跟踪算法,需要大量人工标注且仅限于已标注的对象类别,开发周期长且泛化能力有限。
Result: 在1,069个视频上的实验表明,该方法实现了100.00%的精确率和93.33%的F1分数,显著优于经典的稀疏跟踪方法,同时保持了无需训练部署的优势。
Insight: 创新点在于将遮挡(传统跟踪中的失败案例)重新利用为检测信号,通过监测参考点是否持续被遮挡而非跟踪其移动来检测堵塞,实现了无需标注数据、对象无关的泛化以及大幅缩短开发时间。
Abstract: In fulfillment centers, diverse objects move continuously from inbound to outbound operations and can become jammed due to excessive conveyor friction, incorrect orientation, or mechanical failures. Traditional jam detection approaches rely on object detection models to identify objects, followed by tracking algorithms (such as IoU overlap and Kalman filtering) to monitor motion over time. This pipeline requires thousands of manual annotations, consuming approximately two weeks of effort, and is limited to annotated object classes. We present a training-free, object-agnostic jam detection method that eliminates the need for labeled data. Our approach uniformly samples reference points within the monitoring region when no objects are present. As objects occlude these points, we detect motion. When a sufficient fraction remains occluded beyond a temporal threshold, we classify the event as a jam. Unlike conventional point tracking–which treats occlusion as a failure case–our approach repurposes occlusion as a detection signal, monitoring whether reference points remain persistently occluded rather than tracking where they move. Our experimental evaluation on 1,069 videos demonstrates that AllTracker achieves 100.00% precision and 93.33% F1 score, significantly outperforming classical sparse tracking methods while maintaining training-free deployment. This approach offers three key advantages: (1) no training data or manual annotations, (2) object-agnostic generalization to arbitrary object types, and (3) significantly reduced development time.
[103] Real2SAM2Real: Generative 3D Caches as Complementary Context for Video Diffusion cs.CV | cs.AIPDF
Jiayi Wu, Haoming Cai, Cornelia Fermuller, Christopher Metzler, Yiannis Aloimonos
TL;DR: 本文提出了Real2SAM2Real框架,旨在解决视频扩散模型(VDM)在精确控制相机和场景动态方面的挑战。该框架利用3D重建模型(如SAM3D)提取一个显式可编辑的3D缓存,作为VDM的几何支架,并通过软空间对齐注入和微调策略,将整体空间先验注入模型,从而实现对相机轨迹和多实体运动的解耦控制。
Details
Motivation: 现有视频扩散模型主要依赖隐式扩散先验来生成未观测区域,这在高动态运动或复杂遮挡时容易导致结构崩溃。因此,需要一种方法来提供可靠的3D感知引导,以克服对扩散先验的过度依赖。
Result: 大量实验表明,Real2SAM2Real能够实现对相机轨迹和多实体运动的精确、解耦控制,在大幅度相机移动和严重遮挡下保持出色的时空一致性,解决了因结构空洞、错误外观以及反射/折射误导线索引起的视角模糊问题。
Insight: 创新点在于利用生成式3D缓存作为互补上下文,为VDM提供显式几何支架;设计了软空间对齐注入机制和微创微调策略,以有效利用3D引导并保留预训练先验;使用掩码法线图作为跨模态桥梁,构建了无需3D的数据管理和扰动流程,从而解耦了几何与外观。
Abstract: While Video Diffusion Models (VDMs) excel at synthesizing high-fidelity videos, enabling precise camera and scene control remains challenging. Existing methods predominantly rely on implicit diffusion priors to generate unobserved regions, inevitably leading to structural collapse during high-dynamic movements or complex occlusions. To address this challenge, we propose Real2SAM2Real, a framework that leverages 3D lifting models (e.g., SAM3D) to extract an explicitly editable 3D cache, serving as a robust geometric scaffold for the VDM. By capturing the entire 3D volume of foreground entities rather than just their visible shells, this cache injects holistic spatial priors into the VDM, providing dependable 3D-aware guidance for complex scene dynamics. To effectively leverage this 3D guidance while preserving pre-trained priors, we design a Soft Spatial-Aligned Injection mechanism alongside a minimally invasive fine-tuning strategy tailored for VDMs. Furthermore, we employ masked normal maps as a cross-modal bridge to construct a 3D-free data curation and perturbation pipeline. Extensive experiments demonstrate that Real2SAM2Real enables precise, decoupled control over both camera trajectories and multi-entity motions. By utilizing the complementary context from generative 3D caches, our framework overcomes typical breakdowns caused by over-reliance on diffusion priors, maintaining exceptional spatiotemporal consistency under large camera shifts and severe occlusions. Crucially, by decoupling geometry from appearance, our VDM-tailored 3D cache eradicates perspective ambiguities caused by structural holes and erroneous facades, as well as misleading cues from reflections and refractions. Project website is available at https://jiayi-wu-leo.github.io/real2sam2real
[104] UniVerse: A Unified Modulation Framework for Segmentation-Free,Disentangled Multi-Concept Personalization cs.CVPDF
Quynh Phung, Sandesh Ghimire, Minsi Hu, Chung-Chi Tsai, Jia-Bin Huang
TL;DR: 本文提出了UniVerse,一个用于扩散变换器的统一调制框架,旨在实现无需分割、解耦的多概念个性化。该方法能够进行可组合和可分解的概念提取,实现目标对象的细粒度定位和表示,而无需显式的分割掩码。
Details
Motivation: 现有的个性化视觉理解方法在处理包含多个对象的输入图像时,难以定位和提取特定概念,且严重依赖基于分割的监督或组合泛化能力差,限制了其准确解耦和操作单个概念的能力。
Result: 在多个基准测试上的广泛实验表明,UniVerse在定位精度和视觉保真度方面显著优于最先进的基线方法。定性和定量结果证实了该方法在杂乱场景中精确提取目标概念的能力。
Insight: 创新点在于提出了一个统一的调制框架,学习将复杂场景分解为特定概念表示,并以统一方式组合它们,从而实现跨不同视觉上下文的鲁棒个性化。这为更灵活、可解释和个性化的视觉生成与理解铺平了道路。
Abstract: Personalized visual understanding has advanced significantly, yet existing approaches struggle to localize and extract specific concepts when input images contain multiple objects. Many prior methods rely heavily on segmentation-based supervision or exhibit poor compositional generalization, limiting their ability to accurately disentangle and manipulate individual concepts. In this work, we propose UniVerse, a Unified Modulation Framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers. Our method allows for composable and decomposable concept extraction, enabling fine-grained localization and representation of target objects without explicit segmentation masks. UniVerse learns to decompose complex scenes into concept-specific representations and then compose them in a unified manner, enabling robust personalization across diverse visual contexts. Through extensive experiments on multiple benchmarks, we demonstrate that UniVerse significantly outperforms state-of-the-art baselines in both localization accuracy and visual fidelity. Qualitative and quantitative results show that our approach can precisely extract target concepts in cluttered scenes, paving the way for more flexible, interpretable, and personalized visual generation and understanding.
[105] Score-Control for Hallucination Reduction in Diffusion Models cs.CVPDF
Mahesh Bhosale, Naresh Kumar Devulapally, Abdul Wasi, Chau Pham, Vishnu Suresh Lokhande
TL;DR: 本文研究了扩散模型中导致幻觉(即生成不符合真实数据分布的不可信样本)的根本原因,并提出了一种减少幻觉的方法。作者首先通过实证验证了分数平滑性是图像生成扩散模型中幻觉的成因,并从密度角度提供了理论解释。随后,他们提出了一种名为方差引导分数调制(VSM)的策略,通过控制分数雅可比矩阵来降低分数平滑度,从而减少幻觉。
Details
Motivation: 扩散模型是现代生成式AI的基石,但其存在幻觉问题,即生成位于真实数据分布支撑集之外的不可信样本,这降低了模型的可靠性和可信度。本文旨在从理论上理解并实际减少这种幻觉。
Result: 在合成和真实世界数据集上的实验结果表明,该方法能有效减少幻觉(高达约25%),同时保持高保真度和多样性。作者还提出了两个具有极端语义变化的基准数据集用于系统性的幻觉评估。
Insight: 核心创新点在于将幻觉的概率质量与学习到的分数函数的Lipschitz常数联系起来,从而形式化了分数平滑性导致幻觉的观点,并据此提出了VSM这一原理性方法。从客观角度看,该方法为更可靠的基于扩散的图像生成提供了理论指导和实用工具,其提出的评估基准也颇具价值。
Abstract: Diffusion models have emerged as the backbone of modern generative AI, powering advances in vision, language, audio and other modalities. Despite their success, they suffer from hallucinations, implausible samples that lie outside the support of true data distribution, which degrade reliability and trust. In this work, we first empirically confirm previously proposed hypothesis that score smoothness causes hallucinations in Image Generation diffusion models and provide a density-based perspective. We further formalize this notion by linking the hallucinations probability mass to lipschitz constant of the learned score function. Motivated by this, we introduce a Variance-Guided Score Modulation (VSM) strategy that controls the score Jacobian, in turn reducing score smoothness and better approximating the ground truth score that decreases hallucinations. Empirical results on synthetic and real-world datasets demonstrate that our approach reduces hallucinations (up to ~25%) while maintaining high fidelity and diversity, providing a principled step toward more reliable diffusion-based image generation. We also propose two benchmark datasets with extreme semantic variation for systematic hallucination evaluation. Code and Datasets are publicly available at https://github.com/bhosalems/VSM.
[106] Non-Learning Low-Light Stereo Vision cs.CVPDF
Jason Wang, Lucas Nguyen, Hyunseung Eom, Wei Xu, Qi Guo
TL;DR: 本文提出了一种非学习的立体视觉框架,用于从严重噪声图像中估计视差。该方法利用Field of Junctions(FoJ)在噪声下保留稳定的粗粒度视觉特征以构建代价体积,同时丢弃与光子噪声难以区分的细粒度纹理。生成的结构信息指导边界感知的半全局匹配(SGM),动态调整平滑惩罚以保留真实的视差不连续性。
Details
Motivation: 解决在严重噪声(如低光照条件)下,传统立体匹配方法因噪声干扰而性能下降的问题,特别是细粒度纹理与光子噪声难以区分导致的匹配困难。
Result: 在广泛使用的基准数据集上,该方法在未掩蔽像素上生成的稀疏视差图比近期立体算法更准确。
Insight: 创新点在于结合FoJ提取噪声鲁棒的结构特征来构建代价体积,并采用动态平滑惩罚的边界感知SGM,在非学习框架下实现了对噪声的鲁棒性,为低光照立体视觉提供了可借鉴的噪声处理思路。
Abstract: We present a non-learning stereo framework for disparity estimation from severely noisy images. Using the Field of Junctions (FoJ), it retains coarse visual features stable under severe noise for cost volume construction while discarding fine textures inseparable from photon noise. The resulting structural information guides boundary-aware Semi-Global Matching (SGM) that dynamically adapts smoothness penalties to preserve true disparity discontinuities. The output is a sparse disparity map more accurate than those of recent stereo algorithms over unmasked pixels on widely-used benchmark datasets.
[107] Zamba2-VL Technical Report cs.CV | cs.AIPDF
Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge
TL;DR: 本文介绍了Zamba2-VL,一套基于Zamba2架构的视觉语言模型。Zamba2是一种混合架构,结合了Mamba2状态空间层和少量共享Transformer块。该模型在广泛的图像理解、推理、OCR、定位和计数基准测试中,与同规模的领先Transformer开源VLM(如Molmo2、Qwen3-VL和InternVL3.5)性能相当,并显著优于之前的SSM和混合VLM(如VL-Mamba、Cobra和mmMamba)。得益于其主干网络Zamba2的近线性预填充计算和小型、近乎恒定的循环状态,Zamba2-VL在相同参数规模下,其首词生成时间(TTFT)比Transformer基线模型低约一个数量级,这种效率优势在更适用于设备和边缘部署的1.2B和2.7B小规模模型上尤为明显。
Details
Motivation: 旨在开发一种高效且性能强大的视觉语言模型,通过结合状态空间模型(SSM)和Transformer的优势,以解决传统Transformer模型在推理效率(特别是首词生成时间)上的瓶颈,尤其针对设备和边缘部署场景。
Result: 在广泛的图像理解、推理、OCR、定位和计数基准测试中,Zamba2-VL与同规模领先的Transformer开源VLM(如Molmo2、Qwen3-VL和InternVL3.5)性能相当,并大幅超越之前的SSM和混合VLM。其关键效率指标TTFT比Transformer基线模型低约一个数量级。
Insight: 主要创新点在于将Mamba2状态空间层与少量共享Transformer块结合的混合架构,在保持与纯Transformer模型相当性能的同时,实现了显著更高的推理效率(特别是TTFT)。这为在资源受限环境下部署高性能VLM提供了新思路,证明了混合架构在视觉语言任务中的潜力。
Abstract: We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space layers with a small number of shared transformer blocks. Across a broad range of image understanding, reasoning, OCR, grounding, and counting benchmarks, Zamba2-VL is competitive with leading Transformer-based open-weight VLMs of comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, and substantially outperforms prior SSM-based and hybrid VLMs such as VL-Mamba, Cobra, and mmMamba. Inheriting the near-linear prefill compute and small, near-constant recurrent state of its Zamba2 backbone, Zamba2-VL delivers roughly an order of magnitude lower time-to-first-token (TTFT) than these Transformer baselines at matched parameter scale, with the efficiency gap most pronounced at the smaller 1.2B and 2.7B scales most relevant to on-device and edge deployment. We release three models – 1.2B, 2.7B, and 7B – together with inference code at https://huggingface.co/collections/Zyphra/zamba2-vl.
[108] Detect Before You Leap: Mirage Detection in Vision-Language Models cs.CV | cs.AIPDF
Sayeed Shafayet Chowdhury, Md. Shaown Miah
TL;DR: 本文提出了一种名为TC-LIA的模型无关方法,用于在视觉语言模型(VLM)生成答案前,检测其是否可能产生‘海市蜃楼’(即缺乏视觉证据却给出自信回答)的故障模式。该方法通过分析CLIP视觉编码器各层中图像块标记与问题嵌入的相似性轨迹,并结合多种特征与集成策略,在多个VQA领域和VLM骨干网络上实现了高精度的三分类检测,显著降低了海市蜃楼率。
Details
Motivation: 解决视觉语言模型在所需视觉证据缺失、空白或无关时,仍会生成看似合理但缺乏视觉依据的自信回答(即‘海市蜃楼’现象)的问题,这在医疗和文档视觉问答等高风险领域尤为关键。
Result: 在五个VQA领域、三种输入条件和十二个VLM骨干网络上,最佳系统实现了约94.6-94.7%的三分类检测准确率,并将海市蜃楼率降至3%以下,而基线模型的海市蜃楼率在21.7%到66.6%之间。
Insight: 创新点在于提出了文本条件层间内部对齐(TC-LIA)方法,通过追踪视觉编码器各层中问题相关证据的出现轨迹来预测模型是否应弃答;客观来看,该方法结合了多尺度对齐特征、像素统计检测、零样本域路由和结构化自评估的集成策略,提供了一种通用且有效的预发布检测框架。
Abstract: Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated to the question. This failure mode, known as mirage (Asadi et al. 2026), is especially concerning in medical and document visual question answering, where plausible but visually ungrounded responses may be mistaken for image-based evidence. We study pre-release mirage detection: given an image-question pair, the goal is to determine whether a VLM should answer or abstain before producing a response. We propose Text-Conditioned Layer-wise Internal Alignment (TC-LIA), a model-agnostic method that probes patch-token representations across the layers of a CLIP ViT-H/14 vision encoder. TC-LIA projects layer-wise image patch tokens into the final CLIP embedding space and measures their similarity to the question embedding, allowing the method to track whether question-relevant visual evidence emerges across vision layers. The resulting alignment trajectory is summarized using final image-text cosine similarity, late-layer top-k patch-text alignment, early-to-late gain, and layer-wise slope. These features are combined with pixel-statistic blank/noise detection, zero-shot domain routing, and structured VLM self-assessment in an ensemble. Across five VQA domains, three input conditions, and twelve VLM backbones, the best systems achieve approximately 94.6-94.7% three-class detection accuracy with mirage rates below 3%, while baseline mirage rates range from 21.7% to 66.6%.
[109] Physical Object Understanding with a Physically Controllable World Model cs.CVPDF
Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Wanhee Lee, Gia Ancone
TL;DR: 本文提出了一种新型的概率世界模型,能够从原始视频中推断场景的物理结构,包括物体组成及其交互规律。该模型通过自回归序列建模高效训练,支持对任意视觉变量(如外观和动态)的条件概率估计,从而实现对物体的发现、运动预测和三维操控。
Details
Motivation: 视觉智能的核心挑战是从原始视频中学习场景的物理结构,包括物体如何形成以及它们之间的交互规律。现有架构缺乏从部分观察中推断世界分布状态的能力,因此需要开发能够支持此类推理的世界模型。
Result: 模型能够生成多个合理的未来世界状态,捕捉物体运动的物理规律,并通过分析这些未来状态中的运动相关性来提取物体及其关节子部件。在物理操控任务(如Visual Jenga)中,模型能够计算物体间的物理关系,展示了其应用潜力。
Insight: 创新点在于提出了一种基于自回归序列建模的概率世界模型,能够从视频中无监督地发现物体并推断其物理属性,支持条件概率估计和三维操控。这为从原始视觉数据中学习结构化物理知识提供了新途径,可能推动具身智能和物理推理的发展。
Abstract: A central challenge in visual intelligence is learning the physical structure of scenes from raw videos: how regions form objects and the laws that govern their interactions. Solving these tasks requires world models capable of inferring distributional states of the world from partial observations - capabilities that current architectures do not provide. We introduce a new class of probabilistic world models that support estimation of the probability of any visual variable, such as appearance and dynamics, conditioned on any other variables. Here, we identify that these models can be trained efficiently with autoregressive sequence modeling, yielding world models from which rich object understanding emerges. First, we demonstrate that our model captures the physical laws governing how objects move by generating multiple plausible future states of the world through sequential inference. Then, by analyzing motion correlations across these futures, we extract objects and articulated object subparts. Having discovered these objects, we show that our world model can manipulate them in 3D. Finally, we demonstrate how physical relationships between objects can be computed from the world model, enabling applications such as Visual Jenga.
[110] DarkVesselNet: Multi-Modal Remote Sensing and Trajectory Reasoning for Dark Vessel Detection cs.CV | cs.AI | cs.LGPDF
Arun Sharma
TL;DR: DarkVesselNet是一个用于检测‘暗船’(不报告AIS信号的船只)的多模态遥感系统。它融合了Sentinel-1 SAR雷达图像、Sentinel-2光学图像、地理空间基础模型、AIS轨迹推理、TGARD式间隙检测以及受Pi-DPM启发的异常检测头。论文介绍了系统架构、融合路径和基于软件的验证方法,并提供了开源实现。
Details
Motivation: 解决在海洋监测中,仅依赖船舶自动识别系统(AIS)报告会遗漏‘暗船’的问题,需要融合卫星遥感观测(SAR和光学)与AIS轨迹数据进行综合推理与检测。
Result: 论文目前主要提供了基于软件的验证证据,包括对SAR斑点滤波、光学波段比率、哈弗辛距离计算、TGARD间隙检测、传感器配准、骨干网络令牌形状以及可微分异常评分等一系列功能的测试,尚未在特定基准上报告与其他方法的定量比较结果或SOTA声明。
Insight: 创新点在于构建了一个集成了多源遥感数据、地理空间基础模型和轨迹推理的完整软件栈,并采用了TGARD进行轨迹间隙分析和受Pi-DPM启发的异常检测头。其系统化的多模态融合框架和开源实践对遥感目标检测领域具有借鉴意义。
Abstract: Dark vessel detection requires fusing what vessels report through AIS with what satellites observe through radar and optical sensors. DarkVesselNet is a multi-modal remote sensing stack that combines Sentinel-1 SAR, Sentinel-2 optical imagery, geospatial foundation model backbones, AIS trajectory reasoning, TGARD-style gap detection, and a Pi-DPM-inspired anomaly head. The repository exposes the system as a tested Python package and a public Hugging Face Space. The paper presents the sensor stack, backbone abstraction, fusion path, anomaly head, and current validation. The evidence currently available is software-grounded: tests for SAR speckle filtering, optical band ratios, Haversine distance, TGARD gap emission, sensor coregistration, backbone token shapes, and differentiable anomaly scoring.
[111] GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video cs.CV | cs.AIPDF
Arun Sharma
TL;DR: GeoSAM-3D提出了一种基于单目视频的开放词汇3D场景分割方法,用户只需上传一段短视频并在其中一帧点击或命名一个物体,即可通过高斯场景获得传播的3D掩码。该方法结合了冻结的图像/视频基础模型、单目3D高斯溅射重建以及在高斯质心上的可微分图-测地线传播核,核心设计是使用重建场景图上的热核距离而非3D欧几里得最近邻来传播提示,以保持曲面连续性和减少泄漏。
Details
Motivation: 解决现有开放词汇3D场景分割通常依赖RGB-D视频、校准多视图图像或重建网格等较重设置的问题,探索仅需单目视频的轻量级交互式分割场景。
Result: 论文通过分离实现验证、图传播质量、泄漏控制和交互延迟的评估协议进行验证,表明该方法在保持曲面连续性和控制跨物体泄漏方面表现良好。
Insight: 创新点在于使用图-测地线传播核(基于热核距离)替代传统的3D欧几里得最近邻进行提示传播,这能更好地处理曲面几何并减少不相关物体间的错误传播;同时整合冻结基础模型与可微分高斯重建,实现了高效的单目视频3D分割。
Abstract: Open-vocabulary 3D scene segmentation usually assumes RGB-D video, calibrated multi-view imagery, or a reconstructed mesh. GeoSAM-3D studies a lighter setting: a user uploads a short monocular video, clicks or names an object in one frame, and receives a propagated 3D mask over a Gaussian scene. The implementation combines frozen image and video foundation models with a monocular 3D Gaussian Splatting reconstruction and a differentiable graph-geodesic propagation kernel over Gaussian centroids. The central design choice is to propagate prompts by heat-kernel distance on the reconstructed scene graph, rather than by Euclidean nearest neighbors in 3D. This preserves continuity around curved surfaces and reduces leakage across nearby but disconnected objects. This paper describes the repository state, the mathematical kernel implemented in geosam3d.propagate, the feature head trained from Segment Anything masks, and the validation already present in the codebase. The evaluation protocol separates implementation validation, graph propagation quality, leakage control, and interactive latency.
[112] An explainable hierarchical self attention-based approach for tremor detection in the time domain cs.CV | eess.SPPDF
Timothy Odonga, Jeanne M. Powell, Mark Saad, Richa Tripathi, Christine D. Esper
TL;DR: 本文提出了一种可解释的、基于分层自注意力的时域震颤检测方法,该框架直接从整个诱发震颤试验的3D运动学标记时间序列数据中学习震颤模式,无需依赖专家设计的频域特征。
Details
Motivation: 传统震颤自动检测方法依赖临床专家知识提取的频域特征,本文旨在探索一种数据驱动的时域建模方法,减少对专家工程特征的依赖,并提高模型的可解释性。
Result: 在九个身体部位上进行评估,F1分数在0.594至0.947之间(平均0.765),虽未达到频域SOTA方法(0.909)的性能,但预处理需求极低。
Insight: 创新点在于结合了卷积LSTM和视觉Transformer的分层框架,直接从时域数据学习表征,并通过注意力权重和Grad-CAM提供对震颤时空和解剖模式的事后可解释性,证明了数据驱动时域建模的可行性。
Abstract: Tremor is a common movement disorder associated with conditions like Parkinson’s disease and Essential tremor, traditionally diagnosed through expert clinician assessment. Current automated detection methods rely on frequency-domain features informed by clinical expertise. In this work, we present an explainable, two-stage hierarchical framework for tremor detection in the time domain that learns tremor patterns directly from 3D kinematic marker time-series data across entire tremor-provoking trials. Our framework combined a deep convolutional and long short-term memory network to learn tremor representations from short, discrete, non-overlapping time segments of kinematic time series data from trials, which are then processed by a vision transformer that models their long-term temporal dynamics of time segment features for trial (session) level classification. Evaluated across nine body parts, the framework achieved F1-scores of 0.594 - 0.947 depending on body parts (average: 0.765), falling short of the frequency-domain state-of-the-art performance (0.909) while requiring minimal preprocessing. Attention weights and gradient-based class activation maps (Grad-CAM) identified time-domain features of tremor across body parts. This proof of concept demonstrated the feasibility of data-driven time-domain modeling for tremor detection across anatomically diverse body parts, while reducing reliance on expert-engineered spectral features and providing posthoc interpretability of temporal and anatomical patterns of tremor.
[113] CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space cs.CV | cs.AI | cs.HC | cs.LGPDF
Hung Q. Vo, Huy Q. Vo, Son T. Ly, Zhihao Wan, Anh-Vu Nguyen
TL;DR: 本文提出了CodeCytos,一个基于代码推理的智能体框架,旨在通过可编程的代码交互来增强空间分子成像数据的自动化与定制化分析。该框架允许用户动态探索自定义的空间细胞特征,并通过在四个不同组织类型的专家数据集上进行案例研究,验证了其有效性。
Details
Motivation: 传统组织图像分析软件自动化程度低、与代码驱动流程集成差,且仅支持有限的预定义空间特征,难以满足复杂空间组织研究中高效、可扩展和定制化的分析需求。
Result: 在真实的最小化提示设置下,CodeCytos在多个具有强大编码能力的LLM骨干网络上进行了基准测试,其性能优于基线方法。研究进一步表明,引入领域无关的少量示例(即非空间分析领域的随机演示)能显著提升性能,而无需依赖昂贵的领域专家演示。
Insight: 核心创新在于构建了一个基于代码增强动作空间的智能体框架,将编程能力直接融入与空间成像数据的交互中,实现了高度的灵活性和自动化。一个关键的洞见是,通过利用领域无关的少量编码推理示例进行上下文学习,可以有效提升模型在专业领域的表现,这降低了获取高质量领域特定演示的成本。
Abstract: Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, basic morphological feature extraction, and spatial organization analysis. However, these tools often require manual intervention and are not well integrated with code-driven automation, limiting efficiency and scalability for complex spatial tissue studies. In addition, they offer limited flexibility for custom analyses, as they typically support only a fixed set of pre-implemented spatial cellular features. To address these limitations, we propose CodeCytos, a coding-based reasoning agent framework that enables dynamic, programmable interaction with spatial molecular imaging data to improve automation and customization. CodeCytos is designed to streamline the exploration of custom spatial cellular features and adapt to diverse research needs. We demonstrate its utility through case studies on four expert-curated datasets from distinct tissue types: frontal cortex, non-small-cell lung cancer, pancreas, and tonsil. We evaluate CodeCytos under a realistic minimal prompt setting, where bioscientists pose simple questions without task-specific instructions or contextual information about spatial cellular analysis, and benchmark multiple LLM backbones with strong coding capabilities. We further show that incorporating tailored, domain-agnostic few-shot in-context coding-reasoning examples (randomly sampled demonstrations outside the spatial analysis domain) can substantially improve performance without requiring costly, expert-crafted in-domain demonstrations. Overall, CodeCytos outperforms baseline approaches, highlighting the potential of code-action agents to assist with custom feature exploration in spatial molecular imaging and to accelerate biomarker discovery.
[114] MUSCLE-NET: Predicted-Multiscale-Aware Network for Pedestrian Trajectory Forecasting cs.CVPDF
Yu Liu, Ming Huang, Xiao Ren, Zhijie Liu, Youfu Li
TL;DR: 本文提出了一种名为MUSCLE-NET的预测多尺度感知网络,用于行人轨迹预测。该网络通过整合多模态线索和尺度自适应预测机制,旨在更全面地利用观测信息并处理未来运动的尺度依赖性,以提升预测的鲁棒性。
Details
Motivation: 现有行人轨迹预测方法未能充分利用多样化的观测信息,且常常忽视未来运动的尺度依赖性,对多尺度特征进行统一处理,这限制了模型在不同行人行为下的鲁棒性。
Result: 在JAAD和PIE基准测试上的大量实验表明,MUSCLE-NET与最先进的轨迹预测方法相比,取得了具有竞争力的性能和一致的性能提升。
Insight: 创新点在于提出了一个集成了多尺度多模态特征提取(MMFE)模块和多尺度增强分层预测(MEHP)模块的框架。MMFE模块通过模态感知重校准和定向跨模态融合构建语义对齐表征;MEHP模块则通过概率性粗粒度预测器、尺度对齐融合和渐进式细化,自适应选择与尺度相关的线索以减轻空间漂移。
Abstract: Accurate pedestrian trajectory prediction is essential for safe navigation in autonomous driving and intelligent transportation systems. Despite substantial progress made by recent methods, most existing approaches are limited in fully exploiting diverse observations and often overlook the scale dependency of future motion, treating multiscale features uniformly regardless of underlying motion dynamics. This limits their robustness across diverse pedestrian behaviors. To address these challenges, we propose a Predicted-MUltiSCale-Aware Network (MUSCLE-NET) for Pedestrian Trajectory Forecasting that integrates complementary multimodal cues with scale-adaptive prediction mechanisms. The proposed framework is built upon a Multiscale Multimodal Feature Extraction (MMFE) module, which combines multiscale representation, modality-aware recalibration, and directional cross-modal fusion to construct semantically aligned representations from bounding boxes, velocities, and pose information. Building on these features, a Multiscale Enhanced Hierarchical Prediction (MEHP) module performs prediction-aware future-motion refinement via a probabilistic coarse predictor, scale-aligned fusion, and progressive refinement, adaptively selecting scale-relevant cues to mitigate spatial drift. Extensive experiments on the JAAD and PIE benchmarks demonstrate that the proposed MUSCLE-Net achieves competitive performance and consistent gains compared with state-of-the-art trajectory prediction methods.
[115] OptiWorld: Optimal Control for Video World Generation under Physical Constraints cs.CVPDF
Yu Yuan, Jianhao Yuan, Xijun Wang, Daiqing Li, Liu He
TL;DR: OptiWorld是一个将经典最优控制引入视频生成的框架,它在推理时通过提取世界状态、在物理约束下规划最优轨迹并基于轨迹渲染视频,以生成具有更优动态特性的视频。
Details
Motivation: 现有视频生成模型主要生成看似合理的运动,但无法主动控制或优化底层动态,导致生成的物体轨迹可能不安全、不平滑、低效或物理不一致。
Result: 通过添加最优控制层,OptiWorld在目标条件图像到视频生成、视频动态编辑和反事实生成等多个任务中展现出强大潜力,生成了具有更优动态的视频。
Insight: 创新点在于将视频生成与最优控制结合,将规划问题表述为连续流形上的几何问题,将3D几何和任务相关物理约束统一为规划几何,从而主动优化视频动态。
Abstract: Video generation models are becoming a scalable form of world models, but they mainly generate plausible motion rather than proactively control or optimize the underlying dynamics. As a result, an object in the generated video may follow trajectories that are unsafe, not smooth, inefficient, or physically inconsistent. In this work, we propose \textbf{OptiWorld}, a framework that brings classical optimal control into video generation at inference time. OptiWorld first extracts a compact, task-relevant world state, then plans an optimal trajectory under physical constraints, and finally renders the video conditioned on this trajectory. We formulate planning as a geometric problem on a continuous manifold, which converts 3D geometry and task-dependent physical constraints into a unified planning geometry. By adding this optimal-control layer, OptiWorld generates videos with preferable dynamics, demonstrating strong potential in multiple tasks including goal-conditioned image-to-video generation, video dynamics editing, and counterfactual generation.
[116] V-LynX: Token Interface Alignment for Video+X LLMs cs.CV | cs.AIPDF
Jungin Park, Jiyoung Lee, Kwanghoon Sohn
TL;DR: 该研究揭示了视频大语言模型(Video LLMs)中存在一个连续流形——令牌接口,使得视觉令牌可以作为独立实体运作。基于此发现,论文提出了V-LynX框架,通过复用该内部接口,以轻量级辅助路径将新模态(如音频、3D)集成到冻结的视频LLM中,无需特定模态编码器或配对监督。
Details
Motivation: 解决传统方法为视频LLM集成新模态时需要繁重的模态特定编码器或配对数据监督的问题,旨在实现高效、可扩展的多模态融合。
Result: 在视听问答、3D推理、高帧率和多视角视频理解等多个基准测试中,V-LynX实现了最先进的性能(SOTA)和高效率。
Insight: 创新点在于利用视频LLM内部固有的令牌接口作为通用对齐媒介,通过注意力响应和统计分布对齐实现新模态与视频先验的融合,避免了模型重构,为多模态扩展提供了新范式。
Abstract: This study introduces an intriguing phenomenon in Video LLMs: rather than merely translating frames into textual embeddings, Video LLMs establish a continuous manifold, token interface, allowing visual tokens to operate as standalone entities within the architecture. Exploiting this discovery, we propose V-LynX, a scalable framework that integrates novel modalities into Video LLMs by repurposing the internalized interface. Departing from conventional paradigms that necessitate heavy modality-specific encoders or paired supervision, V-LynX employs a lightweight auxiliary pathway in parallel with the frozen vision encoder. Our method integrates new sensory inputs with intrinsic video priors by aligning both attention responses and statistical distributions using unpaired unimodal data sets. This ensures manifold compatibility while preserving the integrity of the Video LLMs. Extensive benchmarks demonstrate that V-LynX achieves SOTA and efficiency across audio-visual QA, 3D reasoning, high-frame-rate, and multi-view video understanding. The code is available at https://github.com/park-jungin/lynx.
[117] ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs cs.CVPDF
Yiling Gao, Hongchen Wei, Zhenzhong Chen
TL;DR: 本文提出了极端令牌压缩(ETC)框架,通过任务感知的视觉信息蒸馏,在视觉语言模型(VLM)中大幅减少高分辨率图像产生的视觉令牌数量,从而降低推理时的计算成本和KV缓存开销。
Details
Motivation: 解决高分辨率图像在视觉语言模型中产生大量视觉令牌,导致推理计算成本和KV缓存开销过高的问题。
Result: 在LLaVA-1.5-7B和Qwen3-VL-2B模型上的实验表明,即使在单令牌压缩下,ETC仍能有效工作,在显著减少KV缓存开销的同时保持强大的任务性能。
Insight: 从信息论角度出发,提出通过变分信息蒸馏来最小化任务损失,并利用文本到图像的交叉注意力来加权原始视觉特征,以近似潜在的指令感知预测统计量,从而实现高效的令牌压缩。
Abstract: In Vision-Language Models (VLMs), high-resolution images produce a large number of visual tokens, resulting in high computational costs and KV-cache overhead during inference. To address this problem, we propose an Extreme Token Compression (ETC) framework that minimizes task loss when reducing the number of input tokens based on the principle of variational information distillation. Specifically, from an information-theoretic perspective, we show that minimizing task loss requires the compact representation to preserve the instruction-aware sufficient statistic of the task-relevant visual information for prediction. In practice, ETC leverages text-to-image cross-attention to weight the original visual features to approximate the latent instruction-aware predictive statistic. Moreover, ETC introduces a variational information distillation, enabling the compact representation to preserve the essential information to recover this predictive statistic. Experiments on LLaVA-1.5-7B and Qwen3-VL-2B show that ETC remains effective even under single-token compression, substantially reducing KV-cache overhead while retaining strong task performance.
[118] Improving Visual Grounding in Remote Sensing via Cluster-Guided Refinement and Model Ensemble Voting cs.CVPDF
Panav Shah, Geet Sethi, Ashutosh Gandhe
TL;DR: 本文针对遥感图像中视觉定位的挑战,提出了两种定位流程:序列定位精化(SGR)和聚类感知定位精化(CGR),并结合了专用于遥感的RemoteSAM和通用分割模型SAM3的优势。此外,还探索了基于六种不同定位流程的多数投票集成策略,以提升鲁棒性和定位精度。实验表明,所提方法优于单一模型,实现了更可靠和精确的视觉定位预测。
Details
Motivation: 遥感图像中的视觉定位因场景复杂、物体小和尺度变化大而极具挑战性,依赖单一模型往往不足以应对这些多样性问题。
Result: 实验结果显示,所提出的定位流程和集成方法在定位准确性上优于单个模型,实现了更可靠的视觉定位预测。
Insight: 创新点在于结合专用与通用模型的互补优势进行定位精化,以及通过多模型集成投票策略来提升鲁棒性和精度,这为复杂场景下的视觉定位提供了可借鉴的框架。
Abstract: Visual grounding aims to locate image regions that correspond to natural language descriptions and is a key component of interpretable vision systems. In remote sensing imagery, grounding is particularly challenging due to complex scenes, small objects, and large variations in scale. Relying on a single model is often insufficient to address these diverse challenges. In this work, we propose two grounding pipelines, Sequential Grounding Refinement (SGR) and Cluster-Aware Grounding Refinement (CGR), that combine the complementary strengths of RemoteSAM, a visual grounding model specialized for remote sensing, and SAM3, a powerful general-purpose segmentation model. Our approach first uses RemoteSAM to obtain an initial estimate of object location, which is then refined using SAM3 to produce more accurate and spatially consistent segmentations. Additionally, we explore an ensemble strategy based on majority voting across six diverse grounding pipelines, each with distinct capabilities. This multi-model framework improves robustness and significantly enhances localization accuracy. Experimental results demonstrate that the proposed pipelines and ensemble approach outperform individual models, leading to more reliable and precise visual grounding predictions.
[119] CAFOSat: A Strongly Annotated Dataset for Infrastructure-Aware CAFO Mapping Using High-Resolution Imagery cs.CV | cs.AI | cs.LGPDF
Oishee Bintey Hoque, Nibir Chandra Mandal, Mandy L Wilson, Samarth Swarup, Madhav Marathe
TL;DR: 本文介绍了CAFOSat数据集,这是一个用于美国集中动物饲养操作(CAFO)大规模制图的高分辨率遥感图像强标注数据集。该数据集整合了高分辨率NAIP图像和多源CAFO清单,通过人机协同流程将弱地理位置记录转化为精细化标注,包含超过45,000个图像块,涵盖20个州和四大CAFO类别。
Details
Motivation: 解决从遥感图像进行CAFO大规模制图时面临的挑战,包括基础设施布局异构、位置记录噪声大、标注不一致以及清单不完整等问题。
Result: 论文对多种卷积、基于Transformer和视觉语言模型进行了基准测试,证明了精细化标注和精心挑选的负样本对CAFO分类和泛化的价值。
Insight: 创新点在于通过人机协同流程(结合AI辅助标注、基于GradCAM的定位和几何聚类)将弱标注转化为强标注,并引入了基于土地覆盖引导采样和空间排除约束的负样本筛选方法,以及一个生成基础设施感知变体的合成增强流程以提高训练多样性和分布偏移下的鲁棒性。
Abstract: Concentrated Animal Feeding Operations (CAFOs) play an important role in agricultural production but are also associated with environmental, public health, and disease surveillance concerns. Large-scale mapping of CAFOs from remote sensing imagery remains challenging due to heterogeneous infrastructure layouts, noisy location records, inconsistent annotations, and incomplete inventories. We introduce CAFOSat, a strongly annotated, infrastructure-aware dataset for CAFO mapping across the United States. CAFOSat integrates high-resolution National Agriculture Imagery Program (NAIP) imagery with multi-source CAFO inventories collected across multiple states and transforms weak geolocation records into refined annotations through a human-in-the-loop pipeline combining AI-assisted annotation, GradCAM-based localization, and geometric clustering. To improve dataset quality, we curate challenging negative samples using land-cover-guided sampling with spatial exclusion constraints and provide infrastructure-level annotations, including barns, manure ponds, and grazing-related features, through manual verification. The resulting dataset contains more than 45,000 image patches spanning 20 states and four major CAFO categories. We benchmark a diverse set of convolutional, transformer-based, and vision-language models, demonstrating the value of refined annotations and curated negative samples for CAFO classification and generalization. In addition, we introduce a synthetic augmentation pipeline that generates infrastructure-aware variations to increase training diversity and improve robustness under distribution shifts. CAFOSat provides a large-scale benchmark for advancing infrastructure-aware agricultural monitoring and CAFO mapping from high-resolution remote sensing imagery.
[120] Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding cs.CV | cs.CLPDF
Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom, Ji Woo Hong
TL;DR: 本文提出了一种分解式策略蒸馏方法,用于提升视觉语言模型的视觉基础能力。通过将蒸馏损失分解为语言先验和视觉基础两个正交分量,并引入视觉梯度导向(VGS)技术动态优化视觉子空间,从而显著提升了小模型在多模态推理任务中的性能。
Details
Motivation: 现有策略蒸馏方法在多模态领域(如视觉语言模型)的优化动态未被充分探索,其单一损失函数可能隐含地平衡语言和视觉目标,导致次优的视觉基础能力,而视觉基础被认为是视觉语言推理的主要瓶颈。
Result: 在多个蒸馏设置和复杂多模态基准测试中,VGS方法显著优于标准的单一策略蒸馏方法,以最小的训练开销实现了更优的视觉基础性能。
Insight: 核心创新在于将蒸馏损失数学分解为两个几何正交的梯度分量(语言先验与视觉基础),并主动引导优化过程优先关注视觉子空间,这为多模态模型的高效蒸馏提供了新的优化视角和实用技术。
Abstract: While on-policy distillation offers dense supervision for training small reasoning models, its optimization dynamics in the multimodal domain remain under-explored. In this work, we challenge the standard monolithic view of Vision-Language Model (VLM) distillation by mathematically decomposing the loss into two distinct components: the language prior and visual grounding. Our analysis uncovers that gradient vectors for these components are nearly orthogonal, indicating that the objective of aligning with the teacher’s language distribution is geometrically independent from the objective of matching its visual perception. Consequently, standard optimization passively follows a suboptimal compromise trajectory that implicitly balances the two objectives. Hypothesizing that visual grounding constitutes the primary bottleneck for vision-language reasoning, we introduce Visual Gradient Steering (VGS), a method that dynamically reorients the update vector to prioritize the visual subspace. Experimental results on multiple distillation settings and complex multimodal benchmarks demonstrate that VGS significantly outperforms the standard monolithic formulation of on-policy distillation, achieving superior grounding with minimal training overhead.
[121] DeepLatent: Think with Images via Parallel Latent Visual Reasoning cs.CV | cs.LGPDF
Dongchen Lu, Zhimo Li, Mao Shu, Huo Cao
TL;DR: 本文提出DeepLatent,一种用于视觉推理的并行潜在框架。它通过LatentFormer使用可学习的2D令牌并行生成上下文条件化的潜在状态,并设计了一种连续空间强化学习算法来优化潜在表征质量。该方法在多个基准测试中实现了最先进的性能。
Details
Motivation: 现有‘用图像思考’的方法存在局限:工具辅助方法延迟高且操作类型受限;潜在推理方法性能较差且其潜在令牌未能捕获有效的视觉信息。本文旨在克服这些缺点,开发一个高效且性能优越的潜在视觉推理框架。
Result: 在多个基准测试上进行广泛评估,结果表明DeepLatent实现了最先进的(SOTA)性能。
Insight: 主要创新点包括:1) 引入LatentFormer,通过可学习的2D令牌并行生成锚定于原始图像特征的潜在状态;2) 设计了一种直接在嵌入空间优化潜在调制参数的连续空间强化学习算法;3) 贡献了专门用于潜在视觉推理的大规模数据集DeepLatent-180K。
Abstract: The emerging paradigm of “thinking with images” embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply explicit visual operations but suffer from high latency and restricted manipulation types. Latent reasoning methods autoregressively produce implicit visual states, but underperform tool-assisted methods, and their latent tokens fail to capture effective visual information. In this work, we propose DeepLatent, a parallel framework for latent visual reasoning. First, we introduce LatentFormer. It uses learnable 2D tokens to generate context-conditioned latent states in parallel, anchoring every visual update directly in the original image features. Second, we design a continuous-space reinforcement learning algorithm. It optimizes latent modulation parameters directly in the embedding space, significantly improving latent representation quality. The framework is trained via knowledge distillation followed by this continuous-space RL algorithm. Furthermore, we contribute DeepLatent-180K, a large-scale dataset tailored for latent visual reasoning. Extensive evaluations across multiple benchmarks demonstrate that DeepLatent achieves state-of-the-art performance.
[122] Improving Visual Representation Alignment Generation with GRPO cs.CV | cs.AI | cs.LG | cs.MMPDF
Shentong Mo, Sukmin Yun
TL;DR: 本文提出了一种名为VRPO的强化学习优化策略,用于改进扩散变换器中的视觉表示对齐生成。该方法将静态对齐损失替换为生成表示策略优化目标,通过基于生成保真度、感知质量和语义一致性的自适应奖励来引导表示对齐过程,从而在提升图像质量的同时加速模型收敛。
Details
Motivation: 现有表示对齐框架(如REPA)使用静态的外部监督对齐损失,缺乏训练和推理过程中的自适应性,无法动态平衡表示一致性与生成质量,导致判别性收益有限且无法以任务自适应方式优化对齐。
Result: 在ImageNet-256x256上的大量实验表明,VRPO-Alignment显著提升了收敛速度和生成保真度,在相同计算预算下相比REPA实现了最高+1.8的FID改进和2.3倍的训练加速。
Insight: 核心创新点是将表示对齐重新定义为奖励引导的强化学习过程,而非固定的相似性约束;这允许生成器根据多维度奖励(保真度、感知质量、语义一致性)自适应地优化内部表示,实现了任务自适应的对齐优化,且计算开销极小,与SiT/DiT架构完全兼容。
Abstract: Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment between generative and discriminative representations. While representation alignment frameworks such as REPA improve convergence by aligning noisy denoising features with pretrained visual encoders, their externally supervised alignment loss is static and lacks adaptivity during training and inference. Existing methods rely on fixed cosine alignment or contrastive objectives, which cannot dynamically balance representation consistency and generation quality, resulting in limited discriminative benefit and failing to optimize alignment in a task-adaptive manner. To address this, we propose VRPO, a reinforcement-based optimization strategy that replaces REPA’s static alignment loss with a generative representation policy optimization objective. Instead of enforcing a fixed similarity constraint, VRPO treats representation alignment as a reward-guided process: the model receives adaptive rewards based on generation fidelity, perceptual quality, and semantic coherence between the diffusion features and pretrained visual embeddings. This formulation enables the generator to continuously refine its internal representations toward semantically meaningful directions while improving image quality. Our VRPO-driven training seamlessly integrates into diffusion transformers, introducing negligible computation cost and preserving full compatibility with SiT and DiT architectures. Extensive experiments on ImageNet-256x256 demonstrate that our VRPO-Alignment substantially enhances both convergence and fidelity, achieving up to +1.8 FID improvement and 2.3x faster training compared to REPA under identical compute budgets.
[123] Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs cs.CVPDF
Mona Gandhi, KJ Joseph, Srinivasan Parthasarathy, Sayan Nag
TL;DR: 本文提出了PRISM基准和一套多尺度评估框架,用于对视觉设计进行可解释、基于设计原则的评估。PRISM通过系统性地扰动专业布局来创建包含特定原则违规的数据集,以分析模型对设计质量的多模态推理能力。研究发现现有模型对针对性原则退化不敏感,因此提出了一个集成了轻量级评分器、指令调优视觉语言模型和基于提示方法的框架,以提供可解释的设计失败解释和针对性改进。
Details
Motivation: 当前机器代理通常将多个设计原则压缩成一个单一的启发式分数,导致可解释性和诊断精度有限,无法像人类设计师那样进行整体推理。本文旨在填补这一空白,实现对视觉设计质量更精细、可解释的评估。
Result: 在PRISM基准上的实验表明,Qwen-2.5-VL和GPT-4o-mini等模型对针对性的设计原则退化不敏感,而GPT-4o表现出全局意识但缺乏细粒度解耦能力。提出的多尺度评估框架能够提供可解释的设计失败解释,并利用这些局部洞察进行针对性改进,从而提升布局质量。
Insight: 论文的创新点在于提出了一个系统性的、基于可测量设计原则的基准(PRISM)来评估多模态模型的设计理解能力,并构建了一个集成了定量评估、局部反馈和全局推理的多尺度评估框架,为实现可解释的、精通设计的多模态推理系统奠定了基础。
Abstract: Effective visual communication stems from the harmony of multiple design principles, such as readability, contrast, alignment, overlap, and coherence, which collectively govern clarity and intent of the communicator. While human designers reason holistically over these principles, machine agents typically condense them into a single heuristic score, offering limited interpretability and diagnostic precision. To address this gap, we introduce PRISM (PRinciple-aware, Interpretable, and Structure-guided Design Modifications), a benchmark that systematically perturbs professional layouts from the Crello dataset along measurable design principles. The benchmark comprises 100K perturbed training samples and 10K perturbed validation designs, each isolating a specific principle violation for controlled analysis of multimodal reasoning about design quality. We show that models like Qwen-2.5-VL and GPT-4o-mini are largely insensitive to targeted principle degradations, whereas GPT-4o exhibits global awareness without fine-grained disentanglement. Building on these insights, we propose a multi-scale evaluation framework that integrates lightweight scorers for quantitative assessment, instruction-tuned vision-language models for localised feedback, and prompt-based methods for global reasoning. Our framework provides interpretable explanations of design failures. Using these localised insights, we show targeted refinements that improve layout quality. Together, PRISM and our framework lay the foundation for interpretable design-literate multimodal reasoning systems.
[124] Response-Aware Multimodal Learning for Post-Treatment Visual Acuity Forecasting cs.CVPDF
Phuoc-Nguyen Bui, Van-Vi Vo, Duc-Tai Le, Van-Nguyen Pham, Ki-Young Kim
TL;DR: 该论文提出了一种名为ReVA的响应感知多模态学习框架,用于预测接受抗VEGF治疗的糖尿病性黄斑水肿(DME)患者的长期视力(VA)轨迹。该模型仅利用基线期和首次治疗后(第1个月)的OCT扫描图像、衍生的生物标志物表格数据及其他临床变量,来预测未来3至24个月多个时间点的视力结果。
Details
Motivation: 在临床实践中,医生通常只能根据治疗后的早期发现来估计患者的长期视力轨迹,这使得可靠的预后预测变得困难。现有的基于OCT的学习方法主要关注短期反应或单一终点预测,而利用早期纵向观察数据对未来多个时间点的VA轨迹进行建模的研究尚不充分。
Result: 所提出的ReVA框架在24个月VA预测上取得了MAE=0.1246、RMSE=0.1621和R^2=0.6064的结果,在所有预测时间点上性能表现一致。
Insight: 论文的创新点在于提出了一个响应感知的多模态框架,通过整合基线期和首次治疗后的OCT结构特征与表格变量,来捕捉基线疾病状态和早期治疗反应。具体技术包括使用空间注意力机制保留局部预后成像特征,以及使用依赖感知的表格编码器来建模临床变量间的交互作用,从而实现患者特异性的长期视力轨迹预测。
Abstract: Long-term visual acuity (VA) outcomes after anti-VEGF therapy are central to patient counseling, expectation setting, and follow-up planning in diabetic macular edema (DME). However, in clinical practice, physicians must often estimate long-term visual trajectories based only on early post-treatment findings, making reliable prognostication difficult. Although prior OCT-based learning approaches have largely focused on short-term response or single-endpoint prediction, modeling VA trajectories across multiple future time points from early longitudinal observations remains insufficiently explored. In this study, we assembled a real-world cohort of 188 anti-VEGF-treated DME patients with paired baseline and month-1 OCT scans, along with tabular OCT-derived biomarkers and non-imaging clinical variables. Using only these early data, we formulate a multi-horizon VA forecasting problem aimed at predicting visual outcomes at 3, 6, 12, 18, and 24 months, reflecting clinically meaningful follow-up intervals. We propose ReVA, a response-aware multimodal framework that integrates structural features from baseline and month-1 OCT with the tabular variables to capture baseline disease status and early treatment response. ReVA uses spatial attention to preserve localized prognostic imaging features and a dependency-aware tabular encoder to model interactions among clinical variables. These multimodal representations are fused to predict patient-specific long-term visual acuity trajectories. The proposed framework achieves MAE=0.1246, RMSE=0.1621, and R^2=0.6064 for 24-month VA prediction, with consistent performance across all forecast horizons. Our findings show that incorporating early treatment-response signals enables clinically meaningful long-term visual acuity forecasting, supporting data-driven decision support for routine anti-VEGF management.
[125] ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training cs.CVPDF
Rongsheng Wang, Fenghe Tang, Zihang Jiang, Yingtai Li, Xu Zhang
TL;DR: 本文提出ASAP(Anatomy-aware Semantically-Adaptive Pre-training)框架,用于从大规模胸部CT扫描及其对应放射学报告中学习细粒度的医学体数据表示。该框架整合了三个关键组件:解剖感知知识注入、语义自适应选择性对齐和语义自适应融合模块,以生成可迁移且具有临床意义的表示。
Details
Motivation: 解决从医学体数据中学习可迁移和可解释表示的挑战,这些挑战源于复杂的解剖结构和放射学报告提供的弱、异质监督。
Result: 在涵盖15个数据集和22个下游任务的胸部CT医学体数据视觉语言预训练综合基准测试中,ASAP在异常分类、分割、疾病预后预测、报告生成等任务上均取得了最先进的性能,尤其在有限监督和分布偏移下表现突出。
Insight: 创新点包括通过现成分割工具注入器官级结构先验以鼓励解剖一致性表示,以及动态关联句子级发现与局部体数据区域的语义自适应对齐机制;客观来看,其建立的标准化评估基准为系统评估不同临床设置下的表示质量提供了重要工具。
Abstract: Learning transferable and interpretable representations from medical volumetric scans remains challenging due to complex anatomical structures and weak, heterogeneous supervision provided by radiology reports. In this paper, we propose Anatomy-aware Semantically-Adaptive Pre-training (ASAP), a principled vision-language pre-training framework for fine-grained medical volumetric representation learning from large-scale chest CT scans and their corresponding radiology reports. ASAP integrates three key components: (1) an anatomy-aware knowledge injection module that incorporates organ-level structural priors via off-the-shelf segmentation tool to encourage anatomically coherent representations; (2) a semantically-adaptive selective alignment mechanism that dynamically associates sentence-level findings with localized volumetric regions; and (3) a semantically-adaptive fusion module for effective interaction between anatomically informed visual features and grounded textual cues under dual-modal masked modeling paradigm. Beyond methodological contributions, we establish a comprehensive benchmark for medical volumetric vision-language pre-training on chest CT, covering 15 datasets and 22 downstream tasks spanning abnormality classification, segmentation, disease prognosis prediction, report generation, vocabulary classification, cross-modal retrieval and visual question answering. This benchmark provides standardized evaluation protocols to systematically assess representation quality under diverse clinical settings and data regimes. Extensive experiments demonstrate that ASAP consistently achieves state-of-the-art performance across tasks and datasets, with particularly pronounced gains under limited supervision and distribution shift, validating its effectiveness in learning transferable and clinically meaningful volumetric representations.
[126] FlowNar: Scalable Streaming Narration for Long-Form Videos cs.CVPDF
Zeyun Zhong, Manuel Martin, Chengzhi Wu, David Schneider, Frederik Diederichs
TL;DR: FlowNar是一个用于长视频可扩展流式叙述的新框架,旨在解决现有大型多模态模型在流式视频处理中资源需求随视频时长线性增长的可扩展性瓶颈。其核心是动态上下文管理策略与CLAM模块,以限制视觉内存使用和计算复杂度,从而高效处理长视频。在Ego4D等数据集上的实验表明,FlowNar在显著提升叙述质量的同时,支持处理10倍长的视频并实现3倍高的吞吐量。
Details
Motivation: 现有的大型多模态模型主要为离线场景设计,不适合流式视频的动态需求;而近期的在线适配方法虽能实时处理,但仍面临资源需求随视频时长至少线性增长的可扩展性挑战。
Result: 在Ego4D、EgoExo4D和EpicKitchens100数据集上的实验表明,FlowNar在叙述质量上大幅超越强基线,同时效率极高,支持处理10倍长的视频,并实现3倍高的吞吐量(FPS)。
Insight: 创新点包括:动态上下文管理策略用于历史视觉上下文移除,以及CLAM(交叉线性注意力记忆)模块用于流式视觉历史保留,确保有界的视觉内存使用和计算复杂度;此外,还引入了现实的自条件评估协议和补充评估指标,以在类似部署条件下评估流式叙述模型。
Abstract: Recent Large Multimodal Models (LMMs), primarily designed for offline settings, are ill-suited for the dynamic requirements of streaming video. While recent online adaptations improve real-time processing, they still face critical scalability challenges, with resource demands typically growing at least linearly with video duration. To overcome this bottleneck, we propose FlowNar, a novel framework for scalable streaming video narration. The core of FlowNar is a dynamic context management strategy for historical visual context removal, combined with our CLAM (Cross Linear Attentive Memory) module for streaming visual history retention, ensuring bounded visual memory usage and computational complexity, crucial for efficient streaming. We also introduce a realistic self-conditioned evaluation protocol and complementary evaluation metrics to assess streaming narration models under deployment-like conditions. Experiments on the Ego4D, EgoExo4D, and EpicKitchens100 datasets demonstrate that FlowNar substantially improves narration quality over strong baselines while being highly efficient, supporting processing of 10$\times$ longer videos and achieving 3$\times$ higher throughput (FPS). The code is available at https://github.com/zeyun-zhong/FlowNar.
[127] Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion cs.CV | cs.AIPDF
Shivam Singh, Saptarshi Majumdar, Pratik Prabhanjan, Zicheng Liu, Emad Barsoum
TL;DR: 本文提出了一个名为pause-and-think-T的以推理为中心的训练数据集和一个名为pause-and-think-B的基准测试,旨在解决现有视觉语言模型在视频中存在的基于视觉的推理、时间一致性和上下文感知规划方面的困难。研究通过微调一个紧凑的4B参数模型,在目标基准上实现了与大型模型相当的性能,并在多个外部基准上展现出强大的泛化能力。
Details
Motivation: 当前视觉语言模型在视频理解任务中,特别是在基于视觉证据的推理、保持时间一致性以及进行上下文感知的规划方面存在不足。
Result: 在pause-and-think-B基准测试的上下文理解和目标规划任务上,微调后的4B参数模型达到了58.0%的准确率,其参数量仅为Qwen3-VL-235B的1/59,但性能(58.9%)与之相当;在场景理解任务上与GPT-5.2相当,并超越了GPT-4o。该模型在EgoThink和TempCompass等外部基准上也表现出色,在可操作性、辅助性、属性识别、情境推理和时间顺序等任务上均有显著提升,且未经过针对这些基准的专门训练。
Insight: 论文的核心创新在于构建了一个强调“暂停与思考”的结构化推理训练数据集,引导模型在生成答案前进行类似人类的、基于场景的推理。研究结果表明,针对性的推理监督能使紧凑模型提供可操作的、基于视觉的指导,并具备良好的泛化能力,无需依赖大规模模型扩展。
Abstract: Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We introduce pause-and-think-T, a reasoning-centric training dataset that encourages models to pause, reason over visual evidence, and produce concise, actionable responses. The dataset promotes structured reasoning prior to answer generation, guiding models toward human-like, scene-grounded assistance. We fine-tune a compact 4B-parameter model and evaluate it on our pause-and-think-B benchmark targeting contextual understanding and goal planning tasks. The model achieves 58.0% accuracy at 59x fewer parameters than Qwen3-VL-235B (58.9%), matching GPT-5.2 on scene understanding and surpassing GPT-4o. Beyond our benchmark, it also shows strong out-of-distribution performance on EgoThink and TempCompass, with substantial gains in affordance, assistance, attribution recognition, situated reasoning, and temporal order, without benchmark-specific training. Our results indicate that targeted reasoning supervision enables compact models to deliver actionable, visually grounded guidance while generalizing beyond training data, without requiring large-scale model expansion.
[128] MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue cs.CVPDF
Yue Jiang, Xue Jiang, Lihua Zhang, Zhiqiang Wang, Yuhang Lu
TL;DR: 该论文针对多模态大语言模型在交互对话中存在的幻觉雪崩问题,提出了首个用于细粒度诊断的基准测试MM-Snowball,并设计了一种无需训练的冲突感知视觉校正方法CAVR来缓解该问题,实验证明CAVR取得了最先进的性能。
Details
Motivation: 现有基准测试主要局限于单轮视觉问答,无法捕捉长程交互中错误传播的复杂动态,而多模态大语言模型在交互设置中因初始错误在对话轮次间被放大并导致连贯性崩溃的幻觉雪崩现象,严重削弱了其可靠性。
Result: 在提出的MM-Snowball基准上进行广泛评估,结果表明该基准对先进的多模态大语言模型构成重大挑战,并揭示了现有为单轮VQA设计的缓解方法无效;而提出的CAVR方法在实验中实现了最先进的性能。
Insight: 创新点在于首次构建了专门用于诊断对话中幻觉雪崩的基准测试,并提出了一种无需训练的双重机制缓解方法,该方法在表示层面刷新视觉基础并在logit层面校正输出分布,从而有效地将模型重新锚定到视觉事实上。
Abstract: Multimodal large language models (MLLMs) demonstrate remarkable visual understanding, yet their reliability in interactive settings is severely undermined by hallucination snowballing: a phenomenon where initial errors amplify across conversational turns, leading to a collapse in coherence. This failure reveals a fundamental vulnerability where models progressively neglect visual grounding in favor of over-relying on polluted textual history. Existing benchmarks are predominantly confined to single-turn VQA, which fail to capture the complex dynamics of error propagation in long-horizon interactions. To address this, we introduce MM-Snowball, the first benchmark for fine-grained diagnosis of hallucination snowballing within dialogues. Extensive evaluation shows that our benchmark poses a significant challenge even to advanced MLLMs and reveals the inefficacy of existing mitigation methods designed for single-turn VQA. To counteract this degradation, we propose Conflict-Aware Visual Rectification (CAVR). This training-free method mitigates snowballing through a synergistic dual-mechanism that refreshes visual grounding at the representation level and rectifies output distributions at the logit level, effectively re-anchoring the model to visual facts. Experiments demonstrate that CAVR achieves state-of-the-art performance, offering a promising path toward more reliable interactive AI. Data and code are available at: https://frenkie-chiang.github.io/MM-Snowball
[129] An Attribute-Based Measure of Video Complexity cs.CVPDF
Aditya Sarkar, Yi Li, Zihao Wang, Jiacheng Cheng, Sai Vidyaranya Nuthalapati
TL;DR: 本文提出了一种名为VideoABC的新框架,用于估计视频-问题对给视频大语言模型带来的复杂度。VideoABC是一种基于属性的非参数复杂度度量方法,它利用参考视频数据集和预定义的视频属性词汇来量化复杂度,并通过量化属性空间来预测新视频的复杂度。
Details
Motivation: 为了解决视频大语言模型在处理视频-问题对时面临的复杂度评估问题,作者旨在定义并量化视频复杂度,即模型对给定视频-问题对失败的概率,从而提供一种可解释的度量工具。
Result: 实验结果表明,VideoABC即使在低维属性表示下也有效,显著优于如’video-LLM as judge’等方法,且复杂度更低,并在基准测试中展示了其可解释性,揭示了属性组成如何影响复杂度。
Insight: 创新点包括基于属性的非参数复杂度度量框架、结合k-means和通用格点量化器以处理分布内和分布外样本,以及受心理物理学启发的合成视频生成方法,这些方法增强了泛化能力和可解释性。
Abstract: A new framework for the estimation of the complexity posed by video-question pairs to video-LLMs, Video Attribute-Based Complexity (VideoABC), is proposed. Video complexity is defined as the probability of failure of a video-LLM for a given video-question pair. VideoABC is a non-parametric complexity measure, using a reference video dataset and a pre-defined vocabulary of video attributes informative of complexity, \eg the scene complexity or the speed of the video event informative of the question. In a training phase, reference videos are projected into the space of these attributes, which is then quantized. The expected ABC of each quantization cell is then computed. Given a new video and its projection into the attribute space, complexity is estimated by the expected ABC of the associated quantization cell. To enable the use of VideoABC with small reference video datasets, two quantizers are combined: a k-means quantizer that enables accurate complexity estimates for samples in the distribution of the reference dataset and a universal lattice quantizer that guarantees generalization to out-of-distribution samples. A synthetic video generation procedure, inspired by target-distractor manipulations of psychophysics studies, is proposed to populate the cells of the lattice quantizer during training, enabling the computation of their expected ABCs. Experimental results show that VideoABCis effective even with very low-dimensional attribute representations, substantially outperforming approaches like `video-LLM as judge’ with much less complexity. Finally, the explainable nature of the VideoABC score, in terms of well-defined attributes, is shown to provide insights on how the attribute composition of benchmarks affects their complexity.
[130] TAP-JEPA: Frozen Future-Latent Probing and Two-Stage Score Fusion for EPIC-KITCHENS-100 Action Anticipation cs.CVPDF
Chaoyang Wang, Lexuan Xu
TL;DR: TAP-JEPA是EPIC-KITCHENS-100动作预测挑战赛的亚军方案,它利用冻结的V-JEPA 2.1视频特征,通过一个紧凑的预测模型来预测第一人称视角视频中即将发生的动作(动词、名词及动词-名词组合)。该方法的核心是使用预训练的潜在预测器估计未来令牌,并通过注意力探针与观测到的上下文令牌融合,最后采用两阶段分数融合策略提升性能。
Details
Motivation: 解决在动作开始前,仅基于第一人称视角视频片段预测下一个动作(动词、名词及组合)的任务,旨在避免微调大型视频骨干网络,而是构建一个轻量级的预测模型。
Result: 在EPIC-KITCHENS-100官方开放测试排行榜上,该方法取得了27.91%的整体动作平均Top-5召回率(MT5R),排名第二,仅比第一名低0.04个百分点。
Insight: 创新点在于利用冻结的预训练视频特征(V-JEPA 2.1)和其预训练的潜在预测器来建模未来上下文,并结合任务特定的注意力探针进行多任务预测;同时,两阶段分数融合策略(先平均同一周期内的多个探针副本,再跨周期加权合并)有效提升了模型的鲁棒性和最终性能。
Abstract: This report presents TAP-JEPA, our runner-up submission to the EPIC-KITCHENS-100 (EK-100) Action Anticipation Challenge at EgoVis 2026. The task is to anticipate the next verb, noun, and verb-noun action from an egocentric clip that ends before the target action begins. Instead of fine-tuning a large video backbone, TAP-JEPA builds a compact anticipation model on frozen V-JEPA 2.1 features: a ViT-G/384 encoder extracts visible pre-action tokens, the pre-trained latent predictor estimates near-future tokens from the observed context, and both token groups are fused by attentive probes with task-specific queries for verbs, nouns, and action pairs. For the final submission, we expand supervised training with the official training split and most of the validation split, reserving a small subset for sanity checks and qualitative inspection, and adopt a two-stage score fusion that first averages eight independently initialized probe replicas within each epoch and then merges candidates from epochs 12-20 with field-dependent weights. On the official open-testing leaderboard, our sunshinesky entry achieves 27.91 percent overall action Mean Top-5 Recall (MT5R), ranking second and only 0.04 percentage points behind the top score.
[131] Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models cs.CV | cs.AIPDF
Jinyang Du, Shenghao Jin, Ziqian Xu, Ruihao Gong, Shiqiao Gu
TL;DR: 本文针对Wan2.2-T2V-A14B视频扩散模型部署成本高的问题,提出了一种结合少步分布匹配蒸馏与低位量化的协同压缩流程。该方法遵循模型的双专家去噪路径,分别校准高噪声和低噪声分支,保护敏感入口层,并使用HiF4风格的低位表示以提高动态范围覆盖。量化在校准后的少步学生模型上进行,减少了推理时的激活分布不匹配。
Details
Motivation: 大型视频扩散模型虽然视觉质量强,但部署成本高昂,因为每个样本需要许多去噪步骤且参数量大。本文旨在研究一个面向部署的压缩流程,以解决模型推理效率低和内存占用大的问题。
Result: 提出的协同设计使量化模型在相同步数下接近全精度模型,并在8步和20步设置下平均超越了原始全精度基线。在测试配置中,20步设置提供了最佳的质量-效率权衡。
Insight: 创新点在于将少步蒸馏与低位量化协同设计,并针对双专家模型结构进行分支校准和敏感层保护。关键洞察是量化应在蒸馏后的少步学生模型上进行校准,而非原始长步轨迹,这有效减少了推理时的分布不匹配。HiF4风格的低位表示也提升了动态范围覆盖能力。
Abstract: Large video diffusion models achieve strong visual quality but remain expensive to deploy because each sample requires many denoising steps and a large resident parameter footprint. This paper studies a deployment-oriented compression pipeline for Wan2.2-T2V-A14B by combining few-step distribution-matching distillation with low-bit quantization. The pipeline follows the model’s dual-expert denoising route, calibrates the high-noise and low-noise branches separately, protects sensitive entrance layers, and uses HiF4-style low-bit representation to improve dynamic-range coverage. Quantization is calibrated on the distilled few-step student rather than on the original long-step trajectory, reducing activation-distribution mismatch during inference. The proposed co-design keeps the quantized model close to the same-step full-precision model and surpasses the original full-precision baseline at 8 and 20 steps on average. The 20-step setting gives the best quality-efficiency trade-off in the tested configurations.
[132] T-CLIP: Enabling Thermal Perception for Contrastive Language-Image Pretraining cs.CVPDF
Tayeba Qazi, Ayush Maheshwari, Prerana Mukherjee, Brejesh Lall
TL;DR: 本文提出了T-CLIP框架,旨在解决基础视觉-语言模型(如CLIP)无法对齐热成像图像与文本描述的问题。通过引入首个物理感知的热图像描述数据集IR-Cap,以及一个解耦的双LoRA适配框架,T-CLIP在三个热成像基准测试的跨模态检索任务中均超越了基线方法,并展示了其在文本条件热图像生成上的潜力。
Details
Motivation: 热成像在低光照和恶劣天气等挑战性条件下是可见光谱视觉的有力替代,但现有模型如CLIP存在热感知鸿沟,无法将热图像与文本对齐。主要挑战包括缺乏带标注的热数据集、标准LLM无法推理热现象,以及全局场景上下文与物体级热特征在单一嵌入空间中学习时存在表征冲突。
Result: T-CLIP在三个公开热成像基准测试的跨模态检索任务中,均取得了相对于所有基线方法的一致性能提升。
Insight: 创新点包括:1) 提出了首个物理感知的热图像描述生成流程与数据集IR-Cap,提供互补的全局和细粒度描述;2) 设计了解耦的双LoRA适配框架,分别针对场景级和物体级热理解独立地适配CLIP,解决了不同层次热特征的表征冲突问题。
Abstract: Thermal imaging offers a powerful alternative to visible-spectrum vision under challenging conditions such as low illumination and adverse weather, yet foundational vision-language models like CLIP fail to align thermal images with textual descriptions due to a fundamental thermal perception gap. We identify three major challenges: the lack of captioned thermal datasets, the inability of standard LLMs to reason about thermal phenomena, and a key representational challenge in thermal imaging where global scene context and object-level heat signatures conflict when learned together in a single embedding space. To address these, we introduce IR-Cap, the first physics-aware thermal captioning pipeline and dataset providing complementary global and fine-grained thermal descriptions across three public benchmarks, and T-CLIP, a decoupled dual-LoRA framework that independently adapts CLIP for scene-level and object-level thermal understanding. T-CLIP achieves consistent improvements over all baselines across three thermal benchmarks in cross-modal retrieval, and we provide an exploratory demonstration of its applicability to text-conditioned thermal image generation.
[133] A Modelling and Evaluation Framework for EuroCrops-Driven Sentinel-2 Crop Segmentation cs.CVPDF
Alexandra Nicoleta Scarlat, Ioana Cristina Plajer, Alexandra Baicoianu
TL;DR: 本文提出了一个可配置的流水线,用于从Sentinel-2影像和EuroCrops地块级标注生成可用于语义分割的农业数据集。该工作训练了一个四层U-Net模型进行作物分割,并在内部测试集上取得了良好性能。同时,通过外部数据集评估,揭示了模型在现实领域偏移下的潜力和局限性。
Details
Motivation: 解决从异构的EuroCrops矢量标注和Sentinel-2影像中,自动生成高质量、可用于深度学习语义分割的农业数据集的问题,以支持作物类型识别。
Result: 在基于EuroCrops的内部测试集上,模型取得了0.7665的mIoU、0.8693的像素精度和0.9072的平均类别精度,优于光谱和空间上下文随机森林基线。在外部数据集(比利时EuroCrops子集、DACIA5、PASTIS)上的评估显示,模型在领域偏移下性能存在差距,尤其对于少数类别和不同标注协议的数据集。
Insight: 创新点在于提出了一个完整的、可配置的数据集生成流水线,将异构标注与遥感影像对齐。客观分析表明,该工作系统性地评估了基于EuroCrops监督的模型在内部和外部数据集上的泛化能力,强调了学习到的多尺度空间表征的重要性,并量化了现实领域偏移(如分类体系、标注协议不同)对性能的影响,为农业遥感领域的数据集构建和模型评估提供了重要参考。
Abstract: This work presents a configurable pipeline for generating semantic-segmentation-ready agricultural datasets from Sentinel-2 imagery and EuroCrops parcel-level annotations. The workflow transforms heterogeneous vector crop annotations into aligned multispectral image–mask pairs through label harmonization, Sentinel-2 product selection, spatial alignment, rasterization, patch extraction, quality filtering, and class-aware sample selection. The generated dataset contains 67,337 patches from five European countries and uses a reduced taxonomy of ten crop classes plus background. A four-level U-Net with Group Normalization was trained using 10 Sentinel-2 spectral bands and a composite loss combining class-weighted cross-entropy and Dice loss. On the internal EuroCrops-based test split, the model achieved a mean Intersection over Union (mIoU) of 0.7665, a pixel accuracy of 0.8693, and a mean class accuracy of 0.9072. Compared with spectral and spatial-context Random Forest baselines, the U-Net showed the importance of learned multi-scale spatial representations for crop segmentation. External evaluation was performed on unseen Belgian EuroCrops subsets, DACIA5, and PASTIS. The results show a clear performance gap under external and cross-dataset evaluation, especially for benchmarks with different taxonomies, annotation protocols, spatial coverage, or temporal organization. The model transfers more reliably to dominant and taxonomically aligned classes such as maize and wheat, while performance remains limited for several minority classes and for the adapted single-date PASTIS setting. These findings highlight both the potential and the limitations of using EuroCrops-derived supervision for Sentinel-2 crop segmentation under realistic domain shifts.
[134] Wavelet-Fusion Diffusion Model for Multimodal Brain MRI Synthesis with Modality and Metadata Conditioning cs.CVPDF
Muhammad Nabi Yasinzai, Remika Mito, Mangor Pedersen
TL;DR: 本文提出了一种基于小波融合变分自编码器(WF-VAE)和条件3D U-Net扩散模型的Wavelet-Fusion Diffusion Model(WFDM),用于多模态脑MRI合成。该方法通过在学习的3D潜在空间中结合显式的模态和元数据条件,生成高质量的合成MRI图像,以解决公共神经影像数据集中模态覆盖不均和异质性问题。
Details
Motivation: 多模态MRI在神经影像分析中提供互补信息,但公共数据集中模态覆盖不均,且存在站点、扫描仪、采集协议以及人口统计学和临床变量的异质性,限制了AI应用的发展。合成MRI生成可以用于数据集增强和受控合成队列创建,但现有方法通常基于狭窄的模态集或同质队列,难以适用于大型异质数据集。
Result: 在评估的合成MRI生成器中,WFDM实现了最强的分布对齐(distributional alignment),表明其在生成质量上优于其他方法。
Insight: 创新点包括结合小波融合的VAE潜在压缩器和条件3D U-Net扩散模型,在学习的3D潜在空间中进行训练,并利用显式的模态和元数据条件,以提高合成MRI的实用性和质量,同时降低3D体素空间直接采样的计算成本。
Abstract: Multimodal MRI provides complementary information for neuroimaging analysis, where different imaging modalities capture distinct anatomical, tissue, and pathological features that support the development and evaluation of downstream AI applications. Although large-scale structural MRI resources are increasingly available, their modality coverage is often uneven across public and pooled neuroimaging datasets. This uneven modality coverage is further complicated by heterogeneity across sites, scanners, and acquisition protocols, as well as demographic and clinical variables that are often sparse, inconsistently recorded, or unavailable across studies. Synthetic MRI generation can help address this imbalance by synthesizing target-modality volumes for dataset augmentation and controlled synthetic cohort creation. However, many existing MRI synthesis approaches are trained on narrow modality sets or relatively homogeneous cohorts, limiting their applicability to large pooled neuroimaging resources where modality availability, acquisition protocols, and metadata coverage vary substantially across datasets. Diffusion models have become an attractive approach for MRI synthesis because of their strong sample fidelity and diversity, but sampling directly in 3D voxel space is computationally expensive and slow at inference. Latent diffusion improves practicality by synthesizing MRI in a learned, 3D latent space, although generation quality depends on the autoencoder’s reconstruction fidelity and the resulting latent distribution. Our approach combines a Wavelet-Fusion variational autoencoder (WF-VAE) latent compressor with a conditional 3D U-Net diffusion model trained in the learned latent space using explicit modality and metadata conditioning. Our proposed Wavelet-Fusion Diffusion Model (WFDM) achieved the strongest distributional alignment among the evaluated synthetic MRI generators.
[135] FROST-STA: Frozen Dense Features for the Ego4D Short-Term Object Interaction Anticipation cs.CVPDF
Chaoyang Wang, Lexuan Xu
TL;DR: FROST-STA是针对Ego4D短期物体交互预测挑战提出的模型,它利用冻结的V-JEPA 2.1 ViT-G骨干网络提取密集的视频和图像特征,通过紧凑的对齐模块融合特征,并使用类似Faster R-CNN的预测头来生成包含物体框、名词、动词、接触时间和置信度的结构化假设。
Details
Motivation: 解决第一人称视频中短期交互预测的复杂问题,即需要同时推断用户将接触的物体、执行的动作以及接触发生的时间,而不仅仅是识别当前场景。
Result: 在官方测试服务器上获得了5.13的Overall Top-5 mAP,在挑战中排名第二,表明冻结的密集图像-视频特征可以作为物体级交互预测的强有力基础。
Insight: 创新点包括使用冻结的预训练骨干网络提取密集特征、设计对象中心解码和紧凑的对齐模块(包含注意力探测和帧引导时序池化)来融合视频与图像信息,以及采用面向提交的训练和集成策略以提升性能。
Abstract: Short-term anticipation in egocentric video requires more than recognizing the current scene: a system must infer which object the camera wearer will contact, which action will follow, and how soon the contact will happen. This report describes FROST-STA, our submission to the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. For each query time, the model produces a ranked set of structured hypotheses containing an active-object box, noun label, verb label, time-to-contact (TTC), and confidence. FROST-STA builds on the V-JEPA 2.1 STA evaluation protocol, but adapts it to the challenge by using object-centric decoding, multi-head prediction, and a submission-oriented training and ensembling recipe. We keep the V-JEPA 2.1 ViT-G backbone fixed and extract two dense token streams: video tokens from a short clip resized to 384 pixels before the query, and image tokens from the last observed high-resolution frame. A compact alignment module, consisting of an attentive probe and frame-guided temporal pooling, maps the clip representation onto the spatial reference of the final frame before fusing it with image features. The fused maps are decoded by Faster R-CNN-style STA heads that estimate box offsets, nouns, verbs, TTC values, and interaction quality. For the final leaderboard entry, we train for 25 epochs with the official training split plus additional permitted validation annotations, and combine predictions across eight heads and checkpoints from epochs 15-25. FROST-STA obtains 5.13 Overall Top-5 mAP on the official test server, ranking second in the challenge and showing that frozen dense image-video features can serve as a strong basis for object-level interaction forecasting.
[136] CASTLE2026 Team WDL Technical Report cs.CVPDF
Zhengyang Li, Zhenglin Du, Yi Wen, Fang Liu, Shuo Li
TL;DR: 本文介绍了CASTLE2026竞赛中WDL团队的技术报告,针对EgoVis 2026的CASTLE挑战赛,该挑战赛评估基于600多小时多视角第一人称视频的长格式问答。团队提出了一种基于Qwen的证据感知多模态推理流程,通过解析问题提示、检索ASR片段、附加辅助图像、采样候选视频帧,并使用专用提示将问题路由到静态视觉、语音/文本、时间和混合类型进行处理。最终系统通过置信度加权投票聚合多个推理结果,并在该挑战赛中排名第一。
Details
Motivation: 解决长格式第一人称视频问答任务中需要整合视频、转录文本、辅助照片、人物、日期、房间和时间上下文等多模态证据的复杂推理问题。
Result: 在消融实验中,LoRA技术将得分从0.21提升至0.50,增加采样帧数进一步将得分提升至0.58;最终系统在CASTLE Challenge @ EgoVis 2026竞赛中排名第一。
Insight: 创新点在于设计了一个证据感知的多模态推理管道,能够根据问题类型动态路由到专门的提示和处理模块,并采用置信度加权投票进行结果聚合;客观来看,其将LoRA微调与多帧采样策略结合,有效提升了模型在复杂多模态任务上的性能。
Abstract: The CASTLE Challenge @ EgoVis 2026 evaluates long-form egocentric video question answering over 600+ hours of multi-perspective recordings. Each four-choice question requires evidence from videos, transcripts, auxiliary photos, people, days, rooms, and temporal context. We propose an evidence-aware multimodal reasoning pipeline based on Qwen. Our system parses question hints, retrieves ASR chunks, attaches auxiliary images, samples candidate video frames, and routes questions into static visual, speech/text, temporal, and mixed types with specialized prompts. Multiple inference passes are aggregated by confidence-weighted voting and converted into the official Codabench format. In ablation, LoRA improves the score from 0.21 to 0.50, and more sampled frames further raise it to 0.58. Our final system ranks first in the CASTLE Challenge @ EgoVis 2026.
[137] Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders cs.CV | cs.LGPDF
Yitong Jiang, Hongjun Wang, Collin McCarthy, Hanrong Ye, David Wehr
TL;DR: 本文提出了C-GSPN,一种基于二维空间传播的视觉基础模型编码器,旨在解决自注意力二次计算成本高的问题。通过优化CUDA内核、引入压缩潜在空间传播块以及两阶段跨算子蒸馏方法,C-GSPN在减少参数的同时提升了性能,并实现了显著的推理加速。
Details
Motivation: 视觉基础模型受限于自注意力的二次计算成本,这限制了可用分辨率并增加了大规模预训练成本。现有次二次方替代方案(如线性注意力和状态空间模型)会削弱对视觉至关重要的二维空间结构,而广义空间传播网络(GSPN)虽能直接传播二维上下文,但尚未被用作基础规模的编码器。
Result: 在600M图像-文本对上进行蒸馏后,C-GSPN在参数减少15%的情况下,与同构ViT基线性能相当,在ADE20K分割任务上提升了2.1%,能以少量数据迁移到高分辨率,并在2K分辨率下实现了4倍的端到端块加速,且无需分块推理。
Insight: 创新点包括:1)高效的GSPN CUDA内核实现,通过融合启动、共享内存平铺和紧凑多通道传播,大幅提升计算效率;2)压缩潜在空间传播块与融合归一化,将内核级速度转化为块和模型级效率;3)两阶段跨算子蒸馏方法,无需从头开始的大规模训练即可从注意力教师模型训练新架构。
Abstract: Vision foundation models are bottlenecked by the quadratic cost of self-attention, which limits usable resolution and increases the cost of large-scale pretraining. Subquadratic alternatives such as linear attention and state-space models reduce this cost, but often serialize images into 1D token streams and weaken the 2D spatial structure important for vision. Generalized Spatial Propagation Networks (GSPN) instead propagate context directly on the 2D grid through line-scan recurrences, achieving near-linear complexity without positional embeddings, but have seen little use as foundation-scale encoders. We present C-GSPN, a foundation-scale vision encoder based on 2D spatial propagation. C-GSPN makes the operator practical through three improvements: (1) a fast GSPN CUDA kernel that fuses per-step launches into a single warp-specialized implementation with shared-memory tiling, coalesced access, and a compact multi-channel propagation, reaching over 90% of peak memory bandwidth and running up to 40–52x faster than the original GSPN implementation; (2) a compressed latent-space propagation block with fused normalization, which turns kernel-level speed into block- and model-level efficiency; and (3) a two-stage cross-operator distillation recipe that trains the new architecture from an attention teacher without the cost of from-scratch foundation-scale training. Distilled with 600M image-text pairs, C-GSPN matches an isomorphic ViT baseline with 15% fewer parameters, improves ADE20K segmentation by +2.1%, transfers to high resolution with a fraction of the data needed from scratch, and delivers a 4x end-to-end block speedup at 2K with single-pass, tiling-free inference.
[138] FlowOVD: Learning Generative Latent Flows for Zero-shot Open-vocabulary Detection cs.CVPDF
Yao Wei, Andrea Cavallaro, Changjae Oh
TL;DR: 本文提出FlowOVD,一种基于生成式潜在流的零样本开放词汇检测方法。它将解码器查询生成建模为潜在空间中的连续传输过程,通过修正流逐步将文本无关查询转化为文本引导查询,从而增强语义对齐和检测性能。
Details
Motivation: 现有开放词汇检测方法通常将问题视为判别式预测,解码器查询静态或从编码器特征初始化,限制了多样性和灵活性。本文旨在从生成视角出发,通过连续潜在流建模查询生成过程,以提升开放词汇检测的表达能力。
Result: 在无需额外训练数据的情况下,FlowOVD在COCO数据集上达到49.5 AP,在LVIS数据集上达到31.5 AP,分别比GroundingDINO提升+1.2 AP(+2.5%)和+4.1 AP(+15.0%),在长尾LVIS基准上表现尤为突出。
Insight: 创新点在于将生成式潜在流引入基于视觉语言模型的检测器,避免了启发式离散查询构建,实现了更富表达力的语义对齐。从客观角度看,连续查询生成机制为开放词汇泛化提供了新思路,尤其在处理长尾分布时显示出优势。
Abstract: Open-vocabulary object detection (OVD) has achieved remarkable progress through large-scale vision-language pre-training. Existing methods, however, typically formulate OVD as a discriminative prediction problem, where decoder queries are either static or initialized from encoder features, thus limiting their diversity and flexibility. In this paper, we introduce a generative perspective by modeling decoder query generation as a continuous transport process in latent space. We propose FlowOVD, a text-conditioned query generation framework based on rectified flow that progressively transforms text-agnostic queries into text-guided queries. By introducing continuous latent query dynamics into a vision-language model (VLM) based detector, our method avoids heuristic discrete query construction and enables more expressive semantic alignment for open-vocabulary detection. Without requiring additional training data, FlowOVD achieves 49.5 AP on COCO and 31.5 AP on LVIS, outperforming GroundingDINO by +1.2 AP (+2.5 %) and +4.1 AP (+15.0 %), respectively. The larger gain on the challenging long-tailed LVIS benchmark further highlights the effectiveness of continuous query generation for open-vocabulary generalization.
[139] GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval cs.CV | cs.AIPDF
Shihang Zhang, Mingjin Kuai, Ye Wei, Zhen Zhang, Wei Ji
TL;DR: 本文提出GIRL-DETR,一种用于视频片段检索(VMR)的轻量级时序定位框架。该方法通过跨模态交互和文本引导门控机制建立鲁棒的特征对齐,并在监督训练收敛后,冻结主干网络,使用三阶段渐进式强化学习策略直接优化不可微的评价指标tIoU,从而解决了代理损失与评价指标不匹配导致的优化停滞问题。
Details
Motivation: 现有VMR模型在训练后期常因连续的代理损失与不可微的评价指标(如tIoU)不匹配而陷入次优解,优化停滞。虽然强化学习后训练对大模型有效,但直接应用于轻量级网络容易破坏监督阶段建立的脆弱特征表示。
Result: 在Charades-STA、QVHighlights和TACoS三个基准数据集上的实验表明,GIRL-DETR有效缓解了代理损失退化问题,仅通过少量参数更新就实现了显著的定位精度提升。
Insight: 核心创新在于将状态表示与度量优化进行正交解耦:冻结主干以保护特征流形,仅让检测头通过强化学习直接优化目标指标。同时,提出的三阶段渐进式强化学习策略为轻量级VMR模型引入RL后训练提供了稳健的新途径。
Abstract: Video Moment Retrieval (VMR) task requires accurately localizing temporal boundaries aligned with natural language queries, but many models suffer from a misalignment between continuous surrogate losses and non-differentiable metrics, leading to optimization stagnation during the late stages of training and trapping boundary predictions in suboptimal solutions. Although Reinforcement Learning (RL) post-training successfully optimizes localization results for large models, applying it directly to lightweight networks easily disrupts the fragile feature representations established during the supervised phase. To overcome this optimization bottleneck, we propose Gradient-Isolated Reinforcement Learning for DETR (GIRL-DETR), introducing RL post-training into a lightweight temporal localization framework for the first time. The input video and text features first establish early alignment through Cross-Modal Interaction (CMI) before entering the transformer encoder. Subsequently, a Text-Guided Gating (TGG) mechanism dynamically injects semantic priors into the queries before the transformer decoder generates candidate proposals, providing high signal-to-noise ratio inputs for temporal prediction. After the supervised training reaches convergence, the backbone network is frozen to protect the feature manifold, while the detection head directly optimizes the non-differentiable evaluation metric tIoU to enhance localization accuracy through a Three-stage Progressive Reinforcement Learning (TPRL) strategy. This approach achieves an orthogonal decoupling of state representation and metric optimization. Experiments on Charades-STA, QVHighlights, and TACoS demonstrate that GIRL-DETR effectively resolves surrogate loss degradation and achieves substantial accuracy improvements with minimal parameter updates, providing a robust new pathway for RL applications in lightweight VMR models.
[140] MBench: A Comprehensive Benchmark on Memory Capability for Video World Models cs.CVPDF
Shengjun Zhang, Zhang Zhang, Simin Huang, Zhenyu Tang, Hanyang Wang
TL;DR: 本文提出了MBench,一个专门用于评估视频世界模型记忆能力的综合基准。该基准将记忆能力分解为实体一致性、环境一致性和因果一致性三个核心维度,并细化为12个可量化的子维度。基于精心策划的真实长视频数据集,MBench通过基于规则的定量指标和视觉语言模型进行客观评估,揭示了现有主流模型在长期状态保持方面的系统性局限。
Details
Motivation: 当前视频世界模型在生成高保真视觉序列方面取得进展,但其视觉合理性生成与作为世界模型的功能需求(如长期保持稳定合理的内部状态)之间存在根本差距。现有基准主要关注视觉质量、运动连贯性和文本-视频对齐,而忽视了记忆这一核心能力。
Result: 对主流最先进的视频世界模型进行了广泛评估,结果揭示了现有方法在长期状态保持方面存在关键的系统性局限。该基准为领域提供了标准化的评估工具和清晰的研究方向。
Insight: 创新性地将视频世界模型的记忆能力系统性地分解为三个层次化且互补的核心维度,并进一步细化为可量化的子维度,从而实现了对长期记忆的全面表征。基准构建基于精心策划的真实长视频,并采用基于规则的定量矩阵和VLM进行评估,确保了评估的客观性和全面性。
Abstract: Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present \textbf{MBench}, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.
[141] DASH: Dual-Branch Score Distillation for Guidance-Calibrated Compact Diffusion Models cs.CV | cs.AI | cs.LGPDF
Abdullah Al Shafi, Kazi Saeed Alam, Sk Imran Hossain, Engelbert Mephu Nguifo
TL;DR: 本文提出DASH框架,用于解决类别条件扩散模型参数压缩中存在的无监督无条件分支问题。该方法通过独立监督两个分数分支,结合锚定正则化和课程迁移技术,在CIFAR-10和CIFAR-100数据集上实现5.9倍压缩,保持与教师模型相近的生成质量。
Details
Motivation: 现有输出级蒸馏方法在压缩类别条件扩散模型时,无条件分数分支缺乏监督,导致分类器无关引导机制失效,两个分支可能坍缩为相同预测。
Result: 在CIFAR-10和CIFAR-100上,5.9倍压缩模型在50步DDIM采样下FID分数与教师模型相差不超过4分,显著优于从头训练的基线模型,且引导保真度得到保持。
Insight: 核心创新在于对无条件分支的显式监督(贡献超过60%蒸馏增益),配合TIRT课程迁移和锚定正则化,形成完整的双分支约束框架,为保持引导能力的模型压缩提供了有效方案。
Abstract: Parameter compression of class-conditional diffusion models reveals an underexplored limitation in output-level distillation: the unconditional score branch remains unsupervised, leaving the classifier-free guidance gap underdetermined in the student. This gap, amplified at every denoising step, admits degenerate solutions where both branches collapse toward identical predictions, rendering guidance ineffective despite low output-level training loss. This paper introduces DASH, a dual-branch distillation framework that independently supervises both score branches, uniquely specifying target branch outputs for each training sample through independent branch constraints, with an anchor term regularising conditional predictions toward ground-truth noise. The framework further introduces TIRT Transfer, which copies the teacher’s converged per-timestep importance curriculum into the student as a frozen prior, eliminating the need to relearn it within limited distillation budgets. Experiments on CIFAR-10 and CIFAR-100 demonstrate that 5.9x compression maintains quality within 4 FID points of the teacher at 50-step DDIM sampling, considerably outperforming training from scratch with guidance fidelity well preserved. Ablation studies confirm that unconditional supervision is the dominant contribution, accounting for over 60% of total distillation gain. Curriculum transfer and anchor regularisation provide complementary benefit, together validating dual-branch constraints as empirically essential for guidance-preserving compression.
[142] RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes cs.CVPDF
Leyi Wu, Yifan Zhao, Jinjie Zhang, Suzeyu Chen, Wosong Chen
TL;DR: 该论文提出了RoboStressBench基准测试,用于评估视觉语言模型在具身场景中对物理视觉应力的鲁棒性。该基准从逆图形学视角出发,将视觉应力分解为材料、视角、光照和几何四个物理维度,从而覆盖真实环境中的广泛视觉应力。通过评估多个先进VLM,论文揭示了不同物理因素对模型能力的具体影响,并提出了一个应力感知的智能求解器来提升模型在高应力场景下的鲁棒性。
Details
Motivation: 现有基准测试主要使用干净图像或孤立扰动来评估VLM,未能涵盖由物理场景形成过程引起的真实视觉应力,这限制了模型在真实具身AI系统中的可靠部署。
Result: 通过对多个先进VLM的综合评估,论文揭示了应力特定的失败模式,并发现不同的物理因素会损害不同的具身能力(如视觉识别、推理和规划),这些影响在总体准确率中往往被掩盖。
Insight: 论文的核心创新点在于从逆图形学和物理渲染方程出发,提出了一个原则性的视觉应力分解框架(MVLG),为诊断和提升VLM在真实物理应力下的感知能力提供了系统化的评估基准。此外,提出的应力感知智能求解器通过先检测视觉应力源并调用视觉编辑技能再进行推理,是一种提升模型鲁棒性的有效方法。
Abstract: Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.
[143] SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory cs.CV | cs.ET | cs.HC | cs.MAPDF
Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, James Fort, Richard Newcombe
TL;DR: 本文介绍了SuperMemory-VQA,一个用于评估AI助手在实用、长时程记忆任务上的第一人称视觉问答基准数据集。该数据集包含52.9小时由AI眼镜记录的日常活动视频及多模态传感器数据,并构建了4,853个人工验证的问答对,涵盖多种记忆类型。基准测试表明,现有智能体框架和LLM在真实世界记忆任务上仍不可靠,凸显了对新型、有据可依的AI记忆架构的需求。
Details
Motivation: 现有第一人称数据集主要关注动作识别或短片段通用问答,衡量的是感知能力而非现实的人类记忆需求。为了推动AI眼镜作为个性化记忆助手的发展,需要超越短期视频理解,解决人类在长期第一人称视频流中因实际、个人或社交目的而产生的记忆空白问题。
Result: 在SuperMemory-VQA基准上对领先的智能体框架和LLM主干进行的测试表明,现有系统在真实世界记忆任务上仍远未达到可靠水平。
Insight: 创新点在于构建了一个专注于长时程、实用记忆任务的第一人称VQA基准,其问题类型(如物体位置记忆、意图回忆、时间线重建、对话记忆等)和包含明确“无法回答”选项的多选题设计,旨在更真实地评估AI的记忆辅助能力并测试其幻觉鲁棒性,这为开发新型的、仅在证据充分时才回答的、有据可依的AI记忆架构指明了方向。
Abstract: AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit “unanswerable” option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.
[144] The Right Inference Strategy Is All You Need: Nearly Training-Free Domain-Wise Inference for EgoCross Challenge cs.CVPDF
Leyi Wu, Yifan Zhao, Jinjie Zhang, Yinchuan Li, Ying-Cong Chen
TL;DR: 本文针对EgoCross挑战赛中的源受限赛道,提出了一种近乎无需训练的领域感知推理策略。该策略针对手术、工业装配、极限运动和动物佩戴相机这四个不同领域,分别设计输入、提示和答案映射流程,以帮助固定的Qwen3-VL-4B基础模型更好地适应领域转移下的第一人称视频问答任务。
Details
Motivation: EgoCross挑战赛测试视频来自非日常场景,导致模型面临显著的领域偏移。在源受限赛道中,基础模型固定且仅有20个训练样本,这使得挑战的关键在于如何为受限模型暴露正确的视觉、时序和答案选择线索,而非模型缩放。作者观察到,冻结的基线模型并非能力不足,而是缺乏将现有视觉-语言知识迁移到新任务格式的合适接口。
Result: 在最终评估中,该策略实现了66.98%的整体准确率。这表明,即使基础模型能力有限,通过精心设计的领域感知推理也能恢复基线模型中已有的能力。
Insight: 核心创新点在于提出了一种领域感知的推理策略,该策略根据不同领域的任务特性定制输入、提示和答案映射,从而以近乎无需训练的方式,有效引导预训练视觉语言模型适应罕见的第一人称场景。从客观角度看,这凸显了在数据稀缺和模型固定的约束下,优化推理流程(即如何“问”模型)与提升模型本身同等重要,为轻量级适应提供了新思路。
Abstract: EgoCross evaluates multimodal large language models on egocentric video question answering under substantial domain shift, where test videos come from surgery, industrial assembly, extreme sports, and animal-mounted cameras rather than ordinary daily-life scenes. In the source-limited track, the base model is fixed to Qwen3-VL-4B, while the official task-specific support set contains only 20 training samples. This setting makes the challenge less about model scaling and more about exposing the right visual, temporal, and answer-selection cues to a constrained model. Our key observation is that the frozen baseline model is not simply incapable of these rare scenarios; rather, it often fails to transfer its existing visual-language knowledge to the new task format without an appropriate interface. We therefore use a domain-wise inference strategy that treats the four target domains separately and designs different input, prompting, and answer-mapping procedures according to each domain’s task characteristics. These strategies make the rare egocentric scenes more interpretable to the VLM by emphasizing the cues that matter for each domain. The resulting system is nearly training-free: surgery, and animal questions are answered with the base Qwen3-VL-4B model, while XSports and industry use only the official SFT checkpoint trained for two epochs on the provided 20 training samples. On the final evaluation, this simple strategy reaches 66.98% overall accuracy, suggesting that careful domain-aware inference can compensate for limited base-model strength and recover much of the ability already present in the baseline model.
[145] Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated cs.CV | cs.AIPDF
Rashid Mushkani
TL;DR: 本文主张城市感知任务中的视觉语言模型(VLM)基准测试应关注可靠性和协商性,强调需将标注者间的分歧和弃权视为测量结果,并在评估中报告标注者间信度与模型对齐度,同时将标签空间和评分策略视为可协商的。研究基于蒙特利尔100个街景的30个维度的12名参与者标注数据,对7个VLM进行了零样本评估。
Details
Motivation: 当前VLM在城市感知任务(如街景审计、公共咨询)中生成结构化描述时,常涉及可观测属性与评价类别的结合,而人类标注目标常存在分歧和明确弃权,因此需要建立更可靠的基准测试框架来反映这些不确定性。
Result: 在评估的7个VLM中,模型与人类共识的一致性随维度级人类可靠性共变;在评价维度’整体印象’上,模型与标注者存在分布不匹配,包括不同的’不适用’率,表明当前基准测试存在局限性。
Insight: 创新点在于提出城市感知VLM基准测试应明确处理不确定性和标注分歧,并将标签空间和评分策略视为可协商的;客观分析认为,该研究强调了评估中考虑人类标注可靠性和分布匹配的重要性,为基准创建者和模型开发者提供了具体行动建议。
Abstract: Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes with appraisal categories, and the human targets are often distributions of judgments with disagreement and explicit non-response. This paper argues that benchmarking VLMs for urban perception should treat disagreement and abstention as measurement outcomes, report inter-annotator reliability alongside model alignment, and treat the label space and scoring policy as negotiable artifacts when outputs are intended to inform urban governance. We ground the argument in a benchmark of 100 Montreal street scenes annotated along 30 dimensions by 12 participants from seven community organizations, and in a deterministic zero-shot evaluation of seven VLMs. Across dimensions, model agreement with human consensus co-varies with dimension-level human reliability, and for the appraisal dimension Overall Impression models and annotators exhibit distributional mismatch including different rates of Not applicable. We close with actions for benchmark creators, model developers, and institutions to make uncertainty and benchmark assumptions visible in evaluation reports.
[146] GABI: Geometry-Aware Boundary Integration for Spacecraft Segmentation cs.CV | cs.ROPDF
Iason Georgios Velentzas, Dhruv Ahuja, Panagiotis Tsiotras
TL;DR: 本文提出了一种名为GABI的轻量级边界感知多任务分割架构,用于航天器分割。该架构通过增强卷积主干网络并添加辅助距离场预测头,利用距离场在物体边界周围提供密集的几何监督,从而学习航天器结构的空间一致性表示。
Details
Motivation: 解决在空间恶劣光照条件下,航天器图像外观变化大,导致分割方法在不同航天器和环境间泛化能力差的问题。
Result: 在SPARK基准测试中,距离场监督将基线模型的平均精度提升了高达5%,性能与基于Transformer的模型相当。在泛化实验中,GABI的平均精度比基线提高了50%以上。在跨域评估中,轻量级GABI变体的IoU和F1分数与更重的Transformer模型相差在5%以内,但模型大小约为其十分之一;而更重的GABI变体性能超越了Transformer架构,同时模型轻了近三倍。
Insight: 创新点在于引入距离场作为辅助任务进行几何监督,以增强模型对边界的感知和空间一致性学习,从而在保持轻量化的同时提升分割精度和泛化能力。这是一种有效的多任务学习策略,可借鉴于其他需要精确边界分割的场景。
Abstract: Accurate segmentation is crucial for autonomous spacecraft, as it directly affects downstream tasks related to 3D situational awareness. The harsh illumination conditions of space, however, produce images with high variability in appearance, hindering the generalization of segmentation approaches across different spacecraft and environments. In this work, we propose GABI, a lightweight boundary-aware multi-task segmentation architecture that augments a convolutional backbone with an auxiliary distance-field prediction head. The distance field provides dense geometric supervision around object boundaries, encouraging the network to learn spatially consistent representations of spacecraft structures while maintaining low model complexity suitable for onboard perception systems. We evaluated GABI against both an established convolutional baseline and a heavier transformer-based architecture. On the SPARK benchmark, distance-field supervision improves the baseline by up to $5%$ in Average Precision while achieving performance comparable to the transformer models. In generalization experiments, GABI improves Average Precision by more than $50%$ over the baseline. In cross-domain evaluation, the lightweight GABI variant performs within $5%$ in IoU and F1-score of the heavier transformer model while being approximately ten times smaller. At the same time, the heavier GABI variant surpasses the transformer architectures while remaining nearly three times lighter.
[147] MMDG-Bench: A Benchmark for Multimodal Domain Generalization cs.CVPDF
Qianshan Zhan, Qian Wang, Da Li, Xiao-Jun Zeng, Xiatian Zhu
TL;DR: 本文提出了MMDG-Bench,一个用于多模态域泛化的综合基准测试。它引入了两种基础框架(D2M和M2D),并在视频-音频-光流动作识别和RGB-深度-红外活体检测等任务上建立了统一的评估协议。通过实例化十个基线方法,证明了结构化组合通常能超越现有SOTA方法,并揭示了关于模态稳定性与框架选择关系的关键见解。
Details
Motivation: 当前多模态域泛化研究主要局限于动作识别,缺乏标准化的评估协议,其与多模态学习及域泛化两个领域的系统性整合尚未得到充分探索。
Result: 在MMDG-Bench上,通过将统一的多模态学习配置与五种域泛化技术按D2M和M2D顺序配对,实例化的十个基线方法经常优于现有的最先进方法。
Insight: 主要创新点在于提出了一个原则性的基准测试和两种结构化框架(D2M/M2D),为多模态鲁棒性研究提供了基础。关键见解包括:域泛化技术能带来一致的泛化增益;最优框架选择取决于跨域模态关系的稳定性;更强的骨干网络在结构化框架中能获得更大的性能提升。
Abstract: Multi-modal Domain Generalization (MMDG) seeks to leverage complementary modalities to enhance model robustness on unseen domains. Despite extensive progress in Multi-modal Learning (MML) and Domain Generalization (DG) as individual fields, their systematic integration remains under-explored. Current MMDG research is largely confined to action recognition and lacks standardized evaluation protocols. To address this, we introduce MMDG-Bench, a comprehensive benchmark featuring two foundational frameworks: DG then MML (D2M) and MML then DG (M2D). We provide unified experimental protocols across diverse tasks, including video-audio-flow action recognition and RGB-Depth-IR face anti-spoofing. By instantiating ten MMDG baselines through pairing a unified MML configuration with five DG techniques under both D2M and M2D orderings, we demonstrate that these structured combinations frequently outperform existing state-of-the-art methods, underscoring the necessity of a unified benchmarking effort. Our analysis yields three key insights: (1) Integrating DG techniques provides consistent generalization gains across various backbones, whereas non-DG methods are highly sensitive to backbone shifts; (2) The optimal framework choice depends on inter-modal stability: D2M excels when modal relations are stable across domains, while M2D is more robust to cross-domain relational variance; (3) Stronger backbones yield amplified performance dividends when integrated into our structured frameworks. MMDG-Bench provides a principled foundation and actionable design guidelines for future research in multi-modal robustness. Code is released at https://github.com/qszhan/MMDG-Bench.
[148] Cohort-Scale Neural Atlases of Ultrasound Video cs.CVPDF
Zhuorui Zhang, Roger Pallarès-López, Xuan Wu, Praneeth Namburi, Brian W. Anthony
TL;DR: 本文提出了一种针对超声视频的群体规模神经图谱方法,通过将数千帧图像注册到一个共享的规范坐标系中,以降低专家标注成本。该方法在DINOv3特征空间中联合训练,利用每个视频的生成潜在优化嵌入,能够学习连贯的规范模板并实现准确的图谱空间标注迁移。
Details
Motivation: 超声是临床实践中最广泛使用的实时成像方式,但逐帧视频标注成本高昂且专家标签稀缺,图像外观受斑点、阴影、衰减和操作者依赖的探头姿态影响而变化。临床上相关信息(如心脏运动、肌肉骨骼运动学)通常是动态的,现有神经图谱方法主要针对单个视频、小型测试集或以物体为中心的图像集合,无法满足大规模超声视频分析需求。
Result: 在五个包含点标志和分割掩码的心脏与肌肉骨骼数据集上,该方法学习了连贯的规范模板并实现了准确的图谱空间标注迁移。在EchoNet-Dynamic和MSK-Bone数据集上,支持单样本和少样本迁移,其准确性与强大的密集对应基线方法相当,且仅需在单个消费级GPU上训练数分钟。
Insight: 创新点在于首次构建了超声视频的群体规模神经图谱,通过联合优化嵌入实现了大规模视频帧的规范对齐。该方法具有可解释性:线性投影揭示了群体变异的结构,图像解码器插值能生成解剖学上合理的中间帧,测试时潜在反演可通过图谱重建未见帧,为降低超声视频分析中的专家标注负担提供了实用且可解释的表示方法。
Abstract: Ultrasound is the most widely used real-time imaging modality in clinical practice, yet per-frame video annotation remains a major bottleneck: expert labels are scarce and costly, and image appearance varies with speckle, shadowing, attenuation, and operator-dependent probe pose. This is especially limiting because clinically relevant information is often dynamic, from left-ventricular motion in echocardiography to muscle and bone kinematics in musculoskeletal imaging. Population atlases can amortize annotation cost by registering observations to a shared canonical coordinate system, but existing neural atlas methods mainly target single videos, small test-time image sets, or object-centric image collections. We introduce a cohort-scale neural atlas for ultrasound video: a single canonical chart with per-video Generative Latent Optimization embeddings, trained jointly over thousands of frames in DINOv3 feature space. Across five cardiac and musculoskeletal datasets with point landmarks and segmentation masks, our method learns coherent canonical templates and enables accurate atlas-space annotation transfer. On EchoNet-Dynamic and MSK-Bone, it supports single- and few-shot transfer with accuracy competitive with strong dense-correspondence baselines, while training in minutes on a single consumer GPU. The learned embeddings are interpretable: linear projections reveal structured cohort variation, image-decoder interpolation produces anatomically plausible intermediate frames, and test-time latent inversion reconstructs held-out frames through the atlas. These results suggest that cohort-scale neural atlases offer a practical, interpretable representation for reducing expert annotation burden in ultrasound video analysis.
[149] Bridging Topology and Deep Representation Learning: A TDA-ViT Fusion Model for Four-Class Brain Tumor Classification cs.CVPDF
Faisal Ahmed
TL;DR: 该论文提出了一种融合拓扑数据分析(TDA)与预训练视觉变换器(ViT)的模型(TDA-ViT),用于四分类脑肿瘤MRI图像分类。该方法通过TDA提取肿瘤区域的拓扑结构特征(如几何、连通性、形状),并与ViT提取的语义特征进行融合,形成更具判别性的统一表示。
Details
Motivation: 现有ViT模型在医学图像分析中擅长学习全局上下文,但难以捕捉肿瘤区域固有的结构和拓扑模式。为了解决这一局限性,需要引入能表征几何与拓扑信息的特征来补充ViT的表示能力。
Result: 在BRISC2025数据集(包含胶质瘤、脑膜瘤、垂体瘤和非肿瘤四类)上的实验表明,该融合模型准确率达99.10%,精确率99.27%,召回率99.15%,F1分数99.21%,AUC达99.98%。性能显著优于单独使用TDA或ViT,并超越了ResNet50、ResNet101、EfficientNetB2及独立ViT等SOTA模型。
Insight: 核心创新点在于将拓扑数据分析(TDA)的几何拓扑特征与ViT的深度语义表示进行融合,证明了拓扑特征能为深度学习提供有价值的互补信息。这为医学图像分析提供了一种结合数据拓扑本质与深度表示学习的新范式。
Abstract: Accurate brain tumor classification from magnetic resonance imaging (MRI) is a key requirement for early diagnosis and clinical decision-making. Vision Transformers (ViTs) have shown strong performance in medical image analysis by learning global contextual representations, but they often fail to capture intrinsic structural and topological patterns present in tumor regions. To address this limitation, we propose a fusion framework that combines Topological Data Analysis (TDA) features with pretrained Vision Transformer representations for four-class brain tumor classification. In the proposed method, TDA is used to extract complementary topological descriptors that capture geometric structure, connectivity, and shape information from MRI images. In parallel, a pretrained ViT model learns high-level semantic representations from the same images. These two feature spaces are then fused to form a unified and more discriminative representation for classification. The model is evaluated on the BRISC2025 dataset, which contains four brain tumor classes: glioma, meningioma, pituitary tumor, and non-tumor cases. Experimental results show that combining topological and transformer-based features significantly improves performance compared to using either approach alone. The proposed TDA-ViT fusion model achieves an accuracy of 99.10%, precision of 99.27%, recall of 99.15%, F1-score of 99.21%, and an AUC of 99.98%. It also outperforms several state-of-the-art models, including ResNet50, ResNet101, EfficientNetB2, and standalone Vision Transformers. These results demonstrate that topological features provide valuable complementary information that enhances deep representation learning, leading to a robust and highly accurate framework for automated brain tumor classification.
[150] hZACH-ViT: Curved Latent Geometry for Compact Vision Transformers in Low-Data Medical Imaging cs.CVPDF
Athanasios Angelakis
TL;DR: 本文提出了hZACH-ViT,一种基于弯曲几何的紧凑视觉Transformer变体,用于低数据医学成像。该方法在保留已验证的ZACH-ViT骨干网络基础上,通过修改最终表示空间和基于原型的分类器头,系统比较了欧几里得、双曲和球面三种潜在几何结构在七个MedMNIST数据集上的性能。
Details
Motivation: 现有紧凑视觉Transformer大多假设欧几里得潜在几何足以组织图像表示,但该研究旨在探索弯曲几何(如双曲和球面几何)在低数据医学成像任务中能否提供更好的表示能力。
Result: 在七个MedMNIST数据集上使用每类50样本的少样本协议进行评估,最佳非欧几里得hZACH-ViT配置平均在数据集特定主要指标上提升0.021,其中OCTMNIST数据集提升最大(MacroF1 +0.055)。低曲率配置(c=0.1或0.2)在多数数据集上保持正向增益。
Insight: 创新点在于将几何结构和曲率作为数据集依赖的模型选择变量,而非寻找通用最优流形;通过固定低曲率分析证实增益具有普适性,无需针对每个数据集进行详尽调参。
Abstract: Compact Vision Transformers are attractive for medical imaging in low-data and resource-constrained settings, but most existing variants assume that Euclidean latent geometry is sufficient for organizing image representations. We introduce hZACH-ViT, a family of curved-geometry extensions of ZACH-ViT, a compact zero-token Vision Transformer that removes positional embeddings and the class token and relies on global average pooling over patch representations. To isolate the role of geometry, we preserve the verified ZACH-ViT backbone and modify only the final representation space and prototype-based classifier head, enabling a controlled comparison between Euclidean, hyperbolic, and spherical latent geometries. We evaluate Poincaré, Klein, and spherical hZACH-ViT heads on seven MedMNIST datasets under an identical few-shot protocol with 50 samples per class and five random seeds. The completed benchmark contains 770 training runs spanning seven datasets, three non-Euclidean geometries, seven curvature magnitudes, and a Euclidean baseline. Across all seven datasets, the best non-Euclidean hZACH-ViT configuration improves over Euclidean ZACH-ViT, with an average gain of +0.021 in the dataset-specific primary metric and the largest improvement on OCTMNIST (+0.055 MacroF1). Fixed low-curvature configurations retain positive gains on the majority of datasets, and low curvature values (c = 0.1 or 0.2) account for six of the seven dataset-level winners. Rather than identifying a universally optimal manifold, our results establish geometry and curvature as dataset-dependent model-selection variables, with fixed low-curvature analyses confirming that gains persist beyond exhaustive per-dataset tuning.
[151] Single-Channel Tissue Segmentation via Cross-Modal Distillation from Foundation Models cs.CV | cs.LGPDF
Sakib Mohammad, Jarin Ritu, Md Sakhawat Hossain
TL;DR: 本文提出了一种跨模态知识蒸馏框架,用于单通道组织分割。该方法通过将处理多通道输入的冻结基础模型(如SAM ViT-H)的知识,蒸馏到仅使用细胞核通道(DAPI)的轻量级学生网络中,从而在缺少部分成像通道的情况下实现高性能分割。
Details
Motivation: 解决多通道荧光显微镜模型在推理时需要所有通道的限制,使其能够在仅有单通道(如细胞核通道)可用的情况下部署,从而扩大应用范围。
Result: 在TissueNet数据集上,经SAM蒸馏的Swin-Tiny学生模型达到Dice系数78.36(±1.44),比无知识蒸馏基线提升13.05点,并以23倍的参数量缩减恢复了教师模型87.9%的性能。知识蒸馏使四种不同架构的学生模型均提升约12个Dice点,且在BBBC038数据集上的跨数据集评估也显示出一致的性能提升。
Insight: 创新点在于提出了一个结合MSE概率匹配、边界感知监督和可学习不确定性加权的跨模态蒸馏目标,实现了架构无关的知识迁移,使得轻量级单通道模型能够有效利用多通道基础模型的丰富语义信息。
Abstract: Multiplexed fluorescence microscopy improves tissue segmentation by providing complementary channels including nuclear (DAPI) and membrane (E-cadherin), that together encode richer spatial context than single-channel imaging alone. However, multiplexed models require all channels at inference, limiting deployment where only a subset is available. This work proposes a cross-modal knowledge distillation framework that transfers semantic information from a frozen foundation model teacher processing multiplexed input to a lightweight student operating on the nuclear channel only. The distillation objective combines MSE-based probability matching, boundary-aware supervision, and learnable uncertainty weighting. SAM ViT-H and CellSAM are evaluated as teachers across four U-Net students: Swin-Tiny (27M), ResNet18 (11M), EfficientNet-B0 (5.3M), and MobileNetV3 (1.5M), on TissueNet and BBBC038. On TissueNet, the SAM-distilled Swin-Tiny student achieves Dice 78.36 (plus or minus 1.44), a 13.05-point improvement over the no-KD baseline (65.31 plus or minus 1.35) and 87.9% recovery of teacher oracle performance (89.12 plus or minus 1.21) at a 23x parameter reduction. KD consistently improves all four students by approximately 12 Dice points, confirming architecture-agnostic distillation. SAM ViT-H outperforms CellSAM as teacher across all settings. Cross-dataset evaluation on BBBC038 shows consistent gains without teacher retraining.
[152] Reason, Retrieve, Re-rank: A Zero-Shot Reasoning-Aware Framework for Composed Video Retrieval cs.CV | cs.LGPDF
Ali Alavi
TL;DR: 本文提出了一个名为R3-CoVR(Reason, Retrieve, Re-rank)的零样本推理感知框架,用于解决组合视频检索(CoVR)任务。该框架无需训练,完全基于冻结的基础模型构建:首先使用多模态大语言模型(Qwen3-VL-8B)推理编辑指令所隐含的‘后效’并生成简洁的编辑后描述;然后利用对比视频-文本编码器(SigLIP-2)进行第一阶段的检索;最后通过一个约束感知的重排序阶段,使用同一多模态模型作为评判员对候选视频进行评分。
Details
Motivation: 解决CVPR 2026 VidLLMs研讨会中提出的‘推理感知’组合视频检索(CoVR-R)挑战,该任务要求严格零样本检索,即根据参考视频和自由形式的文本修改指令,检索出目标视频。
Result: 在挑战赛测试集上,R3-CoVR取得了91.9%的R@1和98.2%的R@10的优异结果。其中,将描述长度与对比编码器的文本窗口匹配,将R@1从67.5%提升至72.7%;而约束感知的重排序器(仅对候选列表重排)将R@1从72.7%大幅提升至91.9%,是最大的性能增益来源。
Insight: 论文的创新点在于提出了一个完全无需训练、模块化的零样本推理感知检索管道。其核心洞察包括:1)利用多模态大语言模型进行显式的‘后效’推理和描述生成,将复杂的组合检索任务转化为标准检索问题;2)通过一个两阶段(检索+重排序)策略,并引入约束感知的重排序器,显著提升了最终精度;3)发现描述长度与编码器窗口的匹配对性能有重要影响。从客观角度看,这种将推理、检索、重排序解耦并利用冻结基础模型组合的思路,为构建高效、可解释的零样本多模态检索系统提供了借鉴。
Abstract: Composed Video Retrieval (CoVR) seeks the target video that results from applying a free-form textual modification to a reference video. We address the \emph{Reason-Aware} CoVR (CoVR-R) challenge at the CVPR2026 VidLLMs workshop, where retrieval is strictly zero-shot. We present \textbf{R3-CoVR} (\emph{Reason, Retrieve, Re-rank}), a training-free pipeline built entirely from frozen foundation models. A multimodal large language model (Qwen3-VL-8B) reasons about the \emph{after-effects} an edit implies – state transitions, action phases, scene, camera and tempo – and verbalises a concise post-edit description; a contrastive video–text encoder (SigLIP-2) embeds this description and the gallery for first-stage retrieval; finally a constraint-aware re-ranking stage uses the same multimodal model as a judge that scores each shortlisted candidate against the intended edited result. On the challenge test set, R3-CoVR attains \textbf{91.9% R@1} and \textbf{98.2% R@10}. Two findings drive these results: (i)matching the description length to the contrastive encoder’s text window lifts \Rk{1} from $67.5$ to $72.7$; and (ii)~the constraint-aware re-ranker, which reorders only the shortlist, lifts \Rk{1} from $72.7$ to $91.9$ – the single largest gain. We analyse the re-ranker’s behaviour, the retrieve/re-rank blend, and the shortlist depth, and we release a clean three-layer implementation.
[153] CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences cs.CV | cs.AIPDF
Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen, Qianwen Ge
TL;DR: 本文提出了CV-Arena,一个用于评估指令引导计算机视觉问题解决能力的开放基准。该基准包含12K个高分辨率真实图像-指令对,涵盖16种视觉任务类型,并引入了Active Elo这一人机协作偏好协议进行大规模评估。通过对21个系统的评估,揭示了现有模型在指令遵循、物理推理等方面的不足,并展示了闭环推理代理模型CV-Agent的潜力。
Details
Motivation: 现有指令引导图像编辑基准主要关注狭窄的外观编辑,未能充分捕捉专业工作流程中真实图像任务的多样性。本文旨在定义一个更广泛的‘指令式计算机视觉问题解决’任务,并建立一个专业规模的开放基准来评估该能力。
Result: 在CV-Arena基准上对21个系统(包括专有、开源和代理模型)的全面评估显示,所有模型在指令遵循、物理推理、结构控制和细粒度细节保留方面均存在显著差距。
Insight: 创新点在于提出了一个更全面的‘指令式计算机视觉问题解决’任务定义,并构建了大规模、高质量的CV-Arena基准。同时,提出的Active Elo人机协作评估协议和CV-Agent闭环推理代理模型,为专业级视觉编辑提供了新的评估框架和技术方向。
Abstract: Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints. We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability. To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge, a logic-gated, multi-dimensional VLM evaluator, to reject clear failures and resolve high-confidence comparisons; and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates. Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.
[154] Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning cs.CV | cs.CLPDF
Jixuan He, Xueting Li, Chieh Hubert Lin, Ming-Hsuan Yang
TL;DR: 本文提出Reasmory框架,将空间推理任务转化为在重建的显式3D记忆(如点云)上执行结构化程序的过程。该框架通过构建3D记忆、增强语义对象实例,并引入领域特定语言(DSL)来约束视觉语言模型(VLM)的查询和操作,从而提高多视角图像和视频中空间推理任务的可靠性。
Details
Motivation: 现有视觉语言模型(VLMs)在需要精确空间理解的任务(如视点推理、方向比较和距离估计)上表现不可靠,因为多视角图像或单目视频中的空间线索稀疏且分散,难以有效组织利用。
Result: 在多视角图像和视频空间推理基准测试中,Reasmory相比GPT-5-mini和Gemini-3-flash等强基线模型取得了6%至18%的稳定性能提升,表明通过受限、验证的操作访问显式3D记忆比自由形式的工具调用更有效。
Insight: 创新点在于将空间推理形式化为结构化程序执行,通过构建显式3D记忆和引入DSL来约束VLM的操作流程,确保程序在解析验证后执行,从而提升与空间记忆交互的可靠性,而非依赖脆弱的自由工具调用。
Abstract: Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimation. In multi-view images and monocular videos, relevant spatial cues are often sparse and distributed across redundant observations, making them difficult to organize and exploit. Reconstruction-based Vision Foundation Models (VFMs) offer a natural way to aggregate such observations into explicit spatial memory, such as point clouds. However, simply exposing reconstruction models as free-form tools is brittle, VLMs may invoke tools incorrectly, skip required spatial transformations, or misuse intermediate results. We propose \textbf{Reasmory}, a framework that formulates spatial reasoning as structured program execution over reconstructed spatial memory. Reasmory constructs explicit 3D memory, augments it with semantically grounded 3D object instances, and introduces a lightweight Domain-Specific Language (DSL) that constrains how VLMs query objects and cameras, transform viewpoints, and render observations during reasoning. Generated programs are parsed and validated before execution, enabling more reliable interaction with spatial memory than unconstrained tool use. Experiments on multi-view image and video spatial reasoning benchmarks show consistent gains of 6–18% over strong baselines, including GPT-5-mini and Gemini-3-flash, indicating that explicit 3D memory is most useful when accessed through constrained, validated operations rather than free-form tool calls.
[155] Flexible Control of 3D CT Generation via Text and Semantically-Defined Segmentation Prompts cs.CVPDF
Weicheng Dai, Chenyu Wang, Andy Li, Shantanu Ghosh, Kayhan Batmanghelich
TL;DR: 本文提出了一种灵活的多模态框架,用于可控的3D医学图像生成,支持放射学报告文本和语义定义的分割提示作为输入。该方法基于改进的扩散变换器构建了内存高效的架构,通过门控注意力处理长文本报告,并允许用户提供特定解剖结构或异常的分割掩码,无需完整器官标注。
Details
Motivation: 现有方法在控制3D医学图像生成时存在局限性:基于文本提示的方法灵活性高但空间控制能力弱,而基于分割的方法空间指导精确却需要完整的器官标注,限制了实用性。本文旨在开发一种既能提供灵活文本控制又能接受精确但非完整分割输入的生成框架。
Result: 实验表明,该方法在感知和语义评分上达到了最先进水平(例如,平均FID相对提升24%),能生成高分辨率且解剖结构一致的CT体积图像,用于数据增强时提高了数据效率。放射科医生的评估进一步证实了生成图像与真实医学图像的高度一致性。
Insight: 创新点在于提出了一种将文本描述与语义定义的分割掩码相结合的灵活条件机制,允许对特定解剖或异常进行局部空间控制而无需全器官标注。从架构上,采用改进的扩散变换器联合处理图像和分割标记,并引入门控注意力有效处理长放射学报告,实现了高效的多模态条件生成。
Abstract: Generative models for volumetric medical images have found many applications in medical imaging, ranging from data augmentation to serving as priors for inverse problems. For these applications, generating high-resolution 3D images with strong controllability is essential but remains highly challenging. Existing approaches typically control generation either through radiology reports used as text prompts or through full image segmentation. While text-based prompting is flexible, it provides limited spatial control over the location, shape, and boundary of abnormalities. In contrast, segmentation-based methods receive precise spatial guidance but are restrictive in requiring full-organ annotations. In this work, we propose a flexible multimodal framework for controllable volumetric image generation that supports input from radiology reports and segmentation prompts (both optional). Our approach allows users to provide segmentation of a specific anatomy or abnormality without requiring full-organ annotations. The semantic meaning of the segmentation mask is specified through an accompanying text description, resulting in a highly flexible and scalable conditioning mechanism. We develop a memory-efficient architecture based on a modified diffusion transformer that jointly processes image and segmentation tokens. The model further incorporates gated attention to effectively attend to long radiology reports. Experiments demonstrate that our method achieves state-of-the-art perceptual and semantic scores (e.g., 24% relative improvement in mean FID), generates high-resolution anatomically consistent CT volumes, and improves data efficiency when used for data augmentation. Radiologists’ evaluation further confirms strong alignment between generated and real medical images.
[156] Boundary-Protection W8A8 HiFloat8 Quantization for Large-Scale Text-to-Video Diffusion Transformers cs.CVPDF
Yiming Zhao
TL;DR: 本文提出了一种针对14B参数文本到视频扩散变换器Wan2.1-T2V-14B的后训练量化方法,旨在在昇腾910B NPU上实现W8A8 HiFloat8格式量化。核心挑战在于视频DiT模型中不同Transformer块的激活分布存在异质性,特别是边界块与中间块的统计特性差异显著。为此,作者通过系统性的逐块激活分析,设计了一种边界保护策略,将前两个和后三个块保留为BF16精度,而对其余35个块进行W8A8 HiF8量化。该方法在VBench的五个维度评估中均达到或略微超过BF16基线,表明在5个提示词的评估集上没有可测量的精度损失。此外,论文还探讨了量化感知训练作为补充微调阶段的效果,并分析了其在单卡硬件上未能超越普通PTQ的条件。
Details
Motivation: 量化大规模文本到视频扩散变换器(如14B参数的Wan2.1-T2V-14B)时,面临的主要挑战是Transformer块间激活分布的异质性,特别是边界块(起始和结束的几个块)与中间块具有根本不同的统计特性,导致均匀量化效果不佳。
Result: 在VBench的五个维度评估中,提出的PTQ方法匹配或略微超过了BF16基线,表明在5个提示词的评估集上没有可测量的精度损失。消融研究证实,完整的边界保护配置(保护前两个和后三个块)获得了最高的平均VBench分数,验证了数据驱动的块选择策略。
Insight: 论文的创新点在于通过系统性的逐块激活分析,揭示了视频DiT模型中边界块与中间块的统计差异,并据此提出了一种边界保护量化策略,有效解决了异质激活分布导致的量化难题。从客观角度看,这种基于数据驱动的块级精度分配方法为大规模Transformer模型的量化提供了可借鉴的思路,即针对模型内部的结构性差异进行非均匀的精度配置,而非采用一刀切的量化方案。
Abstract: We present a post-training quantization (PTQ) approach for Wan2.1-T2V-14B, a 14-billion-parameter text-to-video diffusion transformer, targeting the W8A8 HiFloat8 (HiF8) format on Ascend 910B NPUs. A central challenge in quantizing video DiT models is the heterogeneous activation distribution across transformer blocks: boundary blocks (the first and last few blocks) exhibit fundamentally different statistical properties from middle blocks, making uniform quantization ineffective. We conduct a systematic per-block activation analysis across all 40 WanAttentionBlocks and use the findings to motivate a boundary-protection strategy that retains the first two and last three blocks in BF16 while quantizing the remaining 35 blocks with W8A8 HiF8. The proposed PTQ method matches or marginally exceeds the BF16 baseline on all five VBench dimensions evaluated, indicating no measurable accuracy loss within the 5-prompt evaluation set. An ablation study over four protection configurations confirms that full boundary protection yields the highest average VBench score, validating the data-driven block selection. We additionally investigate quantization-aware training (QAT) as a complementary fine-tuning stage and analyze the conditions under which it fails to outperform plain PTQ on single-card hardware.
[157] An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation cs.CV | cs.AIPDF
Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao, Junyu Gao
TL;DR: 本文提出了多时序指代分割(MTRS)新任务,旨在从多时序图像中分割出语言描述的变化区域。作者构建了首个MTRS基准数据集MTRefSeg-21K,并提出了一种名为MTRefSeg-R1的两阶段训练框架,该框架通过显式建模时序视觉差异和对齐语言指令来实现性能提升。
Details
Motivation: 尽管大型视觉语言模型(LVLM)展现出强大的视觉理解和语言引导能力,但其在多时序视觉推理方面的能力尚未得到充分探索。为了弥补这一空白,本文旨在通过MTRS任务联合要求时序对应推理、语言接地和像素级掩码预测,从而扩展传统的指代分割和变化检测任务。
Result: 在构建的MTRefSeg-21K基准上进行的广泛实验表明,MTRefSeg-R1框架相比现有的LVLM基线模型实现了强大且通常更优的性能,证明了MTRS任务的挑战性和潜力。
Insight: 论文的创新点在于提出了MTRS新任务及其首个大规模基准数据集,并设计了一个两阶段训练框架,该框架先学习通用的时序变化感知,再进行细粒度的语言引导时序定位微调,显式地建模跨时序视觉差异和对齐语言与变化。从客观角度看,将时序变化检测与语言指代分割相结合是一个有前景的研究方向,其数据构建流程和模型框架设计具有借鉴意义。
Abstract: Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \textbf{Multi-temporal Referring Segmentation (MTRS)}, a new task that aims to segment language-described temporal changes from multi-temporal images. MTRS extends conventional referring segmentation and change detection by jointly requiring temporal correspondence reasoning, language grounding, and pixel-level mask prediction. We propose \textbf{CRAFT-Agent}, an automated data construction pipeline with human auditing, and build \textbf{MTRefSeg-21K}, the first MTRS benchmark, containing 21K high-quality multi-temporal image-text-mask triplets across diverse scenes, viewpoints, and domains. Benchmarking a broad set of VLM- and LVLM-based models reveals that direct inference performs poorly, while task-specific fine-tuning remains limited. To address this, we propose \textbf{MTRefSeg-R1}, a change-aware LVLM framework trained with a two-stage strategy. It first learns general temporal-change perception from 20K vision-only bi-temporal samples, and is then fine-tuned on MTRefSeg-21K for fine-grained language-guided temporal localization. MTRefSeg-R1 explicitly models cross-temporal visual differences, aligns language instructions with temporal variations, and predicts referred change masks. Extensive experiments show that MTRefSeg-R1 achieves strong and often superior performance compared with existing LVLM baselines, demonstrating the challenge and potential of MTRS.
[158] SWARD: Stochastic Window-Attention-Based Relational Distillation for Cross-Architectural Semantic Segmentation cs.CVPDF
Aditya Makineni, Qing Tian
TL;DR: 本文提出了一种名为SWARD的知识蒸馏框架,旨在解决基于Transformer的大规模视觉基础模型(教师)与轻量级卷积网络(学生)在语义分割任务中的架构差异问题。该框架通过多尺度窗口注意力蒸馏模块和原型判别正则化损失,有效传递教师的全局上下文建模能力,并增强学生特征的判别性结构。
Details
Motivation: 大规模视觉基础模型在密集预测任务上表现出色,但其庞大的参数量在资源受限环境中难以部署,因此需要通过知识蒸馏将能力迁移到轻量级学生网络。然而,现有蒸馏方法通常假设师生架构同质,直接特征模仿无法弥补Transformer的全局建模与CNN的局部偏置之间的表示差距,且忽略了语义分割所需的结构化空间依赖和判别性组织。
Result: 在城市场景解析和医学图像分割等不同视觉应用上的实验表明,SWARD实现了最先进的性能,超越了现有蒸馏方法。
Insight: 创新点包括:1)多尺度窗口注意力蒸馏模块,通过随机偏移的窗口划分对齐师生基于注意力的关系,消除窗口边界偏差并捕获多尺度空间依赖;2)原型判别正则化损失,通过强制类间分离和类内紧凑性来塑造学生特征分布,增强判别性结构。这些机制共同解决了跨架构蒸馏中的表示差距问题。
Abstract: Large-scale vision foundation models have driven substantial gains on dense prediction tasks such as semantic segmentation, but their size makes deployment impractical in resource-constrained settings, motivating knowledge distillation as a means of transferring their capabilities to lightweight student networks. However, modern foundation teachers are predominantly transformer-based that encode global context, whereas efficient students are typically convolutional networks with locally biased receptive fields. Existing distillation methods largely assume architectural homogeneity and rely on direct feature mimicry, which fails to bridge this representational gap and neglects the structured spatial dependencies and discriminative organization required for accurate semantic segmentation. In this paper, we propose SWARD, a knowledge distillation framework that addresses this gap through two complementary mechanisms. First, we introduce a Multi-Scale Windowed Attention Distillation (MWAD) module that aligns teacher-student attention-based relations within stochastically shifted window partitions whose offsets are randomly resampled at every training iteration. This removes window boundary bias, and, combined with the multi-scale design, captures both short- and long-range spatial dependencies. Second, we introduce Prototype Discriminative Regularization (PDR), a loss that helps shape the student’s feature distribution by enforcing inter-class separation and intra-class compactness, further sharpening the discriminative structure beyond what feature mimicry alone can produce under the student’s reduced capacity. Experiments across different vision applications (i.e., urban scene parsing and medical image segmentation) show that SWARD achieves state-of-the-art performance.
[159] Data Collection for Training Quality-Control AI in Carpet Manufacturing cs.CV | cs.AIPDF
Akbar Erkinov
TL;DR: 本文提出了一种用于地毯制造质量控制的在线机器视觉系统设计方案,旨在实时检测地毯缺陷并系统性地收集和标注缺陷图像,以持续训练更强大的质量控制模型。该方案基于一个具体的工业环境(六西格玛DMAIC项目),详细描述了成像子系统、缺陷分类法以及从无监督异常检测到有监督检测的分阶段建模策略。
Details
Motivation: 解决传统人工视觉检测在地毯生产中速度慢、主观性强且不一致的问题,同时应对因新增织机可能带来的生产瓶颈和高缺陷率带来的财务风险。
Result: 方案将检测性能与DMAIC目标关联,展示了减少漏检缺陷如何提升过程质量和西格玛水平;其无监督检测方法遵循MVTec异常检测基准的地毯类别范式。
Insight: 创新点在于将数据收集作为首要工程目标而非事后考虑,提供了一个端到端、可部署的蓝图,并通过人机协同标注飞轮实现模型从无监督到有监督的渐进成熟。
Abstract: Visual inspection remains the dominant quality-control practice in woven and tufted carpet production, yet it is slow, subjective, and inconsistent at the line speeds and widths of modern looms. We present a design proposal for an in-line machine-vision system whose primary purpose is twofold: to inspect the carpet web in real time and, equally importantly, to systematically collect and label images of defect patterns so that increasingly capable quality-control models can be trained over the life of the installation.The proposal is grounded in a concrete industrial setting: a Six Sigma (DMAIC) project at a woven-carpet production facility that anticipated a production bottleneck following the installation of additional weaving machines, with a substantial baseline defect rate and significant financial exposure associated with quality failures. We describe an imaging subsystem based on synchronized line-scan cameras with combined bright-field and grazing illumination, derive the resolution and throughput requirements needed to resolve fine structural defects across a multi-metre web, and define a carpet-specific defect taxonomy.We then lay out a staged modelling strategy that begins with unsupervised anomaly detection trained on defect-free material, following the paradigm exemplified by the carpet category of the MVTec Anomaly Detection benchmark, and matures through a human-in-the-loop annotation flywheel into supervised detection and segmentation models. Finally, we connect detection performance to the DMAIC objectives, showing how reductions in escaped defects translate into improved process quality and process sigma levels. The contribution is an end-to-end, deployable blueprint that treats data collection as a first-class engineering objective rather than an afterthought.
[160] ProductWebGen: Benchmarking Multimodal Product Webpage Generation cs.CV | cs.AIPDF
Zhihong Liu, Siqi Kou, Zheng Li, Ye Ma, Quan Chen
TL;DR: 本文提出了ProductWebGen基准,用于系统评估多模态生成模型在根据产品图像和指令生成产品展示网页方面的能力。该基准包含500个测试样本,涵盖13个产品类别,要求模型生成包含多个视觉一致图像且可渲染的HTML代码。论文比较了基于编辑和基于统一模型(UM)的两种工作流,并构建了一个包含1000组数据的监督微调数据集ProductWebGen-1k,验证了其在开源UM BAGEL上的有效性。
Details
Motivation: 从产品图像、布局和视觉内容指令生成产品展示网页在营销、广告和电子商务等领域具有重要实用价值。该任务要求产品展示具有严格的视觉一致性,并高保真地遵循指令以生成可渲染的HTML代码,这些对可控性和指令遵循的要求与先进多模态生成模型的核心特性高度契合。
Result: 实证结果表明,基于编辑的方法在网页指令遵循和内容吸引力方面取得领先结果,而基于统一模型的方法在满足视觉内容指令方面可能展现出更多优势。在开源统一模型BAGEL上验证了构建的监督微调数据集ProductWebGen-1k的有效性。
Insight: 论文的创新点在于系统性地构建了一个多模态产品网页生成基准,并对比了两种不同的生成工作流(编辑基与统一模型基),揭示了各自在不同评估维度上的优势。同时,通过构建高质量的监督微调数据集,为提升开源模型的生成能力提供了可行路径。
Abstract: Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practical value for domains such as marketing, advertising, and E-commerce. Intuitively, this task demands strict visual consistency across product displays and high-fidelity instruction following to jointly generate renderable HTML code. These requirements on controllability and instruction-following are closely aligned with the core features of advanced multimodal generative models, such as image editing models and unified models. To this end, this paper introduces ProductWebGen to systematically benchmark the product webpage generation capacities of these models. We organize ProductWebGen with 500 test samples covering 13 product categories; each sample consists of a source image, a visual content instruction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source image and instructions. Given the mixed-modality input-output nature of the task, we design and systematically compare two workflows for evaluation – one uses large language models and image editing models to separately generate HTML code and images (editing-based), while the other relies on a single UM to generate both, with image generation conditioned on the preceding multimodal context (UM-based). Empirical results show that editing-based approaches achieve leading results in webpage instruction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a supervised fine-tuning dataset, ProductWebGen-1k, with 1,000 groups of real product images and LLM-generated HTML code. We verify its effectiveness on the open-source UM BAGEL. The data and code are available at https://github.com/SJTU-DENG-Lab/ProductWebGen.
[161] Ask4VG: Risk-Aware Question Selection for Reducing Prior-Driven Answers in Medical VQA cs.CVPDF
Xiaorong Zhu, Qiang Li, Zibo Xu, Weijie Wang, Weizhi Nie
TL;DR: 本文提出了Ask4VG,一个用于医学视觉问答的无标签风险感知问题选择框架。该框架通过反事实视觉探测来估计问题引发的幻觉风险,并利用习得的风险估计器对候选问题改写进行重排序,以选择那些对缺失或错配视觉证据变化更敏感、意图保持的问题,从而在最终答案生成前降低幻觉风险。
Details
Motivation: 医学VQA中许多问题是通用、模板化或高度相似的,这促使模型学习问题-答案的捷径而非依赖图像的推理,从而增加了产生幻觉回答的风险。
Result: 在VQA-RAD数据集上使用Qwen2-VL-2B-Instruct模型,基于预测风险的重排序将保留集风险从0.658降至0.623,并将精确匹配准确率从0.337提升至0.356。在PMC-VQA上的外部验证也显示出风险降低和准确率小幅提升的趋势。
Insight: 创新点在于提出了一种无需人工标注、通过反事实视觉探测(使用原始、扰动、空白和错配图像)来量化问题诱导幻觉风险的方法,并将风险估计用于问题选择,作为答案层面幻觉缓解的补充策略。
Abstract: Medical visual question answering requires models to ground their responses in image evidence, because visually unsupported answers can mislead downstream interpretation. However, many medical VQA questions are generic, template-like, or highly similar in form, which can encourage models to learn question-answer shortcuts instead of image-dependent reasoning and thereby increase the risk of hallucinated responses. We propose Ask4VG, a label-free pilot framework for risk-aware question selection. Ask4VG estimates question-induced hallucination risk through counterfactual visual probing: the same question is asked under the original image, a perturbed image, a blank image, and a mismatched image, and the resulting answer relations are converted into weak supervision for a counterfactual risk estimator. The learned estimator then reranks candidate question rewrites to favor intent-preserving questions that are less invariant to missing or mismatched visual evidence before final answer generation. On VQA-RAD with Qwen2-VL-2B-Instruct, prompt-only rewriting increases counterfactual risk, whereas predicted-risk reranking reduces held-out risk from 0.658 to 0.623 and improves exact accuracy from 0.337 to 0.356. A 300-sample PMC-VQA external check shows the same direction of risk reduction with a small accuracy gain. These results suggest that question selection is a promising complement to response-level hallucination mitigation for reliable medical VQA.
[162] TextFake: Benchmarking AI-Generated Image Detection on Text-Rich Images cs.CVPDF
Yuning Zhang, Changtao Miao, Mingyu Liao, Tingyu Liu, Xinghao Wang
TL;DR: 本文提出了TextFake基准测试,用于评估AI生成图像检测器在富含文本图像上的性能。研究发现,现有检测器在自然图像上表现良好,但在处理包含伪造截图、文档和新闻页面等富含文本的AI生成图像时,性能显著下降,揭示了系统性的检测差距。
Details
Motivation: 现有AI生成图像检测器在自然图像基准上表现良好,但在处理文本丰富的伪造图像(如虚假截图、文档和新闻页面)时性能未知,而这些图像在虚假信息中普遍存在,因此需要专门的基准来评估和揭示其局限性。
Result: 在TextFake基准上对14个专用检测器和3个前沿VLM API进行零样本评估,结果显示所有方法准确率均未超过80%,部分方法相比自然图像基准性能下降超过60%,表明存在显著的性能差距。
Insight: 论文的创新点在于构建了首个大规模、多语言、多主题的文本丰富AI生成图像检测基准,并通过诊断评估识别了三种关键失败模式:文本密度诅咒、渲染保真度伪装和阈值崩溃,为改进检测器提供了重要方向。
Abstract: Recent AI-generated image (AIGI) detectors perform well on natural-image benchmarks, but their behavior on text-rich forgeries, such as fabricated screenshots, documents, and news pages prevalent in misinformation, remains untested. We introduce TextFake, a 20,000-image benchmark for text-rich AIGI detection spanning 28 languages, 4 topic categories, and 2 scene modalities. Fake images are synthesized via a four-stage pipeline that annotates real images along three controlled dimensions and generates counterparts through distribution-aligned structured prompting, ruling out covariate shortcuts. Zero-shot evaluation of 14 specialized detectors and 3 frontier VLM APIs reveals a large systematic gap: no method exceeds 80% accuracy, with some dropping over 60% from natural-image benchmarks. Diagnostic evaluations identify three failure modes: the Text Density Curse, where dense glyphs overwhelm low-level detectors; Cloaking via Rendering Fidelity, where stronger text rendering suppresses enerative artifacts; and Threshold Collapse, where routine perturbations drive detectors toward chance-level performance.
[163] 3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code cs.CV | cs.AI | cs.GR | cs.LGPDF
Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong, Ameesh Makadia
TL;DR: 本文提出了3DCodeBench,一个用于评估视觉语言模型(VLM)代理在3D建模软件中通过代码进行程序化3D生成能力的系统性基准。该基准包含一个大规模的多模态提示数据集、评估协议以及一个基于人类偏好排名的平台3DCodeArena。通过评估12个先进VLM,研究发现主要失败源于API不匹配,而测试时扩展(如更多思考预算和多轮优化)能提升性能。
Details
Motivation: 程序化3D建模通过代码生成具有确定性、引擎就绪和精确可编辑的资产,但创作此类内容需要深厚的3D软件API、参数化设计和代码级几何推理专业知识。本文旨在评估VLM代理是否能够有效替代人类专家,将文本和图像参考转换为3D建模软件的程序化代码。
Result: 在3DCodeBench基准上评估了12个先进VLM。结果表明,失败主要源于API不匹配,成功的渲染仍存在3D几何组件断开或漂浮的问题。测试时扩展(如更高思考预算和多轮优化)总体上提高了性能。评估结合了自动指标和基于人类偏好排名的3DCodeArena平台。
Insight: 论文的创新点在于构建了首个系统性的、专注于评估VLM代理程序化3D建模能力的基准(3DCodeBench),并引入了基于人类偏好的评估平台(3DCodeArena)以弥补自动指标的不足。客观分析认为,其核心贡献是强调了高质量程序化编码数据和提供高保真反馈的鲁棒执行环境对于推进VLM在此领域应用的重要性,为后续研究提供了基础工具包。
Abstract: Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.
[164] A Multiscale Network with Supervised Contrastive Learning for Real-Time Facial Emotion Recognition cs.CVPDF
Rejoy Chakraborty, Archisman Adhikary, Chayan Halder, Payel Rakshit, Sanchita Ghosh
TL;DR: 本文提出了一种基于深度学习的实时面部情绪识别系统,该系统采用多尺度网络和监督对比学习来建模面部表情的连续变化,以检测视频中个体的情绪状态变化。
Details
Motivation: 解决实时视频中面部情绪识别的挑战,包括个体间表情差异大、情绪状态连续变化难以计算建模的问题,旨在为心理咨询等应用提供辅助洞察。
Result: 在标准数据集上训练后,系统取得了非常满意的结果,但摘要未具体说明基准测试名称或是否达到SOTA水平。
Insight: 创新点在于结合多尺度网络与监督对比学习来捕捉面部表情的细微连续变化,这有助于更准确地建模情绪动态,可借鉴于其他时序表情分析任务。
Abstract: Real-time emotion recognition from facial expressions is a challenging task, particularly in video-based scenarios where multiple emotional states may occur over time. The difficulty increases further due to the fact that each emotional state is associated with facial expressions that vary significantly across individuals. The change of facial expressions portraying emotional state is not discrete, but rather continuous, which is very challenging to represent through computational aids. A system with the ability to detect variations in facial expressions can have a significant impact on determining the emotional state of an individual. Such a system can be very beneficial for psychologists during counseling by providing additional insights into the emotional state of a subject. In this paper, a deep learning-based system is presented to detect emotional changes in real-time video of a person by modeling the change in facial expressions. The current study is conducted on a standard dataset for training of the deep learning system and the system has provided very satisfactory outcomes in this respect.
[165] Dual-Route Top-K Retrieval with 1v1 VLM Reranking for the CoVR-R cs.CVPDF
Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang
TL;DR: 本文针对CoVR-R挑战赛,提出了一种双路径Top-K检索与1v1视觉语言模型重排序方法。该方法将组合视频检索分解为两个耦合问题:首先获取一个足够完整的前k候选集,然后安全地决定是否有候选视频应替换当前最强的Top-1结果。系统通过改进文本种子、结合视觉路径,并最终使用VLM进行保守的1v1比较来实现高效检索。
Details
Motivation: 解决组合视频检索(CoVR-R)任务中,如何平衡召回候选集的完整性与最终Top-1结果的准确性这一核心问题,将检索过程解耦为召回与选择两个阶段。
Result: 在隐藏测试集上,最终系统达到了95.28的R@1、97.47的R@5、98.48的R@10和99.66的R@50,展现了优异的检索性能。
Insight: 主要创新在于将检索任务解耦为召回(通过双路径融合确保候选集完整性)和选择(通过保守的1v1 VLM重排序确保Top-1准确性)两个阶段,并证明这种解耦策略比广泛的文本重排序或直接的多候选VLM分类对CoVR-R任务更有效。
Abstract: We describe \emph{Dual-Route Top-K Retrieval with 1v1 VLM Reranking} for the CoVR-R challenge. The method treats composed video retrieval as two coupled problems: finding a sufficiently complete top-k candidate set, and then safely deciding whether any candidate should replace a strong current top-1. We first improve the reasoning/text seed with a VLM slot selector over existing candidates, without introducing DFN visual retrieval. We then add a visual route from contact-sheet embeddings using DFN-H/DFN-L. The routes are merged into a top-10 candidate set, after which a VLM final reranker performs conservative 1v1 comparisons between the current top-1 and each challenger. On the hidden test split, the final system reaches 95.28 R@1, 97.47 R@5, 98.48 R@10, and 99.66 R@50. The main lesson is that CoVR-R benefits more from recall-selection decoupling than from broad text reranking or direct multi-candidate VLM classification.
[166] Temporal Evidence Routing with Structured Visual Evidence for TimeLogicQA cs.CVPDF
Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang
TL;DR: 本文提出了一种视觉证据路由管道,用于解决TimeLogicQA视频问答任务中的时序关系推理问题。该系统将感知与符号时序推理分离,通过解析问题、路由视频证据、生成结构化视觉证据,并应用程序化验证器和确定性归约器来生成最终答案。
Details
Motivation: 旨在评估视频问答系统在事件存在、顺序、持续性、边界条件和重叠等时序关系上的推理能力,并设计一个分离感知与符号推理的框架来应对这一挑战。
Result: 在官方测试评估中,最终系统取得了81.8的平均准确率(AvgAcc),表明其在TimeLogicQA基准上具有竞争力。
Insight: 创新点包括视觉证据路由管道(根据视频时长和操作符难度路由证据)、结构化视觉证据生成(利用多模态大语言模型)以及保守融合机制(通过视觉证据、时序程序和置信度检查的一致性来减少噪声答案翻转)。
Abstract: TimeLogicQA evaluates whether video question answering systems can reason over temporal relations such as event existence, ordering, persistence, boundary conditions, and overlap. We address this task with a visual evidence routing pipeline that separates perception from symbolic temporal reasoning. The system first parses each question into event targets, answer mode, candidate options, and temporal operators. It then routes videos according to duration and operator difficulty, using ordered full-frame evidence for short clips and event-focused candidate windows for long videos. A multimodal large language model produces structured visual evidence for the relevant events, while programmatic verifiers recover dense action intervals and a deterministic reducer applies operator-specific temporal rules to produce the final answer. Conservative fusion accepts an answer only when the visual evidence, temporal program, and confidence checks agree, reducing noisy answer flips. On the official test evaluation, our final system achieves an AvgAcc of 81.8.
[167] Adaptive Dense Evidence Refinement for Video Relational Reasoning for VRR-QA Challenge cs.CVPDF
Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang
TL;DR: 本文提出了一种用于视频关系推理问答(VRR-QA)的自适应推理系统。该系统采用测试时自适应计算策略,先通过轻量级模型直接回答问题,再识别不稳定问题,仅将困难问题路由至高成本的密集证据模块进行精细化验证。最终系统在VRR-QA测试集上取得了90.07%的平均准确率和87.81%的宏平均准确率。
Details
Motivation: 解决视频问答中,单一帧信息不足以推断复杂时空、视角、深度和可见性关系的问题,并区分‘寻找合理备选答案’与‘决定是否更改当前答案’这两个常被混淆的子问题。
Result: 在VRR-QA挑战的测试集上,最终系统达到了90.07的平均准确率和87.81的宏平均准确率,报告聚焦于最终测试系统及复现自适应密集验证器所需的实现设置。
Insight: 创新点在于将自适应测试时计算引入视频问答,通过轻量级视图筛选不稳定问题,并仅对困难问题使用高成本密集证据模块(包含时间戳帧观察、关系特定探测、候选验证和保守时间聚合),实现了计算效率与精度的平衡。从客观角度看,这种‘路由’机制和问题分解思路对资源敏感的视觉语言推理任务具有借鉴意义。
Abstract: VRR-QA evaluates whether video-language systems can infer spatial, temporal, viewpoint, depth, and visibility relations that are not always resolved by a single frame. We present an inference-only system built around adaptive test-time computation. The system first answers each question with a direct video-language model pass, then uses multiple lightweight views to find unstable questions. Only these difficult questions are routed to a high-budget dense evidence module that constructs timestamped frame observations, relation-specific probes, candidate verification, and conservative temporal aggregation. This design separates two problems that are often confused in video question answering: finding plausible alternative answers and deciding when a current answer should actually be changed. On the test split, the final system obtains 90.07 average accuracy and 87.81 macro average accuracy. The report focuses on the final test system and the implementation settings required to reproduce the adaptive dense verifier.
[168] R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking cs.CVPDF
Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Weili Guan
TL;DR: 本文提出R^3方法,用于组合视频检索任务,该任务要求根据参考视频和文本编辑指令从大型库中检索目标视频。该方法通过生成推理轨迹来描述编辑后的预期目标视频,并利用推理增强的查询进行检索,最后通过重排序验证候选结果。
Details
Motivation: 解决组合视频检索中,仅依赖强嵌入模型可能无法充分表达目标视频的状态变化、动作替换等细节,而全库重排序又计算不可行的问题。
Result: 实验证明了该方法在CoVR-R挑战中的有效性,代码已开源。
Insight: 创新点在于将源-编辑查询转化为基于推理的检索程序,生成推理轨迹并编码为增强查询,通过协议门控残差规则融合分数,结合重排序提升检索精度。
Abstract: The CoVR-R challenge evaluates composed video retrieval, where a system must retrieve a target video from a large gallery given a reference video and a textual edit instruction. This setting is not a standard video-text retrieval problem: the query is defined by both the visual evidence in the source video and the transformation implied by the edit. A strong embedding model can provide scalable candidate recall, but it may under-express target-side consequences such as state changes, action replacement, object preservation, or temporal consistency. A pairwise multimodal reranker can verify such details more directly, but exhaustive reranking over the full gallery is computationally infeasible. We present $\mathbb{R}^3$, a zero-shot composed video retrieval pipeline built around Reasoning-guided Recalling and Reranking. The core idea is to turn the source-edit query into a reasoning-grounded retrieval program rather than treating the edit text as a short caption. First, the model generates a reasoning trace that describes the expected target video after applying the edit. Then the trace is encoded together with the source video as a reasoning-augmented query, and its retrieval score is fused with the base composed query through an agreement-gated residual rule. At last, a re-ranker verifies the recalled candidates with direct source-candidate comparison. Experiments have demonstrated the effectiveness of our method in addressing this challenge. Codes are available on https://github.com/Lee-zixu/R-3.
[169] Rank-Aware Quantile Activation for Motion-Robust Crop Segmentation in UAV Imagery cs.CVPDF
Abinav Kiran, Sravan Danda, Aditya Challa, Sougata Sen, Daya Sagar B S
TL;DR: 本文针对高速无人机采集图像中运动模糊导致依赖纹理的稀有类别语义分割性能下降的问题,提出了一种名为双分位数激活(QAct)的模块。该模块采用实例级秩归一化替代传统的幅度门控机制,在Agriculture-Vision 2021数据集上的零样本和有监督模糊训练两种范式下均能提升分割性能,尤其在稀有结构类和纹理依赖类上效果显著。
Details
Motivation: 解决高速无人机图像中运动模糊破坏高频幅度特征,导致具有高农学价值的稀有纹理依赖类别在语义分割中出现统计性信号擦除的问题。
Result: 在Agriculture-Vision 2021数据集上,QAct在多种模糊程度下的零样本和有监督训练范式中均能带来一致的mIoU提升,超越了ReLU激活函数。在中等模糊下,零样本QAct甚至优于经过蒸馏训练的ReLU模型;在所有模糊程度下,经过蒸馏训练的QAct(Distill-QAct)取得了最佳性能。
Insight: 核心创新在于用实例级的秩归一化(rank-aware activation)替代对模糊敏感的幅度特征门控,这为模型提供了对运动模糊的鲁棒性。研究表明,秩感知激活与模糊域训练是互补的鲁棒性来源。
Abstract: Motion blur from high-speed UAV acquisition de-grades semantic segmentation on rare texture-dependent classes with high agronomic value. Standard CNNs rely on high-frequency magnitude features that blur destroys, causing statistical erasure of minority signals. We propose Dual Quantile Activation (QAct), a rank-aware block replacing magnitude gating with instance-level rank normalization. Evaluated onAgriculture-Vision 2021 across zero-shot and blur-supervised regimes at multiple severities, QAct is the dominant architectural factor: it delivers consistent mIoU gains over ReLU across both regimes and all severities, with strongest gains on rare structural and texture-dependent classes. Some dominant classes (water,planter skip) show mixed per-class performance under distillation. At moderate blur, zero-shot QAct outperforms distillation-trained ReLU; across all severities, Distill-QAct achieves best performance, confirming rank aware activation and blur-domain training are complementary robustness sources.
[170] CoSTL: Comprehensive Spatial-Temporal Representation Learning for Moment Retrieval and Highlight Detection cs.CVPDF
Xin Dong, Wenjia Geng, Wenfeng Deng, Yansong Tang
TL;DR: 本文提出了一个名为CoSTL的综合时空表示学习框架,用于视频时刻检索(MR)和高光检测(HD)任务。该框架通过文本驱动的渐进式细粒度图像编码器学习空间表示,并结合多尺度时间感知模块捕获时空动态,在四个公开基准测试上实现了最先进的性能。
Details
Motivation: 现有方法主要关注使用帧级特征进行时间建模,往往忽略了与文本查询相关的丰富视觉信息,导致定位结果不准确。本文旨在解决这一局限,通过同时捕获细粒度图像级信息和视频时间动态来提升模型性能。
Result: 在QVHighlights、Charades-STA、TACoS和TVSum四个公开基准测试上,CoSTL框架均实现了最先进的(SOTA)性能。
Insight: 创新点在于提出了一个文本驱动的渐进式细粒度图像编码器,通过两步知识提取过程学习空间表示,并结合多尺度时间感知模块来增强对时空动态的处理能力。从客观角度看,这种将细粒度空间理解与多层次时间建模相结合的方法,为视频理解任务提供了更全面的表示学习方案。
Abstract: Video Moment Retrieval (MR) and Highlight Detection (HD) are crucial tasks in video analysis that aim to localize specific moments and estimate clip-wise relevance based on a given text query. Recent approaches treat them as similar video grounding tasks and use the same architecture to solve them. These tasks require both fine-grained comprehension at the image level and high-level temporal understanding across the entire video. Existing approaches have primarily focused on temporal modeling using frame-level features, often neglecting the rich visual information related to the text query within individual frames. This oversight leads to inaccurate grounding results. To address this limitation, we propose a Comprehensive Spatial-Temporal Representation Learning Framework (CoSTL), which captures both fine-grained image-level information and temporal dynamics. Specifically, CoSTL incorporates a text-driven progressive fine-grained image encoder, performing a two-step text-driven knowledge extraction process to learn fine-grained spatial representations. Furthermore, a multi-scale temporal perception module captures comprehensive spatial-temporal representations, enhancing the model’s ability to process temporal dynamics. We demonstrate state-of-the-art performance on four public benchmarks: QVHighlights, Charades-STA, TACoS, and TVSum.
[171] HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers cs.CVPDF
Issa Sugiura, Shuhei Kurita, Yusuke Oda, Naoaki Okazaki
TL;DR: 该论文提出了HakushoBench,一个基于日本政府白皮书构建的日语图表和表格视觉问答(VQA)基准数据集,旨在评估视觉语言模型在非英语文档理解任务上的能力。该数据集包含2053张涵盖10多种类型的图像及人工标注的问答对,专注于测试对图表和表格的深度、整体理解,而非局部视觉线索。实验表明,该基准对现有开源模型具有挑战性,最佳开源模型准确率仅为58.6%,且与闭源模型存在34.9个百分点的性能差距。
Details
Motivation: 当前图表和表格理解的研究主要集中在英语基准上,非英语基准稀缺,导致无法确定现有进展是否能跨语言泛化。主要障碍在于难以大规模收集真实且多样的非英语图表图像。
Result: 在广泛的视觉语言模型上进行实验,结果显示HakushoBench对开源模型构成挑战:最佳开源模型准确率为58.6%,而开源模型与闭源模型之间存在34.9个百分点的性能差距,表明在复杂图表理解方面仍有巨大改进空间。
Insight: 创新点在于利用政府白皮书作为可扩展的来源来构建非英语基准,因其包含自然生成、格式多样且领域广泛的图表和表格。这为创建真实、多样的多语言文档理解基准提供了一种有效方法,并揭示了当前模型在复杂非英语图表理解任务上的显著局限性。
Abstract: Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, non-English counterparts remain scarce, leaving it unclear whether this progress generalizes across languages. A key obstacle is the difficulty of collecting realistic and diverse non-English chart and table images at scale. To address this, we leverage governmental white papers as a scalable source for benchmark construction beyond English, as they contain naturally occurring charts and tables across diverse formats and domains and are freely accessible in many countries. As a first instantiation, we introduce HakushoBench, a challenging Japanese chart and table VQA benchmark built from 33 governmental white papers. HakushoBench contains 2,053 images spanning over 10 image types, with manually annotated QA pairs, designed to assess deep and holistic understanding of charts and tables, rather than local visual cues alone. Experiments across a broad range of VLMs demonstrate that HakushoBench remains challenging for open-weight models: the best open-weight model achieves only 58.6% accuracy, and a 34.9-point gap between open-weight and proprietary models highlights substantial room for improvement in complex chart and table understanding. We release our dataset and code.
[172] HiTokSR: A Coarse-to-Fine Tokenizer with Hierarchical Codebooks for High-Fidelity Real-World Image Super-Resolution cs.CVPDF
Mingxi Li
TL;DR: 本文提出了HiTokSR,一种用于高保真实世界图像超分辨率的从粗到细的层次化分词器框架。该方法通过将潜在空间沿通道维度划分为频率感知组,并使用独立的子码本进行量化,从而解耦全局结构和细节纹理。此外,结合视觉基础模型的先验知识,并引入索引级扰动策略,以提升生成质量和语义一致性。
Details
Motivation: 现有基于向量量化(VQ)的生成模型在真实世界图像超分辨率(Real-ISR)中,通常使用单一的潜在空间,导致低频结构和高频纹理纠缠在一起,限制了表示能力和码本利用率。
Result: 在真实世界基准测试上的大量实验表明,HiTokSR在感知质量和重建保真度方面均达到了最先进的(SOTA)性能。
Insight: 创新点包括:1)层次化码本设计,通过通道分组和独立量化实现结构-纹理解耦,增强组合表达能力;2)集成视觉基础模型先验,通过自适应特征调制、多尺度类别标记和表示对齐损失提升语义一致性;3)在解码器微调阶段引入索引级扰动策略,以弥合离散标记预测中的训练-测试差异。
Abstract: Vector-quantized (VQ) generative models have shown promising results in real-world image super-resolution (Real-ISR). However, existing methods typically rely on a monolithic latent space that entangles low-frequency structures with high-frequency textures. This entanglement forces a single codebook to capture a combinatorially complex set of structure-texture pairings, which constrains representational capacity and limits codebook utilization. To address this issue, we present HiTokSR, a hierarchical token prediction framework. Instead of using a single codebook, HiTokSR partitions the latent space along the channel dimension into frequency-aware groups, quantizing each with an independent sub-codebook. This coarse-to-fine design disentangles global structures from fine details, enhancing combinatorial expressiveness while circumventing the optimization instability of high-dimensional nearest-neighbor lookups. To further improve semantic consistency, our generator integrates priors from a vision foundation model via adaptive feature modulation, multi-scale class tokens, and a representation alignment loss. Additionally, we introduce an index-level perturbation strategy during decoder fine-tuning to bridge the train-test discrepancy in discrete token prediction. Extensive experiments on real-world benchmarks demonstrate that HiTokSR achieves state-of-the-art performance in both perceptual quality and reconstruction fidelity.
[173] Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends cs.CVPDF
Jiuming Liu, Chaojun Ni, Mengmeng Liu, Chensheng Peng, Fangjinhua Wang
TL;DR: 本文系统综述了交互式世界建模领域的最新研究趋势、技术发展、评估基准,并提出了未来潜在方向。文章首先总结了应用场景、世界状态演化和场景模态方面的研究进展,然后深入探讨了动作条件可控性、长时程交互与记忆、实时交互的动作跟随响应性三大技术挑战,并比较了开放世界探索、游戏引擎、自动驾驶和机器人四个应用领域的现有基准与指标。
Details
Motivation: 随着大语言模型和基于扩散的内容生成技术的快速发展,世界建模研究日益受到关注。通过将用户动作显式地纳入世界状态转移,近期研究在动作条件的视频或3D生成范式中赋予了世界建模交互性,从而增强了对世界演化的可控性,并促进了用户在状态演化中的自由穿越、操控、导航和个性化。本文旨在系统梳理该领域的研究趋势、技术挑战和评估基准。
Result: 本文是一篇综述性论文,未提出具体的新模型或方法,因此没有报告定量的实验结果。文章主要对现有研究进行了系统性总结和比较,并分析了不同应用领域的基准和评估指标。
Insight: 论文的创新点在于系统性地梳理了交互式世界建模这一新兴领域,明确了其核心挑战(如动作条件可控性、长时程交互记忆、实时响应性)并构建了跨应用领域的基准分析框架。从客观角度看,该综述为领域研究者提供了清晰的技术路线图和研究缺口分析,其提出的未来方向(如实现下一代交互式世界建模)具有前瞻性和指导价值。
Abstract: With rapid development of large language models and diffusion-based content generation, world modeling has attracted increasing research attention, benefiting various downstream domains such as game engines, embodied AI, autonomous driving, etc. Through explicitly incorporating user actions into world state transition, recent literature empowers world modeling with interactivity in an action-conditioned video or 3D generation paradigm, further enhancing controllability over world evolutions and facilitating users to freely traverse, manipulate, navigate, and personalize the state evolution. In this paper, we aim to systematically review recent research trends, technical developments, evaluation benchmarks, and also propose future potential directions in interactive world modeling. Specifically, we first summarize recent efforts and trends in terms of application scenarios, world state evolution, and scene modality. Afterwards, we delve into three crucial technical challenges, including action-conditioned controllability, long-horizon interactions and memory, and action-following responsiveness for real-time interactivity. Furthermore, we also thoroughly compare existing benchmarks and metrics in four specific application fields: open-world exploration, game engine, autonomous driving, and robotics. Finally, we discuss several promising future directions in achieving next-generation interactive world modeling. The corresponding repository is publicly available at: https://github.com/liujiuming123/Awesome-Interactive-World-Model.
[174] Feature Alignment Determines Fusion Strategy: A Comparative Study of Cross-Attention and Concatenation in Multimodal Learning cs.CV | cs.LGPDF
Zhiqiang Zhou, Xuezhen Xie
TL;DR: 本文通过对比实验和理论分析,研究了多模态学习中交叉注意力与拼接融合策略的选择问题。研究发现,特征对齐质量而非数据规模是决定哪种融合策略更优的关键因素。当特征通过视觉-语言预训练目标预先对齐时,拼接策略在所有测试规模上均优于交叉注意力。
Details
Motivation: 当前多模态融合策略(交叉注意力与拼接)的选择主要依赖实践者的直觉,缺乏原则性的理解。本文旨在探究决定融合策略性能的根本因素,为多模态系统的设计提供理论依据。
Result: 在Flickr8k数据集上,使用ResNet18和CLIP ViT-B/32作为特征提取骨干进行控制实验。当特征预对齐时,拼接策略在2048-16384个样本的所有测试规模上,性能均优于交叉注意力4.1-5.1个百分点。特征对齐退化研究证实了单调趋势:随着特征对齐质量下降,拼接的优势从1.3%增长到2.8%。
Insight: 论文的核心创新点在于揭示了特征对齐质量是选择融合策略的决定性因素,并提供了基于样本复杂度分析的理论解释:拼接的样本复杂度为O(d_v + d_t),而交叉注意力为O(d_v * d_t)。当特征已对齐时,两种方法的近似误差差距消失,拼接的样本效率优势主导了实际性能。这为多模态大语言模型等系统的融合方法选择提供了一个原则性的决策框架。
Abstract: The choice between cross-attention and concatenation for multimodal fusion remains governed by practitioner intuition rather than principled understanding. In this paper, we demonstrate that feature alignment quality, not data scale alone, is the primary determinant of which fusion strategy excels. Through controlled experiments on Flickr8k using two feature extraction backbones (ResNet18 and CLIP ViT-B/32), we show that concatenation outperforms cross-attention by 4.1-5.1 percentage points across all tested scales (2048-16384 samples) when features are pre-aligned by a vision-language pretraining objective. We provide a theoretical explanation grounded in sample complexity analysis: concatenation requires O(d_v + d_t) samples to learn its fusion projection, while cross-attention requires O(d_v * d_t) samples to learn bilinear attention weights, over 256 times as many for 512-dimensional CLIP features. When features are already aligned, the approximation error gap between the two methods vanishes, and concatenation’s sample efficiency dominates at all practical dataset sizes. An alignment degradation study confirms a monotonic trend: as feature alignment degrades, concatenation’s advantage grows from 1.3% to 2.8%. These findings provide a principled decision framework for fusion method selection in multimodal systems, with direct implications for the design of Multimodal Large Language Models.
[175] TECCI: Tricky Edits of Collected and Curated Images cs.CV | cs.AI | cs.CLPDF
Aishwarya Agrawal, Roy Hirsch, Yasumasa Onoe, Sherry Ben, Jason Baldridge
TL;DR: 本文提出了一个名为TECCI的新图像编辑基准测试,专门针对现有文本引导图像编辑方法在处理复杂编辑任务(如位置、运动、视角、尺度和创意编辑)时的弱点。该基准包含7550对图像和编辑指令,并通过对五个领先模型进行人工评估,揭示了它们在指令遵循、编辑最小化和视觉质量方面的不足,整体成功率低于22%。
Details
Motivation: 当前文本引导图像编辑方法在指令遵循、最小化编辑源图像和保证高视觉质量方面仍存在困难,尤其是在处理涉及位置、运动、视角、尺度和创意等复杂编辑时。为了系统测试生成式图像编辑器的性能,需要一个新的基准来针对这些弱点。
Result: 在TECCI基准上对五个领先模型进行人工评估,结果显示:整体成功率最高为Nano Banana Pro模型,但所有模型均未超过22%;模型在指令遵循方面表现相对较好,但在编辑最小化和视觉质量上显著不足;模型在处理需要空间布局理解和精细视觉细节的建筑与自然图像时尤为困难;推理和创意编辑是最难的,而颜色和外观编辑是最容易的。此外,构建的基于Gemini的自动评分器在匹配人工评估方面达到了74.7%的准确率。
Insight: 论文的创新点在于创建了一个专门针对现有方法弱点的、具有挑战性的图像编辑基准TECCI,通过自动生成和手动编写指令覆盖多种编辑类型,并引入多维度人工评估和自动评分器来系统量化模型性能。这为未来研究提供了明确的改进方向和评估标准。
Abstract: Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruction following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark – TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.
[176] Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs cs.CV | cs.AI | cs.CL | cs.MMPDF
Wentao Mo, Yang Liu
TL;DR: 本文提出APEIRIA,一种神经符号3D多模态大语言模型,旨在弥合神经符号3D概念学习器与端到端3D MLLMs之间的鸿沟。它通过一个三阶段课程,将符号推理模式以自然语言思维链的形式蒸馏到MLLMs中,从而在保持透明推理和模块化优点的同时,处理开放词汇和复杂指令。
Details
Motivation: 当前3D空间推理方法面临根本权衡:神经符号3D方法可解释但受限于封闭词汇和简单程序;端到端3D MLLMs能处理复杂语言和开放概念,但推理是黑箱且缺乏显式空间验证。本文旨在结合两者的优势。
Result: 在3D空间推理数据集上的评估表明,APEIRIA在物体定位、问答和描述任务上超越了先前的神经符号3D方法,并与最先进的3D MLLMs性能相当。
Insight: 核心创新在于通过课程学习将符号推理模式(而非具体概念知识)蒸馏到MLLMs中,实现了系统化推理与灵活性的统一,并保持了规划与感知模块的可互换性。
Abstract: Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed-set concept vocabularies and simple programs; end-to-end 3D multi-modal LLMs (3D MLLMs) could handle complex natural language and open-vocabulary concepts but suffer from black-box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro-symbolic 3D MLLM to bridge two paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain-of-thought. Our three-stage curriculum progressively builds reasoning capabilities: a) 3D perception alignment grounds object visual-geometric features to the LLM, b) CoT-SFT teaches query decomposition and stepwise verification from symbolic program traces, and c) CoT-RL extends reasoning patterns to open-set concepts and deeply nested instructions. By transferring reasoning patterns rather than concept-specific knowledge, APEIRIA preserves key NS3D virtues: transparent reasoning and modular interchangeability of planning and perception components. Evaluations on grounding, question answering, and captioning show that APEIRIA surpasses prior NS3D methods and matches state-of-the-art 3D MLLMs on 3D spatial reasoning datasets, unifying symbolic methods’ systematic reasoning with MLLMs’ flexibility. Code is available at https://github.com/oceanflowlab/APEIRIA.
[177] Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration? cs.CVPDF
Liyang Li, Muzhi Zhu, Zhiyue Zhao, Hengyu Zhao, Ke Liu
TL;DR: 本文提出了目标视角重现(TVR)任务及其基准测试TVRBench,旨在评估基础模型在3D环境中通过主动探索调整自身视角以匹配给定目标图像的能力。研究发现当前开源和闭源模型在该任务上表现不佳,并构建了一个统一的训练后框架来提升性能。
Details
Motivation: 人类能够通过主动的头部和身体运动来复现目标图像指定的视角,而现有基础模型的空间智能研究主要集中于对预收集观测的被动理解,因此需要建立一个主动感知与行动的评估任务来填补这一空白。
Result: 在TVRBench评估集上,最强的开源和闭源模型成功率分别仅为7.8%和12.0%。通过视觉-动作监督微调(SFT)可将一个90亿参数的开源模型提升至50.8%的成功率,而多轮GRPO策略进一步优化至51.4%。
Insight: 论文的创新点在于引入了主动的TVR任务和配套基准,揭示了现有模型在处理多轮视觉历史以及将空间差异映射为身体平移运动方面的瓶颈,并提出了一个涵盖多种训练后方法的统一框架来系统提升模型在具身探索中的性能。
Abstract: Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) – an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image – and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.
[178] Exploiting In-Sensor Computing for Energy-Efficient Earth Observation cs.CVPDF
Luigi Capogrosso, Pietro Bonazzi, Loris Hoxhaj, Michele Magno
TL;DR: 本文提出了一种面向地球观测任务的传感器内计算框架,通过将TinyML技术与索尼IMX500智能视觉传感器集成,直接在传感器端处理数据,以减少对下行链路带宽的依赖。该方法在EuroSAT数据集上评估了多种高效卷积神经网络,在8 MB内存限制下实现了96.68%的准确率,并展现出高能效特性。
Details
Motivation: 卫星产业的快速发展导致地理空间数据采集量激增,但下行链路带宽有限,形成了数据传输瓶颈。现有星上计算虽能预处理数据,但本文旨在进一步将计算卸载至传感器端,以更有效地减少噪声或无关数据的传输。
Result: 在EuroSAT数据集上,优化后的SqueezeNet、ShuffleNetV2和MCUNetV1模型在IMX500平台的8 MB内存限制下达到了96.68%的准确率,平均处理吞吐量为17.40 FPS,延迟为27.43 ms,每次推理能耗为14.19 mJ,能效评分为42.26 GMAC/J,展现了竞争力。
Insight: 创新点在于将TinyML与智能视觉传感器硬件深度集成,实现了传感器内计算范式,直接减轻了主嵌入式设备的计算负担和下行传输压力;从客观角度看,这种端到端优化框架为资源受限的星载系统提供了可行的低功耗、高精度解决方案。
Abstract: The rapid growth of the satellite industry has driven a significant increase in geospatial data acquisition, highlighting a critical bottleneck: the severe disparity between the volume of collected sensor data and the limited downlink bandwidth available to ground stations. While On-Board Computing (OBC) has helped address this by pre-processing data in orbit, this article further advances the paradigm by introducing an in-sensor computing framework. We present an optimized end-to-end Earth Observation (EO) pipeline tailored for strict computational constraints by integrating TinyML techniques with the Sony IMX500 Intelligent Vision Sensor. Specifically, our approach shifts processing directly to the sensor level, offloading the computation from the primary embedded device, and effectively mitigating the downlink transmission of noisy or irrelevant data. We evaluated several efficient Convolutional Neural Networks (ConvNets), i.e., SqueezeNet, ShuffleNetV2, and MCUNetV1, on the EuroSAT dataset. Experimental results show that, despite the optimizations required for deployment on the IMX500 platform, our models maintain a competitive 96.68% accuracy while operating within its 8 MB constraints. Specifically, the models reach an average processing throughput of 17.40 FPS with a latency of 27.43 ms. Furthermore, our system profile exhibits high energy efficiency, with a low energy footprint of 14.19 mJ per inference and an efficiency rating of 42.26 GMAC/J, demonstrating its viability for in-sensor deployment.
[179] Knowledge-Intensive Video Generation cs.CV | cs.AIPDF
Chenxu Wang, Mingda Chen
TL;DR: 本文提出知识密集型视频生成(KIVI)任务,旨在从信息寻求式提示生成解释性、程序性或演示性视频,并构建了包含1080个提示的KIVI-Bench基准测试,提出了事实性和帮助性的自动评估指标。实验表明,当前最先进的视频生成模型在视觉属性、操作过程和清晰信息呈现方面仍落后于人类表现。
Details
Motivation: 现有文本到视频生成模型在视觉质量上进展迅速,但在事实性和实际有用性方面缺乏充分评估,因此需要针对知识密集型场景(如解释、程序演示)建立评估框架。
Result: 在七个最先进的视频生成模型上进行的实验显示,这些模型在KIVI-Bench上的表现仍低于人类水平,特别是在视觉属性、操作过程和清晰信息呈现方面;同时,提出的自动评估指标与人工标注的一致性显著优于现有替代方案。
Insight: 创新点在于首次系统性地定义了知识密集型视频生成任务,并构建了专门评估事实性和帮助性的基准与指标,为追求实用性和准确性的视频生成研究提供了新的评估方向。
Abstract: Text-to-video generation has advanced rapidly in visual quality, but remains under-evaluated for factuality and practical usefulness. We introduce knowledge-intensive video generation (KIVI), where models generate videos from short information-seeking prompts that ask for explanations, procedures, or demonstrations. To evaluate this setting, we construct KIVI-Bench, a benchmark of 1,080 prompts, and propose automatic metrics for factuality and helpfulness. Human evaluation shows that our metrics significantly better align with human annotations than existing alternatives. Experiments on seven state-of-the-art video generation models show that current systems still lag behind human performance, especially on visual properties, procedural operations, and clear information presentation. These results highlight KIVI as a challenging direction for factual and instructionally useful video generation.
[180] Event-Based Vision in Space: Applications, Trends, and Future Directions cs.CVPDF
Luigi Capogrosso, Pietro Bonazzi, Michele Magno
TL;DR: 本文对基于事件的视觉(即神经形态相机)在空间领域的应用进行了全面综述。论文指出,传统帧式光学传感器在轨道环境中面临运动模糊、高功耗和数据冗余等问题,而事件传感器通过异步捕捉局部光照变化,提供了微秒级时间分辨率、极高动态范围和卓越能效。文章基于现有文献,围绕大气与高速观测、环境监测与变化检测、操作支持与星上处理、地理空间建模与预测分析四个主要领域构建了分类体系,并强调神经形态工程不仅是补充成像技术,更是解决现代遥感和可持续空间探索关键瓶颈的范式转变。
Details
Motivation: 传统帧式光学传感器在轨道环境中存在运动模糊、高功耗和数据冗余等问题,而事件传感器作为一种仿生异步方法,具有微秒级时间分辨率、高动态范围和能效优势。然而,关于其在空间应用的科学文献仍然高度分散,本文旨在通过全面综述填补这一空白。
Result: 本文是一篇综述性文章,未提出具体模型或进行定量实验,但通过文献调研构建了基于事件视觉在空间领域的分类体系,并总结了其在四个主要领域的应用现状和趋势。
Insight: 创新点在于首次系统性地综述了事件视觉在空间领域的应用,并构建了围绕四个核心领域的分类法。从客观角度看,文章强调了神经形态工程作为范式转变的潜力,能够直接解决遥感中的关键瓶颈,为未来空间探索提供了新的技术方向。
Abstract: Earth Observation (EO) is undergoing a significant transformation driven by the deployment of novel sensing technologies. Traditional frame-based optical sensors often struggle with motion blur, high power consumption, and extreme data redundancy in challenging orbital environments. In contrast, event-based sensors, also known as neuromorphic cameras, offer a bio-inspired asynchronous approach. By capturing only local illumination changes, they provide microsecond temporal resolution, an extremely high dynamic range, and exceptional energy efficiency. Although the use of these sensors is rapidly expanding from terrestrial systems to orbital platforms, the scientific literature surrounding their space-based applications remains heavily fragmented. To bridge this gap, this article presents a comprehensive review of the state-of-the-art in event-based vision in the space domain. Based on the retrieved literature, we introduce a taxonomy structured around four primary domains: 1) atmospheric and high-speed observation; 2) environmental monitoring and change detection; 3) operational support and onboard processing; and 4) geospatial modeling and predictive analysis. As a result, this survey highlights that neuromorphic engineering is far more than a supplementary imaging technique; it is a paradigm shift that can be used to directly address critical bottlenecks in modern remote sensing and sustainable space exploration.
[181] Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning cs.CV | cs.AIPDF
Garvin Guo, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao
TL;DR: 本文通过分解潜在视觉推理方法中的连续潜在标记为三个可测试组件(潜在槽、边界标记和格式),揭示了这些方法性能提升的真正来源并非视觉记忆,而是边界标记、格式以及特定的注意力模式。
Details
Motivation: 针对现有潜在视觉推理方法性能提升归因于视觉记忆的悖论,本文旨在通过分解潜在标记的组件,探究其性能提升的真实机制。
Result: 在六个方法-阶段设置和四个感知密集型基准测试中,潜在槽未能通过任何视觉记忆假设的预测;仅保留边界标记即可在多个设置中保留78%至100%的性能增益。
Insight: 创新点在于对潜在标记进行细粒度分解的机制诊断方法,揭示了性能提升的关键是边界标记和格式而非视觉编码,并强调评估模型需结合准确性和实际依赖机制。
Abstract: Recent latent visual reasoning methods achieve substantial gains by inserting continuous latent tokens into multimodal language models. These gains are commonly attributed to the tokens encoding visual evidence; recent analyses, however, reveal a paradox: the tokens are loosely tied to the image and contribute little to the answer. Critically, these analyses treat latent tokens as a single unit, obscuring the true source of the gains. We therefore decompose latent tokens into three testable components: latent slots, boundary markers, and format, and develop a state-of-the-art method as a probe under favorable conditions. Across six method-stage settings and four perception-heavy benchmarks, latent slots fail every prediction of the visual-memory account. Strikingly, retaining only the boundary markers preserves 78 to 100% of the gain in several settings, while the model attends to the image more narrowly at latent positions than at answer positions. The gain therefore comes from boundary markers, format, and this attention pattern, not from latent slots. How each method engages this mechanism depends on its training supervision: at matched accuracy, mechanisms can still differ markedly. Latent visual reasoning thus needs evaluation not only by accuracy but by what the model actually relies on.
[182] DeblurNVS: Geometric Latent Diffusion for Novel View Synthesis from Sparse Motion-Blurred Images cs.CVPDF
Changyue Shi, Wangbo Yu, Chaoran Feng, Li Yuan
TL;DR: DeblurNVS是一个新颖的框架,旨在直接从稀疏的运动模糊图像中合成高质量的新视角视图,无需进行昂贵的逐场景优化。它通过恢复用于多视角推理的中间几何表示,并结合目标相机信息来重建清晰的RGB新视图。
Details
Motivation: 解决现有新视角合成方法依赖清晰观测的局限性,这些方法在图像结构和跨视角几何线索因运动模糊而受损时性能下降。运动模糊常见于相机抖动、场景运动或有限曝光时间,破坏了局部细节并削弱了多视角对应关系。
Result: 在合成的运动模糊基准测试中,DeblurNVS超越了现有基线方法,并能泛化到真实的运动模糊场景,生成感知上更清晰、结构上更稳定的新视角视图,同时避免了逐场景优化。
Insight: 创新点在于提出一个无需逐场景优化的通用框架,通过恢复几何表示来从模糊输入中重建可靠的结构和对应线索。客观来看,其利用基于插值的有限曝光模糊合成构建大规模训练数据集的方法,以及将几何潜在扩散模型应用于模糊感知的新视角合成任务,具有借鉴意义。
Abstract: Novel view synthesis (NVS) is a fundamental problem in computer vision and graphics. Recent advances in neural radiance fields (NeRF), 3D Gaussian Splatting (3DGS), and generative view synthesis have substantially improved its quality. Yet most methods still rely on clean observations, where image structures and cross-view geometric cues are well preserved. Motion blur breaks this assumption by corrupting local details and weakening multi-view correspondences. Such blur commonly arises from camera shake, scene motion, or finite exposure in practical capture. Blur-aware NVS methods address this degradation by modeling image formation, but their reliance on costly per-scene optimization limits efficient and generalizable sparse-view synthesis. To address this, we propose DeblurNVS, a novel framework for synthesizing high-fidelity novel views directly from sparse motion-blurred images, without requiring per-scene optimization. DeblurNVS restores the intermediate geometric representations needed for multi-view reasoning, enabling blurred inputs to recover reliable structure and correspondence cues. The restored representations are then combined with target camera information to synthesize the target-view representation and reconstruct a sharp RGB novel view. To enable the large-scale training, we construct a motion-blurred NVS dataset from DL3DV-10K using interpolation-based finite-exposure blur synthesis. Extensive experiments demonstrate that DeblurNVS outperforms existing baselines on synthetic motion-blur benchmarks and generalizes to real motion-blurred scenes, producing perceptually sharper and structurally more stable novel views while avoiding costly per-scene optimization. Project page: https://github.com/PKU-YuanGroup/DeblurNVS.
[183] HOLA: Holistic Multi-Modal Alignment for Open-Set 3D Recognition cs.CVPDF
Koby Aharonov, Oren Shrout, Ayellet Tal
TL;DR: 本文提出HOLA方法,通过将每个点云与多张图像和多个文本描述对齐,实现更全面的3D物体理解。该方法引入解耦多正例对比损失,避免多正例共享负例时的‘聚光灯拥挤’问题,并采用轻量级文本适配器减少领域差距。在长尾基准测试中达到最先进的开放词汇性能,同时保持高帧率。
Details
Motivation: 现有开放集3D识别方法通常依赖重型2D视觉Transformer,并将每个点云与单张图像或标题对齐,导致表示局限于局部视图。本文旨在通过多模态对齐捕获更全面的3D物体理解,以提升对罕见或未见类别的泛化能力。
Result: 在长尾基准测试中实现最先进的开放词汇性能,零样本识别能力显著提升,同时保持高帧率。
Insight: 创新点包括解耦多正例对比损失设计,有效分离正例聚合与负例竞争;轻量级文本适配器减少网络标题与标注数据间的领域差距,促进大规模无监督文本的有效利用。
Abstract: Open-set 3D recognition requires models that generalize to rare or unseen categories. Recent approaches address this by distilling language-vision knowledge into 3D encoders, typically relying on heavy 2D ViTs and aligning each point cloud with a single image or caption, thus anchoring representations to partial views. We propose aligning each point cloud with multiple images and textual descriptions to capture a more holistic understanding of 3D objects. To realize this idea, it is essential to design a loss function capable of jointly aligning a 3D instance with multiple matched signals, multi-view images and multiple texts, while separating positive aggregation from negative competition. We introduce such a function, termed the decoupled multi-positive contrastive loss. Our formulation enhances the loss’s hardness-aware focus on challenging negatives, avoiding the “spotlight crowding” that occurs when many positives share the same softmax with all the negatives. Complementing this, we present a lightweight text adapter applied only to web captions, reducing the domain gap to curated annotations and enabling effective use of large-scale unsupervised text. Our model demonstrates state-of-the-art open-vocabulary performance on long-tail benchmarks, yielding substantial zero-shot improvements while sustaining high frame rates.
[184] ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats cs.CVPDF
Shangpin Peng, Gengluo Li, Xingyu Wan, Chengquan Zhang, Hao Feng
TL;DR: 本文介绍了ChartArena,一个全面的双语图表解析基准测试,涵盖八个图表家族(包括数值图表和图示结构)和三种视觉场景(数字渲染、打印照片和手绘照片)。它通过人机协作的标注流程构建,并设计了一个与格式无关的评估协议,将异构输出映射到两个规范的语义空间进行评分。通过对26个领先的多模态大语言模型(MLLMs)的广泛评估,揭示了当前模型在特定图表类型和场景下的能力差距。
Details
Motivation: 现有图表解析基准测试主要关注狭窄的图表类型,忽略了流程图、思维导图等图示结构,且模型输出格式不兼容,数据集也缺乏实际中常见的打印或手绘图像。为了解决这些问题,作者提出了ChartArena。
Result: 在ChartArena基准上评估了26个领先的MLLMs。结果表明:前沿的专有模型(如Gemini 3.1 Pro)总体领先,但最强的开源系统正在迅速缩小差距;文档解析模型能较好地处理数值图表,但在图示结构上表现大幅落后;专业的图表解析器仍局限于狭窄的图表家族。在所有模型中,雷达图和手绘场景尤其具有挑战性。
Insight: 论文的创新点在于构建了一个覆盖广泛图表类型(包括数值和图示)和真实场景(数字、打印、手绘)的综合性双语基准。其设计的与格式无关的评估协议(将异构输出映射到规范化的三元组视图和有向图视图)和结构感知的度量标准,为公平的跨模型比较提供了统一的基础,有效暴露了现有模型的能力边界。
Abstract: Charts are a primary medium for conveying quantitative and relational information, yet systematically evaluating chart parsing models remains difficult. Existing benchmarks focus on narrow chart types and leave diagrammatic structures such as flowcharts and mind maps largely unaddressed, while models produce outputs in incompatible formats, and datasets rarely include the printed or hand-drawn images encountered in practice. To address these issues, we introduce ChartArena, a comprehensive bilingual benchmark covering eight chart families spanning both numeric charts and diagrammatic structures, each evaluated across three visual scenarios: digital renderings, printed photos, and hand-drawn photos. The dataset is built via a human-agent collaborative annotation pipeline with multi-stage human verification to ensure annotation reliability. To enable fair cross-model comparison, we further design a format-agnostic evaluation protocol that maps heterogeneous outputs into two canonical semantic spaces, a normalized triple view and a directed graph view, and scores them with structure-aware metrics. Through extensive evaluation of 26 leading MLLMs, we observe three consistent findings: (i) frontier proprietary models such as Gemini 3.1 Pro lead overall, yet the strongest open-source systems are rapidly closing the gap; (ii) document parsing models handle numeric charts reasonably but fall sharply behind on diagrammatic structures; and (iii) expert chart parsers remain limited to narrow chart families. Across all models, radar charts and hand-drawn scenarios stay especially challenging. These findings show that ChartArena exposes clear capability gaps and provides a unified foundation for future progress. ChartArena is publicly available at https://github.com/pspdada/ChartArena.
[185] Diamonds in the Sky: Pareidolic Animals in Clouds cs.CVPDF
Miriam Horovicz, Yacov Hel-Or, Yael Moses
TL;DR: 本文提出了一种基于AI的方法,用于预测人们在云朵中可能感知到的动物形状(即空想性错视现象),并辅助人们识别这些形状。该方法利用扩散模型将云朵片段转化为视觉上相似的动物形状,通过生成的图像预测空想性动物,并使用从生成图像过渡回原始云朵的变形视频来增强人类感知。
Details
Motivation: 解决现有最先进识别方法通常无法检测云朵中空想性动物形状的问题,并帮助人们感知这些形状,即使他们最初未能识别。
Result: 未在摘要中提及具体定量结果或基准测试,但暗示扩散模型能成功生成与云朵形状相似的动物图像,从而预测空想性动物。
Insight: 创新点在于结合扩散模型和变形视频来模拟和增强人类空想性错视感知,利用视觉提示辅助识别,这为计算机视觉中的感知辅助和创意应用提供了新思路。
Abstract: People often see animal shapes in clouds, a phenomenon known as pareidolia. We propose an AI-based method that aims to predict which animals people are likely to perceive in clouds, even though state-of-the-art recognition methods typically fail to detect such animals. Additionally, we introduce a method to assist individuals in perceiving specific pareidolic animals, even if they did not recognize them initially. Our approach uses a diffusion model to transform cloud segments into an animal shape that visually resemble the original cloud. This diffusion technique is inspired by the observation that the diffusion process succeeds only when the target animal resembles the shape of the cloud, and that subtle visual hints often suffice to help individuals recognize specific pareidolic animals. A generated image, successfully derived from the diffusion model, is then used to predict the pareidolic animal. Additionally, a short morphing video transitioning from the generated image back to the original cloud segment is employed to further enhance the human’s perception of the pareidolic animals.
[186] PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion cs.CVPDF
Heyuan Gao, Bangxun Tang, Yiren Song, Guian Fang, Zijian He
TL;DR: PAI-Studio是一个新的参考条件视频合成任务,旨在解决电影级背景替换中的长期挑战:生成与前景运动对齐的动态背景,同时保持前景身份、匹配参考场景外观并实现全局一致的光照和真实的前景重光照。该方法基于Diffusion Transformer视频主干,将问题重新表述为上下文条件生成任务,并构建了一个30K规模的高质量数据集。
Details
Motivation: 现有开源系统和商业API无法同时确保运动一致的背景生成、高保真的前景重光照和前景身份保持,通常导致静态背景、不一致的边界和明显的合成伪影。
Result: 广泛的评估表明,该方法在动态背景一致性、光照真实性和前景保持方面显著优于现有的开源和商业API解决方案。
Insight: 创新点在于将任务重新定义为上下文条件生成,并通过双向注意力机制在统一架构中联合捕获前景动态和背景参考信息;同时构建了大规模高质量电影数据集来支持该任务。
Abstract: We present PAI-Studio, a new reference-conditioned video synthesis task that addresses a long-standing challenge in cinematic background replacement: generating dynamic backgrounds aligned with foreground motion while preserving foreground identity, matching reference scene appearance, and achieving globally consistent illumination with realistic foreground relighting. Existing open-source systems and commercial APIs cannot simultaneously ensure motion-consistent background generation, high-fidelity foreground relighting and foreground identity preservation, often resulting in static backgrounds, inconsistent boundaries, and noticeable compositing artifacts. To bridge this gap, we build upon a Diffusion Transformer video backbone and reformulate the problem as an in-context conditional generation task. Through bidirectional attention, our model jointly captures foreground dynamics and background reference information within a unified architecture. We further construct a 30K-scale dataset sourced from high-quality films and online videos to support this task. Extensive evaluations demonstrate that our method significantly outperforms existing open-source and commercial API solutions.
[187] Agent Skills Should Go Beyond Text: The Case for Visual Skills cs.CVPDF
Binxiao Xu, Ruichuan An, Bocheng Zou, Hang Hua
TL;DR: 本文提出了一种超越纯文本的多模态技能范式,认为在视觉中心任务中,可重用知识依赖于空间布局、视觉基础、细粒度外观和局部状态变化,纯文本技能存在根本瓶颈。为此,论文引入了结合声明性文本逻辑与显式视觉支持的多模态技能,包括静态先验、动态先验和交错视觉技能,并开发了一个自动系统将智能体经验转化为可重用的多模态技能。实验表明,在GUI等视觉中心任务上,视觉技能始终优于纯文本技能。
Details
Motivation: 现有技能学习方法大多将可重用经验存储为纯文本资产(如指令、推理轨迹),这为视觉中心任务创建了根本瓶颈,因为可重用知识通常依赖于视觉和空间信息。
Result: 在GUI和其他视觉中心任务上的实验表明,视觉技能始终优于纯文本技能,特别是在需要空间对应、视觉证据和状态感知交互的任务中。
Insight: 核心创新在于提出了一个结合文本逻辑与显式视觉支持的多模态技能范式,并开发了自动构建系统。其关键是将“做什么”与“看哪里、如何检查、如何验证视觉结果”相结合,使技能成为包含空间参考、视觉边界和交互模式的多模态资产,为未来多模态智能体提供了更丰富的经验复用机制。
Abstract: Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \textbf{\NAME}, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \textbf{\SYSTEM}, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.
[188] DENSER: Depth-Guided Ensemble with Staged EFA-GS Reconstruction for Soccer Novel View Synthesis cs.CVPDF
Parthsarthi Rawat
TL;DR: 本文提出DENSER方法,用于足球比赛场景的新视角合成。该方法基于EFA-GS框架进行扩展,通过引入基于相机高度的损失加权、利用Depth-Anything-V2的单目深度监督来规范无纹理区域的几何结构,并采用三模型像素平均集成策略。在五个保留的挑战场景上取得了PSNR 29.89 dB、SSIM 0.791和LPIPS 0.366的平均性能。
Details
Motivation: 解决足球比赛场景中新视角合成任务中,由于场地纹理单一、相机视角变化大导致的几何结构重建困难问题,特别是针对地面广播视角的优化。
Result: 在五个保留的挑战场景上,平均PSNR达到29.89 dB,SSIM为0.791,LPIPS为0.366,表明该方法在足球场景新视角合成任务中取得了具有竞争力的定量结果。
Insight: 创新点包括:1) 基于相机高度的损失加权机制,优先优化广播视角;2) 利用预训练单目深度估计模型(Depth-Anything-V2)作为几何正则化监督;3) 通过调整训练时长和高斯尺度约束从共享基座检查点分化出集成模型成员。
Abstract: We propose DENSER, a Depth-guided ENSemble with Staged EFA-GS Reconstruction for soccer novel view synthesis. DENSER extends EFA-GS with three key contributions: (1) camera-height-based loss weighting that prioritises ground-level broadcast views, (2) monocular depth supervision from Depth-Anything-V2 to regularise geometry in textureless regions, and (3) a three-model pixel-average ensemble whose members diverge from a shared base checkpoint by varying training length and Gaussian scale clamping. On five held-out challenge scenes we achieve a mean PSNR of 29.89 dB, SSIM of 0.791, and LPIPS of 0.366.
[189] SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation cs.CVPDF
Yingzi Ma, Xiaogeng Liu, Yawen Zheng, Chaowei Xiao
TL;DR: 本文介绍了SafeGen-Bench,一个专门用于评估条件文本到视频(T2V)生成模型安全性的基准测试。该基准定义了10个恶意类别,重点关注时间序列和描绘行为相关的风险,通过精心挑选的起始图像帧和对应文本提示来模拟现实输入。评估发现,当前模型在生成高质量视频时难以避免恶意内容,且单模态防护措施效果有限。
Details
Motivation: 现有基准主要关注恶意文本提示下的视频生成安全,忽略了即使文本和图像输入本身安全,其组合仍可能产生有害视频内容的常见且具有挑战性的场景。
Result: 在SafeGen-Bench上评估多种条件T2V模型,结果显示当前模型难以持续避免生成恶意内容,不安全分数高达44.5。对文本和图像防护措施的评估发现,单模态防护措施效果不足,在七个恶意类别中失败率达80%。
Insight: 创新点在于提出了首个专门针对图像条件文本到视频生成安全性的基准测试,强调了多模态输入组合可能引发的独特安全风险,并揭示了现有单模态防护措施的局限性,为开发更安全的可控T2V模型指明了方向。
Abstract: With the rapid advancements in text-to-image diffusion models, generative video models (T2V models) like Sora can now produce short synthetic videos from a text prompt or an initial image. However, synthetic video generation – especially when guided by an initial image – often poses risks, including the potential creation of illegal, politically sensitive, or unethical content. Existing benchmarks have started to consider the safety of generated videos, but they primarily focus on testing models with malicious text prompts, ignoring the scenario where text prompt and image combination may still lead to harmful video content. In practice, this is a common and challenging issue: videos generated from safe text and image inputs can nonetheless convey harmful information. To bridge this gap, we introduce SafeGen-Bench, a benchmark specifically designed to evaluate the safety of conditional T2V models. Our benchmark defines 10 malicious categories, concentrating on risks related to both temporal sequences and depicted behaviors. SafeGen-Bench consists of carefully selected start frames from diverse image and video sources, paired with corresponding text prompts to simulate realistic inputs. We evaluate a variety of conditional T2V models on SafeGen-Bench, and the results indicate that current models struggle to consistently avoid generating malicious content with unsafety scores reaching up to 44.5, especially under conditions requiring high quality. Furthermore, we assess the effectiveness of both text-based and image-based guardrails on our benchmark, finding that unimodal guardrails alone were insufficient to provide a robust defense, with an 80% failure rate across seven malicious categories. We hope that SafeGen-Bench will foster the development of safer and more controllable conditional T2V models.
[190] On the Limits of Token Reduction for Efficient Unified Vision Language Training cs.CV | cs.AI | cs.CLPDF
Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv
TL;DR: 本文研究了基于令牌缩减的加速方法在统一视觉语言模型训练中的可行性与局限性。研究发现,视觉理解任务在深层网络中存在大量视觉令牌冗余,而视觉生成任务则在整个网络深度中持续依赖图像令牌,这种不对称性使得为每个任务设计特定的令牌缩减加速器成为可能。然而,在统一训练中,这种任务特定的令牌丢弃策略会导致协同效应损失,表明高效的统一建模需要保留跨任务的共享结构。
Details
Motivation: 统一视觉语言模型的联合训练计算成本高昂,且从效率角度研究不足。本文旨在探索基于令牌缩减的加速方法在这种统一训练中的可行性和根本限制。
Result: 研究通过系统性的层间注意力分配分析,揭示了视觉理解与视觉生成任务在令牌依赖上的根本不对称性。基于此设计的任务特定加速器在孤立训练中能显著提升效率,但在统一训练下会导致一致的协同性能损失。
Insight: 核心创新点在于揭示了统一视觉语言模型中视觉理解与视觉生成任务对图像令牌依赖的深度不对称性。客观来看,该研究的关键洞见是:高效的统一建模不能简单应用任务孤立的加速策略,而必须设计能保留跨任务协同效应的、具有协同感知的加速方法。
Abstract: Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective. While these methods achieve significant efficiency gains in isolated settings, we observe a consistent synergy loss under unified training – task-specific token dropping necessitates divergent parameter pathways and eliminates the mutual performance gains typically observed in joint optimization. Our findings suggest that efficient unified modeling requires preserving shared cross-task structures, highlighting the need for synergy-aware acceleration strategies. Project page: https://chicychen.github.io/TokenReductionUnifiedVLM/.
[191] Perception First: A Frontier Native-Video Model with Self-Consistency for Implicit Video Question Answering cs.CV | cs.LGPDF
Ali Alavi
TL;DR: 本文介绍了针对CVPR 2026 VRR挑战赛的提交方案,该挑战基于ImplicitQA/VRR-QA基准,专注于需要从视频的时空布局、运动、深度、视角、因果关系和社会背景中进行推理的隐式视频问答。研究通过系统性的免训练实验,评估了多种开源视频大语言模型和推理策略,核心发现是:该基准任务主要受感知能力限制而非推理能力限制,感知能力强的基模型和轻量级测试时去噪是唯一可靠的提升手段。
Details
Motivation: 解决隐式视频问答任务,即答案无法从单帧图像中直接观察,而必须从非连续帧的空间布局、运动、深度、视角、因果关系和社会背景中进行推断,旨在探索此类任务的核心瓶颈。
Result: 在ImplicitQA/VRR-QA基准上的实验表明,推理侧的增强策略(如思维链、问题分解等)效果中性甚至有害,而基模型的感知能力和轻量级测试时去噪是有效提升手段;具体地,相对深度、视角和计数是最困难的类别,而因果和社会推理已接近解决;尝试通过提示注入单目深度线索来攻击最弱类别,反而使测试准确率降低了5.8个百分点。
Insight: 论文的创新点在于通过系统性实验揭示了隐式视频问答任务本质上是感知受限而非推理受限,挑战了过度依赖复杂推理策略的常见做法;客观分析认为,其核心洞察是强调了提升基础视觉感知能力对于解决需要跨帧综合理解的视频任务至关重要,而复杂的后处理或推理增强可能适得其反。
Abstract: We describe our submission to the VRR Challenge @ CVPR 2026, built on the \emph{ImplicitQA} / \emph{VRR-QA} benchmark\cite{implicitqa}: multiple-choice video question answering in which answers are deliberately \emph{not} observable in any single frame and must be inferred from spatial layout, motion, depth, viewpoint, causality, and social context across discontinuous frames of creative video. We conduct a systematic, training-free study spanning open-source Video-LMMs (Qwen2.5-VL\cite{qwen25vl}, Qwen3-VL\cite{qwen3vl}, InternVL3, Gemma-3, and the RL-tuned video reasoners Video-R1\cite{videor1} and VideoChat-R1.5\cite{videochatr15}) and a battery of inference-time strategies (chain-of-thought, question decomposition, describe-then-reason cascades, audio transcripts, spatial state prompting, self-consistency\cite{selfconsistency}, multi-model ensembling, and category routing). Our central finding is that this benchmark is \emph{perception-bound rather than reasoning-bound}: reasoning-side augmentations are neutral-to-harmful, whereas base-model perceptual capability and lightweight test-time denoising are the only reliable levers. A per-category error analysis localizes the difficulty to low-level perception – relative depth, viewpoint, and counting are the hardest categories, while causal and social reasoning are nearly solved – and a prompt that explicitly injects monocular depth cues to attack the weakest category \emph{lowers} test accuracy by $5.8$ points, confirming that the model needs a better \emph{percept}, not a better \emph{procedure}.
[192] MotionDreamer: Universal Skeletal Motion Generation for 3D Rigged Shapes cs.CV | cs.GRPDF
Ye Tao, Yuxin Yao, Kendong Liu, Dapeng Wu, Junhui Hou
TL;DR: MotionDreamer是一个基于扩散模型的通用骨骼动画生成框架,能够从2D视频指导中为各种拓扑结构的3D绑定形状生成骨骼动画。它通过构建大规模动态数据集和提出结构-语义注入机制,解决了现有模板方法泛化性差和逐案例优化方法效率低、易陷入局部最优的问题,实现了跨类别的高保真动画合成。
Details
Motivation: 解决现有方法(基于模板的方法受限于特定拓扑结构,而逐案例优化方法计算成本高、易受局部最优和视角模糊性影响)在生成多样化形态的3D绑定形状动画时面临的泛化性、效率和鲁棒性挑战,旨在实现可扩展的4D资产生产。
Result: 大量实验表明,该方法显著优于现有方法,为鲁棒且高效的4D资产生成设定了新的最先进(SOTA)基准。
Insight: 主要创新点包括:1)构建了一个包含约20,000个多样化3D模型的大规模动态数据集,每个模型具有完整纹理、骨骼绑定和丰富的动画序列,以解决高质量训练数据稀缺问题;2)提出了结构-语义注入机制,将纹理和语义属性直接整合到骨骼关节表示中,从而弥合2D视觉运动线索与异构3D骨骼结构之间的运动学差距,实现跨未见类别的解剖学一致性动画生成。
Abstract: Motion generation for rigged shapes is vital for scalable 4D asset production. However, template-based methods are limited by specific topologies and fail to generalize across diverse morphologies. Conversely, per-case optimization is computationally expensive, susceptible to local optima, and highly sensitive to viewpoint-induced ambiguities. In this paper, we present MotionDreamer, a diffusion-based framework designed for category-agnostic skeletal animation generation from 2D video guidance. To overcome the scarcity of high-quality training data, we have curated a large-scale dynamic dataset comprising approximately 20,000 diverse 3D models, each featuring complete textures, skeletal rigging, and a wide array of comprehensive animation sequences. To bridge the kinematic gap between 2D visual motion cues and heterogeneous 3D skeletal structures, we propose a structural-semantic injection mechanism. Our model integrates texture and semantic attributes directly into skeletal joint representations. This allows it to map perceived visual dynamics to specific joint hierarchies and their functional roles. This enables MotionDreamer to synthesize high-fidelity animations that maintain anatomical consistency across a vast range of unseen categories, from existing biological species to fantastical beings. Extensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art benchmark for robust and efficient 4D asset generation. The code will be made publicly available upon acceptance.
[193] PathAR: Structure-First Autoregressive Synthesis of Multimodal Pathology Images cs.CVPDF
Yuan Zhang, Jiahao Xia, Junzhang Huang, Meng Wang, Feng Chen
TL;DR: 本文提出了一种名为PathAR的结构优先自回归合成框架,用于生成多模态病理学图像。该方法通过双向量量化(Dual-VQ)分词器将样本分解为基于掩码的结构令牌和外观令牌,并利用具有非对称注意力可见性的交错自回归(IAR)变换器来强制执行结构到外观的依赖关系,从而在保持解剖结构一致性的同时合成特定模态的外观。
Details
Motivation: 多模态病理学数据稀缺,需要统一的生成模型来合成特定模态的外观,同时保持解剖结构的连贯性。现有方法通常在统一的令牌流中建模结构和外观,导致在模态变化时结构可控性减弱。
Result: 大量实验表明,PathAR在结构一致性和模态保真度上优于基线方法,保持了样本多样性,支持在数据稀缺情况下的下游分割任务,并展示了扩展到更细粒度模态内器官标签变化的可扩展性。
Insight: 创新点在于显式地将结构和外观因子分解,通过双向量量化令牌化和交错自回归变换器实现结构优先的生成,增强了多模态病理图像合成中的结构可控性和对齐能力。
Abstract: Data scarcity in multimodal pathology motivates unified generative models that synthesize modality-specific appearance while preserving anatomically coherent structure. Although modalities differ in appearance statistics, morphological structures such as cellular topology and tissue boundaries are largely preserved across acquisition protocols. However, existing methods often model these factors within a homogeneous token stream, implicitly coupling structure with appearance and weakening structural controllability under modality shifts. To address this, we propose pathology Autorgressive modeling (PathAR), a structure-first autoregressive synthesis framework that explicitly factorizes structure and appearance for modality-label-conditioned pathology generation.PathAR employs a dual vector quantization (Dual-VQ) tokenizer to decompose samples into mask-grounded structure and appearance tokens, and an interleaved autoregressive (IAR) transformer with asymmetric attention visibility to enforce structure-to-appearance dependence. PathAR stabilizes morphology under heterogeneous modality-specific appearances and enables spatially aligned image–mask pair generation. Extensive experiments show that PathAR improves structural consistency and modality fidelity over baselines, maintains sample diversity, supports downstream segmentation in data-scarce regimes, and demonstrates extensibility to finer-grained intra-modality organ-label variation.
[194] Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning cs.CVPDF
Sanchit Sinha, Guangzhi Xiong, Bohan Liu, Zhenghao He, Aidong Zhang
TL;DR: 本文研究了多模态大语言模型(MLLMs)中思维链(CoT)提示的有效性问题,发现其在多个视觉推理基准上表现不佳。通过系统分析,作者识别了两种常见失败模式:过早答案承诺和推理生成过程中视觉令牌访问受限。为此,他们提出了注意力引导的微调方法Attentive-CoT(Att-CoT),以改善CoT推理轨迹,并在三个视觉推理基准上验证了其有效性。
Details
Motivation: 在多模态大语言模型中,思维链提示的有效性不确定,甚至常常导致性能下降。本文旨在系统分析CoT在MLLMs中的失败模式,并设计一种改进的微调方法来提升其视觉推理能力。
Result: 在三个视觉推理基准(如需要逐步视觉证据的数据集)上对六个MLLMs进行的实验表明,Att-CoT方法相比标准微调(CoT-SFT)提升了CoT性能。
Insight: 创新点在于提出了注意力引导的微调目标Att-CoT,它鼓励模型在生成推理链时延迟答案承诺并保持对视觉令牌的持续访问。该方法无需改变模型架构,可直接嵌入现有CoT-SFT训练流程,为解决MLLMs中CoT推理的固有缺陷提供了一种可借鉴的解决方案。
Abstract: The effectiveness of Chain-of-Thought (CoT) prompting in Multimodal Large Language Models (MLLMs) remains uncertain: across several visual reasoning benchmarks, CoT prompting often degrades performance compared to direct prompting. In this paper, we provide a systematic analysis of CoT behavior in three modern MLLM families across model scales on datasets requiring step-wise visual evidence. Our analysis identifies two recurring failure modes: premature answer commitment and limited direct visual-token access during rationale generation. We further find that standard CoT-style Supervised Fine-Tuning (CoT-SFT) can mitigate these issues only partially, while often increasing reliance on textual priors and reducing counterfactual visual dependence. Motivated by these findings, we propose Attentive-CoT (Att-CoT), an attention-guided fine-tuning objective that encourages CoT trajectories to delay answer commitment while maintaining sustained visual-token access. Att-CoT can be plugged into any CoT-SFT training run without architectural changes. Experiments on three visual reasoning benchmarks across six MLLMs show that Att-CoT enhances CoT performance over standard fine-tuning.
[195] Deformable Wiener Filter for Future Video Coding cs.CVPDF
Xuewei Meng, Chuanmin Jia, Xinfeng Zhang, Shanshe Wang, Siwei Ma
TL;DR: 本文提出了一种用于未来视频编码的可变形维纳滤波器(DWF),它结合了局部和非局部特征,并基于维纳滤波理论进行有监督的滤波器系数训练。该方法通过样本分类和自适应融合参考样本,在VVC框架内实现了显著的码率节省。
Details
Motivation: 现有VVC中的环路滤波器主要利用图像局部相似性,基于非局部的滤波器虽能弥补不足,但其广泛使用的无监督参数估计方法限制了性能。因此,需要一种结合局部与非局部特征、且有监督训练的方法来提升滤波效果。
Result: 在VTM-11.0基准测试中,与VVC标准相比,所提方法在All Intra、Random Access和Low-Delay B配置下分别实现了平均1.16%、1.92%和2.67%的码率节省,达到了SOTA水平。
Insight: 创新点在于将局部与非局部参考样本进行自适应融合,并基于样本分类进行有监督的维纳滤波系数训练;客观来看,其将传统维纳滤波理论与深度学习方法中的可变形/非局部思想结合,为环路滤波设计提供了新思路。
Abstract: In-loop filters have attracted increasing attention due to the remarkable noise-reduction capability in the hybrid video coding framework. However, the existing in-loop filters in Versatile Video Coding (VVC) mainly take advantage of the image local similarity. Although some non-local based in-loop filters can make up for this shortcoming, the widely-used unsupervised parameter estimation method by non-local filters limits the performance. In view of this, we propose a deformable Wiener Filter (DWF). It combines the local and non-local characteristics and supervisedly trains the filter coefficients based on the Wiener Filter theory. In the filtering process, local adjacent samples and non-local similar samples are first derived for each sample of interest. Then the to-be-filtered samples are classified into specific groups based on the patch level noise and sample-level characteristics. Samples in each group share the same filter coefficients. After that, the local and non-local reference samples are adaptively fused based on the classification results. Finally, the filtering operation with outlier data constraints is conducted for each to-be-filtered sample. Moreover, the performance of the proposed DWF is analyzed with different reference sample derivation schemes in detail. Simulation results show that the proposed approach achieves 1.16%, 1.92%, and 2.67% bit-rate savings on average compared to the VTM-11.0 for All Intra, Random Access, and Low-Delay B configurations, respectively.
[196] TLG: Temporal-Logic Grounding for Video Question Answering via Source-Annotation Reconstruction and Category-Targeted Reasoning cs.CV | cs.LGPDF
Ali Alavi
TL;DR: 本文提出了TLG(时序逻辑接地)系统,用于解决视频问答中的时序逻辑推理挑战。该系统通过重建视频动作时间线、解析问题为时序逻辑程序并执行,结合开放视觉语言模型和前沿推理模型,显著提升了在TimeLogic Challenge基准上的性能。
Details
Motivation: 现有端到端视频语言模型将视频视为帧的集合,无法准确定位动作发生的时间,导致在需要形式化时序逻辑推理的任务上表现接近随机猜测。
Result: TLG系统在TimeLogic Challenge测试集上将准确率从视觉语言模型基线的46.9%提升至71.37%,绝对增益达24.5%,接近排行榜最高水平(差距在3个百分点内)。
Insight: 创新点在于通过重建源数据集标注来获得精确的动作时间线,并将问题解析为可执行的时序逻辑程序,而非依赖更大模型;研究揭示了时序接地是当前不可简化的瓶颈,真实标注对性能提升至关重要。
Abstract: The TimeLogic Challenge evaluates formal temporal-logic reasoning over video - 16 operators (before, after, until, since, always, co-occur, ordering, …) in boolean and 4-way multiple-choice form. End-to-end video-language models (VLMs) hover near chance on this task because they treat video as a bag of frames and cannot localize when actions occur. We present TLG (Temporal-Logic Grounding), a three-tier system that (i) reconstructs each video’s action timeline from the public source-dataset annotations the benchmark was generated from, parses every question into a temporal-logic program, and executes it deterministically; (ii) falls back to a strong open VLM where no annotation exists; and (iii) routes only the question categories where the VLM is empirically weakest to a frontier reasoning model. TLG raises test accuracy from a 46.9% VLM baseline to 71.37%, a +24.5 absolute gain, reaching within 3 points of the leaderboard top. We report extensive ablations, including three model-based timeline-reconstruction variants that all underperform a holistic VLM, isolating temporal grounding as the irreducible bottleneck and showing that real annotations - not larger models - drive accuracy.
[197] Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis cs.CV | cs.GRPDF
Zhengfei Kuang, Adam Sun, Liyuan Zhu, Tong Wu, Shengqu Cai
TL;DR: 本文提出StreetNVS,一个用于街景新视角合成的视频扩散框架,通过一个基于相对射线级位置编码的参考增强相机注意力模块,联合利用LiDAR、多摄像头图像和相机位姿三种传感器信号。该方法采用两阶段课程训练策略,逐步适应稀疏LiDAR输入,在Waymo Open Dataset上显著优于现有方法,并能合成沿极端偏离轨迹路径的连贯视频。
Details
Motivation: 现有方法仅利用部分车载传感器信号(如稀疏LiDAR或多摄像头图像),导致新视角合成质量在目标轨迹偏离原始驾驶路径时下降。本文认为这是一个多传感器融合问题,需要联合利用LiDAR提供的精确但不完整的几何、环视图像提供的稠密外观以及相机位姿提供的视图关联。
Result: 在Waymo Open Dataset上,StreetNVS在稀疏LiDAR条件下大幅优于现有SOTA基线,其性能可与使用10-100倍稠密点云的方法相媲美。
Insight: 创新点在于提出一个统一的扩散框架,通过参考增强相机注意力模块实现多传感器信号的联合条件化,以及采用两阶段课程训练策略逐步适应稀疏几何输入。这为解决街景合成中几何与外观信息的有效融合提供了新思路。
Abstract: Modern vehicle platforms are equipped with a rich sensor suite, including LiDAR, calibrated multi-camera rigs, and accurate ego-motion, that in principle offers strong signal for re-rendering a driving scene from novel viewpoints. A growing line of recent work leverages video diffusion models for this task, using their generative priors to synthesize plausible novel views from sparse vehicle observations. In practice, however, existing methods exploit only a fragment of this signal, and their quality tends to degrade as the target trajectory departs from the recorded driving path. We argue that this is fundamentally a multi-sensor fusion problem: sparse LiDAR reprojections supply accurate but incomplete metric geometry, surround-view reference imagery supplies dense appearance but no metric depth, and camera poses tie the two together across views. We introduce StreetNVS, a video diffusion framework that jointly conditions on all three signals through a Reference-Enhanced Camera Attention module based on a relative ray-level positional encoding. We develop a two-stage curriculum training strategy that gradually exposes the model to increasingly sparse LiDAR. On the Waymo Open Dataset, StreetNVS substantially outperforms state-of-the-art baselines under sparse LiDAR conditioning, matches methods that rely on 10-100 times denser point clouds. We further show capabilities of synthesizing coherent videos along extreme out-of-trajectory paths such as elevation, lane-shift, pullback, and rotation. Our website: https://streetnvs.github.io
[198] RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation cs.CV | cs.CL | cs.ROPDF
Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar, Jingjing Chen
TL;DR: 本文提出了RoboTrustBench基准,用于评估视频世界模型在机器人操作任务中的可信赖性。该基准包含1,207个专家验证的指令-图像对,涵盖正常、约束敏感、反事实和对抗四种场景,并采用六维评估协议。通过评估七个代表性模型,发现当前模型在视觉连贯性上表现良好,但在约束推理、反事实基础、物理交互和不安全指令抑制方面存在不足。
Details
Motivation: 现有基准主要评估视频世界模型在有效、可行和安全指令下的性能,缺乏对其在复杂、对抗或非理想场景下可信赖性的系统评估。
Result: 在RoboTrustBench上评估七个代表性视频世界模型,使用人类和MLLM评估,发现模型在约束推理、反事实基础、物理交互和不安全指令抑制方面表现不佳,表明仅凭视觉质量和表面指令遵循不足以实现可信赖的机器人视频世界建模。
Insight: 创新点在于构建了首个专注于机器人操作中视频世界模型可信赖性的多场景基准,并揭示了当前模型在深层推理和安全性方面的关键缺陷,为未来研究提供了重要方向。
Abstract: Video world models are increasingly used in robotic manipulation, yet existing benchmarks mostly evaluate them under valid, feasible, and safe instructions. We introduce RoboTrustBench, a benchmark for evaluating the trustworthiness of video world models under four scenarios: Normal, Constraint-Sensitive, Counterfactual, and Adversarial. Built from real-world DROID episodes, RoboTrustBench contains 1,207 expert-validated instruction-image pairs and a six-dimensional evaluation protocol with 13 fine-grained criteria. Evaluating seven representative video world models with human and MLLM assessment, we find that current models often generate visually coherent videos, but struggle with constraint reasoning, counterfactual grounding, physical interaction, and unsafe-instruction suppression. These results show that visual quality and surface-level instruction following are insufficient for trustworthy robotic video world modeling.
[199] Paving the Way for Point Cloud Video Representation Learning Using A PDE Model cs.CVPDF
Zhuoxu Huang, Zhenkun Fan, Jungong Han, Josef Kittler
TL;DR: 该论文提出了一种名为MotionPDE的新方法,用于点云视频的表征学习。该方法将时空相关性学习问题建模为一个可求解的偏微分方程,并通过对比学习结构来指导和优化PDE的求解过程。MotionPDE作为一个即插即用的增强模块,可以有效地提升现有骨干模型的性能,同时计算开销和参数增加很小。
Details
Motivation: 传统方法(特别是基于光流的技术)由于序列点云数据的空间无序性,难以有效捕捉其时空相关性。为了解决这一挑战,论文旨在探索一种新的方式来正则化点云视频的时空关联学习。
Result: 论文展示了MotionPDE在点云视频数据解释中的实用性和适应性,并取得了有希望的结果。具体定量结果和基准测试(benchmark)在摘要中未明确提及,但暗示了其作为增强模块的有效性。
Insight: 主要创新点在于将物理领域有效的偏微分方程(PDE)框架引入到点云视频这种新型序列数据的分析中,并巧妙地结合对比学习来监督PDE求解过程,从而正则化时空特征学习。从客观角度看,这是一种跨领域的方法迁移,为处理无序时空数据提供了新的正则化思路和即插即用的模块化设计。
Abstract: Investigating spatial-temporal correlations, specifically how spatial points vary over time, is crucial for understanding point cloud videos. Traditional methods, particularly flow-based techniques, struggle with these correlations due to the unordered spatial arrangement of sequential point cloud data. To address this challenge, we propose a novel approach that regularizes spatial-temporal correlation learning by formulating the problem as a solvable Partial Differential Equation (PDE). While PDEs have long been effective in the physical domain, their application to novel sequential data like point cloud video remains underexplored. Inspired by fluid analysis, we construct a simplified PDE, and the process of solving PDE is guided and refined by a contrastive learning structure between the temporal embeddings and the spatial embeddings. With this extra supervision, our method, named MotionPDE, serves as an effective, plug-and-play enhancement module for existing backbone models, adding minimal computational overhead and parameters. Capitalizing on the contrastive learning process, we delve deeper into the self-supervised capabilities of MotionPDE, yielding promising results that underscore its utility and adaptability in point cloud video data interpretation. The code repo with trained checkpoints will be available at https://github.com/zhh6425/motionpde.git for facilitating future research.
[200] Self-Improving Small Object Grounding in LVLMs cs.CV | cs.LGPDF
Tianze Yang, Yucheng Shi, Ruitong Sun, Ninghao Liu, Jin Sun
TL;DR: 本文提出了一种基于大型视觉语言模型(LVLMs)内部注意力模式的自改进小物体定位方法,无需微调模型。通过分析注意力结构,训练了一个轻量级IoU回归器来预测边界框质量,并开发了基于学习的ACS-Learned和无训练的ACS-Free框架,从候选框中选出最优框以提升定位性能。实验在COCO和Objects365数据集上验证了方法对小物体定位的显著改进。
Details
Motivation: 解决LVLMs在小物体定位中可靠性不足的问题,探索是否可以利用模型内部的注意力模式来识别高质量的边界框,而无需额外的微调或复杂训练。
Result: 在COCO和Objects365数据集上,方法实现了高达19%的小物体定位自改进;其中无训练的ACS-Free在所有无训练方法中表现最佳,IoU预测的皮尔逊相关系数r>0.67,达到强相关水平。
Insight: 创新点在于揭示了LVLMs注意力结构编码了定位质量信息,并利用此开发了轻量级回归器和无训练选择器;从客观角度看,该方法通过分析关键Transformer层和头部的注意力熵,提升了定位的可靠性和可解释性,为模型内部机制的理解提供了新视角。
Abstract: Can internal attention patterns in Large Vision Language Models (LVLMs) identify reliable small-object boxes without fine-tuning? In this work, we provide an affirmative answer. Attention structure in LVLMs encodes grounding quality-a lightweight IoU regressor trained solely on attention maps achieves strong IoU prediction (Pearson r > 0.67). This regressor powers the regressor-based variant of our Attention-based Candidate Selection (ACS) framework, called ACS-Learned, which selects the best box from multiple sampled candidates to improve object grounding. By analyzing what the regressor learns, we reveal which transformer layers and heads are most critical and derive ACS-Free: a training-free selector that ranks candidates by attention entropy on these discriminative heads, with no learned component at inference. Experiments on COCO and Objects365 demonstrate up to 19% self-improvement on small object localization, with ACS-Free ranking best among all training-free methods, demonstrating that useful attention structure improves both localization reliability and interpretability in LVLMs.
[201] Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval cs.CV | cs.MMPDF
Xiang Fang, Wanlong Fang, Wei Ji, Tat-Seng Chua
TL;DR: 本文提出了一种受图灵模式启发的反应-扩散多模态融合框架(RDMF),用于解决视频-语言模型中动态、非线性交互建模的难题。该框架将视频特征的时间扩散与文本-视频的非线性反应相结合,形成类似生物系统的涌现模式,以提升语言引导视频时刻检索等任务的性能。
Details
Motivation: 现有视频-语言模型依赖静态交叉注意力或提示调优机制,难以自适应地建模模态间动态演化关系,导致对齐效果欠佳和泛化能力有限。
Result: 初步实验表明,RDMF在识别显著视频时刻方面有潜力超越现有方法,为视频-语言任务提供了新范式。
Insight: 创新点在于将系统生物学中的图灵反应-扩散过程引入多模态融合,利用Gray-Scott模型实现高效的特征交互,并通过图灵不稳定性准则进行严格的数学稳定性分析,实现了跨学科的方法创新。
Abstract: Video-language models are pivotal for tasks such as moment retrieval and highlight detection, yet they often struggle to capture the dynamic, non-linear interactions between temporal video sequences and textual semantics. Existing approaches, relying on static cross-attention or prompt-tuning mechanisms, fail to adaptively model the evolving relationships between modalities, leading to suboptimal alignment and limited generalization. Inspired by systems biology, we propose \textbf{Reaction-Diffusion Multimodal Fusion (RDMF)}, a novel framework that reimagines video-language alignment as a reaction-diffusion (RD) process, drawing on the principles of pattern formation introduced by Alan Turing. In RDMF, video features diffuse across time to capture temporal context, while text-video interactions are modeled as non-linear reactions that amplify relevant features and suppress noise, forming emergent patterns akin to biological systems. Leveraging the Gray-Scott RD model, we design a computationally efficient fusion module that integrates video and text representations, supported by rigorous mathematical analysis of stability and convergence using Turing instability criteria. Our framework is theoretically grounded, employing advanced mathematical tools to ensure stable pattern formation, and is practically viable, incorporating standard components like pretrained encoders and DETR-style heads for moment retrieval and saliency prediction. RDMF represents a pioneering interdisciplinary approach, bridging systems biology and multimedia research to address the limitations of conventional multimodal fusion. Preliminary experiments demonstrate its potential to outperform existing methods in identifying salient video moments, offering a new paradigm for video-language tasks.
[202] Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs cs.CVPDF
Sicheng Xu, Yu Deng, Shoukang Hu, Yichuan Wang, Yizhong Zhang
TL;DR: 本文提出了一种面向流媒体场景的可实时生成说话人像视频的框架,该框架以语音音频和参考图像为条件,通过因果视频VAE进行深度潜在压缩,并采用自回归潜在去噪模型生成视频。
Details
Motivation: 现有视频扩散模型在生成人像视频方面虽有进展,但其高计算需求限制了在交互式应用中的使用,因此需要一种专为流媒体设计的、能实时生成高质量说话人像视频的方法。
Result: 该方法在生成速度上显著快于基线模型,并在真实感、生动性和视频质量方面与这些大型模型相当甚至更优。
Insight: 创新点包括:因果视频VAE整合可变数量的参考图像作为引导,使网络专注于动态信息而非静态外观,从而提升压缩效率和重建质量;将残差自编码范式扩展到VAE中以改进时空因果性处理;生成器基于Rectified Flow Transformer架构,以分块自回归方式产生视频潜在表示。
Abstract: Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an autoregressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.
[203] What to Test Next: Interpretable Coverage Gap Discovery in Driving VLMs cs.CV | cs.SEPDF
Abhishek Aich, Sparsh Garg, Vijay Kumar BG, Turgun Yusuf Kashgari, Manmohan Chandraker
TL;DR: 本文提出SliceScorer和SliceNav方法,用于在驾驶视觉语言模型(VLMs)的验证中发现可解释的覆盖缺口。SliceScorer是一种确定性评分规则,结合了基于暴露的覆盖先验和邻居失败先验,以优先考虑罕见、未充分测试的区域并传播相似测试条件的风险。SliceNav是一个由LLM编排的验证流程,将SliceScorer嵌入其中,根据开发者查询选择相关操作符和词汇扩展,构建可审计的验证工作流。
Details
Motivation: 驾驶VLMs需要在由操作设计域(ODDs)定义的各种条件下准确理解场景,但现有验证稀疏,许多切片缺失,导致经验故障率不可靠。因此,需要一种方法来发现覆盖缺口,提高验证的可靠性。
Result: 在三个驾驶VLM(WiseAD、DriveMM、Cosmos-Reason2-2B)上的实验表明,SliceNav比先前的切片发现方法更有效地揭示高风险覆盖缺口,同时在条件空间中保持多样化的推荐。消融研究证实了评分规则的两个组成部分均有效。
Insight: 创新点包括结合覆盖先验和风险传播先验的确定性评分规则SliceScorer,以及LLM编排的模块化验证流程SliceNav,实现了可解释、可审计的覆盖缺口发现,适用于安全关键验证。从客观角度看,该方法将统计优先级与可解释性结合,为VLM验证提供了系统化框架。
Abstract: Driving vision-language models (VLMs) must accurately understand scenes across diverse conditions defined by Operational Design Domains (ODDs), yet verification remains sparse: many slices are missing, making empirical failure rates unreliable. We propose SliceScorer, a deterministic scoring rule for missing-slice recommendation that combines (i) an exposure-based coverage prior to prioritize rare, under-tested regions, and (ii) a neighbor-failure prior that propagates risk from similar tested conditions. SliceScorer is deliberately simple - interpretable, auditable, and conservative - properties essential for safety-critical validation. For stress testing beyond the declared ODD, we embed SliceScorer within SliceNav, an LLM-orchestrated verification pipeline where the model interprets developer queries to select relevant operators (triage, scoring, acquisition, evaluation) and vocabulary extensions, composing verification workflows while keeping all scoring deterministic and auditable. Experiments on three driving VLMs (WiseAD, DriveMM, Cosmos-Reason2-2B) show that SliceNav surfaces high-risk coverage gaps more effectively than prior slice-discovery methods while maintaining diverse recommendations across the condition space. Ablations confirm both scoring components contribute, and qualitative analysis demonstrates end-to-end workflows from developer query to targeted evaluation.
[204] Pave-GRPO: Beyond Instantaneous Guidance through Principled Average Velocity Decomposition cs.CVPDF
Pengyang Ling, Jiazi Bu, Yujie Zhou, Yibin Wang, Zhenyu Hu
TL;DR: 本文提出了Pave-GRPO方法,用于改进基于流的生成模型在偏好对齐训练中的效率与粒度。该方法通过原理性的平均速度分解,将粗粒度的轨迹过渡分解为多个细粒度的子轨迹集合,从而在不增加生成成本的情况下,将奖励反馈传播到更密集的时间阶段,实现更精细和全面的偏好优化。
Details
Motivation: 现有基于组相对策略优化(GRPO)的方法在训练流模型时,由于生成组轨迹进行策略梯度更新的成本高昂,通常只能使用极少的去噪步骤,导致奖励反馈只能到达少数阶段,中间大部分去噪步骤缺乏直接监督,从而损害了对齐的粒度。
Result: 广泛的实验验证表明,Pave-GRPO在不同奖励设置下有效推进了偏好对齐,提供了全面的性能提升。
Insight: 创新点在于提出了原理性的平均速度分解,实现了零成本的时间范围扩展和全面的时间监督,即在不增加采样预算的情况下,通过重用分段组样本及其关联奖励,显著扩大了有效优化范围,并将瞬时速度目标等价分解为多时间步集合,使奖励信号分布在去噪过程的更多中间阶段。
Abstract: Post-training via Group Relative Policy Optimization (GRPO) has emerged as a powerful paradigm for aligning flow-based generative models with human preferences. However, the iterative denoising nature of flow models incurs substantial costs when generating group rollouts for policy-gradient updates, compelling existing methods to train with extremely few denoising steps. This temporal sparsity severely restricts preference optimization: reward feedback can only reach a handful of stages per trajectory, leaving the vast majority of intermediate denoising steps without direct supervision and thus compromising alignment granularity. To address this, we propose Pave-GRPO, which reformulates the GRPO objective through Principled average velocity decomposition. Rather than generating expensive high-step rollouts, we maintain efficient few-step group sampling but decompose each coarse transition into an equivalent ensemble of finer sub-trajectories spanning multiple intermediate timesteps. This propagates reward feedback to a denser set of temporal stages for more comprehensive preference alignment without additional generation cost. This design offers two benefits: (i) zero-cost horizon expansion: through the direct reuse of piece-wise group samples and their associated rewards, Pave-GRPO significantly broadens the effective optimization scope under fixed sampling budgets; and (ii) comprehensive temporal supervision: by equivalently decomposing an instantaneous velocity target into a multi-timestep ensemble, it distributes reward signals across more intermediate stages of the denoising process, enabling finer-grained and more thorough preference optimization. Extensive experiments validate that Pave-GRPO effectively advances preference alignment across different reward settings, offering comprehensive performance enhancement.
[205] Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation cs.CV | cs.ROPDF
Muyi Bao, Yuxin Cai, Hang Xu, Zongtai Li, Jinxi He
TL;DR: Goal2Pixel提出了一种基于像素的视觉语言导航新范式,将导航任务重新定义为可导航像素的定位问题。该方法通过预测图像平面上的目标像素,并将其反投影为3D路径点来指导机器人运动,同时引入辅助指令区域处理非前进动作,并利用可见性感知关键帧记忆实现长视野导航。
Details
Motivation: 现有基于视觉语言模型(VLM)的导航方法通常将导航视为低级动作预测,存在动作接口模糊、与短视野运动基元绑定以及因重复VLM查询导致效率低下的问题。
Result: 在R2R-CE Val-Unseen数据集上,Goal2Pixel实现了54.1%的成功率(SR)和52.5%的SPL,每个情节仅需7.75次VLM调用,远低于直接动作预测方法所需的46.62次(其SR为32.9%),达到了具有竞争力的最先进(SOTA)性能,在RxR-CE数据集上也表现出相同趋势。
Insight: 核心创新在于将导航任务从动作预测范式转变为像素定位范式,使用图像平面作为VLM推理与机器人运动的统一空间接口;通过引入语义嵌入和坐标感知辅助损失来适配预训练VLM,并设计了可见性感知关键帧记忆以紧凑表示历史信息,显著提升了导航效率和长视野能力。
Abstract: Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on RxR-CE.Project Page: https://baobao0926.github.io/Goal2Pixel/.
[206] Edge-directed geometric partitioning for versatile video coding cs.CVPDF
Xuewei Meng, Xinfeng Zhang, Chuanmin Jia, Xia Li, Shanshe Wang
TL;DR: 本文提出了一种面向多功能视频编码(VVC)标准的边缘导向几何分割(Edge-directed geometric partition)方案。该方法通过利用时空边缘信息构建最可能模式(MPM)列表,以预测几何分割(GEO)模式,从而减少模式索引的编码开销并提升编码效率。
Details
Motivation: 动机在于,VVC标准中的几何分割(GEO)提供了多达140种分割候选模式,其最优模式的索引需要显式编码,带来了开销。为了利用不同编码单元(CU)的结构特性以及空域相邻块与时域同位块之间的相关性来降低此开销,作者提出了GEO模式预测策略。
Result: 所提方法在VTM-6.0基准上进行了测试,在RA(随机接入)和LDB(低延迟B帧)配置下,平均分别获得了0.58%和1.00%的客观BD-rate增益。此外,该方法还提升了物体边界的视觉质量。
Insight: 创新点在于将分割模式与物体边界的高相关性作为关键观察,并据此提出了一种基于时空边缘信息来指导构建MPM列表的边缘导向方案。这为在视频编码中利用视觉结构信息进行高效模式预测提供了一种新思路。
Abstract: To improve the coding performance, geometric partition (GEO) was proposed for the upcoming VVC standard. GEO provides 140 partition candidates. The index of optimal GEO mode needs to be signaled explicitly. Considering different structural characteristics of different CUs and the correlation between spatial adjacent blocks and temporal collocated blocks, we propose a GEO mode prediction strategy by constructing a Most Probable Mode (MPM) list to reduce the overhead of GEO index and improve coding efficiency. Based on the observation of the high correlation between the partition mode and object boundaries, an edge-directed geometric partition scheme is proposed to construct the MPM list according to spatio-temporal edge information. The proposed method provides an objective BD-rate gain of 0.58% and 1.00% on average for RA and LDB configurations compared to VTM-6.0. Besides, it also promotes the visual quality of object boundaries.
[207] PhyScene3D: Physically Consistent Interactive 3D Tabletop Scene Generation cs.CVPDF
Weixing Chen, Zhuoqian Feng, Yang Liu, Yexin Zhang, Yifan Wen
TL;DR: 本文提出了PhyScene3D框架,用于生成物理一致且可交互的3D桌面场景。该方法将场景生成重新表述为一个模仿人类构建过程的序列化任务,通过引入认知拓扑推理链和物理感知去噪对齐技术,有效解决了现有方法在物理约束和语义准确性方面的不足。
Details
Motivation: 现有生成物理一致3D交互场景的方法(如解耦符号求解器或端到端回归模型)存在错误传播或对包含大量物理违规的噪声监督过拟合的问题,难以处理密集物体层次和不规则功能性的挑战。
Result: 实验表明,PhyScene3D在语义准确性和物理有效性方面均优于现有最先进方法,相对于人工标注的训练数据,实现了场景级碰撞率降低40%。
Insight: 创新点在于将场景生成建模为模仿人类构造的序列化过程(认知拓扑推理链),并提出了结合可微符号距离场与测试时优化的物理感知去噪对齐方法,以在保持语义意图的同时将生成场景投影到物理可行的流形上,为机器人学习提供了可直接加载到物理模拟器的场景。
Abstract: Generating physically consistent 3D tabletop scenes is a fundamental yet underexplored problem for interactive and generalist robotic learning. The challenge stems from dense object hierarchies and irregular affordances. Here, an interactive scene denotes a physically valid, collision-free environment directly loadable into physics simulators. Existing methods, ranging from decoupled symbolic solvers to end-to-end regression models, often suffer from error propagation or overfitting to noisy supervision containing widespread physical violations. To address these limitations, we introduce PhyScene3D, a framework that reformulates generation as a Human-Mimetic Constructive Process. The proposed Cognitive Topological Reasoning Chain (CTRC) factorizes scene synthesis into a sequential, anchor-conditioned process. It employs a 3D AABB-based placement scheme that imposes a strong structural inductive bias. To address imperfect supervision and physical infeasibility, we introduce Physics-Aware Denoising Alignment (PADA). It integrates a differentiable Signed Distance Field (SDF) with Test-Time Optimization (TTO) to project generated scenes onto a physics-feasible manifold while preserving semantic intent. Experiments demonstrate that PhyScene3D outperforms state-of-the-art approaches in both semantic accuracy and physical validity, achieving a 40% reduction in scene-wise collision rate relative to the human-annotated training data.
[208] RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection cs.CV | cs.AIPDF
Pingping Liu, Aohua Li, Yubing Lu, Jin Kuang, Tongshun Zhang
TL;DR: 本文提出了一种名为RPCASSM的鲁棒主成分分析状态空间模型,用于红外小目标检测与分割。该模型针对红外小目标在远距离成像中占比低、边缘难以准确建模的问题,设计了背景状态空间模块和目标状态空间模块,分别利用空间异质信号显著性和目标稀疏局部高亮特性进行建模,有效提升了检测性能。
Details
Motivation: 现有主流视觉状态空间模型在处理红外小目标时效率低下且难以准确建模目标边缘,而现有红外状态空间模型未能从红外小目标的结构特性出发脱离主流框架,因此需要一种专门针对红外小目标空间域特性的新模型。
Result: 在现有基准数据集上的实验结果证明了RPCASSM设计的有效性,但摘要未提及具体定量结果或与SOTA的比较水平。
Insight: 创新点在于基于RPCA范式设计了专门针对红外小目标特性的双模块结构:BSSM利用空间异质信号设计空间探针扫描机制建模背景,TSSM利用目标稀疏性和局部高亮设计可变形提示扫描机制聚焦目标可变形空间进行状态空间建模,从而解决了现有模型难以准确建模目标边缘结构的问题。
Abstract: The detection and segmentation of infrared small targets have important application significance in the fields of surveillance and security, maritime rescue and so on. Due to the low occupancy of these targets in long-distance imaging, the mainstream visual state space model is inefficient and difficult to accurately model the target edge. The existing infrared state space models do not deviate from the mainstream visual state space structure framework from the structural properties of infrared small targets. In order to solve this problem, this paper proposes the RPCASSM network based on the model paradigm of robust principal component analysis(RPCA), which aims to design the background state space module(BSSM) and the target state space module(TSSM) by the nature of the infrared small target in the spatial domain. The BSSM aims to use the saliency of spatial heterogeneous signals to design a spatial probe scanning mechanism(SPCM) to model background information. The TSSM designs a deformable prompt scanning mechanism(DPCM) by using the sparsity and local highlight of the target to focus on the deformable space of the target for state space modeling. According to the above design, we effectively solve the problem that the existing mainstream vision state space model is difficult to accurately model the edge structure of infrared small target. Experimental results on the existing benchmark data sets prove the effectiveness of the RPCASSM design. Our code will be made public at \href{https://github.com/PepperCS/RPCASSM}{RPCASSM}.
[209] Understanding Identity Continuity in Thermal Video through Scene-Level Consistency cs.CV | cs.AI | cs.LG | cs.MMPDF
Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang, Jenq-Neng Hwang
TL;DR: 本文研究了热成像视频中的行人多目标跟踪(MOT)问题,针对因外观特征弱和检测中断导致的轨迹碎片化挑战,提出了一种轻量级后处理方法。该方法在YOLOv8和SORT基线基础上,增加了一个模块化的身份修复后端,包含在线短间隙重映射和基于时空、运动、边界线索的离线轨迹段重链接。实验表明,保守的重链接策略能有效提升身份连续性。
Details
Motivation: 解决热成像行人MOT中因弱外观线索和频繁检测中断导致的严重轨迹碎片化问题,探索在不依赖复杂重识别模型或在线关联的情况下,通过轻量级后处理恢复身份连续性的可能性。
Result: 在PBVS热成像行人MOT基准测试中,所提方法通过保守的重链接将IDF1分数从82.25提升至84.93,同时保持MOTA分数稳定。消融实验表明主要身份增益来源于轨迹重链接,且许多启发式阈值在较宽操作范围内保持稳定。
Insight: 创新点在于提出了一种模块化、轻量级的身份修复后端,强调在低信息量的热成像中,通过高精度的轨迹重链接(利用场景级时空一致性)比增加跟踪器复杂度更能有效恢复身份连续性。这为热成像MOT提供了一种计算效率高的解决方案,并凸显了场景级线索相对于局部帧间关联的主导作用。
Abstract: Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragmentation. We study whether lightweight post-processing can recover identity continuity without relying on heavy re-identification models or complex online association. Starting from a YOLOv8 and SORT baseline, we add a modular identity-repair backend consisting of online short-gap remapping and offline tracklet relinking based on temporal, spatial, motion, and border cues. Controlled ablations on a fixed validation split and evaluation on the official PBVS Thermal Pedestrian MOT benchmark show that the main identity gains arise from conservative relinking, improving IDF1 from 82.25 to 84.93 while preserving MOTA, whereas many heuristic thresholds remain stable across broad operating ranges. These results suggest that, in low-information thermal imagery, robust identity recovery can be achieved more effectively through high-precision trajectory relinking than through increasing tracker complexity. These results provide a controlled analysis of identity recovery in thermal video, showing that scene-level spatial-temporal consistency plays a dominant role in identity continuity compared to local frame-to-frame association.
[210] Learning Label-Efficient Interpretable Medical Image Diagnosis via Semi-supervised Hypergraph Concept Bottleneck Model cs.CVPDF
Yijun Yang, Ruiqiang Xiao, Lijie Hu, Angelica I Aviles-Rivero, Yunzhu Wu
TL;DR: 本文提出了一种用于医学图像诊断的半监督超图概念瓶颈模型(HyperCBM),旨在解决深度学习模型在医学图像分析中缺乏可解释性以及传统概念瓶颈模型(CBM)依赖大量专家标注概念标签、难以捕捉复杂概念间依赖关系的问题。该方法通过双层级超图学习来建模高阶概念依赖并生成领域自适应的伪标签,从而在减少概念标注需求的同时提升模型的解释性和诊断性能。
Details
Motivation: 动机在于解决深度学习模型在医学图像诊断(如胎盘植入谱系PAS超声诊断)中因缺乏可解释性而难以获得临床信任的问题,同时克服传统概念瓶颈模型(CBM)需要大量专家标注概念标签且无法有效捕捉复杂概念间依赖关系的局限性。
Result: 在新标注的PAS超声数据集、公共乳腺超声数据集以及皮肤镜图像数据集SkinCon上的实验表明,该方法在保持高诊断性能的同时,显著减少了对概念标注的依赖,验证了其有效性和普适性。
Insight: 创新点在于提出了一个半监督框架,通过引入概念级超图来增强概念间的高阶依赖推理,以及图像级超图来生成鲁棒的领域自适应伪标签,从而在减少标注成本的前提下,实现了更高效、更可解释的医学图像诊断模型。
Abstract: Deep learning has revolutionized medical image analysis, delivering exceptional diagnostic accuracy across diverse applications. Yet, the lack of interpretability in its decision-making hinders clinical adoption, particularly in high-stakes medical contexts where transparency is paramount for trustworthiness. For example, in Placenta Accreta Spectrum (PAS), subtle cues in ultrasound imaging challenge reliable diagnosis, rendering black-box models untrustworthy for accurate scoring. To address this, Concept Bottleneck Models (CBMs) offer a promising avenue by embedding clinically meaningful intermediate concepts into the diagnosis pipeline, enabling clinicians to scrutinize and refine model outputs. However, conventional CBMs falter in capturing complex inter-concept dependencies and demand costly, expert-driven concept annotations, limiting their scalability. This study introduces a novel semi-supervised CBM framework designed for medical imaging, which leverages dual-level hypergraph learning to model high-order concept dependencies and generate domain-adaptive pseudo-labels. Our approach achieves superior interpretability and performance by integrating a concept-level hypergraph for enhanced reasoning and an image-level hypergraph for robust pseudo-label generation. Experiments on a newly annotated PAS ultrasound dataset and a breast ultrasound public dataset demonstrate the effectiveness of the proposed concept label-efficient interpretable framework. Its universality is further validated on the dermoscopic image dataset SkinCon. The code is available at https://github.com/scott-yjyang/HyperCBM.
[211] Spatio-Temporal Correlation Guided Geometric Partitioning for Versatile Video Coding cs.CVPDF
Xuewei Meng, Chuanmin Jia, Xinfeng Zhang, Shanshe Wang, Siwei Ma
TL;DR: 本文提出了一种时空相关性引导的几何划分(STGEO)方案,用于高效描述视频编码运动场中的对象信息。该方法通过分析划分模式决策和运动矢量选择的统计特性,利用观测到的时空相关性设计模式预测和编码方法,以减少边信息表示的开销。
Details
Motivation: 现有VVC标准中的几何划分(GEO)方案在表示边信息(如划分模式和运动信息)时开销较大,限制了编码效率。本文旨在通过利用时空相关性来减少这些开销。
Result: 仿真结果表明,与不包含GEO的VTM-8.0相比,所提方法在Random Access和Low-Delay B配置下平均分别实现了0.95%和1.98%的码率节省。
Insight: 创新点在于利用边缘信息和相邻已编码块的历史模式来预测高概率STGEO模式,并基于离线训练的合并候选选择概率自适应推断运动信息,从而指导熵编码以减少比特消耗。
Abstract: Geometric partitioning has attracted increasing attention by its remarkable motion field description capability in the hybrid video coding framework. However, the existing geometric partitioning (GEO) scheme in Versatile Video Coding (VVC) causes a non-negligible burden for signaling the side information. Consequently, the coding efficiency is limited. In view of this, we propose a spatio-temporal correlation guided geometric partitioning (STGEO) scheme to efficiently describe the object information in the motion field of video coding. The proposed method can economize the bits consumed for side information signaling, including the partitioning mode and motion information. We firstly analyze the characteristics of partitioning mode decision and motion vector selection in a statistically-sound way. Based on the observed spatio-temporal correlation, we design a mode prediction and coding method to reduce the overhead for representing the above mentioned side information. The main idea is to predict the STGEO modes and motion candidates that have higher selection possibilities, which can guide the entropy coding, i.e., representing the predicted high-probability modes and motion candidates with fewer bits. In particular, the high-probability STGEO modes are predicted based on the edge information and history modes of adjacent STGEO-coded blocks. The corresponding motion information is represented by the index in a merge candidate list, which is adaptively inferred based on the off-line trained merge candidate selection probability. Simulation results show that the proposed approach achieves 0.95% and 1.98% bit-rate savings on average compared to VTM-8.0 without GEO for Random Access and Low-Delay B configurations, respectively.
[212] Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference cs.CVPDF
Hyeonwoo Cho, DongHyeon Baek, Yewon Kim, Bumsub Ham
TL;DR: 本文提出了一种名为RESTORE的新型视觉令牌缩减框架,旨在解决多模态大语言模型中因视觉令牌数量庞大导致的二次计算复杂性问题。该方法通过校正位置和注意力失真来提升缩减后序列的表示质量,同时保持计算效率。
Details
Motivation: 现有视觉令牌缩减方法忽视了完整序列与缩减序列之间的位置和注意力一致性,导致表示失真,从而影响模型性能。
Result: 在多个基准测试上的实验结果表明,该方法能持续提升各种缩减方法的准确性,在保持计算效率的同时达到了最先进的性能水平。
Insight: 创新点包括提出一种基于相对距离增强注意力权重的校准方法来恢复丢失的视觉注意力,以及引入独特的锚点选择策略来减少特征平均过程中的信息损失。
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet the quadratic computational complexity arising from the vast number of visual tokens incurs significant memory and latency bottlenecks. While visual token reduction (VTR) strategies have been explored to mitigate this burden, existing methods overlook the positional and attentional consistency between the full and reduced sequences, resulting in a distorted representation. To this end, we propose RESTORE, a novel VTR framework that rectifies the positional and attentional distortions while maintaining efficiency. Specifically, we present a simple yet effective calibration method that restores lost visual attention by augmenting attention weights based on relative distances. We also introduce a distinctive anchor selection for token merging to mitigate information loss during feature averaging. Experimental results on multiple benchmarks demonstrate that our method consistently improves the accuracy of various reduction methods, achieving state-of-the-art performance while maintaining computational efficiency.
[213] Density-Aware Translation of Spurious Correlations in Zero-Shot VLMs cs.CV | cs.LGPDF
Afsaneh Hasanebrahimi, Hanxun Huang, Christopher Leckie, Sarah Erfani
TL;DR: 本文提出了一种名为密度感知翻译(DAT)的方法,用于改进零样本视觉语言模型(如CLIP)的分类性能,通过利用从组参考集中导出的局部几何密度项来重新校准图像-文本相似度分数,以减少伪相关性的影响。
Details
Motivation: CLIP等视觉语言模型在零样本分类中表现强大,但其预测仍受伪相关性的干扰,即上下文线索主导语义内容;现有方法如微调或提示工程会削弱预训练模型的优势或易产生幻觉,因此需要一种无需训练的有效校准机制。
Result: 在基准数据集上的实验结果表明,该方法在平均准确率和最差组准确率上均取得了一致的提升,证明了密度感知翻译作为一种简单有效的校准机制,能够提高多模态模型零样本分类的可靠性。
Insight: 创新点在于利用CLIP嵌入存在的模态间隙和各向异性壳几何特性,通过相对度量基于嵌入密度重新缩放相似度,抑制稀疏区域的过度自信分数,同时保留密集且语义一致的匹配;这为无需训练地缓解伪相关性提供了一种几何视角的解决方案。
Abstract: Vision-Language models (VLMs), such as CLIP, achieve powerful zero-shot classification. However, their predictions remain sensitive to spurious correlations, where contextual cues dominate over semantic content. Earlier solutions typically rely on fine-tuning or prompt engineering, which either undermine the advantages of pre-trained models or are prone to hallucination. In this work, we propose Density-Aware Translation (DAT) that refines image-text similarity scores using a local geometric density term derived from group reference sets. Our approach is motivated by the phenomenon that CLIP embeddings exhibit a modality gap and lie on an anisotropic shell in the feature space: common patterns cluster near the mean, while rare patterns are pushed outward. This geometry creates uneven alignment, where spurious correlations are amplified while semantically meaningful but rare cues are marginalised. To address this, we employ a relative measure to rescale similarities based on embedding density, suppressing overconfident scores in diffuse regions while preserving dense, semantically consistent matches. Experimental results on benchmark datasets demonstrate consistent improvements in worst-group and average accuracy, highlighting density-aware translation as a simple and effective calibration mechanism for reliable zero-shot classification using multimodal models.
[214] EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models cs.CVPDF
Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Pengfei Zhang
TL;DR: 本文提出EvoCut,一种无需训练且不依赖注意力机制的视觉token压缩方法,用于提升大型视觉语言模型的推理效率。该方法通过分析视觉编码器中跨层的token演化方向,识别偏离群体演化趋势的信息丰富token,从而在保留少量token的同时维持模型性能。
Details
Motivation: 现有视觉token压缩方法通常基于特定层的注意力分数或表示属性估计token重要性,忽略了token在视觉编码器各层间的演化过程,导致重要性估计不完整且压缩后性能下降。
Result: 在LLaVA-1.5-7B模型上,EvoCut仅保留11.1%的视觉token即可维持94.4%的平均性能,在效率与准确性间取得了良好平衡。
Insight: 创新点在于从跨层token演化偏差的角度估计token重要性,而非依赖单层注意力;客观来看,该方法揭示了信息丰富token在多层演化中持续偏离群体方向的现象,为token压缩提供了新的可解释性准则。
Abstract: Large vision-language models (LVLMs) achieve strong performance on image and video understanding tasks, but their inference efficiency is constrained by the large number of visual tokens produced by vision encoders. Most existing visual token compression methods estimate token importance from attention scores or representation properties at specific layers, overlooking how visual tokens evolve across the vision encoder. Such layer-specific criteria may provide incomplete importance estimates and limit performance preservation after compression. To address this issue, we analyze layer-wise visual token evolution directions and observe that tokens form multiple group evolution directions across vision-encoder layers. Our analysis further shows that informative tokens tend to exhibit persistent deviations from common group evolution directions. Based on this observation, we propose EvoCut, a training-free and attention-free visual token compression method that estimates token importance from multi-layer evolution deviation. Experimental results show that EvoCut can retain only 11.1% of the visual tokens on LLaVA-1.5-7B while preserving 94.4% of the average performance, demonstrating its effectiveness in balancing efficiency and accuracy.
[215] PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps cs.CVPDF
Junlin Long, Zeyu Zhang, Xu Deng, Yiran Wang, Yue Yang
TL;DR: 该论文提出了PlatonicNav框架,将具身视觉导航中的视觉目标导航、跨模态目标导航和视觉语言导航重新定义为同一对象中心语义流形的三种不同接口。该框架通过自监督视觉编码器构建的Platonic拓扑图融合几何与语义节点距离,并无需配对视觉语言数据即可通过盲匹配实现语言目标定位。
Details
Motivation: 旨在解决现有导航方法在统一不同任务时仅停留在架构融合层面,未探究独立训练的视觉与语言编码器是否已共享语义结构的问题,并探索能否仅通过视觉构建的地图实现语言目标的基础定位。
Result: 在HM3D-IIN、OVON和R2R-CE等多个模拟基准测试以及Unitree Go2机器人上的部署实验表明,PlatonicNav无需显式跨模态训练即可泛化到不同任务、模态和具身平台。
Insight: 核心创新在于将柏拉图表示假说扩展到具身导航领域,并提出一种免训练的框架,通过自监督视觉编码构建的拓扑图统一处理多模态导航任务,实现了跨任务和跨模态的零样本泛化能力。
Abstract: Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.
[216] PillarDETR: YOLO-Backbone and RT-DETR Head for Real-Time 3D Object Detection cs.CVPDF
Smit Kadvani, Shriya Gumber, Kriti Faujdar, Harsh Dave
TL;DR: 本文提出了PillarDETR,一种新颖的端到端实时3D目标检测架构。它结合了基于柱体(pillar)的LiDAR点云高效编码、源自YOLOv8的CSP主干网络以及RT-DETR解码器,旨在平衡检测精度与推理速度。
Details
Motivation: 解决自动驾驶和机器人系统中,处理LiDAR点云时,传统基于复杂3D卷积或锚框的方法难以在检测精度和推理速度之间取得平衡的问题。
Result: 在KITTI和nuScenes基准测试上的大量实验表明,PillarDETR在平均精度(mAP)和推理延迟之间取得了有竞争力的权衡,相比PointPillars基线有显著提升。
Insight: 主要创新点在于将2D视觉领域高效的YOLO主干和RT-DETR解码器引入3D检测,通过柱体编码生成伪图像,并利用Transformer解码器捕获全局上下文,直接预测3D边界框,无需非极大值抑制(NMS)。
Abstract: Real-time 3D object detection is a critical component for the safe operation of autonomous driving systems and robotics. While LiDAR point clouds provide accurate spatial information, processing them efficiently remains a significant challenge. Traditional methods rely on complex 3D convolutions or anchor-based paradigms that struggle to balance detection accuracy with inference speed. In this paper, we propose PillarDETR, a novel end-to-end 3D object detection architecture that combines the efficiency of pillar-based LiDAR encoding with the representational power of modern 2D vision models. Specifically, PillarDETR replaces standard convolutional backbones with a Cross Stage Partial (CSP) network derived from YOLOv8, enabling richer feature extraction from pseudoimages. Furthermore, we discard conventional anchor-based or center-based detection heads in favor of a Real-Time Detection Transformer (RT-DETR) decoder. This hybrid design allows the network to capture global context and directly predict 3D bounding boxes without relying on non-maximum suppression (NMS). Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that PillarDETR achieves a compelling trade-off between mean Average Precision (mAP) and inference latency. Our ablation studies confirm that integrating the YOLOv8 backbone and RT-DETR head yields substantial improvements over the PointPillars baseline, establishing PillarDETR as a highly effective solution for real-time 3D perception.
[217] STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models cs.CV | cs.AIPDF
Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin, Yaojie Zhang
TL;DR: 本文提出STaR-KV,一种无需训练的KV缓存压缩框架,用于解决基于视觉语言模型的GUI代理在部署时因KV缓存随交互步骤线性增长而导致GPU内存消耗过大的瓶颈问题。该方法通过子空间感知评分、时间稳定性折扣和熵导温度三个轴向自适应地重新校准token重要性,在多个GUI基准测试中实现了SOTA的准确率,并在20%的KV缓存预算下将峰值GPU内存降低了近40%。
Details
Motivation: 现有KV缓存压缩方法存在两个结构性假设:将视觉token重要性聚合到单一共享的显著图中,以及对融合后的分数分布应用固定的top-B截断。初步测量反驳了这两点,因为空间特化存在于注意力子空间层面并在层间迁移,且分数分布的形状会沿轨迹漂移。
Result: 在四个GUI基准测试中,STaR-KV在相同预算下,达到了最先进的KV压缩方法(如GUIKV、SnapKV)中最强的平均准确率,压缩阶段FLOPs开销几乎为零(-0.07%),并在20%的KV缓存预算下将峰值GPU内存降低了近40%。
Insight: 创新点在于提出了一个无需训练、三轴校准的KV缓存压缩框架:1)基于在线空间互信息的子空间感知评分;2)抑制来自持续关注子空间的冗余缓存条目的时间稳定性折扣;3)自适应重塑分数分布的熵导温度。这突破了现有方法在空间聚合和固定截断上的限制,实现了更精细和自适应的缓存管理。
Abstract: Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked by a key-value (KV) cache that grows linearly with interaction steps. For instance, UI-TARS-1.5-7B consumes 76 GB of GPU memory on merely five screenshots, approaching the capacity of mainstream 80 GB accelerators. Existing KV compression methods share two structural assumptions: aggregating visual-token importance into a single shared saliency map, and applying a fixed top-B cutoff to the fused score distribution. Pilot measurements refute both: spatial specialization lives at the attention-subspace level and migrates across layers, while the score distribution drifts in shape along a trajectory. We propose STaR-KV (Spatio-Temporal Adaptive Re-weighting), a training-free KV cache compression framework that calibrates token importance along three axes: (i) subspace-aware scoring driven by online spatial mutual information; (ii) a temporal stability discount that suppresses redundant cache entries from persistently attended subspaces; and (iii) an entropy-derived temperature that adaptively reshapes the score distribution. Across four GUI benchmarks, STaR-KV achieves the strongest average accuracy among state-of-the-art KV compression methods (e.g., GUIKV, SnapKV) at matched budgets, with no compression-stage FLOPs overhead (-0.07%) and cutting peak GPU memory by nearly 40% at a 20% KV-cache budget. Code is available at https://github.com/kawhiiiileo/STaR-KV.
[218] Personalized 3D Myocardial Infarct Geometry Reconstruction from Cine MRI for Cardiac Digital Twins cs.CVPDF
Yilin Lyu, Mark YY Chan, Ching-Hui Sia, Lei Li
TL;DR: 本研究提出了一种新颖的显式几何-运动嵌入模型,能够直接从多视角电影磁共振成像(cine MRI)中全自动重建个性化的、可用于仿真的3D心肌梗死几何结构。该方法通过构建4D双心室网格来显式提取和解耦几何感知与运动感知特征,并利用自适应融合模块和多尺度监督机制进行预测,实现了无对比剂的心肌瘢痕表征。
Details
Motivation: 准确重建心肌梗死的3D几何结构对于构建心脏数字孪生以模拟梗死相关电生理至关重要。临床金标准LGE MRI依赖造影剂,限制了其在肾功能不全患者和纵向随访中的应用。因此,本研究旨在利用无造影剂的电影MRI所显示的异常心室壁运动,作为定位梗死区域的替代方案。
Result: 在225个cine MRI数据上的实验结果表明,所提出的3D MI重建方法取得了高性能,平均Dice分数为0.678 ± 0.011。在下游的计算机电生理模拟评估中,其结果与基于LGE MRI的金标准高度一致。
Insight: 创新点在于提出了一个显式几何-运动嵌入的4D网格模型,通过双分支自适应融合模块捕获时空依赖性以映射梗死区域,并引入了基于AHA-17节段的跨注意力多尺度监督机制来引导预测,确保重建的生物物理一致性。这为无对比剂瘢痕表征和心脏数字孪生建模提供了新思路。
Abstract: Accurate 3D geometric characterization of myocardial infarction (MI) is essential for building cardiac digital twins (CDTs) to precisely simulate infarct-related electrophysiology. Late gadolinium enhancement magnetic resonance imaging (LGE MRI) is the clinical reference for locating MI, yet its reliance on contrast agents restricts use in renally impaired patients and limits longitudinal follow-ups. As an alternative, contrast-free cine MRI visualizes abnormal ventricular wall motion, which is highly indicative of the infarcted area. In this study, we propose a novel explicit geometry-motion embedded model to fully automatically reconstruct personalized, simulation-ready 3D MI geometries directly from multi-view cine MRIs. Specifically, we construct a 4D (3D + t) biventricular mesh to explicitly extract and decouple geometry-aware and motion-aware features. We further design a dual-branch module for adaptive geometry-motion fusion to capture spatiotemporal dependencies for mapping infarcted region. Furthermore, we introduce multi-scale supervision utilizing an AHA-17 segment-guided cross-attention mechanism to steer the prediction, ensuring biophysically consistent reconstruction. Experimental results on 225 cine MRIs demonstrated that the proposed 3D MI reconstruction achieved high performance with an average Dice score of 0.678 $\pm$ 0.011. In the downstream in-silico electrophysiological simulation evaluations, the results were highly consistent with the LGE-derived ground truth, highlighting the great potential of the proposed model for contrast-free scar characterization and seamless integration into CDT modeling. The code will be released publicly upon acceptance of the manuscript for publication.
[219] Unsupervised Collaborative Domain Adaptation for Driving Scene Parsing cs.CVPDF
Jiahe Fan, Shaolong Shu, Mingjian Sun, Tiehua Zhang, Bohong Xiao
TL;DR: 本文提出了一种无监督协同域适应(UCDA)框架,用于在无法访问源域数据(源自由设置)的情况下,解决自动驾驶场景解析模型的域适应问题。该框架通过整合多个预训练源模型的互补知识到一个统一的目标模型,并利用类级原型记忆库和两阶段迁移策略来提升模型在目标域的鲁棒性和泛化能力。
Details
Motivation: 解决自动驾驶场景解析模型在新部署域中适应性的挑战,因为像素级标注昂贵,且源域数据常因隐私、安全或所有权限制而无法访问,而现有单源模型方法易受源特定偏差影响,限制了其在多样化驾驶环境下的鲁棒性。
Result: 在公开驾驶场景数据集和真实自动驾驶平台采集的数据上进行综合评估,表明UCDA能有效整合多源互补知识,提升目标域场景解析的可靠性和跨多样化驾驶环境的泛化性能。
Insight: 创新点包括:在源自由设置下利用多个预训练源模型进行协同适应;构建类级原型记忆库并通过原型相似性估计跨模型预测可靠性以解决置信度尺度不一致问题;采用两阶段迁移策略(协同优化与知识蒸馏)来提炼验证后的专业知识。从客观角度看,该方法通过多模型协作缓解单源偏差,增强了模型在动态开放环境中的适应性。
Abstract: Reliable driving scene parsing is a fundamental capability for autonomous vehicles operating in open and dynamic driving environments. However, adapting perception models to new deployment domains remains challenging because pixel-level annotations are expensive to obtain, while source-domain data are often inaccessible due to privacy, security, or ownership constraints. Existing source-free unsupervised domain adaptation methods typically rely on a single pre-trained source model, which makes the adapted perception system vulnerable to source-specific biases and limits its robustness under diverse road layouts, illumination conditions, weather patterns, and traffic conditions. This article presents an unsupervised collaborative domain adaptation (UCDA) framework for driving scene parsing in a source-free setting, which transfers complementary knowledge from multiple pre-trained source models to a unified target model without accessing any original source samples. To compare predictions from independently trained models, UCDA constructs a class-level prototype memory bank and estimates cross-model prediction reliability through prototype similarity, reducing the effect of inconsistent confidence scales across source models. Based on the resulting complementary supervision, UCDA adopts a two-stage transfer strategy: multiple source models are first refined on unlabeled target-domain driving data through collaborative optimization with positive and negative consistency constraints, and their validated expertise is then distilled into a single deployable target model. Comprehensive evaluations on public driving-scene datasets and real-world data collected from an autonomous vehicle platform demonstrate that UCDA effectively consolidates complementary multi-source knowledge, improving target-domain scene parsing reliability and generalization across diverse driving environments.
[220] ROGLE: Robust Global-Local Alignment with Automated Region Supervision for Text-Based Person Search cs.CV | cs.MMPDF
Zequn Xie, Xibei Jia, Sihang Cai, Shulei Wang, Tao Jin
TL;DR: 本文提出ROGLE框架,通过自动区域-句子匹配策略解决基于文本的行人检索中细粒度对齐不足的问题,并构建了支持全局和局部评估的大规模数据集P-VLG。
Details
Motivation: 现有基于CLIP的TBPS模型因全局表示偏差和短标题训练导致的语义稀疏性,难以进行细粒度理解,且缺乏区域级标注数据。
Result: ROGLE在实验中显著优于现有方法,尤其在长查询文本上表现突出,并在新构建的P-VLG基准上进行了验证。
Insight: 创新点在于自动挖掘伪区域-句子对进行可扩展的细粒度监督,以及融合全局对比学习和区域级局部对齐的多粒度学习策略;同时贡献了首个支持全局与局部评估的TBPS基准数据集。
Abstract: Text-Based Person Search (TBPS) aims to retrieve pedestrian images using natural language queries. However, existing TBPS models, especially those based on CLIP, struggle with fine-grained understanding due to global representational bias and semantic sparsity inherited from training on short captions. This results in weak fine-grained alignment, exacerbated by the scarcity of region-level annotations. To address this, we propose ROGLE (Robust Global-Local Embedding), a unified framework that overcomes reliance on costly manual annotations through an automated Region-to-Sentence Matching (RSM) strategy. RSM automatically mines pseudo region-sentence pairs for scalable fine-grained supervision. Furthermore, ROGLE employs a multi-granular learning strategy that fuses global contrastive learning with region-level local alignment. We also introduce the P-VLG Benchmark, a large-scale dataset constructed by curating and enriching images from established public benchmarks. It features over 100,000 annotated regions and rich long-form captions, making it the first TBPS benchmark to support both global and local assessment protocols. Extensive experiments show that ROGLE significantly outperforms existing approaches, particularly on challenging long-form queries. Code and the P-VLG benchmark will be made publicly available.
[221] Hist2Style: Histogram-Guided Stylization with Bilateral Grids cs.CV | eess.IVPDF
Dekel Galor, Adam Pikielny, Zhoutong Zhang, Ke Wang, Laura Waller
TL;DR: Hist2Style是一种基于双边网格的快速、边缘感知风格化方法,旨在实现高分辨率实时照片级真实感风格迁移。该方法通过将大型图像编辑模型蒸馏为轻量级网络,利用基于直方图的嵌入作为条件,在保持内容结构的同时避免幻觉,支持用户交互式调整颜色和色调。
Details
Motivation: 现有大型图像模型在照片级真实感风格迁移中存在计算成本高、易产生幻觉和用户控制有限等问题,难以适用于高分辨率实时工作流。
Result: Hist2Style通过双边空间中的局部仿射变换约束操作,在保持视觉保真度的同时实现了实时高分辨率风格化,其轻量级网络在生成的大规模监督语料上训练,针对空间变化的颜色编辑。
Insight: 创新点包括将双边网格公式用于边缘感知风格化,通过直方图嵌入提供可解释的界面以调整目标颜色分布,以及通过模型蒸馏实现轻量化并避免幻觉,同时保持内容结构。
Abstract: Photorealistic style transfer aims to match the color and tone of an input image to that of a style target while preserving the content and details of the original scene. Although existing large image models can facilitate these kinds of appearance edits, their high computational demands, potential for hallucinations, and limited user control make them unsuitable for high-resolution, real-time workflows. We introduce Hist2Style, a bilateral-grid formulation for fast, edge-aware stylization that preserves visual fidelity by constraining operations to locally affine transforms in bilateral space. Our model distills a large image editing model into a lightweight network by training on a large supervised corpus generated with language and vision-language models, targeting spatially varying color edits. The network conditions on a histogram-based embedding of the style target to provide an interpretable interface for adjusting the output style by modifying the target color distribution. Overall, Hist2Style maintains content structure by construction, avoids hallucinations, and supports real-time, high-resolution photorealistic stylization with interactive user-controllable color and tone adjustments.
[222] RescueBench: Can Embodied Agents Save Lives in the Wild ? cs.CVPDF
Kui Wu, Beiyu Guo, Hao Chen, ShuHang Xu, Yuling Li
TL;DR: 论文提出了RescueBench,一个用于评估具身智能体在野外搜救任务中综合能力的照片级真实诊断基准。该基准将搜救任务建模为四个阶段:多模态探索、目标救援、记忆引导返回和最终交接,通过任务组合和分阶段评估来分析探索与记忆失败在救援工作流中的传播。
Details
Motivation: 现有基准通常孤立地评估具身智能体的能力(如探索、记忆),而搜救任务需要将这些能力在真实工作流中组合使用,因此不清楚失败如何复合。RescueBench旨在通过一个综合、可扩展的基准来诊断具身智能体在复杂、长视野救援任务中的瓶颈。
Result: 在RescueBench上评估了七个基线模型、一个Oracle参考和人类玩家,结果显示在最高难度下没有基线模型能完成完整任务。分阶段诊断表明,自主探索是主要失败模式,空间记忆是第二个独立瓶颈,当前基于拓扑的视觉语言导航或基于地图的方法未能解决这些限制。
Insight: 创新点在于将搜救任务形式化为一个多阶段、可组合的诊断管道,并引入了渐进难度级别和自动生成标注流程。客观来看,该工作强调了在复杂、长视野任务中评估能力组合的重要性,并为分析失败传播提供了系统框架,有助于推动具身AI在现实应用中的发展。
Abstract: Search-and-rescue (SAR) requires embodied agents to explore unfamiliar environments under multimodal uncertainty, perform multi-stage interactions, and retrieve spatial memory over long horizons. Existing benchmarks typically evaluate these capabilities in isolation, leaving unclear how failures compound when they must be composed in realistic workflows. We introduce RescueBench, a photo-realistic diagnostic benchmark that instantiates SAR as a four-stage pipeline: multimodal exploration, target rescue, memory-guided return, and final handoff. By combining sequential task composition with stage-level evaluation, RescueBench enables analysis of how exploration and memory failures propagate through embodied rescue workflows. It contains five progressive difficulty levels that vary in environmental complexity, clue ambiguity, and spatial hierarchy, along with an automatic episode generation and annotation pipeline for scalable evaluation and training. We evaluate seven baselines, an oracle reference, and human players, showing that no baselines complete the full task at the greatest difficulty. Stage-level diagnosis identifies autonomous exploration as the dominant failure mode and spatial memory as a second, independent bottleneck, suggesting that these limitations are not resolved by current topological visual-language navigation or map-based methods. Code is available in https://github.com/wukui-muc/RescueBench
[223] Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation cs.CVPDF
Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Xuelin Chen, Erkut Erdem
TL;DR: 本文提出了Auteur方法,一种基于语言驱动、以人为中心的视频生成相机框架控制技术。该方法通过将专业电影摄影的构图概念(如镜头大小、角度和构图)形式化为以人体姿态和运动为函数的相机参数化,并设计了一种可转换为标准6自由度相机参数的领域特定语言(DSL)。利用微调的多模态大语言模型作为虚拟导演,将自然语言描述和粗略人体运动映射为稀疏的DSL关键帧,进而插值为连续的相机轨迹,最终输入视频生成器。
Details
Motivation: 现有生成视频模型虽在视觉保真度和时间连贯性上取得显著进展,但相机运动控制仍存在随机性、空间不一致性以及对场景中人物主体缺乏关注的问题。本文旨在解决语言驱动的、以人为中心的专业电影构图控制这一缺失能力。
Result: 在基于程序合成和CondensedMovies数据集真实电影片段构建的包含34K条对齐文本、人体运动和DSL标注相机轨迹的新数据集上进行训练和评估。实验提出了新的聚焦于构图的评估指标,结果表明Auteur在实现电影摄影构图方面持续优于现有方法。
Insight: 核心创新在于将以演员为中心的构图直觉形式化为相机参数化,并引入可解释的DSL作为中间表示。通过将大语言模型作为虚拟导演,实现了从自然语言到确定性、符合电影摄影规则的相机轨迹的映射,为生成视频提供了可控的、以人为中心的相机框架。
Abstract: Generative video models have achieved remarkable visual fidelity and temporal coherence, yet intentional camera control remains elusive. Existing frameworks treat camera motion as a byproduct of pixel synthesis, producing trajectories that are stochastic, spatially inconsistent, and indifferent to the human subject driving the scene. In this work, we present Auteur, a method for language-driven, human-centric camera framing in generative video. Our core insight is that professional filmmakers conceive shots not as world-space trajectories but as framings defined relative to the actor, encoding shot size, angle, and composition as functions of human pose and motion. We formalize this intuition as a human-centric camera parameterization and introduce a Domain-Specific Language (DSL) that is convertible to standard 6-DoF camera parameters. A fine-tuned multimodal large language model then acts as a virtual director, mapping natural language descriptions and coarse human motion to sparse DSL keyframes that are deterministically interpolated into continuous camera trajectories, which are then provided as input to video generators. We train and evaluate Auteur on a new dataset of 34K aligned text, human motion, and DSL-annotated camera trajectories drawn from procedural synthesis and real-world movie footage from the CondensedMovies dataset. Auteur enables cinematographic framing of human-centered scenes, a capability largely absent in prior generative models. To assess this behavior, we propose new framing-focused metrics, and our experiments show that Auteur consistently outperforms existing methods
[224] The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue cs.CV | cs.AI | cs.CLPDF
Sherzod Hakimov, Mattia D’Agostini, Ivan Samodelkin, David Schlangen
TL;DR: 本文提出了‘图像重建游戏’这一自动化基准测试,用于评估视觉语言模型通过多轮对话向图像生成器发出修正指令的能力,使累积的共同基础能够直接通过渲染图像观察。通过交叉测试两种描述模型和两种生成模型在七个图像类别上的表现,研究发现描述模型是重建质量的主导因素,而生成模型决定了迭代细化是有益还是有害。数学和几何图像最具挑战性。描述模型的令牌预算强烈影响收敛性:较短的预算导致初始渲染更稀疏但有更多改进空间,较长的预算提高绝对质量但留下较少修正余地。更强的描述模型使用更丰富的修正词汇,涵盖空间、数字和结构类别,而较弱的描述模型集中于表面属性且倾向于在几轮后停止。人类验证表明最佳自动化评估器与人类偏好仅达到轻微至中等一致,自动化评分需要人类重新校准才能可靠使用。
Details
Motivation: 解决如何通过多轮多模态对话建立和观察累积共同基础的问题,以评估视觉语言模型在迭代图像重建中的交互能力。
Result: 在七个图像类别上交叉测试两种描述模型和两种生成模型,发现描述模型主导重建质量,生成模型影响迭代细化效果;数学和几何图像最具挑战性;自动化评估器与人类偏好一致性有限(轻微至中等一致)。
Insight: 创新点在于引入完全自动化的多轮对话基准测试,使共同基础可视化;客观分析表明描述模型的令牌预算和词汇丰富性是关键因素,且自动化评分需人类校准以提高可靠性。
Abstract: We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to an image generator across multiple turns, making accumulated common ground directly observable as a rendered image. Benchmarking two Describer models crossed with two Generator models across seven image categories, we find that the describer is the dominant factor in reconstruction quality, while the generator determines whether iterative refinement helps or hurts. Mathematical and geometric images pose the greatest challenge. The describer’s token budget strongly affects convergence: shorter budgets yield sparser first renderings with more room for visible improvement, while longer budgets raise absolute quality but leave less to fix. Stronger describers use a richer correction vocabulary spanning spatial, numeric, and structural categories, while weaker describers concentrate on surface properties and tend to stop after a few turns. Human validation shows that the best automated judge reaches only slight-to-fair agreement with human preferences, and automated scores require human recalibration to be used reliably.
[225] Pool-Select-Refine: Allocation-Aware Generative Dataset Distillation with Soft-Label-Guided Latent Refinement cs.CVPDF
Wenmin Li, Shunsuke Sakai, Zhongkai Zhao, Tatsuhito Hasegawa
TL;DR: 本文提出了一种名为’Pool-Select-Refine’的两阶段框架,用于解决基于扩散模型的生成式数据集蒸馏中存在的预算分配与样本生成紧密耦合的问题。该方法首先构建一个过完备的候选样本池并从中选择紧凑子集,然后在潜在空间中使用教师模型提供的软标签监督对选定样本进行精炼。
Details
Motivation: 现有基于扩散模型的数据集蒸馏方法通常采用僵化的’生成即用’策略,将生成的样本直接作为最终蒸馏集,这导致有限的蒸馏预算可能被浪费或产生信息量不足的样本。本文旨在将生成、选择和精炼过程解耦,以实现更有效的预算利用。
Result: 在大规模和细粒度图像分类基准测试上的实验表明,该框架相比基于扩散模型的基线方法取得了持续的性能提升。
Insight: 核心创新点在于将’选择’作为一个独立的阶段引入生成式数据集蒸馏流程,即在精炼前引入一个筛选阶段,并通过软标签指导的潜在空间精炼来改善语义对齐。这提供了一种简单而有效的改进思路,即通过解耦流程来更灵活地分配预算和优化样本质量。
Abstract: Diffusion-based dataset distillation has recently emerged as a promising paradigm for condensing large-scale datasets into compact synthetic sets. By leveraging pretrained generative priors, these methods can produce realistic class-conditional samples more efficiently than traditional matching-based approaches. However, most existing diffusion-based methods still adopt a rigid Generate-and-Use'' strategy, where the generated samples are directly treated as the final distilled set under a fixed images-per-class budget. Such a design tightly couples candidate generation with final budget allocation, which may result in redundant waste of the limited budget or insufficiently informative samples. In this paper, we propose Pool-Select-Refine’’, a two-stage framework for allocation-aware generative dataset distillation. First, instead of directly using a fixed number of generated samples, we construct an over-complete candidate pool and select a compact subset under the target budget. Second, we refine the selected samples in latent space using soft-label supervision derived from the teacher model, improving semantic alignment while preserving the generative prior. This design explicitly decouples generation, selection, and refinement, enabling more effective use of the distillation budget. Experiments on large-scale and fine-grained image classification benchmarks show that the proposed framework delivers consistent gains over diffusion-based baselines. The results suggest that introducing a curation stage before refinement is a simple yet effective way to improve diffusion-based dataset distillation.
[226] Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering cs.CVPDF
Dongxing Mao, Jinpeng Wang, Jiahao Tang, Kevin Qinghong Lin, Linjie Li
TL;DR: 本文提出了一种名为残差解码器适配器(RDA)的方法,用于提升自回归视觉模型在文本渲染任务上的性能。该方法通过引入共享token分布的配对码本和并行分支学习像素空间残差,无需重新训练现有tokenizer和AR模型即可增强tokenizer的细节重建能力。
Details
Motivation: 自回归视觉模型在整体图像生成上表现良好,但在文本渲染时存在笔画模糊和字母形状失真的问题,这主要源于视觉tokenizer难以重建细粒度细节。作者旨在不重新训练现有tokenizer和AR模型的前提下改善文本渲染性能。
Result: 在TextAtlas基准测试中,RDA显著提升了文本渲染效果:例如,在TextVisionBlend数据集上,微调后的Janus-Pro模型OCR准确率从24.52%提升至58.26%;在StyledTextSynth数据集上从12.75%提升至36.81%。
Insight: 创新点在于提出了一种非侵入式的tokenizer适配方法,通过残差设计在保持与原AR模型兼容性的同时增强细节重建;客观来看,其配对码本和并行残差学习机制为升级预训练视觉组件提供了一种高效且低成本的思路。
Abstract: Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer. Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail. Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model. Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(RDA) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. RDA substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. The code is available at https://github.com/CSU-JPG/RDA
[227] 3rd Place at CVPR 2026 CASTLE Challenge: Agentic Multi-View Long-Context Video Understanding via Hierarchical Knowledge Graph Retrieval cs.CVPDF
Raghad Albusayes, Munirah Alyahya
TL;DR: 本文介绍了在CVPR 2026 CASTLE挑战赛中获得第三名的解决方案。该方案针对大规模、多模态长视频流中的复杂视觉、时空和语言问题,提出了一种无需训练的智能体框架。其核心是构建视频知识图谱以实现多跳关系推理,并采用分层检索的工作流来处理长上下文多视角视频流,在零样本设置下取得了高准确率。
Details
Motivation: 解决在由多个自我中心(ego)和外部(exo)摄像机同步拍摄的大规模、长上下文多视角视频流中,进行复杂视觉、时空和语言推理(如视觉计数、动作定位、多视角跟踪和说话者时序推理)的极端挑战。
Result: 在CVPR 2026 CASTLE挑战赛中取得了全球第三名的成绩。实证结果表明,该框架在长上下文多视角视频流上实现了较高的零样本推理准确率。
Insight: 主要创新点在于提出了一个无需训练的智能体框架,其核心是构建一个包含静态/动态实体、时序关系和交叉事件的视频知识图谱以支持多跳推理,并设计了一个基于分层检索和索引的自适应智能体工作流来处理复杂查询,有效应对了长视频理解的挑战。
Abstract: This paper presents our winning methodology for the CASTLE 2026 Challenge at the CVPR 2026 EgoVis Workshop, where our team secured third place globally. The challenge tasks participants with answering highly complex visual, spatiotemporal, and verbal questions, including visual counting, action localization, multi-view tracking and speaker temporal reasoning, within massive, multimodal video streams. The underlying dataset consists of over 600 hours synchronized footage captured by 15 ego and exo camera sources. To tackle the extreme scale and long-context demands of this environment, we introduce a training-free agentic framework optimized for long-form video understanding. Our framework introduces two core architectural components: i) a Video Knowledge Graph that maps static and dynamic entities, their temporal relationships, and intersecting events to enable multi-hop relational reasoning, and ii) an adaptive agentic workflow that resolves complex queries through a hierarchical retrieval and indexing. Empirical results demonstrate that our framework achieves high zero-shot reasoning accuracy on long-context multi-view streams. Our code will be released at https://github.com/RaghadKhaled/CASTLE-Challenge-Framework.
[228] Unified Driving Tokens: Representation- and Geometry-Guided Discrete Tokenizer for Driving World Models and Planning cs.CVPDF
Ziyang Yao, Zeyu Zhu, YunCheng Jiang, Zibin Guo, Huijing Zhao
TL;DR: 本文提出了一种用于自动驾驶世界建模和规划的表示引导与几何增强的离散视觉分词器(tokenizer)。该分词器通过联合监督学习离散标记,将离散瓶颈与冻结的DINO特征空间对齐,同时通过RGB重建保持外观,并注入深度和相对位姿等几何线索。实验表明,该分词器在重建保真度、表示一致性、规划性能和生成质量方面均有提升。
Details
Motivation: 现有离散分词器多源自图像生成任务,主要优化像素重建,可能导致生成内容与驾驶决策解码需求之间存在差距。本文旨在设计一个专门用于自动驾驶的分词器,以弥合这一差距,为基于标记的世界建模和规划提供紧凑且有用的表示。
Result: 在NAVSIM基准测试中,该方法提高了重建保真度和表示一致性;在固定解码器下实现了具有竞争力的规划性能;在匹配设置下展现出更好的生成质量。
Insight: 创新点包括:通过特征解码将离散瓶颈与预训练DINO特征对齐,以增强表示有用性;引入相邻帧深度和相对位姿监督来注入几何状态线索;使用多码本量化稳定联合目标训练。这为自动驾驶领域提供了一种更专注于决策相关信息的紧凑离散表示学习方法。
Abstract: Discrete visual tokens should provide a compact representation for both token-based world modeling and planning in autonomous driving. However, most tokenizers are inherited from image generation and are optimized mainly for pixel reconstruction, which may leave a gap between what is easy to generate and what is useful to decode for driving decisions. We present a representation-guided and geometry-enhanced tokenizer that learns discrete tokens under joint supervision. The tokenizer aligns its discrete bottleneck with a frozen DINO feature space through feature decoding, while preserving appearance via RGB reconstruction with perceptual and adversarial losses. To inject geometric state-related cues, we add adjacent-frame depth and relative-pose supervision during training and stabilize joint objectives with multi-codebook quantization. We evaluate the same learned tokens with a lightweight planning readout and a GPT-style next-token world model. Experiments on NAVSIM show improved reconstruction fidelity and representation consistency, competitive planning performance under a fixed decoder, and better generative quality under matched settings.
[229] SAVMap: Structure-Aided Visual Mapping of Large-Scale 2.5D Manhattan Wireframes from Panoramic Video cs.CVPDF
Howard Huang, Bharath Surianarayanan, Keifer Lee, Chenyu Wang, Chen Feng
TL;DR: SAVMap提出了一种仅使用全景视频相机作为传感器输入,生成仓库货架和灯光结构语义线框地图的方法。该方法从沿仓库通道拍摄的全景视频中提取校正后的货架和天花板视图图像序列,利用语义分割网络提取并跟踪稀疏的语义结构特征点(如货架角点、灯光中心),并结合曼哈顿网格等真实世界几何关系,通过约束的运动恢复结构算法生成3D点以形成线框地图。
Details
Motivation: 解决工业环境中精确3D表示的需求,以支持机器人定位和数字孪生生成等任务,特别是在大规模仓库场景下,仅依赖低成本全景视频传感器实现高效、准确的语义地图构建。
Result: 在包含46排货架(每排尺寸为55米×7米)的仓库中,从一小时全景视频内容生成了超过5000个货架元素的线框地图,相对于地面真值的整体平均绝对误差为4.8厘米,展示了方法的可扩展性和准确性。
Insight: 创新点在于结合语义分割与曼哈顿网格约束的运动恢复结构,从全景视频中自动提取和跟踪语义特征点,实现大规模2.5D曼哈顿线框的鲁棒重建;客观分析认为,该方法通过利用场景的结构先验(如曼哈顿几何),降低了仅依赖视觉数据的歧义性,提升了在工业环境中的实用性和精度。
Abstract: Precise 3D representations of industrial environments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a semantic wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the warehouse aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55,m by 7,m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8,cm with respect to ground-truth.
[230] SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation cs.CVPDF
Can Zhang, Gim Hee Lee
TL;DR: SCAPO是一个自监督框架,用于从单次RGB-D观测中估计类别级铰接物体的规范几何、刚性部件分割以及关节枢轴、轴和铰接状态,无需真实标签或类别特定模型。它通过SE(3)等变向量神经元自编码器对齐实例到共享规范空间,并设计关节感知混合蒙皮模块建模部件运动。
Details
Motivation: 现有方法依赖密集监督、多帧输入或CAD模板,难以解耦几何与铰接或恢复显式关节参数,SCAPO旨在解决这些问题。
Result: 在合成和真实铰接物体数据集上的实验表明,SCAPO恢复了一致的部件结构和准确的铰接参数,并优于所有自监督基线方法。
Insight: 创新点包括使用SE(3)等变自编码器对齐全局姿态,以及通过循环重建和可学习规范模板的解耦学习,实现几何与铰接的分离。
Abstract: Existing methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO, a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonical template that decouples shared category geometry from instance-specific residual shape. Experiments on synthetic and real articulated-object datasets show that our SCAPO recovers consistent part structure and accurate articulation parameters and outperforms all self-supervised baselines.
[231] Beyond Low-Rank: Low-Rank Sparse Prompting via Spiking Neural Network and Prompt Factorization cs.CVPDF
Yumiao Zhao, Bo Jiang, Beibei Wang, Xixi Wan, Xiao Wang
TL;DR: 本文提出了一种名为LoRSP的新型视觉提示学习框架,通过将脉冲神经网络的稀疏发放机制与低秩分解相结合,生成动态的低秩稀疏视觉提示,以解决现有密集像素级提示方法存在的冗余扰动、泛化能力有限和能效低下等问题。
Details
Motivation: 现有视觉提示方法通常采用密集像素级提示,存在冗余扰动、泛化能力有限和能效低下的问题,本文旨在通过引入受大脑启发的脉冲学习机制来克服这些限制。
Result: 在五个异构视觉主干网络和多个基准测试上的广泛实验表明,LoRSP在保持与现有方法相当性能的同时,需要更少的可调参数。
Insight: 创新点在于将脉冲神经元的稀疏发放机制与低秩分解相结合,实现了实例特定的选择性提示生成,从而获得更紧凑和鲁棒的模型适应能力;客观来看,这是一种将生物启发计算与高效模型适配相结合的新颖思路。
Abstract: Visual Prompting (VP) has emerged as an efficient paradigm for adapting large-scale pre-trained vision models to downstream tasks by incorporating learnable prompts at the input level. However, existing VP methods typically employ dense pixel-level prompts, which often suffer from redundant perturbations, limited generalization and energy inefficiency. To overcome these limitations, we propose to integrate brain-inspired spiking learning into visual prompt learning tasks. As we know that spiking neuron can perform inexpensive information processing by transmitting the input data into discrete spike trains and return sparse outputs. Inspired by this, we propose \textbf{Lo}w-\textbf{R}ank visual \textbf{S}pike \textbf{P}rompting (LoRSP), a novel framework that learns dynamic low-rank sparse visual prompts naturally via a Spiking neuron learning mechanism. The core idea of LoRSP is to exploit the brain-inspired sparse firing mechanism of spiking neurons to generate pixel-level sparse prompt for each instance. To be specific, we first construct a series of prompt factors via low-rank factorization to capture distinct prompt subspaces. These prompt factors are then fed into an SNN architecture, which performs the integrate-and-fire process to emit spikes. As a result, our LoRSP generates a \emph{sparse} visual prompt while maintaining the low-rank constraint. This design enables instance-specific selective prompting, leading to more compact and robust adaptation across diverse downstream tasks. Extensive experiments on five heterogeneous vision backbones and multiple benchmarks demonstrate that LoRSP achieves competitive performance while requiring fewer tunable parameters compared to existing VP methods.
[232] MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching cs.CVPDF
Jiahui Huang, Yasi Zhang, Tianyu Chen, Shu Wang, Jianwen Xie
TL;DR: 本文提出了MT-EditFlow,一个基于流匹配的强化学习框架,旨在解决多轮交互式图像编辑中存在的‘全有或全无’失败和错误传播问题。该框架通过整合多轮视角和多奖励公式,优化了序列编辑的奖励信号,显著提升了多种基础模型在多轮编辑任务上的性能。
Details
Motivation: 现有基于指令的单轮图像编辑模型在多轮交互编辑中表现不佳,主要面临单轮失败导致整个序列崩溃以及曝光偏差引起的错误累积问题。
Result: 在FLUX.1-Kontext-dev基准测试中,MT-EditFlow将三轮编辑的整体性能提升了6.85分,超越了Qwen-Image-Edit等最先进的开源模型,并在多种基础模型上均表现出显著改进。
Insight: 创新点在于提出了一个统一的流匹配强化学习框架,通过研究轮级聚合评分策略、视觉语言模型推理模式权衡以及优势融合层级来系统优化奖励信号,特别是将聚合优势广播到整个编辑轨迹,有效弥合了局部规划与全局多轮任务成功之间的差距。
Abstract: Recent breakthroughs in instruction-based image editing have captured significant attention, as models are now capable of handling real-world editing demands with the practicality required by everyday users. However, editing models trained primarily for single-turn edits often break down in multi-turn editing–the natural interactive setting where a user iteratively refines an image based on the model’s own previous outputs. This failure stems from the all-or-nothing requirement, where a single failed turn compromises the entire sequence, and error propagation, where exposure bias leads to compounding editing errors. To address these challenges, we introduce MT-EditFlow, a flow-matching reinforcement learning framework designed to optimize reward signals for sequential image editing. MT-EditFlow integrates a multi-turn perspective with a multi-reward formulation to provide a unified structure applicable to both GRPO and NFT-based reinforcement learning methods. We systematically analyze and optimize the reward signal by investigating effective scoring strategies for turn-level aggregation, VLM reasoning modes to trade off reward bias and variance, and advantage fusion levels to prevent reward hacking. Our findings reveal that broadcasting the aggregated advantage across the entire editing trajectory effectively bridges the gap between local planning and global multi-turn task success. Extensive experiments demonstrate that MT-EditFlow significantly improves performance across diverse base models. Notably, it boosts FLUX.1-Kontext-dev by 6.85 points in turn-3 overall performance, surpassing state-of-the-art open-source models such as Qwen-Image-Edit. By maintaining high marginal success rates and reducing exposure bias, MT-EditFlow provides a foundation for more reliable and natural human-AI collaboration in visual content creation.
[233] Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization cs.CV | cs.AI | eess.IVPDF
Jingyun Liang, Min Wei, Shikai Li, Yizeng Han, Hangjie Yuan
TL;DR: 本文提出了一种无需渲染的3D感知视频扩散模型框架,通过将3D人体网格压缩为token,与视频token在基于DiT的架构中联合处理,以实现对人体运动、几何、相机视角和场景的精确控制。该方法避免了传统方法依赖2D运动引导视频导致的伪影和错配问题。
Details
Motivation: 现有视频扩散模型是否真正理解视觉观测背后的3D结构,而非仅生成合理的2D投影,仍是一个开放问题。本文旨在通过需要精确建模3D人体几何、运动、相机视角和场景上下文的人体运动控制任务来探究此问题。
Result: 实验结果表明,该方法在人体运动控制基准测试中表现出色,同时减少了编辑过程中由视角依赖的2D引导和轨迹-姿态不匹配引起的伪影。
Insight: 创新点在于提出了一种基于3D人体网格token化的免渲染条件生成框架,将完整的3D几何信息压缩为token,并与视频token在统一管道中联合推理,迫使模型在生成过程中同时考虑外观、3D结构和相机视角,从而更好地捕捉复杂的3D人体结构及其与环境的交互。
Abstract: Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains an open question. In this work, we investigate this question through human motion control, a task that requires precise modelling of 3D human geometry, motion, camera viewpoint, and scene context. Unlike prior methods that rely on rendered 2D motion guidance videos, we propose a render-free framework that conditions video generation directly on compressed 3D human mesh tokens. This representation preserves full 3D geometric information while enabling a unified token-based generation pipeline that processes video tokens jointly with motion tokens in a DiT-based architecture. This design requires the model to reason jointly about appearance, 3D structure, and camera viewpoint during video generation. Experimental results demonstrate strong performance on human motion control benchmarks, while reducing artifacts induced by view-dependent 2D guidance and trajectory-pose mismatches during editing. These findings suggest that video diffusion models, when equipped with mesh tokenization, can better capture complex 3D human structures and their interactions with the surrounding environment.
[234] Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment cs.CVPDF
Bishr Omer Abdelrahman Adam, Xu Li
TL;DR: 本文提出了一种用于盲图像质量评估(BIQA)的失真感知融合框架,该框架通过一个乘法门控机制,动态地融合了传统的自然场景统计(NSS)描述符和两种现代视觉语言模型(VLM)嵌入特征。该方法无需对VLM主干进行端到端微调,并在多个标准基准测试中取得了优异的性能。
Details
Motivation: 传统的NSS描述符和现代的VLM嵌入从根本不同的角度解决BIQA问题,但它们的结合是否能产生互补优势,以及如何根据输入图像动态加权它们的贡献,尚未得到充分探索。
Result: 在KonIQ-10k、KADID-10k和LIVE Challenge三个标准基准测试上进行了评估,其中在KADID-10k上超越了近期的最先进方法(SROCC=0.9715,PLCC=0.9733)。
Insight: 创新点在于提出了一个基于输入图像内容学习的动态门控融合机制,而非静态拼接。该机制能根据失真类型自适应地抑制或放大不同特征流的贡献,且学习到的门控权重与通过独立消融实验测得的NSS贡献度呈正相关,表明模型能自主发现与人工分析一致的失真-特征流亲和性模式。
Abstract: Blind image quality assessment (BIQA) aims to predict perceived image quality without access to a reference image. Classical natural scene statistics (NSS) descriptors and modern vision-language model (VLM) embeddings address this problem from fundamentally different perspectives, yet whether combining them yields complementary benefits and how to weight their contributions per input image remains unexplored. We propose a distortion-aware fusion framework that integrates a 138-dimensional NSS descriptor with two complementary VLM embeddings, SigLIP and CLIP-H, through a multiplicative gating mechanism that learns per-input stream weights conditioned on image content. Unlike static concatenation fusion, the proposed gating network suppresses or amplifies each stream’s contribution based on the input, producing weights that correlate positively (Spearman rank correlation rho=0.33) with the per-distortion NSS contribution measured by independent ablation on KADID-10k. The framework requires no end-to-end fine-tuning of the VLM backbones and is trained with a hybrid loss combining mean squared error, Pearson linear correlation, and pairwise ranking objectives. We evaluate on three standard benchmarks: KonIQ-10k (SROCC=0.9142, PLCC=0.9279), KADID-10k (SROCC=0.9715, PLCC=0.9733, surpassing recent state-of-the-art methods), and LIVE Challenge in-the-Wild (SROCC=0.8527, PLCC=0.8802 with cross-dataset pretraining and fine-tuning). A per-distortion analysis on KADID-10k reveals that NSS features contribute most on noise and color-shift distortions where pixel statistics are directly affected, and least on perceptual distortions such as color saturation changes. The learned gate values validate these findings, confirming that the model autonomously discovers distortion-stream affinity patterns consistent with the manual per-distortion study.
[235] A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision cs.CV | cs.AI | cs.LGPDF
Stefano Samele, Eugenio Lomurno, Teodora Jovanovic, Sanjay Shivakumar Manohar, Alberto Crivellaro
TL;DR: 本文提出了一个用于评估文本引导异常检测(TGAD)的结构化基准测试,旨在检验当前多模态视觉语言模型是否真正利用文本信息进行决策。该基准包含三个逐步增加语言功能角色的场景:MVTec AD上的受控提示敏感性测试、要求模型仅评估指定部件的MVTec AD扩展版本,以及需要缺陷类型和部件位置知识的新建组装面板数据集(APD)。
Details
Motivation: 当前多模态异常检测方法声称支持文本引导的零样本和少样本检测,但其评估协议沿用自单模态基准,无法衡量语言是否真正影响决策。论文旨在解决这一问题,通过设计结构化基准来区分文本引导能力与预训练视觉特征的影响。
Result: 在三个代表性模型(生成式大视觉语言模型、免训练判别式模型和嵌入自适应判别式模型)上的实验表明,文本接口仅表面影响决策:提示内容被吸收除非移除对象名词(生成模型的I-AUROC从97.4降至82.6);部件级指令在允许外部缺陷时失效(从90.3降至66.3);在APD上结合两者时,图像级判别性能崩溃至低于MVTec水平甚至随机水平(71.2、50.5、31.5)。
Insight: 创新点在于提出了首个系统评估文本引导异常检测能力的结构化基准,揭示了当前多模态模型对语言条件的依赖有限,标准基准可能高估其文本引导能力。客观来看,该工作强调了开发可靠语言控制工业部署模型的协议必要性,并为未来研究提供了更严格的评估框架。
Abstract: Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that admit textual input alongside the image and are presented as enabling text-guided zero- and few-shot inspection. Yet these methods are evaluated with protocols inherited from unimodal benchmarks that hold the textual condition constant and therefore cannot measure whether language conditions the decision; whether reported gains reflect text guidance or strong pretrained visual features remains open. We introduce Text-Guided Anomaly Detection (TGAD), a structured benchmark that progressively increases the functional role of language across three scenarios: a controlled prompt-sensitivity setting on MVTec AD; a component-tagged extension of MVTec AD that requires the model to restrict its assessment to an instructed part; and the new Assembled Panel Dataset (APD), a realistic industrial setting that requires both defect-type and component-location knowledge. We evaluate one representative model per paradigm: generative large vision-language, training-free discriminative, and embedding-adaptive discriminative. In all three, the textual interface conditions the decision only superficially: prompt content is absorbed unless the object noun is removed (the generative model’s I-AUROC drops from 97.4 to 82.6); component-level instructions do not constrain the decision once defects outside the instructed part are admitted as normal (from 90.3 to 66.3); and when both combine on APD, image-level discrimination collapses below the MVTec level, in one case below chance (71.2, 50.5, 31.5). These results suggest that standard benchmarks overstate the text-guided capabilities of current multimodal anomaly detection systems, and that a protocol of this kind is a prerequisite for models that can be reliably controlled through language for industrial deployment.
[236] Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association cs.CV | cs.AI | cs.LGPDF
Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani
TL;DR: 本文指出多视角目标关联任务中评估指标(如AP和FPR-95)与实际的分配目标之间存在根本性不匹配。理论分析表明,即使分配正确,这些排序指标也可能不完美,而基于Sinkhorn的归一化可以使其完美;反之,最优的排序结果仍可能导致错误的分配。实验通过Sinkhorn归一化作为后处理压力测试验证了这一不匹配,显示仅优化少量后处理参数即可显著提升AP和FPR-95,但分配级指标(如ACC和IPAA)并未相应改善。
Details
Motivation: 解决多视角目标关联任务中评估指标(基于排序的AP和FPR-95)与实际的一对一匹配目标之间的不匹配问题,强调当前方法依赖不完善的评估标准。
Result: 实验表明,使用基于Sinkhorn的归一化作为后处理,仅优化少量参数即可显著提升AP和FPR-95分数,但分配级指标(如ACC和IPAA)未得到相应改进,验证了指标与目标的不匹配。
Insight: 创新点在于理论揭示了排序指标与分配任务之间的根本性不匹配,并提出Sinkhorn归一化作为分析工具;客观来看,这强调了在评估多视角关联模型时需谨慎选择指标,避免误导性优化。
Abstract: Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is naturally formulated as a constrained one-to-one matching problem, recent works heavily rely on pairwise ranking metrics like AP and FPR-95 for model evaluation. We highlight a fundamental mismatch between these metrics and the actual assignment objective. Theoretically, we show that AP and FPR-95 can be imperfect even when the assignment is already correct, and that Sinkhorn-based normalization can make them perfect. Conversely, optimal pairwise ranking can still lead to incorrect assignments. We validate this mismatch in practice by using our Sinkhorn-based normalization as a controlled post-processing stress test. We show that optimizing just a few post-processing parameters significantly boosts AP and FPR-95 without corresponding improvements in assignment-level metrics such as ACC and IPAA.
[237] FACT: A Simple and Efficient Framework for Active Finetuning cs.CVPDF
Wenshuai Xu, You Song, Yuzhuo Cui, Minjie Ren, Qingjie Liu
TL;DR: 本文提出了一个名为FACT的简单高效框架,用于主动微调任务。该框架通过三阶段分层微调方法,在数据稀缺情况下优化预训练模型在特定任务上的性能,解决了传统全微调导致的预训练特征失真和过拟合问题。
Details
Motivation: 现有主动微调研究主要关注数据选择,但普遍采用全微调进行模型适应,这会导致分布偏移引起的预训练特征失真,尤其在模型规模相对微调数据量较大时,过拟合风险显著增加。
Result: 在经典、不平衡和细粒度图像分类数据集上,使用多种预训练架构进行实验,结果表明该框架具有强泛化性和鲁棒性。在低采样率下,ViT模型在CIFAR10、CIFAR100和ImageNet-1k基准上实现了超过20%的性能提升,达到了新的SOTA水平。
Insight: 创新点在于将主动学习与微调方法系统结合,提出分层微调框架以保持参数效率;客观来看,其对冻结特征增强策略的系统探索和效率与泛化性的综合分析,为数据稀缺场景下的模型适应提供了新思路。
Abstract: The main goal of active finetuning is to improve a pretrained model’s performance on a specific task or domain by finetuning it with carefully selected informative or challenging data. Previous research has predominantly focused on the active aspect (i.e., data selection) while uniformly employing full finetuning for model adaptation, which inevitably distorts pretrained features due to distribution shift. This issue becomes particularly pronounced when the model size is large relative to the finetuning data quantity, leading to heightened overfitting risks. To address this critical gap, we formally outline the FiAF task that emphasizes systematic exploration of finetuning methodologies in active learning. We propose FACT, a three-phase hierarchical finetuning framework featuring both efficiency and simplicity, specifically designed for active finetuning scenarios. Our comprehensive experiments span: (1) Three major dataset categories encompassing classic (CIFAR10, CIFAR100, ImageNet-1k), imbalanced (CIFAR10-LT, CIFAR100-LT), and fine-grained (StanfordCars, FGVCAircraft) image classification datasets, each evaluated under 3-5 distinct sampling ratios; (2) Diverse pretrained architectures including Convolutional Neural Network (ConvNeXt), Vision Transformer (ViT), and Vision LSTM (ViL) networks; (3) A systematic investigation of frozen feature augmentation (FroFA) strategies. (4) A comprehensive and rigorous analysis of efficiency and generalizability. The results demonstrate significant improvements with strong generalization and robustness. Notably, under low sampling ratios, our framework achieves remarkable performance gains of over 20% on the ViT model for CIFAR10, CIFAR100, and ImageNet-1k benchmarks. This systematic approach establishes new state-of-the-art performance while maintaining parameter efficiency, proving particularly effective when labeled data is scarce.
[238] WebSpline: Structure-Informed Splines for Real-Time 3D Gaussians from Monocular Videos cs.CVPDF
Jongmin Park, Jeonghwan Yun, Minh-Quan Viet Bui, Munchurl Kim
TL;DR: 本文提出WebSpline,一种用于单目视频动态场景重建的新颖3D高斯框架。其核心是结构信息样条表示,通过可学习的立方Hermite样条建模动态高斯轨迹,并利用辅助的结构代理图来组织运动以保持结构一致性。该框架采用两阶段优化,在iPhone和NVIDIA等基准测试中实现了最先进的渲染质量,且渲染速度比次优方法快10倍以上。
Details
Motivation: 解决现有方法在单目视频动态场景重建中,难以在有限的多视角线索下平衡全局结构连贯性与局部精细细节的问题。
Result: 在具有挑战性的单目动态场景基准测试(iPhone和NVIDIA数据集)上,WebSpline实现了最先进的渲染质量,并且在iPhone数据集上渲染速度比次优方法WorldTree快10倍以上。
Insight: 创新点在于提出了结构信息样条表示和结构代理图,将结构先验显式地融入动态高斯轨迹建模中,从而在保持高保真度的同时确保了运动的结构连贯性;其两阶段优化策略(先建立结构一致性,再优化细节)也是一个有效的设计。
Abstract: Dynamic scene reconstruction from monocular videos remains highly challenging, as existing methods often struggle to balance global structural coherence and local fine-grained details under limited multi-view cues. To address this challenge, we propose WebSpline, a novel dynamic 3D Gaussian framework that enables structurally coherent and high-fidelity reconstruction from monocular videos with fast rendering. The core of WebSpline is the Structure-Informed Spline (SIS) representation, which models each dynamic Gaussian trajectory using a learnable cubic Hermite spline whose motion is structurally organized with an auxiliary Structural Proxy Graph (SPG). The proposed framework is optimized in two stages: (i) in the first stage, the SPG is initialized from 2D point tracks and refined with temporal rigidity regularization to establish structural coherence for moving objects across the sequence; and (ii) in the second stage, the SIS representation is initialized from the refined SPG and optimized under both spatial and structural neighborhood constraints. At inference, Gaussian motion is obtained solely by evaluating the learned SIS, enabling fast rendering. Extensive experiments on the challenging monocular dynamic scene benchmarks, iPhone and NVIDIA, demonstrate that our WebSpline achieves state-of-the-art rendering quality while rendering over 10 times faster than WorldTree, the second-best method on the iPhone dataset.
[239] Multimodal Action Diffusion for Robust End-to-End Autonomous Driving cs.CVPDF
Jorge Daniel Rodríguez-Vidal, Diego Porres, Gabriel Villalonga Pineda, Antonio M. López Peña
TL;DR: 本文提出了一种名为动作扩散变换器(ADT)的端到端自动驾驶系统,它通过扩散模型直接预测多模态的控制信号(油门、转向和刹车),而非传统的确定性轨迹点。ADT生成多个动作候选,并通过最近邻匹配选择最合适的动作,从而在保持低延迟的同时提升了驾驶性能和表示质量。
Details
Motivation: 当前端到端自动驾驶系统主要预测中间轨迹点,依赖手工控制器进行最终控制,而直接预测控制信号的研究不足,且动作多模态性的作用未被充分理解。作者认为超越确定性单动作输出是提升驾驶性能、表示质量和训练稳定性的关键。
Result: ADT在具有挑战性的闭环Bench2Drive基准测试中超越了先前的最先进方法(SOTA),同时实现了十倍的延迟降低,展示了其在性能和效率上的优势。
Insight: 创新点在于引入无锚点的扩散变换器来原生建模多模态驾驶动作分布,通过生成多个动作候选和最近邻匹配提升鲁棒性;客观分析表明,多模态动作建模不仅提高了基准性能,还带来了表示学习和行为一致性方面的可测量益处,这是确定性架构无法复制的。
Abstract: End-to-End Autonomous Driving (E2E-AD) systems have largely converged on predicting intermediate trajectory waypoints, delegating final control to hand-crafted controllers with GPS access. Direct control-signal prediction (outputting throttle, steer and brake in an end-to-end fashion) remains underexplored, and critically, the role of action multimodality in such systems is not well understood. We argue that moving beyond deterministic, single-action outputs is not merely a modelling choice, but a key driver of driving performance, representational quality, and training stability. To validate this, we introduce the Action Diffusion Transformer (ADT), an anchor-free diffusion transformer trained with a MSE objective that natively models the multimodal distribution of plausible driving actions. Rather than committing to a single deterministic command, ADT generates K action candidates and selects the most suitable one at inference via Nearest Neighbour Matching (NNM). Beyond strong benchmark numbers, we show that action multimodality yields measurable benefits in learned representations and behavioral consistency, effects that deterministic architectures cannot replicate. ADT surpasses previous state-of-the-art on the challenging closed-loop Bench2Drive benchmark while achieving ten times lower latency, demonstrating that expressive, multimodal action modelling is both practically efficient and conceptually essential for robust end-to-end driving.
[240] Jailbreaking Multimodal Large Language Models using Multi-Clip Video cs.CV | cs.AI | cs.CLPDF
Choongwon Kang, Seungjong Sun, Hyunmin Jun, Jang Hyun Kim
TL;DR: 这篇论文研究了多模态大语言模型在处理视频输入时的安全性漏洞。作者提出了一个名为MCV SafetyBench的数据集来评估视频多样性如何影响模型的脆弱性,并发现攻击成功率随视频片段数量增加而上升。基于此,他们提出了一种利用图像模态相对鲁棒性的防御策略。
Details
Motivation: 随着多模态大语言模型能够处理视频输入,其被恶意滥用的风险引发关注。先前的研究表明视觉输入可以绕过模型的安全对齐,但尚不清楚视频输入的哪些特性导致了这种脆弱性。
Result: 在八个代表性的视频MLLMs上的实验表明,攻击成功率随着视频片段数量的增加而持续上升。结果还显示,视频模态比图像模态更脆弱,动态视频比静态视频更脆弱,且视频内容越多样,模型越脆弱。
Insight: 论文的创新点在于系统性地研究了视频输入多样性(通过多片段视频)对MLLM安全漏洞的影响,并量化了相关风险因素。从客观角度看,其提出的基于图像模态鲁棒性的防御思路,为缓解视频输入带来的安全威胁提供了一个可行的方向。
Abstract: As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inputs, yet it remains unclear which properties of video inputs induce this vulnerability. To address this gap, we introduce Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos designed to evaluate how the diversity of video inputs affects the vulnerability of MLLMs. Each video consists of multiple short clips depicting diverse contexts related to a harmful query. Experiments on eight representative video MLLMs show that attack success consistently increases with the number of clips. Our results further indicate that the video modality is (1) more vulnerable than the image modality, (2) more vulnerable to dynamic videos than to static videos, and (3) more vulnerable when videos contain more diverse contexts. Building on these findings, we propose a defense strategy that leverages the relative robustness of the image modality.
[241] InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models cs.CV | cs.CLPDF
Xinxin Liu, Shiwei Gan, Xiao Liu, Yafeng Yin, Lei Xie
TL;DR: 本文提出InfoMerge,一种无需训练的视频大语言模型视觉令牌压缩方法,通过时序指纹差异进行鲁棒的冗余估计,并结合内容感知的预算分配策略,在显著减少视觉令牌数量的同时保持模型性能。
Details
Motivation: 现有视频大语言模型因视觉令牌过多导致计算开销巨大,而现有免训练压缩方法依赖局部帧相似性或均匀分配令牌预算,对帧级噪声敏感且无法捕捉真实视频的非均匀信息分布。
Result: 在LLaVA-OneVision-7B等模型上,InfoMerge在减少85%视觉令牌、实现预填充阶段4.24倍加速的同时,保持了原始平均性能的98.8%,在多个基准测试和骨干网络上均展现出优越的效率-精度权衡。
Insight: 创新点在于提出基于片段级的二阶时序冗余估计策略(时序指纹差异)和结合独特性与谱熵表示丰富度的内容感知预算分配机制,从而更有效地利用有限令牌预算处理视频中的静态冗余与信息丰富区域。
Abstract: Video Large Language Models (Video-LLMs) achieve strong performance in video understanding, but their excessive visual tokens bring substantial computational overhead. Existing training-free compression methods improve inference efficiency by reducing visual tokens, yet they often rely on local adjacent-frame similarity for temporal redundancy estimation or allocate token budgets mainly according to segment length. Such designs are sensitive to frame-level noise and fail to capture the non-uniform information distribution of real-world videos. To address these challenges, we propose InfoMerge, a training-free visual token compression method that improves token utilization through robust redundancy estimation and content-aware budget allocation. Specifically, we propose the Temporal Fingerprint Difference: a segment-level second-order temporal redundancy estimation strategy, which models the temporal similarity structure of tokens at the same spatial positions within each segment. We further introduce Content-Aware Budget Allocation (CABA), which dynamically allocates segment-level token budgets based on segment uniqueness and spectral-entropy-based representational richness. By reducing repeated preservation of redundant static regions and allocating more tokens to informative segments, InfoMerge makes better use of the limited token budget while maintaining strong performance. Extensive experiments show that InfoMerge achieves strong efficiency–accuracy trade-offs across multiple benchmarks and backbones, with more pronounced advantages under aggressive compression. On LLaVA-OneVision-7B, InfoMerge retains 98.8% of the original average performance while reducing 85% of visual tokens and achieving a 4.24-fold speedup in the prefill stage.
[242] Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection cs.CV | cs.AI | cs.LGPDF
Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui
TL;DR: 本文提出了一种理解增强的模型协作方法(UE-MCM),用于从第一人称视角视频中检测用户执行动作的错误。该方法通过一个小模型分支和一个大模型分支协作,分别处理粗粒度的视频整体理解和细粒度的动作片段推理,并通过一个轻量级协作门自适应融合预测结果。同时,针对错误实例的长尾分布问题,采用了重加权交叉熵、AUC导向学习和标签感知调整等互补目标来优化分类器。
Details
Motivation: 解决从第一人称(egocentric)视频中判断用户动作是否执行错误的问题,特别是针对错误实例呈现长尾分布、以及动作可能局部正确但与整体工作流程不一致的挑战。
Result: 所提出的系统在速度和准确性之间取得了平衡,能够有效检测第一人称教学视频中细微、罕见和模糊的错误。摘要未提及具体的基准测试和与SOTA的定量比较。
Insight: 主要创新点在于双分支协作框架,将高效的粗粒度视频理解与精确的细粒度动作推理相结合,并自适应融合。此外,针对长尾分布设计了多目标优化策略。从客观角度看,利用CLIP4CLIP和Qwen3-VL等预训练模型提取不同粒度的表征,并结合扩散对比重建进行增强,是有效的技术路径。
Abstract: In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end, we propose an Understanding-Enhanced Model Collaboration Method (UE-MCM) that combines efficient coarse-grained video understanding with accurate fine-grained action reasoning. Specifically, UE-MCM contains a small model branch and a large model branch. The large model branch focuses on whether the fine-grained action itself is executed incorrectly, while the small model branch jointly takes the coarse-grained video and fine-grained segment as input to identify actions that may be locally correct but inconsistent with the overall workflow. The small model branch is built on a CLIP4CLIP video encoder initialized from a CLIP model enhanced by Diffusion Contrastive Reconstruction, and the large model branch uses the Qwen3-VL Embedding model to extract high-capacity representations from fine-grained action segments. The small-branch prediction and the large-branch prediction are then adaptively fused by a lightweight collaboration gate. To handle the long-tailed distribution of mistake instances, we optimize the classifiers with complementary objectives, including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment. The resulting system balances speed and accuracy, making it effective for detecting subtle, rare, and ambiguous mistakes in egocentric instructional videos.
[243] Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis cs.CV | cs.AI | cs.CL | cs.IRPDF
Catyana Heyne, Jürgen Frikel, Filippo Riccio
TL;DR: 本文对视觉丰富文档类型分类中的多模态方法进行了系统性比较分析。研究在统一的实验框架下评估了四种代表性模型(LayoutLMv3、Donut、Qwen3-VL-32B-Instruct和Qwen3-32B),重点关注文本、图像和布局信息的贡献,并对比了依赖OCR与不依赖OCR的方法。结果表明,在视觉丰富且布局密集的文档上,专门的多模态Transformer模型优于基于LLM的方法,其中图像信息对可靠分类贡献最大。
Details
Motivation: 当前视觉丰富文档类型分类方法采用多样化的多模态建模策略,导致架构异构且评估设置不统一,难以进行系统性比较和评估进展。
Result: 在RVL-CDIP基准测试上,专门的多模态Transformer模型(如LayoutLMv3)优于基于LLM的方法;图像信息对分类贡献最大,OCR提取的文本提供有用的次要支持。
Insight: 创新点在于提供了一个统一框架来系统比较多模态架构,并实证了图像模态在文档分类中的主导作用,为模型选择和特征组合提供了实用指导。
Abstract: Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual, and layout modalities. To capture this complexity, current approaches rely on diverse multimodal modeling strategies, resulting in heterogeneous architectures that complicate systematic comparison. This variability is also reflected in existing comparative studies, which often rely on heterogeneous evaluation setups, further complicating systematic comparison and making it difficult to assess progress. To address these limitations, this work provides a structured analysis of multimodal design strategies across transformer- and LLM-based architectures, combined with a controlled empirical comparison within a unified experimental framework. Specifically, four representative models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B) are evaluated on the RVL-CDIP benchmark to systematically analyze the contributions of text, image, and layout information for document type classification, with a particular focus on contrasting OCR-dependent and OCR-free approaches. The results show that specialized multimodal Transformers outperform LLM-based approaches on visually rich and layout-intensive documents. Image information contributes most strongly to reliable classification, while OCR-derived text provides useful but secondary support. These findings highlight that multimodal processing remains essential for documents with pronounced layout structure. Overall, the study provides a systematic basis for comparing multimodal architectures and offers practical guidance for selecting effective feature combinations and model designs for document type classification.
[244] Disentanglement-Based Equivariant Learning for Compositional VQA cs.CV | cs.LGPDF
Zhou Du, Zhaoquan Yuan, Xiao Wu, Changsheng Xu
TL;DR: 本文提出了一种基于解缠的等变学习(DEAL)框架,用于解决组合式视觉问答(VQA)任务。该框架通过因果启发的干预在重编码框架中解缠视觉和文本输入的概念,并基于等变性原理对推理输入进行组合变换以增强模型的组合推理能力。
Details
Motivation: 当前组合式VQA方法往往忽视底层概念的解缠,且难以有效捕捉组合变化机制,同时现有SOTA技术依赖额外训练线索,这在现实VQA场景中不可行。
Result: 在CLEVR-CoGenT和GQA-SGL基准数据集上的综合实验表明,DEAL在视觉和语言泛化设置下均优于现有SOTA方法。
Insight: 创新点在于仅使用真实答案作为指导,通过因果干预实现概念解缠,并利用等变约束增强组合推理能力,避免了额外训练数据的需求。
Abstract: Compositional visual question answering (VQA) represents a challenging yet fundamental task that requires models to comprehend novel combinations of previously learned concepts. The current methods often overlook the disentanglement of underlying concepts and are restricted in terms of their ability to effectively capture the compositional variation mechanism. Moreover, the state-of-the-art techniques depend on additional clues for training, which is not feasible in real-world VQA scenarios. To address these issues, in this paper, we introduce a novel Disentanglement-based EquivAriant Learning (DEAL) framework for compositional VQA, which is guided exclusively by ground-truth answers. In DEAL, we employ causality-inspired interventions to disentangle concepts derived from visual and textual inputs within a re-encoding framework. Based on the principle of equivariance, we subsequently perform a compositional transformation on the inference input and impose the equivariant constraint on the output to augment the compositional reasoning capacity of the model. Comprehensive experiments conducted on the benchmark CLEVR-CoGenT and GQA-SGL datasets validate the superiority of our proposed DEAL approach over the existing state-of-the-art methods for compositional VQA tasks in both visual and linguistic generalization settings.
[245] InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark cs.CVPDF
Shiyu Wang, Ziyu Liu, Chaoyi Yu, Yujie Yin, Zhongqian Mao
TL;DR: 本文提出了InsightVQA,一个用于情感理解和认知推理的大规模分层视觉问答数据集与评估基准。该数据集包含从35.1万张图像中筛选出的13.8万张高质量图像,并标注了总计72.5万个QA对,分为感知、基于视觉触发器的归因理解和认知推理三个层次。
Details
Motivation: 现有视觉情感理解基准主要集中于情感识别,缺乏对情感产生原因(归因)和高级认知推理能力的支持。本文旨在填补这一空白,构建一个支持分层、可解释情感分析的基准。
Result: 论文构建了包含3万个样本的高质量评估基准InsightVQA-Bench,并提出了一个名为InsightNet的情感调优多模态大语言模型基线。结果表明,InsightVQA对基于归因的情感理解和推理提出了重大挑战。
Insight: 主要创新点在于构建了一个分层(感知、归因、认知)的情感视觉问答数据集,并通过约束引导生成方法构建了基于视觉触发器的归因理解QA,这为模型的可解释情感推理提供了结构化支持。
Abstract: Visual emotion understanding requires models not only to recognize emotional states, but also to why they arise and perform higher-level cognitive reasoning. However, existing benchmarks mainly focus on emotion recognition, offering limited support for grounded understanding and response-oriented analysis. To address this gap, we introduce \textbf{InsightVQA}, a large-scale dataset for hierarchical visual question answering on emotion understanding and cognitive reasoning. Building from 351K images collected from six public sources, we apply a rigorous multi-stage filtering pipeline to curate 138K high-confidence images. Each image is annotated at three hierarchical levels: perception QA for emotion and valence recognition, grounded understanding QA constructed from visual trigger extraction through constraint-guided generation, and cognition QA centered on response intent prediction and sequential insight reasoning. In total, InsightVQA contains 725K QA pairs. We further present \textbf{InsightVQA-Bench}, a high-quality evaluation benchmark comprising 30K samples for fine-grained evaluation. To support evaluation, we introduce \textbf{InsightNet}, an emotion-tuned baseline for MLLMs. Results demonstrate that InsightVQA poses significant challenges for grounded emotion understanding and reasoning.
[246] Symmetry-Aware 9D Pose Estimation with Sim(3)-Consistent Feature and Spherical Inception Convolution cs.CVPDF
Panfei Cheng, Hongshan Yu, Wenrui Chen, Xiaojun Tang, Jian Liu
TL;DR: 该论文提出了一种用于类别级物体姿态估计的有效方法,通过两个关键创新来解决现有方法在泛化到未见物体和非线性Sim(3)空间学习方面的挑战。首先,它设计了一个包含语义引导对称感知模块的平移/尺寸估计器,利用大型视觉模型的泛化能力推断对称点,从而无需形状先验即可准确估计平移和尺寸。其次,它提出了一个基于球形大核初始卷积的特征融合模块,融合语义和几何特征以提取关键的姿态特征。该方法在基准测试和真实场景中达到了最先进的性能,并开发了一个鲁棒的机器人抓取系统。
Details
Motivation: 现有实例级物体姿态估计方法泛化到未见物体能力不足,而类别级方法又受限于非线性Sim(3)空间学习的复杂性以及类内差异。论文旨在解决这些挑战,实现更通用、准确的类别级物体姿态估计。
Result: 该方法在相关基准测试和真实场景中达到了最先进的(SOTA)性能,并成功开发了一个能够处理多样化物体的鲁棒机器人抓取系统。
Insight: 论文的创新点在于:1) 利用大型视觉模型的泛化能力,通过语义引导的对称感知模块来推断对称点,从而无需形状先验即可准确估计平移和尺寸,这为后续旋转估计提供了可靠线索并简化了Sim(3)空间的学习;2) 提出球形大核初始卷积,以较低计算成本建模长程依赖,有效融合语义和几何特征以应对类内差异。
Abstract: Object pose estimation is a fundamental problem for an agent system to perceive or manipulate objects in images or videos. However, current instance-level methods struggle with generalization to unseen objects. Category-level methods seek to address this, but remain constrained by the complexities of learning in the non-linear Sim(3) space and intra-class variations. To address these challenges, We propose an effective method for category-level object pose estimation with two key innovations: (1) A translation/size estimator, featuring a semantic-guided symmetry-aware module that leverages robust generalization capabilities of a large vision model (LVM) to infer symmetry points, resulting in accurate translation and size without shape priors. This result serves as a precomputed cue for rotation estimation, thereby reducing the difficulty of learning in the non-linear Sim(3) space and laying a robust foundation for tackling the inherently more challenging rotation estimation. (2) A feature fusion module, based on our proposed spherical large-kernel inception convolution, fuses semantic features from the LVM with systematically computed geometric features to extract essential pose features from intra-class variations by modeling long-range dependencies without excessive computational cost. Built on these innovations, we achieve SOTA on benchmarks and real-world scenes, while developing a robust robotic picking system capable of handling diverse objects. Our code will be available at the project page: {\hypersetup{urlcolor=blue}https://panfei-cheng.github.io/SSH-Pose}.
[247] Cross-modal linkage risk in clinical vision-language models cs.CV | cs.AI | cs.CL | cs.LGPDF
Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn
TL;DR: 该论文研究了临床视觉语言模型(VLMs)中的跨模态关联风险,指出在胸片和放射学报告分离的隐私场景下,训练后的VLMs可能通过余弦相似度将匿名图像重新链接到原始报告。作者通过图像到报告检索任务评估了不同专业化程度的VLMs在MIMIC-CXR和CheXpert Plus数据集上的风险,发现专业化程度越高,重新链接风险越大。为降低风险,作者提出了一种差分隐私优化方法,仅微调对齐层的投影头,在显著降低关联风险的同时保持了图像表示的临床实用性。
Details
Motivation: 解决临床视觉语言模型在图像和报告分离的隐私场景下,可能通过共享嵌入空间重新链接匿名图像与原始报告的隐私风险。
Result: 在MIMIC-CXR(43,793对)和CheXpert Plus(29,296对)数据集上评估,最强VLM在候选池N=100时检索正确报告的概率是随机概率的15倍,在N=10,000时是50倍。应用差分隐私优化(ε=0.34, δ=6×10⁻⁶)后,在N=10,000时Recall@1降低了61.8%,同时图像侧实用性基本保持:14个标签的线性分类宏AUROC仅从79.63%降至79.43%。
Insight: 创新点在于首次形式化并量化了临床VLMs的跨模态重新链接隐私风险,并提出了一种针对共享对齐层的差分隐私微调方法,在降低风险的同时最小化对模型临床效用的影响,为隐私保护提供了新思路。
Abstract: Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate after acquisition, such as image-only data sharing or access-controlled reports, because a de-identified image may be re-linked to its original narrative report through cosine similarity alone. We formalized this as image-to-report retrieval and used public paired cohorts, in which the true pairing is known by design, as ground-truth benchmarks to audit the risk rather than as the privacy scenario. Evaluating VLMs of increasing clinical specialization on 406,241 paired examples from 126,804 patients across MIMIC-CXR (43,793 held-out pairs) and external CheXpert Plus (29,296 pairs), we found that re-linkage rose systematically with specialization: the strongest VLM retrieved the correct report at 15 times chance at a candidate pool of N = 100, 50 times chance at N = 10,000, and well above chance at full-database scale. The signal persisted under pathology-matched hard negatives that removed disease-label shortcuts, indicating correspondence beyond broad diagnostic categories. To reduce it without retraining, we froze both encoders and applied differentially private optimization only to the projection heads defining the alignment layer (epsilon = 0.34, delta = 6x10-6). This reduced Recall@1 by 61.8% at N = 10,000 on MIMIC-CXR and transferred to CheXpert Plus without retraining, while image-side utility was largely preserved: macro AUROC for linear-probe classification across 14 labels shifted only from 79.63% to 79.43%. Targeted DP finetuning of the shared alignment layer can substantially reduce cross-modal re-linkage without materially degrading the image representations that make these models clinically useful.
[248] Ego-METAS: Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark cs.CVPDF
Maria Santos-Villafranca, Jesus Bermudez-cameo, Alejandro Perez-Yus, Giovanni Maria Farinella, Antonino Furnari
TL;DR: 本文提出了Ego-METAS,首个以自我为中心、在线、多模态、节能的时间动作分割基准。该基准整合了超过100小时未经修剪的自我中心视频数据,涵盖RGB、音频、注视、IMU和单色相机五种模态,并定义了一个在线任务:模型必须在每个时间步动态选择激活哪些传感器,同时严格遵守硬件代表性的能量预算。
Details
Motivation: 为了解决在资源受限设备上实现“始终在线”感知的挑战,平衡能量约束与任务精度,当前研究大多假设计算资源无限,而能量感知感知领域仍未被充分探索。
Result: 评估表明,最优的路由策略高度依赖于具体场景,而现有的主要为修剪片段设计的策略学习方法难以适应连续、未经修剪的环境。然而,即使是简单的动态融合互补模态(例如通过随机路由)也被证明对于在严格能量预算下平衡预测精度至关重要。
Insight: 主要创新点是构建了首个专注于自我中心、在线、多模态且能量约束下的时间动作分割基准测试平台,并揭示了在连续感知场景中动态传感器选择策略的重要性及其与场景的高度相关性,为开发鲁棒、成本感知的自主AI策略提供了标准化基础。
Abstract: To operate in the physical world, embodied agents must perceive their environment in an “always-on” fashion, selectively accessing the most informative sensors to balance energy constraints and task accuracy. Despite its importance for resource-constrained devices, energy-aware perception remains under-explored, with most prior work assuming unlimited compute. To address this, we introduce Ego-METAS: the first Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark. Ego-METAS provides a unified testbed of more than 100 hours of untrimmed egocentric video from EgoExo4D, CMU-MMAC, and CaptainCook4D, spanning 5 modalities (RGB, audio, gaze, IMU, and monochrome camera). We formulate an online temporal action segmentation task where models must dynamically select which sensors to activate at each timestep while strictly adhering to hardware-representative energy budgets. Alongside the benchmark, we release unified splits, cleaned annotations, pre-extracted features, and a diverse suite of baseline routing policies. Our evaluations show that optimal routing is highly scenario-dependent, and that existing policy-learning methods, designed primarily for trimmed clips, struggle to adapt to continuous, untrimmed environments. However, even simple dynamic fusion of complementary modalities (e.g., via random routing) proves critical for balancing predictive accuracy against strict energy budgets. Ultimately, Ego-METAS provides a standardized foundation to develop robust, cost-aware policies for autonomous, always-on embodied AI.
[249] Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification cs.CV | cs.AI | cs.LGPDF
Karina Kvanchiani, Timur Mamedov
TL;DR: 本文研究了图像-图像(I2I)和文本-图像(T2I)行人重识别(ReID)联合优化中的冲突问题,指出两者因模态差异和目标冲突导致共享表征学习不佳。为此,论文提出了一种解耦的两阶段训练流程,使用单一视觉编码器同时支持两种检索任务,并通过实验验证了I2I预训练对T2I泛化能力的积极影响以及文本监督对双任务性能的提升。
Details
Motivation: I2I ReID追求身份级别的图像不变性,而T2I ReID依赖于描述独特视觉特征的实例级文本,两者优化目标存在根本差异,分别训练的损失函数会相互干扰,导致跨模态共享表征质量下降。
Result: 论文在多种配置(如领域混合过程、学习策略、任务目标)下进行了广泛实验,结果表明I2I预训练能提升模型在T2I数据上的泛化能力,且在视觉编码器训练阶段加入文本监督能同时提高I2I和T2I的性能。
Insight: 创新点在于揭示了I2I与T2I ReID任务间的优化冲突本质,并提出了一种解耦的两阶段训练流程来避免跨任务干扰,从而学习更优的跨模态共享表征;客观来看,其将文本监督融入视觉编码器训练以提升双任务性能的思路,对构建统一的ReID系统具有借鉴意义。
Abstract: The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and conflicting training objectives, leading to suboptimal shared representations. While I2I ReID focuses on identity-level invariance across images of the same person, T2I ReID is driven by instance-specific textual descriptions tied to unique visual traits. This paper explores the fundamental difference between two ReID tasks and their optimization processes for effective training. Since I2I and T2I ReID are often studied separately, the loss functions optimized for one retrieval setting may negatively affect the representation quality required by the other. Motivated by these findings, we propose a decoupled two-stage training pipeline for learning a shared representation across image and text modalities. The pipeline is based on a single vision encoder that supports both I2I and T2I retrieval while avoiding cross-task interference during training. We provide extensive experiments across multiple configurations, varying domain mixing procedures, learning strategies, and task objectives. We observed that I2I ReID pre-training positively impacts the generalization ability to T2I data. Besides, we find that incorporating textual supervision during the vision encoder training stage enhances both I2I and T2I performance. We believe our insights provide a meaningful step toward unified ReID systems and cross-modal retrieval overall.
[250] Vision-language Models for Driver Monitoring Systems: A Driver Activity Description Dataset cs.CVPDF
David J. Lerch, Sarath Mulugurthi, Manuel Martin, Frederik Diederichs, Rainer Stiefelhagen
TL;DR: 本文针对现有视觉语言模型在驾驶员监控系统中难以识别细微行为差异的问题,创建了一个基于Drive&Act数据集的细粒度自然语言描述数据集。通过评估三个视觉语言模型在新基准上的表现,发现其无法可靠生成准确的细粒度驾驶员活动描述。基于新数据集微调的视觉语言模型在Driver Monitoring Dataset上取得了76%的ACCR分数,显著优于零样本基线的66%,证明了使用丰富描述的驾驶员动作数据可以显著提升模型对驾驶员行为的解释能力。
Details
Motivation: 现有视觉语言模型在通用数据集上训练,难以识别驾驶员行为的细微差别,这限制了可靠驾驶员监控系统的构建。
Result: 在新创建的Drive&Act描述数据集上微调的视觉语言模型在Driver Monitoring Dataset上取得了76%的ACCR分数,优于零样本基线的66%,实现了SOTA性能。
Insight: 创新点在于构建了一个细粒度的驾驶员活动自然语言描述数据集,用于专门微调视觉语言模型;客观分析表明,针对特定领域(如驾驶员监控)创建高质量、细粒度的描述数据集,能有效提升视觉语言模型的领域适应性和性能。
Abstract: Understanding subtle driver actions is essential for building reliable driver monitoring systems. Existing visionlanguage models (VLMs) are trained on general datasets and struggle to recognize fine distinctions in driver behaviors. This paper addresses this limitation by creating a detailed natural language version of the Drive&Act dataset. We evaluate three VLMs on our new benchmark using LLM-based scoring methods. Their performance on the new benchmark shows that they cannot reliably generate accurate fine-grained driver activity descriptions. Based on the labeled Drive&Act dataset we create a new Drive&Act description dataset containing finegrained descriptions to train VLMs on driver activity understanding. Cross dataset evaluation on the Driver Monitoring Dataset (DMD) shows that the VLM fine-tuned on our new Drive&Act description dataset generalizes well to actions in the DMD dataset. The VLM fine-tuned on our Drive&Act description dataset achieves an ACCR score of 76 outperforming the zero-shot VLM baseline with an ACCR score of 66. These findings demonstrate that adapting VLMs with richly described driver actions can significantly improve their ability to interpret driver behavior while also highlighting the need for more diverse datasets to support broader generalization in future applications. Our Drive&Act description dataset and code will be publicly available on GitHub.
[251] Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning cs.CVPDF
Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Qingming Huang
TL;DR: 本文提出了一种无需训练的合成视频检索方法,通过视觉表示引导的视频大语言模型推理来解决CVPR 2026挑战赛中的任务。该方法首先使用冻结的DINOv3模型获取紧凑的视觉相关候选视频,然后利用大视觉语言模型评估每个候选是否满足修改指令,最后对排名靠前的候选进行基于推理的细化以改进预测结果。
Details
Motivation: 解决合成视频检索任务,即根据参考视频和修改指令检索目标视频,旨在扩展视频检索能力,使其能够同时处理视觉示例和文本指令的灵活场景。
Result: 在测试集上实现了48.78的Recall@1和51.48的Recall@5,无需任何训练即达到竞争性性能。
Insight: 创新点在于将视觉表示(DINOv3)与大语言模型推理相结合,形成两阶段检索框架,并引入基于推理的细化步骤;客观分析认为其训练无关的设计降低了部署成本,且模块化结构便于未来集成更强的视频-LLM模型。
Abstract: Recent advances in large vision-language models have expanded video retrieval from simple text-based search to more flexible scenarios, where users may specify the desired result through both visual examples and textual instructions. In the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge, the system is required to retrieve a target video according to a reference video and a modification instruction. To address this task, we develop Visual Representation-Guided Video-LLM Reasoning for Training-Free Composed Video Retrieval. Our framework first uses frozen DINOv3 models to obtain a compact set of visually relevant candidates, and then applies large vision-language models to evaluate whether each candidate satisfies the modification instruction. A final reasoning-based refinement is further performed on the top candidates to improve the first-ranked prediction. Without training, our system achieves 48.78 Recall@1 and 51.48 Recall@5 on the test set. Future work may further improve retrieval accuracy through stronger video-LLMs and detailed integration between visual representations and language reasoning.
[252] Detecting Pen-In-Air States from Video: A Proof-of-Concept Toward Complementary Handwriting Analysis cs.CVPDF
Lauren Sismeiro, Remy Plastre, Binbin Xu, Frederic Puyjarinet, Gerard Dray
TL;DR: 本文提出了一种基于视频的笔尖悬空状态检测方法,作为数字化写字板的补充方案。该方法结合了YOLO笔尖检测、运动学特征提取和机器学习分类,在自建数据集上通过留一交叉验证实现了可靠的笔尖悬空事件检测。
Details
Motivation: 数字化写字板对笔尖悬空行为的检测范围有限,可能遗漏高抬笔动作,而视频分析有望提供低成本、非侵入的补充信息源,用于评估书写障碍等发展性障碍。
Result: 在自建的手写视频数据集上,采用留一视频交叉验证,该方法在事件级别的笔尖悬空检测中取得了最高0.805的F2分数,强调了筛查场景中对召回率的重视。
Insight: 创新点在于利用俯视视频进行笔尖悬空检测的可行性验证,并提出了一种可解释的混合流程,为未来大规模研究奠定了基础,提供了一种替代或补充传统数字化写字板的方法。
Abstract: Dynamic aspects of handwriting are critical for assessing developmental disorders such as dysgraphia and are typically captured using digitizing tablets. However, tablet-based sensing restricts analysis of Pen-Up behavior to a short proximity range above the writing surface, potentially missing high-lift in-air movements. As a proof of concept, we investigate whether top-view video can provide a complementary source of information for inferring pen-contact states without relying on tablet proximity sensing. We propose an interpretable hybrid pipeline combining pen-tip tracking using a YOLO-based detector with kinematic feature extraction and machine learning classification. A pilot dataset of diverse handwriting videos was manually annotated at the frame level and evaluation used a Leave-One-Video-Out (LOVO) protocol. The method achieved reliable event-level detection of Pen-Up segments, with an F_2 score up to 0.805, consistent with the emphasis on recall in a screening-oriented setting. These results support the feasibility of video-based Pen-Up detection as a low-cost and non-intrusive complement to digitizing tablets, and provide a foundation for future large-scale studies.
[253] TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos cs.CVPDF
Jinpeng Liu, Yukang Xu, Yutong Li, Xingyu Liu
TL;DR: 本文提出了一种名为TROPHIES的统一框架,用于从多视角视频中联合重建动态人体、静态场景和相机姿态,并将其整合到一个全局一致的4D空间中。该框架包含人体分支和场景分支,并通过全局对齐与优化模块确保尺度一致性、接触先验和跨视角时序连贯性。
Details
Motivation: 现有方法通常假设单视角输入或将人体、场景和相机解耦,导致无法恢复连贯的几何结构、稳定的运动以及物理对齐的轨迹。因此,本文旨在解决从多视角视频中统一重建人体、场景和相机这一新任务。
Result: 在EgoHuman和EgoExo4D数据集上的实验表明,TROPHIES能够实现全局对齐、物理合理的4D重建,并且在全局保真度和人-场景一致性方面持续优于现有范式。
Insight: 论文的创新点在于提出了一个统一的多分支框架来联合处理人体、场景和相机,并通过引入尺度一致性、接触先验和跨视角时序连贯性等约束进行全局优化,从而实现了更一致和物理可信的重建结果。
Abstract: Reconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume single-view inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human-scene-camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES–Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos-a unified framework tailored for this task. TROPHIES features a Human Branch that models humans through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment and optimization module couples both branches by enforcing scale consistency, contact priors, and cross-view temporal coherence. Experiments on EgoHuman and EgoExo4D demonstrate that TROPHIES achieves globally aligned, physically plausible 4D reconstructions and consistently outperforms existing paradigms in both global fidelity and human-scene consistency.
[254] AdaCodec: A Predictive Visual Code for Video MLLMs cs.CV | cs.AI | cs.CLPDF
Haowen Hou, Zhen Huang, Zheming Liang, Qingyi Si, Chenglin Li
TL;DR: 本文提出了AdaCodec,一种用于视频多模态大语言模型(video MLLMs)的预测性视觉编码方法。该方法利用视频帧间的高度冗余性,仅在场景变化显著时传输完整的参考帧,否则传输描述帧间变化(如运动和预测残差)的紧凑P-token,从而显著减少视觉token数量并提升效率。
Details
Motivation: 现有视频MLLMs通常将每个采样帧作为独立的RGB图像编码,导致视觉token重复包含早期帧中已存在的内容,造成了计算和传输的冗余。本文旨在通过一种更直接的视频接口来解决这种时间冗余问题。
Result: 在全部11个基准测试中,AdaCodec在匹配的视觉token预算下均优于基于Qwen3-VL-8B的逐帧RGB编码基线。即使在仅使用1/7 token预算(32k token)的情况下,AdaCodec在所有长视频基准测试上超越了使用224k token的基线;在五个通用视频基准测试上,它提高了平均得分,并将首token生成时间从9.26秒大幅缩短至1.62秒。
Insight: 核心创新在于提出了“预测性视觉编码”接口,根据条件预测成本自适应地选择传输完整参考帧或紧凑的帧间变化描述。这为视频MLLMs提供了一种高效利用时间冗余的新范式,在保持甚至提升性能的同时,显著降低了计算和延迟开销。
Abstract: Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a \emph{predictive visual code}, and instantiate it for video MLLMs as \textbf{AdaCodec}. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at $1/7$ the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.
[255] Multi-modal Video Representation Alignment for Robust Self-supervised Driver Distraction Detection cs.CVPDF
David J. Lerch, Livien Majer, Zeyun Zhong, Manuel Martin, Frederik Diederichs
TL;DR: 本文提出了一种新颖的多模态视频表示全局对齐框架,用于鲁棒的自监督驾驶员分心检测。该框架通过联合建模有问题的负样本和不可靠的正样本来解决多模态数据中常见的噪声和语义重叠问题,并利用循环一致性分数和相似度分布加权机制来改进传统对比学习目标。
Details
Motivation: 动机在于解决多模态视频数据(如驾驶员分心检测)中,传统对比学习目标(如InfoNCE)假设所有负样本信息量相同且所有正样本可靠,而这一假设常因视角变化、遮挡或模态间语义重叠而被违反的问题。
Result: 在Drive&Act数据集上,该方法在RGB、红外、深度和骨骼模态上持续优于成对对齐和现有全局对齐基线。跨视角消融研究进一步表明其对未见相机视角具有很强的泛化能力。
Insight: 创新点在于提出了一个原则性的全局多模态对齐框架,通过循环一致性分数生成软目标以放宽硬负样本假设,并基于相似度分布加权来减轻噪声正样本的影响,从而实现了更鲁棒的多模态表示学习。
Abstract: Robust self-supervised learning of multi-modal video representations is critical for real-world applications such as driver distraction detection, where multiple sensors provide complementary but noisy signals. Conventional contrastive objectives, such as InfoNCE, assume all negatives are equally informative and all positives are reliable. However, this assumption is frequently violated in multi-modal data due to viewpoint changes, occlusions, or semantic overlap across modalities. In this work, we propose a novel framework for multi-modal global alignment that addresses these challenges by jointly modeling faulty negatives and unreliable or faulty positives. We introduce soft targets derived from cycle-consistency scores to relax the hard-negative assumption, and a weighting mechanism based on similarity distributions to mitigate the impact of noisy or faulty positives. Our approach extends traditional pairwise alignment to a principled global multi-modal setting, aggregating alignment information across all modality pairs. We evaluate our method on the Drive&Act dataset, demonstrating that it consistently outperforms both pairwise and existing global alignment baselines across RGB, IR, Depth, and Skeleton modalities. Cross-view ablation studies further show strong generalization to unseen camera perspectives, highlighting the robustness of our representations. Overall, our framework provides a scalable and effective solution for self-supervised global multi-modal representation learning, enabling reliable driver distraction detection and pioneering in real-world multi-modal video understanding. Our code will be published on GitHub.
[256] PRIMA: Boosting Animal Mesh Recovery with Biological Priors and Test-Time Adaptation cs.CVPDF
Xiaohang Yu, Ti Wang, Mackenzie Weygandt Mathis
TL;DR: PRIMA是一个用于提升四足动物三维网格重建鲁棒性的框架,通过引入生物先验和测试时适应策略来解决物种和姿态不平衡问题。该方法利用BioCLIP嵌入作为生物先验注入语义和形态知识,并采用测试时适应策略优化SMAL预测,同时构建了大规模伪三维数据集Quadruped3D。实验表明,PRIMA在多个基准测试中达到了最先进的性能,尤其在代表性不足的物种和挑战性姿态上表现突出。
Details
Motivation: 现有动物重建方法由于三维监督数据有限和物种分布长尾,往往回归到平均形状和姿态,导致对代表性不足的动物和罕见关节姿态的泛化能力差。PRIMA旨在通过生物先验和测试时适应来解决这一挑战。
Result: 在Animal3D、CtrlAni3D、Quadruped2D和Animal Kingdom等基准测试上的广泛实验表明,PRIMA取得了最先进的结果,特别是在代表性不足的物种和挑战性姿态上改进显著。
Insight: 创新点包括:利用BioCLIP嵌入作为生物先验来增强形状预测的准确性和泛化性;提出测试时适应策略,结合二维重投影约束和辅助关键点指导来优化姿态和形状估计,并生成高质量的伪三维标注;构建了覆盖多样物种和姿态的大规模伪三维数据集Quadruped3D以系统性提升模型性能。从客观角度看,该方法将领域知识(生物先验)与自适应学习(测试时适应)相结合,为数据稀缺场景下的三维重建提供了可扩展的解决方案。
Abstract: We present PRIMA (PRIors for Mesh Adaptation), a framework for robust 3D quadruped mesh recovery under severe species and pose imbalance. Existing animal reconstruction methods often regress toward mean shapes and poses due to limited 3D supervision and long-tailed species distributions, resulting in poor generalization to underrepresented animals and rare articulations. PRIMA addresses this challenge through three key contributions. First, we incorporate BioCLIP embeddings as biological priors to inject semantic and morphological knowledge into the reconstruction process, enabling more accurate and generalizable shape prediction across diverse quadrupeds. Second, we introduce a test-time adaptation (TTA) strategy that refines SMAL predictions using 2D reprojection constraints together with auxiliary keypoint guidance, improving pose and shape estimation while enabling the generation of high-quality pseudo-3D annotations from existing 2D datasets. Third, leveraging this TTA framework, we construct Quadruped3D, a large-scale pseudo-3D dataset that covers diverse species and pose variations to systematically improve model performance. Extensive experiments on Animal3D, CtrlAni3D, Quadruped2D, and Animal Kingdom demonstrate that PRIMA achieves state-of-the-art results, with particularly strong improvements on underrepresented species and challenging poses. Our results highlight the importance of biological priors and adaptation-driven data expansion for scalable and generalizable animal mesh recovery. Code is available at https://github.com/AdaptiveMotorControlLab/PRIMA.
[257] Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains cs.CV | cs.AIPDF
Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang, Shuai Li
TL;DR: 本文系统研究了多模态智能体在工具使用方面的能力增益,发现工具增强的多模态智能体在基准测试中的提升可能并非源于真正学会了使用工具,而是更多地学习了工具调用的模式。通过对Thyme和DeepEyesV2等智能体在真实世界理解、OCR、图表理解和数学推理任务上的分析,发现工具访问带来的整体改进有限,且大部分工具解决的问题也能被无工具版本解决。
Details
Motivation: 当前工具增强的多模态智能体在基准测试中表现出显著提升,这通常被解释为智能体学会了使用工具。然而,作者认为仅凭工具调用轨迹不足以证明工具提供了关键信息,因此需要系统性地研究工具使用是否真正带来了能力增益。
Result: 研究发现,工具访问并未带来一致的聚合性能提升,也未可靠地降低生成token的成本。在DeepEyesV2中,93%的工具解决问题也能被至少一种非工具设置解决;在Thyme中,这一比例为96%。机制消融实验进一步表明,完整的工具使用循环并不总是优于单独的工具调用格式或执行结果。
Insight: 论文的创新点在于对工具增强智能体能力增益的系统性质疑和实证分析,揭示了当前评估可能混淆了工具的可用性与工具实际扩展智能体解决问题的能力。从客观角度看,该研究强调了在评估多模态智能体时,需要区分工具调用模式的学习与工具贡献的实际能力扩展,这对未来智能体设计和评估具有重要借鉴意义。
Abstract: Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that this interpretation can be premature: a tool-call trace alone does not show whether the tool supplied answer-critical information. We study two representative ``thinking with images’’ agents, Thyme and DeepEyesV2, across real-world understanding, OCR, chart understanding, and mathematical reasoning. Each agent is compared with its Tool-Free counterpart and with a Pure-Text Reasoner trained from the same source pool without tool-calling trajectories. Tool access yields little consistent aggregate improvement, does not reliably reduce generated-token cost, and leaves only a small tool-only solved set: 93% of DeepEyesV2’s tool-solved problems and 96% of Thyme’s are also solved by at least one non-tool setting. Mechanism ablations further show that the full tool-use loop does not consistently outperform either the tool-call format or the returned execution result alone. In the settings we study, the analyzed agents appear to learn tool-calling patterns more reliably than tool-contributed capabilities, suggesting that evaluation should distinguish tool availability from whether tools actually expand what agents can solve.
[258] Explainable Forensics of Manipulated Segments in Untrimmed Long Videos cs.CVPDF
Yue Feng, Jingjing Li, Qijia Lu, Wei Ji, Jingrou Zhang
TL;DR: 本文针对长视频中局部AI生成内容检测的挑战,提出了时序AI生成片段定位与解释任务,并构建了包含12,472个未修剪视频的大规模基准数据集TASLE。同时,论文提出了一个从粗到精的基线方法MSLoc,该方法结合了边界敏感的提案生成模块和基于MLLM的细化模块,以实现高效的长视频扫描、精确的边界定位和可解释的推理。
Details
Motivation: 现有视频取证方法主要针对短小、独立的片段,无法有效处理AI生成内容稀疏嵌入真实长视频的现实场景,因此需要开发能够检测、定位并解释长视频中局部篡改片段的方法。
Result: 实验验证了所提基线方法的有效性,强调了片段级可解释取证对于长格式AI生成视频分析的重要性。
Insight: 创新点在于首次系统性地定义了长视频中时序AI生成片段的定位与解释任务,并构建了相应的大规模基准数据集;提出的MSLoc方法采用从粗到精的两阶段架构,将高效的边界提案生成与基于MLLM的精确细化和可解释推理相结合,为长视频取证提供了新的解决方案。
Abstract: The rapid advancement of AI-driven video generation has transformed content creation, while simultaneously increasing the risk of misinformation through localized manipulations in long-form videos. Existing video forensic methods predominantly operate on short, independent clips, and thus fail to capture realistic scenarios where AI-generated content is sparsely embedded within otherwise authentic footage. To bridge this gap, we formulate the task of Temporal AI-Generated Segment Localization and Explanation, which targets authenticity detection, temporal localization, and interpretable analysis of manipulated segments in untrimmed long videos. We further introduce TASLE, a large-scale benchmark comprising 12,472 untrimmed videos with diverse manipulation patterns and rich annotation signals, including temporal boundaries, authenticity labels, and segment-level rationales. In addition, we propose MSLoc, a coarse-to-fine forensic baseline that combines a boundary-sensitive proposal generation module for efficient long-video scanning with an MLLM-based refinement module for precise boundary localization and interpretable reasoning. Experiments validate the effectiveness of the proposed baseline, highlighting the importance of segment-level explainable forensics for long-form AI-generated video analysis. Our dataset and code are publicly available at https://debby-0527.github.io/TASLE.
[259] Geometry-Aware Implicit Memory for Video World Models cs.CVPDF
Zhengxuan Wei, Xu Guo, Xinghui Li, Xunzhi Xiang, Min Wei
TL;DR: 本文提出了一种用于视频世界模型的几何感知隐式记忆框架GIM-World,旨在解决长序列推演中历史信息压缩与几何一致性保持的问题。该方法通过轻量级Transformer编码器将可变长度历史压缩为固定大小的记忆令牌,并利用冻结基础模型提取的3D场景结构作为几何监督进行训练,推理时仅保留轻量记忆模块。
Details
Motivation: 现有视频世界模型在长序列推演时面临显式记忆(存储帧或在线3D重建)的检索误差、冗余存储或重建伪影问题,而隐式记忆虽能压缩历史但缺乏对跨视角场景几何结构的显式约束。
Result: 在MIND基准测试中,GIM-World在长序列几何一致性和视觉一致性方面优于显式与隐式记忆基线方法。
Insight: 创新点在于将几何感知监督引入隐式记忆训练(通过冻结基础模型提取3D结构),并设计信息引导的剪枝规则控制编码成本;客观来看,其分离训练与推理的设计(训练时用几何教师模型、推理时丢弃)实现了效率与性能的平衡。
Abstract: Video world models aim to simulate controllable visual environments, but long-horizon rollouts depend on what the model remembers after observations leave its native context window. Explicit memories retain frames or online 3D reconstructions, which can suffer from heuristic retrieval errors, redundant appearance storage, or reconstruction artifacts. Implicit memories compress history into a compact state, but existing designs are not explicitly constrained to encode cross-view scene geometry. We propose GIM-World, a geometry-aware implicit memory framework for video world models. A lightweight transformer encoder compresses variable-length history into fixed-size memory tokens, a camera-queryable geometry head distills 3D scene structure from a frozen foundation model into the memory during training, and an information-guided pruning rule keeps encoding cost bounded as history grows. The geometry teacher is discarded at inference, leaving a lightweight memory module. Experiments on MIND show that GIM-World better preserves long-horizon geometric and visual consistency than both explicit- and implicit-memory baselines.
[260] Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation cs.CVPDF
Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Lizhuang Ma
TL;DR: 本文提出ST-DRC框架,用于身份保持的文本到视频生成。该框架通过时空解耦的参考条件机制,在潜在空间进行上下文特征注入,并引入TASS-RoPE注意力方案,以平衡文本语义控制与身份细节保真度。
Details
Motivation: 现有身份保持视频生成方法难以平衡高层次语义控制与低层次身份保真度,ST-DRC旨在解决这一核心矛盾。
Result: 在基于LTX-2.3的轻量级设计上,ST-DRC在身份保持、提示对齐、时序一致性和视频质量方面表现优异,在面部身份保持视频生成赛道中位列前茅。
Insight: 创新点包括:时空解耦的潜在上下文特征注入无需额外适配器;TASS-RoPE注意力机制分离身份检索与外观复制;结合外观不变参考增强与面部引导身份目标的三流无分类器引导策略。
Abstract: Identity-preserving video generation (IPVG) aims to synthesize high-fidelity videos that follow text prompts while faithfully preserving a reference identity. Despite recent progress, existing IPVG methods still struggle to balance high-level semantic control and low-level identity fidelity. To bridge this gap, we propose ST-DRC, an effective Spatial-Temporal Decoupled Reference Conditioning framework for identity-preserving text-to-video generation. At the framework level, ST-DRC performs latent in-context feature injection by encoding the reference image with the video VAE and concatenating it with noisy video latents, enabling rich low-level identity details to be accessed without additional adapters. To separate identity-aware reference retrieval from appearance copying, we introduce TASS-RoPE, a Temporal-Adjacent Spatial-Shifted RoPE scheme that places reference tokens near the video sequence in time but shifts them in space, allowing reference information to flow through spatio-temporal attention while suppressing pixel-level copy-paste shortcuts. To further prevent shortcut learning and strengthen the otherwise diluted identity supervision in the diffusion objective, we combine appearance-invariant reference augmentation with face-guided identity objectives, encouraging the model to preserve identity under variations in color, pose, and layout. At inference time, we introduce a three-stream reference classifier-free guidance strategy that independently controls text adherence and reference fidelity. Experiments demonstrate that ST-DRC achieves strong identity preservation, prompt alignment, temporal consistency, and video quality with a lightweight design built on LTX-2.3. Our method ranks among the top submissions in the facial identity-preserving video generation track, validating the effectiveness of spatial-temporal decoupled reference conditioning.
[261] Reason-Then-Retrieve for CoVR-R with Structured Edit Prompts and Dense-Sparse Fusion cs.CVPDF
DongQing Liu, MengShi Qi, HongWei Ji
TL;DR: 该论文研究CoVR-R任务,即基于推理的组合视频检索。作者构建了一个零样本的‘先推理后检索’流程,核心是使用Qwen3.5-27B模型为视频生成结构化的描述和密集嵌入,并结合TF-IDF稀疏检索进行融合。该方法在验证集和盲测集上均取得了优异的检索性能。
Details
Motivation: 解决CoVR-R任务的核心难点:目标视频并非被直接描述,而是需要通过编辑指令推断出参考视频在物体身份、动作顺序、最终状态、手部交互和场景转换等方面的细粒度变化。
Result: 在验证集上,最佳提交结果在R@1, R@5, R@10, R@50分别达到80.81, 94.86, 97.11, 98.59。在盲测集上,分别达到89.73, 95.79, 96.63, 97.98,展现了强大的检索能力。
Insight: 创新点在于‘先推理后检索’的零样本流程设计,利用大语言模型进行结构化描述生成和嵌入提取,以及结合生成文本的TF-IDF稀疏检索与密集嵌入检索的融合策略,有效提升了复杂推理检索任务的性能。
Abstract: CoVR-R studies reason-aware composed video retrieval: given a reference video and an edit instruction, the system must retrieve the target video that satisfies the edit. The main difficulty is that the target is not described directly; it must be inferred from fine-grained changes in object identity, action order, final state, hand interaction, and scene transition. We build a zero-shot reason-then-retrieve pipeline around Qwen3.5-27B. For each gallery video, the model generates a retrieval-oriented structured description and a dense embedding by pooling generated-token hidden states with token-dependent weights. For each query, the model first performs edit reasoning over the reference video and instruction, then generates a target-video description whose hidden states serve as the query embedding. We complement dense retrieval with a TF-IDF branch over the generated texts and fuse the two rankings with split-specific weights. On validation, the current best submission reaches 80.81 at R@1, 94.86 at R@5, 97.11 at R@10, and 98.59 at R@50. On the blind test split, it reaches 89.73 at R@1, 95.79 at R@5, 96.63 at R@10, and 97.98 at R@50.
[262] Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models cs.CVPDF
Wei Deng, Xianlin Zhang, Mengshi Qi
TL;DR: 本文提出了一种受鸽子导航启发的主动探索式空间推理框架,用于增强视觉语言模型(VLMs)的空间推理能力。该方法通过动态认知地图(dynamic cognitive map)对场景布局进行参数化表示,并结合空间断言代码(Spatial Assertion Codes, SAC)来程序化描述和验证空间关系,从而提供密集的奖励信号。通过监督与强化微调,模型在MindCube基准测试中取得了80.5%的整体准确率,在具有挑战性的Rotation子集上比现有最佳方法提升了29.5个百分点(相对提升53.2%)。
Details
Motivation: 现有方法将视觉语言模型视为被动观察者,难以处理真实世界的空间推理任务;同时,强化学习方法依赖稀疏奖励,限制了其在复杂推理任务中的有效性。
Result: 在MindCube基准测试中,该方法取得了80.5%的整体准确率,在Rotation子集上达到了SOTA水平,比当前最佳方法提升了29.5个准确率点(相对提升53.2%)。
Insight: 创新点在于引入受生物启发的动态认知地图作为持久记忆,并结合可编程的空间断言代码(SAC)来形式化描述和验证空间关系,从而为强化学习提供密集的奖励信号,有效解决了稀疏奖励和被动观察的局限性。
Abstract: Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons’ building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new \emph{dynamic cognitive map} parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel \emph{Spatial Assertion Codes (SAC)}, Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with \emph{80.5%} overall accuracy, outperforming the best current method by \emph{29.5} accuracy points (a relative improvement of \emph{53.2%}) on the challenging \textsc{Rotation} subset. Our code and data are open-sourced at https://github.com/dw-dengwei/active-spatial-reasoning.git.
[263] Places in the Wild: A Large, High-Resolution RAW Photograph Dataset for Ecologically Valid Vision Research cs.CVPDF
Michelle R. Greene
TL;DR: 本文介绍了Places in the Wild数据集,这是一个包含67,574张高分辨率RAW和JPEG格式照片的大规模数据集,覆盖810个物理位置和260个基本场景类别。数据通过45兆像素相机在360度视角下密集采样获得,并附带完整的EXIF元数据和图像质量指标,旨在支持生态有效的视觉研究。
Details
Motivation: 现有大型图像数据集多为低分辨率、互联网来源的JPEG图像,其拍摄条件和空间上下文信息有限,无法满足生态有效的视觉研究需求。本文旨在提供一个高分辨率、原位采集的RAW格式数据集,以支持更真实的场景理解和视觉分析。
Result: 数据集包含67,574张高分辨率图像,覆盖室内、城市和自然环境的260个场景类别,通过密集的360度视角采样获得,并提供了完整的元数据和图像质量指标。该数据集为场景理解、自然场景统计等研究提供了高质量的基准数据。
Insight: 创新点在于提供了大规模、高分辨率、RAW格式的原位拍摄数据集,支持传感器级细节分析(如亮度、对比度、颜色统计),并实现了密集的360度视角采样,有助于推动人类和模型视角依赖识别、生态有效场景理解等研究。
Abstract: Large image datasets have accelerated progress in cognitive neuroscience and computer vision. However, most datasets are low-resolution, internet-sourced JPEGs with unknown capture conditions and limited spatial context. Places in the Wild is a dataset of 67,574 high-resolution photographs collected in situ across 810 physical locations spanning 260 basic-level scene categories, including indoor, urban, and natural environments. At each location, a 45-megapixel Canon EOS R5 mounted on a panoramic tripod captured 72 images at 5-degree horizontal intervals plus 12 images at varying elevations, yielding dense 360-degree viewpoint sampling. All images were recorded simultaneously as 14-bit RAW (CR3) files and compressed JPEGs, preserving sensor-level detail for analyses of luminance, contrast, color, and other image statistics. The dataset is accompanied by complete EXIF metadata and a suite of image-quality metrics. Places in the Wild supports research on viewpoint-dependent recognition in humans and models, training and evaluation of scene-understanding systems under realistic conditions, characterization of natural scene statistics, and experiments requiring near-full-field visual displays.
[264] MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence cs.CV | cs.AIPDF
Hilton Raj, Vishnuram AV
TL;DR: 该论文提出了MASER(模态自适应专家路由)框架,用于解决3D环境中具身智能体在回答空间相关问题时,现有视觉语言模型(VLM)通常仅针对单一模态进行微调,而忽略了问题语义可能更适配其他模态的局限性。MASER通过训练共享VLM骨干网络的五个不同模态适配器,并学习一个基于问题选择最佳适配器的神经路由策略,在Open3D-VQA基准测试中验证了其有效性。
Details
Motivation: 现有视觉语言模型在3D空间推理任务中通常仅针对单一模态(如RGB图像)进行微调,但问题语义可能更适配其他模态(如点云、深度图等),这导致了性能瓶颈。
Result: 在Open3D-VQA基准测试中,评估表明没有单一模态是普遍最优的(例如点云在51.5%的情况下最佳),MASER的路由策略与最优适配器的选择一致性达到51.3%,优于随机森林消融实验的43.5%,且每个问题仅需调用单个适配器。
Insight: 创新点在于提出了一个轻量级的模态自适应路由框架,通过冻结的句子编码器和小型MLP学习基于问题语义的动态模态选择策略,避免了多模态融合的计算开销,同时有效利用了不同模态的互补优势。
Abstract: In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality. To address this, we propose MASER (Modality-Adaptive SpEcialist Routing), a lightweight framework that trains five different modality adapters of a shared VLM backbone and learns a neural routing policy that selects the best adapter based on the question during inference. We encode each question with a frozen sentence transformer and pass the embedding through a small Multi-layer Perceptron (MLP) trained on oracle adapter-accuracy labels. We evaluate our methodology over the Open3D-VQA benchmark and our evaluations show that no single modality is universally optimal – point-cloud answers are best in 51.5% of cases. MASER routes with 51.3% oracle agreement, outperforming a Random-Forest ablation (43.5%), with only a single adapter call per question.
[265] Retrieve What’s Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation cs.CVPDF
Minseok Joo, Dogyun Park, Taehoon Lee, Kyujin Lee, Hyunwoo J. Kim
TL;DR: 本文提出了一种名为COVRAG的基于深度记忆检索框架,用于提升长视频生成中的几何一致性。该方法利用预训练的3D先验构建目标视角覆盖图作为轻量级3D记忆证据,并通过最大化残差覆盖增益迭代选择记忆帧,同时引入滑动窗口深度缓存以提高长视频生成的效率。
Details
Motivation: 解决长序列自回归视频生成中长期几何一致性保持的挑战,现有方法在3D几何证据表示和记忆帧选择上存在轻量级方法过于粗糙或显式3D重建成本高昂的问题。
Result: 在RealEstate10K和DL3DV10K数据集上的实验表明,COVRAG在提升长序列几何一致性的同时保持了较低的延迟,优于基线方法。
Insight: 创新点在于使用预训练3D先验构建轻量级覆盖图作为几何证据,以及通过最大化残差覆盖增益的迭代检索策略;从客观角度看,该方法在几何精度与计算效率间取得了平衡,滑动窗口缓存机制增强了长视频生成的扩展性。
Abstract: Maintaining long-term geometric consistency remains challenging for long-horizon autoregressive video generation. Memory-augmented generative models address this by retrieving historical frames, but their effectiveness depends on two key design choices: what 3D-geometric evidence should represent past observations, and how memory frames should be selected from this evidence. Existing methods often rely on camera poses or field-of-view overlap, which are lightweight but too coarse to reason about pixel-wise visibility, or use explicit 3D reconstruction, which provides fine-grained evidence but is costly to maintain over long rollouts. We propose Coverage-Maximizing Retrieval-Augmented Generation (COVRAG), a depth-based memory retrieval framework that uses pretrained 3D priors to construct a target-view coverage map as lightweight 3D memory evidence. For frame selection, COVRAG maximizes residual coverage gain, iteratively retrieving frames that explain target-view regions not covered by the current context or previously selected memories. To improve scalability in long-video generation, we introduce sliding-window depth caching for efficient geometry estimation. Experiments on RealEstate10K and DL3DV10K show that COVRAG improves long-horizon geometric consistency while maintaining low latency compared to baselines.
[266] X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding cs.CVPDF
Peiwen Sun, Xudong Lu, Huadai Liu, Yang Bo, Dongming Wu
TL;DR: 本文提出了首个多流视频理解基准X-Stream,用于评估模型在实时、多流交互场景下的推理能力。该基准包含4,220个QA对和11个子任务,覆盖多窗口、多视角和多设备场景。作者将多模态大语言模型(MLLMs)概念化为朴素多路复用器,并通过信号多路复用理论系统评估其性能,发现现有SOTA模型在多流并发处理上表现不佳。
Details
Motivation: 现实应用(如体育直播、自动驾驶和多屏协作)需要连续的多流交互理解,但现有基准局限于单流范式,缺乏对在线跨流推理的评估。
Result: 在X-Stream基准上,当前最先进的MLLMs在多流并发处理中仅获得约50%的得分,表现出较差的主动能力,揭示了现有多路复用方案的权衡。
Insight: 创新点包括:1)构建首个多流视频理解基准X-Stream,采用双验证流程防止对单流的过度依赖;2)将MLLMs概念化为朴素多路复用器,从信号多路复用理论角度系统评估其多流处理能力,为下一代多流智能体提供实践评估和实证指导。
Abstract: While video streaming understanding has made significant strides, real-world applications, such as live sports broadcasting, autonomous driving, and multi-screen collaboration, inherently demand continuous, multi-stream interactions. However, existing benchmarks are confined to single-stream paradigms, leaving a critical gap in evaluating online, cross-stream reasoning. To bridge this, we introduce X-Stream, the first benchmark dedicated to multi-stream streaming understanding. Comprising 4,220 rigorously curated QA pairs across 932 videos, X-Stream evaluates 11 subtasks across multi-window, multi-view, and multi-device scenarios. Crucially, our dataset is constructed using a novel dual-verification pipeline that prevents over-reliance on a single stream. Furthermore, we pioneer the conceptualization of multi-modal large language models (MLLMs) as naive multiplexers, systematically evaluating their performance through the lens of Signal Multiplexing Theory. Our extensive online inference experiments reveal a stark reality: state-of-the-art MLLMs struggle significantly with concurrent streams, achieving only about 50% score and exhibiting poor proactive ability. Ultimately, X-Stream exposes the trade-off of current multiplexing schemes, providing both a practical evaluation protocol and empirical guidance for next-generation multi-stream agents.
[267] MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents cs.CVPDF
Minkyung Kwon, Jinhyeok Choi, Youngjin Shin, Jaeyeong Kim, JongMin Lee
TL;DR: MORPHOS是一个新颖的自回归框架,能够从视频中生成动态3D资产,支持多种表示形式(如网格、3D高斯和辐射场)。它通过引入时间结构化潜在表示(T-SLAT)来统一编码几何和外观,并利用因果注意力自回归生成,确保时间一致性并处理拓扑变化。
Details
Motivation: 现有方法通常局限于单一表示、难以建模拓扑变化或在长视频中无法保持时间一致性,MORPHOS旨在解决这些限制。
Result: MORPHOS在多个基准测试中实现了外观方面的最先进性能,并在几何方面取得了有竞争力的结果,展示了在各种表示上的优越泛化能力和长时生成的鲁棒性。
Insight: 创新点包括引入T-SLAT作为统一的4D表示来联合编码时空信息,以及采用自回归生成结合因果注意力来处理动态拓扑;此外,时间结构增强有助于减轻自回归生成中的误差累积,提升长期一致性。
Abstract: We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields. Existing methods are typically limited to a single representation, struggle to model topological changes, or fail to maintain temporal consistency over long videos. To address these limitations, we introduce the Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension. Leveraging T-SLAT, MORPHOS autoregressively generates dynamic 3D assets via causal attention, conditioning each frame on its preceding history to ensure temporal consistency while handling evolving topologies. We also propose a temporal-structural augmentation to mitigate error accumulation in autoregressive generation. MORPHOS achieves state-of-the-art performance in appearance and competitive results in geometry across multiple benchmarks, demonstrating superior generalization across various representations and robustness in long-horizon generation.
[268] ToolFG: Towards Well-Grounded Fine-Grained Image Classification cs.CVPDF
Yu Xue, Haoxuan Qu, Zhuoling Li, Yihang Lou, Yan Bai
TL;DR: 本文提出ToolFG,首个基于多模态大语言模型(MLLM)并集成外部工具来解决细粒度图像分类(FGIC)任务的框架。该框架使MLLM能够在推理过程中自主灵活地使用工具,主动与图像交互,以更可靠和可验证的方式收集视觉线索来区分高度相似的类别。
Details
Motivation: 细粒度图像分类具有广泛应用且研究关注度高,但现有方法在可靠性和可解释性方面存在不足。本文旨在探索一种新的范式,通过工具集成使MLLM能够以更可靠、可验证的方式进行细粒度分类。
Result: 大量实验证明了该框架的有效性,但摘要中未具体提及在哪些基准数据集上测试,也未明确说明是否达到SOTA水平或与特定模型相当。
Insight: 主要创新点包括:1)首个为FGIC定制的工具集成MLLM框架(ToolFG);2)设计了MCTS引导的工具使用知识蒸馏机制,从先进专有MLLM中挖掘相关知识;3)提出了模型-工具协同进化机制,共同优化工具集和模型的使用策略,使其相互适应并专门化于FGIC任务。
Abstract: Fine-grained image classification (FGIC) has broad applications and has attracted significant research attention. In this paper, we explore a novel paradigm for solving FGIC by proposing \textbf{ToolFG}, the first tool-integrated MLLM-based framework tailored to FGIC. ToolFG enables MLLMs to autonomously and flexibly use external tools during the reasoning process, actively interact with images, and collect verifiable visual cues for distinguishing highly similar categories in a more \textit{reliable} and \textit{well-grounded} manner. To equip the model with such tool-use ability, we design a novel \textbf{MCTS-guided tool-use knowledge distillation mechanism}, which effectively mines tool-use- and FGIC-relevant knowledge from advanced proprietary MLLMs for model training. Furthermore, we propose a \textbf{model-tool co-evolution mechanism} that jointly refines the toolset and the model’s tool-use policy, driving them toward a mutually adapted and FGIC-specialized state. Extensive experiments demonstrate the effectiveness of our framework.
[269] Question-Aware Evidence Ledgers for Video Relational Reasoning cs.CVPDF
Yilin Ou, Mengshi Qi, Huadong Ma
TL;DR: 本文提出了一种用于视频关系推理问答(VRR-QA)的测试时推理流水线,围绕强大的GPT-5.5视频QA求解器和一组问题感知的证据账本构建。初始求解器从统一的视频表示中回答问题,而路由账本则被提示为计数、空间、端点、视点和对话推理等任务显式地提取所需的目标、计数单位、参考帧以及时空范围。外部工具仅作为证据源使用,并通过保守的门控机制保持当前答案,除非独立证据唯一支持不同选项。最终系统在挑战测试集上取得了92.95%的整体准确率和93.79%的宏观准确率。
Details
Motivation: VRR-QA挑战评估视频中的视觉关系推理,其答案通常依赖于隐式的空间关系、事件边界、目标身份和对话上下文,而非单个显著帧。现有方法可能难以捕捉这些复杂、多模态的依赖关系,因此需要一种能够显式提取和整合多样化证据的推理框架。
Result: 提出的证据门控流水线在VRR-QA挑战的测试集上达到了92.95%的整体准确率和93.79%的宏观准确率,这很可能代表了该基准上的最先进(SOTA)性能。
Insight: 核心创新在于引入了“问题感知的证据账本”这一模块化设计,它将复杂的推理任务分解为针对性的子任务(如显式化目标、范围等),并通过路由机制动态调用。这种将通用视频QA模型(GPT-5.5)与专用、可解释的证据提取工具(如检测、深度、场景图)相结合,并通过保守门控进行证据融合的架构,为处理需要多步骤、多证据推理的复杂视频QA问题提供了一种可借鉴的范式。
Abstract: The VRR-QA challenge evaluates visual relational reasoning in videos, where answers often depend on implicit spatial relations, event boundaries, target identity, and dialogue context rather than a single salient frame. We present a test-time reasoning pipeline built around a strong GPT-5.5 video QA solver and a set of question-aware evidence ledgers. The initial solver answers each question from a uniform video representation, while routed ledgers are prompted to make the required targets, count units, reference frames, and temporal or spatial scope explicit for counting, spatial, endpoint, viewpoint, and dialogue reasoning. External tools such as open-vocabulary detection, depth cues, pair crops, ASR, and scene-graph ledgers are used only as evidence sources. A conservative gate keeps the current answer unless independent evidence uniquely supports a different option. The final evidence-gated pipeline achieves 92.95% overall accuracy and 93.79% macro accuracy on the challenge test split.
[270] Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events cs.CV | cs.AIPDF
Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang, Yan Li
TL;DR: 该论文提出了Moment-Video基准,用于诊断视频多模态大语言模型(MLLMs)在理解瞬时视觉事件上的时间保真度。研究发现,现有模型在捕捉短暂但关键视觉证据方面存在显著不足,最佳模型准确率仅为39.6%。
Details
Motivation: 现有视频MLLMs在通用和长视频理解上进展迅速,但其保留短暂、关键视觉证据的能力尚未被充分探索。许多实际问题由仅持续几帧的瞬时视觉事件决定,而稀疏帧采样、视觉令牌压缩或粗粒度时间聚合可能导致模型失败。
Result: 在Moment-Video基准上评估了33个专有和开源MLLMs,最佳模型Seed-2.0-Pro的总体准确率仅为39.6%,大多数开源模型低于25%。诊断分析表明,更密集的帧采样对部分模型有改善,但无法消除瓶颈,且长视频带来更强的时间定位挑战。
Insight: 创新点在于构建了一个专注于瞬时、局部化、采样敏感的视觉事件理解的诊断性基准。客观来看,该研究揭示了当前视频MLLMs在捕获、保留和利用短暂但决定性视觉证据方面,仍缺乏时间上忠实的表征能力,这为未来模型设计指明了改进方向。
Abstract: Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are determined by momentary visual events: localized actions or state transitions that may last only a few frames. Such evidence can be skipped by sparse frame sampling, suppressed by visual-token compression, or diluted by coarse temporal aggregation, causing failures that language-side reasoning cannot reliably recover. We introduce Moment-Video, a benchmark for diagnosing the temporal fidelity of video MLLMs through momentary visual event understanding. Each question is grounded in a localized, visually observable, and sampling-sensitive event, requiring models to notice, count, describe, or reason about transient evidence rather than rely on persistent objects, global scene context, or language priors. Moment-Video contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. We evaluate 33 proprietary and open-source MLLMs on Moment-Video. The best-performing model, Seed-2.0-Pro, achieves only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in momentary visual event understanding. Diagnostic analyses show that denser frame sampling improves some models but does not eliminate the bottleneck, and longer videos introduce stronger temporal-localization challenges. These findings suggest that current video MLLMs still lack temporally faithful representations for capturing, preserving, and using brief but decisive visual evidence.
[271] LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models cs.CVPDF
Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu, Haoyun Jiang
TL;DR: 本文提出了LL-Bench,一个用于评估大规模生成模型在低层视觉任务(如图像恢复)上性能的综合基准。该基准包含大量真实世界的退化图像、由多种模型生成的恢复图像,以及大量专家级的人类偏好标注和质量分数。基于此基准,论文系统性地诊断了生成模型在低层视觉任务中的性能边界和独特失败模式,并揭示了现有质量评估指标与人类评分之间的显著差异。此外,论文还提出了LL-Score,一个基于多模态大语言模型的评估器,能更好地与人类偏好对齐,并可作为训练生成模型的奖励模型。
Details
Motivation: 大规模生成模型在图像生成和编辑任务上表现出色,但其在需要像素级控制的低层视觉任务(如图像去噪、超分辨率等)上的性能尚未得到充分研究。为了填补这一空白,需要建立一个全面的基准来系统评估和比较这些模型。
Result: 在提出的LL-Bench基准上,对10个SOTA大规模生成模型和21个传统恢复模型进行了系统诊断,揭示了生成模型的性能边界和独特失败模式。提出的LL-Score评估器在实验中优于现有的图像质量评估指标,并能作为训练低层视觉生成模型的奖励模型。
Insight: 创新点在于构建了一个大规模、多任务、包含丰富人类标注的低层视觉评估基准(LL-Bench),为系统研究生成模型在该领域的性能提供了基础。另一个关键创新是提出了基于MLLM的评估器LL-Score,它通过捕捉恢复质量和幻觉存在,更好地与人类偏好对齐,并具有作为奖励模型的潜力,为解决低层视觉任务中评估指标与人类感知不一致的问题提供了新思路。
Abstract: Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studied. To address this gap, we introduce \textbf{LL-Bench}, a comprehensive \textbf{Benchmark} for evaluating the capabilities of large-scale generative models on \textbf{L}ow-\textbf{L}evel vision tasks. The benchmark comprises 2,469 real-world degraded images covering 16 low-level degradation tasks, and 28,919 restored images produced by 10 state-of-the-art large-scale generative models and 21 conventional restoration models, which are annotated with 152,020 expert-level pairwise human preferences and 28,334 quality scores. Built upon LL-Bench, we present a systematic diagnosis that reveals the performance boundaries and unique failure modes of large-scale generative models across diverse low-level vision tasks, compared with conventional representative restoration approaches. Moreover, we investigate the effectiveness of current quality evaluation metrics on LL-Bench, which exhibit significant discrepancy with human ratings. To better align restored-image quality assessment with human preferences, we further propose \textbf{LL-Score}, an MLLM-based evaluator that captures both restoration quality and hallucination existence. Extensive experiments demonstrate that LL-score not only outperforms existing image quality assessment metrics, but also serves as a promising reward model for training generative models on low-level vision tasks.
[272] LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation cs.CVPDF
Qixin Hu, Shuai Yang, Wei Huang, Song Han, Yukang Chen
TL;DR: 本文提出LongLive-RAG,一种用于长视频生成的通用检索增强框架,旨在解决自回归视频扩散模型中因滑动窗口注意力导致的误差累积和身份漂移问题。该框架将先前生成的潜在表示视为可检索的动态历史,通过轻量级检索步骤引入非局部上下文,并结合窗口时间差损失提升检索判别性,从而改善长视频生成质量。
Details
Motivation: 自回归视频扩散模型在生成长视频时,由于滑动窗口注意力机制的限制,容易产生不可逆的误差累积和身份漂移,现有方法仅依赖近期窗口导致生成轨迹退化。
Result: 在多个自回归骨干模型和生成长度上的实验表明,LongLive-RAG显著提升了长视频质量,并在VBench-Long基准上取得了最佳平均排名。
Insight: 创新点在于首次将自回归长视频生成中的潜在历史建模为可内容寻址的检索记忆,通过检索增强机制和窗口时间差损失来抑制局部冗余相似性,增强对有意义时间变化的捕捉,从而缓解滑动窗口注意力带来的误差传播问题。
Abstract: Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes. Together, these components help reduce error accumulation caused by sliding-window attention. Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory. Code is available at https://github.com/qixinhu11/LongLive-RAG.
[273] Policy-based Foveated Imaging and Perception cs.CVPDF
Howard Xiao, Jan Ackermann, Boyang Deng, Gordon Wetzstein
TL;DR: 本文提出了一种基于策略的实时、预测性、任务感知的中央凹成像系统,旨在解决超高分辨率图像传感器在带宽、延迟和功耗限制下难以全分辨率采集和处理所有像素的问题。该系统利用新兴的双流传感器架构,在图像采集时动态地将有限的像素带宽分配给任务相关的感兴趣区域,同时保持低分辨率的全局上下文,从而在严格的像素预算下实现高性能的视觉感知任务。
Details
Motivation: 超高分辨率传感器能捕捉精细空间细节,但在实际带宽、延迟和功耗约束下,全分辨率采集和处理所有像素往往不可行;现有方法(如空间或时间下采样)在评估任务相关性前就不可逆地丢弃了信息,因此需要一种在采集时就能动态分配资源、任务感知的成像方法。
Result: 通过在多个感知任务上进行广泛模拟,该方法在严格像素预算下实现了高任务性能,显著优于相同带宽下的相关基线;进一步在2亿像素双流传感器上验证,在真实带宽和延迟约束下捕获现实世界视频,证明了任务驱动、采集时中央凹成像的实践可行性。
Insight: 创新点在于将中央凹采集建模为传感器注意力策略学习问题,利用过去观测指导决定未来测量的动作,从而闭合感知-采集循环;客观来看,该系统直接集成到采集硬件中,实现了实时、预测性的资源分配,为高分辨率感知提供了实用的端到端解决方案。
Abstract: Ultra-high-resolution image sensors offer the potential to capture fine spatial details critical for many visual perception tasks, but acquiring and processing all pixels at full resolution is often infeasible under realistic bandwidth, latency, and power constraints. Existing approaches address this challenge through acquisition strategies such as spatial or temporal downsampling, which irrevocably discard information before task relevance can be assessed. In this work, we introduce a real-time, predictive, and task-aware foveated imaging system that operates directly at image acquisition time. Leveraging emerging dual-stream sensor architectures, our method dynamically allocates limited pixel bandwidth to task-relevant regions of interest while maintaining a low-resolution global context. We formulate foveated acquisition as a sensor attention policy-learning problem, in which past observations guide actions that determine future measurements, closing the perception-acquisition loop. Through extensive simulation across multiple perception tasks, we demonstrate that our approach achieves high task performance under strict pixel budgets and significantly outperforms relevant baselines operating at the same bandwidth. We further validate our system on a 200-megapixel dual-stream sensor, capturing real-world videos under realistic bandwidth and latency constraints, demonstrating the practical feasibility of task-driven, acquisition-time foveated imaging.
[274] VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization cs.CVPDF
Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao, Pengfei Wan
TL;DR: 本文提出了一种新的视频推理范式,将视觉语言模型(VLMs)从问题求解者转变为’教师’角色。该方法利用VLM提取任务规则并制定可微奖励,通过在线优化轻量级LoRA模块来指导视频生成模型(VGM)进行推理,从而在测试时自适应地提升VGM的逻辑遵循能力。
Details
Motivation: 现有方法利用VGM进行视频推理时,VGM虽能生成高质量视频,但难以理解和遵循复杂任务规则,导致逻辑失败;而将VLMs用作预求解器提供的文本指导又无法捕捉精细时空细节且VGM执行效果不佳。
Result: 在符号推理基准(VBVR-Bench)和通用视频推理基准(RULER-Bench)上的评估表明,该方法平均性能提升了16.7个百分点,大幅超越了VLM-as-Solver范式(+0.4点)和Best-of-N缩放(+2.2点),且在可比的测试时间成本下实现。
Insight: 核心创新在于角色转换:利用VLM强大的感知能力作为’教师’来评估过程约束和最终目标达成情况,并据此制定可微奖励,通过测试时在线优化(而非预求解)来动态引导VGM,从而突破了VGM固有的推理能力边界,实现了更通用和自适应的视频推理。
Abstract: The recent “Reasoning with Video” paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to “teachers”. Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM’s intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/
[275] From Zero to Hero: Training-Free Custom Concept Spawning in World Models cs.CVPDF
Kiymet Akdemir, Pinar Yanardag
TL;DR: 本文提出了一种名为SPAWN的训练自由方法,用于在自回归世界模型中实现用户指定的视觉概念生成(即概念生成)。该方法通过替换模型上下文记忆中的基础锚点,在短时间内注入外部概念潜在表示,使概念能够自然地通过模型记忆在生成的视频序列中传播,从而解决了现有模型在生成未见区域时无法控制场景内容的问题。
Details
Motivation: 现有自回归世界模型在交互式视频生成中,当用户导航超出参考帧可见范围时,无法控制未见过区域的内容生成,这限制了其在游戏、交互式叙事和仿真等需要可控场景构建的应用。
Result: 实验表明,SPAWN方法能够将概念(从细粒度实体到大规模元素)以一致的照明、比例和视角集成到生成的世界中,同时保持身份和时间连贯性,证明了在现有自回归世界模型中无需训练即可实现可控概念生成。
Insight: 创新点在于发现了图像到视频主干网络的结构特性(即上下文记忆的第一个槽位被固定为参考帧并作为每个生成块的基础锚点),并利用此特性通过锚点交换和窗口化注入实现训练自由的概念生成;这为现有模型提供了无需重新训练即可增强可控生成能力的途径。
Abstract: Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments through actions. These models are typically conditioned on a text prompt and/or a single reference frame, from which the entire world is generated. Yet the moment the user navigates beyond what is visible in that frame, the unseen regions are populated by the base model’s priors, with no mechanism for the user to specify what should appear and where. This is a fundamental limitation for applications such as gaming, interactive storytelling, and simulation, where controllable scene composition is essential. We refer to this missing capability as concept spawning; introducing a user-specified visual concept into a world model, analogous to spawning in a game engine. We introduce SPAWN (Swapping Pinned Anchor with Windowed iNjection), a training-free method for concept spawning. SPAWN exploits a structural property of image-to-video backbones: the first slot of the context memory is pinned to the reference frame and acts as a foundational anchor for every generated chunk. By swapping this anchor with an external concept latent over a short injection window and letting the original anchor return, we cause the concept to propagate naturally through the rollout via the model’s own memory. SPAWN supports concepts from fine-grained entities such as characters and props to large-scale elements such as buildings and landmarks, and accepts either a concept image or a text description as input. Experiments show that SPAWN integrates concepts with consistent lighting, scale, and perspective while preserving identity and temporal coherence, demonstrating that controllable concept spawning is achievable in existing autoregressive world models without any training.
[276] ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning cs.CV | cs.LGPDF
Yu-Cheng Shi, Zhen-Hao Xie, Jun-Tao Tang, Da-Wei Zhou
TL;DR: 本文提出了ProtoAda,一个用于多模态持续指令调优的原型引导自适应微调框架。该框架通过引入格式感知的任务原型,将任务分配和路由与任务语义及输出结构对齐,并以几何感知的方式整合格式兼容的更新,以有效重用和逐步优化现有参数。
Details
Motivation: 多模态大语言模型通过指令调优获得强大性能,但实际部署需要其持续获取新的视觉-语言能力,这使得多模态持续指令调优变得至关重要。现有基于图像-文本相似性路由的稀疏架构方法(如LoRA专家混合)在处理具有不同响应结构但视觉-语言语义高度相似的任务时,会导致错误的路由和任务分配,引发梯度干扰和专家协作无效的问题。
Result: 在多个基准测试上的广泛实验表明,ProtoAda实现了卓越的性能,尤其是在那些答案结构容易被顺序调优破坏的任务上。
Insight: 核心创新点在于引入了格式感知的任务原型,将输出结构信息纳入任务分配和路由决策,超越了仅依赖图像-文本相似性的方法。此外,通过几何感知的方式整合更新,有效促进了参数的重用和渐进式优化,缓解了任务间的干扰。
Abstract: Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. To reduce inter-task interference and promote collaboration, recent methods often employ sparse architectures like Mixture of LoRA Experts with image-text similarity routing. However, tasks with distinct response structures could share highly similar visual-linguistic semantics and thus be wrongly routed to the same expert; image-text similarity alone is insufficient for reliable task assignment. For example, an expert in a grounding task requiring coordinate prediction may be biased toward producing short textual answers after learning semantically similar VQA tasks. This format-blind task assignment integrates heterogeneous response types into shared parameters, inducing gradient interference and ineffective expert collaboration. To address this problem, we propose ProtoAda, a prototype-guided adaptive tuning framework. ProtoAda introduces format-aware task prototypes to align task assignment and routing with both task semantics and output structure, and further consolidates format-compatible updates in a geometry-aware manner to effectively reuse and progressively refine existing parameters. Extensive experiments on multiple benchmarks demonstrate that ProtoAda achieves superior performance, especially on tasks whose answer structures are easily corrupted by sequential tuning.
[277] Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling cs.CV | cs.AIPDF
Seojeong Park, Jiho Choi, Junyong Kang, Seonho Lee, Jaeyo Shin
TL;DR: 本文针对多模态大语言模型作为自动评估器时存在的感知判断偏差问题,即当视觉证据与文本线索冲突时,模型倾向于奖励看似合理的叙述而非感知正确的答案,提出了一种缓解方法。作者构建了感知扰动判断数据集,并开发了一个结合结构化GRPO奖励和批量排序目标的统一训练框架,以提升评估的感知保真度和一致性。
Details
Motivation: 现有MLLM作为评估器时,在视觉证据与文本线索冲突的情况下,会倾向于依赖文本线索而非自身视觉感知进行判断,导致评估不一致且不可验证,这被称为感知判断偏差。
Result: 在多个MLLM-as-a-Judge基准测试上的实验表明,该方法显著提高了感知保真度、排序一致性以及与人类评估的对齐度。
Insight: 创新点在于识别并系统分析了感知判断偏差,并通过构建包含最小编辑反事实响应的感知扰动数据集来实现可验证的监督。提出的统一训练框架结合了结构化奖励和批量排序目标,无需显式的成对标签即可实现连贯的全局排序,为训练感知可靠、可解释且对视觉-推理冲突鲁棒的多模态评估器提供了可扩展的通用路径。
Abstract: Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and systematically analyze this phenomenon, which we term Perceptual Judgment Bias. Through controlled visual perturbations, existing multimodal judges frequently anchor on the response text instead of their own visual perception, leading to inconsistent and non-verifiable evaluations. To address this issue, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision. Building on this dataset, we develop a unified training framework that combines a structured GRPO-based reward with a batch-ranking objective, achieving coherent global ordering without explicit pairwise labels. Experiments across diverse MLLM-as-a-Judge benchmarks show that our approach substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation. Our results establish a scalable and generalizable pathway for training multimodal judges that are perceptually grounded, interpretable, and robust to visual-reasoning conflicts.
[278] Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models cs.CVPDF
Guangzhao He, Rundong Luo, Wei-Chiu Ma, Hadar Averbuch-Elor
TL;DR: 本文提出了一种名为SEIG的智能体框架,利用预训练的视觉语言模型从单张图像执行可执行的逆向图形学任务,将图像重建为可编辑的Blender程序,而无需依赖专门的2D或3D基础模型、可微分渲染或多视角监督。该方法通过分阶段在可执行的Blender代码空间中逐步细化场景因子(如几何、材质、构图和光照)来重建3D场景,并展示了重建后的可编辑场景在下游应用中的潜力。
Details
Motivation: 解决逆向图形学这一长期存在且高度欠约束的问题,即从单张图像重建可编辑、可渲染、可重光照和可操纵的3D场景,同时探索通用视觉语言模型是否能在不依赖专门模型或监督的情况下直接执行此任务。
Result: 实验在多样化场景上使用像素级、感知和语义保真度等多种重建指标进行评估,结果表明分阶段重建显著提高了重建保真度,突显了任务分解对于通用视觉语言模型执行可执行逆向图形学的重要性。
Insight: 创新点在于提出了一个分阶段的可执行逆向图形学框架,将场景重建过程转化为在Blender代码空间中的渐进式细化,利用通用视觉语言模型的能力,避免了传统方法对专门模型或监督的依赖,为单图像3D场景重建提供了新的智能体驱动范式。
Abstract: Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.
cs.LG [Back]
[279] Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization cs.LG | cs.CLPDF
Hasan Amin, Kian Ahrabian, Ming Yin, Rajiv Khanna
TL;DR: 本文研究了语言模型微调中的多响应训练方法,指出传统单响应训练会导致条件分布的模式丢失问题,即‘模式彩票’现象。通过理论分析和实验验证,论文证明了在响应多样性高、提示冗余低的场景下,保留多个响应能显著提升模型的分布泛化能力。
Details
Motivation: 传统语言模型微调通常每个提示只对应一个响应,忽略了多提示可能对应多个有效完成的情况,导致条件分布中的部分模式被忽视,从而限制了模型的泛化能力。
Result: 在结构化数据集和真实世界数据集(包括新的多提示多响应基准测试)上,多响应训练一致改善了分布泛化性能,尤其在响应多样性高、提示冗余低的场景中增益最大。
Insight: 创新点在于将响应多重性视为数据分配问题,提出多响应训练的理论框架,揭示了提示和响应作为不同统计资源的角色,并分析了响应选择策略(如Random-K-of-N和基于奖励的选择)对模式崩溃的影响,为高效训练提供了理论保证。
Abstract: Modern language-model fine-tuning typically pairs each prompt with a single response, even though many prompts admit multiple valid completions. This effectively reduces a multi-modal conditional distribution to a one-sample view, a phenomenon we call the “mode lottery,” where training emphasizes a subset of plausible modes while leaving others underrepresented. We study multi-response training (MRT), which retains multiple responses per prompt, and develop a principled account of when and why it helps. Our key insight is that prompts and responses are distinct statistical resources: additional prompts reduce uncertainty about the input distribution, while additional responses reduce uncertainty about the conditional output distribution. This yields a variance-budget tradeoff that predicts when retaining multiple responses is worthwhile, shows diminishing returns as prompt-level uncertainty dominates, and explains why large redundant corpora can exhibit an implicit multi-response effect. We further analyze response selection, and show that Random-K-of-N is the unbiased default for distributional fine-tuning, reward-based selection can induce mode collapse, and a submodular quality-diversity objective provides an efficient alternative with theoretical guarantees. Controlled simulations validate the predicted variance and selection effects, including a striking failure mode where reward-only selection produces gradients misaligned with the true objective. Across structured and real-world datasets, including a new multi-prompt, multi-response benchmark, MRT consistently improves distributional generalization, with the largest gains in high response-diversity, low prompt-redundancy regimes. MRT reframes response multiplicity as a data-allocation problem with clear guidance: when responses are cheap and diverse, keeping more than one is not a heuristic, but a statistically grounded choice.
[280] Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher cs.LG | cs.CLPDF
Arda Uzunoglu, Alvin Zhang, Daniel Khashabi
TL;DR: 该论文提出了一种名为信任函数的方法,用于解决弱到强泛化问题,即当可靠标签稀缺时,如何利用较弱教师的监督来改进强学生模型。该方法将问题视为数据选择问题,通过为每个弱标签分配信任分数来筛选可靠的弱监督数据。在多个领域(如世界知识、定量推理和策略游戏)的实验中,该方法能实现近乎无损的弱到强泛化,甚至有时超越真实监督的效果,并支持迭代的弱到强链式训练以放大收益。
Details
Motivation: 动机在于解决弱到强泛化中的核心挑战:当可靠标签稀缺时,如何有效利用较弱教师提供的监督信号,关键在于识别哪些弱标签足够可靠以用作训练信号。
Result: 在多个领域(包括世界知识、定量推理和策略游戏)的基准测试中,信任过滤方法使学生模型达到与真实监督相当甚至超越的水平,实现了近乎无损的弱到强泛化,并支持迭代链式训练以进一步放大性能提升。
Insight: 创新点在于将弱到强泛化问题形式化为数据选择问题,并引入信任函数来动态评估弱标签的可靠性;从客观角度看,该方法提供了一种通用框架,通过信任分数筛选弱监督,避免了噪声标签的负面影响,并支持可扩展的迭代训练流程,具有实际应用潜力。
Abstract: Weak-to-strong generalization studies how to improve a strong student using supervision from a weaker teacher when reliable labels are scarce. We view this primarily as a data selection problem, where the key challenge is to identify which weak labels are reliable enough to serve as a training signal. To address this, we introduce trust functions that assign each weak label a scalar trust score and use these scores to filter weak supervision. Across several domains, including world knowledge, quantitative reasoning, and strategy games, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-lossless weak-to-strong generalization. Moreover, trust functions enable an iterative weak-to-strong chain that compounds gains by training a student and reusing it as the next teacher, amplifying the gains. There are several mechanisms to which advantage of trust functions can be attributed.
[281] Trust Region On-Policy Distillation cs.LG | cs.CLPDF
Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang
TL;DR: 本文提出了一种名为信任域在线策略蒸馏(TrOPD)的方法,用于解决大型语言模型在线策略蒸馏训练中的不稳定性问题。该方法通过在教师模型提供可靠监督的区域进行蒸馏,并采用梯度裁剪、掩码和正向KL估计等技术处理异常区域,从而提升训练效果。实验表明,TrOPD在数学推理、代码生成和通用领域基准测试中均优于现有在线策略蒸馏基线。
Details
Motivation: 在线策略蒸馏(OPD)在教师和学生分布差异较大时训练不稳定,教师对学生生成token的监督可能产生不可靠的策略梯度,甚至导致优化失败。
Result: TrOPD在数学推理、代码生成和通用领域基准测试中一致优于SOTA OPD基线,包括OPD、EOPD和REOPOLD。
Insight: 创新点包括:1)在教师提供可靠监督的信任域进行在线策略学习,缓解分布不匹配下K1反向KL估计器的优化困难;2)对异常区域采用梯度裁剪、掩码和正向KL估计减少不可靠监督的负面影响;3)通过教师前缀继续生成并使用正向KL模仿离线策略指导,鼓励在线策略探索可靠区域。
Abstract: On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.
[282] OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification cs.LG | cs.CLPDF
Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Peng Bo
TL;DR: 本文提出了OmniOPD,一种新颖的免logit、基于分块验证的在线策略蒸馏框架。它通过蒙特卡洛采样和多token分块的语义相似度度量来近似教师模型的局部偏好,并使用峰值熵调度器在高不确定性推理节点进行监督,解决了传统在线策略蒸馏对教师模型token级logits的依赖及其固有的脆弱性问题。
Details
Motivation: 传统在线策略蒸馏存在两个耦合的局限性:一是需要直接访问教师模型的token级logits,这排除了许多强大的专有模型作为教师;二是token级logit信号本身很脆弱,依赖于师生模型对下一个可能token的狭窄重叠,并且容易放大重复循环等退化模式。
Result: 在多个竞争性基准测试中,OmniOPD在数学任务上比标准OPD方法提升了高达+28.64%。当与Claude-4.5-Haiku和Gemini-2.5-Flash等更强的黑盒教师模型配对时,OmniOPD在数学任务上相比其开源权重教师模型实现了额外的+9.54%相对提升,使学生模型的性能超越了自探索强化学习。
Insight: 核心创新在于用分块级别的语义验证取代了确定性的token级logit匹配,这提供了更可靠的学习信号。其峰值熵调度器、狄利克雷-多项贝叶斯先验和基础模型KL锚点等机制,共同解决了离散采样的方差问题并防止策略崩溃,实现了对黑盒教师模型的有效利用。
Abstract: On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher’s token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher’s local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.
[283] HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression cs.LG | cs.CLPDF
Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin, Xiaoyang Qu
TL;DR: 本文提出了一种名为HMPO(混合中值长度策略优化)的单阶段强化学习框架,用于高效压缩大语言模型中的思维链推理过程。该方法通过自适应中值预算、余弦衰减令牌奖励和乘法奖励公式协同工作,无需手动调整长度预算,并有效缓解了奖励攻击问题。实验表明,HMPO在多种任务上实现了显著的令牌压缩(19%-46%),且精度损失可忽略,同时大幅降低了训练成本。
Details
Motivation: 现有思维链压缩方法存在手动长度预算不灵活、多阶段训练流程计算成本高以及可扩展性差(仅限于小模型)等问题,导致推理开销大。本文旨在解决这些问题,提出一种更高效、成本更低的压缩方案。
Result: 在从9B到122B参数的密集和MoE架构模型上进行的大规模实验表明,HMPO在数学、代码、科学和指令遵循任务上实现了19%至46%的令牌压缩,且精度下降可忽略不计,同时训练成本相比现有多阶段基线大幅降低。
Insight: 创新点包括:1)基于成功轨迹的自适应中值预算,消除了手动调整;2)余弦衰减令牌奖励,实现平滑的长度惩罚;3)乘法奖励公式,通过严格优先考虑答案正确性,显著减轻了琐碎的奖励攻击问题。从客观角度看,该框架的单阶段设计和强泛化能力(仅用数学数据训练即可泛化到多领域)具有借鉴价值。
Abstract: Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%–46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.
[284] OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents cs.LG | cs.AI | cs.CL | cs.CVPDF
Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai, Wenlin Yao
TL;DR: 本文提出了OpenWebRL,一个用于在真实网站上通过在线多轮强化学习训练视觉网页智能体的开源框架。该框架包含完整的训练流程,并基于此训练了OpenWebRL-4B模型,该模型在多个在线网页基准测试中取得了新的开源SOTA性能,且仅需少量初始化和训练数据。
Details
Motivation: 当前强大的视觉网页智能体系统多为闭源,而开源智能体严重依赖大量人工标注的网页轨迹进行监督式后训练,这造成了可扩展性瓶颈。在线强化学习在文本智能体上已显示出潜力,但其在视觉网页智能体上的应用尚未被充分探索。
Result: OpenWebRL-4B在Online-Mind2Web和DeepShop基准测试上分别取得了67.0%和64.0%的成功率,超越了此前类似或更大规模的开源智能体,并与OpenAI CUA和Gemini CUA等闭源系统性能相当。
Insight: 论文的主要创新点是提出了一个完整的、可复现的在线多轮强化学习框架,用于直接在动态的真实网站上训练视觉智能体,这为解决监督学习的数据瓶颈问题提供了实用路径。研究还系统性地分析了使在线强化学习对视觉网页智能体有效的关键设计选择,并探讨了强化学习如何提升智能体的推理能力。
Abstract: Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.
[285] A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL cs.LG | cs.CLPDF
Lei Yang, Siyu Ding, Deyi Xiong
TL;DR: 本文提出了一种多领域强化学习(RL)中跨领域干扰与恢复的局部扰动理论。研究发现,单领域RL训练会产生稀疏、小量级的参数编辑,不同领域在共享计算路径上的更新方向决定了协同或冲突。理论证明,后续领域训练主要通过二阶损伤项损害先前领域,且损伤集中在低维共享冲突子空间。通过领域刷新或稀疏代理冲突坐标集回滚,可选择性恢复受损领域性能,同时最小化对其他领域的附带损害。
Details
Motivation: 解决多领域RL训练中,单一领域训练导致其他领域性能下降的问题。现有基于灾难性遗忘或全局梯度冲突的解释不完整,因为即使全模型梯度几乎正交时仍会发生显著干扰。
Result: 在Code→Math→QA→CW序列训练后,对Math进行短暂刷新(Re-Math)将Math性能从57.66恢复至66.04,同时基本保持其他领域性能,获得最佳平均得分66.39。对于Math-QA对,在稀疏代理冲突坐标集上进行无训练回滚也能部分恢复Math性能。
Insight: 创新点在于提出了局部扰动理论,将干扰机制归因于低维共享冲突子空间中的二阶损伤,而非全局梯度冲突。这为通过针对性刷新或稀疏坐标回滚实现选择性恢复提供了理论依据和实用方法,减少了对其他领域的附带损害。
Abstract: Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code $\rightarrow$ Math $\rightarrow$ QA $\rightarrow$ CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.
[286] On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters cs.LG | cs.CLPDF
Mind Lab, :, Song Cao, Vic Cao, Kaijie Chen
TL;DR: 本文探讨了参数高效微调(PEFT)的扩展性,将其视为在强大共享基础模型之上构建持久本地状态(如偏好、技能、工具习惯和类记忆更新)的紧凑基板,而非仅仅是全参数微调的廉价替代。研究围绕三个扩展轴展开:向上扩展(更强共享先验使小本地更新更有用)、向下扩展(研究适配器的最小可靠尺寸)和向外扩展(多个持久适配实例共存),并以MinT基础设施为例进行管理。
Details
Motivation: 动机在于超越将PEFT仅视为全参数微调的廉价替代,探索其作为在共享基础模型上承载实例特定行为的持久本地状态的更广泛角色,以支持个性化模型的大规模部署。
Result: 未在摘要中明确提及具体定量结果或基准测试,但通过理论分析和基础设施示例(MinT)论证了PEFT在向上、向下和向外扩展中的可行性,表明其可作为构建百万个性化万亿参数模型的紧凑基板。
Insight: 创新点在于将PEFT重新定义为持久个性化模型的基板,并提出三个扩展轴的系统性研究框架;从客观角度看,其将适配器视为本地状态承载者的视角为大规模个性化AI系统提供了新的架构思路,MinT基础设施的设计也具有实践参考价值。
Abstract: Parameter-efficient fine-tuning (PEFT) is usually treated as a cheaper alternative to full fine-tuning. We study a broader role: small trainable adapters as persistent local state on top of strong shared foundation models. In this framing, the base model provides shared competence while adapters carry instance-specific behavior such as preferences, skills, tool habits, and memory-like updates. We organize the problem around three scaling axes: Scale Up, where stronger shared priors make small local updates more useful; Scale Down, where we study how small adapters can be while remaining reliable; and Scale Out, where many persistent adapted instances coexist. MinT provides one infrastructure example for managing adapter identity, revision, provenance, evaluation, and serving residency. Together, the results suggest that PEFT can be a compact substrate for persistent personal models rather than only a budget substitute for full fine-tuning.
[287] Saliency-Aware Model Merging cs.LG | cs.CVPDF
Jungin Park, Jiyoung Lee, Kwanghoon Sohn
TL;DR: 该论文提出了一种名为SA-Merging的数据无关模型合并方法,该方法基于结构剪枝中的连通性显著性公式,通过定义任务向量相对于共享基础模型的显著性分数,并引入合并感知调制来减轻任务干扰,从而迭代地合并多个任务特定模型。该方法还扩展到了LoRA的秩方向显著性分解,并在视觉和语言任务上验证了其有效性,缩小了数据无关方法与测试时适应方法之间的性能差距。
Details
Motivation: 当前的数据无关模型合并方法通常依赖于简单的参数级启发式方法,忽略了层间依赖性和专业知识的非均匀分布,导致难以扩展。本文旨在解决这一问题,通过引入显著性感知机制来更有效地合并模型。
Result: 在视觉和语言任务上的大量实验表明,该显著性感知方法有效,进一步缩小了数据无关方法与测试时适应方法之间的性能差距。
Insight: 创新点在于将结构剪枝中的连通性显著性概念扩展到数据无关模型合并设置,定义了任务向量的显著性分数,并引入了合并感知调制以整合专家共识来缓解任务干扰。此外,该方法还支持对LoRA进行秩方向显著性分解而不破坏其结构完整性,为模型合并提供了新的技术路径。
Abstract: Model merging aims to consolidate multiple task-specific models fine-tuned on different datasets into a unified architecture that performs cross-domain proficiency. Current data-free model merging methods often struggle to scale as they rely on simple parameter-level heuristics that ignore inter-layer dependencies and non-uniform distribution of expertise. This work proposes SA-Merging, which is built upon connectivity-based saliency formulations from structural pruning (e.g., SynFlow) and extends them to the data-free model merging setting. We define a saliency score over task vectors relative to a shared base model, and further introduce merge-aware modulation that incorporates agreement across experts to mitigate task interference. Based on this formulation, an iterative saliency-aware merging procedure progressively removes non-informative updates while preserving end-to-end connectivity. Furthermore, we extend SA-Merging to introduce rank-wise saliency decomposition for LoRAs without compromising their structural integrity. Extensive experiments on vision and language tasks demonstrate the effectiveness of our saliency-based approach, further reducing the gap between data-free and test-time adaptation methods.
cs.MM [Back]
[288] When Jokes Cross the Line: Analyzing Regular Humor and Dark Humor in YouTube Shorts cs.MM | cs.AI | cs.CV | cs.CYPDF
Sydney Johns, Sanjeev Parthasarathy, Shantnu Bhalla, Vaibhav Garg
TL;DR: 本文介绍了TwistedHumor数据集,包含1,211个YouTube Shorts视频及其33,041条评论,并进行了手动标注。论文通过多视角分析研究了短视频中幽默与伤害的灰色地带,发现黑色幽默常围绕批判、应对、尴尬和身份表达等主题,且与更复杂的观众反应相关。研究还评估了大语言模型在相关任务上的表现。
Details
Motivation: 研究旨在分析YouTube Shorts等短视频平台中,处于允许但可能对部分观众产生负面影响的灰色地带内容,特别是常规幽默与黑色幽默的界限及其社会影响。
Result: 在TwistedHumor数据集上的分析表明,常规幽默与更积极的观众情绪相关,而黑色幽默则引发更混合、中性甚至有毒的反应。大语言模型在单口喜剧上的评估表现优于短笑话。
Insight: 创新点在于构建了首个针对短视频幽默与伤害的多维度标注数据集,并采用基于LLooM的概念归纳方法揭示了黑色幽默的多元主题结构。研究强调了上下文感知的内容审核和更鲁棒的多模态评估的必要性。
Abstract: Video platforms such as YouTube have reshaped how users engage with entertainment and information, emphasizing brief, highly engaging content such as Shorts. Within this ecosystem, certain content occupies a gray area where it remains allowed but may still have unintended negative effects on some audiences. To study this problem, we introduce TwistedHumor, a dataset of 1,211 YouTube Shorts paired with 33,041 related comments, with hand annotations for humor presence, humor type, harm, topic, rhetorical devices, and stand up context. Beyond dataset creation, we present a multi view analysis of how humor and harm appear in short form social media. Using LLooM based concept induction over video descriptions, we find that dark humor frequently clusters around themes of critique, coping, awkwardness, and identity expression rather than appearing as a single uniform category. We further analyze audience response through linked comments and show that regular humor is associated with more positive sentiment, while dark humor receives more mixed, neutral, and sometimes more toxic reactions. Finally, we evaluate large language models against human annotations and find that they perform better on stand up comedy compared to shorter jokes. Together, these results position TwistedHumor not only as a new benchmark, but as an empirical study of the gray area between humor and harm in short form video, highlighting the need for context aware moderation and more robust multimodal evaluation.
cs.AI [Back]
[289] Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya cs.AI | cs.CLPDF
Sharath Sathish
TL;DR: 论文提出Pramana方法,通过基于印度古代逻辑体系Navya-Nyāya的微调,来增强大语言模型的认知推理能力。该方法教导模型遵循包含怀疑分析、证据源识别、五段论演绎、反事实验证、谬误检测和最终确认的六阶段结构化推理过程,以解决LLMs在系统性推理中常出现的幻觉和证据追溯能力不足的问题。
Details
Motivation: 大语言模型虽然能生成流畅文本,但在系统性推理方面存在缺陷,容易产生缺乏依据的自信断言,其推理往往基于脆弱的模式匹配而非扎实的逻辑。当面对无关上下文干扰时,性能会大幅下降,这暴露了其在需要可追溯证据和论证的领域中可靠性不足的认知鸿沟。
Result: 在55个基于Nyāya结构的逻辑问题(如约束满足、布尔可满足性、多步演绎)上对Llama 3.2-3B和DeepSeek-R1-Distill-Llama-8B进行微调。第一阶段模型在保留评估集上实现了100%的语义正确性,尽管其严格格式遵循率仅为40%,表明模型内化了推理内容。消融研究表明,格式提示和温度参数对性能有重要影响,且不同阶段的最优配置不同。
Insight: 核心创新点是将拥有2500年历史的Navya-Nyāya逻辑框架系统性地整合到LLM微调中,为模型提供了标准推理方法所缺乏的认知支架。这不仅是引入一种新的结构化推理模板,更是将逻辑学与认识论(知识论)相结合,教导模型区分知识与假设、识别证据来源和检测谬误的完整认识论方法论。客观来看,这种将古老形式化逻辑体系与现代AI模型结合以解决其根本性推理缺陷的思路,是一个新颖且有潜力的研究方向。
Abstract: Large language models produce fluent text but struggle with systematic reasoning, often hallucinating confident but unfounded claims. When Apple researchers added irrelevant context to mathematical problems, LLM performance degraded by 65% Apple Machine Learning Research, exposing brittle pattern-matching beneath apparent reasoning. This epistemic gap, the inability to ground claims in traceable evidence, limits AI reliability in domains requiring justification. We introduce Pramana, a novel approach that teaches LLMs explicit epistemological methodology by fine-tuning on Navya-Nyaya logic, a 2,500-year-old Indian reasoning framework. Unlike generic chain-of-thought prompting, Navya-Nyaya enforces structured 6-phase reasoning: SAMSHAYA (doubt analysis), PRAMANA (evidence source identification), PANCHA AVAYAVA (5-member syllogism with universal rules), TARKA (counterfactual verification), HETVABHASA (fallacy detection), and NIRNAYA (ascertainment distinguishing knowledge from hypothesis). This integration of logic and epistemology provides cognitive scaffolding absent from standard reasoning approaches. We fine-tune Llama 3.2-3B and DeepSeek-R1-Distill-Llama-8B on 55 Nyaya-structured logical problems (constraint satisfaction, Boolean SAT, multi-step deduction). Stage 1 achieves 100% semantic correctness on held-out evaluation despite only 40% strict format adherence revealing that models internalize reasoning content even when structural enforcement is imperfect. Ablation studies show format prompting and temperature critically affect performance, with optimal configurations differing by stage. We release all models, datasets, and training infrastructure on Hugging Face to enable further research on epistemic frameworks for AI reasoning.
[290] MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution cs.AI | cs.CL | cs.MAPDF
Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov
TL;DR: 本文提出了一种用于多智能体战略交互的语言模型智能体训练方法,通过延迟的每步奖励归因与资格门控技术,解决了奖励在时间和智能体间纠缠的难题。该方法结合异步rollout生成、课程式对手采样和多层分层批次构建,在MindGames Arena基准测试中取得了优异表现。
Details
Motivation: 训练语言模型智能体进行多智能体战略交互的核心困难在于,任何动作的质量都可能取决于未来未发生的事件、违反规则的移动或其他玩家的决策,而标准强化学习假设每一步都能分配奖励,这在结果跨时间和智能体纠缠的场景中失效。
Result: 在NeurIPS 2025的MindGames Arena基准测试中,使用该方法训练的单个80亿参数开源模型在直接对抗中匹配或超越了包括GPT-5在内的更大规模专有系统,并在开放(无限制)和高效(参数≤80亿)两个赛道中均获得第一名。
Insight: 创新点包括延迟的每步奖励归因与资格门控的episode生命周期和后处理流程,仅在episode结束时计算奖励,并根据任务特定语义将其传播回起源步骤,同时从训练中排除缺乏有效依赖信息的步骤;此外,通过vLLM的连续批处理实现异步rollout生成、课程式对手采样和多层分层批次构建,共同实现了多智能体环境中稳定且样本高效的RL训练。
Abstract: Training language model agents for multi-agent strategic interaction presents a core difficulty: the quality of any action may depend on future events that never materialize, on moves that violate game rules, or on decisions made by other players. Standard reinforcement learning assumes that rewards can be assigned at each step, but this assumption fails in settings where outcomes are entangled across time and agents. We introduce delayed per-step reward attribution with eligibility gating, an episode lifecycle and postprocessing pipeline that computes rewards only at episode end, propagates them back to originating steps according to task-specific semantics, and excludes steps that lack valid dependent information from training. Together with asynchronous rollout generation via vLLM’s continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, this approach enables stable, sample-efficient RL training in multi-agent environments. We evaluate on the MindGames Arena benchmark at NeurIPS 2025, where a single 8-billion-parameter open-source model trained with our method matched or surpassed substantially larger proprietary systems, including GPT-5, in head-to-head play and took first place in both the Open (unrestricted) and Efficient (<=8B parameters) tracks.
[291] The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary cs.AI | cs.CL | cs.LGPDF
Dongxin Guo, Jikun Wu, Siu Ming Yiu
TL;DR: 该论文研究发现,在确定性状态跟踪任务中,扩展的思维链推理会降低模型性能,其根本原因并非偏好偏差,而是源于仅解码器注意力机制的信息论容量限制。作者提出了注意力瓶颈定理、状态空间杰卡德度量等概念,并确定了一个确定性边界(约19-31步),超过此边界则必须使用工具委托。实验表明,工具集成推理在多个任务上显著优于纯神经思维链推理。
Details
Motivation: 旨在探究为何扩展的思维链推理在确定性状态跟踪任务中会失败,并揭示其根本原因在于Transformer解码器架构固有的信息处理能力限制,而非模型训练偏好问题。
Result: 在涵盖12个模型和8个任务领域(包括SWE-Bench、WebArena和SQL-Multi)的实验中,工具集成推理的准确率达到86-94%,而纯神经思维链推理仅为24-42%。微调带来的改进小于5%,且模型间存在高相关性(r=0.81-0.91),证实了性能瓶颈源于架构本身。
Insight: 论文的核心创新点在于从信息论角度形式化地证明了仅解码器Transformer在状态跟踪任务上的能力上限(注意力瓶颈定理),并提出了一个可量化的“确定性边界”作为转向混合智能体(工具委托)的决策依据。这为构建智能体系统时,在纯神经推理与混合方法之间做出原则性选择提供了理论指导。
Abstract: Extended chain-of-thought reasoning can degrade performance on deterministic state-tracking tasks, not due to preference biases, but limits rooted in the information-theoretic capacity of decoder-only attention. We establish: (1) an Attention Bottleneck Theorem with a complementary achievability construction, bounding state-tracking capacity as $O(H \cdot \log(L/H) \cdot \sqrt{d_h})$; (2) a context-dependent error model yielding super-exponential accuracy decay; (3) the State-Space Jaccard metric distinguishing capability from preference failures; (4) a Deterministic Horizon $d^* \in [19, 31]$ beyond which tool delegation becomes necessary. Across 12 models and 8 task domains (including SWE-Bench, WebArena, and SQL-Multi), tool-integrated reasoning consistently outperforms neural chain-of-thought; on the primary model suite it reaches 86-94% accuracy versus 24-42% for neural chain-of-thought. Fine-tuning on optimal-length traces yields $<$5% improvement, confirming an architectural ceiling, and high cross-model correlation ($r = 0.81$-$0.91$) indicates these failures are architectural rather than training-specific. Our results provide principled guidance for when pure neural reasoning should yield to hybrid approaches in agentic systems.
[292] AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning cs.AI | cs.CL | cs.LGPDF
Alessio Bruno
TL;DR: AXIOM是一个以信任为先的神经符号执行架构,用于自然语言数学推理。该架构将语言模型严格用作规范化器,将非正式问题文本重写为确定性计算机代数系统(CAS)管道处理的特定模式,从而推导和验证答案或将其作为一等输出进行弃权。
Details
Motivation: 解决自然语言数学推理中信任和可验证性问题,通过神经符号方法确保答案的正确性和零自信错误,避免语言模型产生幻觉或不可靠输出。
Result: 在4个MATH类别上,解析部分累积正确率达到94.36%(2,592/2,747),信任度为100%(整个2,747条基准中零自信错误),所有四个领域均超过每领域70/90/70下限,规则处理器的中位延迟为1毫秒,并在生产环境中处理了约30,000次查询。
Insight: 创新点在于将语言模型限制为规范化器,结合确定性CAS管道实现可验证推理;通过问题模式正则、特定模式提示和封闭形式CAS处理器的1:1:1对齐,以及LOST_CORRECT扫描作为回归测试,构建了可扩展且无回归的信任优先框架,适用于数学以外的可信神经符号系统。
Abstract: We present AXIOM, a trust-first neuro-symbolic execution architecture for natural-language mathematical reasoning. In AXIOM, the language model functions strictly as a canonicalizer: it rewrites informal problem text into a narrow schema consumed by a deterministic Computer-Algebra-System (CAS) pipeline, which derives and verifies the answer or abstains as a first-class output. Routing follows a 1:1:1 alignment between problem-shape regex, schema-specific prompt, and closed-form CAS handler, with 3,100+ such routes shipped and zero LOST_CORRECT regressions across 250+ consecutive ship commits. We report empirical results on 4 MATH categories with a cumulative correctness of 94.36% (2,592/2,747) at 100.00% trust on parseable (zero confident-wrong answers across the full 2,747-record benchmark), all four domains above the per-domain 70/90/70 floor with per-domain trust at 100.0%, and median latency of 1 ms on rule-only handlers (88% of records on the lm-eval arithmetic 20,000-record benchmark). The architecture has served ~30,000 production queries through a public deployment. The contribution we emphasize is not a final accuracy figure but the forward dynamic the architecture establishes: every logged abstain in production is a candidate correct after one ship cycle, since new tasks compose without regressing the registry. The operational discipline behind this property – math-template bucketing, LOST_CORRECT scan as regression oracle, parseable-first onboarding, and abstain as first-class output – constitutes a transferable framework for trustworthy neuro-symbolic systems beyond mathematics.
[293] Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence cs.AI | cond-mat.mtrl-sci | cs.CL | cs.LG | math.CTPDF
Fiona Y. Wang, Markus J. Buehler
TL;DR: 本文提出了一种基于范畴论的自主科学发现系统框架,将科学发现建模为表示体系的修订过程。该框架通过范畴论形式化描述材料科学中的智能体发现机制,区分了检索、搜索和发现操作,并构建了两个实例系统展示其应用。
Details
Motivation: 解决传统科学发现系统局限于固定表示体系的问题,旨在建立能够自主修订知识表示范式的智能发现系统,将科学发现形式化为范畴间的体系转换过程。
Result: 在Builder/Breaker系统中实现了蛋白质力学世界模型的MDL门控修订,得到模态条件柔度定律;在CategoryScienceClaw系统中构建了携带证明的知识计算图,通过纤维网络案例展示了候选模型筛选、AIC门控和扰动测试等完整流程。
Insight: 创新性地将范畴论同时作为发现过程的数学语言和自修订AI系统的工程规范,通过左Kan扩展实现跨表示体系的证据迁移与残差内容识别,为可验证的自主科学发现提供了形式化框架。
Abstract: Scientific discovery is not only answer generation but revision of the representational regime in which evidence, artifacts, operations, and verifiers are typed. We develop a category-theoretic account of agentic discovery for materials science. In a fixed regime b with schema category S_b, the system state is a copresheaf I_t: S_b -> Set, and provenance is the category of elements \int_{S_b} I_t. Fixed-regime operation is an update on such states, endofunctorial only when provenance-preserving refinements are specified and preserved. Discovery is instead a verified regime transition u: S_b -> S_b’: old artifacts are preserved, transported by the left Kan extension Lan_u I_t, and compared with the post-transition state to identify residual content beyond functorial transport. This separates retrieval, search, and discovery without subjective novelty. We instantiate the framework in two systems. In Builder/Breaker, a protein-mechanics world model is revised under a Minimum Description Length gate; the accepted law expresses within-chain flexibility as all-mode elastic compliance conditioned by slow collective-mode participation, or mode-conditioned compliance. In CategoryScienceClaw, typed skills, artifacts, open needs, workflow mutation, gates, stress tests, and public discourse become a proof-carrying knowledge-computation graph. A fiber-network example records candidate models, rejected alternatives, an AIC gate, perturbation tests, and an accepted orientation-tensor anisotropic stiffness surrogate over an isotropic fiber-count descriptor. Together, the cases show how category theory can be both a mathematical language for discovery and an engineering specification for self-revising AI discovery systems.
[294] An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models cs.AI | cs.CL | cs.LGPDF
Mingzhong Sun, Teresa Yeo, Armando Solar-Lezama, Tan Zhi-Xuan
TL;DR: 本文研究了大型推理模型(LRMs)在推理评估与推理生成之间的表现差距,发现尽管LRMs在生成复杂推理链方面表现优异,但在评估带有细微推理缺陷的解决方案时表现显著下降,揭示了模型存在答案确认偏误。
Details
Motivation: 研究动机源于人类推理能力中评估强于生成的特点,而当前LRMs主要训练用于生成推理,因此探究其在评估推理方面的表现,以揭示模型训练方法的潜在局限性。
Result: 在VAIR数据集(包含答案有效但推理有缺陷的数学问题)上,前沿LRMs的评估准确率低至48%,而人类仅比解决问题时低6%,显示出显著的生成-评估差距。
Insight: 创新点在于通过VAIR数据集隔离推理评估与生成,并利用思维链分析和因果修补揭示LRMs存在答案确认偏误,即模型倾向于基于答案有效性而非逐步推理验证来做出判断,这指出了当前推理训练方法中缺乏对稳健评估能力激励的缺陷。
Abstract: Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer’s representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models’ confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons.
[295] TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment cs.AI | cs.CLPDF
Thi-Nhung Nguyen, Linhao Luo, Rollin Omari, Junae Kim, Thuy-Trang Vu
TL;DR: 本文提出TriAlign框架,旨在解决个性化大语言模型中普遍真理在不同社会群体间不一致的问题。该框架采用离线多智能体强化学习,将每个社会群体建模为智能体,通过公平感知目标和显式不一致惩罚联合优化普遍真理准确性、跨群体真理一致性和个性化程度。
Details
Motivation: 现有个性化大语言模型对齐方法要么忽略个性化,要么主要关注主观偏好对齐,普遍忽视了不同社会群体间在客观任务上普遍真理回答的公平性和一致性,导致某些群体系统性地获得准确性较低的响应。
Result: 在多个基准测试上的实验表明,TriAlign在普遍真理准确性、跨群体真理一致性和个性化这三个目标之间取得了比现有强基线更好的平衡,减少了社会群体间的普遍真理差异,同时提升了客观任务性能和个性化质量。
Insight: 创新点在于首次形式化并研究了普遍真理不变对齐问题,并提出了首个用于该问题的离线多智能体强化学习框架,通过将社会群体建模为智能体并引入公平感知优化和显式不一致惩罚,实现了对普遍真理一致性、准确性和个性化的联合优化。
Abstract: Personalized large language models adapt responses to users’ preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either ignore personalization or mainly focus on subjective preference alignment, largely overlooking fairness and consistency in universal truths. To address this gap, we study Truth-Invariant Alignment (TIA), an alignment problem for personalized LLMs that aims to ensure universal truths remain consistent across social groups while preserving personalization. We propose TriAlign, the first offline multi-agent reinforcement learning (MARL) framework for TIA, where each social group is modeled as an agent interacting. TriAlign jointly optimizes universal truth accuracy, cross-group truth consistency, and personalization through a fairness-aware objective and an explicit inconsistency penalty. Experiments across diverse benchmarks demonstrate that TriAlign achieves a stronger balance among these three objectives than strong baselines, reducing universal truth disparities across social groups while improving both objective task performance and personalization quality.
[296] SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning cs.AI | cs.CL | cs.CYPDF
Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji, Chi Harold Liu
TL;DR: SafeMCP是一个针对LLM代理的服务器端防御插件,通过前瞻性推理来约束工具获取,以防止代理在复杂环境中因能力扩展而引发的安全风险。它采用内部世界模型进行两层防御:主动工具过滤和即时干预,并通过一个包含环境动态建模、安全策略初始化和强化学习的三阶段流程进行训练。
Details
Motivation: 随着LLM代理利用模型上下文协议(MCP)在复杂环境中操作,其行动空间的扩展带来了不安全能力和权力寻求风险,微小的错误或幻觉可能被放大为灾难性故障,因此需要一种防御机制来平衡任务完成与安全性。
Result: 在PowerSeeking Bench、ToolEmu和AgentHarm基准测试上的实验表明,SafeMCP实现了安全均衡,在有效缓解风险的同时保持了代理的实用性。
Insight: 创新点在于将前瞻性安全推理与环境动态建模相结合,通过服务器端的主动工具过滤和即时干预两层防御机制,以及使用可验证的双重奖励进行强化学习,以在开放环境中动态约束代理的权力扩展。
Abstract: As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.
[297] Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses cs.AI | cs.CL | cs.IRPDF
Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu, Jiashuo Sun
TL;DR: 本文提出Harness-1,一个200亿参数的检索子智能体,它通过强化学习在状态化搜索框架中训练。该框架将搜索过程中的状态管理(如候选池、证据链接、验证记录等)从策略中剥离,由环境侧负责维护,使策略能专注于语义搜索决策。在八个涵盖网络、金融、专利和多跳问答的检索基准测试中,Harness-1取得了优异的性能。
Details
Motivation: 现有搜索智能体通常将状态管理与语义决策耦合在同一个策略中,导致强化学习需要同时优化搜索决策和可恢复的簿记任务,这增加了学习难度并降低了效率。本文旨在通过引入状态外部化的框架,将例行状态管理交由环境处理,以简化策略的学习目标。
Result: 在八个检索基准测试(涵盖网络、金融、专利和多跳问答)上,Harness-1的平均精选召回率达到0.730,比次优的开源搜索子智能体高出11.4个百分点,并与规模更大的前沿模型搜索器保持竞争力。尤其在未见过的迁移基准上表现突出,表明其检索行为具有良好的泛化能力。
Insight: 核心创新在于提出了状态外部化的搜索框架,将环境状态管理(工作记忆)与策略的语义决策解耦,这使强化学习能更专注于优化搜索行为本身。这种架构设计可能提升智能体的学习效率和泛化性能,为构建更高效的检索系统提供了新思路。
Abstract: Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.
[298] HLL: Can Agents Cross Humanity’s Last Line of Verification? cs.AI | cs.CL | cs.CV | cs.LG | cs.MMPDF
Xinhao Song, Su Su, Sirui Song, Hongliang Wu, Wen Shen
TL;DR: 本文提出了一个名为HLL(Humanity’s Last Line of Verification)的基准测试,用于评估多模态智能体能否通过类似人类的交互(而非仅靠识别)来突破CAPTCHA验证这一自动化防护边界。该基准模拟了真实的网页环境与压力条件,并测试了八个前沿多模态智能体,结果表明当前智能体在替代人类完成受保护工作流方面仍很脆弱。
Details
Motivation: 随着多模态智能体越来越多地代表用户操作界面,一个核心部署问题是:它们能否真正替代人类完成那些服务方特意设计以防止自动化的工作流?CAPTCHA验证将这个问题具体化,它不仅是视觉谜题,更是设置在账户创建、内容访问等操作前的人类验证边界。
Result: 在闭环GUI环境中评估了八个前沿多模态智能体。结果显示,当前智能体在替代人类的边界上表现脆弱:性能在不同验证类型间差异巨大,在真实界面条件下下降,并且当需要有效动作轨迹来支持正确答案时性能进一步下降。
Insight: 创新点在于构建了HLL这一受控基准,它通过交互式CAPTCHA验证,结合杂乱的网页、更难的变体和基于轨迹的条件验证等压力源,来系统评估智能体在定位、动作校准、状态跟踪和过程一致性等方面的能力差距,为衡量智能体在现实受保护工作流中替代人类的程度提供了具体测试平台。
Abstract: Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbf{Humanity’s Last Line of Verification (HLL)}, a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at https://github.com/XinhaoS0101/HLL
[299] AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents cs.AI | cs.CLPDF
Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao, Huan Sun
TL;DR: 本文提出了一个名为AgentCL的评估框架,用于严格评估语言智能体中的持续学习能力。该框架通过构建可控的任务流(其中早期子解决方案、证据或工作流可在后续任务中重用)与朴素任务流(无重用保证)进行对比,并引入MemProbe诊断方法来分析记忆设计如何影响持续学习。在编码、深度研究和语言理解/推理任务上的实验表明,可控任务流能更清晰地区分不同记忆设计的可塑性,而朴素任务流则区分能力有限。
Details
Motivation: 现有基准难以严格评估语言智能体的持续学习能力,因为它们通常侧重于长上下文对话或文档的检索与推理,或依赖朴素任务流而缺乏对跨任务关系的深入分析,导致难以理解智能体随时间学习和重用的内容。
Result: 在编码、深度研究和语言理解/推理任务上的实证分析表明,朴素任务流对区分不同记忆设计的能力有限,而可控任务流能更清晰地区分其可塑性;同时,朴素和保留设置往往带来有限的增益,并可能暴露记忆引起的性能退化。
Insight: 创新点在于提出了一个以可控任务流和迁移增益指标为中心的评估框架AgentCL,以及用于诊断记忆设计影响的MemProbe探测方法;客观来看,该研究强调了需要更强的记忆设计来平衡可塑性和稳定重用,为持续学习评估提供了更严谨的基准和方法论。
Abstract: Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AGENTCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.
[300] VESTA: Visual Exploration with Statistical Tool Agents cs.AI | cs.CV | stat.COPDF
William Rudman, Abhishek Divekar, Kanishk Jain, Sebastian Joseph, Stella S. R. Offner
TL;DR: 论文提出了VESTA框架,通过为视觉语言模型(VLM)配备一个动态增长的工具包,来增强其在科学工作流中自动化拟合定量模型的能力。该框架引导模型通过数据转换、假设驱动的可视化和稳健的统计检验来改进模型。
Details
Motivation: 当前基于智能体的系统利用语言和视觉语言模型迭代提出和优化统计模型,但在更具挑战性的建模任务上表现不佳。VESTA旨在解决这些局限性,提升自动化建模的能力。
Result: 在DAWN基准测试(一个针对分布拟合和时间序列建模的基准,包含不同难度等级和真实世界天文学任务)上,VESTA的动态工具创建方法优于现有的智能体流程,在复杂和特定领域任务上提升最大。
Insight: 核心创新在于让VLM在模型优化前后主动探索数据,通过选择或创建诊断工具,这些工具会积累在模型上下文中并可复用。动态生成的工具比现有系统产生的工具更复杂,覆盖更多诊断类别,并优先选择VLM可以直接推理的可视化输出。
Abstract: Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model’s context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA’s dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.
cs.HC [Back]
[301] UF-AMA: A unified framework for cross-domain emotion recognition via adaptive multimodal alignment cs.HC | cs.AI | cs.CVPDF
Zheng Wang, Shuo Wang, Junhong Wang
TL;DR: 本文提出了一种名为UF-AMA的统一框架,用于解决基于多模态生理信号(如脑电图和眼动追踪数据)的跨域(跨被试和跨会话)情绪识别问题。该框架通过跨模态特征融合网络、置信度感知的样本筛选机制以及多级域适应策略,旨在提升模型在目标域上的泛化能力和鲁棒性。
Details
Motivation: 当前基于生理信号的情绪识别面临关键挑战:由于个体和情境差异导致的分布偏移,以及多模态间样本质量不一致,使得构建具有高泛化性和鲁棒性的跨域多模态模型十分困难。
Result: 在SEED和SEED-IV数据集上进行的广泛实验表明,UF-AMA在跨被试和跨会话任务中均取得了最先进的性能。
Insight: 主要创新点包括:1)结合Transformer和多头交叉注意力的跨模态深度融合网络;2)动态评估模态预测可靠性的置信度感知筛选机制,并据此对样本进行分区处理,分别应用全局一致性对齐和跨模态蒸馏;3)联合优化局部模态特定特征和全局融合特征的边缘与条件分布的多级域适应框架,从多粒度减少跨域分布偏移。
Abstract: In recent years, emotion recognition based on physiological signals such as electroencephalogram (EEG) has gained considerable attention, as internal physiological data offer greater objectivity and reliability compared to external behavioral data like facial expressions. However, due to distribution shifts caused by individual and contextual differences, along with variations in sample quality across modalities, constructing a cross-domain multimodal emotion recognition model with high generalization and robustness remains a key challenge. In this study, we propose a Unified Framework with Adaptive Multimodal Alignment (UF-AMA) to address cross-subject and cross-session emotion recognition using multimodal physiological signals. First, we construct a cross-modal feature fusion network comprising Transformer encoders and multi-head cross-attention modules, enabling the deep integration of EEG signals and eye-tracking data. Subsequently, we introduce a confidence-aware screening mechanism that dynamically assesses the predictive reliability of each modality branch on target domain samples, partitions samples into different quality subsets, and accordingly applies global consistency alignment and cross-modal distillation. Finally, we propose a multi-level domain adaptation framework that jointly optimizes the marginal and conditional distributions of both local modality-specific and global fusion features, thereby reducing cross-domain distribution shifts at multiple granularities. Extensive experiments on the SEED and SEED-IV datasets demonstrate that UF-AMA achieves state-of-the-art (SOTA) performance in both cross-subject and cross-session tasks. The source code is available at: https://github.com/BetterCoderLab/UF-AMA.
[302] Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video cs.HC | cs.AI | cs.CVPDF
Pranav Mahajan, Amanda Wall, Eleonora Maria Camerone, Julie Stebbins, Eoin Kelleher
TL;DR: 本文提出并验证了定量运动测试(QMT),一种从单目智能手机视频中提取3D运动学生物标志物的计算机视觉流程。该方法在实验室验证中与光学运动捕捉金标准高度一致,并在纤维肌痛和慢性坐骨神经痛患者的真实世界临床队列中成功应用,证明了其作为可扩展、客观评估工具的潜力。
Details
Motivation: 慢性疼痛会降低患者的功能能力,但在真实世界环境中客观测量这种功能影响仍具挑战性。传统的光学运动捕捉成本高且局限于实验室,因此需要开发一种兼具临床可及性和生物力学准确性的替代方案。
Result: 在实验室验证中,QMT提取的临床运动学指标与光学运动捕捉金标准高度一致,显示出强相关性(r > 0.85)和低平均绝对误差。在临床队列中,QMT在纤维肌痛患者中表现出高重测信度(r > 0.86),并成功追踪了慢性坐骨神经痛患者的日常运动波动,尽管家庭环境引入了更高的测量方差,但仍能发现患者与健康对照组之间的群体差异。
Insight: 论文的主要创新点在于将基于深度学习的单目3D姿态估计技术整合到一个完整的临床验证流程(QMT)中,用于从标准智能手机视频中提取客观的生物标志物。这为在临床试验中远程、可扩展地跟踪疾病进展和治疗反应提供了一种新工具,尽管家庭环境的可靠性仍需进一步优化。
Abstract: Chronic pain diminishes quality of life by decreasing functional ability, yet objectively measuring this functional impact remains challenging in real-world settings. While optical motion capture provides high precision for assessing altered movement quality, it is costly and restricted to laboratory environments. We aimed to develop and validate Quantitative Movement Testing (QMT), a computer vision pipeline extracting 3D kinematic biomarkers from standard monocular smartphone video, balancing clinical accessibility with biomechanical accuracy. We validated the QMT pipeline, utilising deep learning-based 3D pose-estimation, against gold-standard optical motion capture in healthy controls (N=13). Following leave-one-subject-out calibration to correct systematic bias, we deployed QMT in two prospective clinical cohorts to assess real-world utility: a pre- and post-intervention trial for fibromyalgia patients, and a 30-day longitudinal at-home monitoring study of chronic sciatica patients and healthy controls. In laboratory validation, QMT extracted clinical kinematic metrics with high agreement to optical motion capture, yielding strong correlations (r > 0.85) and low mean absolute errors. QMT demonstrated high test-retest reliability (r > 0.86) in fibromyalgia patients and successfully tracked day-to-day movement fluctuations in chronic sciatica. While real-world home settings introduced higher measurement variance than lab settings, QMT found group-level differences between healthy controls and sciatica patients based entirely on remote recordings. Monocular 3D pose estimation offers a scalable alternative to traditional assessments. QMT provides an objective, accessible biomarker for tracking disease progression and treatment response in clinical trials, though further research is needed to optimise reliability in home environments.
cs.GR [Back]
[303] Directed Distance Fields for Constant-Time Ray Queries on Gaussian Splatting cs.GR | cs.CVPDF
Subhankar MIshra
TL;DR: 该论文提出了一种将训练好的3D高斯溅射(3DGS)场景转化为射线查询预言机的方法,通过蒸馏一个定向距离函数(DDF)。DDF是一个小型神经场,能够快速查询射线到第一个表面的距离以及是否命中,其大小固定为约52MB,且查询成本不随场景增长。该方法支持全局光照等二次射线追踪应用,无需网格即可实现高效渲染。
Details
Motivation: 3DGS虽然能实时渲染新视角,但只能处理主射线(从相机出发的射线),无法追踪阴影、环境光遮蔽和全局光照所需的二次射线。因此,需要一种高效的方法来支持任意射线的快速查询,以扩展3DGS的功能。
Result: DDF在142个物体和真实捕获场景上,能够以30.3 dB和21.3 dB的PSNR分别重现参考射线追踪的阴影和环境光遮蔽效果。与等效的有符号距离场的球体追踪相比,DDF快26到72倍,且查询时间和内存不随场景复杂度增长。
Insight: 创新点在于将3DGS场景蒸馏为紧凑的DDF神经场,实现恒定时间的射线查询;研究发现使用清晰的几何距离监督(而非模糊的深度渲染)能更好地恢复薄结构;整个流程无需网格,仅从图像到3DGS再到神经表面,最后学习DDF,为实时全局光照等应用提供了新思路。
Abstract: 3D Gaussian Splatting (3DGS) renders new views of a scene in real time. Like every rasterizer, it answers only primary rays, the rays from the camera through the image. It cannot trace the secondary rays that shadows, ambient occlusion, and global illumination need. We turn a trained 3DGS scene into a ray oracle by distilling a Directed Distance Function (DDF). The DDF is a small neural field. It takes a ray, given by an origin and a direction, and returns the distance to the first surface and whether the ray hits anything. Each query is one forward pass. The field is 52MB, and its size does not depend on the number of Gaussians, so its cost and memory stay flat as the scene grows. We make three points. First, we study what supervision a DDF needs. Depth rendered from the Gaussians is too blurry to teach thin parts, while clean distance supervision recovers them. Second, we measure speed. The DDF is 26 to 72 times faster than sphere tracing an equivalent signed distance field, and unlike a bounding volume hierarchy built over the Gaussians, even on dedicated RT-core hardware, its query time and memory do not grow with the scene. Third, we show a pipeline that needs no mesh: images give a 3DGS scene, a neural surface gives clean distances, and the DDF learns from them. We use the DDF as a secondary-ray oracle for global illumination. It reproduces reference ray-traced shadows at 30.3dB and ambient occlusion at 21.3~dB across 142 objects, and on real captured scenes. Our codes are available at https://github.com/smlab-niser/ddf-gs.
[304] Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation cs.GR | cs.AI | cs.CV | cs.LG | cs.MMPDF
Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao
TL;DR: 本文针对音频驱动说话头生成任务的评估问题,指出传统逐帧评估指标因假设生成视频与参考视频严格时间对齐,而无法容忍语音驱动面部运动中自然存在的轻微时序偏移、语速和风格差异,导致不公平比较。为此,论文提出将动态生成模型的评估重新表述为一个序列对齐问题,引入一个统一的序列级评估框架,该框架将软动态时间规整(Soft Dynamic Time Warping)集成到现有评估流程中,在保持时序顺序的同时对齐特征轨迹,从而对有限的时序错位具有鲁棒性。基于此框架,论文对20种方法在7个数据集上进行了大规模基准测试,结果表明时序对齐的指标对时序差异更鲁棒,在不同数据集上结果更一致,并能更好地揭示不同建模范式之间的权衡。
Details
Motivation: 现有音频驱动说话头生成的评估协议主要依赖逐帧指标,其假设生成视频与参考视频存在严格的时间对应关系,但这与语音驱动面部运动自然包含的轻微时序偏移、不同语速和风格变化不匹配,导致传统指标可能将无害的时序差异视为质量错误,使得公平比较方法和理解其权衡变得困难。
Result: 论文在涵盖规范、野外和风格多样化场景的七个数据集上,对20种方法进行了大规模基准测试。实验表明,所提出的时序对齐指标对时序差异更鲁棒,在不同数据集上提供更一致的结果,并能更好地揭示建模范式之间的系统性权衡(例如同步性与真实感、表现力与稳定性之间的权衡)。
Insight: 论文的核心创新在于将动态生成模型的评估重新定义为序列对齐问题,而非独立的逐帧比较,并提出了一个将软动态时间规整集成到现有评估流程中的统一序列级框架。该框架在保持底层感知、身份或同步编码器不变的情况下,通过对齐特征轨迹来容忍有限的时序错位,从而提供了更公平、更稳定的评估方式,并能更清晰地揭示不同方法的内在特性与权衡。从客观角度看,这为生成模型,特别是涉及时序生成任务的评估,提供了一个更符合实际、更具原则性的评估范式。
Abstract: Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does not match speech-driven facial motion, which naturally includes slight timing shifts, different speaking speeds, and stylistic variations. As a result, conventional metrics may treat harmless timing differences as quality errors, making it harder to fairly compare methods and understand their trade-offs. In this work, we argue that evaluation of dynamic generative models should be formulated as a sequence-alignment problem rather than independent frame comparison. We introduce a unified sequence-level reformulation that integrates Soft Dynamic Time Warping into established evaluation pipelines. By aligning feature trajectories while preserving temporal order, the proposed framework provides robustness to bounded temporal misalignments without altering the underlying perceptual, identity, or synchronization encoders. We show that frame-wise evaluation can be viewed as a special case under rigid alignment, while sequence-level alignment provides improved stability, lower sensitivity to timing differences, and clearer separation between modeling paradigms. Building on this principled formulation, we conduct a large-scale benchmark of 20 methods across seven datasets spanning canonical, in-the-wild, and style-diverse scenarios under standardized protocols. Extensive experiments show that temporally aligned metrics are more robust to timing differences, provide more consistent results across datasets, and better reveal systematic trade-offs between modeling paradigms, such as synchronization versus realism and expressiveness versus stability.
[305] AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance cs.GR | cs.CVPDF
Xilong Zhou, Bao-Huy Nguyen, Zheng Zeng, Jacob Munkberg, Jon Hasselgren
TL;DR: AlbedoEdit是一个统一的生成式视频编辑框架,支持对象插入、对象移除和纹理编辑三种细粒度实例级编辑任务。其核心是利用与光照无关、不含镜面反射和阴影等效果的本征反照率图作为用户友好的编辑条件,指导视频基础模型将源RGB视频转换为编辑后的RGB视频。该方法在合成数据集上训练,能够隐式地学习协调编辑内容并模拟由编辑操作触发的复杂真实世界视觉效果。
Details
Motivation: 现有视频生成模型在细粒度实例级视频编辑(如对象插入、移除和纹理编辑)方面存在局限,要么只能提供粗粒度的语义控制,要么是针对特定任务设计的专用框架,缺乏灵活性和广泛适用性。
Result: AlbedoEdit在定性和定量评估中均优于最先进的视频编辑方法,在相关基准测试中达到了SOTA水平。
Insight: 创新点在于利用本征反照率图作为统一的、用户友好的编辑条件,它不受光照变化影响,能有效指定细粒度的外观编辑。框架通过训练学习隐式地协调编辑并模拟复杂视觉效果(如高光、软阴影和镜面反射),实现了多种编辑任务的统一处理。
Abstract: Video generative models have achieved remarkable progress in synthesizing photorealistic video sequences. However, enabling broader and more creative downstream applications requires fine-grained instance-level video editing, including object insertion, object removal, and texture editing, which has emerged as a prominent yet challenging problem. Existing approaches either propose unified generative frameworks with only coarse semantic control, or design task-specific frameworks for individual editing tasks, limiting their flexibility and applicability across diverse real-world scenarios. To address these limitations, we propose AlbedoEdit, a unified generative video editing framework that jointly supports object insertion, object removal, and texture editing. Our key insight is that the intrinsic albedo map, which is invariant to lighting and contains no specularity, shadowing and inter-reflection effects, provides an effective and user-friendly mechanism for specifying fine-grained appearance edits. Built upon video foundation models, AlbedoEdit is fine-tuned to translate source RGB videos into edited RGB videos, conditioned on a user-edited first-frame albedo. Trained on a new paired synthetic dataset covering all three editing tasks, AlbedoEdit implicitly learns to harmonize edited contents and simulate complex real-world visual effects triggered by editing operations, including specular highlights, soft shadows, and mirror reflections. AlbedoEdit demonstrates superior performance over state-of-the-art video editing approaches, both qualitatively and quantitatively. Project webpage is https://vcai.mpi-inf.mpg.de/projects/AlbedoEdit/.
[306] MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics cs.GR | cs.CV | cs.LGPDF
Žiga Kovačič, Kevin Ellis
TL;DR: 该论文构建了一个包含丰富物理现象的2D材料点方法(MPM)模拟数据集,用于研究从视频中推断物理动力学并进行时间外推的能力。论文比较了代码生成模型和视频扩散模型在该数据集上的表现,发现代码生成模型能生成物理和时间稳定的外推结果,但难以从视觉输入推断物理参数;而视频扩散模型能更好地识别几何属性,但外推结果在物理上不合理。
Details
Motivation: 研究从视频中推断物理动力学并进行时间外推的能力,为此构建了一个涵盖可变形物体、流体、运动物体和发射器等丰富物理现象的MPM模拟数据集。
Result: 在MPM数据集上评估了代码生成和视频扩散方法,通过改变物理相关辅助信息量来识别其优缺点。代码生成模型能自动合成MPM模拟,但难以推断物理参数;视频扩散模型能更好地识别几何属性,但外推结果物理上不合理。
Insight: 创新点在于构建了专门的MPM物理模拟数据集用于评估推断和外推能力,并系统比较了代码生成与视频扩散两种范式的优缺点,揭示了它们在物理参数推断与几何识别、外推稳定性方面的互补性,为物理动力学学习提供了新的基准和见解。
Abstract: To study the ability to infer physical dynamics from videos and extrapolate them forward in time, we assemble a dataset of 2D Material Point Method (MPM) physical simulations covering rich physical phenomena such as deformable objects, fluids, kinetic objects, and emitters. We study code generation and video diffusion approaches on this dataset, identifying their strengths and weaknesses by varying the amount of physically relevant side information. The code generation model, beyond giving a working demonstration of automatic synthesis of MPM simulations, reveals that such an approach struggles with inferring physical parameters from visual input, but relative to video diffusion, produces physically and temporally stable extrapolations forward in time, while the video diffusion model more strongly identifies geometric properties from visual input but produces physically implausible extrapolations.
cs.NE [Back]
[307] Evolving to the Aesthetics of a Vision-Language Model cs.NE | cs.CVPDF
Stephen James Krol, Jon McCormack
TL;DR: 本文探索了两种利用视觉语言模型评估进化系统中抽象输出美学质量的方法:一种是使用CLIP-IQA预测单个设计的美学分数,另一种是通过VLM根据用户自定义提示进行两两比较,并利用Glicko评分系统估计种群排名。研究通过一个定制生成系统的案例研究,将这两种方法的结果与艺术家的美学排名及其他美学评估技术进行了比较,并分析了其优缺点。
Details
Motivation: 解决进化系统在创意领域(如生成字体、设计和音乐)中,如何设计有效的适应度函数来捕捉抽象输出所需美学这一开放性问题。
Result: 在定制生成系统的案例研究中,将所提方法产生的排名与艺术家的美学排名及其他美学评估技术进行了比较,但摘要未明确提及具体的定量结果或是否达到SOTA水平。
Insight: 创新点在于利用VLM(特别是CLIP-IQA和基于提示的两两比较)作为适应度函数来评估抽象美学,避免了手动设计适应度函数的困难,并通过Glicko系统将比较结果转化为可量化的种群排名,为进化艺术提供了新的评估框架。
Abstract: Evolutionary systems have demonstrated remarkable results in creative domains, with recent applications in generative typography, design, and music. However, an open problem remains in designing fitness functions that effectively capture the desired aesthetics of abstract outputs. In this work, we explore two methods for evaluating the aesthetics of a population using Vision-Language Models (VLMs). The first method uses CLIP-IQA to predict an aesthetic score for each design. The second method instead pits candidates against each other, with winners determined by a VLM using a custom prompt specified by the user. The outcomes of these pairwise comparisons are then used to estimate a population ranking via the Glicko rating system. We present these methods in the context of a case study using a custom generative system and compare the resulting rankings with an artist’s aesthetic ranking and those produced by other aesthetic evaluation techniques. Additionally, we document the artist’s experience using these approaches to evolve designs, critically analysing the strengths and weaknesses of both methods.
cs.DC [Back]
[308] Compliance-Scored Best-of-N Guardrail Orchestration for Multimodal Document Generation in Payments Dispute Defense cs.DC | cs.AI | cs.CL | cs.LGPDF
Nataraj Agaram Sundar, Tejas Morabia
TL;DR: 本文提出了一种用于支付争议防御场景的多模态文档生成的合规评分最佳N选1护栏编排框架。该框架通过并行生成多个候选文档,使用加权合规评分(包括PII检测、内容审核、模式约束等)进行早期筛选,并返回最佳评分输出。在支付争议摘要生成的实际应用中,该框架在20秒内完成5次尝试,达到91%的合规率,并在关键业务指标上显示出统计显著的提升。
Details
Motivation: 解决高风险企业文档生成(如金融争议叙述、合规通知等)中,传统生产系统将PII脱敏、内容审核和格式验证等步骤分离,导致逻辑碎片化、请求路径慢和运营成本高的问题。
Result: 在支付争议防御摘要生成任务中,该框架在20秒内完成5次尝试,达到91%的合规率。与对照组相比,整体胜率提高了11.0个百分点(95%置信区间[6.6, 15.5],p < 0.001),在调整后的“未收到物品”案例中提高了7.5个百分点(95%置信区间[0.2, 15.7],p = 0.045)。
Insight: 创新点在于将多候选生成与显式合规评分相结合,通过加权护栏(PII检测、内容审核、模式约束、领域规则)进行早期退出选择,实现了统一的护栏编排层。从客观角度看,该框架将合规性验证从串行后处理转变为并行生成-评分流程,有效平衡了生成质量、合规性和延迟,为高风险企业应用提供了可配置、可解释的解决方案。
Abstract: High-stakes enterprise document generation, including financial dispute narratives, compliance notices, and audit summaries, demands schema correctness, policy compliance, and low-latency operation at scale. Prior to a unified guardrail layer, production systems often stitched together separate PII redaction, content moderation, and format validation steps, leading to fragmented logic, slower request paths, and higher operational cost. We present a guardrail orchestration layer for text and image inputs that couples multi-candidate generation with an explicit compliance score used for early exit. The framework runs configurable parallel generation heads, scores candidates against weighted guardrails including PII detection, content moderation, schema constraints, and domain rules, and returns the best-scoring output with selection metadata. The available operational readout reports 5 attempts within 20 seconds and 91 percent compliance. For payments dispute defense summaries, we analyze aggregate operational scenario readouts rather than a randomized A/B test. Variable cohorts show higher count win rates than controls overall, 301/659 versus 536/1548, corresponding to +11.0 percentage points with 95 percent confidence interval [6.6, 15.5] and p < 0.001, and for adjusted item-not-received cases, +7.5 percentage points with 95 percent confidence interval [0.2, 15.7] and p = 0.045. Fraud and local evidence-ranking deltas are directionally positive but not statistically significant from the aggregate count data. We also report reviewer-calibrated Responsible-AI evidence-quality signals from 770 generated-evidence reviews and a 70-case OCR slice, and document the reproducibility boundary through the request interface, scoring logic, pseudocode, and operational evidence boundary.
cs.IR [Back]
[309] Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy cs.IR | cond-mat.mtrl-sci | cs.AI | cs.CLPDF
Aritra Roy, Enrico Grisan, Chiara Gattinoni, John Buckeridge
TL;DR: 本文扩展了ComProScanner框架,通过集成视觉语言模型(VLM)实现了从科学文献图表中自动提取材料成分-性能数据的能力。该扩展引入了基于标题关键词的图表过滤工具和利用VLM从图表中恢复数据对的代理,并在压电陶瓷数据集上评估了四种VLM,其中Gemini-3-Flash-Preview取得了最佳性能。
Details
Motivation: 现有基于大语言模型的自动化材料数据提取框架主要局限于文本和表格内容,忽略了大量仅存在于科学图表中的定量性能数据,因此需要扩展框架以支持多模态数据提取。
Result: 在来自$d_{33}$测试语料库的50篇压电陶瓷文章上进行基准测试,Gemini-3-Flash-Preview模型取得了最高的成分准确率(0.97)和归一化F1分数(0.97),并且在四个评估模型中成本效益最高。
Insight: 创新点包括将VLM原生集成到端到端多代理框架中,实现了从文本、表格和图表中提取结构化数据的统一流程;并引入了基于范围的数值误差阈值参数,为从图表中提取的数值提供了更具物理意义的评估方法。
Abstract: Automated extraction of materials composition-property data from scientific literature has advanced considerably with the development of large language model-based pipelines; however, existing frameworks remain limited to textual and tabular content, overlooking the substantial proportion of quantitative property data reported exclusively in scientific figures. Here, we extend ComProScanner, a fully end-to-end multi-agent framework for automated composition-property database construction, with a native vision-language model (VLM) based figure extraction capability. The extension introduces a FigureExtractor utility for caption-keyword-based figure filtering across all supported publishers, and a GraphExtractorTool agent that passes extracted figures to a configurable VLM to recover composition-property pairs from scientific charts and plots. Four VLMs are selected for evaluation on the basis of the LMArena Diagram leaderboard with an input cost criterion of less than $1.50 per million tokens. Benchmarking on 50 piezoelectric ceramic articles from the established $d_{33}$ test corpus demonstrates that Gemini-3-Flash-Preview achieves the highest performance with a composition accuracy of 0.97 and a normalised F1 score of 0.97, whilst remaining the most cost-effective model among the four evaluated. We additionally introduce a range-based value error threshold parameter into the evaluation framework, providing a more physically meaningful assessment of numeric property values extracted from figures than exact value matching. These contributions establish VLM-integrated ComProScanner as the first materials-specific, fully automated, multimodal literature mining platform capable of extracting structured composition-property data from text, tables, and figures within a single unified pipeline.
[310] ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning cs.IR | cs.AI | cs.CL | cs.LG | cs.MAPDF
Zhensheng Wang, Xiaole Liu, Wenmian Yang, Kun Zhou, Yiquan Zhang
TL;DR: 本文提出了一个面向未来数据预测与推理的开放域表格问答新任务,并构建了首个基于房地产数据的相关数据集ODTQA-FoRe。针对该任务在数据检索、预测和答案标准化方面的挑战,作者提出了一个基于LLM智能体的框架TimeFore,通过分解为检索器、预测器和分析器三个协作角色来解决。实验验证了该框架的有效性。
Details
Motivation: 现有大多数表格问答系统无法进行面向未来的数值预测,本文旨在填补这一研究空白。
Result: 广泛的实验证明了所提TimeFore框架的有效性,但摘要中未提及具体的基准测试和定量结果(如是否达到SOTA)。
Insight: 创新点在于定义了结合时间序列预测与推理的开放域表格问答新任务及数据集,并提出了一个将LLM与外部时间序列模型协同工作的多智能体框架,以克服LLM在预测方面的固有局限并确保答案的标准化。
Abstract: The rapid development of LLMs has significantly advanced tabular question answering, but most systems cannot perform future-oriented numerical prediction. To address this gap, we introduce a novel task, Open-Domain Tabular Question Answering for Future Data Forecasting and Reasoning, and propose the first dataset to cover time-series forecasting and forecast-based reasoning scenarios using real estate data. This task poses challenges in retrieving precise historical data, overcoming the forecasting limitations of LLMs, and standardizing responses for diverse queries. To solve the above challenges, we propose TimeFore, an LLM agent-based framework that decomposes the problem into three collaborative roles: a Retriever autonomously generates SQL to fetch data, a Forecaster invokes external time-series models for higher accuracy, and an Analyzer synthesizes the results to construct a precise and consistent final answer. Extensive experiments demonstrate the effectiveness of our TimeFore.
cs.SD [Back]
[311] Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning cs.SD | cs.CL | cs.HC | cs.LG | eess.ASPDF
Sukru Samet Dindar, Riki Shimizu, Xilin Jiang, Nima Mesgarani
TL;DR: 本文提出了Sympatheia,一个基于连续效价-唤醒度(VA)情感条件控制的语音对话框架,旨在解决日常语音中情感线索微弱、中性或模糊的问题,以生成语义内容和语音表达都情感合适的回应。
Details
Motivation: 现有共情语音对话系统需要准确推断用户情感状态,但日常语音常包含弱、中性或模糊的情感线索,因此需要一种能整合显式连续情感控制信号的方法来提升情感适应性。
Result: 实验表明,Sympatheia在生成情感合适的回应方面优于基线语音对话模型,且同一VA接口能整合来自面部表情、生物信号和文本情感描述等多种感知模块的情感估计,在语音情感证据有限时改善回应对齐。
Insight: 创新点在于引入连续VA情感条件控制,并构建了包含情感锚点和中性查询的合成语音对话数据集Sympatheia-18k,以显式隔离情感控制,为构建情感自适应语音助手提供了有效实践步骤。
Abstract: Empathetic spoken dialogue systems must infer a user’s emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user’s speech and, when available, explicit affect specifications provided as a continuous valence–arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.
[312] JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions cs.SD | cs.AI | cs.CVPDF
Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang
TL;DR: 本文提出了JenBridge,一个用于生成长视频配乐的自适应模块化框架,旨在解决现有AI音乐系统在长视频跨场景转换时缺乏叙事连贯性的问题。其核心是一个基于Transformer的生成模型,采用流匹配目标进行训练,并通过两阶段范式(大规模文本-音频预训练和视频域自适应)实现跨模态对齐。关键创新在于引入了一个包含多种过渡风格的自适应过渡机制,并利用LLM代理作为导演智能选择最合适的场景过渡方式。
Details
Motivation: 现有AI音乐系统主要针对短片段设计,缺乏确保长视频跨场景转换时叙事连续性的机制,因此需要解决生成长视频连贯、高保真配乐的挑战。
Result: 在作者提出的新基准LVS Benchmark上进行的大量实验表明,JenBridge在客观和主观指标上均显著优于现有方法,特别是在过渡自然度和整体叙事连贯性方面。
Insight: 主要创新点包括:1) 模块化、可解释的框架设计;2) 新颖的自适应过渡机制,结合了生成式过渡方法和LLM代理的智能选择功能;3) 提出了一个专注于整体和过渡感知评估的新基准LVS Benchmark,用于严格评估长视频配乐任务。
Abstract: We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.
cs.CR [Back]
[313] “I Strongly Suspect This Website Is a Scam”: Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents cs.CR | cs.CLPDF
Soham Roy, Sarthakbrata Halder, Arya Bharaty, Vaibhav Bhaskar, Yash Sinha
TL;DR: 本文研究了自主网络代理在面对社交工程攻击时泄露个人身份信息(PII)的风险,并引入了Scammer4U基准测试来量化该风险。研究发现,前沿网络代理在无隐私指导的情况下,关键PII泄露率高达54-93%,且即使代理推理中识别出可疑站点,仍有35.9%的会话会提交PII,揭示了检测与行动之间的差距。
Details
Motivation: 社交工程攻击广泛存在于互联网中,能操纵自主网络代理将用户的个人身份信息提交给攻击者控制的端点,这对已部署的代理系统构成严重风险,需要量化并理解这种风险。
Result: 在Scammer4U基准测试(包含91个攻击环境和10个良性基线)上,前沿代理的关键PII泄露率在无防御下为54-93%,而良性基线上为0%。即使代理通过独立LLM判断标记站点为可疑,仍有35.9%的会话提交关键PII,相比未标记可疑时的66.1%有30.2%的差距,且这一结果在四个模型家族中均稳健。
Insight: 论文的创新点在于构建了Scammer4U基准测试,通过因子分类法隔离攻击设计因素的因果贡献,并揭示了代理的检测-行动差距:依赖代理自身识别攻击的防御机制存在缺陷,因此需要独立于代理推理循环的输出级拦截来防止PII提交。
Abstract: Deceptive web content, widely instantiated across the internet and commonly known as \textit{social-engineering attacks}, manipulates autonomous web agents into submitting users’ personally identifiable information (PII) to attacker-controlled endpoints. In this paper, we show that social-engineering attacks are highly effective at extracting critical-tier PII from frontier web agents, posing a severe risk to deployed agentic systems. To quantify this risk, we introduce \textbf{\textsc{Scammer4U}}, a pre-registered benchmark of 91 attacker-controlled environments and 10 benign-twin baselines, spanning 8 attack vectors and 16 site categories on an 8-axis factorial taxonomy that isolates the causal contribution of individual attack design factors. Across frontier agents, we find that critical-tier PII leakage reaches 54–93% under no privacy guidance, compared to 0% on benign-twin baselines, confirming that leakage is attack-attributable rather than incidental form-filling. Escalating prompt-level mitigation yields sharply model-dependent reductions across the four families and remains insufficient to reliably prevent critical PII submission at the pooled level. Most critically, we identify a detection–action gap: agents whose reasoning an independent LLM judge confirms has flagged the site as suspicious still submit critical PII in 35.9% of sessions, versus 66.1% when no suspicion is verbalized, a 30.2% gap robust across all four model families. Our findings reveal that defenses conditioned on the agent’s own recognition of an attack are gating on the wrong signal, motivating output-level interception of outbound submissions that operates independently of the agent’s reasoning loop.
[314] BraveGuard: From Open-World Threats to Safer Computer-Use Agents cs.CR | cs.CLPDF
Yunhao Feng, Yifan Ding, Xiaohu Du, Ming Wen, Xinhao Deng
TL;DR: 本文提出了BraveGuard,一个用于训练计算机使用代理安全防护模型的自进化防御框架。该框架通过挖掘开放世界威胁信号和真实代理轨迹来生成轨迹级监督数据,并形成一个适应性的防御循环。实验表明,BraveGuard能显著提升对AgentHazard等基准上轨迹级安全威胁的检测准确率。
Details
Motivation: 计算机使用代理使语言模型从文本生成扩展到与文件、终端、浏览器等工具的持续交互,这带来了新的安全风险。这些风险难以通过孤立的提示或最终响应来检测,因为危害往往只在多步执行轨迹中显现,而单个动作在局部看可能是良性的。
Result: 在AgentHazard基准测试中,BraveGuard训练的防护模型(如Qwen3-Guard和Llama-Guard变体)将平均检测准确率从38.79%大幅提升至82.38%,显著优于现成的防护模型。
Insight: 论文的核心创新在于提出了一个基于开放世界威胁发现和真实代理执行轨迹来生成监督数据的自进化防御框架,这超越了基于固定分类法和合成提示级数据的传统安全监控方法,为应对不断演变的现实风险提供了一条可扩展的适应性防御路径。
Abstract: Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.
cs.RO [Back]
[315] From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data cs.RO | cs.AI | cs.CVPDF
Zhiyuan Feng, Qixiu Li, Huizhi Liang, Rushuai Yang, Yichao Shen
TL;DR: 这篇综述论文系统性地探讨了如何利用丰富的人类视频数据来增强视觉-语言-动作(VLA)模型在机器人操作任务中的泛化能力。论文将现有方法归纳为四类:潜在动作表示、预测性世界模型、显式2D监督和显式3D重建,并指出了该领域在视频结构化、动作落地和评估协议方面的关键开放挑战。
Details
Motivation: 现有的VLA模型训练严重依赖成本高昂且与特定机器人本体紧密耦合的演示数据,而人类视频数据丰富且蕴含丰富的交互语义与物理线索,但其直接应用因本体差异和任务标注缺失而面临挑战。
Result: 本文是一篇综述,未报告具体定量结果,但系统梳理了该领域的研究进展,并提出了一个分类框架和未来研究方向。
Insight: 创新点在于提出了一个基于从人类视频中提取动作相关信息的统一分类框架(四类方法),并强调了将非结构化视频转化为可训练数据、解决本体异构下的动作落地以及设计更有效的评估协议这三个核心挑战,为未来研究提供了清晰的路线图。
Abstract: Recent progress in generalizable embodied control has been driven by large-scale pretraining of Vision-Language-Action (VLA) models. However, most existing approaches rely on large collections of robot demonstrations, which are costly to obtain and tightly coupled to specific embodiments. Human videos, by contrast, are abundant and capture rich interactions, providing diverse semantic and physical cues for real-world manipulation. Yet, embodiment differences and the frequent absence of task-aligned annotations make their direct use in VLA models challenging. This survey provides a unified view of how human videos are transformed into effective knowledge for VLA models. We categorize existing approaches into four classes based on the action-related information they derive: (i) latent action representations that encode inter-frame changes; (ii) predictive world models that forecast future frames; (iii) explicit 2D supervision that extracts image-plane cues; and (iv) explicit 3D reconstruction that recovers geometry or motion. Beyond this taxonomy, we highlight three key open challenges in this area: structuring unstructured videos into training-ready episodes, grounding video-derived supervision into robot-executable actions under embodiment and viewpoint heterogeneity, and designing evaluation protocols that better predict real-world deployment performance and transfer efficiency, thereby informing future research directions. A curated list of papers and resources is available at https://github.com/AaronFengZY/HumanCentricToVLA-Survey.
[316] Modeling Robotics Dataset Construction as an Artifact-Based Build Process cs.RO | cs.CV | cs.LGPDF
Leon Pohl, Lukas Beer, George Sebastian, Mirko Maehlisch
TL;DR: 该论文提出将机器人数据集构建过程建模为基于产物的构建过程,并开发了名为Bagzel的开源工具(Bazel扩展),用于实现可复现、增量的数据集生成。该方法通过依赖图管理数据转换,显著减少了数据集更新的延迟,并支持可复现性。
Details
Motivation: 机器人系统产生大量多模态传感器数据,但将ROS bag记录转换为机器学习数据集通常由临时编写的顺序脚本处理,这带来了工程开销和缓慢的迭代周期。论文旨在解决这一问题,提高数据集构建的效率和可管理性。
Result: 在实验中,Bagzel在所有评估的执行模式下均减少了运行时间,特别是在迭代工作流中效果显著(在20.4 GB数据集上,热构建速度提升高达386.26倍,增量构建提升7.21倍)。Bagzel变体在5.1至20.4 GB的数据集规模上表现出比基线更好的扩展性,Bagzel-xattr相比Bagzel平均运行时间进一步减少5.9%。
Insight: 创新点在于将数据集构建建模为基于产物的构建过程,利用依赖图实现增量处理和可复现性。这借鉴了软件构建工具(如Bazel)的思想,为机器人数据管理提供了系统化解决方案,可减少工程开销并加速迭代。
Abstract: Robotic systems generate large volumes of multimodal sensor data, but converting ROS bag recordings into machine learning datasets is often handled by ad hoc sequential scripts, creating engineering overhead and slow iteration cycles. We model dataset construction as an artifact-based build process over a dependency graph and implement this approach in Bagzel, an open-source Bazel extension for reproducible, incremental dataset generation (including nuScenes-format export). We compare Bagzel and Bagzel-xattr (server-side digest management) against a sequential rosbag2nuscenes baseline. Bagzel reduces runtime in all evaluated execution modes, with the largest gains in iterative workflows (up to 386.26x in warm builds and 7.21x in incremental builds on a 20.4 GB dataset). Across dataset sizes from 5.1 to 20.4 GB, Bagzel variants show markedly better scaling behavior than the baseline, especially in warm and incremental modes. Bagzel-xattr provides additional gains, with a mean runtime reduction of 5.9% compared to Bagzel in the input granularity study. Overall, modeling robotics dataset construction as an artifact-based build process substantially reduces dataset update latency while maintaining a deterministic build design that supports reproducibility. Bagzel is publicly available at https://github.com/UniBwTAS/bagzel.
[317] Safe2Drive: Evaluating Safe Driving Behaviors of E2E Autonomous Driving Models cs.RO | cs.CVPDF
Nishad Sahu, Kalpana Panda, Congyuan Yu, Changzhong Qian, Shounak Sural
TL;DR: 该论文提出了Safe2Drive(S2D),一个专注于评估端到端自动驾驶模型在安全关键场景下表现的新基准。它扩展了Bench2Drive基准,增加了100个涉及施工区、行人乱穿马路和视线受阻的弱势道路使用者等常见危险场景,并引入了以安全为中心的SafeDriving Score(SDS)评估指标。
Details
Motivation: 尽管当前端到端自动驾驶策略在闭环仿真中取得了较高的驾驶分数,但其在常见安全关键场景下的处理能力尚不明确。论文旨在评估这些模型在真实世界危险情况下的安全驾驶行为。
Result: 在S2D基准上评估两个最先进的策略(LEAD和SimLingo)时,它们的驾驶分数相对于Bench2Drive基线大幅下降(LEAD从94.70降至39.95,SimLingo从85.07降至41.00),且安全驾驶分数(SDS)很低(LEAD为11.85,SimLingo为15.27)。结果表明模型存在对施工区理解不足、闯红灯、对行人制动过晚或缺失等脆弱的安全驾驶行为。
Insight: 论文的创新点在于构建了一个专注于安全关键场景的评估基准(S2D)和一个综合的安全驾驶评分(SDS),该评分整合了碰撞前制动、施工区物体接触、车道居中和平稳性检查。这揭示了即使在被测场景属于训练集一部分的情况下,现有端到端模型仍缺乏安全的行为推理能力,强调了专门安全评估的重要性。
Abstract: Recent end-to-end (E2E) autonomous driving policies achieve high driving scores in closed-loop simulations. Yet it remains unclear whether these policies handle common safety-critical scenarios. We present Safe2Drive (S2D), a set of Bench2Drive-aligned scenario extensions focused on three frequent families of road hazards: work zones, pedestrian jaywalking, and occluded vulnerable road users (VRUs). Safe2Drive adds 100 common but challenging scenarios and introduces SafeDriving Score (SDS), a safety-centric metric that augments prior evaluators with pre-crash braking, work zone-object contact, lane centering, and smoothness checks. Evaluating two state-of-the-art policies (LEAD and SimLingo) on S2D, we find that their driving scores drop sharply relative to their reported Bench2Drive baselines (LEAD: from 94.70 DS on Bench2Drive to 39.95 DS on S2D; SimLingo: from 85.07 DS on Bench2Drive to 41.00 DS on S2D) and that SDS on S2D is low (11.85 for LEAD and 15.27 for Sim-Lingo). These results are consistent with brittle safe-driving behaviors such as poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. This study highlights a lack of safe behavioral reasoning in E2E models even when tested on CARLA towns that are part of the training set. We plan to release the code and videos for all 100 S2D scenarios.
[318] Belief Consistency Between Foundation-Model Evidence and Geometric Perception in Persistent Robotic Maps cs.RO | cs.CVPDF
Christoffer Heckman, Harel Biggie, Brendan Crowe, Nicholas Roy
TL;DR: 本文提出了一种用于持久性机器人地图的更新算子,旨在协调基础模型证据与几何感知之间的信念一致性。该算子通过每类校准的提交门和每事件冲突丢弃窗口两种机制,拒绝在特定时刻与几何通道矛盾的基础模型声明。在KITTI-360和ScanNet数据集上的实验表明,该方法能显著提高地图提交的准确性,并在使用oracle几何通道和现成在线语义分割器时均表现出稳定的部署质量。
Details
Motivation: 当前地图系统在融合几何感知通道(具有良好表征的断言)和基础模型通道(产生未校准可靠性的语义声明)时,通常将基础模型视为额外的投票者,缺乏对其每类可靠性的校准,且无法在两者出现矛盾时进行标记。
Result: 在KITTI-360数据集上,该算子在汽车提交精度达到99.7%(对比仅校准算子的43.9%),平均每类IoU为0.522(对比0.180),并在ScanNet上验证了其有效性,性能优于单一组合VLM提示方法,且在不同基础模型替换下保持稳定。
Insight: 创新点在于设计了每类校准提交门和每事件冲突丢弃窗口的双重机制,动态处理通道间矛盾,确保地图一致性;客观来看,该方法提供了一种可扩展的框架,能够适应不同几何感知源和基础模型,提升了语义地图的可靠性和实用性。
Abstract: Persistent maps used by autonomous robots increasingly fuse a geometric perception stack whose assertions are well-characterized with a foundation-model channel that produces semantic claims without calibrated reliability about the same scene. Contemporary mapping systems integrate the two channels by treating the foundation-model channel as an additional voter into a per-element posterior, uncalibrated for its own per-class reliability and without machinery to flag when the two channels contradict each other at a given moment. We propose an update operator with two cooperating mechanisms: a per-class calibrated commit gate, and a per-event conflict-drop window that refuses to commit foundation-model claims contradicted by the geometric channel at the moment of the claim. We evaluate on KITTI-360 and ScanNet, with an oracle geometric channel (panoptic ground truth) and an off-the-shelf online semantic segmenter (Mask2Former) to demonstrate real-world performance. The operator produces substantially more accurate committed maps (KITTI is car commit precision 99.7% vs. 43.9% for the calibration-only operator; mean per-class IoU 0.522 vs. 0.180), retains more compositional true positives at higher precision than a monolithic compositional VLM prompt. The framework operates at deployment quality across both oracle and off-the-shelf-segmenter geometric channels, and is invariant under foundation-model substitution.
[319] SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models cs.RO | cs.CVPDF
Ziheng He, Yixiang Chen, Ning Yang, Zhanqian Wu, Qisen Ma
TL;DR: 本文提出了一种名为SKIP(稀疏关键帧插值范式)的高效具身世界模型框架,旨在解决现有模型在像素空间进行密集逐帧推理时计算成本高昂的问题。SKIP通过识别任务相关的关键帧、仅合成这些关键帧,并利用学习的间隙预测器和动作条件插值器来重建缺失的帧,从而在保持任务相关事件(如接触、抓取)完整性的同时,大幅加速了视频生成过程。
Details
Motivation: 具身世界模型在机器人学中通过预测机器人动作对场景的影响而展现出潜力,但长时程操作视频的密集逐帧生成计算开销巨大,且无法通过简单丢弃帧来降低成本,因为下游策略依赖于完整保留稀疏的任务相关事件。
Result: 在LIBERO基准测试中,SKIP生成密集视频的速度比密集基线快4.16倍,同时提高了视觉保真度,并将聚合Fréchet Video Distance(FVD)降低了89.0%。使用SKIP生成的视频作为策略训练数据时,在LIBERO仿真中成功率仅下降1.3个百分点,在真实机器人上下降6.7个百分点,而完全密集的逐帧生成方法则崩溃性下降了48至58个百分点。
Insight: 论文的创新点在于提出了一个事件保持的稀疏到密集框架,通过结合机器人感知的多模态特征来识别关键帧,并利用学习模型进行高效插值,从而在保证任务性能的同时显著降低了计算成本。从客观角度看,该方法将视频生成从密集计算范式转向了基于事件的关键帧驱动范式,为高效具身世界模型的部署提供了新思路。
Abstract: Embodied world models have emerged as a promising paradigm in robotics by predicting how robot actions affect the surrounding scene. However, the rollout inference remains computationally expensive in pixel space, as long-horizon manipulation videos typically have to be generated frame by frame. This cost cannot be easily reduced by indiscriminately dropping frames, since downstream policies rely on complete preservation of sparse task-relevant events such as approach, contact, grasp, and release. To address this challenge, we propose Sparse Keyframe Interpolation Paradigm (SKIP), an event-preserving sparse-to-dense framework that avoids dense frame-by-frame generation. SKIP first identifies task-relevant keyframes by leveraging robot-aware multimodal features. It then synthesizes only these keyframes with a sparse video diffusion model. A learned gap predictor and an action-conditioned interpolator subsequently reconstruct the missing intervals according to the robot actions. On LIBERO, SKIP generates dense rollouts $4.16\times$ faster than a dense baseline while improving visual fidelity and reducing aggregate FVD by $89.0%$. Importantly, SKIP-generated videos are effective policy-training data. Even when they fully replace real demonstrations, $π_{0.5}$ success drops only $1.3$ pp in LIBERO simulation and $6.7$ pp on the real robot, whereas fully dense frame-by-frame generation collapses by $48$ to $58$ pp.
[320] Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs cs.RO | cs.CVPDF
Jianing Qian, Qinhe Peng, Emmanuel Panov, Leonor Fermoselle, Dinesh Jayaraman
TL;DR: 本文提出了一种利用场景图作为显式结构化记忆机制的模仿学习方法,以解决机器人学习任务中因环境空间尺度大导致的严重部分可观测问题以及需要长期推理的多子任务序列执行挑战。该方法通过维护动态场景图来捕捉以物体为中心的关系及其随时间演变,使智能体能在任务执行过程中保留相关历史上下文,从而高效推理逐步累积的场景信息。
Details
Motivation: 动机在于解决现实世界家庭和办公室等环境中因空间尺度大导致的严重部分可观测问题,以及需要自主机器人进行长期时间跨度推理的多子任务序列执行挑战。
Result: 在模拟移动操作和真实世界桌面操作实验上,该方法显著提升了策略性能,特别是在需要长期推理和部分可观测下鲁棒泛化的场景中。
Insight: 创新点在于将场景图作为显式结构化记忆机制引入模仿学习,以动态捕捉物体关系演变,支持长期上下文保留和增量信息推理,这为处理部分可观测和时序依赖任务提供了可借鉴的结构化表示方法。
Abstract: Imitation learning enables robots to learn how to execute tasks via observation. However, real-world environments like homes and offices are often severely partially observed due to their large spatial scales. In addition, many tasks involve executing a series of subtasks requiring autonomous robots to reason over extended time horizons. To address these challenges, we propose using scene graphs as an explicit and structured memory mechanism in imitation learning. By maintaining a dynamic scene graph that captures object-centric relationships and their evolution over time, our method allows the agent to retain relevant historical context during task execution to efficiently reason over incrementally accrued scene information. Our experiments on simulated mobile manipulation and real-world tabletop manipulation demonstrate that our approach substantially improves policy performance, particularly in settings that demand long-term reasoning and robust generalization under partial observability.
[321] DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance cs.RO | cs.AI | cs.CV | eess.IV | eess.SYPDF
Oskar Natan, Andi Dharmawan, Aufaclav Zatu Kusuma Frisky, Jazi Eko Istiyanto, Jun Miura
TL;DR: 本文提出了DeepIPCv3,一种新颖的多模态端到端自动驾驶框架,旨在解决突发行人横穿场景下的安全漏洞。该框架通过Transformer启发的跨模态注意力机制,融合了LiDAR点云的密集3D空间几何信息与动态视觉传感器(DVS)的微秒级异步事件流,以克服传统帧式传感器的感知延迟和运动模糊问题。融合后的表征通过混合策略网络映射为安全局部路径点和可执行控制命令。
Details
Motivation: 当前端到端自动驾驶系统主要依赖帧式传感器,在突发行人横穿等高动态场景中存在固有的感知延迟和运动模糊,构成关键安全风险。
Result: 在自定义多模态数据集(涵盖光照良好和具有挑战性的傍晚条件)上进行的大量对比和消融研究表明,DeepIPCv3实现了最先进的预测性能,在轨迹和控制命令误差上达到最低,有效消除了曝光失败和运动模糊。
Insight: 创新点在于将LiDAR的3D几何信息与DVS的高速异步事件流进行深度融合,并设计了Transformer启发的跨模态注意力机制来动态关联不同模态,使网络能在不牺牲场景结构感知的前提下即时优先处理高速动态更新。混合策略网络结合了启发式轨迹跟踪和直接神经预测,也是一个值得借鉴的设计。
Abstract: Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and motion blur during highly dynamic encounters, specifically sudden pedestrian crossings. To address this critical safety vulnerability, we propose DeepIPCv3, a novel multi-modal autonomous navigation framework that synergizes the dense 3D spatial geometry of LiDAR point clouds with the microsecond-level asynchronous event streams of a Dynamic Vision Sensor (DVS). We introduce a Transformer-inspired cross-modal attention mechanism to dynamically correlate these distinct modalities, allowing the network to instantaneously prioritize high-speed dynamic updates without sacrificing structural scene awareness. The fused latent representations are then mapped to safe local waypoints and executable control commands via a hybrid policy network that blends heuristic trajectory tracking with direct neural predictions. Due to the severe physical risks associated with live testing of these sudden crossing scenarios, the framework is rigorously evaluated offline using a custom multi-modal dataset collected across both well-illuminated noon and challenging evening conditions. Extensive comparative and ablation studies demonstrate that DeepIPCv3 achieves state-of-the-art predictive performance. By effectively eliminating exposure failures and motion blur, the proposed LiDAR and DVS fusion yields the lowest trajectory and control command errors, enabling highly reactive, mathematically bounded evasive maneuvers regardless of ambient illumination. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/DeepIPCv3.
[322] ActMVS: Active Scene Reconstruction with Monocular Multi-View Stereo cs.RO | cs.CVPDF
Guo Pu, Yixuan Han, Zhouhui Lian
TL;DR: ActMVS是首个用于单目主动场景重建的框架,旨在让机器人或无人机仅使用单目视觉自主规划轨迹并重建环境。该框架通过整合视图因子图构建和全局深度优化,在线生成高质量、全局一致的稠密深度图,从而支持可靠的占据地图维护和安全路径规划。
Details
Motivation: 主动场景重建需要实时构建高置信度的占据地图以进行无碰撞导航,但现有方法依赖深度传感器,增加了平台成本和重量。当前单目场景重建方法多为离线且无法满足机器人导航所需的实时全局一致稠密深度输出,因此需要一种纯视觉的在线解决方案。
Result: 在Replica数据集上的实验表明,ActMVS的性能可与RGB-D方法相竞争,达到了与深度传感器方法相当的水平。
Insight: 创新点在于首次提出了单目主动重建框架,通过视图因子图构建来指导多视图立体深度预测,并结合全局深度优化,实现了在线生成全局一致的稠密深度图,为纯视觉的实时占据地图更新和路径规划提供了可行方案。
Abstract: Active scene reconstruction enables robots/UAVs to autonomously plan trajectories and reconstruct environments without costly manual data acquisition. Unlike passive methods, active reconstruction requires real-time construction of high-confidence occupancy maps for collision-free navigation. Existing approaches rely on depth sensors for occupancy map updates, increasing platform cost and weight. To advance spatial intelligence, we aim for a vision-only monocular solution. However, current monocular scene reconstruction methods operate offline and fail to deliver globally consistent dense depth at the frame rates required for robots/UAVs navigation. To bridge this gap, we introduce ActMVS, the first framework for monocular active reconstruction. Our framework integrates a view factor graph construction for informed Multi-View Stereo depth prediction, along with a global depth optimization, to enable the online generation of high-quality, globally consistent dense depth maps. This enables monocular robots/UAVs to maintain reliable occupancy maps for safe trajectory planning during reconstruction. Experiments on Replica datasets demonstrate performance competitive with RGB-D methods. Our code and data are available at https://github.com/TrickyGo/ActMVS.
[323] Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation cs.RO | cs.CVPDF
Xiang Fang, Wanlong Fang, Changshuo Wang
TL;DR: 本文提出了层次化语义增强导航(HSAN)框架,用于解决连续环境中的视觉语言导航(VLN-CE)挑战。该框架通过构建动态层次化语义场景图、采用基于最优传输的拓扑规划器以及图感知强化学习策略,实现了对复杂室内环境的深度理解和鲁棒导航。
Details
Motivation: 现有VLN-CE方法在长视野任务中常因场景理解有限、规划效率低下和缺乏鲁棒决策框架而表现不佳,HSAN旨在解决这些缺陷。
Result: 在多个具有挑战性的VLN-CE数据集上的广泛实验表明,HSAN实现了最先进的性能,在导航成功率和未见环境泛化能力上均有显著提升。
Insight: 创新点在于将谱图理论、最优传输和先进多模态学习相结合,通过动态层次语义表示和具有理论最优性保证的规划器,超越了先前工作中静态地图和启发式规划器的局限。
Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented Navigation (HSAN)} framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations, from objects to regions to zones, enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich’s duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.
[324] WALL-WM: Carving World Action Modeling at the Event Joints cs.RO | cs.CVPDF
Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng
TL;DR: WALL-WM是一种世界行动模型,它将视频-行动学习从以固定长度片段为中心的优化转向以事件为基础的视觉-语言-行动预训练,使用语义连贯的行动事件作为学习的基本单元。该模型通过事件级的数据生态系统和两种互补的推理模式(事件模式和统一模式),实现了在多样化行为、场景和任务结构上的可扩展学习,并在大规模真实世界泛化评估中达到了最先进的性能。
Details
Motivation: 现有世界行动模型通常从多模态或视频基础模型初始化,并基于当前观察和指令优化固定长度的行动片段,这导致了语言描述的语义事件、视觉的连续场景动态和行动的控制级时间尺度之间的粒度不匹配问题,使得视觉-语言-行动训练变成了短视域的相关性拟合。
Result: 实验表明,WALL-WM在语言、场景和任务上具有广泛的泛化能力,在大规模真实世界泛化评估中实现了最先进的性能。
Insight: 论文的核心创新在于以语义事件为原子单位组织监督和数据,通过事件基础的视觉-语言-行动预训练和数据生态系统(包括事件级标注和聚类平衡采样)来解决粒度不匹配问题;同时,模型支持事件模式(处理可变长度执行片段)和统一模式(结合阶梯解码的视觉语言模型进行固定长度片段推理)两种互补的推理方式,为通用世界行动模型提供了可扩展的预训练方案。
Abstract: WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.
[325] RoboDream: Compositional World Models for Scalable Robot Data Synthesis cs.RO | cs.CVPDF
Junjie Ye, Rong Xue, Basile Van Hoorick, Runhao Li, Harshitha Rajaprakash
TL;DR: 本文提出RoboDream,一种可泛化的具身中心世界模型,通过合成具有新物体、新场景和新视角的逼真演示数据,实现机器人数据的大规模生成。该方法将生成过程锚定在渲染的机器人运动上,并基于显式的场景和物体先验进行条件化,有效解耦了轨迹执行与环境合成。
Details
Motivation: 机器人学习需要大规模、多样化的演示数据,但通过遥操作收集真实世界数据成本高昂且耗时。现有视频扩散模型的数据生成方法通常局限于表面视觉增强或存在具身幻觉,导致物理上不可行的运动。
Result: 真实世界实验表明,生成的数据能持续提升下游策略性能,并在多种操作任务中显著减少对真实数据的需求。
Insight: 创新点在于提出了一种解耦轨迹执行与环境合成的生成框架,实现了检索与重生以及无道具遥操作两种强大的数据扩展能力,从而高效生成物理可行的逼真机器人演示数据。
Abstract: Scaling robot learning requires large-scale, diverse demonstrations, yet real-world data collection via teleoperation remains prohibitively expensive and time-consuming. While video diffusion models offer a promising avenue for data scaling, existing generative approaches are often limited to superficial visual augmentation, or suffer from embodiment hallucinations that yield physically infeasible motions. We present a generalizable embodiment-centric world model that achieves scalable data generation by synthesizing photorealistic demonstrations with novel objects, in novel scenes, and from novel viewpoints. Our approach anchors generation to rendered robot motion while conditioning on explicit scene and object priors, effectively decoupling trajectory execution from environment synthesis. This formulation has the potential to unlock two powerful data scaling capabilities: (1) retrieval and rebirth, which repurposes existing trajectories into entirely new contexts without new motion data; and (2) prop-free teleoperation, where operators manipulate empty air and the model hallucinates the target objects and scene afterwards, eliminating reset time. We demonstrate with real-world experiments that our generated data consistently improves downstream policy performance and significantly reduces real-world data requirements across diverse manipulation tasks.
eess.IV [Back]
[326] Multi-Contrast MRI Motion Correction via Parameter-Informed Disentanglement and Adaptive Experts eess.IV | cs.AI | cs.CVPDF
Honglin Xiong, Yuxian Tang, Feng Li, Yulin Wang, Lei Xiang
TL;DR: 该论文提出了一种用于多对比度MRI运动伪影校正的统一框架,结合了参数驱动的对比度解耦和严重程度感知的自适应校正。该方法利用预训练的ScanCLIP模型从扫描参数中提取对比度嵌入,将解剖内容与对比度风格解耦,然后通过Vision Transformer估计运动严重程度并路由至专家混合网络进行针对性校正,最终通过双路径解码器重建干净图像和伪影残差图。
Details
Motivation: 现有深度学习MRI运动校正方法通常是针对特定对比度设计的,无法泛化到多种模态和不同严重程度的伪影。
Result: 在IXI和HCP基准测试上,该方法比现有最先进方法将PSNR提高了0.75 dB,SSIM提高了最高0.0279,且在伪影更严重时提升更大。在具有未见扫描参数的真实临床数据上,该方法展现了强大的零样本泛化能力,而现有方法要么无法去除伪影,要么会引入额外失真。
Insight: 创新点在于将扫描参数作为先验知识来解耦对比度风格与解剖内容,并结合了严重程度感知的自适应专家网络进行针对性校正。从客观角度看,这种参数驱动的解耦和自适应路由机制为解决多对比度、多严重程度MRI运动校正的泛化问题提供了新思路。
Abstract: Motion artifacts in magnetic resonance imaging (MRI) degrade diagnostic reliability. Existing deep learning methods are typically contrast-specific and fail to generalize across diverse modalities and artifact severities. We propose a unified framework combining parameter-informed contrast disentanglement with severity-aware adaptive correction. ScanCLIP, pretrained on over 30,000 MRI text-image pairs, derives contrast embeddings from acquisition parameters to disentangle contrast style from anatomical content, yielding contrast-free features. A Vision Transformer then estimates motion severity and routes features through a Mixture-of-Experts network, enabling targeted artifact correction. A dual-pathway decoder reconstructs both the clean image and residual artifact map, enforcing image-space consistency. On IXI and HCP benchmarks, our method improves PSNR by 0.75 dB and SSIM by up to 0.0279 over state-of-the-art approaches, with larger gains at higher artifact severities. It further demonstrates robust zero-shot generalization on real-world clinical data acquired with unseen scanning parameters, where existing methods either fail to remove artifacts or introduce additional distortions.
[327] PINNOCHIO: Physics-Informed Neural Network for Coupled Hyperelastic Interface-Volume Simulation in Orthognathic Surgery eess.IV | cs.CVPDF
Jungwook Lee, Daeseung Kim, Kevin Gu, Zhangfeng Hu, Tianshu Kuang
TL;DR: 该论文提出了一种名为PINNOCHIO的物理信息神经网络框架,用于正颌手术中面部软组织的耦合超弹性界面-体积模拟。该方法通过混合顺序分解策略,将不连续的骨-软组织界面运动与连续的体积超弹性变形解耦,从而在仅使用部分临床表面数据的情况下,实现了稳定训练和物理一致的软组织变形预测。
Details
Motivation: 动机在于解决正颌手术规划中患者特异性面部软组织变形预测的难题,现有方法(如高保真有限元法计算成本过高,纯深度学习模型缺乏生物力学一致性)面临严格的精度-效率权衡,而传统物理信息神经网络在仅使用部分临床表面数据学习复杂异质力学时训练不稳定。
Result: 在包含40名患者的临床队列上进行评估,PINNOCHIO在表面预测精度和物理有效性方面均优于现有基线方法,同时相比有限元法实现了显著的加速,成功解决了精度与效率的权衡问题。
Insight: 主要创新点在于提出了一个混合顺序分解框架,将复杂的耦合界面-体积力学问题解耦为更易学习的子问题,并设计了一种物理驱动的模拟到真实适应策略,从而在无需体积真实数据的情况下确保了内部生物力学一致性,为交互式手术规划提供了可靠工具。
Abstract: Predicting patient-specific facial soft-tissue deformation is critical for iterative orthognathic surgery planning. However, current computational methods face a strict accuracy-efficiency trade-off: high-fidelity Finite Element Methods (FEM) are computationally prohibitive, whereas pure deep learning models often produce biomechanically inconsistent results. While Physics-Informed Neural Networks (PINNs) offer a promising avenue, learning the complex heterogeneous mechanics of bone–soft-tissue interactions with only partial clinical supervision (i.e., outer facial surfaces) remains highly unstable. To overcome these challenges, we present PINNOCHIO, a novel physics-informed framework for facial soft-tissue simulation. PINNOCHIO introduces a hybrid sequential decomposition that explicitly decouples discontinuous bone–soft-tissue interface movements from continuous volumetric hyperelastic deformation. This structural separation enables stable training and facilitates a physics-enabled sim-to-real adaptation strategy, ensuring internal biomechanical consistency without requiring volumetric ground truth. Evaluated on a 40-patient clinical cohort, PINNOCHIO outperforms existing baselines in both surface accuracy and physical validity. Furthermore, it achieves a substantial speedup over FEM, successfully resolving the accuracy-efficiency trade-off to provide a highly reliable and practical tool for interactive surgical planning.