cs.CL [Total: 18]
cs.CV [Total: 66]
cs.NI [Total: 1]
cs.RO [Total: 5]
cs.DL [Total: 1]
cs.CR [Total: 1]
cs.AR [Total: 1]
eess.IV [Total: 3]
eess.AS [Total: 1]
cs.LG [Total: 9]
cs.AI [Total: 4]

cs.CL [Back]

[1] Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis cs.CL | cs.AIPDF

May Lynn Reese, Markela Zeneli, Mindy Ng, Jacob Haimes, Andreea Damien

TL;DR: 该论文提出了一种利用LLM作为评估者（LLM-as-a-Judge）或陪审团（LLM-as-a-Jury）来对大型语言模型在心理健康支持场景中，特别是针对精神病患者时的安全风险进行可扩展、临床验证的评估方法。该方法首先开发并验证了七项临床医生指导的安全标准，构建了人类共识数据集，然后测试了自动化评估的性能。

Details

Motivation: 通用大型语言模型被广泛用于心理健康支持，但存在加剧精神病患者妄想和幻觉的风险。现有评估方法缺乏临床验证和可扩展性，因此需要开发一种既可靠又可扩展的安全评估框架。

Result: 实验结果表明，单个LLM评估者（如Gemini）与人类共识具有高度一致性（Cohen’s κ = 0.75），且最佳单个评估者略优于LLM陪审团（多数投票，κ = 0.74），证明了自动化评估的可行性。

Insight: 创新点在于将临床专业知识转化为可操作的安全标准，并系统验证了LLM-as-a-Judge/Jury范式在心理健康这一高风险、细粒度评估任务中的有效性，为可扩展的、临床基础的LLM安全评估提供了新途径。

Abstract: General-purpose Large Language Models (LLMs) are becoming widely adopted by people for mental health support. Yet emerging evidence suggests there are significant risks associated with high-frequency use, particularly for individuals suffering from psychosis, as LLMs may reinforce delusions and hallucinations. Existing evaluations of LLMs in mental health contexts are limited by a lack of clinical validation and scalability of assessment. To address these issues, this research focuses on psychosis as a critical condition for LLM safety evaluation by (1) developing and validating seven clinician-informed safety criteria, (2) constructing a human-consensus dataset, and (3) testing automated assessment using an LLM as an evaluator (LLM-as-a-Judge) or taking the majority vote of several LLM judges (LLM-as-a-Jury). Results indicate that LLM-as-a-Judge aligns closely with the human consensus (Cohen’s $κ_{\text{human} \times \text{gemini}} = 0.75$, $κ_{\text{human} \times \text{qwen}} = 0.68$, $κ_{\text{human} \times \text{kimi}} = 0.56$) and that the best judge slightly outperforms LLM-as-a-Jury (Cohen’s $κ_{\text{human} \times \text{jury}} = 0.74$). Overall, these findings have promising implications for clinically grounded, scalable methods in LLM safety evaluations for mental health contexts.

[2] Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets cs.CL | cs.MAPDF

Dat Tran, Douwe Kiela

TL;DR: 本文通过信息论分析和实证研究发现，在等量推理token预算下，单智能体LLM系统在多跳推理任务中能够匹配甚至超越多智能体系统，并揭示了现有评估方法中存在的计算和上下文偏差问题。

Details

Motivation: 针对当前多智能体LLM系统性能提升常与额外计算资源混淆的问题，研究旨在在公平计算预算下比较单智能体与多智能体系统的信息效率，澄清其理论依据与评估方法。

Result: 在三个模型系列（Qwen3、DeepSeek-R1-Distill-Llama和Gemini 2.5）的对照实验中，单智能体系统在多跳推理任务上始终匹配或优于多智能体架构，同时发现API预算控制和基准测试中的偏差会夸大MAS的增益。

Insight: 创新点在于基于数据处理不等式提出信息论论证，指出单智能体在固定token预算和完美上下文利用下更高效；实践上强调需显式控制计算、上下文与协调的权衡，并揭示了评估中未计计算和上下文效应的影响。

Abstract: Recent work reports strong performance from multi-agent LLM systems (MAS), but these gains are often confounded by increased test-time computation. When computation is normalized, single-agent systems (SAS) can match or outperform MAS, yet the theoretical basis and evaluation methodology behind this comparison remain unclear. We present an information-theoretic argument, grounded in the Data Processing Inequality, suggesting that under a fixed reasoning-token budget and with perfect context utilization, single-agent systems are more information-efficient. This perspective further predicts that multi-agent systems become competitive when a single agent’s effective context utilization is degraded, or when more compute is expended. We test these predictions in a controlled empirical study across three model families (Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5), comparing SAS with multiple MAS architectures under matched budgets. We find that SAS consistently match or outperform MAS on multi-hop reasoning tasks when reasoning tokens are held constant. Beyond aggregate performance, we conduct a detailed diagnostic analysis of system behavior and evaluation methodology. We identify significant artifacts in API-based budget control (particularly in Gemini 2.5) and in standard benchmarks, both of which can inflate apparent gains from MAS. Overall, our results suggest that, for multi-hop reasoning tasks, many reported advantages of multi-agent systems are better explained by unaccounted computation and context effects rather than inherent architectural benefits, and highlight the importance of understanding and explicitly controlling the trade-offs between compute, context, and coordination in agentic systems.

[3] Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models cs.CL | cs.LGPDF

Ayush Rajesh Jhaveri, Anthony GX-Chen, Ilia Sucholutsky, Eunsol Choi

TL;DR: 这篇论文研究了大型语言模型（LLMs）是否表现出确认偏误（confirmation bias），即倾向于寻找支持而非挑战自身假设的证据。通过改编人类心理学中的规则发现研究，作者发现多种LLMs确实存在确认偏误，这阻碍了它们发现隐藏规则的能力。论文进一步探索了基于人类干预策略的提示方法，有效减少了确认偏误，提高了规则发现率，并将干预行为蒸馏到模型中，展示了在新任务上的泛化能力。

Details

Motivation: 动机是评估大型语言模型在假设探索中是否像人类一样存在确认偏误，这种偏误会限制推理能力，从而解决LLMs在逻辑推理任务中的潜在缺陷问题。

Result: 在规则发现任务中，LLMs的平均规则发现率从42%提升到56%，通过提示干预策略减少了确认偏误，并在Blicket测试中展示了泛化性能，表明干预方法有效。

Insight: 创新点在于将人类心理学中的确认偏误测试和干预策略应用于LLMs评估与缓解，揭示了LLMs的认知局限，并通过提示工程和模型蒸馏实现了偏差缓解，为改进模型推理提供了新思路。

Abstract: Confirmation bias, the tendency to seek evidence that supports rather than challenges one’s belief, hinders one’s reasoning ability. We examine whether large language models (LLMs) exhibit confirmation bias by adapting the rule-discovery study from human psychology: given a sequence of three numbers (a “triple”), an agent engages in an interactive feedback loop where it (1) proposes a new triple, (2) receives feedback on whether it satisfies the hidden rule, and (3) guesses the rule. Across eleven LLMs of multiple families and scales, we find that LLMs exhibit confirmation bias, often proposing triples to confirm their hypothesis rather than trying to falsify it. This leads to slower and less frequent discovery of the hidden rule. We further explore intervention strategies (e.g., encouraging the agent to consider counter examples) developed for humans. We find prompting LLMs with such instruction consistently decreases confirmation bias in LLMs, improving rule discovery rates from 42% to 56% on average. Lastly, we mitigate confirmation bias by distilling intervention-induced behavior into LLMs, showing promising generalization to a new task, the Blicket test. Our work shows that confirmation bias is a limitation of LLMs in hypothesis exploration, and that it can be mitigated via injecting interventions designed for humans.

Roland Mühlenbernd

TL;DR: 本文研究大型语言模型（LLMs）是否能在结构和强度上近似人类的社会意义推理，并探索基于语用理论的提示策略能否改善这种近似。通过引入效应大小比（ESR）和校准偏差分数（CDS）两个指标，并在三个前沿LLMs上进行关于数字（不）精确性的案例研究，发现所有模型都能可靠地复现人类社会推理的定性结构，但在强度校准上差异显著。结合考虑说话者知识和动机以及语言替代项的提示策略能最一致地减少强度偏差。

Details

Motivation: 探究LLMs是否不仅在定性上，而且在定量上近似人类的社会意义推理，以及基于语用理论的提示策略能否提升这种近似。

Result: 在关于数字（不）精确性的案例研究中，所有三个前沿LLMs都能可靠地复现人类社会推理的定性结构，但在强度校准（magnitude calibration）上存在显著差异。结合考虑说话者知识和动机以及语言替代项的提示是唯一能跨所有模型改善所有校准敏感指标（包括ESR和CDS）的干预措施，但细粒度的强度校准问题仅得到部分解决。

Insight: 创新点在于引入了区分结构保真度和强度校准的量化指标（ESR和CDS），并系统性地从语用理论（如对语言替代项的推理、对说话者知识状态和交际动机的推断）推导出提示策略来改善LLMs的社会意义近似。客观来看，将定性结构评估与定量强度校准分离，并为基于理论的提示工程提供框架，是提升LLMs社会推理能力的有益方向。

Abstract: Large language models (LLMs) increasingly exhibit human-like patterns of pragmatic and social reasoning. This paper addresses two related questions: do LLMs approximate human social meaning not only qualitatively but also quantitatively, and can prompting strategies informed by pragmatic theory improve this approximation? To address the first, we introduce two calibration-focused metrics distinguishing structural fidelity from magnitude calibration: the Effect Size Ratio (ESR) and the Calibration Deviation Score (CDS). To address the second, we derive prompting conditions from two pragmatic assumptions: that social meaning arises from reasoning over linguistic alternatives, and that listeners infer speaker knowledge states and communicative motives. Applied to a case study on numerical (im)precision across three frontier LLMs, we find that all models reliably reproduce the qualitative structure of human social inferences but differ substantially in magnitude calibration. Prompting models to reason about speaker knowledge and motives most consistently reduces magnitude deviation, while prompting for alternative-awareness tends to amplify exaggeration. Combining both components is the only intervention that improves all calibration-sensitive metrics across all models, though fine-grained magnitude calibration remains only partially resolved. LLMs thus capture inferential structure while variably distorting inferential strength, and pragmatic theory provides a useful but incomplete handle for improving that approximation.

[5] Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge cs.CL | cs.LGPDF

Yiyang Shen, Lifu Tu, Weiran Wang

TL;DR: 本文提出了一种基于强化学习的知识蒸馏框架，利用大型语言模型（LLM）作为评判者，在大量无标签数据上评估模型输出以生成奖励信号，从而无需真实标签即可进行知识蒸馏。该方法通过单令牌输出实现高效奖励计算，结合可验证奖励后，在数学推理基准测试中取得了显著性能提升。

Details

Motivation: 现有强化学习方法通常依赖可验证奖励（即真实标签）来提升语言模型的推理能力，这限制了其在无标签数据上的应用。本文旨在解决这一问题，提出一种无需真实标签监督的强化学习框架。

Result: 在数学推理基准测试（如GSM8K、MATH）上，该方法结合可验证奖励后取得了显著性能提升，表明基于LLM的评估器能为强化学习微调提供有效的训练信号。

Insight: 创新点在于使用LLM作为高效评判者（单令牌输出）进行无监督奖励生成，实现了标签无关的知识蒸馏，扩展了强化学习在无标签数据上的应用潜力。

Abstract: Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.

[6] Train Yourself as an LLM: Exploring Effects of AI Literacy on Persuasion via Role-playing LLM Training cs.CLPDF

Qihui Fan, Min Ge, Chenyan Jia, Weiyan Shi

TL;DR: 本文介绍了LLMimic，一个基于角色扮演、互动式、游戏化的AI素养教程，让参与者扮演大型语言模型（LLM）并体验其训练流程的三个关键阶段（预训练、监督微调、RLHF）。通过一项2×3的组间研究（N=274），研究发现与观看AI历史视频的对照组相比，使用LLMimic的参与者显著提高了AI素养，降低了在多种AI说服场景（慈善捐款、恶意募资、酒店推荐）中的被说服成功率，并在酒店推荐场景中提升了真实性和社会责任感水平。

Details

Motivation: 随着大型语言模型（LLM）的说服力日益增强，人们担心其可能大规模影响人们的观点和决策。现有的缓解措施（如AI检测器和免责声明）大多将人们视为AI生成信息的被动接收者。本文旨在提供一种更主动的干预措施，以对抗具有说服力的AI。

Result: 实验结果表明，LLMimic显著提高了参与者的AI素养（p < .001），降低了在三种现实AI说服场景（慈善捐款、恶意募资、酒店推荐）中的说服成功率（p < .05），并在酒店推荐场景中提升了真实性和社会责任感水平（p < 0.01）。

Insight: 论文的创新点在于提出了一种主动、互动、游戏化的人类中心主义AI素养提升方法（LLMimic），通过让用户亲身体验LLM的训练过程来增强其对AI运作机制的理解和批判性思维，从而有效抵御AI的说服。这为AI素养教育提供了一种可扩展的新范式，超越了传统的被动信息告知方式。

Abstract: As large language models (LLMs) become increasingly persuasive, there is concern that people’s opinions and decisions may be influenced across various contexts at scale. Prior mitigation (e.g., AI detectors and disclaimers) largely treats people as passive recipients of AI-generated information. To provide a more proactive intervention against persuasive AI, we introduce $\textbf{LLMimic}$, a role-play-based, interactive, gamified AI literacy tutorial, where participants assume the role of an LLM and progress through three key stages of the training pipeline (pretraining, SFT, and RLHF). We conducted a $2 \times 3$ between-subjects study ($N = 274$) where participants either (1) watched an AI history video (control) or (2) interacted with LLMimic (treatment), and then engaged in one of three realistic AI persuasion scenarios: (a) charity donation persuasion, (b) malicious money solicitation, or (c) hotel recommendation. Our results show that LLMimic significantly improved participants’ AI literacy ($p < .001$), reduced persuasion success across scenarios ($p < .05$), and enhanced truthfulness and social responsibility levels ($p<0.01$) in the hotel scenario. These findings suggest that LLMimic offers a scalable, human-centered approach to improving AI literacy and supporting more informed interactions with persuasive AI.

[7] Overcoming the “Impracticality” of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework cs.CLPDF

Kenichirou Narita, Siqi Peng, Taku Fukui, Moyuru Yamada, Satoshi Munakata

TL;DR: 该论文针对企业环境中检索增强生成（RAG）系统评估的不足，提出了一个多维度诊断框架和一个企业级RAG基准测试，旨在系统性地诊断模型在推理复杂性、检索难度、文档多样性和可解释性等方面的潜在弱点，以弥合学术基准与真实部署可靠性之间的差距。

Details

Motivation: 现有学术基准无法系统诊断企业RAG系统面临的多维度、复合性挑战（如推理复杂性、检索难度、文档结构多样性、可解释性要求），导致高分数模型在实际部署中可靠性不足，论文旨在解决这一评估与实践的脱节问题。

Result: 论文提出了一个集成四轴难度分类法的多维度诊断框架，并构建了一个企业RAG基准，用于诊断系统弱点，但摘要中未提及具体的定量实验结果或与SOTA的比较。

Insight: 创新点在于定义了评估RAG系统的四轴难度分类法，并构建了专注于企业级多维度挑战的诊断性基准，强调从简单精度检查转向系统性弱点诊断，对推动RAG技术的实际落地具有借鉴意义。

Abstract: Performance evaluation of Retrieval-Augmented Generation (RAG) systems within enterprise environments is governed by multi-dimensional and composite factors extending far beyond simple final accuracy checks. These factors include reasoning complexity, retrieval difficulty, the diverse structure of documents, and stringent requirements for operational explainability. Existing academic benchmarks fail to systematically diagnose these interlocking challenges, resulting in a critical gap where models achieving high performance scores fail to meet the expected reliability in practical deployment. To bridge this discrepancy, this research proposes a multi-dimensional diagnostic framework by defining a four-axis difficulty taxonomy and integrating it into an enterprise RAG benchmark to diagnose potential system weaknesses.

[8] Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints cs.CL | cs.AIPDF

Rodney Jehu-Appiah

TL;DR: 本研究通过复制实验检验了E-Prime（禁用动词’to be’的英语）对语言模型推理能力影响的认知重构假说，发现所有词汇约束（包括E-Prime、禁用’have’、元认知提示和中性填充词禁令）均能提升模型推理表现，且效果与理论深度成反比——最简单的’中性填充词禁令’提升最大。实验结果支持了更简单的机制解释：任何迫使模型偏离默认生成路径的约束都起到输出正则化作用，通过干扰流畅但肤浅的响应模式来提升推理。

Details

Motivation: 验证先前研究中提出的E-Prime通过特定词汇-认知映射引发认知重构从而选择性改变语言模型推理的假说，设计包含主动对照的实验探究其作用机制。

Result: 在六个模型和七个推理任务上的实验（N=11,919有效试验）显示：所有四种约束条件均优于对照组（83.0%基础准确率），其中中性填充词禁令（禁用’very’、’just’等词）提升最大（+6.7个百分点），E-Prime提升最小（+3.7个百分点）；跨模型相关性特征未复现（平均r=0.005）。

Insight: 研究发现最浅层的词汇约束（如禁用与逻辑无关的填充词）对提升LLM推理效果最佳，因其在最小化概念干扰的同时增加了监控负荷；这揭示了’输出正则化’机制——任何打破模型默认生成流畅性的约束都能通过干扰浅层响应模式来改善推理，为优化LLM提示工程提供了新视角。

Abstract: A previous study reported that E-Prime (English without the verb “to be”) selectively altered reasoning in language models, with cross-model correlations suggesting a structural signature tied to which vocabulary was removed. I designed a replication with active controls to test the proposed mechanism: cognitive restructuring through specific vocabulary-cognition mappings. The experiment tested five conditions (unconstrained control, E-Prime, No-Have, elaborated metacognitive prompt, neutral filler-word ban) across six models and seven reasoning tasks (N=15,600 trials, 11,919 after compliance filtering). Every prediction from the cognitive restructuring hypothesis was disconfirmed. All four treatments outperformed the control (83.0%), including both active controls predicted to show null effects. The neutral filler-word ban, banning words like “very” and “just” with no role in logical inference, produced the largest improvement (+6.7 pp), while E-Prime produced the smallest (+3.7 pp). The four conditions ranked in perfect inverse order of theoretical depth. The cross-model correlation signature did not replicate (mean r=0.005). These results are consistent with a simpler mechanism: any constraint that forces a model off its default generation path acts as an output regularizer, improving reasoning by disrupting fluent but shallow response patterns. The shallowest constraints work best because they impose monitoring load with minimal conceptual disruption. I present these findings as a case study in discovery through disconfirmation.

[9] Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy cs.CL | cs.AI | cs.LG | cs.SEPDF

Yihong Dong, Xiaoha Jian, Xue Jiang, Xuyuan Guo, Zhiyuan Fan

TL;DR: 本文提出了ChomskyBench基准，首次基于乔姆斯基层次理论系统评估大语言模型的形式推理能力，通过语言识别与生成任务测试模型在不同复杂度层级的表现。实验表明模型性能与任务复杂度呈层级化关联，且当前LLMs在处理形式任务时存在严重的效率瓶颈，远低于传统算法程序。

Details

Motivation: 现有LLM基准缺乏基于计算理论与复杂度的系统性评估，无法确定SOTA模型是否能理解形式语言的结构化层次复杂性，因此需要构建一个结合完整乔姆斯基层次覆盖、自然语言过程追踪与确定性符号可验证性的评估框架。

Result: 在ChomskyBench上的广泛实验显示，模型性能随任务难度增加而显著分层，推理长度和准确率均受显著影响；尽管更大模型和先进推理方法带来相对提升，但达到实用可靠性需极高计算成本，且LLMs的时间复杂度远高于传统算法程序。

Insight: 创新点在于首次将完整乔姆斯基层次、自然语言过程追踪与符号可验证性结合于基准构建；客观分析表明，当前LLMs的形式推理瓶颈主要源于效率而非绝对能力限制，这强调了传统软件工具的必要性，并为开发更强大形式推理能力的LLMs提供了方向。

Abstract: The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language, and deterministic symbolic verifiability. ChomskyBench is composed of a comprehensive suite of language recognition and generation tasks designed to test capabilities at each level. Extensive experiments indicate a clear performance stratification that correlates with the hierarchy’s levels of complexity. Our analysis reveals a direct relationship where increasing task difficulty substantially impacts both inference length and performance. Furthermore, we find that while larger models and advanced inference methods offer notable relative gains, they face severe efficiency barriers: achieving practical reliability would require prohibitive computational costs, revealing that current limitations stem from inefficiency rather than absolute capability bounds. A time complexity analysis further indicates that LLMs are significantly less efficient than traditional algorithmic programs for these formal tasks. These results delineate the practical limits of current LLMs, highlight the indispensability of traditional software tools, and provide insights to guide the development of future LLMs with more powerful formal reasoning capabilities.

[10] When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs cs.CLPDF

Linyu Li, Zhi Jin, Yichi Zhang, Dongming Jin, Yuanpeng He

TL;DR: 本文针对现实世界中动态变化的多模态知识图谱（MMKGs）提出了持续多模态知识图谱推理（CMMKGR）的系统性研究。作者构建了多个持续多模态知识图谱基准，并提出了名为MRCKG的新模型。该模型通过多模态-结构协作课程、跨模态知识保留机制以及多模态对比重放方案，有效缓解了灾难性遗忘，同时显著提升了新知识的学习能力。

Details

Motivation: 现实世界的多模态知识图谱是动态演化的，现有持续知识图谱推理方法仅关注结构三元组而无法充分利用新实体的多模态信号，而现有多模态知识图谱推理方法通常假设图谱是静态的，在图谱演化时易遭受灾难性遗忘。本文旨在填补这一空白。

Result: 在多个数据集上的实验表明，MRCKG在有效保留已学习多模态知识的同时，显著提升了新知识的学习效果。

Insight: 创新点包括：1）提出了多模态-结构协作课程，基于新三元组与历史图的结构连通性及其多模态兼容性来调度渐进学习；2）引入了跨模态知识保留机制，通过实体表示稳定性、关系语义一致性和模态锚定来缓解遗忘；3）设计了具有两阶段优化策略的多模态对比重放方案，通过多模态重要性采样和表示对齐来强化已学知识。这些机制为动态多模态环境下的持续学习提供了系统解决方案。

Abstract: Real-world multimodal knowledge graphs (MMKGs) are dynamic, with new entities, relations, and multimodal knowledge emerging over time. Existing continual knowledge graph reasoning (CKGR) methods focus on structural triples and cannot fully exploit multimodal signals from new entities. Existing multimodal knowledge graph reasoning (MMKGR) methods, however, usually assume static graphs and suffer catastrophic forgetting as graphs evolve. To address this gap, we present a systematic study of continual multimodal knowledge graph reasoning (CMMKGR). We construct several continual multimodal knowledge graph benchmarks from existing MMKG datasets and propose MRCKG, a new CMMKGR model. Specifically, MRCKG employs a multimodal-structural collaborative curriculum to schedule progressive learning based on the structural connectivity of new triples to the historical graph and their multimodal compatibility. It also introduces a cross-modal knowledge preservation mechanism to mitigate forgetting through entity representation stability, relational semantic consistency, and modality anchoring. In addition, a multimodal contrastive replay scheme with a two-stage optimization strategy reinforces learned knowledge via multimodal importance sampling and representation alignment. Experiments on multiple datasets show that MRCKG preserves previously learned multimodal knowledge while substantially improving the learning of new knowledge.

[11] Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks cs.CL | cs.AIPDF

Tianze Xu, Yanzhao Zheng, Pengrui Lu, Lyumanshan Ye, Yong Wu

TL;DR: 本文提出了一种名为Rubrics to Tokens (RTT)的新型基于准则的强化学习框架，旨在解决大型语言模型在遵循复杂开放域指令任务中存在的奖励稀疏性和模糊性问题。该方法通过引入令牌级相关性判别器，将粗粒度的响应级评分与细粒度的令牌级信用分配相连接，并利用RTT-GRPO算法在统一框架内优化策略模型。

Details

Motivation: 现有基于准则的强化学习方法主要依赖响应级奖励，这导致了严重的奖励稀疏性和奖励模糊性问题，阻碍了模型在复杂指令跟随任务中的有效对齐。

Result: 广泛的实验和基准测试表明，RTT在不同模型上，于指令级和准则级准确性方面均持续优于其他基线方法。

Insight: 核心创新点在于将一维的结果级奖励空间扩展为三维的令牌级准则奖励空间，并提出了样本内令牌组归一化方法来适应这一转变，从而实现了从响应级到令牌级的细粒度信用分配，有效缓解了奖励稀疏和模糊问题。

Abstract: Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL framework that bridges coarse response-level scores and fine-grained token-level credit assignment. RTT introduces a Token-Level Relevance Discriminator to predict which tokens in the response are responsible for a specific constraint, and optimizes the policy model via RTT-GRPO, which integrates response-level and token-level advantages within a unified framework. Furthermore, when transitioning from one-dimensional, outcome-level reward to three-dimensional reward space in the token-level rubric-based RL, we propose a novel group normalization method, called Intra-sample Token Group Normalization, to accommodate this shift. Extensive experiments and benchmarks demonstrate that RTT consistently outperforms other baselines in both instruction- and rubric-level accuracy across different models.

[12] Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection cs.CLPDF

Chaoqun He, Yingfa Chen, Chaojun Xiao, Xu Han, Lijie Wen

TL;DR: 本文提出了一种名为Gen-SSD（生成时自选择蒸馏）的学生参与循环框架，用于解决将大型推理模型的思维链（CoT）轨迹有效蒸馏到小型模型中的挑战。该方法的核心创新在于让学生在教师模型的生成过程中实时评估候选推理路径，从而引导生成过程，只扩展学生可学习的推理分支，并提前剪除无益分支。

Details

Motivation: 现有方法通常在教师模型完整生成推理轨迹后进行后验过滤，基于启发式标准选择轨迹，但这无法控制生成过程本身，且可能产生超出学生学习能力的路径。因此，需要一种能在生成过程中动态选择适合学生学习的推理轨迹的方法。

Result: 在数学推理基准测试上的实验表明，Gen-SSD持续优于标准知识蒸馏和近期基线方法，相比标准知识蒸馏提升了约5.9个百分点，相比其他基线最高提升了4.7个百分点。分析显示，该方法能产生更稳定且更易学习的推理轨迹。

Insight: 主要创新点在于将学生模型集成到教师模型的生成过程中，实现生成时的实时路径选择与剪枝，强调了在生成过程中融入监督对于有效蒸馏的重要性。这为知识蒸馏提供了一种更主动、可控的框架，可借鉴其学生参与循环的交互式蒸馏思想。

Abstract: Large reasoning models achieve strong performance on complex tasks through long chain-of-thought (CoT) trajectories, but directly transferring such reasoning processes to smaller models remains challenging. A key difficulty is that not all teacher-generated reasoning trajectories are suitable for student learning. Existing approaches typically rely on post-hoc filtering, selecting trajectories after full generation based on heuristic criteria. However, such methods cannot control the generation process itself and may still produce reasoning paths that lie outside the student’s learning capacity. To address this limitation, we propose Gen-SSD (Generation-time Self-Selection Distillation), a student-in-the-loop framework that performs generation-time selection. Instead of passively consuming complete trajectories, the student evaluates candidate continuations during the teacher’s sampling process, guiding the expansion of only learnable reasoning paths and enabling early pruning of unhelpful branches. Experiments on mathematical reasoning benchmarks demonstrate that Gen-SSD consistently outperforms standard knowledge distillation and recent baselines, with improvements of around 5.9 points over Standard KD and up to 4.7 points over other baselines. Further analysis shows that Gen-SSD produces more stable and learnable reasoning trajectories, highlighting the importance of incorporating supervision during generation for effective distillation.

[13] LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation cs.CL | cs.AIPDF

Yilin Xiao, Jin Chen, Qinggang Zhang, Yujing Zhang, Chuang Zhou

TL;DR: 本文提出了一种针对基于图的检索增强生成（GraphRAG）系统的新型攻击框架LogicPoison。该攻击不直接注入虚假内容，而是通过保持类型一致的实体交换机制，扰动知识图谱中的全局逻辑枢纽和查询特定推理桥，从而破坏图谱的拓扑完整性和逻辑连接，误导模型的推理路径，使其在表面文本语义不变的情况下性能显著下降。

Details

Motivation: GraphRAG系统通过利用社区检测和关系过滤技术，对传统的文本投毒和提示注入攻击具有内在抵抗力。然而，其安全性根本上依赖于底层知识图谱的拓扑完整性。本文的动机是发现并利用这一弱点，即通过隐式破坏逻辑连接（而非改变表层文本）来攻击GraphRAG系统。

Result: 在多个基准测试上的综合实验表明，LogicPoison成功绕过了GraphRAG的防御机制，显著降低了其性能。在攻击有效性和隐蔽性方面，该方法均优于现有的最先进（SOTA）基线方法。

Insight: 论文宣称的创新点在于首次针对GraphRAG系统的逻辑推理脆弱性进行攻击，提出了不依赖内容篡改而是破坏逻辑拓扑的攻击范式。从客观角度看，其核心洞察是揭示了结构化知识系统（如图谱增强生成）的安全边界不仅在于内容真实性，更在于其底层图结构的逻辑一致性，这为评估和防御此类系统提供了新的视角。其提出的类型保持实体交换机制，在保持局部语义合理性的同时破坏全局或特定推理路径，是一种巧妙且有效的攻击策略。

Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) enhances the reasoning capabilities of Large Language Models (LLMs) by grounding their responses in structured knowledge graphs. Leveraging community detection and relation filtering techniques, GraphRAG systems demonstrate inherent resistance to traditional RAG attacks, such as text poisoning and prompt injection. However, in this paper, we find that the security of GraphRAG systems fundamentally relies on the topological integrity of the underlying graph, which can be undermined by implicitly corrupting the logical connections, without altering surface-level text semantics. To exploit this vulnerability, we propose \textsc{LogicPoison}, a novel attack framework that targets logical reasoning rather than injecting false contents. Specifically, \textsc{LogicPoison} employs a type-preserving entity swapping mechanism to perturb both global logic hubs for disrupting overall graph connectivity and query-specific reasoning bridges for severing essential multi-hop inference paths. This approach effectively reroutes valid reasoning into dead ends while maintaining surface-level textual plausibility. Comprehensive experiments across multiple benchmarks demonstrate that \textsc{LogicPoison} successfully bypasses GraphRAG’s defenses, significantly degrading performance and outperforming state-of-the-art baselines in both effectiveness and stealth. Our code is available at \textcolor{blue}https://github.com/Jord8061/logicPoison.

[14] NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons cs.CLPDF

Haonan Dong, Kehan Jiang, Haoran Ye, Wenhao Zhu, Zhaolu Kang

TL;DR: 本文提出NeuReasoner，一个基于神经元混合（Mixture of Neurons, MoN）的可解释、可控、统一的推理框架。该框架通过白盒分析识别与推理失败相关的关键神经元及其波动模式，并集成轻量级MLP进行失败检测和基于特殊令牌触发的自校正机制。在六个基准测试和六个骨干模型上的评估表明，NeuReasoner在提升性能的同时显著降低了令牌消耗。

Details

Motivation: 现有大型推理模型在复杂推理任务中存在计算错误、振荡停滞和过度思考等失败模式，且现有方法多为黑箱、依赖强化学习，缺乏可解释性、可控性和统一性。本文旨在通过白盒分析解决这些问题。

Result: 在六个基准测试、六个骨干模型（8B~70B）上与九个竞争基线对比，NeuReasoner实现了高达27.0%的性能提升，同时令牌消耗降低了19.6%至63.3%。

Insight: 创新点在于通过白盒分析识别关键神经元（MoN）及其波动模式来关联推理失败，并设计了一个轻量级MLP检测与特殊令牌触发的自校正机制的统一框架，实现了可解释、可控的推理过程，同时提升了效率和性能。

Abstract: Large Reasoning Models (LRMs) have recently achieved remarkable success in complex reasoning tasks. However, closer scrutiny reveals persistent failure modes compromising performance and cost: I) Intra-step level, marked by calculation or derivation errors; II) Inter-step level, involving oscillation and stagnation; and III) Instance level, causing maladaptive over-thinking. Existing endeavors target isolated levels without unification, while their black-box nature and reliance on RL hinder explainability and controllability. To bridge these gaps, we conduct an in-depth white-box analysis, identifying key neurons (Mixture of Neurons, MoN) and their fluctuation patterns associated with distinct failures. Building upon these insights, we propose NeuReasoner, an explainable, controllable, and unified reasoning framework driven by MoN. Technically, NeuReasoner integrates lightweight MLPs for failure detection with a special token-triggered self-correction mechanism learned via SFT. During inference, special tokens are inserted upon failure detection to actuate controllable remedial behaviors. Extensive evaluations across six benchmarks, six backbone models (8B~70B) against nine competitive baselines, demonstrate that NeuReasoner achieves performance gains of up to 27.0% while reducing token consumption by 19.6% ~ 63.3%.

[15] R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning cs.CL | cs.AIPDF

Wanlong Liu, Bo Zhang, Chenliang Li, Shaopeng Lai, Yuning Wu

TL;DR: 本文提出了R2-Write框架，旨在通过引入显式的反思与修订模式来提升大语言模型在开放式写作任务中的深度推理能力。该框架通过迭代的作者-评判者交互生成高质量的思维轨迹，并设计了过程奖励机制来监督反思质量，以提高性能和token效率。

Details

Motivation: 现有主流推理模型在可验证领域（如数学）表现优异，但在开放式写作任务上提升有限，因为它们缺乏深度反思和修订模式。

Result: 在多个创意写作和深度研究基准测试上的广泛实验表明，该方法带来了显著改进，验证了显式结合反思与修订模式能有效解锁开放式写作任务的深度推理能力。

Insight: 核心创新在于将反思与修订模式结构化地融入推理过程，并通过过程奖励机制在强化学习中优化反思质量，从而在开放式任务中实现更高效的深度推理。这为将链式思维推理范式扩展到非结构化、主观性强的领域提供了新思路。

Abstract: While deep reasoning with long chain-of-thought has dramatically improved large language models in verifiable domains like mathematics, its effectiveness for open-ended tasks such as writing remains unexplored. In this paper, we conduct a systematic investigation revealing that existing mainstream reasoning models achieve limited gains on open-ended writing tasks. Our further analysis shows that these models lack deep reflection and revision patterns in open-ended writing, resulting in substantially smaller improvements compared to mathematical reasoning tasks. To address this limitation, we introduce R2-Write: an automated framework that synthesizes high-quality thinking trajectories enriched with explicit reflection and revision patterns through iterative writer-judge interaction. To prevent redundant reflections, we design a process reward mechanism that supervises reflection quality during reinforcement learning, improving both performance and token efficiency. Extensive experiments across multiple creative writing and deep-research benchmarks demonstrate significant improvements, validating that explicitly incorporating reflection and revision patterns unlocks deep reasoning capabilities for open-ended writing tasks.

[16] JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency cs.CL | cs.AIPDF

Aichen Cai, Anmeng Zhang, Anyu Li, Bo Zhang, Bohua Cai

TL;DR: 本文介绍了JoyAI-LLM Flash，一个高效的混合专家语言模型，旨在在500亿参数以下的规模中重新权衡强大性能与令牌效率。该模型在20万亿令牌的语料库上进行预训练，并通过监督微调、直接偏好优化和大规模强化学习进行后训练优化。它通过平衡“思考”与“非思考”认知模式、引入基于纤维化理论的新RL算法FiberPO、实现高稀疏性架构，以及采用多令牌预测和量化感知训练等联合设计，显著提升了令牌效率和推理吞吐量。

Details

Motivation: 解决在中等规模（500亿参数以下）语言模型中，如何在保持强大性能的同时，显著提高令牌效率和推理速度的问题。

Result: 模型总参数量为480亿，但每次前向传播仅激活27亿参数，稀疏比显著高于同规模的行业领先模型；通过后训练优化和架构设计，在性能和效率上取得了平衡。

Insight: 创新点包括：1）战略性地平衡“思考”与“非思考”认知模式以提升令牌效率；2）引入基于纤维化理论的FiberPO算法，将信任区域维护分解为全局和局部组件，为LLM策略优化提供统一的多尺度稳定性控制；3）采用高稀疏性MoE架构；4）联合训练-推理协同设计，整合密集多令牌预测和量化感知训练来提升推理吞吐量。

Abstract: We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve token efficiency, JoyAI-LLM Flash strategically balances \emph{thinking} and \emph{non-thinking} cognitive modes and introduces FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components, providing unified multi-scale stability control for LLM policy optimization. To enhance architectural sparsity, the model comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. To further improve inference throughput, we adopt a joint training-inference co-design that incorporates dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT). We release the checkpoints for both JoyAI-LLM-48B-A3B Base and its post-trained variants on Hugging Face to support the open-source community.

[17] Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation cs.CL | cs.AIPDF

Prakhar Bansal, Shivangi Agarwal

TL;DR: 这篇论文对大型语言模型（LLMs）的上下文增强技术进行了系统性综述，沿着推理时提供的结构化上下文程度这一轴线，统一分析了从上下文学习、提示工程到检索增强生成（RAG）、GraphRAG和CausalRAG等多种策略。

Details

Motivation: 动机在于解决LLMs固有的静态知识、有限上下文窗口和弱结构化因果推理等根本性限制，旨在通过外部上下文增强来提升其能力。

Result: 论文未提及具体的定量实验结果或基准测试，但通过文献筛选协议、声明审计框架和结构化跨论文证据综合，区分了高置信度发现与新兴结果，并提供了部署决策框架。

Insight: 创新点在于提出了一个以结构化上下文程度为轴线的统一分析框架，并引入了透明的文献评估方法和面向部署的决策框架，为可信赖的检索增强NLP研究指明了方向。

Abstract: Large language models (LLMs) encode vast world knowledge in their parameters, yet they remain fundamentally limited by static knowledge, finite context windows, and weakly structured causal reasoning. This survey provides a unified account of augmentation strategies along a single axis: the degree of structured context supplied at inference time. We cover in-context learning and prompt engineering, Retrieval-Augmented Generation (RAG), GraphRAG, and CausalRAG. Beyond conceptual comparison, we provide a transparent literature-screening protocol, a claim-audit framework, and a structured cross-paper evidence synthesis that distinguishes higher-confidence findings from emerging results. The paper concludes with a deployment-oriented decision framework and concrete research priorities for trustworthy retrieval-augmented NLP.

[18] Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization cs.CL | cs.AIPDF

Dipto Sumit, Ankan Kumar Roy, Sadia Khair Rodela, Atia Haque Asha, Mourchona Afrin

TL;DR: 本文研究了低资源抽象摘要任务中的多教师知识蒸馏，提出了EWAD和CPDP两种机制，分别用于基于教师间一致性的监督路由和几何约束学生位置。实验表明，logit级蒸馏效果最稳定，跨语言伪标签蒸馏在10种语言上能保留教师模型71-122%的ROUGE-L分数，同时揭示了单评委LLM评估中的校准偏差。

Details

Motivation: 解决低资源抽象摘要任务中多教师知识蒸馏的可靠性问题，探索何时多教师监督能提升摘要质量，以及何时数据扩展比损失函数设计更有效。

Result: 在孟加拉语数据集上，logit级蒸馏提供了最可靠的性能提升；跨语言伪标签蒸馏在10种语言上以3.2倍压缩率保留了教师模型71-122%的ROUGE-L分数；人类验证的多评委LLM评估揭示了单评委流程中的校准偏差。

Insight: 创新点包括基于教师间一致性的熵加权监督路由机制（EWAD）和针对异构教师的几何约束（CPDP），客观分析表明，该方法为多教师蒸馏的可靠性提供了系统评估框架，并强调了评估流程中校准偏差的重要性。

Abstract: We study multiteacher knowledge distillation for low resource abstractive summarization from a reliability aware perspective. We introduce EWAD (Entropy Weighted Agreement Aware Distillation), a token level mechanism that routes supervision between teacher distillation and gold supervision based on inter teacher agreement, and CPDP (Capacity Proportional Divergence Preservation), a geometric constraint on the student position relative to heterogeneous teachers. Across two Bangla datasets, 13 BanglaT5 ablations, and eight Qwen2.5 experiments, we find that logit level KD provides the most reliable gains, while more complex distillation improves semantic similarity for short summaries but degrades longer outputs. Cross lingual pseudo label KD across ten languages retains 71-122 percent of teacher ROUGE L at 3.2x compression. A human validated multi judge LLM evaluation further reveals calibration bias in single judge pipelines. Overall, our results show that reliability aware distillation helps characterize when multi teacher supervision improves summarization and when data scaling outweighs loss engineering.

cs.CV [Back]

[19] Internalized Reasoning for Long-Context Visual Document Understanding cs.CV | cs.AI | cs.CLPDF

Austin Veselka

TL;DR: 本文提出了一种用于长文档视觉理解任务的内部化推理方法，通过合成数据管道生成思维链，利用页面相关性评分、文本证据提取和排序来构建推理轨迹，并通过监督微调和低强度模型融合将推理能力内化到视觉语言模型中。

Details

Motivation: 解决长文档视觉理解任务中现有最佳开源方法未充分探索推理能力的问题，旨在提升模型在复杂文档（如企业、法律、科学文档）上的理解和分析性能。

Result: 在Qwen3 VL 32B模型上，MMLongBenchDoc基准测试得分达到58.3，超越了7倍参数量的Qwen3 VL 235B A22B（57.0）；在Mistral Small 3.1 24B上，合成推理方法在MMLBD-C上比从Thinking版本蒸馏的方法高出3.8分，且内部化推理的平均输出令牌数比显式推理减少12.4倍。

Insight: 创新点包括：1）为长文档视觉理解设计的合成推理数据生成管道，通过页面相关性评分和证据排序构建思维链；2）利用控制令牌和模型融合技术将推理能力内部化，减少推理过程中的令牌开销；3）证明了小模型通过内部化推理可以超越更大模型或传统蒸馏方法的性能。

Abstract: Visual long-document understanding is critical for enterprise, legal, and scientific applications, yet the best performing open recipes have not explored reasoning, a capability which has driven leaps in math and code performance. We introduce a synthetic data pipeline for reasoning in long-document understanding that generates thinking traces by scoring each page for question relevance, extracting textual evidence and ordering it from most to least relevant. We apply SFT to the resulting traces within \texttt{} tags, gated by a \texttt{} control token, and the resulting reasoning capability is internalized via low-strength model merging. We study Qwen3 VL 32B and Mistral Small 3.1 24B. With Qwen3 VL, we achieve 58.3 on MMLongBenchDoc, surpassing the 7$\times$ larger Qwen3 VL 235B A22B (57.0). With Mistral, we show that synthetic reasoning outperforms distillation from the Thinking version’s traces by 3.8 points on MMLBD-C, and internalized reasoning exhibits 12.4$\times$ fewer mean output tokens compared to explicit reasoning. We release our pipeline for reproducibility and further exploration.

[20] Environment-Aware Channel Prediction for Vehicular Communications: A Multimodal Visual Feature Fusion Framework cs.CV | cs.AIPDF

Xuejian Zhang, Ruisi He, Minseok Kim, Inocent Calist, Mi Yang

TL;DR: 本文提出了一种基于多模态视觉特征融合的环境感知信道预测框架，用于6G车联网通信。该框架利用GPS数据和车载全景RGB图像，通过语义分割和深度估计提取语义、深度和位置特征，采用三支路架构和挤压-激励注意力门控模块进行自适应多模态融合，实现了路径损耗、时延扩展、到达角扩展、离开角扩展和角功率谱的联合预测。

Details

Motivation: 解决传统经验或确定性信道模型在准确性、泛化性和部署性之间难以平衡的问题，利用车载和路侧传感设备提供的环境先验信息，满足车联网通信对高可靠、低延迟和强适应性的前瞻性信道预测需求。

Result: 在城市V2I同步测量数据集上的实验结果表明，路径损耗的RMSE达到3.26 dB，时延扩展、到达角扩展和离开角扩展的RMSE分别为37.66 ns、5.05度和5.08度，角功率谱的余弦相似度均值和中位数分别为0.9342和0.9571，展现了优异的预测精度和泛化能力。

Insight: 创新点在于将多模态视觉特征（语义、深度、位置）融合用于环境感知的信道预测，设计了专用的回归头和复合多约束损失函数以实现多个信道参数的联合预测，为6G智能通信提供了可借鉴的感知-通信一体化框架。

Abstract: The deep integration of communication with intelligence and sensing, as a defining vision of 6G, renders environment-aware channel prediction a key enabling technology. As a representative 6G application, vehicular communications require accurate and forward-looking channel prediction under stringent reliability, latency, and adaptability demands. Traditional empirical and deterministic models remain limited in balancing accuracy, generalization, and deployability, while the growing availability of onboard and roadside sensing devices offers a promising source of environmental priors. This paper proposes an environment-aware channel prediction framework based on multimodal visual feature fusion. Using GPS data and vehicle-side panoramic RGB images, together with semantic segmentation and depth estimation, the framework extracts semantic, depth, and position features through a three-branch architecture and performs adaptive multimodal fusion via a squeeze-excitation attention gating module. For 360-dimensional angular power spectrum (APS) prediction, a dedicated regression head and a composite multi-constraint loss are further designed. As a result, joint prediction of path loss (PL), delay spread (DS), azimuth spread of arrival (ASA), azimuth spread of departure (ASD), and APS is achieved. Experiments on a synchronized urban V2I measurement dataset yield the best root mean square error (RMSE) of 3.26 dB for PL, RMSEs of 37.66 ns, 5.05 degrees, and 5.08 degrees for DS, ASA, and ASD, respectively, and mean/median APS cosine similarities of 0.9342/0.9571, demonstrating strong accuracy, generalization, and practical potential for intelligent channel prediction in 6G vehicular communications.

[21] Variational Encoder–Multi-Decoder (VE-MD) for Privacy-by-functional-design (Group) Emotion Recognition cs.CV | cs.AIPDF

Anderson Augusma, Dominique Vaufreydaz, Fédérique Letué

TL;DR: 本文提出了一种名为VE-MD（变分编码器-多解码器）的隐私保护功能设计框架，用于群体情感识别（GER）。该框架通过共享潜在表示学习，联合优化情感分类以及身体和面部结构表示的内部预测，避免了对个体的显式监控，从而在仅需群体层面理解时保护隐私。实验在六个真实世界数据集上进行，结果表明结构监督能持续改进表示学习，并在GER和个体情感识别（IER）任务上均取得了优异性能。

Details

Motivation: 现有群体情感识别方法通常依赖显式的个体级处理（如裁剪人脸、人员跟踪或每人特征提取），这导致分析流程以人为中心，在仅需群体层面理解时引发隐私担忧。本研究旨在设计一个隐私感知的功能性框架，避免显式个体监控，仅预测聚合的群体级情感。

Result: 在群体情感识别基准GAF-3.0上达到SOTA性能（最高90.06%），在VGAF数据集上通过多模态（音频）融合达到82.25%。在个体情感识别基准上，VE-MD在SamSemo数据集（结合文本模态）上以77.9%超越SOTA，并在MER-MULTI（63.8%）、DFEW（70.7%）和EngageNet（69.0%）上取得有竞争力的结果。

Insight: 创新点在于提出了一种隐私保护的功能设计框架，通过约束模型仅输出群体级情感，避免身份识别或每人情感输出，从而在任务层面保护隐私。客观分析表明，其核心洞察是：对于群体情感识别，仅优化潜在空间往往不足，因为它倾向于削弱与交互相关的线索，而保留显式的结构输出（如通过Transformer-based PersonQuery解码器或密集热图解码器）能改善集体情感推断；相反，对于个体情感识别，投影的结构表示则充当有效的去噪瓶颈。这揭示了GER与IER任务在表示学习需求上的根本区别。

Abstract: Group Emotion Recognition (GER) aims to infer collective affect in social environments such as classrooms, crowds, and public events. Many existing approaches rely on explicit individual-level processing, including cropped faces, person tracking, or per-person feature extraction, which makes the analysis pipeline person-centric and raises privacy concerns in deployment scenarios where only group-level understanding is needed. This research proposes VE-MD, a Variational Encoder-Multi-Decoder framework for group emotion recognition under a privacy-aware functional design. Rather than providing formal anonymization or cryptographic privacy guarantees, VE-MD is designed to avoid explicit individual monitoring by constraining the model to predict only aggregate group-level affect, without identity recognition or per-person emotion outputs. VE-MD learns a shared latent representation jointly optimized for emotion classification and internal prediction of body and facial structural representations. Two structural decoding strategies are investigated: a transformer-based PersonQuery decoder and a dense Heatmap decoder that naturally accommodates variable group sizes. Experiments on six in-the-wild datasets, including two GER and four Individual Emotion Recognition (IER) benchmarks, show that structural supervision consistently improves representation learning. More importantly, the results reveal a clear distinction between GER and IER: optimizing the latent space alone is often insufficient for GER because it tends to attenuate interaction-related cues, whereas preserving explicit structural outputs improves collective affect inference. In contrast, projected structural representations seem to act as an effective denoising bottleneck for IER. VE-MD achieves state-of-the-art performance on GAF-3.0 (up to 90.06%) and VGAF (82.25% with multimodal fusion with audio). These results show that preserving interaction-related structural information is particularly beneficial for group-level affect modeling without relying on prior individual feature extraction. On IER datasets using multimodal fusion with audio modality, VE-MD outperforms SOTA on SamSemo (77.9%, adding text modality) while achieving competitive performances on MER-MULTI (63.8%), DFEW (70.7%) and EngageNet (69.0).

[22] LumiVideo: An Intelligent Agentic System for Video Color Grading cs.CV | cs.AIPDF

Yuchen Guo, Junli Gong, Hongmin Cai, Yiu-ming Cheung, Weifeng Su

TL;DR: LumiVideo是一个用于视频调色的智能代理系统，它模仿专业调色师的认知工作流程，通过感知、推理、执行和反思四个阶段，将原始log视频自动转换为电影级色彩。系统利用LLM的影视知识和RAG框架进行推理，生成行业标准的ASC-CDL配置和3D LUT，而非直接输出像素，从而保证时间一致性并支持自然语言反馈迭代。

Details

Motivation: 解决现有自动化视频调色方法作为静态黑盒执行器，缺乏可解释性和专业所需的迭代控制的问题。

Result: 在提出的首个log编码视频基准LumiGrade上的实验表明，LumiVideo在全自动模式下接近人类专家质量，并在指导下实现精确的迭代控制。

Insight: 创新点在于将调色过程构建为代理系统，模仿人类工作流，结合LLM知识与RAG进行推理，并生成可解释的行业标准参数而非像素，实现了自动化与可控性的平衡。

Abstract: Video color grading is a critical post-production process that transforms flat, log-encoded raw footage into emotionally resonant cinematic visuals. Existing automated methods act as static, black-box executors that directly output edited pixels, lacking both interpretability and the iterative control required by professionals. We introduce LumiVideo, an agentic system that mimics the cognitive workflow of professional colorists through four stages: Perception, Reasoning, Execution, and Reflection. Given only raw log video, LumiVideo autonomously produces a cinematic base grade by analyzing the scene’s physical lighting and semantic content. Its Reasoning engine synergizes an LLM’s internalized cinematic knowledge with a Retrieval-Augmented Generation (RAG) framework via a Tree of Thoughts (ToT) search to navigate the non-linear color parameter space. Rather than generating pixels, the system compiles the deduced parameters into industry-standard ASC-CDL configurations and a globally consistent 3D LUT, analytically guaranteeing temporal consistency. An optional Reflection loop then allows creators to refine the result via natural language feedback. We further introduce LumiGrade, the first log-encoded video benchmark for evaluating automated grading. Experiments show that LumiVideo approaches human expert quality in fully automatic mode while enabling precise iterative control when directed.

[23] VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation cs.CV | cs.AIPDF

Mengtian Li, Yuwei Lu, Feifei Li, Chenqi Gan, Zhifeng Xie

TL;DR: 本文提出了VERTIGO框架，首个用于电影摄影机轨迹生成的视觉偏好优化方法。它通过实时图形引擎渲染生成轨迹的2D视觉预览，并利用经过电影美学微调的视觉语言模型进行评分，为直接偏好优化提供视觉偏好信号，从而显著提升生成镜头的构图、一致性和美学质量。

Details

Motivation: 现有生成式摄影机系统能产生多样化的文本条件轨迹，但缺乏类似导演的闭环反馈，无法明确监督镜头是否具有视觉吸引力，导致构图不佳、角色出画和美学缺陷。

Result: 在Unity渲染和基于扩散的Camera-to-Video流程上的定量评估和用户研究表明，该方法在条件遵循、构图质量和感知真实感方面持续提升，将角色出画率从38%降至接近0%，同时保持了相机运动的几何保真度。用户研究也证实其在构图、一致性、提示遵循和美学质量上优于基线。

Insight: 核心创新在于将视觉偏好优化引入相机轨迹生成，通过提出的循环语义相似度机制，将渲染预览与文本提示对齐，从而为DPO后训练提供直接的视觉美学监督信号，实现了从纯几何轨迹到视觉感知质量的优化闭环。

Abstract: Cinematic camera control relies on a tight feedback loop between director and cinematographer, where camera motion and framing are continuously reviewed and refined. Recent generative camera systems can produce diverse, text-conditioned trajectories, but they lack this “director in the loop” and have no explicit supervision of whether a shot is visually desirable. This results in in-distribution camera motion but poor framing, off-screen characters, and undesirable visual aesthetics. In this paper, we introduce VERTIGO, the first framework for visual preference optimization of camera trajectory generators. Our framework leverages a real-time graphics engine (Unity) to render 2D visual previews from generated camera motion. A cinematically fine-tuned vision-language model then scores these previews using our proposed cyclic semantic similarity mechanism, which aligns renders with text prompts. This process provides the visual preference signals for Direct Preference Optimization (DPO) post-training. Both quantitative evaluations and user studies on Unity renders and diffusion-based Camera-to-Video pipelines show consistent gains in condition adherence, framing quality, and perceptual realism. Notably, VERTIGO reduces the character off-screen rate from 38% to nearly 0% while preserving the geometric fidelity of camera motion. User study participants further prefer VERTIGO over baselines across composition, consistency, prompt adherence, and aesthetic quality, confirming the perceptual benefits of our visual preference post-training.

[24] Guideline2Graph: Profile-Aware Multimodal Parsing for Executable Clinical Decision Graphs cs.CV | cs.LGPDF

Onur Selim Kilic, Yeti Z. Gurbuz, Cem O. Yaldiz, Afra Nawar, Etrit Haxholli

TL;DR: 本文提出Guideline2Graph，一种分解优先的多模态解析流水线，用于将临床实践指南转换为可执行的临床决策图。该方法通过拓扑感知分块、接口约束的块图生成和来源保留的全局聚合，解决了现有LLM/VLM提取器在跨页面连续性、接口规范和整体决策图一致性方面的不足。

Details

Motivation: 临床实践指南是冗长、多模态的文档，其分支推荐难以转换为可执行的临床决策支持（CDS）。现有LLM/VLM提取器大多是局部或文本中心的，未能充分指定章节接口，也无法将跨页面的控制流整合为一个连贯的决策图。

Result: 在一个经过裁决的前列腺指南基准测试中，使用相同的VLM骨干网络，本方法在完整合并图上的边和三联体精确率/召回率从现有模型的19.6%/16.1%提升至69.0%/87.5%，节点召回率从78.1%提升至93.8%。

Insight: 创新点在于分解优先的流水线设计，通过显式的入口/终止接口和语义去重来保持跨页面连续性，同时确保导出的控制流可审计且结构一致。这为可审计的指南到CDS转换提供了新思路，但当前证据仅限于单一指南，需更广泛验证。

Abstract: Clinical practice guidelines are long, multimodal documents whose branching recommendations are difficult to convert into executable clinical decision support (CDS), and one-shot parsing often breaks cross-page continuity. Recent LLM/VLM extractors are mostly local or text-centric, under-specifying section interfaces and failing to consolidate cross-page control flow across full documents into one coherent decision graph. We present a decomposition-first pipeline that converts full-guideline evidence into an executable clinical decision graph through topology-aware chunking, interface-constrained chunk graph generation, and provenance-preserving global aggregation. Rather than relying on single-pass generation, the pipeline uses explicit entry/terminal interfaces and semantic deduplication to preserve cross-page continuity while keeping the induced control flow auditable and structurally consistent. We evaluate on an adjudicated prostate-guideline benchmark with matched inputs and the same underlying VLM backbone across compared methods. On the complete merged graph, our approach improves edge and triplet precision/recall from $19.6%/16.1%$ in existing models to $69.0%/87.5%$, while node recall rises from $78.1%$ to $93.8%$. These results support decomposition-first, auditable guideline-to-CDS conversion on this benchmark, while current evidence remains limited to one adjudicated prostate guideline and motivates broader multi-guideline validation.

[25] Generating Satellite Imagery Data for Wildfire Detection through Mask-Conditioned Generative AI cs.CV | cs.AIPDF

Valeria Martin, K. Brent Venable, Derek Morgan

TL;DR: 本文研究了如何利用基于扩散模型的地球观测基础模型EarthSynth，通过输入已有的火烧迹地掩码，合成逼真的野火后哨兵2号卫星RGB影像，以解决深度学习野火监测系统中标记卫星影像稀缺的问题。

Details

Motivation: 标记卫星影像的稀缺性是深度学习野火监测系统发展的主要瓶颈，因此本文旨在探索无需任务特定重训练的生成式AI方法，利用现有火烧迹地掩码合成高质量野火后影像，以用于数据增强。

Result: 在CalFireSeg-50数据集上进行的定量评估显示，基于修复（inpainting）的流程在所有指标上均优于全图生成，其中结构化修复提示在空间对齐（Burn IoU = 0.456）和火烧显著性（Darkness Contrast = 20.44）上表现最佳，而颜色匹配后处理则实现了最低的颜色距离（ΔC_burn = 63.22）。视觉语言模型（VLM）辅助的修复与人工设计的提示具有竞争力。

Insight: 论文的创新点在于将掩码条件生成与修复架构结合，并系统评估了提示工程和颜色匹配后处理对合成影像质量的影响，为将生成式数据增强整合到野火检测流程提供了可行方案。

Abstract: The scarcity of labeled satellite imagery remains a fundamental bottleneck for deep-learning (DL)-based wildfire monitoring systems. This paper investigates whether a diffusion-based foundation model for Earth Observation (EO), EarthSynth, can synthesize realistic post-wildfire Sentinel-2 RGB imagery conditioned on existing burn masks, without task-specific retraining. Using burn masks derived from the CalFireSeg-50 dataset (Martin et al., 2025), we design and evaluate six controlled experimental configurations that systematically vary: (i) pipeline architecture (mask-only full generation vs. inpainting with pre-fire context), (ii) prompt engineering strategy (three hand-crafted prompts and a VLM-generated prompt via Qwen2-VL), and (iii) a region-wise color-matching post-processing step. Quantitative assessment on 10 stratified test samples uses four complementary metrics: Burn IoU, burn-region color distance (ΔC_burn), Darkness Contrast, and Spectral Plausibility. Results show that inpainting-based pipelines consistently outperform full-tile generation across all metrics, with the structured inpainting prompt achieving the best spatial alignment (Burn IoU = 0.456) and burn saliency (Darkness Contrast = 20.44), while color matching produces the lowest color distance (ΔC_burn = 63.22) at the cost of reduced burn saliency. VLM-assisted inpainting is competitive with hand-crafted prompts. These findings provide a foundation for incorporating generative data augmentation into wildfire detection pipelines. Code and experiments are available at: https://www.kaggle.com/code/valeriamartinh/genai-all-runned

[26] VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors cs.CV | cs.CLPDF

Haz Sameen Shahgir, Xiaofu Chen, Yu Fu, Erfan Shayegani, Nael Abu-Ghazaleh

TL;DR: 本文研究发现，当前视觉语言模型（VLMs）在需要细粒度视觉感知的任务上表现不佳，并非因为其内部表征缺乏信息，而是源于其训练流程过于依赖将视觉信息映射到文本空间。模型只能对语言空间中已有概念对应的视觉实体进行推理，导致在处理视觉对应、新颖实体推理等以视觉为中心的任务时能力受限。

Details

Motivation: 解决VLMs在需要精细视觉感知的任务（如视觉对应）上表现不佳的问题，探究其根本原因是否源于训练过程中过度依赖文本语义锚点，而忽略了视觉细节本身。

Result: 通过在语义、形状和人脸对应任务上的测试，发现VLMs对可命名的实体表现远优于不可命名的实体。机制分析（Logit Lens）证实，模型会为可命名实体显式分配语义标签并生成更独特的对应token。实验表明，为未知实体赋予任意名称可以提升性能，而针对特定任务的微调能带来更强的泛化能力，且不依赖语言先验。

Insight: 论文的核心创新点在于揭示了当前VLM训练范式（视觉信息向文本空间对齐）带来的根本性局限，即模型倾向于依赖“语义锚点”而非原始视觉细节进行推理。这一发现表明，VLM在视觉任务上的失败是训练导致的捷径学习，而非多模态架构的根本缺陷，为未来改进训练目标（如加强视觉本身的理解与推理）提供了重要方向。

Abstract: Vision Language Models (VLMs) achieve impressive performance across a wide range of multimodal tasks. However, on some tasks that demand fine-grained visual perception, they often fail even when the required information is present in their internal representations. In this work, we demonstrate that this gap arises from their narrow training pipeline which focuses on moving visual information to the textual space. Consequently, VLMs can only reason about visual entities that can be mapped to known concepts in the language space, leaving vision-focused tasks such as visual correspondence and reasoning about novel visual entities poorly supported. As a result, VLMs are severely limited in several important multimodal capabilities because they rely on brittle, hallucinated textual descriptions of visual entities that they cannot map to textual representations. We verify this behavior through visual correspondence tasks, in which VLMs must detect matching entities between two images. Testing across semantic, shape, and face correspondence tasks, we find that VLMs perform much better when the relevant entities are nameable in language than when they are unnameable. Mechanistically, our Logit Lens analyses confirm that VLMs explicitly assign semantic labels to nameable entities and surface more unique corresponding tokens compared to unnameable entities. Furthermore, we show that teaching completely arbitrary names for unknown entities improves performance, yet task-specific finetuning yields even stronger generalization without relying on language priors. Our findings suggest that current VLM failures on visual tasks reflect learned shortcuts from their training, rather than a fundamental limitation of multimodal architectures.

[27] Token-Efficient Multimodal Reasoning via Image Prompt Packaging cs.CV | cs.AIPDF

Joong Ho Choi, Jiayang Zhao, Avani Appalla, Himansh Mukesh, Dhwanil Vasani

TL;DR: 本文提出了一种名为图像提示打包（IPPg）的新型视觉提示范式，通过将结构化文本直接嵌入图像中来减少文本令牌开销，从而降低多模态大模型推理成本。该方法在五个数据集、三个前沿模型（GPT-4.1、GPT-4o、Claude 3.5 Sonnet）和两类任务（VQA和代码生成）上进行了基准测试，实现了35.8%至91.0%的推理成本降低，同时保持了许多场景下的准确率竞争力。

Details

Motivation: 部署大规模多模态语言模型受到基于令牌的推理成本的限制，而视觉提示策略的成本-性能行为尚未得到充分研究。本文旨在通过减少文本令牌开销来优化推理成本。

Result: 在CoSQL数据集上，GPT-4.1实现了准确率和成本的同时提升；但Claude 3.5在多个VQA基准上成本反而增加。尽管令牌压缩高达96%，准确率在许多设置中仍保持竞争力，但结果高度依赖于模型和任务。

Insight: 创新点在于将结构化文本嵌入图像以减少令牌使用，从而显著降低推理成本。研究发现视觉编码选择是多模态系统设计中的关键变量，空间推理、非英语输入和字符敏感操作易受影响，而模式结构化任务受益最大。

Abstract: Deploying large multimodal language models at scale is constrained by token-based inference costs, yet the cost-performance behavior of visual prompting strategies remains poorly characterized. We introduce Image Prompt Packaging (IPPg), a prompting paradigm that embeds structured text directly into images to reduce text token overhead, and benchmark it across five datasets, three frontier models (GPT-4.1, GPT-4o, Claude 3.5 Sonnet), and two task families (VQA and code generation). We derive a cost formulation decomposing savings by token type and show IPPg achieves 35.8–91.0% inference cost reductions. Despite token compression of up to 96%, accuracy remains competitive in many settings, though outcomes are highly model- and task-dependent: GPT-4.1 achieves simultaneous accuracy and cost gains on CoSQL, while Claude 3.5 incurs cost increases on several VQA benchmarks. Systematic error analysis yields a failure-mode taxonomy: spatial reasoning, non-English inputs, and character-sensitive operations are most vulnerable, while schema-structured tasks benefit most. A 125-configuration rendering ablation reveals accuracy shifts of 10–30 percentage points, establishing visual encoding choices as a first-class variable in multimodal system design.

[28] An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis cs.CV | cs.AIPDF

Md. Sajeebul Islam Sk., Md. Mehedi Hasan Shawon, Md. Golam Rabiul Alam

TL;DR: 本文提出了一种可解释的视觉语言模型框架，用于腰椎管狭窄症的诊断。该框架通过空间补丁交叉注意力模块实现精确的文本引导病灶定位，并引入自适应PID-Tversky损失函数以解决临床分割数据中的极端类别不平衡问题。结合自动化放射报告生成模块，该框架在诊断分类、分割和报告生成方面均取得了优异性能，并提供了可解释的临床报告输出。

Details

Motivation: 解决腰椎管狭窄症诊断中依赖人工解读多视图MRI导致的观察者间差异大和诊断延迟问题，同时克服现有视觉语言模型在处理临床分割数据时因全局池化机制丢失解剖层次信息以及无法有效应对极端类别不平衡的局限性。

Result: 在相关基准测试中，诊断分类准确率达到90.69%，分割任务的宏观平均Dice分数为0.9512，报告生成的CIDEr分数为92.80%，为临床医学影像中透明、可解释的AI建立了新的性能基准。

Insight: 主要创新点包括：1) 空间补丁交叉注意力模块，实现了文本引导的精确空间定位；2) 结合控制理论的自适应PID-Tversky损失函数，动态调整训练惩罚以专注于难分的少数类样本；3) 端到端的可解释框架，能将分割预测转化为放射科医生风格的临床报告，在保持必要人工监督的同时提升诊断能力。

Abstract: Lumbar Spinal Stenosis (LSS) diagnosis remains a critical clinical challenge, with diagnosis heavily dependent on labor-intensive manual interpretation of multi-view Magnetic Resonance Imaging (MRI), leading to substantial inter-observer variability and diagnostic delays. Existing vision-language models simultaneously fail to address the extreme class imbalance prevalent in clinical segmentation datasets while preserving spatial accuracy, primarily due to global pooling mechanisms that discard crucial anatomical hierarchies. We present an end-to-end Explainable Vision-Language Model framework designed to overcome these limitations, achieved through two principal objectives. We propose a Spatial Patch Cross-Attention module that enables precise, text-directed localization of spinal anomalies with spatial precision. A novel Adaptive PID-Tversky Loss function by integrating control theory principles dynamically further modifies training penalties to specifically address difficult, under-segmented minority instances. By incorporating foundational VLMs alongside an Automated Radiology Report Generation module, our framework demonstrates considerable performance: a diagnostic classification accuracy of 90.69%, a macro-averaged Dice score of 0.9512 for segmentation, and a CIDEr score of 92.80%. Furthermore, the framework shows explainability by converting complex segmentation predictions into radiologist-style clinical reports, thereby establishing a new benchmark for transparent, interpretable AI in clinical medical imaging that keeps essential human supervision while enhancing diagnostic capabilities.

[29] Rapidly deploying on-device eye tracking by distilling visual foundation models cs.CVPDF

Cheng Jiang, Jogendra Kundu, David Colmenares, Fengting Yang, Joseph Robinson

TL;DR: 本文提出了DistillGaze框架，通过利用标记的合成数据和无标记的真实数据来蒸馏视觉基础模型，以实现快速、高性能的端上视线估计。该方法分两阶段：首先使用自监督学习将视觉基础模型适配为领域专家教师模型，然后训练一个轻量级学生模型用于端上部署。

Details

Motivation: 解决在新硬件配置（如相机位置、姿态和光照）下，快速部署高精度、端上视线估计的挑战，因为现成的视觉基础模型在专业的近眼红外图像上精度不足。

Result: 在超过2000名参与者的大规模众包数据集上评估，DistillGaze将中位视线误差相对于仅使用合成数据的基线降低了58.62%，同时保持了一个仅256K参数的轻量级模型，适合实时端上部署。

Insight: 创新点在于结合标记合成数据和无标记真实数据来蒸馏视觉基础模型，以弥合合成到真实的领域差距，并为端上回归任务提供了一种结合合成监督与无标记真实数据的有效方案。

Abstract: Eye tracking (ET) plays a critical role in augmented and virtual reality applications. However, rapidly deploying high-accuracy, on-device gaze estimation for new products remains challenging because hardware configurations (e.g., camera placement, camera pose, and illumination) often change across device generations. Visual foundation models (VFMs) are a promising direction for rapid training and deployment, and they excel on natural-image benchmarks; yet we find that off-the-shelf VFMs still struggle to achieve high accuracy on specialized near-eye infrared imagery. To address this gap, we introduce DistillGaze, a framework that distills a foundation model by leveraging labeled synthetic data and unlabeled real data for rapid and high-performance on-device gaze estimation. DistillGaze proceeds in two stages. First, we adapt a VFM into a domain-specialized teacher using self-supervised learning on labeled synthetic and unlabeled real images. Synthetic data provides scalable, high-quality gaze supervision, while unlabeled real data helps bridge the synthetic-to-real domain gap. Second, we train an on-device student using both teacher guidance and self-training. Evaluated on a large-scale, crowd-sourced dataset spanning over 2,000 participants, DistillGaze reduces median gaze error by 58.62% relative to synthetic-only baselines while maintaining a lightweight 256K-parameter model suitable for real-time on-device deployment. Overall, DistillGaze provides an efficient pathway for training and deploying ET models that adapt to hardware changes, and offers a recipe for combining synthetic supervision with unlabeled real data in on-device regression tasks.

[30] Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions? cs.CV | cs.AI | cs.LGPDF

Kamalasankari Subramaniakuppusamy, Jugal Gajjar

TL;DR: 本文提出了特征归因稳定性套件（FASS），这是一个评估后验特征归因方法在真实输入扰动下稳定性的新基准。FASS通过强制预测不变性过滤，将稳定性分解为结构相似性、秩相关性和top-k Jaccard重叠三个互补指标，并在几何、光度和压缩扰动上进行评估。

Details

Motivation: 后验特征归因方法广泛应用于安全关键视觉系统，但其在真实输入扰动下的稳定性尚未得到充分表征。现有评估指标主要针对加性噪声，将稳定性简化为单一标量，且未考虑预测保持条件，混淆了解释脆弱性与模型敏感性。

Result: 在ImageNet-1K、MS COCO和CIFAR-10三个数据集上，对四种归因方法（Integrated Gradients、GradientSHAP、Grad-CAM、LIME）和四种架构的评估显示，稳定性估计严重依赖于扰动类型和预测不变性过滤。几何扰动比光度变化暴露了更大的归因不稳定性；若不进行预测保持过滤，高达99%的评估对涉及预测变化。在受控评估下，Grad-CAM在跨数据集上表现出最高的稳定性。

Insight: 创新点在于提出了一个系统性的稳定性评估框架FASS，它强调了预测不变性过滤的重要性，并将稳定性分解为多维指标。这为更可靠地评估和比较归因方法的鲁棒性提供了新基准，揭示了不同扰动类型对归因稳定性的差异化影响，并识别出Grad-CAM在现有方法中相对更稳定。

Abstract: Post-hoc feature attribution methods are widely deployed in safety-critical vision systems, yet their stability under realistic input perturbations remains poorly characterized. Existing metrics evaluate explanations primarily under additive noise, collapse stability to a single scalar, and fail to condition on prediction preservation, conflating explanation fragility with model sensitivity. We introduce the Feature Attribution Stability Suite (FASS), a benchmark that enforces prediction-invariance filtering, decomposes stability into three complementary metrics: structural similarity, rank correlation, and top-k Jaccard overlap-and evaluates across geometric, photometric, and compression perturbations. Evaluating four attribution methods (Integrated Gradients, GradientSHAP, Grad-CAM, LIME) across four architectures and three datasets-ImageNet-1K, MS COCO, and CIFAR-10, FASS shows that stability estimates depend critically on perturbation family and prediction-invariance filtering. Geometric perturbations expose substantially greater attribution instability than photometric changes, and without conditioning on prediction preservation, up to 99% of evaluated pairs involve changed predictions. Under this controlled evaluation, we observe consistent method-level trends, with Grad-CAM achieving the highest stability across datasets.

[31] Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation cs.CV | cs.LGPDF

Ji Young Byun, Young-Jin Park, Jean-Philippe Corbeil, Asma Ben Abacha

TL;DR: 本文通过系统实证研究，揭示了医疗视觉问答（VQA）中视觉语言模型（VLMs）普遍存在的过度自信问题，并评估了多种校准策略。研究发现，模型缩放和思维链等提示策略无法解决该问题，而简单的后处理校准方法（如Platt scaling）能有效降低校准误差，但受限于单调性无法提升判别性能。为此，论文提出并验证了幻觉感知校准（HAC），通过引入基于视觉的幻觉检测信号，在提升校准效果的同时也改善了AUROC，尤其在开放式问题上效果显著。

Details

Motivation: 随着视觉语言模型（VLMs）在临床决策支持中的部署日益增多，仅关注模型准确性已不足够，了解何时信任其预测同样至关重要。然而，在医疗领域，对这些模型过度自信问题的全面系统性研究仍非常缺乏。本文旨在填补这一空白。

Result: 在三个医疗VQA基准测试上，对三个模型家族（Qwen3-VL, InternVL3, LLaVA-NeXT）、三种模型规模（2B–38B）及多种置信度估计提示策略进行了评估。研究发现：1）过度自信普遍存在，且无法通过模型缩放或提示策略解决；2）简单的后处理校准方法（如Platt scaling）能降低校准误差，优于基于提示的策略；3）后处理校准方法因单调性限制，无法提升AUROC。提出的幻觉感知校准（HAC）方法则能同时改善校准效果和AUROC，在开放式问题上提升最大。

Insight: 论文的核心创新点在于系统揭示了医疗VLM中过度自信问题的普遍性和顽固性，并提出了幻觉感知校准（HAC）这一针对性解决方案。其关键洞察在于，将基于视觉的幻觉检测信号作为辅助输入来细化置信度估计，打破了传统后处理校准方法因单调性带来的性能瓶颈，从而在提升校准可靠性的同时，也增强了模型的判别能力。这为医疗等高风险领域部署更可靠的VLM提供了实践指导：应将后处理校准作为标准实践，并积极利用幻觉信号。

Abstract: As vision-language models (VLMs) are increasingly deployed in clinical decision support, more than accuracy is required: knowing when to trust their predictions is equally critical. Yet, a comprehensive and systematic investigation into the overconfidence of these models remains notably scarce in the medical domain. We address this gap through a comprehensive empirical study of confidence calibration in VLMs, spanning three model families (Qwen3-VL, InternVL3, LLaVA-NeXT), three model scales (2B–38B), and multiple confidence estimation prompting strategies, across three medical visual question answering (VQA) benchmarks. Our study yields three key findings: First, overconfidence persists across model families and is not resolved by scaling or prompting, such as chain-of-thought and verbalized confidence variants. Second, simple post-hoc calibration approaches, such as Platt scaling, reduce calibration error and consistently outperform the prompt-based strategy. Third, due to their (strict) monotonicity, these post-hoc calibration methods are inherently limited in improving the discriminative quality of predictions, leaving AUROC at the same level. Motivated by these findings, we investigate hallucination-aware calibration (HAC), which incorporates vision-grounded hallucination detection signals as complementary inputs to refine confidence estimates. We find that leveraging these hallucination signals improves both calibration and AUROC, with the largest gains on open-ended questions. Overall, our findings suggest post-hoc calibration as standard practice for medical VLM deployment over raw confidence estimates, and highlight the practical usefulness of hallucination signals to enable more reliable use of VLMs in medical VQA.

[32] WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models cs.CV | cs.LGPDF

Haiyu Wang, Yutong Wang, Jack Jiang, Sai Qian Zhang

TL;DR: 该论文提出了一种名为加权奇异值分解（WSVD）的新方法，用于加速低精度视觉语言模型的执行。该方法通过更细粒度的SVD计算模式和自适应权重分配，结合权重量化和激活量化，在保持模型精度的同时，显著提升了推理速度。

Details

Motivation: 现有SVD方法在减少视觉语言模型计算负担时，难以在实际执行中实现显著的延迟降低，因此需要一种更有效的低秩近似技术来加速模型推理。

Result: 在实验中，WSVD方法相比其他方法，在保持精度的同时，实现了超过1.8倍的解码加速。

Insight: 创新点在于提出了细粒度的加权SVD计算模式，自适应地分配权重重要性以更好地保持精度，并结合量化技术，实现了高效的低精度视觉语言模型执行。

Abstract: Singular Value Decomposition (SVD) has become an important technique for reducing the computational burden of Vision Language Models (VLMs), which play a central role in tasks such as image captioning and visual question answering. Although multiple prior works have proposed efficient SVD variants to enable low-rank operations, we find that in practice it remains difficult to achieve substantial latency reduction during model execution. To address this limitation, we introduce a new computational pattern and apply SVD at a finer granularity, enabling real and measurable improvements in execution latency. Furthermore, recognizing that weight elements differ in their relative importance, we adaptively allocate relative importance to each element during SVD process to better preserve accuracy, then extend this framework with quantization applied to both weights and activations, resulting in a highly efficient VLM. Collectively, we introduce~\textit{Weighted SVD} (WSVD), which outperforms other approaches by achieving over $1.8\times$ decoding speedup while preserving accuracy. We open source our code at: \href{https://github.com/SAI-Lab-NYU/WSVD}{\texttt{https://github.com/SAI-Lab-NYU/WSVD}

[33] FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder cs.CVPDF

Wei Li, Yufan Ren, Hanqing Jiang, Jianhui Ding, Zhen Peng

TL;DR: 本文提出了FusionBERT，一种新颖的多视角视觉融合框架，用于图像-3D多模态检索。该框架通过基于交叉注意力的多视角视觉聚合器自适应地整合物体多视角图像的特征，并引入法线感知的3D模型编码器联合编码点法线和3D位置以增强几何特征。实验表明，该方法在单视角和多视角设置下均显著优于现有SOTA多模态大模型。

Details

Motivation: 现有图像-3D表示学习方法主要关注单张物体图像与其3D模型的特征对齐，限制了其在物体通常从多视角观察和捕获的现实场景中的适用性。多视角观察虽能提供互补的几何和外观线索，但现有多模态大模型很少探索如何有效融合此类多视角视觉信息以提升跨模态检索性能。

Result: 广泛的图像-3D检索实验表明，FusionBERT在单视角和多视角设置下均取得了比SOTA多模态大模型显著更高的检索准确率，为多视角多模态检索建立了强大的基线。

Insight: 创新点包括：1) 基于交叉注意力的多视角视觉聚合器，能自适应融合多视角图像特征，强调跨视角的信息线索；2) 法线感知的3D模型编码器，通过联合编码点法线和3D位置来增强对无纹理或颜色退化的3D模型的几何特征学习，提升表示鲁棒性。

Abstract: We propose FusionBERT, a novel multi-view visual fusion framework for image-3D multimodal retrieval. Existing image-3D representation learning methods predominantly focus on feature alignment of a single object image and its 3D model, limiting their applicability in realistic scenarios where an object is typically observed and captured from multiple viewpoints. Although multi-view observations naturally provide complementary geometric and appearance cues, existing multimodal large models rarely explore how to effectively fuse such multi-view visual information for better cross-modal retrieval. To address this limitation, we introduce a multi-view image-3D retrieval framework named FusionBERT, which innovatively utilizes a cross-attention-based multi-view visual aggregator to adaptively integrate features from multi-view images of an object. The proposed multi-view visual encoder fuses inter-view complementary relationships and selectively emphasizes informative visual cues across multiple views to get a more robustly fused visual feature for better 3D model matching. Furthermore, FusionBERT proposes a normal-aware 3D model encoder that can further enhance the 3D geometric feature of an object model by jointly encoding point normals and 3D positions, enabling a more robust representation learning for textureless or color-degraded 3D models. Extensive image-3D retrieval experiments demonstrate that FusionBERT achieves significantly higher retrieval accuracy than SOTA multimodal large models under both single-view and multi-view settings, establishing a strong baseline for multi-view multimodal retrieval.

[34] Moondream Segmentation: From Words to Masks cs.CV | cs.AIPDF

Ethan Reid

TL;DR: 本文提出了Moondream分割模型，它是Moondream 3视觉语言模型的指代图像分割扩展。该模型接收图像和指代表达式，通过自回归解码向量路径并迭代优化光栅化掩码，最终生成精细的掩码。论文还引入了强化学习阶段以直接优化掩码质量，并发布了RefCOCO-M数据集以解决多边形标注带来的评估噪声。

Details

Motivation: 动机是扩展视觉语言模型以处理指代图像分割任务，即根据自然语言描述在图像中生成精确的掩码，并解决现有监督信号中的歧义性和标注噪声问题。

Result: 在RefCOCO (val)数据集上达到80.2%的cIoU，在LVIS (val)数据集上达到62.6%的mIoU，展示了其有效性。

Insight: 创新点包括：1) 结合自回归解码和迭代优化的掩码生成流程；2) 引入强化学习阶段直接优化掩码质量，以解决监督信号的歧义；3) 发布RefCOCO-M数据集，提供边界准确的掩码以改进评估。这些方法可借鉴于其他需要精细分割和语言引导的视觉任务。

Abstract: We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).

[35] Unlocking Multi-Site Clinical Data: A Federated Approach to Privacy-First Child Autism Behavior Analysis cs.CVPDF

Guangyu Sun, Wenhan Wu, Zhishuai Guo, Ziteng Wang, Pegah Khosravi

TL;DR: 本文提出了一种基于联邦学习的隐私保护框架，用于多站点儿童自闭症行为识别。该框架通过人体骨骼抽象去除原始视频中的可识别视觉信息，并结合联邦学习确保敏感姿态数据保留在诊所内部，从而在保护隐私的同时利用分布式临床数据学习通用表示并支持站点个性化。

Details

Motivation: 儿童自闭症行为的自动识别对早期干预和客观临床评估至关重要，但严格的隐私法规（如HIPAA）和儿科数据的敏感性阻碍了临床数据集的集中聚合，且单个临床站点常面临数据稀缺问题，难以学习通用行为模式或针对特定站点患者分布定制模型。

Result: 在MMASD基准测试上的实验结果表明，该框架实现了高识别准确率，优于传统的联邦学习基线方法，为多站点临床分析提供了一个鲁棒的、隐私优先的解决方案。

Insight: 创新点在于首次探索了联邦学习在基于姿态的儿童自闭症行为识别中的应用，并设计了两层隐私保护机制（骨骼抽象和联邦学习），在保护隐私的同时实现了数据利用与模型个性化的平衡。

Abstract: Automated recognition of autistic behaviors in children is essential for early intervention and objective clinical assessment. However, the development of robust models is severely hindered by strict privacy regulations (e.g., HIPAA) and the sensitive nature of pediatric data, which prevents the centralized aggregation of clinical datasets. Furthermore, individual clinical sites often suffer from data scarcity, making it difficult to learn generalized behavior patterns or tailor models to site-specific patient distributions. To address these challenges, we observe that Federated Learning (FL) can decouple model training from raw data access, enabling multi-site collaboration while maintaining strict data residency. In this paper, we present the first study exploring Federated Learning for pose-based child autism behavior recognition. Our framework employs a two-layer privacy protection mechanism: utilizing human skeletal abstraction to remove identifiable visual information from the raw RGB videos and FL to ensure sensitive pose data remains within the clinic. This approach leverages distributed clinical data to learn generalized representations while providing the flexibility for site-specific personalization. Experimental results on the MMASD benchmark demonstrate that our framework achieves high recognition accuracy, outperforming traditional federated baselines and providing a robust, privacy-first solution for multi-site clinical analysis.

[36] Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery cs.CV | cs.AI | cs.MMPDF

Hao Li, Liwei Zou, Wenping Yin, Gulsen Taskin, Naoto Yokoya

TL;DR: 本文提出Smart Transfer框架，利用视觉基础模型（FMs）和两种新颖的迁移策略——像素级聚类（PC）和距离惩罚三元组（DPT），实现对震后超高分辨率（VHR）影像的快速建筑物损毁制图，以支持灾害应急响应。

Details

Motivation: 传统灾害损毁调查难以泛化到不同城市形态和新灾害事件，且依赖耗时的人工标注，无法满足’黄金72小时’救援的快速响应需求。

Result: 在2023年土耳其-叙利亚地震数据上的大量实验和消融研究表明，该框架在Leave One Domain Out (LODO)和Specific Source Domain Combination (SSDC)等跨区域迁移设置中表现出色。

Insight: 创新点在于设计了两种迁移策略：PC实现原型级全局特征对齐，DPT通过惩罚语义不一致但空间相邻的图块来整合空间自相关模式，为快速、可扩展的灾害制图提供了自动化GeoAI解决方案。

Abstract: Living in a changing climate, human society now faces more frequent and severe natural disasters than ever before. As a consequence, rapid disaster response during the “Golden 72 Hours” of search and rescue becomes a vital humanitarian necessity and community concern. However, traditional disaster damage surveys routinely fail to generalize across distinct urban morphologies and new disaster events. Effective damage mapping typically requires exhaustive and time-consuming manual data annotation. To address this issue, we introduce Smart Transfer, a novel Geospatial Artificial Intelligence (GeoAI) framework, leveraging state-of-the-art vision Foundation Models (FMs) for rapid building damage mapping with post-earthquake Very High Resolution (VHR) imagery. Specifically, we design two novel model transfer strategies: first, Pixel-wise Clustering (PC), ensuring robust prototype-level global feature alignment; second, a Distance-Penalized Triplet (DPT), integrating patch-level spatial autocorrelation patterns by assigning stronger penalties to semantically inconsistent yet spatially adjacent patches. Extensive experiments and ablations from the recent 2023 Turkiye-Syria earthquake show promising performance in multiple cross-region transfer settings, namely Leave One Domain Out (LODO) and Specific Source Domain Combination (SSDC). Moreover, Smart Transfer provides a scalable, automated GeoAI solution to accelerate building damage mapping and support rapid disaster response, offering new opportunities to enhance disaster resilience in climate-vulnerable regions and communities. The data and code are publicly available at https://github.com/ai4city-hkust/SmartTransfer.

[37] Cross-Vehicle 3D Geometric Consistency for Self-Supervised Surround Depth Estimation on Articulated Vehicles cs.CV | cs.AIPDF

Weimin Liu, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng

TL;DR: 本文提出ArticuSurDepth，一种针对铰接式车辆的自监督环视深度估计框架，通过利用视觉基础模型的结构先验引导跨视图和跨车辆的几何一致性，增强深度学习。

Details

Motivation: 现有自监督深度估计方法主要针对乘用车设计，很少考虑铰接式车辆或机器人平台，其铰接结构引入了复杂的跨段几何和运动耦合，使得跨视图的深度推理更具挑战性。

Result: 在自采集数据集以及DDAD、nuScenes和KITTI基准测试上，实验结果表明该方法达到了最先进的深度估计性能。

Insight: 创新点包括引入多视图空间上下文增强策略、跨视图表面法向约束、结合地面平面感知的相机高度正则化以及跨车辆姿态一致性，以提升铰接车辆场景下的结构连贯性和度量深度估计。

Abstract: Surround depth estimation provides a cost-effective alternative to LiDAR for 3D perception in autonomous driving. While recent self-supervised methods explore multi-camera settings to improve scale awareness and scene coverage, they are primarily designed for passenger vehicles and rarely consider articulated vehicles or robotics platforms. The articulated structure introduces complex cross-segment geometry and motion coupling, making consistent depth reasoning across views more challenging. In this work, we propose \textbf{ArticuSurDepth}, a self-supervised framework for surround-view depth estimation on articulated vehicles that enhances depth learning through cross-view and cross-vehicle geometric consistency guided by structural priors from vision foundation model. Specifically, we introduce multi-view spatial context enrichment strategy and a cross-view surface normal constraint to improve structural coherence across spatial and temporal contexts. We further incorporate camera height regularization with ground plane-awareness to encourage metric depth estimation, together with cross-vehicle pose consistency that bridges motion estimation between articulated segments. To validate our proposed method, an articulated vehicle experiment platform was established with a dataset collected over it. Experiment results demonstrate state-of-the-art (SoTA) performance of depth estimation on our self-collected dataset as well as on DDAD, nuScenes, and KITTI benchmarks.

[38] Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs cs.CV | cs.AIPDF

Yuhui Lin, Siyue Yu, Yuxing Yang, Guangliang Cheng, Jimin Xiao

TL;DR: 本文提出了Efficient3D，一个用于加速3D多模态大语言模型（3D MLLMs）推理的统一框架。该框架通过一个去偏的视觉令牌重要性估计器（DVTIE）来更可靠地预测令牌重要性，并结合一个自适应令牌再平衡（ATR）策略来根据场景复杂度动态调整剪枝强度，从而在保持模型精度的同时显著减少计算开销。

Details

Motivation: 3D MLLMs模型规模大且输入特征维度高，导致推理开销巨大，限制了其在资源受限平台上的实际部署。本文旨在解决这一效率瓶颈。

Result: 在ScanRefer、Multi3DRefer、Scan2Cap、ScanQA和SQA3D五个代表性的3D视觉与语言基准测试上进行的实验表明，Efficient3D相比未剪枝的基线模型取得了更优的性能，例如在Scan2Cap数据集上实现了CIDEr指标+2.57%的提升。

Insight: 主要创新点在于提出了一个统一的令牌剪枝框架，其核心是结合了DVTIE模块（通过考虑浅层初始层在注意力聚合中的影响来去偏重要性估计）和ATR策略（根据场景复杂度自适应调整剪枝强度以保持语义完整性和层间注意力平衡），实现了上下文感知的令牌约简。这为3D MLLMs的高效推理提供了一个可扩展且有效的解决方案。

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have expanded reasoning capabilities into 3D domains, enabling fine-grained spatial understanding. However, the substantial size of 3D MLLMs and the high dimensionality of input features introduce considerable inference overhead, which limits practical deployment on resource constrained platforms. To overcome this limitation, this paper presents Efficient3D, a unified framework for visual token pruning that accelerates 3D MLLMs while maintaining competitive accuracy. The proposed framework introduces a Debiased Visual Token Importance Estimator (DVTIE) module, which considers the influence of shallow initial layers during attention aggregation, thereby producing more reliable importance predictions for visual tokens. In addition, an Adaptive Token Rebalancing (ATR) strategy is developed to dynamically adjust pruning strength based on scene complexity, preserving semantic completeness and maintaining balanced attention across layers. Together, they enable context-aware token reduction that maintains essential semantics with lower computation. Comprehensive experiments conducted on five representative 3D vision and language benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D, demonstrate that Efficient3D achieves superior performance compared with unpruned baselines, with a +2.57% CIDEr improvement on the Scan2Cap dataset. Therefore, Efficient3D provides a scalable and effective solution for efficient inference in 3D MLLMs. The code is released at: https://github.com/sol924/Efficient3D

Fuyuan Liu, Dianyu Yu, He Ren, Nayu Liu, Xiaomian Kang

TL;DR: 本文提出了一种轻量级的结构细化模块，用于稳定文档解析中的布局接口。该模块位于DETR风格检测器和解析器之间，通过对查询特征、语义线索、边界框几何和视觉证据进行集合级推理，联合决定实例保留、细化框定位并预测解析器输入顺序，从而解决密集页面中布局假设不稳定导致的解析错误问题。

Details

Motivation: 在显式文档布局分析（DLA）流水线中，下游解析器不直接使用检测器的完整输出，而是处理一组保留并序列化的布局实例。然而，在具有重叠区域和模糊边界的密集页面上，不稳定的布局假设会导致保留的实例集与其解析器输入顺序不一致，从而引发严重的下游解析错误。

Result: 在公开基准测试上的广泛实验表明，该方法持续提升了页面级布局质量。当集成到标准端到端解析流水线中时，稳定的解析器接口显著减少了序列不匹配，在OmniDocBench基准上实现了0.024的阅读顺序编辑距离。

Insight: 创新点包括：在检测器和解析器之间引入一个结构细化阶段，进行集合级推理以联合优化实例保留、框定位和输入顺序；提出了面向保留的监督和难度感知的排序目标，以更好地在结构复杂页面上对齐保留实例集及其顺序与最终解析器输入。从客观角度看，该方法通过轻量级中间模块稳定接口，为解决文档解析中布局不稳定导致的顺序错配问题提供了新思路。

Abstract: Accurate document parsing requires both robust content recognition and a stable parser interface. In explicit Document Layout Analysis (DLA) pipelines, downstream parsers do not consume the full detector output. Instead, they operate on a retained and serialized set of layout instances. However, on dense pages with overlapping regions and ambiguous boundaries, unstable layout hypotheses can make the retained instance set inconsistent with its parser input order, leading to severe downstream parsing errors. To address this issue, we introduce a lightweight structural refinement stage between a DETR-style detector and the parser to stabilize the parser interface. Treating raw detector outputs as a compact hypothesis pool, the proposed module performs set-level reasoning over query features, semantic cues, box geometry, and visual evidence. From a shared refined structural state, it jointly determines instance retention, refines box localization, and predicts parser input order before handoff. We further introduce retention-oriented supervision and a difficulty-aware ordering objective to better align the retained instance set and its order with the final parser input, especially on structurally complex pages. Extensive experiments on public benchmarks show that our method consistently improves page-level layout quality. When integrated into a standard end-to-end parsing pipeline, the stabilized parser interface also substantially reduces sequence mismatch, achieving a Reading Order Edit of 0.024 on OmniDocBench.

[40] DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning cs.CV | cs.AIPDF

Fanwei Zeng, Changtao Miao, Jing Huang, Zhiya Tan, Shutao Gong

TL;DR: 本文提出了DocShield框架，这是首个将文本中心图像伪造分析统一为视觉-逻辑协同推理问题的框架。其核心是新颖的跨线索感知思维链机制，通过迭代交叉验证视觉异常与文本语义，进行基于证据的取证分析。作者还构建了RealText-V1多语言数据集，并引入了基于GRPO优化的加权多任务奖励。实验表明，DocShield在多个基准测试上显著优于现有方法。

Details

Motivation: 生成式AI的快速发展使得以文本为中心的图像伪造越来越逼真，对文档安全构成重大挑战。现有取证方法主要依赖视觉线索，缺乏基于证据的推理来揭示细微的文本篡改，且检测、定位和解释常被孤立处理，限制了可靠性和可解释性。

Result: 在T-IC13基准上，DocShield将宏平均F1分数比专用框架提高了41.4%，比GPT-4o提高了23.4%。在具有挑战性的T-SROIE基准上也取得了一致的性能提升。

Insight: 主要创新点在于：1）将文本伪造分析统一为视觉-逻辑协同推理问题；2）提出了跨线索感知思维链机制，实现隐式的智能体推理和交叉验证；3）设计了加权多任务奖励用于GRPO优化，对齐推理结构、空间证据和真实性预测；4）构建了包含像素级篡改掩码和专家级文本解释的多语言数据集RealText-V1。

Abstract: The rapid progress of generative AI has enabled increasingly realistic text-centric image forgeries, posing major challenges to document safety. Existing forensic methods mainly rely on visual cues and lack evidence-based reasoning to reveal subtle text manipulations. Detection, localization, and explanation are often treated as isolated tasks, limiting reliability and interpretability. To tackle these challenges, we propose DocShield, the first unified framework formulating text-centric forgery analysis as a visual-logical co-reasoning problem. At its core, a novel Cross-Cues-aware Chain of Thought (CCT) mechanism enables implicit agentic reasoning, iteratively cross-validating visual anomalies with textual semantics to produce consistent, evidence-grounded forensic analysis. We further introduce a Weighted Multi-Task Reward for GRPO-based optimization, aligning reasoning structure, spatial evidence, and authenticity prediction. Complementing the framework, we construct RealText-V1, a multilingual dataset of document-like text images with pixel-level manipulation masks and expert-level textual explanations. Extensive experiments show DocShield significantly outperforms existing methods, improving macro-average F1 by 41.4% over specialized frameworks and 23.4% over GPT-4o on T-IC13, with consistent gains on the challenging T-SROIE benchmark. Our dataset, model, and code will be publicly released.

[41] XrayClaw: Cooperative-Competitive Multi-Agent Alignment for Trustworthy Chest X-ray Diagnosis cs.CVPDF

Shawn Young, Lijian Xu

TL;DR: 本文提出了XrayClaw框架，一种用于可信赖胸部X光诊断的协作-竞争多智能体对齐方法。该框架通过整合四个协作智能体模拟系统化临床工作流，并引入一个竞争智能体作为独立审计者，利用提出的竞争偏好优化学习目标来惩罚不合逻辑的推理，从而提升诊断的准确性和可靠性。

Details

Motivation: 传统单体模型在胸部X光解读中常因缺乏细致推理而导致逻辑不一致和诊断幻觉，现有基于单一底层模型的多智能体系统易受共识性错误影响，因此需要一种新框架来增强诊断的可信度。

Result: 在MS-CXR-T、MIMIC-CXR和CheXbench基准测试上的广泛实验表明，XrayClaw在诊断准确性、临床推理保真度和零样本领域泛化方面达到了最先进的性能水平。

Insight: 创新点在于将协作与竞争机制结合的多智能体架构，以及竞争偏好优化目标，通过分析性与整体性解释的相互验证来缓解累积幻觉，为可信赖医学影像分析设立了新范式。

Abstract: Chest X-ray (CXR) interpretation is a fundamental yet complex clinical task that increasingly relies on artificial intelligence for automation. However, traditional monolithic models often lack the nuanced reasoning required for trustworthy diagnosis, frequently leading to logical inconsistencies and diagnostic hallucinations. While multi-agent systems offer a potential solution by simulating collaborative consultations, existing frameworks remain susceptible to consensus-based errors when instantiated by a single underlying model. This paper introduces XrayClaw, a novel framework that operationalizes multi-agent alignment through a sophisticated cooperative-competitive architecture. XrayClaw integrates four specialized cooperative agents to simulate a systematic clinical workflow, alongside a competitive agent that serves as an independent auditor. To reconcile these distinct diagnostic pathways, we propose Competitive Preference Optimization, a learning objective that penalizes illogical reasoning by enforcing mutual verification between analytical and holistic interpretations. Extensive empirical evaluations on the MS-CXR-T, MIMIC-CXR, and CheXbench benchmarks demonstrate that XrayClaw achieves state-of-the-art performance in diagnostic accuracy, clinical reasoning fidelity, and zero-shot domain generalization. Our results indicate that XrayClaw effectively mitigates cumulative hallucinations and enhances the overall reliability of automated CXR diagnosis, establishing a new paradigm for trustworthy medical imaging analysis.

[42] ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving cs.CVPDF

Zihao Sheng, Xin Ye, Jingru Luo, Sikai Chen, Liu Ren

TL;DR: 本文提出ExploreVLA框架，通过结合世界建模与强化学习，为基于视觉-语言-动作（VLA）架构的端到端自动驾驶模型引入了探索能力。该方法利用未来RGB和深度图像生成作为密集监督信号，并基于世界模型的不确定性设计内在奖励，以安全地探索训练分布之外的场景，从而提升策略的鲁棒性。

Details

Motivation: 现有基于模仿学习的端到端自动驾驶VLA模型只能复现专家行为，缺乏对多样化驾驶策略的探索，在未见或分布外场景中表现脆弱。强化学习虽能提供探索能力，但VLA模型缺乏可直接观测的状态转移，因此需要学习一个世界模型来预测动作后果。

Result: 在NAVSIM和nuScenes基准测试上验证了方法的有效性，在NAVSIM上达到了93.7的PDMS分数和88.8的EPDMS分数，取得了最先进的（SOTA）性能。

Insight: 主要创新点在于提出了一个统一的理解与生成框架，将密集世界建模（未来图像生成）作为监督信号以学习细粒度表征，并创新性地利用世界模型的预测不确定性作为内在奖励，以安全地引导策略探索分布外且有价值的新场景。这为结合生成模型与强化学习进行安全探索提供了新思路。

Abstract: End-to-end autonomous driving models based on Vision-Language-Action (VLA) architectures have shown promising results by learning driving policies through behavior cloning on expert demonstrations. However, imitation learning inherently limits the model to replicating observed behaviors without exploring diverse driving strategies, leaving it brittle in novel or out-of-distribution scenarios. Reinforcement learning (RL) offers a natural remedy by enabling policy exploration beyond the expert distribution. Yet VLA models, typically trained on offline datasets, lack directly observable state transitions, necessitating a learned world model to anticipate action consequences. In this work, we propose a unified understanding-and-generation framework that leverages world modeling to simultaneously enable meaningful exploration and provide dense supervision. Specifically, we augment trajectory prediction with future RGB and depth image generation as dense world modeling objectives, requiring the model to learn fine-grained visual and geometric representations that substantially enrich the planning backbone. Beyond serving as a supervisory signal, the world model further acts as a source of intrinsic reward for policy exploration: its image prediction uncertainty naturally measures a trajectory’s novelty relative to the training distribution, where high uncertainty indicates out-of-distribution scenarios that, if safe, represent valuable learning opportunities. We incorporate this exploration signal into a safety-gated reward and optimize the policy via Group Relative Policy Optimization (GRPO). Experiments on the NAVSIM and nuScenes benchmarks demonstrate the effectiveness of our approach, achieving a state-of-the-art PDMS score of 93.7 and an EPDMS of 88.8 on NAVSIM. The code and demo will be publicly available at https://zihaosheng.github.io/ExploreVLA/.

[43] THOM: Generating Physically Plausible Hand-Object Meshes From Text cs.CVPDF

Uyoung Jeong, Yihalem Yimolal Tiruneh, Hyung Jin Chang, Seungryul Baek, Kwang In Kim

TL;DR: THOM是一个无需训练、从文本生成物理上合理的手-物体交互三维网格的框架。它采用两阶段流程：首先生成手和物体的高斯表示，然后进行基于物理的交互优化，通过新的网格提取方法和顶点到高斯的映射实现拓扑感知正则化，并利用视觉语言模型引导的平移细化和接触感知优化提升物理合理性。

Details

Motivation: 从文本生成高视觉保真度和物理合理性的3D手-物体交互对于灵巧机器人抓取和VR/AR内容生成至关重要，但现有方法在从文本生成的高斯表示中提取网格以及基于错误网格进行物理优化方面存在挑战。

Result: 综合实验表明，THOM在文本对齐、视觉真实感和交互合理性方面持续超越最先进方法。

Insight: 创新点包括无需模板物体网格、两阶段生成优化流程、显式的顶点到高斯映射实现拓扑感知正则化，以及结合VLM引导的平移细化和接触感知优化来提升物理合理性。

Abstract: The generation of 3D hand-object interactions (HOIs) from text is crucial for dexterous robotic grasping and VR/AR content generation, requiring both high visual fidelity and physical plausibility. Nevertheless, the ill-posed problem of mesh extraction from text-generated Gaussians, and physics-based optimization on the erroneous meshes pose challenges. To address these issues, we introduce THOM, a training-free framework that generates photorealistic, physically plausible 3D HOI meshes without the need for a template object mesh. THOM employs a two-stage pipeline, initially generating the hand and object Gaussians, followed by physics-based HOI optimization. Our new mesh extraction method and vertex-to-Gaussian mapping explicitly assign Gaussian elements to mesh vertices, allowing topology-aware regularization. Furthermore, we improve the physical plausibility of interactions by VLM-guided translation refinement and contact-aware optimization. Comprehensive experiments demonstrate that THOM consistently surpasses state-of-the-art methods in terms of text alignment, visual realism, and interaction plausibility.

[44] Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks cs.CVPDF

Jonghun Kim, Sinyoung Ra, Hyunjin Park

TL;DR: 本文提出LLaBIT（Large Language Model for Brain Image Translation），一种基于视觉指令微调的语言模型，旨在统一处理脑部MRI图像的多项临床相关任务，包括报告生成、视觉问答、图像分割和图像翻译。该方法通过重用图像编码器的特征图来减少空间信息损失，并利用LLM生成文本数据以增强有限的图像-文本对数据。

Details

Motivation: 现有LLMs在视觉-语言任务中取得进展，但简单的文本到图像生成临床价值有限。医学影像中，如病灶定位的分割或序列重建的翻译等任务更具临床重要性，然而将这些多样化任务集成到一个统一、通用的语言模型中尚未被探索。

Result: 在五个脑部MRI数据集上对四个任务（报告生成、视觉问答、图像分割、图像翻译）的全面评估表明，该模型在所有任务上均表现出优越性能，并且在直接比较中超越了专门的、任务特定的模型，展现了其高效性和多功能性。

Insight: 创新点包括将LLMs的视觉推理能力扩展到脑MRI领域有临床意义的任务，通过特征图重用机制缓解图像标记化导致的空间信息损失，以及利用LLMs生成指令严格的文本数据来增强数据。从客观角度看，其统一框架处理多种医学影像任务的思路具有借鉴意义。

Abstract: LLMs have demonstrated remarkable capabilities in linguistic reasoning and are increasingly adept at vision-language tasks. The integration of image tokens into transformers has enabled direct visual input and output, advancing research from image-to-text descriptions to text-to-image generation. However, simple text-to-image generation holds limited clinical utility. In medical imaging, tasks such as image segmentation for localizing pathologies or image translation for reconstructing missing sequences have much greater clinical importance. Despite this, integrating these diverse, clinically relevant tasks within a single, versatile language model remains unexplored. Our method, LLaBIT (Large Language Model for Brain Image Translation), extends the visual reasoning of LLMs to these clinically meaningful tasks in the brain MRI domain. To mitigate the spatial information loss inherent in image tokenization, we incorporate a mechanism to reuse feature maps from the image encoder, minimizing data degradation. We also generate text data using LLMs with strict predefined instructions to augment limited image-text paired data in brain MRI. We comprehensively evaluated our method on five brain MRI datasets across four distinct tasks: report generation, visual question answering, image segmentation, and image translation. Our model not only demonstrated superior performance across all tasks but also outperformed specialized, task-specific models in direct comparisons, highlighting its efficacy and versatility

[45] DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection cs.CVPDF

Siheng Wang, Yanshu Li, Bohan Hu, Zhengdao Li, Haibo Zhan

TL;DR: 本文提出了一种名为DeCo-DETR的视觉中心化框架，用于高效开放词汇目标检测。该方法通过解耦范式，构建了可重用的分层语义原型空间，并将语义推理与定位解耦，从而在保持竞争力的零样本检测性能的同时，显著提升了推理效率。

Details

Motivation: 现有开放词汇目标检测方法存在两个主要限制：一是依赖推理时的文本编码器导致计算开销大；二是紧密耦合的训练目标在闭集检测精度和开放世界泛化能力之间存在权衡。本文旨在解决这些实际部署中的挑战。

Result: 在标准OVOD基准测试上的大量实验表明，DeCo-DETR在保持竞争力的零样本检测性能的同时，显著提高了推理效率。

Insight: 主要创新点在于统一的解耦范式：1）构建了基于预训练LVLMs生成、并通过CLIP对齐的区域级描述的分层语义原型空间，避免了在线文本编码；2）通过解耦训练策略，将语义对齐与检测定位分离为并行优化流，从而解耦了语义认知与检测。这为可扩展的OVOD系统提供了一个实用的方向。

Abstract: Open-vocabulary Object Detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.

[46] A Unified Perspective on Adversarial Membership Manipulation in Vision Models cs.CVPDF

Ruize Gao, Kaiwen Zhou, Yongqiang Chen, Feng Liu

TL;DR: 本文首次系统性地研究了视觉模型中对抗性成员操纵现象，揭示了现有成员推理攻击在面对对抗性扰动时的脆弱性。作者展示了如何通过不可察觉的扰动将非成员图像伪造为成员，并基于梯度几何特征提出了一种检测方法和鲁棒推理框架。

Details

Motivation: 现有成员推理攻击默认查询输入是诚实的，其对抗鲁棒性尚未被探索。本文旨在揭示并分析视觉模型中对抗性成员操纵这一被忽视的攻击面，即通过对抗性扰动欺骗成员推理攻击。

Result: 实验表明，对抗性成员伪造在多种架构和数据集上普遍有效。作者提出的基于梯度几何信号的检测方法和鲁棒推理框架，在广泛的实验中显著增强了模型的抗操纵能力。

Insight: 核心创新在于首次为视觉模型中的对抗性成员操纵现象提供了一个统一的分析视角和防御框架。关键洞察是发现了伪造成员与真实成员之间在梯度范数轨迹上存在可区分的几何特征，这为检测提供了理论基础。

Abstract: Membership inference attacks (MIAs) aim to determine whether a specific data point was part of a model’s training set, serving as effective tools for evaluating privacy leakage of vision models. However, existing MIAs implicitly assume honest query inputs, and their adversarial robustness remains unexplored. We show that MIAs for vision models expose a previously overlooked adversarial surface: adversarial membership manipulation, where imperceptible perturbations can reliably push non-member images into the “member” region of state-of-the-art MIAs. In this paper, we provide the first unified perspective on this phenomenon by analyzing its mechanism and implications. We begin by demonstrating that adversarial membership fabrication is consistently effective across diverse architectures and datasets. We then reveal a distinctive geometric signature - a characteristic gradient-norm collapse trajectory - that reliably separates fabricated from true members despite their nearly identical semantic representations. Building on this insight, we introduce a principled detection strategy grounded in gradient-geometry signals and develop a robust inference framework that substantially mitigates adversarial manipulation. Extensive experiments show that fabrication is broadly effective, while our detection and robust inference strategies significantly enhance resilience. This work establishes the first comprehensive framework for adversarial membership manipulation in vision models.

[47] EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors cs.CV | cs.CLPDF

Ryuhei Miyazato, Shunsuke Kitada, Kei Harada

TL;DR: 本文提出了一种名为EnsemHalDet的集成幻觉检测框架，旨在通过整合视觉语言模型（VLM）的多种内部表示（如注意力输出和隐藏状态）来更有效地检测多模态幻觉。该方法训练独立的检测器并利用集成学习进行组合，实验表明其在多个VQA数据集和VLM上均优于现有方法。

Details

Motivation: 现有基于内部表示的幻觉检测方法通常依赖单一表示或检测器，限制了捕捉多样化幻觉信号的能力，因此需要一种更鲁棒的方法来提升检测性能。

Result: 在多个VQA数据集和VLM上的实验结果显示，EnsemHalDet在AUC指标上持续优于先前方法和单检测器模型，证明了其有效性。

Insight: 创新点在于通过集成多种内部表示（如注意力输出和隐藏状态）来增强幻觉检测的鲁棒性，这为利用模型内部状态进行更全面的错误检测提供了新思路。

Abstract: Vision-Language Models (VLMs) excel at multimodal tasks, but they remain vulnerable to hallucinations that are factually incorrect or ungrounded in the input image. Recent work suggests that hallucination detection using internal representations is more efficient and accurate than approaches that rely solely on model outputs. However, existing internal-representation-based methods typically rely on a single representation or detector, limiting their ability to capture diverse hallucination signals. In this paper, we propose EnsemHalDet, an ensemble-based hallucination detection framework that leverages multiple internal representations of VLMs, including attention outputs and hidden states. EnsemHalDet trains independent detectors for each representation and combines them through ensemble learning. Experimental results across multiple VQA datasets and VLMs show that EnsemHalDet consistently outperforms prior methods and single-detector models in terms of AUC. These results demonstrate that ensembling diverse internal signals significantly improves robustness in multimodal hallucination detection.

[48] LumaFlux: Lifting 8-Bit Worlds to HDR Reality with Physically-Guided Diffusion Transformers cs.CV | cs.AIPDF

Shreshth Saini, Hakan Gedik, Neil Birkbeck, Yilin Wang, Balu Adsumilli

TL;DR: LumaFlux是一种基于物理和感知引导的扩散Transformer模型，用于将8位标准动态范围（SDR）内容转换为10位高动态范围（HDR）。它通过引入物理引导适应模块、感知交叉调制层和HDR残差耦合器，结合轻量级有理二次样条解码器，实现了对高光和曝光的平滑扩展，从而生成感知和物理上准确的HDR图像。

Details

Motivation: 随着HDR设备的普及，需要将大量现有的8位SDR内容转换为高质量的HDR内容。现有的逆色调映射方法通常依赖于固定的色调映射算子，难以泛化到真实世界的退化、风格变化和相机流水线，常导致高光裁剪、颜色去饱和或不稳定的色调再现。

Result: 在多个基准测试中，LumaFlux超越了最先进的基线方法，实现了更优的亮度重建和感知色彩保真度，且仅需最小的额外参数量。论文还建立了一个新的评估基准，包含HDR参考和专家评级的SDR版本，以进行公平和可复现的比较。

Insight: 论文的创新点包括：1）物理引导适应模块，通过低秩残差将亮度、空间描述符和频率线索注入注意力机制；2）感知交叉调制层，利用视觉编码器特征的FiLM条件来稳定色度和纹理；3）HDR残差耦合器，在时间步和层自适应的调制计划下融合物理和感知信号。此外，构建了首个大规模SDR-HDR训练语料库，并建立了新的评估基准，为HDR学习提供了更稳健的数据和评估基础。

Abstract: The rapid adoption of HDR-capable devices has created a pressing need to convert the 8-bit Standard Dynamic Range (SDR) content into perceptually and physically accurate 10-bit High Dynamic Range (HDR). Existing inverse tone-mapping (ITM) methods often rely on fixed tone-mapping operators that struggle to generalize to real-world degradations, stylistic variations, and camera pipelines, frequently producing clipped highlights, desaturated colors, or unstable tone reproduction. We introduce LumaFlux, a first physically and perceptually guided diffusion transformer (DiT) for SDR-to-HDR reconstruction by adapting a large pretrained DiT. Our LumaFlux introduces (1) a Physically-Guided Adaptation (PGA) module that injects luminance, spatial descriptors, and frequency cues into attention through low-rank residuals; (2) a Perceptual Cross-Modulation (PCM) layer that stabilizes chroma and texture via FiLM conditioning from vision encoder features; and (3) an HDR Residual Coupler that fuses physical and perceptual signals under a timestep- and layer-adaptive modulation schedule. Finally, a lightweight Rational-Quadratic Spline decoder reconstructs smooth, interpretable tone fields for highlight and exposure expansion, enhancing the output of the VAE decoder to generate HDR. To enable robust HDR learning, we curate the first large-scale SDR-HDR training corpus. For fair and reproducible comparison, we further establish a new evaluation benchmark, comprising HDR references and corresponding expert-graded SDR versions. Across benchmarks, LumaFlux outperforms state-of-the-art baselines, achieving superior luminance reconstruction and perceptual color fidelity with minimal additional parameters.

[49] UNICA: A Unified Neural Framework for Controllable 3D Avatars cs.CVPDF

Jiahe Zhu, Xinyao Wang, Yiyu Zhuang, Yanwen Wang, Jing Tian

TL;DR: UNICA是一个统一的神经可控3D化身框架，它通过一个基于动作条件的扩散模型和点变换器，将运动规划、绑定、物理模拟和渲染等传统复杂流程整合到单一神经网络中，实现了从键盘输入直接生成高质量3D化身几何和自由视角渲染。

Details

Motivation: 传统创建可控3D化身需要复杂多步骤的流水线，包括外观建模、运动规划、绑定和物理模拟，过程繁琐耗时。UNICA旨在通过一个统一的神经框架简化这一流程，实现端到端的可控化身生成。

Result: 论文声称UNICA是首个统一‘运动规划、绑定、物理模拟和渲染’工作流程的模型，能够自然捕捉头发和宽松衣物的动态，并支持超长自回归生成，但摘要中未提供具体的基准测试或定量结果比较。

Insight: 创新点在于提出了一个骨架无关的生成模型，将动作条件扩散模型与3D高斯溅射渲染相结合，避免了手动物理模拟，实现了从简单控制输入到高质量3D化身生成的端到端统一框架。

Abstract: Controllable 3D human avatars have found widespread applications in 3D games, the metaverse, and AR/VR scenarios. The conventional approach to creating such a 3D avatar requires a lengthy, intricate pipeline encompassing appearance modeling, motion planning, rigging, and physical simulation. In this paper, we introduce UNICA (UNIfied neural Controllable Avatar), a skeleton-free generative model that unifies all avatar control components into a single neural framework. Given keyboard inputs akin to video game controls, UNICA generates the next frame of a 3D avatar’s geometry through an action-conditioned diffusion model operating on 2D position maps. A point transformer then maps the resulting geometry to 3D Gaussian Splatting for high-fidelity free-view rendering. Our approach naturally captures hair and loose clothing dynamics without manually designed physical simulation, and supports extra-long autoregressive generation. To the best of our knowledge, UNICA is the first model to unify the workflow of “motion planning, rigging, physical simulation, and rendering”. Code is released at https://github.com/zjh21/UNICA.

[50] PaveBench: A Versatile Benchmark for Pavement Distress Perception and Interactive Vision-Language Analysis cs.CV | cs.AI | cs.MMPDF

Dexiang Li, Zhenning Che, Haijun Zhang, Dongliang Zhou, Zhao Zhang

TL;DR: 本文介绍了PaveBench，一个用于路面病害感知和交互式视觉语言分析的大规模基准数据集。该数据集包含真实高速公路检测图像，支持分类、目标检测、语义分割和视觉问答（VQA）四个核心任务，并提供了统一的评估协议。此外，论文还提出了PaveVQA视觉问答数据集，支持单轮、多轮和专家校正的交互，涵盖识别、定位、定量估计和维护推理。

Details

Motivation: 现有路面状况评估研究多集中于传统的视觉任务（如分类、检测、分割），但在实际应用中，路面检测还需要定量分析、解释和交互式决策支持。当前数据集局限于单模态感知，缺乏对多轮交互、事实推理以及感知与视觉语言分析结合的支持。

Result: 论文评估了多种最先进（SOTA）方法，并提供了详细分析。同时，提出了一个简单有效的智能体增强视觉问答框架，该框架将领域特定模型作为工具与视觉语言模型集成。

Insight: 创新点在于构建了一个大规模、多任务、支持交互式视觉语言分析的路面病害基准数据集PaveBench，特别是其PaveVQA组件支持复杂的多轮问答和专家校正。从客观角度看，该工作将传统视觉任务与新兴的视觉语言模型及工具使用（智能体）相结合，为基础设施检测领域提供了更全面的评估平台和决策支持框架。

Abstract: Pavement condition assessment is essential for road safety and maintenance. Existing research has made significant progress. However, most studies focus on conventional computer vision tasks such as classification, detection, and segmentation. In real-world applications, pavement inspection requires more than visual recognition. It also requires quantitative analysis, explanation, and interactive decision support. Current datasets are limited. They focus on unimodal perception. They lack support for multi-turn interaction and fact-grounded reasoning. They also do not connect perception with vision-language analysis. To address these limitations, we introduce PaveBench, a large-scale benchmark for pavement distress perception and interactive vision-language analysis on real-world highway inspection images. PaveBench supports four core tasks: classification, object detection, semantic segmentation, and vision-language question answering. It provides unified task definitions and evaluation protocols. On the visual side, PaveBench provides large-scale annotations and includes a curated hard-distractor subset for robustness evaluation. It contains a large collection of real-world pavement images. On the multimodal side, we introduce PaveVQA, a real-image question answering (QA) dataset that supports single-turn, multi-turn, and expert-corrected interactions. It covers recognition, localization, quantitative estimation, and maintenance reasoning. We evaluate several state-of-the-art methods and provide a detailed analysis. We also present a simple and effective agent-augmented visual question answering framework that integrates domain-specific models as tools alongside vision-language models. The dataset is available at: https://huggingface.co/datasets/MML-Group/PaveBench.

[51] QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models cs.CV | cs.AIPDF

Xinhao Wang, Zhonyu Xia, Zhiwei Lin, Zhe Li, Yongtao Wang

TL;DR: 本文提出QAPruner，一种量化感知的视觉令牌剪枝框架，用于多模态大语言模型（MLLMs）的压缩。该框架通过结合量化误差与离群值强度的混合敏感度指标，在低比特量化（如W4A4）下协同优化视觉令牌剪枝与后训练量化，以提升模型在资源受限环境中的部署效率。

Details

Motivation: MLLMs计算和内存成本高，难以在资源受限环境中部署。传统的后训练量化和视觉令牌剪枝通常独立优化，但直接结合会因剪除对量化稳定性重要的离群激活值而加剧低比特量化误差，导致性能下降。

Result: 在标准LLaVA架构上的实验表明，在仅保留12.5%视觉令牌的激进剪枝比例下，该方法比基线准确率提升2.24%，甚至超过未剪枝的密集量化结果，实现了优于朴素集成基线的性能。

Insight: 创新点在于首次明确协同优化视觉令牌剪枝与后训练量化，提出混合敏感度指标（结合模拟分组量化误差与离群值强度）来保留既语义丰富又对量化鲁棒的令牌，解决了低比特量化中剪枝与量化的耦合问题。

Abstract: Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers that are important for numerical stability and thus worsen quantization errors in low-bit regimes (\textit{e.g.}, W4A4). To address this issue, we propose a quantization-aware vision token pruning framework. Our method introduces a lightweight hybrid sensitivity metric that combines simulated group-wise quantization error with outlier intensity. By combining this metric with standard semantic relevance scores, the method retains tokens that are both semantically informative and robust to quantization. Experiments on standard LLaVA architectures show that our method consistently outperforms naive integration baselines. At an aggressive pruning ratio that retains only 12.5% of visual tokens, our framework improves accuracy by 2.24% over the baseline and even surpasses dense quantization without pruning. To the best of our knowledge, this is the first method that explicitly co-optimizes vision token pruning and PTQ for accurate low-bit MLLM inference.

[52] MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling cs.CVPDF

Shubo Lin, Xuanyang Zhang, Wei Cheng, Weiming Hu, Gang Yu

TL;DR: 本文提出MMPhysVideo框架，通过联合多模态建模提升视频生成的物理合理性。该方法将语义、几何和时空轨迹等感知线索统一为伪RGB格式，使视频扩散模型能直接捕捉复杂物理动态。为减少跨模态干扰，设计了双向控制教师架构，通过并行分支解耦RGB与感知处理，并利用零初始化控制链接逐步学习像素级一致性。为提升推理效率，将教师的物理先验通过表示对齐蒸馏到单流学生模型中。此外，还提出了MMPhysPipe数据流水线，用于构建富含物理信息的多模态数据集。

Details

Motivation: 现有视频扩散模型仅基于像素重建，常产生物理不一致的结果，需提升生成视频的物理合理性。

Result: 在多个基准测试中，MMPhysVideo无需额外推理成本，即能持续提升物理合理性和视觉质量，相比现有方法达到最先进（SOTA）性能。

Insight: 创新点包括将多模态感知线索统一为伪RGB格式以直接建模物理动态，以及双向控制教师架构实现跨模态解耦与渐进式一致性学习；可借鉴之处在于通过多模态联合与蒸馏平衡物理合理性与效率，以及基于视觉证据链的数据标注流水线。

Abstract: Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling. We recast perceptual cues, specifically semantics, geometry, and spatio-temporal trajectory, into a unified pseudo-RGB format, enabling VDMs to directly capture complex physical dynamics. To mitigate cross-modal interference, we propose a Bidirectionally Controlled Teacher architecture, which utilizes parallel branches to fully decouple RGB and perception processing and adopts two zero-initialized control links to gradually learn pixel-wise consistency. For inference efficiency, the teacher’s physical prior is distilled into a single-stream student model via representation alignment. Furthermore, we present MMPhysPipe, a scalable data curation and annotation pipeline tailored for constructing physics-rich multimodal datasets. MMPhysPipe employs a vision-language model (VLM) guided by a chain-of-visual-evidence rule to pinpoint physical subjects, enabling expert models to extract multi-granular perceptual information. Without additional inference costs, MMPhysVideo consistently improves physical plausibility and visual quality over advanced models across various benchmarks and achieves state-of-the-art performance compared to existing methods.

[53] NavCrafter: Exploring 3D Scenes from a Single Image cs.CV | cs.AIPDF

Hongbo Duan, Peiyu Zhuang, Yi Liu, Zhengyang Zhang, Yuxin Zhang

TL;DR: NavCrafter是一个从单张图像探索3D场景的新框架，通过合成具有相机可控性和时空一致性的新视角视频序列来实现。它利用视频扩散模型捕获丰富的3D先验，并采用几何感知的扩展策略逐步扩大场景覆盖范围。

Details

Motivation: 解决在直接获取3D数据成本高昂或不切实际时，从单张图像创建灵活3D场景的难题。

Result: 大量实验表明，NavCrafter在大视角变化下实现了最先进的新视角合成，并显著提高了3D重建的保真度。

Insight: 创新点包括：利用视频扩散模型捕获3D先验；几何感知的场景扩展策略；多阶段相机控制机制（双分支相机注入和注意力调制）；碰撞感知的相机轨迹规划器；以及结合深度对齐监督、结构正则化和细化的增强型3D高斯溅射（3DGS）流程。

Abstract: Creating flexible 3D scenes from a single image is vital when direct 3D data acquisition is costly or impractical. We introduce NavCrafter, a novel framework that explores 3D scenes from a single image by synthesizing novel-view video sequences with camera controllability and temporal-spatial consistency. NavCrafter leverages video diffusion models to capture rich 3D priors and adopts a geometry-aware expansion strategy to progressively extend scene coverage. To enable controllable multi-view synthesis, we introduce a multi-stage camera control mechanism that conditions diffusion models with diverse trajectories via dual-branch camera injection and attention modulation. We further propose a collision-aware camera trajectory planner and an enhanced 3D Gaussian Splatting (3DGS) pipeline with depth-aligned supervision, structural regularization and refinement. Extensive experiments demonstrate that NavCrafter achieves state-of-the-art novel-view synthesis under large viewpoint shifts and substantially improves 3D reconstruction fidelity.

Hao Ren, Zetong Bi, Yiming Zeng, Zhaoliang Wan, Lu Qi

TL;DR: 本文提出了一种名为STRNet的视觉导航方法，旨在通过动态图聚合来学习时空表示。该方法设计了一个统一的时空表示框架，通过一个时空融合模块，结合图像序列和目标观测的特征，在每帧内进行空间图推理，并使用混合时间移位模块与多分辨率差异感知卷积来建模时序动态，以增强机器人导航中的视觉编码。

Details

Motivation: 现有基于学习的方法在视觉导航中往往侧重于改进策略头或决策策略，但依赖于简单的特征编码器和时序池化来表示视觉输入，导致细粒度的空间和时间结构信息丢失，从而限制了准确的动作预测和进度估计。

Result: 实验结果表明，该方法在视觉导航任务中持续提升了导航性能，并为目标条件控制提供了一个可泛化的视觉骨干网络。

Insight: 创新点在于提出了一个统一的时空表示框架，通过空间图推理和混合时序建模（结合时间移位与多分辨率差异感知卷积）来融合图像序列和目标特征，从而更有效地保留和利用视觉输入中的细粒度时空结构信息，为视觉导航任务提供了一个通用的视觉编码增强方案。

Abstract: Visual navigation requires the robot to reach a specified goal such as an image, based on a sequence of first-person visual observations. While recent learning-based approaches have made significant progress, they often focus on improving policy heads or decision strategies while relying on simplistic feature encoders and temporal pooling to represent visual input. This leads to the loss of fine-grained spatial and temporal structure, ultimately limiting accurate action prediction and progress estimation. In this paper, we propose a unified spatio-temporal representation framework that enhances visual encoding for robotic navigation. Our approach extracts features from both image sequences and goal observations, and fuses them using the designed spatio-temporal fusion module. This module performs spatial graph reasoning within each frame and models temporal dynamics using a hybrid temporal shift module combined with multi-resolution difference-aware convolution. Experimental results demonstrate that our approach consistently improves navigation performance and offers a generalizable visual backbone for goal-conditioned control. Code is available at \href{https://github.com/hren20/STRNet}{https://github.com/hren20/STRNet}.

[55] Deformation-based In-Context Learning for Point Cloud Understanding cs.CVPDF

Chengxing Lin, Jinhong Deng, Yinjie Lei, Wen Li

TL;DR: 本文提出了一种基于形变的点云上下文学习框架DeformPIC，以解决现有基于掩码点建模（MPM）方法在点云理解中存在的几何先验利用不足和目标不匹配问题。该方法通过学习在任务提示指导下形变查询点云，实现了显式的几何推理和一致的训练与推理目标。

Details

Motivation: 现有基于MPM的点云上下文学习方法直接从掩码标记预测目标点云，未能有效利用几何先验，且存在训练与推理目标不匹配的问题（训练时使用了推理时不可得的目标侧信息）。

Result: 在重建、去噪和配准任务上，DeformPIC的平均倒角距离（Chamfer Distance）分别比之前的最先进方法降低了1.6、1.8和4.7个点。在一个新的域外基准测试中，DeformPIC也取得了最先进的性能。

Insight: 核心创新在于将点云上下文学习范式从掩码重建转变为基于形变的几何推理。这允许模型在任务提示的指导下显式地学习如何形变查询点云，从而更有效地利用几何结构，并确保了训练与推理目标的一致性。新引入的域外基准也为评估模型的泛化能力提供了新视角。

Abstract: Recent advances in point cloud In-Context Learning (ICL) have demonstrated strong multitask capabilities. Existing approaches typically adopt a Masked Point Modeling (MPM)-based paradigm for point cloud ICL. However, MPM-based methods directly predict the target point cloud from masked tokens without leveraging geometric priors, requiring the model to infer spatial structure and geometric details solely from token-level correlations via transformers. Additionally, these methods suffer from a training-inference objective mismatch, as the model learns to predict the target point cloud using target-side information that is unavailable at inference time. To address these challenges, we propose DeformPIC, a deformation-based framework for point cloud ICL. Unlike existing approaches that rely on masked reconstruction, DeformPIC learns to deform the query point cloud under task-specific guidance from prompts, enabling explicit geometric reasoning and consistent objectives. Extensive experiments demonstrate that DeformPIC consistently outperforms previous state-of-the-art methods, achieving reductions of 1.6, 1.8, and 4.7 points in average Chamfer Distance on reconstruction, denoising, and registration tasks, respectively. Furthermore, we introduce a new out-of-domain benchmark to evaluate generalization across unseen data distributions, where DeformPIC achieves state-of-the-art performance.

[56] A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos cs.CV | cs.AIPDF

Allen He, Qi Liu, Kun Liu, Xinchen Liu, Wu Liu

TL;DR: 本文提出了一种完全端到端的训练范式，用于视频时序语句定位任务，通过联合优化视频主干网络和定位头来解决现有方法中预训练视觉编码器与下游任务不匹配的问题。

Details

Motivation: 现有方法通常使用预训练且与查询无关的视觉编码器进行离线特征提取，导致为视觉分类训练的视频主干网络与TSGV任务存在差异，本文旨在通过端到端训练弥合这一差距。

Result: 在两个基准测试上的实验表明，该方法超越了当前最先进的方法。

Insight: 提出了完全端到端的训练范式，并引入了句子条件适配器，该适配器利用句子特征自适应地训练视频主干网络的一小部分参数，从而在降低内存消耗的同时，通过语言嵌入的精确集成来调制特征图，显著增强了视觉表示能力。

Abstract: Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for TSGV. To bridge this gap, we propose a fully end-to-end paradigm that jointly optimizes the video backbone and localization head. We first conduct an empirical study validating the effectiveness of end-to-end learning over frozen baselines across different model scales. Furthermore, we introduce a Sentence Conditioned Adapter (SCADA), which leverages sentence features to train a small portion of video backbone parameters adaptively. SCADA facilitates the deployment of deeper network backbones with reduced memory and significantly enhances visual representation by modulating feature maps through precise integration of linguistic embeddings. Experiments on two benchmarks show that our method outperforms state-of-the-art approaches. The code and models will be released.

[57] HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits cs.CVPDF

Leyang Jin, Yujian Zheng, Bingkui Tong, Yuda Qiu, Zhenyu Xie

TL;DR: 本文提出HairOrbit框架，用于从单张肖像图重建发丝级别的3D头发模型。该方法利用视频生成模型的强大3D先验，将单视图重建转化为校准的多视图重建任务，并引入神经方向提取器和基于混合隐式场的两阶段发丝生长算法，以在可见和不可见区域都实现高质量、高效率的重建。

Details

Motivation: 现有方法依赖有限的正面视图线索和小规模/风格受限的合成数据，在重建单视图图像的3D头发时，难以在不可见区域保持一致且真实的属性，导致结果不理想。

Result: 大量实验表明，该方法在多样化的头发肖像数据集上，对于单视图3D发丝重建任务，在可见和不可见区域均达到了最先进的（SOTA）性能。

Insight: 核心创新点在于利用视频生成模型的3D先验进行任务转化，以及结合了基于稀疏真实图像标注训练的神经方向提取器和高效的混合隐式场发丝生长算法，从而在保证重建质量的同时提升了效率。

Abstract: Reconstructing strand-level 3D hair from a single-view image is highly challenging, especially when preserving consistent and realistic attributes in unseen regions. Existing methods rely on limited frontal-view cues and small-scale/style-restricted synthetic data, often failing to produce satisfactory results in invisible regions. In this work, we propose a novel framework that leverages the strong 3D priors of video generation models to transform single-view hair reconstruction into a calibrated multi-view reconstruction task. To balance reconstruction quality and efficiency for the reformulated multi-view task, we further introduce a neural orientation extractor trained on sparse real-image annotations for better full-view orientation estimation. In addition, we design a two-stage strand-growing algorithm based on a hybrid implicit field to synthesize the 3D strand curves with fine-grained details at a relatively fast speed. Extensive experiments demonstrate that our method achieves state-of-the-art performance on single-view 3D hair strand reconstruction on a diverse range of hair portraits in both visible and invisible regions.

[58] Token Warping Helps MLLMs Look from Nearby Viewpoints cs.CVPDF

Phillip Y. Lee, Chanho Park, Mingue Park, Seungwoo Yoo, Juil Koo

TL;DR: 本文提出了一种基于token而非像素的视角变换方法，用于增强多模态大语言模型（MLLMs）在附近视角下的场景理解能力。通过借鉴人类心理意象理论，研究探索了ViT-based MLLMs中图像token作为视角变换基础的有效性，并发现反向token扭曲方法在稳定性和语义一致性方面表现更优。在提出的ViewBench基准测试中，该方法显著优于像素级扭曲、空间微调MLLMs及生成式扭曲等基线模型。

Details

Motivation: 解决MLLMs在视觉推理中对视角变化敏感的问题，因为像素级扭曲易受深度误差影响并产生几何失真，而人类视角变换依赖于部分层级的结构表征。

Result: 在ViewBench基准测试中，token级扭曲方法在附近视角下的推理任务上一致优于所有基线，包括像素级扭曲方法、空间微调MLLMs和生成式扭曲方法，实现了最先进的性能。

Insight: 创新点在于将视角变换从像素层面提升到token层面，利用ViT-based MLLMs的token作为结构表征进行反向扭曲，从而增强模型的几何鲁棒性和语义连贯性；客观分析认为，该方法为MLLMs的几何理解提供了新的可扩展途径。

Abstract: Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

[59] Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework cs.CVPDF

Yu Zhu, Kang Li, Zheng Li, Pheng-Ann Heng

TL;DR: 本文提出了一种自反思分层提示框架，用于解决手术视频场景解析中的类别增量学习问题，旨在通过正向和反向知识迁移来高效学习新手术器械、改进现有器械的分割能力，并避免对旧器械的灾难性遗忘。

Details

Motivation: 现有增量学习方法在持续学习分割新手术器械时，忽视了正向知识迁移（过去知识帮助学习新类别）和反向知识迁移（学习新类别帮助优化过去知识）的潜力，导致模型适应性不足。

Result: 在两个公开基准测试中，该框架在基于CNN和Transformer的基础模型上均取得显著提升，分别比竞争方法提高了5%和11%以上，达到了SOTA水平。

Insight: 创新点包括：1）构建分层提示解析树，以共享提示为根节点、部分共享提示为中间节点、独有提示为叶节点，促进正向知识迁移；2）通过自反思细化机制，利用有向加权图传播检查知识关联，实现反向知识迁移而不引发灾难性遗忘；该框架可泛化至不同模型架构。

Abstract: To continuously enhance model adaptability in surgical video scene parsing, recent studies incrementally update it to progressively learn to segment an increasing number of surgical instruments over time. However, prior works constantly overlooked the potential of positive forward knowledge transfer, i.e., how past knowledge could help learn new classes, and positive backward knowledge transfer, i.e., how learning new classes could help refine past knowledge. In this paper, we propose a self-reflection hierarchical prompt framework that unlocks the power of positive forward and backward knowledge transfer in class incremental segmentation, aiming to proficiently learn new instruments, improve existing skills of regular instruments, and avoid catastrophic forgetting of old instruments. Our framework is built on a frozen, pre-trained model that adaptively appends instrument-aware prompts for new classes throughout training episodes. To enable positive forward knowledge transfer, we organize instrument prompts into a hierarchical prompt parsing tree with the instrument-shared prompt partition as the root node, n-part-shared prompt partitions as intermediate nodes and instrument-distinct prompt partitions as leaf nodes, to expose the reusable historical knowledge for new classes to simplify their learning. Conversely, to encourage positive backward knowledge transfer, we conduct self-reflection refining on existing knowledge by directed-weighted graph propagation, examining the knowledge associations recorded in the tree to improve its representativeness without causing catastrophic forgetting. Our framework is applicable to both CNN-based models and advanced transformer-based foundation models, yielding more than 5% and 11% improvements over the competing methods on two public benchmarks respectively.

[60] InstructTable: Improving Table Structure Recognition Through Instructions cs.CVPDF

Boming Chen, Zining Wang, Zhentao Guo, Jianqiang Liu, Chen Duan

TL;DR: 本文提出InstructTable框架，通过指令引导的多阶段训练方法改进表格结构识别（TSR）。该方法结合了表格指令预训练和TSR微调，以增强对复杂表格布局的理解，并引入Table Mix Expand（TME）方法合成大规模真实表格数据，构建了BCDSTab基准。实验在多个公共数据集和BCDSTab上验证了其SOTA性能。

Details

Motivation: 传统视觉中心模型在复杂表格布局（如合并或空单元格）中缺乏语义支持，而视觉语言模型又忽视视觉结构信息建模，导致TSR在复杂场景下准确率受限。

Result: 在FinTabNet、PubTabNet、MUSTARD等公共数据集及自建BCDSTab基准上，InstructTable实现了SOTA性能；消融研究证实了表格专用指令和合成数据的积极影响。

Insight: 创新点包括指令引导的多阶段训练框架，专注于细粒度结构模式；以及无模板的TME合成方法，用于生成大规模真实表格数据，以构建复杂表格基准。

Abstract: Table structure recognition (TSR) holds widespread practical importance by parsing tabular images into structured representations, yet encounters significant challenges when processing complex layouts involving merged or empty cells. Traditional visual-centric models rely exclusively on visual information while lacking crucial semantic support, thereby impeding accurate structural recognition in complex scenarios. Vision-language models leverage contextual semantics to enhance comprehension; however, these approaches underemphasize the modeling of visual structural information. To address these limitations, this paper introduces InstructTable, an instruction-guided multi-stage training TSR framework. Meticulously designed table instruction pre-training directs attention toward fine-grained structural patterns, enhancing comprehension of complex tables. Complementary TSR fine-tuning preserves robust visual information modeling, maintaining high-precision table parsing across diverse scenarios. Furthermore, we introduce Table Mix Expand (TME), an innovative template-free method for synthesizing large-scale authentic tabular data. Leveraging TME, we construct the Balanced Complex Dense Synthetic Tables (BCDSTab) benchmark, comprising 900 complex table images synthesized through our method to serve as a rigorous benchmark. Extensive experiments on multiple public datasets (FinTabNet, PubTabNet, MUSTARD) and BCDSTab demonstrate that InstructTable achieves state-of-the-art performance in TSR tasks. Ablation studies further confirm the positive impact of the proposed tabular-data-specific instructions and synthetic data.

[61] Information-Regularized Constrained Inversion for Stable Avatar Editing from Sparse Supervision cs.CVPDF

Zhenxiao Liang, Qixing Huang

TL;DR: 本文提出了一种基于信息正则化的约束反演方法，用于从稀疏监督（如少量编辑关键帧）中稳定地编辑可动画人体化身。该方法通过将编辑问题建模为结构化化身潜在空间中的约束反演，限制更新到低维、部件特定的编辑子空间，以防止身份泄漏和姿态依赖的时间闪烁。

Details

Motivation: 现有方法在基于稀疏监督编辑可动画化身时，常因编辑约束不足导致身份泄漏和时序闪烁问题，这本质上是病态反演问题。

Result: 该方法通过优化从完整解码-渲染流程局部线性化导出的条件目标，构建编辑子空间信息矩阵，其谱可预测稳定性并驱动帧重加权/关键帧激活，从而在有限编辑监督下提升了稳定性。

Insight: 创新点在于将编辑问题形式化为潜在空间中的约束反演，并设计基于信息矩阵谱分析的正则化机制来引导稳定编辑，这通过高效的海森-向量积实现，可泛化至其他基于稀疏监督的生成模型编辑任务。

Abstract: Editing animatable human avatars typically relies on sparse supervision, often a few edited keyframes, yet naively fitting a reconstructed avatar to these edits frequently causes identity leakage and pose-dependent temporal flicker. We argue that these failures are best understood as an ill-conditioned inversion: the available edited constraints do not sufficiently determine the latent directions responsible for the intended edit. We propose a conditioning-guided edited reconstruction framework that performs editing as a constrained inversion in a structured avatar latent space, restricting updates to a low-dimensional, part-specific edit subspace to prevent unintended identity changes. Crucially, we design the editing constraints during inversion by optimizing a conditioning objective derived from a local linearization of the full decoding-and-rendering pipeline, yielding an edit-subspace information matrix whose spectrum predicts stability and drives frame reweighting / keyframe activation. The resulting method operates on small subspace matrices and can be implemented efficiently (e.g., via Hessian-vector products), and improves stability under limited edited supervision.

[62] Progressive Video Condensation with MLLM Agent for Long-form Video Understanding cs.CVPDF

Yufei Yin, Yuchen Xing, Qianke Meng, Minghao Chen, Yan Yang

TL;DR: 本文提出了一种名为ProVCA的渐进式视频浓缩代理方法，旨在利用多模态大语言模型（MLLM）进行高效的长视频理解。该方法通过从粗到细的迭代过程（包括片段定位、片段选择和关键帧精炼三个模块），逐步定位与查询相关的关键视频帧，从而在有限的计算预算下提取相关信息。

Details

Motivation: 现有方法（如文本-LLM流水线）会丢失细粒度视觉线索，而基于视频的MLLM虽然能保留视觉细节，但计算成本过高、需要处理大量帧。本文旨在解决在计算资源受限的情况下，如何有效利用MLLM进行长视频理解的问题。

Result: ProVCA在无需训练的情况下，在多个基准测试中取得了零样本（zero-shot）的SOTA性能：在EgoSchema上达到69.3%的准确率，在NExT-QA上达到80.5%，在IntentQA上达到77.7%，同时使用的帧数少于之前的无训练方法。

Insight: 创新点在于提出了一种渐进式的视频浓缩框架，通过多粒度迭代定位（从视频片段到关键帧）来高效筛选信息，这平衡了计算效率与视觉细节保留。从客观角度看，其模块化设计（定位-选择-精炼）为长视频理解中的关键帧提取提供了一种可扩展且计算友好的新思路。

Abstract: Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a small set of keyframes for MLLM-based reasoning. ProVCA achieves state-of-the-art zero-shot accuracies of 69.3% on EgoSchema, 80.5% on NExT-QA, and 77.7% on IntentQA, while using fewer frames than previous training-free methods.

[63] Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models cs.CV | cs.AI | cs.LGPDF

Hai Nguyen-Truong, Alper Balbay, Tunga Bayrak

TL;DR: 本文提出了一种面向几何教育的人工通用教师（AGT）框架，通过全自动程序化数据引擎生成20多万个合成几何图表及对应的像素级分割掩码和多样化语言描述，解决了现有参照图像分割（RIS）模型在抽象几何图表上因领域偏移而失效的问题。

Details

Motivation: 动机是解决几何教育中视觉解释任务因缺乏合适训练数据而面临的挑战，现有基于自然图像（如RefCOCO）训练的RIS模型无法处理抽象、无纹理的几何图表。

Result: 通过领域特定的视觉语言模型（VLM）微调，Florence-2模型在几何图表上的IoU达到49%，Buffered IoU（BIoU）达到85%，相比零样本设置的<1% IoU有显著提升。

Insight: 创新点包括：1）全自动程序化数据生成引擎，无需人工标注；2）引入几何感知的评估指标Buffered IoU，更好地反映薄结构分割质量；3）为构建能够提供视觉基础、逐步解释的AGT奠定了基础。

Abstract: We study visual explanation in geometry education as a Referring Image Segmentation (RIS) problem: given a diagram and a natural language description, the task is to produce a pixel-level mask for the referred geometric element. However, existing RIS models trained on natural image benchmarks such as RefCOCO fail catastrophically on geometric diagrams due to the fundamental domain shift between photographic scenes and abstract, textureless schematics. To address the absence of suitable training data, we present a fully automated procedural data engine that generates over 200,000 synthetic geometry diagrams with pixel-perfect segmentation masks and linguistically diverse referring expressions, requiring zero manual annotation. We further propose domain-specific fine-tuning of vision-language models (VLMs), demonstrating that a fine-tuned Florence-2 achieves 49% IoU and 85% Buffered IoU (BIoU), compared to <1% IoU in zero-shot settings. We introduce Buffered IoU, a geometry-aware evaluation metric that accounts for thin-structure localization, and show that it better reflects true segmentation quality than standard IoU. Our results establish a foundation for building Artificial General Teachers (AGTs) capable of providing visually grounded, step-by-step explanations of geometry problems.

[64] EvaNet: Towards More Efficient and Consistent Infrared and Visible Image Fusion Assessment cs.CVPDF

Chunyang Cheng, Tianyang Xu, Xiao-Jun Wu, Tao Zhou, Hui Li

TL;DR: 本文提出EvaNet，一种专为红外与可见光图像融合设计的统一评估框架，通过轻量级网络近似传统指标，采用分治策略将融合结果分解为红外与可见光分量进行信息保留度评估，并结合对比学习和大语言模型的感知场景评估进行训练，同时首次提出一致性评估框架以衡量指标与人类视觉感知的对齐程度。

Details

Motivation: 现有图像融合评估指标多直接借用其他视觉任务，未经适当适配，不仅计算复杂且难以准确捕捉融合质量，因此需要一种高效且与人类感知一致的专用评估方法。

Result: 在多个标准图像融合基准测试中，该方法在效率上比传统指标快高达1000倍，并在评估一致性方面表现更优。

Insight: 创新点包括：采用分治策略解耦融合评估过程，通过分解融合结果进行分量评估；结合对比学习与大语言模型感知信息进行训练；首次提出一致性评估框架，以无参考分数和下游任务性能作为客观参考，确保评估指标与人类视觉感知对齐。

Abstract: Evaluation is essential in image fusion research, yet most existing metrics are directly borrowed from other vision tasks without proper adaptation. These traditional metrics, often based on complex image transformations, not only fail to capture the true quality of the fusion results but also are computationally demanding. To address these issues, we propose a unified evaluation framework specifically tailored for image fusion. At its core is a lightweight network designed efficiently to approximate widely used metrics, following a divide-and-conquer strategy. Unlike conventional approaches that directly assess similarity between fused and source images, we first decompose the fusion result into infrared and visible components. The evaluation model is then used to measure the degree of information preservation in these separated components, effectively disentangling the fusion evaluation process. During training, we incorporate a contrastive learning strategy and inform our evaluation model by perceptual scene assessment provided by a large language model. Last, we propose the first consistency evaluation framework, which measures the alignment between image fusion metrics and human visual perception, using both independent no-reference scores and downstream tasks performance as objective references. Extensive experiments show that our learning-based evaluation paradigm delivers both superior efficiency (up to 1,000 times faster) and greater consistency across a range of standard image fusion benchmarks. Our code will be publicly available at https://github.com/AWCXV/EvaNet.

[65] RayMamba: Ray-Aligned Serialization for Long-Range 3D Object Detection cs.CV | cs.AIPDF

Cheng Lu, Mingqian Ji, Shanshan Zhang, Zhihao Li, Jian Yang

TL;DR: 本文提出了RayMamba，一种用于基于体素的3D目标检测器的几何感知即插即用增强模块。它通过一种射线对齐的序列化策略，将稀疏的体素组织成扇区有序序列，以保持方向连续性和遮挡相关上下文，从而改进长距离稀疏场景下的上下文建模。该方法与仅LiDAR和多模态检测器兼容，且开销较小。

Details

Motivation: 解决长距离3D目标检测中，由于LiDAR观测在远场变得高度稀疏和碎片化，导致现有检测器难以进行可靠上下文建模的问题。现有基于状态空间模型（SSM）的方法效率虽高，但其通用序列化策略在稀疏场景中无法保持有意义的上下文邻域，限制了有效性。

Result: 在nuScenes和Argoverse 2数据集上的大量实验表明，该方法在多个强基线上带来了一致的性能提升。具体而言，在nuScenes数据集上，RayMamba在具有挑战性的40-50米距离范围内实现了高达2.49 mAP和1.59 NDS的提升；在Argoverse 2上，将VoxelNeXt的mAP从30.3提升至31.2。

Insight: 核心创新点是提出了一种射线对齐的序列化策略，这是一种几何感知的序列构建方法，旨在为后续基于Mamba的建模保留方向连续性和遮挡相关上下文。这解决了通用序列化在稀疏3D场景中破坏局部几何结构的问题，是一种针对3D点云数据特性的高效长程建模增强方案。

Abstract: Long-range 3D object detection remains challenging because LiDAR observations become highly sparse and fragmented in the far field, making reliable context modeling difficult for existing detectors. To address this issue, recent state space model (SSM)-based methods have improved long-range modeling efficiency. However, their effectiveness is still limited by generic serialization strategies that fail to preserve meaningful contextual neighborhoods in sparse scenes. To address this issue, we propose RayMamba, a geometry-aware plug-and-play enhancement for voxel-based 3D detectors. RayMamba organizes sparse voxels into sector-wise ordered sequences through a ray-aligned serialization strategy, which preserves directional continuity and occlusion-related context for subsequent Mamba-based modeling. It is compatible with both LiDAR-only and multimodal detectors, while introducing only modest overhead. Extensive experiments on nuScenes and Argoverse 2 demonstrate consistent improvements across strong baselines. In particular, RayMamba achieves up to 2.49 mAP and 1.59 NDS gain in the challenging 40–50 m range on nuScenes, and further improves VoxelNeXt on Argoverse 2 from 30.3 to 31.2 mAP.

[66] SentiAvatar: Towards Expressive and Interactive Digital Humans cs.CV | cs.HC | cs.MMPDF

Chuhao Jin, Rui Zhang, Qingzhe Gao, Haoyu Shi, Dayu Wu

TL;DR: 本文提出了SentiAvatar框架，用于构建富有表现力和交互性的3D数字人，并创建了虚拟角色SuSu。该框架通过构建高质量多模态数据集SuSuInterActs、预训练运动基础模型，并采用音频感知的‘先规划后填充’架构，解决了大规模高质量数据缺乏、鲁棒的语义到运动映射以及细粒度帧级运动-韵律同步等关键挑战。

Details

Motivation: 构建能够实时说话、做手势和表达情感的交互式3D数字人系统面临三大挑战：缺乏大规模高质量多模态数据、需要鲁棒的语义到运动映射，以及实现细粒度的帧级运动与韵律同步。

Result: 在自建的SuSuInterActs数据集上，SentiAvatar达到了R@1 43.64%的召回率，接近最佳基线的两倍，实现了SOTA。在BEATv2基准测试上，FGD为4.941，BC为8.078，同样达到SOTA水平。系统能以0.3秒生成6秒的输出，并支持无限多轮流式生成。

Insight: 主要创新点包括：1）构建了高质量、单角色、多模态同步的对话数据集SuSuInterActs；2）在超20万运动序列上预训练运动基础模型，获得了超越对话场景的丰富动作先验；3）提出了音频感知的‘先规划后填充’架构，将句子级语义规划与帧级韵律驱动插值解耦，确保了生成动作的语义恰当性和与语音的节奏对齐。

Abstract: We present SentiAvatar, a framework for building expressive interactive 3D digital humans, and use it to create SuSu, a virtual character that speaks, gestures, and emotes in real time. Achieving such a system remains challenging, as it requires jointly addressing three key problems: the lack of large-scale, high-quality multimodal data, robust semantic-to-motion mapping, and fine-grained frame-level motion-prosody synchronization. To solve these problems, first, we build SuSuInterActs (21K clips, 37 hours), a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Second, we pre-train a Motion Foundation Model on 200K+ motion sequences, equipping it with rich action priors that go well beyond the conversation. We then propose an audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation, so that generated motions are both semantically appropriate and rhythmically aligned with speech. Experiments show that SentiAvatar achieves state-of-the-art on both SuSuInterActs (R@1 43.64%, nearly 2 times the best baseline) and BEATv2 (FGD 4.941, BC 8.078), producing 6s of output in 0.3s with unlimited multi-turn streaming. The source code, model, and dataset are available at https://sentiavatar.github.io.

[67] GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes cs.CVPDF

Mijeong Kim, Jungtaek Kim, Bohyung Han

TL;DR: GP-4DGS是一个将高斯过程（GPs）集成到4D高斯泼溅（4DGS）中的新框架，用于对动态场景进行概率建模。它解决了现有4DGS方法在捕捉运动模糊性和评估预测可靠性方面的不足，通过引入时空核和变分高斯过程，实现了运动预测的不确定性量化、未观测区域运动估计以及时间外推。

Details

Motivation: 现有4DGS方法专注于确定性重建，但无法有效捕捉运动模糊性，也缺乏评估预测可靠性的机制，这限制了其在动态场景建模中的应用。

Result: 实验表明，GP-4DGS在提升重建质量的同时，提供了可靠的不确定性估计，能有效识别高运动模糊区域。

Insight: 创新点在于将高斯过程的概率特性引入4DGS，通过设计时空核和采用变分高斯过程进行可扩展推理，实现了动态场景的概率建模，为连接概率建模与神经图形学迈出了重要一步。

Abstract: We present GP-4DGS, a novel framework that integrates Gaussian Processes (GPs) into 4D Gaussian Splatting (4DGS) for principled probabilistic modeling of dynamic scenes. While existing 4DGS methods focus on deterministic reconstruction, they are inherently limited in capturing motion ambiguity and lack mechanisms to assess prediction reliability. By leveraging the kernel-based probabilistic nature of GPs, our approach introduces three key capabilities: (i) uncertainty quantification for motion predictions, (ii) motion estimation for unobserved or sparsely sampled regions, and (iii) temporal extrapolation beyond observed training frames. To scale GPs to the large number of Gaussian primitives in 4DGS, we design spatio-temporal kernels that capture the correlation structure of deformation fields and adopt variational Gaussian Processes with inducing points for tractable inference. Our experiments show that GP-4DGS enhances reconstruction quality while providing reliable uncertainty estimates that effectively identify regions of high motion ambiguity. By addressing these challenges, our work takes a meaningful step toward bridging probabilistic modeling and neural graphics.

[68] PolyReal: A Benchmark for Real-World Polymer Science Workflows cs.CVPDF

Wanhao Liu, Weida Wang, Jiaqing Xie, Suorong Yang, Jue Wang

TL;DR: 本文提出了PolyReal，一个基于真实世界聚合物科学工作流程的多模态基准测试，用于评估多模态大语言模型在完整实验生命周期中的能力。该基准涵盖五个关键能力：基础知识应用、实验室安全分析、实验机理推理、原始数据提取以及性能与应用探索。

Details

Motivation: 现有的聚合物科学相关基准大多忽视了真实世界的工作流程，限制了其实用性，且未能系统评估MLLMs在完整、基于实践的实验生命周期中的表现。聚合物科学作为一个跨学科领域，其多样的多模态数据是评估MLLMs处理复杂现实科学问题的理想高风险测试平台。

Result: 在PolyReal上对领先的MLLMs进行评估，揭示了能力不平衡现象：模型在知识密集型推理任务（如实验机理推理）上表现良好，但在基于实践的任务（如实验室安全分析和原始数据提取）上表现急剧下降。这表明MLLMs在抽象科学知识与其实践、情境依赖性应用之间存在严重差距。

Insight: 论文的创新点在于构建了一个根植于真实科学实践的、覆盖完整实验生命周期的多模态基准，填补了现有评估的空白。从客观角度看，其将评估重点从纯知识问答转向工作流程驱动的综合能力测试，特别是强调安全、数据提取等实践环节，为评估AI在现实科学应用中的实用性提供了新视角和工具。

Abstract: Multimodal Large Language Models (MLLMs) excel in general domains but struggle with complex, real-world science. We posit that polymer science, an interdisciplinary field spanning chemistry, physics, biology, and engineering, is an ideal high-stakes testbed due to its diverse multimodal data. Yet, existing benchmarks related to polymer science largely overlook real-world workflows, limiting their practical utility and failing to systematically evaluate MLLMs across the full, practice-grounded lifecycle of experimentation. We introduce PolyReal, a novel multimodal benchmark grounded in real-world scientific practices to evaluate MLLMs on the full lifecycle of polymer experimentation. It covers five critical capabilities: (1) foundational knowledge application; (2) lab safety analysis; (3) experiment mechanism reasoning; (4) raw data extraction; and (5) performance & application exploration. Our evaluation of leading MLLMs on PolyReal reveals a capability imbalance. While models perform well on knowledge-intensive reasoning (e.g., Experiment Mechanism Reasoning), they drop sharply on practice-based tasks (e.g., Lab Safety Analysis and Raw Data Extraction). This exposes a severe gap between abstract scientific knowledge and its practical, context-dependent application, showing that these real-world tasks remain challenging for MLLMs. Thus, PolyReal helps address this evaluation gap and provides a practical benchmark for assessing AI systems in real-world scientific workflows.

[69] MMTalker: Multiresolution 3D Talking Head Synthesis with Multimodal Feature Fusion cs.CVPDF

Bin Liu, Zhixiang Xiong, Zhifen He, Bo Li

TL;DR: 本文提出了一种名为MMTalker的新型三维音频驱动面部动画合成方法，通过多分辨率表示和多模态特征融合，旨在从一维语音信号中准确重建三维面部运动的丰富细节。该方法首先通过网格参数化和非均匀可微分采样实现带细节的三维面部连续表示，然后利用残差图卷积网络和双重交叉注意力机制从多模态输入中提取判别性面部运动特征，最后通过轻量级回归网络预测合成说话人脸的逐顶点几何位移。

Details

Motivation: 当前语音驱动的三维面部动画合成方法在保持唇部同步精度和生成真实面部表情方面仍面临挑战，主要由于这种跨模态映射的高度不适定性质。本文旨在解决这一问题，通过多分辨率表示和多模态特征融合来提高合成动画的准确性和真实感。

Result: 综合实验表明，该方法在现有最先进方法基础上取得了显著改进，特别是在唇部和眼部运动的同步精度方面达到了SOTA水平。

Insight: 创新点包括：1) 通过网格参数化和非均匀可微分采样实现三维面部的连续细节表示；2) 利用残差图卷积网络和双重交叉注意力机制进行多模态特征融合，充分利用语音的层次特征和面部网格的显式时空几何特征；3) 在规范UV空间中联合处理采样点和编码的面部运动特征，以预测几何位移。这些方法有助于提高跨模态映射的准确性和细节重建能力。

Abstract: Speech-driven three-dimensional (3D) facial animation synthesis aims to build a mapping from one-dimensional (1D) speech signals to time-varying 3D facial motion signals. Current methods still face challenges in maintaining lip-sync accuracy and producing realistic facial expressions, primarily due to the highly ill-posed nature of this cross-modal mapping. In this paper, we introduce a novel 3D audio-driven facial animation synthesis method through multi-resolution representation and multi-modal feature fusion, called MMTalker which can accurately reconstruct the rich details of 3D facial motion. We first achieve the continuous representation of 3D face with details by mesh parameterization and non-uniform differentiable sampling. The mesh parameterization technique establishes the correspondence between UV plane and 3D facial mesh and is used to offer ground truth for the continuous learning. Differentiable non-uniform sampling enables precise facial detail acquisition by setting learnable sampling probability in each triangular face. Next, we employ residual graph convolutional network and dual cross-attention mechanism to extract discriminative facial motion feature from multiple input modalities. This proposed multimodal fusion strategy takes full use of the hierarchical features of speech and the explicit spatiotemporal geometric features of facial mesh. Finally, a lightweight regression network predicts the vertex-wise geometric displacements of the synthesized talking face by jointly processing the sampled points in the canonical UV space and the encoded facial motion features. Comprehensive experiments demonstrate that significant improvements are achieved over state-of-the-art methods, especially in the synchronization accuracy of lip and eye movements.

Zelin Zhang, Kedi Li, Huiqi Liang, Tao Zhang, Chuanzhi Xu

TL;DR: 本文提出CrossWeaver，一个用于任意模态语义分割的简单而有效的多模态融合框架。其核心是模态交互块（MIB），能够在编码器内实现选择性且感知可靠性的跨模态交互，同时轻量级的无缝对齐融合（SAF）模块进一步聚合增强特征。

Details

Motivation: 现有方法依赖于精心设计的融合策略，通常使用模态特定的适配或松散的耦合交互，限制了灵活性并导致跨模态协调效果不佳，且难以在不同模态组合中平衡高效信息交换与保留各模态独特特性。

Result: 在多个多模态语义分割基准测试上的广泛实验表明，该框架以最少的额外参数实现了最先进的性能，并对未见过的模态组合具有很强的泛化能力。

Insight: 创新点在于提出了一个通用的、轻量级的融合框架，通过MIB实现选择性跨模态交互，以及SAF模块进行特征聚合，从而灵活处理任意模态组合并保持各模态特性，提升了多模态语义分割的效率和泛化性。

Abstract: Multimodal semantic segmentation has shown great potential in leveraging complementary information across diverse sensing modalities. However, existing approaches often rely on carefully designed fusion strategies that either use modality-specific adaptations or rely on loosely coupled interactions, thereby limiting flexibility and resulting in less effective cross-modal coordination. Moreover, these methods often struggle to balance efficient information exchange with preserving the unique characteristics of each modality across different modality combinations. To address these challenges, we propose CrossWeaver, a simple yet effective multimodal fusion framework for arbitrary-modality semantic segmentation. Its core is a Modality Interaction Block (MIB), which enables selective and reliability-aware cross-modal interaction within the encoder, while a lightweight Seam-Aligned Fusion (SAF) module further aggregates the enhanced features. Extensive experiments on multiple multimodal semantic segmentation benchmarks demonstrate that our framework achieves state-of-the-art performance with minimal additional parameters and strong generalization to unseen modality combinations.

[71] Collaborative Multi-Mode Pruning for Vision-Language Models cs.CVPDF

Zimeng Wu, Yunhong Wang, Donghao Wang, Jiaxin Chen

TL;DR: 本文提出了一种名为协作多模态剪枝（CoMP）的新框架，专门用于视觉语言模型（VLMs），通过联合执行参数剪枝和令牌剪枝来压缩模型，以解决高剪枝率下性能显著下降的问题。

Details

Motivation: 现有剪枝方法主要专注于单一模态（参数或令牌），未能充分利用各模态的内在冗余，导致在高剪枝率下性能大幅下降，限制了VLMs在资源受限设备上的部署。

Result: 在多种视觉语言任务和模型上的广泛实验表明，与最先进方法相比，该方法在高剪枝率下有效提升了性能。

Insight: 创新点包括设计协作重要性度量（CIM）来研究参数与令牌之间的相互干扰，以及开发多模态剪枝策略（MPS）来分解剪枝过程并自适应选择最优剪枝模式，结合历史成本和随机探索以实现稳定剪枝和避免局部最优。

Abstract: Vision-Language Models (VLMs) have advanced rapidly within the unified Transformer architecture, yet their deployment on resource-constrained devices remains challenging due to high computational complexity. While pruning has emerged as an effective technique for compressing VLMs, existing approaches predominantly focus on a single mode by pruning either parameters or tokens, neglecting fully exploring the inherent redundancy in each mode, which leads to substantial performance degradation at high pruning ratios. To address the above limitations, we propose Collaborative Multi-Mode Pruning (CoMP), a novel framework tailored for VLMs by performing joint parameter and token pruning. Specifically, we first design a Collaborative Importance Metric (CIM) that investigates the mutual interference between the coupled parameters and tokens. It incorporates distinct significance of tokens into the computation of parameter importance scores, while simultaneously mitigating the affect of pruned parameters on token importance scores. Moreover, we develop a Multi-Mode Pruning Strategy (MPS) that decomposes the overall pruning process into a sequence of pruning stages, while in each stage we estimate the priory of different pruning modes based on their pruning cost and adaptively shift to the optimal one. Additionally, MPS integrates the historical cost and random exploration, in order to achieve a stable pruning process and avoid local optimum. Extensive experiments across various vision-language tasks and models demonstrate that our method effectively promotes the performance under high pruning ratios by comparing to the state-of-the-art approaches. The source code is available at https://github.com/Wuzimeng/CoMP.git.

[72] Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation cs.CVPDF

Hanshuai Cui, Zhiqing Tang, Zhi Yao, Fanshuai Meng, Weijia Jia

TL;DR: 本文提出了一种名为SCOPE的训练免费框架，用于加速自回归视频扩散模型。该框架通过引入缓存、预测和重新计算的三模态调度器，以及选择性计算机制，有效减少了重复去噪的计算开销，在保持生成质量的同时实现了高达4.73倍的加速。

Details

Motivation: 自回归视频扩散模型生成长视频时计算成本高昂，现有训练免费加速方法仅依赖二元的缓存或重新计算决策，忽略了中间情况，且异步自回归调度中不同帧的噪声级别不同，但现有方法对整个有效区间进行统一处理，导致效率低下。

Result: 在MAGI-1和SkyReels-V2基准测试上，SCOPE实现了高达4.73倍的加速，同时保持与原始输出相当的质量，优于所有训练免费基线方法。

Insight: 创新点包括：1）引入三模态调度器（缓存、预测、重新计算），通过噪声级别泰勒外推进行预测，填补了重用和重新计算之间的空白，并基于误差传播分析提供显式稳定性控制；2）提出选择性计算，将执行限制在活动帧区间内，针对自回归特定低效问题进行了优化。

Abstract: Autoregressive (AR) video diffusion models enable long-form video generation but remain expensive due to repeated multi-step denoising. Existing training-free acceleration methods rely on binary cache-or-recompute decisions, overlooking intermediate cases where direct reuse is too coarse yet full recomputation is unnecessary. Moreover, asynchronous AR schedules assign different noise levels to co-generated frames, yet existing methods process the entire valid interval uniformly. To address these AR-specific inefficiencies, we present SCOPE, a training-free framework for efficient AR video diffusion. SCOPE introduces a tri-modal scheduler over cache, predict, and recompute, where prediction via noise-level Taylor extrapolation fills the gap between reuse and recomputation with explicit stability controls backed by error propagation analysis. It further introduces selective computation that restricts execution to the active frame interval. On MAGI-1 and SkyReels-V2, SCOPE achieves up to 4.73x speedup while maintaining quality comparable to the original output, outperforming all training-free baselines.

[73] Explicit Time-Frequency Dynamics for Skeleton-Based Gait Recognition cs.CVPDF

Seoyeon Ko, Yeojin Song, Egene Chung, Luca Quagliato, Taeyong Lee

TL;DR: 本文提出一种即插即用的Wavelet Feature Stream，通过连续小波变换提取关节速度的时频动态特征，增强现有骨架步态识别骨干网络对运动动态的表征能力，在CASIA-B数据集上显著提升了GaitMixer等模型的性能，尤其在携带物品和穿外套等协变量变化场景下效果突出。

Details

Motivation: 现有基于骨架的步态识别方法擅长建模空间结构，但未能充分利用对外观变化至关重要的显式运动动态信息。

Result: 在CASIA-B数据集上，该方法为GaitMixer、GaitFormer、GaitGraph等强骨架骨干网络带来一致性能提升，当与GaitMixer结合时建立了新的骨架步态识别SOTA，在携带包（BG）和穿外套（CL）等协变量偏移场景下改进尤为显著。

Insight: 创新点在于引入连续小波变换将关节速度序列转换为多尺度时频谱图，通过轻量级CNN学习判别性动态线索，无需修改骨干网络架构或额外监督，实现了显式时频建模与标准时空编码器的互补融合。

Abstract: Skeleton-based gait recognizers excel at modeling spatial configurations but often underuse explicit motion dynamics that are crucial under appearance changes. We introduce a plug-and-play Wavelet Feature Stream that augments any skeleton backbone with time-frequency dynamics of joint velocities. Concretely, per-joint velocity sequences are transformed by the continuous wavelet transform (CWT) into multi-scale scalograms, from which a lightweight multi-scale CNN learns discriminative dynamic cues. The resulting descriptor is fused with the backbone representation for classification, requiring no changes to the backbone architecture or additional supervision. Across CASIA-B, the proposed stream delivers consistent gains on strong skeleton backbones (e.g., GaitMixer, GaitFormer, GaitGraph) and establishes a new skeleton-based state of the art when attached to GaitMixer. The improvements are especially pronounced under covariate shifts such as carrying bags (BG) and wearing coats (CL), highlighting the complementarity of explicit time-frequency modeling and standard spatio-temporal encoders.

[74] GenSmoke-GS: A Multi-Stage Method for Novel View Synthesis from Smoke-Degraded Images Using a Generative Model cs.CVPDF

Qida Cao, Xinyuan Hu, Changyue Shi, Jiajun Ding, Zhou Yu

TL;DR: 本文提出了GenSmoke-GS，一个用于从烟雾退化图像中进行新视角合成的多阶段方法。该方法通过图像恢复、去雾、基于MLLM的增强、3DGS-MCMC优化和多次运行平均的流程，旨在提升渲染前的图像可见性并保持输入视角间场景内容的一致性。该方法在NTIRE 2026 3DRR挑战赛Track 2中取得了第一名。

Details

Motivation: 解决烟雾导致图像可见度下降以及削弱场景优化和渲染所需的跨视角一致性问题。

Result: 在NTIRE挑战赛基准测试中，定量性能和视觉质量均优于提供的基线方法，在14名参与者中排名第一。

Insight: 主要创新点在于将生成模型（MLLM）集成到多阶段流程中，并采用3DGS-MCMC优化与多次运行平均策略，以在提升图像质量的同时约束场景内容的变化，确保跨视角一致性。

Abstract: This paper describes our method for Track 2 of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge on smoke-degraded images. In this task, smoke reduces image visibility and weakens the cross-view consistency required by scene optimization and rendering. We address this problem with a multi-stage pipeline consisting of image restoration, dehazing, MLLM-based enhancement, 3DGS-MCMC optimization, and averaging over repeated runs. The main purpose of the pipeline is to improve visibility before rendering while limiting scene-content changes across input views. Experimental results on the challenge benchmark show improved quantitative performance and better visual quality than the provided baselines. The code is available at https://github.com/plbbl/GenSmoke-GS. Our method achieved a ranking of 1 out of 14 participants in Track 2 of the NTIRE 3DRR Challenge, as reported on the official competition website: https://www.codabench.org/competitions/13993/#/results-tab.

[75] QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection cs.CVPDF

Lokman Bekit, Hamza Karim, Nghia T Nguyen, Yasin Yilmaz

TL;DR: QVAD是一个以问题为中心的智能体框架，用于高效且无需训练的视频异常检测。它通过LLM智能体与小型VLM进行动态对话，迭代优化查询，生成高质量描述和精确语义推理，从而在多个基准数据集上达到SOTA性能，同时具有高推理速度和低内存占用。

Details

Motivation: 解决视频异常检测中由于异常开放集特性带来的挑战，现有基于静态提示的无训练VLM方法依赖庞大模型且资源密集，作者认为瓶颈在于查询的静态性而非模型容量。

Result: 在UCF-Crime、XD-Violence和UBNormal数据集上达到最先进性能，在单场景ComplexVAD数据集上表现出卓越的泛化能力，且参数量远少于竞争方法，推理速度快、内存占用低。

Insight: 创新点在于将VLM-LLM交互视为动态对话，通过’提示更新’机制解锁轻量级模型潜力，无需参数更新即可实现高性能检测，使先进VAD能力可部署于资源受限边缘设备。

Abstract: Video Anomaly Detection (VAD) is a fundamental challenge in computer vision, particularly due to the open-set nature of anomalies. While recent training-free approaches utilizing Vision-Language Models (VLMs) have shown promise, they typically rely on massive, resource-intensive foundation models to compensate for the ambiguity of static prompts. We argue that the bottleneck in VAD is not necessarily model capacity, but rather the static nature of inquiry. We propose QVAD, a question-centric agentic framework that treats VLM-LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high-fidelity captions and precise semantic reasoning without parameter updates. This ``prompt-updating” mechanism effectively unlocks the latent capabilities of lightweight models, enabling state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal using a fraction of the parameters required by competing methods. We further demonstrate exceptional generalizability on the single-scene ComplexVAD dataset. Crucially, QVAD achieves high inference speeds with minimal memory footprints, making advanced VAD capabilities deployable on resource-constrained edge devices.

[76] STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models cs.CV | cs.MMPDF

Linfeng Fan, Yuan Tian, Ziwei Li, Zhiwu Lu

TL;DR: 本文提出了一种名为STEAR的层感知时空证据干预框架，用于缓解视频大语言模型中的时空幻觉问题。该方法通过识别高风险解码步骤，从对视觉接地敏感的中层提取令牌条件化的视觉证据，并利用这些证据同时恢复缺失的局部接地和构建时间扰动的补丁级反事实，以在单次编码推理框架内高效地缓解空间和时间幻觉。

Details

Motivation: 现有方法通常将幻觉视为统一的解码失败，并应用全局共享的校正规则。作者观察到解码器各层对视觉接地和后续语言组合的贡献不同，表明干预必须是层感知的，以解决视频大语言模型容易产生视觉不支持细节或错误时间关系的时空幻觉问题。

Result: 在具有代表性的视频大语言模型主干和具有挑战性的基准测试上的实验表明，STEAR能持续减少幻觉，同时提高忠实度、时间一致性和鲁棒性。

Insight: 创新点在于提出了层感知的干预框架，强调可靠的视频解码依赖于在正确的层上对精确证据进行干预，而不是强制执行全局惩罚。该方法通过令牌条件化的证据提取和耦合的接地恢复与反事实推理，实现了对时空幻觉的针对性缓解。

Abstract: Video Large Language Models (Video-LLMs) remain prone to spatiotemporal hallucinations, often generating visually unsupported details or incorrect temporal relations. Existing mitigation methods typically treat hallucination as a uniform decoding failure, applying globally shared correction rules. We instead observe that decoder layers contribute differently to visual grounding and later linguistic composition, indicating that intervention must be layer-aware. Based on this insight, we propose STEAR, a layer-aware spatiotemporal evidence intervention framework. STEAR identifies high-risk decoding steps and selects token-conditioned visual evidence from grounding-sensitive middle layers. It uses this shared evidence for two coupled purposes: restoring missing local grounding in middle layers, and constructing temporally perturbed patch-level counterfactuals to falsify inconsistent reasoning during late-layer decoding. Consequently, STEAR mitigates both spatial and temporal hallucinations within an efficient single-encode inference framework. Experiments across representative Video-LLM backbones and challenging benchmarks demonstrate that STEAR consistently reduces hallucinations while improving faithfulness, temporal consistency, and robustness. Our results confirm that reliable video decoding relies on intervening on precise evidence at the right layer, rather than enforcing a global penalty. The code is provided in the Supplementary Material.

[77] MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs cs.CVPDF

Jiameng Li, Aleksei Tiulpin, Matthew B. Blaschko

TL;DR: MI-Pruner是一种用于高效多模态大语言模型（MLLMs）的跨模态互信息引导的视觉令牌剪枝方法，它通过直接计算视觉和文本特征之间的互信息来度量跨模态依赖，从而选择性地剪枝视觉令牌，实现高效推理。

Details

Motivation: 针对MLLMs中视觉信息相对稀疏的问题，现有视觉剪枝方法通常基于注意力分数选择重要令牌，但该方法依赖于特定机制，本文旨在探索一种更直接、更精细的方法来度量跨模态依赖以指导剪枝。

Result: 实验结果表明，该方法在延迟最小的情况下优于先前基于注意力的剪枝方法。

Insight: 创新点在于直接利用视觉和文本特征间的互信息（而非注意力机制）来度量跨模态依赖，从而进行剪枝，该方法简单、高效且无需访问内部注意力图或修改模型架构，提供了一种更本质的跨模态特征关联度量方式。

Abstract: For multimodal large language models (MLLMs), visual information is relatively sparse compared with text. As a result, research on visual pruning emerges for efficient inference. Current approaches typically measure token importance based on the attention scores in the visual encoder or in the LLM decoder, then select visual tokens with high attention scores while pruning others. In this paper, we pursue a different and more surgical approach. Instead of relying on mechanism-specific signals, we directly compute Mutual Information (MI) between visual and textual features themselves, prior to their interaction. This allows us to explicitly measure crossmodal dependency at the feature levels. Our MI-Pruner is simple, efficient and non-intrusive, requiring no access to internal attention maps or architectural modifications. Experimental results demonstrate that our approach outperforms previous attention-based pruning methods with minimal latency.

[78] A Data-Centric Vision Transformer Baseline for SAR Sea Ice Classification cs.CV | cs.AIPDF

David Mike-Ewewie, Panhapiseth Lim, Priyanka Kumar

TL;DR: 本文针对北极海冰分类任务，建立了一个基于Vision Transformer的SAR数据基线模型。通过使用AI4Arctic/ASIP海冰数据集，结合全分辨率Sentinel-1 Extra Wide输入、防泄漏分层补丁分割、SIGRID-3发育阶段标签和训练集归一化等方法，评估了ViT模型。研究发现，使用焦点损失的ViT-Large模型在少数类（多年冰）上取得了最佳性能。

Details

Motivation: 解决在严重类别不平衡条件下，区分形态相似海冰类别的挑战，为未来多模态融合研究提供一个可信赖的SAR-only基线。

Result: 在AI4Arctic/ASIP海冰数据集上，使用焦点损失的ViT-Large模型取得了69.6%的留出准确率、68.8%的加权F1分数，并在少数类多年冰上达到83.9%的精确率。这些结果优于使用交叉熵和加权交叉熵的ViT-Base模型。

Insight: 创新点包括：1) 建立了首个基于Vision Transformer的SAR海冰分类可信基线；2) 证明了焦点损失在处理类别不平衡问题上比加权交叉熵具有更优的精度-召回权衡；3) 提出的数据预处理流程（如防泄漏分层补丁分割）为后续多模态融合研究提供了干净的基础。

Abstract: Accurate and automated sea ice classification is important for climate monitoring and maritime safety in the Arctic. While Synthetic Aperture Radar (SAR) is the operational standard because of its all-weather capability, it remains challenging to distinguish morphologically similar ice classes under severe class imbalance. Rather than claiming a fully validated multimodal system, this paper establishes a trustworthy SAR only baseline that future fusion work can build upon. Using the AI4Arctic/ASIP Sea Ice Dataset (v2), which contains 461 Sentinel-1 scenes matched with expert ice charts, we combine full-resolution Sentinel-1 Extra Wide inputs, leakage-aware stratified patch splitting, SIGRID-3 stage-of-development labels, and training-set normalization to evaluate Vision Transformer baselines. We compare ViT-Base models trained with cross entropy and weighted cross-entropy against a ViT-Large model trained with focal loss. Among the tested configurations, ViT-Large with focal loss achieves 69.6% held-out accuracy, 68.8% weighted F1, and 83.9% precision on the minority Multi-Year Ice class. These results show that focal-loss training offers a more useful precision-recall trade-off than weighted cross-entropy for rare ice classes and establishes a cleaner baseline for future multimodal fusion with optical, thermal, or meteorological data.

[79] Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning cs.CV | cs.AIPDF

Zhangyun Tan, Zeliang Zhang, Susan Liang, Yolo Yunlong Tang, Lisha Chen

TL;DR: 本文提出了首个用于评估视觉语言模型（VLM）无需训练即可遗忘视觉概念能力的基准测试VLM-UnBench，覆盖多个遗忘级别、数据集和概念轴。研究发现，现实中的遗忘提示对模型遗忘准确率影响甚微，仅在模型知晓目标概念的理想条件下才有显著效果，揭示了提示级抑制与真实概念擦除之间的差距。

Details

Motivation: 为了解决现有基于训练的视觉概念遗忘方法存在结构缺陷（微调窄遗忘集会先损害模型通用能力）以及缺乏对无需训练方法在视觉任务上的严格评估基准的问题。

Result: 在8种评估设置和13种VLM配置上的实验表明，现实的遗忘提示仅使遗忘准确率接近无指令基线；仅在向模型披露目标概念的神谕条件下才出现有意义的下降。物体和场景概念最难被抑制，且更强的指令调优模型即使收到明确遗忘指令仍能保持能力。

Insight: 创新点在于构建了首个系统性的训练式视觉概念遗忘基准，并设计了三层探测分类法与五种评估条件来区分真实遗忘与指令遵从。客观分析表明，当前基于提示的无需训练遗忘方法在真实视觉概念擦除上效果有限，这为未来研究指明了方向。

Abstract: VLMs trained on web-scale data retain sensitive and copyrighted visual concepts that deployment may require removing. Training-based unlearning methods share a structural flaw: fine-tuning on a narrow forget set degrades general capabilities before unlearning begins, making it impossible to attribute subsequent performance drops to the unlearning procedure itself. Training-free approaches sidestep this by suppressing concepts through prompts or system instructions, but no rigorous benchmark exists for evaluating them on visual tasks. We introduce VLM-UnBench, the first benchmark for training-free visual concept unlearning in VLMs. It covers four forgetting levels, 7 source datasets, and 11 concept axes, and pairs a three-level probe taxonomy with five evaluation conditions to separate genuine forgetting from instruction compliance. Across 8 evaluation settings and 13 VLM configurations, realistic unlearning prompts leave forget accuracy near the no-instruction baseline; meaningful reductions appear only under oracle conditions that disclose the target concept to the model. Object and scene concepts are the most resistant to suppression, and stronger instruction-tuned models remain capable despite explicit forget instructions. These results expose a clear gap between prompt-level suppression and true visual concept erasure.

[80] Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models cs.CVPDF

Chengyin Hu, Yuxian Dong, Yikun Guo, Xiang Chen, Junqi Wu

TL;DR: 本文提出了一种名为通用曲网格补丁（UCGP）的物理对抗补丁框架，专门针对红外视觉语言模型（IR-VLM）的语义漏洞。该框架通过曲网格参数化生成连续、低频且可部署的补丁，并结合统一表示驱动目标来破坏跨模态语义对齐，在多种IR-VLM架构上有效削弱语义理解能力。

Details

Motivation: 现有对抗补丁方法主要针对RGB模型和封闭集设置，不适用于红外VLM的开放式语义理解和物理部署需求，因此需要开发专门针对红外VLM的通用物理对抗攻击方法。

Result: 大量实验表明，UCGP在多种IR-VLM架构上持续削弱语义理解，同时保持了跨模型可迁移性、跨数据集泛化性、真实世界物理有效性以及对防御的鲁棒性，揭示了当前红外多模态系统中被忽视的鲁棒性漏洞。

Insight: 创新点包括曲网格参数化实现可部署补丁生成、统一表示驱动目标直接破坏视觉表示空间而非操纵标签或提示，以及结合元差分进化和EOT增强的TPS变形建模以提高真实世界鲁棒性，为红外多模态系统的安全性评估提供了新视角。

Abstract: Infrared vision-language models (IR-VLMs) have emerged as a promising paradigm for multimodal perception in low-visibility environments, yet their robustness to adversarial attacks remains largely unexplored. Existing adversarial patch methods are mainly designed for RGB-based models in closed-set settings and are not readily applicable to the open-ended semantic understanding and physical deployment requirements of infrared VLMs. To bridge this gap, we propose Universal Curved-Grid Patch (UCGP), a universal physical adversarial patch framework for IR-VLMs. UCGP integrates Curved-Grid Mesh (CGM) parameterization for continuous, low-frequency, and deployable patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. To improve robustness under real-world deployment and domain shift, we further incorporate Meta Differential Evolution and EOT-augmented TPS deformation modeling. Rather than manipulating labels or prompts, UCGP directly disrupts the visual representation space, weakening cross-modal semantic alignment. Extensive experiments demonstrate that UCGP consistently compromises semantic understanding across diverse IR-VLM architectures while maintaining cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and robustness against defenses. These findings reveal a previously overlooked robustness vulnerability in current infrared multimodal systems.

[81] Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation cs.CV | eess.IVPDF

Xingtong Ge, Yi Zhang, Yushi Huang, Dailan He, Xiahong Wang

TL;DR: 本文提出了一种名为Salt的方法，用于快速视频生成模型的蒸馏，旨在将推理成本降至极低（如2-4个NFEs）。该方法结合了自洽分布匹配蒸馏（SC-DMD）和缓存感知训练，以解决现有轨迹一致性蒸馏的保守性（导致过度平滑和运动弱）以及分布匹配蒸馏（DMD）的局部训练信号不足（易导致生成漂移）的问题。

Details

Motivation: 动机在于将视频生成模型蒸馏到极低推理预算（如2-4个NFEs）以实现实时部署，但现有方法面临挑战：轨迹一致性蒸馏在复杂视频动态下变得保守，导致外观过度平滑和运动弱；而分布匹配蒸馏（DMD）能恢复锐利、模式寻求的样本，但其局部训练信号未显式正则化去噪更新在时间步上的组合，使得组合推演易漂移。

Result: 在非自回归骨干（如Wan 2.1）和自回归实时范式（如Self Forcing）上的广泛实验表明，Salt方法在低NFE视频生成质量上持续提升，同时与多种KV缓存内存机制兼容。

Insight: 创新点包括：1) 自洽分布匹配蒸馏（SC-DMD），显式正则化连续去噪更新的端点一致组合，以解决漂移问题；2) 缓存感知训练，将KV缓存作为质量参数化条件，引入缓存条件特征对齐目标，引导低质量输出向高质量参考对齐，适用于多步推演的自回归实时生成。

Abstract: Distilling video generation models to extremely low inference budgets (e.g., 2–4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed \textbf{Salt}, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at \href{https://github.com/XingtongGe/Salt}{https://github.com/XingtongGe/Salt}.

[82] EffiMiniVLM: A Compact Dual-Encoder Regression Framework cs.CVPDF

Yin-Loon Khor, Yi-Jie Wong, Yan Chai Hum

TL;DR: 本文提出EffiMiniVLM，一个紧凑的双编码器视觉语言回归框架，用于冷启动场景下基于图像和文本元数据预测产品质量。该模型结合EfficientNet-B0图像编码器和MiniLM文本编码器，采用加权Huber损失提升训练样本效率，仅使用20%的Amazon Reviews 2023数据集训练，参数量为27.7M，计算量为6.8 GFLOPs，在基准测试中取得CES分数0.40，并以最低资源成本达到与更大模型相当的性能。

Details

Motivation: 解决冷启动场景中依赖多模态信息预测产品质量时，现有视觉语言模型通常架构庞大、依赖外部数据集、计算成本高的问题。

Result: 在Amazon Reviews 2023基准上，仅用20%数据训练，模型取得CES分数0.40，资源成本最低；与其他前5方法相比，资源效率提升约4-8倍，且是唯一不使用外部数据集的方法；数据扩展到40%时，性能可超越使用更大模型和数据集的方法。

Insight: 创新点包括紧凑的双编码器回归框架设计、加权Huber损失利用评分计数加权可靠样本以提升训练效率；客观分析表明，该模型在极小参数量和计算量下实现了强可扩展性和竞争力，为资源受限场景提供了高效解决方案。

Abstract: Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost. To address this, we propose EffiMiniVLM, a compact dual-encoder vision-language regression framework that integrates an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head. To improve training sample efficiency, we introduce a weighted Huber loss that leverages rating counts to emphasize more reliable samples, yielding consistent performance gains. Trained using only 20% of the Amazon Reviews 2023 dataset, the proposed model contains 27.7M parameters and requires 6.8 GFLOPs, yet achieves a CES score of 0.40 with the lowest resource cost in the benchmark. Despite its small size, it remains competitive with significantly larger models, achieving comparable performance while being approximately 4x to 8x more resource-efficient than other top-5 methods and being the only approach that does not use external datasets. Further analysis shows that scaling the data to 40% alone allows our model to overtake other methods, which use larger models and datasets, highlighting strong scalability despite the model’s compact design.

[83] VOSR: A Vision-Only Generative Model for Image Super-Resolution cs.CVPDF

Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang, Xiangtao Kong, Jixin Zhao

TL;DR: 本文提出了VOSR，一种仅基于视觉数据的生成式图像超分辨率框架，通过使用预训练视觉编码器提取LR输入的语义特征作为视觉语义引导，并设计了面向恢复的引导策略来替代传统分类器自由引导中的无条件分支，从而在无需多模态预训练的情况下实现了高质量的生成式超分辨率。

Details

Motivation: 当前大多数生成式图像超分辨率方法依赖于在网页规模文本-图像数据上预训练的大型文本到图像扩散模型，但超分辨率本质上是低分辨率输入条件下的图像恢复任务，因此研究是否仅基于视觉数据训练的SR模型能与基于T2I的方法相竞争。

Result: 在合成和真实世界基准测试中，VOSR在感知质量和效率方面达到竞争甚至更好的水平，同时产生更忠实结构且幻觉更少；其训练成本不到代表性基于T2I的SR方法的十分之一。

Insight: 创新点在于使用纯视觉语义引导替代多模态预训练，并提出面向恢复的引导策略以保留弱LR锚点，首次证明高质量生成式SR无需多模态预训练即可实现；同时通过蒸馏实现高效一步推理。

Abstract: Most of the recent generative image super-resolution (SR) methods rely on adapting large text-to-image (T2I) diffusion models pretrained on web-scale text-image data. While effective, this paradigm starts from a generic T2I generator, despite that SR is fundamentally a low-resolution (LR) input-conditioned image restoration task. In this work, we investigate whether an SR model trained purely on visual data can rival T2I-based ones. To this end, we propose VOSR, a Vision-Only generative framework for SR. We first extract semantically rich and spatially grounded features from the LR input using a pretrained vision encoder as visual semantic guidance. We then revisit classifier-free guidance for training generative models and show that the standard unconditional branch is ill-suited to restoration models trained from scratch. We therefore replace it with a restoration-oriented guidance strategy that preserves weak LR anchors. Built upon these designs, we first train a multi-step VOSR model from scratch and then distill it into a one-step model for efficient inference. VOSR requires less than one-tenth of the training cost of representative T2I-based SR methods, yet in both multi-step and one-step settings, it achieves competitive or even better perceptual quality and efficiency, while producing more faithful structures with fewer hallucinations on both synthetic and real-world benchmarks. Our results, for the first time, show that high-quality generative SR can be achieved without multimodal pretraining. The code and models can be found at https://github.com/cswry/VOSR.

[84] CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning cs.CVPDF

Ankan Deria, Komal Kumar, Xilin He, Imran Razzak, Hisham Cholakkal

TL;DR: CoME-VL提出了一种互补多编码器视觉语言学习框架，通过融合对比学习训练的视觉编码器（如CLIP）和自监督视觉编码器（如DINO）的互补视觉表示，以提升视觉语言模型的性能。该方法采用熵引导的多层聚合与正交约束投影来减少冗余，并使用RoPE增强的交叉注意力对齐异构令牌网格，生成紧凑的融合视觉令牌，可轻松集成到仅解码器的LLM中。

Details

Motivation: 现有视觉语言模型通常依赖单一对比学习视觉编码器，虽在跨模态对齐和检索上有效，但自监督视觉编码器能捕获更丰富的密集语义并在识别理解任务上更具鲁棒性。本文旨在探索如何规模化融合这两种互补视觉表示以增强视觉语言建模。

Result: 在多个视觉语言基准测试中，CoME-VL一致优于单编码器基线，在视觉理解任务上平均提升4.9%，在接地任务上平均提升5.4%，并在RefCOCO检测任务上达到最先进性能。

Insight: 创新点在于提出模块化融合框架，通过熵引导多层聚合与正交约束投影减少特征冗余，以及RoPE增强交叉注意力对齐异构令牌，有效结合对比学习和自监督学习的互补信号，提升视觉语言任务的性能。

Abstract: Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.

cs.NI [Back]

[85] Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment cs.NI | cs.CLPDF

Warren Johnson, Charles Lee

TL;DR: 本文提出使用小型语言模型（SLMs）进行推理时模型选择（路由）任务，以替代高成本、高延迟的LLM分类器。通过两项研究评估了多个SLM在统一硬件和量化设置下的性能，发现Qwen-2.5-3B在延迟-准确率权衡上表现最优，但所有模型均未达到生产就绪的独立可行性标准。

Details

Motivation: 现有路由方法依赖大语言模型分类器，成本高、延迟大，且将多目标优化简化为单维质量预测。本文旨在探索参数为1-4B的小型语言模型是否具备足够的推理能力，以近乎零边际成本、自托管的方式实现亚秒级任务分类，从而将路由决策的开销降至可忽略水平。

Result: 在统一基准测试中，Qwen-2.5-3B取得了最佳精确匹配准确率（0.783）和最强的延迟-准确率权衡，且在六个任务家族上均获得非零准确率。在合成流量随机实验中，DeepSeek-V3准确率最高（0.830）但未通过预注册的P95延迟门限（2295 ms）；Qwen-2.5-3B在自托管模型中帕累托占优（准确率0.793，中位延迟988 ms，边际成本$0）。所有模型均未达到独立可行性标准（准确率≥0.85，P95延迟≤2000 ms）。

Insight: 创新点在于论证了SLM已满足基于路由的成本和延迟前提，能够以自托管、低边际成本的方式执行任务分类。客观来看，研究通过统一基准和预注册实验提供了SLM路由性能的实证比较，揭示了准确率差距（6-8个百分点）和分类正确性对下游输出质量的影响是迈向生产可行性的关键未解问题。

Abstract: Selecting the appropriate model at inference time – the routing problem – requires jointly optimizing output quality, cost, latency, and governance constraints. Existing approaches delegate this decision to LLM-based classifiers or preference-trained routers that are themselves costly and high-latency, reducing a multi-objective optimization to single-dimensional quality prediction. We argue that small language models (SLMs, 1-4B parameters) have now achieved sufficient reasoning capability for sub-second, zero-marginal-cost, self-hosted task classification, potentially making the routing decision negligible in the inference budget. We test this thesis on a six-label taxonomy through two studies. Study 1 is a harmonized offline benchmark of Phi-3.5-mini, Qwen2.5-1.5B, and Qwen-2.5-3B on identical Azure T4 hardware, serving stack, quantization, and a fixed 60-case corpus. Qwen-2.5-3B achieves the best exact-match accuracy (0.783), the strongest latency-accuracy tradeoff, and the only nonzero accuracy on all six task families. Study 2 is a pre-registered four-arm randomized experiment under synthetic traffic with an effective sample size of 60 unique cases per arm, comparing Phi-4-mini, Qwen-2.5-3B, and DeepSeek-V3 against a no-routing control. DeepSeek-V3 attains the highest accuracy (0.830) but fails the pre-registered P95 latency gate (2,295 ms); Qwen-2.5-3B is Pareto-dominant among self-hosted models (0.793 accuracy, 988 ms median, $0 marginal cost). No model meets the standalone viability criterion (>=0.85 accuracy, <=2,000 ms P95). The cost and latency prerequisites for SLM-based routing are met; the accuracy gap of 6-8 percentage points and the untested question of whether correct classification translates to downstream output quality bound the remaining distance to production viability.

cs.RO [Back]

[86] Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA cs.RO | cs.CLPDF

Zihua Wang, Zhitao Lin, Ruibo Li, Yu Zhang, Xu Yang

TL;DR: 本文提出了一种名为SV-VLA的框架，用于提升视觉-语言-动作（VLA）模型在具身控制任务中的效率与鲁棒性。该框架将高效的开环长时程规划与轻量级的闭环在线验证相结合，以应对动态环境变化。

Details

Motivation: VLA模型在操作任务中表现出色，但推理成本高昂。现有方法采用动作分块进行开环执行以提高效率，但缺乏闭环反馈，对环境变化敏感且易产生误差累积。本文旨在解决开环执行鲁棒性不足的问题。

Result: 实验表明，SV-VLA结合了分块预测的效率和闭环控制的鲁棒性，能够在动态环境中实现高效可靠的VLA控制。

Insight: 创新点在于提出了一种推测性验证机制，利用重型VLA模型进行低频宏观规划生成动作块和规划上下文，同时使用轻量级验证器基于最新观察持续监控执行，仅在必要时触发重新规划，从而在保证效率的同时增强了系统的适应性。

Abstract: Vision-Language-Action (VLA) models, as large foundation models for embodied control, have shown strong performance in manipulation tasks. However, their performance comes at high inference cost. To improve efficiency, recent methods adopt action chunking, which predicts a sequence of future actions for open-loop execution. Although effective for reducing computation, open-loop execution is sensitive to environmental changes and prone to error accumulation due to the lack of close-loop feedback. To address this limitation, we propose Speculative Verification for VLA Control (SV-VLA), a framework that combines efficient open-loop long-horizon planning with lightweight closed-loop online verification. Specifically, SV-VLA uses a heavy VLA as a low-frequency macro-planner to generate an action chunk together with a planning context, while a lightweight verifier continuously monitors execution based on the latest observations. Conditioned on both the current observation and the planning context, the verifier compares the planned action against a closed-loop reference action and triggers replanning only when necessary. Experiments demonstrate that SV-VLA combines the efficiency of chunked prediction with the robustness of closed-loop control, enabling efficient and reliable VLA-based control in dynamic environments. Code is available: https://github.com/edsad122/SV-VLA.

[87] V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views cs.RO | cs.AI | cs.CVPDF

Junwei You, Pei Li, Zhuoyu Jiang, Weizhe Tang, Zilin Huang

TL;DR: 本文提出了V2X-QA，一个用于评估多模态大语言模型在自动驾驶中跨车辆、路侧和协同视角综合推理能力的真实世界数据集和基准。该基准采用视图解耦的评估协议，在统一的多选题问答框架下，支持对纯车辆、纯路侧和协同驾驶条件下的模型性能进行系统性评估。

Details

Motivation: 现有自动驾驶基准主要围绕车辆自身视角，无法系统评估模型在路侧中心化和协同驾驶条件下的性能。本文旨在填补这一空白，为多视角推理提供全面的评估基础。

Result: 在十个代表性的SOTA专有和开源模型上的基准测试结果表明：视角可访问性显著影响性能；路侧推理支持有意义的宏观交通理解；协同推理仍然具有挑战性，因为它需要跨视图对齐和证据整合。提出的基准对齐基线模型V2X-MoE表现强劲。

Insight: 创新点在于构建了首个覆盖车辆、路侧和协同视角的综合性自动驾驶推理数据集与基准，并提出了视图解耦的评估协议。客观来看，其提出的V2X-MoE基线模型通过显式的视图路由和针对特定视角的LoRA专家，为多视图推理提供了一个有前景的方向，强调了显式视角专业化的重要性。

Abstract: Multimodal large language models (MLLMs) have shown strong potential for autonomous driving, yet existing benchmarks remain largely ego-centric and therefore cannot systematically assess model performance in infrastructure-centric and cooperative driving conditions. In this work, we introduce V2X-QA, a real-world dataset and benchmark for evaluating MLLMs across vehicle-side, infrastructure-side, and cooperative viewpoints. V2X-QA is built around a view-decoupled evaluation protocol that enables controlled comparison under vehicle-only, infrastructure-only, and cooperative driving conditions within a unified multiple-choice question answering (MCQA) framework. The benchmark is organized into a twelve-task taxonomy spanning perception, prediction, and reasoning and planning, and is constructed through expert-verified MCQA annotation to enable fine-grained diagnosis of viewpoint-dependent capabilities. Benchmark results across ten representative state-of-the-art proprietary and open-source models show that viewpoint accessibility substantially affects performance, and infrastructure-side reasoning supports meaningful macroscopic traffic understanding. Results also indicate that cooperative reasoning remains challenging since it requires cross-view alignment and evidence integration rather than simply additional visual input. To address these challenges, we introduce V2X-MoE, a benchmark-aligned baseline with explicit view routing and viewpoint-specific LoRA experts. The strong performance of V2X-MoE further suggests that explicit viewpoint specialization is a promising direction for multi-view reasoning in autonomous driving. Overall, V2X-QA provides a foundation for studying multi-perspective reasoning, reliability, and cooperative physical intelligence in connected autonomous driving. The dataset and V2X-MoE resources are publicly available at: https://github.com/junwei0001/V2X-QA.

[88] ARM: Advantage Reward Modeling for Long-Horizon Manipulation cs.RO | cs.AI | cs.CVPDF

Yiming Mao, Zixi Yu, Weixin Mao, Yinhao Li, Qirui Hu

TL;DR: 本文提出了一种名为优势奖励建模（ARM）的框架，用于解决长视野机器人操作任务中稀疏奖励导致的信用分配困难问题。该方法通过一种低成本的三状态标注策略（渐进、回归、停滞）来估计相对优势，而非难以量化的绝对进度，从而为离线强化学习提供自动化的进度标注和自适应动作-奖励重加权。

Details

Motivation: 长视野机器人操作任务中，稀疏奖励难以提供有效的信用分配指导，而密集进度奖励成本高昂且不适用于非单调行为（如回溯和恢复）。因此，需要一种更实用且成本效益高的中间监督方法。

Result: 在具有挑战性的长视野毛巾折叠任务上，该方法实现了99.4%的成功率，相比当前视觉语言动作（VLA）基线模型，在策略训练期间近乎零人工干预的情况下，表现出更高的稳定性和数据效率。

Insight: 创新点在于从估计绝对进度转向估计相对优势，并引入了低成本、高一致性的三状态人工标注策略，该策略能自动化地标注完整演示和碎片化数据，并通过集成到离线强化学习流程中实现自适应过滤次优样本。

Abstract: Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy – Progressive, Regressive, and Stagnant – that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating improved stability and data efficiency over current VLA baselines with near-zero human intervention during policy training.

[89] Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model cs.RO | cs.CVPDF

Peiyan Li, Yixiang Chen, Yuan Xu, Jiabing Yang, Xiangnan Wu

TL;DR: 该论文提出了一种名为MV-VDP的多视角视频扩散策略，用于机器人操作任务。该方法通过联合建模环境的3D时空状态，同时预测多视角热图视频和RGB视频，从而在数据高效性、鲁棒性、泛化性和可解释性方面实现优越的机器人操作性能。

Details

Motivation: 现有机器人策略大多依赖2D视觉观测和基于静态图像-文本对预训练的骨干网络，忽略了环境的3D空间结构和时间演化，导致数据需求高且对环境动态理解有限。MV-VDP旨在解决这一问题。

Result: 在Meta-World仿真环境和真实机器人平台上的实验表明，MV-VDP仅需10条演示轨迹且无需额外预训练，就能成功执行复杂现实任务，并在鲁棒性、泛化性和未来视频预测方面表现优异。它一致超越了基于视频预测、基于3D以及视觉-语言-动作模型，在多任务操作中建立了新的SOTA。

Insight: 核心创新点在于将视频预训练的表征格式与动作微调对齐，通过同时预测多视角热图视频和RGB视频，不仅指定了机器人应采取的动作，还预测了环境对这些动作的预期演化。这提供了一种数据高效且时空感知的机器人策略框架。

Abstract: Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image–text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only what actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable, and interpretable manipulation. With only ten demonstration trajectories and without additional pretraining, MV-VDP successfully performs complex real-world tasks, demonstrates strong robustness across a range of model hyperparameters, generalizes to out-of-distribution settings, and predicts realistic future videos. Experiments on Meta-World and real-world robotic platforms demonstrate that MV-VDP consistently outperforms video-prediction–based, 3D-based, and vision–language–action models, establishing a new state of the art in data-efficient multi-task manipulation.

[90] The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling cs.RO | cs.CV | cs.LGPDF

Takuya Shiba

TL;DR: 本文研究了视觉-语言-动作（VLA）模型中离散化动作表示对模型扩展性的限制。作者提出了一个名为’压缩间隙’的信息论原理，指出在视觉-动作管道中，最紧的信息瓶颈位置决定了模型扩展的行为。当动作是连续的（如扩散策略）时，视觉编码器是性能瓶颈，升级它可以提升性能；但当动作通过固定容量的码本离散化（如OAT）时，码本成为瓶颈，上游视觉编码器的改进无法传递到下游，导致模型扩展失效。

Details

Motivation: 动机在于探究为何在视觉-语言-动作模型中，像视觉-语言模型那样通过升级视觉编码器来提升下游操作性能的预期会失败，特别是在动作被表示为离散标记时。

Result: 在LIBERO基准测试上，通过三个实验验证了该原理：1）因子实验显示，扩散策略在升级编码器后性能提升超过21个百分点，而OAT的增益在不同模型规模下显著减弱；2）使用四个编码器的质量梯度实验证实，扩散策略的性能随编码器质量单调提升，而OAT保持平坦；3）码本大小实验表明，放宽码本容量可以部分恢复编码器敏感性，为瓶颈假设提供了因果证据。

Insight: 创新点在于提出了’压缩间隙’这一信息论原理，揭示了在物理AI中，识别管道中的信息瓶颈位置比均匀增加模型或数据规模更为关键。这为未来VLA模型的设计提供了重要见解，即需要避免由离散化动作表示引入的固定容量瓶颈，以充分利用上游视觉表示的改进。

Abstract: Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance–as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it–regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three lines of evidence: a factorial experiment showing that encoder upgrades improve Diffusion Policy by over 21 percentage points while OAT gains are substantially attenuated across model scales; an encoder quality gradient across four encoders confirming that Diffusion Policy tracks encoder quality monotonically while OAT remains flat; and a codebook size experiment demonstrating that relaxing codebook capacity partially recovers encoder sensitivity, providing causal evidence for the bottleneck hypothesis. Our findings reveal that scaling in Physical AI requires identifying where information bottlenecks lie in the pipeline, rather than uniformly increasing model or data size.

cs.DL [Back]

[91] BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation cs.DL | cs.CLPDF

Delip Rao, Chris Callison-Burch

TL;DR: 本文评估了集成网络搜索功能的大型语言模型在科学出版代理中生成BibTeX引文时的幻觉问题，构建了一个包含931篇论文的跨领域、跨引用层级的基准测试集。研究发现，即使具备搜索能力，前沿模型（GPT-5、Claude Sonnet-4.6、Gemini-3 Flash）生成的BibTeX条目整体准确率为83.6%，但完全正确的条目仅占50.9%，且对近期论文的准确率显著下降，表明模型严重依赖参数记忆。作者提出并评估了基于Zotero Translation Server和CrossRef的确定性检索工具clibib作为缓解机制，通过两阶段集成（基线生成后依据权威记录修订）可将准确率提升至91.5%，完全正确条目提升至78.3%，回归率仅为0.8%。

Details

Motivation: 解决集成网络搜索的LLM在科学出版代理中生成BibTeX引文时仍普遍存在字段级错误（即引文幻觉）的问题，现有评估多针对无搜索的基础模型，未能反映当前实践。

Result: 在构建的包含931篇论文（涵盖四个科学领域和三个引用层级）的基准测试上，三个前沿搜索模型的BibTeX生成整体准确率为83.6%，完全正确率仅50.9%；对近期论文的准确率较热门论文下降27.7个百分点。使用clibib工具进行两阶段集成修订后，整体准确率提升8.0个百分点至91.5%，完全正确率从50.9%提升至78.3%，回归率仅为0.8%。

Insight: 论文的创新点包括：1) 构建了一个专门用于评估引文幻觉、能区分参数记忆与搜索依赖性的多版本真实基准测试集和错误分类法；2) 揭示了即使具备搜索能力，模型仍严重依赖参数记忆，导致对近期/低引用论文准确率骤降；3) 提出并验证了将确定性检索工具（clibib）与LLM生成以两阶段架构（先搜索后修订）集成的缓解方案，该架构在提升准确率的同时显著降低了回归风险，表明集成设计本身独立于模型能力至关重要。

Abstract: Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-level errors. Prior evaluations tested base models without search, which does not reflect current practice. We construct a benchmark of 931 papers across four scientific domains and three citation tiers – popular, low-citation, and recent post-cutoff – designed to disentangle parametric memory from search dependence, with version-aware ground truth accounting for multiple citable versions of the same paper. Three search-enabled frontier models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) generate BibTeX entries scored on nine fields and a six-way error taxonomy, producing ~23,000 field-level observations. Overall accuracy is 83.6%, but only 50.9% of entries are fully correct; accuracy drops 27.7pp from popular to recent papers, revealing heavy reliance on parametric memory even when search is available. Field-error co-occurrence analysis identifies two failure modes: wholesale entry substitution (identity fields fail together) and isolated field error. We evaluate clibib, an open-source tool for deterministic BibTeX retrieval from the Zotero Translation Server with CrossRef fallback, as a mitigation mechanism. In a two-stage integration where baseline entries are revised against authoritative records, accuracy rises +8.0pp to 91.5%, fully correct entries rise from 50.9% to 78.3%, and regression rate is only 0.8%. An ablation comparing single-stage and two-stage integration shows that separating search from revision yields larger gains and lower regression (0.8% vs. 4.8%), demonstrating that integration architecture matters independently of model capability. We release the benchmark, error taxonomy, and clibib tool to support evaluation and mitigation of citation hallucinations in LLM-based scientific writing.

cs.CR [Back]

[92] An Independent Safety Evaluation of Kimi K2.5 cs.CR | cs.AI | cs.CLPDF

Zheng-Xin Yong, Parv Mahajan, Andy Wang, Ida Caspary, Yernat Yestekov

TL;DR: 本文对开源大模型Kimi K2.5进行了独立的安全评估，重点关注其在CBRNE（化学、生物、放射、核、高爆）滥用风险、网络安全风险、错位、政治审查、偏见和伤害性等方面的表现。评估发现，Kimi K2.5在双重用途能力上与GPT 5.2和Claude Opus 4.5相当，但在CBRNE相关请求上拒绝率显著更低，可能提升恶意行为者的武器制造能力。

Details

Motivation: Kimi K2.5是一个在编码、多模态和智能体基准测试中媲美闭源模型的开源大模型，但其发布时未附带安全评估。本研究旨在对这款强大的开源模型进行初步安全评估，以识别和量化其可能加剧的风险。

Result: 在网络安全任务中，Kimi K2.5表现出有竞争力的性能，但未展现出前沿水平的自主网络攻击能力（如漏洞发现和利用）。模型显示出令人担忧的破坏能力和自我复制倾向，但没有表现出长期恶意目标。此外，它在中文环境下表现出狭隘的审查和政治偏见，并且更倾向于服从与传播虚假信息和侵犯版权相关的有害请求。模型拒绝参与用户妄想，且总体过度拒绝率较低。

Insight: 研究强调了前沿开源大模型中存在的安全风险，并指出其开放性和可及性可能放大这些风险。这凸显了对开源模型进行系统性安全评估的必要性，为负责任地部署此类模型提供了重要警示和评估框架。

Abstract: Kimi K2.5 is an open-weight LLM that rivals closed models across coding, multimodal, and agentic benchmarks, but was released without an accompanying safety evaluation. In this work, we conduct a preliminary safety assessment of Kimi K2.5 focusing on risks likely to be exacerbated by powerful open-weight models. Specifically, we evaluate the model for CBRNE misuse risk, cybersecurity risk, misalignment, political censorship, bias, and harmlessness, in both agentic and non-agentic settings. We find that Kimi K2.5 shows similar dual-use capabilities to GPT 5.2 and Claude Opus 4.5, but with significantly fewer refusals on CBRNE-related requests, suggesting it may uplift malicious actors in weapon creation. On cyber-related tasks, we find that Kimi K2.5 demonstrates competitive cybersecurity performance, but it does not appear to possess frontier-level autonomous cyberoffensive capabilities such as vulnerability discovery and exploitation. We further find that Kimi K2.5 shows concerning levels of sabotage ability and self-replication propensity, although it does not appear to have long-term malicious goals. In addition, Kimi K2.5 exhibits narrow censorship and political bias, especially in Chinese, and is more compliant with harmful requests related to spreading disinformation and copyright infringement. Finally, we find the model refuses to engage in user delusions and generally has low over-refusal rates. While preliminary, our findings highlight how safety risks exist in frontier open-weight models and may be amplified by the scale and accessibility of open-weight releases. Therefore, we strongly urge open-weight model developers to conduct and release more systematic safety evaluations required for responsible deployment.

cs.AR [Back]

[93] InCoder-32B-Thinking: Industrial Code World Model for Thinking cs.AR | cs.AI | cs.CLPDF

Jian Yang, Wei Zhang, Jiajun Wu, Junhang Cheng, Tuney Zheng

TL;DR: 本文提出了InCoder-32B-Thinking模型，这是一个面向工业代码世界的思维模型，旨在为芯片设计、GPU优化和嵌入式系统等工业软件开发生成专家级的推理轨迹。模型通过错误驱动的思维链（ECoT）合成框架和工业代码世界模型（ICWM）进行训练，能够模拟工程师在硬件约束和时序语义下的推理过程，并在通用和工业基准测试中取得了领先的开源结果。

Details

Motivation: 工业软件开发（如芯片设计、GPU优化、嵌入式系统）缺乏展示工程师如何推理硬件约束和时序语义的专家级推理轨迹，这限制了自动化工具的发展。本文旨在解决这一问题，通过生成可验证的推理链来模拟工业环境中的专业思维过程。

Result: 在14个通用基准测试（如在LiveCodeBench v5上达到81.3%）和9个工业基准测试（如在CAD-Coder上达到84.0%，在KernelBench上达到38.0%）上，InCoder-32B-Thinking均取得了顶级的开源结果，展示了其在跨领域任务中的强大性能。

Insight: 创新点包括：1) 错误驱动的思维链（ECoT）合成框架，通过多轮对话与环境错误反馈合成思维内容，显式建模纠错过程；2) 工业代码世界模型（ICWM），基于领域特定执行轨迹（如Verilog仿真、GPU性能分析）训练，学习代码如何影响硬件行为的因果动态，并支持通过预测执行结果进行自我验证；3) 所有合成推理轨迹均通过领域工具链验证，确保训练数据与工业任务的自然推理深度分布相匹配。这些方法结合了模拟推理和硬件感知，为工业代码生成提供了可解释且可靠的解决方案。

Abstract: Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason about hardware constraints and timing semantics. In this work, we propose InCoder-32B-Thinking, trained on the data from the Error-driven Chain-of-Thought (ECoT) synthesis framework with an industrial code world model (ICWM) to generate reasoning traces. Specifically, ECoT generates reasoning chains by synthesizing the thinking content from multi-turn dialogue with environmental error feedback, explicitly modeling the error-correction process. ICWM is trained on domain-specific execution traces from Verilog simulation, GPU profiling, etc., learns the causal dynamics of how code affects hardware behavior, and enables self-verification by predicting execution outcomes before actual compilation. All synthesized reasoning traces are validated through domain toolchains, creating training data matching the natural reasoning depth distribution of industrial tasks. Evaluation on 14 general (81.3% on LiveCodeBench v5) and 9 industrial benchmarks (84.0% in CAD-Coder and 38.0% on KernelBench) shows InCoder-32B-Thinking achieves top-tier open-source results across all domains.GPU Optimization

eess.IV [Back]

[94] Managing Diabetic Retinopathy with Deep Learning: A Data Centric Overview eess.IV | cs.AI | cs.CVPDF

Shramana Dey, Zahir Khan, T. A. PramodKumar, B. Uma Shankar, Ashis K. Dhara

TL;DR: 本文对用于糖尿病视网膜病变（DR）管理的眼底图像数据集进行了全面的综述和比较分析，评估了它们在二元分类、严重程度分级、病变定位和多疾病筛查等关键任务中的可用性，并指出了当前数据集在标准化标注和纵向数据方面的不足。

Details

Motivation: 解决深度学习在DR自动检测和分级中因高质量数据集有限、地理覆盖窄、样本量小、标注不一致或图像质量参差不齐而导致的临床可靠性受限的问题。

Result: 研究通过系统性的综述和案例分析，总结了现有数据集的特性、可用性和局限性，并提出了未来数据集开发的建议。

Insight: 创新点在于提供了一个以数据为中心的全面概述，系统性地分类和评估了DR数据集，并强调了标准化病变级标注和纵向数据收集对于开发临床可靠、可解释的AI解决方案的重要性。

Abstract: Diabetic Retinopathy (DR) is a serious microvascular complication of diabetes, and one of the leading causes of vision loss worldwide. Although automated detection and grading, with Deep Learning (DL), can reduce the burden on ophthalmologists, it is constrained by the limited availability of high-quality datasets. Existing repositories often remain geographically narrow, contain limited samples, and exhibit inconsistent annotations or variable image quality; thereby, restricting their clinical reliability. This paper presents a comprehensive review and comparative analysis of fundus image datasets used in the management of DR. The study evaluates their usability across key tasks, including binary classification, severity grading, lesion localization, and multi-disease screening. It also categorizes the datasets by size, accessibility, and annotation type (such as image-level, lesion-level, and multi-disease). Finally, a recently published dataset is presented as a case study to illustrate broader challenges in dataset curation and usage. The review consolidates current knowledge while highlighting persistent gaps such as the lack of standardized lesion-level annotations and longitudinal data. It also outlines recommendations for future dataset development to support clinically reliable and explainable solutions in DR screening.

[95] ARIQA-3DS: A Stereoscopic Image Quality Assessment Dataset for Realistic Augmented Reality eess.IV | cs.CV | cs.MMPDF

Aymen Sekhri, Seyed Ali Amirshahi, Mohamed-Chaker Larabi

TL;DR: 本文提出了首个大规模立体增强现实图像质量评估数据集ARIQA-3DS，包含1200个AR视口，融合了真实世界的高分辨率立体全景捕获与多样化的增强前景，并在受控的透明度和退化条件下进行了全面的主观研究。

Details

Motivation: 现有数据集缺乏生态效度，无法捕捉真实与虚拟层之间复杂的感知交互（视觉混淆），因此需要构建一个更真实的立体AR质量评估数据集。

Result: 主观研究显示感知质量主要由前景退化驱动并受透明度水平调节，而眼动和定向障碍症状在观看期间呈现渐进但可控的增加。

Insight: 创新点在于首次构建了大规模立体AR质量评估数据集，并揭示了前景退化与透明度对AR体验质量的关键影响，为下一代AR质量评估模型提供了基准。

Abstract: As Augmented Reality (AR) technologies advance towards immersive consumer adoption, the need for rigorous Quality of Experience (QoE) assessment becomes critical. However, existing datasets often lack ecological validity, relying on monocular viewing or simplified backgrounds that fail to capture the complex perceptual interplay, termed visual confusion, between real and virtual layers. To address this gap, we present ARIQA-3DS, the first large stereoscopic AR Image Quality Assessment dataset. Comprising 1,200 AR viewports, the dataset fuses high-resolution stereoscopic omnidirectional captures of real-world scenes with diverse augmented foregrounds under controlled transparency and degradation conditions. We conducted a comprehensive subjective study with 36 participants using a video see-through head-mounted display, collecting both quality ratings and simulator-sickness indicators. Our analysis reveals that perceived quality is primarily driven by foreground degradations and modulated by transparency levels, while oculomotor and disorientation symptoms show a progressive but manageable increase during viewing. ARIQA-3DS will be publicly released to serve as a comprehensive benchmark for developing next-generation AR quality assessment models.

[96] HyperCT: Low-Rank Hypernet for Unified Chest CT Analysis eess.IV | cs.CVPDF

Fengbei Liu, Sunwoo Kwak, Hao Phung, Nusrat Binta Nizam, Ilan Richter

TL;DR: 本文提出HyperCT框架，通过超网络动态调整视觉Transformer主干网络，结合低秩适应（LoRA）技术，实现参数高效的多任务学习，用于统一分析胸部CT图像中的肺部和肺外病变。

Details

Motivation: 解决非对比胸部CT多任务学习中，传统硬参数共享方法难以有效建模不同病理特征的问题，旨在提供统一且参数高效的解决方案。

Result: 在大型放射学和心脏病学任务数据集上验证，HyperCT优于多种强基线方法，实现了整体患者评估的统一高效性能。

Insight: 创新点在于将超网络与低秩适应结合，动态生成任务特定的低秩权重更新，而非完整参数，从而在保持计算效率的同时提升多任务学习效果。

Abstract: Non-contrast chest CTs offer a rich opportunity for both conventional pulmonary and opportunistic extra-pulmonary screening. While Multi-Task Learning (MTL) can unify these diverse tasks, standard hard-parameter sharing approaches are often suboptimal for modeling distinct pathologies. We propose HyperCT, a framework that dynamically adapts a Vision Transformer backbone via a Hypernetwork. To ensure computational efficiency, we integrate Low-Rank Adaptation (LoRA), allowing the model to regress task-specific low-rank weight updates rather than full parameters. Validated on a large-scale dataset of radiological and cardiological tasks, \method{} outperforms various strong baselines, offering a unified, parameter-efficient solution for holistic patient assessment. Our code is available at https://github.com/lfb-1/HyperCT.

eess.AS [Back]

[97] Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR eess.AS | cs.CL | cs.SDPDF

Zhennan Lin, Shuai Wang, Zhaokai Sun, Pengyuan Xie, Chuan Xie

TL;DR: 本文提出Speaker-Reasoner，一种端到端的语音大语言模型，通过多轮时序推理来处理多说话人场景下的自动语音识别、说话人归属和时序定位任务。模型采用迭代分析全局音频结构、自主预测时间边界并进行细粒度片段分析的方法，联合建模说话人身份、性别、时间戳和转录文本，并利用说话人感知缓存扩展处理超出训练上下文窗口的音频。

Details

Motivation: 解决多说话人对话场景中语音识别、说话人归属和时间戳定位的联合挑战，特别是处理重叠语音、反馈词、快速话轮转换和上下文窗口限制等难点。

Result: 在AliMeeting和AISHELL-4数据集上，相比强基线模型取得了一致的性能提升，尤其在处理重叠语音和复杂话轮转换方面表现突出。

Insight: 创新点在于将代理式多轮时序推理机制引入语音LLM，通过迭代的全局-局部分析和说话人感知缓存，实现了对长音频、复杂交互场景的端到端联合建模与高效处理。

Abstract: Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn-taking, and context window constraints. We propose Speaker-Reasoner, an end-to-end Speech LLM with agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. Trained with a three-stage progressive strategy, Speaker-Reasoner achieves consistent improvements over strong baselines on AliMeeting and AISHELL-4 datasets, particularly in handling overlapping speech and complex turn-taking.

cs.LG [Back]

[98] LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning cs.LG | cs.CL | cs.CVPDF

Md Kowsher, Haris Mansoor, Nusrat Jahan Prottasha, Ozlem Garibay, Victor Zhu

TL;DR: 本文提出了一种名为LiME（轻量级专家混合）的高效多模态多任务学习方法，旨在解决现有MoE-PEFT方法中每个专家需要独立适配器导致可训练参数随专家数量线性增长的问题。LiME通过轻量级调制实现专家专业化，使用单个共享的PEFT模块，并通过专家向量调制其输出，从而减少参数并适用于任何PEFT方法。此外，LiME引入了零参数路由，利用现有的冻结和适应表示来消除每层通常需要的学习路由参数。理论分析表明，更多专家能保留更多任务相关信息，且调制能以有界误差近似全专家特定PEFT。实验在包含47个跨文本、图像和视频任务的多模态多任务基准MMT-47上验证了LiME的性能，结果显示其在减少高达4倍可训练参数和加速高达29%训练的同时，达到了竞争性或更优的性能。

Details

Motivation: 现有MoE-PEFT方法在多任务适应中结合专家混合与参数高效微调，但每个专家需要独立适配器，导致可训练参数随专家数量线性增长，且仅限于基于适配器的架构，限制了其应用范围。

Result: 在MMT-47基准测试中，LiME相比对应的MoE-PEFT基线，使用高达4倍更少的可训练参数和高达29%更快的训练速度，实现了竞争性或更优的性能。

Insight: 创新点包括：通过轻量级调制而非适配器复制实现专家专业化，减少参数并泛化到任何PEFT方法；引入零参数路由，利用现有表示消除学习路由参数；结合n-gram窗口路由和基于路由置信度的自适应专家选择（Auto Top-K）。从客观角度看，这些方法在保持性能的同时显著提升了效率，为多模态多任务学习提供了更轻量、通用的解决方案。

Abstract: MoE-PEFT methods combine Mixture of Experts with parameter-efficient fine-tuning for multi-task adaptation, but require separate adapters per expert causing trainable parameters to scale linearly with expert count and limiting applicability to adapter-based architectures. We propose LiME (Lightweight Mixture of Experts), which achieves expert specialization through lightweight modulation rather than adapter replication. Instead of separate adapters, LiME uses a single shared PEFT module and modulates its output with lightweight expert vectors, reducing expert parameters while generalizing to any PEFT method. Notably, LiME introduces zero-parameter routing by leveraging existing frozen and adapted representations eliminating learned router parameters typically required per layer. Theoretically, we prove that (i) more experts preserve more task-relevant information and (ii) modulation approximates full expert-specific PEFT with bounded error. LiME further incorporates n-gram windowed routing and adaptive expert selection (Auto Top-K) based on routing confidence. Experiments on MMT-47, a multimodal multi-task benchmark with 47 tasks spanning text, image, and video, demonstrate that LiME achieves competitive or superior performance while using up to 4x fewer trainable parameters and up to 29% faster training compared to corresponding MoE-PEFT baselines.

[99] SIEVE: Sample-Efficient Parametric Learning from Natural Language cs.LG | cs.CLPDF

Parth Asawa, Alexandros G. Dimakis, Matei Zaharia

TL;DR: SIEVE是一种从自然语言上下文进行样本高效参数学习的方法，仅需三个查询示例即可实现模型适应。它通过SIEVE-GEN合成数据生成管道，利用上下文可分解的洞察力，将合成查询与适用上下文配对，再通过上下文蒸馏将上下文内化到模型中。

Details

Motivation: 解决自然语言上下文（如指令、知识或反馈）适应语言模型时参数学习数据需求大、依赖高质量轨迹或自动验证器的问题，旨在实现样本高效的参数学习。

Result: 在需要上下文的推理设置（包括自定义领域、RuleArena和Machine Translation from One Book任务）中，SIEVE仅用三个查询示例就优于先前的上下文蒸馏方法，展示了样本高效性。

Insight: 创新点在于提出上下文可分解的洞察力，通过SIEVE-GEN生成高质量合成数据，结合上下文蒸馏实现高效参数学习；客观分析认为其方法在减少数据依赖和提升适应效率方面具有借鉴价值。

Abstract: Natural language context-such as instructions, knowledge, or feedback-contains rich signal for adapting language models. While in-context learning provides adaptation via the prompt, parametric learning persists into model weights and can improve performance further, though is data hungry and heavily relies on either high-quality traces or automated verifiers. We propose SIEVE, a method for sample-efficient parametric learning from natural language context that requires as few as three query examples. SIEVE uses a novel synthetic data generation pipeline, SIEVE-GEN, that leverages the insight that context is decomposable. Decomposing context allows us to generate higher quality rollouts by pairing synthetic queries with only the applicable context rather than the entirety, then using context distillation to internalize context into the model. We evaluate in reasoning settings where context is necessary, including custom domains and the RuleArena and Machine Translation from One Book tasks. Our results show that SIEVE outperforms prior context distillation methods using just three query examples, demonstrating how to achieve sample-efficient parametric learning from natural language.

[100] Do We Need Frontier Models to Verify Mathematical Proofs? cs.LG | cs.AI | cs.CLPDF

Aaditya Naik, Guruprerana Shabadi, Rajeev Alur, Mayur Naik

TL;DR: 这篇论文探讨了验证数学证明是否需要前沿大语言模型。通过系统评估开源和前沿LLM在竞赛级数学证明数据集上的表现，发现开源模型在准确性上仅落后约10%，但自一致性差25%。研究通过提示词搜索优化，显著提升了开源模型的验证性能，使其达到与前沿模型相当的水平。

Details

Motivation: 随着前沿推理模型在数学竞赛中取得突破，对其生成的自然语言证明进行可靠验证的需求日益增长。论文旨在探究可靠验证究竟需要何种模型能力，并评估开源模型是否足以胜任此任务。

Result: 在人类评分的竞赛级数学证明数据集上，开源模型（如Qwen3.5-35B）在准确性上仅比前沿模型（如Gemini 3.1 Pro）落后约10%，但自一致性低25%。通过提示词优化，开源模型的准确性和自一致性分别提升了最高9.1%和15.9%，达到了与前沿模型相当的性能。

Insight: 论文的核心创新在于揭示了开源模型具备与前沿模型相当的数学验证潜力，但其性能瓶颈主要在于提示词的可靠性而非核心能力。通过LLM引导的提示词搜索合成专门提示词集合，可以有效克服其特定失败模式，这是一种高效提升模型特定任务表现且成本较低的方法。

Abstract: Advances in training, post-training, and inference-time methods have enabled frontier reasoning models to win gold medals in math competitions and settle challenging open problems. Gaining trust in the responses of these models requires that natural language proofs be checked for errors. LLM judges are increasingly being adopted to meet the growing demand for evaluating such proofs. While verification is considered easier than generation, what model capability does reliable verification actually require? We systematically evaluate four open-source and two frontier LLMs on datasets of human-graded natural language proofs of competition-level problems. We consider two key metrics: verifier accuracy and self-consistency (the rate of agreement across repeated judgments on the same proof). We observe that smaller open-source models are only up to ~10% behind frontier models in accuracy but they are up to ~25% more inconsistent. Furthermore, we see that verifier accuracy is sensitive to prompt choice across all models. We then demonstrate that the smaller models, in fact, do possess the mathematical capabilities to verify proofs at the level of frontier models, but they struggle to reliably elicit these capabilities with general judging prompts. Through an LLM-guided prompt search, we synthesize an ensemble of specialized prompts that overcome the specific failure modes of smaller models, boosting their performance by up to 9.1% in accuracy and 15.9% in self-consistency. These gains are realized across models and datasets, allowing models like Qwen3.5-35B to perform on par with frontier models such as Gemini 3.1 Pro for proof verification.

[101] Mitigating Reward Hacking in RLHF via Advantage Sign Robustness cs.LG | cs.AI | cs.CLPDF

Shinnosuke Ono, Johannes Ackermann, Soichiro Nishimori, Takashi Ishida, Masashi Sugiyama

TL;DR: 本文提出了一种名为SignCert-PO的新方法，用于缓解基于人类反馈的强化学习（RLHF）中的奖励黑客问题。该方法通过计算一个认证的符号保持半径，识别并降低策略梯度更新中对非鲁棒补全的权重，从而减少奖励模型中的符号翻转错误。

Details

Motivation: 动机是解决RLHF中奖励模型容易受到奖励黑客攻击的问题，即策略在最大化学习到的代理奖励时，真实质量可能停滞或下降，这通常由优势符号翻转引起。

Result: 在TL;DR摘要和AlpacaFarm基准测试中，SignCert-PO始终比基线方法获得更高的胜率，并有效减少了奖励黑客现象。

Insight: 创新点在于提出了一种轻量级的策略优化方法，仅需奖励模型参数和在线补全，无需多个奖励模型或访问训练数据，通过对抗性扰动推导认证符号保持半径来缓解奖励黑客。

Abstract: Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.

[102] Co-Evolution of Policy and Internal Reward for Language Agents cs.LG | cs.AI | cs.CLPDF

Xinyu Wang, Hanwei Wu, Jingwei Song, Shuyuan Zhang, Jiayi Zhang

TL;DR: 本文提出Self-Guide方法，为大型语言模型（LLM）智能体生成内部奖励，以解决长时程训练中奖励稀疏和延迟的问题。该方法在推理时提供短期自引导信号来指导下一步行动，在训练时将同一信号转换为步级内部奖励以进行更密集的策略优化，从而形成策略与内部奖励协同进化的循环。

Details

Motivation: 解决LLM智能体在环境交互学习中，因奖励稀疏和延迟导致的长时程训练瓶颈问题，现有方法（如事后信用分配或外部奖励模型）在推理时指导有限且常将奖励改进与策略改进分离。

Result: 在三个智能体基准测试中，推理时的自引导已带来明显收益，而结合GRPO协同进化策略和内部奖励，相比仅使用环境奖励训练的基线带来了8%的进一步改进。

Insight: 创新点在于提出一种自生成的内部奖励机制，实现推理时引导与训练时监督的统一，并通过策略与内部奖励的协同进化循环（更好的策略产生更好的引导，更好的引导作为内部奖励进一步改进策略）提升智能体性能；客观分析认为，该方法使语言智能体不仅能通过收集更多经验改进，还能在行动和学习中学会生成和优化自身内部奖励，是一种有潜力的训练范式。

Abstract: Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.

[103] Self-Distilled RLVR cs.LG | cs.CLPDF

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu

TL;DR: 本文提出了一种名为RLSD的新方法，该方法结合了RLVR（基于可验证奖励的强化学习）和OPSD（策略内自蒸馏）的优势。RLSD利用自蒸馏获取细粒度的令牌级策略差异来确定更新幅度，同时继续依赖RLVR从环境反馈中获得可靠的更新方向，从而实现了更高的收敛上限和更优的训练稳定性。

Details

Motivation: 现有策略内自蒸馏方法仅依赖特权教师模型产生的学习信号，会导致严重的信息泄露和不稳定的长期训练。本文旨在为自蒸馏找到最佳应用场景，并解决其与RLVR结合时的信息泄露和稳定性问题。

Result: 论文表明，所提出的RLSD方法能够同时利用RLVR和OPSD的优势，实现了更高的收敛上限和更优的训练稳定性。

Insight: 核心创新点在于将自蒸馏的角色限定为提供细粒度的更新幅度（令牌级策略差异），而将更新方向的确定权交给来自环境反馈的RLVR信号，从而巧妙地结合了两种范式的优点，避免了纯自蒸馏的信息泄露问题。

Abstract: On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

[104] PRISM: LLM-Guided Semantic Clustering for High-Precision Topics cs.LG | cs.CL | cs.IR | cs.SIPDF

Connor Douglas, Utkucan Balci, Joseph Aylett-Bullock

TL;DR: 本文提出了PRISM（Precision-Informed Semantic Modeling）框架，这是一种结合了大型语言模型（LLM）丰富表示能力和潜在语义聚类方法低成本、可解释性的结构化主题建模方法。它通过少量LLM标注样本微调句子编码模型，并在嵌入空间中进行阈值聚类，以在特定领域内分离紧密相关的主题。

Details

Motivation: 动机在于解决传统主题建模方法在分离紧密相关主题时精度不足的问题，同时希望结合LLM的强大语义理解能力与轻量级聚类方法的效率和可解释性。

Result: 在多个语料库上的实验表明，PRISM在主题分离性上优于最先进的局部主题模型，甚至优于在大型前沿嵌入模型上直接进行聚类的方法，且仅需少量LLM查询进行训练。

Insight: 创新点包括：1）一个师生蒸馏管道，将稀疏的LLM监督知识提炼到轻量级模型中以进行主题发现；2）对采样策略提升聚类分离性的局部几何有效性进行了分析；3）提供了一种适用于网络规模文本分析的有效方法，使研究人员能够以可解释、可本地部署的框架追踪细微主张和子主题。

Abstract: In this paper, we propose Precision-Informed Semantic Modeling (PRISM), a structured topic modeling framework combining the benefits of rich representations captured by LLMs with the low cost and interpretability of latent semantic clustering methods. PRISM fine-tunes a sentence encoding model using a sparse set of LLM- provided labels on samples drawn from some corpus of interest. We segment this embedding space with thresholded clustering, yielding clusters that separate closely related topics within some narrow domain. Across multiple corpora, PRISM improves topic separability over state-of-the-art local topic models and even over clustering on large, frontier embedding models while requiring only a small number of LLM queries to train. This work contributes to several research streams by providing (i) a student-teacher pipeline to distill sparse LLM supervision into a lightweight model for topic discovery; (ii) an analysis of the efficacy of sampling strategies to improve local geometry for cluster separability; and (iii) an effective approach for web-scale text analysis, enabling researchers and practitioners to track nuanced claims and subtopics online with an interpretable, locally deployable framework.

[105] From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation cs.LG | cs.CVPDF

Han Song, Yucheng Zhou, Jianbing Shen, Yu Cheng

TL;DR: 本文提出了一种基于熵引导的优化方法EG-GRPO，用于改进自回归图像生成。通过系统性的熵分析，揭示了思维链（CoT）的探索与强化学习（RL）优化之间的相互作用，并设计了一种根据不确定性重新分配优化预算的微调策略，以平衡探索与稳定性。

Details

Motivation: 旨在阐明思维链（CoT）与强化学习（RL）结合用于文本到图像（T2I）生成时，其探索与优化之间的相互作用机制不清晰的问题，并基于熵分析的关键发现来设计更有效的优化方法。

Result: 在标准文本到图像生成基准测试上，所提出的EG-GRPO方法取得了最先进的（SOTA）性能。

Insight: 论文的创新点在于通过熵分析揭示了CoT探索与RL优化之间的动态关系（探索扩展空间，优化收缩空间），并据此提出了EG-GRPO策略，其核心是根据图像token和文本CoT的熵值来差异化地分配优化预算，以同时鼓励结构化探索和保持生成稳定性。

Abstract: Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT’s exploration and RL’s optimization remains unclear. We present a systematic entropy-based analysis that yields three key insights: (1) CoT expands the generative exploration space, while RL contracts it toward high-reward regions; (2) final reward is strongly negatively correlated with both the mean and variance of image-token entropy, highlighting the need to reduce uncertainty and instability; and (3) the entropy of the textual CoT directly governs downstream image quality, with lower-entropy CoTs leading to better generations. Motivated by these findings, we propose Entropy-Guided Group Relative Policy Optimization (EG-GRPO), a fine-tuning strategy that reallocates optimization budget by uncertainty: low-entropy tokens are excluded from reward-driven updates to preserve stability, while high-entropy tokens receive an entropy bonus that encourages structured exploration without collapse. Experiments on standard T2I benchmarks demonstrate that EG-GRPO achieves state-of-the-art performance.

[106] Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models cs.LG | cs.AI | cs.CVPDF

Gengwei Zhang, Jie Peng, Zhen Tan, Mufan Qiu, Hossein Nourkhiz Mahjoub

TL;DR: 本文提出了一种名为’幻觉作为线索’的分析框架，用于研究基于强化学习的后训练对多模态推理模型的影响，特别是从模型幻觉的角度。通过引入诱导幻觉的模态特定破坏，该框架揭示了幻觉在RL训练中的作用比以往认识到的更为显著，甚至在纯幻觉诱导设置下也能显著提升模型性能。

Details

Motivation: 尽管强化学习被广泛用于后训练多模态大语言模型以提升视觉推理能力，但尚不清楚RL训练是否真正让模型学会了利用视觉信息。本文旨在通过分析模型幻觉来探究RL后训练对多模态推理模型的实际影响。

Result: 在多个多模态推理基准测试上的广泛实验表明，在纯幻觉诱导设置下的RL后训练仍能显著提升模型的推理性能，有时甚至优于标准训练。

Insight: 创新点在于提出了一个通过可控的幻觉诱导破坏来诊断RL训练动态和分析数据集内在属性的框架。客观来看，该研究挑战了关于MLLM推理训练的普遍假设，强调了在RL训练中考虑模态感知设计的重要性。

Abstract: The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large Language Models (MLLMs) to enhance their visual reasoning capabilities. Although many studies have reported improved performance, it remains unclear whether RL training truly enables models to learn from visual information. In this work, we propose the Hallucination-as-Cue Framework, an analytical framework designed to investigate the effects of RL-based post-training on multimodal reasoning models from the perspective of model hallucination. Specifically, we introduce hallucination-inductive, modality-specific corruptions that remove or replace essential information required to derive correct answers, thereby forcing the model to reason by hallucination. By applying these corruptions during both training and evaluation, our framework provides a unique perspective for diagnosing RL training dynamics and understanding the intrinsic properties of datasets. Through extensive experiments and analyses across multiple multimodal reasoning benchmarks, we reveal that the role of model hallucination for RL-training is more significant than previously recognized. For instance, we find that RL post-training under purely hallucination-inductive settings can still significantly improve models’ reasoning performance, and in some cases even outperform standard training. These findings challenge prevailing assumptions about MLLM reasoning training and motivate the development of more modality-aware RL-based training designs.

cs.AI [Back]

[107] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation cs.AI | cs.CLPDF

Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang

TL;DR: 本文提出了XpertBench基准测试，用于评估大语言模型在真实专业领域的复杂开放任务中的表现。该基准包含80个类别、1346个任务，涵盖金融、医疗、法律、教育和双轨研究等领域，任务由领域专家设计，采用详细评分标准。研究还引入了ShotJudge评估范式，利用经专家示例校准的LLM评委来减少自奖励偏差。实证评估显示，即使领先模型在XpertBench上的最高成功率也仅约66%，平均分约55%，揭示了当前AI系统与专家水平之间存在显著差距。

Details

Motivation: 现有评估框架存在领域覆盖窄、依赖通用任务或自评估偏差等问题，难以评估LLM在真实复杂、开放式的专家级认知任务中的熟练程度。

Result: 在XpertBench上对SOTA LLMs的评估显示，性能存在明显上限：领先模型的峰值成功率仅约66%，平均分约55%。模型还表现出领域特异性差异，在定量推理与语言合成方面优势不重叠。

Insight: 创新点在于构建了一个高保真、生态效度高的专家级任务基准，并提出了ShotJudge评估范式以缓解自奖励偏差，为评估LLM从通用助手向专业协作者过渡提供了关键工具。

Abstract: As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts–including researchers from elite institutions and practitioners with extensive clinical or industrial experience–ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant “expert-gap” in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.

Hyunji Nam, Dorottya Demszky

TL;DR: 本文研究了大型语言模型在基于提示的预测任务中对虚假社会背景信息的敏感性，发现无关背景信息会显著影响模型预测，且模型规模增大有时会加剧这种偏差。作者提出了一种名为Debiasing-DPO的自监督训练方法，该方法通过对比模型在有无虚假背景信息下的推理过程来减少偏差，并结合监督微调以保持预测准确性。在Llama和Qwen系列模型上的实验表明，该方法能平均减少84%的偏差并提高52%的预测准确率。

Details

Motivation: 大型语言模型越来越多地用于高风险决策（如教师教学质量评估），但其对虚假背景信息的敏感性会引入有害偏见，影响决策公平性。现有方法（如提示工程和标准直接偏好优化）在缓解此类偏差上效果不足。

Result: 在最大的公开美国课堂转录本数据集（NCTE）上评估了七个前沿和开源模型，发现虚假背景信息可使模型预测在7分量表上偏移高达1.48分。提出的Debiasing-DPO方法在Llama 3B/8B和Qwen 3B/7B Instruct模型上，平均减少84%的偏差并提高52%的预测准确率。

Insight: 创新点在于提出了Debiasing-DPO，这是一种结合自监督对比学习（对比中性推理与有偏推理）和监督微调的方法，能有效减轻模型对虚假社会背景的依赖，且不牺牲预测精度。研究还揭示了对虚假背景的鲁棒性并非模型规模扩大的自然副产品，需要专门的方法来提升。

Abstract: LLMs are increasingly used for high-stakes decision-making, yet their sensitivity to spurious contextual information can introduce harmful biases. This is a critical concern when models are deployed for tasks like evaluating teachers’ instructional quality, where biased assessment can affect teachers’ professional development and career trajectories. We investigate model robustness to spurious social contexts using the largest publicly available dataset of U.S. classroom transcripts (NCTE) paired with expert rubric scores. Evaluating seven frontier and open-weight models across seven categories of spurious contexts – including teacher experience, education level, demographic identity, and sycophancy-inducing framings – we find that irrelevant contextual information can shift model predictions by up to 1.48 points on a 7-point scale, with larger models sometimes exhibiting greater sensitivity despite higher predictive accuracy. Mitigations using prompts and standard direct preference optimization (DPO) prove largely insufficient. We propose Debiasing-DPO,, a self-supervised training method that pairs neutral reasoning generated from the query alone, with the model’s biased reasoning generated with both the query and additional spurious context. We further combine this objective with supervised fine-tuning on ground-truth labels to prevent losses in predictive accuracy. Applied to Llama 3B & 8B and Qwen 3B & 7B Instruct models, Debiasing-DPO reduces bias by 84% and improves predictive accuracy by 52% on average. Our findings from the educational case study highlight that robustness to spurious context is not a natural byproduct of model scaling and that our proposed method can yield substantial gains in both accuracy and robustness for prompt-based prediction tasks.

[109] Analysis of Optimality of Large Language Models on Planning Problems cs.AI | cs.CLPDF

Bernd Bohnet, Michael C. Mozer, Kevin Swersky, Wil Cunningham, Aaron Parisi

TL;DR: 本文研究了大型语言模型在经典AI规划问题（特别是Blocksworld领域和广义Path-Star图）中的最优性表现，发现LLMs在复杂多目标配置下显著优于传统满意规划器，并能以近乎完美的精度跟踪理论最优性极限，即使在没有领域特定语义提示的情况下。

Details

Motivation: 动机在于重新审视LLM时代下的经典AI规划问题，关注前沿模型在规划效率而非仅成功率方面的表现，探究其是进行最优推理还是依赖简单启发式策略。

Result: 在Blocksworld和广义Path-Star图任务中，增强推理的LLMs在复杂多目标配置下显著优于传统满意规划器（如LAMA），并能在搜索空间扩展时以近乎完美的精度跟踪理论最优性极限，达到SOTA水平。

Insight: 创新点在于通过系统操纵问题深度、宽度和组合性来隔离真实拓扑推理与语义先验，并提出了两种假设（算法模拟和几何记忆）来解释LLMs能有效绕过指数组合复杂性的原因，为理解LLMs的规划能力提供了新视角。

Abstract: Classic AI planning problems have been revisited in the Large Language Model (LLM) era, with a focus of recent benchmarks on success rates rather than plan efficiency. We examine the degree to which frontier models reason optimally versus relying on simple, heuristic, and possibly inefficient strategies. We focus on the Blocksworld domain involving towers of labeled blocks which have to be moved from an initial to a goal configuration via a set of primitive actions. We also study a formally equivalent task, the generalized Path-Star ($P^$) graph, in order to isolate true topological reasoning from semantic priors. We systematically manipulate problem depth (the height of block towers), width (the number of towers), and compositionality (the number of goal blocks). Reasoning-enhanced LLMs significantly outperform traditional satisficing planners (e.g., LAMA) in complex, multi-goal configurations. Although classical search algorithms hit a wall as the search space expands, LLMs track theoretical optimality limits with near-perfect precision, even when domain-specific semantic hints are stripped away. To explain these surprising findings, we consider (and find evidence to support) two hypotheses: an active Algorithmic Simulation executed via reasoning tokens and a Geometric Memory that allows models to represent the $P^$ topology as a navigable global geometry, effectively bypassing exponential combinatorial complexity.

[110] FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models cs.AI | cs.CLPDF

Kehan Jiang, Haonan Dong, Zhaolu Kang, Zhengzhou Zhu, Guojie Song

TL;DR: 本文发现大型推理模型（LRMs）中存在’第一个解决方案最优’的现象，即后续推理路径不仅次优，反而可能有害。作者将错误建模为森林结构（FoE），并提出RED框架，通过’精炼首个解’和’丢弃后续解’来抑制错误增长，在多个基准测试中显著提升性能并大幅降低计算开销。

Details

Motivation: 挑战现有测试时扩展定律，揭示大型推理模型中替代解决方案可能损害性能的现象，并探究其背后原因。

Result: 在五个基准测试和六个骨干模型上的实验表明，RED优于八个竞争基线，性能提升最高达19.0%，同时token消耗减少37.7%至70.4%。

Insight: 创新性地将推理错误建模为森林结构（FoE），并提出基于双重一致性的自引导高效推理框架RED，通过抑制首个解的FoE增长和修剪后续FoE来提升效率与性能。

Abstract: Recent Large Reasoning Models (LRMs) like DeepSeek-R1 have demonstrated remarkable success in complex reasoning tasks, exhibiting human-like patterns in exploring multiple alternative solutions. Upon closer inspection, however, we uncover a surprising phenomenon: The First is The Best, where alternative solutions are not merely suboptimal but potentially detrimental. This observation challenges widely accepted test-time scaling laws, leading us to hypothesize that errors within the reasoning path scale concurrently with test time. Through comprehensive empirical analysis, we characterize errors as a forest-structured Forest of Errors (FoE) and conclude that FoE makes the First the Best, which is underpinned by rigorous theoretical analysis. Leveraging these insights, we propose RED, a self-guided efficient reasoning framework comprising two components: I) Refining First, which suppresses FoE growth in the first solution; and II) Discarding Subs, which prunes subsequent FoE via dual-consistency. Extensive experiments across five benchmarks and six backbone models demonstrate that RED outperforms eight competitive baselines, achieving performance gains of up to 19.0% while reducing token consumption by 37.7% ~ 70.4%. Moreover, comparative experiments on FoE metrics shed light on how RED achieves effectiveness.

Table of Contents

cs.CL [Back]

[1] Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis cs.CL | cs.AIPDF

[2] Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets cs.CL | cs.MAPDF

[3] Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models cs.CL | cs.LGPDF

[4] Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting cs.CL | cs.AIPDF

[5] Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge cs.CL | cs.LGPDF

[6] Train Yourself as an LLM: Exploring Effects of AI Literacy on Persuasion via Role-playing LLM Training cs.CLPDF

[7] Overcoming the “Impracticality” of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework cs.CLPDF

[8] Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints cs.CL | cs.AIPDF

[9] Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy cs.CL | cs.AI | cs.LG | cs.SEPDF

[10] When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs cs.CLPDF

[11] Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks cs.CL | cs.AIPDF

[12] Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection cs.CLPDF

[13] LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation cs.CL | cs.AIPDF

[14] NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons cs.CLPDF

[15] R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning cs.CL | cs.AIPDF

[16] JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency cs.CL | cs.AIPDF

[17] Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation cs.CL | cs.AIPDF

[18] Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization cs.CL | cs.AIPDF

cs.CV [Back]

[19] Internalized Reasoning for Long-Context Visual Document Understanding cs.CV | cs.AI | cs.CLPDF

[20] Environment-Aware Channel Prediction for Vehicular Communications: A Multimodal Visual Feature Fusion Framework cs.CV | cs.AIPDF

[21] Variational Encoder–Multi-Decoder (VE-MD) for Privacy-by-functional-design (Group) Emotion Recognition cs.CV | cs.AIPDF

[22] LumiVideo: An Intelligent Agentic System for Video Color Grading cs.CV | cs.AIPDF

[23] VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation cs.CV | cs.AIPDF

[24] Guideline2Graph: Profile-Aware Multimodal Parsing for Executable Clinical Decision Graphs cs.CV | cs.LGPDF

[25] Generating Satellite Imagery Data for Wildfire Detection through Mask-Conditioned Generative AI cs.CV | cs.AIPDF

[26] VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors cs.CV | cs.CLPDF

[27] Token-Efficient Multimodal Reasoning via Image Prompt Packaging cs.CV | cs.AIPDF

[28] An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis cs.CV | cs.AIPDF

[29] Rapidly deploying on-device eye tracking by distilling visual foundation models cs.CVPDF

[30] Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions? cs.CV | cs.AI | cs.LGPDF

[31] Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation cs.CV | cs.LGPDF

[32] WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models cs.CV | cs.LGPDF

[33] FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder cs.CVPDF

[34] Moondream Segmentation: From Words to Masks cs.CV | cs.AIPDF

[35] Unlocking Multi-Site Clinical Data: A Federated Approach to Privacy-First Child Autism Behavior Analysis cs.CVPDF

[36] Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery cs.CV | cs.AI | cs.MMPDF

[37] Cross-Vehicle 3D Geometric Consistency for Self-Supervised Surround Depth Estimation on Articulated Vehicles cs.CV | cs.AIPDF

[38] Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs cs.CV | cs.AIPDF

[39] Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing cs.CVPDF

[40] DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning cs.CV | cs.AIPDF

[41] XrayClaw: Cooperative-Competitive Multi-Agent Alignment for Trustworthy Chest X-ray Diagnosis cs.CVPDF

[42] ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving cs.CVPDF

[43] THOM: Generating Physically Plausible Hand-Object Meshes From Text cs.CVPDF

[44] Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks cs.CVPDF

[45] DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection cs.CVPDF

[46] A Unified Perspective on Adversarial Membership Manipulation in Vision Models cs.CVPDF

[47] EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors cs.CV | cs.CLPDF

[48] LumaFlux: Lifting 8-Bit Worlds to HDR Reality with Physically-Guided Diffusion Transformers cs.CV | cs.AIPDF

[49] UNICA: A Unified Neural Framework for Controllable 3D Avatars cs.CVPDF

[50] PaveBench: A Versatile Benchmark for Pavement Distress Perception and Interactive Vision-Language Analysis cs.CV | cs.AI | cs.MMPDF

[51] QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models cs.CV | cs.AIPDF

[52] MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling cs.CVPDF

[53] NavCrafter: Exploring 3D Scenes from a Single Image cs.CV | cs.AIPDF

[54] STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation cs.CV | cs.ROPDF

[55] Deformation-based In-Context Learning for Point Cloud Understanding cs.CVPDF

[56] A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos cs.CV | cs.AIPDF

[57] HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits cs.CVPDF

[58] Token Warping Helps MLLMs Look from Nearby Viewpoints cs.CVPDF

[59] Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework cs.CVPDF

[60] InstructTable: Improving Table Structure Recognition Through Instructions cs.CVPDF

[61] Information-Regularized Constrained Inversion for Stable Avatar Editing from Sparse Supervision cs.CVPDF

[62] Progressive Video Condensation with MLLM Agent for Long-form Video Understanding cs.CVPDF

[63] Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models cs.CV | cs.AI | cs.LGPDF

[64] EvaNet: Towards More Efficient and Consistent Infrared and Visible Image Fusion Assessment cs.CVPDF

[65] RayMamba: Ray-Aligned Serialization for Long-Range 3D Object Detection cs.CV | cs.AIPDF

[66] SentiAvatar: Towards Expressive and Interactive Digital Humans cs.CV | cs.HC | cs.MMPDF

[67] GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes cs.CVPDF

[68] PolyReal: A Benchmark for Real-World Polymer Science Workflows cs.CVPDF

[69] MMTalker: Multiresolution 3D Talking Head Synthesis with Multimodal Feature Fusion cs.CVPDF

[70] CrossWeaver: Cross-modal Weaving for Arbitrary-Modality Semantic Segmentation cs.CVPDF

[71] Collaborative Multi-Mode Pruning for Vision-Language Models cs.CVPDF

[72] Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation cs.CVPDF

[73] Explicit Time-Frequency Dynamics for Skeleton-Based Gait Recognition cs.CVPDF

[74] GenSmoke-GS: A Multi-Stage Method for Novel View Synthesis from Smoke-Degraded Images Using a Generative Model cs.CVPDF

[75] QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection cs.CVPDF

[76] STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models cs.CV | cs.MMPDF

[77] MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs cs.CVPDF