Table of Contents
- cs.CL [Total: 85]
- cs.CV [Total: 204]
- cs.CR [Total: 2]
- cs.DC [Total: 1]
- cs.MM [Total: 1]
- cs.LG [Total: 24]
- cs.AI [Total: 14]
- cs.GR [Total: 2]
- cs.RO [Total: 5]
- cs.IR [Total: 2]
- eess.IV [Total: 3]
- cs.DB [Total: 1]
cs.CL [Back]
[1] SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators cs.CLPDF
Yada Pruksachatkun, Elaine Wan, Lyanna Chen, Kai-Wei Chang, Chien-Sheng Wu
TL;DR: SalesSim是一个用于评估多模态大语言模型在在线零售对话中模拟真实、基于角色的客户行为能力的框架和测试平台。该研究将零售交互和决策建模为一个基于情境的、自主的过程,并设计了一套以决策对齐为中心的评估指标。研究发现现有模型在词汇多样性和角色一致性方面存在不足,并提出了UserGRPO强化学习方法来优化对话流畅度和决策对齐。
Details
Motivation: 现有研究将用户模拟视为表面的对话生成,缺乏对零售交互中基于角色的决策过程的深入建模。SalesSim旨在解决这一问题,为评估和改进MLLMs在目标导向环境中的用户模拟能力提供一个更真实的测试平台。
Result: 在SalesSim基准上评估了6个开源和闭源的SOTA模型。结果表明,即使最强的模型与其底层角色规范的平均对齐度也低于79%。模型在词汇多样性上显著低于人类对话,并且容易受到销售代理建议的影响而偏离角色设定。提出的UserGRPO方法将基线模型的决策对齐度提升了13.8%,同时改善了对话质量。
Insight: 创新点在于将用户模拟重新定义为基于情境的、自主的决策过程,并引入了以决策对齐为核心的新评估范式。提出的UserGRPO多目标强化学习配方,为同时优化对话流畅度和角色一致性提供了有效方法,这对构建可靠的对话代理和评估系统具有借鉴意义。
Abstract: We present SalesSim, a framework and testbed for evaluating the ability of Multimodal Large Language Models (MLLMs) to simulate realistic, persona-driven customer behavior in multi-turn, multi-modal, tool-augmented online retail conversations. Unlike prior work that treat user simulation as surface-level dialogue generation, SalesSim models retail interaction and decision-making as a grounded, agentic process, where shoppers with diverse backgrounds, preferences, and dealbreakers interact with a sales agent, seek clarifications, and make informed purchasing decisions. For evaluation, we design a suite of metrics centered on decision alignment, measuring the consistency between the simulator’s actions and its persona specifications, as well as conversational quality. We find several behavioral gaps after benchmarking 6 open and closed-source state-of-the-art models. First, while models produce fluent conversations, they display significantly lower lexical diversity and overdisclosure of criteria across personas compared to human conversations. Second, models tend to be persuaded by sales agent suggestions and drift from persona specifications. Even the strongest model achieves less than 79% average alignment with its underlying persona specifications. To make progress on these limitations, we propose UserGRPO, a multi-turn, multi-objective reinforcement learning recipe to optimize both conversational fluency and decision alignment under persona specifications. Our experiments demonstrate that UserGRPO boosts decision alignment of the baseline model by 13.8% while improving conversational quality. By introducing SalesSim, we provide a new testbed for the community to investigate and improve the adherence of user simulators in goal-oriented settings.
[2] Sanity Checks for Long-Form Hallucination Detection cs.CL | cs.AIPDF
Geigh Zollicoffer, Minh Vu, Hongli Zhan, Raymond Li, Manish Bhattarai
TL;DR: 该论文提出了一种名为’受控不变性’的方法论,通过两个预言机测试(Force和Remove)来检验长文本幻觉检测方法是否真正评估了思维链推理过程本身,还是仅仅利用了最终答案的表面关联。研究发现,现有方法往往依赖于答案层面的伪影,而非中间推理的有效性。论文进一步提出了一种名为TRACT的轻量级评分器,它基于词汇轨迹特征,在控制伪影后能实现强大的鲁棒性,并在未扰动轨迹上与现有基线方法竞争或超越。
Details
Motivation: 当前大型语言模型的幻觉检测方法越来越多地基于思维链推理轨迹,但其评估的究竟是推理本身,还是仅仅利用了与最终答案相关的表面特征,这一点尚不明确。论文旨在揭示这种区别,并推动开发真正关注推理过程的检测方法。
Result: 在控制答案层面伪影后,TRACT评分器在未扰动的推理轨迹上表现与现有基线方法相当或更优,证明了其有效性。该方法强调了当前检测方法的主要挑战在于从终点线索中分离出推理信号,而非轨迹中缺乏信号。
Insight: 核心创新在于提出了’受控不变性’方法论和两个具体的预言机测试(Force和Remove),为评估幻觉检测方法是否真正基于推理提供了清晰的诊断工具。此外,TRACT评分器表明,一旦控制住伪影,有效的检测不一定需要复杂的习得表征,简单的词汇轨迹特征(如犹豫趋势、步骤长度动态和跨响应词汇收敛)也能实现强健的性能。这为开发更鲁棒、更可解释的幻觉检测方法提供了新思路。
Abstract: Hallucination detection methods for large language models increasingly operate on chain-of-thought reasoning traces, yet it remains unclear whether they evaluate the reasoning itself or merely exploit surface correlates of the final answer. We introduce a controlled-invariance methodology that exposes this distinction through two oracle tests: \textsc{Force}, which replaces each response’s final answer with the ground truth while preserving the reasoning trace, and \textsc{Remove}, which strips answer-announcement steps while leaving the trajectory intact. This reveals if their predictive power derives from answer-level artifacts rather than from the structure or validity of intermediate reasoning. We further show that once these artifacts are controlled for, effective detection does not necessarily require complex learned representations: TRACT, a lightweight scorer built on lexical trajectory features (hedging trends, step-length dynamics, and cross-response vocabulary convergence), achieves strong robustness while remaining competitive with or outperforming existing baselines on unperturbed traces. These findings suggest that the current central challenge in reasoning-aware hallucination detection is not the absence of signal in the trace, but the failure to isolate it from endpoint cues.
[3] Change My View? The Dynamics of Persuasion and Polarization in Online Discourse cs.CLPDF
David Freeborn, Malihe Alikani, Anthony Sicilia
TL;DR: 本研究利用大型语言模型分析Reddit的r/ChangeMyView论坛中的辩论语料,探讨说服与观点极化的动态机制。研究发现,表达让步或共情的修辞策略能显著提高观点改变的可能性,而直接反驳、可信度攻击和话题转移则会降低说服效果。
Details
Motivation: 解决哲学上关于理性论证应导致观点趋同的假设与日常讨论中观点极化现象之间的矛盾,通过实证分析在线辩论中的修辞策略如何影响信念修正。
Result: 在Reddit的r/ChangeMyView数据集上,添加修辞策略特征后预测模型性能显著提升,识别出特定策略(如让步、共情)与观点改变的正相关关系,而反驳等策略则呈负相关。
Insight: 创新点在于结合大型语言模型的预测能力与人工辅助编码,量化分析修辞策略对说服效果的影响;客观而言,该研究揭示了公共理性对话中关系框架与证据内容同等重要,为规范理性对话理论提供了实证依据。
Abstract: Philosophical accounts of persuasion often assume that shared evidence and rational argumentation should lead to a convergence of views between peers, yet everyday discourse often suggests otherwise. In this study, we use large language models to analyze a corpus of debates on Reddit’s r/ChangeMyView, where belief revision is publicly signaled. Large language models were asked, halfway through each discussion, to forecast whether such an acknowledgement would arise; their probabilistic estimates serve as a conversational baseline. Each reply was then coded, through a hybrid machine-assisted procedure, for ten familiar rhetorical strategies – concession, empathy, logical challenge, credibility appeals, and so forth. Adding these strategic features markedly improves predictive power and yields a consistent pattern: moves that express concession or empathetic alignment substantially increase the prospect of belief change, whereas frontal refutation, credibility attacks, and topic deflection diminish it. The findings indicate that effective public reasoning depends as much on relational framing as on evidential content, and they invite a refinement of normative accounts of rational dialogue.
[4] jina-embeddings-v5-omni: Text-Geometry-Preserving Multimodal Embeddings via Frozen-Tower Composition cs.CLPDF
Florian Hönicke, Michael Günther, Andreas Koukounas, Kalim Akram, Scott Martens
TL;DR: 本文提出了冻结编码器模型组合的新方法,用于构建多模态嵌入模型。基于VLM风格架构,通过添加图像和音频编码器,将Jina Embeddings v5文本模型扩展为支持文本、图像、音频和视频输入的多模态嵌入模型套件jina-embeddings-v5-omni。该方法保持骨干文本嵌入模型和新增的非文本媒体编码器冻结,仅训练占总权重0.35%的连接组件,实现了高效训练,且文本输入嵌入与原始文本模型完全一致。
Details
Motivation: 解决多模态嵌入模型训练效率低下的问题,同时保持文本嵌入的几何特性不变,实现高效扩展文本模型以支持多种媒体输入。
Result: 在评估中,该方法取得了与当前最先进模型竞争的结果,性能与更大的多模态嵌入模型相当。
Insight: 创新点在于冻结骨干模型和新增编码器,仅微调少量连接组件,实现了高效的多模态扩展,同时保证了文本嵌入的几何一致性;从客观角度看,这种模块化组合方法为多模态模型的高效构建和迁移提供了新思路。
Abstract: In this work, we introduce frozen-encoder model composition, a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. Our method is to extend the two Jina Embeddings v5 Text models to support additional media by adding encoders for images and audio. The backbone text embedding models and the added non-text media encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that this approach produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.
[5] AIPO: : Learning to Reason from Active Interaction cs.CL | cs.AIPDF
Junnan Liu, Linhao Luo, Thuy-Trang Vu, Gholamreza Haffari
TL;DR: 论文提出了一种名为AIPO的增强型强化学习框架,旨在通过探索过程中的主动多智能体交互来提升大语言模型的推理能力。该框架允许策略模型在遇到推理瓶颈时主动咨询三个功能协作智能体,从而获得细粒度和有针对性的指导,以在训练期间主动扩展其能力边界。训练完成后,策略模型可独立进行推理,无需依赖协作智能体。
Details
Motivation: 现有基于可验证奖励的强化学习方法在探索上受限于策略模型固有的能力边界,而引入外部专家演示的方法通常依赖完整轨迹级指导,存在样本效率低、信息稀疏且可能将探索限制在静态指导空间的问题。
Result: 在AIME、MATH500、GPQA-Diamond和LiveCodeBench等多个推理基准测试上的广泛实验表明,AIPO能持续提升推理性能,在不同策略模型和RLVR算法上具有鲁棒的泛化能力,并有效扩展了策略模型的推理能力边界。
Insight: 创新点在于将主动多智能体交互机制引入强化学习框架,通过策略模型在瓶颈处主动寻求细粒度指导来动态扩展能力边界,并设计了定制的重要性采样系数和裁剪策略来缓解从智能体反馈中学习时产生的离策略偏差和梯度消失问题。
Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable reasoning capabilities, largely stimulated by Reinforcement Learning with Verifiable Rewards (RLVR). However, existing RL algorithms face a fundamental limitation: their exploration remains largely constrained by the inherent capability boundary of the policy model. Although recent methods introduce external expert demonstrations to extend this boundary, they typically rely on complete trajectory-level guidance, which is sample-inefficient, information-sparse, and may confine exploration to a static guidance space. Inspired by the potential of multi-agent systems, we propose $\textbf{AIPO}$, an enhanced reinforcement learning framework that improves LLM reasoning through active multi-agent interaction during exploration. Specifically, AIPO enables the policy model to proactively consult three functional collaborative agents, $\textit{Verify Agent}$, $\textit{Knowledge Agent}$, and $\textit{Reasoning Agent}$, when encountering reasoning bottlenecks, thereby receiving fine-grained and targeted guidance to actively expand its capability boundary during training. We further introduce a tailored importance sampling coefficient together with a clipping strategy to mitigate the off-policy bias and gradient vanishing issues that arise when learning from agent-provided feedback. After training, the policy model performs reasoning independently without relying on collaborative agents. Extensive experiments on diverse reasoning benchmarks, including AIME, MATH500, GPQA-Diamond, and LiveCodeBench, show that AIPO consistently improves reasoning performance, generalizes robustly across different policy models and RLVR algorithms, and effectively expands the reasoning capability boundary of the policy model.
[6] Built Environment Reasoning from Remote Sensing Imagery Using Large Vision–Language Models cs.CL | cs.AI | cs.CV | cs.ETPDF
Dongdong Wang, Deepak Balakrishnan, Ravi Srinivasan, Shenhao Wang
TL;DR: 该研究探索利用大型视觉-语言模型从遥感图像中推理建成环境特征,包括设计建议、可建造性评估、土地利用模式和风险识别,通过多空间尺度图像输入评估其对建成环境相关推理任务的影响,并比较了InternVL和Qwen等先进模型在生成建议时的准确性和可靠性。
Details
Motivation: 解决如何利用遥感图像和大型语言模型来智能分析城市建成环境,以支持智慧城市决策和规划的问题。
Result: 研究展示了将遥感图像与大型语言模型结合在辅助智慧城市决策方面的潜力,通过比较InternVL和Qwen等SOTA模型在准确性和可靠性上的表现来评估效果。
Insight: 创新点在于将多尺度遥感图像作为多模态语言模型的输入进行建成环境推理,并系统评估了不同先进模型在该任务上的性能,为遥感与AI结合的城市分析提供了新思路。
Abstract: This work investigates the use of large language models (LLMs) for tasks in smart cities. The core idea is to leverage remote sensing imagery to characterize the built environment, including design suggestions, constructability assessment, landuse patterns, and risk identification. We examine remote sensing imagery at multiple spatial scales as inputs for multimodal language modeling and evaluate their effects on built-environment-related reasoning. In addition, we compare state-of-the-art LLMs, including InternVL and Qwen, in terms of accuracy and reliability when generating built environment recommendations. The results demonstrate the potential of integrating remote sensing imagery with large language models to assist smart cities and decision-making.
[7] Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks cs.CL | cs.AIPDF
Ramon Pires, Thales Sales Almeida, Celio Larcher Junior, Giovana Bonás, Hugo Abonizio
TL;DR: 本文介绍了Magis-Bench,一个用于评估大型语言模型在法官级别法律任务上表现的新基准。该基准基于2023年至2025年巴西司法职位竞争性考试的74个问题,包括多轮结构化的法律分析论述题和需要撰写完整民事及刑事判决书的实践练习。作者使用LLM-as-a-judge方法,以四个前沿模型作为评估者,对23个先进LLM进行了评估。结果显示,即使表现最佳的模型(Gemini-3-Pro-Preview)平均得分也低于总分的70%,表明司法级别的法律推理和写作对当前LLM仍具挑战性。
Details
Motivation: 现有法律AI基准主要关注生成法律论点或文件的任务,而评估模型判断法律论点的能力——权衡相互竞争的主张、将法律原则应用于事实并做出合理裁决——对于法律系统的功能同样至关重要。本文旨在填补这一空白,评估LLM在法官级别法律写作任务上的表现。
Result: 在Magis-Bench上评估了23个SOTA LLM。使用LLM-as-a-judge方法,评估者间具有高度一致性(Kendall’s W = 0.984;成对Kendall’s τ ≥ 0.897)。Google的Gemini-3-Pro-Preview获得最高平均分(6.97/10),其次是Gemini-3-Flash-Preview(6.67)和Claude-4.5-Opus(6.46)。所有模型得分均低于总分的70%。
Insight: 论文的创新点在于构建了一个专注于法官级别法律推理与写作(而非法庭辩论)的基准,填补了现有法律AI评估的空白。其核心洞察是,法律系统的有效运作不仅需要生成论点的能力,更需要评判和裁决的能力。该基准基于真实的竞争性考试题目,具有较高的现实性和挑战性,揭示了当前顶尖LLM在复杂、专业的司法决策任务上仍有显著不足。
Abstract: Existing benchmarks for legal AI focus primarily on tasks where LLMs must produce legal arguments or documents, yet the capacity to \emph{judge} such arguments – weighing competing claims, applying doctrine to facts, and rendering reasoned decisions – is arguably as fundamental to a well-functioning legal system as advocacy itself. We introduce Magis-Bench, a benchmark for evaluating LLMs on magistrate-level writing tasks derived from recent Brazilian competitive examinations for judicial positions. Magis-Bench comprises 74 questions from eight examinations conducted between 2023 and 2025, including discursive legal analysis questions with multi-turn structure and practical exercises requiring the composition of complete civil and criminal judicial sentences. We evaluate 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with four independent frontier models as evaluators. Our results show strong inter-judge agreement (Kendall’s $W = 0.984$; pairwise Kendall’s $τ\ge 0.897$), with Google’s Gemini-3-Pro-Preview achieving the highest average score (6.97/10), followed by Gemini-3-Flash-Preview (6.67) and Claude-4.5-Opus (6.46). Even the best-performing models score below 70% of the maximum, indicating that judicial-level legal reasoning and writing remain challenging for current LLMs. We release the complete benchmark, model outputs, and evaluation code to support further research on legal AI capabilities.
[8] Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment cs.CL | cs.AIPDF
I. F. Atasoy, B. Mutlu, E. A. Sezer, A. Wahdan
TL;DR: 本研究聚焦于大型语言模型在摘要任务中的上下文幻觉检测问题,通过对比QAGS-C和SummEval数据集的原始人工标注与Gemini 2.5 Flash和GPT-5 Mini的基于推理和跨度的预测,发现存在系统性差异。为解决此问题,研究引入了一个由两名跨文化评审员参与的人工裁定流程,对冲突样本进行重新评估。结果表明,经过裁定后,人机三方一致性和模型准确率均显著提升,且评审员在模型提供明确推理时更倾向于支持模型的判断。
Details
Motivation: 大型语言模型在上下文场景(如RAG和智能体系统)中产生的幻觉问题依然严峻,而现有基准数据集(如QAGS-C和SummEval)的单一人工标注在模糊性任务中可能不足,导致对模型性能的低估。
Result: 在QAGS-C数据集上,人机三方一致率提升6.38%,GPT准确率提升4.25%,Gemini提升8.51%;在SummEval数据集上,三方一致率提升7.62%,GPT准确率提升2.34%,Gemini提升3.80%。人工评审员与模型判断的一致率在83%至87%之间。
Insight: 论文的创新点在于提出并验证了“模型辅助的人工裁定”评估框架,表明对于易产生歧义的任务,结合LLM推理进行多轮人工复审能构建更可靠的基准,这为未来幻觉检测乃至更广泛的NLP评估提供了方法论借鉴。
Abstract: Hallucination remains a persistent challenge in Large Language Models (LLMs), particularly in context-grounded settings such as RAG and agentic AI systems. This study focuses on contextual hallucination detection in summarization tasks. We analyze the QAGS-C and SummEval datasets by comparing original benchmark annotations with reason and span-based predictions from Gemini 2.5 Flash and GPT-5 Mini. To address systematic divergences between human labels and LLM judgments, we re-evaluated all conflicted samples through a human adjudication process involving 2 cross-cultural adjudicators. Following this re-evaluation, triple agreement (between human, GPT, and Gemini) increased by 6.38% for QAGS-C and 7.62% for SummEval. Similarly, model accuracy improved, with GPT increasing by 4.25% on QAGS-C and 2.34% on SummEval, while Gemini showed gains of 8.51% and 3.80%, respectively. Notably, adjudicators frequently sided with the models’ judgments over original human annotations when LLMs provided explicit reasoning. Overall human adjudicator agreement ranged between 83% and 87%. These findings suggest that for ambiguity-prone tasks, single-pass annotations may be insufficient, and model-assisted re-evaluation yields more reliable benchmarks.
[9] PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents cs.CL | cs.AI | cs.LGPDF
Mehmet Iscan
TL;DR: 论文提出了PYTHALAB-MERA,一个用于本地验证条件代码生成的轻量级外部控制器。它围绕一个冻结的大型语言模型(LLM)工作,通过验证驱动的情节记忆、自适应检索-动作选择、延迟信用分配和结构化技能重用来提升在严格验证门控的强化学习编码任务中的成功率。
Details
Motivation: 现有的基于本地LLM的编码代理在需要执行反馈、持久状态和有界修复的环境中,其静态检索、长上下文提示、自我精炼、执行反馈修复和模型权重强化学习等方法未能联合提供验证驱动的记忆、自适应检索、延迟信用分配和围绕冻结模型的技能重用,论文旨在解决这一综合问题。
Result: 在严格的强化学习编码任务设置(三个任务、三次重复、三次尝试预算)中,PYTHALAB-MERA通过了8/9次严格验证,而自我精炼基线和研究的GRACE扩展均通过了0/9次。这证明了在该特定设置下,外部记忆与检索控制器提高了验证成功率。
Insight: 创新点在于设计了一个轻量级的外部控制器架构,将冻结LLM作为提议者,并通过验证结果生成有界成形奖励、利用TD(λ)风格的资格迹进行延迟信用分配,实现了验证驱动的记忆管理和技能复用,为本地代码生成代理提供了一种新的、有界但有效的控制范式。
Abstract: Local LLM-based coding agents increasingly work in settings where correctness is earned through execution feedback, persistent state, and bounded repair, not through a single fluent answer. Static retrieval, long-context prompting, self-refinement, execution-feedback repair, and reinforcement learning over model weights each address part of this setting, but they do not jointly provide validation-grounded episodic memory, adaptive retrieval-action selection, delayed credit assignment, and structural skill reuse around a frozen local model. We introduce PYTHALAB-MERA, a lightweight external controller for local validation-conditioned code generation. The frozen language model proposes complete source files; the controller decides which memory records and AST-derived skills should enter the next prompt, validates each candidate through a fail-fast pipeline, converts validation outcomes into bounded shaped rewards, and propagates delayed credit through TD(lambda)-style eligibility traces. We evaluate the implementation as a local CLI artifact on reinforcement-learning coding tasks with strict validation gates. In the measured hard RL setting with three tasks, three repetitions, and a three-attempt budget, PYTHALAB-MERA passed 8/9 strict validations; the self-refinement baseline and the investigated GRACE extension each passed 0/9. These results support a deliberately bounded claim: in this recorded setting, the external memory-and-retrieval controller improved validation success. They do not establish general-purpose code synthesis, state-of-the-art performance, formal program correctness, or formal safety.
[10] Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling cs.CLPDF
Naoki Otani, Nikita Bhutani, Hannah Kim, Dan Zhang, Estevam Hruschka
TL;DR: 本文重新审视了LLM智能体在数据密集型任务中的规划策略,挑战了逐步执行(单步规划)作为默认选择的必要性。通过对比全视野规划(FH)与单步规划(SH)两种范式,研究发现,对于定义明确的数据中心任务,采用惰性重规划的全视野规划能在保持同等准确率的同时,显著减少2-3倍的token消耗。
Details
Motivation: 解决LLM智能体在执行复杂数据中心任务(如精确调用外部数据源工具)时,现有规划策略(全视野规划与单步规划)的效率与必要性假设问题,特别是重新评估了逐步执行监控对于任务适应性的默认需求。
Result: 在知识库问答(KBQA)和多跳问答(Multi-hop QA)任务上的实验表明,全视野规划配合惰性重规划在不同任务深度、广度和工具鲁棒性水平下,均能达到与单步规划相当的准确率,同时token使用量减少2-3倍。
Insight: 主要创新点在于通过控制实验将规划视野作为关键架构特征进行系统分析,揭示了对于定义明确的数据中心任务,急切的逐步监控往往不必要,全视野规划配合按需重规划可成为更高效的默认策略,这挑战了现有常见假设并为智能体架构设计提供了新思路。
Abstract: Explicit planning is a critical capability for LLM-based agents solving complex data-centric tasks, which require precise tool calling over external data sources. Existing strategies fall into two paradigms based on planning horizon: (1) full-horizon (FH), which generates a complete plan before execution, and (2) single-step horizon (SH), which interleaves each action (tool call) with incremental reasoning and observation. While step-by-step execution is a common default under the assumption that eager execution monitoring is necessary for adaptability, we revisit this assumption for well-defined data-centric tasks. Our controlled empirical study isolates planning horizon as the key architectural feature and systematically analyzes the effects of topological complexity and tool robustness on both paradigms. Our experiments across Knowledge Base Question Answering and Multi-hop QA show that FH planning with lazy replanning achieves accuracy parity with SH across varying depths, breadths, and robustness levels, while using 2-3x fewer tokens. These findings suggest that for well-defined data-centric tasks, eager step-wise monitoring is often unnecessary, and full-horizon planning with on-demand replanning can offer a more efficient default.
[11] A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models cs.CLPDF
Zeru Shi, Zhenting Wang, Fan Yang, Qifan Wang, Ruixiang Tang
TL;DR: 该论文研究了大型语言模型中大规模激活现象的起源,发现了一个跨模型家族普遍存在的特定层——大规模涌现层(ME层),该层是激活首次出现并通过残差连接传播至更深层的源头。研究表明,ME层中的RMSNorm和前馈网络参数共同促成了大规模激活的形成,且一旦形成,激活的令牌表征在后续层中保持高度不变,降低了传递给注意力模块的隐藏表征多样性。针对这一局限,作者提出了一种简单有效的方法来降低大规模激活令牌的刚性,该方法在无需训练和微调两种设置下均能提升模型在指令跟随和数学推理等任务上的性能,并能通过选择性削弱注意力汇的影响来缓解其负面效应。
Details
Motivation: 论文的动机是探究大型语言模型中大规模激活现象的起源及其对模型性能的潜在负面影响,特别是大规模激活导致隐藏表征多样性降低的问题。
Result: 论文提出的方法在指令跟随和数学推理等多个任务上,于无需训练和微调两种设置下均能一致提升LLM性能,并能有效缓解注意力汇问题。
Insight: 论文的创新点在于识别了跨模型普遍存在的大规模涌现层(ME层)作为大规模激活的源头,并揭示了RMSNorm和FFN参数对其形成的共同作用;从客观角度看,其提出的通过降低大规模激活令牌刚性来提升模型性能的方法,为理解并缓解注意力汇等模型内部机制问题提供了新的、基于隐藏状态层面的原理性策略视角。
Abstract: We investigate the origins of massive activations in large language models (LLMs) and identify a specific layer named the \textbf{Massive Emergence Layer (ME Layer)}, that is consistently observed across model families, where massive activations first emerge and subsequently propagate to deeper layers through residual connections. We show that, within the ME Layer both the RMSNorm and the FFN parameters jointly contribute to the emergence of massive activations. Once formed, the massive activation token representation remains largely invariant across layers, reducing the diversity of hidden representations passed to the attention module. Motivated by this limitation, we propose a simple and effective method to reduce the rigidity of the massive activation token. Our approach consistently improves LLM performance across multiple tasks, including instruction following and math reasoning, in both training free and fine tuning settings. Moreover, we show that our method mitigates attention sinks by selectively weakening their influence, elucidating their origin at the hidden state level and shedding new light on principled mitigation strategies.
[12] AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators cs.CL | cs.AI | cs.LGPDF
Aritra Mazumder, Shubhashis Roy Dipta, Nusrat Jahan Lia, Tanzila Khan, Kainat Raisa Hossain
TL;DR: 论文提出了一个名为AgentCollabBench的诊断性基准测试,包含900个人工验证的任务,用于评估多智能体协作系统中的过程性风险,揭示了仅基于结果的评估所无法发现的模型特定脆弱性和通信拓扑结构对信息完整性的关键影响。
Details
Motivation: 当前多智能体系统通过协作取得了SOTA成果,但当流程中的智能体静默地丢弃约束时,系统的最终输出可能看似正确,而推理链却已悄然损坏,现有的基于结果的评估方法无法检测这类多跳过程故障。
Result: 在评估GPT 4.1 mini、Gemini 2.5 Flash Lite、Qwen-3.5-35B-A3B和Llama 3.1 8B Instruct四个现代LLM时,暴露了仅基于结果的评估无法发现的模型特定脆弱性图谱;例如,Qwen-3.5-35B-A3B在追踪器持久性和指令稳定性上领先,而GPT 4.1 mini在泄漏遏制和错误信念抵抗上领先。通信拓扑结构是主要风险因素,解释了多跳信息存活率7-40%的方差。
Insight: 论文的创新点在于识别并系统化评估了多智能体协作中的四种行为风险(指令衰减、错误信念传染、上下文泄漏、追踪器持久性),并揭示了通信拓扑结构(特别是汇聚DAG节点处的综合瓶颈)是影响多智能体可靠性的根本性结构问题,表明仅提升模型智能无法替代架构设计的重要性。
Abstract: Multi-agent systems achieve state-of-the-art outcomes through peer collaboration. However, when an agent in the pipeline silently drops a constraint, the system’s final output may look correct even though the reasoning chain was quietly corrupted, and existing outcome-based evaluations are blind to such multi-hop process failures. To make these vulnerabilities measurable before deployment, we introduce AgentCollabBench, a diagnostic benchmark of 900 human-validated tasks spanning software engineering, DevOps, and data engineering. Each task isolates one of four behavioral risks: instruction decay (does a constraint survive peer pressure?), false-belief contagion (does a falsehood spread through consensus?), context leakage (does information bleed between tasks?), and tracer durability (does marked data reach the final agent?). Evaluating four modern LLMs (GPT 4.1 mini, Gemini 2.5 Flash Lite, Qwen-3.5-35B-A3B, and Llama 3.1 8B Instruct), we expose model-specific vulnerability profiles invisible to outcome-only evaluation; Qwen-3.5-35B-A3B, for example, leads on tracer durability and instruction stability, while GPT 4.1 mini leads on leakage containment and false-belief resistance. Beyond per-model differences, communication topology emerges as a primary risk factor that explains 7-40% of the variance in multi-hop information survival. The effect traces to a synthesis bottleneck specific to converging-DAG nodes: an agent weighing competing parent inputs discards constraints carried by a minority branch, a bottleneck structurally absent from linear chains. AgentCollabBench demonstrates that suboptimal topology can silently erase the safeguards of highly capable models, arguing that multi-agent reliability is fundamentally a structural problem and that scaling model intelligence alone is no substitute for architecture.
[13] Hint Tuning: Less Data Makes Better Reasoners cs.CLPDF
Siqi Fan, Minghao Li, Xiaoqian Ma, Xiusheng Huang, Zhuo Chen
TL;DR: 本文提出Hint Tuning方法,旨在解决大型推理模型因采用冗长思维链而普遍过度生成token的问题。该方法通过利用对应的指令模型作为难度探测器,自动构建包含无提示、稀疏提示和完整提示三种状态的训练数据,从而高效地教导模型校准推理深度。仅需1K自标注样本,即可在多个主流推理模型上实现显著的token缩减,同时保持推理准确性。
Details
Motivation: 大型推理模型通过扩展思维链实现高精度,但无论问题难易,均会生成远超必要的冗余token(通常多5-8个),导致推理效率低下。本文旨在解决模型推理深度与问题难度不匹配的问题,寻求一种数据高效的方法来校准模型的推理过程。
Result: 在Qwen3-Thinking和DeepSeek-R1-Distill等多个主流推理模型(规模4B至32B)上,Hint Tuning实现了24%至66%的token缩减(平均31.5%),同时在五个基准测试上保持了有竞争力的准确率。
Insight: 核心创新点在于将抽象的难度标注问题,转化为指令模型与推理模型之间的一致性检查,从而自动构建训练数据。该方法避免了需要海量蒸馏数据集或昂贵强化学习的方法,仅通过与指令模型能力的简单对齐,就实现了卓越的效率提升。其数据高效性(仅需1K样本)和通用性(适用于不同模型和规模)是显著优势。
Abstract: Large reasoning models achieve high accuracy through extended chain-of-thought but generate 5–8 more tokens than necessary, applying verbose reasoning uniformly regardless of problem difficulty. We propose Hint Tuning, a data-efficient approach that teaches models to calibrate reasoning depth. Our key insight: the corresponding instruct model serves as an ideal difficulty probe. By testing what the instruct model can solve with varying guidance, we automatically construct training data across three states: No-Hint (direct answer), Sparse-Hint (minimal prefix), and Full-Hint (complete reasoning). This converts the abstract challenge of difficulty labeling into a measurable consistency check between the instruct and reasoning models. With only 1K self-annotated samples, Hint Tuning achieves 24–66% token reduction (31.5% average) across mainstream reasoning models (Qwen3-Thinking, DeepSeek-R1-Distill) at multiple scales (4B–32B) while maintaining competitive accuracy on five benchmarks. Unlike methods requiring massive distillation datasets or expensive RL, we achieve superior efficiency through simple alignment with the instruct model’s capabilities.
[14] Structured Recurrent Mixers for Massively Parallelized Sequence Generation cs.CL | cs.LGPDF
Benjamin L. Badger
TL;DR: 本文提出了结构化循环混合器(SRM)架构,该架构能够在训练时使用序列并行表示,在推理时转换为循环表示,从而兼顾训练效率和推理吞吐量。实验表明,SRM在训练效率、输入信息容量和推理吞吐量及并发性方面优于其他线性复杂度模型,并在GSM8k基准上实现了30%的性能提升。
Details
Motivation: 解决传统循环架构训练效率低、非循环模型推理吞吐量不足的问题,旨在设计一种既能高效训练又能高速推理的模型架构。
Result: 在GSM8k基准测试中,SRM的PyTorch实现使计算恒定条件下的Pass@k提升了30%;与vLLM上的Transformer相比,SRM的Mojo/MAX推理实现吞吐量提升12倍,并发性提升170倍。
Insight: 创新点在于通过代数转换实现训练与推理表示的灵活切换,无需专用内核或设备特定内存管理;客观分析认为,该架构巧妙结合了循环模型在样本维度扩展的优势和非循环模型在序列长度处理上的效率,为大规模序列生成提供了新思路。
Abstract: Over the last two decades, language modeling has experienced a shift from predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements in parallel during training, which results in greater training efficiency and stability at the expense of lower inference throughput. Here we introduce the Structured Recurrent Mixer, an architecture that allows for algebraic conversion between a sequence parallel representation at train time and a recurrent representation at inference, notably without the need for specialized kernels or device-specific memory management. We show experimentally that this dual representation allows for greater training efficiency, higher input information capacity, and larger inference throughput and concurrency when compared to other linear complexity models. We postulate that recurrent models are poorly suited to extended sequence length scaling for information-rich inputs typical of language, but are well suited to scaling in the sample (batch) dimension due to their constant memory per sample. We provide Mojo/MAX inference implementations of SRMs exhibiting 12x the throughput and 170x the concurrency of similarly powerful Transformers inferenced on vLLM, increases characteristic of Pytorch implementations resulting in a 30% increase in compute-constant GSM8k Pass@k. We conclude by demonstrating that SRMs are effective reinforcement learning training candidates.
[15] AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL | cs.AI | cs.MAPDF
Boxuan Zhang, Jianing Zhu, Zeru Shi, Dongfang Liu, Ruixiang Tang
TL;DR: 本文提出了AgentForesight框架,将多智能体系统中的失败预测问题重新定义为在线审计。该框架在任务轨迹展开的每一步,仅观察当前轨迹前缀,旨在最早检测到决定性错误并发出警报,从而允许在部署时进行干预。为此,研究构建了AFTraj-2K数据集,并开发了AgentForesight-7B模型,该模型采用由粗到细的强化学习策略进行训练,在多个基准测试中超越了GPT-4.1等领先的专有模型。
Details
Motivation: 现有工作将LLM多智能体系统中的错误视为事后归因,在轨迹结束后诊断责任方,这丧失了在任务执行过程中进行干预的机会。本文旨在解决这一问题,实现早期失败预测以支持在线干预。
Result: 在自建的AFTraj-2K数据集和外部的Who&When基准测试上,AgentForesight-7B模型超越了GPT-4.1和DeepSeek-V4-Pro等领先的专有模型,实现了高达+19.9%的性能提升,并将步骤定位误差降低了3倍。
Insight: 核心创新点在于将失败预测范式从事后归因转变为在线审计,并提出了一个由粗到细的强化学习训练方法,通过联合优化审计判决的“内容”、“位置”和“责任方”三个维度的奖励,使模型具备精确的步骤级错误定位能力。这为从失败检测转向部署时干预提供了可能。
Abstract: LLM-based multi-agent systems are increasingly deployed on long-horizon tasks, but a single decisive error is often accepted by downstream agents and cascades into trajectory-level failure. Existing work frames this as \emph{post-hoc failure attribution}, diagnosing the responsible agent and step after the trajectory has ended. However, this paradigm forfeits any opportunity to intervene while trajectory is still unfolding. In this work, we introduce AgentForesight, a framework that reframes this problem as online auditing: at each step of an unfolding trajectory, an auditor observes only the current prefix and must either continue the run or alarm at the earliest decisive error, without access to future steps. To this end, we curate AFTraj-2K, a corpus of agentic trajectories across Coding, Math, and Agentic domains, in which safe trajectories are retained under a strict curation pipeline and unsafe trajectories are annotated at the step of their decisive error via consensus among multiple LLM judges. Built on that, we develop AgentForesight-7B, a compact online auditor trained with a coarse-to-fine reinforcement learning recipe that first equips it with a risk-anticipation prior at the failure boundary on adjacent safe/unsafe prefix pairs, then sharpens this prior into precise step-level localization under a three-axis reward jointly targeting the what, where, and who of an audit verdict. Across AFTraj-2K and an external Who&When benchmark, AgentForesight-7B outperforms leading proprietary models, including GPT-4.1 and DeepSeek-V4-Pro, achieving up to +19.9% performance gain and 3$\times$ lower step localization error, opening the loop from post-hoc failures detection to enabling deployment-time intervention. Project page: https://zbox1005.github.io/agent-foresight/
[16] Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents cs.CLPDF
Minzheng Wang, Run Luo, Yanbo Wang, Zichen Liu, Yuqiao Tan
TL;DR: 本文提出了一种名为双尺度进化策略训练(DEPT)的方法,旨在解决开放社交语言游戏中自训练策略因行为同质化而陷入进化僵局的问题。该方法通过时间尺度进化感知机制检测僵局,并利用非对称优势重塑进行干预,从而恢复梯度信号并促进持续的策略探索。
Details
Motivation: 动机在于,尽管带可验证奖励的强化学习在封闭任务中有效,但在开放社交语言游戏中,由于策略空间巨大,智能体行为容易同质化,导致确定性对局结果和梯度信号消失,从而陷入进化僵局。
Result: 在多个社交语言游戏上的广泛实验表明,DEPT方法优于强基线,避免了策略退化,并推动了社交语言智能体的持续进化。
Insight: 创新点在于引入了时间尺度进化感知机制来量化僵局,并结合非对称优势重塑动态调整优化景观,从而在开放、高维策略空间中有效打破进化僵局,促进多样性策略探索。
Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for closed-ended tasks, extending it to open-ended social language games via self-play reveals a critical issue: evolution impasse. Due to the vast strategy space, language agents frequently converge to homogenized behaviors, leading to deterministic match outcomes that eliminate the gradient signals necessary for policy evolution. To tackle this issue, we propose Dual-scale Evolutionary Policy Training (DEPT) for social language games. DEPT introduces a time-scaled evolutionary perception mechanism that detects impasse by quantifying dual-scale value baseline divergence alongside match entropy. Upon perceiving the collapse, it then activates asymmetric advantage reshaping to dynamically modulate the optimization landscape for intervention. Thus, our method effectively restores gradient signals and enforces sustained strategic exploration. Extensive experiments on multiple social language games demonstrate that DEPT outperforms strong baselines, avoiding policy degeneration and driving the continuous evolution of social language agents.
[17] Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning cs.CLPDF
Zhengyang Zhao, Lu Ma, Wentao Zhang
TL;DR: 本文提出了一种名为“On-Policy Harness Self-Distillation (OPHSD)”的方法,旨在通过自我蒸馏将推理时外部工作流(harness)的能力内化到基础大语言模型中,从而提升模型在复杂推理任务上的独立性能。
Details
Motivation: 动机在于解决现有推理时工作流(harness)虽然能提升大语言模型在复杂任务上的表现,但并未增强模型内在能力的问题,希望将harness的优势永久性地整合到模型中。
Result: 在文本分类的draft-verify harness和数学推理的plan-solve harness等任务上评估,OPHSD显著超越了强基线(例如在HMMT25上比OPSD高出10.83%),并表现出强大的泛化能力和独立性能。
Insight: 创新点在于提出了“策略内harness自我蒸馏”框架,利用harness增强的当前模型作为教师进行蒸馏,将harness提供的额外监督信号内化到学生模型中;其核心洞察是复杂的harness可以作为临时的训练支架,其益处能被永久反馈到基础模型中,从而在推理时无需再附加harness,甚至可能避免性能下降。
Abstract: Inference-time harnesses substantially improve large language models on complex reasoning tasks. However, the intrinsic capabilities of the underlying model remain unchanged by the addition of these external workflows. To bridge this gap, we introduce \emph{On-Policy Harness Self-Distillation} (OPHSD), which employs the harness-augmented current model as a teacher for self-distillation, thereby introducing extra supervisory signals from the harness beyond training data. OPHSD internalizes task-specific harness capabilities into the student model, yielding robust generalizability and strong standalone performance across diverse reasoning tasks. Evaluated across draft–verify harness for text classification and plan–solve for mathematical reasoning tasks, OPHSD consistently outperforms strong baselines (e.g., +10.83% over OPSD on HMMT25). Our analysis further indicates that reattaching the harness during inference yields no additional benefits and can even degrade performance, suggesting that complex harnesses need not always be permanent fixtures; instead, they can serve as temporary training scaffolds whose benefits are permanently fed back into the base model. Our code and training data are available at https://github.com/zzy1127/OPHSD-On-Policy-Harness-Self-Distillation.
[18] Generating Leakage-Free Benchmarks for Robust RAG Evaluation cs.CL | cs.AIPDF
Jiayi Liu, Jiaxing Zhang, Bowen Jin, Jennifer Neville
TL;DR: 本文提出SeedRG,一种半合成的基准测试生成流程,旨在缓解RAG评估中的知识泄露问题。该方法从种子数据集提取推理图,通过类型约束的实体替换生成结构相似但新颖的实例,并加入推理图一致性检查和知识泄露过滤来确保质量。
Details
Motivation: 现有RAG基准测试存在知识泄露问题,即许多问题无需检索即可由LLM的参数记忆回答,导致评估不可靠,且随着基准被用于训练而加剧(基准老化)。
Result: 未在摘要中提及具体定量结果或基准测试对比。
Insight: 创新点在于通过推理图提取和类型约束的实体替换生成结构保留但内容新颖的实例,并结合双重验证(推理一致性和知识泄露过滤)来构建无泄露的稳健RAG评估基准。
Abstract: Retrieval-augmented generation (RAG) is widely used to augment large language models (LLMs) with external knowledge. However, many benchmark datasets, designed to test RAG performance, comprise many questions that can already be answered from an LLM’s parametric memory. This leads to unreliable evaluation. We refer to this phenomenon as knowledge leakage: cases where RAG tasks are solvable without retrieval. This issue worsens over time due to benchmark aging. As benchmarks are reused for training, their contents are increasingly absorbed into model parameters, making them less effective for evaluating retrieval. We introduce SeedRG, a semi-synthetic benchmark generation pipeline that mitigates knowledge leakage and addresses the issue of benchmark aging. Starting from a seed benchmark dataset, SeedRG extracts a reasoning graph from question-context pairs to capture their underlying reasoning structure, and then generates new examples via type-constrained entity replacement. This process produces structurally similar but novel instances that are unlikely to exist in the model’s parametric knowledge, while preserving the original reasoning patterns. To ensure quality, we incorporate two verification steps: (1) a reasoning-graph consistency check to maintain task difficulty, and (2) a knowledge-leakage filter to exclude instances answerable without retrieval.
[19] EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding cs.CLPDF
Pengze Guo, Jingxi Liang, Zhiwen Xie, Qifeng Wang, Derek F. Wong
TL;DR: 本文提出了EmoS,一个高保真的双语多模态基准数据集,旨在解决现有情感理解数据集在生态效度、信号清晰度和细粒度标注可靠性方面的不足。该数据集结合了严格筛选的静态切片和动态流式独白子集,并通过双层人工标注流程提供可信的连续情感演化真值。实验表明,在EmoS上微调多模态大语言模型(MLLMs)相比零样本基线有显著提升,为未来情感识别和共情模型的训练与评估奠定了基础。
Details
Motivation: 当前高压、老龄化社会对能够提供共情支持的大规模情感模型需求迫切,但现有基准无法同时实现生态效度、信号清晰度和可靠的细粒度标注。
Result: 在EmoS基准上微调多模态大语言模型(MLLMs)相比零样本基线取得了显著增益,为未来情感识别和共情模型的训练与评估提供了基础。
Insight: 创新点在于通过结合严格筛选的静态切片与动态流式独白子集,并采用双层人工标注流程,构建了一个兼具高生态效度、信号清晰度和可靠细粒度标注的高保真双语多模态情感理解基准,解决了现有数据集的局限性。
Abstract: In the context of today’s high-pressure, aging society, the demand for large-scale emotional models capable of providing empathetic support is more critical than ever. However, existing benchmarks fail to simultaneously achieve ecological validity, signal clarity, and reliable fine-grained labeling. We introduce EmoS, a high-fidelity bilingual benchmark designed to resolve the limitations of ecological validity and noise in existing datasets by combining strictly filtered static slices with a dynamic Streaming Monologue subset. Supported by a rigorous dual-layer human annotation pipeline, EmoS provides trusted ground truth that captures continuous emotional evolution. Empirical results show that fine-tuning MLLMs (multimodal large language models) on EmoS yields significant gains over zero-shot baselines, laying the foundation for the training and evaluation of future emotion recognition models and empathy models. The dataset and code are publicly available at https://github.com/NLP2CT/EmoS.
[20] DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding cs.CL | cs.CVPDF
Xiang Feng, Jiawei Zhou, Zhangfeng Huang, Kewei Wang, Shanshan Ye
TL;DR: 本文介绍了DocScope基准,用于评估多模态大语言模型在长文档理解中可信、可验证的推理能力。它将长文档问答转化为结构化推理轨迹预测问题,包含页面定位、区域定位、事实提取和答案验证四个评估阶段,并包含1124个问题。实验表明,仅靠答案准确性无法替代轨迹级评估,且区域定位是模型最薄弱的环节。
Details
Motivation: 现有评估主要关注端到端答案准确性,无法衡量多模态大语言模型在长且视觉丰富的文档中是否进行了可信、可验证的推理,因此需要更细粒度的评估方法。
Result: 在DocScope基准上评估了6个专有模型、12个开源模型及多个领域特定系统。结果显示,即使在正确答案中,完整证据链的最高比例仅为29%;区域定位是所有模型最薄弱的阶段;证据聚合是主要难点,而忠实感知和事实提取是主要能力瓶颈;激活参数量比总规模更重要。
Insight: 创新点在于将长文档QA定义为结构化推理轨迹预测,并设计了四阶段解耦评估协议,独立审计每个推理层级。客观来看,该方法为模型的可解释性和可信赖性评估提供了细粒度框架,强调了证据链完整性的重要性,并揭示了模型在长文档处理中的具体瓶颈。
Abstract: Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end-to-end answer accuracy. We introduce DocScope, a benchmark that formulates long-document QA as a structured reasoning trajectory prediction problem: given a complete PDF document and a question, the model outputs evidence pages, supporting evidence regions, relevant factual statements, and a final answer. We design a four-stage evaluation protocol – Page Localization, Region Grounding, Fact Extraction, and Answer Verification – that audits each level of the trajectory independently through inter-stage decoupling, with all judges selected and calibrated via human alignment studies. DocScope comprises 1,124 questions derived from 273 documents, with all hierarchical evidence annotations completed by human annotators. We benchmark 6 proprietary models, 12 open-weight models, and several domain-specific systems. Our experiments reveal that answer accuracy cannot substitute for trajectory-level evaluation: even among correct answers, the highest observed rate of complete evidence chains is only 29%. Across all models, region grounding remains the weakest trajectory stage. Furthermore, the primary difficulty stems from aggregating evidence dispersed across long distances and multiple document clusters, while an oracle study identifies faithful perception and fact extraction as the dominant capability bottleneck. Cross-architecture comparisons further suggest that activated parameter count matters more than total scale. The benchmark and code will be publicly released at https://github.com/MiliLab/DocScope.
[21] FragileFlow: Spectral Control of Correct-but-Fragile Predictions for Foundation Model Robustness cs.CL | cs.AI | cs.LGPDF
Zhuoyun Li, Boxuan Wang, Jinwei Hu, Xiaowei Huang, Yi Dong
TL;DR: 本文提出FragileFlow,一种用于提升基础模型鲁棒性的插件式正则化器。它通过校准的边界缓冲区识别正确但脆弱的预测,并将预测概率流组织成类别级脆弱风险矩阵,旨在控制模型在扰动下从真实类别流向系统错误竞争类别的概率质量,从而改善最差类别鲁棒性。
Details
Motivation: 现有LLM和VLM的鲁棒性评估通常依赖于平均准确率或平均一致性,但这些指标可能掩盖一种结构化失效模式:预测结果虽然正确,但概率质量已从特定真实类别流向决策边界附近的系统性错误竞争类别。本文旨在形式化并缓解这种’边界感知错误流’现象。
Result: 在多项选择LLM基准测试和少样本CLIP适应任务上的实验表明,FragileFlow在提出的面向理论的风险度量上持续优于匹配基线,在大多数设置中提升了扰动下的最差类别准确率,并在所有比较中保持了干净的准确率。
Insight: 核心创新在于形式化了’边界感知错误流’概念,并提出了首个针对该对象的PAC-Bayes上界,表明对经验谱的控制是在稳定性条件下实现确定性最差类别鲁棒性的保守途径。方法上,通过脆弱风险矩阵对概率流进行结构化分析和管理,提供了一种新的鲁棒性正则化视角。
Abstract: Robust adaptation of LLMs and VLMs is often evaluated by average accuracy or average consistency under perturbations. However, these averages can hide a structured failure mode: a prediction may remain correct while probability mass already flows from particular true classes toward systematic wrong competitors near the decision boundary. In this paper, we formalize this phenomenon as margin-aware error flow and introduce FragileFlow, a plug-in regularizer that uses a calibrated margin buffer to identify correct-but-fragile predictions and organize their off-class probability mass into a class-wise vulnerable-risk matrix. Theoretically, we provide the first PAC-Bayes upper bound for this margin-aware error-flow object, showing how empirical spectral control yields a conservative route to deterministic worst-class robustness under a stability condition. Experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation show that FragileFlow consistently improves the proposed theory-facing risk measures over matched baselines, yields perturbed worst-class accuracy gains in most settings, and preserves clean accuracy across comparisons.
[22] Decomposing and Steering Functional Metacognition in Large Language Models cs.CLPDF
Yanshi Li, Xueru Bai, Shuman Liu, Haibo Zhang, Anxiang Zeng
TL;DR: 该论文提出大型语言模型(LLMs)内部存在一个可分解的功能性元认知状态空间,这些状态编码了评估意识、自我能力评估、风险感知等因素。通过残差流分析和激活导向技术,论文证明了这些状态可以从内部激活中线性解码,并能以可分离的方式因果性地调节模型的推理行为,影响其冗长度、准确性和安全性。
Details
Motivation: 解决LLMs在评估环境中表现出的‘评估意识’现象,探究其是单一行为伪影还是反映了模型内部更深层的结构,并理解这些内部状态如何影响性能测量。
Result: 通过残差流分析在多个推理模型上证明,功能性元认知状态可以从内部激活中线性解码,并具有不同的层间分布特征;通过激活导向实验表明,操纵这些状态能因果性地、可分离地影响模型在各项任务中的行为(如冗长度、准确性和安全相关响应)。
Insight: 创新点在于将LLMs的‘评估意识’概念化为一个可分解、可操纵的内部状态空间,并提供了通过残差流分析和激活导向来机制性地研究和控制这些状态的方法框架,这对于可靠评估和部署推理模型至关重要。
Abstract: Large language models (LLMs) increasingly exhibit behaviors suggesting awareness of their evaluation context, often adapting their reasoning strategies in benchmark settings. Prior work has shown that such evaluation awareness can distort performance measurements; however, it remains unclear whether this phenomenon reflects a single behavioral artifact or a deeper internal structure within the model. We propose that LLMs maintain a decomposable space of functional metacognitive states: internal variables encoding factors such as evaluation awareness, self-assessed capability, perceived risk, computational effort allocation, audience expertise adaptation, and intentionality. Through residual stream analysis across multiple reasoning models, we demonstrate that these states are linearly decodable from internal activations and exhibit distinct layer-wise profiles. Moreover, by steering model activations along probe-derived directions, we show that each functional metacognitive state causally modulates reasoning behavior in dissociable ways, affecting verbosity, accuracy, and safety-related responses across tasks. Our findings suggest that benchmark performance reflects not only task competence but also the activation of specific functional metacognitive states. We argue that understandi ng and controlling these internal states is essential for reliable evaluation and deployment of reasoning models, and we provide a mechanistic framework for studying functional m etacognition in artificial systems. Our code and data are publicly available at https://github.com/xlands/meta-cognition.
[23] Improving Lexical Difficulty Prediction with Context-Aligned Contrastive Learning and Ridge Ensembling cs.CL | cs.AIPDF
Wicaksono Leksono Muhamad, Joanito Agili Lopo, Tsamarah Rana Nugraha, Ahmad Cahyono Adi, Muhammad Oriza Nurfajri
TL;DR: 本文提出了一种结合上下文对齐对比学习和岭回归集成的方法,用于改进词汇难度预测任务。该方法通过整合跨视图上下文和序数软对比学习两个互补目标,旨在解决现有回归方法在表示空间结构化和跨语言对齐能力上的不足。
Details
Motivation: 现有词汇难度预测方法主要依赖纯回归训练和标量监督,未能显式结构化表示空间,限制了其捕捉跨语言对齐和序数难度关系的能力。
Result: 在三个第一语言(L1)数据集上的实验表明,该方法提高了跨语言表示对齐性,同时保留了语言特定细节;学习到的表示能捕捉词汇难度的序数结构;集成方法有效缓解了单个模型的系统性偏差,在不同难度级别上实现了更稳定的性能。
Insight: 创新点在于将对比学习目标与回归任务结合,通过跨视图上下文对比和序数软对比来结构化表示空间,并利用岭回归集成来提升模型鲁棒性。这为需要建模序数关系和跨语言对齐的任务提供了新思路。
Abstract: Lexical difficulty prediction is a fundamental problem in language learning and readability assessment, requiring models to estimate word difficulty across different first-language (L1) backgrounds. However, existing approaches rely on regression-only training with scalar supervision, which does not explicitly structure the representation space, limiting their ability to capture cross-lingual alignment and ordinal difficulty. To mitigate these issues, we propose Context-Aligned Contrastive Regression, which integrates Ridge regression ensemble with two complementary objectives, i.e., Cross-View Context and Ordinal Soft Contrastive Learning. Experiments on three L1 datasets show that (i) contrastive objectives improve cross-lingual representation alignment while preserving language-specific nuances, (ii) the learned representations capture the ordinal structure of lexical difficulty, and (iii) the ensemble effectively mitigates systematic biases of individual models, leading to more stable performance across difficulty levels.
[24] GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives cs.CL | cs.LGPDF
Alexandre Le Mercier, Chris Develder, Thomas Demeester
TL;DR: 该论文提出了GAMBIT基准测试,用于评估多智能体大语言模型(LLM)集体中对抗鲁棒性的检测器。该基准包含三种评估模式和两个独立分数,并附带一个包含27,804个标记实例的数据集,涵盖240种协同进化的冒名顶替者策略。研究以国际象棋作为深层推理问题,使用Gemini 3.1 Pro作为智能体,旨在模拟现实约束下对抗隐蔽自适应冒名顶替者的场景。
Details
Motivation: 现有针对多智能体系统(MAS)的对抗性研究仅针对浅层任务,且未考虑自适应对手(即能够调整策略以逃避检测器的攻击者),这导致现有防御措施存在不足。GAMBIT旨在填补这一空白,提供一个更全面、动态的基准来评估冒名顶替者检测器。
Result: 在基准测试中,基于Gemini的检测器对自适应冒名顶替者的F1分数仅为50.5%,表明其基本无法被检测到。研究还发现,零样本评估对于自适应对手具有高度误导性:两个零样本分数相近的检测器在少样本适应能力上相差8倍,而元学习变体的收敛速度快20倍,这一差距仅在重新校准模式下可见。
Insight: 论文的创新点包括:1)提出了首个让对抗攻击与防御协同进化的多智能体基准(GAMBIT),并提供了可推广到国际象棋之外用例的自适应冒名顶替者框架;2)揭示了零样本评估在动态对抗环境中的局限性,强调了快速重新校准(如元学习)的重要性;3)为在快速演进的对抗系统中开发更鲁棒的检测器提供了新的评估范式和有前景的技术方向。
Abstract: In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: https://anonymous.4open.science/r/gambit.
[25] Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity cs.CLPDF
Ye-eun Cho
TL;DR: 本研究通过标量多样性作为语用推理的分级诊断工具,评估大型语言模型的语用推理能力。研究发现,直接概率测量和元语言提示两种评估方法在不同模型和实验设置下表现不一致,语用行为受模型家族、提示策略和任务结构影响显著。标量多样性梯度仅在特定模型-条件组合中出现,表明LLMs的语用推理是内部概率表征与任务诱导提示行为交互的结果,而非单一评估范式能捕捉的稳定能力。
Details
Motivation: 解决大型语言模型语用推理评估中的方法不一致问题,探究模型表现差异是源于内在能力还是任务诱导行为,并验证标量多样性作为分级诊断工具的有效性。
Result: 在多个模型和实验设置中,直接概率测量与元语言提示两种评估方法均未表现出持续优势;语用行为在不同模型家族、提示策略和任务结构间存在显著差异;标量多样性梯度仅在特定模型-条件组合中显现。
Insight: 创新点在于使用标量多样性作为分级诊断工具系统比较评估方法,并揭示语用推理是内部概率与任务提示的交互结果;客观分析表明评估设计对解释LLMs语用能力具有核心影响,强调了多维度评估的重要性。
Abstract: Evaluating pragmatic reasoning in large language models (LLMs) remains challenging because model behavior can vary depending on evaluation methods. Previous studies suggest that prompt-based judgments may diverge from models’ internal probability distributions, raising questions about whether observed performance reflects underlying competence or task-induced behavior. This study examines this issue using scalar diversity as a graded diagnostic for pragmatic inference. Following Hu & Levy (2023), this study compares direct probability measurement and metalinguistic prompting across multiple models and experimental settings. The results show that neither evaluation method consistently outperforms the other and that pragmatic behavior varies substantially across model families, prompting strategies, and task structures. Moreover, scalar diversity gradients emerge only in specific model-condition combinations, suggesting that pragmatic reasoning in LLMs reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable competence captured by a single evaluation paradigm. These findings highlight the central role of evaluation design in interpreting pragmatic abilities in LLMs.
[26] Language-Conditioned Visual Grounding with CLIP Multilingual cs.CLPDF
J. de Curtò, Mauro Liz, I. de Zarzà
TL;DR: 本研究通过一个密集多语言CLIP探针,在保持视觉编码器相同的情况下,仅改变XLM-RoBERTa文本分支,评估了13种类型多样语言在视觉定位任务上的表现。研究发现,低资源语言在文本分支上存在结构性缺陷,视觉编码器规模扩大对语言性能影响不一,且主要失败模式是空间错位而非信号崩溃。
Details
Motivation: 多语言视觉-语言模型在不同语言间存在系统性性能差异,但其机制尚不明确。本研究旨在通过控制变量实验,明确这种差异是源于视觉编码器、文本分支还是它们的交互作用。
Result: 在11个概念和210张图像上评估了两种CLIP架构(视觉参数规模相差7倍)。结果显示,低资源语言(阿拉伯语、巴斯克语、卢森堡语)在两种骨干规模下均存在结构性缺陷(Wilcoxon检验p<10^-300;基础模型和大模型的聚类掩码IoU差距分别为+0.114和+0.143)。视觉编码器规模扩大7倍,加剧了某些语言(巴斯克语Δ=-0.056,卢森堡语Δ=-0.076)的失败,但改善了阿拉伯语(Δ=+0.033)。峰值相似性在语言间得以保持(大规模下平均比率为0.94),但聚类掩码IoU急剧下降。
Insight: 创新点在于设计了一个密集多语言CLIP探针来分离视觉和文本分支的影响,明确了多语言性能差异主要源于文本分支,尤其是低资源语言的结构性缺陷。客观分析表明,视觉编码器规模扩大对不同语言的影响揭示了语料库覆盖度和分词器效率两种不同的失败模式,且主要失败是空间错位,这为高效能、能源感知的多语言部署提供了实用基础。
Abstract: Multilingual vision-language models exhibit systematic performance gaps across languages, but the mechanism remains ambiguous: cross-language divergence could arise from the visual encoder, the text branch, or their interaction. We resolve this ambiguity through a dense multilingual CLIP probe in which the visual encoder is held identical across thirteen typologically diverse languages and only the XLM-RoBERTa text branch varies. We evaluate two CLIP architectures spanning a 7x visual-encoder scale gap (XLM-R base + ViT-B/32, ~87M visual parameters; XLM-R large + ViT-H/14, ~632M) on 11 concepts and 210 images, and quantify cross-language agreement via cluster-mask IoU, top-percentile IoU, and Spearman rank correlation against an English reference (n=2,310 paired observations per language). Three findings emerge. First, low-resource languages (Arabic, Basque, Luxembourgish) incur a structural penalty at both backbone scales (Wilcoxon HR>LR p<10^-300; cluster-mask IoU gap +0.114 at base, +0.143 at large), isolating the deficit to the text branch. Second, scaling the encoder 7x widens the gap for structural failure cases (Basque Δ=-0.056, Luxembourgish Δ=-0.076) while improving Arabic (Δ=+0.033), separating corpus-coverage from tokeniser-fertility failures. Third, peak similarity is preserved across languages (mean ratio 0.94 at large scale) while cluster-mask IoU drops sharply, identifying spatial misalignment, not signal collapse, as the dominant failure mode. At 3.4-3.9 Wh per 1,000 queries, dense-CLIP grounding is competitive with high-throughput inference budgets, positioning it as a practical substrate for energy-aware multilingual deployment.
[27] Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs cs.CLPDF
Guijin Son, Seungone Kim, Catherine Arnett, Hyunwoo Ko, Hyein Lee
TL;DR: Soohak是一个由数学家精心构建的包含439个研究级数学问题的基准测试,旨在评估大语言模型在高级数学推理方面的能力,并特别引入了拒绝问题子集来测试模型识别病态问题的能力。
Details
Motivation: 当前大语言模型已在IMO等竞赛中取得金牌表现,社区需要更具挑战性的目标来衡量其推理能力;研究级数学问题能更好地评估模型推动数学知识前沿的潜力,但现有此类基准数据集稀缺且规模小。
Result: 在挑战子集上,前沿模型如Gemini-3-Pro、GPT-5和Claude-Opus-4.5的准确率分别为30.4%、26.4%和10.4%,开源模型如Qwen3-235B等均低于15%;在拒绝子集上,所有模型准确率均未超过50%,表明当前模型尚不具备有效识别病态问题的能力。
Insight: 论文创新点在于构建了大规模、高质量的研究级数学基准,并首次系统性地引入拒绝能力作为评估维度,揭示了模型在数学研究中‘知止’能力的重要缺陷,为下一代模型优化指明了新方向。
Abstract: Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.
[28] GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression cs.CLPDF
Zhongtao Miao, Qiyu Wu, Yoshimasa Tsuruoka
TL;DR: 本文提出GRC训练框架,通过元潜在令牌和统一的生成、表示与压缩调优方法,将推理驱动的生成、增强文本表示和上下文压缩三大任务统一在LLMs的单次前向传播中,显著降低训练成本与部署开销,并引入自推理潜在嵌入和潜在记忆增强生成等新范式。
Details
Motivation: 当前基于大语言模型的文本嵌入和生成任务通常分开训练,导致高昂的训练成本和部署负担,且上下文压缩对于需要长上下文和持续学习的推理驱动生成与代理任务至关重要,因此探索如何统一这些任务于单次前向传播中。
Result: 在推理密集型检索基准、生成任务、文档压缩、延迟评估和RAG设置上的广泛实验证明了方法的有效性,实现了高效推理和三倍数据利用率,可能为真正统一处理推理驱动生成、嵌入和压缩任务的模型提供启示。
Insight: 创新点包括使用元潜在令牌和统一调优方法桥接三大任务,实现模块化推理灵活性;提出自推理潜在嵌入和潜在记忆增强生成新范式,以及混合分页注意力加速推理;设计支持O(1)长度压缩KV缓存作为可更新记忆,提升RAG部署效率。
Abstract: Text embedding and generative tasks are usually trained separately based on large language models (LLMs) nowadays. This causes a large amount of training cost and deployment effort. Context compression is also a challenging and pressing task, which is vital to reasoning-driven generation, and agentic tasks requiring long context and continual learning. In this paper, we explore how to unify reasoning-driven generation, reasoning-enhanced text representation and context compression tasks in one forward pass for LLMs. Through meta latent tokens and a unified generative, representative and compressive tuning approach, we propose a training framework named GRC that bridges the three tasks. The trained models can accomplish three objectives in a single forward pass while maintaining modular, LEGO-style flexibility during inference. This design greatly reduces the deployment effort for retrieval-augmented generation (RAG) and achieves efficient inference and three times data utilization during training. Furthermore, this framework design enables a new paradigm for text embedding: self-reason-latent embeds, and a new generation paradigm, latent memory-augmented generation, where compressed and internalized KV cache with O(1) length is used as the updatable memory. We also propose hybrid paged attention to speed up the inference of our models. Extensive experiments on reasoning-intensive retrieval benchmarks, generative tasks, document compression, latency evaluation, and RAG settings demonstrate the effectiveness of our method and may shed light on the truly unified model that can handle reasoning-driven generation, embedding and compression tasks seamlessly.
[29] Meow-Omni 1: A Multimodal Large Language Model for Feline Ethology cs.CL | q-bio.NCPDF
Jucheng Hu, Zhangquan Chen, Yulin Chen, Chengjie Hong, Liang Zhou
TL;DR: 本文提出了Meow-Omni 1,这是首个专为计算动物行为学设计的开源四模态大语言模型,旨在通过融合视频、音频、生理时间序列和文本数据来解码猫科动物的意图,以解决现有模型因忽略高频生理数据而导致的语义混淆问题。
Details
Motivation: 现有MLLMs无法处理高频生物时间序列数据,只能进行表面的行为模式匹配,无法进行真正的潜在状态推理,这阻碍了对动物意图的准确解读。
Result: 在专家验证的新型四模态基准MeowBench上,Meow-Omni 1实现了最先进的意图识别准确率(71.16%),显著优于领先的视觉语言和全模态基线模型。
Insight: 创新点在于将专门的科学编码器集成到统一骨干网络中,并通过基于生理学的跨模态对齐来形式化意图推理,为跨物种意图理解提供了一个可扩展的范式,并推动了基础模型在兽医诊断和野生动物保护等现实世界的应用。
Abstract: Deciphering animal intent is a fundamental challenge in computational ethology, largely because of semantic aliasing, the phenomenon where identical external signals (e.g., a cat’s purr) correspond to radically different internal states depending on physiological context. Existing Multimodal Large Language Models (MLLMs) are blind to high-frequency biological time-series data, restricting them to superficial behavioural pattern matching rather than genuine latent-state reasoning. To bridge this gap, we introduce Meow-Omni 1, the first open-source, quad-modal MLLM purpose-built for computational ethology. It natively fuses video, audio, and physiological time-series streams with textual reasoning. Through targeted architectural adaptation, we integrate specialized scientific encoders into a unified backbone and formalize intent inference via physiologically grounded cross-modal alignment. Evaluated on MeowBench, a novel, expert-verified quad-modal benchmark, Meow-Omni 1 achieves state-of-the-art intent-recognition accuracy (71.16%), substantially outperforming leading vision-language and omni-modal baselines. We release the complete open-source pipeline including model weights, training framework, and the Meow-10K dataset, to establish a scalable paradigm for inter-species intent understanding and to advance foundation models toward real-world veterinary diagnostics and wildlife conservation.
[30] Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs cs.CL | cs.LGPDF
Sohan Venkatesh
TL;DR: 这篇论文揭示了大型语言模型在重复词符计数任务上的失败并非源于内部计数表示的缺陷,而是由于特定格式触发的MLP模块错误地覆盖了正确的计数表示。研究发现,模型在每一层都能近乎完美地解码出正确的计数,但在网络深度约88-93%处,一个由空格分隔列表格式触发的MLP模块会输出一个固定的错误答案。
Details
Motivation: 动机是探究大型语言模型在广泛推理基准上表现强劲,却在重复词符计数任务上失败的根本原因,挑战了将失败归因于内部计数跟踪能力不足的普遍观点。
Result: 在Llama-3.2 (1B和3B)和Qwen2.5 (1.5B, 3B和7B)模型上,线性探针在每一层都能近乎完美地解码出正确计数,但模型最终输出错误答案。对于空格分隔的重复单词词符,一个MLP模块在约88-93%网络深度处会覆盖正确表示;而对于重复数字词符或使用逗号分隔符(在大模型中),此现象被抑制。
Insight: 创新点在于揭示了计数失败是路由(特定MLP模块的激活)而非表示能力的失败,两者需要不同的干预措施。这挑战了关于模型内部表示与输出行为关系的常见假设,并指出了格式敏感性对模型行为的深刻影响。
Abstract: Large language models fail at counting repeated tokens despite strong performance on broader reasoning benchmarks. These failures are commonly attributed to limitations in internal count tracking. We show this attribution is wrong. Linear probes on the residual stream decode the correct count with near-perfect accuracy at every post-embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. Attention patterns show no evidence of collapse over repeated tokens and tokenization artifacts account for none of the failure. Instead, a format-triggered multi-layer perceptron (MLP) block overwrites the correctly-encoded count with a fixed wrong answer at roughly 88–93,% network depth. This prior fires for repeated word-tokens in space-separated list format and is absent for repeated digit-tokens. It is suppressed by comma-separated delimiters in larger models but persists in smaller ones. The finding holds across Llama-3.2 (1B and 3B) and Qwen2.5 (1.5B, 3B and 7B) at consistent relative depth. Counting failure is a failure of routing not of representation and the two require different interventions.
[31] LLM Agents Already Know When to Call Tools – Even Without Reasoning cs.CLPDF
Chung-En Sun, Linbo Liu, Ge Yan, Zimo Wang, Tsui-Wei Weng
TL;DR: 本文提出了When2Tool基准测试,用于评估LLM代理在何时需要调用工具。研究发现,现有免训练基线方法(如Prompt-only和Reason-then-Act)在控制不必要工具调用方面效果有限。通过分析模型隐藏状态,发现工具必要性信息已线性可解码,据此提出了Probe&Prefill方法,显著减少了工具调用次数且精度损失很小。
Details
Motivation: 现有工具增强的LLM代理倾向于不加区分地调用工具,即使模型可直接回答,这会导致不必要的API费用和延迟开销,且缺乏系统研究工具调用必要性的基准。
Result: 在提出的When2Tool基准(包含18个环境)上评估,Probe&Prefill方法在所有测试模型中平均减少48%的工具调用,仅损失1.7%的准确率;而最佳基线在可比准确率下仅减少6%工具调用,或在相似工具调用减少量下带来5倍更高的准确率损失。
Insight: 创新点在于发现LLM的隐藏状态中已编码了工具必要性的知识(线性可解码,AUROC达0.89-0.96),但模型在生成时未能利用该信息;基于此提出的Probe&Prefill方法通过轻量级线性探针读取隐藏状态信号并预填充引导句,实现了高效的工具调用控制。
Abstract: Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity – computational scale, knowledge boundaries, and execution reliability – each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models’ hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89–0.96 across six models, substantially exceeding the model’s own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model’s response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5$\times$ higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool
[32] Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation cs.CL | cs.AIPDF
Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, Zhao Yang
TL;DR: 本文研究了On-Policy Distillation(OPD)训练过程中的高损失令牌(Rock Tokens),发现即使训练收敛后,仍有大量令牌持续表现出高损失,这些令牌对模型推理性能贡献甚微,却消耗了大量优化带宽。研究表明,策略性地绕过这些令牌可以显著提高对齐过程的效率。
Details
Motivation: 现有研究认为,在OPD的逐令牌KL目标下,作为师生模型不匹配最直接信号的高损失令牌应随着训练收敛而减少,但实证分析发现并非如此,因此需要探究这些持续高损失令牌的性质及其对训练的影响。
Result: 实证分析发现,Rock Tokens在生成输出中占比高达18%,尽管其梯度范数贡献巨大,但其自身在整个训练过程中保持停滞,且通过因果干预发现它们对模型实际推理性能的贡献可忽略不计。
Insight: 论文的创新点在于识别并深入分析了OPD中持续存在的、高损失但低功能贡献的Rock Tokens,挑战了均匀令牌加权的必要性,并提出通过策略性绕过这些“绊脚石”来优化大规模模型蒸馏过程,为对齐提供了更高效的范式。
Abstract: While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that–as the most direct signal of student-teacher mismatch under OPD’s per-token KL objective–should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model’s actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks’’ can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.
[33] Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs cs.CL | cs.AIPDF
Aditya Sinha, Harald Steck, Vito Ostuni, Matteo Rinaldi
TL;DR: 本文通过构建基于真实数据集的合成基准测试,对十种大型语言模型在多轮对话中的上下文切换能力进行了压力测试,重点评估了模型在检测用户话题转换和筛选相关历史上下文两个子任务上的零样本性能。研究发现,仅有部分推理型和强指令型模型能准确检测话题转换,开源模型普遍表现不佳且易受位置偏差影响,所有模型均难以有效处理上下文切换问题。
Details
Motivation: 解决大型语言模型在多轮对话中因未能及时识别用户话题转换或请求细化,而持续携带无关历史上下文导致生成不准确响应的问题。
Result: 在模拟不同难度级别上下文切换的合成基准测试上,仅部分推理型(如GPT-4)和强指令型闭源模型能较准确检测话题转换(pivot detection);开源模型(如Llama 2)在该任务上表现挣扎,即使存在显式线索也常携带过时上下文;所有模型均受位置偏差影响。
Insight: 论文创新点在于系统性地构建了针对多轮对话中上下文切换问题的评估框架(包含话题转换检测和相关上下文筛选两个子任务),并揭示了当前LLMs在此类任务上的普遍缺陷(如开源模型能力不足、位置偏差等),为提升模型长期对话鲁棒性提供了关键见解。
Abstract: Users interacting with Large Language Models (LLMs) in a multi-turn conversation routinely refine their requests or pivot to new topics. LLMs, however, often miss these topic shifts and carry over irrelevant context from previous turns, leading to inaccurate responses. In this paper, we stress-test the multi-turn understanding of LLMs and study the following two sub-tasks: (1) detecting whether the user pivots or refines in the current turn, and (2) shortlisting relevant context from previous turns. To this end, we construct synthetic benchmarks based on real-world datasets from varied domains, as to simulate context shifts of different levels of difficulty. We then evaluate the zero-shot performance of ten LLMs (open-weight, closed-source and reasoning), and demonstrate that only some reasoning and strongly instructed LLMs are accurate in detecting pivots; open-weight LLMs struggle with the task and frequently carry stale context even with explicit cues; and all models suffer from a position bias. Based on the results, we discuss key takeaways for improving long-term robustness in multi-turn capabilities for LLMs.
[34] DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification cs.CL | cs.CVPDF
Rui Liu, Dian Yu, Zhenwen Liang, Yucheng Shi, Tong Zheng
TL;DR: 本文提出了DeltaRubric方法,通过将多模态偏好评估重构为单个多模态大语言模型(MLLM)内的“规划-执行”过程,以解决现有单步评估器在多模态对齐任务中存在的“懒惰判断”和过度依赖语言先验的问题。该方法分两步进行:首先作为“分歧规划器”生成中立的、实例特定的验证清单,然后作为“清单验证器”执行这些检查以产生最终判断。
Details
Motivation: 现有用于对齐多模态大语言模型(MLLMs)的奖励模型通常是单步评估器,容易产生“懒惰判断”并过度依赖语言先验,而忽略细粒度的视觉验证。基于量规的评估在纯文本设置中能缓解这些偏差,但扩展到多模态任务时,因视觉推理的复杂性而受阻。
Result: 在Qwen3-VL 4B和8B Instruct模型上验证,DeltaRubric在VL-RewardBench基准上显著提升了基础模型的整体准确率,分别提高了+22.6(4B)和+18.8(8B)个百分点,大幅超越了标准的无量规基线方法。
Insight: 核心创新点在于将多模态奖励建模分解为结构化的、可验证的“规划-执行”两步流程,并表述为多角色强化学习问题以联合优化规划和验证能力。这提供了一种通过动态合成量规来隔离空间和事实差异,从而实现更可靠、可泛化的多模态评估的新范式。
Abstract: Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce $\textbf{DeltaRubric}$, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a $\textit{Disagreement Planner}$, the model generates a neutral, instance-specific verification checklist. Transitioning into a $\textit{Checklist Verifier}$, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by $\textbf{+22.6}$ (4B) and $\textbf{+18.8}$ (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.
[35] LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction cs.CLPDF
Zhao Tan, Xiping Liu, Qing Shu, Qizhi Wan, Dexi Liu
TL;DR: 本文提出了LEAF-SQL,一个用于文本到SQL任务中骨架预测的新框架。它将骨架预测重构为一个从粗到细的树搜索过程,通过三级骨架层次引导搜索、骨架生成代理产生多样化候选、以及骨架评估代理高效剪枝空间,旨在提升复杂查询的生成能力。
Details
Motivation: 现有基于提示的大语言模型在文本到SQL任务中,对于涉及深层嵌套逻辑或多子句的复杂查询仍存在困难。广泛使用的SQL骨架方法受限于对单一结构假设的依赖和缺乏渐进式推理。
Result: 在具有挑战性的BIRD基准测试的官方隐藏测试集上,该方法取得了71.6%的执行准确率,超越了领先的基于搜索和基于骨架的方法。
Insight: 主要创新点在于将骨架预测重构为从粗到细的树搜索过程,并集成了引导搜索的层次结构、生成多样化候选的代理以及高效剪枝的代理,实现了结构多样性和粒度自适应的骨架生成,为后续SQL生成提供了更强基础。
Abstract: Text-to-SQL translates natural language questions into executable SQL queries, enabling intuitive database access for non-experts. While large language models achieve strong performance on Text-to-SQL with prompting, they still struggle with complex queries that involve deeply nested logic or multiple clauses. A widely used approach employs SQL skeletons–intermediate representations of query logic–to streamline generation, but existing methods are limited by their reliance on a single structural hypothesis and lack of progressive reasoning. To overcome these limitations, we propose LEAF-SQL, a novel framework that reframes skeleton prediction as a coarse-to-fine tree search process. LEAF-SQL enables systematic exploration of diverse structural hypotheses with adaptive refinement. Several key techniques are employed in LEAF-SQL: (1) a three-level skeleton hierarchy to guide the search, (2) a Skeleton Formulation Agent to generate diverse candidates, and (3) a Skeleton Evaluation Agent to efficiently prune the search space. This integrated design yields skeleton candidates that are both structurally diverse and granularity-adaptive, providing a stronger foundation for the SQL generation. Extensive experiments show that LEAF-SQL consistently improves the performance of various LLM backbones. On the official hidden test set of the challenging BIRD benchmark, our method achieves 71.6 execution accuracy, which outperforms leading search-based and skeleton-based methods, affirming its effectiveness for complex queries.
[36] Mem-W: Latent Memory-Native GUI Agents cs.CL | cs.CV | cs.LGPDF
Guibin Zhang, Yaohui Ling, Fanci Meng, Kun Wang, Shuicheng Yan
TL;DR: 本文提出了Mem-W,一种潜在内存原生的GUI智能体框架,它将历史轨迹和工作记忆直接编码为紧凑的潜在内存令牌,并与当前GUI观察嵌入到同一个连续的表示序列中,从而让智能体能够通过机器原生的接口直接读取历史经验,以支持长视野的GUI任务。
Details
Motivation: 现有GUI智能体通常将记忆作为外部、人类可读的符号记录进行处理,这导致了经验存储形式与智能体策略实际操作的潜在嵌入序列之间的不匹配。本文旨在解决这种表示形式的不一致问题,使记忆成为智能体连续上下文的一部分。
Result: 在四个网页和移动端导航基准测试中,Mem-W持续改进了多种骨干网络和增强记忆的基线模型,性能提升最高达到+30.0分,表明潜在上下文原生记忆可以作为长视野GUI智能体的可扩展基础。
Insight: 核心创新在于将记忆视为智能体连续上下文的内在部分,而非外部符号支架,通过共享的轨迹到潜在压缩器将历史和会话信息编织成紧凑的潜在令牌,并采用自蒸馏和结果感知监督进行训练,以保留决策相关状态并过滤出真正支持任务成功的证据。
Abstract: GUI agents are beginning to operate the web, mobile, and desktop as interactive worlds, where successful control depends on carrying forward visual, procedural, and task-level evidence beyond the fleeting present screen. Yet most agents still treat memory as an external, human-readable artifact: histories are summarized, categorized, retrieved, and reinserted as text or structured records before being encoded again by the policy. This creates a mismatch between the representational form in which experience is stored and the latent embedding sequence over which modern GUI policies actually act. We introduce Mem-W, a series of latent-memory-native GUI agents that treat memory as part of the agent’s continuous context rather than as an auxiliary symbolic scaffold. Mem-W weaves both historical trajectories (as experiential memory) and in-session segments (as working memory) into compact memory tokens through a shared trajectory-to-latent compressor. These tokens are woven with the current GUI observation and local context into one continuous embedding sequence, allowing the agent to read successes, failures, and unfinished progress through the same machine-native interface. Mem-W is trained with self-distillation and outcome-aware supervision to preserve decision-relevant state while filtering memory toward evidence that truly supports task success. Across four web and mobile navigation benchmarks, Mem-W consistently improves diverse backbones and memory-enhanced baselines, with gains of up to $+30.0$, suggesting that latent-context-native memory can serve as a scalable foundation for long-horizon GUI agency.
[37] RuPLaR : Efficient Latent Compression of LLM Reasoning Chains with Rule-Based Priors From Multi-Step to One-Step cs.CL | cs.AIPDF
Xiaocheng Luo, Kang Wang, Zaifu Zhan, Yuechi Zhou, Xiangyu Duan
TL;DR: 本文提出了一种名为RuPLaR的新型压缩框架,用于将大型语言模型的多步推理链高效压缩为单步潜在推理。该方法通过基于规则的先验概率分布指导,训练LLM在单一训练阶段自主生成潜在推理标记,从而消除级联过程和模型间依赖。
Details
Motivation: 现有思维链方法受自然语言效率和表达能力的限制,而潜在思维链方法虽在连续潜在空间中操作,但面临多步或多模型范式中的结构复杂性、错误传播和协调开销等挑战。
Result: 大量实验表明,该压缩框架在潜在思维链方法上将准确率提升了11.1%,并且以最少的标记使用量实现了这一结果,证明了其有效性和可扩展性。
Insight: 创新点在于提出了’单模型单步’的压缩范式,通过结合基于规则的先验分布和联合训练目标(包括交叉熵答案一致性、KL散度软标记对齐以及问题-思维语义对齐约束),实现了高质量、高效率的潜在推理压缩。
Abstract: The Chain-of-Thought (CoT) paradigm, while enhancing the interpretability of Large Language Models (LLMs), is constrained by the inefficiencies and expressive limits of natural language. Latent Chain-of-Thought (latent CoT) reasoning, which operates in a continuous latent space, offers a promising alternative but faces challenges from structural complexities in existing multi-step or multi-model paradigms, such as error propagation and coordination overhead. In this paper, we introduce One-Model One-Step, a novel compression framework for Latent Reasoning with Rule-Based Priors(RuPLaR) to address this challenge. Our method trains an LLM to autonomously generate latent reasoning tokens in a single training stage, guided by rule-based prior probability distributions, thereby eliminating cascaded processes and inter-model dependencies. To ensure reasoning quality, we design a joint training objective that enforces answer consistency via cross-entropy, aligns soft tokens with rule-based priors via KL divergence (the Soft Thinking constraint), and adds a problem-thought semantic alignment constraint in the representation space. Extensive experiments show that our compression framework not only improves accuracy by 11.1% over existing latent CoT methods but also achieves this with minimal token usage, underscoring its effectiveness and extensibility. Code: https://github.com/xiaocen-luo/RuPLaR.
[38] HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities cs.CL | cs.AI | cs.DB | cs.MMPDF
Shusaku Egami, Aoi Ohta, Tomoki Tsujimura, Masaki Asada, Tatsuya Ishigaki
TL;DR: 该论文提出了HOME-KGQA,一个基于家庭日常活动多模态知识图谱构建的新型知识图谱问答(KGQA)基准数据集,旨在弥补现有数据集偏向百科全书知识、单模态且缺乏细粒度时空数据的不足,以更好地服务于具身AI的现实场景。
Details
Motivation: 现有KGQA基准数据集存在偏向百科全书知识、局限于单模态、缺乏细粒度时空数据的问题,限制了其在具身AI等现实场景中的应用,因此需要构建一个更贴近现实、更具挑战性的多模态KGQA数据集。
Result: 实验结果表明,基于LLM的KGQA方法在HOME-KGQA数据集上的表现远不及在现有数据集上的表现,突显了KGQA系统在现实世界部署中面临的重大挑战。
Insight: 创新点在于构建了一个专注于家庭日常活动、包含复杂多跳问题、多级时空推理、多模态对齐和聚合函数的多模态KGQA基准数据集,为评估和推动KGQA在现实场景中的发展提供了新的测试平台。
Abstract: Large Language Models (LLMs) provide flexible natural language processing capabilities, while knowledge graphs (KGs) offer explicit and structured knowledge. Integrating these two in a complementary manner enables the development of reliable and verifiable AI systems. In particular, knowledge graph question answering (KGQA) has attracted attention as a means to reduce LLM hallucinations and to leverage knowledge beyond the training data. However, existing KGQA benchmark datasets are biased toward encyclopedic knowledge, limited to a single modality, and lack fine-grained spatiotemporal data, which limits their applicability to real-world scenarios targeted by Embodied AI. We introduce HOME-KGQA, a novel KGQA benchmark dataset built on a multimodal KG of daily household activities. HOME-KGQA consists of complex, multi-hop natural language questions paired with graph database query languages. Compared to existing benchmarks, it includes more challenging questions that involve multi-level spatiotemporal reasoning, multimodal grounding, and aggregate functions. Experimental results show that the LLM-based KGQA methods fail to achieve performance comparable to that on existing datasets when evaluated on HOME-KGQA. This highlights significant challenges that should be addressed for the real-world deployment of KGQA systems. Our dataset is available at https://github.com/aistairc/home-kgqa
[39] Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs cs.CL | cs.CVPDF
Jiafeng Liang, Zhihao Zhu, Zihan Zhang, Baoqi Ren, Shixin Jiang
TL;DR: 本文揭示了大型多模态模型(LMMs)在视频因果发现任务中过度依赖文本先验捷径的缺陷,并提出了一个基于扰动的评估协议ProCauEval来诊断其机制。研究发现,模型虽能准确感知视频内容,但在因果推理中未能充分利用视觉证据,且更强的后训练反而加剧了对文本先验的依赖。为此,作者提出了基于负向教师对齐的强化学习框架ADPO,通过最大化原始输入与视觉损坏输入下策略分布的差异,迫使模型基于视觉证据进行推理,从而在保持基本理解的同时提升视觉参与度。
Details
Motivation: 尽管LMMs在通用视频理解上表现出色,但其在因果发现任务中易受文本先验捷径影响,现有基准仅评估响应准确性,无法揭示缺陷的来源和程度,因此需要一种机制诊断方法来深入理解并解决这一问题。
Result: 在提出的ProCauEval基准上评估了17个主流LMMs,发现模型普遍存在视觉利用不足的问题,且基线性能越高,在扰动下越脆弱;提出的ADPO方法通过实验证明能有效提升视觉参与度而不损害基本理解能力。
Insight: 创新点在于从结果评估转向机制诊断的扰动评估协议ProCauEval,以及基于负向教师对齐的强化学习框架ADPO,通过显式地让策略远离仅依赖先验的反事实教师,强制模型基于视觉证据进行推理,为解决LMMs的因果发现缺陷提供了新思路。
Abstract: Although Large Multimodal Models (LMMs) have achieved strong performance on general video understanding, their susceptibility to textual prior shortcuts during causal discovery has been recognized as a critical deficit. The underlying mechanisms of this phenomenon remain incompletely understood, as existing benchmarks only measure response accuracy without revealing the sources and extent of the deficit. We introduce ProCauEval, a perturbation-based evaluation protocol that shifts from outcome assessment to mechanism diagnosis, probing causal discovery through five controlled configurations that systematically manipulate visual and textual modalities to decompose their respective contributions to model behavior and dissect the failure modes. Evaluating 17 mainstream LMMs, we find that models faithfully perceive video content yet systematically underexploit it during causal reasoning. We further observe that stronger post-training amplifies rather than mitigates textual prior reliance, and that higher baseline performance correlates with greater fragility under perturbation. To address these, we propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning framework built on negative teacher alignment, which augments GRPO by explicitly pushing the policy away from a prior-only counterfactual teacher induced by visual corruption. Specifically, ADPO maximizes the divergence between the policy distributions conditioned on the original and visually corrupted inputs, thereby forcing the model to ground its reasoning in visual evidence rather than textual shortcuts. Extensive experiments show that ADPO improves visual engagement without sacrificing fundamental comprehension, thus offering a preliminary step toward reliable causal discovery.
[40] Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning cs.CL | cs.AR | cs.LGPDF
Aojie Yuan, Tianqi Shen, Dajun Zhang
TL;DR: 本文提出了一种面向大语言模型推理的语义感知内存层次结构,通过累积注意力评分将KV缓存中的token分为四个层级(HBM、DDR、压缩和驱逐),将低重要性token移至CPU内存而非永久丢弃,并在注意力计算前将其预取回GPU,实现了零近似误差的卸载。该方法在保持高精度的同时,显著减少了GPU HBM的占用。
Details
Motivation: 解决推理LLMs中KV缓存占用大量稀缺GPU HBM内存的问题。现有方法永久驱逐低重要性token会导致推理准确性急剧下降,本文旨在探索是否所有token都必须驻留在HBM中,以及如何通过分层存储来保留这些token的信息,从而避免准确性损失。
Result: 在三个模型规模(7B-32B)和四个基准测试(包括GSM8K和MATH-500)上进行了实验。仅驱逐3%的token时,在GSM8K上保持了91%的完整缓存准确率,在MATH-500上保持了71%。在14B规模下,该方法在将HBM占用减半的同时,匹配了未压缩基线的准确率(90% vs. 86%)。与当前SOTA驱逐方法R-KV的对比中,在可比内存预算下,R-KV仅达到0-32%的准确率,而本方法显著更优。系统原型显示传输开销仅为5-7%,扩展分析预测在生产批次大小下可节省2-48 GB HBM。
Insight: 创新点在于提出了零近似误差的卸载机制和语义感知的内存层次结构,核心发现是推理准确性仅取决于永久丢弃的token比例(驱逐率),而非驻留在HBM中的token数量。这挑战了传统认知,即低重要性token必须被永久移除,转而通过分层存储和预取策略在保留信息的同时优化内存使用。
Abstract: Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response – permanently evicting low-importance tokens – is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers – HBM, DDR, compressed, and evicted – using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction ratio), not on how many remain in HBM. A controlled 3x3 grid over HBM and eviction ratios confirms this across three model scales (7B-32B) and four benchmarks. With only 3% eviction, the hierarchy retains 91% of full-cache accuracy on GSM8K and 71% on MATH-500 (n=200); at 14B scale it matches the uncompressed baseline (90% vs. 86%) while halving HBM occupancy. A head-to-head reproduction of R-KV – the current SOTA eviction method – on our setup achieves only 0-32% at comparable budgets. A system prototype with real GPU-CPU data movement shows that the price of this preservation is modest – 5-7% transfer overhead – and scaling analysis projects 2-48 GB HBM savings at production batch sizes.
[41] APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation cs.CL | cs.AIPDF
Tianyu Zheng, Hong Wu, Jiaji Zhong
TL;DR: 本文提出了一种自适应路径对比解码(APCD)框架,旨在解决大语言模型(LLM)在自回归解码中因早期次优令牌选择导致错误累积而产生幻觉的问题。该方法通过自适应探索和受控的路径交互来提高输出可靠性,包含熵驱动路径扩展和发散感知路径对比两个核心组件。
Details
Motivation: 动机在于解决大语言模型自回归解码中因错误传播导致的幻觉问题,现有多路径解码方法缺乏确定何时分支以及如何调节路径间交互的原则性策略。
Result: 在八个基准测试上的实验表明,该方法提高了事实准确性,同时保持了解码效率。
Insight: 创新点在于提出了一个原则性的多路径解码框架,通过基于预测不确定性的自适应分支时机决策和基于分布发散度的动态路径交互衰减机制,实现了更可靠的生成,这为改进解码策略提供了新思路。
Abstract: Large language models (LLMs) often suffer from hallucinations due to error accumulation in autoregressive decoding, where suboptimal early token choices misguide subsequent generation. Although multi-path decoding can improve robustness by exploring alternative trajectories, existing methods lack principled strategies for determining when to branch and how to regulate inter-path interactions. We propose Adaptive Path-Contrastive Decoding (APCD), a multi-path decoding framework that improves output reliability through adaptive exploration and controlled path interaction. APCD consists of two components: (1) Entropy-Driven Path Expansion, which delays branching until predictive uncertainty - measured by Shannon entropy over top candidate tokens - indicates multiple plausible continuations; and (2) Divergence-Aware Path Contrast, which encourages diverse reasoning trajectories while dynamically attenuating inter-path influence as prediction distributions diverge. Experiments on eight benchmarks demonstrate improved factual accuracy while maintaining decoding efficiency. Our code is available at https://github.com/zty-king/APCD.
[42] Beyond Language: Format-Agnostic Reasoning Subspaces in Large Language Models cs.CL | cs.LGPDF
Aojie Yuan, Zhiyuan Su
TL;DR: 该论文研究了大型语言模型在不同符号系统(如英语散文、Python代码、数学符号)中是否共享一个共同的内部推理表示。通过引入TriForm基准测试,并使用多种分析方法,论文在模型中间层发现了一个格式无关的推理子空间(FARS),该子空间能显著增强概念结构并抑制形式信息。研究还揭示了陈述性与程序性表示之间的不对称性。
Details
Motivation: 探究大型语言模型在不同表面形式(如自然语言、代码、数学符号)下进行推理时,其内部表示是否共享一个共同的、与格式无关的底层结构。
Result: 在TriForm基准测试(涵盖18个概念、6种形式、3个实例)上对五个LLMs(1.6B-8B参数)进行了分析。发现一个10维的FARS子空间,能将概念结构增强3倍,同时将形式信息抑制到接近零。跨形式激活修补中,仅替换这10个维度就能保留90-96%的模型输出,远优于全激活替换(44-56%)和方差最大化PCA(60-74%)。该子空间在未见概念上具有泛化性,且在不同架构模型间具有高相关性(CCA > 0.79)。
Insight: 创新点在于提出了格式无关推理子空间(FARS)的概念,并通过概念质心PCA等方法将其具体化,为“柏拉图式表示假说”提供了模型内证据。一个关键发现是表示的不对称性主要存在于陈述性(散文/数学)与程序性(代码)之间,而非语言与形式符号之间。
Abstract: Large language models represent the same reasoning in vastly different surface forms – English prose, Python code, mathematical notation – yet whether they share a common internal substrate across these symbolic systems remains unknown. We introduce the TriForm Benchmark (18 concepts x 6 forms x 3 instances = 324 stimuli) and study five LLMs (1.6B-8B) across three architecture families. Using permutation-corrected RSA, cross-form probing, and activation patching, we find converging evidence for a Format-Agnostic Reasoning Subspace (FARS) in middle layers. We make FARS concrete: concept-centroid PCA extracts a 10-dimensional subspace that amplifies concept structure 3x while suppressing form information to near zero. Replacing only these 10 dimensions during cross-form patching preserves 90-96% of model output – far exceeding both full activation replacement (44-56%) and variance-maximizing PCA (60-74%) – while ablating them causes targeted disruption. FARS generalizes to held-out concepts and converges across architectures (CCA > 0.79 for all model pairs), providing within-modality evidence for the Platonic Representation Hypothesis. We further discover a declarative-procedural asymmetry: representations are far more compatible between prose and mathematics than between either and code, suggesting that the critical axis of divergence is not linguistic vs. formal but declarative vs. procedural.
[43] Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal cs.CL | cs.AI | cs.LGPDF
Aojie Yuan, Zhiyuan Julian Su, Haiyue Zhang, Yi Nian, Yue Zhao
TL;DR: 本文研究发现,在思维链推理过程中,大语言模型内部能够检测到自身的推理错误,但对外却表现出对这些错误的高置信度。这种‘隐藏的错误感知’信号是诊断性的,而非因果性的,即它反映了计算质量,但无法直接用于纠正错误。
Details
Motivation: 挑战思维链提示的基本假设,即生成的推理反映了模型的内部计算,旨在揭示模型内部错误检测能力与外部表达之间的脱节。
Result: 在线性探测隐藏状态时,预测推理轨迹正确性的AUROC高达0.95,而基于文本表面的分类器仅为0.59;模型对错误轨迹的口头置信度(4.55/5)与正确轨迹(4.87/5)几乎相同。该现象在多个模型家族(Qwen、Llama、Phi)和RL训练的推理模型(DeepSeek-R1)中均成立。
Insight: 揭示了模型内部存在与外部表达脱节的错误检测信号,这为机制可解释性划定了边界:推理过程中的错误表征与可成功编辑的事实知识表征有本质不同,该信号是诊断性的,无法直接用于干预和修正推理过程。
Abstract: Chain-of-thought (CoT) prompting assumes that generated reasoning reflects a model’s internal computation. We show this assumption is wrong in a specific, measurable way: models internally detect their own reasoning errors but outwardly express confidence in them. A linear probe on hidden states predicts trace correctness with 0.95 AUROC – from the very first reasoning step (0.79) – while verbalized confidence for wrong traces is 4.55/5, nearly identical to correct ones (4.87/5). A text-surface classifier achieves only 0.59 on the same data, confirming a 0.20-point gap invisible in the generated text. This hidden error awareness holds across three model families (Qwen, Llama, Phi), 1.5B-72B parameters, and RL-trained reasoning models (DeepSeek-R1, 0.852 AUROC). The natural question is whether this signal can fix the errors it detects. It cannot. Four interventions – activation steering, probe-guided best-of-N, self-correction, and activation patching – all fail; patching destroys output coherence entirely. The signal is diagnostic, not causal: a readout of computation quality, not a lever to redirect it. This delineates a boundary for mechanistic interpretability: error representations during reasoning are fundamentally different from the factual knowledge representations that prior work has successfully edited.
[44] TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM cs.CL | cs.AIPDF
Haoyang Zhou, Li Kong, Shijie Ren, Xiting Wang, Shuang Liang
TL;DR: 本文提出TAD(Temporal-Aware Trajectory Self-Distillation)框架,旨在解决扩散大语言模型(dLLMs)在并行文本生成中面临的准确性与并行性权衡问题。该方法通过时间感知的轨迹自蒸馏,将教师模型生成的解码轨迹中的掩码位置根据剩余解码步数划分为近端和远端子集,并分别采用硬交叉熵损失和软KL散度损失训练学生模型,从而在加速生成的同时保持或提升生成质量。
Details
Motivation: 扩散大语言模型为并行文本生成提供了有前景的范式,但在实践中存在准确性与并行性的权衡:增加每次前向传播生成的令牌数(TPF)通常会降低生成质量。现有的加速方法往往以牺牲准确性为代价,本文旨在解决这一局限性。
Result: 在LLaDA基准上的实验表明,TAD框架持续改善了准确性与并行性的权衡。使用注重质量的模型(Quality model)时,平均准确率从46.2%提升至51.6%;使用注重速度的模型(Speed model)时,平均AUP(Area Under the Performance curve)从46.2提升至257.1。
Insight: 论文的核心创新点在于提出了时间感知的轨迹自蒸馏框架,根据掩码令牌距离被揭示的剩余解码步数进行划分,并针对近端和远端令牌分别设计硬标签和软分布的监督信号。这既鼓励了对即将解码令牌的自信预测,又保留了未来规划知识,从而实现了速度与精度的更好平衡,并自然地衍生出侧重质量或速度的两种部署配置。
Abstract: Diffusion large language models (dLLMs) offer a promising paradigm for parallel text generation, but in practice they face an accuracy-parallelism trade-off, where increasing tokens per forward (TPF) often degrades generation quality. Existing acceleration methods often gain speed at the cost of accuracy. To address this limitation, we propose TAD, a Temporal-Aware trajectory self-Distillation framework. During data construction, we condition a teacher model on both the prompt and the ground-truth response to generate decoding trajectories, recording the intermediate masked states throughout the process. Based on how many decoding steps remain before each masked token is revealed, we partition masked positions into near and distant subsets. For near tokens, we train the student with a hard cross-entropy loss using the teacher trajectory tokens as labels, encouraging confident predictions for tokens that are about to be decoded. For distant tokens, we apply a soft KL divergence loss between the teacher and student token distributions, providing softer supervision and preserving future planning knowledge. This temporal-aware partition naturally gives rise to two deployment configurations: a Quality model that prioritizes accuracy and a Speed model that favors more aggressive acceleration. Experiments show that TAD consistently improves the accuracy-parallelism trade-off. On LLaDA, it raises average accuracy from 46.2% to 51.6% with the Quality model and average AUP from 46.2 to 257.1 with the Speed model. Our code is available at: https://github.com/BHmingyang/TAD
[45] Crosslingual On-Policy Self-Distillation for Multilingual Reasoning cs.CLPDF
Yihong Liu, Raoyuan Zhao, Michael A. Hedderich, Hinrich Schütze
TL;DR: 本文提出了一种名为跨语言策略自蒸馏(COPSD)的方法,旨在解决大型语言模型在低资源语言上数学推理能力不足的问题。该方法通过将模型自身在高资源语言(如英语)上的推理行为迁移到低资源语言,使用同一模型同时作为学生和教师:学生仅看到低资源语言的问题,而教师则获得特权跨语言上下文(包括英语翻译和参考答案)。训练通过最小化学生自身生成序列上的全分布词级差异,提供密集监督,避免了仅基于结果的强化学习的稀疏性和不稳定性。
Details
Motivation: 大型语言模型在数学推理方面取得了显著进展,但这种能力在不同语言间并不均衡,尤其是低资源语言表现出较低的推理性能。因此,需要一种方法来提升低资源语言的推理能力。
Result: 在17种低资源非洲语言上的实验表明,COPSD能持续提升不同模型规模下的低资源数学推理性能,并显著优于Group Relative Policy Optimization(GRPO)。进一步分析显示,COPSD改善了答案格式的遵循性,增强了测试时的扩展性,并能泛化到更难的跨语言推理基准测试中,对更低资源语言的提升尤其显著。
Insight: 创新点在于提出了一种跨语言策略自蒸馏框架,利用模型自身的高资源语言推理作为监督信号,通过全分布词级差异最小化提供密集训练,避免了传统强化学习的稀疏性问题。这种方法可借鉴于其他需要跨语言或跨领域知识迁移的任务,尤其是在资源不均衡的场景下。
Abstract: Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, but this ability is not equally accessible across languages. Especially low-resource languages exhibit much lower reasoning performance. To address this, we propose Crosslingual On-Policy Self-Distillation (COPSD), which transfers a model’s own high-resource reasoning behavior to low-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distribution token-level divergence on the student’s own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-only reinforcement learning (RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resource mathematical reasoning across model sizes and substantially outperforms Group Relative Policy Optimization (GRPO). Further analyses show that COPSD improves answer-format adherence, strengthens test-time scaling, and generalizes to harder multilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: https://github.com/cisnlp/COPSD.
[46] Towards Compact Sign Language Translation: Frame Rate and Model Size Trade-offs cs.CL | cs.CVPDF
Kuanwei Chen, Mengfeng Tsai
TL;DR: 本文提出了一种紧凑的手语翻译(SLT)流水线,通过结合MMPose骨骼姿态提取和线性投影到T5-small模型,仅使用7700万参数。通过调整输入帧率,揭示了效率与性能之间的权衡,在12 fps下显著降低了计算复杂度,同时性能下降有限。
Details
Motivation: 解决当前无注释(gloss-free)手语翻译方法依赖大型编码器-解码器模型,导致部署受限的问题,旨在设计更紧凑、高效的SLT系统。
Result: 在How2Sign基准测试上,12 fps时BLEU-4得分为9.53(24 fps时为10.06),编码器自注意力计算复杂度降低75%,模型大小比先前T5-base系统减小约3倍,保持竞争力。
Insight: 创新点在于使用轻量级骨骼姿态提取和线性投影简化架构,无需分层编码器或大规模模型;客观分析表明,通过帧率调整实现计算效率与翻译质量的实用权衡是有效的系统优化策略。
Abstract: Sign Language Translation (SLT) converts sign language videos into spoken-language text, bridging communication between Deaf and hearing communities. Current gloss-free approaches rely on large encoder-decoder models, limiting deployment. We propose a compact 77M-parameter pipeline that couples MMPose skeletal pose extraction with a single linear projection into T5-small. By varying the input frame rate, we expose a practical efficiency trade-off: at 12 fps the model halves its sequence length, achieving a 75% reduction in encoder quadratic self-attention computational complexity while incurring only a modest BLEU-4 drop (9.53 vs. 10.06 at 24 fps on How2Sign). Our system is roughly 3x smaller than prior T5-base systems, demonstrating that a lightweight architecture can remain competitive without hierarchical encoders or large-scale models.
[47] CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics cs.CL | cs.AI | cs.LGPDF
Aishik Nagar, Arun-Kumar Kaliya-Perumal, Yu-Hsuan Han, Andrew Sheng-Han Huang, Kristen Kee
TL;DR: 论文提出了CLR-voyance框架,将住院临床推理重新定义为部分可观测马尔可夫决策过程,并利用基于结果且经过临床医生验证的奖励进行监督。该框架实例化为CLR-POMDP,通过将成功的患者诊疗路径划分为策略可见的过去和仅预言模型可见的未来,生成病例特定的查询-答案对和首个可验证的临床推理自适应评分标准。基于此,论文对Qwen3-8B和MedGemma-4B模型进行了GRPO后训练和模型合并,得到的CLR-voyance-8B模型在住院临床推理任务上达到了最先进的性能,并在现有医学基准测试中表现相当或更好。
Details
Motivation: 解决现有临床大语言模型评估和强化学习奖励信号将住院临床推理(一种部分可观测下的序列决策问题)简化为封闭式检索、临床路径信息泄露或缺乏锚定的LLM-as-judge评分的问题。
Result: CLR-voyance-8B在CLR-POMDP基准上达到84.91%的准确率,超过了GPT-5(77.83%)和MedGemma-27B(66.66%)等前沿医学推理模型,并在现有医学基准测试中表现相当或更好。
Insight: 主要创新点包括:1)将住院临床推理形式化为POMDP问题;2)提出了首个可验证的、病例自适应的临床推理评分标准生成方法;3)通过大规模临床医生对齐研究验证了框架的临床意义,并为临床LLM-as-judge和偏好模型选择提供了洞见。
Abstract: Inpatient clinical reasoning is a sequential decision under partial observability: the clinician sees the admission so far and must choose the next action whose downstream consequences are not yet visible. Existing clinical-LLM evaluations and RL rewards signals collapse this into closed-form retrieval, clinical journey leakage, or unanchored LLM-as-judge scoring. We introduce CLR-voyance, a framework that reformulates inpatient reasoning as a Partially Observable Markov Decision Process (POMDP) and supervises it with rewards that are simultaneously outcome-grounded and clinician-validated. We instantiate the formulation as CLR-POMDP, which partitions successful patient journeys into a policy-visible past and an oracle-only future. Using the past information, an oracle LLM generates a case-specific query-answer pair, and the first adaptive rubric for clinical reasoning which is verifiable in the future of the patient journey. These rubrics are used for both post-training and evaluation of models for inpatient clinical reasoning. We post-train Qwen3-8B and MedGemma-4B with GRPO followed by model merging, yielding state-of-the-art inpatient clinical reasoning while retaining generalist capabilities. CLR-voyance-8B achieves 84.91% on CLR-POMDP, ahead of frontier medical reasoning models like GPT-5 (77.83%) and MedGemma-27B (66.66%) and has comparable or better performance on existing medical benchmarks. To ensure a clinically meaningful setting, we conduct a large-scale clinician alignment study, where physicians curate per-case rubrics, grade candidate responses, and provide blinded pairwise preferences of model reasoning. This study provides insights on clinical LLM-as-a-judge and clinical preference-model selection, which can inform the community at large. CLR-voyance has been deployed for 6+ months at a partner public hospital, drafting thousands of reasoning-heavy inpatient notes.
[48] Edit-Based Refinement for Parallel Masked Diffusion Language Models cs.CLPDF
Houxing Ren, Mingjie Zhan, Zimu Lu, Ke Wang, Yunqiao Yang
TL;DR: 本文提出ME-DLM,一种基于编辑的细化框架,用于增强并行掩码扩散语言模型的生成质量。该方法通过在并行扩散生成后进行轻量级的后编辑步骤(替换、删除、插入),以解决多令牌同时生成时序列一致性不足的问题,从而在保持高效解码的同时提升输出质量。
Details
Motivation: 掩码扩散语言模型虽能并行生成令牌,解码效率高于自回归模型,但在同时生成多个令牌时性能显著下降,原因是令牌级训练目标与联合序列一致性不匹配。
Result: 在LLaDA基础上构建的ME-DLM,在HumanEval上提升11.6分,在GSM8K上提升33.6分,且仅使用八分之一的扩散步数,实现了质量和效率的显著改进。
Insight: 创新点在于引入基于编辑距离的确定性监督信号进行后编辑细化,通过全局条件编辑促进序列级一致性,同时保留了并行扩散解码的效率优势;这是一种将并行生成与序列级修正相结合的新颖框架。
Abstract: Masked diffusion language models enable parallel token generation and offer improved decoding efficiency over autoregressive models. However, their performance degrades significantly when generating multiple tokens simultaneously, due to a mismatch between token-level training objectives and joint sequence consistency. In this paper, we propose ME-DLM, an edit-based refinement framework that augments diffusion generation with lightweight post-editing steps. After producing an initial complete response, the model refines it through minimal edit operations, including replacement, deletion, and insertion, conditioned on the full sequence. Training supervision is derived from edit distance, providing a deterministic signal under a fixed canonicalization scheme for learning minimal corrections. This approach encourages sequence-level consistency through globally conditioned edits while preserving the efficiency benefits of parallel diffusion decoding. Extensive experiments demonstrate that ME-DLM improves the quality and robustness of multi-token parallel generation. In particular, when built upon LLaDA, our method achieves consistent gains of 11.6 points on HumanEval and 33.6 points on GSM8K while using one-eighth of the total diffusion steps. Code is available at https://github.com/renhouxing/ME-DLM.
[49] Statistical Scouting Finds Debate-Safe but Not Debate-Useful Cases: A Matched-Ceiling Study of Open-Weight LLM Reasoning Protocols cs.CL | cs.CYPDF
Julia Hu, Alfred Shen, Kumar Lakshmipathi
TL;DR: 本文研究了在生成令牌数量受限(每例960个)的条件下,如何通过预审议信号(如投票熵)为语言模型(Llama 3.1 8B Instruct和Ministral 3 8B Instruct)在MuSiQue和GSM8K数据集上选择最佳推理协议(直接贪婪解码、投票或双代理批判-修订辩论)。研究发现,通过神谕(oracle)为每个示例选择正确协议可带来显著性能提升(+14.0和+13.7个百分点),但这一提升难以通过廉价的事前信号有效恢复。
Details
Motivation: 动机在于探究在生成令牌数量受限的情况下,如何通过廉价的事前信号(如投票熵)为语言模型动态选择最佳推理协议(直接回答、投票或多代理辩论),以最大化性能并理解不同协议的有效性边界。
Result: 在MuSiQue数据集上,神谕选择相比最佳固定协议可带来+14.0和+13.7个百分点的提升。投票熵阈值控制器在两个模型上均方向性优于最佳固定协议(+1.3和+1.7个百分点),但统计显著性不足(联合分析p=0.125)。学习型控制器(逻辑回归、梯度提升树)未超越阈值方法。
Insight: 关键结构性发现是投票熵能预测辩论何时安全(避免性能倒退),而非何时需要辩论(辩论有益的情况多发生在投票一致但错误时)。这揭示了廉价信号作为路由器的局限性,并指出在8B规模模型上,避免格式遵从混淆的行为探针是恢复剩余性能提升空间的关键。
Abstract: When should a language model answer directly, sample and vote, or engage in multi-agent debate? Recent work shows voting often explains much of the gain attributed to debate, while selective-debate systems activate deliberation only on uncertain examples. We ask: under a matched ceiling on generated tokens (960 per example), how much per-example routing headroom exists, and how much is recoverable from cheap pre-deliberation signals? We evaluate greedy decoding, three-sample voting, and a two-agent critique-revise debate on MuSiQue and GSM8K using Llama 3.1 8B Instruct and Ministral 3 8B Instruct. On MuSiQue, an oracle selecting the correct protocol per example gains +14.0 and +13.7 pp over the best fixed one. The best fixed protocol is model- and dataset-dependent: each (model, dataset) cell has a different winner. This headroom is hard to recover from cheap ex-ante signals. A vote-entropy threshold is the only controller that directionally beats the best fixed protocol on both models (+1.3 and +1.7 pp), though individual paired-bootstrap CIs include zero. A joint analysis (meta-analysis +1.6 pp, p=0.125; Bayesian P(both>0)=0.59) is directionally consistent but not significant. Learned controllers (LR, GBT) do not outperform the threshold. The key finding is structural: vote entropy predicts where debate is safe, not where debate is needed. High entropy sharply reduces debate backfire, but 66% of debate-helpful examples (31/47) occur when voting is unanimous but wrong. A single-prompt self-critique probe on Llama flips the answer in 127/127 unanimous cases, yielding zero mutual information with the debate-helpful label; we cannot rule out a prompt-compliance artifact, but either interpretation disqualifies the probe as a router. Recovering the remaining headroom requires behavioral probes that avoid format-compliance confounds at the 8B scale.
[50] K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs cs.CLPDF
Hao Liang, Qihan Lin, Zhaoyang Han, Xiaochen Ma, Zhen Hao Wong
TL;DR: 该论文提出了K12-KGraph,一个与K-12课程对齐的知识图谱,用于评估和训练教育大语言模型。图谱基于人教版教材构建,包含多种节点和关系类型。基于此图谱,作者构建了评估基准K12-Bench和训练数据集K12-Train。实验表明当前LLMs在课程认知方面存在显著不足,而使用K12-Train进行监督微调能高效提升模型在教育任务上的表现。
Details
Motivation: 现有教育AI基准主要评估事实回忆,而有效的教育AI还需要理解课程知识的结构(如先决条件链、概念分类、实验-概念联系等),即课程认知能力。当前缺乏评估和提升这种能力的资源。
Result: 在K12-Bench基准上,Gemini-3-Flash的精确匹配率仅为57%,最佳开源模型Gemma-4-31B-IT为46%,揭示了模型在课程认知上的重大缺陷。在严格匹配的2,300样本SFT预算下,使用K12-Train微调的Qwen3-4B-Base和Llama-3.1-8B-Base模型,在GaokaoBench和EduEval基准上,持续优于使用八个主流指令微调数据集中同等大小子集微调的模型,证明了课程结构化监督的高样本效率。
Insight: 论文的核心创新在于构建了一个课程对齐的知识图谱,并基于其结构衍生出评估基准和训练数据,将评估重点从事实回忆转向对知识结构关系的理解(课程认知)。这为教育LLMs的评估和高效训练提供了新的、结构化的数据资源和方法论,强调了知识结构在AI教育应用中的重要性。
Abstract: Large language models (LLMs) are increasingly used in K-12 education, yet existing benchmarks such as C-Eval, CMMLU, GaokaoBench, and EduEval mainly evaluate factual recall through exam-style question answering. Effective educational AI additionally requires curriculum cognition: understanding how knowledge is structured through prerequisite chains, concept taxonomies, experiment-concept links, and pedagogical sequencing. To address this gap, we introduce K12-KGraph, a curriculum-aligned knowledge graph extracted from official People’s Education Press textbooks across mathematics, physics, chemistry, and biology from primary to high school. The graph contains seven node types (Concept, Skill, Experiment, Exercise, Section, Chapter, Book) and nine relation types covering taxonomy, prerequisite, association, verification, assessment, location, and order. Based on this graph, we construct two resources: (1) K12-Bench, a 23,640-question multi-select benchmark spanning five graph-derived task families (Ground, Prereq, Neighbor, Evidence, and Locate); and (2) K12-Train, a KG-guided supervised fine-tuning corpus of approximately 2,300 QA pairs synthesized from graph structure and node attributes. Experiments reveal substantial deficiencies in curriculum cognition: on K12-Bench, Gemini-3-Flash achieves only 57% exact match, while the best open-source model, Gemma-4-31B-IT, reaches 46%. Under a strictly matched 2,300-sample SFT budget on Qwen3-4B-Base and Llama-3.1-8B-Base, K12-Train consistently outperforms equally sized subsets from eight mainstream instruction-tuning corpora on both GaokaoBench and EduEval, demonstrating that curriculum-structured supervision is highly sample-efficient for educational tuning. We release the graph, benchmark, training data, and full construction pipeline.
[51] MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies cs.CL | cs.AIPDF
Huy Hoang Ha, Benoit Favre, Francois Portet
TL;DR: 该论文提出了首个用于评估大语言模型从医学研究摘要中综合元分析结论能力的基准测试MedMeta,包含81个PubMed元分析,并设计了Golden-RAG和仅参数化两种评估工作流。研究发现信息基础至关重要,Golden-RAG显著优于仅参数化方法,而领域微调在提供外部材料时收益甚微;所有模型均无法有效识别否定证据,且即使在理想RAG条件下,当前LLM性能也仅略高于平均水平。
Details
Motivation: 解决大语言模型在医学领域高阶推理能力(如综合多源证据)评估不足的问题,现有基准多关注事实回忆,缺乏对证据合成能力的系统测评。
Result: 在MedMeta基准上,Golden-RAG工作流显著优于仅参数化方法;LLM-as-a-judge评估协议与人类专家评分高度相关(皮尔逊r=0.81);所有模型均无法识别否定证据;当前LLM在理想RAG条件下平均得分仅约2.7/5.0。
Insight: 提出了首个专注于医学元分析结论合成的基准,强调信息基础(RAG)比领域微调更关键;揭示了当前RAG系统在否定证据处理上的普遍脆弱性;为临床应用中开发稳健RAG系统而非单纯模型专业化提供了方向性启示。
Abstract: Large language models (LLMs) have saturated standard medical benchmarks that test factual recall, yet their ability to perform higher-order reasoning, such as synthesizing evidence from multiple sources, remains critically under-explored. To address this gap, we introduce MedMeta, the first benchmark designed to evaluate an LLM’s ability to generate conclusions from medical meta-analyses using only the abstracts of cited studies. MedMeta comprises 81 meta-analyses from PubMed (2018–2025) and evaluates models using two distinct workflows: a Retrieval-Augmented Generation (Golden-RAG) setting with ground-truth abstracts, and a Parametric-only approach relying on internal knowledge. Our evaluation framework is validated by a well-structured analysis showing our LLM-as-a-judge protocol strongly aligns with human expert ratings, as evidenced by high Pearson’s r correlation (0.81) and Bland-Altman analysis revealing negligible systematic bias, establishing it as a reliable proxy for scalable evaluation. Our findings underscore the critical importance of information grounding: the Golden-RAG workflow consistently and significantly outperforms the Parametric-only approach across models. In contrast, the benefits of domain-specific fine-tuning are marginal and largely neutralized when external material is provided. Furthermore, stress tests show that all models, regardless of architecture, fail to identify and reject negated evidence, highlighting a critical vulnerability in current RAG systems. Notably, even under ideal RAG conditions, current LLMs achieve only slightly above-average performance (~2.7/5.0). MedMeta provides a challenging new benchmark for evidence synthesis and demonstrates that for clinical applications, developing robust RAG systems is a more promising direction than model specialization alone.
[52] The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods cs.CL | cs.AIPDF
Sanket Badhe, Priyanka Tiwari, Deep Shah
TL;DR: 本文提出了一种名为Semantic Softmax的推理层方法,旨在改善大型语言模型在零样本分类任务中的可靠性。该方法通过聚合每个目标标签的语义邻域分数,解决了标准约束解码中的重归一化偏差问题,从而减少信息丢失,提升模型校准性和分类性能。
Details
Motivation: 标准约束解码在限制目标标签集时,会丢弃语义同义词的概率质量,导致重归一化偏差,造成模型过度自信和校准不良,本文旨在解决这一问题。
Result: 在Qwen-3和Phi-4-mini模型上,使用GoEmotions和Civil Comments数据集进行评估,Semantic Softmax在所有评估指标上均取得一致提升,显著降低了预期校准误差和Brier分数,同时提高了AUROC和Macro-F1的判别性能。
Insight: 创新点在于识别并定义了重归一化偏差和静默投票现象,并提出了Semantic Softmax这一推理时层来聚合语义邻域信息,从而更准确地利用语言模型的概率分布,改善零样本分类的校准和准确性。
Abstract: Large Language Models are increasingly used as zero-shot classifiers in complex reasoning tasks. However, standard constrained decoding suffers from a phenomenon we define as Renormalization Bias. When a model is restricted to a small set of target labels, the standard softmax operation discards the probability mass assigned to semantic synonyms in the original distribution. This loss of information, which we call the Silent Vote, results in artificial overconfidence and poor calibration. We propose Semantic Softmax, an inference-time layer that recovers this lost information by aggregating the scores of the semantic neighborhood surrounding each target label. We evaluate this approach on Qwen-3 and Phi-4-mini models using GoEmotions and Civil Comments datasets. Our results demonstrate consistent improvements across all evaluation metrics: Semantic Softmax substantially reduces Expected Calibration Error (ECE) and Brier Score, while simultaneously enhancing discriminative performance in terms of AUROC and Macro-F1. By accounting for linguistic nuances, our method provides a more calibrated and accurate alternative for zero-shot classification.
[53] ConFit v3: Improving Resume-Job Matching with LLM-based Re-Ranking cs.CLPDF
Xiao Yu, Ruize Xu, Chengyuan Xue, Junyu Chen, Matthew So
TL;DR: 本文提出了ConFit v3,一种基于LLM的重新排序方法,用于改进简历与职位匹配系统。该方法通过系统分析LLM重新排序训练流程,包括推理算法设计、强化学习算法选择、数据处理和SFT蒸馏,并在真实世界的人岗匹配数据集上训练,显著提升了匹配性能。
Details
Motivation: 解决现有基于嵌入的简历-职位匹配方法(如ConFit和ConFit v2)在可控性和可解释性方面的不足,以及现有LLM重新排序训练方法在短文档基准上开发、未考虑真实招聘数据噪声的问题。
Result: 在真实世界的人岗匹配数据集上,使用Qwen3-8B和Qwen3-32B训练的ConFit v3显著优于现有最佳的人岗匹配系统以及强大的LLM(如GPT-5和Claude Opus-4.5),达到了SOTA水平。
Insight: 创新点包括采用多轮重新排序、使用列表式强化学习目标进行训练、去除噪声样本以及在强化学习前从更强的LLM进行SFT蒸馏。这些发现为将基于LLM的重新排序器适配到人岗匹配系统提供了有价值的见解。
Abstract: A reliable resume-job matching system helps a company find suitable candidates from a pool of resumes and helps a job seeker find relevant jobs from a list of job posts. While recent advances in embedding-based methods such as ConFit and ConFit v2 can efficiently retrieve candidates at scale, the lack of controllability and explainability limits their real-world adaptations. LLM-based re-rankers can address these limitations through reasoning, but existing training recipes are developed on short-document benchmarks and do not account for noise in real-world recruiting data. In this work, we first conduct a systematic analysis over the LLM re-ranker training pipeline for person-job fit, covering inference algorithm design, RL algorithm selection, data processing, and SFT distillation. We find that using multi-pass re-ranking, training with listwise RL objectives, removing noisy samples, and distilling from a stronger LLM before RL significantly improves re-ranking performance. We then aggregate these findings to train ConFit v3 with Qwen3-8B and Qwen3-32B on real-world person-job fit datasets, and find significant improvements over existing best person-job fit systems as well as strong LLMs such as GPT-5 and Claude Opus-4.5. We hope our findings provide useful insights for future research on adapting LLM-based re-rankers to person-job fit systems.
[54] cantnlp@DravidianLangTech 2026: organic domain adaptation improves multi-class hope speech detection in Tulu cs.CLPDF
Andrew Li, Sidney Wong
TL;DR: 本文介绍了作者在DravidianLangTech-2026研讨会上,针对图卢语(Tulu)代码混合社交媒体评论中的希望言论检测任务所开发的系统与结果。作者基于XLM-RoBERTa训练了一个文本分类系统,并与基线模型进行了比较。实验表明,在开发集上,经过有机领域适应的模型性能优于基线。
Details
Motivation: 解决图卢语代码混合社交媒体评论中希望言论检测的挑战,特别是在存在混合脚本和代码混合变异的文本中提升检测性能。
Result: 在开发集上,有机领域适应的XLM-RoBERTa模型超越了基线系统;在官方测试集上表现较为一般,但结果表明进一步适应有机收集的图卢语社交媒体文本有潜力提升性能。
Insight: 创新点在于提出并验证了通过有机领域适应(即在包含代码混合和混合脚本变异的真实社交媒体文本上进一步微调)来改进多语言预训练模型(如XLM-RoBERTa)在图卢语希望言论检测任务上的有效性。
Abstract: This paper presents our systems and results for the Hope Speech Detection in Code-Mixed Tulu Language shared task at the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2026). We trained an XLM-RoBERTa-based text classification system for detecting hope speech in code-mixed Tulu social media comments. We compared this organically adapted hope speech detection model with our baseline model. On the development set, the organically adapted model outperformed the baseline system. While our submitted systems performed more modestly on the official test set, these results suggest that further adapting XLM-RoBERTa on organically collected Tulu social media text containing code-mixed and mixed-script variation can improve hope speech detection in code-mixed Tulu.
[55] Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants cs.CLPDF
Joseph Suh, Ayush Raj, Minwoo Kang, Serina Chang
TL;DR: 本文提出通过评估基于用户模拟器训练的LLM助手在真实人机交互中的表现来量化用户模拟器的质量,并通过实验证明,基于真实人类对话微调的模拟器在训练出的助手性能上显著优于基于角色扮演LLM的模拟器。
Details
Motivation: 用户模拟器被广泛用于构建交互式AI助手,但其质量评估标准尚不明确,本文旨在通过下游任务(即训练出的助手与真实人类交互的性能)来量化模拟器的效用。
Result: 在包含283名参与者的用户研究中,基于微调模拟器训练的助手相比初始助手和基于角色扮演LLM模拟器训练的助手,在胜率上分别达到58%和57%的显著提升,且在WildBench基准测试中表现出更好的泛化能力。
Insight: 创新点在于提出以真实用户交互效果作为模拟器质量的评估指标,并实证表明基于真实人类行为数据微调的模拟器在训练助手时具有显著优势,而角色扮演LLM模拟器的改进方法(如角色条件设定)虽有效但无法弥合与微调模拟器的差距。
Abstract: User simulators are increasingly leveraged to build interactive AI assistants, yet how to measure the quality of these simulators remains an open question. In this work, we show how simulator quality can be quantified in terms of its downstream utility: how an LLM assistant trained with this user simulator performs in the wild when interacting with real humans. In a controlled experiment where only the user simulator varies, we train LLM assistants via reinforcement learning against a spectrum of simulators, from an LLM prompted to role-play a user to one fine-tuned on human utterances from WildChat. As evaluation, we measure pairwise win rates in a user study with 283 participants and on WildBench, a benchmark derived from real human–AI conversations. Training against the role-playing LLM yields an assistant statistically indistinguishable from the initial assistant in our user study (51% win rate), whereas training against the fine-tuned simulator yields significant gains (58% over the initial and 57% over the one trained against role-playing). Closer inspection reveals three further patterns: methods for making role-playing LLMs more realistic (e.g., persona conditioning) improve trained assistants but do not close the gap to the fine-tuned simulator; scaling the simulator’s model size benefits the fine-tuned simulator but yields no gain for role-playing ones; and assistants trained against role-playing simulators fail to generalize when paired with other simulators at test time, while the one trained against fine-tuned simulator does. Together, these results argue for grounding user simulators in real human behavior and measuring their quality by their downstream effect on real users.
[56] Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions cs.CL | cs.AIPDF
Sushrita Rakshit, Hanwen Zhang, Hua Shen
TL;DR: 这篇论文研究了大型语言模型(LLMs)中存在的‘价值-行动差距’问题,即模型陈述的价值观与其生成的行为之间存在不一致。作者将这种现象在显式推理下依然存在的情况定义为‘伪审议’,并提出了VALDI框架来系统性地测量这种不一致性。
Details
Motivation: 动机是解决LLMs中陈述的价值观无法可靠地转化为其生成行动的问题,即‘价值-行动差距’,并探究即使在显式推理下这种差距依然存在的深层失效模式。
Result: 在专有和开源LLMs上,使用VALDI框架(包含4,941个人本场景、三个任务和五个指标)观察到表达的价值与下游对话之间存在持续的不一致。
Insight: 创新点在于提出了‘伪审议’的概念来刻画推理与行为未对齐的深层问题,并设计了VALDI评估框架和VIVALDI多智能体价值审计器作为干预策略,为评估和改善LLM的价值对齐提供了系统化工具。
Abstract: Large language models (LLMs) are often evaluated based on their stated values, yet these do not reliably translate into their actions, a discrepancy termed “value-action gap.” In this work, we argue that this gap persists even under explicit reasoning, revealing a deeper failure mode we call “Pseudo-Deliberation”: the appearance of principled reasoning without corresponding behavioral alignment. To study this systematically, we introduce VALDI, a framework for measuring alignment between stated values and generated dialogue. VALDI includes 4,941 human-centered scenarios across five domains, three tasks that elicit value articulation, reasoning, and action, and five metrics for quantifying value adherence. Across both proprietary and open-source LLMs, we observe consistent misalignment between expressed values and downstream dialogues. To investigate intervention strategies, we propose VIVALDI, a multi-agent value auditor that intervenes at different stages of generation.
[57] Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs cs.CL | cs.AIPDF
Wu Li, Yigeng Zhou, Zesheng Shi, Yequan Wang, Min Zhang
TL;DR: 本文提出了一种名为TPAW(Team-based self-Play with dual Adaptive Weighting)的新型自博弈算法,旨在完全自监督的设置下改进大语言模型的对齐。该方法采用团队框架,让当前策略模型与历史检查点既合作又竞争,并引入了双重自适应加权机制来增强学习效果。实验表明,TPAW在不同基础模型和多个LLM基准测试上均优于现有基线方法。
Details
Motivation: 现有自训练方法在减少对人类标注数据的依赖方面取得进展,但仍面临关键限制:对合成数据质量敏感,导致迭代训练不稳定和偏见放大;以及由于正负响应差距在连续迭代中缩小而导致的优化低效。
Result: 实验结果表明,TPAW在多个基础模型和各种LLM基准测试上持续优于现有基线方法,实现了更好的对齐性能。
Insight: 主要创新点在于团队自博弈框架(结合合作与竞争)和双重自适应加权机制(响应重加权和玩家加权),这有助于稳定训练、缓解偏见放大并提升优化效率,为完全自监督的对齐训练提供了新思路。
Abstract: While recent self-training approaches have reduced reliance on human-labeled data for aligning LLMs, they still face critical limitations: (i) sensitivity to synthetic data quality, leading to instability and bias amplification in iterative training; (ii) ineffective optimization due to a diminishing gap between positive and negative responses over successive training iterations. In this paper, we propose Team-based self-Play with dual Adaptive Weighting (TPAW), a novel self-play algorithm designed to improve alignment in a fully self-supervised setting. TPAW adopts a team-based framework in which the current policy model both collaborates with and competes against historical checkpoints, promoting more stable and efficient optimization. To further enhance learning, we design two adaptive weighting mechanisms: (i) a response reweighting scheme that adjusts the importance of target responses, and (ii) a player weighting strategy that dynamically modulates each team member’s contribution during training. Initialized from a SFT model, TPAW iteratively refines alignment without requiring additional human supervision. Experimental results demonstrate that TPAW consistently outperforms existing baselines across various base models and LLM benchmarks. Our code is publicly available at https://github.com/lab-klc/TPAW.
[58] PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning cs.CL | cs.AIPDF
Luan Zhang, Dandan Song, Zhijing Wu, Zhengyu Chen, Chen Zhang
TL;DR: 本文提出了PruneTIR框架,旨在无需额外训练的情况下,在推理时提升已具备工具调用能力的大型语言模型的工具集成推理效果与效率。该框架通过剪枝轨迹、重采样工具调用和暂停工具使用三个组件,来减少错误工具调用的负面影响并防止模型陷入重复失败的解决尝试。
Details
Motivation: 当前研究主要关注如何让LLM具备使用工具的能力,但对于如何在推理时进一步提升已具备工具能力的LLM的推理能力探索不足。推理时改进无需额外训练,能帮助LLM更好地利用工具解决问题。
Result: 广泛的实验结果表明,PruneTIR显著提高了工具能力LLM的Pass@1指标和效率,同时减少了工作上下文长度。
Insight: 核心创新点在于基于观察(错误工具调用的数量/比例与答案正确性负相关,且错误通常在后续几轮内成功解决,否则难以解决)设计了一个轻量、高效的推理时干预框架。其三个触发式组件(成功触发剪枝、卡住触发剪枝与重采样、重试触发工具暂停)的系统性组合,为优化工具集成推理的流程控制提供了新思路。
Abstract: Tool-integrated reasoning (TIR) enables large language models (LLMs) to enhance their capabilities by interacting with external tools, such as code interpreters (CI). Most recent studies focus on exploring various methods to equip LLMs with the ability to use tools. However, how to further boost the reasoning ability of already tool-capable LLMs at inference time remains underexplored. Improving reasoning at inference time requires no additional training and can help LLMs better leverage tools to solve problems. We observe that, during tool-capable LLM inference, both the number and the proportion of erroneous tool calls are negatively correlated with answer correctness. Moreover, erroneous tool calls are typically resolved successfully within a few subsequent turns. If not, LLMs often struggle to resolve such errors even with many additional turns. Building on the above observations, we propose PruneTIR, a rather effective yet efficient framework that enhances the tool-integrated reasoning at inference time. During LLM inference, PruneTIR prunes trajectories, resamples tool calls, and suspends tool usage through three components: Success-Triggered Pruning, Stuck-Triggered Pruning and Resampling, and Retry-Triggered Tool Suspension. These three components enable PruneTIR to mitigate the negative impact of erroneous tool calls and prevent LLMs from getting stuck in repeated failed resolution attempts, thereby improving overall LLM performance. Extensive experimental results demonstrate the effectiveness of PruneTIR, which significantly improves Pass@1 and efficiency while reducing the working context length for tool-capable LLMs.
[59] TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents cs.CLPDF
Bihui Yu, Caijun Jia, Jing Chi, Xiaohan Liu, Yining Wang
TL;DR: 本文提出了TRACER框架,旨在解决多模态工具调用代理中的来源追溯问题。该框架在生成每个答案句子的同时,生成结构化的来源记录,明确标注支持每个声明的工具调用、证据单元及语义支持关系,并通过验证机制确保其可靠性。
Details
Motivation: 当前多模态大语言模型在调用外部工具解决视觉任务时,通常只展示工具执行轨迹和最终答案,缺乏对每个生成声明具体支持证据的明确标注,即存在来源追溯缺口,这使得工具使用难以验证和优化。
Result: 在构建的TRACE-Bench基准测试上,使用Qwen3-VL-8B模型,TRACER实现了78.23%的答案准确率和95.72%的摘要准确率,比最强的闭源工具增强基线高出23.80个百分点,同时将测试集总工具调用次数从4949次减少到3486次。
Insight: 创新点在于提出了一个在生成过程中同步构建结构化来源记录的框架,定义了引用、压缩和推理三种语义支持关系,并设计了包含模式检查、工具轮次对齐、来源真实性和关系合理性的验证流程,将验证后的来源信息转化为可追溯性约束和用于强化学习的局部信用分配,证明了可靠的多模态工具推理依赖于对观察结果的有来源意识的利用,而非单纯增加工具调用次数。
Abstract: Multimodal large language models increasingly solve vision-centric tasks by calling external tools for visual inspection, OCR, retrieval, calculation, and multi-step reasoning. Current tool-using agents usually expose the executed tool trajectory and the final answer, but they rarely specify which tool observation supports each generated claim. We call this missing claim-level dependency structure the provenance gap. The gap makes tool use hard to verify and hard to optimize, because useful evidence, redundant exploration, and unsupported reasoning are mixed in the same trajectory. We introduce TRACER, a framework for verifiable generative provenance in multimodal tool-using agents. Instead of adding citations after generation, TRACER generates each answer sentence together with a structured provenance record that identifies the supporting tool turn, evidence unit, and semantic support relation. Its relation space contains Quotation, Compression, and Inference, covering direct reuse, faithful condensation, and grounded derivation. TRACER verifies each record through schema checking, tool-turn alignment, source authenticity, and relation rationality, and then converts verified provenance into traceability constraints and provenance-derived local credit for reinforcement learning. We further construct TRACE-Bench, a benchmark for sentence-level provenance reconstruction from coarse multimodal tool trajectories. On TRACE-Bench, simply adding tools often introduces noise. With Qwen3-VL-8B, TRACER reaches 78.23% answer accuracy and 95.72% summary accuracy, outperforming the strongest closed-source tool-augmented baseline by 23.80 percentage points. Compared with tool-only supervised fine-tuning, it also reduces total test-set tool calls from 4949 to 3486. These results show that reliable multimodal tool reasoning depends on provenance-aware use of observations, not on more tool calls alone.
[60] Speech-based Psychological Crisis Assessment using LLMs cs.CL | cs.AIPDF
Terumi Chiba, Yang Luo, Ziyun Cui, Yongsheng Tong, Chao Zhang
TL;DR: 该论文提出了一种基于大语言模型(LLM)的自动化心理危机等级分类框架,旨在改进心理支持热线的服务。其核心创新在于通过一种副语言注入方法将语音对话中的非言语情感线索整合到文本转录中,并采用一种推理增强的训练策略,让模型生成诊断推理链作为辅助任务以提升分类性能。结合数据增强,该系统在危机等级三分类任务上取得了良好效果。
Details
Motivation: 当前心理支持热线的危机评估主要依赖人工操作员,其判断易受专业经验和有限人力资源的影响,存在不一致性和可扩展性限制。因此,需要一种自动化的、可靠的危机等级分类方法来支持下游任务并提升热线服务质量。
Result: 在5折交叉验证下,最终系统在三分类任务上取得了宏平均F1分数0.802和准确率0.805。
Insight: 论文的创新点在于:1. 副语言注入方法,将语音中的非言语情感线索(如语调、停顿)作为标记插入文本转录,使LLM能结合声学细微差别进行推理,弥补了纯文本模型的不足。2. 推理增强训练策略,通过将生成诊断推理链作为辅助任务来正则化模型,提升了分类性能。这为多模态情感分析和基于LLM的决策支持系统提供了可借鉴的思路。
Abstract: Psychological support hotlines provide critical support for individuals experiencing mental health emergencies, yet current assessments largely rely on human operators whose judgments may vary with professional experience and are constrained by limited staffing resources. This paper proposes a large language model (LLM)-based framework for automated crisis level classification, a key indicator that supports many downstream tasks and improves the overall quality of hotline services. To better capture emotional signals in spoken conversations, we introduce a paralinguistic injection method that inserts identified non-verbal emotional cues into speech transcripts, enabling LLM-based reasoning to incorporate critical acoustic nuances. In addition, we propose a reasoning-enhanced training strategy that trains the model to generate diagnostic reasoning chains as an auxiliary task, which serves as a regulariser to improve classification performance. Combined with data augmentation, our final system achieves a macro F1-score of 0.802 and an accuracy of 0.805 on the three-class classification task under 5-fold cross-validation.
[61] PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning cs.CLPDF
Sajib Acharjee Dip, Song Li, Liqing Zhang
TL;DR: 本文提出了PlantMarkerBench,一个用于评估基于文献的植物标记基因证据解释的多物种基准数据集。该基准通过整合大规模文献检索、混合搜索、物种感知的生物基础、结构化证据提取和针对性人工审核的模块化流程构建,涵盖拟南芥、玉米、水稻和番茄四种植物物种,包含5,550个句子级证据实例,并标注了证据有效性、证据类型和支持强度。论文定义了两个基准任务:判断候选句子是否为特定基因-细胞类型对提供有效标记证据,以及对证据进行分类。作者评估了多种开源和闭源语言模型在不同物种和提示策略下的性能,发现前沿模型在直接表达证据上表现较好,但在功能、间接和弱支持证据上表现显著下降,且证据类型混淆是主要失败模式。开源模型在模糊生物背景下还表现出较高的假阳性率。
Details
Motivation: 现有植物细胞类型特异性标记基因资源主要依赖人工整理的数据库或高通量研究,缺乏对科学文献中支持证据的显式建模,因此需要建立一个基于文献的、可评估证据解释能力的基准。
Result: 在PlantMarkerBench基准上评估了多种开源和闭源语言模型。前沿模型在直接表达证据任务上表现相对较强,但在功能、间接和弱支持证据上性能大幅下降,证据类型混淆是主要错误模式;开源模型在模糊生物背景下假阳性率更高。
Insight: 创新点在于构建了一个多物种、基于全文生物文献、具有精细证据标注(有效性、类型、强度)的基准数据集,并设计了模块化的数据构建流程。这为基于文献的生物证据归因提供了可重复的评估框架,并支持可信科学信息提取和AI辅助植物生物学的研究。
Abstract: Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval, hybrid search, species-aware biological grounding, structured evidence extraction, and targeted human review. The benchmark spans four plant species – Arabidopsis, maize, rice, and tomato – and contains 5,550 sentence-level evidence instances annotated for marker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight and closed-source language models across species and prompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode. Open-weight models additionally exhibit elevated false-positive rates under ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework for literature-grounded biological evidence attribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.
[62] FERA: Uncertainty-Aware Federated Reasoning for Large Language Models cs.CLPDF
Ruhan Wang, Chengkai Huang, Zhiyong Wang, Junda Wu, Rui Wang
TL;DR: 本文提出了FERA(Uncertainty-Aware Federated Reasoning),一种无需训练的联邦推理框架,旨在解决大语言模型(LLM)在无法集中高质量演示数据(由于监管、所有权或制度限制)的场景下的多步推理问题。该框架通过服务器与持有私有演示数据的异构客户端进行迭代协同优化,利用客户端提供的带有轻量级不确定性估计的推理轨迹,服务器合成并改进推理结果,再将其作为下一轮的上下文重新分发,从而逐步提升服务器输出和客户端推理质量。
Details
Motivation: 大语言模型在高质量演示数据引导下表现出强大的推理能力,但此类数据通常分散在不同组织中,且由于监管、所有权或制度限制无法集中。因此,需要研究联邦推理,使服务器能够与持有私有演示数据的异构客户端协作改进多步推理,而无需集中训练或原始数据共享。关键挑战在于客户端的可靠性是查询依赖的,而服务器无法检查客户端数据以确定哪些贡献是可信的。
Result: 在多个推理基准测试上的实验表明,FERA在无需训练的情况下,持续优于联邦训练和无需训练的基线方法,在保持通信和计算效率的同时,实现了跨轮次逐步提高的准确率。
Insight: 核心创新点在于提出了不确定性感知的联邦推理框架FERA及其核心组件UA-SCA(Uncertainty-Aware Self-Critique Aggregation)。UA-SCA通过查询依赖的信任加权和结构化的跨客户端验证来解决异构客户端推理轨迹间的冲突,其创新之处在于并非简单地丢弃低质量轨迹,而是修正有缺陷的推理步骤以回收有用信息。此外,理论分析证明了所提迭代协议的收敛性以及不确定性感知加权能加速收敛。
Abstract: Large language models (LLMs) exhibit strong reasoning capabilities when guided by high-quality demonstrations, yet such data is often distributed across organizations that cannot centralize it due to regulatory, proprietary, or institutional constraints. We study federated reasoning, where a server improves multi-step reasoning by coordinating with heterogeneous clients holding private demonstrations, without centralized training or raw data sharing. The key challenge is that client reliability is query-dependent, while the server cannot inspect client data to determine which contributions are trustworthy. To address this, we propose Uncertainty-Aware Federated Reasoning (FERA), a training-free framework based on iterative server-client co-refinement. Across communication rounds, clients generate reasoning traces with lightweight uncertainty estimates, and the server synthesizes them into improved reasoning that is redistributed as context for the next round, progressively improving both server outputs and client-side reasoning. Within each round, Uncertainty-Aware Self-Critique Aggregation (UA-SCA) resolves conflicts among heterogeneous client traces through query-dependent trust weighting and structured cross-client verification. Rather than simply discarding low-quality traces, UA-SCA revises flawed reasoning steps to recover useful information. We provide theoretical guarantees showing that the proposed iterative protocol converges and that uncertainty-aware weighting accelerates convergence. Experiments on multiple reasoning benchmarks show that FERA consistently outperforms both federated training and training-free baselines, achieving progressively higher accuracy across rounds while maintaining communication and computational efficiency.
[63] NyayaAI: An AI-Powered Legal Assistant Using Multi-Agent Architecture and Retrieval-Augmented Generation cs.CLPDF
Deepanshu, Divi Saxena, Deepali Rana, Ayesha Varshney, Sahinur Rahman Laskar
TL;DR: 本文介绍了NyayaAI,一个基于多智能体架构和检索增强生成(RAG)的AI法律助手,旨在利用大型语言模型和印度法律知识库,自动化并简化律师、学生和普通用户的法律工作流程,以提高法律信息的可及性和工作效率。
Details
Motivation: 解决印度法律信息因语言复杂、文档量大而难以获取的问题,为法律从业者和公众提供自动化、简化的法律辅助工具。
Result: 在测试样本中,领域分类精度达70%,RAG检索精度达74%,整体响应准确率达72%,表明该系统能有效提升法律可及性和工作流效率。
Insight: 创新点在于将多智能体架构(通过Mastra框架协调主智能体和专业子智能体)与基于印度法律知识库的RAG管道结合,并加入合规性验证模块,构建了结构化的法律AI系统。
Abstract: Legal information in India remains largely inaccessible due to the complexity of legal language and the sheer volume of legal documentation involved in research and case analysis. This paper presents NyayaAI, an AI-powered legal assistant that automates and simplifies legal workflows for lawyers, law students, and general users. The system combines Large Language Models with a Retrieval-Augmented Generation pipeline grounded in a curated Indian legal knowledge base comprising constitutional provisions, statutes, case laws, and judicial precedents. A multi-agent architecture orchestrated through the Mastra TypeScript framework coordinates a main agent with specialized sub-agents handling legal research, document summarization, case law retrieval, and drafting assistance. A compliance module validates all responses before delivery. Domain classification achieved 70% precision across test samples, with RAG retrieval precision at 74% and overall response accuracy at 72%, demonstrating that structured multi-agent LLM systems can meaningfully improve legal accessibility and workflow efficiency. The code\footnote{https://github.com/B97784/NyayaAI} is made publicly available for the benefit of the research community.
[64] When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews cs.CL | cs.AIPDF
Sandeep Kumar, Yash Kamdar, Abid Hossain, Bharti Kumari, Tanik Saikh
TL;DR: 本文提出了一种细粒度的同行评审矛盾分析方法,通过识别证据片段和分配分级强度分数来分析完整评审中的矛盾。作者构建了专家标注的RevCI基准数据集,并提出了IMPACT多智能体框架和其蒸馏版本TIDE模型,实验表明IMPACT在证据识别和强度一致性上显著优于基线,TIDE则以较低推理成本实现了有竞争力的性能。
Details
Motivation: 解决科学同行评审中频繁出现的专家意见冲突问题,现有方法通常将评审矛盾简化为孤立句子对的二元检测,忽略了评审级别的上下文和评估冲突严重程度的差异,导致难以可靠识别和解释分歧。
Result: 在RevCI基准上,IMPACT框架在证据识别和强度一致性方面显著优于强单智能体和通用多智能体基线;蒸馏模型TIDE以显著更低的推理成本实现了有竞争力的性能。
Insight: 创新点包括细粒度矛盾分析框架(结合证据提取和分级强度评分)、专家标注的基准数据集RevCI、以及结构化的多智能体框架IMPACT(集成方面条件证据提取、审议推理和裁决),并通过蒸馏实现高效部署;客观分析认为,该方法将矛盾分析从二元句子级提升到上下文感知的评审级,并引入强度分级,更具实用价值。
Abstract: Scientific peer reviews frequently contain conflicting expert judgments, and the increasing scale of conference submissions makes it challenging for Area Chairs and editors to reliably identify and interpret such disagreements. Existing approaches typically frame reviewer disagreement as binary contradiction detection over isolated sentence pairs, abstracting away the review-level context and obscuring differences in the severity of evaluative conflict. In this work, we introduce a fine-grained formulation of reviewer contradiction analysis that operates over full peer reviews by explicitly identifying contradiction evidence spans and assigning graded disagreement intensity scores. To support this task, we present RevCI, an expert-annotated benchmark of peer-review pairs with evidence-level contradiction annotations with graded intensity labels. We further propose IMPACT, a structured multi-agent framework that integrates aspect-conditioned evidence extraction, deliberative reasoning, and adjudication to model reviewer contradictions and their intensity. To support efficient deployment, we distill IMPACT into TIDE, a small language model that predicts contradiction evidence and intensity in a single forward pass. Experimental results show that IMPACT substantially outperforms strong single-agent and generic multi-agent baselines in both evidence identification and intensity agreement, while TIDE achieves competitive performance at significantly lower inference cost.
[65] LegalCiteBench: Evaluating Citation Reliability in Legal Language Models cs.CL | cs.AIPDF
Sijia Chen, Hang Yin, Shunfan Zhou
TL;DR: 本文提出了LegalCiteBench基准,用于评估法律语言模型在封闭式场景下的引文可靠性,包括引文检索、补全、错误检测、案例匹配及验证修正五大任务。该基准基于1000份真实美国司法意见构建了约24K个评估实例,测试发现当前LLMs在精确引文恢复上表现极差(最优模型得分低于7/100),且常生成看似合理但错误的引文(误导回答率超94%)。
Details
Motivation: 现有法律基准主要关注法规推理或合同理解,缺乏对普通法实践中关键失效模式(即模型在无外部依据时生成错误或虚构判例引文)的系统评估,而错误引文可能导致严重职业风险。
Result: 在21个LLMs的评估中,封闭式引文恢复任务表现普遍不佳:最佳模型在引文检索和补全任务中得分低于7/100;20个模型在检索密集型任务中的误导回答率超过94%;模型规模和领域预训练带来的改进有限,且明确的不确定性提示仅减少部分虚构引文但未提升正确率。
Insight: 创新点在于构建了首个专注于法律引文可靠性的诊断基准,揭示了LLMs在封闭式法律权威生成中的根本性缺陷;客观而言,该研究强调了法律AI应用中外部知识基础的必要性,并为模型验证行为和弃权机制的研究提供了框架。
Abstract: Large language models (LLMs) are increasingly integrated into legal drafting and research workflows, where incorrect citations or fabricated precedents can cause serious professional harm. Existing legal benchmarks largely emphasize statutory reasoning, contract understanding, or general legal question answering, but they do not directly study a central common-law failure mode: when asked to provide case authorities without external grounding, models may return plausible-looking but incorrect citations or cases. We introduce LegalCiteBench, a benchmark for studying closed-book citation recovery, citation verification, and case matching in legal language models. LegalCiteBench contains approximately 24K evaluation instances constructed from 1,000 real U.S. judicial opinions from the Case Law Access Project. The benchmark covers five citation-centric tasks: citation retrieval, citation completion, citation error detection, case matching, and case verification and correction. Across 21 LLMs, exact citation recovery remains highly challenging in this closed-book setting: even the strongest models score below 7/100 on citation retrieval and completion. Within the evaluated models, scale and legal-domain pretraining provide limited gains and do not resolve this difficulty. Models also frequently provide concrete but incorrect or low-overlap authorities under our evaluation protocol, with Misleading Answer Rates (MAR) exceeding 94% for 20 of 21 evaluated models on retrieval-heavy tasks. A prompt-only abstention experiment shows that explicit uncertainty instructions reduce some confident fabrication but do not improve citation correctness. LegalCiteBench is intended as a diagnostic framework for studying authority generation failures, verification behavior, and abstention when external grounding is absent, incomplete, or bypassed.
[66] Relative Score Policy Optimization for Diffusion Language Models cs.CLPDF
Zichao Yu, Shengze Xu, Bingqing Jiang, Wenyi Zhang, Difan Zou
TL;DR: 本文提出了一种名为相对分数策略优化(RSPO)的新方法,用于解决扩散大语言模型(dLLMs)在基于可验证奖励的强化学习(RLVR)训练中,由于缺乏可处理的序列级对数比而导致的不稳定问题。RSPO通过利用奖励优势来校准噪声似然估计,从而稳定训练并提升模型在推理任务上的性能。
Details
Motivation: 扩散大语言模型在并行高效文本生成方面前景广阔,但其推理能力的提升需要有效的后训练。基于可验证奖励的强化学习是自然选择,但标准策略优化依赖的序列级对数比在dLLMs中难以处理,导致现有方法依赖高方差近似,使得训练不稳定。
Result: 在数学推理和规划基准测试上的实验表明,RSPO在规划任务上带来了显著的性能提升,并在数学推理任务上取得了有竞争力的结果。
Insight: 核心创新在于将奖励优势重新解释为当前策略与参考策略之间相对对数比的目标,并通过校准噪声估计与目标之间的差距来更新策略,而非直接使用原始奖励优势,这提供了一种更稳定的dLLMs强化学习训练范式。
Abstract: Diffusion large language models (dLLMs) offer a promising route to parallel and efficient text generation, but improving their reasoning ability requires effective post-training. Reinforcement learning with verifiable rewards (RLVR) is a natural choice for this purpose, yet its application to dLLMs is hindered by the absence of tractable sequence-level log-ratios, which are central to standard policy optimization. The lack of tractable sequence-level log-ratios forces existing methods to rely on high-variance ELBO-based approximations, where high verifier rewards can amplify inaccurate score estimates and destabilize RL training. To overcome this issue, we propose \textbf{R}elative \textbf{S}core \textbf{P}olicy \textbf{O}ptimization (RSPO), a simple RLVR method that uses verifiable rewards to calibrate noisy likelihood estimates in dLLMs. The core of our algorithm relies on a key observation: a reward advantage can be interpreted not only as an update direction, but also as a target for the relative log-ratio between the current and reference policies. Accordingly, RSPO calibrates this noisy relative log-ratio estimate by comparing its reward advantage with the reward-implied target relative log-ratio, updating the policy according to the gap between the current estimate and the target rather than the raw advantage alone. Experiments on mathematical reasoning and planning benchmarks show that RSPO yields especially strong gains on planning tasks and competitive mathematical-reasoning performance.
[67] Route Before Retrieve: Activating Latent Routing Abilities of LLMs for RAG vs. Long-Context Selection cs.CLPDF
Yiwen Chen, Kuan Li, Fuzhen Zhuang, Deqing Wang, Zhao Zhang
TL;DR: 本文提出了一种名为Pre-Route的主动路由框架,用于在长文档任务中决策是使用检索增强生成(RAG)还是长上下文(LC)策略。该框架通过分析任务、估计信息覆盖度和预测信息需求,在回答前进行结构化推理,从而做出可解释且成本效益高的路由决策。研究发现大语言模型(LLMs)具有可通过引导可靠激发的潜在路由能力,且其表征空间中的“最优路由维度”可通过结构化提示变得可分,该能力还可蒸馏到小模型。
Details
Motivation: 解决在长文档理解任务中,如何主动、高效且可解释地在检索增强生成(RAG)和长上下文(LC)两种策略之间进行选择的问题。现有方法(如Self-Route)是被动、低效且难以解释的失败驱动回退机制。
Result: 在LaRA(领域内)和LongBench-v2(领域外)基准测试上的实验表明,Pre-Route在整体成本效益上优于Always-RAG、Always-LC和Self-Route基线方法。
Insight: 创新点在于提出了一个主动的、基于轻量级元数据(如文档类型、长度、初始片段)进行结构化推理的路由框架。客观分析认为,其核心洞察是发现并可靠地激发了LLMs的潜在路由能力,使其单样本性能接近多样本(Best-of-N)结果,并通过表征空间分析和知识蒸馏实现了能力的解释与轻量化部署。
Abstract: Recent advances in large language models (LLMs) have expanded the context window to beyond 128K tokens, enabling long-document understanding and multi-source reasoning. A key challenge, however, lies in choosing between retrieval-augmented generation (RAG) and long-context (LC) strategies: RAG is efficient but constrained by retrieval quality, while LC supports global reasoning at higher cost and with position sensitivity. Existing methods such as Self-Route adopt failure-driven fallback from RAG to LC, but remain passive, inefficient, and hard to interpret. We propose Pre-Route, a proactive routing framework that performs structured reasoning before answering. Using lightweight metadata (e.g., document type, length, initial snippet), Pre-Route enables task analysis, coverage estimation, and information-need prediction, producing explainable and cost-efficient routing decisions. Our study shows three key findings: (i) LLMs possess latent routing ability that can be reliably elicited with guidelines, allowing single-sample performance to approach that of multi-sample (Best-of-N) results; (ii) linear probes reveal that structured prompts sharpen the separability of the “optimal routing dimension” in representation space; and (iii) distillation transfers this reasoning structure to smaller models for lightweight deployment. Experiments on LaRA (in-domain) and LongBench-v2 (OOD) confirm that Pre-Route outperforms Always-RAG, Always-LC, and Self-Route baselines, achieving superior overall cost-effectiveness.
[68] MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading cs.CL | cs.AIPDF
Baibei Ji, Xiaoyang Weng, Juntao Li, Zecheng Tang, Yihang Lou
TL;DR: 本文提出MemReread方法,旨在增强智能体在长上下文推理任务中的性能。该方法基于流式阅读,通过问题分解和重读机制来恢复因内存覆盖而丢失的间接证据,从而支持非线性推理。同时,引入强化学习框架以动态决定重读次数,平衡任务复杂性与计算开销。实验表明,MemReread在长上下文推理任务上持续优于基线方法,并保持与上下文长度相关的线性时间复杂度。
Details
Motivation: 解决基于智能体内存的长上下文推理方法中,由于内存覆盖导致的潜在证据丢失问题,以及现有检索式召回方法存在的证据损失和无效查询干扰的局限性。
Result: 在长上下文推理任务上的大量实验表明,MemReread consistently outperforms baseline frameworks,同时保持了相对于上下文长度的线性时间复杂度。
Insight: 创新点在于绕过中间检索,通过问题触发式重读机制来恢复被过早丢弃的间接事实,从而支持非线性推理而不破坏文档理解的内在逻辑流;此外,结合强化学习动态调整重读次数,增强了长度外推能力和计算开销的灵活性控制。
Abstract: To tackle long-context reasoning tasks without the quadratic complexity of standard attention mechanisms, approaches based on agent memory have emerged, which typically maintain a dynamically updated memory when linearly processing document chunks. To mitigate the potential loss of latent evidence in this memorize-while-reading paradigm, recent works have integrated retrieval modules that allow agents to recall information previously discarded during memory overwriting. However, retrieval-based recall suffers from both evidence loss during memory formation and interference induced by invalid queries. To overcome these limitations, we propose MemReread. Built upon streaming reading, MemReread circumvents intermediate retrieval. It triggers question decomposition and rereading when the final memory is insufficient, enabling the recovery of indirect facts that were prematurely discarded. This design supports non-linear reasoning while preserving the inherent logical flow of document comprehension. To further enhance practicality, we introduce a reinforcement learning framework that enhances length extrapolation capability while dynamically determining the number of rereading passes based on task complexity, thereby flexibly controlling computational overhead. Extensive experiments demonstrate that MemReread consistently outperforms baseline frameworks on long-context reasoning tasks, while maintaining linear time complexity with respect to context length.
[69] An Annotation Scheme and Classifier for Personal Facts in Dialogue cs.CLPDF
Konstantin Zaitsev
TL;DR: 本文提出了一种扩展的个人事实标注方案和基于Transformer编码器的多头部分类器,用于对话中的个人事实分类。该方案在现有方法(如PeaCoK)基础上引入了新类别(如人口统计、财产)和属性(如持续时间、有效性、后续性),以支持结构化存储、质量过滤和对话延续。研究在Multi-Session Chat数据集上手动标注了2,779个事实,并训练了分类器,结合Gemma-300M编码器实现了81.6%的宏F1分数,显著优于少样本LLM基线。
Details
Motivation: 现有个人事实分类方法(如PeaCoK)存在局限性,无法充分支持个性化对话系统的结构化存储、质量过滤和对话延续需求,因此需要更精细的标注方案和高效分类器。
Result: 在Multi-Session Chat数据集上,结合Gemma-300M编码器的多头部分类器达到81.6 ± 2.6%的宏F1分数,比最佳少样本LLM基线(GPT-5.4-mini,72.92%)提升近9个百分点,且计算资源需求显著更低,实现了SOTA性能。
Insight: 创新点包括扩展的标注方案(新增类别和属性以增强结构化与实用性)和高效分类器设计(基于Transformer编码器,资源需求低且性能优),但语义边界歧义、时间解释和语用推理仍是持续挑战。
Abstract: The advancement of Large Language Models (LLMs) has enabled their application in personalized dialogue systems. We present an extended annotation scheme for personal fact classification that addresses limitations in existing approaches, particularly PeaCoK. Our scheme introduces new categories (Demographics, Possessions) and attributes (Duration, Validity, Followup) that enable structured storage, quality filtering, and identification of facts suitable for dialogue continuation. We manually annotated 2,779 facts from Multi-Session Chat and trained a multi-head classifier based on transformer encoders. Combined with the Gemma-300M encoder, the classifier achieves $81.6 \pm 2.6$% macro F1, outperforming all few-shot LLM baselines (best: GPT-5.4-mini, 72.92%) by nearly 9 percentage points while requiring substantially fewer computational resources. Error analysis reveals persistent challenges in semantic boundary disambiguation, temporal aspect interpretation, and pragmatic reasoning for followup assessment. The dataset\footnotemark[1] and classifier\footnotemark[2] are publicly available.
[70] Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness cs.CLPDF
Ivo Petrov, Jasper Dekoninck, Dimitar I. Dimitrov, Martin Vechev
TL;DR: 这篇论文提出了ProofRank基准,用于评估大语言模型生成的数学证明的质量,而不仅仅是正确性。ProofRank从数学竞赛中选取难题,评估证明的简洁性、计算简便性、认知简单性、多样性和适应性等多个质量维度。研究发现,不同模型在证明质量上存在显著差异,并且证明质量指标与正确性之间存在权衡,表明未来评估数学推理能力时应衡量证明的有用性。
Details
Motivation: 动机在于认识到仅评估大语言模型生成数学证明的正确性是不够的,高质量的证明还应具备清晰、简洁、有洞察力等特性,这些特性对数学实践至关重要。
Result: 研究结果表明,不同模型在ProofRank基准上的证明质量存在显著差异,这些差异是仅关注正确性的基准所无法捕捉的。同时,观察到证明质量指标与正确性之间存在显著的权衡关系。
Insight: 论文的创新点在于首次系统性地定义了数学证明质量的多个可扩展代理指标,并构建了ProofRank基准来量化评估。从客观角度看,这为评估大语言模型的数学推理能力提供了一个更全面、更贴近人类数学实践的新视角,强调了证明实用性评估的重要性。
Abstract: Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a model’s proofs for a single problem are; and (v) adaptivity, measuring whether a model can follow a specified proof technique. Across models, we find substantial differences in proof quality that are not captured by correctness-only benchmarks. We also observe significant trade-offs between proof-quality metrics and correctness, suggesting that future evaluations of mathematical reasoning should measure how useful LLM-generated proofs are.
[71] Phoenix-VL 1.5 Medium Technical Report cs.CL | cs.AI | cs.CVPDF
Team Phoenix, :, Arka Ray, Askar Ali Mohamed Jawad, Biondi Lee
TL;DR: 本文介绍了Phoenix-VL 1.5 Medium,这是一个拥有1230亿参数的原生多模态、多语言基础模型,专门针对区域语言和新加坡语境进行了深度领域适应。该模型通过在本地化的1万亿词元多模态语料库上对Mistral Medium 3.1进行持续预训练,并经过长上下文扩展、后训练(使用新加坡多模态数据集及文化知识语料)以及在线直接偏好优化对齐,最终在保持广泛智能的同时,在新加坡特定基准上达到了同类模型中的最先进水平。
Details
Motivation: 开发一个作为主权AI资产的多模态基础模型,旨在通过深度领域适应,使其在适应区域语言和新加坡特定语境(如文化、法律、政策)的同时,最小化对模型通用智能和对齐性能的损害。
Result: 模型在其规模级别上,在新加坡多模态、法律和政府政策基准测试中达到了最先进(SOTA)性能,同时在通用多模态智能、多语言和STEM基准测试中保持全球竞争力。
Insight: 创新点在于展示了通过精心策划的本地化多模态和文本语料库进行深度领域适应的可行性,以及引入了一个包含本地化知识基准和机构对齐模型行为与安全框架的新型评估套件。从客观角度看,其将大规模持续预训练、长上下文扩展、特定领域后训练与在线直接偏好优化相结合的方法,为构建兼具领域专长和通用能力的大型模型提供了可借鉴的路径。
Abstract: We introduce Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model, adapted to regional languages and the Singapore context. Developed as a sovereign AI asset, it demonstrates that deep domain adaptation can be achieved with minimal degradation to broad-spectrum intelligence and alignment. Continued pretraining was performed on Mistral Medium 3.1 using a localized 1-trillion tokens multimodal corpus, followed by a 250-billion tokens long-context extension phase. Subsequent post-training incorporated a novel human-annotated Singapore multimodal dataset and curated textual corpus on Singapore culture, knowledge, and legislation, totaling 22-billion tokens. An additional 5 billion tokens of model alignment was performed through Online Direct Preference Optimization. Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. We also introduce a novel evaluation suite encompassing localized knowledge benchmarks and an institutionally aligned model behavior and safety framework. We report the data curation principles, training methodology, and highlight benchmark and inference performance.
[72] Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis cs.CLPDF
Junyu Lu, Deyi Ji, Xuanyi Liu, Lanyun Zhu, Bo Xu
TL;DR: 本文提出了一种不确定性感知的主观性分析框架DPUA,通过解耦学习和GRPO奖励优化,使大语言模型的预测不确定性能够与人类标注分歧分布对齐,从而缓解边界样本的过度自信问题并提升分布外泛化能力。
Details
Motivation: 现有主观性分析任务通常使用聚合标签训练大语言模型,这掩盖了人类标注分歧所反映的内在不确定性,导致模型在低一致性样本上产生过度自信的预测,损害了复杂主观场景下的可靠性和泛化能力。
Result: 在三个主观性分析任务上的实验表明,DPUA在保持任务性能的同时,能更好地将模型不确定性与人类分歧对齐,缓解边界样本的过度自信,并提升了分布外泛化能力。
Insight: 创新点在于将不确定性表达与人类标注分歧对齐作为训练目标,通过解耦学习增强模型对分歧线索的感知,并利用GRPO奖励优化实现不确定性感知推理与人类分歧分布的对齐,为处理主观性任务提供了一种更可靠的建模范式。
Abstract: Large language models for subjectivity analysis are typically trained with aggregated labels, which compress variations in human judgment into a single supervision signal. This paradigm overlooks the intrinsic uncertainty of low-agreement samples and often induces overconfident predictions, undermining reliability and generalization in complex subjective settings. In this work, we advocate uncertainty-aware subjectivity analysis, where models are expected to make predictions while expressing uncertainty that reflects human disagreement. To operationalize this perspective, we propose a two-phase Disagreement Perception and Uncertainty Alignment (DPUA) framework. Specifically, DPUA jointly models label prediction, rationale generation, and uncertainty expression under an uncertainty-aware setting. In the disagreement perception phase, adaptive decoupled learning enhances the model’s sensitivity to disagreement-related cues while preserving task performance. In the uncertainty alignment phase, GRPO-based reward optimization further improves uncertainty-aware reasoning and aligns the model’s confidence expression with the human disagreement distribution. Experiments on three subjectivity analysis tasks show that DPUA preserves task performance while better aligning model uncertainty with human disagreement, mitigating overconfidence on boundary samples, and improving out-of-distribution generalization.
[73] Coherency through formalisations of Structured Natural Language, A case study on FRETish cs.CL | cs.LOPDF
Joost J. Joosten, Marina López Chamosa, Sofía Santiago Fernández
TL;DR: 本文提出了一种新的需求形式化指南——‘通过形式化实现一致性’,强调不同层次的需求描述(如自然语言、结构化自然语言、形式化语言)应保持逻辑结构的一致性。论文以NASA的形式化需求获取工具FRET为案例,分析了其结构化自然语言FRETish到形式化语言MTL的自动翻译,提出了一种替代翻译方法,并通过模型检验证明了其等价性,同时揭示了原有翻译中的不一致性。
Details
Motivation: 在形式化方法领域,需求形式化是验证过程中最微妙和复杂的步骤之一,现有工具常使用不同层次的需求描述(如自然语言、技术语言、图表、形式化语言),但缺乏确保这些层次间逻辑结构一致性的明确指导原则。本文旨在通过提出‘一致性’指南来改进这一过程,特别是在利用大语言模型进行可形式化工具验证的推理任务时,结构化自然语言作为中间层的作用尤为重要。
Result: 论文对NASA的FRET工具进行了案例分析,提出了一种新的从FRETish到MTL的自动翻译方法,并通过模型检验证明了新翻译与原有翻译的等价性。一些统计数据表明新翻译更具优势。分析过程还揭示并讨论了原有翻译中存在的不一致性。
Insight: 核心创新点是提出了‘通过形式化实现一致性’这一形式化指南,强调在多层次需求描述中保持逻辑结构连贯的重要性。从客观角度看,该研究为利用结构化自然语言作为大语言模型与形式化工具之间的桥梁提供了方法论支持,并通过具体案例展示了如何通过改进自动翻译来提升一致性和发现潜在问题,对需求工程和形式化验证领域具有借鉴意义。
Abstract: Formalisation is the process of writing system requirements in a formal language. These requirements mostly originate in Natural Language. In the field of Formal Methods, formalisation is often identified as one of the most delicate and complicated steps in the verification process. Not seldomly, formalisation tools and environments choose various levels of requirement descriptions: Natural Language, Technical Language, Diagram Representations and Formal Language, to mention a few. In the literature, there are various maxims and principles of good practice to guide the process of requirement formalisation. In this paper we propose a new guideline: Coherency through Formalisations. The guideline states that the different levels of formalisation mentioned above should roughly follow the same logical structure. The principle seems particularly relevant in the setting where LLMs are prompted to perform reasoning tasks that can be checked by formal tools using Structured Natural Language to act as an intermediate layer bridging both paradigms. In the light of coherency, we analyze NASA’s Formal Requirement Elicitation Tool FRET and propose an alternative automated translation of the Controlled Natural Language FRETish to the formal language of MTL. We compare our translation to the original translation and prove equivalence using model checking. Some statistics are performed which seem to favor the new translation. As expected, the translation process yielded interesting reflections and revealed inconsistencies which we present and discuss.
[74] DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning cs.CL | cs.AIPDF
Haoyu Huang, Jiaxin Bai, Shujie Liu, Yang Wei, Hong Ting Tsang
TL;DR: DeepRefine是一个基于大型语言模型(LLM)的通用推理模型,用于通过强化学习对智能体编译的知识库进行精炼。它通过多轮交互、溯因诊断和针对性更新,解决知识库的不完整性、不正确性和冗余性问题,从而提升其在开放域、知识密集型下游任务中的性能。
Details
Motivation: 动机是解决智能体编译知识库在开放域、知识密集型下游任务中存在的系统性质量问题,包括不完整性(如缺失证据或跨文档链接)、不正确性(如低置信度或不精确声明)和冗余性(如歧义或共指消解问题),这些问题在迭代使用中会恶化,降低检索保真度和下游任务性能。
Result: 广泛的实验表明,DeepRefine在多个下游任务上相比强基线模型取得了一致的性能增益。
Insight: 创新点在于提出了一个无需黄金参考即可优化精炼策略的强化学习框架,引入了Gain-Beyond-Draft(GBD)奖励,并以端到端方式训练推理过程,实现了对任意预构建知识库的通用、增量式质量提升。
Abstract: Agent-compiled knowledge bases provide persistent external knowledge for large language model (LLM) agents in open-ended, knowledge-intensive downstream tasks. Yet their quality is systematically limited by \emph{incompleteness}, \emph{incorrectness}, and \emph{redundancy}, manifested as missing evidence or cross-document links, low-confidence or imprecise claims, and ambiguous or coreference resolution issues. Such defects compound under iterative use, degrading retrieval fidelity and downstream task performance. We present \textbf{DeepRefine}, a general LLM-based reasoning model for \emph{agent-compiled knowledge refinement} that improves the quality of any pre-constructed knowledge bases with user queries to make it more suitable for the downstream tasks. DeepRefine performs multi-turn interactions with the knowledge base and conducts abductive diagnosis over interaction history, localizes likely defects, and executes targeted refinement actions for incremental knowledge base updates. To optimize refinement policies of DeepRefine without gold references, we introduce a Gain-Beyond-Draft (GBD) reward and train the reasoning process end-to-end via reinforcement learning. Extensive experiments demonstrate consistent downstream gains over strong baselines.
[75] Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing cs.CLPDF
Jinchang Zhu, Jindong Li, Chengyu Zou, Rong Fu, Chao Wang
TL;DR: 这篇论文提出了EXACT(有效上下文暴露平衡)方法,旨在解决长上下文适应训练中的监督不匹配问题。论文指出,传统的打包训练和文档掩码方法导致每个目标标记的有效上下文仍然很短。EXACT通过根据长尾分布中的逆频率为具有长有效上下文的目标分配额外权重,从而更有效地分配监督信号。实验在多种模型配置上进行,显著提升了长上下文评估基准的性能,同时保持了标准任务的能力。
Details
Motivation: 动机在于揭示长上下文适应不仅仅是窗口缩放问题,其核心是训练过程中的监督不匹配:在打包训练和文档掩码下,每个目标标记实际接收到的有效上下文很短,导致模型未能充分学习长距离依赖。
Result: 在七个Qwen和LLaMA CPT配置上的实验表明,EXACT在所有28个训练/外推的NoLiMa和RULER比较中均有提升。例如,在Qwen2.5-0.5B上,NoLiMa分别提升+10.09(训练)和+5.34(外推),RULER提升+10.69和+5.55;在LLaMA-3.2-3B上,RULER提升+17.91和+16.11。标准QA/推理任务性能得以保持(在六个基准上平均变化仅+0.24)。距离解析探测显示,当证据距离数千个标记时性能提升显著,而短距离情况保持不变。
Insight: 论文的核心创新点是提出了一个以监督为中心的视角:长上下文适应的效果取决于训练过程对长上下文预测的监督强度。EXACT方法通过逆频率加权,直接针对长尾的有效上下文目标进行强化监督,这是一种新颖的监督分配策略,而非仅仅调整模型架构或训练数据。这为长上下文建模提供了新的优化方向,即可通过重新平衡监督信号来更高效地利用训练数据。
Abstract: Long-context adaptation is often viewed as window scaling, but this misses a token-level supervision mismatch: in packed training with document masking, each target token’s effective context remains short. We introduce EXACT, a supervision-allocation objective that assigns extra weight to long effective-context targets by inverse frequency within the long tail. Across seven Qwen/LLaMA CPT configurations, EXACT improves all 28 trained/extrapolated NoLiMa and RULER comparisons. On Qwen2.5-0.5B, NoLiMa improves by +10.09 (trained) and +5.34 (extrapolated); RULER by +10.69 and +5.55. On LLaMA-3.2-3B, RULER improves by +17.91 and +16.11. Standard QA/reasoning are preserved (+0.24 macro change across six benchmarks). A distance-resolved probe shows gains arise when evidence is thousands of tokens away, while short cases remain unchanged. Results support a supervision-centric thesis: long-context adaptation depends on how strongly training supervises long-context predictions.
[76] Multi-domain Multi-modal Document Classification Benchmark with a Multi-level Taxonomy cs.CLPDF
Denghao Ma, Qing Liu, Zulong Chen, Chuanfei Xu, Jia Xu
TL;DR: 该论文提出了首个多层级、多领域、多模态的文档分类基准MMM-Bench,以解决现有基准在层级结构、多模态和跨域特性上的不足。该基准包含一个五层级的分类体系,以及从阿里巴巴12个商业领域精心收集的5,990份真实多模态文档,每份文档均由专家手动标注完整层级路径。论文建立了全面的基线模型,并识别了四个基本挑战及相应见解。
Details
Motivation: 现有文档分类基准过于简化,局限于单领域和平坦标签结构,无法反映现实商业文档的层级性、多模态和跨域复杂性,这阻碍了工业级文档智能的发展。
Result: 论文在MMM-Bench上建立了全面的基线,包括开源模型和基于API的模型,并通过系统实验识别了四个基本挑战,但未具体提及定量性能指标(如准确率)或与SOTA的直接比较。
Insight: 创新点在于构建了首个深度融合层级、领域和多模态的文档分类基准,其多层级分类体系(五层)和跨12个真实商业领域的多模态文档集,为研究提供了更贴近实际的评估基础;客观来看,其系统性的挑战分析也为未来模型设计提供了方向性指导。
Abstract: Document classification forms the backbone of modern enterprise content management, yet existing benchmarks remain trapped in oversimplified paradigms – single domain settings with flat label structures – that bear little resemblance to the hierarchical, multi-modal, and cross-domain nature of real-world business documents. This gap not only misrepresents practical complexity but also stifles progress toward industrially viable document intelligence. To bridge this gap, we construct the first Multi-level, Multi-domain, Multi-modal document classification Benchmark (MMM-Bench). MMM-Bench includes (1) a deeply hierarchical taxonomy spanning five levels that capture the authentic organizational logic of business documentation; and (2) 5,990 real-world multi-modal documents meticulously curated from 12 commercial domains in Alibaba. Each document is manually annotated with a complete hierarchical path by domain experts. We establish comprehensive baselines on MMM-Bench, which consists of open-weight models and API-based models. Through systematic experiments, we identify four fundamental challenges within MMM-Bench and propose corresponding insights. To provide a solid foundation for advancing research in multi-level, multi-domain document classification, we release all of the data and the evaluation toolkit of MMM-Bench at https://github.com/MMMDC-Bench/MMMDC-Bench.
[77] VISTA: A Generative Egocentric Video Framework for Daily Assistance cs.CLPDF
Yu-Hsiang Liu, Yu-Chien Tang, An-Zi Yen
TL;DR: VISTA是一个生成式第一人称视频框架,旨在为训练和评估AI助手生成高质量、多样化的日常协助场景视频数据,以解决现实世界数据采集困难、成本高或不安全的问题。
Details
Motivation: 训练AI代理主动协助人类完成日常活动需要大规模视觉数据,但现实世界采集此类数据困难且昂贵,而基于物理的模拟器又缺乏足够的视觉保真度以实现向真实环境的迁移。
Result: 论文提出了VISTA系统,它通过一个包含因果逆向推理的5步脚本生成流程,创建了逻辑基础扎实、多样化的干预模式场景,生成了用于训练和评估AI代理的高保真第一人称视频基准。
Insight: 核心创新在于提出了一个可扩展、可控的视频合成框架,通过因果逆向推理生成逻辑合理的脚本,并系统地将代理自主性划分为反应式和主动式(包括显式和隐式)两种模式,为AI助手训练提供了逼真且可定制的合成数据源。
Abstract: Training AI agents to proactively assist humans in daily activities, from routine household tasks to urgent safety situations, requires large-scale visual data. However, capturing such scenarios in the real world is often difficult, costly, or unsafe, and physics-based simulators lack the visual fidelity needed to transfer learned behaviors to real settings. Therefore, we introduce VISTA, a video synthesis system that produces high-fidelity egocentric videos as training and evaluation data for AI agents. VISTA employs a 5-step script generation pipeline with causal reverse reasoning to create diverse, logically grounded intervention modes. These scenarios span two levels of agent autonomy: reactive and proactive. In reactive modes, the user explicitly asks the agent for help. In proactive modes, the agent offers help without receiving a direct request. We further divide proactive modes into explicit and implicit types. In explicit proactive scenarios, the user is aware of needing help but does not directly address the agent. In implicit proactive scenarios, the agent intervenes before the user even realizes that help is needed. VISTA allows users to customize and refine scenarios to generate video benchmarks for daily tasks, offering a scalable and controllable alternative to real-world data collection for training and evaluating AI agents in realistic environments.
[78] Why Low-Resource NLP Needs More Than Cross-Lingual Transfer: Lessons Learned from Luxembourgish cs.CL | cs.AIPDF
Fred Philippy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé
TL;DR: 本文以卢森堡语为例,探讨了跨语言迁移在低资源自然语言处理中的作用与局限,指出跨语言迁移与语言特定努力之间存在根本的相互依赖关系,两者应作为互补组件整合到可持续的低资源NLP流程中。
Details
Motivation: 解决跨语言迁移能否替代语言特定努力的问题,特别是在低资源语言(如卢森堡语)中,尽管其与高资源语言类型相近且处于多语言环境,但在现代NLP技术中仍代表性不足。
Result: 研究发现,跨语言迁移能显著提升目标语言性能,但其成功关键依赖于高质量、任务对齐的目标语言数据;同时,这些资源在低资源环境中规模有限,单独使用效果不佳,只有在跨语言框架中才能发挥最大潜力。
Insight: 创新点在于强调跨语言迁移与语言特定努力不是竞争替代关系,而是互补的,并提供了在可持续低资源NLP流程中整合与平衡两者的实用指南;客观分析认为,这为低资源NLP研究提供了更全面的方法论视角。
Abstract: Cross-lingual transfer has become a central paradigm for extending natural language processing (NLP) technologies to low-resource languages. By leveraging supervision from high-resource languages, multilingual language models can achieve strong task performance with little or no labeled target-language data. However, it remains unclear to what extent cross-lingual transfer can substitute for language-specific efforts. In this paper, we synthesize prior research findings and data collection results on Luxembourgish, which, despite its typological proximity to high-resource languages and its presence in a multilingual context, remains insufficiently represented in modern NLP technologies. Across findings, we observe a fundamental interdependence between cross-lingual transfer and language-specific efforts. Cross-lingual transfer can substantially improve target-language performance, but its success depends critically on the availability of sufficiently high-quality, task-aligned target-language data. At the same time, such resources, particularly in low-resource settings, are typically too limited in scale to drive strong performance on their own. Instead, such resources reach their full potential only when leveraged within a cross-lingual framework. We therefore argue that cross-lingual transfer and language-specific efforts should not be viewed as competing alternatives. Instead, they function as complementary components of a sustainable low-resource NLP pipeline. Based on these insights, we provide practical guidelines for integrating and balancing cross-lingual transfer with language-specific development in sustainable low-resource NLP pipelines.
[79] Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents cs.CLPDF
Shijue Huang, Hangyu Guo, Chenxin Li, Junting Lu, Xinyu Geng
TL;DR: 本文提出了一种面向视觉原生多模态深度搜索智能体的在线策略数据演化方法。首先,通过引入图像库引用协议,使工具返回的中间视觉证据可被后续工具重复利用;其次,采用在线策略数据演化框架,通过闭环数据生成器根据当前策略的推演结果动态生成训练数据,以针对性地提升智能体的学习效果。该方法在8个多模态深度搜索基准测试中显著提升了智能体性能。
Details
Motivation: 当前多模态深度搜索系统存在两个瓶颈:一是现有工具使用框架将搜索、浏览或转换返回的图像视为瞬时输出,导致中间视觉证据无法被后续工具重复利用;二是训练数据通常通过固定的整理方法构建,无法跟踪目标智能体能力的动态演化。
Result: 在8个多模态深度搜索基准测试中,ODE方法将Qwen3-VL-8B智能体的平均性能从24.9%提升至39.0%,在标准智能体工作流设置下超越了Gemini-2.5 Pro(37.9%)。对于30B模型,平均得分从30.6%提升至41.5%。分析验证了图像库重用的有效性,尤其是在需要迭代视觉细化的复杂任务上,而推演反馈演化相比静态合成能产生更可靠的监督微调轨迹和更匹配策略的强化学习任务。
Insight: 创新点包括:1)视觉原生智能体框架采用图像库引用协议,使中间视觉证据可寻址且可重用;2)在线策略数据演化通过闭环数据生成器实现训练数据的动态优化,能够针对当前策略的学习需求生成数据。该方法统一支持监督微调和策略感知的强化学习数据整理,覆盖了智能体完整训练生命周期。
Abstract: Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent’s evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round’s data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.
[80] Learning More from Less: Exploiting Counterfactuals for Data-Efficient Chart Understanding cs.CLPDF
Jianzhu Bao, Haozhen Zhang, Kuicai Dong, Bozhi Wu, Sarthak Ketanbhai Modi
TL;DR: 本文提出ChartCF框架,通过利用反事实数据增强视觉语言模型在图表理解中的细粒度视觉差异感知能力,在显著减少训练数据量的情况下,在多个基准测试上达到或超越了现有图表专用视觉语言模型的性能。
Details
Motivation: 现有基于大规模合成数据的有监督微调方法效率低下,且忽略了图表作为程序生成视觉产物的关键特性:代码控制的微小视觉变化会导致语义和正确答案的剧烈变化,而标准训练方法无法有效学习这种反事实敏感性。
Result: 在五个基准测试上的实验表明,ChartCF在使用显著更少训练数据的情况下,取得了优于或与强大的图表专用视觉语言模型相当的性能。
Insight: 创新点在于:1) 通过代码修改生成反事实数据;2) 基于图表相似性的数据选择策略以过滤过难样本;3) 跨文本和视觉模态的多模态偏好优化。从客观角度看,该研究将反事实学习与数据选择、多模态优化结合,为数据高效训练提供了新思路。
Abstract: Vision-Language Models (VLMs) have demonstrated remarkable progress in chart understanding, largely driven by supervised fine-tuning (SFT) on increasingly large synthetic datasets. However, scaling SFT data alone is inefficient and overlooks a key property of charts: charts are programmatically generated visual artifacts, where small, code-controlled visual changes can induce drastic shifts in semantics and correct answers. Learning this counterfactual sensitivity requires VLMs to discriminate fine-grained visual differences, yet standard SFT treats training instances independently and provides limited supervision to enforce this behavior. To address this, we introduce ChartCF, a data-efficient training framework designed to enhance counterfactual sensitivity. ChartCF consists of: (1) a counterfactual data synthesis pipeline via code modification, (2) a chart similarity-based data selection strategy that filters overly difficult samples for improved training efficiency, and (3) multimodal preference optimization across both textual and visual modalities. Experiments on five benchmarks show that ChartCF achieves superior or comparable performance to strong chart-specific VLMs while using significantly less training data.
[81] DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization cs.CLPDF
Mengyi Deng, Zhiwei Li, Xin Li, Tingyu Zhu, Yulan Yuan
TL;DR: 本文提出了一种名为DGPO(Directional-Groupwise Preference Optimization)的轻量级框架,用于解决大语言模型偏好优化中方向一致性与推理多样性难以兼顾的问题。该方法通过将正向和反向的问答实例组织成结构化集合,并优化基于间隔的似然目标,从多候选比较中显式建模方向感知的对齐。
Details
Motivation: 当前大语言模型的偏好优化方法在保持方向一致性的同时难以维持推理多样性,DGPO旨在通过组级聚合监督信号来克服这一限制。
Result: 在五个基准测试上,构建的反向数据平均提升3.2%,而DGPO在多个数据集和模型家族上进一步带来一致增益,平均准确率提升最高达3.6%。
Insight: 创新点在于通过组级(group-wise)优化捕获比成对目标更丰富的相对信息,并强化跨不同推理路径的一致性;客观来看,其利用方向感知的多候选比较来显式建模对齐,是一种新颖的轻量级优化框架。
Abstract: Although Large Language Models (LLMs) have made remarkable progress, current preference optimization methods still struggle to align directional consistency while preserving reasoning diversity. To address this limitation, we propose Directional-Groupwise Preference Optimization (DGPO), a lightweight framework that aggregates supervision signals at the group level and explicitly models direction-aware alignment through multi-candidate comparisons. DGPO organizes forward and reverse question-answer instances into structured sets and optimizes a margin-based likelihood objective that separates coherent reasoning paths from inconsistent alternatives. This group-wise formulation captures richer relative information than pairwise objectives and reinforces consistency across diverse reasoning pathways. Empirical results show that our constructed reverse data yields a 3.2% average improvement across five benchmarks, while DGPO further delivers consistent gains across multiple datasets and model families, achieving average accuracy improvements of up to 3.6%.
[82] Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking cs.CLPDF
Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur, Charese H. Smiley, Ivan Brugere
TL;DR: 本文提出了一种名为BICR(盲图对比排序)的模型无关置信度估计框架,用于检测大型视觉语言模型(LVLM)中的视觉未接地问题。该方法通过对比模型在正常图像和黑屏图像下的隐藏状态,训练一个轻量级探针来评估预测的可靠性,无需额外推理成本。
Details
Motivation: 大型视觉语言模型存在视觉未接地问题:它们可能仅基于语言先验产生流畅、自信甚至正确的回答,而图像对预测毫无贡献。现有置信度估计方法无法检测这一点,因为它们缺乏机制来判断预测是由图像还是仅由文本驱动的。
Result: 在涵盖视觉问答、目标幻觉检测、医疗影像和金融文档理解的基准测试中,BICR在五个现代LVLM和七个基线模型上同时实现了最佳的跨模型平均校准和区分性能,其区分增益在统计上显著,且参数数量比最强探针基线少4-18倍。
Insight: 创新点在于通过对比正常图像与黑屏图像下的模型隐藏状态,显式地将视觉接地作为可靠性的信号进行训练,从而有效检测视觉未接地问题。该方法模型无关、轻量级,且无需额外推理成本,为LVLM的置信度估计提供了新思路。
Abstract: Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model behavior under normal inference with no mechanism to determine whether a prediction was shaped by the image or by text alone. We introduce BICR (Blind-Image Contrastive Ranking), a model-agnostic confidence estimation framework that makes this contrast explicit during training by extracting hidden states from a frozen LVLM twice: once with the real image-question pair, and once with the image blacked out while the question is held fixed. A lightweight probe is trained on the real-image hidden state and regularized by a ranking loss that penalizes higher confidence on the blacked-out view, teaching it to treat visual grounding as a signal of reliability at zero additional inference cost. Evaluated across five modern LVLMs and seven baselines on a benchmark covering visual question answering, object hallucination detection, medical imaging, and financial document understanding, BICR achieves the best cross-LVLM average on both calibration and discrimination simultaneously, with statistically significant discrimination gains robust to cluster-aware analysis at 4-18x fewer parameters than the strongest probing baseline.
[83] RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards cs.CL | cs.LGPDF
Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen
TL;DR: 本文提出了RubricEM框架,通过引入评分标准(rubric)指导深度研究智能体的强化学习训练。该框架将长轨迹任务分解为基于评分标准的阶段化策略执行,并结合阶段结构化GRPO进行信用分配,同时训练一个共享骨干的反思元策略以从历史轨迹中提炼可复用的指导。
Details
Motivation: 解决深度研究智能体(如规划、搜索、评估证据和生成长篇报告的系统)在缺乏可验证奖励、决策轨迹长且工具增强、以及传统训练难以将历史尝试转化为可复用经验等挑战。
Result: 在四个长篇幅研究基准测试中,RubricEM-8B模型取得了强劲性能,超越了可比的开源模型,并接近专有的深度研究系统水平。
Insight: 创新点在于将评分标准从最终答案评估器提升为结构化策略执行、反馈和记忆的共享接口,实现了阶段化策略分解与基于反思的元策略进化相结合,为长视野、无地面真值奖励的任务提供了更密集的语义反馈和可复用经验积累机制。
Abstract: Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.
[84] WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation cs.CLPDF
Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu
TL;DR: 本文提出了WildClawBench,一个用于评估现实世界、长视野智能体的原生运行时基准测试。该基准包含60个人工编写的双语多模态任务,覆盖六个主题类别,每个任务平均耗时约8分钟并涉及超过20次工具调用。评估在包含真实命令行界面(CLI)工具链的Docker容器中进行,采用混合评分方法(规则检查、环境状态审计和LLM/VLM语义验证)。实验表明,在19个前沿模型中,表现最佳的Claude Opus 4.7在OpenClaw环境下仅达到62.2%的总体成功率,凸显了当前模型在长视野、原生运行时任务上的局限性。
Details
Motivation: 现有智能体基准测试大多依赖合成沙盒、短视野任务、模拟API和最终答案检查,无法有效评估智能体在真实部署环境中完成实际长视野工作的能力。
Result: 在WildClawBench上对19个前沿模型进行评估,最佳模型Claude Opus 4.7在OpenClaw工具链下总体成功率仅为62.2%,其他模型均低于60%。仅更换工具链(如OpenClaw、Claude Code等)就可使同一模型的性能差异高达18个百分点。
Insight: 创新点在于构建了一个基于原生运行时、使用真实工具而非模拟服务的多模态长视野任务基准,并设计了结合确定性规则、环境状态审计和LLM/VLM语义验证的混合评估框架。客观来看,该工作强调了在真实复杂环境中评估智能体能力的必要性,并为可复现的智能体评估提供了标准化基础设施。
Abstract: Large language and vision-language models increasingly power agents that act on a user’s behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.
[85] ELF: Embedded Language Flows cs.CL | cs.AI | cs.LGPDF
Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li
TL;DR: 本文提出了一种名为ELF(Embedded Language Flows)的连续嵌入空间扩散模型,基于连续时间流匹配方法。与现有的主要在离散令牌上操作的扩散语言模型不同,ELF主要在连续嵌入空间中运行,仅在最后一步映射到离散令牌,从而能够轻松借鉴图像域扩散模型的技术(如无分类器引导)。实验表明,ELF在生成质量和采样效率上均优于领先的离散和连续扩散语言模型。
Details
Motivation: 扩散和流模型在连续数据生成(如图像和视频)中已成为主流方法,但其在语言建模中的应用主要局限于离散令牌操作。本文旨在探索连续扩散语言模型的有效性,通过最小化对离散域的适应,解决现有模型难以直接应用图像域成熟技术的问题。
Result: ELF在实验中显著优于领先的离散和连续扩散语言模型,实现了更好的生成质量和更少的采样步骤,表明其在生成基准测试中达到了先进水平。
Insight: 创新点在于将扩散模型主要保持在连续嵌入空间,仅最后映射到离散令牌,这简化了图像域技术(如无分类器引导)的迁移。从客观角度看,这种连续表示可能提升了模型的表达能力和效率,为连续扩散语言模型的发展提供了新路径。
Abstract: Diffusion and flow-based models have become the de facto approaches for generating continuous data, e.g., in domains such as images and videos. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today’s leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs.
cs.CV [Back]
[86] VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving cs.CV | cs.AIPDF
Rui Zhao, Haofeng Hu, Zhenhai Gao, Jiaqiao Liu, Gao Fei
TL;DR: 本文提出了VLADriver-RAG框架,通过检索增强的视觉-语言-动作模型来解决自动驾驶中长尾场景的泛化问题。该方法将感知输入抽象为时空语义图以过滤视觉噪声,并利用图对齐度量进行相关历史知识检索,最终在VLA骨干网络中融合这些先验知识以生成精确的轨迹。
Details
Motivation: 当前端到端自动驾驶的视觉-语言-动作模型依赖隐式参数化知识,在长尾场景中泛化能力有限;而标准的视觉检索方法存在高延迟和语义模糊性问题,因此需要一种能够利用显式、结构感知的历史知识来增强规划的方法。
Result: 在Bench2Drive基准测试上取得了新的最先进水平,驾驶分数达到89.12。
Insight: 创新点包括:1) 通过视觉到场景的机制将感知输入抽象为时空语义图,有效过滤视觉噪声;2) 采用场景对齐嵌入模型,利用图动态时间规整度量来确保检索的拓扑一致性而非表面视觉相似性;3) 将检索到的先验知识与基于查询的VLA骨干网络融合,以合成解耦的精确轨迹。
Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving, yet their reliance on implicit parametric knowledge limits generalization in long-tail scenarios. While Retrieval-Augmented Generation (RAG) offers a solution by accessing external expert priors, standard visual retrieval suffers from high latency and semantic ambiguity. To address these challenges, we propose \textbf{VLADriver-RAG}, a framework that grounds planning in explicit, structure-aware historical knowledge. Specifically, we abstract sensory inputs into spatiotemporal semantic graphs via a \textit{Visual-to-Scenario} mechanism, effectively filtering visual noise. To ensure retrieval relevance, we employ a \textit{Scenario-Aligned Embedding Model} that utilizes Graph-DTW metric alignment to prioritize intrinsic topological consistency over superficial visual similarity. These retrieved priors are then fused within a query-based VLA backbone to synthesize precise, disentangled trajectories. Extensive experiments on the Bench2Drive benchmark establish a new state-of-the-art, achieving a Driving Score of 89.12.
[87] Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models cs.CV | cs.AI | cs.LGPDF
Yuriel Ryan, Hei Man Ip, Adriel Kuek, Paul Pu Liang, Roy Ka-Wei Lee
TL;DR: 本文提出了一种名为Self-Captioning Multimodal Interaction Tuning(SC-MIT)的方法,通过增强多模态交互中的冗余信息来解决当前视觉语言模型(VLMs)面临的幻觉和鲁棒性问题。该方法引入了一个多模态交互门(Multimodal Interaction Gate),将模态间的独特交互转化为冗余交互,从而利用共享信息来补偿受损或模糊的模态。
Details
Motivation: 当前视觉语言模型在面对模糊或损坏的模态时存在幻觉和鲁棒性问题,作者假设通过利用模态间的共享信息来补偿受损模态可以解决这些问题。
Result: 实验表明,增加冗余交互可以将视觉诱导错误减少38.3%,并将一致性提高16.8%,在相关基准测试中显示出显著的性能提升。
Insight: 创新点在于通过自描述工作流和多模态交互门主动增强模态间的冗余信息,而非依赖现有数据集中可能被消除的冗余,这为提升模型在模糊或损坏输入下的可靠性提供了新思路。
Abstract: Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities to compensate for the impaired one. To this end, we analyze multimodal interactions – redundant (shared), unique (exclusive), and synergistic (emergent) task-relevant information provided by the modalities – to determine their impacts on model reliability. Specifically, amplifying redundant interactions would increase this exploitable shared information to resolve these issues; yet, modern instruction datasets often eliminate redundancies to prioritize visual grounding. We bridge this gap through a self-captioning workflow featuring a \textsc{Multimodal Interaction Gate}: a mechanism to convert unique interactions into redundant interactions. Our findings suggest that increasing redundancy can reduce visual induced errors by 38.3% and improve consistency by 16.8%.
[88] VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning cs.CV | cs.AIPDF
Zi-Yi Jia, Zi-Jian Cheng, Xin-Yue Zhang, Kun-Yang Yu, Zhi Zhou
TL;DR: 本文提出了VT-Bench,这是首个用于标准化视觉-表格多模态学习的统一基准测试,涵盖了9个领域(包括医疗、宠物、媒体和交通等)的14个数据集,共超过75.6万个样本。该基准旨在评估视觉-表格数据的判别性预测和生成性推理任务,并对23个代表性模型进行了评估,揭示了当前视觉-表格学习面临的重大挑战。
Details
Motivation: 当前多模态学习主要关注视觉-文本任务,而视觉-表格数据在医疗和工业等高风险领域至关重要,但尚未得到充分探索。因此,需要建立一个统一的基准来推动视觉-表格多模态学习的研究。
Result: 在VT-Bench上评估了23个模型,包括单模态专家、专门的视觉-表格模型、通用视觉语言模型(VLMs)和工具增强方法,结果表明视觉-表格学习仍面临重大挑战,但未具体提及定量结果或SOTA比较。
Insight: 创新点在于首次构建了一个统一的视觉-表格多模态基准,覆盖多个领域和大规模样本,为未来研究提供了标准化评估平台。从客观角度看,这有助于填补视觉-表格学习领域的空白,并激励社区开发更强大的多模态基础模型。
Abstract: Multi-model learning has attracted great attention in visual-text tasks. However, visual-tabular data, which plays a pivotal role in high-stakes domains like healthcare and industry, remains underexplored. In this paper, we introduce \textit{VT-Bench}, the first unified benchmark for standardizing vision-tabular discriminative prediction and generative reasoning tasks. VT-Bench aggregates 14 datasets across 9 domains (medical-centric, while covering pets, media, and transportation) with over 756K samples. We evaluate 23 representative models, including unimodal experts, specialized visual-tabular models, general-purpose vision-language models (VLMs), and tool-augmented methods, highlighting substantial challenges of visual-tabular learning. We believe VT-Bench will stimulate the community to build more powerful multi-modal vision-tabular foundation models. Benchmark: https://github.com/Ziyi-Jia990/VT-Bench
[89] LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment cs.CV | cs.AIPDF
Junyi Hu, Qiji Zhou, Lei Zhang, Yue Zhang
TL;DR: LAGO提出了一种语言引导的自适应物体区域聚焦框架,用于解决零样本视觉-文本对齐任务中细粒度识别的问题。该方法通过类无关的物体中心候选区域发现获得稳定的视觉初始化,然后根据中间置信度控制语义引导强度进行自适应语言引导细化,并结合物体级、上下文和全图像证据进行双通道聚合。
Details
Motivation: 解决零样本细粒度识别中,相关证据常位于局部区域而非全图,现有方法依赖大量随机或冗余的图像裁剪导致推理成本高且引入弱相关候选,以及过早引入语义引导可能造成错误放大的预测循环问题。
Result: 在标准零样本基准测试和具有挑战性的分布偏移设置上,LAGO始终实现了最先进的性能,同时在推理时所需的候选区域数量显著减少。
Insight: 创新点在于提出了类无关物体中心候选发现与基于置信度的自适应语言引导细化相结合的策略,以及有效的物体-上下文双通道聚合方法,避免了预测循环,提高了效率和鲁棒性。
Abstract: Zero-shot recognition aims to classify an image by selecting the most compatible label description from a set of candidate classes without any task-specific supervision. In fine-grained settings, however, the relevant evidence often lies in localized parts, attributes, or textures rather than in the full image, making whole-image alignment suboptimal. Recent localized visual-text alignment methods address this by comparing class descriptions with multiple image regions, but they typically rely on large sets of random or redundant crops, increasing inference cost and introducing many highly redundant or weakly relevant candidates. Moreover, introducing semantic guidance too early can create an error-amplifying feedback process in which inaccurate intermediate predictions bias later localization and reinforce subsequent mistakes; we refer to this failure mode as the prediction loop. We propose LAGO (LAnguage-Guided adaptive Object-region focus), a framework for efficient and robust zero-shot localized visual-text alignment. LAGO first performs class-agnostic object-centric candidate discovery to obtain a stable visual initialization, and then applies adaptive language-guided refinement with the strength of semantic guidance controlled by intermediate confidence. It further combines object-level, contextual, and full-image evidence through an effective object-context dual-channel aggregation strategy. Extensive experiments show that LAGO consistently achieves state-of-the-art performance on standard zero-shot benchmarks and challenging distribution-shift settings, while requiring substantially fewer candidate regions at inference time.
[90] HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding cs.CV | cs.AIPDF
Haopeng Jin, Hongzhu Yi, Wenlong Zhao, Jinwen Luo, Shani Ye
TL;DR: HY-Himmel是一个用于长视频理解的分层视频-语言框架,它通过分离语义和运动处理来解决现有多模态语言模型在长视频理解中的瓶颈。该方法使用稀疏的锚定I帧通过大型视觉变换器(ViT)编码场景和物体语义,而密集的帧间运动信息则通过轻量级压缩域三流适配器从运动矢量图、残差图和I帧上下文中提取,并注入到大型语言模型(LLM)中。
Details
Motivation: 解决多模态语言模型在长视频理解中面临的三个瓶颈:密集RGB帧解码成本高、帧数增加导致token数量呈二次增长、以及稀疏关键帧采样下运动感知能力弱的问题。
Result: 在Video-MME基准测试上,HY-Himmel以3.6倍更少的上下文token,超越了密集32帧基线方法2.3个百分点(从61.2%提升至63.5%)。广泛的消融实验验证了其有效性。
Insight: 核心创新点在于分层和分离的编码策略:将语义(稀疏高分辨率锚定帧)和运动(密集轻量级压缩域三流编码)解耦处理,并通过可微占位符机制将对齐的运动表征注入LLM。这为高效长视频理解提供了一种新颖的架构设计思路,即利用视频压缩域信息(如运动矢量)来高效建模运动,避免了全帧解码和密集token化的开销。
Abstract: Long-video understanding with multimodal language models suffers from three compounding bottlenecks: heavy decode cost to obtain dense RGB frames, quadratic token growth with frame count, and weak motion perception under sparse keyframe sampling. We present HY-Himmel, a hierarchical video-language framework that allocates semantic and motion capacity separately. A small set of sparse anchor I-frames is routed to the expensive host ViT to ground object identity and scene layout, while the far denser inter-frame intervals are encoded by a lightweight compressed-domain tri-stream adapter that distils motion evidence from motion-vector maps, residual maps, and I-frame context into aligned motion tokens. These tokens are injected into the LLM via a differentiable placeholder mechanism after a dedicated Stage-1 contrastive alignment that places the motion representation in a geometry compatible with the frozen visual backbone. On Video-MME, HY-Himmel surpasses the dense 32-frame baseline by +2.3 pp (61.2 to 63.5%) while using 3.6x fewer context tokens. Extensive ablations over stream composition, motion encoder family, fusion mode, alignment objective, anchor count, LoRA rank, and video duration confirm that the full tri-stream is necessary and sufficient for the observed gains.
[91] WATCH: Wide-Area Archaeological Site Tracking for Change Detection cs.CV | cs.AIPDF
Girmaw Abebe Tadesse, Titien Bartette, Andrew Hassanali, Allen Kim, Jonathan Chemla
TL;DR: 本文提出了WATCH框架,用于在月度时间尺度上对考古遗址进行变化检测和事件定位。该框架结合了三种互补的评分方法:无需训练的时间嵌入距离法、自监督变化检测法和弱监督时序定位模型。研究在阿富汗的1943个遗址上进行了基准测试,并评估了其在多个国家的跨区域泛化能力。结果表明,无监督方法表现优异,结合不同基础模型能实现高精度的事件定位,为大规模文化遗产监测提供了可扩展的解决方案。
Details
Motivation: 大规模监测考古遗址对于保护文化遗产至关重要,但由于视觉线索细微且地面真实数据稀疏,精确定位干扰发生的时间仍然困难。本文旨在解决这一问题。
Result: 在阿富汗1943个考古遗址的基准测试中,无监督方法(TED和SSCD)表现优于弱监督方法。使用SatMAE的TED方法实现了最高的精确月份召回率(55%,m=0),而使用GeoRSCLIP、CLIP或Satlas-Pretrain的TED方法在三个月容忍度内达到了92.5%的召回率(m=3)。在叙利亚、土耳其、巴基斯坦和埃及的遗址上验证了跨区域泛化能力。
Insight: 论文的创新点在于提出了一个结合多种互补评分方法的统一框架(WATCH),用于细粒度的月度变化事件定位。客观来看,其核心洞察在于系统性地评估和比较了多种基础模型嵌入(如CLIP、GeoRSCLIP、SatMAE等)在变化检测任务中的表现,并揭示了不同方法组合(如TED与SSCD)在检测倾向性(早期预警 vs. 事后确认)上的系统性差异,这为选择合适的监测策略提供了依据。
Abstract: Monitoring archaeological sites at scale is vital for protecting cultural heritage, yet pinpointing when disturbances occur remains difficult because visual cues are subtle and ground-truth data are sparse. We introduce WATCH, a framework for month-level change-event localization over PlanetScope satellite mosaics (2017-2024, 4.7 m/px) that supports three complementary scoring approaches: (i) Temporal Embedding Distance (TED), a training-free method that scores month-to-month deviations from a local temporal reference; (ii) Self-Supervised Change Detection (SSCD), an ensemble of reconstruction, forecasting, and latent-novelty signals; and (iii) a Weakly Supervised (WS) temporal localization model trained with sparse event-month labels. We benchmark WATCH on 1,943 archaeological sites in Afghanistan using embeddings from six foundation models (CLIP, GeoRSCLIP, SatMAE, Prithvi-EO-2.0, DINOv3, and Satlas-Pretrain) alongside a handcrafted spectral and texture baseline, and assess cross-regional generalization on sites in Syria, Turkey, Pakistan, and Egypt. The unsupervised approaches (TED, SSCD) consistently outperform the weakly supervised alternative. TED with SatMAE achieves the highest exact-month recall (55% at m=0), while TED with GeoRSCLIP, CLIP, or Satlas-Pretrain reaches 92.5% within a three-month tolerance (m=3). Handcrafted features remain competitive for exact-month detection under weak supervision. Our directional margin analysis reveals systematic temporal biases: SSCD paired with GeoRSCLIP or Prithvi-EO-2.0 exhibits the strongest early-warning profile, detecting anomalies before the recorded event, while TED favors confirmation-oriented detection after a change has materialized. These results show that satellite imagery combined with foundation-model embeddings enables scalable, decision-relevant heritage monitoring. Code: https://github.com/microsoft/WATCH
[92] Augmented Equivariant Mesh Networks for Anatomical Segmentation cs.CV | cs.LGPDF
Daniel Saragih
TL;DR: 本文提出了EAMS(等变解剖网格分割器),一种基于等变网格神经网络(EMNN)的轻量级框架,用于在解剖网格上进行稳健的语义分割。该方法结合了内在网格描述符和解剖学先验知识,并通过增强的消息传递提供全局上下文,能够在多种几何扰动(如旋转、分辨率变化)下保持性能稳定,适用于顶点、边和面级别的监督任务。
Details
Motivation: 现有针对网格和点云的特定任务方法不具备等变性,在测试时遇到几何扰动(如患者姿态或网格分辨率变化)时性能会急剧下降,例如在口腔内扫描分割任务中,40度倾斜会导致IoU下降25-26个百分点。因此,需要一种能够直接在非规则表面几何上操作、且对姿态和分辨率变化具有鲁棒性的模型。
Result: 在颅内动脉瘤和口腔内扫描分割任务上,EAMS的变体在未扰动输入上与专用基线方法性能相当,同时在几何扰动下保持稳定;在肝脏表面分割任务中,EAMS在标准姿态精度和旋转鲁棒性之间展现出有利的权衡。模型参数量小于200万。
Insight: 主要创新点在于构建了一个轻量级的等变框架(EAMS/EMNN),通过结合内在网格描述符和解剖学先验(如PCA导出的参考系)以及增强的消息传递,实现了对多种几何扰动的鲁棒性,且无需为不同任务设计特定架构。这为在非规则网格数据上实现通用且稳健的分割提供了新思路。
Abstract: Anatomical mesh segmentation requires models that operate directly on irregular surface geometry while remaining robust to arbitrary patient pose and mesh resolution variation. Existing task-specific mesh and point-cloud methods are not equivariant, and can degrade sharply under test-time perturbation, for example dropping by 25-26 IoU points on intraoral scan segmentation at $40^\circ$ tilt. We present EAMS, an Equivariant Anatomical Mesh Segmentor built on Equivariant Mesh Neural Networks (EMNN), and evaluate it across four clinically distinct tasks spanning edge-, vertex-, and face-level supervision. We combine intrinsic mesh descriptors with anatomy-aware priors, including PCA-derived frames for dental arches and liver surfaces, and augment message passing to provide lightweight global context. Across intracranial aneurysm and intraoral segmentation, EAMS variants are competitive with specialized baselines on unperturbed inputs while remaining stable under geometric perturbations, and on liver surfaces they expose a favorable trade-off between canonical-pose accuracy and rotation robustness. These results show that a lightweight ($<2$M parameters) equivariant framework can deliver robust anatomical mesh segmentation across diverse supervision types without task-specific architectures.
[93] KARMA-MV: A Benchmark for Causal Question Answering on Music Videos cs.CV | cs.AIPDF
Archishman Ghosh, Abhinaba Roy, Dorien Herremans
TL;DR: 本文介绍了KARMA-MV,一个基于音乐视频的大规模因果问答基准数据集,旨在评估模型整合时序视听线索并进行视觉到音乐影响推理的能力。该数据集包含37,737个多选题,通过LLM推理实现可扩展的生成与验证。论文提出了一种因果知识图(CKG)方法,通过结构化检索跨模态依赖关系来增强视觉语言模型(VLMs),实验表明CKG能提升模型性能,尤其对小模型效果显著。
Details
Motivation: 当前视频问答和跨模态理解虽取得进展,但针对音乐视频中视觉动态如何驱动音乐结构的因果推理研究仍不足,因此需要构建专门的数据集和方法来探索这一领域。
Result: 在KARMA-MV基准上对SOTA的VLMs和LLMs进行实验,结果显示CKG方法带来了性能提升,尤其对小模型增益明显,验证了显式因果结构对音乐视频推理的价值。
Insight: 创新点包括利用LLM进行可扩展的数据集生成与验证,以及提出因果知识图(CKG)来增强VLMs的跨模态依赖检索能力,为超越相关性的因果视听理解提供了新基准和方法论。
Abstract: While significant progress has been made in Video Question Answering and cross-modal understanding, causal reasoning about how visual dynamics drive musical structure in music videos remains under-explored. We introduce KARMA-MV, a large-scale multiple-choice QA dataset derived from 2,682 YouTube music videos, designed to test models’ ability to integrate temporal audio-visual cues and reason about visual-to-musical influence across reasoning, prediction, and counterfactual questions. Unlike traditional datasets requiring manual annotation, KARMA-MV leverages LLM reasoning for scalable generation and validation, yielding 37,737 MCQs. We propose a causal knowledge graph (CKG) approach that augments vision-language models (VLMs) with structured retrieval of cross-modal dependencies. Experiments on state-of-the-art VLMs and LLMs show consistent gains from CKG grounding – especially for smaller models – establishing the value of explicit causal structure for music-video reasoning. KARMA-MV provides a new benchmark for advancing causal audio-visual understanding beyond correlation.
[94] Text-Guided Multi-Scale Frequency Representation Adaptation cs.CV | cs.AI | cs.LGPDF
Weicai Yan, Xinhua Ma, Wang Lin, Tao Jin
TL;DR: 本文提出了一种名为多尺度频率适配器(FreqAdapter)的参数高效微调方法,该方法在频域中整合文本信息并对信号进行多尺度微调,以解决现有方法在信号空间域操作导致信息冗余以及未能充分考虑信号多尺度特性的问题。
Details
Motivation: 现有参数高效微调方法虽然引入少量训练参数,但大多在信号空间域操作,存在信息冗余,且使用固定的提示或适配层,未能充分考虑信号的多尺度特性。
Result: 在包括CLIP和LLaVA在内的多模态模型上进行的大量实验表明,FreqAdapter显著提高了性能和效率,能以极低成本提升性能并在一个epoch内快速收敛。
Insight: 主要创新点在于将微调过程从信号空间域转移到频域,并结合文本信息进行多尺度优化;其引入的多尺度适应策略优化了不同频率范围的感受野,增强了模型的表示能力,这是一种新颖的频域视角下的参数高效微调方法。
Abstract: Parameter-efficient fine-tuning methods introduce a small number of training parameters, enabling pre-trained models to adapt rapidly to new data distributions. While these methods have shown promising results, they exhibit notable limitations. First, most existing methods operate in the signal space domain, which results in substantial information redundancy. Second, most existing methods utilize fixed prompts or adaptation layers, failing to fully account for the multi-scale characteristics of signals. To address these challenges, we propose the Multi-Scale Frequency Adapter (FreqAdapter), which integrates textual information and performs multi-scale fine-tuning of signals in the frequency domain. Additionally, we introduce a multi-scale adaptation strategy to optimize receptive fields across different frequency ranges, further enhancing the model’s representational capacity. Extensive experiments on multimodal models, including CLIP and LLaVA, demonstrate that FreqAdapter significantly improves both performance and efficiency. FreqAdapter improves performance with minimal cost and fast convergence within one epoch. Code is available at https://github.com/Kelvin-ywc/FreqAdapter.
[95] Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers cs.CV | cs.AIPDF
Mathis Immertreu, Fitim Abdullahu, Thomas Kinfe, Helmut Grabner, Patrick Krauss
TL;DR: 本研究通过神经科学方法分析多模态视觉语言模型Qwen3-VL-8B,探究其内部表征是否编码了人类视觉兴趣度。研究发现,模型能够线性解码出与人类兴趣度评分(CI)相关的信息,且这种表征在中间视觉层出现,并在语言层逐渐增强,表明模型在无监督下形成了结构化的视觉兴趣编码。
Details
Motivation: 动机在于探索现代Transformer模型是否编码了人类兴趣原则,而非仅利用大规模相关性,这对于理解认知和确保AI在传播与营销中的负责任使用至关重要。
Result: 在Qwen3-VL-8B模型上,使用来自Flickr平台的大规模人类参与数据定义的Common Interestingness(CI)分数进行分析,结果显示CI信息可从最终层嵌入中线性解码,且通过降维和广义判别值(GDV)分析证实,CI相关表征在中间视觉Transformer层出现并在语言模型层中逐渐可区分。
Insight: 创新点在于将神经科学分析方法应用于多模态Transformer,揭示了模型在无显式监督下能形成与人类视觉兴趣度对齐的稳健结构化编码,为连接人脑动态与Transformer架构的计算原理提供了新视角。
Abstract: Human attention is the gateway to conscious perception, memory and decision-making. However, its role in modern transformer models remains largely unexplored. As these systems increasingly influence what people see, prefer and buy, the question arises as to whether they encode principles of human interest or merely exploit large-scale correlations. Addressing this issue is crucial for understanding cognition and ensuring the responsible use of AI in communication and marketing. In order to address this issue, the concept of visual interest was examined within the multimodal vision-language-model Qwen3-VL-8B, using a pre-defined Common Interestingness (CI) score derived from large-scale human engagement data on the photo-sharing platform Flickr. Here, we analyzed internal representations across vision and language components using methods from the neurosciences. Our analyses revealed that CI information is linearly decodable from final-layer embeddings, indicating that it is aligned with human-derived measures of visual interestingness. Dimensionality reduction and Generalized Discrimination Value (GDV) analyses demonstrate that CI-related hidden representations emerge in intermediate vision transformer layers and becomes progressively more distinguishable across language model layers. Concept vectors derived using geometric, probe, and Sparse Auto-Encoder based methods converge in higher layers, as confirmed by representational similarity analysis. This indicates a robust and structured encoding of visual interestingness without explicit supervision. Future work will seek to identify shared computational principles linking human brain dynamics and transformer architectures, with the ultimate goal of uncovering the organizing mechanisms that give rise to attention and interest in both biological and artificial systems.
[96] Survey on Disaster Management Datasets for Remote Sensing Based Emergency Applications cs.CVPDF
Alain P. Ndigande, Josiah Wiggins, Sedat Ozer
TL;DR: 这篇论文是一篇关于遥感灾害管理数据集的综述,全面梳理了支持机器学习与深度学习应用的公开图像数据集,涵盖灾害前、中、后各阶段,旨在为研究人员和从业者提供集中参考。
Details
Motivation: 针对自然灾害频发背景下,数据驱动的灾害管理方法需求迫切,而成功应用机器学习或深度学习的关键在于高质量标注数据集的可用性,因此需要系统整理相关数据集以促进快速解决方案开发。
Result: 论文未提及具体定量结果,但通过综述形式系统总结了现有公开数据集,为灾害管理中的计算机视觉与遥感任务提供了资源索引。
Insight: 创新点在于首次对灾害管理全流程的遥感图像数据集进行系统性综述,强调了数据集在灾害响应解决方案中的核心作用,为领域研究提供了实用的数据资源指南。
Abstract: Recent natural disasters have highlighted the urgent need for efficient data-driven approaches to disaster management. Machine learning (ML) and deep learning (DL) techniques have shown considerable promise in enhancing the key phases of disaster management including mitigation, preparedness, detection, response, and recovery. A critical enabler of successful ML or DL based applications in remote sensing, however, is the accessibility and quality of annotated datasets. With the growing availability of high-resolution imagery from unmanned aerial vehicles (UAVs) and satellites, computer vision and remote sensing algorithms have become essential tools for rapid detection, situational assessment, and decision-making in disaster scenarios. This survey provides a comprehensive overview of publicly available image-based datasets relevant to ML/DL-based disaster management pipelines. Emphasis is placed on datasets that support computer vision and remote sensing tasks across all phases of disaster events including pre-disaster, during, and post-disaster. The goal of this work is to serve as a centralized reference for researchers and practitioners seeking high-quality datasets for rapid development and deployment of remote sensing-driven disaster response solutions.
[97] A Breast Vision Pathology Foundation Model for Real-world Clinical Utility cs.CVPDF
Yingxue Xu, Zhengyu Zhang, Xiuming Zhang, Mengwei Xu, Fengtao Zhou
TL;DR: 本文提出了一个名为BRAVE的乳腺癌病理学基础模型,该模型基于来自亚洲、欧洲和北美洲32个数据源的101,638张乳腺全切片图像进行开发和评估。研究通过包含回顾性基准测试、临床挑战场景、工作流程导向的临床影响模拟、前瞻性观察验证以及交叉病理学家-AI交互研究在内的证据链,在涵盖术前活检、术中冰冻切片和术后切除的34个任务、82个队列中评估了BRAVE。结果表明,BRAVE能够在临床工作流程中发挥实际作用,例如安全排除低风险病例、辅助二次审查挽救漏诊阳性病例以及优先排序病例。
Details
Motivation: 尽管病理学基础模型在回顾性研究中表现出色,但其是否能在临床实践中提供支持尚不明确。乳腺癌的病理评估是诊断的金标准,并指导治疗计划、手术决策和风险分层,因此开发一个能在真实临床场景中发挥效用的模型至关重要。
Result: 在前瞻性验证中,BRAVE排除了76.9%的阴性活检病例(阴性预测值NPV 0.953)和70.1%的阴性冰冻切片病例(NPV 0.973),并将78.8%的术后分型病例归类为高置信度的明确病例(NPV 1.000)。在读者研究中,AI辅助将平衡准确率从88.5%提升至95.1%(比值比OR 3.14,P<0.001),并提高了效率、置信度和评分者间一致性。此外,BRAVE衍生的评分独立预测了无病生存期(调整后风险比HR 4.79,P<0.001)和总生存期(调整后风险比HR 8.14,P<0.001)。
Insight: 论文的创新点在于提出了一个专门针对乳腺癌、适应真实世界临床工作流程的病理学基础模型(BRAVE),并通过一个全面的、多阶段的证据链(从回顾性到前瞻性,再到人机交互研究)进行严格评估,证明了其在临床决策支持(如病例分诊、辅助诊断和预后预测)方面的实际效用和泛化能力,超越了传统的仅关注模型性能的评估范式。
Abstract: Pathology foundation models have shown strong retrospective performance, but whether such systems can support clinically relevant use remains unclear. This challenge is particularly important in breast cancer, where pathological assessment serves as the gold standard for diagnosis and guides treatment planning, surgical decision-making and risk stratification across pre-, intra- and post-operative stages. Here we present \textbf{BRAVE}, a breast-adaptive pathology foundation model developed and evaluated using a total resource of 101,638 breast whole-slide images from 32 sources across Asia, Europe and North America. We assessed BRAVE across 34 tasks in 82 cohorts spanning pre-operative biopsy, intra-operative frozen section and post-operative resection, using an evidence chain comprising retrospective benchmarking, clinically challenging scenarios, workflow-oriented clinical impact simulations, prospective observational validation with the thresholds locked in the retrospective cohorts and crossover pathologist-AI interaction studies. Across these settings, BRAVE supported practical roles in the clinical workflow, including safe exclusion of low-risk cases from routine review, AI-assisted second-review rescue of initially missed positives and prioritization of cases for further assessment. In prospective validation across three centres, BRAVE excluded 76.9% of negative biopsy cases (NPV 0.953) and 70.1% of negative frozen-section cases (NPV 0.973), and triaged 78.8% of post-operative subtyping cases as high-confidence clear-cut cases (NPV 1.000). In reader studies, AI assistance improved balanced accuracy from 88.5% to 95.1% (OR 3.14, P<0.001), with better efficiency, confidence and inter-rater agreement. BRAVE-derived scores also independently predicted disease-free survival (adjusted HR 4.79, P<0.001) and overall survival (adjusted HR 8.14, P<0.001).
[98] Low-Cost Stereo Vision for Robust 3D Positioning of Thin Radiata Pine Branches in Autonomous Drone Pruning cs.CVPDF
Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green
TL;DR: 本文研究利用低成本双目相机实现无人机自主修剪细枝(直径10毫米)的鲁棒三维定位。通过比较多种分割模型(Mask R-CNN、YOLOv8/YOLOv9)和深度估计方法(传统SGBM与多种深度学习模型),提出了一种结合分割掩码、视差图、基于质心的三角测量及中位数绝对偏差离群值剔除的流程,以应对森林场景中纹理稀疏、结构细小和视差噪声的挑战。
Details
Motivation: 解决辐射松人工修剪的危险性、劳动密集性和劳动力短缺问题,并克服现有自主修剪平台依赖昂贵传感器(如LiDAR)且仅能处理粗大树枝的限制,探索低成本双目相机是否能为细枝修剪提供足够精确的检测与三维定位。
Result: 在1-2米距离的定性评估中,基于学习的立体匹配方法(如PSMNet、ACVNet)比传统方法(SGBM)产生更连贯的深度估计;通过跨数据集微调实验揭示了城市驾驶基准与自然林业场景之间的领域差距。
Insight: 主要创新点在于将立体分割与基于质心的三角测量算法及中位数绝对偏差离群值剔除相结合,将分割掩码和视差图转换为单一鲁棒的树枝到相机距离,有效处理了森林场景的典型挑战;框架设计保持与新版YOLO模型的兼容性,体现了方法的可扩展性。
Abstract: Manual pruning of radiata pine, a species of major economic importance to New Zealand forestry, is hazardous, labour-intensive, and increasingly constrained by workforce shortages. Existing autonomous pruning platforms typically rely on expensive sensors such as LiDAR and are limited to thick branches, which restricts their wider adoption. This paper investigates whether a single low-cost stereo camera mounted on a drone can provide sufficiently accurate branch detection and three-dimensional positioning to support autonomous pruning of branches as thin as 10 mm, thereby removing the need for auxiliary depth sensors. The proposed pipeline comprises two stages: branch segmentation and depth estimation. For segmentation, Mask R-CNN variants and the YOLOv8 and YOLOv9 families are compared on a custom dataset of 71 stereo image pairs captured with a ZED Mini camera; YOLOv8 and YOLOv9 are selected as representative state-of-the-art real-time segmentors at the time of data collection, and the framework is designed to remain compatible with newer YOLO releases. For depth estimation, a traditional method (SGBM with WLS filtering) and deep-learning-based methods (PSMNet, ACVNet, GWCNet, MobileStereoNet, RAFT-Stereo, and NeRF-Supervised Deep Stereo) are evaluated, including cross-dataset fine-tuning experiments that expose the domain gap between urban driving benchmarks and natural forestry scenes. The main novelty of this work lies in coupling stereo segmentation with a centroid-based triangulation algorithm and Median-Absolute-Deviation outlier rejection that converts a segmentation mask and disparity map into a single robust branch-to-camera distance, addressing the challenges of sparse texture, thin structures, and noisy disparity values typical of forest scenes. Qualitative evaluations at distances of 1-2 m show that the learning-based stereo methods produce more coherent depth es…
[99] Test-Time Training for Visual Foresight Vision-Language-Action Models cs.CV | cs.LG | cs.ROPDF
Sangwu Park, Wonjoong Kim, Yeonjun In, Sein Kim, Hongseok Kang
TL;DR: 本文提出了一种名为Test-Time Training Visual Foresight VLA(T^3VF)的方法,旨在解决视觉前瞻视觉-语言-动作模型在面对分布外数据时的脆弱性问题。该方法利用预测的未来图像与其后续实际观测形成的自然监督对,在测试时进行训练,并引入自适应更新过滤机制以应对无差别更新带来的挑战。
Details
Motivation: 视觉前瞻VLA模型因其卓越性能而成为主流架构,但其设计使其对分布外偏移特别敏感,因为动作质量直接依赖于预测的未来视觉信息的准确性,OOD条件会同时影响两个阶段。
Result: 实验表明,T^3VF以适度的额外推理成本缓解了VF-VLA的OOD脆弱性,无需任何架构修改或辅助模块。
Insight: 创新点在于利用预测图像与后续观测的自然配对进行测试时训练,以及引入自适应过滤机制来优化更新过程,这为提升VLA模型在动态环境中的鲁棒性提供了新思路。
Abstract: Visual Foresight VLA (VF-VLA) has become a prominent architectural choice in the recent VLA due to its impressive performance. Nevertheless, the inherent design of VF-VLA makes it particularly vulnerable to out-of-distribution (OOD) shifts. Because the quality of action directly depends on the accuracy of the predicted future visual information, OOD conditions affect both stages at once. To address this vulnerability, we propose Test-Time Training Visual Foresight VLA ($T^3$VF), a test-time training approach motivated by the observation that the predicted future image and its subsequent observation form a natural supervision pair. To further address the practical challenges that arise from indiscriminate test-time updates, we introduce an adaptive update filtering mechanism. Empirically, $T^3$VF mitigates the OOD vulnerability of VF-VLA at a modest additional inference cost, without requiring any architectural modification or auxiliary modules.
[100] From Historical Tabular Image to Knowledge Graphs: A Provenance-Aware Modular Pipeline cs.CV | cs.AI | cs.IRPDF
Sarah Binta Alam Shoilee, Victor de Boer, Jacco van Ossenbruggen, Susan Legêne
TL;DR: 本文提出了一种模块化、可追溯的流水线,用于将手写历史表格图像转换为知识图谱,支持人机协作。该流水线将工作流分解为表格重建、信息提取和知识图谱构建三个阶段,并暴露中间表示以供检查、评估和修正。
Details
Motivation: 手写档案表格包含丰富的历史信息,但将其转换为知识图谱等结构化表示需要集成表格结构识别、手写识别和语义解释,这是一个复杂的多模态过程。端到端的AI实现可能掩盖这些步骤,导致不透明的算法操作,阻碍人工监督、关键评估和信任。
Result: 通过对真实世界军事生涯档案材料进行的一系列实验,展示了所提出的流水线。三种不同表格重建变体的结果突显了模块化的重要性。
Insight: 主要创新点在于将模块化与数据溯源系统性地结合,确保所有提取的实体和字面量都能追溯到其视觉和文本来源,从而推进了针对复杂历史数据的透明且可协作控制的图像到知识图谱流水线。
Abstract: Handwritten archival tables contain rich historical information, yet transforming them into structured representations, such as Knowledge Graphs, requires integrating table structure recognition, handwriting recognition, and semantic interpretation - a complex multimodal process. End-to-end AI implementations can obscure these steps, resulting in opaque algorithmic operations that hinder human oversight, critical assessment, and trust. To address this, we present a modular, provenance-aware pipeline to convert handwritten tabular images into KGs supporting human-AI collaboration. The pipeline decomposes the workflow into three stages - table reconstruction, information extraction, and KG construction - while exposing intermediate representations for inspection, evaluation, and correction. A key contribution of our approach is the systematic integration of data provenance at every stage, ensuring that all extracted entities and literals remain traceable to their visual and textual origins. The proposed pipeline is demonstrated through a number of experiments on real-world archival material concerning military careers. The results across three different table reconstruction variants highlight the importance of modularisation. By coupling modularity with data provenance, our work advances transparent and collaboratively controllable image-to-KG pipelines for complex historical data.
[101] SPECTRA-Net: Scalable Pipeline for Explainable Cross-domain Tensor Representations for AI-generated Images Detection cs.CVPDF
Sarra Arab, Anfal Achouri, Seif Eddine Bouziane
TL;DR: 本文提出SPECTRA-Net,一种用于AI生成图像检测的可扩展、可解释的跨域张量表示流水线。该方法融合了视觉基础模型的全局语义特征、频谱分析、基于局部块的异常检测和统计描述符等多视图表示,在多个数据集上实现了最先进的检测性能,并提供了可解释的伪影定位能力。
Details
Motivation: AI生成图像的快速扩散对数字信息完整性构成重大挑战,现有检测模型难以跟上生成模型的复杂化步伐,亟需构建鲁棒、实时的检测系统。
Result: SPECTRA-Net在WildFake、Chameleon和RRDataset等多个具有挑战性的数据集上,在域内和跨域设置中均达到了最先进的性能,表现出高精度和强大的泛化能力。
Insight: 创新点在于提出了一种多视图融合的表示学习框架,将全局语义、频谱特征、局部异常和统计描述符互补结合,并实现了可解释的检测结果(如伪影定位),为构建可信赖的内容验证系统提供了新思路。
Abstract: The rapid proliferation of AI-generated images (AIGI) presents a significant challenge to digital information integrity. While human observers and existing detection models struggle to keep pace with the increasing sophistication of generative models, the need for robust, real-time detection systems has become critical. This paper introduces SPECTRA-Net, a scalable pipeline for explainable, cross-domain tensor representations for AIGI detection. Our approach leverages a multi-view representation of images, combining global semantic features from a Vision Foundation Model (VFM), spectral analysis, local patch-based anomaly detection, and statistical descriptors. By fusing these complementary data streams, SPECTRA-Net achieves state-of-the-art performance in both in-domain and cross-domain settings, demonstrating high accuracy and generalization capabilities across a wide range of challenging datasets, including WildFake, Chameleon, and RRDataset. The proposed pipeline not only provides a robust solution for AIGI detection but also offers explainability through artifact localization, paving the way for more trustworthy and reliable content verification in real-world applications.
[102] TinySSL: Distilled Self-Supervised Pretraining for Sub-Megabyte MCU Models cs.CV | cs.AIPDF
Bibin Wilson
TL;DR: 本文提出了TinySSL,一种针对参数量小于50万的微控制器(MCU)级模型的蒸馏自监督预训练框架。该方法通过容量感知蒸馏自监督学习(CA-DSSL)解决了小模型自监督学习中的投影头主导、表示瓶颈和数据增强敏感性问题,在CIFAR-100和Pascal VOC等任务上取得了优于或可比拟现有方法的性能。
Details
Motivation: 自监督学习(SSL)在大模型上取得了成功,但在参数量少于50万的MCU级小模型上尚未被探索。本文旨在解决小模型SSL面临的三个主要障碍:投影头主导、表示瓶颈和数据增强敏感性。
Result: 在CIFAR-100上,使用MobileNetV2-0.35骨干网络(39.6万参数)预训练,CA-DSSL的线性探测准确率达到62.7%,比SimCLR-Tiny高出18个百分点,与SEED方法(61.7%)相当但投影参数少10倍(42.6万 vs. 315万),达到监督学习上限的94.0%。在Pascal VOC检测任务上,CA-DSSL的mAP是随机初始化的2.3倍,比SEED高出3个百分点。模型部署后仅占378 KB(INT8)且无推理开销。
Insight: 创新点在于提出了容量感知蒸馏自监督学习(CA-DSSL)框架,结合了非对称蒸馏、多尺度特征蒸馏和渐进式数据增强课程。其优势在小数据场景(如CIFAR-100)中尤为明显,为MCU级小模型的有效自监督预训练提供了新思路。
Abstract: Self-supervised learning (SSL) has transformed representation learning for large models, yet remains unexplored for microcontroller (MCU)-class models with fewer than 500K parameters. We identify three obstacles at this scale – projection head dominance, representation bottleneck, and augmentation sensitivity – and propose Capacity-Aware Distilled Self-Supervised Learning (CA-DSSL), a teacher-guided framework that overcomes them without labels or text supervision. CA-DSSL combines asymmetric distillation from a frozen DINO ViT-S/16 teacher, multi-scale feature distillation for spatial representations, and a progressive augmentation curriculum. On a MobileNetV2-0.35 backbone (396K parameters) pretrained on CIFAR-100, CA-DSSL reaches 62.7 0.5% linear-probe accuracy (3-seed mean) – surpassing SimCLR-Tiny by 18 pp, matching SEED (61.7%) with 10 fewer projection parameters (426K vs. 3.15M), and reaching 94.0% of a supervised upper bound. Standard SSL methods (BYOL-Tiny, DINO-Tiny) collapse entirely at this scale. On Pascal VOC detection, CA-DSSL achieves 2.3 the mAP of random initialization and +3 pp over SEED, though SimCLR-Tiny matches CA-DSSL on detection mAP. The deployed backbone occupies 378 KB (INT8) with no inference overhead from pretraining. Preliminary ImageNet-100 experiments reveal that CA-DSSL’s advantage is specific to small-data regimes; scaling to ImageNet-1K is discussed as future work.
[103] When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models cs.CV | cs.AIPDF
Harshvardhan Saini, Samyak Jha, Yiming Tang, Dianbo Liu
TL;DR: 本文研究了基于解码器的视觉语言模型(VLMs)产生幻觉(即自信地描述输入中不存在的内容)的根本原因,发现其源于几何过度对齐:模型为弥合注意力机制所需的模态间隙,过度将视觉嵌入对齐到文本流形,从而注入了统计语言偏差,掩盖了细粒度视觉证据。作者首次量化表征了这种过度对齐,并提出两种互补的解决方案:一种无需训练即可在推理中使用的策略,以及一种考虑偏差的微调范式,两者都通过从视觉表示中显式投影掉该偏差子空间来减少幻觉。
Details
Motivation: 解决VLMs在高风险应用中频繁产生幻觉的问题,特别是探究基于解码器的VLMs中幻觉的根本机制,而非仅通过黑盒解码策略抑制幻觉。
Result: 在POPE、CHAIR和AMBER基准测试上显著减少了幻觉,并在长描述任务上提升了CLAIR分数,其中无需训练的变体在基础模型上未增加计算开销。
Insight: 创新点在于首次从几何角度量化了VLMs中的过度对齐现象,并识别出语言偏差集中在与数据集无关的通用文本子空间的主成分中;提出的两种方法(训练免费推理和偏差感知微调)直接针对这一底层几何原因进行干预,提供了高效且可解释的解决方案。
Abstract: Vision-Language Models (VLMs) increasingly power high-stakes applications, from medical imaging to autonomous systems, yet they routinely hallucinate, confidently describing content not present in the input. We investigate the root causes of these failure modes with a mechanistic analysis focusing on the decoder-based VLMs. We trace these failure modes to a geometric over-alignment: to bridge the modality gap required by attention mechanisms, decoder-based VLMs over-align visual embeddings with the text manifold, injecting a statistical linguistic bias that systematically overshadows fine-grained visual evidence. While prior work either aggressively closes this gap or suppresses hallucinations through expensive black-box decoding strategies, none addresses the underlying geometric cause. We provide the first quantitative characterization of this over-alignment, demonstrating that linguistic bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Building on this insight, we propose two complementary remedies: a training-free inference strategy and a bias-aware fine-tuning paradigm, both of which explicitly project out this subspace from visual representations. Our methods significantly reduce hallucinations across POPE, CHAIR, and AMBER benchmarks, and improve CLAIR scores on long-form captioning tasks, with the training-free variant adding no computational overhead over the base model.
[104] Smart Railway Obstruction Detection System using IoT and Computer Vision cs.CV | cs.CR | cs.LGPDF
Pravin Kumar, Mritunjay Shall Peelam, Ramakant Kumar, Sanjay Kumar, Vinay Chamola
TL;DR: 本文提出了一种名为NETRA的智能铁路障碍物检测系统,该系统结合物联网和计算机视觉技术,通过部署在Raspberry Pi Zero W和Raspberry Pi 4边缘平台上,利用概率传感器融合(PIR运动传感器和HC-SR04超声波距离传感器)实现事件驱动的摄像头激活,并使用MobileNet-SSD或YOLOv5 ONNX进行边缘AI分类,以低成本、高精度地实时检测铁路入侵威胁(如人类、大型动物和轨道障碍物),并通过LoRa通信在2.4秒内将警报传输给机车司机。
Details
Motivation: 解决印度铁路面临的轨道入侵安全挑战(包括野生动物侵入和恶意障碍物),现有解决方案(如Gajraj系统)成本高昂(1000美元/公里)且误报率高,部署范围有限,因此需要开发一种经济高效、独立的实时检测系统。
Result: 在113个运动事件实验中,概率融合方法实现了95%的检测准确率和零误报(优于二元方法的85%);Raspberry Pi 4搭载YOLOv5在大象检测上达到83.5%的F1分数(比Pi Zero的启发式方法提升5.6倍);LoRa通信在1-2公里野外试验中实现100%数据包传输;系统部署成本降低75%(247美元/公里 vs Gajraj的1000美元/公里)。
Insight: 创新点包括:概率传感器融合与可调阈值(tau_c=0.65)的事件驱动摄像头激活,减少52%不必要的视觉处理;边缘AI分类器(MobileNet-SSD/YOLOv5)与LoRa通信的集成,实现端到端2.4秒低延迟警报;系统以低成本统一检测野生动物和障碍物威胁,提升部署可行性和实时性。
Abstract: Railway track intrusions pose a critical safety challenge for Indian Railways, encompassing wildlife incursions and deliberate malicious obstructions. The December 2025 collision in Assam, in which seven elephants were killed by the Rajdhani Express, underscores the urgency of effective real-time detection. Existing solutions such as the optical fiber-based Gajraj system suffer from prohibitive costs ($1000/km) and high false alarm rates, limiting deployment to only 20 of India’s 101 elephant corridors. This paper proposes NETRA, a cost-effective, internet-independent intrusion detection system deployed on Raspberry Pi Zero W and Raspberry Pi 4 edge platforms. NETRA employs probabilistic sensor fusion integrating a PIR motion sensor and an HC-SR04 ultrasonic distance sensor with a tunable threshold (tau_c = 0.65), enabling event-driven camera activation that reduces unnecessary visual processing by 52%. Upon confirmed intrusion, edge-AI classification using MobileNet-SSD (Pi Zero) or YOLOv5 ONNX (Pi 4) identifies threats including humans, large animals, and track obstructions. Confirmed threats are transmitted via LoRa (868 MHz) to alert the locomotive driver within 2.4 seconds end-to-end. Experimental evaluation across 113 motion events demonstrated 95% detection accuracy with zero false alarms through probabilistic fusion, compared to 85% for binary methods. Raspberry Pi 4 with YOLOv5 achieved 83.5% elephant F1-score, a 5.6x improvement over Pi Zero’s heuristic approach (14.8%). LoRa communication achieved 100% packet delivery across 1-2 km in field trials. NETRA reduces deployment cost by 75% ($247/km vs $1000/km for Gajraj) while providing unified detection of both wildlife and obstruction threats.
[105] Dimensional Coactivation for Representational Consistency in Frozen Vision Foundation Models cs.CV | eess.IV | eess.SPPDF
Izaldein Al-Zyoud Abdulmotaleb El Saddik
TL;DR: 本文提出了一种称为维度共激活(DCA)的新方法,用于测量冻结视觉基础模型(如DINOv3)在单个输入样本内部表征的一致性。该方法通过比较语义子区域(如眼睛、鼻子、嘴巴)中相同特征维度是否同时激活来评估表征连贯性,并成功应用于深度伪造检测任务,在CelebDF-v2和DFD数据集上取得了高AUC性能。
Details
Motivation: 动机是探究冻结视觉基础模型是否在单个输入图像内部保持其学习到的坐标系的内在一致性,即表征一致性问题,并为此开发一个合适的测量工具。
Result: 在深度伪造检测的验证任务中,使用冻结的DINOv3特征,DCA方法在CelebDF-v2数据集上达到0.9106 AUC,在DFD数据集上达到0.9289 AUC(基于FF++ c23的跨数据集迁移设置)。消融实验表明,引入中心化、L2归一化或跨维度耦合会严重降低性能,验证了DCA设计的必要性。
Insight: 创新点在于提出了DCA这一专门用于测量冻结模型内部样本表征一致性的工具,它避免了对齐、归一化等传统相似性度量操作,强调原始幅值信号和固定坐标系下的维度级分析。这为理解基础模型的内部表征结构提供了新视角,并在深度伪造检测等任务中展现出实用价值。
Abstract: Frozen vision foundation models do not merely extract features; they organize images through a learned coordinate system. We ask whether that coordinate system remains internally coherent within a single input. This leads to Representational Consistency: the study of whether a frozen foundation model represents one sample coherently across its semantic subregions. We introduce Dimensional Coactivation (DCA), a per-dimension instrument for measuring this coherence. DCA compares semantic regions by asking whether the same feature dimensions coactivate across them. Unlike classical similarity measures, it deliberately avoids centering, L2 normalization, and full Gram coupling. These operations are useful when comparing different models or distributions, but they are mismatched to the intra-sample setting, where the coordinate system is fixed and raw magnitude carries signal. Deepfake detection provides a natural validation task. Synthetic faces may reproduce plausible eyes, noses, and mouths while breaking the representational structure that links those regions in real faces. Using frozen DINOv3 features, DCA exposes this break: an eyes-mouth-nose fingerprint achieves 0.9106 AUC on CelebDF-v2 and 0.9289 on DFD under FF++ c23 cross-dataset transfer. The design is also sharply validated by ablation: reintroducing centering collapses CelebDF-v2 AUC to 0.459, L2 normalization reduces it to 0.862, and cross-dimension coupling reduces it to 0.478. Finally, replacing DINOv3 with FaRL collapses CelebDF-v2 AUC to 0.582. DCA therefore depends on a stable per-dimension coordinate system, not on region extraction alone. These results position DCA as an instrument for measuring intra-sample representational coherence in frozen foundation models, with deepfake detection as the first validation task.
[106] Multimodal Emotion Recognition via Causal-Diffusion Bridge (Affect-Diff) cs.CVPDF
Ankit Sanjyal
TL;DR: 本文提出Affect-Diff模型,一种因果-扩散桥接框架,用于解决CMU-MOSEI数据集上多模态情感识别面临的极端类别不平衡问题。该模型通过联合训练三个机制:基于NOTEARS学习的因果图、beta-VAE瓶颈和停止梯度的1D DDPM先验,来重构模态贡献、压缩潜在表示并防止多数类主导,从而有效识别少数情感类别。
Details
Motivation: 解决CMU-MOSEI数据集中多模态情感识别因情感类别极端不平衡(如Happy占65.9%,而三种Ekman类别合计不足7%)导致标准融合模型完全忽略少数情感类别的问题。
Result: 在3,292个对齐的CMU-MOSEI样本上,Affect-Diff的验证集平衡准确率达到0.384,比最强基线TETFN(0.324)相对提升18%。所有评估基线在Fear、Disgust和Surprise三个类别上的F1分数均为零,而Affect-Diff能有效检测这些少数类。消融实验表明扩散先验和因果图分别贡献了-24%和-13%的性能下降。
Insight: 创新点在于将因果推理、变分自编码器和扩散先验联合建模,以结构化潜在空间并抵抗多数类坍塌;关键发现是确定性编码器变体是检测全部六类情感的必要条件,揭示了KL正则化强度可直接调控对少数类的敏感性,为处理极端不平衡问题提供了新视角。
Abstract: Multimodal emotion recognition on CMU-MOSEI faces an extreme imbalance as Happy accounts for 65.9% of samples while three Ekman categories collectively represent under 7%, causing standard fusion models to maximize accuracy by ignoring minority emotions entirely. We present Affect-Diff, a Causal-Diffusion Bridge that addresses this through three jointly trained mechanisms: a NOTEARS-learned causal graph that re-weights modality contributions before fusion, a beta-VAE bottleneck for regularized latent compression, and a stop-gradiented 1D DDPM prior that structures the latent space against majority-class collapse. On 3,292 aligned CMU-MOSEI samples, Affect-Diff achieves validation balanced accuracy 0.384, an 18% relative improvement over the strongest baseline (TETFN: 0.324), while all evaluated baselines produce zero F1 on Fear, Disgust, and Surprise. Ablation studies confirm independent, non-redundant contributions from the diffusion prior (-24% without it) and causal graph (-13%). Notably, only the deterministic-encoder variant detects all six emotion classes, revealing KL regularization strength as a direct lever for minority-class sensitivity.
[107] Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning cs.CV | cs.AIPDF
Jiazheng Li, Chi-Hao Wu, Yunze Liu, Kaize Ding, Jundong Li
TL;DR: 该论文提出了MAGIC-Video框架,用于解决超长视频(如第一人称视角记录、直播或监控录像)的理解难题。该框架通过构建多模态记忆图与交错叙事链,统一了视频中的情节、语义和视觉内容,并支持跨模态检索与长时程实体传记和重复活动事件的提炼。在推理时,采用智能体循环将图检索与叙事事实注入交错进行,以覆盖超长视频的模态和时间维度。
Details
Motivation: 当前多模态大语言模型在处理超长视频时,即使拥有百万token的上下文窗口,其帧预算也只能覆盖密集采样的数十分钟视频,且大部分证据在推理开始前就被丢弃。现有的记忆增强和智能体方法虽有助于扩展规模,但其检索在模态间是碎片化的,并且缺乏跨越数天或数周的长程叙事摘要。
Result: 在EgoLifeQA、Ego-R1和MM-Lifelong三个基准测试上,MAGIC-Video均优于强大的通用、长视频和智能体基线模型,相较于先前最佳智能体系统,分别提升了10.1、7.4和5.9个百分点,达到了新的SOTA水平。
Insight: 论文的创新点在于提出了一个免训练框架,通过构建具有六种类型边的多模态记忆图来统一不同模态内容,并引入交错叙事链来提炼长时程实体和事件信息。从客观角度看,其将图结构与叙事链结合,并通过智能体循环实现模态与时间维度的统一检索,为超长视频理解提供了一种结构化和高效的记忆与推理机制。
Abstract: Understanding ultra-long videos such as egocentric recordings, live streams, or surveillance footage spanning days to weeks, remains a challenge. For current multimodal LLMs: even with million-token context windows, frame budgets cover only tens of minutes of densely sampled video, and most evidence is discarded before inference begins. Memory-augmented and agentic approaches help with scale, but their retrieval remains fragmented across modalities and lacks long-range narrative summaries that span days or weeks. We propose \textbf{MAGIC-Video}, a training-free framework built around a multimodal memory graph with interleaved narrative chain: the graph unifies episodic, semantic, and visual content through six typed edges and supports cross-modal retrieval, while the chain distils long-horizon entity biographies and recurring activity events. At inference time, an agentic loop interleaves graph retrieval with narrative fact injection, covering both the modality and time dimensions of ultra-long video in a single retrieval pipeline. On EgoLifeQA, Ego-R1 and MM-Lifelong, MAGIC-Video consistently outperforms strong general-purpose, long-video, and agentic baselines, with gains of 10.1, 7.4, and 5.9 points over the prior best agentic system on each benchmark. Code is available at https://github.com/lijiazheng0917/MAGIC-video.
[108] BenchHAR: Benchmarking Self-Supervised Learning for Generalizable Sensor-based Activity Recognition cs.CV | eess.SPPDF
Yize Cai, Rui Feng, Anlan Yu, Baoshen Guo, Zhiqing Hong
TL;DR: 本文提出了BenchHAR,一个用于评估自监督学习方法在基于传感器的活动识别任务中泛化能力的统一框架。该研究构建了一个大规模数据集,并系统评估了多种自监督学习方法与编码器-分类器架构的组合,揭示了现有方法在泛化性能上的不足,并提供了关于模型架构、数据规模和传感器类型如何影响泛化的关键见解。
Details
Motivation: 解决基于传感器的人类活动识别中,由于数据异质性和标注数据稀缺导致的真实世界泛化能力受限的问题,并系统评估和比较自监督学习方法在该领域的泛化性能。
Result: 在构建的大规模数据集(约25.8万个样本)上评估了8种代表性自监督学习方法和12种编码器-分类器架构。结果表明,现有自监督学习方法难以达到满意的泛化性能。研究发现混合范式(结合重构和对比预训练)总体表现最佳,CNN编码器学习泛化表征的能力最强,增加来自下游活动类别的预训练数据能持续改善泛化,而增加标注数据收益有限。
Insight: 主要创新点在于提出了首个用于评估自监督学习在传感器活动识别中泛化能力的统一基准框架BenchHAR。客观分析其核心贡献在于通过大规模系统性实验,揭示了该领域自监督学习泛化的关键影响因素(如混合预训练范式、CNN编码器优势、数据来源类别与设备类型的影响),为构建泛化性更强的系统提供了可操作的指导原则,而不仅仅是方法论的简单迁移。
Abstract: Human Activity Recognition (HAR) from wearable sensors supports broad healthcare and behavior science applications. However, data heterogeneity and the scarcity of labeled data limit its real-world generalization. Recent advances in self-supervised learning (SSL) in vision and language domains have shown strong capability for learning generalizable representations from unlabeled data. Yet, few studies have systematically compared the generalization performance of SSL methods or explored how to adapt them for generalizable HAR. To address these gaps, we present BenchHAR, a unified framework for evaluating the generalization capability of SSL methods for sensor-based HAR on unseen target distributions. BenchHAR curates a large-scale dataset (~258K samples) and evaluates eight representative SSL methods across 12 encoder-classifier architectures. Our results reveal that existing SSL methods struggle to achieve satisfactory generalization performance. We find that: (1) For HAR models, the hybrid paradigm (combining reconstruction and contrastive pretraining) achieves the best overall performance. The CNN encoder exhibits the strongest ability to learn generalizable representations, while more expressive classifier architectures further improve generalization. (2) For data scale, increasing the amount of pretraining data from downstream activity classes consistently improves generalization, while adding more labeled data yields limited gains. Interestingly, incorporating unlabeled data from non-downstream activity classes does not improve generalization. (3) Sensor data collected from custom-grade devices generalizes better than that from research-grade devices, and data from limb transfers more effectively to trunk positions. BenchHAR provides a unified benchmark and actionable insights for generalizable sensor-based HAR systems. Our code is available at https://github.com/saiketa/HAR-Bench.
[109] Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval cs.CV | cs.AIPDF
Mingyu Liu, Sihan Huang, Yijia Fan, Yinlin Yan, Quan Zhang
TL;DR: 本文提出DeCIR方法,用于解决零样本组合图像检索(ZS-CIR)中基于投影的方法在复杂语义修改上的性能瓶颈。该方法通过解耦端点学习和语义转换学习,分别训练独立的低秩文本适配器分支,并通过低秩方向合并(LRDM)将其合并为一个可部署的适配器,从而在不增加推理复杂度的前提下提升性能。
Details
Motivation: 基于投影的ZS-CIR方法虽然轻量且不依赖大型语言模型(LLM),但在复杂语义修改上表现不佳,这源于语义转换瓶颈:端点级匹配可能使编辑文本仅作为目标侧属性提示,而非基于源条件的语义转换。此外,在同一文本适配器中同时添加语义转换监督会导致端点对齐与语义转换对齐之间的冲突。
Result: 在CIRR、CIRCO、FashionIQ和GeneCIS等多个基准数据集上的广泛实验表明,DeCIR在不增加推理复杂度的前提下,持续提升了基于投影的ZS-CIR方法的性能。
Insight: 创新点在于解耦端点学习和语义转换学习,通过构建配对的前向/反向编辑元组、训练独立的低秩适配器分支,并利用LRDM进行合并,有效解决了端点-转换冲突,为轻量级ZS-CIR方法的设计提供了新思路。
Abstract: Zero-shot composed image retrieval (ZS-CIR) retrieves a target image from a reference image and a text modification without human-annotated CIR triplets. Projection-based ZS-CIR methods are attractive because they do not rely on LLMs at inference and remain lightweight, but they often underperform LLM-based approaches on complex semantic modifications. This gap reflects a semantic transition bottleneck in projection-based ZS-CIR: endpoint-level matching can let the edit text act as a target-side attribute cue rather than grounding it as a source-conditioned semantic transition. We further show that adding semantic transition supervision to the same text adapter creates an endpoint–transition conflict between endpoint alignment and semantic transition alignment. To address this conflict, DeCIR decouples endpoint and transition learning. It constructs paired forward/reverse edit tuples from image-caption pairs, trains separate low-rank text adapter branches for endpoint alignment and semantic transition alignment, and merges them with Low-Rank Directional Merge (LRDM) into one deployable adapter. Extensive experiments on CIRR, CIRCO, FashionIQ, and GeneCIS demonstrate that DeCIR consistently improves projection-based ZS-CIR without increasing inference complexity.
[110] SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding cs.CVPDF
Sara Ghazanfari, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami
TL;DR: 本文介绍了SYNCR,一个用于评估多模态大语言模型在多视频流间推理能力的合成基准。该基准通过程序化验证提供精确的空间、时间和物理基础真值,包含8,163个多视频问答对,覆盖时间对齐、空间追踪、比较推理和整体综合等八个任务。零样本评估显示,当前最佳模型平均准确率仅为52.5%,远低于人类基线的89.5%,尤其在物理和空间推理方面存在显著不足。
Details
Motivation: 现有基于真实视频的多视频基准受限于人工标注的精度,难以精确诊断模型在多视频推理中的失败原因,因此需要构建一个可控的合成基准来系统评估MLLMs的跨视频推理能力。
Result: 在SYNCR基准上,最佳模型平均准确率为52.5%,而人类基线为89.5%;模型在时间排序任务上表现相对较好,但在运动学比较任务上准确率仅为26.0%。参数缩放和推理专项后训练能提升时间对齐能力,但对细粒度物理追踪或全局空间综合改善有限。
Insight: 创新点在于利用合成数据构建程序化验证的基准,实现了对跨视频推理能力的细粒度、可控诊断;客观来看,该方法为系统评估和提升MLLMs在复杂多模态场景下的物理与空间推理提供了新工具,并揭示了现有评估未充分覆盖的推理能力短板。
Abstract: Multimodal Large Language Models (MLLMs) have made rapid progress in single-video understanding, yet their ability to reason across multiple independent video streams remains poorly understood. Existing multi-video benchmarks rely largely on human-annotated real-world footage, limiting the precision of spatial, temporal, and physical ground truth and making it difficult to diagnose model failures. We introduce SYNCR, a controlled synthetic benchmark for cross-video reasoning with programmatically verified grounding. Built using Habitat, Kubric, and CLEVRER simulator engines, SYNCR contains 8,163 multi-video question-answer pairs grounded in 9,650 unique videos. It evaluates MLLMs across eight tasks spanning four diagnostic pillars: Temporal Alignment, Spatial Tracking, Comparative Reasoning, and Holistic Synthesis. Our zero-shot evaluation of leading open- and closed-weight MLLMs reveals a substantial gap between current models and humans: the best model achieves only 52.5% average accuracy, compared to an 89.5% human baseline. Models perform relatively well on temporal ordering but struggle with precise physical and spatial reasoning, with the best model reaching only 26.0% accuracy on Kinematic Comparison. We further find that parameter scaling and reasoning-specialized post-training improve temporal alignment capabilities, but do not reliably address fine-grained physical tracking or global spatial synthesis. Finally, an exploratory sim-to-real correlation analysis suggests that several SYNCR tasks track model-level trends on real-world multi-video benchmarks, while also exposing reasoning capabilities underrepresented by existing evaluations. Code available at https://github.com/SaraGhazanfari/SYNCR.
[111] Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval cs.CVPDF
Pascal Tilli, Mohsen Mesgar
TL;DR: 本文提出了一种用于视觉文档检索的多模态编码器,通过文本监督学习全局布局嵌入,以增强基于局部图像块表示的检索模型,从而更好地捕捉文档的布局结构信息。
Details
Motivation: 现有视觉文档检索模型主要依赖局部图像块嵌入的交互架构,忽视了文档全局布局结构对相关性的影响,导致在包含图表、文本等异质布局的文档检索中出现错误。
Result: 在ViDoRe-v2的四个数据集上,该方法比最强的基线模型ColPali/ColQwen在nDCG@5和MAP@5指标上分别提升了2.4和2.3个点,且在各数据集上均取得了统计显著的改进。
Insight: 创新点在于通过文本描述监督学习全局布局嵌入,在不改变推理架构的前提下,将文档布局信息融入局部表示中,提升了模型对文档整体结构的理解能力。
Abstract: Visual Document Retrieval (VDR) models mostly rely on late interaction architectures, in which documents are represented by a set of local patch embeddings and then matched against query tokens. While efficient, this architecture prioritizes local similarity over global layout structure of documents to estimate relevancy between documents and query. In practice, this leads to errors as relevance originates from layout structure of documents with heterogeneous layouts combining figures, tables, and text. We make document layout learnable without changing inference. We propose a multimodal encoder that augments local patch representations with a global layout embedding, trained via textual descriptions encoding document layout information. Across four ViDoRe-v2 datasets, our model improves over the strongest architecturally comparable ColPali/ColQwen baseline by +2.4 nDCG@5 and +2.3 MAP@5, with statistically significant per-dataset gains over ColQwen.
[112] NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics cs.CVPDF
Jian Lan, Zhicheng Liu, Xinpeng Wang, Yuhao Zhou, Haokun Chen
TL;DR: 该论文提出了NICE和FACT双重诊断范式,用于评估和校准视觉语言模型在运动学物理定量推理中的能力,揭示了模型在视觉保真度、物理定律理解和时间定位方面的缺陷,并提出了基于邻域信息的校准方法来提升模型置信度的可靠性。
Details
Motivation: 解决当前视觉语言模型在空间智能任务(如物理推理)中性能不佳的问题,缺乏对模型是否真实理解物理世界并遵循物理定律的科学分析,旨在评估模型置信度的可靠性并提供标准化诊断方法。
Result: 在6个最新的SOTA视觉语言模型上进行评估,发现模型无法识别视觉前提条件或利用必要的物理定律来得出答案,通过提出的诊断范式和校准方法提升了模型置信度的可靠性。
Insight: 创新点在于将定量推理分解为视觉保真度、物理定律理解和时间定位的显式诊断,并引入基于邻域信息的校准方法及新指标来评估和校准置信度,为开发忠实、基于物理的视觉语言模型提供了标准化诊断范式。
Abstract: The ability to derive precise spatial and physical insights is a cornerstone of vision-language models (VLMs), yet their poor performances in related spatial intelligence tasks such as physical reasoning remain a fundamental barrier. The community critically lacks a scientific analysis revealing whether VLMs faithfully reach answers or plausibly make guesses. This work aims to provide a fundamental understanding of how VLMs perceive the physical world, and utilize physical laws, while assessing the reliability of model confidence. We propose NICE and FACT, a dual-diagnostic paradigm that explicitly decomposes quantitative reasoning for kinematic physics: FACT diagnoses visual fidelity, physical law comprehension, and temporal grounding. NICE studies our novel neighborhood-informed calibration method and novel metrics to evaluate and calibrate confidence reliability. Evaluated across 6 latest state-of-the-art VLMs, we uncover that models fail to identify visual preconditions or utilize necessary physical laws to reach answers. This work highlights and establishes a standardized diagnostic paradigm to guide the development of faithful, physically-grounded VLMs.
[113] CapCLIP: A Vision-Language Representation Alignment Approach for Wireless Capsule Endoscopy Analysis cs.CVPDF
Haroon Wahab, Irfan Mehmood, Hassan Ugail
TL;DR: 本文提出CapCLIP,一种针对无线胶囊内窥镜(WCE)分析的领域特定视觉-语言表示学习框架。该框架通过将内窥镜图像与基于标准化术语和病理感知的文本描述对齐,学习语义信息丰富且可迁移的嵌入表示。在严格的零样本条件下,CapCLIP在多个下游任务上超越了现有的开源视觉和视觉-语言基础模型,尤其在分布外数据集上表现出色。
Details
Motivation: 无线胶囊内窥镜检查会产生大量图像帧,且在高度变化的成像条件下识别细微异常困难。现有基于学习的方法多为纯视觉模型,局限于狭窄的病理集,且在不同数据集和中心间的迁移能力有限。
Result: 在未见过的WCE数据集上,CapCLIP在K近邻分类、CLIP风格的图文分类和文本到图像检索三个下游任务中均持续优于基线模型,特别是在零样本图文分类和跨模态检索任务上对分布外数据集取得了显著提升。
Insight: 创新点在于将领域特定的临床文本描述(基于标准化术语)与视觉表示对齐,以提升WCE分析的泛化能力和语义可解释性。这为构建针对胶囊内窥镜的专用基础模型提供了方向,并支持了基于语言的WCE分析范式。
Abstract: Wireless capsule endoscopy (WCE) enables non-invasive visual assessment of the small bowel, but its clinical utility is constrained by the large volume of frames generated per examination and the difficulty of recognising subtle abnormalities under highly variable imaging conditions. Existing learning-based approaches for WCE are predominantly vision-only, often confined to narrow pathology sets, and show limited transfer across datasets and centres. To address these limitations, this study introduces CapCLIP, a domain-specific vision-language representation learning framework for WCE. CapCLIP aligns capsule endoscopy frames with clinically grounded textual descriptions derived from standardised nomenclature and pathology-aware caption templates, thereby learning embeddings that are both semantically informed and transferable. The proposed framework is evaluated against relevant open-source vision and vision-language foundation models under strict zero-shot conditions using unseen WCE datasets. Evaluation covers three downstream tasks: K-nearest neighbour classification, CLIP-style image-text classification, and text-to-image retrieval. Across these settings, CapCLIP consistently outperforms the compared baselines, with particularly strong gains in zero-shot image-text classification and cross-modal retrieval on out-of-distribution datasets. The results indicate that language-guided representation learning can improve both generalisation and semantic interpretability in WCE analysis. These findings position CapCLIP as a step toward foundation models tailored to capsule endoscopy and support the use of language-grounded WCE analysis.
[114] A Two-Stage Motion-Aware Framework for mmWave-based Human Mesh Recovery cs.CVPDF
Hoang Hai Pham, Shuntian Zheng, Jiaqi Li, Yu Guan
TL;DR: 本文提出了一种基于毫米波雷达的两阶段运动感知框架,用于从雷达观测中恢复精确的3D人体网格。该方法首先通过粗到精的定位和体素级分割提取人体反射信号,生成置信加权的雷达体积编码;然后设计了一个运动感知的网格恢复网络,利用双分支架构联合建模逐帧几何和帧间动态信息。
Details
Motivation: 现有方法通常采用端到端框架直接从原始雷达数据回归人体参数,未能解耦信号解释与几何推理,也未能利用时间运动线索,限制了学习性能。本文旨在解决毫米波雷达信号杂波严重、测量本质不完整导致人体网格恢复困难的问题。
Result: 大量实验表明,所提方法在保持计算效率的同时,性能优于现有方法。
Insight: 创新点在于将雷达人体重建任务解耦为信号解释和几何推理两个阶段,并引入了运动感知机制。具体包括:1) 置信加权雷达体积编码的人体反射提取模块;2) 联合建模静态几何与时间动态的双分支运动感知网格恢复网络。这种分阶段、显式利用时序信息的思路可借鉴于其他基于稀疏或噪声数据的3D感知任务。
Abstract: Millimeter-wave (mmWave) radar has emerged as a promising sensing modality for human perception due to its robustness under challenging environmental conditions and strong privacy-preserving properties. However, recovering accurate 3D human body meshes from radar observations remains difficult due to severe signal clutter and the inherently partial nature of radar measurements. Previous works typically adopt end-to-end frameworks that directly regress human body parameters from raw radar data, without decoupling signal interpretation from geometric reasoning or exploiting temporal motion cues, limiting learning performance. To address this, we propose a two-stage framework for radar-based human body reconstruction. First, we introduce a human reflection extraction module that performs coarse-to-fine localization and voxel-wise segmentation to produce a confidence-weighted radar volume encoding voxel-level human likelihood. Second, we design a motion-aware mesh recovery network that reconstructs the human body by jointly modeling per-frame geometry and inter-frame dynamics using a dual-branch architecture. Extensive experiments demonstrate that the proposed method outperforms existing approaches while maintaining computational efficiency.
[115] MC-RFM: Geometry-Aware Few-Shot Adaptation via Mixed-Curvature Riemannian Flow Matching cs.CV | cs.AI | cs.LGPDF
Salim Khazem, Ibrahim Mohamed Serouis, Zakaria Ezzahed
TL;DR: 本文提出了一种名为MC-RFM的混合曲率黎曼流匹配框架,用于对预训练视觉主干进行少样本适应。该方法将适应特征表示在结合了双曲因子(捕获层次敏感的语义结构)和欧几里得因子(保留局部判别性视觉变化)的乘积流形上,将适应过程建模为从冻结特征到支持集原型的任务条件连续传输,并采用流匹配目标进行训练。
Details
Motivation: 现有的参数高效适应方法(如线性探针、提示、低秩更新或轻量残差模块)通常将适应视为冻结表示的离散欧几里得扰动,没有显式建模任务诱导的特征位移的几何结构。本文旨在通过几何感知的流匹配框架来改进少样本适应。
Result: 在七个视觉识别基准、五个冻结主干和1/4/16-shot设置下,MC-RFM在大多数评估场景中表现最佳,尤其在Transformer主干和细粒度数据集上提升最显著。
Insight: 创新点在于将少样本适应建模为在混合曲率(双曲+欧几里得)乘积流形上的连续传输过程,显式地结合了任务相关的几何结构。可借鉴之处包括混合曲率特征表示、任务条件流匹配、自适应分支门控、原型收缩和判别性监督的协同设计。
Abstract: Parameter-efficient adaptation of pretrained vision models is commonly performed through linear probes, prompts, low-rank updates, or lightweight residual modules. While effective, these methods usually treat adaptation as a discrete Euclidean perturbation of frozen representations, without explicitly modeling the geometry of the task-induced feature displacement. We propose \textsc{MC-RFM}, a mixed-curvature Riemannian flow-matching framework for few-shot adaptation of frozen visual backbones. The key idea is to represent adapted features on a product manifold combining a hyperbolic factor, which captures hierarchy-sensitive semantic structure, and a Euclidean factor, which preserves locally discriminative visual variation. Adaptation is formulated as a task-conditioned continuous transport from frozen features to support-set prototypes, trained with a flow-matching objective and coupled to a hybrid prototype-linear classifier. The method is lightweight, backbone-agnostic, and operates entirely on cached frozen features. Across seven visual recognition benchmarks, five frozen backbones, and 1/4/16-shot regimes, \textsc{MC-RFM} is the best-performing method in a majority of evaluated settings, with the strongest gains on Transformer backbones and fine-grained datasets. Ablations show that the mixed-curvature head, task conditioning, adaptive branch gating, prototype shrinkage, and discriminative supervision each contribute to performance. These results suggest that few-shot adaptation benefits not only from deciding which parameters to update, but also from modeling how representations should move through a geometry matched to the structure of the downstream task.
[116] ZAYA1-VL-8B Technical Report cs.CV | cs.AIPDF
Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge
TL;DR: ZAYA1-VL-8B是一个基于自研语言模型ZAYA1-8B构建的紧凑型专家混合视觉语言模型。尽管模型尺寸较小,但它在多项图像理解、推理和计数基准测试中,性能可与Molmo2-4B和InternVL3.5-4B等领先基础模型竞争,并超越了Qwen2.5-VL-3B、PLM-3B和MolmoE-1B等模型。其架构包含两个关键创新:集成视觉特定LoRA适配器和在LLM内对图像令牌使用双向注意力机制。
Details
Motivation: 旨在开发一个参数高效、性能强大的紧凑型视觉语言模型,以在有限的模型规模下实现与更大模型相竞争的视觉理解能力。
Result: 在一系列图像理解、推理和计数基准测试中,性能与Molmo2-4B和InternVL3.5-4B等领先基础模型相当,并超越了Qwen2.5-VL-3B、PLM-3B和MolmoE-1B等模型,达到了具有竞争力的水平。
Insight: 主要创新点包括:1) 在LLM中集成视觉特定LoRA适配器,在不增加专家数量的情况下提升模态特定能力;2) 在LLM内对图像令牌应用双向注意力机制以增强视觉理解。这些方法为构建高效紧凑的多模态模型提供了新思路。
Abstract: We present ZAYA1-VL-8B, a compact mixture-of-experts vision-language model built upon our in-house language model, ZAYA1-8B. Despite its compact size, ZAYA1-VL achieves performance competitive with leading base models such as Molmo2-4B and InternVL3.5-4B, while surpassing models including Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B across a range of image understanding, reasoning, and counting benchmarks. The architecture incorporates two key innovations: (1) vision-specific LoRA adapters integrated into the LLM to increase modality-specific capacity without increasing the number of experts, and (2) bidirectional attention over image tokens within the LLM to enhance visual understanding. We detail the full training pipeline including data composition at each stage, sequence packing, and the attention masking scheme. The model comprises 9.2B total parameters, with 1.4B active parameters including the vision encoder, and is publicly available at https://huggingface.co/Zyphra/ZAYA1-VL.
[117] ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models cs.CVPDF
Haotian Xue, Yipu Chen, Liqian Ma, Zelin Zhao, Lama Moukheiber
TL;DR: 该论文提出了ACWM-Phys,这是一个用于评估动作条件世界模型在多样化物理动态下预测能力的新基准,它在一个可控的仿真环境中构建,涵盖了刚体动力学、运动学、可变形物体交互和粒子动力学等多种物理交互类型。
Details
Motivation: 现有动作条件世界模型的基准主要局限于第一人称导航或狭窄的任务特定机器人数据集,缺乏对广义世界理解所需的丰富物理交互的覆盖,因此需要一个新的基准来系统评估模型在物理交互上的泛化能力。
Result: 通过在ACWM-DiT模型上进行系统实验,发现模型在分布内和分布外协议下的泛化能力存在差异:模型在视觉简单、低维且具有清晰几何结构的交互上泛化良好,但在可变形接触、高维控制和复杂关节运动上性能下降较大,表明模型仍严重依赖视觉外观模式而非完全学习底层物理规律。
Insight: 论文的创新点在于构建了一个可控、可复现的物理交互基准,并系统分析了模型在不同物理机制和任务复杂度下的泛化行为;客观来看,其通过消融实验揭示了交叉注意力、因果VAE和更大动作空间对模型性能的影响,为设计物理基础的世界模型提供了指导。
Abstract: Action-conditioned world models (ACWMs) have shown strong promise for video prediction and decision-making. However, existing benchmarks are largely restricted to egocentric navigation or narrow, task-specific robotics datasets, offering only limited coverage of the rich physical interactions required for generalized world understanding. We introduce ACWM-Phys, a new benchmark for evaluating action-conditioned prediction under diverse physical dynamics in a clean, controllable simulation environment with a carefully designed action space. ACWM-Phys contains training and evaluation data spanning rigid-body dynamics, kinematics, deformable-object interactions, and particle dynamics. To evaluate both interpolation and generalization, we design in-distribution and out-of-distribution protocols with controlled shifts in interaction patterns or scene configurations. By building the benchmark in a fully controllable simulator, ACWM-Phys enables precise data collection, reproducible evaluation, and systematic analysis of model capabilities for physically grounded world modeling. Through systematic experiments on ACWM-DiT, we find that OoD generalization depends not only on the physical regime but also on effective task complexity: models generalize well on visually simple, low-dimensional interactions with clear geometric structure, but suffer larger drops on deformable contacts, high-dimensional control, and complex articulated motion. This suggests that the model still relies heavily on visual appearance patterns instead of fully learning the underlying physics. Ablations show that cross-attention improves high-dimensional action conditioning, causal VAEs outperform frame-wise encoders, and larger action spaces are harder to model but can improve generalization by providing richer control signals. These findings guide the design of physically grounded world models.
[118] Enhancing Consistency Models for Multi-Agent Trajectory Prediction cs.CVPDF
Alen Mrdovic, Qingze, Liu, Danrui Li, Mathew Schwartz
TL;DR: 本文提出ECTraj,一种增强的一致性模型框架,用于多智能体轨迹预测。该框架通过改进师生一致性训练方案,结合条件生成和top-K多模态生成,实现了高质量的单步推理,在Argoverse 2数据集上取得了具有竞争力的新基准结果。
Details
Motivation: 解决扩散模型在多智能体轨迹预测中因迭代去噪导致推理延迟的问题,以及现有快速采样方法无法实现真正单步生成或受限于噪声分布的局限性。
Result: 在Argoverse 2大规模数据集上建立了具有竞争力的新基准,实现了更快的推理速度和更高的预测精度。
Insight: 通过师生训练方案,教师模型显式融合预测与真实轨迹以提供更强监督;利用一致性模型的直接去噪特性实现训练中的top-K多模态生成,结合条件生成提升性能。
Abstract: Diffusion models for multi-agent trajectory prediction are limited by iterative denoising, which causes inference latency that hinders their use in time-critical settings like autonomous driving. Fast-sampling variants using DDIM and informed initial noise distributions partially alleviate this issue, but they either fail to achieve true single-step generation or are constrained by the chosen noise distribution. Consistency Models (CMs) offer high-quality one-step generation by mapping noise directly to data, but are difficult to train from scratch . We propose ECTraj, an enhanced CM pipeline with improved training and conditional generation for trajectory prediction. Our framework extends the student-teacher consistency training scheme: the student produces standard outputs, while the teacher explicitly fuses its predictions with parts of the ground truth to give stronger supervision. We also exploit CMs’ direct denoising for top-K multi-shot generation during training. Combining conditional generation with this enhanced consistency objective yields faster inference and improved prediction accuracy, establishing competitive new benchmarks on the large-scale Argoverse 2 dataset.
[119] PromptDx: Differentiable Prompt Tuning for Multimodal In-Context Alzheimer’s Diagnosis cs.CV | cs.AIPDF
Lujia Zhong, Yihao Xia, Shuo Huang, Jianwei Zhang, Yonggang Shi
TL;DR: 本文提出PromptDx,一种用于多模态阿尔茨海默病诊断的上下文学习新框架。它通过可微分提示调优机制,将预训练的TabPFN上下文学习引擎与多模态表示对齐,实现了端到端优化,在ADNI数据集上仅需1%的上下文样本即可超越传统参数化基线。
Details
Motivation: 解决现有上下文学习框架(如TabPFN)在处理异构多模态数据时存在的不可微分预处理流程导致的流形失配和梯度断裂问题,以更贴合临床类比推理的诊断范式替代传统深度学习的参数化记忆模式。
Result: 在ADNI数据集(3D MRI和表格生物标志物)上超越传统参数化基线,仅使用1%的上下文样本即达到优于标准上下文学习使用30%样本的性能,并在六个不同规模的表格数据集上验证了其通用性。
Insight: 核心创新是可微分提示调优机制,通过训练轻量级适配器作为不可微分预处理器的可微分替代,实现了上下文学习范式下多模态提示的端到端优化,展示了卓越的流形压缩能力,为医学诊断提供了更数据高效且临床对齐的新范式。
Abstract: Deep learning models in medical imaging typically operate as parametric memory, diagnosing patients by recalling fixed knowledge learned during training. This contrasts sharply with clinical practice, where physicians employ analogical reasoning to diagnose new cases by referencing similar records from past exemplars. While In-Context Learning (ICL) frameworks such as Tabular Prior-Fitted Networks (TabPFN) offer a promising diagnosis-by-reference paradigm, they are designed with tabular-specific inductive priors and rely on non-differentiable preprocessing pipelines, leading to manifold mismatch and gradient fracture when applied to heterogeneous multimodal data. To address these limitations, we propose PromptDx, a novel diagnosis-by-reference framework that leverages a pre-trained TabPFN as an ICL engine while enabling seamless integration with multimodal representations. Our core contribution is a Differentiable Prompt Tuning (DPT) mechanism that aligns a Masked Multimodal Modeling module with the pre-trained ICL engine. By training a lightweight adapter as a differentiable surrogate for the engine’s non-differentiable preprocessors, we enable an end-to-end optimization of multimodal prompts within the ICL paradigm. We validate our method on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset using 3D MRI and tabular biomarkers. Experiments demonstrate that our approach outperforms traditional parametric baselines. Notably, our method achieves superior performance using only 1% context samples compared to 30% in standard ICL, demonstrating exceptional manifold condensation ability. We further validate the generalizability of our DPT framework across six tabular datasets with diverse scales. Overall, our method offers a more data-efficient and clinically aligned paradigm for Alzheimer’s Disease diagnosis.
[120] Cross-Modal RGB-D Fusion Transformer for 6D Pose Estimation of Non-Cooperative Spacecraft with Stereo-Derived Depth cs.CVPDF
Yongliang Zhen, Bo LÜ, Hang Yang, Xiaotian WU
TL;DR: 本文提出了一种用于非合作航天器六自由度姿态估计的被动立体视觉框架,包括一个名为TSCA-Stereo的双目立体匹配网络和一个跨模态融合Transformer,以自适应地结合RGB外观信息和立体深度特征,从而在苛刻的空间视觉条件下实现可靠姿态恢复。
Details
Motivation: 解决在轨服务和非合作航天器主动碎片清除任务中,基于学习的单目方法存在的深度模糊问题以及在轨道严酷光照条件下容易失效的局限性,同时避免主动深度传感器在功耗和质量上对航天器平台的不适应性。
Result: 在构建的合成双目多模态数据集上,TSCA-Stereo在所有评估指标上均优于基线。完整的姿态估计流水线在各种成像条件下实现了平均平移误差0.0419米和平均方向误差0.8632°,证明了该被动立体方法在空间环境苛刻视觉条件下的有效性和鲁棒性。
Insight: 创新点在于提出了一个专门针对空间图像弱纹理表面、镜面高光和剧烈光照变化的立体匹配网络(TSCA-Stereo),以及一个自适应融合RGB和立体深度特征的跨模态Transformer架构。从客观角度看,其构建的涵盖多种光照、姿态和噪声水平的合成双目多模态数据集,也为该领域的研究提供了有价值的基准。
Abstract: On-orbit servicing and active debris removal involving non-cooperative spacecraft require reliable pose estimation to supply accurate position and orientation data for autonomous visual navigation. Learning-based monocular methods have seen widespread adoption in spacecraft pose estimation, yet they suffer from an intrinsic depth ambiguity problem and tend to fail under the harsh illumination conditions routinely encountered in orbit. Active depth sensors could in principle address the geometric ambiguity, but their power and mass requirements make them poorly suited to most spacecraft platforms. This work addresses these issues through a passive stereo vision framework for six-degree-of-freedom (6-DOF) pose estimation of non-cooperative spacecraft. A binocular stereo matching network called TSCA-Stereo is developed to cope with weak-texture surfaces, specular highlights, and severe lighting variations typical of space imagery. A cross-modal fusion Transformer is introduced to combine RGB appearance information with stereo depth features in an adaptive manner, supporting reliable pose recovery. A synthetic binocular multimodal dataset is also built for the experiments, covering stereo disparity maps and 6-DOF pose annotations across a range of illumination scenarios, attitude configurations, and noise levels. Experimental results show that TSCA-Stereo outperforms the baseline across every evaluated metric on this space-specific dataset. The full pose estimation pipeline achieves a mean translation error of 0.0419 m and a mean orientation error of 0.8632° under varied imaging conditions, confirming that the passive stereo approach is both effective and resilient when operating under the demanding visual conditions of the space environment.
[121] Egocentric Whole-Body Human Mesh Recovery with Prior-Guided Learning cs.CVPDF
Soyeon Na, Seung Young Noh, Ju Yong Chang
TL;DR: 本文提出了一种基于先验引导学习的框架,用于从单张第一人称(头戴式相机)图像中恢复全身人体网格。该方法通过构建更准确的基于优化的伪地面真值,并利用外中心人体网格恢复基础模型和基于扩散的姿态先验,结合确定性去失真模块处理鱼眼畸变,从而提升了全身重建的精度。
Details
Motivation: 从单目头戴式相机进行第一人称人体网格恢复对于AR/VR应用至关重要,但由于缺乏基于SMPL/SMPL-X等参数化人体模型的真实第一人称图像的地面真值标注,且现有方法通常依赖伪地面真值并仅关注身体姿态估计,难以恢复手、脸等细粒度全身细节,因此需要新的解决方案。
Result: 在多个第一人称基准测试上的实验表明,该方法在全身重建方面优于现有最先进方法,且其基于优化的伪地面真值比现有基于回归的伪地面真值准确得多。
Insight: 创新点包括:构建了与3D关节监督对齐的更准确的优化式伪地面真值;通过适配外中心基础模型和扩散姿态先验来利用多种先验知识;引入了确定性去失真模块处理鱼眼畸变。从客观角度看,该方法通过整合优化与先验知识,有效解决了第一人称视角下缺乏真值标注和细粒度恢复的挑战。
Abstract: Egocentric human mesh recovery (HMR) from monocular head-mounted cameras is increasingly important for AR/VR applications, but remains challenging due to the lack of reliable ground-truth (GT) annotations based on parametric human body models such as SMPL and SMPL-X for real egocentric images. Existing egocentric HMR methods typically rely on pseudo-GT and focus on body pose estimation, which limits their ability to recover fine-grained whole-body details such as hands and face. We study egocentric whole-body human mesh recovery and propose a prior-guided learning framework that reconstructs whole-body meshes from a single egocentric image. We construct more accurate optimization-based pseudo-GT aligned with 3D joint supervision, and leverage multiple priors by adapting an exocentric HMR foundation model together with a diffusion-based pose prior. A deterministic undistortion module is further adopted to handle fisheye distortions in egocentric images. Experiments across multiple egocentric benchmarks demonstrate improved whole-body reconstruction compared to state-of-the-art methods, and show that our optimization-based pseudo-GT is substantially more accurate than existing regression-based pseudo-GT. To facilitate reproducibility, the code and dataset annotations are publicly available at https://github.com/naso06/EgoSMPLX.
[122] Kinematics-Driven Gaussian Shape Deformation for Blurry Monocular Dynamic Scenes cs.CVPDF
Yeon-Ji Song, Kiyoung Kwon, Junoh Lee, Jin-Hwa Kim, Byoung-Tak Zhang
TL;DR: 本文提出Kinematics-GS,一种用于从模糊单目视频重建动态3D场景的框架。该框架将模糊建模为运动对齐的形变,并引入运动学先验对高斯形状沿运动轨迹进行重参数化,从而在没有额外运动监督的情况下避免形状退化塌陷。通过基于时间形变方差的场景分解和由粗到细的形变策略,稳定优化并捕捉全局运动与细节。
Details
Motivation: 从模糊的单目视频重建动态3D场景具有挑战性,因为运动引起的模糊使得物体运动与几何结构纠缠不清,破坏了重建的几何一致性。现有方法难以处理这种复杂、非刚性运动导致的模糊。
Result: 在包含真实运动模糊的真实世界基准测试上进行的大量实验表明,Kinematics-GS在单目动态场景重建任务上显著优于现有方法,取得了明显的性能优势。
Insight: 主要创新点包括:1) 将模糊建模为运动对齐的形变,并引入运动学先验对3D高斯形状进行重参数化,有效缓解了模糊导致的形状退化问题;2) 提出基于时间形变方差的动态/静态场景分解和由粗到细的形变优化策略,提升了重建稳定性与细节;3) 构建了一个包含非刚性运动和空间非均匀运动模糊的真实世界挑战性数据集,推动了该领域的发展。
Abstract: Reconstructing dynamic 3D scenes from blurry monocular videos is challenging as motion-induced blur entangles object motion and geometry, hindering geometric consistency. We present Kinematics-GS, a kinematics-aware framework that models blur as motion-aligned deformation and introduces a kinematic prior to reparameterize Gaussian shapes along motion trajectories, thereby mitigating degenerate shape collapse without auxiliary motion supervision. To stabilize optimization, we decompose scenes into dynamic and static components using temporal deformation variance and employ a coarse-to-fine deformation strategy to capture both global motion and fine-grained details. We also introduce a challenging real-world dataset of deformable and elastic objects exhibiting non-rigid motion with spatially non-uniform motion blur that obscures geometric cues. Extensive experiments on real-world benchmarks with realistic motion blur demonstrate that Kinematics-GS outperforms prior methods by a clear margin in monocular dynamic scene reconstruction, highlighting its effectiveness in handling complex and non-rigid motion scenarios.
[123] Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection cs.CV | cs.AI | cs.LGPDF
Lei Wang, Wenxiang Diao, Andrew Busch, Jun Zhou, Yongsheng Gao
TL;DR: 该论文提出了一种隐私感知的视频异常检测方法,通过正交子空间投影来移除与任务无关的隐私敏感信息。核心是引入正交投影层(OPL)及其引导版本(G-OPL),后者利用人脸存在信号作为弱监督,抑制面部身份属性,同时保留姿态和运动等非识别性特征。
Details
Motivation: 现有视频异常检测系统过于关注准确性而忽视隐私问题,限制了其在现实世界(尤其是以人为中心的场景)中的部署。
Result: 实验表明,将隐私约束嵌入模型设计能在维持甚至提升检测精度的同时,有效减少敏感信息泄露,支持了基于投影的架构作为隐私感知VAD的原则性方法。
Insight: 创新点在于:1)提出轻量级的正交投影层来分离任务相关与无关信息;2)利用人脸存在信号作为弱监督来引导隐私信息抑制,无需身份标签或对抗训练;3)提出了一个联合评估检测性能和隐私保护的评估框架。
Abstract: Video anomaly detection (VAD) systems often prioritize accuracy while overlooking privacy concerns, limiting their suitability for real-world deployment. We propose the Orthogonal Projection Layer (OPL), a lightweight module that removes task-irrelevant variations to produce representations focused on anomaly-relevant cues. To address privacy risks in human-centered scenarios, we introduce Guided OPL (G-OPL), which suppresses facial attributes using weak supervision from face-presence signals while preserving non-identifying features such as pose and motion. A cosine alignment objective enforces consistent capture and removal of facial information without identity labels or adversarial training. We further present a privacy-aware evaluation framework that jointly assesses detection performance and privacy preservation, and enables analysis of how sensitive information is filtered. Experiments show that embedding privacy constraints into model design reduces sensitive information while maintaining or improving detection accuracy, supporting projection-based architectures as a principled approach for privacy-aware VAD.
[124] CAST: Channel-Aware Spatial Transfer Learning with Pseudo-Image Radar for Sign Language Recognition cs.CVPDF
Md. Shakhoyat Rahman Shujon, Sheikh Md. Galib Mahim, Md. Milon Islam, Md Rezwanul Haque, Md Rabiul Islam
TL;DR: 本文提出了一种名为CAST的双流架构,用于解决仅使用60GHz雷达距离-时间图(RTM)进行孤立手语识别时面临的挑战。该框架结合了三种物理感知架构与预训练的视觉骨干网络,在雷达数据约束下处理临床和字母手势。首先,通过显式的分贝到线性反演与加窗快速傅里叶变换提取节奏速度图(CVD),避免了对数压缩信号频谱分析产生的谐波伪影。其次,跨天线空间注意力模块在卷积前对原始天线通道应用注意力,保留接收器间的幅度协方差。第三,采用非对称交叉注意力机制融合并行ConvNeXt-Tiny(CVD)和EfficientNetV2-S(RTM)骨干网络的表示。实验表明,该架构在5折交叉验证下实现了80.5%的Top-1准确率,比最佳单模型基线(77.2%)提高了3.3%。
Details
Motivation: 解决在仅使用幅度信息60GHz雷达距离-时间图(RTM)的传感器模态约束下,进行孤立手语识别所面临的挑战,旨在提升雷达手语识别的性能。
Result: 在SignEval2026基准测试中,通过5折交叉验证,CAST实现了80.5%的Top-1准确率,比最佳单模型基线(77.2%)提高了3.3%,达到了新的SOTA水平。
Insight: 创新点包括:1)结合物理感知的信号表示(如分贝到线性反演与CVD提取)以避免谐波伪影;2)引入跨天线空间注意力模块以保留原始通道间的协方差信息;3)设计非对称交叉注意力机制来有效融合双流(CVD与RTM)特征。从客观角度看,将信号处理先验知识与深度学习架构(如预训练视觉骨干)相结合,为受限传感器模态下的感知任务提供了有前景的研究方向。
Abstract: We propose CAST, a dual-stream architecture that utilizes channel-aware spatial transfer learning for isolated sign language recognition addressing the challenges of magnitude-only 60~GHz radar Range-Time Maps (RTM). The proposed framework combines three physics-aware architectures with pretrained vision backbones, which operate under radar-only constraints across clinical and alphabetical gestures. First, an explicit decibel-to-linear inversion is combined with a windowed fast Fourier transform that extracts Cadence Velocity Diagrams (CVD) while avoiding the harmonic artifacts that arise from the spectral analysis of log-compressed signals. Second, a cross-antenna spatial attention module applies attention to raw antenna channels before the convolution, preserving inter-receiver amplitude covariance. Third, an asymmetric cross-attention mechanism fuses representations from parallel ConvNeXt-Tiny (CVD) and EfficientNetV2-S (RTM) backbones. Extensive experiments reveal that the architecture achieves a Top-1 accuracy of 80.5% under 5-fold cross-validation, establishing a 3.3% improvement over the best single-model baseline (77.2%). The findings suggest that physics-aware signal representations form a promising direction for radar-only sign language recognition under constrained sensor modalities. The source code is available at: https://github.com/Shakhoyat/CAST-at-SignEval2026.
[125] EditSleuth: A Dataset of Grounded Reasoning Chains for Image-Edit Forensics cs.CVPDF
Van-Loc Nguyen, AprilPyone MaungMaung, Minh-Triet Tran, Isao Echizen
TL;DR: 本文提出了EditSleuth数据集,用于支持基于视觉证据的图像编辑取证推理。该数据集包含超过25万个图像编辑三元组,每个样本提供编辑掩码、分类标签、难度分数和一个六步推理链,旨在推动超越简单真伪检测的、可解释的取证分析。
Details
Motivation: 现有图像取证数据集通常只关注检测或定位,而带有推理监督的视觉语言数据集很少针对图像篡改,且常依赖难以验证忠实度的LLM生成理由。因此,需要构建一个能支持定位编辑、识别语义类型并基于视觉证据进行推理的数据集。
Result: 初步学习研究表明,使用LoRA微调Qwen2-VL-2B模型,以推理链为目标进行监督,在可解析答案的分类准确率上与仅使用标签的基线相当,同时还能产生仅标签监督无法提供的、基于证据的解释性文本。
Insight: 创新点在于构建了一个具有确定性生成、证据可追溯的六步推理链的数据集,并提出了一个简化的三组件难度评分公式,有效增加了分数离散度,避免了原始四组件公式中的相关性崩溃问题。数据集的构建方法确保了推理链的忠实性和可验证性。
Abstract: Forensic analysis of AI-edited images requires more than binary real-versus-fake prediction: a useful system should localize the edit, identify its semantic type, and ground its decisions in visual evidence. Existing image-forensics datasets typically emphasize detection or localization, while reasoning-supervised vision-language datasets rarely target image manipulation and often rely on LLM-generated rationales whose faithfulness is difficult to verify. We introduce EditSleuth, a dataset of 257,725 image-edit triplets constructed from existing image-editing corpora for grounded image-edit forensic reasoning. Each example includes an edited image, its source image, a binary edit mask, a 12-class edit taxonomy label, a difficulty score, and a six-step reasoning chain. EditSleuth chains are generated deterministically from triplet-grounded upstream artifacts, with each statement tied to a specific computable source of evidence. Our analysis reveals that a naive four-component difficulty formulation suffers from a rank-2 correlation collapse among magnitude features; a simplified three-component formulation substantially increases score dispersion on both Pico-Banana and MagicBrush. Difficulty also varies meaningfully within most edit categories, indicating that the score is not a proxy for edit type. As an initial learning study, we fine-tune Qwen2-VL-2B with LoRA and find that chain-as-target supervision matches a label-only baseline on classification accuracy among parseable answers, while additionally yielding grounded explanatory prose that label-only supervision cannot produce. We release the dataset, the deterministic construction pipeline, and pilot training scripts.
[126] Gate-and-Merge: Zero-shot Compositional Personalization of Vision Language Models cs.CV | cs.AIPDF
Guodong Ding, Angela Yao
TL;DR: 本文提出了一种名为Gate-and-Merge的零样本框架,用于解决视觉语言模型(VLM)的组合式个性化问题。该框架允许在无需共现训练的情况下,将多个独立学习的用户定义概念在推理时进行组合。通过将概念特定的LoRA适配器在权重空间直接合并,并利用门控机制抑制无关激活,实现了高效且稳定的组合。
Details
Motivation: 解决视觉语言模型在测试时需要同时识别或描述多个用户定义概念的组合式个性化问题,避免传统方法对概念共现训练数据的依赖。
Result: 在多个个性化任务(包括单概念和组合场景)的定量和定性分析中,该方法均表现出性能的持续提升。
Insight: 创新点在于提出了一个零样本、无需共现训练的组合框架,通过权重空间直接合并LoRA适配器和基于门控的模块选择机制,实现了概念的解耦与稳定组合,有效保持了各概念的独立性并防止干扰。
Abstract: This paper tackles compositional personalization of vision-language models (VLMs). In this problem, multiple user-defined concepts must be recognized or described jointly at test time. We introduce Gate-and-Merge, a zero-shot framework that enables compositional personalization without the need for co-occurrence training. During personalization, each concept is learned independently as a lightweight LoRA adapter, paired with a concept token. The base model remains unchanged and concepts are kept disentangled. At inference, we enable composition by merging concept-specific LoRA updates directly in weight space. To suppress irrelevant activations and prevent interference, a gating mechanism is employed to estimate textual and visual cues and select only the modules that contribute to the prediction. We further stabilize composition by combining only the most meaningful and mutually consistent updates, helping preserve each concept’s identity. Our quantitative and qualitative analyses show consistent gains in performance across multiple personalization tasks in both single-concept and compositional settings.
[127] UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning cs.CVPDF
Hongrui Li, Yichen Shi, Hongyang Wang, Yuhao Gao, Hui Ma
TL;DR: 本文提出UniShield,一个基于知识图谱的多模态推理框架,用于统一人脸攻击检测(UAD)。该方法构建了人脸攻击知识图谱(FAKG),将攻击类别与诊断性视觉线索及攻击条件关系联系起来,并利用其生成了52,025个FAKG-QA示例进行攻击图指令微调(AGIT)。为进一步提升推理一致性,引入了基于GRPO的图一致推理优化(GCRO)目标,通过知识图谱一致性奖励鼓励生成的推理与图谱支持的线索匹配。
Details
Motivation: 现有统一人脸攻击检测方法主要依赖外观相关性或基于提示的方法,缺乏证据支撑的推理,难以在共享决策空间中有效识别物理欺骗和数字伪造攻击。
Result: 在多模态UAD基准测试上的实验表明,UniShield在二分类、粗粒度和细粒度协议上均取得了强劲性能,具有高准确率(ACC)和低错误率(HTER),优于判别性基线方法和通用多模态大语言模型。
Insight: 创新点在于将结构化攻击知识(FAKG)引入多模态推理框架,并通过AGIT和GCRO优化,将知识图谱与视觉线索关联,提升了检测准确性和推理可靠性,为攻击检测提供了可解释的、证据支撑的决策路径。
Abstract: Unified face attack detection (UAD) requires recognizing physical spoofing and digital forgery within a shared decision space, yet existing discriminative or prompt-based methods largely rely on appearance correlations and provide limited evidence-grounded reasoning. We propose UniShield, a knowledge-grounded multimodal reasoning framework for unified face attack defense. UniShield constructs a Face Attack Knowledge Graph (FAKG) that links attack categories to diagnostic visual cues and attack-conditioned relations, and uses it to synthesize 52,025 FAKG-QA examples for Attack-Graph Instruction Tuning (AGIT). To improve rationale consistency, we further introduce Graph-Consistent Reasoning Optimization (GCRO), a GRPO-based objective with a KG-consistency reward that encourages generated rationales to match graph-supported cues while penalizing incompatible claims. Experiments on our multimodal UAD benchmark show that UniShield achieves strong performance across binary, coarse-grained, and fine-grained protocols, with consistently high ACC and low HTER. These results suggest that structured attack knowledge can improve both detection accuracy and reasoning reliability over discriminative baselines and general-purpose MLLMs. Our code will be released at https://anonymous.4open.science/r/Unishield-A6A3/.
[128] From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation cs.CVPDF
Bohan Li, Shuojue Yang, Baorui Peng, Xianda Guo, Erli Zhang
TL;DR: 本文提出了一种用于手术视频生成的运动学-视觉提升范式,将关节运动学转换为五种图像对齐的控制模态,并设计了分层路由视觉控制框架来动态选择最相关的控制模态和运动尺度,通过预算训练和推理方案提高效率,并在新构建的基准上验证了方法在动作忠实度、视觉保真度和跨域泛化方面的优越性。
Details
Motivation: 解决机器人手术中动作条件化视频生成的挑战,即低维控制向量需精确指导复杂图像空间演化的问题。
Result: 在构建的新基准上,方法在动作忠实度、视觉保真度和跨域泛化方面持续优于多种基线,高效变体在保持强控制精度的同时显著降低了延迟。
Insight: 创新点包括将关节运动学转换为统一图像对齐控制模态的表示方法、分层路由机制动态分配条件容量、运动学先验引导的路由损失函数确保物理意义和稳定性,以及利用路由稀疏性的预算训练推理方案实现自适应计算。
Abstract: Action-conditioned surgical video generation is a critical yet highly challenging problem for robotic surgery. The core difficulty is that low-dimensional control vectors must precisely govern complex image-space evolution. In this work, we propose a kinematic-to-visual lifting paradigm that converts articulated kinematics into a unified set of five image-aligned control modalities. Building on this representation, we introduce a hierarchically routed visual control framework that selectively activates the most relevant control modalities and motion scales. Instead of uniformly applying all control signals, our model performs hierarchical routing to dynamically allocate conditioning capacity. We further design kinematic-prior-guided routing loss functions to ensure physically meaningful, temporally stable, and efficient expert utilization. To improve efficiency, we propose a budgeted training and inference scheme that leverages routing-induced sparsity. By selectively discarding low-significance control pathways during training and execution, our approach enables adaptive computation that is complementary to standard distillation. We additionally construct a new benchmark with curated articulated annotations, obtained through human-in-the-loop semantic labeling and differentiable pose tracking, providing realistic supervision for action-conditioned surgical video generation. Extensive experiments demonstrate that our method consistently improves action faithfulness, visual fidelity, and cross-domain generalization over diverse baselines. Moreover, our efficient variant achieves substantial reductions in latency while maintaining strong control accuracy.
[129] EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing cs.CV | cs.MMPDF
Huilai Li, Xiaomeng Di, Ying Xing, Yonghao Dang, Yiming Wang
TL;DR: 本文提出了一种名为EAR的新框架,用于增强弱监督音视频视频解析(AVVP)中的单模态表示。该框架通过相似性标签迁移方法标注预训练数据,提升伪标签生成器对单模态事件的理解,并采用软约束方式并行优化单模态特征建模与多模态融合,从而协同关注单模态与跨模态表示,提高事件定位性能。
Details
Motivation: 现有弱监督AVVP方法过度强调多模态融合,而忽略了对单模态语义的充分引导与保留,导致伪标签噪声大和解析性能次优。本文旨在通过增强单模态表示来解决这一问题。
Result: 在弱监督AVVP任务上,该方法在伪标签生成和AVVP性能上均超越了现有最先进方法。
Insight: 创新点在于同时增强伪标签生成器和AVVP模型的单模态表示,通过相似性标签迁移和软约束并行建模,实现了单模态与跨模态表示的协同优化,为多模态学习提供了新视角。
Abstract: Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the pseudo-label generator and the AVVP model. Specifically, we introduce a similarity-based label migration approach to annotate pre-training data, thereby enabling the pseudo-label generator to better understand uni-modal events. We also employ a soft-constrained manner to refine modeling of uni-modal features in parallel with multi-modal fusion. These designs enable coordinated attention to both uni-modal and cross-modal representations, thus boosting the localization performance for events. Extensive experiments show that our method outperforms state-of-the-art methods in both pseudo-label and AVVP performance.
[130] SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment cs.CVPDF
Weiren Zhao, Yi Dong, Cheng Chen
TL;DR: SynerMedGen是一个统一的医学多模态框架,通过任务对齐将理解与生成任务协同起来,提出了生成对齐的理解任务和两阶段训练策略,在22个医学图像合成任务上实现了强大的零样本性能,并在合成任务上超越了当前最先进的专用模型。
Details
Motivation: 解决现有统一医学模型将理解与生成视为独立目标、缺乏功能协同的问题,探索何种形式的理解真正有益于生成。
Result: 仅通过理解训练就在22个医学图像合成任务上实现了强大的零样本性能,并在与生成训练结合后,在医学图像合成任务上超越了当前最先进的专用模型和统一模型。
Insight: 提出了生成对齐的理解原则,通过任务对齐实现理解与生成的协同;引入了三个生成对齐的理解任务和两阶段训练策略,将理解阶段学习到的有益表示迁移到生成任务;发布了大规模数据集SynerMed以支持进一步研究。
Abstract: Unifying multimodal understanding and generation is a compelling frontier that is beginning to emerge in the medical field. However, the limited existing unified medical models typically treat understanding and generation as disjoint objectives, lacking a meaningful functional synergy. In this work, we identify and address a critical question in unified medical modeling: what form of understanding truly benefits generation. We present SynerMedGen, a unified framework built on the proposed principle of generation-aligned understanding, which synergizes understanding objectives with generation tasks via task alignment. SynerMedGen introduces three generation-aligned understanding tasks and a two-stage training strategy that transfers generation-beneficial representations learned during understanding training to medical image synthesis. Remarkably, even with understanding training alone, our SynerMedGen achieves strong zero-shot performance across 22 medical image synthesis tasks and demonstrates robust generalization to unseen datasets. When combined with generation training, SynerMedGen consistently outperforms state-of-the-art specialized medical image synthesis models as well as recent unified medical models. We also release a large-scale dataset named SynerMed consisting of 1M paired synthesis samples and 2M generation-derived understanding instances to support further research on understanding-generation synergy. Our project can be accessed at https://github.com/Mhilab/SynerMedGen.
[131] Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation cs.CV | cs.GR | cs.MM | cs.SDPDF
Shihao Cheng, Jiaxu Zhang, Quanyue Song, Shansong Liu, Zhizhi Guo
TL;DR: 本文提出Unison框架,旨在解决人本视频生成中运动、语音和音效多模态对齐的挑战。该框架通过语义引导的音频协调策略分离语音与音效生成,并利用双向跨模态强制策略和渐进稳定化实现音频与运动的同步,从而提升多模态一致性与生成质量。
Details
Motivation: 现有音视频生成模型在运动、语音和环境音效的异构时序特性下难以保持跨模态一致性,导致明显的模态失配问题,因此需要一种统一框架来显式协调多模态生成。
Result: 在广泛的实验中,Unison在音频感知质量和跨模态同步方面均达到了最先进(SOTA)性能,突显了显式多模态协调在人本视频生成中的重要性。
Insight: 创新点包括语义引导的音频协调策略(通过双向音频交叉注意力和语义条件门控实现自适应重组)以及双向跨模态强制策略(通过解耦去噪计划和渐进稳定化实现更清洁模态对噪声模态的引导),这些设计有效缓解了语音主导问题并增强了声学清晰度与同步鲁棒性。
Abstract: Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.
[132] CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models cs.CVPDF
Joowon Kim, Seungho Shin, Joonhyung Park, Eunho Yang
TL;DR: 本文提出CollabVR框架,通过将视觉语言模型(VLM)与视频生成模型(VGM)在步骤级别进行闭环协作,以解决现有视频生成模型在目标导向任务中存在的长时漂移和片段模拟错误问题,从而提升视频推理能力。
Details
Motivation: 现有视频生成模型在复杂任务中易出现长时漂移和模拟错误,缺乏显式推理能力,而视觉语言模型虽能提供推理支持,但传统的前置规划或后置批判方法效果有限,因此需要一种更精细的协作机制。
Result: 在Gen-ViRe和VBVR-Bench基准测试中,CollabVR在相同计算成本下超越了单次推理、Pass@k和现有测试时扩展基线,对开源和闭源视频生成模型均有提升,尤其在困难任务上增益显著,且能与推理微调方法正交叠加。
Insight: 创新点在于引入步骤级闭环协作框架,让VLM实时规划、检查并修正VGM生成的片段,将验证诊断直接融入后续提示,实现了VLM与VGM的细粒度交互,为解决视频生成中的累积错误提供了可借鉴的监督机制。
Abstract: Recent “Thinking with Video” approaches use Video Generation Models (VGMs) for visual reasoning by producing temporally coherent Chain-of-Frames as reasoning artifacts. Even strong VGMs, however, exhibit two recurring failure modes on goal-directed tasks: long-horizon drift on multi-step tasks and mid-clip simulation errors that compound. Both stem from the absence of explicit reasoning built upon the VGM’s short-horizon visual prior, a role naturally filled by Vision-Language Models (VLMs), but where to place the VLM is non-trivial: upfront plans commit before any frame is generated and post-hoc critiques over whole videos intervene too late. We propose VLM-VGM Collaborative Video Reasoning (CollabVR), a closed-loop framework that couples the VLM with the VGM at step-level granularity: the VLM plans the immediate next action, inspects the clip the VGM generates, and folds the verifier’s diagnosis directly into the next action prompt to repair detected failures. On Gen-ViRe and VBVR-Bench, CollabVR improves both open-source and closed-source VGMs over single-inference, Pass@$k$, and prior test-time scaling baselines at matched compute, with the largest gains on the hardest tasks. It also yields further improvements on top of a reasoning-fine-tuned VGM, indicating that step-level VLM supervision is orthogonal to and stackable with reasoning-oriented fine-tuning. We provide video samples and additional qualitative results at our project page: https://joow0n-kim.github.io/collabvr-project-page.
[133] Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models cs.CVPDF
Mashrafi Monon, Umaima Rahman, Asif Hanif, Numan Saeed, Mohammad Yaqub
TL;DR: 本文介绍了CT-SpatialVQA,一个专门用于评估3D医学视觉语言模型在CT数据上进行语义-空间推理能力的基准测试。该基准包含9077个基于临床的问答对,源自1601份放射学报告和CT体积数据,并采用LLM辅助流程验证,人工共识率达95%。研究发现现有3D医学VLMs在该任务上表现严重不足,平均准确率仅34%,常低于随机水平,揭示了模型对体积证据整合不足的问题。
Details
Motivation: 当前3D医学视觉语言模型在医学VQA和报告生成方面表现出色,但尚不清楚它们是否真正从3D体积数据中学习到空间解剖知识,还是主要依赖先验知识和语言关联。缺乏对体积医学VLMs语义-空间推理能力的系统评估,阻碍了其在临床可靠决策支持中的应用。
Result: 在CT-SpatialVQA基准上评估了八个3D医学VLMs,发现它们在语义-空间推理任务上表现严重退化,平均准确率仅为34%,且常低于随机猜测水平。
Insight: 论文的创新点在于提出了首个专门评估3D医学VLMs语义-空间推理能力的基准测试CT-SpatialVQA,其数据集要求明确的解剖定位、侧向意识、结构比较和3D结构间关系推理。从客观角度看,该研究揭示了当前3D医学VLMs在空间理解上的重大缺陷,强调了深度整合体积证据对于可信临床应用的必要性。
Abstract: Recent advances in 3D medical vision-language models have enabled joint reasoning over volumetric images and text, showing strong performance in medical visual question-answering (VQA) and report generation. Despite this progress, it remains unclear whether these models learn spatially grounded anatomy from 3D volumes or rely primarily on learned priors and language correlations. This uncertainty stems from the lack of systematic evaluation of semantic-spatial reasoning in volumetric medical VLMs for clinically reliable decision support. To address this gap, we introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. The benchmark comprises 9077 clinically grounded question-answer (QA) pairs derived directly from 1601 radiology reports and CT volumes, which are validated via a robust LLM-assisted pipeline with a 95% human consensus agreement rate. Our dataset requires explicit anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning. We also introduce a standardized evaluation protocol and benchmark eight 3D medical VLMs, finding severe degradation on semantic-spatial reasoning tasks, averaging 34% accuracy and often below random, highlighting the need for deeper integration of volumetric evidence for trustworthy clinical use.
[134] PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models cs.CV | cs.AIPDF
Jiahui Guang, Zexun Zhan, Zhenlin Xu, Cuiyun Gao, Haiyan Wang
TL;DR: 本文提出了PPU-Bench,一个用于评估多模态大语言模型个性化部分遗忘能力的真实世界基准。该基准包含24K个多模态和单模态样本,基于500位公众人物的现有知识构建,并设置了三种渐进式挑战场景。实验揭示了不同遗忘设置下的关键问题,并提出了边界感知优化方法来有效建模主体内部的知识遗忘与保留边界。
Details
Motivation: 现有MLLM遗忘基准依赖于合成知识注入或完全主体级删除,无法捕捉需要细粒度事实控制的现实、个性化删除请求。因此,需要建立一个真实世界、无需微调的基准来评估个性化部分遗忘。
Result: 在PPU-Bench上的大量实验表明,完全遗忘通常会抑制视觉身份而非事实知识,而选择性和个性化遗忘则暴露了显著的遗忘-保留权衡问题以及主体内部事实边界的挑战。所提出的边界感知优化方法在两种代表性方法上的实验结果表明,它能有效强化主体内部的事实边界。
Insight: 创新点在于构建了首个面向真实世界、无需微调的个性化部分遗忘基准,并系统揭示了不同遗忘设置下的核心挑战(如遗忘-保留权衡、事实边界模糊)。提出的边界感知优化为建模主体内部知识边界提供了新思路,对实现细粒度、可控的知识遗忘具有借鉴意义。
Abstract: Multimodal Large Language Models (MLLMs) may memorize sensitive cross-modal information during pretraining. However, existing MLLM unlearning benchmarks rely on synthetic knowledge injection or complete subject-level deletion, which fail to capture realistic, personalized deletion requests that require fine-grained factual control. In this paper, we introduce PPU-Bench, a real-world and fine-tuning-free benchmark for personalized partial unlearning in MLLMs. PPU-Bench contains 24K multimodal and unimodal samples derived from pre-existing knowledge of 500 public figures under three progressively challenging settings: Complete, Selective, and Personalized unlearning. The benchmark evaluates whether methods can remove target knowledge while preserving non-target facts, model utility, and cross-modal consistency. Extensive experiments show that Complete Unlearning often suppresses visual identity rather than factual knowledge, while Selective and Personalized Unlearning expose significant forget–retain trade-offs and challenges in intra-subject factual boundaries. Robustness analysis under cross-image and prompt-based attacks reveals distinct vulnerabilities across different unlearning settings. Motivated by these findings, we propose Boundary-Aware Optimization (BAO), which explicitly models intra-subject forget-retain boundaries. Experimental results on two representative methods demonstrate that BAO can effectively enforce intra-subject factual boundaries.
[135] CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization cs.CVPDF
Ziyang Ding, Linjian Meng, Yiming Wu, Yuhan Li, Yuhao Liu
TL;DR: 本文提出CoLVR方法,通过对比优化增强潜在视觉推理的探索性。该方法包含两个阶段:首先,通过基于角度扰动的潜在对比目标学习多样化和探索性的表示;其次,使用潜在轨迹对比奖励进行强化学习后训练,以微调推理过程。实验表明,CoLVR在多个基准测试上显著提升了性能。
Details
Motivation: 现有方法通常依赖硬对齐目标迫使潜在表示匹配预定义视觉特征,这严重限制了潜在推理过程的探索性。本文旨在解决这一问题,以增强多模态大语言模型在潜在视觉推理中的探索能力。
Result: 在VSP和Jigsaw基准上分别平均提升5.83%和8.00%,并在跨域基准MMStar上获得3.40%的性能增益,优于现有潜在模型。
Insight: 创新点在于提出了一个两阶段的潜在对比训练框架:通过角度扰动引导的对比目标避免嵌入过约束,以及使用轨迹对比奖励进行细粒度优化,从而促进多样化的推理行为,增强了表示的探索性。
Abstract: Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding. Then, CoLVR employs a latent trajectory contrastive reward for RL (Reinforcement Learning) post-training to enable fine-grained optimization of latent visual reasoning process and thus fostering diverse reasoning behaviors. Experiments demonstrate that CoLVR significantly enhances the exploratory capability of latent representations, achieving average improvements of 5.83% on VSP and 8.00% on Jigsaw, while also outperforming existing latent models on out of domain benchmarks, with a 3.40% gain on MMStar. The data, codes, and models are released at https://github.com/Oscar-dzy/CoLVR.
[136] LightAVSeg: Lightweight Audio-Visual Segmentation cs.CVPDF
Qing Zhong, Guodong Ding, Lingqiao Liu, Zaiwen Feng, Lin Yuanbo Wu
TL;DR: LightAVSeg是一个轻量级的视听分割框架,旨在高效地定位视频中发声物体的像素级区域。它通过解耦的语义过滤和空间定位设计替代了计算密集的跨模态注意力机制,实现了线性计算成本,并在移动处理器上实现了高效推理。
Details
Motivation: 现有视听分割模型依赖计算成本高的密集跨模态注意力,限制了其在资源受限环境中的部署,本文旨在解决这一效率瓶颈,重点关注交互模块的轻量化设计。
Result: 在MS3基准测试上达到50.4 mIoU,参数量仅为20.5M(约AVSegFormer的1/7),在轻量级方法中达到了新的最先进水平(SOTA)。
Insight: 创新点在于用解耦的语义过滤和空间定位设计替代二次计算成本的注意力机制,并引入无推理开销的辅助对齐损失来增强训练时的语义一致性,实现了线性计算缩放和移动端高效部署。
Abstract: Audio-Visual Segmentation (AVS) targets pixel level localization of sounding emitting objects in videos. However, existing models rely on dense cross-modal attention with quadratic computational cost, limiting their suitability for resource efficient deployment. Most efficiency oriented methods focus on backbone reduction and overlook the interaction module as the primary bottleneck. This paper proposes LightAVSeg, a lightweight framework that replaces heavy attention with a decoupled design for semantic filtering and spatial grounding, resulting in interaction costs that scale linearly with spatial resolution. Furthermore, we introduce an auxiliary alignment loss to enforce semantic consistency during training with zero inference overhead. Extensive experiments demonstrate that LightAVSeg achieves a new state-of-the-art among lightweight methods: with 20.5M parameters ~1/7 of AVSegFormer), it reaches 50.4 mIoU on the MS3 benchmark and enables efficient inference on a mobile processor.
[137] FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence cs.CV | cs.AI | cs.CRPDF
Xinyu Yan, Boyang Chen, Jiaming Zhang, Tiantong Wu, Hong Xi Tae
TL;DR: 本文提出了FraudBench,一个用于检测AI生成的欺诈性退款证据的多模态基准测试。该基准基于电子商务、外卖和旅游服务等真实场景的用户评论证据构建,包含真实损坏、真实未损坏以及通过六种先进图像编辑和生成模型合成的虚假损坏证据。研究评估了多模态大语言模型、专用AI生成图像检测器和人类参与者的性能,发现现有方法在检测虚假损坏证据方面存在显著不足。
Details
Motivation: AI生成的图像日益逼真且易于适配现实世界中的具体索赔,为验证视觉证据带来了新挑战,特别是AI生成的退款欺诈风险。现有基准主要评估独立的真实性分类、跨生成器迁移或取证定位,而缺乏针对具体索赔条件的欺诈证据检测的评估。
Result: 实验表明,当前的多模态大语言模型(MLLMs)能识别真实损坏证据,但在多个虚假损坏子集上失败,其虚假损坏检测率(TPR)在大多数生成器子集上远低于50%基线。专用检测器表现更好,但在不同生成器间表现不一致,且可能在真实损坏样本上产生误报,揭示了通用AI图像检测与可靠的索赔条件退款证据验证之间存在明显差距。
Insight: 论文的创新点在于构建了首个针对具体索赔场景(退款欺诈)的多模态AI生成证据检测基准,强调了结合文本(评论、产品元数据)和图像进行条件化验证的重要性。客观来看,该研究揭示了当前先进模型在面向特定、复杂的现实世界欺诈检测任务时的局限性,为开发更鲁棒、场景感知的检测方法指明了方向。
Abstract: Artificial Intelligence (AI)-generated images have become increasingly realistic and readily adaptable to concrete real-world claims, creating new challenges for verifying visual evidence. A concrete emerging risk is AI-generated refund fraud, in which manipulated or synthetic images are used to support claims about damaged products, poor delivery conditions, or service-related defects. Existing AI-generated image detection benchmarks mainly evaluate standalone authenticity classification, cross-generator transfer, or forensic localization, leaving claim-conditioned fraudulent evidence detection underexplored. To bridge this gap, we introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence. FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios. We curate real evidence images together with their associated review and product metadata, identify genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation, and synthesize fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models. Using FraudBench, we evaluate MLLMs, specialized AI-generated image detectors, and human participants under the same settings. Experiments show that current MLLMs often recognize real-damaged evidence but fail on many fake-damaged subsets, with fake-damage detection rates (TPR) far below the 50% baseline on most generator subsets. Specialized detectors generally perform better but remain inconsistent across generators and can produce false positives on real-damaged samples, revealing a clear gap between generic AI image detection and reliable claim-conditioned refund-evidence verification.
[138] Rethinking Event-Based Object Dtection through Representation-Level Temporal Aggregation and Model-Level Hypergraph Reasoning cs.CVPDF
Meisen Wang, Hao Deng, Wei Bao, Ma Yuanxiao, Chengjie Wang
TL;DR: 本文提出了一种名为Ev-DTAD的统一事件相机目标检测框架,旨在解决现有方法在表示层面和模型层面的局限性。该框架通过层次化时间聚合(HTA)构建紧凑的伪RGB表示来显式编码时间信息,并利用频率感知超图时间融合(FHTF)模块对稀疏、碎片化的事件响应进行高阶关系推理和特征增强。实验表明,该方法在多个基准数据集上实现了精度与效率的平衡。
Details
Motivation: 事件相机具有高时间分辨率、低延迟和高动态范围的优势,适用于高速运动和挑战性光照下的感知。然而,现有事件目标检测方法存在两方面局限:事件表示通常通过冗余结构间接编码时间信息;检测模型难以将碎片化的事件响应显式聚合为连贯的高阶目标特征。
Result: 在Gen1数据集上mAP提升0.8且速度加快1.7倍,在1Mpx/Gen4数据集上mAP提升0.5且速度加快1.6倍,在eTraM数据集上mAP显著提升3.0且速度加快2.0倍。实验结果表明,Ev-DTAD在多个基准上达到了具有竞争力的精度-效率权衡,验证了紧凑时间表示与时间-超图特征推理的互补性。
Insight: 论文的创新点在于将表示层面的时间编码与模型层面的时间-超图推理统一在一个框架内。具体包括:1)提出层次化时间聚合(HTA)表示,以紧凑的三通道伪RGB形式显式嵌入窗内和窗间事件的时间信息;2)提出频率感知超图时间融合(FHTF)模块,通过建模时间演化和高阶关系推理来精炼多尺度事件特征。从客观角度看,这种结合紧凑底层表示与高阶模型推理的思路,为解决事件数据稀疏性和碎片化问题提供了新途径。
Abstract: Event cameras provide microsecond-level temporal resolution, low latency, and high dynamic range, offering potential for perception under fast motion and challenging illumination conditions. However, existing Event-based Object Detection (EOD) methods face limitations at both the representation and model levels: prior event representations usually encode temporal information indirectly through redundant structures, while detection models struggle to explicitly aggregate fragmented event responses into coherent high-order object features. To address these limitations, we present Event Dual Temporal-Relational Aggregation Detector (Ev-DTAD), a unified EOD framework that integrates representation-level temporal encoding with model-level temporal-hypergraph reasoning. Specifically, we introduce Hierarchical Temporal Aggregation (HTA), a compact three-channel pseudo-RGB representation that explicitly embeds temporal information across intra- and inter-window events. To further enhance detection under sparse and fragmented event responses, we propose Frequency-aware Hypergraph Temporal Fusion (FHTF), which refines multi-scale event features through temporal evolution modeling and high-order relational reasoning. Extensive experiments on Gen1 (+0.8 mAP and 1.7$\times$ faster), 1Mpx/Gen4 (+0.5 mAP and 1.6$\times$ faster), and eTraM (+3.0 mAP and \textbf{2.0$\times$ faster}) demonstrate that Ev-DTAD achieves a competitive accuracy-efficiency trade-off, validating the complementarity between compact temporal representation and temporal-hypergraph feature reasoning.
[139] VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving cs.CV | cs.AI | cs.ROPDF
Rui Zhao, Jianlin Yu, Zhenhai Gao, Jiaqiao Liu, Fei Gao
TL;DR: VECTOR-DRIVE是一个紧密耦合的视觉-语言-动作(VLA)端到端自动驾驶框架,基于Qwen2.5-VL-3B模型构建。它通过共享的自注意力机制保持所有令牌的耦合,同时根据令牌语义将前馈计算路由到不同的专家模块:视觉-语言专家处理语义先验,轨迹专家处理运动规划。该框架在单一多模态Transformer内耦合了语义推理和运动规划,同时分离了任务特定的计算。
Details
Motivation: 解决现有端到端自动驾驶VLA模型在耦合设计上的权衡问题:完全共享的主干网络可能使语言推理和轨迹预测相互纠缠,而解耦的推理-动作流水线又会削弱语义与运动的耦合。
Result: 在Bench2Drive基准测试上,VECTOR-DRIVE取得了88.91的驾驶分数,超越了代表性的端到端和基于VLA的基线模型,达到了SOTA水平。
Insight: 创新点在于提出了语义感知的专家路由机制,在共享注意力的基础上,将前馈计算按语义路由到不同专家,实现了语义推理与运动规划在单一模型内的紧耦合与任务解耦。此外,基于流匹配的动作解码器也是一个技术亮点。
Abstract: End-to-end autonomous driving requires models to understand traffic scenes, infer driving intent, and generate executable motion plans. Recent vision-language-action (VLA) models inherit semantic priors from large-scale vision-language pretraining, yet still face a coupling trade-off: fully shared backbones preserve multimodal interaction but may entangle language reasoning and trajectory prediction, whereas decou pled reasoning-action pipelines reduce task conflict but weaken semantic-motion coupling. We propose VECTOR-DRIVE, a tightly coupled VLA framework built on Qwen2.5-VL-3B. VECTOR-DRIVE keeps all tokens coupled through shared self attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert to preserve semantic priors, while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert for motion-specific computation. On the action-token pathway, a flow-matching planner refines noisy action tokens into future waypoints and speed profiles. This design couples semantic reasoning and motion planning within a single multimodal Transformer while separating task-specific FFN computation. On Bench2Drive, VECTOR-DRIVE achieves 88.91 Driving Score and outperforms representative end-to end and VLA-based baselines. Qualitative results and ablations further validate the benefits of shared attention, semantic-aware expert routing, progressive training, and flow-based action de coding.
[140] Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models cs.CVPDF
Junli Zha, Jiahui Wang, Xinkai Lu, Jinbo Wang
TL;DR: 本文提出了一种无需训练的框架,用于解决视觉语言模型在经典视错觉任务中存在的感知与记忆冲突问题。该框架通过三种互补策略实现:视错觉感知的图像预处理、抗视错觉的提示工程以及多投票集成方法。
Details
Motivation: 视觉语言模型在面对视觉错觉时存在系统性偏差,倾向于回忆记忆中的事实而非感知实际的视觉差异,本文旨在解决这一感知与记忆的冲突问题。
Result: 在CVPR 2026 DataCV挑战赛任务1的官方630张图像测试集上,使用Claude模型和5票多数集成方法达到了90.48%的准确率,在人工验证子集上达到98.41%,最终获得第二名,与第一名仅差0.47%。
Insight: 创新点在于无需微调,仅通过视觉操作和提示设计来引导模型进行定性视觉比较,结合类型特定的图像变换(如边缘提取、颜色隔离等)和集成策略,有效提升了模型对视觉错觉的鲁棒性。
Abstract: Vision-Language Models (VLMs) exhibit systematic bias toward visual illusions, recalling memorized facts rather than perceiving actual visual differences. This paper presents a training-free framework for the 5th DataCV Challenge Task 1 at CVPR 2026, addressing this perception-versus-memory conflict through three complementary strategies:(1) illusion-aware image preprocessing that weakens illusion-inducing context via type-specific transformations (edge extraction, color isolation, morphological processing, and reference-line overlay), (2) anti-illusion prompt engineering guiding VLMs toward qualitative visual comparison, and (3) multi-vote ensemble that further improves robustness. Our method achieves 90.48% accuracy on the official 630-image test set using Claude (claude-opus-4-6) with 5-vote majority ensemble, and 98.41% on a human-verified subset. The approach requires no finetuning, relying solely on visual manipulation and prompt design. Our solution secured 2nd place in the challenge, only 0.47% behind the 1st-place solution. Code is available at https://github.com/jasminezz/sf-illusion-aware-vlm.git.
[141] ProDG: Prototypes for Data-Free Generative Post-Hoc Explainability cs.CVPDF
Piotr Borycki, Magdalena Trędowicz, Jacek Tabor, Łukasz Struski, Przemysław Spurek
TL;DR: ProDG提出了一种无需外部数据的生成式后置可解释性框架,通过生成模型直接从冻结模型权重中合成高保真原型,解决了现有原型方法依赖数据子集的局限性。
Details
Motivation: 现有基于原型的后置可解释性方法虽避免了神经网络重训练,但仍需依赖测试集或验证集等数据子集来提取视觉原型,这在隐私敏感领域(如医疗、金融)中因数据不可访问而受限。
Result: 论文在多个基准数据集上验证了ProDG,结果表明其生成的原型质量高,能提供与数据依赖方法相当的可解释性,为数据不可访问场景下的可解释AI(XAI)设立了新标准。
Insight: 创新点在于首次实现完全数据无关的原型生成,利用生成模型从模型权重直接合成原型,突破了后置可解释性对原始数据的依赖,可扩展至隐私严格受限的应用领域。
Abstract: Ante-hoc interpretability methods based on prototypes provide highly accurate explanations by utilizing the intuitive “this looks like that” reasoning paradigm. On the other hand, post-hoc models can explain predictions for a single image without relying on an underlying dataset or requiring costly neural network retraining. Recent approaches successfully solve the retraining problem for prototype-based networks. However, they still face a fundamental limitation: they require access to a subset of data (e.g., a test or validation set) to search for and extract the visual prototypes. In this paper, we address this issue and introduce ProDG: Generative Prototypes for Data-Free Post-Hoc Explainability, a novel framework that leverages generative models to synthesize pure, high-fidelity prototypes directly from the frozen model’s weights, completely eliminating the dependency on any external data. By establishing this new frontier in Data-Free XAI, ProDG unlocks robust visual interpretability for privacy-sensitive domains, where original data is strictly restricted or fundamentally inaccessible. Project page: https://github.com/piotr310100/ProDG
[142] Semantic Alignment in Hyperbolic Space for Open-Vocabulary Semantic Segmentation cs.CVPDF
Hoang M. Truong, Hai Nguyen-Truong, Dang Huynh
TL;DR: 本文提出了一种名为HyRo的双曲空间微调框架,用于解决开放词汇语义分割任务中嵌入空间的层次结构与语义对齐不匹配的问题。该方法在庞加莱球模型中解耦层次对齐和语义对齐,通过调整双曲半径实现层次对齐,并通过保持双曲半径的正交变换进行角度对齐以优化语义关系。
Details
Motivation: 开放词汇语义分割需要将图像级视觉语言模型(如CLIP)适配到像素级预测,但由于嵌入空间中层次结构与语义对齐的不匹配而具有挑战性。现有方法利用双曲几何建模层次关系,但仅对齐不同层次间的嵌入,忽略了同一层次内嵌入的语义错位问题。
Result: 在标准开放词汇语义分割基准测试上的实验表明,HyRo实现了优于先前方法的最先进(SOTA)性能。
Insight: 创新点在于在双曲空间(庞加莱球模型)中明确解耦层次对齐和语义对齐,通过调整双曲半径处理层次关系,并引入理论上保持双曲半径的正交变换进行角度对齐以优化语义关系,从而更有效地利用双曲几何进行开放词汇分割。
Abstract: Open-vocabulary semantic segmentation requires adapting image-level vision-language models such as CLIP to dense pixel-level prediction, which is challenging due to the mismatch between hierarchical structure and semantic alignment in the embedding space. While recent works leverage hyperbolic geometry to model hierarchical relationships, they align embeddings across hierarchical levels but overlook semantic misalignment among embeddings within the same level. In this work, we propose HyRo, a hyperbolic fine-tuning framework that decouples hierarchical and semantic alignment in the Poincaré ball model. HyRo aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius. Experiments on standard open-vocabulary semantic segmentation benchmarks demonstrate that HyRo achieves state-of-the-art performance over prior methods.
[143] Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning cs.CVPDF
Han Li, Yulu Gao, Si Liu, Yuhang Wang, Bo Liu
TL;DR: 本文提出了一种名为UniTopo的创新方法,用于统一建模车道及其拓扑关系,将车道拓扑表示为连接的车道(如前驱、后继车道及其互连),从而在共享感知流程中同时获取车道位置和拓扑信息,实现了从原始图像特征直接感知车道拓扑的新范式。该方法在OpenLane-V2基准测试的两个子集上分别达到30.1%和31.8%的TOP_ll分数,显著超越了现有最优方法T^2SG。
Details
Motivation: 自动驾驶车辆需要感知驾驶场景中的物理元素(如车道线、交通灯)和逻辑元素(如车道中心线及其拓扑),现有车道拓扑推理方法通常遵循检测-推理范式,拓扑关系主要从车道检测结果中推导,这限制了效率和准确性。
Result: 在基于Argoverse2和nuScenes构建的OpenLane-V2基准测试的两个子集上,UniTopo方法分别实现了30.1%和31.8%的TOP_ll分数,比现有SOTA方法T^2SG提升了6.0%和8.6%。
Insight: 创新点在于将车道拓扑统一表示为连接的车道,实现了车道位置和拓扑的联合感知,避免了传统检测-推理范式的分离处理,从而提高了推理效率和准确性;从客观角度看,这种端到端的统一建模范式为驾驶场景推理提供了新的思路,可推广到其他需要同时处理物理和逻辑元素的感知任务中。
Abstract: Autonomous vehicles need to perceive not only physical elements in the driving scene, such as lane lines and traffic lights, but also logical elements like lane centerlines and their topology. Existing lane topology reasoning methods typically follow a reasoning-by-detection paradigm, where lane topological relationships are primarily derived from lane detection results. In this paper, we propose an innovative method called Unified Modeling of Lane and Lane Topology (UniTopo), which represents the topological relationships between lanes as connected lanes, encompassing predecessor lanes, successor lanes, and their interconnections. This unified representation of lanes and lane topology allows us to simultaneously obtain both the positions and topological information of lanes within a shared perception pipeline, establishing a new paradigm for directly perceiving lane topology from original image features. We validate our method on the driving scene reasoning benchmark OpenLane-V2, which consists of two subsets, built based on Argoverse2 and nuScenes, respectively. Our method achieves TOP_ll of 30.1% and 31.8% on the two subsets, significantly surpassing the existing state-of-the-art method T^2SG by 6.0% and 8.6%.
[144] PIDNet: Progressive Implicit Decouple Network for Multimodal Action Quality Assessment cs.CVPDF
Qiqi Li, Pengfei Wang, Nenggan Zheng
TL;DR: 本文提出了一种名为PIDNet的渐进式隐式解耦与融合网络,用于多模态动作质量评估(AQA)。该方法通过iMambaWave模块将RGB、光流和音频特征映射到共享潜在空间并进行解耦,以分别捕获长程时序依赖和局部扰动细节,并通过门控聚合机制融合时域和频域信息。随后,利用Group3M模块构建三阶段渐进融合网络,通过模态互补注意力检索跨模态证据并抑制冗余,多尺度卷积则用于丰富特征表示。
Details
Motivation: 现有方法在多模态AQA中通常采用粗粒度融合或统一的时序建模,这可能导致模态特定线索模糊、跨模态冗余保留以及阶段特定质量证据被削弱。本文旨在解决这些问题,以更精确地整合模态特定信息、跨模态互补线索和全局质量语义。
Result: 在Rhythmic Gymnastics和Fis-V数据集上的实验表明,PIDNet在得分相关性方面取得了极具竞争力的结果,并实现了良好的误差控制,优于现有的单模态和多模态方法。消融研究验证了各模块的有效性。此外,iMambaWave模块在多种骨干网络上均能持续改善视觉表示和时序建模,显示出良好的泛化性和即插即用能力。
Insight: 创新点在于提出了渐进式隐式解耦与融合框架,以及iMambaWave模块,该模块结合了Bi-Mamba分支(用于长程依赖)和小波变换分支(用于局部细节),并通过门控机制自适应融合时频信息。这种设计能够更精细地处理多模态数据的异质性和时序渐进性,有效分离和融合质量证据。
Abstract: Action quality assessment (AQA) aims to automatically quantify the execution quality of human actions in videos and is valuable for applications such as competitive sports judging. In multimodal AQA, quality evidence from different modalities is heterogeneous, and quality cues evolve progressively over time. Existing methods often rely on coarse fusion or unified temporal modeling, which may blur modality-specific cues, preserve cross-modal redundancy, and weaken stage-specific quality evidence. To address these issues, we propose a progressive implicit decoupling and fusion network (PIDNet) that progressively integrates modality-specific information, cross-modal complementary cues, and global quality semantics for accurate assessment. Specifically, we design an iMambaWave module that maps RGB, optical flow, and audio features into a shared latent space and disentangles them with a Bi-Mamba branch and a wavelet-transform branch to capture long-range temporal dependencies and local perturbation details, respectively. A gated aggregation mechanism adaptively fuses temporal and frequency-domain information. We further build a three-stage progressive fusion network using Group3M blocks, where modality complementary attention retrieves cross-modal evidence while suppressing redundancy, and multi-scale convolutions enrich feature representations. Experiments on the Rhythmic Gymnastics and Fis-V datasets show that PIDNet achieves highly competitive score correlation with favorable error control compared with existing unimodal and multimodal methods. Ablation studies verify the effectiveness of each component. Moreover, iMambaWave consistently improves visual representation and temporal modeling across multiple backbones, showing good generalization and plug-and-play capability.
[145] Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning cs.CVPDF
Naeun Lee, Hyunjong Kim, Sunghwan Choi, Injin Kong, Yohan Jo
TL;DR: 本文研究了多模态大语言模型在视觉说服任务中的推理能力,发现直接提示模型进行推理并不能提升说服力预测性能,反而可能降低效果。作者提出使用多样化的教师生成理由进行监督微调,能有效改善预测表现,并引入了一个三维忠实度评估框架来评估理由的可靠性。
Details
Motivation: 尽管多模态大语言模型在多模态任务上表现强劲,但在预测图像是否具有说服力及其原因方面仍面临挑战,且缺乏训练模型进行视觉说服推理或评估其理由忠实性的方法。
Result: 通过监督微调使用多样化教师生成理由,提升了视觉说服力预测性能;在提出的三维忠实度评估框架中,理由到决策的敏感性与人类理由偏好最一致。
Insight: 创新点在于揭示了单纯预测性能不能保证理由的忠实性,提出了基于多样化理由的监督微调方法和三维忠实度评估框架,为视觉说服任务中的推理可靠性和可解释性提供了新思路。
Abstract: Despite strong performance of Multimodal Large Language Models (MLLMs) on multimodal tasks, predicting whether and why an image is persuasive remains challenging. We first show that prompting MLLMs to reason before prediction does not consistently help, and can even reduce persuasiveness prediction performance, suggesting that naively generated rationales are unreliable signals for this task. Yet, no established methodology exists for training MLLMs to reason about visual persuasion or evaluating whether their rationales faithfully support their decisions. To address this gap, we show empirically and theoretically that diverse teacher-generated rationales, when used for supervised fine-tuning, improve visual persuasiveness prediction. We further introduce a three-dimensional faithfulness evaluation framework covering rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity. Applying this framework shows that prediction performance alone does not guarantee faithful rationales, while rationale-to-decision sensitivity is most aligned with human rationale preferences. These findings motivate faithfulness-aware training objectives and scalable rationale supervision for visual persuasiveness evaluation. Our code and dataset will be made publicly available.
[146] Extrusion Segmentation Strategy to improve CAD Reconstruction from Point Cloud cs.CV | cs.AIPDF
Said Harb, Mehdi Maboudi, Markus Gerke
TL;DR: 本文提出了一种端到端模型,用于从点云数据重建CAD模型,并引入了一种分割方法,将点云分解为单个挤出形状,以提高深度学习模型的泛化能力和鲁棒性。
Details
Motivation: 解决从物理对象的点云扫描中恢复CAD模型的问题,应用于逆向工程和质量控制,以将无序点云转换为结构化CAD模型。
Result: 通过分割部分形状增加数据多样性,提高了深度学习模型的重建性能,但摘要未提及具体基准测试或定量结果。
Insight: 创新点在于引入挤出分割策略,通过分解点云为单个挤出形状来增强数据多样性,从而简单有效地提升模型重建能力。
Abstract: Computer-Aided Design is ubiquitous in todays world, as almost every manufactured object begins as a digital model across industries. At the same time, advances in 3D sensing have made point clouds a dominant form of raw 3D data. Recovering the CAD model of a physical object from its point cloud scan has two major applications: reverse engineering, where physical or hand-crafted prototypes need to be reconstructed automatically as editable digital models, and quality control, where recovering the CAD description of a manufactured object helps quantify and understand deviations introduced during the production process. Thus, converting unordered point clouds into structured CAD models is increasingly important for modern applications. Deep learning has enabled major progress in computer vision for both 2D and 3D data, and new datasets facilitate data-driven CAD reconstruction. Building on this foundation, we develop an end-to-end model that reconstructs CAD models from point clouds and introduce a segmentation approach that decomposes them into individual extrusions. These partial shapes increase data diversity, improving the generalization and robustness of deep learning models. Our strategy thereby provides a simple, yet effective way to increase reconstruction performance of deep learning models.
[147] Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models cs.CV | cs.AIPDF
Tri Cao, Khoi Le, Thong Nguyen, Cong-Duy Nguyen, Quynh Vo
TL;DR: 本文针对视频大语言模型在动态场景中易产生幻觉的问题,提出了STEMO-Bench基准测试来诊断其时空监控能力的不足,并提出了STEMO-Track框架,通过显式构建和推理结构化对象轨迹来提升模型的时空一致性理解。
Details
Motivation: 现有MLLMs在视频理解中易产生幻觉,源于缺乏对对象身份、状态和关系的持续时空监控能力,且现有基准测试难以真正评估这种能力。
Result: 在提出的STEMO-Bench基准上,实验表明STEMO-Track框架显著减少了幻觉答案,并提升了时空推理一致性,优于当前最先进的MLLMs。
Insight: 创新点在于引入了对象中心的视角,通过分块状态提取和时间聚合显式构建对象轨迹进行推理,并设计了分解查询的基准来区分真实理解与巧合正确性。
Abstract: While multimodal large language models (MLLMs) have advanced video understanding, they remain highly prone to hallucinations in dynamic scenes. We argue this stems from a failure in spatio-temporal monitoring, the ability to persistently track object identities, states, and relations over time. Existing benchmarks obscure this deficit by relying on single final-answer evaluations for queries that can often be resolved via local visual cues or statistical priors. To rigorously diagnose this, we introduce STEMO-Bench (Spatio-TEmporal MOnitoring), a benchmark of human-verified object-centric facts that evaluates intermediate reasoning by decomposing queries into sub-questions, distinguishing genuine temporal understanding from coincidental correctness. To address failure modes exposed by STEMO, we propose STEMO-Track, a novel object-centric framework that explicitly constructs and reasons over structured object trajectories via chunk-wise state extraction and temporal aggregation. Extensive experiments demonstrate that our object-centric framework significantly reduces hallucinated answers and improves spatio-temporal reasoning consistency over state-of-the-art MLLMs.
[148] LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs? cs.CVPDF
Kechen Fang, Yihua Qin, Chongyi Wang, Wenshuo Ma, Tianyu Yu
TL;DR: 本文提出了LLaVA-UHD v4,一种针对高分辨率输入的高效且计算可控的视觉编码方案。该方法通过结合基于切片的编码策略和ViT层内早期压缩技术,在多个基准测试中显著降低了视觉编码的计算开销(FLOPs减少55.8%),同时保持或超越了基线模型的性能。
Details
Motivation: 解决多模态大语言模型(MLLMs)在处理高分辨率图像时,视觉编码成为主要计算瓶颈的问题。传统方法采用全局编码和ViT后压缩,会产生大量token序列并在压缩前就产生完整的二次注意力计算成本。
Result: 在涵盖文档理解、OCR和通用VQA的多样化基准测试中,LLaVA-UHD v4将视觉编码的FLOPs降低了55.8%,同时性能匹配甚至超越了基线模型。
Insight: 主要创新点在于:1)通过实验验证了基于切片的编码策略优于全局编码,表明通过切片视图保留局部细节比应用全局注意力对细粒度感知更有利;2)提出了ViT层内早期压缩技术,在浅层ViT中减少token数量,从而大幅降低视觉编码计算量而不损害下游性能。这为高效高分辨率MLLMs提供了一个实用的设计方向。
Abstract: Visual encoding constitutes a major computational bottleneck in Multimodal Large Language Models (MLLMs), especially for high-resolution image inputs. The prevailing practice typically adopts global encoding followed by post-ViT compression. Global encoding produces massive token sequences, while post-ViT compression incurs the full quadratic attention cost of the ViT before any token reduction takes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show that slice-based encoding outperforms global encoding across benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowers visual-encoding FLOPs while preserving downstream performance. By integrating intra-ViT compression into the slice-based encoding framework, we present LLaVA-UHD v4, an efficient and compute-controllable visual encoding scheme tailored for high-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.
[149] CT-IDP: Segmentation-Derived Quantitative Phenotypes for Interpretable Abdominal CT Disease Classification cs.CV | cs.AIPDF
Lavsen Dahal, Joseph Y. Lo
TL;DR: 本研究开发了一种名为CT-IDP的定量表型框架,用于腹部CT疾病分类。该框架基于MERLIN数据集(训练、验证和测试集分别为15,175、5,018和5,082个研究)开发,并在Duke-Abdomen(2,000个)和AMOS(1,107个)两个独立数据集上进行外部评估。方法利用TotalSegmentator生成多器官分割,并从中提取超过900个器官和分区层面的描述符(包括形态学、衰减和背景/负荷发现)。使用弹性网络正则化的稀疏疾病特异性逻辑回归进行训练和验证,性能与基于DINOv3的视觉Transformer基线进行比较。
Details
Motivation: 开发一个可解释的、基于分割衍生定量表型的框架,以改善腹部CT疾病分类,旨在提供比黑盒深度学习模型更具解释性的结果。
Result: 在MERLIN数据集上,CT-IDP的宏AUC为0.897,基线为0.880;在Duke-Abdomen数据集上为0.877 vs 0.857;在AMOS数据集上为0.780 vs 0.756。CT-IDP在所有三个数据集上的性能均优于基于DINOv3的视觉Transformer基线,达到了SOTA水平。
Insight: 创新点在于将自动分割与大量手工设计的定量表型特征相结合,并使用稀疏逻辑回归实现可解释的疾病分类。这提供了一种结合深度学习分割优势与经典可解释建模的混合方法,其表型特征审计和系数检查增强了模型的可解释性。
Abstract: In this retrospective multi-institutional study, a quantitative phenotyping framework, CT-IDP (CT Image-Derived Phenotypes) was developed on the MERLIN abdominal CT benchmark (training, validation, and test sets- 15,175, 5,018, and 5,082 studies, respectively) and externally evaluated on two independent dataset: Duke-Abdomen (2,000) and AMOS (1,107). Multi-organ segmentations were generated with TotalSegmentator and used to derive over 900 organ and compartment-level descriptors spanning morphometry, attenuation, and contextual/burden findings. Sparse disease-specific logistic regression with elastic-net regularization was trained on MERLIN and externally validated under a frozen specification. Performance was compared against a DINOv3-based vision-transformer baseline using AUC and average precision (AP), supported by phenotype-stratified audits and coefficient-level inspection. Macro-AUC for CT-IDP versus the baseline was 0.897 versus 0.880 on MERLIN, 0.877 versus 0.857 on the Duke-Abdomen dataset, and 0.780 versus 0.756 on AMOS.
[150] LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation cs.CVPDF
Jiankun Peng, Jianyuan Guo, Yiguang Yang, Yue Liu, Jiashuang Yan
TL;DR: 本文提出LCGNav,一种用于视觉语言导航(VLN)的模块化局部几何增强框架,旨在解决现有在线拓扑规划方法中存在的局部深度信息冗余和随着拓扑图增长对当前候选节点关注度减弱的问题。该方法通过将候选深度视图显式转换为3D点云并进行物理截断,实现更紧凑的局部几何建模,并引入维度保持的局部融合策略,仅在当前相关的幽灵节点上应用几何增强,而不改变原有规划器接口。
Details
Motivation: 动机在于解决视觉语言连续环境导航(VLN-CE)中在线拓扑规划方法的两个局限性:冗余的局部深度信息,以及随着拓扑图扩展,对当前前沿候选节点的关注度减弱。
Result: 在R2R-CE和RxR-CE基准测试上的实验表明,LCGNav作为一个有效的跨架构增强模块,能够以较低额外训练成本持续提升代表性在线拓扑基线的多个关键指标。当与ETP-R1集成时,LCGNav在R2R-CE和RxR-CE的val-unseen划分上取得了相比在线拓扑方法中的最佳性能。
Insight: 创新点在于显式地将候选深度视图转换为3D点云并进行物理截断以实现紧凑的局部几何建模,以及引入维度保持的局部融合策略(含瞬态状态退化),使得几何增强仅针对当前相关的幽灵节点,保持了规划器接口的兼容性。从客观角度看,该方法提供了一种轻量级、模块化的几何信息增强方案,可灵活集成到不同拓扑规划架构中,提升了导航系统对局部几何结构的利用效率。
Abstract: Online topological planning has become an effective paradigm for Vision-Language Navigation in Continuous Environments (VLN-CE), but existing methods still suffer from two limitations: redundant local depth information and weakened focus on current frontier candidates as the topological graph grows. To address this, we propose LCGNav, a modular local geometric enhancement framework for topological VLN. LCGNav explicitly converts candidate depth views into 3D point clouds and applies physical truncation based on the agent’s reachable range, enabling more compact local geometric modeling. It further introduces a dimension-preserving local fusion strategy with transient state degradation, so that geometric enhancement is applied only to the currently relevant ghost nodes without changing the original planner interface. Experiments on R2R-CE and RxR-CE show that LCGNav serves as an effective cross-architecture enhancement module, consistently improving multiple key metrics of representative online topological baselines with low additional training cost. When integrated with ETP-R1, LCGNav achieves the best performance among the compared online topological methods on the val-unseen splits of the R2R-CE and RxR-CE benchmarks. The code is available at https://github.com/shannanshouyin/LCGNav.
[151] KEPIL: Knowledge-Enhanced Prompt-Image Learning for Prompt-Robust Disease Detection cs.CVPDF
Haozhe Luo, Shelley Zixin Shu, Ziyu Zhou, Robert Berke, Mauricio Reyes
TL;DR: KEPIL是一个知识增强的提示-图像学习框架,旨在提升医学视觉-语言模型在放射学疾病检测中的零样本推理能力和提示鲁棒性。它通过动态提示增强、语义感知对比损失和实体中心报告标准化,整合医学知识库,以稳定模型对提示词变化的敏感性。
Details
Motivation: 当前基于CLIP风格的医学视觉-语言模型对提示词变化敏感,且缺乏可靠的外部知识支持,这阻碍了其在临床放射学中的可靠部署,尤其是在处理长尾分布疾病时零样本推理至关重要。
Result: 在七个基准测试中,KEPIL实现了最先进的零样本推理性能;在提示变化测试下,它在CheXpert数据集上将AUC提升了6.37%,平均提升4.11%。
Insight: 创新点包括利用本体和LLM进行动态提示增强、通过双嵌入目标的语义感知对比损失对齐等效提示变体,以及实体中心报告标准化以生成与本体对齐的表示。客观来看,将结构化知识系统性地融入模型训练和推理过程,是提升临床放射学视觉-语言模型鲁棒性和可靠性的关键途径。
Abstract: Vision–language models (VLMs) show promise for clinical decision support in radiology because they enable joint reasoning over radiological images and clinical text, thereby leveraging complementary clinical information. However, radiological findings are long-tailed in practice, leaving some conditions underrepresented and making zero-shot inference essential. Yet current CLIP-style medical VLMs are sensitive to prompt variations and often lack trustworthy external knowledge at inference time, which hinders reliable clinical deployment. We present \textit{KEPIL}, a prompt-robust framework that integrates curated medical knowledge to stabilize zero-shot generalization. KEPIL comprises: (i) \emph{dynamic prompt enrichment} using ontologies with LLM assistance, (ii) a \emph{semantic-aware contrastive loss} aligning embeddings of equivalent prompt variants via a dual-embedding objective, and (iii) \emph{entity-centric report standardization} to yield ontology-aligned representations. Across seven benchmarks, KEPIL achieves state-of-the-art zero-shot inference performance; under prompt-variation tests, it improves AUC by (6.37%) on \textit{CheXpert} and by (4.11%) on average. These results suggest that structured knowledge and robust prompt design are key to clinically reliable radiology-facing VLMs. Code will be released at https://github.com/Roypic/KEPIL.
[152] Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search cs.CVPDF
Jingdong Zhang, Yizhou Wang, Zhengzhong Tu, Xin Li, Wenping Wang
TL;DR: 本文提出了一种名为‘Imagining in 360°’的新框架,用于解决人形视觉搜索任务中探索沉浸式360度环境的问题。该框架将探索过程解耦为专门的‘想象器’和‘执行器’两个模块,通过单步推理语义布局并采样多个假设来指导搜索,显著降低了数据工程成本并提升了搜索效率。
Details
Motivation: 现有方法将人形视觉搜索视为一个整体任务,依赖于累积的多轮思维链推理,这带来了沉重的认知负担并需要昂贵的轨迹级标注。本文旨在通过解耦架构来克服这些限制。
Result: 广泛的实验表明,显式建模语义空间先验能极大提高在复杂、真实环境中的搜索效率和成功率。该方法无需完整轨迹的思维链标注,生成了超过196万个精选训练样本。
Insight: 核心创新在于将探索过程解耦为独立的概率预测器(想象器)和执行器,想象器通过单步推理生成已观测和未观测区域的语义布局及多个假设分布,为执行器提供鲁棒的空间信息指导,从而有效应对不确定性并降低数据标注需求。
Abstract: Humanoid Visual Search (HVS) requires agents to actively explore immersive 360$^\circ$ environments. While prior methods treat this as a monolithic task relying on cumulative, multi-turn Chain-of-Thought (CoT) reasoning, they impose heavy cognitive burdens and require expensive trajectory-level annotations. In this paper, we propose Imagining in 360$^\circ$, a novel framework that decouples the exploration process into a specialized Imaginator and an Actor. The Imaginator functions as a probabilistic predictor of spatial priors; instead of maintaining a cumulative reasoning chain, it infers the semantic layout of both observed and unobserved regions in a single step. By sampling multiple hypotheses within this semantic space, we provide the Actor with a distribution of effective spatial information, offering robust guidance that hedges against uncertainty during active search. This decoupled architecture significantly lowers data engineering costs by eliminating the need for full-trajectory CoT annotations, enabling the generation of over 1.96 million curated training samples. Extensive experiments demonstrate that explicitly modeling semantic spatial priors drastically improves search efficiency and success rates in complex, in-the-wild environments.
[153] MultiMedVision: Multi-Modal Medical Vision Framework cs.CVPDF
Frank Li, Bardia Khosravi, Mohammadreza Chavoshi, Young Seok Jeon, Theo Dapamede
TL;DR: MultiMedVision是一个基于稀疏视觉Transformer的统一多模态医学视觉框架,能够联合学习2D和3D医学图像的表示。该框架采用3D旋转位置编码和可变长度序列打包技术,在共享潜在空间中直接处理混合模态批次,无需特定模态适配器或将3D体积视为2D切片序列。在胸部X光(MIMIC-CXR)和CT扫描(CT-RATE)数据集上通过自监督目标训练,使用单个共享编码器且数据量减少5倍,在2D和3D基准测试中均取得了有竞争力的性能。
Details
Motivation: 当前的基础模型通常使用维度特定的架构分别处理2D(如X光)和3D(如CT)医学图像数据,这限制了多模态医学成像的综合诊断能力。本文旨在解决这一问题,提出一个统一的框架来实现跨维度的联合表示学习。
Result: 在2D基准测试中,MultiMedVision在MIMIC数据集上取得了0.82的宏观AUROC,在CheXpert数据集上取得了0.84的宏观AUROC;在3D任务中,在CT-RATE数据集上取得了0.85的AUROC。这些结果表明,该框架在减少数据使用的情况下,性能与现有方法相当。
Insight: 论文宣称的创新点在于提出了一个统一的跨维度表示学习框架,通过3D旋转位置编码和可变长度序列打包技术,在共享潜在空间中处理混合模态数据,避免了特定模态适配器。客观分析认为,其核心创新在于证明了统一的表示学习在不牺牲模态特定性能的情况下是可行的,并且学习到的表征同时包含模态特定和共享的特征子空间,这为多模态医学图像分析提供了新的思路。
Abstract: Multi-modal medical imaging enables comprehensive diagnostics, yet current foundation models process 2D (e.g. X-ray) and 3D (e.g. CT) data with separate, dimensionality-specific architectures. We present MultiMedVision, a unified framework for joint 2D/3D representation learning built on a Sparse Vision Transformer. Our model uses 3D Rotary Positional Embeddings and variable-length sequence packing to process mixed-modality batches natively within a shared latent space, without modality-specific adapters or treating 3D volumes as 2D slice sequences. Trained with a self-supervised objective on chest X-rays (MIMIC-CXR) and CT scans (CT-RATE), and using a single shared encoder with 5x less data, MultiMedVision achieves competitive performance on both 2D benchmarks (Macro AUROC 0.82 on MIMIC, 0.84 on CheXpert) and 3D tasks (0.85 on CT-RATE). Analysis of the learned representations reveals coexisting modality-specific and shared feature subspaces, demonstrating that unified cross-dimensional representation learning is feasible without sacrificing modality-specific performance.
[154] Establishing Robust Retinal Eye Tracking: A Weakly Supervised Algorithmic Framework cs.CV | cs.ET | eess.IVPDF
Bo Wen, Dillon Lohr, Yatong An, Pushkar Anand, Alexander Fix
TL;DR: 本文提出了一种新颖的弱监督学习框架,用于实现稳健的视网膜图像眼动追踪,以替代传统模板匹配方法,旨在提升在真实成像条件下的鲁棒性和追踪精度。
Details
Motivation: 现有视网膜追踪算法主要依赖经典的模板匹配配准方法,对视网膜特征变异性和真实世界成像条件的鲁棒性不足,因此需要更稳健的解决方案。
Result: 初步研究在6名参与者队列中实现了高精度,95百分位凝视误差小于0.45度。
Insight: 创新点在于采用弱监督学习框架替代传统模板匹配,提高了对特征变异和成像条件变化的适应性,为AR/VR等设备提供了潜在的高精度眼动追踪方案。
Abstract: Retinal image-based eye tracking is widely used in ophthalmic imaging and vision science, and is a promising path to deliver higher gaze accuracy than the pupil- and cornea-based approaches commonly used in modern AR/VR devices. Nevertheless, existing retinal tracking algorithms still primarily rely on classical template-matching registration, which can be insufficiently robust to retinal feature variability and real-world imaging conditions. In this work, we propose a novel weakly-supervised, learning-based framework for robust retinal eye tracking. Initial studies demonstrate high accuracy, achieving the 95th-percentile gaze error < 0.45 deg across a cohort of 6 participants.
[155] Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models cs.CV | cs.AI | cs.LG | cs.ROPDF
Sagar Bharadwaj, Ziyong Ma, Anurag Ghosh, Srinivasan Seshan, Anthony Rowe
TL;DR: Flame3D是一个无需训练的零样本3D场景理解框架,它将3D场景表示为可编辑的视觉-文本记忆,并通过可组合的空间工具与现成的多模态大语言模型交互,支持对自由空间、物体定位、假设物体插入和复杂几何关系的开放式推理。
Details
Motivation: 现有3D理解方法依赖大规模3D-语言训练或仅关注物体定位和简单空间关系,本文旨在通过推理时的方法实现广泛泛化,无需特定3D训练。
Result: 在ScanQA基准上达到与微调3D-LMM方法相当的性能,并在自建的组合空间推理基准Compose3D上展示了多跳推理能力,证明了推理时合成空间操作的重要性。
Insight: 创新点在于提出训练免费的框架,利用可编辑的3D记忆和可组合工具实现零样本组合推理,并引入推理时合成空间程序的能力,为未来3D场景理解提供了基于丰富场景记忆和组合抽象的新方向。
Abstract: 3D scene understanding spans reasoning about free space, object grounding, hypothetical object insertions, complex geometric relationships, and integrating all of these with external tools and data sources. Existing 3D understanding methods typically rely on large-scale 3D-language training or focus on object grounding and simple spatial relationships. We argue that the broad generalization that motivates 3D-language training can be achieved at inference time, without 3D-specific training. We propose Flame3D, a training-free framework that represents scenes as editable visual-textual 3D memories and exposes them to an off-the-shelf MLLM through composable spatial tools. Flame3D also lets the agent synthesize custom spatial programs at inference time, enabling open-ended reasoning over layouts, empty space, and objects not yet present in the scene. External data and corrections can be added to the memory without retraining. In addition to showing competitive performance to finetuned 3D-LMM methods on ScanQA, we study multi-hop 3D reasoning capabilities of Flame3D by evaluating it on a curated compositional spatial-reasoning benchmark, Compose3D. We find that fixed tools fall short and that the agent’s ability to synthesize spatial operations at inference time is essential. These results invite the question: should future progress in 3D scene understanding focus on richer scene memories and expressive compositional abstractions?
[156] CATS: Curvature Aware Temporal Selection for efficient long video understanding cs.CVPDF
Mehrajul Abadin Miraj, Abdul Mohaimen Al Radi, Shariful Islam Rayhan, Md. Tanvir Alam, Ismat Rahman
TL;DR: 本文提出了一种名为CATS的曲率感知时序选择方法,用于在计算资源受限的情况下高效理解长视频。该方法通过建模查询与帧相关性的时序几何结构,自适应地选择关键帧,以捕捉视频中的显著事件及其上下文,同时抑制冗余帧。
Details
Motivation: 解决多模态大语言模型在处理长视频时,因计算资源限制而无法处理所有帧,且最优帧选择是组合优化难题的问题。
Result: 在LongVideoBench和VideoMME基准测试中,使用固定主干网络和帧预算,CATS性能持续优于AKS等轻量级方法。与MIRA等多阶段方法相比,CATS能以仅3-4%的预处理成本,保留其约93-95%的性能,实现了良好的效率-精度权衡。此外,基于LLM-as-a-judge协议的评估表明,CATS能生成更连贯、信息更丰富的描述。
Insight: 创新点在于显式建模查询-帧相关性的时序几何(曲率),并利用曲率自适应调整选择密度,从而同时捕捉突变过渡和渐变内容。从客观角度看,该方法提供了一种计算高效且原理清晰的帧选择策略,在预处理成本极低的情况下接近更复杂方法的性能,为长视频理解提供了实用的效率-精度平衡方案。
Abstract: Understanding long videos with multimodal large language models (MLLMs) requires selecting a small subset of informative frames under strict computational budgets, where exhaustive processing is infeasible and optimal selection is combinatorial. We propose CATS, a curvature-aware frame selection method that explicitly models the temporal geometry of query-frame relevance to identify salient events and their surrounding context. By leveraging temporal curvature to adapt selection density, CATS captures both abrupt transitions and gradually evolving content while suppressing redundant frames. Under a fixed backbone and frame budget, CATS consistently outperforms prior lightweight approaches such as AKS on LongVideoBench and VideoMME. While multi-stage methods such as MIRA achieve higher absolute accuracy, they incur substantial computational overhead; in contrast, CATS retains approximately 93-95% of MIRA’s performance while requiring only 3-4% of its preprocessing cost, yielding a favorable efficiency-accuracy trade-off. Beyond answer accuracy, we evaluate description generation using an LLM-as-a-judge protocol, and the obtained results show that CATS produces more coherent and informative outputs, indicating improved grounding in visual evidence. These results position CATS as a computationally efficient and principled approach to long-video understanding.
[157] An Elastic Shape Variational Autoencoder for Skeleton Pose Trajectories cs.CV | stat.MLPDF
Arafat Rahman, Shashwat Kumar, Laura E. Barnes, Anuj Srivastava
TL;DR: 本文提出了一种名为弹性形状变分自编码器(ES-VAE)的几何感知生成模型,用于处理人体骨骼姿态轨迹序列。该模型利用Kendall形状流形上的传输平方根速度场(TSRVF)表示,自动去除刚性变换和时间速率变化,专注于形状的内在几何动态。实验表明,ES-VAE在步态分析预测临床移动评分、中风后分类以及NTU RGB+D数据集上的动作识别任务中,均优于标准VAE及多种序列建模基线方法。
Details
Motivation: 标准变分自编码器(VAEs)在处理骨骼序列时,往往将大量模型容量浪费在相机方向、主体尺度、视角和执行速度等干扰因素上,而非关注形状及其运动的内在几何结构。因此,需要一种能直接建模形状几何动态的生成模型。
Result: 在两个数据集上验证了ES-VAE的有效性:在骨骼步态周期分析中,成功预测临床移动评分并对健康和中风后受试者进行分类;在NTU RGB+D数据集上进行动作识别评估。ES-VAE在两种设置下均一致优于标准VAE以及时序卷积网络、Transformer和图卷积网络等一系列序列建模基线方法。
Insight: 创新点在于将TSRVF表示与VAE框架结合,在Kendall形状流形上构建生成模型,通过黎曼对数映射和指数映射实现编码和解码,从而在隐空间中学习对刚性变换和时间变化不变的形状动态表示。这为姿态形状流形上的纵向数据学习提供了一个原则性框架,提升了隐表示质量和下游任务性能。
Abstract: Deep generative models provide flexible frameworks for modeling complex, structured data such as images, videos, 3D objects, and texts. However, when applied to sequences of human skeletons, standard variational autoencoders (VAEs) often allocate substantial capacity to nuisance factors-such as camera orientation, subject scale, viewpoint, and execution speed-rather than the intrinsic geometry of shapes and their motion. We propose the Elastic Shape - Variational Autoencoder (ES-VAE), a geometry-aware generative model for skeletal trajectories that leverages the transported square-root velocity field (TSRVF) representation on Kendall’s shape manifold. This representation inherently removes rigid translations, rotations, and global scaling of shapes, and temporal rate variability of sequences, isolating the underlying shape dynamics. The ES-VAE encoder maps skeletal sequences to a low-dimensional latent space incorporating the Riemannian logarithm map, while the decoder reconstructs sequences using the corresponding exponential map. We demonstrate the effectiveness of ES-VAE on two datasets. First, we analyze skeletal gait cycles to predict clinical mobility scores and classify subjects into healthy and post-stroke groups. Second, we evaluate action recognition on the NTU RGB+D dataset. Across both settings, ES-VAE consistently outperforms standard VAEs and a range of sequence modeling baselines, including temporal convolutional networks, transformers, and graph convolutional networks. More broadly, ES-VAE provides a principled framework for learning generative models of longitudinal data on pose shape manifolds, offering improved latent representation and downstream performance compared to existing deep learning approaches.
[158] Monocular Biomechanical Tracking of Fingers with Inverse Kinematics to Foundation Models cs.CV | cs.AIPDF
R. James Cotton, Pouyan Firouzabadi, Wendy Murray
TL;DR: 本文提出了一种结合SAM 3D Body基础模型与逆向运动学优化的方法,用于从单目视频中提取解剖学约束下的手指关节角度。该方法将SAM 3D Body从PyTorch移植到JAX,以便与MuJoCo-MJX集成,实现GPU加速优化,并开发了MHR输出与生物力学模型标记点之间的新映射。在7名参与者执行各种手部姿势和物体操作任务的4590帧数据上,与8相机多视图重建进行验证,结果显示手指关节角度误差约为10度,手部位置误差约为6毫米。
Details
Motivation: 从视频中准确追踪手部和手指在监测日常生活活动和测量活动范围方面具有重要的临床应用价值,但目前从单目视频获取手部生物力学的方法仍不成熟。
Result: 在7名参与者的4590帧多视图视频数据上进行验证,经过Procrustes对齐后,手指关节角度误差约为10度,手部位置误差约为6毫米。结果在不同相机视角下保持一致,并且对从多视图视频生成参考值的方法具有鲁棒性。
Insight: 主要创新点在于将强大的SAM 3D Body基础模型与逆向运动学优化相结合,并集成到全身生物力学模型中,从而实现了从单目视频进行详细手指追踪的生物力学分析。技术实现上的创新包括将模型框架移植到JAX以实现GPU加速,以及开发了MHR输出到生物力学标记点的新映射方法。
Abstract: Accurate hand and finger tracking from video has significant clinical applications for monitoring activities of daily living and measuring range of motion, yet monocular video approaches for obtaining hand biomechanics remain under-developed. We present a method that combines the SAM 3D Body foundation model with inverse kinematics optimization in a full-body biomechanical model to extract anatomically-constrained finger joint angles from single-view video. We port SAM 3D Body from PyTorch to JAX for integration with MuJoCo-MJX, enabling GPU-accelerated optimization, and develop a novel mapping between the Momentum Human Rig (MHR) outputs and biomechanical model markers. Validation against 8-camera multiview reconstruction on 4,590 frames from 7 participants performing a variety of hand poses and object manipulation tasks shows finger joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm, after Procrustes alignment. Results were consistent across camera viewpoints and robust to different methods for producing reference values from multiview video. This work extends monocular biomechanical analysis to detailed finger tracking, expanding access to quantitative characterization of hand movement from readily available video.
[159] Reinforcing Multimodal Reasoning Against Visual Degradation cs.CV | cs.CLPDF
Rui Liu, Dian Yu, Haolin Liu, Yucheng Shi, Tong Zheng
TL;DR: 本文提出了ROMA框架,一种用于增强多模态大语言模型(MLLMs)在视觉退化(如模糊、压缩伪影)下推理鲁棒性的强化学习微调方法。该方法通过双前向传播、代理KL惩罚、辅助策略梯度损失和正确性条件正则化,避免了传统数据增强导致的奖励污染和优化不稳定,在保持干净输入性能的同时显著提升了模型对已见和未见视觉退化的鲁棒性。
Details
Motivation: 现有基于强化学习的多模态大语言模型推理策略在面对真实世界视觉退化时表现脆弱,而传统的视觉或深度强化学习鲁棒性技术(如静态数据增强或基于值的正则化)无法直接适用于自回归MLLMs的无评论家RL微调,且简单在训练中注入退化视图会导致奖励污染和优化不稳定。
Result: 在Qwen3-VL 4B/8B模型上,于七个多模态推理基准测试中,该方法相比GRPO基线,在已见视觉退化上鲁棒性提升+2.4%,在未见视觉退化上提升+2.3%,同时保持了与基线相当的干净输入准确率。
Insight: 创新点在于提出了一个专门针对自回归MLLMs RL微调的鲁棒性框架,其核心是通过双前向传播避免在退化输入上重新进行策略展开,并结合了多种正则化技术(如最坏情况增强的代理KL惩罚、基于干净图像优势的辅助策略梯度、成功轨迹上的正确性条件正则化)来协同稳定优化并强化对视觉退化的推理能力,避免了奖励信号污染和策略崩溃。
Abstract: Reinforcement Learning has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet the resulting policies remain brittle against real-world visual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout induces reward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning against visual degradation while preserving clean-input performance. A dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply a token-level surrogate KL penalty against the worst-case augmentation; to prevent policy collapse under regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance, correctness-conditioned regularization restricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions over GRPO while matching clean accuracy.
[160] Low-Cost Neural Radiance Fields cs.CVPDF
Alice Huang, Prathamesh Sonawane, Yashdeep Thorat, Yug Rao
TL;DR: 本文对三种加速版NeRF变体(DS-NeRF、TensoRF和HashNeRF)进行了比较研究,并探索了面向低计算、少数据场景的扩展方法。具体包括:为TensoRF添加基于COLMAP关键点的深度监督损失(TensoRF-DS)并在LLFF数据集上评估;对TensoRF的特征解码MLP进行消融并研究输入下采样对合成Lego场景PSNR和运行时间的影响;提出HashNeRF颜色和密度网络的四种架构变体(包括残差和卷积设计),并在相同迭代预算下报告PSNR与训练时间的权衡。
Details
Motivation: NeRF虽然能实现高质量的新视角合成,但其训练时间长且依赖密集输入视图,限制了其可访问性。本文旨在探索在低计算资源和少数据条件下加速NeRF的方法。
Result: 在等时评估下,所有提出的扩展均未显著超越已发布的基线方法。实验在LLFF数据集(减少视图数)和合成Lego场景上进行了评估,主要报告了PSNR和训练时间的权衡结果。
Insight: 论文的创新点包括:将深度监督损失扩展到TensoRF(TensoRF-DS)、对TensoRF进行MLP消融和输入下采样分析、以及为HashNeRF设计多种网络架构变体。从客观角度看,这些工作系统性地探索了不同加速NeRF方法在受限设置下的可迁移性,并为未来低成本NeRF的设计提出了具体问题。
Abstract: Neural Radiance Fields (NeRF) achieve high-quality novel-view synthesis, but their long training times and reliance on dense input views limit accessibility. We present a comparative study of three accelerated NeRF variants - DS-NeRF, TensoRF, and HashNeRF and explore extensions targeted at the low-compute, low-data regime. First, we add a depth-supervision loss derived from COLMAP keypoints to TensoRF (TensoRF-DS) and evaluate it on the LLFF dataset under reduced view counts. Second, we ablate the feature-decoding MLP of TensoRF and study the effect of input downsampling on PSNR and runtime on the synthetic Lego scene. Third, we propose four architectural variants of the HashNeRF color and density networks, including residual and convolutional designs, and report PSNR/training-time tradeoffs under matched iteration budgets. Under iso-time evaluation, none of our extensions conclusively outperform the published baselines, but the experiments characterize which extensions transfer to constrained settings and surface design questions for future work.
[161] EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation cs.CV | cs.AI | cs.CLPDF
Xinyi Wu, Jayant Teotia, Shuai Zhao, Erik Cambria
TL;DR: 本文提出EduStory框架,旨在解决多镜头STEM教学视频生成中知识一致性与教学叙事连贯性的难题。该框架整合了教学状态建模、脚本引导的结构化控制以及面向学习的评估指标,并引入了包含多粒度标注的诊断基准EduVideoBench。实验表明,领域感知的状态建模和结构化控制能有效减少叙事断裂并提升与教学意图的对齐。
Details
Motivation: 现有长视频生成方法在视觉质量上虽有进展,但在多镜头教学视频(尤其是STEM领域)中难以保持知识一致性和连贯的教学叙事,因此需要开发可靠的指令性视频生成框架。
Result: 大量实验证明,领域感知的状态建模和结构化控制显著减少了叙事崩溃,并提高了与教学意图的对齐度,突显了领域特定结构约束和定制化基准对推进可靠、可控、可信的长视频生成的重要性。
Insight: 创新点在于将教学状态建模与脚本引导的结构化控制相结合,以维护知识一致性;同时引入专门的教学视频评估基准EduVideoBench,包含教学故事板、镜头级语义和知识状态转换等多粒度标注,为可控教学视频生成提供了系统化的评估框架。
Abstract: Long-horizon video generation has advanced in visual quality, yet existing methods still struggle to maintain knowledge consistency and coherent pedagogical narratives across multi-shot instructional videos, especially in STEM domains. To address these challenges, we propose EduStory, a unified framework for reliable instructional video generation. EduStory integrates pedagogical state modeling to track persistent knowledge states, script-guided structured control to organize multi-shot narratives, and learning-oriented evaluation metrics to assess knowledge fidelity and constraint satisfaction. To support rigorous evaluation, we further introduce EduVideoBench, a diagnostic benchmark with multi-granularity annotations, including pedagogical storyboards, shot-level semantics, and knowledge state transitions, together with baseline tasks for controllable instructional video generation. Extensive experiments demonstrate that domain-aware state modeling and structured control substantially reduce narrative breakdown and improve alignment with instructional intent. These results highlight the significance of domain-specific structural constraints and tailored benchmarks for advancing reliable, controllable, and also trustworthy long-horizon video generation.
[162] LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering cs.CV | cs.AI | q-bio.QMPDF
Runze Ma, Shunbo Jia, Haonan Lyu, Guo Liu, Caizhi Liao
TL;DR: 本文提出LiteMedCoT-VL,一种参数高效的适配方法,用于解决医学视觉问答(VQA)中紧凑视觉语言模型(VLM)缺乏多步推理能力的问题。该方法通过基于LoRA的微调,将来自235B教师模型的思维链推理能力蒸馏到2B学生模型上,并在PMC-VQA基准测试中取得了64.9%的准确率,显著超越了零样本Qwen3-VL-4B基线和其他已发表基线。
Details
Motivation: 解决大型与紧凑视觉语言模型之间的推理能力差距,使紧凑模型(2-4B参数)能够在资源受限的便携临床设备上部署,并提供可解释的临床决策支持,而现有知识蒸馏方法仅传递答案而缺乏推理过程。
Result: 在PMC-VQA基准测试上达到64.9%的准确率,比零样本Qwen3-VL-4B基线(53.9%)高出11.0个百分点,超越了所有已发表基线,表明2B模型通过推理蒸馏可以达到或超过参数翻倍模型的性能。
Insight: 创新点在于通过思维链推理蒸馏和基于LoRA的微调,在无需图像描述(模拟临床直接读图场景)的情况下,实现了紧凑模型的参数高效适配和可解释推理;客观分析认为该方法有效结合了视觉证据与临床知识,并通过视觉基础分析验证了模型对图像内容的依赖而非文本先验利用。
Abstract: The reasoning gap between large and compact vision-language models (VLMs) limits the deployment of medical AI on portable clinical devices. Compact VLMs of 2–4B parameters can run on resource-constrained hardware but lack the multi-step reasoning capacity needed for interpretable clinical decision support. Existing knowledge distillation methods transfer answers without the reasoning process behind them. Medical visual question answering (VQA) serves as a testbed for this problem, as it requires models to integrate visual evidence with clinical knowledge through structured reasoning chains. We introduce LiteMedCoT-VL, a pipeline that transfers chain-of-thought reasoning from a 235B teacher model to 2B student models through LoRA-based fine-tuning on explanation-enriched training data. All inference is conducted without image captions by default, simulating the clinical scenario in which a physician interprets a medical image directly without an accompanying radiology report. On the PMC-VQA benchmark, LiteMedCoT-VL achieves 64.9% accuracy, exceeding the zero-shot Qwen3-VL-4B baseline of 53.9% by 11.0 percentage points and outperforming all published baselines. This result indicates that a 2B model with reasoning distillation can match or exceed models with twice the parameters. Visual grounding analysis shows that the model relies on image content rather than exploiting textual priors. Our code is publicly available at https://anonymous.4open.science/r/LiteMedCoT-VL.
[163] HyNeuralMap: Hyperbolic Mapping of Visual Semantics to Neural Hierarchies cs.CVPDF
Zihan Ma, Tian Xia, Kexin Wang, Xiao Li, Xiaowei He
TL;DR: 该论文提出了一种名为HyNeuralMap的框架,利用双曲洛伦兹模型将视觉语义映射到共享的跨被试神经层次结构中,以更好地捕捉视觉刺激与神经响应之间的层次化语义关系。
Details
Motivation: 现有方法主要在欧几里得空间中对齐图像与功能磁共振成像(fMRI)响应,但难以保留细粒度语义关系和跨模态的潜在层次结构,因此需要一种能更有效建模层次化语义对齐的几何方法。
Result: 实验表明,HyNeuralMap在多标签语义预测和跨模态检索任务中均优于最先进的欧几里得基线方法,证实了双曲几何在跨模态语义对齐和层次建模方面的优越性。
Insight: 创新点在于引入双曲空间作为归纳偏置,通过测地距离更有效地保留语义邻近性和层次关系,为视觉-神经表征学习提供了新途径;从客观角度看,该方法将双曲几何应用于跨模态对齐,增强了模型对层次结构的捕捉能力。
Abstract: Understanding the intricate mappings between visual stimuli and neural responses is a fundamental challenge in cognitive neuroscience. While current approaches predominantly align images and functional magnetic resonance imaging (fMRI) responses in Euclidean space, this geometry often struggles to preserve fine-grained semantic relationships and latent hierarchical structures across visual and neural modalities. To overcome this, we propose HyNeuralMap, a framework that employ hyperbolic Lorentz model to map visual semantics into a shared, cross-subject neural hierarchy. By leveraging the negative curvature of hyperbolic space as an inductive bias, the proposed framework better captures hierarchical semantic organization and cross-subject neural similarities. Specifically, visual and neural embeddings are jointly optimized through hyperbolic geometric alignment, where geodesic distances preserve semantic proximity and hierarchical relationships more effectively than Euclidean embeddings. Experiments demonstrate that HyNeuralMap consistently outperforms state-of-the-art Euclidean baselines in both multi-label semantic prediction and cross-modal retrieval tasks. This confirms hyperbolic geometry’s superiority for cross-modal semantic alignment and hierarchical modeling, providing a new avenue for vision-neural representation learning.
[164] SAMOFT: Robust Multi-Object Tracking via Region and Flow cs.CVPDF
Yanchao Wang, Dawei Zhang, Chengzhuan Yang, Wei Liu, Minglu Li
TL;DR: SAMOFT是一种鲁棒的多目标跟踪方法,通过整合像素级线索来应对目标变形、非线性运动和遮挡等挑战。它结合了Segment Anything Model和密集光流来细化运动预测,并设计了基于质心距离和分布的匹配与校正模块,以及聚类感知的重识别策略,以提升跟踪的鲁棒性。
Details
Motivation: 现有MOT方法主要依赖实例级特征进行轨迹关联,在目标变形、非线性运动和遮挡等复杂场景下性能下降,因此需要利用像素级线索来提高鲁棒性。
Result: 在DanceTrack和MOTChallenge基准测试上的大量实验表明,SAMOFT持续改进了基线跟踪器,并与近期最先进方法相比取得了有竞争力的性能。
Insight: 创新点在于将SAM与光流结合用于像素运动匹配,以及设计无需训练的分布校正模块来处理长尾运动模式;可借鉴之处是利用像素级信息和在线统计动态校正来增强复杂运动下的跟踪鲁棒性。
Abstract: Multi-object tracking (MOT) is a fundamental task in computer vision that requires continuously tracking multiple targets while maintaining consistent identities across frames. However, most existing approaches primarily rely on instance-level object features for trajectory association, which often leads to degraded performance under challenging conditions such as object deformation, nonlinear motion, and occlusion. In this work, we propose SAMOFT, a robust tracker that leverages pixel-level cues to improve robustness under complex motion scenarios. Specifically, we introduce a Pixel Motion Matching (PMM) module that integrates the Segment Anything Model (SAM) with dense optical flow to refine Kalman filter-based motion prediction using instantaneous foreground pixel motion. To further enhance robustness under unreliable detections, we design a Centroid Distance Matching (CDM) module that performs flexible mask-based centroid matching for low-confidence or partially occluded observations. Moreover, a Distribution-Based Correction (DBC) module models long-tailed motion patterns in a training-free manner using historical optical flow statistics and dynamically corrects trajectory states online. We also incorporate a Cluster-Aware ReID (CA-ReID) strategy to improve the stability and discriminative power of trajectory appearance features. Extensive experiments on the DanceTrack and MOTChallenge benchmarks demonstrate that SAMOFT consistently improves baseline trackers and achieves competitive performance compared with recent state-of-the-art methods, validating the effectiveness of leveraging pixel-level cues for robust multi-object tracking.
[165] MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition cs.CV | cs.ROPDF
Zhengyi Xu, Yuhang Ming, Zhihao Zhan, Hanyu Zhu, Javier Civera
TL;DR: 本文提出MAG-VLAQ框架,用于解决多模态空地跨视角地点识别问题。该框架利用预训练基础模型提取地面和空中图像的视觉token以及地面LiDAR的几何token,通过ODE-conditioned VLAQ机制动态融合多模态信息,生成全局描述符以提升匹配性能。
Details
Motivation: 解决空地跨视角地点识别中因视角、模态和空间结构差异导致的匹配困难。
Result: 在KITTI360-AG和nuScenes-AG数据集上验证有效,在KITTI360-AG卫星设置下Recall@1达到61.1%,接近SOTA性能的两倍(对比方法为34.5%)。
Insight: 创新点在于将基于神经ODE的RGB-LiDAR融合与VLAQ紧密耦合,使查询中心能根据融合的多模态状态动态调整,平衡全局检索原型和场景特定证据。
Abstract: Multi-modal cross-view place recognition remains a fundamental challenge in computer vision and robotics due to the severe viewpoint, modality, and spatial-structure discrepancies between ground observations and aerial references. To address this challenge, we present MAG-VLAQ, a foundation-model-enhanced query aggregation framework for multi-modal aerial-ground cross-view place recognition. Specifically, our approach leverages pre-trained foundation models to extract dense visual tokens from both ground and aerial images, as well as expressive geometric tokens from ground LiDAR observations. These heterogeneous tokens are then projected into a shared embedding space for cross-modal alignment and fusion. As our main contribution, we propose ODE-conditioned VLAQ, which tightly couples neural ordinary differential equations (ODE)-based RGB-LiDAR fusion with vectors of locally aggregated queries (VLAQ). In this design, the VLAQ query centers are dynamically adapted according to the fused multi-modal state. This mechanism allows the final global descriptor to preserve globally learned retrieval prototypes while remaining responsive to scene-specific visual and geometric evidence, significantly improving aerial-ground matching. Extensive experiments on KITTI360-AG and nuScenes-AG validate the effectiveness of our proposed MAG-VLAQ. Notably, on KITTI360-AG, our MAG-VLAQ nearly doubles the state-of-the-art performance, achieving 61.1 Recall@1 in the satellite setting, compared with 34.5 from the closest competing approach.
[166] Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models cs.CV | cs.AIPDF
Jie Ma, Yihang Liu, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun
TL;DR: 本文提出了COAST(对比自适应语义令牌剪枝),一种无需训练、基于自适应语义路由的剪枝框架,用于解决现有视觉语言模型(LVLM)剪枝方法因早期丢弃低注意力视觉令牌而导致的‘视觉失语症’问题。COAST利用原生跨模态注意力识别查询相关的锚点,并通过注意力熵估计上下文分散度,自适应地权衡语义证据与空间上下文的保留,从而在加速推理的同时保持模型对组合推理(如次要对象、空间关系和上下文线索)的能力。
Details
Motivation: 现有剪枝方法通常基于浅层文本到图像注意力对视觉令牌进行标量排序并丢弃低分补丁,以加速LVLM推理。然而,这种标量准则在组合推理中不可靠,因为早期被忽略的令牌可能在后续层中对解析次要对象、空间关系和上下文线索至关重要,过早剪枝会导致模型失去视觉基础并退回到语言先验,即‘视觉失语症’。
Result: 在七个基准测试中,COAST减少了77.8%的视觉令牌,实现了2.15倍的延迟加速,同时保持了原始平均性能的98.64%。COAST在多种令牌预算下持续优于强剪枝基线,并泛化到多个LVLM家族,表明自适应语义路由是单次标量剪枝的稳健替代方案。
Insight: 创新点在于将压缩视为自适应语义路由,而非简单的标量剪枝。具体包括:使用原生跨模态注意力识别查询特定锚点、通过注意力熵估计上下文分散度、自适应权衡语义证据与空间上下文保留,以及引入对比路由分数以同时保留锚点对齐证据和互补空间上下文。这为LVLM的高效推理提供了更可靠的剪枝策略,避免了因过度剪枝导致的组合推理能力下降。
Abstract: Are low-attention visual tokens truly redundant in vision-language reasoning? Existing pruning methods often assume so, ranking visual tokens by shallow text-to-image attention and discarding low-scoring patches to accelerate LVLM inference. We show that this scalar criterion is unreliable for compositional reasoning: tokens ignored in early layers can later become essential for resolving secondary objects, spatial relations, and contextual cues. Premature pruning can therefore induce Visual Aphasia, a failure mode in which the model loses visual grounding and falls back on language priors. We introduce COAST (COntrastive Adaptive Semantic Token Pruning), a training-free pruning framework that casts compression as adaptive semantic routing. COAST uses native cross-modal attention to identify query-specific anchors and estimate contextual dispersion via attention entropy, then adapts the retention trade-off between semantic evidence and spatial context. It further uses a contrastive routing score to preserve both anchor-aligned evidence and complementary spatial context. Across seven benchmarks, COAST reduces visual tokens by 77.8% and achieves a 2.15x latency speedup while retaining 98.64% of the original average performance. Beyond a single backbone or compression setting, COAST consistently outperforms strong pruning baselines across token budgets and generalizes across multiple LVLM families, showing that adaptive semantic routing is a robust alternative to one-shot scalar pruning
[167] SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation cs.CV | cs.AIPDF
Shanwen Tan, Hao Li, Jingtao Zhang, Xiaosong Jia, Xue Yang
TL;DR: 本文提出了SWIFT框架,一种无需训练的多提示长视频生成方法,通过语义窗口化和注入机制实现高效的语义切换,同时保持时间连贯性。
Details
Motivation: 现有方法在提示边界依赖缓存重建或固定内存预算,导致冗余计算和语义适应能力受限,无法平衡视觉连续性与快速语义切换的需求。
Result: 在单张H100 GPU上达到22.6 FPS,在保持生成质量的同时,相比现有SOTA方法显著提升了效率。
Insight: 创新点包括轻量级语义注入缓存、基于注意头对齐的语义注入、自适应动态窗口分配以及段级语义锚点机制,实现了计算效率与语义一致性的平衡。
Abstract: Streaming long-video generation faces a central challenge in continuous semantic switching, requiring adaptive memory to preserve coherent visual evolution. Current approaches rely on cache rebuilding at prompt boundaries or fixed memory budgets, but they introduce redundant computation and limit flexible semantic adaptation. This limitation arises from a mismatch between cached video history and prompt updates, as memory preserves visual continuity while prompt switches demand rapid semantic adaptation. Motivated by this observation, we present SWIFT, Semantic Windowing and Injection for Flexible Transitions, a training-free framework for multi-prompt long-video generation that enables efficient semantic switching while preserving temporal coherence in causal video diffusion models. SWIFT introduces a lightweight Semantic Injection Cache that augments cached video memory rather than reconstructing it from scratch at every prompt boundary. To avoid uniformly perturbing all attention channels, we further perform head-wise semantic injection, so that each attention head receives a prompt update proportional to its alignment with the current video state. In addition, we introduce an Adaptive Dynamic Window that allocates temporal memory according to prompt phase, using larger local context near switching boundaries and smaller windows during stable segments to reduce average inference cost. To preserve long-range semantic consistency under compressed local attention, we further maintain segment-level semantic anchors that summarize prompt-conditioned video history and reintroduce it as compact memory tokens. Compared with current state-of-the-art methods, SWIFT preserves generation quality while achieving 22.6 FPS on a single H100 GPU, establishing a substantially more efficient solution for multi-prompt long-video generation. Our code is available at https://github.com/ShanwenTan/SWIFT.
[168] Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent cs.CV | cs.CLPDF
Yihong Tang, Kehai Chen, Xuefeng Bai, Min Zhang
TL;DR: 本文提出了一种名为Character-Aware Visual Intervention (CAVI) 的无训练框架,旨在解决多模态角色扮演代理中的模态-角色干扰问题。CAVI通过角色引导的令牌剪枝、正交特征调制和模态自适应角色引导三个核心机制,使代理能够从角色视角感知视觉世界,从而提升角色一致的多模态交互能力。
Details
Motivation: 多模态大语言模型在角色扮演代理中的应用面临模态-角色干扰问题,即通用的、与角色无关的视觉特征会淹没脆弱的角色特质,导致代理难以同时保持视觉接地和角色一致性。
Result: 大量实验表明,CAVI能有效缓解模态-角色干扰,显著增强角色一致的多模态交互性能。
Insight: 创新点在于提出了一个无需训练、系统性的视觉干预框架,通过宏观的令牌剪枝、微观的特征投影和解码时的动态引导,使模型能够从角色主观视角处理视觉信息,这为解决多模态任务中身份或角色感知问题提供了新思路。
Abstract: The advancement of Multimodal Large Language Models (MLLMs) has expanded Role-Playing Agents (RPAs) into visually grounded environments. However, human vision is inherently subjective and identity-driven, whereas existing MLLMs extract objective, character-agnostic features for general tasks. In RPAs, this generic visual noise overpowers fragile character traits, causing Modality-Role Interference (MRI), where agents struggle to integrate visual grounding and character consistency. To address this, we introduce the training-free Character-Aware Visual Intervention (CAVI) framework, enabling agents to perceive the world through the lens of character. CAVI systematically targets MRI: macroscopically, Character-Guided Token Pruning (CTP) restricts the visual receptive field to role-relevant entities; microscopically, Orthogonal Feature Modulation (OFM) projects tokens onto a character-context subspace to extract aligned facts; and during decoding, Modality-Adaptive Role Steering (MARS) dynamically optimizes steering intensity based on visual reliance. Extensive experiments show CAVI effectively alleviates MRI, significantly enhancing character-consistent multimodal interactions.
[169] SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs cs.CVPDF
Bo Gu, Zhikang Zhang, Zizhuang Wei, Zhenyuan Chen, Lingyun Li
TL;DR: SpaceMind++是一种视频多模态大语言模型(MLLM)架构,旨在通过构建体素化的认知地图来实现空间一致的3D推理。它从RGB视频中提取并整合语义和空间线索,形成世界中心的表示,以保持物体持久性和空间拓扑结构。通过坐标引导的深度迭代融合机制,将地图级空间知识反馈到原始2D视觉特征中,从而在不破坏预训练模型接口的情况下增强空间理解能力。
Details
Motivation: 现有MLLMs在视觉理解和语言推理方面取得显著进展,但缺乏持久的世界中心表示,难以在3D环境中进行空间一致的推理。受哺乳动物双流系统启发,论文旨在解决MLLMs在空间认知和跨视角一致性方面的不足。
Result: 在VSI-Bench上达到了新的最先进(SOTA)性能,并在SPBench、SITE-Bench和SPAR-Bench上展示了优越的分布外泛化能力,突显了其在未见3D环境中的鲁棒性。
Insight: 创新点包括:1)引入体素化认知地图,将自我中心观察重组为共享的3D度量表示;2)提出坐标引导的深度迭代融合机制,结合坐标嵌入和3D旋转位置编码,将语义交互锚定在度量3D空间中,类似于内嗅皮层将感官特征绑定到度量空间的过程。这为视频MLLMs提供了显式的空间表示和融合方法,增强了空间推理的连贯性和泛化性。
Abstract: Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new mechanism that relays map-level spatial knowledge back into the original 2D visual features. This fusion is explicitly guided by coordinate embeddings and 3D Rotary Positional Encoding, which ground semantic interactions in metric 3D space, resembling the entorhinal binding of sensory features to metric space. Extensive experiments show that SpaceMind++ achieves new state-of-the-art performance on VSI-Bench. Furthermore, it demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench, underscoring its robustness in unseen 3D environments.
[170] Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization cs.CVPDF
Omer Tariq, Syed Muhammad Raza, Jeongbae Son
TL;DR: 本文提出VASTSum,一种用于视频摘要的不确定性感知和解码器对齐学习框架。该方法通过变分公式预测概率性的帧级重要性分数,显式建模多标注者监督带来的不确定性,并引入解码器对齐正则化以提高基于背包算法的摘要选择稳定性。
Details
Motivation: 视频摘要任务因标注主观性强且评估时依赖离散解码过程(如时间分割和基于背包算法的选择)而具有固有难度。现有方法要么学习确定性的重要性分数而忽略这些特性,要么采用复杂的生成模型增加训练和推理成本。
Result: 在SumMe和TVSum基准测试上使用标准基于排序的指标进行评估。实验结果显示,在多个数据划分上取得了稳定且具有竞争力的Kendall和Spearman相关系数,在保持高效单次前向推理的同时,提高了在标注不一致情况下的鲁棒性。
Insight: 创新点在于通过概率建模显式处理标注不确定性,并采用与解码阶段(背包选择)对齐的正则化目标。这为确定性和基于扩散的视频摘要方法提供了一个原则性的替代方案,兼顾了主观性建模与推理效率。
Abstract: Video summarization aims to produce a compact representation of a long video by selecting a subset of temporally important segments that best reflect human preferences. This task is inherently difficult due to strong annotation subjectivity and the reliance on discrete decoding procedures, such as temporal segmentation and knapsack-based selection, during evaluation. Most existing approaches either learn deterministic importance scores that overlook these characteristics or adopt complex generative models that increase training and inference cost. In this paper, we propose VASTSum, an uncertainty-aware and decoder-aligned learning framework for video summarization that addresses both challenges within a single-pass model. The proposed method predicts probabilistic frame-level importance scores using a variational formulation, enabling explicit modeling of uncertainty arising from multi-annotator supervision. To account for subjectivity, particularly under binary annotations, we employ a supervision strategy that encourages alignment with plausible human annotation modes rather than enforcing a single consensus target. Furthermore, we introduce a decoder-aligned regularization that promotes stability of knapsack-based summary selection, reducing sensitivity to small perturbations in predicted scores. We evaluate the proposed framework on the SumMe and TVSum benchmarks using standard rank-based metrics. Experimental results show consistent and competitive Kendall and Spearman correlations across multiple data splits, demonstrating improved robustness under annotation disagreement while maintaining efficient single-forward inference. These results indicate that explicitly modeling uncertainty and aligning learning objectives with the decoding stage provide a principled alternative to both deterministic and diffusion-based video summarization methods.
[171] QueST: Persistent Queries as Semantic Monitors for Drift Suppression in Long-Horizon Tracking cs.CV | cs.ROPDF
Mayank Anand, Mohammad Saqlain, Kyan Mahajan, Priya Shukla, Gora Chand Nandi
TL;DR: 论文提出QueST框架,将长时视频跟踪中的交互相关实体视为持久语义查询而非瞬时点轨迹,通过全局时空注意力机制和轻量级3D物理约束来抑制语义漂移,在PartNet-Mobility数据集上显著提升了长时跟踪的稳定性。
Details
Motivation: 现有视频点跟踪方法依赖帧间局部匹配,在长时跟踪中因关节运动、遮挡和视角变化导致误差累积和语义漂移,无法检测或纠正,需要从监控角度重新设计长时跟踪框架。
Result: 在PartNet-Mobility数据集的长时关节序列上评估,相比TAP-Net、RAFT-3D和CoTracker等方法,QueST显著降低了终端漂移,绝对点误差(APE)比TAP-Net提升了67.7%,并在长时范围内更好地保持了身份一致性。
Insight: 创新点包括将跟踪对象建模为持久语义查询而非点轨迹,通过每时间步的全局时空注意力提供稳定语义锚点,并结合轻量级3D物理约束(几何合理性)来抑制遮挡下的无界漂移,实现了感知与语义监控的直接嵌入。
Abstract: Tracking points in videos is typically formulated as frame-to-frame correspondence, where each point is matched locally to the next frame. While this works over short horizons, errors accumulate under articulation, occlusion, and viewpoint change, leading to silent semantic drift that existing trackers cannot detect or correct. In this work, we revisit long-horizon tracking from a monitoring perspective and introduce QueST, a monitoring-by-design framework that treats interaction-relevant entities as persistent semantic queries rather than transient point tracks. Instead of local propagation, each query attends globally over spatio-temporal video features at every time-step, providing a stable semantic anchor across time. We further constrain query trajectories with lightweight 3D physical grounding, using geometric plausibility to suppress unbounded drift under occlusion. We evaluate QueST on long-horizon articulated sequences from PartNet-Mobility in SAPIEN and compare against RAFT-3D, CoTracker, and TAP-Net. QueST substantially reduces terminal drift achieving a 67.7% Absolute Point Error (APE) improvement over TAP-Net while better preserving identity over extended horizons. Our results show that embedding semantic monitoring directly into perception enables more reliable long-horizon tracking under distribution shift.
[172] KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation cs.CV | cs.AI | cs.MMPDF
Guanyi Du, Lintao Wang, Kun Hu, Ziyang Wang
TL;DR: 本文提出KANMultiSign,一个多尺度序列生成模型,用于将手语符号系统HamNoSys转换为二维人体姿态序列。该模型采用由粗到细的生成策略,先通过中间的身体-手-脸骨架确保全局结构一致性,再细化手部关节以提升手指细节。同时,模型在Transformer骨干中集成了Kolmogorov-Arnold Network模块,利用可学习的单变量函数原语以紧凑参数化建模从离散音系符号到连续身体运动学的非线性映射。
Details
Motivation: 从符号表示生成手语动画是实现可扩展无障碍手语动画的途径。本文旨在解决将离散手语符号系统(HamNoSys)转换为连续人体姿态序列这一高度非线性映射问题,并探索更高效的建模方法。
Result: 在波兰语、德语、希腊语和法语手语等多个公开语料库上的实验表明,与强大的符号到姿态基线相比,该方法在基于动态时间规整的关节误差上持续降低,同时使用的参数数量显著减少。消融实验进一步表明,基于KAN的变体在保持竞争力的性能的同时大幅减少了参数数量,但主要性能提升来自多尺度监督机制。
Insight: 论文的主要创新点在于提出了结合多尺度监督的由粗到细生成策略,以及将KAN模块集成到Transformer中以实现紧凑高效的非线性映射建模。从客观角度看,多尺度监督被定位为提升符号条件姿态生成的关键机制,而KAN则提供了一种参数高效的紧凑替代方案,而非主要的精度提升驱动因素。
Abstract: Sign language production from symbolic notation offers a scalable route to accessible sign animation. We present KANMultiSign, a multi-scale sequence generator that translates HamNoSys notation into two-dimensional human pose sequences. Our framework makes two complementary contributions. First, we introduce a coarse-to-fine generation strategy with multi-scale supervision: the model is first guided by an intermediate body–hand–face scaffold to encourage global structural coherence, and then refines fine-grained hand articulation to improve finger-level detail. Second, we investigate integrating Kolmogorov–Arnold Network modules into a Transformer backbone, using learnable univariate function primitives to model the highly non-linear mapping from discrete phonological symbols to continuous body kinematics with a compact parameterization. Experiments on multiple public corpora spanning Polish, German, Greek, and French sign languages show consistent reductions in dynamic time warping based joint error compared with a strong notation-to-pose baseline, while using substantially fewer parameters. Controlled ablations further indicate that KAN-based variants substantially reduce parameter count while maintaining competitive performance when coupled with multi-scale supervision, rather than serving as the main driver of accuracy gains. These findings position multi-scale supervision as the key mechanism for improving notation-conditioned pose generation, with KAN offering a compact alternative for efficient modeling. Our code will be publicly available.
[173] FPGA-Based Hardware Architecture for Contrast Maximization in Event-Based Vision cs.CVPDF
Michal Filipkowski, Marcin Kowalczyk, Tomasz Kryjak
TL;DR: 本文提出了一种基于FPGA的硬件架构,用于在事件视觉系统中实现对比度最大化算法。该架构通过事件扭曲、对比度计算和迭代优化模块,利用FPGA的并行性和流水线设计,实现了高速、低功耗的实时运动参数估计。
Details
Motivation: 事件视觉传感器生成稀疏、高时间分辨率的数据,适合硬件处理,但现有CPU/GPU实现效率不足,需开发专用硬件架构以实现实时嵌入式应用。
Result: 实验显示,该FPGA架构在运动参数估计上比CPU/GPU实现快200倍以上,并在处理速度、能效和硬件资源利用率方面表现优异,通过事件目标跟踪应用验证了其实时性。
Insight: 创新点包括首次实现CM算法的硬件加速架构,利用FPGA的确定性并行结构进行深度流水线设计,以及硬件感知的优化方法,为高速低功耗嵌入式系统提供了实时运动估计基础。
Abstract: This paper presents a hardware architecture that implements the Contrast Maximization (CM) algorithm in Field-Programmable Gate Array (FPGA) resources for event-based vision systems. CM estimates motion parameters by maximizing the contrast of an Image of Warped Events (IWE) reconstructed from asynchronous event streams. Event-based vision sensors generate sparse data with high temporal resolution and low spatial redundancy, which makes them well suited for hardware processing. The deterministic, massively parallel structure of the FPGA is leveraged to design a deeply pipelined architecture capable of high-throughput, energy-efficient processing suitable for real-time embedded applications. This paper details the hardware modules responsible for event warping, contrast computation, and iterative optimization, discusses key implementation decisions, and presents the hardware-aware optimization method used in the design. Experimental results demonstrate a substantial speed and efficiency improvement over CPU- and GPU-based implementations, with motion parameter estimation executing over 200 times faster. To the best of our knowledge, this is the first hardware architecture enabling acceleration of CM algorithm computations. Its performance is evaluated in terms of processing speed, energy efficiency, and hardware resource utilization. The proposed design is validated using an event-based object tracking application. The results confirm that the architecture provides a solid foundation for real-time motion estimation in high-speed, low-power embedded systems.
[174] DeformMaster: An Interactive Physics-Neural World Model for Deformable Objects from Videos cs.CV | cs.ROPDF
Can Li, Zhoujian Li, Ren Li, Jie Gu, Lei Lei
TL;DR: DeformMaster是一个从真实视频中学习的交互式物理-神经世界模型,用于可变形物体。它在一个统一的动力学和外观框架内,将真实交互视频转化为在线交互模型,能够预测未来动力学并渲染动态外观,支持新颖动作推演、材料参数变化和动态新视角合成。
Details
Motivation: 解决从真实视频中学习可变形物体世界模型的挑战,该模型需要从视觉观察中推断物理状态,在新交互下推演,并以高视觉保真度渲染结果动力学,同时恢复几何、外观、底层物理动力学、交互基础和材料行为。
Result: 在真实世界可变形物体序列上的实验表明,DeformMaster在推演未来动力学和渲染动态外观方面优于最先进的基线方法。
Insight: 创新点包括:在统一框架中结合结构化物理推演与神经残差补偿未建模效应;将稀疏手部运动作为分布式顺应执行器进行手-连续体交互的基础;使用空间变化的材料本构专家表示材料响应;从预测的物理演化驱动高保真4D外观。从客观角度看,其将物理模拟与神经渲染紧密耦合以处理高维变形和复杂材料响应的思路具有借鉴意义。
Abstract: World models for deformable objects should recover not only geometry and appearance, but also underlying physical dynamics, interaction grounding, and material behavior. Learning such a model from real videos is challenging because deformable linear, planar, and volumetric objects evolve under high-dimensional deformation, noisy interactions, and complex material response. The model must therefore infer a physical state from visual observations, roll it forward under new interactions, and render the resulting dynamics with high visual fidelity. We present DeformMaster, a video-derived interactive physics–neural world model that turns real interaction videos into an online interactive model of deformable objects within a unified dynamics-and-appearance framework. DeformMaster preserves structured physical rollout while using a neural residual to compensate for unmodeled effects, grounds sparse hand motion as distributed compliant actuator for hand–continuum interaction, represents material response with spatially varying constitutive experts, and drives high-fidelity 4D appearance from the predicted physical evolution. Experiments on real-world deformable-object sequences demonstrate DeformMaster’s ability to roll out future dynamics and render dynamic appearance, outperforming state-of-the-art baselines while supporting novel action rollout, material-parameter variation, and dynamic novel-view synthesis.
[175] From Pixels to Concepts: Do Segmentation Models Understand What They Segment? cs.CVPDF
Shuang Liang, Zeqing Wang, Yuxian Li, Xihui Liu, Han Wang
TL;DR: 本文提出了CAFE基准,用于评估提示式分割模型(如SAM3)是否真正理解并忠实于文本提示中的语义概念,而非仅依赖视觉显著性线索。该基准通过属性级反事实操作构建了2,146个测试样本,涵盖三种误导性语义类别,揭示了模型定位质量与概念判别能力之间的系统性差距。
Details
Motivation: 现有分割基准主要评估掩码精度或对象存在性,无法判断模型是否忠实于查询概念,还是依赖于视觉显著但语义误导的线索。
Result: 在CAFE基准上评估多种模型类型和规模,实验表明模型定位质量与概念判别存在系统性差距:模型常为误导性提示生成准确掩码,说明强掩码预测不一定意味着忠实的语义基础。
Insight: 创新点在于引入反事实属性操作构建评估基准,强调分割模型需进行概念忠实的基础而非捷径驱动的掩码检索;客观分析揭示了当前提示式分割模型在语义理解上的局限性,为模型诊断提供了可控基准。
Abstract: Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: \textbf{C}ounterfactual \textbf{A}ttribute \textbf{F}actuality \textbf{E}valuation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our \textbf{CAFE} is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (\textbf{SM}), Context Conflict (\textbf{CC}), and Ontological Conflict (\textbf{OC}). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.
[176] SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy cs.CVPDF
Ismael Elsharkawi, Ahmed Sait, Silvio Giancola, Bernard Ghanem, Hossam Sharara
TL;DR: 该论文提出了SoccerLens基准,用于评估足球视频理解模型是否基于有意义的视觉证据而非利用虚假相关性进行预测。该基准包含13类常见足球事件的标注视频片段,并引入了结合时空注意力的归因方法和评估指标,以衡量模型注意力与标注线索的对齐程度。
Details
Motivation: 现有足球视频理解评估主要关注分类准确性,未能评估模型的视觉基础(visual grounding)能力,无法判断模型是依赖真实视觉证据还是利用了虚假关联和捷径学习。
Result: 对最先进的足球视觉语言模型(VLMs)的评估表明,尽管分类准确率很高,但即使在最宽松的线索定义下,当前模型的视觉基础性能也未超过50%,并且持续未能充分利用时序信息。
Insight: 创新点在于提出了首个专注于足球视频视觉基础理解的基准SoccerLens,并扩展了归因方法以联合建模时空注意力,揭示了在复杂时空领域中预测性能与真实视觉基础之间存在巨大差距,强调了进行基础评估的必要性。
Abstract: Vision-language models (VLMs) have recently shown strong potential in soccer video understanding. However, given the high complexity of soccer videos due to large viewpoint variations, rapid shot transitions, and cluttered scenes, it remains unclear on whether VLMs rely on meaningful visual evidence or exploit spurious correlations and shortcut learning. Existing evaluation protocols focus primarily on classification accuracy and do not assess visual grounding. To address this limitation, we introduce SoccerLens, a benchmark for grounded soccer video understanding. The benchmark contains annotated video segments spanning $13$ common soccer events, with structured visual cues organized into three levels of semantic relevance. We further extend the attribution method of Chefer [arXiv:2103.15679] to jointly model spatial and temporal attention, and introduce evaluation metrics that measure whether model attention aligns with annotated cues or drifts toward spurious regions. Our evaluation of state-of-the-art soccer VLMs shows that, despite strong classification accuracy, current models fail to exceed $50%$ grounding performance even under the loosest cue definitions and consistently underutilize temporal information. These results reveal a substantial gap between predictive performance and true visual grounding, highlighting the need for grounded evaluation in complex spatio-temporal domains such as soccer.
[177] Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning cs.CVPDF
Xuan Gong, Hanbo Huang, Hao Zheng, Yiran Zhang, Wenbin Dai
TL;DR: 本文提出了一种基于信息论的视觉保留方法,通过分析下游视觉增益的下界,引入反射锚点策略优化(RAPO),以在长链多模态推理中更有效地保留视觉信息,从而提升大型视觉语言模型的性能。
Details
Motivation: 长链思维推理虽能提升大型视觉语言模型,但视觉信息在生成过程中容易衰减,限制了长视野多模态推理能力;现有方法依赖感知启发式而非原则性增益分析,且局部视觉影响传播不明确。
Result: 在推理密集型和通用领域基准测试中,RAPO在多个LVLM骨干网络上均显著优于强基线,机制分析表明反射锚点富集于视觉敏感决策点,并增强了生成轨迹上的对比性视觉依赖信号。
Insight: 创新点在于从信息论角度推导下游视觉增益下界,提出基于熵和发散度的反射锚点选择策略,以及链掩码有限窗口KL替代优化方法,实现了传播感知的视觉保留,可借鉴于多模态模型的长序列生成优化。
Abstract: Long chain-of-thought (CoT) reasoning improves large vision–language models, but visual information often fades during generation, limiting long-horizon multimodal reasoning. Existing methods either re-inject vision at inference or train policies for stronger grounding, but where to intervene relies on perception heuristics rather than principled gain analysis, and how local visual influence propagates remains implicit. We study this problem from an information-theoretic standpoint and derive a lower bound on the downstream visual gain of a one-step intervention, which suggests two factors: local branching room (token entropy) and downstream visual propagation potential (suffix divergence from a vision-marginalized reference). Guided by this analysis, we propose reflection-anchor policy optimization (RAPO), a GRPO-based policy optimization method that selects high-entropy reflection anchors and optimizes a chain-masked finite-window KL surrogate for downstream visual dependence. Experiments on reasoning-intensive and general-domain benchmarks show that RAPO delivers substantial gains over strong baselines across multiple LVLM backbones. Mechanism analyses further indicate that reflection anchors are enriched for visually sensitive decision points and that RAPO increases contrastive visual-dependence signals along generated trajectories.
[178] GSMap: 2D Gaussians for Online HD Mapping cs.CVPDF
Zhenxuan Zeng, Sheng Yang, Lingxuan Wang, Yanan He, Mingxia Chen
TL;DR: GSMap提出了一种基于可学习2D高斯表示的在线高精地图构建框架,通过将地图元素建模为有序的2D高斯序列,统一了基于矢量化(保持拓扑)和基于栅格化(保证几何精度)两种传统方法的优势,实现了几何约束与拓扑结构的协同优化。
Details
Motivation: 现有高精地图构建方法存在根本性权衡:矢量化方法能保持拓扑结构但几何保真度差,栅格化方法可实现精确几何监督但输出是非结构化的。GSMap旨在弥合这一差距,提出统一表示以同时优化几何精度和拓扑规整性。
Result: 在nuScenes和Argoverse2数据集上的实验表明,该方法显著提升了性能,证明了其与现有高精地图架构的强兼容性,并实现了几何学习与拓扑学习的有效统一。
Insight: 创新点在于将地图元素表示为有序2D高斯序列(中心对应矢量折线/多边形顶点),通过可微分栅格化实现像素级几何约束,结合拓扑感知矢量化保持结构规整性,为高精地图构建提供了几何与拓扑统一学习的新范式。
Abstract: Accurate High-Definition (HD) map construction is critical for autonomous driving, yet existing methods face a fundamental trade-off: vectorization-based approaches preserve topology but struggle with geometric fidelity, while rasterization-based approaches enable precise geometric supervision but produce unstructured outputs. To bridge this gap, we propose GSMap, a novel framework that unifies both paradigms via a learnable 2D Gaussian representation. Each map element is modeled as an ordered sequence of 2D Gaussians, whose centers correspond to the vertices of the vectorized polyline/polygon. This formulation enables simultaneous optimization through: (1) Differentiable rasterization that enforces pixel-level geometric constraints, and (2) Topology-aware vectorization that maintains structural regularity. Experiments on both nuScenes and Argoverse2 demonstrate that our Gaussian-based representation effectively unifies geometric and topological learning, achieving significant performance improvements and demonstrating strong compatibility with existing HD mapping architectures. Code will be available at https://github.com/peakpang/GSMap
[179] Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study cs.CV | cs.AIPDF
Yuhan Wang, Zihan Li, Han Liu, Simon Arberet, Martin Kraus
TL;DR: 本文提出DiffKT3D,一个统一的Any2Any 3D扩散框架,通过利用预训练视频扩散模型的先验知识,实现高效且具有临床意义的放疗剂量预测。该方法引入了一种支持多模态条件输入(如CT、解剖结构、身体、束流设置等)的范式,并设计了一种由临床评分卡引导的强化学习后训练机制,以更好地匹配机构治疗偏好。
Details
Motivation: 解决放疗计划中体素级剂量预测任务因定制模型泛化能力差而难以适应多样化临床场景的挑战,同时利用视觉领域大规模预训练生成模型的强大先验知识。
Result: 在GDP-HMM挑战赛数据集上,DiffKT3D将体素级平均绝对误差(MAE)从2.07降低至1.93,达到了新的SOTA水平,并在图像质量和偏好匹配方面表现优异。
Insight: 创新点在于提出了一个统一的、支持任意模态到任意模态的3D扩散框架,通过模态特定嵌入实现灵活条件化而无需交叉注意力开销,并引入了由临床评分卡指导的强化学习后训练机制来显式地对齐机构治疗偏好,这为将大规模预训练扩散模型知识迁移到特定医学领域提供了有效途径。
Abstract: Voxel-wise dose prediction is a critical yet challenging task in practical radiotherapy (RT) planning, as bespoke models trained from scratch often struggle to generalize across diverse clinical settings. Meanwhile, generative models trained on billion-scale datasets from vision domains have achieved impressive performance. Herein, we propose DiffKT3D, a unified Any2Any 3D diffusion framework that leverages prior knowledge from pretrained video diffusion models for efficient and clinically meaningful dose prediction. To enable flexible conditioning across multiple clinical modalities (CT, anatomical structures, body, beam settings, etc.), we introduce an Any2Any conditional paradigm utilizing modality-specific embeddings without cross-attention overhead. Further, we design a novel reinforcement learning (RL) post-training mechanism guided by a clinically-informed Scorecard explicitly tailored to institutional treatment preferences. Compared with winner of GDP-HMM challenge, DiffKT3D sets a new state-of-the-art in dose prediction by reducing voxel-level MAE from 2.07 to 1.93. In addition, DiffKT3D achieves superior image quality and preference match. These results demonstrate that transferring diffusion priors via modality-aware conditioning and clinically aligned RL post-training can provide a robust and generalizable solution for RT planning across various clinical scenarios.
[180] Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning cs.CV | cs.LGPDF
Meng Lou, Hanzhong Guo, Linwei Chen, Yizhou Yu
TL;DR: 本文提出了一种名为Retention-aware Policy Optimization (RaPO)的新方法,旨在解决视觉持续学习中的灾难性遗忘问题。该方法通过轨迹级奖励塑形,显式地减轻遗忘,并在多个视觉持续学习设置中实现了领先的性能。
Details
Motivation: 尽管强化微调(RFT)比监督微调(SFT)对灾难性遗忘更具韧性,但在具有挑战性的视觉持续学习设置(如类增量学习和域增量学习)中,RFT仍存在不可忽视的遗忘问题。本文旨在通过分析遗忘瓶颈并提出解决方案来克服这一问题。
Result: 在五个视觉持续学习设置上的广泛实验表明,RaPO实现了领先的性能,显著减少了灾难性遗忘,同时保持了强大的可塑性。
Insight: 论文的创新点在于识别了“轨迹级漂移不可知性”是RFT中遗忘的关键瓶颈,并提出了RaPO方法,其核心是通过保留奖励和跨任务优势归一化(CTAN)进行轨迹级奖励塑形,以显式地减轻遗忘。这为在视觉持续学习中系统性地探索RFT提供了新的见解和方向。
Abstract: Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.
[181] BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction cs.CVPDF
Alessio Mazzucchelli, Maria Naranjo-Almeida, Jorge Bustos-Sanchez, Mariella Dimiccoli, Francesc Moreno-Noguer
TL;DR: 本文提出了BEA-GS方法,旨在解决现有3D高斯泼溅技术在场景语义表示中几何优化不足的问题,通过引入两种新的损失函数来优化可见与非可见高斯的几何形状,从而实现精确的对象提取和边界分割。
Details
Motivation: 现有大多数提供3D语义表示的高斯泼溅技术未优化底层几何,导致对象级编辑或资产提取困难,本文旨在通过改进几何表示来获得更精确的对象边界。
Result: 在4个数据集上与12种最先进方法进行六项指标比较,结果表明该方法在边界分割方面达到了当前最佳水平。
Insight: 创新点包括引入两种损失函数:一种通过光栅化传播梯度优化可见高斯的几何以尊重语义边界,另一种不通过光栅化直接调整非可见高斯的几何,从而在对象提取时实现近乎完美的边界分割。
Abstract: Most Gaussian Splatting techniques that provide a 3D semantic representation of the scene do not optimize the underlying 3D geometry, making object-level editing or asset extraction challenging. Recent methods, such as COBGS, Trace3D, ObjectGS, acknowledge this limitation and propose approaches that modify the scene’s geometry to represent the underlying semantics. We advance this concept further by proposing a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1) a loss that modifies the geometry of visible Gaussians to respect semantic boundaries, and 2) a loss that adjusts the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization, allowing for seamless integration within the optimization of the Gaussian parameters. The second loss also propagates gradients to Gaussian parameters but does so without passing through the rasterization, enabling modification of the scene’s geometry even when little transmittance reaches a Gaussian (partial or non-visible). Exhaustive comparisons with 12 state of the art methods across 4 datasets, using six metrics, demonstrate that our approach produces overall the best boundary segmentation to date.
[182] VFM-SDM: A vision foundation model-based framework for training-free, marker-free, and calibration-free structural displacement measurement cs.CVPDF
Qingyu Xian, Hao Cheng, Berend Jan van der Zwaag, Rolands Kromanis, Ozlem Durmaz Incel
TL;DR: 本文提出了一种基于视觉基础模型(VFM)的结构位移测量框架VFM-SDM,该框架无需特定任务训练、无需安装标记点、也无需手动相机标定,通过集成VFM推断的相机参数估计和点跟踪,利用三角测量重建多方向结构位移,并结合结构几何约束以提高估计一致性。
Details
Motivation: 解决传统基于视觉的位移测量方法在部署时面临的限制,如需要特定任务模型训练、现场标记点安装或手动相机标定,旨在实现高效、非接触、无需现场准备的真实世界应用部署。
Result: 在一个在役人行天桥上收集的多模态现场数据集上,使用统一的基准协议进行评估。结果显示,对于垂直和横向位移,具有较低的振幅误差(NRMSE范围:0.11/0.12)、强的时间一致性(相关系数:0.86/0.88)以及小的峰峰值振幅误差(RPPAE:0.01/0.02),表明其在真实条件下具有鲁棒性能。
Insight: 主要创新点在于利用视觉基础模型实现免训练、免标记、免标定的端到端位移测量框架,并通过引入结构几何约束来抑制物理上不合理的偏差,提高了自动化、可扩展性,为数字孪生和数据驱动的施工工作流中的结构响应测量奠定了基础。
Abstract: Reliable displacement measurement is fundamental for structural health monitoring and digital engineering workflows, as it provides direct structural response information. Vision-based measurement has emerged as a promising approach for low-cost, non-contact displacement monitoring. However, its deployment often remains constrained by task-specific model training or on-site preparation, such as marker installation or manual camera calibration. This study presents a Vision Foundation Model-based framework for Structural Displacement Measurement (VFM-SDM) that integrates VFM-inferred camera parameter estimation and point tracking to reconstruct multi-directional structural displacements via triangulation without task-specific training or on-site preparation, enabling efficient non-contact deployment in real-world applications. Structural geometry constraints are incorporated to suppress physically implausible deviations and improve estimation consistency. A multi-modal field dataset collected from an in-service pedestrian bridge is introduced alongside a unified benchmarking protocol to support reproducible evaluation. Representative results show low amplitude errors (NRMSE$_{\text{range}}$: 0.11/0.12), strong temporal agreement (correlation coefficient: 0.86/0.88), and small peak-to-peak amplitude errors (RPPAE: 0.01/0.02) for vertical and lateral displacements, indicating robust performance under real-world conditions. The proposed framework advances automated, scalable displacement monitoring and lays the groundwork for VFM-enabled structural response measurements in digital twin and data-centric construction workflows.
[183] DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents cs.CV | cs.AIPDF
Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Boyan Wang, Liang He
TL;DR: 本文提出了DeepTumorVQA,一个用于评估医学视觉语言模型和工具增强智能体的分层3D CT基准。它将肿瘤诊断的多阶段证据链分解为识别、测量、视觉推理和医学推理四个阶段,包含超过47.6万个问题。基准支持直接推理和工具交互两种评估模式,揭示了可靠定量测量是当前模型的主要瓶颈,而工具增强能有效缓解此问题。
Details
Motivation: 现有医学视觉问答基准将模型能力压缩为单一准确率分数,无法揭示模型在何处及为何失败。本文旨在通过一个遵循肿瘤诊断多阶段证据链的分层基准,对模型能力进行更细致、阶段性的评估。
Result: 在评估超过30种模型配置后,研究发现可靠的定量测量是主要瓶颈,导致后续视觉和医学推理对VLMs更为困难。工具增强能显著缓解此问题,但当工具可用时,如何利用医学知识和工具进行推理成为新的挑战。
Insight: 创新点在于提出了一个分阶段评估医学VLMs和智能体的3D CT基准,将复杂诊断任务分解为可独立评分的层级,并整合了工具交互环境。客观来看,其提供的真实逐步工具使用轨迹可用于监督智能体,为未来研究提供了从识别到医学推理的具体路线图。
Abstract: Medical vision-language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later-stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground-truth step-by-step tool-use traces from DeepTumorVQA can supervise agents and reduce tool-use and reasoning failures. This stage-wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at https://github.com/Schuture/DeepTumorVQA.
[184] Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models cs.CVPDF
Yicheng Ji, Zhizhou Zhong, Jun Zhang, Qin Yang, XiTai Jin
TL;DR: 本文提出了Forcing-KV,一种混合KV缓存压缩策略,用于解决自回归视频扩散模型中因历史帧冗余KV缓存导致的高注意力复杂性和内存开销问题。该方法通过将注意力头分类为静态头和动态头,并分别进行结构化静态剪枝和基于片段相似性的动态剪枝,在保持输出质量的同时显著提升了生成速度并减少了缓存内存。
Details
Motivation: 现有自回归视频扩散模型(如Self Forcing训练范式)在实现长视频生成和实时响应时,由于历史帧中冗余的键值(KV)缓存,仍面临显著的注意力复杂性和严重的内存开销,这限制了模型的可扩展性。
Result: 该方法在单个NVIDIA H200 GPU上实现了超过29帧/秒的生成速度,并减少了30%的缓存内存。在480P分辨率下,在LongLive和Self Forcing基准上分别实现了1.35倍和1.50倍的加速;在1080P分辨率下,加速比进一步达到2.82倍,同时保持了输出质量。
Insight: 核心创新点在于对注意力头进行功能分类(静态头负责跨块转换和帧内保真度,动态头负责帧间运动和一致性),并据此设计混合压缩策略(静态头结构化剪枝,动态头基于片段相似性的动态剪枝)。这为高效自回归视频生成中的KV缓存管理提供了新的、有针对性的压缩思路。
Abstract: Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at https://zju-jiyicheng.github.io/Forcing-KV-Page.
[185] Do multimodal models imagine electric sheep? cs.CV | cs.AI | cs.LGPDF
Santhosh Kumar Ramakrishnan, Carl Vondrick, Raja Giryes, Philipp Krähenbühl, Vladlen Koltun
TL;DR: 该论文发现大型多模态模型在解决空间谜题时会形成心理意象,并通过微调Qwen3.5 VLM模型在12个视觉推理任务上验证了这一现象。研究表明,模型在预测动作序列时,其激活状态编码了中间状态的视觉信息,表明学习正确动作选择会自发形成不完美的视觉世界模型。基于此,论文提出了两种方法来锐化和利用这些心理意象,从而将平均解决率从83%提升至89%。
Details
Motivation: 研究动机是探索大型多模态模型是否能在解决空间推理任务时形成心理意象,并验证这种内部表征对任务性能的影响。
Result: 在包括拼图、3D心理旋转等12个视觉推理任务上,通过集成少量视觉标记到思维链中,平均解决率从83%提升至89%,尤其在推理密集型任务上提升显著。
Insight: 创新点在于揭示了多模态模型通过动作预测学习自发形成视觉世界模型,并提出利用心理意象增强推理能力的方法,为模型内部表征研究提供了新视角。
Abstract: Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine-tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks – including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour – that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open-loop sequence of actions to solve a puzzle from an initial state, we show that the model’s activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed by the model. We find that integrating as few as sixteen visual tokens per step into the chain of thought improves the average solve rate from 83% to 89%, with particularly strong gains on reasoning-heavy tasks such as jigsaw and 3D mental rotation.
[186] Discriminative Span as a Predictor of Synthetic Data Utility via Classifier Reconstruction cs.CV | cs.LGPDF
Radhika Amar Desai, Modigari Narendra
TL;DR: 本文提出了一种基于几何度量的方法,用于预测合成数据在二元分类任务中的效用,无需训练下游模型。该方法利用预训练基础模型的嵌入空间,通过样本间差异向量构建数据集表示,并通过测量线性分类器权重向量在差异向量张成子空间上的相对投影误差来评估合成数据质量。实验表明,该度量与在真实负样本和合成正样本混合数据上训练的CNN分类性能具有强相关性。
Details
Motivation: 在医学影像和工业检测等计算机视觉应用中,二元分类任务常面临正样本极度稀缺的问题。通常采用图像到图像变换生成合成正样本数据,但缺乏可靠的方法来评估这些合成数据是否能有效提升下游模型性能。
Result: 在多个数据集和架构上的实验表明,所提出的度量与在真实负样本和合成正样本混合数据上训练的CNN的下游分类性能(如准确率)具有强相关性,验证了其作为合成数据质量评估工具的有效性。
Insight: 创新点在于提出了一种无需模型训练的几何驱动度量,通过分析预训练模型嵌入空间中合成数据诱导的变异方向是否覆盖任务相关方向来预测其效用。这为数据稀缺场景下的合成数据评估提供了高效且信息丰富的工具,其核心思想是利用线性分类器的可重构性作为代理指标。
Abstract: In many real-world computer vision applications, including medical imaging and industrial inspection, binary classification tasks are characterized by a severe scarcity of positive samples. A widely adopted solution is to generate synthetic positive data using image-to-image transformations applied to negative samples. However, a fundamental challenge remains: how can we reliably assess whether such synthetic data will improve downstream model performance? In this work, we propose a geometry-driven metric that predicts the utility of synthetic data without requiring model training. Our approach operates in the embedding space of a pre-trained foundation model and represents the dataset through difference vectors between samples. We evaluate whether the weight vector of a linear classifier can be expressed within the subspace spanned by these variations by measuring the relative projection error. Intuitively, if the variations induced by synthetic data capture task-relevant directions, their span can approximate the classifier, resulting in low projection error. Conversely, poor synthetic data fails to span these directions, leading to higher error. Across multiple datasets and architectures, we show that this metric exhibits strong correlation with downstream classification performance of CNNs trained on mixtures of real negative and synthetic positive data. These findings suggest that the proposed metric serves as a practical and informative tool for evaluating synthetic data quality in data-scarce settings.
[187] MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding cs.CVPDF
Xiaoyu Yuan, Niklas Heikkala, Tiina Törmänen, Hanna Järvenoja, Guoying Zhao
TL;DR: 该论文提出了MOTOR-Bench,一个用于零样本人类心理状态理解的真实世界基准测试,包含一个多模态视频数据集MOTOR-dataset和一个多智能体推理框架MOTOR-MAS。该基准旨在解决从可观察行为到深层心理状态的结构化推理问题,并在实验中验证了所提框架的有效性。
Details
Motivation: 当前研究多集中于预测孤立的心理状态标签,缺乏对复杂人际互动的结构化分析。为了支持结构化分析,需要构建一个包含真实世界数据挑战(如类别不平衡、视觉噪声和领域特定语言)的基准。
Result: 在MOTOR-Bench上评估了多个最先进的多模态大语言模型和多智能体系统,它们在零样本设置下表现有限。提出的MOTOR-MAS框架在行为、认知和情感三个标签的Macro-F1分数上比最佳单模型基准高出15.93分,在内部认知预测上比通用多智能体基准高出10.2分。
Insight: 创新点包括:1) 引入了一个基于自我调节学习理论、由教育专家标注的真实世界多模态基准数据集;2) 提出了一个通过结构化智能体协调机制来推断显式行为、内部认知和心理情感的多智能体推理框架,以应对结构化推理的挑战。
Abstract: Understanding human mental states from natural behavior is crucial for intelligent systems in the real world. However, most current research focuses on predicting isolated mental state labels, lacking structured annotations of complex interpersonal interactions. To support structured analysis, we introduce MOTOR-Bench, a carefully-designed benchmark with a real-world dataset MOTOR-dataset, containing 1,440 multimodal video clips in collaborative learning scenarios, reflecting key real-world data challenges including natural class imbalance, visual noise, and domain-specific language. Each sample is labeled by educational experts based on self-regulated learning theory. We further evaluate several state-of-the-art multimodal large language models and multi-agent systems in a zero-shot setting on our MOTOR-Bench. However, their performance on this task remains limited, suggesting that existing methods still struggle with structured reasoning from observable behavior to deeper mental states. To address this challenge, we propose a reasoning multi-agent framework, named MOTOR-MAS. It coordinates multiple agents through a structured agent coordination mechanism to infer explicit behaviors, internal cognitions, and psychological emotions. Experimental results show that our MOTOR-MAS outperforms the best single-model benchmark by 15.93 points in Macro-F1 scores for the three labels of behavior, cognition, and emotion, and outperforms the general multi-agent benchmark by 10.2 points in internal cognition prediction.
[188] Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT cs.CV | cs.AIPDF
Alaa Asfour, Christopher Indris, Leihan Chen, Tejas Vyas, Guanghui Wang
TL;DR: 本文提出了一种知识蒸馏框架,将大型3D视觉语言模型的空间推理能力迁移到轻量级学生模型中,通过引入’Hidden CoT’(可学习的潜在标记作为内部草稿)来提升推理能力,并在保持性能的同时显著降低计算开销。
Details
Motivation: 解决大型3D视觉语言模型(如LLaVA-3D)因计算成本高而难以部署的问题,旨在将空间推理能力蒸馏到更小的模型中,以在资源受限平台上实现高效的3D场景问答。
Result: 在ScanNet和3D-FRONT基准测试中,学生模型在接近性和接触任务上达到68-72%的准确率,保留了教师模型性能的54-72%,同时推理延迟降低8.7倍,模型大小减少3倍。
Insight: 创新点包括首次在蒸馏的3D视觉语言模型中使用潜在草稿推理(Hidden CoT),以及多任务蒸馏管道与不确定性感知损失加权,这些方法可借鉴于其他需要轻量化和推理增强的视觉语言任务。
Abstract: Large-scale 3D vision-language models (VLMs) like LLaVA-3D offer strong spatial reasoning but are difficult to deploy due to high computational costs. We propose a knowledge distillation framework that transfers spatial reasoning from a 7B teacher to a 2.29B student model. Our approach achieves 8.7x lower inference latency and a 3x reduction in model size while retaining 54-72% of the teacher’s performance. The framework utilizes VGGT as the vision encoder and a multi-task distillation pipeline with uncertainty-aware loss weighting. To improve reasoning without chain-of-thought (CoT) data, we introduce “Hidden CoT”: learnable latent tokens that serve as an internal scratchpad before answer generation. This is the first use of latent scratchpad reasoning in distilled 3D VLMs. The student model jointly performs spatial description, depth estimation, and object detection. Experiments on ScanNet and 3D-FRONT show strong spatial understanding, reaching 68-72% accuracy on proximity and contact tasks. Our framework enables efficient 3D scene QA on resource-constrained platforms.
[189] On-Policy Distillation with Best-of-N Teacher Rollout Selection cs.CVPDF
Ke Zhang, Yunjie Tian, DongDi Zhao, Yijiang Li, Yuanye Liu
TL;DR: 本文提出了BRTS(Best-of-N Rollout Teacher Selection)框架,用于改进策略蒸馏(OPD)方法。该方法通过从多个教师轨迹中选取最佳样本(优先考虑正确性和学生行为对齐)来提供更可靠的监督信号,从而在保持数据效率的同时,提升模型在复杂推理任务上的性能。
Details
Motivation: 标准策略蒸馏(OPD)在嘈杂的学生生成上下文中计算教师监督,且通常每个提示仅依赖单个随机教师轨迹,导致监督信号方差高、可能不正确或与学生的当前推理行为不匹配。
Result: 在AIME 2024、AIME 2025和AMC 2023等具有挑战性的推理基准测试中,BRTS相比标准OPD取得了改进,尤其在更困难的数据集上提升最为显著。
Insight: 创新点在于引入了多教师轨迹采样与选择机制(基于正确性优先、学生对齐次之的优先级规则),并整合了教师上下文监督分支和辅助损失,从而稳定了蒸馏过程并增强了推理能力。
Abstract: On-policy distillation (OPD), which supervises a student on its own sampled trajectories, has emerged as a data-efficient post-training method for improving reasoning while avoiding the reward dependence of reinforcement learning and the catastrophic forgetting often observed in standard supervised fine-tuning. However, standard OPD typically computes teacher supervision under noisy student-generated contexts and often relies on a single stochastic teacher rollout per prompt. As a result, the supervision signal can be high-variance: the sampled teacher trajectory can be incorrect, uninformative, or poorly matched to the student’s current reasoning behavior. To address this limitation, we propose BRTS, a Best-of-N Rollout Teacher Selection framework for on-policy distillation. BRTS augments standard student-context OPD with a teacher-context supervision branch constructed from the curated teacher trajectory. Rather than distilling from the first sampled teacher rollout, BRTS samples a small pool of teacher trajectories and selects the auxiliary trajectory using a simple priority rule: correctness first, student alignment second. When multiple correct teacher trajectories are available, BRTS chooses the one most aligned with the student’s current behavior; when unconditioned teacher samples fail on harder prompts, it invokes a ground-truth-conditioned recovery step to elicit a natural derivation. The selected trajectory is then used to provide reliable teacher-context supervision inside the OPD loop, augmented with an auxiliary loss on the teacher trajectory. Experiments on AIME 2024, AIME 2025, and AMC 2023 show that BRTS improves over standard OPD on challenging reasoning benchmarks, with the largest gains on harder datasets. Our code is available at https://github.com/BWGZK-keke/BRTS.
[190] Fetal Brain Imaging: A Composite Neural Network Approach for Keyframe Detection in Ultrasound Videos cs.CVPDF
Aleksander Zamojski, Kacper Jarczak, Radoslaw Roszczyk
TL;DR: 本文提出了一种用于超声视频关键帧检测的复合神经网络方法,特别针对胎儿脑部成像。该方法结合了卷积神经网络(CNN)和循环神经网络(RNN),CNN提取单帧的空间特征,RNN捕捉视频序列中连续帧之间的时间依赖性。
Details
Motivation: 解决胎儿脑部超声视频分析中关键帧检测的效率和准确性问题,以支持对特定胎儿脑部疾病的早期检测、诊断和治疗规划。
Result: 摘要中未提及具体的定量结果、基准测试或性能水平(如SOTA),仅宣称模型可能提高效率和准确性。
Insight: 创新点在于将CNN与RNN结合形成复合架构,以同时利用空间和时间信息进行关键帧检测;从客观角度看,这种多模态融合方法在医学视频分析中具有借鉴意义。
Abstract: This article presents a novel approach to keyframe detection in ultrasound videos, with a particular focus on fetal brain imaging. The proposed model is a composite neural network architecture that combines a Convolutional Neural Network (CNN) with a Recurrent Neural Network (RNN). The CNN extracts spatial features from individual video frames, while the RNN captures temporal dependencies between consecutive frames within each video sequence. The proposed model may improve the efficiency and accuracy of fetal brain ultrasound analysis, thereby supporting earlier detection, diagnosis, and treatment planning for selected fetal brain conditions.
[191] DRIVE-C: A Controlled Corruption Dataset for Autonomous Driving cs.CVPDF
Shiva Aher
TL;DR: DRIVE-C是一个用于评估自动驾驶系统视觉感知鲁棒性的受控损坏数据集,基于真实世界多场景驾驶视频构建,通过物理启发的合成退化方法生成包含10个干净片段和600个损坏片段的标注数据,支持鲁棒性基准测试、退化感知建模和传感器健康监测等任务。
Details
Motivation: 解决自动驾驶视觉感知系统在真实世界相机退化(如镜头污损、雨雾干扰)下缺乏标准化评估基准的问题,为研究受控相机退化下的感知可靠性提供结构化测试平台。
Result: 数据集包含12种相机退化类型(如镜头污损、雨滴、运动模糊等)在5个严重程度下的600个损坏视频片段,每个片段均提供元数据和全局传感器健康指数(GSHI)标注,支持可复现的损坏参数。
Insight: 创新点在于通过物理启发的合成退化方法构建像素对齐的干净/损坏视频对,并引入GSHI标注,为自动驾驶感知的鲁棒性评估、OOD检测和传感器健康监控提供了系统化基准工具。
Abstract: DRIVE-C is a controlled corruption dataset designed to evaluate visual perception robustness in autonomous driving systems. It is built from real-world forward-facing driving videos collected across daytime, nighttime, urban, rural, freeway, and parking environments. Clean clips are anonymized via localized face and license plate blurring, then transformed with physics-inspired synthetic degradations. The dataset contains 10 clean clips and 600 corrupted clips spanning 12 camera degradation types across five severity levels, with per-clip metadata and Global Sensor Health Index (GSHI) annotations. DRIVE-C supports robustness benchmarking, degradation-aware modeling, uncertainty estimation, out-of-distribution (OOD) detection, and sensor health monitoring for Advanced Driver Assistance Systems (ADAS). By providing pixel-aligned clean and degraded video clips with fully reproducible corruption parameters, DRIVE-C offers a structured testbed for studying perception reliability under controlled camera degradation.
[192] CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection cs.CV | cs.AI | cs.LGPDF
Zhipeng Liu, Chunbo Luo
TL;DR: 本文提出了CrossVL框架,用于提升视觉语言模型在跨视角(地面与航拍)场景下的目标检测性能。该框架包含复杂度感知路径聚合和配对课程学习两个核心组件,旨在处理不同视角间因几何差异导致的系统复杂度变化,从而稳定训练并提升检测精度。
Details
Motivation: 现有视觉语言模型在跨视角检测中性能严重下降,因为地面和航拍视图在高度、尺度和空间布局上存在显著差异,导致图像复杂度不同,而固定的模型融合机制无法处理这种差异。
Result: 在MAVREC数据集上,CrossVL将Florence-2模型的航拍mAP从58.66%提升至61.03%,并将地面-航拍性能差距从8.63个百分点缩小至6.65个百分点,同时将不同随机种子下的方差降低了3.3倍。
Insight: 创新点在于通过复杂度感知的特征路由机制自适应地处理不同视角的特征,以及利用配对图像的语义一致性设计课程学习策略来稳定训练。这表明协调的架构调整和训练策略对于实现鲁棒的跨视角视觉语言检测至关重要。
Abstract: Vision-language models (VLMs) enable text-guided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose CrossVL, a framework combining Complexity-Aware Pathway Aggregation (CPA) and Paired Curriculum Learning (PCL) for enhanced cross-view detection for VLM. CPA estimates scene complexity from multimodal statistics and routes visual features through multiple pathways to obtain view-specific representations. PCL leverages semantic consistency of synchronized ground-aerial pairs to provide stable early supervision and then gradually shifts toward randomized sampling. On MAVREC, CrossVL improves Florence-2’s aerial mAP from 58.66% to 61.03% and reduces the ground-aerial performance gap from 8.63pp to 6.65pp, while also achieving a 3.3x reduction in variance across random seeds. CPA provides stable complexity-aware feature aggregation, and PCL enhances optimization dynamics. Together, they demonstrate that coordinated architectural and training adaptations are crucial for robust cross-view VLM detection.
[193] Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction cs.CV | cs.AIPDF
Anushree Berlia
TL;DR: 本文提出了Fashion Florence模型,通过对Florence-2视觉语言模型进行LoRA微调,实现了从服装图像中提取结构化时尚属性的功能。模型接收单张照片,输出包含类别、颜色、材质、风格标签和场合标签的JSON对象,可直接用于下游推荐和检索系统。
Details
Motivation: 解决从服装图像中自动提取结构化属性信息的问题,以支持下游的时尚推荐和检索系统,避免依赖非结构化文本输出。
Result: 在461张图像的测试集上,Fashion Florence在类别准确率达到94.6%,材质准确率达到63.0%,均优于GPT-4o-mini(89.3%/43.3%)和Gemini 2.5 Flash(87.4%)。风格标签F1分数为0.753,同样超过对比模型。模型输出有效JSON的比例高达99.8%。
Insight: 创新点在于将Florence-2模型通过LoRA高效微调,专门用于结构化时尚属性提取,并设计了紧凑的标签体系。客观来看,其将视觉语言模型与特定领域结构化输出任务结合,并通过轻量级适配实现了高性能和低推理成本,具有实际部署价值。
Abstract: We present Fashion Florence, a Florence-2 vision-language model fine-tuned with LoRA to extract structured fashion attributes from clothing images. Given a single photograph, the model generates a JSON object containing category, color, material, style tags, and occasion tags, structured output suitable for direct programmatic consumption by downstream recommendation and retrieval systems. Fine-tuning data is derived from the iMaterialist Fashion dataset (228 labels), where we collapse fine-grained annotations into a compact 6-category, 16-color, 19-style schema via rule-based label engineering. We apply LoRA (r=16, alpha=32) to all decoder linear layers, training for 3 epochs on 3,688 examples. On a held-out test set of 461 images, Fashion Florence achieves 94.6% category accuracy and 63.0% material accuracy, compared to 89.3% / 43.3% for GPT-4o-mini and 87.4% for Gemini 2.5 Flash. Fashion Florence produces valid JSON in 99.8% of outputs while running at 0.77B parameters on a single GPU at zero marginal inference cost. Style tag F1 reaches 0.753 vs. 0.612 (Gemini) and 0.398 (GPT-4o-mini). The model is deployed as a Hugging Face Space and integrated into Loom, an open-source outfit recommendation system.
[194] Clip-level Uncertainty and Temporal-aware Active Learning for End-to-End Multi-Object Tracking cs.CVPDF
Riku Inoue, Shogo Sato, Kazuhiko Murasaki, Tomoyasu Shimada, Toshihiko Nishimura
TL;DR: 本文提出了一种针对端到端多目标跟踪(MOT)的剪辑级主动学习方法CUTAL,以解决现有帧级主动学习与基于Transformer的端到端跟踪器(其训练和推理依赖于多帧剪辑)之间的结构不匹配问题。该方法通过基于多帧预测的不确定性度量来评估剪辑的信息量,并强制时间多样性来选择信息丰富且非冗余的剪辑进行标注,从而在相同标注预算下显著提升模型性能。
Details
Motivation: 动机在于解决端到端多目标跟踪模型训练所需大量标注数据的高成本问题。现有主动学习方法主要针对帧级标注,与依赖多帧剪辑进行推理和训练的现代端到端跟踪器结构不匹配,导致标注效率低下。
Result: 在MeMOTR和SambaMOTR两个基准模型上的实验表明,CUTAL在相同标注预算下,整体性能优于基线方法。特别地,在MeMOTR上,仅使用50%的标注训练数据,CUTAL就能达到与全监督训练相当的性能。
Insight: 宣称的创新点在于首次将主动学习问题公式化为剪辑级别,并提出了结合不确定性度量和时间多样性的选择策略。从客观角度看,其核心创新在于将主动学习的查询单位与端到端跟踪器的训练/推理基本单元(剪辑)对齐,并设计了能捕捉帧间对应关系模糊性的不确定性度量,这为解决视频任务中主动学习的冗余性问题提供了新思路。
Abstract: Multi-Object Tracking (MOT) in dynamic environments relies on robust temporal reasoning to maintain consistent object identities over time. Transformer-based end-to-end MOT models achieve strong performance by explicitly modeling temporal dependencies, yet training them requires extensive bounding-box and identity annotations. Given the high labeling cost and strong redundancy in videos, Active Learning (AL) is an effective approach to improve annotation efficiency. However, existing AL methods for MOT primarily operate at the frame level, which is structurally misaligned with modern end-to-end trackers whose inference and training rely on multi-frame clips. To bridge this gap, we formulate clip-level active learning and propose Clip-level Uncertainty and Temporal-aware Active Learning (CUTAL). In contrast to frame-based approaches, CUTAL scores each clip using uncertainty metrics derived from multi-frame predictions to capture inter-frame correspondence ambiguities, while enforcing temporal diversity to select an informative and non-redundant subset. Experiments show that CUTAL achieves stronger overall performance than baselines at the same label budgets across MeMOTR and SambaMOTR. Notably, CUTAL achieves performance comparable to full supervision for MeMOTR on both datasets using only 50% of the labeled training data.
[195] Learning to Align Generative Appearance Priors for Fine-grained Image Retrieval cs.CVPDF
Shijie Wang, Yadan Luo, Zijian Wang, Xin Yu, Zi Huang
TL;DR: 本文提出GAPan(生成式外观先验对齐网络),通过将细粒度图像检索的学习目标从类别预测重新定义为外观建模,以解决现有方法因依赖已见类别监督而导致的未见类别泛化能力受限问题。GAPan利用归一化流构建可逆密度模型,在正向过程中将实例特征映射到潜在空间并基于类别条件高斯先验进行精确似然估计,以保留丰富外观细节;在反向过程中,从先验高密度区域采样生成反映类内变化的外观感知锚点,通过先验驱动对齐目标使检索嵌入与类别特定外观分布对齐,从而提升对未见类别的泛化性能。
Details
Motivation: 现有细粒度图像检索方法通常依赖已见类别的监督学习判别性嵌入,但这会使模型偏向已见类别的语义而非可跨类别泛化的底层外观特征,从而限制对未见类别的检索性能。
Result: 评估表明,GAPan在广泛使用的细粒度和粗粒度基准测试上均达到了最先进的性能水平。
Insight: 创新点在于将检索问题重新定义为外观建模而非类别预测,并利用归一化流的可逆特性构建生成式外观先验对齐机制,通过外观感知锚点驱动嵌入与类别特定分布对齐,以增强跨类别泛化能力;从客观角度看,该方法通过密度建模和可逆变换融合生成与判别学习,为细粒度检索提供了新的视角。
Abstract: Fine-grained image retrieval (FGIR) typically relies on supervision from seen categories to learn discriminative embeddings for retrieving unseen categories. However, such supervision often biases retrieval models toward the semantics of seen categories rather than the underlying appearance characteristics that generalize across categories, thereby limiting retrieval performance on unseen categories. To tackle this, we propose GAPan, a Generative Appearance Prior alignment network that reformulates the learning objective from category prediction toward appearance modeling. Technically, GAPan treats retrieval features with an invertible density model based on normalizing flows. In the forward direction, the flow maps all instance features into a latent density space, where each seen category is modeled by a class-conditional Gaussian prior and optimized via exact likelihood estimation. This formulation preserves richer appearance details by leveraging the invertible property of the flows. In the reverse direction, samples from the high-density regions of these learned priors are mapped back to the feature space to produce appearance-aware anchors that reflect intra-category variation. These anchors supervise a prior-driven alignment objective that aligns retrieval embeddings with category-specific appearance distributions, thereby improving generalization to unseen categories. Evaluations demonstrate that our GAPan achieves state-of-the-art performance on both widely-used fine- and coarse-grained benchmarks.
[196] EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding cs.CV | cs.AI | cs.CLPDF
Ziyang Wang, Yue Zhang, Shoubin Yu, Ce Zhang, Zengqi Zhao
TL;DR: 该论文提出了EgoMemReason基准测试,用于评估长时程第一人称视角视频理解中的记忆驱动推理能力。该基准包含500个问题,涵盖实体记忆、事件记忆和行为记忆三种类型,要求模型在长达一周的视频中整合稀疏分布的证据。实验表明现有方法(包括MLLMs和智能体框架)在该基准上表现不佳,最高准确率仅为39.6%,揭示了长时程记忆推理仍是一个未解决的挑战。
Details
Motivation: 现有周级别视频基准主要关注感知和识别任务(如片段定位或全局摘要),缺乏对需要跨多天整合证据的推理能力的评估。为填补这一空白,作者构建了专门针对长时程第一人称视频中记忆驱动推理的评估基准。
Result: 在17种方法(包括多模态大语言模型和智能体框架)上的评估结果显示,最佳模型的总体准确率仅为39.6%。性能随着证据时间跨度的延长而下降,且三种记忆类型因不同原因失效,表明长时程记忆问题远未解决。
Insight: 创新点在于系统性地定义了长时程视频理解中三种互补的记忆类型(实体、事件、行为记忆)并构建了相应的评估基准。该工作为评估和推进长上下文、记忆感知的多模态系统奠定了重要基础,揭示了当前模型在长时程稀疏信息整合与推理上的根本性局限。
Abstract: Next-generation visual assistants, such as smart glasses, embodied agents, and always-on life-logging systems, must reason over an entire day or more of continuous visual experience. In ultra-long video settings, relevant information is sparsely distributed across hours or days, making memory a fundamental challenge: models must accumulate information over time, recall prior states, track temporal order, and abstract recurring patterns. However, existing week-long video benchmarks are primarily designed for perception and recognition, such as moment localization or global summarization, rather than reasoning that requires integrating evidence across multiple days. To address this gap, we introduce EgoMemReason, a comprehensive benchmark that systematically evaluates week-long egocentric video understanding through memory-driven reasoning. EgoMemReason evaluates three complementary memory types: entity memory, tracking how object states evolve and change across days; event memory, recalling and ordering activities separated by hours or days; and behavior memory, abstracting recurring patterns from sparse, repeated observations over the whole week period. EgoMemReason comprises 500 questions across three memory types and six core challenges, with an average of 5.1 video segments of evidence per question and 25.9 hours of memory backtracking. We evaluate EgoMemReason on 17 methods across MLLMs and agentic frameworks, revealing that even the best model achieves only 39.6% overall accuracy. Further analysis shows that the three memory types fail for distinct reasons and that performance degrades as evidence spans longer temporal horizons, revealing that long-horizon memory remains far from solved. We believe EgoMemReason establishes a strong foundation for evaluating and advancing long-context, memory-aware multimodal systems.
[197] The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space cs.CV | cs.AIPDF
Xia Hu, Zhenrui Yue, Brian Potetz, Howard Zhou, Leonidas Guibas
TL;DR: 本文提出了一个名为‘笛卡尔捷径’的普遍漏洞,即当前多模态大语言模型(MLLMs)在视觉推理基准测试中,过度依赖基于正交网格布局的、可轻松离散化为文本坐标的特性,从而可能并未真正掌握稳健的视觉理解。为了系统性地消除这一捷径,作者构建了Polaris-Bench基准,将53个视觉推理任务重新表述在极坐标系中,同时保持逻辑约束和任务语义一致。对14个前沿MLLMs的评估显示,模型在笛卡尔布局上70-83%的高性能,在极坐标等效任务上崩溃至31-39%,揭示了当前MLLMs缺乏拓扑不变的视觉推理能力。
Details
Motivation: 随着多模态大语言模型在标准视觉推理基准上性能迅速饱和,作者质疑这些高分是否真正反映了稳健的视觉理解能力,旨在揭示并解决模型可能存在的系统性漏洞。
Result: 在Polaris-Bench上的综合评估表明,前沿MLLMs在笛卡尔布局上达到70-83%的准确率,但在极坐标等效任务上骤降至31-39%,且即使在逻辑完全等价的情况下性能退化依然存在,推理增益也严重削弱。
Insight: 论文的创新点在于识别并系统性地拆解了‘笛卡尔捷径’这一漏洞,通过构建极坐标空间的基准测试来打破模型对正交先验的依赖,从而暴露了当前MLLMs在拓扑不变视觉推理方面的关键缺陷;从客观角度看,该方法为评估模型真正的视觉理解能力提供了新的、更严格的基准和视角。
Abstract: As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the \textbf{Cartesian Shortcut}: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce \textbf{Polaris-Bench}, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics – thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive evaluation across $14$ state-of-the-art MLLMs reveals that frontier models achieving $70$–$83%$ on Cartesian layouts collapse to $31$–$39%$ on Polar equivalents, with degradation persisting even under complete logical equivalence. Moreover, reasoning gains observed on Cartesian layouts are severely diminished on Polar equivalents. These findings expose a critical deficiency in current MLLMs: the lack of topology-invariant visual reasoning.
[198] Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection cs.CV | cs.AIPDF
Kanglin Ning, Wenrui Li, Houde Quan, Qifan Li, Xingtao Wang
TL;DR: 本文提出了一种名为HGC-Det的双曲几何约束跨模态蒸馏方法,用于鲁棒的多模态3D目标检测。该方法通过图像分支和点云分支提取特征,并引入三个核心组件:2D语义引导的体素优化(SGVO)、双曲几何约束的跨模态特征迁移(HFT)和基于特征聚合的几何优化(FAGO),以解决模态异构性、空间未对齐和表示危机等问题,在室内外数据集上实现了检测精度与计算成本的良好平衡。
Details
Motivation: 现有跨模态蒸馏方法在3D感知任务中常受限于模态异构性、空间未对齐和多模态表示危机,导致特征融合效率低下,本文旨在解决这些限制。
Result: 在室内数据集(SUN RGB-D、ARKitScenes)和室外数据集(KITTI、nuScenes)上的大量实验表明,该方法在检测精度和计算成本之间取得了更好的权衡。
Insight: 创新点包括利用双曲空间的几何特性来缓解高维图像特征与低维点云特征融合时的语义损失,以及通过语义引导的体素优化和特征聚合几何优化来增强空间表示和补偿特征退化,为跨模态3D检测提供了新的几何约束视角。
Abstract: Cross-modal knowledge distillation has emerged as an effective strategy for integrating point cloud and image features in 3D perception tasks. However, the modality heterogeneity, spatial misalignment, and the representation crisis of multiple modalities often limit the efficient of these cross-modal distillation methods. To address these limitations in existing approaches, we propose a hyperbolic constrained cross-modal distillation method for multimodal 3D object detection (HGC-Det). The proposed HGC-Det framework includes an image branch and a point cloud branch to extract semantic features from two different modalities. The point cloud branch comprises three core components: a 2D semantic-guided voxel optimization component (SGVO), a hyperbolic geometry constrained cross-modal feature transfer component (HFT), and a feature aggregation-based geometry optimization component (FAGO). Specifically, the SGVO component adaptively refines the spatial representation of the 3D branch by leveraging semantic cues from the image branch, thereby mitigating the issue of inadequate representation fusion. The HFT component exploits the intrinsic geometric properties of hyperbolic space to alleviate semantic loss during the fusion of high-dimensional image features and low-dimensional point cloud features. Finally, the FAGO compensates for potential spatial feature degradation introduced by the 2D semantic-guided voxel optimization component. Extensive experiments on indoor datasets (SUN RGB-D, ARKitScenes) and outdoor datasets (KITTI, nuScenes) demonstrate that our method achieves a better trade-off between detection accuracy and computational cost.
[199] Adversarial Attacks Against MLLMs via Progressive Resolution Processing and Adaptive Feature Alignment cs.CVPDF
Haobo Wang, Xiaorong Ma, Weiqi Luo, Xiaojun Jia, Jiwu Huang
TL;DR: 本文提出了一种名为PRAF-Attack的对抗攻击框架,旨在提升针对多模态大语言模型(MLLMs)的基于迁移的目标攻击的迁移性和鲁棒性。该方法通过渐进式分辨率处理和自适应特征对齐策略,整合了多尺度全局语义引导与鲁棒的中间层局部对齐,以克服现有方法依赖最终全局特征和固定分辨率目标裁剪的局限性。
Details
Motivation: 现有基于迁移的目标攻击方法通常依赖代理编码器的最终全局特征并锚定于原始分辨率的目标裁剪,导致其迁移性和鲁棒性有限。为了理解和提升黑盒MLLM在安全关键场景(如自动驾驶和医疗诊断)中的鲁棒性,需要更有效的攻击方法来评估模型漏洞。
Result: 在包括六个开源模型和六个闭源商业API在内的多样化黑盒MLLM套件上评估PRAF-Attack。与七种最先进的目标攻击基线相比,PRAF-Attack始终实现了更优的迁移性。
Insight: 创新点在于提出了自适应特征对齐策略(利用中间层表示和基于梯度一致性的自适应中间层选择机制)和渐进式分辨率处理策略(从粗到细逐步优化),从而更好地利用多尺度目标信息并增强攻击的迁移性。从客观角度看,该方法通过结合多尺度语义和局部特征对齐,为提升对抗样本的跨模型迁移能力提供了新的技术路径。
Abstract: Adversarial perturbations can mislead Multimodal Large Language Models (MLLMs) recognize a benign image as a specific target object, posing serious risks in safety-critical scenarios such as autonomous driving and medical diagnosis. This makes transfer-based targeted attacks crucial for understanding and improving black-box MLLM robustness. Existing transfer-based targeted attack methods typically rely on the final global features of the surrogate encoder and anchor optimization to original-resolution target crops, leading to their limited transferability and robustness. To address these challenges, we propose Progressive Resolution Processing and Adaptive Feature Alignment (PRAF-Attack), a targeted transfer-based attack framework that integrates multi-scale global semantic guidance with robust intermediate-layer local alignment. Unlike prior methods that align only the surrogate encoder’s final layer, we design an adaptive feature alignment strategy that leverages intermediate representations to enhance transferability. Specifically, we introduce an adaptive intermediate layer selection mechanism to identify transferable hierarchical features across surrogate ensembles via gradient consistency, along with an adaptive patch-level optimization strategy that preserves highly correlated local regions through efficient patch filtering. To overcome the reliance on fixed original-resolution target crops, we propose a progressive resolution processing strategy that gradually refines optimization from coarse to fine, enabling the attack to better exploit target information at multiple scales and achieve stronger transferability. We evaluate PRAF-Attack on a diverse suite of black-box MLLMs, including six open-source models and six closed-source commercial APIs. Compared with seven state-of-the-art targeted attack baselines, the proposed PRAF-Attack consistently achieves superior transferability.
[200] TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models cs.CVPDF
Junzhe Chen, Siyuan Meng, Yuxi Chen, Man Zhao, Xiaojie Guo
TL;DR: 本文提出了TOC-Bench,一个专门用于评估视频大语言模型(Video-LLMs)在时序对象一致性方面能力的诊断性基准测试。该基准测试关注模型能否在遮挡、消失、重现、状态转换和跨对象交互等复杂时序场景中,一致地保持同一对象的身份、状态和时序连续性。通过一个三层时序必要性过滤协议,构建了一个包含2,323个高质量问答对的人类验证数据集。实验表明,尽管现有模型在通用视频理解基准上表现良好,但在时序对象一致性方面仍存在显著弱点。
Details
Motivation: 现有基准测试主要关注事件识别、动作理解或粗粒度的时序推理,而很少评估模型能否在复杂时序变化中保持以对象为中心的时序一致性,这可能导致高估模型的时序推理能力。
Result: 在代表性Video-LLMs上的实验表明,时序对象一致性仍是一个主要未解决的挑战。当前模型在事件计数、事件排序、身份敏感推理和幻觉感知验证等维度上表现出显著弱点,尽管它们在通用视频理解基准上表现强劲。
Insight: 创新点在于提出了首个专门针对时序对象一致性的诊断基准TOC-Bench,其核心是对象轨迹和结构化时序事件时间线,并通过严格的三层过滤协议确保测试项依赖于时序视觉证据而非语言先验或单帧捷径,这为评估和提升模型的细粒度时序推理能力提供了新工具和新视角。
Abstract: Video large language models (Video-LLMs) have achieved remarkable progress in general video understanding, yet their ability to maintain temporal object consistency remains insufficiently explored. Existing benchmarks primarily focus on event recognition, action understanding, or coarse temporal reasoning, but rarely evaluate whether a model can consistently preserve the identity, state, and temporal continuity of the same object across occlusion, disappearance, reappearance, state transitions, and cross-object interactions. As a result, current evaluations may overestimate temporal reasoning ability while overlooking failures in object-centric temporal coherence. To address this issue, we introduce TOC-Bench, a diagnostic benchmark specifically designed to evaluate temporal object consistency in Video-LLMs. TOC-Bench is explicitly object-track grounded, where each queried subject is associated with a per frame object trajectory and structured temporal event timeline. To ensure that benchmark items depend on temporally ordered visual evidence rather than language priors, single-frame shortcuts, or unordered frame cues, we propose a three-layer temporal-necessity filtering protocol that removes 60.7% of candidate QA pairs and retains 17,900 temporally dependent items spanning 10 diagnostic dimensions. From this filtered pool, we further construct a human-verified benchmark containing 2,323 high-quality QA pairs over 1,951 videos. Experiments on representative Video-LLMs show that temporal object consistency remains a major unsolved challenge. Current models exhibit substantial weaknesses in event counting, event ordering, identity-sensitive reasoning, and hallucination-aware verification, despite strong performance on general video understanding benchmarks.
[201] Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception cs.CV | cs.IR | cs.LGPDF
Yiwei Ou, Chung Ching Cheung, Jun Yang Ang, Xiaobin Ren, Ronggui Sun
TL;DR: 本文提出了Urban-ImageNet,一个用于城市空间感知的大规模多模态数据集和评估框架。该数据集包含超过200万张从微博收集的社交媒体图像及配对文本,覆盖24个中国城市的61个地点,并提供了1K、10K、100K和完整2M规模的基准子集。数据集基于一个根植于城市理论的10类分层分类框架(HUSIC)进行组织,旨在评估模型对城市空间、社会和功能差异的感知能力。基准支持三个任务:城市场景语义分类、跨模态图文检索和实例分割。实验评估了代表性模型,发现在监督场景分类上表现良好,但在跨模态检索和实例分割上更具挑战性。
Details
Motivation: 解决现有数据集通常将城市图像视为通用场景数据,缺乏对城市研究中核心的空间、社会和功能差异进行系统评估的问题,旨在为评估AI系统如何跨模态、规模和任务感知当代城市空间提供一个统一、有理论依据的基准。
Result: 在提出的基准上评估了代表性的视觉、视觉-语言和分割模型。结果显示,在监督场景分类任务上模型表现强劲,但在跨模态图文检索和实例级城市物体分割任务上表现出更具挑战性的行为。多尺度研究进一步表明,当平衡训练数据从1K、10K增加到100K图像时,模型性能会发生变化。
Insight: 主要创新点在于构建了一个大规模、多城市、多模态且理论驱动的城市空间感知数据集与评估框架(HUSIC分类法),将城市图像分析与城市理论紧密结合,超越了通用的场景理解,专注于评估模型对城市空间功能与社会属性的感知能力。其多任务(分类、检索、分割)和多尺度(不同数据量级)的基准设计也为系统评估模型性能提供了新思路。
Abstract: We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.
[202] SDTalk: Structured Facial Priors and Dual-Branch Motion Fields for Generalizable Gaussian Talking Head Synthesis cs.CV | cs.AIPDF
Peng Jia, Zhen Xiao, Jia Li, Xueliang Liu, Zhenzhen Hu
TL;DR: SDTalk是一个基于3D高斯泼溅(3DGS)的单样本可泛化说话头合成框架,无需针对特定身份进行个性化训练或微调。该框架采用两阶段训练策略,第一阶段通过引入结构化面部先验并分别预测可见与遮挡区域的3DGS参数,实现从单张图像的完整头部重建;第二阶段引入双分支运动场来建模粗粒度与细粒度的面部动态,提升细节保真度与唇部同步质量。
Details
Motivation: 解决现有基于重建与渲染的说话头合成方法通常依赖身份特定模型、缺乏跨身份泛化能力的问题,旨在实现高质量、实时且可泛化的说话头合成。
Result: 实验表明,SDTalk在视觉质量和推理效率上均超越了现有方法,达到了先进的性能水平。
Insight: 创新点在于将结构化面部先验整合到3DGS重建中,并设计了双分支运动场来分别处理粗、细粒度面部运动,从而在保持高效推理的同时,实现了对未见身份的泛化和高质量的动态细节生成。
Abstract: High-quality, real-time talking head synthesis remains a fundamental challenge in computer vision. Existing reconstruction- and rendering-based methods typically rely on identity-specific models, limiting cross-identity generalization. To address this issue, we propose SDTalk, a one-shot 3D Gaussian Splatting (3DGS)-based framework that generalizes to unseen identities without personalized training or fine-tuning. Our framework comprises two modules with a two-stage training strategy. In the first stage, we incorporate structured facial priors into the reconstruction module and separately predict 3DGS parameters for visible and occluded regions, enabling complete head reconstruction from a single image. In the second stage, we introduce a dual-branch motion field to model coarse and fine facial dynamics, improving detail fidelity and lip synchronization. Experiments demonstrate that SDTalk surpasses existing methods in both visual quality and inference efficiency.
[203] Learning to Perceive “Where”: Spatial Pretext Tasks for Robust Self-Supervised Learning cs.CVPDF
Yang Shen, Yusen Cai, Weronika Hryniewska-Guzik, Qing Lin, Mengmi Zhang
TL;DR: 本文提出了一种名为空间预测(SP)的自监督学习前置任务,通过预测同一图像中两个解耦局部视图之间的相对位置和尺度,来增强模型对空间结构和物体部分间关系的感知能力。该方法可作为即插即用模块集成到多种自监督学习框架中,并在图像识别、细粒度分类、语义分割和深度估计等任务上带来一致性能提升,同时显著提高了模型在分布外场景下的鲁棒性。
Details
Motivation: 现有自监督学习方法主要学习物体不变性表示,但往往忽略了物体部分间的空间结构和关系,本文旨在通过引入空间感知的前置回归任务来弥补这一不足。
Result: 在图像识别、细粒度分类、语义分割和深度估计等多个基准测试中均观察到性能提升,并在分布外鲁棒性评估中取得显著增益;在专门设计的空间推理任务(如图像块对的位置与尺度预测、拼图理解任务)上也表现出色,表明模型的空间结构和几何感知能力得到增强。
Insight: 创新点在于将连续几何空间中的部分到部分关系建模作为自监督学习的显式归纳偏置,从而学习更具结构化的视觉场景组合表示;该方法设计为解耦的插件式模块,易于与现有框架集成,为增强模型的空间推理能力提供了有效途径。
Abstract: Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as well as substantial gains in out-of-distribution robustness for object recognition. To evaluate spatial reasoning, we introduce (1) a position and scale prediction task on image patch pairs and (2) a jigsaw understanding task requiring patch reordering and recognition after reconstruction. Strong performance on these tasks indicates improved spatial structure and geometric awareness. Overall, explicitly modeling spatial information provides an effective inductive bias for SSL, leading to more structured representations and better generalization. Code and models will be released.
[204] Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse cs.CVPDF
Kuan Zhang, Dongchen Liu, Qiyue Zhao, Tianyu Xin, Yue Su
TL;DR: 这篇论文探讨了利用基础模型构建通用游戏玩家的研究进展,将游戏多元宇宙视为训练和评估通用人工智能(AGI)的终极平台。文章回顾了从特定环境智能体到当前通用基础模型玩家的四个发展阶段,并提出了一个包含数据集、模型、利用和基准测试四大支柱的全生命周期框架,旨在突破现有系统的五个基本权衡,最终规划了一个从单一游戏精通到能够同时创造和进化的‘创造者’阶段的五级路线图。
Details
Motivation: 论文的动机是探索如何实现像人类一样能够将单一物理世界的经验泛化到规则、美学、物理和目标完全不同的游戏多元宇宙中的通用智能,这是通往通用人工智能(AGI)的关键一步。
Result: 论文是一篇综述性研究,未提及具体的定量实验结果或基准测试排名,而是系统性地梳理了该领域的发展脉络、核心挑战和未来路线图。
Insight: 论文的创新点在于提出了一个用于分析和构建通用游戏玩家的统一框架(数据集、模型、利用、基准四大支柱),并识别了限制当前系统的五个基本权衡。其核心见解是将游戏多元宇宙视为AGI的训练场,并规划了从玩家到创造者的渐进发展路径,为领域研究提供了原则性指导。
Abstract: The real world unfolds along a single set of physics laws, yet human intelligence demonstrates a remarkable capacity to generalize experiences from this singular physical existence into a multiverse of games, each governed by entirely different rules, aesthetics, physics, and objectives. This omni-reality adaptability is a hallmark of general intelligence. As Artificial Intelligence progresses towards Artificial General Intelligence, the multiverse of games has evolved from mere entertainment into the ultimate ground for training and evaluating AGI. The pursuit of this generality has unfolded across four eras: from environment-specific symbolic and reinforcement learning agents, to current large foundation models acting as generalist players, and toward a future creator stage where agent both creates new game worlds and continually evolves within them. We trace the full lifecycle of a generalist game player along four interdependent pillars: Dataset, Model, Harness, and Benchmark. Every advance across these pillars can be read as an attempt to break one of five fundamental trade-offs that currently bound the whole system. Building on this end-to-end view, we chart a five-level roadmap, progressing from single-game mastery to the ultimate creator stage in which the agent simultaneously creates and evolves within theoretical game multiverse. Taken together, our work offers a unified lens onto a rapidly shifting field,and a principled path toward the omnipotent generalist agent capable of seamlessly mastering any challenge within the multiverse of games, thereby paving the way for AGI.
[205] OZ-TAL: Online Zero-Shot Temporal Action Localization cs.CVPDF
Chaolei Han, Hongsong Wang, Xin Gong, Jie Gui
TL;DR: 本文提出了一种新的任务——在线零样本时序动作定位(OZ-TAL),旨在以在线方式检测流式视频中先前未见过的动作。作者提出了一个无需训练的框架,利用现成的视觉语言模型(VLMs),并通过引入额外机制来增强视觉表示并减轻其固有偏差。该框架在THUMOS14和ActivityNet-1.3数据集上建立了新的基准和代表性基线,实验表明其在离线和在线零样本设置下均显著优于现有最先进方法。
Details
Motivation: 现有的在线时序动作定位方法通常在特定领域训练,泛化能力有限,难以处理包含未见动作的任意视频。本文旨在解决在线零样本动作检测问题,即无需训练即可识别新动作。
Result: 在THUMOS14和ActivityNet-1.3数据集上建立了OZ-TAL基准,实验表明所提方法在离线和在线零样本设置下均大幅超越现有SOTA方法。
Insight: 创新点在于将零样本学习引入在线时序动作定位任务,并设计了一个无需训练的框架,通过增强视觉表示和缓解视觉语言模型偏差来提升性能,为处理开放世界视频动作检测提供了新思路。
Abstract: Online Temporal Action Localization (On-TAL) aims to detect the occurrence time and category of actions in untrimmed streaming videos immediately upon their completion. Recent advancements in this field focus on developing more sophisticated frameworks, shifting from Online Action Detection (OAD)-based aggregation paradigm to instance-level understanding. However, existing approaches are typically trained on specific domains and often exhibit limited generalization capabilities when applied to arbitrary videos, particularly in the presence of previously unseen actions. In this paper, we introduce a new task called Online Zero-shot Temporal Action Localization (OZ-TAL), which aims to detect previously unseen actions in an online fashion. Furthermore, we propose a training-free framework that leverages off-the-shelf Vision-Language Models (VLMs) while introducing additional mechanisms to enhance visual representations and mitigate their inherent biases. We establish new benchmarks and representative baselines for OZ-TAL on THUMOS14 and ActivityNet-1.3, and extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings.
[206] ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning cs.CVPDF
Yuna Lee, Kyoungho Min, Yulhwa Kim
TL;DR: 本文提出ERASE,一种自适应两阶段视觉令牌剪枝框架,旨在减少视觉语言模型中高分辨率图像处理产生的冗余视觉令牌,以降低计算开销。该方法通过根据图像复杂度调整剪枝策略,有效保留关键令牌,在保持模型准确性的同时显著减少令牌数量。
Details
Motivation: 视觉语言模型处理高分辨率图像时产生大量视觉令牌,导致计算负担加重;现有剪枝方法主要依赖模型学习的语义特征捕捉冗余,且缺乏根据输入图像复杂度自适应调整剪枝策略的机制。
Result: 在Qwen2.5-VL-7B模型上,ERASE在85%的令牌剪枝率下保持了89.46%的原始模型准确率,而先前最佳方法仅保持78.1%,显著优于现有方法。
Insight: 创新点包括提出自适应两阶段剪枝框架,根据图像复杂度动态调整策略以保留显著令牌;客观分析认为该方法通过结合图像复杂度自适应机制,提升了剪枝效率与准确性平衡,为视觉令牌压缩提供了新思路。
Abstract: Recent advancements in Vision-Language Models (VLMs) enable large language models (LLMs) to process high-resolution images, significantly improving real-world multimodal understanding. However, this capability introduces a large number of vision tokens, resulting in substantial computational overhead. To mitigate this issue, various vision token pruning methods have been proposed. Nevertheless, existing approaches predominantly rely on learned semantic features within the model to capture visual redundancy. Moreover, they lack adaptive mechanisms to adjust pruning strategies according to the complexity of the input image. In this paper, we propose ERASE, a two-stage vision token pruning framework that identifies and retains salient tokens through pruning strategies adaptive to image complexity. Experiment results demonstrate that ERASE significantly reduces vision tokens while preserving accuracy. For Qwen2.5-VL-7B, at a token pruning ratio of 85%, ERASE retains 89.46% of the original model accuracy, whereas the best prior method retains only 78.1%. Our code is available at https://github.com/Tuna-Luna/ERASE.
[207] Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization cs.CVPDF
Yeongtak Oh, Dongwook Lee, Sangkwon Park, Heeseung Kim, Sungroh Yoon
TL;DR: 本文提出了Omni-Persona,首个用于评估文本、图像和音频全模态个性化能力的综合基准。该研究将任务形式化为在‘人物模态图’上的跨模态路由,包含4个任务组和18个细粒度任务。研究还引入了校准准确率(Cal)来联合评估正确引用和适当弃权行为,并揭示了开源模型在音频与视觉引用上的性能差距、现有评估指标的局限性,以及不同训练方法(如SFT与RLVR)的优缺点。
Details
Motivation: 当前多模态大语言模型的个性化研究主要局限于视觉-语言领域,缺乏一个统一覆盖文本、图像和音频的全模态基准,并且缺少严谨的方法来评估模型在‘人物信息缺失’场景下的表现和系统性的引用行为。
Result: 在提出的Omni-Persona基准上进行实验,诊断发现:开源模型存在一致的音频与视觉引用性能差距;仅凭可回答召回率和模型参数量是不完整的诊断指标;监督微调(SFT)受限于大规模标注数据的构建难度,而基于规则的强化学习(RLVR)虽然泛化更一致,但在特定奖励设计下会趋于保守并降低生成质量。
Insight: 创新点在于构建了首个全面的全模态个性化基准,并提出了校准准确率(Cal)这一联合评估引用正确性和弃权适当性的新指标。客观来看,该研究系统性地揭示了全模态个性化任务中的关键挑战(如模态间性能差距、幻觉与校准问题),并为未来的后训练和奖励设计提供了明确的诊断框架和指导方向。
Abstract: While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emph{Persona Modality Graph}, encompassing 4 task groups and 18 fine-grained tasks across ${\sim}750$ items. To rigorously diagnose grounding behavior, we propose \emph{Calibrated Accuracy ($\mathrm{Cal}$)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher $\mathrm{Cal}$, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.
[208] Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models cs.CVPDF
Minh Khoi Nguyen, Dai Lam Le, Amir Reza Jafari, Tuan Dung Nguyen, Mai Hong Son
TL;DR: 本文提出了Med-StepBench,一个用于评估医学视觉语言模型幻觉的大规模分层推理基准,专注于3D肿瘤PET/CT图像,通过将临床推理分解为四个专家设计的诊断阶段,揭示了现有模型在看似正确诊断背后隐藏的关键推理错误和对对抗性解释的脆弱性。
Details
Motivation: 现有医学幻觉基准主要关注二维成像和一次性诊断问题,无法评估预测是否基于正确的定位和异常识别,导致关键推理错误被隐藏,因此需要一个新的基准来系统性地检测多步临床推理中的幻觉。
Result: 在包含超过12,000张图像和1,000,000多个图像-语句对的Med-StepBench上对通用和医学VLMs进行了首次步骤级评估,揭示了聚合准确率指标所掩盖的系统性失败模式,并显示当前VLMs极易受到对抗性但临床合理的中间解释影响,从而显著放大幻觉。
Insight: 创新点在于引入了首个针对3D医学影像(PET/CT)的大规模、分层(四阶段)幻觉检测基准,强调了对多步临床推理进行细粒度、步骤级评估的重要性,以及模型对对抗性解释的脆弱性这一新发现,为开发更安全可靠的医学VLMs提供了严格的评估框架。
Abstract: Large vision-language models (VLMs) demonstrate strong performance in medical image understanding, but frequently generate clinically plausible yet incorrect statements, raising significant safety concerns. Existing medical hallucination benchmarks primarily focus on 2D imaging with one-shot diagnostic questions, offering limited insight into whether predictions are grounded in correct localization and abnormality identification, allowing critical reasoning errors to remain hidden behind seemingly correct diagnoses. We introduce Med-StepBench, the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT, comprising over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data, which decomposes clinical reasoning into four expert-designed diagnostic stages. Using clinician-verified annotations, we perform the first step-level evaluation of general-purpose and medical VLMs, revealing systematic failure modes obscured by aggregate accuracy metrics. Furthermore, we show that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations, which significantly amplify hallucinations despite contradictory visual evidence. Together, our findings highlight fundamental limitations in grounding multi-step clinical reasoning and establish Med-StepBench as a rigorous benchmark for developing safer and more reliable medical VLMs.
[209] Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation cs.CVPDF
Yujia Cai, Boxuan Li, Chenghao Xu, Jiexi Yan
TL;DR: 本文提出了Hystar框架,用于解决基于查询的图像检索中因查询风格多样导致的分布偏移问题。该框架通过超网络动态生成注意力层的奇异值扰动,并结合MLP层的静态奇异值偏移,实现轻量化的风格自适应检索。
Details
Motivation: 基于查询的图像检索面临查询风格多样(如素描、艺术作品、低分辨率预览)带来的分布偏移挑战,现有大规模视觉-语言表示模型在未见风格上表现不佳。
Result: 在多风格检索和跨风格分类基准测试中,Hystar持续超越强基线,实现了最先进的性能,同时参数高效且跨风格稳定。
Insight: 创新点包括:1)超网络驱动的动态奇异值调制实现每输入自适应;2)MLP层静态奇异值偏移确保跨风格稳定性;3)设计StyleNCE损失函数,通过最优传输加权对比学习强调困难跨风格负样本,缓解风格间语义混淆。
Abstract: Query-based image retrieval (QBIR) requires retrieving relevant images given diverse and often stylistically heterogeneous queries, such as sketches, artworks, or low-resolution previews. While large-scale vision–language representation models (VLRMs) like CLIP offer strong zero-shot retrieval performance, they struggle with distribution shifts caused by unseen query styles. In this paper, we propose the Hypernetwork-driven Style-adaptive Retrieval (Hystar), a lightweight framework that dynamically adapts model weights to each query’s style. Hystar employs a hypernetwork to generate singular-value perturbations ($ΔS$) for attention layers, enabling flexible per-input adaptation, while static singular-value offsets on MLP layers ensure cross-style stability. To better handle semantic confusions across styles, we design StyleNCE as part of Hystar, an optimal-transport-weighted contrastive loss that emphasizes hard cross-style negatives. Extensive experiments on multi-style retrieval and cross-style classification benchmarks demonstrate that Hystar consistently outperforms strong baselines, achieving state-of-the-art performance while being parameter-efficient and stable across styles.
[210] MUSDA: Multi-source Multi-modality Unsupervised Domain Adaptive 3D Object Detection for Autonomous Driving cs.CVPDF
Xiaohu Lu, Hamed Khatounabadi, Hayder Radha
TL;DR: 本文提出了一种名为MUSDA的多源多模态无监督域自适应3D目标检测框架,用于自动驾驶场景。该框架通过分层空间条件域分类器对齐相机和激光雷达模态的特征,并利用原型图加权策略融合多个源域检测头的预测,从而在新环境中实现无需人工标注的3D目标检测。
Details
Motivation: 现有域自适应方法通常只针对单一源域或单一模态,难以有效利用自动驾驶中日益丰富的多源、多模态标注数据集,限制了在新环境下的适应能力。
Result: 在Waymo、nuScenes和Lyft三个主流3D目标检测数据集上的实验表明,该框架能有效整合多模态和多源域信息,性能持续超越现有最先进方法。
Insight: 创新点包括分层空间条件域分类器实现跨模态特征对齐,以及基于原型图加权的多源融合策略;其核心思想是通过结构化原型图来建模和利用多源域间关系,为多源多模态域自适应提供了新思路。
Abstract: With the advancement of autonomous driving, numerous annotated multi-modality datasets have become available. This presents an opportunity to develop domain-adaptive 3D object detectors for new environments without relying on labor-intensive manual annotations. However, traditional domain adaptation methods typically focus on a single source domain or a single modality, limiting their effectiveness in multi-source, multi-modality scenarios. In this paper, we propose a novel framework for multi-source, multi-modality unsupervised domain adaptation in 3D object detection for autonomous driving. Given multiple labeled source domains and one unlabeled target domain, our framework first introduces hierarchical spatially-conditioned (HSC) domain classifiers, which jointly align features from both camera and LiDAR modalities at two distinct levels for each source-target domain pair. To effectively leverage information from multiple source domains, we construct a prototype graph between each pair of domains. Based on this, we develop a prototype graph weighted (PGW) multi-source fusion strategy to aggregate predictions from multiple source detection heads. Experimental results on three widely used 3D object detection datasets - Waymo, nuScenes, and Lyft - demonstrate that our proposed framework effectively integrates information across both modalities and source domains, consistently outperforming state-of-the-art methods.
[211] EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs cs.CVPDF
Jiameng Li, Minye Wu, Jiezhang Cao, Aleksei Tiulpin, Matthew B. Blaschko
TL;DR: 本文提出了EchoPrune,一种轻量级且无需训练的视频大语言模型(VideoLLMs)视觉令牌剪枝方法。该方法将冗余的视频令牌解释为时间上的“回声”,通过查询引导的跨模态相关性和时间重建误差对令牌进行评分,从而在固定的LLM视觉令牌预算下,保留任务相关线索和时间新颖性,抑制可预测的冗余,使模型能够处理更多帧数。
Details
Motivation: 解决长视频理解中,密集帧采样导致视觉令牌过多而稀疏采样可能丢失关键时间证据并引发LLM幻觉的问题。现有免训练令牌削减方法将视频等同于静态图像或依赖片段级合并启发式方法,削弱了细粒度时空建模并引入了额外开销。
Result: 在LLaVA-OV、Qwen2.5VL和Qwen3VL模型上,于六个视频理解基准测试中进行了广泛实验。结果表明,在相同令牌预算下,EchoPrune使VideoLLMs能够处理多达20倍的帧数,在Qwen2.5VL-7B上实现了性能提升(+8.6%)和推理加速(预填充阶段5.6倍)。
Insight: 核心创新点是将冗余视频令牌概念化为“时间回声”,并基于查询相关性和跨帧对应/回声匹配的重建误差进行令牌评分与剪枝。这是一种新颖的、无需训练的、轻量级的动态令牌选择策略,旨在优化视频理解中的时空建模效率。
Abstract: Long-form video understanding remains challenging for Video Large Language Models (VideoLLMs), as the dense frame sampling introduces massive visual tokens while sparse sampling risks missing critical temporal evidence and leading to LLM hallucination. Existing training-free token reduction methods either treat videos equally as static images or rely on segment-level merging heuristics, which weaken fine-grained spatiotemporal modeling and introduce additional overhead. In this paper, we propose EchoPrune, a lightweight and training-free token pruning method that improves temporal resolution under a fixed LLM-side visual token budget. Our core idea is to interpret redundant video tokens as temporal echoes: if a token is well reconstructed from the previous frame, it is merely a temporally redundant echo; otherwise, it may capture new events, motion, or query-relevant visual evidence. Based on this insight, EchoPrune scores visual tokens by (i) query-guided crossmodal relevance and (ii) temporal reconstruction error, measured by correspondence matching and echo matching across consecutive frames. The selected tokens preserve task-relevant cues and temporal novelty while suppressing predictable redundancy, allowing VideoLLMs to observe more frames without increasing the decoding budget. Extensive experiments on LLaVA-OV, Qwen2.5VL, and Qwen3VL across six video understanding benchmarks show that EchoPrune enables VideoLLMs to process up to 20x frames under the same token budget, yielding improved performance (+8.6%) and inference speedup (5.6x for prefilling) on Qwen2.5VL-7B.
[212] Explanation-Aware Learning for Enhanced Interpretability in Biomedical Imaging cs.CVPDF
Zubair Faruqui, Rahul Dubey
TL;DR: 本文提出了一种将解释监督直接整合到训练目标中的系统方法,以引导医学图像诊断模型关注临床相关区域,并分析了不同解释损失设计和监督强度对预测性能和解释空间忠实度的影响。
Details
Motivation: 医学图像诊断的深度神经网络通常依赖虚假或临床无关的视觉线索实现高预测准确率,限制了其实践可信度;后验解释方法虽能可视化模型决策,但不影响训练过程,导致非因果或混杂特征持续存在。
Result: 在标注的胸部X光数据集上实验表明,解释质量与解释损失系数之间存在明显权衡;定量统计分析显示,在保持可比准确率的同时,解释对齐性得到持续改善。
Insight: 创新点包括将解释损失纳入训练目标以提升可解释性,并引入注释覆盖率和显著性精度两个互补的定量指标来严格评估解释性能;框架适用于广泛的标注生物医学成像模态,为在噪声临床标注下整合解释损失提供了实用指导。
Abstract: Deep neural networks for medical image diagnosis often achieve high predictive accuracy while relying on spurious or clinically irrelevant visual cues, limiting their trustworthiness in practice. Post-hoc explanation methods are widely used to visualize model decisions in the form of saliency maps; however, these explanations do not influence how models learn during training, allowing non-causal or confounding features to persist. This motivates the incorporation of explanation supervision directly into the training objective to guide model attention toward clinically meaningful regions and promote clinically grounded decision-making. This paper presents a systematic approach to integrate explanation loss into model training and analyzes how different explanation loss designs and supervision strengths influence both predictive performance and spatial faithfulness of explanations. To quantitatively assess interpretability, two complementary explanation performance metrics-annotation coverage and saliency precision-are introduced, enabling rigorous evaluation beyond qualitative visualization. Our experimental results reveal a clear trade-off between explanation quality and explanation loss coefficients. Furthermore, quantitative statistical analysis yields consistently improved explanation alignment while maintaining comparable accuracy. Experiments were conducted on annotated chest X-ray datasets; however, the proposed framework is applicable to a broad range of annotated biomedical imaging modalities. Overall, these findings demonstrate that explanation supervision is not a monolithic design choice and provide practical guidance for incorporating explanation loss into training objectives under noisy clinical annotations.
[213] MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization cs.CVPDF
Yaning Zhang, Tianyi Wang, Zan Gao, Yibo Zhao, Chunjie Ma
TL;DR: 本文提出了一种名为MFVLR的多领域细粒度视觉语言重建模型,旨在通过语言引导的面部伪造表示学习,实现可泛化的扩散合成面部伪造检测与定位。该模型整合了细粒度语言Transformer、多领域视觉编码器、视觉解码器及视觉注入模块,以捕捉跨图像和残差域的通用视觉伪造模式,并在多个跨生成器、跨伪造类型和跨数据集的评估设置中取得了优于现有方法的结果。
Details
Motivation: 随着逼真人脸生成技术的快速发展,社会与学术界对可泛化的面部伪造检测与定位方法的需求日益迫切。现有方法多依赖图像模态捕捉多领域伪造模式,但未充分探索细粒度文本等多模态信息,限制了模型的泛化能力;同时,这些方法通常针对GAN生成的面部图像,难以有效检测和定位扩散模型合成的伪造内容。
Result: 在跨生成器、跨伪造类型和跨数据集的广泛实验与可视化分析中,MFVLR模型在扩散合成面部伪造检测与定位任务上超越了现有最先进方法(SOTA),展现了优异的泛化性能。
Insight: 创新点包括:设计细粒度语言Transformer通过语言重建学习通用细粒度语言嵌入;提出多领域视觉编码器捕获跨图像与残差域的互补视觉伪造模式;构建视觉解码器实现外观重建与伪造定位;引入即插即用的视觉注入模块增强视觉与语言嵌入的交互。从客观角度看,该研究首次将细粒度文本模态与多领域视觉信息深度融合,针对扩散合成伪造的检测与定位提供了新颖的跨模态解决方案。
Abstract: The swift advancement in photo-realistic face generation technology has sparked considerable concerns across society and academia, emphasizing the requirement of generalizable face forgery detection and localization methods. Prior works tend to capture face forgery patterns across multiple domains using image modality, other modalities like fine-grained texts are not comprehensively investigated, which restricts the generalization capability of models. Besides, they usually analyze facial images created by GAN, but struggle to identify and localize those synthesized by diffusion. To solve the problems, in this paper, we devise a novel multi-domain fine-grained vision-language reconstruction (MFVLR) model, which explores comprehensive and diverse visual forgery traces via language-guided face forgery representation learning, to achieve generalizable diffusion-synthesized face forgery detection and localization (DFFDL). Specifically, we devise a fine-grained language transformer that studies general fine-grained language embeddings using language reconstruction. We propose a multi-domain vision encoder to capture general and complementary visual forgery patterns across the image and residual domains. A vision decoder is designed to reconstruct image appearance and achieve forgery localization. Besides, we propose an innovative plug-and-play vision injection module to enhance the interaction between the vision and language embeddings. Extensive experiments and visualizations demonstrate that our network outperforms the state of the art on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.
[214] SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation cs.CVPDF
Liangyang Ouyang, Ruicong Liu, Caixin Kang, Yifei Huang, Yoichi Sato
TL;DR: 本文提出SocialDirector,一种无需训练的社交交互控制器,通过调制交叉注意力图来增强多人生成视频中的社交交互控制。该方法包含社交演员掩码和方向重加权两个模块,旨在解决现有模型在生成多人视频时出现的演员-动作不匹配、社交动态混乱和动作目标错误等问题。
Details
Motivation: 现有视频生成模型在生成多人社交互动视频时,缺乏对交互的显式控制(如谁执行哪个动作、何时发生、针对谁),导致演员-动作不匹配、社交动态混乱和动作目标错误,而电影制作和社交机器人等领域对此类可控生成有迫切需求。
Result: 在不同视频生成模型上的实验表明,SocialDirector显著提高了交互保真度,并接近真实视频设定的上限。评估基于对现有数据集标注交互描述并构建由开源视觉语言模型驱动的全自动评估流程。
Insight: 创新点在于提出无需训练的交互控制器,通过社交演员掩码(限制视觉token仅关注自身文本描述)和方向重加权(放大对方向词的注意力)实现细粒度社交交互控制,且方法可即插即用,无需额外训练。
Abstract: Video generation has advanced rapidly, producing photorealistic videos from text or image prompts. Meanwhile, film production and social robotics increasingly demand multi-person videos with rich social interactions, including conversations, gestures, and coordinated actions. However, existing models offer no explicit control over interactions, such as who performs which action, when it occurs, and toward whom it is directed. This often results in wrong person performing unintended actions (actor-action mismatch), disordered social dynamics, and wrong action targets. To address these challenges, we present SocialDirector, a training-free interaction controller that enhances the generation model by modulating cross-attention maps. SocialDirector contains two modules: Social Actor Masking and Directional Reweighting. Social Actor Masking constrains each person’s visual tokens to attend only to their own textual descriptions via a spatiotemporal mask, avoiding actor-action mismatch and disordered social dynamics. Directional Reweighting amplifies attention to directional words (e.g., “leftward”, “right”), leading each action towards its intended target. To evaluate generated social interactions, we annotate existing datasets with interaction descriptions and build a fully automated evaluation pipeline powered by open-source VLMs. Experiments on different video generation models show that SocialDirector significantly improves interaction fidelity and approaches the upper bound set by real videos.
[215] Initiation of Interaction Detection Framework using a Nonverbal Cue for Human-Robot Interaction cs.CVPDF
Guhnoo Yun, Juhan Yoo, Kijung Kim, Dong Hwan Kim
TL;DR: 本文提出了一种基于音频和视觉传感器融合的人机交互启动检测框架,用于家庭环境中无需关键词的交互启动检测。该框架利用机器人自身及外部视觉传感器进行稳定的人体检测与跟踪,结合声源定位和面部朝向判断用户是否发起交互。
Details
Motivation: 解决人机交互中如何自然、无需特定关键词地检测用户交互意图的问题,特别是在家庭环境中实现更直观的交互启动。
Result: 通过实验验证了状态转移模型的有效性,并在机器人操作系统(ROS)中实现了所有组件的集成与部署。
Insight: 创新点在于融合音频与视觉信息(如声源定位和面部朝向)进行交互启动检测,并引入无直接语音时的凝视时长判断机制,提升了交互的自然性和鲁棒性。
Abstract: This paper describes an initiation of interaction(IoI) detection framework without keywords for human-robot interaction(HRI) based on audio and vision sensor fusion in a domestic environment. In the proposed framework, the robot has its own audio and vision sensors, and can employ external vision sensor for stable human detection and tracking. When the user starts to speak while looking at the robot, the robot can localize his or her position by its sound source localization together with human tracking information. Then the robot can detect the IoI if it perceives the face of the speaker faces the robot. In case that the user does not speak directly, the robot can also detect the IoI if he or she looks at the robot for more than predefined periods of time. A state transition model for the proposed IoI detection framework is designed and verified by experiments with a mobile robot. In order to implement and associate our model in a robot architecture, all the components are implemented and integrated in the Robot Operating System(ROS) environment.
[216] HYPERPOSE: Hyperbolic Kinematic Phase-Space Attention for 3D Human Pose Estimation cs.CV | cs.AIPDF
Vinduja T., Ashish M., Ajay Waghumbare, Upasna Singh
TL;DR: HYPERPOSE是一个新颖的3D人体姿态估计框架,它在双曲空间(洛伦兹模型)中进行时空推理,以原生地保持人体骨架的层次树状拓扑结构。该方法通过双曲运动相空间注意力(HKPSA)和多尺度窗口化双曲注意力机制来建模关节关系与时间动态,并引入黎曼损失套件和不确定性加权课程来稳定训练。在Human3.6M和MPI-INF-3DHP数据集上的评估表明,HYPERPOSE在结构一致性、时间连贯性和位置精度方面达到了最先进的水平。
Details
Motivation: 当前最先进的姿态估计方法(如Transformer和图卷积网络)在欧几里得空间中操作,这与人体固有的树状结构不匹配,导致指数级的体积扭曲和结构连贯性不足。为了解决这一几何不匹配问题,论文提出在双曲空间中直接建模,以更好地保持骨架的层次拓扑。
Result: 在Human3.6M和MPI-INF-3DHP数据集上的广泛评估显示,HYPERPOSE显著减少了体积扭曲和速度误差,在整体位置精度上建立了新的最先进(SOTA)基准,同时实现了最先进的结构和时间连贯性。
Insight: 创新点包括:1)首次在双曲空间(洛伦兹模型)中进行完整的时空推理,以原生保持人体骨架的层次结构;2)提出双曲运动相空间注意力(HKPSA)和多尺度窗口化双曲注意力机制,高效建模关节关系和时间动态;3)引入黎曼损失套件和不确定性加权课程,通过物理测地线约束(如骨骼长度和速度一致性)来稳定非欧几里得流形的训练。
Abstract: We introduce HYPERPOSE, a novel 3D human pose estimation framework that performs spatio-temporal reasoning entirely within the Lorentz model of hyperbolic space $\mathbb{H}^d$ to natively preserve the hierarchical tree topology of the human skeleton. Current state-of-the-art pose estimators aim to capture complex joint dynamics by relying on transformers and graph convolutional networks. Since these architectures operate exclusively in Euclidean space which fundamentally mismatches the inherent tree structure of the human body, these methods inevitably suffer from exponential volume distortion and struggle to maintain structural coherence. To this end, we depart from flat spaces and aim to improve geometric fidelity with Hyperbolic Kinematic Phase-Space Attention (HKPSA), natively embedding complex joint relationships without distortion, alongside a multi-scale windowed hyperbolic attention mechanism that efficiently models temporal dynamics in $O(TW)$ complexity. Furthermore, to overcome the well-known instability of training non-Euclidean manifolds, HYPERPOSE introduces a novel Riemannian loss suite and an uncertainty-weighted curriculum, enforcing physical geodesic constraints like bone length and velocity consistency. Extensive evaluations on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HYPERPOSE achieves state-of-the-art structural and temporal coherence, significantly reducing both volume distortion and velocity error, while establishing new state-of-the-art benchmarks in overall positional accuracy.
[217] ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models cs.CV | cs.AIPDF
Tingshu Mou, Jiabo He, Renying Wang, Ce Liu, Hao Yang
TL;DR: 本文提出了一种无需训练的视频空间推理智能体ViSRA,旨在探索多模态大语言模型的空间推理机制。该框架通过利用专家模型提供的显式空间信息,以模块化和可扩展的方式激发MLLMs的空间推理能力,实现了即插即用的灵活范式。
Details
Motivation: 当前多模态大语言模型在3D空间智能方面的进展主要依赖于在特定基准上的后训练,而推理时的方法相对未被充分探索。本文旨在从无需训练的角度出发,研究MLLMs的空间推理机制。
Result: 实验结果表明,ViSRA在一系列MLLMs上,在现有基准和未见过的3D空间推理任务上均取得了性能提升,分别以高达15.6%和28.9%的绝对优势超越了基线模型。
Insight: 论文的创新点在于提出了一个无需训练、人类对齐且可迁移的3D理解框架,避免了任务特定的过拟合,也无需后训练的计算成本和繁重的空间推理数据集人工标注。从客观角度看,其模块化、即插即用的设计理念和利用专家模型提供显式空间信息的思路具有借鉴意义。
Abstract: Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free perspective and introduce ViSRA, a human-aligned Video-based Spatial Reasoning Agent, as a framework to probe the spatial reasoning mechanism of MLLMs. ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm. ViSRA offers two key advantages: (1) human-aligned and transferable 3D understanding rather than task-specific overfitting; and (2) no post-training computational cost along with heavy manual curation of spatial reasoning datasets. Experimental results demonstrate consistent improvement across a set of MLLMs on both existing benchmarks and unseen 3D spatial reasoning tasks, with ViSRA outperforming baselines by up to a 15.6% and 28.9% absolute margin respectively.
[218] MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph cs.CV | cs.AIPDF
Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, Bo Yan
TL;DR: 本文提出了MicroWorld框架,通过构建多模态属性图(MAPG)来增强多模态大语言模型在显微镜领域的推理能力,无需领域特定微调。该方法从大规模科学图像-文本语料中提取生物医学实体和关系,构建知识图谱,并在推理时通过图增强检索将结构化知识注入提示词,显著提升了模型在显微镜视觉问答任务上的性能。
Details
Motivation: 解决多模态大语言模型在显微镜等专业领域因缺乏领域特定训练数据和难以编码细粒度专家知识而性能受限的问题。
Result: 在MicroVQA基准测试上,将Qwen3-VL-8B-Instruct模型的推理性能提升了37.5%,并超越GPT-5模型13.0%,达到了新的SOTA水平;在MicroBench基准测试上获得了6.0%的性能提升。
Insight: 创新点在于构建了多模态属性图(MAPG)作为外部知识库,并通过检索增强生成(RAG)在推理时动态注入结构化知识,避免了领域特定微调,增强了模型的泛化能力和领域适应性。
Abstract: Multimodal large language models (MLLMs) show remarkable potential for scientific reasoning, yet their performance in specialized domains such as microscopy remains limited by the scarcity of domain-specific training data and the difficulty of encoding fine-grained expert knowledge into model parameters. To bridge the gap, we introduce MicroWorld, a framework that constructs a multimodal attributed property graph (MAPG) from large-scale scientific image–caption corpora and leverages it to augment MLLM reasoning at inference time without any domain-specific fine-tuning. MicroWorld extracts biomedical entities and relations via scispaCy or LLM-based triplet mining, aligns images and entities in a shared embedding space using Qwen3-VL-Embedding, and assembles a knowledge graph comprising approximately 111K nodes and 346K typed edges spanning eight relation categories. At inference time, a graph-augmented retrieval pipeline matches query entities to the MAPG and injects structured knowledge context into the MLLM prompt. On the MicroVQA benchmark, MicroWorld improves the reasoning performance of Qwen3-VL-8B-Instruct by 37.5%, outperforming GPT-5 by 13.0% to achieve a new state-of-the-art. Furthermore, it yields a 6.0% performance gain on the MicroBench benchmark. Extensive experiments demonstrate the enhanced generalization capability introduced by MicroWorld. A qualitative case study further reveals both the mechanisms through which structured knowledge improves reasoning and the failure modes that point to promising future directions. Code and data are available at https://github.com/ieellee/MicroWorld.
[219] Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition cs.CVPDF
Yu He, Ting Zhu, Yichun Liu, Lichen Ma, Xinyuan Shan
TL;DR: 本文提出了一个名为Fashion130K的电商时尚数据集,并设计了一个统一多模态条件(UMC)框架,用于通过整合文本和视觉提示来生成视觉一致的服装搭配。
Details
Motivation: 现有服装搭配生成研究在利用参考图像和文本提示促进视觉一致性方面仍有不足,缺乏全面的电商数据集和对多模态条件的精细利用,因此需要新的数据集和框架来深入探索多模态提示的潜力。
Result: 在真实应用和基准测试上的大量实验表明,UMC在视觉一致性方面有效,取得了比现有SOTA方法更有希望的结果。
Insight: 创新点包括引入大规模电商数据集Fashion130K,以及提出UMC框架,其中嵌入精炼器和融合Transformer用于对齐多模态嵌入,并通过重新设计生成模型中的注意力机制来强调提示与噪声图像之间的相关性,从而选择关键标记进行一致搭配生成。
Abstract: Recent research work on fashion outfit generation focuses on promoting visual consistency of garments by leveraging key information from reference image and text prompt. However, the potential of outfit generation remains underexplored, requiring comprehensive e-commercial dataset and elaborative utilization of multi-modal condition. In this paper, we propose a brand-new e-commerce dataset, named Fashion130k, with various occasions, models, and garment types. For the consistent generation of garment, we design a framework with Unified Multi-modal Condition (UMC) to align and integrate the text and visual prompts into generation model. Specifically, we explore an embedding refiner to extract the unified embeddings of multi-modal prompts, within which a Fusion Transformer is proposed to align the multi-modal embeddings by adjusting the modality gap between text and image. Based on unified embeddings, the attention in generation model is redesigned to emphasis the correlations between prompts and noise image, inducing that the noise image can select the pivotal tokens of prompts for consistent outfit generation. Our dataset and proposed framework offer a general and nuanced exploration of multi-modal prompts for generation models. Extensive experiments on real-world applications and benchmark demonstrate the effectiveness of UMC in visual consistency, achieving promising result than that of SoTA methods.
[220] Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection cs.CVPDF
Yasiru Ranasinghe, Elim Schenck, Florence Yellin, Shuowen Hu, Christopher Funk
TL;DR: 本文提出了Thermal-Det,这是首个针对热成像图像、由大语言模型监督的开放词汇目标检测器。为了解决热图像纹理低、发射率变化带来的语义挑战,该方法通过将GroundingCap-1M数据集转换到热域并过滤描述文本来构建大规模合成数据集,联合优化检测、描述生成和跨模态蒸馏目标,利用冻结的RGB教师模型提供几何和语义伪监督,从而无需人工标注即可迁移开放词汇知识。
Details
Motivation: 现有开放词汇检测器主要针对RGB图像,难以泛化到热成像图像,因为热图像的低纹理和发射率变化对基于RGB的语义理解构成了挑战。
Result: 在公开基准测试上的实验表明,该方法相比现有开放词汇检测器取得了2-4%的平均精度(AP)提升,为可扩展的、语言驱动的热感知建立了坚实基础。
Insight: 创新点包括:1)构建首个大规模、热域对齐的合成数据集用于训练;2)提出联合优化检测、描述生成和跨模态蒸馏的多任务框架;3)引入冻结RGB教师模型进行知识蒸馏,实现无需人工标注的开放词汇知识迁移;4)设计了热-文本对齐头和模态融合交叉注意力模块,用于文本校准和双模态推理。与先前领域自适应方法不同,该检测器通过全微调内化了热对比模式,同时保持了语言对齐能力。
Abstract: Existing open-vocabulary detectors focus on RGB images and fail to generalize to thermal imagery, where low texture and emissivity variations challenge RGB-based semantics. We present Thermal-Det, the first large language model (LLM) supervised open-vocabulary detector tailored for thermal images. To enable large-scale training, we develop a synthetic dataset by converting GroundingCap-1M into the thermal domain and filtering captions to remove RGB-specific terms, yielding over one million thermally aligned samples with bounding boxes, grounding texts, and detailed captions. Thermal-Det jointly optimizes detection, captioning, and cross-modal distillation objectives. A frozen RGB teacher provides geometric and semantic pseudo-supervision for paired but unlabeled RGB-thermal data, transferring open-vocabulary knowledge without manual annotation. The model further employs a Thermal-Text Alignment Head for text calibration and a Modality-Fused Cross-Attention module for dual-modality reasoning. Unlike prior domain-adaptation methods, the detector is fully fine-tuned to internalize thermal contrast patterns while preserving language alignment. Experiments on public benchmarks show consistent 2-4% AP gains over existing open-vocabulary detectors, establishing a strong foundation for scalable, language-driven thermal perception.
[221] Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality cs.CV | cs.AIPDF
Mateusz Cedro, Marcin Chlebus
TL;DR: 本文研究了计算机视觉模型规模(深度和参数数量)与事后解释质量之间的关系,通过评估ResNet、DenseNet和Vision Transformer家族的11个模型在三个图像数据集上的表现,发现增大模型规模并不能一致地提升基于定位的解释质量,较小的模型往往表现相当或更好,且预测性能高并不保证解释定位准确。
Details
Motivation: 探究模型规模(深度、复杂度、预训练)的扩大是否能够提升事后可解释AI方法的解释质量,特别是在基于定位的评估指标上,以解决模型可解释性评估在安全敏感部署中的重要性问题。
Result: 在大多数统计比较中,增加架构深度和参数数量并未改善解释质量;预训练通常提升了预测性能,但并未一致提高定位分数(如Relevance Rank Accuracy和提出的Dual-Polarity Precision);甚至存在预测性能强而定位精度接近零的情况。
Insight: 创新点在于提出了Dual-Polarity Precision指标来量化正负归因的定位准确性,并系统性地揭示了模型规模与解释质量之间的不一致性;客观分析认为,该研究强调了在模型选择中需独立评估可解释性,而非依赖预测性能作为代理指标,这对安全关键应用具有重要借鉴意义。
Abstract: Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of depth and complexity within the ResNet, DenseNet, and Vision Transformer families, trained from scratch or pretrained, across three image datasets with ground-truth segmentation masks. For each model, we generate explanations using five post-hoc explainable AI methods and quantify mask alignment using two localisation metrics: Relevance Rank Accuracy (Arras et al., 2022) and the proposed Dual-Polarity Precision, which measures positive attributions inside the class mask and negative attributions outside it. Across datasets and methods, increasing architectural depth and parameter count does not improve explanation quality in most statistical comparisons, and smaller models often match or exceed deeper variants. While pretraining typically improves predictive performance and increases the dependence of explanations on learned weights, it does not consistently increase localisation scores. We also observe scenarios in which models achieve strong predictive performance while localisation precision is near zero, suggesting that performance metrics alone may not indicate whether predictions are based on the annotated regions. These results indicate that larger models do not reliably provide higher-quality explanations, and that explainability should therefore be assessed explicitly during model selection for safety-sensitive deployments.
[222] MicroViTv2: Beyond the FLOPS for Edge Energy-Friendly Vision Transformers cs.CVPDF
Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh
TL;DR: 本文提出了MicroViTv2,一种针对边缘设备优化的轻量级视觉Transformer模型。它在原始MicroViT基础上,通过重参数化设计(如RepEmbed和RepDW)和引入单深度可分离转置注意力(SDTA)模块,在保持快速推理和高能效的同时,提升了模型精度。实验表明,其硬件感知设计和结构重参数化是实现高精度和低能耗的关键。
Details
Motivation: Vision Transformer(ViT)在视觉任务上表现出色,但其计算成本高昂,难以在边缘设备上部署。本文旨在设计一个轻量级、能效友好的ViT变体,以解决边缘部署的计算效率和能耗问题。
Result: 在ImageNet-1K和COCO数据集上的实验表明,MicroViTv2在Jetson AGX Orin设备上实现了比其前身高达0.5%的精度提升,并超越了MobileViTv2、EdgeNeXt和EfficientViT等模型,同时保持了快速的推理速度和高的能源效率。
Insight: 论文宣称的创新点在于硬件感知的重参数化设计(RepEmbed和RepDW)以及用于捕获长距离依赖的单深度可分离转置注意力(SDTA)模块。从客观角度看,其核心洞察是:在边缘设备上,衡量效率不应仅看FLOPs,硬件友好的架构设计和结构重参数化对于实现实际的高精度和低能耗至关重要。
Abstract: The Vision Transformer (ViT) achieves remarkable accuracy across visual tasks but remains computationally expensive for edge deployment. This paper presents MicroViTv2, a lightweight Vision Transformer optimized for real-device efficiency. Built upon the original MicroViT, the proposed model is designed based on reparameterized design, specifically Reparameterized Patch Embedding (RepEmbed) and Reparameterized Depth-Wise convolution mixer (RepDW) for faster inference, and introduces the Single Depth-Wise Transposed Attention (SDTA) to capture long-range dependencies with minimal redundancy. Despite slightly higher FLOPs, MicroViTv2 improves accuracy up to 0.5% compared to its predecessor and surpassing MobileViTv2, EdgeNeXt, and EfficientViT while maintaining fast inference and high energy efficiency on Jetson AGX Orin. Experiments on ImageNet-1K and COCO demonstrate that hardware-aware design and structural re-parameterization are key to achieving high accuracy and low energy consumption, validating the need to evaluate efficiency beyond FLOPs. Code is available at https://github.com/novendrastywn/MicroViT.
[223] Improving Temporal Action Segmentation via Constraint-Aware Decoding cs.CVPDF
Yeo Keat Ee, Debaditya Roy, Chen Li, Hao Zhang, Basura Fernando
TL;DR: 本文提出了一种轻量级的基于约束的优化框架,通过整合统计结构先验(如转移置信度、动作边界集和每类持续时间)来改进时序动作分割(TAS)预测,使用改进的维特比解码算法在推理时进行优化,无需重新训练或增加模型复杂度,适用于全监督和半监督TAS模型。
Details
Motivation: 解决时序动作分割中动作可变性、边界模糊和高标注成本等挑战,特别是针对新领域或低资源场景,同时克服基于语法的方法依赖复杂解析、可扩展性有限的问题。
Result: 方法在推理时通过约束感知解码优化TAS预测,纠正结构错误,提高分割准确性,同时保持高效率,代码已开源。
Insight: 创新点在于将统计结构先验直接集成到解码过程中,实现轻量级、无需重训练的优化框架,可灵活应用于现有模型,提升分割性能。
Abstract: Temporal action segmentation (TAS) divides untrimmed videos into labeled action segments. While fully supervised methods have advanced the field, challenges such as action variability, ambiguous boundaries, and high annotation costs remain, especially in new or low-resource domains. Grammar-based approaches improve segmentation with structural priors but rely on complex parsing limiting scalability. In this work, we propose a lightweight, constraint-based refinement framework that enhances TAS predictions by integrating statistical structural priors such as transition confidence, action boundary sets, and per-class duration, that can be directly extracted from annotated data. These constraints are integrated into a modified Viterbi decoding algorithm, allowing inference-time refinement without retraining or added model complexity. Our approach improves both fully and semi-supervised TAS models by correcting structural prediction errors while maintaining high efficiency. Code is available at https://github.com/LUNAProject22/CAD
[224] MolSight: Molecular Property Prediction with Images cs.CV | cs.CLPDF
Aaditya Baranwal, Akshaj Gupta, Shruti Vyas, Yogesh S Rawat
TL;DR: MolSight是首个系统性的大规模视觉分子属性预测研究,它使用分子2D骨架图作为输入,通过视觉编码器进行处理。研究评估了10种视觉架构、7种预训练策略和200万张分子图像在10个下游任务上的性能,并提出了基于化学结构复杂度的课程学习策略,在多个基准测试中取得了具有竞争力的结果。
Details
Motivation: 现代分子属性预测通常依赖分子图、3D构象或大型语言模型,这些方法计算和数据工程开销大。论文的动机是探索普遍可用但被忽视的分子2D骨架图表示,验证仅从视觉图像中获取化学见解的可行性。
Result: 在10个基准测试(涵盖物理性质回归、药物发现分类和量子化学预测)中,最佳课程训练配置在5个任务上取得了最佳结果,在所有10个任务上均位列前二,且计算量比最接近的多模态竞争对手低80倍。
Insight: 主要创新点包括:首次系统性研究基于视觉的分子属性预测;提出基于化学结构复杂度的课程学习策略,将预训练数据按复杂度分层,提升了模型性能;证明了单一渲染的键线图足以实现有竞争力的预测,为计算高效的分子建模提供了新思路。
Abstract: Every molecule ever synthesised can be drawn as a 2D skeletal diagram, yet in modern property prediction this universally available representation has received less focus in favour of molecular graphs, 3D conformers, or billion-parameter language models, each imposing its own computational and data-engineering overhead. We present $\textbf{MolSight}$, the first systematic large-scale study of vision-based Molecular Property Prediction (MPP). Using 10 vision architectures, 7 pre-training strategies, and $2,M$ molecule images, we evaluate performance across 10 downstream tasks spanning physical-property regression, drug-discovery classification, and quantum-chemistry prediction. To account for the wide variation in structural complexity across pre-training molecules, we further propose a $\textbf{chemistry-informed curriculum}$: five structural complexity descriptors partition the corpus into five tiers of increasing chemical difficulty, consistently outperforming non-curriculum baselines. We show that a single rendered bond-line image, processed by a vision encoder, is sufficient for competitive molecular property prediction, i.e. $\textit{chemical insight from sight alone}$. The best curriculum-trained configuration achieves the top result on $\textbf{5 of 10}$ benchmarks and top two on $\textbf{all 10}$, at $\textbf{$\textit{80$\times$ lower}$}$ FLOPs than the nearest multi-modal competitor.
[225] V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning cs.CV | cs.CLPDF
Zhiwei Ning, Xuanang Gao, Jiaxi Cao, Gengming Zhang, Shengnan Ma
TL;DR: 本文提出V-ABS,一种用于动态视觉推理的行动-观察者驱动波束搜索框架,旨在解决多模态大语言模型在复杂多步推理中因忽视执行反馈而产生的想象-行动-观察者偏差问题。
Details
Motivation: 多模态大语言模型在复杂多步视觉推理中存在挑战,现有基于智能体的方法常忽略关键的执行反馈,导致想象与观察反馈不匹配,损害推理的稳定性和最优性。
Result: 在八个多样化基准测试上的广泛实验表明,V-ABS实现了最先进的性能,在Qwen3-VL-8B基线上平均提升19.7%,并在开源和专有模型上均取得一致增益。
Insight: 创新点包括:1)引入行动-观察者驱动的波束搜索框架,通过思考者-行动者-观察者迭代实现审慎推理;2)提出基于熵的自适应加权算法,动态平衡策略先验与观察反馈的置信度以缓解偏差;3)构建大规模监督微调数据集,引导模型为正确行动路径分配更高先验置信度。
Abstract: Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multi-step visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset comprising over 80k samples to guide the model to assign higher prior confidence to correct action paths. Extensive experiments across eight diverse benchmarks show that V-ABS achieves state-of-the-art performance, delivering an average improvement of 19.7% on the Qwen3-VL-8B baseline and consistent gains across both open-source and proprietary models.
[226] MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning cs.CV | cs.AI | cs.ROPDF
Guangli Chen, Dianzhao Li, Wenjian Zhong, Bangquan Xie, Ostap Okhrin
TL;DR: 本文提出了MTA-RL框架,首次通过基于多模态Transformer的3D可供性表示和强化学习,将感知与控制模块桥接起来,用于解决城市自动驾驶中的鲁棒性问题。该方法融合RGB图像和LiDAR点云,预测显式的、几何感知的可供性表示,并以此作为强化学习策略的紧凑观测空间,从而显著提升了样本效率和决策稳定性。
Details
Motivation: 现有端到端自动驾驶模型缺乏可解释性,而模块化流水线则在脆弱的接口间存在误差传播问题。本文旨在通过一个结构化、可解释的中间表示来桥接感知与控制,以提升自动驾驶系统在密集交互环境下的鲁棒性和泛化能力。
Result: 在CARLA仿真环境(Town01-03)中,面对不同密度(20-60辆背景车辆)的交通场景,MTA-RL均持续优于最先进的基线方法。仅在Town03训练,该方法在未见过的城镇中展现出卓越的零样本泛化能力:路线完成率提升高达9.0%,总行驶距离提升11.0%,每次违规的平均行驶距离提升83.7%。消融实验证实了多模态融合和奖励塑形的关键作用。
Insight: 核心创新在于引入了一个由多模态Transformer预测的、显式的3D可供性表示作为感知与强化学习控制之间的结构化接口。这不仅提高了系统的可解释性,其紧凑的语义化观测空间还极大地提升了强化学习的样本效率和策略稳定性,为实现鲁棒的城市自动驾驶提供了一种新的范式。
Abstract: Robust urban autonomous driving requires reliable 3D scene understanding and stable decision-making under dense interactions. However, existing end-to-end models lack interpretability, while modular pipelines suffer from error propagation across brittle interfaces. This paper proposes MTA-RL, the first framework that bridges perception and control through Multi-modal Transformer-based 3D Affordances and Reinforcement Learning (RL). Unlike previous fusion models that directly regress actions, RGB images and LiDAR point clouds are fused using a transformer architecture to predict explicit, geometry-aware affordance representations. These structured representations serve as a compact observation space, enabling the RL policy to operate purely on predicted driving semantics, which significantly improves sample efficiency and stability. Extensive evaluations in CARLA Town01-03 across varying densities (20-60 background vehicles) show that MTA-RL consistently outperforms state-of-the-art baselines. Trained solely on Town03, our method demonstrates superior zero-shot generalization in unseen towns, achieving up to a 9.0% increase in Route Completion, an 11.0% increase in Total Distance, and an 83.7% improvement in Distance Per Violation. Furthermore, ablation studies confirm that our multi-modal fusion and reward shaping are critical, significantly outperforming image-only and unshaped variants, demonstrating the effectiveness of MTA-RL for robust urban autonomous driving.
[227] Developing a foundation model for high-resolution remote sensing data of the Netherlands cs.CV | cs.AIPDF
Paul Vermeeren, Heysem Kaya
TL;DR: 本文开发了一个针对荷兰1.2米高分辨率卫星影像的基础模型,该模型结合了卷积神经网络和视觉Transformer,能够同时捕获低频(如大地形结构)和高频(如精细纹理)景观特征。通过利用时序数据作为输入,模型学习了跨时间的更广泛上下文信息,从而利用地形特征、土地覆盖变化和季节动态等时间依赖性,减少了特征歧义,提升了表征学习能力,并实现了在少量标注样本下的更好泛化。
Details
Motivation: 动机是开发一个专门针对高分辨率遥感数据的基础模型,以解决从有限数据中学习丰富、可泛化表征的挑战,并利用时序信息来提升模型对景观特征的理解和下游任务性能。
Result: 在荷兰的植被监测数据集上,结合时序信息相比单时间点带来了明显的性能提升。尽管模型规模较小且预训练数据仅限于荷兰,但在全球基准测试中与最先进的模型相比取得了有竞争力的结果。
Insight: 创新点在于结合CNN和ViT以捕获多尺度景观特征,并有效利用时序数据作为额外的约束来提升表征学习和泛化能力;客观分析表明,该工作展示了在有限数据和参数下通过领域特定设计和时序利用也能达到与大规模SOTA模型竞争的性能,为资源受限的遥感应用提供了借鉴。
Abstract: We develop a foundation model using 1.2m high resolution satellite images of the Netherlands. By combining a Convolutional Neural Network and a Vision Transformer, the model captures both low- and high-frequency landscape features, such as fine textures, edges, and small objects as well as large terrain structures, elevation patterns, and land-cover distributions. Leveraging temporal data as input, the model learns from broader contextual information across time, allowing the model to exploit the temporal dependencies, such as topographic features, land-cover changes, and seasonal dynamics. These additional constraints reduce feature ambiguity, improve representation learning, and enable better generalization with fewer labeled samples. The foundation model is evaluated on multiple downstream tasks, ranging from use cases within the Netherlands to global benchmarking datasets. On the vegetation monitoring dataset of the Netherlands, the model shows clear performance improvements by incorporating temporal information instead of relying on a single time point. Despite using a smaller model and less pretraining data limited to the Netherlands, it achieves competitive results on global benchmarks when compared to state-of-the-art models. These results demonstrate that the model can learn rich, generalizable representations from limited data, achieving competitive performance on global benchmarks while using a fraction of the parameters of larger state-of-the-art remote sensing models. To maximize reproducibility and reuse, we made the scripts and the model accessible on GitHub.
[228] SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation cs.CVPDF
Longteng Guo, Xuanxu Lin, Dongze Hao, Tongtian Yue, Pengkang Huo
TL;DR: 本文介绍了SciVQR,一个多学科多模态基准测试,用于评估多模态大语言模型在科学推理上的能力。它覆盖数学、物理、化学、地理、天文和生物等54个子领域,包含特定领域的视觉信息(如方程、图表和示意图),并要求模型结合视觉理解和推理来完成任务,任务范围从基础事实回忆到复杂的多步推理。
Details
Motivation: 现有MLLM基准测试未能充分捕捉科学推理所需的复杂性和可追溯性,因此需要一个新的基准来严格评估模型在整合多模态输入、领域知识和多步推理方面的能力。
Result: 对领先的专有和开源MLLM的评估显示,它们在处理复杂多模态推理任务上存在显著局限性,突显了改进多步推理和跨学科知识整合的必要性。
Insight: 创新点在于构建了一个多学科、多模态的基准测试,不仅评估最终答案,还检查推理过程,以深入了解模型得出结论的方式;这为推进MLLM实现真正的科学智能提供了重要工具和洞察。
Abstract: Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi-step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain-specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi-step inferences, with 46% including expert-authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models reach their conclusions. Our evaluation of leading MLLMs, including both proprietary and open-source models, reveals significant limitations in handling complex multimodal reasoning tasks, underscoring the need for improved multi-step reasoning and better integration of interdisciplinary knowledge in advancing MLLMs toward true scientific intelligence. The dataset and evaluation code are publicly available at https://github.com/CASIA-IVA-Lab/SciVQR.
[229] 3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects cs.CVPDF
Zhicheng Liang, Haoyi Yu, Boyan Li, Dayou Zhang, Zijian Cao
TL;DR: 本文介绍了3DReflecNet,一个超过22TB的大规模混合数据集,专门用于评估和推进针对反射、透明和低纹理物体的3D重建方法。该数据集包含超过12万个合成实例和超过1,000个真实物体,总计超过700万帧多视角图像,涵盖了多种材料、复杂光照和几何形状。
Details
Motivation: 解决现有3D重建方法在处理具有反射、透明或低纹理表面的物体时面临的挑战,因为这类材料通常违反多视角重建流程中的关键假设(如光度一致性和几何纹理线索的可用性),而现有数据集主要关注漫反射、有纹理的物体,无法充分反映真实世界材料的复杂性。
Result: 在图像匹配、运动恢复结构、新视角合成、反射去除和重光照五个核心任务的基准测试中,广泛的实验表明,最先进的方法在这些设置下难以保持准确性,凸显了对更具鲁棒性的3D视觉模型的需求。
Insight: 论文的主要创新点是创建了一个专门针对挑战性材料的大规模、混合(合成与真实)3D重建数据集,并设计了全面的基准任务。从客观角度看,该数据集通过结合物理渲染合成数据和真实采集数据,并涵盖从真实和LLM合成图像生成的形状,为评估和提升3D视觉方法在复杂真实场景下的鲁棒性提供了宝贵资源。
Abstract: Accurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces still remains notoriously challenging. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the availability on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, and therefore provide limited insight into performance under real-world material complexities. We introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 120,000 synthetic instances generated via physically-based rendering of more than 12,000 shapes, and over 1,000 real-world objects captured using consumer devices. Together, these data consist of more than 7 million multi-view frames. The dataset spans diverse materials, complex lighting conditions, and a wide range of geometric forms, including shapes generated from both real and LLM-synthesized 2D images using diffusion-based pipelines. To support robust evaluation, we design benchmarks for five core tasks: image matching, structure-from-motion, novel view synthesis, reflection removal, and relighting. Extensive experiments demonstrate that state-of-the-art methods struggle to maintain accuracy across these settings, highlighting the need for more resilient 3D vision models.
[230] VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection cs.CV | cs.CYPDF
Xiaobin Hu, Enpu Zuo, Lanping Hu, Kaiwen Yang, Dianshu Liao
TL;DR: 本文提出了一个大规模细粒度视觉隐私数据集VPD-100K,旨在解决现有隐私检测数据集规模小、标注粗糙、领域覆盖窄的问题。该数据集包含10万张图像,涵盖人类存在、屏幕个人身份信息、物理标识符和位置指示器四大领域,并标注了33个细粒度类别和超过19万个目标实例。同时,论文设计了一个有效的频率增强轻量级模块,通过频域注意力融合和自适应频谱门控机制来捕捉敏感信息的细微细节。实验在多样化的图像和流媒体视频基准测试中验证了数据集和方法的有效性。
Details
Motivation: 当前鲁棒的隐私检测模型因缺乏全面数据集而严重受限,现有数据集存在规模有限、标注粒度粗、领域覆盖窄等问题,无法捕捉真实环境中敏感信息的复杂细节。
Result: 在多样化的图像和流媒体视频基准测试上进行的广泛实验一致证明了VPD-100K数据集和精心设计的频率机制的有效性。
Insight: 创新点在于构建了一个大规模、细粒度、涵盖多领域的视觉隐私数据集,并设计了一个突破空间像素强度限制的频率增强模块(频域注意力融合和自适应频谱门控机制),以更好地捕捉敏感信息的细微特征,这对于直播等实时信息泄露场景具有重要价值。
Abstract: Privacy protection has become a critical requirement in the era of ubiquitous visual data sharing, imposing higher demands on efficient and robust privacy detection algorithms. However, current robust detection models are severely hindered by the lack of comprehensive datasets. Existing privacy-oriented datasets often suffer from limited scale, coarse-grained annotations, and narrow domain coverage, failing to capture the intricate details of sensitive information in realworld environments. To bridge this gap, we present a large-scale, fine-grained Visual Privacy Dataset (VPD-100K), designed to facilitate generalized privacy detection. We establish a holistic taxonomy comprising four primary domains: Human Presence, On-Screen Personally Identifiable Information (PII), Physical Identifiers, and Location Indicators, containing 100,000 images annotated with 33 fine-grained classes and over 190,000 object instances. Statistical analysis reveals that our dataset features long-tailed distributions, small object scales, and high visual complexity. These characteristics make the dataset particularly valuable for demanding, unconstrained applications such as live streaming, where actors frequently face unintentional, realtime information leakage. Furthermore, we design an effective frequency-enhanced lightweight module consisting of frequency-domain attention fusion and adaptive spectral gating mechanism that breaks the limitations of spatial pixel intensity to better capture the subtle details of sensitive information. Extensive experiments conducted on both diverse image and streaming videos benchmarks consistently demonstrate the effectiveness of our VPD-100K dataset and the wellcurated frequency mechanism. The code and dataset are available at https://vpd-100k.github.io/.
[231] AdaptSplat: Adapting Vision Foundation Models for Feed-Forward 3D Gaussian Splatting cs.CVPDF
Mingwei Xing, Xinliang Wang, Yifeng Shi
TL;DR: 本文提出AdaptSplat,一种用于前馈式3D高斯泼溅(3DGS)的轻量级适配器设计。该方法通过一个仅含150万参数的频率保持适配器(FPA),从强大的视觉基础模型骨干网络的浅层特征中提取方向感知的高频结构先验,并将其通过高频位置编码和自适应残差调制整合到通用流程中,以解决现有方法在跨域泛化和高频几何保真度方面的不足。
Details
Motivation: 现有基于通用流程(图像特征提取→多视图交互→特征解码)的3DGS前馈方法,受限于3D训练数据的规模瓶颈和深度网络的低通滤波效应,在跨域泛化和高频几何保真度方面表现不佳。本文旨在通过一个简单高效的适配器设计来解决这些问题。
Result: 在多个标准基准测试中,AdaptSplat实现了最先进(SOTA)的前馈重建性能,并展现出稳定的跨域泛化能力。
Insight: 创新点在于提出了一个极简的轻量级频率保持适配器(FPA),它无需复杂的组件工程,仅通过提取并融合视觉基础模型浅层特征中的高频先验,就有效补偿了深度特征因过度平滑导致的高频衰减,从而提升了高斯基元在复杂表面和尖锐边界上的拟合精度。这为利用基础模型能力增强3D重建任务提供了一个高效且通用的适配思路。
Abstract: This work explores a simple yet powerful lightweight adapter design for feed-forward 3D Gaussian Splatting (3DGS). Existing methods typically apply complex, architecture-specific designs on top of the generic pipeline of image feature extraction $\rightarrow$ multi-view interaction $\rightarrow$ feature decoding. However, constrained by the scale bottleneck of 3D training data and the low-pass filtering effect of deep networks, these methods still fall short in cross-domain generalization and high-frequency geometric fidelity. To address these problems, we propose AdaptSplat, which demonstrates that without complex component engineering, introducing a single adapter of only 1.5M parameters into the generic architecture is sufficient to achieve superior performance. Specifically, we design a lightweight Frequency-Preserving Adapter (FPA) that extracts direction-aware high-frequency structural priors from the shallow features of a powerful vision foundation model backbone, and seamlessly integrates them into the generic pipeline via high-frequency positional encodings and adaptive residual modulation. This effectively compensates for the high-frequency attenuation caused by over-smoothing in deep features, improving the fitting accuracy of Gaussian primitives on complex surfaces and sharp boundaries. Extensive experiments demonstrate that AdaptSplat achieves state-of-the-art feed-forward reconstruction performance on multiple standard benchmarks, with stable generalization across domains. Code available at: https://github.com/xmw666/AdaptSplat.
[232] Efficient Hybrid CNN-GNN Architecture for Monocular Depth Estimation cs.CVPDF
Ishan Narayan
TL;DR: 本文提出GraphDepth,一种用于单目深度估计的高效混合CNN-GNN架构。该架构在ResNet-101 U-Net主干网络中多尺度地嵌入了高效的GraphSAGE层,以显式建模超出局部卷积感受野的长距离空间关系。与基于Transformer的混合模型相比,GraphDepth具有线性空间复杂度,在多个基准测试中达到了与SOTA Transformer模型相当的精度,同时计算成本显著降低,并展示了优异的零样本跨域泛化能力。
Details
Motivation: 解决单目深度估计中,局部卷积操作难以有效建模长距离空间关系的问题,同时避免Transformer类方法带来的二次复杂度计算开销。
Result: 在NYU Depth V2、WHU Aerial、ETH3D和Mid-Air等基准测试上进行了实验。在室内场景上,精度与最先进的Transformer模型相差在4.6%以内,但计算成本显著更低(25 FPS vs 9 FPS, 3.8 GB vs 8.8 GB VRAM)。在WHU Aerial数据集上取得了最佳报告结果(RMSE 8.24 m),并在Mid-Air合成航空数据集上表现出优异的零样本跨域迁移能力。
Insight: 主要创新点包括:1) 在多尺度瓶颈层和解码器阶段集成GraphSAGE层以传播全局上下文;2) 可扩展的批并行化图构建方法;3) 通道注意力门控跳跃连接;4) 用于置信度感知损失加权的异方差不确定性估计头。其核心思想是通过图神经网络的消息传递机制,以线性复杂度实现全局感受野,为深度估计任务提供了高效且泛化性强的显式关系推理框架。
Abstract: We present GraphDepth, a monocular depth estimation architecture that synergistically integrates Graph Neural Networks (GNNs) within a convolutional encoder-decoder framework. Our approach embeds efficient GraphSAGE layers at multiple scales of a ResNet-101 U-Net backbone, enabling explicit modeling of long-range spatial relationships that lie beyond the receptive field of local convolutions. Key technical contributions include: (1) batch-parallelized graph construction with configurable k-NN and grid-based adjacency for scalable training; (2) multi-scale GraphSAGE integration at bottleneck and decoder stages (1/32, 1/16, 1/8 resolution) to propagate global context throughout the feature hierarchy; (3) channel-attention gated skip connections that adaptively weight encoder features before fusion; and (4) heteroscedastic uncertainty estimation via a dedicated aleatoric uncertainty head, enabling confidence-aware loss weighting during optimization. Unlike transformer-based hybrids, which suffer from quadratic complexity in sequence length, GraphDepth scales linearly with spatial resolution while achieving comparable global receptive fields through iterative message passing. Experiments on NYU Depth V2, WHU Aerial, ETH3D, and Mid-Air benchmarks demonstrate competitive accuracy within 4.6% of state-of-the-art transformers on indoor scenes with substantially lower computational cost (25 FPS vs 9 FPS, 3.8 GB vs 8.8 GB VRAM). GraphDepth achieves the best reported result on WHU Aerial (RMSE 8.24 m) and exhibits superior zero-shot cross-domain transfer to the Mid-Air synthetic aerial dataset, validating the generalization power of explicit relational reasoning for depth estimation.
[233] Increasing the Efficiency of DETR for Maritime High-Resolution Images cs.CV | cs.ROPDF
Tinsae Yehuala, Hao Cheng, Ville Lehtola
TL;DR: 本文提出了一种基于Vision Mamba(ViM)骨干网络的高效海洋目标检测方法,旨在解决高分辨率图像下实时检测的计算与内存挑战。通过将图像序列化并利用状态空间模型(SSM)捕获长程依赖,结合定制的特征金字塔网络和令牌剪枝技术,在保持精度的同时显著提升了计算效率。
Details
Motivation: 解决无人水面艇(USV)安全导航中,海洋高分辨率图像目标检测面临的实时性挑战,包括小目标、大尺度变化、边缘计算限制和高内存需求,现有方法如降采样或图像分块往往牺牲精度或增加处理负担。
Result: 在海洋目标检测任务上,相比以ResNet50为骨干的RT-DETR等先进方法,本文方法在性能与计算效率之间取得了更好的平衡。
Insight: 创新点在于将Vision Mamba(ViM)骨干引入目标检测,利用SSM的线性序列长度缩放特性高效处理高分辨率图像;同时设计了结合连续降采样、SSM层和令牌剪枝的定制特征金字塔网络,以减少背景区域的不必要计算,为高分辨率视觉任务提供了内存高效的新思路。
Abstract: Maritime object detection is critical for the safe navigation of unmanned surface vessels (USVs), requiring accurate recognition of obstacles from small buoys to large vessels. Real-time detection is challenging due to long distances, small object sizes, large-scale variations, edge computing limitations, and the high memory demands of high-resolution imagery. Existing solutions, such as downsampling or image splitting, often reduce accuracy or require additional processing, while memory-efficient models typically handle only limited resolutions. To overcome these limitations, we leverage Vision Mamba (ViM) backbones, which build on State Space Models (SSMs) to capture long-range dependencies while scaling linearly with sequence length. Images are tokenized into sequences for efficient high-resolution processing. For further computational efficiency, we design a tailored Feature Pyramid Network with successive downsampling and SSM layers, as well as token pruning to reduce unnecessary computation on background regions. Compared to state-of-the-art methods like RT-DETR with ResNet50 backbone, our approach achieves a better balance between performance and computational efficiency in maritime object detection.
[234] PolarVSR: A Unified Framework and Benchmark for Continuous Space-Time Polarization Video Reconstruction cs.CVPDF
Chenggong Li, Yidong Luo, Junchao Zhang, Boxin Shi, Degui Yang
TL;DR: 本文提出了PolarVSR,一个用于连续时空偏振视频重建的统一框架和基准。该框架通过联合建模空间和时间上的偏振方向,利用偏振感知的隐式神经表示实现连续高保真上采样,并引入流引导的偏振变化损失来监督偏振动态。同时,论文建立了首个大规模彩色DoFP偏振视频基准数据集。
Details
Motivation: 主流分焦平面(DoFP)彩色偏振成像中,从捕获的马赛克阵列恢复偏振参数是一个具有挑战性的逆问题,且现有DoFP相机面临硬件瓶颈,难以支持高帧率采集,限制了偏振成像在动态视频任务中的应用,因此需要联合空间和时间增强。
Result: 在建立的首个大规模彩色DoFP偏振视频基准上进行的广泛实验证明了该方法的有效性。
Insight: 创新点在于提出了首个时空偏振视频重建架构,通过偏振感知隐式神经表示实现连续上采样,并设计了流引导的偏振变化损失来监督时间动态;同时,创建了该领域的首个大规模基准数据集,推动了研究方向的发展。
Abstract: Polarimetric imaging captures surface polarization characteristics, such as the Degree of Linear Polarization (DoLP) and the Angle of Polarization (AoP). In mainstream Division of-Focal-Plane (DoFP) color polarization imaging, recovering polarization parameters from captured mosaic arrays remains a challenging inverse problem. Existing DoFP cameras also face hardware bottlenecks and often cannot support high-frame-rate acquisition, limiting polarimetric imaging in dynamic video tasks. These limitations motivate joint spatial and temporal enhancement. To this end, we propose the first space-time polarization video reconstruction architecture. The method jointly models polarization directions in space and time and uses a polarization-aware implicit neural representation for continuous, high-fidelity upsampling. By analyzing temporal variations in polarization parameters, we further introduce a flow-guided polarization variation loss to supervise polarization dynamics. We also establish the first large-scale color DoFP polarization video benchmark to support this research direction. Extensive experiments on this benchmark demonstrate the effectiveness of the method.
[235] PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction cs.CV | cs.GR | cs.ROPDF
Yinan Deng, Jianyu Dou, Jiahui Wang, Jingyu Zhao, Yi Yang
TL;DR: 本文提出了PaMoSplat,一种用于动态场景重建的新型高斯泼溅框架。该方法通过结合部件感知和运动先验,利用多视角分割掩码构建一致的3D高斯部件,并利用光流线索通过差分进化算法估计部件的刚性运动,从而优化渲染和跟踪。
Details
Motivation: 解决在具有显著且复杂运动的场景中,现有基于3DGS的动态场景建模方法难以实现高保真渲染和精确跟踪的挑战。
Result: 在包括真实世界环境在内的多种场景的综合评估中,PaMoSplat相比现有方法,在渲染质量、跟踪精度和收敛速度方面均表现出优越性。
Insight: 创新点在于将部件作为场景形变的基本单元,并利用光流运动线索引导部件运动;同时引入了自适应迭代计数机制、内部可学习刚性以及流监督渲染损失来加速和优化训练过程,支持如4D场景编辑等部件级下游应用。
Abstract: Dynamic scene reconstruction represents a fundamental yet demanding challenge in computer vision and robotics. While recent progress in 3DGS-based methods has advanced dynamic scene modeling, obtaining high-fidelity rendering and accurate tracking in scenarios with substantial, intricate motions remains significantly challenging. To address these challenges, we propose PaMoSplat, a novel dynamic Gaussian splatting framework incorporating part awareness and motion priors. Our approach is grounded in two key observations: 1) Parts serve as primitives for scene deformation, and 2) Motion cues from optical flow can effectively guide part motion. Specifically, PaMoSplat initializes by lifting multi-view segmentation masks into 3D space via graph clustering, establishing coherent Gaussian parts. For subsequent timestamps, we leverage a differential evolutionary algorithm to estimate the rigid motion of these parts using multi-view optical flow cues, providing a robust warm-start for further optimization. Additionally, PaMoSplat introduces an adaptive iteration count mechanism, internal learnable rigidity, and flow-supervised rendering loss to accelerate and optimize the training process. Comprehensive evaluations across diverse scenes, including real-world environments, demonstrate that PaMoSplat delivers superior rendering quality, improved tracking precision, and faster convergence compared to existing methods. Furthermore, it enables multiple part-level downstream applications, such as 4D scene editing.
[236] EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant cs.CV | cs.AIPDF
Zichen Wen, Boxue Yang, Junlong Ke, Jiajie Huang, Chenfei Liao
TL;DR: 本文提出了EvoStreaming框架,旨在将离线视频语言模型(VideoLLMs)高效地适配为流式视频理解助手。该框架通过让基础模型自身作为数据生成器、相关性标注器和策略执行器,在无需外部监督的情况下合成流式交互轨迹,从而学习何时响应的决策策略。
Details
Motivation: 现有视频语言模型主要针对离线推理训练,缺乏在流式视频场景中实时决定何时响应的交互策略,而现有的流式评估基准将时序决策外部化给了评估者,未能有效评估模型本身的流式能力。
Result: 在提出的RealStreamEval(一个帧级多轮评估协议)上,EvoStreaming仅使用1000个自生成样本(比领先的流式指令调优方法少139倍)且不改变模型架构,就在五个开源VideoLLM骨干模型(Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5)上将总体得分提升了高达10.8分,同时基本保持了离线视频性能。
Insight: 创新点在于提出了一个数据高效的自进化流式适配框架,让模型自我生成训练数据以学习流式交互策略,这为将现有离线VideoLLM转化为流式助手提供了一条实用路径。从客观角度看,其核心洞察是利用模型自身能力进行无监督的交互策略学习,避免了大规模人工标注或指令调优数据的需求。
Abstract: Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only $1{,}000$ self-generated samples ($139\times$ less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to $10.8$ points across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance. These results suggest that data-efficient interaction tuning is a practical path for adapting existing VideoLLMs to streaming assistants.
[237] BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization cs.CVPDF
Wei Wang, Dou Quan, Ning Huyan, Shuang Wang, Yi Li
TL;DR: 本文提出了一种名为BGG的参数高效适应框架,旨在通过视觉基础模型(如DINOv3)来弥合跨视图图像(如无人机和卫星视图)之间的几何差异,以提升跨视图地理定位性能。该框架包含多粒度特征增强适配器和频率感知结构聚合模块,通过增强特征尺度适应性和视角鲁棒性,并结合局部结构特征,实现更精确的图像检索和定位。
Details
Motivation: 跨视图图像(如无人机与卫星视图)之间的几何差异显著增加了跨视图地理定位的挑战,现有方法在利用视觉基础模型的通用表示和泛化能力方面存在不足,需要一种参数高效的方法来捕获鲁棒且一致的特征以提升性能。
Result: 在University-1652和SUES-200数据集上的大量实验表明,BGG相比其他方法具有显著优势,以低训练成本实现了最先进的定位性能(SOTA)。
Insight: 创新点包括:利用多级扩张卷积的多粒度特征增强适配器来提升特征尺度适应性和视角鲁棒性,以及通过频率域调制和自适应聚合的FASA模块来增强局部结构特征,弥补了[CLS]令牌缺乏空间细节的不足,从而高效弥合几何差距并提升定位精度。
Abstract: Geometric differences between cross-view images, such as drone and satellite views, significantly increase the challenge of Cross-View Geo-Localization (CVGL), which aims to acquire the geolocation of images by image retrieval. To further enhance the CVGL performance, this paper proposes a parameter-efficient adaptation framework for bridging the geometric gap across images based on the vision foundation model (VFM) (e.g., DINOv3), termed BGG. BGG not only effectively leverages the general visual representations of VFM and captures the robust and consistent features from cross-view images, but also utilizes the generalization capabilities of the VFM, significantly improving the CVGL performance. It mainly contains a Multi-granularity Feature Enhancement Adapter (MFEA) and a Frequency-Aware Structural Aggregation (FASA) module. Specifically, MFEA enhances the scale adaptability and viewpoint robustness of features by multi-level dilated convolutions, effectively bridging the cross-view geometric gap with small training costs. Additionally, considering the [CLS] token lacks spatial details for precise image retrieval and localization, the FASA module modulates patch tokens in the frequency domain and performs adaptive aggregation for local structural feature enhancement. Finally, BGG fuses the enhanced local features with the [CLS] token for more accurate CVGL. Extensive experiments on University-1652 and SUES-200 datasets demonstrate that BGG has significant advantages over other methods and achieves state-of-the-art localization performance with low training costs.
[238] BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation cs.CV | cs.CLPDF
Qi Yang, Xiangyao Ma, Xiao Wang, Hao Wang, Rui Wang
TL;DR: BabelDOC是一个基于中间表示(IR)的框架,用于实现布局保持的PDF文档翻译。它将视觉布局元数据与语义内容解耦,支持文档级翻译操作,如术语提取、跨页上下文处理、词汇表约束生成和公式占位,然后通过自适应排版引擎将翻译后的内容重新锚定到原始布局中。
Details
Motivation: 解决现有文档翻译流程在语言处理和布局保持之间的冲突:面向文本的计算机辅助翻译(CAT)系统通常丢弃结构元数据,而文档解析器专注于提取但不支持翻译后的忠实重新渲染。
Result: 在精心策划的200页基准测试中,结合人工评估和多模态LLM作为评判者的评估,BabelDOC在布局保真度、视觉美观度和术语一致性方面优于代表性基线方法,同时保持了有竞争力的翻译精度。
Insight: 创新点在于通过中间表示解耦布局和内容,实现文档级翻译操作和自适应重新排版,从而在翻译过程中更好地保留原始文档的视觉结构。从客观角度看,该方法将翻译任务从纯文本处理提升到结构化文档层面,结合了传统CAT和解析器的优势。
Abstract: As global cross-lingual communication intensifies, language barriers in visually rich documents such as PDFs remain a practical bottleneck. Existing document translation pipelines face a tension between linguistic processing and layout preservation: text-oriented Computer-Assisted Translation (CAT) systems often discard structural metadata, while document parsers focus on extraction and do not support faithful re-rendering after translation. We introduce BabelDOC, an Intermediate Representation (IR)-based framework for layout-preserving PDF translation. BabelDOC decouples visual layout metadata from semantic content, enabling document-level translation operations such as terminology extraction, cross-page context handling, glossary-constrained generation, and formula placeholdering. The translated content is then re-anchored to the original layout through an adaptive typesetting engine. Experiments on a curated 200-page benchmark, together with human evaluation and multimodal LLM-as-a-judge evaluation, show that BabelDOC improves layout fidelity, visual aesthetics, and terminology consistency over representative baselines, while maintaining competitive translation precision. The open-source toolkit and its interactive downstream applications are publicly available and have attracted over 8.4K GitHub stars and 17 contributors at the time of writing. A demonstration video is also available.
[239] Halo Separation-guided Underwater Multi-scale Image Restoration cs.CVPDF
Jiaxin Yang, Honglin Liu, Yongli Wang, Shuyi Cao, Chengcheng Jiang
TL;DR: 本文提出了一种基于迭代结构的单光晕图像校正方法,用于解决自主水下航行器(AUVs)在人工光源下拍摄时图像前景出现光晕的问题。该方法包含两个子网络:光晕层分离子网络通过梯度最小化分离光晕,以及多尺度恢复子网络用于恢复被光晕掩盖的图像信息。
Details
Motivation: 现有水下图像增强方法未能充分考虑人工光源导致的光晕问题,在处理此类场景时鲁棒性较差,严重影响图像质量和后续视觉任务。
Result: 方法在UIEB和EUVP合成数据集上训练,并在真实人工光源环境下采集的大量光晕图像上测试,通过引入径向梯度约束来消除光晕,提升了水下图像恢复效果。
Insight: 创新点在于明确针对水下人工光晕问题设计专用网络结构,通过光晕分离与多尺度恢复的迭代框架,并结合径向梯度约束,专门处理这一特定退化模式,而非通用增强。
Abstract: Underwater images captured by Autonomous Underwater Vehicles (AUVs) are inevitably affected by artificial light sources, which often produce halos in the foreground of the camera and seriously interfere with the quality of the image. The existing underwater image enhancement methods fail to fully consider this key problem, and the robustness of processing images under artificial light scenes is poor. In practical applications, since underwater image enhancement itself is a very challenging task, the influence of artificial light sources will lead to serious degradation of image performance and affect subsequent vision tasks. In order to effectively deal with this problem, this paper designs a single halo image correction method based on an iterative structure. The network is mainly divided into two sub-networks, one is the halo layer separation sub-network which aims to separate the halo by gradient minimization, and the other is the multi-scale recovery sub-network which aims to recover the image information masked by halo. The UIEB and EUVP synthetic datasets are used for training to ensure that the network can fully learn the characteristics and laws of underwater halo images. Then a large number of halo images taken in an underwater environment with real artificial light are collected for testing. In addition, the brightness distribution characteristics of underwater halo images are analyzed and the radial gradient is introduced to constraint eliminate halo to improve the effect of underwater image restoration.
[240] SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation cs.CVPDF
Niyati Rawal, Sushant Ravva, Shah Alam Abir, Saksham Jain, Aman Chadha
TL;DR: SleepWalk是一个用于评估指令引导视觉语言导航的三层基准测试,专注于在单场景3D环境中测试模型将语言指令转化为空间一致、可执行轨迹的能力。该基准通过文本场景描述生成可导航的3D世界,要求模型根据视觉观察和自然语言指令预测符合场景几何、避免碰撞并在动作兼容位置终止的轨迹。
Details
Motivation: 当前视觉语言模型在多模态感知和语言理解方面进展迅速,但其能否在3D数字环境中可靠地将语言指令转化为空间连贯且可执行的动作仍不明确,因此需要构建一个专注于局部交互式具身推理的基准来系统评估这一能力。
Result: 在2,472个精心策划的3D环境(每个场景包含九条指令)上评估了三个前沿视觉语言模型。结果显示模型在空间推理方面存在系统性失败,尤其是在遮挡、交互约束和多步指令条件下:随着任务难度等级增加,性能显著下降。
Insight: 创新点在于构建了一个三层难度(空间和时间复杂性递增)的基准,支持对组合复杂性增加下的基础能力进行细粒度分析。通过标准化的逐点评判协议,在可控且可扩展的设置中暴露了当前模型在具身规划方面的缺陷,为推进3D环境中的基础多模态推理和动作能力智能体提供了关键基准。
Abstract: Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.
[241] Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection cs.CVPDF
Andreas Goulas, Damianos Galanopoulos, Evlampios Apostolidis, Vasileios Mezaris
TL;DR: 本文提出了一个名为Sens-VisualNews的新基准数据集,用于检测新闻图像中的耸动内容,该数据集包含9,576张标注图像,并基于此数据集评估了多种先进多模态大语言模型在零样本和微调设置下的性能、提示敏感性和鲁棒性。
Details
Motivation: 检测媒体内容中的耸动性对于识别值得核查的内容和标记潜在虚假信息至关重要,因为这类内容会触发生理唤醒,绕过批判性评估并加速病毒式传播,但现有研究缺乏针对图像耸动性检测的任务和数据集。
Result: 研究使用Sens-VisualNews数据集评估了多种先进多模态大语言模型,在零样本和微调设置下分析了其性能、提示敏感性和鲁棒性,但摘要未具体说明定量结果或是否达到SOTA水平。
Insight: 论文的创新点在于首次定义了图像耸动性检测任务,并创建了一个大规模、多概念标注的基准数据集Sens-VisualNews,为多模态内容审核研究提供了新方向;从客观角度看,该数据集的设计和针对多模态大语言模型的系统性评估框架具有借鉴价值。
Abstract: The detection of sensational content in media items can be a critical filtering mechanism for identifying check-worthy content and flagging potential disinformation, since such content triggers physiological arousal that often bypasses critical evaluation and accelerates viral sharing. In this paper we introduce the task of sensational image detection, which aims to determine whether an image contains shocking, provocative, or emotionally charged features to grab attention and trigger strong emotional responses. To support research on this task, we create a new benchmark dataset (called Sens-VisualNews) that contains 9,576 images from news items, annotated based on the (in-)existence of various sensational concepts and events in their visual content. Finally, using Sens-VisualNews, we study the prompt sensitivity, performance and robustness of a wide range of open SotA Multimodal LLMs, across both zero-shot and fine-tuned settings.
[242] AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation cs.CV | cs.AIPDF
Xi Jiang, Yinjie Zhao, Zesheng Yang, Feng Zheng
TL;DR: 本文提出了AnomalyClaw,一种无需训练的通用视觉异常检测智能体,通过将异常判断转化为多轮反驳过程,并利用包含13种工具的库进行视觉验证、参考解析和冻结专家模型探测,以提升视觉语言模型在跨域异常检测任务上的可靠性和性能。
Details
Motivation: 视觉异常检测在不同领域存在异常定义、数据模态和标注标准差异,使得单域训练的模型难以迁移;而视觉语言模型虽能进行跨域感知,但其单次推理判断依赖先验知识而非正常样本参考或细粒度特征证据,导致不可靠。
Result: 在CrossDomainVAD-12基准测试(包含12个数据集)上,AnomalyClaw相比单步直接推理,在GPT-5.5、Seed2.0-lite和Qwen3.5-VL-27B模型上分别实现了+6.23 pp、+7.93 pp和+3.52 pp的宏观AUROC一致提升;其可选的口头自进化扩展在Qwen3.5-VL-27B上带来了+2.09 pp的平均增益,与K=10的监督基线(+1.99 pp)相当。
Insight: 创新点在于将异常检测重构为基于工具的多轮反驳推理过程,通过工具库整合视觉验证和专家知识来减少先验依赖;同时,提出的无监督自进化机制能从内部分支分歧中学习规则,无需真实标签即可提升性能,表明智能体反驳能增强VLM的异常理解和推理能力,而非简单工具输出聚合。
Abstract: Visual anomaly detection (VAD) is crucial in many real-world fields, such as industrial inspection, medical imaging, infrastructure monitoring, and remote sensing. However, the specific anomaly definitions, data modalities, and annotation standards across different domains make it difficult to transfer single-domain trained VAD models. Vision-language models (VLMs), pre-trained on large-scale cross-domain data, can perform visual perception under task instructions, offering a promising solution for cross-domain VAD. However, single-inference VLM judgments are unreliable, since they rely more on prior knowledge than on normal-sample references or fine-grained feature evidence. We therefore present AnomalyClaw, a training-free VAD agent that turns anomaly judgment into a multi-round refutation process. In each round, the agent proposes candidate anomalies and refutes each against normal-sample references, drawing on a 13-tool library for visual verification, reference parsing, and frozen expert probing. On the CrossDomainVAD-12 benchmark (12 datasets), AnomalyClaw achieves consistent macro-AUROC improvements over single-step direct inference with +6.23 pp on GPT-5.5, +7.93 pp on Seed2.0-lite, and +3.52 pp on Qwen3.5-VL-27B. We further introduce an optional verbalized self-evolution extension. It builds an online rulebook from internal-branch disagreement without oracle labels. On Qwen3.5-VL-27B, it delivers a +2.09 pp mean gain, comparable to a K = 10 oracle-label supervised baseline (+1.99 pp). These results show that agentic refutation improve anomaly understanding and reasoning of VLMs, rather than merely aggregating tool outputs.
[243] Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable cs.CVPDF
Tianyuan Zou, Liang Yue, Yang Liu, Ya-Qin Zhang, Sijie Cheng
TL;DR: 本文探讨了随着智能眼镜、随身相机等常开硬件普及,生活日志视频流在持续感知AI系统中的核心作用及其引发的隐私与效用权衡问题。作者指出现有隐私保护方法存在局限性,呼吁开发面向整个数据处理流程的隐私保护设计,并建立正式的隐私泄露度量和标准化基准。
Details
Motivation: 解决生活日志视频流在提升下一代AI系统效用的同时,因持续记录敏感信息(如行为模式、情感状态)而带来的重大隐私风险,以避免损害公众信任并阻碍常开AI技术的可持续发展。
Result: 未提及具体的定量实验结果或基准测试,但指出现有方法要么针对特定攻击,要么导致显著的效用损失,且未能考虑整个数据利用流程。
Insight: 创新点在于提出了“流程感知”的隐私保护设计理念,强调需要联合优化长时程生活日志视觉数据的效用与隐私,并指出建立形式化隐私泄露度量和标准化基准是未来重要的开放研究方向。
Abstract: With the growing prevalence of always-on hardware such as smart glasses, body cameras, and home security systems, life-logging visual sensing is becoming inevitable, forming the backbone of persistent, always-on AI systems. Meanwhile, recent advances in proactive agents and world models signal a fundamental shift from episodic, prompt-driven tools to next-generation AI systems that continuously perceive and react to the physical world. Although life-logging video streams can substantially improve utility of these promising systems, they also introduce significant privacy risks by revealing sensitive information, such as behavioral patterns, emotional states, and social interactions, beyond what isolated images expose. If unresolved, these risks may undermine public trust and hinder the sustainable development of always-on AI technologies. Existing privacy protections are either attack-specific or incur substantial utility loss, and fail to consider the entire data exploitation pipeline. We therefore posit that the privacy-utility trade-off in life-logging video streams is a foundational challenge for next-generation AI systems that demands further investigation. We call for novel pipeline-aware privacy-preserving designs that jointly optimize utility and privacy for long-horizon life-logging visual data. In parallel, formal privacy leakage metrics and standardized benchmarks remain important open directions for future research.
[244] Progressive Photorealistic Simplification cs.CVPDF
Adi Rosenthal, Dana Berman, Yedid Hoshen, Ariel Shamir
TL;DR: 本文提出了一种渐进式语义图像简化框架,能够在保持照片真实感的同时逐步简化图像内容。该方法结合了视觉语言模型(VLM)的语义理解和生成式编辑技术,通过迭代的“选择-移除-验证”流程,可控地移除和修复图像元素,生成一系列从复杂到简化的逼真图像序列。此外,该方法还被提炼成一个图像到视频生成模型,可直接从单张输入图像预测连贯的简化序列,并应用于内容感知去杂乱、语义层分解和交互式编辑等场景。
Details
Motivation: 现有图像简化技术(如非真实感渲染NPR)通常将照片转换为风格化的草图、卡通或绘画,虽能降低视觉复杂度,但牺牲了照片真实感。本文旨在探索一个互补方向:在简化图像的同时保持其照片般逼真的外观。
Result: 该方法通过迭代的Select-Remove-Verify流程,生成了高质量的简化轨迹(即图像序列),确保了每一步输出都是合理的自然照片。为了提升效率,进一步将该过程提炼成一个图像到视频生成模型,可直接从单张输入图像预测连贯的简化序列。
Insight: 核心创新在于将语义理解与生成式编辑相结合,通过结构化内容移除来引导视觉解释,在照片真实感领域内实现可控简化。这为图像编辑提供了新范式,如内容感知去杂乱和语义层分解,并补充了传统的抽象方法。
Abstract: Existing image simplification techniques often rely on Non-Photorealistic Rendering (NPR), transforming photographs into stylized sketches, cartoons, or paintings. While effective at reducing visual complexity, such approaches typically sacrifice photographic realism. In this work, we explore a complementary direction: simplifying images while preserving their photorealistic appearance. We introduce progressive semantic image simplification, a framework that iteratively reduces scene complexity by removing and inpainting elements in a controlled manner. At each step, the resulting image remains a plausible natural photograph. Our method combines semantic understanding with generative editing, leveraging Vision-Language Models (VLMs) to identify and prioritize elements for removal, and a learned verifier to ensure photorealism and coherence throughout the process. This is implemented via an iterative Select-Remove-Verify pipeline that produces high-quality simplification trajectories. To improve efficiency, we further distill this process into an image-to-video generation model that directly predicts coherent simplification sequences from a single input image. Beyond generating cleaner and more focused compositions, our approach enables applications such as content-aware decluttering, semantic layer decomposition, and interactive editing. More broadly, our work suggests that simplification through structured content removal can serve as a practical mechanism for guiding visual interpretation within the photorealistic domain, complementing traditional abstraction methods.
[245] CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving cs.CV | cs.AIPDF
Minqing Huang, Yujiao Xiang, Zihan Liang, Jiajie Huang, Jingqi Wang
TL;DR: 本文提出了CoWorld-VLA,一个用于自动驾驶的多专家世界模型推理框架。该框架通过多源监督提取互补的世界信息(如语义交互、几何结构、动态演化和自车轨迹),并将其编码为专家令牌,作为生成动作的明确条件。它采用基于扩散的分层多专家融合规划器,在联合去噪过程中结合场景上下文生成连续的自车轨迹。
Details
Motivation: 现有视觉-语言-动作(VLA)模型的推理机制存在不足:文本思维链(CoT)难以保留连续的时空结构,而潜在世界推理又不易直接用作动作生成的条件。本文旨在为端到端自动驾驶提供面向规划的、明确的中间世界表征作为条件。
Result: 在NAVSIM v1基准测试中,CoWorld-VLA在未来场景生成和规划任务上均取得了有竞争力的结果,在碰撞避免和轨迹精度方面表现出色。消融研究进一步验证了专家令牌的互补性及其作为动作生成规划条件的有效性。
Insight: 创新点在于提出了一个多专家世界表征框架,将不同维度的世界信息(交互意图、空间结构、未来动态、行为目标)编码为明确的、可被规划器访问的令牌条件,并通过扩散模型的分层融合机制进行联合去噪和轨迹生成,从而弥合了世界推理与动作生成之间的鸿沟。
Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/potatochip1211/CoWorld-VLA.
[246] WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors cs.CVPDF
Keming Wu, Yijing Cui, Wenhan Xue, Qijie Wang, Xuan Luo
TL;DR: 本文介绍了WorldReasonBench,一个将视频生成评估重新定义为世界状态预测的基准测试。它包含436个精心设计的测试用例,涵盖物理、社会、逻辑和信息四个推理维度的22个子类别,并采用人类对齐的两部分评估方法:过程感知推理验证和多维质量评估。此外,还引入了WorldRewardBench偏好基准,包含约6K个专家标注的视频对。评估结果表明,现代视频生成器在视觉合理性与世界推理能力之间存在持续差距。
Details
Motivation: 当前商业视频生成系统(如Seedance2.0和Veo3.1)快速发展,被视为潜在的“世界模拟器”,但社区缺乏直接测试模型能否推理世界随时间演变的基准。本文旨在解决这一问题,通过构建基准来评估模型在给定初始状态和动作下,生成视频是否在物理、社会、逻辑和信息上保持一致性。
Result: 在多个现代视频生成器上的评估结果揭示了视觉合理性与世界推理能力之间的持续差距:生成的视频可能看起来令人信服,但在动态、因果关系或信息保留方面存在失败。该基准支持对视频生成器进行排名和奖励建模。
Insight: 创新点在于将视频生成评估重新定义为世界状态预测,并构建了一个结构化、多维度的人类对齐基准。其过程感知推理验证和结构化QA诊断方法,能够深入检测时间和因果故障,为评估世界感知视频生成提供了更全面的框架。同时,引入的偏好基准WorldRewardBench支持奖励模型评估,有助于推动该领域研究。
Abstract: Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into “world simulators.” Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures, while Multi-dimensional Quality Assessment scores reasoning quality, temporal consistency, and visual aesthetics for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation. Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation at https://github.com/UniX-AI-Lab/WorldReasonBench/.
[247] Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning cs.CVPDF
Zijun Shen, Sihan Yang, Ruichuan An, Ziyu Guo, Hao Liang
TL;DR: 本文提出Sync-R1,一个端到端的强化学习框架,通过协同优化个性化理解与生成任务,以解决统一多模态模型在连接这两者时的不足。该方法引入Sync-GRPO强化学习算法和动态组缩放技术,并在新构建的UnifyBench++基准上验证了其有效性。
Details
Motivation: 现有统一多模态模型在个性化理解与生成任务之间存在鸿沟,先前工作主要依赖监督微调进行隐式的词元级对齐,未能充分挖掘理解与创造之间的协同潜力。
Result: 在UnifyBench++基准上,Sync-R1取得了最先进的性能,展示了卓越的跨任务推理能力和鲁棒的个性化效果,且无需复杂的冷启动过程。
Insight: 创新点在于提出了一个显式的协同推理循环,通过统一的奖励机制让个性化理解指导内容生成,同时生成质量反过来优化理解;方法上引入了集成奖励的Sync-GRPO和用于降低梯度方差的动态组缩放技术;并构建了包含更密集文本描述和更丰富用户上下文的新基准UnifyBench++以更好反映现实复杂性。
Abstract: Unified Multimodal Models (UMMs) excel in general tasks but struggle to bridge the gap between personalized understanding and generation. Prior works largely rely on implicit token-level alignment via supervised fine-tuning, which fails to fully capture the potential synergy between comprehension and creation. In this work, we propose Sync-R1, an end-to-end reinforcement learning framework that jointly optimizes personalized understanding and generation within a single, explicit reasoning loop. Through this unified feedback process, Sync-R1 enables personalized comprehension to guide content creation, while the resulting generation quality reciprocally refines understanding within an integrated reward landscape. To efficiently orchestrate this dual-task synergy, we introduce Sync-GRPO, a reinforcement learning method utilizing an ensemble reward system. Furthermore, we propose Dynamic Group Scaling (DGS), which adaptively filters low-potential trajectories to reduce gradient variance and accelerate convergence. To better reflect real-world complexity, we introduce UnifyBench++, featuring denser textual descriptions and richer user contexts. Experimental results demonstrate that Sync-R1 achieves state-of-the-art performance, showcasing superior cross-task reasoning and robust personalization without requiring complex cold-start procedures. The code and the UnifyBench++ dataset will be released at: https://github.com/arctanxarc/UniCTokens.
[248] Automated high-frequency quantification of fish communities and biomass using computer vision cs.CVPDF
Kota Ishikawa, Takuma Masui, Keita Koeda, Rickdane Gomez, Lucas Yutaka Kimura
TL;DR: 本文提出了一种基于计算机视觉的自动化框架,用于从水下视频中量化鱼类群落和生物量。该框架结合了深度学习鱼类识别、多目标跟踪和三维重建技术,能够实现物种级别的丰度和生物量估计。通过在20天内每小时进行日间观测,该方法揭示了珊瑚礁鱼类群落在物种丰富度、丰度和生物量方面的动态波动,并与传统视觉普查和环境DNA调查方法进行了比较,展示了其在连续、非侵入性和定量监测方面的互补优势。
Details
Motivation: 现有鱼类群落调查方法(如捕捞法、水下视觉普查和环境DNA宏条形码)存在劳动密集或无法可靠估计丰度和生物量的问题,缺乏高频定量观测能力,因此需要开发自动化、高频率的量化方法以更好地理解生物多样性和生态系统响应。
Result: 该方法在20天的珊瑚礁鱼类群落观测中,实现了每小时日间观测,揭示了物种组成变化相关的动态波动;与传统视觉普查和环境DNA调查相比,证明了其在持续监测一致观测物种方面的互补性,为长期监测提供了可扩展基础。
Insight: 创新点在于整合深度学习识别、多目标跟踪和3D重建,实现自动化、高频、非侵入的鱼类群落定量监测;客观分析认为,该方法通过计算机视觉技术提升了时间分辨率,能够解析鱼类群落的精细时间动态,为生态监测提供了新工具。
Abstract: Quantifying fish community structure is essential for understanding biodiversity and ecosystem responses in a changing environment, yet existing survey methods provide limited high-frequency, quantitative observations. Conventional approaches, including catch-based methods, underwater visual censuses, and environmental DNA metabarcoding, either require intensive labor or lack reliable estimates of abundance and biomass. Here, we develop an automated framework for quantifying fish communities from underwater video using computer vision. Using videos acquired with a custom-made stereo camera system, the framework integrates deep learning-based fish identification, multi-object tracking, and 3D reconstruction to estimate species-level abundance and biomass. We applied the approach to a reef fish community over a 20-day period with hourly daytime observations, revealing dynamic fluctuations in species richness, abundance, and biomass associated with changes in species composition. By comparing fish communities estimated from visual census and environmental DNA surveys, we demonstrate that our method provides complementary strengths for continuous, non-invasive, and quantitative monitoring of consistently observed species. This approach provides a scalable foundation for long-term monitoring and advances the capacity to resolve fine-scale temporal dynamics in fish communities.
[249] Adaptive Context Matters: Towards Provable Multi-Modality Guidance for Super-Resolution cs.CVPDF
Jinyi Luo, Minghao Liu, Yifan Li, Zejia Fan, Jiaying Liu
TL;DR: 本文针对超分辨率(SR)这一病态问题,首次从理论角度对多模态引导SR进行了建模,揭示了现有方法在模态利用上的瓶颈。基于理论分析,作者提出了一个新颖的多模态混合专家超分辨率框架(M^3ESR),该框架采用面向泛化的动态模态融合机制,通过空间动态模态加权模块和时间自适应模态温度调度机制,实现了灵活的自适应时空模态加权以进行有效的风险控制。大量实验表明,M^3ESR显著提升了泛化性能和语义一致性。
Details
Motivation: 超分辨率是一个具有内在模糊性的严重病态问题。尽管现有的语义引导和多模态SR方法利用大模型或外部先验来增强语义对齐,但异构模态的融合在实践和理论上都尚未得到充分理解。本文旨在从理论上建模多模态SR,以解决模态利用次优的瓶颈问题。
Result: 广泛的实验表明,所提出的M^3ESR框架显著提升了泛化性能和语义一致性,证实了其优越性。
Insight: 论文的主要创新点在于首次为多模态SR提供了理论建模,揭示了模态权重与其有效贡献之间的对齐以及表示复杂度对泛化风险界的影响。基于此,提出了一个新颖的、理论驱动的动态模态融合框架(M^3ESR),其核心是空间动态模态加权和时间自适应温度调度机制,实现了对泛化风险的有效控制和模态贡献的优化。
Abstract: Super-resolution (SR) is a severely ill-posed problem with inherent ambiguity, as widely recognized in both empirical and theoretical studies. Although recent semantic-guided and multi-modal SR methods exploit large models or external priors to enhance semantic alignment, the fusion of heterogeneous modalities remains insufficiently understood in practice and theory. In this work, we provide the first theoretical modeling of multi-modal SR, revealing that prior methods are bottlenecked by sub-optimal modality utilization. Our analysis shows that the generalization risk bound can be improved by strengthening the alignment between modality weights and their effective contributions, while reducing representation complexity. This theoretical insight inspires us to propose the novel Multi-Modal Mixture-of-Experts Super-Resolution framework (M$^3$ESR) that employs generalization-oriented dynamic modality fusion for accurate risk control and modality contribution optimization. In detail, we propose a novel spatially dynamic modality weighting module and a temporally adaptive modality temperature scheduling mechanism, enabling flexible and adaptive spatial-temporal modality weighting for effective risk control. Extensive experiments demonstrate that our M$^3$ESR significantly boosts generalization and semantic consistency performances, which confirms our superiority.
[250] OpenSGA: Efficient 3D Scene Graph Alignment in the Open World cs.CV | cs.ROPDF
Gang Chen, Sebastián Barbas Laina, Stefan Leutenegger, Javier Alonso-Mora
TL;DR: 本文提出了OpenSGA,一个高效的三维场景图对齐框架,通过融合视觉-语言、文本和几何特征来预测物体对应关系,并引入了大规模数据集ScanNet-SG。该方法在帧到扫描(F2S)和子扫描到子扫描(S2S)对齐任务上均取得了最佳性能。
Details
Motivation: 现有方法主要关注子扫描到子扫描(S2S)对齐,严重依赖几何点云特征,而帧到扫描(F2S)对齐和开放集视觉语言特征未被充分探索。同时,现有数据集规模小、物体多样性有限,限制了系统训练和评估。
Result: 实验表明,该方法在F2S和S2S任务上均实现了最佳整体性能,大幅超越了现有场景图对齐方法。
Insight: 创新点包括:1)统一高效的场景图对齐框架,融合多模态特征(视觉-语言、文本、几何)与空间上下文;2)设计了距离门控空间注意力编码器、基于最小成本流的分配器和全局场景嵌入生成器等模块,以处理大坐标差异下的精确对齐;3)引入了大规模数据集ScanNet-SG,通过自动化标注流程生成,覆盖广泛物体类别,支持系统训练与评估。
Abstract: Scene graph alignment establishes object correspondences between two 3D scene graphs constructed from partially overlapping observations. This enables efficient scene understanding and object-level relocalization when a robot revisits a place, as well as global map fusion across multiple agents. Such capabilities are essential for robots that require long-term memory for long-horizon tasks involving interactions with the environment. Existing approaches mainly focus on subscan-to-subscan (S2S) alignment and depend heavily on geometric point-cloud features, leaving frame-to-scan (F2S) alignment and open-set vision-language features underexplored. In addition, existing datasets for scene graph alignment remain small-scale with limited object diversity, constraining systematic training and evaluation. We present a unified and efficient scene graph alignment framework that predicts object correspondences by fusing vision-language, textual, and geometric features with spatial context. The framework comprises modules such as a distance-gated spatial attention encoder, a minimum-cost-flow-based allocator, and a global scene embedding generator to achieve accurate alignment even under large coordinate discrepancies. We further introduce ScanNet-SG, a large-scale dataset generated via an automated annotation pipeline with over 700k samples, covering 509 object categories from ScanNet labels and over 3k categories from GPT-4o-based tagging. Experiments show that our method achieves the best overall performance on both F2S and S2S tasks, substantially outperforming existing scene graph alignment methods. Our code and dataset are released at: https://autonomousrobots.nl/paper_websites/opensga.
[251] Simultaneous Long-tailed Recognition and Multi-modal Fusion for Highly Imbalanced Multi-modal Data cs.CV | cs.AI | stat.MLPDF
Heegeon Yoon, Heeyoung Kim
TL;DR: 本文提出了一种针对高度不平衡多模态数据的长尾识别与多模态融合统一框架。该框架通过扩展多专家架构至多模态场景,将异构数据融合为统一表示,并利用模态特定网络估计各模态的信息量,以置信度引导的权重动态调制融合过程,使信息更丰富的模态对最终决策贡献更大。此外,设计了专门的训练和测试流程以适应不同模态组合(如图像和表格数据)。
Details
Motivation: 解决长尾分布数据中深度学习模型偏向多数类的问题,现有长尾识别方法多局限于单模态输入,无法充分利用多源数据的互补信息,因此需要开发能够同时处理多模态输入的长尾识别框架。
Result: 在基准和真实世界数据集上的大量实验表明,所提方法不仅能有效整合多模态信息,而且在处理长尾、类别不平衡场景中优于现有方法,展现了其鲁棒性和泛化能力。
Insight: 创新点在于将多专家架构与多模态融合相结合,通过模态特定网络动态估计并加权各模态信息量以指导融合,并设计了适应多样化模态组合的训练测试流程,为不平衡多模态学习提供了可借鉴的置信度引导动态融合机制。
Abstract: Long-tailed distributions in class-imbalanced data present a fundamental challenge for deep learning models, which tend to be biased toward majority classes. While recent methods for long-tailed recognition have mitigated this issue, they are largely restricted to single-modal inputs and cannot fully exploit complementary information from diverse data sources. In this work, we introduce a new framework for long-tailed recognition that explicitly handles multi-modal inputs. Our approach extends multi-expert architectures to the multi-modal setting by fusing heterogeneous data into a unified representation while leveraging modality-specific networks to estimate the informativeness of each modality. These confidence-guided weights dynamically modulate the fusion process, ensuring that more informative modalities contribute more strongly to the final decision. To further enhance performance, we design specialized training and test procedures that accommodate diverse modality combinations, including images and tabular data. Extensive experiments on benchmark and real-world datasets demonstrate that the proposed approach not only effectively integrates multi-modal information but also outperforms existing methods in handling long-tailed, class-imbalanced scenarios, highlighting its robustness and generalization capability.
[252] Improving Human Image Animation via Semantic Representation Alignment cs.CVPDF
Chang Liu, Mengting Chen, Yixuan Huang, Haoning Wu, Chen Ju
TL;DR: 本文提出了一种名为SemanticREPA的新方法,通过语义表示对齐来改进人体图像动画生成。该方法利用结构对齐模块和ID对齐模块,分别将视频潜在表示与深度估计特征对齐、生成视频的ID表示与人脸识别特征对齐,从而纠正结构扭曲并增强身份一致性,以生成更连贯稳定的人体动画视频。
Details
Motivation: 现有的人体图像动画方法常使用密集姿态或ID嵌入等语义表示作为额外条件,但这可能降低生成灵活性,且依赖RGB像素监督缺乏对3D几何关系和时间一致性的学习,导致人体肢体扭曲和面部失真问题,尤其是在生成长视频或密集动作时。
Result: 通过结构和ID对齐,该方法在扩展角色运动和增强角色一致性方面表现出更优的质量,但摘要中未提及具体基准测试或定量结果。
Insight: 创新点在于将语义表示作为监督信号而非条件输入,通过表示对齐来纠正结构和身份,这避免了条件限制生成灵活性的问题,并强调学习3D几何和时间相干性,可借鉴于其他需要保持结构和身份一致性的生成任务。
Abstract: The field of image-to-video generation has made remarkable progress. However, challenges such as human limb twisting and facial distortion persist, especially when generating long videos or modeling intensive motions. Existing human image animation works address these issues by incorporating human-specific semantic representations, e.g., dense poses or ID embeddings, as additional conditions. However, conditioning on these representations could decrease the generation flexibility. Moreover, their reliance on RGB pixel supervision also lacks emphasis on learning necessary 3D geometric relationships and temporal coherence. In contrast, we introduce a novel approach named SemanticREPA that leverages these semantic representations as supervision signals through representation alignment. Specifically, we begin by training a structure alignment module that aligns the structure representations obtained from video latents with video depth estimation features. We then fix the pretrained module, and utilize it to provide additional supervision on the structure representations of the diffusion models, achieving structure rectification to generate coherent and stable human structures. Simultaneously, we develop an ID alignment module to align the ID representations of the generated videos to face recognition features. We further propose to use the predicted structure representations to refine identity restoration in relevant regions. With structure and ID alignment, our method demonstrates superior quality on extended character motions and enhanced character consistency.
[253] GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth cs.CVPDF
Yuecheng LiulJunda Cheng, Longliang Liu, Wenjing Liao, Hanrui Cheng, Yuzhou Wang
TL;DR: GemDepth是一个用于视频深度估计的框架,通过显式地嵌入相机运动和全局3D结构信息,解决了现有方法在细节区域空间模糊和时间不一致的问题。它引入了一个几何嵌入模块(GEM)来预测帧间相机位姿并生成隐式几何嵌入,结合交替时空Transformer(ASTT)来同时提升空间精度和时间一致性。
Details
Motivation: 现有视频深度估计方法主要依赖Transformer进行时间平滑,难以在旋转或剧烈视角变化下保持严格的3D几何一致性,导致空间细节模糊和时间不一致。
Result: 在多个数据集上的综合评估表明,GemDepth实现了最先进的性能,特别是在复杂的动态场景中。
Insight: 创新点在于通过显式引入相机运动先验和全局3D结构感知,使网络具备内在的3D感知和对齐能力;交替时空Transformer利用几何线索捕获潜在点级对应关系,同时优化空间细节和时间一致性;采用数据高效的训练策略,平衡了高效率和鲁棒几何一致性。
Abstract: Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: https://github.com/Yuecheng919/GemDepth
[254] TIE: Time Interval Encoding for Video Generation over Events cs.CVPDF
Zhilei Shu, Shangwen Zhu, Zihang Liang, Xiaofan Li, Qianyu Peng
TL;DR: 本文提出了一种名为时间间隔编码(TIE)的新方法,用于解决视频生成中处理并发和重叠事件时存在的根本性维度不匹配问题。该方法通过将时间间隔提升为DiT交叉注意力中的一等原语,在保持基础模型视觉质量的同时,显著提升了时间可控性。
Details
Motivation: 现有视频生成器(如DiT)使用逐点位置编码表示时间,无法数学表征时间扩展区间和重叠事件,这与导演式提示、机器人动作预测等需要时间基础的应用需求(大量视频包含重叠事件)存在根本矛盾。
Result: 在OmniEvents数据集上的实验表明,TIE将人工验证的时间约束满足率从77.34%提升至96.03%,将时间边界误差从0.261秒降低至0.073秒,并改善了轨迹级的时间对齐指标。
Insight: 核心创新点在于提出了一个原则性的、即插即用的区间感知旋转嵌入(RoPE)泛化方案,其设计基于时间可积性和持续时间不变性两个基本原则,并在均匀核下导出了一个高效的、基于sinc函数的闭式解,自然地通过区间积分衰减边界噪声。
Abstract: Director-style prompting, robotic action prediction, and interactive video agents demand temporal grounding over concurrent events – a regime in which 68% of general clips and over 99% of robotics/gameplay clips contain overlapping events, yet existing multi-event generators rest on a single-active-prompt assumption. However, modern video generators, such as Diffusion Transformers (DiT), represent time as discrete points through point-wise positional encodings. This formulation creates a fundamental dimension mismatch: temporally extended intervals and overlapping events are mathematically unrepresentable to the attention mechanism. In this paper, we propose Time Interval Encoding (TIE), a principled, plug-and-play interval-aware generalization of rotary embeddings that elevates time intervals to first-class primitives inside DiT cross-attention. Rather than introducing another heuristic interval embedding, we show that, within RoPE-compatible bilinear attention, TIE is characterized by two basic principles: Temporal Integrability, which requires an event to aggregate positional evidence over its full duration, and Duration Invariance, which removes the trivial bias toward longer intervals. Under a uniform kernel, this characterization yields an efficient closed-form sinc-based solution that preserves the standard attention interface and naturally attenuates boundary noise through interval integration. Empirically, TIE preserves the visual quality of the base DiT model while substantially improving temporal controllability. In our experiments on the OmniEvents dataset, it improves human-verified Temporal Constraint Satisfaction Rate from 77.34% to 96.03% and reduces temporal boundary error from 0.261s to 0.073s, while also improving trajectory-level temporal alignment metrics. The code and dataset are available at https://github.com/MatrixTeam-AI/TIE.
[255] EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving cs.CV | cs.LGPDF
Vittorio Palladino, Gianluca Palermo, Michael E. Papka, Zhiling Lan
TL;DR: EnergyLens是一个用于多模态大语言模型推理服务的可解释闭式能量模型。它通过符号回归从少量性能剖析数据中推导出一个仅含12个参数、基于系统属性(如并行度、批大小和序列长度)的闭式模型,以准确预测推理能耗,并支持跨模型架构和硬件平台的泛化。
Details
Motivation: 现有方法要么将延迟作为能量代理,要么依赖数据饥渴的黑盒代理模型,两者在变化的并行策略下均失效(延迟与能量最优解在超过20%的配置中不一致),且黑盒方法需要大量样本才能泛化。因此,需要一种样本高效、可解释且能准确指导能量最优部署的模型。
Result: 在多种评估场景中,EnergyLens仅需50个剖析样本,就实现了88.2%的Top-1配置选择准确率,显著优于先前最佳分析基线(60.9%),并与集成机器学习方法预测精度相当,但所需样本量减少10倍。模型能可靠地外推到未见过的批大小和硬件平台,无需结构修改。
Insight: 创新点在于将符号回归用作结构发现工具,推导出物理可解释的闭式模型,明确解耦了张量并行与流水线并行的贡献,并分离了预填充和解码阶段的能耗。这提供了一种样本高效、可操作且能跨平台泛化的能量建模方法,优于黑盒代理和简单代理指标。
Abstract: As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing approaches either treat latency as an energy proxy or rely on data-hungry black-box surrogates. Both fail under varying parallelism strategies: latency and energy optima diverge in over 20% of configurations we tested, and black-box surrogates require hundreds of profiling samples to generalize across model families and hardware. We present EnergyLens, which uses symbolic regression as a structure-discovery tool over profiling data to derive a single twelve-parameter closed-form energy model expressed in terms of system properties such as degree of parallelism, batch size, and sequence length. Unlike black-box surrogates, EnergyLens decouples tensor and pipeline parallelism contributions and separates prefill from decode energy, making its predictions physically interpretable and actionable. Fitted from as few as 50 profiling measurements, EnergyLens achieves 88.2% Top-1 configuration selection accuracy across many evaluation scenarios compared to 60.9% for the closest prior analytical baseline, matches the predictive accuracy of ensemble ML methods with 10x fewer profiling samples, and extrapolates reliably to unseen batch sizes and hardware platforms without structural modification, making it a practical, interpretable tool for energy-optimal LLM deployment.
[256] DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving cs.CV | cs.ROPDF
Lingjun Zhang, Changjie Wu, Linzhe Shi, Jiangyang Li, Jiaxin Liu
TL;DR: 本文提出DeepSight,一种用于端到端自动驾驶的长时域世界建模方法。该方法通过在鸟瞰图空间并行预测未来连续帧的潜在语义特征来实现对未来世界状态的长时域建模,并引入了一种高效的自适应文本推理机制,利用额外的社会知识和推理能力来提升在挑战性长尾场景下的驾驶性能。
Details
Motivation: 当前端到端自动驾驶系统虽然集成了视觉语言模型,但其推理机制多是从通用领域直接迁移而来,缺乏针对自动驾驶场景(尤其是视觉推理模块)的深入探索,无法有效进行长时域世界建模和应对长尾场景。
Result: 该方法在闭环Bench2drive基准测试上取得了最先进(SOTA)的结果。
Insight: 创新点在于提出了一个在BEV空间并行预测未来潜在语义特征的长时域世界模型,以及一个利用社会知识进行自适应文本推理的机制,旨在更专门化地解决自动驾驶中的长时域规划和长尾场景挑战。
Abstract: End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird’s-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Codes are available at: https://github.com/hotdogcheesewhite/DeepSight.
[257] VeloGauss: Learning Physically Consistent Gaussian Velocity Fields from Videos cs.CVPDF
Nengbo Lu, Bin Zhao
TL;DR: 本文提出VeloGauss方法,旨在仅从动态多视角视频中联合建模3D场景的几何、外观和物理信息,无需依赖任何物理先验。该方法通过引入物理编码和粒子动力学系统来学习每个高斯粒子的速度场,并加入全局物理约束以确保场景的物理一致性。
Details
Motivation: 现有方法通常仅将物理损失作为软约束或将物理模拟集成到神经网络中,难以有效学习复杂运动物理;尽管建模速度场有潜力捕获真实物理信息,但由于缺乏适当的物理约束,当前方法无法正确学习刚体与非刚体粒子间的交互机制。
Result: 在四个公开数据集上的大量实验表明,该方法在Novel View Interpolation和Future Frame Extrapolation任务上均取得了最先进的性能。
Insight: 创新点在于通过物理编码和粒子动力学系统学习每个高斯粒子的速度场,并引入全局物理约束来确保物理一致性,从而在没有物理先验的情况下有效建模复杂动态3D场景的物理特性。
Abstract: In this paper, we aim to jointly model the geometry, appearance, and physical information of 3D scenes solely from dynamic multi-view videos, without relying on any physical priors. Existing works typically employ physical losses merely as soft constraints or integrate physical simulations into neural networks; however, these approaches often fail to effectively learn complex motion physics. Although modeling velocity fields holds the potential to capture authentic physical information, due to the lack of appropriate physical constraints, current methods are unable to correctly learn the interaction mechanisms between rigid and non-rigid particles. To address this, we propose VeloGauss, designed to learn the physical properties of complex dynamic 3D scenes without physical priors. Our method learns the velocity field for each Gaussian particle by introducing a Physics Code and a Particle Dynamics System, and ultimately incorporates Global Physical Constraints to ensure the physical consistency of the scene. Extensive experiments on four public datasets demonstrate that our method outperforms achieves state-of-the-art performance in both Novel View Interpolation and Future Frame Extrapolation tasks.
[258] SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models cs.CV | cs.AIPDF
Chen Zhong, Xiao An, Jiaxing Sun, Zihan Gui, Guangyi Yang
TL;DR: 该论文提出了首个专门用于评估大视觉语言模型在遥感图像低层视觉感知与描述能力的基准测试SenseBench。该基准基于物理驱动的层次化分类法构建,包含超过1万个精心标注的实例,涵盖6个主要类别和22个细粒度遥感退化类型。论文设计了客观感知和主观描述两种互补的评估协议,并对29个先进视觉语言模型进行了全面评估,揭示了模型存在的领域偏见、多退化崩溃、流畅性幻觉以及感知-描述倒置等问题。
Details
Motivation: 当前遥感图像质量评估方法输出的是难以解释的标量分数,无法描述物理驱动的遥感退化特征,与遥感专家的诊断需求存在显著差距。虽然视觉语言模型提供了基于语言的图像质量评估潜力,但其视觉先验严重偏向地面自然图像,能否跨越领域鸿沟感知并描述遥感伪影尚未得到充分研究。
Result: 对29个最先进的视觉语言模型在SenseBench上的全面评估揭示了模型存在领域先验偏差、多退化崩溃、流畅性幻觉以及感知-描述倒置效应。SenseBench为遥感低层视觉感知提供了稳健的评估测试平台和高质量的诊断数据。
Insight: 创新点在于构建了首个专门针对遥感低层视觉感知与描述的诊断基准,采用了基于物理的层次化分类法统一了无参考和基于参考的评估范式,并设计了感知与描述两种互补的评估协议。从客观角度看,该研究系统性地揭示了视觉语言模型在遥感领域的特定失效模式,为领域自适应研究提供了重要洞见和数据基础。
Abstract: Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS experts. While Vision-Language Models (VLMs) present a compelling alternative by delivering language-grounded IQA, their visual priors are heavily biased toward ground-level natural images. Consequently, whether VLMs can overcome this domain gap to perceive and articulate RS artifacts remains insufficiently studied. To bridge this gap, we propose \textbf{SenseBench}, the first dedicated diagnostic benchmark for RS low-level visual perception and description. Driven by a physics-based hierarchical taxonomy that unifies both non-reference and reference-based paradigms, SenseBench features over 10K meticulously curated instances across 6 major and 22 fine-grained RS degradation categories. Specifically, two complementary protocols are designed for evaluation: objective low-level visual \textit{perception} and subjective diagnostic \textit{description}. Comprehensive evaluation of 29 state-of-the-art VLMs reveals not only skewed domain priors and multi-distortion collapse, but also \textit{fluency illusion} and a \textit{perception-description inversion} effect. We hope SenseBench provides a robust evaluation testbed and high-quality diagnostic data to advance the development of VLMs in RS low-level perception. Code and datasets are available \href{https://github.com/Zhong-Chenchen/SenseBench}{\textcolor{blue}{here}}.
[259] FrequencyCT: Frequency domain pseudo-label generation for self-supervised low-dose CT denoising cs.CVPDF
Guoquan Wei, Liu Shi, Chong Chen, Qiegen Liu
TL;DR: 本文提出了一种名为FrequencyCT的零样本自监督方法,用于低剂量CT去噪。该方法在频域中生成伪标签,通过区域低频锚定技术、相位保持幅度调制和高频区域掩码扰动来分离噪声与干净信号,并利用投影域噪声方差截断生成样本来稳定网络优化梯度。
Details
Motivation: 现有CT去噪研究很少利用投影域数据特性来缓解噪声相关性,因此本文旨在通过频域伪标签生成解决低剂量CT去噪问题。
Result: 在多个公共和真实世界数据集上的评估结果证实了该方法的临床应用潜力,对去噪领域具有革命性影响。
Insight: 创新点在于首次在频域进行零样本自监督伪标签生成,利用频域特性分离噪声,并通过区域低频锚定和幅度调制等技术增强去噪效果;客观分析认为其频域处理和数据截断策略可有效提升去噪性能。
Abstract: Despite extensive research on computed tomography (CT) denoising, few studies exploit projection-domain data characteristics to mitigate noise correlation. To address this, this work proposes FrequencyCT, the first zero-shot self-supervised method for pseudo-label generation in the frequency domain for low-dose CT denoising. Leveraging the characteristic of the frequency domain that largely isolates noise from clean signals, a regional low-frequency anchoring technique is proposed. Phase-preserving amplitude modulation and mask perturbation in the high-frequency region generate pseudo-label data for self-supervision. The fluctuating noise variance in the projection domain prompts truncation of the generated samples to stabilize the network’s optimization gradient. Evaluation results on multiple public and real-world datasets confirm the clinical application potential of this research, which will have a revolutionary impact on the field of denoising. The code can be obtained from https://github.com/yqx7150/FrequencyCT.
[260] CausalGS: Learning Physical Causality of 3D Dynamic Scenes with Gaussian Representations cs.CVPDF
Nengbo Lu, Minghua Pan
TL;DR: CausalGS是一个从多视角视频学习3D动态场景物理因果性的框架,通过解耦初始速度场和内在材料属性来推断物理规律,并利用可微分物理模拟器进行正则化学习,实现了无需显式先验的长期未来帧预测和新视角插值。
Details
Motivation: 现有方法依赖偏微分方程软约束或物理模拟器,需要强先验或高质量几何重建,本文旨在仅从多视角视频学习复杂动态3D场景的因果动力学,消除对显式先验的依赖。
Result: 在长期未来帧外推任务上超越SOTA,并在新视角插值任务上表现出先进性能,实验验证了模型仅从视觉观测中学习物理属性和因果关系的有效性。
Insight: 创新点在于将复杂动力学问题解耦为初始速度场和材料属性的联合推断,并通过可微分物理模拟器实现物理正则化学习,展示了从纯视觉数据中无监督学习物理因果关系的潜力。
Abstract: Learning a physical model from video data that can comprehend physical laws and predict the future trajectories of objects is a formidable challenge in artificial intelligence. Prior approaches either leverage various Partial Differential Equations (PDEs) as soft constraints in the form of PINN losses, or integrate physics simulators into neural networks; however, they often rely on strong priors or high-quality geometry reconstruction. In this paper, we propose CausalGS, a framework that learns the causal dynamics of complex dynamic 3D scenes solely from multi-view videos, while dispensing with the reliance on explicit priors. At its core is an inverse physics inference module that decouples the complex dynamics problem from the video into the joint inference of two factors: the initial velocity field representing the scene’s kinematics, and the intrinsic material properties governing its dynamics. This inferred physical information is then utilized within a differentiable physics simulator to guide the learning process in a physics-regularized manner. Extensive experiments demonstrate that CausalGS surpasses the state-of-the-art on the highly challenging task of long-term future frame extrapolation, while also exhibiting advanced performance in novel view interpolation. Crucially, our work shows that, without any human annotation, the model is able to learn the complex interactions between multiple physical properties and understand the causal relationships driving the scene’s dynamic evolution, solely from visual observations.
[261] Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence cs.CVPDF
Yanbing Zhang, Bo Wang, Jianhui Liu, Nan Jiang, Jiaxiu Jiang
TL;DR: 本文提出了一种名为’Thinking with Novel Views’的新范式,通过将生成式新视角合成整合到推理循环中,以解决当前大型多模态模型在需要视角依赖理解的空间推理任务上的不足。该方法让一个推理器LMM识别空间模糊性,指导一个绘制器合成替代视角,并利用额外证据重新审视场景。
Details
Motivation: 当前大型多模态模型在处理需要视角依赖理解的空间推理任务时表现不佳,主要原因是它们局限于单一的静态观察。本文旨在通过引入动态视角合成来增强模型的空间智能。
Result: 在四个空间子任务类别和四种LMM架构上的系统实验表明,TwNV方法将准确率一致提升了1.3到3.9个百分点,在视角敏感的子任务上提升最大。
Insight: 论文的创新点在于将新视角生成作为提升LMM空间智能的实用杠杆,并系统分析了指令格式、生成保真度和推理时视觉扩展的影响,发现数值相机姿态指定比自由形式语言更可靠,合成视图质量与下游空间准确性紧密耦合,迭代多轮视角优化能进一步提升性能。
Abstract: Current Large Multimodal Models (LMMs) struggle with spatial reasoning tasks requiring viewpoint-dependent understanding, largely because they are confined to a single, static observation. We propose Thinking with Novel Views (TwNV), a paradigm that integrates generative novel-view synthesis into the reasoning loop: a Reasoner LMM identifies spatial ambiguity, instructs a Painter to synthesize an alternative viewpoint, and re-examines the scene with the additional evidence. Through systematic experiments we address three research questions. (1) Instruction format: numerical camera-pose specifications yield more reliable view control than free-form language. (2) Generation fidelity: synthesized view quality is tightly coupled with downstream spatial accuracy. (3) Inference-time visual scaling: iterative multi-turn view refinement further improves performance, echoing recent scaling trends in language reasoning. Across four spatial subtask categories and four LMM architectures (both closed- and open-source), TwNV consistently improves accuracy by +1.3 to +3.9 pp, with the largest gains on viewpoint-sensitive subtasks. These results establish novel-view generation as a practical lever for advancing spatial intelligence of LMMs.
[262] Segment Anything with Robust Uncertainty-Accuracy Correlation cs.CVPDF
Hongyou Zhou, Marc Toussaint, Ling Shao, Zihan Ye
TL;DR: 本文针对SAM在领域偏移下因掩码级置信度混淆(MCC)导致的不可靠问题,提出了一种鲁棒的像素级不确定性估计方法RUAC。该方法通过添加轻量级不确定性头、使用协同风格-形变攻击进行训练,并应用不确定性-准确性对齐,以在多种零样本领域中提升分割质量和不确定性-准确性相关性。
Details
Motivation: SAM在零样本性能强,但在领域偏移下不可靠,主要由于掩码级置信度混淆(MCC),即单一的基于IoU的掩码分数无法反映边界附近像素级的可靠性。受神经网络中纹理偏向捷径与人类视觉中形状中心处理之间的对比启发,本文将域外变化建模为外观偏移和非刚性形变,共同对校准施加压力。
Result: 在23个零样本领域中,RUAC提高了分割质量,并产生了更可靠的不确定性,具有更强的不确定性-准确性相关性。
Insight: 创新点包括:将域外变化建模为外观和形变的联合扰动;提出协同风格-形变攻击来训练不确定性头;应用不确定性-准确性对齐确保不确定性在对抗扰动下仍能一致地突出错误像素。从客观角度看,该方法通过联合建模纹理和几何变化,增强了模型在未知领域的鲁棒性和不确定性估计的可靠性。
Abstract: Despite strong zero-shot performance, SAM is unreliable under domain shift due to Mask-level Confidence Confusion (MCC), where a single IoU-based mask score fails to reflect pixel-wise reliability near boundaries. Motivated by the contrast between texture-biased shortcuts in neural networks and shape-centric processing in human vision, we model out-of-domain variation as appearance shifts and non-rigid deformations that jointly stress calibration. We propose Segment Anything with Robust Uncertainty-Accuracy Correlation (RUAC) for robust pixel-wise uncertainty estimation under appearance and deformation shifts. RUAC adds a lightweight uncertainty head, trains it with a collaborative style-deformation attack that jointly perturbs texture and geometry, and applies Uncertainty-Accuracy Alignment to ensure uncertainty consistently highlights erroneous pixels even under adversarial perturbations. Across 23 zero-shot domains, RUAC improves segmentation quality and yields more faithful uncertainty with stronger uncertainty-accuracy correlation. Project page: https://github.com/HongyouZhou/ruac.git.
[263] Hypergraph-Enhanced Training-Free and Language-Free Few-Shot Anomaly Detection cs.CVPDF
Guohuan Xie, Xin He, Dingying Fan, Siqi Li, Yun Liu
TL;DR: 本文提出了一种名为HyperFSAD的新型少样本异常检测框架,该框架无需训练、无需语言监督,且具有跨域鲁棒性。该方法基于DINOv3视觉特征和超图推理机制,通过稀疏超匹配和双分支图像评分,在多个工业和医学数据集上实现了最先进的性能。
Details
Motivation: 现有少样本异常检测方法存在三个关键挑战:依赖任务或数据集特定的训练/微调、需要语言监督或精心设计的手工提示词,以及跨域鲁棒性有限。本文旨在解决这些问题,提出一个完全基于视觉、无需训练和语言提示的鲁棒框架。
Result: 在严格的无需训练和无需语言监督的设置下,HyperFSAD在涵盖四个工业数据集(MVTecAD, VisA, MPDD, BTAD)和两个医学数据集(RESC, BraTS)的六个数据集上取得了最先进的性能。
Insight: 主要创新点包括:1)用稀疏超匹配替代敏感的近邻匹配,通过sparsemax选择最相关的支持图像块并聚合成超边,以抑制背景噪声;2)双分支图像评分,融合来自块级异常图的空间异常证据和由支持感知的CLS匹配捕获的全局语义偏差,以纯视觉方式生成鲁棒的图像级异常分数。整个框架完全基于视觉,避免了手工文本提示的依赖。
Abstract: Few-shot anomaly detection (FSAD) has made significant strides, yet existing methods still face critical challenges: (i) dependence on task- or dataset-specific training/fine-tuning, (ii) reliance on language supervision or carefully hand-crafted prompts, and (iii) limited robustness across domains. In this paper, we introduce HyperFSAD, a novel FSAD framework that is training-free, language-free, and robust across domains, offering a powerful solution to these challenges. Built upon DINOv3 and a hypergraph-based inference mechanism, our approach performs inference without any task-specific optimization or text prompts, while remaining competitive. Specifically, we replace sensitive nearest-neighbor / top-$n$ matching with \textbf{Sparse Hyper Matching}: \textit{sparsemax} first selects the most relevant support patches, which are then aggregated into a \textit{hyperedge} as compact normal evidence to suppress background noise and distractors. We further introduce \textbf{Dual-Branch Image Scoring}, which fuses \emph{spatial anomaly evidence} from the patch-grid anomaly map with \emph{global semantic deviation} captured by support-aware CLS matching, yielding a robust image-level anomaly score in a strictly visual manner. Notably, all components of HyperFSAD are purely visual, eliminating the need for labor-intensive hand-crafted text prompts. Under the stringent training-free and language-free setting, HyperFSAD achieves state-of-the-art performance across six datasets spanning four industrial datasets (MVTecAD, VisA, MPDD, BTAD) and two medical datasets (RESC, BraTS).
[264] LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models cs.CV | cs.AIPDF
Nikolaos Gkalelis, Vasileios Mezaris
TL;DR: 本文提出了一种名为LLaVA-CKD的级联知识蒸馏框架,旨在解决大型视觉语言模型(VLMs)在部署时面临的高内存和计算需求问题。该方法通过引入一个或多个中间容量的教师模型,以自底向上的方式逐步将知识从高容量教师模型迁移到更小的学生模型,从而缓解因师生模型容量差距过大导致的知识迁移效率下降问题。
Details
Motivation: 大型视觉语言模型在视觉问答等任务上表现出色,但其巨大的内存和计算开销限制了实际部署。知识蒸馏是缓解该问题的有效技术,但师生模型之间过大的容量差距会阻碍知识的高效迁移。
Result: 该方法在基于LLaVA架构的模型上进行了应用,并在七个公开的标准视觉问答基准上进行了评估,结果表明其达到了最先进的性能水平。
Insight: 核心创新点在于受人类正规教育体系启发,提出了级联知识蒸馏框架,通过引入中间容量的教师模型作为“桥梁”,分阶段、渐进式地进行知识迁移,以优化大容量差下的蒸馏效果。这为高效压缩大型多模态模型提供了一种可借鉴的、结构化的蒸馏策略。
Abstract: Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for mitigating this concern is Knowledge Distillation, where knowledge from a high-capacity Teacher network is transferred to a considerably smaller Student network. However, the capacity gap between the two networks is both a blessing and a curse: the smaller the Student network, the better its efficiency, and the larger the Teacher, the more knowledge it carries; yet, beyond a point, the larger capacity gap between the two leads to worse knowledge transfer. To counter this effect, we propose a bottom-up cascaded knowledge distillation (CKD) framework. Instead of treating knowledge transfer as an activity involving one high-capacity Teacher (or an ensemble of such), inspired by human formal education systems, we introduce one (potentially, more) additional Teacher(s) of intermediate capacity that gradually bring the Student network to the next level, where the next (higher-capacity) Teacher can take over. We provide a theoretical analysis in order to study the effect of cascaded distillation in the generalization performance of the Student. We apply the proposed framework on models build upon the LLaVA methodology and evaluate the derived models on seven standard, publicly available VQA benchmarks, demonstrating their SotA performance.
[265] bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition cs.CV | cs.AIPDF
Michal Byra, Pawel Olszowiec, Grzegorz Stefanski, Grzegorz Gruszczynski, Alberto Presta
TL;DR: 本文提出了bViT,一种单块循环视觉Transformer,通过重复应用单个Transformer块来处理图像,以研究ViT深度中多少需要层特定参数化,多少可通过循环计算实现。在ImageNet-1K上,12步bViT-B在相同训练方案和计算预算下,参数减少一个数量级,性能与标准ViT-B相当。
Details
Motivation: 研究ViT深度中需要层特定变换的比例,以及通过循环计算实现的可能性,以探索参数效率更高的架构设计。
Result: 在ImageNet-1K分类任务上,bViT-B达到与标准ViT-B相当的准确率;在下游任务上迁移竞争性表现,并支持参数高效微调。
Insight: 创新点在于单块循环ViT设计,揭示了通过循环重用可实现大部分ViT深度,前提是表示空间足够宽;机制分析表明共享块在不同循环步中改变有效行为,而非简单重复计算,这体现了隐式深度复用。
Abstract: Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than narrow variants. We interpret this behavior as implicit depth multiplexing, where a shared block expresses multiple step-dependent computations through the evolving hidden state. Beyond ImageNet classification, bViT transfers competitively to downstream tasks and enables parameter-efficient fine-tuning. Mechanistic analyses of activations, attention and step-specific pruning show that the shared block changes its effective behavior across recurrent steps rather than simply repeating the same computation. Our results suggest that a large fraction of ViT depth can be implemented through recurrent reuse, provided that the representation space is sufficiently wide.
[266] Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium cs.CV | cs.LGPDF
Qingxin Xiao, Peilin Zhao, Yangyang Zhao, Lingwei Dang, Qingyao Wu
TL;DR: 本文提出了一种名为对抗性反常识平衡(ACE)的训练免费框架,旨在解决多模态大语言模型(MLLM)解码过程中注意力异常集中于无关图像标记的问题。论文认为这些标记是视觉和叙事逻辑的关键载体,强制纠正会加剧视觉-语言不平衡。ACE通过引入反常识图像块扰动视觉上下文,利用真实视觉特征在扰动下稳定而幻觉特征波动的特性,实施动态博弈解码策略,以抑制扰动敏感的先验并补偿稳定的视觉信号,从而恢复平衡。
Details
Motivation: 现有研究将MLLM解码中注意力集中于无关图像标记的现象视为无效噪声并强制纠正,但作者认为这加剧了视觉与语言之间的不平衡。论文的动机是揭示幻觉源于语言先验与视觉信息之间的均衡失衡,并提出一种无需训练的方法来恢复这种平衡。
Result: 大量实验表明,ACE作为一种即插即用策略,能以可忽略的推理开销显著提升模型的可信度。
Insight: 论文的创新点在于将解码过程视为博弈,并提出通过对抗性反常识扰动来区分稳定视觉特征与波动幻觉,从而动态调整解码策略以恢复视觉-语言平衡。这为缓解MLLM幻觉提供了一种新颖且高效的训练免费视角。
Abstract: During MLLM decoding, attention often abnormally concentrates on irrelevant image tokens. While existing research dismisses this as invalid noise and forcibly redirects attention to compel focusing on key image information, we argue these tokens are critical carriers of visual and narrative logic, and such coercive corrections exacerbate visual-language imbalance. Adopting a “decoding-as-game” perspective, we reveal that hallucinations stem from an equilibrium imbalance between linguistic priors and visual information. We propose Adversarial Counter-Commonsense Equilibrium (ACE), a training-free framework that perturbs visual context via counter-commonsense patches. Leveraging the fact that authentic visual features remain stable under perturbation while hallucinations fluctuate, ACE implements a dynamic game decoding strategy. This approach precisely suppresses perturbation-sensitive priors while compensating for stable visual signals to restore balance. Extensive experiments demonstrate that ACE, as a plug-and-play strategy, enhances model trustworthiness with negligible inference overhead.
[267] AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State cs.CV | cs.AI | cs.LG | cs.MAPDF
Huimin Wang, Leilei Ouyang, Chang Xia, Yongqi Kang, Yu Fu
TL;DR: AllocMV提出了一种分层框架,将音乐视频生成建模为多选择背包问题,通过全局规划器生成包含角色实体、场景先验和共享图的结构化持久状态表示,并基于动态规划的资源分配器在高质量生成、中等质量生成和重用分支间优化资源分配,以实现成本与感知质量的最佳权衡。
Details
Motivation: 解决生成长序列音乐视频时面临的计算成本过高和跨镜头一致性难以保持的问题。
Result: 在严格的预算和节奏约束下,通过成本-质量比评估,AllocMV实现了感知质量与资源消耗之间的最优权衡。
Insight: 将视频生成问题形式化为多选择背包问题并进行优化资源分配;引入结构化持久状态表示来维护跨镜头一致性;提出基于发散度的分叉策略重用视觉前缀以降低重复音乐主题的生成成本。
Abstract: Generating long-horizon music videos (MVs) is frequently constrained by prohibitive computational costs and difficulty maintaining cross-shot consistency. We propose AllocMV, a hierarchical framework formulating music video synthesis as a Multiple-Choice Knapsack Problem (MCKP). AllocMV represents the video’s persistent state as a compact, structured object comprising character entities, scene priors, and sharing graphs, produced by a global planner prior to realization. By estimating segment saliency from multimodal cues, a group-level MCKP solver based on dynamic programming optimally allocates resources across High-Gen, Mid-Gen, and Reuse branches. For repetitive musical motifs, we implement a divergence-based forking strategy that reuses visual prefixes to reduce costs while ensuring motif-level continuity. Evaluated via the Cost-Quality Ratio (CQR), AllocMV achieves an optimal trade-off between perceived quality and resource expenditure under strict budgetary and rhythmic constraints.
[268] Qwen-Image-2.0 Technical Report cs.CVPDF
Bing Zhao, Chenfei Wu, Deqing Li, Hao Meng, Jiahao Li
TL;DR: Qwen-Image-2.0是一个全能型图像生成基础模型,它在单一框架内统一了高保真图像生成和精确图像编辑能力。该模型通过将Qwen3-VL作为条件编码器与多模态扩散Transformer耦合,并辅以大规模数据整理和定制化多阶段训练流程,旨在解决现有模型在超长文本渲染、多语言排版、高分辨率写实感、鲁棒的指令遵循以及高效部署等方面的挑战。
Details
Motivation: 现有图像生成模型在文本丰富和构图复杂的场景中,仍难以处理超长文本渲染、多语言排版、高分辨率写实感、鲁棒的指令遵循和高效部署等问题。Qwen-Image-2.0旨在通过一个统一的框架来应对这些挑战。
Result: 广泛的人工评估表明,Qwen-Image-2.0在生成和编辑任务上均显著超越了之前的Qwen-Image模型,在生成文本丰富的幻灯片、海报、信息图和漫画等内容时,能支持长达1K token的指令,并显著提升了多语言文本保真度和排版质量,同时增强了写实生成的细节、纹理和光照连贯性。
Insight: 主要创新点在于将强大的视觉语言模型Qwen3-VL作为条件编码器与多模态扩散Transformer进行联合建模,实现了对多模态条件的深度理解,同时保持了灵活的生成和编辑能力。其大规模数据整理和定制化多阶段训练流程也是实现高性能的关键。这为构建更通用、可靠和实用的图像生成基础模型提供了方向。
Abstract: We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution photorealism, robust instruction following, and efficient deployment, especially in text-rich and compositionally complex scenarios. Qwen-Image-2.0 addresses these challenges by coupling Qwen3-VL as the condition encoder with a Multimodal Diffusion Transformer for joint condition-target modeling, supported by large-scale data curation and a customized multi-stage training pipeline. This enables strong multimodal understanding while preserving flexible generation and editing capabilities. The model supports instructions of up to 1K tokens for generating text-rich content such as slides, posters, infographics, and comics, while significantly improving multilingual text fidelity and typography. It also enhances photorealistic generation with richer details, more realistic textures, and coherent lighting, and follows complex prompts more reliably across diverse styles. Extensive human evaluations show that Qwen-Image-2.0 substantially outperforms previous Qwen-Image models in both generation and editing, marking a step toward more general, reliable, and practical image generation foundation models.
[269] iPay: Integrated Payment Action Recognition via Multimodal Networks and Adaptive Spatial Prior Learning cs.CV | cs.AIPDF
Kaicong Huang, Weiheng Oh, Thomas Guggisberg, Ruimin Ke
TL;DR: 本文提出了iPay,一个用于车载交通监控系统的集成支付动作识别框架。该框架采用多模态专家混合架构,结合RGB和骨架数据流,通过双注意力融合和空间差异判别器来提升对细微支付动作的识别能力。
Details
Motivation: 现有基于视觉和骨架的动作识别方法在嘈杂的车载监控环境下鲁棒性差,且依赖泛化能力有限的手工特征。RGB特征缺乏可靠的时间连续性,而骨架特征对区分支付动作的细微局部相对运动建模不足。
Result: 在收集的超过55小时真实车载监控数据(包含500多个支付片段)上进行的实验表明,iPay优于现有方法,识别准确率达到83.45%,并具有竞争力的计算效率,适合边缘部署。
Insight: 创新点在于提出了一个紧密耦合的多模态架构,通过双注意力融合流实现骨架到RGB的时间传递和RGB到骨架的空间增强,并设计了先验驱动的空间差异判别器来显式建模手部与锚点的相对运动,以提升任务特异性判别能力。
Abstract: Automated transit payment analysis is vital for scalable fare auditing and passenger analytics, yet practice still relies on limited manual inspection. Prior vision- and skeleton-based methods remain brittle under noisy onboard surveillance and often depend on poorly generalizable handcrafted features. Building on the success of graph convolutional networks in human action recognition, we observe that skeleton features excel at modeling global spatiotemporal dependencies but tend to underemphasize the subtle local relative motions that distinguish payment actions. In contrast, RGB features preserve fine-grained spatial details yet often lack reliable temporal continuity in surveillance footage. To bridge both system-level deployment needs and model-level design challenges, we present iPay, an integrated payment action recognition framework for onboard transit surveillance system. iPay adopts a multimodal mixture-of-experts architecture with four tightly coupled streams: (1) an RGB expert stream emphasizing local evidence via region-focused computation; (2) a skeleton expert stream modeling articulated motion with a graph convolutional backbone; (3) a dual-attention fusion stream enabling skeleton-to-RGB temporal transfer and RGB-to-skeleton spatial enhancement; and (4) a prior-driven Spatial Difference Discriminator (SDD) that explicitly models hand-to-anchor relative motion to improve task-specific discriminability. We also collaborate with local transit agencies to collect over 55 hours of real onboard surveillance footage, yielding 500+ payment clips. Experiments show that iPay outperforms prior methods and achieves 83.45% recognition accuracy with competitive computational efficiency, making it suitable for edge deployment. Code is available at https://github.com/ccoopq/iPay.
[270] C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving cs.CV | cs.ROPDF
Kefei Tian, Yuansheng Lian, Kai Yang, Xiangdong Chen, Shen Li
TL;DR: 本文提出了一种反事实思维链(C-CoT)框架,利用视觉语言模型(VLMs)将自动驾驶的安全规划决策分解为场景描述、关键物体识别、风险预测、反事实风险推理和最终行动规划五个顺序阶段,旨在提升复杂城市场景下(尤其是交叉路口)的决策鲁棒性和安全性。
Details
Motivation: 现有自动驾驶规划方法(无论是基于规则还是数据驱动)难以捕捉复杂场景语义、推断潜在风险,并在罕见高风险情况下做出可靠决策;同时,当前基于视觉语言模型的方法缺乏反思性和因果推理能力,限制了其整体鲁棒性。
Result: 在基于DeepAccident基准构建的DeepAccident-CCoT数据集上,使用低秩适应微调的Qwen2.5-VL (7B)模型实现了81.9%的风险预测召回率,将碰撞率降低至3.52%,并将L2误差降至1.98米。消融研究证实了反事实推理和元动作评估树对提升安全性和可解释性的关键作用。
Insight: 创新点在于将反事实推理与思维链结合,并引入结构化的元动作评估树来显式评估替代动作组合的潜在后果,从而在动作选择与安全结果之间建立因果联系,提升了模型在长尾和分布外场景下的鲁棒性与可解释性。
Abstract: Safety-critical planning in complex environments, particularly at urban intersections, remains a fundamental challenge for autonomous driving. Existing methods, whether rule-based or data-driven, frequently struggle to capture complex scene semantics, infer potential risks, and make reliable decisions in rare, high-risk situations. While vision-language models (VLMs) offer promising approaches for safe decision-making in these environments, most current approaches lack reflective and causal reasoning, thereby limiting their overall robustness. To address this, we propose a counterfactual chain-of-thought (C-CoT) framework that leverages VLMs to decompose driving decisions into five sequential stages: scene description, critical object identification, risk prediction, counterfactual risk reasoning, and final action planning. Within the counterfactual reasoning stage, we introduce a structured meta-action evaluation tree to explicitly assess the potential consequences of alternative action combinations. This self-reflective reasoning establishes causal links between action choices and safety outcomes, improving robustness in long-tail and out-of-distribution scenarios. To validate our approach, we construct the DeepAccident-CCoT dataset based on the DeepAccident benchmark and fine-tune a Qwen2.5-VL (7B) model using low-rank adaptation. Our model achieves a risk prediction recall of 81.9%, reduces the collision rate to 3.52%, and lowers L2 error to 1.98 m. Ablation studies further confirm the critical role of counterfactual reasoning and the meta-action evaluation tree in enhancing safety and interpretability.
[271] TINS: Test-time ID-prototype-separated Negative Semantics Learning for OOD Detection cs.CVPDF
Yifeng Yang, Jubo Feng, Jing Xu, Xinbing Wang, Qinying Gu
TL;DR: 本文提出了一种名为TINS的测试时ID原型分离负语义学习方法,用于提升视觉语言模型在OOD检测中的性能。该方法通过图像到文本的模态反转学习样本特定的负文本嵌入,并引入ID原型分离正则化来避免ID语义污染,同时采用分组聚合评分和缓冲区更新策略来稳定负语义扩展。
Details
Motivation: 现有基于负标签的OOD检测方法主要依赖推理前构建的静态负标签,难以覆盖多样且动态变化的OOD概念,而直接在测试时从潜在OOD样本中学习负语义又容易引入ID污染,因此需要一种能有效分离ID与OOD语义的测试时学习方法。
Result: 在Four-OOD、OpenOOD、Temporal-shift和Various ID等多个基准测试中,TINS均表现出优于强基线的性能提升。特别是在以ImageNet-1K为ID的Four-OOD基准上,TINS将平均FPR95从14.04%显著降低至6.72%。
Insight: 创新点在于提出了测试时ID原型分离的负语义学习机制,通过模态反转和正则化技术动态生成样本特定的负嵌入,有效避免了ID污染并提升了OOD检测的覆盖能力;其分组聚合和缓冲区策略也为测试时学习提供了稳定性保障。
Abstract: Vision-language models enable OOD detection by comparing image alignment with ID labels and negative semantics. Existing negative-label-based methods mainly rely on static negative labels constructed before inference, limiting their ability to cover diverse and evolving OOD concepts. Although test-time expansion provides a natural solution, naively learning negative semantics from potential OOD samples may introduce hard ID contamination. To address this issue, we propose a \textbf{T}est-time \textbf{I}D-prototype-separated \textbf{N}egative \textbf{S}emantics learning method, termed \textbf{TINS}. TINS learns sample-specific negative text embeddings via image-to-text modality inversion and introduces ID-prototype-separated regularization to keep them separated from ID semantics. To further stabilize negative semantics expansion, TINS employs group-wise aggregation scoring and a buffer update strategy. Extensive experiments across Four-OOD, OpenOOD, Temporal-shift, and Various ID settings show consistent improvements over strong baselines. Notably, on the Four-OOD benchmark with ImageNet-1K as ID, TINS reduces the average FPR95 from 14.04% to 6.72%. Our code is available at https://github.com/zxk1212/tins.
[272] RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology cs.CVPDF
Wenxuan Li, Pedro R. A. S. Bassi, Xinze Zhou, Jakob Wasserthal, Alan L. Yuille
TL;DR: 本文介绍了RadThinking数据集,这是一个用于放射学纵向临床推理的视觉问答数据集。该数据集包含三个难度级别的VQA对:基础感知问题、单步推理问题和需要多步思维链的组合推理问题。数据集基于20,362个CT扫描,涵盖9,131名患者和43种癌症类型,并包含2,077个健康对照。RadThinking旨在使AI系统能够进行癌症推理而不仅仅是检测。
Details
Motivation: 解决当前AI系统在癌症筛查中缺乏临床推理能力的问题,通过构建一个分层、基于临床报告标准的VQA数据集,使AI能够模拟放射科医生的多步骤推理过程。
Result: 数据集规模为20,362个CT扫描,覆盖9,131名患者和43种癌症组,并包含2,077个验证过的健康对照。这是首个按推理深度分层并将组合问题基于临床报告标准的癌症筛查VQA语料库。
Insight: 创新点在于将VQA问题按推理深度分层(基础感知、单步推理、组合推理),并为组合问题提供基于临床标准的思维链数据,这为强化学习等方法提供了可验证的奖励信号,有助于系统性地训练和评估AI的临床推理能力。
Abstract: Cancer screening is a reasoning task. A radiologist observes findings, compares them to prior scans, integrates clinical context, and reaches a diagnostic conclusion confirmed by pathology. We present RadThinking, a Visual Question Answering (VQA) dataset that makes this reasoning explicit and trainable. RadThinking releases VQA pairs at three difficulty tiers. Foundation VQAs are atomic perception questions. Single-step reasoning VQAs apply one clinical rule. Compositional VQAs require multi-step chain-of-thought to reach a guideline category such as LI-RADS-5. For every compositional VQA, we release the chain of foundation VQAs that solves it. The chain follows the rules of the governing clinical reporting standard. The dataset spans 20,362 CT scans from 9,131 patients across 43 cancer groups, plus 2,077 verified healthy controls with >1-year follow-up. To our knowledge, RadThinking is the first cancer-screening VQA corpus that stratifies questions by reasoning depth and grounds compositions in clinical reporting standards. The foundation tier supplies atomic perception supervision. The compositional tier supplies chain-of-thought data and verifiable rewards for reinforcement-learning recipes such as DeepSeek-R1 and OpenAI o1. RadThinking enables systematic training and evaluation of whether AI systems can reason about cancer, not merely detect it.
[273] GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs cs.CV | cs.AIPDF
Mohamed Eltahir, Lama Ayash, Ali Habibullah, Tanveer Hussain, Naeemullah Khan
TL;DR: 本文提出GridProbe,一种无需训练的后验探测推理范式,用于长视频视觉语言模型(VLMs)的自适应测试时计算。该方法通过在答案空间中评估证据,自适应地选择与问题相关的帧,从而以亚二次注意力成本实现接近单次前向传播的精度。
Details
Motivation: 长视频理解在VLMs中因需要对数千帧进行单次昂贵的前向传播而受限。现有基于辅助编码器空间相似性的无训练帧选择方法受对比预训练限制,难以处理推理密集型查询(如否定、跨帧计数、整体摘要)。
Result: 在Video-MME-v2基准上,GridProbe以3.36倍TFLOPs减少,在平均准确率上仅比单次前向传播基线低1.6个百分点;在LongVideoBench上,它帕累托占优于基线(计算量降至0.35倍时准确率提升0.9个百分点)。将小型2B选择器与更强的4B或8B问答模型解耦组合,相比2B单次前向传播基线平均可提升4.0个百分点(计算量降至0.52倍),且无需重新训练。
Insight: 创新点在于提出基于网格的后验探测机制(行/列轻量探测)和形状自适应选择规则,通过答案空间置信度实现测试时自适应计算,并能解耦选择器与问答模型以灵活组合。重要性图的可解释性为行为诊断、 grounding 和帧选择蒸馏提供了未来研究方向。
Abstract: Long-video understanding in VLMs is bottlenecked by a single monolithic forward pass over thousands of frames at quadratic attention cost. A common mitigation is to first select a small subset of informative frames before the forward pass; common for training-free selectors via auxiliary encoder-space similarities. Such signals are capped by contrastive pretraining, which usually fails on reasoning-heavy queries (negation, cross-frame counting, holistic summarization). We propose GridProbe, an efficient training-free posterior-probing inference paradigm that scores evidence in answer space using a frozen VLM’s own reasoning and then selects question-relevant frames adaptively, resulting in sub-quadratic attention cost with little to no accuracy loss. We arrange frames on a $K{\times}K$ grid and run lightweight row R and column C probes, where each probe reads its peak posterior as a query-conditioned confidence. The outer product of R and C yields an interpretable importance map whose skewness and kurtosis drive Shape-Adaptive Selection, a closed-form rule that reliably replaces the fixed frame budget $M$ with a per-question $M_{\mathrm{eff}}$. We show empirically that $M_{\mathrm{eff}}$ tracks intrinsic question difficulty without ever seeing the answer, a sign of test-time adaptive compute. On Video-MME-v2, GridProbe matches the monolithic baseline within $1.6$ pp Avg Acc at $3.36\times$ TFLOPs reduction, while on LongVideoBench it Pareto-dominates the baseline ($+0.9$ pp at $0.35\times$ compute). Because the selector and QA models can be decoupled, pairing a small 2B selector with a stronger 4B or 8B QA is strictly Pareto-dominant over the 2B monolithic baseline (up to $+4.0$ pp at $0.52\times$ compute, on average), with no retraining. Finally, the interpretability of the importance maps opens future avenues for behavioral diagnostics, grounding, and frame-selection distillation.
[274] Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization cs.CV | cs.AIPDF
Mengqi He, Xinyu Tian, Xin Shen, Shu Zou, Jinhong Ni
TL;DR: 本文提出了一种名为UJEM-KL的轻量级无目标越狱攻击方法,通过最大化视觉语言模型(VLM)自回归解码过程中高熵决策令牌的熵值来翻转模型的拒绝响应,同时稳定低熵位置以保持输出质量。该方法在三个VLM和两个安全基准测试中实现了有竞争力的白盒攻击成功率,并显著提升了跨模型可迁移性,且在代表性防御下仍保持有效。
Details
Motivation: 针对现有基于梯度的通用图像越狱攻击在视觉语言模型中跨模型可迁移性差的问题,论文在严格无目标威胁模型下重新审视这一结论,旨在探索更有效的可迁移多模态越狱方法。
Result: 在三个视觉语言模型和两个安全基准测试上,UJEM-KL取得了有竞争力的白盒攻击成功率,并一致地提升了跨模型可迁移性,同时在代表性防御下仍保持有效,表明有限的可迁移性主要源于过度约束的优化目标。
Insight: 创新点在于发现拒绝行为集中在自回归解码的高熵令牌处,并通过最大化这些决策令牌的熵值来翻转拒绝结果,同时稳定低熵位置以保持语义连贯性。这为设计更通用、可迁移的多模态对抗攻击提供了新思路。
Abstract: Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.
[275] Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning cs.CV | cs.AI | cs.LGPDF
Tao Hu, Da-Wei Zhou
TL;DR: 本文提出了一种名为DRAPE的动态跨模态提示生成框架,用于解决多模态持续指令调优中的灾难性遗忘问题。该方法通过从文本指令中提取查询并与视觉特征进行跨注意力交互,为每个查询-图像对生成实例特定的软提示,而非依赖固定的任务级模块组合。
Details
Motivation: 现有方法主要遵循模块组合范式,在任务级别维护提示或LoRA专家,并在推理时动态路由或聚合它们。然而,同一任务内的样本在视觉场景、问题意图和推理需求上仍存在显著差异,因此需要针对单个查询-图像对进行实例级适应,而不仅仅是选择或组合任务级模块。
Result: 在多模态持续指令调优基准上的大量实验表明,DRAPE在基于提示和基于LoRA的持续学习基线中实现了最先进的性能。
Insight: 创新点在于提出了一种动态生成实例特定软提示的框架,通过跨模态交互实现细粒度适应,并采用零空间梯度投影和基于CLIP的原型路由来缓解顺序更新中的遗忘问题,无需任务标签即可在推理时选择生成器。
Abstract: Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, yet real-world deployment often requires continual capability expansion across sequential tasks. In such scenarios, Multimodal Continual Instruction Tuning (MCIT) aims to acquire new capabilities while limiting catastrophic forgetting. Existing methods mainly follow a module-composition paradigm: they maintain task-level prompts or LoRA experts and dynamically route or aggregate a subset of them at inference. However, samples within the same task can still differ substantially in visual scenes, question intents, and reasoning demands. This motivates instance-level adaptation to individual query-image pairs rather than only selecting or combining task-level modules. To this end, we propose DRAPE (Dynamic Cross-Modal Prompt Generation), a prompt-learning framework that synthesizes continuous instance-specific soft prompts for MCIT. Instead of selecting prompts from a fixed pool, DRAPE derives prompt queries from the textual instruction and cross-attends to visual patch features, producing query-image conditioned prompts that are prepended to the frozen LLM. To mitigate forgetting during sequential updates, DRAPE applies null-space gradient projection to the shared projector and uses CLIP-based prototype routing for task-label-free generator selection at inference. Extensive experiments on MCIT benchmarks show that DRAPE achieves state-of-the-art performance among representative prompt-based and LoRA-based continual-learning baselines.
[276] MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation cs.CV | cs.AIPDF
Ziyi Wang, Xianping Ma, Ziyao Wang, Hongyang Zhang, Man On Pun
TL;DR: 本文提出MPerS方法,一种动态多模态大语言模型专家混合感知引导的遥感场景分割框架。该方法通过设计多种提示词引导MLLM从不同专家视角生成高质量遥感图像描述,并利用DINOv3提取密集视觉特征,通过动态专家混合模块自适应融合最有效的文本语义,再通过语言查询引导注意力机制用文本语义指导视觉特征以实现精确分割。
Details
Motivation: 现有研究在处理复杂遥感场景时,主要关注文本语义信息与视觉特征融合的架构优化,而忽视了高质量遥感描述文本的生成及其在多模态语义融合中有效性的研究。
Result: 该方法在三个公开的遥感语义分割数据集上取得了优越的性能。
Insight: 创新点在于强调并系统解决了高质量遥感描述生成的问题,并设计了动态专家混合模块来自适应选择有效文本语义,以及语言查询引导注意力机制来实现文本对视觉分割的精准引导。
Abstract: The multimodal fusion of images and scene captions has been extensively explored and applied in various fields. However, when dealing with complex remote sensing (RS) scenes, existing studies have predominantly concentrated on architectural optimizations for integrating textual semantic information with visual features, while largely neglecting the generation of high-quality RS captions and the investigation of their effectiveness in multimodal semantic fusion.In this context, we propose the Dynamic MLLM Mixture-of-Experts Perception-Guided Remote Sensing Scene Segmentation, referred to as MPerS.We design multiple prompts for MLLMs to generate high-quality RS captions, enabling MLLMs to perceive RS scenes from diverse expert perspectives. DINOv3 is employed to extract dense visual representations of land-covers.We design a Dynamic MixExperts module that adaptively integrates the most effective textual semantics. Linguistic Query Guided Attention is constructed to utilize textual semantic information to guide visual features for precise segmentation. The MLLMs include LLaVA, ChatGPT, and Qwen. Our method achieves superior performance on three public semantic segmentation RS datasets.
[277] Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition cs.CV | cs.AI | eess.IVPDF
David F. Ramirez, Tim L. Overman, Kristen Jaskie, Marv Kleine, Andreas Spanias
TL;DR: 本研究探索将大型语言视觉模型应用于合成孔径雷达图像的自动目标识别任务,特别是针对MSTAR数据集构建了包含描述性文本和视觉问答对的基准,并通过参数高效微调实现了98%的细粒度目标识别准确率。
Details
Motivation: 解决SAR图像中军事车辆类型的精确识别与区分难题,尤其是在复杂环境条件下,以替代需要长时间训练的人类分析员,推动机器辅助的遥感ATR在军事和情报领域的应用。
Result: 在基于MSTAR数据集构建的SAR训练与评估基准上,通过参数高效微调,模型在细粒度目标识别任务中达到了98%的准确率。
Insight: 创新点在于首次将LLVM(如CLIP和LLaVA架构)专门应用于SAR图像的视觉问答与自动目标识别,并构建了针对性的挑战数据集以推动模型识别细微ATR细节的能力;客观来看,其参数高效微调方法在保持模型通用性的同时提升了特定领域的性能,为遥感领域的多模态学习提供了新思路。
Abstract: Large language-vision models (LLVM), such as OpenAI’s ChatGPT and GPT-4, have gained prominence as powerful tools for analyzing text and imagery. The merging of these data domains represents a significant paradigm shift with far-reaching implications for automatic target recognition (ATR). Recent transformer-based LLVM research has shown substantial improvements for geospatial perception tasks. Our study examines the application of LLVM to remote sensing image captioning and visual question-answering (VQA), with a specific focus on synthetic aperture radar (SAR) imagery. We examine newly published LLVM methods, including CLIP and LLaVA neural network transformer architectures. We have developed a work-in-progress SAR training and evaluation benchmark derived from the MSTAR Public Dataset. This has been extended to include descriptive text captions and question-answer pairs for VQA tasks. This challenge dataset is designed to push the boundaries of an LLVM in identifying nuanced ATR details in SAR imagery. Utilizing parameter-efficient fine-tuning, we train an LLVM method to identify fine-grained target qualities at 98% accuracy. We detail our data setup and experiments, addressing potential pitfalls that could lead to misleading conclusions. Accurately identifying and differentiating military vehicle types in SAR data poses a critical challenge, especially under complex environmental conditions. Mastering this target recognition skill may require a human analyst months of training and years of practice. This research represents a unique effort to apply LLVM to SAR applications, advancing machine-assisted remote sensing ATR for military and intelligence contexts.
[278] Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenizatio cs.CV | cs.AIPDF
Xuanyu Zhu, Yan Bai, Yang Shi, Yihang Lou, Yuanxing Zhang
TL;DR: 本文提出了一种名为DRoRAE(深度路由表示自编码器)的新型视觉分词器,它通过自适应融合预训练视觉编码器的多层特征来增强潜在表示,从而显著提升了图像重建与生成质量。
Details
Motivation: 现有方法仅利用预训练编码器的最后一层特征,丢弃了中间层丰富的层次化视觉信息,导致低层细节在语义抽象过程中被过度衰减。本文旨在通过融合多层特征来恢复这些丢失的信息。
Result: 在ImageNet-256数据集上,DRoRAE将重建rFID从0.57降至0.29,并将生成FID(配合AutoGuidance)从1.74提升至1.65,同时其增益也能迁移到文生图任务中。
Insight: 核心创新点在于提出了一个轻量级的、基于能量约束路由和增量校正的多层特征自适应融合模块,并发现了融合能力与重建质量之间存在对数线性缩放规律,将’表示丰富度’确立为视觉分词器中一个可预测、可扩展的新维度,类似于NLP中的词汇表大小。
Abstract: Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law ($R^2{=}0.86$) between fusion capacity and reconstruction quality, identifying \textit{representation richness} as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.
[279] PhyGround: Benchmarking Physical Reasoning in Generative World Models cs.CV | cs.AI | cs.LGPDF
Juyi Lin, Arash Akbari, Yumei He, Lin Zhao, Haichao Zhang
TL;DR: 该论文提出了PhyGround基准测试,用于评估生成式世界模型在视频生成中的物理推理能力。它包含250个精心设计的提示、13条物理定律的分类法,并通过大规模人工标注和自动化评估器PhyJudge-9B来系统评估八个现代视频生成模型。
Details
Motivation: 现有评估生成视频是否符合物理定律的方法存在三大挑战:评估框架粗糙、人工标注存在偏差和疲劳、自动化评估器物理感知不足或难以审计。
Result: 通过大规模、质量控制的人工研究(459名标注者提供5796个完整标注和超过37.4K细粒度标签),评估了八个现代视频生成模型,保留的标注显示出高模型排序相关性(Spearman’s rho > 0.90)。自动化评估器PhyJudge-9B的总体相对偏差显著低于Gemini-3.1-Pro(3.3% vs. 16.6%)。
Insight: 创新点在于提出了一个基于标准、可诊断的物理推理基准,将物理定律操作化为可观察的子问题,并引入了受社会科学实验设计启发的质量控制人工评估流程以及一个开源的、物理专用的视觉语言模型评估器,以支持可复现的自动化评估。
Abstract: Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law-specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics-aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria-grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across solid-body mechanics, fluid dynamics, and optics. Each law is operationalized through observable sub-questions to enable per-law diagnostics. We evaluate eight modern video generation models through a large-scale, quality-controlled human study, grounded on social science lab experiment design. A total of 459 annotators provided 5,796 complete annotations and over 37.4K fine-grained labels; after quality control, the retained annotations exhibited high split-half model-ranking correlations (Spearman’s rho > 0.90). To support reproducible automated evaluation, we release PhyJudge-9B, an open physics-specialized VLM judge. PhyJudge-9B achieves substantially lower aggregate relative bias than Gemini-3.1-Pro (3.3% vs. 16.6%). We release prompts, human annotations, model checkpoints, and evaluation code on the project page https://phyground.github.io/.
[280] Predicting 3D structure by latent posterior sampling cs.CV | cs.LGPDF
Azmi Haider, Dan Rosenbaum
TL;DR: 本文提出了一种结合神经辐射场(NeRF)表示与扩散模型概率推理的方法,通过将3D场景建模为随机隐变量,并利用扩散模型进行后验采样,实现从多种观测数据(如单视图、多视图、噪声图像、稀疏像素和深度数据)中重建3D结构。
Details
Motivation: 结合2D图像生成模型(如扩散模型)和3D场景神经场表示(如NeRF)的优势,将3D重建视为具有内在不确定性的感知问题,利用概率推理方法提升重建性能。
Result: 实验表明,该方法能从不同类型观测数据中准确预测3D结构,并建模各任务中的不确定性,但未提及具体基准测试或与SOTA的比较结果。
Insight: 创新点包括将3D场景表示为随机隐变量,采用两阶段训练(先训练重建模型自解码隐变量,再训练扩散模型作为先验),以及基于扩散模型的分数推断与体积渲染似然结合的后验采样框架,可泛化于多种重建任务。
Abstract: The remarkable achievements of both generative models of 2D images and neural field representations for 3D scenes present a compelling opportunity to integrate the strengths of both approaches. In this work, we propose a methodology that combines a NeRF-based representation of 3D scenes with probabilistic modeling and reasoning using diffusion models. We view 3D reconstruction as a perception problem with inherent uncertainty that can thereby benefit from probabilistic inference methods. The core idea is to represent the 3D scene as a stochastic latent variable for which we can learn a prior and use it to perform posterior inference given a set of observations. We formulate posterior sampling using the score-based inference method of diffusion models in conjunction with a likelihood term computed from a reconstruction model that includes volumetric rendering. We train the model using a two-stage process: first we train the reconstruction model while auto-decoding the latent representations for a dataset of 3D scenes, and then we train the prior over the latents using a diffusion model. By using the model to generate samples from the posterior we demonstrate that various 3D reconstruction tasks can be performed, differing by the type of observation used as inputs. We showcase reconstruction from single-view, multi-view, noisy images, sparse pixels, and sparse depth data. These observations vary in the amount of information they provide for the scene and we show that our method can model the varying levels of inherent uncertainty associated with each task. Our experiments illustrate that this approach yields a comprehensive method capable of accurately predicting 3D structure from diverse types of observations.
[281] MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection cs.CV | cs.AIPDF
Xiran Zhao, Jing Jin, Yan Bai, Zhongan Wang, Yifeng Sun
TL;DR: 本文提出了MMVIAD数据集,这是首个用于工业异常检测的连续多视角视频数据集,并配套了多任务评估基准。针对现有视频多模态大语言模型在细粒度缺陷识别和时间定位上的不足,作者开发了一种两阶段后训练流程,包括感知结构监督微调(PS-SFT)和基于可见性的工业结构化时间异常组相对策略优化(VISTA-GRPO),最终得到VISTA模型,在未见过的测试集上显著提升了多任务性能。
Details
Motivation: 现有工业异常检测数据集主要基于静态图像或稀疏视角,无法充分反映真实工业场景中连续的检测过程,因此需要构建一个连续多视角视频数据集并建立相应的多任务基准。
Result: 在MMVIAD-Unseen测试集上,提出的VISTA模型将基线模型在四个任务上的平均得分从45.0提升至57.5,超过了GPT-5.4,但当前商业和开源视频MLLM的性能仍远低于人类水平,尤其是在细粒度缺陷识别和时间定位方面。
Insight: 创新点包括构建了首个面向工业异常检测的连续多视角视频数据集(MMVIAD),以及设计了一个两阶段后训练流程,通过结合感知结构推理初始化和基于语义门控缺陷奖励与可见性感知时间奖励的强化学习来提升模型的可迁移异常理解能力。
Abstract: Industrial anomaly detection is critical for manufacturing quality control, yet existing datasets mainly focus on static images or sparse views, which do not fully reflect continuous inspection processes in real industrial scenarios. We introduce MMVIAD (Multi-view Multi-task Video Industrial Anomaly Detection), to the best of our knowledge the first continuous multi-view video dataset for industrial anomaly detection and understanding, together with a benchmark for multi-task evaluation. MMVIAD contains object-centric 2-second inspection clips with approximately 120 degrees of camera motion, covering 48 object categories, 14 environments, and 6 structural anomaly types. It supports anomaly detection, defect classification, object classification, and anomaly visible-time localization. Systematic evaluations on MMVIAD show that current commercial and open-source video MLLMs remain far below human performance, especially for fine-grained defect recognition and temporal grounding. To improve transferable anomaly understanding, we further develop a two-stage post-training pipeline where PS-SFT (Perception-Structured Supervised Fine-Tuning) initializes perception-structured reasoning and VISTA-GRPO (Visibility-grounded Industrial Structured Temporal Anomaly Group Relative Policy Optimization) refines the model with semantic-gated defect reward and visibility-aware temporal reward, producing the final model VISTA. On MMVIAD-Unseen, VISTA improves the base model’s average score across the four tasks from 45.0 to 57.5, surpassing GPT-5.4. Source code is available at https://github.com/Georgekeepmoving/MMVIAD.
[282] Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA cs.CVPDF
Ruinan Jin, Beidi Zhao, Myeongkyun Kang, Qiong Zhang, Xiaoxiao Li
TL;DR: 本文提出了一种诊断框架,用于评估医学视觉问答(VQA)中视觉语言模型(VLM)自验证的可靠性边界,揭示了自验证存在‘验证幻象’问题,即验证器会过度同意生成器的错误答案,导致高错误率和高一致性偏差,且该问题在不同任务中表现不同。
Details
Motivation: 针对当前医学VQA中普遍采用自验证作为默认安全层的做法,作者认为其根本不可靠,旨在通过系统分析揭示其可靠性边界和潜在风险。
Result: 在五个医学VQA数据集和七个医学任务上评估了六个开源VLM,发现验证可靠性强烈依赖于任务类型:知识密集型临床任务最容易陷入‘验证幻象’,感知任务次之,简单任务抵抗力较强。交叉验证能减轻但无法消除幻象,在多轮验证循环中,错误的初始答案常被虚假验证锁定。
Insight: 创新点在于将验证器行为分解为判别能力和一致性偏差,并提出了‘验证幻象’和‘懒惰验证器’概念。客观来看,该研究揭示了自验证在医学等高风险领域作为独立安全措施的局限性,强调了任务条件对可靠性的关键影响,对设计更可靠的VLM安全机制具有重要借鉴意义。
Abstract: Self-verification, re-invoking the same vision language model (VLM) in a fresh context to check its own generated answer, is increasingly used as a default safety layer for medical visual question answering (VQA). We argue that this practice is fundamentally unreliable. We introduce [METHOD NAME], a diagnostic framework for mapping the reliability boundary of medical VLM self-verification by decomposing verifier behavior into discrimination capability and agreement bias. Because the verifier and answer generator are capacity-coupled, the verifier can overly agree with the generator, creating a verification mirage: a regime with both high verifier error and high agreement bias, driven by false acceptance of incorrect answers. Evaluating six open-weight VLMs across five medical VQA datasets and seven medical tasks, we find that this boundary is strongly task-conditioned. Knowledge-intensive clinical tasks fall deepest into the mirage, simpler tasks are more resistant, and perceptual tasks lie in between. Verification also fails to provide an independent safety signal: logistic mixed-effects analysis shows that verifier error and agreement bias become more likely when the generator is wrong, while saliency analyses show that verifiers under-attend to image evidence relative to generators, a phenomenon we call the lazy verifier. Cross-verification reduces but does not eliminate the mirage. Moreover, when verification is reused in multi-turn actor-verifier loops, most initially wrong answers become locked in by false verification. Since our experiments use clean benchmarks, the observed reliability boundary likely underestimates failures in real clinical deployment.
[283] Is Your Driving World Model an All-Around Player? cs.CV | cs.ROPDF
Lingdong Kong, Ao Liang, Tianyi Yan, Hongsi Liu, Wesley Yang
TL;DR: 本文提出WorldLens,一个用于全面评估驾驶世界模型真实性的统一基准,涵盖像素质量、4D几何、闭环驾驶和人类感知对齐等五个互补方面和24个标准化维度。研究发现现有模型均存在局限,并进一步贡献了WorldLens-26K人工标注偏好数据集和WorldLens-Agent视觉语言评估器,形成了一个评估生成世界物理和行为保真度的生态系统。
Details
Motivation: 当前驾驶世界模型在生成逼真视频方面存在割裂:一些模型纹理逼真但违反物理规律,另一些几何一致但在闭环规划中失败。领域缺乏对生成世界行为真实性的评估,暴露了关键差距。
Result: 对六个代表性模型的评估表明,没有现有方法在所有轴上占优:纹理丰富的模型违反几何,几何感知的模型缺乏行为保真度,即使最强模型在人类真实感评分(10分制)中也仅得2-3分。
Insight: 创新点在于提出了首个全面评估世界模型保真度的统一基准WorldLens,并构建了包含人工标注偏好数据集和可扩展、可解释的自动评估代理的生态系统,将评估重点从视觉吸引力扩展到物理和行为保真度。
Abstract: Today’s driving world models can generate remarkably realistic dash-cam videos, yet no single model excels universally. Some generate photorealistic textures but violate basic physics; others maintain geometric consistency but fail when subjected to closed-loop planning. This disconnect exposes a critical gap: the field evaluates how real generated worlds appear, but rarely whether they behave realistically. We introduce WorldLens, a unified benchmark that measures world-model fidelity across the full spectrum, from pixel quality and 4D geometry to closed-loop driving and human perceptual alignment, through five complementary aspects and 24 standardized dimensions. Our evaluation of six representative models reveals that no existing approach dominates across all axes: texture-rich models violate geometry, geometry-aware models lack behavioral fidelity, and even the strongest performers achieve only 2-3 out of 10 on human realism ratings. To bridge algorithmic metrics with human perception, we further contribute WorldLens-26K, a 26,808-entry human-annotated preference dataset pairing numerical scores with textual rationales, and WorldLens-Agent, a vision-language evaluator distilled from these judgments that enables scalable, explainable auto-assessment. Together, the benchmark, dataset, and agent form a unified ecosystem for assessing generated worlds not merely by visual appeal, but by physical and behavioral fidelity.
[284] CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation cs.CV | cs.AIPDF
Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari
TL;DR: CADBench是一个用于评估AI辅助CAD程序生成的多模态基准测试,包含18,000个样本,涵盖六个数据集家族、五种输入模态和六种评估指标,旨在统一测量从图像或3D观测中恢复可编辑CAD程序的进展。
Details
Motivation: 现有评估在数据集、模态和指标上碎片化,难以衡量AI辅助设计中从图像或3D观测恢复可编辑CAD程序的进展,因此需要统一的基准测试。
Result: 在理想化输入下,专门的网格到CAD模型显著优于代码生成视觉语言模型(VLMs),后者远未达到可靠的CAD程序重建水平;基准测试揭示了三个常见失败模式:重建质量随几何复杂度下降、专门模型在模态转换下脆弱、模型排名因指标而异。
Insight: 创新点在于构建了首个统一的多模态CAD程序生成基准,支持跨复杂度和对象变化的受控分析;客观分析表明,该基准能诊断可编辑3D重建和多模态CAD理解中的关键挑战,如模态鲁棒性和指标一致性。
Abstract: Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://huggingface.co/datasets/DeCoDELab/CADBench.
[285] Count Anything at Any Granularity cs.CVPDF
Chang Liu, Haoning Wu, Weidi Xie
TL;DR: 本文提出了一种新的开放世界物体计数框架,将计数任务重新定义为多粒度计数,并引入了KubriCount数据集和HieraCount模型来解决现有方法在遵循细粒度提示语义方面的不足。
Details
Motivation: 现有开放世界物体计数方法将’计数什么’视为单一的类别级匹配问题,而忽略了用户意图的语义粒度(如特定身份、属性、实例类型、类别或抽象概念),导致计数不可靠。
Result: 在构建的KubriCount数据集上进行系统基准测试表明,现有多模态大语言模型和专用计数模型在细粒度区分下存在严重的提示遵循失败;而提出的HieraCount模型显著提高了多粒度计数准确性,并能鲁棒地泛化到具有挑战性的真实世界场景。
Insight: 核心创新在于将计数粒度显式化(分为五个明确级别),并提出了首个全自动数据扩展流程来构建大规模、多粒度标注的计数数据集;模型层面,HieraCount联合利用文本和视觉示例作为互补的目标指定方式,以更好地遵循细粒度提示语义。
Abstract: Open-world object counting remains brittle: despite rapid advances in vision-language models (VLMs), reliably counting the objects a user intends is far from solved. We argue that a central reason is that counting granularity is left implicit; users may refer to a specific identity, an attribute, an instance type, a category, or an abstract concept, yet most methods treat “what to count” as a single, category-level matching problem. In this work, we redefine open-world counting as multi-grained counting, where visual exemplars specify target appearance and fine-grained text, with optional negative prompts, specifies the intended semantic granularity across five explicit levels. Making granularity explicit, however, exposes a critical data bottleneck: existing counting datasets lack the multi-category scenes, controlled distractors, and instance-level annotations needed to verify fine-grained prompt semantics. To address this, we propose the first fully automatic data-scaling pipeline that integrates controllable 3D synthesis with consistent image editing and VLM-based filtering, and use it to construct KubriCount, the largest and most comprehensively annotated counting dataset to date, supporting both training and multi-grained evaluation. Systematic benchmarking reveals that both multimodal large language models and specialist counting models exhibit severe prompt-following failures under fine-grained distinctions. Motivated by these findings, we train HieraCount, a multi-grained counting model that jointly leverages text and visual exemplars as complementary target specifications. HieraCount substantially improves multi-grained counting accuracy and generalizes robustly to challenging real-world scenarios. The project page is available here: https://verg-avesta.github.io/KubriCount/.
[286] CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models cs.CV | cs.ROPDF
Wenxuan Song, Han Zhao, Fuhao Li, Ziyang Zhou, Xi Wang
TL;DR: 本文提出了一种名为CapVector的新方法,旨在解决预训练的视觉-语言-动作(VLA)模型在标准监督微调(SFT)中难以有效提升性能并降低适应成本的问题。该方法通过解耦辅助目标微调的两个目标(增强通用能力和拟合任务特定动作分布),在参数空间中学习可迁移的能力向量,从而以较低计算开销实现与辅助微调基线相当的性能。
Details
Motivation: 预训练的VLA模型在标准SFT中性能提升有限且适应成本高,而现有辅助目标微调方法虽能提升性能但计算开销大,因此需要一种既能增强能力又保持简单性的方法。
Result: 内部和外部实验表明,所提出的能力向量在不同模型上有效且通用,能泛化到新环境和实体,并在增强轻量正交正则化损失后,合并模型达到与辅助微调基线相当的性能,同时减少了计算开销。
Insight: 创新点在于将辅助目标微调解耦为参数空间中的能力向量学习,通过合并预训练参数形成能力增强的元模型,并结合轻量正则化实现高效微调;这提供了一种参数化能力迁移的新思路,可降低VLA模型适应成本并提升泛化性。
Abstract: This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters’ difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.
[287] Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition cs.CV | cs.AIPDF
Md. Sultan Al Rayhan, Maheen Islam
TL;DR: 本文提出了一种置信度引导的扩散增强框架,用于提升低分辨率孟加拉语复合字符的识别性能。该框架结合了类别条件扩散模型与分类器引导,生成高质量手写复合字符样本,并通过置信度过滤机制筛选高质量合成样本,用于增强训练数据。在AIBangla数据集上的实验表明,该方法显著提升了多种分类架构的性能。
Details
Motivation: 孟加拉语手写复合字符识别面临字符结构复杂、类内差异大以及高质量标注数据有限等挑战,现有系统难以泛化到多样化的书写风格,特别是包含复杂连字和变音符号的复合字符。
Result: 在AIBangla复合字符数据集上,该方法在ResNet50、DenseNet121、VGG16和Vision Transformer架构上均取得了一致的性能提升,最佳模型达到了89.2%的分类准确率,大幅超越了先前发布的AIBangla基准。
Insight: 主要创新点包括:1)结合类别条件扩散模型与分类器引导进行样本生成;2)在扩散模型的U-Net骨干中引入Squeeze-and-Excitation增强的残差块以提升生成质量;3)提出基于置信度的过滤机制,利用预训练分类器作为质量门控,仅保留高度类别一致的合成样本用于数据增强。这为低资源文字领域的手写字符识别提供了一种有效的质量感知数据增强方法。
Abstract: Recognition of handwritten Bangla compound characters remains a challenging problem due to complex character structures, large intra-class variation, and limited availability of high-quality annotated data. Existing Bangla handwritten character recognition systems often struggle to generalize across diverse writing styles, particularly for compound characters containing intricate ligatures and diacritical variations. In this work, we propose a confidence-guided diffusion augmentation framework for low-resolution Bangla compound character recognition. Our framework combines class-conditional diffusion modeling with classifier guidance to synthesize high-quality handwritten compound character samples. To further improve generation quality, we introduce Squeeze-and-Excitation enhanced residual blocks within the diffusion model’s U-Net backbone. We additionally propose a confidence-based filtering mechanism where pre-trained classifiers act as quality gates to retain only highly class-consistent synthetic samples. The filtered synthetic images are fused with the original training data and used to retrain multiple classification architectures. Experiments conducted on the AIBangla compound character dataset demonstrate consistent performance improvements across ResNet50, DenseNet121, VGG16, and Vision Transformer architectures. Our best-performing model achieves 89.2% classification accuracy, surpassing the previously published AIBangla benchmark by a substantial margin. The results demonstrate that quality-aware diffusion augmentation can effectively enhance handwritten character recognition performance in low-resource script domains.
[288] Personal Visual Context Learning in Large Multimodal Models cs.CVPDF
Zihui Xue, Ami Baid, Sangho Kim, Mi Luo, Kristen Grauman
TL;DR: 本文提出个人视觉上下文学习(Personal VCL)的概念,旨在使大型多模态模型能够利用用户特定的视觉上下文信息来回答个性化查询。作者构建了Personal-VCL-Bench基准来评估该能力,发现前沿LMMs存在显著的上下文利用差距。为此,他们提出了Agentic Context Bank方法,通过构建自优化的记忆库和查询自适应的证据选择,在多个任务和模型上显著提升了性能。
Details
Motivation: 随着智能眼镜等可穿戴设备将LMMs集成到用户连续的第一人称视觉流中,实现真正的个人助手需要视觉个性化能力,即能够推理用户独有的视觉信息。
Result: 在提出的Personal-VCL-Bench基准上,前沿LMMs表现出严重的上下文利用不足。提出的Agentic Context Bank基线方法在多个任务和评估的骨干模型上,相比标准上下文提示方法取得了持续一致的性能提升。
Insight: 创新点在于将视觉个性化能力形式化为Personal VCL,并构建了专门的评估基准。提出的Agentic Context Bank通过结构化的、自优化的记忆库和查询自适应机制,为解决LMMs在利用和聚合多视觉观察证据方面的不足提供了一个实用且有效的推理时方法。
Abstract: As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user’s visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.
[289] Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CVPDF
Haoyuan Sun, Jing Wang, Yuxin Song, Yu Lu, Bo Fang
TL;DR: 本文提出了一种名为超线性优势塑形(SLAS)的后训练方法,用于改进文本到图像(T2I)模型。该方法通过从信息几何角度重新审视函数更新,引入依赖于优势的加权来扩展Fisher-Rao信息度量,从而重塑局部策略空间,以缓解现有强化学习后训练方法(如GRPO)中常见的奖励黑客问题,并提升训练动态和泛化性能。
Details
Motivation: 现有基于强化学习的后训练方法(如GRPO)容易受到奖励黑客的影响,即模型利用不完美奖励函数的偏差而非获得真正的性能提升。作者发现归一化可能导致校准错误,而直接移除提示级标准差项得到的策略上升方向虽在优势上是线性的,但仍限制了真实信号与噪声的分离。
Result: 广泛的评估表明,SLAS在多个骨干模型和基准测试上持续超越DanceGRPO基线。具体而言,它带来了更快的训练动态、在GenEval和UniGenBench++上改进的域外性能、增强的模型缩放鲁棒性,同时缓解了奖励黑客问题并保持了生成结果的语义和组合保真度。
Insight: 主要创新点是从信息几何视角重新设计策略更新,通过引入优势依赖加权的非线性格局结构来重塑局部策略空间,从而放大高优势方向的信息更新并抑制低优势区域的虚假梯度。此外,应用批次级归一化以稳定不同奖励尺度下的训练。这种方法为缓解奖励黑客和提升后训练效果提供了一种新的几何学思路。
Abstract: Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.
cs.CR [Back]
[290] On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models cs.CR | cs.CVPDF
Yule Liu, Yilong Yang, Jiale Teng, Hanze Jia, Zeren Luo
TL;DR: 本文系统评估了图像转3D模型在生成有害几何形状方面的风险及其缓解措施,发现现有模型能有效生成三类有害几何体(直接物理危害物、风险模板/组件、欺骗性复制品),而商业审核机制仅能拦截不足0.3%的案例;研究进一步测试了三种主流防护方案并提出组合防御策略,虽将有害留存率降至1%以下,但仍伴随11%的误报率。
Details
Motivation: 针对图像转3D技术可能被滥用于生成可3D打印的有害几何体(如危险物品、欺骗性复制品等)的现实风险,当前缺乏对生成能力及防护措施有效性的系统评估。
Result: 在开源和商业图像转3D模型测试中,使用原始/退化/视角变换/语义伪装输入时,模型能高效重建有害几何体;商业审核仅标记不足0.3%的案例;提出的组合防御策略在基准测试中将有害留存率降至<1%,但整体误报率达11%。
Insight: 创新点包括:首次系统定义三类有害几何体风险框架;设计多维度评估指标(几何有效性、多视角VLM语义评分、人工验证、物理制造测试);揭示现有防护措施(输入审核、模型对齐、输出过滤)的局限性,并提出可降低有害留存率的组合防御方案,强调需开发几何感知的审核机制。
Abstract: Recent advances in image-to-3D models have significantly improved the fidelity and accessibility of 3D content creation. Such a powerful reconstruction capability that enables creative design can also be misused by the adversary to generate harmful geometries, which can be further fabricated via 3D printers and pose real-world risks. However, such risks are largely underexplored: it remains unclear how well current image-to-3D models can produce these harmful geometries, and whether existing safeguards can reliably prevent such generation. To fill this gap, we conduct a systematic measurement study of harmful geometry generation and mitigation. We first describe this risk through three kinds of unsafe categories: direct-use physical hazards, risky templates or components, and deceptive replicas. Each category is instantiated with representative objects. We evaluate both open-source and commercial image-to-3D models under original, degraded, viewpoint-shifted, and semantically camouflaged inputs. We consider different evaluation metrics, including geometric validity, multi-view VLM-based semantic scoring, targeted human validation, and controlled physical fabrication. The results reveal a concerning reality that current image-to-3D models can effectively reconstruct the harmful geometries, while fewer than 0.3% of such geometries trigger commercial moderation flags. As a first step toward mitigation, we evaluate three representative safeguard families, including input moderation, model-level benign alignment, and output-level filtering. We find that existing safeguards have distinct weaknesses. We further develop a stacked defense that can reduce harmful retention to <1%, but still at 11% overall false-positive cost. Taken together, our findings demonstrate that the risk in current system and encourage better geometry-aware safeguards for moderation.
[291] BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data cs.CR | cs.AI | cs.CV | cs.LG | cs.NIPDF
Ishpuneet Singh, Gursmeep Kaur, Uday Pratap Singh Atwal, Guramrit Singh, Gurjot Singh
TL;DR: 本文介绍了BEACON数据集,这是一个用于行为指纹识别的大规模多模态数据集,基于《Valorant》游戏数据构建,包含同步的鼠标动态、按键事件、网络数据、屏幕录制等多种模态信息,旨在支持连续认证和行为分析研究。
Details
Motivation: 当前连续认证领域的数据集存在规模小、模态单一或缺乏同步环境上下文的问题,无法满足高风险数字环境的需求,因此需要构建一个更全面、真实的数据集来推动行为生物识别技术的发展。
Result: BEACON数据集包含约430GB同步模态数据,来自28名玩家的79个会话,总计约102.51小时游戏时间,涵盖了多种行为信号和环境上下文,为评估下一代行为指纹和安全模型提供了可复现的基准。
Insight: 创新点在于利用战术射击游戏的高认知负荷和精细运动技能来模拟真实行为压力,首次在电竞场景中整合了多模态同步数据,为研究连续认证、行为分析和多模态表示学习提供了高保真环境。
Abstract: Continuous authentication in high-stakes digital environments requires datasets with fine-grained behavioral signals under realistic cognitive and motor demands. But current benchmarks are often limited by small scale, unimodal sensing or lack of synchronised environmental context. To address this gap, this paper introduces BEACON ( Behavioral Engine for Authentication & Continuous Monitoring), a large-scale multimodal dataset that captures diverse skill tiers in competitive \textit{Valorant} gameplay. BEACON contains approximately 430 GB of synchronised modality data (461 GB total on-disk including auxiliary \textit{Valorant} configuration captures) from 79 sessions across 28 distinct players, estimated at 102.51 hours of active gameplay, including high-frequency mouse dynamics, keystroke events, network packet captures, screen recordings, hardware metadata, and in-game configuration context. BEACON leverages the high precision motor skills and high cognitive load that are inherent to tactical shooters, making it a rigorous stress test for the robustness of behavioral biometrics. The dataset allows for the study of continuous authentication, behavioral profiling, user drift and multimodal representation learning in a high-fidelity esports setting. The authors release the dataset and code on Hugging Face and GitHub to create a reproducible benchmark for evaluating next-generation behavioral fingerprinting and security models
cs.DC [Back]
[292] Scaling Mobile Agent Systems: From Capability Density to Collective Intelligence cs.DC | cs.CL | cs.MA | cs.NIPDF
Bowei He
TL;DR: 本文提出一个统一的研究议程,旨在从两个互补维度扩展移动代理系统:一是通过紧凑的基础模型设计和压缩来提高单个代理的能力密度,二是通过支持丰富通信的多代理协作来实现集体智能,从而将孤立的移动代理转变为高效、可扩展的分布式智能系统。
Details
Motivation: 移动代理系统在边缘设备和AIoT生态系统中作为实现智能应用的关键范式正在兴起,但其可扩展性受到设备上有限计算能力和跨设备智能碎片化的根本制约。
Result: 摘要中未提及具体的定量实验结果或基准测试,但提出了一个基于近期模型和基础设施进展的研究愿景。
Insight: 论文的创新点在于将扩展移动代理系统的挑战分解为提升单个代理能力密度和实现多代理集体智能两个互补维度,并整合为统一的研究议程,旨在构建分布式智能系统。
Abstract: Mobile agent systems are emerging as a key paradigm for enabling intelligent applications on edge devices and in AIoT ecosystems. However, their scalability is fundamentally constrained by limited on-device computation and fragmented intelligence across devices. In this work, we propose a unified research agenda for scaling mobile agent systems along two complementary dimensions: (1) improving capability density of individual agents through compact foundation model design and compression, and (2) enabling collective intelligence via communication-rich multi-agent collaboration. Building on recent model and infrastructure advances, this vision aims to transform isolated mobile agents into a distributed intelligent system that is efficient and scalable.
cs.MM [Back]
[293] Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination cs.MM | cs.CVPDF
Yangneng Chen, Junlin Li, Weijun Yao, Xilai Ma, Guodong Du
TL;DR: 本文提出了一种名为Vocabulary Hijacking的现象,即大型视觉语言模型(LVLMs)中特定的视觉标记(Inert Tokens)会过度吸引注意力,并在各层解码为固定的无关词汇(Hijacking Anchors),导致语义僵化并引发幻觉。为应对此问题,作者提出了Hijacking Anchor-Based Identification(HABI)来定位这些标记,并设计了Non-Hijacked Visual Attention Ratio(NHAR)指标来识别对幻觉具有抵抗力的关键注意力头。基于此,作者进一步提出了无需训练的干预方法Hijacking-Aware Visual Attention Enhancement(HAVAE),通过增强这些关键头对显著视觉内容的关注来有效减轻幻觉,同时保持模型通用能力。
Details
Motivation: 大型视觉语言模型在多模态任务中取得了显著进展,但其可靠性持续受到幻觉(即生成与视觉输入矛盾的文本)的损害。近期研究常将此类错误归因于视觉注意力不足,本文旨在通过分析注意力机制,揭示并缓解导致幻觉的根本原因。
Result: 在多个基准测试上的广泛实验表明,所提出的HAVAE方法能显著减轻幻觉,且无需额外的计算开销,同时保持了模型的通用能力。
Insight: 论文的创新点在于首次揭示了Vocabulary Hijacking现象及其背后的语义僵化机制,并提出了基于Hijacking Anchor的稳健识别策略(HABI)和量化指标(NHAR)。从客观角度看,其提出的无需训练的干预方法HAVAE,通过选择性增强关键注意力头,为缓解LVLMs的幻觉问题提供了一种高效且低成本的解决方案。
Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet their reliability is persistently undermined by hallucinations-generating text that contradicts visual input. Recent studies often attribute these errors to inadequate visual attention. In this work, we analyze the attention mechanisms via the logit lens, uncovering a distinct anomaly we term Vocabulary Hijacking. We discover that specific visual tokens, defined as Inert Tokens, disproportionately attract attention. Crucially, when their intermediate hidden states are projected into the vocabulary space, they consistently decode to a fixed set of unrelated words (termed Hijacking Anchors) across layers, revealing a rigid semantic collapse. Leveraging this semantic rigidity, we propose Hijacking Anchor-Based Identification (HABI), a robust strategy to accurately localize these Inert Tokens. To quantify the impact of this phenomenon, we introduce the Non-Hijacked Visual Attention Ratio (NHAR), a novel metric designed to identify attention heads that remain resilient to hijacking and are critical for factual accuracy. Building on these insights, we propose Hijacking-Aware Visual Attention Enhancement (HAVAE), a training-free intervention that selectively strengthens the focus of these identified heads on salient visual content. Extensive experiments across multiple benchmarks demonstrate that HAVAE significantly mitigates hallucinations with no additional computational overhead, while preserving the model’s general capabilities. Our code is publicly available at https://github.com/lab-klc/HAVAE.
cs.LG [Back]
[294] Reasoning emerges from constrained inference manifolds in large language models cs.LG | cs.CL | cs.CVPDF
Yanbiao Ma, Fei Luo, Linfeng Zhang, Chuangxin Zhao, Mingxuan Wang
TL;DR: 该论文研究大型语言模型推理过程中的内部动态,发现推理时表征会自组织成嵌入高维空间的低维流形,但仅几何压缩不足以保证稳定推理。有效推理需满足三个条件:足够的表征表达能力、自发的流形压缩以及在压缩子空间中保持非退化信息量。基于此,论文提出了一种无需标签、仅从内部动态计算的统一诊断方法。
Details
Motivation: 动机在于超越基于标注基准的推理评估,将推理视为内在动态过程进行研究,以区分任务性能与内部推理质量,探究推理的本质几何与信息约束。
Result: 研究发现,推理动态在满足特定几何和信息约束(表达性、压缩性、信息量保持)时才能实现有效推理,否则会出现病态动态。提出的无标签诊断方法为此提供了评估框架。
Insight: 创新点在于将推理过程建模为受几何和信息约束的动态系统,揭示了有效推理的结构性条件,并提供了不依赖基准的内部动态诊断工具,为理解LLM推理机制提供了新视角。
Abstract: Reasoning in large language models is predominantly evaluated through labeled benchmarks, conflating task performance with the quality of internal inference. Here we study reasoning as an intrinsic dynamical process by examining the evolution of internal representations during inference. We find that inference-time dynamics consistently self-organize into low-dimensional manifolds embedded within high-dimensional representation spaces. we find that such geometric compression, although pervasive, is not sufficient for stable or reliable reasoning. Instead, effective reasoning dynamics emerge within a constrained structural regime characterized by three conditions: adequate representational expressivity, spontaneous manifold compression, and preservation of non-degenerate information volume within the compressed subspace. Models outside this regime exhibit characteristic pathological inference dynamics. Based on these insights, we introduce a unified, label-free diagnostic computed solely from internal dynamics. These findings suggest that reasoning in LLMs is fundamentally governed by geometric and informational constraints, offering a complementary framework to benchmark-centric assessment.
[295] HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control cs.LG | cs.AI | cs.CLPDF
Xincheng Yao, Ruoqi Li, Cheng Chen, Daoxin Zhang, Yi Wu
TL;DR: 本文提出了一种名为HTPO(分层令牌级目标控制策略优化)的新型强化学习算法,旨在解决现有RL算法在增强大语言模型推理能力时,对所有响应令牌采用相同优化目标、缺乏细粒度指导的问题。HTPO通过分层将响应令牌按提示难度、答案正确性和令牌熵划分为不同功能组,并为每个组设计专门的优化目标,以实现更平衡的探索-利用权衡。
Details
Motivation: 当前主流的强化学习算法在用于可验证奖励的强化学习(RLVR)以提升大语言模型推理能力时,通常平等对待响应中的所有令牌,为每个令牌分配相同的优化目标,这无法为推理过程提供细粒度的指导,且缺乏动态平衡学习过程中探索与利用权衡的有效机制。
Result: 在具有挑战性的推理基准测试(如AIME’24和AIME’25)上进行的大量实验验证了HTPO算法的优越性,其显著优于强大的DAPO基线(例如,在AIME’24和AIME’25上分别提升8.6%和6.7%)。当扩展测试时计算量时,HTPO训练的模型相对于DAPO基线保持了一致的性能优势,且随着采样预算增加,差距扩大,验证了其自适应令牌级控制方法在不牺牲利用性能的前提下促进了有效探索。
Insight: 论文宣称的创新点在于首次将分而治之的思想引入RLVR,通过分层令牌级目标控制来动态平衡探索与利用。从客观角度看,其核心创新在于根据令牌在思维链推理中的不同功能角色(通过提示难度、答案正确性、令牌熵三个维度划分),设计差异化的优化目标,从而实现对推理过程的细粒度、自适应引导,这为解决RL中探索-利用权衡问题提供了一个新颖的、结构化的视角。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a pivotal technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, the de facto practice of mainstream RL algorithms is to treat all tokens of one response equally and assign the same optimization objective to each token, failing to provide granular guidance for the reasoning process. While in Chain-of-Thought (CoT) reasoning, different tokens usually play distinct roles. Therefore, the current RL algorithms lack an effective mechanism to dynamically balance the exploration-exploitation trade-off during learning. To this end, we propose Hierarchical Token-level Objective Control Policy Optimization (HTPO), a novel RL algorithm that takes the divide-and-conquer idea to hierarchically partition the response tokens into specific functional groups from three aspects (i.e., prompt difficulty, answer correctness, and token entropy). Within each group, according to the contributions to exploration or exploitation, we design specialized optimization objectives to facilitate the effective execution of each token’s expected functionality. In this way, HTPO can achieve a more balanced exploration-exploitation trade-off. Extensive experiments on challenging reasoning benchmarks validate the superiority of our HTPO algorithm, which significantly outperforms the strong DAPO baseline (e.g., +8.6% and +6.7% on AIME’24 and AIME’25, respectively). When scaling test-time compute, the HTPO-trained model maintains a consistent performance advantage over the DAPO baseline, and the gap widens as the sampling budget increases, validating that our adaptive token-level control method fosters effective exploration without sacrificing exploitation performance. Code will be at https://github.com/xcyao00/HTPO.
[296] CDS4RAG: Cyclic Dual-Sequential Hyperparameter Optimization for RAG cs.LG | cs.AI | cs.CL | cs.PF | cs.SEPDF
Pengzhou Chen, Tao Chen
TL;DR: 本文提出了CDS4RAG框架,用于优化检索增强生成(RAG)系统中检索器和生成器的全部超参数。该框架采用一种新的循环双序列优化策略,将检索器和生成器的超参数区分开并进行循环交替优化,从而更有效地利用评估预算并加速收敛。
Details
Motivation: RAG系统对检索器和生成器的大量超参数非常敏感,但由于参数间复杂的相互作用和昂贵的评估成本,使用给定查询优化这些超参数是一项具有挑战性的任务。现有方法通常将RAG视为单一黑盒或仅优化部分超参数,导致优化效果不佳且收敛缓慢。
Result: 在四个常用基准测试和两种骨干大语言模型上的实验表明,CDS4RAG在21/24的情况下显著提升了基础算法的性能,并且在所有情况下都显著优于最先进的算法,生成质量最高提升1.54倍,同时实现了更好的加速效果。
Insight: 核心创新点在于提出了循环双序列优化范式,将检索器和生成器的超参数解耦并循环优化,并设计了细粒度的周期内预算分配和跨周期种子传递机制以加速生成器优化。该框架是算法无关的,可与多种通用优化算法结合使用。
Abstract: Retrieval-Augmented Generation (RAG) is sensitive to the vast hyperparameters of the retriever and generator, yet optimizing them using given queries is a challenging task due to the complex interactions and expensive evaluation costs. Existing algorithms are ineffective and slow in convergence, since they often treat RAG as a monolithic black box or only optimize partial hyperparameters. In this paper, we propose CDS4RAG, a framework that optimizes the full RAG hyperparameters using given queries via a new cyclic dual-sequential formulation. CDS4RAG is special in the sense that it distinguishes the hyperparameters of the retriever and generator, cyclically optimizing them in turn. Such a paradigm allows us to design fine-grained within-cycle budget provision and expedite the optimization via cross-cycle seeding when optimizing the generator. CDS4RAG is also an algorithm-agnostic framework that can be paired with diverse general algorithms. Through experiments on four common benchmarks and two backbone LLMs, we reveal that CDS4RAG considerably boosts the vanilla algorithms in 21/24 cases while significantly outperforming state-of-the-art algorithms in all cases with up to 1.54x improvements of generation quality and better speedup.
[297] Reinforcement Learning for Scalable and Trustworthy Intelligent Systems cs.LG | cs.AI | cs.CLPDF
Guangchen Lan
TL;DR: 这篇论文探讨了强化学习在可扩展性和可信赖性方面的挑战,提出了通过联邦优化、偏好对齐和上下文安全等方法,使强化学习在分布式环境中更高效,并在大型语言模型和自主智能体中更安全可靠。
Details
Motivation: 解决强化学习在实际部署中面临的两个核心问题:在分布式环境中的可扩展性(通信带宽有限、计算异构)以及在后训练大型语言模型和自主智能体中的可信赖性(与人类偏好对齐、满足隐私等安全要求)。
Result: 论文通过四个互补的贡献(涉及联邦优化、偏好对齐和上下文安全)来应对这些挑战,但摘要未提及具体的定量结果或基准测试。
Insight: 创新点在于将强化学习框架统一应用于可扩展性和可信赖性目标,具体包括通信高效的异步联邦优化方法,以及改进人类偏好对齐和减少基于语言的智能系统中上下文不当信息泄露的技术。
Abstract: Reinforcement learning has become a powerful paradigm for improving the capability of intelligent systems, but its practical deployment faces two central challenges. First, reinforcement learning must scale efficiently in distributed environments where communication bandwidth is limited and computation is heterogeneous across agents. Second, as reinforcement learning is increasingly used in post-training large language models and autonomous agents, the optimized policies must also be aligned with human preferences and satisfy safety requirements such as privacy-aware information disclosure. This dissertation addresses both challenges through four complementary contributions spanning federated optimization, preference alignment, and contextual safety. The first part of the dissertation studies scalable reinforcement learning in federated settings. The second part of the dissertation studies trustworthy reinforcement learning for large language models. Together, these contributions advance reinforcement learning along two complementary dimensions. On the one hand, they make reinforcement learning more scalable through communication-efficient and asynchronous federated optimization. On the other hand, they make reinforcement learning more trustworthy by improving alignment with human preferences and by reducing contextually inappropriate information disclosure in language-based intelligent systems. As a whole, this dissertation argues that the next generation of intelligent systems will require both efficient optimization and trustworthy behavior, and that reinforcement learning provides a unifying framework for addressing both goals.
[298] PAAC: Privacy-Aware Agentic Device-Cloud Collaboration cs.LG | cs.CL | cs.DCPDF
Liangqi Yuan, Wenzhi Fang, Shiqiang Wang, Christopher G. Brinton
TL;DR: PAAC是一个隐私感知的智能体框架,通过将规划器-执行器分解与设备-云边界对齐来解决LLM智能体在隐私与能力之间的结构性矛盾。它让云端智能体在类型化占位符令牌上进行推理以保护敏感内容,设备端智能体则负责识别敏感信息并提炼执行结果,从而在严格隐私设置下显著提升准确性和减少数据泄露。
Details
Motivation: 解决LLM智能体面临的结构性矛盾:云端智能体提供强大推理但暴露用户数据,而设备端智能体保护隐私却牺牲整体能力;现有设备-云设计未将边界视为适合智能体工作负载的信任边界,且现有净化器在策略灵活性和工具调用所需的结构保真度之间强制取舍。
Result: 在三个智能体基准测试的严格隐私设置下,PAAC主导了隐私与准确性的帕累托前沿,相比最先进的设备-云基线方法,平均准确率提升15-36%,平均泄露减少2-6倍,在固定实体分类之外的隐私目标上优势最大;在涵盖10个领域(包括数学、科学和金融)的17个额外基准测试上也表现出一致的改进。
Insight: 创新点在于将角色专业化本身作为隐私机制,通过规划器-执行器分解与设备-云边界对齐,使云端智能体推理时仅使用类型化占位符,设备端智能体专注于敏感信息识别和结果提炼;可借鉴之处包括使用确定性注册表执行所有替换和反转以保持设备端直接可执行性,从而在灵活策略和结构保真度之间取得平衡。
Abstract: Large language model (LLM) agents face a structural tension: cloud agents provide strong reasoning but expose user data, while on-device agents preserve privacy at the cost of overall capability. Existing device-cloud designs treat this boundary as a compute split rather than a trust boundary suited to agentic workloads, and existing sanitizers force a choice between policy flexibility and the structural fidelity tool calls require. In this work, we develop PAAC, a privacy-aware agentic framework that aligns planner–executor decomposition with the device-cloud boundary so that role specialization itself becomes the privacy mechanism. The cloud agent reasons over typed placeholder tokens that preserve each sensitive value’s reasoning role while discarding its content, while the on-device agent identifies sensitive spans and distills each step’s execution outcome into compact key findings. Sanitization confines the on-device LLM to proposing which spans to mask, while a deterministic registry performs all substitution and reversal, keeping actions directly executable on device. On three agentic benchmarks under strict privacy settings, PAAC dominates the Pareto frontier of privacy and accuracy, improving average accuracy by 15-36% and reducing average leakage by 2-6$\times$ over state-of-the-art device-cloud baselines, with the largest margins on privacy targets outside fixed entity taxonomies. We find consistent improvements on 17 additional benchmarks spanning 10 domains, including math, science, and finance.
[299] Relative Kinetic Utility for Reasoning-Aware Structural Pruning in Large Language Models cs.LG | cs.CLPDF
Tianhao Qian
TL;DR: 本文提出了一种名为相对动能效用(RKU)的新理论框架,用于解决大型语言模型(LLM)在思维链(CoT)提示下进行结构剪枝时出现的‘幅度陷阱’问题。该方法通过基于交替梯度流(AGF)在模型深度流形上进行连续动能积分,并结合Fisher迹归一化,来识别和保留负责高曲率逻辑路由的关键结构路径(动能尖峰),从而在高稀疏度(如40%)下更好地保持模型的推理能力。
Details
Motivation: 思维链提示显著提升了LLM的推理能力,但生成长推理序列会带来严重的推理延迟和KV缓存内存瓶颈。现有的基于幅度的结构剪枝方法过度依赖离散交叉熵目标,陷入‘幅度陷阱’,即优先剪枝高频、低信息的语法标记,导致在高稀疏度下推理能力崩溃。
Result: 在Qwen-2.5-7B和LLaMA-3-8B模型上的广泛实验表明,RKU在高稀疏度(约40%)下提升了性能。在GSM8K基准测试上,RKU在40%稀疏度下达到了13.34%的准确率,超过了最强的基线方法,并且在分布外评估中似乎能更好地保留与推理相关的表征。
Insight: 论文的核心创新点是将离散剪枝问题提升为基于交替梯度流(AGF)在模型深度流形上的连续动能积分,并引入Fisher迹归一化作为轻量级的曲率感知归一化方法,以隔离对逻辑路由至关重要的‘动能尖峰’结构路径。这为结构剪枝提供了一种更理论化、能更好保留模型推理拓扑特性的新视角。
Abstract: Chain-of-Thought (CoT) prompting symbolized a huge improvement of reasoning capabilities of Large Language Models (LLMs). However, scaling up test-time computation yields extensive CoT sequences, introducing severe inference latency and key-value (KV) cache memory bottlenecks. While structural pruning offers a fundamental, hardware-aware solution to alleviate static parameter burdens, existing magnitude-based methods may cut off the neurons of CoT: by over-indexing on discrete cross-entropy objectives, these heuristics fall into a \textit{magnitude trap}: they prioritize high-frequency, low-information syntactic tokens and trigger a disappointing reasoning collapse at high sparsities (e.g., 40%). To overcome this topological phase transition, we propose \textsc{Relative Kinetic Utility} (RKU), a novel theoretical framework that elevates discrete pruning to a continuous kinetic integral over the depth manifold of the model based on Alternating Gradient Flow(AGF). By modifying it with Fisher trace normalization, RKU acts as a lightweight curvature-aware normalization to isolate \textit{kinetic spikes} – the fundamental structural pathways responsible for high-curvature logical routing. Extensive experiments on Qwen-2.5-7B and LLaMA-3-8B improves performance in the high-sparsity regime around 40%. RKU attains 13.34% accuracy on GSM8K at 40% sparsity, outperforming the strongest baseline, and appears to better preserve reasoning-relevant representations under out-of-distribution evaluation.
[300] Your Simulation Runs but Solves the Wrong Physics: PDE-Grounded Intent Verification for LLM-Generated Multiphysics Simulation Code cs.LG | cs.AI | cs.CL | cs.SEPDF
Zhenghan Song, Yulong Liu, Cheng Wan, Chenjun Li, Lingfu Liu
TL;DR: 本文针对LLM生成的科学仿真代码提出了PDE基础意图验证方法,指出仅凭代码可执行性不足以保证正确性,存在’理解-生成鸿沟’。作者在MOOSE框架中开发了意图保真度评分(IFS)来量化代码与目标偏微分方程的匹配程度,并构建了基于IFS的迭代修正循环。在220个案例的MooseBench基准测试中,该方法显著提升了困难案例的IFS分数,揭示了可执行性与意图保真度是可分离的失败模式。
Details
Motivation: 解决LLM生成的科学仿真代码可能成功运行但编码了错误物理方程的问题,即’理解-生成鸿沟’,确保生成的代码在数学结构上符合用户意图而不仅仅是可执行。
Result: 在220案例的MooseBench多物理场基准测试中,迭代修正方法相比直接生成持续提升平均IFS,在直接生成IFS低于0.7的困难子集上,IFS绝对提升+0.22至+0.41。部署审计显示仅修复执行性会遗留39-40%案例可运行但求解错误物理方程。
Insight: 创新点在于提出PDE基础意图验证框架和IFS结构化度量标准,将代码验证从执行层面提升到数学结构层面;客观分析认为其核心贡献是揭示了科学计算中可执行性与正确性的分离现象,并提供了可扩展到其他PDE领域特定语言(如FEniCS、FreeFEM等)的验证模式。
Abstract: Execution-based evaluation of LLM-generated code implicitly treats successful execution as a proxy for correctness. In scientific simulation, this proxy is insufficient: a generated input file can run, mesh, and converge while encoding governing equations that differ from the user’s intent. We call this mismatch between intended physics and generated code the comprehension-generation gap. We instantiate this in MOOSE, where Kernel and BC objects map compositionally to weak-form residual terms, enabling deterministic reconstruction of the encoded PDE and comparison against an intended contract. We formalize this comparison as the Intent Fidelity Score (IFS), a structural metric covering governing terms, BCs, ICs, coefficients, and time scheme. Building on IFS, we develop a PDE-grounded refinement loop that uses deterministic violation reports to correct generated code iteratively. We evaluate on MooseBench, a 220-case multiphysics benchmark with PDE-level ground truth released with this work. On this benchmark, our method consistently improves mean IFS over direct generation, with gains concentrated on hard cases. On the subset where direct generation falls below IFS 0.7, refinement adds +0.22 to +0.41 absolute IFS. In the deployment audit, execution-only repair improves execution success while leaving 39-40% of all 220 cases runnable but still solving the wrong physics across the three main deployment-audit models, exposing executability and intent fidelity as separable failure modes. Static proof-of-concept experiments on four PDE-oriented DSLs (UFL/FEniCS, FreeFEM, FiPy, and Devito) suggest that the reconstruction-and-comparison pattern extends beyond MOOSE. These findings reinforce that executable simulation code should be verified against the mathematical structure it is intended to encode, not accepted on execution alone.
[301] Let the Target Select for Itself: Data Selection via Target-Aligned Paths cs.LG | cs.CL | cs.CVPDF
Huitao Yang, Hengzhi He, Guang Cheng
TL;DR: 本文提出了一种名为目标对齐路径的数据选择方法,旨在通过验证集诱导的参考路径来减少传统方法中的参考路径偏差,从而更有效地从异构候选池中选择对特定下游任务有益的训练样本。该方法利用容量有限的预热训练生成参考轨迹,通过归一化端点损失下降对候选样本进行评分,无需计算梯度或Hessian近似,显著降低了计算和存储成本。
Details
Motivation: 传统基于局部归因分数聚合的数据选择方法在异构候选池中可能因参考轨迹与目标对齐子集的动态不匹配而产生参考路径偏差,导致选择效果下降。本文旨在解决这一问题,提出一种更高效的替代参考路径构建方法。
Result: 在逻辑回归、视觉任务和指令调优的受控实验中,该方法与强动态归因基线方法性能相当,同时大幅减少了预热训练和存储成本。相同的紧凑预热模型可在不同候选池中重复使用,无需重新计算轨迹。
Insight: 创新点在于利用验证集代理通过有限容量预热生成目标对齐的参考路径(验证诱导流),并基于归一化端点损失下降设计零阶选择规则,避免了传统方法对候选样本梯度或Hessian近似的依赖,从而实现了高效、可复用的数据选择。从客观角度看,该方法将参考轨迹与候选池解耦的设计具有通用性,可能适用于多种异构数据选择场景。
Abstract: Targeted data selection aims to identify training samples from a large candidate pool that improve performance on a specific downstream task. Many recent methods estimate candidate utility by aggregating local attribution scores along a trajectory induced by the candidate pool. When the pool is heterogeneous, however, this reference trajectory may be misaligned with the dynamics of a target-aligned selected subset, creating what we call reference path bias. We propose an alternative reference path: a validation-induced flow obtained from a short, capacity-limited warmup on the available target validation proxy. Along this path, candidates are scored by a normalized endpoint loss drop, yielding a simple zero-order selection rule that requires no candidate gradients or Hessian approximations. Across controlled logistic, vision, and instruction-tuning experiments, this score is competitive with strong dynamic attribution baselines while substantially reducing warmup and storage cost. Moreover, since the reference trajectory is decoupled from any specific candidate pool, the same compact warmup can be reused across additional pools without recomputing the trajectory.
[302] Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies cs.LG | cs.AI | cs.CLPDF
Jingze Song, Zihao Chen, Wenqing Chen, Zibin Zheng
TL;DR: 本文提出了一种联合任务-模型适应的框架,用于学习多指标权重以进行数据选择。该方法利用上下文学习信号在小型验证集上作为高效性能代理,无需完整微调即可确定最优权重配置,从而将数据选择同时适应下游任务和特定模型。实验表明,该方法在多个基准测试和模型系列上,仅使用30%的训练样本即可达到或超过全数据集调优的性能。
Details
Motivation: 现有数据选择方法大多依赖静态的、与任务和模型无关的权重方案,忽略了特定下游任务的不同需求和模型已有的不同能力。本文旨在解决这一问题,提出一个能联合适应任务和模型的动态数据选择框架。
Result: 在包括GSM8K在内的多个基准测试和Mistral、Qwen、Llama等模型系列上的实验表明,该方法在GSM8K上仅使用30%的训练样本,就能达到与全数据集调优相当或更优的性能。
Insight: 主要创新点在于提出了一个联合任务-模型适应的框架,通过上下文学习信号作为高效性能代理来学习动态的多指标权重,避免了昂贵的完整微调。客观分析认为,其核心洞察在于揭示了推理任务中语义多样性和逻辑复杂性之间的权衡,并强调了联合适应的必要性,这为高效指令调优的数据选择提供了新的视角。
Abstract: Data selection is a key component of efficient instruction tuning for large language models, as recent work has shown that data quality often matters more than data quantity. Accordingly, prior studies have introduced various multi-dimensional heuristics to evaluate and filter instruction data. However, most existing methods rely on static task-agnostic and model-agnostic weighting schemes, which overlook the varying requirements of specific downstream tasks and the differing pre-existing capabilities of models. In this paper, we propose a framework for learning multi-indicator weights that jointly adapts data selection to both the downstream task and the specific model. Our method identifies optimal weight configurations without full-scale fine-tuning by utilizing in-context learning (ICL) signals on compact tiny-validation sets. These signals serve as efficient performance proxies that ensure high-fidelity evaluation at minimal computational cost. Experiments across multiple benchmarks and model families, including Mistral, Qwen, and Llama, show that the approach achieves performance comparable to or exceeding full-dataset tuning while using only 30% of the training samples on GSM8K. Furthermore, our analysis reveals a trade-off between semantic diversity and logical complexity in reasoning tasks, highlighting the necessity of joint task-model adaptation.
[303] G-Zero: Self-Play for Open-Ended Generation from Zero Data cs.LG | cs.AI | cs.CL | cs.ETPDF
Chengsong Huang, Haolin Liu, Tong Zheng, Runpeng Dai, Langlin Huang
TL;DR: 本文提出G-Zero框架,一种无需外部验证器的协同进化方法,用于大语言模型在开放生成任务中的自主自我提升。核心创新是Hint-δ内在奖励,通过量化生成模型在无提示和自生成提示条件下的预测差异来驱动进化。提议模型通过GRPO训练以生成挑战性查询和提示,生成模型通过DPO内化改进,理论上证明了在理想条件下的最优性保证。
Details
Motivation: 现有自进化大语言模型在可验证领域表现良好,但在开放生成任务中依赖代理LLM评判器会导致能力瓶颈和奖励黑客问题,因此需要一种无需外部验证、能自主持续改进的框架。
Result: 论文未在摘要中提及具体定量结果或基准测试,但通过理论分析证明了在理想标准DPO版本下的最优迭代次优性保证,强调其可扩展性和鲁棒性。
Insight: 创新点在于Hint-δ内在奖励机制,完全从内部分布动态中衍生监督信号,避免了外部评判器的能力上限问题,为不可验证领域的LLM持续自进化提供了新途径。
Abstract: Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. To overcome this, we introduce G-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation is Hint-$δ$, an intrinsic reward that quantifies the predictive shift between a Generator model’s unassisted response and its response conditioned on a self-generated hint. Using this signal, a Proposer model is trained via GRPO to continuously target the Generator’s blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized via DPO to internalize these hint-guided improvements. Theoretically, we prove a best-iterate suboptimality guarantee for an idealized standard-DPO version of G-Zero, provided that the Proposer induces sufficient exploration coverage and the data filteration keeps pseudo-label score noise low. By deriving supervision entirely from internal distributional dynamics, G-Zero bypasses the capability ceilings of external judges, providing a scalable, robust pathway for continuous LLM self-evolution across unverifiable domains.
[304] The Truth Lies Somewhere in the Middle (of the Generated Tokens) cs.LG | cs.CLPDF
Sophie L. Wang, Phillip Isola, Brian Cheung
TL;DR: 本文研究了如何将自回归生成过程中产生的隐藏状态聚合成能反映语言模型内部状态的表示。研究发现,对生成token的隐藏状态进行平均池化比使用单个token能产生更具语义的表示,且这种表示优于基于提示token的表示。
Details
Motivation: 解决如何从自回归生成的语言模型隐藏状态中提取有效的语义表示问题,探索信息在生成token中的分布特性。
Result: 通过在语言、视觉和蛋白质领域的参考空间进行核对齐量化评估,表明平均池化方法能持续提升表示质量,揭示了模型行为的可解释动态。
Insight: 创新点在于发现并验证了信息分布在多个生成token中而非集中于单个位置,平均池化是有效的聚合策略,这为理解模型内部表示提供了新视角。
Abstract: How should hidden states generated autoregressively be collapsed into a representation that reflects a language model’s internal state? Despite tokens being generated under causal masking, we find that mean pooling across their hidden states yields more semantic representations than any individual token alone. We quantify this through kernel alignment to reference spaces in language, vision, and protein domains. The improvement through mean pooling is consistent with information being distributed across generated tokens rather than localized to a single position. Furthermore, representations derived from generated tokens outperform those from prompt tokens, and alignment across generation reveals interpretable dynamics in model behavior.
[305] MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image cs.LG | cs.CL | cs.CVPDF
Alan Arazi, Eilam Shapira, Shoham Grunblat, Mor Ventura, Elad Hoffer
TL;DR: 本文提出了MulTaBench,一个包含40个数据集的基准测试,用于评估图像-表格和文本-表格多模态表格学习任务。研究发现,现有表格基础模型在处理非结构化模态时依赖冻结的预训练嵌入,而针对任务调整嵌入能提升性能。该基准专注于模态间提供互补预测信号的任务,以促进开发结合联合建模和目标感知表示的新架构。
Details
Motivation: 现有表格基础模型缺乏对文本和图像等非结构化模态的原生支持,且现有多模态表格学习基准往往仅关注模态的共现,导致数据集间方差高,掩盖了任务特定调整的益处。
Result: 实验结果表明,目标感知表示调整带来的性能提升在文本和图像模态、多种表格学习器、编码器规模和嵌入维度上具有普遍性。MulTaBench是迄今最大的图像-表格基准测试,涵盖医疗和电子商务等高影响力领域。
Insight: 创新点在于构建了一个强调模态互补性的基准,揭示了任务特定嵌入调整的重要性,并为开发新型多模态表格基础模型铺平了道路,其设计鼓励研究结合联合建模和目标感知表示的架构。
Abstract: Tabular Foundation Models have recently established the state of the art in supervised tabular learning, by leveraging pretraining to learn generalizable representations of numerical and categorical structured data. However, they lack native support for unstructured modalities such as text and image, and rely on frozen, pretrained embeddings to process them. On established Multimodal Tabular Learning benchmarks, we show that tuning the embeddings to the task improves performance. Existing benchmarks, however, often focus on the mere co-occurrence of modalities; this leads to high variance across datasets and masks the benefits of task-specific tuning. To address this gap, we introduce MulTaBench, a benchmark of 40 datasets, split equally between image-tabular and text-tabular tasks. We focus on predictive tasks where the modalities provide complementary predictive signal, and where generic embeddings lose critical information, necessitating Target-Aware Representations that are aligned with the task. Our experimental results demonstrate that the gains from target-aware representation tuning generalize across both text and image modalities, several tabular learners, encoder scales, and embedding dimensions. MulTaBench constitutes the largest image-tabular benchmarking effort to date, spanning high-impact domains such as healthcare and e-commerce. It is designed to enable the research of novel architectures which incorporate joint modeling and target-aware representations, paving the way for the development of novel Multimodal Tabular Foundation Models.
[306] Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR cs.LG | cs.CLPDF
Jeonghye Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang
TL;DR: 本文提出了一种名为RLRT(RLVR with Reversed Teacher)的新方法,用于改进大型语言模型(LLM)的自我蒸馏后训练过程。该方法通过反转教师模型的信号,在推理探索中强化学生模型自身成功的推理路径,从而提升模型性能。
Details
Motivation: 传统自我蒸馏中,教师模型在成功路径上会覆盖学生模型的选择,抑制其自身推理能力。本文旨在解决这一问题,利用学生模型成功而教师模型未预测的路径,来促进基于学生自身成功的、有价值的探索。
Result: 在基础、指令微调和思维微调的Qwen3模型上,RLRT方法显著超越了标准自我蒸馏和基于探索的基线方法,确立了信息不对称作为RLVR的一个新的、有原则的设计维度。
Insight: 核心创新在于将自我蒸馏信号反转解读:将学生成功但教师未预测的令牌视为其自主推理的体现并进行强化。这为强化学习与人类反馈(RLHF/RLAIF)中的探索策略提供了一种新的、基于模型自身成功经验的有价值探索范式,而非追求均匀的多样性。
Abstract: Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student’s choices and suppresses it’s own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student’s own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.
[307] The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies cs.LG | cs.AI | cs.CLPDF
Gabriel Garcia
TL;DR: 本文揭示了链式思维(CoT)忠实性评估中存在的系统性混淆:在标准基准测试中,由于答案文本通常出现在推理链末尾,传统的‘破坏研究’方法(通过替换错误步骤来测量准确性)实际上检测的是答案文本出现的位置,而非真正的计算发生位置。通过格式消融、冲突答案实验和生成时探测,论文证明了模型在消费时倾向于遵循显式的答案后缀,而非早期推理,并提出了包含三个前提条件(仅问题控制、格式表征、全位置扫描)的协议作为基于破坏的忠实性研究的最低标准。
Details
Motivation: 动机是识别并解决链式思维忠实性评估中的系统性混淆问题,即传统破坏研究可能错误地将模型对答案文本格式的敏感性归因于推理步骤的计算重要性,从而无法准确评估CoT的忠实性。
Result: 在多个模型规模(3B至32B)和架构家族上进行了实验。关键定量结果包括:在GSM8K上,移除答案后缀导致后缀敏感性下降约19倍(3B模型);冲突答案实验中,7B模型的CC准确率降至接近零(<=0.02),跟随错误率在3B-7B模型上为0.63-1.00,并在更大规模模型上衰减(如Phi-4-14B为0.300,32B约0.01)。在MATH数据集上,DeepSeek-R1-7B模型显示出10.9倍的后缀生存恢复。格式决定效应在14B模型上仍显著(8.5倍比率),在32B模型上趋近于零。
Insight: 论文的创新点在于首次系统性地识别并量化了CoT忠实性评估中的格式混淆问题,揭示了模型对显式答案文本的依赖可能掩盖真实的推理过程。从客观角度看,其提出的三前提协议为未来基于破坏的忠实性研究提供了更严谨的方法论框架,强调了控制格式变量和全面扫描所有链位置的重要性,有助于更准确地评估模型推理的忠实性。
Abstract: Corruption studies, the primary tool for evaluating chain-of-thought (CoT) faithfulness, identify which chain positions are “computationally important” by measuring accuracy when steps are replaced with errors. We identify a systematic confound: for chains with explicit terminal answer statements, the dominant format in standard benchmarks, corruption studies detect where the answer text appears, not where computation occurs. A within-dataset format ablation provides the key evidence: on standard GSM8K chains ending with “the answer is X,” removing only the answer statement, preserving all reasoning, collapses suffix sensitivity ~19x at 3B (N=300, p=0.022). Conflicting-answer experiments quantify the causal mechanism: at 7B, CC accuracy drops to near-zero (<=0.02) across five architecture families; the followed-wrong rate spans 0.63-1.00 at 3B-7B and attenuates at larger scales (0.300 at Phi-4-14B, ~0.01 at 32B). A within-stable 7B replication (9.3x attenuation, N=76, p=7.8e-3; Qwen3-8B N=299, p=0.004) provides converging evidence, and the pattern replicates on MATH (DeepSeek-R1-7B: 10.9x suffix-survival recovery). On chains without answer suffixes the same protocol identifies the prefix as load-bearing (Delta=-0.77, p<10^-12). Generation-time probes confirm a dissociation: the answer is not early-determined during generation (early commitment <5%), yet at consumption time model outputs systematically follow the explicit answer text. The format-determination effect persists through 14B (8.5x ratio, p=0.001) and converges toward zero at 32B. We propose a three-prerequisite protocol (question-only control, format characterization, all-position sweep) as a minimum standard for corruption-based faithfulness studies.
[308] SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing cs.LG | cs.AI | cs.CE | cs.CLPDF
Mingxu Zhang, Yuhan Li, Lujundong Li, Dazhong Shen, Hui Xiong
TL;DR: 本文提出SLIM框架,通过稀疏自编码器将大语言模型的隐藏状态分解为稀疏、属性对齐的特征,实现无需修改模型参数即可提升分子编辑成功率,并在MolEditRL基准测试中取得显著性能提升。
Details
Motivation: 解决大语言模型在分子编辑中属性相关信息隐式纠缠于稠密隐藏状态、缺乏显式控制手段导致编辑失败率高的问题。
Result: 在MolEditRL基准测试中,针对四种模型架构和八种分子属性,SLIM相比基线方法取得了一致性提升,最高提升达42.4个百分点。
Insight: 通过稀疏自编码器学习可解释的重要性门控,在稀疏特征空间中进行定向激活,实现了属性控制与行为可解释性,是一种即插即用的通用框架。
Abstract: Large language models possess strong chemical reasoning capabilities, making them effective molecular editors. However, property-relevant information is implicitly entangled across their dense hidden states, providing no explicit handle for property control: a substantial fraction of edits fail to improve or even degrade target properties. To address these issues, we propose SLIM (Sparse Latent Interpretable Molecular editing), a plug-and-play framework that decomposes the editor’s hidden states into sparse, property-aligned features via a Sparse Autoencoder with learnable importance gates. Steering in this sparse feature space precisely activates property-relevant dimensions, improving editing success rate without modifying model parameters. The same sparse basis further supports interpretable analysis of editing behavior. Experiments on the MolEditRL benchmark across four model architectures and eight molecular properties show consistent gains over baselines, with improvements of up to 42.4 points.
[309] Compute Where it Counts: Self Optimizing Language Models cs.LG | cs.CLPDF
Yash Akhauri, Mohamed S. Abdelfattah
TL;DR: 本文提出了一种自优化语言模型(SOL),通过动态分配计算预算来提升LLM推理效率。该方法在冻结的基础LLM上附加一个轻量级策略网络,根据隐藏状态为每个解码步骤选择注意力稀疏化、MLP激活剪枝和量化位宽等效率操作,从而根据token难度自适应调整计算量。
Details
Motivation: 现有LLM推理优化方法(如量化、剪枝)通常对每个token采用统一计算预算,但实际token难度差异很大,导致静态压缩在简单token上过度计算,在困难token上计算不足。本文旨在研究自回归解码中的动态预算分配问题,学习如何为每个token分配适当的计算量。
Result: 在不同模型变体和计算预算下,SOL在相同预算下优于静态分配和随机调度搜索方法,在所有实验中发现了更好的质量-效率帕累托前沿,并在MMLU基准上比均匀预算分配策略的准确率最高提升7.3%。
Insight: 创新点在于将动态计算预算分配建模为强化学习问题,通过策略网络联合控制多种效率操作(注意力稀疏、激活剪枝、量化),并采用基于固定token序列的组相对策略优化方法进行训练,实现了在不改变基础模型权重的前提下自适应调整每步计算开销。
Abstract: Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study dynamic budget allocation for autoregressive decoding: learning how much computation to spend per token from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete efficiency action at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train the policy with group-relative policy optimization on teacher-forced episodes: the token sequence is fixed, while we sample multiple compute schedules (i.e., “counterfactual” schedules that vary only the efficiency actions for the same token path) and compare their likelihoods under the same supervision. Our reward trades off language-model quality against soft penalties that encourage episode-average budget usage to match a requested target. Across model variants and compute regimes, SOL improves quality at matched budget over static allocation and strong random schedule search, offering a complementary axis for inference-efficiency optimization. SOL discovers a better quality-efficiency pareto-front across all our experiments and improves MMLU accuracy by up to 7.3% over uniform budget allocation strategies.
[310] Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning cs.LG | cs.CLPDF
Junhao Shen, Teng Zhang, Xiaoyan Zhao, Hong Cheng
TL;DR: 本文提出了SLIM框架,用于动态管理基于技能的智能体强化学习中的外部技能生命周期。该框架将活跃技能集视为与策略学习联合优化的动态变量,通过留一验证评估技能边际贡献,并执行保留、退役和扩展三种操作,以适应任务和阶段需求。
Details
Motivation: 现有方法假设外部技能要么作为持久指导积累,要么内化到策略中,最终实现零技能推理,但这种假设过于严格,因为参数容量有限且技能边际贡献不均,最优活跃技能集是非单调且依赖于任务和阶段的。
Result: 在ALFWorld和SearchQA基准测试中,SLIM平均优于最佳基线7.1个百分点,结果表明策略学习和外部技能保留并不互斥,部分技能被内化而其他技能继续提供外部价值。
Insight: 创新点在于将技能管理动态化,通过生命周期操作优化技能集,支持技能内化与外部保留共存,为基于技能的智能体强化学习提供了更通用的范式。
Abstract: Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized into the policy, eventually leading to zero-skill inference. We argue this assumption is overly restrictive, since with limited parametric capacity and uneven marginal contribution across skills, the optimal active skill set is non-monotonic, task- and stage-dependent. In this work, we propose SLIM, a framework of dynamic Skill LIfecycle Management for agentic reinforcement learning (RL), which treats the active external skill set as a dynamic optimization variable jointly updated with policy learning. Specifically, SLIM estimates each active skill’s marginal external contribution through leave-one-skill-out validation, then applies three lifecycle operations: retaining high-value skills, retiring skills whose contribution becomes negligible after sufficient exposure, and expanding the skill bank when persistent failures reveal missing capability coverage. Experiments show that SLIM outperforms the best baselines by an average of 7.1% points across ALFWorld and SearchQA. Results further indicate that policy learning and external skill retention are not mutually exclusive: some skills are absorbed into the policy, while others continue to provide external value, supporting SLIM as a more general paradigm for skill-based agentic RL.
[311] Weakly Supervised Concept Learning for Object-centric Visual Reasoning cs.LG | cs.AI | cs.CVPDF
Sparsh Tiwari, Bettina Finzel, Gesina Schwalbe
TL;DR: 本文提出了一种用于以物体为中心的视觉推理任务的弱监督概念学习方法,通过结合基于槽的物体中心架构和变分自编码器(VAE)进行自监督,在潜在维度上与概念引导竞争以实现人类可解释的符号基础。该方法将预测结果转换为符号背景知识,供归纳逻辑编程(ILP)、决策树和贝叶斯网络等推理框架使用,在合成和真实数据集上验证了其有效性。
Details
Motivation: 动机是解决神经符号系统中两阶段方法(将基于DNN的感知与基于规则的推理解耦)需要昂贵感知输出标签的问题,通过引入高效的弱监督方案来减少监督需求。
Result: 在合成和真实数据集上的广泛实验表明,该方法能以低至1%的标签监督发现复杂、抽象的物体中心推理规则,并在显著领域偏移下保持鲁棒性;在1%监督下,其领域泛化性能甚至优于最先进的基模型基线。
Insight: 创新点在于结合槽架构和VAE实现弱监督的物体中心符号基础,减少对昂贵标签的依赖,并提升可解释性和泛化能力;从客观角度看,该方法为神经符号系统提供了一种高效且可扩展的弱监督解决方案。
Abstract: Neurosymbolic systems promise to combine deep neural network’s (DNN) processing of raw sensor inputs with few-shot performance of symbolic artificial intelligence. Two-stage approaches explicitly decouple DNN based perception from subsequent rule based reasoning. This avoids optimization and interpretability issues of end to end differentiable approaches, but requires costly labels for the perception output. This paper introduces an efficient weak supervision scheme for the perception stage to ground its output symbols for logical induction in object-centric reasoning tasks. It combines a slot-based architecture for object-centricity with a Variational Autoencoder (VAE) for self-supervision, competing with concept guidance on latent dimensions for human interpretable grounding. The resulting predictions are translated into symbolic background knowledge for reasoning frameworks, such as Inductive Logic Programming (ILP), Decision Trees, and Bayesian Networks. Our extensive empirical evaluation on synthetic and real world datasets shows that our approach can discover complex, abstract rules for object centric reasoning whilst reducing supervision to as little as 1% of labels, and being robust even under substantial domain shift. Notably, at 1% supervision it even outperforms state of the art foundation model baselines in domain generalization
[312] Anchoring the Eigengap: Cross-Modal Spectral Stabilization for Sample-Efficient Representation Learning cs.LG | cs.CV | eess.IVPDF
Nikhil J. Dhinagar, Vidhi Chhatbar, Chirag Jagad, Pavithra Senthilkumar, Sophia I. Thomopoulos
TL;DR: 本文提出了一种基于谱理论的有限样本表示学习框架,揭示了低数据状态下深度视觉模型性能下降的根本原因在于特征值间隙(eigengap)的崩溃,导致可恢复的信号模式数量受限。通过扰动理论和集中不等式,论文量化了可恢复维度K(N),并引入截断马氏能量来评估分类性能。在多模态学习场景下,视觉-语言模型通过低秩约束抑制噪声主导方向,稳定谱结构,从而提升数据效率。在MNIST和多疾病神经影像数据集上的实验验证了该框架的有效性。
Details
Motivation: 解决深度视觉模型在低数据(尤其是医学影像等标注稀缺领域)下性能急剧下降的问题,指出其根源不仅是过拟合,更在于有限样本噪声导致嵌入协方差矩阵的谱结构崩溃,使得特征值间隙缩小,限制了可学习的信号模式数量。
Result: 在MNIST和多疾病神经影像数据集上,多模态训练能保持更稳定的特征模式并改善类别分离性,即使单模态模型在少样本准确率上表现相当;通过截断马氏能量和可恢复维度K(N)作为诊断指标,验证了谱崩溃是低数据学习的基本瓶颈。
Insight: 创新点包括:1) 从谱理论角度形式化低数据表示学习的可恢复维度K(N);2) 提出截断马氏能量作为分类性能的理论近似;3) 揭示多模态学习通过谱稳定化(抑制噪声、保持特征值间隙)提升数据效率的机制;4) 引入基于黎曼ζ函数的谱滤波方法,为改进数据效率提供了原则性框架。
Abstract: Deep vision models degrade sharply in low-data regimes, particularly in medical imaging where labeled samples are scarce. We show this arises not merely from overfitting but from a geometric failure: finite-sample noise corrupts the embedding covariance, collapsing the eigengap and limiting the number of recoverable signal-bearing modes. We develop a spectral theory of finite-sample representation learning that quantifies the recoverable dimension K(N), the number of eigenmodes that can be stably estimated from N samples. Using perturbation theory and concentration bounds, we show that only modes with eigenvalues above the noise floor $|\hatΣ - Σ|_{\mathrm{op}} \sim \sqrt{D/N}$ are reliable, yielding a truncated Mahalanobis energy that governs classification performance. Under a power-law spectral model, this energy can be approximated by a truncated Riemann zeta function, linking eigenvalue decay to data efficiency and AUC. Within this framework, multimodal learning acts as spectral stabilization: vision-language models impose low-rank constraints that suppress noise-dominated directions and preserve the eigengap, increasing K(N) under data scarcity. Across MNIST and multi-disease neuroimaging, we show that multimodal training maintains more stable modes and improves class separation, even when unimodal models achieve comparable few-shot accuracy. These results identify spectral collapse as a fundamental bottleneck in low-data learning. We use truncated Mahalanobis energy and K(N) to diagnose encoder quality, and introduce zeta-based spectral filtering as a principled approach to improve data efficiency.
[313] Uncertainty-Aware Token Importance Estimation in Spiking Transformers cs.LG | cs.CVPDF
Wenxuan Liu, Zecheng Hao, Tong Bu, Yuran Wang, Zhaofei Yu
TL;DR: 本文提出Uncert,一种无需训练、即插即用的脉冲Transformer令牌重要性估计框架,通过建模令牌类别证据的狄利克雷分布,并汇总其跨脉冲步长的时间不确定性(均值和波动),以区分信息性令牌和冗余令牌,从而在推理时进行令牌剪枝。
Details
Motivation: 现有脉冲Transformer的令牌处理方法在多个脉冲步长中引入冗余和推理成本,而现有的令牌约简方法主要依赖激活幅度、发放统计或特征相似性等响应线索,未能从时间演化的类别证据角度明确表征令牌重要性。
Result: 在静态和神经形态基准测试上的实验表明,Uncert在准确性和效率之间取得了有利的权衡,在令牌剪枝下观察到最一致的增益。
Insight: 创新点在于首次从时间不确定性模式的角度评估令牌重要性,提出基于狄利克雷分布的证据建模和跨步长不确定性汇总方法,为脉冲Transformer中的令牌动态提供了新的见解。
Abstract: Spiking transformers have shown strong potential for neuromorphic vision, yet their token processing across multiple spiking steps still introduces substantial redundancy and inference cost. Existing token reduction methods mainly rely on response based cues, such as activation magnitude, firing statistics, or feature similarity. Although effective, these criteria do not explicitly characterize token importance from the perspective of temporally evolving class evidence. In spiking transformers, token representations are progressively formed across multiple spiking steps rather than determined at a single instant, suggesting that token importance should be evaluated not only by instantaneous responses but also by temporal uncertainty patterns. Our key observation is that tokens exhibit heterogeneous uncertainty trajectories over time, and that their temporally aggregated uncertainty statistics provide an effective cue for distinguishing informative tokens from redundant ones. Motivated by this, we propose Uncert, a training free and plug and play token importance estimation framework for spiking transformers. Specifically, Uncert models token wise class evidence with a Dirichlet distribution and summarizes each token temporal uncertainty using its mean and fluctuation across spiking steps, yielding an uncertainty aware importance score for token reduction during inference. Experiments on both static and neuromorphic benchmarks show that Uncert achieves favorable accuracy and efficiency tradeoffs, with the most consistent gains observed under token pruning. Further analysis reveals a clear empirical connection between temporal uncertainty patterns and token contribution, offering new insights into token dynamics in spiking transformers.
[314] Learning-Augmented Scalable Linear Assignment Problem Optimization via Neural Dual Warm-Starts cs.LG | cs.CV | cs.DS | math.OCPDF
Ilay Yavlovich, Jad Agbaria, Muhamed Mhamed, Jose Yallouz, Nir Weinberger
TL;DR: 本文提出了一种学习增强的线性分配问题优化框架,通过神经网络预测对偶变量来预热经典精确求解器,在保持最优性和最坏情况保证的同时显著加速求解过程。该方法采用轻量级行独立架构RowDualNet避免内存瓶颈,并基于LP对偶性确保可行性,实现了大规模实例(N=16,384)的高效求解。
Details
Motivation: 线性分配问题是组合优化中的基础任务,传统精确求解器(如匈牙利算法、LAPJV算法)具有立方时间复杂度,在大规模实例中成为瓶颈。现有基于学习的方法往往牺牲精确性或受限于内存约束,无法有效扩展。
Result: 在具有挑战性的合成分布上实现了超过2倍的加速;在真实世界多目标跟踪(MOT)数据集上提升超过1.25倍,在运输问题(LPT)数据集上提升超过1.5倍,同时严格保持完全最优性,实现了对真实世界任务的零样本泛化。
Insight: 创新点包括:1)学习增强框架通过预测对偶变量预热经典求解器,在保持最优性保证的同时加速计算;2)提出RowDualNet轻量级行独立架构,避免基于图模型的O(N²)内存瓶颈;3)基于LP对偶性(Min-Trick)的构造机制确保可行性,无需昂贵的迭代投影;4)设计回退机制防止学习建议不可靠时的渐近运行时退化。
Abstract: The Linear Assignment Problem (LAP) is a fundamental combinatorial optimization task with applications ranging from computer vision to logistics. Classical exact solvers such as the Hungarian and Jonker-Volgenant (LAPJV) algorithms guarantee optimality, but their cubic time complexity $\mathcal{O}(N^{3})$ becomes a bottleneck for large-scale instances. Recent learning-based approaches aim to replace these solvers with neural models, often sacrificing exactness or failing to scale due to memory constraints. We propose a learning-augmented framework that accelerates exact assignment solvers while maintaining optimality and worst-case guarantees. Our method predicts dual variables to warm-start a classical solver, with a fallback that prevents asymptotic runtime degradation when the learned advice is unreliable. We introduce RowDualNet, a lightweight row-independent architecture that avoids the $\mathcal{O}(N^{2})$ memory bottleneck of graph-based models, enabling neural warm-starting at large scale ($N=16{,}384$). Feasibility is ensured via a constructive mechanism based on LP duality (namely, the Min-Trick), eliminating costly iterative projection. Empirically, our approach reduces the search effort of LAPJV and achieves over $2{\times}$ speedups on challenging synthetic distributions, in addition to improving over $1.25{\times}$ and $1.5{\times}$ on real-world tracking (MOT) and transportation (LPT) datasets, respectively, while strictly maintaining full optimality, effectively yielding a robust zero-shot generalization to real-world tasks.
[315] Beyond Spatial Compression: Interface-Centric Generative States for Open-World 3D Structure cs.LG | cs.CVPDF
Xiang Chen, Alexander Binder
TL;DR: 本文提出了一种新的3D表示方法——以接口为中心的生成状态(interface-centric generative states),替代传统的空间压缩式3D tokenizer。该方法将表示分解为规范局部几何、分区条件上下文和关系接缝变量,使局部几何、部件所有权和连接有效性在解码过程中可查询、可约束和可修复,从而提升开放世界多部件3D资产的结构鲁棒性。
Details
Motivation: 当前3D tokenizer主要将表示视为空间压缩,在开放世界资产(具有相交部件、噪声拓扑和弱规范结构)中,局部形状、部件身份和装配关系在潜在流中纠缠,解码时无法原生寻址,导致表示不匹配。本文旨在解决这一局限性。
Result: 在单物体CAD模型上训练,并在开放世界多部件资产上进行零样本评估,C2LT-3D提高了结构鲁棒性,其潜在变量在对抗性连接设置下仍保持可操作性。结果表明,开放世界3D生成表示不仅应通过重建保真度评估,还应评估其离散状态是否对装配级结构推理保持可操作性。
Insight: 创新点在于将3D表示从被动压缩代码重构为可操作的生成状态,通过因子化表示(规范局部几何、分区条件上下文、关系接缝变量)分别解决姿态泄漏、跨部件干扰和无效局部连接等问题,支持连接验证、潜在结构修复、针对性干预和约束序列化,无需单独的后处理结构恢复模块。
Abstract: Current 3D tokenizers largely treat representation as spatial compression: compact codes reconstruct surface geometry, but leave component ownership and attachment validity implicit. In open-world assets with intersecting components, noisy topology, and weak canonical structure, this creates a representation mismatch: local shape, component identity, and assembly relations become entangled in a latent stream and are not natively addressable during decoding. We formulate an alternative view, interface-centric generative states, in which tokenization constructs an operational state rather than a passive compressed code. The state exposes local geometry, component ownership, and attachment validity as variables that can be queried, constrained, and repaired during decoding. We instantiate this formulation with Component-Conditioned Canonical Local Tokens (C2LT-3D), factorizing representation into canonical local geometry, partition-conditioned context, and relational seam variables. Each factor targets a distinct failure mode of compression-centric tokens: pose leakage, cross-component interference, or invalid local attachment. This exposed state supports attachment validation, latent structural repair, targeted intervention, and constrained serialization without a separate post-hoc structure recovery module. Trained on single-object CAD models and evaluated zero-shot on open-world multi-component assets, C2LT-3D improves structural robustness and shows that its latent variables remain actionable under adversarial attachment settings. These results suggest that open-world 3D generative representations should be evaluated not only by reconstruction fidelity, but by whether their discrete states remain operational for assembly-level structural reasoning.
[316] Heteroscedastic Diffusion for Multi-Agent Trajectory Modeling cs.LG | cs.CVPDF
Guillem Capellera, Antonio Rubio, Luis Ferraz, Antonio Agudo
TL;DR: 本文提出U2Diffine和U2Diff两个模型,用于多智能体轨迹建模,统一处理轨迹补全和预测任务,并提供状态级的异方差不确定性估计以及生成模式的误差概率排序。
Details
Motivation: 现有方法通常只关注轨迹预测,忽视轨迹补全任务,且缺乏状态级的不确定性量化,同时多模态采样方法无法对同一先验观测下生成的场景进行误差概率排序,限制了实际应用。
Result: 在四个具有挑战性的体育数据集(NBA、Basketball-U、Football-U、Soccer-U)上,该方法在轨迹补全和预测任务上均超越了现有最优方法(SOTA)。
Insight: 通过将预测噪声的负对数似然融入标准去噪损失,并利用一阶泰勒近似将潜在空间不确定性传播到真实状态空间,实现了状态级异方差不确定性估计;引入RankNN进行后处理,为每个生成模式提供误差概率估计,增强了预测的可信度与实用性。
Abstract: Multi-agent trajectory modeling traditionally focuses on forecasting, often neglecting more general tasks like trajectory completion, which is essential for real-world applications such as correcting tracking data. Existing methods also generally predict agents’ states without offering any state-wise measure of heteroscedastic uncertainty. Moreover, popular multi-modal sampling methods lack error probability estimates for each generated scene under the same prior observations, which makes it difficult to rank the predictions at inference time. We introduce U2Diffine, a unified diffusion model built to perform trajectory completion while simultaneously offering state-wise heteroscedastic uncertainty estimates. This is achieved by augmenting the standard denoising loss with the negative log-likelihood of the predicted noise, and then propagating the latent space uncertainty to the real state space using a first-order Taylor approximation. We also propose U2Diff, a faster baseline that avoids gradient computation during sampling. This approach significantly increases inference speed, making it as efficient as a standard generative-only diffusion model. For post-processing, we integrate a Rank Neural Network (RankNN) that enables error probability estimation for each generated mode, demonstrating strong correlation with ground truth errors. Our method outperforms state-of-the-art solutions in both trajectory completion and forecasting across four challenging sports datasets (NBA, Basketball-U, Football-U, Soccer-U), underscoring the effectiveness of our uncertainty and error probability estimation.
[317] Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models cs.LG | cs.CVPDF
Andreas Bergmeister, Stefanie Jegelka, Nikolas Nüsken, Carles Domingo-Enrich, Jakiw Pidstrigach
TL;DR: 本文提出了一种名为Reinforce Adjoint Matching(RAM)的新方法,用于扩散模型和流匹配模型的强化学习(RL)后训练。该方法通过一个一致性损失,将奖励信号直接融入预训练的回归结构中,无需进行昂贵的SDE轨迹采样、伴随反向传播或奖励梯度计算,从而实现了高效且可扩展的模型对齐。
Details
Motivation: 现有RL后训练方法(如SDE轨迹、奖励梯度或代理损失)破坏了扩散/流匹配模型预训练的高效回归结构,导致计算成本高昂且难以扩展。本文旨在将预训练的回归结构扩展到RL后训练中,以更简单、更可扩展的方式对齐模型与奖励(如改善图像组合、文本渲染和人类偏好)。
Result: 在Stable Diffusion 3.5M模型上,RAM在组合性、文本渲染和人类偏好方面获得了最高的奖励。与Flow-GRPO方法相比,RAM仅需最多50倍更少的训练步骤即可达到其峰值奖励水平。
Insight: 核心创新在于将KL正则化的奖励最大化问题的最优解形式,与伴随匹配最优性条件及REINFORCE恒等式相结合,推导出一个简单的一致性损失(RAM)。这使得RL后训练能像预训练一样,仅通过噪声化采样和回归即可完成,无需复杂计算,保持了预训练的可扩展性优势。
Abstract: Diffusion and flow-matching models scale because pretraining is supervised regression: a clean sample is noised analytically, and a model regresses against a closed-form target. RL post-training aligns the model with a reward. In image generation, this makes samples compose objects correctly, render text legibly, and match human preferences. Existing methods rely on costly SDE rollouts, reward gradients, or surrogate losses, sacrificing pretraining’s regression structure. We show that the structure extends to RL post-training. Under KL-regularized reward maximization, the optimal generative process tilts the clean-endpoint distribution towards samples with higher reward and leaves the noising law unchanged. Combining this with the adjoint-matching optimality condition and a REINFORCE identity, we derive Reinforce Adjoint Matching (RAM): a consistency loss that corrects the pretraining target with the reward. At each step, we draw a clean endpoint from the current model, evaluate its reward, noise it as in pretraining, and regress. No SDE rollouts, backward adjoint sweeps, or reward gradients are required. Like the pretraining objective, RAM is simple and scales. On Stable Diffusion 3.5M, RAM achieves the highest reward on composability, text rendering, and human preference, reaching Flow-GRPO’s peak reward in up to $50\times$ fewer training steps.
cs.AI [Back]
[318] Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction cs.AI | cs.CE | cs.CL | cs.CV | cs.SEPDF
Andrei Lazarev, Dmitrii Sedov, Alexander Galkin
TL;DR: 本文研究了提升多模态大语言模型从科学图表中提取数据准确性的策略,比较了高层次语义提示和低层次空间提示两种方法。实验发现,语义方法(如两阶段元数据优先框架和思维链)未能带来显著改进,而简单的空间提示方法(在图表图像上叠加坐标网格)则显著降低了数据提取误差。
Details
Motivation: 自动化从科学图表中提取数据对于大规模文献分析至关重要,但多模态大语言模型在非标准化图表上的准确性仍面临挑战,因此需要探索最有效的提升模型性能的策略:是高层次语义提示还是低层次空间提示?
Result: 在合成数据集上的定量实验表明,基于网格的空间提示方法相比基线显著降低了数据提取误差(SMAPE从25.5%降至19.5%,p < 0.05),而语义方法未产生统计显著改进。
Insight: 论文的创新点在于通过对比实验明确了空间提示(叠加坐标网格)比语义提示更有效,为当前多模态模型在此类任务中提供了更可靠的空间上下文策略;从客观角度看,这种简单直接的网格方法易于实现,能显著提升模型对图表空间结构的理解,具有实用价值。
Abstract: The automated extraction of data from scientific charts is a critical task for large-scale literature analysis. While multimodal Large Language Models (LLMs) show promise, their accuracy on non-standardized charts remains a challenge. This raises a key research question: what is the most effective strategy to improve model performance (high-level semantic priming) or low-level spatial priming? This paper presents a comparative investigation into these two distinct strategies. We describe our exploratory experiments with semantic methods, such as a two-stage metadata-first framework and Chain-of-Thought, which failed to produce a statistically significant improvement. In contrast, we present a simple but highly effective spatial priming method: overlaying a coordinate grid onto the chart image before analysis. Our quantitative experiment on a synthetic dataset demonstrates that this grid-based approach provides a statistically significant reduction in data extraction error (SMAPE reduced from 25.5% to 19.5%, p < 0.05) compared to a baseline. We conclude that for the current generation of multimodal models, providing explicit spatial context is a more effective and reliable strategy than high-level semantic guidance for this class of tasks.
[319] RewardHarness: Self-Evolving Agentic Post-Training cs.AI | cs.CL | cs.CV | cs.LGPDF
Yuxuan Zhang, Penghui Du, Bo Li, Cong Wei, Junwen Miao
TL;DR: RewardHarness是一个自演化的智能体奖励框架,它将奖励建模重新定义为上下文演化而非权重优化。该框架仅需约100个偏好演示,通过迭代演化工具和技能库来对齐人类偏好,用于评估指令引导的图像编辑。一个编排器从库中选择相关工具和技能,一个冻结的子智能体使用它们构建推理链以产生偏好判断,并通过比较预测与真实偏好来自动优化库,无需额外人工标注。
Details
Motivation: 当前评估指令引导图像编辑的奖励模型通常依赖大规模偏好标注和额外模型训练,存在数据效率差距;人类能从少量示例推断评估标准,而模型需要数十万次比较。本文旨在解决这一差距,提出一个更数据高效的奖励建模方法。
Result: 仅使用EditReward偏好数据的0.05%(约100个演示),RewardHarness在图像编辑评估基准上达到47.4%的平均准确率,超过GPT-5 5.3个百分点。当用作GRPO微调的奖励信号时,RL调优模型在ImgEdit-Bench上达到3.52分。
Insight: 核心创新是将奖励建模从传统的基于大规模数据训练权重,转变为基于少量演示、通过智能体自演化工具和技能库的上下文演化范式。这通过编排器自动分析推理成功与失败来优化库,实现了数据高效且无需额外标注的自适应奖励学习。
Abstract: Evaluating instruction-guided image edits requires rewards that reflect subtle human preferences, yet current reward models typically depend on large-scale preference annotation and additional model training. This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons. We present RewardHarness, a self-evolving agentic reward framework that reframes reward modeling as context evolution rather than weight optimization. Instead of learning from large-scale annotations, RewardHarness aligns with human preferences by iteratively evolving a library of tools and skills from as few as 100 preference demonstrations. Given a source image, candidate edited images, and an editing instruction, an Orchestrator selects the most relevant subset of tools and skills from the maintained library, and a frozen Sub-Agent uses them to construct a reasoning chain that produces a preference judgment. By comparing predicted judgments with ground-truth preferences and analyzing successes and failures in the reasoning process, the Orchestrator automatically refines its library of tools and skills without additional human annotation. Using only 0.05% of the EditReward preference data, RewardHarness achieves 47.4% average accuracy on image-editing evaluation benchmarks, surpassing GPT-5 by 5.3 points. When used as a reward signal for GRPO fine-tuning, RL-tuned models achieve 3.52 on ImgEdit-Bench. Project page: https://rewardharness.com.
[320] Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution cs.AI | cs.CLPDF
Feng Xiong, Zengbin Wang, Yong Wang, Xuecai Hu, Jinghan He
TL;DR: Ace-Skill是一个协同进化框架,通过优先级采样和聚类组织,共同优化多模态智能体的任务采样和知识管理,以解决自进化智能体中的数据低效和知识干扰问题,在多个多模态工具使用基准测试中实现了显著的性能提升。
Details
Motivation: 解决自进化智能体在持续适应中面临的两个耦合瓶颈:数据低效(昂贵的任务采样资源浪费在低价值样本上)和知识干扰(异质知识存储在共享仓库中导致检索噪声和任务指导失准),这些问题形成了自我强化的失败循环。
Result: 在四个多模态工具使用基准测试中,Ace-Skill带来了强劲的性能增益(例如,Avg@4准确率相对提升+35.46%),使一个开源的350亿参数混合专家模型能够匹配或超越专有模型;所获取的知识还能以零样本方式有效迁移到更小的90亿和40亿参数模型。
Insight: 创新点在于将优先级采样(结合先验优先级和惰性衰减熟练度跟踪)与语义聚类知识组织协同优化,将自进化转变为良性循环;客观分析认为,其通过改进采样和组织打破了失败循环,并展示了知识向小模型的有效迁移能力,为资源受限的智能体提供了高级能力继承路径。
Abstract: Self-evolving agents present a promising path toward continual adaptation by distilling task interactions into reusable knowledge artifacts. In practice, this paradigm remains hindered by two coupled bottlenecks: data inefficiency, where costly rollout effort is disproportionately spent on low-value samples rather than informative ones, and knowledge interference, where heterogeneous knowledge stored in shared repositories leads to noisy retrieval and task-misaligned guidance. Together, these issues form a self-reinforcing failure loop in which uninformative rollouts yield noisy knowledge, which in turn degrades subsequent rollouts. In this work, we introduce Ace-Skill, a co-evolutionary framework that jointly optimizes rollout allocation and knowledge organization for self-evolving multimodal agents. Specifically, Ace-Skill combines aprioritized sampler with lazy-decay proficiency tracking to focus rollouts on informative and insufficiently mastered samples, and a clustered organizer that semantically clusters knowledge for cleaner retrieval and more reliable adaptation. By improving sampling and organization together, Ace-Skill turns self-evolution into a virtuous cycle in which more informative rollouts produce higher-quality knowledge that supports stronger subsequent rollouts. Across four multimodal tool-use benchmarks, Ace-Skill delivers strong gains (e.g., +35.46% relative improvement in Avg@4 accuracy), enabling an opensource 35B MoE model to match or surpass proprietary models. The acquired knowledge also transfers effectively in a zero-shot manner to smaller 9B and 4B models, allowing resource-constrained agents to inherit advanced capabilities without additional training. The code has been publicly available at https://github.com/AMAP-ML/Ace-Skill.
[321] Open Ontologies: Tool-Augmented Ontology Engineering with Stable Matching Alignment cs.AI | cs.CL | cs.DBPDF
Fabio Rovai
TL;DR: 本文提出了Open Ontologies,一个用Rust实现的开源本体工程系统。该系统通过模型上下文协议,将LLM驱动的本体构建与形式化的OWL推理及本体对齐相结合。核心发现是,稳定的1对1匹配是影响本体对齐质量的主导因素,并且在工具增强的本体交互中,结构化工具访问比LLM直接读取原始OWL文件效果更好。
Details
Motivation: 旨在构建一个集成了大型语言模型与形式化推理的本体工程系统,以改进本体构建和对齐的质量与效率。
Result: 在OAEI Anatomy评测轨道上,F1分数达到0.832(精确率0.963,召回率0.733),与最先进系统相当且精确率最高;在Conference轨道上F1为0.438。消融实验表明,稳定匹配是关键因素,移除后F1降至0.728。在工具交互任务中,结构化MCP工具访问的F1为0.717,显著优于LLM直接读取原始文件(F1=0.323)或无文件访问(F1=0.431)。
Insight: 主要创新点在于将稳定的1对1匹配算法确立为本体对齐的核心机制,并证明了其对于信号权重的不敏感性。同时,研究揭示了通过结构化工具接口(如MCP)为LLM提供访问,比让LLM解析原始语法能实现质的不同且更优的交互模式,这对设计LLM与形式化系统集成的工具有重要启示。
Abstract: We present Open Ontologies, an open-source ontology engineering system implemented in Rust that integrates LLM-driven construction with formal OWL reasoning and ontology alignment via the Model Context Protocol. Our primary finding is that stable 1-to-1 matching is the dominant factor in ontology alignment quality: on the OAEI Anatomy track, it achieves F1 = 0.832 (P = 0.963, R = 0.733), competitive with state-of-the-art systems and exceeding all in precision. Ablation across five weight configurations shows that signal weights are irrelevant when stable matching is applied (F1 varies by less than 0.004), while removing stable matching drops F1 to 0.728. On the Conference track, the same method achieves F1 = 0.438. On tool-augmented ontology interaction, we find a surprising result: an LLM reading a raw OWL file (F1 = 0.323) performs worse than the same LLM with no file at all (F1 = 0.431), while structured MCP tool access achieves F1 = 0.717. This demonstrates that tool structure provides a qualitatively different mode of access that the LLM cannot replicate by reading raw syntax. The system ships as a single binary under the MIT licence.
[322] Emergent Semantic Role Understanding in Language Models cs.AI | cs.CL | cs.LGPDF
Carla Griffiths, Mirco Musolesi
TL;DR: 本文研究了语言模型预训练过程中语义角色理解(即‘谁对谁做了什么’)是否自发涌现,通过冻结解码器Transformer模型并训练线性探针来提取语义角色信息,发现预训练表征已包含大量语义角色信息,但性能仍不及微调模型,表明语义角色理解在预训练中部分但不完全涌现。
Details
Motivation: 探究语言模型中的语义角色理解是仅通过预训练自发涌现,还是依赖于任务特定的微调,以理解模型从数据中学到的内容和所需监督程度。
Result: 在不同模型规模下,冻结表征显示出显著的语义角色信息提取能力,性能随规模提升但未完全达到微调模型水平,表明预训练目标能促使语义角色结构涌现,但其内部实现随模型规模增大而更分布式。
Insight: 创新点在于使用线性探针分析冻结预训练模型,揭示了语义角色理解在预训练中的部分涌现性,以及模型规模对表征分布性的影响,为理解语言模型内部语义表示提供了方法。
Abstract: Understanding how linguistic structure emerges in language models is central to interpreting what these systems learn from data and how much supervision they truly require. In particular, semantic role understanding (“who did what to whom”) is a core component of meaning representation, yet it remains unclear whether it arises from pre-training alone or depends on task-specific fine-tuning. We study whether semantic role understanding emerges during language model pre-training or requires task-specific fine-tuning. We freeze decoder-only transformers and train linear probes to extract semantic roles, using performance to infer whether role information is already encoded in pre-training or learned during adaptation. Across model scales, we find that frozen representations contain substantial semantic role information, with performance improving but not fully matching fine-tuned models. This indicates partial but incomplete emergence from pre-training alone. We show that semantic role structure emerges from language modeling objectives, but its internal implementation shifts toward more distributed representations as model scale increases.
[323] Towards Conversational Medical AI with Eyes, Ears and a Voice cs.AI | cs.CL | cs.CVPDF
Meet Shah, Jason Gusdorf, Anil Palepu, Chunjong Park, Jack W. O’Sullivan
TL;DR: 该论文介绍了AI co-clinician,一个基于Gemini低延迟音视频处理能力的首创性对话式AI系统,它利用实时医患对话中的连续视听数据流来辅助临床决策。通过模拟远程医疗咨询的随机交叉研究评估,该系统在关键指标上接近初级保健医生的水平,并显著优于GPT-Realtime。
Details
Motivation: 医疗实践不仅依赖于熟练的对话,还依赖于医患之间丰富的听觉和视觉线索的微妙交换与解读。现有纯文本方法无法捕捉医疗咨询的真实挑战,因此需要开发能够处理实时多模态数据的AI系统来辅助临床决策。
Result: 在基于20个标准化门诊场景、采用TelePACES评估标准和病例特定量表的随机交叉模拟研究中,AI co-clinician在管理计划和鉴别诊断等关键TelePACES维度上接近初级保健医生水平,在所有通用标准上显著优于GPT-Realtime,但在病例特定评估的整体表现上仍逊于医生。
Insight: 论文的创新点在于提出了首个利用连续实时视听数据流的对话式医疗AI系统,其双智能体架构平衡了深度临床推理与自然对话所需的低延迟。核心见解是高风险实时诊断AI应安全地推进为协作的、三方的模型,即AI作为医生和患者的支持性共同临床医生,而非替代者。
Abstract: The practice of medicine relies not only upon skillful dialogue but also on the nuanced exchange and interpretation of rich auditory and visual cues between doctors and patients. Building on the low-latency voice and video processing capabilities of Gemini, we introduce AI co-clinician, a first-of-its-kind conversational AI system utilizing continuous streams of audio-visual data from live patient conversations to inform real-time clinical decisions. Its dual-agent architecture balances deep clinical reasoning with the low latency required for natural dialogue. To assess this system, we implemented a video-based interface emulating telemedicine consultations. We crafted 20 standardized outpatient scenarios requiring proactive real-time auditory and visual reasoning and designed “TelePACES” evaluation criteria alongside case-specific rubrics. In a randomized, interface-blinded, crossover simulation study (n = 120 encounters) with 10 internal medicine residents as patient actors, we compared AI co-clinician with primary care physicians (PCPs), GPT-Realtime, and a baseline agent. AI co-clinician approached PCPs in key TelePACES dimensions, including management plans and differential diagnosis, while significantly outperforming GPT-Realtime across all general criteria. While our agent demonstrated parity with PCPs in case-specific triage measures, physicians maintained superior overall performance in case-specific assessments. Although AI co-clinician marks a significant advance in real-time telemedical AI, gaps remain in physical examination and disease-specific reasoning. Our work shows that text-only approaches fail to capture the true challenges of medical consultation and suggests that high-stakes real-time diagnostic AI is most safely advanced in collaborative, triadic models where AI can be a supportive co-clinician for doctors and patients.
[324] The Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLMs cs.AI | cs.CL | cs.LGPDF
Rafael C. T. Oliveira
TL;DR: 本文提出了一个名为‘元认知探针’的五任务诊断工具,用于分解大语言模型在置信度行为上的五个不同维度:置信度校准、认知警惕、知识边界、校准范围和推理链验证。该工具在8个前沿模型和69名人类上进行了评估,旨在揭示模型在何时知道自己的回答是错误的,而不仅仅是回答是否正确。
Details
Motivation: 现有综合基准测试(如MMLU、BIG-Bench)仅关注模型是否产生正确答案,而无法揭示模型是否知道自己的回答何时错误。本文旨在通过元认知探针,从行为维度分解模型的置信度行为,以识别模型在特定领域的过度自信问题。
Result: 在Gemini 2.5 Flash模型中观察到47点的内部解离:在任务内校准方面表现最佳(T1-CC = 88;Spearman rho = +0.551),但在跨任务难度预测方面表现最差(T4-CR = 41;sigma_conf = 1.4)。该工具在8个前沿模型和69名人类上进行了评估,但预设的人类发展假设被证伪。
Insight: 创新点在于将模型的置信度行为分解为五个行为上不同的维度进行诊断,超越了传统基准测试仅关注正确性的局限。这为评估模型的元认知能力(即‘知道何时知道’)提供了新的框架,有助于识别模型在特定知识领域的校准缺陷。
Abstract: The Metacognitive Probe is an exploratory five-task, 15-slot diagnostic that decomposes an LLM’s confidence behaviour into five behaviourally-distinct dimensions: confidence calibration (T1-CC), epistemic vigilance (T2-EV), knowledge boundary (T3-KB), calibration range (T4-CR), and reasoning-chain validation (T5-RCV). It is evaluated on N=8 frontier models and N=69 humans. The instrument is motivated by Flavell (1979) and Nelson and Narens (1990) but operates on observable confidence-correctness alignment; it is not a validated cross-species metacognition scale, and the pre-specified human developmental hypothesis was falsified. Composite benchmarks (MMLU, BIG-Bench, HELM, GPQA) ask whether a model produces a correct response. They are silent on whether the model knows when its response is wrong. A model can score 80 on a composite calibration benchmark and still be wildly overconfident in narrow pockets the aggregate cannot surface. The Metacognitive Probe surfaces those pockets. Our headline is a 47-point within-model dissociation in Gemini 2.5 Flash: panel-best within-task calibration (T1-CC = 88; Spearman rho = +0.551, 95% CI [+0.14, +0.80], p = 0.005) and panel-worst cross-task difficulty prediction (T4-CR = 41; sigma_conf = 1.4 across twelve factoids).
[325] The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark cs.AI | cs.CL | cs.CVPDF
Hao Liu, Jicheng Liu
TL;DR: 这篇论文提出了KnotBench,一个用于评估视觉语言模型在绳结图推理能力上的基准测试。该基准包含858,318张图像,覆盖1,951个素结原型,并设计了14个任务,分为等价判断、移动预测、识别和跨模态对齐四个类别。研究发现,当前最先进的模型(如Claude Opus和GPT-5)在大多数任务上表现不佳,甚至低于随机基线,表明模型虽能感知图表结构,但缺乏对结构进行模拟操作的能力。
Details
Motivation: 动机在于揭示当前视觉语言模型在理解结构化视觉信息(如绳结图)并进行推理时的局限性,即模型能“看到”图表但无法有效“操作”其结构,从而提出了一个具有挑战性的基准来量化这一感知-操作差距。
Result: 在KnotBench的14个任务上,Claude Opus和GPT-5(无论是否启用思维链推理)在56个(任务,模型)案例中,有15个表现低于或等于随机基线,8个任务的最佳得分低于随机基线的1.5倍。在图表到符号转录任务中,没有模型能生成完全正确的字符串,宽松解码下仅能恢复0到4个绳结。思维链推理仅将Claude和GPT-5的整体准确率分别提升了1.65和9.25个百分点,改善有限。
Insight: 论文的创新点在于构建了一个基于绳结图的硬基准测试,通过图像与符号的拆分定位了模型在感知与操作之间的失败点。客观来看,该研究强调了当前视觉语言模型在结构化推理任务上的根本缺陷,即缺乏对视觉特征的模拟操作机制,为未来模型设计提供了重要的评估方向。
Abstract: A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina’s canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal grounding; an image-versus-symbol split locates failures along the perception-operation gap. We score Claude Opus 4.7 and GPT-5, each with and without thinking, under a 64K output-token budget matched on both vendors. Across 56 (task, model) cases, 15 sit at or below a random baseline and 8 of 14 tasks have a best score under 1.5x random. On diagram-to-symbol transcription, no model produces a strictly correct string, and permissive Regina decoding recovers the knot in 0 to 4 of 100 items. Thinking-mode reasoning lifts overall accuracy by 1.65 points for Claude and 9.25 points for GPT-5, narrowing the gap only modestly. Read together, the four families suggest current vision-language models hold features of a diagram but lack apparatus to simulate moves on those features.
[326] How Mobile World Model Guides GUI Agents? cs.AI | cs.CLPDF
Weikai Xu, Kun Huang, Yunren Feng, Jiaxing Li, Yuhan Chen
TL;DR: 该论文研究了移动GUI智能体中世界模型的不同表示形式(文本、图像、可渲染代码)的效用,探讨了生成轨迹能否替代真实环境进行训练,以及测试时指导对不同能力智能体的影响。研究发现,可渲染代码在分布内保真度高,适合数据构建监督;文本反馈在分布外执行中更鲁棒;世界模型生成的轨迹能提升智能体端到端任务性能;对于过度自信的智能体,后验自我反思收益有限。
Details
Motivation: 解决移动GUI智能体中,现有世界模型(文本或图像表示)的效用不明确、生成轨迹能否替代真实环境、以及测试时指导如何帮助不同能力智能体的问题。
Result: 在MobileWorldBench和Code2WorldBench上达到SOTA性能;在AITZ、AndroidControl和AndroidWorld下游任务评估中,验证了不同表示形式的优劣和生成轨迹的有效性。
Insight: 创新点在于系统比较了四种模态(增量文本、完整文本、扩散图像、可渲染代码)的世界模型,并明确了各自适用场景:可渲染代码适合高保真数据构建监督,文本反馈在OOD执行中更鲁棒;世界模型生成的轨迹可作为可迁移的交互经验提升任务性能;世界模型更适合作为先验感知或训练监督,而非通用的后验验证器。
Abstract: Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents’ end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.
[327] Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge cs.AI | cs.CL | stat.MLPDF
Wenbo Zhang, Lijinghua Zhang, Liner Xiang, Hengrui Cai
TL;DR: 本文研究了在LLM-as-a-Judge场景中使用推理能力对判断准确性和计算成本的影响,发现推理仅在需要结构化验证的任务(如数学和编程)中显著提升准确性,而在简单任务中增益有限甚至为负,且成本显著更高。为此,作者提出了RACER(鲁棒自适应成本高效路由)方法,通过将路由问题建模为约束分布鲁棒优化问题,在固定预算下动态选择推理与非推理法官,以应对分布偏移并实现最优的准确性与成本权衡。
Details
Motivation: 当前推理能力强的LLM被用作自动法官,但其在LLM-as-a-Judge设置中的收益和成本尚不明确,需要研究推理在何种任务中有效以及如何高效利用推理能力以降低成本。
Result: 在需要结构化验证的任务(如数学和编程)上,推理法官显著提高了判断准确性;在简单评估任务中,推理带来的增益有限甚至为负。RACER方法在分布偏移下实现了优越的准确性与成本权衡,通过实验验证了其有效性。
Insight: 创新点在于提出不应普遍使用推理法官,而应根据任务需求选择性使用;RACER方法通过分布鲁棒优化动态路由,理论保证最优策略唯一性和线性收敛,为LLM-as-a-Judge提供了成本高效的解决方案。
Abstract: Reasoning-capable large language models (LLMs) have recently been adopted as automated judges, but their benefits and costs in LLM-as-a-Judge settings remain unclear. Through controlled comparisons between reasoning and non-reasoning judges, we show that explicit reasoning substantially improves judgment accuracy on tasks requiring structured verification (e.g., math and coding), while offering limited or even negative gains on simpler evaluations and incurring significantly higher computational cost. These findings motivate that reasoning should be used selectively rather than universally, with awareness of possible distribution shift. We propose a Robust Adaptive Cost-Efficient Routing (RACER), which dynamically selects between reasoning and non-reasoning judges under a fixed budget by formulating routing as a constrained distributionally robust optimization problem. RACER explicitly accounts for distribution shift via a KL-divergence uncertainty set, admits an efficient primal–dual algorithm, and enjoys theoretical guarantees including uniqueness of the optimal policy and linear convergence. Extensive experiments show that RACER achieves superior accuracy–cost trade-offs under distribution shift.
[328] The Generalized Turing Test: A Foundation for Comparing Intelligence cs.AI | cs.CL | cs.LGPDF
Daniel Mitropolsky, Susan S. Hong, Riccardo Neumarker, Emanuele Rimoldi, Tomaso Poggio
TL;DR: 本文提出了广义图灵测试(GTT),这是一个通过不可区分性来比较任意智能体能力的正式框架。该框架定义了智能体间的图灵比较器,若智能体B无法可靠区分与模仿B的智能体A的交互和与另一个B实例的交互,则A ≥ B。这产生了一种与数据集和任务无关的相对智能概念。作者研究了比较器的结构,包括其传递性条件,并定义了多种变体。在实证部分,该框架被应用于一系列现代模型,通过数千次试验评估成对不可区分性,所得比较结果呈现出与现有排名一致的分层结构。
Details
Motivation: 动机是建立一个不依赖于特定数据集或任务的、形式化的智能比较基础,以统一地推理智能,并为评估乃至训练目标提供理论基础。
Result: 实证结果表明,基于GTT框架对现代模型进行的成对不可区分性比较,产生了一个分层的排序结构,该结构与现有的模型能力排名一致,暗示该框架能产生有意义的经验排序。
Insight: 创新点在于提出了一个基于不可区分性的、任务和数据集无关的智能比较形式化框架(GTT),为智能评估提供了一个潜在的通用基础。从客观角度看,其将图灵测试的核心思想(模仿游戏)推广并形式化为一个可操作的比较器,可能为超越传统基准的智能体评估与对齐研究提供新视角。
Abstract: We introduce the Generalized Turing Test (GTT), a formal framework for comparing the capabilities of arbitrary agents via indistinguishability. For agents A and B, we define the Turing comparator A $\geq$ B to hold if B, acting as a distinguisher, cannot reliably distinguish between interactions with A (instructed to imitate B) and another instance of B. This yields a dataset- and task-agnostic notion of relative intelligence. We study the comparator’s structure, including conditions under which it is transitive and therefore induces an ordering over equivalence classes, and we define and analyze variants with querying, bounded interaction, and fixed distinguishers. To complement the theory, we instantiate the framework on a collection of modern models, empirically evaluating pairwise indistinguishability across thousands of trials. The resulting comparisons exhibit a stratified structure consistent with existing rankings, hinting that the proposed framework yields meaningful empirical orderings. Our results position indistinguishability as a unifying lens for reasoning about intelligence, suggesting a foundation for evaluation and, potentially, training objectives that are inherently independent of fixed datasets or benchmarks.
[329] Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits cs.AI | cs.CV | cs.LGPDF
Logan Mann, Ajit Saravanan, Ishan Dave, Shikhar Shiromani, Saadullah Ismail
TL;DR: 本文通过系统性的机制分析,挑战了视觉语言模型(VLM)中‘注意力越集中,答案越可靠’的直觉假设。研究发现,注意力结构对预测正确性几乎无贡献,而可靠性更易从后期计算中的隐藏状态几何、层间边界形成以及稀疏的后期层电路中解读。
Details
Motivation: 动机是检验视觉语言模型中普遍存在的‘注意力-置信度假设’——即注意力图越清晰集中,模型答案越可信——并探究可靠性在模型机制中的真实分布位置。
Result: 在三个开源VLM家族(LLaVA-1.5, PaliGemma, Qwen2-VL; 3-7B参数)上的实验表明:注意力结构对正确性的预测能力近乎为零(R_pb接近0);而后期隐藏状态的线性探针在POPE基准上对两个家族达到AUROC>0.95;自一致性(K=10)是测得的最强行为预测指标(R_pb=0.43)。因果神经元消融揭示了架构差异:晚融合模型(如LLaVA)的可靠性集中于脆弱的后期瓶颈,而早融合模型(如PaliGemma和Qwen2-VL)的可靠性分布广泛且鲁棒。
Insight: 创新点在于提出了统一的VLM可靠性探针(VRP)管道,系统比较了注意力结构、生成动态和隐藏状态几何与正确性的关系。核心洞察是:对于3-7B参数的VLM,可靠性更可靠地体现在隐藏状态几何、层间边界形成和稀疏的后期层电路中,而非注意力图的清晰度;这为模型监控和设计提供了直接指导,特别是揭示了早融合与晚融合架构在可靠性分布上的根本差异。
Abstract: A pervasive intuition holds that vision-language models (VLMs) are most trustworthy when their attention maps look sharp: concentrated attention on the queried region should imply a confident, calibrated answer. We test this Attention-Confidence Assumption directly. We instrument three open-weight VLM families (LLaVA-1.5, PaliGemma, Qwen2-VL; 3-7B parameters) with a unified mechanistic pipeline – the VLM Reliability Probe (VRP) – that compares attention structure, generation dynamics, and hidden-state geometry against a single correctness label. Three results emerge. (i) Attention structure is a near-zero predictor of correctness (R_pb(C_k,y)=0.001, 95% CI [-0.034,0.036]; R_pb(H_s,y)=-0.012, [-0.047,0.024] on a pooled n=3,090 split), even though attention remains causally necessary for feature extraction (top-30% patch masking drops accuracy by 8.2-11.3 pp, p<0.001). (ii) Reliability becomes legible later in the computation: a single hidden-state linear probe reaches AUROC>0.95 on POPE for two of three families, and self-consistency at K=10 is the strongest behavioral predictor we measure at 10x inference cost (R_pb=0.43). (iii) Causal neuron-level ablations expose a sharp architectural split with direct monitor-design implications: late-fusion LLaVA concentrates reliability in a fragile late bottleneck (-8.3 pp object-identification accuracy after top-5 probe-neuron ablation), whereas early-fusion PaliGemma and Qwen2-VL distribute it widely and absorb destruction of ~50% of their peak-layer hidden dimension with <=1 pp degradation. The takeaway is narrow but consequential: in 3-7B VLMs, reliability is read more reliably off hidden-state geometry, layer-wise margin formation, and sparse late-layer circuits than off attention-map sharpness.
[330] LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models cs.AI | cs.CV | cs.ROPDF
Boyang Shen, Kaixiang Yang, Hao Wang, Qiuyu Yu, Qiang Xie
TL;DR: LoopVLA是一种用于机器人操作的循环视觉-语言-动作模型,它通过共享的Transformer块迭代地细化多模态表征,并在每次迭代中同时生成候选动作和一个判断表征是否已足够用于动作预测的充分性分数,从而在保证任务成功率的同时显著提升了模型的效率和推理速度。
Details
Motivation: 当前VLA模型通常将视觉-语言骨干网络最深层的表征视为动作预测的通用最优解,但机器人操作涉及大量闭环空间调整,过度抽象会浪费计算资源并削弱精确控制所需的低级几何线索。现有早期退出策略无法直接判断表征何时对动作预测是充分的。
Result: 在LIBERO、LIBERO-Plus和VLA-Arena基准测试中,LoopVLA在匹配或超越强基线任务成功率的同时,将参数量减少了45%,推理吞吐量提升了高达1.7倍,从而推进了VLA策略的效率-性能边界。
Insight: 创新点在于提出了一种循环架构,将表征细化、动作预测和充分性估计联合学习,并通过参数共享将细化过程与绝对层索引解耦。核心是引入了一种自监督分布对齐目标,将中间置信度分数与不同细化步骤间的相对动作质量对齐,从而在没有直接监督的情况下将充分性学习与策略优化信号联系起来。
Abstract: Current Vision-Language-Action (VLA) models typically treat the deepest representation of a vision-language backbone as universally optimal for action prediction. However, robotic manipulation is composed of many frequent closed-loop spatial adjustments, for which excessive abstraction may waste computation and weaken low-level geometric cues essential for precise control. Existing early-exit strategies attempt to reduce computation by stopping at predefined layers or applying heuristic rules such as action consistency, but they do not directly answer when a representation is actually sufficient for action. In this paper, we present LoopVLA, a recurrent VLA architecture that jointly learns representation refinement, action prediction, and sufficiency estimation. LoopVLA iteratively applies a shared Transformer block to refine multimodal tokens, and at each iteration produces both a candidate action and a sufficiency score that estimates whether further refinement is necessary. By sharing parameters across iterations, LoopVLA decouples refinement from absolute layer indices and grounds sufficiency estimation in the evolving representation itself. Since sufficiency has no direct supervision, we introduce a self-supervised distribution alignment objective, where intermediate confidence scores are trained to match the relative action quality across refinement steps, thereby linking sufficiency learning to policy optimization signals. Experiments on LIBERO, LIBERO-Plus, and VLA-Arena show that LoopVLA pushes the efficiency-performance frontier of VLA policies, reducing parameters by 45% and improving inference throughput by up to 1.7 times while matching or outperforming strong baselines in task success.
[331] BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD cs.AI | cs.CV | cs.SEPDF
Haozhe Zhang, Kaichen Liu, Miaomiao Chen, Lei Li, Shaojie Yang
TL;DR: 本文提出了BenchCAD,一个针对工业计算机辅助设计(CAD)程序化代码生成的综合性基准测试。该基准包含17,900个经过执行验证的CadQuery程序,涵盖106个工业零件族,用于评估多模态大语言模型在视觉问答、代码问答、图像到代码生成和指令引导代码编辑等任务上的能力,旨在衡量模型在真实工业CAD环境中的感知、参数抽象和可执行程序合成性能。
Details
Motivation: 当前多模态大语言模型在工业CAD代码生成任务中缺乏在真实工业场景下的全面评估,现有方法往往仅关注零件外部形状识别,而忽略了理解3D结构、推断工程参数以及选择反映设计与制造过程的CAD操作等关键能力。
Result: 在超过10个前沿模型上的测试表明,现有系统通常能恢复粗略的外部几何形状,但难以生成准确的参数化CAD程序;常见失败包括忽略精细3D结构、误解工业设计参数,以及用简单的草图拉伸模式替代关键的扫描、放样和扭转拉伸等操作。微调和强化学习能提升分布内性能,但对未见零件族的泛化能力仍然有限。
Insight: 创新点在于构建了一个统一、执行验证的工业CAD基准,支持细粒度能力分析;客观来看,该研究揭示了当前MLLM在工业CAD自动化中存在的关键差距(如结构理解和操作选择),为提升模型工业就绪度提供了明确的评估框架和方向。
Abstract: Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings. We present BenchCAD, a unified benchmark for industrial CAD reasoning. BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families, including bevel gears, compression springs, twist drills, and other reusable engineering designs. It evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing, enabling fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Across 10+ frontier models, BenchCAD shows that current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Common failures include missing fine 3D structure, misinterpreting industrial design parameters, and replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. These results position BenchCAD as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.
cs.GR [Back]
[332] Alice v1: Distillation-Enhanced Video Generation Surpassing Closed-Source Models cs.GR | cs.CV | cs.LGPDF
Wang Xiaoyu, Phong Nguyen, Chen Zhao
TL;DR: Alice v1是一个140亿参数的开源视频生成模型,通过基于分数正则化的一致性蒸馏技术,在保持高质量的同时实现了7倍加速,其自动评估分数超越了包括Veo3和Sora2在内的闭源模型。
Details
Motivation: 解决传统视频生成模型在蒸馏过程中牺牲质量换取速度的问题,旨在开发一个既快又好的开源视频生成模型。
Result: 在VBench基准测试中,Alice v1得分从教师模型的84.0提升至91.2,超越了闭源模型Veo3(约90分)和Sora2(约88分),并在4步去噪内生成5秒720p视频,速度提升7倍。
Insight: 创新点在于提出了分数正则化一致性蒸馏方法,该方法通过模式寻求目标、针对失败模式的合成数据管道以及一致性强制作为隐式正则化,实现了超越教师模型质量的蒸馏效果。
Abstract: Wepresent Alice v1, a 14-billion parameter open-source video generation model that achieves state-of-the-art quality through consistency distillation with score regularization (rCM). Contrary to conventional distillation-which trades quality for speed-we demonstrate that rCM-based distillation can exceed teacher model quality. We attribute this to three mechanisms: (1) the score regularization term acts as a mode-seeking objective that concentrates probability mass on high-quality outputs rather than covering the full teacher distribution, (2) our targeted synthetic data pipeline with hard example mining provides training signal specifically for failure modes (physics, hands, faces) that the teacher handles inconsistently, and (3) consistency enforcement acts as implicit regularization, eliminating “lucky path” dependence on specific noise samples. Alice v1 generates 5-second 720p videos at 24fps in 4 denoising steps (8 seconds on H100), a 7x speedup over the 50-step teacher while improving VBench score from 84.0 (Wan2.2) to 91.2. This surpasses both the teacher and closed-source systems including Veo3 (90) and Sora2 (~88) on automated benchmarks, with competitive results in human preference studies. We release all model weights, training code, synthetic data pipelines, and evaluation scripts to advance open research in video generation.
[333] CAGS: Color-Adaptive Volumetric Video Streaming with Dynamic 3D Gaussian Splatting cs.GR | cs.CV | cs.MM | cs.NI | eess.IVPDF
Daheng Yin, Yili Jin, Jianxin Shi, Isaac Ding, Miao Zhang
TL;DR: 本文提出了一种名为CAGS的自适应体视频流系统,该系统基于动态3D高斯泼溅技术,通过颜色自适应方案,利用矢量量化建立细节层次模型,并结合低分辨率参考图像校正颜色失真,以在异构网络条件下实现高质量、低延迟的沉浸式远程3D环境访问。
Details
Motivation: 体视频流作为远程物理环境的实时接口,对逼真场景表示、低延迟交互和异构网络下的鲁棒性能提出了新的系统级需求。现有基于密度的细节层次方法不适用于高斯表示,导致可见间隙和严重质量下降,且现有属性压缩技术主要引起颜色失真。
Result: 在原型系统上的大量实验表明,CAGS在带宽波动条件下,其PSNR比现有自适应流系统高出5~20 dB,运行速度显著快于现有可扩展高斯压缩方法,并能泛化到不同的高斯表示。
Insight: 创新点在于提出了颜色自适应方案,利用矢量量化建立细节层次,并通过服务器端渲染参考图像、客户端进行颜色恢复来校正压缩导致的颜色失真,从而在保持高视觉质量和实时渲染性能的同时,有效降低带宽消耗。
Abstract: Volumetric video (VV) streaming enables real-time, immersive access to remote 3D environments, powering telepresence, ecological monitoring, and robotic teleoperation. These applications turn VV streaming into a real-time interface to remote physical environments, imposing new system-level demands for photorealistic scene representation, low-latency interaction, and robust performance under heterogeneous networks. 3D Gaussian Splatting (3DGS) has been widely used for real-time photorealistic rendering, offering superior visual quality and rendering performance, but it faces challenges due to bandwidth consumption. Furthermore, as the foundation of adaptive VV streaming, existing Levels of Detail (LoD) methods based on density are not well-suited to Gaussian representations, leading to visible gaps and severe quality degradation. Recent studies have also explored attribute compression techniques to reduce bandwidth consumption. Our preliminary studies reveal that aggressive attribute compression primarily causes color distortion, which can be effectively corrected in the rendered image using a reference image. Motivated by these findings, we propose a novel Color-Adaptive scheme for adaptive VV streaming that uses vector quantization (VQ) to establish LoDs and correct color distortions with low-resolution reference images. We further present CAGS, an adaptive VV streaming system compatible with diverse Gaussian representations, which integrates the Color-Adaptive scheme by rendering reference images on the streaming server and performing color restoration on the client. Extensive experiments on our prototype system demonstrate that CAGS outperforms the existing adaptive streaming systems in PSNR by 5$\sim$20 dB under fluctuating bandwidth, operates significantly faster than existing scalable Gaussian compression methods, and generalizes across different Gaussian representations.
cs.RO [Back]
[334] Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models cs.RO | cs.CVPDF
Aws Khalil, Jaerock Kwon
TL;DR: 本文提出了一个零样本基准测试,用于评估现成的生成式视频模型在短时域预测显示任务中的适用性,以解决远程操作中通信延迟问题。研究使用CARLA模拟器的驾驶数据,对五种基于Transformer和扩散模型的公开视频模型进行了评估,发现现有模型在零样本设置下无法同时满足低预测误差、稳定的逐步误差行为和实时推理的要求。
Details
Motivation: 远程操作系统受限于通信延迟,预测显示通过呈现当前视觉状态的估计来缓解此问题,但现有生成式视频模型在延迟敏感的预测显示任务中的适用性尚不明确。
Result: 在CARLA模拟器驾驶数据的基准测试中,所有测试模型在零样本设置下均无法同时实现低滚动误差、非发散的逐步误差行为和源帧率的实时推理;增加模型规模或分辨率带来的改进有限,有时甚至产生负面效果。
Insight: 研究揭示了通用生成式视频合成与远程操作中预测显示需求之间的差距,表明实际部署需要明确的短时域时间监督、领域内适应或激进的推理优化,而非直接应用现成模型;同时提供了统一的基准测试流程和评估指标。
Abstract: Teleoperation systems are fundamentally limited by communication latency, which degrades situational awareness and control performance. Predictive display aims to mitigate this limitation by presenting an estimate of the current visual state rather than delayed observations. While recent advances in generative video models enable high-quality video synthesis, their suitability for latency-sensitive predictive display remains unclear. This paper presents a zero-shot benchmark of off-the-shelf generative video models for short-horizon predictive display, without task-specific fine-tuning. We formulate the problem as rollout-based future frame prediction and develop a unified benchmarking pipeline using simulated driving data from the CARLA simulator. Five publicly released video models spanning transformer-based and diffusion-based families are evaluated across two resolutions and two conditioning regimes (multi-frame and single-frame). Performance is assessed using prediction accuracy (mean absolute difference), per-rollout latency, peak GPU memory usage, and temporal error evolution across the prediction horizon. On this zero-shot benchmark, no tested model simultaneously achieves low rollout error, non-divergent per-step error behavior, and real-time inference at the source frame rate. Increasing model scale or resolution yields limited and, in some cases, inverted improvements. These findings highlight a gap between general-purpose generative video synthesis and the requirements of predictive display in teleoperation, suggesting that practical deployment will require either explicit short-horizon temporal supervision, in-domain adaptation, or aggressive inference optimization rather than direct application of off-the-shelf models. Code, configurations, and qualitative results are released on the project page: https://bimilab.github.io/paper-GenPD
[335] JODA: Composable Joint Dynamics for Articulated Objects cs.RO | cs.CVPDF
Tianhong Gao, Cheng Yu, Yinghao Xu, Mengyu Chu
TL;DR: JODA是一个用于生成铰接物体关节级动力学的框架,通过结构化的三通道场(保守力、干摩擦和阻尼)在关节自由度上建模,使用形状约束的分段三次插值(PCHIP)实现紧凑且可解释的函数空间,支持从多模态输入推断和优化关节动力学。
Details
Motivation: 现有仿真和具身AI中的铰接物体通常仅基于几何和运动学结构,缺乏精细动力学效应(如摩擦保持、卡扣、软闭合和闩锁),导致真实机械行为建模不足,而现有方法要么忽略动力学细节,要么使用表达能力有限的简单模型。
Result: JODA能够对多样关节行为进行合理且可控的建模,提供了一个统一的接口用于推断、编辑和优化,但摘要未提及具体基准测试或定量结果(如SOTA比较)。
Insight: 创新点包括:提出结构化三通道场表示关节动力学,结合视觉语言模型从多模态输入推断动力学原语,以及使用可微仿真兼容的PCHIP实现紧凑可解释函数空间,支持直接操作和基于梯度的优化。
Abstract: Articulated objects used in simulation and embodied AI are typically specified by geometry and kinematic structure, but lack the fine-grained dynamical effects that govern realistic mechanical behavior, such as frictional holding, detents, soft closing, and snap latching. Existing approaches either ignore the detailed structure of dynamics entirely, or use simple models with limited expressiveness. We introduce JODA, a framework for generating joint-level dynamics as a structured three-channel field over the joint degree of freedom, capturing conservative forces, dry friction, and damping. Instantiated using shape-constrained piecewise cubic interpolation (PCHIP), this formulation defines a compact and expressive function space that is both interpretable and compatible with differentiable simulation. Building on this representation, we develop methods for inferring and refining joint dynamics from multimodal inputs. Given visual observations and joint context, a vision-language model proposes structured dynamical primitives, which are composed into a unified dynamics field. The resulting representation supports both direct manipulation and gradient-based refinement. We demonstrate that JODA enables plausible and controllable modeling of diverse joint behaviors, providing a unified interface for inference, editing, and optimization. Code and example assets with their generated profiles will be released upon publication.
[336] HiDrive: A Closed-Loop Benchmark for High-Level Autonomous Driving cs.RO | cs.CVPDF
Zhongyu Xia, Guanyu Zhu, Guo Tang, Wenhao Chen, Yongtao Wang
TL;DR: 该论文提出了HiDrive,一个专注于长尾场景和高级驾驶能力评估的端到端自动驾驶闭环新基准。它通过引入罕见物体和复杂交通情境,并扩展评估指标至规则遵守和道德推理等方面,旨在解决现有基准在场景多样性和能力评估广度上的不足。
Details
Motivation: 现有自动驾驶基准在场景多样性、物体种类和驾驶能力评估范围上存在局限,特别是缺乏安全关键的长尾场景和高级决策(如法规遵守、伦理推理)的评估,导致模型性能接近饱和但实际问题未解决。
Result: 论文未在摘要中提供具体的定量实验结果或基准比较数据,但宣称HiDrive基于更先进的物理引擎,提供了物理上逼真的光照和高保真视觉渲染,为评估自动驾驶系统处理现实世界复杂性提供了一个更具挑战性和真实性的测试平台。
Insight: 创新点在于构建了一个强调长尾场景和高级驾驶能力(如规则遵守、道德推理)综合评估的闭环基准,并将以碰撞避免为中心的指标扩展为一个涵盖碰撞与制动、交通规则遵守和道德推理指标的全面评估体系。
Abstract: End-to-end autonomous driving has witnessed rapid progress, yet existing benchmarks are increasingly saturated, with state-of-the-art models achieving near-perfect scores on widely used open-loop and closed-loop benchmarks. This saturation does not mean that the problem has been solved; instead, it reveals that current benchmarks remain limited in scenario diversity, object variety, and the breadth of driving capabilities they evaluate. In particular, they lack sufficient long-tail scenarios involving rare but safety-critical objects and fail to assess advanced decision-making such as legal compliance, ethical reasoning, and emergency response. To address these gaps, we propose HiDrive, a new closed-loop benchmark for end-to-end autonomous driving that emphasizes long-tail scenarios and a richer evaluation of driving capabilities. HiDrive introduces a diverse set of rare objects and uncommon traffic situations, and expands evaluation from basic driving skills to more advanced capabilities, including rule compliance, moral reasoning, and context-dependent emergency maneuvers. Correspondingly, we extend previous collision-avoidance-centered metrics into a comprehensive evaluation system that encompasses collision and braking, traffic-rule compliance, and moral-reasoning indicators. Built on a more advanced physics engine, HiDrive provides physically realistic lighting and high-fidelity visual rendering, offering a more challenging and realistic testbed for assessing whether autonomous driving systems can handle the complexity of real-world deployment. The HiDrive software, source code, digital assets, and documentation are available at https://github.com/VDIGPKU/HiDrive.
[337] StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception cs.RO | cs.CVPDF
Evans Han, Yunfan Jiang, Yingke Wang, Haoyue Xiao, Huang Huang
TL;DR: 本文提出了StereoPolicy,一种利用同步立体图像对增强机器人操作策略几何推理能力的视觉运动策略学习框架。该框架通过预训练的2D视觉编码器独立处理每张图像,并使用立体Transformer融合特征,无需显式3D重建或相机标定。实验在RoboMimic、RoboCasa和OmniGibson三个仿真基准以及真实机器人任务中验证了其优于RGB、RGB-D、点云和多视图基线的性能。
Details
Motivation: 现有基于单目视觉的机器人模仿学习策略缺乏可靠的深度线索和空间感知能力,在杂乱或几何复杂场景中难以实现精确操作,因此需要引入立体视觉来增强几何推理。
Result: 在RoboMimic、RoboCasa和OmniGibson三个仿真基准测试中,StereoPolicy在扩散基线和预训练视觉-语言-动作(VLA)策略上均取得了优于RGB、RGB-D、点云和多视图基线的性能,并在真实机器人桌面操作和双手移动操作实验中得到了验证。
Insight: 创新点在于直接利用立体图像对隐式捕获空间对应和视差线索,通过立体Transformer融合预训练2D视觉表征,实现了2D预训练表征与3D几何理解的桥接,提供了一种可扩展且鲁棒的视觉模态。
Abstract: Recent advances in robot imitation learning have yielded powerful visuomotor policies capable of manipulating a wide variety of objects directly from monocular visual inputs. However, monocular observations inherently lack reliable depth cues and spatial awareness, which are critical for precise manipulation in cluttered or geometrically complex scenes. To address this limitation, we introduce StereoPolicy, a new visuomotor policy learning framework that directly leverages synchronized stereo image pairs to strengthen geometric reasoning, without requiring explicit 3D reconstruction or camera calibration. StereoPolicy employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues. The framework integrates seamlessly with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks: RoboMimic, RoboCasa, and OmniGibson. We further validate StereoPolicy on real-robot experiments spanning both tabletop and bimanual mobile manipulation settings. Our results underscore stereo vision as a scalable and robust modality that bridges 2D pretrained representations with 3D geometric understanding for robotic manipulation.
[338] ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models cs.RO | cs.AI | cs.CVPDF
Zuojin Tang, Haoyun Liu, Xinyuan Chang, Changjie Wu, Dongjie Huo
TL;DR: 本文提出了ALAM(代数一致潜在动作模型),一种从无动作标注的视频中学习结构化潜在动作表示的方法。该方法通过代数一致性约束(组合与反转)构建局部可加的潜在转移空间,并将学习到的潜在转移序列作为辅助生成目标,与机器人动作在流匹配目标下联合生成,从而提升视觉-语言-动作模型的策略性能。
Details
Motivation: 视觉-语言-动作模型受限于带动作标注的机器人数据稀缺,而无动作视频蕴含丰富的物理世界变化信息。现有基于重建的潜在动作模型学到的潜在编码缺乏结构化特性,难以直接用于策略生成。
Result: 在MetaWorld MT50基准上,平均成功率从47.9%提升至85.0%;在LIBERO基准上从94.1%提升至98.1%;在真实世界操作任务上也取得一致提升。表示探针显示,ALAM将可加性与可逆性误差降低了25-85倍,并改善了长时程累积重建。
Insight: 创新点在于将视频中的时序关系转化为代数一致性(组合与反转)的结构化监督,构建局部可加的潜在转移空间,并通过联合流匹配将结构化潜在转移与基于流的策略生成耦合,避免了潜在到动作的解码需求。
Abstract: Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM’s locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.
cs.IR [Back]
[339] UserGPT Technical Report cs.IR | cs.CLPDF
Yunyi Xuan, Hao Yi, Fengling Mao, Daye Cai, Leikun Liang
TL;DR: 本文提出UserGPT框架,通过生成属性和摘要来提升大语言模型对用户画像的理解能力。为解决真实行为数据稀缺问题,开发了用户行为模拟引擎生成复杂用户轨迹,并引入数据为中心的语义化模块处理异构日志。采用课程驱动的后训练策略结合监督微调和强化学习优化长期行为推理。在自建的HPR-Bench基准测试中,UserGPT在标签预测和摘要生成任务上表现优异,同时能将行为记录压缩97.9%并保留关键信息。
Details
Motivation: 传统用户画像方法依赖判别模型和手工特征工程,导致画像碎片化、逻辑不一致且难以泛化到长尾行为。本文旨在通过生成式范式,让大语言模型从嘈杂行为历史中生成连贯叙事以捕捉用户演变。
Result: 在自建的HPR-Bench基准上,UserGPT在标签预测任务上Avg@10得分为0.7325,在摘要生成任务上$Acc_{Ex}$得分为0.7528,同时将行为记录压缩高达97.9%且保留关键信息,展示了其在整体用户画像推理和个性化交互中的有效性。
Insight: 创新点包括:提出结合属性生成和摘要生成的用户理解框架;开发用户行为模拟引擎解决数据稀缺;设计数据为中心的语义化模块处理异构日志噪声;采用课程驱动的后训练策略(多阶段SFT与DF-GRPO结合)强化长期行为推理。从客观角度看,模拟数据生成与结构化语义转换的结合为基于LLM的用户建模提供了可扩展的数据基础,而课程强化学习策略有望提升复杂推理的鲁棒性。
Abstract: Personalized user understanding from large-scale digital traces remains a fundamental challenge. Traditional user profiling methods rely on discriminative models and manual feature engineering to predict discrete attributes, often producing fragmented and logically inconsistent profiles that generalize poorly to long-tail behaviors. In this work, we study a generative paradigm in which large language models (LLMs) summarize long and noisy behavioral histories into coherent narratives that capture nuanced user evolution. Our experiments show that even strong LLMs remain limited in complex and implicit personalization reasoning. We propose UserGPT, a framework for improving LLM-based persona understanding through both attribute generation and summary generation. To address the scarcity of real-world behavioral data, we develop a User Behavior Simulation Engine that produces realistic and complex user trajectories. We further introduce a Data-Centric Semantization module that transforms heterogeneous behavioral logs into structured and semantically coherent inputs, reducing noise and sparsity. On top of this pipeline, we design a curriculum-driven post-training strategy that combines multi-stage Supervised Fine-Tuning (SFT) with Dual-Filter Group Relative Policy Optimization (DF-GRPO) to strengthen reasoning over long behavioral histories. We also construct HPR-Bench, a benchmark for holistic persona reasoning derived from simulated data. On HPR-Bench, UserGPT achieves an Avg@10 score of 0.7325 on tag prediction and an $Acc_{Ex}$ score of 0.7528 on summary generation, while compressing behavioral records by up to 97.9% with critical information preserved. These results demonstrate the effectiveness of UserGPT for holistic persona reasoning and personalized user-agent interaction.
[340] Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient? cs.IR | cs.AI | cs.CLPDF
Tz-Huan Hsu, Jheng-Hong Yang, Jimmy Lin
TL;DR: 这篇论文探讨了在智能体循环中,随着大语言模型能力的提升,词法检索器是否足够有效的问题。作者通过将BM25词法检索器与具备更强推理和工具使用能力的前沿LLMs结合,构建了名为Pi-Serini的搜索智能体,该智能体配备了检索、浏览和阅读文档的工具。实验表明,在BrowseComp-Plus基准上,配置良好的词法检索器与更强大的LLMs结合时,能够支持有效的深度研究,性能优于使用密集检索器的现有搜索智能体。
Details
Motivation: 动机是重新审视在构建深度研究系统时,随着LLMs在智能体循环中能力增强,词法检索器是否仍然足够有效的问题,旨在为研究者提供实用工具和见解。
Result: 在BrowseComp-Plus基准上,Pi-Serini与gpt-5.5结合实现了83.1%的答案准确率和94.7%的表面证据召回率,优于使用密集检索器的已发布搜索智能体;消融实验显示,BM25调优比默认设置提升答案准确率18.0%和表面证据召回率11.1%,增加检索深度比浅层检索设置提升表面证据召回率25.3%。
Insight: 创新点在于挑战了密集检索器在智能体搜索中的必要性,证明了词法检索器(如BM25)在适当配置和与前沿LLMs结合时仍能实现高性能,强调了检索深度和调优的重要性,为构建高效搜索智能体提供了新思路。
Abstract: Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers asking the same question, we introduce Pi-Serini, a search agent equipped with three tools for retrieving, browsing, and reading documents. Our results show that, on BrowseComp-Plus, a well-configured lexical retriever with sufficient retrieval depth can support effective deep research when paired with more capable LLMs. Specifically, Pi-Serini with gpt-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming released search agents that use dense retrievers. Controlled ablations further show that BM25 tuning improves answer accuracy by 18.0% and surfaced evidence recall by 11.1% over the default BM25 setting, while increasing retrieval depth further improves surfaced evidence recall by 25.3% over the shallow-retrieval setting. Source code is available at https://github.com/justram/pi-serini.
eess.IV [Back]
[341] Cross-Modal Semantic-Enhanced Diffusion Framework for Diabetic Retinopathy Grading eess.IV | cs.CVPDF
Yiqun Wang
TL;DR: 本文提出了一种名为CLIP-Guided Semantic Diffusion (CGSD)的跨模态语义增强扩散框架,用于糖尿病视网膜病变(DR)的自动分级。该框架将视觉-语言预训练与扩散概率建模相结合,通过领域特定的视觉-语言模型提供语义指导,并利用低秩适应(LoRA)进行领域适配,构建跨模态语义条件向量来引导扩散去噪网络,从而提升分级性能。
Details
Motivation: 解决糖尿病视网膜病变自动分级中的关键挑战:细粒度病变模式中等级间视觉差异细微、异构成像设备和采集条件导致的分布差异,以及纯视觉方法无法利用临床语义知识。
Result: 在APTOS 2019数据集上,该方法达到了87.5%的准确率和0.731的宏平均F1分数,优于多种代表性方法。
Insight: 创新点包括:将视觉-语言预训练与扩散模型协同用于医学图像分类;采用LoRA进行高效领域适配以桥接分布差距;构建融合视觉内容和临床语义的跨模态条件向量,简化了现有扩散分类方法中复杂的双分支视觉先验结构。
Abstract: Automated grading of diabetic retinopathy (DR) faces several critical challenges: subtle inter-grade visual distinctions in fine-grained lesion patterns, distributional discrepancies induced by heterogeneous imaging devices and acquisition conditions, and the inherent inability of purely visual approaches to exploit clinical semantic knowledge. In this paper, we propose CLIP-Guided Semantic Diffusion (CGSD), a DR grading framework that synergistically integrates vision-language pretraining with diffusion probabilistic modeling. We adopt a domain-specific vision-language model tailored for DR grading as the semantic guidance module and adapt it to the target domain via Low-Rank Adaptation (LoRA), effectively bridging the distributional gap between the pretrained model and the target dataset with only a minimal number of trainable parameters. Building on this foundation, we construct a cross-modal semantic conditioning vector by computing the dot product between image features and the text description features of each DR grade, yielding a joint representation that simultaneously encodes visual content and clinical-grade semantics. This vector serves as the conditioning signal for the diffusion denoising network, replacing the structurally complex dual-branch visual prior employed in existing diffusion-based classification methods. Experiments on the APTOS 2019 dataset demonstrate that the proposed approach achieves an accuracy of 87.5% and a macro-averaged F1 score of 0.731, outperforming a variety of representative methods. Ablation studies further validate the independent contribution of each constituent module.
[342] A Real-Calibrated Synthetic-First Data Engine eess.IV | cs.CV | cs.GR | cs.LGPDF
Yukang Shen
TL;DR: 本文提出了一种名为’真实校准的合成优先数据引擎’的模块化数据工程框架,旨在解决数据稀缺领域中合成数据增强性能不稳定的问题。该框架将可控扩散生成与多阶段筛选/过滤流程相结合,并可选支持不确定性驱动选择和人工验证,以系统化地构建数据集,提升合成数据增强在低数据场景下的实用可靠性。
Details
Motivation: 现代计算机视觉系统在数据稀缺领域面临性能瓶颈,因为收集大规模高质量标注数据成本高昂或不切实际。虽然可控扩散模型能够实现可扩展的合成图像生成,但由于数据集级别的质量问题以及反馈机制不足,直接应用合成数据增强往往导致性能提升不稳定。
Result: 在以人体姿态估计为中心的实证评估中,研究表明,当合成数据作为近乎零人工标注成本的增强数据与真实锚点数据一起使用时,能够提升真实数据基线的性能;而仅使用合成数据进行训练的性能仍远低于仅使用真实数据。补充的分割诊断也显示了相同的领域差距模式。
Insight: 论文的创新点在于其数据中心的系统化编排思想,而非提出新的生成算法。它提供了一个模块化、可配置的CLI流水线框架,强调在真实世界数据工作流中的可复现性、灵活性和实际部署能力,为解决合成数据与真实数据之间的领域差距提供了一种工程化解决方案。
Abstract: Modern computer vision systems increasingly encounter performance limitations in data-scarce domains, where collecting large-scale, high-quality labeled data is costly or impractical. While controllable diffusion models enable scalable synthetic image generation, directly applying synthetic augmentation often leads to unstable performance gains due to dataset-level quality issues and insufficient feedback mechanisms. In this work, we present a Real-Calibrated Synthetic-First Data Engine, a modular data engineering framework that combines controllable diffusion generation and multi-stage curation/filtering within a unified pipeline, with optional support for uncertainty-driven selection and human verification. Instead of introducing new generative algorithms, our approach focuses on systematic dataset construction for improving the practical reliability of synthetic augmentation in low-data regimes. The framework is implemented as a modular CLI-based pipeline, where generation, filtering, selection, and validation components can be independently configured and replaced. This design emphasizes reproducibility, flexibility, and practical deployment in real-world data workflows. Through empirical evaluation centered on human pose estimation, we show that synthetic data improves a real-data baseline when used as near-zero-human-annotation-cost augmentation alongside real anchors, while synthetic-only training remains substantially below real-only performance. Supplementary segmentation diagnostics show the same domain-gap pattern. These results highlight the practical value of data-centric orchestration for low-data augmentation.
[343] Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model eess.IV | cs.AI | cs.CVPDF
David F. Ramirez, Tim Overman, Kristen Jaskie, Andreas Spanias
TL;DR: 本文介绍了SMART-HC-VQA数据集,这是一个基于Sentinel-2卫星影像的视觉问答数据集,源自IARPA SMART重型建筑数据集,用于人类活动的时空分析。该数据集将建筑工地的标注、建筑类型标签、时间阶段标签、地理元数据和观测关系转化为自然语言问答对,将现有数据集重新定义为时间扩展的自动目标识别和视觉问答挑战。目前包含21,837个可访问的Sentinel-2图像块、65,511个单图像VQA示例以及约230万个通过新颖的图像对组合增强生成的双图像时间比较示例。论文还详细描述了基于LLaVA-NeXT Mistral-7B实现的多图像多模态大语言模型训练框架,旨在为语言引导的遥感活动理解提供可复现的基础。
Details
Motivation: 解决遥感影像分析中,不仅需要检测变化,还需要理解持续过程、其进展和潜在未来发展的推理问题,将时空分析与自然语言理解相结合。
Result: 构建了包含大量单图像和双图像时间比较示例的VQA数据集,并实现了一个适应多日期图像输入的多图像MLLM训练框架,为后续研究提供了基准和基础。
Insight: 创新点在于将遥感时空分析任务重新定义为视觉问答挑战,并引入了图像对组合增强方法大规模生成时间比较样本;同时,将多模态大语言模型适配到多时序遥感图像输入,推动了遥感领域向更高层次的语义理解和推理发展。
Abstract: We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.
cs.DB [Back]
[344] Toward Multi-Database Query Reasoning for Text2Cypher cs.DB | cs.CLPDF
Makbule Gulcin Ozsoy
TL;DR: 本文提出从单数据库查询生成转向多数据库查询推理,以解决现实世界中图数据库通常分布在多个独立数据库中的问题。论文通过三阶段路线图(数据库路由、多数据库分解和异构查询推理)形式化这一设置,旨在支持更现实和可扩展的图数据库自然语言接口。
Details
Motivation: 现有Text2Cypher系统通常假设单个预选图数据库,但现实系统常分布在多个独立图数据库中,相关信息可能跨越多个来源,因此需要解决多数据库查询推理问题。
Result: 论文未提及具体实验结果或基准测试,而是提出了一个结构化框架和挑战分析。
Insight: 创新点在于将Text2Cypher扩展到多数据库场景,通过三阶段路线图(数据库路由、分解和异构推理)系统化处理源选择、查询分解和结果集成等挑战,为构建更实用的图数据库自然语言接口提供了理论方向。
Abstract: Large language models have significantly improved natural language interfaces to databases by translating user questions into executable queries. In particular, Text2Cypher focuses on generating Cypher queries for graph databases, enabling users to access graph data without query language expertise. Most existing Text2Cypher systems assume a single preselected graph database, where queries are generated over a known schema. However, real-world systems are often distributed across multiple independent graph databases organized by domain or system boundaries, where relevant information may span multiple sources. To address this limitation, we propose a shift from single-database query generation to multi-database query reasoning. Instead of assuming a fixed execution context, the system must reason about (i) relevant databases, (ii) how to decompose a question across them, and (iii) how to integrate partial results. We formalize this setting through a three-phase roadmap: database routing, multi-database decomposition, and heterogeneous query reasoning across database types and query languages. This work provides a structured formulation of multi-database reasoning for Text2Cypher and identifies challenges in source selection, query decomposition, and result integration, aiming to support more realistic and scalable natural language interfaces to graph databases.