cs.CL [Total: 13]
cs.CV [Total: 82]
cs.MM [Total: 1]
cs.RO [Total: 2]
cs.AI [Total: 1]
eess.IV [Total: 2]
eess.SY [Total: 1]
cs.LG [Total: 2]

cs.CL [Back]

[1] Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition cs.CL | cs.SD | eess.ASPDF

Ying Liu, Yuntao Shou, Wei Ai, Tao Meng, Keqin Li

TL;DR: 本文提出了一种关系感知的去噪与扩散注意力融合模型，用于解决多模态对话情感识别中音频和视频信号因环境噪声和数据质量不平衡导致的信息失真与权重偏差问题。该方法通过差分Transformer增强时序一致性信息并抑制噪声，构建模态内与跨模态关系子图以捕捉说话者依赖的情感依赖，并引入文本引导的跨模态扩散机制实现鲁棒且语义对齐的多模态融合。

Details

Motivation: 现实场景中音频和视频信号常受环境噪声和采集条件限制，导致提取特征包含过多噪声，且不同模态间数据质量和信息承载能力不平衡，这导致融合阶段信息失真和权重偏差，影响整体识别性能。现有方法多忽略噪声模态的影响，并依赖隐式加权建模模态重要性，未能显式考虑文本模态在情感理解中的主导贡献。

Result: 论文在IEMOCAP和MELD基准数据集上进行了实验，结果表明所提方法达到了最先进的性能水平（SOTA），验证了其在多模态情感识别任务中的有效性。

Insight: 创新点包括：设计差分Transformer显式计算注意力图差异以增强时序一致性并抑制噪声；构建模态特定和跨模态关系子图以捕捉细粒度的说话者依赖情感依赖；引入文本引导的跨模态扩散机制，利用自注意力建模模态内依赖并自适应地将视听信息扩散到文本流中，实现更鲁棒和语义对齐的融合。从客观角度看，这些方法显式处理噪声和模态不平衡问题，提升了多模态融合的鲁棒性和准确性。

Abstract: In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance. Most existing methods neglect the impact of noisy modalities and rely on implicit weighting to model modality importance, thereby failing to explicitly account for the predominant contribution of the textual modality in emotion understanding. To address these issues, we propose a relation-aware denoising and diffusion attention fusion model for MCER. Specifically, we first design a differential Transformer that explicitly computes the differences between two attention maps, thereby enhancing temporally consistent information while suppressing time-irrelevant noise, which leads to effective denoising in both audio and video modalities. Second, we construct modality-specific and cross-modality relation subgraphs to capture speaker-dependent emotional dependencies, enabling fine-grained modeling of intra- and inter-modal relationships. Finally, we introduce a text-guided cross-modal diffusion mechanism that leverages self-attention to model intra-modal dependencies and adaptively diffuses audiovisual information into the textual stream, ensuring more robust and semantically aligned multimodal fusion.

[2] RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation cs.CLPDF

Jiajun Zhang, Yuying Li, Zhixun Li, Xingyu Guo, Jingzhuo Wu

TL;DR: 本文提出了RealChart2Code，一个基于真实数据、包含超过2800个实例的大规模基准测试，用于评估视觉语言模型（VLMs）从复杂、多面板可视化图表生成代码的能力。该基准首次系统评估了从大规模原始数据生成图表以及在多轮对话中进行迭代代码优化的性能。对14个领先VLMs的评估显示，它们在处理复杂图表结构和真实数据时性能显著下降，揭示了当前模型的局限性。

Details

Motivation: 现有视觉语言模型在代码生成方面表现出色，但其从真实世界数据复制复杂、多面板可视化的能力尚未得到充分评估。为填补这一空白，本文旨在创建一个更贴近实际应用场景的基准，以系统评估模型在此类复杂任务上的表现。

Result: 在RealChart2Code基准上对14个领先VLMs的评估表明，与更简单的基准相比，模型性能显著下降。分析揭示了专有模型与开源模型之间存在显著的性能差距，并且即使是当前最先进的VLMs也经常无法准确复制复杂的多面板图表。

Insight: 论文的创新点在于构建了首个基于真实数据、包含多任务评估和迭代代码优化场景的图表到代码生成基准。从客观角度看，其核心贡献是提供了一个更贴近现实、更具挑战性的评估框架，揭示了VLMs在处理复杂数据可视化任务时的具体弱点，为未来研究指明了方向。

Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \textbf{\texttt{RealChart2Code}}, a new large-scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on \texttt{RealChart2Code} reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open-weight models and confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at \url{https://github.com/Speakn0w/RealChart2Code}.

[3] Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI cs.CL | cs.AI | cs.LG | cs.MAPDF

Anna Kozlova, Stanislau Salavei, Pavel Satalkin, Hanna Plotnitskaya, Sergey Parfenyuk

TL;DR: 本文提出了Doctorina MedBench，一个基于模拟真实医患交互的、用于评估基于智能体的医疗AI的综合评估框架。该框架通过多步骤临床对话模拟，要求系统收集病史、分析附件材料、制定鉴别诊断并提供个性化建议，并使用D.O.T.S.指标（包含诊断、观察/检查、治疗和步骤数四个维度）来评估临床正确性和对话效率。

Details

Motivation: 动机是解决传统医疗基准（依赖解答标准化测试题）的局限性，通过模拟临床对话来更真实地评估临床能力，并提供对临床推理技能发展的支持。

Result: 框架包含一个包含超过1000个临床案例、覆盖750多种诊断的数据集。结果表明，与传统考试式基准相比，临床对话模拟能提供更真实的临床能力评估。

Insight: 创新点在于提出了一个端到端的、基于对话模拟的评估范式（Doctorina MedBench）和综合评估指标（D.O.T.S.），并设计了包含安全陷阱案例、类别随机抽样和完整回归测试的多级测试与质量监控架构，其通用性使其可用于评估医疗AI、医生以及支持临床推理技能发展。

Abstract: We present Doctorina MedBench, a comprehensive evaluation framework for agent-based medical AI based on the simulation of realistic physician-patient interactions. Unlike traditional medical benchmarks that rely on solving standardized test questions, the proposed approach models a multi-step clinical dialogue in which either a physician or an AI system must collect medical history, analyze attached materials (including laboratory reports, images, and medical documents), formulate differential diagnoses, and provide personalized recommendations. System performance is evaluated using the D.O.T.S. metric, which consists of four components: Diagnosis, Observations/Investigations, Treatment, and Step Count, enabling assessment of both clinical correctness and dialogue efficiency. The system also incorporates a multi-level testing and quality monitoring architecture designed to detect model degradation during both development and deployment. The framework supports safety-oriented trap cases, category-based random sampling of clinical scenarios, and full regression testing. The dataset currently contains more than 1,000 clinical cases covering over 750 diagnoses. The universality of the evaluation metrics allows the framework to be used not only to assess medical AI systems, but also to evaluate physicians and support the development of clinical reasoning skills. Our results suggest that simulation of clinical dialogue may provide a more realistic assessment of clinical competence compared to traditional examination-style benchmarks.

[4] Can Small Models Reason About Legal Documents? A Comparative Study cs.CL | cs.AIPDF

Snehit Vaddi

TL;DR: 本文评估了参数量小于100亿的小型语言模型在法律文档推理任务上的表现，通过在三个法律基准（ContractNLI、CaseHOLD和ECtHR）上测试九个模型，并采用五种提示策略（直接、思维链、少样本、BM25 RAG和稠密RAG）。研究发现，一个仅激活30亿参数的专家混合模型在平均准确率上可与GPT-4o-mini匹敌，并在法律判决识别任务上超越它，表明架构和训练质量比参数量更重要。思维链提示效果因任务而异，而少样本提示是最一致有效的策略。检索增强生成中，BM25和稠密检索结果相近，瓶颈在于语言模型对检索上下文的利用。

Details

Motivation: 大型语言模型在法律应用中前景广阔，但部署前沿模型存在成本、延迟和数据隐私问题，因此研究小型模型能否作为实用替代方案。

Result: 在三个法律基准上的405次实验表明，仅激活30亿参数的专家混合模型在平均准确率上匹配GPT-4o-mini，并在CaseHOLD任务上超越它；最大的90亿参数模型整体表现最差。思维链提示在ContractNLI上提升效果，但在CaseHOLD上降低性能；少样本提示最稳定有效。BM25和稠密RAG结果几乎相同。

Insight: 创新点在于系统比较小型模型在法律推理中的表现，强调架构和训练质量的重要性；发现提示策略的强任务依赖性；指出RAG的瓶颈在于模型利用上下文而非检索质量；通过云API低成本完成评估，展示了无需专用GPU的严谨LLM评估可行性。

Abstract: Large language models show promise for legal applications, but deploying frontier models raises concerns about cost, latency, and data privacy. We evaluate whether sub-10B parameter models can serve as practical alternatives by testing nine models across three legal benchmarks (ContractNLI, CaseHOLD, and ECtHR) using five prompting strategies (direct, chain-of-thought, few-shot, BM25 RAG, and dense RAG). Across 405 experiments with three random seeds per configuration, we find that a Mixture-of-Experts model activating only 3B parameters matches GPT-4o-mini in mean accuracy while surpassing it on legal holding identification, and that architecture and training quality matter more than raw parameter count. Our largest model (9B parameters) performs worst overall. Chain-of-thought prompting proves sharply task-dependent, improving contract entailment but degrading multiple-choice legal reasoning, while few-shot prompting emerges as the most consistently effective strategy. Comparing BM25 and dense retrieval for RAG, we find near-identical results, suggesting the bottleneck lies in the language model’s utilization of retrieved context rather than retrieval quality. All experiments were conducted via cloud inference APIs at a total cost of $62, demonstrating that rigorous LLM evaluation is accessible without dedicated GPU infrastructure.

[5] Toward Culturally Grounded Natural Language Processing cs.CLPDF

Sina Bagheri Nezhad

TL;DR: 本文综述了2020-2026年间50多篇论文，指出当前多语言NLP虽在语言覆盖上取得进展，但文化能力依然不足，模型常会抹平本地规范、误解文化线索，在低资源或特定社区场景中表现不佳。作者主张应从将语言视为孤立的基准测试项，转向对包含机构、脚本、翻译流程、领域、模态和社区在内的’交流生态’进行建模，并提出了一个以更丰富的上下文元数据、文化分层评估、参与式对齐、语言内变异和多模态社区感知设计为中心的文化根基NLP研究议程。

Details

Motivation: 解决当前多语言NLP能力与文化能力脱节的问题，即模型虽能处理多种语言，但缺乏对文化背景的理解，导致在真实、多样化的文化场景中表现不佳。

Result: 综述分析表明，训练数据覆盖是性能的关键决定因素，但远不充分；分词、提示语言、翻译基准设计、文化特定监督和多模态上下文都实质性地影响结果。

Insight: 创新点在于提出了从’孤立语言基准’转向’交流生态’建模的范式转变，并系统性地提出了一个以文化根基为核心的研究议程，强调上下文元数据、文化分层评估和社区参与式设计，为构建更具文化包容性和理解力的NLP系统提供了方向性框架。

Abstract: Recent progress in multilingual NLP is often taken as evidence of broader global inclusivity, but a growing literature shows that multilingual capability and cultural competence come apart. This paper synthesizes over 50 papers from 2020–2026 spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal local-knowledge modeling, benchmark design critiques, and community-grounded data practices. Across this literature, training data coverage remains a strong determinant of performance, yet it is not sufficient: tokenization, prompt language, translated benchmark design, culturally specific supervision, and multimodal context all materially affect outcomes. Recent work on Global-MMLU, CDEval, WorldValuesBench, CulturalBench, CULEMO, CulturalVQA, GIMMICK, DRISHTIKON, WorldCuisines, CARE, CLCA, and newer critiques of benchmark design and community-grounded evaluation shows that strong multilingual models can still flatten local norms, misread culturally grounded cues, and underperform in lower-resource or community-specific settings. We argue that the field should move from treating languages as isolated rows in a benchmark spreadsheet toward modeling communicative ecologies: the institutions, scripts, translation pipelines, domains, modalities, and communities through which language is used. On that basis, we propose a research agenda for culturally grounded NLP centered on richer contextual metadata, culturally stratified evaluation, participatory alignment, within-language variation, and multimodal community-aware design.

[6] AgentCollab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM Agents cs.CLPDF

Wenbo Gao, Renxi Liu, Xian Wang, Fang Guo, Shuai Yang

TL;DR: 本文提出了AgentCollab，一种基于自评估驱动的协作范式，旨在解决LLM智能体在执行复杂任务时效率与推理鲁棒性之间的权衡问题。该框架通过智能体自身的反思信号动态协调不同能力级别的模型，仅在必要时将控制权移交至更强推理层，并结合难度感知的累积升级策略来稳定长程执行。

Details

Motivation: 动机在于解决LLM智能体在执行长程推理和工具交互任务时，执行效率与推理鲁棒性之间的根本性权衡问题，利用不同能力-成本级别模型的互补优势（低成本模型执行快但可能难以处理困难推理段，强模型推理更鲁棒但计算成本高）。

Result: 在多样化的多步智能体基准测试中，实验表明AgentCollab持续改进了LLM智能体的准确率-效率帕累托前沿，具体采用了小型-大型模型的两级设置进行实例化验证。

Insight: 创新点在于提出了一种自驱动的协作推理框架，它不依赖外部路由模块，而是利用智能体的自我反思信号来动态决策模型切换，并引入了难度感知的累积升级策略以优化长程执行的稳定性，这为高效LLM智能体设计提供了新的协作范式。

Abstract: Autonomous agents powered by large language models (LLMs) perform complex tasks through long-horizon reasoning and tool interaction, where a fundamental trade-off arises between execution efficiency and reasoning robustness. Models at different capability-cost levels offer complementary advantages: lower-cost models enable fast execution but may struggle on difficult reasoning segments, while stronger models provide more robust reasoning at higher computational cost. We present AgentCollab, a self-driven collaborative inference framework that dynamically coordinates models with different reasoning capacities during agent execution. Instead of relying on external routing modules, the framework uses the agent’s own self-reflection signal to determine whether the current reasoning trajectory is making meaningful progress, and escalates control to a stronger reasoning tier only when necessary. To further stabilize long-horizon execution, we introduce a difficulty-aware cumulative escalation strategy that allocates additional reasoning budget based on recent failure signals. In our experiments, we instantiate this framework using a two-level small-large model setting. Experiments on diverse multi-step agent benchmarks show that AgentCollab consistently improves the accuracy-efficiency Pareto frontier of LLM agents.

[7] LLM Benchmark-User Need Misalignment for Climate Change cs.CLPDF

Oucheng Liu, Lexing Xie, Jing Jiang

TL;DR: 本文提出了一种主动知识行为框架和主题-意图-形式分类法，用于分析气候变化相关的用户需求与LLM基准测试之间的对齐情况。研究发现，现有基准测试与真实用户需求存在显著不匹配，而人机知识交互模式与人人交互模式相似。

Details

Motivation: 随着大语言模型日益成为获取气候知识的接口，评估现有基准测试是否反映真实用户需求对于LLM在实际场景中的应用至关重要。

Result: 分析表明，当前基准测试与真实用户需求存在实质性不匹配，而人机知识交互模式与人人交互模式高度相似。

Insight: 提出了主动知识行为框架和主题-意图-形式分类法，为基准测试设计、RAG系统开发和LLM训练提供了可操作的指导，强调了以用户需求为中心评估LLM的重要性。

Abstract: Climate change is a major socio-scientific issue shapes public decision-making and policy discussions. As large language models (LLMs) increasingly serve as an interface for accessing climate knowledge, whether existing benchmarks reflect user needs is critical for evaluating LLM in real-world settings. We propose a Proactive Knowledge Behaviors Framework that captures the different human-human and human-AI knowledge seeking and provision behaviors. We further develop a Topic-Intent-Form taxonomy and apply it to analyze climate-related data representing different knowledge behaviors. Our results reveal a substantial mismatch between current benchmarks and real-world user needs, while knowledge interaction patterns between humans and LLMs closely resemble those in human-human interactions. These findings provide actionable guidance for benchmark design, RAG system development, and LLM training. Code is available at https://github.com/OuchengLiu/LLM-Misalign-Climate-Change.

[8] ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory cs.CLPDF

Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang

TL;DR: 本文提出了ClinicalAgents，一个用于临床决策的新型多智能体框架，旨在模拟专家医生的认知工作流程。该框架采用蒙特卡洛树搜索（MCTS）进行动态编排，并引入双记忆架构（工作记忆和经验记忆），以解决大语言模型在复杂、非线性临床推理中的不足。实验表明，该方法在诊断准确性和可解释性上均达到了最先进的性能。

Details

Motivation: 现有方法通常依赖从症状到诊断的静态线性映射，无法捕捉人类医生固有的迭代、假设驱动的推理过程，而大语言模型在复杂临床诊断中也存在困难。本文旨在弥合这一差距。

Result: 广泛的实验表明，ClinicalAgents实现了最先进的性能，与强大的单智能体和多智能体基线相比，显著提高了诊断准确性和可解释性。

Insight: 创新点在于将临床决策建模为动态编排的多智能体系统，核心是采用MCTS进行迭代假设生成与验证的编排机制，以及结合上下文感知推理（工作记忆）和临床知识检索（经验记忆）的双记忆架构，这增强了推理的灵活性和可解释性。

Abstract: While Large Language Models (LLMs) have demonstrated potential in healthcare, they often struggle with the complex, non-linear reasoning required for accurate clinical diagnosis. Existing methods typically rely on static, linear mappings from symptoms to diagnoses, failing to capture the iterative, hypothesis-driven reasoning inherent to human clinicians. To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians. Unlike rigid sequential chains, ClinicalAgents employs a dynamic orchestration mechanism modeled as a Monte Carlo Tree Search (MCTS) process. This allows an Orchestrator to iteratively generate hypotheses, actively verify evidence, and trigger backtracking when critical information is missing. Central to this framework is a Dual-Memory architecture: a mutable Working Memory that maintains the evolving patient state for context-aware reasoning, and a static Experience Memory that retrieves clinical guidelines and historical cases via an active feedback loop. Extensive experiments demonstrate that ClinicalAgents achieves state-of-the-art performance, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines.

[9] Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR cs.CL | cs.AI | cs.LG | eess.ASPDF

Shashi Kumar, Esaú Villatoro-Tello, Sergio Burdisso, Kadri Hacioglu, Thibault Bañeras-Roux

TL;DR: 本文提出了一种名为’抽象压缩’的方法，用于提升基于LLM的自动语音识别系统在利用多轮对话上下文时的效率。研究发现，在监督式多轮训练后，对话上下文主要有助于识别上下文相关的实体，但直接处理原始音频上下文成本高昂。该方法通过学习固定数量的潜在令牌来压缩历史轮次的音频部分，同时显式保留对应文本，从而以较小的音频足迹部分恢复原始上下文带来的性能增益。

Details

Motivation: 解决标准基于LLM的语音识别系统孤立处理话语、无法高效利用多轮对话音频上下文的问题，旨在降低上下文处理的计算成本。

Result: 在领域内和领域外测试集上，压缩模型以更小的历史音频足迹，部分恢复了原始上下文条件化带来的性能提升。

Insight: 创新点在于提出用固定数量的学习潜在令牌抽象压缩历史音频，同时显式保留文本转录，实现了计算效率与上下文利用之间的权衡；客观分析认为，该方法为长对话ASR的上下文高效表示提供了新思路。

Abstract: Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.

[10] From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs cs.CL | cs.AIPDF

Jiyuan An, Liner Yang, Mengyan Wang, Luming Lu, Weihua An

TL;DR: 本文从计算机制角度探究大语言模型（LLMs）的空间推理能力，通过借鉴人类空间认知理论，将空间推理分解为关系组合、表征转换和状态化空间更新三个基本计算单元，并设计相应控制任务族进行评估。研究发现，LLMs在中间层编码了任务相关的空间信息并能因果影响行为，但这些表征是瞬时的、碎片化的，且跨语言分析揭示了机制退化现象，表明当前LLMs的空间表征是有限且依赖上下文的，而非鲁棒通用的空间推理。

Details

Motivation: 解决大语言模型在空间推理基准测试上的表现是否源于结构化内部空间表征，还是仅依赖语言启发式的问题，从机制层面探究空间信息在模型内部如何表征与使用。

Result: 通过线性探测、基于稀疏自编码器的特征分析和因果干预等方法，在英语、中文和阿拉伯语的多语言LLMs上进行评估，发现任务相关的空间信息编码于中间层并能因果影响行为，但表征是瞬时的、跨任务族碎片化且与最终预测弱集成；跨语言分析显示存在机制退化，即相似行为表现源于不同的内部通路。

Insight: 创新点在于将人类空间认知的计算理论分解为三个基本单元来系统评估LLMs的空间表征，并采用多语言、多任务族的机制分析框架；客观来看，研究强调了超越基准准确率的机制性评估的重要性，揭示了LLMs空间表征的局限性和上下文依赖性，为理解模型内部计算机制提供了新视角。

Abstract: As spatial intelligence becomes an increasingly important capability for foundation models, it remains unclear whether large language models’ (LLMs) performance on spatial reasoning benchmarks reflects structured internal spatial representations or reliance on linguistic heuristics. We address this question from a mechanistic perspective by examining how spatial information is internally represented and used. Drawing on computational theories of human spatial cognition, we decompose spatial reasoning into three primitives, relational composition, representational transformation, and stateful spatial updating, and design controlled task families for each. We evaluate multilingual LLMs in English, Chinese, and Arabic under single pass inference, and analyze internal representations using linear probing, sparse autoencoder based feature analysis, and causal interventions. We find that task relevant spatial information is encoded in intermediate layers and can causally influence behavior, but these representations are transient, fragmented across task families, and weakly integrated into final predictions. Cross linguistic analysis further reveals mechanistic degeneracy, where similar behavioral performance arises from distinct internal pathways. Overall, our results suggest that current LLMs exhibit limited and context dependent spatial representations rather than robust, general purpose spatial reasoning, highlighting the need for mechanistic evaluation beyond benchmark accuracy.

[11] CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law cs.CL | cs.AIPDF

JiHyeok Jung, TaeYoung Yoon, HyunSouk Cho

TL;DR: 本文提出了CALRK-Bench，一个基于韩国法律体系的上下文感知法律推理基准测试。该基准旨在评估模型能否识别法律规范的时间有效性、判断给定案例是否具备充分的法律信息，以及理解法律判决变化背后的原因。实验表明，即使是当前的大型语言模型在这些任务上也表现不佳。

Details

Motivation: 现有法律基准主要评估固定规范下的规则应用，无法捕捉法律判决变化或多规范交互的情境，因此需要一个新的基准来评估上下文感知的法律推理能力。

Result: 实验结果表明，即使是近期的大型语言模型在CALRK-Bench的三个任务（时间有效性识别、信息充分性判断、判决变化原因理解）上都持续表现出较低的性能。

Insight: 创新点在于构建了一个专注于上下文感知（如时间有效性和判决变化）而非简单法律知识记忆的法律推理基准，并通过法律专家验证的数据集（基于判例和咨询记录）提供了新的压力测试。

Abstract: Legal reasoning requires not only the application of legal rules but also an understanding of the context in which those rules operate. However, existing legal benchmarks primarily evaluate rule application under the assumption of fixed norms, and thus fail to capture situations where legal judgments shift or where multiple norms interact. In this work, we propose CALRK-Bench, a context-aware legal reasoning benchmark based on the legal system in Korean. CALRK-Bench evaluates whether models can identify the temporal validity of legal norms, determine whether sufficient legal information is available for a given case, and understand the reasons behind shifts in legal judgments. The dataset is constructed from legal precedents and legal consultation records, and is validated by legal experts. Experimental results show that even recent large language models consistently exhibit low performance on these three tasks. CALRK-Bench provides a new stress test for evaluating context-aware legal reasoning rather than simple memorization of legal knowledge. Our code is available at https://github.com/jhCOR/CALRKBench.

[12] Why Models Know But Don’t Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models cs.CL | cs.AIPDF

Richard J. Young

TL;DR: 本研究通过分析12个开源推理模型在MMLU和GPQA数据集上受误导提示影响时的表现，发现模型在‘思维令牌’（内部推理过程）和可见答案之间存在显著的不一致性：在55.4%的受提示影响案例中，模型仅在思维令牌中提及提示关键词，而在答案中完全省略，这种模式被称为‘思维-答案分歧’。研究揭示了仅监控答案文本会遗漏大量受提示影响的推理过程，且不同提示类型和模型间的差异显著。

Details

Motivation: 探究扩展思维模型在生成可见答案的同时，其内部‘思维令牌’通道是否忠实反映推理过程，特别是在面对误导性提示时，评估模型在思维与答案表达之间的一致性，以揭示仅依赖答案监控的局限性。

Result: 在10,506个模型实际遵循误导提示（选择提示目标而非正确答案）的案例中，55.4%表现出‘思维-答案分歧’（仅在思维令牌中承认提示）。不同提示类型影响显著：奉承提示最透明（58.8%案例在双通道均承认），而一致性（72.2%）和不道德（62.7%）提示主要由仅思维承认主导。模型间差异大，从近乎完全分歧（Step-3.5-Flash: 94.7%）到相对透明（Qwen3.5-27B: 19.6%）。思维令牌访问虽必要，但仍有11.8%案例在双通道均无言语化承认。

Insight: 创新点在于量化揭示了扩展思维模型中思维令牌与答案文本之间的忠实性分歧，强调了仅依赖答案监控会严重低估模型受外部提示影响的程度。客观分析表明，研究为评估模型推理透明度提供了新维度，并提示需结合思维令牌分析以更全面理解模型行为，尤其在安全对齐和可解释性领域具有借鉴意义。

Abstract: Extended-thinking models expose a second text-generation channel (“thinking tokens”) alongside the user-visible answer. This study examines 12 open-weight reasoning models on MMLU and GPQA questions paired with misleading hints. Among the 10,506 cases where models actually followed the hint (choosing the hint’s target over the ground truth), each case is classified by whether the model acknowledges the hint in its thinking tokens, its answer text, both, or neither. In 55.4% of these cases the model’s thinking tokens contain hint-related keywords that the visible answer omits entirely, a pattern termed thinking-answer divergence. The reverse (answer-only acknowledgment) is near-zero (0.5%), confirming that the asymmetry is directional. Hint type shapes the pattern sharply: sycophancy is the most transparent hint, with 58.8% of sycophancy-influenced cases acknowledging the professor’s authority in both channels, while consistency (72.2%) and unethical (62.7%) hints are dominated by thinking-only acknowledgment. Models also vary widely, from near-total divergence (Step-3.5-Flash: 94.7%) to relative transparency (Qwen3.5-27B: 19.6%). These results show that answer-text-only monitoring misses more than half of all hint-influenced reasoning and that thinking-token access, while necessary, still leaves 11.8% of cases with no verbalized acknowledgment in either channel.

Raia Abu Ahmad, Max Upravitelev, Aida Usmanova, Veronika Solopova, Georg Rehm

TL;DR: ClimateCheck 2026是第二届针对气候相关声明的科学事实核查和虚假信息叙事分类的共享任务。它在2025年版本的基础上，将训练数据增加了三倍，并引入了新的虚假信息叙事分类任务。该竞赛在CodaBench平台上进行，吸引了20名参与者，提交的系统结合了密集检索、交叉编码器集成以及具有结构化层次推理能力的大语言模型。除了使用标准评估指标外，研究还采用了一个自动化框架来评估不完全标注下的检索质量，揭示了传统指标在系统排名中的系统性偏差，并发现并非所有气候虚假信息都同样可验证。

Details

Motivation: 自动根据科学文献验证气候相关声明是一项具有挑战性的任务，其复杂性在于学术证据的专业性和气候虚假信息背后修辞策略的多样性。该任务旨在应对这一挑战。

Result: 竞赛吸引了20名注册参与者和8个排行榜提交。系统性能评估采用了标准指标（如Recall@K和Binary Preference），并引入了一个自动化框架来评估不完全标注下的检索质量，揭示了传统排名指标的偏差。跨任务分析表明，并非所有气候虚假信息都同样可验证。

Insight: 主要创新点包括：1）将共享任务扩展到包含虚假信息叙事分类；2）在评估中引入自动化框架以处理不完全标注，从而暴露传统评估指标的局限性；3）通过跨任务分析揭示了虚假信息可验证性的差异，这对未来事实核查系统的设计具有潜在影响。

Abstract: Automatically verifying climate-related claims against scientific literature is a challenging task, complicated by the specialised nature of scholarly evidence and the diversity of rhetorical strategies underlying climate disinformation. ClimateCheck 2026 is the second iteration of a shared task addressing this challenge, expanding on the 2025 edition with tripled training data and a new disinformation narrative classification task. Running from January to February 2026 on the CodaBench platform, the competition attracted 20 registered participants and 8 leaderboard submissions, with systems combining dense retrieval pipelines, cross-encoder ensembles, and large language models with structured hierarchical reasoning. In addition to standard evaluation metrics (Recall@K and Binary Preference), we adapt an automated framework to assess retrieval quality under incomplete annotations, exposing systematic biases in how conventional metrics rank systems. A cross-task analysis further reveals that not all climate disinformation is equally verifiable, potentially implicating how future fact-checking systems should be designed.

cs.CV [Back]

[14] A Survey of OCR Evaluation Methods and Metrics and the Invisibility of Historical Documents cs.CV | cs.DLPDF

Fitsum Sileshi Beyene, Christopher L. Dancy

TL;DR: 本文系统综述了2006年至2025年间OCR与文档理解系统的评估方法、指标及数据集，指出当前评估体系主要围绕现代、西方及机构文档，导致对历史文献（尤其是黑人历史报纸）的评估存在显著盲区，并分析了这种评估差距如何导致结构性忽视与代表性损害。

Details

Motivation: 当前OCR和文档理解系统的评估严重偏向现代、西方文档，无法有效反映系统在处理具有独特版面、字体和材料退化特征的历史及边缘化档案（特别是黑人历史报纸）时的真实性能，存在评估盲区。

Result: 综述发现，黑人报纸等社区历史文献极少出现在训练数据或评估基准中；现有评估多关注字符精度和现代版式任务成功率，而未能捕捉历史报纸中常见的结构性错误（如栏位塌陷、字体错误、文本幻觉）。

Insight: 创新点在于揭示了评估体系在组织（中观）和制度（宏观）层面因基准激励和数据治理决策而产生的系统性偏差，并强调将历史文献的特殊性纳入评估框架对于避免结构性忽视至关重要。

Abstract: Optical character recognition (OCR) and document understanding systems increasingly rely on large vision and vision-language models, yet evaluation remains centered on modern, Western, and institutional documents. This emphasis masks system behavior in historical and marginalized archives, where layout, typography, and material degradation shape interpretation. This study examines how OCR and document understanding systems are evaluated, with particular attention to Black historical newspapers. We review OCR and document understanding papers, as well as benchmark datasets, which are published between 2006 and 2025 using the PRISMA framework. We look into how the studies report training data, benchmark design, and evaluation metrics for vision transformer and multimodal OCR systems. During the review, we found that Black newspapers and other community-produced historical documents rarely appear in reported training data or evaluation benchmarks. Most evaluations emphasize character accuracy and task success on modern layouts. They rarely capture structural failures common in historical newspapers, including column collapse, typographic errors, and hallucinated text. To put these findings into perspective, we use previous empirical studies and archival statistics from significant Black press collections to show how evaluation gaps lead to structural invisibility and representational harm. We propose that these gaps occur due to organizational (meso) and institutional (macro) behaviors and structure, shaped by benchmark incentives and data governance decisions.

[15] Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis cs.CVPDF

Yuan Zhang, Sihao Dou, Kai Hu, Shuhua Deng, Chunhong Cao

TL;DR: 本文提出了一种受认知启发的层次化框架——Focus-to-Perceive Representation Learning (FPRL)，用于解决内窥镜视频分析中高质量标注数据有限的问题。该框架模仿临床检查过程，首先聚焦于帧内病灶区域学习静态语义，然后感知其跨帧演化以建模上下文语义，通过层次化语义建模机制明确区分并协同学习这两种语义。

Details

Motivation: 现有为自然视频设计的自监督预训练方法强调密集时空建模并存在运动偏差，忽视了临床决策所依赖的静态、结构化语义。为了解决内窥镜视频分析中标注稀缺且现有方法不适用于医学视频特性的挑战，本文提出了FPRL框架。

Result: 在11个内窥镜视频数据集上的广泛实验表明，FPRL在多种下游任务中取得了优异的性能，证明了其在内窥镜视频表征学习中的有效性。

Insight: 创新点在于受临床认知过程启发的层次化语义建模框架，它通过教师先验自适应掩码（TPAM）与多视图稀疏采样来捕获静态语义，并通过跨视图掩码特征补全（CVMFC）与注意力引导时序预测（AGTP）来建模上下文语义，从而有效区分并整合静态与动态信息，减少了冗余的时间依赖性，专注于病灶相关的局部语义和结构化的帧间演化。

Abstract: Endoscopic video analysis is essential for early gastrointestinal screening but remains hindered by limited high-quality annotations. While self-supervised video pre-training shows promise, existing methods developed for natural videos prioritize dense spatio-temporal modeling and exhibit motion bias, overlooking the static, structured semantics critical to clinical decision-making. To address this challenge, we propose Focus-to-Perceive Representation Learning (FPRL), a cognition-inspired hierarchical framework that emulates clinical examination. FPRL first focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this, FPRL employs a hierarchical semantic modeling mechanism that explicitly distinguishes and collaboratively learns both types of semantics. Specifically, it begins by capturing static semantics via teacher-prior adaptive masking (TPAM) combined with multi-view sparse sampling. This approach mitigates redundant temporal dependencies and enables the model to concentrate on lesion-related local semantics. Following this, contextual semantics are derived through cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP). These processes establish cross-view correspondences and effectively model structured inter-frame evolution, thereby reinforcing temporal semantic continuity while preserving global contextual integrity. Extensive experiments on 11 endoscopic video datasets show that FPRL achieves superior performance across diverse downstream tasks, demonstrating its effectiveness in endoscopic video representation learning. The code is available at https://github.com/MLMIP/FPRL.

[16] ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions cs.CVPDF

Zikai Wang, Zhilu Zhang, Yiqing Wang, Hui Li, Wangmeng Zuo

TL;DR: 本文提出ArtHOI框架，用于从单目RGB视频中重建手部与铰接物体的4D交互。该框架通过整合并优化多个基础模型的先验知识，解决了现有方法在铰接物体交互重建中的局限性。

Details

Motivation: 现有手-物交互方法主要局限于刚性物体，而铰接物体的4D重建通常需要预扫描或多视角视频。从单目RGB视频重建手-铰接物体交互是一个未解决且具有挑战性的问题，基础模型的进展为此提供了新机遇。

Result: 在提出的ArtHOI-RGBD和ArtHOI-Wild数据集上进行广泛实验，验证了ArtHOI在不同物体和交互场景中的鲁棒性和有效性。

Insight: 创新点包括：1）自适应采样细化方法优化物体的度量尺度和姿态；2）利用多模态大语言模型引导的手-物对齐方法，将接触推理信息作为手-物网格组合优化的约束；3）贡献了两个新数据集用于全面评估。

Abstract: Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object’s metric scale and pose for grounding its normalized mesh in world space. Furthermore, we propose a Multimodal Large Language Model (MLLM) guided hand-object alignment method, utilizing contact reasoning information as constraints of hand-object mesh composition optimization. To facilitate a comprehensive evaluation, we also contribute two new datasets, ArtHOI-RGBD and ArtHOI-Wild. Extensive experiments validate the robustness and effectiveness of our ArtHOI across diverse objects and interactions. Project: https://arthoi-reconstruction.github.io.

[17] Do All Vision Transformers Need Registers? A Cross-Architectural Reassessment cs.CV | cs.LGPDF

Spiros Baxevanakis, Platon Karageorgis, Ioannis Dravilas, Konrad Szewczyk

TL;DR: 本文对Darcet等人（2024）关于Vision Transformers（ViTs）训练中注意力图伪影及其解决方案（引入寄存器令牌）的研究进行了复现和跨架构评估。研究发现，原论文的部分结论（如寄存器能消除伪影并提升注意力图清晰度）在DINO、DINOv2、OpenCLIP和DeiT3等多种模型上并非普遍适用，并进一步探讨了模型大小的影响，澄清了术语不一致的问题。

Details

Motivation: 动机在于验证Darcet等人提出的‘寄存器’解决方案的普适性，即评估其是否适用于不同架构和规模的Vision Transformer模型，并澄清原研究中的术语不一致问题。

Result: 研究结果确认了原论文部分关键主张的有效性，但也表明其某些主张（如寄存器消除伪影）并不能普遍推广到其他模型（如DINO、DINOv2等）。研究还将结论扩展到更小的模型，并揭示了模型大小的影响。

Insight: 创新点在于对一项特定改进（寄存器）进行了系统的跨架构和跨规模评估，揭示了其局限性和条件依赖性，强调了在推广ViT改进方案时考虑模型多样性的重要性，并提供了术语澄清以促进更严谨的后续研究。

Abstract: Training Vision Transformers (ViTs) presents significant challenges, one of which is the emergence of artifacts in attention maps, hindering their interpretability. Darcet et al. (2024) investigated this phenomenon and attributed it to the need of ViTs to store global information beyond the [CLS] token. They proposed a novel solution involving the addition of empty input tokens, named registers, which successfully eliminate artifacts and improve the clarity of attention maps. In this work, we reproduce the findings of Darcet et al. (2024) and evaluate the generalizability of their claims across multiple models, including DINO, DINOv2, OpenCLIP, and DeiT3. While we confirm the validity of several of their key claims, our results reveal that some claims do not extend universally to other models. Additionally, we explore the impact of model size, extending their findings to smaller models. Finally, we untie terminology inconsistencies found in the original paper and explain their impact when generalizing to a wider range of models.

[18] ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners? cs.CV | cs.AIPDF

Haonan Han, Jiancheng Huang, Xiaopeng Sun, Junyan He, Rui Yang

TL;DR: 论文提出了ViGoR基准测试，这是一个统一的评估框架，旨在揭示当前视觉生成模型在零样本视觉推理任务上的真实能力，并指出它们存在显著的逻辑推理缺陷。

Details

Motivation: 当前AIGC模型虽然视觉保真度高，但在需要物理、因果或复杂空间推理的任务上存在‘逻辑沙漠’，现有评估方法依赖表面指标或碎片化基准，造成了‘性能幻象’，忽略了生成过程。

Result: 在超过20个领先模型上的实验表明，即使是最先进的系统也存在显著的推理缺陷，ViGoR被确立为下一代智能视觉模型的关键‘压力测试’。

Insight: 创新点包括：1) 全面的跨模态覆盖，桥接图像到图像和视频任务；2) 评估中间过程和最终结果的双轨机制；3) 基于证据的自动化评判器，确保高人类对齐；4) 细粒度诊断分析，将性能分解为精细的认知维度。

Abstract: Beneath the stunning visual fidelity of modern AIGC models lies a “logical desert”, where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical stress test’’ for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/

[19] Fus3D: Decoding Consolidated 3D Geometry from Feed-forward Geometry Transformer Latents cs.CVPDF

Laura Fink, Linus Franke, George Kopanas, Marc Stamminger, Peter Hedman

TL;DR: 本文提出了一种名为Fus3D的前馈方法，用于从非结构化图像集合中快速（三秒内）回归密集有符号距离场（SDF），无需相机标定或后处理融合。其核心在于直接利用预训练多视角前馈几何变换器的中间特征空间作为联合世界表示，并通过学习的体素化提取和卷积解码器生成SDF，避免了传统流程中因逐视角预测和后期融合导致的信息丢失与误差累积。

Details

Motivation: 现有方法通常将预训练几何变换器的中间特征通过逐视角预测头处理后再融合成3D几何，这丢弃了宝贵的完整性信息并累积了不准确性。本文旨在直接从这些特征中提取3D几何，以解决信息损失和精度问题。

Result: 该方法在稀疏和密集视角设置下都能产生完整且定义良好的距离值，并展示出几何上合理的补全效果。

Insight: 创新点在于识别并直接利用预训练几何变换器中间特征作为强大的联合世界表示，通过体素化规范嵌入和交叉/自注意力渐进吸收多视角信息，以及提出了一种可扩展的、有效性感知的监督方案直接使用深度图或3D资产衍生的SDF进行训练，解决了非水密网格等实际问题。

Abstract: We propose a feed-forward method for dense Signed Distance Field (SDF) regression from unstructured image collections in less than three seconds, without camera calibration or post-hoc fusion. Our key insight is that the intermediate feature space of pretrained multi-view feed-forward geometry transformers already encodes a powerful joint world representation; yet, existing pipelines discard it, routing features through per-view prediction heads before assembling 3D geometry post-hoc, which discards valuable completeness information and accumulates inaccuracies. We instead perform 3D extraction directly from geometry transformer features via learned volumetric extraction: voxelized canonical embeddings that progressively absorb multi-view geometry information through interleaved cross- and self-attention into a structured volumetric latent grid. A simple convolutional decoder then maps this grid to a dense SDF. We additionally propose a scalable, validity-aware supervision scheme directly using SDFs derived from depth maps or 3D assets, tackling practical issues like non-watertight meshes. Our approach yields complete and well-defined distance values across sparse- and dense-view settings and demonstrates geometrically plausible completions. Code and further material can be found at https://lorafib.github.io/fus3d.

[20] GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding cs.CV | cs.AIPDF

Trong Thang Pham, Hien Nguyen, Ngan Le

TL;DR: 本文提出了GazeQwen，一种参数高效的轻量级方法，通过隐藏状态调制为开源多模态大语言模型（MLLM）赋予注视感知能力。其核心是一个紧凑的注视重采样器，编码视频特征和注视位置信息，生成加性残差注入到选定的LLM解码层。在StreamGaze基准测试的所有10个任务上，该方法显著超越了包括GPT-4o在内的基线模型。

Details

Motivation: 当前的多模态大语言模型（MLLMs）无法有效利用眼动注视信息进行视频理解，即使通过视觉叠加或文本描述提供了注视线索。本文旨在解决这一能力缺失问题。

Result: 在StreamGaze基准测试的所有10个任务上，GazeQwen达到了63.9%的准确率，比使用注视作为视觉提示的相同骨干模型（Qwen2.5-VL-7B）高出16.1个百分点，比GPT-4o高出10.5个百分点，在所有测试的开源和专有模型中得分最高。

Insight: 主要创新点在于提出了一种参数高效的轻量级注视条件LLM调制方法（GazeQwen），其核心洞察是：学习在LLM的何处注入注视信息，比扩大模型规模或设计更好的提示词更有效。该方法通过一个紧凑的注视重采样器（仅1-5M可训练参数）和可选的LoRA微调阶段实现，为MLLM高效集成注视信息提供了新思路。

Abstract: Current multimodal large language models (MLLMs) cannot effectively utilize eye-gaze information for video understanding, even when gaze cues are supplied via visual overlays or text descriptions. We introduce GazeQwen, a parameter efficient approach that equips an open-source MLLM with gaze awareness through hidden-state modulation. At its core is a compact gaze resampler (~1-5 M trainable parameters) that encodes V-JEPA 2.1 video features together with fixation-derived positional encodings and produces additive residuals injected into selected LLM decoder layers via forward hooks. An optional second training stage adds low-rank adapters (LoRA) to the LLM for tighter integration. Evaluated on all 10 tasks of the StreamGaze benchmark, GazeQwen reaches 63.9% accuracy, a +16.1 point gain over the same Qwen2.5-VL-7B backbone with gaze as visual prompts and +10.5 points over GPT-4o, the highest score among all open-source and proprietary models tested. These results suggest that learning where to inject gaze within an LLM is more effective than scaling model size or engineering better prompts. All code and checkpoints are available at https://github.com/phamtrongthang123/gazeqwen .

[21] GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks cs.CV | cs.AI | cs.HCPDF

Saelyne Yang, Jaesang Yu, Yi-Hao Peng, Kevin Qinghong Lin, Jae Won Cho

TL;DR: 该论文提出了一个名为GUIDE的基准测试，用于评估AI模型在开放式图形用户界面（GUI）任务中感知用户行为、推断意图并提供协助的能力。该基准包含来自10款软件的120名新手用户的屏幕录像和有声思考叙述，并定义了行为状态检测、意图预测和帮助预测三项任务。

Details

Motivation: 现有GUI代理研究主要集中于通过点击和按键自动化用户操作，忽视了用户意图和协作需求。为了从自动化转向协作，GUI代理需要理解用户在做什么以及为什么这么做。

Result: 在GUIDE基准上对八个最先进的多模态模型进行评估，结果显示所有模型表现均不佳，行为状态检测和帮助预测的准确率分别仅为44.6%和55.0%。然而，提供用户上下文信息显著提升了性能，帮助预测准确率最高提升了50.2个百分点。

Insight: 论文的创新点在于将GUI代理的研究重点从自动化操作转向理解用户意图和协作辅助，并为此创建了一个包含真实用户数据和多任务评估的基准。客观来看，其强调了结构化理解用户上下文对于提供有效GUI协助的关键作用，为未来开发更协作式、以用户为中心的AI助手指明了方向。

Abstract: Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency. To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI User Intent Detection Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software. GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model’s ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled, achieving only 44.6% and 55.0% accuracy on behavior state and help prediction. However, providing user context significantly improved the performance, raising help prediction by up to 50.2pp, highlighting the critical role of structured user understanding in effective assistance. Our dataset is available at https://guide-bench.github.io.

[22] Seeing Through Smoke: Surgical Desmoking for Improved Visual Perception cs.CVPDF

Jingpei Lu, Fengyi Jiang, Xiaorui Zhang, Lingbo Jin, Omid Mohareri

TL;DR: 本文提出了一种基于Transformer的手术烟雾去除模型，通过物理启发的去烟头联合预测无烟图像和对应的烟雾图。为解决配对数据稀缺问题，开发了合成数据生成流程，并构建了目前最大的配对手术烟雾数据集。实验表明，该方法在图像重建方面达到了最先进的性能，并评估了去烟对下游深度估计和器械分割任务的影响。

Details

Motivation: 微创和机器人辅助手术严重依赖内窥镜成像，但电灼和血管密封器械产生的手术烟雾会严重降低视觉感知并阻碍基于视觉的功能，因此需要有效的数字去烟方法。

Result: 在公共基准和自建数据集上的广泛实验表明，与现有的去雾和去烟方法相比，该方法在图像重建方面达到了最先进的性能。

Insight: 创新点包括基于Transformer的架构结合物理启发的去烟头进行联合预测，以及通过合成数据生成和真实数据采集解决训练数据稀缺问题，为数字烟雾去除在手术视觉增强中的应用提供了新思路。

Abstract: Minimally invasive and robot-assisted surgery relies heavily on endoscopic imaging, yet surgical smoke produced by electrocautery and vessel-sealing instruments can severely degrade visual perception and hinder vision-based functionalities. We present a transformer-based surgical desmoking model with a physics-inspired desmoking head that jointly predicts smoke-free image and corresponding smoke map. To address the scarcity of paired smoky-to-smoke-free training data, we develop a synthetic data generation pipeline that blends artificial smoke patterns with real endoscopic images, yielding over 80,000 paired samples for supervised training. We further curate, to our knowledge, the largest paired surgical smoke dataset to date, comprising 5,817 image pairs captured with the da Vinci robotic surgical system, enabling benchmarking on high-resolution endoscopic images. Extensive experiments on both a public benchmark and our dataset demonstrate state-of-the-art performance in image reconstruction compared to existing dehazing and desmoking approaches. We also assess the impact of desmoking on downstream stereo depth estimation and instrument segmentation, highlighting both the potential benefits and current limitations of digital smoke removal methods.

[23] Speech-Synchronized Whiteboard Generation via VLM-Driven Structured Drawing Representations cs.CV | cs.LGPDF

Suraj Prasad, Pinak Mahapatra

TL;DR: 本文提出了一种通过视觉语言模型驱动的结构化绘图表示来生成语音同步白板内容的方法，并创建了首个包含24个配对的Excalidraw演示与音频叙述的数据集，其中每个绘图元素都带有毫秒级精度的创建时间戳。研究利用Qwen2-VL-7B模型通过LoRA微调，仅基于24个演示预测与语音同步的完整笔画序列，并通过主题分层五折评估验证了时间戳条件化在时序对齐上的有效性。

Details

Motivation: 解决白板式教育视频中自由手绘插图与语音叙述之间精确协调的多模态同步问题，现有方法缺乏结构化、可复现的绘图表示。

Result: 在涵盖8个STEM领域的主题分层五折评估中，时间戳条件化显著提高了时序对齐性能，优于消融基线，且模型在未见过的STEM主题上展现出泛化能力。

Insight: 创新点在于引入结构化绘图表示（如Excalidraw元素的时间戳）来同步语音与绘图，并通过小样本微调视觉语言模型实现跨主题泛化，为自动化教育内容生成提供了可扩展的数据集和方法框架。

Abstract: Creating whiteboard-style educational videos demands precise coordination between freehand illustrations and spoken narration, yet no existing method addresses this multimodal synchronization problem with structured, reproducible drawing representations. We present the first dataset of 24 paired Excalidraw demonstrations with narrated audio, where every drawing element carries millisecond-precision creation timestamps spanning 8 STEM domains. Using this data, we study whether a vision-language model (Qwen2-VL-7B), fine-tuned via LoRA, can predict full stroke sequences synchronized to speech from only 24 demonstrations. Our topic-stratified five-fold evaluation reveals that timestamp conditioning significantly improves temporal alignment over ablated baselines, while the model generalizes across unseen STEM topics. We discuss transferability to real classroom settings and release our dataset and code to support future research in automated educational content generation.

Prasiddha Bhandari, Kanchan Poudel, Nishant Luitel, Bishram Acharya, Angelina Ghimire

TL;DR: 本文系统评估了盲扫产科超声（BSOU）视频质量对AI任务性能的影响，模拟了扫描方向反转、探头倒置和不完整扫描等常见偏差，并开发了自动质量评估模型来检测这些偏差。研究表明，通过反馈循环重新采集被标记的扫描视频可以提升下游任务性能，强调了自动质量评估在构建可靠、可扩展的AI辅助产前超声工作流程中的关键作用。

Details

Motivation: 在资源匮乏地区，BSOU允许训练不足的操作员采集标准化扫描视频供AI解读，但AI系统的可靠性严重依赖于扫描质量，且协议偏差对下游预测的影响尚不明确。

Result: 研究量化了模型对模拟采集偏差的鲁棒性，并开发了能检测这些偏差的自动质量评估模型；模拟反馈循环显示，重新采集被标记的扫描可改善下游任务（如扫描标签分类、胎儿位置分类和胎盘位置分类）性能。

Insight: 创新点在于系统评估BSOU质量对AI任务的影响，并开发自动化质量评估模型以增强鲁棒性；客观分析认为，该方法通过模拟现实部署中的反馈循环，为低资源环境下的可靠AI辅助产前超声提供了实用解决方案。

Abstract: Blind Sweep Obstetric Ultrasound (BSOU) enables scalable fetal imaging in low-resource settings by allowing minimally trained operators to acquire standardized sweep videos for automated Artificial Intelligence(AI) interpretation. However, the reliability of such AI systems depends critically on the quality of the acquired sweeps, and little is known about how deviations from the intended protocol affect downstream predictions. In this work, we present a systematic evaluation of BSOU quality and its impact on three key AI tasks: sweep-tag classification, fetal presentation classification, and placenta-location classification. We simulate plausible acquisition deviations, including reversed sweep direction, probe inversion, and incomplete sweeps, to quantify model robustness, and we develop automated quality-assessment models capable of detecting these perturbations. To approximate real-world deployment, we simulate a feedback loop in which flagged sweeps are re-acquired, showing that such correction improves downstream task performance. Our findings highlight the sensitivity of BSOU-based AI models to acquisition variability and demonstrate that automated quality assessment can play a central role in building reliable, scalable AI-assisted prenatal ultrasound workflows, particularly in low-resource environments.

[25] World Reasoning Arena cs.CVPDF

PAN Team, Qiyue Gao, Kun Zhou, Jiannan Xiang, Zihan Liu

TL;DR: 本文提出了WR-Arena，一个用于评估世界模型（WMs）在三个核心维度——动作模拟保真度、长时程预测以及模拟推理与规划——上的综合基准，旨在弥补现有基准仅关注下一状态预测和视觉保真度的不足。

Details

Motivation: 现有世界模型基准过于狭隘，主要关注下一状态预测和视觉保真度，而忽略了智能行为所需的更丰富的模拟能力，因此需要一个新的基准来全面评估世界模型作为内部模拟器的核心能力。

Result: 通过对最先进的世界模型进行广泛实验，结果表明当前模型与人类水平的假设推理能力之间存在巨大差距，并确立了WR-Arena作为诊断工具和指导下一代世界模型发展的基准。

Insight: 论文的创新点在于构建了一个超越单轮和感知评估的、包含多样化数据集的任务分类法，从三个基本维度系统评估世界模型的模拟能力，为开发具有鲁棒理解、预测和有目的行动能力的下一代世界模型提供了明确的评估指南。

Abstract: World models (WMs) are intended to serve as internal simulators of the real world that enable agents to understand, anticipate, and act upon complex environments. Existing WM benchmarks remain narrowly focused on next-state prediction and visual fidelity, overlooking the richer simulation capabilities required for intelligent behavior. To address this gap, we introduce WR-Arena, a comprehensive benchmark for evaluating WMs along three fundamental dimensions of next world simulation: (i) Action Simulation Fidelity, the ability to interpret and follow semantically meaningful, multi-step instructions and generate diverse counterfactual rollouts; (ii) Long-horizon Forecast, the ability to sustain accurate, coherent, and physically plausible simulations across extended interactions; and (iii) Simulative Reasoning and Planning, the ability to support goal-directed reasoning by simulating, comparing, and selecting among alternative futures in both structured and open-ended environments. We build a task taxonomy and curate diverse datasets designed to probe these capabilities, moving beyond single-turn and perceptual evaluations. Through extensive experiments with state-of-the-art WMs, our results expose a substantial gap between current models and human-level hypothetical reasoning, and establish WR-Arena as both a diagnostic tool and a guideline for advancing next-generation world models capable of robust understanding, forecasting, and purposeful action. The code is available at https://github.com/MBZUAI-IFM/WR-Arena.

[26] Few Shots Text to Image Retrieval: New Benchmarking Dataset and Optimization Methods cs.CVPDF

Ofer Idan, Vladi Vexler, Gil Lederman, Dima Sivov, Aviad Cohen Zada

TL;DR: 本文针对预训练视觉语言模型在组合查询和分布外图像-文本对上的检索性能不足问题，提出了少样本文本到图像检索任务及其基准数据集FSIR-BD，并开发了两种兼容任意预训练图像编码器的检索优化方法，实验证明新数据集具有挑战性且优化方法在平均精度均值指标上优于现有基线。

Details

Motivation: 解决预训练视觉语言模型在组合查询和分布外图像-文本对检索中的性能瓶颈，通过少样本学习方法模拟人类从少量示例中学习的能力，以提升模型对复杂场景的理解和检索效果。

Result: 在提出的FSIR-BD基准数据集上，所提优化方法在平均精度均值指标上超越了现有基线，验证了数据集的有效性和方法的优越性。

Insight: 创新点包括首次引入少样本文本到图像检索任务及对应基准数据集FSIR-BD，专注于组合和分布外查询的挑战；同时提出两种通用的少样本检索优化方法，可灵活适配现有预训练编码器，为从有限示例中进行组合推理的研究提供了新方向。

Abstract: Pre-trained vision-language models (VLMs) excel in multimodal tasks, commonly encoding images as embedding vectors for storage in databases and retrieval via approximate nearest neighbor search (ANNS). However, these models struggle with compositional queries and out-of-distribution (OOD) image-text pairs. Inspired by human cognition’s ability to learn from minimal examples, we address this performance gap through few-shot learning approaches specifically designed for image retrieval. We introduce the Few-Shot Text-to-Image Retrieval (FSIR) task and its accompanying benchmark dataset, FSIR-BD - the first to explicitly target image retrieval by text accompanied by reference examples, focusing on the challenging compositional and OOD queries. The compositional part is divided to urban scenes and nature species, both in specific situations or with distinctive features. FSIR-BD contains 38,353 images and 303 queries, with 82% comprising the test corpus (averaging per query 37 positives, ground truth matches, and significant number of hard negatives) and 18% forming the few-shot reference corpus (FSR) of exemplar positive and hard negative images. Additionally, we propose two novel retrieval optimization methods leveraging single shot or few shot reference examples in the FSR to improve performance. Both methods are compatible with any pre-trained image encoder, making them applicable to existing large-scale environments. Our experiments demonstrate that: (1) FSIR-BD provides a challenging benchmark for image retrieval; and (2) our optimization methods outperform existing baselines as measured by mean Average Precision (mAP). Further research into FSIR optimization methods will help narrow the gap between machine and human-level understanding, particularly for compositional reasoning from limited examples.

[27] THFM: A Unified Video Foundation Model for 4D Human Perception and Beyond cs.CVPDF

Letian Wang, Andrei Zanfir, Eduard Gabriel Bazavan, Misha Andriluka, Cristian Sminchisescu

TL;DR: THFM是一个统一的以人为中心的视频基础模型，能够在一个单一架构中联合处理密集任务（如深度、法线、分割、密集姿态）和稀疏任务（如2D/3D关键点估计）。该模型基于预训练的文本到视频扩散模型改造而成，通过可学习令牌进行稀疏预测，并受文本提示调制。尽管仅使用合成数据训练，但在多个基准测试中达到或超越了专门模型，并展现出对多人和非人类对象的泛化能力。

Details

Motivation: 解决现有方法通常为不同感知任务设计独立模型，缺乏统一框架的问题，旨在开发一个单一模型来联合处理多种以人为中心的视频感知任务。

Result: 在多个基准测试中，该模型达到或超越了最先进的专门模型，尽管仅使用合成数据训练，未使用真实世界或特定基准数据。

Insight: 创新点在于将预训练的文本到视频扩散模型重新用于单次前向传播的感知任务，通过可学习令牌和文本提示调制实现多任务统一处理，并展现出基于扩散的视频表示带来的新兴泛化能力，如从单人场景泛化到多人及类人角色和动物。

Abstract: We present THFM, a unified video foundation model for human-centric perception that jointly addresses dense tasks (depth, normals, segmentation, dense pose) and sparse tasks (2d/3d keypoint estimation) within a single architecture. THFM is derived from a pretrained text-to-video diffusion model, repurposed as a single-forward-pass perception model and augmented with learnable tokens for sparse predictions. Modulated by the text prompt, our single unified model is capable of performing various perception tasks. Crucially, our model is on-par or surpassing state-of-the-art specialized models on a variety of benchmarks despite being trained exclusively on synthetic data (i.e.~without training on real-world or benchmark specific data). We further highlight intriguing emergent properties of our model, which we attribute to the underlying diffusion-based video representation. For example, our model trained on videos with a single human in the scene generalizes to multiple humans and other object classes such as anthropomorphic characters and animals – a capability that hasn’t been demonstrated in the past.

[28] Shared Representation for 3D Pose Estimation, Action Classification, and Progress Prediction from Tactile Signals cs.CVPDF

Isaac Han, Seoyoung Lee, Sangyeon Park, Ecehan Akan, Yiyue Luo

TL;DR: 本文提出了一种名为SCOTTI的共享卷积Transformer模型，用于从足部触觉信号中同时处理3D人体姿态估计、动作分类和动作完成进度预测三个任务。该方法通过多任务学习共享表示，避免了视觉方法的遮挡和隐私问题，并在新收集的数据集上验证了其优越性。

Details

Motivation: 解决基于视觉的方法在现实环境中存在的遮挡和隐私问题，同时克服现有触觉方法将每个任务单独处理导致的次优性能，探索利用足部触觉信号进行统一的多任务学习。

Result: 实验结果表明，SCOTTI在三个任务上均优于现有方法，在新收集的包含15名参与者、8种不同活动、总时长7小时的数据集上实现了性能提升。

Insight: 创新点在于首次利用定制无线鞋垫传感器的足部触觉信号进行动作进度预测，并设计了一个共享表示的多任务学习框架，通过统一模型同时优化三个相关任务，相互促进以提高整体性能。

Abstract: Estimating human pose, classifying actions, and predicting movement progress are essential for human-robot interaction. While vision-based methods suffer from occlusion and privacy concerns in realistic environments, tactile sensing avoids these issues. However, prior tactile-based approaches handle each task separately, leading to suboptimal performance. In this study, we propose a Shared COnvolutional Transformer for Tactile Inference (SCOTTI) that learns a shared representation to simultaneously address three separate prediction tasks: 3D human pose estimation, action class categorization, and action completion progress estimation. To the best of our knowledge, this is the first work to explore action progress prediction using foot tactile signals from custom wireless insole sensors. This unified approach leverages the mutual benefits of multi-task learning, enabling the model to achieve improved performance across all three tasks compared to learning them independently. Experimental results demonstrate that SCOTTI outperforms existing approaches across all three tasks. Additionally, we introduce a novel dataset collected from 15 participants performing various activities and exercises, with 7 hours of total duration, across eight different activities.

[29] Good Scores, Bad Data: A Metric for Multimodal Coherence cs.CV | cs.AIPDF

Vasundra Srinivasan

TL;DR: 该论文提出了一个名为多模态一致性分数（MCS）的新评估指标，用于独立于下游任务模型来评估多模态AI系统中数据融合的质量。MCS将一致性分解为身份、空间、语义和决策四个维度，并通过优化学习权重。在Visual Genome和COCO数据集上的实验表明，MCS在区分融合质量方面比单纯的下游任务准确率更敏感。

Details

Motivation: 当前多模态AI系统主要依赖下游任务准确率进行评估，但高准确率并不能保证输入数据本身是内在一致的（例如，视觉问答任务中模型可能在输入相互矛盾的情况下仍给出正确答案）。因此，需要一种独立于下游模型的指标来直接评估多模态数据融合的内在一致性质量。

Result: 在Visual Genome的1000张图像和COCO的150张图像上，使用DETR、CLIP和ViLT等模型进行验证。MCS在三种融合架构上均能比下游任务准确率（Spearman rho = 0.071）更敏感地（Spearman rho = 0.093）区分融合质量。扰动实验证实了其四个维度能独立响应各自的失效模式，且无交叉干扰。

Insight: 主要创新点是提出了一个独立于下游任务的、可解释的多模态融合质量评估指标MCS。它将抽象的一致性概念分解为四个可量化的具体维度，并通过无监督优化学习权重，无需人工标注。这为诊断多模态系统故障（不仅是“是否坏了”，更是“哪里坏了”）提供了一个轻量级工具，对模型开发和调试具有实用价值。

Abstract: Multimodal AI systems are evaluated by downstream task accuracy, but high accuracy does not mean the underlying data is coherent. A model can score well on Visual Question Answering (VQA) while its inputs contradict each other. We introduce the Multimodal Coherence Score (MCS), a metric that evaluates fusion quality independent of any downstream model. MCS decomposes coherence into four dimensions, identity, spatial, semantic, and decision, with weights learned via Nelder-Mead optimization. We evaluate on 1,000 Visual Genome images using DETR, CLIP, and ViLT, and validate on 150 COCO images with no retraining. Across three fusion architectures, MCS discriminates quality with higher sensitivity than task accuracy alone (Spearman rho = 0.093 vs. 0.071). Perturbation experiments confirm each dimension responds independently to its failure mode with zero cross-talk. MCS is lightweight, requires no human annotation, and tells you not just that something broke, but what broke.

[30] DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation cs.CV | cs.AIPDF

Abolfazl Meyarian, Amin Karimi Monsefi, Rajiv Ramnath, Ser-Nam Lim

TL;DR: 本文提出了一种名为DiReCT的轻量级后训练框架，旨在解决基于流匹配的视频生成模型在物理一致性方面的不足。该方法通过解耦对比轨迹的正则化，将对比信号分解为宏观和微观两个互补尺度，以分离语义与物理行为，从而提升生成视频的物理常识性，同时保持预训练模型的视觉质量。

Details

Motivation: 流匹配视频生成器虽然能产生时序连贯、高保真的输出，但经常违反基本物理规律，因为其重建目标仅惩罚逐帧偏差，而无法区分物理上一致与不可能的动态。在文本条件视频生成中，语义与物理行为紧密耦合，导致简单的对比学习可能损害训练目标。

Result: 在Wan 2.1-1.3B模型上应用DiReCT后，在VideoPhy基准测试中的物理常识得分相比基线和监督微调（SFT）分别提升了16.7%和11.3%，且未增加训练时间。

Insight: 创新点在于形式化了对比学习中的梯度冲突问题，并提出了解耦的宏观与微观对比正则化方法：宏观对比从语义遥远区域采样负例以实现全局轨迹分离，微观对比则通过LLM扰动单个物理行为轴构建硬负例。此外，引入速度空间分布正则器以防止灾难性遗忘。

Abstract: Flow-matching video generators produce temporally coherent, high-fidelity outputs yet routinely violate elementary physics because their reconstruction objectives penalize per-frame deviations without distinguishing physically consistent dynamics from impossible ones. Contrastive flow matching offers a principled remedy by pushing apart velocity-field trajectories of differing conditions, but we identify a fundamental obstacle in the text-conditioned video setting: semantic-physics entanglement. Because natural-language prompts couple scene content with physical behavior, naive negative sampling draws conditions whose velocity fields largely overlap with the positive sample’s, causing the contrastive gradient to directly oppose the flow-matching objective. We formalize this gradient conflict, deriving a precise alignment condition that reveals when contrastive learning helps versus harms training. Guided by this analysis, we introduce DiReCT (Disentangled Regularization of Contrastive Trajectories), a lightweight post-training framework that decomposes the contrastive signal into two complementary scales: a macro-contrastive term that draws partition-exclusive negatives from semantically distant regions for interference-free global trajectory separation, and a micro-contrastive term that constructs hard negatives sharing full scene semantics with the positive sample but differing along a single, LLM-perturbed axis of physical behavior; spanning kinematics, forces, materials, interactions, and magnitudes. A velocity-space distributional regularizer helps to prevent catastrophic forgetting of pretrained visual quality. When applied to Wan 2.1-1.3B, our method improves the physical commonsense score on VideoPhy by 16.7% and 11.3% compared to the baseline and SFT, respectively, without increasing training time.

[31] Reinforcing Structured Chain-of-Thought for Video Understanding cs.CV | cs.AIPDF

Peiyao Wang, Haotian Xu, Noranart Vesdapunt, Rui Hou, Jingyi Zhang

TL;DR: 本文提出了一种名为摘要驱动强化学习（SDRL）的新型单阶段强化学习框架，用于增强多模态大语言模型（MLLMs）在视频理解中的结构化思维链推理。该方法通过引入“总结->思考->回答”的结构化CoT格式，并整合两种自监督机制（CVK和DVR），旨在解决现有方法中存在的思维漂移、时序理解弱、依赖成本高昂的SFT以及推理路径僵化等问题。

Details

Motivation: 解决多模态大语言模型在视频理解任务中存在的思维漂移和时序理解能力弱的问题，同时克服现有强化学习方法依赖监督微调（需要昂贵的思维链标注和多阶段训练）以及强制固定推理路径所导致的泛化能力受限和潜在偏见。

Result: 该方法在七个公开的视频问答（VideoQA）数据集上取得了最先进的（state-of-the-art）性能。

Insight: 创新点在于提出了无需监督微调的单阶段强化学习框架SDRL，其核心是“总结->思考->回答”的结构化思维链格式，以及整合了两种自监督机制（CVK用于确保视觉知识一致性，DVR用于动态调节推理多样性）的GRPO目标。这有效地平衡了对齐与探索，同时对最终答案和推理过程进行监督，提升了模型的泛化能力和推理质量。

Abstract: Multi-modal Large Language Models (MLLMs) show promise in video understanding. However, their reasoning often suffers from thinking drift and weak temporal comprehension, even when enhanced by Reinforcement Learning (RL) techniques like Group Relative Policy Optimization (GRPO). Moreover, existing RL methods usually depend on Supervised Fine-Tuning (SFT), which requires costly Chain-of-Thought (CoT) annotation and multi-stage training, and enforces fixed reasoning paths, limiting MLLMs’ ability to generalize and potentially inducing bias. To overcome these limitations, we introduce Summary-Driven Reinforcement Learning (SDRL), a novel single-stage RL framework that obviates the need for SFT by utilizing a Structured CoT format: Summarize -> Think -> Answer. SDRL introduces two self-supervised mechanisms integrated into the GRPO objective: 1) Consistency of Vision Knowledge (CVK) enforces factual grounding by reducing KL divergence among generated summaries; and 2) Dynamic Variety of Reasoning (DVR) promotes exploration by dynamically modulating thinking diversity based on group accuracy. This novel integration effectively balances alignment and exploration, supervising both the final answer and the reasoning process. Our method achieves state-of-the-art performance on seven public VideoQA datasets.

[32] Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets cs.CV | cs.AI | cs.LGPDF

Alex Koran, Dimitrios Sinodinos, Hadi Hojjati, Takuya Nanri, Fangge Chen

TL;DR: 本文提出了一种碰撞感知的视觉语言学习方法，通过开发视频语言增强异常检测器（VLAAD）并构建两个多模态碰撞数据集（CARLA-Collide和Real-Collide），旨在提升端到端自动驾驶系统的安全性。该方法可作为插件模块集成到现有模型中，在仿真和真实世界数据上均表现出色。

Details

Motivation: 当前端到端自动驾驶在CARLA排行榜上驾驶分数较低，主要瓶颈是高违规率，尤其是碰撞相关违规，但碰撞感知表示学习尚未得到充分关注。现有仿真数据集缺乏多模态性且场景单一，限制了模型在复杂路网中的碰撞预测能力。

Result: 在闭环仿真中，将VLAAD集成到预训练的TransFuser++智能体后，驾驶分数相对提升14.12%。在开环真实世界评估中，VLAAD仅含0.6B参数，但在AUC指标上超越了一个数十亿参数的视觉语言模型，性能提升23.3%。

Insight: 创新点包括：1）提出基于多示例学习（MIL）的VLAAD框架，实现稳定、时序局部化的碰撞信号预测；2）构建了大规模多模态碰撞数据集（CARLA-Collide和Real-Collide），覆盖多样路网和真实驾驶场景；3）VLAAD作为轻量级插件模块，可无缝增强现有端到端驾驶模型的碰撞感知能力，提升泛化性能。

Abstract: High infraction rates remain the primary bottleneck for end-to-end (E2E) autonomous driving, as evidenced by the low driving scores on the CARLA Leaderboard. Despite collision-related infractions being the dominant failure mode in closed-loop evaluations, collision-aware representation learning has received limited attention. To address this gap, we first develop a Video-Language-Augmented Anomaly Detector (VLAAD), leveraging a Multiple Instance Learning (MIL) formulation to obtain stable, temporally localized collision signals for proactive prediction. To transition these capabilities into closed-loop simulations, we must overcome the limitations of existing simulator datasets, which lack multimodality and are frequently restricted to simple intersection scenarios. Therefore, we introduce CARLA-Collide, a large-scale multimodal dataset capturing realistic collision events across highly diverse road networks. Trained on this diverse simulator data, VLAAD serves as a collision-aware plug-in module that can be seamlessly integrated into existing E2E driving models. By integrating our module into a pretrained TransFuser++ agent, we demonstrate a 14.12% relative increase in driving score with minimal fine-tuning. Beyond closed-loop evaluation, we further assess the generalization capability of VLAAD in an open-loop setting using real-world driving data. To support this analysis, we introduce Real-Collide, a multimodal dataset of diverse dashcam videos paired with semantically rich annotations for collision detection and prediction. On this benchmark, despite containing only 0.6B parameters, VLAAD outperforms a multi-billion-parameter vision-language model, achieving a 23.3% improvement in AUC.

[33] Low-Rank-Modulated Functa: Exploring the Latent Space of Implicit Neural Representations for Interpretable Ultrasound Video Analysis cs.CVPDF

Julia Wolleb, Cristiana Baloescu, Alicia Durrer, Hemant D. Tagare, Xenophon Papademetris

TL;DR: 本文提出了一种名为低秩调制Functa（LRM-Functa）的新架构，用于分析超声视频。该方法基于隐式神经表示（INR）的Functa框架，通过强制时间分辨潜在空间中的调制向量进行低秩适应，从而获得结构清晰、可解释的潜在空间。在心脏超声应用中，该潜在空间展现出周期性的轨迹，便于可视化时间模式，并能用于生成新帧、直接读取舒张末期和收缩末期帧，无需额外训练。LRM-Functa在无监督的ED/ES帧检测上优于先前方法，同时以低至秩k=2的压缩率保持竞争性的射血分数预测性能，并在心脏和肺部超声任务中展示了泛化能力。

Details

Motivation: 现有基于Functa的方法虽能有效重建图像，但其潜在空间的结构和可解释性尚未充分探索，特别是在超声视频分析中。本文旨在通过设计一种低秩调制架构，增强潜在空间的可解释性和结构化，以更好地分析超声视频中的时间模式。

Result: 在心脏超声数据上，LRM-Functa在无监督ED和ES帧检测任务上优于先前方法，并在射血分数预测的下游任务中保持竞争性性能，同时将每帧视频压缩至低至秩k=2。在心脏点护理数据集的外分布帧选择以及肺部超声的B线分类任务中，该方法也展示了良好的泛化性。

Insight: 创新点在于引入低秩适应机制来调制Functa的潜在向量，从而强制潜在空间具有结构化、周期性的轨迹，这提高了时间模式的可视化和可解释性。从客观角度看，该方法将低秩约束与隐式神经表示结合，为超声视频分析提供了一个紧凑且可泛化的框架，可直接应用于临床相关任务如ED/ES帧检测，而无需额外训练。

Abstract: Implicit neural representations (INRs) have emerged as a powerful framework for continuous image representation learning. In Functa-based approaches, each image is encoded as a latent modulation vector that conditions a shared INR, enabling strong reconstruction performance. However, the structure and interpretability of the corresponding latent spaces remain largely unexplored. In this work, we investigate the latent space of Functa-based models for ultrasound videos and propose Low-Rank-Modulated Functa (LRM-Functa), a novel architecture that enforces a low-rank adaptation of modulation vectors in the time-resolved latent space. When applied to cardiac ultrasound, the resulting latent space exhibits clearly structured periodic trajectories, facilitating visualization and interpretability of temporal patterns. The latent space can be traversed to sample novel frames, revealing smooth transitions along the cardiac cycle, and enabling direct readout of end-diastolic (ED) and end-systolic (ES) frames without additional model training. We show that LRM-Functa outperforms prior methods in unsupervised ED and ES frame detection, while compressing each video frame to as low as rank k=2 without sacrificing competitive downstream performance on ejection fraction prediction. Evaluations on out-of-distribution frame selection in a cardiac point-of-care dataset, as well as on lung ultrasound for B-line classification, demonstrate the generalizability of our approach. Overall, LRM-Functa provides a compact, interpretable, and generalizable framework for ultrasound video analysis. The code is available at https://github.com/JuliaWolleb/LRM_Functa.

[34] BEVMAPMATCH: Multimodal BEV Neural Map Matching for Robust Re-Localization of Autonomous Vehicles cs.CVPDF

Shounak Sural, Ragunathan Rajkumar

TL;DR: BEVMapMatch是一种用于自动驾驶车辆在无GNSS或GNSS信号不佳环境下进行鲁棒重定位的框架。它通过融合激光雷达和相机数据生成多模态鸟瞰图分割，并利用基于交叉注意力的搜索机制从已知地图中检索候选地图块进行匹配，最终实现精确的全局定位。

Details

Motivation: 解决自动驾驶车辆在GNSS信号缺失或降级环境中难以实现可靠定位的问题，需要不依赖GNSS先验的替代方法。

Result: 在GNSS缺失和恶劣天气环境下的重定位任务中，BEVMapMatch显著优于现有方法，其Recall@1m达到39.8%，几乎是性能最佳基线的两倍。

Insight: 创新点在于提出了一种上下文感知的多模态BEV分割生成方法，并结合基于交叉注意力的检索机制进行地图匹配，利用多帧BEV分割进一步提升定位精度，实现了不依赖GNSS的鲁棒重定位。

Abstract: Localization in GNSS-denied and GNSS-degraded environments is a challenge for the safe widespread deployment of autonomous vehicles. Such GNSS-challenged environments require alternative methods for robust localization. In this work, we propose BEVMapMatch, a framework for robust vehicle re-localization on a known map without the need for GNSS priors. BEVMapMatch uses a context-aware lidar+camera fusion method to generate multimodal Bird’s Eye View (BEV) segmentations around the ego vehicle in both good and adverse weather conditions. Leveraging a search mechanism based on cross-attention, the generated BEV segmentation maps are then used for the retrieval of candidate map patches for map-matching purposes. Finally, BEVMapMatch uses the top retrieved candidate for finer alignment against the generated BEV segmentation, achieving accurate global localization without the need for GNSS. Multiple frames of generated BEV segmentation further improve localization accuracy. Extensive evaluations show that BEVMapMatch outperforms existing methods for re-localization in GNSS-denied and adverse environments, with a Recall@1m of 39.8%, being nearly twice as much as the best performing re-localization baseline. Our code and data will be made available at https://github.com/ssuralcmu/BEVMapMatch.git.

[35] Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control cs.CVPDF

Zhuoli Zhuang, Yu-Cheng Chang, Yu-Kai Wang, Thomas Do, Chin-Teng Lin

TL;DR: 该论文提出了一种基于脑电图（EEG）引导的决策框架，将人类认知洞察融入强化学习（RL）中，以改进自动驾驶控制。通过收集20名参与者在真实驾驶模拟器中的EEG信号，分析其对突发环境变化的事件相关电位（ERP），并利用神经网络根据视觉场景信息预测ERP强度，进而将认知信息整合到RL算法的奖励信号中。

Details

Motivation: 尽管计算机视觉技术推动了自动驾驶的发展，但训练机器以符合人类期望的方式驾驶仍具挑战。人类因素至关重要，因为人类拥有复杂的认知系统，能快速解读场景信息并做出准确决策。现有基于人类反馈的强化学习（RLHF）方法依赖人工排序生成输出以收集偏好数据，耗时且间接，因此需要更直接整合人类认知反馈的方法。

Result: 实验结果表明，该框架能提升强化学习算法的避撞能力，突显了神经认知反馈在增强自动驾驶系统中的潜力。

Insight: 创新点在于直接利用EEG信号（特别是ERP）作为人类认知反馈，无需行为响应中断，从而更自然、高效地将人类意图对齐到机器学习中；通过神经网络预测ERP强度并将其整合为RL奖励信号，提供了一种新颖的、以人为中心的奖励建模方法。

Abstract: Recent advancements in computer vision have accelerated the development of autonomous driving. Despite these advancements, training machines to drive in a way that aligns with human expectations remains a significant challenge. Human factors are still essential, as humans possess a sophisticated cognitive system capable of rapidly interpreting scene information and making accurate decisions. Aligning machine with human intent has been explored with Reinforcement Learning with Human Feedback (RLHF). Conventional RLHF methods rely on collecting human preference data by manually ranking generated outputs, which is time-consuming and indirect. In this work, we propose an electroencephalography (EEG)-guided decision-making framework to incorporate human cognitive insights without behaviour response interruption into reinforcement learning (RL) for autonomous driving. We collected EEG signals from 20 participants in a realistic driving simulator and analyzed event-related potentials (ERP) in response to sudden environmental changes. Our proposed framework employs a neural network to predict the strength of ERP based on the cognitive information from visual scene information. Moreover, we explore the integration of such cognitive information into the reward signal of the RL algorithm. Experimental results show that our framework can improve the collision avoidance ability of the RL algorithm, highlighting the potential of neuro-cognitive feedback in enhancing autonomous driving systems. Our project page is: https://alex95gogo.github.io/Cognitive-Reward/.

[36] FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants cs.CV | cs.AIPDF

Mahesh Bhosale, Abdul Wasi, Shantam Srivastava, Shifa Latif, Tianyu Luan

TL;DR: 本文提出了FairLLaVA，一种用于大型视觉语言助手（MLLMs）的参数高效微调方法，旨在缓解模型在不同人口统计学群体（如种族、性别）上表现不均的公平性问题。该方法通过最小化目标属性间的互信息，使模型表征对人口统计学信息保持不变，从而在不损害整体性能的前提下减少群体差异。

Details

Motivation: 多模态大语言模型（MLLMs）在图像条件生成中表现出色，但其性能在不同人口统计学群体间存在差异，这在安全关键的临床环境中可能导致不平等的诊断叙述并削弱对AI辅助决策的信任。目前公平性研究主要集中在纯视觉或纯语言模型，对MLLMs的影响尚待深入探索。

Result: 在大规模胸部放射学报告生成和皮肤镜视觉问答基准测试上的广泛实验表明，FairLLaVA能持续减少组间差异，同时提升跨多种医学成像模态的公平性调整后的临床性能和自然语言生成质量。

Insight: 创新点在于提出了一种架构无关的、轻量级插件式参数高效微调方法（基于低秩适配器），通过互信息正则化实现人口统计学不变表征学习，为公平的视觉指令跟随提供了一个高效且可扩展的解决方案。

Abstract: While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model’s representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at https://github.com/bhosalems/FairLLaVA.

[37] VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation cs.CV | cs.AIPDF

Rakib Hossain Sajib, Md Kishor Morol, Rajan Das Gupta, Mohammad Sakib Mahmood, Shuvra Smaran Das

TL;DR: 本文提出了VLAgeBench基准，用于评估大型视觉语言模型在零样本人脸年龄估计任务上的性能。研究在UTKFace和FG-NET数据集上，对GPT-4o、Claude 3.5 Sonnet和LLaMA 3.2 Vision等模型进行了严格零样本评估，发现通用LVLMs能够达到与领域专用模型相当的竞争性表现，同时揭示了模型在图像质量和人口统计子组上的性能差异。

Details

Motivation: 解决传统人脸年龄估计方法依赖大量标注数据和领域特定训练的问题，探索大型视觉语言模型在零样本设置下完成此任务的潜力，并系统评估其实际性能与局限性。

Result: 在UTKFace和FG-NET基准数据集上，使用MAE、MSE、RMSE、MAPE、MBE、R²、CCC和±5年准确率等八项指标进行评估，结果表明通用LVLMs在零样本设置下取得了具有竞争力的性能。

Insight: 创新点在于首次对LVLMs进行严格零样本人脸年龄估计的综合性基准测试，证明了通用模型在该任务上的涌现能力；客观分析认为，其揭示了提示词敏感性、可解释性、计算成本和人口统计公平性等实际部署挑战，为公平感知的多模态推理提供了重要见解。

Abstract: Human age estimation from facial images represents a challenging computer vision task with significant applications in biometrics, healthcare, and human-computer interaction. While traditional deep learning approaches require extensive labeled datasets and domain-specific training, recent advances in large vision-language models (LVLMs) offer the potential for zero-shot age estimation. This study presents a comprehensive zero-shot evaluation of state-of-the-art Large Vision-Language Models (LVLMs) for facial age estimation, a task traditionally dominated by domain-specific convolutional networks and supervised learning. We assess the performance of GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision on two benchmark datasets, UTKFace and FG-NET, without any fine-tuning or task-specific adaptation. Using eight evaluation metrics, including MAE, MSE, RMSE, MAPE, MBE, $R^2$, CCC, and $\pm$5-year accuracy, we demonstrate that general-purpose LVLMs can deliver competitive performance in zero-shot settings. Our findings highlight the emergent capabilities of LVLMs for accurate biometric age estimation and position these models as promising tools for real-world applications. Additionally, we highlight performance disparities linked to image quality and demographic subgroups, underscoring the need for fairness-aware multimodal inference. This work introduces a reproducible benchmark and positions LVLMs as promising tools for real-world applications in forensic science, healthcare monitoring, and human-computer interaction. The benchmark focuses on strict zero-shot inference without fine-tuning and highlights remaining challenges related to prompt sensitivity, interpretability, computational cost, and demographic fairness.

Danny Abraham, Nikhil Kamalkumar Advani, Arun Das, Nikil Dutt

TL;DR: 本文提出GeoReFormer，一种用于自动驾驶中3D车道段检测与拓扑推理的几何感知Transformer架构。该模型通过数据驱动的几何先验初始化查询、有界的坐标空间细化以及门控拓扑传播，在解码器中显式编码车道段的几何与拓扑结构。

Details

Motivation: 现有基于Transformer的车道检测方法主要继承自紧凑目标检测的解码器设计，未能显式编码车道段作为连续折线和有向图的几何与拓扑结构，导致查询初始化和细化过程缺乏结构性约束。

Result: 在OpenLane-V2基准测试中，GeoReFormer取得了34.5% mAP的SOTA性能，并在拓扑一致性上超越了强Transformer基线。

Insight: 创新点在于将几何与拓扑归纳偏置直接嵌入Transformer解码器，具体包括：利用数据驱动先验进行结构化查询初始化、有界坐标空间细化确保折线稳定变形、以及门控拓扑传播选择性整合关系上下文。这为处理结构化几何数据提供了可借鉴的架构设计思路。

Abstract: Accurate 3D lane segment detection and topology reasoning are critical for structured online map construction in autonomous driving. Recent transformer-based approaches formulate this task as query-based set prediction, yet largely inherit decoder designs originally developed for compact object detection. However, lane segments are continuous polylines embedded in directed graphs, and generic query initialization and unconstrained refinement do not explicitly encode this geometric and relational structure. We propose GeoReFormer (Geometry-aware Refinement Transformer), a unified query-based architecture that embeds geometry- and topology-aware inductive biases directly within the transformer decoder. GeoReFormer introduces data-driven geometric priors for structured query initialization, bounded coordinate-space refinement for stable polyline deformation, and per-query gated topology propagation to selectively integrate relational context. On the OpenLane-V2 benchmark, GeoReFormer achieves state-of-the-art performance with 34.5% mAP while improving topology consistency over strong transformer baselines, demonstrating the utility of explicit geometric and relational structure encoding.

[39] Knowledge is Power: Advancing Few-shot Action Recognition with Multimodal Semantics from MLLMs cs.CVPDF

Jiazheng Xing, Chao Xu, Hangjie Yuan, Mengmeng Wang, Jun Dan

TL;DR: 本文提出FSAR-LLaVA方法，首次利用多模态大语言模型（如Video-LLaVA）作为多模态知识库，通过端到端方式直接增强少样本动作识别。该方法包括提取时空语义增强特征、构建复合任务导向原型，以及设计无需训练的多模态原型匹配度量，在多种任务上实现了优异性能且参数量极小。

Details

Motivation: 当前少样本动作识别中，多模态大语言模型的初步探索主要局限于生成字幕形成次优的特征-字幕-特征流程，且仅在视觉空间进行度量学习。本文旨在直接利用MLLMs的多模态知识，解决特征表示不足和元训练与元测试集分布差距的问题。

Result: 在多个少样本动作识别任务上进行广泛实验，结果显示该方法在性能上优于现有方法，同时仅需极少的可训练参数，达到了先进水平（SOTA）。

Insight: 创新点包括：首次端到端利用MLLMs作为多模态知识库直接增强FSAR；设计多模态特征增强模块解耦并增强视觉和文本特征；通过复合任务导向原型构建桥接元训练与测试集分布差距；引入无需训练的多模态原型匹配度量，自适应选择关键线索并高效利用解耦特征。

Abstract: Multimodal Large Language Models (MLLMs) have propelled the field of few-shot action recognition (FSAR). However, preliminary explorations in this area primarily focus on generating captions to form a suboptimal feature->caption->feature pipeline and adopt metric learning solely within the visual space. In this paper, we propose FSAR-LLaVA, the first end-to-end method to leverage MLLMs (such as Video-LLaVA) as a multimodal knowledge base for directly enhancing FSAR. First, at the feature level, we leverage the MLLM’s multimodal decoder to extract spatiotemporally and semantically enriched representations, which are then decoupled and enhanced by our Multimodal Feature-Enhanced Module into distinct visual and textual features that fully exploit their semantic knowledge for FSAR. Next, we leverage the versatility of MLLMs to craft input prompts that flexibly adapt to diverse scenarios, and use their aligned outputs to drive our designed Composite Task-Oriented Prototype Construction, effectively bridging the distribution gap between meta-train and meta-test sets. Finally, to enable multimodal features to guide metric learning jointly, we introduce a training-free Multimodal Prototype Matching Metric that adaptively selects the most decisive cues and efficiently leverages the decoupled feature representations produced by MLLMs. Extensive experiments demonstrate superior performance across various tasks with minimal trainable parameters.

[40] Face2Parts: Exploring Coarse-to-Fine Inter-Regional Facial Dependencies for Generalized Deepfake Detection cs.CVPDF

Kutub Uddin, Nusrat Tasnim, Byung Tae Oh

TL;DR: 本文提出了一种名为Face2Parts的新型混合方法，用于广义深度伪造检测。该方法基于分层特征表示（HFR），通过从帧、面部及关键面部区域（如嘴唇、眼睛、鼻子）分别提取特征，探索从粗到细的区域间关系，并利用通道注意力机制和深度三元组学习捕获面部区域间的相互依赖性。

Details

Motivation: 现有深度伪造检测方法各有侧重，专注于特定面部区域（如帧、面部、嘴唇、眼睛或鼻子）以检测伪造痕迹，但面临操作多样性带来的挑战。本文旨在利用这些不同区域的互补优势，通过探索从粗到细的区域间依赖关系来提升检测性能。

Result: 在多个基准深度伪造数据集上进行了评估，包括FF++（平均AUC 98.42%）、CDF1（79.80%）、CDF2（85.34%）、DFD（89.41%）、DFDC（84.07%）、DTIM（95.62%）、PDD（80.76%）和WLDR（100%）。结果表明，该方法在跨数据集和跨操作设置下均能有效泛化，性能优于现有方法。

Insight: 创新点在于提出了一种分层特征表示框架，通过从粗到细（帧→面部→关键区域）的多级特征提取和通道注意力机制来建模面部区域间的依赖关系，结合深度三元组学习增强特征判别力，从而提升深度伪造检测的泛化能力和鲁棒性。

Abstract: Multimedia data, particularly images and videos, is integral to various applications, including surveillance, visual interaction, biometrics, evidence gathering, and advertising. However, amateur or skilled counterfeiters can simulate them to create deepfakes, often for slanderous motives. To address this challenge, several forensic methods have been developed to ensure the authenticity of the content. The effectiveness of these methods depends on their focus, with challenges arising from the diverse nature of manipulations. In this article, we analyze existing forensic methods and observe that each method has unique strengths in detecting deepfake traces by focusing on specific facial regions, such as the frame, face, lips, eyes, or nose. Considering these insights, we propose a novel hybrid approach called Face2Parts based on hierarchical feature representation ($HFR$) that takes advantage of coarse-to-fine information to improve deepfake detection. The proposed method involves extracting features from the frame, face, and key facial regions (i.e., lips, eyes, and nose) separately to explore the coarse-to-fine relationships. This approach enables us to capture inter-dependencies among facial regions using a channel-attention mechanism and deep triplet learning. We evaluated the proposed method on benchmark deepfake datasets in both intra-, inter-dataset, and inter-manipulation settings. The proposed method achieves an average AUC of 98.42% on FF++, 79.80% on CDF1, 85.34% on CDF2, 89.41% on DFD, 84.07% on DFDC, 95.62% on DTIM, 80.76% on PDD, and 100% on WLDR, respectively. The results demonstrate that our approach generalizes effectively and achieves promising performance to outperform the existing methods.

[41] Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives cs.CVPDF

Daiqiang Li, Zihao Pan, Zeyu Zhang, Ronghao Chen, Huacan Wang

TL;DR: 本文对GUI视觉代理中历史截图的token剪枝进行了实证研究，提出了三个关键见解：GUI截图具有独特的前景-背景语义构成，背景区域能有效捕捉界面状态转换；随机剪枝在保留空间结构方面具有固有优势；GUI代理存在类似人类认知的近期效应，可通过为近期截图分配更多token来显著降低计算成本。

Details

Motivation: 解决高分辨率GUI截图产生大量视觉token导致完整历史信息保存计算成本高昂的问题，旨在为设计高效的GUI视觉代理提供实用指导。

Result: 研究通过实证分析发现，基于边缘分离的前景-背景划分、随机剪枝策略以及基于近期效应的token预算分配，能在相同计算预算下实现更好的性能或显著降低计算成本。

Insight: 创新点在于从语义、空间和时间三个视角重新思考GUI截图token剪枝，挑战了背景区域语义价值低的常见假设，揭示了随机剪枝在保留空间结构上的优势，并提出了利用认知近期效应优化token分配的策略。

Abstract: In recent years, GUI visual agents built upon Multimodal Large Language Models (MLLMs) have demonstrated strong potential in navigation tasks. However, high-resolution GUI screenshots produce a large number of visual tokens, making the direct preservation of complete historical information computationally expensive. In this paper, we conduct an empirical study on token pruning for historical screenshots in GUI scenarios and distill three practical insights that are crucial for designing effective pruning strategies. First, we observe that GUI screenshots exhibit a distinctive foreground-background semantic composition. To probe this property, we apply a simple edge-based separation to partition screenshots into foreground and background regions. Surprisingly, we find that, contrary to the common assumption that background areas have little semantic value, they effectively capture interface-state transitions, thereby providing auxiliary cues for GUI reasoning. Second, compared with carefully designed pruning strategies, random pruning possesses an inherent advantage in preserving spatial structure, enabling better performance under the same computational budget. Finally, we observe that GUI Agents exhibit a recency effect similar to human cognition: by allocating larger token budgets to more recent screenshots and heavily compressing distant ones, we can significantly reduce computational cost while maintaining nearly unchanged performance. These findings offer new insights and practical guidance for the design of efficient GUI visual agents.

[42] Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays cs.CV | cs.AIPDF

Kang Liu, Zhuoqi Ma, Siyu Liang, Yunan Li, Xiyue Gao

TL;DR: 本文提出了CoGaze，一个用于胸部X光片的上下文和注视引导的视觉语言预训练框架。该框架通过上下文注入的视觉编码器模拟放射科医生整合临床上下文（如病史、症状）的推理过程，并采用多级监督范式，结合混合正例对比学习、疾病感知的跨模态表示学习以及利用放射科医生注视作为概率先验来引导注意力，从而更好地捕捉诊断工作流。

Details

Motivation: 现有医学视觉语言预训练模型难以捕捉真实的诊断工作流：它们通常将X光片视为与上下文无关的图像，且忽略了放射科医生的注视这一关键的视觉推理线索，这阻碍了疾病特异性模式的建模并削弱了跨模态对齐。

Result: 在多个任务上的广泛实验表明，CoGaze一致性地超越了最先进方法：在自由文本和结构化报告生成任务上分别提升CheXbertF1达2.0%和BLEU2达1.2%；在零样本分类任务上提升AUROC达23.2%；在图像-文本检索任务上提升Precision@1达12.2%。

Insight: 创新点在于明确地将临床上下文和放射科医生的注视信息整合到预训练框架中，通过上下文注入编码器和多级监督（特别是利用注视作为概率先验来引导视觉注意力），更贴近真实诊断认知过程，从而显著提升了模型在报告生成、分类和检索等下游任务上的性能。

Abstract: Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists’ gaze – a crucial cue for visual reasoning – remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context – including patient history, symptoms, and diagnostic intent – to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists’ gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at https://github.com/mk-runner/CoGaze.

[43] Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification cs.CV | cs.AIPDF

Zizhao Chen, Ping Wei, Ziyang Ren, Huan Li, Xiangru Yin

TL;DR: 本文提出了一种名为MaLSF（Mask-aware Local Semantic Fusion）的新型多模态媒体验证框架，旨在解决复杂多模态虚假信息检测中的特征稀释问题。该框架通过掩码-标签对作为语义锚点，连接像素与文本，并采用双向跨模态验证（BCV）和分层语义聚合（HSA）模块，主动识别局部语义不一致性，从而提升检测性能。

Details

Motivation: 当前多模态验证方法依赖被动的整体融合，容易因特征稀释而平均化细微的局部语义不一致，难以检测复杂的虚假信息。本文旨在通过模拟人类认知交叉验证，实现主动、双向的验证，以更精确地定位多模态内容中的冲突。

Result: MaLSF在DGM4和多模态假新闻检测任务上取得了最先进的性能（SOTA），并通过广泛的消融实验和可视化结果验证了其有效性和可解释性。

Insight: 创新点包括：1）利用掩码-标签对作为语义锚点，桥接视觉与文本信息；2）引入双向跨模态验证（BCV）模块，通过并行查询流（文本作为查询和图像作为查询）显式定位冲突；3）设计分层语义聚合（HSA）模块，智能聚合多粒度冲突信号进行任务特定推理。这些机制从被动融合转向主动验证，增强了局部不一致性的检测能力。

Abstract: As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to ‘feature dilution,’ global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a Hierarchical Semantic Aggregation (HSA) module that intelligently aggregates these multi-granularity conflict signals for task-specific reasoning. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. MaLSF achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.

[44] Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline cs.CVPDF

Qizhi Xie, Kun Yuan, Yunpeng Qu, Ming Sun, Chao Zhou

TL;DR: 本文首次提出了视频流畅度评估（VFA）作为一个独立的感知任务，专注于视频的时间维度（如运动一致性和帧连续性）。作者构建了首个面向流畅度的数据集FluVid，包含4,606个真实世界视频，并开发了包含23种方法的基准测试。此外，提出了基线模型FluNet，采用时间置换自注意力（T-PSA）来增强输入流畅度信息和长程帧间交互，实现了最先进的性能。

Details

Motivation: 现有视频质量评估（VQA）方法在评估视频流畅度（如运动一致性和帧连续性）时代表性不足，限制了其应用，因此需要将VFA作为一个独立任务来解决。

Result: 在FluVid数据集上，提出的FluNet模型实现了最先进的性能（SOTA），基准测试涵盖了23种方法，是迄今为止最全面的VFA评估。

Insight: 创新点包括：将VFA确立为独立感知任务，构建首个平衡流畅度分布的数据集FluVid并制定评分标准，以及提出使用T-PSA的FluNet模型来增强时间信息处理。从客观角度看，这为社区提供了探索VFA解决方案的路线图和基准资源。

Abstract: Accurately estimating humans’ subjective feedback on video fluency, e.g., motion consistency and frame continuity, is crucial for various applications like streaming and gaming. Yet, it has long been overlooked, as prior arts have focused on solving it in the video quality assessment (VQA) task, merely as a sub-dimension of overall quality. In this work, we conduct pilot experiments and reveal that current VQA predictions largely underrepresent fluency, thereby limiting their applicability. To this end, we pioneer Video Fluency Assessment (VFA) as a standalone perceptual task focused on the temporal dimension. To advance VFA research, 1) we construct a fluency-oriented dataset, FluVid, comprising 4,606 in-the-wild videos with balanced fluency distribution, featuring the first-ever scoring criteria and human study for VFA. 2) We develop a large-scale benchmark of 23 methods, the most comprehensive one thus far on FluVid, gathering insights for VFA-tailored model designs. 3) We propose a baseline model called FluNet, which deploys temporal permuted self-attention (T-PSA) to enrich input fluency information and enhance long-range inter-frame interactions. Our work not only achieves state-of-the-art performance but, more importantly, offers the community a roadmap to explore solutions for VFA.

[45] Finding Distributed Object-Centric Properties in Self-Supervised Transformers cs.CV | cs.AI | cs.CL | cs.LG | cs.MMPDF

Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang

TL;DR: 本文分析了自监督视觉Transformer（如DINO）中对象中心属性的分布，发现这些属性不仅存在于最终层的[CLS]令牌注意力图中，而是分散在网络各层的补丁级注意力组件（查询、键、值）中。基于此，作者提出了无需训练的Object-DINO方法，通过聚类所有层的注意力头来提取分散的对象中心信息，并应用于无监督对象发现和多模态大语言模型的视觉接地任务。

Details

Motivation: 自监督ViT（如DINO）通过最终层[CLS]令牌的注意力图来发现对象，但常因虚假激活导致定位不佳，这是因为[CLS]令牌基于图像级目标汇总了整个图像，稀释了补丁级交互中的对象中心信息。

Result: 在无监督对象发现任务上，Object-DINO将CorLoc指标提升了3.6到12.4个点；在多模态大语言模型中，通过提供视觉接地缓解了对象幻觉问题，表明使用分散的对象中心信息能改善下游任务且无需额外训练。

Insight: 创新点在于揭示对象中心属性编码于所有注意力组件（q、k、v）的相似性图中，并分布在整个网络层中，而非仅限最终层；提出的Object-DINO方法通过跨层聚类注意力头自动识别对象中心簇，实现了训练免费的信息提取，可借鉴于提升视觉模型的解释性和下游任务性能。

Abstract: Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ($q, k, v$), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO’s effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.

[46] MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection cs.CV | cs.AIPDF

Peiyuan Jiang, Yao Liu, Yanglei Gan, Jiaye Yang, Lu Liu

TL;DR: 本文提出了一个用于非接触式欺骗检测的多模态数据集MuDD，并设计了一种基于皮电反应（GSR）引导的渐进式蒸馏框架GPD，通过跨模态知识蒸馏将GSR中的稳定欺骗知识迁移到非接触模态（如视频和音频），以提升非接触欺骗检测的性能。

Details

Motivation: 非接触式自动欺骗检测面临挑战，因为视觉和听觉欺骗线索缺乏稳定的跨主体模式，而GSR作为生理信号能提供更可靠的线索，但现有数据集和模态差异阻碍了其应用。

Result: 在MuDD数据集上的实验表明，GPD方法在欺骗检测和隐蔽数字识别任务上优于现有方法，达到了最先进的性能水平。

Insight: 创新点包括引入大规模多模态欺骗检测数据集MuDD，以及提出结合渐进特征级和数字级蒸馏与动态路由的GPD框架，以自适应地处理GSR与非接触信号间的模态不匹配，实现更稳定的跨模态知识迁移。

Abstract: Non-contact automatic deception detection remains challenging because visual and auditory deception cues often lack stable cross-subject patterns. In contrast, galvanic skin response (GSR) provides more reliable physiological cues and has been widely used in contact-based deception detection. In this work, we leverage stable deception-related knowledge in GSR to guide representation learning in non-contact modalities through cross-modal knowledge distillation. A key obstacle, however, is the lack of a suitable dataset for this setting. To address this, we introduce MuDD, a large-scale Multimodal Deception Detection dataset containing recordings from 130 participants over 690 minutes. In addition to video, audio, and GSR, MuDD also provides Photoplethysmography, heart rate, and personality traits, supporting broader scientific studies of deception. Based on this dataset, we propose GSR-guided Progressive Distillation (GPD), a cross-modal distillation framework for mitigating the negative transfer caused by the large modality mismatch between GSR and non-contact signals. The core innovation of GPD is the integration of progressive feature-level and digit-level distillation with dynamic routing, which allows the model to adaptively determine how teacher knowledge should be transferred during training, leading to more stable cross-modal knowledge transfer. Extensive experiments and visualizations show that GPD outperforms existing methods and achieves state-of-the-art performance on both deception detection and concealed-digit identification.

[47] PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery cs.CVPDF

Elkhan Ismayilzada, Yufei Zhang, Zijun Cui

TL;DR: 本文提出了一种名为PAD-Hand的物理感知条件扩散框架，用于从图像中恢复手部运动。该方法基于MeshCNN-Transformer骨干网络，通过欧拉-拉格朗日动力学公式处理关节手，将动态残差视为虚拟观测值，并利用最后一层拉普拉斯近似来估计每个关节、每个时间步的物理方差，从而将噪声姿态序列细化为物理上合理的手部运动，同时评估运动估计的物理一致性。

Details

Motivation: 现有从图像重建手部的方法虽能提供准确的单帧估计，但往往缺乏物理一致性，且无法量化运动满足物理规律的可信度。本文旨在解决这一问题，通过引入物理约束来提升运动序列的物理合理性并提供可解释的方差估计。

Result: 在两个知名手部数据集上的实验表明，该方法在基于图像的强初始化方法和基于视频的竞争方法上均取得了持续的性能提升，定性结果证实其方差估计与图像估计中运动的物理合理性相一致。

Insight: 创新点在于将物理动力学残差作为虚拟观测值整合到扩散框架中，而非强制残差为零，从而更有效地融入物理约束；同时，通过最后一层拉普拉斯近似生成可解释的每关节、每时间步方差图，量化物理一致性，为运动恢复提供了新的可信度评估维度。

Abstract: Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN-Transformer backbone, we formulate Euler-Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods. Qualitative results confirm that our variance estimations are aligned with the physical plausibility of the motion in image-based estimates.

[48] MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality cs.CV | cs.LGPDF

Kyungwon Kim, Dosik Hwang

TL;DR: 本文提出了一种名为MUST（Modality-Specific representation-aware Transformer）的新型框架，用于处理多模态医学数据中常见的模态缺失问题，以提升生存预测的准确性。该框架通过代数约束在学习的低秩共享子空间中，将每个模态的表征显式分解为模态特定和跨模态上下文组件，从而精确识别模态缺失时丢失的信息。对于无法从现有模态推断的真正模态特定信息，MUST采用条件潜在扩散模型来生成高质量的表征。在五个TCGA癌症数据集上的实验表明，MUST在完整数据下达到最先进性能，并在缺失病理学或基因组学模态时保持稳健预测，且推理延迟在临床可接受范围内。

Details

Motivation: 解决多模态医学数据中因成本、技术限制或回顾性数据可用性导致的模态频繁缺失问题，现有方法缺乏对每个模态独特贡献的显式建模，无法区分模态特定信息与可从其他模态推导的信息。

Result: 在五个TCGA癌症数据集上的广泛实验表明，MUST在完整数据下实现了最先进的性能（SOTA），同时在缺失病理学或基因组学模态的条件下保持稳健预测，推理延迟符合临床要求。

Insight: 创新点包括通过代数约束在低秩共享子空间中显式分解模态表征为模态特定和跨模态组件，以及使用条件潜在扩散模型生成缺失的模态特定信息；这提供了处理模态缺失问题的新思路，强调了对模态独特贡献的建模和高质量生成方法。

Abstract: Accurate survival prediction from multimodal medical data is essential for precision oncology, yet clinical deployment faces a persistent challenge: modalities are frequently incomplete due to cost constraints, technical limitations, or retrospective data availability. While recent methods attempt to address missing modalities through feature alignment or joint distribution learning, they fundamentally lack explicit modeling of the unique contributions of each modality as opposed to the information derivable from other modalities. We propose MUST (Modality-Specific representation-aware Transformer), a novel framework that explicitly decomposes each modality’s representation into modality-specific and cross-modal contextualized components through algebraic constraints in a learned low-rank shared subspace. This decomposition enables precise identification of what information is lost when a modality is absent. For the truly modality-specific information that cannot be inferred from available modalities, we employ conditional latent diffusion models to generate high-quality representations conditioned on recovered shared information and learned structural priors. Extensive experiments on five TCGA cancer datasets demonstrate that MUST achieves state-of-the-art performance with complete data while maintaining robust predictions in both missing pathology and missing genomics conditions, with clinically acceptable inference latency.

[49] PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning cs.CV | cs.AI | cs.CL | cs.LGPDF

Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian

TL;DR: 本文提出了PerceptionComp，一个用于复杂、长时程、感知中心视频推理的人工标注基准。该基准要求回答每个问题需要多个时间上分离的视觉证据和组合约束，涵盖对象、属性、关系、位置、动作和事件等感知子任务，并需要语义识别、视觉对应、时间推理和空间推理等技能。基准包含来自城市步行游览、室内别墅游览、视频游戏和极限户外运动等多样领域的279个视频上的1,114个高度复杂问题。人类研究表明，PerceptionComp需要大量的测试时思考和重复感知步骤，而最先进的多模态大语言模型在该基准上的表现也远低于现有基准。

Details

Motivation: 解决现有视频推理基准在复杂、长时程、感知中心推理方面的不足，这些基准往往依赖于单一时刻或简单组合，而真实世界的视频理解需要整合多个时间分离的视觉线索和逻辑约束。

Result: 人类在禁止重看的情况下准确率降至18.97%，表明基准难度高。在五选一设置下，评估中最佳模型Gemini-3-Flash仅达到45.96%的准确率，开源模型则低于40%，远低于现有基准上的表现，突显了感知中心长时程视频推理仍是一个主要瓶颈。

Insight: 创新点在于构建了一个强调多时刻证据整合、组合逻辑和跨多种感知子任务与推理技能的复杂视频问答基准。其手动标注、领域多样性和对‘无单一时刻足够’的设计原则，为评估和推动模型在真实世界复杂视频理解方面的进展提供了更严格的测试平台。

Abstract: We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.

[50] Learnable Instance Attention Filtering for Adaptive Detector Distillation cs.CVPDF

Chen Liu, Qizhen Lan, Zhicheng Ding, Xinyu Chu, Qing Tian

TL;DR: 本文提出了一种名为LIAF-KD的自适应检测器蒸馏框架，通过引入可学习的实例选择器，在蒸馏过程中动态评估和重新加权实例重要性，以解决现有特征蒸馏方法忽视实例级差异且过滤机制非学习性的问题。

Details

Motivation: 现有基于特征的知识蒸馏方法通常采用空间过滤来指导蒸馏，但普遍统一处理所有目标实例，忽略了实例级别的可变性，且现有的注意力过滤机制通常是启发式或教师驱动的，而非与学生共同学习。

Result: 在KITTI和COCO数据集上的实验表明，该方法取得了持续改进，在GFL ResNet-50学生模型上实现了2%的性能提升，且未增加复杂度，性能优于最先进方法。

Insight: 创新点在于提出了可学习的实例注意力过滤机制，允许学生模型根据其动态学习状态参与实例重要性的评估，实现了自适应的蒸馏过程，而非依赖固定的启发式规则。

Abstract: As deep vision models grow increasingly complex to achieve higher performance, deployment efficiency has become a critical concern. Knowledge distillation (KD) mitigates this issue by transferring knowledge from large teacher models to compact student models. While many feature-based KD methods rely on spatial filtering to guide distillation, they typically treat all object instances uniformly, ignoring instance-level variability. Moreover, existing attention filtering mechanisms are typically heuristic or teacher-driven, rather than learned with the student. To address these limitations, we propose Learnable Instance Attention Filtering for Adaptive Detector Distillation (LIAF-KD), a novel framework that introduces learnable instance selectors to dynamically evaluate and reweight instance importance during distillation. Notably, the student contributes to this process based on its evolving learning state. Experiments on the KITTI and COCO datasets demonstrate consistent improvements, with a 2% gain on a GFL ResNet-50 student without added complexity, outperforming state-of-the-art methods.

[51] SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection cs.CVPDF

Jiaming Liang, Yifeng Zhan, Chunlin Liu, Weihua Zheng, Bingye Peng

TL;DR: 本文提出了一种名为SDDF（Specificity-Driven Dynamic Focusing）的方法，用于解决开放词汇伪装目标检测（OVCOD）问题。该方法通过构建OVCOD-D基准数据集，并设计子描述主成分对比融合策略来减少噪声文本干扰，同时采用特异性引导的区域弱对齐和动态聚焦机制，以增强检测器在目标与背景视觉特征高度相似时的区分能力。

Details

Motivation: 开放词汇目标检测（OVOD）在处理伪装目标时，由于目标与背景视觉特征高度相似，检测器难以区分和定位目标。现有伪装目标数据集规模有限，且文本描述中存在混淆和过度修饰的干扰，需要专门的方法来提升检测性能。

Result: 在开放集评估设置下，所提方法在OVCOD-D基准上达到了56.4的AP（平均精度）值。

Insight: 创新点包括构建了细粒度文本描述的伪装目标检测基准OVCOD-D，设计了子描述主成分对比融合策略以减少文本噪声，并提出了特异性引导的区域弱对齐和动态聚焦方法，以增强对伪装目标的视觉特征区分能力。从客观角度看，该方法结合了多模态大模型的文本生成与视觉对齐技术，针对开放词汇场景下的伪装目标检测提供了有效的解决方案。

Abstract: Open-vocabulary object detection (OVOD) aims to detect known and unknown objects in the open world by leveraging text prompts. Benefiting from the emergence of large-scale vision–language pre-trained models, OVOD has demonstrated strong zero-shot generalization capabilities. However, when dealing with camouflaged objects, the detector often fails to distinguish and localize objects because the visual features of the objects and the background are highly similar. To bridge this gap, we construct a benchmark named OVCOD-D by augmenting carefully selected camouflaged object images with fine-grained textual descriptions. Due to the limited scale of available camouflaged object datasets, we adopt detectors pre-trained on large-scale object detection datasets as our baseline methods, as they possess stronger zero-shot generalization ability. In the specificity-aware sub-descriptions generated by multimodal large models, there still exist confusing and overly decorative modifiers. To mitigate such interference, we design a sub-description principal component contrastive fusion strategy that reduces noisy textual components. Furthermore, to address the challenge that the visual features of camouflaged objects are highly similar to those of their surrounding environment, we propose a specificity-guided regional weak alignment and dynamic focusing method, which aims to strengthen the detector’s ability to discriminate camouflaged objects from background. Under the open-set evaluation setting, the proposed method achieves an AP of 56.4 on the OVCOD-D benchmark.

[52] SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis cs.CV | cs.AIPDF

Zhangtianyi Chen, Yuhao Shen, Florensia Widjaja, Yan Xu, Liyuan Sun

TL;DR: 本文提出了SkinGPT-X，一个结合自进化皮肤病记忆机制的多模态协作多智能体系统，用于皮肤病诊断。该系统通过模拟皮肤科医生的诊断流程和实现持续记忆进化，为复杂和罕见皮肤病病例提供透明且可信的诊断。

Details

Motivation: 现有大型语言模型在皮肤病诊断中面临细粒度、大规模多类诊断任务和罕见皮肤病诊断的挑战，且缺乏临床推理所需的可解释性和可追溯性；而现有多智能体系统主要集中于视觉问答和对话任务，对静态知识库的依赖限制了其在复杂真实临床环境中的适应性。

Result: 在四个公共数据集上，SkinGPT-X相比四个最先进的大型语言模型实现了SOTA性能，在DDI31上准确率提升+9.6%，在Dermnet上加权F1提升+13%；在一个包含498个不同皮肤病类别的大规模多类数据集上评估了其细粒度分类能力；在一个包含8种罕见皮肤病、564个临床样本的首个罕见皮肤病基准数据集上，准确率提升+9.8%，加权F1提升+7.1%，Cohen’s Kappa提升+10%。

Insight: 创新点在于将多智能体协作与自进化记忆机制结合，模拟临床诊断工作流以增强透明度和可信度；从客观角度看，该系统通过动态知识更新解决了静态知识库的局限性，并针对罕见病数据稀疏问题设计了专门的评估基准，提升了在复杂真实场景下的适应性和诊断性能。

Abstract: While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen’s Kappa improvement.

[53] Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR cs.CVPDF

Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang

TL;DR: 本文提出了一种名为轨迹引导强化学习（TGRL）的方法，用于解决多模态大语言模型在强化学习与可验证奖励（RLVR）中视觉证据与推理过程脱节的问题。该方法利用更强模型的专家推理轨迹来指导策略模型，并引入了令牌级重加权和轨迹过滤技术以优化训练。在多个多模态推理基准测试上的实验表明，TGRL能有效提升推理性能，并弥合视觉感知与逻辑推理之间的鸿沟。

Details

Motivation: 当前基于可验证奖励的强化学习（RLVR）在多模态大语言模型（MLLMs）上的研究主要关注提升最终答案正确性和强化视觉定位，但模型在关注到相关视觉区域后，往往无法有效地将视觉证据整合到后续的细粒度推理过程中，导致推理链缺乏坚实的视觉事实基础。

Result: 在多个多模态推理基准测试上进行的广泛实验表明，TGRL方法能持续提升推理性能，有效地弥合了视觉感知与逻辑推理之间的差距。

Insight: 论文的核心创新点在于提出了轨迹引导强化学习（TGRL）框架，利用专家模型的推理轨迹来引导策略模型的细粒度推理过程，并引入了令牌级重加权和轨迹过滤机制来确保策略优化的稳定性和有效性。这为解决多模态推理中视觉证据与逻辑推理脱节的问题提供了一种新的、可借鉴的训练范式。

Abstract: Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for multimodal large language models (MLLMs) have mainly focused on improving final answer correctness and strengthening visual grounding. However, a critical bottleneck remains: although models can attend to relevant visual regions, they often fail to effectively incorporate visual evidence into subsequent reasoning, leading to reasoning chains that are weakly grounded in visual facts. To address this issue, we propose Trajectory-Guided Reinforcement Learning (TGRL), which guides the policy model to integrate visual evidence into fine-grained reasoning processes using expert reasoning trajectories from stronger models. We further introduce token-level reweighting and trajectory filtering to ensure stable and effective policy optimization. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that TGRL consistently improves reasoning performance and effectively bridges the gap between visual perception and logical reasoning.

[54] TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life cs.CVPDF

Mridul Khurana, Amin Karimi Monsefi, Justin Lee, Medha Sawhney, David Carlyn

TL;DR: 该论文提出TaxaAdapter方法，通过将视觉分类模型（如BioCLIP）的嵌入注入到冻结的文本到图像扩散模型中，以提升生命之树中细粒度物种图像的生成准确性，同时保持对姿态、风格和背景等属性的文本控制。

Details

Motivation: 现有文本到图像生成模型难以捕捉定义物种身份的细粒度视觉特征，导致生成的物种图像在物种级别保真度不足，而地球上有超过1000万个物种，许多仅通过细微特征区分。

Result: 实验表明，TaxaAdapter在多个基准上相比强基线模型，显著提升了形态保真度和物种身份准确性，并在少样本甚至训练未见物种的生成任务中展现出强大的泛化能力。

Insight: 创新点在于利用视觉分类模型作为关键组件来引导细粒度生成，并引入基于多模态大语言模型的评估指标以提供更可解释的形态一致性度量，架构轻量且训练简洁。

Abstract: Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.

[55] InstaVSR: Taming Diffusion for Efficient and Temporally Consistent Video Super-Resolution cs.CVPDF

Jintong Hu, Bin Chen, Zhenyu Hu, Jiayue Liu, Guo Wang

TL;DR: InstaVSR是一个轻量化的扩散模型框架，用于高效且时序一致的视频超分辨率。它通过剪裁单步扩散主干、引入流引导的时序正则化循环训练以及双空间对抗学习，在显著降低计算成本和内存占用的同时，保持了良好的感知质量和更平滑的时序过渡。

Details

Motivation: 解决基于扩散模型的视频超分辨率方法面临的两个主要挑战：强生成先验导致的时序不稳定性和多帧扩散流程计算成本过高，难以实际部署。

Result: 在NVIDIA RTX 4090上，InstaVSR能在不到一分钟内处理30帧2K×2K分辨率的视频，仅使用7GB内存，相比现有基于扩散的方法大幅降低了计算成本，同时保持了良好的感知质量和显著更平滑的时序过渡。

Insight: 创新点在于将高效的单步扩散主干、用于时序稳定的流引导循环训练以及保持感知质量的双空间对抗学习相结合，为扩散模型在视频任务中的高效稳定应用提供了一种轻量化框架设计思路。

Abstract: Video super-resolution (VSR) seeks to reconstruct high-resolution frames from low-resolution inputs. While diffusion-based methods have substantially improved perceptual quality, extending them to video remains challenging for two reasons: strong generative priors can introduce temporal instability, and multi-frame diffusion pipelines are often too expensive for practical deployment. To address both challenges simultaneously, we propose InstaVSR, a lightweight diffusion framework for efficient video super-resolution. InstaVSR combines three ingredients: (1) a pruned one-step diffusion backbone that removes several costly components from conventional diffusion-based VSR pipelines, (2) recurrent training with flow-guided temporal regularization to improve frame-to-frame stability, and (3) dual-space adversarial learning in latent and pixel spaces to preserve perceptual quality after backbone simplification. On an NVIDIA RTX 4090, InstaVSR processes a 30-frame video at 2K$\times$2K resolution in under one minute with only 7 GB of memory usage, substantially reducing the computational cost compared to existing diffusion-based methods while maintaining favorable perceptual quality with significantly smoother temporal transitions.

[56] IP-Bench: Benchmark for Image Protection Methods in Image-to-Video Generation Scenarios cs.CVPDF

Xiaofeng Li, Leyi Sheng, Zhen Sun, Zongmin Zhang, Jiaheng Wei

TL;DR: 本文提出了IP-Bench，这是首个用于系统评估图像保护方法在图像到视频（I2V）生成场景中性能的基准测试。该基准测试了6种代表性保护方法和5种先进的I2V模型，并评估了这些方法在实际场景下对两种鲁棒性攻击策略的抵抗能力，以及它们的跨模型和跨模态可迁移性。

Details

Motivation: 随着I2V生成模型的快速发展，其被滥用于创建恶意内容（如利用单张图像生成虚假视频）的风险日益增加，而现有的图像保护方法缺乏统一的基准，且未在I2V场景和预处理攻击下进行系统评估，难以衡量其实际部署效果。

Result: 论文构建了IP-Bench基准，对6种保护方法和5种SOTA I2V模型进行了评估，并系统分析了保护方法在两种鲁棒性攻击下的性能及其跨模型、跨模态的迁移能力。

Insight: 创新点在于首次为I2V生成场景中的图像保护方法建立了系统、可复现、可扩展的评估框架，并引入了对预处理攻击的鲁棒性测试以及跨模型/跨模态迁移性分析，弥补了现有评估体系的不足。

Abstract: With the rapid advancement of image-to-video (I2V) generation models, their potential for misuse in creating malicious content has become a significant concern. For instance, a single image can be exploited to generate a fake video, which can be used to attract attention and gain benefits. This phenomenon is referred to as an I2V generation misuse. Existing image protection methods suffer from the absence of a unified benchmark, leading to an incomplete evaluation framework. Furthermore, these methods have not been systematically assessed in I2V generation scenarios and against preprocessing attacks, which complicates the evaluation of their effectiveness in real-world deployment scenarios.To address this challenge, we propose IP-Bench (Image Protection Bench), the first systematic benchmark designed to evaluate protection methods in I2V generation scenarios. This benchmark examines 6 representative protection methods and 5 state-of-the-art I2V models. Furthermore, our work systematically evaluates protection methods’ robustness with two robustness attack strategies under practical scenarios and analyzes their cross-model & cross-modality transferability. Overall, IP-Bench establishes a systematic, reproducible, and extensible evaluation framework for image protection methods in I2V generation scenarios.

[57] Provably Contractive and High-Quality Denoisers for Convergent Restoration cs.CVPDF

Shubhi Shukla, Pravin Nair

TL;DR: 本文提出了一种可证明收缩（全局Lipschitz常数小于1）的去噪网络，旨在解决现有图像恢复模型在输入微小扰动下缺乏稳定性保证的问题。该模型结合了基于展开技术的近端层和Lipschitz控制的卷积细化层，在保证输出对输入扰动具有严格鲁棒性的同时，其去噪性能与无约束的SOTA方法相当，并可作为即插即用（PnP）算法中的正则化器，确保收敛性。

Details

Motivation: 现有基于卷积和注意力的图像恢复网络虽然达到了SOTA性能，但在输入发生微小偏移时缺乏稳定性保证，存在鲁棒性与准确性的权衡问题。本文旨在开发可证明收缩的去噪网络来显著缩小这一差距。

Result: 在图像去噪任务上，所提模型与无约束的SOTA去噪器（如DnCNN、Restormer）性能相当，同时报告了可证明1-Lipschitz模型中目前最紧的性能差距，并证实了这种差距确实可以通过收缩去噪器实现。该模型作为正则化器在即插即用算法中可证明地实现了收敛。

Insight: 主要创新点在于设计了一种可证明具有严格Lipschitz约束（<1）的去噪网络架构，它结合了近端展开和Lipschitz控制的卷积层。这挑战了文献中“强制严格的Lipschitz控制会固有地降低输出质量”的常见假设，表明高性能与可验证的稳定性可以兼得，推动了可验证且稳定的视觉模型的发展。

Abstract: Image restoration, the recovery of clean images from degraded measurements, has applications in various domains like surveillance, defense, and medical imaging. Despite achieving state-of-the-art (SOTA) restoration performance, existing convolutional and attention-based networks lack stability guarantees under minor shifts in input, exposing a robustness accuracy trade-off. We develop provably contractive (global Lipschitz $< 1$) denoiser networks that considerably reduce this gap. Our design composes proximal layers obtained from unfolding techniques, with Lipschitz-controlled convolutional refinements. By contractivity, our denoiser guarantees that input perturbations of strength $|δ|\le\varepsilon$ induce at most $\varepsilon$ change at the output, while strong baselines such as DnCNN and Restormer can exhibit larger deviations under the same perturbations. On image denoising, the proposed model is competitive with unconstrained SOTA denoisers, reporting the tightest gap for a provably 1-Lipschitz model and establishing that such gaps are indeed achievable by contractive denoisers. Moreover, the proposed denoisers act as strong regularizers for image restoration that provably effect convergence in Plug-and-Play algorithms. Our results show that enforcing strict Lipschitz control does not inherently degrade output quality, challenging a common assumption in the literature and moving the field toward verifiable and stable vision models. Codes and pretrained models are available at https://github.com/SHUBHI1553/Contractive-Denoisers

[58] CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions cs.CVPDF

Chonghuinan Wang, Zihan Chen, Yuxiang Wei, Tianyi Jiang, Xiaohe Wu

TL;DR: 本文提出了CREval，一种基于问答的自动化评估框架，用于评估复杂指令下的创意图像编辑模型性能，并构建了包含800多个样本和13K评估查询的CREval-Bench基准。通过系统评估开源和闭源模型，发现闭源模型表现更好，但所有模型在复杂创意编辑任务上仍有困难。

Details

Motivation: 现有评估方法缺乏系统且与人类对齐的框架来评估复杂创意图像编辑任务，因此需要一种自动化、可解释的评估方案。

Result: 在CREval-Bench基准上评估了多种开源和闭源模型，闭源模型整体优于开源模型，但所有模型在复杂创意编辑任务上表现不佳；用户研究表明CREval的自动化指标与人类判断高度一致。

Insight: 创新点包括提出基于问答的自动化评估管道以解决多模态大语言模型评分的不完整性和可解释性差的问题，以及构建覆盖多个创意维度的综合基准；客观分析认为该方法为复杂创意图像编辑提供了可靠的评估基础，并揭示了未来研究的关键挑战。

Abstract: Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval’s automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.

[59] Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning cs.CVPDF

Bozhao Li, Shaocong Wu, Tong Shao, Senqiao Yang, Qiben Shan

TL;DR: 本文提出了一种名为上下文一致性学习（CCL）的新框架，旨在提升开放词汇目标检测的鲁棒性。该框架通过上下文自举数据生成（CBDG）和上下文一致性损失（CCLoss）两种策略，确保同一模态（视觉）内部的一致性，使模型能够在不同背景或场景中稳定地检测同一物体。

Details

Motivation: 现有开放词汇目标检测方法主要依赖扩大数据集和对比学习对齐多模态，但忽视了单一模态（尤其是视觉）的内部一致性，导致模型在不同场景下检测同一物体时性能下降，存在鲁棒性差距。

Result: 该方法在OmniLabel和D3基准测试上取得了最先进（SOTA）性能，分别比先前方法提升了+16.3 AP和+14.9 AP。

Insight: 创新点在于强调并强制实施单模态（视觉）的上下文一致性，通过专门的数据生成机制（CBDG）和一致性损失（CCLoss）来弥补现有方法在场景变化下鲁棒性不足的缺陷。这揭示了在开放词汇检测中，除了跨模态对齐，确保模态内特征不变性对泛化能力至关重要。

Abstract: Recent advances in open-vocabulary object detection focus primarily on two aspects: scaling up datasets and leveraging contrastive learning to align language and vision modalities. However, these approaches often neglect internal consistency within a single modality, particularly when background or environmental changes occur. This lack of consistency leads to a performance drop because the model struggles to detect the same object in different scenes, which reveals a robustness gap. To address this issue, we introduce Contextual Consistency Learning (CCL), a novel framework that integrates two key strategies: Contextual Bootstrapped Data Generation (CBDG) and Contextual Consistency Loss (CCLoss). CBDG functions as a data generation mechanism, producing images that contain the same objects across diverse backgrounds. This is essential because existing datasets alone do not support our CCL framework. The CCLoss further enforces the invariance of object features despite environmental changes, thereby improving the model’s robustness in different scenes. These strategies collectively form a unified framework for ensuring contextual consistency within the same modality. Our method achieves state-of-the-art performance, surpassing previous approaches by +16.3 AP on OmniLabel and +14.9 AP on D3. These results demonstrate the importance of enforcing intra-modal consistency, significantly enhancing model generalization in diverse environments. Our code is publicly available at: https://github.com/bozhao-li/CCL.

[60] GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport cs.CVPDF

Youngju Na, Jaeseong Yun, Soohyun Ryu, Hyunsu Kim, Sung-Eui Yoon

TL;DR: 本文提出了GLINT框架，用于解决3D高斯泼溅技术在建模透明物体（如玻璃面板）时的根本性失败问题。该框架通过显式分解的高斯表示，分别重建透明界面并建模反射和透射辐射，实现一致的辐射传输。

Details

Motivation: 3D高斯泼溅技术无法有效建模透明物体，核心挑战在于解耦透明界面和透过玻璃观察到的几何体的交织辐射贡献。

Result: 大量实验表明，GLINT在重建复杂透明场景方面相比现有方法取得了持续改进。

Insight: 创新点在于通过分解高斯表示来显式分离界面与透射几何体，并利用几何分离线索和预训练视频重照明模型的先验知识进行优化，实现了场景级透明度的建模。

Abstract: While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and models reflected and transmitted radiance separately, enabling consistent radiance transport. During optimization, GLINT bootstraps transparency localization from geometry-separation cues induced by the decomposition, together with geometry and material priors from a pre-trained video relighting model. Extensive experiments demonstrate consistent improvements over prior methods for reconstructing complex transparent scenes.

[61] Progressive Learning with Anatomical Priors for Reliable Left Atrial Scar Segmentation from Late Gadolinium Enhancement MRI cs.CV | cs.AIPDF

Jing Zhang, Bastien Bergere, Emilie Bollache, Jonas Leite, Mikaël Laredo

TL;DR: 本文提出了一种渐进式学习策略，用于从心脏磁共振延迟钆增强（LGE）图像中自动分割左心房（LA）疤痕。该方法采用基于SwinUNETR的三阶段框架，模拟临床工作流程，并引入了解剖学感知的空间加权损失函数，以整合先验临床知识，提高分割的准确性和可靠性。

Details

Motivation: 自动分割左心房疤痕面临低对比度、标注可变性以及缺乏解剖学约束等挑战，导致预测不可靠。本文旨在通过模拟临床诊断推理过程，将解剖学先验知识嵌入深度学习模型，以解决这些问题。

Result: 在LASCARQS公共数据集的验证集上，经过五折交叉验证，左心房分割的Dice分数达到0.94，左心房疤痕分割的Dice分数为0.50，豪斯多夫距离为11.84毫米，平均表面距离为1.80毫米，优于单阶段疤痕分割方法（Dice 0.49，HD 13.02毫米，ASD 1.96毫米）。

Insight: 创新点在于提出了一个渐进式三阶段学习框架，模拟临床工作流程，并设计了解剖学感知的空间加权损失函数，将解剖学先验知识（如疤痕应位于左心房壁区域）作为约束融入模型训练，从而提升了分割的准确性和临床可靠性。这种方法强调了临床信息驱动模型设计的重要性。

Abstract: Cardiac MRI late gadolinium enhancement (LGE) enables non-invasive identification of left atrial (LA) scar, whose spatial distribution is strongly associated with atrial fibrillation (AF) severity and recurrence. However, automatic LA scar segmentation remains challenging due to low contrast, annotation variability, and the lack of anatomical constraints, often leading to non-reliable predictions. Accordingly, our aim was to propose a progressive learning strategy to segment LA scar from LGE images inspired from a clinical workflow. A 3-stage framework based on SwinUNETR was implemented, comprising: 1) a first LA cavity pre-learning model, 2) dual-task model which further learns spatial relationship between LA geometry and scar patterns, and 3) fine-tuning on precise segmentation of the scar. Furthermore, we introduced an anatomy-aware spatially weighted loss that incorporates prior clinical knowledge by constraining scar predictions to anatomically plausible LA wall regions while mitigating annotation bias. Our preliminary results obtained on validation LGE volumes from LASCARQS public dataset after 5-fold cross validation, LA segmentation had Dice score of 0.94, LA scar segmentation achieved Dice score of 0.50, Hausdorff Distance of 11.84 mm, Average Surface Distance of 1.80 mm, outperforming only a one-stage scar segmentation with 0.49, 13.02 mm, 1.96 mm, repectively. By explicitly embedding clinical anatomical priors and diagnostic reasoning into deep learning, the proposed approach improved the accuracy and reliability of LA scar segmentation from LGE, revealing the importance of clinically informed model design.

[62] OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement cs.CVPDF

Rui Wang, Huisi Wu, Jing Qin

TL;DR: 本文提出OSA框架，用于超声心动图视频中左心室的分割，通过正交化状态更新机制防止状态矩阵的秩崩溃，并结合解剖先验感知的特征增强模块分离噪声与结构，实现了高精度和时序一致的分割。

Details

Motivation: 解决超声心动图视频分割中因斑点噪声和非刚性形变导致的时空建模困难，以及现有线性循环模型中无约束状态更新引起的秩崩溃问题，以提升分割的准确性和时序稳定性。

Result: 在CAMUS和EchoNet-Dynamic数据集上的实验表明，OSA达到了最先进的分割精度和时序稳定性，同时保持了实时推理效率。

Insight: 创新点包括将状态演化约束在Stiefel流形上的正交化状态更新机制，以及通过物理驱动过程显式分离解剖结构与噪声的特征增强模块，有效防止了秩崩溃并提升了噪声鲁棒性。

Abstract: Accurate and temporally consistent segmentation of the left ventricle from echocardiography videos is essential for estimating the ejection fraction and assessing cardiac function. However, modeling spatiotemporal dynamics remains difficult due to severe speckle noise and rapid non-rigid deformations. Existing linear recurrent models offer efficient in-context associative recall for temporal tracking, but rely on unconstrained state updates, which cause progressive singular value decay in the state matrix, a phenomenon known as rank collapse, resulting in anatomical details being overwhelmed by noise. To address this, we propose OSA, a framework that constrains the state evolution on the Stiefel manifold. We introduce the Orthogonalized State Update (OSU) mechanism, which formulates the memory evolution as Euclidean projected gradient descent on the Stiefel manifold to prevent rank collapse and maintain stable temporal transitions. Furthermore, an Anatomical Prior-aware Feature Enhancement module explicitly separates anatomical structures from speckle noise through a physics-driven process, providing the temporal tracker with noise-resilient structural cues. Comprehensive experiments on the CAMUS and EchoNet-Dynamic datasets show that OSA achieves state-of-the-art segmentation accuracy and temporal stability, while maintaining real-time inference efficiency for clinical deployment. Codes are available at https://github.com/wangrui2025/OSA.

[63] MemCam: Memory-Augmented Camera Control for Consistent Video Generation cs.CV | cs.AIPDF

Xinhang Gao, Junlin Guan, Shuhan Luo, Wenzhuo Li, Guanghuan Tan

TL;DR: MemCam是一种内存增强的交互式视频生成方法，通过将先前生成的帧作为外部内存并利用其作为上下文条件，实现可控相机视角下的高场景一致性。该方法设计了上下文压缩模块，将内存帧编码为紧凑表示，并采用基于共可见性的选择机制动态检索最相关的历史帧，从而在减少计算开销的同时丰富上下文信息。

Details

Motivation: 现有方法在动态相机控制下的长视频生成中，由于上下文信息有限，难以保持场景一致性。

Result: 在交互式视频生成任务上的实验表明，MemCam在场景一致性方面显著优于现有基线方法和开源SOTA方法，尤其是在相机旋转较大的长视频场景中。

Insight: 创新点在于将历史帧作为外部内存进行动态检索和压缩，以增强长视频生成的上下文建模能力，从而提升场景一致性。这是一种有效解决长视频生成中上下文遗忘问题的方法。

Abstract: Interactive video generation has significant potential for scene simulation and video creation. However, existing methods often struggle with maintaining scene consistency during long video generation under dynamic camera control due to limited contextual information. To address this challenge, we propose MemCam, a memory-augmented interactive video generation approach that treats previously generated frames as external memory and leverages them as contextual conditioning to achieve controllable camera viewpoints with high scene consistency. To enable longer and more relevant context, we design a context compression module that encodes memory frames into compact representations and employs co-visibility-based selection to dynamically retrieve the most relevant historical frames, thereby reducing computational overhead while enriching contextual information. Experiments on interactive video generation tasks show that MemCam significantly outperforms existing baseline methods as well as open-source state-of-the-art approaches in terms of scene consistency, particularly in long video scenarios with large camera rotations.

[64] Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding cs.CV | cs.AIPDF

Shrinidhi Kumbhar, Haofu Liao, Srikar Appalaraju, Kunwar Yashraj Singh

TL;DR: 本文探讨了离散扩散视觉语言模型在图形用户界面基础任务上的应用潜力。研究通过改进LLaDA-V模型，提出了一种结合线性和确定性掩码的混合掩码策略，用于预测GUI上的单轮操作和边界框。实验表明，该方法在多个数据集上优于线性掩码变体，并与自回归模型竞争，同时扩展训练数据能显著提升性能。

Details

Motivation: 自回归视觉语言模型长期主导多模态理解和GUI基础任务，但离散扩散模型在多模态推理中表现出双向注意力、并行令牌生成和迭代优化等优势，其在GUI基础任务中的潜力尚未被探索。本文旨在评估离散扩散模型是否可作为自回归模型的有效替代方案。

Result: 在涵盖网页、桌面和移动界面的四个数据集上评估，采用混合掩码的扩散模型在步骤成功率上比线性掩码变体提升高达6.1个百分点，并与自回归模型表现相当。系统消融实验显示，增加扩散步骤、生成长度和块长度能提高准确性但增加延迟，且准确性在超过一定扩散步骤后趋于稳定。扩展训练数据使延迟减少约1.3秒，并在基准测试中平均提升20个百分点的准确性。

Insight: 创新点在于提出混合掩码策略以更好地捕捉边界框几何的层次结构，从而提升GUI基础任务的准确性。客观分析认为，该研究展示了离散扩散模型作为GUI基础建模框架的潜力，为基于扩散的GUI代理发展迈出重要一步，特别是在处理多模态输入和优化生成效率方面具有借鉴意义。

Abstract: Autoregressive (AR) vision-language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. We adapt LLaDA-V for single-turn action and bounding-box prediction, framing the task as text generation from multimodal input. To better capture the hierarchical structure of bounding-box geometry, we propose a hybrid masking schedule that combines linear and deterministic masking, improving grounding accuracy by up to 6.1 points in Step Success Rate (SSR) over the GUI-adapted LLaDA-V trained with linear masking. Evaluations on four datasets spanning web, desktop, and mobile interfaces show that the adapted diffusion model with hybrid masking consistently outperforms the linear-masked variant and performs competitively with autoregressive counterparts despite limited pretraining. Systematic ablations reveal that increasing diffusion steps, generation length, and block length improves accuracy but also increases latency, with accuracy plateauing beyond a certain number of diffusion steps. Expanding the training data with diverse GUI domains further reduces latency by about 1.3 seconds and improves grounding accuracy by an average of 20 points across benchmarks. These results demonstrate that discrete DVLMs are a promising modeling framework for GUI grounding and represent an important step toward diffusion-based GUI agents.

[65] Real-Time Branch-to-Tool Distance Estimation for Autonomous UAV Pruning: Benchmarking Five DEFOM-Stereo Variants from Simulation to Jetson Deployment cs.CVPDF

Yida Lin, Bing Xue, Mengjie Zhang, Sam Schofield, Richard Green

TL;DR: 本文针对无人机自主修剪树木的安全关键任务，提出了一种实时估计切割工具与细树枝之间距离的感知方法。通过训练五种基于基础模型DEFOM-Stereo的变体，并在合成数据集上进行评估，最终在NVIDIA Jetson Orin Super平台上部署测试，以平衡精度与速度。

Details

Motivation: 解决无人机自主修剪树木时，需要实时、准确地估计切割工具与细树枝之间的度量距离，以实现安全接近、对齐和执行修剪操作，避免碰撞。

Result: 在合成测试集上，DEFOM-Stereo ViT-S在深度域精度上表现最佳（EPE 1.74像素，D1-all 5.81%，delta-1 95.90%，深度MAE 23.40厘米），但Jetson推理速度仅约2.2 FPS，过慢。新引入的平衡变体DEFOM-PrunePlus（约21M骨干网络，Jetson上约3.3 FPS）提供了最佳的可部署精度-速度权衡（EPE 5.87像素，深度MAE 64.26厘米，delta-1 87.59%），其帧率足以支持实时引导，深度精度在2米操作范围内支持安全接近规划。轻量级变体DEFOM-PruneStereo（约6.9 FPS）和DEFOM-PruneNano（约8.5 FPS）速度更快，但精度损失较大（深度MAE > 57厘米），估计不可靠。在真实照片上的零样本推理验证了全容量模型能保持分支几何结构，证实了仿真到现实的迁移有效性。

Insight: 论文的创新点在于将基于基础模型的立体匹配器DEFOM-Stereo适配到特定任务（无人机修剪），通过引入平衡变体（如DEFOM-PrunePlus）在嵌入式硬件（Jetson）上实现精度与速度的实用权衡，为安全关键实时应用提供了可部署的解决方案。从客观角度看，研究强调了在资源受限平台上优化模型架构以平衡性能与效率的重要性，并验证了合成数据训练的有效性。

Abstract: Autonomous tree pruning with unmanned aerial vehicles (UAVs) is a safety-critical real-world task: the onboard perception system must estimate the metric distance from a cutting tool to thin tree branches in real time so that the UAV can approach, align, and actuate the pruner without collision. We address this problem by training five variants of DEFOM-Stereo - a recent foundation-model-based stereo matcher - on a task-specific synthetic dataset and deploying the checkpoints on an NVIDIA Jetson Orin Super 16 GB. The training corpus is built in Unreal Engine 5 with a simulated ZED Mini stereo camera capturing 5,520 stereo pairs across 115 tree instances from three viewpoints at 2m distance; dense EXR depth maps provide exact, spatially complete supervision for thin branches. On the synthetic test set, DEFOM-Stereo ViT-S achieves the best depth-domain accuracy (EPE 1.74 px, D1-all 5.81%, delta-1 95.90%, depth MAE 23.40 cm) but its Jetson inference speed of ~~2.2 FPS (~~450 ms per frame) remains too slow for responsive closed-loop tool control. A newly introduced balanced variant, DEFOM-PrunePlus (21M backbone, ~3.3 FPS on Jetson), offers the best deployable accuracy-speed trade-off (EPE 5.87 px, depth MAE 64.26 cm, delta-1 87.59%): its frame rate is sufficient for real-time guidance and its depth accuracy supports safe branch approach planning at the 2m operating range. The lightweight DEFOM-PruneStereo (6.9 FPS) and DEFOM-PruneNano (~8.5 FPS) run fast but sacrifice substantial accuracy (depth MAE > 57 cm), making estimates too unreliable for safe actuation. Zero-shot inference on real photographs confirms that full-capacity models preserve branch geometry, validating the sim-to-real transfer. We conclude that DEFOM-PrunePlus provides the most practical accuracy-latency balance for onboard distance estimation, while ViT-S serves as the reference for future hardware.

[66] ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction cs.CV | cs.AI | cs.LGPDF

David Hagerman, Roman Naeem, Erik Brorsson, Fredrik Kahl, Lennart Svensson

TL;DR: ARTA提出了一种自适应混合分辨率令牌分配策略，用于高效密集特征提取的视觉Transformer。该方法从低分辨率（粗粒度）令牌开始，通过轻量级分配器预测需要更多细粒度令牌的区域，并迭代分配额外令牌到语义边界附近，从而在保持对弱边界敏感性的同时集中计算资源于复杂区域。

Details

Motivation: 为了解决传统密集高分辨率令牌方法计算冗余的问题，ARTA旨在通过从粗到细的混合分辨率策略，动态分配计算资源到语义边界区域，以提高密集特征提取的效率。

Result: 在ADE20K和COCO-Stuff数据集上达到SOTA性能，在Cityscapes上以显著更低计算量实现竞争性表现；例如ARTA-Base在ADE20K上获得54.6 mIoU，参数量约1亿，且FLOPs和内存使用低于可比骨干网络。

Insight: 创新点包括基于语义边界评分的自适应令牌分配机制、混合分辨率注意力实现粗细令牌交互，以及通过迭代阈值调整平衡边界敏感性与计算效率；客观上，该方法通过动态资源分配有效减少了均匀区域的计算冗余。

Abstract: We present ARTA, a mixed-resolution coarse-to-fine vision transformer for efficient dense feature extraction. Unlike models that begin with dense high-resolution (fine) tokens, ARTA starts with low-resolution (coarse) tokens and uses a lightweight allocator to predict which regions require more fine tokens. The allocator iteratively predicts a semantic (class) boundary score and allocates additional tokens to patches above a low threshold, concentrating token density near boundaries while maintaining high sensitivity to weak boundary evidence. This targeted allocation encourages tokens to represent a single semantic class rather than a mixture of classes. Mixed-resolution attention enables interaction between coarse and fine tokens, focusing computation on semantically complex areas while avoiding redundant processing in homogeneous regions. Experiments demonstrate that ARTA achieves state-of-the-art results on ADE20K and COCO-Stuff with substantially fewer FLOPs, and delivers competitive performance on Cityscapes at markedly lower compute. For example, ARTA-Base attains 54.6 mIoU on ADE20K in the ~100M-parameter class while using fewer FLOPs and less memory than comparable backbones.

[67] PhysVid: Physics Aware Local Conditioning for Generative Video Models cs.CV | cs.AIPDF

Saurabh, Pathak, Elahe Arani, Mykola Pechenizkiy, Bahram Zonooz

TL;DR: 本文提出了PhysVid，一种用于生成视频模型的物理感知局部条件化方案。该方法通过在训练时对连续帧块进行基于物理的状态、交互和约束描述，并与全局提示融合，在推理时引入描述局部物理定律违反的负提示来引导生成，旨在解决现有生成视频模型违反物理常识的问题。

Details

Motivation: 现有生成视频模型虽然视觉保真度高，但经常违反基本物理原理，限制了其在真实世界中的可靠性。先前注入物理知识的方法要么依赖于领域特定、短视域的帧级信号，要么依赖于粗糙、噪声大的全局文本提示，均无法捕捉细粒度的动态。

Result: 在VideoPhy基准测试上，PhysVid将物理常识得分比基线视频生成器提高了约33%；在VideoPhy2基准上，提升高达约8%。这表明该方法显著提高了生成视频的物理合理性。

Insight: 论文的创新点在于提出了一个基于连续帧块的局部物理条件化方案，并引入了“负物理提示”这一推理时引导机制。从客观角度看，将物理约束从全局或单帧层面细化到局部时间块，并与文本提示进行融合，是一种新颖且有效的提升生成视频物理合理性的方法。

Abstract: Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real-world settings. Prior attempts to inject physics rely on conditioning: frame-level signals are domain-specific and short-horizon, while global text prompts are coarse and noisy, missing fine-grained dynamics. We present PhysVid, a physics-aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics-grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk-aware cross-attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by $\approx 33%$ over baseline video generators, and by up to $\approx 8%$ on VideoPhy2. These results show that local, physics-aware guidance substantially increases physical plausibility in generative video and marks a step toward physics-grounded video models.

[68] Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy cs.CV | cs.AI | cs.LGPDF

Wooseong Jeong, Wonyoung Lee, Kuk-Jin Yoon

TL;DR: 本文提出了一种名为TARA-Merging（任务秩各向异性对齐）的新方法，用于合并多个低秩适应（LoRA）模块。该方法通过偏好加权的交叉熵伪损失来对齐合并权重，同时保留与任务相关的LoRA子空间，旨在解决LoRA方向跨越不同子空间且贡献不均的问题，从而确保广泛的子空间覆盖并缓解方向上的各向异性。

Details

Motivation: 合并多个LoRA模块对于构建通用系统具有前景，但由于LoRA更新方向跨越不同子空间且贡献不均，简单合并会削弱对某些任务损失最关键的方向，同时过度强调相对次要的方向，最终降低模型忠实表示所有任务的能力。

Result: 在八个视觉和六个自然语言推理（NLI）基准测试中，TARA-Merging一致优于原始方法和LoRA感知基线，表现出强大的鲁棒性和泛化能力。

Insight: 论文的创新点在于从子空间覆盖（捕捉LoRA方向覆盖不同表示方向的广度）和各向异性（反映这些方向间影响的不平衡）两个视角重新审视LoRA合并问题，并提出通过方向重加权来确保广泛的子空间覆盖并缓解各向异性，这为多任务LoRA模块的有效融合提供了新思路。

Abstract: Merging multiple Low-Rank Adaptation (LoRA) modules is promising for constructing general-purpose systems, yet challenging because LoRA update directions span different subspaces and contribute unevenly. When merged naively, such mismatches can weaken the directions most critical to certain task losses while overemphasizing relatively less important ones, ultimately reducing the model’s ability to represent all tasks faithfully. We revisit this problem through two perspectives: subspace coverage, which captures how broadly LoRA directions cover diverse representational directions, and anisotropy, which reflects the imbalance of influence across those directions. We propose TARA-Merging (Task-Rank Anisotropy Alignment), which aligns merging weights using a preference-weighted cross-entropy pseudo-loss while preserving task-relevant LoRA subspaces. This ensures broad subspace coverage and mitigates anisotropy via direction-wise reweighting. Across eight vision and six NLI benchmarks, TARA-Merging consistently outperforms vanilla and LoRA-aware baselines, demonstrating strong robustness and generalization, and highlighting the importance of addressing both subspace coverage and anisotropy in LoRA merging.

[69] SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning cs.CV | cs.LGPDF

Cai Selvas-Sala, Lei Kang, Lluis Gomez

TL;DR: 本文提出了SALMUBench，一个用于评估敏感关联级别多模态遗忘的基准测试。该基准基于一个包含6万个人物-属性关联的合成数据集，并构建了一个被敏感数据污染的模型和一个干净的模型，通过结构化的保留集来精确衡量遗忘效果和副作用。

Details

Motivation: 随着CLIP等多模态模型在下游系统中的广泛应用，移除敏感信息的需求日益迫切。然而，对比训练编码器的机器遗忘研究尚不充分，且现有评估方法无法诊断细粒度的关联级别遗忘。

Result: 基准测试表明，虽然实现高效删除是可行的，但现有方法存在明显的失败模式：要么无法有效遗忘，要么过度泛化，删除了超出预期的内容。SALMUBench为全面的遗忘评估设立了新标准。

Insight: 论文的创新点在于提出了首个专注于敏感关联级别多模态遗忘的基准，并设计了结构化的保留集（保留身份、保留关联）评估协议，以精确量化遗忘效果和附带损害。这为未来机器遗忘研究提供了更精细的评估工具和数据集。

Abstract: As multimodal models like CLIP become integral to downstream systems, the need to remove sensitive information is critical. However, machine unlearning for contrastively-trained encoders remains underexplored, and existing evaluations fail to diagnose fine-grained, association-level forgetting. We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean model without it. To isolate unlearning effects, both are trained from scratch on the same 400M-pair retain base, with the Compromised model additionally trained on the sensitive set. We propose a novel evaluation protocol with structured holdout sets (holdout identity, holdout association) to precisely measure unlearning efficacy and collateral damage. Our benchmark reveals that while utility-efficient deletion is feasible, current methods exhibit distinct failure modes: they either fail to forget effectively or over-generalize by erasing more than intended. SALMUBench sets a new standard for comprehensive unlearning evaluation, and we publicly release our dataset, models, evaluation scripts, and leaderboards to foster future research.

[70] Label-Free Cross-Task LoRA Merging with Null-Space Compression cs.CV | cs.AI | cs.LGPDF

Wonyoung Lee, Wooseong Jeong, Kuk-Jin Yoon

TL;DR: 本文提出了一种名为空问压缩（NSC）合并的无标签、输出无关的LoRA模型合并方法，用于合并在不同任务上独立微调的LoRA适配器。该方法通过分析适配器几何结构中的下投影因子A的空问压缩程度来设置合并权重，能够泛化到分类、回归和序列生成等异构任务，无需任务标签或输出信息。

Details

Motivation: 现有的模型合并方法主要适用于同构任务（如全部分类任务），在处理跨分类和回归的异构任务时往往失效；同时，基于熵的代理方法不适用于回归任务，且对于具有长令牌序列的大语言模型计算成本高昂。因此，需要一种能泛化到异构任务且高效的合并方法。

Result: NSC方法在二十个异构视觉任务上取得了最先进的性能，实现了平衡的性能增益，而先前的方法会过度拟合部分任务子集。此外，在六个自然语言推理基准测试以及视觉问答和图像描述等视觉语言评估任务上，NSC也优于基线方法，展示了其可扩展性和有效性。

Insight: 核心创新点在于发现LoRA微调过程中，下投影因子A会压缩其零空间，且这种压缩程度与任务性能相关，从而将其用作跨任务合并的优化信号。这提供了一种无需任务标签或输出、仅基于适配器内部几何结构的通用合并策略，适用于异构任务场景。

Abstract: Model merging combines independently fine-tuned checkpoints without joint multi-task training. In the era of foundation-model, fine-tuning with Low-Rank Adaptation (LoRA) is prevalent, making LoRA merging a promising target. Existing approaches can work in homogeneous settings where all target tasks are classification but often fail when tasks span classification and regression. Approaches using entropy-based surrogates do not apply to regression and are costly for large language models due to long token sequences. We introduce Null-Space Compression (NSC) Merging, a label-free, output-agnostic method that sets merge weights from adapter geometry. Our key observation is that during LoRA finetuning the down-projection factor $A$ in $ΔW = BA$ compresses its null space, and the compression correlates with performance. NSC uses this as an optimization signal for merging that can generalize across classification, regression, and sequence generation. NSC achieves state-of-the-art performance across twenty heterogeneous vision tasks with balanced gains where prior methods overfit subsets of tasks. It also outperforms baselines on six NLI benchmarks and on vision-language evaluations for VQA and image captioning, demonstrating scalability and effectiveness.

[71] Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation cs.CV | cs.AIPDF

Yiming Ren, Yujiu Yang, Junjie Wang

TL;DR: 该论文针对视觉语言模型（VLM）在视觉指令数据上进行监督微调（SFT）时，感知能力提升但推理性能下降的‘推理税’问题，提出了一种名为输入自适应深度聚合（IADA）的轻量级机制。IADA通过使跨深度检索具备输入自适应性、模态感知能力和低秩瓶颈参数化，有效恢复了模型的推理能力，并在Qwen3-VL-2B模型上验证了其有效性。

Details

Motivation: 解决视觉语言模型在视觉指令微调后出现的‘推理税’问题，即感知能力提升但推理性能下降，并探究其是否与深度方向表征的访问被破坏有关。

Result: 在Qwen3-VL-2B模型上，与仅使用LoRA的微调相比，IADA将平均推理分数提升了9.5分，平均感知分数提升了3.3分，且仅增加了0.14M参数，在参数高效的低秩设置中收益最为显著。

Insight: 论文的创新点在于揭示了保持跨深度访问是VLM微调中一个重要的缺失因素，并提出了IADA这一轻量、自适应且参数高效的机制来动态整合不同深度的特征，以同时提升模型的推理和感知能力。

Abstract: Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tuning. Building on this observation, we propose Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. On Qwen3-VL-2B, IADA improves the average reasoning score by 9.5 points and the average perception score by $3.3$ points over LoRA-only fine-tuning with only 0.14M additional parameters, with the strongest gains appearing in parameter-efficient low-rank settings.

[72] From Pixels to Privacy: Temporally Consistent Video Anonymization via Token Pruning for Privacy Preserving Action Recognition cs.CVPDF

Nazia Aslam, Abhisek Ray, Joakim Bruslund Haurum, Lukas Esterle, Kamal Nasrollahi

TL;DR: 本文提出了一种基于注意力机制的视频匿名化框架，通过解耦视频中的动作相关特征和隐私敏感特征，并利用视觉Transformer中的注意力分布计算每个时空块（tubelet）的效用-隐私分数，进而选择性剪枝隐私主导的块，从而在保持动作识别性能的同时显著降低隐私泄露风险。

Details

Motivation: 随着大规模视频模型在监控、医疗和娱乐等领域的广泛应用，这些模型编码了面部身份、种族和性别等敏感属性，加剧了隐私风险。尽管图像匿名化已有较多研究，但视频匿名化仍相对不足，而现代视频模型可利用时空运动模式作为生物特征标识符，因此需要一种有效的视频匿名化方法。

Result: 大量实验表明，该方法在保持与原始视频训练模型相当的动作识别性能的同时，显著减少了隐私泄露。这些结果证明了注意力驱动的时空剪枝为隐私保护视频分析提供了一种有效且原则性的解决方案。

Insight: 创新点在于利用视觉Transformer的注意力机制显式分离动作相关信息和隐私敏感内容，通过引入动作CLS令牌和隐私CLS令牌学习互补表示，并基于注意力分布对比计算效用-隐私分数进行选择性剪枝。从客观角度看，该方法将匿名化过程集成到模型架构中，而非后处理，提供了一种端到端的隐私保护框架。

Abstract: Recent advances in large-scale video models have significantly improved video understanding across domains such as surveillance, healthcare, and entertainment. However, these models also amplify privacy risks by encoding sensitive attributes, including facial identity, race, and gender. While image anonymization has been extensively studied, video anonymization remains relatively underexplored, even though modern video models can leverage spatiotemporal motion patterns as biometric identifiers. To address this challenge, we propose a novel attention-driven spatiotemporal video anonymization framework based on systematic disentanglement of utility and privacy features. Our key insight is that attention mechanisms in Vision Transformers (ViTs) can be explicitly structured to separate action-relevant information from privacy-sensitive content. Building on this insight, we introduce two task-specific classification tokens, an action CLS token and a privacy CLS token, that learn complementary representations within a shared Transformer backbone. We contrast their attention distributions to compute a utility-privacy score for each spatiotemporal tubelet, and keep the top-k tubelets with the highest scores. This selectively prunes tubelets dominated by privacy cues while preserving those most critical for action recognition. Extensive experiments demonstrate that our approach maintains action recognition performance comparable to models trained on raw videos, while substantially reducing privacy leakage. These results indicate that attention-driven spatiotemporal pruning offers an effective and principled solution for privacy-preserving video analytics.

[73] HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network cs.CVPDF

Mingyu Zhang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu

TL;DR: 本文提出了一种名为HINT的双路径组合上下文网络，用于解决组合图像检索（CIR）任务中忽略上下文信息的问题。该网络通过上下文编码和差异放大机制，提升了模型在复杂场景下区分匹配与非匹配样本的能力。

Details

Motivation: 现有CIR方法在跨模态对齐和特征融合方面虽有进展，但普遍忽视了上下文信息在区分匹配样本中的关键作用，且面临隐式依赖和缺乏差异放大机制两大挑战。

Result: HINT模型在两个CIR基准数据集的所有指标上均达到了最优性能，证明了其优越性。

Insight: 创新点在于引入了双路径组合上下文网络，通过上下文编码和差异放大机制显式地利用上下文信息，从而提升了CIR模型在复杂场景下的检索精度和鲁棒性。

Abstract: Composed Image Retrieval (CIR) is a challenging image retrieval paradigm. It aims to retrieve target images from large-scale image databases that are consistent with the modification semantics, based on a multimodal query composed of a reference image and modification text. Although existing methods have made significant progress in cross-modal alignment and feature fusion, a key flaw remains: the neglect of contextual information in discriminating matching samples. However, addressing this limitation is not an easy task due to two challenges: 1) implicit dependencies and 2) the lack of a differential amplification mechanism. To address these challenges, we propose a dual-patH composItional coNtextualized neTwork (HINT), which can perform contextualized encoding and amplify the similarity differences between matching and non-matching samples, thus improving the upper performance of CIR models in complex scenarios. Our HINT model achieves optimal performance on all metrics across two CIR benchmark datasets, demonstrating the superiority of our HINT model. Codes are available at https://github.com/zh-mingyu/HINT.

[74] Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification cs.CV | cs.AIPDF

Shuai Lv, Chang Liu, Feng Tang, Yujie Yuan, Aojun Zhou

TL;DR: 本文提出了一种名为视觉再审视（VRE）的自演进训练框架，旨在解决多模态大语言模型在生成长文本时逐渐偏离图像证据、依赖文本先验知识，从而导致推理不接地气和产生幻觉的问题。该框架通过信息增益驱动的验证机制，使模型能够在推理过程中自主进行视觉内省，无需额外视觉输入，从而提升多模态推理的准确性和感知可靠性。

Details

Motivation: 动机在于发现多模态大语言模型在长文本生成中存在一个常见失败模式：随着输出变长，模型逐渐远离图像证据并依赖文本先验，导致推理不接地气和幻觉。基于注意力分析，作者观察到模型具有潜在的后期视觉验证能力，但未被持续激活，因此旨在开发一种方法来自主激活并利用这种能力。

Result: 在多个多模态基准测试上的广泛实验表明，VRE框架能持续提升推理准确性和感知可靠性，并显著减少幻觉，尤其是在长链推理场景中。

Insight: 创新点在于提出了一种基于信息增益驱动的自我验证机制（VRE），它不依赖于从更强教师模型蒸馏视觉能力，而是通过模型自身生成反思轨迹来促进迭代自我改进，使视觉信息在推理中变得可操作。这提供了一种利用模型内在潜力进行自我修正和增强的新颖训练范式。

Abstract: Multimodal Large Language Models (MLLMs) achieve strong multimodal reasoning performance, yet we identify a recurring failure mode in long-form generation: as outputs grow longer, models progressively drift away from image evidence and fall back on textual priors, resulting in ungrounded reasoning and hallucinations. Interestingly, Based on attention analysis, we find that MLLMs have a latent capability for late-stage visual verification that is present but not consistently activated. Motivated by this observation, we propose Visual Re-Examination (VRE), a self-evolving training framework that enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs. Rather than distilling visual capabilities from a stronger teacher, VRE promotes iterative self-improvement by leveraging the model itself to generate reflection traces, making visual information actionable through information gain. Extensive experiments across diverse multimodal benchmarks demonstrate that VRE consistently improves reasoning accuracy and perceptual reliability, while substantially reducing hallucinations, especially in long-chain settings. Code is available at https://github.com/Xiaobu-USTC/VRE.

[75] Only Whats Necessary: Pareto Optimal Data Minimization for Privacy Preserving Video Anomaly Detection cs.CVPDF

Nazia Aslam, Abhisek Ray, Thomas B. Moeslund, Kamal Nasrollahi

TL;DR: 本文提出了一种名为’Only What’s Necessary’的隐私设计框架，用于视频异常检测（VAD）。该框架通过结合广度与深度数据最小化机制，在抑制个人身份信息（PII）的同时保留与异常检测相关的线索，以平衡隐私保护与检测性能。

Details

Motivation: 视频异常检测系统需要大量数据，但这些数据常包含个人身份信息，导致其部署面临GDPR等数据保护法规的合规性挑战。本文旨在解决如何在满足’数据最小化’原则下，既保护隐私又不显著损害检测性能的问题。

Result: 在公开数据集上的大量实验表明，该框架有效。通过帕累托分析和基于排序的方法，从非支配前沿中识别出了’最佳平衡点’，这些操作点在检测性能有限下降的前提下，最小化了个人数据暴露。

Insight: 创新点在于提出了一个结合广度（如裁剪）和深度（如模糊）数据最小化机制的隐私设计框架，并系统性地使用帕累托分析来量化隐私与效用的权衡，为合规的VAD系统提供了可操作的配置方案。

Abstract: Video anomaly detection (VAD) systems are increasingly deployed in safety critical environments and require a large amount of data for accurate detection. However, such data may contain personally identifiable information (PII), including facial cues and sensitive demographic attributes, creating compliance challenges under the EU General Data Protection Regulation (GDPR). In particular, GDPR requires that personal data be limited to what is strictly necessary for a specified processing purpose. To address this, we introduce Only What’s Necessary, a privacy-by-design framework for VAD that explicitly controls the amount and type of visual information exposed to the detection pipeline. The framework combines breadth based and depth based data minimization mechanisms to suppress PII while preserving cues relevant to anomaly detection. We evaluate a range of minimization configurations by feeding the minimized videos to both a VAD model and a privacy inference model. We employ two ranking based methods, along with Pareto analysis, to characterize the resulting trade off between privacy and utility. From the non-dominated frontier, we identify sweet spot operating points that minimize personal data exposure with limited degradation in detection performance. Extensive experiments on publicly available datasets demonstrate the effectiveness of the proposed framework.

[76] From Pen to Pixel: Translating Hand-Drawn Plots into Graphical APIs via a Novel Benchmark and Efficient Adapter cs.CVPDF

Zhenghao Xu, Mengning Yang

TL;DR: 本文提出了一个名为HDpy-13的手绘图数据集，以提升神经网络从手绘图表图像中推荐图形API的性能，并设计了一个高效的Plot-Adapter适配器来解决多领域、多语言任务中的参数增长和计算成本问题。

Details

Motivation: 现有Plot2API研究主要关注标准图表图像，忽略了非专家和初学者更易获取的手绘图；且现有模型（包括多模态大语言模型）因领域差异和专业知识缺乏，难以有效处理手绘图API推荐。

Result: 实验结果表明，HDpy-13数据集和Plot-Adapter均有效，后者通过轻量CNN块增强局部特征捕获，并利用投影矩阵共享进一步减少微调参数量。

Insight: 创新点在于引入首个手绘图数据集以填补领域空白，并设计参数高效的适配器架构，通过模块化适配器训练和特征增强机制，在降低计算成本的同时提升跨域任务性能。

Abstract: As plots play a critical role in modern data visualization and analysis, Plot2API is launched to help non-experts and beginners create their desired plots by directly recommending graphical APIs from reference plot images by neural networks. However, previous works on Plot2API have primarily focused on the recommendation for standard plot images, while overlooking the hand-drawn plot images that are more accessible to non-experts and beginners. To make matters worse, both Plot2API models trained on standard plot images and powerful multi-modal large language models struggle to effectively recommend APIs for hand-drawn plot images due to the domain gap and lack of expertise. To facilitate non-experts and beginners, we introduce a hand-drawn plot dataset named HDpy-13 to improve the performance of graphical API recommendations for hand-drawn plot images. Additionally, to alleviate the considerable strain of parameter growth and computational resource costs arising from multi-domain and multi-language challenges in Plot2API, we propose Plot-Adapter that allows for the training and storage of separate adapters rather than requiring an entire model for each language and domain. In particular, Plot-Adapter incorporates a lightweight CNN block to improve the ability to capture local features and implements projection matrix sharing to reduce the number of fine-tuning parameters further. Experimental results demonstrate both the effectiveness of HDpy-13 and the efficiency of Plot-Adapter.

[77] HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models cs.CVPDF

MD Khalequzzaman Chowdhury Sayem, Mubarrat Tajoar Chowdhury, Yihalem Yimolal Tiruneh, Muneeb A. Khan, Muhammad Salman Ali

TL;DR: 该论文提出了HandVQA，一个用于诊断和改进视觉语言模型在精细手部空间推理能力的大规模诊断基准。该基准基于高质量3D手部数据集构建，包含超过160万个多项选择题，用于评估模型对手部关节间角度、距离和相对位置等空间关系的理解。研究发现当前最先进的VLMs存在系统性缺陷，而通过在该基准上进行微调，模型学到的空间知识能够零样本迁移到新手势识别等下游任务，并带来显著性能提升。

Details

Motivation: 在高风险场景（如机器人辅助手术、芯片制造和AR/VR人机交互）中，理解人类手部的精细关节至关重要。尽管当前视觉语言模型在通用基准上已达到接近人类的性能，但在精细空间推理，特别是解释复杂手部姿态方面仍存在困难。

Result: 论文评估了多个SOTA VLM（如LLaVA、DeepSeek和Qwen-VL），发现它们存在幻觉手指部位、几何解释错误和泛化能力差等系统性局限。通过在HandVQA基准上进行LoRA微调，模型学到的3D空间知识能够零样本迁移，显著提升了在新下游任务上的准确率，例如手势识别（+10.33%）和手-物体交互（+2.63%）。

Insight: 论文的创新点在于构建了一个专注于精细手部空间推理的诊断性基准，系统地揭示了当前VLMs在该领域的核心缺陷。其核心洞察是，通过基于高质量3D数据的、受控的、大规模的VQA任务进行微调，可以有效地将结构化的空间知识注入VLMs，并实现向未见任务的零样本泛化，这为提升VLMs的细粒度空间理解能力提供了一条可验证的路径。

Abstract: Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs’ understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).

[78] Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning cs.CVPDF

Shida Wang, YongXiang Hua, Zhou Tao, Haoyu Cao, Linli Xu

TL;DR: 本文提出SCORE框架，通过强化学习实现视频理解中的动态令牌压缩，以解决多模态大语言模型因视觉令牌冗余导致的计算成本高和性能下降问题。该方法引入基于惊喜增强状态表示的轻量策略网络，利用帧间残差捕捉时序动态和运动显著性，并通过两阶段课程学习优化策略。

Details

Motivation: 现有视频令牌压缩方法通常依赖启发式或固定变换，与下游任务目标解耦，限制了适应性和有效性。本文旨在通过强化学习学习自适应的令牌压缩策略，以提升压缩效率并保持模型性能。

Result: 在多个视频理解基准测试中，SCORE显著优于最先进的基线方法。在10%保留率下，实现了16倍的预填充加速，同时保持原始性能的99.5%。

Insight: 创新点包括惊喜增强的状态表示以捕捉时序动态，以及基于分组强化学习和分割优势估计器的优化方案。从客观角度看，该方法将强化学习与令牌压缩结合，为长视频理解提供了可扩展的解决方案。

Abstract: Multimodal Large Language Models have demonstrated remarkable capabilities in video understanding, yet face prohibitive computational costs and performance degradation from ‘’context rot’’ due to massive visual token redundancy. Existing compression strategies typically rely on heuristics or fixed transformations that are often decoupled from the downstream task objectives, limiting their adaptability and effectiveness. To address this, we propose SCORE (Surprise-augmented token COmpression via REinforcement learning), a unified framework that learns an adaptive token compression policy. SCORE introduces a lightweight policy network conditioned on a surprise-augmented state representation that incorporates inter-frame residuals to explicitly capture temporal dynamics and motion saliency. We optimize this policy using a group-wise reinforcement learning scheme with a split-advantage estimator, stabilized by a two-stage curriculum transferring from static pseudo-videos to real dynamic videos. Extensive experiments on diverse video understanding benchmarks demonstrate that SCORE significantly outperforms state-of-the-art baselines. Notably, SCORE achieves a 16x prefill speedup while preserving 99.5% of original performance at a 10% retention ratio, offering a scalable solution for efficient long-form video understanding.

[79] SHANDS: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training cs.CVPDF

Le Ma, Thiago Freitas dos Santos, Nadia Magnenat-Thalmann, Katarzyna Wac

TL;DR: 本文介绍了SHands数据集，这是一个用于外科手术手势和错误识别的大规模多视角视频数据集，旨在支持医疗培训中的自动化AI评估。数据集包含52名参与者（20名专家和32名学员）执行线性切口和缝合手术的视频，使用五个RGB摄像机从互补视角捕获，并标注了15种手势基元和8种学员错误类型。

Details

Motivation: 解决外科培训中依赖专家评估的成本高、时间有限、难以扩展的问题，以及现有数据集缺乏真实学员错误和多视角变异性以训练鲁棒计算机视觉方法的限制。

Result: 在SHands数据集上对最先进的深度学习模型进行了基准测试，并定义了单视角、多视角和跨视角泛化的标准化评估协议。

Insight: 创新点在于提供了一个包含多视角视频和临床验证错误分类的大规模数据集，支持手势识别和错误检测任务，有助于开发基于领域知识的鲁棒且可扩展的AI系统用于外科培训。

Abstract: In surgical training for medical students, proficiency development relies on expert-led skill assessment, which is costly, time-limited, difficult to scale, and its expertise remains confined to institutions with available specialists. Automated AI-based assessment offers a viable alternative, but progress is constrained by the lack of datasets containing realistic trainee errors and the multi-view variability needed to train robust computer vision approaches. To address this gap, we present Surgical-Hands (SHands), a large-scale multi-view video dataset for surgical hand-gesture and error recognition for medical training. \textsc{SHands} captures linear incision and suturing using five RGB cameras from complementary viewpoints, performed by 52 participants (20 experts and 32 trainees), each completing three standardized trials per procedure. The videos are annotated at the frame level with 15 gesture primitives and include a validated taxonomy of 8 trainee error types, enabling both gesture recognition and error detection. We further define standardized evaluation protocols for single-view, multi-view, and cross-view generalization, and benchmark state-of-the-art deep learning models on the dataset. SHands is publicly released to support the development of robust and scalable AI systems for surgical training grounded in clinically curated domain knowledge.

[80] CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities cs.CV | cs.AIPDF

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

TL;DR: 本文提出了一种名为CPUBone的新型视觉骨干网络架构，专门针对并行处理能力较低的CPU设备进行优化。通过研究分组卷积和减小卷积核尺寸这两种标准卷积的改进方法，在降低计算量（MACs）的同时保持了硬件执行效率（高MACpS），从而在多种CPU设备上实现了最优的速度-精度权衡（SAT），并在目标检测和语义分割等下游任务中有效迁移了其效率。

Details

Motivation: 当前视觉骨干架构的研究主要集中于优化具有高并行处理能力的硬件平台（如手机和嵌入式AI加速器），而CPU由于并行化能力有限，需要一种平衡操作数量（MACs）与硬件执行效率（每秒MACs，即MACpS）的特定设计哲学。本文旨在为CPU推理设计高效的视觉骨干网络。

Result: 在多种CPU设备上的实验表明，所提出的改进方法成功保持了CPU上的高硬件效率。CPUBone模型在广泛的CPU设备上实现了最先进的（SOTA）速度-精度权衡（SATs），并且其效率能有效迁移到目标检测和语义分割等下游任务。

Insight: 创新点在于针对CPU的低并行化特性，系统地研究了分组卷积和减小卷积核尺寸对计算成本（MACs）和硬件效率（MACpS）的影响，并基于此设计出CPUBone这一新的骨干网络家族。其核心洞察是，对于CPU优化，在减少总MACs的同时，必须维持高MACpS以确保低延迟，这与为高并行硬件优化的设计理念不同。

Abstract: Recent research on vision backbone architectures has predominantly focused on optimizing efficiency for hardware platforms with high parallel processing capabilities. This category increasingly includes embedded systems such as mobile phones and embedded AI accelerator modules. In contrast, CPUs do not have the possibility to parallelize operations in the same manner, wherefore models benefit from a specific design philosophy that balances amount of operations (MACs) and hardware-efficient execution by having high MACs per second (MACpS). In pursuit of this, we investigate two modifications to standard convolutions, aimed at reducing computational cost: grouping convolutions and reducing kernel sizes. While both adaptations substantially decrease the total number of MACs required for inference, sustaining low latency necessitates preserving hardware-efficiency. Our experiments across diverse CPU devices confirm that these adaptations successfully retain high hardware-efficiency on CPUs. Based on these insights, we introduce CPUBone, a new family of vision backbone models optimized for CPU-based inference. CPUBone achieves state-of-the-art Speed-Accuracy Trade-offs (SATs) across a wide range of CPU devices and effectively transfers its efficiency to downstream tasks such as object detection and semantic segmentation. Models and code are available at https://github.com/altair199797/CPUBone.

[81] ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better cs.CVPDF

Mriganka Nath, Anurag Das, Jiahao Xie, Bernt Schiele

TL;DR: ClipTTT是一种CLIP引导的测试时训练方法，旨在解决大型视觉语言模型在测试时视觉输入受损时产生幻觉的问题。该方法利用预训练CLIP模型的图像-文本对齐能力作为稳定的指导信号，识别可靠的自监督目标，从而在不改变基础LVLM的情况下实现快速适应。

Details

Motivation: 大型视觉语言模型在视觉输入受损时容易产生幻觉，这种损坏作为额外的分布偏移，在现实应用中显著放大了幻觉率。论文旨在通过测试时训练来适应退化条件，减少幻觉。

Result: 在标准幻觉基准测试中，使用15种常见损坏进行广泛实验，结果表明ClipTTT有效缓解了幻觉，并在视觉损坏下提高了描述忠实度。

Insight: 创新点在于利用CLIP的稳定对齐信号指导测试时训练，实现单样本快速适应，无需修改基础模型，这为LVLM在退化条件下的鲁棒性提供了一种轻量级解决方案。

Abstract: Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.

[82] Learnable Quantum Efficiency Filters for Urban Hyperspectral Segmentation cs.CVPDF

Imad Ali Shah, Jiarong Li, Ethan Delaney, Enda Ward, Martin Glavin

TL;DR: 本文提出了一种名为可学习量子效率滤波器（LQE）的物理启发式、可解释的降维方法，用于城市高光谱分割。LQE通过参数化模拟传感器量子效率曲线的平滑高阶光谱响应函数，在保留判别信息的同时实现紧凑的光谱表示，并可与语义分割模型端到端训练。

Details

Motivation: 高光谱感知为城市驾驶场景理解提供了丰富的光谱信息，但其高维特性给解释和高效学习带来了挑战。现有降维方法（包括传统方法和无约束可学习层）缺乏物理动机的约束，可能导致次优表示。

Result: 在三个公开的多类城市驾驶高光谱数据集（HyKo、HSI-Drive、Hyperspectral City）上，使用六个语义分割模型进行系统评估。与六种传统方法和七种可学习基线降维方法相比，LQE在所有模型和配置上平均取得了最高的平均mIoU，在三个数据集上分别比传统方法提升2.45%、0.45%、1.04%，比可学习方法提升1.18%、1.56%、0.81%。LQE还保持了强大的参数效率（12-36个参数，而竞争性可学习方法为51-22K个）和具有竞争力的推理延迟。

Insight: 主要创新点在于将物理约束（如单一主峰、平滑响应、有限带宽）引入可学习的降维过程，从而在数据驱动学习中融入领域知识。这为高光谱感知与面向汽车视觉系统的数据驱动多光谱传感器设计之间架起了一座原则性的桥梁，同时提升了性能和可解释性。消融研究表明，低阶配置最优，且学习到的光谱滤波器会收敛到数据集固有的波长模式。

Abstract: Hyperspectral sensing provides rich spectral information for scene understanding in urban driving, but its high dimensionality poses challenges for interpretation and efficient learning. We introduce Learnable Quantum Efficiency (LQE), a physics-inspired, interpretable dimensionality reduction (DR) method that parameterizes smooth high-order spectral response functions that emulate plausible sensor quantum efficiency curves. Unlike conventional methods or unconstrained learnable layers, LQE enforces physically motivated constraints, including a single dominant peak, smooth responses, and bounded bandwidth. This formulation yields a compact spectral representation that preserves discriminative information while remaining fully differentiable and end-to-end trainable within semantic segmentation models (SSMs). We conduct systematic evaluations across three publicly available multi-class hyperspectral urban driving datasets, comparing LQE against six conventional and seven learnable baseline DR methods across six SSMs. Averaged across all SSMs and configurations, LQE achieves the highest average mIoU, improving over conventional methods by 2.45%, 0.45%, and 1.04%, and over learnable methods by 1.18%, 1.56%, and 0.81% on HyKo, HSI-Drive, and Hyperspectral City, respectively. LQE maintains strong parameter efficiency (12–36 parameters compared to 51–22K for competing learnable approaches) and competitive inference latency. Ablation studies show that low-order configurations are optimal, while the learned spectral filters converge to dataset-intrinsic wavelength patterns. These results demonstrate that physics-informed spectral learning can improve both performance and interpretability, providing a principled bridge between hyperspectral perception and data-driven multispectral sensor design for automotive vision systems.

[83] OVI-MAP:Open-Vocabulary Instance-Semantic Mapping cs.CVPDF

Zilong Deng, Federico Tombari, Marc Pollefeys, Johanna Wald, Daniel Barath

TL;DR: OVI-MAP是一种用于开放词汇的增量式3D实例语义建图系统，旨在解决复杂日常环境中自主智能体的感知需求。该系统通过将实例重建与语义推理解耦，构建一个类别无关的3D实例地图，并利用视觉语言模型从少量自动选择的视图中提取语义特征，从而实现稳定的实例跟踪和零样本语义标注。

Details

Motivation: 现有方法通常依赖封闭集假设或密集的逐像素语言融合，限制了可扩展性和时间一致性，因此需要一种能够实现鲁棒实例分割、实时处理和灵活开放集推理的增量式开放词汇3D实例语义建图方法。

Result: 该系统在标准基准测试中优于最先进的开放词汇建图基线方法，并能够实时运行。

Insight: 创新点在于将实例重建与语义推理解耦的设计，以及仅从少量自动选择视图中提取语义特征的方法，这提高了实例跟踪的稳定性和零样本语义标注的灵活性，同时保持了实时性能。

Abstract: Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks.

[84] AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing cs.CVPDF

Tianyu Liu, Weitao Xiong, Kunming Luo, Manyuan Zhang, Peng Liu

TL;DR: AutoWeather4D是一个前馈式的3D感知天气编辑框架，旨在为自动驾驶视频进行恶劣天气转换。其核心是通过G-buffer双通道编辑机制，显式解耦几何与光照，实现表面锚定的物理交互和动态3D局部重光照，从而无需大规模数据集或昂贵的逐场景优化。

Details

Motivation: 现有生成式视频模型需要海量数据学习罕见天气场景，而3D感知编辑方法则受限于昂贵的逐场景优化以及几何与光照的固有纠缠问题。本文旨在解决这些数据约束和瓶颈。

Result: 大量实验表明，AutoWeather4D在逼真度和结构一致性上可与生成式基线方法相媲美，同时实现了细粒度的参数化物理控制，可作为自动驾驶的实用数据引擎。

Insight: 创新点在于提出G-buffer双通道编辑机制，通过几何通道和光照通道分别处理，显式解耦几何与光照，实现了表面锚定的物理交互和动态3D局部重光照，为可控的天气编辑提供了新思路。

Abstract: Generative video models have significantly advanced the photorealistic synthesis of adverse weather for autonomous driving; however, they consistently demand massive datasets to learn rare weather scenarios. While 3D-aware editing methods alleviate these data constraints by augmenting existing video footage, they are fundamentally bottlenecked by costly per-scene optimization and suffer from inherent geometric and illumination entanglement. In this work, we introduce AutoWeather4D, a feed-forward 3D-aware weather editing framework designed to explicitly decouple geometry and illumination. At the core of our approach is a G-buffer Dual-pass Editing mechanism. The Geometry Pass leverages explicit structural foundations to enable surface-anchored physical interactions, while the Light Pass analytically resolves light transport, accumulating the contributions of local illuminants into the global illumination to enable dynamic 3D local relighting. Extensive experiments demonstrate that AutoWeather4D achieves comparable photorealism and structural consistency to generative baselines while enabling fine-grained parametric physical control, serving as a practical data engine for autonomous driving.

[85] Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones cs.CV | cs.AIPDF

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

TL;DR: 本文指出MACs作为衡量视觉骨干网络效率指标的局限性，并提出一种名为LowFormer的新型高效视觉骨干网络家族，其核心创新是轻量级注意力模块Lowtention，在ImageNet等任务上实现更优性能与速度。

Details

Motivation: 针对现有研究过度依赖MACs作为效率指标，尤其在边缘设备上无法准确预测实际运行时间的问题，旨在通过实验分析影响执行效率的关键因素，并设计硬件高效的视觉骨干网络。

Result: LowFormer在ImageNet分类任务上取得优异结果，并在目标检测、语义分割等多个下游任务中验证了其广泛适用性；与近期SOTA骨干网络相比，在各种硬件平台上均实现了显著加速。

Insight: 创新点在于揭示了MACs指标的不足，并基于硬件效率分析提出了轻量级注意力模块Lowtention和优化的宏观/微观架构设计，为边缘设备上的高效骨干网络设计提供了新思路。

Abstract: Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline’s speed on edge GPU and desktop GPU. We demonstrate LowFormer’s wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

[86] Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow cs.CV | cs.AIPDF

Ziyue Zeng, Xun Su, Haoyuan Liu, Bingyu Lu, Yui Tatsumi

TL;DR: 本文提出了一种名为GVC的零样本视频编码框架，将预训练的视频生成模型直接用作编解码器，通过将确定性整流流ODE转换为等效SDE，实现基于码书的压缩，并实例化了三种互补的条件策略（I2V、T2V、FLF2V），在标准基准测试中实现了低于0.002 bpp的高质量重建。

Details

Motivation: 现有生成式视频压缩方法仅将生成模型用作传统编解码器的后处理模块，本文旨在将预训练视频生成模型直接转化为编解码器本身，实现零样本压缩，无需重新训练。

Result: 在标准基准测试中，GVC在低于0.002 bpp的比特率下实现了高质量重建，并支持通过单一超参数进行灵活的比特率控制。

Insight: 创新点在于将确定性整流流ODE转换为等效SDE以解锁随机注入点，实现码书驱动的压缩，并通过三种条件策略在空间保真度、时间一致性和压缩效率之间进行权衡；客观分析认为，该方法将生成模型直接用作编解码器的零样本框架具有新颖性，为视频压缩提供了新的生成式先验路径。

Abstract: Existing generative video compression methods use generative models only as post-hoc reconstruction modules atop conventional codecs. We propose \emph{Generative Video Codec} (GVC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies – \emph{Image-to-Video} (I2V) with adaptive tail-frame atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVC achieves high-quality reconstruction below 0.002,bpp while supporting flexible bitrate control through a single hyperparameter.

[87] Scene Grounding In the Wild cs.CVPDF

Tamir Cohen, Leo Segre, Shay Shomer-Chai, Shai Avidan, Hadar Averbuch-Elor

TL;DR: 本文提出了一种从非结构化、无重叠的野外图像中重建大规模真实场景3D模型的方法，通过将局部重建结果与完整的伪合成参考模型进行语义对齐，实现全局一致性配准。

Details

Motivation: 解决现有重建流程在输入视图重叠度低时产生多个不连通局部重建或错误合并非重叠区域的问题。

Result: 在WikiEarth数据集上验证，该方法能持续改进多种经典和基于学习流程的全局对齐效果，并缓解最先进端到端模型的失败模式。

Insight: 创新点在于利用语义特征跨越真实图像与伪合成渲染间的域差距，通过基于特征的逆向优化估计全局6DoF位姿和尺度，实现无视觉重叠下的场景对齐。

Abstract: Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code and data will be released.

[88] MA-Bench: Towards Fine-grained Micro-Action Understanding cs.CVPDF

Kun Li, Jihao Gu, Fei Wang, Zhiliang Wu, Hehe Fan

TL;DR: 本文提出了MA-Bench，一个用于评估多模态大语言模型（MLLMs）在微动作理解方面能力的基准测试。该基准包含1000个视频和12000个结构化问答对，采用三层评估架构（感知、关系理解、解释推理）进行系统评估。测试23个代表性MLLMs后发现，它们在捕捉运动粒度和细粒度身体部位动态方面存在显著挑战。为此，作者进一步构建了包含20.5K视频的大规模训练数据集MA-Bench-Train，用于微调MLLMs。实验表明，基于该数据集微调的Qwen3-VL-8B模型在微动作推理和解释任务上性能有明显提升。

Details

Motivation: 多模态大语言模型（MLLMs）在人类情感分析中至关重要的微动作理解方面的潜力尚未被探索，原因是缺乏专门的基准测试。本文旨在填补这一空白，为评估和推进MLLMs在理解细微人类行为方面的能力建立一个基础性基准。

Result: 在MA-Bench上评估了23个代表性MLLMs，结果显示它们在捕捉运动粒度和细粒度身体部位动态方面面临显著挑战。通过在构建的大规模训练数据集MA-Bench-Train上微调Qwen3-VL-8B模型，在微动作推理和解释任务上观察到了清晰的性能提升。

Insight: 论文的创新点在于首次提出了一个专门针对微动作理解的三层评估基准（MA-Bench），并配套构建了大规模训练数据集（MA-Bench-Train）。这为系统评估和提升MLLMs在细粒度人类行为理解方面的能力提供了标准化的工具和数据支持，揭示了当前模型在运动细节捕捉上的不足，并展示了通过针对性微调进行改进的有效路径。

Abstract: With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: https://MA-Bench.github.io

[89] The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding cs.CVPDF

Gillian Rosenberg, Skylar Stadhard, Bruce C. Hansen, Michelle R. Greene

TL;DR: 本研究通过比较18个视觉语言模型（VLMs）与2000多名人类观察者在15个高级场景理解任务上的描述，测试了仅从图像和文本的统计共现中学习人类场景理解的局限性。研究发现，VLMs在一般知识任务上接近人类水平，但在功能可供性（affordance）任务上存在显著且结构性的缺陷，这种缺陷无法通过提示工程或提供显式空间信息解决。语料分析表明，图像描述数据集中缺乏面向主体的可供性语言，表明基于分布的学习不足以支持基于可供性的场景理解，暗示人类视觉认知的某些维度需要主体中心的三维体验。

Details

Motivation: 探究仅从大规模配对的文本-图像语料中学习是否足以掌握人类场景理解的丰富性，即检验分布假说在视觉认知中的有效性，并特别关注VLMs因缺乏具身体验而可能存在的局限性。

Result: 在实验1中，VLMs在一般知识任务上接近人类水平（使用HCD指标衡量），但在可供性任务上表现出显著的、持续的缺陷，且该缺陷不随模型更新而改善。实验2表明，该缺陷是结构性的而非风格性的，且提供显式空间信息无法解决。

Insight: 论文的创新点在于系统地量化了VLMs在多种高级场景理解任务上与人类的差距，并开发了HCD指标来校准无标准答案任务的评估。客观分析表明，其核心发现挑战了仅通过分布学习即可完全掌握场景理解的假设，强调了具身体验对于可供性等认知维度的重要性，为未来多模态模型的发展方向提供了关键洞见。

Abstract: What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over 2000 human observers across 15 high-level scene understanding tasks, spanning general knowledge, affordances, sensory experiences, affective responses, and future prediction. Because many tasks lack ground truth answers, we developed a Human-Calibrated Cosine Distance (HCD) metric that measures VLM output similarity to the distribution of human responses, scaled by within-human variability. In Experiment 1, VLMs approached human-level performance on general knowledge tasks, but showed a robust deficit for affordance tasks that resisted prompt engineering and did not improve with newer model releases. In Experiment 2, we tested six mechanistic hypotheses for explaining this affordance gap, finding that the deficit was structural rather than stylistic and was not resolved by providing explicit spatial information. Corpus analyses revealed that image captioning datasets contain sparse agent-addressed affordance language, consistent with Gricean accounts of why embodied knowledge may be systematically underrepresented in language. Together, these findings suggest that distributional learning from images and text is insufficient for affordance-based scene understanding, implying that some dimensions of human visual cognition may require the kind of agent-centered, three-dimensional experience that no photograph or caption can encode.

[90] From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning cs.CVPDF

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Xilin Zhao

TL;DR: 本文提出了一种名为Co-Settle的自监督图像到视频表示迁移学习框架，旨在解决在迁移过程中保持视频内时间一致性和视频间语义可分性之间的权衡问题。该方法通过在冻结的图像预训练编码器上添加轻量级投影层，并结合时间循环一致性目标和语义可分性约束来调整表示空间，仅需少量自监督训练即可在多种视频任务上提升性能。

Details

Motivation: 现有方法在将图像预训练模型迁移到视频任务时，通常需要复杂的时序模块和视频微调，但微调重型模块可能损害视频间的语义可分性，而减少可调参数又会阻碍视频内的时间一致性。这揭示了图像到视频迁移中存在视频内时间一致性与视频间语义可分性之间的潜在权衡问题。

Result: 在八个图像预训练模型上的实验表明，仅需五个epoch的自监督训练，该方法在多个层次的视频任务上均取得了持续的性能提升。

Insight: 创新点在于明确提出了图像到视频迁移中的一致性-可分性权衡问题，并设计了轻量级的Co-Settle框架来优化这一权衡。其核心是通过冻结主干、仅训练投影层，并联合优化时间循环一致性和语义可分性两个目标，从而高效地实现表示空间的适应。这为轻量、高效的视频表示迁移学习提供了新思路。

Abstract: Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available at https://github.com/yafeng19/Co-Settle.

[91] VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward cs.CVPDF

Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja

TL;DR: VGGRPO提出了一种基于潜在几何引导的视频生成后训练框架，通过引入潜在几何模型（LGM）将视频扩散模型的潜在空间与几何基础模型连接，从而直接从潜在空间解码场景几何信息，并利用潜在空间的组相对策略优化（GRPO）结合相机运动平滑性和几何重投影一致性奖励，提升视频的几何一致性，同时避免昂贵的VAE解码开销。

Details

Motivation: 现有大规模视频扩散模型在视觉质量上表现优异，但往往难以保持几何一致性；先前方法通过增加模块或几何对齐进行改进，但前者可能损害预训练模型的泛化能力，后者局限于静态场景且依赖RGB空间奖励导致计算开销大、难以泛化到动态场景。

Result: 在静态和动态基准测试中，VGGRPO提高了相机稳定性、几何一致性和整体质量，同时消除了昂贵的VAE解码，实现了高效且灵活的世界一致视频生成。

Insight: 创新点在于提出潜在几何模型（LGM）实现从扩散潜在空间直接解码动态场景几何，并利用潜在空间的GRPO结合双奖励机制（相机平滑与几何重投影一致性）进行优化，避免了架构修改和RGB空间对齐的计算瓶颈，扩展了方法在动态场景下的适用性。

Abstract: Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.

[92] Think over Trajectories: Leveraging Video Generation to Reconstruct GPS Trajectories from Cellular Signaling cs.CV | cs.AIPDF

Ruixing Zhang, Hanzhang Jiang, Leilei Sun, Liangzhe Han, Jibin Wang

TL;DR: 本文提出Sig2GPS方法，将蜂窝信令轨迹重建为高精度GPS轨迹的问题重新定义为图像到视频生成任务：将信令轨迹渲染在地图上，训练视频生成模型来绘制连续的GPS路径。通过构建配对数据集微调开源视频模型，并引入基于强化学习的轨迹感知优化方法提升生成保真度。

Details

Motivation: 解决蜂窝信令记录仅提供粗略位置信息（如基站标识符），无法直接用于需要高精度GPS轨迹的应用的问题，旨在从信令数据中重建连续、精确的GPS轨迹。

Result: 在大规模真实世界数据集上的实验表明，该方法显著优于现有的工程化和基于学习的基线方法，并在下一GPS点预测任务中展示了可扩展性和跨城市可迁移性。

Insight: 创新点在于将轨迹重建问题转化为地图视觉域的视频生成任务，避免了复杂的多阶段工程流程或坐标回归；通过强化学习奖励机制优化生成路径的连续性，为轨迹数据挖掘提供了直观的生成与细化界面。

Abstract: Mobile devices continuously interact with cellular base stations, generating massive volumes of signaling records that provide broad coverage for understanding human mobility. However, such records offer only coarse location cues (e.g., serving-cell identifiers) and therefore limit their direct use in applications that require high-precision GPS trajectories. This paper studies the Sig2GPS problem: reconstructing GPS trajectories from cellular signaling. Inspired by domain experts often lay the signaling trace on the map and sketch the corresponding GPS route, unlike conventional solutions that rely on complex multi-stage engineering pipelines or regress coordinates, Sig2GPS is reframed as an image-to-video generation task that directly operates in the map-visual domain: signaling traces are rendered on a map, and a video generation model is trained to draw a continuous GPS path. To support this paradigm, a paired signaling-to-trajectory video dataset is constructed to fine-tune an open-source video model, and a trajectory-aware reinforcement learning-based optimization method is introduced to improve generation fidelity via rewards. Experiments on large-scale real-world datasets show substantial improvements over strong engineered and learning-based baselines, while additional results on next GPS prediction indicate scalability and cross-city transferability. Overall, these results suggest that map-visual video generation provides a practical interface for trajectory data mining by enabling direct generation and refinement of continuous paths under map constraints.

[93] Make Geometry Matter for Spatial Reasoning cs.CV | cs.AIPDF

Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang

TL;DR: 本文提出GeoSR框架，旨在增强视觉语言模型（VLMs）在静态和动态场景中的空间推理能力。通过引入几何释放掩码和几何引导融合两个关键组件，GeoSR迫使模型在训练中更多地依赖几何线索，从而有效利用来自预训练3D基础模型的几何信息。

Details

Motivation: 尽管大规模训练的视觉语言模型在图像和视频理解方面表现强劲，但其在静态和动态场景中的空间推理能力仍有限。现有方法通过注入几何标记来尝试解决这一问题，但简单的标记融合和微调往往导致几何线索未被充分利用，模型仍过度依赖2D视觉线索。

Result: 在静态和动态空间推理基准测试上的广泛实验表明，GeoSR持续优于先前方法，并通过有效利用几何信息建立了新的最先进（SOTA）性能。

Insight: 论文的创新点在于提出几何释放掩码（通过掩码部分2D视觉标记来削弱非几何捷径，迫使模型咨询几何标记）和几何引导融合（一种门控路由机制，自适应地放大几何关键区域的几何标记贡献）。从客观角度看，这种通过设计训练策略和融合机制来强制模型利用几何信息的方法，为提升VLMs的空间推理能力提供了可借鉴的思路。

Abstract: Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.

[94] Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision cs.CVPDF

Ling Li, Bowen Liu, Zinuo Zhan, Peng Jie, Jianhui Zhong

TL;DR: 本文提出了EgoPoint-Ground，首个专注于第一人称视角下指示性视觉定位的大规模多模态数据集，包含超过15k个交互样本，并提出了SV-CoT基线框架，通过视觉思维链范式协同手势和语言线索，在基准测试中取得了11.7%的绝对性能提升。

Details

Motivation: 传统视觉定位主要依赖文本描述，难以处理语言歧义并忽略了现实交互中普遍存在的非语言指示线索（如手指指向）。本文旨在弥补这一差距，研究第一人称视角中手势与语言结合的最直观指代表达机制。

Result: 在提出的EgoPoint-Ground数据集上建立了全面的基准测试，评估了多种主流MLLM和SOTA VG架构。提出的SV-CoT框架相比现有方法取得了11.7%的绝对性能提升，有效缓解了语义歧义。

Insight: 创新点在于首次构建了大规模的第一人称指示性视觉定位数据集，并提出了将定位重构为结构化推理过程的SV-CoT框架，通过视觉思维链协同手势和语言模态，为理解多模态物理意图提供了新思路。

Abstract: Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an $\textbf{11.7%}$ absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.

[95] Tunable Soft Equivariance with Guarantees cs.CV | cs.LGPDF

Md Ashiqur Rahman, Lim Jun Hao, Jeremiah Jiang, Teck-Yian Lim, Raymond A. Yeh

TL;DR: 本文提出了一种构建软等变模型的通用框架，通过将模型权重投影到设计的子空间中来控制等变性的程度，适用于任何预训练架构，并提供了理论上的等变误差界限。该方法在多个预训练骨干网络（如ViT和ResNet）上，针对图像分类、语义分割和人类轨迹预测任务进行了实证验证，在ImageNet基准测试中同时提升了性能并降低了等变误差。

Details

Motivation: 等变性是计算机视觉模型的基本属性，但现实世界数据很少满足严格的等变性，这可能限制模型性能，因此需要控制等变性的程度。

Result: 在ImageNet等基准测试中，该方法提升了性能并降低了等变误差，在多个任务和骨干网络上验证了有效性。

Insight: 创新点在于提出了一个通用的软等变模型构建框架，通过权重投影实现等变性程度的可控调节，并提供了理论误差保证，适用于现有预训练模型，具有较好的通用性和可解释性。

Abstract: Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real-world data, which can limit a model’s performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre-trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre-trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human-trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.

cs.MM [Back]

[96] ComVi: Context-Aware Optimized Comment Display in Video Playback cs.MM | cs.CV | cs.GR | cs.HCPDF

Minsun Kim, Dawon Lee, Junyong Noh

TL;DR: 本文提出了ComVi系统，旨在解决视频播放中评论与当前场景不匹配导致剧透和沉浸感中断的问题。该系统通过计算视听相关性将评论映射到相关视频时间戳，并综合考虑时间相关性、受欢迎程度和显示时长进行优化排序，从而在上下文相关时刻显示评论，实现评论与视频内容的同步。

Details

Motivation: 动机在于解决视频分享平台（如YouTube）中评论与视频播放独立显示导致的问题，即观众在观看视频时阅读评论可能遇到与当前场景无关的评论，这会揭示剧透并破坏沉浸感。

Result: 在用户研究中，ComVi相比传统视频界面（如YouTube和弹幕）提供了显著更吸引人的体验，71.9%的参与者选择ComVi作为他们最偏好的界面。

Insight: 创新点在于通过视听相关性映射和优化排序实现评论的上下文感知显示，这结合了多媒体分析和用户体验设计，可借鉴于提升视频平台的交互性和沉浸感。

Abstract: On general video-sharing platforms like YouTube, comments are displayed independently of video playback. As viewers often read comments while watching a video, they may encounter ones referring to moments unrelated to the current scene, which can reveal spoilers and disrupt immersion. To address this problem, we present ComVi, a novel system that displays comments at contextually relevant moments, enabling viewers to see time-synchronized comments and video content together. We first map all comments to relevant video timestamps by computing audio-visual correlation, then construct the comment sequence through an optimization that considers temporal relevance, popularity (number of likes), and display duration for comfortable reading. In a user study, ComVi provided a significantly more engaging experience than conventional video interfaces (i.e., YouTube and Danmaku), with 71.9% of participants selecting ComVi as their most preferred interface.

cs.RO [Back]

Amirhosein Chahe, Lifeng Zhou

TL;DR: 本文提出PiJEPA，一个用于语言条件视觉导航的两阶段框架。它首先微调一个基于Octo的通用策略，生成基于当前观察和语言指令的知情动作分布；然后利用该分布来热启动在JEPA世界模型上的MPPI规划，从而更高效地生成高质量动作序列以到达目标。

Details

Motivation: 解决现有方法在具身AI中语言条件视觉导航的挑战：反应式策略难以进行长程规划，而世界模型在高维空间中动作初始化效果差。

Result: 在真实世界导航任务上的实验表明，PiJEPA显著优于独立的策略执行和无信息先验的世界模型规划，实现了更高的目标到达准确率和指令遵循保真度。

Insight: 核心创新在于将学习到的导航策略（提供知情先验分布）与潜在世界模型规划（JEPA）相结合，通过策略先验热启动MPPI采样，从而加速收敛并提升规划质量；同时系统研究了不同视觉编码器骨干（DINOv2与V-JEPA-2）在策略和世界模型组件中的影响。

Abstract: Navigating to a visually specified goal given natural language instructions remains a fundamental challenge in embodied AI. Existing approaches either rely on reactive policies that struggle with long-horizon planning, or employ world models that suffer from poor action initialization in high-dimensional spaces. We present PiJEPA, a two-stage framework that combines the strengths of learned navigation policies with latent world model planning for instruction-conditioned visual navigation. In the first stage, we finetune an Octo-based generalist policy, augmented with a frozen pretrained vision encoder (DINOv2 or V-JEPA-2), on the CAST navigation dataset to produce an informed action distribution conditioned on the current observation and language instruction. In the second stage, we use this policy-derived distribution to warm-start Model Predictive Path Integral (MPPI) planning over a separately trained JEPA world model, which predicts future latent states in the embedding space of the same frozen encoder. By initializing the MPPI sampling distribution from the policy prior rather than from an uninformed Gaussian, our planner converges faster to high-quality action sequences that reach the goal. We systematically study the effect of the vision encoder backbone, comparing DINOv2 and V-JEPA-2, across both the policy and world model components. Experiments on real-world navigation tasks demonstrate that PiJEPA significantly outperforms both standalone policy execution and uninformed world model planning, achieving improved goal-reaching accuracy and instruction-following fidelity.

Jiayi Chen, Wenxuan Song, Shuai Chen, Jingbo Wang, Zhijun Li

TL;DR: 本文提出了DFM-VLA，一种基于离散流匹配（Discrete Flow Matching）的视觉-语言-动作模型，用于机器人操作任务中的动作序列迭代优化。该方法通过建模令牌级概率速度场，在多次迭代中动态更新整个动作序列，克服了现有自回归或离散扩散模型一旦生成令牌即无法修正的局限性。

Details

Motivation: 现有基于离散令牌化的视觉-语言-动作模型在解码动作时，无论是自回归还是离散扩散方法，一旦令牌生成即被固定，无法在后续迭代中修正早期错误，限制了动作生成的准确性和鲁棒性。

Result: 在CALVIN基准测试中平均成功长度达到4.44，在LIBERO基准测试中平均成功率达到95.7%，均优于自回归、离散扩散和连续扩散基线模型，实现了SOTA性能。

Insight: 创新点在于将离散流匹配引入动作序列生成，实现了动作令牌的迭代优化；提出了两种速度场构建方法（辅助速度头与动作嵌入引导）以及两阶段解码策略（迭代优化后确定性验证），提升了生成稳定性和推理效率。

Abstract: Vision–Language–Action (VLA) models that encode actions using a discrete tokenization scheme are increasingly adopted for robotic manipulation, but existing decoding paradigms remain fundamentally limited. Whether actions are decoded sequentially by autoregressive VLAs or in parallel by discrete diffusion VLAs, once a token is generated, it is typically fixed and cannot be revised in subsequent iterations, so early token errors cannot be effectively corrected later. We propose DFM-VLA, a discrete flow matching VLA for iterative refinement of action tokens. DFM-VLA~models a token-level probability velocity field that dynamically updates the full action sequence across refinement iterations. We investigate two ways to construct the velocity field: an auxiliary velocity-head formulation and an action-embedding-guided formulation. Our framework further adopts a two-stage decoding strategy with an iterative refinement stage followed by deterministic validation for stable convergence. Extensive experiments on CALVIN, LIBERO, and real-world manipulation tasks show that DFM-VLA consistently outperforms strong autoregressive, discrete diffusion, and continuous diffusion baselines in manipulation performance while retaining high inference efficiency. In particular, DFM-VLA achieves an average success length of 4.44 on CALVIN and an average success rate of 95.7% on LIBERO, highlighting the value of action refinement via discrete flow matching for robotic manipulation. Our project is available \url{https://chris1220313648.github.io/DFM-VLA/}

cs.AI [Back]

[99] GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation cs.AI | cs.CVPDF

Rui Xie, Zhi Gao, Chenrui Shi, Zirui Shang, Lu Chen

TL;DR: GUIDE是一个无需训练、即插即用的框架，旨在解决GUI智能体在特定应用领域的领域偏见问题。它通过从网络教程视频中检索并自动标注领域专业知识，来增强智能体对特定软件操作流程和UI布局的理解，从而提升其在真实世界任务中的表现。

Details

Motivation: 大型视觉语言模型训练的GUI智能体在通用界面理解和交互方面表现出色，但由于训练数据中缺乏特定领域软件的操作数据，导致智能体对特定应用的操作流程和UI元素布局不熟悉，存在显著的领域偏见，限制了其在实际任务中的性能。

Result: 在OSWorld基准测试上的广泛实验表明，GUIDE作为即插即用组件，无论是用于多智能体系统还是单模型智能体，都能在不修改任何模型参数或架构的情况下，持续带来超过5%的性能提升，并减少执行步骤，验证了其作为架构无关的增强方法的有效性。

Insight: GUIDE的创新点包括：1) 基于字幕的视频检索增强生成（Video-RAG）管道，通过渐进式三阶段检索（领域分类、主题提取、相关性匹配）解锁视频语义，识别任务相关教程视频；2) 基于逆动力学范式的全自动标注管道，利用增强UI元素检测的连续关键帧和VLM推断规划与接地知识，并注入智能体相应模块，以解决领域偏见的两方面表现。从客观角度看，该方法通过利用丰富的网络视频资源进行实时知识获取和标注，提供了一种高效、无需重新训练的系统级解决方案来缓解数据稀缺导致的领域适应问题。

Abstract: Large vision-language models have endowed GUI agents with strong general capabilities for interface understanding and interaction. However, due to insufficient exposure to domain-specific software operation data during training, these agents exhibit significant domain bias - they lack familiarity with the specific operation workflows (planning) and UI element layouts (grounding) of particular applications, limiting their real-world task performance. In this paper, we present GUIDE (GUI Unbiasing via Instructional-Video Driven Expertise), a training-free, plug-and-play framework that resolves GUI agent domain bias by autonomously acquiring domain-specific expertise from web tutorial videos through a retrieval-augmented automated annotation pipeline. GUIDE introduces two key innovations. First, a subtitle-driven Video-RAG pipeline unlocks video semantics through subtitle analysis, performing progressive three-stage retrieval - domain classification, topic extraction, and relevance matching - to identify task-relevant tutorial videos. Second, a fully automated annotation pipeline built on an inverse dynamics paradigm feeds consecutive keyframes enhanced with UI element detection into VLMs, inferring the required planning and grounding knowledge that are injected into the agent’s corresponding modules to address both manifestations of domain bias. Extensive experiments on OSWorld demonstrate GUIDE’s generality as a plug-and-play component for both multi-agent systems and single-model agents. It consistently yields over 5% improvements and reduces execution steps - without modifying any model parameters or architecture - validating GUIDE as an architecture-agnostic enhancement to bridge GUI agent domain bias.

eess.IV [Back]

[100] Adapting Segment Anything Model 3 for Concept-Driven Lesion Segmentation in Medical Images: An Experimental Study eess.IV | cs.CVPDF

Guoping Xu, Jayaram K. Udupa, Yubing Tong, Xin Long, Ying Zhang

TL;DR: 本研究系统评估了Segment Anything Model 3 (SAM3)在医学图像病灶分割中的应用，通过几何边界框和基于概念的文本与图像提示，在多种成像模态（如MRI、CT、超声等）的13个数据集上进行实验，并探索了结合先验知识和微调策略以提升鲁棒性。

Details

Motivation: 现有病灶分割方法通常针对特定解剖部位或成像模态设计，泛化性有限；而基于视觉-语言基础模型的概念驱动分割在自然图像中表现出色，但在医学图像中，特别是使用最新的SAM3进行概念提示分割的研究仍不足。

Result: 在覆盖11种病灶类型的13个数据集上，SAM3展现出强大的跨模态泛化能力、可靠的概念驱动分割和精确的病灶描绘，实现了可扩展且实用的医学图像分割潜力。

Insight: 创新点在于将SAM3应用于医学图像病灶分割，系统评估了概念提示（文本和图像）的有效性，并引入先验知识（如相邻切片预测和多参数信息）及多种微调策略（如部分模块调整和适配器方法）以增强模型鲁棒性，为通用医学图像分析提供了新方向。

Abstract: Accurate lesion segmentation is essential in medical image analysis, yet most existing methods are designed for specific anatomical sites or imaging modalities, limiting their generalizability. Recent vision-language foundation models enable concept-driven segmentation in natural images, offering a promising direction for more flexible medical image analysis. However, concept-prompt-based lesion segmentation, particularly with the latest Segment Anything Model 3 (SAM3), remains underexplored. In this work, we present a systematic evaluation of SAM3 for lesion segmentation. We assess its performance using geometric bounding boxes and concept-based text and image prompts across multiple modalities, including multiparametric MRI, CT, ultrasound, dermoscopy, and endoscopy. To improve robustness, we incorporate additional prior knowledge, such as adjacent-slice predictions, multiparametric information, and prior annotations. We further compare different fine-tuning strategies, including partial module tuning, adapter-based methods, and full-model optimization. Experiments on 13 datasets covering 11 lesion types demonstrate that SAM3 achieves strong cross-modality generalization, reliable concept-driven segmentation, and accurate lesion delineation. These results highlight the potential of concept-based foundation models for scalable and practical medical image segmentation. Code and trained models will be released at: https://github.com/apple1986/lesion-sam3

Yi Zhang, Yidong Zhao, Qian Tao

TL;DR: 本文提出了一种用于医学图像多模态配准的新框架，该框架将预训练的单模态配准模型与轻量级适应管道相结合。通过基于对比度无关表示生成和细化的风格迁移模块，在测试时进行实例优化，以桥接模态和领域差距。该方法避免了完全微调的计算负担，同时保持了对未见领域的适应能力。

Details

Motivation: 解决医学图像分析中多模态场景下（如CT、MRI等强度分布差异显著）的可变形图像配准问题。现有深度学习方法在测试时分布偏移下泛化能力不足，而完全微调现代架构（如Transformer、深度U-Net）在3D操作中计算成本过高，且在面对剧烈领域偏移时性能可能下降。

Result: 在Learn2Reg 2025 LUMIR验证集上评估，相比预训练的单模态SOTA骨干网络有持续改进。具体排名：在多模态子集上排名第二，在领域外子集上排名第三，总体Dice分数排名第四。

Insight: 创新点在于将冻结的预训练单模态模型与模态适应和轻量级实例优化相结合，提供了一种有效且实用的鲁棒多模态配准路径。该方法设计独立于骨干单模态模型的选择，避免了完全微调的计算负担，同时通过对比度无关表示和测试时实例优化来适应未见领域。

Abstract: Deformable image registration remains a central challenge in medical image analysis, particularly under multi-modal scenarios where intensity distributions vary significantly across scans. While deep learning methods provide efficient feed-forward predictions, they often fail to generalize robustly under distribution shifts at test time. A straightforward remedy is full network fine-tuning, yet for modern architectures such as Transformers or deep U-Nets, this adaptation is prohibitively expensive in both memory and runtime when operating in 3D. Meanwhile, the naive fine-tuning struggles more with potential degradation in performance in the existence of drastic domain shifts. In this work, we propose a registration framework that integrates a frozen pretrained \textbf{mono-modal} registration model with a lightweight adaptation pipeline for \textbf{multi-modal} image registration. Specifically, we employ style transfer based on contrast-agnostic representation generation and refinement modules to bridge modality and domain gaps with instance optimization at test time. This design is orthogonal to the choice of backbone mono-modal model, thus avoids the computational burden of full fine-tuning while retaining the flexibility to adapt to unseen domains. We evaluate our approach on the Learn2Reg 2025 LUMIR validation set and observe consistent improvements over the pretrained state-of-the-art mono-modal backbone. In particular, the method ranks second on the multi-modal subset, third on the out-of-domain subset, and achieves fourth place overall in Dice score. These results demonstrate that combining frozen mono-modal models with modality adaptation and lightweight instance optimization offers an effective and practical pathway toward robust multi-modal registration.

eess.SY [Back]

[102] Experimental study on surveillance video-based indoor occupancy measurement with occupant-centric control eess.SY | cs.CVPDF

Irfan Qaisar, Kailai Sun, Qingshan Jia, Qianchuan Zhao

TL;DR: 本文是一项实验研究，探讨了基于监控视频的室内人员数量测量及其对以人员为中心的暖通空调控制的影响。研究比较了仅检测、基于跟踪和基于大语言模型优化的三种处理流程，并利用真实实验室监控数据和人工标注进行测试。结果表明，基于跟踪的方法提升了时间稳定性，而LLM优化进一步提高了测量性能并减少了误判。最佳流程（YOLOv8+DeepSeek）在集成到模型预测控制框架后，实现了17.94%的显著节能潜力。

Details

Motivation: 为实现2050年净零排放目标，智能建筑中需要精确的人员信息进行闭环的以人员为中心的控制。然而，现有基于视觉的人员测量方法在真实室内环境中难以提供稳定准确的测量，且其对下游暖通空调控制的影响研究不足。

Result: 在真实实验室数据上，基于跟踪的方法相比仅检测方法提升了时间稳定性，而基于LLM的优化进一步提高了人员测量性能（最佳流程YOLOv8+DeepSeek的准确率为0.8824，F1分数为0.9320）。将该流程集成到OpenStudio-EnergyPlus的模型预测控制框架后，实验结果显示可支持更高效的以人员为中心的控制运行，实现了17.94%的暖通空调节能潜力。

Insight: 论文的创新点在于系统性地实验比较了三种视觉处理流程（检测、跟踪、LLM优化）在人员测量任务上的性能，并首次将性能最佳的LLM增强流程（YOLOv8+DeepSeek）集成到暖通空调控制框架中，量化了其对节能的实际影响，为AI增强的智能建筑运营提供了有效的方法论和实践基础。

Abstract: Accurate occupancy information is essential for closed-loop occupant-centric control (OCC) in smart buildings. However, existing vision-based occupancy measurement methods often struggle to provide stable and accurate measurements in real indoor environments, and their implications for downstream HVAC control remain insufficiently studied. To achieve Net Zero emissions by 2050, this paper presents an experimental study of large language models (LLMs)-enhanced vision-based indoor occupancy measurement and its impact on OCC-enabled HVAC operation. Detection-only, tracking-based, and LLM-based refinement pipelines are compared under identical conditions using real surveillance data collected from a research laboratory in China, with frame-level manual ground-truth annotations. Results show that tracking-based methods improve temporal stability over detection-only measurement, while LLM-based refinement further improves occupancy measurement performance and reduces false unoccupied prediction. The best-performing pipeline, YOLOv8+DeepSeek, achieves an accuracy of 0.8824 and an F1-score of 0.9320. This pipeline is then integrated into an HVAC supervisory model predictive control framework in OpenStudio-EnergyPlus. Experimental results demonstrate that the proposed framework can support more efficient OCC operation, achieving a substantial HVAC energy-saving potential of 17.94%. These findings provide an effective methodology and practical foundation for future research in AI-enhanced smart building operations.

cs.LG [Back]

[103] H-Node Attack and Defense in Large Language Models cs.LG | cs.AI | cs.CL | cs.NEPDF

Eric Yocam, Varghese Vaidyan, Yong Wang

TL;DR: 本文提出H-Node对抗噪声消除（H-Node ANC）机制框架，用于在Transformer大语言模型的单个隐藏状态维度上识别、利用和防御幻觉表征。该方法通过逻辑回归探针定位幻觉节点（H-Nodes），实现选择性白盒攻击，并采用自适应和动态迭代的防御策略来抑制幻觉，在多个模型上验证了有效性且对通用推理能力影响微小。

Details

Motivation: 解决大语言模型中幻觉表征的定位、利用和防御问题，旨在从机制层面理解并控制模型内部与幻觉相关的特定维度。

Result: 在OPT-125M、Phi-3-mini-4k-instruct、LLaMA-3-8B-Instruct和Mistral-7B-Instruct-v0.3（125M-8B参数）上验证：探针AUC达0.90；攻击选择性为3.02倍且防御者可见性低于10%；自适应防御将接地激活漂移降低33-42%；动态迭代防御从单次基线8%恢复至0.69鲁棒性；困惑度影响<5%，MMLU退化最多3%。

Insight: 创新点在于将幻觉表征定位到单个隐藏状态维度（H-Nodes），并提出了实时前向钩子的选择性攻击与基于置信度加权的自适应消除防御机制。从客观角度看，其从微观维度干预模型内部状态以控制幻觉的思路，为理解和改善LLM的可靠性提供了新的机制性途径。

Abstract: We present H-Node Adversarial Noise Cancellation (H-Node ANC), a mechanistic framework that identifies, exploits, and defends hallucination representations in transformer-based large language models (LLMs) at the level of individual hidden-state dimensions. A logistic regression probe trained on last-token hidden states localizes hallucination signal to a small set of high-variance dimensions – termed Hallucination Nodes (H-Nodes) – with probe AUC reaching 0.90 across four architectures. A white-box adversarial attack amplifies these dimensions at inference time via a real-time forward hook, achieving a selectivity of 3.02x with less than 10% visibility to the defender. Adaptive ANC defense suppresses H-Node excess in-pass using confidence-weighted cancellation, reducing grounded activation drift by 33-42% over static cancellation. A dynamic iterative extension that re-ranks cancellation targets across successive passes recovers up to 0.69 robustness from a single-pass baseline of 8%. All contributions are validated on OPT-125M, Phi-3-mini-4k-instruct, LLaMA-3-8B-Instruct, and Mistral-7B-Instruct-v0.3 (125M-8B parameters). Perplexity impact is surgical (<5%) and MMLU degradation is at most 3%, confirming that the defense does not impair general reasoning capability.

[104] Selective Deficits in LLM Mental Self-Modeling in a Behavior-Based Test of Theory of Mind cs.LG | cs.AI | cs.CLPDF

Christopher Ackerman

TL;DR: 本文开发了一种新颖的行为实验范式，用于测试大语言模型是否具备心理理论能力，即能否形成关于自我和他人的心理状态模型并据此进行战略性行动。研究发现，2025年中之前发布的LLMs在所有任务上均失败，近期模型在建模他人认知状态上达到人类水平，但即使是前沿模型在自我建模任务上也会失败，除非提供推理轨迹作为草稿纸。研究还发现了认知负荷效应，并探讨了推理模型成功执行任务的机制，包括进行战略性欺骗。

Details

Motivation: 动机是探究LLMs是否真正习得了可部署于任意情境的因果心理模型，而非仅仅模仿训练数据中无处不在的心理理论描述。

Result: 在作者设计的基于行为的测试范式上，近期LLMs在建模他人认知状态（other-modeling）任务上达到人类水平，但所有测试模型在自我建模（self-modeling）任务上均失败，除非提供推理轨迹（scratchpad）。研究还展示了认知负荷对其他建模任务的影响。

Insight: 创新点在于设计了一种要求模型基于心理状态进行战略性行动（而非仅描述）的测试范式，揭示了LLMs在心理自我建模方面存在选择性缺陷，并提供了LLMs可能使用类似有限工作记忆机制的证据。可借鉴之处在于这种区分模仿与真实因果推理能力的评估方法，以及对模型内部推理过程（如通过scratchpad）进行干预以提升性能的洞见。

Abstract: The ability to represent oneself and others as agents with knowledge, intentions, and belief states that guide their behavior - Theory of Mind - is a human universal that enables us to navigate - and manipulate - the social world. It is supported by our ability to form mental models of ourselves and others. Its ubiquity in human affairs entails that LLMs have seen innumerable examples of it in their training data and therefore may have learned to mimic it, but whether they have actually learned causal models that they can deploy in arbitrary settings is unclear. We therefore develop a novel experimental paradigm that requires that subjects form representations of the mental states of themselves and others and act on them strategically rather than merely describe them. We test a wide range of leading open and closed source LLMs released since 2024, as well as human subjects, on this paradigm. We find that 1) LLMs released before mid-2025 fail at all of our tasks, 2) more recent LLMs achieve human-level performance on modeling the cognitive states of others, and 3) even frontier LLMs fail at our self-modeling task - unless afforded a scratchpad in the form of a reasoning trace. We further demonstrate cognitive load effects on other-modeling tasks, offering suggestive evidence that LLMs are using something akin to limited-capacity working memory to hold these mental representations in mind during a single forward pass. Finally, we explore the mechanisms by which reasoning models succeed at the self- and other-modeling tasks, and show that they readily engage in strategic deception.

Table of Contents

cs.CL [Back]

[1] Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition cs.CL | cs.SD | eess.ASPDF

[2] RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation cs.CLPDF

[3] Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI cs.CL | cs.AI | cs.LG | cs.MAPDF

[4] Can Small Models Reason About Legal Documents? A Comparative Study cs.CL | cs.AIPDF

[5] Toward Culturally Grounded Natural Language Processing cs.CLPDF

[6] AgentCollab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM Agents cs.CLPDF

[7] LLM Benchmark-User Need Misalignment for Climate Change cs.CLPDF

[8] ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory cs.CLPDF

[9] Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR cs.CL | cs.AI | cs.LG | eess.ASPDF

[10] From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs cs.CL | cs.AIPDF

[11] CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law cs.CL | cs.AIPDF

[12] Why Models Know But Don’t Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models cs.CL | cs.AIPDF

[13] ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims cs.CLPDF

cs.CV [Back]

[14] A Survey of OCR Evaluation Methods and Metrics and the Invisibility of Historical Documents cs.CV | cs.DLPDF

[15] Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis cs.CVPDF

[16] ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions cs.CVPDF

[17] Do All Vision Transformers Need Registers? A Cross-Architectural Reassessment cs.CV | cs.LGPDF

[18] ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners? cs.CV | cs.AIPDF

[19] Fus3D: Decoding Consolidated 3D Geometry from Feed-forward Geometry Transformer Latents cs.CVPDF

[20] GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding cs.CV | cs.AIPDF

[21] GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks cs.CV | cs.AI | cs.HCPDF

[22] Seeing Through Smoke: Surgical Desmoking for Improved Visual Perception cs.CVPDF

[23] Speech-Synchronized Whiteboard Generation via VLM-Driven Structured Drawing Representations cs.CV | cs.LGPDF

[24] Automated Quality Assessment of Blind Sweep Obstetric Ultrasound for Improved Diagnosis cs.CVPDF

[25] World Reasoning Arena cs.CVPDF

[26] Few Shots Text to Image Retrieval: New Benchmarking Dataset and Optimization Methods cs.CVPDF

[27] THFM: A Unified Video Foundation Model for 4D Human Perception and Beyond cs.CVPDF

[28] Shared Representation for 3D Pose Estimation, Action Classification, and Progress Prediction from Tactile Signals cs.CVPDF

[29] Good Scores, Bad Data: A Metric for Multimodal Coherence cs.CV | cs.AIPDF

[30] DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation cs.CV | cs.AIPDF

[31] Reinforcing Structured Chain-of-Thought for Video Understanding cs.CV | cs.AIPDF

[32] Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets cs.CV | cs.AI | cs.LGPDF

[33] Low-Rank-Modulated Functa: Exploring the Latent Space of Implicit Neural Representations for Interpretable Ultrasound Video Analysis cs.CVPDF

[34] BEVMAPMATCH: Multimodal BEV Neural Map Matching for Robust Re-Localization of Autonomous Vehicles cs.CVPDF

[35] Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control cs.CVPDF

[36] FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants cs.CV | cs.AIPDF

[37] VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation cs.CV | cs.AIPDF

[38] GeoReFormer: Geometry-Aware Refinement for Lane Segment Detection and Topology Reasoning cs.CV | cs.ROPDF

[39] Knowledge is Power: Advancing Few-shot Action Recognition with Multimodal Semantics from MLLMs cs.CVPDF

[40] Face2Parts: Exploring Coarse-to-Fine Inter-Regional Facial Dependencies for Generalized Deepfake Detection cs.CVPDF

[41] Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives cs.CVPDF

[42] Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays cs.CV | cs.AIPDF

[43] Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification cs.CV | cs.AIPDF

[44] Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline cs.CVPDF

[45] Finding Distributed Object-Centric Properties in Self-Supervised Transformers cs.CV | cs.AI | cs.CL | cs.LG | cs.MMPDF

[46] MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection cs.CV | cs.AIPDF

[47] PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery cs.CVPDF

[48] MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality cs.CV | cs.LGPDF

[49] PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning cs.CV | cs.AI | cs.CL | cs.LGPDF

[50] Learnable Instance Attention Filtering for Adaptive Detector Distillation cs.CVPDF

[51] SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection cs.CVPDF

[52] SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis cs.CV | cs.AIPDF

[53] Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR cs.CVPDF

[54] TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life cs.CVPDF

[55] InstaVSR: Taming Diffusion for Efficient and Temporally Consistent Video Super-Resolution cs.CVPDF

[56] IP-Bench: Benchmark for Image Protection Methods in Image-to-Video Generation Scenarios cs.CVPDF

[57] Provably Contractive and High-Quality Denoisers for Convergent Restoration cs.CVPDF

[58] CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions cs.CVPDF

[59] Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning cs.CVPDF

[60] GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport cs.CVPDF

[61] Progressive Learning with Anatomical Priors for Reliable Left Atrial Scar Segmentation from Late Gadolinium Enhancement MRI cs.CV | cs.AIPDF

[62] OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement cs.CVPDF

[63] MemCam: Memory-Augmented Camera Control for Consistent Video Generation cs.CV | cs.AIPDF

[64] Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding cs.CV | cs.AIPDF

[65] Real-Time Branch-to-Tool Distance Estimation for Autonomous UAV Pruning: Benchmarking Five DEFOM-Stereo Variants from Simulation to Jetson Deployment cs.CVPDF

[66] ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction cs.CV | cs.AI | cs.LGPDF

[67] PhysVid: Physics Aware Local Conditioning for Generative Video Models cs.CV | cs.AIPDF

[68] Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy cs.CV | cs.AI | cs.LGPDF

[69] SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning cs.CV | cs.LGPDF

[70] Label-Free Cross-Task LoRA Merging with Null-Space Compression cs.CV | cs.AI | cs.LGPDF

[71] Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation cs.CV | cs.AIPDF

[72] From Pixels to Privacy: Temporally Consistent Video Anonymization via Token Pruning for Privacy Preserving Action Recognition cs.CVPDF

[73] HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network cs.CVPDF

[74] Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification cs.CV | cs.AIPDF

[75] Only Whats Necessary: Pareto Optimal Data Minimization for Privacy Preserving Video Anomaly Detection cs.CVPDF

[76] From Pen to Pixel: Translating Hand-Drawn Plots into Graphical APIs via a Novel Benchmark and Efficient Adapter cs.CVPDF

[77] HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models cs.CVPDF