Table of Contents

cs.CL [Back]

[1] Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning cs.CL | cs.IRPDF

Chris Samarinas, Haw-Shiuan Chang, Hamed Zamani

TL;DR: 本文提出了SLATE框架,通过截断步级采样和密集LLM作为评判者的奖励机制,解决了检索增强推理中强化学习训练大语言模型时的信用分配问题,显著降低了梯度方差并提升了性能。

Details

Motivation: 现有方法如Search-R1仅提供稀疏的轨迹级奖励,而StepSearch等过程奖励方法依赖启发式奖励且采样完整轨迹导致高梯度方差,难以将成功或失败归因于单个推理和检索决策。

Result: 在七个QA基准测试上的实验表明,SLATE在稀疏奖励和过程奖励基线方法中均表现更优,尤其在更困难的多跳任务和小模型上提升最大。

Insight: 创新点包括截断步级采样(共享前缀的轨迹采样)和密集LLM评判者奖励(替代启发式评分),理论上证明在相同密集奖励结构下,截断采样能将优势估计的方差降低多达T倍(T为步数),从而提供更低方差、更精准的策略梯度。

Abstract: Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample k complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates k trajectories that share a common prefix and differ only at the next step, and (2) dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, yielding lower-variance, better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.


[2] CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era cs.CL | cs.DLPDF

Zhengqing Yuan, Kaiwen Shi, Zheyuan Zhang, Lichao Sun, Nitesh V. Chawla

TL;DR: 本文提出了CiteAudit,首个用于检测科学写作中幻觉引用的综合基准和检测框架。该框架通过多智能体验证流程将引用检查分解为声明提取、证据检索、段落匹配、推理和校准判断等步骤,以评估被引用的来源是否真正支持其声明。作者构建了一个大规模的人工验证数据集,并定义了统一的引用忠实性和证据对齐度指标。实验表明,该框架在准确性和可解释性上显著优于现有方法。

Details

Motivation: 科学研究依赖准确的引用以确保归属性和完整性,但大型语言模型(LLM)带来了新的风险:捏造的引用看起来合理但对应不上真实的出版物。这种幻觉引用已在主要机器学习会议的投稿和录用论文中被观察到,暴露了同行评审的脆弱性。同时,快速增长的参考文献列表使得手动验证不切实际,而现有的自动化工具对噪声和异构的引用格式仍然脆弱,且缺乏标准化评估。

Result: 实验使用最先进的LLM揭示了大量的引用错误,并表明该框架在准确性和可解释性上都显著优于先前的方法。

Insight: 论文的创新点在于首次提出了一个可扩展的、用于审计LLM时代科学引用的基础设施,通过多智能体验证管道和统一的评估指标,为解决幻觉引用问题提供了实用的工具和基准。从客观角度看,其将复杂任务分解为可管理的子任务并集成校准判断的方法,为构建更可靠的自动化科学诚信检查系统提供了有价值的思路。

Abstract: Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted papers at major machine learning venues, exposing vulnerabilities in peer review. Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation. We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing. Our multi-agent verification pipeline decomposes citation checking into claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment to assess whether a cited source truly supports its claim. We construct a large-scale human-validated dataset across domains and define unified metrics for citation faithfulness and evidence alignment. Experiments with state-of-the-art LLMs reveal substantial citation errors and show that our framework significantly outperforms prior methods in both accuracy and interpretability. This work provides the first scalable infrastructure for auditing citations in the LLM era and practical tools to improve the trustworthiness of scientific references.


[3] FHIRPath-QA: Executable Question Answering over FHIR Electronic Health Records cs.CLPDF

Michael Frew, Nishit Bheda, Bryan Tripp

TL;DR: 本文提出了FHIRPath-QA,首个针对患者特定问题回答的开放数据集和基准,基于真实临床数据并包含FHIRPath查询。论文引入了一种文本转FHIRPath的问答范式,将推理从自由文本生成转向FHIRPath查询合成,从而显著减少大语言模型的使用。

Details

Motivation: 解决现有电子健康记录接口难以提供精确、可信的患者特定问题答案,以及基于检索的大语言模型方法存在计算效率低、易产生幻觉且难以在真实电子健康记录上部署的问题。

Result: 在基于MIMIC-IV on FHIR Demo构建的数据集上,包含超过1.4万个自然语言问题及其对应的已验证FHIRPath查询和答案。实验表明,最先进的大语言模型在处理患者语言歧义和FHIRPath查询合成方面表现不佳,但通过监督微调后性能显著提升。

Insight: 创新点在于提出了文本到FHIRPath查询合成的问答范式,这为安全、高效、可互操作的消费者健康应用提供了潜在基础;同时发布的开放数据集和基准为未来研究提供了起点。

Abstract: Though patients are increasingly granted digital access to their electronic health records (EHRs), existing interfaces may not support precise, trustworthy answers to patient-specific questions. Large language models (LLM) show promise in clinical question answering (QA), but retrieval-based approaches are computationally inefficient, prone to hallucination, and difficult to deploy over real-life EHRs. In this work, we introduce FHIRPath-QA, the first open dataset and benchmark for patient-specific QA that includes open-standard FHIRPath queries over real-world clinical data. We propose a text-to-FHIRPath QA paradigm that shifts reasoning from free-text generation to FHIRPath query synthesis, significantly reducing LLM usage. Built on MIMIC-IV on FHIR Demo, the dataset pairs over 14k natural language questions in patient and clinician phrasing with validated FHIRPath queries and answers. Further, we demonstrate that state-of-the-art LLMs struggle to deal with ambiguity in patient language and perform poorly in FHIRPath query synthesis. However, they benefit strongly from supervised fine-tuning. Our results highlight that text-to-FHIRPath synthesis has the potential to serve as a practical foundation for safe, efficient, and interoperable consumer health applications, and our dataset and benchmark serve as a starting point for future research on the topic. The full dataset and generation code is available at: https://github.com/mooshifrew/fhirpath-qa.


[4] IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation cs.CLPDF

Md Mofijul Islam, Md Sirajus Salekin, Joe King, Priyashree Roy, Vamsi Thilak Gudi

TL;DR: 本文提出了IDP Accelerator框架,这是一个用于端到端文档智能处理的智能体AI系统。它通过四个核心组件解决非结构化文档处理中的关键挑战:文档分割、信息提取、智能分析和规则验证,并在实际部署中显著提升了准确性和效率。

Details

Motivation: 解决工业NLP中从非结构化文档理解和提取结构化信息的根本性挑战,特别是传统流程难以处理多文档包、复杂推理和严格合规性要求的问题。

Result: 在领先的医疗保健提供商的生产部署中,实现了98%的分类准确率,处理延迟降低了80%,运营成本降低了77%,超越了传统基线方法。

Insight: 创新点包括:1)引入使用BIO标记进行复杂文档包分割的新型基准数据集和多模态分类器DocSplit;2)采用符合模型上下文协议(MCP)的智能体分析模块,通过安全的沙箱代码执行提供数据访问;3)用LLM驱动的逻辑替代确定性引擎进行复杂的合规性检查。从客观角度看,该框架将多模态LLM与可配置的模块化智能体架构相结合,为工业文档处理提供了一个灵活且高效的端到端解决方案。

Abstract: Understanding and extracting structured insights from unstructured documents remains a foundational challenge in industrial NLP. While Large Language Models (LLMs) enable zero-shot extraction, traditional pipelines often fail to handle multi-document packets, complex reasoning, and strict compliance requirements. We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging to segment complex document packets; (2) configurable Extraction Module leveraging multimodal LLMs to transform unstructured content into structured data; (3) Agentic Analytics Module, compliant with the Model Context Protocol (MCP) providing data access through secure, sandboxed code execution; and (4) Rule Validation Module replacing deterministic engines with LLM-driven logic for complex compliance checks. The interactive demonstration enables users to upload document packets, visualize classification results, and explore extracted data through an intuitive web interface. We demonstrate effectiveness across industries, highlighting a production deployment at a leading healthcare provider achieving 98% classification accuracy, 80% reduced processing latency, and 77% lower operational costs over legacy baselines. IDP Accelerator is open-sourced with a live demonstration available to the community.


[5] Humans and LLMs Diverge on Probabilistic Inferences cs.CL | cs.AIPDF

Gaurav Kamath, Sreenath Madathil, Sebastian Schuster, Marie-Catherine de Marneffe, Siva Reddy

TL;DR: 该论文通过构建ProbCOPA数据集,对比研究了人类与大型语言模型在概率推理任务上的表现差异,发现LLMs在非确定性推理中无法生成类似人类的概率分布,揭示了二者在推理模式上的根本性分歧。

Details

Motivation: 研究动机在于探索LLMs在开放域概率推理任务中的表现,填补当前对LLMs在非确定性推理能力评估的空白。

Result: 在自建的ProbCOPA数据集上,8个SOTA推理LLMs均无法复现人类参与者的概率判断分布,表现出与人类推理模式的系统性差异。

Insight: 创新点在于构建了首个针对概率推理的标注数据集,并通过思维链分析揭示了LLMs处理概率推理的固定模式,为评估非确定性推理能力提供了新范式。

Abstract: Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25–30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.


[6] Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations cs.CLPDF

Jun Li, Xiangmeng Wang, Haoyang Li, Yifei Yan, Shijie Zhang

TL;DR: 本文提出了一种多智能体因果推理(MACR)框架,用于通过在线对话树检测自杀意念。该框架包含一个推理智能体,用于生成反事实用户反应以扩展用户交互,以及一个偏差感知决策智能体,通过前门调整策略减轻隐藏偏差(如从众和模仿自杀行为)。实验证明MACR在识别自杀风险方面有效且鲁棒。

Details

Motivation: 现有社交媒体自杀风险检测方法存在两大局限:一是依赖预定义规则(如引用或回复)记录对话,仅捕捉狭窄的用户交互范围;二是忽视了隐藏影响(如用户从众和自杀模仿行为),这些因素会显著影响在线社区中的自杀表达和传播。

Result: 在真实世界对话数据集上的大量实验表明,MACR在识别自杀风险方面具有有效性和鲁棒性,但摘要未具体提及基准测试名称或与SOTA的比较结果。

Insight: 创新点在于结合多智能体协作,将认知评价理论融入推理智能体以生成反事实反应,并利用前门调整进行偏差感知决策,从而同时缓解隐藏偏差并利用反事实知识丰富用户交互的上下文信息。

Abstract: Suicide remains a pressing global public health concern. While social media platforms offer opportunities for early risk detection through online conversation trees, existing approaches face two major limitations: (1) They rely on predefined rules (e.g., quotes or relies) to log conversations that capture only a narrow spectrum of user interactions, and (2) They overlook hidden influences such as user conformity and suicide copycat behavior, which can significantly affect suicidal expression and propagation in online communities. To address these limitations, we propose a Multi-Agent Causal Reasoning (MACR) framework that collaboratively employs a Reasoning Agent to scale user interactions and a Bias-aware Decision-Making Agent to mitigate harmful biases arising from hidden influences. The Reasoning Agent integrates cognitive appraisal theory to generate counterfactual user reactions to posts, thereby scaling user interactions. It analyses these reactions through structured dimensions, i.e., cognitive, emotional, and behavioral patterns, with a dedicated sub-agent responsible for each dimension. The Bias-aware Decision-Making Agent mitigates hidden biases through a front-door adjustment strategy, leveraging the counterfactual user reactions produced by the Reasoning Agent. Through the collaboration of reasoning and bias-aware decision making, the proposed MACR framework not only alleviates hidden biases, but also enriches contextual information of user interactions with counterfactual knowledge. Extensive experiments on real-world conversational datasets demonstrate the effectiveness and robustness of MACR in identifying suicide risk.


[7] LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning cs.CL | cs.AIPDF

Yu Zhu, Kai Yang

TL;DR: 本文提出了一种由大语言模型驱动的多轮任务导向对话合成框架,旨在生成基于真实推理场景的对话数据,以解决现有基准数据集在评估LLM现实推理能力方面的不足。该方法利用三层优化提升对话质量,并围绕生成的对话设计相应的推理任务,最终构建了一个用于评估和提升LLM现实逻辑推理能力的数据集。

Details

Motivation: 现有基准数据集未能充分反映现实世界场景的复杂性,过于简化且抽象,与真实任务流程、领域约束和操作规则脱节,同时存在数据污染问题,且传统众包构建方法成本高、难以扩展,因此需要一种新方法来生成高质量、贴近现实的推理对话数据。

Result: 实验结果表明,基于合成数据的推理任务引入了非平凡的推理挑战,并为提升LLM的推理能力提供了有意义的支持。

Insight: 创新点在于提出了一个LLM驱动的、基于三层优化的对话合成框架,能够生成扎根于真实任务场景、富含现实信息且上下文连贯的多轮对话,并围绕其迭代优化设计推理任务,从而构建出更有效的评估基准。从客观角度看,该方法通过合成数据生成而非传统众包,为解决数据污染和可扩展性问题提供了新思路。

Abstract: The reasoning capability of large language models (LLMs), defined as their ability to analyze, infer, and make decisions based on input information, is essential for building intelligent task-oriented dialogue systems. However, existing benchmarks do not sufficiently reflect the complexity of real-world scenarios, which limits their effectiveness in evaluating and enhancing LLM reasoning in practical contexts. Many current reasoning datasets are overly simplistic and abstract, often disconnected from realistic task flows, domain constraints, and operational rules, making it difficult to effectively evaluate LLMs’ logical reasoning ability. In addition, data contamination from pretraining corpora undermines the reliability of evaluation results, and traditional crowdsourcing methods for dataset construction are labor-intensive and difficult to scale. To address these challenges, we propose a LLM-driven framework for synthesizing multi-turn, task-oriented dialogues grounded in realistic reasoning scenarios, leveraging trilevel optimization to enhance dialogue quality. Our method generates dialogues grounded in authentic task scenarios, enriched with real-world information, and exhibiting strong contextual coherence. Corresponding reasoning tasks are carefully designed around these dialogues and iteratively refined to continuously improve the tasks’ quality and challenge. The resulting dataset serves as a valuable benchmark for assessing and advancing the realistic logical reasoning capabilities of LLMs. Experimental results show that our synthetic data-based reasoning tasks introduce non-trivial reasoning challenges and provide meaningful support for improving the reasoning capabilities of LLMs.


[8] TRIZ-RAGNER: A Retrieval-Augmented Large Language Model for TRIZ-Aware Named Entity Recognition in Patent-Based Contradiction Mining cs.CL | cs.AIPDF

Zitong Xu, Yuqing Wu, Yue Zhao

TL;DR: 本文提出了TRIZ-RAGNER,一个用于专利矛盾挖掘中TRIZ感知命名实体识别的检索增强大语言模型框架。该框架将矛盾挖掘重新定义为语义级NER任务,通过集成TRIZ知识库的密集检索、交叉编码器重排序和结构化LLM提示,从专利句子中提取改善和恶化的技术参数。

Details

Motivation: 现有基于规则或传统机器学习的方法在处理复杂专利语言时存在语义模糊、领域依赖和泛化能力有限的问题,而直接应用大语言模型又面临幻觉和缺乏结构化TRIZ知识基础等挑战。

Result: 在PaTRIZ数据集上的实验表明,TRIZ-RAGNER在TRIZ矛盾对识别任务中达到了85.6%的精确率、82.9%的召回率和84.2%的F1分数,相比使用提示增强GPT的最强基线,F1分数绝对提升了7.3个百分点,性能优于传统序列标注模型和基于LLM的基线方法。

Insight: 核心创新点在于将检索增强生成(RAG)范式与结构化提示相结合,将领域特定的TRIZ知识注入LLM的推理过程,有效减少了语义噪声并提高了提取的一致性,为基于专利的稳健准确矛盾挖掘提供了新思路。

Abstract: TRIZ-based contradiction mining is a fundamental task in patent analysis and systematic innovation, as it enables the identification of improving and worsening technical parameters that drive inventive problem solving. However, existing approaches largely rely on rule-based systems or traditional machine learning models, which struggle with semantic ambiguity, domain dependency, and limited generalization when processing complex patent language. Recently, large language models (LLMs) have shown strong semantic understanding capabilities, yet their direct application to TRIZ parameter extraction remains challenging due to hallucination and insufficient grounding in structured TRIZ knowledge. To address these limitations, this paper proposes TRIZ-RAGNER, a retrieval-augmented large language model framework for TRIZ-aware named entity recognition in patent-based contradiction mining. TRIZ-RAGNER reformulates contradiction mining as a semantic-level NER task and integrates dense retrieval over a TRIZ knowledge base, cross-encoder reranking for context refinement, and structured LLM prompting to extract improving and worsening parameters from patent sentences. By injecting domain-specific TRIZ knowledge into the LLM reasoning process, the proposed framework effectively reduces semantic noise and improves extraction consistency. Experiments on the PaTRIZ dataset demonstrate that TRIZ-RAGNER consistently outperforms traditional sequence labeling models and LLM-based baselines. The proposed framework achieves a precision of 85.6%, a recall of 82.9%, and an F1-score of 84.2% in TRIZ contradiction pair identification. Compared with the strongest baseline using prompt-enhanced GPT, TRIZ-RAGNER yields an absolute F1-score improvement of 7.3 percentage points, confirming the effectiveness of retrieval-augmented TRIZ knowledge grounding for robust and accurate patent-based contradiction mining.


[9] From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning cs.CL | cs.AI | cs.LGPDF

Seungdong Yoa, Sanghyu Yoon, Suhee Yoon, Dongmin Kim, Ye Seul Sim

TL;DR: 本文提出了一种以智能体为中心的动态评测范式,用于评估大语言模型的推理能力。该范式通过教师、编排者和学生三个智能体的协作,动态生成、验证和解决问题,使评测难度能随智能体能力提升而自动扩展,无需人工标注数据集。以文本异常检测作为评测任务,该方法能系统性地揭示传统静态基准无法发现的推理错误。

Details

Motivation: 现有大语言模型评测主要依赖静态数据集,其扩展性有限且难以捕捉模型不断演进的推理能力,因此需要一种更动态、可持续的评测方法。

Result: 该方法在文本异常检测任务上进行了演示,表明其能系统性地暴露传统基准无法揭示的边界情况推理错误,并提出了跨模型成对性能和问题演进进度等多个互补的评估维度。

Insight: 核心创新在于从静态数据集转向动态协议,通过多智能体协作实现评测难度的自动扩展和可持续演进,为评估持续进化的大语言模型提供了新方向,并引入了以智能体为中心基准共同演进的研究议程。

Abstract: The evaluation of large language models (LLMs) has predominantly relied on static datasets, which offer limited scalability and fail to capture the evolving reasoning capabilities of recent models. To overcome these limitations, we propose an agent-centric benchmarking paradigm that moves beyond static datasets by introducing a dynamic protocol in which autonomous agents iteratively generate, validate, and solve problems. Within this protocol, a teacher agent generates candidate problems, an orchestrator agent rigorously verifies their validity and guards against adversarial attacks, and a student agent attempts to solve the validated problems. An invalid problem is revised by the teacher agent until it passes validation. If the student correctly solves the problem, the orchestrator prompts the teacher to generate more challenging variants. Consequently, the benchmark scales in difficulty automatically as more capable agents are substituted into any role, enabling progressive evaluation of large language models without manually curated datasets. Adopting text anomaly detection as our primary evaluation format, which demands cross-sentence logical inference and resists pattern-matching shortcuts, we demonstrate that this protocol systematically exposes corner-case reasoning errors that conventional benchmarks fail to reveal. We further advocate evaluating systems along several complementary axes including cross-model pairwise performance and progress between the initial and orchestrator-finalized problems. By shifting the focus from fixed datasets to dynamic protocols, our approach offers a sustainable direction for evaluating ever-evolving language models and introduces a research agenda centered on the co-evolution of agent-centric benchmarks.


[10] Divide and Conquer: Accelerating Diffusion-Based Large Language Models via Adaptive Parallel Decoding cs.CLPDF

Xiangzhong Luo, Yilin An, Zhicheng Yu, Weichen Liu, Xu Yang

TL;DR: 本文提出了一种名为DiCo的自适应并行解码方法,用于加速基于扩散的大语言模型(dLLMs)的推理过程。该方法采用分而治之的三阶段范式(划分、征服和最终化),通过识别种子令牌并构建局部簇来实现并行解码,从而在保持生成质量的同时显著提升推理速度。

Details

Motivation: 尽管扩散大语言模型理论上支持并行生成多个令牌,但实际应用中仍倾向于逐令牌生成,因为直接解码多个掩码令牌会导致生成质量和稳定性下降。本文旨在弥合dLLMs理论并行性与实际性能之间的差距。

Result: 大量实验表明,DiCo在保持竞争力的生成质量的同时,能实现显著的推理加速。

Insight: 创新点在于提出了自适应并行解码的三阶段分治范式,通过动态识别种子令牌和局部簇来解锁dLLMs的固有并行性,避免了直接解码多个掩码令牌的退化问题,为加速扩散模型推理提供了新思路。

Abstract: Diffusion-based large language models (dLLMs) have shown promising performance across various reasoning tasks, establishing themselves as an alternative to autoregressive large language models (LLMs). Unlike autoregressive LLMs that generate one token per step based on all previous tokens, dLLMs theoretically enable parallel generation of multiple tokens at each decoding step. However, recent dLLMs still favor one-token-per-step generation in practice, as directly decoding multiple masked tokens often leads to degraded generation quality and stability. This reveals a substantial gap between the theoretical parallelism and practical performance of dLLMs. To bridge this gap, we introduce an adaptive parallel decoding approach, namely DiCo, which features a three-phase divide-and-conquer paradigm to unleash the inherent parallelism of dLLMs. During the Divide phase, DiCo first explores the input masked sequence and identifies masked tokens as seed tokens, which are then expanded to construct a set of local clusters. During the Conquer phase, DiCo performs parallel decoding across different local clusters constructed in the Divide phase. The divide-and-conquer process repeatedly alternates between the Divide and Conquer phases until convergence. During the Finalize phase, DiCo decodes the remaining few masked tokens using an effective fine-grained compound decoding scheme to finalize the generation. Extensive experiments demonstrate that DiCo can achieve significant inference speedups while maintaining competitive generation quality.


[11] Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis cs.CL | cs.AIPDF

Donghao Huang, Zhaoxia Wang

TL;DR: 该论文通过系统评估七种大语言模型家族在情感分析任务上的504种配置,实证检验了推理能力普遍提升语言任务性能的假设。研究发现推理效果高度依赖于任务复杂度:在简单的二元分类中推理反而导致性能下降(F1分数最多降低19.9个百分点),而在复杂的27类情感识别中推理带来显著提升(F1分数最高增加16.0个百分点)。

Details

Motivation: 挑战当前普遍认为推理能力总能提升大语言模型在各种语言任务中性能的假设,探究推理效果与任务复杂度之间的具体依赖关系。

Result: 在情感分析数据集(二元、五类、27类情感)上的实验表明:推理对简单任务产生负面影响(蒸馏推理变体比基础模型低3-18个百分点),但对复杂任务有显著增益;少样本学习在多数情况下优于零样本学习;帕累托前沿分析显示基础模型在效率-性能权衡中占优,推理仅对复杂情感识别任务有价值(尽管计算开销增加2.1-54倍)。

Insight: 揭示了推理效果的任务复杂性依赖规律,提出了’系统性过度思考’的机制解释;通过帕累托分析为推理模块的部署提供了成本效益决策依据;证明了少样本提示能部分缓解简单任务上的推理性能退化。

Abstract: Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across language tasks. We test this claim through a comprehensive evaluation of 504 configurations across seven model families–including adaptive, conditional, and reinforcement learning-based reasoning architectures–on sentiment analysis datasets of varying granularity (binary, five-class, and 27-class emotion). Our findings reveal that reasoning effectiveness is strongly task-dependent, challenging prevailing assumptions: (1) Reasoning shows task-complexity dependence–binary classification degrades up to -19.9 F1 percentage points (pp), while 27-class emotion recognition gains up to +16.0pp; (2) Distilled reasoning variants underperform base models by 3-18 pp on simpler tasks, though few-shot prompting enables partial recovery; (3) Few-shot learning improves over zero-shot in most cases regardless of model type, with gains varying by architecture and task complexity; (4) Pareto frontier analysis shows base models dominate efficiency-performance trade-offs, with reasoning justified only for complex emotion recognition despite 2.1x-54x computational overhead. We complement these quantitative findings with qualitative error analysis revealing that reasoning degrades simpler tasks through systematic over-deliberation, offering mechanistic insight beyond the high-level overthinking hypothesis.


[12] CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning cs.CL | cs.AIPDF

Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao

TL;DR: 本文提出了一种名为CoME的新型移动智能体架构,通过四个独立的专家模块分别对应屏幕摘要、子任务规划、行动决策和行动函数等混合能力推理阶段,并采用面向输出的激活机制来调用相应专家。为了赋予CoME混合能力推理能力,作者设计了一种渐进式训练策略,包括专家微调、路由器微调和思维链微调,以分别实现能力解耦与增强、专家激活对齐以及多能力协同优化。此外,为减少推理中的错误传播,提出了基于信息增益的DPO方法(Info-DPO),用于评估中间步骤的贡献并引导模型进行更具信息量的推理。实验表明,CoME在AITZ和AMEX数据集上优于密集移动智能体和MoE方法。

Details

Motivation: 现有移动智能体在实现屏幕摘要、子任务规划、行动决策和行动函数等混合能力推理时,难以同时做到能力解耦增强与平衡整合,因此需要一种新的架构来解决这些挑战。

Result: 在AITZ和AMEX数据集上的综合实验表明,CoME超越了密集移动智能体和MoE方法,达到了SOTA水平。

Insight: 创新点包括:1. 基于专家模块的架构设计,实现混合能力推理的阶段对齐和解耦;2. 渐进式训练策略,分阶段优化专家能力、激活对齐和协同优化;3. 引入信息增益驱动的DPO方法,以减少错误传播并提升推理的信息量。从客观角度看,这种模块化设计和渐进训练策略为复杂任务中的智能体能力平衡提供了可借鉴的思路。

Abstract: Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, action decision and action function. However, existing agents struggle to achieve both decoupled enhancement and balanced integration of these capabilities. To address these challenges, we propose Channel-of-Mobile-Experts (CoME), a novel agent architecture consisting of four distinct experts, each aligned with a specific reasoning stage, CoME activates the corresponding expert to generate output tokens in each reasoning stage via output-oriented activation. To empower CoME with hybrid-capabilities reasoning, we introduce a progressive training strategy: Expert-FT enables decoupling and enhancement of different experts’ capability; Router-FT aligns expert activation with the different reasoning stage; CoT-FT facilitates seamless collaboration and balanced optimization across multiple capabilities. To mitigate error propagation in hybrid-capabilities reasoning, we propose InfoGain-Driven DPO (Info-DPO), which uses information gain to evaluate the contribution of each intermediate step, thereby guiding CoME toward more informative reasoning. Comprehensive experiments show that CoME outperforms dense mobile agents and MoE methods on both AITZ and AMEX datasets.


[13] ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models cs.CL | cs.AIPDF

Adam Dejl, Deniz Gorur, Francesca Toni

TL;DR: 本文提出了ArgLLM-App,一个基于Web的交互式系统,通过结合大型语言模型(LLMs)与计算论证技术,支持二元决策任务。该系统旨在生成可解释且可被人类质疑的决策,提供可视化解释并允许用户与系统交互以识别和挑战推理中的错误。

Details

Motivation: 动机在于利用ArgLLMs(论证性大型语言模型)来增强决策的可解释性和可争议性,使人类能够理解和质疑AI系统的推理过程,从而提高透明度和可信度。

Result: 论文实现了一个公开可用的Web系统(https://argllm.app),支持模块化设计,并能从可信外部源获取信息,但摘要中未提及具体的定量实验结果或基准测试性能。

Insight: 创新点在于将LLMs与计算论证结合,构建了一个交互式系统,允许用户可视化并质疑AI的推理;从客观角度看,其模块化设计和外部信息集成增强了系统的实用性和可扩展性,为可解释AI提供了新思路。

Abstract: Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by humans. Here we propose a web-based system implementing ArgLLM-empowered agents for binary tasks. ArgLLM-App supports visualisation of the produced explanations and interaction with human users, allowing them to identify and contest any mistakes in the system’s reasoning. It is highly modular and enables drawing information from trusted external sources. ArgLLM-App is publicly available at https://argllm.app, with a video demonstration at https://youtu.be/vzwlGOr0sPM.


[14] Controllable Reasoning Models Are Private Thinkers cs.CL | cs.AIPDF

Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych

TL;DR: 这篇论文提出了一种通过增强推理模型在推理轨迹上的指令遵循能力来保护隐私的方法。作者通过微调模型,使其在生成最终答案的同时,也能遵循对推理轨迹的明确限制,并引入了一种使用独立LoRA适配器将推理过程与答案生成解耦的生成策略。

Details

Motivation: 动机是解决由AI智能体驱动的推理模型在处理敏感用户数据时,其推理轨迹难以控制,可能导致隐私信息无意泄露给外部的问题。

Result: 在六个参数量从1.7B到14B的模型上,于两个指令遵循基准和两个隐私基准上进行了评估。方法带来了显著提升,指令遵循性能最高提升20.9分,隐私基准最高提升51.9个百分点。但这也可能以牺牲任务效用为代价。

Insight: 宣称的创新点在于将指令遵循能力扩展到推理轨迹控制以增强隐私保护,并提出了解耦推理与答案生成的LoRA适配器策略。客观来看,这为开发隐私感知的智能体提供了一个有前景的方向,并实证了指令遵循与隐私保护之间的关联及权衡。

Abstract: AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We hypothesize that improving their instruction following abilities in the reasoning traces can improve their privacy-preservation skills. To demonstrate this, we fine-tune models on a new instruction-following dataset with explicit restrictions on reasoning traces. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction-following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 20.9 points in instruction-following performance and up to 51.9 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade-off between reasoning performance and instruction-following abilities. Overall, our results show that improving instruction-following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy-aware agents. Our code and data are available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models


[15] Do LLMs Benefit From Their Own Words? cs.CL | cs.AIPDF

Jenny Y. Huang, Leshem Choshen, Ramon Astudillo, Tamara Broderick, Jacob Andreas

TL;DR: 本研究探讨了在多轮对话中,大型语言模型是否真正受益于保留自身先前回复的历史。通过比较标准全上下文提示与仅用户轮次提示方法,发现移除助手侧历史对大部分轮次响应质量无显著影响,且能大幅减少上下文长度。研究还识别出上下文污染现象,并提出了选择性过滤助手侧上下文的改进方法。

Details

Motivation: 重新审视多轮对话中保留助手自身历史回复的设计选择,探究LLMs是否从自身先前的回复中获益,旨在优化上下文利用效率并减少潜在错误传播。

Result: 在三个开放推理模型和一个SOTA模型上的实验表明,移除助手侧历史对大量轮次(36.4%的自包含提示)的响应质量无影响,上下文长度最多可减少10倍;同时,在某些情况下仅用户轮次提示能显著优于全上下文,减少了因过度依赖先前回复导致的错误、幻觉或风格伪影。

Insight: 创新点在于揭示了多轮对话中助手侧历史并非总是必要,许多提示可仅基于当前和先前的用户轮次独立回答;提出了上下文污染的概念及选择性过滤方法,为优化LLM内存消耗和响应质量提供了新思路。

Abstract: Multi-turn interactions with large language models typically retain the assistant’s own past responses in the conversation history. In this work, we revisit this design choice by asking whether large language models benefit from conditioning on their own prior responses. Using in-the-wild, multi-turn conversations, we compare standard (full-context) prompting with a user-turn-only prompting approach that omits all previous assistant responses, across three open reasoning models and one state-of-the-art model. To our surprise, we find that removing prior assistant responses does not affect response quality on a large fraction of turns. Omitting assistant-side history can reduce cumulative context lengths by up to 10x. To explain this result, we find that multi-turn conversations consist of a substantial proportion (36.4%) of self-contained prompts, and that many follow-up prompts provide sufficient instruction to be answered using only the current user turn and prior user turns. When analyzing cases where user-turn-only prompting substantially outperforms full context, we identify instances of context pollution, in which models over-condition on their previous responses, introducing errors, hallucinations, or stylistic artifacts that propagate across turns. Motivated by these findings, we design a context-filtering approach that selectively omits assistant-side context. Our findings suggest that selectively omitting assistant history can improve response quality while reducing memory consumption.


cs.CV [Back]

[16] DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation cs.CV | cs.AIPDF

Varun Gopal, Rishabh Jain, Aradhya Mathur, Nikitha SR, Sohan Patnaik

TL;DR: 本文介绍了DesignSense-10k,一个包含10,235个人类标注偏好对的大规模图形布局评估数据集,以及基于视觉语言模型(VLM)的分类器DesignSense,用于提升图形布局生成与人类审美偏好的对齐。

Details

Motivation: 现有布局生成模型常与人类细微的审美判断不一致,且文本到图像生成的偏好数据集和奖励模型无法泛化到以空间排列质量为核心的布局评估领域,因此需要专门的布局偏好数据集和模型。

Result: DesignSense分类器在综合评估指标上显著优于现有开源和专有模型(在Macro F1上比最强的专有基线提升54.6%),并在基于RL的训练和推理时缩放中,分别使生成器胜率提升约3%和3.6%。

Insight: 创新点在于通过五阶段筛选流程构建高质量布局偏好数据集,并训练专门的VLM分类器进行布局感知的偏好建模,解决了前沿VLM在四类分类任务上不可靠的问题,提升了布局生成的下游性能。

Abstract: Graphic layouts serve as an important and engaging medium for visual communication across different channels. While recent layout generation models have demonstrated impressive capabilities, they frequently fail to align with nuanced human aesthetic judgment. Existing preference datasets and reward models trained on text-to-image generation do not generalize to layout evaluation, where the spatial arrangement of identical elements determines quality. To address this critical gap, we introduce DesignSense-10k, a large-scale dataset of 10,235 human-annotated preference pairs for graphic layout evaluation. We propose a five-stage curation pipeline that generates visually coherent layout transformations across diverse aspect ratios, using semantic grouping, layout prediction, filtering, clustering, and VLM-based refinement to produce high-quality comparison pairs. Human preferences are annotated using a 4-class scheme (left, right, both good, both bad) to capture subjective ambiguity. Leveraging this dataset, we train DesignSense, a vision-language model-based classifier that substantially outperforms existing open-source and proprietary models across comprehensive evaluation metrics (54.6% improvement in Macro F1 over the strongest proprietary baseline). Our analysis shows that frontier VLMs remain unreliable overall and fail catastrophically on the full four-class task, underscoring the need for specialized, preference-aware models. Beyond the dataset, our reward model DesignSense yields tangible downstream gains in layout generation. Using our judge during RL based training improves generator win rate by about 3%, while inference-time scaling, which involves generating multiple candidates and selecting the best one, provides a 3.6% improvement. These results highlight the practical impact of specialized, layout-aware preference modeling on real-world layout generation quality.


[17] Modelling and Simulation of Neuromorphic Datasets for Anomaly Detection in Computer Vision cs.CV | cs.AI | cs.LGPDF

Mike Middleton, Teymoor Ali, Hakan Kayan, Basabdatta Sen Bhattacharya, Charith Perera

TL;DR: 本文针对神经形态计算机视觉研究中动态视觉传感器(DVS)数据稀缺的问题,提出了一个名为ANTShapes的新型数据集模拟框架。该框架基于Unity引擎构建,能够生成包含随机运动行为的抽象3D场景,并自动标注异常对象,以支持对象识别、定位和异常检测等任务。

Details

Motivation: 动态视觉传感器(DVS)数据的可用性有限,现有数据集样本或场景不足,阻碍了神经形态计算机视觉应用的研究。

Result: 论文提出了ANTShapes框架,能够通过调整少量参数生成任意数量的样本数据集,并导出标签和帧数据,为基于事件的计算机视觉研究提供了定制化数据模拟解决方案。

Insight: 创新点在于利用Unity引擎构建可配置的3D场景模拟器,通过遵循中心极限定理的统计过程对对象行为进行采样和异常标注,从而灵活生成大规模神经形态数据集,弥补了真实数据收集的不足。

Abstract: Limitations on the availability of Dynamic Vision Sensors (DVS) present a fundamental challenge to researchers of neuromorphic computer vision applications. In response, datasets have been created by the research community, but often contain a limited number of samples or scenarios. To address the lack of a comprehensive simulator of neuromorphic vision datasets, we introduce the Anomalous Neuromorphic Tool for Shapes (ANTShapes), a novel dataset simulation framework. Built in the Unity engine, ANTShapes simulates abstract, configurable 3D scenes populated by objects displaying randomly-generated behaviours describing attributes such as motion and rotation. The sampling of object behaviours, and the labelling of anomalously-acting objects, is a statistical process following central limit theorem principles. Datasets containing an arbitrary number of samples can be created and exported from ANTShapes, along with accompanying label and frame data, through the adjustment of a limited number of parameters within the software. ANTShapes addresses the limitations of data availability to researchers of event-based computer vision by allowing for the simulation of bespoke datasets to suit purposes including object recognition and localisation alongside anomaly detection.


[18] Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos cs.CVPDF

Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You

TL;DR: 本文介绍了Synthetic Visual Genome 2 (SVG2),一个大规模全景视频场景图数据集,以及基于此数据集训练的TRaSER模型。SVG2通过自动化流程从视频中提取包含对象、属性和关系的时空场景图,其规模远超现有数据集。TRaSER模型通过创新的轨迹对齐令牌机制和重采样模块,能够高效生成紧凑的时空场景图,并在多个基准测试中显著提升了关系检测、对象和属性预测的性能,同时证明场景图作为中间表示能有效提升视频问答的准确性。

Details

Motivation: 动机是解决现有时空场景图数据集规模小、多样性不足的问题,并探索如何从视频中高效、准确地提取大规模、细粒度的时空场景图作为中间表示,以支持下游视觉语言任务。

Result: 在PVSG、VIPSeg、VidOR和SVG2测试集上,TRaSER在关系检测上比最强开源基线提升15-20%,对象预测提升30-40%(比GPT-5高13%),属性预测提升15%。将TRaSER生成的场景图用于视频问答,比仅使用视频或使用Qwen2.5-VL生成的场景图,绝对准确率提升1.5%至4.6%。

Insight: 创新点包括:1) 构建了规模空前(包含63.6万视频、660万对象)的SVG2数据集及其全自动生成流程;2) 提出了TRaSER模型,其轨迹对齐令牌安排机制、对象轨迹重采样器和时间窗口重采样器能有效整合局部运动与全局上下文,实现单次前向传播生成紧凑时空场景图;3) 实证了显式时空场景图作为中间表示对提升视频理解任务(如VQA)的有效性。

Abstract: We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass. The temporal-window resampler binds visual tokens to short trajectory segments to preserve local motion and temporal semantics, while the object-trajectory resampler aggregates entire trajectories to maintain global context for objects. On the PVSG, VIPSeg, VidOR and SVG2 test datasets, TRaSER improves relation detection by +15 to 20%, object prediction by +30 to 40% over the strongest open-source baselines and by +13% over GPT-5, and attribute prediction by +15%. When TRaSER’s generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5 to 4.6% absolute accuracy gain over using video only or video augmented with Qwen2.5-VL’s generated scene graphs, demonstrating the utility of explicit spatio-temporal scene graphs as an intermediate representation.


[19] LE-NeuS: Latency-Efficient Neuro-Symbolic Video Understanding via Adaptive Temporal Verification cs.CVPDF

Shawn Liang, Sahil Shah, Chengwei Zhou, SP Sharan, Harsh Goel

TL;DR: 本文提出了LE-NeuS,一种延迟高效的神经符号框架,用于长视频问答(LVQA)。该框架通过自适应时间采样和批量命题检测等优化技术,在保持时序逻辑引导视频理解所带来的准确性优势的同时,大幅降低了推理延迟。

Details

Motivation: 现有的神经符号方法虽然通过形式化验证进行时序推理,显著提升了长视频问答的准确性,但其引入了过高的延迟开销(比基础VLM提示慢高达90倍),导致在延迟敏感的边缘部署中不切实际。本文旨在解决这一延迟瓶颈问题。

Result: 在LongVideoBench和Video-MME基准测试(部署于NVIDIA H100 GPU)上,LE-NeuS将延迟差距从90倍降低至约10倍,同时在时序复杂的查询上保持了超过10%的准确率提升。

Insight: 论文的核心创新在于识别出自动机构建过程中跨视频帧的顺序且密集的命题检测是主要计算瓶颈,并提出了两项优化:1)利用视觉冗余的CLIP引导两阶段自适应采样,以跳过语义相似的帧同时保留时序边界;2)跨时间窗口并行化VLM推理的批量命题检测。理论上,论文推导了延迟与视频长度、命题复杂度和采样密度之间的关系,为实现延迟效率提供了理论依据。

Abstract: Neuro-symbolic approaches to long-form video question answering (LVQA) have demonstrated significant accuracy improvements by grounding temporal reasoning in formal verification. However, existing methods incur prohibitive latency overheads, up to 90x slower than base VLM prompting, rendering them impractical for latency-sensitive edge deployments. We present LE-NeuS, a latency-efficient neuro-symbolic framework that preserves the accuracy benefits of temporal logic-guided video understanding while drastically reducing inference latency. Our key insight is that the dominant computational bottleneck arises from sequential and dense proposition detection across video frames during automaton construction. We address this through two principled optimizations: (1) CLIP guided two-stage adaptive sampling that exploits visual redundancy to skip semantically similar frames while preserving temporal boundaries, and (2) batched proposition detection that parallelizes VLM inference across temporal windows. Theoretically, we derive latency bounds as a function of video length, proposition complexity, and sampling density, establishing conditions under which latency efficiency is achievable. Empirically, on LongVideoBench and Video-MME benchmarks deployed on NVIDIA H100 GPUs, LE-NeuS reduces the latency gap from 90x to approximately 10x while maintaining >10% accuracy gains on temporally complex queries.


[20] Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning cs.CV | cs.AI | cs.LGPDF

Abhishek Dalvi, Vasant Honavar

TL;DR: 本文提出HDFLIM框架,通过超维计算在不微调预训练视觉和语言模型的情况下实现跨模态对齐,利用绑定、捆绑和相似性检索等轻量级符号操作在超维空间中构建关联表示,实现高效的图像描述生成。

Details

Motivation: 解决传统跨模态对齐方法需要计算密集型多模态微调、大规模参数更新和扰动预训练表示的问题,探索是否能在不修改模型本身的情况下实现跨模态对齐。

Result: HDFLIM在图像描述任务上性能与端到端视觉语言训练方法相当,生成的描述比零样本基线更具语义基础。

Insight: 创新点在于通过超维编码的符号操作实现冻结模型的语义映射,为模型对齐提供了无需大规模重训练的新范式,强调预训练模型间已存在的潜在语义兼容性。

Abstract: Large unimodal foundation models for vision and language encode rich semantic structures, yet aligning them typically requires computationally intensive multimodal fine-tuning. Such approaches depend on large-scale parameter updates, are resource intensive, and can perturb pretrained representations. Emerging evidence suggests, however, that independently trained foundation models may already exhibit latent semantic compatibility, reflecting shared structures in the data they model. This raises a fundamental question: can cross-modal alignment be achieved without modifying the models themselves? Here we introduce HDFLIM (HyperDimensional computing with Frozen Language and Image Models), a framework that establishes cross-modal mappings while keeping pretrained vision and language models fully frozen. HDFLIM projects unimodal embeddings into a shared hyperdimensional space and leverages lightweight symbolic operations – binding, bundling, and similarity-based retrieval to construct associative cross-modal representations in a single pass over the data. Caption generation emerges from high-dimensional memory retrieval rather than iterative gradient-based optimization. We show that HDFLIM achieves performance comparable to end-to-end vision-language training methods and produces captions that are more semantically grounded than zero-shot baselines. By decoupling alignment from parameter tuning, our results suggest that semantic mapping across foundation models can be realized through symbolic operations on hyperdimensional encodings of the respective embeddings. More broadly, this work points toward an alternative paradigm for foundation model alignment in which frozen models are integrated through structured representational mappings rather than through large-scale retraining. The codebase for our implementation can be found at https://github.com/Abhishek-Dalvi410/HDFLIM.


[21] Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models cs.CV | cs.AIPDF

Hiroshi Sasaki

TL;DR: 本文提出了一种名为伪对比学习的新训练范式,旨在提升多模态模型(如CLIP)对图表(如流程图)的理解能力。该方法通过图表渲染器生成伪对比样本,这些样本突出了图表的结构差异,无需修改原始数据,从而帮助模型学习更精确的语义一致的图表结构。

Details

Motivation: 现有多模态模型(如CLIP)在视觉-语言对齐方面表现优异,但对细微结构变化敏感度不足,这在语义意义重大的图表理解领域尤为挑战,因此需要增强模型对细粒度结构差异的感知能力。

Result: 在流程图基准数据集上的实验表明,该方法在图像-文本匹配和视觉问答任务上均显著优于标准CLIP和硬负样本CLIP训练,取得了实质性改进。

Insight: 创新点在于引入伪对比样本,通过合成图表来强调结构差异,这是一种无需原始数据编辑的领域特定训练策略,可推广到其他需要细粒度视觉理解的场景,提升模型的结构敏感性。

Abstract: Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differences carry large semantic significance, such as diagram understanding, remain challenging due to the models’ limited sensitivity to fine-grained structural variations. We propose a new training paradigm designed to enhance diagram comprehension in vision-language models. Our approach introduces pseudo contrastive samples generated by a diagram renderer that creates synthetic diagrams using randomly picked text elements. These samples highlight structural differences in diagrammatic imagery without requiring any modification or editing of the original data. By incorporating these pseudo contrastive samples into the training objective, the model learns to capture more precise and semantically consistent diagram structures. Empirical evaluations on a benchmark dataset of flowcharts demonstrate substantial improvements over standard CLIP and hard-negative CLIP training in both image-text matching and visual question answering tasks. The results underscore the value of domain-specific training strategies and contribute to advancing diagrammatic understanding within the broader context of vision-language learning.


[22] Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning cs.CVPDF

Jiacheng Yang, Anqi Chen, Yunkai Dang, Qi Fan, Cong Wang

TL;DR: 本文提出了一种无需标注的高分辨率视觉推理技术HART,通过强化学习训练大型多模态模型自主聚焦和验证高分辨率图像中的关键区域,以解决现有模型因图像令牌数量随分辨率平方增长而导致的冗余和无关信息问题。

Details

Motivation: 当前大型多模态模型在处理高分辨率视觉输入时面临令牌数量激增带来的冗余和无关信息干扰,而依赖外部视觉监督(如人工标注的定位标签)成本高昂,因此需要一种无需额外标注的方法来增强模型的定位能力以支持推理。

Result: 实验表明,HART在多种高分辨率视觉任务上均提升了性能,持续优于强基线模型;当应用于Qwen2.5-VL-7B进行后训练时,在面向高分辨率的视觉中心基准测试中甚至超越了更大规模的模型如Qwen2.5-VL-72B和LLaVA-OneVision-72B。

Insight: 创新点包括:1)提出闭环框架HART,使模型能自主聚焦和验证关键区域,无需外部标注;2)设计AP-GRPO强化学习优化方法,促进准确的关键区域定位;3)提供可解释的推理路径并实现高效的定位优化,为高分辨率视觉推理提供了一种标注自由的解决方案。

Abstract: Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model’s grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training paradigm in which we design Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions. Notably, HART provides explainable reasoning pathways and enables efficient optimization of localization. Extensive experiments demonstrate that HART improves performance across a wide range of high-resolution visual tasks, consistently outperforming strong baselines. When applied to post-train Qwen2.5-VL-7B, HART even surpasses larger-scale models such as Qwen2.5-VL-72B and LLaVA-OneVision-72B on high-resolution, vision-centric benchmarks.


[23] DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model cs.CV | cs.AIPDF

Shibo Hong, Boxian Ai, Jun Kuang, Wei Wang, FengJiao Chen

TL;DR: 本文提出了首个专门评估基于指令的图像编辑模型(IIEMs)在小尺度物体编辑能力上的基准测试DLEBench,包含1889个样本和七种指令类型,并设计了双模式评估框架以减少主观性。

Details

Motivation: 现有IIEMs在遵循指令和推理能力上表现良好,但在编辑小物体方面的能力尚未充分探索,而这对精确局部编辑和细节优化至关重要。

Result: 在10个IIEMs上的实证结果显示,这些模型在小尺度物体编辑上存在显著的性能差距,突显了专门基准测试的必要性。

Insight: 论文的创新点在于构建了首个专注于小物体编辑的基准测试DLEBench,并引入了双模式评估协议(工具驱动和Oracle引导模式)来减少评估中的主观性和模糊性,从而更准确地衡量模型性能。

Abstract: Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.


[24] 3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection cs.CV | cs.AIPDF

Haowen Zhu, Ning Yin, Xiaogen Zhou

TL;DR: 该论文提出了一种名为MedMAP的医学模态感知预训练框架,专门针对3D MRI多器官异常检测任务。该框架包含模态感知视觉-语言对齐预训练阶段和下游任务微调阶段,通过构建包含7,392个3D MRI-报告对的MedMoM-MRI3D数据集进行验证。实验表明MedMAP在3D MRI多器官异常检测任务上显著优于现有视觉语言模型。

Details

Motivation: 解决将视觉语言模型应用于多器官医学影像时面临的两个主要挑战:1)特定模态的视觉-语言对齐;2)跨模态特征融合。

Result: 在MedMoM-MRI3D数据集上的大量实验表明,MedMAP在基于3D MRI的多器官异常检测任务上显著优于现有视觉语言模型,达到了新的SOTA水平。

Insight: 创新点包括:1)提出模态感知视觉语言对齐机制,使编码器能够隐式捕获联合模态分布;2)构建了专门针对3D医学分析任务的大规模多模态MRI数据集MedMoM-MRI3D;3)采用两阶段训练策略(预训练+微调),在微调时冻结文本编码器以保持语言表示稳定性。

Abstract: Vision-language models (VLMs) show strong potential for complex diagnostic tasks in medical imaging. However, applying VLMs to multi-organ medical imaging introduces two principal challenges: (1) modality-specific vision-language alignment and (2) cross-modal feature fusion. In this work, we propose MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language representation learning in 3D MRI. MedMAP comprises a modality-aware vision-language alignment stage and a fine-tuning stage for multi-organ abnormality detection. During the pre-training stage, the modality-aware encoders implicitly capture the joint modality distribution and improve alignment between visual and textual representations. We then fine-tune the pre-trained vision encoders (while keeping the text encoder frozen) for downstream tasks. To this end, we curated MedMoM-MRI3D, comprising 7,392 3D MRI volume-report pairs spanning twelve MRI modalities and nine abnormalities tailored for various 3D medical analysis tasks. Extensive experiments on MedMoM-MRI3D demonstrate that MedMAP significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection. Our code is available at https://github.com/RomantiDr/MedMAP.


[25] ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models cs.CV | cs.AIPDF

Wei Luo, Yangfan Ou, Jin Deng, Zeshuai Deng, Xiquan Yan

TL;DR: 本文提出了ProtoDCS框架,用于解决大规模视觉语言模型在开放集测试时适应中的关键挑战。该方法通过一种新颖的双重检查分离机制和基于证据的适应策略,有效区分协变量偏移的分布内数据和分布外数据,从而在保证安全的前提下高效地适应已知类别。

Details

Motivation: 现有基于VLM的测试时适应方法在封闭集假设下工作,无法处理包含未知分布外数据的开放集场景,导致模型性能下降和计算成本高昂。本文旨在解决开放集TTA中准确分离未知样本与安全适应已知样本的难题。

Result: 在CIFAR-10/100-C和Tiny-ImageNet-C基准测试上的大量实验表明,ProtoDCS取得了最先进的性能,显著提升了已知类别的准确率和OOD检测指标。

Insight: 主要创新点包括:1)用概率高斯混合模型验证取代脆性阈值判定的双重检查分离机制;2)利用不确定性感知损失和高效原型级更新的证据驱动适应策略,缓解过度自信并降低计算开销。

Abstract: Large-scale Vision-Language Models (VLMs) exhibit strong zero-shot recognition, yet their real-world deployment is challenged by distribution shifts. While Test-Time Adaptation (TTA) can mitigate this, existing VLM-based TTA methods operate under a closed-set assumption, failing in open-set scenarios where test streams contain both covariate-shifted in-distribution (csID) and out-of-distribution (csOOD) data. This leads to a critical difficulty: the model must discriminate unknown csOOD samples to avoid interference while simultaneously adapting to known csID classes for accuracy. Current open-set TTA (OSTTA) methods rely on hard thresholds for separation and entropy minimization for adaptation. These strategies are brittle, often misclassifying ambiguous csOOD samples and inducing overconfident predictions, and their parameter-update mechanism is computationally prohibitive for VLMs. To address these limitations, we propose Prototype-based Double-Check Separation (ProtoDCS), a robust framework for OSTTA that effectively separates csID and csOOD samples, enabling safe and efficient adaptation of VLMs to csID data. Our main contributions are: (1) a novel double-check separation mechanism employing probabilistic Gaussian Mixture Model (GMM) verification to replace brittle thresholding; and (2) an evidence-driven adaptation strategy utilizing uncertainty-aware loss and efficient prototype-level updates, mitigating overconfidence and reducing computational overhead. Extensive experiments on CIFAR-10/100-C and Tiny-ImageNet-C demonstrate that ProtoDCS achieves state-of-the-art performance, significantly boosting both known-class accuracy and OOD detection metrics. Code will be available at https://github.com/O-YangF/ProtoDCS.


[26] Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering cs.CVPDF

Ao Li, Rui Liu, Mingjie Li, Sheng Liu, Lei Wang

TL;DR: 本文提出了一种名为语义解耦潜在引导(SDLS)的训练无关推理时控制框架,旨在解决放射学报告生成中视觉语言模型(VLM)的“先验比较幻觉”问题。该方法通过大语言模型驱动的语义分解和基于QR的正交化构建语义无关的干预向量,从而在抑制幻觉的同时保持临床准确性。

Details

Motivation: 解决放射学报告自动生成中视觉语言模型因依赖先验知识而产生与当前研究不符的历史发现幻觉的问题。

Result: 在MIMIC-CXR数据集上,该方法显著降低了历史幻觉概率(FilBERT分数从0.2373降至0.1889)并提高了临床标签保真度(CheXpert macro-F1从0.2242提升至0.3208);在CheXpert Plus和IU-Xray上的零样本迁移评估也验证了其鲁棒性。

Insight: 创新点在于通过语义分解和正交化构建语义无关的干预向量,利用几何约束过滤临床语义纠缠,从而精准针对“历史比较”轴进行引导,避免了通用激活引导中的语义纠缠问题,实现了幻觉抑制与临床准确性的平衡。

Abstract: Automated radiology report generation using vision-language models (VLMs) is limited by the risk of prior-comparison hallucination, where the model generates historical findings unsupported by the current study. We address this challenge with a training-free, inference-time control framework termed Semantically Decoupled Latent Steering (SDLS). Unlike generic activation steering, which often suffers from semantic entanglement, our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition followed by $QR$-based orthogonalization. This orthogonalization step is critical. It leverages geometric constraints to filter out the clinical semantics often entangled in standard principal component analysis (PCA) directions, ensuring that the steering vector targets only the ``historical comparison” axis. We validate our method on the BiomedGPT foundation model, demonstrating that it overcomes the trade-off between hallucination suppression and clinical accuracy. Extensive experiments on MIMIC-CXR, and zero-shot transfer evaluation on CheXpert Plus and IU-Xray, demonstrate the robustness of our approach. Quantitative evaluations on MIMIC-CXR show that our approach significantly reduces the probability of historical hallucinations (FilBERT score decreases from 0.2373 to 0.1889) and improves clinical label fidelity (CheXpert macro-F1 increases from 0.2242 to 0.3208). Supplementary evaluations confirm that the structural integrity of the clinical narrative is maintained.


[27] Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation cs.CVPDF

Nazia Hossain, Xintong Jiang, Yu Tian, Philippe Seguin, O. Grant Clark

TL;DR: 本文提出了一种名为Vision-Language Weed Segmentation (VL-WS)的新框架,用于解决精细化的作物-杂草分割任务。该框架通过将像素级分割建立在语义对齐、领域不变的表示之上,利用冻结的CLIP嵌入和任务特定的空间特征,并通过基于自然语言描述的FiLM层进行融合与调制,从而提升了模型在异构农业环境中的泛化能力。

Details

Motivation: 现有深度学习模型在异构农业环境中泛化能力不足,主要依赖于数据集特定的视觉特征。本文旨在通过结合视觉与语言语义,构建领域不变的表示,以支持跨域、标签高效的杂草分割。

Result: 在四个基准数据集上的实验表明,VL-WS的平均Dice分数达到91.64%,比CNN基线高出4.98%。在最具挑战性的杂草类别上,VL-WS的Dice分数为80.45%,相比最佳基线的65.03%提升了15.42%。此外,在有限目标域监督下,VL-WS仍能保持稳定的分割性能,显示出更好的泛化能力和数据效率。

Insight: 论文的创新点在于将视觉-语言语义对齐引入像素级分割任务,通过双编码器设计和基于文本描述的FiLM调制,实现了领域不变的特征表示。这为构建可扩展、跨真实世界农业领域部署的分割模型提供了新思路。

Abstract: Fine-grained crop-weed segmentation is essential for enabling targeted herbicide application in precision agriculture. However, existing deep learning models struggle to generalize across heterogeneous agricultural environments due to reliance on dataset-specific visual features. We propose Vision-Language Weed Segmentation (VL-WS), a novel framework that addresses this limitation by grounding pixel-level segmentation in semantically aligned, domain-invariant representations. Our architecture employs a dual-encoder design, where frozen Contrastive Language-Image Pretraining (CLIP) embeddings and task-specific spatial features are fused and modulated via Feature-wise Linear Modulation (FiLM) layers conditioned on natural language captions. This design enables image level textual descriptions to guide channel-wise feature refinement while preserving fine-grained spatial localization. Unlike prior works restricted to training and evaluation on single-source datasets, VL-WS is trained on a unified corpus that includes close-range ground imagery (robotic platforms) and high-altitude UAV imagery, covering diverse crop types, weed species, growth stages, and sensing conditions. Experimental results across four benchmark datasets demonstrate the effectiveness of our framework, with VL-WS achieving a mean Dice score of 91.64% and outperforming the CNN baseline by 4.98%. The largest gains occur on the most challenging weed class, where VL-WS attains 80.45% Dice score compared to 65.03% for the best baseline, representing a 15.42% improvement. VL-WS further maintains stable weed segmentation performance under limited target-domain supervision, indicating improved generalization and data efficiency. These findings highlight the potential of vision-language alignment to enable scalable, label-efficient segmentation models deployable across diverse real-world agricultural domains.


[28] Towards Source-Aware Object Swapping with Initial Noise Perturbation cs.CVPDF

Jiahui Zhan, Xianbing Sun, Xiangnan Zhu, Yikun Ji, Ruitong Liu

TL;DR: 本文提出SourceSwap,一种自监督、源感知的物体交换框架,通过在初始噪声空间进行频率分离扰动来合成高质量伪配对数据,无需额外配对数据或逐物体微调即可实现跨物体对齐,并引入高质量基准SourceBench进行评测。

Details

Motivation: 现有物体交换方法要么需要逐物体微调且推理慢,要么依赖额外配对数据(通常为同一物体在不同背景下的图像),导致模型依赖背景线索而非学习跨物体对齐,本文旨在解决这些问题。

Result: 在提出的SourceBench基准(更高分辨率、更多类别、更丰富交互)上,SourceSwap在保真度、场景保留和自然和谐度方面表现优异,并可迁移到主体驱动细化和人脸交换等编辑任务。

Insight: 创新点包括:在初始噪声空间进行频率分离扰动以合成伪配对数据,实现自监督学习;采用全源条件双U-Net和无噪声参考编码器,实现直接跨物体对齐和零样本推理;提出高质量基准SourceBench以促进研究。

Abstract: Object swapping aims to replace a source object in a scene with a reference object while preserving object fidelity, scene fidelity, and object-scene harmony. Existing methods either require per-object finetuning and slow inference or rely on extra paired data that mostly depict the same object across contexts, forcing models to rely on background cues rather than learning cross-object alignment. We propose SourceSwap, a self-supervised and source-aware framework that learns cross-object alignment. Our key insight is to synthesize high-quality pseudo pairs from any image via a frequency-separated perturbation in the initial-noise space, which alters appearance while preserving pose, coarse shape, and scene layout, requiring no videos, multi-view data, or additional images. We then train a dual U-Net with full-source conditioning and a noise-free reference encoder, enabling direct inter-object alignment, zero-shot inference without per-object finetuning, and lightweight iterative refinement. We further introduce SourceBench, a high-quality benchmark with higher resolution, more categories, and richer interactions. Experiments demonstrate that SourceSwap achieves superior fidelity, stronger scene preservation, and more natural harmony, and it transfers well to edits such as subject-driven refinement and face swapping.


[29] HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit cs.CV | cs.CLPDF

Hao Wu, Yingqi Fan, Jinyang Dai, Junlong Tong, Yunpu Ma

TL;DR: 本文提出HiDrop框架,通过延迟注入、凹金字塔剪枝和提前退出机制,在保持性能的同时显著减少多模态大语言模型中视觉令牌的计算开销,实现约90%的令牌压缩和1.72倍的训练加速。

Details

Motivation: 解决多模态大语言模型中视觉令牌二次计算成本过高的问题,现有渐进式令牌剪枝方法误解浅层功能且使用僵化策略,未能充分发挥效率潜力。

Result: 在实验中压缩约90%视觉令牌,性能与原始模型相当,训练加速1.72倍,为高效MLLM训练和推理设定了新的SOTA。

Insight: 创新点包括延迟注入将视觉令牌引入到主动融合开始的层,以及基于层间相似度和可微分top-k优化的凹金字塔剪枝与提前退出机制;客观分析认为其通过持久位置编码、FlashAttention兼容令牌选择等技术消除了动态令牌减少的隐藏开销,深入揭示了多模态融合的层次性本质。

Abstract: The quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs) hinders their widespread adoption. While progressive vision token pruning offers a promising solution, current methods misinterpret shallow layer functions and use rigid schedules, which fail to unlock the full efficiency potential. To address these issues, we propose HiDrop, a framework that aligns token pruning with the true hierarchical function of MLLM layers. HiDrop features two key innovations: (1) Late Injection, which bypasses passive shallow layers to introduce visual tokens exactly where active fusion begins; and (2) Concave Pyramid Pruning with an Early Exit mechanism to dynamically adjust pruning rates across middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-k operator. To ensure practical efficiency, HiDrop further incorporates persistent positional encoding, FlashAttention-compatible token selection, and parallel decoupling of vision computation to eliminate hidden overhead associated with dynamic token reduction. Extensive experiments show that HiDrop compresses about 90% visual tokens while matching the original performance and accelerating training by 1.72 times. Our work not only sets a new state-of-the-art for efficient MLLM training and inference but also provides valuable insights into the hierarchical nature of multimodal fusion. The code is released at https://github.com/EIT-NLP/HiDrop.


[30] EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding cs.CVPDF

Shitong Sun, Ke Han, Yukai Huang, Weitong Cai, Jifei Song

TL;DR: 本文提出了EgoGraph,一种无需训练的动态知识图谱构建框架,用于解决超长第一人称视角视频理解中的长期依赖建模问题。该框架通过统一的自我中心模式提取人物、物体、地点和事件等核心实体,并对其属性和交互进行结构化推理,从而构建比传统片段式视频模型更丰富、更连贯的语义表示。

Details

Motivation: 现有方法在处理跨越多天的超长第一人称视角视频时,仍依赖于碎片化的局部处理和有限的时间建模,难以对长序列进行有效推理。

Result: 在EgoLifeQA和EgoR1-bench基准测试上的大量实验表明,EgoGraph在长期视频问答任务上达到了最先进的性能。

Insight: 创新点在于提出了一种无需训练的动态知识图谱构建框架,并引入了新颖的自我中心模式来统一提取和抽象核心实体,以及开发了能够捕捉跨实体时间依赖性和积累多天稳定长期记忆的时间关系建模策略。

Abstract: Ultra-long egocentric videos spanning multiple days present significant challenges for video understanding. Existing approaches still rely on fragmented local processing and limited temporal modeling, restricting their ability to reason over such extended sequences. To address these limitations, we introduce EgoGraph, a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams. EgoGraph employs a novel egocentric schema that unifies the extraction and abstraction of core entities, such as people, objects, locations, and events, and structurally reasons about their attributes and interactions, yielding a significantly richer and more coherent semantic representation than traditional clip-based video models. Crucially, we develop a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning. Extensive experiments on the EgoLifeQA and EgoR1-bench benchmarks demonstrate that EgoGraph achieves state-of-the-art performance on long-term video question answering, validating its effectiveness as a new paradigm for ultra-long egocentric video understanding.


[31] Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities? cs.CVPDF

Hongbo Jiang, Jie Li, Yunhang Shen, Pingyang Dai, Xing Sun

TL;DR: 本文研究了统一多模态大语言模型(U-MLLMs)在不同输出模态(文本与图像)间保持语义等价性的能力。研究发现,尽管模型在文本推理上表现稳健,但在需要以图像模态呈现相同推理结果时,却无法维持语义等价性。为此,作者提出了VGUBench诊断框架,包含文本生成理解、视觉生成理解和视觉渲染控制三个任务,以解耦推理逻辑与生成保真度。评估结果表明,模型在视觉问答任务上性能显著崩溃,且这种失败源于跨模态语义对齐的失效,而非生成质量不足。

Details

Motivation: 现有评估通常将统一多模态大语言模型(U-MLLMs)的理解与生成能力分开评估,忽略了语义等价性,即无论输出模态如何,模型应展现一致推理结果的能力。本文旨在探究当前U-MLLMs是否满足这一前提。

Result: 在VGUBench框架上的评估显示,U-MLLMs在文本理解和视觉渲染控制任务上表现良好,但在视觉生成理解(即生成视觉答案)任务上性能显著崩溃。视觉回答性能与基本渲染质量之间几乎没有相关性,表明失败源于跨模态语义对齐的失效。

Insight: 论文的创新点在于首次系统性地诊断了U-MLLMs在跨模态语义等价性上的缺陷,并提出了VGUBench这一解耦推理与生成保真度的诊断框架。从客观角度看,该研究揭示了当前统一模型在模态间语义对齐上的根本性挑战,为未来模型设计提供了关键的诊断视角和改进方向。

Abstract: Unified Multimodal Large Language Models (U-MLLMs) integrate understanding and generation within a single architecture. However, existing evaluations typically assess these capabilities separately, overlooking semantic equivalence, i.e., the ability to manifest consistent reasoning results regardless of the output modality. In this work, we investigate whether current U-MLLMs satisfy this premise. We observe that while models demonstrate robust textual reasoning, they fail to maintain semantic equivalence when required to render the same results in the image modality. To rigorously diagnose this discrepancy, we introduce VGUBench, a framework to decouple reasoning logic from generation fidelity. VGUBench comprises three diagnostic tasks: (1)Textual Generative Understanding, establishing a baseline for reasoning accuracy in textual response; (2)Visual Generative Understanding, evaluating the ability to generate visual responses that represent the correct answer; and (3)a Visual Rendering control task, which assesses the ability to directly render explicit visual descriptions into images without complex reasoning. Our evaluation reveals a significant disparity: despite strong performance in textual understanding and visual rendering, U-MLLMs exhibit a marked performance collapse when required to generate visual answers to questions. Furthermore, we find a negligible correlation between visual answering performance and basic rendering quality. These results suggest that the failure stems not from insufficient generation fidelity, but from a breakdown in cross-modal semantic alignment. We provide diagnostic insights to address this challenge in future Unified Generation and Understanding Models.


[32] UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking cs.CV | cs.CLPDF

Hao Wu, Xudong Wang, Jialiang Zhang, Junlong Tong, Xinghao Chen

TL;DR: UTPTrack是一种简单统一的视觉跟踪令牌剪枝框架,首次联合压缩搜索区域、动态模板和静态模板三个组件,通过注意力引导和令牌类型感知策略建模冗余,在保持性能的同时显著降低计算开销。

Details

Motivation: 解决基于Transformer的单流跟踪器计算开销大、难以实时部署的问题,现有令牌剪枝方法孤立处理各组件,忽略了组件间依赖关系,导致剪枝效果不佳和精度下降。

Result: 在10个基准测试上评估,UTPTrack在基于剪枝的跟踪器中实现了精度与效率权衡的新SOTA,在RGB跟踪中剪除65.4%的视觉令牌并保持99.7%的基线性能,在统一跟踪中剪除67.5%的令牌并保持100.5%的基线性能。

Insight: 创新点在于首次提出联合压缩所有三个组件的统一框架,采用注意力引导和令牌类型感知的剪枝策略,能够无缝支持多模态和语言引导的统一跟踪任务,为高效视觉跟踪研究提供了鲁棒基础。

Abstract: One-stream Transformer-based trackers achieve advanced performance in visual object tracking but suffer from significant computational overhead that hinders real-time deployment. While token pruning offers a path to efficiency, existing methods are fragmented. They typically prune the search region, dynamic template, and static template in isolation, overlooking critical inter-component dependencies, which yields suboptimal pruning and degraded accuracy. To address this, we introduce UTPTrack, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multimodal and language-guided tasks within a single model. Extensive evaluations on 10 benchmarks demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4% of vision tokens in RGB-based tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance, respectively. This strong performance across both RGB and multimodal scenarios underlines its potential as a robust foundation for future research in efficient visual tracking. Code will be released at https://github.com/EIT-NLP/UTPTrack.


[33] U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation cs.CVPDF

Xiang Deng, Feng Gao, Yong Zhang, Youxin Pang, Xu Xiaoming

TL;DR: 本文提出了U-Mind,一个用于实时多模态交互的统一框架,能够在一个交互循环中联合建模语言、语音、动作和视频的生成。其核心是统一对齐与推理框架,通过分段对齐策略增强跨模态同步,并通过排练驱动学习保持推理能力。

Details

Motivation: 现有系统要么仅限于单模态生成,要么存在推理能力下降和跨模态对齐不佳的问题,阻碍了连贯且基于感知的交互。本文旨在解决实时、全栈多模态交互的挑战。

Result: 大量实验表明,U-Mind在一系列多模态交互任务(包括问答、指令跟随和动作生成)上达到了最先进的性能。

Insight: 主要创新点包括:1. 首个支持实时生成并联合建模多种模态的统一系统;2. 统一对齐与推理框架,特别是分段对齐策略和排练驱动学习;3. 采用文本优先的解码流程,结合内部思维链规划和跨模态时间同步生成;4. 实现了基于姿态和语音的实时视频渲染框架,提供同步的视觉反馈。

Abstract: Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop. At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses two key challenges: enhancing cross-modal synchronization via a segment-wise alignment strategy, and preserving reasoning abilities through Rehearsal-Driven Learning. During inference, U-Mind adopts a text-first decoding pipeline that performs internal chain-of-thought planning followed by temporally synchronized generation across modalities. To close the loop, we implement a real-time video rendering framework conditioned on pose and speech, enabling expressive and synchronized visual feedback. Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation, paving the way toward intelligent, immersive conversational agents.


[34] Learning Accurate Segmentation Purely from Self-Supervision cs.CVPDF

Zuyao You, Zuxuan Wu, Yu-Gang Jiang

TL;DR: 本文提出了一种名为Selfment的完全自监督框架,用于从原始图像中分割前景对象,无需人工标注、预训练分割模型或后处理。该方法首先基于自监督特征构建块级亲和图,并通过归一化割(NCut)获得初始的粗粒度前景-背景分离;随后引入迭代块优化(IPO)进行特征空间细化,通过迭代聚类增强空间一致性和语义一致性;最后,利用细化后的掩码作为监督信号,训练一个轻量级分割头,结合对比学习和区域一致性目标,学习稳定且可迁移的对象表示。

Details

Motivation: 解决无需任何人工标注即可准确分割对象这一计算机视觉核心挑战,旨在开发一个完全自监督的分割框架,摆脱对人工标签、预训练模型或后处理的依赖。

Result: 在多个基准测试中取得了新的最先进(SOTA)结果:在ECSSD、HKUIS和PASCAL-S数据集上,F_max指标相比之前的无监督显著性检测方法分别提升了4.0%、4.6%和5.7%;在零样本泛化到伪装对象检测任务上,在CHAMELEON数据集上Sm达到0.910,在CAMO数据集上Fβ^ω达到0.792,超越了所有现有无监督方法,甚至可与全监督SOTA方法相媲美。

Insight: 创新点包括:1)完全自监督的分割框架设计,无需任何外部监督信号;2)迭代块优化(IPO)过程,通过特征空间迭代聚类实现掩码的渐进式细化;3)利用细化掩码作为伪标签训练分割头,结合对比学习和区域一致性目标,学习可迁移的表示。从客观角度看,该方法将图割、迭代优化和表示学习有机结合,为无监督分割提供了简洁有效的解决方案,并在零样本泛化上展现出强大潜力。

Abstract: Accurately segmenting objects without any manual annotations remains one of the core challenges in computer vision. In this work, we introduce Selfment, a fully self-supervised framework that segments foreground objects directly from raw images without human labels, pretrained segmentation models, or any post-processing. Selfment first constructs patch-level affinity graphs from self-supervised features and applies NCut to obtain an initial coarse foreground–background separation. We then introduce Iterative Patch Optimization (IPO), a feature-space refinement procedure that progressively enforces spatial coherence and semantic consistency through iterative patch clustering. The refined masks are subsequently used as supervisory signals to train a lightweight segmentation head with contrastive and region-consistency objectives, allowing the model to learn stable and transferable object representations. Despite its simplicity and complete absence of manual supervision, Selfment sets new state-of-the-art (SoTA) results across multiple benchmarks. It achieves substantial improvements on $F_{\max}$ over previous unsupervised saliency detection methods on ECSSD ($+4.0%$), HKUIS ($+4.6%$), and PASCAL-S ($+5.7%$). Moreover, without any additional fine-tuning, Selfment demonstrates remarkable zero-shot generalization to camouflaged object detection tasks (e.g., $0.910$ $S_m$ on CHAMELEON and $0.792$ $F_β^ω$ on CAMO), outperforming all existing unsupervised approaches and even rivaling the SoTA fully supervised methods.


[35] See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent cs.CV | cs.AIPDF

Tianci Tang, Tielong Cai, Hongwei Wang, Gaoang Wang

TL;DR: 本文提出Sea²框架,通过一个智能姿态控制代理来调整预训练感知模型的部署方式,而非直接微调模型本身。该框架冻结所有感知模块,无需下游标注,仅利用标量感知反馈引导代理寻找信息丰富的视角,从而在室内场景等新环境中提升视觉感知任务的性能。

Details

Motivation: 预训练感知模型在通用图像领域表现良好,但在室内等新环境中性能显著下降。传统微调方法会导致灾难性遗忘且需要昂贵的场景特定标注,因此需要一种无需标注、不遗忘先验知识的适应方法。

Result: 在ReplicaCAD数据集上,Sea²在视觉定位、分割和3D边界框估计三个任务上分别实现了13.54%、15.92%和27.68%的性能提升。

Insight: 创新点在于将视觉语言模型(VLM)转化为低层姿态控制器,通过两阶段训练(基于规则的探索轨迹微调和无监督强化学习)实现主动感知,直接利用现成感知模型而无需重新训练,实现了模型部署方式的范式转变。

Abstract: Pre-trained perception models excel in generic image domains but degrade significantly in novel environments like indoor scenes. The conventional remedy is fine-tuning on downstream data which incurs catastrophic forgetting of prior knowledge and demands costly, scene-specific annotations. We propose a paradigm shift through Sea$^2$ (See, Act, Adapt): rather than adapting the perception modules themselves, we adapt how they are deployed through an intelligent pose-control agent. Sea$^2$ keeps all perception modules frozen, requiring no downstream labels during training, and uses only scalar perceptual feedback to navigate the agent toward informative viewpoints. Specially, we transform a vision-language model (VLM) into a low-level pose controller through a two-stage training pipeline: first fine-tuning it on rule-based exploration trajectories that systematically probe indoor scenes, and then refining the policy via unsupervised reinforcement learning that constructs rewards from the perception module’s outputs and confidence. Unlike prior active perception methods that couple exploration with specific models or collect data for retraining them, Sea$^2$ directly leverages off-the-shelf perception models for various tasks without the need for retraining. We conducted experiments on three visual perception tasks, including visual grounding, segmentation and 3D box estimation, with performance improvements of 13.54%, 15.92% and 27.68% respectively on dataset ReplicaCAD.


[36] Footprint-Guided Exemplar-Free Continual Histopathology Report Generation cs.CVPDF

Pratibha Kumari, Daniel Reisenbüchler, Afshin Bozorgpour, yousef Sadegheih, Priyankar Choudhary

TL;DR: 本文提出了一种无样本的持续学习框架,用于从全切片图像生成病理报告,通过构建紧凑的域足迹(包括形态学标记码本、共现摘要和补丁计数先验)来支持生成式回放,避免存储原始数据,并利用风格描述符适应报告惯例变化,在多个基准测试中优于现有方法。

Details

Motivation: 解决病理报告生成在临床部署中因新器官、机构和报告惯例随时间出现而导致的灾难性遗忘问题,避免存储原始切片或补丁样本的需求。

Result: 在多个公开持续学习基准测试中,该方法优于无样本和有限缓冲回放基线,展示了基于足迹的生成式回放作为实用解决方案的有效性。

Insight: 创新点包括构建冻结补丁嵌入空间中的紧凑域足迹以支持生成式回放,以及通过风格描述符蒸馏和自适应来应对报告惯例变化,实现无需显式域标识的领域无关设置。

Abstract: Rapid progress in vision-language modeling has enabled pathology report generation from gigapixel whole-slide images, but most approaches assume static training with simultaneous access to all data. In clinical deployment, however, new organs, institutions, and reporting conventions emerge over time, and sequential fine-tuning can cause catastrophic forgetting. We introduce an exemplar-free continual learning framework for WSI-to-report generation that avoids storing raw slides or patch exemplars. The core idea is a compact domain footprint built in a frozen patch-embedding space: a small codebook of representative morphology tokens together with slide-level co-occurrence summaries and lightweight patch-count priors. These footprints support generative replay by synthesizing pseudo-WSI representations that reflect domain-specific morphological mixtures, while a teacher snapshot provides pseudo-reports to supervise the updated model without retaining past data. To address shifting reporting conventions, we distill domain-specific linguistic characteristics into a compact style descriptor and use it to steer generation. At inference, the model identifies the most compatible descriptor directly from the slide signal, enabling domain-agnostic setup without requiring explicit domain identifiers. Evaluated across multiple public continual learning benchmarks, our approach outperforms exemplar-free and limited-buffer rehearsal baselines, highlighting footprint-based generative replay as a practical solution for deployment in evolving clinical settings.


[37] APPO: Attention-guided Perception Policy Optimization for Video Reasoning cs.CVPDF

Henghui Du, Chang Zhou, Xi Chen, Di Hu

TL;DR: 本文提出了一种名为APPO(Attention-guided Perception Policy Optimization)的注意力引导感知策略优化算法,旨在通过推理低成本地增强模型在视频理解任务中的细粒度感知能力。该方法利用token级密集奖励来优化那些关注同一关键视频帧的感知token,实验表明其在多个视频基准测试和不同规模模型上均优于现有方法。

Details

Motivation: 论文的动机源于观察到复杂视频推理任务性能的提升更依赖于增强模型的感知能力,而非专家级推理能力。实证表明,感知模型规模的微小提升比推理能力的大幅增强带来更显著的性能改进,因此探索如何在不依赖昂贵细粒度标注的情况下,通过推理来增强感知能力具有重要意义。

Result: 在多个视频基准测试和不同规模模型(3B/7B)上的实验结果表明,APPO方法一致优于GRPO和DAPO基线方法,性能提升幅度在0.5%到4%之间。

Insight: 论文的核心创新点在于提出了APPO算法,通过注意力机制识别并优化关注同一关键视频帧的感知token(即组内感知token),利用token级密集奖励来低成本地增强模型的细粒度感知能力,为通过推理提升感知性能提供了一种有前景的途径。

Abstract: Complex video reasoning, actually, relies excessively on fine-grained perception rather than on expert (e.g., Ph.D, Science)-level reasoning. Through extensive empirical observation, we have recognized the critical impact of perception. In particular, when perception ability is almost fixed, enhancing reasoning from Qwen3-8B to OpenAI-o3 yields only 0.7% performance improvement. Conversely, even minimal change in perception model scale (from 7B to 32B) boosts performance by 1.4%, indicating enhancing perception, rather than reasoning, is more critical to improve performance. Therefore, exploring how to enhance perception ability through reasoning without the need for expensive fine-grained annotation information is worthwhile. To achieve this goal, we specially propose APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model’s fine-grained perception. The core idea behind APPO is to optimize those tokens from different responses that primarily focus on the same crucial video frame (called intra-group perception tokens). Experimental results on diverse video benchmarks and models with different scales (3/7B) demonstrate APPO consistently outperforms GRPO and DAPO (0.5%~4%). We hope our work provides a promising approach to effectively enhance model’s perception abilities through reasoning in a low-cost manner, serving diverse scenarios and demands.


[38] NAU-QMUL: Utilizing BERT and CLIP for Multi-modal AI-Generated Image Detection cs.CV | cs.CLPDF

Xiaoyu Guo, Arkaitz Zubiaga

TL;DR: 本文提出了一种多模态多任务模型,用于检测AI生成的图像并识别其生成模型。该模型利用预训练的BERT和CLIP分别提取文本和图像特征,通过跨模态特征融合和定制多任务损失函数进行训练,并采用基于伪标签的数据增强策略扩充训练数据。在CT2竞赛的Task A和Task B中均获得第五名。

Details

Motivation: 解决AI生成图像的检测问题,并识别生成这些图像的具体模型,以应对现实场景中AI生成内容识别的挑战。

Result: 在CT2竞赛的Task A(检测AI生成图像)和Task B(识别生成模型)中,F1分数分别为83.16%和48.88%,均排名第五。

Insight: 创新点在于结合BERT与CLIP进行多模态特征提取与融合,并设计多任务损失函数;同时采用伪标签数据增强策略提升模型性能,为实际应用中的AI生成内容检测提供了有效架构。

Abstract: With the aim of detecting AI-generated images and identifying the specific models responsible for their generation, we propose a multi-modal multi-task model. The model leverages pre-trained BERT and CLIP Vision encoders for text and image feature extraction, respectively, and employs cross-modal feature fusion with a tailored multi-task loss function. Additionally, a pseudo-labeling-based data augmentation strategy was utilized to expand the training dataset with high-confidence samples. The model achieved fifth place in both Tasks A and B of the `CT2: AI-Generated Image Detection’ competition, with F1 scores of 83.16% and 48.88%, respectively. These findings highlight the effectiveness of the proposed architecture and its potential for advancing AI-generated content detection in real-world scenarios. The source code for our method is published on https://github.com/xxxxxxxxy/AIGeneratedImageDetection.


[39] Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition cs.CVPDF

Mohammadreza Heidarianbaei, Mareike Dorozynski, Hubert Kanyamahanga, Max Mehltretter, Franz Rottensteiner

TL;DR: 本文提出了一种无需训练的开放词汇遥感图像语义分割方法ReSeg-CLIP。该方法通过引入SAM生成的多尺度掩码来分层约束自注意力层的交互,以解决CLIP等视觉语言模型在语义分割中的问题,并采用一种基于不同文本提示评估表征质量的新加权方案,对多个遥感专用CLIP变体的参数进行平均组合。该方法在三个遥感基准测试上取得了最先进的结果。

Details

Motivation: 解决CLIP等视觉语言模型在遥感图像开放词汇语义分割中,因自注意力层内不恰当的交互而导致性能不佳的问题。

Result: 在三个遥感基准测试上取得了最先进(SOTA)的结果,且无需额外训练。

Insight: 创新点在于提出了分层注意力掩码机制来约束多尺度特征交互,以及一种基于文本提示评估的模型参数加权平均组合方法,以提升开放词汇分割的泛化能力。

Abstract: In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.


[40] Altitude-Aware Visual Place Recognition in Top-Down View cs.CV | cs.ROPDF

Xingyu Shao, Mengfan He, Chunyu Li, Liangzheng Sun, Ziyang Meng

TL;DR: 本文提出了一种用于解决显著高度变化下航空视觉地点识别(VPR)挑战的自适应方法。该方法通过分析图像中地面特征的密度来估计飞行平台的相对高度,并基于此进行图像裁剪以生成规范查询图像,最后采用基于分类的VPR策略进行定位。

Details

Motivation: 解决传统航空VPR方法在显著高度变化下性能下降的问题,并避免依赖额外的硬件传感器(如气压高度计或飞行时间传感器),为中小型飞行平台提供一种即插即用的纯视觉解决方案。

Result: 在多种地形和高度条件下的广泛实验表明,该方法在高度估计和VPR任务中均实现了高精度和鲁棒性。在显著高度变化下,将相对高度估计模块集成到VPR检索流程中,相比单独使用VPR检索,平均R@1和R@5分别提升了29.85%和60.20%。相比传统的单目度量深度估计(MMDE)方法,平均误差降低了202.1米,并额外带来了R@1平均31.4%和R@5平均44%的性能提升。

Insight: 创新点在于将地面特征密度分析作为相对高度估计的代理,并以此驱动图像预处理(裁剪),从而生成对高度变化鲁棒的规范视图用于VPR。这为三维视觉地点识别提供了一个稳健的、仅需视觉的框架,其核心思想是利用图像内容本身来隐式估计并补偿平台位姿(尤其是高度)的变化,而非依赖外部传感器或显式的深度估计网络。

Abstract: To address the challenge of aerial visual place recognition (VPR) problem under significant altitude variations, this study proposes an altitude-adaptive VPR approach that integrates ground feature density analysis with image classification techniques. The proposed method estimates airborne platforms’ relative altitude by analyzing the density of ground features in images, then applies relative altitude-based cropping to generate canonical query images, which are subsequently used in a classification-based VPR strategy for localization. Extensive experiments across diverse terrains and altitude conditions demonstrate that the proposed approach achieves high accuracy and robustness in both altitude estimation and VPR under significant altitude changes. Compared to conventional methods relying on barometric altimeters or Time-of-Flight (ToF) sensors, this solution requires no additional hardware and offers a plug-and-play solution for downstream applications, {making it suitable for small- and medium-sized airborne platforms operating in diverse environments, including rural and urban areas.} Under significant altitude variations, incorporating our relative altitude estimation module into the VPR retrieval pipeline boosts average R@1 and R@5 by 29.85% and 60.20%, respectively, compared with applying VPR retrieval alone. Furthermore, compared to traditional {Monocular Metric Depth Estimation (MMDE) methods}, the proposed method reduces the mean error by 202.1 m, yielding average additional improvements of 31.4% in R@1 and 44% in R@5. These results demonstrate that our method establishes a robust, vision-only framework for three-dimensional visual place recognition, offering a practical and scalable solution for accurate airborne platforms localization under large altitude variations and limited sensor availability.


[41] DACESR: Degradation-Aware Conditional Embedding for Real-World Image Super-Resolution cs.CVPDF

Xiaoyan Lei, Wenlong Zhang, Biao Luo, Hui Liang, Weifeng Cao

TL;DR: 本文提出了一种名为DACESR的退化感知条件嵌入方法,用于真实世界图像超分辨率。该方法首先通过退化选择策略和对比学习训练一个真实嵌入提取器(REE),以提升对退化图像内容的识别能力;然后利用条件特征调制器(CFM)将REE提取的高层信息融入一个基于Mamba的强大网络中,有效恢复图像纹理并产生视觉上令人愉悦的结果。

Details

Motivation: 现有基于语言类别作为条件信息的多模态大模型在处理真实世界退化图像的超分辨率时能力有限,本文旨在解决这一问题,提升模型对退化图像的识别和恢复能力。

Result: 大量实验表明,所提出的真实嵌入提取器(REE)能有效帮助图像超分辨率网络在保真度和感知质量之间取得平衡,并突显了Mamba在真实世界应用中的巨大潜力。

Insight: 创新点在于提出了一个退化感知的条件嵌入框架,通过专门设计的真实嵌入提取器(REE)来捕获退化图像的高层语义信息,并利用基于Mamba的网络进行条件特征调制,从而在真实世界图像超分辨率任务中实现了性能提升。

Abstract: Multimodal large models have shown excellent ability in addressing image super-resolution in real-world scenarios by leveraging language class as condition information, yet their abilities in degraded images remain limited. In this paper, we first revisit the capabilities of the Recognize Anything Model (RAM) for degraded images by calculating text similarity. We find that directly using contrastive learning to fine-tune RAM in the degraded space is difficult to achieve acceptable results. To address this issue, we employ a degradation selection strategy to propose a Real Embedding Extractor (REE), which achieves significant recognition performance gain on degraded image content through contrastive learning. Furthermore, we use a Conditional Feature Modulator (CFM) to incorporate the high-level information of REE for a powerful Mamba-based network, which can leverage effective pixel information to restore image textures and produce visually pleasing results. Extensive experiments demonstrate that the REE can effectively help image super-resolution networks balance fidelity and perceptual quality, highlighting the great potential of Mamba in real-world applications. The source code of this work will be made publicly available at: https://github.com/nathan66666/DACESR.git


[42] AoE: Always-on Egocentric Human Video Collection for Embodied AI cs.CV | cs.ROPDF

Bowen Yang, Zishuo Li, Yang Sun, Changtao Miao, Yifan Yang

TL;DR: 本文提出了Always-on Egocentric (AoE)数据收集系统,旨在通过利用人类自身和智能手机,简化硬件依赖,实现低成本、高效率、场景无关的真实世界交互数据收集,以解决具身AI模型训练数据稀缺的挑战。该系统采用符合人体工程学的颈戴式手机支架和云边协同架构,通过跨平台移动应用进行实时处理,并在云端进行自动化标注和过滤,支持任何人在任何时间、地点进行分布式自我中心视频数据采集。

Details

Motivation: 现有具身基础模型需要大规模、高质量的真实世界交互数据进行预训练和扩展,但现有数据收集方法存在基础设施成本高、硬件依赖复杂、交互范围有限等问题,难以实现规模化扩展。人类自身是理想的物理具身智能体,因此从全球分布的“人类智能体”获取自我中心视角的真实世界交互数据具有低成本、可持续的优势。

Result: 论文在数据预处理质量和下游任务上评估了AoE系统,结果表明,高质量的自我中心数据显著提升了模型在真实世界中的泛化能力。

Insight: 创新点在于提出了一种以人为中心、利用智能手机和云边协同架构的轻量化、可扩展数据收集范式,通过将人类本身作为数据收集代理,并结合设备端实时处理与云端自动化流水线,实现了低成本、大规模、场景无关的具身AI数据采集,为解决数据瓶颈问题提供了新思路。

Abstract: Embodied foundation models require large-scale, high-quality real-world interaction data for pre-training and scaling. However, existing data collection methods suffer from high infrastructure costs, complex hardware dependencies, and limited interaction scope, making scalable expansion challenging. In fact, humans themselves are ideal physically embodied agents. Therefore, obtaining egocentric real-world interaction data from globally distributed “human agents” offers advantages of low cost and sustainability. To this end, we propose the Always-on Egocentric (AoE) data collection system, which aims to simplify hardware dependencies by leveraging humans themselves and their smartphones, enabling low-cost, highly efficient, and scene-agnostic real-world interaction data collection to address the challenge of data scarcity. Specifically, we first employ an ergonomic neck-mounted smartphone holder to enable low-barrier, large-scale egocentric data collection through a cloud-edge collaborative architecture. Second, we develop a cross-platform mobile APP that leverages on-device compute for real-time processing, while the cloud hosts automated labeling and filtering pipelines that transform raw videos into high-quality training data. Finally, the AoE system supports distributed Ego video data collection by anyone, anytime, and anywhere. We evaluate AoE on data preprocessing quality and downstream tasks, demonstrating that high-quality egocentric data significantly boosts real-world generalization.


[43] SelfOccFlow: Towards end-to-end self-supervised 3D Occupancy Flow prediction cs.CVPDF

Xavier Timoneda, Markus Herb, Fabian Duerr, Daniel Goehring

TL;DR: 本文提出了一种名为SelfOccFlow的自监督方法,用于估计自动驾驶车辆周围环境的3D占用和运动,无需依赖昂贵的人工标注或外部光流监督。该方法通过将场景解耦为静态和动态的有符号距离场,并利用时间聚合隐式学习运动,同时引入基于特征余弦相似度的自监督流线索。

Details

Motivation: 现有方法在联合学习几何和运动时,依赖于昂贵的3D占用和流标注、边界框速度标签或预训练光流模型,这限制了其可扩展性和实用性。本文旨在开发一种自监督方法,以消除对人工标注或外部监督的需求,从而更高效地实现3D占用流预测。

Result: 在SemanticKITTI、KITTI-MOT和nuScenes等基准数据集上验证了方法的有效性,展示了其自监督学习能力,但摘要未提及具体的定量结果或与现有SOTA模型的比较。

Insight: 创新点包括:1) 将场景解耦为静态和动态有符号距离场以隐式学习运动;2) 引入基于特征余弦相似度的自监督流线索,增强运动估计的鲁棒性。从客观角度看,该方法通过自监督设计降低了数据标注成本,为3D占用流预测提供了新的端到端解决方案。

Abstract: Estimating 3D occupancy and motion at the vehicle’s surroundings is essential for autonomous driving, enabling situational awareness in dynamic environments. Existing approaches jointly learn geometry and motion but rely on expensive 3D occupancy and flow annotations, velocity labels from bounding boxes, or pretrained optical flow models. We propose a self-supervised method for 3D occupancy flow estimation that eliminates the need for human-produced annotations or external flow supervision. Our method disentangles the scene into separate static and dynamic signed distance fields and learns motion implicitly through temporal aggregation. Additionally, we introduce a strong self-supervised flow cue derived from features’ cosine similarities. We demonstrate the efficacy of our 3D occupancy flow method on SemanticKITTI, KITTI-MOT, and nuScenes.


[44] Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks cs.CV | cs.AI | cs.CLPDF

Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang

TL;DR: 该论文提出了一个新的基准数据集Ref-Adv,旨在评估多模态大语言模型在指代表达理解任务中的视觉推理和基础能力。作者认为现有基准(如RefCOCO系列)存在捷径,无法有效测试模型真正的推理能力。Ref-Adv通过设计包含复杂语言表达、困难干扰项和必要推理要素(如否定)的样本,抑制了模型利用简单线索的捷径。实验表明,尽管现有模型在传统基准上表现良好,但在Ref-Adv上性能显著下降,揭示了它们在视觉推理和基础方面的不足。

Details

Motivation: 现有指代表达理解基准(如RefCOCO, RefCOCO+, RefCOCOg)存在缺陷:表达式过短、图像干扰项少、存在冗余描述,使得模型可以通过捷径(如简单匹配)解决问题,无法真正测试多模态大语言模型的视觉推理和基础能力。

Result: 在Ref-Adv基准上,一系列当代多模态大语言模型的表现相比在RefCOCO、RefCOCO+和RefCOCOg等传统基准上显著下降,揭示了模型对捷径的依赖以及在视觉推理和基础方面的差距。

Insight: 论文的创新点在于构建了一个旨在抑制捷径、强调必要推理的基准数据集Ref-Adv,其设计原则(如使用非平凡语言表达、精心设计困难干扰项、标注推理要素)为评估和提升多模态模型的视觉推理能力提供了新的方向。从客观角度看,这种通过构建更具挑战性的诊断性基准来揭示模型真实能力短板的方法,对于推动该领域向更鲁棒、更可解释的视觉语言理解发展具有借鉴意义。

Abstract: Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.


[45] Half-Truths Break Similarity-Based Retrieval cs.CVPDF

Bora Kargi, Arnas Uselis, Seong Joon Oh

TL;DR: 本文揭示了CLIP等双编码器模型在图像-文本相似性评估中的一个关键缺陷:当在原本正确的描述中添加一个看似合理但错误的细节时,模型反而可能给出更高的相似性分数,作者称这种现象为‘半真’。论文提出了CS-CLIP方法,通过对描述进行组件级监督训练来缓解这一问题,并在组合理解基准上取得了性能提升。

Details

Motivation: 解决CLIP等双编码器模型在评估图像与文本描述相似性时,对部分错误细节(即‘半真’描述)缺乏鲁棒性的问题,即模型可能错误地偏好包含错误细节的更长描述。

Result: 在COCO数据集上,原始CLIP模型仅在40.6%的情况下偏好正确的较短描述;当添加的细节是关系时,性能降至32.9%。提出的CS-CLIP方法将‘半真’准确率提升至69.3%,并在已建立的组合理解基准上将平均性能提高了5.7个百分点。

Insight: 创新点在于揭示了对比学习中对完整句子进行对齐的弱监督不足,未能显式地确保单个实体和关系被正确建模。提出的CS-CLIP通过将描述分解为实体和关系单元,并为每个单元构建最小编辑的负例进行微调,从而在保持标准双编码器推理的同时,增强了模型对组合细节的鲁棒性。这为改进视觉-语言模型的组合理解提供了一种有效的组件级监督思路。

Abstract: When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP


[46] The Geometry of Transfer: Unlocking Medical Vision Manifolds for Training-Free Model Ranking cs.CV | cs.AIPDF

Jiaqi Tang, Shaoyang Zhang, Xiaoqi Wang, Jiaying Zhou, Yang Liu

TL;DR: 本文提出了一种新颖的拓扑驱动的可迁移性估计框架,用于在无需微调的情况下,高效地为特定医学图像分割任务选择最优的基础模型。该框架通过评估流形的可处理性而非统计重叠,包含全局表示拓扑差异、局部边界感知拓扑一致性和任务自适应融合三个组件。

Details

Motivation: 现有可迁移性估计指标主要针对分类任务设计,依赖于全局统计假设,无法捕捉密集预测所需的关键拓扑复杂性,导致为特定医学分割任务选择最优基础模型时存在计算瓶颈。

Result: 在大规模OpenMind基准测试上,针对多种解剖目标和自监督学习基础模型进行验证,该方法在加权Kendall系数上显著优于现有最先进基线,相对提升约31%。

Insight: 创新点在于从拓扑几何角度评估模型可迁移性,特别是引入了量化特征-标签结构同构性的全局拓扑差异和关注关键解剖边界处流形可分性的局部一致性度量,并动态融合以适应不同语义复杂度的任务,为无训练模型选择提供了新视角。

Abstract: The advent of large-scale self-supervised learning (SSL) has produced a vast zoo of medical foundation models. However, selecting optimal medical foundation models for specific segmentation tasks remains a computational bottleneck. Existing Transferability Estimation (TE) metrics, primarily designed for classification, rely on global statistical assumptions and fail to capture the topological complexity essential for dense prediction. We propose a novel Topology-Driven Transferability Estimation framework that evaluates manifold tractability rather than statistical overlap. Our approach introduces three components: (1) Global Representation Topology Divergence (GRTD), utilizing Minimum Spanning Trees to quantify feature-label structural isomorphism; (2) Local Boundary-Aware Topological Consistency (LBTC), which assesses manifold separability specifically at critical anatomical boundaries; and (3) Task-Adaptive Fusion, which dynamically integrates global and local metrics based on the semantic cardinality of the target task. Validated on the large-scale OpenMind benchmark across diverse anatomical targets and SSL foundation models, our approach significantly outperforms state-of-the-art baselines by around \textbf{31%} relative improvement in the weighted Kendall, providing a robust, training-free proxy for efficient model selection without the cost of fine-tuning. The code will be made publicly available upon acceptance.


[47] Leveraging Geometric Prior Uncertainty and Complementary Constraints for High-Fidelity Neural Indoor Surface Reconstruction cs.CVPDF

Qiyu Feng, Jiwei Shan, Shing Shin Cheng, Hesheng Wang

TL;DR: 本文提出GPU-SDF,一种用于室内表面重建的神经隐式框架,旨在解决现有方法在恢复薄结构和复杂几何细节时,因依赖不可靠或有噪声的几何先验而面临的挑战。该框架通过自监督模块显式估计先验不确定性,并设计不确定性引导的损失函数来调制而非丢弃先验信息,同时引入边缘距离场和多视图一致性正则化作为互补约束来处理高不确定性区域,从而提升重建的保真度。

Details

Motivation: 现有基于符号距离函数的神经隐式表面重建方法在处理薄结构和复杂几何时,常因几何先验不可靠或噪声而效果不佳;现有方法依赖优化过程中产生的隐式不确定性来过滤先验,这种方式间接且低效,且在高不确定性区域屏蔽监督会导致优化约束不足。

Result: 大量实验证实,GPU-SDF改善了精细细节的重建效果,并能作为即插即用的增强模块用于现有框架。

Insight: 创新点在于显式地自监督估计几何先验不确定性,并据此调制先验影响以保留弱但信息丰富的线索;同时,通过引入边缘距离场(加强边界监督)和多视图一致性正则化(强制几何一致性)作为互补约束,直接处理高不确定性区域,这比单纯屏蔽监督更有效。

Abstract: Neural implicit surface reconstruction with signed distance function has made significant progress, but recovering fine details such as thin structures and complex geometries remains challenging due to unreliable or noisy geometric priors. Existing approaches rely on implicit uncertainty that arises during optimization to filter these priors, which is indirect and inefficient, and masking supervision in high-uncertainty regions further leads to under-constrained optimization. To address these issues, we propose GPU-SDF, a neural implicit framework for indoor surface reconstruction that leverages geometric prior uncertainty and complementary constraints. We introduce a self-supervised module that explicitly estimates prior uncertainty without auxiliary networks. Based on this estimation, we design an uncertainty-guided loss that modulates prior influence rather than discarding it, thereby retaining weak but informative cues. To address regions with high prior uncertainty, GPU-SDF further incorporates two complementary constraints: an edge distance field that strengthens boundary supervision and a multi-view consistency regularization that enforces geometric coherence. Extensive experiments confirm that GPU-SDF improves the reconstruction of fine details and serves as a plug-and-play enhancement for existing frameworks. Source code will be available at https://github.com/IRMVLab/GPU-SDF


[48] PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning cs.CV | cs.AI | cs.MMPDF

Dongxu Zhang, Yiding Sun, Pengcheng Li, Yumou Liu, Hongqiang Lin

TL;DR: 本文提出了PointCoT框架,旨在解决多模态大语言模型在3D点云理解中存在的几何幻觉问题。该框架通过引入显式的思维链推理,采用“先观察、再思考、后回答”的范式,并构建了一个包含约8.6万指令调优样本的大规模基准数据集Point-Reason-Instruct。实验表明,该方法在复杂推理任务上达到了最先进的性能。

Details

Motivation: 当前多模态大语言模型主要擅长2D场景,但在3D点云理解中,现有方法通常将几何推理视为隐式映射过程,缺乏中间逻辑步骤,导致模型产生脱离精确结构细节的几何幻觉。本文旨在弥合这一差距,赋予模型对3D数据进行显式推理的能力。

Result: 广泛的实验表明,PointCoT在复杂的3D几何推理任务上实现了最先进的性能。

Insight: 论文的核心创新点在于将显式思维链推理范式引入3D点云理解,提出了“Look, Think, then Answer”的推理流程,并构建了带有分层思维链标注的大规模指令调优数据集。从客观角度看,该方法通过双流多模态架构融合语义外观与几何真值,为解决3D场景下的精确几何推理问题提供了新思路。

Abstract: While Multimodal Large Language Models (MLLMs) demonstrate proficiency in 2D scenes, extending their perceptual intelligence to 3D point cloud understanding remains a significant challenge. Current approaches focus primarily on aligning 3D features with pre-trained models. However, they typically treat geometric reasoning as an implicit mapping process. These methods bypass intermediate logical steps and consequently suffer from geometric hallucinations. They confidently generate plausible responses that fail to ground in precise structural details. To bridge this gap, we present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data. We advocate for a \textit{Look, Think, then Answer} paradigm. In this approach, the model is supervised to generate geometry-grounded rationales before predicting final answers. To facilitate this, we construct Point-Reason-Instruct, a large-scale benchmark comprising $\sim$86k instruction-tuning samples with hierarchical CoT annotations. By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth. Extensive experiments demonstrate that PointCoT achieves state-of-the-art performance on complex reasoning tasks.


[49] Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification cs.CV | cs.AI | cs.CL | cs.LOPDF

Vikash Singh, Debargha Ganguly, Haotian Yu, Chengwei Zhou, Prerna Singh

TL;DR: 本文提出了一种神经符号验证框架,用于确定性地审核视觉语言模型(VLM)生成的放射学报告的内部逻辑一致性。该框架将自由文本的放射学发现自动形式化为结构化命题证据,并利用SMT求解器(Z3)和临床知识库来验证每个诊断主张是数学上可推导的、幻觉的还是被遗漏的。在五个胸部X光基准上评估七个VLM,该验证器揭示了传统指标无法捕捉的独特推理失败模式,如保守观察和随机幻觉。在标记数据集上,强制执行求解器支持的蕴涵关系作为一种严格的事后保证,系统地消除了无支持的幻觉,显著提高了生成式临床助手的诊断可靠性和精确度。

Details

Motivation: 视觉语言模型在起草放射学报告方面显示出潜力,但经常遭受逻辑不一致的问题,例如生成不受其自身感知发现支持的诊断印象,或遗漏逻辑上必然的结论。标准的词汇度量指标严重惩罚临床释义,并且在无参考设置下无法捕捉这些演绎失败。

Result: 在五个胸部X光基准(如MIMIC-CXR、CheXpert等)上评估了七个VLMs(如Flamingo、BLIP-2、LLaVA-Med等),验证器暴露了传统指标不可见的独特推理失败模式。在标记数据集上,强制执行求解器支持的蕴涵关系可系统消除无支持的幻觉,显著提高诊断可靠性和精确度。

Insight: 创新点在于将神经符号方法(自动形式化文本为命题证据,结合SMT求解器和临床知识库)引入VLM输出验证,提供了一种确定性的、可解释的逻辑一致性审计框架,为生成式临床助手提供了严格的事后保证,并能揭示传统评估指标无法捕捉的深层推理失败模式。

Abstract: Vision-language models (VLMs) show promise in drafting radiology reports, yet they frequently suffer from logical inconsistencies, generating diagnostic impressions unsupported by their own perceptual findings or missing logically entailed conclusions. Standard lexical metrics heavily penalize clinical paraphrasing and fail to capture these deductive failures in reference-free settings. Toward guarantees for clinical reasoning, we introduce a neurosymbolic verification framework that deterministically audits the internal consistency of VLM-generated reports. Our pipeline autoformalizes free-text radiographic findings into structured propositional evidence, utilizing an SMT solver (Z3) and a clinical knowledge base to verify whether each diagnostic claim is mathematically entailed, hallucinated, or omitted. Evaluating seven VLMs across five chest X-ray benchmarks, our verifier exposes distinct reasoning failure modes, such as conservative observation and stochastic hallucination, that remain invisible to traditional metrics. On labeled datasets, enforcing solver-backed entailment acts as a rigorous post-hoc guarantee, systematically eliminating unsupported hallucinations to significantly increase diagnostic soundness and precision in generative clinical assistants.


[50] CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering cs.CVPDF

Yuyang Hong, Jiaqi Gu, Yujin Lou, Lubin Fan, Qi Yang

TL;DR: 本文提出了一种名为CC-VQA的新型免训练方法,用于缓解基于知识的视觉问答(KB-VQA)中的知识冲突问题。该方法通过视觉中心的上下文冲突推理和相关性引导的编码解码,有效整合了视觉信息并处理了冗余检索上下文,在多个基准测试上取得了最先进的性能。

Details

Motivation: 解决基于知识的视觉问答中,预训练视觉语言模型的静态参数知识与动态检索信息之间的冲突问题,以及现有方法忽视视觉信息作用和受冗余上下文影响的问题。

Result: 在E-VQA、InfoSeek和OK-VQA基准测试上进行了广泛评估,CC-VQA达到了最先进的性能,相比现有方法获得了3.3%到6.4%的绝对准确率提升。

Insight: 创新点在于提出了视觉中心的冲突分析和相关性引导的编码解码机制,强调了视觉信息在知识冲突中的关键作用,并通过位置编码压缩和自适应解码有效处理了冗余上下文,是一种免训练的高效解决方案。

Abstract: Knowledge-based visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks. However, conflicts arise between static parametric knowledge in vision language models (VLMs) and dynamically retrieved information due to the static model knowledge from pre-training. The outputs either ignore retrieved contexts or exhibit inconsistent integration with parametric knowledge, posing substantial challenges for KB-VQA. Current knowledge conflict mitigation methods primarily adapted from language-based approaches, focusing on context-level conflicts through engineered prompting strategies or context-aware decoding mechanisms. However, these methods neglect the critical role of visual information in conflicts and suffer from redundant retrieved contexts, which impair accurate conflict identification and effective mitigation. To address these limitations, we propose \textbf{CC-VQA}: a novel training-free, conflict- and correlation-aware method for KB-VQA. Our method comprises two core components: (1) Vision-Centric Contextual Conflict Reasoning, which performs visual-semantic conflict analysis across internal and external knowledge contexts; and (2) Correlation-Guided Encoding and Decoding, featuring positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring. Extensive evaluations on E-VQA, InfoSeek, and OK-VQA benchmarks demonstrate that CC-VQA achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3% to 6.4% compared to existing methods. Code is available at https://github.com/cqu-student/CC-VQA.


[51] AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation cs.CV | cs.CLPDF

Zhengren Wang, Dongsheng Ma, Huaping Zhong, Jiayu Li, Wentao Zhang

TL;DR: 本文提出了AgenticOCR,一种用于视觉文档检索增强生成(RAG)的动态解析范式。它将传统的静态、全文光学字符识别(OCR)转变为查询驱动的按需提取系统,通过分析文档布局并选择性识别感兴趣区域,以解决页面级检索带来的信息过载和幻觉风险问题。

Details

Motivation: 多模态检索增强生成在处理复杂视觉文档(如财务报告)时面临挑战:页面级分块和检索会向生成器引入过多无关上下文,不仅过载其注意力机制,还稀释了关键证据,同时在有限视觉令牌预算下压缩信息丰富的页面会增加幻觉风险。

Result: 实验结果表明,AgenticOCR提高了视觉RAG系统的效率和准确性,在长文档理解任务中达到了专家级性能。

Insight: 核心创新在于将OCR从静态过程转变为动态、查询驱动的按需提取系统,实现了检索粒度与固定页面级分块的解耦。该方法采用“用图像思考”的方式自主分析布局,并按需解压视觉令牌,有潜力成为视觉文档RAG栈中嵌入和重排模块之外的“第三构建块”。

Abstract: The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, such as financial reports. While page-level chunking and retrieval is a natural starting point, it creates a critical bottleneck: delivering entire pages to the generator introduces excessive extraneous context. This not only overloads the generator’s attention mechanism but also dilutes the most salient evidence. Moreover, compressing these information-rich pages into a limited visual token budget further increases the risk of hallucinations. To address this, we introduce AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full-text process into a query-driven, on-demand extraction system. By autonomously analyzing document layout in a “thinking with images” manner, AgenticOCR identifies and selectively recognizes regions of interest. This approach performs on-demand decompression of visual tokens precisely where needed, effectively decoupling retrieval granularity from rigid page-level chunking. AgenticOCR has the potential to serve as the “third building block” of the visual document RAG stack, operating alongside and enhancing standard Embedding and Reranking modules. Experimental results demonstrate that AgenticOCR improves both the efficiency and accuracy of visual RAG systems, achieving expert-level performance in long document understanding. Code and models are available at https://github.com/OpenDataLab/AgenticOCR.


[52] SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls cs.CVPDF

Qianxun Xu, Chenxi Song, Yujun Cai, Chi Zhang

TL;DR: SwitchCraft是一个无需训练的多事件视频生成框架,通过事件对齐查询引导(EAQS)和自适应平衡强度求解器(ABSS)来解决现有文本到视频扩散模型在处理多事件提示时产生的场景混合或崩溃问题,显著提升了提示对齐、事件清晰度和场景一致性。

Details

Motivation: 当前文本到视频扩散模型主要针对单事件生成优化,处理多事件提示时由于缺乏明确的时间定位,往往会产生混合或崩溃的场景,破坏了预期的叙事结构。

Result: 大量实验表明,SwitchCraft在提示对齐、事件清晰度和场景一致性方面相比现有基线方法有显著提升,为多事件视频生成提供了一个简单而有效的解决方案。

Insight: 创新点在于认识到均匀的跨时间提示注入忽略了事件与帧之间的对应关系,因此提出了事件对齐查询引导(EAQS)来引导帧级注意力与相关事件提示对齐,以及自适应平衡强度求解器(ABSS)来自适应平衡引导强度以保持时间一致性和视觉保真度。

Abstract: Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.


[53] Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought cs.CVPDF

Kesen Zhao, Beier Zhu, Junbao Zhou, Xingyu Zhu, Zhongqi Yue

TL;DR: 本文提出了一种名为数值视觉思维链(NV-CoT)的框架,使多模态大语言模型能够使用连续的数值坐标对图像进行推理,从而将模型的动作空间从离散的词汇标记扩展到连续的欧几里得空间,以提升区域定位精度和答案准确性。

Details

Motivation: 现有方法通过文本化坐标或固定粒度图像块进行区域定位,存在模态不匹配、语义碎片化以及限制精确区域选择的问题,需要一种更精确且无需重大架构更改的视觉推理方法。

Result: 在三个基准测试上与八个代表性视觉推理基线模型相比,NV-CoT显著提高了定位精度和最终答案准确性,同时加速了训练收敛,验证了连续动作视觉推理在多模态大语言模型中的有效性。

Insight: 创新点在于将模型输出从离散标记扩展到连续坐标空间,采用高斯(或拉普拉斯)策略并通过重参数化采样引入随机性,使其完全兼容GRPO风格的策略优化,仅需最小架构修改即可实现精确的区域推理。

Abstract: Recent multimodal large language models (MLLMs) increasingly rely on visual chain-of-thought to perform region-grounded reasoning over images. However, existing approaches ground regions via either textified coordinates-causing modality mismatch and semantic fragmentation or fixed-granularity patches that both limit precise region selection and often require non-trivial architectural changes. In this paper, we propose Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates. NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space, allowing models to directly generate bounding-box coordinates as actions with only minimal architectural modification. The framework supports both supervised fine-tuning and reinforcement learning. In particular, we replace categorical token policies with a Gaussian (or Laplace) policy over coordinates and introduce stochasticity via reparameterized sampling, making NV-CoT fully compatible with GRPO-style policy optimization. Extensive experiments on three benchmarks against eight representative visual reasoning baselines demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence, validating the effectiveness of continuous-action visual reasoning in MLLMs. The code is available in https://github.com/kesenzhao/NV-CoT.


[54] SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking cs.CVPDF

Qiuyang Zhang, Jiujun Cheng, Qichao Mao, Cong Liu, Yu Fang

TL;DR: SpikeTrack是一个脉冲驱动的框架,用于高效RGB目标跟踪。它采用非对称设计,包括非对称时间步扩展和单向信息流,以利用时空动态并减少计算。通过受神经推理机制启发的记忆检索模块,实现分支间的有效单向信息传递。实验表明,该框架在SNN跟踪器中达到SOTA,并与先进ANN跟踪器竞争,在LaSOT数据集上超越TransT,同时能耗仅为1/26。

Details

Motivation: 现有SNN跟踪框架要么未完全对齐脉冲驱动计算,要么未充分利用神经元的时空动态,导致效率与准确性之间的权衡。SpikeTrack旨在解决这一问题,实现高效且准确的RGB目标跟踪。

Result: 在LaSOT数据集上,SpikeTrack超越TransT,同时能耗仅为后者的1/26;在SNN跟踪器中达到SOTA,并与先进ANN跟踪器保持竞争力。

Insight: 创新点包括非对称设计(非对称时间步扩展和单向信息流)和受神经推理机制启发的记忆检索模块,这些设计在保持准确性的同时显著降低能耗,是首个实现RGB跟踪既准确又高效的脉冲驱动框架。

Abstract: Spiking Neural Networks (SNNs) promise energy-efficient vision, but applying them to RGB visual tracking remains difficult: Existing SNN tracking frameworks either do not fully align with spike-driven computation or do not fully leverage neurons’ spatiotemporal dynamics, leading to a trade-off between efficiency and accuracy. To address this, we introduce SpikeTrack, a spike-driven framework for energy-efficient RGB object tracking. SpikeTrack employs a novel asymmetric design that uses asymmetric timestep expansion and unidirectional information flow, harnessing spatiotemporal dynamics while cutting computation. To ensure effective unidirectional information transfer between branches, we design a memory-retrieval module inspired by neural inference mechanisms. This module recurrently queries a compact memory initialized by the template to retrieve target cues and sharpen target perception over time. Extensive experiments demonstrate that SpikeTrack achieves the state-of-the-art among SNN-based trackers and remains competitive with advanced ANN trackers. Notably, it surpasses TransT on LaSOT dataset while consuming only 1/26 of its energy. To our knowledge, SpikeTrack is the first spike-driven framework to make RGB tracking both accurate and energy efficient. The code and models are available at https://github.com/faicaiwawa/SpikeTrack.


[55] Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping cs.CVPDF

Tianxiang Du, Hulingxiao He, Yuxin Peng

TL;DR: 本文提出Venus框架,旨在增强多模态大语言模型(MLLMs)的美学指导与裁剪能力。通过构建首个大规模美学指导数据集AesGuide,并设计两阶段训练方法(先赋予美学指导能力,再激活美学裁剪),Venus显著提升了模型在美学问题识别与构图优化方面的性能。

Details

Motivation: 解决普通用户与专业摄影师在拍摄时美学问题识别与指导能力上的差距,现有MLLMs仅提供正面反馈而无法指出问题或给出可操作建议,限制了其在美学裁剪等任务中的应用。

Result: 在构建的AesGuide数据集上进行实验,Venus大幅提升了美学指导能力,并在美学裁剪任务上达到了最先进的(SOTA)性能。

Insight: 创新点包括:1) 定义并构建了首个大规模美学指导(AG)数据集与基准;2) 提出两阶段框架,通过渐进式美学问题训练赋予MLLMs AG能力,并利用思维链(CoT)激活其美学裁剪能力,实现了可解释、交互式的美学优化。

Abstract: The widespread use of smartphones has made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers, who can identify aesthetic issues and provide actionable shooting guidance during capture. We define this capability as aesthetic guidance (AG) – an essential but largely underexplored domain in computational aesthetics. Existing multimodal large language models (MLLMs) primarily offer overly positive feedback, failing to identify issues or provide actionable guidance. Without AG capability, they cannot effectively identify distracting regions or optimize compositional balance, thus also struggling in aesthetic cropping, which aims to refine photo composition through reframing after capture. To address this, we introduce AesGuide, the first large-scale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and guidance. Building upon it, we propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions and then activates their aesthetic cropping power via CoT-based rationales. Extensive experiments show that Venus substantially improves AG capability and achieves state-of-the-art (SOTA) performance in aesthetic cropping, enabling interpretable and interactive aesthetic refinement across both stages of photo creation. Code is available at https://github.com/PKU-ICST-MIPL/Venus_CVPR2026.


[56] Interpretable Debiasing of Vision-Language Models for Social Fairness cs.CV | cs.AIPDF

Na Min An, Yoonna Jang, Yusuke Hirota, Ryo Hachiuma, Isabelle Augenstein

TL;DR: 本文提出了一种名为DeBiasLens的可解释、模型无关的视觉语言模型(VLM)去偏框架,通过在多模态编码器上应用稀疏自编码器(SAEs)来定位与特定社会属性(如人口统计学特征)高度响应的神经元,并通过选择性失活这些神经元来有效减轻VLM的社会偏见行为,同时保持其语义知识。

Details

Motivation: 当前VLM的去偏方法主要关注通过后处理或测试时算法缓解表面偏见信号,而模型内部动态机制未被充分探索,这可能导致模型的黑盒推理过程产生意外的社会偏见。

Result: 通过在不含社会属性标签的面部图像或字幕数据集上训练SAEs,成功定位了与特定人口统计学特征(包括少数群体)高度相关的神经元,选择性失活这些神经元有效减轻了VLM的社会偏见行为,且未损害其语义知识。

Insight: 创新点在于利用稀疏自编码器的解耦能力,以无监督方式定位VLM中与社会属性相关的特定神经元,从而实现可解释的、模型无关的去偏,为未来AI系统的社会公平审计工具奠定了基础。

Abstract: The rapid advancement of Vision-Language models (VLMs) has raised growing concerns that their black-box reasoning processes could lead to unintended forms of social bias. Current debiasing approaches focus on mitigating surface-level bias signals through post-hoc learning or test-time algorithms, while leaving the internal dynamics of the model largely unexplored. In this work, we introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in VLMs through sparse autoencoders (SAEs) applied to multimodal encoders. Building upon the disentanglement ability of SAEs, we train them on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics, including those that are underrepresented. By selectively deactivating the social neurons most strongly tied to bias for each group, we effectively mitigate socially biased behaviors of VLMs without degrading their semantic knowledge. Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.


[57] Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection cs.CVPDF

Zhaolin Cai, Fan Li, Huiyu Duan, Lijun He, Guangtao Zhai

TL;DR: 本文提出了一种名为SteerVAD的新型干预框架,用于视频异常检测(VAD)。该方法通过主动引导和校正冻结多模态大语言模型(MLLMs)的内部表示来解决现有免调优方法性能受限的问题,仅需1%的训练数据即可在主流基准上达到最先进的性能。

Details

Motivation: 现有基于冻结MLLMs的免调优VAD方法直接继承了预训练偏差,无法使内部表示适应特定视频上下文,导致难以处理细微或模糊的异常。本文旨在通过主动干预模型内部表示来克服这些限制。

Result: 在主流VAD基准上的大量实验表明,该方法在仅需1%训练数据的免调优方法中达到了最先进的性能。

Insight: 创新点在于从被动读取转向主动引导和校正内部表示。具体包括:1)利用无梯度的表示可分性分析(RSA)识别出对VAD最具判别力的注意力头作为潜在异常专家(LAEs);2)通过分层元控制器(HMC)联合全局上下文和LAE输出来生成动态校正信号,对LAE表示流形进行有针对性的各向异性缩放,以放大异常相关维度并抑制固有偏差。

Abstract: Video anomaly detection (VAD) aims to identify abnormal events in videos. Traditional VAD methods generally suffer from the high costs of labeled data and full training, thus some recent works have explored leveraging frozen multi-modal large language models (MLLMs) in a tuning-free manner to perform VAD. However, their performance is limited as they directly inherit pre-training biases and cannot adapt internal representations to specific video contexts, leading to difficulties in handling subtle or ambiguous anomalies. To address these limitations, we propose a novel intervention framework, termed SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations. Our approach first leverages the gradient-free representational separability analysis (RSA) to identify top attention heads as latent anomaly experts (LAEs) which are most discriminative for VAD. Then a hierarchical meta-controller (HMC) generates dynamic rectification signals by jointly conditioning on global context and these LAE outputs. The signals execute targeted, anisotropic scaling directly upon the LAE representation manifolds, amplifying anomaly-relevant dimensions while suppressing inherent biases. Extensive experiments on mainstream benchmarks demonstrate our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data, establishing it as a powerful new direction for video anomaly detection. The code will be released upon the publication.


[58] GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models cs.CV | cs.MMPDF

Xingyu Zhu, Beier Zhu, Junfeng Fang, Shuo Wang, Yin Zhang

TL;DR: 本文提出了GuardAlign,一种无需训练的多模态大语言模型(MLLM)安全防御框架,旨在解决现有方法在复杂场景下检测不准确和解码过程中安全信号不稳定的问题。该框架通过结合最优传输增强的安全检测和跨模态注意力校准两种策略,有效识别恶意图像区域并确保安全信号在生成过程中持续激活。

Details

Motivation: 大型视觉语言模型(LVLM)在安全对齐方面存在挑战,现有基于输入端的防御方法(如使用CLIP检测不安全图像并添加安全前缀)在复杂场景下检测不准确,且解码时安全信号不稳定。

Result: 在六个代表性MLLM上的广泛评估表明,GuardAlign在SPA-VL基准上将不安全响应率降低了高达39%,同时在保持模型实用性的前提下,将VQAv2基准上的准确率从78.51%提升至79.21%。

Insight: 创新点在于将最优传输理论用于图像区域与不安全语义的分布距离度量以实现精准检测,以及通过跨层自适应注意力重分配来稳定安全信号的影响。这为无需训练、即插即用的测试时安全对齐提供了新思路。

Abstract: Large vision-language models (LVLMs) have achieved remarkable progress in vision-language reasoning tasks, yet ensuring their safety remains a critical challenge. Recent input-side defenses detect unsafe images with CLIP and prepend safety prefixes to prompts, but they still suffer from inaccurate detection in complex scenes and unstable safety signals during decoding. To address these issues, we propose GuardAlign, a training-free defense framework that integrates two strategies. First, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost. Second, cross-modal attentive calibration strengthens the influence of safety prefixes by adaptively reallocating attention across layers, ensuring that safety signals remain consistently activated throughout generation. Extensive evaluations on six representative MLLMs demonstrate that GuardAlign reduces unsafe response rates by up to 39% on SPA-VL, while preserving utility, achieving an improvement on VQAv2 from 78.51% to 79.21%.


[59] Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation cs.CVPDF

Xingyu Zhu, Kesen Zhao, Liang Yi, Shuo Wang, Zhicai Wang

TL;DR: 本文提出了一种名为自适应视觉增强(AIR)的训练免费框架,旨在缓解多模态大语言模型(MLLMs)中的幻觉问题。AIR通过原型标记缩减和最优传输引导的补丁增强两个组件,选择性地强化关键视觉信息,减少冗余和背景干扰,从而提升模型对显著视觉特征的依赖。

Details

Motivation: 多模态大语言模型在视觉语言推理方面取得显著进展,但仍易产生与视觉证据不符的幻觉。现有缓解方法要么需要昂贵的训练监督,要么在推理时引入额外延迟,而最近的视觉增强方法在解码时对所有视觉标记进行无差别注入,导致背景区域干扰和关键线索分散。

Result: 在多个代表性MLLMs上的广泛实验表明,AIR显著减少了幻觉,同时保持了模型的通用能力,成为构建可靠MLLMs的有效解决方案。

Insight: 创新点在于提出了一种无需训练的自适应视觉增强框架,通过原型标记缩减压缩冗余视觉标记,并利用最优传输(OT)量化隐藏状态与补丁嵌入的一致性,以选择性地将最一致的补丁集成到前馈层中,从而增强模型对显著视觉信息的依赖,有效缓解幻觉。从客观角度看,该方法避免了全标记注入的干扰,实现了更精准的视觉信息强化,为MLLMs的可靠性提升提供了新思路。

Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language reasoning, yet they remain vulnerable to hallucination, where generated content deviates from visual evidence. Existing mitigation strategies either require costly supervision during training or introduce additional latency at inference time. Recent vision enhancement methods attempt to address this issue by reinforcing visual tokens during decoding, but they typically inject all tokens indiscriminately, which causes interference from background regions and distracts the model from critical cues. To overcome this challenge, we propose Adaptive Visual Reinforcement (AIR), a training-free framework for MLLMs. AIR consists of two components. Prototype-based token reduction condenses the large pool of visual tokens into a compact subset to suppress redundancy. OT-guided patch reinforcement quantifies the alignment between hidden states and patch embeddings to selectively integrate the most consistent patches into feed-forward layers. As a result, AIR enhances the model’s reliance on salient visual information and effectively mitigates hallucination. Extensive experiments across representative MLLMs demonstrate that AIR substantially reduces hallucination while preserving general capabilities, establishing it as an effective solution for building reliable MLLMs.


[60] Spatio-Temporal Garment Reconstruction Using Diffusion Mapping via Pattern Coordinates cs.CVPDF

Yingxuan You, Ren Li, Corentin Dumery, Cong Cao, Hao Li

TL;DR: 本文提出了一种统一的框架,用于从单张图像和视频序列中进行高保真度的3D服装重建。该方法结合了隐式缝纫图案(ISP)和生成扩散模型,在2D UV空间中学习富有表现力的服装形状先验,并引入一个映射模型来建立图像像素、UV图案坐标和3D几何之间的对应关系。此外,通过引入具有测试时引导的时空扩散方案,该方法被扩展到动态重建,以强制执行长程时间一致性。

Details

Motivation: 从单目图像和视频中重建3D着装人体是一个基础性问题,在虚拟试穿、虚拟化身创建和混合现实中具有应用价值。尽管人体恢复取得了显著进展,但准确重建服装几何,特别是宽松服装,仍然是一个开放挑战。

Result: 尽管仅在合成模拟的布料数据上训练,但该方法能很好地泛化到真实世界图像,并且在紧身和宽松服装上都持续优于现有方法。重建的服装保留了精细的几何细节,同时展现出逼真的动态运动。

Insight: 创新点在于将隐式缝纭图案与扩散模型结合以学习服装形状先验,并引入映射模型建立像素、UV坐标与3D几何的对应关系。通过时空扩散方案和基于分析的投影约束,实现了从单图像到视频的动态、高保真重建,并保证了时间一致性和遮挡区域的连贯补全。

Abstract: Reconstructing 3D clothed humans from monocular images and videos is a fundamental problem with applications in virtual try-on, avatar creation, and mixed reality. Despite significant progress in human body recovery, accurately reconstructing garment geometry, particularly for loose-fitting clothing, remains an open challenge. We propose a unified framework for high-fidelity 3D garment reconstruction from both single images and video sequences. Our approach combines Implicit Sewing Patterns (ISP) with a generative diffusion model to learn expressive garment shape priors in 2D UV space. Leveraging these priors, we introduce a mapping model that establishes correspondences between image pixels, UV pattern coordinates, and 3D geometry, enabling accurate and detailed garment reconstruction from single images. We further extend this formulation to dynamic reconstruction by introducing a spatio-temporal diffusion scheme with test-time guidance to enforce long-range temporal consistency. We also develop analytic projection-based constraints that preserve image-aligned geometry in visible regions while enforcing coherent completion in occluded areas over time. Although trained exclusively on synthetically simulated cloth data, our method generalizes well to real-world imagery and consistently outperforms existing approaches on both tight- and loose-fitting garments. The reconstructed garments preserve fine geometric detail while exhibiting realistic dynamic motion, supporting downstream applications such as texture editing, garment retargeting, and animation.


[61] Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization cs.CV | cs.AIPDF

Chenwei Jia, Baoting Li, Xuchong Zhang, Mingzhuo Wei, Bochen Lin

TL;DR: 本文提出了一种名为Quant Experts (QE)的量化方法,用于大型视觉语言模型(VLMs)的后训练量化(PTQ)。该方法通过将重要通道分为token无关和token相关两组,并分别使用共享专家和路由专家(包含多个低秩适配器)来补偿全局和局部量化误差,从而适应不同模态和token间的重要通道分布差异,提升量化性能。

Details

Motivation: 现有PTQ方法主要依赖对敏感或异常通道的静态识别和全局补偿,但忽略了这些重要通道在不同输入(包括跨模态和token间)的分布差异,导致量化效果不佳。本文旨在解决这一问题,通过token感知的自适应误差补偿来改进VLMs的量化。

Result: 大量实验表明,QE在各种量化设置和模型规模(从2B到70B参数)下均能持续提升任务准确率,同时保持与全精度模型相当的性能。

Insight: 创新点在于将重要通道区分为token无关和token相关组,并分别设计共享专家和路由专家进行误差补偿,实现了token感知的自适应量化误差重建。这为处理VLMs中跨模态和token间分布差异提供了新思路,可借鉴其混合专家(MoE)结构和低秩适配器设计来提升量化方法的适应性。

Abstract: Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs) by compressing both weights and activations without retraining the full model. Existing PTQ methods primarily rely on static identification and global compensation of sensitive or outlier channels, yet they often overlook the distributional differences of these important channels across inputs, leading to unsatisfactory quantization. In this work, we observe that the distributions and occurrence frequencies of important channels vary significantly both across modalities and among tokens, even within the same modality. Accordingly, we propose \textbf{Quant Experts (QE)}, a token-aware adaptive error compensation with mixture-of-experts for VLMs quantization. QE divides the important channels into token-independent and token-dependent groups. For the former, a shared expert is designed for most tokens to compensate for global quantization error using a low-rank adapter. For the latter, routed experts including multiple routed low-rank adapters are elaborated to compensate for local quantization error related to specific tokens. Extensive experiments demonstrate that QE consistently enhances task accuracy across various quantization settings and model scales, ranging from 2B to 70B parameters, while maintaining performance comparable to full-precision models.


[62] Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics cs.CV | cs.AIPDF

Omar Mohamed, Edoardo Fazzari, Ayah Al-Naji, Hamdan Alhadhrami, Khalfan Hableel

TL;DR: 本文提出了一种名为TASOT的无监督方法,用于手术视频中的阶段和步骤识别。该方法将时序动作分割建模为多模态最优传输问题,结合了视觉和文本信息,无需特定手术预训练或大规模标注数据。

Details

Motivation: 解决现有手术视频识别方法严重依赖大规模有标注视频预训练所带来的高昂计算和数据收集成本问题,探索是否可以不依赖这种繁重预训练而实现有效识别。

Result: 在多个基准手术数据集(StrasBypass70、BernBypass70、Cholec80、AutoLaparo)上评估,相比现有零样本方法取得了显著且一致的提升(分别提升23.7、4.5、16.5、19.6)。

Insight: 创新点在于将时序动作分割问题形式化为多模态最优传输问题,并引入由视频直接生成的文本信息作为补充语义线索,与视觉信息通过时间一致的非平衡Gromov-Wasserstein公式联合正则化,从而有效利用标准视觉和文本表示中的信息,避免了复杂预训练流程。

Abstract: Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions. Recent approaches increasingly rely on large-scale pre-training on thousands of labeled surgical videos, followed by zero-shot transfer to specific procedures. While effective, this strategy incurs substantial computational and data collection costs. In this work, we question whether such heavy pre-training is truly necessary. We propose Text-Augmented Action Segmentation Optimal Transport (TASOT), an unsupervised method for surgical phase and step recognition that extends Action Segmentation Optimal Transport (ASOT) by incorporating textual information generated directly from the videos. TASOT formulates temporal action segmentation as a multimodal optimal transport problem, where the matching cost is defined as a weighted combination of visual and text-based costs. The visual term captures frame-level appearance similarity, while the text term provides complementary semantic cues, and both are jointly regularized through a temporally consistent unbalanced Gromov-Wasserstein formulation. This design enables effective alignment between video frames and surgical actions without surgical-specific pretraining or external web-scale supervision. We evaluate TASOT on multiple benchmark surgical datasets and observe consistent and substantial improvements over existing zero-shot methods, including StrasBypass70 (+23.7), BernBypass70 (+4.5), Cholec80 (+16.5), and AutoLaparo (+19.6). These results demonstrate that fine-grained surgical understanding can be achieved by exploiting information already present in standard visual and textual representations, without resorting to increasingly complex pre-training pipelines. The code will be available at https://github.com/omar8ahmed9/TASOT.


[63] HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation cs.CVPDF

Keito Suzuki, Kunyao Chen, Lei Wang, Bang Du, Runfa Blark Li

TL;DR: 本文提出HumanOrbit方法,能够从单张输入图像生成围绕人物的360度环绕视频,并进一步重建出纹理化的3D人体网格。该方法利用视频扩散模型来合成几何一致的多视角图像,并保持人物的外观和身份一致性。

Details

Motivation: 现有方法通常基于图像扩散模型进行多视角合成,但存在视角间不一致以及与原始身份不匹配的问题。视频扩散模型在生成与提示词对齐的逼真结果方面表现出色,因此本文受此启发,旨在解决单图生成多视角一致且身份保持的人体图像问题。

Result: 实验结果表明,HumanOrbit在多视角图像生成方面有效,并且重建出的3D模型在完整性和保真度上优于最先进的基线方法。

Insight: 核心创新点在于将多视角人体图像生成任务重新定义为视频生成问题,利用视频扩散模型来保证视角间的时空一致性。这为从单图进行3D重建提供了一种新颖且高质量的解决方案。从客观角度看,将视频生成技术迁移到多视角合成任务是一个巧妙的思路,有望提升3D内容创建的效率和效果。

Abstract: We present a method for generating a full 360° orbit video around a person from a single input image. Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity. In contrast, recent video diffusion models have demonstrated their ability in generating photorealistic results that align well with the given prompts. Inspired by these results, we propose HumanOrbit, a video diffusion model for multi-view human image generation. Our approach enables the model to synthesize continuous camera rotations around the subject, producing geometrically consistent novel views while preserving the appearance and identity of the person. Using the generated multi-view frames, we further propose a reconstruction pipeline that recovers a textured mesh of the subject. Experimental results validate the effectiveness of HumanOrbit for multi-view image generation and that the reconstructed 3D models exhibit superior completeness and fidelity compared to those from state-of-the-art baselines.


[64] RAViT: Resolution-Adaptive Vision Transformer cs.CV | cs.LGPDF

Martial Guidez, Stefan Duffner, Christophe Garcia

TL;DR: RAViT是一种基于多分支网络的新型图像分类框架,通过在多个不同分辨率的同一图像副本上进行操作,以减少计算成本同时保持整体精度。该框架还包含早期退出机制,使模型能够自适应地在运行时选择精度与计算成本之间的权衡。

Details

Motivation: 视觉变换器在计算机视觉中表现出色,但计算成本远高于卷积神经网络等替代方法,因此需要一种能降低计算成本同时保持精度的解决方案。

Result: 在CIFAR-10、Tiny ImageNet和ImageNet数据集上评估,RAViT仅需约70%的FLOPs即可达到与经典视觉变换器模型相当的精度。

Insight: 创新点包括多分支网络处理不同分辨率图像以优化计算效率,以及早期退出机制实现运行时自适应权衡,这为视觉变换器的轻量化设计提供了新思路。

Abstract: Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications. However, their computational cost is very high compared to alternative approaches such as Convolutional Neural Networks. To address this problem, we propose a novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy. Furthermore, our framework includes an early exit mechanism that makes our model adaptive and allows to choose the appropriate trade-off between accuracy and computational cost at run-time. For example in a two-branch architecture, the original image is first resized to reduce its resolution, then a prediction is performed on it using a first transformer and the resulting prediction is reused together with the original-size image to perform a final prediction on a second transformer with less computation than a classical Vision transformer architecture. The early-exit process allows the model to make a final prediction at intermediate branches, saving even more computation. We evaluated our approach on CIFAR-10, Tiny ImageNet, and ImageNet. We obtained an equivalent accuracy to the classical Vision transformer model with only around 70% of FLOPs.


[65] GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction cs.CVPDF

Chao Xu, Xiaochen Zhao, Xiang Deng, Jingxiang Sun, Zhuo Su

TL;DR: 本文提出GeoDiff4D,一种基于几何感知扩散模型的4D头部虚拟人重建框架,从单张肖像图像联合合成图像和表面法线,结合无姿态表情编码器,构建基于3D高斯分布的虚拟人,实现高保真、可动画化的实时渲染。

Details

Motivation: 现有基于扩散模型的虚拟人重建方法主要依赖2D先验,难以保持一致的3D几何一致性,本文旨在解决从单张肖像图像重建高保真、可动画4D头部虚拟人的几何一致性问题。

Result: 在视觉质量、表情保真度和跨身份泛化能力上显著优于现有SOTA方法,并支持实时渲染。

Insight: 创新点在于通过几何感知扩散联合学习图像和表面法线先验,结合无姿态表情编码器隐式捕捉表情表示,并集成到3D高斯虚拟人中,实现了几何准确的高保真重建。

Abstract: Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to learn strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are incorporated into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.


[66] A Mixed Diet Makes DINO An Omnivorous Vision Encoder cs.CV | cs.AIPDF

Rishabh Kabra, Maks Ovsjanikov, Drew A. Hudson, Ye Xia, Skanda Koppula

TL;DR: 本文提出了一种名为Omnivorous Vision Encoder的新框架,旨在解决DINOv2等预训练视觉编码器在不同模态(如RGB图像、深度图、分割图)特征表示不一致的问题。该框架通过最大化同一场景不同模态间的特征对齐,并结合对冻结教师模型(如DINOv2)的知识蒸馏,学习到一个模态无关的特征空间,使编码器能够为同一场景的不同输入模态生成一致且强大的嵌入。

Details

Motivation: 现有预训练视觉编码器(如DINOv2)在单模态任务上表现出色,但其特征表示在不同模态间对齐性差,例如同一场景的RGB图像与其深度图的特征余弦相似度与随机无关图像几乎相同,这限制了跨模态理解能力。

Result: 论文提出的Omnivorous Vision Encoder框架通过双目标训练,实现了跨模态特征对齐,同时保留了基础模型的判别性语义,从而获得了强大的跨模态理解能力。摘要中未提及具体的基准测试或定量结果,但宣称该方法能生成一致且强大的跨模态嵌入。

Insight: 创新点在于将跨模态特征对齐目标与对冻结单模态教师模型的知识蒸馏目标相结合,从而在不牺牲单模态性能的前提下,使编码器变得“杂食性”,能够统一处理多种输入模态。这为构建模态无关的通用视觉表示提供了一种有效途径。

Abstract: Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes “omnivorous” by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.


[67] A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification cs.CV | cs.LGPDF

Yixuan Liu, Kanwal K. Bhatia, Ahmed E. Fetit

TL;DR: 本文提出了一种用于医学图像分类系统故障检测与解释的多模态切片发现框架,旨在通过整合多模态信息来提升现有单模态或基于元数据审计方法的局限,从而更全面地发现和解释隐藏的系统性故障。

Details

Motivation: 现有基于机器学习的医学图像分类器在安全性和可靠性方面存在隐患,而传统的审计方法主要依赖单模态特征或基于元数据的子组分析,这些方法解释性有限且难以捕捉隐藏的系统性故障,因此需要一种更有效的自动化审计框架。

Result: 在MIMIC-CXR-JPG数据集上的综合实验表明,该框架在常见故障场景下具有较强的故障发现和解释生成能力,多模态信息通常能实现更全面有效的分类器审计,而仅使用图像之外的单模态变体在资源受限场景中也显示出强大潜力。

Insight: 创新点在于首次将切片发现方法扩展到多模态表示以专门用于医学应用,提供了自动化的系统性故障审计框架;从客观角度看,该研究强调了多模态整合在提升医学AI系统可靠性和可解释性方面的重要性,并为资源受限环境下的单模态审计提供了实用见解。

Abstract: Despite advances in machine learning-based medical image classifiers, the safety and reliability of these systems remain major concerns in practical settings. Existing auditing approaches mainly rely on unimodal features or metadata-based subgroup analyses, which are limited in interpretability and often fail to capture hidden systematic failures. To address these limitations, we introduce the first automated auditing framework that extends slice discovery methods to multimodal representations specifically for medical applications. Comprehensive experiments were conducted under common failure scenarios using the MIMIC-CXR-JPG dataset, demonstrating the framework’s strong capability in both failure discovery and explanation generation. Our results also show that multimodal information generally allows more comprehensive and effective auditing of classifiers, while unimodal variants beyond image-only inputs exhibit strong potential in scenarios where resources are constrained.


[68] SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching cs.CV | cs.LGPDF

Yasaman Haghighi, Alexandre Alahi

TL;DR: 本文提出了一种名为SenCache的加速扩散模型推理的方法,通过分析模型输出对去噪输入(噪声潜在表示和时间步)扰动的敏感性,来指导缓存策略,从而在减少计算量的同时保持生成质量。

Details

Motivation: 扩散模型在视频生成中达到SOTA质量,但推理成本高,因为需要大量顺序去噪步骤。现有基于缓存的加速方法依赖启发式准则选择缓存/重用时间步,需要大量调优,缺乏理论依据。

Result: 在Wan 2.1、CogVideoX和LTX-Video等基准上的实验表明,在相似计算预算下,SenCache比现有缓存方法实现了更好的视觉质量。

Insight: 创新点在于提出了一个基于敏感性的理论框架来形式化缓存误差,并据此设计了动态的、样本自适应的缓存策略(SenCache),为自适应缓存提供了理论解释,并扩展了先前经验性启发式方法。

Abstract: Diffusion models achieve state-of-the-art video generation quality, but their inference remains expensive due to the large number of sequential denoising steps. This has motivated a growing line of research on accelerating diffusion inference. Among training-free acceleration methods, caching reduces computation by reusing previously computed model outputs across timesteps. Existing caching methods rely on heuristic criteria to choose cache/reuse timesteps and require extensive tuning. We address this limitation with a principled sensitivity-aware caching framework. Specifically, we formalize the caching error through an analysis of the model output sensitivity to perturbations in the denoising inputs, i.e., the noisy latent and the timestep, and show that this sensitivity is a key predictor of caching error. Based on this analysis, we propose Sensitivity-Aware Caching (SenCache), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis. Our framework provides a theoretical basis for adaptive caching, explains why prior empirical heuristics can be partially effective, and extends them to a dynamic, sample-specific approach. Experiments on Wan 2.1, CogVideoX, and LTX-Video show that SenCache achieves better visual quality than existing caching methods under similar computational budgets.


[69] MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy cs.CV | cs.LGPDF

Albert Dominguez Mantes, Gioele La Manno, Martin Weigert

TL;DR: MuViT是一种专为显微镜图像设计的多分辨率视觉Transformer架构,通过将不同分辨率的图像块嵌入到共享的世界坐标系中,并扩展旋转位置编码,实现了跨尺度信息的有效融合。

Details

Motivation: 现代显微镜图像通常包含从细胞形态到组织结构的多个空间尺度信息,现有视觉模型多基于单一分辨率或从单一视图提取多尺度特征,难以充分利用显微镜数据固有的多分辨率特性。

Result: 在合成基准测试、肾脏组织病理学和高分辨率小鼠脑显微镜图像上,MuViT相比强大的ViT和CNN基线模型取得了持续改进;多分辨率MAE预训练进一步产生了尺度一致的表征,提升了下游任务性能。

Insight: 通过显式世界坐标建模为大规模显微镜分析提供了一种简单而强大的多分辨率信息利用机制,其核心创新在于将多分辨率观测统一到共享坐标系并扩展位置编码,使注意力机制能同时整合广域上下文和高分辨率细节。

Abstract: Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader tissue organization. Many analysis tasks require combining these scales, yet most vision models operate at a single resolution or derive multi-scale features from one view, limiting their ability to exploit the inherently multi-resolution nature of microscopy data. We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image. MuViT embeds all patches into a shared world-coordinate system and extends rotary positional embeddings to these coordinates, enabling attention to integrate wide-field context with high-resolution detail within a single encoder. Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines. Multi-resolution MAE pretraining further produces scale-consistent representations that enhance downstream tasks. These results demonstrate that explicit world-coordinate modelling provides a simple yet powerful mechanism for leveraging multi-resolution information in large-scale microscopy analysis.


[70] Enhancing Spatial Understanding in Image Generation via Reward Modeling cs.CVPDF

Zhenyu Tang, Chaoran Feng, Yufan Deng, Jie Wu, Xiaojie Li

TL;DR: 本文提出了一种增强图像生成模型空间理解能力的新方法,通过构建包含超过8万对偏好数据的SpatialReward-Dataset,并基于此训练了名为SpatialScore的奖励模型,该模型能够评估文生图生成中空间关系的准确性,其性能甚至超越了领先的专有模型,并进一步证明了该奖励模型能有效支持复杂空间生成的在线强化学习。

Details

Motivation: 解决当前文生图模型在编码复杂空间关系时对提示词复杂度要求高、往往需要多次采样才能获得满意结果的问题。

Result: 在多个基准测试上的广泛实验表明,该专用奖励模型在图像生成的空间理解方面带来了显著且一致的性能提升,其空间评估性能超越了领先的专有模型。

Insight: 通过构建大规模、高质量的偏好数据集来训练专门的奖励模型,以量化评估并优化生成模型在特定能力(如空间关系理解)上的表现,并将奖励模型与在线强化学习结合,为提升生成模型的特定子能力提供了一种有效途径。

Abstract: Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.


[71] Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models cs.CV | cs.LGPDF

Arnas Uselis, Andrea Dittadi, Seong Joon Oh

TL;DR: 该论文从理论上形式化了组合泛化的三个几何约束条件(可分性、可迁移性、稳定性),并证明这些条件要求视觉嵌入模型的表示必须具有线性分解和跨概念正交的结构。研究进一步推导了维度界限,并在CLIP、SigLIP、DINO等现代视觉模型上进行了实证验证,发现表示的部分线性分解程度与模型在未见组合上的组合泛化能力相关。

Details

Motivation: 解决现代模型在庞大训练数据下仍难以泛化到未见输入组合的问题,探究支持组合泛化的表示结构应具备何种几何特性。

Result: 在CLIP、SigLIP、DINO等视觉模型上的实证评估表明,其表示确实呈现出部分线性分解和低秩、近正交的每概念因子结构,且这种结构的程度与模型在未见组合上的组合泛化性能正相关。

Insight: 为广泛观察到的神经表示线性结构(线性表示假说)提供了理论依据,指出组合泛化能力必然要求表示具有线性正交分解的几何特性;研究将可组合概念的数量与嵌入几何的维度界限联系起来,为理解模型缩放时的表示收敛趋势提供了预测框架。

Abstract: Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern models are trained on massive datasets, they still cover only a tiny fraction of the combinatorial space of possible inputs, raising the question of what structure representations must have to support generalization to unseen combinations. We formalize three desiderata for compositional generalization under standard training (divisibility, transferability, stability) and show they impose necessary geometric constraints: representations must decompose linearly into per-concept components, and these components must be orthogonal across concepts. This provides theoretical grounding for the Linear Representation Hypothesis: the linear structure widely observed in neural representations is a necessary consequence of compositional generalization. We further derive dimension bounds linking the number of composable concepts to the embedding geometry. Empirically, we evaluate these predictions across modern vision models (CLIP, SigLIP, DINO) and find that representations exhibit partial linear factorization with low-rank, near-orthogonal per-concept factors, and that the degree of this structure correlates with compositional generalization on unseen combinations. As models continue to scale, these conditions predict the representational geometry they may converge to. Code is available at https://github.com/oshapio/necessary-compositionality.


[72] Hierarchical Action Learning for Weakly-Supervised Action Segmentation cs.CVPDF

Junxian Huang, Ruichu Cai, Hao Zhu, Juntao Fang, Boyan Xu

TL;DR: 本文提出了一种用于弱监督动作分割的层次化动作学习(HAL)模型。该模型基于人类通过关键转换感知多层次抽象动作的观察,通过引入层次化因果数据生成过程,将高层潜在动作变量建模为控制底层视觉特征动态的因素,并利用确定性过程对齐不同时间尺度的潜在变量,结合层次化金字塔Transformer和稀疏转换约束来增强高层动作变量的识别。

Details

Motivation: 解决机器视觉特征在视频理解中倾向于过分割的问题,通过模仿人类对动作的层次化感知,利用高层潜在动作变量变化较慢、更易识别的特性,提升弱监督动作分割的性能。

Result: 在多个基准测试中,HAL模型显著优于现有的弱监督动作分割方法,证实了其在实际应用中的有效性。

Insight: 创新点在于将层次化因果生成过程引入动作分割,通过建模不同时间尺度的潜在变量(高层动作变量变化慢、底层视觉变量变化快)并证明其严格可识别性,结合稀疏转换约束和层次化金字塔Transformer,有效提升了模型对动作结构的理解能力。

Abstract: Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\textbf{HAL}) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The \textbf{HAL} model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the \textbf{HAL} model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.


[73] Mode Seeking meets Mean Seeking for Fast Long Video Generation cs.CV | cs.LGPDF

Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang

TL;DR: 本文提出了一种名为’Mode Seeking meets Mean Seeking’的训练范式,通过解耦扩散变换器统一表示,将局部保真度与长期连贯性分离。该方法利用在长视频上通过监督学习训练的全局流匹配头来捕捉叙事结构,同时采用局部分布匹配头,通过模式寻求的反向KL散度将滑动窗口与冻结的短视频教师模型对齐。这使得模型能够从有限的长视频中学习长期连贯性和运动,同时通过将学生的每个滑动窗口段与短视频教师对齐来继承局部真实性,从而实现了快速生成分钟级视频。

Details

Motivation: 解决视频生成从秒级扩展到分钟级时面临的关键瓶颈:短视频数据丰富且保真度高,而连贯的长视频数据稀缺且局限于狭窄领域。

Result: 评估表明,该方法通过联合改善局部清晰度、运动质量和长期一致性,有效缩小了保真度与生成时长之间的差距。

Insight: 核心创新点在于将’模式寻求’与’均值寻求’相结合的训练范式,通过解耦的全局流匹配与局部分布匹配,利用有限的连贯长视频数据和丰富的短视频数据,分别学习长期结构和局部细节,从而在保持生成速度的同时实现高质量的长视频生成。

Abstract: Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.


math.AC [Back]

[74] Multiprojective Geometry of Compatible Triples of Fundamental and Essential Matrices math.AC | cs.CV | math.AGPDF

Timothy Duff, Viktor Korotynskiy, Anton Leykin, Tomas Pajdla

TL;DR: 本文通过计算多重度和多重齐次消失理想,刻画了兼容基本矩阵三元组的簇结构,解决了Bråtelund和Rydell提出的问题。研究改进了计算机视觉几何中先前不完整的代数约束集,并发现了一组简单的四次约束,这些约束在本质矩阵三元组的兼容性分析中也具有局部刻画作用。

Details

Motivation: 解决几何计算机视觉中兼容基本矩阵三元组的完整代数刻画问题,改进现有不完整且带有缩放限制的约束集。

Result: 计算了兼容基本矩阵三元组簇的多重度和多重齐次消失理想,证明了新发现的四次约束与已知约束局部刻画了兼容本质矩阵三元组簇。

Insight: 首次完整刻画了兼容基本矩阵三元组的代数簇,发现了一组简单的四次约束,该约束可推广至本质矩阵三元组的局部兼容性分析,为广义兼容簇的多重齐次消失理想研究提供了新思路。

Abstract: We characterize the variety of compatible fundamental matrix triples by computing its multidegree and multihomogeneous vanishing ideal. This answers the first interesting case of a question recently posed by Bråtelund and Rydell. Our result improves upon previously discovered sets of algebraic constraints in the geometric computer vision literature, which are all incomplete (as they do \emph{not} generate the vanishing ideal) and sometimes make restrictive assumptions about how a matrix triple should be scaled. Our discussion touches more broadly on generalized compatibility varieties, whose multihomogeneous vanishing ideals are much less well understood. One of our key new discoveries is a simple set of quartic constraints vanishing on compatible fundamental matrix triples. These quartics are also significant in the setting of essential matrices: together with some previously known constraints, we show that they locally cut out the variety of compatible essential matrix triples.


cs.IR [Back]

[75] Reason to Contrast: A Cascaded Multimodal Retrieval Framework cs.IR | cs.AI | cs.CLPDF

Xuanming Cui, Hong-You Chen, Hao Yu, Hao Yuan, Zihao Wang

TL;DR: 本文提出了TTE-v2,一个级联多模态检索框架,通过引入基于额外输入token预算的推理驱动性能扩展,而非依赖模型或嵌入维度。该方法在初始检索后增加推理步骤进行重排序,并在测试时实现更丰富的查询-候选交互。重排序阶段还为困难负样本挖掘和假负样本过滤提供细粒度监督,形成反馈循环以增强上游检索器。

Details

Motivation: 传统多模态检索系统主要依赖双编码器架构,性能与嵌入维度紧密相关。近期工作Think-Then-Embed (TTE)表明,在嵌入前融入多模态推理以引出额外信息token可以进一步提升检索性能。本文旨在扩展此范式,探索基于token而非模型规模的性能扩展方法。

Result: 在MMEB-V2基准测试中,TTE-v2-7B达到了75.7%的最新最先进(SOTA)准确率,而TTE-v2-2B匹配或超越了使用显著更多外部数据训练的领先7B模型。

Insight: 创新点在于提出了基于推理token的级联性能扩展范式,通过重排序实现测试时增强和细粒度监督反馈,有效提升了检索性能。这为多模态检索提供了一种替代模型规模扩展的新思路,即通过增加推理token预算来实现性能提升。

Abstract: Traditional multimodal retrieval systems rely primarily on bi-encoder architectures, where performance is closely tied to embedding dimensionality. Recent work, Think-Then-Embed (TTE), shows that incorporating multimodal reasoning to elicit additional informative tokens before embedding can further improve retrieval. In this paper, we extend this paradigm with TTE-v2, a hybrid multimodal retrieval framework that introduces reasoning-driven performance scaling based on additional input token budget rather than model or embedding size. Our approach augments the initial multimodal retrieval with additional reasoning steps for reranking, enabling more expressive query-candidate interactions at test time. The reranking stage further provides fine-grained supervision for hard negative mining and false negative filtering, creating a feedback loop that effectively strengthens the upstream retriever. This cascaded design delivers substantial test-time improvements based on intermediate reasoning token scaling. Experiments on the MMEB-V2 benchmark demonstrate that TTE-v2-7B achieves a new state-of-the-art accuracy of 75.7%, and that TTE-v2-2B matches or surpasses leading 7B models trained with significantly larger external data. Our results highlight the promise of token-wise scaling as an alternative scaling paradigm for multimodal retrieval.


Rakshita Goel, S Pranav Kumar, Anmol Agrawal, Divyan Poddar, Pratik Narang

TL;DR: 本文提出了一种面向印度法律研究的领域分区混合检索增强生成(RAG)与知识图谱架构,旨在解决传统检索方法在处理印度长篇幅、异构法律文档时在结构化推理、多跳推理和跨领域依赖方面的不足。该系统整合了三个针对最高法院判例法、法规宪法文本和印度刑法典的专用RAG管道,并构建了一个基于Neo4j的法律知识图谱来捕获结构化关系,通过LLM驱动的智能编排器动态路由查询并融合证据,生成有依据且包含引用的回答。

Details

Motivation: 解决印度法律研究中传统基于关键词或纯嵌入的检索系统难以支持结构化法律推理、多跳推理和跨领域依赖的问题,为印度司法背景提供可扩展且可解释的法律AI基础。

Result: 在一个包含40个问题的合成法律问答基准(源自印度权威法律资料,并通过LLM作为法官框架评估)上,该混合架构达到了70%的通过率,显著优于仅使用RAG的基线(37.5%),在完整性和法律推理质量上有明显提升。

Insight: 创新点在于将领域分区的检索与结构化的关系知识(知识图谱)相结合,并通过LLM驱动的智能编排器进行动态查询路由和证据融合,这为法律AI系统提供了模块化、可解释且可扩展的解决方案,特别是在处理复杂、多源法律文档时。

Abstract: Legal research in India involves navigating long and heterogeneous documents spanning statutes, constitutional provisions, penal codes, and judicial precedents, where purely keyword-based or embedding-only retrieval systems often fail to support structured legal reasoning. Recent retrieval augmented generation (RAG) approaches improve grounding but struggle with multi-hop reasoning, citation chaining, and cross-domain dependencies inherent to legal texts. We propose a domain partitioned hybrid RAG and Knowledge Graph architecture designed specifically for Indian legal research. The system integrates three specialized RAG pipelines covering Supreme Court case law, statutory and constitutional texts, and the Indian Penal Code, each optimized for domain specific retrieval. To enable relational reasoning beyond semantic similarity, we construct a Neo4j based Legal Knowledge Graph capturing structured relationships among cases, statutes, IPC sections, judges, and citations. An LLM driven agentic orchestrator dynamically routes queries across retrieval modules and the knowledge graph, fusing evidence into grounded and citation aware responses. We evaluate the system using a 40 question synthetic legal question answer benchmark curated from authoritative Indian legal sources and assessed via an LLM as a Judge framework. Results show that the hybrid architecture achieves a 70 percent pass rate, substantially outperforming a RAG only baseline at 37.5 percent, with marked improvements in completeness and legal reasoning quality. These findings demonstrate that combining domain partitioned retrieval with structured relational knowledge provides a scalable and interpretable foundation for advanced legal AI systems in the Indian judicial context.


cs.SD [Back]

[77] Leveraging large multimodal models for audio-video deepfake detection: a pilot study cs.SD | cs.CVPDF

Songjun Cao, Yuqi Li, Yunpeng Luo, Jianjun Yin, Long Ma

TL;DR: 本文提出了一种名为AV-LMMDetect的监督微调大型多模态模型,用于音频-视频深度伪造检测。该方法将检测任务构建为基于提示的二元分类问题,并基于Qwen 2.5 Omni模型,通过两阶段训练(轻量级LoRA对齐和音视频编码器全参数微调)联合分析音频和视觉流。在FakeAVCeleb和Mavos-DD数据集上,其性能达到或超越了现有方法,并在Mavos-DD上创造了新的最先进水平。

Details

Motivation: 当前大多数多模态检测器是小型、任务特定的模型,虽然在精心设计的测试上表现良好,但扩展性差且跨领域泛化能力弱。为了解决这一问题,作者探索利用大型多模态模型进行音频-视频深度伪造检测。

Result: 在FakeAVCeleb和Mavos-DD基准测试上,AV-LMMDetect的性能与先前方法相当或更优,并在Mavos-DD数据集上达到了新的最先进水平。

Insight: 论文的主要创新点在于将音频-视频深度伪造检测任务重新构建为基于提示的二元分类问题,并利用大型多模态模型(LMM)的能力进行联合分析。从客观角度看,其两阶段微调策略(先LoRA对齐,后全参数微调)是一种高效利用预训练LMM进行特定任务适配的有效方法,为多模态伪造检测领域提供了一种可扩展且泛化能力更强的解决方案。

Abstract: Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - “Is this video real or fake?”. Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.


eess.IV [Back]

[78] SALIENT: Frequency-Aware Paired Diffusion for Controllable Long-Tail CT Detection eess.IV | cs.AI | cs.CV | cs.LGPDF

Yifan Li, Mehrdad Salimitari, Taiyu Zhang, Guang Li, David Dreizin

TL;DR: 本文提出SALIENT,一种基于小波域的掩码条件扩散框架,用于生成长尾CT检测中可控的配对病灶-掩码体数据以增强训练。该方法通过在小波系数上进行结构化扩散,分离低频亮度与高频结构细节,并利用可学习的频率感知目标解耦病灶与背景属性,从而提升生成真实性和检测性能。

Details

Motivation: 解决全身CT中罕见病灶检测面临的极端类别不平衡和低目标-体积比问题,现有扩散模型方法计算成本高且缺乏可控的属性和配对监督,导致精度崩溃。

Result: 在生成质量上,SALIENT将MS-SSIM从0.63提升至0.83,FID从118.4降低至46.5;在下游检测任务中,显著提高了长尾检测的AUPRC,尤其在低患病率和低目标-体积比情况下获得不成比例的增益,并发现最优合成比例随标注种子大小减少从2倍增至4倍。

Insight: 创新点包括在小波域进行结构化扩散以分离频率成分,引入频率感知目标实现属性解耦和可解释优化,以及结合3D VAE生成多样掩码和半监督教师生成配对伪标签,为低标注条件下的可控数据增强提供了高效框架。

Abstract: Detection of rare lesions in whole-body CT is fundamentally limited by extreme class imbalance and low target-to-volume ratios, producing precision collapse despite high AUROC. Synthetic augmentation with diffusion models offers promise, yet pixel-space diffusion is computationally expensive, and existing mask-conditioned approaches lack controllable attribute-level regulation and paired supervision for accountable training. We introduce SALIENT, a mask-conditioned wavelet-domain diffusion framework that synthesizes paired lesion-masking volumes for controllable CT augmentation under long-tail regimes. Instead of denoising in pixel space, SALIENT performs structured diffusion over discrete wavelet coefficients, explicitly separating low-frequency brightness from high-frequency structural detail. Learnable frequency-aware objectives disentangle target and background attributes (structure, contrast, edge fidelity), enabling interpretable and stable optimization. A 3D VAE generates diverse volumetric lesion masks, and a semi-supervised teacher produces paired slice-level pseudo-labels for downstream mask-guided detection. SALIENT improves generative realism, as reflected by higher MS-SSIM (0.63 to 0.83) and lower FID (118.4 to 46.5). In a separate downstream evaluation, SALIENT-augmented training improves long-tail detection performance, yielding disproportionate AUPRC gains across low prevalences and target-to-volume ratios. Optimal synthetic ratios shift from 2x to 4x as labeled seed size decreases, indicating a seed-dependent augmentation regime under low-label conditions. SALIENT demonstrates that frequency-aware diffusion enables controllable, computationally efficient precision rescue in long-tail CT detection.


[79] SGDC: Structurally-Guided Dynamic Convolution for Medical Image Segmentation eess.IV | cs.CVPDF

Bo Shi, Wei-ping Zhu, M. N. S. Swamy

TL;DR: 本文提出了一种结构引导的动态卷积(SGDC)机制,用于医学图像分割,通过显式监督的结构提取分支来生成动态核和门控信号,以实现结构感知的特征调制,从而提升边界保真度。

Details

Motivation: 解决现有动态卷积方法中平均池化导致高频空间细节丢失、预测过度平滑、细粒度临床结构保真度下降的问题。

Result: 在ISIC 2016、PH2、ISIC 2018和CoNIC数据集上达到SOTA性能,Hausdorff距离(HD95)降低2.05,IoU相比基于池化的基线模型提升0.99%-1.49%。

Insight: 创新点在于用像素级结构引导替代上下文聚合,通过辅助分支提供高保真边界信息来指导动态核生成,有效防止平均池化造成的信息损失,并可扩展至其他细粒度结构敏感视觉任务。

Abstract: Spatially variant dynamic convolution provides a principled approach of integrating spatial adaptivity into deep neural networks. However, mainstream designs in medical segmentation commonly generate dynamic kernels through average pooling, which implicitly collapses high-frequency spatial details into a coarse, spatially-compressed representation, leading to over-smoothed predictions that degrade the fidelity of fine-grained clinical structures. To address this limitation, we propose a novel Structure-Guided Dynamic Convolution (SGDC) mechanism, which leverages an explicitly supervised structure-extraction branch to guide the generation of dynamic kernels and gating signals for structure-aware feature modulation. Specifically, the high-fidelity boundary information from this auxiliary branch is fused with semantic features to enable spatially-precise feature modulation. By replacing context aggregation with pixel-wise structural guidance, the proposed design effectively prevents the information loss introduced by average pooling. Experimental results show that SGDC achieves state-of-the-art performance on ISIC 2016, PH2, ISIC 2018, and CoNIC datasets, delivering superior boundary fidelity by reducing the Hausdorff Distance (HD95) by 2.05, and providing consistent IoU gains of 0.99%-1.49% over pooling-based baselines. Moreover, the mechanism exhibits strong potential for extension to other fine-grained, structure-sensitive vision tasks, such as small-object detection, offering a principled solution for preserving structural integrity in medical image analysis. To facilitate reproducibility and encourage further research, the implementation code for both our SGE and SGDC modules has been is publicly released at https://github.com/solstice0621/SGDC.


[80] VideoPulse: Neonatal heart rate and peripheral capillary oxygen saturation (SpO2) estimation from contact free video eess.IV | cs.CVPDF

Deependra Dewagiri, Kamesh Anuradha, Pabadhi Liyanage, Helitha Kulatunga, Pamuditha Somarathne

TL;DR: 本文提出了VideoPulse,一个用于新生儿心率与血氧饱和度(SpO2)无接触式视频估计的数据集和端到端流程。该流程通过面部视频,利用3D CNN进行回归,在短至2秒的窗口内实现准确的生命体征监测。

Details

Motivation: 解决新生儿生命体征监测中传统接触式方法(如粘性探头)易刺激脆弱皮肤、增加感染控制负担的问题,旨在通过远程光电容积描记术(rPPG)实现无接触、低成本的非侵入性监测。

Result: 在NBHR新生儿数据集上,心率估计的平均绝对误差(MAE)为2.97 bpm(2秒窗口),SpO2估计的MAE为1.69%。跨数据集评估显示,在VideoPulse数据集上心率模型MAE为5.34 bpm,微调后的SpO2模型MAE为1.68%,表明该方法在短时间未对齐视频片段上也能达到高精度。

Insight: 创新点包括构建了首个专门针对新生儿的多姿态面部视频数据集(VideoPulse),以及采用了基于去噪脉搏血氧信号的伪影感知监督、标签分布平滑和加权回归的端到端3D CNN回归流程。客观来看,其将rPPG技术专门适配于具有挑战性的新生儿场景,并通过数据驱动方法有效处理了运动伪影和短时窗口估计问题。

Abstract: Remote photoplethysmography (rPPG) enables contact free monitoring of vital signs and is especially valuable for neonates, since conventional methods often require sustained skin contact with adhesive probes that can irritate fragile skin and increase infection control burden. We present VideoPulse, a neonatal dataset and an end to end pipeline that estimates neonatal heart rate and peripheral capillary oxygen saturation (SpO2) from facial video. VideoPulse contains 157 recordings totaling 2.6 hours from 52 neonates with diverse face orientations. Our pipeline performs face alignment and artifact aware supervision using denoised pulse oximeter signals, then applies 3D CNN backbones for heart rate and SpO2 regression with label distribution smoothing and weighted regression for SpO2. Predictions are produced in 2 second windows. On the NBHR neonatal dataset, we obtain heart rate MAE 2.97 bpm using 2 second windows (2.80 bpm at 6 second windows) and SpO2 MAE 1.69 percent. Under cross dataset evaluation, the NBHR trained heart rate model attains 5.34 bpm MAE on VideoPulse, and fine tuning an NBHR pretrained SpO2 model on VideoPulse yields MAE 1.68 percent. These results indicate that short unaligned neonatal video segments can support accurate heart rate and SpO2 estimation, enabling low cost non invasive monitoring in neonatal intensive care.


[81] Breaking the Data Barrier: Robust Few-Shot 3D Vessel Segmentation using Foundation Models eess.IV | cs.CVPDF

Kirato Yoshihara, Yohei Sugawara, Yuta Tokuoka, Lihang Hong

TL;DR: 本文提出了一种利用预训练视觉基础模型(DINOv3)进行三维血管分割的新框架,通过引入轻量级3D适配器、多尺度3D聚合器和Z通道嵌入,有效弥合了2D预训练与3D医学模态之间的差距,从而在极少量标注数据下实现了对连续血管结构的鲁棒分割。

Details

Motivation: 现有血管分割方法通常需要大规模标注数据,且在域偏移下性能严重下降;临床实践中难以针对每个新扫描仪或协议获取大量标注,因此需要一种能在数据稀缺或域偏移下鲁棒工作的少样本分割方法。

Result: 在TopCoW(域内)和Lausanne(域外)数据集上验证,在仅使用5个训练样本的极端少样本情况下,Dice分数达到43.42%,相比最先进的nnU-Net(33.41%)相对提升30%,并优于SwinUNETR和UNETR等Transformer基线达45%;在域外设置下,模型Dice分数为21.37%,相比nnU-Net(14.22%)相对提升50%,显示出更强的鲁棒性。

Insight: 创新点在于将2D预训练基础模型有效适配到3D医学分割任务,通过轻量级3D适配器保持体积一致性、多尺度聚合器实现分层特征融合、以及Z通道嵌入来捕捉血管连续性;这为数据稀缺或域偏移场景提供了一种可行的冷启动解决方案,提升了临床可靠性。

Abstract: State-of-the-art vessel segmentation methods typically require large-scale annotated datasets and suffer from severe performance degradation under domain shifts. In clinical practice, however, acquiring extensive annotations for every new scanner or protocol is unfeasible. To address this, we propose a novel framework leveraging a pre-trained Vision Foundation Model (DINOv3) adapted for volumetric vessel segmentation. We introduce a lightweight 3D Adapter for volumetric consistency, a multi-scale 3D Aggregator for hierarchical feature fusion, and Z-channel embedding to effectively bridge the gap between 2D pre-training and 3D medical modalities, enabling the model to capture continuous vascular structures from limited data. We validated our method on the TopCoW (in-domain) and Lausanne (out-of-distribution) datasets. In the extreme few-shot regime with 5 training samples, our method achieved a Dice score of 43.42%, marking a 30% relative improvement over the state-of-the-art nnU-Net (33.41%) and outperforming other Transformer-based baselines, such as SwinUNETR and UNETR, by up to 45%. Furthermore, in the out-of-distribution setting, our model demonstrated superior robustness, achieving a 50% relative improvement over nnU-Net (21.37% vs. 14.22%), which suffered from severe domain overfitting. Ablation studies confirmed that our 3D adaptation mechanism and multi-scale aggregation strategy are critical for vascular continuity and robustness. Our results suggest foundation models offer a viable cold-start solution, improving clinical reliability under data scarcity or domain shifts.


[82] FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy eess.IV | cs.CVPDF

Hyejin Park, Jiwon Yoon, Sumin Park, Suree Kim, Sinae Jang

TL;DR: 本文提出了荧光显微镜中染色感知的聚焦质量评估任务,指出现有方法忽略了不同荧光染料导致的聚焦行为差异。作者构建了首个染色感知FQA数据集FluoMix,并提出了两阶段视觉语言框架FluoCLIP,该框架利用CLIP的对齐能力,通过染色表征学习和染色引导排序来预测聚焦质量。

Details

Motivation: 解决荧光显微镜中因荧光染料光学特性不同导致的聚焦质量评估难题,现有数据集和模型将聚焦质量视为与染色无关的问题,忽略了染色依赖性。

Result: 在现有数据集(FocusPath、BBBC006)和新构建的FluoMix数据集上的定量分析表明,不同染色间的聚焦排序关系差异显著。FluoCLIP框架在多种荧光显微镜条件下展现出强大的泛化能力。

Insight: 创新点在于首次将荧光显微镜聚焦质量评估形式化为染色感知任务,并构建了相应的数据集和两阶段视觉语言框架。其核心是利用CLIP的图文对齐能力学习通用染色表征,并优化染色特定的排序提示进行有序预测,为染色感知FQA奠定了基础。

Abstract: Accurate focus quality assessment (FQA) in fluorescence microscopy remains challenging, as the stain-dependent optical properties of fluorescent dyes cause abrupt and heterogeneous focus shifts. However, existing datasets and models overlook this variability, treating focus quality as a stain-agnostic problem. In this work, we formulate the task of stain-aware FQA, emphasizing that focus behavior in fluorescence microscopy must be modeled as a function of staining characteristics. Through quantitative analysis of existing datasets (FocusPath, BBBC006) and our newly curated FluoMix, we demonstrate that focus-rank relationships vary substantially across stains, underscoring the need for stain-aware modeling in fluorescence microscopy. To support this new formulation, we propose FluoMix, the first dataset for stain-aware FQA that encompasses multiple tissues, fluorescent stains, and focus variations. Building on this dataset, we propose FluoCLIP, a two-stage vision-language framework that leverages CLIP’s alignment capability to interpret focus quality in the context of biological staining. In the stain-grounding phase, FluoCLIP learns general stain representations by aligning textual stain tokens with visual features, while in the stain-guided ranking phase, it optimizes stain-specific rank prompts for ordinal focus prediction. Together, our formulation, dataset, and framework establish the first foundation for stain-aware FQA, and FluoCLIP achieves strong generalization across diverse fluorescence microscopy conditions.


[83] Revisiting Integration of Image and Metadata for DICOM Series Classification: Cross-Attention and Dictionary Learning eess.IV | cs.CVPDF

Tuan Truong, Melanie Dohmen, Sara Lorio, Matthias Lenga

TL;DR: 本文提出了一种用于DICOM医学图像序列分类的端到端多模态框架,该框架联合建模图像内容和采集元数据,并专门处理图像内容异构、序列长度可变以及元数据缺失或不一致等挑战。方法包括使用模态感知模块编码图像和元数据,并通过双向跨模态注意力机制进行融合;采用基于可学习特征字典和值条件调制的稀疏、缺失感知编码器处理元数据,无需任何形式的插补;通过2.5D视觉编码器和对等距采样切片的注意力机制处理序列长度和图像尺寸的可变性。

Details

Motivation: DICOM序列分类对于大规模医学图像分析、质量控制、协议协调和可靠的下游处理至关重要,但由于图像切片内容异构、序列长度可变以及DICOM元数据完全缺失、不完整或不一致,该任务仍具挑战性。

Result: 在公开的Duke Liver MRI数据集和一个大型多机构内部队列上进行了评估,涵盖了域内性能和域外泛化能力。在所有评估设置中,所提出的方法始终优于仅使用图像、仅使用元数据以及多模态2D/3D基线方法。

Insight: 创新点在于明确建模元数据稀疏性和跨模态交互以提高DICOM序列分类的鲁棒性。具体包括:1)使用双向跨模态注意力机制融合图像和元数据;2)设计无需插补的、基于可学习特征字典的缺失感知元数据编码器;3)采用2.5D视觉编码器和基于采样的注意力处理可变序列长度。这些设计为处理医学图像中常见的不完整和多模态数据提供了可借鉴的思路。

Abstract: Automated identification of DICOM image series is essential for large-scale medical image analysis, quality control, protocol harmonization, and reliable downstream processing. However, DICOM series classification remains challenging due to heterogeneous slice content, variable series length, and entirely missing, incomplete or inconsistent DICOM metadata. We propose an end-to-end multimodal framework for DICOM series classification that jointly models image content and acquisition metadata while explicitly accounting for all these challenges. (i) Images and metadata are encoded with modality-aware modules and fused using a bi-directional cross-modal attention mechanism. (ii) Metadata is processed by a sparse, missingness-aware encoder based on learnable feature dictionaries and value-conditioned modulation. By design, the approach does not require any form of imputation. (iii) Variability in series length and image data dimensions is handled via a 2.5D visual encoder and attention operating on equidistantly sampled slices. We evaluate the proposed approach on the publicly available Duke Liver MRI dataset and a large multi-institutional in-house cohort, assessing both in-domain performance and out-of-domain generalization. Across all evaluation settings, the proposed method consistently outperforms relevant image only, metadata-only and multimodal 2D/3D baselines. The results demonstrate that explicitly modeling metadata sparsity and cross-modal interactions improves robustness for DICOM series classification.


[84] Clinically-aligned ischemic stroke segmentation and ASPECTS scoring on NCCT imaging using a slice-gated loss on foundation representations eess.IV | cs.CVPDF

Hiba Azeem, Behraj Khan, Tahir Qasim Syed

TL;DR: 本文提出了一种临床对齐的缺血性卒中分割和ASPECTS评分框架,通过结合冻结的DINOv3骨干网络与轻量级解码器,并引入领土感知门控损失来增强基底节与上节段的一致性,从而在非增强CT图像上实现更准确的卒中分割。

Details

Motivation: 现有深度学习方法通常进行逐像素分割,而未建模ASPECTS评分中基底节和上节段耦合的解剖结构推理,因此需要一种能结合临床先验知识的方法来改进卒中评估。

Result: 在AISD数据集上达到Dice分数0.6385,优于之前的CNN和基础模型基线;在专有ASPECTS数据集上,TAGL将平均Dice从0.698提升至0.767,显示出性能提升。

Insight: 创新点在于将基础模型表示与结构化临床先验结合,通过领土感知门控损失在训练中强制解剖一致性,且不增加推理复杂度,为医学图像分析提供了可借鉴的临床对齐监督策略。

Abstract: Rapid infarct assessment on non-contrast CT (NCCT) is essential for acute ischemic stroke management. Most deep learning methods perform pixel-wise segmentation without modeling the structured anatomical reasoning underlying ASPECTS scoring, where basal ganglia (BG) and supraganglionic (SG) levels are clinically interpreted in a coupled manner. We propose a clinically aligned framework that combines a frozen DINOv3 backbone with a lightweight decoder and introduce a Territory-Aware Gated Loss (TAGL) to enforce BG-SG consistency during training. This anatomically informed supervision adds no inference-time complexity. Our method achieves a Dice score of 0.6385 on AISD, outperforming prior CNN and foundation-model baselines. On a proprietary ASPECTS dataset, TAGL improves mean Dice from 0.698 to 0.767. These results demonstrate that integrating foundation representations with structured clinical priors improves NCCT stroke segmentation and ASPECTS delineation.


cs.MM [Back]

[85] MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation cs.MM | cs.CVPDF

Haoyuan Shi, Yunxin Li, Nanhao Deng, Zhenran Xu, Xinyu Chen

TL;DR: 本文提出了MSVBench,这是首个专为多镜头视频生成设计的综合性基准测试,包含分层脚本和参考图像。论文引入了一个混合评估框架,结合了大型多模态模型的高层语义推理和领域专家模型的细粒度感知能力。通过对20种不同范式的视频生成方法进行评估,发现当前模型主要充当视觉插值器而非真正的世界模型。该基准测试与人类判断的斯皮尔曼等级相关性达到94.4%,并可通过其管道精炼的推理轨迹为轻量级模型提供可扩展的监督信号,实现与商业模型相当的性能。

Details

Motivation: 当前视频生成技术正朝着复杂、多镜头的叙事方向发展,但现有评估方法仍局限于单镜头范式,缺乏评估长视频连贯性和吸引力的综合故事素材和跨镜头指标。

Result: 在MSVBench上评估了20种视频生成方法,发现当前模型尽管视觉保真度高,但主要作为视觉插值器而非世界模型。基准测试与人类判断的斯皮尔曼等级相关性达到94.4%(SOTA)。使用其管道精炼的推理轨迹微调轻量级模型,可实现与Gemini-2.5-Flash等商业模型相当的人类对齐性能。

Insight: 创新点在于首次构建了针对多镜头视频生成的综合基准测试(MSVBench),并提出了结合大型多模态模型语义推理和专家模型感知的混合评估框架。客观来看,该工作通过提供分层脚本和参考图像,以及可扩展的监督信号,为评估和提升长视频生成模型的叙事连贯性提供了新工具和方法。

Abstract: The evolution of video generation toward complex, multi-shot narratives has exposed a critical deficit in current evaluation methods. Existing benchmarks remain anchored to single-shot paradigms, lacking the comprehensive story assets and cross-shot metrics required to assess long-form coherence and appeal. To bridge this gap, we introduce MSVBench, the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation. We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models (LMMs) with the fine-grained perceptual rigor of domain-specific expert models. Evaluating 20 video generation methods across diverse paradigms, we find that current models–despite strong visual fidelity–primarily behave as visual interpolators rather than true world models. We further validate the reliability of our benchmark by demonstrating a state-of-the-art Spearman’s rank correlation of 94.4% with human judgments. Finally, MSVBench extends beyond evaluation by providing a scalable supervisory signal. Fine-tuning a lightweight model on its pipeline-refined reasoning traces yields human-aligned performance comparable to commercial models like Gemini-2.5-Flash.


cs.LG [Back]

[86] MINT: Multimodal Imaging-to-Speech Knowledge Transfer for Early Alzheimer’s Screening cs.LG | cs.AI | cs.CVPDF

Vrushank Ahire, Yogesh Kumar, Anouck Girard, M. A. Ganaie

TL;DR: 本文提出了一种名为MINT的多模态成像到语音知识迁移框架,用于早期阿尔茨海默病筛查。该框架通过三阶段跨模态学习,将MRI中的生物标志物结构迁移到语音编码器中,使语音分类器在推理时无需依赖神经影像数据,同时保持与MRI相似的决策边界。

Details

Motivation: 解决早期阿尔茨海默病筛查中神经影像(如MRI)成本高、部署困难,而纯语音分类器缺乏生物学基础、可靠性有限的问题。

Result: 在ADNI-4数据集上评估,对齐后的语音分类器性能与纯语音基线相当(AUC 0.720 vs 0.711),且多模态融合优于单独使用MRI(AUC 0.973 vs 0.958)。

Insight: 创新点在于首次实现了MRI到语音的知识迁移,通过几何损失对齐语音表示到冻结的成像流形,为无需神经影像的群体级认知筛查提供了生物学基础路径;关键设计包括dropout正则化和自监督预训练。

Abstract: Alzheimer’s disease is a progressive neurodegenerative disorder in which mild cognitive impairment (MCI) marks a critical transition between aging and dementia. Neuroimaging modalities, such as structural MRI, provide biomarkers of this transition; however, their high costs and infrastructure needs limit their deployment at a population scale. Speech analysis offers a non-invasive alternative, but speech-only classifiers are developed independently of neuroimaging, leaving decision boundaries biologically ungrounded and limiting reliability on the subtle CN-versus-MCI distinction. We propose MINT (Multimodal Imaging-to-Speech Knowledge Transfer), a three-stage cross-modal framework that transfers biomarker structure from MRI into a speech encoder at training time. An MRI teacher, trained on 1,228 subjects, defines a compact neuroimaging embedding space for CN-versus-MCI classification. A residual projection head aligns speech representations to this frozen imaging manifold via a combined geometric loss, adapting speech to the learned biomarker space while preserving imaging encoder fidelity. The frozen MRI classifier, which is never exposed to speech, is applied to aligned embeddings at inference and requires no scanner. Evaluation on ADNI-4 shows aligned speech achieves performance comparable to speech-only baselines (AUC 0.720 vs 0.711) while requiring no imaging at inference, demonstrating that MRI-derived decision boundaries can ground speech representations. Multimodal fusion improves over MRI alone (0.973 vs 0.958). Ablation studies identify dropout regularization and self-supervised pretraining as critical design decisions. To our knowledge, this is the first demonstration of MRI-to-speech knowledge transfer for early Alzheimer’s screening, establishing a biologically grounded pathway for population-level cognitive triage without neuroimaging at inference.


cs.AI [Back]

[87] Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance cs.AI | cs.CLPDF

Yanwei Ren, Haotian Zhang, Likang Xiao, Xikai Zhang, Jiaxing Huang

TL;DR: 本文提出了一种名为SCOPE的新框架,用于解决强化学习可验证奖励(RLVR)中探索空间过早缩窄的问题。该方法利用过程奖励模型精确定位次优轨迹中的首个错误步骤,并应用细粒度的、逐步的离策略修正,以挽救部分正确的轨迹,从而维持广泛的探索空间。

Details

Motivation: 标准基于结果的监督在RLVR中存在关键限制:它会对基本正确但因少数错误步骤而失败的轨迹施加与完全错误轨迹同样严重的惩罚,导致模型丢弃有价值的、基本正确的轨迹,从而降低轨迹多样性并过早缩窄探索空间。现有方法未能有效利用模型自身生成的基本正确的轨迹。

Result: 在数学推理任务上,该方法实现了46.6%的平均准确率,并在分布外推理任务上表现出稳健的泛化能力,达到53.4%的准确率,建立了新的最先进(SOTA)结果。此外,该方法将多样性分数提高了13.5%。

Insight: 核心创新点在于提出了一种细粒度的、逐步的离策略修正机制(SCOPE),它能够精确定位并修正轨迹中的首个错误步骤,从而挽救部分正确的轨迹,有效维持了探索空间的多样性。这为RLVR中如何更有效地利用失败或次优轨迹提供了新的思路。

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reasoning Models. However, standard outcome-based supervision suffers from a critical limitation that penalizes trajectories that are largely correct but fail due to several missteps as heavily as completely erroneous ones. This coarse feedback signal causes the model to discard valuable largely correct rollouts, leading to a degradation in rollout diversity that prematurely narrows the exploration space. Process Reward Models have demonstrated efficacy in providing reliable step-wise verification for test-time scaling, naively integrating these signals into RLVR as dense rewards proves ineffective.Prior methods attempt to introduce off-policy guided whole-trajectory replacement that often outside the policy model’s distribution, but still fail to utilize the largely correct rollouts generated by the model itself and thus do not effectively mitigate the narrowing of the exploration space. To address these issues, we propose SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained, step-wise off-policy rectification. By applying precise refinement on partially correct rollout, our method effectively salvages partially correct trajectories and increases diversity score by 13.5%, thereby sustaining a broad exploration space. Extensive experiments demonstrate that our approach establishes new state-of-the-art results, achieving an average accuracy of 46.6% on math reasoning and exhibiting robust generalization with 53.4% accuracy on out-of-distribution reasoning tasks.


[88] Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume cs.AI | cs.CL | cs.CV | cs.LGPDF

Gregory Kang Ruey Lau, Hieu Dao, Nicole Kan Hui Lin, Bryan Kian Hsiang Low

TL;DR: 本文提出了UMPIRE,一种无需训练、高效的多模态大语言模型不确定性量化框架。它通过计算给定任务实例下采样响应的非一致性调整语义体积,有效捕捉响应的全局语义多样性和基于模型内部置信度的局部非一致性。该框架适用于多种输入输出模态,无需外部工具,在图像、音频和视频-文本基准测试的错误检测和不确定性校准方面优于基线方法。

Details

Motivation: 多模态大语言模型可能产生看似合理但错误的输出,阻碍其可靠部署。现有不确定性度量方法存在局限,如仅适用于特定模态、依赖外部工具或计算成本高昂。

Result: 在图像、音频和视频-文本基准测试(包括对抗性和分布外设置)上的广泛实验表明,UMPIRE在错误检测和不确定性校准方面持续优于基线指标。

Insight: 创新点在于提出了一种仅依赖模型内部模态特征、无需训练或外部工具的不确定性量化框架,通过非一致性调整语义体积统一捕捉响应的多样性和置信度,并成功泛化到非文本输出任务(如图像和音频生成)。

Abstract: Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models’ own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE’s design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE’s generalization to non-text output tasks, including image and audio generation.


[89] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science cs.AI | cs.CLPDF

Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao

TL;DR: 该论文提出了DARE-bench,一个用于评估大型语言模型在数据科学任务中建模能力和指令遵循能力的基准测试。它包含6,300个源自Kaggle的任务,提供可验证的真实答案,旨在弥补现有基准在标准化、过程感知评估和高质量训练数据方面的不足。实验表明,即使是GPT-4-mini等先进模型在该基准上表现不佳,而使用其数据进行微调能显著提升模型性能。

Details

Motivation: 现有基准测试在评估LLMs处理复杂多步骤数据科学任务时存在两大缺陷:一是缺乏标准化的、能捕捉指令遵循和过程保真度的过程感知评估方法;二是缺乏准确标注的训练数据。

Result: 在DARE-bench上的广泛评估显示,即使是GPT-4-mini等高性能模型也难以取得良好表现,尤其在机器学习建模任务中。使用DARE-bench的训练任务进行微调能大幅提升模型性能:例如,监督微调使Qwen3-32B的准确率提升1.83倍,强化学习使Qwen3-4B的准确率提升超过8倍。

Insight: 论文的创新点在于构建了一个具有可验证真实答案、覆盖广泛任务并支持智能体工具的数据科学基准,确保了评估的客观性和可复现性。其核心价值在于同时作为精准的评估工具和关键的训练数据源,填补了领域空白。

Abstract: The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32B’s accuracy by 1.83x and reinforcement learning boosts Qwen3-4B’s accuracy by more than 8x. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.


[90] EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models cs.AI | cs.CVPDF

Yiyang Fang, Wenke Huang, Pei Fu, Yihao Yang, Kehua Su

TL;DR: 本文提出了一种名为EMO-R3的反思性强化学习框架,旨在提升多模态大语言模型在情感推理方面的能力。该框架通过结构化情感思维引导模型进行逐步推理,并利用反思性情感奖励机制,基于视觉-文本一致性和情感连贯性对推理过程进行重新评估。实验表明,该方法在多个视觉情感理解基准上取得了优越性能,显著增强了模型的可解释性和情感智能。

Details

Motivation: 现有基于监督微调的多模态大语言模型在捕捉人类情感的复杂性和主观性方面存在局限,泛化能力和可解释性不足,而现有的强化学习方法(如GRPO)未能与情感认知的内在特性对齐。

Result: 在多个视觉情感理解基准测试中,EMO-R3取得了优越的性能,显著提升了模型的可解释性和情感智能。

Insight: 创新点在于提出了结构化情感思维来引导可解释的逐步推理,并设计了反思性情感奖励机制,使模型能够基于视觉-文本一致性和情感连贯性进行自我重新评估,这为增强AI的情感理解能力提供了一种新的强化学习范式。

Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual reasoning and understanding tasks but still struggle to capture the complexity and subjectivity of human emotions. Existing approaches based on supervised fine-tuning often suffer from limited generalization and poor interpretability, while reinforcement learning methods such as Group Relative Policy Optimization fail to align with the intrinsic characteristics of emotional cognition. To address these challenges, we propose Reflective Reinforcement Learning for Emotional Reasoning (EMO-R3), a framework designed to enhance the emotional reasoning ability of MLLMs. Specifically, we introduce Structured Emotional Thinking to guide the model to perform step-by-step emotional reasoning in a structured and interpretable manner, and design a Reflective Emotional Reward that enables the model to re-evaluate its reasoning based on visual-text consistency and emotional coherence. Extensive experiments demonstrate that EMO-R3 significantly improves both the interpretability and emotional intelligence of MLLMs, achieving superior performance across multiple visual emotional understanding benchmarks.


cs.SE [Back]

[91] SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale cs.SE | cs.CLPDF

Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Alexander Golubev

TL;DR: 本文介绍了SWE-rebench V2,一个语言无关的自动化流水线,用于大规模收集可执行的真实世界软件工程任务并构建强化学习训练环境。该流水线通过交互式设置代理合成仓库特定的安装和测试流程,并使用LLM评判器集成过滤不可靠实例,最终构建了一个包含32,000多个任务、覆盖20种语言和3,600多个仓库的数据集,并提供了预构建镜像以确保可复现执行。

Details

Motivation: 当前软件工程智能体的训练受限于大规模、可复现且具有可靠测试套件的任务集合的稀缺性,现有数据集在规模、多样性和语言覆盖上存在不足。

Result: 通过诊断研究验证了收集实例的质量,该研究涵盖了五种编程语言中七个流行模型的一个任务子集,并提供了实例级元数据以标记常见干扰因素。

Insight: 创新点在于提出了一个自动化、语言无关的流水线来大规模构建可执行的软件工程任务环境,并通过LLM评判器集成和人工验证确保数据质量,为跨语言、跨仓库的大规模软件工程智能体训练提供了资源支持。

Abstract: Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,000+ tasks spanning 20 languages and 3,600+ repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.


cs.RO [Back]

[92] StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation cs.RO | cs.CVPDF

Jiasong Xiao, Yutao She, Kai Li, Yuyang Sha, Ziang Cheng

TL;DR: 本文提出了一种名为StemVLA的新型视觉-语言-动作模型,该模型通过显式地整合面向未来的3D空间几何知识与历史4D时空表示,来增强机器人对动态环境的理解和长时程决策能力。

Details

Motivation: 现有VLA模型主要依赖从2D视觉输入到动作序列的直接映射,缺乏对底层3D空间结构和时间动态的显式建模,这限制了在动态环境中的空间推理和长时程决策。

Result: 在仿真实验中,StemVLA显著提升了长时程任务的成功率,并在CALVIN ABC-D基准测试上达到了最先进的性能水平。

Insight: 创新点在于显式地预测未来3D空间几何知识以预见场景变化,并利用预训练的视频-几何Transformer骨干网络提取并聚合历史帧的隐式3D世界表示,形成统一的4D历史时空表示,从而实现了更全面的世界理解。

Abstract: Vision-language-action (VLA) models integrate visual observations and language instructions to predict robot actions, demonstrating promising generalization in manipulation tasks. However, most existing approaches primarily rely on direct mappings from 2D visual inputs to action sequences, without explicitly modeling the underlying 3D spatial structure or temporal world dynamics. Such representations may limit spatial reasoning and long-horizon decision-making in dynamic environments. To address this limitation, we propose StemVLA, a novel framework that explicitly incorporates both future-oriented 3D spatial knowledge and historical 4D spatiotemporal representations into action prediction. First, instead of relying solely on observed images, StemVLA forecasts structured 3D future spatial-geometric world knowledge, enabling the model to anticipate upcoming scene geometry and object configurations. Second, to capture temporal consistency and motion dynamics, we feed historical image frames into a pretrained video-geometry transformer backbone to extract implicit 3D world representations, and further aggregate them across time using a temporal attention module, termed VideoFormer [20], forming a unified 4D historical spatiotemporal representation. By jointly modeling 2D observations, predicted 3D future structure, and aggregated 4D temporal dynamics, StemVLA enables more comprehensive world understanding for robot manipulation. Extensive experiments in simulation demonstrate that StemVLA significantly improves long-horizon task success and achieves state-of-the-art performance on the CALVIN ABC-D benchmark [46], achieving an average sequence length of XXX.


[93] Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos cs.RO | cs.CVPDF

Haoxuan Xu, Tianfu Li, Wenbo Chen, Yi Liu, Xingxing Zuo

TL;DR: 本文提出了一种基于事件知识增强的视觉语言导航方法,通过从真实室内游览视频中构建大规模多模态时空知识图谱(YE-KG),并设计了一种从粗到细的分层检索机制(STE-VLN)将事件知识融入导航模型,以解决长时推理和粗粒度指令理解问题。

Details

Motivation: 现有视觉语言导航(VLN)智能体在未见环境中进行长时推理时面临困难,尤其是在处理模糊、粗粒度指令时表现不佳;尽管已有研究利用知识图谱增强推理,但受人类情景记忆启发的多模态事件知识潜力尚未充分挖掘。

Result: 在REVERIE、R2R和R2R-CE基准测试上的实验表明,该方法优于当前最先进(SOTA)方法,在不同动作空间上均取得了更好的性能。

Insight: 创新点在于首次从真实世界视频中构建了大规模多模态时空知识图谱(YE-KG),并提出了一个从粗到细的分层检索机制来动态融合事件序列与第一视角视觉观察,从而显式地利用情景记忆进行因果推理。

Abstract: Vision-Language Navigation (VLN) agents often struggle with long-horizon reasoning in unseen environments, particularly when facing ambiguous, coarse-grained instructions. While recent advances use knowledge graph to enhance reasoning, the potential of multimodal event knowledge inspired by human episodic memory remains underexplored. In this work, we propose an event-centric knowledge enhancement strategy for automated process knowledge mining and feature fusion to solve coarse-grained instruction and long-horizon reasoning in VLN task. First, we construct YE-KG, the first large-scale multimodal spatiotemporal knowledge graph, with over 86k nodes and 83k edges, derived from real-world indoor videos. By leveraging multimodal large language models (i.e., LLaVa, GPT4), we extract unstructured video streams into structured semantic-action-effect events to serve as explicit episodic memory. Second, we introduce STE-VLN, which integrates the above graph into VLN models via a Coarse-to-Fine Hierarchical Retrieval mechanism. This allows agents to retrieve causal event sequences and dynamically fuse them with egocentric visual observations. Experiments on REVERIE, R2R, and R2R-CE benchmarks demonstrate the efficiency of our event-centric strategy, outperforming state-of-the-art approaches across diverse action spaces. Our data and code are available on the project website https://sites.google.com/view/y-event-kg/.


cs.HC [Back]

[94] Shape vs. Context: Examining Human–AI Gaps in Ambiguous Japanese Character Recognition cs.HC | cs.CVPDF

Daichi Haraguchi

TL;DR: 本文通过比较人类与视觉语言模型(VLMs)在识别模糊日本字符时的决策模式,发现两者在仅基于形状的任务中存在决策边界差异,而在上下文中嵌入模糊字符时,某些条件下可改善VLM与人类判断的对齐。

Details

Motivation: 研究动机是探究高文本识别性能的VLMs在解决模糊性时是否具有类人决策模式,揭示人类与AI在模糊字符识别中的行为差异。

Result: 实验使用β-VAE生成的连续插值日本字符形状,在单字符识别(仅形状任务)中估计决策边界,并在词级上下文中评估VLM响应与人类判断的对齐情况,结果显示在某些条件下上下文能改善对齐。

Insight: 创新点在于通过形状与上下文的对比分析,直接量化人类与VLM的行为差距,为人类-VLM对齐基准测试提供了基础性见解,强调定性行为差异的重要性。

Abstract: High text recognition performance does not guarantee that Vision-Language Models (VLMs) share human-like decision patterns when resolving ambiguity. We investigate this behavioral gap by directly comparing humans and VLMs using continuously interpolated Japanese character shapes generated via a $β$-VAE. We estimate decision boundaries in a single-character recognition (shape-only task) and evaluate whether VLM responses align with human judgments under shape in context (i.e., embedding an ambiguous character near the human decision boundary in word-level context). We find that human and VLM decision boundaries differ in the shape-only task, and that shape in context can improve human alignment in some conditions. These results highlight qualitative behavioral differences, offering foundational insights toward human–VLM alignment benchmarking.