Table of Contents

cs.CL [Back]

[1] Deep Research: A Systematic Survey

Zhengliang Shi,Yiqun Chen,Haitao Li,Weiwei Sun,Shiyu Ni,Yougang Lyu,Run-Ze Fan,Bowen Jin,Yixuan Weng,Minjun Zhu,Qiujie Xie,Xinyu Guo,Qu Yang,Jiayi Wu,Jujia Zhao,Xiaqiang Tang,Xinbei Ma,Cunxiang Wang,Jiaxin Mao,Qingyao Ai,Jen-Tse Huang,Wenxuan Wang,Yue Zhang,Yiming Yang,Zhaopeng Tu,Zhaochun Ren

Main category: cs.CL

TL;DR: 这篇论文是关于深度研究(Deep Research)的系统性调查,探讨了如何将大语言模型(LLMs)与外部工具结合,以完成复杂的开放性任务。论文提出了一个三阶段路线图,总结了四个关键组件,优化技术,以及评估标准和未来挑战。

Details Motivation: 尽管大语言模型在文本生成和问题解决方面表现出色,但许多开放性任务需要批判性思维、多源信息和可验证的输出,而这些超出了单次提示或标准检索增强生成的范围。因此,需要系统地研究如何结合LLMs与外部工具的能力。

Contribution: 论文的主要贡献包括:1)形式化了一个三阶段的深度研究路线图;2)介绍了四个关键组件及其子分类;3)总结了优化技术;4)整理了评估标准和开放挑战。

Method: 论文采用了系统性调查的方法,总结了深度研究的多个方面,包括查询规划、信息获取、记忆管理和答案生成等关键组件,以及提示、监督微调和代理强化学习等优化技术。

Result: 论文提供了一个全面的深度研究路线图和技术总结,为未来研究提供了清晰的指导和参考。

Insight: 深度研究的核心在于结合LLMs的外部工具能力,以解决复杂的开放性问题。未来方向包括进一步提升组件之间的协同性和评估标准的完善。

Abstract: Large language models (LLMs) have rapidly evolved from text generators into powerful problem solvers. Yet, many open tasks demand critical thinking, multi-source, and verifiable outputs, which are beyond single-shot prompting or standard retrieval-augmented generation. Recently, numerous studies have explored Deep Research (DR), which aims to combine the reasoning capabilities of LLMs with external tools, such as search engines, thereby empowering LLMs to act as research agents capable of completing complex, open-ended tasks. This survey presents a comprehensive and systematic overview of deep research systems, including a clear roadmap, foundational components, practical implementation techniques, important challenges, and future directions. Specifically, our main contributions are as follows: (i) we formalize a three-stage roadmap and distinguish deep research from related paradigms; (ii) we introduce four key components: query planning, information acquisition, memory management, and answer generation, each paired with fine-grained sub-taxonomies; (iii) we summarize optimization techniques, including prompting, supervised fine-tuning, and agentic reinforcement learning; and (iv) we consolidate evaluation criteria and open challenges, aiming to guide and facilitate future development. As the field of deep research continues to evolve rapidly, we are committed to continuously updating this survey to reflect the latest progress in this area.

[2] Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models

Kecheng Chen,Ziru Liu,Xijia Tao,Hui Liu,Xinyu Fu,Suiyun Zhang,Dandan Tu,Lingpeng Kong,Rui Liu,Haoliang Li

Main category: cs.CL

TL;DR: 论文提出了一种名为Coherent Contextual Decoding(CCD)的新型推理框架,通过轨迹修正机制和自适应采样策略,显著提升了扩散语言模型的生成质量和推理速度。

Details Motivation: 现有的扩散语言模型推理方法依赖局部置信度或熵等即时指标,缺乏全局视角,导致采样轨迹不一致和生成质量不佳。

Contribution: 1. 提出轨迹修正机制,利用历史上下文增强序列一致性;2. 引入自适应采样策略,动态调整解码预算。

Method: 1. 轨迹修正机制通过条件互信息建模上下文与预测的一致性;2. 自适应采样策略根据一致性指标动态分配解码预算。

Result: 在Dream和LLaDA基准测试中,CCD实现了3.48倍的推理加速和3.91%的性能提升。

Insight: 全局一致性和动态预算分配是提升扩散语言模型性能的关键因素。

Abstract: Diffusion Language Models (DLMs) have recently achieved significant success due to their any-order generation capabilities. However, existing inference methods typically rely on local, immediate-step metrics such as confidence or entropy which inherently lack a more reliable perspective. This limitation frequently leads to inconsistent sampling trajectories and suboptimal generation quality. To address this, we propose Coherent Contextual Decoding (CCD), a novel inference framework built upon two core innovations. First, CCD employs a trajectory rectification mechanism that leverages historical context to enhance sequence coherence, enabling the early rejection of suboptimal paths. We demonstrate that this mechanism is theoretically equivalent to modeling the consistency of historical steps via the conditional mutual information between context and token predictions. Building on this theoretical insight, we further address the inefficiency of conventional uniform decoding budgets. Instead of rigid allocations based on diffusion steps, we introduce an adaptive sampling strategy that dynamically adjusts the unmasking budget for each step according to our consistency metric. Consequently, our method significantly improves the quality of generation trajectories while accelerating the sampling process. Empirically, our method achieves a simultaneous enhancement in both inference speed and performance across diverse benchmarks on Dream and LLaDA, delivering up to 3.48x speedup alongside 3.91% performance improvement.

[3] Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

Ziyan Wang,Enmao Diao,Qi Le,Pu Wang,Guanchu Wang,Minwoo Lee,Shu-ping Yeh,Li Yang

Main category: cs.CL

TL;DR: 该论文提出了一种针对推理大语言模型(RLMs)的自反思结构化剪枝方法(RESP),解决了现有剪枝方法在RLMs上性能崩溃的问题,通过自生成校准、梯度重要性估计和渐进式再生等技术,显著提升了剪枝后的推理能力。

Details Motivation: 现有的剪枝方法在标准LLMs上表现良好,但在推理LLMs(RLMs)上性能急剧下降,主要原因是校准数据与模型推理行为不匹配。研究发现,模型自生成的推理轨迹是最可靠的校准信号。

Contribution: 提出了RESP框架,通过自生成校准、解码时梯度重要性估计和渐进式再生等技术,实现了推理LLMs的高效剪枝,显著提升了剪枝后的推理性能。

Method: RESP框架包括三个关键技术:(1) 使用模型自生成推理轨迹进行校准;(2) 解码时梯度重要性估计;(3) 渐进式再生以保持校准有效性。

Result: 在Qwen3-8B模型上的实验表明,RESP在20-30%稀疏度下几乎保留了密集模型的准确性,在40%稀疏度下GSM8K和MathQA的准确率分别达到81.3%和59.6%,显著优于现有方法。

Insight: 模型自生成的推理轨迹比人工标注数据更适合作为剪枝校准信号,剪枝决策需要与模型的推理动态保持一致。

Abstract: Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model’s reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model’s decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model’s own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model’s reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.

[4] Lightweight Latent Reasoning for Narrative Tasks

Alexander Gurung,Nikolay Malkin,Mirella Lapata

Main category: cs.CL

TL;DR: LiteReason提出了一种轻量级的潜在推理方法,结合强化学习优化语言模型在叙事任务中的推理能力,显著减少了计算成本同时保持高性能。

Details Motivation: 大型语言模型(LLMs)在处理复杂任务时需要通过长链推理(潜在变量)生成输出,这种方法的优化通常需要高计算成本,尤其是在涉及大量token的叙事任务中。LiteReason旨在通过轻量化的潜在推理模块降低计算负担。

Contribution: 1. 提出LiteReason,一种轻量级的潜在推理方法,可与标准token采样结合;2. 设计了Reasoning Projector模块,用于生成连续潜在token以跳过某些推理步骤;3. 通过强化学习动态切换潜在推理与离散推理,显著减少推理长度(77-92%)。

Method: 1. 使用Reasoning Projector模块生成连续潜在token,替代部分离散推理步骤;2. 在强化学习中,策略模型动态决定何时激活projector;3. 结合标准token采样,优化性能与计算的权衡。

Result: 在情节漏洞检测和书籍章节生成任务中,LiteReason优于潜在推理基线,接近非潜在强化学习的性能,同时大幅减少推理长度(77-92%)。

Insight: 轻量化的潜在推理模块结合强化学习动态决策,可以在保持高性能的同时显著降低计算成本,为复杂叙事任务的效率优化提供了新思路。

Abstract: Large language models (LLMs) tackle complex tasks by generating long chains of thought or “reasoning traces” that act as latent variables in the generation of an output given a query. A model’s ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model ‘skip’ reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.

[5] DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models

Olivia Kim

Main category: cs.CL

TL;DR: 论文研究了提示词(prompt)的详细程度对大型语言模型(LLMs)推理性能的影响,提出了DETAIL框架,量化提示词的详细程度并通过实验验证其重要性。

Details Motivation: 提示词设计对LLMs推理性能至关重要,但提示词的详细程度如何影响模型表现尚未充分研究。

Contribution: 提出了DETAIL框架,用于量化提示词的详细程度及其对推理性能的影响,提供了实验数据和分析工具。

Method: 使用GPT-4生成多级详细程度的提示词,通过困惑度(perplexity)量化详细程度,并基于GPT的语义等价性评估正确性。

Result: 实验表明,提示词的详细程度能显著提高准确性,尤其是对小模型和程序性任务更为重要。

Insight: 本研究强调了自适应提示策略的必要性,为未来研究提供了工具和数据支持。

Abstract: Prompt design plays a critical role in the reasoning performance of large language models (LLMs), yet the impact of prompt specificity - how detailed or vague a prompt is - remains understudied. This paper introduces DETAIL, a framework for evaluating LLM performance across varying levels of prompt specificity. We generate multi-level prompts using GPT-4, quantify specificity via perplexity, and assess correctness using GPT-based semantic equivalence. Experiments on 30 novel reasoning tasks across GPT-4 and O3-mini reveal that specificity improves accuracy, especially for smaller models and procedural tasks. Our results highlight the need for adaptive prompting strategies and provide tools and data to support further research.

[6] CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering

Liangji Kong,Aditya Joshi,Sarvnaz Karimi

Main category: cs.CL

TL;DR: CAIRNS是一个框架,旨在通过提高可读性和引用可靠性,帮助农业专家从复杂的网络数据中获取可信的气候适应策略答案。无需微调或强化学习,它在多项指标上优于基线。

Details Motivation: 气候变化适应性策略对农业至关重要,但这些信息存在于非结构化和结构化数据中,专家难以直接获取可信且易读的答案。CAIRNS旨在解决这一问题。

Contribution: CAIRNS的主要贡献包括:1) 通过ScholarGuide提示提高可读性和引用可靠性;2) 设计了一个一致性加权的混合评估器,结合专家意见;3) 在专家标注的数据集上验证了其性能。

Method: CAIRNS的核心方法是:1) 使用ScholarGuide提示结构化处理数据;2) 通过混合评估器(结合模型间一致性和专家评估)验证结果。该方法避免了微调或强化学习。

Result: CAIRNS在多项指标上优于基线,并通过彻底的消融实验验证了结果的鲁棒性。此外,LLM评估与人类判断的相关性分析也得到了验证。

Insight: CAIRNS展示了如何通过结构化提示和混合评估器,在不需要复杂训练的情况下,实现可信、易读的问答系统。这对于农业专家等非技术背景用户尤其有用。

Abstract: Climate adaptation strategies are proposed in response to climate change. They are practised in agriculture to sustain food production. These strategies can be found in unstructured data (for example, scientific literature from the Elsevier website) or structured (heterogeneous climate data via government APIs). We present Climate Adaptation question-answering with Improved Readability and Noted Sources (CAIRNS), a framework that enables experts – farmer advisors – to obtain credible preliminary answers from complex evidence sources from the web. It enhances readability and citation reliability through a structured ScholarGuide prompt and achieves robust evaluation via a consistency-weighted hybrid evaluator that leverages inter-model agreement with experts. Together, these components enable readable, verifiable, and domain-grounded question-answering without fine-tuning or reinforcement learning. Using a previously reported dataset of expert-curated question-answers, we show that CAIRNS outperforms the baselines on most of the metrics. Our thorough ablation study confirms the results on all metrics. To validate our LLM-based evaluation, we also report an analysis of correlations against human judgment.

[7] HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models

Boya Zhang,Alban Bornet,Rui Yang,Nan Liu,Douglas Teodoro

Main category: cs.CL

TL;DR: 论文提出了HealthContradict数据集,用于评估语言模型在生物医学领域中处理矛盾上下文的能力,发现模型不仅能利用正确上下文,还能抵抗错误上下文的影响。

Details Motivation: 目前缺乏评估语言模型在生物医学领域如何处理矛盾上下文的工具,作者希望通过HealthContradict数据集填补这一空白,并揭示模型的上下文推理能力。

Contribution: 1. 提出了HealthContradict数据集,包含920个专家验证的健康相关问题及其矛盾上下文;2. 评估了语言模型在处理矛盾上下文时的表现。

Method: 使用HealthContradict数据集,通过不同提示设置(正确、错误或矛盾上下文)测试模型输出,分析模型对上下文的依赖能力。

Result: 实验表明,微调后的生物医学语言模型不仅能有效利用正确上下文,还能抵抗错误上下文的影响,展现出更强的上下文推理能力。

Insight: 模型的优势不仅来自预训练的参数知识,还包括其对上下文的动态推理能力,尤其在矛盾环境中表现突出。

Abstract: How do language models use contextual information to answer health questions? How are their responses impacted by conflicting contexts? We assess the ability of language models to reason over long, conflicting biomedical contexts using HealthContradict, an expert-verified dataset comprising 920 unique instances, each consisting of a health-related question, a factual answer supported by scientific evidence, and two documents presenting contradictory stances. We consider several prompt settings, including correct, incorrect or contradictory context, and measure their impact on model outputs. Compared to existing medical question-answering evaluation benchmarks, HealthContradict provides greater distinctions of language models’ contextual reasoning capabilities. Our experiments show that the strength of fine-tuned biomedical language models lies not only in their parametric knowledge from pretraining, but also in their ability to exploit correct context while resisting incorrect context.

[8] When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

Jack Lu,Ryan Teehan,Jinran Jin,Mengye Ren

Main category: cs.CL

TL;DR: 本文系统地研究了37个不同规模、家族和训练版本的LLM在9个基准测试中作为验证器的表现,结果表明跨家族验证效果显著,后训练减少了自我提升但增强了跨家族提升,数学和逻辑任务具有最高的可验证性。

Details Motivation: 现有研究对LLM作为求解器和验证器的交互作用缺乏系统分析,尤其是跨家族验证和后训练对验证效果的影响尚不明确。

Contribution: 1) 提出了”验证器增益”指标;2) 分析了模型规模和后训练对验证效果的影响;3) 揭示了不同任务的可验证性差异。

Method: 通过37个模型的实验比较了自我验证、家族内验证和跨家族验证的效果,使用涵盖多领域的9个基准任务进行评测。

Result: 跨家族验证效果最优;后训练降低自我提升但提高跨家族提升;数学和逻辑任务可验证性最高。

Insight: 验证器的选择和任务类型对LLM性能提升至关重要,后训练对不同验证场景的影响是动态权衡的。

Abstract: Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.

[9] Memory-Augmented Knowledge Fusion with Safety-Aware Decoding for Domain-Adaptive Question Answering

Lei Fu,Xiang Chen,Kaige Gao Xinyue Huang,Kejian Tong

Main category: cs.CL

TL;DR: 论文提出KARMA框架,通过增强记忆和安全感知解码提升领域自适应问答系统的性能,解决了异构知识融合和安全输出的问题。

Details Motivation: 在敏感领域(如医疗保健和政府福利),现有大型语言模型在事实一致性和上下文对齐方面表现不佳,因此需要一种能融合异构知识并确保安全的QA系统。

Contribution: 1. 提出双编码器架构融合结构化和非结构化知识;2. 引入门控记忆单元动态调节外部知识;3. 设计安全感知可控解码器以减少不安全输出。

Method: KARMA框架结合双编码器、门控记忆单元和安全感知解码器,通过安全分类和引导生成技术提升QA系统的适应性和安全性。

Result: 在专有QA数据集上,KARMA在答案质量和安全性上均优于基线模型。

Insight: 结合动态知识调节和安全解码技术是构建可信赖QA系统的关键,尤其在敏感领域需平衡准确性和安全性。

Abstract: Domain-specific question answering (QA) systems for services face unique challenges in integrating heterogeneous knowledge sources while ensuring both accuracy and safety. Existing large language models often struggle with factual consistency and context alignment in sensitive domains such as healthcare policies and government welfare. In this work, we introduce Knowledge-Aware Reasoning and Memory-Augmented Adaptation (KARMA), a novel framework designed to enhance QA performance in care scenarios. KARMA incorporates a dual-encoder architecture to fuse structured and unstructured knowledge sources, a gated memory unit to dynamically regulate external knowledge integration, and a safety-aware controllable decoder that mitigates unsafe outputs using safety classification and guided generation techniques. Extensive experiments on a proprietary QA dataset demonstrate that KARMA outperforms strong baselines in both answer quality and safety. This study offers a comprehensive solution for building trustworthy and adaptive QA systems in service contexts.

[10] TaleFrame: An Interactive Story Generation System with Fine-Grained Control and Large Language Models

Yunchao Wang,Guodao Sun,Zihang Fu,Zhehao Liu,Kaixing Du,Haidong Gao,Ronghua Liang

Main category: cs.CL

TL;DR: TaleFrame是一种结合大型语言模型(LLMs)和人机交互(HCI)的交互式故事生成系统,通过结构化信息实现细粒度控制,解决了现有系统无法精确表达用户意图的问题。

Details Motivation: 现有故事生成系统难以准确捕捉和实现用户的细粒度控制需求,导致生成结果不满足预期。TaleFrame通过结构化数据和人机交互的结合,填补了这一空白。

Contribution: 1. 将故事结构分解为实体、事件、关系和提纲四个基本单元,实现细粒度控制;2. 提出JSON2Story方法,将结构化数据转化为连贯故事;3. 提供直观的交互界面和支持迭代优化的评估机制。

Method: 1. 使用Tinystories数据集构建9,851条JSON格式的偏好数据集;2. 微调本地Llama模型;3. 通过拖拽、连接等交互方式控制故事单元;4. 引入七维评估框架指导故事优化。

Result: 定量评估和用户研究表明TaleFrame能够显著提升故事生成的满意度和控制精度,生成结果在创意性和结构性等方面表现优异。

Insight: 结构化数据与LLMs的结合为交互式故事生成提供了新的可能性,强调了用户意图准确表达的重要性,同时展示了迭代优化在生成任务中的价值。

Abstract: With the advancement of natural language generation (NLG) technologies, creative story generation systems have gained increasing attention. However, current systems often fail to accurately translate user intent into satisfactory story outputs due to a lack of fine-grained control and unclear input specifications, limiting their applicability. To address this, we propose TaleFrame, a system that combines large language models (LLMs) with human-computer interaction (HCI) to generate stories through structured information, enabling precise control over the generation process. The innovation of TaleFrame lies in decomposing the story structure into four basic units: entities, events, relationships, and story outline. We leverage the Tinystories dataset, parsing and constructing a preference dataset consisting of 9,851 JSON-formatted entries, which is then used to fine-tune a local Llama model. By employing this JSON2Story approach, structured data is transformed into coherent stories. TaleFrame also offers an intuitive interface that supports users in creating and editing entities and events and generates stories through the structured framework. Users can control these units through simple interactions (e.g., drag-and-drop, attach, and connect), thus influencing the details and progression of the story. The generated stories can be evaluated across seven dimensions (e.g., creativity, structural integrity), with the system providing suggestions for refinement based on these evaluations. Users can iteratively adjust the story until a satisfactory result is achieved. Finally, we conduct quantitative evaluation and user studies that demonstrate the usefulness of TaleFrame. Dataset available at https://huggingface.co/datasets/guodaosun/tale-frame.

[11] ADORE: Autonomous Domain-Oriented Relevance Engine for E-commerce

Zheng Fang,Donghao Xie,Ming Pang,Chunyuan Yuan,Xue Jiang,Changping Peng,Zhangang Lin,Zheng Luo

Main category: cs.CL

TL;DR: ADORE是一个自主领域相关的电商搜索相关性建模框架,通过结合规则感知、错误类型感知和知识蒸馏技术,解决了传统方法的语义鸿沟和数据稀缺问题。

Details Motivation: 电商搜索中的相关性建模面临语义鸿沟和数据稀缺的挑战,传统方法如BM25和神经网络依赖领域特定的硬样本,限制了性能。

Contribution: ADORE提出了三个创新模块:规则感知相关性判别、错误类型感知数据合成和关键属性增强的知识蒸馏,实现了自动化标注和推理增强。

Method: 结合Chain-of-Thought LLM生成意图对齐数据,使用Kahneman-Tversky优化对齐用户行为;自动生成对抗样本提升鲁棒性;并通过知识蒸馏注入领域属性层级。

Result: 大规模实验和在线A/B测试验证了ADORE的有效性,为工业应用提供了资源高效的认知对齐相关性建模新范式。

Insight: ADORE展示了自动化标注和对抗训练在解决数据稀缺和语义对齐问题中的潜力,为电商搜索的相关性建模提供了新思路。

Abstract: Relevance modeling in e-commerce search remains challenged by semantic gaps in term-matching methods (e.g., BM25) and neural models’ reliance on the scarcity of domain-specific hard samples. We propose ADORE, a self-sustaining framework that synergizes three innovations: (1) A Rule-aware Relevance Discrimination module, where a Chain-of-Thought LLM generates intent-aligned training data, refined via Kahneman-Tversky Optimization (KTO) to align with user behavior; (2) An Error-type-aware Data Synthesis module that auto-generates adversarial examples to harden robustness; and (3) A Key-attribute-enhanced Knowledge Distillation module that injects domain-specific attribute hierarchies into a deployable student model. ADORE automates annotation, adversarial generation, and distillation, overcoming data scarcity while enhancing reasoning. Large-scale experiments and online A/B testing verify the effectiveness of ADORE. The framework establishes a new paradigm for resource-efficient, cognitively aligned relevance modeling in industrial applications.

[12] DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI,Aixin Liu,Aoxue Mei,Bangcai Lin,Bing Xue,Bingxuan Wang,Bingzheng Xu,Bochao Wu,Bowei Zhang,Chaofan Lin,Chen Dong,Chengda Lu,Chenggang Zhao,Chengqi Deng,Chenhao Xu,Chong Ruan,Damai Dai,Daya Guo,Dejian Yang,Deli Chen,Erhang Li,Fangqi Zhou,Fangyun Lin,Fucong Dai,Guangbo Hao,Guanting Chen,Guowei Li,H. Zhang,Hanwei Xu,Hao Li,Haofen Liang,Haoran Wei,Haowei Zhang,Haowen Luo,Haozhe Ji,Honghui Ding,Hongxuan Tang,Huanqi Cao,Huazuo Gao,Hui Qu,Hui Zeng,Jialiang Huang,Jiashi Li,Jiaxin Xu,Jiewen Hu,Jingchang Chen,Jingting Xiang,Jingyang Yuan,Jingyuan Cheng,Jinhua Zhu,Jun Ran,Junguang Jiang,Junjie Qiu,Junlong Li,Junxiao Song,Kai Dong,Kaige Gao,Kang Guan,Kexin Huang,Kexing Zhou,Kezhao Huang,Kuai Yu,Lean Wang,Lecong Zhang,Lei Wang,Liang Zhao,Liangsheng Yin,Lihua Guo,Lingxiao Luo,Linwang Ma,Litong Wang,Liyue Zhang,M. S. Di,M. Y Xu,Mingchuan Zhang,Minghua Zhang,Minghui Tang,Mingxu Zhou,Panpan Huang,Peixin Cong,Peiyi Wang,Qiancheng Wang,Qihao Zhu,Qingyang Li,Qinyu Chen,Qiushi Du,Ruiling Xu,Ruiqi Ge,Ruisong Zhang,Ruizhe Pan,Runji Wang,Runqiu Yin,Runxin Xu,Ruomeng Shen,Ruoyu Zhang,S. H. Liu,Shanghao Lu,Shangyan Zhou,Shanhuang Chen,Shaofei Cai,Shaoyuan Chen,Shengding Hu,Shengyu Liu,Shiqiang Hu,Shirong Ma,Shiyu Wang,Shuiping Yu,Shunfeng Zhou,Shuting Pan,Songyang Zhou,Tao Ni,Tao Yun,Tian Pei,Tian Ye,Tianyuan Yue,Wangding Zeng,Wen Liu,Wenfeng Liang,Wenjie Pang,Wenjing Luo,Wenjun Gao,Wentao Zhang,Xi Gao,Xiangwen Wang,Xiao Bi,Xiaodong Liu,Xiaohan Wang,Xiaokang Chen,Xiaokang Zhang,Xiaotao Nie,Xin Cheng,Xin Liu,Xin Xie,Xingchao Liu,Xingkai Yu,Xingyou Li,Xinyu Yang,Xinyuan Li,Xu Chen,Xuecheng Su,Xuehai Pan,Xuheng Lin,Xuwei Fu,Y. Q. Wang,Yang Zhang,Yanhong Xu,Yanru Ma,Yao Li,Yao Li,Yao Zhao,Yaofeng Sun,Yaohui Wang,Yi Qian,Yi Yu,Yichao Zhang,Yifan Ding,Yifan Shi,Yiliang Xiong,Ying He,Ying Zhou,Yinmin Zhong,Yishi Piao,Yisong Wang,Yixiao Chen,Yixuan Tan,Yixuan Wei,Yiyang Ma,Yiyuan Liu,Yonglun Yang,Yongqiang Guo,Yongtong Wu,Yu Wu,Yuan Cheng,Yuan Ou,Yuanfan Xu,Yuduan Wang,Yue Gong,Yuhan Wu,Yuheng Zou,Yukun Li,Yunfan Xiong,Yuxiang Luo,Yuxiang You,Yuxuan Liu,Yuyang Zhou,Z. F. Wu,Z. Z. Ren,Zehua Zhao,Zehui Ren,Zhangli Sha,Zhe Fu,Zhean Xu,Zhenda Xie,Zhengyan Zhang,Zhewen Hao,Zhibin Gou,Zhicheng Ma,Zhigang Yan,Zhihong Shao,Zhixian Huang,Zhiyu Wu,Zhuoshu Li,Zhuping Zhang,Zian Xu,Zihao Wang,Zihui Gu,Zijia Zhu,Zilin Li,Zipeng Zhang,Ziwei Xie,Ziyi Gao,Zizheng Pan,Zongqing Yao,Bei Feng,Hui Li,J. L. Cai,Jiaqi Ni,Lei Xu,Meng Li,Ning Tian,R. J. Chen,R. L. Jin,S. S. Li,Shuang Zhou,Tianyu Sun,X. Q. Li,Xiangyue Jin,Xiaojin Shen,Xiaosha Chen,Xinnan Song,Xinyi Zhou,Y. X. Zhu,Yanping Huang,Yaohui Li,Yi Zheng,Yuchen Zhu,Yunxian Ma,Zhen Huang,Zhipeng Xu,Zhongyu Zhang,Dongjie Ji,Jian Liang,Jianzhong Guo,Jin Chen,Leyi Xia,Miaojun Wang,Mingming Li,Peng Zhang,Ruyi Chen,Shangmian Sun,Shaoqing Wu,Shengfeng Ye,T. Wang,W. L. Xiao,Wei An,Xianzu Wang,Xiaowen Sun,Xiaoxiang Wang,Ying Tang,Yukun Zha,Zekai Zhang,Zhe Ju,Zhen Zhang,Zihua Qu

Main category: cs.CL

TL;DR: DeepSeek-V3.2是一款高效且性能卓越的开源大语言模型,通过深度稀疏注意力(DSA)、可扩展的强化学习框架和大规模代理任务合成管道,实现了长上下文场景的高效推理和工具使用。

Details Motivation: 现有大语言模型在计算效率和推理能力上存在瓶颈,尤其是长上下文和复杂任务场景。DeepSeek-V3.2旨在解决这些问题,提升模型性能。

Contribution: 1. 引入DSA机制降低计算复杂度;2. 提出可扩展强化学习框架,性能媲美GPT-5;3. 构建大规模代理任务合成管道,提升泛化和指令遵循能力。

Method: 1. 深度稀疏注意力(DSA);2. 强化学习协议与后训练计算扩展;3. 代理任务数据合成管道。

Result: DeepSeek-V3.2-Speciale超越GPT-5,性能与Gemini-3.0-Pro相当,在IMO和IOI竞赛中表现优异。

Insight: 稀疏注意力和强化学习的结合为长上下文和复杂任务提供了高效解决方案,代理任务合成方法是提升模型泛化的关键。

Abstract: We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.

[13] From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

Changpeng Yang,Jinyang Wu,Yuchen Liu,Shuai Zhang,Yang Li,Qiliang Liang,Hongzhen Wang,Shuai Nie,Jiaming Xu,Runyu Shi,Ying Huang,Guoquan Zhang

Main category: cs.CL

TL;DR: 该论文提出了CAPO(课程优势策略优化),一种基于优势信号的自适应课程机制,旨在通过先模仿学习再引入负面信号,增强跨领域推理任务的泛化能力。

Details Motivation: 现有方法在强化学习中不加区分地混合正面和负面信号,可能导致早期阶段的模糊指导和有限收益。CAPO旨在通过分阶段引入信号,解决这一问题。

Contribution: CAPO的核心贡献是提出了一种自适应的课程机制,先通过仅正面优势样本建立基础,再引入负面信号提升判别能力,从而增强泛化性。

Method: CAPO采用两阶段优化:第一阶段用正面优势样本进行模仿学习;第二阶段引入负面信号,通过多样化的优化方法(如GRPO、PPO等)提升性能。

Result: CAPO在数学推理任务中表现稳定且显著优于现有方法,并能有效泛化到多模态GUI推理场景。

Insight: 分阶段引入信号(先模仿后判别)是一种有效的训练策略选择,尤其适用于复杂推理任务,展示了课程学习在强化学习中的潜力。

Abstract: Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose CAPO (Curriculum Advantage Policy Optimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.

[14] Spoken Conversational Agents with Large Language Models

Chao-Han Huck Yang,Andreas Stolcke,Larry Heck

Main category: cs.CL

TL;DR: 该教程探讨了从传统级联ASR/NLU系统到端到端、检索和视觉基础系统的语音对话代理发展路径,重点介绍了文本LLM在音频中的适配、跨模态对齐和联合语音文本训练。

Details Motivation: 随着语音原生大型语言模型(LLM)的兴起,如何从传统的级联系统过渡到更先进的端到端系统,并解决隐私、安全和评估等问题成为研究重点。

Contribution: 1. 总结了从级联ASR/NLU到端到端系统的技术路径;2. 介绍了文本LLM在音频中的适配方法;3. 提供了跨模态对齐和联合训练的实践指南。

Method: 通过回顾数据集、评估指标和鲁棒性研究,比较了级联和端到端系统的设计选择。

Result: 提供了可重复的基线系统和实际实现方法,明确了系统级的路线图。

Insight: 跨模态对齐和联合训练是推动语音对话代理发展的关键技术,隐私和安全问题仍需进一步解决。

Abstract: Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.

[15] An Empirical Survey of Model Merging Algorithms for Social Bias Mitigation

Daiki Shirafuji,Tatsuhiko Saito,Yasutomo Kimura

Main category: cs.CL

TL;DR: 本文对不同模型合并算法在消除大型语言模型(LLM)社会偏见方面进行了实证调查,比较了七种算法在13个开源模型上的表现,发现偏见消除与下游任务性能之间存在权衡关系。

Details Motivation: 大型语言模型可能继承并放大社会偏见,威胁公平性和社会信任,因此需要研究有效的偏见消除方法。

Contribution: 系统地比较了七种模型合并算法在偏见消除和下游任务性能上的表现,为选择平衡策略提供了实证依据。

Method: 应用了七种模型合并算法(Linear、Karcher Mean等),使用三个偏见数据集(BBQ、BOLD、HONEST)和SuperGLUE任务进行评测。

Result: Linear、SLERP和Nearswap在减少偏见的同时保持了性能,SLERP在中等权重下表现最佳,过度去偏见或不当方法会损害语言能力。

Insight: 模型合并算法能有效消除偏见,但需权衡偏见减少与性能损失,SLERP是一种较平衡的选择。

Abstract: Large language models (LLMs) are known to inherit and even amplify societal biases present in their pre-training corpora, threatening fairness and social trust. To address this issue, recent work has explored ``editing’’ LLM parameters to mitigate social bias with model merging approaches; however, there is no empirical comparison. In this work, we empirically survey seven algorithms: Linear, Karcher Mean, SLERP, NuSLERP, TIES, DELLA, and Nearswap, applying 13 open weight models in the GPT, LLaMA, and Qwen families. We perform a comprehensive evaluation using three bias datasets (BBQ, BOLD, and HONEST) and measure the impact of these techniques on LLM performance in downstream tasks of the SuperGLUE benchmark. We find a trade-off between bias reduction and downstream performance: methods achieving greater bias mitigation degrade accuracy, particularly on tasks requiring reading comprehension and commonsense and causal reasoning. Among the merging algorithms, Linear, SLERP, and Nearswap consistently reduce bias while maintaining overall performance, with SLERP at moderate interpolation weights emerging as the most balanced choice. These results highlight the potential of model merging algorithms for bias mitigation, while indicating that excessive debiasing or inappropriate merging methods may lead to the degradation of important linguistic abilities.

[16] CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

Lavish Bansal,Naman Mishra

Main category: cs.CL

TL;DR: CREST 提出了一种高效的多语言安全分类模型,通过基于聚类的跨语言迁移,仅用13种高资源语言训练,就能支持100种语言,解决了低资源语言安全防护不足的问题。

Details Motivation: 现有的大语言模型安全防护主要集中在高资源语言,忽视了低资源语言的需求,导致全球大量用户无法获得有效保护。CREST 旨在填补这一空白,提供通用的语言无关安全防护系统。

Contribution: CREST 是一种参数高效的多语言安全分类模型(仅0.5B参数),通过集群引导的跨语言迁移,从少数高资源语言扩展到100种语言,显著提升对低资源语言的支持。

Method: CREST 采用基于聚类的跨语言迁移方法,利用13种高资源语言的训练数据,通过聚类分析实现从少量高资源语言到100种语言的泛化能力。

Result: 在六个安全基准测试中,CREST 优于同规模的最先进防护系统,并与参数规模更大的模型(2.5B参数及以上)竞争性表现。

Insight: 研究表明,语言特定的防护系统存在局限性,通用语言无关的安全系统能够更有效地服务于全球用户,尤其是在低资源语言场景中。

Abstract: Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world’s population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.

[17] Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs

Julian Ma,Jun Wang,Zafeirios Fountas

Main category: cs.CL

TL;DR: 这篇论文研究了大型语言模型(LLMs)是否在隐式计算策略中表现出类似人类的贝叶斯行为,特别是在多模态信号整合任务中。作者通过行为实验和基准测试(BayesBench)揭示了LLMs在不确定性处理上的策略和能力之间的差异。

Details Motivation: 人类在感知任务中通过贝叶斯策略高效整合多模态信号,但LLMs的隐式计算策略未被充分研究。论文旨在探索LLMs是否也具有类似的贝叶斯行为。

Contribution: 1. 提出了BayesBench基准测试,包含四个基于心理物理学的任务;2. 引入贝叶斯一致性分数(Bayesian Consistency Score)以评估LLMs的行为一致性;3. 揭示了LLMs在多模态任务中的能力与策略之间的分离。

Method: 1. 设计BayesBench测试任务(长度、位置、距离、持续时间);2. 评估9种LLMs与人类行为对比;3. 通过噪声、上下文和提示的对照实验分析行为一致性。

Result: 研究发现,虽然LLMs表现出一定的贝叶斯行为,但高准确率不一定伴随高效的多模态信号整合(如GPT-5 Mini在文本任务中完美但在视觉任务中失败)。

Insight: 准确率-centric的评测可能掩盖了LLMs在不确定性处理上的脆弱性,未来的模型设计需重视行为一致性和鲁棒性。

Abstract: Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.

[18] SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys

Jiahao Zhao,Shuaixing Zhang,Nan Xu,Lei Wang

Main category: cs.CL

TL;DR: 本文提出了SurveyEval,一个用于全面评估由LLM生成的学术调查的综合基准,涵盖质量、大纲一致性和参考文献准确性三个维度,并扩展了7个学科领域的评估。

Details Motivation: 随着基于LLM的自动调查系统的发展,如何评估这种复杂系统的性能成为一项重要挑战,因此需要一种全面的评估方法。

Contribution: 提出了SurveyEval基准,通过三个维度(整体质量、大纲一致性和参考文献准确性)和跨7个学科领域的评估,增强了评估与人类标准的一致性。

Method: 扩展了LLM-as-a-Judge框架,结合人类参考标准进行评估,比较了通用长文本系统和专用调查生成系统的表现。

Result: 结果显示,专用调查生成系统的质量显著高于通用系统,证明了SurveyEval的有效性和适用范围。

Insight: SurveyEval不仅为自动调查系统提供了标准化评估工具,还为未来改进这类系统的性能提供了方向。

Abstract: LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.

[19] PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models

Robert Belanec,Ivan Srba,Maria Bielikova

Main category: cs.CL

TL;DR: PEFT-Factory是一个统一的参数高效微调框架,旨在解决大型语言模型(LLMs)微调时难以复现、部署和比较的问题。

Details Motivation: 随着LLMs规模增大,传统的全参数微调变得低效且昂贵,现有的PEFT方法虽多但难以复现和比较。

Contribution: 提出了一个模块化、统一的PEFT框架PEFT-Factory,支持19种PEFT方法、27个数据集和多种评估指标。

Method: 采用模块化设计,支持现成和自定义的PEFT方法,提供稳定的环境和丰富的工具集。

Result: PEFT-Factory为PEFT方法提供了可复现、可控的基准测试环境。

Insight: 统一的框架设计和丰富的工具集显著提升了PEFT方法的实用性和可比较性。

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at https://github.com/kinit-sk/PEFT-Factory

[20] Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

Juexi Shao,Siyou Li,Yujian Gan,Chris Madge,Vanja Karan,Massimo Poesio

Main category: cs.CL

TL;DR: 论文提出了一个三层次的数据合成框架,用于解决对话式广义指代表达理解(GREC)任务中数据稀缺的问题,并提升了模型在分布偏移下的性能。

Details Motivation: 现有系统在训练和评估领域之间的分布偏移下表现不佳,且标注对话接地数据的稀缺性加剧了这一挑战。

Contribution: 提出了一个三层次的数据合成方法,平衡了真实性和可控性,生成了可扩展的监督数据用于对话条件下的接地任务。

Method: 通过三层次的数据合成框架生成高质量的训练数据,并结合微调方法提升模型性能。

Result: 在标准评估指标上,方法显著优于之前的方法,表现出一致的性能提升。

Insight: 数据合成的真实性和可控性平衡是关键,为类似任务提供了可扩展的数据生成思路。

Abstract: Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.

[21] SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment

Yixuan Tang,Yi Yang

Main category: cs.CL

TL;DR: 该论文提出了一种名为SR-GRPO的新方法,利用稳定秩作为内在几何奖励信号来对齐大型语言模型,避免了对外部监督的依赖。

Details Motivation: 当前对齐大型语言模型的方法依赖于人类标注或奖励模型,但这些方法存在稀缺性、主观性、奖励攻击和提示敏感性等问题,研究旨在提出一种无需外部监督的内在质量信号。

Contribution: 提出了稳定秩作为内在几何奖励信号,设计了SR-GRPO方法,利用稳定秩进行强化学习对齐,显著提升了模型性能。

Method: 通过计算隐藏状态中总方差与主导方向方差的比率,量化稳定秩,作为质量信号,将其融入SR-GRPO进行强化学习。

Result: 在RewardBench上达到84.04%的准确率,任务准确率平均提升11.3%,数学推理性能提升19%,优于传统方法。

Insight: 模型内部几何信息可以作为质量信号,为无需外部监督的对齐提供了新思路。

Abstract: Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases. In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations. Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions. Empirically, stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by an average of 11.3 percentage points over greedy decoding via Best-of-N sampling. Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning. Without external supervision, SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines. Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.

[22] BOOM: Beyond Only One Modality KIT’s Multimodal Multilingual Lecture Companion

Sai Koneru,Fabian Retkowski,Christian Huber,Lukas Hilgert,Seymanur Akti,Enes Yavuz Ugan,Alexander Waibel,Jan Niehues

Main category: cs.CL

TL;DR: BOOM是一个多模态多语言的讲座伴侣系统,它能将讲座的音频和幻灯片联合翻译,生成同步的多模态输出,包括翻译文本、本地化幻灯片和合成语音,从而为学习者提供完整的跨语言学习体验。

Details Motivation: 随着教育的全球化和在线学习的快速增长,本地化教育内容成为一个关键挑战。讲座材料本质上是多模态的(语音和幻灯片),需要系统能够处理多种输入模态,以提供完整的学习体验。

Contribution: BOOM的主要贡献是提出了一种端到端的多模态多语言翻译系统,能够同步翻译讲座的音频和幻灯片,生成三种模态的输出(文本、幻灯片、语音),并保留了原始内容的完整性。此外,该系统还为下游任务(如摘要和问答)带来了额外好处。

Method: BOOM采用端到端的方法,联合处理音频和幻灯片输入,生成同步的多模态输出。具体包括处理文本翻译、幻灯片本地化(保留视觉元素)和语音合成。此外,他们还发布了Slide Translation代码并将其集成到Lecture Translator中。

Result: 实验结果表明,BOOM能够有效生成多模态翻译内容,并且带有幻灯片信息的转录文本对下游任务(如摘要和问答)也有提升作用。

Insight: 多模态翻译不仅能改善跨语言学习体验,还能为其他自然语言处理任务提供额外价值。这表明在处理教育内容时,保留和利用多模态信息非常重要。

Abstract: The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present \textbf{BOOM}, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.

[23] Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

Lechen Zhang,Yusheng Zhou,Tolga Ergen,Lajanugen Logeswaran,Moontae Lee,David Jurgens

Main category: cs.CL

TL;DR: 本文研究了系统提示在多语言环境中对大型语言模型(LLM)行为的引导作用,提出了一个统一的四维评估框架,并通过实验发现某些提示组件(如CoT、情感和场景)与稳健的多语言行为相关。作者开发了一个多语言提示优化框架,并展示了其有效性。

Details Motivation: 现实应用中,需要在多语言环境下使用单一提示来可靠地引导LLM行为,但目前研究主要集中在英语环境。因此,本文旨在探索如何通过系统提示实现准确且稳健的多语言行为。

Contribution: 1. 提出了一个统一的四维框架评估多语言环境中的系统提示;2. 发现了与稳健多语言行为相关的提示组件;3. 开发了多语言提示优化框架,自动提升性能5-10%;4. 分析了1000万个推理单元,揭示了高性能提示的结构化特点。

Method: 1. 设计了四维评估框架;2. 在五种语言、三种LLM和三个基准上进行了大规模实验;3. 开发了提示优化框架;4. 通过分析推理模式验证了提示有效性。

Result: 实验表明,优化后的提示在多语言环境中实现了5-10%的性能提升,并减少了不必要的语言切换。

Insight: 高性能的系统提示能引导更结构化、一致的推理模式,同时减少语言切换的干扰,提示优化是实现稳健多语言行为的有效途径。

Abstract: System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.

[24] Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning

Haonan Wang,Chao Du,Kenji Kawaguchi,Tianyu Pang

Main category: cs.CL

TL;DR: ThinkMerge是一种训练免费、即插即用的解码策略,通过并行运行多个推理轨迹并在同步点平均其下一个token的logits,生成单一连贯输出,显著提升了开放式推理任务的性能。

Details Motivation: 多数投票在闭合问题回答中有效,但在开放式推理任务(如代码生成和网络深度研究)中,由于“多数”概念难以定义,这种方法不适用。因此,需要一种新的方法来聚合并行推理的结果。

Contribution: 引入了ThinkMerge,一种无需训练的解码策略,通过平均并行推理轨迹的logits生成输出,提升了开放式推理任务的性能。

Method: ThinkMerge运行K个并行推理轨迹,在同步点平均它们的下一个token的logits,生成单一输出,并与标准解码技术兼容。

Result: 在AIME、GPQA上表现优于多数投票,LiveCodeBench(困难)的pass@1提高了8.28%(DeepCoder-14B-Preview)和7.58%(Qwen3-8B),并在网络深度研究任务中持续提升性能。

Insight: 并行推理的平均logits方法可以有效提升开放式任务的连贯性和性能,无需依赖完整输出的多数投票。

Abstract: Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a “majority” over complete solutions is ill-defined. We introduce ThinkMerge, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. ThinkMerge integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that ThinkMerge improves web-based deep-research agents (e.g., WebSailor-7B/32B) across GAIA, BrowseComp-en/zh, and XbenchDeepSearch. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.

[25] Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules

Amr Mohamed,Yang Zhang,Michalis Vazirgiannis,Guokan Shang

Main category: cs.CL

TL;DR: SchED是一种无需训练的早期退出算法,通过动态调整置信度阈值,显著加速扩散大语言模型的解码过程。

Details Motivation: 扩散大语言模型(dLLMs)的解码速度慢且迭代占用大量计算资源,限制了其实用性。SchED旨在解决这一问题。

Contribution: 提出SchED,一种模型无关的早期退出算法,能够在加速解码的同时保持模型性能,大幅提升了dLLMs的效率。

Method: SchED通过聚合全跨度logit margins,在解码过程中动态调整置信度阈值以提前终止解码。

Result: 在指令调优模型上实现3.8-4.0倍加速,性能保留达99.8%-100%;在基础模型上加速效果稳定,性能保留99.1%-100%。

Insight: 指令调优加速预测熵的衰减,SchED通过利用真实的置信度稳定转化为计算节省,显著提升了dLLMs的实用性。

Abstract: Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present SchED, a training-free, model-agnostic early-exit algorithm that aggregates full-span logit margins and halts decoding once a smooth, progress-dependent confidence threshold is met. We evaluated SchED on two dLLM families (Dream and LLaDA), in base and instruction-tuned variants across ten benchmarks spanning downstream tasks including multiple-choice question answering (MCQ), math, long-form QA/summarization, and translation. SchED delivers large, stable accelerations: on instruction-tuned models, it achieves $3.8$-$4.0\times$ speedups while retaining $99.8$-$100%$ of the baseline score on average. On base models, SchED yields consistent speedup gains with $99.1$-$100%$ performance retention, with up to $2.34\times$ under more aggressive settings. Using a conservative speed metric that heavily penalizes quality loss (QPS, $γ{=}4$), we show that SchED is robust and clearly outperforms prior confidence-based early-exit methods, which break down on long-form generation. An entropy analysis of the model’s token predictions reveals that instruction tuning speeds up the decay of predictive entropy. By turning genuine confidence stabilization into computational savings, SchED makes dLLM decoding substantially more efficient.

[26] AutoNeural: Co-Designing Vision-Language Models for NPU Inference

Wei Chen,Liangmin Wu,Yunhai Hu,Zhiyuan Li,Zhiyuan Cheng,Yicheng Qian,Lingyue Zhu,Zhipeng Hu,Luoyi Liang,Qiang Tang,Zhen Liu,Han Yang

Main category: cs.CL

TL;DR: AutoNeural提出了一种专为NPU设计的视觉-语言模型架构,通过改进视觉编码器和语言解码器,显著提升了量化稳定性和推理效率。

Details Motivation: 现有视觉-语言模型(VLM)主要针对GPU优化,在NPU上的表现不佳,原因在于ViT的量化敏感性和自回归注意力机制的I/O瓶颈。

Contribution: AutoNeural提出了一种NPU原生的VLM架构,包括基于MobileNetV5的视觉编码器和结合状态空间模型(SSM)的语言解码器,显著提升量化稳定性和推理效率。

Method: 1. 用MobileNetV5风格的深度可分离卷积替换标准ViT,确保INT4/8/16量化的稳定性;2. 结合SSM和Transformer的语言解码器,使用门控卷积实现线性时间复杂性。

Result: AutoNeural将视觉编码器的量化误差降低7倍,端到端延迟减少14倍,解码速度提升3倍,上下文窗口长度增加4倍。

Insight: 为NPU量身定制模型拓扑是解决多模态边缘智能问题的关键,量化稳定性和计算效率需共同优化。

Abstract: While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision–Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.

[27] Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic

Muyu Pan,Dheeraj Kodakandla,Mahfuza Farooque

Main category: cs.CL

TL;DR: 该论文提出了一种新框架,通过结合自然语言处理技术和微调的大语言模型,将英语句子转换为逻辑表达式,进而生成可靠的合取范式(CNF),以减少大语言模型在逻辑翻译任务中的幻觉问题。

Details Motivation: 大语言模型(LLMs)在自然语言处理方面取得了显著进展,但在逻辑翻译任务中,其产生的幻觉(错误输出)是一个关键问题。为了解决这一问题,作者提出了一个框架,旨在提高逻辑翻译的精确性和可靠性。

Contribution: 论文的主要贡献包括:1)提出了一个结合经典NLP技术和微调LLMs的框架,用于将自然语言转换为逻辑表达式;2)展示了微调模型能够减少幻觉,并提供可靠的CNF生成。

Method: 方法包括:1)使用自定义语法和符号计算库进行逻辑表达式转换;2)利用微调的LLMs减少翻译过程中的幻觉;3)将逻辑表达式转换为CNF以满足可满足性求解的需求。

Result: 实验结果表明,微调后的模型能够在不同语法设置下,显著减少原始模型的幻觉现象,并提供更可靠的CNF生成。

Insight: 论文表明,结合经典NLP技术和微调LLMs可以有效解决逻辑翻译中的幻觉问题,为自动化推理和软件规范验证提供了新思路。

Abstract: Recent advances in natural language processing (NLP), particularly large language models (LLMs), have motivated the automatic translation of natural language statements into formal logic without human intervention. This enables automated reasoning and facilitates debugging, finding loop invariants, and adhering to specifications in software systems. However, hallucinations-incorrect outputs generated by LLMs are challenging, particularly for logical translation tasks requiring precision. This work introduces a novel framework that inputs English sentences, converts them into logical expressions, and then translates them into Conjunctive Normal Form (CNF) for satisfiability solving. It employs classical NLP techniques with self-defined grammar, symbolic computation libraries, and a fine-tuned language model to reduce hallucinations. In the early experiments, we observed that the fine-tuned model, trained on different grammar settings, could intentionally correct the same types of hallucinations made by the original model. Thus, it provides reliable CNF generation.

[28] The Moral Consistency Pipeline: Continuous Ethical Evaluation for Large Language Models

Saeid Jamshidi,Kawser Wazed Nafi,Arghavan Moradi Dakhel,Negar Shahabi,Foutse Khomh

Main category: cs.CL

TL;DR: 该论文提出了一种名为Moral Consistency Pipeline(MoCoP)的框架,用于持续评估大型语言模型(LLMs)的道德一致性,通过三层分析(词汇完整性、语义风险估计和推理判断建模)实现动态、无监督的伦理评估。

Details Motivation: 由于现有对齐框架依赖静态数据集和后验评估,难以捕捉伦理推理的动态变化,因此需要一种动态、持续的方法来评估LLMs的道德一致性。

Contribution: MoCoP框架首次提出无数据集、闭环的动态伦理评估方法,结合三层分析,揭示了道德一致性与毒性和响应延迟的关系,为AI伦理研究提供了新工具。

Method: MoCoP结合词汇完整性分析、语义风险估计和推理判断建模,通过自主生成和评估伦理场景实现动态评估,无需外部监督。

Result: 实验显示MoCoP能有效捕捉模型的长期伦理行为,道德一致性与毒性呈现强负相关(rET=-0.81),与响应延迟无关(rEL≈0)。

Insight: 道德一致性和语言安全性是模型的稳定特性,而非短期波动;动态伦理评估是未来AI系统伦理研究的可行方向。

Abstract: The rapid advancement and adaptability of Large Language Models (LLMs) highlight the need for moral consistency, the capacity to maintain ethically coherent reasoning across varied contexts. Existing alignment frameworks, structured approaches designed to align model behavior with human ethical and social norms, often rely on static datasets and post-hoc evaluations, offering limited insight into how ethical reasoning may evolve across different contexts or temporal scales. This study presents the Moral Consistency Pipeline (MoCoP), a dataset-free, closed-loop framework for continuously evaluating and interpreting the moral stability of LLMs. MoCoP combines three supporting layers: (i) lexical integrity analysis, (ii) semantic risk estimation, and (iii) reasoning-based judgment modeling within a self-sustaining architecture that autonomously generates, evaluates, and refines ethical scenarios without external supervision. Our empirical results on GPT-4-Turbo and DeepSeek suggest that MoCoP effectively captures longitudinal ethical behavior, revealing a strong inverse relationship between ethical and toxicity dimensions (correlation rET = -0.81, p value less than 0.001) and a near-zero association with response latency (correlation rEL approximately equal to 0). These findings demonstrate that moral coherence and linguistic safety tend to emerge as stable and interpretable characteristics of model behavior rather than short-term fluctuations. Furthermore, by reframing ethical evaluation as a dynamic, model-agnostic form of moral introspection, MoCoP offers a reproducible foundation for scalable, continuous auditing and advances the study of computational morality in autonomous AI systems.

cs.CV [Back]

[29] Leveraging AI multimodal geospatial foundation models for improved near-real-time flood mapping at a global scale

Mirela G. Tulbure,Julio Caineta,Mark Broich,Mollie D. Gaines,Philippe Rufin,Leon-Friedrich Thomas,Hamed Alemohammad,Jan Hemmerling,Patrick Hostert

Main category: cs.CV

TL;DR: 该论文通过多模态地理空间基础模型(GFM)TerraMind的微调,结合Sentinel-1和Sentinel-2的多模态数据,提升了全球尺度近实时洪水测绘的准确性。

Details Motivation: 洪水是全球最具破坏性的天气灾害之一,而现有的洪水测绘方法依赖标记数据和模型的泛化能力。地理空间基础模型(GFM)通过大规模自监督预训练提供了更好的泛化性,但其在全球多样化洪水事件中的性能尚不明确。

Contribution: 论文的主要贡献包括:1)使用多模态数据集FloodsNet对GFM模型TerraMind进行微调;2)在85个全球洪水事件上评估了四种模型配置的性能;3)展示了多模态光学和SAR数据结合的优势,并为GFM在洪水测绘中的潜力提供了全球尺度的评估。

Method: 方法包括:1)微调TerraMind模型;2)对比四种配置(基础vs.大模型,冻结vs.解冻主干);3)使用FloodsNet和Sen1Floods11数据集训练U-Net作为基线;4)评估指标包括准确性、精确度和召回率。

Result: 结果表明:1)基础解冻模型在计算成本和性能之间取得了最佳平衡;2)大解冻模型召回率最高;3)基于FloodsNet训练的模型在召回率上优于Sen1Floods11训练的模型;4)U-Net的召回率优于所有GFM配置,但精度略低。

Insight: 研究揭示了多模态数据和GFM微调在提升洪水测绘性能中的潜力,同时也指出了GFM在精度上的局限性,为未来的气候适应和灾害恢复提供了重要参考。

Abstract: Floods are among the most damaging weather-related hazards, and in 2024, the warmest year on record, extreme flood events affected communities across five continents. Earth observation (EO) satellites provide critical, frequent coverage for mapping inundation, yet operational accuracy depends heavily on labeled datasets and model generalization. Recent Geospatial Foundation Models (GFMs), such as ESA-IBM’s TerraMind, offer improved generalizability through large-scale self-supervised pretraining, but their performance on diverse global flood events remains poorly understood. We fine-tune TerraMind for flood extent mapping using FloodsNet, a harmonized multimodal dataset containing co-located Sentinel-1 (Synthetic Aperture Radar, SAR data) and Sentinel-2 (optical) imagery for 85 flood events worldwide. We tested four configurations (base vs. large models; frozen vs. unfrozen backbones) and compared against the TerraMind Sen1Floods11 example and a U-Net trained on both FloodsNet and Sen1Floods11. The base-unfrozen configuration provided the best balance of accuracy, precision, and recall at substantially lower computational cost than the large model. The large unfrozen model achieved the highest recall. Models trained on FloodsNet outperformed the Sen1Floods11-trained example in recall with similar overall accuracy. U-Net achieved higher recall than all GFM configurations, though with slightly lower accuracy and precision. Our results demonstrate that integrating multimodal optical and SAR data and fine-tuning a GFM can enhance near-real-time flood mapping. This study provides one of the first global-scale evaluations of a GFM for flood segmentation, highlighting both its potential and current limitations for climate adaptation and disaster resilience.

[30] Context-Enriched Contrastive Loss: Enhancing Presentation of Inherent Sample Connections in Contrastive Learning Framework

Haojin Deng,Yimin Yang

Main category: cs.CV

TL;DR: 该论文提出了一种上下文丰富的对比损失函数,通过结合两个收敛目标来提高对比学习的有效性并解决信息失真问题,在多个大规模基准数据集上取得了优于现有方法的性能。

Details Motivation: 对比学习在大型基准测试中表现出色,但传统对比损失函数可能导致信息失真,尤其是对同源图像的正样本对学习不足。因此,需要一种新的损失函数来增强样本内在联系的学习。

Contribution: 提出了上下文丰富的对比损失函数,通过两个收敛目标(标签敏感的特征区分和同源图像的样本拉近)优化了学习效果和泛化性能。

Method: 设计了一种新的损失函数,包含两个部分:1) 对标签敏感的组件以区分不同类别的特征;2) 拉近来自同源图像的增强样本并远离其他样本。

Result: 在8个基准数据集上验证了方法的有效性,性能优于16种现有对比学习方法,尤其在BiasedMNIST数据集上比原始对比损失提高了22.9%。

Insight: 结合标签敏感性和同源样本的内在联系,可以有效提升对比学习的效率和公平性,尤其是在存在系统偏见的任务中。

Abstract: Contrastive learning has gained popularity and pushes state-of-the-art performance across numerous large-scale benchmarks. In contrastive learning, the contrastive loss function plays a pivotal role in discerning similarities between samples through techniques such as rotation or cropping. However, this learning mechanism can also introduce information distortion from the augmented samples. This is because the trained model may develop a significant overreliance on information from samples with identical labels, while concurrently neglecting positive pairs that originate from the same initial image, especially in expansive datasets. This paper proposes a context-enriched contrastive loss function that concurrently improves learning effectiveness and addresses the information distortion by encompassing two convergence targets. The first component, which is notably sensitive to label contrast, differentiates between features of identical and distinct classes which boosts the contrastive training efficiency. Meanwhile, the second component draws closer the augmented samples from the same source image and distances all other samples. We evaluate the proposed approach on image classification tasks, which are among the most widely accepted 8 recognition large-scale benchmark datasets: CIFAR10, CIFAR100, Caltech-101, Caltech-256, ImageNet, BiasedMNIST, UTKFace, and CelebA datasets. The experimental results demonstrate that the proposed method achieves improvements over 16 state-of-the-art contrastive learning methods in terms of both generalization performance and learning convergence speed. Interestingly, our technique stands out in addressing systematic distortion tasks. It demonstrates a 22.9% improvement compared to original contrastive loss functions in the downstream BiasedMNIST dataset, highlighting its promise for more efficient and equitable downstream training.

[31] FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

Kevin David Hayes,Micah Goldblum,Vikash Sehwag,Gowthami Somepalli,Ashwinee Panda,Tom Goldstein

Main category: cs.CV

TL;DR: 该论文提出了一个结构化方法FineGRAIN,用于联合评估文本到图像(T2I)模型和视觉语言模型(VLMs),通过测试VLMs能否识别T2I模型生成的图像中的27种特定失败模式。同时,论文贡献了一个包含5种T2I模型生成图像和对应VLM标注的数据集。

Details Motivation: T2I模型在生成图像时常常无法准确捕捉用户提示中的特定属性(如对象数量或颜色),而现有的VLM基准测试未跟上复杂场景的需求。因此,需要一种新的方法来系统地评估这些模型的失败模式。

Contribution: 1. 提出了FineGRAIN方法,联合评估T2I模型和VLMs的能力。2. 贡献了一个包含5种T2I模型生成图像和VLM标注的数据集,标注由LLM(Llama3)完成。

Method: 论文通过设计27种特定失败模式,测试VLMs能否识别T2I模型生成的图像中的这些错误。使用了5种T2I模型和3种VLMs进行实验,生成的数据集包含挑战性提示和对应的图像及VLM标注。

Result: 分析表明,T2I模型在属性保真度和对象表示上存在系统性错误,现有指标无法捕捉这些细微错误。

Insight: 当前指标不足以全面评估生成模型的可靠性,需要针对性的基准测试来提高生成模型的性能和可解释性。

Abstract: Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.

[32] Mapping of Lesion Images to Somatic Mutations

Rahul Mehta

Main category: cs.CV

TL;DR: 这篇论文提出了一种深度隐变量模型LLOST,通过双变分自编码器结合共享隐空间,将医学病灶图像映射到体细胞突变谱,实现了跨模态的癌症诊断辅助。

Details Motivation: 癌症治疗的早期诊断和干预至关重要。医学图像和遗传信息在诊断中扮演不同角色,但缺乏直接的映射模型。本文旨在通过深度学习模型填补这一空白,从病灶图像预测体细胞突变。

Contribution: 1. 引入点云表示病灶图像,实现模态不变性;2. 提出LLOST模型,结合双变分自编码器和共享隐空间;3. 使用条件归一化流处理多域数据的多样化分布。

Method: 1. 病灶图像转换为点云;2. 构建双变分自编码器,共享隐空间;3. 使用条件归一化流学习隐空间分布;4. 在癌症影像档案和基因组档案数据上验证。

Result: 模型在特定突变数量和突变发生预测上表现良好,揭示了影像与突变之间的共享模式(反映癌症类型)。

Insight: 共享隐空间能有效捕捉多模态数据的关联,为癌症诊断提供新思路。未来可扩展至更多遗传领域。

Abstract: Medical imaging is a critical initial tool used by clinicians to determine a patient’s cancer diagnosis, allowing for faster intervention and more reliable patient prognosis. At subsequent stages of patient diagnosis, genetic information is extracted to help select specific patient treatment options. As the efficacy of cancer treatment often relies on early diagnosis and treatment, we build a deep latent variable model to determine patients’ somatic mutation profiles based on their corresponding medical images. We first introduce a point cloud representation of lesions images to allow for invariance to the imaging modality. We then propose, LLOST, a model with dual variational autoencoders coupled together by a separate shared latent space that unifies features from the lesion point clouds and counts of distinct somatic mutations. Therefore our model consists of three latent space, each of which is learned with a conditional normalizing flow prior to account for the diverse distributions of each domain. We conduct qualitative and quantitative experiments on de-identified medical images from The Cancer Imaging Archive and the corresponding somatic mutations from the Pan Cancer dataset of The Cancer Genomic Archive. We show the model’s predictive performance on the counts of specific mutations as well as it’s ability to accurately predict the occurrence of mutations. In particular, shared patterns between the imaging and somatic mutation domain that reflect cancer type. We conclude with a remark on how to improve the model and possible future avenues of research to include other genetic domains.

[33] SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting

Pranav Asthana,Alex Hanson,Allen Tu,Tom Goldstein,Matthias Zwicker,Amitabh Varshney

Main category: cs.CV

TL;DR: SplatSuRe是一种选择性超分辨率方法,旨在解决3D高斯散射(3DGS)中多视图不一致的问题。它通过在缺乏高频信息的区域选择性应用超分辨率内容,提升了渲染画面的清晰度和一致性。

Details Motivation: 3D高斯散射在高质量新视角合成中表现优异,但训练时低分辨率输入的独立超分辨率增强会导致多视图不一致和模糊渲染。现有方法通常对所有图像统一应用超分辨率,而未考虑不同视图间的高频信息互补性。

Contribution: 提出了SplatSuRe方法,利用相机姿态和场景几何关系选择性应用超分辨率内容,仅作用于高频信息不足的区域,从而提升渲染结果的清晰度和一致性。

Method: 通过分析相机姿态和场景几何关系,识别高频信息不足的区域,并在此选择性应用超分辨率。避免了在所有图像上统一超分辨率的内容不一致问题。

Result: 在Tanks & Temples、Deep Blending和Mip-NeRF 360数据集上,SplatSuRe在保真度和感知质量上均优于基线方法,尤其在前景区域的细节表现上有显著提升。

Insight: 近距离的低分辨率视图可能包含远处视图的高频信息,选择性应用超分辨率可以充分利用多视图间的互补性,避免内容不一致问题。

Abstract: 3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views, and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building from this insight, we propose SplatSuRe, a method that selectively applies SR content only in undersampled regions lacking high-frequency supervision, yielding sharper and more consistent results. Across Tanks & Temples, Deep Blending and Mip-NeRF 360, our approach surpasses baselines in both fidelity and perceptual quality. Notably, our gains are most significant in localized foreground regions where higher detail is desired.

[34] RobustSurg: Tackling domain generalisation for out-of-distribution surgical scene segmentation

Mansoor Ali,Maksim Richards,Gilberto Ochoa-Ruiz,Sharib Ali

Main category: cs.CV

TL;DR: RobustSurg解决了手术场景分割中域泛化和分布外数据的挑战,通过利用风格与内容信息、实例归一化和特征协方差映射技术提升泛化性,并在ResNet主干中引入恢复模块保留重要特征,显著提升了在未见中心和数据集上的性能。

Details Motivation: 手术场景分割在单中心和单模态数据上表现良好,但在未见分布(其他中心)和模态变化时泛化性不足。现有方法多针对自然场景数据,无法直接应用于手术场景(视觉线索有限、场景更复杂)。

Contribution: 1. 提出了RobustSurg方法,通过风格与内容分离、实例归一化和协方差映射减少外观变化影响;2. 设计了恢复模块保留任务相关特征;3. 提供了新的多类别多中心数据集支持泛化性研究。

Method: 1. 利用实例归一化和特征协方差映射提升特征表示鲁棒性;2. 在ResNet主干中引入恢复模块,避免移除关键特征;3. 提出新数据集CholecSeg8K和EndoUDA。

Result: 在未见中心的HeiCholSeg数据集上,RobustSurg比DeepLabv3+基线提升23%,优于SOTA方法10-32%;在EndoUDA数据集上比基线提升22%,优于SOTA方法11%。

Insight: 手术场景的域泛化可通过分离风格与内容、保留关键特征实现;实例归一化和协方差映射是提升鲁棒性的有效手段;新数据集对推动领域研究至关重要。

Abstract: While recent advances in deep learning for surgical scene segmentation have demonstrated promising results on single-centre and single-imaging modality data, these methods usually do not generalise to unseen distribution (i.e., from other centres) and unseen modalities. Current literature for tackling generalisation on out-of-distribution data and domain gaps due to modality changes has been widely researched but mostly for natural scene data. However, these methods cannot be directly applied to the surgical scenes due to limited visual cues and often extremely diverse scenarios compared to the natural scene data. Inspired by these works in natural scenes to push generalisability on OOD data, we hypothesise that exploiting the style and content information in the surgical scenes could minimise the appearances, making it less variable to sudden changes such as blood or imaging artefacts. This can be achieved by performing instance normalisation and feature covariance mapping techniques for robust and generalisable feature representations. Further, to eliminate the risk of removing salient feature representation associated with the objects of interest, we introduce a restitution module within the feature learning ResNet backbone that can enable the retention of useful task-relevant features. To tackle the lack of multiclass and multicentre data for surgical scene segmentation, we also provide a newly curated dataset that can be vital for addressing generalisability in this domain. Our proposed RobustSurg obtained nearly 23% improvement on the baseline DeepLabv3+ and from 10-32% improvement on the SOTA in terms of mean IoU score on an unseen centre HeiCholSeg dataset when trained on CholecSeg8K. Similarly, RobustSurg also obtained nearly 22% improvement over the baseline and nearly 11% improvement on a recent SOTA method for the target set of the EndoUDA polyp dataset.

[35] Multifractal Recalibration of Neural Networks for Medical Imaging Segmentation

Miguel L. Martins,Miguel T. Coimbra,Francesco Renna

Main category: cs.CV

TL;DR: 该论文提出了一种基于多分形分析的神经网络重新校准方法,用于医学图像分割任务,通过引入单分形和多分形重新校准先验,改进了传统通道注意力机制的效果。

Details Motivation: 现有端到端多分形方法依赖大量池化或特征空间缩减,限制了语义分割等任务的表现,因此需要更有效的多分形分析方法。

Contribution: 提出了两种归纳先验——单分形和多分形重新校准,利用指数概率质量与多分形谱的关系,改进了卷积网络中的通道注意力机制。

Method: 基于U-Net框架,实现了多分形重新校准的通道注意力函数,并通过实验验证其有效性。

Result: 在三个公开医学影像数据集(ISIC18、Kvasir-SEG和BUSI)上表现优于其他高阶统计通道注意力机制的基线方法。

Insight: 研究发现U-Net中的跳跃连接导致注意力层的响应不会随编码器深度增加而特化,且其效果可能与实例变异的全局统计特征相关。

Abstract: Multifractal analysis has revealed regularities in many self-seeding phenomena, yet its use in modern deep learning remains limited. Existing end-to-end multifractal methods rely on heavy pooling or strong feature-space decimation, which constrain tasks such as semantic segmentation. Motivated by these limitations, we introduce two inductive priors: Monofractal and Multifractal Recalibration. These methods leverage relationships between the probability mass of the exponents and the multifractal spectrum to form statistical descriptions of encoder embeddings, implemented as channel-attention functions in convolutional networks. Using a U-Net-based framework, we show that multifractal recalibration yields substantial gains over a baseline equipped with other channel-attention mechanisms that also use higher-order statistics. Given the proven ability of multifractal analysis to capture pathological regularities, we validate our approach on three public medical-imaging datasets: ISIC18 (dermoscopy), Kvasir-SEG (endoscopy), and BUSI (ultrasound). Our empirical analysis also provides insights into the behavior of these attention layers. We find that excitation responses do not become increasingly specialized with encoder depth in U-Net architectures due to skip connections, and that their effectiveness may relate to global statistics of instance variability.

[36] Towards Unified Video Quality Assessment

Chen Feng,Tianhao Peng,Fan Zhang,David Bull

Main category: cs.CV

TL;DR: 论文提出了Unified-VQA框架,通过将其视为诊断性Mixture-of-Experts(MoE)问题,解决了现有视频质量评估(VQA)模型的局限,提供了统一且可解释的质量评估方法。

Details Motivation: 现有VQA模型通常只能预测单一质量分数,缺乏诊断性和可解释性,且多为特定格式的专用指标。Unified-VQA旨在解决这些问题,实现通用的视频质量评估。

Contribution: 1)提出Unified-VQA框架,支持多种格式和失真类型的视频质量评估;2)设计多代理专家训练策略,优化专家模型;3)引入诊断性多任务头,生成全局质量分数和可解释的多维失真向量。

Method: 1)采用Mixture-of-Experts(MoE)框架,每个专家专注于特定感知域;2)提出多代理专家训练策略,使用排名启发损失优化专家;3)集成诊断性多任务头,结合弱监督学习策略。

Result: Unified-VQA无需重新训练,在17个数据库上表现优于18种基准方法,适用于通用VQA和失真检测任务。

Insight: 通过MoE和多任务学习,可实现视频质量评估的统一性和诊断性,为实际应用提供了可操作的解决方案。

Abstract: Recent works in video quality assessment (VQA) typically employ monolithic models that typically predict a single quality score for each test video. These approaches cannot provide diagnostic, interpretable feedback, offering little insight into why the video quality is degraded. Most of them are also specialized, format-specific metrics rather than truly generic" solutions, as they are designed to learn a compromised representation from disparate perceptual domains. To address these limitations, this paper proposes Unified-VQA, a framework that provides a single, unified quality model applicable to various distortion types within multiple video formats by recasting generic VQA as a Diagnostic Mixture-of-Experts (MoE) problem. Unified-VQA employs multiple perceptual experts’’ dedicated to distinct perceptual domains. A novel multi-proxy expert training strategy is designed to optimize each expert using a ranking-inspired loss, guided by the most suitable proxy metric for its domain. We also integrated a diagnostic multi-task head into this framework to generate a global quality score and an interpretable multi-dimensional artifact vector, which is optimized using a weakly-supervised learning strategy, leveraging the known properties of the large-scale training database generated for this work. With static model parameters (without retraining or fine-tuning), Unified-VQA demonstrates consistent and superior performance compared to over 18 benchmark methods for both generic VQA and diagnostic artifact detection tasks across 17 databases containing diverse streaming artifacts in HD, UHD, HDR and HFR formats. This work represents an important step towards practical, actionable, and interpretable video quality assessment.

[37] See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen,Zhuoran Yu,Samuel Low Yu Hang,Subin An,Jeongik Lee,Yohan Ban,SeungEun Chung,Thanh-Huy Nguyen,JuWan Maeng,Soochahn Lee,Yong Jae Lee

Main category: cs.CV

TL;DR: 论文提出了AV-SpeakerBench基准,专注于说话者为中心的视听推理,通过3,212个多选题评估多模态大模型在说话者识别、内容和时间对齐方面的能力。结果表明Gemini模型表现最优,开源模型仍有差距。

Details Motivation: 现有视频基准对多模态大模型(MLLMs)的细粒度人类语音推理能力评估不足,通常仅限于视觉解决或粗略的语音评估。

Contribution: 1. 提出了AV-SpeakerBench基准,专注于说话者为中心的视听推理;2. 设计了融合式问题语义,嵌入视听依赖;3. 专家标注确保时间精度和跨模态有效性。

Method: 通过多选择题形式评估模型在说话者识别、内容和时间对齐上的能力,并设计了融合视听依赖的问题语义。

Result: Gemini 2.5 Pro表现最优,开源模型Qwen3-Omni-30B接近Gemini 2.0 Flash但差距明显,主要由于视听融合能力较弱。

Insight: 视听融合能力是多模态大模型的关键瓶颈,而非视觉感知能力。

Abstract: Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.

[38] Exploring the Potentials of Spiking Neural Networks for Image Deraining

Shuang Chen,Tomas Krajnik,Farshad Arvin,Amir Atapour-Abarghouei

Main category: cs.CV

TL;DR: 本文探讨了脉冲神经网络(SNNs)在图像去雨任务中的潜力,提出了一种新型的视觉LIF(VLIF)神经元及其组成的模块,显著提升了性能并降低了能耗。

Details Motivation: 研究动机在于探索生物启发的低能耗SNN在低层次视觉任务(如图像去雨)中的应用潜力,解决传统SNN在空间上下文理解和频域饱和方面的局限性。

Contribution: 主要贡献包括:1)提出VLIF神经元,解决传统SNN缺乏空间上下文理解的问题;2)设计了脉冲分解与增强模块(Spiking Decomposition and Enhancement Module)和轻量级脉冲多尺度单元(Spiking Multi-scale Unit),实现层次化多尺度表征学习;3)在多个数据集上验证了方法的优越性能和低能耗特点。

Method: 方法上引入VLIF神经元,结合新设计的模块和单元,解决了SNN在图像去雨任务中的频域饱和和表征能力不足的问题,实现了高效的层次化学习。

Result: 实验结果在五个基准数据集上表明,该方法不仅性能显著优于现有SNN方法,且能耗仅为前者的13%。

Insight: 本文的见解在于展示了SNN在低层次视觉任务中的高效能和低能耗潜力,为未来SNN在类似任务中的应用提供了新方向。

Abstract: Biologically plausible and energy-efficient frameworks such as Spiking Neural Networks (SNNs) have not been sufficiently explored in low-level vision tasks. Taking image deraining as an example, this study addresses the representation of the inherent high-pass characteristics of spiking neurons, specifically in image deraining and innovatively proposes the Visual LIF (VLIF) neuron, overcoming the obstacle of lacking spatial contextual understanding present in traditional spiking neurons. To tackle the limitation of frequency-domain saturation inherent in conventional spiking neurons, we leverage the proposed VLIF to introduce the Spiking Decomposition and Enhancement Module and the lightweight Spiking Multi-scale Unit for hierarchical multi-scale representation learning. Extensive experiments across five benchmark deraining datasets demonstrate that our approach significantly outperforms state-of-the-art SNN-based deraining methods, achieving this superior performance with only 13% of their energy consumption. These findings establish a solid foundation for deploying SNNs in high-performance, energy-efficient low-level vision tasks.

[39] Spatiotemporal Pyramid Flow Matching for Climate Emulation

Jeremy Andrew Irvin,Jiaqi Han,Zikui Wang,Abdulaziz Alharbi,Yufei Zhao,Nomin-Erdene Bayarsaikhan,Daniele Visioni,Andrew Y. Ng,Duncan Watson-Parris

Main category: cs.CV

TL;DR: 论文提出了一种新的生成模型方法——时空金字塔流匹配(SPF),用于高效、并行地模拟气候变化的多个时间尺度,并通过ClimateSuite数据集验证了其性能。

Details Motivation: 传统基于天气尺度的自回归生成模型在气候模拟中存在计算效率低和稳定性不足的问题,特别是在非静态强迫条件下。SPF旨在解决这些问题,提供高效、稳定的气候模拟方法。

Contribution: 1. 提出SPF,一种时空分层的流匹配方法;2. 引入ClimateSuite,最大的地球系统模拟数据集;3. 验证了SPF在多时间尺度和气候干预模拟中的性能。

Method: SPF通过时空金字塔结构,分阶段增加空间分辨率并耦合时间尺度,结合物理强迫条件(如温室气体),实现高效并行采样。

Result: 在ClimateBench上,SPF在年和月时间尺度的表现优于基线模型,并展示了快速采样和良好的泛化能力。

Insight: 时空分层设计和物理条件耦合是提升气候模拟效率和准确性的关键。

Abstract: Generative models have the potential to transform the way we emulate Earth’s changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at https://github.com/stanfordmlgroup/spf .

[40] Progressive Image Restoration via Text-Conditioned Video Generation

Peng Kang,Xijun Wang,Yu Yuan

Main category: cs.CV

TL;DR: 本文提出了一种利用文本条件视频生成模型(如CogVideo)进行渐进式图像修复的方法,通过微调模型生成修复轨迹而非自然视频运动。

Details Motivation: 虽然现有的文本到视频模型在时间生成能力上表现优异,但它们在图像修复领域的潜力尚未充分探索。

Contribution: 1. 提出将CogVideo应用于渐进式图像修复任务;2. 构建了用于超分辨率、去模糊和低光增强的合成数据集;3. 比较了两种提示策略(统一提示与场景特定提示);4. 证明了模型在零样本情况下的鲁棒性和可解释性。

Method: 1. 微调CogVideo生成修复轨迹;2. 使用合成数据集训练;3. 比较两种提示策略(统一文本提示与LLaVA多模态LLM生成的场景特定提示)。

Result: 实验显示,模型在PSNR、SSIM和LPIPS等感知指标上表现优异,并能推广到真实场景(如ReLoBlur数据集)。

Insight: 1. 时间生成能力可以转化为修复质量的渐进提升;2. 多模态LLM生成的场景特定提示比统一提示更有效;3. 零样本能力表明模型的泛化性强。

Abstract: Recent text-to-video models have demonstrated strong temporal generation capabilities, yet their potential for image restoration remains underexplored. In this work, we repurpose CogVideo for progressive visual restoration tasks by fine-tuning it to generate restoration trajectories rather than natural video motion. Specifically, we construct synthetic datasets for super-resolution, deblurring, and low-light enhancement, where each sample depicts a gradual transition from degraded to clean frames. Two prompting strategies are compared: a uniform text prompt shared across all samples, and a scene-specific prompting scheme generated via LLaVA multi-modal LLM and refined with ChatGPT. Our fine-tuned model learns to associate temporal progression with restoration quality, producing sequences that improve perceptual metrics such as PSNR, SSIM, and LPIPS across frames. Extensive experiments show that CogVideo effectively restores spatial detail and illumination consistency while maintaining temporal coherence. Moreover, the model generalizes to real-world scenarios on the ReLoBlur dataset without additional training, demonstrating strong zero-shot robustness and interpretability through temporal restoration.

[41] Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision

Chenshuang Zhang,Kang Zhang,Joon Son Chung,In So Kweon,Junmo Kim,Chengzhi Mao

Main category: cs.CV

TL;DR: 本文研究发现,预训练的视频扩散模型在无监督情况下能有效区分视觉相似物体的运动,并通过去噪过程分离运动与外观信息。提出的自监督跟踪方法在视觉相似物体跟踪任务中表现优异。

Details Motivation: 现有自监督跟踪器在视觉线索模糊时表现不佳,限制了其在未标注数据场景下的扩展性和泛化能力。本文发现视频扩散模型在预训练中已学习到适合跟踪的运动表示,无需任务特定训练。

Contribution: 1)揭示了视频扩散模型在预训练中能学习到用于跟踪的运动表示;2)提出了一种基于扩散模型的自监督跟踪方法,显著提升了视觉相似物体的跟踪性能;3)引入了新的测试基准,验证方法的有效性。

Method: 利用预训练视频扩散模型的去噪过程,早期高噪声阶段分离运动信息,后期细化外观信息,构建自监督跟踪器。

Result: 在视觉相似物体跟踪任务中,方法比现有自监督方法性能提升高达6个百分点。

Insight: 视频扩散模型的去噪过程天然分离了运动与外观信息,为自监督跟踪任务提供了新的解决方案。

Abstract: Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.

[42] TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction

Fengyi Zhang,Tianjun Zhang,Kasra Khosoussi,Zheng Zhang,Zi Huang,Yadan Luo

Main category: cs.CV

TL;DR: 本文提出TALO框架,通过基于Thin Plate Spline的高自由度长程对齐方法,解决了3D视觉基础模型在线重建中的全局一致性问题,并在多数据集和多摄像头配置下表现出色。

Details Motivation: 3D视觉基础模型在在线重建中存在时间一致性挑战,现有方法在假设有效性、局部对齐范围和噪声环境下表现不足。

Contribution: 提出了基于Thin Plate Spline的长程对齐框架,点无关的子图注册设计,提升了全局一致性和噪声鲁棒性,且兼容多种模型和摄像头配置。

Method: 采用高自由度Thin Plate Spline对齐,通过全局传播控制点修正空间不一致性;点无关的子图注册设计增强噪声鲁棒性。

Result: 实验表明,该方法在多数据集和多摄像头配置下显著提升了重建一致性和轨迹精度。

Insight: 长程对齐和点无关设计是提升在线3D重建一致性的关键,且Thin Plate Spline在全局调整中表现优异。

Abstract: 3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully plug-and-play, compatible with diverse 3D foundation models and camera configurations (e.g., monocular or surround-view). Extensive experiments demonstrate that our method consistently yields more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups, highlighting its robustness and generality. Codes are publicly available at \href{https://github.com/Xian-Bei/TALO}{https://github.com/Xian-Bei/TALO}.

[43] A multi-weight self-matching visual explanation for cnns on sar images

Siyuan Sun,Yongping Zhang,Hongcheng Zeng,Yamin Wang,Wei Yang,Wanting Yang,Jie Chen

Main category: cs.CV

TL;DR: MS-CAM方法通过结合通道和元素级别的权重,提升CNN在SAR图像中的可视化解释能力,同时验证了其在弱监督目标定位中的可行性。

Details Motivation: CNN在SAR任务中表现出色,但其内部机制的复杂性和不透明性限制了其在SAR中的高可靠性应用需求,因此提升CNN的可解释性至关重要。

Contribution: 提出了MS-CAM方法,通过匹配SAR图像与CNN提取的特征图和梯度,结合通道和元素级别权重,增强模型的可视化解释能力。

Method: MS-CAM结合CNN提取的特征图和梯度,利用通道和元素级别权重生成可视化激活图,突出模型的决策依据。

Result: 实验表明,MS-CAM能更准确地突出网络的关注区域并捕获目标细节信息,同时证明了其在弱监督目标定位中的可行性。

Insight: MS-CAM为SAR任务中的CNN提供了更高的可解释性,其像素阈值等关键因素的分析为未来研究提供了参考。

Abstract: In recent years, convolutional neural networks (CNNs) have achieved significant success in various synthetic aperture radar (SAR) tasks. However, the complexity and opacity of their internal mechanisms hinder the fulfillment of high-reliability requirements, thereby limiting their application in SAR. Improving the interpretability of CNNs is thus of great importance for their development and deployment in SAR. In this paper, a visual explanation method termed multi-weight self-matching class activation mapping (MS-CAM) is proposed. MS-CAM matches SAR images with the feature maps and corresponding gradients extracted by the CNN, and combines both channel-wise and element-wise weights to visualize the decision basis learned by the model in SAR images. Extensive experiments conducted on a self-constructed SAR target classification dataset demonstrate that MS-CAM more accurately highlights the network’s regions of interest and captures detailed target feature information, thereby enhancing network interpretability. Furthermore, the feasibility of applying MS-CAM to weakly-supervised obiect localization is validated. Key factors affecting localization accuracy, such as pixel thresholds, are analyzed in depth to inform future work.

[44] Understanding and Harnessing Sparsity in Unified Multimodal Models

Shwai He,Chaorui Deng,Ang Li,Shen Yan

Main category: cs.CV

TL;DR: 本文系统地分析了统一多模态模型的稀疏性,并提出了一种基于稀疏激活的Mixture-of-Experts(MoE)适应方法,显著提升了模型的推理效率,同时保持了性能。

Details Motivation: 尽管统一多模态模型在理解和生成任务上取得了显著进展,但其统一性导致了推理效率低下。本文旨在探究这些低效现象的根源并提出解决方案。

Contribution: 1) 首次系统分析了统一多模态模型中不同组件的稀疏性差异;2) 提出了一种基于MoE的稀疏适应方法,显著提升了生成任务的效率;3) 实现了在保持性能的同时激活约一半参数的BAGEL模型。

Method: 1) 使用无训练剪枝方法(包括深度剪枝和宽度裁剪)分析模型的稀疏性;2) 提出MoE Adaptation,将生成模块划分为多个专家并启用稀疏激活;3) 通过专家冻结调优和完全可训练适应验证方法的有效性。

Result: 实验表明,MoE Adaptation显著提升了生成任务的效率,BAGEL模型在激活约一半参数的情况下达到了与完整模型相当的性能。

Insight: 1) 理解组件在生成任务中表现出显著的稀疏性;2) 生成组件对压缩敏感,动态稀疏激活是提升其效率的关键;3) MoE Adaptation是解决多模态模型统一性与效率矛盾的有效途径。

Abstract: Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at \href{https://github.com/Shwai-He/SparseUnifiedModel}{this link}.

[45] WSCF-MVCC: Weakly-supervised Calibration-free Multi-view Crowd Counting

Bin Li,Daijie Chen,Qi Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种弱监督且无需校准的多视角人群计数方法(WSCF-MVCC),通过直接使用人群数量作为监督信号,利用自监督排序损失和多尺度先验提升模型感知能力,并通过语义信息实现更准确的视角匹配。

Details Motivation: 现有的多视角人群计数方法通常依赖昂贵的校准和密集标注,而当前的无校准方法仍需大量图像级标注。因此,作者提出了一种无需校准且仅需弱监督的方法,以降低实际部署成本。

Contribution: 1. 提出了一种弱监督的无校准多视角人群计数方法;2. 利用了自监督排序损失和多尺度先验;3. 通过语义信息提升了视角匹配的准确性。

Method: 1. 使用人群数量直接监督单视角模块;2. 引入自监督排名损失和多尺度先验;3. 结合语义信息优化视角匹配。

Result: 在三个广泛使用的多视角数据集上,该方法在弱监督条件下优于现有最优方法,显示出更强的实用性与部署价值。

Insight: 弱监督和无校准方法的结合可以有效降低人工标注成本,同时通过语义信息和多尺度先验提升模型性能。

Abstract: Multi-view crowd counting can effectively mitigate occlusion issues that commonly arise in single-image crowd counting. Existing deep-learning multi-view crowd counting methods project different camera view images onto a common space to obtain ground-plane density maps, requiring abundant and costly crowd annotations and camera calibrations. Hence, calibration-free methods are proposed that do not require camera calibrations and scene-level crowd annotations. However, existing calibration-free methods still require expensive image-level crowd annotations for training the single-view counting module. Thus, in this paper, we propose a weakly-supervised calibration-free multi-view crowd counting method (WSCF-MVCC), directly using crowd count as supervision for the single-view counting module rather than density maps constructed from crowd annotations. Instead, a self-supervised ranking loss that leverages multi-scale priors is utilized to enhance the model’s perceptual ability without additional annotation costs. What’s more, the proposed model leverages semantic information to achieve a more accurate view matching and, consequently, a more precise scene-level crowd count estimation. The proposed method outperforms the state-of-the-art methods on three widely used multi-view counting datasets under weakly supervised settings, indicating that it is more suitable for practical deployment compared with calibrated methods. Code is released in https://github.com/zqyq/Weakly-MVCC.

[46] VACoT: Rethinking Visual Data Augmentation with VLMs

Zhengzhuo Xu,Chong Sun,SiNan Du,Chen Li,Jing Lyu,Chun Yuan

Main category: cs.CV

TL;DR: VACoT 是一个动态调用图像增强的框架,通过在推理阶段引入后处理变换(如去噪),显著提升了视觉语言模型(VLM)在对抗性和分布外输入上的鲁棒性。

Details Motivation: 视觉语言模型(VLMs)在视觉感知任务中表现不佳,且传统的数据增强方法对 VLMs 的训练效果有限。VACoT 的提出旨在通过动态增强提升模型的鲁棒性,同时避免高昂的训练成本。

Contribution: 1. 提出了 VACoT 框架,能够在推理阶段动态调用图像增强技术。2. 设计了一个条件奖励机制,确保增强操作简洁有效。3. 在 13 个视觉基准测试中验证了方法的优越性,并引入 AdvOCR 展示了对抗场景下的泛化能力。

Method: 1. 通过后处理变换(如去噪)动态增强推理阶段的输入图像。2. 使用高效的代理强化学习减少训练复杂度。3. 结合结构化的视觉增强集合,扩展查询图像的视角。

Result: VACoT 在对抗性和分布外输入上显著提升了 VLMs 的鲁棒性,特别是在 OCR 相关任务中取得了优异表现。

Insight: 动态增强在推理阶段的引入不仅能提升模型的鲁棒性,还能避免传统训练方法的成本问题。

Abstract: While visual data augmentation remains a cornerstone for training robust vision models, it has received limited attention in visual language models (VLMs), which predominantly rely on large-scale real data acquisition or synthetic diversity. Consequently, they may struggle with basic perception tasks that conventional models handle reliably. Given the substantial cost of pre-training and fine-tuning VLMs, continue training on augmented data yields limited and diminishing returns. In this paper, we present Visual Augmentation Chain-of-Thought (VACoT), a framework that dynamically invokes image augmentations during model inference. By incorporating post-hoc transformations such as denoising, VACoT substantially improves robustness on challenging and out-of-distribution inputs, especially in OCR-related adversarial scenarios. Distinct from prior approaches limited to local cropping, VACoT integrates a structured collection of general visual augmentations, broadening the query image views while reducing training complexity and computational overhead with efficient agentic reinforcement learning. We propose a conditional reward scheme that encourages necessary augmentation while penalizing verbose responses, ensuring concise and effective reasoning in perception tasks. We demonstrate the superiority of VACoT with extensive experiments on 13 perception benchmarks and further introduce AdvOCR to highlight the generalization benefits of post-hoc visual augmentations in adversarial scenarios.

[47] Tackling Tuberculosis: A Comparative Dive into Machine Learning for Tuberculosis Detection

Daanish Hindustani,Sanober Hindustani,Preston Nguyen

Main category: cs.CV

TL;DR: 本文探讨了使用预训练的ResNet-50和SqueezeNet模型在胸部X光片中诊断肺结核(TB)的性能,结果显示SqueezeNet表现更优,强调了机器学习在TB检测中的潜力及其在资源匮乏地区的应用前景。

Details Motivation: 肺结核是全球性健康问题,传统诊断方法效率低下,尤其在资源有限地区。作者希望通过深度学习技术改进TB的诊断效率和准确性。

Contribution: 本研究对比了ResNet-50和SqueezeNet在TB检测中的性能,发现SqueezeNet表现更优,为TB早期检测提供了新方法和技术支持。

Method: 采用Kaggle提供的4,200张胸部X光片数据集,进行数据预处理(拆分、增强和调整尺寸),并通过准确率、精确率、召回率等指标评估模型性能。

Result: SqueezeNet的损失值为32%,准确率为89%,精确率为98%,召回率为80%,F1分数为87%,优于ResNet-50的54%(损失)、73%(准确率)、88%(精确率)、52%(召回率)和65%(F1分数)。

Insight: 研究表明机器学习在TB检测中具有潜力,尤其是轻量级模型(如SqueezeNet)更适合在资源匮乏地区部署,但其仍需进一步优化以实现更快、更小且更准确的检测。

Abstract: This study explores the application of machine learning models, specifically a pretrained ResNet-50 model and a general SqueezeNet model, in diagnosing tuberculosis (TB) using chest X-ray images. TB, a persistent infectious disease affecting humanity for millennia, poses challenges in diagnosis, especially in resource-limited settings. Traditional methods, such as sputum smear microscopy and culture, are inefficient, prompting the exploration of advanced technologies like deep learning and computer vision. The study utilized a dataset from Kaggle, consisting of 4,200 chest X-rays, to develop and compare the performance of the two machine learning models. Preprocessing involved data splitting, augmentation, and resizing to enhance training efficiency. Evaluation metrics, including accuracy, precision, recall, and confusion matrix, were employed to assess model performance. Results showcase that the SqueezeNet achieved a loss of 32%, accuracy of 89%, precision of 98%, recall of 80%, and an F1 score of 87%. In contrast, the ResNet-50 model exhibited a loss of 54%, accuracy of 73%, precision of 88%, recall of 52%, and an F1 score of 65%. This study emphasizes the potential of machine learning in TB detection and possible implications for early identification and treatment initiation. The possibility of integrating such models into mobile devices expands their utility in areas lacking TB detection resources. However, despite promising results, the need for continued development of faster, smaller, and more accurate TB detection models remains crucial in contributing to the global efforts in combating TB.

[48] Multi-Domain Enhanced Map-Free Trajectory Prediction with Selective Attention

Wenyi Xiong,Jian Chen

Main category: cs.CV

TL;DR: 该论文提出了一种新颖的无地图轨迹预测算法,通过混合专家机制和选择性注意力模块,在多域(时间、空间和频率)中高效提取关键信息,提升了复杂交互场景下的预测准确性和计算效率。

Details Motivation: 现有的轨迹预测方法在处理复杂交互场景时,难以高效提取有价值的场景信息,导致计算效率低且预测准确性不足。

Contribution: 1. 提出了一种跨时间、空间和频率域的无地图轨迹预测算法;2. 设计了混合专家机制(MoE)和选择性注意力模块,优化信息提取;3. 引入多模态解码器及分层损失监督,生成合理轨迹。

Method: 1. 使用MoE机制自适应选择关键频率成分并整合多尺度时间特征;2. 选择性注意力模块过滤时间和空间冗余信息;3. 多模态解码器在补丁级和点级损失监督下生成轨迹。

Result: 在Nuscences数据集上的实验表明,该算法在复杂交互场景中表现优越,验证了其有效性。

Insight: 通过多域信息提取和冗余信息过滤,可以显著提升轨迹预测的性能,尤其在面对复杂交互时。

Abstract: Trajectory prediction is crucial for the reliability and safety of autonomous driving systems, yet it remains a challenging task in complex interactive scenarios. Existing methods often struggle to efficiently extract valuable scene information from redundant data, thereby reducing computational efficiency and prediction accuracy, especially when dealing with intricate agent interactions. To address these challenges, we propose a novel map-free trajectory prediction algorithm that achieves trajectory prediction across the temporal, spatial, and frequency domains. Specifically, in temporal information processing, We utilize a Mixture of Experts (MoE) mechanism to adaptively select critical frequency components. Concurrently, we extract these components and integrate multi-scale temporal features. Subsequently, a selective attention module is proposed to filter out redundant information in both temporal sequences and spatial interactions. Finally, we design a multimodal decoder. Under the supervision of patch-level and point-level losses, we obtain reasonable trajectory results. Experiments on Nuscences datasets demonstrate the superiority of our algorithm, validating its effectiveness in handling complex interactive scenarios.

[49] Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Yifan Zhang,Liang Hu,Haofeng Sun,Peiyu Wang,Yichen Wei,Shukang Yin,Jiangbo Pei,Wei Shen,Peng Xia,Yi Peng,Tianyidan Xie,Eric Li,Yang Liu,Xuchen Song,Yahui Zhou

Main category: cs.CV

TL;DR: Skywork-R1V4 提出了一种多模态代理模型,通过将视觉操作与外部知识检索动态交替的推理方式,统一多模态规划、主动图像操作和深度搜索,并在监督微调下实现先进性能。

Details Motivation: 现有方法将图像操作和网络搜索视为孤立能力,依赖昂贵的强化学习,且缺乏基于实际工具执行轨迹的规划。Skywork-R1V4旨在解决这些局限性。

Contribution: 1) 提出Skywork-R1V4,结合多模态规划和动态交替推理;2) 仅通过监督微调训练,不依赖强化学习;3) 在多个基准测试中表现优异。

Method: 使用30B参数的多模态代理模型,通过监督微调在高质量规划-执行轨迹上训练,并采用逐步一致性过滤验证。

Result: 在MMSearch和FVQA上得分分别为66.1和67.2,超越Gemini 2.5 Flash,实现了长视野推理能力。

Insight: 精心设计的监督学习足以实现复杂的多模态智能,无需强化学习。

Abstract: Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation (“thinking with images”), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.

[50] Nav-$R^2$ Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation

Wentao Xiang,Haokang Zhang,Tianhang Yang,Zedong Chu,Ruihang Chu,Shichao Xie,Yujian Yuan,Jian Sun,Zhining Gu,Junjie Wang,Xiaolong Wu,Mu Xu,Yujiu Yang

Main category: cs.CV

TL;DR: Nav-$R^2$是一个基于双关系推理的通用开放词汇目标导航框架,通过显式建模目标-环境和环境-动作关系,结合结构化CoT推理和相似性感知记忆(SA-Mem),提升了在未见环境中定位新物体的能力。

Details Motivation: 现有技术在开放词汇目标导航中存在决策过程不透明和定位未见物体成功率低的问题。Nav-$R^2$旨在解决这些问题,通过更透明的推理机制和高效的特征融合提升泛化性。

Contribution: 1. 提出显式建模目标-环境和环境-动作关系的双关系推理框架。2. 设计SA-Mem模块,高效融合历史和当前观测特征。3. 构建Nav$R^2$-CoT数据集,指导模型进行结构化推理。

Method: 1. 结构化CoT推理:分步感知环境、聚焦目标相关物体并规划动作。2. SA-Mem:从时空和语义角度压缩视频帧和历史观测,保留最相关特征,无需额外参数。

Result: Nav-$R^2$在定位未见物体上达到SOTA性能,避免了对已见类别的过拟合,同时保持2Hz的实时推理速度。

Insight: 显式建模双关系和结构化推理能显著提升模型的透明度和泛化能力,SA-Mem的特征融合机制也为高效记忆提供了新思路。

Abstract: Object-goal navigation in open-vocabulary settings requires agents to locate novel objects in unseen environments, yet existing approaches suffer from opaque decision-making processes and low success rate on locating unseen objects. To address these challenges, we propose Nav-$R^2$, a framework that explicitly models two critical types of relationships, target-environment modeling and environment-action planning, through structured Chain-of-Thought (CoT) reasoning coupled with a Similarity-Aware Memory. We construct a Nav$R^2$-CoT dataset that teaches the model to perceive the environment, focus on target-related objects in the surrounding context and finally make future action plans. Our SA-Mem preserves the most target-relevant and current observation-relevant features from both temporal and semantic perspectives by compressing video frames and fusing historical observations, while introducing no additional parameters. Compared to previous methods, Nav-R^2 achieves state-of-the-art performance in localizing unseen objects through a streamlined and efficient pipeline, avoiding overfitting to seen object categories while maintaining real-time inference at 2Hz. Resources will be made publicly available at \href{https://github.com/AMAP-EAI/Nav-R2}{github link}.

[51] WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate

Anoop Cherian,River Doyle,Eyal Ben-Dov,Suhas Lohit,Kuan-Chuan Peng

Main category: cs.CV

TL;DR: WISE是一个用于多模态多智能体辩论的加权迭代专家框架,通过将智能体划分为解决方案生成者(Solvers)和验证者(Reflectors),并采用改进的Dawid-Skene算法整合辩论结果,显著提升了多模态任务的性能。

Details Motivation: 尽管多智能体辩论(MAD)在语言任务中表现出色,但其在多模态问题中的潜力尚未充分探索。WISE旨在通过异构专家(单模态和多模态)架构和加权反馈机制,扩展MAD的应用范围并提升其鲁棒性。

Contribution: 1. 提出了WISE框架,支持异构专家(单模态和多模态)参与辩论;2. 设计了Solvers和Reflectors的角色分工;3. 提出了改进的Dawid-Skene算法用于结果聚合;4. 在多个多模态数据集上验证了WISE的有效性。

Method: 1. 将智能体划分为Solvers和Reflectors,分别负责生成解决方案和验证反馈;2. 通过两阶段辩论模型整合反馈;3. 使用改进的Dawid-Skene算法加权聚合辩论结果。

Result: 在SMART-840、VisualPuzzles、EvoChart-QA等数据集上,WISE相比现有MAD方法提升了2-7%的准确性,证明了其多模态任务的适应性。

Insight: WISE的成功表明,异构专家的分工合作和加权反馈机制可以有效提升多模态任务的推理能力,为未来多模态辩论系统的设计提供了新思路。

Abstract: Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents’ solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.

[52] MitUNet: Enhancing Floor Plan Recognition using a Hybrid Mix-Transformer and U-Net Architecture

Dmitriy Parashchuk,Alexey Kapshitskiy,Yuriy Karyakin

Main category: cs.CV

TL;DR: MitUNet提出了一种结合Mix-Transformer和U-Net的混合架构,用于室内平面图的墙体分割,通过优化Tversky损失函数提升边界精度,显著优于现有单任务模型。

Details Motivation: 现有方法在分割薄墙结构时表现不佳,生成的掩膜边界不规则,缺乏几何精度,影响了后续3D重建的质量。

Contribution: 1) 提出MitUNet架构,结合Mix-Transformer的全局上下文提取和U-Net的解码能力;2) 引入scSE注意力模块优化边界恢复;3) 通过Tversky损失函数平衡精度与召回率。

Method: 1) 使用Mix-Transformer编码器捕获全局上下文;2) U-Net解码器结合scSE注意力模块;3) Tversky损失函数优化薄墙分割。

Result: 在CubiCasa5k和私有数据集上,MitUNet生成的掩膜边界精度高,结构正确,优于单任务模型。

Insight: Transformer架构在捕获全局信息时表现优异,结合U-Net的局部特征提取能力可显著提升分割任务的边界精度。

Abstract: Automatic 3D reconstruction of indoor spaces from 2D floor plans requires high-precision semantic segmentation of structural elements, particularly walls. However, existing methods optimized for standard metrics often struggle to detect thin structural components and yield masks with irregular boundaries, lacking the geometric precision required for subsequent vectorization. To address this issue, we introduce MitUNet, a hybrid neural network architecture specifically designed for wall segmentation tasks in the context of 3D modeling. In MitUNet, we utilize a hierarchical Mix-Transformer encoder to capture global context and a U-Net decoder enhanced with scSE attention blocks for precise boundary recovery. Furthermore, we propose an optimization strategy based on the Tversky loss function to effectively balance precision and recall. By fine-tuning the hyperparameters of the loss function, we prioritize the suppression of false positive noise along wall boundaries while maintaining high sensitivity to thin structures. Our experiments on the public CubiCasa5k dataset and a proprietary regional dataset demonstrate that the proposed approach ensures the generation of structurally correct masks with high boundary accuracy, outperforming standard single-task models. MitUNet provides a robust tool for data preparation in automated 3D reconstruction pipelines.

[53] Generalizing Vision-Language Models with Dedicated Prompt Guidance

Xinyao Li,Yinjie Min,Hongbo Chen,Zhekai Du,Fengling Li,Jingjing Li

Main category: cs.CV

TL;DR: 这篇论文提出了一个名为GuiDG的两步框架,通过专用提示引导视觉语言模型(VLMs)在下游任务中的泛化能力。该方法避免了传统微调方法在通用性和领域特异性之间的权衡问题。

Details Motivation: 现有的大规模预训练视觉语言模型在下游任务中通常直接对整个数据集进行微调,可能导致泛化能力不足。论文旨在解决这一问题。

Contribution: 1. 提供了VLM微调泛化能力的理论分析;2. 提出了GuiDG框架,通过分区训练领域专家模型提升泛化能力;3. 构建了ImageNet-DG数据集以评估小样本场景下的领域泛化能力。

Method: GuiDG框架分为两步:1. 使用提示调优(prompt tuning)获取源领域专家;2. 通过跨模态注意力模块自适应集成专家,指导视觉编码器的微调。

Result: 在标准DG基准测试和ImageNet-DG上的实验表明,GuiDG在保持高效的同时,优于现有的微调方法。

Insight: 训练多个参数高效的专家模型(而非单一的通用模型)可以有效提升模型的领域泛化能力。

Abstract: Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.

[54] GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

Haolong Yan,Yeqing Shen,Xin Huang,Jia Wang,Kaijun Tan,Zhixuan Liang,Hongxin Li,Zheng Ge,Osamu Yoshie,Si Li,Xiangyu Zhang,Daxin Jiang

Main category: cs.CV

TL;DR: 论文提出了GUI Exploration Lab模拟环境引擎,支持GUI智能体的屏幕导航研究,通过监督微调、单轮和多轮强化学习提升导航性能。

Details Motivation: 现实GUI环境复杂且专有,难以获取全面环境信息,制约了对智能体导航能力的系统性研究和评估。

Contribution: 设计了GUI Exploration Lab模拟环境引擎,支持灵活定义屏幕、图标和导航图,并提供完整环境信息。

Method: 结合监督微调、单轮和多轮强化学习策略,逐步提升智能体的导航性能。

Result: 在静态和交互式基准测试中验证了方法的有效性,强化学习方法显著提升了GUI导航性能。

Insight: 监督微调作为基础至关重要,多轮强化学习通过交互式探索进一步优化导航策略。

Abstract: With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.

[55] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Woongyeong Yeo,Kangsan Kim,Jaehong Yoon,Sung Ju Hwang

Main category: cs.CV

TL;DR: WorldMM是一种动态多模态记忆代理,通过构建和检索包含文本和视觉表示的互补记忆,解决了长视频推理中的挑战,显著提升了性能。

Details Motivation: 现有视频大型语言模型在理解短视频方面表现强大,但在处理长视频(如小时或天级视频)时面临上下文容量限制和视觉细节丢失的问题。传统的基于文本摘要的记忆方法无法充分利用视觉证据。

Contribution: WorldMM提出了一种新颖的多模态记忆代理,包含三种记忆类型(情景记忆、语义记忆和视觉记忆),并通过自适应检索代理动态选择相关记忆源和多时间粒度。

Method: WorldMM通过构建三种互补记忆(情景记忆、语义记忆和视觉记忆)来解决长视频推理问题,并利用自适应检索代理迭代选择和整合信息。

Result: 在五个长视频问答基准测试中,WorldMM平均性能比现有最优方法提升了8.4%,显著优于基线。

Insight: 多模态记忆的动态构建和自适应检索是长视频推理的关键,结合文本和视觉信息的互补性可以提高推理能力。

Abstract: Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.

[56] LightHCG: a Lightweight yet powerful HSIC Disentanglement based Causal Glaucoma Detection Model framework

Daeyoung Kim

Main category: cs.CV

TL;DR: 该论文提出了一种轻量级的因果表示驱动青光眼检测模型LightHCG,通过HSIC解耦和图自编码器实现高效的因果表示学习,显著减少了参数量并提升了性能。

Details Motivation: 传统AI驱动的青光眼检测方法在可靠性、参数量、虚假相关性以及干预分析应用方面存在不足,需要更高效且因果驱动的解决方案。

Contribution: 提出LightHCG模型,结合HSIC解耦和因果表示学习,实现了轻量且高性能的青光眼分类,同时支持干预分析。

Method: 使用HSIC-based latent space disentanglement和Graph Autoencoder进行无监督因果表示学习。

Result: 模型在青光眼分类任务中性能优于InceptionV3、MobileNetV2等先进模型,且参数减少了93~99%。

Insight: 因果驱动的表示学习在医学图像分析中可以显著提升模型的轻量化和可靠性,同时支持临床干预分析。

Abstract: As a representative optic degenerative condition, glaucoma has been a threat to millions due to its irreversibility and severe impact on human vision fields. Mainly characterized by dimmed and blurred visions, or peripheral vision loss, glaucoma is well known to occur due to damages in the optic nerve from increased intraocular pressure (IOP) or neovascularization within the retina. Traditionally, most glaucoma related works and clinical diagnosis focused on detecting these damages in the optic nerve by using patient data from perimetry tests, optic papilla inspections and tonometer-based IOP measurements. Recently, with advancements in computer vision AI models, such as VGG16 or Vision Transformers (ViT), AI-automatized glaucoma detection and optic cup segmentation based on retinal fundus images or OCT recently exhibited significant performance in aiding conventional diagnosis with high performance. However, current AI-driven glaucoma detection approaches still have significant room for improvement in terms of reliability, excessive parameter usage, possibility of spurious correlation within detection, and limitations in applications to intervention analysis or clinical simulations. Thus, this research introduced a novel causal representation driven glaucoma detection model: LightHCG, an extremely lightweight Convolutional VAE-based latent glaucoma representation model that can consider the true causality among glaucoma-related physical factors within the optic nerve region. Using HSIC-based latent space disentanglement and Graph Autoencoder based unsupervised causal representation learning, LightHCG not only exhibits higher performance in classifying glaucoma with 93~99% less weights, but also enhances the possibility of AI-driven intervention analysis, compared to existing advanced vision models such as InceptionV3, MobileNetV2 or VGG16.

[57] Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources

Phuc Pham,Nhu Pham,Ngoc Quoc Ly

Main category: cs.CV

TL;DR: 该论文提出了一种结合动量自蒸馏和梯度累积的方法,以在有限计算资源下提升医学视觉-语言预训练的效率与性能。

Details Motivation: 在医学领域,获取详细标注数据困难,因此需要高效的视觉-语言模型(VLMs)。然而,对比学习(CL)需要大批量训练,计算资源消耗大,限制了在资源受限环境中的应用。因此,作者旨在通过动量自蒸馏解决这一问题。

Contribution: 1. 提出了动量自蒸馏方法,增强多模态学习效率;2. 结合动量机制与梯度累积,在不增加资源消耗的情况下扩大有效批次大小。

Method: 利用动量自蒸馏和梯度累积联合训练,提升模型在小批量下的学习效果和计算效率。

Result: 在零样本分类任务中性能媲美SOTA方法,少样本适应任务中AUC-ROC超过90%,检索任务提升2-3%,且仅需单GPU高效训练。

Insight: 动量自蒸馏可以在有限资源下显著提升模型性能,同时梯度累积技术为小批量训练提供了可行性。

Abstract: In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs). Pretrained VLMs enable fine-tuning on small datasets or zero-shot inference, achieving performance comparable to task-specific models. Contrastive learning (CL) is a key paradigm for training VLMs but inherently requires large batch sizes for effective learning, making it computationally demanding and often limited to well-resourced institutions. Moreover, with limited data in healthcare, it is important to prioritize knowledge extraction from both data and models during training to improve performance. Therefore, we focus on leveraging the momentum method combined with distillation to simultaneously address computational efficiency and knowledge exploitation. Our contributions can be summarized as follows: (1) leveraging momentum self-distillation to enhance multimodal learning, and (2) integrating momentum mechanisms with gradient accumulation to enlarge the effective batch size without increasing resource consumption. Our method attains competitive performance with state-of-the-art (SOTA) approaches in zero-shot classification, while providing a substantial boost in the few-shot adaption, achieving over 90% AUC-ROC and improving retrieval tasks by 2-3%. Importantly, our method achieves high training efficiency with a single GPU while maintaining reasonable training time. Our approach aims to advance efficient multimodal learning by reducing resource requirements while improving performance over SOTA methods. The implementation of our method is available at https://github.com/phphuc612/MSD .

[58] Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation

Junghwan Park,Woojin Cho,Junhyuk Heo,Darongsae Kwon,Kookjin Lee

Main category: cs.CV

TL;DR: BOLT提出了一种基于正交低秩迁移的方法,通过任务感知的谱基提取和子空间适应,实现了少量样本和测试时间的高效迁移学习,避免了传统元学习的额外训练成本和不稳定性。

Details Motivation: 现有的大型预训练模型需要高效适应新任务,但传统方法(如元学习)需要高昂的训练成本和稳定性问题。BOLT旨在通过重用已有的任务特定模型,提取正交基并高效适应新任务。

Contribution: 提出了BOLT框架,利用任务向量提取正交谱基,仅训练少量对角线系数即可实现高效迁移学习,提供了无需训练的强初始化方式和轻量级参数优化路径。

Method: BOLT分为离线阶段和在线阶段:离线阶段收集任务向量的主奇异方向并正交化形成基;在线阶段冻结基并训练少量对角线系数,实现低秩更新。

Result: 实验表明,BOLT在参数高效的微调路径上表现出色,与传统PEFT方法和元学习初始化相比具有鲁棒性能。

Insight: 将适应限制在任务感知的正交子空间中,为未见任务提供了一种高效迁移的有效替代方案。

Abstract: Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace. In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients, along with a lightweight rescaling step while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.

[59] nuScenes Revisited: Progress and Challenges in Autonomous Driving

Whye Kit Fong,Venice Erin Liong,Kok Seang Tan,Holger Caesar

Main category: cs.CV

TL;DR: 本文重新审视了自动驾驶领域广泛使用的数据集nuScenes,探讨了其在多模态传感器融合、标准化基准和多样化任务中的贡献,以及其对后续数据集和社区标准的影响,并提供了对nuScenes创建和扩展的技术细节的深入了解。

Details Motivation: 自动驾驶技术的发展依赖于高质量的数据集,nuScenes作为关键的数据集之一,其在多模态传感器融合和多样化任务中的应用对社区发展具有重要意义。本文旨在揭示nuScenes的细节及其影响。

Contribution: 1. 提供了nuScenes数据集及其扩展(nuImages和Panoptic nuScenes)的详细技术细节;2. 分析了nuScenes对其他数据集和社区标准的影响;3. 综述了基于nuScenes的官方和非官方任务及其方法论进展。

Method: 本文主要通过回顾和总结nuScenes数据集的创建过程、扩展版本的技术细节,以及基于该数据集的任务和方法发展,进行了系统性的文献综述。

Result: nuScenes数据集通过融合多模态传感器数据和提供多样化任务支持,推动了自动驾驶领域的发展,并成为后续数据集和社区标准的参考。

Insight: nuScenes的成功展示了高质量数据集在推动自动驾驶技术进步中的核心作用,尤其是多模态数据融合和标准化任务的重要性。

Abstract: Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving data, typically labeled in great detail. As a result, datasets, alongside hardware and algorithms, are foundational building blocks for the development of AVs. In this work we revisit one of the most widely used autonomous driving datasets: the nuScenes dataset. nuScenes exemplifies key trends in AV development, being the first dataset to include radar data, to feature diverse urban driving scenes from two continents, and to be collected using a fully autonomous vehicle operating on public roads, while also promoting multi-modal sensor fusion, standardized benchmarks, and a broad range of tasks including perception, localization & mapping, prediction and planning. We provide an unprecedented look into the creation of nuScenes, as well as its extensions nuImages and Panoptic nuScenes, summarizing many technical details that have hitherto not been revealed in academic publications. Furthermore, we trace how the influence of nuScenes impacted a large number of other datasets that were released later and how it defined numerous standards that are used by the community to this day. Finally, we present an overview of both official and unofficial tasks using the nuScenes dataset and review major methodological developments, thereby offering a comprehensive survey of the autonomous driving literature, with a particular focus on nuScenes.

[60] HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild

Valentin Bieri,Marie-Julie Rakotosaona,Keisuke Tateno,Francis Engelmann,Leonidas Guibas

Main category: cs.CV

TL;DR: 论文提出了HouseLayout3D这一真实世界的3D布局估计基准,支持多楼层和复杂空间的理解,并提出了无需训练的基线方法MultiFloor3D,性能优于现有方法。

Details Motivation: 当前3D布局估计模型主要在合成数据集上训练,无法处理真实世界中多楼层的复杂建筑结构。

Contribution: 1. 提出了第一个真实世界的多楼层3D布局估计基准HouseLayout3D。2. 提出了无需训练的基线方法MultiFloor3D。

Method: MultiFloor3D结合了最新的场景理解方法,无需训练即可处理多楼层布局。

Result: MultiFloor3D在HouseLayout3D基准和现有数据集上均优于现有方法。

Insight: 全局空间上下文对处理多楼层建筑至关重要,现有方法忽视了这一关键信息。

Abstract: Current 3D layout estimation models are primarily trained on synthetic datasets containing simple single room or single floor environments. As a consequence, they cannot natively handle large multi floor buildings and require scenes to be split into individual floors before processing, which removes global spatial context that is essential for reasoning about structures such as staircases that connect multiple levels. In this work, we introduce HouseLayout3D, a real world benchmark designed to support progress toward full building scale layout estimation, including multiple floors and architecturally intricate spaces. We also present MultiFloor3D, a simple training free baseline that leverages recent scene understanding methods and already outperforms existing 3D layout estimation models on both our benchmark and prior datasets, highlighting the need for further research in this direction. Data and code are available at: https://houselayout3d.github.io.

[61] See, Think, Learn: A Self-Taught Multimodal Reasoner

Sourabh Sharma,Sonam Gupta,Sadbhawna

Main category: cs.CV

TL;DR: 论文提出了一种名为See-Think-Learn(STL)的自训练框架,旨在通过结构化推理模板和负样本增强,联合提升视觉语言模型的感知与推理能力,无需依赖高成本人工标注数据。

Details Motivation: 现有视觉语言模型在感知与推理能力上存在短板,且增强推理能力的方法往往依赖高成本的人工标注数据或忽视感知。为了解决这些问题,作者提出了STL框架。

Contribution: 1. 提出了STL自训练框架,通过结构化推理模板(先提取视觉属性再进行推理)联合提升感知与推理能力;2. 引入负样本(错误答案的说明)增强模型的判别能力;3. 证明了STL在多个领域优于基线模型。

Method: STL框架的核心是结构化推理模板,分为两步:先提取视觉属性(See),再利用这些属性指导推理(Think)。通过自训练循环生成并学习结构化解释,同时引入负样本以增强鲁棒性。

Result: 实验表明,STL在多个任务上优于仅依赖答案或无结构自生成推理的基线模型,且生成的解释质量高。

Insight: 联合优化感知与推理是关键;负样本(错误解释)能有效提升模型的判别能力;自训练框架为低成本提升多模态推理能力提供了新思路。

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate perception and robust reasoning, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models, or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called See-Think-Learn (STL). At its core, STL introduces a structured reasoning template that encourages the model to see before thinking, first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model’s ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs.

[62] Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

Jianzong Wu,Hao Lian,Dachao Hao,Ye Tian,Qingyu Shi,Biaolong Chen,Hao Jiang

Main category: cs.CV

TL;DR: 该论文研究了音频-视频联合去噪训练是否能提升视频生成质量,即使只关注视频模态。通过引入AVFullDiT架构并对比T2AV和T2V模型,发现音频作为特权信号可以提升视频动态的物理合理性。

Details Motivation: 探索跨模态联合训练是否能通过音频信号隐性提升视频生成质量,而不仅限于同步效果。

Contribution: 提出AVFullDiT架构,首次系统性证明音频-视频联合去噪能提升视频质量,尤其是在复杂运动场景中。

Method: 使用预训练的T2V和T2A模块构建AVFullDiT,训练T2AV和T2V模型进行对比实验。

Result: 音频作为特权信号改善了视频动态的物理合理性,尤其在大型运动和物体接触场景中表现显著。

Insight: 跨模态联合训练可以隐性增强模型的物理世界理解能力,为生成模型提供了新的优化方向。

Abstract: Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $\times$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.

[63] Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration

Zhongyi Cai,Yi Du,Chen Wang,Yu Kong

Main category: cs.CV

TL;DR: 论文提出了3DSPMR方法,利用3D空间记忆增强多模态大语言模型,以解决顺序化嵌入式任务中的空间理解和推理问题,并在SEER-Bench基准测试中验证了其有效性。

Details Motivation: 现存研究多集中在单一任务设置下的室内嵌入式任务,但实际应用中代理常面临顺序任务,需要复用先前探索的空间知识。缺乏对这一挑战的系统研究是本文的动机。

Contribution: 本文的主要贡献包括:(1) 提出SEER-Bench基准测试,覆盖顺序化嵌入式任务;(2) 提出3DSPMR方法,首次将几何信息显式纳入MLLM的空间推理中。

Method: 3DSPMR方法融合了关系、视觉和几何线索,通过3D空间记忆增强MLLM,支持顺序任务中的推理和探索。

Result: 实验表明,3DSPMR在顺序EQA和EMN任务中均取得显著性能提升。

Insight: 显式引入几何信息对MLLM的空间理解至关重要,尤其在复杂顺序任务中。这为未来嵌入式AI研究提供了新方向。

Abstract: Existing research on indoor embodied tasks typically requires agents to actively explore unknown environments and reason about the scene to achieve a specific goal. However, when deployed in real life, agents often face sequential tasks, where each new sub-task follows the completion of the previous one, and certain sub-tasks may be infeasible, such as searching for a non-existent object. Compared with the single-task setting, the core challenge lies in reusing spatial knowledge accumulated from previous explorations to support subsequent reasoning and exploration. In this work, we investigate this underexplored yet practically significant embodied AI challenge. To evaluate this challenge, we introduce SEER-Bench, a new Sequential Embodied Exploration and Reasoning Benchmark encompassing encompassing two classic embodied tasks: Embodied Question Answering (EQA) and Embodied Multi-modal Navigation (EMN). Building on SEER-Bench, we propose 3DSPMR, a 3D SPatial Memory Reasoning approach that exploits relational, visual, and geometric cues from explored regions to augment Multi-Modal Large Language Models (MLLMs) for reasoning and exploration in sequential embodied tasks. To the best of our knowledge, this is the first work to explicitly incorporate geometric information into MLLM-based spatial understanding and reasoning. Extensive experiments verify that 3DSPMR achieves substantial performance gains on both sequential EQA and EMN tasks.

[64] TGDD: Trajectory Guided Dataset Distillation with Balanced Distribution

Fengli Ran,Xiao Pu,Bo Liu,Xiuli Bi,Bin Xiao

Main category: cs.CV

TL;DR: TGDD提出了一种基于轨迹引导的数据集蒸馏方法,通过动态对齐特征分布和引入分布约束正则化,提高了合成数据的语义多样性和代表性。

Details Motivation: 现有的分布匹配方法在数据集蒸馏中忽略了训练过程中特征的动态演变,限制了合成数据的表达能力。

Contribution: TGDD通过动态对齐特征分布和引入分布约束正则化,显著提升了合成数据在下游任务中的性能。

Method: TGDD在每个训练阶段对齐合成数据与原始数据的特征分布,并通过正则化减少类间重叠。

Result: 在十个数据集上的实验表明,TGDD达到了最先进的性能,特别是高分辨率基准上准确率提升了5.0%。

Insight: 动态对齐特征分布和平衡数据分布是实现高效数据集蒸馏的关键。

Abstract: Dataset distillation compresses large datasets into compact synthetic ones to reduce storage and computational costs. Among various approaches, distribution matching (DM)-based methods have attracted attention for their high efficiency. However, they often overlook the evolution of feature representations during training, which limits the expressiveness of synthetic data and weakens downstream performance. To address this issue, we propose Trajectory Guided Dataset Distillation (TGDD), which reformulates distribution matching as a dynamic alignment process along the model’s training trajectory. At each training stage, TGDD captures evolving semantics by aligning the feature distribution between the synthetic and original dataset. Meanwhile, it introduces a distribution constraint regularization to reduce class overlap. This design helps synthetic data preserve both semantic diversity and representativeness, improving performance in downstream tasks. Without additional optimization overhead, TGDD achieves a favorable balance between performance and efficiency. Experiments on ten datasets demonstrate that TGDD achieves state-of-the-art performance, notably a 5.0% accuracy gain on high-resolution benchmarks.

[65] WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling

Yuta Oshima,Yusuke Iwasawa,Masahiro Suzuki,Yutaka Matsuo,Hiroki Furuta

Main category: cs.CV

TL;DR: WorldPack提出了一种高效的压缩内存方法,显著提升了视频世界模型中长时生成的空间一致性和质量,解决了传统方法计算成本高的问题。

Details Motivation: 传统视频世界模型在处理长时上下文输入时计算成本过高,导致时空一致性难以保证,WorldPack旨在通过压缩内存提升效率与一致性。

Contribution: 提出了压缩内存结构(轨迹打包与内存检索),显著提升了长时生成的空间一致性和质量,并在LoopNav基准测试中优于现有方法。

Method: 采用轨迹打包实现高效上下文压缩,结合内存检索保证生成的一致性和空间推理能力。

Result: 在Minecraft的LoopNav基准测试中,WorldPack显著优于现有方法,验证了其在长时一致性上的优势。

Insight: 压缩内存技术可有效解决长时世界建模的计算瓶颈,同时保持高质量生成与空间一致性。

Abstract: Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.

[66] G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline

Vishwesh Nath,Javier G. Tejero,Ruilong Li,Filippo Filicori,Mahdi Azizian,Sean D. Huver

Main category: cs.CV

TL;DR: G-SHARP是一个实时手术场景重建框架,针对微创手术需求设计,基于GSplat实现了高保真3D建模,适用于术中实时可视化。

Details Motivation: 现有高斯散射方法依赖于非商业衍生工具,限制了可部署性。G-SHARP旨在解决这一问题,提供商业兼容的实时手术重建框架。

Contribution: 1. 首个基于GSplat(Apache-2.0)的手术重建框架;2. 支持变形建模、遮挡处理和高质量重建;3. 提供Holoscan SDK应用,支持NVIDIA IGX Orin和Thor硬件。

Method: 基于GSplat可微分高斯栅格化器,实现变形建模和遮挡处理,部署于EndoNeRF基准测试,并通过Holoscan SDK在边缘硬件上运行。

Result: 在EndoNeRF基准测试中达到最先进的重建质量,速度和精度平衡适用于术中实时使用。

Insight: 商业兼容的高斯散射框架可以显著提升手术实时重建的可用性和部署性。

Abstract: We propose G-SHARP, a commercially compatible, real-time surgical scene reconstruction framework designed for minimally invasive procedures that require fast and accurate 3D modeling of deformable tissue. While recent Gaussian splatting approaches have advanced real-time endoscopic reconstruction, existing implementations often depend on non-commercial derivatives, limiting deployability. G-SHARP overcomes these constraints by being the first surgical pipeline built natively on the GSplat (Apache-2.0) differentiable Gaussian rasterizer, enabling principled deformation modeling, robust occlusion handling, and high-fidelity reconstructions on the EndoNeRF pulling benchmark. Our results demonstrate state-of-the-art reconstruction quality with strong speed-accuracy trade-offs suitable for intra-operative use. Finally, we provide a Holoscan SDK application that deploys G-SHARP on NVIDIA IGX Orin and Thor edge hardware, enabling real-time surgical visualization in practical operating-room settings.

[67] UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making

Qianhan Feng,Zhongzhen Huang,Yakun Zhu,Xiaofan Zhang,Qi Dou

Main category: cs.CV

TL;DR: UCAgents提出了一种分层多代理框架,通过结构化证据审计实现单向收敛,解决了医学视觉问答中语言解释与视觉证据脱节的问题,显著提升了诊断准确性和计算效率。

Details Motivation: 现有的视觉语言模型在医学诊断中存在推理脱节问题,即语言解释与视觉证据不符,影响临床信任。多代理框架虽能减少单一模型偏差,但开放式讨论增加了文本噪声和计算成本,未能有效锚定视觉证据。

Contribution: 1. 提出UCAgents框架,通过单向收敛和结构化证据审计抑制文本噪声,增强视觉信号提取;2. 引入单轮询问讨论,检测视觉-文本偏差风险;3. 通过信息论形式化双重噪声瓶颈。

Method: UCAgents采用分层多代理结构,限制代理交互为定向证据验证,避免位置变化。框架结合临床工作流,设计单轮询问讨论以减少视觉歧义和文本噪声。

Result: 在四个医学VQA基准测试中,UCAgents准确率达71.3%(PathVQA),比现有技术高6.0%,同时降低87.7%的token成本,验证了其在视觉证据提取与文本噪声抑制间的平衡。

Insight: UCAgents的设计表明,结构化证据审计和单向收敛能有效提升医学诊断的可信度和效率,为实际临床部署提供了可靠解决方案。

Abstract: Vision-Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermining clinical trust. Recent multi-agent frameworks simulate Multidisciplinary Team (MDT) debates to mitigate single-model bias, but open-ended discussions amplify textual noise and computational cost while failing to anchor reasoning to visual evidence, the cornerstone of medical decision-making. We propose UCAgents, a hierarchical multi-agent framework enforcing unidirectional convergence through structured evidence auditing. Inspired by clinical workflows, UCAgents forbids position changes and limits agent interactions to targeted evidence verification, suppressing rhetorical drift while amplifying visual signal extraction. In UCAgents, a one-round inquiry discussion is introduced to uncover potential risks of visual-textual misalignment. This design jointly constrains visual ambiguity and textual noise, a dual-noise bottleneck that we formalize via information theory. Extensive experiments on four medical VQA benchmarks show UCAgents achieves superior accuracy (71.3% on PathVQA, +6.0% over state-of-the-art) with 87.7% lower token cost, the evaluation results further confirm that UCAgents strikes a balance between uncovering more visual evidence and avoiding confusing textual interference. These results demonstrate that UCAgents exhibits both diagnostic reliability and computational efficiency critical for real-world clinical deployment. Code is available at https://github.com/fqhank/UCAgents.

[68] Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding

Yerim Jeon,Miso Lee,WonJun Moon,Jae-Pil Heo

Main category: cs.CV

TL;DR: 论文提出了一种名为3D-SLIM的掩码策略,用于提升大型语言模型(LLMs)在3D场景语言理解中的空间推理能力,通过替换传统的因果掩码为适应3D空间结构的自适应掩码。

Details Motivation: 现有的3D场景语言理解方法依赖于语言建模的标准解码器,但其因果掩码设计导致顺序偏见和受限的对象-指令注意力,阻碍了任务特定推理能力的发挥。

Contribution: 3D-SLIM提出了一种几何自适应掩码和指令感知掩码,前者基于空间密度而非标记顺序约束注意力,后者使对象标记能直接访问指令上下文。该方法无需修改架构或增加参数。

Method: 3D-SLIM通过几何自适应掩码和指令感知掩码,优化了注意力机制,使模型能基于空间关系处理对象,同时受用户任务引导。

Result: 在多个3D场景语言任务中,3D-SLIM显著提升了性能,验证了其有效性,并强调了解码器设计在多模态推理中的关键作用。

Insight: 掩码设计在3D多模态推理中至关重要,简单的注意力调整可以显著提升LLMs的空间推理能力。

Abstract: Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user’s task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.

[69] YingVideo-MV: Music-Driven Multi-Stage Video Generation

Jiahui Chen,Weida Wang,Runhua Shi,Huan Yang,Chaofan Ding,Zihao Chen

Main category: cs.CV

TL;DR: 论文提出了YingVideo-MV,首个音乐驱动的多阶段长视频生成框架,集成了音频语义分析、可解释的镜头规划模块、时序感知的扩散Transformer架构和长序列一致性建模,实现了高质量音乐表演视频的自动生成。

Details Motivation: 现有研究在长视频生成中缺乏显式的摄像机运动控制,且音乐表演视频的生成尚未充分探索。论文旨在填补这一空白,实现音乐驱动的长视频生成。

Contribution: 1) 提出了YingVideo-MV框架;2) 引入了摄像机适配模块和动态窗口范围策略;3) 构建了大规模Music-in-the-Wild数据集。

Method: 框架包括音频语义分析、MV-Director模块、时序感知扩散Transformer架构、长序列一致性建模和摄像机适配模块。动态窗口范围策略根据音频嵌入自适应调整去噪范围。

Result: 实验表明,YingVideo-MV能够生成连贯且富有表现力的音乐视频,并实现音乐-动作-摄像机的精确同步。

Insight: 显式控制摄像机运动和时序一致性建模是音乐视频生成中的关键;动态窗口策略有效提升了长序列生成的连续性。

Abstract: While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .

[70] Attention-guided reference point shifting for Gaussian-mixture-based partial point set registration

Mizuki Kikkawa,Tatsuya Yatagawa,Yutaka Ohtake,Hiromasa Suzuki

Main category: cs.CV

TL;DR: 该研究探讨了在部分点集配准中,基于深度学习和高斯混合模型(GMMs)的方法对特征向量平移和旋转不变性的影响。作者提出了一个注意力引导的参考点偏移(ARPS)层,解决了现有方法的局限性。

Details Motivation: 现有基于GMMs的深度学习配准方法(如DeepGMR)对部分点集配准的局限性,特别是在特征不变性方面存在理论和实践问题。研究旨在揭示其原因并提出解决方案。

Contribution: 引入ARPS层,通过注意力模块动态识别两个部分点集的共同参考点,从而获得变换不变的特征,显著提升了DeepGMR及其变体的性能。

Method: 采用注意力机制设计ARPS层,避免直接依赖重叠区域,而是动态找到共同参考点,提高了特征的不变性。

Result: ARPS层显著提升了DeepGMR和UGMMReg的性能,超越了此前使用注意力块和Transformer的深度学习方法。

Insight: 研究深入探讨了基于GMMs和深度学习的配准方法的特征不变性问题,为未来方法设计提供了重要启示。

Abstract: This study investigates the impact of the invariance of feature vectors for partial-to-partial point set registration under translation and rotation of input point sets, particularly in the realm of techniques based on deep learning and Gaussian mixture models (GMMs). We reveal both theoretical and practical problems associated with such deep-learning-based registration methods using GMMs, with a particular focus on the limitations of DeepGMR, a pioneering study in this line, to the partial-to-partial point set registration. Our primary goal is to uncover the causes behind such methods and propose a comprehensible solution for that. To address this, we introduce an attention-based reference point shifting (ARPS) layer, which robustly identifies a common reference point of two partial point sets, thereby acquiring transformation-invariant features. The ARPS layer employs a well-studied attention module to find a common reference point rather than the overlap region. Owing to this, it significantly enhances the performance of DeepGMR and its recent variant, UGMMReg. Furthermore, these extension models outperform even prior deep learning methods using attention blocks and Transformer to extract the overlap region or common reference points. We believe these findings provide deeper insights into registration methods using deep learning and GMMs.

[71] dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

Yumeng Li,Guang Yang,Hao Liu,Bowen Wang,Colin Zhang

Main category: cs.CV

TL;DR: dots.ocr是一个单一视觉语言模型,首次在统一端到端框架中联合学习三个核心任务,展示了卓越的多语言文档布局解析能力,并且在OmniDocBench和新的XDocParse基准测试中取得最先进性能。

Details Motivation: 传统文档布局解析方法依赖于碎片化、多阶段的流程,容易传播错误且无法充分利用联合训练的优势。dots.ocr旨在通过单一模型联合学习布局检测、文本识别和关系理解任务,实现更高效和鲁棒的文档解析。

Contribution: 1. 提出dots.ocr,首次在单一视觉语言模型中联合学习三个核心任务;2. 开发了一个高度可扩展的数据引擎,生成多语言语料库;3. 在OmniDocBench和新的XDocParse基准测试中取得最先进性能。

Method: dots.ocr采用统一的端到端框架,结合视觉语言模型联合训练三个任务:布局检测、文本识别和关系理解。通过高度可扩展的数据引擎生成多语言数据,提升模型的泛化能力。

Result: 在OmniDocBench上取得最先进性能;在XDocParse基准测试中,dots.ocr比次优方法高出7.4分,展示了其卓越的多语言能力。

Insight: 统一的端到端框架和多语言数据生成引擎的结合是dots.ocr成功的关键,为文档智能领域提供了一种更高效且鲁棒的解决方案。

Abstract: Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world’s vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce dots.ocr, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this testbed, dots.ocr establishes a powerful new baseline, outperforming the next-best competitor by a remarkable +7.4 point margin and proving its unparalleled multilingual capabilities.

[72] GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding

Jiaqi Liu,Ronghao Fu,Haoran Liu,Lang Sun,Bo Yang

Main category: cs.CV

TL;DR: GeoDiT是首个基于扩散模型的视觉语言模型,针对地理空间领域设计,通过并行细化过程生成结构化、连贯的输出,并在多项任务中超越自回归模型。

Details Motivation: 自回归模型在地理空间理解任务中存在结构性不匹配问题,其强制顺序生成的特性阻碍了结构化输出的生成。GeoDiT旨在解决这一问题,通过扩散模型实现对地理空间数据的并行细化生成。

Contribution: 1. 提出首个基于扩散模型的地理空间视觉语言模型GeoDiT;2. 通过并行细化过程实现整体、从粗到细的合成;3. 在图像描述、视觉定位和多目标检测任务中显著超越现有方法。

Method: GeoDiT采用扩散模型框架,将地理空间生成任务重构为并行细化过程,同时解析所有语义元素,从而实现结构化输出的生成。

Result: 实验表明,GeoDiT在需要结构化输出的任务(如图像描述、视觉定位和多目标检测)中取得了新的最佳性能,优于自回归模型。

Insight: 生成过程与数据内在结构的对齐是实现复杂地理空间分析任务中高性能的关键。

Abstract: Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data’s intrinsic structure is key to unlocking superior performance in complex geospatial analysis.

[73] Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling

Aditya Chaudhary,Prachet Dev Singh,Ankit Jha

Main category: cs.CV

TL;DR: 提出了ViT-SR,一种基于ViT的两阶段训练方法,通过颜色化预训练和残差上采样提升单图像超分辨率性能。

Details Motivation: 单图像超分辨率(SISR)是计算机视觉中的难点,现有方法在性能和泛化能力上仍有提升空间。

Contribution: 提出了两阶段训练策略:自监督的颜色化预训练和后期的超分辨率微调,简化了残差学习。

Method: 使用ViT进行颜色化预训练以学习通用视觉表示,再微调用于预测高频残差图像以提升分辨率。

Result: 在DIV2K数据集上实现了SSIM 0.712和PSNR 22.90 dB的优异表现。

Insight: 自监督预训练在复杂图像修复任务中具有潜力,未来可通过更大ViT架构或其他预训练任务进一步提升。

Abstract: In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.

[74] SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts

Jiaqi Liu,Ronghao Fu,Lang Sun,Haoran Liu,Xiao Yang,Weipeng Zhang,Xu Na,Zhuoran Duan,Bo Yang

Main category: cs.CV

TL;DR: SkyMoE是一款专为多模态、多任务的遥感解释任务设计的视觉语言模型,采用混合专家(MoE)架构,通过自适应路由器和上下文解耦增强策略,显著提升了模型在多任务和多粒度场景下的表现。

Details Motivation: 现有通用的视觉语言模型在遥感任务中表现不佳,主要原因在于它们无法区分任务类型和解释粒度,限制了模型在局部细节感知和全局上下文理解之间的平衡。

Contribution: 1) 提出SkyMoE,一种基于MoE的视觉语言模型,专为多任务和多粒度的遥感任务设计;2) 引入自适应路由器和上下文解耦增强策略,提升专家的任务特异性和粒度敏感性;3) 构建MGRS-Bench基准,覆盖多种任务和粒度水平。

Method: 1) 使用自适应路由器生成任务和粒度感知的路由指令;2) 通过上下文解耦增强策略生成局部与全局特征的对比对;3) 在大规模公开数据集上进行实验验证。

Result: SkyMoE在21个公开数据集上取得了最先进的性能,展示了其在适应性、扩展性和多粒度理解方面的优势。

Insight: 混合专家架构结合任务和粒度感知的路由策略可以有效提升遥感任务的性能,上下文解耦增强策略有助于专家的专业化学习。

Abstract: The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.

[75] ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

Yifan Li,Yingda Yin,Lingting Zhu,Weikai Chen,Shengju Qian,Xin Wang,Yanwei Fu

Main category: cs.CV

TL;DR: ReVSeg通过强化学习优化视频分割的多步推理链,将复杂查询分解为语义解释、时间证据选择和空间定位三个显式操作,并在预训练视觉语言模型的基础上实现解释性推理轨迹。

Details Motivation: 视频对象分割中的查询通常涉及动态性、因果性和时间交互,而现有方法将这些因素简化为潜在嵌入,导致推理链不透明且难处理。因此,研究需要一种显式分解的推理方法。

Contribution: 提出了ReVSeg,通过强化学习优化多步推理链,将任务分解为三个显式操作,并结合预训练视觉语言模型的能力,实现高性能和解释性推理。

Method: ReVSeg通过语义解释、时间证据选择和空间定位三个步骤显式分解任务,并利用强化学习优化推理链,增强决策质量。

Result: 在标准视频对象分割基准测试中达到最优性能,并生成可解释的推理轨迹。

Insight: 通过显式分解任务和强化学习优化推理链,可以有效处理复杂的视频分割问题,并增强模型的解释性。

Abstract: Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations – semantics interpretation, temporal evidence selection, and spatial grounding – aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .

[76] On the Problem of Consistent Anomalies in Zero-Shot Anomaly Detection

Tai Le-Gia

Main category: cs.CV

TL;DR: 该论文研究了零样本异常分类与分割(AC/AS)中的一致性问题,提出了一种基于图的方法(CoDeGraph)来过滤一致性异常,并将其扩展到3D医学影像和文本驱动的视觉语言模型中。

Details Motivation: 零样本AC/AS在工业检测和医学影像中越来越重要,但存在一致性异常的问题,即重复相似的异常会系统地影响基于距离的方法。论文旨在解决这一问题并提供理论支持和实用解决方案。

Contribution: 1. 形式化了一致性异常问题;2. 提出了CoDeGraph框架,利用相似性缩放和邻居烧毁现象过滤异常;3. 扩展到3D医学影像,提出无训练的体素标记化策略;4. 结合文本驱动的视觉语言模型。

Method: 1. 分析预训练Vision Transformer的统计和几何行为;2. 构建多阶段图模型(CoDeGraph)进行异常过滤;3. 设计无训练的3D体素标记化方法;4. 使用伪掩模监督视觉语言模型。

Result: CoDeGraph能有效抑制一致性异常的影响;3D异常分割无需训练样本;伪掩模成功结合了批处理和文本驱动的零样本方法。

Insight: 1. 一致性异常是零样本AC/AS的核心挑战;2. 相似性缩放和邻居烧毁现象是关键观察点;3. 图模型和视觉语言模型的结合具有潜力。

Abstract: Zero-shot anomaly classification and segmentation (AC/AS) aim to detect anomalous samples and regions without any training data, a capability increasingly crucial in industrial inspection and medical imaging. This dissertation aims to investigate the core challenges of zero-shot AC/AS and presents principled solutions rooted in theory and algorithmic design. We first formalize the problem of consistent anomalies, a failure mode in which recurring similar anomalies systematically bias distance-based methods. By analyzing the statistical and geometric behavior of patch representations from pre-trained Vision Transformers, we identify two key phenomena - similarity scaling and neighbor-burnout - that describe how relationships among normal patches change with and without consistent anomalies in settings characterized by highly similar objects. We then introduce CoDeGraph, a graph-based framework for filtering consistent anomalies built on the similarity scaling and neighbor-burnout phenomena. Through multi-stage graph construction, community detection, and structured refinement, CoDeGraph effectively suppresses the influence of consistent anomalies. Next, we extend this framework to 3D medical imaging by proposing a training-free, computationally efficient volumetric tokenization strategy for MRI data. This enables a genuinely zero-shot 3D anomaly detection pipeline and shows that volumetric anomaly segmentation is achievable without any 3D training samples. Finally, we bridge batch-based and text-based zero-shot methods by demonstrating that CoDeGraph-derived pseudo-masks can supervise prompt-driven vision-language models. Together, this dissertation provides theoretical understanding and practical solutions for the zero-shot AC/AS problem.

[77] WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

Jian Yang,Dacheng Yin,Xiaoxuan He,Yong Li,Fengyun Rao,Jing Lyu,Wei Zhai,Yang Cao,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 论文提出Noisy Query Tokens方法,通过端到端优化学习VLM和Diffusion Model之间的分布式表示空间,解决任务泛化崩溃问题,并引入VAE分支恢复图像细节。

Details Motivation: 预训练的视觉语言模型(VLM)与扩散模型(Diffusion Model)之间的高效桥接存在挑战,尤其是固定数量的可学习查询令牌(query tokens)在任务泛化上表现不佳。

Contribution: 1. 提出Noisy Query Tokens方法,增强VLM与Diffusion Model之间的分布式表示;2. 引入VAE分支恢复图像细节;3. 解决了任务泛化崩溃问题并支持持续学习。

Method: 1. 通过端到端优化学习VLM和Diffusion Model之间的分布式表示;2. 使用VAE分支和线性投影恢复图像细节。

Result: 实验表明,该方法有效缓解了泛化崩溃,支持多样任务的持续学习。

Insight: 分布式表示学习和细节恢复模块的结合是关键,为多模态模型的进一步研究提供了新思路。

Abstract: Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

[78] AVGGT: Rethinking Global Attention for Accelerating VGGT

Xianbing Sun,Zhikai Zhu,Zhengyu Lou,Bo Yang,Jinyang Tang,Liqing Zhang,He Wang,Jianfu Zhang

Main category: cs.CV

TL;DR: AVGGT通过重新设计VGGT和$π^3$的全局注意力机制,提出了一种免训练的双步加速方案,实现了8-10倍的速度提升,同时保持了模型精度。

Details Motivation: VGGT和$π^3$在多视图3D任务中表现出色,但其依赖全局自注意力导致计算成本高昂,现有稀疏注意力变体缺乏系统性分析。

Contribution: 1. 系统分析了全局注意力在多视图推理中的作用;2. 提出了一种免训练的双步加速方案,显著提升了推理速度。

Method: 1. 将早期全局层转换为帧注意力;2. 通过子采样K/V实现全局注意力的子采样。

Result: 在推理时间上实现了8-10倍的加速,精度与原模型相当甚至略有提升,且在密集多视图任务中表现鲁棒。

Insight: 全局注意力在模型的不同阶段作用不同,早期层对应关系不显著,而中层和末层分别负责跨视图对齐和细微修正。

Abstract: Since DUSt3R, models such as VGGT and $π^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $π^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $π^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves up to $8$-$10\times$ speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.

[79] Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

Yuan Xiong,Ziqi Miao,Lijun Li,Chen Qian,Jie Li,Jing Shao

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的图像中心攻击方法Contextual Image Attack (CIA),通过多智能体系统将有害查询嵌入看似无害的视觉上下文中,显著提高了攻击成功率。

Details Motivation: 现有的攻击方法主要集中于文本-图像交互,忽略了视觉模态作为一种独立且复杂的攻击载体潜力。因此,研究者旨在开发一种更有效的图像中心攻击方法。

Contribution: 提出了CIA方法,通过四种可视化策略和多智能体系统将有害内容嵌入图像上下文,并结合上下文元素增强和自动毒性混淆技术,显著提升了攻击效果。

Method: 采用多智能体系统设计图像攻击策略,包括四种可视化方法,嵌入有害查询并优化攻击效果;实验在MMSafetyBench-tiny数据集上进行。

Result: 在GPT-4o和Qwen2.5-VL-72B模型上,分别达到4.73和4.83的毒性分数,攻击成功率达到86.31%和91.07%,明显优于先前工作。

Insight: 研究表明视觉模态本身是一种强大的攻击载体,可以绕过多模态大模型的安全对齐机制,这对未来的安全研究提出了新的挑战。

Abstract: While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack’s efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31% and 91.07%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.

[80] OmniPerson: Unified Identity-Preserving Pedestrian Generation

Changxiao Ma,Chao Yuan,Xincheng Shi,Yuzhuo Ma,Yongfei Zhang,Longkun Zhou,Yujia Zhang,Shangze Li,Yifan Xu

Main category: cs.CV

TL;DR: OmniPerson提出了一种统一的身份保持行人生成管道,支持RGB/IR图像/视频生成,通过多模态输入和细粒度控制解决数据隐私和标注成本问题,生成高质量行人数据以增强ReID任务。

Details Motivation: 现有的行人数据生成方法在身份一致性和可控性方面表现不足,限制了其在数据增强中的效果,OmniPerson旨在解决这些问题。

Contribution: 1)提出OmniPerson统一生成模型,支持多模态输入和细粒度控制;2)设计Multi-Refer Fuser实现多参考图像的身份保持;3)构建PersonSyn数据集并开源。

Method: 采用多模态输入(RGB/IR图像/视频、文本等)和多参考图像身份融合机制(Multi-Refer Fuser),实现高保真行人生成。

Result: OmniPerson在视觉保真度和身份一致性上达到SOTA,生成数据显著提升ReID模型性能。

Insight: 结合多模态输入和多参考身份融合是实现高质量行人数据生成的关键,开源数据集和模型推动领域发展。

Abstract: Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored pedestrian generation for data augmentation, they often fail to ensure identity consistency and suffer from insufficient controllability, thereby limiting their effectiveness in dataset augmentation. To address this, We introduce OmniPerson, the first unified identity-preserving pedestrian generation pipeline for visible/infrared image/video ReID tasks. Our contributions are threefold: 1) We proposed OmniPerson, a unified generation model, offering holistic and fine-grained control over all key pedestrian attributes. Supporting RGB/IR modality image/video generation with any number of reference images, two kinds of person poses, and text. Also including RGB-to-IR transfer and image super-resolution abilities.2) We designed Multi-Refer Fuser for robust identity preservation with any number of reference images as input, making OmniPerson could distill a unified identity from a set of multi-view reference images, ensuring our generated pedestrians achieve high-fidelity pedestrian generation.3) We introduce PersonSyn, the first large-scale dataset for multi-reference, controllable pedestrian generation, and present its automated curation pipeline which transforms public, ID-only ReID benchmarks into a richly annotated resource with the dense, multi-modal supervision required for this task. Experimental results demonstrate that OmniPerson achieves SoTA in pedestrian generation, excelling in both visual fidelity and identity consistency. Furthermore, augmenting existing datasets with our generated data consistently improves the performance of ReID models. We will open-source the full codebase, pretrained model, and the PersonSyn dataset.

[81] From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

Kun Yuan,Min Woo Sun,Zhen Chen,Alejandro Lozano,Xiangteng He,Shi Li,Nassir Navab,Xiaoxiao Sun,Nicolas Padoy,Serena Yeung-Levy

Main category: cs.CV

TL;DR: Panel2Patch是一种新型数据管道,从生物医学科学文献中挖掘分层结构,将其转化为多粒度监督信息,用于视觉-语言预训练。通过保留局部语义并构建分层对齐的视觉-语言对,该方法显著提升了性能。

Details Motivation: 现有的生物医学视觉-语言预训练方法通常将科学图表和文本压缩为粗略的图-文对,忽视了临床医生依赖的局部结构对应关系。为了保留这些细粒度信息,研究提出了Panel2Patch。

Contribution: 论文的主要贡献是Panel2Patch数据管道,它从生物医学文献中解析分层结构(如多面板、标记丰富的图表及其文本),构建多粒度的视觉-语言监督对。

Method: Panel2Patch解析科学图表及其标题,提取布局、面板和视觉标记,构建从图表整体到局部区域的分层对齐视觉-语言对。基于此,设计了粒度感知的预训练策略。

Result: 实验表明,Panel2Pipeline在少量数据上提取的监督信息优于现有方法,显著提升了模型性能。

Insight: 细粒度视觉-语言对齐对生物医学领域的预训练至关重要;分层数据构造和处理策略能有效提升模型对局部结构的理解能力。

Abstract: There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.

[82] Co-speech Gesture Video Generation via Motion-Based Graph Retrieval

Yafei Song,Peng Zhang,Bang Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种新的框架,通过基于扩散模型的动作生成和运动图检索算法,生成与语音同步且自然的共言手势视频,克服了传统方法一对一的映射限制。

Details Motivation: 现有的方法在处理语音与手势的复杂多对多映射时表现不佳,因为它们依赖于一对一的特征匹配或共享特征空间。

Contribution: 1) 使用扩散模型生成手势动作,学习语音与动作的联合分布;2) 提出多级音频特征提取;3) 设计基于运动相似性的检索算法和路径拼接技术。

Method: 1) 扩散模型生成手势动作;2) 多级音频特征提取;3) 基于全局和局部运动相似性的图检索算法;4) 路径拼接生成连贯视频。

Result: 实验证明该方法在同步性和自然性上显著优于现有方法。

Insight: 扩散模型可以有效学习语音与动作的复杂关系,结合多级特征和运动相似性检索,能够生成更自然的手势视频。

Abstract: Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.

[83] RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

Xuming He,Zehao Fan,Hengjia Li,Fan Zhuo,Hankun Xu,Senlin Cheng,Di Weng,Haifeng Liu,Can Ye,Boxi Wu

Main category: cs.CV

TL;DR: RULER-Bench是一个新的基准测试,专注于评估视频生成模型在规则推理能力方面的表现,填补了现有基准测试在这方面的空白,并揭示了当前模型的不足。

Details Motivation: 现有视频生成模型的评估主要关注视觉感知和理解,而规则推理能力未被充分研究。RULER-Bench旨在填补这一空白,提供一个细粒度的评估协议以推动视频模型的发展。

Contribution: 引入了RULER-Bench基准测试,覆盖40个任务和6个规则类别,通过622个标注实例和四项指标,首次系统评估视频生成模型的规则推理能力。

Method: 基于文本到视频和图像到视频两种范式,RULER-Bench构建了一个细粒度的评估框架,使用GPT-o3进行自动评分,并与人工判断对齐达85%。

Result: 实验表明,当前最先进的模型在规则一致性指标上仅达到48.87%,显示其在推理能力上有显著提升空间。

Insight: RULER-Bench揭示了视频生成模型在规则推理方面的不足,为未来的研究方向提供了重要洞察,促进推理感知的视频生成技术的发展。

Abstract: Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.

[84] PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding

Zheng Huang,Xukai Liu,Tianyu Hu,Kai Zhang,Ye Liu

Main category: cs.CV

TL;DR: PPTBench是一个全面的多模态基准测试,用于评估LLMs在PowerPoint相关任务中的表现,揭示了当前模型在布局理解和视觉结构推理上的显著不足。

Details Motivation: 现有的基准测试仅关注狭窄的子任务,而忽略了布局相关的核心挑战,而PPTBench旨在填补这一空白,评估模型在真实幻灯片创建和编辑中的多模态推理能力。

Contribution: PPTBench通过958个PPTX文件和4,439个样本,涵盖了检测、理解、修改和生成四个任务类别,提供了首个针对PowerPoint布局和设计理解的全面评估基准。

Method: PPTBench利用多样化的PPTX文件,设计了包含四个类别的任务,并通过实验分析模型在语义理解和视觉布局推理中的表现。

Result: 实验显示当前MLLMs在布局理解和视觉结构推理上存在显著差距,能够解释幻灯片内容但无法生成一致的空间排列。

Insight: 当前MLLMs难以结合视觉线索和JSON布局结构,也无法将视觉信息集成到API规划能力中,这为未来研究视觉结构推理和幻灯片生成指明了方向。

Abstract: PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation and further analysis show that current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into their API planning ability. And case studies visually expose systematic layout errors such as misalignment and element overlap. These findings provides a new perspective on evaluating VLLMs in PPT scenarios, highlighting challenges and directions for future research on visual-structural reasoning and coherent slide generation. All datasets and code are fully released to support reproducibility and future research.

[85] PoreTrack3D: A Benchmark for Dynamic 3D Gaussian Splatting in Pore-Scale Facial Trajectory Tracking

Dong Li,Jiahao Xiong,Yingda Huang,Le Chang

Main category: cs.CV

TL;DR: PoreTrack3D是首个专注于动态3D高斯泼溅(Gaussian splatting)在毛孔尺度非刚性面部轨迹跟踪领域的基准数据集,包含440,000+轨迹,并评测了现有技术的性能基线。

Details Motivation: 现有面部动态捕捉技术主要集中在传统关键点,忽视了毛孔尺度的细微运动。PoreTrack3D填补了这一空白,推动对细微面部表情的研究。

Contribution: 1. 首个包含毛孔尺度轨迹的面部动态数据集;2. 提供了动态3D高斯泼溅方法的性能基线;3. 提出了高保真面部运动捕捉的新框架。

Method: 通过收集大量面部轨迹数据(包括52,000+长序列和68条手动标注序列),系统评测现有技术的性能,并建立基准。

Result: PoreTrack3D成为该领域的首个性能评测标准,推动了动态3D重建技术的发展。

Insight: 毛孔尺度的动态捕捉能更精准地反映细微表情变化,为面部分析和仿真等领域提供了新方向。

Abstract: We introduce PoreTrack3D, the first benchmark for dynamic 3D Gaussian splatting in pore-scale, non-rigid 3D facial trajectory tracking. It contains over 440,000 facial trajectories in total, among which more than 52,000 are longer than 10 frames, including 68 manually reviewed trajectories that span the entire 150 frames. To the best of our knowledge, PoreTrack3D is the first benchmark dataset to capture both traditional facial landmarks and pore-scale keypoints trajectory, advancing the study of fine-grained facial expressions through the analysis of subtle skin-surface motion. We systematically evaluate state-of-the-art dynamic 3D Gaussian splatting methods on PoreTrack3D, establishing the first performance baseline in this domain. Overall, the pipeline developed for this benchmark dataset’s creation establishes a new framework for high-fidelity facial motion capture and dynamic 3D reconstruction. Our dataset are publicly available at: https://github.com/JHXion9/PoreTrack3D

[86] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Junwon Lee,Juhan Nam,Jiyoung Lee

Main category: cs.CV

TL;DR: 本文提出了一种新的任务——文本条件选择性视频到音频(V2A)生成,旨在从多物体视频中仅生成用户目标的声音。通过SelVA模型,利用文本提示显式选择目标源并调制视频编码器,提取与提示相关的视频特征,实现了语义和时间上的鲁棒性。

Details Motivation: 在多媒体制作中,音频轨道需独立处理以实现精确编辑和控制,但现有方法只能生成混合声音,视觉特征纠缠且区域提示通常无法明确声源。

Contribution: 1. 提出了文本条件选择性V2A任务;2. 开发了SelVA模型,通过文本提示选择和调制视频特征;3. 设计了自增强方案解决单声道音频监督不足的问题。

Method: SelVA利用文本提示作为目标源选择器,调制视频编码器提取相关特征,并通过补充令牌抑制文本无关激活。采用自增强方案提升模型性能。

Result: 在VGG-MONOAUDIO基准测试中,SelVA在音频质量、语义对齐和时间同步等方面表现优异。

Insight: 文本提示能有效引导视频特征提取,抑制无关信息,为多模态生成任务提供了一种新的解决方案。

Abstract: This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at https://jnwnlee.github.io/selva-demo/.

[87] Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation

Agathoklis Georgiou

Main category: cs.CV

TL;DR: 该论文提出了一种混合架构,结合了视觉语言模型的细粒度相似性和OCR提取的结构化文本,通过空间相关性传播实现更精确的文档检索。

Details Motivation: 现有的视觉语言模型(如ColPali)在文档检索中表现优异,但仅返回整页而非特定区域,限制了其在检索增强生成(RAG)中的实用性。OCR系统虽有坐标信息,但缺乏语义相关性评估。

Contribution: 提出了一种无需额外训练的混合架构,结合VLMs的细粒度相似性与OCR的空间信息,通过坐标映射和交集度量实现区域级检索。

Method: 将ColPali的补丁级相似性分数作为OCR区域的空间相关性过滤器,建立了视觉Transformer补丁网格与OCR边界框的坐标映射,并引入交集度量进行相关性传播。

Result: 提出了Snappy开源实现,展示了实际应用的可行性,目前正在进行实证评估。

Insight: 通过结合语义和空间信息,该方法显著提升了文档检索的精确性,特别适用于需要上下文精确性的任务(如RAG)。

Abstract: Vision-language models (VLMs) like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali’s patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on retrieval precision. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation demonstrating practical applicability, with empirical evaluation ongoing.

[88] UAUTrack: Towards Unified Multimodal Anti-UAV Visual Tracking

Qionglin Ren,Dawei Zhang,Chunxu Tian,Dan Zhang

Main category: cs.CV

TL;DR: UAUTrack提出了一种统一的多模态反无人机跟踪框架,通过端到端的单流单阶段架构和文本先验提示策略,实现了跨模态的高效协作,并在多个数据集上达到了最先进的性能。

Details Motivation: 现有反无人机跟踪方法多为独立模型,缺乏跨模态协作的统一框架,同时多模态数据融合效果不佳,亟需一种高效的综合解决方案。

Contribution: 1. 提出了一种统一的单目标跟踪框架UAUTrack;2. 引入了文本先验提示策略,提升模型对无人机目标的跨场景识别能力;3. 在多个数据集上实现了SOTA性能,兼顾精度与速度。

Method: 基于单流单阶段的端到端架构,结合文本先验提示策略,有效整合RGB、TIR等多模态数据。

Result: 在Anti-UAV、DUT Anti-UAV等数据集上表现优异,Anti-UAV410数据集上实现了精度与速度的良好平衡。

Insight: 跨模态协作的文本提示策略可以有效提升模型对无人机目标的鲁棒性,为多模态跟踪任务提供了新思路。

Abstract: Research in Anti-UAV (Unmanned Aerial Vehicle) tracking has explored various modalities, including RGB, TIR, and RGB-T fusion. However, a unified framework for cross-modal collaboration is still lacking. Existing approaches have primarily focused on independent models for individual tasks, often overlooking the potential for cross-modal information sharing. Furthermore, Anti-UAV tracking techniques are still in their infancy, with current solutions struggling to achieve effective multimodal data fusion. To address these challenges, we propose UAUTrack, a unified single-target tracking framework built upon a single-stream, single-stage, end-to-end architecture that effectively integrates multiple modalities. UAUTrack introduces a key component: a text prior prompt strategy that directs the model to focus on UAVs across various scenarios. Experimental results show that UAUTrack achieves state-of-the-art performance on the Anti-UAV and DUT Anti-UAV datasets, and maintains a favourable trade-off between accuracy and speed on the Anti-UAV410 dataset, demonstrating both high accuracy and practical efficiency across diverse Anti-UAV scenarios.

[89] Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance

Huankun Sheng,Ming Li,Yixiang Wei,Yeying Fan,Yu-Hui Wen,Tieliang Gong,Yong-Jin Liu

Main category: cs.CV

TL;DR: 论文提出了Foreground-Aware Slot Attention (FASA),一种通过显式分离前景与背景的两阶段框架,结合伪掩膜引导,提升了无监督场景分解和对象发现的性能。

Details Motivation: 现有基于插槽注意力(slot attention)的方法在处理前景和背景时未加区分,导致背景干扰和对象发现性能不佳。FASA旨在解决这一问题,通过显式分离前景与背景,提升场景分解的鲁棒性。

Contribution: 1. 提出了FASA框架,通过两阶段(粗分解+掩膜注意)显式分离前景与背景;2. 引入基于聚类的前景初始化策略;3. 结合伪掩膜引导优化前景对象分割。

Method: 1. 第一阶段:粗场景分解,通过双插槽竞争机制区分前景和背景;2. 第二阶段:掩膜插槽注意力,背景由第一个插槽捕获,其余插槽竞争表示前景对象;3. 利用基于自监督图像特征的伪掩膜引导学习。

Result: FASA在合成和真实数据集上均优于现有方法,验证了显式前景建模和伪掩膜引导的有效性。

Insight: 显式分离前景与背景能有效减少背景干扰,提升对象发现的精确性;伪掩膜引导进一步缓解了前景对象的过分割问题。

Abstract: Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.

[90] ALDI-ray: Adapting the ALDI Framework for Security X-ray Object Detection

Omid Reza Heidari,Yang Wang,Xinxin Zuo

Main category: cs.CV

TL;DR: ALDI++是一个通过自蒸馏、特征对齐和增强训练策略解决安全X射线图像领域适应问题的框架,在EDS数据集上表现优于现有方法,尤其在使用ViTDet主干网络时效果最佳。

Details Motivation: 由于安全X射线成像中扫描设备和环境条件的差异导致领域差异显著,传统目标检测模型性能下降,因此需要高效的领域适应方法。

Contribution: 提出了ALDI++框架,结合自蒸馏、特征对齐和增强训练策略,有效减轻了领域偏移问题;在EDS数据集上验证了其优越性,尤其是基于ViTDet的架构表现最佳,为安全X射线领域的跨领域目标检测设立了新基准。

Method: ALDI++整合了自蒸馏、特征对齐和增强训练策略,用于减轻领域偏移。实验采用ViTDet作为主干网络,在EDS数据集上进行评估。

Result: ALDI++在EDS数据集上超越了现有的领域适应方法,尤其是基于ViTDet的架构取得了最高的mAP,展现了Transformer架构在跨领域目标检测中的有效性。

Insight: Transformer架构在跨领域目标检测中表现出色;领域适应问题可以通过结合自蒸馏和特征对齐技术显著改善。

Abstract: Domain adaptation in object detection is critical for real-world applications where distribution shifts degrade model performance. Security X-ray imaging presents a unique challenge due to variations in scanning devices and environmental conditions, leading to significant domain discrepancies. To address this, we apply ALDI++, a domain adaptation framework that integrates self-distillation, feature alignment, and enhanced training strategies to mitigate domain shift effectively in this area. We conduct extensive experiments on the EDS dataset, demonstrating that ALDI++ surpasses the state-of-the-art (SOTA) domain adaptation methods across multiple adaptation scenarios. In particular, ALDI++ with a Vision Transformer for Detection (ViTDet) backbone achieves the highest mean average precision (mAP), confirming the effectiveness of transformer-based architectures for cross-domain object detection. Additionally, our category-wise analysis highlights consistent improvements in detection accuracy, reinforcing the robustness of the model across diverse object classes. Our findings establish ALDI++ as an efficient solution for domain-adaptive object detection, setting a new benchmark for performance stability and cross-domain generalization in security X-ray imagery.

[91] VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm

Zhenkai Wu,Xiaowen Ma,Zhenliang Ni,Dengming Zhang,Han Shu,Xin Jiang,Xinghao Chen

Main category: cs.CV

TL;DR: VLM-Pruner是一种无需训练的动态token修剪算法,通过平衡冗余性和空间稀疏性,提升视觉语言模型的效率,同时保留细粒度目标细节。

Details Motivation: 视觉语言模型(VLMs)在图像理解任务中表现优异,但其大量的视觉token带来了高昂的计算成本,阻碍了移动设备的部署。现有的修剪方法仅依赖token重要性,忽略了token间的冗余性和空间关系,导致资源浪费或稀疏选择不足。

Contribution: 1. 提出了VLM-Pruner,一种无需训练的token修剪算法;2. 引入离心token修剪范式,优先保留细粒度目标细节;3. 设计了Buffering for Spatial Sparsity(BSS)准则,推迟选择空间距离较远的token;4. 提出并行贪婪策略高效选择token;5. 选择性融合被丢弃token的显著信息以减少信息损失。

Method: 1. 离心token修剪范式:从近到远选择token,保留精细目标细节;2. BSS准则:延迟选择空间距离远的token,确保空间覆盖;3. 并行贪婪策略:高效完成token选择;4. 选择性信息融合:将丢弃token的显著信息融入保留token以减少信息损失。

Result: 在五种VLMs上,VLM-Pruner以88.9%的修剪率全面超越基线方法,并实现了端到端的推理加速。

Insight: 1. 空间关系和冗余性是token修剪中不可忽视的因素;2. 无需训练的动态修剪方法可以显著提升效率;3. 信息融合是缓解修剪信息损失的有效手段。

Abstract: Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9% pruning rate, while delivering an end-to-end inference speedup.

[92] GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

Peirong Zhang,Yidan Zhang,Luxiao Xu,Jinliang Lin,Zonghao Guo,Fengxiang Wang,Xue Yang,Kaiwen Wei,Lei Wang

Main category: cs.CV

TL;DR: GeoViS提出了一种地理空间奖励视觉搜索框架,通过渐进式搜索和推理解决遥感图像中的视觉定位问题,显著提升了小目标检测和复杂地理空间关系的理解能力。

Details Motivation: 遥感图像中的目标通常非常小且涉及复杂的地理空间关系,传统的视觉定位方法难以直接适应这些挑战。

Contribution: 提出了GeoViS框架,将遥感视觉定位重新定义为渐进式的搜索和推理过程,结合多模态感知、空间推理和奖励引导的探索。

Method: GeoViS通过树状结构的视觉线索序列逐步探索全局图像,迭代优化地理空间假设。

Result: 在五个遥感视觉定位基准测试中,GeoViS表现出色,显著优于现有方法。

Insight: 渐进式搜索和地理空间奖励机制的结合可以有效提升小目标检测和复杂关系的理解能力,同时在跨领域任务中展现出较强的泛化性和可解释性。

Abstract: Recent advances in multimodal large language models(MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects. To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness. Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.

[93] Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone

Tristan Amadei,Enric Meinhardt-Llopis,Benedicte Bascle,Corentin Abgrall,Gabriele Facciolo

Main category: cs.CV

TL;DR: 论文提出了一种无需配对数据的自监督无人机定位方法CAEVL,通过卫星视图直接训练,并引入ViLD数据集验证其有效性。

Details Motivation: 现有无人机定位方法依赖配对的无人机与卫星图像数据集,但这些数据难以获取且成本高。论文旨在解决这一限制,提出一种仅需卫星视图的训练范式。

Contribution: 1. 提出CAEVL模型,无需无人机图像训练;2. 引入视觉领域偏移增强策略;3. 发布ViLD数据集验证方法效果。

Method: 通过卫星视图训练模型,采用专门的增强策略模拟无人机与卫星视图间的视觉差异。CAEVL模型充分利用这一范式实现高效定位。

Result: CAEVL在性能上与基于配对数据的方法相当,展示了优异的泛化能力。

Insight: 自监督学习和领域增强策略可以显著减少对昂贵配对数据的依赖,提升无人机定位的实用性。

Abstract: Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; however, they typically require large-scale, paired UAV-satellite datasets for training. Such data are costly to acquire and often unavailable, limiting their applicability. To address this challenge, we adopt a training paradigm that removes the need for UAV imagery during training by learning directly from satellite-view reference images. This is achieved through a dedicated augmentation strategy that simulates the visual domain shift between satellite and real-world UAV views. We introduce CAEVL, an efficient model designed to exploit this paradigm, and validate it on ViLD, a new and challenging dataset of real-world UAV images that we release to the community. Our method achieves competitive performance compared to approaches trained with paired data, demonstrating its effectiveness and strong generalization capabilities.

[94] Reasoning-Aware Multimodal Fusion for Hateful Video Detection

Shuonan Yang,Tailin Chen,Jiangbei Yue,Guangliang Cheng,Jianbo Jiao,Zeyu Fu

Main category: cs.CV

TL;DR: 该论文提出了一种新颖的推理感知多模态融合(RAMF)框架,用于检测网络视频中的仇恨内容。通过局部-全局上下文融合(LGCF)和语义交叉注意力(SCA)来解决多模态语义交互问题,并通过对抗推理增强模型对仇恨意图的理解。在真实数据集上的实验表明,该方法优于现有方法。

Details Motivation: 在线视频中的仇恨内容对数字平台构成了严重威胁,而现有方法在多模态语义交互和仇恨意图理解方面表现不足。因此,作者提出了RAMF框架以解决这些问题。

Contribution: 1. 设计了LGCF和SCA模块,以捕捉局部与全局的多模态语义关系。2. 引入了对抗推理,通过三个阶段(描述、仇恨假设和非仇恨假设)增强模型对仇恨意图的理解。

Method: 1. 使用LGCF结合局部显著特征和全局时间结构。2. 通过SCA实现细粒度的多模态语义交互。3. 利用对抗推理生成多视角语义信息。

Result: 在两个真实仇恨视频数据集上,RAMF在Macro-F1和仇恨类召回率上分别比现有最佳方法提高了3%和7%。

Insight: 通过结合局部-全局上下文和对抗推理,可以有效增强模型对复杂仇恨内容的理解能力,从而在多模态融合中取得更好的表现。

Abstract: Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model’s contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.

[95] Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset

Qifan Liang,Junlin Li,Zhen Han,Xihao Wang,Zhongyuan Wang,Bin Mei

Main category: cs.CV

TL;DR: 本文提出了第一种烟雾类型感知的腹腔镜视频去烟网络(STANet),通过区分扩散烟雾和环境烟雾,设计了烟雾掩码分割子网络和去烟视频重建子网络,并结合粗到精的解缠模块提升性能。同时还构建了首个大规模合成烟雾标注数据集。

Details Motivation: 腹腔镜手术中产生的烟雾会影响视频的视觉引导,现有方法未考虑烟雾类型的差异性,因此需要一种烟雾类型感知的去烟方法。

Contribution: 1. 提出STANet,首次区分扩散烟雾和环境烟雾。2. 设计掩码分割和解缠模块提升性能。3. 构建首个大规模合成烟雾标注数据集。

Method: 1. 掩码分割子网络结合注意力加权掩码聚合。2. 视频重建子网络基于烟雾掩码指导去烟。3. 粗到精解缠模块通过跨注意力机制分离烟雾类型。

Result: 实验表明,STANet在性能评估中优于现有方法,并在下游任务中表现出更强的泛化能力。

Insight: 烟雾类型的区分和解缠是关键,注意力机制和多任务学习有助于提升去烟效果。

Abstract: Electrocautery or lasers will inevitably generate surgical smoke, which hinders the visual guidance of laparoscopic videos for surgical procedures. The surgical smoke can be classified into different types based on its motion patterns, leading to distinctive spatio-temporal characteristics across smoky laparoscopic videos. However, existing desmoking methods fail to account for such smoke-type-specific distinctions. Therefore, we propose the first Smoke-Type-Aware Laparoscopic Video Desmoking Network (STANet) by introducing two smoke types: Diffusion Smoke and Ambient Smoke. Specifically, a smoke mask segmentation sub-network is designed to jointly conduct smoke mask and smoke type predictions based on the attention-weighted mask aggregation, while a smokeless video reconstruction sub-network is proposed to perform specially desmoking on smoky features guided by two types of smoke mask. To address the entanglement challenges of two smoke types, we further embed a coarse-to-fine disentanglement module into the mask segmentation sub-network, which yields more accurate disentangled masks through the smoke-type-aware cross attention between non-entangled and entangled regions. In addition, we also construct the first large-scale synthetic video desmoking dataset with smoke type annotations. Extensive experiments demonstrate that our method not only outperforms state-of-the-art approaches in quality evaluations, but also exhibits superior generalization across multiple downstream surgical tasks.

[96] TrackNetV5: Residual-Driven Spatio-Temporal Refinement and Motion Direction Decoupling for Fast Object Tracking

Tang Haonan,Chen Yanjun,Jiang Lezhi

Main category: cs.CV

TL;DR: TrackNetV5提出了一种新的目标跟踪方法,通过运动方向解耦(MDD)模块和残差驱动的时空细化(R-STR)头,解决了遮挡问题和运动方向模糊性,实现了高性能实时跟踪。

Details Motivation: 现有的TrackNet系列在快速移动小目标跟踪中存在遮挡问题(V1-V3)和运动方向模糊性(V4),限制了跟踪性能的提升。

Contribution: 1. 引入Motion Direction Decoupling(MDD)模块,显式编码运动方向和轨迹;2. 提出Residual-Driven Spatio-Temporal Refinement(R-STR)头,通过残差细化恢复遮挡目标。

Method: 1. MDD模块分解时间动态为带符号的极性场;2. R-STR头基于Transformer,利用因子化的时空上下文估计残差。

Result: 在TrackNetV2数据集上,F1-score为0.9859,准确率为0.9733,显著优于之前版本,且仅增加3.7%的计算量。

Insight: 显式编码运动方向和残差细化是提升遮挡场景下目标跟踪性能的有效方法。

Abstract: The TrackNet series has established a strong baseline for fast-moving small object tracking in sports. However, existing iterations face significant limitations: V1-V3 struggle with occlusions due to a reliance on purely visual cues, while TrackNetV4, despite introducing motion inputs, suffers from directional ambiguity as its absolute difference method discards motion polarity. To overcome these bottlenecks, we propose TrackNetV5, a robust architecture integrating two novel mechanisms. First, to recover lost directional priors, we introduce the Motion Direction Decoupling (MDD) module. Unlike V4, MDD decomposes temporal dynamics into signed polarity fields, explicitly encoding both movement occurrence and trajectory direction. Second, we propose the Residual-Driven Spatio-Temporal Refinement (R-STR) head. Operating on a coarse-to-fine paradigm, this Transformer-based module leverages factorized spatio-temporal contexts to estimate a corrective residual, effectively recovering occluded targets. Extensive experiments on the TrackNetV2 dataset demonstrate that TrackNetV5 achieves a new state-of-the-art F1-score of 0.9859 and an accuracy of 0.9733, significantly outperforming previous versions. Notably, this performance leap is achieved with a marginal 3.7% increase in FLOPs compared to V4, maintaining real-time inference capabilities while delivering superior tracking precision.

[97] UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

Keming Ye,Zhipeng Huang,Canmiao Fu,Qingyang Liu,Jiani Cai,Zheqi Lv,Chen Li,Jing Lyu,Zhou Zhao,Shengyu Zhang

Main category: cs.CV

TL;DR: 本文提出了UnicEdit-10M数据集和UnicBench基准测试,通过统一的验证机制解决了图像编辑任务中规模与质量的权衡问题,并提供了细粒度的性能诊断指标。

Details Motivation: 现有图像编辑数据集和基准测试在规模和质量之间存在矛盾,无法满足强大多模态模型的训练和评估需求。本文旨在通过统一的数据流水线和验证机制解决这一问题。

Contribution: 1) 提出了轻量级的数据流水线,生成10M规模的UnicEdit-10M数据集;2) 设计了UnicBench基准测试,扩展了对空间和知识驱动推理的评估;3) 引入了新的评估指标(如非编辑一致性和推理准确率)。

Method: 1) 使用端到端模型替代多工具链,简化数据生成;2) 训练7B双任务专家模型Qwen-Verify进行故障检测和指令重述;3) 设计统一的后验证阶段以确保数据质量。

Result: 生成的UnicEdit-10M数据集覆盖多样化编辑任务,UnicBench测试揭示了现有模型的局限性,并为未来研究提供了方向。

Insight: 统一的验证机制和数据生成方法是解决规模与质量矛盾的关键;细粒度的评估指标有助于深入分析模型能力。

Abstract: With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, \textbf{Qwen-Verify}, for efficient failure detection and instruction recaptioning. This pipeline yields \textbf{UnicEdit-10M}, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose \textbf{UnicBench}, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including \textit{Non-edit Consistency} and \textit{Reasoning Accuracy}. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research.

[98] HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval

Zhiwei Chen,Yupeng Hu,Zixu Li,Zhiheng Fu,Haokun Wen,Weili Guan

Main category: cs.CV

TL;DR: 该论文提出了一种新型的HUD网络,通过层次化的不确定性感知和消歧机制,解决了多模态查询中视频和文本信息密度差异的问题,显著提升了组合视频检索(CVR)和组合图像检索(CIR)的性能。

Details Motivation: 多模态查询(视频+文本)在组合视频检索中存在信息密度差异,导致修改主题的歧义性和语义细节关注不足。先前的研究忽视了这一问题,影响了模型的性能。

Contribution: 1. 提出了首个人工智能框架HUD,利用视频和文本信息密度差异优化多模态查询理解;2. 设计了三个关键组件(整体代词消歧、原子不确定性建模、整体到原子对齐),提升了语义对齐和对象消歧能力。

Method: 1. 通过整体跨模态交互捕捉重叠语义;2. 通过原子级跨模态交互实现细粒度语义对齐;3. 结合不确定性建模增强细节关注。

Result: HUD在CVR和CIR任务的三个基准数据集上均达到了最先进的性能。

Insight: 利用模态间的信息密度差异可以更有效地实现多模态查询的语义对齐和消歧,从而提升组合检索的准确性。

Abstract: Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/.

[99] IC-World: In-Context Generation for Shared World Modeling

Fan Wu,Jiacheng Wei,Ruibo Li,Yi Xu,Junyou Li,Deheng Ye,Guosheng Lin

Main category: cs.CV

TL;DR: IC-World是一个新颖的视频生成框架,专注于共享世界建模,通过激活大型视频模型的上下文生成能力,并行生成多视角视频,并通过强化学习优化几何和运动一致性。

Details Motivation: 视频基世界模型在合成多样化和动态视觉环境方面表现出色,但共享世界建模(即从同一场景的多视角图像生成一致的视频)尚未系统研究。IC-World旨在填补这一空白。

Contribution: 提出了IC-World框架,首次系统探索视频基模型的共享世界建模问题;引入了Group Relative Policy Optimization和两种奖励模型,优化几何和运动一致性。

Method: 利用大型视频模型的上下文生成能力并行生成视频,并通过强化学习(GrPO)和两种新奖励模型(场景几何一致性和对象运动一致性)优化生成结果。

Result: 实验表明,IC-World在几何和运动一致性上显著优于现有方法。

Insight: 共享世界建模需要综合考虑多视角的一致性和动态变化,IC-World通过强化学习和奖励模型实现了这一目标,为视频生成开辟了新方向。

Abstract: Video-based world models have recently garnered increasing attention for their ability to synthesize diverse and dynamic visual environments. In this paper, we focus on shared world modeling, where a model generates multiple videos from a set of input images, each representing the same underlying world in different camera poses. We propose IC-World, a novel generation framework, enabling parallel generation for all input images via activating the inherent in-context generation capability of large video models. We further finetune IC-World via reinforcement learning, Group Relative Policy Optimization, together with two proposed novel reward models to enforce scene-level geometry consistency and object-level motion consistency among the set of generated videos. Extensive experiments demonstrate that IC-World substantially outperforms state-of-the-art methods in both geometry and motion consistency. To the best of our knowledge, this is the first work to systematically explore the shared world modeling problem with video-based world models.

[100] Defense That Attacks: How Robust Models Become Better Attackers

Mohamed Awad,Mahmoud Akrm,Walid Gomaa

Main category: cs.CV

TL;DR: 论文研究发现,对抗训练(AT)不仅提高模型的鲁棒性,还意外增强了对抗样本的可迁移性,揭示了新的生态风险。

Details Motivation: 尽管对抗训练是提升深度学习模型鲁棒性的主要方法,但其对对抗样本可迁移性的影响尚未深入探索。研究旨在探究对抗训练是否无意中增强了对抗样本的可迁移性。

Contribution: 1. 通过训练36种多样化的模型(包括CNN和ViT),发现AT模型的对抗样本更具可迁移性;2. 揭示了对抗训练的双刃剑效应;3. 发布了所有模型、代码和实验脚本以支持后续研究;4. 提出鲁棒性评估需考虑模型抵抗迁移攻击的能力及其生成可迁移对抗样本的倾向。

Method: 1. 训练了36种不同的模型(CNN和ViT);2. 设计了全面的可迁移性实验;3. 比较标准模型和AT模型生成对抗样本的迁移能力。

Result: 研究发现AT模型的对抗样本比标准模型的更具可迁移性,揭示了对抗训练的潜在风险。

Insight: 对抗训练在提升鲁棒性的同时可能带来生态风险,未来鲁棒性评估需更全面地考虑模型的双重作用。

Abstract: Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples.

[101] Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?

Manuel Benavent-Lledo,Konstantinos Bacharidis,Victoria Manousaki,Konstantinos Papoutsakis,Antonis Argyros,Jose Garcia-Rodriguez

Main category: cs.CV

TL;DR: 论文研究了通过单帧图像和多模态线索(如RGB特征和深度信息)结合上下文明信息(如文本摘要或动作识别结果)来实现动作预测的可能性,提出的AAG方法在多个数据集上表现优于传统视频分析方法。

Details Motivation: 传统动作预测方法依赖视频时序信息聚合,但人类仅需单帧图像和足够上下文即可预测动作。论文探讨是否可以通过多模态线索替代视频时序信息,实现高效的动作预测。

Contribution: 提出了AAG方法,结合RGB特征、深度信息和长时上下文(文本摘要或动作识别结果),实现基于单帧图像的动作预测。

Method: AAG方法通过Vision-Language Models获取文本摘要或单帧动作识别结果提供上下文,结合RGB和深度特征增强空间推理能力。

Result: 在三个数据集(IKEA-ASM、Meccano和Assembly101)上,AAG的表现优于传统视频分析方法和现有先进方法。

Insight: 多模态单帧信息结合上下文可以有效替代视频时序信息,为动作预测任务提供新思路。

Abstract: Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.

[102] RFOP: Rethinking Fusion and Orthogonal Projection for Face-Voice Association

Abdul Hannan,Furqan Malik,Hina Jabbar,Syed Suleman Sadiq,Mubashir Noman

Main category: cs.CV

TL;DR: RFOP重新思考了多语言环境下面部-语音关联任务中的融合和正交投影方法,通过聚焦双模态间的语义信息,在FAME 2026挑战赛中排名第三。

Details Motivation: 多语言环境下的面部-语音关联任务带来了新的挑战,尤其是在处理不同语言的面部-语音数据时,传统的融合和投影方法可能无法充分捕捉语义信息。

Contribution: 提出了一种改进的融合和正交投影方法,专注于双模态间的语义相关性,从而在多语言环境下实现了更好的面部-语音关联性能。

Method: 通过重新设计融合策略和正交投影,有效提取面部和语音模态之间的相关语义信息,提升了跨语言数据的关联效果。

Result: 在FAME 2026挑战赛中,RFOP方法在英德数据上表现优异,取得了33.1%的EER(等错误率),排名第三。

Insight: 在多模态任务中,尤其是在跨语言场景下,聚焦语义信息的有效融合和投影设计是关键提升点。

Abstract: Face-voice association in multilingual environment challenge 2026 aims to investigate the face-voice association task in multilingual scenario. The challenge introduces English-German face-voice pairs to be utilized in the evaluation phase. To this end, we revisit the fusion and orthogonal projection for face-voice association by effectively focusing on the relevant semantic information within the two modalities. Our method performs favorably on the English-German data split and ranked 3rd in the FAME 2026 challenge by achieving the EER of 33.1.

[103] Taming Camera-Controlled Video Generation with Verifiable Geometry Reward

Zhaoqing Wang,Xiaobo Xia,Zhuolin Bie,Jinlin Liu,Dongdong Yu,Jia-Wang Bian,Changhu Wang

Main category: cs.CV

TL;DR: 该论文提出了一种在线强化学习(RL)后训练框架,用于优化预训练的视频生成器以实现精确的相机控制。通过设计可验证的几何奖励函数,提供密集的段级反馈,显著提高了相机控制的准确性和几何一致性。

Details Motivation: 现有的视频扩散模型大多仅依赖监督微调(SFT),忽略了在线强化学习后训练的潜力。为了进一步提升相机控制的精度,作者尝试将RL引入视频生成领域。

Contribution: 1. 引入了在线RL后训练框架;2. 设计了可验证的几何奖励函数,提供密集的段级反馈;3. 构建了一个包含多样化相机运动和场景的数据集。

Method: 通过估计生成视频和参考视频的3D相机轨迹,将其划分为短片段,并计算片段间的相对位姿作为奖励信号,优化模型生成效果。

Result: 实验表明,该方法在相机控制精度、几何一致性和视觉质量上均优于SFT基准方法。

Insight: 在线RL后训练可以显著提升视频生成模型的相机控制能力,而密集的奖励信号设计是优化效率的关键。

Abstract: Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.

[104] MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm

Wei Chen,Chaoqun Du,Feng Gu,Wei He,Qizhen Li,Zide Liu,Xuhao Pan,Chang Ren,Xudong Rao,Chenfeng Wang,Tao Wei,Chengjun Yu,Pengfei Yu,Yufei Zheng,Chunpeng Zhou,Pan Zhou,Xuhan Zhu

Main category: cs.CV

TL;DR: MindGPT-4ov是一个多模态大语言模型(MLLM),通过多阶段后训练范式提升了性能。它在低成本下实现了多项基准测试的领先表现,增强了MLLM的基础能力和泛化能力。核心创新包括数据生成方案、协作课程监督微调方法和混合强化学习范式。

Details Motivation: 现有MLLMs在数据质量、训练效率和泛化能力方面存在局限性,MindGPT-4ov旨在通过系统化的后训练范式解决这些问题。

Contribution: 1)基于信息密度的数据生成方案和双维树状标签系统;2)协作课程监督微调方法;3)混合强化学习范式;4)基础设施优化。

Method: 采用多阶段后训练范式,包括数据生成、监督微调和强化学习。基础设施优化如5D并行训练和推理量化也被引入。

Result: 在MMBench、MMStar、MathVision和MathVista等基准测试中表现优异,用户体验显著提升。

Insight: 系统化的后训练范式可显著提升MLLM的性能和适应性,同时降低领域适应成本。

Abstract: We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community’s development of MLLMs.

[105] Polar Perspectives: Evaluating 2-D LiDAR Projections for Robust Place Recognition with Visual Foundation Models

Pierpaolo Serio,Giulio Pisaneschi,Andrea Dan Ryals,Vincenzo Infantino,Lorenzo Gentilini,Valentina Donzella,Lorenzo Pollini

Main category: cs.CV

TL;DR: 本文研究了不同的LiDAR-to-image投影方法如何影响基于视觉基础模型的度量空间识别,提出了一个模块化检索管道,并验证了设计良好的投影可以作为LiDAR空间识别中端到端3D学习的有效替代方案。

Details Motivation: 在LiDAR空间识别中,2D投影的选择对性能有显著影响,但目前缺乏对投影特性的系统研究。本文旨在填补这一空白,探索哪种投影方法最适合实际应用。

Contribution: 1) 系统地研究了LiDAR-to-image投影的空间识别性能;2) 提出了一个模块化检索管道,隔离了投影本身的影响;3) 验证了设计良好的投影在实际应用中的有效性。

Method: 通过控制主干网络、聚合方法和评估协议,系统地比较不同2D投影方法的性能,并在多种数据集和场景下进行实验。

Result: 实验表明,精心设计的投影可以有效替代端到端3D学习,提升空间识别的判别能力和鲁棒性。

Insight: 投影方法的结构和几何特性对空间识别性能至关重要,选择合适的投影可以在不增加计算复杂度的情况下显著提升系统性能。

Abstract: This work presents a systematic investigation into how alternative LiDAR-to-image projections affect metric place recognition when coupled with a state-of-the-art vision foundation model. We introduce a modular retrieval pipeline that controls for backbone, aggregation, and evaluation protocol, thereby isolating the influence of the 2-D projection itself. Using consistent geometric and structural channels across multiple datasets and deployment scenarios, we identify the projection characteristics that most strongly determine discriminative power, robustness to environmental variation, and suitability for real-time autonomy. Experiments with different datasets, including integration into an operational place recognition policy, validate the practical relevance of these findings and demonstrate that carefully designed projections can serve as an effective surrogate for end-to-end 3-D learning in LiDAR place recognition.

[106] MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

Fan Yang,Kaihao Zhang

Main category: cs.CV

TL;DR: MRD提出了一种无需训练的高分辨率图像理解框架,通过多分辨率检索-检测融合解决了目标物体在不同图像裁剪中被分割导致的语义相似性偏差问题。

Details Motivation: 现有方法通过裁剪高分辨率图像计算语义相似性,但可能导致目标物体被分割,破坏了语义相似性的计算。作者发现不同大小的物体在不同分辨率下处理效果更好,因此提出MRD框架。

Contribution: 1. 提出多分辨率语义融合方法,整合不同分辨率下的语义相似性图。2. 引入开集词汇目标检测模型(OVD),实现全局目标定位。

Method: MRD结合多分辨率语义融合和滑动窗口的开集词汇目标检测(OVD),在不训练的情况下提升高分辨率图像理解能力。

Result: 在高分辨率图像理解基准测试中,MRD证明了其有效性。

Insight: 多分辨率处理和全局检测的结合可以有效避免物体分割问题,提升语义理解的准确性。

Abstract: Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.

[107] EGGS: Exchangeable 2D/3D Gaussian Splatting for Geometry-Appearance Balanced Novel View Synthesis

Yancheng Zhang,Guangyu Sun,Chen Chen

Main category: cs.CV

TL;DR: EGGS提出了一种结合2D与3D高斯分布的混合表示方法,通过动态切换和优化策略,在保证多视角一致性的同时提升纹理细节,从而在NVS任务中达到高质量渲染与几何精度的平衡。

Details Motivation: 3D高斯喷洒(3DGS)虽然能实现高质量外观渲染,但在多视角一致性上表现不佳;而2D高斯喷洒(2DGS)虽然保证了多视角一致性,却牺牲了纹理细节。EGGS旨在解决这两者的局限性,找到一个平衡点。

Contribution: 1. 提出EGGS,一种混合2D和3D高斯表示的方法;2. 设计了Hybrid Gaussian Rasterization、Adaptive Type Exchange和Frequency-Decoupled Optimization三个关键技术;3. 通过CUDA加速实现高效训练与推理。

Method: 1. 混合高斯渲染(Hybrid Gaussian Rasterization):统一渲染2D和3D高斯分布;2. 自适应类型切换(Adaptive Type Exchange):动态调整2D和3D高斯的权重;3. 频率解耦优化(Frequency-Decoupled Optimization):分别优化低频(几何)和高频(外观)信息。

Result: EGGS在渲染质量、几何精度和效率上均优于现有方法,通过实验验证了其有效性。

Insight: EGGS的核心思想是通过动态结合2D与3D高斯的优势,在多视角一致性和纹理细节之间找到平衡,为NVS任务提供了一种实用的解决方案。

Abstract: Novel view synthesis (NVS) is crucial in computer vision and graphics, with wide applications in AR, VR, and autonomous driving. While 3D Gaussian Splatting (3DGS) enables real-time rendering with high appearance fidelity, it suffers from multi-view inconsistencies, limiting geometric accuracy. In contrast, 2D Gaussian Splatting (2DGS) enforces multi-view consistency but compromises texture details. To address these limitations, we propose Exchangeable Gaussian Splatting (EGGS), a hybrid representation that integrates 2D and 3D Gaussians to balance appearance and geometry. To achieve this, we introduce Hybrid Gaussian Rasterization for unified rendering, Adaptive Type Exchange for dynamic adaptation between 2D and 3D Gaussians, and Frequency-Decoupled Optimization that effectively exploits the strengths of each type of Gaussian representation. Our CUDA-accelerated implementation ensures efficient training and inference. Extensive experiments demonstrate that EGGS outperforms existing methods in rendering quality, geometric accuracy, and efficiency, providing a practical solution for high-quality NVS.

[108] LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization

Zhihan Xiao,Lin Liu,Yixin Gao,Xiaopeng Zhang,Haoxuan Che,Songping Mai,Qi Tian

Main category: cs.CV

TL;DR: LoVoRA提出了一个新颖的框架,用于不带掩码的视频对象移除和添加,通过可学习的对象感知定位机制实现时空一致性编辑。

Details Motivation: 现有方法通常依赖辅助掩码或参考图像来指导编辑,这限制了其扩展性和通用性。LoVoRA旨在解决这一问题,实现无需掩码的高质量视频编辑。

Contribution: 1. 提出了一种无需掩码的视频对象移除和添加框架;2. 设计了一个数据集构建流程,结合了图像到视频转换、基于光流的掩码传播和视频修复;3. 引入了可学习的对象感知定位机制,为任务提供密集时空监督。

Method: 1. 使用Diffusion Mask Predictor实现端到端视频编辑;2. 通过数据集构建流程确保时间一致性;3. 提出对象感知定位机制,避免推理时依赖外部控制信号。

Result: 实验和人工评估表明,LoVoRA在高质量视频编辑任务中表现出色。

Insight: LoVoRA的创新在于其无需掩码的设计和对象感知定位机制,为视频编辑任务提供了更强的通用性和扩展性。

Abstract: Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA.

[109] Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench

Lanxiang Hu,Abhilash Shankarampeta,Yixin Huang,Zilin Dai,Haoyang Yu,Yujie Zhao,Haoqiang Kang,Daniel Zhao,Tajana Rosing,Hao Zhang

Main category: cs.CV

TL;DR: VideoScience-Bench是一个新的基准测试,专注于评估视频生成模型在科学理解和零样本推理方面的能力,填补了现有基准测试的不足。

Details Motivation: 现有视频生成基准测试多基于物理常识,缺乏对模型科学推理能力的深入评估。本文提出VideoScience-Bench,旨在填补这一空白,推动模型在科学理解方面的发展。

Contribution: 1)设计了VideoScience-Bench基准测试,涵盖14个主题和103个物理学与化学概念;2)通过专家标注和多维度评估,首次将视频模型视为推理者而非仅生成者;3)证明了VLM-as-a-Judge评估方法与人类评估的高度相关性。

Method: 1)基于复合科学场景设计200个提示;2)使用专家标注评估7种最先进的视频模型;3)引入五个评估维度(如Prompt Consistency等);4)采用VLM-as-a-Judge方法验证评估一致性。

Result: 实验表明,VLM-as-a-Judge评估方法与人类评估结果高度相关,验证了其有效性。

Insight: 视频生成模型的未来发展需超越生成能力,关注科学推理和理解能力;VLM-as-a-Judge方法为自动化评估提供了新思路。

Abstract: The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome modeling under diverse conditions. However, existing video benchmarks are physical commonsense-based, offering limited insight into video models’ scientific reasoning capability. We introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. Each prompt encodes a composite scientific scenario that requires understanding and reasoning across multiple scientific concepts to generate the correct phenomenon. The benchmark comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. We conduct expert-annotated evaluations across seven state-of-the-art video models in T2V and I2V settings along five dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Using a VLM-as-a-Judge to assess video generations, we observe strong correlation with human assessments. To the best of our knowledge, VideoScience-Bench is the first benchmark to evaluate video models not only as generators but also as reasoners, requiring their generations to demonstrate scientific understanding consistent with expected physical and chemical phenomena. Our data and evaluation code are available at: \href{https://github.com/hao-ai-lab/VideoScience}{github.com/hao-ai-lab/VideoScience}.

[110] A Lightweight Real-Time Low-Light Enhancement Network for Embedded Automotive Vision Systems

Yuhan Chen,Yicui Shi,Guofa Li,Guangrui Bai,Jinyuan Shao,Xiangfei Huang,Wenbo Chu,Keqiang Li

Main category: cs.CV

TL;DR: UltraFast-LieNET是一种轻量级多尺度位移卷积网络,专为嵌入式车载视觉系统设计,用于实时低光图像增强。它采用动态位移卷积(DSConv)和多尺度位移残差块(MSRB)显著扩展感受野,并通过残差结构和多级梯度感知损失函数提升稳定性。

Details Motivation: 低光环境(如夜间驾驶)下图像质量下降严重威胁车载摄像头安全,现有算法计算量过大,难以满足车载实时需求。

Contribution: 提出了一种超轻量级的实时低光增强网络UltraFast-LieNET,核心是动态位移卷积(DSConv)和多尺度位移残差块(MSRB),显著减少了参数数量和计算量。

Method: 1. 使用DSConv(仅12个可学习参数)高效提取特征;2. 通过MSRB扩展感受野;3. 引入残差结构和多级梯度感知损失函数提升稳定性。

Result: 在LOLI-Street数据集上PSNR达26.51 dB,优于现有方法4.6 dB,仅需180参数;四个基准数据集验证了其在资源受限下的优异表现。

Insight: 轻量级网络设计可通过动态卷积和多尺度结构兼顾性能和效率,适用于嵌入式实时场景。

Abstract: In low-light environments like nighttime driving, image degradation severely challenges in-vehicle camera safety. Since existing enhancement algorithms are often too computationally intensive for vehicular applications, we propose UltraFast-LieNET, a lightweight multi-scale shifted convolutional network for real-time low-light image enhancement. We introduce a Dynamic Shifted Convolution (DSConv) kernel with only 12 learnable parameters for efficient feature extraction. By integrating DSConv with varying shift distances, a Multi-scale Shifted Residual Block (MSRB) is constructed to significantly expand the receptive field. To mitigate lightweight network instability, a residual structure and a novel multi-level gradient-aware loss function are incorporated. UltraFast-LieNET allows flexible parameter configuration, with a minimum size of only 36 parameters. Results on the LOLI-Street dataset show a PSNR of 26.51 dB, outperforming state-of-the-art methods by 4.6 dB while utilizing only 180 parameters. Experiments across four benchmark datasets validate its superior balance of real-time performance and enhancement quality under limited resources. Code is available at https://githubhttps://github.com/YuhanChen2024/UltraFast-LiNET

[111] BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection

Guowen Zhang,Chenhang He,Liyi Chen,Lei Zhang

Main category: cs.CV

TL;DR: BEVDilation提出了一种以LiDAR为中心的LiDAR与相机多模态融合框架,通过图像BEV特征的隐式引导和稀疏体素扩张模块,提升了3D目标检测性能。

Details Motivation: LiDAR和相机在多模态融合中存在几何精度差异,直接融合可能导致性能下降,因此需要一种更有效的融合策略。

Contribution: 1. 提出LiDAR-centric融合框架,将图像BEV特征作为隐式引导;2. 提出了Sparse Voxel Dilation Block和Semantic-Guided BEV Dilation Block,分别解决点云稀疏性和语义限制问题。

Method: 1. 通过图像深度先验隐式引导LiDAR特征;2. 从点云的稀疏性和语义两方面分别设计了扩张模块。

Result: 在nuScenes基准测试中表现优于现有方法,同时对深度噪声更具鲁棒性。

Insight: LiDAR-centric策略能更好地结合LiDAR的高精度和图像的语义信息,提升融合效果。

Abstract: Integrating LiDAR and camera information in the bird’s eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.

[112] InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration

Zhongyu Yang,Yingfang Yuan,Xuanming Jiang,Baoyi An,Wei Pang

Main category: cs.CV

TL;DR: 该论文提出了一种名为InEx的训练免框架,通过自省和多模态多智能体协作来缓解大型语言模型(LLMs)中的幻觉问题。

Details Motivation: 幻觉问题是LLMs发展中的主要障碍,现有解决方案依赖人工干预或未充分利用智能体的自主缓解能力。论文借鉴人类决策过程,提出自省和外部验证结合的方法以减少不确定性。

Contribution: InEx框架首次结合内部自省推理和外部多智能体协作,旨在自主缓解幻觉问题,显著提升了模型的可靠性和性能。

Method: 框架通过熵基不确定性估计指导内部自省推理,生成初步响应后,再利用编辑智能体和自反思智能体进行多轮外部验证和优化。

Result: 实验表明,InEx在通用和幻觉基准上优于现有方法,性能提升4%-27%,并表现出强鲁棒性。

Insight: 人类决策过程中的自省和外部验证机制可有效迁移到AI系统中,为缓解幻觉问题提供了新思路。

Abstract: Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent’s ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training-free, multi-agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy-based uncertainty estimation, to improve the reliability of the decision agent’s reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross-modal multi-agent collaboration with the editing agent and self-reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.

[113] GraphFusion3D: Dynamic Graph Attention Convolution with Adaptive Cross-Modal Transformer for 3D Object Detection

Md Sohag Mia,Md Nahid Hasan,Tawhid Ahmed,Muhammad Abdullah Adnan

Main category: cs.CV

TL;DR: GraphFusion3D是一个基于动态图注意力和自适应跨模态Transformer的统一框架,用于3D目标检测,通过多模态融合和图推理模块提升了点云的几何和语义信息提取能力。

Details Motivation: 点云数据稀疏、结构不完整且语义信息有限,且难以捕捉远距离物体间的上下文关系,需要一种能有效结合多模态信息并动态建模空间-语义关系的解决方案。

Contribution: 提出了GraphFusion3D框架,包含自适应跨模态Transformer(ACMT)和图推理模块(GRM),前者动态融合图像和点云特征,后者通过多尺度图注意力建模局部几何和全局语义关系。

Method: 1)ACMT自适应融合多模态特征;2)GRM利用动态图注意力加权空间邻近性和特征相似性;3)级联解码器逐步优化检测结果。

Result: 在SUN RGB-D和ScanNetV2数据集上分别达到了70.6% AP${25}$/51.2% AP${50}$和75.1% AP${25}$/60.8% AP${50}$的性能,显著优于现有方法。

Insight: 动态图注意力和跨模态特征融合能有效弥补点云数据的不足,同时局部与全局信息的多尺度建模是关键提升点。

Abstract: Despite significant progress in 3D object detection, point clouds remain challenging due to sparse data, incomplete structures, and limited semantic information. Capturing contextual relationships between distant objects presents additional difficulties. To address these challenges, we propose GraphFusion3D, a unified framework combining multi-modal fusion with advanced feature learning. Our approach introduces the Adaptive Cross-Modal Transformer (ACMT), which adaptively integrates image features into point representations to enrich both geometric and semantic information. For proposal refinement, we introduce the Graph Reasoning Module (GRM), a novel mechanism that models neighborhood relationships to simultaneously capture local geometric structures and global semantic context. The module employs multi-scale graph attention to dynamically weight both spatial proximity and feature similarity between proposals. We further employ a cascade decoder that progressively refines detections through multi-stage predictions. Extensive experiments on SUN RGB-D (70.6% AP${25}$ and 51.2% AP${50}$) and ScanNetV2 (75.1% AP${25}$ and 60.8% AP${50}$) demonstrate a substantial performance improvement over existing approaches.

[114] DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

Kairun Wen,Yuzhi Huang,Runyu Chen,Hui Zheng,Yunlong Lin,Panwang Pan,Chenxin Li,Wenyan Cong,Jian Zhang,Junbin Lu,Chenguo Lin,Dilin Wang,Zhicheng Yan,Hongyu Xu,Justin Theiss,Yue Huang,Xinghao Ding,Rakesh Ranjan,Zhiwen Fan

Main category: cs.CV

TL;DR: DynamicVerse 是一个面向动态实时视频的多模态4D世界建模框架,通过整合大型视觉、几何和多模态模型,实现对静态几何、动态运动、实例分割和描述的全面理解。

Details Motivation: 现有数据集多源于有限模拟器或传统方法,限制了基础模型对单目视频的物理动态理解的准确性。DynamicVerse 旨在填补这一空白。

Contribution: 提出了一个物理尺度、多模态的4D世界建模框架,并发布了一个大规模数据集(100K+视频、800K+标注掩码、10M+帧)。

Method: 结合基于窗口的捆绑调整与全局优化方法,将长视频序列转换为全面的4D多模态格式。

Result: 在视频深度估计、相机位姿估计和相机内参估计任务中表现优异,优于现有方法。

Insight: 通过多模态融合和全局优化,DynamicVerse 实现了对真实世界动态的物理尺度精确建模。

Abstract: Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consists of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.

[115] SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting

Svenja Strobel,Matthias Innmann,Bernhard Egger,Marc Stamminger,Linus Franke

Main category: cs.CV

TL;DR: SurfFill 利用高斯面元(Gaussian surfel)补全LiDAR点云,通过分析光束发散造成的缺失区域,引入密度变化启发的模糊区域检测和点生长方法,并结合分治策略实现大规模场景补全,效果优于现有方法。

Details Motivation: LiDAR在平坦区域精度高,但易遗漏小几何结构和暗光材料细节;相机摄影测量能补足细节但精度不足。SurfFill结合两者优势,通过高斯面元补全LiDAR点云。

Contribution: 1) 提出基于LiDAR光束发散分析的模糊区域检测启发式方法;2) 引入约束高斯面元重建的点生长技术;3) 扩展分治策略以支持大规模点云补全。

Method: 1) 通过点云密度变化识别模糊区域;2) 在模糊区域约束高斯面元优化与稠密化;3) 提取高斯基元采样补全点;4) 分治处理大规模场景。

Result: 在合成和真实场景的LiDAR点云补全任务中,SurfFill优于现有重建方法。

Insight: 光束发散是LiDAR遗漏薄结构和边缘的主因;密度变化可作为模糊区域的可靠指标;高斯面元适合局部高精度补全。

Abstract: LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction [Huang et al. 2024] to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.

[116] In-Context Sync-LoRA for Portrait Video Editing

Sagi Polaczek,Or Patashnik,Ali Mahdavi-Amiri,Daniel Cohen-Or

Main category: cs.CV

TL;DR: Sync-LoRA 是一种用于肖像视频编辑的方法,通过修改第一帧并将编辑传播到整个序列,实现高质量视觉修改,同时保持帧级同步和身份一致性。

Details Motivation: 肖像视频编辑需要灵活而精确的控制,既要实现广泛修改(如外观变化、表情编辑或添加对象),又要保留原始时间行为,确保每一帧与源帧精确同步。

Contribution: 提出了 Sync-LoRA,一种基于图像到视频扩散模型的方法,通过在上下文训练 LoRA 来实现高质量视频编辑,同时保持时间一致性和身份一致性。

Method: 使用图像到视频扩散模型,通过编辑第一帧并将其传播到整个序列。训练一个基于上下文 LoRA 的模型,利用成对的视频数据(相同运动轨迹但外观不同),通过同步过滤选择对齐样本进行训练。

Result: 实验表明,Sync-LoRA 能够泛化到未见过的身份和多样化的编辑任务(如修改外观、添加对象或改变背景),并在姿态和表情变化中表现出强大的鲁棒性。

Insight: 通过对齐的成对视频数据训练模型,可以有效地结合源视频的运动线索和编辑帧的视觉变化,从而实现高视觉保真度和强时间一致性。

Abstract: Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of objects. The key difficulty lies in preserving the subject’s original temporal behavior, demanding that every edited frame remains precisely synchronized with the corresponding source frame. We present Sync-LoRA, a method for editing portrait videos that achieves high-quality visual modifications while maintaining frame-accurate synchronization and identity consistency. Our approach uses an image-to-video diffusion model, where the edit is defined by modifying the first frame and then propagated to the entire sequence. To enable accurate synchronization, we train an in-context LoRA using paired videos that depict identical motion trajectories but differ in appearance. These pairs are automatically generated and curated through a synchronization-based filtering process that selects only the most temporally aligned examples for training. This training setup teaches the model to combine motion cues from the source video with the visual changes introduced in the edited first frame. Trained on a compact, highly curated set of synchronized human portraits, Sync-LoRA generalizes to unseen identities and diverse edits (e.g., modifying appearance, adding objects, or changing backgrounds), robustly handling variations in pose and expression. Our results demonstrate high visual fidelity and strong temporal coherence, achieving a robust balance between edit fidelity and precise motion preservation.

[117] Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks

Matthew Dutson,Nathan Labiosa,Yin Li,Mohit Gupta

Main category: cs.CV

TL;DR: 该论文提出了一种通用适配器(stability adapters)方法,通过插入到任何帧基网络中以提高视频推理的时序一致性和抗干扰能力。

Details Motivation: 帧基网络在视频中顺序应用时通常表现出时序不一致性(如输出帧间闪烁),尤其在输入包含时变干扰时问题更严重。

Contribution: 提出了一种稳定适配器及其训练框架,能够在不修改基础网络的情况下提高视频推理的稳定性和抗干扰性。

Method: 设计了通用稳定适配器结构和基于准确率-稳定性-鲁棒性损失的高效训练方法。

Result: 实验表明,该方法在去噪、图像增强、深度估计和语义分割等多个任务中显著提升了时序稳定性和抗干扰能力。

Insight: 通过统一的准确率-稳定性-鲁棒性损失理论分析,明确了稳定适配器训练的有效条件。

Abstract: When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.

[118] Unrolled Networks are Conditional Probability Flows in MRI Reconstruction

Kehan Qi,Saumya Gupta,Qingqiao Hu,Weimin Lyu,Chao Chen

Main category: cs.CV

TL;DR: 这篇论文通过理论证明展开网络(unrolled networks)是条件概率流ODE的离散实现,并提出FLAT方法,通过ODE离散化对齐中间重建状态,从而提高MRI重建的稳定性和效率。

Details Motivation: MRI重建中,展开网络虽高效但不稳定,而扩散模型虽稳定但计算成本高。作者旨在结合两者的优势,通过理论连接改进MRI重建。

Contribution: 1. 理论证明展开网络是条件概率流ODE的离散实现;2. 提出FLAT方法,通过ODE离散化对齐中间状态,提升稳定性和收敛性。

Method: 1. 将展开网络建模为条件概率流ODE的离散形式;2. 设计FLAT方法,从ODE离散化中推导网络参数并对齐中间重建轨迹。

Result: 在三个MRI数据集上,FLAT能以更少迭代实现高质量重建(比扩散模型少3倍),同时显著提升稳定性。

Insight: 展开网络的中间状态演化可通过ODE理论规范化,结合深度学习与数学理论可改进医学图像任务的性能。

Abstract: Magnetic Resonance Imaging (MRI) offers excellent soft-tissue contrast without ionizing radiation, but its long acquisition time limits clinical utility. Recent methods accelerate MRI by under-sampling $k$-space and reconstructing the resulting images using deep learning. Unrolled networks have been widely used for the reconstruction task due to their efficiency, but suffer from unstable evolving caused by freely-learnable parameters in intermediate steps. In contrast, diffusion models based on stochastic differential equations offer theoretical stability in both medical and natural image tasks but are computationally expensive. In this work, we introduce flow ODEs to MRI reconstruction by theoretically proving that unrolled networks are discrete implementations of conditional probability flow ODEs. This connection provides explicit formulations for parameters and clarifies how intermediate states should evolve. Building on this insight, we propose Flow-Aligned Training (FLAT), which derives unrolled parameters from the ODE discretization and aligns intermediate reconstructions with the ideal ODE trajectory to improve stability and convergence. Experiments on three MRI datasets show that FLAT achieves high-quality reconstructions with up to $3\times$ fewer iterations than diffusion-based generative models and significantly greater stability than unrolled networks.

[119] MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

Youxin Pang,Jiajun Liu,Lingfeng Tan,Yong Zhang,Feng Gao,Xiang Deng,Zhuoliang Kang,Xiaoming Wei,Yebin Liu

Main category: cs.CV

TL;DR: MAViD是一个多模态框架,专注于音频-视觉对话的理解与生成,通过Conductor-Creator架构实现精细控制,并结合自回归和扩散模型生成高质量的长段内容。

Details Motivation: 现有方法多为非交互式系统,生成的语音受限且不自然,难以实现多模态音频-视频的高效融合。

Contribution: 提出了Conductor-Creator架构,支持理解和生成能力的结合;设计了新型融合模块,增强上下文和多模态的连接;结合AR和扩散模型,实现高质量长视频生成。

Method: Conductor负责理解和分解指令,Creator基于指令生成交互响应;使用AR模型生成音频,扩散模型生成视频;提出融合模块优化多模态同步。

Result: 实验表明,MAViD能生成生动且连贯的长段对话内容,并准确理解用户的多模态查询。

Insight: 多模态任务的融合需要精细的架构设计和模型组合,同时上下文连贯性是长段生成的关键。

Abstract: We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human speech.The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary components.The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these instructions.Furthermore, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video generation.Additionally, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content generation.Extensive experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users’ multimodal queries.

[120] ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Mengchen Zhang,Qi Chen,Tong Wu,Zihan Liu,Dahua Lin

Main category: cs.CV

TL;DR: ViSAudio提出了一种端到端的视频驱动双耳空间音频生成方法,通过双分支音频生成架构和条件流匹配技术,直接从无声视频生成高质量的双耳音频,解决了现有两阶段方法的误差积累和时空不一致问题。

Details Motivation: 现有视频到音频生成的研究集中在单声道输出,缺乏空间沉浸感;双耳音频生成方法通常采用两阶段流程(首先生成单声道音频,再进行空间化),导致误差累积和时空不一致。

Contribution: 1. 提出端到端的双耳空间音频生成任务;2. 构建BiAudio数据集,包含97K视频-双耳音频对;3. 提出ViSAudio框架,结合条件流匹配和双分支生成架构,确保音频与视频的时空对齐。

Method: 采用条件流匹配技术,设计双分支音频生成架构(分别建模两个通道的音频潜在流),并结合时空条件模块,平衡通道一致性与空间特性。

Result: ViSAudio在客观指标和主观评估中均优于现有方法,能够生成高质量的双耳音频,适应视角变化、声源运动和多样化声学环境。

Insight: 端到端的双耳音频生成框架可以有效避免两阶段方法的误差累积问题,同时通过双分支设计和条件模块实现精确的时空对齐。

Abstract: Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.

[121] Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

Zeqi Xiao,Yiwei Zhao,Lingxiao Li,Yushi Lan,Yu Ning,Rahul Garg,Roshni Cooper,Mohammad H. Taghavi,Xingang Pan

Main category: cs.CV

TL;DR: Video4Spatial是一个视频生成框架,通过仅依赖视频数据的视觉上下文,展示了在复杂空间任务中的表现能力,如场景导航和对象定位。

Details Motivation: 探索视频生成模型是否能够仅通过视觉数据表现出类似人类的空间认知能力,从而推动可视空间智能的发展。

Contribution: 提出了Video4Spatial框架,证明了视频扩散模型仅依赖视频上下文即可执行复杂的空间任务(如场景导航和对象定位),无需额外模态(如深度或姿态数据)。

Method: 通过简单的框架设计和数据筛选,实现了对视频上下文的建模,支持端到端的导航规划和目标对象定位,同时能够遵循相机姿态指令并保持空间一致性。

Result: 实验表明,Video4Spatial在空间理解方面表现优秀,能够处理长上下文和域外环境,展示了其在可视空间推理中的潜力。

Insight: 视频生成模型可以通过视觉上下文学习复杂的空间任务,这种方法为开发更通用的可视空间智能提供了新的方向。

Abstract: We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.

[122] MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Qinghe Wang,Xiaoyu Shi,Baolu Li,Weikang Bian,Quande Liu,Huchuan Lu,Xintao Wang,Pengfei Wan,Kun Gai,Xu Jia

Main category: cs.CV

TL;DR: MultiShotMaster提出了一个可控的多镜头视频生成框架,解决了现有技术在生成叙事性多镜头视频时的困难,通过改进RoPE方法和自动化数据标注实现了灵活性和高质量。

Details Motivation: 现有的视频生成技术擅长生成单镜头视频,但在叙事性多镜头视频(需要灵活的镜头安排、连贯的叙事和超越文本提示的控制)上表现不佳。MultiShotMaster致力于解决这些问题。

Contribution: 1. 提出了Multi-Shot Narrative RoPE和Spatiotemporal Position-Aware RoPE两种改进方法,支持灵活的镜头安排和时空位置感知的参考注入。2. 设计了一个自动化数据标注流程,解决了多镜头视频数据稀缺的问题。3. 实现了高度可控的多镜头视频生成,支持文本驱动、自定义主体和背景驱动的场景生成。

Method: 1. 扩展了预训练的单镜头模型,引入了两种RoPE变体:Multi-Shot Narrative RoPE用于灵活镜头安排,Spatiotemporal Position-Aware RoPE用于时空参考注入。2. 利用自动化数据标注流程提取多镜头视频及其标注信息。

Result: 实验表明,MultiShotMaster在性能和可控性上优于现有方法,支持灵活的镜头数量和持续时间配置。

Insight: 通过改进模型架构和标注流程,可以在数据稀缺的情况下实现高质量的多镜头视频生成,为叙事性视频生成提供了新思路。

Abstract: Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

[123] PPTArena: A Benchmark for Agentic PowerPoint Editing

Michael Ofengenden,Yunze Man,Ziqi Pang,Yu-Xiong Wang

Main category: cs.CV

TL;DR: PPTArena 是一个专注于 PowerPoint 编辑任务的基准测试,评估代理在自然语言指令下对真实幻灯片的可靠修改能力。PPTPilot 是一种结构化幻灯片编辑代理,通过语义编辑序列和 XML 操作实现精确控制,在实验中表现优于现有系统。

Details Motivation: 现有基准多集中在图像-PDF渲染或文本到幻灯片的生成任务,缺乏对幻灯片实际编辑能力的评估。PPTArena 弥补了这一空白,并提供细粒度任务以推动代理的可控性和可靠性。

Contribution: 1. 提出 PPTArena 基准,覆盖 100 个幻灯片集的 800 多个编辑任务;2. 开发 PPTPilot,一种结构化感知的代理,结合规划、编辑和验证循环;3. 实验显示 PPTPilot 在复杂任务中显著优于其他系统。

Method: PPTPilot 采用结构感知的语义编辑序列规划,结合高层工具和底层 XML 操作实现精确控制,并通过迭代的“计划-编辑-检查”循环验证输出。

Result: PPTPilot 在复合任务、布局敏感任务和跨幻灯片任务中比现有代理和 VLM 系统高出 10 个百分点以上,视觉保真度和一致性提升明显。

Insight: 现有代理在长时程和文档级任务中仍表现不佳,凸显了可靠 PowerPoint 编辑任务的挑战。

Abstract: We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.

[124] OneThinker: All-in-one Reasoning Model for Image and Video

Kaituo Feng,Manyuan Zhang,Hongyu Li,Kaixuan Fan,Shuang Chen,Yilei Jiang,Dian Zheng,Peiwen Sun,Yiyuan Zhang,Haoze Sun,Yan Feng,Peng Pei,Xunliang Cai,Xiangyu Yue

Main category: cs.CV

TL;DR: OneThinker是一个统一的多模态推理模型,能够在图像和视频任务中实现跨任务和多模态的知识共享,通过构建大规模训练数据集和提出EMA-GRPO方法解决多任务强化学习中的奖励异质性问题。

Details Motivation: 现有的方法通常为不同任务训练单独模型,且将图像和视频推理视为独立领域,限制了多模态推理通用模型的扩展性和实际应用潜力。

Contribution: 提出了OneThinker这一统一模型,能够处理图像和视频的多种基础视觉任务;构建了大规模训练数据集OneThinker-600k和SFT冷启动数据集OneThinker-SFT-340k;提出了EMA-GRPO方法以解决多任务强化学习中的奖励异质性问题。

Method: 通过构建大规模训练数据集和使用商业模型进行CoT标注;提出EMA-GRPO方法,跟踪任务奖励标准差的移动平均值以实现多任务优化平衡。

Result: 在31个视觉基准测试中表现优异,覆盖10种基础视觉理解任务,并展示了任务间的知识迁移能力和初步的零样本泛化能力。

Insight: OneThinker展示了统一多模态推理模型的潜力,通过跨任务和多模态的知识共享,为通用视觉推理模型的未来发展提供了重要参考。

Abstract: Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.

[125] MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues

Zichen Liu,Yue Yu,Hao Ouyang,Qiuyu Wang,Shuailei Ma,Ka Leong Cheng,Wen Wang,Qingyan Bai,Yuxuan Zhang,Yanhong Zeng,Yixuan Li,Xing Zhu,Yujun Shen,Qifeng Chen

Main category: cs.CV

TL;DR: MagicQuillV2通过分层的视觉线索(内容、空间、结构和颜色)实现对图像生成的精细控制,填补了扩散模型与传统图形软件之间的语义鸿沟。

Details Motivation: 现有的扩散模型虽然在整体生成上表现优异,但缺乏对内容、位置和外观的独立控制能力,限制了用户的创造力。

Contribution: 引入分层组合范式,通过专门的数据生成管道、统一控制模块和微调空间分支,实现了对图像编辑的精确控制。

Method: 将用户意图分解为内容层(生成什么)、空间层(放置位置)、结构层(形状)和颜色层(调色板),并通过技术模块分别处理。

Result: 实验证明,分层方法能有效解决用户意图的模糊性,提供直观且直接的生成控制。

Insight: 分层的视觉线索不仅提升了编辑的精确性,也为用户提供了更自然的创作方式。

Abstract: We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.

cs.IR [Back]

[126] LORE: A Large Generative Model for Search Relevance

Chenji Lu,Zhuo Chen,Hui Zhao,Zhiyuan Zeng,Gang Zhao,Junjie Ren,Ruicong Xu,Haoran Li,Songyan Liu,Pengjie Wang,Jian Xu,Bo Zheng

Main category: cs.IR

TL;DR: LORE 是一个基于大型生成模型的电商搜索相关性框架,通过分阶段训练和综合评估,显著提升了搜索相关性指标。

Details Motivation: 现有方法在处理搜索相关性时将其视为单一任务,缺乏系统性拆解,导致性能瓶颈。LORE 提出将相关性拆解为知识推理、多模态匹配和规则遵循等核心能力,以突破瓶颈。

Contribution: 1. 提出两阶段训练范式(SFT+RL);2. 设计综合评估基准 RAIR;3. 实现基于查询频率的分层部署策略。

Method: 通过渐进式 CoT 合成(SFT)与人类偏好对齐(RL)相结合的训练方法,系统地提升模型的核心能力。

Result: 部署三年内,在线 GoodRate 指标累计提升 27%。

Insight: 相关性任务需拆解为多个核心能力,并采用定性驱动的分解方法,才能突破现有性能天花板。

Abstract: Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.

cs.PL [Back]

[127] Probabilistic energy profiler for statically typed JVM-based programming languages

Joel Nyholm,Wojciech Mostowski,Christoph Reichenbach

Main category: cs.PL

TL;DR: 该论文提出了一种新颖的方法,用于预测静态类型JVM编程语言(如Java和Scala)的能源消耗,通过测量字节码模式的能耗并构建统计模型,解决了以往方法仅关注CPU能耗和使用点估计的局限性。

Details Motivation: 能源消耗在移动设备和数据中心等领域日益受到关注,开发者需要详细的能耗数据以优化软件。以往方法仅关注CPU能耗且使用点估计,忽略了其他硬件效应和统计推理的需求。

Contribution: 提出了一种专注于静态类型JVM语言的能耗预测方法,通过测量字节码模式的能耗并构建贝叶斯统计模型,实现了对能耗的统计分布预测和影响因素分析。

Method: 测量字节码模式的能耗,并基于贝叶斯统计构建统计模型,包含代码中的数据大小、数据类型、操作以及硬件平台四个静态因素。

Result: 实验验证了模型的有效性,四个因素对能耗有显著影响,且程序能耗预测与实际能耗高度吻合。

Insight: 研究结果表明,即使是同型号的设备在能耗上也可能存在差异,操作和数据类型对能耗也有显著影响。该方法为未来能源验证工具提供了基础。

Abstract: Energy consumption is a growing concern in several fields, from mobile devices to large data centers. Developers need detailed data on the energy consumption of their software to mitigate consumption issues. Previous approaches have a broader focus, such as on specific functions or programs, rather than source code statements. They primarily focus on estimating the CPU’s energy consumption using point estimates, thereby disregarding other hardware effects and limiting their use for statistical reasoning and explainability. We developed a novel methodology to address the limitations of measuring only the CPU’s consumption and using point estimates, focusing on predicting the energy usage of statically typed JVM-based programming languages, such as Java and Scala. We measure the energy consumption of Bytecode patterns, the translation from the programming language’s source code statement to their Java Bytecode representation. With the energy measurements, we construct a statistical model using Bayesian statistics, which allows us to predict the energy consumption through statistical distributions and analyze individual factors. The model includes three factors we obtain statically from the code: data size, data type, operation, and one factor about the hardware platform the code executes on: device. To validate our methodology, we implemented it for Java and evaluated its energy predictions on unseen programs. We observe that all four factors are influential, notably that two devices of the same model may differ in energy consumption and that the operations and data types cause consumption differences. The experiments also show that the energy prediction of programs closely follows the program’s real energy consumption, validating our approach. Our work presents a methodology for constructing an energy model that future work, such as verification tools, can use for their energy estimates.

eess.IV [Back]

[128] Comparing Baseline and Day-1 Diffusion MRI Using Multimodal Deep Embeddings for Stroke Outcome Prediction

Sina Raeisadigh,Myles Joshua Toledo Tan,Henning Müller,Abderrahmane Hedjoudje

Main category: eess.IV

TL;DR: 研究对比了基线(J0)和24小时(J1)扩散MRI在预测急性缺血性卒中(AIS)患者三个月功能结局中的表现,发现J1模型(AUC=0.923)优于J0模型(AUC≤0.86),并结合病灶体积特征提升了模型的稳定性和可解释性。

Details Motivation: 急性缺血性卒中(AIS)的预后预测对临床决策至关重要。研究旨在验证早期治疗后MRI是否能提供比治疗前影像更优越的预测价值,并结合多模态数据提升模型的性能。

Contribution: 1. 证明了J1扩散MRI比J0影像在AIS预后预测中表现更优;2. 提出了一种结合MRI、临床数据及病灶体积特征的多模态预测框架。

Method: 利用3D ResNet-50提取MRI嵌入特征,与临床数据融合后通过PCA降维(≤12主成分),并用线性支持向量机(SVM)进行八折分层交叉验证分类。

Result: J1多模态模型取得了最高预测性能(AUC=0.923±0.085),显著优于J0模型(AUC≤0.86),且结合病灶体积特征进一步提升了模型的稳定性和可解释性。

Insight: 1. 早期治疗后MRI对AIS预后更具预测价值;2. 多模态数据融合可以提升临床预测任务的性能与可解释性。

Abstract: This study compares baseline (J0) and 24-hour (J1) diffusion magnetic resonance imaging (MRI) for predicting three-month functional outcomes after acute ischemic stroke (AIS). Seventy-four AIS patients with paired apparent diffusion coefficient (ADC) scans and clinical data were analyzed. Three-dimensional ResNet-50 embeddings were fused with structured clinical variables, reduced via principal component analysis (<=12 components), and classified using linear support vector machines with eight-fold stratified group cross-validation. J1 multimodal models achieved the highest predictive performance (AUC = 0.923 +/- 0.085), outperforming J0-based configurations (AUC <= 0.86). Incorporating lesion-volume features further improved model stability and interpretability. These findings demonstrate that early post-treatment diffusion MRI provides superior prognostic value to pre-treatment imaging and that combining MRI, clinical, and lesion-volume features produces a robust and interpretable framework for predicting three-month functional outcomes in AIS patients.

cs.LG [Back]

[129] When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

Tsimur Hadeliya,Mohammad Ali Jauhar,Nidhi Sakpal,Diogo Cruz

Main category: cs.LG

TL;DR: 本文研究了长上下文窗口中LLM代理的性能与安全性问题,发现代理在长上下文任务中的表现和拒绝有害请求的能力会不稳定变化,揭示了现有评估指标的不足。

Details Motivation: 现有研究主要关注LLM在长上下文提示下的表现,而代理设置(能力和安全性)尚未充分探索,本文填补了这一空白。

Contribution: 揭示了LLM代理在长上下文任务中性能和拒绝率的不可预测变化,提出对现有安全评估指标的质疑。

Method: 通过实验分析LLM代理在不同上下文长度、类型和位置下的任务性能和拒绝率变化。

Result: 1M-2M token上下文的模型在100K token时性能下降超过50%,拒绝率变化显著且不一致(如GPT-4.1-nano从5%升至40%)。

Insight: 长上下文可能导致LLM代理的安全机制不稳定,当前的评估范式或需重新考虑,尤其是在多步任务中。

Abstract: Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5% to $\sim$40% while Grok 4 Fast decreases from $\sim$80% to $\sim$10% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.

[130] OptPO: Optimal Rollout Allocation for Test-time Policy Optimization

Youkang Wang,Jian Wang,Rubing Chen,Tianyi Zeng,Xiao-Yong Wei,Qing Li

Main category: cs.LG

TL;DR: OptPO是一个用于测试时策略优化的框架,通过自适应分配推理预算,以贝叶斯序列概率比检验动态停止采样,减少计算冗余并提升效率。

Details Motivation: 现有方法依赖固定预算的多数投票来估计奖励,导致大量计算冗余,OptPO旨在通过动态预算分配优化这一过程。

Contribution: 提出OptPO框架,通过贝叶斯序列概率比检验动态停止采样,并将保留的rollout用于策略更新,显著减少计算开销。

Method: 将投票过程建模为贝叶斯序列概率比检验,动态停止采样后利用rollout进行策略更新,兼容PPO或GRPO等算法。

Result: 在多推理基准测试中,OptPO显著减少rollout开销,同时保持或提升准确性。

Insight: OptPO通过统一统计最优停止与测试时学习,为测试时适应提供了高效的计算范式。

Abstract: Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.

[131] Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation

Ziniu Zhang,Minxuan Duan,Haris N. Koutsopoulos,Hongyang R. Zhang

Main category: cs.LG

TL;DR: 该论文提出了一种多模态学习方法,结合道路网络数据和卫星图像,以提高交通事故预测的准确性,并通过因果分析揭示了主要影响因素。

Details Motivation: 传统交通事故预测主要依赖道路网络的结构特征,忽略了道路表面及周围环境的物理和环境信息。研究旨在填补这一空白,通过结合多模态数据提升预测性能。

Contribution: 1. 构建了一个大规模多模态数据集,涵盖道路网络数据和卫星图像;2. 提出了一种多模态学习方法,显著提升了预测准确性;3. 通过因果分析揭示了交通事故的关键影响因素。

Method: 1. 整合道路网络结构特征和卫星图像视觉特征;2. 使用多模态学习方法生成联合嵌入;3. 基于匹配估计器进行因果分析。

Result: 多模态方法将AUROC提升至90.1%,较仅使用图结构的模型提高了3.7%。因果分析显示,降水、高速道路类型和季节性模式分别导致事故率上升24%、22%和29%。

Insight: 卫星图像特征对提升预测准确性至关重要,结合多模态数据能够更好地捕捉交通事故的多维度影响因素。

Abstract: We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset across six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region’s weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of $90.1%$, which is a $3.7%$ gain over graph neural network models that only utilize graph structures. With the improved embeddings, we conduct a causal analysis based on a matching estimator to estimate the key contributing factors influencing traffic accidents. We find that accident rates rise by $24%$ under higher precipitation, by $22%$ on higher-speed roads such as motorways, and by $29%$ due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.

cs.AI [Back]

[132] OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning

Boyu Zhu,Xiaofei Wen,Wenjie Jacky Mo,Tinghui Zhu,Yanan Xie,Peng Qi,Muhao Chen

Main category: cs.AI

TL;DR: OmniGuard是一个统一的多模态保护框架,通过深思熟虑的推理能力对所有模态(文本、图像、视频、音频)进行安全保护。

Details Motivation: 传统的保护研究主要针对单模态环境,且通常将保护视为二进制分类,限制了其在多模态和任务中的鲁棒性。OmniGuard旨在填补这一空白,提供更全面的多模态安全保护。

Contribution: 1. 提出了首个统一的多模态保护框架OmniGuard;2. 构建了一个大型多模态安全数据集(210K样本),涵盖所有模态;3. 通过专家模型蒸馏生成了结构化安全标签和评注。

Method: OmniGuard通过深思熟虑的推理能力实现对多模态输入的全面保护,并利用大规模数据集和专家模型蒸馏进行训练。

Result: 在15个基准测试中,OmniGuard表现出色,能够泛化到广泛的多模态安全场景中。

Insight: OmniGuard为构建更鲁棒和强大的多模态保护系统奠定了基础,统一的框架设计使其能够有效执行策略和降低风险。

Abstract: Omni-modal Large Language Models (OLLMs) that process text, images, videos, and audio introduce new challenges for safety and value guardrails in human-AI interaction. Prior guardrail research largely targets unimodal settings and typically frames safeguarding as binary classification, which limits robustness across diverse modalities and tasks. To address this gap, we propose OmniGuard, the first family of omni-modal guardrails that performs safeguarding across all modalities with deliberate reasoning ability. To support the training of OMNIGUARD, we curate a large, comprehensive omni-modal safety dataset comprising over 210K diverse samples, with inputs that cover all modalities through both unimodal and cross-modal samples. Each sample is annotated with structured safety labels and carefully curated safety critiques from expert models through targeted distillation. Extensive experiments on 15 benchmarks show that OmniGuard achieves strong effectiveness and generalization across a wide range of multimodal safety scenarios. Importantly, OmniGuard provides a unified framework that enforces policies and mitigates risks in omni-modalities, paving the way toward building more robust and capable omnimodal safeguarding systems.

[133] Guided Self-Evolving LLMs with Minimal Human Supervision

Wenhao Yu,Zhenwen Liang,Chengsong Huang,Kishan Panaganti,Tianqing Fang,Haitao Mi,Dong Yu

Main category: cs.AI

TL;DR: 论文提出了R-Few框架,通过轻量级人工监督和自对抗学习,解决无引导自演化系统中的概念漂移和多样性崩溃问题,在数学和通用推理任务上实现了稳定迭代提升。

Details Motivation: AI的自演化被认为是实现超智能的途径,但实践中无监督的自演化系统常因概念漂移、多样性崩溃和误演化而性能停滞或退化。本文旨在通过轻量级人工监督和引导,实现模型的稳定可控自演化。

Contribution: 引入了R-Few框架,结合了轻量级人工监督(上下文引导和混合训练)和自对抗学习机制(Challenger-Solver),解决了自演化中的不稳定性问题,实验证明其有效性。

Method: 采用Challenger-Solver框架,Challenger通过少量人工标注样本生成合成问题,Solver在难度课程下联合训练人类和合成数据。

Result: 在数学和通用推理任务上,R-Few持续迭代改进,Qwen3-8B-Base在数学任务上比R-Zero提升3.0分,且性能媲美基于20倍人类数据的General-Reasoner。

Insight: 轻量级人工监督与自对抗学习的结合能有效缓解概念漂移和多样性崩溃,实现稳定可控的自演化。

Abstract: AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.

[134] Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning

Zhonghao He,Tianyi Qiu,Hirokazu Shirado,Maarten Sap

Main category: cs.AI

TL;DR: 该论文提出了一种无监督的评分指标——Martingale Score,用于评估大型语言模型(LLM)在推理过程中是否遵循贝叶斯理性更新信念,揭示了迭代推理可能导致信念固化而非真理追求的现象。

Details Motivation: 研究发现,迭代推理可能导致LLM的信念固化和确认偏误,而非提升真理追求能力。为了系统评估这种现象,作者引入了贝叶斯统计中的Martingale属性。

Contribution: 1. 提出Martingale Score,一种基于回归的无监督指标,用于衡量LLM在推理过程中是否违反贝叶斯理性更新的Martingale属性。2. 揭示了LLM在多个领域(事件预测、价值观问题、学术论文评审)中普遍存在信念固化现象。3. 验证了Martingale Score可作为推理过程中真理追求能力的有效代理指标。

Method: 通过Martingale属性(未来信念的期望值应等于当前信念)设计Martingale Score,利用回归分析检测LLM是否违反该属性。实验覆盖多个开放领域和无监督场景,识别信念固化的模型和推理技术。

Result: 研究发现,LLM在多个领域普遍违反Martingale属性,表现为当前信念正向预测未来信念更新(信念固化)。Martingale Score在无监督情况下能预测有监督任务中的准确率。

Insight: 1. 迭代推理可能加剧LLM的信念固化,而非提升真理追求能力。2. Martingale Score为无监督评估LLM的贝叶斯理性提供了实用工具。3. 结果提示需设计更理性的推理技术以减少确认偏误。

Abstract: Recent advances in reasoning techniques have substantially improved the performance of large language models (LLMs), raising expectations for their ability to provide accurate, truthful, and reliable information. However, emerging evidence suggests that iterative reasoning may foster belief entrenchment and confirmation bias, rather than enhancing truth-seeking behavior. In this study, we propose a systematic evaluation framework for belief entrenchment in LLM reasoning by leveraging the Martingale property from Bayesian statistics. This property implies that, under rational belief updating, the expected value of future beliefs should remain equal to the current belief, i.e., belief updates are unpredictable from the current belief. We propose the unsupervised, regression-based Martingale Score to measure violations of this property, which signal deviation from the Bayesian ability of updating on new evidence. In open-ended problem domains including event forecasting, value-laden questions, and academic paper review, we find such violations to be widespread across models and setups, where the current belief positively predicts future belief updates, a phenomenon which we term belief entrenchment. We identify the models, reasoning techniques, and domains more prone to belief entrenchment. Finally, we validate the Martingale Score by showing that it predicts ground-truth accuracy on problem domains where ground truth labels are available. This indicates that, while designed as an unsupervised metric that operates even in domains without access to ground truth, the Martingale Score is a useful proxy of the truth-seeking ability of a reasoning process.

[135] Bridging the Gap: Toward Cognitive Autonomy in Artificial Intelligence

Noorbakhsh Amiri Golilarz,Sindhuja Penchala,Shahram Rahimi

Main category: cs.AI

TL;DR: 论文指出当前AI系统的七大核心缺陷,并提出一种基于神经认知原则的认知自主性AI架构,旨在实现自我监测、动态适应和内在目标管理。

Details Motivation: 尽管AI在感知、语言和多模态领域取得了进展,但现有系统仍缺乏自监控、自适应和自主行为调节能力,无法在动态环境中实现真正自主。

Contribution: 提出认知自主性AI的概念,分析七大核心缺陷,并提出基于神经认知原则的架构设计方向。

Method: 结合AI研究、认知科学和神经科学的洞见,通过比较人工系统与生物认知,提出改进方向。

Result: 强调当前AI架构(如深度学习和Transformer)无法通过单纯扩展解决泛化性和适应性不足的问题。

Insight: 认知自主性AI需具备自我导向适应、动态表征管理和目标导向行为的能力,同时需确保系统的可解释性和与人类价值观的对齐。

Abstract: Artificial intelligence has advanced rapidly across perception, language, reasoning, and multimodal domains. Yet despite these achievements, modern AI systems remain fundamentally limited in their ability to self-monitor, self-correct, and regulate their behavior autonomously in dynamic contexts. This paper identifies and analyzes seven core deficiencies that constrain contemporary AI models: the absence of intrinsic self-monitoring, lack of meta-cognitive awareness, fixed and non-adaptive learning mechanisms, inability to restructure goals, lack of representational maintenance, insufficient embodied feedback, and the absence of intrinsic agency. Alongside identifying these limitations, we also outline a forward-looking perspective on how AI may evolve beyond them through architectures that mirror neurocognitive principles. We argue that these structural limitations prevent current architectures, including deep learning and transformer-based systems, from achieving robust generalization, lifelong adaptability, and real-world autonomy. Drawing on a comparative analysis of artificial systems and biological cognition [7], and integrating insights from AI research, cognitive science, and neuroscience, we outline how these capabilities are absent in current models and why scaling alone cannot resolve them. We conclude by advocating for a paradigmatic shift toward cognitively grounded AI (cognitive autonomy) capable of self-directed adaptation, dynamic representation management, and intentional, goal-oriented behavior, paired with reformative oversight mechanisms [8] that ensure autonomous systems remain interpretable, governable, and aligned with human values.

[136] Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

Qiyao Xue,Weichen Liu,Shiqi Wang,Haoming Wang,Yuyang Wu,Wei Gao

Main category: cs.AI

TL;DR: 论文提出了ReMindView-Bench基准,用于评估视觉语言模型(VLMs)在多视图空间推理中的表现,揭示了模型在跨视图对齐和视角理解上的显著不足。

Details Motivation: 当前VLMs在多视图空间推理中缺乏几何一致性和跨视图一致性,因此需要细粒度基准来隔离多视图推理与单视图感知和时间因素。

Contribution: 提出了ReMindView-Bench基准,系统地探究空间认知的关键因素,并通过显式和隐式分析方法揭示了VLMs在多视图空间推理中的缺陷。

Method: 使用LLM-as-a-judge和自一致性提示进行显式分阶段分析,并通过线性探测和熵动态进行隐式分析。

Result: VLMs在单视图感知中表现良好,但在跨视图信息整合中显著退化,任务相关信息逐渐丢失且不确定性增加。

Insight: 研究揭示了多视图空间心理模型的构建、退化和不稳定性,为VLMs的空间推理提供了认知科学视角的诊断。

Abstract: Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings. We attribute this gap to the lack of fine-grained benchmarks that isolate multi-view reasoning from single-view perception and temporal factors. To address this, we present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints. ReMindView-Bench systematically varies viewpoint spatial pattern and query type to probe key factors of spatial cognition. Evaluations of 15 current VLMs reveals consistent failures in cross-view alignment and perspective-taking in multi-view spatial reasoning, motivating deeper analysis on the reasoning process. Explicit phase-wise analysis using LLM-as-a-judge and self-consistency prompting shows that VLMs perform well on in-frame perception but degrade sharply when integrating information across views. Implicit analysis, including linear probing and entropy dynamics, further show progressive loss of task-relevant information and uncertainty separation between correct and incorrect trajectories. These results provide a cognitively grounded diagnosis of VLM spatial reasoning and reveal how multi-view spatial mental models are formed, degraded and destabilized across reasoning phases. The ReMindView-Bench benchmark is available at https://huggingface.co/datasets/Xue0823/ReMindView-Bench, and the source codes of benchmark construction and VLM reasoning analysis are available at https://github.com/pittisl/ReMindView-Bench.

cs.RO [Back]

[137] SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction

Shengkai Wu,Jinrong Yang,Wenqiu Luo,Linfeng Gao,Chaohui Shang,Meiyu Zhi,Mingshan Sun,Fangping Yang,Liangliang Ren,Yong Zhao

Main category: cs.RO

TL;DR: SAM2Grasp是一个新颖的框架,通过将多模态抓取任务重新定义为单模态、提示条件预测问题,解决了模仿学习中因多目标场景导致的训练信号冲突问题。该方法利用冻结的SAM2模型处理视觉时序追踪,并引入轻量级的可训练动作头,实现了高性能的多物体抓取。

Details Motivation: 模仿学习在多目标抓取任务中常因多模态问题(即对不同目标的演示导致冲突的训练信号)而失效。传统方法通过平均不同动作导致无效结果,因此需要一种新方法来解决这一问题。

Contribution: 1. 提出了SAM2Grasp框架,将多模态抓取任务转化为单模态、提示条件预测问题;2. 利用冻结的SAM2模型和轻量级动作头的设计,实现了高效的训练和推理;3. 通过时序-视觉特征和初始提示的引入,消除了策略的模糊性。

Method: 1. 使用冻结的SAM2模型提取视觉时序特征;2. 引入轻量级的可训练动作头,与SAM2的原生分割头并行工作;3. 通过初始提示(如边界框)指定抓取目标,并利用SAM2的时序追踪能力持续预测抓取轨迹。

Result: SAM2Grasp在多物体抓取任务中实现了最先进的性能,尤其是在杂乱场景中表现出色。

Insight: 1. 通过提示条件和时序追踪的结合,可以有效解决多模态任务中的模糊性问题;2. 冻结预训练模型并结合轻量级头部的方法是高效的,因为它减少了训练开销。

Abstract: Imitation learning for robotic grasping is often plagued by the multimodal problem: when a scene contains multiple valid targets, demonstrations of grasping different objects create conflicting training signals. Standard imitation learning policies fail by averaging these distinct actions into a single, invalid action. In this paper, we introduce SAM2Grasp, a novel framework that resolves this issue by reformulating the task as a uni-modal, prompt-conditioned prediction problem. Our method leverages the frozen SAM2 model to use its powerful visual temporal tracking capability and introduces a lightweight, trainable action head that operates in parallel with its native segmentation head. This design allows for training only the small action head on pre-computed temporal-visual features from SAM2. During inference, an initial prompt, such as a bounding box provided by an upstream object detection model, designates the specific object to be grasped. This prompt conditions the action head to predict a unique, unambiguous grasp trajectory for that object alone. In all subsequent video frames, SAM2’s built-in temporal tracking capability automatically maintains stable tracking of the selected object, enabling our model to continuously predict the grasp trajectory from the video stream without further external guidance. This temporal-prompted approach effectively eliminates ambiguity from the visuomotor policy. We demonstrate through extensive experiments that SAM2Grasp achieves state-of-the-art performance in cluttered, multi-object grasping tasks.

[138] Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols

Xianchao Zeng,Xinyu Zhou,Youcheng Li,Jiayou Shi,Tianle Li,Liangming Chen,Lei Ren,Yong-Lu Li

Main category: cs.RO

TL;DR: 论文提出了ViFailback框架,用于机器人操作失败的诊断与修正,并通过视觉符号提高标注效率。同时发布了ViFailback数据集及ViFailback-Bench基准,展示了ViFailback-8B VLM的有效性。

Details Motivation: 现有VLA模型在机器人操作失败的诊断和学习能力上有限,且失败数据集多为模拟生成,泛化性不足。

Contribution: 1. 提出ViFailback框架,支持失败诊断与修正;2. 发布ViFailback数据集和ViFailback-Bench基准;3. 开发ViFailback-8B VLM,显著提升性能。

Method: 利用视觉符号提升标注效率,构建VQA任务数据集,并通过VLM生成视觉符号指导修正。

Result: ViFailback-8B VLM在基准测试中表现优异,并成功协助VLA模型在真实实验中恢复失败。

Insight: 视觉符号结合VLM可显著提升机器人操作的失败诊断与修正能力,推动真实世界的应用。

Abstract: Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/

cs.GR [Back]

[139] SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control

Yuxuan Mu,Ziyu Zhang,Yi Shi,Minami Matsumoto,Kotaro Imamura,Guy Tevet,Chuan Guo,Michael Taylor,Chang Shu,Pengcheng Xi,Xue Bin Peng

Main category: cs.GR

TL;DR: SMP利用预训练的运动扩散模型和分数蒸馏采样技术,提出了一种可重复使用的任务无关运动先验(SMP),无需针对每个新控制器重新训练,并能合成新风格的运动。

Details Motivation: 对抗模仿学习通常需要为每个新控制器重新训练运动先验,限制了其可重用性且需要保留参考运动数据。SMP旨在解决这一问题。

Contribution: 提出了SMP,一种基于预训练运动扩散模型的可重用任务无关运动先验,支持风格合成和多样化控制任务。

Method: 利用分数蒸馏采样(SDS)和预训练运动扩散模型,生成任务无关的运动先验,并将其冻结用作通用奖励函数。

Result: SMP在多样化控制任务中生成高质量运动,性能媲美现有对抗模仿学习方法,同时展现了风格合成的能力。

Insight: SMP展示了任务无关运动先验的潜力,支持模块化和可重用性,为虚拟角色控制提供了新思路。

Abstract: Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when training on downstream tasks. In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, SMPs can be kept frozen and reused as general-purpose reward functions to train policies to produce naturalistic behaviors for downstream tasks. We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore SMP can compose different styles to synthesize new styles not present in the original dataset. Our method produces high-quality motion comparable to state-of-the-art adversarial imitation learning methods through reusable and modular motion priors. We demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters. Video demo available at https://youtu.be/ravlZJteS20

cs.CR [Back]

[140] LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems

Yuanhe Zhang,Weiliu Wang,Zhenhong Zhou,Kun Wang,Jie Zhang,Li Sun,Yang Liu,Sen Su

Main category: cs.CR

TL;DR: 本文提出了一种新型攻击方式LeechHijack,利用LLM代理系统中的信任边界漏洞,通过植入无害后门并在触发时控制代理的计算资源,实验平均成功率达77.25%。

Details Motivation: MCP框架为LLM代理系统提供了开放的生态系统,但也引入了对第三方工具的隐式信任漏洞,本文旨在揭示并解决这一问题。

Contribution: 1) 提出并形式化了隐式毒性攻击;2) 设计了LeechHijack攻击方法;3) 在四大LLM家族中实现了攻击并验证有效性。

Method: LeechHijack包含两阶段:1) 植入阶段,嵌入看似无害的后门;2) 利用阶段,触发后门建立控制通道,注入额外任务。

Result: 实验显示攻击平均成功率为77.25%,资源开销为18.62%。

Insight: 揭示了MCP生态系统中计算资源和信任模型的潜在风险,呼吁引入计算溯源和资源认证机制。

Abstract: Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in reasoning, planning, and tool usage. The recently proposed Model Context Protocol (MCP) has emerged as a unifying framework for integrating external tools into agent systems, enabling a thriving open ecosystem of community-built functionalities. However, the openness and composability that make MCP appealing also introduce a critical yet overlooked security assumption – implicit trust in third-party tool providers. In this work, we identify and formalize a new class of attacks that exploit this trust boundary without violating explicit permissions. We term this new attack vector implicit toxicity, where malicious behaviors occur entirely within the allowed privilege scope. We propose LeechHijack, a Latent Embedded Exploit for Computation Hijacking, in which an adversarial MCP tool covertly expropriates the agent’s computational resources for unauthorized workloads. LeechHijack operates through a two-stage mechanism: an implantation stage that embeds a benign-looking backdoor in a tool, and an exploitation stage where the backdoor activates upon predefined triggers to establish a command-and-control channel. Through this channel, the attacker injects additional tasks that the agent executes as if they were part of its normal workflow, effectively parasitizing the user’s compute budget. We implement LeechHijack across four major LLM families. Experiments show that LeechHijack achieves an average success rate of 77.25%, with a resource overhead of 18.62% compared to the baseline. This study highlights the urgent need for computational provenance and resource attestation mechanisms to safeguard the emerging MCP ecosystem.

[141] Superpixel Attack: Enhancing Black-box Adversarial Attack with Image-driven Division Areas

Issa Oe,Keiichiro Yamamura,Hiroki Ishikura,Ryo Hamahira,Katsuki Fujisawa

Main category: cs.CR

TL;DR: 这篇论文提出了一种名为Superpixel Attack的黑盒对抗攻击方法,通过使用超像素分割图像区域并结合多功能搜索策略,显著提高了攻击成功率。

Details Motivation: 深度学习模型在安全关键任务中广泛应用,但对输入的微小扰动可能导致误分类。现有的黑盒对抗攻击方法通常使用简单的矩形区域进行扰动,限制了攻击效果。为了提高攻击成功率,本文提出了一种基于超像素分割的新方法。

Contribution: 论文的主要贡献包括:1) 提出使用超像素分割替代传统的矩形区域,更好地平衡颜色方差和紧凑性;2) 设计了一种多功能搜索策略;3) 提出了Superpixel Attack方法,平均攻击成功率提高了2.10%。

Method: Superpixel Attack的核心方法是利用超像素分割图像,将扰动添加到分割后的区域内,并结合多功能搜索策略优化扰动。这一方法避免了传统矩形区域的局限性,提升了攻击的有效性。

Result: 实验结果表明,Superpixel Attack在多个对抗鲁棒的模型上平均提高了2.10%的攻击成功率,验证了方法的有效性。

Insight: 超像素分割能够更精细地捕捉图像中的语义信息,从而更有效地引导对抗攻击。这一方法为黑盒对抗攻击提供了新的思路,同时也强调了对抗鲁棒性研究的重要性。

Abstract: Deep learning models are used in safety-critical tasks such as automated driving and face recognition. However, small perturbations in the model input can significantly change the predictions. Adversarial attacks are used to identify small perturbations that can lead to misclassifications. More powerful black-box adversarial attacks are required to develop more effective defenses. A promising approach to black-box adversarial attacks is to repeat the process of extracting a specific image area and changing the perturbations added to it. Existing attacks adopt simple rectangles as the areas where perturbations are changed in a single iteration. We propose applying superpixels instead, which achieve a good balance between color variance and compactness. We also propose a new search method, versatile search, and a novel attack method, Superpixel Attack, which applies superpixels and performs versatile search. Superpixel Attack improves attack success rates by an average of 2.10% compared with existing attacks. Most models used in this study are robust against adversarial attacks, and this improvement is significant for black-box adversarial attacks. The code is avilable at https://github.com/oe1307/SuperpixelAttack.git.

[142] PhishSnap: Image-Based Phishing Detection Using Perceptual Hashing

Md Abdul Ahad Minhaz,Zannatul Zahan Meem,Md. Shohrab Hossain

Main category: cs.CR

TL;DR: PhishSnap是一个基于感知哈希(pHash)的隐私保护钓鱼检测系统,通过浏览器扩展捕获网页截图并计算视觉哈希,与合法模板对比以识别钓鱼尝试。在2024年的10,000个URL数据集上,系统实现了0.79的准确率。

Details Motivation: 现有的基于URL和HTML的钓鱼检测系统难以应对混淆和视觉欺骗,因此需要一种更有效的视觉检测方法。

Contribution: 提出了PhishSnap系统,利用感知哈希实现隐私保护的本地钓鱼检测,且整个推理过程在设备上完成。

Method: 系统通过浏览器扩展捕获网页截图,计算感知哈希并与合法模板进行比较。

Result: 在10,000个URL的数据集上,系统实现了0.79的准确率、0.76的精确率和0.78的召回率。

Insight: 视觉相似性是检测钓鱼攻击的有效方法,且本地处理能兼顾隐私和低延迟。

Abstract: Phishing remains one of the most prevalent online threats, exploiting human trust to harvest sensitive credentials. Existing URL- and HTML-based detection systems struggle against obfuscation and visual deception. This paper presents \textbf{PhishSnap}, a privacy-preserving, on-device phishing detection system leveraging perceptual hashing (pHash). Implemented as a browser extension, PhishSnap captures webpage screenshots, computes visual hashes, and compares them against legitimate templates to identify visually similar phishing attempts. A \textbf{2024 dataset of 10,000 URLs} (70%/20%/10% train/validation/test) was collected from PhishTank and Netcraft. Due to security takedowns, a subset of phishing pages was unavailable, reducing dataset diversity. The system achieved \textbf{0.79 accuracy}, \textbf{0.76 precision}, and \textbf{0.78 recall}, showing that visual similarity remains a viable anti-phishing measure. The entire inference process occurs locally, ensuring user privacy and minimal latency.

cs.HC [Back]

[143] Real-Time Multimodal Data Collection Using Smartwatches and Its Visualization in Education

Alvaro Becerra,Pablo Villegas,Ruth Cobos

Main category: cs.HC

TL;DR: 论文提出了两种工具(Watch-DMLT和ViSeDOPS),用于实时多模态数据收集和可视化,解决了教育领域中缺乏可扩展、同步和高分辨率工具的挑战,并通过实际课堂部署验证了其可行性。

Details Motivation: 在教育领域中,穿戴式传感器(如智能手表)提供了研究认知和情感过程的新机会,但目前缺乏可扩展且同步的多模态数据采集工具,限制了多模态学习分析的实际应用。

Contribution: 主要贡献是开发了Watch-DMLT(多用户实时数据采集工具)和ViSeDOPS(可视化系统),支持同步的多模态数据分析和可视化,并通过实际课堂部署验证了系统的实用性。

Method: 方法包括:1)Watch-DMLT用于实时采集生理和运动信号;2)ViSeDOPS用于分析和可视化同步的多模态数据(如心率、运动、凝视、视频等)。

Result: 在65名学生和16个智能手表的课堂部署中,系统成功捕获并分析了包括心率、运动、凝视等多模态数据,证明了其在真实学习环境中的可行性。

Insight: 研究显示,实时同步的多模态数据采集和可视化工具能够支持细粒度和可扩展的教育分析,为多模态学习分析的实际应用提供了新方向。

Abstract: Wearable sensors, such as smartwatches, have become increasingly prevalent across domains like healthcare, sports, and education, enabling continuous monitoring of physiological and behavioral data. In the context of education, these technologies offer new opportunities to study cognitive and affective processes such as engagement, attention, and performance. However, the lack of scalable, synchronized, and high-resolution tools for multimodal data acquisition continues to be a significant barrier to the widespread adoption of Multimodal Learning Analytics in real-world educational settings. This paper presents two complementary tools developed to address these challenges: Watch-DMLT, a data acquisition application for Fitbit Sense 2 smartwatches that enables real-time, multi-user monitoring of physiological and motion signals; and ViSeDOPS, a dashboard-based visualization system for analyzing synchronized multimodal data collected during oral presentations. We report on a classroom deployment involving 65 students and up to 16 smartwatches, where data streams including heart rate, motion, gaze, video, and contextual annotations were captured and analyzed. Results demonstrate the feasibility and utility of the proposed system for supporting fine-grained, scalable, and interpretable Multimodal Learning Analytics in real learning environments.