Table of Contents

cs.CL [Back]

[1] Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR cs.CLPDF

Haobo Xu, Sirui Chen, Ruizhong Qiu, Yuchen Yan, Chen Luo

TL;DR: 本文提出了一种名为ARRoL的在线剪枝方法,用于加速和提升可验证奖励强化学习(RLVR)的训练过程。该方法在生成过程中动态剪枝低质量的推理路径(rollouts),并引导剩余路径更均衡以提高学习信号,同时通过系统设计优化计算效率。

Details

Motivation: 现有RLVR方法(如GRPO和DAPO)存在计算成本高的问题,因为它们需要为每个提示采样大量推理路径,且奖励信号稀疏(许多样本几乎全对或全错),导致学习信号弱。

Result: 在Qwen-3和LLaMA-3.2模型(1B-8B)上,ARRoL在GRPO和DAPO中平均准确率提升+2.30至+2.99,训练速度最高加速1.7倍,并在测试时扩展中带来最高+8.33的额外平均准确率增益。

Insight: 创新点包括:在线训练轻量级质量头来预测部分推理路径的成功概率以进行早期剪枝,系统设计将剪枝集成到推理引擎中并重新批处理剩余路径,以及利用质量头在测试时加权候选以提高推理准确性。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, methods such as GRPO and DAPO suffer from substantial computational cost, since they rely on sampling many rollouts for each prompt. Moreover, in RLVR the relative advantage is often sparse: many samples become nearly all-correct or all-incorrect, yielding low within-group reward variance and thus weak learning signals. In this paper, we introduce arrol (Accelerating RLVR via online Rollout Pruning), an online rollout pruning method that prunes rollouts during generation while explicitly steering the surviving ones more correctness-balanced to enhance learning signals. Specifically, arrol trains a lightweight quality head on-the-fly to predict the success probability of partial rollouts and uses it to make early pruning decisions. The learned quality head can further weigh candidates to improve inference accuracy during test-time scaling. To improve efficiency, we present a system design that prunes rollouts inside the inference engine and re-batches the remaining ones for log-probability computation and policy updates. Across GRPO and DAPO on Qwen-3 and LLaMA-3.2 models (1B-8B), arrol improves average accuracy by +2.30 to +2.99 while achieving up to 1.7x training speedup, and yielding up to +8.33 additional gains in average accuracy in test-time scaling. The code is available at https://github.com/Hsu1023/ARRoL.


[2] LLM-Driven Reasoning for Constraint-Aware Feature Selection in Industrial Systems cs.CLPDF

Yuhang Zhou, Zhuokai Zhao, Ke Li, Spilios Evmorfos, Gökalp Demirci

TL;DR: 本文提出了一种名为模型特征代理(MoFA)的模型驱动框架,利用大型语言模型(LLM)进行基于推理的特征选择。该方法结合特征的语义信息和定量信息(如重要性分数、相关性、元数据),通过结构化提示进行可解释的、考虑约束的序列特征选择,旨在解决工业系统中标记数据有限且需满足多种操作约束的挑战。

Details

Motivation: 传统特征选择方法依赖标记数据和统计启发式方法,难以应用于标记数据有限且需满足多种操作约束(如特征组复杂性、推理效率)的实际生产环境。

Result: 在三个真实工业应用(真实兴趣与时间价值预测、价值模型增强、通知行为预测)中进行了评估。MoFA在提高模型准确性的同时,降低了特征组复杂性,发现了能带来显著参与度提升的高阶交互项,并选择了紧凑、高价值的特征子集,从而同时提升了模型准确性和推理效率。

Insight: 创新点在于将LLM驱动的推理机制引入特征选择任务,通过结构化提示整合多源特征信息(语义与定量),实现可解释且能灵活处理多种业务约束的序列决策过程。这为在复杂约束的工业场景中实现自动化、智能化的特征工程提供了新思路。

Abstract: Feature selection is a crucial step in large-scale industrial machine learning systems, directly affecting model accuracy, efficiency, and maintainability. Traditional feature selection methods rely on labeled data and statistical heuristics, making them difficult to apply in production environments where labeled data are limited and multiple operational constraints must be satisfied. To address this, we propose Model Feature Agent (MoFA), a model-driven framework that performs sequential, reasoning-based feature selection using both semantic and quantitative feature information. MoFA incorporates feature definitions, importance scores, correlations, and metadata (e.g., feature groups or types) into structured prompts and selects features through interpretable, constraint-aware reasoning. We evaluate MoFA in three real-world industrial applications: (1) True Interest and Time-Worthiness Prediction, where it improves accuracy while reducing feature group complexity, (2) Value Model Enhancement, where it discovers high-order interaction terms that yield substantial engagement gains in online experiments, and (3) Notification Behavior Prediction, where it selects compact, high-value feature subsets that improve both model accuracy and inference efficiency. Together, these results demonstrate the practicality and effectiveness of LLM-based reasoning for feature selection in real production systems.


[3] Closing the Confidence-Faithfulness Gap in Large Language Models cs.CL | cs.AIPDF

Miranda Muqing Miao, Lyle Ungar

TL;DR: 本文通过机制可解释性分析揭示了大型语言模型中口头表达置信度与实际准确性之间的几何关系,发现校准信号与口头置信度信号呈线性编码但相互正交,并识别出推理过程会污染口头置信度方向,加剧校准失准的’推理污染效应’。基于此,作者提出了一种两阶段自适应引导管道,通过读取模型内部准确性估计并引导口头输出与之匹配,显著改善了所有评估模型的校准对齐。

Details

Motivation: 大型语言模型的口头置信度得分往往与其实际准确性严重脱节,但控制这种行为的几何关系尚不明确,本文旨在通过机制可解释性方法理解并解决这一置信度-忠实度差距问题。

Result: 在三个开源权重模型和四个数据集上的实验表明,所提出的两阶段自适应引导管道显著改善了校准对齐,在所有评估模型中均实现了更好的置信度-准确性匹配。

Insight: 创新点在于揭示了校准与口头置信度信号在线性表示中的正交性,以及推理过程对口头置信度的污染效应;提出的自适应引导方法通过内部准确性估计直接调控口头输出,为改善LLM校准提供了一种可解释的干预机制。

Abstract: Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another – a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the “Reasoning Contamination Effect.” Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model’s internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.


[4] OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs cs.CLPDF

Suraj Racha, Prashant Harish Joshi, Utkarsh Maurya, Nitin Yadav, Mridul Sharma

TL;DR: 本文提出了OMIND框架,旨在解决心理健康领域大语言模型面临的挑战,包括高质量训练数据缺乏、训练范式受限以及多轮对话评估困难。该框架包含基于结构化知识检索和LLM修剪的生成流程,构建了约164k的高质量多任务SFT数据集,并引入了专家标注的多轮对话基准数据集oMind-Chat。实验表明,oMind-LLM在核心能力和对话任务上均优于基线模型,推理能力显著提升,胜率高达80%。

Details

Motivation: 心理健康是全球日益关注的问题,大语言模型在该领域具有巨大潜力,但面临高质量可解释且知识基础的训练数据不足、训练范式局限于核心能力以及多轮对话评估困难三大挑战。

Result: 在核心能力和对话任务上的实验显示,oMind-LLM consistently outperform baselines,推理能力显著优于基线,胜率高达80%。

Insight: 创新点包括:提出OMIND框架,通过结构化知识检索、LLM修剪和审查流程生成高质量多任务SFT数据集;引入专家标注的多轮对话基准数据集oMind-Chat,提供回合级和对话级评估标准;在心理健康领域实现了知识基础的微调和多轮对话评估,提升了模型的适应性和推理能力。

Abstract: Large Language Models (LLMs) have shown remarkable capabilities for complex tasks, yet adaptation in medical domain, specifically mental health, poses specific challenges. Mental health is a rising concern globally with LLMs having large potential to help address the same. We highlight three primary challenges for LLMs in mental health - lack of high quality interpretable and knowledge grounded training data; training paradigms restricted to core capabilities, and evaluation of multi turn dialogue settings. Addressing it, we present oMind framework which includes training and aligning LLM agents for diverse capabilities including conversations; high quality ~164k multi-task SFT dataset, as a result of our generation pipeline based on Structured Knowledge retrieval, LLM based pruning, and review actions. We also introduce oMind-Chat - a novel multi turn benchmark dataset with expert annotated turn level and conversation level rubrics. Our diverse experiments on both core capabilities and conversations shows oMind LLMs consistently outperform baselines. oMind-LLM also shows significantly better reasoning with up to 80% win rate.


[5] Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models cs.CLPDF

Hieu Xuan Le, Benjamin Goh, Quy Anh Tang

TL;DR: 本文提出了一种利用轻量级通用大语言模型(LLM)作为安全法官,并结合混合模型(MoM)策略,在低延迟生产环境中检测提示攻击(如越狱和提示注入)的方法。该方法通过精心设计的提示和输出结构,引导LLM进行结构化推理,包括意图分解、安全信号验证、危害评估和自我反思。研究在包含真实聊天机器人良性查询和自动化红队生成对抗性提示的数据集上进行了评估,并已在新加坡公共服务聊天机器人中作为集中式护栏服务部署。

Details

Motivation: 解决在生产环境中,轻量级分类器和基于规则的系统难以应对分布偏移,而基于LLM的高容量法官又因延迟高、成本高而无法实时部署的难题,旨在探索轻量级通用LLM能否在真实世界生产约束下可靠地充当安全法官。

Result: 评估结果显示,通用LLM(如gemini-2.0-flash-lite-001)可以作为有效的低延迟法官用于实时护栏。该方法在结合真实聊天机器人良性查询和自动化红队生成对抗性提示的精选数据集上进行了测试。混合模型(MoM)设置对检测性能的提升相对有限。

Insight: 创新点在于通过精心设计的提示工程(结构化推理流程)使轻量级LLM具备了在低延迟约束下进行复杂安全判断的能力,弥合了部署鸿沟。客观来看,将LLM-as-a-Judge范式与生产级低延迟要求结合,并探索模型聚合(MoM)的边际效益,是具有实际工程价值的系统设计思路。

Abstract: Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots with adversarial prompts generated via automated red teaming (ART), covering diverse and evolving patterns. Our results show that general-purpose LLMs, such as gemini-2.0-flash-lite-001, can serve as effective low-latency judges for live guardrails. This configuration is currently deployed in production as a centralized guardrail service for public service chatbots in Singapore. We additionally evaluate a Mixture-of-Models (MoM) setting to assess whether aggregating multiple LLM judges improves prompt-attack detection performance relative to single-model judges, with only modest gains observed.


[6] Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation cs.CLPDF

Ying Li, Xinglin Lyu, Junhui Li, Jinlong Yang, Hengchao Shang

TL;DR: 本文提出了一种名为交叉偏好学习(CPL)的偏好优化训练框架,用于解决上下文感知机器翻译(MT)性能不稳定、未能持续优于句子级MT的问题。该框架通过整合句子内和跨条件偏好,显式地建模何时以及如何利用上下文信息来提升翻译质量。

Details

Motivation: 上下文感知机器翻译虽然利用了文档级信息,但由于上下文信号在不同句子中的益处不均,其性能并不总是优于句子级翻译。现有训练目标未能显式建模这种差异性,限制了模型自适应利用上下文的能力。

Result: 在多个公共上下文感知MT任务上,使用Qwen3-4B、Qwen3-8B和Llama-3-8B等模型进行验证,实验结果表明该方法在不修改模型架构的情况下,在两种输入条件下均能持续提升翻译质量和鲁棒性。

Insight: 创新点在于提出了一个统一的偏好优化框架,通过引入句子内偏好和跨条件偏好,为模型提供了关于上下文信息何时有益的显式监督,从而自适应地融合句子级和上下文感知翻译的优势。

Abstract: Context-aware machine translation (MT) leverages document-level information, yet it does not consistently outperform sentence-level MT, as contextual signals are unevenly beneficial across sentences. Existing training objectives do not explicitly model this variability, limiting a model’s ability to adaptively exploit context. In this paper, we propose Cross-Preference Learning (CPL), a preference-based training framework that explicitly captures the complementary benefits of sentence-level and context-aware MT. CPL achieves this by integrating both intra- and cross-condition preferences into the preference optimization objective. The introduction of intra- and cross-condition preferences provides explicit supervision on when and how contextual information improves translation quality. We validate the proposed approach on several public context-aware MT tasks using multiple models, including Qwen3-4B, Qwen3-8B, and Llama-3-8B. Experimental results demonstrate consistent improvements in translation quality and robustness across both input conditions, achieved without any architectural modifications.


[7] A Catalog of Basque Dialectal Resources: Online Collections and Standard-to-Dialectal Adaptations cs.CLPDF

Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri

TL;DR: 本文构建了一个巴斯克语方言资源目录,系统性地收集了在线原生方言数据和标准语到方言的改编数据,包括新闻、推文、词典等在线资源,以及手动和自动改编的评估数据集。

Details

Motivation: 针对方言自然语言处理中数据稀缺的问题,本文旨在通过整理巴斯克语方言资源来缓解这一限制。

Result: 手动改编了XNLI数据集的测试部分,生成了三个巴斯克方言的高质量并行评估数据集;自动改编的物理常识数据集(BasPhyCowest)经过人工评估以验证其作为银数据替代品的可行性。

Insight: 创新点在于区分了在线原生方言数据和标准语到方言的改编数据,并提供了手动和自动改编的评估方法,为方言NLP的数据构建提供了系统化框架。

Abstract: Recent research on dialectal NLP has identified data scarcity as a primary limitation. To address this limitation, this paper presents a catalog of contemporary Basque dialectal data and resources, offering a systematic and comprehensive compilation of the dialectal data currently available in Basque. Two types of data sources have been distinguished: online data originally written in some dialect, and standard-to-dialect adapted data. The former includes all dialectal data that can be found online, such as news and radio sites, informal tweets, as well as online resources such as dictionaries, atlases, grammar rules, or videos. The latter consists of data that has been adapted from the standard variety to dialectal varieties, either manually or automatically. Regarding the manual adaptation, the test split of the XNLI Natural Language Inference dataset was manually adapted into three Basque dialects: Western, Central, and Navarrese-Lapurdian, yielding a high-quality parallel gold standard evaluation dataset. With respect to the automatic dialectal adaptation, the automatically adapted physical commonsense dataset (BasPhyCowest) underwent additional manual evaluation by native speakers to assess its quality and determine whether it could serve as a viable substitute for full manual adaptation (i.e., silver data creation).


[8] SafeMath: Inference-time Safety improves Math Accuracy cs.CL | cs.CYPDF

Sagnik Basu, Subhrajit Mitra, Aman Juneja, Somnath Banerjee, Rima Hazra

TL;DR: 本文研究了大型语言模型在数学应用题中可能传播有害内容的问题,提出了ToxicGSM数据集用于系统评估,并开发了SafeMath方法以在推理时增强安全性同时保持数学准确性。

Details

Motivation: 针对LLMs可能通过对抗性或看似良性的输入(特别是嵌入有害背景的数学应用题)产生有害、偏见或违反政策输出的问题,尤其是在涉及儿童的教育场景中风险更高。

Result: 在ToxicGSM数据集上评估现有LLMs,发现安全性与数学正确性之间存在权衡;提出的SafeMath方法减少了有害输出,并在某些情况下提高了数学推理性能,实现了安全对齐而不牺牲准确性。

Insight: 创新点在于将语言层面的危害与数学推理任务解耦,通过推理时安全对齐技术(SafeMath)在保持甚至提升模型数学能力的同时有效抑制有害内容生成,为安全可靠的AI应用提供了新思路。

Abstract: Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further propose SafeMath – a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at https://github.com/Swagnick99/SafeMath/tree/main.


[9] MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation cs.CL | cs.AIPDF

Taolin Han, Shuang Wu, Jinghang Wang, Yuhao Zhou, Renquan Lv

TL;DR: 本文提出了MolQuest,一个基于真实化学实验数据的分子结构解析评估框架,用于系统评估大语言模型在复杂科学任务中的溯因推理和战略决策能力。该框架将分子结构解析形式化为多轮交互任务,要求模型主动规划实验步骤、整合异质谱源并迭代优化结构假设。

Details

Motivation: 现有科学评估基准主要依赖静态、单轮问答格式,无法衡量模型在需要多步迭代和实验交互的复杂科学任务中的性能,因此需要开发更贴近真实研究动态的评估框架。

Result: 在MolQuest基准上的实证结果表明,即使是最先进的模型准确率也仅约为50%,而大多数其他模型的性能仍低于30%的阈值,揭示了当前前沿模型在真实科学场景中的显著局限性。

Insight: 创新点在于将分子结构解析构建为基于智能体的多轮交互评估任务,系统评估LLMs的溯因推理和战略决策能力;这为面向科学的LLM评估提供了一个可复现和可扩展的框架,并指明了未来研究需要弥补LLMs战略科学推理能力的关键差距。

Abstract: Large language models (LLMs) hold considerable potential for advancing scientific discovery, yet systematic assessment of their dynamic reasoning in real-world research remains limited. Current scientific evaluation benchmarks predominantly rely on static, single-turn Question Answering (QA) formats, which are inadequate for measuring model performance in complex scientific tasks that require multi-step iteration and experimental interaction. To address this gap, we introduce MolQuest, a novel agent-based evaluation framework for molecular structure elucidation built upon authentic chemical experimental data. Unlike existing datasets, MolQuest formalizes molecular structure elucidation as a multi-turn interactive task, requiring models to proactively plan experimental steps, integrate heterogeneous spectral sources (e.g., NMR, MS), and iteratively refine structural hypotheses. This framework systematically evaluates LLMs’ abductive reasoning and strategic decision-making abilities within a vast and complex chemical space. Empirical results reveal that contemporary frontier models exhibit significant limitations in authentic scientific scenarios: notably, even state-of-the-art (SOTA) models achieve an accuracy of only approximately 50%, while the performance of most other models remains below the 30% threshold. This work provides a reproducible and extensible framework for science-oriented LLM evaluation, our findings highlight the critical gap in current LLMs’ strategic scientific reasoning, setting a clear direction for future research toward AI that can actively participate in the scientific process.


[10] CRAFT: Grounded Multi-Agent Coordination Under Partial Information cs.CL | cs.AIPDF

Abhijnan Nath, Hannah VanderHoeven, Nikhil Krishnaswamy

TL;DR: 论文提出了CRAFT基准,用于评估大语言模型在严格部分信息下的实用沟通能力。该基准要求多个具有互补但不完整视角的智能体通过自然语言协调,共同构建一个没有任何单个智能体能完全观察到的共享3D结构。研究发现,更强的推理能力并不总是带来更好的协调效果,较小的开源模型有时能匹配或超越前沿系统,这表明多智能体协调对当前语言模型仍是一个根本性挑战。

Details

Motivation: 动机是评估大语言模型在多智能体协调任务中的实用沟通能力,特别是在严格部分信息设置下,解决智能体如何通过自然语言协作完成共同目标的问题。

Result: 在包括8个开源模型和7个前沿推理模型在内的多样化模型集上进行了评估。结果表明,更强的推理能力并不能可靠地转化为更好的协调效果;较小的开源模型常常能匹配或超越前沿系统,且改进的个体沟通并不能保证成功的协作。

Insight: 论文的创新点在于提出了一个形式化为多发送者实用推理任务的诊断框架,该框架能将失败分解为空间基础、信念建模和实用沟通错误,并提供了一个行为失败剖面的分类法。从客观角度看,该研究揭示了多智能体协调任务与个体模型能力之间的非直接关联性,为理解语言模型在协作场景中的局限性提供了新的视角。

Abstract: We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordinate through natural language to construct a shared 3D structure that no single agent can fully observe. We formalize this problem as a multi-sender pragmatic reasoning task and provide a diagnostic framework that decomposes failures into spatial grounding, belief modeling and pragmatic communication errors, including a taxonomy of behavioral failure profiles in both frontier and open-weight models. Across a diverse set of models, including 8 open-weight and 7 frontier including reasoning models, we find that stronger reasoning ability does not reliably translate to better coordination: smaller open-weight models often match or outperform frontier systems, and improved individual communication does not guarantee successful collaboration. These results suggest that multi-agent coordination remains a fundamentally unsolved challenge for current language models. Our code can be found at https://github.com/csu-signal/CRAFT


[11] When Hate Meets Facts: LLMs-in-the-Loop for Check-worthiness Detection in Hate Speech cs.CLPDF

Nicolás Benjamín Ocampo, Tommaso Caselli, Davide Ceolin

TL;DR: 本文针对网络仇恨言论中常夹杂类似事实的错误信息这一挑战,提出了首个结合仇恨言论与核查价值信息的数据集WSF-ARG+,并引入了一种新颖的LLM-in-the-loop框架来辅助标注需要核查的言论。通过测试12个不同规模和架构的开源LLM,并经过广泛的人工评估,验证了该框架能在不降低标注质量的前提下减少人工工作量。研究还发现,包含核查价值声明的仇恨言论具有更高的骚扰和仇恨程度,并且将核查价值标签纳入LLM-based仇恨言论检测模型能显著提升其性能。

Details

Motivation: 网络仇恨言论常以类似事实的错误信息形式出现,这给内容审核带来了双重挑战(需同时评估有害性和真实性),可能加深偏见、强化有害刻板印象并污染公共辩论。现有工作未能联合处理仇恨言论和错误信息,因此需要开发新的数据集和方法来应对这一挑战。

Result: 在构建的WSF-ARG+数据集上,LLM-in-the-loop框架通过人工评估验证了其能减少人工努力且不损害标注质量。实验表明,包含核查价值声明的仇恨言论表现出显著更高的骚扰和仇恨水平。将核查价值标签整合到基于LLM的仇恨言论检测中,能将大型模型的macro-F1分数平均提升0.154,最高提升0.213。

Insight: 论文的创新点在于首次创建了结合仇恨言论与核查价值信息的数据集,并提出了LLM-in-the-loop框架来高效标注此类数据。从客观角度看,该研究将仇恨言论检测与事实核查需求相结合,为内容审核提供了更细粒度的分析工具,并证明了联合建模这两项任务能有效提升检测性能,为多模态、多任务的内容安全研究提供了新思路。

Abstract: Hateful content online is often expressed using fact-like, not necessarily correct information, especially in coordinated online harassment campaigns and extremist propaganda. Failing to jointly address hate speech (HS) and misinformation can deepen prejudice, reinforce harmful stereotypes, and expose bystanders to psychological distress, while polluting public debate. Moreover, these messages require more effort from content moderators because they must assess both harmfulness and veracity, i.e., fact-check them. To address this challenge, we release WSF-ARG+, the first dataset which combines hate speech with check-worthiness information. We also introduce a novel LLM-in-the-loop framework to facilitate the annotation of check-worthy claims. We run our framework, testing it with 12 open-weight LLMs of different sizes and architectures. We validate it through extensive human evaluation, and show that our LLM-in-the-loop framework reduces human effort without compromising the annotation quality of the data. Finally, we show that HS messages with check-worthy claims show significantly higher harassment and hate, and that incorporating check-worthiness labels improves LLM-based HS detection up to 0.213 macro-F1 and to 0.154 macro-F1 on average for large models.


[12] Separate Before You Compress: The WWHO Tokenization Architecture cs.CLPDF

Kusal Darshana

TL;DR: 本文提出了一种名为WWHO的三层分词架构及SGPE算法,旨在解决传统BPE分词器在处理复杂阿布吉达文字(如僧伽罗语和天城文)时,会将多码位的合字拆分为无意义的子字符单元,导致推理效率下降和成本增加的问题。该方法通过将文字的语言规则与统计压缩过程分离,实现了无缝的多语言分词,显著降低了分词数量并扩展了上下文窗口。

Details

Motivation: 传统BPE分词器在处理结构复杂的阿布吉达文字时,会破坏其合字结构,迫使大语言模型在推理时学习基本正字法,降低推理效率并增加成本,对全球南方地区造成显著的’分词税’。

Result: 在僧伽罗语上,SGPE实现了1.274的分词-词比和每个分词4.83个字符,相比OpenAI的o200k基础分词器减少了61.7%的分词;在印地语上,分词-词比为1.181(减少27.0%)。在混合文字数据集上,相比o200k基础、Llama 4 Scout和DeepSeek V3,分词分别减少了36.7%、39.6%和60.2%,有效将可用上下文窗口扩展了高达4.38倍,并保证了’语言零断裂保证’。

Insight: 创新点在于提出了WWHO三层架构和SGPE算法,将语言规则分离于统计压缩过程,实现了对复杂文字更高效、无损的分词,其’语言零断裂保证’确保了有效音节不会被拆分到多个分词中,为多语言大语言模型的分词设计提供了新思路。

Abstract: Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM’s reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant “Token Tax” for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless multilingual tokenization. Using Sinhala and Devanagari (Hindi/Sanskrit) as highly complex Abugida scripts, we trained WWHO on a cleaned 30-million-sentence dataset and evaluated on a 1,499,950-sentence test set. For Sinhala, SGPE achieves a Token to Word Ratio (TWR) of 1.274 with 4.83 characters per token, representing a 61.7 percent reduction in tokens compared to OpenAI’s o200k base. For Hindi, it achieves a TWR of 1.181 (27.0 percent reduction vs o200k). On the mixed-script (Sinhala, Devanagari, and English) dataset, SGPE achieves an overall TWR of 1.240, representing token reductions of 36.7 percent, 39.6 percent, and 60.2 percent relative to o200k base, Llama 4 Scout, and DeepSeek V3, respectively. This effectively extends the usable context window by up to 4.38 times for these Abugida languages while ensuring a Linguistic Zero-Breakage Guarantee, which ensures that no valid syllable is ever split across multiple tokens.


[13] Large Language Model as Token Compressor and Decompressor cs.CLPDF

Wenbing Li, Zikai Song, Jielei Zhang, Tianhao Zhao, Junkai Lin

TL;DR: 本文提出一种创新方法,利用现成的大型语言模型(LLM)作为令牌压缩器和解压器,通过自表达自编码学习框架将长文本转换为紧凑的离散变长潜在代码(Z-tokens),并精确重建原始文本。该方法在多个数据集上实现了高达18倍的令牌压缩,同时保持重建保真度和下游任务性能。

Details

Motivation: 解决长文本处理中令牌效率低下的问题,探索LLM在压缩和解压缩文本方面的潜力,以支持长上下文推理。

Result: 在Wikipedia、CNN/DailyMail、HotpotQA和Qulac风格的长查询数据集上,实现了高达18倍的令牌压缩,同时保持重建保真度和下游任务性能,达到高效压缩的先进水平。

Insight: 创新点在于将现成LLM用作令牌压缩器和解压器,通过内容自适应的Z-tokens实现语义密集段与冗余区域的差异化压缩,并支持提示压缩和Z-token空间的自回归生成,为令牌高效的长上下文推理提供新途径。

Abstract: In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.


[14] TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning cs.CLPDF

Xu Huang, Zhejian Lai, Zixian Huang, Jiajun Chen, Shujian Huang

TL;DR: 本文提出了一种名为TAPO(Translation-Augmented Policy Optimization)的新型强化学习框架,旨在解决大语言模型在英语数学推理表现优异但在多语言场景下因语言理解不足而性能下降的问题。该框架基于GRPO构建,通过将英语作为枢纽并遵循‘先理解后推理’的范式,利用翻译质量奖励来增强模型的多语言数学推理能力。

Details

Motivation: 动机是弥合大语言模型在英语与多语言数学推理之间的性能差距,该差距主要由语言理解缺陷导致。

Result: 在多项实验中,TAPO在多语言数学推理和翻译任务上均优于基线方法,并能很好地泛化到未见语言和领域外任务。

Insight: 创新点在于引入了一种显式的对齐策略,利用英语作为枢纽,并采用解耦理解与推理的步骤级相对优势机制,从而在不引入优化冲突的情况下整合翻译质量奖励,有效协同了语言理解与推理能力,且兼容多种模型。

Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.


[15] Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence cs.CLPDF

Nikolai Ilinykh, Hyewon Jang, Shalom Lappin, Asad Sayeed, Sharid Loáiciga

TL;DR: 本文通过比较人类与视觉语言模型在视觉写作提示语料库中生成的故事,研究了视觉基础叙事中的连贯性。作者使用一套涵盖指代、话语关系类型、主题连续性、角色持久性和多模态角色基础等多方面的度量指标,计算叙事连贯性分数。研究发现,VLMs展现出与人类系统不同的连贯性特征,尽管表面流畅度相似,但在视觉基础故事的话语组织上存在系统性差异。

Details

Motivation: 研究动机在于评估视觉语言模型在生成视觉基础叙事时的连贯性,并与人类叙事进行对比,以揭示模型在深层话语组织上的不足。

Result: 在Visual Writing Prompts语料库上的实验表明,VLMs的连贯性特征与人类存在系统性差异,尽管单个度量差异可能细微,但综合考量时差异更明显。

Insight: 创新点在于提出了一套统一的叙事连贯性度量框架,结合多维度指标(如核心指代、话语关系等),揭示了VLMs在叙事组织上的系统性弱点,为模型评估提供了新视角。

Abstract: We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU-CLASP/coherence-driven-humans.


[16] Self-Improvement of Large Language Models: A Technical Overview and Future Outlook cs.CLPDF

Haoyan Yang, Mario Xerri, Solha Park, Huajian Zhang, Yiyang Feng

TL;DR: 这篇论文提出了一个关于大语言模型(LLM)自我改进的系统性技术综述与未来展望。作者认为,仅靠人类监督改进LLM成本高昂且可扩展性有限,而模型自主能力的增强为实现开发过程的自动化提供了可能。因此,论文将自我改进概念化为一个由数据获取、数据选择、模型优化和推理精炼四个紧密耦合过程组成的闭环生命周期,并引入一个自主评估层来监控和指导整个过程。论文在此框架下系统地回顾和分析了每个组件的代表性技术,并讨论了当前局限性与未来研究方向。

Details

Motivation: 解决仅依赖人类监督改进大语言模型(LLM)时面临的高成本、可扩展性限制以及反馈信号信息量不足的问题,并利用模型不断增强的自主能力,探索通过自动化实现模型自我迭代改进的路径。

Result: 这是一篇综述性论文,未报告具体的定量实验结果或基准测试排名。其主要成果是提出了一个用于组织和分析现有LLM自我改进技术的统一框架。

Insight: 创新点在于从系统层面提出了一个将LLM自我改进视为闭环生命周期的统一框架,明确了数据获取、选择、模型优化、推理精炼及自主评估等核心组件及其耦合关系,为理解和推进该领域研究提供了结构化视角。从客观角度看,该框架有助于系统化地梳理分散的技术,并识别未来实现完全自我改进LLM所需突破的关键环节。

Abstract: As large language models (LLMs) continue to advance, improving them solely through human supervision is becoming increasingly costly and limited in scalability. As models approach human-level capabilities in certain domains, human feedback may no longer provide sufficiently informative signals for further improvement. At the same time, the growing ability of models to make autonomous decisions and execute complex actions naturally enables abstractions in which components of the model development process can be progressively automated. Together, these challenges and opportunities have driven increasing interest in self-improvement, where models autonomously generate data, evaluate outputs, and iteratively refine their own capabilities. In this paper, we present a system-level perspective on self-improving language models and introduce a unified framework that organizes existing techniques. We conceptualize the self-improvement system as a closed-loop lifecycle, consisting of four tightly coupled processes: data acquisition, data selection, model optimization, and inference refinement, along with an autonomous evaluation layer. Within this framework, the model itself plays a central role in driving each stage: collecting or generating data, selecting informative signals, updating its parameters, and refining outputs, while the autonomous evaluation layer continuously monitors progress and guides the improvement cycle across stages. Following this lifecycle perspective, we systematically review and analyze representative methods for each component from a technical standpoint. We further discuss current limitations and outline our vision for future research toward fully self-improving LLMs.


cs.CV [Back]

[17] MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies cs.CVPDF

Weixiang Shen, Yanzhu Hu, Che Liu, Junde Wu, Jiayuan Zhu

TL;DR: 该论文提出了MEDOPENCLAW,一个可审计的运行时框架,旨在让视觉语言模型(VLM)能够在标准医学影像工具(如3D Slicer)中动态交互操作。同时,论文还引入了MEDFLOWBENCH,一个覆盖多序列脑部MRI和肺部CT/PET的全研究医学影像基准测试,用于系统评估医学智能体的能力。

Details

Motivation: 当前医学影像任务中对视觉语言模型的评估过于简化,仅依赖人工筛选的2D图像,忽略了真实临床诊断的核心挑战:智能体需要主动在完整的3D影像数据中跨序列或模态导航以收集证据并做出最终决策。

Result: 初步结果显示,尽管最先进的LLMs/VLMs(如Gemini 3.1 Pro和GPT-5.4)能够成功导航查看器以解决基本的研究级任务,但当它们获得专业支持工具访问权限时,由于缺乏精确的空间定位能力,其性能反而会下降。

Insight: 主要创新点在于通过MEDOPENCLAW和MEDFLOWBENCH,在静态图像感知与交互式临床工作流之间架起桥梁,为开发可审计的全研究医学影像智能体建立了可复现的基础。从客观角度看,其将智能体评估从静态图像扩展到动态、多模态的完整研究层面,并揭示了当前先进模型在工具使用中空间定位能力的不足,这是一个重要的研究洞见。

Abstract: Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigate full 3D volumes across multiple sequences or modalities to gather evidence and ultimately support a final decision. To address this, we propose MEDOPENCLAW, an auditable runtime designed to let VLMs operate dynamically within standard medical tools or viewers (e.g., 3D Slicer). On top of this runtime, we introduce MEDFLOWBENCH, a full-study medical imaging benchmark covering multi-sequence brain MRI and lung CT/PET. It systematically evaluates medical agentic capabilities across viewer-only, tool-use, and open-method tracks. Initial results reveal a critical insight: while state-of-the-art LLMs/VLMs (e.g., Gemini 3.1 Pro and GPT-5.4) can successfully navigate the viewer to solve basic study-level tasks, their performance paradoxically degrades when given access to professional support tools due to a lack of precise spatial grounding. By bridging the gap between static-image perception and interactive clinical workflows, MEDOPENCLAW and MEDFLOWBENCH establish a reproducible foundation for developing auditable, full-study medical imaging agents.


[18] From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition cs.CVPDF

Francesco Gentile, Nicola Dall’Asen, Francesco Tonini, Massimiliano Mancini, Lorenzo Vaquero

TL;DR: 本文提出了一种名为SITH的数据无关、无需训练的解释性框架,通过奇异值分解直接分析CLIP视觉Transformer的权重空间,将每个注意力头的值输出矩阵分解为奇异向量,并利用COMP算法将其解释为稀疏、语义连贯的人可理解概念组合,从而实现细粒度的模型解释和可解释的权重空间编辑。

Details

Motivation: 现有解释性方法主要依赖激活值,导致其受数据集限制、易受数据偏差影响且通常只能提供粗粒度的头部级解释,因此需要一种不依赖数据、直接在权重空间分析模型内部机制的方法。

Result: SITH通过重建保真度和可解释性实验验证了其能产生连贯、忠实(faithful)的头部内解释,并可用于进行精确、可解释的权重空间模型编辑,从而在不重新训练的情况下提升下游任务性能。

Insight: 创新点在于提出了完全数据无关、无需训练的解释框架SITH和COMP算法,实现了从权重到概念的细粒度解释;客观分析认为,该方法揭示了模型微调主要通过重新加权一个稳定的语义基础而非学习全新特征,为理解模型适应机制提供了新视角。

Abstract: As vision-language models are deployed at scale, understanding their internal mechanisms becomes increasingly critical. Existing interpretability methods predominantly rely on activations, making them dataset-dependent, vulnerable to data bias, and often restricted to coarse head-level explanations. We introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free, training-free framework that directly analyzes CLIP’s vision transformer in weight space. For each attention head, we decompose its value-output matrix into singular vectors and interpret each one via COMP (Coherent Orthogonal Matching Pursuit), a new algorithm that explains them as sparse, semantically coherent combinations of human-interpretable concepts. We show that SITH yields coherent, faithful intra-head explanations, validated through reconstruction fidelity and interpretability experiments. This allows us to use SITH for precise, interpretable weight-space model edits that amplify or suppress specific concepts, improving downstream performance without retraining. Furthermore, we use SITH to study model adaptation, showing how fine-tuning primarily reweights a stable semantic basis rather than learning entirely new features.


[19] ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs cs.CVPDF

An Yu, Ting Yu Tsai, Zhenfei Zhang, Weiheng Lu, Felix X. -F. Ye

TL;DR: 本文提出了一种名为ReDiPrune的训练前视觉令牌剪枝方法,旨在提升多模态大语言模型的效率。该方法在视觉-语言投影器之前,基于文本条件相关性和最大-最小多样性评分,直接选择信息丰富的视觉令牌,以减少Transformer需要处理的令牌数量。

Details

Motivation: 当前多模态大语言模型计算成本高昂,主要因为Transformer需要处理大量视觉令牌。现有方法通常在投影后进行剪枝,这会损失丰富的视觉特征。本文旨在直接在视觉编码器输出上进行剪枝,以保留细粒度的空间和语义信息。

Result: 在四个视频和五个图像基准测试中,该方法持续改善了准确性与效率的权衡。例如,在EgoSchema基准上使用LLaVA-NeXT-Video-7B模型,仅保留15%的视觉令牌即可实现绝对准确率提升2.0%,同时计算量减少超过6倍(以TFLOPs计)。

Insight: 创新点在于提出了一种无需训练、即插即用的令牌剪枝方法,在投影前操作以保留原始视觉特征的判别性。其评分规则联合考虑了文本条件相关性和令牌多样性,确保所选令牌既与查询相关又非冗余,这是一种新颖的剪枝策略。

Abstract: Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present \textbf{ReDiPrune}, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it consistently improves the accuracy-efficiency trade-off. For example, on EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15% of visual tokens yields a +2.0% absolute accuracy gain while reducing computation by more than $6\times$ in TFLOPs. Code is available at https://github.com/UA-CVML/ReDiPrune.


[20] KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins cs.CVPDF

Quanyun Wu, Kyle Gao, Daniel Long, David A. Clausi, Jonathan Li

TL;DR: 本文提出了一种名为KitchenTwin的尺度感知3D融合框架,用于构建具有精确度量几何和语义基础的厨房数字孪生环境。该方法通过视觉语言模型引导的几何锚定机制解决基于Transformer的全局点云预测与局部重建物体网格之间的尺度模糊和坐标不匹配问题,并利用几何感知的配准流程确保物理合理性。

Details

Motivation: 动机是解决具身AI训练和评估中,现有基于Transformer的前馈重建方法预测的全局点云存在固有尺度模糊和坐标约定不一致的问题,这阻碍了其与局部重建物体网格的可靠融合,从而无法构建度量一致的物体中心数字孪生环境。

Result: 在真实室内厨房环境上的实验表明,该方法改善了跨网络的对象对齐和几何一致性,有利于下游任务如多基元拟合和度量测量。

Insight: 创新点在于提出了一个VLM引导的几何锚定机制来恢复真实世界度量尺度,以及一个融合了重力对齐垂直估计、曼哈顿世界结构约束和无碰撞局部优化的几何感知配准流程,从而实现了度量一致的数字孪生构建。同时,作者还贡献了一个开源的、具有度量尺度场景和语义基础注册物体网格标注的室内数字孪生数据集。

Abstract: Embodied AI training and evaluation require object-centric digital twin environments with accurate metric geometry and semantic grounding. Recent transformer-based feedforward reconstruction methods can efficiently predict global point clouds from sparse monocular videos, yet these geometries suffer from inherent scale ambiguity and inconsistent coordinate conventions. This mismatch prevents the reliable fusion of these dimensionless point cloud predictions with locally reconstructed object meshes. We propose a novel scale-aware 3D fusion framework that registers visually grounded object meshes with transformer-predicted global point clouds to construct metrically consistent digital twins. Our method introduces a Vision-Language Model (VLM)-guided geometric anchor mechanism that resolves this fundamental coordinate mismatch by recovering an accurate real-world metric scale. To fuse these networks, we propose a geometry-aware registration pipeline that explicitly enforces physical plausibility through gravity-aligned vertical estimation, Manhattan-world structural constraints, and collision-free local refinement. Experiments on real indoor kitchen environments demonstrate improved cross-network object alignment and geometric consistency for downstream tasks, including multi-primitive fitting and metric measurement. We additionally introduce an open-source indoor digital twin dataset with metrically scaled scenes and semantically grounded and registered object-centric mesh annotations.


[21] UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy cs.CVPDF

Yicheng Xu, Jiangning Zhang, Zhucun Xue, Teng Hu, Ran Yi

TL;DR: 本文提出了UniICL框架,旨在系统化统一多模态上下文学习。通过引入一个面向能力的六层分类法来诊断演示示例的功能角色,并构建了大规模数据集UniICL-760K和评估基准UniICL-Bench。为了解决少样本适应不稳定的问题,作者提出了一个轻量级的即插即用模块——上下文自适应原型调制器。实验表明,该方法在大多数理解型上下文学习任务上优于参数更大的多模态大语言模型基线。

Details

Motivation: 上下文学习虽然能实现免训练适应,但对示例选择和格式高度敏感,在统一多模态模型中,这种敏感性因跨模态干扰和不同认知需求而加剧,导致其效果非单调且高度依赖任务。本文旨在诊断并解决这一问题。

Result: 在UniICL-Bench上的评估显示,该方法在大多数理解型上下文学习任务上取得了极具竞争力的统一结果,超越了参数更大的多模态大语言模型基线。

Insight: 创新点包括:1) 提出了一个面向认知能力的六层分类法,用于系统分析演示示例的功能;2) 构建了大规模、精心策划的多模态上下文学习数据集和评估基准;3) 设计了轻量级的上下文自适应原型调制器模块,以稳定少样本适应。从客观角度看,该工作为多模态上下文学习的系统化评估和性能提升提供了新的理论框架和实用工具。

Abstract: In-context Learning enables training-free adaptation via demonstrations but remains highly sensitive to example selection and formatting. In unified multimodal models spanning understanding and generation, this sensitivity is exacerbated by cross-modal interference and varying cognitive demands. Consequently, In-context Learning efficacy is often non-monotonic and highly task-dependent. To diagnose these behaviors, we introduce a six-level capability-oriented taxonomy that categorizes the functional role of demonstrations from basic perception to high-order discernment. Guided by this cognitive framework, we construct UniICL-760K, a large-scale corpus featuring curated 8-shot In-context Learning episodes across 15 subtasks, alongside UniICL-Bench for rigorous, controlled evaluation. As an architectural intervention to stabilize few-shot adaptation, we propose the Context-Adaptive Prototype Modulator, a lightweight, plug-and-play module. Evaluations on UniICL-Bench show that our approach yields highly competitive unified results, outperforming larger-parameter multimodal large language model baselines on most understanding In-context Learning tasks. Data and code will be available soon at https://github.com/xuyicheng-zju/UniICL.


[22] LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration cs.CVPDF

Gokce Inal, Pouyan Navard, Alper Yilmaz

TL;DR: 本文提出了LLaVA-LE,一个专门用于月球探索的视觉语言模型。为了解决行星科学领域缺乏大规模多模态数据集的问题,作者构建了包含9.6万张高分辨率全色图像及其详细描述的LUCID数据集,以及8.1万个QA对。基于此数据集,通过两阶段训练(概念对齐和指令调优的视觉问答)对基础LLaVA模型进行微调,并设计了针对月球地形分析的评估基准。

Details

Motivation: 当前多模态视觉语言模型在行星科学领域的应用尚未充分探索,主要障碍是缺乏将真实行星图像与详细科学描述配对的大规模数据集。本文旨在开发一个专门用于月球表面和次表面特征分析的视觉语言助手。

Result: 在针对月球地形分析设计的评估基准上,LLaVA-LE相比基础LLaVA模型实现了3.3倍的总体性能提升,相比其第一阶段模型提升了2.1倍,其推理得分(1.070)超过了评判模型(GPT和Gemini)自身的参考分数,展示了其在领域内的有效性。

Insight: 论文的核心创新点在于构建了首个大规模、高质量的行星科学多模态数据集(LUCID),并设计了针对特定科学领域(月球探索)的两阶段微调方案。这证明了领域特定的多模态数据和指令调优能显著提升视觉语言模型在专业科学任务上的推理能力,为将大模型应用于其他垂直科学领域提供了可借鉴的范式。

Abstract: Recent advances in multimodal vision-language models (VLMs) have enabled joint reasoning over visual and textual information, yet their application to planetary science remains largely unexplored. A key hindrance is the absence of large-scale datasets that pair real planetary imagery with detailed scientific descriptions. In this work, we introduce LLaVA-LE (Large Language-and-Vision Assistant for Lunar Exploration), a vision-language model specialized for lunar surface and subsurface characterization. To enable this capability, we curate a new large-scale multimodal lunar dataset, LUCID (LUnar Caption Image Dataset) consisting of 96k high-resolution panchromatic images paired with detailed captions describing lunar terrain characteristics, and 81k question-answer (QA) pairs derived from approximately 20k images in the LUCID dataset. Leveraging this dataset, we fine-tune LLaVA using a two-stage training curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. We further design evaluation benchmarks spanning multiple levels of reasoning complexity relevant to lunar terrain analysis. Evaluated against GPT and Gemini judges, LLaVA-LE achieves a 3.3x overall performance gain over Base LLaVA and 2.1x over our Stage 1 model, with a reasoning score of 1.070, exceeding the judge’s own reference score, highlighting the effectiveness of domain-specific multimodal data and instruction tuning to advance VLMs in planetary exploration. Code is available at https://github.com/OSUPCVLab/LLaVA-LE.


[23] Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models cs.CV | cs.AI | cs.LG | cs.MMPDF

Shengli Zhou, Minghang Zheng, Feng Zheng, Yang Liu

TL;DR: 本文提出了一种名为QuatRoPE的新型位置嵌入方法,用于增强大型语言模型(LLMs)在3D空间推理任务中的能力。该方法通过在线性复杂度下显式计算对象间的成对空间关系,并结合IGRE机制来隔离其对LLM原有位置嵌入的干扰,从而在保持模型原始能力的同时,提升了处理3D场景中对象关系的可扩展性和准确性。

Details

Motivation: 解决现有方法在将3D场景表示注入LLMs时面临的两个主要问题:一是编码绝对位置的方法难以从过早融合的特征中提取空间关系;二是显式编码所有对象间关系(复杂度为对象数量的二次方)的方法可扩展性差。

Result: 在3D空间推理任务上进行了广泛实验,证明了所提方法的有效性。具体结果未在摘要中详细说明,但暗示其性能优于先前方法。

Insight: 创新点在于提出了线性复杂度的QuatRoPE位置嵌入方法,通过注意力层中的点积显式计算成对空间关系,并引入IGRE机制来隔离新嵌入对LLM原有位置系统的影响,从而在保证空间一致性和几何完整性的同时,实现了更好的可扩展性。

Abstract: Spatial reasoning focuses on locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scene-language paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length that is linear to the number of objects, and explicitly calculates pairwise spatial relations through the dot product in attention layers. QuatRoPE’s holistic vector encoding of 3D coordinates guarantees a high degree of spatial consistency, maintaining fidelity to the scene’s geometric integrity. Additionally, we introduce the Isolated Gated RoPE Extension (IGRE), which effectively limits QuatRoPE’s influence to object-related tokens, thereby minimizing interference with the LLM’s existing positional embeddings and maintaining the LLM’s original capabilities. Extensive experiments demonstrate the effectiveness of our approaches. The code and data are available at https://github.com/oceanflowlab/QuatRoPE.


[24] Confidence-Based Mesh Extraction from 3D Gaussians cs.CV | cs.GRPDF

Lukas Radl, Felix Windisch, Andreas Kurz, Thomas Köhler, Michael Steiner

TL;DR: 本文提出了一种基于置信度的3D高斯网格提取方法,通过在3D高斯泼溅(3DGS)中引入自监督置信度框架,动态平衡光度与几何监督,并结合改进的外观模型,实现了高效且高质量的网格提取。

Details

Motivation: 解决3DGS在具有丰富视角依赖效应的场景中网格提取困难的问题,避免现有方法依赖多视图技术、迭代提取或大型预训练模型而牺牲效率的缺点。

Result: 在无界网格提取任务上取得了最先进(SOTA)的结果,同时保持了高效率。

Insight: 创新点包括引入自监督置信度框架动态平衡监督信号、提出惩罚每个基元颜色和方差损失的扩展方法,以及通过解耦D-SSIM损失项改进外观模型,这些设计提升了表面提取的准确性和鲁棒性。

Abstract: Recently, 3D Gaussian Splatting (3DGS) greatly accelerated mesh extraction from posed images due to its explicit representation and fast software rasterization. While the addition of geometric losses and other priors has improved the accuracy of extracted surfaces, mesh extraction remains difficult in scenes with abundant view-dependent effects. To resolve the resulting ambiguities, prior works rely on multi-view techniques, iterative mesh extraction, or large pre-trained models, sacrificing the inherent efficiency of 3DGS. In this work, we present a simple and efficient alternative by introducing a self-supervised confidence framework to 3DGS: within this framework, learnable confidence values dynamically balance photometric and geometric supervision. Extending our confidence-driven formulation, we introduce losses which penalize per-primitive color and normal variance and demonstrate their benefits to surface extraction. Finally, we complement the above with an improved appearance model, by decoupling the individual terms of the D-SSIM loss. Our final approach delivers state-of-the-art results for unbounded meshes while remaining highly efficient.


[25] A Framework for Generating Semantically Ambiguous Images to Probe Human and Machine Perception cs.CVPDF

Yuqi Hu, Vasha DuTell, Ahna R. Girshick, Jennifer E. Corbett

TL;DR: 本文提出了一种基于CLIP嵌入空间的语义模糊图像生成框架,用于探究人类和机器分类器在概念边界上的感知差异。通过心理物理学方法生成连续模糊图像谱,精确测量人类与模型对’鸭子’和’兔子’等概念的判断阈值,发现机器分类器更偏向识别为’兔子’,而人类感知更接近CLIP合成嵌入。

Details

Motivation: 研究人类与机器学习模型在视觉证据模糊时如何划分概念边界,利用语义模糊图像作为可解释性探针,揭示视觉模型表征概念边界的方式。

Result: 在CLIP嵌入空间生成的模糊图像谱上测试表明,机器分类器比人类更倾向于将模糊图像分类为’兔子’,人类感知与CLIP合成嵌入更一致,且引导尺度对人类敏感度的影响大于机器分类器。

Insight: 创新点在于将可控模糊性作为诊断工具,桥接人类心理物理分析、图像分类和生成模型,为人类-模型对齐、鲁棒性、可解释性和图像合成方法提供新视角;方法上通过心理物理学指导的嵌入空间插值实现精确的语义边界探测。

Abstract: The classic duck-rabbit illusion reveals that when visual evidence is ambiguous, the human brain must decide what it sees. But where exactly do human observers draw the line between ‘’duck’’ and ‘’rabbit’’, and do machine classifiers draw it in the same place? We use semantically ambiguous images as interpretability probes to expose how vision models represent the boundaries between concepts. We present a psychophysically-informed framework that interpolates between concepts in the CLIP embedding space to generate continuous spectra of ambiguous images, allowing us to precisely measure where and how humans and machine classifiers place their semantic boundaries. Using this framework, we show that machine classifiers are more biased towards seeing ‘’rabbit’’, whereas humans are more aligned with the CLIP embedding used for synthesis, and the guidance scale seems to affect human sensitivity more strongly than machine classifiers. Our framework demonstrates how controlled ambiguity can serve as a diagnostic tool to bridge the gap between human psychophysical analysis, image classification, and generative image models, offering insight into human-model alignment, robustness, model interpretability, and image synthesis methods.


[26] OpenCap Monocular: 3D Human Kinematics and Musculoskeletal Dynamics from a Single Smartphone Video cs.CV | eess.IV | q-bio.QMPDF

Selim Gilon, Emily Y. Miller, Scott D. Uhlrich

TL;DR: OpenCap Monocular是一种从单部智能手机视频中估计人体3D骨骼运动学和肌肉骨骼动力学的算法。该方法通过优化单目姿态估计模型(WHAM)的3D人体姿态,计算生物力学约束骨骼模型的运动学,并通过基于物理的仿真和机器学习估计动力学。研究验证了其在行走、下蹲和坐站任务中相对于基于标记的运动捕捉和测力板数据的准确性,并已通过智能手机应用、网页应用和安全云计算部署,实现了免费、可访问的单智能手机生物力学评估。

Details

Motivation: 量化人体运动(运动学)和肌肉骨骼力(动力学)对于预测、治疗和监测与行动能力相关的疾病至关重要,但传统方法需要在专业实验室进行昂贵、耗时的分析,限制了临床转化。因此,需要可扩展、准确的生物力学评估工具。

Result: 在验证中,OpenCap Monocular实现了较低的运动学误差(旋转自由度平均绝对误差为4.8°;骨盆平移为3.4 cm),在旋转精度上比仅基于回归的计算机视觉基线提高了48%(p = 0.036),在平移精度上提高了69%(p < 0.001)。在行走任务中估计地面反作用力的准确性可与或优于之前的两相机OpenCap系统。在虚弱和膝骨关节炎相关应用中,算法以具有临床意义的准确性估计了关键动力学结果,如坐站转换中的膝关节伸展力矩和行走中的膝关节内收力矩。

Insight: 创新点在于将单目姿态估计与优化、生物力学约束模型以及基于物理的仿真和机器学习相结合,实现了从单部智能手机视频中准确估计3D运动学和动力学。这提供了可扩展、低成本的生物力学评估方案,通过部署到移动应用和云端,促进了临床和研究的可及性。

Abstract: Quantifying human movement (kinematics) and musculoskeletal forces (kinetics) at scale, such as estimating quadriceps force during a sit-to-stand movement, could transform prediction, treatment, and monitoring of mobility-related conditions. However, quantifying kinematics and kinetics traditionally requires costly, time-intensive analysis in specialized laboratories, limiting clinical translation. Scalable, accurate tools for biomechanical assessment are needed. We introduce OpenCap Monocular, an algorithm that estimates 3D skeletal kinematics and kinetics from a single smartphone video. The method refines 3D human pose estimates from a monocular pose estimation model (WHAM) via optimization, computes kinematics of a biomechanically constrained skeletal model, and estimates kinetics via physics-based simulation and machine learning. We validated OpenCap Monocular against marker-based motion capture and force plate data for walking, squatting, and sit-to-stand tasks. OpenCap Monocular achieved low kinematic error (4.8° mean absolute error for rotational degrees of freedom; 3.4 cm for pelvis translations), outperforming a regression-only computer vision baseline by 48% in rotational accuracy (p = 0.036) and 69% in translational accuracy (p < 0.001). OpenCap Monocular also estimated ground reaction forces during walking with accuracy comparable to, or better than, our prior two-camera OpenCap system. We demonstrate that the algorithm estimates important kinetic outcomes with clinically meaningful accuracy in applications related to frailty and knee osteoarthritis, including estimating knee extension moment during sit-to-stand transitions and knee adduction moment during walking. OpenCap Monocular is deployed via a smartphone app, web app, and secure cloud computing (https://opencap.ai), enabling free, accessible single-smartphone biomechanical assessments.


[27] TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval cs.CVPDF

David G. Shatwell, Sirnam Swetha, Mubarak Shah

TL;DR: 本文提出了TIGeR框架,用于解决结合视觉外观、地理位置和时间信息的联合推理问题,特别是地理时间感知图像检索。作者构建了一个包含450万训练三元组和8.6万评估三元组的数据集,并设计了一个基于多模态Transformer的模型,将图像、地理位置和时间映射到统一的时空嵌入空间,支持地理定位、拍摄时间预测和时空感知检索等多种任务。

Details

Motivation: 解决数字取证、城市监控和环境分析等实际应用中,需要超越标准地理定位和时间预测,实现更复杂的能力(例如,根据查询图像和指定目标时间检索同一地点的图像)的问题。

Result: 在构建的基准测试上,TIGeR在季节预测上比基线方法提升高达16%,在一天内时间预测上提升8%,在地理时间感知检索召回率上提升14%,均优于强基线方法和现有SOTA方法。

Insight: 核心创新在于将图像、地理位置和时间统一建模到一个共享的嵌入空间,使得模型能够基于场景的时空本质而非纯粹视觉相似性进行检索,从而更好地处理外观剧烈变化下的位置一致性。这种统一的多模态表示框架支持灵活的输入配置和多种下游任务。

Abstract: Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, geolocation, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and curate a diverse benchmark of 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. We then propose TIGeR, a multi-modal-transformer-based model that maps image, geolocation, and time into a unified geo-temporal embedding space. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time-aware retrieval. By better preserving underlying location identity under large appearance changes, TIGeR enables retrieval based on where and when a scene is, rather than purely on visual similarity. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.


[28] DRoPS: Dynamic 3D Reconstruction of Pre-Scanned Objects cs.CVPDF

Narek Tumanyan, Samuel Rota Bulò, Denis Rozumny, Lorenzo Porzi, Adam Harley

TL;DR: 本文提出DRoPS方法,利用动态物体的静态预扫描作为显式几何和外观先验,通过将高斯图元组织成锚定在物体表面的像素网格来构建网格化、表面对齐的模型,并使用基于网格的CNN参数化运动,从而显著提升了动态场景从极端新视角的重建质量和3D跟踪精度。

Details

Motivation: 现有方法通过从2D基础模型提取先验或施加手工正则化来克服动态场景重建的病态性,但在存在高度关节运动时,难以从极端新视角重建场景。本文旨在利用静态预扫描作为强先验来有效约束解空间并确保序列中的几何一致性。

Result: 大量实验表明,该方法在渲染质量和3D跟踪精度上显著优于当前最先进方法。

Insight: 创新点在于:1) 构建了网格化、表面对齐的高斯图元模型;2) 利用网格结构,通过CNN参数化运动,注入强隐式正则化并关联邻近点运动。这为利用预扫描先验进行动态3D重建提供了新思路。

Abstract: Dynamic scene reconstruction from casual videos has seen recent remarkable progress. Numerous approaches have attempted to overcome the ill-posedness of the task by distilling priors from 2D foundational models and by imposing hand-crafted regularization on the optimized motion. However, these methods struggle to reconstruct scenes from extreme novel viewpoints, especially when highly articulated motions are present. In this paper, we present DRoPS, a novel approach that leverages a static pre-scan of the dynamic object as an explicit geometric and appearance prior. While existing state-of-the-art methods fail to fully exploit the pre-scan, DRoPS leverages our novel setup to effectively constrain the solution space and ensure geometrical consistency throughout the sequence. The core of our novelty is twofold: first, we establish a grid-structured and surface-aligned model by organizing Gaussian primitives into pixel grids anchored to the object surface. Second, by leveraging the grid structure of our primitives, we parameterize motion using a CNN conditioned on those grids, injecting strong implicit regularization and correlating the motion of nearby points. Extensive experiments demonstrate that our method significantly outperforms the current state of the art in rendering quality and 3D tracking accuracy.


[29] AVControl: Efficient Framework for Training Audio-Visual Controls cs.CV | cs.MM | cs.SDPDF

Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem, Mohammad Salama, Harel Cain

TL;DR: AVControl是一个基于LTX-2联合视听基础模型构建的轻量级、可扩展框架,用于训练多种模态的视频和音频生成控制。它通过为每种控制模态(如深度、姿态、相机轨迹、音频变换)训练独立的LoRA适配器,并将参考信号作为注意力层的额外令牌输入到一个并行画布中,无需改变基础模型架构。该方法在计算和数据上都很高效,每个模态仅需少量数据和数百到数千训练步即可收敛。

Details

Motivation: 现有方法要么为固定的一组控制训练单一的整体模型,要么为每种新模态引入昂贵的架构更改,缺乏灵活性和效率。AVControl旨在提供一个无需架构改动、可轻松扩展新控制模态的轻量级解决方案。

Result: 在VACE基准测试中,该方法在深度和姿态引导的生成、修复和外绘任务上超越了所有评估基线,在相机控制和视听基准测试上也取得了有竞争力的结果。

Insight: 核心创新在于提出了并行画布方法,将参考信号作为额外令牌处理,有效解决了将基于图像的上下文方法简单扩展到视频时在结构控制上的失败问题。该框架首次为联合生成模型实现了模块化的视听控制,支持多种独立训练的空间对齐控制和非空间控制,具有高度的可扩展性和训练效率。

Abstract: Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.


[30] GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining cs.CV | cs.AI | cs.LGPDF

Deen Dayal Mohan, Hossein Souri, Vitali Petsiuk, Juhong Min, Gopal Sharma

TL;DR: GoldiCLIP是一个基于‘金发姑娘原则’(寻找恰到好处的平衡)构建的视觉语言预训练框架,旨在通过平衡多种监督信号来高效利用数据。它结合了文本条件自蒸馏、集成解码器的视觉问答目标以及基于不确定性的损失加权机制。在仅使用3000万图像(比主流方法少300倍数据)的情况下,该模型在数据高效方法中达到了最先进水平,并在多种检索任务上显著超越了可比基线。

Details

Motivation: 解决大规模视觉语言模型(VLMs)严重依赖数十亿样本数据集的问题,通过提高监督质量来弥补数据量的不足,并综合解决对比预训练中的多个弱点。

Result: 在仅使用3000万图像(是领先方法数据量的1/300)进行训练后,GoldiCLIP在数据高效方法中达到了最先进水平(SOTA)。具体而言,在MSCOCO检索任务上提升了2.2个点,在细粒度检索上提升了2.0个点,在基于问题的检索上提升了5.9个点,同时与数十亿规模模型保持竞争力。

Insight: 论文宣称的创新点在于其多方面的训练框架,它协同整合了三个关键创新:文本条件自蒸馏以对齐文本无关和文本条件特征、集成解码器的VQA目标使编码器能泛化到字幕式查询之外、以及基于不确定性的加权机制自动平衡所有异构损失。从客观角度看,其核心创新在于系统性地整合并平衡了多种监督信号,而非单一改进,这为数据高效的视觉语言预训练提供了一个新颖且有效的设计范式。

Abstract: Until recently, the success of large-scale vision-language models (VLMs) has primarily relied on billion-sample datasets, posing a significant barrier to progress. Latest works have begun to close this gap by improving supervision quality, but each addresses only a subset of the weaknesses in contrastive pretraining. We present GoldiCLIP, a framework built on a Goldilocks principle of finding the right balance of supervision signals. Our multifaceted training framework synergistically combines three key innovations: (1) a text-conditioned self-distillation method to align both text-agnostic and text-conditioned features; (2) an encoder integrated decoder with Visual Question Answering (VQA) objective that enables the encoder to generalize beyond the caption-like queries; and (3) an uncertainty-based weighting mechanism that automatically balances all heterogeneous losses. Trained on just 30 million images, 300x less data than leading methods, GoldiCLIP achieves state-of-the-art among data-efficient approaches, improving over the best comparable baseline by 2.2 points on MSCOCO retrieval, 2.0 on fine-grained retrieval, and 5.9 on question-based retrieval, while remaining competitive with billion-scale models. Project page: https://petsi.uk/goldiclip.


[31] DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Video Generation cs.CVPDF

Junyi Ouyang, Wenbin Teng, Gonglin Chen, Yajie Zhao, Haiwei Chen

TL;DR: 本文提出DCARL,一种分治自回归框架,用于生成长轨迹视频。该方法通过关键帧生成器建立全局一致的结构锚点,再通过插值生成器以自回归方式合成密集帧,结合了分治方案的结构稳定性和视频扩散模型的高保真生成能力,在32秒长视频生成任务上实现了稳定且高质量的生成效果。

Details

Motivation: 解决现有视频扩散模型在生成长轨迹视频时扩展性有限,以及自回归模型存在的视觉漂移和可控性差的问题。

Result: 在互联网大规模长轨迹视频数据集上训练,相比SOTA自回归和分治基线,在视觉质量(更低的FID和FVD)和相机运动一致性(更低的ATE和ARE)方面均取得更优性能。

Insight: 创新点在于将分治策略与自回归生成结合,通过分离关键帧生成(无时间压缩)和帧插值(利用全局关键帧和局部前一帧)两个阶段,兼顾了长程结构一致性和局部连贯性,为长视频生成提供了可扩展的稳定框架。

Abstract: Long-trajectory video generation is a crucial yet challenging task for world modeling primarily due to the limited scalability of existing video diffusion models (VDMs). Autoregressive models, while offering infinite rollout, suffer from visual drift and poor controllability. To address these issues, we propose DCARL, a novel divide-and-conquer, autoregressive framework that effectively combines the structural stability of the divide-and-conquer scheme with the high-fidelity generation of VDMs. Our approach first employs a dedicated Keyframe Generator trained without temporal compression to establish long-range, globally consistent structural anchors. Subsequently, an Interpolation Generator synthesizes the dense frames in an autoregressive manner with overlapping segments, utilizing the keyframes for global context and a single clean preceding frame for local coherence. Trained on a large-scale internet long trajectory video dataset, our method achieves superior performance in both visual quality (lower FID and FVD) and camera adherence (lower ATE and ARE) compared to state-of-the-art autoregressive and divide-and-conquer baselines, demonstrating stable and high-fidelity generation for long trajectory videos up to 32 seconds in length.


[32] NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders cs.CV | cs.AI | cs.LGPDF

Katarina Trojachanec Dineva, Stefan Andonov, Ilinka Ivanoska, Ivan Kitanovski, Sasho Gramatikov

TL;DR: 该论文提出了一个名为NeuroVLM-Bench的综合性基准测试,用于评估视觉增强大语言模型在神经影像学(涵盖多发性硬化、中风、脑肿瘤等疾病)临床推理任务中的性能。研究评估了20个前沿多模态模型在诊断分类、校准、结构化输出有效性和计算效率四个维度的表现。

Details

Motivation: 尽管多模态大语言模型在基于图像的决策支持方面展现出潜力,但其在神经影像学领域的可靠性、性能权衡和操作特性尚未得到充分理解,因此需要建立一个标准化的评估框架。

Result: 在20个前沿模型中,GPT-5-Chat和Gemini-2.5-Pro在整体诊断性能上表现最强,而Gemini-2.5-Flash在效率与性能权衡上最佳。在开源模型中,MedGemma-1.5-4B表现最有前景,其在少量样本提示下接近多个专有模型的零样本性能。技术成像属性(如模态和平面)的识别已近乎解决,但诊断推理,尤其是亚型预测,仍然具有挑战性。

Insight: 论文的创新点在于构建了一个针对神经影像学临床推理的、包含多维度评估(包括结构化输出有效性和计算效率)的标准化基准测试框架。其核心洞察是揭示了当前模型在技术属性识别与复杂诊断推理任务之间的性能鸿沟,并为模型选择(如专有模型与开源模型的权衡)提供了实用指南。

Abstract: Recent advances in multimodal large language models enable new possibilities for image-based decision support. However, their reliability and operational trade-offs in neuroimaging remain insufficiently understood. We present a comprehensive benchmarking study of vision-enabled large language models for 2D neuroimaging using curated MRI and CT datasets covering multiple sclerosis, stroke, brain tumors, other abnormalities, and normal controls. Models are required to generate multiple outputs simultaneously, including diagnosis, diagnosis subtype, imaging modality, specialized sequence, and anatomical plane. Performance is evaluated across four directions: discriminative classification with abstention, calibration, structured-output validity, and computational efficiency. A multi-phase framework ensures fair comparison while controlling for selection bias. Across twenty frontier multimodal models, the results show that technical imaging attributes such as modality and plane are nearly solved, whereas diagnostic reasoning, especially subtype prediction, remains challenging. Tumor classification emerges as the most reliable task, stroke is moderately solvable, while multiple sclerosis and rare abnormalities remain difficult. Few-shot prompting improves performance for several models but increases token usage, latency, and cost. Gemini-2.5-Pro and GPT-5-Chat achieve the strongest overall diagnostic performance, while Gemini-2.5-Flash offers the best efficiency-performance trade-off. Among open-weight architectures, MedGemma-1.5-4B demonstrates the most promising results, as under few-shot prompting, it approaches the zero-shot performance of several proprietary models, while maintaining perfect structured output. These findings provide practical insights into performance, reliability, and efficiency trade-offs, supporting standardized evaluation of multimodal LLMs in neuroimaging.


[33] CORA: A Pathology Synthesis Driven Foundation Model for Coronary CT Angiography Analysis and MACE Risk Assessment cs.CVPDF

Jinkui Hao, Gorkem Durak, Halil Ertugrul Aktas, Ulas Bagci, Bradley D. Allen

TL;DR: 本文提出了CORA,一个用于冠状动脉CT血管造影(CCTA)分析和主要不良心脏事件(MACE)风险评估的3D视觉基础模型。该模型通过一种以病理为中心、合成驱动的自监督框架,直接从体数据中学习,并利用解剖引导的病灶合成引擎,使表征学习偏向于临床相关的疾病特征。模型在12,801个未标记CCTA体数据上训练,并在来自九家独立医院的多中心数据集上进行了全面评估。

Details

Motivation: 冠状动脉疾病是全球心血管死亡的主要原因,可通过CCTA进行无创评估。然而,深度学习在自动化CCTA分析方面的临床转化受到专家标注数据稀缺的限制,且常用的无标签预训练策略(如掩码图像建模)偏向于全局解剖统计,难以捕捉冠状动脉斑块的空间局部病理特征。

Result: 在包括斑块表征、狭窄检测和冠状动脉分割在内的诊断和解剖任务中,CORA始终优于最先进的3D视觉基础模型,性能提升高达29%。此外,通过将成像编码器与大语言模型耦合,扩展为多模态框架,显著改善了30天主要不良心脏事件(MACE)的风险分层。

Insight: 核心创新点在于提出了一种病理中心、合成驱动的自监督预训练框架,通过解剖引导的病灶合成引擎,显式训练模型检测模拟的血管异常,从而将表征学习偏向于临床相关的疾病特征,而非占主导的背景解剖结构。这为解决医学影像中病理特征稀疏和标注数据稀缺的问题提供了一种新思路,并展示了基础模型与大型语言模型结合在多模态风险预测中的潜力。

Abstract: Coronary artery disease, the leading cause of cardiovascular mortality worldwide, can be assessed non-invasively by coronary computed tomography angiography (CCTA). Despite progress in automated CCTA analysis using deep learning, clinical translation is constrained by the scarcity of expert-annotated datasets. Furthermore, widely adopted label-free pretraining strategies, such as masked image modeling, are intrinsically biased toward global anatomical statistics, frequently failing to capture the spatially localized pathological features of coronary plaques. Here, we introduce CORA, a 3D vision foundation model for comprehensive cardiovascular risk assessment. CORA learns directly from volumetric CCTA via a pathology-centric, synthesis-driven self-supervised framework. By utilizing an anatomy-guided lesion synthesis engine, the model is explicitly trained to detect simulated vascular abnormalities, biasing representation learning toward clinically relevant disease features rather than dominant background anatomy. We trained CORA on a large-scale cohort of 12,801 unlabeled CCTA volumes and comprehensively evaluated the model across multi-center datasets from nine independent hospitals. Across diagnostic and anatomical tasks, including plaque characterization, stenosis detection, and coronary artery segmentation, CORA consistently outperformed the state-of-the-art 3D vision foundation models, achieving up to a 29% performance gain. Crucially, by coupling the imaging encoder with a large language model, we extended CORA into a multimodal framework that significantly improved 30-day major adverse cardiac event (MACE) risk stratification. Our results establish CORA as a scalable and extensible foundation for unified anatomical assessment and cardiovascular risk prediction.


[34] Towards automatic smoke detector inspection: Recognition of the smoke detectors in industrial facilities and preparation for future drone integration cs.CV | cs.LG | cs.ROPDF

Lukas Kratochvila, Jakub Stefansky, Simon Bilik, Robert Rous, Tomas Zemcik

TL;DR: 本研究提出了一个用于自动烟雾探测器检测系统的识别模块,旨在通过比较YOLOv11、SSD和RT-DETRv2等目标检测模型,并结合真实与半合成数据及多种数据增强策略,实现工业设施中烟雾探测器的准确识别,为未来无人机集成自动巡检系统奠定基础。

Details

Motivation: 解决工业设施中烟雾探测器因安装位置高或难以触及导致人工巡检困难、危险且成本高的问题,开发自动识别系统以实现快速、安全、低成本的检查。

Result: 在包含预期和困难场景(如运动模糊、低分辨率、不完整目标)的两个测试数据集上,最佳模型YOLOv11n的平均mAP@0.5得分达到0.884。

Insight: 创新点在于系统性地比较了卷积和Transformer基的嵌入式设备适用检测器,并探索了真实与半合成数据结合的训练策略以应对真实环境数据收集难题;可借鉴其针对特定工业场景的鲁棒性评估方法及数据增强方案。

Abstract: Fire safety consists of a complex pipeline, and it is a very important topic of concern. One of its frontal parts are the smoke detectors, which are supposed to provide an alarm prior to a massive fire appears. As they are often difficult to reach due to high ceilings or problematic locations, an automatic inspection system would be very beneficial as it could allow faster revisions, prevent workers from dangerous work in heights, and make the whole process cheaper. In this study, we present the smoke detector recognition part of the automatic inspection system, which could easily be integrated to the drone system. As part of our research, we compare two popular convolutional-based object detectors YOLOv11 and SSD widely used on embedded devices together with the state-of-the-art transformer-based RT-DETRv2 with the backbones of different sizes. Due to a complicated way of collecting a sufficient amount of data for training in the real-world environment, we also compare several training strategies using the real and semi-synthetic data together with various augmentation methods. To achieve a robust testing, all models were evaluated on two test datasets with an expected and difficult appearance of the smoke detectors including motion blur, small resolution, or not complete objects. The best performing detector is the YOLOv11n, which reaches the average mAP@0.5 score of 0.884. Our code, pretrained models and dataset are publicly available.


[35] SurgPhase: Time efficient pituitary tumor surgery phase recognition via an interactive web platform cs.CVPDF

Yan Meng, Jack Cook, X. Y. Han, Kaan Duman, Shauna Otto

TL;DR: 本文提出了一个名为SurgPhase的综合性框架,用于垂体瘤手术视频的阶段识别。该框架结合了自监督表示学习、鲁棒的时间建模和可扩展的数据标注策略,并通过一个交互式网络平台促进外科医生上传视频、接收自动分析并贡献数据,从而支持大规模数据收集和模型持续改进。

Details

Motivation: 准确的手术阶段识别对于分析手术流程、支持术中决策以及推动手术教育和性能评估的数据驱动改进至关重要。现有方法在数据标注有限和手术案例多变性方面面临挑战。

Result: 该方法在保留测试集上达到了90%的准确率,优于当前最先进的方法,并在不同手术案例中表现出强大的泛化能力。

Insight: 创新点包括:1)集成一个协作式在线平台,以促进数据收集和知识共享;2)利用自监督学习在大量未标注视频上预训练ResNet-50模型,以提取高质量特征表示;3)采用包含焦点损失、渐进层解冻和动态采样的改进训练策略,以解决类别不平衡和手术变异性问题。

Abstract: Accurate surgical phase recognition is essential for analyzing procedural workflows, supporting intraoperative decision-making, and enabling data-driven improvements in surgical education and performance evaluation. In this work, we present a comprehensive framework for phase recognition in pituitary tumor surgery (PTS) videos, combining self-supervised representation learning, robust temporal modeling, and scalable data annotation strategies. Our method achieves 90% accuracy on a held-out test set, outperforming current state-of-the-art approaches and demonstrating strong generalization across variable surgical cases. A central contribution of this work is the integration of a collaborative online platform designed for surgeons to upload surgical videos, receive automated phase analysis, and contribute to a growing dataset. This platform not only facilitates large-scale data collection but also fosters knowledge sharing and continuous model improvement. To address the challenge of limited labeled data, we pretrain a ResNet-50 model using the self-supervised framework on 251 unlabeled PTS videos, enabling the extraction of high-quality feature representations. Fine-tuning is performed on a labeled dataset of 81 procedures using a modified training regime that incorporates focal loss, gradual layer unfreezing, and dynamic sampling to address class imbalance and procedural variability.


[36] Self-Supervised Learning for Knee Osteoarthritis: Diagnostic Limitations and Prognostic Value of Uncurated Hospital Data cs.CVPDF

Haresh Rengaraj Rajamohan, Yuxuan Chen, Kyunghyun Cho, Cem M. Deniz

TL;DR: 本研究评估了自监督学习(SSL)在膝关节骨关节炎(OA)诊断和预后建模中相对于ImageNet预训练初始化的改进效果。通过比较仅图像SSL(在OAI、MOST和NYU队列的膝关节X光片上预训练)和多模态图像-文本SSL(在未筛选的医院膝关节X光片与放射科医生印象配对数据上预训练),发现SSL在诊断性Kellgren-Lawrence(KL)分级预测中效果不一,但在预后建模中表现显著提升。

Details

Motivation: 解决自监督学习在医学影像领域,特别是膝关节骨关节炎的诊断和预后任务中,是否比传统ImageNet预训练更有效的问题,并探索未筛选医院数据在其中的价值与局限性。

Result: 在诊断任务中,仅图像SSL在线性探测(冻结编码器)时提高了准确性,但在全微调中未超越ImageNet预训练;多模态SSL也未改善分级性能。在预后任务中,多模态初始化显著优于ImageNet基线,如在预测4年结构发生和进展方面,外部验证(MOST AUROC:0.701 vs. 0.599,10%标记数据)达到SOTA水平。

Insight: 创新点在于揭示了未筛选医院数据因严重偏差(如93%估计KL等级3)可能不适合诊断学习,但当预训练数据分布与下游任务(如预后建模)对齐时,能提供强信号;这强调了在医学SSL中考虑数据偏差和任务对齐的重要性。

Abstract: This study assesses whether self-supervised learning (SSL) improves knee osteoarthritis (OA) modeling for diagnosis and prognosis relative to ImageNet-pretrained initialization. We compared (i) image-only SSL pretrained on knee radiographs from the OAI, MOST, and NYU cohorts, and (ii) multimodal image-text SSL pretrained on uncurated hospital knee radiographs paired with radiologist impressions. For diagnostic Kellgren-Lawrence (KL) grade prediction, SSL offered mixed results. While image-only SSL improved accuracy during linear probing (frozen encoder), it did not outperform ImageNet pretraining during full fine-tuning. Similarly, multimodal SSL failed to improve grading performance. We attribute this to severe bias in the uncurated hospital pretraining corpus (93% estimated KL grade 3), which limited alignment with the balanced diagnostic task. In contrast, this same multimodal initialization significantly improved prognostic modeling. It outperformed ImageNet baselines in predicting 4-year structural incidence and progression, including on external validation (MOST AUROC: 0.701 vs. 0.599 at 10% labeled data). Overall, while uncurated hospital image-text data may be ineffective for learning diagnosis due to severity bias, it provides a strong signal for prognostic modeling when the downstream task aligns with pretraining data distribution


[37] TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization cs.CV | cs.AIPDF

Xuepeng Jing, Wenhuan Lu, Hao Meng, Zhizhi Yu, Jianguo Wei

TL;DR: 本文提出TIGFlow-GRPO,一个两阶段生成框架,用于人类轨迹预测。第一阶段使用基于条件流匹配的预测器,结合轨迹交互图模块建模细粒度视觉空间交互;第二阶段通过Flow-GRPO后训练,将确定性流展开重构为随机ODE-to-SDE采样以探索轨迹,并利用复合奖励结合社交合规性与物理可行性,引导预测朝向行为合理的未来。

Details

Motivation: 现有基于条件流匹配的轨迹预测方法主要关注监督拟合,可能导致生成的轨迹未能充分反映社会规范和场景约束,因此需要将基于流的轨迹生成与行为规则对齐。

Result: 在ETH/UCY和SDD数据集上的实验表明,TIGFlow-GRPO提高了预测精度和长时程稳定性,同时生成的轨迹更具社交合规性和物理可行性。

Insight: 创新点包括:1) 引入轨迹交互图模块增强细粒度交互建模;2) 将流展开重构为随机SDE采样以支持轨迹探索;3) 设计复合奖励结合社交与物理约束,通过GRPO进行行为对齐。这为动态多媒体环境中连接基于流的轨迹建模与行为感知对齐提供了有效途径。

Abstract: Human trajectory forecasting is important for intelligent multimedia systems operating in visually complex environments, such as autonomous driving and crowd surveillance. Although Conditional Flow Matching (CFM) has shown strong ability in modeling trajectory distributions from spatio-temporal observations, existing approaches still focus primarily on supervised fitting, which may leave social norms and scene constraints insufficiently reflected in generated trajectories. To address this issue, we propose TIGFlow-GRPO, a two-stage generative framework that aligns flow-based trajectory generation with behavioral rules. In the first stage, we build a CFM-based predictor with a Trajectory-Interaction-Graph (TIG) module to model fine-grained visual-spatial interactions and strengthen context encoding. This stage captures both agent-agent and agent-scene relations more effectively, providing more informative conditional features for subsequent alignment. In the second stage, we perform Flow-GRPO post-training,where deterministic flow rollout is reformulated as stochastic ODE-to-SDE sampling to enable trajectory exploration, and a composite reward combines view-aware social compliance with map-aware physical feasibility. By evaluating trajectories explored through SDE rollout, GRPO progressively steers multimodal predictions toward behaviorally plausible futures. Experiments on the ETH/UCY and SDD datasets show that TIGFlow-GRPO improves forecasting accuracy and long-horizon stability while generating trajectories that are more socially compliant and physically feasible. These results suggest that the proposed framework provides an effective way to connect flow-based trajectory modeling with behavior-aware alignment in dynamic multimedia environments.


[38] Infinite Gaze Generation for Videos with Autoregressive Diffusion cs.CVPDF

Jenna Kang, Colin Groth, Tong Wu, Finley Torrens, Patsorn Sangkloy

TL;DR: 本文提出了一种基于自回归扩散模型的无限时长视频原始注视生成框架,能够合成具有连续空间坐标和高分辨率时间戳的注视轨迹,显著提升了长时程时空预测的准确性和轨迹真实性。

Details

Motivation: 现有视频注视预测方法通常局限于短时窗口(约3-5秒),无法捕捉真实场景中的长时行为依赖,且传统显著性图和扫描路径等抽象表示往往丢失了原始注视的细粒度时间动态。

Result: 定量和定性评估表明,该方法在长时程时空准确性和轨迹真实性方面显著优于现有方法。

Insight: 创新点在于结合自回归扩散模型与显著性感知的视觉潜在空间,实现了任意长度视频的无限时长原始注视生成,突破了传统方法的时序长度限制并保留了细粒度动态信息。

Abstract: Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ($\approx$ 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.


[39] Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models cs.CV | cs.CLPDF

Peiju Liu, Jinming Liu, Xipeng Qiu, Xuanjing Huang

TL;DR: 本文提出了一种名为TIES的动态令牌选择框架,用于解决视觉-语言-动作(VLA)模型因处理密集视觉令牌而导致的高推理延迟问题。该方法通过利用跨层令牌排名一致性来指导令牌选择,而非仅依赖静态的注意力大小,从而在减少令牌使用的同时提升策略性能。

Details

Motivation: 现有令牌缩减方法主要依赖注意力大小进行静态选择,但高注意力令牌是任务依赖性的,甚至可能降低策略性能。本文旨在克服这一局限,通过动态平衡注意力大小与排名一致性来实现更鲁棒的令牌选择。

Result: 在CogACT + SIMPLER基准测试中,TIES将平均成功率提高了6%,同时减少了78%的令牌使用,并在不同解码器和基准上展现出强大的泛化能力。

Insight: 创新点在于挑战了仅依赖注意力大小进行令牌选择的传统假设,引入了跨层排名一致性作为动态指导,无需额外训练即可实现高效且鲁棒的令牌缩减,为VLA模型的效率优化提供了新思路。

Abstract: Vision-Language-Action (VLA) models excel in robotic manipulation but suffer from significant inference latency due to processing dense visual tokens. Existing token reduction methods predominantly rely on attention magnitude as a static selection. In this work, we challenge this assumption, revealing that high-attention tokens are task-dependent and can even degrade policy performance. To address this, we introduce \textbf{TIES} (\textbf{T}au-guided \textbf{I}nter-layer \textbf{E}fficient \textbf{S}election), a dynamic framework guided by inter-layer token ranking consistency. By adaptively balancing attention magnitude with ranking consistency, TIES ensures robust token selection without requiring additional training. On the CogACT + SIMPLER benchmark, TIES improves average success rates by 6% while reducing token usage by 78%, and demonstrate strong generalization across diverse decoders and benchmarks.


[40] BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation cs.CVPDF

Yasong Dai, Zeeshan Hayder, David Ahmedt-Aristizabal, Hongdong Li

TL;DR: 本文提出BiFM(双向流匹配)框架,通过联合学习生成和反转过程,在单模型中实现高效图像编辑与生成。该方法直接估计图像到噪声和噪声到图像两个方向的平均速度场,并引入连续时间间隔监督训练策略,提升少步采样下的编辑质量。

Details

Motivation: 现有少步采样方法在图像编辑中存在前向过程近似不佳的问题,导致编辑质量下降,且通常依赖预训练生成器和辅助模块,限制了跨架构的扩展性和泛化能力。

Result: 在多种图像编辑和生成任务中,BiFM持续优于现有少步方法,实现了卓越的性能和可编辑性。

Insight: 创新点包括双向平均速度场估计、基于共享瞬时速度场的约束、连续时间间隔监督训练策略,以及双向一致性目标和轻量级时间间隔嵌入的稳定化设计,可无缝集成到主流扩散和流匹配骨干网络中。

Abstract: Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both image $\to$ noise" and noise $\to$ image” directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.


[41] Self-Corrected Image Generation with Explainable Latent Rewards cs.CV | cs.AIPDF

Yinyi Luo, Hrishikesh Gokhale, Marios Savvides, Jindong Wang, Shengfeng He

TL;DR: 本文提出了一种名为xLARD的自校正图像生成框架,利用多模态大语言模型通过可解释的潜在奖励来引导生成过程,以解决文本到图像生成中复杂提示(特别是细粒度语义和空间关系)的对齐难题。

Details

Motivation: 现有前馈式生成方法难以在生成前完全预见输出与复杂提示的对齐情况,而评估生成图像相对容易,这种不对称性促使作者探索利用评估来引导生成的自校正机制。

Result: 在多种生成和编辑任务上的实验表明,xLARD在保持生成先验的同时,提升了语义对齐和视觉保真度。

Insight: 核心创新在于引入了一个轻量级校正器,它基于模型生成的参考反馈来优化潜在表示,并实现了一种从潜在编辑到可解释奖励信号的可微分映射,从而将不可微分的图像级评估转化为连续的潜在级指导,使模型能够在生成过程中理解、评估并自我校正。

Abstract: Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.


[42] MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models cs.CVPDF

Dohwan Ko, Jinyoung Park, Seoung Choi, Sanghyeok Lee, Seohyun Lee

TL;DR: 本文提出了一种名为MoE-GRPO的强化学习框架,用于优化基于混合专家(MoE)的视觉语言模型中的专家路由机制。该方法将专家选择建模为序列决策问题,并采用分组相对策略优化(GRPO)进行训练,同时引入了模态感知的路由指导以提升稳定性。实验表明,该方法在多个多模态基准上优于传统的top-K路由及其变体。

Details

Motivation: 针对MoE中广泛使用的确定性top-K路由机制可能忽略更优专家组合并导致专家过拟合的问题,旨在通过强化学习提升专家选择的多样性。

Result: 在多个多模态图像和视频基准测试上的广泛实验表明,MoE-GRPO始终优于标准top-K路由及其变体,通过促进更丰富的专家选择来缓解专家过拟合并实现任务级专家专业化。

Insight: 创新点在于将强化学习(特别是GRPO)引入MoE路由优化,并设计模态感知的路由指导来稳定训练;这为动态、自适应的专家选择提供了新思路,可能提升模型容量利用效率。

Abstract: Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.


[43] Towards Video Anomaly Detection from Event Streams: A Baseline and Benchmark Datasets cs.CVPDF

Peng Wu, Yuting Yan, Guansong Pang, Yujia Sun, Qingsen Yan

TL;DR: 本文首次将基于事件流的视频异常检测确立为一个统一研究方向,构建了多个同步事件与RGB数据的基准数据集,并提出了一种名为EWAD的事件中心化时空视频异常检测框架,包含事件密度感知动态采样、密度调制时序建模和RGB到事件的知识蒸馏三个关键创新。

Details

Motivation: 事件视觉具有低冗余、关注动态运动和固有隐私保护特性,天然适合视频异常检测,但缺乏专用数据集和有效建模策略阻碍了该领域发展。

Result: 在三个基准数据集上的大量实验表明,EWAD相比现有方法取得了显著改进,突显了事件驱动建模在视频异常检测中的潜力和有效性。

Insight: 创新点包括利用事件密度进行动态采样和时序建模以处理稀疏事件流,以及通过弱监督下的知识蒸馏增强事件表征;客观来看,首次构建事件流VAD基准数据集和提出针对性框架是该研究的主要贡献。

Abstract: Event-based vision, characterized by low redundancy, focus on dynamic motion, and inherent privacy-preserving properties, naturally fits the demands of video anomaly detection (VAD). However, the absence of dedicated event-stream anomaly detection datasets and effective modeling strategies has significantly hindered progress in this field. In this work, we take the first major step toward establishing event-based VAD as a unified research direction. We first construct multiple event-stream based benchmarks for video anomaly detection, featuring synchronized event and RGB recordings. Leveraging the unique properties of events, we then propose an EVent-centric spatiotemporal Video Anomaly Detection framework, namely EWAD, with three key innovations: an event density aware dynamic sampling strategy to select temporally informative segments; a density-modulated temporal modeling approach that captures contextual relations from sparse event streams; and an RGB-to-event knowledge distillation mechanism to enhance event-based representations under weak supervision. Extensive experiments on three benchmarks demonstrate that our EWAD achieves significant improvements over existing approaches, highlighting the potential and effectiveness of event-driven modeling for video anomaly detection. The benchmark datasets will be made publicly available.


[44] C2W-Tune: Cavity-to -Wall Transfer Learning for Thin Atrial Wall Segmentation in 3D Late Gadolinium-enhanced Magnetic Resonance cs.CVPDF

Yusri Al-Sanaani, Rebecca Thornhill, Sreeraman Rajan

TL;DR: 本文提出C2W-Tune,一种用于3D LGE-MRI中薄左心房壁分割的两阶段腔体到壁的迁移学习框架。该方法首先预训练网络分割左心房腔体以学习稳健的解剖表征,然后通过渐进式解冻策略迁移权重并适应心房壁分割任务,从而利用高精度腔体模型作为解剖先验来改善薄壁描绘。

Details

Motivation: 在3D LGE-MRI中准确分割左心房壁对于壁厚映射和纤维化量化至关重要,但由于壁薄、解剖结构复杂且对比度低,该任务极具挑战性。

Result: 在2018 LA分割挑战数据集上,相比从头训练的基线模型,C2W-Tune显著提升了性能:壁的Dice系数从0.623提升至0.814,1 mm处的Surface Dice从0.553提升至0.731;边界误差也大幅降低,HD95从2.95 mm降至2.55 mm,ASSD从0.71 mm降至0.63 mm。即使在监督减少(仅使用70个训练样本)的情况下,仍能达到0.78的Dice分数和3.15 mm的HD95,超越了通常Dice值在0.6-0.7的多类基准方法。

Insight: 摘要宣称的创新点在于提出了一种基于解剖的任务迁移框架,通过两阶段迁移学习和渐进式层解冻策略,有效利用腔体分割的先验知识来提升薄壁分割的边界精度。从客观角度看,其核心创新在于将高精度的腔体分割模型作为强解剖先验,并通过可控的微调机制(如渐进解冻)来平衡特征保留与任务特定适应,这为处理医学图像中薄结构、低对比度的分割问题提供了一种可借鉴的迁移学习范式。

Abstract: Accurate segmentation of the left atrial (LA) wall in 3D late gadolinium-enhanced MRI (LGE-MRI) is essential for wall thickness mapping and fibrosis quantification, yet it remains challenging due to the wall’s thinness, complex anatomy, and low contrast. We propose C2W-Tune, a two-stage cavity-to-wall transfer framework that leverages a high-accuracy LA cavity model as an anatomical prior to improve thin-wall delineation. Using a 3D U-Net with a ResNeXt encoder and instance normalization, Stage 1 pre-trains the network to segment the LA cavity, learning robust atrial representations. Stage 2 transfers these weights and adapts the network to LA wall segmentation using a progressive layer-unfreezing schedule to preserve endocardial features while enabling wall-specific refinement. Experiments on the 2018 LA Segmentation Challenge dataset demonstrate substantial gains over an architecture-matched baseline trained from scratch: wall Dice improves from 0.623 to 0.814, and Surface Dice at 1 mm improves from 0.553 to 0.731. Boundary errors were substantially reduced, with the 95th-percentile Hausdorff distance (HD95) decreasing from 2.95 mm to 2.55 mm and the average symmetric surface distance (ASSD) from 0.71 mm to 0.63 mm. Furthermore, even with reduced supervision (70 training volumes sampled from the same training pool), C2W-Tune achieved a Dice score of 0.78 and an HD95 of 3.15 mm, maintaining competitive performance and exceeding multi-class benchmarks that typically report Dice values around 0.6-0.7. These results show that anatomically grounded task transfer with controlled fine-tuning improves boundary accuracy for thin LA wall segmentation in 3D LGE-MRI.


[45] Relaxed Rigidity with Ray-based Grouping for Dynamic Gaussian Splatting cs.CVPDF

Junoh Leea, Junmyeong Lee, Yeon-Ji Song, Inhwan Bae, Jisu Shin

TL;DR: 本文提出了一种用于动态3D高斯溅射重建的新方法,通过引入基于视图空间射线的分组策略,对同一射线相交且α混合权重超过阈值的高斯进行聚类,并对这些组施加约束以保持其空间分布一致性,从而在4D场景中显式地保留高斯局部几何结构。该方法无需依赖光流或2D轨迹等外部先验,即可实现更符合物理规律的运动建模。

Details

Motivation: 现有动态3D高斯溅射方法在建模真实运动时存在挑战,高斯运动常与真实物理动力学不一致,尤其在单目视频数据集中,缺乏连贯运动会破坏局部几何结构,导致重建质量下降。大多数先进方法严重依赖外部先验来强制时间一致性。

Result: 将所提方法集成到两个不同的基线模型中,在具有挑战性的单目数据集上进行的大量实验表明,该方法显著优于现有方法,实现了更优的时间一致性和重建质量。

Insight: 核心创新在于提出了一种基于视图空间射线的分组策略,通过聚类和约束同一射线上的高斯来显式保持局部几何结构的时空一致性,从而实现了无需外部先验的、更物理合理的动态场景建模。

Abstract: The reconstruction of dynamic 3D scenes using 3D Gaussian Splatting has shown significant promise. A key challenge, however, remains in modeling realistic motion, as most methods fail to align the motion of Gaussians with real-world physical dynamics. This misalignment is particularly problematic for monocular video datasets, where failing to maintain coherent motion undermines local geometric structure, ultimately leading to degraded reconstruction quality. Consequently, many state-of-the-art approaches rely heavily on external priors, such as optical flow or 2D tracks, to enforce temporal coherence. In this work, we propose a novel method to explicitly preserve the local geometric structure of Gaussians across time in 4D scenes. Our core idea is to introduce a view-space ray grouping strategy that clusters Gaussians intersected by the same ray, considering only those whose $α$-blending weights exceed a threshold. We then apply constraints to these groups to maintain a consistent spatial distribution, effectively preserving their local geometry. This approach enforces a more physically plausible motion model by ensuring that local geometry remains stable over time, eliminating the reliance on external guidance. We demonstrate the efficacy of our method by integrating it into two distinct baseline models. Extensive experiments on challenging monocular datasets show that our approach significantly outperforms existing methods, achieving superior temporal consistency and reconstruction quality.


[46] Distributed Real-Time Vehicle Control for Emergency Vehicle Transit: A Scalable Cooperative Method cs.CVPDF

WenXi Wang, JunQi Zhang

TL;DR: 本文提出了一种可扩展的分布式实时车辆控制方法,用于紧急车辆通行,该方法通过仅使用局部信息进行分布式在线决策,避免了集中式方法的高计算成本和可扩展性限制,并引入分布式冲突解决机制确保安全性。

Details

Motivation: 解决紧急车辆快速通行问题,同时最小化对普通车辆的影响,克服现有集中式求解器和强化学习方法在计算成本和可扩展性方面的局限性。

Result: 基于真实交通数据集的仿真实验表明,该方法比现有方法决策更快、对普通车辆影响更小,并在不同交通密度和道路配置下展现出更强的可扩展性。

Insight: 创新点在于证明了仅使用局部信息的分布式方法近似等效于使用全局信息的方法,实现了无需预训练、实时近似最优决策,并通过分布式冲突解决机制提供确定性安全保证,避免了集中式方法的单点故障风险。

Abstract: Rapid transit of emergency vehicles is critical for saving lives and reducing property loss but often relies on surrounding ordinary vehicles to cooperatively adjust their driving behaviors. It is important to ensure rapid transit of emergency vehicles while minimizing the impact on ordinary vehicles. Centralized mathematical solver and reinforcement learning are the state-of-the-art methods. The former obtains optimal solutions but is only practical for small-scale scenarios. The latter implicitly learns through extensive centralized training but the trained model exhibits limited scalability to different traffic conditions. Hence, existing methods suffer from two fundamental limitations: high computational cost and lack of scalability. To overcome above limitations, this work proposes a scalable distributed vehicle control method, where vehicles adjust their driving behaviors in a distributed manner online using only local instead of global information. We proved that the proposed distributed method using only local information is approximately equivalent to the one using global information, which enables vehicles to evaluate their candidate states and make approximately optimal decisions in real time without pre-training and with natural adaptability to varying traffic conditions. Then, a distributed conflict resolution mechanism is further proposed to guarantee vehicles’ safety by avoiding their decision conflicts, which eliminates the single-point-of-failure risk of centralized methods and provides deterministic safety guarantees that learned methods cannot offer. Compared with existing methods, simulation experiments based on real-world traffic datasets demonstrate that the proposed method achieves faster decision-making, less impact on ordinary vehicles, and maintains much stronger scalability across different traffic densities and road configurations.


[47] Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs cs.CV | cs.MMPDF

Yike Wu, Necva Bolucu, Stephen Wan, Dadong Wang, Jiahao Xia

TL;DR: 本文提出了一种名为SGREC的可解释零样本指代表达理解方法,该方法利用查询驱动的场景图作为结构化中介,以弥补现有视觉语言模型在细粒度视觉细节和复杂关系理解上的不足,并利用大语言模型进行高级语义推理,从而在无需任务特定训练数据的情况下定位图像中的目标对象。

Details

Motivation: 现有视觉语言模型(如CLIP)在零样本指代表达理解任务中直接度量文本查询与图像区域的特征相似性,难以捕捉细粒度视觉细节和理解复杂对象关系;而大语言模型虽擅长高级语义推理,却无法直接将视觉特征抽象为文本语义。

Result: 在多个零样本REC基准测试中,SGREC取得了最佳Top-1准确率,包括RefCOCO val(66.78%)、RefCOCO+ testB(53.43%)和RefCOCOg val(73.28%),展现了其强大的视觉场景理解能力。

Insight: 创新点在于引入查询驱动的场景图作为结构化中介,将视觉信息(空间关系、描述性标题、对象交互)编码为文本表示,从而桥接低级图像区域与高级语义理解,并利用LLM进行推理和解释,提升了任务性能和可解释性。

Abstract: Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78%), RefCOCO+ testB (53.43%), and RefCOCOg val (73.28%), highlighting its strong visual scene understanding.


[48] VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning cs.CVPDF

Zhe Gao, Shiyu Shen, Taifeng Chai, Weinong Wang, Haotian Xu

TL;DR: 本文提出VideoTIR,一种基于强化学习(RL)的工具集成推理方法,旨在解决多模态大语言模型(MLLMs)在长视频理解(LVU)任务中因文本与视觉令牌不平衡而产生的幻觉问题。该方法通过探索零样本RL和监督微调(SFT)冷启动,引导模型有效调用多级工具包来检索和聚焦于关键视频片段/图像/区域,并引入工具包动作分组策略优化(TAGPO)以减少冗余工具调用,同时开发了基于沙箱的轨迹合成框架来生成高质量训练数据。

Details

Motivation: 现有MLLMs在长视频理解中常因视觉信息过载而产生幻觉,而基于SFT的工具调用方法需要大量高质量数据且调用轨迹受限,因此需要一种更高效、准确的方法来引导模型处理长视频。

Result: 在三个长视频问答基准测试上的广泛实验证明了该方法的有效性和效率,表明其能提升长视频理解的准确性。

Insight: 创新点在于将强化学习与多级工具包调用结合用于长视频理解,提出了TAGPO优化策略以减少冗余调用,并设计了轨迹合成框架来生成训练数据,为高效处理长模态输入提供了新思路。

Abstract: Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs. SFT-based tool-calling methods can serve this purpose, but they typically require vast amounts of fine-grained, high-quality data and suffer from constrained tool-calling trajectories. We propose a novel VideoTIR that leverages Reinforcement Learning (RL) to encourage proper usage of comprehensive multi-level toolkits for efficient long video understanding. VideoTIR explores both Zero-RL and SFT cold-starting to enable MLLMs to retrieve and focus on meaningful video segments/images/regions, enhancing long video understanding both accurately and efficiently. To reduce redundant tool-calling, we propose Toolkit Action Grouped Policy Optimization (TAGPO), which enhances the efficiency of the calling process through stepwise reward assignment and reuse of failed rollouts. Additionally, we develop a sandbox-based trajectory synthesis framework to generate high-quality trajectories data. Extensive experiments on three long-video QA benchmarks demonstrate the effectiveness and efficiency of our method.


[49] MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes cs.CVPDF

Wonjoon Lee, Sungmin Woo, Donghyeong Kim, Jungho Lee, Sangheon Park

TL;DR: MoRGS是一个用于可流式动态3D场景重建的高效在线框架,它通过显式建模每个3D高斯的运动来提升4D重建质量。该方法利用稀疏关键视图上的光流作为轻量级运动线索,并引入高斯运动偏移场和置信度机制来优化运动学习,在保证实时性能的同时实现了更真实的场景动态重建。

Details

Motivation: 现有基于3D高斯泼溅的在线动态场景重建方法,其优化的高斯运动往往只是追逐像素残差而非真实的3D运动,导致无法学习到反映真实场景动态的逐高斯运动。

Result: 大量实验表明,MoRGS在在线方法中实现了最先进的重建质量和运动保真度,同时保持了可流式处理的性能。

Insight: 创新点在于利用稀疏光流作为显式运动监督,并设计了高斯运动偏移场来弥合3D投影运动与观测流之间的差异,以及引入高斯运动置信度来区分动态/静态区域并加权更新,从而提升时间一致性和大运动建模效率。

Abstract: Online reconstruction of dynamic scenes aims to learn from streaming multi-view inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that explicitly models per-Gaussian motion to improve 4D reconstruction quality. Specifically, we leverage optical flow on a sparse set of key views as lightweight motion cues that regularize per-Gaussian motion beyond photometric supervision. To compensate for the sparsity of flow supervision, we learn a per-Gaussian motion offset field that reconciles discrepancies between projected 3D motion and observed flow across views and time. In addition, we introduce a per-Gaussian motion confidence that separates dynamic from static Gaussians and weights Gaussian attribute residual updates, thereby suppressing redundant motion in static regions for better temporal consistency and accelerating the modeling of large motions. Extensive experiments demonstrate that MoRGS achieves state-of-the-art reconstruction quality and motion fidelity among online methods, while maintaining streamable performance.


[50] GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator cs.CVPDF

Liyuan Zhu, Manjunath Narayana, Michal Stary, Will Hutchcroft, Gordon Wetzstein

TL;DR: GaussFusion是一种通过几何信息视频生成来改进野外3D高斯溅射(3DGS)重建的新方法。它通过一个几何信息视频到视频生成器,优化3DGS渲染,缓解了漂浮物、闪烁和模糊等常见伪影。该方法还引入了模拟多种退化模式的伪影合成流程,并在新视角合成基准测试中达到了最先进的性能,其高效变体能以21 FPS实时运行。

Details

Motivation: 解决野外3D高斯溅射重建中因相机姿态误差、覆盖不完整和几何初始化噪声导致的漂浮物、闪烁和模糊等常见伪影问题。

Result: 在新视角合成基准测试中达到了最先进的性能(SOTA),其高效变体能以21 FPS实时运行,同时保持相似性能。

Insight: 核心创新在于引入了一个几何信息视频到视频生成器,该生成器利用编码深度、法线、不透明度和协方差的Gaussian primitive视频缓冲区,来优化多种3DGS重建方法的渲染输出,并提出了一个模拟退化模式的伪影合成流程以提高鲁棒性和泛化能力。

Abstract: We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generalization. GaussFusion achieves state-of-the-art performance on novel-view synthesis benchmarks, and an efficient variant runs in real time at 21 FPS while maintaining similar performance, enabling interactive 3D applications.


[51] Synergistic Event-SVE Imaging for Quantitative Propellant Combustion Diagnostics cs.CVPDF

Jing Tao, Taihang Lei, Banglei Guan, Ying Qu, Xudong Na

TL;DR: 本文提出了一种闭环事件-SVE成像系统,用于高动态范围、烟雾遮挡条件下的推进剂燃烧定量诊断。该系统结合了空间变曝光(SVE)相机和一对神经形态事件相机,通过烟雾感知融合策略生成HDR强度图,并利用该参考抑制事件相机中的烟雾伪影,进而实现基于立体事件的3D粒子轨迹、分离高度和等效粒径估计。

Details

Motivation: 解决高能推进剂燃烧实时监测的难题,传统成像在极端高动态范围、微秒级粒子运动和浓烟同时存在时,易出现饱和、运动模糊和粒子提取不稳定等问题。

Result: 在硼基推进剂实验中,系统实现了多模态等效半径统计,最大校准误差为0.56%,并能捕捉传统传感器难以观测的快速分离瞬态过程。

Insight: 创新点在于将SVE相机与事件相机协同,利用SVE提供的绝对强度参考来校正事件流中的烟雾伪影,并通过立体事件流实现微秒级分辨的3D测量,为烟雾遮挡下的HDR燃烧诊断提供了一种校准一致且实用的解决方案。

Abstract: Real-time monitoring of high-energy propellant combustion is difficult. Extreme high dynamic range (HDR), microsecond-scale particle motion, and heavy smoke often occur together. These conditions drive saturation, motion blur, and unstable particle extraction in conventional imaging. We present a closed-loop Event–SVE measurement system that couples a spatially variant exposure (SVE) camera with a stereo pair of neuromorphic event cameras. The SVE branch produces HDR maps with an explicit smoke-aware fusion strategy. A multi-cue smoke-likelihood map is used to separate particle emission from smoke scattering, yielding calibrated intensity maps for downstream analysis. The resulting HDR maps also provide the absolute-intensity reference missing in event cameras. This reference is used to suppress smoke-driven event artifacts and to improve particle-state discrimination. Based on the cleaned event observations, a stereo event-based 3D pipeline estimates separation height and equivalent particle size through feature extraction and triangulation (maximum calibration error 0.56%). Experiments on boron-based propellants show multimodal equivalent-radius statistics. The system also captures fast separation transients that are difficult to observe with conventional sensors. Overall, the proposed framework provides a practical, calibration-consistent route to microsecond-resolved 3D combustion measurement under smoke-obscured HDR conditions.


[52] Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection cs.CV | cs.CLPDF

Ruichao Yang, Wei Gao, Xiaobin Zhu, Jing Ma, Hongzhan Lin

TL;DR: 本文提出了一种名为概率概念图推理(PCGR)的可解释且可演化的框架,用于多模态虚假信息检测(MMD)。该框架将MMD任务重构为基于概念的结构化推理,遵循先构建后推理的范式:首先利用多模态大语言模型(MLLMs)自动发现和验证新颖的高层概念,构建一个由人类可理解的概念节点组成的图;然后在该概念图上应用分层注意力机制来推断声明的真实性。

Details

Motivation: 解决多模态虚假信息日益严峻的挑战,传统检测器是黑盒且对新型操纵手段脆弱,需要一种可解释且能适应演化的检测方法。

Result: 实验表明,PCGR在多模态虚假信息检测任务上达到了最先进的准确率,并对新兴操纵类型具有更强的鲁棒性,在粗粒度检测和细粒度操纵识别方面均优于先前方法。

Insight: 创新点在于将多模态虚假信息检测重构为基于概念图的结构化推理,并利用MLLMs自动发现高层概念以增强可解释性和演化能力;其构建-推理范式及分层注意力机制提供了清晰的证据链,提升了模型的可解释性与鲁棒性。

Abstract: Multimodal misinformation poses an escalating challenge that often evades traditional detectors, which are opaque black boxes and fragile against new manipulation tactics. We present Probabilistic Concept Graph Reasoning (PCGR), an interpretable and evolvable framework that reframes multimodal misinformation detection (MMD) as structured and concept-based reasoning. PCGR follows a build-then-infer paradigm, which first constructs a graph of human-understandable concept nodes, including novel high-level concepts automatically discovered and validated by multimodal large language models (MLLMs), and then applies hierarchical attention over this concept graph to infer claim veracity. This design produces interpretable reasoning chains linking evidence to conclusions. Experiments demonstrate that PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition.


[53] Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos cs.CVPDF

Xuankai Zhang, Junjin Xiao, Shangwei Huang, Wei-shi Zheng, Qing Zhang

TL;DR: 本文提出了一种从单目视频中学习动态高斯泼溅的显式连续运动表示方法,通过SE(3) B样条运动基和自适应控制机制来建模高斯的位置和方向变形,并引入软分割重建策略和多视图扩散模型以提升新视图合成质量,在多个基准测试中达到SOTA水平。

Details

Motivation: 解决现有方法在从单目视频进行动态高斯泼溅时,对复杂连续运动建模能力不足、计算效率低以及容易过拟合训练视图的问题。

Result: 在多个基准测试上进行的新视图合成实验中,该方法超越了现有最先进方法,达到了SOTA水平。

Insight: 创新点包括:使用SE(3) B样条运动基显式建模连续变形,引入自适应控制机制动态调整运动基和控制点数量以提高效率和建模能力,以及结合软分割重建和多视图扩散模型来缓解长间隔运动干扰和过拟合问题。

Abstract: We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code is available at https://github.com/hhhddddddd/se3bsplinegs.


[54] GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding cs.CVPDF

Junpeng Ma, Sashuai Zhou, Guanghao Li, Xin Gao, Yue Cao

TL;DR: 本文提出GIFT(全局不可替代性帧选择)框架,旨在解决视频大语言模型(VLMs)因处理密集帧导致计算成本高的问题。GIFT通过评估帧的内在不可替代性来选取关键帧,结合定向多样性和预算感知优化策略,避免现有方法因贪婪决策和分离评估而陷入局部最优或选择噪声帧。

Details

Motivation: 视频大语言模型在视频理解中取得显著成功,但处理密集帧带来的高计算成本限制了其实际应用;现有关键帧选择方法因贪婪决策和分离评估相关性与多样性,常陷入局部最优并选择无关噪声帧。

Result: 在LLaVA-Video-7B模型上的长视频基准测试中,GIFT相比均匀采样平均提升最高达12.5%,展现了显著的性能改进。

Insight: 创新点包括引入定向多样性来量化帧在相关性条件下的独特性,形成统一的不可替代性评分,以及采用预算感知优化策略,通过自适应迭代过程优先选择核心帧并随预算扩展构建关键时序上下文;从客观角度看,该方法通过全局优化和集成评估避免了局部最优,提升了帧选择的效率和准确性。

Abstract: Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame’s uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.


[55] Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs cs.CVPDF

Jinda Lu, Junkang Wu, Jinghan Li, Kexin Huang, Shuo Yang

TL;DR: 本文提出了一种名为Token-Reweighting (ToR)的即插即用策略,旨在解决将可验证奖励的强化学习(RLVR)扩展到多模态大语言模型(MLLMs)时面临的挑战。该挑战源于MLLM的响应中感知相关token(用于视觉内容定位)和推理相关token(用于构建推理链)交织在一起,导致孤立优化效果不佳。ToR通过识别这两类关键token并在RLVR训练中动态调整其权重,显式建模其相互依赖关系,从而在现有方法(如GRPO和DAPO)基础上实现性能提升。

Details

Motivation: 将RLVR扩展到MLLMs时,模型响应中感知token和推理token交织且相互依赖,孤立优化任一类token均导致性能不足,因此需要一种能显式建模并联合优化这两类token的方法。

Result: 在多个多模态推理基准测试中,ToR在现有方法(如GRPO和DAPO)基础上带来了持续的性能提升,实现了最先进的性能,同时具备准确的视觉定位和连贯的推理能力。

Insight: 创新点在于通过token级别的经验分析揭示了感知与推理token的耦合性,并提出了一种动态重加权策略来显式建模这种相互依赖关系。从客观角度看,这是一种针对MLLM中多模态对齐与推理联合优化的通用且可插拔的优化技术,可有效提升RLVR训练效果。

Abstract: Extending Reinforcement Learning with Verifiable Rewards (RLVR) to multimodal large language models (MLLMs) faces a fundamental challenge: their responses inherently interleave perception-related tokens, which ground visual content, with reasoning-related tokens, which construct reasoning chains. These token types instantiate distinct yet interdependent capacities – visual grounding and symbolic reasoning – making isolated optimization insufficient. Through token-level empirical analysis, we demonstrate that optimizing either perception- or reasoning-only tokens consistently underperforms full optimization, underscoring their inherent coupling. To address this, we propose a plug-and-play Token-Reweighting (ToR) strategy that explicitly models this interdependence by identifying critical tokens of both types and dynamically reweighting them during RLVR training. Applied on top of existing methods (e.g., GRPO and DAPO), ToR delivers consistent performance gains across multiple multi-modal reasoning benchmarks, achieving state-of-the-art performance with both accurate visual grounding and coherent reasoning.


[56] Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors cs.CVPDF

Chengxu Yang, Jingling Yuan, Chuang Hu, Jiawei Jiang

TL;DR: 本文针对多模态大语言模型中的物体幻觉问题,提出了一种无需训练的方法CLVA(跨层视觉锚点)。研究发现幻觉源于深层注意力向早期层的视觉噪声回归,而输出可靠性依赖于中间层获取的视觉锚点。CLVA通过强化关键中间层特征并抑制回归噪声,将深层注意力拉回正确的视觉区域,在多种架构和基准测试中表现出色,且计算时间和GPU内存增加不显著。

Details

Motivation: 解决多模态大语言模型中普遍存在的物体幻觉问题,现有方法在注意力漂移方面缺乏可解释性,特别是最终模型阶段的注意力演变机制不明确。

Result: 在多种模型架构和基准测试上评估,CLVA方法表现出卓越性能,有效缓解幻觉,且未显著增加计算时间和GPU内存消耗。

Insight: 创新点在于揭示了幻觉源于深层注意力向早期视觉噪声的回归,并首次提出利用跨层视觉锚点(从注意力动态中捕获的关键中间层特征)来校正注意力,这是一种无需训练、可解释的解决方案。

Abstract: Multimodal Large Language Models often suffer from object hallucination. While existing research utilizes attention enhancement and visual retracing, we find these works lack sufficient interpretability regarding attention drift in final model stages. In this paper, we investigate the layer wise evolution of visual features and discover that hallucination stems from deep layer attention regressing toward initial visual noise from early layers. We observe that output reliability depends on acquiring visual anchors at intermediate layers rather than final layers. Based on these insights, we propose CLVA, which stands for Cross-Layer Visual Anchors, a training free method that reinforces critical mid layer features while suppressing regressive noise. This approach effectively pulls deep layer attention back to correct visual regions by utilizing essential anchors captured from attention dynamics. We evaluate our method across diverse architectures and benchmarks, demonstrating outstanding performance without significant increase in computational time and GPU memory.


[57] THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics cs.CVPDF

Tzu-Yen Ma, Bo Zhang, Zichen Tang, Junpeng Ding, Haolin Tian

TL;DR: 本文提出了THEMIS,一个用于全面评估多模态大语言模型在真实学术场景下进行视觉欺诈推理能力的新型多任务基准。该基准包含超过4000个问题,涵盖七个源自真实撤稿案例的场景,并引入了五种欺诈类型和16种细粒度图像篡改操作。在16个领先的MLLMs上的实验表明,即使是表现最好的GPT-5模型,总体准确率也仅为56.15%,证明了该基准的严格性。

Details

Motivation: 现有基准与真实世界学术欺诈的复杂性之间存在关键差距,缺乏能够系统评估MLLMs在复杂、真实学术场景下进行视觉欺诈推理能力的综合性基准。

Result: 在16个领先的MLLMs上的实验结果表明,该基准是一个严格的测试。表现最佳的模型GPT-5的总体准确率仅为56.15%,凸显了当前模型在此类任务上的局限性。

Insight: 创新点在于构建了一个基于真实撤稿案例和合成数据的综合性基准,其特点包括:1) 高比例复杂纹理图像以模拟真实世界复杂性;2) 系统覆盖多种欺诈类型并引入细粒度、堆叠的篡改操作以提升难度;3) 建立了从欺诈类型到核心推理能力的映射,支持多维度的能力评估,能揭示模型的具体优势和弱点。

Abstract: We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) Multi-Dimensional Capability Evaluation: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.


[58] Pixelis: Reasoning in Pixels, from Seeing to Acting cs.CV | cs.AIPDF

Yunpeng Zhou

TL;DR: Pixelis是一个直接在像素空间操作的智能体,通过一组可执行操作(如缩放、分割、跟踪、OCR等)处理图像和视频,并从其行动后果中学习。它采用三阶段训练方法:监督微调学习像素工具语法,好奇心-连贯性奖励微调优化工具链结构,以及像素测试时强化学习进行无标签适应。在六个公开图像和视频基准测试中,Pixelis相比基线模型实现了平均4.08%的相对性能提升,并生成更短、可审计的工具链。

Details

Motivation: 解决现有视觉语言系统作为静态观察者的局限性,即它们仅描述像素而不行动,且无法在分布偏移下安全改进。通过行动而非静态描述来学习,旨在实现更具泛化性和物理基础的可视智能。

Result: 在六个公开图像和视频基准测试(如VSI-Bench)上,Pixelis相比相同的8B基线模型实现了平均+4.08%的相对性能提升(峰值达+6.03%),同时生成更短、可审计的工具链,并在测试时学习中保持KL散度在可控范围内。

Insight: 创新点包括:直接在像素空间操作以增强物理基础,三阶段训练方法(结合监督微调、好奇心-连贯性奖励和测试时强化学习)实现结构化工具链学习和无标签适应,以及通过KL锚定和EMA安全控制确保稳定性。这为将视觉推理与可行动结果链接提供了新途径。

Abstract: Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars while constraining drift with a KL-to-EMA safety control. Across six public image and video benchmarks, Pixelis yields consistent improvements: the average relative gain is +4.08% over the same 8B baseline (peaking at +6.03% on VSI-Bench), computed as (ours-baseline)/baseline, while producing shorter, auditable toolchains and maintaining in-corridor KL during test-time learning. Acting within pixels, rather than abstract tokens, grounds multimodal perception in the physical world, linking visual reasoning with actionable outcomes, and enables embodied adaptation without external feedback.


[59] Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning cs.CVPDF

Yuqiao Zeng, Xu Wang, Tengfei Liang, Yiqing Hao, Yi Jin

TL;DR: 本文提出了一种名为RL-MBA的强化学习框架,用于解决多模态主动学习中的问题。该框架通过将样本选择建模为马尔可夫决策过程,动态调整模态贡献权重并基于不确定性融合感知样本难度,从而在有限的标注预算下实现模态平衡和难度感知的样本选择。

Details

Motivation: 多模态学习依赖大规模标注数据,成本高昂。现有主动学习方法在多模态场景中通常假设模态重要性固定且选择规则不变,无法适应训练过程中模态相对价值和样本难度的动态变化。

Result: 在Food101、KineticsSound和VGGSound数据集上的实验表明,RL-MBA在有限标注预算下,在分类准确性和模态公平性方面均持续优于强基线方法。

Insight: 创新点在于引入强化学习框架动态建模样本选择过程,并设计了自适应模态贡献平衡(AMCB)和基于证据融合的难度感知策略调整(EFDA)两个关键组件,以自适应地调整模态权重和优先选择信息量大的困难样本。

Abstract: Multimodal learning integrates complementary information from different modalities such as image, text, and audio to improve model performance, but its success relies on large-scale labeled data, which is costly to obtain. Active learning (AL) mitigates this challenge by selectively annotating informative samples. In multimodal settings, many approaches implicitly assume that modality importance is stable across rounds and keep selection rules fixed at the fusion stage, which leaves them insensitive to the dynamic nature of multimodal learning, where the relative value of modalities and the difficulty of instances shift as training proceeds. To address this issue, we propose RL-MBA, a reinforcement-learning framework for modality-balanced, difficulty-aware multimodal active learning. RL-MBA models sample selection as a Markov Decision Process, where the policy adapts to modality contributions, uncertainty, and diversity, and the reward encourages accuracy gains and balance. Two key components drive this adaptability: (1) Adaptive Modality Contribution Balancing (AMCB), which dynamically adjusts modality weights via reinforcement feedback, and (2) Evidential Fusion for DifficultyAware Policy Adjustment (EFDA), which estimates sample difficulty via uncertainty-based evidential fusion to prioritize informative samples. Experiments on Food101, KineticsSound, and VGGSound demonstrate that RL-MBA consistently outperforms strong baselines, improving both classification accuracy and modality fairness under limited labeling budgets.


[60] MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning cs.CVPDF

Chenglong Wang, Yifu Huo, Yang Gan, Qiaozhi He, Qi Meng

TL;DR: 本文提出了一种多阶段强化学习方法,用于解决生成式多模态奖励模型训练中多模态偏好数据稀缺的问题。该方法通过从大规模文本偏好数据中学习通用的奖励推理能力,并逐步通过基于标题和完全多模态的强化学习阶段将其迁移到多模态任务中,从而实现了无需额外多模态标注的可扩展训练。

Details

Motivation: 当前基于可验证奖励的强化学习方法训练多模态奖励模型严重依赖标注的多模态偏好数据,这些数据获取成本高、劳动密集,限制了模型训练的扩展性。

Result: 实验表明,MSRL方法在VL-RewardBench上将性能从66.6%提升至75.9%,在GenAI-Bench上从70.2%提升至75.7%,显著提升了生成式多模态奖励模型在视觉理解和视觉生成任务上的性能,且无需额外多模态标注。

Insight: 核心创新在于提出了一个分阶段的强化学习框架,通过从文本到多模态的渐进式能力迁移来解决数据瓶颈,并引入了跨模态知识蒸馏来增强偏好泛化能力,为数据高效的多模态奖励建模提供了新思路。

Abstract: Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning from verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically relies on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale MRM training. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable RL for MRMs with limited multimodal data. MSRL replaces the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data, and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement-learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., from 66.6% to 75.9% on VL-RewardBench and from 70.2% to 75.7% on GenAI-Bench), without requiring additional multimodal preference annotations. Our code is available at: https://github.com/wangclnlp/MSRL.


[61] MoireMix: A Formula-Based Data Augmentation for Improving Image Classification Robustness cs.CV | cs.AIPDF

Yuto Matsuo, Yoshihiro Fukuhara, Yuki M. Asano, Rintaro Yanagi, Hirokatsu Kataoka

TL;DR: 本文提出了一种名为MoireMix的轻量级数据增强方法,通过解析干涉模式生成莫尔纹理,以提升图像分类模型的鲁棒性。该方法基于封闭数学公式在训练时实时合成纹理,计算成本极低(每张图像0.0026秒),无需外部数据或存储开销。实验表明,在ImageNet-C、ImageNet-R和对抗性基准测试中,该方法能持续提升Vision Transformers的鲁棒性,优于标准增强方法和现有无外部数据增强方法。

Details

Motivation: 针对现有数据增强方法(如基于扩散的合成或复杂特征混合)计算开销大或依赖外部数据集的问题,探索基于解析干涉模式的程序化增强方向,以高效生成结构化扰动。

Result: 在ImageNet-C、ImageNet-R和对抗性基准测试中,该方法显著提升了Vision Transformers的鲁棒性,超越了标准增强基线和其他无外部数据增强方法,实现了SOTA水平的性能。

Insight: 创新点在于利用莫尔干涉生成覆盖广泛空间频率的结构化扰动,通过封闭数学公式实现轻量级、实时、无存储的程序化增强,为数据驱动的生成式增强提供了高效实用的替代方案。

Abstract: Data augmentation is a key technique for improving the robustness of image classification models. However, many recent approaches rely on diffusion-based synthesis or complex feature mixing strategies, which introduce substantial computational overhead or require external datasets. In this work, we explore a different direction: procedural augmentation based on analytic interference patterns. Unlike conventional augmentation methods that rely on stochastic noise, feature mixing, or generative models, our approach exploits Moire interference to generate structured perturbations spanning a wide range of spatial frequencies. We propose a lightweight augmentation method that procedurally generates Moire textures on-the-fly using a closed-form mathematical formulation. The patterns are synthesized directly in memory with negligible computational cost (0.0026 seconds per image), mixed with training images during training, and immediately discarded, enabling a storage-free augmentation pipeline without external data. Extensive experiments with Vision Transformers demonstrate that the proposed method consistently improves robustness across multiple benchmarks, including ImageNet-C, ImageNet-R, and adversarial benchmarks, outperforming standard augmentation baselines and existing external-data-free augmentation approaches. These results suggest that analytic interference patterns provide a practical and efficient alternative to data-driven generative augmentation methods.


[62] AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization cs.CVPDF

Jiawei Lin, Wanrong Zhu, Vlad I Morariu, Christopher Tensmeyer

TL;DR: AnyDoc是一个统一的文档生成框架,通过大规模HTML/CSS数据合成和高度感知的强化学习优化,能够处理跨111个类别和32种风格的多种文档生成任务。

Details

Motivation: 解决现有文档生成方法受限于人工标注数据集规模和覆盖范围不足的问题,并处理微调过程中出现的内容溢出问题。

Result: 在意图到文档、文档反渲染和元素到文档三个任务上,AnyDoc在定性和定量实验中都优于通用多模态大语言模型和特定任务基线模型。

Insight: 创新点包括构建大规模合成数据集DocHTML的统一数据合成流程,以及引入基于预测与目标文档高度差异的奖励函数的高度感知强化学习后训练过程,以缓解内容溢出问题。

Abstract: Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.


[63] AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting cs.CVPDF

Minh-Quan Viet Bui, Jaeho Moon, Munchurl Kim

TL;DR: 本文提出AirSplat框架,旨在将3D视觉基础模型的鲁棒几何先验有效适配到无需相机姿态的高保真新视角合成任务中。其核心是通过自一致姿态对齐解决姿态-几何不一致问题,并利用基于评分的透明度匹配过滤退化基元,从而提升重建质量。

Details

Motivation: 尽管3D视觉基础模型在零样本视觉几何估计上表现出色,但将其直接应用于通用的、无需相机姿态的新视角合成仍具挑战。本文旨在解决这一适配难题。

Result: 在大规模基准测试上的实验结果表明,该方法在重建质量上显著优于当前最先进的无需相机姿态的新视角合成方法。

Insight: 创新点在于提出了训练时的自一致姿态对齐反馈循环以确保像素对齐监督,以及利用稀疏视图教师模型的局部3D几何一致性知识进行基于评分的透明度匹配来过滤退化基元。这为将3DVFMs同时用于视觉几何估计和高质量视图合成提供了有效途径。

Abstract: While 3D Vision Foundation Models (3DVFMs) have demonstrated remarkable zero-shot capabilities in visual geometry estimation, their direct application to generalizable novel view synthesis (NVS) remains challenging. In this paper, we propose AirSplat, a novel training framework that effectively adapts the robust geometric priors of 3DVFMs into high-fidelity, pose-free NVS. Our approach introduces two key technical contributions: (1) Self-Consistent Pose Alignment (SCPA), a training-time feedback loop that ensures pixel-aligned supervision to resolve pose-geometry discrepancy; and (2) Rating-based Opacity Matching (ROM), which leverages the local 3D geometry consistency knowledge from a sparse-view NVS teacher model to filter out degraded primitives. Experimental results on large-scale benchmarks demonstrate that our method significantly outperforms state-of-the-art pose-free NVS approaches in reconstruction quality. Our AirSplat highlights the potential of adapting 3DVFMs to enable simultaneous visual geometry estimation and high-quality view synthesis.


[64] Robust Principal Component Completion cs.CV | cs.LGPDF

Yinjian Wang, Wei Li, Yuanyuan Gui, James E. Fowler, Gemine Vivone

TL;DR: 本文提出了一种名为鲁棒主成分补全(RPCC)的新框架,用于处理稀疏前景替换或遮挡低秩背景元素的情况。该方法通过变分贝叶斯推断确定稀疏成分的支持集,避免了传统RPCA方法所需的阈值处理,并在合成数据、彩色视频和高光谱数据集上实现了近乎最优的估计和鲁棒的前景提取与异常检测性能。

Details

Motivation: 传统鲁棒主成分分析(RPCA)假设稀疏成分与低秩背景相加,但在许多实际应用中,稀疏前景会替换或遮挡背景元素,导致模型不匹配。本文旨在解决这一局限性。

Result: 在合成数据上实现了近乎最优的估计;在真实彩色视频数据集上展示了鲁棒的前景提取性能,在高光谱数据集上展示了有效的异常检测性能。

Insight: 创新点在于将问题重构为通过确定支持集来间接识别稀疏成分,并采用完全概率化的贝叶斯稀疏张量分解与变分推断进行求解,实现了对支持集的硬分类,无需后处理阈值设定。

Abstract: Robust principal component analysis (RPCA) seeks a low-rank component and a sparse component from their summation. Yet, in many applications of interest, the sparse foreground actually replaces, or occludes, elements from the low-rank background. To address this mismatch, a new framework is proposed in which the sparse component is identified indirectly through determining its support. This approach, called robust principal component completion (RPCC), is solved via variational Bayesian inference applied to a fully probabilistic Bayesian sparse tensor factorization. Convergence to a hard classifier for the support is shown, thereby eliminating the post-hoc thresholding required of most prior RPCA-driven approaches. Experimental results reveal that the proposed approach delivers near-optimal estimates on synthetic data as well as robust foreground-extraction and anomaly-detection performance on real color video and hyperspectral datasets, respectively. Source implementation and Appendices are available at https://github.com/WongYinJ/BCP-RPCC.


[65] EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions cs.CVPDF

Taegyoon Yoon, Yegyu Han, Seojin Ji, Jaewoo Park, Sojeong Kim

TL;DR: 本文提出了EgoXtreme数据集,这是一个专门用于第一人称视角下极端条件下6D物体姿态估计的大规模数据集。该数据集包含工业维护、体育运动和紧急救援三个具有挑战性的场景,涵盖了严重运动模糊、动态光照和视觉遮挡等真实世界问题。实验表明现有最先进的姿态估计方法在极端条件下泛化能力显著下降,而基于跟踪的方法通过利用时序信息能获得性能提升。

Details

Motivation: 现有6D物体姿态估计基准数据集无法反映真实世界第一人称应用中的极端挑战(如严重运动模糊、动态光照、视觉遮挡),导致实验室数据与现实应用之间存在显著差距。

Result: 在EgoXtreme数据集上评估现有最先进的通用姿态估计器,发现它们在极端条件下(尤其是低光照场景)泛化能力失效;简单的图像恢复(如去模糊)对极端条件没有积极改善;基于跟踪的方法显示出性能提升,表明在快速运动场景中利用时序信息是有效的。

Insight: 创新点在于构建了首个专注于极端条件的第一人称6D姿态估计数据集,揭示了现有方法在真实世界挑战下的局限性;客观分析表明,时序信息的利用可能是提升极端条件下鲁棒性的关键方向,而单纯的图像预处理技术不足以解决此类问题。

Abstract: Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios - industrial maintenance, sports, and emergency rescue - designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset and code are available at https://taegyoun88.github.io/EgoXtreme/


[66] SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment cs.CV | cs.AI | cs.LG | cs.MM | cs.SDPDF

Sahibzada Adil Shahzad, Ammarah Hashmi, Junichi Yamagishi, Yusuke Yasuda, Yu Tsao

TL;DR: 本文提出SAVe,一种自监督的音频-视觉深度伪造检测框架,该框架完全在真实视频上训练,通过生成身份保持、区域感知的自混合伪篡改来模拟篡改伪影,并利用音频-视觉对齐组件检测唇语同步中的时间错位模式,以捕捉跨模态证据。

Details

Motivation: 解决多模态深度伪造检测中存在的视觉伪影和跨模态不一致性挑战,以及依赖合成数据训练导致的生成器偏见和泛化能力有限的问题。

Result: 在FakeAVCeleb和AV-LipSync-TIMIT数据集上展示了具有竞争力的域内性能和强大的跨数据集泛化能力。

Insight: 创新点在于完全自监督学习范式,通过生成伪篡改模拟视觉伪影并建模唇语同步错位,避免了合成数据依赖,提升了可扩展性和鲁棒性。

Abstract: Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.


[67] FD$^2$: A Dedicated Framework for Fine-Grained Dataset Distillation cs.CV | cs.AIPDF

Hongxu Ma, Guang Li, Shijie Wang, Dongzhan Zhou, Baoli Sun

TL;DR: 本文提出FD^2,一个专门用于细粒度数据集蒸馏的框架,旨在解决现有解耦式数据集蒸馏方法在细粒度数据集上因依赖粗粒度类别标签而导致蒸馏样本类内差异过大、类间差异过小的问题。

Details

Motivation: 现有解耦式数据集蒸馏方法主要依赖粗粒度类别监督,并以近乎相同的方式优化每个类别内的样本,这在类间差异细微的细粒度数据集上会导致蒸馏样本保留过大的类内差异和过小的类间差异,限制了局部判别性线索并损害识别性能。

Result: 在多个细粒度数据集和通用数据集上的实验表明,FD^2能够与解耦式数据集蒸馏方法无缝集成,并在大多数设置下提升了性能,显示出良好的可迁移性。

Insight: 创新点在于通过反事实注意力学习聚合判别性表征来更新类别原型,并在蒸馏过程中引入细粒度特征约束(使样本与其类别原型对齐并排斥其他原型)和相似性约束(使同类别样本间的注意力多样化),从而定位判别性区域并构建用于蒸馏的细粒度表征。

Abstract: Dataset distillation (DD) compresses a large training set into a small synthetic set, reducing storage and training cost, and has shown strong results on general benchmarks. Decoupled DD further improves efficiency by splitting the pipeline into pretraining, sample distillation, and soft-label generation. However, existing decoupled methods largely rely on coarse class-label supervision and optimize samples within each class in a nearly identical manner. On fine-grained datasets, this often yields distilled samples that (i) retain large intra-class variation with subtle inter-class differences and (ii) become overly similar within the same class, limiting localized discriminative cues and hurting recognition. To solve the above-mentioned problems, we propose FD$^{2}$, a dedicated framework for Fine-grained Dataset Distillation. FD$^{2}$ localizes discriminative regions and constructs fine-grained representations for distillation. During pretraining, counterfactual attention learning aggregates discriminative representations to update class prototypes. During distillation, a fine-grained characteristic constraint aligns each sample with its class prototype while repelling others, and a similarity constraint diversifies attention across same-class samples. Experiments on multiple fine-grained and general datasets show that FD$^{2}$ integrates seamlessly with decoupled DD and improves performance in most settings, indicating strong transferability.


[68] Learning to Rank Caption Chains for Video-Text Alignment cs.CV | cs.LGPDF

Ansel Blume, Burak Uzkent, Shalini Chaudhuri, Garin Kessler

TL;DR: 本文提出了一种基于排序优化的视频-文本对齐方法,通过生成有序的标题链来替代传统的二元偏好优化(DPO),以更精确地评估响应与视觉内容的忠实度。

Details

Motivation: 传统DPO的二元“赢家通吃”方法在视觉语言模型中存在不足,因为它忽略了即使次优响应也可能保持高视觉忠实度,需要更细粒度的优化策略。

Result: 在长内容生成和评估任务中,排序优化方法优于二元DPO,且实验表明视觉编码器的微调对效果至关重要,挑战了DPO仅作为语言重加权过程的观点。

Insight: 创新点在于通过重复标题降级生成大规模有序标题链,以排序优化替代二元偏好学习,强调了视觉编码器微调在提升视觉语言模型对齐性能中的关键作用。

Abstract: Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary “winner-takes-all” approach is suboptimal for vision-language models whose response quality is highly dependent on visual content. In particular, a response may still be faithful to the visual inputs even if it is less preferable than an alternative. The standard Bradley-Terry DPO formulation lacks this nuance, upweighting winning responses without sufficient regard for whether the “losing” response still maintains high visual fidelity. In this work, we investigate ranking optimization as an alternative that more precisely situates responses’ faithfulness to visual inputs. We focus on video-text alignment using detailed video captions, proposing a method to generate challenging, totally ordered caption chains at scale through repeated caption degradation. Our results show ranking optimization outperforms binary DPO for long-form content generation and assessment, and importantly, we find that these approaches require finetuning of the vision encoder to be effective, challenging the view of DPO as purely a language-reweighting process.


[69] Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models cs.CV | cs.AIPDF

Chengyu Fang, Heng Guo, Zheng Jiang, Chunming He, Xiu Li

TL;DR: 本文提出Photon框架,旨在通过可变长度令牌序列高效表示3D医学影像,以解决多模态大语言模型在临床视觉问答任务中因计算成本高而难以扩展到3D成像的问题。该框架引入指令条件令牌调度和代理梯度传播,在训练和推理中自适应减少令牌数量,降低计算成本并缓解冗余令牌导致的注意力稀释,同时通过梯度恢复和正则化目标提升稳定性和可靠性。

Details

Motivation: 解决多模态大语言模型在3D医学影像分析中因计算成本高、现有方法(如依赖2D切片或固定长度令牌压缩)破坏体积连续性和掩盖细微发现而难以扩展的问题。

Result: 在多种医学视觉问答任务上的实验表明,Photon在减少资源使用、加速训练和推理的同时,达到了最先进的准确率(state-of-the-art accuracy)。

Insight: 创新点包括:可变长度令牌序列表示3D体积、指令条件令牌调度和代理梯度传播实现自适应令牌减少、梯度恢复支持离散令牌丢弃的可微分优化、正则化目标缓解语言偏见并提高可靠性;客观分析认为其高效自适应压缩机制和稳定性设计对3D多模态任务具有借鉴意义。

Abstract: Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens. It incorporates a custom backpropagation rule with gradient restoration to enable differentiable optimization despite discrete token drop. To stabilize token compression and ensure reliable use of visual evidence, Photon further applies regularization objectives that mitigate language-only bias and improve reliability. Experiments on diverse medical visual question answering tasks show that Photon achieves state-of-the-art accuracy while reducing resource usage and accelerating both training and inference.


[70] SportSkills: Physical Skill Learning from Sports Instructional Videos cs.CVPDF

Kumar Ashutosh, Chi Hsuan Wu, Kristen Grauman

TL;DR: 本文介绍了SportSkills,首个面向物理技能学习的大规模野外运动教学视频数据集,包含超过36万个教学视频和63万个视觉演示,涵盖55种不同运动。通过实验表明,该数据集能显著提升模型对精细动作差异的理解能力,并首次提出基于错误条件的教学视频检索任务,为个性化技能改进提供视觉反馈。

Details

Motivation: 现有大规模视频数据集主要关注通用人类活动,缺乏对精细物理技能学习的深度覆盖,因此需要构建专门针对技能学习的视频数据集以支持动作理解和反馈生成。

Result: 在相同模型下,基于SportSkills训练的表示比传统以活动为中心的数据集性能提升高达4倍;专业教练评估表明,检索方法在个性化视频指令生成方面显著优于现有视频模型。

Insight: 创新点在于构建首个大规模体育技能学习数据集,并引入错误条件教学视频检索任务,将表示学习与可操作反馈生成相结合,为物理技能学习提供了数据基础和任务框架。

Abstract: Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., “here’s my execution of a skill; which video clip should I watch to improve it?”). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.


[71] Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds cs.CVPDF

Bin Yang, Mohamed Abdelsamad, Miao Zhang, Alexandru Paul Condurache

TL;DR: 本文提出了PointINS,一种面向实例的自监督学习框架,旨在增强点云表示中的实例感知能力。该方法通过几何感知学习,联合学习高级语义理解和几何推理,以弥补现有自监督学习方法在实例定位任务上的不足。

Details

Motivation: 现有自监督学习方法强调语义感知,但在实例定位任务上迁移效果不佳,而实例感知是3D感知的基础组成部分。为了推动支持所有下游任务的真正3D基础模型的发展,需要弥补这一差距。

Result: 在五个数据集上的广泛实验表明,PointINS在室内实例分割任务上平均提升3.5% mAP,在室外全景分割任务上平均提升4.1% PQ。

Insight: 创新点在于引入了正交偏移分支来联合学习语义和几何信息,以及提出了两种互补的正则化策略(偏移分布正则化和空间聚类正则化)来增强实例定位的鲁棒性。这为构建可扩展的3D基础模型提供了新思路。

Abstract: Recent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to instance localization, and often require full finetuning for strong performance. Instance awareness is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that support all downstream tasks on 3D data. In this work, we introduce PointINS, an instance-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal offset branch to jointly learn high-level semantic understanding and geometric reasoning, yielding instance awareness. We identify two consistent properties essential for robust instance localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo-instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models.


[72] AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation cs.CVPDF

Md Mushfiqur Azam, John Quarles, Kevin Desai

TL;DR: AG-EgoPose是一种用于第一人称视角(鱼眼相机输入)3D人体姿态估计的新型双流框架。它通过整合短程和远程运动上下文与细粒度空间线索,利用空间流生成2D关节点热图和空间特征,同时利用时间流通过动作识别骨干网络捕捉运动动态,最后在Transformer解码器中融合并优化这些表示,以在保持解剖学约束的同时实现关节级的时空证据整合。

Details

Motivation: 解决第一人称视角下因严重透视畸变、身体可见性有限和复杂相机运动导致的3D人体姿态估计挑战,现有方法通常依赖单帧分析或有限的时序融合,未能有效利用第一人称视频中丰富的运动上下文。

Result: 在真实世界数据集上的实验表明,AG-EgoPose在定量和定性指标上均达到了最先进的性能(SOTA)。

Insight: 创新点在于提出了一个双流框架,将短程/远程运动上下文与空间线索相结合,并采用可学习的关节令牌在Transformer解码器中进行关节级的时空融合,同时保持解剖学约束,这为利用第一人称视频中的丰富运动信息提供了新思路。

Abstract: Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to capture the motion dynamics. These complementary representations are fused and refined in a transformer decoder with learnable joint tokens, which allows for the joint-level integration of spatial and temporal evidence while maintaining anatomical constraints. Experiments on real-world datasets demonstrate that AG-EgoPose achieves state-of-the-art performance in both quantitative and qualitative metrics. Code is available at: https://github.com/Mushfiq5647/AG-EgoPose.


[73] AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References cs.CVPDF

Jiahao Wang, Hualian Sheng, Sijia Cai, Yuxiao Yang, Weizhan Zhang

TL;DR: AnyID是一个超保真度的通用身份保持视频生成框架,能够从任意视觉参考(如人脸、肖像、视频)生成高质量视频。它通过一个可扩展的全参考架构统一异构输入,并采用主参考生成范式结合差分提示实现精确的属性级控制。模型在大规模数据集上训练,并通过基于人类偏好的强化学习微调,实现了卓越的身份保真度和可控性。

Details

Motivation: 现有身份保持视频生成方法通常针对单一身份参考进行优化,限制了创意灵活性,且单一来源导致模型在新场景中难以忠实复现身份,这是一个不适定问题。

Result: 广泛的评估验证了AnyID在不同任务设置下实现了超高的身份保真度和优越的属性级可控性,达到了先进水平。

Insight: 创新点包括可扩展的全参考架构以统一异构身份输入,以及主参考生成范式结合差分提示实现属性级精确控制;通过基于人类偏好数据的强化学习微调进一步提升保真度和可控性,这是一个系统性的高质量视频生成解决方案。

Abstract: Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption restricts creative flexibility by inadequately accommodating diverse real-world input formats. Relying on a single source also constitutes an ill-posed scenario, causing an inherently ambiguous setting that makes it difficult for the model to faithfully reproduce an identity across novel contexts. To address these issues, we present AnyID, an ultra-fidelity identity-preservation video generation framework that features two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. We conduct training on a large-scale, meticulously curated dataset to ensure robustness and high fidelity, and then perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings.


[74] Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction cs.CV | cs.AIPDF

Jiahao Tian, Chenxi Song, Wei Cheng, Chi Zhang

TL;DR: 本文提出了一种名为FreeLOC的无训练、层自适应框架,用于解决基于短片段预训练的视频扩散模型生成长视频时出现的视觉质量下降问题。该框架通过视频相对位置重编码(VRPR)和分层稀疏注意力(TSA)分别应对帧级相对位置分布外(O.O.D)和上下文长度O.O.D问题,并利用层自适应探测机制选择性地应用这些技术,从而显著提升长视频生成的质量和时序一致性。

Details

Motivation: 预训练的视频扩散模型通常基于短片段训练,直接用于生成长视频会导致视觉质量显著下降。本文发现该问题主要源于帧级相对位置分布外(O.O.D)和上下文长度O.O.D两个挑战,旨在提出一种无需额外训练的方法来缓解这些问题。

Result: 大量实验表明,该方法在无需训练的方法中显著优于现有技术,在时序一致性和视觉质量方面均达到了最先进的(SOTA)水平。

Insight: 创新点在于:1) 明确识别并形式化了长视频生成中的两种具体O.O.D问题;2) 提出了针对性的VRPR(多粒度层次化重编码)和TSA(结构化注意力密度)技术;3) 引入了层自适应探测机制,根据各Transformer层对O.O.D问题的敏感性进行选择性高效修正,实现了“免费午餐”式的性能提升。

Abstract: Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model’s pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake-AGI-Lab/FreeLOC.


[75] An Image Dataset of Common Skin Diseases of Bangladesh and Benchmarking Performance with Machine Learning Models cs.CV | cs.LGPDF

Sazzad Hossain, Saiful Islam, Muhammad Ibrahim, Md. Rasel Ahmed, Md Shuayb

TL;DR: 该论文构建了一个针对孟加拉国常见皮肤病的公开图像数据集,包含接触性皮炎、白癜风、湿疹、疥疮和癣五种疾病共1612张图像,并利用多种机器学习和深度学习模型进行了分类性能基准测试。

Details

Motivation: 解决孟加拉国等人口稠密地区皮肤科专家和诊断设备不足的问题,通过构建公开数据集和应用AI技术,辅助皮肤病的自动化检测,以避免因缺乏及时诊治导致的严重健康后果。

Result: 在自建数据集上应用了多种机器学习与深度学习模型进行基准测试,并报告了分类性能,但摘要中未提及具体定量结果或与SOTA的比较。

Insight: 创新点在于构建了一个针对特定地区(孟加拉国/南亚)常见皮肤病的公开图像数据集,填补了资源空白,并进行了初步的模型性能基准测试,为基于机器学习的皮肤病自动诊断研究提供了有价值的数据资源。

Abstract: Skin diseases are a major public health concern worldwide, and their detection is often challenging without access to dermatological expertise. In countries like Bangladesh, which is highly populated, the number of qualified skin specialists and diagnostic instruments is insufficient to meet the demand. Due to the lack of proper detection and treatment of skin diseases, that may lead to severe health consequences including death. Common properties of skin diseases are, changing the color, texture, and pattern of skin and in this era of artificial intelligence and machine learning, we are able to detect skin diseases by using image processing and computer vision techniques. In response to this challenge, we develop a publicly available dataset focused on common skin disease detection using machine learning techniques. We focus on five prevalent skin diseases in Bangladesh: Contact Dermatitis, Vitiligo, Eczema, Scabies, and Tinea Ringworm. The dataset consists of 1612 images (of which, 250 are distinct while others are augmented), collected directly from patients at the outpatient department of Faridpur Medical College, Faridpur, Bangladesh. The data comprises of 302, 381, 301, 316, and 312 images of Dermatitis, Eczema, Scabies, Tinea Ringworm, and Vitiligo, respectively. Although the data are collected regionally, the selected diseases are common across many countries especially in South Asia, making the dataset potentially valuable for global applications in machine learning-based dermatology. We also apply several machine learning and deep learning models on the dataset and report classification performance. We expect that this research would garner attention from machine learning and deep learning researchers and practitioners working in the field of automated disease diagnosis.


[76] Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models cs.CV | cs.AI | cs.LGPDF

Yabin Zhang, Maya Varma, Yunhe Gao, Jean-Benoit Delbrouck, Jiaming Liu

TL;DR: 本文提出了一种名为TANL(测试时激活负标签)的新方法,用于提升视觉语言模型在分布外检测任务中的性能。该方法通过在线评估测试样本在语料库标签上的激活水平,动态选择与测试分布对齐的高激活负标签,并设计了一个激活感知的评分函数来强调这些标签,从而更有效地识别OOD样本。

Details

Motivation: 现有的OOD检测方法通常引入远离ID类的负标签,但这些标签在OOD样本上可能激活不足,无法有效捕捉OOD特征。本文旨在解决负标签激活不足的问题,以提高OOD检测的准确性和鲁棒性。

Result: 在包括大规模ImageNet基准在内的多种骨干网络和任务设置上的实验验证了TANL的有效性。在ImageNet上,TANL将FPR95指标从17.5%显著降低至9.8%。

Insight: 核心创新点在于动态、测试时挖掘高激活的负标签,并利用激活信息进行自适应评分。这提供了一种无需训练、测试高效且理论有据的OOD检测新思路,强调了激活在区分ID和OOD样本中的关键作用。

Abstract: Out-of-distribution (OOD) detection aims to identify samples that deviate from in-distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels. However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \underline{T}est-time \underline{A}ctivated \underline{N}egative \underline{L}abels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses during the testing process. Specifically, TANL identifies high-confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric. Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution-adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine-grained, batch-adaptive variant. To fully utilize label activation knowledge, we propose an activation-aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number. Our TANL is training-free, test-efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness. Notably, on the large-scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5% to 9.8%. Codes are available at \href{https://github.com/YBZh/OpenOOD-VLM}{YBZh/OpenOOD-VLM}.


[77] Hyperspectral Trajectory Image for Multi-Month Trajectory Anomaly Detection cs.CV | cs.LGPDF

Md Awsafur Rahman, Chandrakanth Gudavalli, Hardik Prajapati, B. S. Manjunath

TL;DR: 本文提出TITAnD方法,通过将轨迹数据表示为超光谱轨迹图像(HTI),将轨迹异常检测重新定义为视觉问题,并引入循环因子化Transformer(CFT)来建模人类日常行为的循环结构,首次实现了对密集多个月GPS轨迹的异常检测。

Details

Motivation: 现有方法无法处理密集多个月GPS轨迹的异常检测,因为其二次计算成本过高;而可扩展的稀疏停留点方法又丢弃了细粒度证据。本文旨在统一密集和稀疏轨迹的表示,克服这一瓶颈。

Result: TITAnD在稀疏和密集基准测试中均取得了最佳的AUC-PR,超越了UNet等视觉模型,同时比标准Transformer快11-75倍且内存相当,首次实现了密集多个月异常检测。

Insight: 创新点在于将轨迹表示为超光谱图像(HTI),将检测任务转化为图像分类和语义分割;并设计CFT,通过沿两个时间轴因子化注意力来编码人类日常的循环归纳偏置,大幅降低计算成本。

Abstract: Trajectory anomaly detection underpins applications from fraud detection to urban mobility analysis. Dense GPS methods preserve fine-grained evidence such as abnormal speeds and short-duration events, but their quadratic cost makes multi-month analysis intractable; consequently, no existing approach detects anomalies over multi-month dense GPS trajectories. The field instead relies on scalable sparse stay-point methods that discard this evidence, forcing separate architectures for each regime and preventing knowledge transfer. We argue this bottleneck is unnecessary: human trajectories, dense or sparse, share a natural two-dimensional cyclic structure along within-day and across-day axes. We therefore propose TITAnD (Trajectory Image Transformer for Anomaly Detection), which reformulates trajectory anomaly detection as a vision problem by representing trajectories as a Hyperspectral Trajectory Image (HTI): a day x time-of-day grid whose channels encode spatial, semantic, temporal, and kinematic information from either modality, unifying both under a single representation. Under this formulation, agent-level detection reduces to image classification and temporal localization to semantic segmentation. To model this representation, we introduce the Cyclic Factorized Transformer (CFT), which factorizes attention along the two temporal axes, encoding the cyclic inductive bias of human routines, while reducing attention cost by orders of magnitude and enabling dense multi-month anomaly detection for the first time. Empirically, TITAnD achieves the best AUC-PR across sparse and dense benchmarks, surpassing vision models like UNet while being 11-75x faster than the Transformer with comparable memory, demonstrating that vision reformulation and structure-aware modeling are jointly essential. Code will be made public soon.


[78] EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval cs.CVPDF

Yuhan Chen, Pengwen Dai, Chuan Wang, Dayan Wu, Xiaochun Cao

TL;DR: 本文提出了一种名为EagleNet的能量感知细粒度关系学习网络,用于改进文本-视频检索任务。该方法通过细粒度关系学习机制构建文本-帧图,学习文本与帧之间的关系,并聚合文本候选生成包含帧上下文信息的丰富文本嵌入;同时引入能量感知匹配来建模文本-帧交互的能量,以更准确地捕捉真实文本-视频对的分布。

Details

Motivation: 现有方法主要关注视频表示或跨模态对齐,或仅通过丰富文本表达来匹配视频语义,但忽略了视频内部帧之间的丰富交互,导致扩展文本无法捕获帧上下文信息,造成文本与视频之间的差异。

Result: 在MSRVTT、DiDeMo、MSVD和VATEX等多个基准数据集上进行的广泛实验证明了EagleNet的优越性。

Insight: 创新点包括:细粒度关系学习机制(FRL)通过构建文本-帧图并学习关系来生成上下文感知的文本嵌入;能量感知匹配(EAM)建模交互能量以改进关系学习;以及使用sigmoid损失替代传统的基于softmax的对比损失,以实现更有效的跨模态对齐和稳定训练。

Abstract: Text-video retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos. However, these methods use only interactions between text and frames/video, and ignore rich interactions among the internal frames within a video, so the final expanded text cannot capture frame contextual information, leading to disparities between text and video. In response, we introduce Energy-Aware Fine-Grained Relationship Learning Network (EagleNet) to generate accurate and context-aware enriched text embeddings. Specifically, the proposed Fine-Grained Relationship Learning mechanism (FRL) first constructs a text-frame graph by the generated text candidates and frames, then learns relationships among texts and frames, which are finally used to aggregate text candidates into an enriched text embedding that incorporates frame contextual information. To further improve fine-grained relationship learning in FRL, we design Energy-Aware Matching (EAM) to model the energy of text-frame interactions and thus accurately capture the distribution of real text-video pairs. Moreover, for more effective cross-modal alignment and stable training, we replace the conventional softmax-based contrastive loss with the sigmoid loss. Extensive experiments have demonstrated the superiority of EagleNet across MSRVTT, DiDeMo, MSVD, and VATEX. Codes are available at https://github.com/draym28/EagleNet.


[79] V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception cs.CVPDF

Weijia Li, Haoen Xiang, Tianxu Wang, Shuaibing Wu, Qiming Xia

TL;DR: 本文介绍了V2U4Real,这是首个用于车对无人机(V2U)协同感知的大规模真实世界多模态数据集。该数据集由配备多视角激光雷达和RGB相机的地面车辆与无人机采集,覆盖城市街道、校园和乡村道路等多种交通场景,包含超过5.6万帧激光雷达数据、5.6万张多视角相机图像以及70万个跨四个类别的标注3D边界框。论文建立了单智能体3D目标检测、协同3D目标检测和目标跟踪的基准,并通过评估多个先进模型验证了V2U协同在提升感知鲁棒性和远距离感知能力方面的有效性。

Details

Motivation: 现有自动驾驶感知系统常受遮挡、盲区和有限感知范围限制,而现有的车对车(V2V)和车对基础设施(V2I)协同感知范式仅限于地面协作,无法完全解决复杂环境中的大规模遮挡或远距离感知问题。因此,作者旨在推动跨视角协同感知研究,通过引入V2U协同来弥补这些不足。

Result: 在V2U4Real数据集上对多个最先进模型进行了综合评估,结果表明V2U协同能有效增强感知的鲁棒性和远距离感知能力。该数据集为单智能体3D目标检测、协同3D目标检测和对象跟踪任务提供了基准。

Insight: 论文的主要创新点是创建了首个大规模真实世界的V2U协同感知多模态数据集,填补了跨视角(地面与空中)协同感知数据集的空白。从客观角度看,该数据集的多场景覆盖、大规模标注和多样化任务基准为研究车空协同感知提供了宝贵资源,有助于探索如何利用无人机视角克服地面感知的固有局限。

Abstract: Modern autonomous vehicle perception systems are often constrained by occlusions, blind spots, and limited sensing range. While existing cooperative perception paradigms, such as Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I), have demonstrated their effectiveness in mitigating these challenges, they remain limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex environments. To advance research in cross-view cooperative perception, we present V2U4Real, the first large-scale real-world multi-modal dataset for Vehicle-to-UAV (V2U) cooperative object perception. V2U4Real is collected by a ground vehicle and a UAV equipped with multi-view LiDARs and RGB cameras. The dataset covers urban streets, university campuses, and rural roads under diverse traffic scenarios, comprising over 56K LiDAR frames, 56K multi-view camera images, and 700K annotated 3D bounding boxes across four classes. To support a wide range of research tasks, we establish benchmarks for single-agent 3D object detection, cooperative 3D object detection, and object tracking. Comprehensive evaluations of several state-of-the-art models demonstrate the effectiveness of V2U cooperation in enhancing perception robustness and long-range awareness. The V2U4Real dataset and codebase is available at https://github.com/VjiaLi/V2U4Real.


[80] Towards Controllable Low-Light Image Enhancement: A Continuous Multi-illumination Dataset and Efficient State Space Framework cs.CVPDF

Hongru Han, Tingrui Guo, Liming Zhang, Yan Su, Qiwen Xu

TL;DR: 本文提出了一种可控的低光图像增强(CLE)新范式,旨在解决传统确定性映射方法在处理病态问题时的局限性。为此,作者构建了包含连续真实世界光照变化的Light100数据集,并提出了CLE-RWKV框架,该框架采用HVI色彩空间的噪声解耦监督策略来分离光照调制与纹理恢复,并利用空间到深度(S2D)策略来适配高效的SSM模型进行密集预测。

Details

Motivation: 传统低光增强方法将任务视为确定性映射,无法处理由未知环境条件和传感器参数导致的多模态解空间问题,常导致预测与标签间的亮度差异,并依赖‘gt-mean’后处理进行对齐评估。本文旨在将其重新表述为一个适定的条件问题,实现可控增强。

Result: 在七个基准测试上的实验表明,该方法取得了有竞争力的性能,并展现出鲁棒的可控性,显著减少了对gt-mean后处理的依赖。

Insight: 核心创新在于将任务范式转变为可控增强,并为此构建了连续多光照数据集Light100。技术上,HVI色彩空间的噪声解耦监督策略有效分离了光照与纹理,而S2D策略则巧妙地解决了将SSM用于密集预测时的‘扫描间隙’问题,同时保持了线性复杂度。

Abstract: Low-light image enhancement (LLIE) has traditionally been formulated as a deterministic mapping. However, this paradigm often struggles to account for the ill-posed nature of the task, where unknown ambient conditions and sensor parameters create a multimodal solution space. Consequently, state-of-the-art methods frequently encounter luminance discrepancies between predictions and labels, often necessitating “gt-mean” post-processing to align output luminance for evaluation. To address this fundamental limitation, we propose a transition toward Controllable Low-light Enhancement (CLE), explicitly reformulating the task as a well-posed conditional problem. To this end, we introduce CLE-RWKV, a holistic framework supported by Light100, a new benchmark featuring continuous real-world illumination transitions. To resolve the conflict between luminance control and chromatic fidelity, a noise-decoupled supervision strategy in the HVI color space is employed, effectively separating illumination modulation from texture restoration. Architecturally, to adapt efficient State Space Models (SSMs) for dense prediction, we leverage a Space-to-Depth (S2D) strategy. By folding spatial neighborhoods into channel dimensions, this design allows the model to recover local inductive biases and effectively bridge the “scanning gap” inherent in flattened visual sequences without sacrificing linear complexity. Experiments across seven benchmarks demonstrate that our approach achieves competitive performance and robust controllability, providing a real-world multi-illumination alternative that significantly reduces the reliance on gt-mean post-processing.


[81] MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data cs.CVPDF

Zhekai Chen, Yuqing Wang, Manyuan Zhang, Xihui Liu

TL;DR: 本文提出了MACRO框架,旨在解决多参考图像生成任务中因输入参考图像数量增加而导致的性能下降问题。核心贡献包括构建了一个包含40万样本、每个样本最多10张参考图像的大规模结构化长上下文数据集MacroData,以及一个包含4000个样本的标准化评估基准MacroBench。实验表明,在MacroData上微调能显著提升多参考生成性能。

Details

Motivation: 当前多参考图像生成模型在输入参考图像数量增加时性能严重下降,根本原因是现有数据集缺乏结构化、长上下文的监督数据,无法学习密集的参考间依赖关系。

Result: 在提出的MacroBench基准上进行广泛实验,结果显示在MacroData上微调带来了多参考图像生成的实质性改进。消融研究进一步揭示了跨任务协同训练的好处以及处理长上下文复杂性的有效策略。

Insight: 创新点在于首次系统地构建了大规模、结构化、覆盖多维度(定制化、插画、空间推理、时间动态)的长上下文多参考图像数据集和标准化评估基准,为解决多参考生成的数据瓶颈和评估难题提供了关键资源。从客观角度看,其结构化数据构建方法和跨任务协同训练策略具有借鉴意义。

Abstract: Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions – Customization, Illustration, Spatial reasoning, and Temporal dynamics – to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.


[82] HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT cs.CVPDF

Yongsung Kim, Wooseok Song, Jaihyun Lew, Hun Hwangbo, Jaehoon Lee

TL;DR: 本文提出了一种名为HeSS(Head Sensitivity Score)的两阶段稀疏化方法,用于优化VGGT(Visual Geometry Grounded Transformer)中的全局注意力层。该方法通过量化每个注意力头对稀疏化的敏感度,并据此重新分配注意力预算,从而在保持模型性能的同时降低计算成本。

Details

Motivation: VGGT的全局注意力层存在二次计算复杂度问题,现有稀疏化加速技术常导致显著的精度下降。作者假设精度下降源于不同注意力头对稀疏化的敏感度存在异质性,而现有方法对所有头应用统一的稀疏模式,因此需要一种能有效量化和利用这种异质性的方法。

Result: 实验表明,HeSS能有效捕捉注意力头的稀疏化敏感度,并在高稀疏度下显著减轻性能下降,在不同稀疏化水平上展现出强鲁棒性。

Insight: 创新点在于提出了HeSS这一基于Hessian近似的新度量来量化注意力头的敏感度,并据此进行敏感度引导的稀疏化重分配。这为Transformer模型的高效稀疏化提供了一种考虑头间异质性的新思路,可借鉴于其他需要平衡计算效率与精度的注意力机制优化中。

Abstract: Visual Geometry Grounded Transformer (VGGT) has advanced 3D vision, yet its global attention layers suffer from quadratic computational costs that hinder scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits headwise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approximates the Hessian with respect to two distinct error terms on a small calibration set. In the inference stage, we perform HeSS-Guided Sparsification, leveraging the pre-computed HeSS to reallocate the total attention budget-assigning denser attention to sensitive heads and sparser attention to more robust ones. We demonstrate that HeSS effectively captures head-wise sparsification sensitivity and empirically confirm that attention heads in the global attention layers exhibit heterogeneous sensitivity characteristics. Extensive experiments further show that our method effectively mitigates performance degradation under high sparsity, demonstrating strong robustness across varying sparsification levels. Code is available at https://github.com/libary753/HeSS.


[83] Image Rotation Angle Estimation: Comparing Circular-Aware Methods cs.CV | cs.AI | eess.IVPDF

Maximilian Woehrer

TL;DR: 本文对图像旋转角度估计任务中的五种循环感知方法进行了全面比较研究。该任务具有挑战性,因为角度具有循环拓扑结构,导致标准回归方法存在边界不连续问题。研究通过迁移学习,在十六种现代架构上系统评估了这些方法,发现概率方法(特别是循环高斯分布)最具鲁棒性,而分类方法在匹配良好的骨干网络上精度最高。最佳配置在DRC-D数据集上达到了1.23°的平均绝对误差,在COCO数据集上显著优于先前工作。

Details

Motivation: 图像自动旋转估计是许多视觉流程的关键预处理步骤,但由于角度具有循环拓扑结构,其边界不连续性阻碍了标准回归方法,因此需要研究专门处理循环性的方法。

Result: 在DRC-D数据集上,最佳配置(EfficientViT-B3结合分类方法)的平均绝对误差为1.23°,而循环高斯分布与MambaOut Base组合达到1.24°,鲁棒性更强。在COCO 2014数据集上,最佳配置达到3.71° MAE,显著优于先前工作,在更大的COCO 2017数据集上进一步改善至2.84° MAE。

Insight: 论文的创新点在于对五种循环感知方法进行了系统性的实证比较,揭示了概率方法(如循环高斯分布)在跨架构鲁棒性方面的优势,以及分类方法在精度与训练稳定性之间的权衡。从客观角度看,将循环拓扑特性明确纳入损失函数或输出表示的设计,是解决此类周期回归问题的关键洞察。

Abstract: Automatic image rotation estimation is a key preprocessing step in many vision pipelines. This task is challenging because angles have circular topology, creating boundary discontinuities that hinder standard regression methods. We present a comprehensive study of five circular-aware methods for global orientation estimation: direct angle regression with circular loss, classification via angular binning, unit-vector regression, phase-shifting coder, and circular Gaussian distribution. Using transfer learning from ImageNet-pretrained models, we systematically evaluate these methods across sixteen modern architectures by adapting their output heads for rotation-specific predictions. Our results show that probabilistic methods, particularly the circular Gaussian distribution, are the most robust across architectures, while classification achieves the best accuracy on well-matched backbones but suffers training instabilities on others. The best configuration (classification with EfficientViT-B3) achieves a mean absolute error (MAE) of 1.23° (mean across five independent runs) on the DRC-D dataset, while the circular Gaussian distribution with MambaOut Base achieves a virtually identical 1.24° with greater robustness across backbones. Training and evaluating our top-performing method-architecture combinations on COCO 2014, the best configuration reaches 3.71° MAE, improving substantially over prior work, with further improvement to 2.84° on the larger COCO 2017 dataset.


[84] InstanceAnimator: Multi-Instance Sketch Video Colorization cs.CVPDF

Yinhan Zhang, Yue Ma, Bingyuan Wang, Kunyu Feng, Yeying Jin

TL;DR: 本文提出了InstanceAnimator,一种新颖的基于扩散Transformer的框架,用于多实例素描视频上色。它解决了现有方法在用户控制灵活性、多实例对齐和细节保真度方面的局限性。

Details

Motivation: 现有方法存在三个核心局限:过度依赖单参考帧导致用户控制不灵活、多角色场景下实例可控性差导致错位,以及细粒度区域细节保真度下降。

Result: 大量实验表明,InstanceAnimator在增强用户控制、高视觉质量和强实例一致性方面,实现了卓越的多实例上色效果。

Insight: 创新点包括:1) 画布引导条件,允许自由放置参考元素和背景,消除工作流碎片化;2) 实例匹配机制,通过将实例特征与素描图集成来解决错位问题;3) 自适应解耦控制模块,通过向扩散过程注入来自角色、背景和文本条件的语义特征来增强细节保真度。

Abstract: We propose InstanceAnimator, a novel Diffusion Transformer framework for multi-instance sketch video colorization. Existing methods suffer from three core limitations: inflexible user control due to heavy reliance on single reference frames, poor instance controllability leading to misalignment in multi-character scenarios, and degraded detail fidelity in fine-grained regions. To address these challenges, we introduce three corresponding innovations. First, a Canvas Guidance Condition eliminates workflow fragmentation by allowing free placement of reference elements and background, enabling unprecedented user flexibility. Second, an Instance Matching Mechanism resolves misalignment by integrating instance features with the sketches, ensuring precise control over multiple characters. Third, an Adaptive Decoupled Control Module enhances detail fidelity by injecting semantic features from characters, backgrounds, and text conditions into the diffusion process. Extensive experiments demonstrate that InstanceAnimator achieves superior multi-instance colorization with enhanced user control, high visual quality, and strong instance consistency.


[85] Multimodal Dataset Distillation via Phased Teacher Models cs.CVPDF

Shengbin Guo, Hang Zhao, Senqiao Yang, Chenyang Jiang, Yuhang Cheng

TL;DR: 本文提出了一种名为PTM-ST的新型多模态数据集蒸馏框架,旨在通过分阶段的教师模型和基于捷径的轨迹构建策略,更准确地捕捉教师模型在不同训练阶段的动态知识,从而生成高质量的合成数据集,实现高效的知识迁移。

Details

Motivation: 现有方法难以捕捉教师模型在后期训练阶段中复杂且动态演化的知识,导致学生模型性能下降和蒸馏数据质量受损,本文旨在解决跨阶段性能差距显著和教师轨迹不稳定等关键挑战。

Result: 在Flickr30k和COCO基准测试上,该方法显著超越了现有SOTA基线,在Flickr30k上实现了最高13.5%的绝对提升和平均9.53%的增益,并有效缓解了优化振荡和阶段间知识差距。

Insight: 核心创新在于分阶段的教师建模和基于捷径的轨迹构建策略,这增强了蒸馏过程的稳定性和表达能力,为高效压缩大规模图像-文本数据并实现知识迁移提供了新思路。

Abstract: Multimodal dataset distillation aims to construct compact synthetic datasets that enable efficient compression and knowledge transfer from large-scale image-text data. However, existing approaches often fail to capture the complex, dynamically evolving knowledge embedded in the later training stages of teacher models. This limitation leads to degraded student performance and compromises the quality of the distilled data. To address critical challenges such as pronounced cross-stage performance gaps and unstable teacher trajectories, we propose Phased Teacher Model with Shortcut Trajectory (PTM-ST) – a novel phased distillation framework. PTM-ST leverages stage-aware teacher modeling and a shortcut-based trajectory construction strategy to accurately fit the teacher’s learning dynamics across distinct training phases. This enhances both the stability and expressiveness of the distillation process. Through theoretical analysis and comprehensive experiments, we show that PTM-ST significantly mitigates optimization oscillations and inter-phase knowledge gaps, while also reducing storage overhead. Our method consistently surpasses state-of-the-art baselines on Flickr30k and COCO, achieving up to 13.5% absolute improvement and an average gain of 9.53% on Flickr30k. Code: https://github.com/Previsior/PTM-ST.


[86] PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders cs.CVPDF

Niccolò Cavagnero, Narges Norouzi, Gijs Dubbelman, Daan de Geus

TL;DR: 本文提出Plain Mask Transformer(PMT),一种用于图像和视频分割的模型,其核心是Plain Mask Decoder(PMD),一个基于Transformer的快速分割解码器,能够直接处理冻结的视觉基础模型(VFM)特征。PMT保持了编码器-仅设计的简单性和低延迟,同时无需微调编码器,实现了编码器特征的多任务共享。

Details

Motivation: 现有基于VFM的编码器-仅分割模型(如EoMT和VidEoMT)虽然精度高、延迟低,但需要微调编码器,牺牲了VFM的多任务共享优势,不利于大规模部署。本文旨在结合编码器-仅设计的简单快速与冻结VFM特征,实现高效且可共享的分割模型。

Result: 在标准图像分割基准上,PMT匹配了冻结编码器的SOTA性能,同时运行速度提升约3倍;在视频分割任务中,PMT性能与完全微调的方法相当,且比SOTA冻结编码器模型快达8倍。

Insight: 创新点在于提出Plain Mask Decoder(PMD),一个轻量、快速的Transformer解码器,可直接利用冻结VFM特征进行分割,无需微调编码器,实现了速度与精度的平衡,并适用于图像和视频分割,增强了模型的通用性和部署效率。

Abstract: Vision Foundation Models (VFMs) pre-trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM-based encoder-only models for image and video segmentation, such as EoMT and VidEoMT, achieve competitive accuracy with remarkably low latency, yet they require finetuning the encoder, sacrificing the multi-task encoder sharing that makes VFMs practically attractive for large-scale deployment. To reconcile encoder-only simplicity and speed with frozen VFM features, we propose the Plain Mask Decoder (PMD), a fast Transformer-based segmentation decoder that operates on top of frozen VFM features. The resulting model, the Plain Mask Transformer (PMT), preserves the architectural simplicity and low latency of encoder-only designs while keeping the encoder representation unchanged and shareable. The design seamlessly applies to both image and video segmentation, inheriting the generality of the encoder-only framework. On standard image segmentation benchmarks, PMT matches the frozen-encoder state of the art while running up to ~3x faster. For video segmentation, it even performs on par with fully finetuned methods, while being up to 8x faster than state-of-the-art frozen-encoder models. Code: https://github.com/tue-mps/pmt.


[87] LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior cs.CV | cs.ROPDF

Xinkai Wang, Chenyi Wang, Yifu Xu, Mingzhe Ye, Fu-Cheng Zhang

TL;DR: 本文提出了LaMP框架,一种双专家视觉-语言-动作模型,通过将稠密3D场景流作为潜在运动先验嵌入,来提升机器人操作任务的性能。该框架包含一个用于流匹配的运动专家和一个用于策略预测的动作专家,二者通过门控交叉注意力进行对齐。

Details

Motivation: 现有VLA模型直接从2D语义视觉特征回归动作,迫使模型隐式学习复杂的3D物理交互,这种策略在遇到不熟悉的空间动态时性能会下降。LaMP旨在通过显式引入3D场景流先验来解决这一局限性。

Result: 在LIBERO、LIBERO-Plus和SimplerEnv-WidowX仿真基准测试以及真实世界实验中,LaMP均一致优于所评估的VLA基线模型,在相同训练预算下取得了最高的平均成功率。在LIBERO-Plus的分布外扰动测试中,LaMP表现出更强的鲁棒性,相比之前最强的基线平均提升了9.7%。

Insight: 创新点在于将稠密3D场景流作为显式的潜在运动先验,并通过双专家架构(运动专家与动作专家)进行解耦与对齐,避免了完整的多步重建,从而更有效地建模3D物理交互并提升泛化能力。

Abstract: We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.


[88] HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models cs.CVPDF

Huizhi Liang, Yichao Shen, Yu Deng, Sicheng Xu, Zhiyuan Feng

TL;DR: 本文提出了HiSpatial框架,通过层次化分解3D空间理解任务,构建大规模3D空间视觉问答数据集,并开发了结合RGB-D和点云辅助输入的视觉语言模型,在多个空间理解基准上实现了最先进的性能。

Details

Motivation: 为了解决视觉语言模型在实现类人空间智能时,需要从2D观测推断3D结构、识别3D空间中的物体属性和关系并进行高级空间推理的挑战。

Result: 在多个空间理解和推理基准测试中取得了最先进的性能,超越了专用空间模型和大型专有系统如Gemini-2.5-pro和GPT-5。

Insight: 提出了一个原则性的层次化框架,将3D空间理解分解为四个渐进复杂的级别,并揭示了层次任务级别之间的依赖关系,为多级任务设计如何促进3D空间智能的出现提供了新见解。

Abstract: Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex levels, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that processes approximately 5M images with over 45M objects to generate 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised fine-tuning. We also develop an RGB-D VLM incorporating metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence.


[89] VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents cs.CVPDF

George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Yang Bai

TL;DR: VideoWeaver是首个多模态多视角的视频到视频(V2V)转换框架,旨在解决具身智能任务中多视角视频转换的一致性问题。它通过基于流的单视角模型初始化,并利用Pi3空间基础模型将多视角嵌入共享的4D潜在空间,确保视角间外观一致性。模型通过在不同扩散时间步训练视角,支持自回归合成新视角,从而适应动态相机运动和异构相机设置。

Details

Motivation: 现有V2V转换方法仅能处理单视角,而具身AI任务通常依赖多视角同步视频进行策略学习。独立应用单视角模型会导致跨视角外观不一致,且标准Transformer架构因跨视角注意力的二次成本难以扩展到多视角设置。

Result: 在单视角转换基准测试中,VideoWeaver达到或优于当前最先进(SOTA)性能;首次实现了物理和风格一致的多视角转换,包括对机器人学习至关重要的具挑战性的自我中心视角和异构相机设置。

Insight: 创新点包括:1) 利用预训练空间基础模型(Pi3)构建共享4D潜在空间,实现跨视角一致性;2) 通过在不同扩散时间步训练视角,学习联合和条件视角分布,支持自回归视角合成,可扩展至任意数量相机。

Abstract: Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a shared 4D latent space derived from a feed-forward spatial foundation model, namely, Pi3. This encourages view-consistent appearance even under wide baselines and dynamic camera motion. To scale beyond a fixed number of cameras, we train views at distinct diffusion timesteps, enabling the model to learn both joint and conditional view distributions. This in turn allows autoregressive synthesis of new viewpoints conditioned on existing ones. Experiments show superior or similar performance to the state-of-the-art on the single-view translation benchmarks and, for the first time, physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups central to world randomization for robot learning.


[90] GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids cs.CVPDF

Mohamed Eltahir, Ahmed O. Ibrahim, Obada Siralkhatim, Tabarak Abdallah, Sondos Mohamed

TL;DR: GridVAD是一种无需训练、开集视频异常检测方法,它采用‘提议-定位-传播’原则,利用视觉语言模型生成异常候选描述,再通过空间和时间模块进行定位和跟踪,最终生成像素级异常掩码。

Details

Motivation: 直接使用视觉语言模型进行视频监控异常检测存在漏检和误报问题,其根本原因在于使用方式不当,而非模型本身。因此,论文旨在设计一个将VLM作为异常提议生成器,并由专门模块进行后续处理的稳健流程。

Result: 在UCSD Ped2数据集上,GridVAD取得了最高的像素级AUROC(77.59),超过了部分微调的TAO方法(75.11),并在目标级RBDC指标上以超过5倍的优势领先其他零样本方法。效率实验表明其调用效率比逐帧均匀查询VLM高2.7倍。

Insight: 核心创新在于‘提议-定位-传播’的模块化框架设计,以及通过自一致性整合过滤VLM幻觉的可控机制。该方法将强大的开集推理能力与专门的空间-时间处理解耦,实现了无需训练、高效且性能优越的零样本异常检测。

Abstract: Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.Code and qualitative video results are available at https://gridvad.github.io.


[91] Challenges in Hyperspectral Imaging for Autonomous Driving: The HSI-Drive Case cs.CV | cs.AI | cs.LG | eess.IVPDF

Koldo Basterretxea, Jon Gutiérrez-Zaballa, Javier Echanobe

TL;DR: 本文探讨了高光谱成像(HSI)在自动驾驶(AD)应用中面临的挑战,包括非受控光照条件、大景深范围、动态场景以及实时性和嵌入式平台计算资源限制等问题,并基于HSI-Drive数据集的最新版本实验结果,分析了适用于AD的HSI视觉系统技术。

Details

Motivation: 解决高光谱成像在自动驾驶领域应用时遇到的技术挑战,如环境多变性和实时性要求,以推动HSI技术在AD中的实际部署。

Result: 基于HSI-Drive数据集的最新版本进行实验,展示了相关HSI视觉系统技术的研究结果,但摘要未明确提及具体定量指标或与基准的比较。

Insight: 创新点在于系统分析了HSI在AD中的独特挑战,并强调了根据应用需求选择合适HSI技术和开发定制算法的重要性,为领域研究提供了实践指导。

Abstract: The use of hyperspectral imaging (HSI) in autonomous driving (AD), while promising, faces many challenges related to the specifics and requirements of this application domain. On the one hand, non-controlled and variable lighting conditions, the wide depth-of-field ranges, and dynamic scenes with fast-moving objects. On the other hand, the requirements for real-time operation and the limited computational resources of embedded platforms. The combination of these factors determines both the criteria for selecting appropriate HSI technologies and the development of custom vision algorithms that leverage the spectral and spatial information obtained from the sensors. In this article, we analyse several techniques explored in the research of HSI-based vision systems with application to AD, using as an example results obtained from experiments using data from the most recent version of the HSI-Drive dataset.


[92] CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild cs.CV | cs.AIPDF

Alex Hoi Hang Chan, Neha Singhal, Onur Kocahan, Andrea Meltzer, Saverio Lubrano

TL;DR: 本文介绍了CHIRP数据集和CORVID方法,旨在通过计算机视觉技术实现对野生鸟类种群的长期、个体层面的行为监测。CHIRP数据集基于瑞典拉普兰的野生西伯利亚松鸦种群,支持重识别、动作识别、关键点估计等多种任务,并引入了基于生物学相关指标的应用特定基准测试。CORVID方法是一种基于彩色腿环分割与分类的新颖重识别流程,通过概率匹配实现个体识别,在应用基准测试中优于现有SOTA方法。

Details

Motivation: 解决野生种群自动化行为监测的挑战,源于缺乏能够提取个体动物生物学意义测量所需的多任务计算机视觉数据集,以支持保护生物学和进化生物学研究。

Result: 在应用特定基准测试(如摄食率、共现率)中,CORVID方法在个体重识别任务上超越了现有的最先进(SOTA)方法。

Insight: 创新点包括:1) 构建了支持多任务(重识别、动作识别等)的长期野生鸟类数据集CHIRP;2) 引入了基于生物学相关指标的应用特定基准测试,以评估模型在实际用例中的性能;3) 提出了基于彩色腿环分割与分类的概率匹配重识别流程CORVID。从客观角度看,该工作为从伦理批准的生物学研究中构建真实世界数据集提供了蓝图,有助于弥合计算机视觉研究与生物应用之间的差距。

Abstract: Long-term behavioral monitoring of individual animals is crucial for studying behavioral changes that occur over different time scales, especially for conservation and evolutionary biology. Computer vision methods have proven to benefit biodiversity monitoring, but automated behavior monitoring in wild populations remains challenging. This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re-identification of wild birds. The CHIRP (Combining beHaviour, Individual Re-identification and Postures) dataset is curated from a long-term population of wild Siberian jays studied in Swedish Lapland, supporting re-identification (re-id), action recognition, 2D keypoint estimation, object detection, and instance segmentation. In addition to traditional task-specific benchmarking, we introduce application-specific benchmarking with biologically relevant metrics (feeding rates, co-occurrence rates) to evaluate the performance of models in real-world use cases. Finally, we present CORVID (COlouR-based Video re-ID), a novel pipeline for individual identification of birds based on the segmentation and classification of colored leg rings, a widespread approach for visual identification of individual birds. CORVID offers a probability-based id tracking method by matching the detected combination of color rings with a database. We use application-specific benchmarking to show that CORVID outperforms state-of-the-art re-id methods. We hope this work offers the community a blueprint for curating real-world datasets from ethically approved biological studies to bridge the gap between computer vision research and biological applications.


[93] Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training cs.CVPDF

Xiangyang Luo, Qingyu Li, Yuming Li, Guanbo Huang, Yongjie Zhu

TL;DR: 本文针对视频生成模型训练中存在的‘运动-视觉质量困境’——即高质量视觉内容与高强度运动数据难以兼得的问题,提出了一种基于时间步选择的训练方法TQD。该方法通过分析视频扩散模型的分层学习动态,发现质量不平衡的数据在特定时间步能产生与‘黄金数据’相似的梯度,从而调整不同质量数据的采样分布,使模型能在分离的不平衡数据上训练,性能超越使用传统高质量数据训练的方法。

Details

Motivation: 解决视频数据采集中视觉质量与运动强度固有的负相关矛盾,即难以获得同时具备高视觉质量和高运动质量的‘黄金数据’,从而提升视频生成模型的训练效率和效果。

Result: 在多种数据场景下的广泛实验表明,TQD方法仅使用分离的不平衡数据进行训练,其性能超越了使用传统(更好)数据训练的常规方法;即使在高质量数据上训练,该方法也能进一步提升模型性能。

Insight: 创新性地提出了‘训练过程中的时间步选择’概念和Timestep-aware Quality Decoupling (TQD)方法,核心洞察是质量不平衡的数据在扩散模型训练的不同时间步(噪声水平)具有不同的效用,通过时间步感知的采样分布调整,可以更有效地利用现有数据,挑战了视频生成必须依赖完美数据的传统观念。

Abstract: Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of high-quality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion intensity inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model’s learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for motion-rich data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.


[94] BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning cs.CVPDF

Ning Ding, Keisuke Fujii, Toru Tamaki

TL;DR: 该论文提出了首个羽毛球全场比赛密集数据集BFMD,包含19场完整比赛(单打和双打)的密集多模态标注,涵盖击球类型、轨迹、球员姿态和击球描述等。作者开发了基于VideoMAE的多模态描述生成框架,并引入语义反馈机制以提升描述质量。实验表明多模态建模和语义反馈优于仅使用RGB的基线,并展示了BFMD在全场比赛战术模式分析上的潜力。

Details

Motivation: 现有羽毛球数据集多为短片段或任务特定标注,缺乏完整比赛的多模态密集标注,限制了击球描述生成和比赛级分析。

Result: 在BFMD数据集上的实验表明,多模态建模和语义反馈机制提升了击球描述质量,优于仅RGB的基线方法。

Insight: 创新点在于构建首个羽毛球全场比赛密集多模态数据集,并提出结合语义反馈的多模态描述生成框架,有助于推动比赛级战术分析研究。

Abstract: Understanding tactical dynamics in badminton requires analyzing entire matches rather than isolated clips. However, existing badminton datasets mainly focus on short clips or task-specific annotations and rarely provide full-match data with dense multimodal annotations. This limitation makes it difficult to generate accurate shot captions and perform match-level analysis. To address this limitation, we introduce the first Badminton Full Match Dense (BFMD) dataset, with 19 broadcast matches (including both singles and doubles) covering over 20 hours of play, comprising 1,687 rallies and 16,751 hit events, each annotated with a shot caption. The dataset provides hierarchical annotations including match segments, rally events, and dense rally-level multimodal annotations such as shot types, shuttle trajectories, player pose keypoints, and shot captions. We develop a VideoMAE-based multimodal captioning framework with a Semantic Feedback mechanism that leverages shot semantics to guide caption generation and improve semantic consistency. Experimental results demonstrate that multimodal modeling and semantic feedback improve shot caption quality over RGB-only baselines. We further showcase the potential of BFMD by analyzing the temporal evolution of tactical patterns across full matches.


[95] PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos cs.CVPDF

Yihao Wang, Yang Miao, Wenshuai Zhao, Wenyan Yang, Zihan Wang

TL;DR: PAWS是一种从大规模野外第一人称视角视频中直接提取物体关节运动的方法,通过手-物体交互来恢复铰接物体(如抽屉、橱柜)的运动和结构,无需依赖高质量3D数据或人工标注。

Details

Motivation: 现有基于学习的方法严重依赖高质量3D数据和人工标注的监督训练,限制了方法的可扩展性和多样性,因此需要一种能从大规模野外视频中自动学习关节运动的方法。

Result: 在HD-EPIC和Arti4D等公开数据集上评估,PAWS相比基线方法取得了显著提升,并证明提取的关节信息有助于下游任务,如微调3D关节预测模型和机器人操作。

Insight: 创新点在于利用大规模野外第一人称视频中的手-物体交互作为自监督信号,直接从真实世界数据中学习关节运动,避免了昂贵的人工标注,提高了方法的可扩展性和实用性。

Abstract: Articulation perception aims to recover the motion and structure of articulated objects (e.g., drawers and cupboards), and is fundamental to 3D scene understanding in robotics, simulation, and animation. Existing learning-based methods rely heavily on supervised training with high-quality 3D data and manual annotations, limiting scalability and diversity. To address this limitation, we propose PAWS, a method that directly extracts object articulations from hand-object interactions in large-scale in-the-wild egocentric videos. We evaluate our method on the public data sets, including HD-EPIC and Arti4D data sets, achieving significant improvements over baselines. We further demonstrate that the extracted articulations benefit downstream tasks, including fine-tuning 3D articulation prediction models and enabling robot manipulation. See the project website at https://aaltoml.github.io/PAWS/.


[96] Towards Comprehensive Real-Time Scene Understanding in Ophthalmic Surgery through Multimodal Image Fusion cs.CVPDF

Nikolo Rohrmoser, Ghazal Ghazaei, Michael Sommersperger, Nassir Navab

TL;DR: 本研究提出了一种多模态、时序、实时的网络架构,用于眼科手术中通过融合手术显微镜(OPMI)和术中光学相干断层扫描(iOCT)图像,实现联合器械检测、关键点定位和器械-组织距离估计,以提升手术场景理解的全面性。

Details

Motivation: 解决眼科手术中单一成像模态(如仅使用OPMI)在精确器械跟踪和距离估计方面的局限性,通过融合互补的OPMI和iOCT模态信息,实现更全面的实时手术场景理解。

Result: 在玻璃体视网膜手术中,实现了可靠的器械定位和关键点检测(95.79% mAP50),且融合iOCT显著改善了器械-组织距离估计精度,尤其在距离视网膜1毫米以内时,误差从284微米(仅OPMI)降至33微米(多模态),同时达到每帧22.5毫秒的实时处理速度。

Insight: 创新点包括跨注意力融合模块有效整合多模态特征,以及基于区域的循环模块利用时序一致性;客观分析表明,多模态特征融合能显著提升多任务预测精度,且定制化网络设计可实现实时性能,为图像引导手术提供了新思路。

Abstract: Purpose: The integration of multimodal imaging into operating rooms paves the way for comprehensive surgical scene understanding. In ophthalmic surgery, by now, two complementary imaging modalities are available: operating microscope (OPMI) imaging and real-time intraoperative optical coherence tomography (iOCT). This first work toward temporal OPMI and iOCT feature fusion demonstrates the potential of multimodal image processing for multi-head prediction through the example of precise instrument tracking in vitreoretinal surgery. Methods: We propose a multimodal, temporal, real-time capable network architecture to perform joint instrument detection, keypoint localization, and tool-tissue distance estimation. Our network design integrates a cross-attention fusion module to merge OPMI and iOCT image features, which are efficiently extracted via a YoloNAS and a CNN encoder, respectively. Furthermore, a region-based recurrent module leverages temporal coherence. Results: Our experiments demonstrate reliable instrument localization and keypoint detection (95.79% mAP50) and show that the incorporation of iOCT significantly improves tool-tissue distance estimation, while achieving real-time processing rates of 22.5 ms per frame. Especially for close distances to the retina (below 1 mm), the distance estimation accuracy improved from 284 $μm$ (OPMI only) to 33 $μm$ (multimodal). Conclusion: Feature fusion of multimodal imaging can enhance multi-task prediction accuracy compared to single-modality processing and real-time processing performance can be achieved through tailored network design. While our results demonstrate the potential of multi-modal processing for image-guided vitreoretinal surgery, they also underline key challenges that motivate future research toward more reliable, consistent, and comprehensive surgical scene understanding.


[97] GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing cs.CVPDF

Xuran Hu, Zhitong Xiong, Zhongcheng Hong, Yifang Ban, Xiaoxiang Zhu

TL;DR: 本文针对当前地球观测领域大模型忽视垂直维度的问题,提出了一个专注于高度感知遥感理解的评估框架。通过构建两个互补的基准数据集(GeoHeight-Bench和GeoHeight-Bench+)以及首个高度感知遥感大模型基线(GeoHeightChat),证明了融合视觉语义与隐含高度几何特征能有效缓解模型的’垂直盲点’,开启光学模型交互式高度推理的新范式。

Details

Motivation: 当前地球观测领域的大模型通常忽略关键的’垂直’维度,限制了其在复杂遥感几何和灾害场景(物理空间结构往往比平面视觉纹理更重要)中的推理能力。

Result: 论文构建了两个基准数据集用于评估,并提出了GeoHeightChat基线模型。该基线作为强有力的概念验证,证明了融合视觉语义与隐含高度几何特征能有效缓解’垂直盲点’,成功解锁了现有光学模型中交互式高度推理的新范式。

Insight: 主要创新点包括:1) 提出了一个专注于高度感知遥感理解的综合评估框架;2) 开发了一个可扩展的、基于视觉语言模型的驱动数据生成流程,通过系统提示工程和元数据提取来克服标注数据稀缺问题;3) 构建了两个互补的基准(相对高度分析和更具挑战性的地形感知推理);4) 提出了首个高度感知遥感大模型基线,验证了高度感知的必要性,并展示了视觉语义与隐含高度几何特征协同的有效性。

Abstract: Current Large Multimodal Models (LMMs) in Earth Observation typically neglect the critical “vertical” dimension, limiting their reasoning capabilities in complex remote sensing geometries and disaster scenarios where physical spatial structures often outweigh planar visual textures. To bridge this gap, we introduce a comprehensive evaluation framework dedicated to height-aware remote sensing understanding. First, to overcome the severe scarcity of annotated data, we develop a scalable, VLM-driven data generation pipeline utilizing systematic prompt engineering and metadata extraction. This pipeline constructs two complementary benchmarks: GeoHeight-Bench for relative height analysis, and a more challenging GeoHeight-Bench+ for holistic, terrain-aware reasoning. Furthermore, to validate the necessity of height perception, we propose GeoHeightChat, the first height-aware remote sensing LMM baseline. Serving as a strong proof of concept, our baseline demonstrates that synergizing visual semantics with implicitly injected height geometric features effectively mitigates the “vertical blind spot”, successfully unlocking a new paradigm of interactive height reasoning in existing optical models.


[98] Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference cs.CV | cs.LGPDF

Sk Miraj Ahmed, Xi Yu, Yunqi Li, Yuewei Lin, Wei Xu

TL;DR: 本文提出了两种层次感知的多模态学习方法CLiBD-HiR和CLiBD-HiR-Fuse,用于生物多样性分类中的分类学推断任务。它们通过引入层次信息正则化来塑造嵌入空间的几何结构,并训练轻量级融合预测器,以处理不完美的图像和DNA条形码输入,并在大规模生物多样性基准测试中显著提升了分类准确性。

Details

Motivation: 解决大规模野外数据中生物多样性准确识别的核心问题,即从有噪声或不完整的图像、DNA条形码等多模态输入中进行分类学预测。现有方法将分类学视为扁平标签空间,未能编码生物分类的层次结构,这限制了其在噪声和模态缺失情况下的鲁棒性。

Result: 在大型生物多样性基准测试中,相比强大的多模态基线方法,该方法将分类学分类准确率提高了超过14%,尤其是在DNA数据部分损坏或损坏的情况下取得了显著提升。

Insight: 创新点在于明确编码生物分类的层次结构,通过层次信息正则化来学习结构化且对噪声鲁棒的表征,并结合灵活的轻量级融合预测器,支持单模态或联合推理,增强了对模态损坏的适应能力,为实用的生物多样性基础模型提供了关键思路。

Abstract: Accurate biodiversity identification from large-scale field data is a foundational problem with direct impact on ecology, conservation, and environmental monitoring. In practice, the core task is taxonomic prediction - inferring order, family, genus, or species from imperfect inputs such as specimen images, DNA barcodes, or both. Existing multimodal methods often treat taxonomy as a flat label space and therefore fail to encode the hierarchical structure of biological classification, which is critical for robustness under noise and missing modalities. We present two end-to-end variants for hierarchy-aware multimodal learning: CLiBD-HiR, which introduces Hierarchical Information Regularization (HiR) to shape embedding geometry across taxonomic levels, yielding structured and noise-robust representations; and CLiBD-HiR-Fuse, which additionally trains a lightweight fusion predictor that supports image-only, DNA-only, or joint inference and is resilient to modality corruption. Across large-scale biodiversity benchmarks, our approach improves taxonomic classification accuracy by over 14 percent compared to strong multimodal baselines, with particularly large gains under partial and corrupted DNA conditions. These results highlight that explicitly encoding biological hierarchy, together with flexible fusion, is key for practical biodiversity foundation models.


[99] UNIC: Neural Garment Deformation Field for Real-time Clothed Character Animation cs.CVPDF

Chengfeng Zhao, Junbo Qi, Yulou Liu, Zhiyang Dou, Minchen Li

TL;DR: 本文提出了一种名为UNIC的基于神经变形场的新方法,用于实时驱动虚拟角色的服装动画。该方法通过实例特定的学习方案,将3D点映射到变形偏移量,避免了处理复杂服装拓扑结构的问题,并提高了变形质量。实验表明,UNIC在多种服装网格上优于基线方法,具有高效性和实用性,适用于视频游戏等实时交互应用。

Details

Motivation: 传统物理模拟方法计算耗时且硬件要求高,不适合实时应用;现有基于学习的方法(如图神经网络)难以捕捉复杂拓扑服装网格的精细变形。

Result: 在多种服装网格上的广泛实验表明,UNIC在效果和效率上优于基线方法,适用于实时交互场景。

Insight: 创新点在于采用实例特定的神经变形场学习方案,无需泛化到新服装,仅需适应新运动序列,降低了训练难度并提升了变形质量;同时,通过将3D点映射到变形偏移,避免了拓扑处理并引入了自然平滑约束。

Abstract: Simulating physically realistic garment deformations is an essential task for virtual immersive experience, which is often achieved by physics simulation methods. However, these methods are typically time-consuming, computationally demanding, and require costly hardware, which is not suitable for real-time applications. Recent learning-based methods tried to resolve this problem by training graph neural networks to learn the garment deformation on vertices, which, however, fail to capture the intricate deformation of complex garment meshes with complex topologies. In this paper, we introduce a novel neural deformation field-based method, named UNIC, to animate the garments of an avatar in real time, given the motion sequences. Our key idea is to learn the instance-specific neural deformation field to animate the garment meshes. Such an instance-specific learning scheme does not require UNIC to generalize to new garments but only to new motion sequences, which greatly reduces the difficulty in training and improves the deformation quality. Moreover, neural deformation fields map the 3D points to their deformation offsets, which not only avoids handling topologies of the complex garments but also injects a natural smoothness constraint in the deformation learning. Extensive experiments have been conducted on various kinds of garment meshes to demonstrate the effectiveness and efficiency of UNIC over baseline methods, making it potentially practical and useful in real-world interactive applications like video games.


[100] Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification cs.CV | cs.AIPDF

Ünsal Öztürk, Hatef Otroshi Shahreza, Sébastien Marcel

TL;DR: 本文对九种开源多模态大语言模型(MLLM)在面部验证任务中的性别和种族偏见进行了基准测试,评估了它们在IJB-C和RFW协议上的表现,发现专门的面部模型FaceLLM-8B性能显著优于通用MLLM,且偏见模式与传统面部识别系统不同。

Details

Motivation: 多模态大语言模型(MLLMs)被探索用于面部验证任务,但其人口统计学公平性尚未得到充分研究,本文旨在填补这一空白,评估MLLMs在不同性别和种族群体中的偏见表现。

Result: 在IJB-C和RFW基准测试中,FaceLLM-8B(专门的面部模型)在准确率上大幅优于通用MLLMs;偏见模式因基准和模型而异,最准确的模型不一定最公平,而整体准确率低的模型可能因所有群体错误率均高而显得公平。

Insight: 创新点在于首次系统评估MLLMs在面部验证中的人口统计学偏见,并引入基于FMR的公平性指标;客观分析表明,MLLMs的偏见模式与传统面部识别不同,强调了在评估模型时需同时考虑准确性和公平性,避免仅依赖整体错误率判断公平性。

Abstract: Multimodal Large Language Models (MLLMs) have recently been explored as face verification systems that determine whether two face images are of the same person. Unlike dedicated face recognition systems, MLLMs approach this task through visual prompting and rely on general visual and reasoning abilities. However, the demographic fairness of these models remains largely unexplored. In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups. We measure verification accuracy with the Equal Error Rate and True Match Rate at multiple operating points per demographic group, and we quantify demographic disparity with four FMR-based fairness metrics. Our results show that FaceLLM-8B, the only face-specialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks. The bias patterns we observe differ from those commonly reported for traditional face recognition, with different groups being most affected depending on the benchmark and the model. We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.


[101] LanteRn: Latent Visual Structured Reasoning cs.CV | cs.LGPDF

André G. Viveiros, Nuno Gonçalves, Matthias Lindemann, André Martins

TL;DR: LanteRn是一个让大型多模态模型在推理过程中交替使用语言和紧凑的潜在视觉表示进行视觉推理的框架。它通过增强视觉语言Transformer,使其能够在推理时生成并关注连续的视觉思维嵌入,从而直接在潜在空间中进行视觉推理,避免了依赖外部模块或在像素空间直接推理带来的计算开销。

Details

Motivation: 当前大型多模态模型在视觉推理任务上仍面临挑战,通常将感知内容转化为文本,这限制了需要细粒度空间和视觉理解的任务。现有方法要么依赖外部工具,要么在像素空间直接推理导致不必要的计算。本文旨在通过引入潜在视觉表示,实现更高效的多模态推理。

Result: 在三个以感知为中心的基准测试(VisCoT、V*和Blink)上进行了评估,观察到在视觉基础和细粒度推理方面的一致改进。

Insight: 主要创新点在于提出了一个允许模型在推理过程中内部生成和利用连续潜在视觉嵌入的框架,通过两阶段训练(监督微调和强化学习)将视觉特征与潜在状态对齐。这为更高效的多模态推理提供了一个有前景的方向,即利用内部潜在表示而非外部工具或像素级操作。

Abstract: While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.


[102] Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming cs.CV | cs.AIPDF

Yunus Talha Erzurumlu, Jiyong Kwag, Alper Yilmaz

TL;DR: 本文提出了一种名为’Just Zoom In’的新方法,用于解决跨视角地理定位问题。该方法摒弃了传统的基于对比学习的图像检索范式,转而采用自回归缩放策略,通过在城市尺度的卫星地图上进行一系列缩放决策,逐步精确定位目标区域。

Details

Motivation: 现有方法将跨视角地理定位视为对比学习嵌入空间中的图像检索问题,这导致性能依赖于大批量和难负样本挖掘,且忽略了地图的几何结构以及街景与卫星图像之间的覆盖不匹配问题。

Result: 在作者提出的一个包含众包街景和高分辨率卫星图像的真实基准测试上,’Just Zoom In’方法取得了最先进的性能,在50米和100米范围内的Recall@1指标上分别比最强的对比检索基线提高了5.5%和9.6%。

Insight: 论文的核心创新在于将跨视角地理定位重新表述为一个自回归的、由粗到细的空间推理过程,避免了对比学习的复杂性。从客观角度看,这种序列决策框架更自然地模拟了人类在地图上寻找位置的过程,并有效利用了地图的多尺度几何信息。

Abstract: Cross-view geo-localization (CVGL) estimates a camera’s location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retrieval problem in a contrastively trained embedding space. This ties performance to large batches and hard negative mining, and it ignores both the geometric structure of maps and the coverage mismatch between street-view and overhead imagery. In particular, salient landmarks visible from the street view can fall outside a fixed satellite crop, making retrieval targets ambiguous and limiting explicit spatial inference over the map. We propose Just Zoom In, an alternative formulation that performs CVGL via autoregressive zooming over a city-scale overhead map. Starting from a coarse satellite view, the model takes a short sequence of zoom-in decisions to select a terminal satellite cell at a target resolution, without contrastive losses or hard negative mining. We further introduce a realistic benchmark with crowd-sourced street views and high-resolution satellite imagery that reflects real capture conditions. On this benchmark, Just Zoom In achieves state-of-the-art performance, improving Recall@1 within 50 m by 5.5% and Recall@1 within 100 m by 9.6% over the strongest contrastive-retrieval baseline. These results demonstrate the effectiveness of sequential coarse-to-fine spatial reasoning for cross-view geo-localization.


[103] Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training cs.CVPDF

Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai

TL;DR: 本文提出Wan-Weaver框架,通过解耦训练实现交错多模态生成。它将任务分解为文本规划和视觉一致性建模,分别由规划器和可视化器处理。规划器使用大规模文本代理交错数据训练,可视化器使用参考引导图像数据训练,使模型具备长程文本连贯性和视觉一致性的交错生成能力,并在多种任务上实现稳健推理和生成。

Details

Motivation: 现有统一模型虽能接受多模态输入,但通常只产生单模态输出,交错内容生成面临训练数据稀缺和长程跨模态上下文建模困难的问题。

Result: 在构建的涵盖多维度用例的基准测试中,即使未使用任何真实交错数据,Wan-Weaver也优于现有方法,表现出卓越性能。

Insight: 创新点在于将交错生成解耦为文本规划和视觉一致性建模,并利用文本代理数据和参考引导数据分别训练,从而在数据稀缺条件下实现长程连贯的交错生成;客观分析认为,这种解耦策略和代理数据构造方法为解决多模态生成中的数据瓶颈提供了新思路。

Abstract: Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model’s capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.


[104] TRACE: Object Motion Editing in Videos with First-Frame Trajectory Guidance cs.CVPDF

Quynh Phung, Long Mai, Cusuh Ham, Feng Liu, Jia-Bin Huang

TL;DR: 本文提出了一个名为TRACE的框架,用于视频中的物体运动路径编辑,其目标是在保持原始场景内容不变的前提下,改变目标物体的运动轨迹。该方法允许用户仅在单个锚定帧中设计期望的轨迹,然后合成时间一致的编辑视频。

Details

Motivation: 解决现有视频编辑方法主要操控外观或依赖基于点跟踪的轨迹控制(这在推理时对用户来说通常很困难,尤其是在存在相机运动的视频中)的局限性,提供一个实用、易用的、以物体为中心的可控运动编辑方法。

Result: 在多样化的真实世界视频上的实验表明,该方法比最近的图像到视频和视频到视频方法能产生更连贯、更真实和更可控的运动编辑效果。

Insight: 创新点在于提出了一个两阶段流程:1) 跨视图运动变换模块,将首帧路径设计映射到相机运动下的帧对齐边界框轨迹;2) 运动条件视频重合成模块,遵循这些轨迹重新生成物体,同时保留输入视频的其余内容。这提供了一种更直观的用户交互方式(单帧轨迹设计)来处理复杂场景(含相机运动)下的物体运动编辑。

Abstract: We study object motion path editing in videos, where the goal is to alter a target object’s trajectory while preserving the original scene content. Unlike prior video editing methods that primarily manipulate appearance or rely on point-track-based trajectory control, which is often challenging for users to provide during inference, especially in videos with camera motion, we offer a practical, easy-to-use approach to controllable object-centric motion editing. We present Trace, a framework that enables users to design the desired trajectory in a single anchor frame and then synthesizes a temporally consistent edited video. Our approach addresses this task with a two-stage pipeline: a cross-view motion transformation module that maps first-frame path design to frame-aligned box trajectories under camera motion, and a motion-conditioned video re-synthesis module that follows these trajectories to regenerate the object while preserving the remaining content of the input video. Experiments on diverse real-world videos show that our method produces more coherent, realistic, and controllable motion edits than recent image-to-video and video-to-video methods.


[105] Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs cs.CVPDF

Vishal Narnaware, Animesh Gupta, Kevin Zhai, Zhenyi Wang, Mubarak Shah

TL;DR: 本文针对多模态扩散大语言模型(MDLLMs)中普遍存在的多模态幻觉问题,提出了一种无需训练的推理时解码框架VISAGE。该框架通过量化交叉注意力分布的空间熵来估计目标函数失配,并通过强制注意力头之间的定位共识来惩罚空间均匀分布,从而重新排序候选token以优先选择视觉上接地的结果。

Details

Motivation: MDLLMs通过并行掩码解码实现高并发生成,但其架构容易产生多模态幻觉。这种结构性弱点源于一个算法缺陷:解码器仅基于文本似然性对候选token进行排序,而未验证其局部视觉支持,导致目标函数失配,语言概率质量成为多模态任务的错误代理目标。

Result: 在幻觉敏感和通用基准测试上的评估证明了该框架的鲁棒性,在MMMU-val上取得了8.59%的相对提升,在HallusionBench上取得了7.75%的相对提升。

Insight: 核心创新点是将幻觉重新解释为局部优化误差,并提出了一种无需训练、在推理时通过分析交叉注意力分布的空间特性(熵)来校准目标函数的解码方法。该方法提供了分析稳定性保证,并利用注意力机制的内在特性来增强视觉接地能力,为缓解MDLLMs的幻觉问题提供了一种轻量且有效的解决方案。

Abstract: Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.


[106] Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models cs.CV | cs.AIPDF

Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu

TL;DR: 该论文针对现有视频世界模型在处理动态物体被遮挡后重新出现时表现不佳的问题,提出了一种名为混合记忆(Hybrid Memory)的新范式,要求模型同时充当静态背景的精确档案员和动态物体的警觉跟踪器。为此,研究构建了首个专注于混合记忆的大规模视频数据集HM-World,并提出了专门的记忆架构HyDRA,该架构通过将记忆压缩为令牌并利用时空相关性驱动的检索机制,有效保持了隐藏物体的身份和运动连续性。

Details

Motivation: 现有视频世界模型主要将环境视为静态画布,当动态物体被遮挡(移出视野)并随后重新出现时,模型往往难以处理,导致物体冻结、扭曲或消失。论文旨在解决这一动态物体在视野外期间的连续性问题。

Result: 在专门构建的HM-World数据集上进行的大量实验表明,所提出的HyDRA方法在动态物体一致性和整体生成质量方面显著优于最先进的(SOTA)方法。

Insight: 论文的核心创新点在于提出了“混合记忆”范式,要求模型同时处理静态和动态信息;构建了首个用于评估该任务的大规模、高质量、轨迹解耦的HM-World数据集;以及设计了HyDRA架构,其通过令牌化压缩和时空相关性驱动的检索机制,实现了对隐藏物体身份和运动的有效保持。从客观角度看,将记忆明确区分为对静态背景的“归档”和对动态物体的“跟踪”,并设计针对性机制,是一个有启发性的研究方向。

Abstract: Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.


[107] No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models cs.CV | cs.LGPDF

Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez

TL;DR: 本文提出了一种无需硬负样本的概念中心学习方法,通过短概念中心标题和跨模态注意力池化来增强对比视觉-语言模型的组合性,同时在标准组合性基准上达到SOTA性能,且不损害零样本和检索能力。

Details

Motivation: 解决现有对比视觉-语言模型在组合性表示学习上的局限性,避免依赖特定基准的硬负样本生成方法,这些方法通常泛化性差且会损害模型的基本能力。

Result: 在标准组合性基准上达到SOTA性能,同时保持或提升了零样本和检索能力,且未增加推理成本。

Insight: 创新点包括使用短概念中心标题对齐图像以促进组合性学习,以及引入参数无关的跨模态注意力池化来获取概念中心视觉嵌入;客观分析认为该方法通过改进训练数据和表示聚合机制,有效解决了组合性学习的信息丢失问题,具有较好的通用性和实用性。

Abstract: Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.


[108] PixelSmile: Toward Fine-Grained Facial Expression Editing cs.CV | cs.AIPDF

Jiabin Hua, Hengyuan Xu, Aojie Li, Wei Cheng, Gang Yu

TL;DR: 该论文提出了PixelSmile,一个基于扩散模型的细粒度面部表情编辑框架。为了解决表情语义内在重叠的问题,作者构建了带有连续情感标注的FFE数据集和评估基准FFE-Bench。PixelSmile通过完全对称的联合训练解耦表情语义,结合强度监督和对比学习,实现了精确、稳定、线性的表情控制,并支持平滑的表情混合。

Details

Motivation: 解决细粒度面部表情编辑中长期存在的内在语义重叠问题,实现更精确、可控且能保持身份特征的表情编辑。

Result: 在FFE-Bench上的大量实验表明,PixelSmile在解耦效果、编辑准确性、线性可控性以及表情编辑与身份保持的权衡方面均表现出色,实现了优越的解耦和鲁棒的身份保持。

Insight: 主要创新点在于:1)构建了带有连续标注的FFE数据集和全面的评估基准FFE-Bench;2)提出了基于完全对称联合训练的扩散框架,有效解耦表情语义;3)结合强度监督与对比学习,增强了表情的区分度和强度控制;4)通过文本潜在空间插值实现了稳定、线性的表情控制。从客观角度看,其数据集构建和系统性评估方法对推动该领域发展具有重要价值。

Abstract: Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.


[109] PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference cs.CV | cs.AIPDF

Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li

TL;DR: PackForcing是一个用于自回归视频扩散模型的统一框架,旨在解决长视频生成中的KV缓存线性增长、时间重复和错误累积问题。它通过一种新颖的三分区KV缓存策略高效管理生成历史,包括Sink、Mid和Recent三种令牌类型,并结合动态上下文选择和连续位置编码调整机制,实现了在有限内存下高质量的长视频生成。

Details

Motivation: 动机是解决现有自回归视频扩散模型在生成长视频时面临的KV缓存内存占用线性增长不可行、时间重复以及错误累积等瓶颈问题。

Result: 在VBench基准测试上取得了最先进(SOTA)的时间一致性得分(26.07)和动态程度得分(56.25),能够在单块H200 GPU上生成2分钟、832x480分辨率、16 FPS的连贯视频,KV缓存被严格限制在4 GB,并实现了24倍的时间外推(从5秒到120秒)。

Insight: 创新点在于提出了一个分层的上下文压缩框架,将历史令牌分为三类进行差异化处理,特别是通过双分支网络对Mid令牌进行大规模时空压缩,并引入了动态Top-k选择和连续时间RoPE调整机制,从而证明了仅用短视频(5秒片段)训练就足以进行高质量的长视频合成和长上下文推理。

Abstract: Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing


[110] BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation cs.CVPDF

Yan Li, Zezi Zeng, Ziwei Zhou, Xin Gao, Muzhao Tian

TL;DR: 该论文提出了BizGenEval,一个用于系统评估商业视觉内容生成模型能力的基准测试。该基准覆盖幻灯片、图表、网页、海报和科学图表五种文档类型,并从文本渲染、布局控制、属性绑定和知识推理四个维度设计了20项评估任务,包含400个提示词和8000个人工验证问题。通过对26个主流图像生成系统进行大规模测试,揭示了当前模型与专业商业设计需求之间的显著差距。

Details

Motivation: 现有图像生成基准主要关注自然图像合成,缺乏对现实商业设计任务中结构化、多约束需求的系统性评估,因此需要建立一个专门的基准来填补这一空白。

Result: 在BizGenEval基准上对26个最先进的商业API和开源模型进行了评估,结果表明当前生成模型在满足复杂视觉和语义约束方面存在显著的能力差距,尚无法满足专业视觉内容创作的要求。

Insight: 创新点在于构建了一个首个系统性的商业视觉内容生成基准,其核心是将评估从自然图像扩展到结构化文档,并通过多维度、细粒度的任务设计和人工验证的检查清单来量化模型在真实商业场景下的能力。这为模型在特定垂直领域的评估提供了新范式。

Abstract: Recent advances in image generation models have expanded their applications beyond aesthetic imagery toward practical visual content creation. However, existing benchmarks mainly focus on natural image synthesis and fail to systematically evaluate models under the structured and multi-constraint requirements of real-world commercial design tasks. In this work, we introduce BizGenEval, a systematic benchmark for commercial visual content generation. The benchmark spans five representative document types: slides, charts, webpages, posters, and scientific figures, and evaluates four key capability dimensions: text rendering, layout control, attribute binding, and knowledge-based reasoning, forming 20 diverse evaluation tasks. BizGenEval contains 400 carefully curated prompts and 8000 human-verified checklist questions to rigorously assess whether generated images satisfy complex visual and semantic constraints. We conduct large-scale benchmarking on 26 popular image generation systems, including state-of-the-art commercial APIs and leading open-source models. The results reveal substantial capability gaps between current generative models and the requirements of professional visual content creation. We hope BizGenEval serves as a standardized benchmark for real-world commercial visual content generation.


[111] SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding cs.CVPDF

Jiwook Han, Geo Ahn, Youngrae Kim, Jinwoo Choi

TL;DR: 本文提出SlotVTG框架,通过引入轻量级槽适配器,将多模态大语言模型(MLLMs)的视觉令牌分解为抽象槽,以对象为中心进行视觉推理,从而提升视频时序定位(VTG)任务的跨域泛化能力,同时保持域内性能。

Details

Motivation: 现有MLLMs在VTG任务中因粗粒度识别能力不足,需进行任务特定微调,导致模型记忆数据集特定捷径而非基于实际视觉内容进行定位,跨域泛化性能较差。

Result: 在标准VTG基准上的跨域评估表明,SlotVTG显著提升了OOD鲁棒性,同时以最小开销保持了竞争力的ID性能。

Insight: 创新点在于引入轻量级槽适配器,利用自监督视觉模型的对象性先验促进语义连贯的槽形成,实现低成本的对象中心化视觉推理,避免从头训练多阶段管道。

Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.


[112] PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow cs.CVPDF

Xincheng Shuai, Song Tang, Yutong Huang, Henghui Ding, Dacheng Tao

TL;DR: PSDesigner是一个模仿人类设计师创意工作流的自动化平面设计系统。它基于用户指令收集主题相关素材,并自主推断和执行工具调用来操作设计文件(如整合新素材或优化劣质元素)。该系统通过构建包含大量高质量PSD文件及操作轨迹标注的数据集CreativePSD来学习专家设计流程,从而在多种平面设计任务上超越现有方法,使非专业人士也能便捷地创建生产级质量的设计。

Details

Motivation: 当前自动化设计系统通常简化专业工作流,导致灵活性和直观性有限,难以将用户意图忠实地转换为可编辑的设计文件。本文旨在解决这一问题,提出一个能模仿人类设计师创意流程的系统。

Result: 大量实验表明,PSDesigner在多样化的平面设计任务上优于现有方法,使非专业人士能够方便地创建生产级质量的设计。

Insight: 论文的创新点在于提出了一种模仿人类设计师创意工作流的自动化系统,通过构建包含操作轨迹标注的高质量PSD数据集(CreativePSD)来赋予系统强大的工具使用能力,从而实现了更灵活、直观的设计生成和编辑。

Abstract: Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we propose PSDesigner, an automated graphic design system that emulates the creative workflow of human designers. Building upon multiple specialized components, PSDesigner collects theme-related assets based on user instructions, and autonomously infers and executes tool calls to manipulate design files, such as integrating new assets or refining inferior elements. To endow the system with strong tool-use capabilities, we construct a design dataset, CreativePSD, which contains a large amount of high-quality PSD design files annotated with operation traces across a wide range of design scenarios and artistic styles, enabling models to learn expert design procedures. Extensive experiments demonstrate that PSDesigner outperforms existing methods across diverse graphic design tasks, empowering non-specialists to conveniently create production-quality designs.


[113] MegaFlow: Zero-Shot Large Displacement Optical Flow cs.CVPDF

Dingxi Zhang, Fangjinhua Wang, Marc Pollefeys, Haofei Xu

TL;DR: MegaFlow是一种用于零样本大位移光流估计的模型,通过利用预训练的全局Vision Transformer特征将光流估计表述为全局匹配问题,并结合轻量级迭代细化提升亚像素精度,实现了在多个光流基准测试中的最先进零样本性能。

Details

Motivation: 解决现有方法在零样本泛化和大位移场景下性能受限的问题,这些方法通常依赖迭代局部搜索或领域特定微调。

Result: 在多个光流基准测试中达到最先进的零样本性能,并在长程点跟踪基准测试中表现出高度竞争力的零样本性能。

Insight: 创新点在于利用预训练的全局视觉先验(如Vision Transformer特征)进行全局匹配,以自然捕获大位移,并结合轻量级迭代细化,提供了一种可泛化的运动估计统一范式。

Abstract: Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely limits their performance in large displacement and zero-shot generalization scenarios. To overcome this, we introduce MegaFlow, a simple yet powerful model for zero-shot large displacement optical flow. Rather than relying on highly complex, task-specific architectural designs, MegaFlow adapts powerful pre-trained vision priors to produce temporally consistent motion fields. In particular, we formulate flow estimation as a global matching problem by leveraging pre-trained global Vision Transformer features, which naturally capture large displacements. This is followed by a few lightweight iterative refinements to further improve the sub-pixel accuracy. Extensive experiments demonstrate that MegaFlow achieves state-of-the-art zero-shot performance across multiple optical flow benchmarks. Moreover, our model also delivers highly competitive zero-shot performance on long-range point tracking benchmarks, demonstrating its robust transferability and suggesting a unified paradigm for generalizable motion estimation. Our project page is at: https://kristen-z.github.io/projects/megaflow.


[114] Vega: Learning to Drive with Natural Language Instructions cs.CV | cs.AI | cs.ROPDF

Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou

TL;DR: 本文提出了一种名为Vega的统一视觉-语言-世界-动作模型,用于基于自然语言指令的自动驾驶规划。通过构建大规模指令驾驶数据集InstructScene,并采用自回归与扩散范式相结合的方法处理多模态输入,实现了对多样化用户指令的个性化驾驶响应。

Details

Motivation: 现有视觉-语言-动作模型在自动驾驶中主要将语言用于场景描述或推理,缺乏遵循多样化用户指令以实现个性化驾驶的灵活性。

Result: 在构建的数据集上进行广泛实验,结果表明该方法不仅实现了优越的规划性能,还展现出强大的指令跟随能力。

Insight: 创新点包括构建大规模指令驾驶数据集、提出统一的多模态模型架构(结合自回归与扩散范式),以及通过联合注意力机制和独立投影层促进模态间交互以增强能力。

Abstract: Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.


[115] RefAlign: Representation Alignment for Reference-to-Video Generation cs.CVPDF

Lei Wang, YuXin Song, Ge Wu, Haocheng Feng, Hang Zhou

TL;DR: 本文提出RefAlign框架,通过将扩散Transformer中的参考图像特征与视觉基础模型的语义空间显式对齐,解决参考图像到视频生成任务中的复制粘贴伪影和多主体混淆问题,在训练时引入参考对齐损失以提升身份一致性和语义区分性,在OpenS2V-Eval基准测试中达到SOTA水平。

Details

Motivation: 现有参考图像到视频生成方法通过引入高层语义或跨模态特征作为隐式对齐信号,但难以解决异构编码器特征间的模态不匹配导致的复制粘贴伪影和多主体混淆问题。

Result: 在OpenS2V-Eval基准测试中,RefAlign在TotalScore指标上优于当前最先进方法,实现了SOTA性能。

Insight: 创新点在于提出显式表示对齐框架,通过参考对齐损失在训练时对齐扩散Transformer参考分支特征与视觉基础模型语义空间,无需推理开销即可平衡文本可控性和参考保真度;客观分析其核心贡献是将隐式对齐转为显式语义空间对齐,通过对比学习机制增强特征判别性。

Abstract: Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy–paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.


[116] MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models cs.CVPDF

Bocheng Zou, Mu Cai, Mark Stanley, Dingfu Lu, Yong Jae Lee

TL;DR: 本文提出了MuRF(多分辨率融合)策略,旨在解决视觉基础模型在推理时通常局限于单一固定尺度的问题。该方法通过处理图像在多个分辨率下的特征,并将其融合成一个统一的表示,从而利用不同分辨率提供的互补归纳偏置。

Details

Motivation: 当前视觉基础模型在推理时通常采用单一尺度,忽略了不同分辨率在视觉感知中的互补作用:低分辨率视图擅长全局语义识别,而高分辨率视图则对细粒度细化至关重要。

Result: 实验验证了MuRF在多种关键计算机视觉任务上的有效性,并成功应用于多个不同的VFM家族,主要是在DINOv2上,也成功推广到了像SigLIP2这样的对比模型。

Insight: MuRF的创新点在于其简单、通用且无需训练的特性,它不依赖于特定架构,而是作为一种对视觉表示的基础性增强,能够有效利用多尺度信息的协同作用。

Abstract: Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.


[117] ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling cs.CVPDF

Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu

TL;DR: 本文提出了ShotStream,一种新颖的因果多镜头视频生成架构,旨在实现交互式叙事和高效的实时帧生成。它将任务重新定义为基于历史上下文的下一个镜头生成,允许用户通过流式提示动态指导正在进行的叙事。通过将文本到视频模型微调为双向下一个镜头生成器,然后通过分布匹配蒸馏将其提炼为因果学生模型,并结合双缓存记忆机制和两阶段蒸馏策略,解决了自回归生成中固有的镜头间一致性和错误累积挑战。

Details

Motivation: 当前的双向架构在多镜头视频生成中存在交互性有限和延迟高的问题,这限制了长叙事故事的创作。本文旨在解决这些问题,以实现实时、交互式的多镜头视频生成。

Result: 大量实验表明,ShotStream能以亚秒级延迟生成连贯的多镜头视频,在单个GPU上达到16 FPS。其生成质量与较慢的双向模型相当或更优,为实时交互式叙事铺平了道路。

Insight: 主要创新点包括:1)将多镜头视频生成重新定义为因果的下一个镜头生成任务,支持流式交互;2)引入双缓存记忆机制(全局上下文缓存和局部上下文缓存)并配合RoPE不连续性指示器,以保持视觉连贯性;3)提出两阶段蒸馏策略(从基于真实历史镜头的镜头内自强制到使用自生成历史的镜头间自强制),以减轻错误累积并弥合训练-测试差距。

Abstract: Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our


cs.LG [Back]

[118] Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards cs.LG | cs.CLPDF

Cheng Jiayang, Xin Liu, Zhihan Zhang, Haoyang Wen, Zixuan Zhang

TL;DR: 本文提出了一种用于训练大语言模型进行多步骤工具编排的框架,通过结合基于真实API响应缓存的数据合成管道和分级奖励设计,解决了现有方法在完整序列执行上的挑战。

Details

Motivation: 解决多步骤工具编排中LLMs需按正确顺序调用多个依赖API并传递中间输出的难题,克服现有训练环境局限于简单单轮函数调用模拟数据以及二元奖励无法提供部分正确性信号的障碍。

Result: 在ComplexFuncBench基准测试中,该方法在轮次准确率上取得了显著提升,消融研究证实了奖励设计的两个组成部分(原子有效性和编排正确性)都是必不可少的。

Insight: 创新点在于利用大规模真实API响应缓存构建强化学习环境以实现高效、可控复杂度的数据合成,以及将正确性分解为原子有效性和编排正确性的分级奖励机制,为训练LLMs处理复杂工作流提供了新思路。

Abstract: Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness. We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect). On ComplexFuncBench, our approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance.


[119] Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models cs.LG | cs.AI | cs.CLPDF

Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas

TL;DR: 本文提出了一种多答案强化学习方法,用于训练语言模型在推理过程中对多个答案进行分布推理。该方法通过修改强化学习目标,使模型能够在单次前向传播中显式生成多个候选答案,将推理时的搜索过程内化到模型的生成过程中。在问答、医疗诊断和代码生成等基准测试中,该方法相比单答案基线模型,在答案多样性、覆盖率和集合级校准分数方面均有提升,且生成多个答案所需的token更少,在代码任务中准确性也显著提高。

Details

Motivation: 语言模型通常将答案分布坍缩到单一主导模式,这对于存在多个有效答案或不确定性的现实任务(如医疗诊断、模糊问答和信息不完整场景)并不适用。需要模型能够生成多个合理假设并给出置信度估计,而无需通过计算密集的重复采样来生成非模态答案。

Result: 在问答、医疗诊断和代码生成基准测试中,该方法相比单答案基线模型,在多样性、覆盖率和集合级校准分数方面均有改善,生成多个答案所需的token更少,在代码任务中准确性也显著更高。

Insight: 创新点在于将推理时的搜索过程内化到模型的生成过程中,通过多答案强化学习实现分布推理,提供了一种原则性且计算高效的替代方案,避免了推理时扩展方法(如best-of-k)的计算开销。

Abstract: Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model’s generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.


[120] Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale cs.LG | cs.CL | cs.CVPDF

Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou

TL;DR: Intern-S1-Pro是首个万亿参数级别的科学多模态基础模型,通过扩展到前所未有的规模,在通用和科学领域均实现了全面增强。它不仅具备更强的推理和图文理解能力,还集成了先进的智能体能力,并大幅扩展了科学专业知识,能够处理化学、材料、生命科学和地球科学等关键科学领域的超过100项专门任务。

Details

Motivation: 构建一个能够同时处理通用任务和深度科学任务的大规模多模态基础模型,以弥合通用智能与专业科学智能之间的鸿沟。

Result: 模型在通用能力上位居开源模型前列,在专业科学任务的深度上超越了闭源模型。

Insight: 通过将模型规模扩展到万亿参数,并结合XTuner和LMDeploy等基础设施实现高效强化学习训练与严格的训练-推理精度一致性,成功打造了一个兼具通用性和深度专业性的“可特化通才”模型,为科学多模态AI的发展提供了新范式。

Abstract: We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.


[121] Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes cs.LG | cs.AI | cs.CLPDF

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, Dongbin Zhao

TL;DR: 本文重新审视了在线策略蒸馏(OPD)方法,指出在长序列生成任务中,基于采样令牌的OPD变体存在脆弱性问题,包括单令牌信号不平衡、教师指导不可靠以及分词器不匹配等。作者从估计器和实现角度分析了OPD,提出了一种改进方法:教师top-K局部支持匹配,通过截断反向KL散度、top-p采样和特殊令牌掩码实现。实验表明,该方法在数学推理和混合任务训练中优化更稳定、性能更优。

Details

Motivation: 在线策略蒸馏(OPD)在大语言模型后训练中具有吸引力,因为它基于学生生成的序列评估教师反馈,而非固定教师轨迹。然而,在长序列生成场景下,常见的采样令牌变体存在脆弱性:它将分布匹配简化为单令牌信号,且随着序列偏离教师常见前缀而变得不可靠。本文旨在重新审视OPD,识别其失败模式并提出简单修复方案。

Result: 在单任务数学推理和多任务(智能体+数学)训练中,提出的教师top-K局部支持匹配目标比采样令牌OPD实现了更稳定的优化和更好的下游性能。

Insight: 创新点包括:从理论角度分析了令牌级OPD相对于序列级反向KL散度的偏差-方差权衡;实证识别了采样令牌OPD的三种失败模式;提出了一种结合截断反向KL、top-p采样和特殊令牌掩码的教师局部支持匹配方法,有效解决了信号不平衡和指导不可靠问题。从客观角度看,该方法通过局部支持匹配增强了教师反馈的可靠性,为长序列生成任务的策略蒸馏提供了更稳健的优化框架。

Abstract: On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.


[122] Amplified Patch-Level Differential Privacy for Free via Random Cropping cs.LG | cs.CR | cs.CVPDF

Kaan Durmaz, Jan Schuchardt, Sebastian Schmidt, Stephan Günnemann

TL;DR: 本文探讨了计算机视觉中常用的随机裁剪数据增强技术在差分隐私训练中的作用,发现其能通过概率性排除图像中的敏感局部内容(如人脸或车牌)来增强隐私保护,而无需改变模型架构或训练流程。

Details

Motivation: 动机在于利用随机裁剪固有的随机性来增强差分隐私机器学习模型的隐私保护,特别是当图像中的敏感内容空间局部化时,该技术能自然引入额外的随机性以放大隐私保证。

Result: 在多个分割架构和数据集上的实证验证表明,补丁级隐私放大改善了隐私-效用权衡,实现了更强的隐私保证且无需额外成本。

Insight: 创新点在于形式化了视觉数据的补丁级相邻关系,并推导了随机裁剪与差分隐私随机梯度下降(DP-SGD)结合时的紧密隐私界限,揭示了如何利用领域结构和现有随机源来提升隐私保护效果。

Abstract: Random cropping is one of the most common data augmentation techniques in computer vision, yet the role of its inherent randomness in training differentially private machine learning models has thus far gone unexplored. We observe that when sensitive content in an image is spatially localized, such as a face or license plate, random cropping can probabilistically exclude that content from the model’s input. This introduces a third source of stochasticity in differentially private training with stochastic gradient descent, in addition to gradient noise and minibatch sampling. This additional randomness amplifies differential privacy without requiring changes to model architecture or training procedure. We formalize this effect by introducing a patch-level neighboring relation for vision data and deriving tight privacy bounds for differentially private stochastic gradient descent (DP-SGD) when combined with random cropping. Our analysis quantifies the patch inclusion probability and shows how it composes with minibatch sampling to yield a lower effective sampling rate. Empirically, we validate that patch-level amplification improves the privacy-utility trade-off across multiple segmentation architectures and datasets. Our results demonstrate that aligning privacy accounting with domain structure and additional existing sources of randomness can yield stronger guarantees at no additional cost.


[123] Light Cones For Vision: Simple Causal Priors For Visual Hierarchy cs.LG | cs.CVPDF

Manglam Kartik, Neel Tushar Shah

TL;DR: 本文提出Worldline Slot Attention模型,将物体建模为时空世界线中的持久轨迹,通过洛伦兹几何结构编码非对称因果性,以解决标准视觉模型无法捕捉部分-整体层次结构的问题。

Details

Motivation: 标准视觉模型将物体视为欧几里得空间中的独立点,无法捕获部分与整体之间的层次结构,因此需要引入几何结构来编码非对称因果性。

Result: 在三个数据集上,洛伦兹几何世界线的准确率达到0.479-0.661,相比欧几里得世界线的0.078(低于随机机会0.33)提升了6倍,且优于双曲嵌入,表明视觉层次需要因果结构而非树结构。

Insight: 创新点在于将物体建模为具有不同层次槽位的时空世界线,并首次证明视觉层次发现需要洛伦兹光锥的因果几何结构,仅用11K参数即可实现,为视觉表示学习提供了新的几何归纳偏置。

Abstract: Standard vision models treat objects as independent points in Euclidean space, unable to capture hierarchical structure like parts within wholes. We introduce Worldline Slot Attention, which models objects as persistent trajectories through spacetime worldlines, where each object has multiple slots at different hierarchy levels sharing the same spatial position but differing in temporal coordinates. This architecture consistently fails without geometric structure: Euclidean worldlines achieve 0.078 level accuracy, below random chance (0.33), while Lorentzian worldlines achieve 0.479-0.661 across three datasets: a 6x improvement replicated over 20+ independent runs. Lorentzian geometry also outperforms hyperbolic embeddings showing visual hierarchies require causal structure (temporal dependency) rather than tree structure (radial branching). Our results demonstrate that hierarchical object discovery requires geometric structure encoding asymmetric causality, an inductive bias absent from Euclidean space but natural to Lorentzian light cones, achieved with only 11K parameters. The code is available at: https://github.com/iclrsubmissiongram/loco.


[124] CVA: Context-aware Video-text Alignment for Video Temporal Grounding cs.LG | cs.AI | cs.CVPDF

Sungho Moon, Seunghun Lee, Jiwan Seo, Sunghoon Im

TL;DR: 本文提出了一种名为CVA(上下文感知的视频文本对齐)的新框架,用于解决视频时序定位中的关键挑战:实现时间敏感的、对无关背景上下文具有鲁棒性的视频文本对齐。该框架包含三个核心组件:查询感知的上下文多样化(QCD)数据增强策略、上下文不变的边界判别(CBD)对比损失,以及上下文增强的Transformer编码器(CTE)。这些方法协同工作,在主要基准测试上取得了最先进的性能。

Details

Motivation: 解决视频时序定位任务中,现有方法在实现视频与文本的时间敏感对齐时,容易受到无关背景上下文干扰的问题,旨在提升模型对上下文变化的鲁棒性。

Result: 在QVHighlights和Charades-STA等主要视频时序定位基准测试上达到了最先进的性能,特别是在Recall@1(R1)指标上相比现有最佳方法取得了约5个百分点的显著提升。

Insight: 创新点在于从数据和模型架构两方面协同增强上下文鲁棒性:1)通过基于视频-文本相似度的QCD策略进行数据增强,避免引入语义相关的“假负例”;2)设计CBD损失函数,在具有挑战性的时间边界处强制语义一致性;3)提出CTE编码器,结合窗口自注意力和双向交叉注意力,以可学习查询捕获多尺度时间上下文。这些方法共同有效缓解了假负例问题。

Abstract: We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the ``false negative” caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context. Through the synergy of these data-centric and architectural enhancements, CVA achieves state-of-the-art performance on major VTG benchmarks, including QVHighlights and Charades-STA. Notably, our method achieves a significant improvement of approximately 5 points in Recall@1 (R1) scores over state-of-the-art methods, highlighting its effectiveness in mitigating false negatives.


[125] Vision Hopfield Memory Networks cs.LG | cs.AI | cs.CV | stat.MLPDF

Jianfeng Wang, Amine M’Charrak, Luk Koska, Xiangtao Wang, Daniel Petriceanu

TL;DR: 本文提出了一种受大脑启发的视觉基础骨干网络——Vision Hopfield Memory Network (V-HMN),它通过整合分层记忆机制和迭代精炼更新,旨在提升模型的可解释性和数据效率。V-HMN包含局部Hopfield模块(在图像块级别提供联想记忆动态)、全局Hopfield模块(作为情景记忆进行上下文调制)以及受预测编码启发的精炼规则(用于迭代误差校正)。

Details

Motivation: 当前视觉和多模态基础骨干网络(如Transformer家族和Mamba等状态空间模型)虽然取得了显著进展,但其计算原理与人类大脑相去甚远,通常需要大量训练数据且可解释性有限。本文旨在设计一种更接近大脑计算原则的架构,以增强可解释性和数据效率。

Result: 在公开的计算机视觉基准测试上进行了广泛实验,V-HMN取得了与广泛采用的主流骨干架构相竞争的结果,同时提供了更好的可解释性、更高的数据效率和更强的生物合理性。

Insight: 创新点在于将分层记忆机制(局部与全局Hopfield模块)与迭代精炼规则相结合,以大脑启发的设计统一捕捉局部和全局动态。这通过记忆检索揭示输入与存储模式的关系来增强可解释性,并通过重用存储模式提高数据效率,为下一代视觉乃至多模态基础模型提供了一个可推广的蓝图。

Abstract: Recent vision and multimodal foundation backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress, enabling unified modeling across images, text, and beyond. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. In this work, we propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired foundation backbone that integrates hierarchical memory mechanisms with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, making decisions more interpretable, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances interpretability and data efficiency beyond existing self-attention- or state-space-based approaches. We conducted extensive experiments on public computer vision benchmarks, and V-HMN achieved competitive results against widely adopted backbone architectures, while offering better interpretability, higher data efficiency, and stronger biological plausibility. These findings highlight the potential of V-HMN to serve as a next-generation vision foundation model, while also providing a generalizable blueprint for multimodal backbones in domains such as text and audio, thereby bridging brain-inspired computation with large-scale machine learning.


eess.AS [Back]

[126] X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs eess.AS | cs.AI | cs.CLPDF

Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan

TL;DR: 本文提出了一种名为X-OPD的跨模态策略蒸馏框架,旨在解决端到端语音大语言模型相比其纯文本版本性能显著下降的问题。该方法通过让语音模型进行策略探索,并由一个基于文本的教师模型提供细粒度的反馈,从而将教师模型的能力蒸馏到学生的多模态表示中。

Details

Motivation: 端到端语音大语言模型在延迟和副语言建模方面优于级联系统,但其性能相比纯文本模型存在显著差距,而标准的监督微调和强化学习方法无法弥合这一差距。

Result: 在多个基准测试上的广泛实验表明,X-OPD在复杂任务上显著缩小了性能差距,同时保持了模型原有的能力。

Insight: 创新点在于提出了一个跨模态的、基于策略的蒸馏框架,通过让语音模型进行策略探索并接受文本教师模型的细粒度评估,实现了从文本模态到语音模态的能力对齐,这是一种新颖的知识蒸馏范式。

Abstract: While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher’s capabilities into student’s multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model’s inherent capabilities.


cs.AI [Back]

[127] How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning cs.AI | cs.CL | cs.CVPDF

Luyu Yang, Yutong Dai, An Yan, Viraj Prabhu, Ran Xu

TL;DR: 本文提出了一个名为DreamHouse的新基准,用于评估视觉语言模型(VLMs)的物理生成推理能力,即生成同时满足几何、结构、可建造性和规范约束的构件。该基准基于住宅木结构建筑领域,包含超过26,000个经过验证的结构,并支持迭代式智能体交互,以评估模型的规划、结构推理和自我纠正能力。实验表明,当前最先进的VLMs在此基准上存在显著能力差距,凸显了物理有效性作为与视觉真实性正交的关键评估维度的重要性。

Details

Motivation: 当前视觉语言模型的评估过于偏向感知真实性,主要关注生成视觉上合理的3D布局、形状和外观,而很少测试模型是否理解实际构建这些构件所需的逐步过程和物理依赖关系,这种能力对于自动化设计到建造流程至关重要。

Result: 在DreamHouse基准上的广泛实验揭示了当前最先进的视觉语言模型存在显著的能力差距,这些差距在现有排行榜上基本不可见,表明物理生成推理是多模态智能中一个独特且未充分发展的前沿领域。

Insight: 创新点在于将物理有效性确立为与视觉真实性正交的关键评估轴,通过引入支持迭代交互的基准(DreamHouse),能够细粒度地评估模型的规划、结构推理和自我纠正能力,从而推动模型在物理生成推理方面的发展。

Abstract: The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision-language models (VLMs) remains heavily skewed toward perceptual realism, prioritizing the generation of visually plausible 3D layouts, shapes, and appearances. Current benchmarks rarely test whether models grasp the step-by-step processes and physical dependencies required to actually build these artifacts, a capability essential for automating design-to-construction pipelines. To address this, we introduce DreamHouse, a novel benchmark for physical generative reasoning: the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code-compliance constraints. We ground this benchmark in residential timber-frame construction, a domain with fully codified engineering standards and objectively verifiable correctness. We curate over 26,000 structures spanning 13 architectural styles, ach verified to construction-document standards (LOD 350) and develop a deterministic 10-test structural validation framework. Unlike static benchmarks that assess only final outputs, DreamHouse supports iterative agentic interaction. Models observe intermediate build states, generate construction actions, and receive structured environmental feedback, enabling a fine-grained evaluation of planning, structural reasoning, and self-correction. Extensive experiments with state-of-the-art VLMs reveal substantial capability gaps that are largely invisible on existing leaderboards. These findings establish physical validity as a critical evaluation axis orthogonal to visual realism, highlighting physical generative reasoning as a distinct and underdeveloped frontier in multimodal intelligence. Available at https://luluyuyuyang.github.io/dreamhouse


[128] FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol cs.AI | cs.CLPDF

Jie Zhu, Yimin Tian, Boyang Li, Kehao Wu, Zhongzhi Liang

TL;DR: 本文提出了FinMCP-Bench,一个用于评估大语言模型在真实世界金融场景下通过调用金融模型上下文协议工具来解决问题的能力的新基准。该基准包含613个样本,覆盖10个主场景和33个子场景,融合了真实与合成的用户查询,并包含单工具、多工具和多轮对话三种任务类型。

Details

Motivation: 动机是创建一个标准化、实用且具有挑战性的测试平台,以推动金融领域LLM智能体的研究,解决现有基准在评估模型真实金融工具调用和复杂推理能力方面的不足。

Result: 利用该基准,作者系统地评估了一系列主流LLM,并提出了专门衡量工具调用准确性和推理能力的指标。

Insight: 创新点在于构建了一个专注于金融领域、基于真实金融模型上下文协议、且任务复杂度分层的综合性基准,为评估和提升LLM在专业领域的工具使用与多步推理能力提供了新的方向。

Abstract: This paper introduces \textbf{FinMCP-Bench}, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi-tool, and multi-turn, allowing evaluation of models across different levels of task complexity. Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities. FinMCP-Bench provides a standardized, practical, and challenging testbed for advancing research on financial LLM agents.


[129] Can MLLMs Read Students’ Minds? Unpacking Multimodal Error Analysis in Handwritten Math cs.AI | cs.CL | cs.CVPDF

Dingjie Song, Tianlong Xu, Yi-Fan Zhang, Hang Li, Zhiling Yan

TL;DR: 本文介绍了ScratchMath,一个专门用于解释和分类学生手写数学草稿中错误的新型基准测试。该数据集包含1720个中国中小学生的手写数学样本,支持错误原因解释(ECE)和错误原因分类(ECC)两个任务,并定义了七种错误类型。研究系统评估了16个领先的多模态大语言模型(MLLMs),发现它们在视觉识别和逻辑推理方面与人类专家存在显著性能差距,其中专有模型优于开源模型,大型推理模型在错误解释方面展现出潜力。

Details

Motivation: 现有教育NLP主要关注文本响应,忽略了真实手写草稿的复杂性和多模态特性;而当前MLLMs通常采用“考生视角”,优先生成正确答案而非诊断学生错误。因此,需要填补这一空白,开发能够评估手写草稿中多模态错误的基准和方法。

Result: 在ScratchMath基准上评估了16个领先的MLLMs,结果显示所有模型与人类专家相比存在显著性能差距,尤其在视觉识别和逻辑推理方面;专有模型(如GPT-4V)明显优于开源模型,大型推理模型在错误解释任务上表现较强。

Insight: 创新点在于首次构建了专注于手写数学草稿错误分析的多模态基准(ScratchMath),并系统评估了MLLMs在此任务上的能力;客观来看,该研究强调了将MLLMs从“答题者”转向“诊断者”视角的重要性,为个性化教育反馈提供了新的研究方向和数据资源。

Abstract: Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an “examinee perspective”, prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification. We systematically evaluate 16 leading MLLMs on ScratchMath, revealing significant performance gaps relative to human experts, especially in visual recognition and logical reasoning. Proprietary models notably outperform open-source models, with large reasoning models showing strong potential for error explanation. All evaluation data and frameworks are publicly available to facilitate further research.


[130] DAGverse: Building Document-Grounded Semantic DAGs from Scientific Papers cs.AI | cs.CLPDF

Shu Wan, Saketh Vishnubhatla, Iskander Kushbay, Tom Heffernan, Aaron Belikoff

TL;DR: DAGverse是一个从科学论文中构建文档基础语义有向无环图(DAG)的框架,其核心DAGverse-Pipeline是一个半自动系统,通过图像分类、图重建、语义基础和验证来生成高精度语义DAG示例。该研究以因果DAG为例,发布了包含108个专家验证语义DAG的数据集DAGverse-1,并展示了其在DAG分类和标注任务上优于现有视觉语言模型。

Details

Motivation: 解决真实世界DAG数据集稀缺的问题,因为传统构建需要专家对领域文档进行解释;具体挑战包括文档可能允许多种抽象、结构通常隐含且证据分散在文本、方程、标题和图中。

Result: 在DAG分类和标注任务上,DAGverse-Pipeline优于现有视觉语言模型;发布了DAGverse-1数据集,包含108个专家验证的语义DAG,具有图级、节点级和边级证据。

Insight: 利用包含显式DAG图的科学论文作为监督的自然来源,其中图提供结构,文本提供上下文和解释;提出半自动管道系统,结合多模态信息(图像和文本)来构建文档基础的语义DAG,为基于真实世界证据的结构化推理研究开辟了新方向。

Abstract: Directed Acyclic Graphs (DAGs) are widely used to represent structured knowledge in scientific and technical domains. However, datasets for real-world DAGs remain scarce because constructing them typically requires expert interpretation of domain documents. We study Doc2SemDAG construction: recovering a preferred semantic DAG from a document together with the cited evidence and context that explain it. This problem is challenging because a document may admit multiple plausible abstractions, the intended structure is often implicit, and the supporting evidence is scattered across prose, equations, captions, and figures. To address these challenges, we leverage scientific papers containing explicit DAG figures as a natural source of supervision. In this setting, the DAG figure provides the DAG structure, while the accompanying text provides context and explanation. We introduce DAGverse, a framework for constructing document-grounded semantic DAGs from online scientific papers. Its core component, DAGverse-Pipeline, is a semi-automatic system designed to produce high-precision semantic DAG examples through figure classification, graph reconstruction, semantic grounding, and validation. As a case study, we test the framework for causal DAGs and release DAGverse-1, a dataset of 108 expert-validated semantic DAGs with graph-level, node-level, and edge-level evidence. Experiments show that DAGverse-Pipeline outperforms existing Vision-Language Models on DAG classification and annotation. DAGverse provides a foundation for document-grounded DAG benchmarks and opens new directions for studying structured reasoning grounded in real-world evidence.


[131] R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning cs.AI | cs.CVPDF

Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao

TL;DR: 本文提出R-C2框架,一种通过强化学习强制跨模态循环一致性来改善多模态推理的方法。该方法利用模型在视觉和文本模态间预测的矛盾作为学习信号,通过要求模型执行反向推理、切换模态并可靠地重建答案来获得无标签的密集奖励,从而自主对齐内部表示。

Details

Motivation: 当前多模态模型在处理同一概念的视觉和文本表示时经常产生矛盾预测,而标准的投票机制可能放大系统偏差。本文旨在利用跨模态不一致性这一自然信号作为学习来源,以增强模型的鲁棒感知和推理能力。

Result: 在优化该结构后,该方法减轻了模态特定错误,并将推理准确率提高了高达7.6个百分点,表明其在多模态基准测试中取得了显著改进。

Insight: 创新点在于将跨模态不一致性转化为强化学习的奖励信号,通过循环一致性约束促使模型自主对齐内部表示,这为多模态学习提供了一种无需人工标注的结构化监督方法,强调了世界理解的结构一致性对高级推理的重要性,而不仅仅是数据规模的扩展。

Abstract: Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference, we obtain a dense, label-free reward. This cyclic constraint encourages the model to align its internal representations autonomously. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not only from scaling data, but also from enforcing a structurally consistent understanding of the world.


cs.RO [Back]

[132] Integrating Deep RL and Bayesian Inference for ObjectNav in Mobile Robotics cs.RO | cs.AI | cs.CVPDF

João Castelo-Branco, José Santos-Victor, Alexandre Bernardino

TL;DR: 本文提出了一种结合贝叶斯推理与深度强化学习的混合物体搜索框架,用于移动机器人在室内环境中的自主物体导航。该方法通过贝叶斯推理在线更新目标位置的空间信念图,并训练强化学习策略直接从该概率表示中选择导航动作,以解决部分可观测性、感知不确定性和探索-导航效率权衡等挑战。

Details

Motivation: 解决室内移动机器人自主物体搜索中因部分可观测性、感知不确定性和探索-导航效率权衡带来的困难,克服传统概率方法依赖手工启发式策略以及深度强化学习方法收敛慢、可解释性有限的局限性。

Result: 在Habitat 3.0仿真环境中评估,与基线策略相比,在两个室内环境中提高了成功率并减少了搜索努力,表明该方法在部分可观测条件下实现了更高效可靠的物体搜索行为。

Insight: 创新点在于将贝叶斯信念估计与学习型动作选择相结合,利用概率表示直接驱动强化学习策略,从而在不确定环境中实现自适应且高效的导航决策,为混合推理-学习框架在机器人任务中的应用提供了借鉴。

Abstract: Autonomous object search is challenging for mobile robots operating in indoor environments due to partial observability, perceptual uncertainty, and the need to trade off exploration and navigation efficiency. Classical probabilistic approaches explicitly represent uncertainty but typically rely on handcrafted action-selection heuristics, while deep reinforcement learning enables adaptive policies but often suffers from slow convergence and limited interpretability. This paper proposes a hybrid object-search framework that integrates Bayesian inference with deep reinforcement learning. The method maintains a spatial belief map over target locations, updated online through Bayesian inference from calibrated object detections, and trains a reinforcement learning policy to select navigation actions directly from this probabilistic representation. The approach is evaluated in realistic indoor simulation using Habitat 3.0 and compared against developed baseline strategies. Across two indoor environments, the proposed method improves success rate while reducing search effort. Overall, the results support the value of combining Bayesian belief estimation with learned action selection to achieve more efficient and reliable objectsearch behavior under partial observability.


[133] Can Users Specify Driving Speed? Bench2Drive-Speed: Benchmark and Baselines for Desired-Speed Conditioned Autonomous Driving cs.RO | cs.CVPDF

Yuqian Shao, Xiaosong Jia, Langechuan Liu, Junchi Yan

TL;DR: 该论文提出了Bench2Drive-Speed基准测试,用于评估和训练能够根据用户指定期望速度(及是否允许超车)进行条件化控制的端到端自动驾驶策略。论文构建了包含定量指标、数据集和基线模型的完整框架,并通过实验验证了利用常规驾驶数据重新标注进行速度条件化训练的有效性,同时指出执行超车指令仍具挑战性。

Details

Motivation: 当前端到端自动驾驶研究忽视了用户自定义期望速度或超车偏好的实用功能,论文旨在填补这一空白,使自动驾驶策略能够响应用户的个性化驾驶风格指令。

Result: 在构建的CustomizedSpeedDataset(包含2100个专家演示标注片段)上进行的实验表明,经过适当重新标注,在常规驾驶数据上训练的模型在遵循速度指令方面的表现与在专家演示数据上训练的模型相当。模型在遵循目标速度时不会降低常规驾驶性能,但执行超车指令的性能仍有待提升。

Insight: 创新点在于首次系统性地提出了针对用户期望速度条件化自动驾驶的基准测试、专用数据集和评估指标(如速度遵循分数和超车分数)。一个关键的实践洞见是,无需昂贵的专家演示数据收集,通过对现有常规驾驶数据进行未来帧速度重标注,即可有效引入速度监督信号,这为条件化策略训练提供了可扩展的方案。同时,研究揭示了交互行为(如超车)的条件化控制比单纯的速度跟随更具挑战性。

Abstract: End-to-end autonomous driving (E2E-AD) has achieved remarkable progress. However, one practical and useful function has been long overlooked: users may wish to customize the desired speed of the policy or specify whether to allow the autonomous vehicle to overtake. To bridge this gap, we present Bench2Drive-Speed, a benchmark with metrics, dataset, and baselines for desired-speed conditioned autonomous driving. We introduce explicit inputs of users’ desired target-speed and overtake/follow instructions to driving policy models. We design quantitative metrics, including Speed-Adherence Score and Overtake Score, to measure how faithfully policies follow user specifications, while remaining compatible with standard autonomous driving metrics. To enable training of speed-conditioned policies, one approach is to collect expert demonstrations that strictly follow speed requirements, an expensive and unscalable process in the real world. An alternative is to adapt existing regular driving data by treating the speed observed in future frames as the target speed for training. To investigate this, we construct CustomizedSpeedDataset, composed of 2,100 clips annotated with experts demonstrations, enabling systematic investigation of supervision strategies. Our experiments show that, under proper re-annotation, models trained on regular driving data perform comparably to on expert demonstrations, suggesting that speed supervision can be introduced without additional complex real-world data collection. Furthermore, we find that while target-speed following can be achieved without degrading regular driving performance, executing overtaking commands remains challenging due to the inherent difficulty of interactive behaviors. All code, datasets and baselines are available at https://github.com/Thinklab-SJTU/Bench2Drive-Speed


[134] Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning cs.RO | cs.CVPDF

Jai Bardhan, Patrik Drozdik, Josef Sivic, Vladimir Petrik

TL;DR: 本文提出了一种通过强化学习后训练方案来稳定机器人世界模型多步自回归推演的方法,旨在解决现有模型在自回归推演时因误差累积导致视觉质量迅速退化的问题。

Details

Motivation: 动机是解决基于动作的机器人世界模型在自回归多步推演时,预测误差会累积并导致视觉质量迅速崩溃的稳定性问题。

Result: 在DROID数据集上,该方法在推演保真度方面达到了新的SOTA水平,在所有指标上均优于最强基线(例如,外部摄像头的LPIPS降低了14%,手腕摄像头的SSIM提高了9.1%),赢得了98%的配对比较,并在盲测人类研究中获得了80%的偏好率。

Insight: 创新点包括:1)将对比强化学习目标适配于扩散模型,用于对模型自身自回归推演进行后训练;2)设计了从同一推演状态生成并比较多个候选可变长度未来的训练协议;3)开发了高效的多视图视觉保真度奖励,结合了跨摄像机视图的互补感知指标。

Abstract: Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable-length futures from the same rollout state, reinforcing higher-fidelity predictions over lower-fidelity ones. Third, we develop efficient, multi-view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low-variance training signal. Fourth, we show that our approach establishes a new state-of-the-art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.


[135] Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving cs.RO | cs.AI | cs.CV | cs.LG | cs.MAPDF

Zehao Wang, Huaide Jiang, Shuaiwu Dong, Yuping Wang, Hang Qiu

TL;DR: 本文提出Drive My Way (DMW),一种个性化的视觉-语言-动作(VLA)驾驶框架,旨在通过用户嵌入和自然语言指令,使自动驾驶系统能够同时适应驾驶员的长期习惯和实时意图。

Details

Motivation: 现有端到端自动驾驶系统要么优化通用目标,要么依赖固定驾驶模式,缺乏适应个体偏好或解释自然语言意图的能力,而人类驾驶行为本质上是个人化的。

Result: 在Bench2Drive基准测试上的闭环评估表明,DMW改善了风格指令适应能力;用户研究显示其生成的行为可被识别为每个驾驶员自身的风格。

Insight: 创新点在于将用户嵌入(从多驾驶员真实数据中学习)与自然语言指令相结合,实现长期习惯与短期意图的对齐,为以人为中心的自动驾驶提供了关键的个人化能力。

Abstract: Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users’ long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver’s own style, highlighting personalization as a key capability for human-centered autonomous driving. Our data and code are available at https://dmw-cvpr.github.io/.


eess.IV [Back]

[136] Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos eess.IV | cs.CV | cs.HCPDF

Abdullah Hamdi, Changchun Yang, Xin Gao

TL;DR: 该论文提出了Colon-Bench,一个用于全流程结肠镜检查视频的密集病灶标注数据集,通过一个新颖的多阶段智能体工作流程生成,旨在解决该领域缺乏密集标注的长序列视频数据的问题。该数据集规模空前,包含多种病灶类别和丰富的标注,并用于评估多模态大语言模型在病灶分类、开放词汇视频目标分割和视频视觉问答等任务上的性能。

Details

Motivation: 早期结肠镜筛查对预防结肠癌至关重要,但该领域缺乏密集标注的长序列视频数据集,阻碍了鲁棒AI系统的开发。现有数据集主要关注单类息肉检测,缺乏评估现代多模态大语言模型所需的空间、时间和语言标注。

Result: 在Colon-Bench上对SOTA多模态大语言模型进行评估,结果显示其在医学领域的定位性能相比SAM-3出人意料地高。通过分析常见VQA错误,提出了一种新颖的“结肠技能”提示策略,将零样本MLLM性能在大多数模型上提升了高达9.7%。

Insight: 论文的创新点在于提出了一个可扩展的多阶段智能体工作流程,用于生成大规模、多类别、密集标注的医学视频数据集。该工作流程结合了时间提议、边界框跟踪、AI驱动的视觉确认和人在环审查,为医学视频分析提供了新的数据构建范式。同时,提出的“结肠技能”提示策略展示了针对特定领域优化MLLM提示的有效性。

Abstract: Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel “colon-skill” prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .


cs.HC [Back]

[137] Gaze patterns predict preference and confidence in pairwise AI image evaluation cs.HC | cs.AI | cs.CV | cs.CYPDF

Nikolas Papadopoulos, Shreenithi Navaneethan, Sheng Bai, Ankur Samanta, Paul Sajda

TL;DR: 本研究通过眼动追踪技术探究了人类在成对AI生成图像评估中的偏好形成过程,发现注视模式可以预测选择结果和决策置信度。

Details

Motivation: 旨在揭示人类在成对AI生成图像评估中进行偏好判断时的认知过程,以理解如RLHF和DPO等偏好学习方法所依赖的人类判断背后的机制。

Result: 实验在1800次试验中,注视特征预测二元选择的准确率达68%,预测高置信度与不确定决策的准确率达66%,并复现了决策前约1秒注视向所选图像转移的“注视级联效应”。

Insight: 眼动追踪可作为获取偏好标注质量相关隐式信号的工具,注视动态(如注视时间、次数、回访及图像间切换频率)与偏好及置信度紧密相关,为改进偏好学习的数据收集与标注提供了新视角。

Abstract: Preference learning methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on pairwise human judgments, yet little is known about the cognitive processes underlying these judgments. We investigate whether eye-tracking can reveal preference formation during pairwise AI-generated image evaluation. Thirty participants completed 1,800 trials while their gaze was recorded. We replicated the gaze cascade effect, with gaze shifting toward chosen images approximately one second before the decision. Cascade dynamics were consistent across confidence levels. Gaze features predicted binary choice (68% accuracy), with chosen images receiving more dwell time, fixations, and revisits. Gaze transitions distinguished high-confidence from uncertain decisions (66% accuracy), with low-confidence trials showing more image switches per second. These results show that gaze patterns predict both choice and confidence in pairwise image evaluations, suggesting that eye-tracking provides implicit signals relevant to the quality of preference annotations.