Table of Contents

cs.CL [Back]

[1] Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs cs.CLPDF

Jiangang Hao

TL;DR: 本章首先综述了AI生成和AI辅助论文检测器的现状及其负责任使用指南,然后通过实证分析评估了基于一种LLM生成的论文训练的检测器在识别其他LLM生成的论文时的泛化能力,为实际应用中的检测器开发和再训练提供指导。

Details

Motivation: 随着大语言模型(LLMs)的快速发展,生成连贯、高质量论文变得容易,引发了学生提交作品真实性的担忧,因此需要研究AI生成论文的检测方法及其泛化能力。

Result: 基于公共GRE写作提示生成的论文进行实证分析,评估了检测器在不同LLM间的泛化性能,为实际应用提供开发与再训练指导。

Insight: 创新点在于系统评估检测器跨LLM的泛化能力,强调负责任使用指南,为教育评估中AI检测工具的实际部署提供了实证依据和适应性建议。

Abstract: Writing is a foundational literacy skill that underpins effective communication, fosters critical thinking, facilitates learning across disciplines, and enables individuals to organize and articulate complex ideas. Consequently, writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning. The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays, raising significant concerns about the authenticity of student-submitted work. This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use. It then presents empirical analyses to evaluate how well detectors trained on essays from one LLM generalize to identifying essays produced by other LLMs, based on essays generated in response to public GRE writing prompts. These findings provide guidance for developing and retraining detectors for practical applications.


[2] Think, But Don’t Overthink: Reproducing Recursive Language Models cs.CLPDF

Daren Wang

TL;DR: 本研究复现并扩展了Zhang等人(2026)提出的递归语言模型(RLM)框架,该框架通过将提示卸载到外部REPL环境中,使大语言模型能够处理近乎无限的上下文。研究重点探讨了递归深度扩展的影响,使用DeepSeek v3.2和Kimi K2等开源智能体模型,在S-NIAH和OOLONG基准测试上评估了纯LLM、RLM(深度=1)和RLM(深度=2)。

Details

Motivation: 原RLM框架默认递归深度为1,并建议更深递归作为未来方向。本研究旨在具体调查扩展递归深度对模型性能的影响,探究更深递归是否带来性能提升。

Result: 在S-NIAH和OOLONG基准测试上的评估发现:深度为1的RLM能有效提升复杂推理任务的准确性;但应用更深递归(深度=2)或在简单检索任务上使用RLM反而会降低性能,并导致执行时间(例如从3.6秒激增至344.5秒)和令牌成本呈指数级增长。

Insight: 论文揭示了“过度思考”现象:更深递归可能导致模型性能下降和计算成本剧增,这为RLM框架的实际应用提供了重要权衡依据——并非递归越深越好,需要根据任务复杂度选择合适的递归深度。

Abstract: This project reproduces and extends the recently proposed Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to overthink’’. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: https://github.com/drbillwang/rlm-reproduction


[3] Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches cs.CL | cs.AIPDF

Anum Afzal, Yuki Saito, Hiroya Takamura, Katsuhito Sudoh, Shinnosuke Takamichi

TL;DR: 本文研究了使用多模态大语言模型(MLLMs)进行实时游戏视频解说生成的任务,重点关注了‘说什么’和‘何时说’两个关键决策。论文提出了两种基于提示的解码策略:固定间隔方法和新颖的动态间隔解码方法,后者根据前一个话语的估计时长来调整下一个预测的时机,旨在实现无需微调的、感知停顿的生成。在赛车和格斗游戏的日英双语数据集上的实验表明,动态间隔解码方法能仅通过提示就生成在时机和内容上更接近人类解说的评论。

Details

Motivation: 实时视频解说生成在体育、电竞和直播等领域有助于提升可访问性和参与度。现有基于提示的MLLM方法在内容生成上表现良好,但大多忽略了‘何时说’的时机决策问题。本文旨在探究仅通过上下文提示是否就能支持生成语义相关且时机恰当的实时解说。

Result: 在日语和英语的赛车与格斗游戏数据集上的实验表明,所提出的动态间隔解码方法能够生成在话语时机和内容上更接近人类解说的评论,优于固定间隔方法。论文发布了多语言基准数据集、训练模型和代码实现。

Insight: 核心创新点在于提出了无需模型微调、仅通过提示和解码策略就能感知并控制解说时机的动态间隔解码方法。这为实时多模态生成任务中的时序控制问题提供了一个轻量级且有效的解决方案,将‘何时说’的决策整合到了基于提示的生成框架中。

Abstract: Real-time video commentary generation provides textual descriptions of ongoing events in videos. It supports accessibility and engagement in domains such as sports, esports, and livestreaming. Commentary generation involves two essential decisions: what to say and when to say it. While recent prompting-based approaches using multimodal large language models (MLLMs) have shown strong performance in content generation, they largely ignore the timing aspect. We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed. We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach that adjusts the next prediction timing based on the estimated duration of the previous utterance. Both methods enable pause-aware generation without any fine-tuning. Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone. We release a multilingual benchmark dataset, trained models, and implementations to support future research on real-time video commentary generation.


[4] Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory cs.CL | cs.CVPDF

Shunki Uebayashi, Kento Masui, Kyohei Atarashi, Han Bao, Hisashi Kashima

TL;DR: 本文提出了一种多模态多维项目反应理论框架(M3IRT),用于评估多模态大语言模型(MLLMs)的跨模态推理能力。该框架通过将模型能力和题目难度分解为仅图像、仅文本和跨模态三个部分,能够识别并优先选择真正需要跨模态推理的高质量题目,从而构建更紧凑、更可靠的评测基准。

Details

Motivation: 当前多模态基准测试中存在大量‘捷径’问题,这些问题仅使用单一模态即可解决,导致对模型跨模态整合能力的评测不可靠且效率低下。

Result: 在三个基准测试上对24个视觉语言模型(VLMs)的评估表明,M3IRT能有效筛选出真正的跨模态问题,即使在基准中混入50%的人工生成低质量问题时,也能保持模型排名的保真度,从而降低评测成本并提高可靠性。

Insight: 创新点在于将经典项目反应理论(IRT)扩展至多模态领域,通过多维度分解能力与难度,为量化跨模态推理能力和题目质量提供了理论工具。这为构建更高效、更精准的多模态评测基准提供了方法论支持。

Abstract: Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question’s cross-modal difficulty, enabling compact, high-quality subsets that better reflect multimodal reasoning. Across 24 VLMs on three benchmarks, M3IRT prioritizes genuinely cross-modal questions over shortcuts and preserves ranking fidelity even when 50% of items are artificially generated low-quality questions, thereby reducing evaluation cost while improving reliability. M3IRT thus offers a practical tool for assessing cross-modal reasoning and refining multimodal benchmarks.


[5] ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs cs.CL | cs.AIPDF

Wicaksono Leksono Muhamad, Joanito Agili Lopo, Tack Hwa Wong, Muhammad Ravi Shulthan Habibi, Samuel Cahyawijaya

TL;DR: 本文提出了一种通过结构抽象和确定性解析来减少大语言模型在多语言推理任务中内容偏差的新方法,该方法在SemEval-2026 Task 11基准测试中取得了优异的排名。

Details

Motivation: 解决大语言模型在多语言推理任务中受内容效应影响的问题,即模型对推理内容的敏感度高于逻辑结构本身。

Result: 在SemEval-2026 Task 11多语言基准测试的所有子任务中均进入前五名,显著降低了内容效应,其性能与复杂的微调或激活层干预方法相当。

Insight: 创新点在于通过将三段论转换为规范逻辑表示并进行确定性解析来抽象结构,从而减少内容偏差,为改进LLM的推理能力提供了一种轻量且有效的替代方案。

Abstract: Large language models suffer from content effects in reasoning tasks, particularly in multi-lingual contexts. We introduce a novel method that reduces these biases through explicit structural abstraction that transforms syllogisms into canonical logical representations and applies deterministic parsing to determine validity. Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or activation-level interventions.


[6] HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse cs.CL | cs.SIPDF

Sai Kartheek Reddy Kasu, Shankar Biradar, Sunil Saumya, Md. Shad Akhtar

TL;DR: 本文介绍了HateMirage,一个用于解码虚假仇恨和微妙网络虐待的新型可解释多维度数据集。该数据集包含4530条用户评论,每条评论都从目标、意图和影响三个维度进行标注,旨在解决现有仇恨言论数据集主要捕捉明显毒性而忽视虚假或扭曲叙事中微妙仇恨的问题。

Details

Motivation: 现有仇恨言论数据集主要关注明显的毒性言论,未能充分代表通过虚假信息煽动或正常化仇恨的微妙方式,因此需要构建一个能促进对源自虚假或扭曲叙事的仇恨进行推理和可解释性研究的数据集。

Result: 在HateMirage数据集上对多个开源语言模型进行了基准测试,使用ROUGE-L F1和Sentence-BERT相似度评估解释的连贯性,结果表明解释质量可能更依赖于预训练的多样性和面向推理的数据,而不仅仅是模型规模。

Insight: 论文的创新点在于引入了结合虚假信息推理与伤害归因的多维度解释框架,超越了HateXplain和HARE等先前数据集的token级或单维度推理,为可解释的仇恨检测和负责任的人工智能研究建立了新的基准。

Abstract: Subtle and indirect hate speech remains an underexplored challenge in online safety research, particularly when harmful intent is embedded within misleading or manipulative narratives. Existing hate speech datasets primarily capture overt toxicity, underrepresenting the nuanced ways misinformation can incite or normalize hate. To address this gap, we present HateMirage, a novel dataset of Faux Hate comments designed to advance reasoning and explainability research on hate emerging from fake or distorted narratives. The dataset was constructed by identifying widely debunked misinformation claims from fact-checking sources and tracing related YouTube discussions, resulting in 4,530 user comments. Each comment is annotated along three interpretable dimensions: Target (who is affected), Intent (the underlying motivation or goal behind the comment), and Implication (its potential social impact). Unlike prior explainability datasets such as HateXplain and HARE, which offer token-level or single-dimensional reasoning, HateMirage introduces a multi-dimensional explanation framework that captures the interplay between misinformation, harm, and social consequence. We benchmark multiple open-source language models on HateMirage using ROUGE-L F1 and Sentence-BERT similarity to assess explanation coherence. Results suggest that explanation quality may depend more on pretraining diversity and reasoning-oriented data rather than on model scale alone. By coupling misinformation reasoning with harm attribution, HateMirage establishes a new benchmark for interpretable hate detection and responsible AI research.


[7] Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization cs.CLPDF

Yueyang Cang, Xiaoteng Zhang, Erlu Zhao, Zehua Ji, Yuhang Liu

TL;DR: 本文提出了一种名为Graph-GRPO的新型多智能体系统通信拓扑优化框架,该框架通过集成群体相对策略优化(Group Relative Policy Optimization)来稳定学习过程。该方法通过为每个查询采样一组多样化的通信图,并基于组内相对性能计算特定边的优势,从而缓解了传统方法中因任务难度差异导致的梯度方差和信用分配问题。在推理和代码生成基准测试上的实验表明,Graph-GRPO在训练稳定性和性能上显著优于现有最先进方法。

Details

Motivation: 现有基于强化学习的多智能体系统通信拓扑优化方法通常依赖单样本策略梯度和绝对奖励(如二进制正确性),这会导致严重的梯度方差和信用分配问题:简单查询可能对次优结构产生非信息性正奖励,而困难查询则常因失败而无法提供学习信号。

Result: 在推理和代码生成基准测试上进行的大量实验表明,Graph-GRPO显著优于最先进的基线方法,实现了更优的训练稳定性,并识别出先前被奖励噪声掩盖的关键通信路径。

Insight: 创新点在于引入群体相对策略优化,通过采样一组通信图并基于组内相对性能进行奖励归一化,从而有效缓解任务难度差异带来的噪声,实现细粒度的信用分配。从客观角度看,该方法将单样本评估扩展为群体比较,提升了拓扑学习的稳定性和效率。

Abstract: Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.


[8] OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets cs.CL | cs.AIPDF

Jiyuan Shen, Peiyue Yuan, Atin Ghosh, Yifan Mai, Daniel Dahlmeier

TL;DR: 本文通过大规模基准测试评估了多模态大语言模型(MLLMs)在商业文档信息抽取任务上的表现,探讨了纯图像输入与OCR增强输入的性能差异,并提出了一个基于LLM的自动化分层错误分析框架来系统诊断错误模式。研究发现,对于强大的MLLMs,OCR可能并非必需,纯图像输入即可达到与OCR增强方法相当的性能,且精心设计的模式、示例和指令能进一步提升MLLMs的表现。

Details

Motivation: 研究动机在于澄清多模态大语言模型(MLLMs)对文档信息抽取的实际影响,特别是探究纯MLLM流程(仅使用图像输入)是否能够匹敌传统的OCR+MLLM组合流程的性能,以简化现有文档处理流程。

Result: 在真实世界大规模商业文档数据集上的基准测试表明,纯图像输入的MLLM可以达到与OCR增强方法相当的性能水平;通过设计的模式、示例和指令优化后,MLLMs性能可进一步提升。

Insight: 创新点在于提出了一个利用大语言模型(LLMs)进行自动化分层错误分析的框架,以系统诊断MLLMs在文档信息抽取中的错误模式;客观分析认为,该研究挑战了传统OCR在文档处理中的必要性,为简化端到端文档信息抽取流程提供了实证依据和优化方向。

Abstract: Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline–while simpler–can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can offer practical guidance and valuable insight for advancing document information extraction.


[9] A Browser-based Open Source Assistant for Multimodal Content Verification cs.CLPDF

Rosanna Milner, Michael Foster, Olesya Razuvayevskaya, Ian Roberts, Valentin Porcellini

TL;DR: 本文介绍了VERIFICATION ASSISTANT,一个基于浏览器的开源工具,旨在帮助记者和事实核查员快速验证数字媒体信息。该工具整合了多种后端NLP分类器,可自动提取内容并分析可信度信号、评估AI生成内容,以清晰易懂的格式提供验证指导。

Details

Motivation: 解决生成式AI产生的虚假信息和错误内容对记者和事实核查员带来的挑战,现有NLP模型往往难以被非专家用户访问且未集成到日常工作中。

Result: 作为广泛使用的VERIFICATION PLUGIN(拥有超过14万用户)的核心组件,该工具已在现实世界中应用于检测虚假信息,展示了其实际应用价值。

Insight: 创新点在于将多个NLP服务集成到一个统一的浏览器界面中,使非专家用户能够便捷地访问复杂的可信度分析工具,并将其无缝融入日常工作流程。

Abstract: Disinformation and false content produced by generative AI pose a significant challenge for journalists and fact-checkers who must rapidly verify digital media information. While there is an abundance of NLP models for detecting credibility signals such as persuasion techniques, subjectivity, or machine-generated text, such methods often remain inaccessible to non-expert users and are not integrated into their daily workflows as a unified framework. This paper demonstrates the VERIFICATION ASSISTANT, a browser-based tool designed to bridge this gap. The VERIFICATION ASSISTANT, a core component of the widely adopted VERIFICATION PLUGIN (140,000+ users), allows users to submit URLs or media files to a unified interface. It automatically extracts content and routes it to a suite of backend NLP classifiers, delivering actionable credibility signals, estimating AI-generated content, and providing other verification guidance in a clear, easy-to-digest format. This paper showcases the tool architecture, its integration of multiple NLP services, and its real-world application to detecting disinformation.


[10] Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models cs.CL | cs.CVPDF

Haruto Yoshida, Keito Kudo, Yoichi Aoki, Ryota Tanaka, Itsumi Saito

TL;DR: 本文通过构建基于有向图的合成图表数据集,探究大型视觉语言模型(LVLMs)在图表理解中处理节点与有向边关系的局限性。研究发现,边缘信息在视觉编码器中并非线性可分,仅在语言模型的文本标记中才被线性编码,而节点信息和全局结构特征在视觉编码器的隐藏状态中已线性编码。这表明线性可分表征的形成阶段因视觉信息类型而异,边缘表征的延迟形成可能是LVLMs在关系理解(如解释边方向)上表现不佳的原因。

Details

Motivation: 尽管LVLMs在图表理解基准上表现强劲,但在理解元素间关系(尤其是节点和有向边表示的关系)方面仍存在困难,本文旨在探究这一局限性的内在原因。

Result: 通过合成图表数据集的探测实验,发现边缘信息在视觉编码器中非线性可分,仅在语言模型文本标记中线性编码;节点和全局结构特征在视觉编码器隐藏状态中已线性编码。

Insight: 创新点在于揭示了LVLMs中不同类型视觉信息(节点与边)的表征形成阶段差异,边缘信息的延迟线性编码可能解释模型在关系理解上的瓶颈,为改进LVLMs的图表理解能力提供了理论依据。

Abstract: Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.


[11] Eval4Sim: An Evaluation Framework for Persona Simulation cs.CLPDF

Eliseo Bao, Anxo Perez, Xi Wang, Javier Parapar

TL;DR: 本文提出了Eval4Sim,一个用于评估基于大语言模型(LLM)的人物角色模拟对话与人类对话模式对齐程度的框架。该框架从三个互补维度(Adherence、Consistency、Naturalness)进行度量,并以人类对话语料库(如PersonaChat)为参考基线,惩罚双向偏差,旨在提供比当前主流的LLM-as-a-judge方法更可靠、更可解释的评估。

Details

Motivation: 当前基于LLM的人物角色模拟对话评估主要依赖LLM作为评判者,这种方法与可观察的人类行为联系有限,且产生不透明的标量分数。因此,需要一种更可靠、更贴近人类对话模式的评估框架来确保模拟的真实性。

Result: 论文在PersonaChat数据集上演示了Eval4Sim框架,其评估维度包括:通过基于说话者感知表示的密集检索评估Adherence;通过作者身份验证计算Consistency;通过基于对话的自然语言推理分布量化Naturalness。该框架旨在区分角色编码不足与过度优化、不自然的行为。

Insight: 主要创新点在于提出了一个多维度、基于人类对话语料库参考基线的评估框架,它通过惩罚双向偏差来更细致地衡量模拟对话与人类模式的差距,超越了单一的优化导向或绝对分数评估。其框架设计具有可扩展性,可应用于任何包含说话者级别标注的对话语料库。

Abstract: Large Language Model (LLM) personas with explicit specifications of attributes, background, and behavioural tendencies are increasingly used to simulate human conversations for tasks such as user modeling, social reasoning, and behavioural analysis. Ensuring that persona-grounded simulations faithfully reflect human conversational behaviour is therefore critical. However, current evaluation practices largely rely on LLM-as-a-judge approaches, offering limited grounding in observable human behavior and producing opaque scalar scores. We address this gap by proposing Eval4Sim, an evaluation framework that measures how closely simulated conversations align with human conversational patterns across three complementary dimensions. Adherence captures how effectively persona backgrounds are implicitly encoded in generated utterances, assessed via dense retrieval with speaker-aware representations. Consistency evaluates whether a persona maintains a distinguishable identity across conversations, computed through authorship verification. Naturalness reflects whether conversations exhibit human-like flow rather than overly rigid or optimized structure, quantified through distributions derived from dialogue-focused Natural Language Inference. Unlike absolute or optimization-oriented metrics, Eval4Sim uses a human conversational corpus (i.e., PersonaChat) as a reference baseline and penalizes deviations in both directions, distinguishing insufficient persona encoding from over-optimized, unnatural behaviour. Although demonstrated on PersonaChat, the applicability of Eval4Sim extends to any conversational corpus containing speaker-level annotations.


[12] Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction cs.CL | cs.AIPDF

Guangjun Zhang, Hu Zhang, Yazhou Han, Yue Fan, Yuhang Shao

TL;DR: 本文提出了一种用于零样本文档级事件论元提取的多智能体协作框架,模拟人类的’提出-评估-修订’协作认知过程。该框架包含一个生成智能体和一个评估智能体,通过强化学习进行迭代优化,旨在解决现有方法在零样本设置下生成合成数据时难以准确捕捉未见事件上下文和结构关系,以及缺乏质量评估机制的问题。

Details

Motivation: 解决零样本文档级事件论元提取中,现有方法仅依赖事件类型提示生成合成数据,难以准确捕捉未见事件的上下文和结构关系,且缺乏对合成数据可靠性和可用性的质量评估机制这一挑战。

Result: 在基于RAMS和WikiEvents数据集构建的三个零样本场景中,该方法在数据生成质量和论元提取性能上均取得了提升,且生成的合成数据也能有效增强其他DEAE模型的零样本性能。

Insight: 创新点在于引入模拟人类协作认知过程的多智能体框架,将生成与评估分离并通过强化学习结合事件结构约束进行联合优化,为合成数据的质量控制和零样本知识迁移提供了新思路。

Abstract: Document-level event argument extraction (DEAE) is essential for knowledge acquisition, aiming to extract participants of events from documents.In the zero-shot setting, existing methods employ LLMs to generate synthetic data to address the challenge posed by the scarcity of annotated data. However, relying solely on Event-type-only prompts makes it difficult for the generated content to accurately capture the contextual and structural relationships of unseen events. Moreover, ensuring the reliability and usability of synthetic data remains a significant challenge due to the absence of quality evaluation mechanisms. To this end, we introduce a multi-agent collaboration framework for zero-shot document-level event argument extraction (ZS-DEAE), which simulates the human collaborative cognitive process of “Propose-Evaluate-Revise.” Specifically, the framework comprises a generation agent and an evaluation agent. The generation agent synthesizes data for unseen events by leveraging knowledge from seen events, while the evaluation agent extracts arguments from the synthetic data and assesses their semantic consistency with the context. The evaluation results are subsequently converted into reward signals, with event structure constraints incorporated into the reward design to enable iterative optimization of both agents via reinforcement learning.In three zero-shot scenarios constructed from the RAMS and WikiEvents datasets, our method achieves improvements both in data generation quality and argument extraction performance, while the generated data also effectively enhances the zero-shot performance of other DEAE models.


[13] ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation cs.CLPDF

Bo Xu, Haotian Wu, Hehai Lin, Weiquan Huang, Beier Zhu

TL;DR: 本文提出了一种名为ACE-Merging的无数据模型融合方法,该方法通过自适应协方差估计来缓解任务专家模型间的干扰,从而在无需访问数据、重新训练或修改架构的情况下,将多个任务特定模型合并为一个保持多任务泛化能力的单一模型。

Details

Motivation: 模型融合旨在合并多个专家模型,但不同目标训练的专家模型之间存在干扰,导致性能显著下降,而现有方法难以在不依赖数据、重训练或架构改动的情况下解决此问题。

Result: 在视觉和语言基准测试上的大量实验表明,ACE-Merging在无数据方法中达到了新的最先进水平(SOTA),例如在GPT-2的七个任务上平均绝对性能提升4%,且计算成本适中。

Insight: 核心创新点在于理论证明了任务输入协方差(最优合并的关键因素)可以从微调模型的参数差异中隐式估计,并基于此提出了一个具有闭式解的原理性框架,与先前的迭代或启发式方法形成对比。

Abstract: Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce \acem, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, \acem achieves an average absolute improvement of 4% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, \acem delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.


[14] PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems cs.CLPDF

Sudip Bhujel

TL;DR: 这篇论文提出了PrivMedChat,一个用于医疗对话系统的端到端差分隐私RLHF框架。该框架在直接访问对话监督数据的每个训练阶段都强制执行差分隐私,包括医疗监督微调和奖励模型学习。同时,论文还引入了一种无需标注的偏好构建策略来生成可扩展的偏好数据。实验表明,该方法在保护隐私的同时,在医疗对话基准上实现了最佳性能,并显著减少了临床幻觉和有害建议。

Details

Motivation: 动机是解决大型语言模型在适应临床对话时,因使用可能包含敏感信息的医患对话数据进行监督微调和RLHF而带来的隐私泄露风险,如成员推理攻击和训练集内容提取。

Result: 在医疗对话基准测试中,PrivMedChat在隐私预算ε=7时,在所有差分隐私模型中取得了最高的ROUGE-L分数(0.156),将临床幻觉降至1.4%,有害建议降至0.4%,并在一个3模型LLM评审评估中获得最高总分2.86,同时其成员推理信号接近随机水平(AUC 0.510-0.555)。

Insight: 创新点在于提出了一个端到端的差分隐私RLHF框架,将DP-SGD系统地应用于SFT、奖励模型学习和PPO对齐阶段;以及一种无需临床医生标注、通过配对医生响应与过滤的非专家生成内容来构建可扩展偏好数据的策略。这为在严格隐私约束下开发高性能、安全的医疗对话AI提供了可行的技术路径。

Abstract: Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content. We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue. Our design enforces differential privacy at every training stage that directly accesses dialogue-derived supervision: (i) Differential Private Stochastic Gradient Descent (DP-SGD) for medical SFT and (ii) DP-SGD for reward model learning from preference pairs. To limit additional privacy expenditure during alignment, we apply DP-SGD to the PPO actor and critic when operating on dialogue-derived prompts, while the reward model remains fixed after DP training. We also introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations to produce scalable preference data without clinician labeling. Experiments on medical dialogue benchmarks show that PrivMedChat at $\varepsilon=7$ achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3-model LLM-jury evaluation, while producing membership-inference signals that are near chance (AUC 0.510-0.555). We open-source our code at https://github.com/sudip-bhujel/privmedchat.


[15] APRES: An Agentic Paper Revision and Evaluation System cs.CL | cs.AIPDF

Bingchen Zhao, Jenny Zhang, Chenxi Whitehouse, Minqi Jiang, Michael Shvartsman

TL;DR: 本文提出了APRES,一个基于大语言模型(LLM)的智能论文修订与评估系统,旨在通过自动化的方式,根据一个能预测未来引用量的评估标准来修订科学论文的文本,以提升论文质量和潜在影响力,同时不改变其核心科学内容。

Details

Motivation: 当前科学论文的同行评审系统存在反馈不一致的问题,这阻碍了稿件的改进并限制了其潜在影响力。本文旨在利用LLM开发一个自动化系统,帮助作者在投稿前对稿件进行压力测试,从而提升论文的沟通效果。

Result: APRES在预测未来引用量的任务上,将平均绝对误差(MAE)比次优基线降低了19.6%。在人工专家评估中,经APRES修订后的论文在79%的情况下被优先选择。

Insight: 创新点在于构建了一个与未来引用量高度相关的自动化评估标准,并将其与基于LLM的文本修订流程集成,形成了一个端到端的论文质量增强系统。其核心思想是利用LLM作为辅助工具来增强(而非替代)人类专家的作用,在保持科学内容不变的前提下优化论文的表达和结构。

Abstract: Scientific discoveries must be communicated clearly to realize their full potential. Without effective communication, even the most groundbreaking findings risk being overlooked or misunderstood. The primary way scientists communicate their work and receive feedback from the community is through peer review. However, the current system often provides inconsistent feedback between reviewers, ultimately hindering the improvement of a manuscript and limiting its potential impact. In this paper, we introduce a novel method APRES powered by Large Language Models (LLMs) to update a scientific papers text based on an evaluation rubric. Our automated method discovers a rubric that is highly predictive of future citation counts, and integrate it with APRES in an automated system that revises papers to enhance their quality and impact. Crucially, this objective should be met without altering the core scientific content. We demonstrate the success of APRES, which improves future citation prediction by 19.6% in mean averaged error over the next best baseline, and show that our paper revision process yields papers that are preferred over the originals by human expert evaluators 79% of the time. Our findings provide strong empirical support for using LLMs as a tool to help authors stress-test their manuscripts before submission. Ultimately, our work seeks to augment, not replace, the essential role of human expert reviewers, for it should be humans who discern which discoveries truly matter, guiding science toward advancing knowledge and enriching lives.


[16] BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? cs.CL | cs.SEPDF

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng

TL;DR: 该论文提出了BeyondSWE基准测试,旨在评估代码智能体在超越单一代码库错误修复的复杂现实场景(如跨仓库推理、领域专业问题解决、依赖驱动迁移和全仓库生成)中的能力。实验表明,现有前沿模型成功率低于45%,且表现不稳定。作者还开发了SearchSWE框架来研究外部知识(如搜索)的作用,发现其提升效果不一致,有时甚至降低性能。

Details

Motivation: 当前代码智能体的基准测试主要局限于评估单一代码库内的狭窄修复任务,忽略了跨仓库推理、领域专业知识、依赖迁移和全仓库生成等关键现实挑战,因此需要更全面的评估体系。

Result: 在BeyondSWE基准的500个真实世界实例上,即使是最前沿的模型成功率也停滞在45%以下,且没有单一模型能在所有任务类型中表现一致。集成深度搜索的SearchSWE框架带来的性能提升不一致,在某些情况下甚至会降低性能。

Insight: 论文的主要创新点是提出了一个从解决范围和知识范围两个维度拓展的综合性基准(BeyondSWE),以及一个用于系统研究外部知识整合的灵活框架(SearchSWE)。客观来看,其揭示了当前代码智能体在复杂、真实世界任务上的显著能力缺口,并挑战了简单集成搜索即能有效模拟开发者工作流的假设,为未来研究指明了更现实的方向。

Abstract: Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.


[17] Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration? cs.CLPDF

Dadi Guo, Yuejin Xie, Qingyu Liu, Jiayu Liu, Zhiyuan Fan

TL;DR: 论文提出Code2Math框架,利用代码智能体在可扩展的计算环境中,通过探索性执行来自主演化现有数学问题,生成结构上更复杂、难度更高的新问题,以缓解高质量数学问题数据稀缺的瓶颈。

Details

Motivation: 随着大语言模型数学能力向IMO级别迈进,用于训练和评估的、具有挑战性的高质量数学问题数据稀缺已成为关键瓶颈;同时,代码智能体在代理编码和推理方面展现出复杂技能,表明代码执行可作为数学实验的可扩展环境。

Result: 实验表明,在给定充分的测试时探索后,代码智能体能够合成新的、可解的问题,这些问题在结构上与原始问题不同且更具挑战性。

Insight: 创新点在于提出了一个多智能体框架,用于执行问题演化并同时验证生成问题的可解性和难度提升;客观来看,其核心是将代码执行环境作为可扩展的数学问题合成与验证平台,利用智能体的探索能力自动化地生成高质量难题,为数据生成提供了新思路。

Abstract: As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at https://github.com/TarferSoul/Code2Math.


[18] Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use cs.CLPDF

Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi

TL;DR: 本文提出了MOSAIC框架,用于保障具备代理能力的语言模型在多步骤工具使用中的安全性。该框架通过将推理过程结构化为‘规划-检查-执行或拒绝’的循环,并利用基于偏好的强化学习进行训练,有效提升了模型在面临有害任务、提示注入和隐私泄露等风险时的安全决策能力。

Details

Motivation: 现有的对齐方法主要针对静态生成和任务完成进行优化,在代理模型进行顺序决策、面临对抗性工具反馈和产生过度自信的中间推理时,其安全性会失效。因此,需要一种新的方法来确保代理模型在多步骤工具使用中的安全。

Result: 在Qwen2.5-7B、Qwen3-4B-Thinking和Phi-4等模型上进行的零样本评估表明,MOSAIC在分布外基准测试中,将有害行为减少了高达50%,在注入攻击下对有害任务的拒绝率提高了超过20%,减少了隐私泄露,同时保持或改善了良性任务的性能。

Insight: 核心创新点在于将安全决策(特别是‘拒绝’动作)作为推理过程中的显式、可学习的一等公民,并通过基于成对轨迹比较的偏好强化学习来捕捉标量奖励容易忽略的安全细微差别,从而实现了跨模型、领域和代理设置的鲁棒泛化。

Abstract: Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.


cs.CV [Back]

[19] CamDirector: Towards Long-Term Coherent Video Trajectory Editing cs.CVPDF

Zhihao Shi, Kejia Yin, Weilin Wan, Yuhongze Zhou, Yuanhao Yu

TL;DR: 本文提出了CamDirector框架,用于实现长期连贯的视频轨迹编辑,通过混合变形方案显式聚合整个源视频信息,并采用历史引导的自回归扩散模型处理视频片段,以增强时间一致性。

Details

Motivation: 现有视频轨迹编辑方法在精确相机控制和长程一致性方面存在不足,主要因为通过有限容量嵌入注入目标姿态或依赖单帧变形,缺乏显式的跨帧聚合。

Result: 在提出的新基准iPhone-PTZ上实现了最先进的性能,且参数更少。

Insight: 创新点包括混合变形方案显式聚合全局信息以生成一致粗帧,以及历史引导的自回归扩散模型结合增量更新的世界缓存来确保长期时间连贯性。

Abstract: Video (camera) trajectory editing aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos. Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models. To address these issues, we introduce a new VTE framework that 1) explicitly aggregates information across the entire source video via a hybrid warping scheme. Specifically, static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement. 2) processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence. Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.


[20] Social-JEPA: Emergent Geometric Isomorphism cs.CV | cs.AIPDF

Haoran Zhang, Youjin Wang, Yi Duan, Rong Fu, Dianyu Zhao

TL;DR: 论文提出Social-JEPA方法,让多个智能体从不同视角独立学习世界模型,发现其潜在表示之间存在近似线性等距映射,从而实现表示间的透明转换,并利用这种对齐实现分类器迁移和知识蒸馏,减少计算开销。

Details

Motivation: 解决分散式视觉系统中,不同智能体从独立视角学习世界模型后,其内部表示如何实现互操作的问题,旨在探索预测学习目标对表示几何结构的影响。

Result: 实验表明,两个智能体的潜在空间通过近似线性等距关联,即使视角差异大、原始像素重叠少,这种几何共识仍保持;分类器可在智能体间直接迁移而无需额外训练,蒸馏式迁移能加速后续学习并显著减少总计算量。

Insight: 创新点在于揭示了预测学习目标能强正则化表示几何,导致独立训练的智能体表示间自然出现线性等距;这为分散式视觉系统提供了一种轻量级的互操作路径,无需参数共享或协调即可实现知识迁移。

Abstract: World models compress rich sensory streams into compact latent codes that anticipate future observations. We let separate agents acquire such models from distinct viewpoints of the same environment without any parameter sharing or coordination. After training, their internal representations exhibit a striking emergent property: the two latent spaces are related by an approximate linear isometry, enabling transparent translation between them. This geometric consensus survives large viewpoint shifts and scant overlap in raw pixels. Leveraging the learned alignment, a classifier trained on one agent can be ported to the other with no additional gradient steps, while distillation-like migration accelerates later learning and markedly reduces total compute. The findings reveal that predictive learning objectives impose strong regularities on representation geometry, suggesting a lightweight path to interoperability among decentralized vision systems. The code is available at https://anonymous.4open.science/r/Social-JEPA-5C57.


[21] From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification cs.CVPDF

Vasiliy Kudryavtsev, Kirill Borodin, German Berezin, Kirill Bubenchikov, Grach Mkrtchian

TL;DR: 本研究针对自动化动物识别任务,提出了一种多模态验证框架,通过融合视觉特征与合成文本描述提供的语义身份先验,显著提升了宠物重识别的性能。研究构建了包含190万张照片、覆盖约69.5万只独特动物的大规模训练语料库,并通过系统消融实验确定了最优的视觉与文本骨干网络(SigLIP2-Giant和E5-Small-v2)以及门控融合策略。

Details

Motivation: 当前自动化动物识别系统因数据集规模有限且依赖单一视觉模态而性能受限,本研究旨在通过引入多模态(视觉与文本)信息来解决这一问题,以提升宠物重识别的准确率。

Result: 在综合测试协议上,所提方法取得了84.28%的Top-1准确率和0.0422的等错误率,相比领先的单模态基线提升了11%,达到了SOTA水平。

Insight: 创新点在于将合成文本描述作为语义先验与视觉特征融合,并通过系统消融研究确定了最优的骨干网络与融合策略(门控融合),证明了多模态融合能有效优化大规模宠物重识别中的决策边界。

Abstract: Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. We constructed a massive training corpus of 1.9 million photographs covering 695,091~unique animals to support this investigation. Through systematic ablation studies, we identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. We further evaluated fusion strategies ranging from simple concatenation to adaptive gating to determine the best method for integrating these modalities. Our proposed approach utilizes a gated fusion mechanism and achieved a Top-1 accuracy of 84.28% and an Equal Error Rate of 0.0422 on a comprehensive test protocol. These results represent an 11% improvement over leading unimodal baselines and demonstrate that integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification.


[22] Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection cs.CV | cs.AIPDF

Yaoteng Zhang, Zhou Qing, Junyu Gao, Qi Wang

TL;DR: 本文提出了一种名为PDP的原型引导双池提示框架,用于解决增量目标检测中的提示退化问题。该框架通过设计共享池和私有池来解耦任务通用知识和任务特定知识,并引入原型伪标签生成模块来维持监督信号的一致性,从而在无回放的设置下有效缓解提示耦合和提示漂移。

Details

Motivation: 现有基于提示的增量目标检测方法存在提示耦合和提示漂移问题,导致在持续适应过程中出现提示退化,影响模型性能。

Result: 在MS-COCO和PASCAL VOC基准测试上达到了最先进的性能,分别实现了9.2%和3.3%的平均精度提升。

Insight: 创新点在于提出了提示解耦的双池范式以及动态更新类原型空间的原型伪标签生成模块,这为平衡增量学习中的稳定性和可塑性提供了新思路。

Abstract: Incremental Object Detection (IOD) aims to continuously learn new object categories without forgetting previously learned ones. Recently, prompt-based methods have gained popularity for their replay-free design and parameter efficiency. However, due to prompt coupling and prompt drift, these methods often suffer from prompt degradation during continual adaptation. To address these issues, we propose a novel prompt-decoupled framework called PDP. PDP innovatively designs a dual-pool prompt decoupling paradigm, which consists of a shared pool used to capture task-general knowledge for forward transfer, and a private pool used to learn task-specific discriminative features. This paradigm explicitly separates task-general and task-specific prompts, preventing interference between prompts and mitigating prompt coupling. In addition, to counteract prompt drift resulting from inconsistent supervision where old foreground objects are treated as background in subsequent tasks, PDP introduces a Prototypical Pseudo-Label Generation (PPG) module. PPG can dynamically update the class prototype space during training and use the class prototypes to further filter valuable pseudo-labels, maintaining supervisory signal consistency throughout the incremental process. PDP achieves state-of-the-art performance on MS-COCO (with a 9.2% AP improvement) and PASCAL VOC (with a 3.3% AP improvement) benchmarks, highlighting its potential in balancing stability and plasticity. The code and dataset are released at: https://github.com/zyt95579/PDP\_IOD/tree/main


[23] HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding cs.CVPDF

Lei Yao, Yong Chen, Yuejiao Su, Yi Wang, Moyun Liu

TL;DR: HAMMER提出了一种利用多模态大语言模型(MLLM)进行交互意图驱动的3D功能基础定位的新框架。该方法通过聚合图像中的交互意图为接触感知嵌入,引导模型推断文本功能标签,并设计分层跨模态整合机制和多粒度几何提升模块,以精炼3D表示并实现准确的3D功能定位。

Details

Motivation: 受人类通过观察图像或视频中的交互来识别3D对象功能的启发,论文旨在解决如何利用MLLM实现基于交互意图的3D功能基础定位问题,避免依赖显式属性描述或现成的2D分割器。

Result: 在公共数据集和新构建的损坏基准上进行的广泛实验表明,HAMMER相比现有方法具有优越性和鲁棒性,但摘要未提及具体定量结果或是否达到SOTA水平。

Insight: 创新点包括:将交互意图聚合为接触感知嵌入以挖掘对象语义和上下文线索;设计分层跨模态整合机制以利用MLLM互补信息优化3D表示;引入多粒度几何提升模块将空间特征注入意图嵌入,提升3D定位准确性。这些方法可借鉴于多模态3D理解任务。

Abstract: Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches. All code and weights are publicly available.


[24] Beyond Caption-Based Queries for Video Moment Retrieval cs.CVPDF

David Pujol-Perich, Albert Clapés, Dima Damen, Sergio Escalera, Michael Wray

TL;DR: 本文研究了现有视频时刻检索方法,特别是基于DETR架构的方法,在基于字幕的查询上训练但在搜索查询上评估时的性能下降问题。作者通过修改三个公开VMR数据集(HD-EPIC、YouCook2和ActivityNet-Captions)的文本查询,引入了三个基准测试。分析揭示了两个关键泛化挑战:语言差距和多时刻差距,并识别出解码器-查询坍缩是导致多时刻实例泛化差的主要原因。通过增加活跃解码器查询的架构修改,该方法在搜索查询上的性能提升了高达14.82% mAP_m,在多时刻搜索查询上提升了高达21.83% mAP_m。

Details

Motivation: 解决现有视频时刻检索方法在从字幕查询训练转向实际搜索查询评估时出现的泛化性能下降问题,特别是针对DETR架构在语言差距和多时刻差距上的挑战。

Result: 在三个修改后的VMR基准测试上,提出的方法在搜索查询上mAP_m提升高达14.82%,在多时刻搜索查询上提升高达21.83%,显著改善了泛化性能。

Insight: 创新点包括识别并量化了VMR中的语言差距和多时刻差距,揭示了DETR架构中解码器-查询坍缩的关键问题,并通过增加活跃解码器查询的简单架构修改有效缓解了多时刻泛化问题,为实际搜索场景下的VMR模型设计提供了重要见解。

Abstract: In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets – i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search queries, and (ii) a multi-moment gap, caused by the shift from single-moment to multi-moment queries. We also identify a critical issue in these architectures – an active decoder-query collapse – as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries. The code, models and data are available in the project webpage: https://davidpujol.github.io/beyond-vmr/


[25] Cultural Counterfactuals: Evaluating Cultural Biases in Large Vision-Language Models with Counterfactual Examples cs.CVPDF

Phillip Howard, Xin Su, Kathleen C. Fraser

TL;DR: 本文介绍了Cultural Counterfactuals数据集,这是一个包含近6万张反事实图像的高质量合成数据集,用于衡量大型视觉语言模型在宗教、国籍和社会经济地位方面的文化偏见。该数据集通过图像编辑模型将不同人口特征的人置于真实的文化背景图像中生成,从而构建出描绘同一人在多种不同背景下的反事实图像集,以精确测量文化背景差异对LVLM输出的影响。

Details

Motivation: 现有研究主要关注与图像中人视觉特征相关的人口统计特征(如种族或性别)的偏见,而无法仅从个人外貌轻易辨别的文化差异(如宗教、社会经济地位)相关的偏见研究相对不足。衡量文化偏见的一个关键挑战是确定个体所属群体通常依赖于图像中的文化背景线索,而缺乏标注有文化背景线索的数据集。

Result: 研究证明了Cultural Counterfactuals数据集在量化流行LVLM中的文化偏见方面的实用性,但摘要中未提及具体的基准测试或定量结果水平(如SOTA)。

Insight: 创新点在于提出了一个专门针对文化背景(宗教、国籍、社会经济地位)偏见评估的高质量合成反事实数据集生成方法,通过控制变量(同一人置于不同文化背景)来精确分离和测量文化背景对模型输出的影响,弥补了该领域数据集的空白。

Abstract: Large Vision-Language Models (LVLMs) have grown increasingly powerful in recent years, but can also exhibit harmful biases. Prior studies investigating such biases have primarily focused on demographic traits related to the visual characteristics of a person depicted in an image, such as their race or gender. This has left biases related to cultural differences (e.g., religion, socioeconomic status), which cannot be readily discerned from an individual’s appearance alone, relatively understudied. A key challenge in measuring cultural biases is that determining which group an individual belongs to often depends upon cultural context cues in images, and datasets annotated with cultural context cues are lacking. To address this gap, we introduce Cultural Counterfactuals: a high-quality synthetic dataset containing nearly 60k counterfactual images for measuring cultural biases related to religion, nationality, and socioeconomic status. To ensure that cultural contexts are accurately depicted, we generate our dataset using an image-editing model to place people of different demographics into real cultural context images. This enables the construction of counterfactual image sets which depict the same person in multiple different contexts, allowing for precise measurement of the impact that cultural context differences have on LVLM outputs. We demonstrate the utility of Cultural Counterfactuals for quantifying cultural biases in popular LVLMs.


[26] Advancing Earth Observation Through Machine Learning: A TorchGeo Tutorial cs.CVPDF

Caleb Robinson, Nils Lehmann, Adam J. Stewart, Burak Ekim, Heng Fang

TL;DR: 本文是一篇关于TorchGeo库的教程论文。TorchGeo是一个基于PyTorch的领域库,旨在简化机器学习流程中地理空间数据的使用。教程通过代码示例介绍了TorchGeo的核心抽象概念,并提供了一个端到端的案例研究,即使用地球表面水体数据集,从Sentinel-2影像中进行多光谱水体分割。

Details

Motivation: 地球观测机器学习流程与标准计算机视觉工作流存在根本差异,例如数据通常是大规模地理参考场景,标签可能是栅格掩码或不同坐标系下的矢量几何体,且训练和评估需要空间感知的采样与划分策略。因此,需要专门的工具来简化处理。

Result: 论文未在摘要中提及具体的定量基准测试结果或SOTA比较。它主要展示了一个教程案例,演示了如何使用TorchGeo训练一个语义分割模型,并将其应用于巴西里约热内卢的Sentinel-2场景,生成GeoTIFF预测结果。

Insight: 论文宣称的创新点在于提供了TorchGeo这一专门针对地理空间机器学习任务的PyTorch库及其详细教程。从客观角度看,其核心创新在于将地理空间数据处理的特殊性(如地理参考、坐标系统、空间采样)抽象并集成到主流深度学习框架中,降低了该领域的研究与应用门槛。

Abstract: Earth observation machine learning pipelines differ fundamentally from standard computer vision workflows. Imagery is typically delivered as large, georeferenced scenes, labels may be raster masks or vector geometries in distinct coordinate reference systems, and both training and evaluation often require spatially aware sampling and splitting strategies. TorchGeo is a PyTorch-based domain library that provides datasets, samplers, transforms and pre-trained models with the goal of making it easy to use geospatial data in machine learning pipelines. In this paper, we introduce a tutorial that demonstrates 1.) the core TorchGeo abstractions through code examples, and 2.) an end-to-end case study on multispectral water segmentation from Sentinel-2 imagery using the Earth Surface Water dataset. This demonstrates how to train a semantic segmentation model using TorchGeo datasets, apply the model to a Sentinel-2 scene over Rio de Janeiro, Brazil, and save the resulting predictions as a GeoTIFF for further geospatial analysis. The tutorial code itself is distributed as two Python notebooks: https://torchgeo.readthedocs.io/en/stable/tutorials/torchgeo.html and https://torchgeo.readthedocs.io/en/stable/tutorials/earth_surface_water.html.


[27] OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments cs.CV | eess.SPPDF

Hymalai Bello, Lala Ray, Joanna Sorysz, Sungho Suh, Paul Lukowicz

TL;DR: OpenMarcie是当前最大的用于制造环境中人体动作监测的多模态数据集,包含37小时以上的自我中心与外部中心视角数据,涵盖自行车和3D打印机组装任务,并针对活动分类、开放词汇描述和跨模态对齐三个任务进行了基准测试。

Details

Motivation: 为了解决智能工厂中工人活动识别以量化性能指标、提升整体效率并保障工人安全的问题,现有数据集在工业环境下的规模和多模态性不足。

Result: 该数据集在活动分类、开放词汇描述和跨模态对齐三个任务上进行了基准测试,但摘要中未提及具体的定量结果或是否达到SOTA水平。

Insight: 创新点在于构建了大规模、多模态(穿戴式传感器与摄像头)、多视角(自我中心与外部中心)的工业环境数据集,并设计了无固定协议和协作组装两种实验设置以模拟真实制造动态,为工业动作识别研究提供了新基准。

Abstract: Smart factories use advanced technologies to optimize production and increase efficiency. To this end, the recognition of worker activity allows for accurate quantification of performance metrics, improving efficiency holistically while contributing to worker safety. OpenMarcie is, to the best of our knowledge, the biggest multimodal dataset designed for human action monitoring in manufacturing environments. It includes data from wearables sensing modalities and cameras distributed in the surroundings. The dataset is structured around two experimental settings, involving a total of 36 participants. In the first setting, twelve participants perform a bicycle assembly and disassembly task under semi-realistic conditions without a fixed protocol, promoting divergent and goal-oriented problem-solving. The second experiment involves twenty-five volunteers (24 valid data) engaged in a 3D printer assembly task, with the 3D printer manufacturer’s instructions provided to guide the volunteers in acquiring procedural knowledge. This setting also includes sequential collaborative assembly, where participants assess and correct each other’s progress, reflecting real-world manufacturing dynamics. OpenMarcie includes over 37 hours of egocentric and exocentric, multimodal, and multipositional data, featuring eight distinct data types and more than 200 independent information channels. The dataset is benchmarked across three human activity recognition tasks: activity classification, open vocabulary captioning, and cross-modal alignment.


[28] TruckDrive: Long-Range Autonomous Highway Driving Dataset cs.CVPDF

Filippo Ghilotti, Edoardo Palladin, Samuel Brucker, Adam Sigal, Mario Bijelic

TL;DR: 本文介绍了TruckDrive,一个专为重型卡车长距离高速公路自动驾驶设计的大规模多模态数据集。该数据集包含47.5万个样本,其中16.5万帧带有密集标注,感知范围可达1000米(2D检测)和400米(3D检测等),旨在解决现有数据集感知范围短(通常<100米)的问题。研究发现,当前最先进的自动驾驶模型在超过150米的长距离感知任务上性能显著下降(31%-99%),暴露了现有方法的系统性缺陷。

Details

Motivation: 重型卡车因制动距离长,需要数百米的场景理解以实现前瞻性规划和保持安全制动余量,但现有驾驶数据集主要覆盖城市短距离场景(<100米),无法满足高速公路长距离自动驾驶的需求。

Result: 在TruckDrive数据集上评估发现,当前最先进的自动驾驶模型在长距离(>150米)3D感知任务上的性能下降了31%至99%,无法泛化到长距离场景,揭示了现有架构和训练信号存在系统性差距。

Insight: 创新点在于发布了首个专注于高速公路长距离感知的大规模多模态数据集,配备了专为长距离感知设计的传感器套件(包括长距离FMCW激光雷达、高分辨率短距离激光雷达、多焦距摄像头和4D FMCW雷达),并定量揭示了当前SOTA模型在长距离感知上的严重性能瓶颈,为未来长距离自动驾驶研究提供了关键的基准和方向。

Abstract: Safe highway autonomy for heavy trucks remains an open and unsolved challenge: due to long braking distances, scene understanding of hundreds of meters is required for anticipatory planning and to allow safe braking margins. However, existing driving datasets primarily cover urban scenes, with perception effectively limited to short ranges of only up to 100 meters. To address this gap, we introduce TruckDrive, a highway-scale multimodal driving dataset, captured with a sensor suite purpose-built for long range sensing: seven long-range FMCW LiDARs measuring range and radial velocity, three high-resolution short-range LiDARs, eleven 8MP surround cameras with varying focal lengths and ten 4D FMCW radars. The dataset offers 475 thousands samples with 165 thousands densely annotated frames for driving perception benchmarking up to 1,000 meters for 2D detection and 400 meters for 3D detection, depth estimation, tracking, planning and end to end driving over 20 seconds sequences at highway speeds. We find that state-of-the-art autonomous driving models do not generalize to ranges beyond 150 meters, with drops between 31% and 99% in 3D perception tasks, exposing a systematic long-range gap that current architectures and training signals cannot close.


[29] DINOv3 Visual Representations for Blueberry Perception Toward Robotic Harvesting cs.CVPDF

Rui-Feng Wang, Daniel Petti, Yue Chen, Changying Li

TL;DR: 本文评估了DINOv3视觉基础模型作为冻结骨干网络在蓝莓机器人采摘相关视觉任务(包括果实与瘀伤分割、果实与果簇检测)中的性能。研究发现,分割任务能稳定受益于其补丁级表示,而检测任务则受目标尺度变化、补丁离散化和定位兼容性限制,果簇检测尤其困难。

Details

Motivation: 动机是探究在大规模自监督学习下训练的视觉基础模型(如DINOv3)在农业场景(特别是蓝莓机器人采摘)中的实际作用和性能极限,目前对此理解不足。

Result: 在统一协议和轻量级解码器下,分割任务性能随骨干网络规模提升而稳定提升;检测任务(尤其是果簇检测)表现受限,未能达到理想水平。

Insight: 创新点在于系统评估了DINOv3在农业视觉任务中的适用性,指出其更适合作为语义骨干网络而非端到端任务模型,其有效性取决于下游空间建模与果实尺度及聚合结构的对齐,为蓝莓采摘机器人视觉系统设计提供了指导。

Abstract: Vision Foundation Models trained via large-scale self-supervised learning have demonstrated strong generalization in visual perception; however, their practical role and performance limits in agricultural settings remain insufficiently understood. This work evaluates DINOv3 as a frozen backbone for blueberry robotic harvesting-related visual tasks, including fruit and bruise segmentation, as well as fruit and cluster detection. Under a unified protocol with lightweight decoders, segmentation benefits consistently from stable patch-level representations and scales with backbone size. In contrast, detection is constrained by target scale variation, patch discretization, and localization compatibility. The failure of cluster detection highlights limitations in modeling relational targets defined by spatial aggregation. Overall, DINOv3 is best viewed not as an end-to-end task model, but as a semantic backbone whose effectiveness depends on downstream spatial modeling aligned with fruit-scale and aggregation structures, providing guidance for blueberry robotic harvesting. Code and dataset will be available upon acceptance.


[30] MIRAGE: Knowledge Graph-Guided Cross-Cohort MRI Synthesis for Alzheimer’s Disease Prediction cs.CV | cs.AIPDF

Guanchen Wu, Zhe Huang, Yuzhang Xie, Runze Yan, Akul Chopra

TL;DR: 本文提出了MIRAGE框架,通过知识图谱引导的跨队列MRI合成来解决阿尔茨海默病(AD)预测中MRI模态缺失的问题。该方法将缺失MRI的问题重构为解剖学引导的跨模态潜在蒸馏任务,利用生物医学知识图谱和预训练的3D U-Net解码器,从电子健康记录(EHR)中提取结构化的诊断替代表示,从而避免昂贵的3D体素重建,并提升AD分类性能。

Details

Motivation: 解决阿尔茨海默病诊断中多模态评估(结合结构MRI和EHR)因MRI昂贵且经常缺失而导致的部署瓶颈,同时避免从稀疏高维表格记录合成全新3D解剖扫描的技术挑战和临床风险。

Result: 在缺乏真实MRI的队列中,MIRAGE框架成功弥合了缺失模态的差距,与单模态基线相比,AD分类率提高了13%。

Insight: 创新点包括:将缺失MRI问题重构为解剖学引导的跨模态潜在蒸馏任务;利用生物医学知识图谱和Graph Attention Networks统一嵌入EHR变量;采用冻结的预训练3D U-Net解码器作为辅助正则化引擎,并引入队列聚合跳跃特征补偿策略,强制1D潜在表示编码生物合理的宏观病理语义,从而完全绕过计算昂贵的3D体素重建。

Abstract: Reliable Alzheimer’s disease (AD) diagnosis increasingly relies on multimodal assessments combining structural Magnetic Resonance Imaging (MRI) and Electronic Health Records (EHR). However, deploying these models is bottlenecked by modality missingness, as MRI scans are expensive and frequently unavailable in many patient cohorts. Furthermore, synthesizing de novo 3D anatomical scans from sparse, high-dimensional tabular records is technically challenging and poses severe clinical risks. To address this, we introduce MIRAGE, a novel framework that reframes the missing-MRI problem as an anatomy-guided cross-modal latent distillation task. First, MIRAGE leverages a Biomedical Knowledge Graph (KG) and Graph Attention Networks to map heterogeneous EHR variables into a unified embedding space that can be propagated from cohorts with real MRIs to cohorts without them. To bridge the semantic gap and enforce physical spatial awareness, we employ a frozen pre-trained 3D U-Net decoder strictly as an auxiliary regularization engine. Supported by a novel cohort-aggregated skip feature compensation strategy, this decoder acts as a rigorous structural penalty, forcing 1D latent representations to encode biologically plausible, macro-level pathological semantics. By exclusively utilizing this distilled “diagnostic-surrogate” representation during inference, MIRAGE completely bypasses computationally expensive 3D voxel reconstruction. Experiments demonstrate that our framework successfully bridges the missing-modality gap, improving the AD classification rate by 13% compared to unimodal baselines in cohorts without real MRIs.


[31] ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering cs.CVPDF

Aymen Lassoued, Mohamed Ali Souibgui, Yousri Kessentini

TL;DR: ORCA是一个用于文档视觉问答(DocVQA)的多智能体协作推理框架,通过将复杂问题分解为逻辑步骤,并激活专用智能体库中的任务特定智能体进行细粒度处理和协作推理,结合辩论机制和格式检查器来提高答案可靠性。

Details

Motivation: 现有视觉语言模型(VLMs)在处理DocVQA中的复杂推理和多步骤工作流时存在困难,难以将复杂问题分解为可管理的子任务,并且无法针对不同文档元素利用专门的处理路径。

Result: 在三个基准测试上的广泛实验表明,该方法相比最先进(SOTA)方法取得了显著改进,为视觉语言推理中的协作智能体系统建立了新范式。

Insight: 创新点在于通过编排协作智能体进行结构化推理,包括查询分解、任务路由、多模态专用智能体协作、辩论与压力测试的可靠性验证机制,以及格式一致性检查,实现了对文档的细粒度理解和鲁棒答案生成。

Abstract: Document Visual Question Answering (DocVQA) remains challenging for existing Vision-Language Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components. To ensure answer reliability, ORCA employs a debate mechanism with stress-testing, and when necessary, a thesis-antithesis adjudication process. This is followed by a sanity checker to ensure format consistency. Extensive experiments on three benchmarks demonstrate that our approach achieves significant improvements over state-of-the-art methods, establishing a new paradigm for collaborative agent systems in vision-language reasoning.


[32] Deep Learning Based Wildfire Detection for Peatland Fires Using Transfer Learning cs.CV | cs.AIPDF

Emadeldeen Hamdan, Ahmad Faiz Tharima, Mohd Zahirasri Mohd Tohir, Dayang Nur Sakinah Musa, Erdem Koyuncu

TL;DR: 本文提出了一种基于迁移学习的泥炭地火灾检测方法,通过利用预训练的通用野火检测模型,并针对泥炭地火灾的独特视觉特征(如阴燃、低火焰强度、持续烟雾和地下燃烧)进行微调,以解决传统野火检测器在泥炭地场景下效果受限的问题。该方法在有限的泥炭地火灾标注数据上实现了更高的检测准确性和鲁棒性。

Details

Motivation: 传统基于深度学习的野火检测方法主要针对明火森林火灾训练,而泥炭地火灾具有阴燃、低火焰强度等独特特征,导致现有检测器效果不佳。本文旨在通过迁移学习,将通用野火检测知识适配到泥炭地火灾领域,以克服数据稀缺和特征差异的挑战。

Result: 实验表明,与从头训练相比,迁移学习方法显著提升了检测准确性和鲁棒性,特别是在低对比度烟雾、部分遮挡和光照变化等挑战性条件下。结果基于马来西亚泥炭地图像和视频数据集,但未明确提及与特定基准(benchmark)或SOTA模型的定量比较。

Insight: 创新点在于将迁移学习应用于泥炭地火灾这一特定领域,利用预训练模型缓解数据不足问题,并针对泥炭地火灾的独特物理和视觉特征进行优化。从客观角度看,该方法为小样本、领域特定的火灾检测提供了可扩展的解决方案,强调了领域适配在环境监测中的重要性。

Abstract: Machine learning (ML)-based wildfire detection methods have been developed in recent years, primarily using deep learning (DL) models trained on large collections of wildfire images and videos. However, peatland fires exhibit distinct visual and physical characteristics – such as smoldering combustion, low flame intensity, persistent smoke, and subsurface burning – that limit the effectiveness of conventional wildfire detectors trained on open-flame forest fires. In this work, we present a transfer learning-based approach for peatland fire detection that leverages knowledge learned from general wildfire imagery and adapts it to the peatland fire domain. We initialize a DL-based peatland fire detector using pretrained weights from a conventional wildfire detection model and subsequently fine-tune the network using a dataset composed of Malaysian peatland images and videos. This strategy enables effective learning despite the limited availability of labeled peatland fire data. Experimental results demonstrate that transfer learning significantly improves detection accuracy and robustness compared to training from scratch, particularly under challenging conditions such as low-contrast smoke, partial occlusions, and variable illumination. The proposed approach provides a practical and scalable solution for early peatland fire detection and has the potential to support real-time monitoring systems for fire prevention and environmental protection.


[33] Large-Scale Dataset and Benchmark for Skin Tone Classification in the Wild cs.CV | cs.LGPDF

Vitor Pereira Matias, Márcus Vinícius Lobo Costa, João Batista Neto, Tiago Novello de Brito

TL;DR: 本文提出了一个用于肤色分类和公平性评估的综合框架,包括一个大规模开放数据集STW(包含42,313张图像,使用10色调MST量表标注),并比较了经典计算机视觉和深度学习方法的性能。研究发现经典方法结果接近随机,而深度学习能达到接近标注者的准确率。最后,作者提出了SkinToneNet模型,在域外数据上实现了最先进的泛化性能,可用于对CelebA等公共数据集进行可靠的公平性审计。

Details

Motivation: 现有肤色分析研究缺乏细粒度、带标注的大规模数据集,常依赖视觉代表性不足的Fitzpatrick量表或小型私有数据集,且存在训练-测试泄露、数据不平衡等问题,导致公平性评估受限。

Result: 在提出的STW数据集上,经典计算机视觉方法(SkinToneCCV)表现接近随机,而深度学习方法达到接近人工标注者的准确率;提出的SkinToneNet(基于ViT微调)在域外数据上实现了最先进的泛化性能。

Insight: 创新点包括:1)引入大规模开放数据集STW,使用更细粒度的10色调MST量表标注;2)系统对比经典CV与深度学习在肤色分类上的性能;3)提出SkinToneNet模型,在域外泛化上达到SOTA,为数据集公平性审计提供了实用工具。

Abstract: Deep learning models often inherit biases from their training data. While fairness across gender and ethnicity is well-studied, fine-grained skin tone analysis remains a challenge due to the lack of granular, annotated datasets. Existing methods often rely on the medical 6-tone Fitzpatrick scale, which lacks visual representativeness, or use small, private datasets that prevent reproducibility, or often rely on classic computer vision pipelines, with a few using deep learning. They overlook issues like train-test leakage and dataset imbalance, and are limited by small or unavailable datasets. In this work, we present a comprehensive framework for skin tone fairness. First, we introduce the STW, a large-scale, open-access dataset comprising 42,313 images from 3,564 individuals, labeled using the 10-tone MST scale. Second, we benchmark both Classic Computer Vision (SkinToneCCV) and Deep Learning approaches, demonstrating that classic models provide near-random results, while deep learning reaches nearly annotator accuracy. Finally, we propose SkinToneNet, a fine-tuned ViT that achieves state-of-the-art generalization on out-of-domain data, which enables reliable fairness auditing of public datasets like CelebA and VGGFace2. This work provides state-of-the-art results in skin tone classification and fairness assessment. Code and data available soon


[34] E2E-GNet: An End-to-End Skeleton-based Geometric Deep Neural Network for Human Motion Recognition cs.CVPDF

Mubarak Olaoluwa, Hassen Drira

TL;DR: 本文提出了一种名为E2E-GNet的端到端几何深度神经网络,用于基于骨架的人体动作识别。该网络通过引入几何变换层,在非欧几里得空间中联合优化骨架运动序列,并使用可微对数映射激活将其投影到线性空间。此外,还设计了一个失真感知优化层,以限制投影引起的骨架形状失真,从而保留有区分度的几何线索,提高动作识别率。

Details

Motivation: 几何深度学习在计算机视觉领域受到关注,因其能有效捕获非欧几里得空间中数据的表征。本文旨在解决基于骨架的动作识别中,如何在非欧空间中增强不同动作间的区分能力,并处理投影导致的形状失真问题。

Result: 通过消融实验验证了各层的有效性,并在涵盖三个领域的五个数据集上进行了广泛实验,结果表明E2E-GNet以更低的成本超越了其他方法。

Insight: 创新点在于将几何变换层与失真感知优化层结合,实现了端到端的非欧空间骨架序列优化与投影,同时控制形状失真以保留关键几何信息,这为处理非欧数据提供了一种可借鉴的联合优化框架。

Abstract: Geometric deep learning has recently gained significant attention in the computer vision community for its ability to capture meaningful representations of data lying in a non-Euclidean space. To this end, we propose E2E-GNet, an end-to-end geometric deep neural network for skeleton-based human motion recognition. To enhance the discriminative power between different motions in the non-Euclidean space, E2E-GNet introduces a geometric transformation layer that jointly optimizes skeleton motion sequences on this space and applies a differentiable logarithm map activation to project them onto a linear space. Building on this, we further design a distortion-aware optimization layer that limits skeleton shape distortions caused by this projection, enabling the network to retain discriminative geometric cues and achieve a higher motion recognition rate. We demonstrate the impact of each layer through ablation studies and extensive experiments across five datasets spanning three domains show that E2E-GNet outperforms other methods with lower cost.


[35] ModalPatch: A Plug-and-Play Module for Robust Multi-Modal 3D Object Detection under Modality Drop cs.CVPDF

Shuangzhi Li, Lei Ma, Xingyu Li

TL;DR: 本文提出ModalPatch,一种即插即用模块,用于提升多模态3D目标检测在模态缺失情况下的鲁棒性。该模块利用传感器数据的时间连续性,通过历史信息预测并补偿瞬时缺失的特征,并引入不确定性引导的跨模态融合策略动态评估补偿特征的可靠性,无需改变检测器架构或重新训练即可集成到现有框架中。

Details

Motivation: 解决自动驾驶中多模态3D目标检测因硬件故障、恶劣天气或遮挡导致传感器模态瞬时缺失时的可靠性问题,避免车辆在模态同时缺失时出现’盲视’风险。

Result: 大量实验表明,ModalPatch在各种模态缺失条件下,能持续提升最先进(SOTA)3D目标检测器的鲁棒性和准确性。

Insight: 创新点在于首次提出即插即用的鲁棒性增强模块,利用时间连续性进行特征补偿,并结合不确定性估计动态融合跨模态信息,为多模态系统应对现实世界的不确定性提供了一种轻量级解决方案。

Abstract: Multi-modal 3D object detection is pivotal for autonomous driving, integrating complementary sensors like LiDAR and cameras. However, its real-world reliability is challenged by transient data interruptions and missing, where modalities can momentarily drop due to hardware glitches, adverse weather, or occlusions. This poses a critical risk, especially during a simultaneous modality drop, where the vehicle is momentarily blind. To address this problem, we introduce ModalPatch, the first plug-and-play module designed to enable robust detection under arbitrary modality-drop scenarios. Without requiring architectural changes or retraining, ModalPatch can be seamlessly integrated into diverse detection frameworks. Technically, ModalPatch leverages the temporal nature of sensor data for perceptual continuity, using a history-based module to predict and compensate for transiently unavailable features. To improve the fidelity of the predicted features, we further introduce an uncertainty-guided cross-modality fusion strategy that dynamically estimates the reliability of compensated features, suppressing biased signals while reinforcing informative ones. Extensive experiments show that ModalPatch consistently enhances both robustness and accuracy of state-of-the-art 3D object detectors under diverse modality-drop conditions.


[36] WTHaar-Net: a Hybrid Quantum-Classical Approach cs.CVPDF

Vittorio Palladino, Tsai Idden, Ahmet Enis Cetin

TL;DR: 本文提出了WTHaar-Net,一种混合量子-经典卷积神经网络,它用Haar小波变换(HWT)替代了先前混合架构中的Hadamard变换。HWT能提供空间局部化和多分辨率的表示,更符合视觉任务的归纳偏置,并且其量子实现可通过结构化的Hadamard门完成。在CIFAR-10和Tiny-ImageNet上的实验表明,该方法在保持竞争力的准确率的同时,显著减少了参数量,并在Tiny-ImageNet上超越了ResNet和基于Hadamard的基线模型。量子实现在IBM Quantum云硬件上得到了验证。

Details

Motivation: 动机是利用量子计算中浅层电路可实现特定结构化线性变换的优势,来增强深度学习模型,特别是通过引入更符合视觉任务特性的Haar小波变换来改进现有的混合量子-经典架构。

Result: 在CIFAR-10和Tiny-ImageNet基准测试上,WTHaar-Net实现了显著的参数量减少,同时保持了有竞争力的准确率;在Tiny-ImageNet上,其性能超越了ResNet和基于Hadamard的基线模型,达到了先进水平。量子实现在IBM Quantum云硬件上验证了与近期量子设备的兼容性。

Insight: 创新点在于将Haar小波变换引入混合量子-经典神经网络架构,其空间局部化和多分辨率特性更适配视觉任务,并且提供了可行的量子电路分解方法。从客观角度看,这为利用量子计算增强经典模型提供了一个新的、更有效的变换域选择。

Abstract: Convolutional neural networks rely on linear filtering operations that can be reformulated efficiently in suitable transform domains. At the same time, advances in quantum computing have shown that certain structured linear transforms can be implemented with shallow quantum circuits, opening the door to hybrid quantum-classical approaches for enhancing deep learning models. In this work, we introduce WTHaar-Net, a convolutional neural network that replaces the Hadamard Transform used in prior hybrid architectures with the Haar Wavelet Transform (HWT). Unlike the Hadamard Transform, the Haar transform provides spatially localized, multi-resolution representations that align more closely with the inductive biases of vision tasks. We show that the HWT admits a quantum realization using structured Hadamard gates, enabling its decomposition into unitary operations suitable for quantum circuits. Experiments on CIFAR-10 and Tiny-ImageNet demonstrate that WTHaar-Net achieves substantial parameter reduction while maintaining competitive accuracy. On Tiny-ImageNet, our approach outperforms both ResNet and Hadamard-based baselines. We validate the quantum implementation on IBM Quantum cloud hardware, demonstrating compatibility with near-term quantum devices.


[37] SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data cs.CVPDF

Lekang Wen, Liang Liao, Jing Xiao, Mi Wang

TL;DR: 本文提出了一种语义引导的模态感知(SGMA)框架,用于解决遥感图像中不完全多模态数据的语义分割问题。该框架通过两个即插即用模块——语义引导融合(SGF)和模态感知采样(MAS)——来平衡多模态学习、减少类内差异并调和跨模态不一致性。

Details

Motivation: 解决不完全多模态语义分割(IMSS)中的三个关键挑战:多模态不平衡、跨模态的类内差异以及跨模态异质性导致的语义响应不一致。现有方法存在过度对齐、丢弃模态特定线索或训练不平衡等问题。

Result: 在多个数据集和骨干网络上的广泛实验表明,SGMA始终优于最先进的方法,在脆弱模态上取得了特别显著的改进。

Insight: 创新点在于通过提取跨模态一致的类别语义原型来引导融合,并基于原型-特征对齐估计模态鲁棒性进行自适应加权;同时利用鲁棒性估计动态重采样,优先处理脆弱模态的困难样本,以系统性地解决IMSS的三大挑战。

Abstract: Multimodal semantic segmentation integrates complementary information from diverse sensors for remote sensing Earth observation. However, practical systems often encounter missing modalities due to sensor failures or incomplete coverage, termed Incomplete Multimodal Semantic Segmentation (IMSS). IMSS faces three key challenges: (1) multimodal imbalance, where dominant modalities suppress fragile ones; (2) intra-class variation in scale, shape, and orientation across modalities; and (3) cross-modal heterogeneity with conflicting cues producing inconsistent semantic responses. Existing methods rely on contrastive learning or joint optimization, which risk over-alignment, discarding modality-specific cues or imbalanced training, favoring robust modalities, while largely overlooking intra-class variation and cross-modal heterogeneity. To address these limitations, we propose the Semantic-Guided Modality-Aware (SGMA) framework, which ensures balanced multimodal learning while reducing intra-class variation and reconciling cross-modal inconsistencies through semantic guidance. SGMA introduces two complementary plug-and-play modules: (1) Semantic-Guided Fusion (SGF) module extracts multi-scale, class-wise semantic prototypes that capture consistent categorical representations across modalities, estimates per-modality robustness based on prototype-feature alignment, and performs adaptive fusion weighted by robustness scores to mitigate intra-class variation and cross-modal heterogeneity; (2) Modality-Aware Sampling (MAS) module leverages robustness estimations from SGF to dynamically reweight training samples, prioritizing challenging samples from fragile modalities to address modality imbalance. Extensive experiments across multiple datasets and backbones demonstrate that SGMA consistently outperforms state-of-the-art methods, with particularly significant improvements in fragile modalities.


[38] NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining cs.CVPDF

Liang Zeng, Valerio Marsocci, Wufan Zhao, Andrea Nascetti, Maarten Vergauwen

TL;DR: 本文提出NeighborMAE,一种用于地球观测图像自监督预训练的掩码自编码器方法,通过联合重建相邻图像来利用空间依赖性,并采用动态调整掩码比例和像素级损失权重的启发式策略以保持重建挑战性。

Details

Motivation: 现有掩码图像建模方法在地球观测领域多关注多模态或多时序数据,但忽略了相邻图像间的空间依赖性,而地球表面连续,相邻图像高度相关,可提供丰富的上下文信息用于自监督学习。

Result: 在不同预训练数据集和下游任务上的实验结果表明,NeighborMAE显著优于现有基线,突显了相邻图像在地球观测掩码图像建模中的价值及所提设计的有效性。

Insight: 创新点在于首次将相邻图像的空间依赖性引入地球观测的掩码自编码器预训练,通过联合重建和动态调整策略增强表示学习能力,为利用地理连续性提供新思路。

Abstract: Masked Image Modeling has been one of the most popular self-supervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth’s surface is continuous, neighboring images are highly related and offer rich contextual information for self-supervised learning. To close this gap, we propose NeighborMAE, which learns spatial dependencies by joint reconstruction of neighboring Earth Observation images. To ensure that the reconstruction remains challenging, we leverage a heuristic strategy to dynamically adjust the mask ratio and the pixel-level loss weight. Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of our designs.


[39] EIMC: Efficient Instance-aware Multi-modal Collaborative Perception cs.CVPDF

Kang Yang, Peng Wang, Lantao Li, Tianci Bu, Chen Sun

TL;DR: EIMC提出了一种高效的实例感知多模态协同感知框架,用于自动驾驶。它创新性地采用早期协同范式,通过注入轻量级协同体素来生成紧凑的3D协同先验,并利用热图驱动协议识别需要协作的区域,仅传输Top-K实例向量,从而大幅降低带宽需求。

Details

Motivation: 当前多模态协同感知方法遵循‘本地融合再通信’的顺序,需要在协同融合前传输大量特征数据,导致高带宽需求。EIMC旨在通过早期协同和实例级消息传递,在保证关键遮挡物体恢复的同时,显著减少通信冗余。

Result: 在OPV2V和DAIR-V2X基准测试上,EIMC达到了73.01%的AP@0.5,同时与已发表的最佳多模态协同检测器相比,字节带宽使用量减少了87.98%。

Insight: 创新点包括:1) 早期协同范式,通过注入协同体素生成3D协同先验以加强跨模态对齐;2) 热图驱动共识协议,精准识别低置信度高差异区域,实现按需协作;3) 实例中心的消息传递机制,仅传输关键实例向量,在保证性能的同时极大降低通信开销。从客观角度看,该工作将协同从特征级细化到实例级,是通信效率与感知精度权衡的有效探索。

Abstract: Multi-modal collaborative perception calls for great attention to enhancing the safety of autonomous driving. However, current multi-modal approaches remain a ``local fusion to communication’’ sequence, which fuses multi-modal data locally and needs high bandwidth to transmit an individual’s feature data before collaborative fusion. EIMC innovatively proposes an early collaborative paradigm. It injects lightweight collaborative voxels, transmitted by neighbor agents, into the ego’s local modality-fusion step, yielding compact yet informative 3D collaborative priors that tighten cross-modal alignment. Next, a heatmap-driven consensus protocol identifies exactly where cooperation is needed by computing per-pixel confidence heatmaps. Only the Top-K instance vectors located in these low-confidence, high-discrepancy regions are queried from peers, then fused via cross-attention for completion. Afterwards, we apply a refinement fusion that involves collecting the top-K most confident instances from each agent and enhancing their features using self-attention. The above instance-centric messaging reduces redundancy while guaranteeing that critical occluded objects are recovered. Evaluated on OPV2V and DAIR-V2X, EIMC attains 73.01% AP@0.5 while reducing byte bandwidth usage by 87.98% compared with the best published multi-modal collaborative detector. Code publicly released at https://github.com/sidiangongyuan/EIMC.


[40] On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding cs.CVPDF

Zhanzhong Pang, Dibyadip Chatterjee, Fadime Sener, Angela Yao

TL;DR: 本文探讨了多模态大语言模型(MLLMs)在封闭集动作理解任务中生成式分类器与判别式分类器的性能差异,发现判别式方法在准确性和效率上更优。为了弥合差距,作者提出了提升生成式分类器性能的策略,并进一步设计了一种仅在微调阶段使用的生成辅助判别式(GAD)分类器,该模型结合了两种方法的优势,在多个基准测试中实现了最先进的性能。

Details

Motivation: 解决MLLMs在封闭集动作理解中,采用自回归生成动作标签作为文本的生成式分类器方法存在的效率低下和因标签共享子词导致的语义重叠与歧义问题。

Result: 在时间动作理解基准测试中,GAD方法在五个数据集的四个任务上取得了最先进(SOTA)的结果,例如在最大的COIN基准上平均准确率提升了2.5%,推理速度加快了3倍。

Insight: 核心创新点在于提出了GAD分类器,它通过仅在微调阶段结合生成式建模来辅助判别式分类器,从而在保持与MLLM预训练完全兼容的同时,提升了准确性和效率。这揭示了生成式与判别式方法在特定任务中可以互补,而非完全对立。

Abstract: Multimodal Large Language Models (MLLMs) have advanced open-world action understanding and can be adapted as generative classifiers for closed-set settings by autoregressively generating action labels as text. However, this approach is inefficient, and shared subwords across action labels introduce semantic overlap, leading to ambiguity in generation. In contrast, discriminative classifiers learn task-specific representations with clear decision boundaries, enabling efficient one-step classification without autoregressive decoding. We first compare generative and discriminative classifiers with MLLMs for closed-set action understanding, revealing the superior accuracy and efficiency of the latter. To bridge the performance gap, we design strategies that elevate generative classifiers toward performance comparable with discriminative ones. Furthermore, we show that generative modeling can complement discriminative classifiers, leading to better performance while preserving efficiency. To this end, we propose Generation-Assisted Discriminative~(GAD) classifier for closed-set action understanding. GAD operates only during fine-tuning, preserving full compatibility with MLLM pretraining. Extensive experiments on temporal action understanding benchmarks demonstrate that GAD improves both accuracy and efficiency over generative methods, achieving state-of-the-art results on four tasks across five datasets, including an average 2.5% accuracy gain and 3x faster inference on our largest COIN benchmark.


[41] SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding cs.CVPDF

Sheng Ye, Zhen-Hui Dong, Ruoyu Fan, Tian Lv, Yong-Jin Liu

TL;DR: 本文提出SemGS,一种前馈框架,用于从稀疏图像输入重建可泛化的语义场。该方法采用双分支架构提取颜色和语义特征,并引入相机感知注意力机制建模视角间几何关系,最终解码为共享几何一致性的双高斯表示以合成新视角的语义地图。

Details

Motivation: 现有语义场景重建和语义感知新视角合成方法通常依赖密集多视图输入且需场景特定优化,限制了实际应用中的实用性和可扩展性,本文旨在解决这些问题。

Result: 实验表明,SemGS在基准数据集上达到了最先进的性能,同时在多样合成和真实场景中实现了快速推理和强大的泛化能力。

Insight: 创新点包括:共享浅层CNN的双分支特征提取架构,使语义推理能利用颜色外观的纹理和结构线索;相机感知注意力机制显式建模相机视角间的几何关系;以及引入区域平滑损失增强语义一致性。

Abstract: Semantic understanding of 3D scenes is essential for robots to operate effectively and safely in complex environments. Existing methods for semantic scene reconstruction and semantic-aware novel view synthesis often rely on dense multi-view inputs and require scene-specific optimization, limiting their practicality and scalability in real-world applications. To address these challenges, we propose SemGS, a feed-forward framework for reconstructing generalizable semantic fields from sparse image inputs. SemGS uses a dual-branch architecture to extract color and semantic features, where the two branches share shallow CNN layers, allowing semantic reasoning to leverage textural and structural cues in color appearance. We also incorporate a camera-aware attention mechanism into the feature extractor to explicitly model geometric relationships between camera viewpoints. The extracted features are decoded into dual-Gaussians that share geometric consistency while preserving branch-specific attributes, and further rasterized to synthesize semantic maps under novel viewpoints. Additionally, we introduce a regional smoothness loss to enhance semantic coherence. Experiments show that SemGS achieves state-of-the-art performance on benchmark datasets, while providing rapid inference and strong generalization capabilities across diverse synthetic and real-world scenarios.


[42] Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation cs.CVPDF

Chonghua Lv, Dong Zhao, Shuang Wang, Dou Quan, Ning Huyan

TL;DR: 本文提出了一种名为通用知识蒸馏(GKD)的多阶段框架,旨在从视觉基础模型(VFMs)中蒸馏出具有强泛化能力的知识用于语义分割。该方法通过解耦表示学习和任务学习,先让学生模型学习领域无关的表示,再冻结这些表示进行任务适应,并结合基于查询的软蒸馏机制,从而在保持模型压缩的同时,显著提升其在分布偏移下的泛化性能。

Details

Motivation: 传统的知识蒸馏方法在语义分割中主要关注域内精度,而忽略了在分布变化下至关重要的域外泛化能力。特别是当从泛化能力强的视觉基础模型(VFMs)中蒸馏知识时,传统方法往往会损害这种鲁棒性。本文旨在解决如何从VFMs中蒸馏出既紧凑又泛化能力强的学生模型这一挑战。

Result: 在五个域泛化基准测试上的广泛实验表明,GKD始终优于现有的知识蒸馏方法。具体而言,在基础模型到基础模型(F2F)的蒸馏中平均提升了+1.9%,在基础模型到本地模型(F2L)的蒸馏中平均提升了+10.6%。

Insight: 主要创新点在于:1)将蒸馏过程解耦为领域无关的表示学习和任务适应两个阶段,通过冻结第一阶段学到的表示来缓解对可见域的过拟合;2)提出了一种基于查询的软蒸馏机制,让学生特征作为查询,从教师(VFM)表示中主动且有选择性地检索可迁移的空间知识。这为从大型、鲁棒的预训练模型中高效提取并保留其泛化能力提供了新思路。

Abstract: Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve in-domain accuracy while neglecting out-of-domain generalization, which is essential under distribution shifts. This limitation becomes more severe with the emergence of vision foundation models (VFMs): although VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability. We propose Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization. GKD decouples representation learning from task learning. In the first stage, the student acquires domain-agnostic representations through selective feature distillation, and in the second stage, these representations are frozen for task adaptation, thereby mitigating overfitting to visible domains. To further support transfer, we introduce a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs. Extensive experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods, achieving average gains of +1.9% in foundation-to-foundation (F2F) and +10.6% in foundation-to-local (F2L) distillation. The code will be available at https://github.com/Younger-hua/GKD.


[43] Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs cs.CV | cs.AI | cs.CL | cs.LGPDF

Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan

TL;DR: 本文提出了一种名为VC-STaR的视觉对比自学习推理框架,旨在解决视觉语言模型在推理过程中产生的视觉幻觉问题。该方法通过构建视觉对比的VQA对,利用模型自身的对比能力来生成更准确的推理依据,并创建了一个包含55K样本的视觉推理数据集VisCoR-55K,用于提升多种VLMs的推理性能。

Details

Motivation: 现有基于语言的自改进技术难以直接扩展到视觉语言模型,因为模型在视觉推理路径中产生的幻觉无法被有效验证或纠正。本文的动机是利用视觉对比现象——当模型面对两个视觉相似但问题同义的对比对时,能更精确地识别相关视觉线索——来引导模型自我改进,减少幻觉。

Result: 大量实验表明,VC-STaR在多个VQA基准测试中不仅超越了现有的自改进方法,而且其性能超过了在现有最先进的视觉推理数据集上微调的模型,实现了SOTA水平。

Insight: 核心创新点在于首次将视觉对比机制引入自改进框架,利用VLMs固有的对比能力来引导自身生成更可靠的推理依据,从而缓解幻觉问题。这为提升VLMs的视觉推理能力提供了一种无需外部强监督的新途径,其构建的VisCoR-55K数据集也是一个有价值的资源。

Abstract: Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC-STaR.


[44] CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment cs.CV | cs.AIPDF

Maoyuan Shao, Yutong Gao, Xinyang Huang, Chuang Zhu, Lijuan Sun

TL;DR: 本文提出了一种名为CAPT的混淆感知提示调优框架,旨在解决视觉-语言模型(如CLIP)在视觉和语义相似类别之间存在的系统性误分类问题。该方法通过构建混淆库来建模类别间的稳定混淆关系,并引入语义混淆挖掘器和样本混淆挖掘器来捕获全局和样本级的混淆信息,最后通过多粒度差异专家模块统一不同粒度的混淆信息,从而减少误分类并提升模型的判别能力和泛化性能。

Details

Motivation: 动机在于视觉-语言模型(如CLIP)在跨模态表示学习中存在系统性误分类问题,这些混淆并非随机,而是特定类别对之间持续发生,反映了模型的内在偏差和细粒度判别能力有限。

Result: 在11个基准数据集上的广泛实验表明,该方法显著减少了混淆导致的错误,同时增强了基类和新类的判别能力和泛化性,成功解决了50.72%的可混淆样本对。

Insight: 创新点在于提出了一个混淆感知的提示调优框架,通过显式建模类别间的混淆关系和利用误分类样本进行学习,结合语义和样本级挖掘以及多粒度专家模块,实现了对模型内在偏差的纠正和细粒度判别能力的提升。

Abstract: Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model’s intrinsic bias and limited fine-grained discriminative ability. To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72 percent of confusable sample pairs. Code will be released at https://github.com/greatest-gourmet/CAPT.


[45] CAWM-Mamba: A unified model for infrared-visible image fusion and compound adverse weather restoration cs.CVPDF

Huichun Liu, Xiaosong Li, Zhuangfan Huang, Tao Ye, Yang Liu

TL;DR: 本文提出了CAWM-Mamba,一个用于红外-可见光图像融合与复合恶劣天气恢复的统一端到端模型。该模型通过天气感知预处理、跨模态特征交互和小波域状态块等组件,首次实现了对多种退化(如雾+雨)共存场景的联合处理,并在AWMM-100K等基准测试中取得了优于现有方法的性能。

Details

Motivation: 现有恶劣天气下的多模态图像融合方法通常只处理单一类型的退化(如雾、雨、雪),无法应对多种退化同时存在的复杂复合天气场景,这限制了其在自动驾驶和无人机监控等实际应用中的鲁棒性。

Result: 在AWMM-100K基准和三个标准融合数据集上的大量实验表明,CAWM-Mamba在复合天气和单一天气场景下均持续优于最先进的方法。其融合结果在语义分割和目标检测等下游任务中也表现出色,证实了其在实际恶劣天气感知中的实用价值。

Insight: 论文的主要创新点在于提出了首个权重共享的、统一处理图像融合与复合天气恢复的端到端框架。其核心创新包括:1)设计天气感知预处理模块提取全局天气嵌入;2)跨模态特征交互模块促进异构模态对齐;3)利用小波域分解解耦多频率退化的状态块,其中Freq-SSM模块能无冗余地建模各向异性高频退化,并采用统一的退化表示机制以提升对复杂复合天气的泛化能力。

Abstract: Multimodal Image Fusion (MMIF) integrates complementary information from various modalities to produce clearer and more informative fused images. MMIF under adverse weather is particularly crucial in autonomous driving and UAV monitoring applications. However, existing adverse weather fusion methods generally only tackle single types of degradation such as haze, rain, or snow, and fail when multiple degradations coexist (e.g., haze+rain, rain+snow). To address this challenge, we propose Compound Adverse Weather Mamba (CAWM-Mamba), the first end-to-end framework that jointly performs image fusion and compound weather restoration with unified shared weights. Our network contains three key components: (1) a Weather-Aware Preprocess Module (WAPM) to enhance degraded visible features and extracts global weather embeddings; (2) a Cross-modal Feature Interaction Module (CFIM) to facilitate the alignment of heterogeneous modalities and exchange of complementary features across modalities; and (3) a Wavelet Space State Block (WSSB) that leverages wavelet-domain decomposition to decouple multi-frequency degradations. WSSB includes Freq-SSM, a module that models anisotropic high-frequency degradation without redundancy, and a unified degradation representation mechanism to further improve generalization across complex compound weather conditions. Extensive experiments on the AWMM-100K benchmark and three standard fusion datasets demonstrate that CAWM-Mamba consistently outperforms state-of-the-art methods in both compound and single-weather scenarios. In addition, our fusion results excel in downstream tasks covering semantic segmentation and object detection, confirming the practical value in real-world adverse weather perception. The source code will be available at https://github.com/Feecuin/CAWM-Mamba.


[46] Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels cs.CVPDF

Jiahao Lu, Jiayi Xu, Wenbo Hu, Ruijie Zhu, Chengfeng Zhao

TL;DR: Track4World是一种前馈模型,用于从单目视频中高效地实现以世界坐标系为中心的密集3D像素跟踪,通过VGGT风格的ViT编码全局3D场景表示,并应用新颖的3D相关方案同时估计任意帧对之间的像素级2D和3D密集流,从而实现视频中每个像素的3D轨迹估计。

Details

Motivation: 现有单目3D跟踪方法要么只能跟踪第一帧的稀疏点,要么基于缓慢的优化框架进行密集跟踪,限制了全面理解视频3D动态的能力,因此需要一种高效且全面的密集3D跟踪方法。

Result: 在多个基准测试上的广泛实验表明,该方法在2D/3D流估计和3D跟踪方面始终优于现有方法,突显了其在真实世界4D重建任务中的鲁棒性和可扩展性。

Insight: 创新点包括基于VGGT风格ViT的全局3D场景表示编码、新颖的3D相关方案以同时估计2D和3D密集流,以及结合场景流和重建3D几何实现高效像素级3D跟踪,为密集3D跟踪提供了前馈式高效解决方案。

Abstract: Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking. In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.


[47] Neural Electromagnetic Fields for High-Resolution Material Parameter Reconstruction cs.CV | eess.SPPDF

Zhe Chen, Peilin Zheng, Wenshuo Chen, Xiucheng Wang, Yutao Yue

TL;DR: 本文提出了一种名为NEMF的新型框架,用于通过非侵入式传感(如射频信号)重建场景的密集材料参数(如介电常数、电导率),以构建功能性的数字孪生体。该方法利用图像提供的高保真几何信息作为锚点,通过解耦未知的几何、环境场和目标材料,将不适定的物理反演问题转化为适定的、物理监督的学习任务,从而实现高精度的材料参数重建。

Details

Motivation: 当前方法(如NeRF)创建的数字孪生体视觉丰富但功能不完整,缺乏底层材料属性。本文旨在解决通过非接触、非侵入式传感获取场景中每一点材料属性的核心挑战,这是一个著名的病态物理反演问题。

Result: 在高保真合成数据集上的实验验证表明,该非侵入式反演方法能够高精度地重建材料参数图,并且生成的功能性孪生体支持高保真的物理模拟。

Insight: 核心创新在于提出了一种系统的解耦策略:利用图像几何作为锚点来解析环境场,从而将病态反演问题转化为适定的物理监督学习任务。通过结合物理反射模型的可微分层,解码器能够从环境射频信号中学习并显式输出连续、空间变化的材料参数场,实现了从被动视觉复制到功能性可模拟模型的突破。

Abstract: Creating functional Digital Twins, simulatable 3D replicas of the real world, is a central challenge in computer vision. Current methods like NeRF produce visually rich but functionally incomplete twins. The key barrier is the lack of underlying material properties (e.g., permittivity, conductivity). Acquiring this information for every point in a scene via non-contact, non-invasive sensing is a primary goal, but it demands solving a notoriously ill-posed physical inversion problem. Standard remote signals, like images and radio frequencies (RF), deeply entangle the unknown geometry, ambient field, and target materials. We introduce NEMF, a novel framework for dense, non-invasive physical inversion designed to build functional digital twins. Our key insight is a systematic disentanglement strategy. NEMF leverages high-fidelity geometry from images as a powerful anchor, which first enables the resolution of the ambient field. By constraining both geometry and field using only non-invasive data, the original ill-posed problem transforms into a well-posed, physics-supervised learning task. This transformation unlocks our core inversion module: a decoder. Guided by ambient RF signals and a differentiable layer incorporating physical reflection models, it learns to explicitly output a continuous, spatially-varying field of the scene’s underlying material parameters. We validate our framework on high-fidelity synthetic datasets. Experiments show our non-invasive inversion reconstructs these material maps with high accuracy, and the resulting functional twin enables high-fidelity physical simulation. This advance moves beyond passive visual replicas, enabling the creation of truly functional and simulatable models of the physical world.


[48] Maximizing Generalization: The Effect of Different Augmentation Techniques on Lightweight Vision Transformer for Bengali Character Classification cs.CVPDF

Rafi Hassan Chowdhury, Naimul Haque, Kaniz Fatiha

TL;DR: 本研究探讨了不同图像数据增强技术对轻量级视觉变换器(EfficientViT)在孟加拉语手写字符分类任务中的影响,旨在解决资源有限语言数据集规模小的问题。通过评估CLAHE、随机旋转、随机仿射、颜色抖动及其组合等增强方法,发现随机仿射与颜色抖动的组合在Ekush和AIBangla数据集上取得了最佳准确率,分别为97.48%和97.57%。

Details

Motivation: 动机在于解决深度学习模型在资源有限语言(如孟加拉语)中因数据集规模小而容易过拟合或欠拟合的问题,通过数据增强技术来提升模型泛化能力。

Result: 在Ekush和AIBangla数据集上,随机仿射与颜色抖动组合的增强方法取得了最佳准确率(97.48%和97.57%),优于其他单独或组合的增强技术,达到了该任务上的先进水平。

Insight: 创新点在于系统评估了多种图像增强技术及其组合对轻量级视觉变换器在孟加拉语字符分类中的效果,强调了组合增强策略在提升模型性能方面的有效性,为资源稀缺语言的处理提供了实用参考。

Abstract: Deep learning models have proven to be highly effective in computer vision, with deep convolutional neural networks achieving impressive results across various computer vision tasks. However, these models rely heavily on large datasets to avoid overfitting. When a model learns features with either low or high variance, it can lead to underfitting or overfitting on the training data. Unfortunately, large-scale datasets may not be available in many domains, particularly for resource-limited languages such as Bengali. In this experiment, a series of tests were conducted in the field of image data augmentation as an approach to addressing the limited data problem for Bengali handwritten characters. The study also provides an in-depth analysis of the performance of different augmentation techniques. Data augmentation refers to a set of techniques applied to data to increase its size and diversity, making it more suitable for training deep learning models. The image augmentation techniques evaluated in this study include CLAHE, Random Rotation, Random Affine, Color Jitter, and their combinations. The study further explores the use of augmentation methods with a lightweight model such as EfficientViT. Among the different augmentation strategies, the combination of Random Affine and Color Jitter produced the best accuracy on the Ekush [1] and AIBangla [2] datasets, achieving accuracies of 97.48% and 97.57%, respectively. This combination outperformed all other individual and combined augmentation techniques. Overall, this analysis presents a thorough examination of the impact of image data augmentation in resource-scarce languages, particularly in the context of Bengali handwritten character recognition using lightweight models.


[49] VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction cs.CV | cs.ROPDF

A. Enes Doruk, Hasan F. Ates

TL;DR: 本文提出了VLMFusionOcc3D,一个用于自动驾驶中稠密3D语义占据预测的鲁棒多模态框架。该框架通过引入视觉语言模型(VLM)的先验知识来解决体素模型在稀疏几何网格中的语义模糊性和恶劣天气下的性能下降问题。

Details

Motivation: 当前基于体素的占据模型在稀疏几何网格中常存在语义模糊性,且在恶劣天气条件下性能会下降。本文旨在利用视觉语言模型丰富的语言先验知识,将模糊的体素特征锚定到稳定的语义概念上,以解决这些问题。

Result: 在nuScenes和SemanticKITTI数据集上的大量实验表明,所提出的即插即用模块持续提升了最先进的体素基线的性能,特别是在具有挑战性的天气场景中取得了显著改进。

Insight: 创新点包括:1)提出实例驱动的VLM注意力机制(InstVLM),利用门控交叉注意力和LoRA适配的CLIP嵌入,将高级语义和地理先验直接注入3D体素;2)引入天气感知自适应融合(WeathFusion),一种利用车辆元数据和天气条件提示的动态门控机制,根据实时环境可靠性重新加权传感器贡献;3)采用深度感知几何对齐损失(DAGA)来确保稠密相机几何与稀疏但空间精确的LiDAR数据之间的结构一致性。

Abstract: This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving. Current voxel-based occupancy models often struggle with semantic ambiguity in sparse geometric grids and performance degradation under adverse weather conditions. To address these challenges, we leverage the rich linguistic priors of Vision-Language Models (VLMs) to anchor ambiguous voxel features to stable semantic concepts. Our framework initiates with a dual-branch feature extraction pipeline that projects multi-view images and LiDAR point clouds into a unified voxel space. We propose Instance-driven VLM Attention (InstVLM), which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels. Furthermore, we introduce Weather-Aware Adaptive Fusion (WeathFusion), a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions based on real-time environmental reliability. To ensure structural consistency, a Depth-Aware Geometric Alignment (DAGA) loss is employed to align dense camera-derived geometry with sparse, spatially accurate LiDAR returns. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our plug-and-play modules consistently enhance the performance of state-of-the-art voxel-based baselines. Notably, our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.


[50] Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs cs.CVPDF

Zhikang Xu, Qianqian Xu, Zitai Wang, Cong Hua, Sicong Li

TL;DR: 本文提出InterNeg框架,旨在解决视觉语言模型在OOD检测中存在的模态内距离与模态间距离不一致问题。通过从文本和视觉两个角度系统性地利用一致的模态间距离增强,该方法在多个基准测试中实现了最先进的性能。

Details

Motivation: 当前基于视觉语言模型的OOD检测方法常引入模态内距离(如比较负文本与ID标签,或比较测试图像与图像代理),这与CLIP类模型优化的模态间距离存在固有矛盾,可能导致次优性能。

Result: 在多个基准测试上,InterNeg取得了SOTA性能。具体而言,在大规模ImageNet基准上FPR95降低了3.47%,在具有挑战性的Near-OOD基准上AUROC提升了5.50%。

Insight: 核心创新在于提出了一个追求模态间距离一致性的框架:从文本角度,设计了基于模态间距离的负文本选择准则;从视觉角度,动态识别高置信度OOD图像并将其反演到文本空间,生成由模态间距离引导的额外负文本嵌入。这为解决模态对齐不一致问题提供了新思路。

Abstract: Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50% improvement in AUROC on the challenging Near-OOD benchmark.


[51] Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild cs.CVPDF

Seunguk Do, Minwoo Huh, Joonghyuk Shin, Jaesik Park

TL;DR: 本文提出DrPose方法,通过直接奖励微调算法,利用仅包含姿态和单视图图像的数据集DrPose15K,对多视图扩散模型进行后训练,以提升单视图3D人体重建在动态或挑战性姿态下的自然度。

Details

Motivation: 现有单视图3D人体重建方法依赖多视图扩散模型,但重建结果常出现不自然的姿态,尤其在动态或挑战性姿态下更为明显,这归因于现有3D人体数据集中姿态多样性有限。

Result: 在传统基准数据集、野外图像和新构建的基准上评估,DrPose在所有基准上均取得了一致的定性和定量改进,特别是在挑战性人体姿态上表现更优。

Insight: 创新点包括:提出直接奖励微调算法DrPose,仅需姿态和单视图图像对进行训练;引入可微分奖励PoseScore量化生成多视图潜在图像与真实姿态的一致性;构建了具有更广姿态分布的数据集DrPose15K,无需昂贵3D资产。

Abstract: Single-view 3D human reconstruction has achieved remarkable progress through the adoption of multi-view diffusion models, yet the recovered 3D humans often exhibit unnatural poses. This phenomenon becomes pronounced when reconstructing 3D humans with dynamic or challenging poses, which we attribute to the limited scale of available 3D human datasets with diverse poses. To address this limitation, we introduce DrPose, Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses without requiring expensive 3D human assets. DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore, which is our proposed differentiable reward that quantifies consistency between a generated multi-view latent image and a ground-truth human pose. This optimization is conducted on DrPose15K, a novel dataset that was constructed from an existing human motion dataset and a pose-conditioned video generative model. Constructed from abundant human pose sequence data, DrPose15K exhibits a broader pose distribution compared to existing 3D human datasets. We validate our approach through evaluation on conventional benchmark datasets, in-the-wild images, and a newly constructed benchmark, with a particular focus on assessing performance on challenging human poses. Our results demonstrate consistent qualitative and quantitative improvements across all benchmarks. Project page: https://seunguk-do.github.io/drpose.


[52] Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective cs.CVPDF

Kaifang Long, Lianbo Ma, Jiaqi Liu, Liming Liu, Guoyang Xie

TL;DR: 本文提出了一种名为IB-IUMAD的增量统一多模态异常检测框架,旨在解决单一模型在跨类别异常检测和增量学习中的灾难性遗忘问题。该框架通过引入Mamba解码器和信息瓶颈融合模块,分别用于解耦对象间的虚假特征干扰和过滤融合特征中的冗余信息,从而提升模型性能。

Details

Motivation: 动机在于解决增量统一多模态异常检测中灾难性遗忘的核心难题,特别关注了虚假和冗余特征对遗忘的负面影响,并指出简单聚合单模态架构的多模态框架更容易发生遗忘。

Result: 在MVTec 3D-AD和Eyecandies数据集上进行的一系列理论分析和实验表明,IB-IUMAD具有有效性和竞争性性能。

Insight: 创新点在于从信息瓶颈视角出发,通过结合Mamba解码器(解耦对象间特征耦合)和信息瓶颈融合模块(过滤冗余特征)来增强多模态去噪,从而显式保留判别性信息并缓解灾难性遗忘。

Abstract: The quest for incremental unified multimodal anomaly detection seeks to empower a single model with the ability to systematically detect anomalies across all categories and support incremental learning to accommodate emerging objects/categories. Central to this pursuit is resolving the catastrophic forgetting dilemma, which involves acquiring new knowledge while preserving prior learned knowledge. Despite some efforts to address this dilemma, a key oversight persists: ignoring the potential impact of spurious and redundant features on catastrophic forgetting. In this paper, we delve into the negative effect of spurious and redundant features on this dilemma in incremental unified frameworks, and reveal that under similar conditions, the multimodal framework developed by naive aggregation of unimodal architectures is more prone to forgetting. To address this issue, we introduce a novel denoising framework called IB-IUMAD, which exploits the complementary benefits of the Mamba decoder and information bottleneck fusion module: the former dedicated to disentangle inter-object feature coupling, preventing spurious feature interference between objects; the latter serves to filter out redundant features from the fused features, thus explicitly preserving discriminative information. A series of theoretical analyses and experiments on MVTec 3D-AD and Eyecandies datasets demonstrates the effectiveness and competitive performance of IB-IUMAD.


[53] SEP-YOLO: Fourier-Domain Feature Representation for Transparent Object Instance Segmentation cs.CVPDF

Fengming Zhang, Tao Yan, Jianchao Huang

TL;DR: 本文提出SEP-YOLO框架,通过频域细节增强模块和多尺度空间细化流,解决透明物体实例分割中边界模糊、低对比度等挑战,并在Trans10K和GVD数据集上实现了SOTA性能。

Details

Motivation: 透明物体因边界模糊、低对比度和高度依赖背景上下文等固有特性,使现有依赖强外观线索和清晰边界的方法失效,需要新方法解决这些局限。

Result: 在Trans10K和GVD数据集上的大量实验表明,SEP-YOLO达到了最先进的性能水平。

Insight: 创新点包括引入频域细节增强模块通过可学习复数权重分离增强弱高频边界成分,以及设计多尺度空间细化流确保深度语义特征中的精确对齐和边界定位;同时为Trans10K数据集提供了高质量的实例级标注,填补了数据空白。

Abstract: Transparent object instance segmentation presents significant challenges in computer vision, due to the inherent properties of transparent objects, including boundary blur, low contrast, and high dependence on background context. Existing methods often fail as they depend on strong appearance cues and clear boundaries. To address these limitations, we propose SEP-YOLO, a novel framework that integrates a dual-domain collaborative mechanism for transparent object instance segmentation. Our method incorporates a Frequency Domain Detail Enhancement Module, which separates and enhances weak highfrequency boundary components via learnable complex weights. We further design a multi-scale spatial refinement stream, which consists of a Content-Aware Alignment Neck and a Multi-scale Gated Refinement Block, to ensure precise feature alignment and boundary localization in deep semantic features. We also provide high-quality instance-level annotations for the Trans10K dataset, filling the critical data gap in transparent object instance segmentation. Extensive experiments on the Trans10K and GVD datasets show that SEP-YOLO achieves state-of-the-art (SOTA) performance.


[54] OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning cs.CVPDF

Zhengwei Yang, Andi Long, Hao Li, Zechao Hu, Kui Jiang

TL;DR: 本文提出了OmniFashion,一个通过多任务视觉-语言学习实现通用时尚智能的统一框架。为了解决时尚领域任务(如检索、推荐、识别和对话)因监督数据碎片化和标注不完整而难以统一的问题,作者首先构建了大规模数据集FashionX,并在此基础上开发了OmniFashion框架,将多种任务统一在时尚对话范式下,实现了多任务推理和交互式对话。

Details

Motivation: 当前时尚智能任务(检索、推荐、识别、对话)因监督数据碎片化且标注不完整而各自为政,这阻碍了形成一致的视觉-语义结构,使得现有视觉-语言模型难以作为统一理解和推理的通用时尚大脑。

Result: 在多子任务和检索基准测试上的实验表明,OmniFashion在任务级准确性和跨任务泛化能力方面均取得了强劲的性能,展示了其作为通用、对话导向的时尚智能方案的潜力。

Insight: 主要创新点在于:1)构建了大规模、细粒度(从全局到部件级)标注的时尚数据集FashionX;2)提出了一个统一的视觉-语言框架,将多种异构的时尚任务桥接在一个统一的时尚对话范式下,实现了任务间的协同与泛化。这为构建通用、可扩展的领域专用智能体提供了可借鉴的路径。

Abstract: Fashion intelligence spans multiple tasks, i.e., retrieval, recommendation, recognition, and dialogue, yet remains hindered by fragmented supervision and incomplete fashion annotations. These limitations jointly restrict the formation of consistent visual-semantic structures, preventing recent vision-language models (VLMs) from serving as a generalist fashion brain that unifies understanding and reasoning across tasks. Therefore, we construct FashionX, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level. Built upon this foundation, we propose OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue. Experiments on multi-subtasks and retrieval benchmarks show that OmniFashion achieves strong task-level accuracy and cross-task generalization, highlighting its offering of a scalable path toward universal, dialogue-oriented fashion intelligence.


[55] DREAM: Where Visual Understanding Meets Text-to-Image Generation cs.CV | cs.LGPDF

Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati, Hong-You Chen, Satya Narayan Shukla

TL;DR: 论文提出了一个名为DREAM的统一多模态学习框架,该框架通过联合优化判别式(视觉理解)和生成式(文本到图像生成)目标,在单一模型中同时学习强大的视觉表示和图像生成能力。其核心训练技术是Masking Warmup,一种渐进式掩码调度策略,推理时则采用Semantically Aligned Decoding来提升文本-图像对齐度。

Details

Motivation: 解决在多模态学习中,将视觉表示学习和文本到图像生成统一在单一模型内的核心挑战,旨在构建一个既能理解视觉内容又能根据文本生成高质量图像的协同模型。

Result: 仅在CC12M数据集上训练,DREAM在ImageNet线性探测准确率达到72.7%(比CLIP高1.1%),FID得分为4.25(比FLUID高6.2%),并在少样本分类、语义分割和深度估计任务上取得一致提升。文本-图像保真度提升了6.3%。

Insight: 宣称的创新点在于提出了一个统一的判别-生成联合优化框架,并引入了Masking Warmup训练策略和Semantically Aligned Decoding推理方法。客观来看,其核心洞察是证明了判别式与生成式目标在多模态学习中具有协同效应,而非相互冲突,这为构建全能型多模态模型提供了新思路。

Abstract: Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.


[56] VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation cs.CVPDF

Jinxiang Lai, Zexin Lu, Jiajun He, Rongwei Quan, Wenzhe Zhao

TL;DR: 本文提出了VisionCreator,一种原生视觉生成代理模型,通过端到端可学习框架统一了理解、思考、规划和创作(UTPC)能力。该研究的关键贡献包括:构建了包含明确UTPC结构的高质量创作轨迹数据集VisGenData-4k及其生成方法;通过渐进专业化训练和虚拟强化学习优化模型,使其在模拟环境中稳定高效地掌握复杂创作任务的UTPC能力;建立了包含1200个测试样本的综合基准VisGenBench;其8B和32B模型在多个评估维度上超越了更大的闭源模型。

Details

Motivation: 解决视觉内容创作任务需要深入理解设计惯例和创意工作流,这对通用模型具有挑战性,而基于工作流的代理又缺乏自主创意规划所需的专业知识。

Result: VisionCreator-8B/32B模型在多个评估维度上表现出色,超越了更大的闭源模型,并在其提出的VisGenBench基准上进行了标准化评估。

Insight: 创新点在于将理解、思考、规划和创作能力统一到一个端到端的可学习代理模型中,并通过基于元认知的代理生成高质量训练数据、结合渐进专业化训练和虚拟强化学习的优化方法,以及构建专门的综合评估基准,为视觉生成代理系统提供了新的研究基础。

Abstract: Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition of UTPC capabilities for complex creation tasks; (iii) VisGenBench, a comprehensive benchmark featuring 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities; (iv) Remarkably, our VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. Overall, this work provides a foundation for future research in visual-generation agentic systems.


[57] ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling cs.CV | cs.AIPDF

Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan

TL;DR: ShareVerse是一个支持多智能体共享世界建模的视频生成框架,通过构建大规模多智能体交互数据集、提出空间拼接策略以及集成跨智能体注意力模块,实现了多视角几何一致性和时空信息交互,从而生成具有共享世界一致性的长序列视频。

Details

Motivation: 现有视频生成方法缺乏对多智能体交互的统一共享世界建模支持,ShareVerse旨在填补这一空白,通过模拟多智能体在共享环境中的交互来生成一致且合理的视频。

Result: 在基于CARLA仿真平台构建的数据集上,ShareVerse支持生成49帧的大规模视频,能够准确感知动态智能体的位置并实现一致的共享世界建模,但未提及具体定量指标或与SOTA的比较。

Insight: 创新点包括构建大规模多智能体交互数据集、多视角视频的空间拼接策略以确保几何一致性,以及跨智能体注意力模块促进时空信息交互,这些方法可借鉴于多智能体协同任务和一致性视频生成领域。

Abstract: This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.


[58] From “What” to “How”: Constrained Reasoning for Autoregressive Image Generation cs.CV | cs.MM | eess.IVPDF

Ruxue Yan, Xubo Liu, Wenya Guo, Zhengkun Zhang, Ying Zhang

TL;DR: 本文提出CoR-Painter框架,通过引入约束推理来引导自回归图像生成,将生成过程从传统的‘画什么’(What)转变为‘如何画’(How)的范式。该方法首先从输入提示中推导出一组视觉约束(如空间关系、关键属性和构图规则),然后用这些约束指导生成详细的‘画什么’描述,最后进行图像合成。此外,还引入了双目标GRPO策略来优化文本约束推理和视觉投影过程。

Details

Motivation: 当前的自回归图像生成方法仅通过重写输入提示来指定‘画什么’细节,但未能从根本上推理‘如何’构建整体图像结构,这导致了空间模糊性(如不现实的对象重叠)等持续性问题。

Result: 在T2I-CompBench、GenEval和WISE基准上的大量实验表明,该方法达到了最先进的性能,在空间指标上取得了显著提升(例如,在T2I-CompBench上提升了5.41%)。

Insight: 主要创新点在于提出了‘How-to-What’范式,通过约束推理将高层次的结构规划(如何画)与低层次的细节生成(画什么)解耦,从而确保图像的结构合理性和连贯性。双目标GRPO策略则专门优化了推理和投影过程,提升了整个生成流程的质量。

Abstract: Autoregressive image generation has seen recent improvements with the introduction of chain-of-thought and reinforcement learning. However, current methods merely specify “What” details to depict by rewriting the input prompt, yet fundamentally fail to reason about “How” to structure the overall image. This inherent limitation gives rise to persistent issues, such as spatial ambiguity directly causing unrealistic object overlaps. To bridge this gap, we propose CoR-Painter, a novel framework that pioneers a “How-to-What” paradigm by introducing Constrained Reasoning to guide the autoregressive generation. Specifically, it first deduces “How to draw” by deriving a set of visual constraints from the input prompt, which explicitly govern spatial relationships, key attributes, and compositional rules. These constraints steer the subsequent generation of a detailed description “What to draw”, providing a structurally sound and coherent basis for accurate visual synthesis. Additionally, we introduce a Dual-Objective GRPO strategy that specifically optimizes the textual constrained reasoning and visual projection processes to ensure the coherence and quality of the entire generation pipeline. Extensive experiments on T2I-CompBench, GenEval, and WISE demonstrate that our method achieves state-of-the-art performance, with significant improvements in spatial metrics (e.g., +5.41% on T2I-CompBench).


[59] MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization cs.CV | cs.CL | cs.LGPDF

Ashutosh Chaubey, Jiacheng Pang, Mohammad Soleymani

TL;DR: 本文提出了一种名为MoD-DPO的模态解耦直接偏好优化框架,旨在减轻全能大语言模型(omni LLMs)中的跨模态幻觉问题。该方法通过引入模态感知正则化项和语言先验去偏惩罚,增强模型对相关模态的敏感性并减少对无关模态的干扰,从而提升模态对齐的忠实度。

Details

Motivation: 全能大语言模型在视听理解任务中表现出色,但容易因虚假关联和主导性语言先验而产生跨模态幻觉,需要一种方法来改善模态对齐并减少对文本先验的过度依赖。

Result: 在多个视听幻觉基准测试上的广泛实验表明,MoD-DPO在相似训练预算下优于先前的偏好优化基线,持续提升了感知准确性和抗幻觉能力。

Insight: 创新点在于通过模态感知正则化(强制对无关模态的扰动不变性、对相关模态的扰动敏感性)和语言先验去偏惩罚,显式地解耦跨模态交互,为构建更可靠、更鲁棒的多模态基础模型提供了一条可扩展的路径。

Abstract: Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.


[60] Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation cs.CVPDF

Hongbo Zheng, Afshin Bozorgpour, Dorit Merhof, Minjia Zhang

TL;DR: 本文提出了一种名为PVT-GDLA的新型解码器中心Transformer模型,用于医学图像分割。其核心是门控差分线性注意力(GDLA)机制,该机制通过计算两个互补子空间上的核化注意力路径并进行可学习的通道级减法,以消除共模噪声并放大相关上下文,从而在保持线性时间复杂度的同时恢复清晰的长程依赖。模型还结合了轻量级的头特定门控机制和并行的局部令牌混合分支,以增强非线性、输入自适应稀疏性和边界保真度。

Details

Motivation: 医学图像分割需要模型在保持精细解剖边界的同时保持高效,以利于临床部署。Transformer虽能捕获长程依赖,但存在二次注意力计算成本高和数据需求大的问题;CNN计算友好但全局推理能力不足。线性注意力虽提供线性复杂度,但常存在训练不稳定和注意力稀释问题,导致分割图模糊。

Result: 结合预训练的金字塔视觉Transformer(PVT)编码器,PVT-GDLA在CT、MRI、超声和皮肤镜基准测试中,在同等训练预算下实现了最先进的精度,其参数量与基线模型相当但FLOPs更低,优于CNN、Transformer、混合模型和线性注意力基线。

Insight: 创新点包括:1. 门控差分线性注意力(GDLA)机制,通过双路径核化注意力与可学习减法来抑制噪声并增强上下文;2. 引入头特定门控以注入非线性和输入自适应稀疏性,缓解注意力汇聚问题;3. 结合并行局部令牌混合分支(使用深度卷积)来加强相邻令牌交互,提升边界保真度,同时保持线性复杂度和低参数量。这为临床等资源受限环境提供了快速、可扩展的高保真分割方案。

Abstract: Medical image segmentation requires models that preserve fine anatomical boundaries while remaining efficient for clinical deployment. While transformers capture long-range dependencies, they suffer from quadratic attention cost and large data requirements, whereas CNNs are compute-friendly yet struggle with global reasoning. Linear attention offers $\mathcal{O}(N)$ scaling, but often exhibits training instability and attention dilution, yielding diffuse maps. We introduce PVT-GDLA, a decoder-centric Transformer that restores sharp, long-range dependencies at linear time. Its core, Gated Differential Linear Attention (GDLA), computes two kernelized attention paths on complementary query/key subspaces and subtracts them with a learnable, channel-wise scale to cancel common-mode noise and amplify relevant context. A lightweight, head-specific gate injects nonlinearity and input-adaptive sparsity, mitigating attention sink, and a parallel local token-mixing branch with depthwise convolution strengthens neighboring-token interactions, improving boundary fidelity, all while retaining $\mathcal{O}(N)$ complexity and low parameter overhead. Coupled with a pretrained Pyramid Vision Transformer (PVT) encoder, PVT-GDLA achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets, with comparable parameters but lower FLOPs than CNN-, Transformer-, hybrid-, and linear-attention baselines. PVT-GDLA provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings.


[61] CoShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model cs.CVPDF

Waqas Ahmed, Dean Diepeveen, Ferdous Sohel

TL;DR: 本文提出CoShadow,一种基于扩散模型的多物体阴影生成方法,用于图像合成。该方法利用预训练的文本到图像扩散模型,通过图像路径注入多尺度特征提供空间引导,并通过文本路径将每个物体的阴影边界框编码为位置标记,再通过交叉注意力融合。实验表明,该方法在单物体和多物体阴影生成任务上均达到最先进水平。

Details

Motivation: 现有阴影生成方法主要关注单物体插入,难以泛化到多物体合成场景,而实际应用中常需同时插入多个物体,要求阴影在几何、附着和位置上具有联合一致性。本文旨在解决多物体阴影生成这一未被充分探索的问题,以合成物理上合理的多物体阴影。

Result: 实验结果表明,该方法在单物体和多物体阴影生成设置中均实现了最先进的性能,在DESOBAv2数据集增强版本上进行了评估。

Insight: 创新点包括:利用预训练扩散模型的多模态能力,设计双路径(图像和文本)架构以融合细粒度空间引导和物体级位置信息;提出注意力对齐损失以将位置标记与其对应阴影区域对齐;通过构建多物体合成场景和自动生成提示来增强数据集,支持多物体阴影生成任务。

Abstract: Realistic shadow generation is crucial for achieving seamless image compositing, yet existing methods primarily focus on single-object insertion and often fail to generalize when multiple foreground objects are composited into a background scene. In practice, however, modern compositing pipelines and real-world applications often insert multiple objects simultaneously, necessitating shadows that are jointly consistent in terms of geometry, attachment, and location. In this paper, we address the under-explored problem of multi-object shadow generation, aiming to synthesize physically plausible shadows for multiple inserted objects. Our approach exploits the multimodal capabilities of a pre-trained text-to-image diffusion model. An image pathway injects dense, multi-scale features to provide fine-grained spatial guidance, while a text-based pathway encodes per-object shadow bounding boxes as learned positional tokens and fuses them via cross-attention. An attention-alignment loss further grounds these tokens to their corresponding shadow regions. To support this task, we augment the DESOBAv2 dataset by constructing composite scenes with multiple inserted objects and automatically derive prompts combining object category and shadow positioning information. Experimental results demonstrate that our method achieves state-of-the-art performance in both single and multi-object shadow generation settings.


[62] iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding cs.CV | cs.AIPDF

HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zihao Bo

TL;DR: 该论文提出了iGVLM框架,旨在解决大型视觉-语言模型中视觉编码器静态、指令无关的表示瓶颈问题。通过引入解耦的双分支架构(一个冻结的表示分支和一个通过自适应层归一化进行仿射特征调制的动态条件分支),使模型能够根据文本指令动态调整视觉表示,从而实现从通用感知到指令感知推理的平滑过渡。

Details

Motivation: 现有大型视觉-语言模型大多依赖静态、指令无关的视觉编码器,其视觉表示在不同文本任务中以不变的方式使用,这种刚性阻碍了需要任务特定视觉线索的细粒度推理。论文旨在解决这一表示瓶颈问题。

Result: 论文在多个基准测试上进行了广泛实验,结果表明iGVLM能持续提升不同语言骨干网络的指令敏感性。此外,论文还引入了MM4诊断探针来量化多查询、多指令设置下的逻辑一致性。

Insight: 创新点在于提出了一个解耦的双分支动态视觉调制框架,通过自适应层归一化实现指令引导的视觉特征调制,在保持预训练视觉先验结构完整性的同时,实现了从被动感知到主动推理的即插即用范式。

Abstract: Despite the success of Large Vision–Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.


[63] Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing cs.CVPDF

Yi Liu, Jing Zhang, Di Wang, Xiaoyu Tian, Haonan Guo

TL;DR: 本文针对多模态大语言模型在遥感视觉问答中存在的幻觉问题,提出了一个名为RSHBench的基准测试用于细粒度诊断,并设计了一种无需训练的推理方法RADAR,通过利用模型内在注意力机制引导渐进式定位和细粒度局部推理,以缓解幻觉并提升性能。

Details

Motivation: 解决多模态大语言模型在遥感视觉问答场景中,由于大规模场景视觉定位失败或细粒度小目标误判导致的严重幻觉问题。

Result: 在多种MLLMs上的广泛实验表明,RADAR方法能持续提升RS-VQA性能,并减少事实性和逻辑性幻觉。

Insight: 创新点在于提出了一个专门用于诊断遥感幻觉的基准RSHBench,以及一种无需训练、利用模型固有注意力进行主动推理的测试时方法RADAR,为缓解MLLMs在专业领域的幻觉提供了新思路。

Abstract: Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained small targets. To systematically analyze these issues, we introduce RSHBench, a protocol-based benchmark for fine-grained diagnosis of factual and logical hallucinations. To mitigate grounding-induced factual hallucinations, we further propose Relative Attention-Driven Actively Reasoning (RADAR), a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time. Extensive experiments across diverse MLLMs demonstrate that RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations. Code and data will be publicly available at: https://github.com/MiliLab/RADAR


[64] ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion cs.CV | cs.AIPDF

HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao

TL;DR: 本文提出ITO框架,通过多模态多重对齐和训练时融合机制,解决现有图像-文本对比预训练方法中表征仍按模态部分组织的问题,在推理时丢弃融合模块以保持双编码器架构的效率。

Details

Motivation: 现有图像-文本对比预训练方法学习到的表征往往仍部分按模态组织,存在模态鸿沟,限制了跨模态表示的统一性。

Result: 在分类、检索和多模态基准测试中,ITO一致优于强基线方法,表现出优越性能。

Insight: 创新点在于协同使用多重对齐(挖掘多样图像-文本对应关系以增强监督)和轻量级训练时融合模块(强制结构化跨模态交互作为正则化器),后者在推理时丢弃以保持效率;分析表明多重对齐提升判别力,而训练时融合消除模态鸿沟并稳定训练动态,防止激进对比学习中常见的早期饱和问题。

Abstract: Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer – eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.


[65] HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning cs.CVPDF

Zihao Peng, Nan Zou, Jiandian Zeng, Guo Li, Ke Chen

TL;DR: 本文提出了一种名为HiLoRA的分层低秩适应框架,用于个性化联邦学习中的视觉Transformer模型。该方法通过在根、簇和叶三个层级部署适配器,分别捕获全局、子组和客户端特定知识,并通过交叉层正交性和级联优化来分离更新子空间,从而提升模型在未见客户端上的适应能力。

Details

Motivation: 现有基于LoRA的联邦调优方法忽视了真实场景中潜在的客户端结构,限制了共享表示学习,并阻碍了对未见客户端的有效适应。

Result: 在CIFAR-100和DomainNet数据集上使用ViT骨干网络进行的实验表明,HiLoRA在个性化和泛化性能上均取得了持续改进。

Insight: 创新点在于提出了分层LoRA框架和LoRA-子空间自适应聚类机制,通过子空间相似性分析推断潜在客户端群组,促进了结构对齐客户端间的知识共享,并提供了分层泛化分析的理论支持。

Abstract: Vision Transformers (ViTs) have been widely adopted in vision tasks due to their strong transferability. In Federated Learning (FL), where full fine-tuning is communication heavy, Low-Rank Adaptation (LoRA) provides an efficient and communication-friendly way to adapt ViTs. However, existing LoRA-based federated tuning methods overlook latent client structures in real-world settings, limiting shared representation learning and hindering effective adaptation to unseen clients. To address this, we propose HiLoRA, a hierarchical LoRA framework that places adapters at three levels: root, cluster, and leaf, each designed to capture global, subgroup, and client-specific knowledge, respectively. Through cross-tier orthogonality and cascaded optimization, HiLoRA separates update subspaces and aligns each tier with its residual personalized objective. In particular, we develop a LoRA-Subspace Adaptive Clustering mechanism that infers latent client groups via subspace similarity analysis, thereby facilitating knowledge sharing across structurally aligned clients. Theoretically, we establish a tier-wise generalization analysis that supports HiLoRA’s design. Experiments on ViT backbones with CIFAR-100 and DomainNet demonstrate consistent improvements in both personalization and generalization.


[66] Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language cs.CVPDF

Michelle Stegeman, Lena Philipp, Fennie van der Graaf, Marina D’Amato, Clément Grisi

TL;DR: 该论文提出了UNICORN,一个用于系统评估医学基础模型的统一公共基准。它通过解耦模型推理与基于标准化少样本适应的任务特定评估,并引入一个综合的UNICORN评分,旨在解决现有医学基准在跨任务、跨模态评估上的碎片化问题。

Details

Motivation: 当前缺乏公开、标准化且可复现的评估框架来验证医学基础模型在跨模态泛化和快速适应新任务方面的潜力,现有基准往往局限于特定任务、器官或模态。

Result: UNICORN基准包含来自17个机构、2400多名患者的数据,涵盖3700多个视觉病例和2400多份临床报告,涉及八个解剖区域和四种成像模态,并提供了任务特定和汇总的排行榜。

Insight: 创新点在于提出了一个解耦评估框架以隔离表征质量,构建了间接访问的隔离测试集以确保临床相关性,并引入了统一的UNICORN评分来直接比较不同医学领域、模态和任务类型的模型。

Abstract: Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed indirectly accessible sequestered test sets derived from clinically relevant cohorts, along with standardized evaluation code and a submission interface on an open benchmarking platform. Performance is aggregated into a single UNICORN Score, a new metric that we introduce to support direct comparison of foundation models across diverse medical domains, modalities, and task types. The UNICORN test dataset includes data from more than 2,400 patients, including over 3,700 vision cases and over 2,400 clinical reports collected from 17 institutions across eight countries. The benchmark spans eight anatomical regions and four imaging modalities. Both task-specific and aggregated leaderboards enable accessible, standardized, and reproducible evaluation. By standardizing multi-task, multi-modality assessment, UNICORN establishes a foundation for reproducible benchmarking of medical foundation models. Data, baseline methods, and the evaluation platform are publicly available via unicorn.grand-challenge.org.


[67] VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning cs.CVPDF

Ruiyang Zhang, Qianguo Sun, Chao Song, Yiyan Qi, Zhedong Zheng

TL;DR: 本文提出了VSearcher,一种通过强化学习将静态多模态大模型转变为能够在真实网络环境中进行长视野、多轮次工具调用(包括文本搜索、图像搜索和网页浏览)的多模态搜索智能体。

Details

Motivation: 当前文本大模型应用场景单一,而多模态大模型虽感知能力强,但缺乏获取和利用最新网络信息的能力。本文旨在解决静态多模态模型无法动态访问和利用实时网络信息的问题。

Result: 在多个多模态搜索基准测试(包括新提出的高难度基准MM-SearchExam)上的广泛评估表明,VSearcher相比近期多模态搜索智能体表现出优越性能,甚至在多模态网络搜索任务上超越了多个专有模型。

Insight: 创新点包括:1) 提出迭代注入数据合成管道以生成大规模、高质量、高难度的复杂多模态QA数据;2) 采用SFT后接RL的训练流程将基础多模态模型转化为网络环境中的智能体;3) 提出了专门评估多模态搜索智能体能力的基准MM-SearchExam。

Abstract: Large models are increasingly becoming autonomous agents that interact with real-world environments and use external tools to augment their static capabilities. However, most recent progress has focused on text-only large language models, which are limited to a single modality and therefore have narrower application scenarios. On the other hand, multimodal large models, while offering stronger perceptual capabilities, remain limited to static knowledge and lack the ability to access and leverage up-to-date web information. In this paper, we propose VSearcher, turning static multimodal model into multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments, including text search, image search, and web browsing, via reinforcement learning. Specifically, we introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex multimodal QA questions, which are further filtered with comprehensive metrics to ensure high quality and sufficient difficulty. We then adopt an SFT-then-RL training pipeline to turn base multimodal models to agent capable of multi-turn tool calling in real-world web environments. Besides, we propose a multimodal search benchmark MM-SearchExam dedicated to evaluating search capabilities of multimodal search agents, which proves highly challenging for recent proprietary models. Extensive evaluations across multiple multimodal search benchmarks reveal effectiveness of our method. VSearcher achieves superior performance compared to recent multimodal search agents and even surpasses several proprietary models on multimodal web search tasks.


[68] NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing cs.CVPDF

Tianlin Pan, Jiayi Dai, Chenpu Yuan, Zhengyao Lv, Binxin Yang

TL;DR: 本文提出了NOVA框架,一种用于无配对视频编辑的新方法,通过稀疏控制(用户编辑的关键帧)提供语义指导,并结合密集合成分支从原始视频中持续提取运动和纹理信息以保持高保真度和一致性。

Details

Motivation: 现有视频编辑模型大多依赖大规模配对数据集,但收集自然对齐的视频对极具挑战性,尤其是局部编辑数据;现有无配对方法通过全局运动控制将图像编辑迁移到视频,但难以保持背景和时间一致性。

Result: 大量实验表明,NOVA在编辑保真度、运动保持和时间一致性方面优于现有方法。

Insight: 创新点包括稀疏与密集双分支架构,以及通过人工退化视频模拟训练的策略,使模型无需配对数据即可学习运动重建和时间一致性。

Abstract: Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control & Dense Synthesis, a new framework for unpaired video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.


[69] Structure-Aware Text Recognition for Ancient Greek Critical Editions cs.CVPDF

Nicolas Angleraud, Antonia Karamolegkou, Benoît Sagot, Thibault Clérice

TL;DR: 本文研究了面向古希腊评注本的结构感知文本识别,针对历史学术文献的复杂版面语义理解问题,提出了大规模合成数据集和真实扫描基准,并评估了三种先进视觉语言模型在零样本和微调场景下的性能。

Details

Motivation: 解决现有视觉语言模型在理解具有密集引用层级和丰富页边注释的历史学术文献复杂版面语义方面的局限性。

Result: 在真实扫描数据上,Qwen3VL-8B模型取得了最先进的性能,中位字符错误率降至1.0%;但在零样本设置下,多数模型性能显著低于现有成熟软件。

Insight: 创新点在于构建了首个针对古希腊评注本的大规模合成语料库和跨世纪编辑实践的基准数据集,揭示了当前VLM架构在处理高度结构化历史文档时的不足与潜力。

Abstract: Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.


[70] BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation cs.CV | cs.AIPDF

Zihao Zhu, Ruotong Wang, Siwei Lyu, Min Zhang, Baoyuan Wu

TL;DR: 本文提出了BrandFusion,一个用于文本到视频生成中无缝品牌整合的多智能体框架。该框架通过离线阶段构建品牌知识库和在线阶段多智能体协同优化提示,旨在自动将广告商品牌嵌入到根据提示生成的视频中,同时保持与用户意图的语义保真度。

Details

Motivation: 解决现有文本到视频模型商业潜力未充分挖掘的问题,首次提出并解决在文本到视频生成中无缝整合品牌的任务,核心挑战在于平衡提示保真度、品牌可识别性和整合的自然度。

Result: 在多个SOTA文本到视频模型上,对18个既有品牌和2个自定义品牌的实验表明,BrandFusion在语义保持、品牌可识别性和整合自然度上显著优于基线方法,人工评估也证实了更高的用户满意度。

Insight: 创新点在于首次形式化了T2V中的品牌整合任务,并提出了一个包含离线知识库构建和在线多智能体协作的两阶段框架,通过模型先验探测、轻量微调和实时上下文跟踪来协同解决多个约束,为T2V的可持续商业化提供了可行路径。

Abstract: The rapid advancement of text-to-video (T2V) models has revolutionized content creation, yet their commercial potential remains largely untapped. We introduce, for the first time, the task of seamless brand integration in T2V: automatically embedding advertiser brands into prompt-generated videos while preserving semantic fidelity to user intent. This task confronts three core challenges: maintaining prompt fidelity, ensuring brand recognizability, and achieving contextually natural integration. To address them, we propose BrandFusion, a novel multi-agent framework comprising two synergistic phases. In the offline phase (advertiser-facing), we construct a Brand Knowledge Base by probing model priors and adapting to novel brands via lightweight fine-tuning. In the online phase (user-facing), five agents jointly refine user prompts through iterative refinement, leveraging the shared knowledge base and real-time contextual tracking to ensure brand visibility and semantic alignment. Experiments on 18 established and 2 custom brands across multiple state-of-the-art T2V models demonstrate that BrandFusion significantly outperforms baselines in semantic preservation, brand recognizability, and integration naturalness. Human evaluations further confirm higher user satisfaction, establishing a practical pathway for sustainable T2V monetization.


[71] Multimodal-Prior-Guided Importance Sampling for Hierarchical Gaussian Splatting in Sparse-View Novel View Synthesis cs.CVPDF

Kaiqiang Xiong, Zhanke Wang, Ronggang Wang

TL;DR: 本文提出了一种多模态先验引导的重要性采样方法,用于稀疏视角新视图合成中的分层3D高斯泼溅。该方法融合了光度渲染残差、语义先验和几何先验,以产生鲁棒的局部可恢复性估计,从而指导在何处注入精细高斯元素。框架包括一个由粗到细的高斯表示和一个几何感知的采样与保留策略,旨在优先处理具有一致多模态证据的区域,以减轻过拟合和噪声。

Details

Motivation: 解决稀疏视角新视图合成中,仅依赖原始渲染残差容易导致过拟合纹理错误和姿态/外观不一致噪声的问题,通过引入多模态先验来更鲁棒地指导3D高斯表示的细化。

Result: 在多个稀疏视角基准测试中实现了最先进的重建效果,在DTU数据集上PSNR提升高达+0.3 dB。

Insight: 创新点在于将多模态先验(光度、语义、几何)融合到重要性采样中,以直接驱动分层高斯泼溅的细化位置,并结合几何感知策略保护欠约束区域的新增基元,这有助于更精准地恢复细节并抑制噪声。

Abstract: We present multimodal-prior-guided importance sampling as the central mechanism for hierarchical 3D Gaussian Splatting (3DGS) in sparse-view novel view synthesis. Our sampler fuses complementary cues { – } photometric rendering residuals, semantic priors, and geometric priors { – } to produce a robust, local recoverability estimate that directly drives where to inject fine Gaussians. Built around this sampling core, our framework comprises (1) a coarse-to-fine Gaussian representation that encodes global shape with a stable coarse layer and selectively adds fine primitives where the multimodal metric indicates recoverable detail; and (2) a geometric-aware sampling and retention policy that concentrates refinement on geometrically critical and complex regions while protecting newly added primitives in underconstrained areas from premature pruning. By prioritizing regions supported by consistent multimodal evidence rather than raw residuals alone, our method alleviates overfitting texture-induced errors and suppresses noise from pose/appearance inconsistencies. Experiments on diverse sparse-view benchmarks demonstrate state-of-the-art reconstructions, with up to +0.3 dB PSNR on DTU.


[72] Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models cs.CVPDF

Jialiang Zhang, Junlong Tong, Junyan Lin, Hao Wu, Yirong Sun

TL;DR: 本文提出了一种名为Think-as-You-See (TaYS)的流式推理框架,用于解决大型视觉语言模型在处理视频流数据时面临的挑战。传统批处理推理方式与视频信息顺序到达的现实场景不匹配,而TaYS通过并行化思维链生成、流约束训练和流并行推理,实现了真正的并发推理,显著提升了推理性能并降低了延迟。

Details

Motivation: 现有大型视觉语言模型的思维链推理大多基于完整视频的批处理范式,这与现实世界中视频数据以流式顺序到达的特性不符,导致推理过程与实际输入流不匹配。

Result: 在Qwen2.5-VL模型系列上,针对事件动态分析、因果推理和主题理解等代表性视频思维链任务进行实验,结果表明TaYS在推理性能上持续优于批处理和交错式基线方法,同时显著降低了首词生成时间和整体推理延迟。

Insight: 论文的创新点在于提出了一个与数据流对齐的流式推理统一框架,其核心包括时间对齐的推理单元、流式注意力掩码和位置编码,以及解耦视觉编码与文本推理的双重KV缓存机制,实现了高效的并发推理。

Abstract: Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbf{Think-as-You-See (TaYS)}, a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/TaYS}{this repository.}


[73] SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion cs.CVPDF

Xinjie Zhu, Zijing Zhao, Hui Jin, Qingxiao Guo, Yilong Ma

TL;DR: 本文提出了一种名为SIGMark的可扩展视频扩散模型内生成水印框架,支持盲提取。它通过全局帧级伪随机编码密钥生成水印初始噪声,并设计了针对因果3D VAE的段分组排序模块,以在保证无失真的同时,显著提升水印在时空扰动下的鲁棒性和大规模提取的可扩展性。

Details

Motivation: 现有视频扩散模型的内生成水印方法多为非盲提取,需要存储大量消息-密钥对并进行模板匹配,计算成本高昂,且在因果3D VAE架构下对时序扰动的鲁棒性极弱。本文旨在解决这些问题。

Result: 在多个现代扩散模型上的综合实验表明,SIGMark在时空扰动下均能实现极高的比特提取准确率,且开销极小,证明了其可扩展性和鲁棒性。

Insight: 创新点在于提出了全局帧级伪随机编码密钥以实现盲提取,以及针对因果3D VAE的段分组排序模块以增强时序鲁棒性。这为AIGC视频的无失真、可扩展水印保护提供了新思路。

Abstract: Artificial Intelligence Generated Content (AIGC), particularly video generation with diffusion models, has been advanced rapidly. Invisible watermarking is a key technology for protecting AI-generated videos and tracing harmful content, and thus plays a crucial role in AI safety. Beyond post-processing watermarks which inevitably degrade video quality, recent studies have proposed distortion-free in-generation watermarking for video diffusion models. However, existing in-generation approaches are non-blind: they require maintaining all the message-key pairs and performing template-based matching during extraction, which incurs prohibitive computational costs at scale. Moreover, when applied to modern video diffusion models with causal 3D Variational Autoencoders (VAEs), their robustness against temporal disturbance becomes extremely weak. To overcome these challenges, we propose SIGMark, a Scalable In-Generation watermarking framework with blind extraction for video diffusion. To achieve blind-extraction, we propose to generate watermarked initial noise using a Global set of Frame-wise PseudoRandom Coding keys (GF-PRC), reducing the cost of storing large-scale information while preserving noise distribution and diversity for distortion-free watermarking. To enhance robustness, we further design a Segment Group-Ordering module (SGO) tailored to causal 3D VAEs, ensuring robust watermark inversion during extraction under temporal disturbance. Comprehensive experiments on modern diffusion models show that SIGMark achieves very high bit-accuracy during extraction under both temporal and spatial disturbances with minimal overhead, demonstrating its scalability and robustness. Our project is available at https://jeremyzhao1998.github.io/SIGMark-release/.


[74] SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers cs.CVPDF

Wonsuk Jang, Thierry Tambe

TL;DR: 本文提出SemanticDialect,一种面向视频扩散变换器(VDiT)的语义感知混合格式量化方法,旨在降低模型部署时的内存和计算开销。该方法通过扩展格式书(formatbook)并利用查找表实现高效逐块格式选择,结合激活分解和注意力引导的显著令牌选择来减少量化误差,并引入语义感知方言分配(SeDA)以提升语义相关令牌间的量化值一致性。

Details

Motivation: 扩散变换器(DiT)在视频生成方面表现出色,但其内存和计算成本限制了在边缘设备上的部署。现有量化方法在高激活变化和需要保持语义/时间一致性的情况下,往往会降低视频生成质量。

Result: 在视频DiT(VDiT)模型上的实验表明,SemanticDialect优于先前的VDiT量化方法和细粒度逐块格式基线,在Open-Sora 2.0基准上接近FP16的生成质量。

Insight: 创新点包括:1)通过扩展格式书和查找表实现低开销的逐块最优格式选择;2)引入激活分解与注意力引导的残差误差重量化机制以降低误差;3)提出语义感知方言分配(SeDA),在语义相关令牌间共享子格式书以提升一致性。这些方法共同解决了视频生成中量化对语义和时间连贯性的破坏问题。

Abstract: Diffusion Transformers (DiT) achieve strong video generation quality, but their memory and compute costs hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality under high activation variation and the need to preserve semantic/temporal coherence. We propose SemanticDialect, which advances recent block-wise mixed-format quantization-selecting a per-block optimal format (a dialect) from multiple candidates (a formatbook)-by scaling the formatbook with lookup tables for quantization error and quantized values, enabling efficient per-block format selection and quantization at low online cost. We also introduce activation decomposition that reduces quantization error by re-quantizing and adding back residual errors, with attention-guided salient token selection. We further propose semantic-aware dialect assignment (SeDA) to improve quantized value consistency by sharing a sub-formatbook among semantically correlated tokens. Experiments on video DiT (VDiT) models show that SemanticDialect outperforms prior VDiT quantization methods and fine-grained block-wise format baselines, while approaching FP16 quality on Open-Sora 2.0.


[75] LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval cs.CVPDF

Minh-Chi Phung, Thien-Bao Le, Cam-Tu Tran-Thi, Thu-Dieu Nguyen-Thi, Vu-Hung Dao

TL;DR: 本文提出了LLandMark,一个模块化的多智能体框架,用于处理具有地标感知的多模态交互式视频检索。该框架通过四个协作阶段(查询解析与规划、地标推理、多模态检索和重排序答案合成)来处理复杂的现实世界查询,并引入了地标知识代理和LLM辅助的图像到图像检索流程,以增强对越南场景中文化或空间地标的识别和语义匹配能力。

Details

Motivation: 随着视频数据的多样性和规模不断增长,需要能够进行多模态理解、自适应推理和领域知识整合的检索系统。本文旨在解决处理现实世界中复杂查询(尤其是涉及文化或空间地标)的多模态视频检索问题。

Result: 实验结果表明,LLandMark实现了自适应的、基于文化背景的、且可解释的检索性能。摘要中未提及具体的基准测试或定量比较结果(如是否达到SOTA)。

Insight: 创新点包括:1) 模块化的多智能体协作框架,将检索流程分解为四个专门阶段;2) 地标知识代理将地标转化为描述性视觉提示,以增强基于CLIP的语义匹配;3) 引入LLM辅助的图像到图像检索流程,实现地标的自动检测、图像查询生成和检索,无需手动图像输入;4) 利用Gemini和LlamaIndex的OCR细化模块,改进了越南语文本识别。

Abstract: The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries, retrieves representative images, and performs CLIP-based visual similarity matching, removing the need for manual image input. In addition, an OCR refinement module leveraging Gemini and LlamaIndex improves Vietnamese text recognition. Experimental results show that LLandMark achieves adaptive, culturally grounded, and explainable retrieval performance.


[76] 3D-DRES: Detailed 3D Referring Expression Segmentation cs.CVPDF

Qi Chen, Changli Wu, Jiayi Ji, Yiwei Ma, Liujuan Cao

TL;DR: 本文提出了一个新的任务——详细3D指代表达分割(3D-DRES),旨在通过短语到3D实例的映射来增强细粒度的3D视觉语言理解。为此,作者构建了DetailRefer数据集,包含54,432个描述和11,054个不同对象,并采用创新的短语-实例标注范式。同时,提出了一个名为DetailBase的简化而有效的基线架构,支持句子和短语级别的双模式分割。实验表明,在DetailRefer上训练的模型不仅在短语级分割上表现出色,而且在传统3D-RES基准测试中也取得了显著提升。

Details

Motivation: 当前3D视觉定位任务仅处理句子级别的检测或分割,未能充分利用自然语言表达中丰富的组合上下文推理,因此需要引入更细粒度的任务来提升3D视觉语言理解。

Result: 在DetailRefer数据集上训练的模型在短语级分割任务中表现出色,同时在传统3D-RES基准测试中取得了令人惊讶的改进,表明该方法具有泛化能力。

Insight: 论文的创新点在于提出了3D-DRES任务和DetailRefer数据集,采用了短语-实例标注范式,以及设计了支持双模式分割的简化基线架构DetailBase,这些为细粒度3D视觉语言理解提供了新的研究方向和基准。

Abstract: Current 3D visual grounding tasks only process sentence level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 54,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and phrase levels. Our experimental results demonstrate that models trained on DetailRefer not only excel at phrase-level segmentation but also show surprising improvements on traditional 3D-RES benchmarks.


[77] Harmonic Beltrami Signature Network: a Shape Prior Module in Deep Learning Framework cs.CVPDF

Chenran Lin, Lok Ming Lui

TL;DR: 本文提出了谐波贝尔特拉米签名网络(HBSN),一种用于从类二值图像中计算谐波贝尔特拉米签名(HBS)的新型深度学习架构。HBS是一种形状表示,与二维单连通形状具有一一对应关系,且具有平移、缩放和旋转不变性。通过利用神经网络的功能逼近能力,HBSN能够高效提取和利用形状先验信息。该网络架构包含用于形状归一化的预空间变换网络(pre-STN)、用于HBS预测的基于UNet的主干网络以及用于角度正则化的后空间变换网络(post-STN)。实验表明,HBSN能准确计算HBS表示,即使对于复杂形状也是如此。此外,本文展示了如何将HBSN直接集成到现有的深度学习分割模型中,通过使用形状先验来提高其性能。

Details

Motivation: 解决在计算机视觉流程中高效、鲁棒地嵌入几何形状先验信息的问题,以提升模型(特别是分割模型)对形状的理解和生成能力。

Result: 实验表明HBSN能准确计算HBS表示(包括复杂形状),并且将其作为模块集成到现有分割模型中能提升模型性能,证实了其作为通用模块的实用性。

Insight: 主要创新点在于将具有良好数学性质(一一对应、不变性)的HBS形状表示与深度学习框架(通过pre-STN、UNet、post-STN的特定架构设计)相结合,构建了一个可端到端训练、能作为即插即用模块来提供形状先验的通用网络。这为在深度学习模型中显式地利用几何信息提供了一种新思路。

Abstract: This paper presents the Harmonic Beltrami Signature Network (HBSN), a novel deep learning architecture for computing the Harmonic Beltrami Signature (HBS) from binary-like images. HBS is a shape representation that provides a one-to-one correspondence with 2D simply connected shapes, with invariance to translation, scaling, and rotation. By exploiting the function approximation capacity of neural networks, HBSN enables efficient extraction and utilization of shape prior information. The proposed network architecture incorporates a pre-Spatial Transformer Network (pre-STN) for shape normalization, a UNet-based backbone for HBS prediction, and a post-STN for angle regularization. Experiments show that HBSN accurately computes HBS representations, even for complex shapes. Furthermore, we demonstrate how HBSN can be directly incorporated into existing deep learning segmentation models, improving their performance through the use of shape priors. The results confirm the utility of HBSN as a general-purpose module for embedding geometric shape information into computer vision pipelines.


[78] Articulation in Motion: Prior-free Part Mobility Analysis for Articulated Objects By Dynamic-Static Disentanglement cs.CVPDF

Hao Ai, Wenjie Chang, Jianbo Jiao, Ales Leonardis, Ofek Eyal

TL;DR: 本文提出了一种名为Articulation in Motion (AiM)的新框架,用于从用户与物体交互的视频和初始扫描中,实现铰接物体的高质量重建、独立运动部件分割以及关节运动分析。该方法无需预先知道部件数量,利用双高斯场景表示和运动线索进行部件分割与关节分配,并通过顺序RANSAC进行部件运动分析。

Details

Motivation: 现有方法通常需要分析两个不同的关节状态,并依赖部件数量的先验知识,这限制了其应用范围和鲁棒性,尤其是在物体在两个状态下都无法清晰可见时。本文旨在解决这些问题,提出一种无需先验知识的部件运动分析方法。

Result: 在简单和复杂物体上的大量实验分析验证了该方法的有效性和强大的泛化能力。该方法在部件分割质量上优于先前方法,且无需先验知识。

Insight: 创新点在于提出了一种无需部件级结构先验的框架,通过动态-静态解耦,利用运动线索和双高斯表示进行部件分割与关节分析,并采用顺序RANSAC自动确定部件数量,实现了高质量的重建和渲染。

Abstract: Articulated objects are ubiquitous in daily life. Our goal is to achieve a high-quality reconstruction, segmentation of independent moving parts, and analysis of articulation. Recent methods analyse two different articulation states and perform per-point part segmentation, optimising per-part articulation using cross-state correspondences, given a priori knowledge of the number of parts. Such assumptions greatly limit their applications and performance. Their robustness is reduced when objects cannot be clearly visible in both states. To address these issues, in this paper, we present a new framework, Articulation in Motion (AiM). We infer part-level decomposition, articulation kinematics, and reconstruct an interactive 3D digital replica from a user-object interaction video and a start-state scan. We propose a dual-Gaussian scene representation that is learned from an initial 3DGS scan of the object and a video that shows the movement of separate parts. It uses motion cues to segment the object into parts and assign articulation joints. Subsequently, a robust, sequential RANSAC is employed to achieve part mobility analysis without any part-level structural priors, which clusters moving primitives into rigid parts and estimates kinematics while automatically determining the number of parts. The proposed approach separates the object into parts, each represented as a 3D Gaussian set, enabling high-quality rendering. Our approach yields higher quality part segmentation than previous methods, without prior knowledge. Extensive experimental analysis on both simple and complex objects validates the effectiveness and strong generalisation ability of our approach. Project page: https://haoai-1997.github.io/AiM/.


[79] Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers cs.CV | cs.AI | cs.LGPDF

Youngjun Jun, Seil Kang, Woojung Han, Seong Jae Hwang

TL;DR: 本文提出了一种无需梯度计算或参数更新的方法,用于解释视频扩散变换器(Video DiTs)如何将文本描述中的运动概念转化为视频。该方法通过GramCol自适应生成每帧的显著性图,并结合运动特征选择算法,生成可解释的运动注意力图(IMAP),以在时空维度上定位运动。

Details

Motivation: 当前对视频扩散变换器的解释性研究主要关注物体,而对运动相关行为的理解不足,本文旨在探究Video DiTs如何将运动词汇具体化为视频中的时空特征。

Result: 实验表明,该方法在运动定位任务和零样本视频语义分割上表现出优异的定位能力,为运动和非运动概念提供了更清晰、可解释的显著性图。

Insight: 创新点在于引入了无需梯度计算的显著性图生成方法GramCol和运动特征选择算法,实现了对视频扩散变换器中运动概念的时空定位,为模型可解释性提供了新视角。

Abstract: Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need for any gradient calculation or parameter update. Experimentally, our method shows outstanding localization capability on the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.


[80] TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval cs.CVPDF

Xiangzhao Hao, Shijie Wang, Tianyu Yang, Tianyue Wang, Haiyun Guo

TL;DR: 本文提出了TRACE(任务自适应推理与压缩嵌入)框架,用于通用多模态检索。该框架将生成式推理与判别式表示学习相统一,通过生成结构化思维链来显式推理查询意图,并将其压缩为紧凑嵌入。

Details

Motivation: 解决现有统一嵌入模型在处理从简单关键词到复杂组合指令的多样化用户意图时,因局限于静态编码器范式而无法充分利用多模态大语言模型生成与推理能力的问题。

Result: 在M-BEIR基准测试中达到了新的最先进水平(SOTA),并在检索精度与推理吞吐量之间取得了最优平衡。

Insight: 核心创新在于通过思维链进行显式推理并将其压缩为嵌入,实现了生成与判别能力的统一;模型能自主学习隐式路由行为,对复杂查询激活推理而对简单查询绕过,并展现出强大的零样本迁移能力。

Abstract: Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.


[81] TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration cs.CVPDF

Benlei Cui, Shaoxuan He, Bukun Huang, Zhizeng Ye, Yunyun Sun

TL;DR: 该论文提出了一种名为TC-Padé(轨迹一致的Padé逼近)的特征预测框架,用于加速扩散模型的采样过程。该方法基于Padé有理逼近来建模特征演化,比基于泰勒级数的方法更准确地捕捉渐近和过渡行为。它通过自适应系数调制和针对不同采样阶段(早期、中期、后期)的步长感知预测策略,在减少采样步数(如20-30步)的情况下实现稳定且轨迹一致的采样,从而有效加速图像和视频生成。

Details

Motivation: 扩散模型虽然达到了最先进的生成质量,但其迭代采样过程计算负担重。现有的特征缓存技术在较高步数(如50步)下有效,但在实际常用的低步数(20-30步)场景中存在局限性:步长间隔增大时,基于多项式的外推器(如TaylorSeer)会因误差累积和轨迹漂移而失效,且传统缓存策略往往忽略了不同去噪阶段的动态特性差异。

Result: 在DiT-XL/2、FLUX.1-dev和Wan2.1等模型上进行的图像和视频生成实验表明,TC-Padé方法有效。例如,在FLUX.1-dev上实现了2.88倍加速,在Wan2.1上实现了1.72倍加速,同时在FID、CLIP、Aesthetic和VBench-2.0等指标上保持了高质量,显著优于现有的特征缓存方法。

Insight: 论文的创新点在于将Padé有理逼近引入扩散模型的特征预测,以更好地建模特征演化的复杂动态(尤其是渐近行为)。其核心设计包括:1)自适应系数调制,利用历史缓存残差来检测细微的轨迹转变;2)针对不同采样阶段(早、中、后期)动态特性定制的步长感知预测策略。这为在低步数采样下保持轨迹一致性和稳定性提供了新思路。

Abstract: Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps. As the interval between steps increases, polynomial-based extrapolators like TaylorSeer suffer from error accumulation and trajectory drift. Meanwhile, conventional caching strategies often overlook the distinct dynamical properties of different denoising phases. To address these challenges, we propose Trajectory-Consistent Padé approximation, a feature prediction framework grounded in Padé approximation. By modeling feature evolution through rational functions, our approach captures asymptotic and transitional behaviors more accurately than Taylor-based methods. To enable stable and trajectory-consistent sampling under reduced step counts, TC-Padé incorporates (1) adaptive coefficient modulation that leverages historical cached residuals to detect subtle trajectory transitions, and (2) step-aware prediction strategies tailored to the distinct dynamics of early, mid, and late sampling stages. Extensive experiments on DiT-XL/2, FLUX.1-dev, and Wan2.1 across both image and video generation demonstrate the effectiveness of TC-Padé. For instance, TC-Padé achieves 2.88x acceleration on FLUX.1-dev and 1.72x on Wan2.1 while maintaining high quality across FID, CLIP, Aesthetic, and VBench-2.0 metrics, substantially outperforming existing feature caching methods.


[82] Semi-Supervised Few-Shot Adaptation of Vision-Language Models cs.CVPDF

Julio Silva-Rodríguez, Ender Konukoglu

TL;DR: 该论文提出了一种用于视觉语言模型(VLMs)的半监督少样本适应方法,通过利用未标注数据和传播文本信息伪标签来解决医学图像分类中极低样本量下的类别不平衡问题,从而显著降低标注成本。

Details

Motivation: 动机在于解决医学影像任务中,在极低样本量(few-shot)适应视觉语言模型时,由于固有的类别不平衡导致少数类别表现不佳的问题,旨在降低专家标注的高昂成本。

Result: 所提方法在低样本量情况下,能将标注工作量减少超过50%,实现了更经济的标注流程来适应VLMs。

Insight: 创新点在于引入了一个高效的半监督求解器,在少样本适应过程中传播文本信息伪标签,有效利用未标注数据来提升模型在类别不平衡场景下的性能,为VLM的轻量级适应提供了新思路。

Abstract: Vision-language models (VLMs) pre-trained on large, heterogeneous data sources are becoming increasingly popular, providing rich multi-modal embeddings that enable efficient transfer to new tasks. A particularly relevant application is few-shot adaptation, where only a handful of annotated examples are available to adapt the model through multi-modal linear probes. In medical imaging, specialized VLMs have shown promising performance in zero- and few-shot image classification, which is valuable for mitigating the high cost of expert annotations. However, challenges remain in extremely low-shot regimes: the inherent class imbalances in medical tasks often lead to underrepresented categories, penalizing overall model performance. To address this limitation, we propose leveraging unlabeled data by introducing an efficient semi-supervised solver that propagates text-informed pseudo-labels during few-shot adaptation. The proposed method enables lower-budget annotation pipelines for adapting VLMs, reducing labeling effort by >50% in low-shot regimes.


[83] TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation cs.CV | cs.ROPDF

Jiaxing Liu, Zexi Zhang, Xiaoyan Li, Boyue Wang, Yongli Hu

TL;DR: 本文提出了TagaVLM,一个用于视觉语言导航的端到端框架,旨在解决大型视觉语言模型在动态、具身导航任务中的架构不匹配问题。该方法通过显式地将拓扑结构注入VLM主干,实现了全局动作推理和鲁棒的路径修正。

Details

Motivation: 现有基于大模型的方法通常将丰富的视觉和空间信息转换为文本,迫使模型隐式推断复杂的视觉-拓扑关系或限制其全局动作能力。TagaVLM旨在弥合大型视觉语言模型在静态预训练任务与动态、具身导航任务之间的鸿沟。

Result: 在R2R基准测试的未见环境中,TagaVLM取得了基于大模型方法中的最先进性能,成功率(SR)为51.09%,SPL为47.18%,分别比先前工作高出3.39%和9.08。

Insight: 核心创新点在于显式地将拓扑结构(通过空间拓扑感知残差注意力STAR-Att和交错导航提示)注入VLM,使其具备内在的空间推理和全局动作规划能力。研究还表明,对于具身空间推理任务,在较小的开源VLM上进行针对性增强比暴力模型缩放更有效。

Abstract: Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM’s self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code will be released upon publication.Project page: https://apex-bjut.github.io/Taga-VLM


[84] The Dresden Dataset for 4D Reconstruction of Non-Rigid Abdominal Surgical Scenes cs.CVPDF

Reuben Docea, Rayan Younis, Yonghao Long, Maxime Fleury, Jinjing Xu

TL;DR: D4D数据集提供了配对的腹腔镜视频和高质量结构光几何数据,用于在真实手术条件下评估腹部软组织变形3D重建。该数据集包含六次猪尸体实验,使用达芬奇Xi立体内窥镜和Zivid结构光相机采集,通过光学追踪和手动迭代对齐方法配准。数据集包含三种序列类型(整体变形、增量变形和相机移动片段),用于测试算法对非刚性运动、变形幅度和视野外更新的鲁棒性。每个片段提供校正立体图像、每帧器械掩码、立体深度、起始/结束结构光点云、校准相机位姿和相机内参。

Details

Motivation: 解决在真实手术条件下评估非刚性腹部软组织3D重建算法缺乏高质量基准数据集的问题。

Result: 数据集包含超过30万帧图像和369个点云,涵盖98个精心整理的记录,可作为非刚性SLAM、4D重建和深度估计方法的综合基准。

Insight: 创新点在于提供配对的腹腔镜视频和结构光几何数据,支持可见和遮挡区域的定量几何评估,并包含三种专门设计的序列类型来系统测试算法鲁棒性。

Abstract: The D4D Dataset provides paired endoscopic video and high-quality structured-light geometry for evaluating 3D reconstruction of deforming abdominal soft tissue in realistic surgical conditions. Data were acquired from six porcine cadaver sessions using a da Vinci Xi stereo endoscope and a Zivid structured-light camera, registered via optical tracking and manually curated iterative alignment methods. Three sequence types - whole deformations, incremental deformations, and moved-camera clips - probe algorithm robustness to non-rigid motion, deformation magnitude, and out-of-view updates. Each clip provides rectified stereo images, per-frame instrument masks, stereo depth, start/end structured-light point clouds, curated camera poses and camera intrinsics. In postprocessing, ICP and semi-automatic registration techniques are used to register data, and instrument masks are created. The dataset enables quantitative geometric evaluation in both visible and occluded regions, alongside photometric view-synthesis baselines. Comprising over 300,000 frames and 369 point clouds across 98 curated recordings, this resource can serve as a comprehensive benchmark for developing and evaluating non-rigid SLAM, 4D reconstruction, and depth estimation methods.


[85] Any Resolution Any Geometry: From Multi-View To Multi-Patch cs.CVPDF

Wenqing Cui, Zhenyu Li, Mykola Lavreniuk, Jian Shi, Ramzi Idoughi

TL;DR: 本文提出了超分辨率几何变换器(URGT),这是一个统一的多块变换器框架,用于从单张高分辨率图像联合估计深度和表面法线。该方法将图像分割成多个块,并利用预训练模型提供的粗略深度和法线先验进行增强,通过单次前向传播进行联合处理以预测精细化的几何输出。通过跨块注意力机制确保全局一致性,并引入GridMix块采样策略增强空间鲁棒性。

Details

Motivation: 解决高分辨率深度和法线联合估计中,保持精细局部细节与维持全局一致性之间的权衡难题。

Result: 在UnrealStereo4K基准测试上达到了最先进水平,显著提升了深度和法线估计性能:AbsRel从0.0582降至0.0291,RMSE从2.17降至1.31,平均角度误差从23.36度降至18.51度,并产生了更清晰稳定的几何结果。该方法还展示了强大的零样本和跨域泛化能力,并能有效扩展到极高分辨率。

Insight: 核心创新在于将视觉几何基础变换器(VGGT)适配为统一的多块变换器架构,通过跨块注意力实现长距离几何推理和信息无缝传播;同时,提出的GridMix概率性网格配置采样策略增强了块间一致性和泛化能力,为高质量几何细化提供了一个高效且可扩展的解决方案。

Abstract: Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth–normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation, reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36 degrees to 18.51 degrees, while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.


[86] EduVQA: Benchmarking AI-Generated Video Quality Assessment for Education cs.CVPDF

Baoliang Chen, Xinlong Bu, Lingyu Zhu, Hanwei Zhu, Xiangjie Sui

TL;DR: 本文提出了EduAIGV-1k,这是首个用于评估面向教育的AI生成视频质量的基准数据集和评估框架,并在此基础上开发了EduVQA模型,用于对教育视频的感知质量和提示对齐质量进行细粒度评估。

Details

Motivation: 尽管AIGC模型在生成逼真视频方面取得了显著成功,但其在教育领域支持视觉化、故事驱动学习的潜力尚未被充分挖掘。本文旨在填补这一空白,评估AI生成的教育视频的质量。

Result: 在提出的EduAIGV-1k基准数据集上进行的广泛实验表明,本文提出的EduVQA模型在感知质量和提示对齐质量评估方面,持续优于现有的VQA基线模型。

Insight: 主要创新点包括:1)创建了首个专注于教育场景(基础数学概念)的AI生成视频质量评估数据集,并提供了细粒度的感知质量(空间/时间保真度)和提示对齐(词级/句级)标注;2)提出了结构化2D专家混合模块,通过共享专家和动态2D门控矩阵增强整体质量与各子维度之间的依赖关系,从而提升了评估性能。

Abstract: While AI-generated content (AIGC) models have achieved remarkable success in generating photorealistic videos, their potential to support visual, story-driven learning in education remains largely untapped. To close this gap, we present EduAIGV-1k, the first benchmark dataset and evaluation framework dedicated to assessing the quality of AI-generated videos (AIGVs) designed to teach foundational math concepts, such as numbers and geometry, to young learners. EduAIGV-1k contains 1,130 short videos produced by ten state-of-the-art text-to-video (T2V) models using 113 pedagogy-oriented prompts. Each video is accompanied by rich, fine-grained annotations along two complementary axes: (1) Perceptual quality, disentangled into spatial and temporal fidelity, and (2) Prompt alignment, labeled at the word-level and sentence-level to quantify the degree to which each mathematical concept in the prompt is accurately grounded in the generated video. These fine-grained annotations transform each video into a multi-dimensional, interpretable supervision signal, far beyond a single quality score. Leveraging this dense feedback, we introduce EduVQA for both perceptual and alignment quality assessment of AIGVs. In particular, we propose a Structured 2D Mixture-of-Experts (S2D-MoE) module, which enhances the dependency between overall quality and each sub-dimension by shared experts and dynamic 2D gating matrix. Extensive experiments show our EduVQA consistently outperforms existing VQA baselines. Both our dataset and code will be publicly available.


[87] AWDiff: An a trous wavelet diffusion model for lung ultrasound image synthesis cs.CVPDF

Maryam Heidari, Nantheera Anantrasirichai, Steven Walker, Rahul Bhatnagar, Alin Achim

TL;DR: 本文提出了一种名为AWDiff的基于扩散模型的肺部超声图像合成框架,该框架结合了à trous小波变换以保留细微诊断结构,并利用BioMedCLIP进行语义条件控制,旨在解决肺部超声数据稀缺问题并提升生成图像的质量。

Details

Motivation: 肺部超声数据稀缺限制了机器学习方法的发展,现有生成方法(如GANs和扩散模型)常因分辨率降低而丢失B线、胸膜不规则等细微诊断线索。

Result: 在肺部超声数据集上,AWDiff相比现有方法实现了更低的失真和更高的感知质量,展现了结构保真度和临床多样性。

Insight: 创新点在于将à trous小波变换集成到扩散模型中以避免破坏性下采样,并利用BioMedCLIP进行语义条件对齐,从而在保留精细结构的同时确保临床相关性。

Abstract: Lung ultrasound (LUS) is a safe and portable imaging modality, but the scarcity of data limits the development of machine learning methods for image interpretation and disease monitoring. Existing generative augmentation methods, such as Generative Adversarial Networks (GANs) and diffusion models, often lose subtle diagnostic cues due to resolution reduction, particularly B-lines and pleural irregularities. We propose A trous Wavelet Diffusion (AWDiff), a diffusion based augmentation framework that integrates the a trous wavelet transform to preserve fine-scale structures while avoiding destructive downsampling. In addition, semantic conditioning with BioMedCLIP, a vision language foundation model trained on large scale biomedical corpora, enforces alignment with clinically meaningful labels. On a LUS dataset, AWDiff achieved lower distortion and higher perceptual quality compared to existing methods, demonstrating both structural fidelity and clinical diversity.


[88] Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing cs.CV | cs.AIPDF

Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin

TL;DR: 本文提出RL3DEdit,一种利用强化学习优化实现多视角一致3D场景编辑的单次框架。该方法通过3D基础模型VGGT提供的置信度图和姿态估计误差作为奖励信号,将2D扩散模型的编辑先验锚定到3D一致流形上,从而有效解决多视角一致性问题。

Details

Motivation: 现有方法利用2D扩散模型先验进行3D编辑时,难以保持多视角一致性,且缺乏3D一致的编辑配对数据,导致监督微调不可行。作者观察到验证3D一致性比生成更容易,因此将强化学习定位为可行解决方案。

Result: 大量实验表明,RL3DEdit在编辑质量和效率上均优于现有最先进方法,并实现了稳定的多视角一致性。

Insight: 创新点在于将强化学习与3D基础模型VGGT结合,利用其从海量真实数据中学到的鲁棒先验来构建奖励函数,从而无需配对数据即可引导2D编辑先验满足3D一致性约束。

Abstract: Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT’s robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.


[89] Kling-MotionControl Technical Report cs.CVPDF

Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai

TL;DR: 本文介绍了Kling-MotionControl,一个基于扩散Transformer的统一框架,专门用于生成鲁棒、精确且富有表现力的全身角色动画。该框架采用分而治之的策略,协调处理身体、面部和手部的异质运动表示,并融合了自适应身份无关学习、精细的身份注入与融合设计以及主题库机制,以实现跨身份的泛化和外观保真。通过多阶段蒸馏的加速框架,推理速度提升了10倍以上。人类偏好评估表明,其在整体运动控制、开放域泛化及视觉质量方面优于领先的商业和开源方案。

Details

Motivation: 解决现有角色动画方法在生成高保真、可控且富有表现力的全身动画时,难以同时保证大规模结构稳定性和细粒度关节表现力,以及跨不同角色(从真人到卡通)进行自然运动重定向和外观保真的问题。

Result: 人类偏好评估显示,Kling-MotionControl在整体运动控制、开放域泛化和视觉质量与连贯性方面,性能优于领先的商业和开源解决方案,达到了卓越的保真度。

Insight: 主要创新点包括:1)一个统一的DiT框架,采用分而治之策略协调处理身体、面部和手部的异质运动表示;2)自适应身份无关学习机制,实现跨身份(真人/卡通)的鲁棒运动重定向;3)精细的身份注入与融合设计,结合主题库机制,确保外观保真;4)基于多阶段蒸馏的加速框架,显著提升推理速度;5)具备智能语义运动理解和精确文本响应能力,实现超越视觉输入的灵活控制。

Abstract: Character animation aims to generate lifelike videos by transferring motion dynamics from a driving video to a reference image. Recent strides in generative models have paved the way for high-fidelity character animation. In this work, we present Kling-MotionControl, a unified DiT-based framework engineered specifically for robust, precise, and expressive holistic character animation. Leveraging a divide-and-conquer strategy within a cohesive system, the model orchestrates heterogeneous motion representations tailored to the distinct characteristics of body, face, and hands, effectively reconciling large-scale structural stability with fine-grained articulatory expressiveness. To ensure robust cross-identity generalization, we incorporate adaptive identity-agnostic learning, facilitating natural motion retargeting for diverse characters ranging from realistic humans to stylized cartoons. Simultaneously, we guarantee faithful appearance preservation through meticulous identity injection and fusion designs, further supported by a subject library mechanism that leverages comprehensive reference contexts. To ensure practical utility, we implement an advanced acceleration framework utilizing multi-stage distillation, boosting inference speed by over 10x. Kling-MotionControl distinguishes itself through intelligent semantic motion understanding and precise text responsiveness, allowing for flexible control beyond visual inputs. Human preference evaluations demonstrate that Kling-MotionControl delivers superior performance compared to leading commercial and open-source solutions, achieving exceptional fidelity in holistic motion control, open domain generalization, and visual quality and coherence. These results establish Kling-MotionControl as a robust solution for high-quality, controllable, and lifelike character animation.


[90] Chain of World: World Model Thinking in Latent Motion cs.CV | cs.AI | cs.ROPDF

Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan

TL;DR: 本文提出了一种名为CoWVLA(Chain-of-World VLA)的新范式,旨在统一世界模型的时间推理与解耦的潜在运动表示,以克服现有视觉-语言-动作模型在预测性、时间因果结构建模以及计算效率方面的局限性。

Details

Motivation: 现有视觉-语言-动作模型要么在预测未来帧时浪费容量重建冗余背景,要么在编码帧间转换时缺乏连续动态建模和世界知识,因此需要一种能兼顾时间推理、世界知识、紧凑性和可解释性的新方法。

Result: 在机器人仿真基准测试上的大量实验表明,CoWVLA在性能上优于现有的世界模型和潜在动作方法,并实现了适中的计算效率。

Insight: 创新点在于提出了’世界链’范式,利用预训练视频VAE作为潜在运动提取器,显式地将视频片段分解为结构和运动潜在变量,并通过统一的自动回归解码器联合建模稀疏关键帧和动作序列,从而在保持时间推理优势的同时实现了紧凑的潜在动作表示。

Abstract: Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new “Chain of World” paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment’s terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx-hit.github.io/cowvla-io.


[91] Specificity-aware reinforcement learning for fine-grained open-world classification cs.CVPDF

Samuele Angheben, Davide Berasi, Alessandro Conti, Elisa Ricci, Yiming Wang

TL;DR: 本文提出了一种名为SpeciaRL的强化学习框架,旨在提升大型多模态模型在开放世界细粒度图像分类任务中的预测准确性和特异性。该框架通过动态验证器奖励机制,引导模型在保持正确性的同时生成更具体的预测,从而解决现有模型预测过于泛化的问题。

Details

Motivation: 在开放世界设置下进行细粒度视觉概念分类时,现有大型多模态模型虽然具备视觉理解能力,但往往产生过于泛化的预测,导致准确性与特异性难以兼顾。本文旨在解决如何在不牺牲正确性的前提下提升模型预测特异性的挑战。

Result: 在多个细粒度基准测试上的跨域实验表明,SpeciaRL在正确性与特异性之间取得了最佳平衡,超越了现有方法,推动了开放世界细粒度图像分类的发展。

Insight: 创新点在于提出了一种基于动态验证器的强化学习奖励机制,该机制锚定在线探索中的最佳预测,从而在模型能力范围内促进特异性,避免错误预测。这为调整大型多模态模型的输出行为提供了一种可借鉴的强化学习策略。

Abstract: Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model’s capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at https://github.com/s-angheben/SpeciaRL.


[92] COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data – Generation Stochastic by Design cs.CVPDF

Miguel Espinosa, Eva Gmelich Meijling, Valerio Marsocci, Elliot J. Crowley, Mikolaj Czerkawski

TL;DR: COP-GEN是一种用于多模态地球观测数据的潜在扩散变换器,旨在建模光学、雷达、高程和土地覆盖等异构数据在原生空间分辨率下的联合分布。它通过将跨模态映射参数化为条件分布,支持灵活的任意到任意条件生成,包括零样本模态翻译、光谱带填充以及在部分或缺失输入下的生成,无需针对特定任务重新训练。

Details

Motivation: 地球观测应用依赖多传感器数据,但不同模态间的关系本质上是非单射的,即相同的条件信息可能对应多个物理上合理的观测结果。确定性模型倾向于坍缩到条件均值,无法表示数据完成和跨传感器翻译等任务所需的不确定性和变异性,因此需要将条件映射参数化为数据分布。

Result: 在大规模全球多模态数据集上的实验表明,COP-GEN能生成多样且物理一致的实现,同时在光学、雷达和高程模态上保持强峰值保真度。定性和定量分析证明模型能捕捉有意义的跨模态结构,并随着条件信息的增加系统性地调整其输出不确定性。

Insight: 创新点在于使用潜在扩散变换器建模多模态地球观测数据的联合分布,实现任意到任意条件生成,无需任务特定重训练。客观分析认为,该方法强调了随机生成建模在地球观测中的实际重要性,并推动了超越单参考、逐点度量的评估协议。

Abstract: Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover products. Relationships between these modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations. Thus, such conditional mappings should be parametrised as data distributions. As a result, deterministic models tend to collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous Earth Observation modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation, spectral band infilling, and generation under partial or missing inputs, without task-specific retraining. Experiments on a large-scale global multimodal dataset show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and systematically adapts its output uncertainty as conditioning information increases. These results highlight the practical importance of stochastic generative modeling for Earth observation and motivate evaluation protocols that move beyond single-reference, pointwise metrics. Website: https:// miquel-espinosa.github.io/cop-gen


[93] UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? cs.CV | cs.AIPDF

Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen

TL;DR: 该论文提出了UniG2U-Bench,一个用于系统评估统一多模态模型中生成能力如何促进理解能力的基准测试。通过对30多个模型在7个类别、30个子任务上的广泛评估,研究发现统一模型通常弱于其基础视觉语言模型,生成后回答的推理方式通常会降低性能,但在空间智能、视觉错觉和多轮推理等特定任务上有所提升。

Details

Motivation: 当前统一多模态模型展现出强大的生成能力,但生成是否以及何时能提升理解能力尚不明确,且现有基准缺乏对此问题的系统性探索。

Result: 在UniG2U-Bench上评估了超过30个模型,发现统一模型通常弱于其基础VLM,GtA推理通常比直接推理性能差;但在空间智能、视觉错觉和多轮推理子任务上观察到一致的性能提升。

Insight: 论文的创新点在于构建了一个系统性的G2U评估基准,并揭示了生成与理解之间的耦合会在任务、预训练数据和模型架构上诱导出类别一致的归纳偏置,这为未来设计更有效的统一模型训练数据和范式提供了方向。

Abstract: Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.


[94] DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction cs.CVPDF

Yufu Wang, Evonne Ng, Soyong Shin, Rawal Khirodkar, Yuan Dong

TL;DR: DuoMo是一种从无约束视频中恢复世界坐标系下人体运动的生成方法,通过双扩散模型分解运动学习,先估计相机坐标系运动再提升至世界坐标系并优化全局一致性,即使面对噪声或观测不完整也能重建多样场景下的运动。

Details

Motivation: 解决从噪声或不完整观测的无约束视频中重建世界坐标系人体运动时,在泛化多样噪声输入与保持全局运动一致性之间的基本权衡问题。

Result: 在EMDB数据集上世界空间重建误差降低16%且保持低足部滑动,在RICH数据集上世界空间误差降低30%,达到SOTA性能。

Insight: 创新点在于将运动学习分解为相机空间和世界空间两个扩散模型,直接生成网格顶点运动而非依赖参数化模型,提升了泛化能力和全局一致性。

Abstract: We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: https://yufu-wang.github.io/duomo/


[95] LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory cs.CV | cs.LGPDF

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang

TL;DR: LoGeR是一种用于极长视频序列密集三维重建的新型架构,它通过分块处理视频流并利用基于学习的混合记忆模块来管理块间一致性,从而无需后优化即可扩展到数千帧的长序列。

Details

Motivation: 现有前馈几何基础模型在短窗口重建上表现良好,但扩展到分钟级长视频时,受限于注意力机制的二次复杂度或循环设计中有限的有效记忆,因此需要一种能处理长上下文且保持全局一致性的重建方法。

Result: 在标准基准和新构建的VBR数据集(序列长达19k帧)上评估,LoGeR显著优于先前的前馈方法,在KITTI数据集上将ATE降低了超过74%,实现了前所未有的长序列鲁棒且全局一致的重建。

Insight: 创新点在于提出了一个结合参数化测试时训练记忆(用于锚定全局坐标系防止尺度漂移)和非参数化滑动窗口注意力(用于保存未压缩上下文以实现高精度相邻对齐)的混合记忆模块,使得模型能在128帧序列上训练并泛化到推理时的数千帧。

Abstract: Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods–reducing ATE on KITTI by over 74%–and achieves robust, globally consistent reconstruction over unprecedented horizons.


[96] Beyond Language Modeling: An Exploration of Multimodal Pretraining cs.CVPDF

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou

TL;DR: 本文通过从头预训练实验探索了原生多模态模型的设计空间,采用Transfusion框架(语言使用下一词预测,视觉使用扩散模型)在文本、视频、图像-文本对及动作条件视频数据上进行训练。研究发现:表示自编码器(RAE)提供了统一且最优的视觉表示;视觉与语言数据具有互补性和协同效应;统一多模态预训练自然导向世界建模;混合专家(MoE)架构支持高效的多模态扩展并诱导模态专业化。通过IsoFLOP分析揭示了视觉比语言更数据饥渴的缩放不对称性,而MoE能协调这种不对称性。

Details

Motivation: 探索超越语言建模的多模态基础模型设计空间,通过受控的从头预训练实验,厘清多模态预训练的关键因素,避免语言预训练的干扰。

Result: 实验在多种数据上进行,未提及具体基准测试或定量比较,但通过IsoFLOP分析计算了模态的缩放定律,并展示了MoE架构在协调视觉与语言缩放不对称性方面的有效性。

Insight: 创新点包括:RAE作为统一视觉表示的优势;视觉与语言的协同互补;统一预训练自然引发世界建模能力;MoE支持高效多模态扩展并处理缩放不对称性。客观来看,对多模态预训练中数据需求不对称性的系统分析及MoE的协调作用具有重要借鉴意义。

Abstract: The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.


[97] MIBURI: Towards Expressive Interactive Gesture Synthesis cs.CV | cs.GR | cs.HCPDF

M. Hamza Mughal, Rishabh Dabral, Vera Demberg, Christian Theobalt

TL;DR: 本文提出了MIBURI,首个在线、因果的框架,用于生成与实时语音对话同步的富有表现力的全身手势和面部表情。该方法采用身体部位感知的手势编解码器,将分层运动细节编码为多级离散令牌,并通过基于LLM的语音文本嵌入条件化的二维因果框架自回归生成这些令牌,实时建模时间动态和部位级运动层次。

Details

Motivation: 当前基于大语言模型的对话代理缺乏具身性和自然交互所必需的富有表现力的手势,现有解决方案常产生僵硬、低多样性的动作,不适合类人交互,而生成式协同语音手势合成方法虽能产生自然的身体手势,但依赖于未来语音上下文且需要较长运行时间。

Result: 比较评估表明,该因果实时方法相比近期基线,能生成更自然且与上下文对齐的手势。

Insight: 创新点在于提出了首个在线因果框架,结合身体部位感知的编解码器和二维因果建模,实时生成分层手势,并通过辅助目标鼓励表现力和多样性,避免收敛到静态姿势。

Abstract: Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on https://vcai.mpi-inf.mpg.de/projects/MIBURI/.


[98] Utonia: Toward One Encoder for All Point Clouds cs.CVPDF

Yujia Zhang, Xiaoyang Wu, Yunhan Yang, Xianzhe Fan, Han Li

TL;DR: 本文提出Utonia,旨在训练一个统一的自监督点云Transformer编码器,能够处理来自遥感、室外LiDAR、室内RGB-D序列、物体中心CAD模型以及RGB视频提升的点云等多领域数据。该模型学习到跨领域的一致表示空间,不仅提升了感知能力,还展现出联合训练带来的涌现行为,并能增强具身智能和多模态推理任务。

Details

Motivation: 动机是解决不同领域点云数据(如遥感、LiDAR、CAD等)在感知几何、密度和先验上的差异,希望构建一个统一的编码器模型,实现跨领域知识共享与泛化,为稀疏3D数据的基础模型奠定基础。

Result: Utonia在跨领域点云表示学习上实现了统一,其表示能提升机器人操作任务中的视觉-语言-动作策略性能,并增强视觉语言模型的空间推理能力,展示了在AR/VR、机器人和自动驾驶等下游应用的潜力。

Insight: 创新点在于首次尝试训练一个跨多领域的自监督点云Transformer编码器,通过联合训练学习一致表示,并发现这种统一能带来感知能力的提升和涌现行为,同时验证了其在具身和多模态任务中的泛化价值。

Abstract: We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.


cs.DB [Back]

[99] HELIOS: Harmonizing Early Fusion, Late Fusion, and LLM Reasoning for Multi-Granular Table-Text Retrieval cs.DB | cs.CL | cs.IR | cs.LGPDF

Sungho Park, Joohyung Yun, Jongwuk Lee, Wook-Shin Han

TL;DR: 论文提出HELIOS方法,用于多粒度表-文本检索,结合早期融合和晚期融合的优势,通过边基二分图检索、查询相关节点扩展和星基LLM精炼三阶段,有效避免无关上下文并支持高级推理任务。

Details

Motivation: 现有表-文本检索方法使用早期或晚期融合,但早期融合可能包含无关上下文且忽略查询依赖关系,晚期融合可能遗漏相关上下文,两者均难以处理列聚合和多跳推理等高级任务,因此需要结合两者优势以解决这些问题。

Result: 在OTT-QA基准测试中,HELIOS在召回率和nDCG上分别显著优于最先进模型,提升高达42.6%和39.9%,达到SOTA水平。

Insight: 创新点包括边基二分图检索以细化表段与文本间的边、查询相关节点扩展动态构建二分图,以及星基LLM精炼在星图级别进行逻辑推理,这些设计可借鉴于多模态检索和复杂推理任务中。

Abstract: Table-text retrieval aims to retrieve relevant tables and text to support open-domain question answering. Existing studies use either early or late fusion, but face limitations. Early fusion pre-aligns a table row with its associated passages, forming “stars,” which often include irrelevant contexts and miss query-dependent relationships. Late fusion retrieves individual nodes, dynamically aligning them, but it risks missing relevant contexts. Both approaches also struggle with advanced reasoning tasks, such as column-wise aggregation and multi-hop reasoning. To address these issues, we propose HELIOS, which combines the strengths of both approaches. First, the edge-based bipartite subgraph retrieval identifies finer-grained edges between table segments and passages, effectively avoiding the inclusion of irrelevant contexts. Then, the query-relevant node expansion identifies the most promising nodes, dynamically retrieving relevant edges to grow the bipartite subgraph, minimizing the risk of missing important contexts. Lastly, the star-based LLM refinement performs logical inference at the star graph level rather than the bipartite subgraph, supporting advanced reasoning tasks. Experimental results show that HELIOS outperforms state-of-the-art models with a significant improvement up to 42.6% and 39.9% in recall and nDCG, respectively, on the OTT-QA benchmark.


cs.MA [Back]

[100] StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning cs.MA | cs.CL | cs.PLPDF

Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong

TL;DR: 本文提出了StitchCUDA,一个用于端到端GPU程序生成的多智能体框架。该框架包含规划、编码和验证三个智能体,并通过基于量规的智能体强化学习来提升编码智能体的能力,使其能够生成高效且正确的CUDA程序。实验表明,该方法在端到端GPU编程任务上取得了接近100%的成功率,并显著优于基线方法。

Details

Motivation: 现代机器学习工作负载严重依赖GPU,但实现高性能的端到端程序仍然具有挑战性,因为其性能同时依赖于GPU内核效率和主机端设置。现有的基于LLM的方法主要关注单内核优化,无法扩展到端到端程序,阻碍了实际部署。

Result: 在KernelBench基准测试上,StitchCUDA在端到端GPU编程任务上实现了接近100%的成功率。其性能优于多智能体基线1.72倍,优于强化学习模型基线2.73倍。

Insight: 论文的创新点在于:1)提出了一个专门用于端到端GPU程序生成的多智能体协作框架,将复杂任务分解为规划、编码和验证三个专业化角色;2)引入了基于量规的智能体强化学习,结合量规奖励和基于实际执行的规则奖励,来训练编码智能体掌握从任务到代码生成和反馈驱动优化的原子技能,有效防止了奖励欺骗行为。

Abstract: Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment. To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it step-by-step, and a Verifier for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the Coder’s ability in end-to-end GPU programming, StitchCUDA integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code generation and feedback-driven code optimization, with combined rubric reward and rule-based reward from real executions. Therefore, the Coder learns how to implement advanced CUDA programming techniques (e.g., custom kernel fusion, cublas epilogue), and we also effectively prevent Coder’s reward hacking (e.g., just copy PyTorch code or hardcoding output) during benchmarking. Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.


cs.LG [Back]

[101] MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models cs.LG | cs.CL | cs.CV | cs.SD | eess.ASPDF

Zhongxi Wang, Yueqian Lin, Jingyang Zhang, Hai Helen Li, Yiran Chen

TL;DR: 本文提出了MUSE,一个开源的、以运行为中心的多模态统一安全评估平台,用于系统性地测试大型语言模型在音频、图像和视频输入上的安全对齐泛化能力。该平台集成了自动跨模态载荷生成、三种多轮攻击算法、提供商无关的模型路由以及一个基于五级安全分类法的LLM评判器。

Details

Motivation: 现有的大语言模型安全评估和红队测试主要集中于文本,缺乏基础设施来系统测试模型对齐能力是否能泛化到音频、图像和视频等多模态输入。

Result: 在来自四个提供商的六个多模态大语言模型上的实验表明,多轮攻击策略对单轮拒绝率接近完美的模型可实现高达90-100%的攻击成功率。引入的跨轮次模态切换策略虽未普遍提升最终攻击成功率,但能通过破坏早期轮次的防御来加速收敛。

Insight: 主要创新点包括:1) 一个集成的、浏览器端的统一安全评估平台;2) 区分硬攻击成功率(仅完全合规)和软攻击成功率(包含部分合规)的双指标框架,能捕捉二元指标遗漏的部分信息泄露;3) 提出了跨轮次模态切换攻击策略,用于探究对齐能力跨模态边界的泛化;4) 研究发现模态效应的影响方向是模型家族特定的,而非普适的,强调了需要提供商感知的跨模态安全测试。

Abstract: Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs. We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic model routing, and an LLM judge with a five-level safety taxonomy into a single browser-based system. A dual-metric framework distinguishes hard Attack Success Rate (Compliance only) from soft ASR (including Partial Compliance), capturing partial information leakage that binary metrics miss. To probe whether alignment generalizes across modality boundaries, we introduce Inter-Turn Modality Switching (ITMS), which augments multi-turn attacks with per-turn modality rotation. Experiments across six multimodal LLMs from four providers show that multi-turn strategies can achieve up to 90-100% ASR against models with near-perfect single-turn refusal. ITMS does not uniformly raise final ASR on already-saturated baselines, but accelerates convergence by destabilizing early-turn defenses, and ablation reveals that the direction of modality effects is model-family-specific rather than universal, underscoring the need for provider-aware cross-modal safety testing.


[102] ALARM: Automated MLLM-Based Anomaly Detection in Complex-EnviRonment Monitoring with Uncertainty Quantification cs.LG | cs.AI | cs.CVPDF

Congjing Zhang, Feng Lin, Xinyi Zhao, Pei Guo, Wei Li

TL;DR: 本文提出了一个名为ALARM的框架,它结合了多模态大语言模型(MLLM)与不确定性量化(UQ)技术,用于复杂环境下的视觉异常检测(VAD)。该框架通过集成推理链、自我反思和MLLM集成等质量保证技术,构建了一个严谨的概率推理流程,旨在实现鲁棒且准确的异常检测。

Details

Motivation: 在复杂环境中部署基于MLLM的视觉异常检测系统面临挑战,因为异常通常具有高度上下文依赖性和模糊性,因此不确定性量化成为系统成功的关键能力。

Result: 在真实世界的智能家居基准数据和伤口图像分类数据上进行了广泛的实证评估,结果显示ALARM具有优越的性能,并在不同领域展现出可靠的通用适用性。

Insight: 主要创新点在于将不确定性量化与MLLM的推理能力(如推理链、自我反思)及模型集成相结合,形成一个严谨的概率推理框架,以提升复杂环境下异常检测的鲁棒性和可靠性。

Abstract: The advance of Large Language Models (LLMs) has greatly stimulated research interest in developing multi-modal LLM (MLLM)-based visual anomaly detection (VAD) algorithms that can be deployed in complex environments. The challenge is that in these complex environments, the anomalies are sometimes highly contextual and also ambiguous, and thereby, uncertainty quantification (UQ) is a crucial capacity for an MLLM-based VAD system to succeed. In this paper, we introduce our UQ-supported MLLM-based VAD framework called ALARM. ALARM integrates UQ with quality-assurance techniques like reasoning chain, self-reflection, and MLLM ensemble for robust and accurate performance and is designed based on a rigorous probabilistic inference pipeline and computational process. Extensive empirical evaluations are conducted using the real-world smart-home benchmark data and wound image classification data, which shows ALARM’s superior performance and its generic applicability across different domains for reliable decision-making.


[103] CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning cs.LG | cs.CVPDF

Zhenquan Yao, Zitong Huang, Yihan Zeng, Jianhua Han, Hang Xu

TL;DR: 本文提出了一种名为CGL的持续GUI学习框架,通过强化学习微调来平衡新任务适应与旧任务保留。该框架引入了基于策略熵的SFT比例调整机制和专门的梯度手术策略,以缓解监督微调带来的知识覆盖问题,并在新构建的AndroidControl-CL基准上验证了其有效性。

Details

Motivation: 解决GUI智能体在应用程序频繁更新时面临的持续学习问题,即如何适应新任务而不遗忘旧任务,现有监督微调方法容易导致知识覆盖。

Result: 在AndroidControl-CL基准测试中,CGL框架在持续学习场景下表现出色,有效平衡了适应效率和技能保留,实验证明了其优越性。

Insight: 创新点在于揭示了强化学习对旧任务遗忘的固有抵抗力,并设计了动态平衡SFT与RL的协同机制,以及通过梯度投影解决显式梯度冲突的梯度手术策略,为持续学习提供了新思路。

Abstract: Graphical User Interface (GUI) Agents, benefiting from recent advances in multimodal large language models (MLLM), have achieved significant development. However, due to the frequent updates of GUI applications, adapting to new tasks without forgetting old tasks in GUI continual learning remains an open problem. In this work, we reveal that while Supervised Fine-Tuning (SFT) facilitates fast adaptation, it often triggers knowledge overwriting, whereas Reinforcement Learning (RL) demonstrates an inherent resilience that shields prior interaction logic from erasure. Based on this insight, we propose a \textbf{C}ontinual \textbf{G}UI \textbf{L}earning (CGL) framework that dynamically balances adaptation efficiency and skill retention by enhancing the synergy between SFT and RL. Specifically, we introduce an SFT proportion adjustment mechanism guided by policy entropy to dynamically control the weight allocation between the SFT and RL training phases. To resolve explicit gradient interference, we further develop a specialized gradient surgery strategy. By projecting exploratory SFT gradients onto GRPO-based anchor gradients, our method explicitly clips the components of SFT gradients that conflict with GRPO. On top of that, we establish an AndroidControl-CL benchmark, which divides GUI applications into distinct task groups to effectively simulate and evaluate the performance of continual GUI learning. Experimental results demonstrate the effectiveness of our proposed CGL framework across continual learning scenarios. The benchmark, code, and model will be made publicly available.


cs.AI [Back]

[104] TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning cs.AI | cs.CL | cs.CVPDF

Christian Greisinger, Steffen Eger

TL;DR: 本文提出TikZilla,一个用于从文本描述生成高质量TikZ代码的两阶段模型。通过构建大规模高质量数据集DaTikZ-V4,并采用监督微调(SFT)与强化学习(RL)结合的管道,模型在人类评估中超越了GPT-4o,并与GPT-5表现相当,同时模型规模更小。

Details

Motivation: 解决现有Text-to-TikZ任务中数据集规模小、噪声大,以及仅依赖监督微调导致模型无法理解渲染图像语义,从而产生循环、无关内容和空间关系错误等问题。

Result: 在超过1000次人类评估中,TikZilla在5分制上比其基础模型提升1.5-2分,超越GPT-4o 0.5分,并在基于图像的评估中与GPT-5表现相当。

Insight: 创新点包括构建了规模更大、质量更高的数据集DaTikZ-V4,以及采用SFT后接RL的两阶段训练流程,其中RL阶段利用通过逆向图形训练的图像编码器提供语义忠实的奖励信号,有效提升了生成代码的语义准确性。

Abstract: Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.


[105] Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals cs.AI | cs.CLPDF

Patrick Gerard, Svitlana Volkova

TL;DR: 本文提出了一种名为密度引导响应优化(DGRO)的新方法,用于在没有显式偏好标注的情况下,将语言模型与在线社区的特定规范对齐。该方法利用社区用户通过接受、互动和留存内容所隐含表达的偏好信号,这些信号在表示空间中形成了可测量的几何结构(高密度区域代表被接受的内容)。

Details

Motivation: 解决现有对齐方法依赖显式偏好监督或预定义原则的局限性,这些方法成本高、伦理风险大或文化不匹配,难以适用于资源匮乏、敏感或缺乏机构支持的在线社区。

Result: 在标注稀缺的多样化社区(跨平台、主题和语言)中,DGRO对齐的模型在人类标注者、领域专家和基于模型的评判中,其生成的响应持续优于有监督和基于提示的基线方法。实验表明,局部密度能够恢复成对的社区判断,验证了几何结构编码了有意义的偏好信号。

Insight: 创新点在于将社区隐含的接受行为(内容留存与互动)转化为表示空间中的密度信号,并以此作为对齐的监督来源,为缺乏显式标注的场景提供了一种实用的对齐替代方案。客观分析认为,该方法巧妙地利用了社区已有的行为数据,降低了对昂贵人工标注的依赖,但学习这种涌现的接受行为也带来了新的伦理风险和影响需被讨论。

Abstract: Language models deployed in online communities must adapt to norms that vary across social, cultural, and domain-specific contexts. Prior alignment approaches rely on explicit preference supervision or predefined principles, which are effective for well-resourced settings but exclude most online communities – particularly those without institutional backing, annotation infrastructure, or organized around sensitive topics – where preference elicitation is costly, ethically fraught, or culturally misaligned. We observe that communities already express preferences implicitly through what content they accept, engage with, and allow to persist. We show that this acceptance behavior induces measurable geometric structure in representation space: accepted responses occupy coherent, high-density regions that reflect community-specific norms, while rejected content falls in sparser or misaligned areas. We operationalize this structure as an implicit preference signal for alignment and introduce density-guided response optimization (DGRO), a method that aligns language models to community norms without requiring explicit preference labels. Using labeled preference data, we demonstrate that local density recovers pairwise community judgments, indicating that geometric structure encodes meaningful preference signal. We then apply DGRO in annotation-scarce settings across diverse communities spanning platform, topic, and language. DGRO-aligned models consistently produce responses preferred by human annotators, domain experts, and model-based judges over supervised and prompt-based baselines. We position DGRO as a practical alignment alternative for communities where explicit preference supervision is unavailable or misaligned with situated practices, and discuss the implications and risks of learning from emergent acceptance behavior.


stat.ML [Back]

[106] Geometric structures and deviations on James’ symmetric positive-definite matrix bicone domain stat.ML | cs.CG | cs.CV | cs.LGPDF

Jacek Karwowski, Frank Nielsen

TL;DR: 本文针对对称正定矩阵数据集,在James双锥重参数化的基础上,引入了两种新的几何结构:Finsler结构和双信息几何结构,确保测地线在适当坐标系中对应直线。该结构将谱单形作为仿射子空间包含在内,并证明了Hilbert VPM距离可推广Hilbert单纯形距离。

Details

Motivation: 对称正定矩阵在多个科学领域有广泛应用,传统几何框架如仿射不变黎曼结构和双信息几何对数行列式障碍结构存在局限性,需要新的几何结构来提供更灵活的相似性度量。

Result: 论文证明了Hilbert VPM距离可推广Hilbert单纯形距离,并提供了新结构与传统不相似性度量之间的多种不等式。

Insight: 创新点在于通过James双锥重参数化导出Finsler和双Hessian结构,使测地线在坐标系中呈直线,且将谱单形自然地嵌入为仿射子空间,为机器学习中的距离度量提供了新的几何框架。

Abstract: Symmetric positive-definite (SPD) matrix datasets play a central role across numerous scientific disciplines, including signal processing, statistics, finance, computer vision, information theory, and machine learning among others. The set of SPD matrices forms a cone which can be viewed as a global coordinate chart of the underlying SPD manifold. Rich differential-geometric structures may be defined on the SPD cone manifold. Among the most widely used geometric frameworks on this manifold are the affine-invariant Riemannian structure and the dual information-geometric log-determinant barrier structure, each associated with dissimilarity measures (distance and divergence, respectively). In this work, we introduce two new structures, a Finslerian structure and a dual information-geometric structure, both derived from James’ bicone reparameterization of the SPD domain. Those structures ensure that geodesics correspond to straight lines in appropriate coordinate systems. The closed bicone domain includes the spectraplex (the set of positive semi-definite diagonal matrices with unit trace) as an affine subspace, and the Hilbert VPM distance is proven to generalize the Hilbert simplex distance which found many applications in machine learning. Finally, we discuss several applications of these Finsler/dual Hessian structures and provide various inequalities between the new and traditional dissimilarities.


eess.AS [Back]

[107] Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features eess.AS | cs.CLPDF

Kyle Janse van Rensburg, Benjamin van Niekerk, Herman Kamper

TL;DR: 本文研究了自监督学习语音模型中特征维度的可解释性,发现WavLM模型的主成分维度分别编码了音高、强度、噪声、第二共振峰等说话人特征,并验证了通过调整这些维度可以控制语音合成中的声音特性。

Details

Motivation: 探究自监督语音模型中个体特征维度是否捕获了说话人特征,以理解模型表示的结构。

Result: 在WavLM模型上,通过PCA分析发现主维度编码音高和性别,其他维度与强度、噪声、第二共振峰等相关;合成实验表明调整这些维度可控制输出语音特征。

Insight: 揭示了自监督语音特征中个体维度的语义可解释性,为语音合成提供了一种简单的特征控制方法。

Abstract: How do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. Using WavLM, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. Finally, in synthesis experiments we show that most characteristics can be controlled by changing the corresponding dimensions. This provides a simple method to control characteristics of the output voice in synthesis applications.


eess.IV [Back]

[108] Biomechanically Accurate Gait Analysis: A 3d Human Reconstruction Framework for Markerless Estimation of Gait Parameters eess.IV | cs.CVPDF

Akila Pemasiri, Ethan Goan, Glen Lichtwark, Robert Schuster, Luke Kelly

TL;DR: 本文提出了一种基于视频3D人体重建的生物力学可解释步态分析框架。该方法提取类似动作捕捉系统的生物力学标记点,并整合到OpenSim中进行关节运动学估计,实现了无标记、可扩展且准确的步态评估。

Details

Motivation: 解决传统基于关键点的方法在步态分析中生物力学意义不足的问题,旨在开发一种无需标记、可解释且准确的视频步态分析框架,以支持临床和现实世界中的广泛应用。

Result: 在时空和运动学步态参数上与基于标记的参考数据对比,结果显示与标记测量高度一致,相比仅使用姿态估计方法有显著改进。

Insight: 创新点在于将3D重建提取的生物力学标记点与OpenSim仿真平台集成,实现了从视频到生物力学参数的端到端可解释分析,为无标记步态评估提供了新范式。

Abstract: This paper presents a biomechanically interpretable framework for gait analysis using 3D human reconstruction from video data. Unlike conventional keypoint based approaches, the proposed method extracts biomechanically meaningful markers analogous to motion capture systems and integrates them within OpenSim for joint kinematic estimation. To evaluate performance, both spatiotemporal and kinematic gait parameters were analysed against reference marker-based data. Results indicate strong agreement with marker-based measurements, with considerable improvements when compared with pose-estimation methods alone. The proposed framework offers a scalable, markerless, and interpretable approach for accurate gait assessment, supporting broader clinical and real world deployment of vision based biomechanics


cs.IT [Back]

[109] Functional Properties of the Focal-Entropy cs.IT | cs.CV | cs.LG | math.ST | stat.MLPDF

Jaimin Shah, Martina Cardone, Alex Dytso

TL;DR: 本文从分布视角系统研究了焦点熵(focal-entropy),即交叉熵的焦点损失对应形式。通过理论分析建立了焦点熵的有限性、凸性和连续性条件,并给出了多种渐近特征。证明了焦点熵最小化器的存在性、唯一性及其结构,指出其与数据分布可能存在显著差异。特别地,严格证明了焦点损失会放大中段概率、抑制高概率结果,并在极端类别不平衡下引发过抑制机制,进一步削弱极小概率。这些结果通过实验验证,为理解焦点损失提供了理论基础,并阐明了其在类别不平衡学习任务中引入的权衡。

Details

Motivation: 焦点损失在类别不平衡分类问题(尤其是计算机视觉领域)中已广泛替代交叉熵,但其经验成功缺乏系统的信息论研究。本文旨在填补这一空白,从分布角度对焦点熵进行系统性理论分析。

Result: 理论分析结果通过实验验证,证实了焦点损失对概率分布的调整效应:放大中段概率、抑制高概率,并在极端不平衡时产生过抑制。

Insight: 创新点在于首次从信息论和分布视角对焦点熵进行了完整的理论刻画,揭示了其最小化器的独特结构及其与数据分布的偏差机制,特别是明确了焦点损失在类别不平衡学习中的概率调整特性和潜在的过抑制风险,为损失函数设计提供了新的理论依据。

Abstract: The focal-loss has become a widely used alternative to cross-entropy in class-imbalanced classification problems, particularly in computer vision. Despite its empirical success, a systematic information-theoretic study of the focal-loss remains incomplete. In this work, we adopt a distributional viewpoint and study the focal-entropy, a focal-loss analogue of the cross-entropy. Our analysis establishes conditions for finiteness, convexity, and continuity of the focal-entropy, and provides various asymptotic characterizations. We prove the existence and uniqueness of the focal-entropy minimizer, describe its structure, and show that it can depart significantly from the data distribution. In particular, we rigorously show that the focal-loss amplifies mid-range probabilities, suppresses high-probability outcomes, and, under extreme class imbalance, induces an over-suppression regime in which very small probabilities are further diminished. These results, which are also experimentally validated, offer a theoretical foundation for understanding the focal-loss and clarify the trade-offs that it introduces when applied to imbalanced learning tasks.


cs.CR [Back]

[110] Contextualized Privacy Defense for LLM Agents cs.CR | cs.AI | cs.CLPDF

Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie

TL;DR: 本文提出了一种名为上下文防御指导(CDI)的新隐私防御范式,用于保护LLM代理在处理用户个人信息时的隐私。该方法通过一个指导模型在代理执行过程中生成针对具体步骤、上下文感知的隐私指导,主动塑造代理行为,而非仅进行约束或否决。CDI还结合了一个经验驱动的优化框架,通过强化学习训练指导模型,将违反隐私的失败轨迹转化为学习环境。

Details

Motivation: 现有LLM代理的隐私防御方法(如提示和守卫)多为静态或被动的,无法在多步骤代理执行中支持上下文感知的主动隐私决策,因此需要一种更灵活、自适应的防御机制。

Result: 在一个统一的模拟框架中,CDI在隐私保护(94.2%)和帮助性(80.6%)之间取得了比基线方法更好的平衡,并在对抗性条件和泛化方面表现出更优的鲁棒性。

Insight: 创新点在于将隐私防御从静态约束转变为动态、上下文感知的主动指导,并通过强化学习利用失败轨迹进行优化,实现了隐私与实用性的更好权衡。从客观角度看,该方法为LLM代理的隐私保护提供了一种可扩展、自适应的新思路。

Abstract: LLM agents increasingly act on users’ personal information, yet existing privacy defenses remain limited in both design and adaptability. Most prior approaches rely on static or passive defenses, such as prompting and guarding. These paradigms are insufficient for supporting contextual, proactive privacy decisions in multi-step agent execution. We propose Contextualized Defense Instructing (CDI), a new privacy defense paradigm in which an instructor model generates step-specific, context-aware privacy guidance during execution, proactively shaping actions rather than merely constraining or vetoing them. Crucially, CDI is paired with an experience-driven optimization framework that trains the instructor via reinforcement learning (RL), where we convert failure trajectories with privacy violations into learning environments. We formalize baseline defenses and CDI as distinct intervention points in a canonical agent loop, and compare their privacy-helpfulness trade-offs within a unified simulation framework. Results show that our CDI consistently achieves a better balance between privacy preservation (94.2%) and helpfulness (80.6%) than baselines, with superior robustness to adversarial conditions and generalization.


cs.IR [Back]

[111] SOLAR: SVD-Optimized Lifelong Attention for Recommendation cs.IR | cs.CV | cs.LGPDF

Chenghao Zhang, Chao Feng, Yuanhao Pu, Xunyong Yang, Wenhui Yu

TL;DR: 本文提出了SOLAR(SVD优化的终身注意力推荐)框架,通过引入SVD-Attention机制,在保持softmax的同时将注意力复杂度从O(N^2 d)降低到O(Ndr),从而支持推荐系统中万级行为序列和数千候选项目的建模,无需任何过滤。

Details

Motivation: 解决传统注意力机制在长序列建模中O(N^2 d)的高计算和内存成本问题,以及线性注意力丢弃softmax和改变注意力分数分布的问题,利用推荐系统中矩阵低秩结构的归纳偏置。

Result: 在快手在线推荐场景中,SOLAR实现了0.68%的视频观看量提升以及其他业务指标的改进。

Insight: 创新点在于提出SVD-Attention,理论上在低秩矩阵上无损地保持softmax并降低复杂度;将低秩结构作为推荐系统表示学习的默认归纳偏置进行利用,支持大规模序列和候选集的高效建模。

Abstract: Attention mechanism remains the defining operator in Transformers since it provides expressive global credit assignment, yet its $O(N^2 d)$ time and memory cost in sequence length $N$ makes long-context modeling expensive and often forces truncation or other heuristics. Linear attention reduces complexity to $O(N d^2)$ by reordering computation through kernel feature maps, but this reformulation drops the softmax mechanism and shifts the attention score distribution. In recommender systems, low-rank structure in matrices is not a rare case, but rather the default inductive bias in its representation learning, particularly explicit in the user behavior sequence modeling. Leveraging this structure, we introduce SVD-Attention, which is theoretically lossless on low-rank matrices and preserves softmax while reducing attention complexity from $O(N^2 d)$ to $O(Ndr)$. With SVD-Attention, we propose SOLAR, SVD-Optimized Lifelong Attention for Recommendation, a sequence modeling framework that supports behavior sequences of ten-thousand scale and candidate sets of several thousand items in cascading process without any filtering. In Kuaishou’s online recommendation scenario, SOLAR delivers a 0.68% Video Views gain together with additional business metrics improvements.


cs.RO [Back]

[112] ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments cs.RO | cs.CL | cs.CVPDF

Ziyang Gong, Zehang Luo, Anke Tang, Zhe Liu, Shi Fu

TL;DR: 本文提出了ACE-Brain-0,一个基于多模态大语言模型(MLLM)的通用基础大脑,旨在统一空间推理、自动驾驶和具身操作。其核心观点是将空间智能作为连接不同物理形态(如车辆、机器人、无人机)的通用支架,并提出了‘支架-专业化-调和’(SSR)训练范式以及采用组相对策略优化(GRPO)来增强模型能力。实验表明该模型在多个基准测试中取得了有竞争力的、甚至是最先进的性能。

Details

Motivation: 解决通用具身智能在异构形态(如自动驾驶、机器人、无人机)上实现鲁棒泛化时面临的挑战,包括长尾数据、梯度干扰和灾难性遗忘,旨在平衡通用泛化能力与特定领域专精。

Result: 在24个空间和具身相关的基准测试上进行了广泛实验,结果表明ACE-Brain-0取得了有竞争力的、甚至是最先进的(SOTA)性能。

Insight: 宣称的创新点在于将空间智能识别为跨形态具身的通用认知支架,并提出了SSR(支架-专业化-调和)训练范式来构建统一模型。客观来看,其将空间认知作为领域无关的共享基础,并采用数据无关的模型合并进行调和,是解决多形态统一建模中平衡问题的一种新颖思路。

Abstract: Universal embodied intelligence demands robust generalization across heterogeneous embodiments, such as autonomous driving, robotics, and unmanned aerial vehicles (UAVs). However, existing embodied brain in training a unified model over diverse embodiments frequently triggers long-tail data, gradient interference, and catastrophic forgetting, making it notoriously difficult to balance universal generalization with domain-specific proficiency. In this report, we introduce ACE-Brain-0, a generalist foundation brain that unifies spatial reasoning, autonomous driving, and embodied manipulation within a single multimodal large language model(MLLM). Our key insight is that spatial intelligence serves as a universal scaffold across diverse physical embodiments: although vehicles, robots, and UAVs differ drastically in morphology, they share a common need for modeling 3D mental space, making spatial cognition a natural, domain-agnostic foundation for cross-embodiment transfer. Building on this insight, we propose the Scaffold-Specialize-Reconcile(SSR) paradigm, which first establishes a shared spatial foundation, then cultivates domain-specialized experts, and finally harmonizes them through data-free model merging. Furthermore, we adopt Group Relative Policy Optimization~(GRPO) to strengthen the model’s comprehensive capability. Extensive experiments demonstrate that ACE-Brain-0 achieves competitive and even state-of-the-art performance across 24 spatial and embodiment-related benchmarks.


[113] Give me scissors: Collision-Free Dual-Arm Surgical Assistive Robot for Instrument Delivery cs.RO | cs.CV | cs.HC | cs.LGPDF

Xuejin Luo, Shiquan Sun, Runshi Zhang, Ruizhi Zhang, Junchen Wang

TL;DR: 本文提出了一种无碰撞的双臂手术辅助机器人系统,用于自动递送手术器械。该系统利用视觉语言模型根据外科医生指令零样本生成抓取和递送轨迹,并通过实时障碍物最小距离感知与统一二次规划框架实现动态环境中的反应式避障和自碰撞预防。

Details

Motivation: 现有机器人刷手护士依赖预定义路径递送器械,在动态环境中泛化性差且存在安全风险,本文旨在解决这些问题以实现安全、自适应的器械递送。

Result: 在广泛的实验验证中,所提出的机器人系统在手术器械递送任务中实现了83.33%的成功率,并且在所有试验中均保持平滑、无碰撞的运动。

Insight: 创新点在于结合视觉语言模型实现零样本轨迹生成,以及将实时障碍物距离感知集成到统一二次规划框架中,实现了动态环境下的安全、自适应双臂协同操作。

Abstract: During surgery, scrub nurses are required to frequently deliver surgical instruments to surgeons, which can lead to physical fatigue and decreased focus. Robotic scrub nurses provide a promising solution that can replace repetitive tasks and enhance efficiency. Existing research on robotic scrub nurses relies on predefined paths for instrument delivery, which limits their generalizability and poses safety risks in dynamic environments. To address these challenges, we present a collision-free dual-arm surgical assistive robot capable of performing instrument delivery. A vision-language model is utilized to automatically generate the robot’s grasping and delivery trajectories in a zero-shot manner based on surgeons’ instructions. A real-time obstacle minimum distance perception method is proposed and integrated into a unified quadratic programming framework. This framework ensures reactive obstacle avoidance and self-collision prevention during the dual-arm robot’s autonomous movement in dynamic environments. Extensive experimental validations demonstrate that the proposed robotic system achieves an 83.33% success rate in surgical instrument delivery while maintaining smooth, collision-free movement throughout all trials. The project page and source code are available at https://give-me-scissors.github.io/.


[114] Tether: Autonomous Functional Play with Correspondence-Driven Trajectory Warping cs.RO | cs.AI | cs.CVPDF

William Liang, Sam Wang, Hung-Ju Wang, Osbert Bastani, Yecheng Jason Ma

TL;DR: 本文提出Tether方法,一种用于机器人自主功能性玩耍(autonomous functional play)的技术,通过将少量源演示(≤10条)的动作基于目标场景中的语义关键点对应关系进行轨迹变形,构建开环策略,并结合视觉语言模型的视觉理解能力,在真实世界中进行持续的任务选择、执行、评估和改进循环,从而高效生成多样化的高质量机器人交互数据。

Details

Motivation: 解决机器人通过自主交互进行学习和经验积累的挑战,这需要策略对多样且可能超出分布的环境状态具有鲁棒性,并需要一个能持续产生有用机器人经验的流程。

Result: 在类似家庭的多物体设置中,该方法首次实现了仅从少量演示开始,在真实世界中进行数小时自主多任务玩耍,生成了超过1000条专家级轨迹,并训练出与人类收集演示数据学习到的策略性能相当的闭环模仿策略。

Insight: 创新点在于结合了基于语义关键点对应的轨迹变形开环策略(数据高效且鲁棒)与由视觉语言模型引导的自主玩耍循环,实现了从极少量演示出发的长期、多任务、真实世界数据自动收集与策略持续改进。

Abstract: The ability to conduct and learn from interaction and experience is a central challenge in robotics, offering a scalable alternative to labor-intensive human demonstrations. However, realizing such “play” requires (1) a policy robust to diverse, potentially out-of-distribution environment states, and (2) a procedure that continuously produces useful robot experience. To address these challenges, we introduce Tether, a method for autonomous functional play involving structured, task-directed interactions. First, we design a novel open-loop policy that warps actions from a small set of source demonstrations (<=10) by anchoring them to semantic keypoint correspondences in the target scene. We show that this design is extremely data-efficient and robust even under significant spatial and semantic variations. Second, we deploy this policy for autonomous functional play in the real world via a continuous cycle of task selection, execution, evaluation, and improvement, guided by the visual understanding capabilities of vision-language models. This procedure generates diverse, high-quality datasets with minimal human intervention. In a household-like multi-object setup, our method is the first to perform many hours of autonomous multi-task play in the real world starting from only a handful of demonstrations. This produces a stream of data that consistently improves the performance of closed-loop imitation policies over time, ultimately yielding over 1000 expert-level trajectories and training policies competitive with those learned from human-collected demonstrations.


[115] ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation cs.RO | cs.CVPDF

Xialin He, Sirui Xu, Xinyao Li, Runpei Dong, Liuyu Bian

TL;DR: 本文提出了ULTRA框架,旨在解决人形机器人全身运动与操作(loco-manipulation)自主化的核心难题。该框架包含物理驱动的神经重定向算法和统一的多模态控制器,能够从稀疏任务意图和嘈杂的自我中心视觉输入生成协调的全身行为,无需在测试时依赖预定义的运动参考。

Details

Motivation: 现有方法受限于重定向数据稀缺、难以扩展到大技能库,且严重依赖跟踪预定义运动参考,无法根据感知和高层任务指令生成行为。ULTRA旨在克服这些限制,实现从感知到行为的自主生成。

Result: 在仿真和真实Unitree G1人形机器人上的评估表明,ULTRA能够泛化到基于自我中心感知的目标条件全身运动与操作任务,在技能有限的情况下,其性能持续优于仅依赖跟踪的基线方法。

Insight: 创新点在于将物理驱动的神经重定向与统一的多模态控制器相结合,通过将通用跟踪策略蒸馏到控制器中、将运动技能压缩到紧凑的潜在空间,并应用强化学习微调,从而在分布外场景下扩展覆盖范围并提高鲁棒性,实现了从稀疏意图到协调全身行为的端到端生成。

Abstract: Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.