Table of Contents

cs.CL [Back]

[1] FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition cs.CLPDF

Jonas Golde, Patrick Haller, Alan Akbik

TL;DR: 本文介绍了FiNERweb,一个用于创建多语言命名实体识别(NER)数据集的可扩展流水线。该流水线基于FineWeb-Edu,通过训练回归模型筛选NER相关段落,并利用多语言大语言模型(LLM)进行标注,生成了覆盖91种语言和25种文字、包含约22.5万段落和23.5万个不同实体标签的数据集。实验表明,基于FiNERweb训练的模型在零样本迁移到英语、泰语和斯瓦希里语时,使用数据量比强基线少19倍,仍能取得相当或更好的性能。

Details

Motivation: 当前多语言NER研究虽然表明LLM能提供有效的合成监督,但相关数据集多为更广泛实验的副产品,缺乏系统化、可复用的资源。本文旨在构建一个可扩展的、高质量的多语言NER数据集创建流程,以促进更有效的师生训练范式。

Result: 回归模型在识别NER相关段落上取得了超过84的F1分数。在英语、泰语和斯瓦希里语的零样本迁移任务中,基于FiNERweb训练的模型性能与强基线相当或更优,而训练数据量仅为基线的1/19。使用LLM-as-a-judge评估的标注质量在忠实度(3.99/5)和完整性(4.05/5)上均获得高分。此外,研究观察到当前SOTA模型在使用目标语言标签而非英语标签评估时,F1分数会下降0.02到0.09。

Insight: 创新点在于提出了一个系统化、可扩展的多语言NER数据集生成流水线(FiNERweb),将师生范式大规模应用于91种语言。其核心是结合回归模型进行段落筛选与多语言LLM进行标注,实现了高质量、低资源消耗的数据合成。客观来看,该工作不仅发布了数据集,还强调了目标语言标签对评估的重要性,并提供了完整的配套工具,为多语言NER研究提供了宝贵的可复用资源。

Abstract: Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess annotation quality using LLM-as-a-judge and observe consistently high scores for both faithfulness (3.99 out of 5) and completeness (4.05 out of 5), indicating reliable and informative annotations. Further, we release the dataset with both English labels and translated label sets in the respective target languages because we observe that the performance of current state-of-the-art models drops by 0.02 to 0.09 F1 when evaluated using target language labels instead of English ones. We release FiNERweb together with all accompanying artifacts to the research community in order to facilitate more effective student-teacher training for multilingual named entity recognition.


[2] Olmo 3 cs.CL | cs.LGPDF

Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl

TL;DR: Olmo 3是一系列在7B和32B参数规模上达到最先进水平的完全开源语言模型,专注于长上下文推理、函数调用、编码、指令遵循、通用聊天和知识回忆。该版本包含了模型构建的完整生命周期,包括所有阶段、检查点、数据点和依赖项。其旗舰模型Olmo 3 Think 32B是迄今为止发布的最强大的完全开源思维模型。

Details

Motivation: 旨在构建一个完全开源、高性能的语言模型家族,以支持多种复杂任务,如长上下文推理和函数调用,并确保模型构建过程的完全透明和可复现性。

Result: 论文宣称其旗舰模型Olmo 3 Think 32B是迄今为止最强的完全开源思维模型,在多个目标能力上达到最先进水平(SOTA)。

Insight: 创新点在于提供了一个从数据到最终模型的完整、透明的模型构建流程,并专注于集成多种高级能力(如思维和函数调用)于一个完全开源的模型中,这有助于推动开源社区的发展和模型的可信度研究。

Abstract: We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it. Our flagship model, Olmo 3 Think 32B, is the strongest fully-open thinking model released to-date.


[3] What Affects the Effective Depth of Large Language Models? cs.CLPDF

Yi Hu, Cai Zhou, Muhan Zhang

TL;DR: 本文系统研究了影响大语言模型(LLM)有效深度的因素,包括模型规模、训练类型和任务难度。通过对Qwen-2.5系列模型(1.5B-32B)的分析,发现有效层数随模型规模增长,但有效深度比率保持稳定;长链思维(CoT)训练并未增加有效深度,其性能提升源于更长的上下文而非更深层的计算;且模型不会针对更困难的任务动态使用更多层。

Details

Motivation: 针对大语言模型在深度扩展时性能增益递减的现象,基于“有效深度”概念,探究当前LLMs为何未能充分利用其全部层进行有意义的计算,以揭示深度利用不足的根本原因。

Result: 在Qwen-2.5系列模型上的实验表明,有效层数随模型规模线性增长,但有效深度比率稳定;长链思维(CoT)模型与基础模型相比,有效深度未增加;在不同难度任务上的评估显示,模型不会为更困难问题动态调用更多层。

Insight: 论文的创新点在于系统性地量化并验证了LLMs在不同条件下的有效深度,揭示了当前模型普遍存在深度利用不足的问题,为提升层利用率、模型剪枝和早期退出策略提供了明确的研究方向和数据支持。

Abstract: The scaling of large language models (LLMs) emphasizes increasing depth, yet performance gains diminish with added layers. Prior work introduces the concept of “effective depth”, arguing that deeper models fail to fully utilize their layers for meaningful computation. Building on this, we systematically study how effective depth varies with model scale, training type, and task difficulty. First, we analyze the model behavior of Qwen-2.5 family (1.5B-32B) and find that while the number of effective layers grows with model size, the effective depth ratio remains stable. Besides, comparisons between base and corresponding long-CoT models show no increase in effective depth, suggesting that improved reasoning stems from longer context rather than deeper per-token computation. Furthermore, evaluations across tasks of varying difficulty indicate that models do not dynamically use more layers for harder problems. Our results suggest that current LLMs underuse available depth across scales, training paradigms and tasks of varying difficulties, pointing out research opportunities on increasing the layer utilization rate of LLMs, model pruning, and early exiting. Our code is released at https://github.com/AheadOFpotato/what_affects_effective_depth.


[4] A Unified Sparse Attention via Multi-Granularity Compression cs.CLPDF

Siran Liu, Zane Cao, Yongchao He

TL;DR: 本文提出了一种名为UniSparse的统一稀疏注意力机制,通过引入复合令牌(composite tokens)的概念来聚合多粒度上下文信息,并基于此进行动态的稀疏注意力构建,旨在解决大语言模型(LLM)长上下文处理中自注意力机制计算复杂度高的问题。该方法在多种模态和任务上均超越了现有稀疏注意力方法,在保持高精度的同时显著提升了计算效率。

Details

Motivation: 解决大语言模型(LLM)在长上下文应用(如多轮对话、程序分析)中,自注意力机制计算复杂度随序列长度呈二次方增长的根本瓶颈。现有稀疏注意力方法存在权衡:基于训练的方法成本高且难以作为加速插件通用,而推理时方法则常常牺牲效率或跨模态通用性。

Result: 在从合成基准到实际应用的多种模态和任务中,UniSparse在准确性和效率上均持续超越最先进的稀疏注意力方法(如MInference、XAttention、FlexPrefill),达到了≥99%的全注意力精度,并且注意力计算速度比FlashAttention快达2.61倍。

Insight: 创新点在于引入了“复合令牌”这一抽象概念,通过多粒度压缩和块级选择来动态构建稀疏注意力,实现了高效且硬件友好的GPU执行。从客观角度看,其统一的设计思路有望作为一种即插即用的加速方案,平衡了效率、精度和跨模态通用性,为长上下文建模提供了新的技术路径。

Abstract: Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens–compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and block-level selection, enabling efficient and hardware-friendly execution on GPU. Across multiple modalities and tasks ranging from synthetic benchmarks to real-world applications, UniSparse consistently surpasses state-of-the-art sparse attention methods (e.g., MInference, XAttention, FlexPrefill) in both accuracy and efficiency, achieving $\ge$ 99% of full-attention accuracy and up to 2.61$\times$ faster attention computation than FlashAttention.


[5] CogMem: A Cognitive Memory Architecture for Sustained Multi-Turn Reasoning in Large Language Models cs.CLPDF

Yiran Zhang, Jincheng Hu, Mark Dras, Usman Naseem

TL;DR: 本文提出CogMem,一种受认知启发的记忆增强型大语言模型架构,旨在解决LLM在多轮推理中存在的准确性下降、连贯性丧失等问题。该架构通过分层记忆系统(长期记忆、直接访问记忆和注意力焦点机制)来维持结构化、持久化的记忆,从而支持持续的迭代推理。

Details

Motivation: 现有LLM在单轮推理中表现优异,但在多轮交互中容易出现推理偏差、任务漂移、幻觉、过度自信和记忆衰退等问题;当前方法通常直接附加完整对话历史,导致上下文无限增长、计算成本增加和推理效率下降。

Result: 在TurnBench基准测试上的实验表明,CogMem的分层设计能够缓解推理失败、控制上下文增长,并提高长推理链的一致性,推动LLM实现更可靠、类人的推理。

Insight: 创新点在于借鉴人类认知结构,构建了包含长期记忆、直接访问记忆和动态注意力焦点的三层记忆架构,实现了跨会话的策略整合、会话级笔记维护以及任务相关上下文的动态重构,从而在保持上下文简洁的同时提升多轮推理的持续性和一致性。

Abstract: Large language models (LLMs) excel at single-turn reasoning but often lose accuracy and coherence over extended, multi-turn interactions. Recent evaluations such as TurnBench highlight recurring failure modes-reasoning bias, task drift, hallucination, overconfidence, and memory decay. Current approaches typically append full conversational histories, causing unbounded context growth, higher computational costs, and degraded reasoning efficiency. We introduce CogMem, a cognitively inspired, memory-augmented LLM architecture that supports sustained iterative reasoning through structured, persistent memory. CogMem incorporates three layers: a Long-Term Memory (LTM) that consolidates cross-session reasoning strategies; a Direct Access (DA) memory that maintains session-level notes and retrieves relevant long-term memories; and a Focus of Attention (FoA) mechanism that dynamically reconstructs concise, task-relevant context at each turn. Experiments on TurnBench show that this layered design mitigates reasoning failures, controls context growth, and improves consistency across extended reasoning chains, moving toward more reliable, human-like reasoning in LLMs.


[6] Ladder Up, Memory Down: Low-Cost Fine-Tuning With Side Nets cs.CL | cs.LGPDF

Estelle Zheng, Nathan Cerisara, Sébastien Warichet, Emmanuel Helbert, Christophe Cerisara

TL;DR: 本文提出了一种名为Ladder Side Tuning(LST)的参数高效微调方法,通过添加轻量级侧边网络来微调大语言模型,显著降低了峰值内存使用(相比QLoRA减少50%),同时保持了与QLoRA相当的计算扩展性和平均性能。该方法使得在单张12GB消费级GPU上微调70亿参数模型成为可能。此外,论文还提出了xLadder变体,通过交叉连接增加有效深度,以在固定参数下提升思维链推理能力。

Details

Motivation: 解决在消费级GPU上微调大语言模型时面临的内存瓶颈问题,特别是现有参数高效微调方法(如QLoRA)在反向传播过程中仍会导致高内存占用。

Result: 在自然语言理解、数学和LLM批判任务等多个下游基准测试中,LST的平均性能与QLoRA相当,同时峰值内存减少50%。它能在单张12GB GPU上微调70亿参数模型(上下文长度2k),而QLoRA在相同条件下会耗尽内存。扩展定律表明LST与QLoRA具有相似的计算扩展性。

Insight: 核心创新在于重新审视并优化了Ladder Side Tuning这一侧边网络架构,实现了内存效率与性能的平衡;xLadder变体通过交叉连接在固定参数下增加模型有效深度,为提升推理能力(如缩短思维链)提供了新的架构灵活性。这是一种在内存受限场景下极具潜力的高效微调方案。

Abstract: Fine-tuning large language models (LLMs) is often limited by the memory available on commodity GPUs. Parameter-efficient fine-tuning (PEFT) methods such as QLoRA reduce the number of trainable parameters, yet still incur high memory usage induced by the backward pass in the full model. We revisit Ladder Side Tuning (LST), a rarely explored PEFT technique that adds a lightweight side network, and show that it matches QLoRA’s compute scaling slope while cutting peak memory by 50%. Across different downstream benchmarks spanning natural language understanding, mathematical and LLM-critic tasks, LST has competitive performance with QLoRA’s accuracy on average while being much more memory-efficient. This efficiency enables fine-tuning of 7B-parameter models on a single 12 GB consumer GPU with 2k-token contexts, requiring no gradient checkpointing\textemdash conditions under which QLoRA exhausts memory. Beyond memory efficiency, we also establish scaling laws showing that LST scales similarly to QLoRA. We exploit Ladder’s architectural flexibility by introducing xLadder, a depth-extended variant that increases effective depth via cross-connections and shortens chain-of-thought (CoT) at fixed parameter count. Ladder is strong when memory is the bottleneck; xLadder builds on this by enabling deeper reasoning without additional memory overhead.


[7] Step-Tagging: Toward controlling the generation of Language Reasoning Models through step monitoring cs.CL | cs.AIPDF

Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John D. Kelleher

TL;DR: 本文提出Step-Tagging框架,通过一个轻量级句子分类器对语言推理模型(LRM)生成的推理步骤进行实时类型标注,并引入ReasonType推理步骤分类法。该框架利用对特定步骤数量的在线监控,为LRM推理生成有效的、可解释的提前停止准则,从而在保持与标准生成相当准确率的同时,显著减少生成token数量(20%至50%)。

Details

Motivation: 当前语言推理模型(LRM)在推理过程中存在效率低下、过度生成验证和反思步骤的问题,需要一种方法来控制和监控其生成过程。

Result: 在MATH500、GSM8K、AIME、GPQA和MMLU-Pro等标准基准数据集上对三个开源推理模型进行评估,结果表明,在保持与标准生成相当准确率的同时,实现了20%到50%的token减少,在计算更密集的任务上收益最大。

Insight: 创新点在于提出了一个轻量级的实时推理步骤分类框架(Step-Tagging)和一套新的推理步骤分类法(ReasonType),将步骤类型监控转化为可解释的提前停止策略,为控制LRM生成和深入研究其行为提供了新工具。该方法的核心思想是通过监控推理过程的结构而非仅仅依赖最终输出,来实现高效且可控的生成。

Abstract: The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. To address this challenge, we introduce the Step-Tagging framework, a lightweight sentence-classifier enabling real-time annotation of the type of reasoning steps that an LRM is generating. To monitor reasoning behaviors, we introduced ReasonType: a novel taxonomy of reasoning steps. Building on this framework, we demonstrated that online monitoring of the count of specific steps can produce effective interpretable early stopping criteria of LRM inferences. We evaluate the Step-tagging framework on three open-source reasoning models across standard benchmark datasets: MATH500, GSM8K, AIME and non-mathematical tasks (GPQA and MMLU-Pro). We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation, with largest gains observed on more computation-heavy tasks. This work offers a novel way to increase control over the generation of LRMs, and a new tool to study behaviors of LRMs.


[8] Effect of Document Packing on the Latent Multi-Hop Reasoning Capabilities of Large Language Models cs.CL | cs.AI | cs.LGPDF

Gabriele Prato, Shagun Sodhani, Alessandro Sordoni, Sarath Chandar

TL;DR: 本文研究了在大型语言模型训练中,将多个文档打包在一起以优化计算效率的标准做法对模型潜在多跳推理能力的影响。研究发现,与在单个文档上训练相比,打包策略可以提高模型性能,但需要更多计算资源。通过消融研究,论文进一步揭示了打包策略带来优势的关键因素。

Details

Motivation: 标准训练实践中,为了计算效率而将多个文档打包在一起,但这一过程对模型能力的影响尚未被充分探索。本文旨在探究不同文档打包策略如何影响LLMs的潜在多跳推理能力。

Result: 研究发现,与在单个文档上训练相比,文档打包可以提高模型性能(具体基准和SOTA水平未在摘要中明确提及)。

Insight: 论文的创新点在于首次系统性地研究了文档打包策略对LLM多跳推理能力的影响,并通过消融研究揭示了其优势机制。这为理解LLM训练动态和优化模型开发提供了新的实践见解。

Abstract: The standard practice for training large language models involves packing multiple documents together to optimize computational efficiency. However, the impact of this process on the models’ capabilities remains largely unexplored. To address this gap, we investigate how different document-packing strategies influence the latent multi-hop reasoning abilities of LLMs. Our findings indicate that packing can improve model performance compared to training on individual documents, at the expense of more compute. To further understand the underlying mechanisms, we conduct an ablation study, identifying key factors that explain the advantages of packing. Ultimately, our research deepens the understanding of LLM training dynamics and provides practical insights for optimizing model development.


Nguyen Tien Dong, Minh-Anh Nguyen, Thanh Dat Hoang, Nguyen Tuan Ngoc, Dao Xuan Quang Minh

TL;DR: 本文介绍了VLegal-Bench,这是首个针对越南法律推理任务的大语言模型(LLM)综合性基准测试。该基准基于布鲁姆认知分类法设计,包含10,450个样本,覆盖从一般法律问答到检索增强生成、多步推理和基于越南法律场景的问题解决等多种任务,旨在系统评估LLM在越南法律领域的理解和应用能力。

Details

Motivation: 越南法律的复杂性、层级结构和频繁修订对评估大语言模型在法律知识解释和利用方面的性能构成了重大挑战,现有基准无法满足这一需求,因此需要构建一个专门针对越南法律环境的系统性评估框架。

Result: 论文构建了包含10,450个样本的基准数据集,并通过法律专家标注和交叉验证确保样本基于权威法律文件且反映真实法律助理工作流程,为评估LLM在越南法律语境下的性能提供了标准化、透明且基于认知理论的评估基础。

Insight: 创新点在于将布鲁姆认知分类法融入法律基准设计,实现了对法律理解多层次(从记忆到创造)的系统评估;同时,基准构建紧密结合越南法律特色与实际应用场景(如检索增强生成),并通过严格的专家标注流程确保数据质量与权威性,为领域特定LLM评估提供了可借鉴的范式。

Abstract: The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom’s cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems.


[10] Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies cs.CL | cs.AIPDF

Ekaterina Artemova, Laurie Burchell, Daryna Dementieva, Shu Okabe, Mariya Shmatova

TL;DR: 本教程面向处理多语言和低资源语言的NLP从业者、研究人员和开发者,旨在提供构建包容性语言技术的实用工具包,涵盖从数据收集、网络爬虫到并行句子挖掘、机器翻译及下游应用(如文本分类和多模态推理)的端到端NLP流程。

Details

Motivation: 解决低资源语言在数据稀缺和文化差异方面的挑战,以促进更公平、社会影响力更大的语言技术发展。

Result: 教程展示了覆盖超过10种来自不同语系和地缘政治背景的语言(包括数字资源丰富和严重代表性不足的语言)的多样化用例,但未提及具体基准测试或定量结果。

Insight: 创新点在于提供了一套结合公平性、可重复性和社区参与的开发方法,以及应对数据稀缺的实践策略和建模框架,强调端到端流程和实际场景应用。

Abstract: This tutorial (https://tum-nlp.github.io/low-resource-tutorial) is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages – from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.


[11] JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction cs.CL | cs.AI | cs.CVPDF

Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa

TL;DR: 本文提出了JMMMU-Pro,一个基于图像的日语多学科多模态理解基准,以及一种可扩展的构建方法Vibe Benchmark Construction。该方法通过将问题图像和问题文本合成为单一图像,创建了一个需要综合视觉-文本理解的评估工具。

Details

Motivation: 为了更严格地评估大型多模态模型(LMMs)的日语能力,并提供一个需要集成视觉感知的基准,本文扩展了现有的JMMMU基准,开发了JMMMU-Pro。

Result: 实验结果表明,所有开源LMMs在JMMMU-Pro上都表现不佳,突显了该基准作为指导开源社区未来工作的重要工具。

Insight: 创新点在于提出了Vibe Benchmark Construction方法,利用图像生成模型(如Nano Banana Pro)高效生成高质量的视觉问题,并通过人工验证和提示调整确保质量,为未来基于图像的视觉问答(VQA)基准开发提供了高效指南。

Abstract: This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro’s highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.


[12] MMGR: Multi-Modal Generative Reasoning cs.CL | cs.CVPDF

Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou

TL;DR: 本文提出了MMGR(多模态生成推理评估与基准),一个基于物理、逻辑、3D空间、2D空间和时间五种推理能力的评估框架,用于评估视频和图像生成模型在抽象推理、具身导航和物理常识三个领域的推理能力。

Details

Motivation: 现有视频生成模型的评估指标(如FVD)过于强调感知质量而忽视了模型在因果关系、物理规律和全局一致性等方面的推理失败,因此需要一个新的评估框架来系统性地衡量生成模型作为世界模拟器的可靠性。

Result: 在MMGR基准上对主流视频模型(Veo-3, Sora-2, Wan-2.2)和图像模型(Nano-banana系列, GPT-4o-image, Qwen-image)的评估显示,模型在物理常识任务上表现尚可,但在抽象推理任务(如ARC-AGI)上准确率低于10%,在具身导航的长时程空间规划任务上表现不佳,揭示了显著的性能差距。

Insight: 论文的创新点在于提出了一个系统性的多模态生成推理评估框架,强调对生成内容在逻辑、物理和空间约束上的整体正确性评估,而非仅视觉逼真度。其分析揭示了当前模型过度依赖感知数据、全局状态一致性弱、以及优化目标偏向视觉合理性而非因果正确性等关键局限,为构建具备推理能力的生成世界模型提供了诊断基准和发展路径。

Abstract: Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.


cs.CV [Back]

[13] Complex Mathematical Expression Recognition: Benchmark, Large-Scale Dataset and Strong Baseline cs.CV | cs.AIPDF

Weikang Bai, Yongkun Du, Yuchen Su, Yazhen Xie, Zhineng Chen

TL;DR: 本文针对复杂数学表达式识别(MER)的挑战,提出了CMER-Bench基准测试,揭示了现有模型在处理复杂表达式时性能显著下降的问题。为此,作者构建了大规模数据集MER-17M和CMER-3M,并设计了新的结构化数学语言表示和CMERNet模型,该模型在CMER-Bench上显著优于现有方法。

Details

Motivation: 现有数学表达式识别方法在简单表达式上表现良好,但在处理包含大量符号和多行的复杂表达式时性能严重下降,主要原因是现有训练数据集主要由简单样本组成,缺乏对复杂表达式的充分覆盖。

Result: 在CMER-Bench基准测试上,提出的CMERNet模型(仅1.25亿参数)显著优于现有的MER模型和多模态大语言模型(MLLMs),实现了SOTA性能。

Insight: 创新点包括:1)引入按难度分级的CMER-Bench基准测试;2)构建了强调复杂表达式的大规模数据集MER-17M和CMER-3M;3)提出了结构化数学语言表示,显式建模表达式的层次和空间结构,超越了LaTeX格式的局限性;4)基于编码器-解码器架构设计了专门的CMERNet模型。

Abstract: Mathematical Expression Recognition (MER) has made significant progress in recognizing simple expressions, but the robust recognition of complex mathematical expressions with many tokens and multiple lines remains a formidable challenge. In this paper, we first introduce CMER-Bench, a carefully constructed benchmark that categorizes expressions into three difficulty levels: easy, moderate, and complex. Leveraging CMER-Bench, we conduct a comprehensive evaluation of existing MER models and general-purpose multimodal large language models (MLLMs). The results reveal that while current methods perform well on easy and moderate expressions, their performance degrades significantly when handling complex mathematical expressions, mainly because existing public training datasets are primarily composed of simple samples. In response, we propose MER-17M and CMER-3M that are large-scale datasets emphasizing the recognition of complex mathematical expressions. The datasets provide rich and diverse samples to support the development of accurate and robust complex MER models. Furthermore, to address the challenges posed by the complicated spatial layout of complex expressions, we introduce a novel expression tokenizer, and a new representation called Structured Mathematical Language, which explicitly models the hierarchical and spatial structure of expressions beyond LaTeX format. Based on these, we propose a specialized model named CMERNet, built upon an encoder-decoder architecture and trained on CMER-3M. Experimental results show that CMERNet, with only 125 million parameters, significantly outperforms existing MER models and MLLMs on CMER-Bench.


[14] DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models cs.CV | cs.AIPDF

Md. Najib Hasan, Imran Ahmad, Sourav Basak Shuvo, Md. Mahadi Hasan Ankon, Sunanda Das

TL;DR: 本文提出了DL$^3$M框架,通过结合深度学习(DL)和大语言模型(LLM)来生成医学图像(内窥镜图像)的临床推理解释。该框架首先使用专门设计的MobileCoAtNet模型对图像进行高精度分类,然后利用分类结果驱动多个LLM生成涵盖病因、症状、治疗等多方面的结构化临床推理文本。研究构建了两个专家验证的基准来评估LLM的推理质量,发现尽管强分类能提升解释质量,但现有LLM的推理不稳定且未达到人类水平,揭示了其在高风险医疗决策中的不可靠性。

Details

Motivation: 现有医学图像分类器能检测疾病但无法解释其决策,而大语言模型能生成临床文本却不擅长视觉推理且解释不稳定或不正确,这造成了模型所见与临床医生期望的推理类型之间的差距。

Result: 提出的MobileCoAtNet模型在八个胃部相关疾病类别的内窥镜图像分类上取得了高准确率。在构建的两个专家验证基准上评估了32个LLM,结果显示强分类提升了LLM解释的质量,但所有模型均未达到人类水平的稳定性,最佳LLM的推理也会随提示词变化而改变。

Insight: 创新点在于提出了一个将视觉分类与结构化临床推理相链接的端到端框架,并设计了针对内窥镜图像的混合模型MobileCoAtNet。客观来看,该研究系统地评估了LLM在医学推理任务中的表现,并构建了专家验证的基准,为理解当前LLM在医疗领域的局限性及构建更安全的推理系统提供了清晰路径和评估方法。

Abstract: Medical image classifiers detect gastrointestinal diseases well, but they do not explain their decisions. Large language models can generate clinical text, yet they struggle with visual reasoning and often produce unstable or incorrect explanations. This leaves a gap between what a model sees and the type of reasoning a clinician expects. We introduce a framework that links image classification with structured clinical reasoning. A new hybrid model, MobileCoAtNet, is designed for endoscopic images and achieves high accuracy across eight stomach-related classes. Its outputs are then used to drive reasoning by several LLMs. To judge this reasoning, we build two expert-verified benchmarks covering causes, symptoms, treatment, lifestyle, and follow-up care. Thirty-two LLMs are evaluated against these gold standards. Strong classification improves the quality of their explanations, but none of the models reach human-level stability. Even the best LLMs change their reasoning when prompts vary. Our study shows that combining DL with LLMs can produce useful clinical narratives, but current LLMs remain unreliable for high-stakes medical decisions. The framework provides a clearer view of their limits and a path for building safer reasoning systems. The complete source code and datasets used in this study are available at https://github.com/souravbasakshuvo/DL3M.


[15] Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making cs.CV | cs.AIPDF

Siyuan Dai, Lunxiao Li, Kun Zhao, Eardi Lila, Paul K. Crane

TL;DR: 本文研究了多模态大语言模型(MLLMs)在医学决策任务中的局限性,发现纯文本推理的表现优于纯视觉或多模态输入,甚至多模态输入可能比纯文本更差。作者通过阿尔茨海默病三阶段分类和MIMIC-CXR胸片分类两个数据集进行实证分析,并探索了三种缓解策略:基于推理注释示例的上下文学习、视觉描述后纯文本推理,以及对视觉编码器进行少量样本微调。

Details

Motivation: 尽管先进的多模态大语言模型在通用视觉-语言任务上展现出强大的零样本能力,但在生物医学领域,即使是当前最先进的模型也难以处理基本的医学决策任务,因此需要探究其失败原因并寻求改进方法。

Result: 在阿尔茨海默病分类和MIMIC-CXR胸片分类两个具有挑战性的数据集上,纯文本推理一致性地超越了纯视觉或多模态设置,多模态输入的表现往往比纯文本更差。

Insight: 论文揭示了当前MLLMs缺乏扎实的视觉理解能力,其创新点在于通过实证对比凸显了文本模态在特定医学任务中的主导作用,并提出了结合上下文学习、视觉描述和视觉编码器微调等策略来提升多模态决策,这为医疗健康领域的多模态模型改进指明了方向。

Abstract: With the rapid progress of large language models (LLMs), advanced multimodal large language models (MLLMs) have demonstrated impressive zero-shot capabilities on vision-language tasks. In the biomedical domain, however, even state-of-the-art MLLMs struggle with basic Medical Decision Making (MDM) tasks. We investigate this limitation using two challenging datasets: (1) three-stage Alzheimer’s disease (AD) classification (normal, mild cognitive impairment, dementia), where category differences are visually subtle, and (2) MIMIC-CXR chest radiograph classification with 14 non-mutually exclusive conditions. Our empirical study shows that text-only reasoning consistently outperforms vision-only or vision-text settings, with multimodal inputs often performing worse than text alone. To mitigate this, we explore three strategies: (1) in-context learning with reason-annotated exemplars, (2) vision captioning followed by text-only inference, and (3) few-shot fine-tuning of the vision tower with classification supervision. These findings reveal that current MLLMs lack grounded visual understanding and point to promising directions for improving multimodal decision making in healthcare.


[16] STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning cs.CV | cs.AIPDF

Jie Qin, Jiancheng Huang, Limeng Qiao, Lin Ma

TL;DR: 本文提出了一种名为STAR(STacked AutoRegressive)的任务渐进式统一多模态学习方案,旨在解决多模态大语言模型(MLLMs)中理解与生成任务之间的优化冲突和性能权衡问题。该方法将多模态学习分解为理解、生成和编辑多个阶段,通过冻结基础自回归模型参数并逐步堆叠同构模块来避免任务间干扰,同时引入高容量VQ增强图像表示粒度,并采用隐式推理机制提升复杂条件下的生成质量。

Details

Motivation: 当前多模态大语言模型在实现统一的理解与生成目标时,面临优化冲突和性能权衡的挑战,难以在保持现有理解能力的同时有效提升生成性能。

Result: 实验表明,STAR在GenEval(0.91)、DPG-Bench(87.44)和ImgEdit(4.34)等基准测试中取得了最先进的(SOTA)性能,验证了其统一多模态学习的有效性。

Insight: 创新点在于将多模态学习分解为渐进式阶段并采用参数冻结与模块堆叠策略,以避免任务干扰;同时,通过高容量VQ增强图像表示和隐式推理机制,提升了生成任务的精细度和复杂条件处理能力,为统一多模态模型设计提供了新思路。

Abstract: Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model’s capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.


[17] Improvise, Adapt, Overcome – Telescopic Adapters for Efficient Fine-tuning of Vision Language Models in Medical Imaging cs.CV | cs.AIPDF

Ujjwal Mishra, Vinita Shukla, Praful Hambarde, Amit Shukla

TL;DR: 本文提出了一种名为Telescopic Adapters的新型参数高效微调框架,用于在医学影像领域高效微调视觉语言分割模型。该方法通过深度感知缩放,在CLIPSeg模型的视觉和文本编码器中动态调整适配器容量,从浅层到深层逐步增加,仅使用61.3万个可训练参数,在五个不同的医学影像数据集上实现了优于传统微调方法的性能。

Details

Motivation: 传统微调方法在将视觉语言分割模型适配到医学影像领域时计算开销巨大,而现有的参数高效微调方法在所有Transformer层使用统一的适配器维度,导致参数分配次优和适配效率降低。

Result: 在息肉分割、皮肤病变检测和乳腺超声成像等五个不同的医学影像数据集上,仅使用61.3万个可训练参数(比端到端微调少244倍),Telescopic Adapters实现了优越的性能。消融研究表明深层比浅层需要更多的适配能力,验证了伸缩缩放假设。

Insight: 核心创新点是提出了深度感知的伸缩适配器,根据Transformer层深度和语义相关性动态缩放适配器容量,实现了更优的参数分配。这为资源受限的临床环境中高效微调医学视觉语言模型建立了新范式。

Abstract: Adapting Vision Language Segmentation Models (VLSMs) to medical imaging domains requires significant computational overhead when using conventional fine-tuning approaches. Existing Parameter-Efficient Fine-Tuning (PEFT) methods apply uniform adapter dimensions across all transformer layers, leading to suboptimal parameter allocation and reduced adaptation efficiency. We introduce Telescopic Adapters, a novel PEFT framework that employs depth-aware scaling to progressively increase adapter capacity from shallow to deep transformer layers. Our method integrates lightweight bottleneck modules within CLIPSeg’s vision and text encoders, with adapter dimensions dynamically scaled based on layer depth and semantic relevance. Using only 613k trainable parameters–244x fewer than end-to-end fine-tuning, Telescopic Adapters achieve superior performance across five diverse medical datasets spanning polyp segmentation, skin lesion detection, and breast ultrasound imaging. Comprehensive ablation studies demonstrate that deeper layers require substantially more adaptation capacity than shallow layers, validating our telescopic scaling hypothesis. Our approach establishes a new paradigm for efficient medical VLSM fine-tuning, enabling deployment in resource-constrained clinical environments while maintaining competitive segmentation accuracy.


[18] SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning cs.CVPDF

Jitesh Jain, Jialuo Li, Zixian Ma, Jieyu Zhang, Chris Dongjoo Kim

TL;DR: 本文提出了SAGE系统,这是一种用于长视频推理的智能多轮推理智能体。该系统模仿人类灵活处理不同时长视频的能力,通过一个名为SAGE-MM的协调器来决定何时进行快速浏览或详细观看。研究还开发了使用Gemini-2.5-Flash的合成数据生成流程来训练协调器,并提出了有效的强化学习后训练方法。为了评估系统性能,作者构建了平均时长超过700秒的SAGE-Bench基准测试集。实验表明,该系统在开放式视频推理任务上取得了显著提升。

Details

Motivation: 当前最先进的视频推理模型通常以单轮方式处理大量帧(类似于观看整个长视频),这需要大量计算资源且缺乏灵活性。本文旨在模仿人类“任意视界”的推理能力,即根据任务需求灵活决定是快速浏览长视频还是完整观看短视频,从而开发出高性能的、能灵活处理不同时长视频的推理系统。

Result: 在作者构建的SAGE-Bench(平均时长>700秒)基准测试上,SAGE系统在开放式视频推理任务上实现了高达6.1%的性能提升,在处理超过10分钟的长视频时,性能提升更是达到了8.2%。

Insight: 论文的核心创新点在于:1) 提出了一个模仿人类“任意视界”推理行为的智能体系统架构(SAGE),实现了多轮与单轮推理的灵活切换;2) 利用大语言模型(Gemini-2.5-Flash)构建了高效的合成数据生成流程来训练核心协调器(SAGE-MM);3) 提出了一种有效的强化学习后训练方法,这对于赋予模型“任意视界”推理能力至关重要;4) 为真实世界娱乐场景构建了专门的长视频推理基准测试集(SAGE-Bench)。从客观角度看,将强化学习与LLM驱动的数据生成相结合来训练一个决策何时“看多少”的元控制器,是解决长视频处理资源消耗问题的有效且新颖的思路。

Abstract: As humans, we are natural any-horizon reasoners, i.e., we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question: Is it possible to develop performant any-horizon video reasoning systems? Inspired by human behavior, we first propose SAGE, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator, SAGE-MM, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curate SAGE-Bench with an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to 6.1% on open-ended video reasoning tasks, as well as an impressive 8.2% improvement on videos longer than 10 minutes.


[19] From Unlearning to UNBRANDING: A Benchmark for Trademark-Safe Text-to-Image Generation cs.CVPDF

Dawid Malarz, Artur Kasymov, Filip Manjak, Maciej Zięba, Przemysław Spurek

TL;DR: 本文提出了‘去品牌化’这一新任务,旨在从文本到图像生成模型中细粒度地移除商标内容和微妙的结构性品牌特征,同时保持语义连贯性。为此,研究构建了一个全面的基准数据集,并引入了一种基于视觉语言模型的新型评估指标来检测显性和隐性的品牌特征。研究发现,随着模型保真度提高,品牌标识的生成问题愈发突出,验证了去品牌化是一个独特且实际相关的问题。

Details

Motivation: 解决文本到图像扩散模型未经授权复制商标内容的问题,特别是现有方法未能处理的、超越显性商标的、多维度的品牌识别特征(如结构性设计)。

Result: 研究结果通过提出的VLM指标验证,确认去品牌化是一个独特且实际相关的问题,需要专门技术来解决。

Insight: 创新点在于定义了‘去品牌化’这一细粒度任务,并构建了相应的基准数据集;提出了一种基于VLM问答框架的新型评估指标,能够同时探测显性商标和隐性的整体品牌特征,弥补了现有品牌检测器的不足。

Abstract: The rapid progress of text-to-image diffusion models raises significant concerns regarding the unauthorized reproduction of trademarked content. While prior work targets general concepts (e.g., styles, celebrities), it fails to address specific brand identifiers. Crucially, we note that brand recognition is multi-dimensional, extending beyond explicit logos to encompass distinctive structural features (e.g., a car’s front grille). To tackle this, we introduce unbranding, a novel task for the fine-grained removal of both trademarks and subtle structural brand features, while preserving semantic coherence. To facilitate research, we construct a comprehensive benchmark dataset. Recognizing that existing brand detectors are limited to logos and fail to capture abstract trade dress (e.g., the shape of a Coca-Cola bottle), we introduce a novel evaluation metric based on Vision Language Models (VLMs). This VLM-based metric uses a question-answering framework to probe images for both explicit logos and implicit, holistic brand characteristics. Furthermore, we observe that as model fidelity increases, with newer systems (SDXL, FLUX) synthesizing brand identifiers more readily than older models (Stable Diffusion), the urgency of the unbranding challenge is starkly highlighted. Our results, validated by our VLM metric, confirm unbranding is a distinct, practically relevant problem requiring specialized techniques. Project Page: https://gmum.github.io/UNBRANDING/.


[20] Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models cs.CVPDF

Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei

TL;DR: 本文提出Sparse-LaViDa框架,通过在每个推理步骤中动态截断不必要的掩码标记来加速掩码离散扩散模型(MDMs)的采样过程,同时引入专用寄存器标记来保持生成质量,并在训练中使用匹配的注意力掩码确保一致性。

Details

Motivation: 解决MDMs因在每个采样步骤中重复处理冗余掩码标记而导致推理速度较慢的问题。

Result: 在基于SOTA模型LaViDa-O的基础上,Sparse-LaViDa在文本到图像生成、图像编辑和数学推理等多种任务上实现了高达2倍的加速,同时保持了生成质量。

Insight: 创新点在于动态截断冗余标记以加速推理,并使用寄存器标记和匹配的注意力掩码来维持质量与训练-推理一致性;这是一种高效的推理优化策略,可推广到其他基于掩码的生成模型。

Abstract: Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as compact representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.


[21] KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding cs.CV | cs.AIPDF

Zongyao Li, Kengo Ishida, Satoshi Yamazaki, Xiaotong Ji, Jianquan Liu

TL;DR: 该论文提出了KFS-Bench,这是首个用于长视频问答中关键帧采样的基准测试,包含多场景标注,以直接且鲁棒地评估采样策略。作者利用该基准对关键帧采样方法进行了全面研究,发现采样精度、场景覆盖率和采样平衡性是影响问答性能的关键因素,并据此设计了一个新的采样质量度量指标。此外,论文还开发了一种新的关键帧采样方法,利用问题-视频相关性来平衡采样多样性与问题-帧相似性,从而提高了相关场景的覆盖率。

Details

Motivation: 关键帧采样对于高效的长视频理解至关重要,但先前工作仅通过问答准确率间接评估帧选择质量,存在局限性。KFS-Bench旨在通过提供每个问题所需多个不相交场景的真实标注,直接分析不同采样方法如何捕捉整个长视频中的关键内容。

Result: 在KFS-Bench基准上,论文提出的自适应平衡采样方法在关键帧采样和问答性能方面均取得了优越性能。

Insight: 论文的创新点在于:1) 构建了首个带有直接评估能力的多场景标注长视频问答关键帧采样基准;2) 识别出影响问答性能的采样关键因素(精度、覆盖、平衡),并设计了与之相关的采样质量度量;3) 提出了一种基于问题-视频相关性的自适应平衡采样方法,以平衡多样性与相关性,提高场景覆盖。

Abstract: We propose KFS-Bench, the first benchmark for key frame sampling in long video question answering (QA), featuring multi-scene annotations to enable direct and robust evaluation of sampling strategies. Key frame sampling is crucial for efficient long-form video understanding. In long video QA, selecting informative frames enables multimodal large language models (MLLMs) to improve both accuracy and efficiency. KFS-Bench addresses the limitation of prior works that only indirectly assess frame selection quality via QA accuracy. By providing ground-truth annotations of multiple disjoint scenes required per question, KFS-Bench allows us to directly analyze how different sampling approaches capture essential content across an entire long video. Using KFS-Bench, we conduct a comprehensive study of key frame sampling methods and identify that not only sampling precision but also scene coverage and sampling balance are the key factors influencing QA performance. Regarding all the factors, we design a novel sampling quality metric that correlates with QA accuracy. Furthermore, we develop a novel key frame sampling method that leverages question-video relevance to balance sampling diversity against question-frame similarity, thereby improving coverage of relevant scenes. Our adaptively balanced sampling approach achieves superior performance in both key frame sampling and QA performance. The benchmark is available at https://github.com/NEC-VID/KFS-Bench.


[22] Deep Learning Perspective of Scene Understanding in Autonomous Robots cs.CVPDF

Afia Maham, Dur E Nayab Tashfa

TL;DR: 这篇论文综述了深度学习在自主机器人场景理解中的应用,涵盖物体检测、语义与实例分割、深度估计、3D重建和视觉SLAM等领域的创新。它强调了这些技术如何克服传统几何模型的局限性,在遮挡和无纹理表面情况下实时改善深度感知,并增强语义推理以更好地理解环境。当这些感知模块集成到动态和非结构化环境中时,它们在决策制定、导航和交互方面变得更加有效。最后,综述概述了现有问题以及推进基于学习的自主机器人场景理解的研究方向。

Details

Motivation: 解决传统几何模型在自主机器人场景理解中的局限性,特别是在处理遮挡、无纹理表面和实时深度感知方面的挑战,并增强语义推理能力以更好地理解动态和非结构化环境。

Result: 论文是一篇综述,未提供具体的定量实验结果或基准测试,但总结了深度学习技术在物体检测、分割、深度估计和视觉SLAM等任务中的进展,这些技术通常能达到或接近SOTA水平。

Insight: 创新点在于将深度学习模块(如物体检测、语义分割和视觉SLAM)集成到自主机器人系统中,以提升在复杂环境中的感知和决策能力;从客观角度看,论文强调了多模态感知融合和实时处理的重要性,为未来研究提供了方向,如解决数据稀缺和模型泛化问题。

Abstract: This paper provides a review of deep learning applications in scene understanding in autonomous robots, including innovations in object detection, semantic and instance segmentation, depth estimation, 3D reconstruction, and visual SLAM. It emphasizes how these techniques address limitations of traditional geometric models, improve depth perception in real time despite occlusions and textureless surfaces, and enhance semantic reasoning to understand the environment better. When these perception modules are integrated into dynamic and unstructured environments, they become more effective in decisionmaking, navigation and interaction. Lastly, the review outlines the existing problems and research directions to advance learning-based scene understanding of autonomous robots.


[23] Unleashing the Power of Image-Tabular Self-Supervised Learning via Breaking Cross-Tabular Barriers cs.CVPDF

Yibing Fu, Yunpeng Zhao, Zhitao Zeng, Cheng Chen, Yueming Jin

TL;DR: 本文提出了一种名为CITab的新型自监督学习框架,旨在打破跨表格数据的壁垒,通过整合图像和表格数据进行多模态表示学习。该方法从语义感知角度设计表格建模机制,引入列标题作为语义线索,并提出了原型引导的混合线性层模块来处理表格数据的异质性。

Details

Motivation: 现有基于图像和表格数据的自监督学习方法通常局限于特定数据队列,主要因为其僵化的表格建模机制在处理异构表格数据时存在跨表格壁垒,阻碍了跨队列可迁移医学知识的学习。

Result: 在包含4,461名受试者的三个公开阿尔茨海默病诊断数据队列上进行的综合评估表明,CITab在诊断任务上优于现有最先进方法。

Insight: 创新点包括从语义感知角度整合列标题的表格建模机制,以及原型引导的混合线性层模块,这些设计有助于实现可迁移知识学习和利用多数据源进行预训练的可扩展性,有效处理表格数据的异质性并探索潜在的医学概念。

Abstract: Multi-modal learning integrating medical images and tabular data has significantly advanced clinical decision-making in recent years. Self-Supervised Learning (SSL) has emerged as a powerful paradigm for pretraining these models on large-scale unlabeled image-tabular data, aiming to learn discriminative representations. However, existing SSL methods for image-tabular representation learning are often confined to specific data cohorts, mainly due to their rigid tabular modeling mechanisms when modeling heterogeneous tabular data. This inter-tabular barrier hinders the multi-modal SSL methods from effectively learning transferrable medical knowledge shared across diverse cohorts. In this paper, we propose a novel SSL framework, namely CITab, designed to learn powerful multi-modal feature representations in a cross-tabular manner. We design the tabular modeling mechanism from a semantic-awareness perspective by integrating column headers as semantic cues, which facilitates transferrable knowledge learning and the scalability in utilizing multiple data sources for pretraining. Additionally, we propose a prototype-guided mixture-of-linear layer (P-MoLin) module for tabular feature specialization, empowering the model to effectively handle the heterogeneity of tabular data and explore the underlying medical concepts. We conduct comprehensive evaluations on Alzheimer’s disease diagnosis task across three publicly available data cohorts containing 4,461 subjects. Experimental results demonstrate that CITab outperforms state-of-the-art approaches, paving the way for effective and scalable cross-tabular multi-modal learning.


[24] ChartAgent: A Chart Understanding Framework with Tool Integrated Reasoning cs.CV | cs.LGPDF

Boran Wang, Xinming Wang, Yi Chen, Xiang Li, Jian Xu

TL;DR: 本文提出了ChartAgent,一个基于工具集成推理(TIR)的图表理解框架。该框架通过将复杂的图表分析分解为一系列可观察、可复现的步骤,并动态调用一个包含关键元素检测、实例分割和OCR等核心工具的模块化工具库,来系统性地解析各种图表。它通过生成结构化的证据包来提供可追溯和可复现的结论支持,从而超越了黑盒模型范式。

Details

Motivation: 现有的多模态大语言模型(MLLMs)在自动化图表理解方面取得了显著进展,但它们严重依赖显式的文本标注,并且在关键数字缺失时性能会显著下降。ChartAgent旨在解决这一局限性,提高在稀疏标注设置下的鲁棒性。

Result: 实验表明,ChartAgent在稀疏标注设置下显著提高了鲁棒性,为构建可信且可扩展的图表理解系统提供了一条实用路径。

Insight: 主要创新点在于引入了工具集成推理(TIR)框架,将人类认知过程分解为标准化步骤,并利用可扩展的工具库进行动态编排。其核心优势在于通过生成结构化的证据包,实现了推理过程的透明化、可追溯性和可复现性,从而提升了系统的可信度和在标注缺失情况下的性能。这为构建可解释、模块化的多模态理解系统提供了新思路。

Abstract: With their high information density and intuitive readability, charts have become the de facto medium for data analysis and communication across disciplines. Recent multimodal large language models (MLLMs) have made notable progress in automated chart understanding, yet they remain heavily dependent on explicit textual annotations and the performance degrades markedly when key numerals are absent. To address this limitation, we introduce ChartAgent, a chart understanding framework grounded in Tool-Integrated Reasoning (TIR). Inspired by human cognition, ChartAgent decomposes complex chart analysis into a sequence of observable, replayable steps. Supporting this architecture is an extensible, modular tool library comprising more than a dozen core tools, such as keyelement detection, instance segmentation, and optical character recognition (OCR), which the agent dynamically orchestrates to achieve systematic visual parsing across diverse chart types. Leveraging TIRs transparency and verifiability, ChartAgent moves beyond the black box paradigm by standardizing and consolidating intermediate outputs into a structured Evidence Package, providing traceable and reproducible support for final conclusions. Experiments show that ChartAgent substantially improves robustness under sparse annotation settings, offering a practical path toward trustworthy and extensible systems for chart understanding.


[25] OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving cs.CV | cs.AIPDF

Zhenguo Zhang, Haohan Zhen, Yishen Wang, Le Xu, Tianchen Deng

TL;DR: 本文提出了OmniDrive-R1,一个用于自动驾驶的端到端视觉语言模型框架。它通过交错式多模态思维链机制,统一了感知与推理。其核心创新在于引入了强化学习驱动的视觉定位能力,使模型能自主聚焦关键区域进行细粒度分析,从而缓解了传统方法中的物体幻觉问题。

Details

Motivation: 现有视觉语言模型在自动驾驶等安全关键领域部署时,因依赖无根据的文本思维链推理而存在可靠性问题(如物体幻觉)。现有多模态思维链方法存在感知与推理阶段解耦、依赖昂贵密集定位标签两大缺陷。

Result: 在DriveLMM-o1基准上的实验表明,相比基线Qwen2.5VL-7B模型,OmniDrive-R1将整体推理分数从51.77%提升至80.35%,最终答案准确率从37.81%提升至73.62%。

Insight: 主要创新点包括:1)通过交错式多模态思维链实现感知与推理的端到端联合优化;2)提出强化学习驱动的视觉定位能力,无需密集标注;3)设计了Clip-GRPO算法,利用基于过程的无标注定位奖励,通过强制视觉焦点与文本推理的实时跨模态一致性来确保稳定性。

Abstract: The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning.While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels.Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and “zoom in” on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model’s significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.


[26] SELECT: Detecting Label Errors in Real-world Scene Text Data cs.CVPDF

Wenjun Liu, Qian Wu, Yifeng Hu, Yuke Li

TL;DR: 本文提出了SELECT方法,一种利用多模态训练检测真实场景文本数据集中标签错误的新方法。该方法通过图像-文本编码器和字符级分词器处理变长序列标签、标签序列错位和字符级错误,并引入SSLC(基于相似性的序列标签损坏)过程在训练中模拟真实错误场景。实验证明了SELECT在检测标签错误和提高真实世界文本数据集上的场景文本识别(STR)准确性的有效性。

Details

Motivation: 解决真实世界场景文本数据集中存在的标签错误问题,特别是针对变长序列标签、标签序列错位和字符级错误等挑战。

Result: 实验结果表明,SELECT在检测标签错误方面优于现有方法,并提高了在真实世界文本数据集上的场景文本识别(STR)准确性,展示了其实际效用。

Insight: 创新点在于首次成功检测真实场景文本数据集中的变长标签错误,并引入了SSLC过程,该过程在模拟标签损坏时考虑了字符间的视觉相似性,从而更真实地模拟了实际错误场景。

Abstract: We introduce SELECT (Scene tExt Label Errors deteCTion), a novel approach that leverages multi-modal training to detect label errors in real-world scene text datasets. Utilizing an image-text encoder and a character-level tokenizer, SELECT addresses the issues of variable-length sequence labels, label sequence misalignment, and character-level errors, outperforming existing methods in accuracy and practical utility. In addition, we introduce Similarity-based Sequence Label Corruption (SSLC), a process that intentionally introduces errors into the training labels to mimic real-world error scenarios during training. SSLC not only can cause a change in the sequence length but also takes into account the visual similarity between characters during corruption. Our method is the first to detect label errors in real-world scene text datasets successfully accounting for variable-length labels. Experimental results demonstrate the effectiveness of SELECT in detecting label errors and improving STR accuracy on real-world text datasets, showcasing its practical utility.


[27] HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices cs.CV | cs.CLPDF

HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong

TL;DR: 本文提出HyperVL,一种专为边缘设备设计的高效多模态大语言模型。它通过图像分块策略限制峰值内存使用,并引入视觉分辨率压缩器(VRC)自适应预测最优编码分辨率以减少冗余计算,以及双一致性学习(DCL)对齐多尺度ViT编码器,实现在共享LLM下动态切换视觉分支。

Details

Motivation: 现有大参数多模态大语言模型计算和内存需求高,难以直接部署在设备端;而小参数模型虽能力增强,但标准ViT编码器在处理高分辨率输入时仍存在延迟高、内存消耗大的瓶颈。

Result: 在多个基准测试中,HyperVL在同等规模模型中达到了最先进的性能;在实际移动设备上显著降低了延迟和功耗。

Insight: 创新点包括自适应分辨率预测的VRC和统一框架下对齐多尺度编码器的DCL,实现了动态视觉分支切换,为设备端多模态推理提供了高效实用的解决方案。

Abstract: Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.


[28] Real-time prediction of workplane illuminance distribution for daylight-linked controls using non-intrusive multimodal deep learning cs.CV | cs.AIPDF

Zulin Zhuang, Yu Bian

TL;DR: 本研究提出了一种非侵入式多模态深度学习框架,用于实时预测室内工作平面照度分布,以支持日光联动控制(DLCs)系统。该方法通过仅从侧窗区域提取图像特征,避免了室内动态占用空间的干扰,并在广州的实地实验中收集了17,344个样本进行训练和验证。

Details

Motivation: 日光联动控制(DLCs)在建筑节能方面潜力巨大,但现有室内日光预测研究多针对静态场景,难以适用于动态占用的室内空间,因此需要开发实时、非侵入式的预测方法。

Result: 模型在相同分布测试集上达到R2 > 0.98和RMSE < 0.14,在未见日期测试集上达到R2 > 0.82和RMSE < 0.17,显示出高精度和可接受的时间泛化能力。

Insight: 创新点在于仅利用侧窗区域的图像特征(而非整个室内像素)来提取时空特征,这使得方法在动态占用空间中仍具适用性;从客观角度看,这种非侵入式多模态框架为实时日光预测提供了可扩展的解决方案。

Abstract: Daylight-linked controls (DLCs) have significant potential for energy savings in buildings, especially when abundant daylight is available and indoor workplane illuminance can be accurately predicted in real time. Most existing studies on indoor daylight predictions were developed and tested for static scenes. This study proposes a multimodal deep learning framework that predicts indoor workplane illuminance distributions in real time from non-intrusive images with temporal-spatial features. By extracting image features only from the side-lit window areas rather than interior pixels, the approach remains applicable in dynamically occupied indoor spaces. A field experiment was conducted in a test room in Guangzhou (China), where 17,344 samples were collected for model training and validation. The model achieved R2 > 0.98 with RMSE < 0.14 on the same-distribution test set and R2 > 0.82 with RMSE < 0.17 on an unseen-day test set, indicating high accuracy and acceptable temporal generalization.


[29] SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding cs.CV | cs.AIPDF

Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao

TL;DR: SDAR-VL是首个将块状离散扩散系统应用于大规模视觉语言理解(VLU)的框架,通过集成异步块状噪声调度、有效掩码率缩放和渐进式Beta噪声课程三大组件,解决了传统块扩散训练成本高、收敛慢和不稳定的问题,显著提升了训练效率、收敛稳定性和任务性能。

Details

Motivation: 块状离散扩散在并行生成和因果依赖建模之间提供了良好平衡,但实际应用受限于高训练成本、慢收敛和不稳定性,导致其性能落后于强大的自回归基线。本文旨在通过一个高效稳定的训练框架,使块状扩散成为VLU的实用主干。

Result: 在21个单图像、多图像和视频基准测试中,SDAR-VL相比传统块扩散在训练效率、收敛稳定性和任务性能上均取得一致提升。在该评估套件上,SDAR-VL在基于扩散的视觉语言模型中达到了新的SOTA水平,并在匹配设置下媲美或超越了LLaVA-OneVision等强自回归基线以及全局扩散基线LLaDA-V。

Insight: 创新点包括异步块状噪声调度以多样化批次内监督、有效掩码率缩放以实现随机掩码下的无偏损失归一化,以及渐进式Beta噪声课程以在增加有效掩码覆盖的同时保持破坏多样性。从客观角度看,该框架通过系统性优化训练过程,成功将块状扩散从理论优势转化为实际可用的VLU骨干,为扩散模型在复杂多模态任务中的应用提供了新思路。

Abstract: Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.


[30] ProtoFlow: Interpretable and Robust Surgical Workflow Modeling with Learned Dynamic Scene Graph Prototypes cs.CV | cs.AIPDF

Felix Holm, Ghazal Ghazaei, Nassir Navab

TL;DR: 本文提出ProtoFlow框架,通过动态场景图原型学习建模复杂手术工作流,实现可解释且鲁棒的识别。该方法结合自监督预训练与原型微调,在CAT-SG数据集上超越标准GNN基线,并在少样本场景下表现出色。

Details

Motivation: 解决手术识别中标注成本高、数据稀缺及模型可解释性不足的问题,利用场景图结构化表示挖掘手术事件潜力。

Result: 在CAT-SG数据集上整体准确率优于GNN基线,少样本场景下(如仅用1个手术视频训练)仍保持强鲁棒性,原型能识别手术子技术并提供工作流偏差的可解释分析。

Insight: 创新点在于将动态场景图原型与自监督学习结合,实现可解释的手术工作流建模;其原型学习机制为临床异常检测和少样本学习提供了新思路。

Abstract: Purpose: Detailed surgical recognition is critical for advancing AI-assisted surgery, yet progress is hampered by high annotation costs, data scarcity, and a lack of interpretable models. While scene graphs offer a structured abstraction of surgical events, their full potential remains untapped. In this work, we introduce ProtoFlow, a novel framework that learns dynamic scene graph prototypes to model complex surgical workflows in an interpretable and robust manner. Methods: ProtoFlow leverages a graph neural network (GNN) encoder-decoder architecture that combines self-supervised pretraining for rich representation learning with a prototype-based fine-tuning stage. This process discovers and refines core prototypes that encapsulate recurring, clinically meaningful patterns of surgical interaction, forming an explainable foundation for workflow analysis. Results: We evaluate our approach on the fine-grained CAT-SG dataset. ProtoFlow not only outperforms standard GNN baselines in overall accuracy but also demonstrates exceptional robustness in limited-data, few-shot scenarios, maintaining strong performance when trained on as few as one surgical video. Our qualitative analyses further show that the learned prototypes successfully identify distinct surgical sub-techniques and provide clear, interpretable insights into workflow deviations and rare complications. Conclusion: By uniting robust representation learning with inherent explainability, ProtoFlow represents a significant step toward developing more transparent, reliable, and data-efficient AI systems, accelerating their potential for clinical adoption in surgical training, real-time decision support, and workflow optimization.


[31] Quality-Aware Framework for Video-Derived Respiratory Signals cs.CV | eess.SPPDF

Nhi Nguyen, Constantino Álvarez Casado, Le Nguyen, Manuel Lage Cañellas, Miguel Bordallo López

TL;DR: 本文提出了一种质量感知的预测性框架,用于从视频中估计呼吸率。该框架整合了从面部远程光电容积描记、上半身运动和深度学习管道中提取的十种异质信号,并使用四种频谱估计器进行分析。通过训练机器学习模型来预测信号片段的准确性或选择最可靠的信号,实现了自适应信号融合和基于质量的片段过滤。

Details

Motivation: 基于视频的呼吸率估计常因不同提取方法产生的信号质量不一致而不可靠。本文旨在通过一个动态评估可靠性的框架,整合多种信号源,以解决信号质量波动的问题,提升呼吸监测的鲁棒性。

Result: 在三个公共数据集(OMuSense-23, COHFACE, MAHNOB-HCI)上的实验表明,该框架在大多数情况下比单一方法实现了更低的呼吸率估计误差,性能提升取决于数据集特性。

Insight: 创新点在于提出了一个质量驱动的预测建模框架,通过机器学习动态评估和融合多源信号,而非依赖单一方法。这为构建可扩展和泛化性强的视频呼吸监测解决方案提供了新思路,其自适应信号选择与融合策略具有借鉴意义。

Abstract: Video-based respiratory rate (RR) estimation is often unreliable due to inconsistent signal quality across extraction methods. We present a predictive, quality-aware framework that integrates heterogeneous signal sources with dynamic assessment of reliability. Ten signals are extracted from facial remote photoplethysmography (rPPG), upper-body motion, and deep learning pipelines, and analyzed using four spectral estimators: Welch’s method, Multiple Signal Classification (MUSIC), Fast Fourier Transform (FFT), and peak detection. Segment-level quality indices are then used to train machine learning models that predict accuracy or select the most reliable signal. This enables adaptive signal fusion and quality-based segment filtering. Experiments on three public datasets (OMuSense-23, COHFACE, MAHNOB-HCI) show that the proposed framework achieves lower RR estimation errors than individual methods in most cases, with performance gains depending on dataset characteristics. These findings highlight the potential of quality-driven predictive modeling to deliver scalable and generalizable video-based respiratory monitoring solutions.


[32] AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation cs.CVPDF

Sisi Dai, Kai Xu

TL;DR: 本文提出AnchorHOI框架,通过基于锚点的先验蒸馏策略,利用视频和图像扩散模型的混合先验,实现零样本的4D人-物交互生成,解决了现有方法交互线索蒸馏不足和生成挑战的问题。

Details

Motivation: 现有监督方法受限于大规模4D HOI数据集的稀缺性,而零样本方法在生成过程中交互线索蒸馏不足,导致泛化能力受限,因此需要一种能充分利用先验并简化优化过程的新方法。

Result: 在广泛实验中,AnchorHOI在多样性和泛化性方面优于先前方法,展示了其优越性能。

Insight: 创新点包括引入锚点先验蒸馏策略,设计锚点神经辐射场用于交互组合和锚点关键点用于运动合成,从而以可处理的两步过程指导生成,提升了4D HOI生成的表达性和真实性。

Abstract: Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.


[33] ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models cs.CVPDF

Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang

TL;DR: ViewMask-1-to-3是一种基于离散扩散模型的多视角图像生成方法,它通过将多视角合成问题转化为离散序列建模任务,利用MAGVIT-v2视觉分词和掩码预测机制,仅需单张图像和文本描述即可生成几何一致的多视角图像。

Details

Motivation: 解决从单张图像和文本描述生成多视角图像时难以保持几何一致性的挑战,避免现有方法对大量多视角训练数据、复杂3D感知架构或几何先验的依赖。

Result: 在GSO和3D-FUTURE数据集上,该方法在PSNR、SSIM和LPIPS指标上平均排名第一,达到了SOTA水平。

Insight: 创新性地将离散扩散模型应用于多视角生成,通过视觉分词和掩码预测统一语言与视觉模态,仅使用随机掩码和自注意力机制即可实现跨视角一致性,无需复杂3D约束,架构简单有效。

Abstract: Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation. Unlike continuous diffusion methods that operate in latent spaces, ViewMask-1-to-3 formulates multi-view synthesis as a discrete sequence modeling problem, where each viewpoint is represented as visual tokens obtained through MAGVIT-v2 tokenization. By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints through iterative token unmasking with text input. ViewMask-1-to-3 achieves cross-view consistency through simple random masking combined with self-attention, eliminating the requirement for complex 3D geometric constraints or specialized attention architectures. Our approach demonstrates that discrete diffusion provides a viable and simple alternative to existing multi-view generation methods, ranking first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS, while maintaining architectural simplicity.


[34] Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries cs.CV | cs.AI | cs.IRPDF

Emanuele Mezzi, Gertjan Burghouts, Maarten Kruithof

TL;DR: 本文提出了一种名为RUNE的遥感文本到图像检索方法,它结合了大型语言模型(LLM)和神经符号AI,通过推理检测到的实体与从文本查询推导出的一阶逻辑(FOL)表达式之间的兼容性来检索图像。该方法旨在解决现有遥感大视觉语言模型(RS-LVLMs)在可解释性和处理复杂空间关系方面的不足。

Details

Motivation: 现有遥感大视觉语言模型(RS-LVLMs)在文本到图像检索中存在可解释性有限和处理复杂空间关系能力差的问题,限制了其在实际应用中的有效性。

Result: 在基于DOTA数据集构建的、包含更复杂查询的基准上,RUNE在复杂遥感检索任务中超越了最先进的RS-LVLMs,表现出更优的性能、鲁棒性和可解释性。评估使用了新提出的RRQC和RRIU指标。

Insight: 核心创新在于将检索任务解耦:利用基础模型(LLM)仅生成一阶逻辑(FOL)表达式,而将推理任务委托给一个显式的神经符号推理模块,并结合逻辑分解策略以提高可扩展性。这增强了系统的可解释性和对复杂查询的处理能力。

Abstract: Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM’s effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE’s potential for real-world RS applications through a use case on post-flood satellite image retrieval.


[35] Selective, Controlled and Domain-Agnostic Unlearning in Pretrained CLIP: A Training- and Data-Free Approach cs.CVPDF

Ashish Mishra, Gyanaranjan Nayak, Tarun Kumar, Arpit Shah, Suparna Bhattacharya

TL;DR: 本文提出了一种无需额外数据和重新训练的CLIP模型选择性遗忘框架,支持全局、领域特定和选择性领域的三种遗忘范式,通过多模态零空间和文本提示与视觉原型的协同集成实现高效知识移除。

Details

Motivation: 解决预训练CLIP模型在实际应用中需移除特定对象类别知识,同时避免数据依赖和重新训练、且不影响无关任务性能的需求。

Result: 方法在多种视觉领域(自然图像、艺术渲染等)实现了可控遗忘,计算效率高,克服了现有基于重新训练方法的局限性。

Insight: 创新点在于利用多模态零空间和CLIP联合嵌入空间衍生的合成视觉原型,实现训练与数据无关的灵活遗忘机制,为模型编辑提供了新思路。

Abstract: Pretrained models like CLIP have demonstrated impressive zero-shot classification capabilities across diverse visual domains, spanning natural images, artistic renderings, and abstract representations. However, real-world applications often demand the removal (or “unlearning”) of specific object classes without requiring additional data or retraining, or affecting the model’s performance on unrelated tasks. In this paper, we propose a novel training- and data-free unlearning framework that enables three distinct forgetting paradigms: (1) global unlearning of selected objects across all domains, (2) domain-specific knowledge removal (e.g., eliminating sketch representations while preserving photo recognition), and (3) complete unlearning in selective domains. By leveraging a multimodal nullspace through synergistic integration of text prompts and synthesized visual prototypes derived from CLIP’s joint embedding space, our method efficiently removes undesired class information while preserving the remaining knowledge. This approach overcomes the limitations of existing retraining-based methods and offers a flexible and computationally efficient solution for controlled model forgetting.


[36] TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs cs.CV | cs.AI | cs.CL | cs.MMPDF

Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li

TL;DR: 本文提出了TimeLens,一个针对视频时序定位(VTG)任务的系统化基准研究,而非新方法。它通过创建高质量数据集TimeLens-Bench和TimeLens-100K,并探索算法设计原则(如交错文本编码和RLVR训练范式),构建了在开源模型中达到SOTA性能的MLLM模型家族。

Details

Motivation: 解决多模态大语言模型(MLLMs)在视频时序定位任务上优化方法探索不足,以及现有基准和训练数据存在严重质量缺陷、导致评估不可靠的问题。

Result: 在重新标注的高质量基准TimeLens-Bench上,模型排名发生剧烈变化,证实了先前评估标准不可靠。最终训练的TimeLens模型在开源模型中达到SOTA性能,甚至超越了GPT-5和Gemini-2.5-Flash等专有模型。

Insight: 创新点在于系统性地强调了数据质量(通过严格重标注解决基准噪声)和算法设计(如用于时间表示的交错文本编码、免思考的RLVR训练范式)对VTG任务的关键作用,为未来研究提供了可靠的基准和有效实践。

Abstract: This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.


[37] Erasing CLIP Memories: Non-Destructive, Data-Free Zero-Shot class Unlearning in CLIP Models cs.CVPDF

Ashish Mishra, Tarun Kumar, Gyanaranjan Nayak, Arpit Shah, Suparna Bhattacharya

TL;DR: 本文提出了一种针对CLIP等多模态预训练模型的选择性遗忘新方法,通过零空间投影技术,无需重新训练或遗忘集图像,即可从最终投影层中擦除目标类别的信息。该方法计算目标文本嵌入张成的子空间的正交基并进行投影,显著降低图像特征与待遗忘类别之间的对齐度,在计算高效的同时保持模型整体多模态知识的完整性。

Details

Motivation: 解决传统遗忘技术依赖迭代微调和大量数据整理的问题,旨在实现无需数据、非破坏性的零样本类别遗忘,以应对模型去污染和隐私保护的关键挑战。

Result: 实验表明,该方法能显著降低目标类别的零样本性能,同时保留模型的有用信息;即使部分投影也能在完全遗忘与信息保留之间取得平衡。

Insight: 创新点在于利用零空间投影实现封闭形式的精确遗忘,无需训练或数据,为多模态模型的选择性知识擦除提供了高效、可调控的解决方案。

Abstract: We introduce a novel, closed-form approach for selective unlearning in multimodal models, specifically targeting pretrained models such as CLIP. Our method leverages nullspace projection to erase the target class information embedded in the final projection layer, without requiring any retraining or the use of images from the forget set. By computing an orthonormal basis for the subspace spanned by target text embeddings and projecting these directions, we dramatically reduce the alignment between image features and undesired classes. Unlike traditional unlearning techniques that rely on iterative fine-tuning and extensive data curation, our approach is both computationally efficient and surgically precise. This leads to a pronounced drop in zero-shot performance for the target classes while preserving the overall multimodal knowledge of the model. Our experiments demonstrate that even a partial projection can balance between complete unlearning and retaining useful information, addressing key challenges in model decontamination and privacy preservation.


[38] SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing cs.CVPDF

Han Zou, Yan Zhang, Ruiqi Yu, Cong Xie, Jie Huang

TL;DR: 本文提出了SketchAssist,一个交互式草图绘制助手,它通过统一指令引导的全局编辑和线条引导的区域重绘来加速创作,同时保持无关区域和整体构图不变。为了实现这一助手,作者引入了一个可控的数据生成流程来构建大规模数据集,并基于此在DiT编辑器基础上构建了一个统一的草图编辑框架,通过任务引导的专家混合(MoE)机制来提升控制能力。

Details

Motivation: 现有图像编辑系统难以在保持线稿稀疏、风格敏感结构的同时,支持高级语义更改和精确局部重绘。本文旨在解决这一草图编辑中的核心难题。

Result: 大量实验表明,该方法在指令引导编辑和线条引导重绘两项任务上均取得了最先进(SOTA)的结果,在指令遵循、风格和结构保持方面优于近期基线模型。

Insight: 主要创新点包括:1)一个可控的数据生成流程,通过属性添加序列构建、跨序列采样形成多步编辑链以及风格保持的属性移除模型来扩展风格覆盖;2)一个统一的草图编辑框架,复用RGB通道编码输入以实现单一界面下的模式切换;3)在LoRA层中集成了任务引导的专家混合(MoE)机制,通过文本和视觉线索进行路由,以提升语义可控性、结构保真度和风格保持能力。

Abstract: Sketch editing is central to digital illustration, yet existing image editing systems struggle to preserve the sparse, style-sensitive structure of line art while supporting both high-level semantic changes and precise local redrawing. We present SketchAssist, an interactive sketch drawing assistant that accelerates creation by unifying instruction-guided global edits with line-guided region redrawing, while keeping unrelated regions and overall composition intact. To enable this assistant at scale, we introduce a controllable data generation pipeline that (i) constructs attribute-addition sequences from attribute-free base sketches, (ii) forms multi-step edit chains via cross-sequence sampling, and (iii) expands stylistic coverage with a style-preserving attribute-removal model applied to diverse sketches. Building on this data, SketchAssist employs a unified sketch editing framework with minimal changes to DiT-based editors. We repurpose the RGB channels to encode the inputs, enabling seamless switching between instruction-guided edits and line-guided redrawing within a single input interface. To further specialize behavior across modes, we integrate a task-guided mixture-of-experts into LoRA layers, routing by text and visual cues to improve semantic controllability, structural fidelity, and style preservation. Extensive experiments show state-of-the-art results on both tasks, with superior instruction adherence and style/structure preservation compared to recent baselines. Together, our dataset and SketchAssist provide a practical, controllable assistant for sketch creation and revision.


[39] TorchTraceAP: A New Benchmark Dataset for Detecting Performance Anti-Patterns in Computer Vision Models cs.CV | cs.AIPDF

Hanning Chen, Keyu Man, Kevin Zhu, Chenguang Zhu, Haonan Li

TL;DR: 本文提出了TorchTraceAP,一个用于检测计算机视觉模型中性能反模式的新基准数据集,包含超过600个来自不同CV任务的PyTorch执行轨迹,并提出了一种结合轻量级ML模型和LLM的迭代方法,以自动识别和分类轨迹中的性能问题。

Details

Motivation: 当前识别和解决机器学习模型中的性能反模式需要跨系统、模型和内核开发的深厚专业知识,且现有方法难以自动化地从冗长执行轨迹中定位问题片段,这阻碍了计算机视觉研究者的效率。

Result: 实验结果表明,该方法在检测反模式区域方面显著优于无监督聚类和基于规则的统计技术,并有效补偿了LLM的上下文长度限制和推理效率不足。

Insight: 创新点在于首个专门针对性能反模式检测的基准数据集,以及结合轻量级ML模型进行粗粒度检测与LLM进行细粒度分类的迭代方法,这为自动化性能分析提供了新思路。

Abstract: Identifying and addressing performance anti-patterns in machine learning (ML) models is critical for efficient training and inference, but it typically demands deep expertise spanning system infrastructure, ML models and kernel development. While large tech companies rely on dedicated ML infrastructure engineers to analyze torch traces and benchmarks, such resource-intensive workflows are largely inaccessible to computer vision researchers in general. Among the challenges, pinpointing problematic trace segments within lengthy execution traces remains the most time-consuming task, and is difficult to automate with current ML models, including LLMs. In this work, we present the first benchmark dataset specifically designed to evaluate and improve ML models’ ability to detect anti patterns in traces. Our dataset contains over 600 PyTorch traces from diverse computer vision models classification, detection, segmentation, and generation collected across multiple hardware platforms. We also propose a novel iterative approach: a lightweight ML model first detects trace segments with anti patterns, followed by a large language model (LLM) for fine grained classification and targeted feedback. Experimental results demonstrate that our method significantly outperforms unsupervised clustering and rule based statistical techniques for detecting anti pattern regions. Our method also effectively compensates LLM’s limited context length and reasoning inefficiencies.


[40] CIS-BA: Continuous Interaction Space Based Backdoor Attack for Object Detection in the Real-World cs.CV | cs.CRPDF

Shuxin Zhao, Bo Lang, Nan Xiao, Yilang Zhang

TL;DR: 本文提出了一种名为CIS-BA的新型后门攻击方法,该方法针对现实世界中的目标检测模型。与现有依赖单一触发器和单一对象映射的方法不同,CIS-BA通过将触发器设计从静态对象特征转向连续的对象间交互模式,构建了一个连续交互空间,并引入了空间触发器。这首次实现了多触发器-多对象攻击机制,并通过不变的几何关系确保了鲁棒性。

Details

Motivation: 现有后门攻击方法依赖于单一触发器-单一对象映射和脆弱的像素级线索,在能力和鲁棒性上存在固有局限。本文旨在解决这一问题,特别是在自动驾驶等现实应用中,目标检测模型面临严重的安全威胁。

Result: 在MS-COCO数据集和真实世界视频上的实验表明,CIS-BA在复杂环境下攻击成功率超过97%,在动态多触发器条件下保持超过95%的有效性,并能规避三种最先进的防御方法。

Insight: 论文的核心创新点在于将后门触发器的设计范式从静态对象特征重新定义为连续的对象间交互模式,并建模为连续交互空间。这首次实现了多触发器-多对象攻击,并通过几何约束确保了鲁棒性,为交互密集型场景下的后门攻击和安全研究提供了新视角。

Abstract: Object detection models deployed in real-world applications such as autonomous driving face serious threats from backdoor attacks. Despite their practical effectiveness,existing methods are inherently limited in both capability and robustness due to their dependence on single-trigger-single-object mappings and fragile pixel-level cues. We propose CIS-BA, a novel backdoor attack paradigm that redefines trigger design by shifting from static object features to continuous inter-object interaction patterns that describe how objects co-occur and interact in a scene. By modeling these patterns as a continuous interaction space, CIS-BA introduces space triggers that, for the first time, enable a multi-trigger-multi-object attack mechanism while achieving robustness through invariant geometric relations. To implement this paradigm, we design CIS-Frame, which constructs space triggers via interaction analysis, formalizes them as class-geometry constraints for sample poisoning, and embeds the backdoor during detector training. CIS-Frame supports both single-object attacks (object misclassification and disappearance) and multi-object simultaneous attacks, enabling complex and coordinated effects across diverse interaction states. Experiments on MS-COCO and real-world videos show that CIS-BA achieves over 97% attack success under complex environments and maintains over 95% effectiveness under dynamic multi-trigger conditions, while evading three state-of-the-art defenses. In summary, CIS-BA extends the landscape of backdoor attacks in interaction-intensive scenarios and provides new insights into the security of object detection systems.


[41] Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes cs.CVPDF

Joseph Hoche, Andrei Bursuc, David Brellmann, Gilles Louppe, Pavel Izmailov

TL;DR: 本文提出了一种名为语义高斯过程不确定性(SGPU)的贝叶斯框架,用于改进大型视觉语言模型(LVLMs)中的语义不确定性量化。该方法通过分析答案嵌入的几何结构来量化语义不确定性,避免了传统聚类方法的脆弱性,并在多个模型和数据集上实现了最先进的校准和判别性能。

Details

Motivation: 大型视觉语言模型(LVLMs)经常产生看似合理但不可靠的输出,因此需要稳健的不确定性估计。现有的语义不确定性估计方法依赖外部模型对多个采样响应进行聚类并测量其语义一致性,但这些聚类方法通常很脆弱,对微小的措辞变化高度敏感,可能导致不可靠的不确定性估计。

Result: 在涵盖VQA、图像分类和文本QA的八个数据集上,对六个LLM和LVLM的评估表明,SGPU在校准指标(ECE)和判别指标(AUROC、AUARC)上均达到了最先进的性能水平。

Insight: 创新点在于提出了一个基于贝叶斯框架的SGPU方法,它通过将生成的答案映射到密集语义空间,计算其嵌入的格拉姆矩阵,并通过特征谱总结其语义配置,从而避免了脆弱的聚类过程。其光谱表示能够捕捉语义不确定性的一般模式,并可跨模型和模态迁移应用。

Abstract: Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.


[42] DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos cs.CV | cs.ROPDF

Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi

TL;DR: DRAW2ACT是一个深度感知的轨迹条件视频生成框架,用于生成可控且一致的机器人演示视频。它从输入轨迹中提取深度、语义、形状和运动等多种正交表示,并将其注入扩散模型,同时联合生成空间对齐的RGB和深度视频。最后,它利用生成的多模态序列来回归机器人的关节角度,以执行操作任务。

Details

Motivation: 解决现有视频扩散模型在机器人操作任务中可控性不足的问题,特别是克服仅依赖2D轨迹或单一模态条件所导致的演示视频可控性和一致性受限的缺陷。

Result: 在Bridge V2、Berkeley Autolab和仿真基准测试上的实验表明,DRAW2ACT在视觉保真度和一致性方面优于现有基线,并实现了更高的操作成功率。

Insight: 创新点在于从轨迹中提取多模态正交表示并注入扩散模型,以及通过跨模态注意力机制和深度监督联合生成对齐的RGB-D视频以增强时空一致性。这为利用生成模型进行具身AI仿真和策略学习提供了更可控、更真实的模拟环境。

Abstract: Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot’s joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.


[43] History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation cs.CV | cs.ROPDF

Xichen Ding, Jianzhe Gao, Cong Pan, Wenguan Wang, Jie Qin

TL;DR: 本文提出了一种历史增强的两阶段Transformer框架(HETT),用于解决空中视觉与语言导航任务。该框架通过一个从粗到细的导航流程,融合全局环境推理与局部场景理解,首先预测粗粒度目标位置,再通过细粒度视觉分析精炼动作。实验在改进的CityNav数据集上验证了其有效性。

Details

Motivation: 现有无人机代理通常采用单一粒度框架,难以在全局环境推理和局部场景理解之间取得平衡,这限制了其在大型城市环境中基于语言指令导航的性能。

Result: 在改进的CityNav数据集上的实验表明,HETT带来了显著的性能提升,广泛的消融研究进一步验证了每个组件的有效性。

Insight: 创新点包括:1) 从粗到细的两阶段导航流程,有效整合不同粒度信息;2) 设计历史网格图动态聚合视觉特征为结构化空间记忆,增强场景感知;3) 手动精炼数据集标注以提升数据质量。

Abstract: Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.


[44] OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving cs.CVPDF

Tao Tang, Enhui Ma, xia zhou, Letian Wang, Tianyi Yan

TL;DR: 本文提出了OmniGen,一个用于自动驾驶的统一多模态传感器生成框架,能够同时生成对齐的激光雷达和多视角相机数据。该方法利用共享的鸟瞰图空间统一多模态特征,并设计了一种新颖的通用多模态重建方法UAE,通过体渲染实现多模态传感器解码。此外,结合带有ControlNet分支的扩散Transformer实现了可控的多模态传感器生成。

Details

Motivation: 现有生成方法主要关注单模态数据生成,导致多模态传感器数据生成效率低下且存在不对齐问题。为了解决这些问题,本文旨在开发一个统一的框架来生成对齐的多模态传感器数据。

Result: 综合实验表明,OmniGen在统一多模态传感器数据生成方面取得了理想的性能,实现了多模态一致性和灵活的传感器调整。

Insight: 创新点在于提出了一个统一的生成框架,利用共享BEV空间对齐多模态特征,并设计了基于体渲染的UAE方法进行联合解码。此外,通过集成Diffusion Transformer与ControlNet分支,实现了对生成过程的可控性,这在多模态传感器生成领域是一个新颖的尝试。

Abstract: Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird\u2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.


[45] ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body cs.CVPDF

Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang

TL;DR: 本文提出了ViBES,一个具有行为智能的3D虚拟身体对话代理,它通过一个混合模态专家(MoME)骨干网络联合规划语言和动作,支持语音、文本和身体动作指令的混合主动交互,实现了对话条件化的身体行为生成。

Details

Motivation: 现有系统通常将人类行为建模为从固定话语到动作片段的翻译任务,缺乏在何时移动、做什么以及如何在多轮对话中适应的自主决策能力,导致时序脆弱、社会基础薄弱以及语音、文本和动作孤立训练或推断的碎片化堆栈。

Result: 在多轮对话基准测试中,使用对话-动作对齐和行为质量的自动指标进行评估,ViBES在强大的协同语音和文本到动作基线模型上取得了持续的性能提升。

Insight: 创新点在于提出了一个联合生成语言、韵律和动作的代理式虚拟身体框架,通过模态划分的Transformer专家和跨专家注意力实现多模态信息共享与可控行为生成,超越了传统的“语音条件化动作生成”。

Abstract: Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond “speech-conditioned motion generation” toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction. Code and data will be made available at: ai.stanford.edu/~juze/ViBES/


[46] Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding cs.CVPDF

Nando Metzger, Prune Truong, Goutam Bhat, Konrad Schindler, Federico Tombari

TL;DR: 本文提出Elastic3D,一种基于条件潜在扩散模型的可控端到端方法,用于将单目视频升级为双目立体视频。该方法通过新颖的引导VAE解码器避免了显式深度估计和扭曲带来的伪影,并允许用户在推理时通过标量旋钮直观控制立体效果的强度(即视差范围)。

Details

Motivation: 解决自动化单目到立体视频转换的需求,避免传统基于显式深度估计和扭曲方法产生的伪影问题,并实现用户对立体效果强度的可控性。

Result: 在三个真实世界立体视频数据集上的实验表明,该方法超越了传统的基于扭曲的方法和近期无扭曲的基线方法,为可靠、可控的立体视频转换设立了新标准。

Insight: 创新点在于提出了一个引导的VAE解码器,确保输出锐利且极线一致的立体视频,并通过一个直观的标量控制参数实现了推理时对视差范围的可控调整,这是一种结合了生成质量与用户交互性的端到端框架。

Abstract: The growing demand for immersive 3D content calls for automated monocular-to-stereo video conversion. We present Elastic3D, a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (more precisely, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on three different datasets of real-world stereo videos show that our method outperforms both traditional warping-based and recent warping-free baselines and sets a new standard for reliable, controllable stereo video conversion. Please check the project page for the video samples https://elastic3d.github.io.


[47] Enhancing Visual Programming for Visual Reasoning via Probabilistic Graphs cs.CVPDF

Wentao Wan, Kaiyu Wu, Qingyang Ma, Nan Kang, Yunjie Chen

TL;DR: 本文提出EVPG方法,通过构建有向概率图将视觉编程(VP)中不可微的执行过程转化为可微的精确概率推断过程,从而利用最终任务标签进行端到端监督学习,显著提升了视觉推理任务的性能。

Details

Motivation: 现有视觉编程方法主要关注提升大语言模型生成视觉程序的质量,但忽略了优化VP调用的预训练模型,且VP的不可微性阻碍了利用最终标签进行梯度优化。

Result: 在GQA、NLVRv2和Open Images三个经典复杂视觉推理任务上,EVPG显著提升了VP的性能,表现出有效性和优势。

Insight: 创新点在于通过构建有向概率图将VP不可微执行过程转化为可微概率推断,实现了端到端梯度优化;客观分析认为该方法巧妙解决了子任务标签缺失和不可微优化难题,为VP框架的联合优化提供了新思路。

Abstract: Recently, Visual Programming (VP) based on large language models (LLMs) has rapidly developed and demonstrated significant potential in complex Visual Reasoning (VR) tasks. Previous works to enhance VP have primarily focused on improving the quality of LLM-generated visual programs. However, they have neglected to optimize the VP-invoked pre-trained models, which serve as modules for the visual sub-tasks decomposed from the targeted tasks by VP. The difficulty is that there are only final labels of targeted VR tasks rather than labels of sub-tasks. Besides, the non-differentiable nature of VP impedes the direct use of efficient gradient-based optimization methods to leverage final labels for end-to-end learning of the entire VP framework. To overcome these issues, we propose EVPG, a method to Enhance Visual Programming for visual reasoning via Probabilistic Graphs. Specifically, we creatively build a directed probabilistic graph according to the variable dependency relationships during the VP executing process, which reconstructs the non-differentiable VP executing process into a differentiable exact probability inference process on this directed probabilistic graph. As a result, this enables the VP framework to utilize the final labels for efficient, gradient-based optimization in end-to-end supervised learning on targeted VR tasks. Extensive and comprehensive experiments demonstrate the effectiveness and advantages of our EVPG, showing significant performance improvements for VP on three classical complex VR tasks: GQA, NLVRv2, and Open Images.


[48] Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in cs.CVPDF

Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang, Mohamed Elhoseiny, Ryo Hachiuma

TL;DR: 本文提出了Zoom-Zero框架,用于解决基于视频的问答任务中模型时间定位能力不足的问题。该框架采用从粗到细的策略:首先定位与问题相关的视频片段,然后对其中最关键的帧进行时间上的放大,以进行更细粒度的视觉验证。

Details

Motivation: 现有的大型视频语言模型在时间感知能力上存在局限,基于组相对策略优化的方法在将答案忠实锚定到相关视频证据时仍存在困难,导致时间定位错误和幻觉。

Result: 在NExT-GQA和ReXTime基准测试上,该方法分别将时间定位准确率提升了5.2%和4.6%,并将平均答案准确率提升了2.4%。此外,在长视频理解基准上,推理过程中的从粗到细放大策略带来了平均6.4%的性能提升。

Insight: 主要创新点在于引入了放大准确率奖励来验证时间定位预测的保真度并促进细粒度视觉验证,以及令牌选择性信用分配机制来将奖励归因于负责时间定位或答案生成的令牌,从而缓解了GRPO在处理多层面奖励信号时的问题。该框架通过保留关键视觉细节而不损害全局上下文,有效提升了长视频理解能力。

Abstract: Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained visual verification on grounded frames; (ii) token-selective credit assignment, which attributes rewards to the tokens responsible for temporal localization or answer generation, mitigating GRPO’s issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2% on NExT-GQA and 4.6% on ReXTime, while also enhancing average answer accuracy by 2.4%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4% on long-video benchmarks.


[49] TUN: Detecting Significant Points in Persistence Diagrams with Deep Learning cs.CV | cs.LG | math.ATPDF

Yu Chen, Hongwei Lin

TL;DR: 本文提出了一种名为TUN(Topology Understanding Net)的多模态网络,用于自动检测一维持续性图(Persistence Diagrams, PDs)中的显著点,以解决拓扑数据分析中自动化解释的挑战。

Details

Motivation: 持续性图是理解点云底层形状拓扑的有力工具,但识别其中哪些点编码了真实信号仍具挑战性,这阻碍了拓扑数据分析在实际应用中的采用,尤其是在需要自动化可靠解释以支持下游决策的场景中。

Result: 实验表明,TUN在检测PDs中的显著点方面优于经典方法,证明了其在真实世界应用中的有效性。

Insight: 创新点包括结合增强的PD描述符与自注意力机制、PointNet风格的点云编码器、学习融合和逐点分类,以及稳定的预处理和不平衡感知训练,为PDs的自动化显著性检测提供了有效解决方案。

Abstract: Persistence diagrams (PDs) provide a powerful tool for understanding the topology of the underlying shape of a point cloud. However, identifying which points in PDs encode genuine signals remains challenging. This challenge directly hinders the practical adoption of topological data analysis in many applications, where automated and reliable interpretation of persistence diagrams is essential for downstream decision-making. In this paper, we study automatic significance detection for one-dimensional persistence diagrams. Specifically, we propose Topology Understanding Net (TUN), a multi-modal network that combines enhanced PD descriptors with self-attention, a PointNet-style point cloud encoder, learned fusion, and per-point classification, alongside stable preprocessing and imbalance-aware training. It provides an automated and effective solution for identifying significant points in PDs, which are critical for downstream applications. Experiments show that TUN outperforms classic methods in detecting significant points in PDs, illustrating its effectiveness in real-world applications.


[50] SS4D: Native 4D Generative Model via Structured Spacetime Latents cs.CVPDF

Zhibing Li, Mengchen Zhang, Tong Wu, Jing Tan, Jiaqi Wang

TL;DR: SS4D是一个原生4D生成模型,能够直接从单目视频合成动态3D物体。它通过结构化时空潜在表示,在4D数据上直接训练生成器,实现了高保真度、时间一致性和结构一致性。

Details

Motivation: 解决现有方法依赖3D或视频生成模型优化构建4D表示的局限性,以及4D训练数据稀缺的问题,旨在直接生成高质量、连贯的动态3D内容。

Result: 论文声称其方法在4D生成任务中实现了高保真度和时间一致性,但摘要未提及具体定量结果或基准测试(如SOTA比较)。

Insight: 创新点包括:基于预训练单图像到3D模型增强空间一致性,引入专用时间层确保时间连贯性,以及使用因子化4D卷积和时间下采样块压缩潜在序列以支持长视频高效训练和推理。

Abstract: We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion


[51] PSMamba: Progressive Self-supervised Vision Mamba for Plant Disease Recognition cs.CVPDF

Abdullah Al Mamun, Miaohua Zhang, David Ahmedt-Aristizabal, Zeeshan Hayder, Mohammad Awrangjeb

TL;DR: PSMamba是一个用于植物病害识别的渐进式自监督视觉Mamba框架,它通过结合Vision Mamba的高效序列建模和双学生层次蒸馏策略,解决了现有自监督学习方法难以捕捉植物病害图像中层次化、多尺度病变模式的问题。

Details

Motivation: 现有自监督学习框架主要关注全局对齐,难以有效捕获植物病害图像特有的层次化、多尺度病变模式,因此需要一种能够联合学习上下文和细节表示的新方法。

Result: 在三个基准数据集上的实验表明,PSMamba在领域迁移和细粒度场景下均优于最先进的自监督学习方法,实现了更高的准确性和鲁棒性。

Insight: 创新点在于将Vision Mamba与双学生层次蒸馏策略结合,通过共享全局教师和两个专注于不同尺度(中尺度和局部)的专门学生,实现多粒度监督和跨尺度一致性对齐,从而有效学习病变分布、纹理异常等特征。

Abstract: Self-supervised Learning (SSL) has become a powerful paradigm for representation learning without manual annotations. However, most existing frameworks focus on global alignment and struggle to capture the hierarchical, multi-scale lesion patterns characteristic of plant disease imagery. To address this gap, we propose PSMamba, a progressive self-supervised framework that integrates the efficient sequence modelling of Vision Mamba (VM) with a dual-student hierarchical distillation strategy. Unlike conventional single teacher-student designs, PSMamba employs a shared global teacher and two specialised students: one processes mid-scale views to capture lesion distributions and vein structures, while the other focuses on local views to capture fine-grained cues such as texture irregularities and early-stage lesions. This multi-granular supervision facilitates the joint learning of contextual and detailed representations, with consistency losses ensuring coherent cross-scale alignment. Experiments on three benchmark datasets show that PSMamba consistently outperforms state-of-the-art SSL methods, delivering superior accuracy and robustness in both domain-shifted and fine-grained scenarios.


[52] From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region cs.CV | cs.AIPDF

Akila Premarathna, Kanishka Hewageegana, Garcia Andarcia Mariangel

TL;DR: 本研究提出了一种基于视觉语言模型(VLMs)的结构化方法,用于从卫星图像中零样本和少样本检测中东和北非(MENA)地区的污水处理厂(WWTPs)。研究比较了包括LLaMA 3.2 Vision、Qwen 2.5 VL、DeepSeek-VL2、Gemma 3、Gemini和Pixtral 12B在内的多种VLMs,并与在83,566张高分辨率卫星图像上训练的YOLOv8分割模型进行对比。结果表明,在零样本设置下,多个VLMs(尤其是Gemma-3)在真阳性率上超越了YOLOv8,证实了VLMs可作为无需大量标注的高效替代方案,实现可扩展的遥感监测。

Details

Motivation: 在MENA地区,污水处理厂对可持续水资源管理至关重要,但传统基于YOLOv8等模型的分割方法需要大量人工标注。研究旨在探索利用视觉语言模型(VLMs)固有的推理和标注能力,作为更高效的替代方案,以解决污水处理厂卫星图像识别中标注成本高的问题。

Result: 在包含1,207个已验证污水处理厂位置和等量非污水处理厂站点的数据集(600mx600m Geo-TIFF图像)上进行了评估。零样本评估显示,多个VLMs在污水处理厂图像上的真阳性率超过了YOLOv8,其中Gemma-3表现最佳。这证实了VLMs,特别是在零样本设置下,可以替代YOLOv8进行高效、无需标注的污水处理厂分类。

Insight: 论文的创新点在于提出了一种专门针对污水处理厂检测的结构化VLM比较方法,分为零样本和少样本两条路径,并利用专家提示引导模型识别特定组件(如圆形/矩形池、曝气池)和区分混淆物,输出带置信度和描述的JSON。从客观角度看,该方法展示了VLMs在特定遥感目标检测任务中替代传统监督模型的潜力,通过利用其内部知识减少对大规模标注数据的依赖,为可扩展的环境监测提供了新思路。

Abstract: In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that vision-language models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600mx600m Geo-TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8’s true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.


[53] Semantic Mismatch and Perceptual Degradation: A New Perspective on Image Editing Immunity cs.CV | cs.AI | cs.CY | cs.LGPDF

Shuai Dong, Jie Zhang, Guoying Zhao, Shiguang Shan, Xilin Chen

TL;DR: 本文针对基于扩散模型的文本引导图像编辑可能被恶意滥用的问题,提出了一种新的图像免疫视角和方法。作者认为现有评估指标存在根本性缺陷,因为它们仅关注受保护图像与原始未保护图像生成输出的视觉差异,而忽略了免疫的核心目标是破坏编辑结果与攻击者意图的语义对齐。为此,论文提出了协同中间特征操纵(SIFM)方法,通过最大化特征偏离和最小化特征范数两个协同目标来扰动扩散模型的中间特征,从而诱导编辑结果出现语义不匹配或严重感知退化。同时,论文首次引入了免疫成功率(ISR)这一新指标,利用多模态大语言模型(MLLMs)来严格量化真实的免疫效果。实验表明,SIFM在保护视觉内容免受基于扩散模型的恶意操纵方面达到了最先进的性能。

Details

Motivation: 现有基于扩散模型的文本引导图像编辑技术存在被滥用的风险,促使研究者探索使用不可察觉的扰动来免疫图像以防止未经授权的编辑。然而,当前评估免疫成功的主流指标存在根本缺陷,它们依赖于测量受保护图像与原始未保护图像生成输出之间的视觉差异,这忽略了免疫的核心要求是破坏编辑结果与攻击者意图的语义对齐,而非偏离某个特定输出。

Result: 广泛的实验表明,论文提出的SIFM方法在保护视觉内容免受基于扩散模型的恶意操纵方面达到了最先进的(SOTA)性能。

Insight: 论文的核心创新点在于对图像免疫成功定义的根本性反思:成功免疫应导致编辑输出要么与提示语义不匹配,要么遭受严重的感知退化。基于此,提出了SIFM方法,通过协同优化两个目标(最大化特征偏离以破坏语义对齐,最小化特征范数以诱导感知退化)来扰动扩散中间特征。此外,首次提出了免疫成功率(ISR)这一严谨的量化指标,利用MLLMs来评估语义失败或感知退化,为领域提供了新的评估基准。

Abstract: Text-guided image editing via diffusion models, while powerful, raises significant concerns about misuse, motivating efforts to immunize images against unauthorized edits using imperceptible perturbations. Prevailing metrics for evaluating immunization success typically rely on measuring the visual dissimilarity between the output generated from a protected image and a reference output generated from the unprotected original. This approach fundamentally overlooks the core requirement of image immunization, which is to disrupt semantic alignment with attacker intent, regardless of deviation from any specific output. We argue that immunization success should instead be defined by the edited output either semantically mismatching the prompt or suffering substantial perceptual degradations, both of which thwart malicious intent. To operationalize this principle, we propose Synergistic Intermediate Feature Manipulation (SIFM), a method that strategically perturbs intermediate diffusion features through dual synergistic objectives: (1) maximizing feature divergence from the original edit trajectory to disrupt semantic alignment with the expected edit, and (2) minimizing feature norms to induce perceptual degradations. Furthermore, we introduce the Immunization Success Rate (ISR), a novel metric designed to rigorously quantify true immunization efficacy for the first time. ISR quantifies the proportion of edits where immunization induces either semantic failure relative to the prompt or significant perceptual degradations, assessed via Multimodal Large Language Models (MLLMs). Extensive experiments show our SIFM achieves the state-of-the-art performance for safeguarding visual content against malicious diffusion-based manipulation.


[54] Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure cs.CVPDF

Jooyeol Yun, Jaegul Choo

TL;DR: 本文提出Vector Prism框架,通过分层语义结构来恢复SVG(可缩放矢量图形)的语义分组,从而解决当前视觉语言模型(VLMs)在自动化SVG动画中因图形元素碎片化而难以处理的问题,显著提升了动画的连贯性。

Details

Motivation: 当前视觉语言模型(VLMs)在处理SVG动画时,常因SVG中视觉连贯的部分被分解为低层级、无语义指导的图形元素而失败,导致动画不连贯。本文旨在通过恢复SVG的语义结构来解决这一挑战。

Result: 实验表明,该方法在SVG动画任务上相比现有方法有显著提升,通过语义恢复实现了更鲁棒的动画生成,暗示这是解锁可靠SVG动画的关键步骤。

Insight: 创新点在于通过统计聚合多个弱部件预测来稳定推断语义,并将SVG重组为语义分组,从而为VLMs提供了可解释的交互结构,弥补了当前系统忽视的语义层。

Abstract: Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.


[55] Enhancing Interpretability for Vision Models via Shapley Value Optimization cs.CV | cs.AIPDF

Kanglong Fan, Yunqiao Yang, Chen Ma

TL;DR: 本文提出了一种新颖的自解释框架,通过将沙普利值估计作为训练中的辅助任务,旨在提升视觉模型的可解释性。该方法实现了对模型预测分数在图像块上的公平分配,确保解释与模型的决策逻辑内在一致,同时仅需微小的结构修改即可增强可解释性,并保持模型性能和兼容性。

Details

Motivation: 解决深度神经网络决策过程不透明的问题,克服现有后验解释方法难以忠实反映模型行为,以及自解释神经网络因专用架构设计而牺牲性能和兼容性的局限性。

Result: 在多个基准测试上的广泛实验表明,该方法在可解释性方面达到了最先进的水平。

Insight: 创新点在于将沙普利值估计整合为训练辅助任务,实现了解释与模型决策逻辑的内在对齐,且通过最小化结构改动在保持性能的同时提升可解释性。从客观角度看,这是一种将博弈论概念(沙普利值)与模型训练过程深度结合以生成可信自解释的有效途径。

Abstract: Deep neural networks have demonstrated remarkable performance across various domains, yet their decision-making processes remain opaque. Although many explanation methods are dedicated to bringing the obscurity of DNNs to light, they exhibit significant limitations: post-hoc explanation methods often struggle to faithfully reflect model behaviors, while self-explaining neural networks sacrifice performance and compatibility due to their specialized architectural designs. To address these challenges, we propose a novel self-explaining framework that integrates Shapley value estimation as an auxiliary task during training, which achieves two key advancements: 1) a fair allocation of the model prediction scores to image patches, ensuring explanations inherently align with the model’s decision logic, and 2) enhanced interpretability with minor structural modifications, preserving model performance and compatibility. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art interpretability.


[56] Mimicking Human Visual Development for Learning Robust Image Representations cs.CVPDF

Ankita Raj, Kaashika Prajaapat, Tapan Kumar Gandhi, Chetan Arora

TL;DR: 该论文受人类视觉发育过程启发,提出了一种渐进式模糊课程学习方法,用于提升卷积神经网络(CNN)的泛化能力和鲁棒性。该方法在训练初期使用高度模糊的图像,并随着训练进程逐步减少模糊程度,以促使网络优先学习全局结构而非高频伪影。

Details

Motivation: 现代CNN在适应输入分布变化方面仍不及人类视觉系统。论文旨在通过模仿人类婴儿从低视觉敏锐度逐渐发展到高敏锐度的过程,解决CNN在面对分布偏移和噪声输入时鲁棒性不足的问题。

Result: 在CIFAR-10-C和ImageNet-100-C数据集上,该方法将平均损坏误差(mCE)分别降低了8.30%和4.43%,优于标准无模糊训练。该方法还能与CutMix、MixUp等数据增强技术互补,提升模型对自然扰动和对抗攻击的鲁棒性。

Insight: 创新点在于将人类视觉发育过程形式化为一种结构化的渐进式模糊课程,而非随机应用模糊增强。这挑战了早期模糊训练会损害模型性能的既有观点,表明其能有效提升泛化能力且对域内精度影响极小。

Abstract: The human visual system is remarkably adept at adapting to changes in the input distribution; a capability modern convolutional neural networks (CNNs) still struggle to match. Drawing inspiration from the developmental trajectory of human vision, we propose a progressive blurring curriculum to improve the generalization and robustness of CNNs. Human infants are born with poor visual acuity, gradually refining their ability to perceive fine details. Mimicking this process, we begin training CNNs on highly blurred images during the initial epochs and progressively reduce the blur as training advances. This approach encourages the network to prioritize global structures over high-frequency artifacts, improving robustness against distribution shifts and noisy inputs. Challenging prior claims that blurring in the initial training epochs imposes a stimulus deficit and irreversibly harms model performance, we reveal that early-stage blurring enhances generalization with minimal impact on in-domain accuracy. Our experiments demonstrate that the proposed curriculum reduces mean corruption error (mCE) by up to 8.30% on CIFAR-10-C and 4.43% on ImageNet-100-C datasets, compared to standard training without blurring. Unlike static blur-based augmentation, which applies blurred images randomly throughout training, our method follows a structured progression, yielding consistent gains across various datasets. Furthermore, our approach complements other augmentation techniques, such as CutMix and MixUp, and enhances both natural and adversarial robustness against common attack methods. Code is available at https://github.com/rajankita/Visual_Acuity_Curriculum.


[57] Unified Semantic Transformer for 3D Scene Understanding cs.CVPDF

Sebastian Koch, Johanna Wald, Hide Matsuki, Pedro Hermosilla, Timo Ropinski

TL;DR: 本文提出了UNITE,一个用于3D场景理解的统一语义Transformer模型。该模型是一个前馈神经网络,能够在一个单一模型中统一处理多种3D语义任务,如场景分割、实例嵌入、开放词汇特征以及功能性和关节预测。它仅从RGB图像进行端到端推理,速度快,并在多个任务上达到了最先进的性能。

Details

Motivation: 现有的3D场景理解模型通常是针对特定任务开发的,存在局限性。本文旨在解决这一碎片化问题,提出一个统一的模型来处理多样化的3D语义任务,以应对真实世界场景的固有复杂性。

Result: UNITE在多个不同的3D语义任务上实现了最先进的性能,在许多情况下甚至超越了针对特定任务设计的模型,以及那些依赖真实3D几何信息的方法。

Insight: 主要创新点在于提出了一个统一的Transformer架构来处理多种3D语义任务,并采用了结合2D知识蒸馏、自监督和新型多视图损失(确保3D视图一致性)的训练策略。从客观角度看,其将多任务学习与3D场景理解结合,并仅从2D RGB图像高效预测3D语义属性,是一个有前景的方向。

Abstract: Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, as well as affordance and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry. See the project website at unite-page.github.io


[58] Broadening View Synthesis of Dynamic Scenes from Constrained Monocular Videos cs.CVPDF

Le Jiang, Shaotong Zhu, Yedi Luo, Shayda Moezzi, Sarah Ostadabbas

TL;DR: 本文提出了一种名为ExpanDyNeRF的动态神经辐射场(NeRF)框架,旨在解决现有动态NeRF方法在显著视角偏移下渲染不稳定和不真实的问题。该方法利用高斯泼溅先验和伪地面真值生成策略,实现了在大角度旋转下的真实感视图合成,并引入了首个用于动态场景的合成多视角数据集SynDM进行验证。

Details

Motivation: 现有动态NeRF方法在视角显著偏离时渲染质量下降,产生不稳定和不真实的输出,需要一种能够处理大角度旋转的鲁棒视图合成方法。

Result: 在合成的SynDM数据集和真实世界数据集上,ExpanDyNeRF在极端视角偏移下的渲染保真度显著优于现有动态NeRF方法,达到了最先进水平(SOTA)。

Insight: 创新点包括引入高斯泼溅先验和伪地面真值生成策略以增强大角度视图合成,以及创建首个具有明确侧视图监督的合成动态多视角数据集SynDM,为动态场景重建提供了新的基准和训练数据。

Abstract: In dynamic Neural Radiance Fields (NeRF) systems, state-of-the-art novel view synthesis methods often fail under significant viewpoint deviations, producing unstable and unrealistic renderings. To address this, we introduce Expanded Dynamic NeRF (ExpanDyNeRF), a monocular NeRF framework that leverages Gaussian splatting priors and a pseudo-ground-truth generation strategy to enable realistic synthesis under large-angle rotations. ExpanDyNeRF optimizes density and color features to improve scene reconstruction from challenging perspectives. We also present the Synthetic Dynamic Multiview (SynDM) dataset, the first synthetic multiview dataset for dynamic scenes with explicit side-view supervision-created using a custom GTA V-based rendering pipeline. Quantitative and qualitative results on SynDM and real-world datasets demonstrate that ExpanDyNeRF significantly outperforms existing dynamic NeRF methods in rendering fidelity under extreme viewpoint shifts. Further details are provided in the supplementary materials.


[59] DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning cs.CV | cs.AIPDF

Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai

TL;DR: 本文提出了一种名为DISCODE的分布感知分数解码器,用于提升图像描述自动评估的鲁棒性。该方法无需微调,通过引入自适应测试时间损失和利用高斯先验分布,在测试时优化评估分数,使其在不同领域下更符合人类判断。同时,作者构建了多领域描述评估基准MCEval,涵盖六个领域以测试评估指标的鲁棒性。

Details

Motivation: 尽管大型视觉语言模型在多模态任务中表现优异,但在领域偏移场景下,图像描述的鲁棒自动评估仍具挑战性,现有方法难以与人类判断保持一致。

Result: 实验表明,DISCODE在MCEval基准和四个现有代表性基准上作为无参考评估指标达到了最先进的性能水平。

Insight: 创新点包括测试时自适应评估方法、基于高斯先验的自适应测试时间损失及其解析解,以及多领域评估基准的构建,这些设计增强了评估的领域鲁棒性和效率。

Abstract: Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.


[60] S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation cs.CVPDF

Leon Sick, Lukas Hoyer, Dominik Engel, Pedro Hermosilla, Timo Ropinski

TL;DR: 本文提出了一种名为S2D(稀疏到稠密关键掩码蒸馏)的无监督视频实例分割方法,该方法完全在真实视频数据上训练,通过利用深度运动先验识别高质量的关键掩码,然后使用稀疏关键掩码伪标注训练一个分割模型进行隐式掩码传播,并辅以时序DropLoss,最终在稠密标签集上训练模型,在多个基准测试中超越了当前最先进的方法。

Details

Motivation: 当前无监督视频实例分割的最先进方法严重依赖从以对象为中心的图像数据集(如ImageNet)生成的合成视频数据,但通过人工平移和缩放图像实例掩码来合成视频无法准确建模视频中的真实运动(如视角变化、单个或多个实例的部分运动或相机运动),因此本文旨在解决这一问题,提出一个完全在真实视频数据上训练的无监督模型。

Result: 在多个基准测试中,该方法的表现超越了当前最先进的无监督视频实例分割方法。

Insight: 创新点在于完全使用真实视频数据进行无监督训练,通过深度运动先验识别高质量关键掩码来建立时序一致性,并提出了稀疏到稠密的蒸馏方法和时序DropLoss来训练分割模型进行隐式掩码传播,从而避免了合成数据的局限性并提升了分割质量。

Abstract: In recent years, the state-of-the-art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object-centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single-frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo-annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse-To-Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state-of-the-art across various benchmarks.


[61] A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning cs.CV | cs.ROPDF

Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen

TL;DR: 本文提出了A4-Agent,一个用于零样本可供性推理的智能体框架。该框架将可供性预测解耦为三个由专门基础模型协调的阶段:Dreamer(可视化交互)、Thinker(决定交互对象部件)和Spotter(精确定位交互区域)。这是一个无需训练、利用预训练模型互补优势的零样本方法。

Details

Motivation: 现有端到端模型将高层推理与低层定位耦合在单一流程中,并依赖标注数据集训练,导致在新对象和未见环境上泛化能力差。本文旨在超越这一范式,解决泛化问题。

Result: 该零样本框架在多个基准测试上显著优于最先进的监督方法,并展现出对真实世界场景的鲁棒泛化能力。

Insight: 主要创新点在于将复杂的可供性预测任务解耦为三个可解释的、由不同基础模型专门处理的阶段(How, What, Where),并通过智能体协调实现零样本推理,避免了任务特定的微调和对标注数据的依赖。

Abstract: Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a $\textbf{Dreamer}$ that employs generative models to visualize $\textit{how}$ an interaction would look; (2) a $\textbf{Thinker}$ that utilizes large vision-language models to decide $\textit{what}$ object part to interact with; and (3) a $\textbf{Spotter}$ that orchestrates vision foundation models to precisely locate $\textit{where}$ the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.


[62] SuperCLIP: CLIP with Simple Classification Supervision cs.CVPDF

Weiheng Zhao, Zilong Huang, Jiashi Feng, Xinggang Wang

TL;DR: SuperCLIP通过引入基于分类的监督来增强CLIP模型,在对比学习的基础上增加了一个轻量级的线性层,利用词元级线索改善视觉-文本的细粒度对齐,从而提升零样本分类、图文检索和纯视觉任务的性能,且无需额外标注数据或显著增加计算开销。

Details

Motivation: CLIP等模型在处理长而详细的文本描述时,未能充分利用文本中的细粒度语义信号,其训练目标仅优化全局图像-文本相似性,缺乏词元级监督,限制了细粒度视觉-文本对齐能力。

Result: 实验表明,SuperCLIP在零样本分类、图文检索和纯视觉任务上均取得一致提升,无论使用原始网络数据还是重新标注的丰富数据训练,都能有效恢复文本监督;同时缓解了CLIP在小批量训练时的性能下降问题。

Insight: 创新点在于将分类监督与对比学习结合,通过轻量级线性层引入词元级监督,以极低计算成本(总FLOPs仅增加0.077%)实现细粒度对齐,且不依赖大批量训练,增强了模型对文本细节的利用能力。

Abstract: Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP’s training objective, which optimizes only global image-text similarity and overlooks token-level supervision - limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages token-level cues to enhance visual-textual alignment - with just a 0.077% increase in total FLOPs, and no need for additional annotated data. Experiments show that SuperCLIP consistently improves zero-shot classification, image-text retrieval, and purely visual tasks. These gains hold regardless of whether the model is trained on original web data or rich re-captioned data, demonstrating SuperCLIP’s ability to recover textual supervision in both cases. Furthermore, SuperCLIP alleviates CLIP’s small-batch performance drop through classification-based supervision that avoids reliance on large batch sizes. Code and models will be made open source.


[63] SignIT: A Comprehensive Dataset and Multimodal Analysis for Italian Sign Language Recognition cs.CVPDF

Alessia Micieli, Giovanni Maria Farinella, Francesco Ragusa

TL;DR: 该论文提出了SignIT数据集,这是一个用于意大利手语识别研究的新数据集,包含644个视频(总计3.33小时),手动标注了94个属于动物、食物、颜色、情感和家庭这5个宏观类别的不同手语类别,并提取了用户手部、面部和身体的2D关键点。论文还建立了一个手语识别基准,评估了多种最先进模型,分析了时间信息、2D关键点和RGB帧对模型性能的影响。

Details

Motivation: 解决意大利手语识别领域缺乏全面数据集和基准的问题,以促进该方向的研究。

Result: 在提出的SignIT数据集上评估了多种最先进模型,结果表明这些模型在这个具有挑战性的LIS数据集上存在局限性,具体性能数据未在摘要中详细说明。

Insight: 创新点在于创建了一个多模态(视频、关键点)的意大利手语数据集并提供了基准分析,客观来看,其对手语类别进行宏观分类以及同时考虑RGB帧和2D关键点的方法对后续研究有借鉴意义。

Abstract: In this work we present SignIT, a new dataset to study the task of Italian Sign Language (LIS) recognition. The dataset is composed of 644 videos covering 3.33 hours. We manually annotated videos considering a taxonomy of 94 distinct sign classes belonging to 5 macro-categories: Animals, Food, Colors, Emotions and Family. We also extracted 2D keypoints related to the hands, face and body of the users. With the dataset, we propose a benchmark for the sign recognition task, adopting several state-of-the-art models showing how temporal information, 2D keypoints and RGB frames can be influence the performance of these models. Results show the limitations of these models on this challenging LIS dataset. We release data and annotations at the following link: https://fpv-iplab.github.io/SignIT/.


[64] Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment Efficiency cs.CVPDF

Jia Guo, Jiawei Du, Shengzhu Yang, Shuai Lu, Wenquan Cheng

TL;DR: 该论文提出了ReVision,一个从大规模真实世界临床实践中学习的视网膜基础模型。该模型利用中国162家医疗机构长达十年的远程医疗项目中收集的485,980张眼底彩照及其对应诊断报告进行训练,无需特定任务优化即可实现高效的零样本疾病检测和跨任务泛化。

Details

Motivation: 当前视网膜基础模型受限于缺乏真实临床背景的精选研究数据集,且每个应用都需要大量特定任务优化,限制了其在低资源环境下的部署效率。本文旨在通过直接从真实世界医疗实践中构建临床原生智能来克服这些障碍。

Result: 在27个眼科基准测试中,ReVision实现了部署效率。在零样本设置下,其在12个公共基准上的平均AUROC为0.946,在3个独立临床队列上为0.952。在少量适应下,其性能与经过大量微调的替代方案相当,但所需可训练参数和标注样本数量级更少。在一项前瞻性读者研究中,其零样本辅助将33名眼科医生的诊断准确率平均提高了14.8%。

Insight: 核心创新点在于直接从大规模远程医疗项目(作为临床图像解读的自然知识库)中学习,利用图像与诊断报告的自然对齐来构建基础模型,无需额外标注。这证明了临床原生智能可以直接从临床档案中提取,以构建适用于各种低资源环境的医疗AI系统。

Abstract: Current retinal foundation models remain constrained by curated research datasets that lack authentic clinical context, and require extensive task-specific optimization for each application, limiting their deployment efficiency in low-resource settings. Here, we show that these barriers can be overcome by building clinical native intelligence directly from real-world medical practice. Our key insight is that large-scale telemedicine programs, where expert centers provide remote consultations across distributed facilities, represent a natural reservoir for learning clinical image interpretation. We present ReVision, a retinal foundation model that learns from the natural alignment between 485,980 color fundus photographs and their corresponding diagnostic reports, accumulated through a decade-long telemedicine program spanning 162 medical institutions across China. Through extensive evaluation across 27 ophthalmic benchmarks, we demonstrate that ReVison enables deployment efficiency with minimal local resources. Without any task-specific training, ReVision achieves zero-shot disease detection with an average AUROC of 0.946 across 12 public benchmarks and 0.952 on 3 independent clinical cohorts. When minimal adaptation is feasible, ReVision matches extensively fine-tuned alternatives while requiring orders of magnitude fewer trainable parameters and labeled examples. The learned representations also transfer effectively to new clinical sites, imaging domains, imaging modalities, and systemic health prediction tasks. In a prospective reader study with 33 ophthalmologists, ReVision’s zero-shot assistance improved diagnostic accuracy by 14.8% across all experience levels. These results demonstrate that clinical native intelligence can be directly extracted from clinical archives without any further annotation to build medical AI systems suited to various low-resource settings.


[65] CAPRMIL: Context-Aware Patch Representations for Multiple Instance Learning cs.CV | cs.AIPDF

Andreas Lolos, Theofilos Christodoulou, Aris L. Moustakas, Stergios Christodoulidis, Maria Vakalopoulou

TL;DR: 本文提出了一种名为CAPRMIL的新型多示例学习框架,专为计算病理学中的全切片图像分析设计。该方法通过将补丁特征投影到少量全局上下文感知的token中,并利用多头自注意力机制,以线性计算复杂度注入全局上下文信息,从而生成丰富的上下文感知补丁嵌入。该方法与简单的均值聚合器结合,在多个公开病理学基准测试中达到了最先进的性能。

Details

Motivation: 针对计算病理学中全切片图像规模巨大且像素级标注稀缺的问题,多示例学习已成为主要的弱监督学习框架。现有方法通常依赖复杂的基于注意力的聚合器来学习实例间的相关性,这带来了计算复杂性和参数量的挑战。本文的动机是提出一个更高效、聚合器无关的框架,将相关性学习的复杂性从聚合器中移除。

Result: 在多个公开病理学基准测试中,CAPRMIL与简单的均值聚合器配对,达到了最先进的幻灯片级性能。同时,与最先进的多示例学习方法相比,该方法将可训练参数总数减少了48%-92.8%,推理时的FLOPs降低了52%-99%,并且在GPU内存效率和训练时间方面均位列最佳模型之列。

Insight: 论文的核心创新点在于提出了一种在聚合之前学习丰富、上下文感知的实例表示的有效且可扩展的替代方案,以取代复杂的池化操作。具体而言,通过利用冻结的补丁编码器提取特征,并将其投影到少量全局上下文/形态感知的token中,再通过多头自注意力以线性复杂度注入全局上下文,从而生成高质量的补丁嵌入。这种方法将相关性学习前置到表示学习阶段,使得下游聚合器可以非常简单,从而大幅降低了模型复杂度和计算开销。

Abstract: In computational pathology, weak supervision has become the standard for deep learning due to the gigapixel scale of WSIs and the scarcity of pixel-level annotations, with Multiple Instance Learning (MIL) established as the principal framework for slide-level model training. In this paper, we introduce a novel setting for MIL methods, inspired by proceedings in Neural Partial Differential Equation (PDE) Solvers. Instead of relying on complex attention-based aggregation, we propose an efficient, aggregator-agnostic framework that removes the complexity of correlation learning from the MIL aggregator. CAPRMIL produces rich context-aware patch embeddings that promote effective correlation learning on downstream tasks. By projecting patch features – extracted using a frozen patch encoder – into a small set of global context/morphology-aware tokens and utilizing multi-head self-attention, CAPRMIL injects global context with linear computational complexity with respect to the bag size. Paired with a simple Mean MIL aggregator, CAPRMIL matches state-of-the-art slide-level performance across multiple public pathology benchmarks, while reducing the total number of trainable parameters by 48%-92.8% versus SOTA MILs, lowering FLOPs during inference by 52%-99%, and ranking among the best models on GPU memory efficiency and training time. Our results indicate that learning rich, context-aware instance representations before aggregation is an effective and scalable alternative to complex pooling for whole-slide analysis. Our code is available at https://github.com/mandlos/CAPRMIL


[66] FoodLogAthl-218: Constructing a Real-World Food Image Dataset Using Dietary Management Applications cs.CV | cs.MMPDF

Mitsuki Watanabe, Sosuke Amano, Kiyoharu Aizawa, Yoko Yamakata

TL;DR: 本文介绍了FoodLogAthl-218,一个从真实世界饮食管理应用收集的食品图像数据集,包含6,925张图像、218个食品类别和14,349个边界框,并附带丰富的元数据。论文提出了三个基准任务:标准分类、增量微调和上下文感知分类,并使用大型多模态模型进行评估。

Details

Motivation: 现有食品图像分类模型大多依赖网络爬取的图像,这些图像与用户实际拍摄的餐食照片存在差异,无法有效支持饮食管理应用。本文旨在构建一个真实世界场景下的食品图像数据集,以解决这一数据偏差问题。

Result: 论文在FoodLogAthl-218数据集上评估了大型多模态模型在三个任务上的性能,但摘要中未提及具体的定量结果或与SOTA的比较。

Insight: 创新点在于数据集的构建方式:从用户提交的真实餐食照片出发进行标注,而非基于预定义类别进行网络采集,这带来了更大的类内多样性、自然的餐食类型频率分布以及更真实的图像风格。此外,论文引入了增量微调和上下文感知分类这两个针对实际应用场景的新任务。

Abstract: Food image classification models are crucial for dietary management applications because they reduce the burden of manual meal logging. However, most publicly available datasets for training such models rely on web-crawled images, which often differ from users’ real-world meal photos. In this work, we present FoodLogAthl-218, a food image dataset constructed from real-world meal records collected through the dietary management application FoodLog Athl. The dataset contains 6,925 images across 218 food categories, with a total of 14,349 bounding boxes. Rich metadata, including meal date and time, anonymized user IDs, and meal-level context, accompany each image. Unlike conventional datasets-where a predefined class set guides web-based image collection-our data begins with user-submitted photos, and labels are applied afterward. This yields greater intra-class diversity, a natural frequency distribution of meal types, and casual, unfiltered images intended for personal use rather than public sharing. In addition to (1) a standard classification benchmark, we introduce two FoodLog-specific tasks: (2) an incremental fine-tuning protocol that follows the temporal stream of users’ logs, and (3) a context-aware classification task where each image contains multiple dishes, and the model must classify each dish by leveraging the overall meal context. We evaluate these tasks using large multimodal models (LMMs). The dataset is publicly available at https://huggingface.co/datasets/FoodLog/FoodLogAthl-218.


[67] LLM-driven Knowledge Enhancement for Multimodal Cancer Survival Prediction cs.CVPDF

Chenyu Zhao, Yingxue Xu, Fengtao Zhou, Yihui Wang, Hao Chen

TL;DR: 本文提出了一种名为KEMM的LLM驱动的知识增强多模态模型,用于癌症生存预测。该模型通过整合病理学家提供的专家报告和LLM生成的预后背景知识,来增强从高维冗余的病理图像和基因组数据中提取判别性特征的能力,并利用知识增强跨模态注意力模块有效对齐不同模态,从而提升预测性能。

Details

Motivation: 当前多模态生存预测方法依赖高维冗余的病理图像和基因组数据,难以提取判别性特征并实现模态对齐,且仅使用简单的生存随访标签不足以监督这一复杂任务。

Result: 在五个数据集上的广泛实验表明,KEMM模型取得了最先进的性能。

Insight: 创新点在于利用LLM提炼专家报告生成精炼的临床诊断陈述,并生成预后背景知识作为额外监督信息;同时设计了知识增强跨模态注意力模块,引导网络关注判别性和与生存相关的特征。这为利用外部知识增强多模态学习提供了新思路。

Abstract: Current multimodal survival prediction methods typically rely on pathology images (WSIs) and genomic data, both of which are high-dimensional and redundant, making it difficult to extract discriminative features from them and align different modalities. Moreover, using a simple survival follow-up label is insufficient to supervise such a complex task. To address these challenges, we propose KEMM, an LLM-driven Knowledge-Enhanced Multimodal Model for cancer survival prediction, which integrates expert reports and prognostic background knowledge. 1) Expert reports, provided by pathologists on a case-by-case basis and refined by large language model (LLM), offer succinct and clinically focused diagnostic statements. This information may typically suggest different survival outcomes. 2) Prognostic background knowledge (PBK), generated concisely by LLM, provides valuable prognostic background knowledge on different cancer types, which also enhances survival prediction. To leverage these knowledge, we introduce the knowledge-enhanced cross-modal (KECM) attention module. KECM can effectively guide the network to focus on discriminative and survival-relevant features from highly redundant modalities. Extensive experiments on five datasets demonstrate that KEMM achieves state-of-the-art performance. The code will be released upon acceptance.


[68] TUMTraf EMOT: Event-Based Multi-Object Tracking Dataset and Baseline for Traffic Scenarios cs.CVPDF

Mengyu Li, Xingcheng Zhou, Guang Chen, Alois Knoll, Hu Cao

TL;DR: 本文介绍了TUMTraf EMOT数据集,这是一个专为智能交通系统设计的首个基于事件相机的多目标跟踪数据集,涵盖车辆和行人检测与跟踪任务。作者还基于该数据集建立了一个检测跟踪基准,并开发了专用的特征提取器,取得了优异的性能。

Details

Motivation: 解决传统帧式相机在弱光、高速运动条件下性能不佳的问题,利用事件相机低延迟、高动态范围和高时间分辨率的优势,填补基于事件视觉在智能交通系统研究中数据集的空白。

Result: 在提出的TUMTraf EMOT数据集上建立的检测跟踪基准,通过专用特征提取器实现了优异的性能。

Insight: 创新点在于发布了首个面向智能交通系统的基于事件相机的多目标跟踪数据集,并提供了相应的基准和特征提取器,为事件视觉在动态交通场景中的应用研究奠定了基础。

Abstract: In Intelligent Transportation Systems (ITS), multi-object tracking is primarily based on frame-based cameras. However, these cameras tend to perform poorly under dim lighting and high-speed motion conditions. Event cameras, characterized by low latency, high dynamic range and high temporal resolution, have considerable potential to mitigate these issues. Compared to frame-based vision, there are far fewer studies on event-based vision. To address this research gap, we introduce an initial pilot dataset tailored for event-based ITS, covering vehicle and pedestrian detection and tracking. We establish a tracking-by-detection benchmark with a specialized feature extractor based on this dataset, achieving excellent performance.


[69] FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos cs.CV | cs.AIPDF

Zhaolun Li, Jichang Li, Yinqi Cai, Junye Chen, Xiaonan Luo

TL;DR: 本文提出FakeRadar,一种新颖的深度伪造视频检测框架,旨在解决现实场景中的跨域泛化挑战。该框架通过伪造异常值探测主动探索特征空间,并利用异常值引导的三重训练优化检测器,以区分真实视频、已知伪造和未知伪造样本。

Details

Motivation: 现有深度伪造检测方法通常依赖特定操纵线索,对已知伪造类型表现良好,但对新兴伪造技术泛化能力差。论文旨在克服这一局限,提升检测器对未知伪造模式的适应能力。

Result: 实验表明,FakeRadar在多个深度伪造视频检测基准数据集上优于现有方法,尤其在跨域评估中表现出色,能够有效处理各种新兴伪造技术。

Insight: 创新点包括引入伪造异常值探测(通过动态子聚类建模和聚类条件异常值生成模拟未知伪造伪影)以及异常值引导的三重训练(结合异常值驱动的对比学习和异常值条件交叉熵损失),这些方法主动探索特征分布差异,增强了模型对未知伪造的泛化能力。

Abstract: In this paper, we propose FakeRadar, a novel deepfake video detection framework designed to address the challenges of cross-domain generalization in real-world scenarios. Existing detection methods typically rely on manipulation-specific cues, performing well on known forgery types but exhibiting severe limitations against emerging manipulation techniques. This poor generalization stems from their inability to adapt effectively to unseen forgery patterns. To overcome this, we leverage large-scale pretrained models (e.g. CLIP) to proactively probe the feature space, explicitly highlighting distributional gaps between real videos, known forgeries, and unseen manipulations. Specifically, FakeRadar introduces Forgery Outlier Probing, which employs dynamic subcluster modeling and cluster-conditional outlier generation to synthesize outlier samples near boundaries of estimated subclusters, simulating novel forgery artifacts beyond known manipulation types. Additionally, we design Outlier-Guided Tri-Training, which optimizes the detector to distinguish real, fake, and outlier samples using proposed outlier-driven contrastive learning and outlier-conditioned cross-entropy losses. Experiments show that FakeRadar outperforms existing methods across various benchmark datasets for deepfake video detection, particularly in cross-domain evaluations, by handling the variety of emerging manipulation techniques.


[70] WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling cs.CV | cs.GRPDF

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang

TL;DR: WorldPlay是一个流式视频扩散模型,旨在实现实时、交互式的世界建模,并保持长期几何一致性。它通过双动作表示、重构上下文记忆和上下文强制蒸馏三个关键创新,解决了现有方法在速度与内存之间的权衡问题,能够以24 FPS生成720p的长序列视频。

Details

Motivation: 现有方法在实时交互式世界建模中难以平衡生成速度与长期几何一致性,通常面临内存衰减和误差漂移问题,限制了实际应用。

Result: WorldPlay在多种场景下表现出色,能以24 FPS生成720p视频,在一致性方面优于现有技术,展示了强大的泛化能力。

Insight: 创新点包括:双动作表示实现鲁棒的用户动作控制;重构上下文记忆通过动态重建过去帧和时序重帧缓解内存衰减;上下文强制蒸馏方法在师生模型间对齐记忆上下文,保持长程信息利用能力,防止误差漂移,从而实现实时性能。

Abstract: This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user’s keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student’s capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.


[71] Distill Video Datasets into Images cs.CVPDF

Zhenghao Zhao, Haoxuan Wang, Kai Wang, Yuzhang Shang, Yuan Hong

TL;DR: 本文提出了一种名为单帧视频集蒸馏(SFVD)的新方法,用于将视频数据集蒸馏成高度信息化的单帧图像。该方法通过可微分插值将蒸馏出的帧转换为视频序列,并与原始数据集进行匹配,同时结合真实视频的时间信息,从而在多个基准测试中显著优于现有方法。

Details

Motivation: 视频数据集蒸馏面临的主要挑战是视频时间维度引入的大量可学习参数,导致优化复杂且收敛困难。本文观察到单帧图像通常足以捕捉视频的判别性语义,因此旨在通过蒸馏视频为单帧来解决这一问题。

Result: 在多个基准测试(如MiniUCF)上的实验表明,SFVD方法显著优于先前方法,性能提升高达5.3%,提供了更有效的视频数据集蒸馏解决方案。

Insight: 创新点在于利用单帧捕捉视频语义,通过可微分插值和通道重塑层结合时间信息,从而简化优化过程并提升蒸馏效率。从客观角度看,该方法通过减少参数数量并保持信息完整性,为视频数据集压缩提供了新思路。

Abstract: Dataset distillation aims to synthesize compact yet informative datasets that allow models trained on them to achieve performance comparable to training on the full dataset. While this approach has shown promising results for image data, extending dataset distillation methods to video data has proven challenging and often leads to suboptimal performance. In this work, we first identify the core challenge in video set distillation as the substantial increase in learnable parameters introduced by the temporal dimension of video, which complicates optimization and hinders convergence. To address this issue, we observe that a single frame is often sufficient to capture the discriminative semantics of a video. Leveraging this insight, we propose Single-Frame Video set Distillation (SFVD), a framework that distills videos into highly informative frames for each class. Using differentiable interpolation, these frames are transformed into video sequences and matched with the original dataset, while updates are restricted to the frames themselves for improved optimization efficiency. To further incorporate temporal information, the distilled frames are combined with sampled real videos from real videos during the matching process through a channel reshaping layer. Extensive experiments on multiple benchmarks demonstrate that SFVD substantially outperforms prior methods, achieving improvements of up to 5.3% on MiniUCF, thereby offering a more effective solution.


[72] AMD-HookNet++: Evolution of AMD-HookNet with Hybrid CNN-Transformer Feature Enhancement for Glacier Calving Front Segmentation cs.CVPDF

Fei Wu, Marcel Dreier, Nora Gourmelon, Sebastian Wind, Jianlin Zhang

TL;DR: AMD-HookNet++是一种用于冰川崩解前缘分割的混合CNN-Transformer特征增强方法,通过结合Transformer分支捕获长程依赖和CNN分支保留局部细节,并引入增强的空间通道注意力模块和像素级对比深度监督,在CaFFe基准数据集上实现了最先进的性能。

Details

Motivation: 解决纯卷积神经网络(CNN)在冰川分割中因局部性和平移不变性而难以维持长程依赖的问题,以及纯Transformer方法可能产生的锯齿状边缘问题,旨在更准确地监测冰川动态和崩解前缘位置变化。

Result: 在CaFFe基准数据集上,AMD-HookNet++达到了78.2的IoU和1,318米的HD95,同时保持367米的竞争性MDE,实现了新的最先进水平(SOTA),并生成了更平滑的崩解前缘轮廓。

Insight: 创新点包括混合CNN-Transformer架构以平衡全局上下文和局部细节,增强的空间通道注意力模块动态调整空间和通道维度上的token关系,以及像素级对比深度监督集成度量学习优化分割性能,可借鉴于其他遥感图像分割任务。

Abstract: The dynamics of glaciers and ice shelf fronts significantly impact the mass balance of ice sheets and coastal sea levels. To effectively monitor glacier conditions, it is crucial to consistently estimate positional shifts of glacier calving fronts. AMD-HookNet firstly introduces a pure two-branch convolutional neural network (CNN) for glacier segmentation. Yet, the local nature and translational invariance of convolution operations, while beneficial for capturing low-level details, restricts the model ability to maintain long-range dependencies. In this study, we propose AMD-HookNet++, a novel advanced hybrid CNN-Transformer feature enhancement method for segmenting glaciers and delineating calving fronts in synthetic aperture radar images. Our hybrid structure consists of two branches: a Transformer-based context branch to capture long-range dependencies, which provides global contextual information in a larger view, and a CNN-based target branch to preserve local details. To strengthen the representation of the connected hybrid features, we devise an enhanced spatial-channel attention module to foster interactions between the hybrid CNN-Transformer branches through dynamically adjusting the token relationships from both spatial and channel perspectives. Additionally, we develop a pixel-to-pixel contrastive deep supervision to optimize our hybrid model by integrating pixelwise metric learning into glacier segmentation. Through extensive experiments and comprehensive quantitative and qualitative analyses on the challenging glacier segmentation benchmark dataset CaFFe, we show that AMD-HookNet++ sets a new state of the art with an IoU of 78.2 and a HD95 of 1,318 m, while maintaining a competitive MDE of 367 m. More importantly, our hybrid model produces smoother delineations of calving fronts, resolving the issue of jagged edges typically seen in pure Transformer-based approaches.


[73] ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking cs.CVPDF

Lihong Wang, Liangqi Li, Weiwei Feng, Jiamin Wu, Changtao Miao

TL;DR: 本文提出ViRC框架,通过引入Reason Chunking机制将多模态数学思维链分解为连续的关键推理单元(CRUs),模拟人类专家逐步推理过程,并构建CRUX数据集和渐进式训练策略来增强模型的多模态数学推理能力。

Details

Motivation: 现有MLLMs在数学任务中通常仅基于静态图像进行文本推理,忽略了推理过程中的动态视觉获取,而人类会反复观察图像并逐步推理以验证中间命题,因此需要模拟这种分块推理模式来提升多模态数学推理性能。

Result: 基于CRUX数据集训练的ViRC-7B模型在多个数学基准测试中平均比基线模型提升了18.8%,达到了SOTA水平。

Insight: 创新点在于将认知科学中的Miller定律(分块处理)引入多模态推理,通过CRUs结构实现视觉信息跨单元整合与文本推理的单元内一致性,并结合渐进式训练策略(Instructional SFT、Practice SFT、Strategic RL)模拟人类认知学习过程,可借鉴其结构化推理单元设计和多阶段训练方法。

Abstract: CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks. Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller’s Law in cognitive science. Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning. To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem. Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.The resulting ViRC-7B model achieves a 18.8% average improvement over baselines across multiple mathematical benchmarks. Code is available at https://github.com/Leon-LihongWang/ViRC.


[74] ART: Articulated Reconstruction Transformer cs.CVPDF

Zizhang Li, Cheng Zhang, Zhengqin Li, Henry Howard-Jenkins, Zhaoyang Lv

TL;DR: ART是一种类别无关的前馈模型,能够仅从稀疏的多状态RGB图像中重建完整的3D铰接物体。它将铰接物体视为刚性部件的组合,通过基于部件的预测实现重建,并输出可解释的物理参数。

Details

Motivation: 解决现有铰接物体重建方法依赖缓慢优化或局限于特定类别的问题,旨在实现高效、通用的重建。

Result: 在多个基准测试中,ART显著优于现有基线方法,在图像输入的铰接物体重建任务上达到了新的SOTA水平。

Insight: 创新点包括将铰接物体建模为部件组合,并设计Transformer架构将稀疏图像映射到可学习的部件槽,从而联合解码几何、纹理和铰接参数;其方法具有物理可解释性和可导出性。

Abstract: We introduce ART, Articulated Reconstruction Transformer – a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.


[75] VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image cs.CV | cs.AIPDF

Sicheng Xu, Guojun Chen, Jiaolong Yang, Yizhong Zhang, Yu Deng

TL;DR: VASA-3D是一个从单张人像图像生成音频驱动的3D头部化身的方法。它通过利用VASA-1的运动潜在空间来捕捉细微表情细节,并设计了一个基于该运动潜在条件的3D头部模型。通过一个优化框架,使用从输入图像合成的参考头部视频帧进行定制,从而从单张图像重建出精细的3D头部化身。该方法能生成逼真的3D说话头部,支持在线生成512x512分辨率、高达75 FPS的自由视角视频。

Details

Motivation: 解决两个主要挑战:一是捕捉真实人脸的细微表情细节,二是从单张人像图像重建复杂的3D头部化身。

Result: 实验表明,VASA-3D生成了先前技术无法实现的逼真3D说话头部,支持在线生成512x512分辨率、高达75 FPS的自由视角视频。

Insight: 创新点在于将VASA-1的2D运动潜在空间迁移到3D,通过设计一个基于运动潜在条件的3D头部模型,并结合一个对生成训练数据中的伪影和有限姿态覆盖具有鲁棒性的优化框架,实现了从单张图像生成高质量、可定制的音频驱动3D化身。

Abstract: We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image. To accurately model expression details, VASA-3D leverages the motion latent of VASA-1, a method that yields exceptional realism and vividness in 2D talking heads. A critical element of our work is translating this motion latent to 3D, which is accomplished by devising a 3D head model that is conditioned on the motion latent. Customization of this model to a single image is achieved through an optimization framework that employs numerous video frames of the reference head synthesized from the input image. The optimization takes various training losses robust to artifacts and limited pose coverage in the generated training data. Our experiment shows that VASA-3D produces realistic 3D talking heads that cannot be achieved by prior art, and it supports the online generation of 512x512 free-viewpoint videos at up to 75 FPS, facilitating more immersive engagements with lifelike 3D avatars.


[76] CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives cs.CV | cs.GR | cs.ROPDF

Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins

TL;DR: CRISP是一种从单目视频中恢复可模拟人体运动和场景几何的方法,通过拟合平面基元到点云重建来获得干净、凸且适合仿真的几何体,并利用人-场景接触建模来重建被遮挡的几何部分,最终通过强化学习驱动人形控制器确保物理合理性。

Details

Motivation: 解决现有联合人体-场景重建方法依赖数据驱动先验、缺乏物理约束或重建几何噪声大导致运动跟踪失败的问题,旨在从单目视频中恢复物理上可模拟的人体运动和场景。

Result: 在EMDB和PROX等以人为中心的视频基准上,将运动跟踪失败率从55.2%降低到6.9%,同时RL仿真吞吐量提升43%,并在野外视频(包括随手拍摄视频、网络视频和Sora生成视频)上验证了有效性。

Insight: 创新点在于通过深度、法线和光流的简单聚类流程拟合平面基元来重建仿真就绪的几何体,结合人-场景接触建模处理遮挡,并利用强化学习确保物理合理性,为机器人和AR/VR的实景到仿真应用提供了可扩展的解决方案。

Abstract: We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2% to 6.9% on human-centric video benchmarks (EMDB, PROX), while delivering a 43% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP’s ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.


[77] MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives cs.CVPDF

Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan

TL;DR: 本文提出MemFlow方法,通过动态检索与当前文本提示最相关的历史帧来更新记忆库,并在注意力层中仅激活最相关的记忆令牌,从而在长视频生成中实现内容一致性和高效性。

Details

Motivation: 解决流式视频生成中长上下文内容一致性的核心挑战,现有固定策略压缩历史帧的方法难以适应不同视频块对历史线索的差异化需求。

Result: MemFlow在保持与无记忆基线相比仅7.9%速度降低的轻微计算负担下,实现了出色的长上下文一致性,并与任何支持KV缓存的流式视频生成模型兼容。

Insight: 创新点在于动态检索驱动的记忆更新机制和注意力层中的稀疏记忆激活,这提升了长视频叙事的一致性和效率,可借鉴于其他需要长序列建模的生成任务。

Abstract: The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.


cs.MM [Back]

[78] Generative AI for Video Translation: A Scalable Architecture for Multilingual Video Conferencing cs.MM | cs.AI | cs.CL | cs.CVPDF

Amirkia Rafiei Oskooei, Eren Caglar, Ibrahim Sahin, Ayse Kayabay, Mehmet S. Aktas

TL;DR: 本文提出并评估了一个用于实时视频翻译的生成式AI系统级框架,旨在解决级联模型推理的累积延迟和多用户场景下计算复杂度呈二次方增长的关键瓶颈。该架构通过引入轮转机制将计算复杂度降低至线性,并采用分段处理协议管理推理延迟以实现感知实时体验。

Details

Motivation: 解决级联生成式AI流水线在视频翻译等应用中部署时面临的系统级挑战,包括顺序模型推理的累积延迟和多用户视频会议应用因二次方计算复杂度而难以扩展的问题。

Result: 在包括消费级(NVIDIA RTX 4060)、云(NVIDIA T4)和企业级(NVIDIA A100)GPU的多层硬件设置上进行的性能分析表明,系统在现代硬件上实现了实时吞吐量(τ<1.0)。主观用户研究进一步验证了该方法,表明可预测的初始处理延迟对于换取流畅、不间断的播放体验是高度可接受的。

Insight: 系统级创新在于通过轮转机制将多用户场景的计算复杂度从O(N²)降低到线性,以及分段处理协议来管理延迟,为部署可扩展的实时生成式AI应用提供了经过验证的端到端系统设计路线图。

Abstract: The real-time deployment of cascaded generative AI pipelines for applications like video translation is constrained by significant system-level challenges. These include the cumulative latency of sequential model inference and the quadratic ($\mathcal{O}(N^2)$) computational complexity that renders multi-user video conferencing applications unscalable. This paper proposes and evaluates a practical system-level framework designed to mitigate these critical bottlenecks. The proposed architecture incorporates a turn-taking mechanism to reduce computational complexity from quadratic to linear in multi-user scenarios, and a segmented processing protocol to manage inference latency for a perceptually real-time experience. We implement a proof-of-concept pipeline and conduct a rigorous performance analysis across a multi-tiered hardware setup, including commodity (NVIDIA RTX 4060), cloud (NVIDIA T4), and enterprise (NVIDIA A100) GPUs. Our objective evaluation demonstrates that the system achieves real-time throughput ($τ< 1.0$) on modern hardware. A subjective user study further validates the approach, showing that a predictable, initial processing delay is highly acceptable to users in exchange for a smooth, uninterrupted playback experience. The work presents a validated, end-to-end system design that offers a practical roadmap for deploying scalable, real-time generative AI applications in multilingual communication platforms.


cs.RO [Back]

[79] WAM-Flow: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving cs.RO | cs.AI | cs.CVPDF

Yifang Xu, Jiahao Cui, Feipeng Cai, Zhihao Zhu, Hanlin Shang

TL;DR: WAM-Flow是一种用于自动驾驶的视觉-语言-动作模型,它将自车轨迹规划问题转化为结构化令牌空间上的离散流匹配任务。该方法采用并行双向去噪而非自回归解码,支持从粗到细的轨迹优化,并允许在计算开销和精度之间进行权衡。通过多阶段适配,它将预训练的自回归主干模型转换为非因果流模型,并结合了度量对齐的数值令牌化器、几何感知的流目标以及模拟器引导的GRPO对齐机制,以整合安全性、进度和舒适性奖励。

Details

Motivation: 解决传统自回归解码器在轨迹规划中顺序生成导致的效率瓶颈,以及扩散模型推理步数多的问题,旨在通过并行、可调节的粗到细规划范式,提升端到端自动驾驶系统的闭环性能和效率。

Result: 在NAVSIM v1基准测试上,WAM-Flow的1步推理达到89.1 PDMS,5步推理达到90.3 PDMS,优于自回归和基于扩散的VLA基线模型,展示了其作为端到端自动驾驶新范式的潜力。

Insight: 创新点包括:将离散流匹配引入轨迹规划,实现并行双向去噪和可调节的粗到细优化;设计了度量对齐的数值令牌化器以保持几何标量信息;结合了模拟器引导的GRPO对齐来整合多目标奖励,同时保持并行生成特性。这为高效、可扩展的规划模型提供了新思路。

Abstract: We introduce WAM-Flow, a vision-language-action (VLA) model that casts ego-trajectory planning as discrete flow matching over a structured token space. In contrast to autoregressive decoders, WAM-Flow performs fully parallel, bidirectional denoising, enabling coarse-to-fine refinement with a tunable compute-accuracy trade-off. Specifically, the approach combines a metric-aligned numerical tokenizer that preserves scalar geometry via triplet-margin learning, a geometry-aware flow objective and a simulator-guided GRPO alignment that integrates safety, ego progress, and comfort rewards while retaining parallel generation. A multi-stage adaptation converts a pre-trained auto-regressive backbone (Janus-1.5B) from causal decoding to non-causal flow model and strengthens road-scene competence through continued multimodal pretraining. Thanks to the inherent nature of consistency model training and parallel decoding inference, WAM-Flow achieves superior closed-loop performance against autoregressive and diffusion-based VLA baselines, with 1-step inference attaining 89.1 PDMS and 5-step inference reaching 90.3 PDMS on NAVSIM v1 benchmark. These results establish discrete flow matching as a new promising paradigm for end-to-end autonomous driving. The code will be publicly available soon.


[80] WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving cs.RO | cs.AI | cs.CVPDF

Mingwang Xu, Jiahao Cui, Feipeng Cai, Hanlin Shang, Zhihao Zhu

TL;DR: 本文提出了WAM-Diff,一个用于自动驾驶的视觉-语言-动作(VLA)框架。该框架采用掩码扩散模型迭代优化表示未来自车轨迹的离散序列,并引入了稀疏混合专家(MoE)架构和在线强化学习进行优化。

Details

Motivation: 现有的端到端自动驾驶系统多采用自回归大语言模型或连续扩散策略,而离散掩码扩散在轨迹生成方面的潜力尚未被充分探索。本文旨在探索掩码扩散模型在自动驾驶轨迹生成中的应用,以提供一种替代方案。

Result: 在NAVSIM-v1和NAVSIM-v2基准测试上,模型分别取得了91.0 PDMS和89.7 EPDMS的分数,证明了掩码扩散模型在自动驾驶任务中的有效性。

Insight: 主要创新点包括:1)将掩码扩散系统性地适配于自动驾驶,支持灵活的非因果解码顺序;2)通过稀疏MoE架构联合训练运动预测和驾驶导向的视觉问答任务,实现可扩展的模型容量;3)使用基于组的序列策略优化(GSPO)进行在线强化学习,以优化序列级驾驶奖励。这为轨迹生成提供了支持场景感知解码策略的新方法。

Abstract: End-to-end autonomous driving systems based on vision-language-action (VLA) models integrate multimodal sensor inputs and language instructions to generate planning and control signals. While autoregressive large language models and continuous diffusion policies are prevalent, the potential of discrete masked diffusion for trajectory generation remains largely unexplored. This paper presents WAM-Diff, a VLA framework that employs masked diffusion to iteratively refine a discrete sequence representing future ego-trajectories. Our approach features three key innovations: a systematic adaptation of masked diffusion for autonomous driving that supports flexible, non-causal decoding orders; scalable model capacity via a sparse MoE architecture trained jointly on motion prediction and driving-oriented visual question answering (VQA); and online reinforcement learning using Group Sequence Policy Optimization (GSPO) to optimize sequence-level driving rewards. Remarkably, our model achieves 91.0 PDMS on NAVSIM-v1 and 89.7 EPDMS on NAVSIM-v2, demonstrating the effectiveness of masked diffusion for autonomous driving. The approach provides a promising alternative to autoregressive and diffusion-based policies, supporting scenario-aware decoding strategies for trajectory generation. The code for this paper will be released publicly at: https://github.com/fudan-generative-vision/WAM-Diff


[81] Expert Switching for Robust AAV Landing: A Dual-Detector Framework in Simulation cs.RO | cs.CVPDF

Humaira Tasnim, Ashik E Rasul, Bruce Jo, Hyung-Jin Yoon

TL;DR: 本文提出了一种用于自主空中飞行器(AAV)着陆的尺度自适应双专家感知框架,通过并行运行两个分别擅长远距离和近距离检测的YOLOv8专家模型,并结合几何门控机制选择最符合当前视角的预测,以解决单检测器在着陆过程中因目标尺度剧烈变化而导致的性能下降问题。

Details

Motivation: 解决自主空中飞行器在GPS拒止或视觉退化条件下着陆时,单检测器模型难以在目标(如停机坪)从高空小尺度到近地大尺度的极端尺度变化中保持鲁棒性的问题。

Result: 在集成了CARLA逼真渲染与NASA GUAM飞行动力学引擎的闭环着陆环境中进行评估,结果显示,与单检测器基线相比,该方法在航向稳定性、着陆精度和整体鲁棒性方面均有显著提升。

Insight: 核心创新在于将检测任务按尺度分解为远距和近距两个子任务,并训练专门的专家模型,再通过一个基于几何一致性的门控机制进行自适应路由选择,这为面向特定任务(如着陆)构建多专家感知系统提供了可借鉴的框架思路。

Abstract: Reliable helipad detection is essential for Autonomous Aerial Vehicle (AAV) landing, especially under GPS-denied or visually degraded conditions. While modern detectors such as YOLOv8 offer strong baseline performance, single-model pipelines struggle to remain robust across the extreme scale transitions that occur during descent, where helipads appear small at high altitude and large near touchdown. To address this limitation, we propose a scale-adaptive dual-expert perception framework that decomposes the detection task into far-range and close-range regimes. Two YOLOv8 experts are trained on scale-specialized versions of the HelipadCat dataset, enabling one model to excel at detecting small, low-resolution helipads and the other to provide high-precision localization when the target dominates the field of view. During inference, both experts operate in parallel, and a geometric gating mechanism selects the expert whose prediction is most consistent with the AAV’s viewpoint. This adaptive routing prevents the degradation commonly observed in single-detector systems when operating across wide altitude ranges. The dual-expert perception module is evaluated in a closed-loop landing environment that integrates CARLA’s photorealistic rendering with NASA’s GUAM flight-dynamics engine. Results show substantial improvements in alignment stability, landing accuracy, and overall robustness compared to single-detector baselines. By introducing a scale-aware expert routing strategy tailored to the landing problem, this work advances resilient vision-based perception for autonomous descent and provides a foundation for future multi-expert AAV frameworks.


[82] EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models cs.RO | cs.CVPDF

Zechen Bai, Chen Gao, Mike Zheng Shou

TL;DR: EVOLVE-VLA是一个用于视觉-语言-动作(VLA)模型的测试时训练框架,旨在使机器人智能体能够通过与环境交互进行持续自适应学习,而无需依赖大量任务特定演示。该框架通过学习的进度估计器提供自主反馈,并采用累积进度估计和渐进视野扩展策略来驯服噪声信号,从而在长视野任务、少样本学习和跨任务泛化方面取得显著性能提升。

Details

Motivation: 现有基于监督微调的VLA模型需要大量任务演示、死记硬背轨迹且无法适应部署条件变化,无法实现真正的自适应具身智能。本文旨在解决VLA模型在测试时缺乏监督奖励信号的问题,使其能通过环境交互持续改进。

Result: 在长视野任务上性能提升+8.6%,在1-shot学习上提升+22.0%,并在无需任务特定演示训练的情况下,在未见任务上达到20.8%的成功率(纯监督微调为0%),实现了跨任务泛化。

Insight: 创新点在于用学习的进度估计器替代测试时不可用的oracle奖励信号,并通过累积估计和平滑机制(累积进度估计与渐进视野扩展)来有效处理噪声反馈,从而实现了VLA模型从静态模仿到持续自适应的范式转变,并涌现出错误恢复和新策略等能力。

Abstract: Achieving truly adaptive embodied intelligence requires agents that learn not just by imitating static demonstrations, but by continuously improving through environmental interaction, which is akin to how humans master skills through practice. Vision-Language-Action (VLA) models have advanced robotic manipulation by leveraging large language models, yet remain fundamentally limited by Supervised Finetuning (SFT): requiring hundreds of demonstrations per task, rigidly memorizing trajectories, and failing to adapt when deployment conditions deviate from training. We introduce EVOLVE-VLA, a test-time training framework enabling VLAs to continuously adapt through environment interaction with minimal or zero task-specific demonstrations. The key technical challenge is replacing oracle reward signals (unavailable at test time) with autonomous feedback. We address this through a learned progress estimator providing dense feedback, and critically, we design our framework to ``tame’’ this inherently noisy signal via two mechanisms: (1) an accumulative progress estimation mechanism smoothing noisy point-wise estimates, and (2) a progressive horizon extension strategy enabling gradual policy evolution. EVOLVE-VLA achieves substantial gains: +8.6% on long-horizon tasks, +22.0% in 1-shot learning, and enables cross-task generalization – achieving 20.8% success on unseen tasks without task-specific demonstrations training (vs. 0% for pure SFT). Qualitative analysis reveals emergent capabilities absent in demonstrations, including error recovery and novel strategies. This work represents a critical step toward VLAs that truly learn and adapt, moving beyond static imitation toward continuous self-improvements.


eess.AS [Back]

[83] Scalable Frameworks for Real-World Audio-Visual Speech Recognition eess.AS | cs.CL | cs.LGPDF

Sungnyun Kim

TL;DR: 该论文提出了一种系统化、层次化的方法来解决现实环境中音频-视觉语音识别(AVSR)系统的性能下降问题,通过在表示、架构和系统三个层面实现鲁棒的可扩展性。

Details

Motivation: 解决AVSR系统在现实世界(存在不可预测的声学噪声和视觉干扰)中性能显著下降的实际部署挑战。

Result: 论文未在摘要中提及具体的定量实验结果或基准测试,但宣称其目标是构建一个在现实应用中具有高可靠性的下一代鲁棒、可扩展的AVSR系统。

Insight: 创新点在于提出一个分层的系统性框架:在表示层面学习对多种现实世界干扰具有固有鲁棒性的统一视听特征;在架构层面实现基于输入特征智能分配计算资源的自适应模型扩展;在系统层面通过与大规模基础模型的模块化集成来扩展系统功能。这为构建鲁棒的多模态系统提供了可借鉴的系统工程视角。

Abstract: The practical deployment of Audio-Visual Speech Recognition (AVSR) systems is fundamentally challenged by significant performance degradation in real-world environments, characterized by unpredictable acoustic noise and visual interference. This dissertation posits that a systematic, hierarchical approach is essential to overcome these challenges, achieving the robust scalability at the representation, architecture, and system levels. At the representation level, we investigate methods for building a unified model that learns audio-visual features inherently robust to diverse real-world corruptions, thereby enabling generalization to new environments without specialized modules. To address architectural scalability, we explore how to efficiently expand model capacity while ensuring the adaptive and reliable use of multimodal inputs, developing a framework that intelligently allocates computational resources based on the input characteristics. Finally, at the system level, we present methods to expand the system’s functionality through modular integration with large-scale foundation models, leveraging their powerful cognitive and generative capabilities to maximize final recognition accuracy. By systematically providing solutions at each of these three levels, this dissertation aims to build a next-generation, robust, and scalable AVSR system with high reliability in real-world applications.


cs.IR [Back]

[84] RecGPT-V2 Technical Report cs.IR | cs.CLPDF

Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu

TL;DR: RecGPT-V2 是一个用于推荐系统的大语言模型框架,旨在通过显式的意图推理来改进推荐。它通过引入分层多智能体系统、元提示框架、约束强化学习和智能体即法官评估框架,解决了前代模型在计算效率、解释多样性、泛化能力和评估标准方面的局限性。

Details

Motivation: 解决 RecGPT-V1 存在的四个核心问题:多个推理路径的计算效率低下与认知冗余、固定模板生成导致的解释多样性不足、监督学习范式下的泛化能力有限,以及以结果为中心的简单评估无法匹配人类标准。

Result: 在淘宝的在线 A/B 测试中,取得了显著提升:点击率 (CTR) +2.98%,商品详情页浏览量 (IPV) +3.71%,交易额 (TV) +2.19%,以及新用户转化率 (NER) +11.46%。同时,框架将 GPU 消耗降低了 60%,并将独家召回率从 9.39% 提升至 10.99%。

Insight: 核心创新在于将意图推理任务重构为分层多智能体系统的协同合作,并结合混合表示推理压缩上下文,这大幅提升了效率。此外,通过元提示动态生成提示、使用约束强化学习解决多奖励冲突,以及采用多步推理的智能体即法官评估框架,分别提升了多样性、任务性能和与人类偏好的对齐度,为大规模部署基于LLM的意图推理提供了技术可行性和商业可行性的范例。

Abstract: Large language models (LLMs) have demonstrated remarkable potential in transforming recommender systems from implicit behavioral pattern matching to explicit intent reasoning. While RecGPT-V1 successfully pioneered this paradigm by integrating LLM-based reasoning into user interest mining and item tag prediction, it suffers from four fundamental limitations: (1) computational inefficiency and cognitive redundancy across multiple reasoning routes; (2) insufficient explanation diversity in fixed-template generation; (3) limited generalization under supervised learning paradigms; and (4) simplistic outcome-focused evaluation that fails to match human standards. To address these challenges, we present RecGPT-V2 with four key innovations. First, a Hierarchical Multi-Agent System restructures intent reasoning through coordinated collaboration, eliminating cognitive duplication while enabling diverse intent coverage. Combined with Hybrid Representation Inference that compresses user-behavior contexts, our framework reduces GPU consumption by 60% and improves exclusive recall from 9.39% to 10.99%. Second, a Meta-Prompting framework dynamically generates contextually adaptive prompts, improving explanation diversity by +7.3%. Third, constrained reinforcement learning mitigates multi-reward conflicts, achieving +24.1% improvement in tag prediction and +13.0% in explanation acceptance. Fourth, an Agent-as-a-Judge framework decomposes assessment into multi-step reasoning, improving human preference alignment. Online A/B tests on Taobao demonstrate significant improvements: +2.98% CTR, +3.71% IPV, +2.19% TV, and +11.46% NER. RecGPT-V2 establishes both the technical feasibility and commercial viability of deploying LLM-powered intent reasoning at scale, bridging the gap between cognitive exploration and industrial utility.


cs.AI [Back]

[85] Incentivizing Tool-augmented Thinking with Images for Medical Image Analysis cs.AI | cs.CVPDF

Yankai Jiang, Yujie Zhang, Peng Zhang, Yichen Li, Jintai Chen

TL;DR: 本文提出Ophiuchus框架,一个工具增强的多模态大语言模型,用于解决复杂医学图像分析任务中动态、迭代聚焦细粒度视觉区域的难题。该框架通过三阶段训练策略,使模型能够自主决定何时需要额外视觉证据、在何处探查图像区域,并将相关信息无缝整合到多模态思维链中。

Details

Motivation: 现有基于推理的医学MLLM在生成逐步文本推理链方面虽有进展,但在需要动态、迭代关注细粒度视觉区域以实现精确定位和诊断的复杂任务上仍存在困难。

Result: 在包括VQA、检测和基于推理的分割在内的多种医学基准测试中,Ophiuchus始终优于闭源和开源的最先进方法。

Insight: 创新点在于将模型固有的定位和感知能力与外部工具集成,通过冷启动训练、自反思微调和代理工具强化学习的三阶段策略,直接优化任务特定奖励并模拟专家诊断行为,从而促进更高层次的推理,实现真正的’用图像思考’。

Abstract: Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model’s inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy: cold-start training with tool-integrated reasoning data to achieve basic tool selection and adaptation for inspecting key regions; self-reflection fine-tuning to strengthen reflective reasoning and encourage revisiting tool outputs; and Agentic Tool Reinforcement Learning to directly optimize task-specific rewards and emulate expert-like diagnostic behavior. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our approach illuminates a path toward medical AI agents that can genuinely “think with images” through tool-integrated reasoning. Datasets, codes, and trained models will be released publicly.


cond-mat.mtrl-sci [Back]

[86] Hierarchical Multi-agent Large Language Model Reasoning for Autonomous Functional Materials Discovery cond-mat.mtrl-sci | cs.AI | cs.CL | cs.LG | cs.MAPDF

Samuel Rothfarb, Megan C. Davis, Ivana Matanovic, Baikun Li, Edward F. Holby

TL;DR: 本文提出了MASTER框架,这是一个基于大语言模型的分层多智能体主动学习系统,用于自主设计和执行原子模拟以加速功能材料发现。该系统通过将自然语言转换为密度泛函理论工作流,并采用多种多智能体推理策略(如同行评审、分诊排序和分诊表单),在CO吸附于Cu表面过渡金属原子和M-N-C催化剂两个化学应用中,将所需的原子模拟数量减少了高达90%。

Details

Motivation: 当前人工智能方法大多自动化程序性任务,缺乏科学推理能力,限制了科学发现的自主性。本文旨在开发一个能够自主进行科学推理、设计和解释原子模拟的框架,以加速功能材料的发现过程。

Result: 在两个化学应用(CO在Cu表面过渡金属原子上的吸附和在M-N-C催化剂上的吸附)中,推理驱动的探索相对于试错选择,将所需的原子模拟数量减少了高达90%。推理轨迹揭示了基于化学原理的决策,而非随机采样或语义偏见。

Insight: 创新点在于构建了一个分层多智能体大语言模型推理框架,将自然语言指令直接转化为计算工作流,并通过多智能体协作策略(如同行评审)引导科学发现。这标志着从自动化任务执行到自主科学推理的范式转变,为自主科学探索提供了新路径。

Abstract: Artificial intelligence is reshaping scientific exploration, but most methods automate procedural tasks without engaging in scientific reasoning, limiting autonomy in discovery. We introduce Materials Agents for Simulation and Theory in Electronic-structure Reasoning (MASTER), an active learning framework where large language models autonomously design, execute, and interpret atomistic simulations. In MASTER, a multimodal system translates natural language into density functional theory workflows, while higher-level reasoning agents guide discovery through a hierarchy of strategies, including a single agent baseline and three multi-agent approaches: peer review, triage-ranking, and triage-forms. Across two chemical applications, CO adsorption on Cu-surface transition metal (M) adatoms and on M-N-C catalysts, reasoning-driven exploration reduces required atomistic simulations by up to 90% relative to trial-and-error selection. Reasoning trajectories reveal chemically grounded decisions that cannot be explained by stochastic sampling or semantic bias. Altogether, multi-agent collaboration accelerates materials discovery and marks a new paradigm for autonomous scientific exploration.


eess.IV [Back]

[87] Improving the Plausibility of Pressure Distributions Synthesized from Depth through Generative Modeling eess.IV | cs.CV | cs.LGPDF

Neevkumar Manavar, Hanno Gerd Meyer, Joachim Waßmuth, Barbara Hammer, Axel Schneider

TL;DR: 本文提出了一种通过生成建模提升从深度图合成压力分布图物理合理性的框架,结合了信息潜在空间(ILS)和权重优化损失(WOL),并应用了条件布朗桥扩散模型(BBDM)及其潜在版本LBBDM,旨在为医院病床压力监测提供高保真、物理一致的压力估计,以支持无创、基于视觉的实时患者监护。

Details

Motivation: 解决当前从深度图预测压力分布的方法缺乏物理合理性,从而限制临床可靠性的问题。

Result: 实验表明,所提方法在物理合理性和性能上优于基线:BBDM结合ILS能生成高细节压力图但计算成本高、推理时间长,而LBBDM推理更快且性能具有竞争力。

Insight: 创新点在于通过ILS和WOL增强生成模型的物理合理性,并针对卧床姿势压力合成定制了BBDM及训练策略,其潜在版本LBBDM在速度与性能间取得了平衡。

Abstract: Monitoring contact pressure in hospital beds is essential for preventing pressure ulcers and enabling real-time patient assessment. Current methods can predict pressure maps but often lack physical plausibility, limiting clinical reliability. This work proposes a framework that enhances plausibility via Informed Latent Space (ILS) and Weight Optimization Loss (WOL) with generative modeling to produce high-fidelity, physically consistent pressure estimates. This study also applies diffusion based conditional Brownian Bridge Diffusion Model (BBDM) and proposes training strategy for its latent counterpart Latent Brownian Bridge Diffusion Model (LBBDM) tailored for pressure synthesis in lying postures. Experiment results shows proposed method improves physical plausibility and performance over baselines: BBDM with ILS delivers highly detailed maps at higher computational cost and large inference time, whereas LBBDM provides faster inference with competitive performance. Overall, the approach supports non-invasive, vision-based, real-time patient monitoring in clinical environments.


cs.LG [Back]

[88] Mitigating Catastrophic Forgetting in Mathematical Reasoning Finetuning through Mixed Training cs.LG | cs.CLPDF

John Graham Reynolds

TL;DR: 本文研究了在数学推理任务上微调大语言模型时出现的灾难性遗忘问题,并提出了一种混合训练策略来缓解该问题。通过在训练中交替使用数学和自然语言推理(NLI)样本,该方法在保持数学推理性能的同时,完全避免了模型在NLI任务上的性能崩溃。

Details

Motivation: 当针对数学推理等专门任务微调大语言模型时,模型会出现灾难性遗忘,丧失先前学习到的通用能力(如自然语言推理)。本文旨在解决这种性能退化问题。

Result: 在Flan-T5-Base模型上的实验表明,纯数学训练使数学准确率从3.1%提升至12.0%,但导致NLI准确率从81.0%暴跌至16.5%。提出的混合训练方法(如1:1比例)在保持12.0%数学准确率(与纯数学训练相当)的同时,将NLI准确率维持在86.2%,完全消除了遗忘。研究还系统探索了从1:1到15:1的混合比例,发现即使仅包含6.2%的NLI样本也能提供有效的正则化效果。

Insight: 论文的核心创新点是提出并系统评估了混合训练策略,证明在专业化微调中穿插少量通用任务数据可以作为一种简单有效的正则化方法,在不牺牲目标任务性能的前提下完全防止灾难性遗忘。这挑战了“专业化必然导致遗忘”的假设,为大规模模型的持续学习提供了实用方案。

Abstract: When finetuning large language models for specialized tasks such as mathematical reasoning, models exhibit catastrophic forgetting, losing previously learned capabilities. We investigate this by finetuning Flan-T5-Base (250M parameters) on the DeepMind Mathematics dataset and measuring forgetting on MultiNLI. Math-only training improves mathematical accuracy from 3.1% to 12.0% but causes NLI accuracy to collapse from 81.0% to 16.5%–a 64.5 percentage point drop occurring within the first 1,000 training steps. We propose mixed training strategies that interleave mathematical and NLI examples during training. Our results demonstrate that mixed training completely eliminates catastrophic forgetting while maintaining equivalent mathematical performance: the balanced 1:1 ratio achieves 12.0% math accuracy (matching math-only) while preserving 86.2% NLI accuracy. We systematically explore mixing ratios from 1:1 to 15:1, finding that even minimal NLI exposure (6.2%) provides effective regularization. These findings demonstrate that specialization need not require forgetting general capabilities, with implications for scaling to larger models where mixed training may confer additional benefits beyond forgetting prevention.


[89] Let’s (not) just put things in Context: Test-Time Training for Long-Context LLMs cs.LG | cs.CLPDF

Rachit Bansal, Aston Zhang, Rishabh Tiwari, Lovish Madaan, Sai Surya Duvvuri

TL;DR: 这篇论文探讨了长上下文大语言模型(LLMs)在推理时计算资源分配的问题。研究发现,当前流行的推理时扩展策略(如生成更多思考标记)在长上下文任务中收益递减且可能失败,这归因于静态自注意力机制固有的分数稀释问题。作者提出了一种简单的方法,通过对给定上下文进行有针对性的梯度更新,理论上克服了静态自注意力的限制,并在多个模型和长上下文基准测试中带来了持续且显著的性能提升。

Details

Motivation: 尽管训练和架构的进步使得LLMs能够处理数百万标记的上下文,但经验证据表明,这些长上下文LLMs能够消耗的文本远多于它们能够可靠使用的部分。同时,推理时计算(如生成思考标记)已被证明可用于提升LLMs在涉及多步推理的挑战性任务上的性能。然而,在长上下文任务中,这些策略的效果如何尚不清楚。

Result: 在LongBench-v2和ZeroScrolls基准测试的子集上,所提出的方法为Qwen3-4B模型带来了平均12.6和14.1个百分点的巨大性能提升。该方法在多个模型和长上下文基准测试中均带来了持续且显著的改进。

Insight: 论文的核心创新点在于提出了一种新的推理时计算利用范式:对于长上下文任务,将少量推理计算资源用于针对特定上下文进行梯度更新(即测试时训练),比当前流行的推理时扩展策略(如生成更多思考标记)更为有效。这从理论上克服了静态自注意力机制在长距离依赖中固有的分数稀释问题,为提升长上下文LLMs的实际性能提供了实用且高效的解决方案。

Abstract: Progress on training and architecture strategies has enabled LLMs with millions of tokens in context length. However, empirical evidence suggests that such long-context LLMs can consume far more text than they can reliably use. On the other hand, it has been shown that inference-time compute can be used to scale performance of LLMs, often by generating thinking tokens, on challenging tasks involving multi-step reasoning. Through controlled experiments on sandbox long-context tasks, we find that such inference-time strategies show rapidly diminishing returns and fail at long context. We attribute these failures to score dilution, a phenomenon inherent to static self-attention. Further, we show that current inference-time strategies cannot retrieve relevant long-context signals under certain conditions. We propose a simple method that, through targeted gradient updates on the given context, provably overcomes limitations of static self-attention. We find that this shift in how inference-time compute is spent leads to consistently large performance improvements across models and long-context benchmarks. Our method leads to large 12.6 and 14.1 percentage point improvements for Qwen3-4B on average across subsets of LongBench-v2 and ZeroScrolls benchmarks. The takeaway is practical: for long context, a small amount of context-specific training is a better use of inference compute than current inference-time scaling strategies like producing more thinking tokens.


[90] RePo: Language Models with Context Re-Positioning cs.LG | cs.AI | cs.CLPDF

Huayang Li, Tianyu Zhao, Richard Sproat

TL;DR: 本文提出了一种名为RePo的新机制,旨在通过上下文重定位来减少大型语言模型(LLMs)在上下文学习中的外部认知负荷。该方法利用一个可微分模块f_φ来动态分配捕捉上下文依赖关系的词元位置,而非依赖预定义的线性或常数位置索引。通过在OLMo-2 1B骨干模型上进行持续预训练,RePo在涉及嘈杂上下文、结构化数据和较长上下文的任务上显著提升了性能,同时在一般短上下文任务上保持了竞争力。

Details

Motivation: 当前主流的大型语言模型架构通过分配线性或常数位置索引,强加了僵化且固定的上下文结构。基于认知负荷理论(CLT),作者认为这种非信息性的结构会增加外部认知负荷,消耗本应用于深度推理和注意力分配的有限工作记忆容量。因此,需要一种机制来减少这种负荷,优化模型对上下文信息的处理。

Result: 通过在OLMo-2 1B骨干模型上进行持续预训练,RePo在涉及嘈杂上下文、结构化数据和较长上下文的任务上性能显著提升。同时,在一般短上下文任务上保持了竞争力。详细分析表明,RePo能成功地将更高的注意力分配给遥远但相关的信息,在密集和非线性空间中分配位置,并捕捉输入上下文的内在结构。

Insight: 论文的主要创新点在于提出了一个可微分的上下文重定位模块,动态地、非线性地分配位置编码,以更好地捕捉上下文依赖关系,从而减少外部认知负荷。从客观角度看,这种方法挑战了传统固定位置编码的范式,为模型更灵活、更高效地利用长距离和结构化上下文信息提供了新思路,可能对提升LLMs在复杂推理和长文本处理任务上的性能有借鉴意义。

Abstract: In-context learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices. Drawing on Cognitive Load Theory (CLT), we argue that this uninformative structure increases extraneous cognitive load, consuming finite working memory capacity that should be allocated to deep reasoning and attention allocation. To address this, we propose RePo, a novel mechanism that reduces extraneous load via context re-positioning. Unlike standard approaches, RePo utilizes a differentiable module, $f_φ$, to assign token positions that capture contextual dependencies, rather than replying on pre-defined integer range. By continually pre-training on the OLMo-2 1B backbone, we demonstrate that RePo significantly enhances performance on tasks involving noisy contexts, structured data, and longer context length, while maintaining competitive performance on general short-context tasks. Detailed analysis reveals that RePo successfully allocate higher attention to distant but relevant information, assign positions in dense and non-linear space, and capture the intrinsic structure of the input context. Our code is available at https://github.com/SakanaAI/repo.


[91] Composite Classifier-Free Guidance for Multi-Modal Conditioning in Wind Dynamics Super-Resolution cs.LG | cs.AI | cs.CVPDF

Jacob Schnell, Aditya Makkar, Gunadi Gani, Aniket Srinivasan Ashok, Darren Lo

TL;DR: 本文提出了一种名为复合无分类器引导(CCFG)的新方法,用于改进多模态条件扩散模型在风动力学超分辨率任务中的应用。该方法通过扩展标准无分类器引导(CFG)以处理多个输入通道,并应用于名为WindDM的扩散模型中,实现了高保真度的风数据重建,同时大幅降低了成本。

Details

Motivation: 高分辨率风数据获取成本高且困难,传统方法难以同时兼顾成本与精度;现有深度学习方法(如扩散模型)在处理风数据时面临输入通道多(超过10个通道)的挑战,需要更有效地利用多模态条件输入。

Result: 在风超分辨率任务中,CCFG的输出保真度高于标准CFG;WindDM模型在深度学习模型中达到最先进(SOTA)的重建质量,且成本比传统方法降低高达1000倍。

Insight: 创新点在于将无分类器引导推广到多条件输入,提出CCFG方法,可无缝集成到预训练扩散模型中;客观来看,该方法通过高效利用多通道风数据,解决了工业规模风动力学重建中成本与精度的权衡问题。

Abstract: Various weather modelling problems (e.g., weather forecasting, optimizing turbine placements, etc.) require ample access to high-resolution, highly accurate wind data. Acquiring such high-resolution wind data, however, remains a challenging and expensive endeavour. Traditional reconstruction approaches are typically either cost-effective or accurate, but not both. Deep learning methods, including diffusion models, have been proposed to resolve this trade-off by leveraging advances in natural image super-resolution. Wind data, however, is distinct from natural images, and wind super-resolvers often use upwards of 10 input channels, significantly more than the usual 3-channel RGB inputs in natural images. To better leverage a large number of conditioning variables in diffusion models, we present a generalization of classifier-free guidance (CFG) to multiple conditioning inputs. Our novel composite classifier-free guidance (CCFG) can be dropped into any pre-trained diffusion model trained with standard CFG dropout. We demonstrate that CCFG outputs are higher-fidelity than those from CFG on wind super-resolution tasks. We present WindDM, a diffusion model trained for industrial-scale wind dynamics reconstruction and leveraging CCFG. WindDM achieves state-of-the-art reconstruction quality among deep learning models and costs up to $1000\times$ less than classical methods.


[92] Enhancing Semi-Supervised Multi-View Graph Convolutional Networks via Supervised Contrastive Learning and Self-Training cs.LG | cs.CVPDF

Huaiyuan Xiao, Fadi Dornaika, Jingjun Bi

TL;DR: 该论文提出了一种名为MV-SupGCN的半监督多视图图卷积网络模型,旨在通过集成监督对比学习与自训练来增强多视图数据的特征表示。模型设计了结合交叉熵与监督对比损失的联合损失函数,融合了KNN与半监督两种图构建方法,并利用对比学习与伪标签技术来利用未标记数据并增强视图间语义对齐。

Details

Motivation: 现有基于图卷积网络的多视图学习方法未能充分利用视图间的互补信息,导致特征表示次优且性能受限。

Result: 在多个基准测试上的广泛实验表明,MV-SupGCN持续超越了最先进的方法,验证了其集成方法的有效性。

Insight: 创新点在于将监督对比损失与交叉熵损失联合优化以增强类内紧凑性与类间分离性;融合不同图构建方法以提高结构表示的鲁棒性;以及通过对比学习与伪标签的统一框架来利用未标记数据并强制多视图嵌入的一致性。

Abstract: The advent of graph convolutional network (GCN)-based multi-view learning provides a powerful framework for integrating structural information from heterogeneous views, enabling effective modeling of complex multi-view data. However, existing methods often fail to fully exploit the complementary information across views, leading to suboptimal feature representations and limited performance. To address this, we propose MV-SupGCN, a semi-supervised GCN model that integrates several complementary components with clear motivations and mutual reinforcement. First, to better capture discriminative features and improve model generalization, we design a joint loss function that combines Cross-Entropy loss with Supervised Contrastive loss, encouraging the model to simultaneously minimize intra-class variance and maximize inter-class separability in the latent space. Second, recognizing the instability and incompleteness of single graph construction methods, we combine both KNN-based and semi-supervised graph construction approaches on each view, thereby enhancing the robustness of the data structure representation and reducing generalization error. Third, to effectively utilize abundant unlabeled data and enhance semantic alignment across multiple views, we propose a unified framework that integrates contrastive learning in order to enforce consistency among multi-view embeddings and capture meaningful inter-view relationships, together with pseudo-labeling, which provides additional supervision applied to both the cross-entropy and contrastive loss functions to enhance model generalization. Extensive experiments demonstrate that MV-SupGCN consistently surpasses state-of-the-art methods across multiple benchmarks, validating the effectiveness of our integrated approach. The source code is available at https://github.com/HuaiyuanXiao/MVSupGCN


cs.CR [Back]

Quan Yuan, Zhikun Zhang, Linkang Du, Min Chen, Mingyang Sun

TL;DR: 本文提出了VICTOR,这是首个针对视频识别系统的数据集版权审计方法。该方法通过一种通用且隐蔽的样本修改策略,仅修改少量样本(如1%),就能放大已发布修改样本对目标模型预测行为的影响,从而利用模型对修改样本与原始样本的行为差异作为审计数据集是否被未经授权使用的关键依据。

Details

Motivation: 随着视频识别系统在内容推荐、安全监控等领域的广泛应用,高质量公开数据集被用于训练先进模型,但这些数据集也容易被滥用和侵权。现有的数据集版权解决方案主要集中在图像领域,而视频数据因其复杂的时序特性,其版权审计问题尚未被探索。

Result: 在多个模型和数据集上的大量实验证明了VICTOR的优越性。此外,研究表明VICTOR在面对训练视频或目标模型的多种扰动机制时具有鲁棒性。

Insight: 主要创新点在于首次将数据集版权审计引入视频领域,并提出了一种通过轻微修改少量样本来放大模型行为差异的通用、隐蔽策略。从客观角度看,该方法巧妙地利用了视频的时序特性带来的挑战,将其转化为可检测的模型行为信号,为视频数据版权保护提供了新思路。

Abstract: Video recognition systems are increasingly being deployed in daily life, such as content recommendation and security monitoring. To enhance video recognition development, many institutions have released high-quality public datasets with open-source licenses for training advanced models. At the same time, these datasets are also susceptible to misuse and infringement. Dataset copyright auditing is an effective solution to identify such unauthorized use. However, existing dataset copyright solutions primarily focus on the image domain; the complex nature of video data leaves dataset copyright auditing in the video domain unexplored. Specifically, video data introduces an additional temporal dimension, which poses significant challenges to the effectiveness and stealthiness of existing methods. In this paper, we propose VICTOR, the first dataset copyright auditing approach for video recognition systems. We develop a general and stealthy sample modification strategy that enhances the output discrepancy of the target model. By modifying only a small proportion of samples (e.g., 1%), VICTOR amplifies the impact of published modified samples on the prediction behavior of the target models. Then, the difference in the model’s behavior for published modified and unpublished original samples can serve as a key basis for dataset auditing. Extensive experiments on multiple models and datasets highlight the superiority of VICTOR. Finally, we show that VICTOR is robust in the presence of several perturbation mechanisms to the training videos or the target models.