Table of Contents
cs.CL [Back]
[1] Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards cs.CLPDF
Xinyu Tang, Yuliang Zhan, Zhixun Li, Wayne Xin Zhao, Zhenduo Zhang
TL;DR: 本文系统研究了强化学习中可验证奖励(RLVR)训练范式下正负样本极性对大型推理模型训练动态和行为的影响,发现正样本能强化已有正确推理模式,而负样本促进探索新推理路径。基于此,作者提出了一种自适应非对称的令牌级优势塑造方法A3PO,用于更精确地分配不同极性样本中关键令牌的优势信号,并在五个推理基准测试中验证了其有效性。
Details
Motivation: 动机在于深入理解RLVR训练中正负样本极性对模型推理能力提升的具体作用机制,以优化策略更新过程,解决当前方法在优势信号分配上可能不够精确的问题。
Result: 在五个推理基准测试上的实验表明,提出的A3PO方法有效提升了性能,具体定量结果未在摘要中给出,但暗示达到了先进水平。
Insight: 创新点包括揭示了正负样本在RLVR中的不同作用(正样本锐化模式,负样本鼓励探索),并提出了自适应非对称的令牌级优势塑造方法A3PO,实现了更精细的优势信号分配,这对优化基于RL的训练策略具有借鉴意义。
Abstract: Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct sample polarities. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the sample level and the token level affects RLVR training. Based on these insights, we propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization, namely A3PO, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.
[2] Enabling Conversational Behavior Reasoning Capabilities in Full-Duplex Speech cs.CL | cs.AIPDF
Shuchang Pan, Siddharth Banerjee, Dhruv Hebbar, Siddhant Patel, Akshaj Gupta
TL;DR: 本文提出了一种基于思维图(GoT)的因果推理框架,旨在为全双工语音交互系统赋予对话行为推理能力。该框架通过分层标注方案(高层沟通意图和低层言语行为)建模意图到行动的因果路径,并利用混合语料库(可控模拟数据与真实对话)进行训练,实现了对下一言语行为的预测、决策理由生成以及动态推理优化。
Details
Motivation: 人类对话由隐含的思维链组织,表现为定时的言语行为;捕捉这种因果路径是构建自然全双工交互系统的关键。论文旨在解决现有系统缺乏对对话行为因果推理能力的问题。
Result: 在合成和真实全双工对话上的实验表明,该框架实现了鲁棒的行为检测,生成了可解释的推理链,并为全双工口语对话系统的对话推理基准测试奠定了基础。
Insight: 创新点包括将对话行为建模为思维图中的因果推理过程,采用分层意图-行动标注方案,以及结合模拟与真实数据的混合训练方法;这为可解释、动态的对话系统推理提供了新思路。
Abstract: Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this causal pathway is key to building natural full-duplex interactive systems. We introduce a framework that enables reasoning over conversational behaviors by modeling this process as causal inference within a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a hybrid corpus that pairs controllable, event-rich simulations with human-annotated rationales and real conversational speech. The GoT framework structures streaming predictions as an evolving graph, enabling a multimodal transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.
[3] Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought cs.CL | cs.AIPDF
Yuyi Zhang, Boyu Tang, Tianjie Ju, Sufeng Duan, Gongshen Liu
TL;DR: 本文通过因果和对抗分析研究了连续思维链(COCONUT)中潜在令牌的内部机制,发现其本质上是不可解释的占位符,缺乏忠实推理能力,并倾向于利用数据集捷径而非进行真实推理。
Details
Motivation: 探究潜在令牌在增强大语言模型推理时的可靠性问题,揭示COCONUT方法是否真正编码了推理过程,而非仅仅依赖捷径。
Result: 在MMLU和HotpotQA基准测试中,COCONUT表现出对数据集伪影的持续利用,导致基准性能虚高,但缺乏真实推理能力;与显式思维链(CoT)相比,COCONUT令牌对扰动不敏感且缺少关键推理信息。
Insight: 论文创新点在于通过扰动和捷径实验揭示了COCONUT作为一种伪推理机制的本质,即生成看似合理的推理轨迹来掩盖对捷径的依赖,这为评估潜在令牌推理方法的可靠性提供了新的分析视角。
Abstract: Latent tokens are gaining attention for enhancing reasoning in large language models (LLMs), yet their internal mechanisms remain unclear. This paper examines the problem from a reliability perspective, uncovering fundamental weaknesses: latent tokens function as uninterpretable placeholders rather than encoding faithful reasoning. While resistant to perturbation, they promote shortcut usage over genuine reasoning. We focus on Chain-of-Continuous-Thought (COCONUT), which claims better efficiency and stability than explicit Chain-of-Thought (CoT) while maintaining performance. We investigate this through two complementary approaches. First, steering experiments perturb specific token subsets, namely COCONUT and explicit CoT. Unlike CoT tokens, COCONUT tokens show minimal sensitivity to steering and lack reasoning-critical information. Second, shortcut experiments evaluate models under biased and out-of-distribution settings. Results on MMLU and HotpotQA demonstrate that COCONUT consistently exploits dataset artifacts, inflating benchmark performance without true reasoning. These findings reposition COCONUT as a pseudo-reasoning mechanism: it generates plausible traces that conceal shortcut dependence rather than faithfully representing reasoning processes.
[4] Knowledge Reasoning of Large Language Models Integrating Graph-Structured Information for Pest and Disease Control in Tobacco cs.CLPDF
Siyu Li, Chenwei Song, Wan Zhou, Xinyi Liu
TL;DR: 本文提出了一种集成图结构信息的大型语言模型方法,用于烟草病虫害防治领域的知识推理。该方法基于GraphRAG框架,通过显式整合领域知识图谱中的结构化信息,增强了知识检索与推理能力。具体而言,首先利用LLM辅助构建烟草病虫害知识图谱,组织疾病、症状、防治方法等关键实体及其关系;然后基于图谱检索相关知识并整合到推理过程中以支持准确答案生成。模型采用Transformer作为核心推理架构,并利用图神经网络学习知识图谱中捕获局部与全局关系信息的节点表示;以ChatGLM为骨干LLM,并使用LoRA进行参数高效微调。
Details
Motivation: 解决烟草病虫害防治领域中,传统LLM在复杂知识推理任务(如多跳推理和比较推理)中因缺乏结构化知识整合而导致的准确性和深度不足的问题。
Result: 大量实验结果表明,该方法在多个评估指标上持续优于基线方法,显著提高了推理的准确性和深度,特别是在复杂的多跳和比较推理场景中。
Insight: 创新点在于将领域知识图谱的结构化信息显式整合到LLM的推理流程中,结合GNN学习图谱表示以增强关系感知能力,并采用LoRA实现参数高效微调;客观来看,该方法为垂直领域(如农业)的LLM知识推理提供了可扩展的图增强架构范式。
Abstract: This paper proposes a large language model (LLM) approach that integrates graph-structured information for knowledge reasoning in tobacco pest and disease control. Built upon the GraphRAG framework, the proposed method enhances knowledge retrieval and reasoning by explicitly incorporating structured information from a domain-specific knowledge graph. Specifically, LLMs are first leveraged to assist in the construction of a tobacco pest and disease knowledge graph, which organizes key entities such as diseases, symptoms, control methods, and their relationships. Based on this graph, relevant knowledge is retrieved and integrated into the reasoning process to support accurate answer generation. The Transformer architecture is adopted as the core inference model, while a graph neural network (GNN) is employed to learn expressive node representations that capture both local and global relational information within the knowledge graph. A ChatGLM-based model serves as the backbone LLM and is fine-tuned using LoRA to achieve parameter-efficient adaptation. Extensive experimental results demonstrate that the proposed approach consistently outperforms baseline methods across multiple evaluation metrics, significantly improving both the accuracy and depth of reasoning, particularly in complex multi-hop and comparative reasoning scenarios.
[5] HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs cs.CL | cs.AIPDF
Jiaxin Liu, Peiyi Tu, Wenyu Chen, Yihong Zhuang, Xinxia Ling
TL;DR: 本文提出了HeartBench,一个用于评估中文大语言模型在拟人化智能(包括情感、文化和伦理维度)的基准框架。该基准基于真实心理咨询场景,由临床专家参与开发,采用理论驱动的分类法,包含5个主要维度和15项次级能力。通过‘推理-评分’评估协议,将抽象的人类特质转化为可测量的细粒度标准。对13个先进LLM的评估显示,即使领先模型也仅达到专家定义理想分数的60%,且在涉及微妙情感潜台词和复杂伦理权衡的困难场景中表现显著下降。
Details
Motivation: 当前大语言模型在认知和推理基准上表现出色,但在处理复杂社会、情感和伦理细微差别的拟人化智能方面存在持续不足,尤其是在中文语言文化背景下,缺乏专门的评估框架和高质量的社会情感数据阻碍了进展。
Result: 对13个最先进LLM的评估表明存在显著的性能上限,领先模型仅达到专家定义理想分数的60%。通过难度分层的‘困难集’分析,发现在涉及微妙情感潜台词和复杂伦理权衡的场景中,模型性能显著衰减。
Insight: 论文的创新点在于构建了一个扎根于真实心理咨询场景、由专家参与的理论驱动评估框架,并提出了‘推理-评分’的评估协议,将抽象的人类特质转化为可测量的细粒度标准。这为拟人化AI评估提供了标准化度量,并为构建高质量、与人类对齐的训练数据提供了方法蓝图。
Abstract: While Large Language Models (LLMs) have achieved remarkable success in cognitive and reasoning benchmarks, they exhibit a persistent deficit in anthropomorphic intelligence-the capacity to navigate complex social, emotional, and ethical nuances. This gap is particularly acute in the Chinese linguistic and cultural context, where a lack of specialized evaluation frameworks and high-quality socio-emotional data impedes progress. To address these limitations, we present HeartBench, a framework designed to evaluate the integrated emotional, cultural, and ethical dimensions of Chinese LLMs. Grounded in authentic psychological counseling scenarios and developed in collaboration with clinical experts, the benchmark is structured around a theory-driven taxonomy comprising five primary dimensions and 15 secondary capabilities. We implement a case-specific, rubric-based methodology that translates abstract human-like traits into granular, measurable criteria through a reasoning-before-scoring'' evaluation protocol. Our assessment of 13 state-of-the-art LLMs indicates a substantial performance ceiling: even leading models achieve only 60% of the expert-defined ideal score. Furthermore, analysis using a difficulty-stratified Hard Set’’ reveals a significant performance decay in scenarios involving subtle emotional subtexts and complex ethical trade-offs. HeartBench establishes a standardized metric for anthropomorphic AI evaluation and provides a methodological blueprint for constructing high-quality, human-aligned training data.
[6] Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content? cs.CL | cs.AI | cs.CR | cs.CYPDF
Naen Xu, Jinghuai Zhang, Changjiang Li, Hengyu An, Chunyi Zhou
TL;DR: 本文提出了一项关于大型视觉语言模型(LVLMs)处理版权内容能力的大规模评估,发现即使是当前最先进的闭源模型在识别和尊重版权内容方面也存在显著缺陷。为此,作者引入了一个包含5万对多模态查询-内容对的基准数据集来系统评估版权合规性,并提出了一个新颖的工具增强防御框架以降低侵权风险。
Details
Motivation: LVLMs的广泛应用引发了对其潜在版权侵权风险的严重关切,尤其是在模型基于受版权保护的材料(如书籍摘录、新闻报道)生成回应时,未能遵守版权法规可能导致严重的法律和伦理后果。
Result: 评估显示,即使是最先进的闭源LVLMs,在识别和尊重版权内容方面也存在显著不足,即使在呈现版权声明的情况下也是如此。作者提出的工具增强防御框架在所有场景下都能降低侵权风险。
Insight: 论文的创新点在于首次系统评估了LVLMs的版权合规性,并构建了一个大规模、涵盖有/无版权声明两种场景的基准数据集;同时,提出的工具增强防御框架为开发具有版权意识的LVLMs提供了可行的技术路径,强调了负责任和合法使用版权内容的重要性。
Abstract: Large vision-language models (LVLMs) have achieved remarkable advancements in multimodal reasoning tasks. However, their widespread accessibility raises critical concerns about potential copyright infringement. Will LVLMs accurately recognize and comply with copyright regulations when encountering copyrighted content (i.e., user input, retrieved documents) in the context? Failure to comply with copyright regulations may lead to serious legal and ethical consequences, particularly when LVLMs generate responses based on copyrighted materials (e.g., retrieved book experts, news reports). In this paper, we present a comprehensive evaluation of various LVLMs, examining how they handle copyrighted content – such as book excerpts, news articles, music lyrics, and code documentation when they are presented as visual inputs. To systematically measure copyright compliance, we introduce a large-scale benchmark dataset comprising 50,000 multimodal query-content pairs designed to evaluate how effectively LVLMs handle queries that could lead to copyright infringement. Given that real-world copyrighted content may or may not include a copyright notice, the dataset includes query-content pairs in two distinct scenarios: with and without a copyright notice. For the former, we extensively cover four types of copyright notices to account for different cases. Our evaluation reveals that even state-of-the-art closed-source LVLMs exhibit significant deficiencies in recognizing and respecting the copyrighted content, even when presented with the copyright notice. To solve this limitation, we introduce a novel tool-augmented defense framework for copyright compliance, which reduces infringement risks in all scenarios. Our findings underscore the importance of developing copyright-aware LVLMs to ensure the responsible and lawful use of copyrighted content.
[7] CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics cs.CL | cs.AIPDF
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa
TL;DR: 本文介绍了CricBench,一个用于评估大型语言模型在板球分析领域能力的多语言基准测试套件。该基准包含英语和印地语的复杂查询,并评估了包括GPT-4o、Claude 3.7 Sonnet和开源模型在内的六个先进模型。结果显示,即使在通用基准上表现优异的模型,在专业领域也面临显著性能下降,且印地语查询有时能获得与英语相当或更高的准确率。
Details
Motivation: 板球作为全球第二大运动,其爱好者需要高级统计分析,但现有LLMs在体育分析领域处理特定领域细微差别、复杂模式变化和多语言需求的能力尚未得到充分探索。
Result: 在CricBench上评估了六个SOTA模型。开源推理模型DeepSeek R1取得了最佳性能(50.6%),超过了Claude 3.7 Sonnet(47.7%)和GPT-4o(33.7%)。所有模型从通用基准(BIRD)迁移到CricBench时都出现了显著的准确率下降。
Insight: 论文的创新点在于构建了首个针对板球分析的多语言(英/印地语)Text-to-SQL基准,并揭示了通用基准的高性能不能保证在专业领域的成功。一个重要的发现是,在专业SQL任务中,英语并非总是最优的提示语言,代码混合的印地语查询可能表现更佳。
Abstract: Cricket is the second most popular sport globally, commanding a massive following of over 2.5 billion fans globally. Enthusiasts and analysts frequently seek advanced statistical insights, such as long-term historical performance trends or complex player comparisons, that are often unavailable through standard web searches. While Large Language Models (LLMs) have advanced significantly in Text-to-SQL tasks, their capability to handle the domain-specific nuances, complex schema variations, and multilingual requirements inherent to sports analytics remains under-explored. To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data. To curate a “Gold Standard” dataset, we collaborate with domain experts in cricket and SQL to manually author complex queries, ensuring logical correctness. Recognizing linguistic diversity, we construct the benchmark in both English and Hindi, establishing a framework that is open for further extension to other regional languages. We evaluate six state-of-the-art models, including GPT-4o, Claude 3.7 Sonnet, and open-source models, using a strict evaluation protocol. Our results reveal that high performance on general benchmarks does not guarantee success in specialized domains. While the open-weights reasoning model DeepSeek R1 achieves state-of-the-art performance (50.6%), surpassing proprietary giants like Claude 3.7 Sonnet (47.7%) and GPT-4o (33.7%), it still exhibits a significant accuracy drop when moving from general benchmarks (BIRD) to CricBench. Furthermore, we observe that code-mixed Hindi queries frequently yield parity or higher accuracy compared to English, challenging the assumption that English is the optimal prompt language for specialized SQL tasks.
[8] Accelerate Speculative Decoding with Sparse Computation in Verification cs.CLPDF
Jikai Wang, Jianchao Tan, Yuxuan Hu, Jiayu Qin, Yerui Sun
TL;DR: 本文提出了一种稀疏验证框架,用于加速推测解码中的验证阶段。该框架通过联合稀疏化注意力、前馈网络和专家混合组件,并引入跨草稿令牌和跨层检索重用策略,以减少计算冗余,从而在保持稳定接受长度的同时实现更优的效率-准确性权衡。
Details
Motivation: 推测解码通过并行验证多个草稿令牌来加速自回归语言模型推理,但验证阶段常成为主要计算瓶颈,尤其是在长上下文输入和专家混合模型场景中。现有稀疏化方法主要针对标准逐令牌自回归解码设计,未能有效解决验证阶段的计算冗余问题。
Result: 在摘要、问答和数学推理数据集上的大量实验表明,所提方法在保持稳定接受长度的同时,实现了有利的效率-准确性权衡。
Insight: 创新点在于系统地将稀疏方法应用于推测解码的验证阶段,识别出跨多个维度的结构化冗余,并联合稀疏化注意力、FFN和MoE组件,结合跨草稿令牌和跨层的检索重用策略,无需额外训练即可显著减少计算成本。
Abstract: Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel. However, the verification stage often becomes the dominant computational bottleneck, especially for long-context inputs and mixture-of-experts (MoE) models. Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding to remove substantial computational redundancy in LLMs. This work systematically adopts different sparse methods on the verification stage of the speculative decoding and identifies structured redundancy across multiple dimensions. Based on these observations, we propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost. The framework further incorporates an inter-draft token and inter-layer retrieval reuse strategy to further reduce redundant computation without introducing additional training. Extensive experiments across summarization, question answering, and mathematical reasoning datasets demonstrate that the proposed methods achieve favorable efficiency-accuracy trade-offs, while maintaining stable acceptance length.
[9] SWE-RM: Execution-free Feedback For Software Engineering Agents cs.CLPDF
KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, X. W.
TL;DR: 本文提出SWE-RM,一种用于软件工程智能体的免执行反馈奖励模型。该模型采用专家混合架构,旨在克服传统基于单元测试的反馈稀疏性问题,提供更细粒度的信号,从而在测试时扩展和强化学习两种范式下提升智能体的性能。
Details
Motivation: 动机在于解决基于执行的反馈(如单元测试)在软件工程智能体开发中存在的局限性,包括对可扩展测试用例的依赖、反馈稀疏性以及无法有效区分同为成功或失败的轨迹。
Result: 在SWE-Bench Verified基准测试上,使用测试时扩展,SWE-RM将Qwen3-Coder-Flash的准确率从51.6%提升至62.0%,将Qwen3-Coder-Max从67.0%提升至74.6%,在开源模型中达到了新的最先进水平。
Insight: 创新点在于识别出奖励模型在强化学习中除选择最佳轨迹能力外,分类准确性和校准也至关重要,并通过系统实验探索了训练数据规模、策略混合和数据源组成等因素的影响,最终设计出在两种范式下均表现稳健的专家混合奖励模型。
Abstract: Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model’s ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.
[10] Context as a Tool: Context Management for Long-Horizon SWE-Agents cs.CLPDF
Shukai Liu, Jian Yang, Bo Jiang, Yizhi Li, Jinyang Guo
TL;DR: 本文提出了一种名为CAT的新上下文管理范式,用于提升基于大语言模型的软件工程(SWE)智能体在长期交互任务中的性能。CAT将上下文维护设计为可调用的工具,使智能体能够主动压缩历史轨迹,并构建包含稳定任务语义、压缩长期记忆和高保真短期交互的结构化上下文工作空间。
Details
Motivation: 现有智能体在长期与大规模代码库交互时,通常采用仅追加的上下文维护或被动触发的压缩启发式方法,这容易导致上下文爆炸、语义漂移和推理能力下降。本文旨在解决这些问题,提升智能体在长视野软件工程任务中的稳定性和可扩展性。
Result: 在SWE-Bench-Verified基准测试中,基于CAT框架训练的上下文感知模型SWE-Compressor达到了57.6%的解决率,显著优于基于ReAct的智能体和静态压缩基线方法,并在有限的上下文预算下保持了稳定且可扩展的长视野推理能力。
Insight: 创新点在于将上下文管理提升为智能体决策过程中的可调用工具,并提出了轨迹级监督框架CAT-GENERATOR来训练上下文感知模型。这为长视野交互任务中的上下文管理提供了主动、结构化的解决方案,可借鉴于其他需要长期记忆和高效信息处理的智能体系统。
Abstract: Agents based on large language models have recently shown strong potential on real-world software engineering (SWE) tasks that require long-horizon interaction with repository-scale codebases. However, most existing agents rely on append-only context maintenance or passively triggered compression heuristics, which often lead to context explosion, semantic drift, and degraded reasoning in long-running interactions. We propose CAT, a new context management paradigm that elevates context maintenance to a callable tool integrated into the decision-making process of agents. CAT formalizes a structured context workspace consisting of stable task semantics, condensed long-term memory, and high-fidelity short-term interactions, and enables agents to proactively compress historical trajectories into actionable summaries at appropriate milestones. To support context management for SWE-agents, we propose a trajectory-level supervision framework, CAT-GENERATOR, based on an offline data construction pipeline that injects context-management actions into complete interaction trajectories. Using this framework, we train a context-aware model, SWE-Compressor. Experiments on SWE-Bench-Verified demonstrate that SWE-Compressor reaches a 57.6% solved rate and significantly outperforms ReAct-based agents and static compression baselines, while maintaining stable and scalable long-horizon reasoning under a bounded context budget.
cs.CV [Back]
[11] Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation cs.CVPDF
Arnav Gupta, Gurekas Singh Sahney, Hardik Rathi, Abhishek Chandwani, Ishaan Gupta
TL;DR: 本文提出了一种基于视觉语言模型(VLM)的数据驱动评估框架,用于评估短视频内容。该框架通过VLM提取无监督的视听特征,将其聚类为可解释因子,并训练回归评估器来预测短视频的观众参与度。实验表明,该方法能有效预测实际参与度,并提供比传统指标更可解释和可扩展的评估。
Details
Motivation: 现有评估框架(如VideoScore-2)主要关注视觉和语义保真度,但未能捕捉特定视听属性如何驱动真实观众参与。本文旨在开发一个更贴近人类感知、基于多模态推理的评估方法,以解决短视频内容评估中表面指标不足的问题。
Result: 在自建的YouTube Shorts数据集上,实验显示预测参与度与实际参与度之间存在强相关性。与SSIM、FID等传统指标相比,本文提出的轻量级、基于特征的评估器提供了更可解释和可扩展的评估结果。
Insight: 创新点在于将VLM提取的无监督视听特征与人类参与信号相结合,构建了一个数据驱动、可解释的评估框架。该方法通过聚类特征和回归预测,实现了对短视频内容的多模态、以人为中心的评估,为视频理解提供了更稳健和可解释的途径。
Abstract: Evaluating short-form video content requires moving beyond surface-level quality metrics toward human-aligned, multimodal reasoning. While existing frameworks like VideoScore-2 assess visual and semantic fidelity, they do not capture how specific audiovisual attributes drive real audience engagement. In this work, we propose a data-driven evaluation framework that uses Vision-Language Models (VLMs) to extract unsupervised audiovisual features, clusters them into interpretable factors, and trains a regression-based evaluator to predict engagement on short-form edutainment videos. Our curated YouTube Shorts dataset enables systematic analysis of how VLM-derived features relate to human engagement behavior. Experiments show strong correlations between predicted and actual engagement, demonstrating that our lightweight, feature-based evaluator provides interpretable and scalable assessments compared to traditional metrics (e.g., SSIM, FID). By grounding evaluation in both multimodal feature importance and human-centered engagement signals, our approach advances toward robust and explainable video understanding.
[12] A Tool Bottleneck Framework for Clinically-Informed and Interpretable Medical Image Understanding cs.CV | cs.LGPDF
Christina Liu, Alan Q. Wang, Joy Hsu, Jiajun Wu, Ehsan Adeli
TL;DR: 本文提出了一种名为工具瓶颈框架(TBF)的新方法,用于医学图像理解。该框架利用现成的医学视觉语言模型(VLM)从工具箱中选择工具来提取临床相关特征,然后通过一个学习的工具瓶颈模型(TBM)来融合这些工具的输出,而非依赖传统的基于文本的融合方式,从而生成最终预测。该方法旨在解决医学图像中空间局部化特征难以通过文本有效融合的问题,并在组织病理学和皮肤病学任务上进行了评估。
Details
Motivation: 现有基于视觉语言模型的工具使用框架通常通过生成代码或自然语言中的函数调用来组合工具,主要依赖文本进行信息融合。然而,在医学图像理解中,关键信息通常编码为空间局部化特征,仅通过文本难以有效组合或融合,导致性能不佳。本文旨在解决这一局限性。
Result: 在组织病理学和皮肤病学任务上的评估表明,该框架的性能与基于深度学习的分类器、视觉语言模型以及最先进的工具使用框架相当或更优,特别是在数据有限的情况下表现出显著优势。
Insight: 主要创新点在于提出了一个学习的工具瓶颈模型(TBM),用于直接计算和融合工具输出的特征,取代了传统的基于文本的工具组合方式。这提高了医学图像理解的性能,并使得预测过程更具可解释性和临床基础。从客观角度看,该方法为工具使用框架提供了一种更有效的、针对医学图像领域特点的特征融合机制。
Abstract: Recent tool-use frameworks powered by vision-language models (VLMs) improve image understanding by grounding model predictions with specialized tools. Broadly, these frameworks leverage VLMs and a pre-specified toolbox to decompose the prediction task into multiple tool calls (often deep learning models) which are composed to make a prediction. The dominant approach to composing tools is using text, via function calls embedded in VLM-generated code or natural language. However, these methods often perform poorly on medical image understanding, where salient information is encoded as spatially-localized features that are difficult to compose or fuse via text alone. To address this, we propose a tool-use framework for medical image understanding called the Tool Bottleneck Framework (TBF), which composes VLM-selected tools using a learned Tool Bottleneck Model (TBM). For a given image and task, TBF leverages an off-the-shelf medical VLM to select tools from a toolbox that each extract clinically-relevant features. Instead of text-based composition, these tools are composed by the TBM, which computes and fuses the tool outputs using a neural network before outputting the final prediction. We propose a simple and effective strategy for TBMs to make predictions with any arbitrary VLM tool selection. Overall, our framework not only improves tool-use in medical imaging contexts, but also yields more interpretable, clinically-grounded predictors. We evaluate TBF on tasks in histopathology and dermatology and find that these advantages enable our framework to perform on par with or better than deep learning-based classifiers, VLMs, and state-of-the-art tool-use frameworks, with particular gains in data-limited regimes. Our code is available at https://github.com/christinaliu2020/tool-bottleneck-framework.
[13] Intelligent recognition of GPR road hidden defect images based on feature fusion and attention mechanism cs.CV | cs.AIPDF
Haotian Lv, Yuhui Zhang, Jiangbo Dai, Hanli Wu, Jiaji Wang
TL;DR: 本文提出了一种基于特征融合和注意力机制的探地雷达道路隐蔽缺陷智能识别框架,包括DCGAN数据增强、多模态链与全局注意力网络(MCGA-Net)以及MS COCO迁移学习,以解决传统GPR图像解释依赖主观经验、效率低且不准确的问题。
Details
Motivation: 传统探地雷达图像解释严重依赖专家主观判断,导致效率低下且准确性不足,需要一种自动化、高精度的智能识别方法来克服这些限制。
Result: MCGA-Net在实验中实现了精确率92.8%、召回率92.5%和mAP@50 95.9%,在检测高斯噪声、弱信号和小目标时表现出鲁棒性,并优于其他模型。
Insight: 创新点包括DCGAN数据增强以缓解数据稀缺、MCFF和GAM结合实现分层多尺度缺陷表征和上下文感知特征增强,以及MS COCO迁移学习提升泛化能力,为复杂地下环境中的自动化缺陷检测提供了新范式。
Abstract: Ground Penetrating Radar (GPR) has emerged as a pivotal tool for non-destructive evaluation of subsurface road defects. However, conventional GPR image interpretation remains heavily reliant on subjective expertise, introducing inefficiencies and inaccuracies. This study introduces a comprehensive framework to address these limitations: (1) A DCGAN-based data augmentation strategy synthesizes high-fidelity GPR images to mitigate data scarcity while preserving defect morphology under complex backgrounds; (2) A novel Multi-modal Chain and Global Attention Network (MCGA-Net) is proposed, integrating Multi-modal Chain Feature Fusion (MCFF) for hierarchical multi-scale defect representation and Global Attention Mechanism (GAM) for context-aware feature enhancement; (3) MS COCO transfer learning fine-tunes the backbone network, accelerating convergence and improving generalization. Ablation and comparison experiments validate the framework’s efficacy. MCGA-Net achieves Precision (92.8%), Recall (92.5%), and mAP@50 (95.9%). In the detection of Gaussian noise, weak signals and small targets, MCGA-Net maintains robustness and outperforms other models. This work establishes a new paradigm for automated GPR-based defect detection, balancing computational efficiency with high accuracy in complex subsurface environments.
[14] GPF-Net: Gated Progressive Fusion Learning for Polyp Re-Identification cs.CV | cs.AIPDF
Suncheng Xiang, Xiaoyang Wang, Junjie Jiang, Hejia Wang, Dahong Qian
TL;DR: 本文提出了一种名为GPF-Net(门控渐进融合网络)的新架构,用于结肠镜息肉再识别任务。该网络通过门控机制,以全连接的方式选择性地融合多层级特征,并引入门控渐进融合策略,通过多级特征交互实现语义信息的逐层细化。
Details
Motivation: 结肠镜息肉再识别任务中,特定息肉的高级特征分辨率往往较粗糙,导致对小目标(细节信息至关重要)的识别效果不佳。本文旨在解决这一挑战。
Result: 在标准基准测试上的实验表明,该方法优于最先进的单模态再识别模型,尤其是在结合了专门的多模态融合策略后,其优势更为明显。
Insight: 主要创新点在于提出了门控渐进融合网络和策略,通过门控机制实现多层级特征的选择性融合与交互,以增强对细节信息的捕捉能力,这对于小目标识别至关重要。
Abstract: Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras, which plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, the coarse resolution of high-level features of a specific polyp often leads to inferior results for small objects where detailed information is important. To address this challenge, we propose a novel architecture, named Gated Progressive Fusion network, to selectively fuse features from multiple levels using gates in a fully connected way for polyp ReID. On the basis of it, a gated progressive fusion strategy is introduced to achieve layer-wise refinement of semantic information through multi-level feature interactions. Experiments on standard benchmarks show the benefits of the multimodal setting over state-of-the-art unimodal ReID models, especially when combined with the specialized multimodal fusion strategy.
[15] SVBench: Evaluation of Video Generation Models on Social Reasoning cs.CVPDF
Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu
TL;DR: 该论文提出了首个视频生成模型的社会推理能力基准测试SVBench,该基准基于发展心理学和社会心理学的经典范式,构建了涵盖七个核心维度的30个社会认知任务。作者开发了一种无需训练的基于智能体的评估流程,利用大视觉模型(VLM)作为评判员,对七个最先进的视频生成系统进行了大规模评估。研究发现,尽管现有模型在视觉真实性和运动保真度上表现优异,但在意图识别、信念推理、共同注意和亲社会推理等深层社会推理任务上存在系统性缺陷。
Details
Motivation: 当前文本到视频生成模型在视觉真实性和文本对齐方面进步显著,但缺乏生成社会一致性行为的能力,无法像人类一样从视觉线索中推断意图、信念、情感和社会规范。为了系统评估这一差距,需要建立一个专门针对社会推理的视频生成基准。
Result: 对七个最先进的视频生成系统的大规模评估结果表明,现代模型在表面合理性上表现出色,但在意图识别、信念推理、共同注意和亲社会推理等核心社会推理维度上存在显著性能差距,未能达到人类水平的社会理解。
Insight: 论文的创新点在于首次将发展心理学和社会心理学的经典实验范式系统性地转化为可计算的视频生成评估任务,并设计了一个无需训练、基于智能体流程和VLM评判的自动化评估框架。这为评估和推动生成模型的社会智能提供了一个新的、可解释的基准方向。
Abstract: Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.
[16] Fixed-Budget Parameter-Efficient Training with Frozen Encoders Improves Multimodal Chest X-Ray Classification cs.CVPDF
Md Ashik Khan, Md Nahid Siddique
TL;DR: 本研究探索了在固定参数预算下,使用冻结编码器的参数高效训练策略(包括BitFit、LoRA和适配器)进行多模态胸部X光分类。在印第安纳大学数据集上,这些方法仅用2.37M可训练参数(占总参数的2.51%)就实现了0.892-0.908的AUROC,显著优于使用94.3M参数的全微调(0.770 AUROC)。在更大的CheXpert数据集上验证了可扩展性,适配器方法取得了最佳性能(0.7214 AUROC)。研究发现性能提升主要源于参数分配而非跨模态协同,但模型校准性较差,需后处理校正。
Details
Motivation: 解决多模态胸部X光分析中微调大型视觉语言模型计算成本高昂的问题,研究在固定参数预算下参数高效训练策略的可行性和效果。
Result: 在印第安纳大学数据集上,所有参数高效训练变体在2.37M参数预算下AUROC达0.892-0.908,远超全微调的0.770(使用94.3M参数)。在CheXpert外部验证中,所有方法使用<9%可训练参数实现>0.69 AUROC,适配器最佳(0.7214 AUROC)。但校准误差较高(ECE: 0.29-0.34)。
Insight: 创新点在于系统比较了多种参数高效训练策略在固定预算下的多模态医学图像分类性能,并揭示了性能提升主要源于参数分配优化而非模态融合。客观分析认为,冻结编码器策略在显著降低计算成本的同时保持高性能,但需注意其校准缺陷,这为资源受限的临床部署提供了实用方案。
Abstract: Multimodal chest X-Ray analysis often fine-tunes large vision-language models, which is computationally costly. We study parameter-efficient training (PET) strategies, including frozen encoders, BitFit, LoRA, and adapters for multi-label classification on the Indiana University Chest X-Ray dataset (3,851 image-report pairs; 579 test samples). To mitigate data leakage, we redact pathology terms from reports used as text inputs while retaining clinical context. Under a fixed parameter budget (2.37M parameters, 2.51% of total), all PET variants achieve AUROC between 0.892 and 0.908, outperforming full fine-tuning (0.770 AUROC), which uses 94.3M trainable parameters, a 40x reduction. External validation on CheXpert (224,316 images, 58x larger) confirms scalability: all PET methods achieve >0.69 AUROC with <9% trainable parameters, with Adapter achieving best performance (0.7214 AUROC). Budget-matched comparisons reveal that vision-only models (0.653 AUROC, 1.06M parameters) outperform budget-matched multimodal models (0.641 AUROC, 1.06M parameters), indicating improvements arise primarily from parameter allocation rather than cross-modal synergy. While PET methods show degraded calibration (ECE: 0.29-0.34) compared to simpler models (ECE: 0.049), this represents a tractable limitation addressable through post-hoc calibration methods. These findings demonstrate that frozen encoder strategies provide superior discrimination at substantially reduced computational cost, though calibration correction is essential for clinical deployment.
[17] MuS-Polar3D: A Benchmark Dataset for Computational Polarimetric 3D Imaging under Multi-Scattering Conditions cs.CVPDF
Puyun Wang, Kaimin Yu, Huayang He, Xianyu Wu
TL;DR: 该论文构建了一个名为MuS-Polar3D的基准数据集,用于多散射条件下的计算偏振三维成像。该数据集包含42个物体在七种定量控制的散射条件和五个视角下采集的偏振图像,并提供了高精度三维模型、法线图和前景掩码。论文还从计算成像角度提出了一种两阶段水下三维重建流程,并通过实验验证了数据集的有效性。
Details
Motivation: 现有基于偏振的水下三维成像公共数据集在散射和观测条件上缺乏多样性,阻碍了不同方法(如单视图和多视图偏振成像)之间的公平比较。
Result: 在复杂散射条件下使用多种基线方法进行广泛评估,最佳平均角度误差达到15.49度。该数据集是首个公开的、用于定量浑浊水下偏振三维成像的基准数据集。
Insight: 主要创新点在于构建了首个公开的、散射条件可控的定量浑浊水下偏振三维成像基准数据集MuS-Polar3D,并从成像链视角将水下散射三维重建解耦为去散射和三维重建两阶段流程,促进了算法公平评估与准确重建。
Abstract: Polarization-based underwater 3D imaging exploits polarization cues to suppress background scattering, exhibiting distinct advantages in turbid water. Although data-driven polarization-based underwater 3D reconstruction methods show great potential, existing public datasets lack sufficient diversity in scattering and observation conditions, hindering fair comparisons among different approaches, including single-view and multi-view polarization imaging methods. To address this limitation, we construct MuS-Polar3D, a benchmark dataset comprising polarization images of 42 objects captured under seven quantitatively controlled scattering conditions and five viewpoints, together with high-precision 3D models (+/- 0.05 mm accuracy), normal maps, and foreground masks. The dataset supports multiple vision tasks, including normal estimation, object segmentation, descattering, and 3D reconstruction. Inspired by computational imaging, we further decouple underwater 3D reconstruction under scattering into a two-stage pipeline, namely descattering followed by 3D reconstruction, from an imaging-chain perspective. Extensive evaluations using multiple baseline methods under complex scattering conditions demonstrate the effectiveness of the proposed benchmark, achieving a best mean angular error of 15.49 degrees. To the best of our knowledge, MuS-Polar3D is the first publicly available benchmark dataset for quantitative turbidity underwater polarization-based 3D imaging, enabling accurate reconstruction and fair algorithm evaluation under controllable scattering conditions. The dataset and code are publicly available at https://github.com/WangPuyun/MuS-Polar3D.
[18] DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO cs.CV | cs.AIPDF
Henglin Liu, Huijuan Huang, Jing Wang, Chang Liu, Xiu Li
TL;DR: 本文提出DiverseGRPO方法,通过引入基于语义聚类的分布级创造力奖励和结构感知正则化,缓解GRPO在图像生成中后期训练阶段出现的模式崩溃问题,从而在保持图像质量的同时显著提升生成多样性。
Details
Motivation: 传统GRPO方法在图像生成训练后期易产生同质化输出,缺乏视觉多样性,这源于其奖励信号仅关注单样本质量而忽略分布级多样性,且正则化未考虑早期去噪阶段对多样性的主导作用,导致质量与多样性权衡受限。
Result: 实验表明,该方法在匹配质量分数下将语义多样性提升了13%–18%,为基于GRPO的图像生成建立了新的质量-多样性帕累托前沿。
Insight: 创新点包括:从奖励建模角度,提出基于语义聚类的分布级创造力奖励,根据组大小自适应分配探索奖励以鼓励新视觉模式的发现;从生成动态角度,引入结构感知正则化,加强早期阶段约束以保持多样性而不损害奖励优化效率。
Abstract: Reinforcement learning (RL), particularly GRPO, improves image generation quality significantly by comparing the relative performance of images generated within the same group. However, in the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity, which restricts its application scenarios. This issue can be analyzed from both reward modeling and generation dynamics perspectives. First, traditional GRPO relies on single-sample quality as the reward signal, driving the model to converge toward a few high-reward generation modes while neglecting distribution-level diversity. Second, conventional GRPO regularization neglects the dominant role of early-stage denoising in preserving diversity, causing a misaligned regularization budget that limits the achievable quality–diversity trade-off. Motivated by these insights, we revisit the diversity degradation problem from both reward modeling and generation dynamics. At the reward level, we propose a distributional creativity bonus based on semantic grouping. Specifically, we construct a distribution-level representation via spectral clustering over samples generated from the same caption, and adaptively allocate exploratory rewards according to group sizes to encourage the discovery of novel visual modes. At the generation level, we introduce a structure-aware regularization, which enforces stronger early-stage constraints to preserve diversity without compromising reward optimization efficiency. Experiments demonstrate that our method achieves a 13%–18% improvement in semantic diversity under matched quality scores, establishing a new Pareto frontier between image quality and diversity for GRPO-based image generation.
[19] Hierarchy-Aware Fine-Tuning of Vision-Language Models cs.CV | cs.AIPDF
Jiayu Li, Rajesh Gangireddy, Samet Akcay, Wei Cheng, Juhua Hu
TL;DR: 本文提出了一种层次感知的微调框架,用于高效地将视觉语言模型(VLMs)适配到层次分类任务中。该方法通过结合树路径KL散度(TP-KL)和层次兄弟平滑交叉熵(HiSCE)两个目标,在共享嵌入空间中强制结构一致性,并与轻量级LoRA适配结合,以最小参数开销实现性能提升。
Details
Motivation: 标准方法将标签视为扁平类别并进行全参数微调,这既昂贵又会导致分类层级间的预测不一致。本文旨在解决VLMs在适应层次分类时存在的这些问题。
Result: 在多个基准测试上的实验表明,该方法在全路径准确率和基于树的不一致性误差方面均取得了一致性改进,且参数开销极小。
Insight: 创新点在于提出了两种层次感知的损失函数(TP-KL和HiSCE)来强制预测在分类树结构中的垂直(父子)和水平(兄弟)一致性,并与参数高效的微调技术(LoRA)结合,为VLMs适配结构化分类法提供了一种高效策略。
Abstract: Vision-Language Models (VLMs) learn powerful multimodal representations through large-scale image-text pretraining, but adapting them to hierarchical classification is underexplored. Standard approaches treat labels as flat categories and require full fine-tuning, which is expensive and produces inconsistent predictions across taxonomy levels. We propose an efficient hierarchy-aware fine-tuning framework that updates a few parameters while enforcing structural consistency. We combine two objectives: Tree-Path KL Divergence (TP-KL) aligns predictions along the ground-truth label path for vertical coherence, while Hierarchy-Sibling Smoothed Cross-Entropy (HiSCE) encourages consistent predictions among sibling classes. Both losses work in the VLM’s shared embedding space and integrate with lightweight LoRA adaptation. Experiments across multiple benchmarks show consistent improvements in Full-Path Accuracy and Tree-based Inconsistency Error with minimal parameter overhead. Our approach provides an efficient strategy for adapting VLMs to structured taxonomies.
[20] Vision Transformers are Circulant Attention Learners cs.CVPDF
Dongchen Han, Tianyu Li, Ziyi Wang, Gao Huang
TL;DR: 本文提出了一种名为循环注意力(Circulant Attention)的新型注意力范式,旨在解决视觉Transformer中自注意力机制因二次复杂度带来的高计算负担问题。该方法通过将注意力矩阵建模为块循环矩阵(BCCB),实现了O(N log N)的线性计算复杂度,同时保持了标准自注意力的模型能力。
Details
Motivation: 动机在于自注意力机制在视觉Transformer中的成功应用受到其二次计算复杂度的限制,尤其是在高分辨率场景下计算负担沉重。现有方法通过引入手工设计的局部性或稀疏性模式来缓解,但往往牺牲了模型容量。本文旨在利用自注意力固有的高效模式来克服这一限制。
Result: 在多种视觉任务上的广泛实验表明,该方法有效降低了计算复杂度至O(N log N),同时性能与标准自注意力相当,验证了其作为视觉Transformer架构中自注意力有前景替代方案的潜力。
Insight: 创新点在于首次识别出视觉Transformer中自注意力矩阵常近似于块循环矩阵(BCCB),并利用这一结构化矩阵特性设计高效计算算法。从客观角度看,该方法通过数学上的结构近似,在保持模型容量的同时实现了线性复杂度,为注意力机制的高效化提供了新的理论视角和实用方案。
Abstract: The self-attention mechanism has been a key factor in the advancement of vision Transformers. However, its quadratic complexity imposes a heavy computational burden in high-resolution scenarios, restricting the practical application. Previous methods attempt to mitigate this issue by introducing handcrafted patterns such as locality or sparsity, which inevitably compromise model capacity. In this paper, we present a novel attention paradigm termed \textbf{Circulant Attention} by exploiting the inherent efficient pattern of self-attention. Specifically, we first identify that the self-attention matrix in vision Transformers often approximates the Block Circulant matrix with Circulant Blocks (BCCB), a kind of structured matrix whose multiplication with other matrices can be performed in $\mathcal{O}(N\log N)$ time. Leveraging this interesting pattern, we explicitly model the attention map as its nearest BCCB matrix and propose an efficient computation algorithm for fast calculation. The resulting approach closely mirrors vanilla self-attention, differing only in its use of BCCB matrices. Since our design is inspired by the inherent efficient paradigm, it not only delivers $\mathcal{O}(N\log N)$ computation complexity, but also largely maintains the capacity of standard self-attention. Extensive experiments on diverse visual tasks demonstrate the effectiveness of our approach, establishing circulant attention as a promising alternative to self-attention for vision Transformer architectures. Code is available at https://github.com/LeapLabTHU/Circulant-Attention.
[21] EraseLoRA: MLLM-Driven Foreground Exclusion and Background Subtype Aggregation for Dataset-Free Object Removal cs.CVPDF
Sanghyun Jo, Donghwan Lee, Eunji Jung, Seong Je Oh, Kyungsu Kim
TL;DR: 本文提出EraseLoRA,一种无需数据集的物体移除框架,通过多模态大语言模型进行前景排除和背景子类型聚合,避免传统注意力操作导致的细节破坏和物体再生问题。
Details
Motivation: 解决现有无数据集方法在物体移除任务中因直接操作注意力而导致非目标前景被误判为背景、细节丢失以及背景线索整合不连贯的问题。
Result: 在物体移除基准测试中,作为预训练扩散模型的插件,相比无数据集基线方法取得一致改进,并与基于数据集的方法结果相当。
Insight: 创新点在于利用MLLM进行无监督的前景-背景分离,并通过测试时优化将背景子类型作为互补片段进行聚合重建,避免了显式的注意力干预,提升了移除效果和背景保真度。
Abstract: Object removal differs from common inpainting, since it must prevent the masked target from reappearing and reconstruct the occluded background with structural and contextual fidelity, rather than merely filling a hole plausibly. Recent dataset-free approaches that redirect self-attention inside the mask fail in two ways: non-target foregrounds are often misinterpreted as background, which regenerates unwanted objects, and direct attention manipulation disrupts fine details and hinders coherent integration of background cues. We propose EraseLoRA, a novel dataset-free framework that replaces attention surgery with background-aware reasoning and test-time adaptation. First, Background-aware Foreground Exclusion (BFE), uses a multimodal large-language models to separate target foreground, non-target foregrounds, and clean background from a single image-mask pair without paired supervision, producing reliable background cues while excluding distractors. Second, Background-aware Reconstruction with Subtype Aggregation (BRSA), performs test-time optimization that treats inferred background subtypes as complementary pieces and enforces their consistent integration through reconstruction and alignment objectives, preserving local detail and global structure without explicit attention intervention. We validate EraseLoRA as a plug-in to pretrained diffusion models and across benchmarks for object removal, demonstrating consistent improvements over dataset-free baselines and competitive results against dataset-driven methods. The code will be made available upon publication.
[22] Toward Intelligent Scene Augmentation for Context-Aware Object Placement and Sponsor-Logo Integration cs.CVPDF
Unnati Saraswat, Tarun Rao, Namah Gupta, Shweta Swami, Shikhar Sharma
TL;DR: 本文提出了两个面向广告和数字媒体的智能图像编辑新任务:上下文感知的对象插入和赞助商产品标识增强。为此,作者构建了两个包含类别标注、放置区域和赞助商产品标签的新数据集。
Details
Motivation: 现有基于视觉语言模型和扩散模型的图像编辑方法,很少能确保插入的对象在场景中是上下文合适的。本文旨在解决在广告和数字媒体中,智能地插入对象和品牌标识的上下文合理性问题。
Result: 论文主要贡献是提出了两个新任务并构建了相应的新数据集,摘要中未提及具体的定量实验结果或基准测试比较。
Insight: 创新点在于将上下文感知和品牌关联性引入到对象插入任务中,具体化为两个新的、有实际应用价值的任务,并提供了配套的数据集支持。这为后续研究提供了明确的方向和评估基础。
Abstract: Intelligent image editing increasingly relies on advances in computer vision, multimodal reasoning, and generative modeling. While vision-language models (VLMs) and diffusion models enable guided visual manipulation, existing work rarely ensures that inserted objects are \emph{contextually appropriate}. We introduce two new tasks for advertising and digital media: (1) \emph{context-aware object insertion}, which requires predicting suitable object categories, generating them, and placing them plausibly within the scene; and (2) \emph{sponsor-product logo augmentation}, which involves detecting products and inserting correct brand logos, even when items are unbranded or incorrectly branded. To support these tasks, we build two new datasets with category annotations, placement regions, and sponsor-product labels.
[23] Towards Long-window Anchoring in Vision-Language Model Distillation cs.CV | cs.AIPDF
Haoyi Zhou, Shuo Li, Tianyu Chen, Qi Song, Chonghan Gao
TL;DR: 本文提出LAid方法,通过知识蒸馏将大型视觉语言模型的长上下文理解能力迁移到小型模型中,解决了小模型因窗口尺寸限制导致的视觉-语言对齐问题。该方法包含渐进式距离加权注意力匹配和可学习的RoPE响应增益调制两个组件,使蒸馏后的小模型有效上下文窗口扩展至基线模型的3.2倍,并在标准视觉语言基准上保持或提升性能。
Details
Motivation: 解决小型视觉语言模型因有限窗口尺寸导致的视觉-语言对齐能力不足问题,通过蒸馏大型模型的长程注意力机制来增强小模型的长上下文理解能力。
Result: 在多个模型系列上的实验表明,LAid蒸馏的模型有效上下文窗口比基线小模型延长达3.2倍,同时在标准视觉语言基准测试中保持或提升性能;频谱分析证实该方法成功保留了传统方法无法迁移的关键低频注意力成分。
Insight: 创新点在于通过渐进式距离加权注意力匹配动态强化长距离位置差异的学习,并结合可学习的RoPE响应增益调制选择性增强位置敏感性;从理论角度揭示了位置理解在蒸馏过程中的涌现与迁移机制,为构建高效长上下文视觉语言模型提供了新思路。
Abstract: While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students’ capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2 times longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.
[24] From Shallow Humor to Metaphor: Towards Label-Free Harmful Meme Detection via LMM Agent Self-Improvement cs.CVPDF
Jian Lang, Rongpei Hong, Ting Zhong, Leiting Chen, Qiang Gao
TL;DR: 本文提出了ALARM框架,这是首个基于大型多模态模型(LMM)智能体自改进的无标签有害表情包检测方法。该方法通过利用‘浅层’表情包的显式信息,迭代提升模型处理更复杂、更微妙表情包的能力,包括置信度显式表情包识别机制和成对学习引导的智能体自改进范式。在三个数据集上的实验表明,该方法性能优越,甚至超越了依赖标签的方法。
Details
Motivation: 在线有害表情包的泛滥对公共健康与稳定构成重大风险。现有检测方法严重依赖大规模标注数据进行训练,这需要大量人工标注工作,且难以适应有害内容不断演变的特性。
Result: 在三个多样化数据集上的实验表明,ALARM方法在性能上超越了依赖标签的方法,并展现出对新演化表情包的强大适应能力。
Insight: 核心创新在于利用‘浅层’(显式)表情包的信息来迭代增强模型处理复杂、微妙表情包的能力,提出了一种无标签、可扩展的检测框架,通过置信度显式识别和成对学习引导的智能体自改进机制,实现了模型的自适应提升。
Abstract: The proliferation of harmful memes on online media poses significant risks to public health and stability. Existing detection methods heavily rely on large-scale labeled data for training, which necessitates substantial manual annotation efforts and limits their adaptability to the continually evolving nature of harmful content. To address these challenges, we present ALARM, the first lAbeL-free hARmful Meme detection framework powered by Large Multimodal Model (LMM) agent self-improvement. The core innovation of ALARM lies in exploiting the expressive information from “shallow” memes to iteratively enhance its ability to tackle more complex and subtle ones. ALARM consists of a novel Confidence-based Explicit Meme Identification mechanism that isolates the explicit memes from the original dataset and assigns them pseudo-labels. Besides, a new Pairwise Learning Guided Agent Self-Improvement paradigm is introduced, where the explicit memes are reorganized into contrastive pairs (positive vs. negative) to refine a learner LMM agent. This agent autonomously derives high-level detection cues from these pairs, which in turn empower the agent itself to handle complex and challenging memes effectively. Experiments on three diverse datasets demonstrate the superior performance and strong adaptability of ALARM to newly evolved memes. Notably, our method even outperforms label-driven methods. These results highlight the potential of label-free frameworks as a scalable and promising solution for adapting to novel forms and topics of harmful memes in dynamic online environments.
[25] TAMEing Long Contexts in Personalization: Towards Training-Free and State-Aware MLLM Personalized Assistant cs.CVPDF
Rongpei Hong, Jian Lang, Ting Zhong, Yong Wang, Fan Zhou
TL;DR: 本文针对多模态大语言模型(MLLM)个性化任务中缺乏长上下文对话能力的问题,提出了首个长上下文MLLM个性化评估基准LCMP,并引入了一个无需训练、具备状态感知能力的基线框架TAME。TAME通过双记忆机制区分管理个性化概念的时变与持久性变化,并采用检索-对齐增强生成(RA2G)范式来提升对复杂查询的响应质量。
Details
Motivation: 现有MLLM个性化方法主要关注简单的、上下文无关的视觉识别和文本替换,忽略了支持长上下文对话的能力,而理想的个性化助手应能进行长对话并从历史中持续学习以提升体验质量。
Result: 在提出的LCMP基准上的实验表明,TAME取得了最佳性能,在长上下文场景中展示了卓越且不断演进的交互体验。
Insight: 创新点在于提出了首个专注于长上下文MLLM个性化的评估基准LCMP,以及一个无需训练、具备双记忆管理和检索-对齐增强生成(RA2G)范式的框架TAME,以区分处理概念的时变与持久性信息并实现上下文适配的响应生成。
Abstract: Multimodal Large Language Model (MLLM) Personalization is a critical research problem that facilitates personalized dialogues with MLLMs targeting specific entities (known as personalized concepts). However, existing methods and benchmarks focus on the simple, context-agnostic visual identification and textual replacement of the personalized concept (e.g., “A yellow puppy” -> “Your puppy Mochi”), overlooking the ability to support long-context conversations. An ideal personalized MLLM assistant is capable of engaging in long-context dialogues with humans and continually improving its experience quality by learning from past dialogue histories. To bridge this gap, we propose LCMP, the first Long-Context MLLM Personalization evaluation benchmark. LCMP assesses the capability of MLLMs in perceiving variations of personalized concepts and generating contextually appropriate personalized responses that reflect these variations. As a strong baseline for LCMP, we introduce a novel training-free and state-aware framework TAME. TAME endows MLLMs with double memories to manage the temporal and persistent variations of each personalized concept in a differentiated manner. In addition, TAME incorporates a new training-free Retrieve-then-Align Augmented Generation (RA2G) paradigm. RA2G introduces an alignment step to extract the contextually fitted information from the multi-memory retrieved knowledge to the current questions, enabling better interactions for complex real-world user queries. Experiments on LCMP demonstrate that TAME achieves the best performance, showcasing remarkable and evolving interaction experiences in long-context scenarios.
[26] Training-Free Disentangled Text-Guided Image Editing via Sparse Latent Constraints cs.CVPDF
Mutiara Shabrina, Nova Kurnia Putri, Jefri Satria Ferdiansyah, Sabita Khansa Dewi, Novanto Yudistira
TL;DR: 本文分析了基于预测-预防-评估(PPE)框架的文本驱动图像编辑方法,指出其在CelebA-HQ数据集上使用BERT和StyleGAN2时,由于潜在空间更新密集导致属性纠缠问题。为缓解此问题,作者引入了L1正则化稀疏约束,实验表明该方法能实现更聚焦和可控的编辑,有效减少非目标属性的意外改变并保持面部身份。
Details
Motivation: 解决文本驱动图像编辑中因属性纠缠导致修改目标属性时意外改变其他语义属性(如身份或外观)的问题。
Result: 在CelebA-HQ数据集上的实验结果表明,所提方法通过稀疏约束实现了更聚焦的编辑,减少了语义泄漏,有效保持了非目标属性和面部身份。
Insight: 创新点在于将L1正则化稀疏约束引入潜在空间操作,以替代原框架中的密集更新策略,从而提升编辑的分离性和控制性;客观分析认为,这种稀疏化方法为减少生成模型中的属性纠缠提供了简单有效的正则化思路。
Abstract: Text-driven image manipulation often suffers from attribute entanglement, where modifying a target attribute (e.g., adding bangs) unintentionally alters other semantic properties such as identity or appearance. The Predict, Prevent, and Evaluate (PPE) framework addresses this issue by leveraging pre-trained vision-language models for disentangled editing. In this work, we analyze the PPE framework, focusing on its architectural components, including BERT-based attribute prediction and StyleGAN2-based image generation on the CelebA-HQ dataset. Through empirical analysis, we identify a limitation in the original regularization strategy, where latent updates remain dense and prone to semantic leakage. To mitigate this issue, we introduce a sparsity-based constraint using L1 regularization on latent space manipulation. Experimental results demonstrate that the proposed approach enforces more focused and controlled edits, effectively reducing unintended changes in non-target attributes while preserving facial identity.
[27] TrackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References cs.CV | cs.AIPDF
Jiahong Yu, Ziqi Wang, Hailiang Zhao, Wei Zhai, Xueqiang Yan
TL;DR: TrackTeller是一个用于动态3D驾驶场景中基于时序语言进行3D物体定位的框架。它通过整合多帧观测数据,利用激光雷达-图像融合、语言条件解码和时序推理,来解决仅凭静态外观或几何信息无法处理的、涉及近期运动或短期交互行为的物体指代问题。
Details
Motivation: 在动态3D驾驶场景中,许多自然语言指代是通过目标的近期运动或短期交互行为来描述的,仅凭静态外观或几何信息无法解决。因此,需要研究基于时序语言的3D物体定位,以支持交互式自动驾驶系统。
Result: 在NuPrompt基准测试上,TrackTeller显著提升了基于语言的跟踪性能,相对于强基线模型,其平均多目标跟踪准确率相对提升了70%,误报频率降低了3.15-3.4倍,达到了SOTA水平。
Insight: 论文的创新点在于提出了一个统一的时序多模态定位框架,构建了与文本语义对齐的共享UniScene表示,并利用运动历史和短期动态来细化定位决策。从客观角度看,其将语言理解与3D场景的时序动态建模紧密结合,为解决行为依赖的物体指代问题提供了有效方案。
Abstract: Understanding natural-language references to objects in dynamic 3D driving scenes is essential for interactive autonomous systems. In practice, many referring expressions describe targets through recent motion or short-term interactions, which cannot be resolved from static appearance or geometry alone. We study temporal language-based 3D grounding, where the objective is to identify the referred object in the current frame by leveraging multi-frame observations. We propose TrackTeller, a temporal multimodal grounding framework that integrates LiDAR-image fusion, language-conditioned decoding, and temporal reasoning in a unified architecture. TrackTeller constructs a shared UniScene representation aligned with textual semantics, generates language-aware 3D proposals, and refines grounding decisions using motion history and short-term dynamics. Experiments on the NuPrompt benchmark demonstrate that TrackTeller consistently improves language-grounded tracking performance, outperforming strong baselines with a 70% relative improvement in Average Multi-Object Tracking Accuracy and a 3.15-3.4 times reduction in False Alarm Frequency.
[28] Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding cs.CVPDF
Zhiwang Zhou, Yuandong Pu, Xuming He, Yidi Liu, Yixin Chen
TL;DR: 本文提出了Omni-Weather,首个统一天气生成与理解的多模态基础模型。它通过一个共享的自注意力机制架构,整合了雷达编码器用于生成任务,并构建了用于因果推理的思维链数据集,以提升输出的可解释性和感知质量。
Details
Motivation: 现有天气建模方法将精准预测(生成)与机理解释(理解)割裂处理,本文旨在填补这一空白,在一个统一架构中同时解决天气生成与理解问题。
Result: 大量实验表明,Omni-Weather在天气生成和理解任务上均达到了最先进的(SOTA)性能水平。
Insight: 主要创新点在于首次提出了统一天气生成与理解的多模态基础模型架构,并构建了用于因果推理的思维链数据集。客观来看,其揭示了天气领域的生成与理解任务可以相互促进的可行性,为多模态基础模型在特定领域的统一应用提供了范例。
Abstract: Weather modeling requires both accurate prediction and mechanistic interpretation, yet existing methods treat these goals in isolation, separating generation from understanding. To address this gap, we present Omni-Weather, the first multimodal foundation model that unifies weather generation and understanding within a single architecture. Omni-Weather integrates a radar encoder for weather generation tasks, followed by unified processing using a shared self-attention mechanism. Moreover, we construct a Chain-of-Thought dataset for causal reasoning in weather generation, enabling interpretable outputs and improved perceptual quality. Extensive experiments show Omni-Weather achieves state-of-the-art performance in both weather generation and understanding. Our findings further indicate that generative and understanding tasks in the weather domain can mutually enhance each other. Omni-Weather also demonstrates the feasibility and value of unifying weather generation and understanding.
[29] The Deepfake Detective: Interpreting Neural Forensics Through Sparse Features and Manifolds cs.CV | cs.LGPDF
Subramanyam Sahoo, Jared Junkin
TL;DR: 本文提出了一种用于深度伪造检测的机制可解释性框架,结合稀疏自编码器分析和新型法证流形分析,以揭示视觉语言模型内部决策过程。研究发现,每层仅少量潜在特征被激活使用,且模型特征流形的几何特性(如本征维度、曲率和特征选择性)随不同深度伪造伪影类型系统性变化。
Details
Motivation: 深度伪造检测模型虽能达到高精度,但其决策过程仍不透明,本文旨在通过机制可解释性方法打开这一“黑箱”,理解模型如何识别合成媒体中的法证伪影。
Result: 研究通过分析模型内部表示,发现特征使用具有稀疏性,且特征流形几何特性与深度伪造伪影类型存在系统性关联,为模型可解释性提供了实证依据。
Insight: 创新点在于将稀疏自编码器分析与法证流形分析结合,首次系统揭示了深度伪造检测模型中特征激活的稀疏性和流形几何特性与伪影类型的关联,为开发更可解释、鲁棒的检测模型提供了新方向。
Abstract: Deepfake detection models have achieved high accuracy in identifying synthetic media, but their decision processes remain largely opaque. In this paper we present a mechanistic interpretability framework for deepfake detection applied to a vision-language model. Our approach combines a sparse autoencoder (SAE) analysis of internal network representations with a novel forensic manifold analysis that probes how the model’s features respond to controlled forensic artifact manipulations. We demonstrate that only a small fraction of latent features are actively used in each layer, and that the geometric properties of the model’s feature manifold, including intrinsic dimensionality, curvature, and feature selectivity, vary systematically with different types of deepfake artifacts. These insights provide a first step toward opening the “black box” of deepfake detectors, allowing us to identify which learned features correspond to specific forensic artifacts and to guide the development of more interpretable and robust models.
[30] UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture cs.CVPDF
Shuo Cao, Jiayang Li, Xiaohui Li, Yuandong Pu, Kaiwen Zhu
TL;DR: 本文提出了UniPercept,一个用于统一感知级图像理解的框架,涵盖美学、质量、结构和纹理四个关键领域。作者构建了UniPercept-Bench基准和大规模数据集,并开发了通过领域自适应预训练和任务对齐强化学习训练的强基线模型。该模型在视觉评分和视觉问答任务上表现出色,并可作为即插即用的奖励模型用于文生图生成。
Details
Motivation: 现有多模态大语言模型在视觉理解任务上取得显著进展,但其感知图像底层特征(如美学、质量等)的能力仍然有限。本文旨在定义并系统性地推进MLLM时代的感知级图像理解。
Result: UniPercept在感知级图像理解任务上超越了现有的MLLMs,并在视觉评分和视觉问答任务上展现了强大的泛化能力。
Insight: 主要创新点包括:提出了一个统一的感知级图像理解框架和分层定义系统;构建了大规模基准数据集UniPercept-Bench;开发了结合领域自适应预训练和任务对齐强化学习的训练方法;模型可作为奖励模型提升文生图生成质量,为感知级多模态图像理解研究奠定了基础。
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptual-level image features remains limited. In this work, we present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture. We establish a hierarchical definition system and construct large-scale datasets to evaluate perceptual-level image understanding. Based on this foundation, we develop a strong baseline UniPercept trained via Domain-Adaptive Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks. UniPercept outperforms existing MLLMs on perceptual-level image understanding and can serve as a plug-and-play reward model for text-to-image generation. This work defines Perceptual-Level Image Understanding in the era of MLLMs and, through the introduction of a comprehensive benchmark together with a strong baseline, provides a solid foundation for advancing perceptual-level multimodal image understanding.
[31] Contrastive Graph Modeling for Cross-Domain Few-Shot Medical Image Segmentation cs.CVPDF
Yuntian Bo, Tao Zhou, Zechao Li, Haofeng Zhang, Ling Shao
TL;DR: 本文提出了一种名为对比图建模(C-Graph)的框架,用于解决跨域少样本医学图像分割(CD-FSMIS)问题。该方法通过将图像特征表示为图结构,利用医学图像的结构一致性作为可迁移的先验知识,设计了结构先验图(SPG)层、子图匹配解码(SMD)机制和混淆最小化节点对比(CNC)损失,以提升跨域性能并保持源域精度。
Details
Motivation: 现有跨域少样本医学图像分割方法通常通过过滤域特定信息来提升泛化能力,但这会无意中限制跨域性能并降低源域准确性。本文旨在解决这一问题,利用医学图像的结构一致性作为可靠的跨域可迁移先验。
Result: 该方法在多个跨域基准测试中显著优于先前的CD-FSMIS方法,实现了最先进的性能,同时保持了源域上的强大分割准确性。
Insight: 创新点包括:将图像特征建模为图以利用结构先验;提出SPG层捕获节点依赖关系并进行全局结构建模;引入SMD机制利用节点语义关系指导预测;设计CNC损失通过对比学习增强节点可区分性,减少模糊性和异质性。这些方法为跨域少样本学习提供了可借鉴的结构感知和对比学习策略。
Abstract: Cross-domain few-shot medical image segmentation (CD-FSMIS) offers a promising and data-efficient solution for medical applications where annotations are severely scarce and multimodal analysis is required. However, existing methods typically filter out domain-specific information to improve generalization, which inadvertently limits cross-domain performance and degrades source-domain accuracy. To address this, we present Contrastive Graph Modeling (C-Graph), a framework that leverages the structural consistency of medical images as a reliable domain-transferable prior. We represent image features as graphs, with pixels as nodes and semantic affinities as edges. A Structural Prior Graph (SPG) layer is proposed to capture and transfer target-category node dependencies and enable global structure modeling through explicit node interactions. Building upon SPG layers, we introduce a Subgraph Matching Decoding (SMD) mechanism that exploits semantic relations among nodes to guide prediction. Furthermore, we design a Confusion-minimizing Node Contrast (CNC) loss to mitigate node ambiguity and subgraph heterogeneity by contrastively enhancing node discriminability in the graph space. Our method significantly outperforms prior CD-FSMIS approaches across multiple cross-domain benchmarks, achieving state-of-the-art performance while simultaneously preserving strong segmentation accuracy on the source domain.
[32] SlideChain: Semantic Provenance for Lecture Understanding via Blockchain Registration cs.CVPDF
Md Motaleb Hossen Manik, Md Zabirul Islam, Ge Wang
TL;DR: 本文提出SlideChain,一种基于区块链的溯源框架,旨在为大规模多模态语义提取提供可验证的完整性。该框架使用包含1,117张医学影像讲座幻灯片的SlideChain Slides数据集,从四种最先进的视觉-语言模型中提取概念和关系三元组,并为每张幻灯片构建结构化溯源记录。通过将记录的加密哈希锚定在本地EVM兼容区块链上,实现防篡改的可审计性和持久语义基线。论文首次系统分析了多模态教育内容中的语义分歧、跨模型相似性和讲座级变异性,揭示了显著的跨模型差异,并评估了模拟部署条件下的gas使用、吞吐量和可扩展性,展示了完美的篡改检测和确定性可复现性。
Details
Motivation: 现代视觉-语言模型被广泛用于解释和生成教育内容,但其语义输出在验证、复现和审计方面存在挑战,模型家族、推理设置和计算环境的不一致性削弱了AI生成教学材料的可靠性,特别是在高风险和定量的STEM领域。
Result: 在SlideChain Slides数据集上,分析显示跨模型存在显著差异,包括低概念重叠和许多幻灯片上关系三元组的接近零一致性;在模拟部署中,评估了gas使用、吞吐量和可扩展性,并实现了完美的篡改检测和跨独立提取运行的确定性可复现性。
Insight: 创新点在于将区块链技术应用于多模态教育内容的语义溯源,通过结构化记录和加密哈希锚定提供可验证的完整性和审计能力;客观分析认为,该方法为AI辅助教学系统提供了长期可审计性、可复现性和完整性的实用可扩展解决方案,特别是在解决跨模型语义不一致性方面具有借鉴意义。
Abstract: Modern vision–language models (VLMs) are increasingly used to interpret and generate educational content, yet their semantic outputs remain challenging to verify, reproduce, and audit over time. Inconsistencies across model families, inference settings, and computing environments undermine the reliability of AI-generated instructional material, particularly in high-stakes and quantitative STEM domains. This work introduces SlideChain, a blockchain-backed provenance framework designed to provide verifiable integrity for multimodal semantic extraction at scale. Using the SlideChain Slides Dataset-a curated corpus of 1,117 medical imaging lecture slides from a university course-we extract concepts and relational triples from four state-of-the-art VLMs and construct structured provenance records for every slide. SlideChain anchors cryptographic hashes of these records on a local EVM (Ethereum Virtual Machine)-compatible blockchain, providing tamper-evident auditability and persistent semantic baselines. Through the first systematic analysis of semantic disagreement, cross-model similarity, and lecture-level variability in multimodal educational content, we reveal pronounced cross-model discrepancies, including low concept overlap and near-zero agreement in relational triples on many slides. We further evaluate gas usage, throughput, and scalability under simulated deployment conditions, and demonstrate perfect tamper detection along with deterministic reproducibility across independent extraction runs. Together, these results show that SlideChain provides a practical and scalable step toward trustworthy, verifiable multimodal educational pipelines, supporting long-term auditability, reproducibility, and integrity for AI-assisted instructional systems.
[33] Analyzing the Mechanism of Attention Collapse in VGGT from a Dynamics Perspective cs.CVPDF
Huan Li, Longjun Luo, Yuling Shi, Xiaodong Gu
TL;DR: 本文从动力学视角分析了VGGT(视觉几何基础Transformer)中注意力崩溃现象的机制,证明了全局自注意力层在处理长序列时会出现秩退化,导致token几何坍缩到一维子空间,并建立了与扩散过程相关的理论模型。
Details
Motivation: 解决VGGT在输入序列超过数百帧时出现的注意力矩阵秩退化问题,该问题导致重建误差超线性累积,影响3D重建性能。
Result: 理论分析定量匹配了注意力热图演化,并解释了token合并策略通过降低有效扩散系数延迟崩溃,无需额外训练。
Insight: 将注意力迭代建模为退化扩散过程,推导出闭式平均场偏微分方程预测秩分布,为可扩展3D视觉Transformer提供理论解释框架,并具有多模态泛化潜力。
Abstract: Visual Geometry Grounded Transformer (VGGT) delivers state-of-the-art feed-forward 3D reconstruction, yet its global self-attention layer suffers from a drastic collapse phenomenon when the input sequence exceeds a few hundred frames: attention matrices rapidly become near rank-one, token geometry degenerates to an almost one-dimensional subspace, and reconstruction error accumulates super-linearly.In this report,we establish a rigorous mathematical explanation of the collapse by viewing the global-attention iteration as a degenerate diffusion process.We prove that,in VGGT, the token-feature flow converges toward a Dirac-type measure at a $O(1/L)$ rate, where $L$ is the layer index, yielding a closed-form mean-field partial differential equation that precisely predicts the empirically observed rank profile.The theory quantitatively matches the attention-heat-map evolution and a series of experiments outcomes reported in relevant works and explains why its token-merging remedy – which periodically removes redundant tokens – slows the effective diffusion coefficient and thereby delays collapse without additional training.We believe the analysis provides a principled lens for interpreting future scalable 3D-vision transformers,and we highlight its potential for multi-modal generalization.
[34] Prior-AttUNet: Retinal OCT Fluid Segmentation Based on Normal Anatomical Priors and Attention Gating cs.CVPDF
Li Yang, Yuting Liu
TL;DR: 该论文提出了一种名为Prior-AttUNet的模型,用于对视网膜OCT图像中的黄斑水肿(液体区域)进行精确分割。该模型通过整合生成式解剖先验和一种新颖的三重注意力机制,有效解决了OCT图像中流体边界模糊和设备间异质性的挑战。
Details
Motivation: 解决视网膜OCT图像中液体区域分割的挑战,特别是边界模糊和不同成像设备(如Cirrus、Spectralis、Topcon)之间的异质性,这对于年龄相关性黄斑变性和糖尿病性黄斑水肿等疾病的临床诊断和管理至关重要。
Result: 在公开的RETOUCH基准测试上,Prior-AttUNet在三种OCT设备上均取得了优异性能,平均Dice相似系数分别为93.93%(Cirrus)、95.18%(Spectralis)和93.47%(Topcon),同时保持了较低的计算成本(0.37 TFLOPs),在分割精度和推理效率之间取得了良好平衡。
Insight: 主要创新点在于提出了一种混合双路径架构,整合了由变分自编码器提供的多尺度规范解剖先验,以及在解码阶段由解剖先验引导的动态三重注意力机制,这增强了边界描绘能力并提升了模型对不同设备的泛化性能。从客观角度看,将生成式先验与注意力机制结合用于医学图像分割是一个有前景的方向。
Abstract: Accurate segmentation of macular edema, a hallmark pathological feature in vision-threatening conditions such as age-related macular degeneration and diabetic macular edema, is essential for clinical diagnosis and management. To overcome the challenges of segmenting fluid regions in optical coherence tomography (OCT) images-notably ambiguous boundaries and cross-device heterogeneity-this study introduces Prior-AttUNet, a segmentation model augmented with generative anatomical priors. The framework adopts a hybrid dual-path architecture that integrates a generative prior pathway with a segmentation network. A variational autoencoder supplies multi-scale normative anatomical priors, while the segmentation backbone incorporates densely connected blocks and spatial pyramid pooling modules to capture richer contextual information. Additionally, a novel triple-attention mechanism, guided by anatomical priors, dynamically modulates feature importance across decoding stages, substantially enhancing boundary delineation. Evaluated on the public RETOUCH benchmark, Prior-AttUNet achieves excellent performance across three OCT imaging devices (Cirrus, Spectralis, and Topcon), with mean Dice similarity coefficients of 93.93%, 95.18%, and 93.47%, respectively. The model maintains a low computational cost of 0.37 TFLOPs, striking an effective balance between segmentation precision and inference efficiency. These results demonstrate its potential as a reliable tool for automated clinical analysis.
[35] FUSE: Unifying Spectral and Semantic Cues for Robust AI-Generated Image Detection cs.CVPDF
Md. Zahid Hossain, Most. Sharmin Sultana Samu, Md. Kamrozzaman Bhuiyan, Farhad Uz Zaman, Md. Rakibul Islam
TL;DR: 本文提出了一种名为FUSE的混合系统,用于鲁棒地检测AI生成的图像。该系统通过结合快速傅里叶变换提取的频谱特征和CLIP视觉编码器获得的语义特征,形成联合表示,并采用两阶段渐进式训练。在多个数据集上的评估表明,该方法在跨生成器的泛化能力上表现优异。
Details
Motivation: 生成模型的快速发展增加了对可靠检测AI生成图像的需求。现有方法在处理高保真图像时往往性能不佳,因此需要一种能结合不同线索以实现鲁棒泛化检测的新方法。
Result: FUSE(第一阶段)模型在Chameleon基准测试中达到了最先进的水平。在GenImage数据集上平均准确率为91.36%,在所有测试生成器上平均准确率为88.71%,平均精度均值为94.96%。第二阶段训练进一步提升了大多数生成器的性能。
Insight: 论文的核心创新点在于将频谱特征(来自快速傅里叶变换)和语义特征(来自CLIP视觉编码器)进行融合,并采用两阶段渐进式训练策略。这种多模态特征融合方法有效提升了检测系统对多样化AI生成图像的鲁棒性和泛化能力,特别是在处理高保真图像时保持了性能,为解决AI生成图像检测的泛化难题提供了新思路。
Abstract: The fast evolution of generative models has heightened the demand for reliable detection of AI-generated images. To tackle this challenge, we introduce FUSE, a hybrid system that combines spectral features extracted through Fast Fourier Transform with semantic features obtained from the CLIP’s Vision encoder. The features are fused into a joint representation and trained progressively in two stages. Evaluations on GenImage, WildFake, DiTFake, GPT-ImgEval and Chameleon datasets demonstrate strong generalization across multiple generators. Our FUSE (Stage 1) model demonstrates state-of-the-art results on the Chameleon benchmark. It also attains 91.36% mean accuracy on the GenImage dataset, 88.71% accuracy across all tested generators, and a mean Average Precision of 94.96%. Stage 2 training further improves performance for most generators. Unlike existing methods, which often perform poorly on high-fidelity images in Chameleon, our approach maintains robustness across diverse generators. These findings highlight the benefits of integrating spectral and semantic features for generalized detection of images generated by AI.
[36] RAPTOR: Real-Time High-Resolution UAV Video Prediction with Efficient Video Attention cs.CVPDF
Zhan Chen, Zile Guo, Enze Zhu, Peirong Zhang, Xiaoxuan Liu
TL;DR: 本文提出了RAPTOR模型,一种用于无人机视频预测的实时高分辨率架构。其核心是高效的视频注意力模块EVA,通过分解时空建模将复杂度降至线性,并结合三阶段训练课程,首次在边缘设备上实现了512x512分辨率下超过30 FPS的实时预测性能。
Details
Motivation: 解决视频预测中高分辨率、高感知质量与实时速度之间的根本性三难困境,特别是满足密集城市环境中自主无人机对低延迟、高分辨率前瞻性预测的严苛安全需求。
Result: 在Jetson AGX Orin边缘设备上,首次实现了512^2分辨率视频预测超过30 FPS的实时性能。在UAVid、KTH和自定义高分辨率数据集上的PSNR、SSIM和LPIPS指标均达到新的SOTA水平,并在真实无人机导航任务中将任务成功率提升了18%
Insight: 核心创新点是高效的视频注意力模块EVA,它通过沿空间和时间轴交替操作来分解时空建模,将计算和内存复杂度从二次/线性降至线性,实现了对密集特征图的无补丁全局上下文建模。此外,从粗到精的三阶段训练课程也有效提升了预测的时空一致性。
Abstract: Video prediction is plagued by a fundamental trilemma: achieving high-resolution and perceptual quality typically comes at the cost of real-time speed, hindering its use in latency-critical applications. This challenge is most acute for autonomous UAVs in dense urban environments, where foreseeing events from high-resolution imagery is non-negotiable for safety. Existing methods, reliant on iterative generation (diffusion, autoregressive models) or quadratic-complexity attention, fail to meet these stringent demands on edge hardware. To break this long-standing trade-off, we introduce RAPTOR, a video prediction architecture that achieves real-time, high-resolution performance. RAPTOR’s single-pass design avoids the error accumulation and latency of iterative approaches. Its core innovation is Efficient Video Attention (EVA), a novel translator module that factorizes spatiotemporal modeling. Instead of processing flattened spacetime tokens with $O((ST)^2)$ or $O(ST)$ complexity, EVA alternates operations along the spatial (S) and temporal (T) axes. This factorization reduces the time complexity to $O(S + T)$ and memory complexity to $O(max(S, T))$, enabling global context modeling at $512^2$ resolution and beyond, operating directly on dense feature maps with a patch-free design. Complementing this architecture is a 3-stage training curriculum that progressively refines predictions from coarse structure to sharp, temporally coherent details. Experiments show RAPTOR is the first predictor to exceed 30 FPS on a Jetson AGX Orin for $512^2$ video, setting a new state-of-the-art on UAVid, KTH, and a custom high-resolution dataset in PSNR, SSIM, and LPIPS. Critically, RAPTOR boosts the mission success rate in a real-world UAV navigation task by 18/%, paving the way for safer and more anticipatory embodied agents.
[37] AstraNav-World: World Model for Foresight Control and Consistency cs.CVPDF
Junjun Hu, Jintao Chen, Haochen Bai, Minghua Luo, Shichao Xie
TL;DR: 本文提出了AstraNav-World,一个端到端的世界模型,用于在开放动态环境中进行具身导航。该模型在一个统一的概率框架内联合推理未来的视觉状态和动作序列,通过整合基于扩散的视频生成器和视觉语言策略,实现预测场景和规划动作的同步展开。训练优化了两个互补目标:生成动作条件化的多步视觉预测,以及基于这些预测视觉推导轨迹。这种双向约束使视觉预测可执行,并使决策基于物理一致、任务相关的未来,从而减轻了传统解耦’先设想后规划’流程中的累积误差。
Details
Motivation: 解决在开放、动态环境中进行具身导航时,需要准确预见世界如何演变以及动作如何随时间展开的问题,并克服传统解耦’先设想后规划’流程中常见的累积误差。
Result: 在多个具身导航基准测试中,实验显示出改进的轨迹准确性和更高的成功率。消融实验证实了紧密的视觉-动作耦合和统一训练的必要性。在真实世界测试中,AstraNav-World展示了卓越的零样本能力,能够适应先前未见过的场景,而无需任何真实世界微调。
Insight: 主要创新点在于将前瞻性视觉和控制统一在一个单一的生成模型中,通过双向约束(视觉预测可执行,决策基于预测视觉)实现视觉与动作的紧密耦合,这有助于捕获可迁移的空间理解和与规划相关的导航动态,而不仅仅是过拟合到特定模拟数据分布,从而推动构建更可靠、可解释和通用的具身智能体。
Abstract: Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled “envision-then-plan” pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision-action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. In real-world testing, AstraNav-World demonstrated exceptional zero-shot capabilities, adapting to previously unseen scenarios without any real-world fine-tuning. These results suggest that AstraNav-World captures transferable spatial understanding and planning-relevant navigation dynamics, rather than merely overfitting to simulation-specific data distribution. Overall, by unifying foresight vision and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.
[38] Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation cs.CVPDF
Steven Xiao, XIndi Zhang, Dechao Meng, Qi Wang, Peng Zhang
TL;DR: 本文提出了一种名为Knot Forcing的新型流式框架,用于实现实时、无限交互式人像动画。该框架通过分块生成策略、时间结模块和“超前运行”机制,解决了自回归视频生成中的错误累积、块间运动不连续和长期一致性退化问题,从而在消费级GPU上实现高保真、时间一致且交互式的实时动画生成。
Details
Motivation: 实时人像动画对于虚拟助手和实时化身等交互应用至关重要,需要高视觉保真度、时间一致性、超低延迟以及对参考图像和驱动信号等动态输入的响应控制。基于扩散的模型质量高,但其非因果性阻碍了流式部署;因果自回归视频生成方法支持逐帧生成,但存在错误累积、块边界运动不连续和长期一致性退化的问题。
Result: Knot Forcing框架在消费级GPU上实现了实时性能,并展现出强大的视觉稳定性,能够生成高保真、时间一致且交互式的无限序列人像动画。
Insight: 创新点包括:1)通过缓存参考图像的KV状态进行全局身份保持,并结合滑动窗口注意力进行局部时间建模的分块生成策略;2)通过重叠相邻块并利用图像到视频条件传播时空线索以平滑块间运动过渡的时间结模块;3)在推理过程中动态更新参考帧时间坐标以保持其语义上下文超前于当前生成帧,从而支持长期一致性的“超前运行”机制。这些设计有效解决了自回归视频生成中的关键挑战。
Abstract: Real-time portrait animation is essential for interactive applications such as virtual assistants and live avatars, requiring high visual fidelity, temporal coherence, ultra-low latency, and responsive control from dynamic inputs like reference images and driving signals. While diffusion-based models achieve strong quality, their non-causal nature hinders streaming deployment. Causal autoregressive video generation approaches enable efficient frame-by-frame generation but suffer from error accumulation, motion discontinuities at chunk boundaries, and degraded long-term consistency. In this work, we present a novel streaming framework named Knot Forcing for real-time portrait animation that addresses these challenges through three key designs: (1) a chunk-wise generation strategy with global identity preservation via cached KV states of the reference image and local temporal modeling using sliding window attention; (2) a temporal knot module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth inter-chunk motion transitions; and (3) A “running ahead” mechanism that dynamically updates the reference frame’s temporal coordinate during inference, keeping its semantic context ahead of the current rollout frame to support long-term coherence. Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences, achieving real-time performance with strong visual stability on consumer-grade GPUs.
[39] SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild cs.CVPDF
Xindi Zhang, Dechao Meng, Steven Xiao, Qi Wang, Peng Zhang
TL;DR: 本文提出SyncAnyone,一种两阶段学习框架,用于实现野外场景下的高精度唇语同步。第一阶段训练基于扩散的视频Transformer进行掩码嘴部修复,以生成准确的音频驱动唇部运动;第二阶段通过无掩码微调管道解决掩码引起的伪影,提升视觉质量和背景一致性。
Details
Motivation: 现有基于掩码训练的方法虽能提升唇语同步精度,但破坏了时空上下文,导致动态面部运动表现不佳、面部结构和背景一致性不稳定。本文旨在克服这一限制,实现准确运动建模和高视觉保真度的同步。
Result: 大量实验表明,该方法在野外唇语同步场景下,在视觉质量、时间连贯性和身份保持方面达到了最先进水平。
Insight: 创新点包括两阶段框架:第一阶段利用扩散视频Transformer进行掩码嘴部修复以建模准确唇部运动;第二阶段通过生成伪配对训练样本进行无掩码微调,以纠正伪影并提升背景一致性。客观分析认为,这种渐进式自校正策略有效平衡了同步精度与视觉保真度。
Abstract: High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate, audio-driven lip movements. However, due to input corruption, minor artifacts may arise in the surrounding facial regions and the background. In Stage 2, we develop a mask-free tuning pipeline to address mask-induced artifacts. Specifically, on the basis of the Stage 1 model, we develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video and random sampled audio. We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency. Extensive experiments show that our method achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation under in-the wild lip-syncing scenarios.
[40] A-QCF-Net: An Adaptive Quaternion Cross-Fusion Network for Multimodal Liver Tumor Segmentation from Unpaired Datasets cs.CV | cs.AIPDF
Arunkumar V, Firos V M, Senthilkumar S, Gangadharan G R
TL;DR: 本文提出了一种自适应四元数交叉融合网络(A-QCF-Net),用于从未配对的CT和MRI数据集中学习一个统一的肝脏肿瘤分割模型。该网络利用四元数神经网络的参数效率和表达能力构建共享特征空间,并通过自适应四元数交叉融合块实现模态间的双向知识迁移,从而在未配对数据集上显著提升分割性能。
Details
Motivation: 解决多模态医学影像中因数据稀缺(不同模态未配对且空间未对齐)而限制深度学习模型发展的问题,旨在利用大量未配对的CT和MRI数据训练一个统一的分割模型。
Result: 在未配对的LiTS(CT)和ATLAS(MRI)数据集上联合训练,模型在CT和MRI上的肿瘤Dice分数分别达到76.7%和78.3%,显著超过强单模态nnU-Net基线5.4%和4.7%,达到SOTA水平。
Insight: 创新点包括:1. 利用四元数神经网络构建共享特征空间以提高参数效率和表达能力;2. 设计自适应四元数交叉融合块,通过数据驱动的注意力机制实现模态间动态信息交换;3. 提供了一种从大量未配对医疗影像数据中学习统一模型的稳健临床范式。
Abstract: Multimodal medical imaging provides complementary information that is crucial for accurate delineation of pathology, but the development of deep learning models is limited by the scarcity of large datasets in which different modalities are paired and spatially aligned. This paper addresses this fundamental limitation by proposing an Adaptive Quaternion Cross-Fusion Network (A-QCF-Net) that learns a single unified segmentation model from completely separate and unpaired CT and MRI cohorts. The architecture exploits the parameter efficiency and expressive power of Quaternion Neural Networks to construct a shared feature space. At its core is the Adaptive Quaternion Cross-Fusion (A-QCF) block, a data driven attention module that enables bidirectional knowledge transfer between the two streams. By learning to modulate the flow of information dynamically, the A-QCF block allows the network to exchange abstract modality specific expertise, such as the sharp anatomical boundary information available in CT and the subtle soft tissue contrast provided by MRI. This mutual exchange regularizes and enriches the feature representations of both streams. We validate the framework by jointly training a single model on the unpaired LiTS (CT) and ATLAS (MRI) datasets. The jointly trained model achieves Tumor Dice scores of 76.7% on CT and 78.3% on MRI, significantly exceeding the strong unimodal nnU-Net baseline by margins of 5.4% and 4.7% respectively. Furthermore, comprehensive explainability analysis using Grad-CAM and Grad-CAM++ confirms that the model correctly focuses on relevant pathological structures, ensuring the learned representations are clinically meaningful. This provides a robust and clinically viable paradigm for unlocking the large unpaired imaging archives that are common in healthcare.
[41] BertsWin: Resolving Topological Sparsity in 3D Masked Autoencoders via Component-Balanced Structural Optimization cs.CV | cs.LG | eess.IVPDF
Evgeny Alves Limarenko, Anastasiia Studenikina
TL;DR: 本文提出BertsWin,一种结合BERT风格全标记掩码与Swin Transformer窗口的混合架构,用于增强3D医学图像自监督预训练中的空间上下文学习。该方法通过保留完整的3D标记网格(包括掩码和可见标记)来维持空间拓扑结构,并引入结构优先级损失函数。在颞下颌关节锥形束CT数据上,BertsWin相比标准ViT-MAE基线将语义收敛速度提升5.8倍,结合提出的GradientConductor优化器,达到SOTA重建保真度所需的训练轮次减少15倍(44轮 vs 660轮),且计算开销与稀疏ViT基线相当。
Details
Motivation: 解决标准掩码自编码器(MAE)在3D医学体积图像中难以捕捉三维空间关系的问题,尤其是在预训练期间丢弃大量标记时导致拓扑结构稀疏性。
Result: 在颞下颌关节3D CT分割任务上,BertsWin相比标准ViT-MAE基线加速语义收敛5.8倍;结合GradientConductor优化器后,达到SOTA重建保真度所需的训练轮次从660轮减少到44轮,减少15倍,且计算复杂度与稀疏ViT基线保持理论FLOP持平。
Insight: 创新点包括:1)采用完整3D标记网格(含掩码和可见标记)以保持空间拓扑,缓解拓扑稀疏性;2)结合BERT风格全标记掩码与Swin窗口机制,平衡局部上下文与计算效率;3)提出结构优先级损失函数;4)整体框架在加速收敛的同时不增加计算负担,为3D医学图像自监督学习提供了高效解决方案。
Abstract: The application of self-supervised learning (SSL) and Vision Transformers (ViTs) approaches demonstrates promising results in the field of 2D medical imaging, but the use of these methods on 3D volumetric images is fraught with difficulties. Standard Masked Autoencoders (MAE), which are state-of-the-art solution for 2D, have a hard time capturing three-dimensional spatial relationships, especially when 75% of tokens are discarded during pre-training. We propose BertsWin, a hybrid architecture combining full BERT-style token masking using Swin Transformer windows, to enhance spatial context learning in 3D during SSL pre-training. Unlike the classic MAE, which processes only visible areas, BertsWin introduces a complete 3D grid of tokens (masked and visible), preserving the spatial topology. And to smooth out the quadratic complexity of ViT, single-level local Swin windows are used. We introduce a structural priority loss function and evaluate the results of cone beam computed tomography of the temporomandibular joints. The subsequent assessment includes TMJ segmentation on 3D CT scans. We demonstrate that the BertsWin architecture, by maintaining a complete three-dimensional spatial topology, inherently accelerates semantic convergence by a factor of 5.8x compared to standard ViT-MAE baselines. Furthermore, when coupled with our proposed GradientConductor optimizer, the full BertsWin framework achieves a 15-fold reduction in training epochs (44 vs 660) required to reach state-of-the-art reconstruction fidelity. Analysis reveals that BertsWin achieves this acceleration without the computational penalty typically associated with dense volumetric processing. At canonical input resolutions, the architecture maintains theoretical FLOP parity with sparse ViT baselines, resulting in a significant net reduction in total computational resources due to faster convergence.
[42] Inference-based GAN Video Generation cs.CV | cs.AIPDF
Jingbo Yang, Adrian G. Bors
TL;DR: 本文提出了一种基于推理的GAN视频生成方法,通过结合变分自编码器(VAE)和生成对抗网络(GAN)的混合结构,增强了无条件视频生成器的推理能力。该模型包含内容和运动两个处理分支,并引入了一种新颖的、内存高效的方法,利用马尔可夫链框架和回忆机制,将短序列视频生成器顺序连接,以生成长达数百或数千帧的高质量、时序连续且动态一致的视频。
Details
Motivation: 现有视频生成模型(如GANs、VAEs和扩散网络)通常只能生成短序列(如16帧),且在生成长视频时面临时序缩放问题,导致视频质量下降。本文旨在克服这一限制,实现高质量、时序连贯的长视频生成。
Result: 论文提出的方法能够生成长达数百或数千帧的视频,确保时序连续性、一致性和动态性,但摘要中未提及具体的定量结果(如benchmark性能或与SOTA的比较)。
Insight: 创新点包括:1) 将VAE与GAN结合,赋予无条件视频生成器推理能力;2) 采用马尔可夫链框架和回忆机制,通过顺序连接短序列生成器来生成长视频,这是一种内存高效且能保持时序依赖的新方法。
Abstract: Video generation has seen remarkable progresses thanks to advancements in generative deep learning. Generated videos should not only display coherent and continuous movement but also meaningful movement in successions of scenes. Generating models such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) and more recently Diffusion Networks have been used for generating short video sequences, usually of up to 16 frames. In this paper, we first propose a new type of video generator by enabling adversarial-based unconditional video generators with a variational encoder, akin to a VAE-GAN hybrid structure, in order to enable the generation process with inference capabilities. The proposed model, as in other video deep learning-based processing frameworks, incorporates two processing branches, one for content and another for movement. However, existing models struggle with the temporal scaling of the generated videos. In classical approaches when aiming to increase the generated video length, the resulting video quality degrades, particularly when considering generating significantly long sequences. To overcome this limitation, our research study extends the initially proposed VAE-GAN video generation model by employing a novel, memory-efficient approach to generate long videos composed of hundreds or thousands of frames ensuring their temporal continuity, consistency and dynamics. Our approach leverages a Markov chain framework with a recall mechanism, with each state representing a VAE-GAN short-length video generator. This setup allows for the sequential connection of generated video sub-sequences, enabling temporal dependencies, resulting in meaningful long video sequences.
[43] Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models cs.CVPDF
Nimrod Berman, Adam Botach, Emanuel Ben-Baruch, Shunit Haviv Hakimi, Asaf Gendler
TL;DR: Scene-VLM是首个用于视频场景分割的微调视觉语言模型框架,通过联合处理视频帧、转录文本和可选元数据等多模态信息,实现跨连续镜头的多模态推理。该模型以因果依赖关系顺序生成预测,并引入上下文聚焦窗口机制确保每个镜头决策具有足够的时序上下文。此外,它还能从VLM的token级logits中提取置信度分数以实现可控的精度-召回权衡,并能通过少量监督生成边界决策的自然语言解释。
Details
Motivation: 现有基于编码器的方法存在视觉中心偏见、孤立分类镜头而忽略序列依赖、缺乏叙事理解和可解释性等问题,需要一种能够利用多模态信息进行连贯场景分割的新方法。
Result: 在标准场景分割基准测试中达到最先进性能,例如在MovieNet上,相比先前领先方法,AP提升了6,F1分数提升了13.7。
Insight: 创新点包括:将VLM微调并应用于视频场景分割任务,实现多模态联合推理;引入上下文聚焦窗口机制处理时序依赖;提出从token级logits提取置信度的方法以实现可控的精度-召回权衡;模型可通过少量监督对齐生成决策的自然语言解释,增强可解释性。
Abstract: Segmenting long-form videos into semantically coherent scenes is a fundamental task in large-scale video understanding. Existing encoder-based methods are limited by visual-centric biases, classify each shot in isolation without leveraging sequential dependencies, and lack both narrative understanding and explainability. In this paper, we present Scene-VLM, the first fine-tuned vision-language model (VLM) framework for video scene segmentation. Scene-VLM jointly processes visual and textual cues including frames, transcriptions, and optional metadata to enable multimodal reasoning across consecutive shots. The model generates predictions sequentially with causal dependencies among shots and introduces a context-focus window mechanism to ensure sufficient temporal context for each shot-level decision. In addition, we propose a scheme to extract confidence scores from the token-level logits of the VLM, enabling controllable precision-recall trade-offs that were previously limited to encoder-based methods. Furthermore, we demonstrate that our model can be aligned to generate coherent natural-language rationales for its boundary decisions through minimal targeted supervision. Our approach achieves state-of-the-art performance on standard scene segmentation benchmarks. On MovieNet, for example, Scene-VLM yields significant improvements of +6 AP and +13.7 F1 over the previous leading method.
[44] Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models cs.CV | cs.LGPDF
Mengqi He, Xinyu Tian, Xin Shen, Jinhong Ni, Shu Zou
TL;DR: 该论文提出了一种针对视觉语言模型(VLM)的新型对抗攻击方法,名为熵引导攻击(EGA)。研究发现,在自回归生成过程中,只有一小部分(约20%)高熵令牌(即模型不确定性高的关键决策点)对输出轨迹有决定性影响。通过将对抗扰动集中作用于这些关键位置,该方法能以更小的扰动预算实现与全局攻击方法相当的语义破坏效果,并显著提高将良性输出转换为有害输出的成功率(35-49%),同时在不同架构的VLM间展现出可观的攻击可迁移性(17-26%)。
Details
Motivation: 现有基于熵的对抗攻击通常假设所有解码步骤的令牌对生成不稳定性贡献均等,从而最大化所有位置的不确定性。然而,作者发现实际上只有少数高熵令牌主导了输出轨迹,因此动机是探索并利用这种不均匀性,设计更高效、更具针对性的攻击方法,以揭示VLM安全机制中更深层的脆弱性。
Result: 在多个代表性VLM上,所提出的EGA方法在保持高攻击成功率(93-95%)的同时,实现了35-49%的良性输出到有害输出的转换率。攻击在未见过的目标模型上也表现出17-26%的有害转换率,证明了其可迁移性。这些结果揭示了当前VLM安全机制的新弱点。
Insight: 论文的核心创新点在于识别并利用了自回归生成中高熵令牌的不均匀重要性,提出了“关键决策点”的概念。从客观角度看,这为对抗攻击领域提供了新的视角:攻击效率可以通过聚焦于模型不确定性最高的少数关键令牌来大幅提升,而非均匀扰动所有输入。这一发现不仅对攻击设计有启发,也可能为构建更鲁棒的防御机制(如加强对关键决策点的监控和保护)提供思路。
Abstract: Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, a measure of model uncertainty, is strongly correlated with the reliability of VLM. Prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token contributes equally to generation instability. We show instead that a small fraction (about 20%) of high-entropy tokens, i.e., critical decision points in autoregressive generation, disproportionately governs output trajectories. By concentrating adversarial perturbations on these positions, we achieve semantic degradation comparable to global methods while using substantially smaller budgets. More importantly, across multiple representative VLMs, such selective attacks convert 35-49% of benign outputs into harmful ones, exposing a more critical safety risk. Remarkably, these vulnerable high-entropy forks recur across architecturally diverse VLMs, enabling feasible transferability (17-26% harmful rates on unseen targets). Motivated by these findings, we propose Entropy-bank Guided Adversarial attacks (EGA), which achieves competitive attack success rates (93-95%) alongside high harmful conversion, thereby revealing new weaknesses in current VLM safety mechanisms.
[45] End-to-End 3D Spatiotemporal Perception with Multimodal Fusion and V2X Collaboration cs.CVPDF
Zhenwei Yang, Yibo Ai, Weidong Zhang
TL;DR: 本文提出XET-V2X,一个用于车路协同(V2X)的多模态融合端到端跟踪框架,旨在解决自动驾驶中因遮挡、视角受限和通信延迟导致的3D时空感知难题。该框架通过共享时空表征统一多视角多模态感知,并引入基于多尺度可变形注意力的双层空间交叉注意力模块,高效对齐异构视角和模态,在提升语义一致性的同时降低计算开销。
Details
Motivation: 自动驾驶中,多视角协同感知与多模态融合对于实现可靠的3D时空理解至关重要,尤其是在V2X场景下存在遮挡、视角受限和通信延迟等挑战。
Result: 在真实世界数据集V2X-Seq-SPD以及模拟基准V2X-Sim-V2V和V2X-Sim-V2I上的实验表明,XET-V2X在不同通信延迟下均能持续提升检测与跟踪性能,定量结果和定性可视化均证明其在复杂交通场景中实现了鲁棒且时序稳定的感知。
Insight: 创新点在于提出了一个统一的端到端跟踪框架,通过双层空间交叉注意力模块实现高效的多视角图像特征聚合与点云融合,增强了跨模态交互的语义一致性,同时降低了计算复杂度,为V2X协同感知提供了有效的解决方案。
Abstract: Multi-view cooperative perception and multimodal fusion are essential for reliable 3D spatiotemporal understanding in autonomous driving, especially under occlusions, limited viewpoints, and communication delays in V2X scenarios. This paper proposes XET-V2X, a multi-modal fused end-to-end tracking framework for v2x collaboration that unifies multi-view multimodal sensing within a shared spatiotemporal representation. To efficiently align heterogeneous viewpoints and modalities, XET-V2X introduces a dual-layer spatial cross-attention module based on multi-scale deformable attention. Multi-view image features are first aggregated to enhance semantic consistency, followed by point cloud fusion guided by the updated spatial queries, enabling effective cross-modal interaction while reducing computational overhead. Experiments on the real-world V2X-Seq-SPD dataset and the simulated V2X-Sim-V2V and V2X-Sim-V2I benchmarks demonstrate consistent improvements in detection and tracking performance under varying communication delays. Both quantitative results and qualitative visualizations indicate that XET-V2X achieves robust and temporally stable perception in complex traffic scenarios.
[46] Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models cs.CVPDF
Masayuki Kawarada, Kosuke Yamada, Antonio Tejero-de-Pablos, Naoto Inoue
TL;DR: 本文提出了一种名为DIOR的训练无关条件图像嵌入框架,该框架利用大型视觉语言模型(LVLM)生成聚焦于特定文本条件(如颜色、风格)的图像特征表示。DIOR通过提示LVLM用与给定条件相关的单个词描述图像,并提取其最后一个词符的隐藏状态向量作为条件嵌入,无需额外训练即可应用于任意图像和条件。
Details
Motivation: 现有视觉基础模型(如CLIP)虽能提供丰富的图像表示,但并非专为聚焦于特定文本条件而设计,生成条件图像嵌入仍是一个挑战。本文旨在解决如何无需训练即可从图像中提取与给定条件相关的特征表示这一问题。
Result: 在条件图像相似性任务上的综合实验表明,DIOR在无需训练的方法中超越了包括CLIP在内的现有基线,并且在多种设置下,其性能也优于需要额外训练的方法,达到了先进水平。
Insight: 论文的创新点在于提出了一种完全训练无关的通用框架,通过巧妙利用LVLM的文本生成能力及其隐藏状态,将条件信息自然地融入图像嵌入过程。其核心洞察是将条件嵌入问题转化为一个基于提示的单词生成任务,从而避免了模型微调或特定任务先验知识的需求,为条件视觉表示学习提供了一种高效且灵活的解决方案。
Abstract: Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM’s last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP. Furthermore, DIOR achieves superior performance compared to methods that require additional training across multiple settings.
[47] EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition cs.CVPDF
Yihan Hu, Xuelin Chen, Xiaodong Cun
TL;DR: EasyOmnimatte是一种端到端的视频分层分解方法,通过微调预训练的视频修复扩散模型,引入双专家策略(效果专家和质量专家)来同时生成高质量的前景alpha遮罩和相关效果,在质量和效率上均优于现有方法。
Details
Motivation: 现有视频omnimatte方法通常依赖缓慢、多阶段或推理时优化的流程,未能充分利用强大的生成先验,导致分解效果不佳。本文旨在开发一个统一、端到端的解决方案。
Result: 实验表明,EasyOmnimatte在视频omnimatte任务上达到了新的最先进水平(SOTA),在质量和效率上显著超越基线方法,并支持多种下游任务。
Insight: 创新点在于将视频修复模型的微调任务分解为互补的双专家学习:效果专家(仅对效果敏感的DiT块应用LoRA)捕获前景和相关效果的粗结构,质量专家(全LoRA微调)细化alpha遮罩,并通过在去噪过程中分阶段使用专家来降低计算成本。
Abstract: Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naïvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.
[48] DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation cs.CVPDF
Divyansh Srivastava, Akshay Mehra, Pranav Maneriker, Debopam Sanyal, Vishnu Raj
TL;DR: 本文提出DPAR模型,一种动态聚合图像令牌为可变数量图像块的解码器自回归模型,用于高效图像生成。该方法利用轻量无监督自回归模型预测的下一个令牌熵作为信息准则,动态合并令牌为更大图像块,从而减少令牌数量并降低计算开销。
Details
Motivation: 解决传统解码器自回归图像生成中固定长度令牌化方案导致令牌数量随分辨率平方增长,从而显著增加注意力计算和内存需求的问题。
Result: 在Imagenet 256和384分辨率生成任务中,令牌数量分别减少1.81倍和2.06倍,训练FLOPs降低高达40%,收敛更快,FID相对基线模型提升高达27.1%。
Insight: 创新点在于首次利用自回归模型的下一个令牌预测熵作为无监督信息准则来动态合并图像令牌,实现了对高信息区域的计算资源分配,且训练得到的表示对图像块边界具有鲁棒性,支持推理时扩展到更大图像块尺寸。
Abstract: Decoder-only autoregressive image generation typically relies on fixed-length tokenization schemes whose token counts grow quadratically with resolution, substantially increasing the computational and memory demands of attention. We present DPAR, a novel decoder-only autoregressive model that dynamically aggregates image tokens into a variable number of patches for efficient image generation. Our work is the first to demonstrate that next-token prediction entropy from a lightweight and unsupervised autoregressive model provides a reliable criterion for merging tokens into larger patches based on information content. DPAR makes minimal modifications to the standard decoder architecture, ensuring compatibility with multimodal generation frameworks and allocating more compute to generation of high-information image regions. Further, we demonstrate that training with dynamically sized patches yields representations that are robust to patch boundaries, allowing DPAR to scale to larger patch sizes at inference. DPAR reduces token count by 1.81x and 2.06x on Imagenet 256 and 384 generation resolution respectively, leading to a reduction of up to 40% FLOPs in training costs. Further, our method exhibits faster convergence and improves FID by up to 27.1% relative to baseline models.
[49] High-Fidelity and Long-Duration Human Image Animation with Diffusion Transformer cs.CVPDF
Shen Zheng, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Xingpei Ma
TL;DR: 本文提出了一种基于扩散Transformer(DiT)的框架,用于生成高保真度和长持续时间的人体图像动画视频。该方法通过设计混合隐式引导信号和锐度引导因子来增强面部和手部细节,引入位置偏移自适应模块以支持任意长度视频生成,并采用数据增强和骨骼对齐策略来减少不同身份间人体形状变化的影响。
Details
Motivation: 现有的人体图像动画方法在生成短时或常规运动时能保持时间一致性,但在生成长持续时间视频方面仍面临挑战,且对细粒度面部和手部细节的合成研究不足,限制了其在现实世界高质量应用中的适用性。
Result: 实验结果表明,该方法在多个基准测试中优于现有的最先进方法,在高保真度和长持续时间人体图像动画方面均实现了卓越性能。
Insight: 创新点包括:1)混合隐式引导信号和锐度引导因子,以面部和手部细节作为额外引导;2)位置偏移自适应模块,通过修改DiT主干输入格式实现任意长度视频生成;3)数据增强和骨骼对齐策略,减少身份间形状变化的影响。从客观角度看,这些技术结合了细节引导、时间扩展和鲁棒性处理,为长序列高保真动画提供了系统解决方案。
Abstract: Recent progress in diffusion models has significantly advanced the field of human image animation. While existing methods can generate temporally consistent results for short or regular motions, significant challenges remain, particularly in generating long-duration videos. Furthermore, the synthesis of fine-grained facial and hand details remains under-explored, limiting the applicability of current approaches in real-world, high-quality applications. To address these limitations, we propose a diffusion transformer (DiT)-based framework which focuses on generating high-fidelity and long-duration human animation videos. First, we design a set of hybrid implicit guidance signals and a sharpness guidance factor, enabling our framework to additionally incorporate detailed facial and hand features as guidance. Next, we incorporate the time-aware position shift fusion module, modify the input format within the DiT backbone, and refer to this mechanism as the Position Shift Adaptive Module, which enables video generation of arbitrary length. Finally, we introduce a novel data augmentation strategy and a skeleton alignment model to reduce the impact of human shape variations across different identities. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving superior performance in both high-fidelity and long-duration human image animation.
[50] Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition cs.CVPDF
Zeyu Liang, Hailun Xia, Naichuan Zheng
TL;DR: 本文提出了一种名为PAN的人本图表示学习框架,用于多模态动作识别。该框架将包含人体关节的RGB图像块表示为时空图,以抑制RGB帧中的冗余信息,并与基于骨架的方法对齐,从而实现更有效和语义一致的多模态特征融合。
Details
Motivation: 现有的融合RGB和骨架模态的多模态动作识别方法受限于模态间的固有异质性,未能充分利用它们之间的互补潜力。
Result: 在三个广泛使用的多模态动作识别数据集上,PAN的两个变体(PAN-Ensemble和PAN-Unified)分别在分离建模和统一建模的多模态融合设置中达到了最先进的性能。
Insight: 创新点在于提出了一种人本图建模范式,将RGB图像块作为图节点,与骨架模态在语义上对齐,从而促进更有效的融合。此外,还提出了基于注意力的后校准机制,以降低对高质量骨架数据的依赖。
Abstract: While human action recognition has witnessed notable achievements, multimodal methods fusing RGB and skeleton modalities still suffer from their inherent heterogeneity and fail to fully exploit the complementary potential between them. In this paper, we propose PAN, the first human-centric graph representation learning framework for multimodal action recognition, in which token embeddings of RGB patches containing human joints are represented as spatiotemporal graphs. The human-centric graph modeling paradigm suppresses the redundancy in RGB frames and aligns well with skeleton-based methods, thus enabling a more effective and semantically coherent fusion of multimodal features. Since the sampling of token embeddings heavily relies on 2D skeletal data, we further propose attention-based post calibration to reduce the dependency on high-quality skeletal data at a minimal cost interms of model performance. To explore the potential of PAN in integrating with skeleton-based methods, we present two variants: PAN-Ensemble, which employs dual-path graph convolution networks followed by late fusion, and PAN-Unified, which performs unified graph representation learning within a single network. On three widely used multimodal action recognition datasets, both PAN-Ensemble and PAN-Unified achieve state-of-the-art (SOTA) performance in their respective settings of multimodal fusion: separate and unified modeling, respectively.
[51] Unsupervised Anomaly Detection in Brain MRI via Disentangled Anatomy Learning cs.CV | cs.AIPDF
Tao Yang, Xiuying Wang, Hao Liu, Guanzhong Gong, Lian-Ming Wu
TL;DR: 本文提出了一种基于解耦解剖学习的无监督脑MRI异常检测方法,通过将脑MRI解耦为成像信息和成像不变解剖图像,并利用边缘到图像恢复模块重建高质量伪健康图像,显著提升了多模态、多中心数据的泛化能力和异常检测性能。
Details
Motivation: 解决现有无监督方法在多模态、多中心MRI数据上泛化能力受限,以及异常残差传播导致重建伪健康图像质量下降的问题。
Result: 在九个公共数据集(来自多中心的4,443例患者MRI)上评估,方法优于17个SOTA方法,在AP和DSC指标上分别实现绝对提升+18.32%和+13.64%。
Insight: 创新点包括引入解耦表示模块(结合脑解剖先验和可微分独热编码)以提升泛化性,以及边缘到图像恢复模块(通过仅输入边缘信息抑制异常残差)来改善重建质量;客观分析认为,该方法通过解耦成像与解剖信息,有效分离了数据特异性与本质结构,增强了模型鲁棒性。
Abstract: Detection of various lesions in brain MRI is clinically critical, but challenging due to the diversity of lesions and variability in imaging conditions. Current unsupervised learning methods detect anomalies mainly through reconstructing abnormal images into pseudo-healthy images (PHIs) by normal samples learning and then analyzing differences between images. However, these unsupervised models face two significant limitations: restricted generalizability to multi-modality and multi-center MRIs due to their reliance on the specific imaging information in normal training data, and constrained performance due to abnormal residuals propagated from input images to reconstructed PHIs. To address these limitations, two novel modules are proposed, forming a new PHI reconstruction framework. Firstly, the disentangled representation module is proposed to improve generalizability by decoupling brain MRI into imaging information and essential imaging-invariant anatomical images, ensuring that the reconstruction focuses on the anatomy. Specifically, brain anatomical priors and a differentiable one-hot encoding operator are introduced to constrain the disentanglement results and enhance the disentanglement stability. Secondly, the edge-to-image restoration module is designed to reconstruct high-quality PHIs by restoring the anatomical representation from the high-frequency edge information of anatomical images, and then recoupling the disentangled imaging information. This module not only suppresses abnormal residuals in PHI by reducing abnormal pixels input through edge-only input, but also effectively reconstructs normal regions using the preserved structural details in the edges. Evaluated on nine public datasets (4,443 patients’ MRIs from multiple centers), our method outperforms 17 SOTA methods, achieving absolute improvements of +18.32% in AP and +13.64% in DSC.
[52] Data relativistic uncertainty framework for low-illumination anime scenery image enhancement cs.CV | cs.LG | cs.MMPDF
Yiquan Gao, John See
TL;DR: 本文提出了一种数据相对不确定性(DRU)框架,用于增强低光照动漫场景图像的质量。通过构建一个未配对的动漫场景数据集,并利用光照不确定性信息动态调整目标函数,该方法在多个EnlightenGAN变体上实现了优于现有方法的感知和美学质量。
Details
Motivation: 针对低光照动漫场景图像增强这一未被充分探索的任务,旨在弥补与自然图像增强领域的差距,并解决数据稀缺问题。
Result: 在构建的未配对动漫场景数据集上进行广泛实验,训练多个EnlightenGAN版本,结果显示在感知和美学质量上超越了现有最先进方法。
Insight: 创新点包括:借鉴相对论GAN思想提出数据相对不确定性框架,类比光的波粒二象性可解释地定义和量化光照不确定性,并利用该不确定性动态调整目标函数以重新校准模型学习;这为以数据为中心的学习提供了新范式,可潜在应用于视觉和语言领域。
Abstract: By contrast with the prevailing works of low-light enhancement in natural images and videos, this study copes with the low-illumination quality degradation in anime scenery images to bridge the domain gap. For such an underexplored enhancement task, we first curate images from various sources and construct an unpaired anime scenery dataset with diverse environments and illumination conditions to address the data scarcity. To exploit the power of uncertainty information inherent with the diverse illumination conditions, we propose a Data Relativistic Uncertainty (DRU) framework, motivated by the idea from Relativistic GAN. By analogy with the wave-particle duality of light, our framework interpretably defines and quantifies the illumination uncertainty of dark/bright samples, which is leveraged to dynamically adjust the objective functions to recalibrate the model learning under data uncertainty. Extensive experiments demonstrate the effectiveness of DRU framework by training several versions of EnlightenGANs, yielding superior perceptual and aesthetic qualities beyond the state-of-the-art methods that are incapable of learning from data uncertainty perspective. We hope our framework can expose a novel paradigm of data-centric learning for potential visual and language domains. Code is available.
[53] Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models cs.CVPDF
Dunyuan XU, Xikai Yang, Yaoqian Li, Juzheng Miao, Jinpeng Li
TL;DR: 本文系统分析了医学多模态大语言模型(MLLMs)对图像和文本扰动的敏感性,并提出了一个无需训练的感知-校准框架(IMC)来增强其跨模态鲁棒性。该框架包含针对视觉模态的扰动感知去噪校准(PDC)和针对文本模态的自实例化多智能体系统(SMS)。
Details
Motivation: 医学MLLMs在真实临床场景中易受图像伪影和文本错误等输入扰动的影响,现有研究主要关注通用领域且依赖微调,无法满足医学领域复杂噪声模式和严格安全标准的需求。
Result: 在包含2个数据集、11种噪声类型的基准测试上,该方法在多个模态上取得了最先进的(SOTA)性能,显示出提升MLLMs在真实临床场景中鲁棒性的潜力。
Insight: 创新点在于提出了一个无需训练、利用模型自身能力的跨模态鲁棒性增强框架,其核心是’感知-校准’原则,具体包括利用视觉编码器识别噪声模式进行原型引导的特征校准(PDC),以及利用模型自评估能力通过多智能体协作层次精炼噪声文本(SMS)。
Abstract: Medical Multi-modal Large Language Models (MLLMs) have shown promising clinical performance. However, their sensitivity to real-world input perturbations, such as imaging artifacts and textual errors, critically undermines their clinical applicability. Systematic analysis of such noise impact on medical MLLMs remains largely unexplored. Furthermore, while several works have investigated the MLLMs’ robustness in general domains, they primarily focus on text modality and rely on costly fine-tuning. They are inadequate to address the complex noise patterns and fulfill the strict safety standards in medicine. To bridge this gap, this work systematically analyzes the impact of various perturbations on medical MLLMs across both visual and textual modalities. Building on our findings, we introduce a training-free Inherent-enhanced Multi-modal Calibration (IMC) framework that leverages MLLMs’ inherent denoising capabilities following the perceive-and-calibrate principle for cross-modal robustness enhancement. For the visual modality, we propose a Perturbation-aware Denoising Calibration (PDC) which leverages MLLMs’ own vision encoder to identify noise patterns and perform prototype-guided feature calibration. For text denoising, we design a Self-instantiated Multi-agent System (SMS) that exploits the MLLMs’ self-assessment capabilities to refine noisy text through a cooperative hierarchy of agents. We construct a benchmark containing 11 types of noise across both image and text modalities on 2 datasets. Experimental results demonstrate our method achieves the state-of-the-art performance across multiple modalities, showing potential to enhance MLLMs’ robustness in real clinical scenarios.
[54] LVLM-Aided Alignment of Task-Specific Vision Models cs.CV | cs.AIPDF
Alexander Koebler, Lukas Kuhn, Ingo Thon, Florian Buettner
TL;DR: 本文提出了一种名为LVLM-VA的新方法,利用大型视觉语言模型(LVLM)的泛化能力,高效地将小型任务特定视觉模型与人类领域知识对齐,以减少模型对虚假相关性和群体特定偏见的依赖。
Details
Motivation: 在高风险领域,小型任务特定视觉模型虽然计算需求低且可解释性强,但其解释常揭示模型未与人类领域知识对齐,而是依赖虚假相关性,导致部署时行为脆弱。
Result: 在合成和真实世界数据集上验证,该方法显著改善了模型行为与人类规范的匹配,有效减少了模型对虚假特征和群体特定偏见的依赖,且无需细粒度反馈。
Insight: 创新点在于利用LVLM构建双向接口,将模型行为转化为自然语言,并将人类类别级规范映射到图像级批评,实现领域专家与模型的有效交互,从而提升对齐效率。
Abstract: In high-stakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result in brittle behavior once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and maps human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model. Our method demonstrates substantial improvement in aligning model behavior with human specifications, as validated on both synthetic and real-world datasets. We show that it effectively reduces the model’s dependence on spurious features and on group-specific biases, without requiring fine-grained feedback.
[55] Look Closer! An Adversarial Parametric Editing Framework for Hallucination Mitigation in VLMs cs.CV | cs.LGPDF
Jiayu Hu, Beibei Li, Jiangwei Xia, Yanjun Qin, Bing Ji
TL;DR: 本文提出了一种名为ALEAHallu的对抗性参数编辑框架,旨在缓解视觉语言模型(VLMs)中的幻觉问题。该框架遵循激活-定位-编辑的对抗性范式:首先构建包含基于视觉特征的响应(正样本)和反映语言模型先验偏差的幻觉响应(负样本)的激活数据集;然后通过分析响应对的隐藏状态差异来识别易产生幻觉的关键参数簇;最后使用注入对抗性调优前缀的提示对这些参数簇进行微调,迫使模型优先考虑视觉证据而非内在参数偏差。
Details
Motivation: VLMs因其实际应用前景受到广泛关注,但存在持续的幻觉问题,即生成与视觉输入不一致的输出。现有研究认为幻觉源于VLMs过度依赖语言先验和视觉特征整合不足,并提出了启发式解码校准策略来缓解,但这些策略不可训练,优化潜力有限。
Result: 在生成式和判别式VLM任务上的评估表明,ALEAHallu在缓解幻觉方面具有显著有效性。
Insight: 创新点在于提出了一种可训练的对抗性参数编辑框架,通过识别并微调幻觉易发参数簇,直接针对模型内部偏差进行优化,而非仅依赖解码阶段的启发式校准,从而更有效地强制模型整合视觉证据。
Abstract: While Vision-Language Models (VLMs) have garnered increasing attention in the AI community due to their promising practical applications, they exhibit persistent hallucination issues, generating outputs misaligned with visual inputs. Recent studies attribute these hallucinations to VLMs’ over-reliance on linguistic priors and insufficient visual feature integration, proposing heuristic decoding calibration strategies to mitigate them. However, the non-trainable nature of these strategies inherently limits their optimization potential. To this end, we propose an adversarial parametric editing framework for Hallucination mitigation in VLMs, which follows an \textbf{A}ctivate-\textbf{L}ocate-\textbf{E}dit \textbf{A}dversarially paradigm. Specifically, we first construct an activation dataset that comprises grounded responses (positive samples attentively anchored in visual features) and hallucinatory responses (negative samples reflecting LLM prior bias and internal knowledge artifacts). Next, we identify critical hallucination-prone parameter clusters by analyzing differential hidden states of response pairs. Then, these clusters are fine-tuned using prompts injected with adversarial tuned prefixes that are optimized to maximize visual neglect, thereby forcing the model to prioritize visual evidence over inherent parametric biases. Evaluations on both generative and discriminative VLM tasks demonstrate the significant effectiveness of ALEAHallu in alleviating hallucinations. Our code is available at https://github.com/hujiayu1223/ALEAHallu.
[56] iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception cs.CVPDF
Sarthak Mehrotra, Sairam V C Rebbapragada, Mani Hemanth Reddy Bonthu, Vineeth N Balasubramanian
TL;DR: 本文提出了iSHIFT,一个轻量级的多模态大语言模型GUI代理,它通过整合隐式思维链和感知控制模块,实现了在慢速(高精度视觉定位)和快速(高效全局线索)模式间的自适应切换,以解决GUI交互中效率与精度难以兼顾的挑战。
Details
Motivation: 现有MLLM GUI代理在处理高级任务效率和细粒度交互精度之间存在矛盾,难以在需要精确识别界面元素的任务中保持准确性,且模型通常庞大,无法根据任务需求自适应调整推理深度。
Result: 在多个基准数据集上,尽管模型尺寸仅为2.5B,iSHIFT的性能达到了最先进水平(SOTA)。
Insight: 创新点在于引入了隐式慢-快混合推理机制和感知控制模块,通过特殊感知令牌动态引导注意力到相关屏幕区域,使模型能自主决定推理方式和视觉焦点,实现了轻量级模型下的高效自适应感知与交互。
Abstract: Multimodal Large Language Models (MLLMs) show strong potential for interpreting and interacting with complex, pixel-rich Graphical User Interface (GUI) environments. However, building agents that are both efficient for high-level tasks and precise for fine-grained interactions remains challenging. GUI agents must perform routine actions efficiently while also handling tasks that demand exact visual grounding, yet existing approaches struggle when accuracy depends on identifying specific interface elements. These MLLMs also remain large and cannot adapt their reasoning depth to the task at hand. In this work, we introduce iSHIFT: Implicit Slow-fast Hybrid Inference with Flexible Tokens, a lightweight agent that integrates latent thinking (implicit chain-of-thought) with a perception control module. iSHIFT enables an MLLM to switch between a slow mode, which leverages detailed visual grounding for high precision and a fast mode that uses global cues for efficiency. Special perception tokens guide attention to relevant screen regions, allowing the model to decide both how to reason and where to focus. Despite its compact 2.5B size, iSHIFT matches state-of-the-art performance on multiple benchmark datasets.
[57] LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration cs.CV | cs.AIPDF
Wen Jiang, Li Wang, Kangyao Huang, Wei Fan, Jinyuan Liu
TL;DR: 本文提出了LongFly,一个用于无人机长时程视觉语言导航的时空上下文建模框架。该框架通过历史感知的时空建模策略,将碎片化和冗余的历史数据转化为结构化、紧凑且富有表现力的表示,以解决复杂环境中长时程导航的语义对齐和路径规划问题。
Details
Motivation: 当前无人机视觉语言导航方法在复杂环境中难以对长时程时空上下文进行建模,导致语义对齐不准确和路径规划不稳定,特别是在信息密度高、视角变化快和结构动态的灾后搜救等场景中。
Result: 实验结果表明,LongFly在成功率和路径长度加权成功率上分别比最先进的无人机VLN基线方法高出7.89%和6.33%,并且在已见和未见环境中均表现一致。
Insight: 创新点包括:1)基于槽位的历史图像压缩模块,动态地将多视角历史观测提炼为固定长度的上下文表示;2)时空轨迹编码模块,捕捉无人机轨迹的时间动态和空间结构;3)提示引导的多模态集成模块,将现有时空上下文与当前观测相结合,支持基于时间的推理和稳健的航点预测。这些模块共同构成了一个系统性的长时程上下文建模框架。
Abstract: Unmanned aerial vehicles (UAVs) are crucial tools for post-disaster search and rescue, facing challenges such as high information density, rapid changes in viewpoint, and dynamic structures, especially in long-horizon navigation. However, current UAV vision-and-language navigation(VLN) methods struggle to model long-horizon spatiotemporal context in complex environments, resulting in inaccurate semantic alignment and unstable path planning. To this end, we propose LongFly, a spatiotemporal context modeling framework for long-horizon UAV VLN. LongFly proposes a history-aware spatiotemporal modeling strategy that transforms fragmented and redundant historical data into structured, compact, and expressive representations. First, we propose the slot-based historical image compression module, which dynamically distills multi-view historical observations into fixed-length contextual representations. Then, the spatiotemporal trajectory encoding module is introduced to capture the temporal dynamics and spatial structure of UAV trajectories. Finally, to integrate existing spatiotemporal context with current observations, we design the prompt-guided multimodal integration module to support time-based reasoning and robust waypoint prediction. Experimental results demonstrate that LongFly outperforms state-of-the-art UAV VLN baselines by 7.89% in success rate and 6.33% in success weighted by path length, consistently across both seen and unseen environments.
[58] Patch-Discontinuity Mining for Generalized Deepfake Detection cs.CVPDF
Huanhuan Yuan, Yang Ping, Zhengqin Xu, Junyi Cao, Shuai Jia
TL;DR: 本文提出了一种名为GenDF的简单而有效的深度伪造检测框架,该框架将强大的大规模视觉模型迁移到深度伪造检测任务中,通过深度伪造特定的表示学习、特征空间重分布和分类不变的特征增强策略,在跨域和跨操作设置中实现了最先进的泛化性能,且仅需0.28M可训练参数。
Details
Motivation: 生成式人工智能的快速发展使得创建高度逼真的伪造人脸图像成为可能,这对个人隐私和在线信息的完整性构成严重威胁。现有的深度伪造检测方法通常依赖于手工制作的取证线索和复杂架构,在域内设置中表现良好,但在面对未见过的伪造模式时性能显著下降。
Result: 广泛的实验表明,GenDF在跨域和跨操作设置中实现了最先进的泛化性能,同时仅需0.28M可训练参数,验证了所提框架的有效性和效率。
Insight: 论文的创新点在于将大规模视觉模型迁移到深度伪造检测任务,并结合了深度伪造特定的表示学习、特征空间重分布以及无需额外可训练参数的分类不变特征增强策略,以简单紧凑的网络设计实现了强大的泛化能力。
Abstract: The rapid advancement of generative artificial intelligence has enabled the creation of highly realistic fake facial images, posing serious threats to personal privacy and the integrity of online information. Existing deepfake detection methods often rely on handcrafted forensic cues and complex architectures, achieving strong performance in intra-domain settings but suffering significant degradation when confronted with unseen forgery patterns. In this paper, we propose GenDF, a simple yet effective framework that transfers a powerful large-scale vision model to the deepfake detection task with a compact and neat network design. GenDF incorporates deepfake-specific representation learning to capture discriminative patterns between real and fake facial images, feature space redistribution to mitigate distribution mismatch, and a classification-invariant feature augmentation strategy to enhance generalization without introducing additional trainable parameters. Extensive experiments demonstrate that GenDF achieves state-of-the-art generalization performance in cross-domain and cross-manipulation settings while requiring only 0.28M trainable parameters, validating the effectiveness and efficiency of the proposed framework.
[59] Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models cs.CV | cs.CRPDF
Zongmin Zhang, Zhen Sun, Yifan Liao, Wenhan Dong, Xinlei He
TL;DR: 本文提出了首个针对提示驱动视频分割基础模型(VSFMs)的后门攻击框架BadVSFM,研究发现传统后门攻击方法对VSFMs几乎无效(攻击成功率低于5%),而BadVSFM采用两阶段策略:先引导图像编码器使触发帧映射到目标嵌入,再训练掩码解码器使触发帧-提示对产生共享目标掩码,实验表明该方法在多种触发器和提示下实现了强可控的后门效果,同时保持了干净的语义分割质量。
Details
Motivation: 随着SAM2等提示驱动视频分割基础模型在自动驾驶、数字病理学等关键领域的部署,其后门安全威胁日益凸显,但传统后门攻击方法对这类模型几乎无效,因此需要设计专门的后门攻击框架。
Result: 在两个数据集和五个VSFM模型上的实验表明,BadVSFM在多种触发器和提示类型下实现了高攻击成功率(ASR)和可控的后门效果,同时保持了与干净模型相当的语义分割质量(如mIoU指标),梯度冲突分析和注意力可视化证实了其有效性,而四种典型防御方法均未能有效检测或缓解该攻击。
Insight: 创新点在于揭示了VSFMs对传统后门攻击的固有鲁棒性源于编码器梯度对齐和注意力机制对真实目标的聚焦,并据此设计了两阶段攻击策略,通过分离触发与干净样本的表征并转移注意力至触发区域,实现了针对提示驱动多模态模型的定向后门植入;从防御视角看,这暴露了当前VSFMs中未被充分探索的脆弱性,为未来安全研究提供了新方向。
Abstract: Prompt-driven Video Segmentation Foundation Models (VSFMs) such as SAM2 are increasingly deployed in applications like autonomous driving and digital pathology, raising concerns about backdoor threats. Surprisingly, we find that directly transferring classic backdoor attacks (e.g., BadNet) to VSFMs is almost ineffective, with ASR below 5%. To understand this, we study encoder gradients and attention maps and observe that conventional training keeps gradients for clean and triggered samples largely aligned, while attention still focuses on the true object, preventing the encoder from learning a distinct trigger-related representation. To address this challenge, we propose BadVSFM, the first backdoor framework tailored to prompt-driven VSFMs. BadVSFM uses a two-stage strategy: (1) steer the image encoder so triggered frames map to a designated target embedding while clean frames remain aligned with a clean reference encoder; (2) train the mask decoder so that, across prompt types, triggered frame-prompt pairs produce a shared target mask, while clean outputs stay close to a reference decoder. Extensive experiments on two datasets and five VSFMs show that BadVSFM achieves strong, controllable backdoor effects under diverse triggers and prompts while preserving clean segmentation quality. Ablations over losses, stages, targets, trigger settings, and poisoning rates demonstrate robustness to reasonable hyperparameter changes and confirm the necessity of the two-stage design. Finally, gradient-conflict analysis and attention visualizations show that BadVSFM separates triggered and clean representations and shifts attention to trigger regions, while four representative defenses remain largely ineffective, revealing an underexplored vulnerability in current VSFMs.
[60] MAI-UI Technical Report: Real-World Centric Foundation GUI Agents cs.CVPDF
Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen
TL;DR: MAI-UI是一个系列化的基础GUI智能体,包含从2B到235B-A22B的不同规模变体。它旨在解决GUI智能体在现实世界部署中的关键挑战,通过自演进数据管道、原生设备-云协作系统和在线强化学习框架等方法,在GUI定位和移动导航任务上取得了新的最先进性能。
Details
Motivation: 论文的动机是解决GUI智能体在现实部署中面临的四个核心挑战:缺乏原生的人机交互、仅依赖UI操作的局限性、缺乏实用的部署架构以及在动态环境中的脆弱性,以推动下一代人机交互的发展。
Result: 在GUI定位基准测试中,MAI-UI在ScreenSpot-Pro上达到73.5%,在MMBench GUI L2上达到91.3%,在OSWorld-G上达到70.9%,在UI-Vision上达到49.2%,超越了Gemini-3-Pro和Seed1.8。在移动GUI导航上,它在AndroidWorld上达到76.7%的新SOTA,超越了UI-Tars-2、Gemini-2.5-Pro和Seed1.8;在MobileWorld上获得41.7%的成功率,显著优于端到端GUI模型,并与基于Gemini-3-Pro的智能体框架竞争。在线RL实验显示,将并行环境从32扩展到512带来了+5.2个百分点的提升,将环境步数预算从15增加到50带来了+4.3个百分点的提升。原生设备-云协作系统使设备端性能提升33%,云模型调用减少40%以上,并保护了用户隐私。
Insight: 论文宣称的创新点包括:1)一个统一的方法论,通过自演进数据管道扩展导航数据以包含用户交互和MCP工具调用;2)一个原生设备-云协作系统,根据任务状态路由执行;3)一个具有高级优化(如扩展并行环境和上下文长度)的在线RL框架。从客观角度看,其核心创新在于将数据生成、系统架构和在线学习紧密结合,构建了一个面向实际部署的、可扩展且高效的GUI智能体基础框架。
Abstract: The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.
[61] StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars cs.CV | cs.AI | cs.HCPDF
Zhiyao Sun, Ziqiao Peng, Yifeng Ma, Yi Chen, Zhengguang Zhou
TL;DR: 本文提出StreamAvatar,一种用于实时交互式人体化身的流式扩散模型框架。该框架通过两阶段自回归适应与加速,将高保真人体视频扩散模型转化为实时流式系统,能够生成包含自然对话和倾听行为的连贯手势动作。
Details
Motivation: 解决现有基于扩散模型的化身生成方法因非因果架构和高计算成本而无法用于实时流式交互,以及现有交互方法通常局限于头肩区域、无法生成身体姿态和动作的问题。
Result: 大量实验表明,该方法在生成质量、实时效率和交互自然度方面均超越了现有方法,达到了最先进的性能水平。
Insight: 创新点包括:1) 两阶段自回归适应与加速框架(自回归蒸馏与对抗性精炼);2) 确保长期稳定性和一致性的三个关键组件(参考汇、参考锚定位置重编码策略、一致性感知判别器);3) 实现了能够生成连贯手势的一次性交互人体化身模型。
Abstract: Real-time, streaming interactive avatars represent a critical yet challenging goal in digital human research. Although diffusion-based human avatar generation methods achieve remarkable success, their non-causal architecture and high computational costs make them unsuitable for streaming. Moreover, existing interactive approaches are typically limited to head-and-shoulder region, limiting their ability to produce gestures and body motions. To address these challenges, we propose a two-stage autoregressive adaptation and acceleration framework that applies autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model for real-time, interactive streaming. To ensure long-term stability and consistency, we introduce three key components: a Reference Sink, a Reference-Anchored Positional Re-encoding (RAPR) strategy, and a Consistency-Aware Discriminator. Building on this framework, we develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures. Extensive experiments demonstrate that our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness. Project page: https://streamavatar.github.io .
[62] Yume-1.5: A Text-Controlled Interactive World Generation Model cs.CVPDF
Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying
TL;DR: Yume-1.5是一个文本控制的交互式世界生成模型,它通过整合长视频生成框架、实时流加速策略和文本控制的事件生成方法,旨在从单张图像或文本提示生成逼真、交互且连续的世界,并支持基于键盘的探索。
Details
Motivation: 解决现有扩散模型在生成交互式可探索世界时面临的参数量过大、推理步骤冗长、历史上下文快速增长等关键挑战,这些挑战严重限制了实时性能并缺乏文本控制生成能力。
Result: 论文未在摘要中提供具体的定量实验结果或基准测试比较。
Insight: 创新点在于提出了一个集成了统一上下文压缩与线性注意力的长视频生成框架、基于双向注意力蒸馏和增强文本嵌入方案的实时流加速策略,以及文本控制的世界事件生成方法,以协同实现高效、可控的交互式世界生成。
Abstract: Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.
[63] Learning Association via Track-Detection Matching for Multi-Object Tracking cs.CVPDF
Momir Adžemović
TL;DR: 本文提出了一种名为Track-Detection Link Prediction (TDLP)的多目标跟踪方法,该方法属于检测后跟踪范式。TDLP通过学习轨迹与检测之间的链接预测来进行逐帧关联,主要利用边界框等几何特征,并可选择性地融合姿态和外观等线索。与基于启发式规则的方法不同,TDLP直接从数据中学习关联,同时保持了模块化和计算效率。
Details
Motivation: 解决多目标跟踪中关联问题。现有检测后跟踪方法依赖手工设计的关联启发式规则,而端到端方法虽然从数据学习但计算复杂度高。本文旨在提出一种既能从数据学习关联,又保持计算效率的模块化方法。
Result: 在多个基准测试上的广泛实验表明,TDLP在检测后跟踪和端到端方法中都一致超越了最先进的性能。
Insight: 核心创新点在于将逐帧关联任务形式化为轨迹与检测之间的链接预测问题。摘要宣称该方法直接从数据学习关联,无需手工规则,且模块化高效。客观分析,其将链接预测与基于度量学习的关联进行对比分析,并证明链接预测在处理边界框等异构特征时更有效,这是一个有价值的见解。
Abstract: Multi-object tracking aims to maintain object identities over time by associating detections across video frames. Two dominant paradigms exist in literature: tracking-by-detection methods, which are computationally efficient but rely on handcrafted association heuristics, and end-to-end approaches, which learn association from data at the cost of higher computational complexity. We propose Track-Detection Link Prediction (TDLP), a tracking-by-detection method that performs per-frame association via link prediction between tracks and detections, i.e., by predicting the correct continuation of each track at every frame. TDLP is architecturally designed primarily for geometric features such as bounding boxes, while optionally incorporating additional cues, including pose and appearance. Unlike heuristic-based methods, TDLP learns association directly from data without handcrafted rules, while remaining modular and computationally efficient compared to end-to-end trackers. Extensive experiments on multiple benchmarks demonstrate that TDLP consistently surpasses state-of-the-art performance across both tracking-by-detection and end-to-end methods. Finally, we provide a detailed analysis comparing link prediction with metric learning-based association and show that link prediction is more effective, particularly when handling heterogeneous features such as detection bounding boxes. Our code is available at \href{https://github.com/Robotmurlock/TDLP}{https://github.com/Robotmurlock/TDLP}.
[64] ProEdit: Inversion-based Editing From Prompts Done Right cs.CVPDF
Zhi Ouyang, Dian Zheng, Xiao-Ming Wu, Jian-Jian Jiang, Kun-Yu Lin
TL;DR: 本文提出了一种名为ProEdit的基于反演的视觉编辑方法,旨在解决现有方法在编辑过程中过度依赖源图像信息导致编辑效果不佳的问题。该方法通过注意力层面的KV-mix和潜在空间层面的Latents-Shift两个创新模块,在保持背景一致性的同时,有效减少源信息对编辑区域的影响,从而更准确地遵循用户指令进行属性修改。
Details
Motivation: 现有基于反演的视觉编辑方法在采样过程中注入源图像信息以保持一致性,但过度依赖源信息会阻碍对目标图像中主体属性(如姿态、数量、颜色)的编辑,导致编辑失败。
Result: 在多个图像和视频编辑基准测试上的广泛实验表明,该方法达到了最先进的性能水平。
Insight: 主要创新点在于提出了KV-mix和Latents-Shift两个模块,分别从注意力机制和潜在表示层面解耦源信息对编辑区域的影响。该方法设计为即插即用,可无缝集成到现有的反演和编辑方法中,提升了编辑的灵活性和准确性。
Abstract: Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject’s atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.
[65] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning cs.CVPDF
Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian
TL;DR: 本文提出了一种名为双向感知塑造(BiPS)的新方法,用于增强大型视觉语言模型(VLMs)的多模态推理能力。BiPS通过将问题条件化的掩码视图转换为双向的“看哪里”信号,在训练过程中塑造模型的感知。该方法包括两个约束:KL一致性约束确保模型对支持性像素的粗粒度但完整覆盖,而KL分离约束则防止模型仅依赖文本捷径,强制其依赖细粒度视觉证据。实验表明,BiPS在八个基准测试中平均提升了Qwen2.5-VL-7B模型8.2%的性能,并展现出强大的跨域泛化能力。
Details
Motivation: 现有大型视觉语言模型在推理时通常依赖中间视觉线索(如外部工具注入或潜在视觉令牌生成),但这些方法忽略了细粒度视觉证据(如图表中的折线),跨域泛化能力差,且推理成本高。本文旨在解决这些问题,通过BiPS方法强制模型更有效地利用视觉信息进行推理。
Result: 在八个基准测试中,BiPS将Qwen2.5-VL-7B模型的平均性能提升了8.2%,并在未见过的数据集和图像类型上表现出强大的跨域泛化能力,达到了当前先进水平(SOTA)。
Insight: 论文的创新点在于提出了双向感知塑造(BiPS)框架,通过KL一致性约束和KL分离约束,在训练中同时促进对视觉证据的完整覆盖和防止文本捷径,从而增强模型对细粒度视觉信息的依赖。从客观角度看,这种方法提供了一种新颖的训练机制,能够有效提升多模态模型的推理性能和泛化能力,尤其在需要精细视觉理解的场景中具有借鉴意义。
Abstract: Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
cs.LG [Back]
[66] MotionTeller: Multi-modal Integration of Wearable Time-Series with LLMs for Health and Behavioral Understanding cs.LG | cs.AI | cs.CL | cs.HCPDF
Aiwei Zhang, Arvind Pillai, Andrew Campbell, Nicholas C. Jacobson
TL;DR: MotionTeller是一个生成式框架,旨在将来自可穿戴设备的分钟级活动数据(如加速度计数据)与大型语言模型(LLMs)原生集成,以自动生成自然语言形式的日常行为总结。
Details
Motivation: 解决如何从原始生理信号(如活动记录数据)生成自然语言摘要的关键挑战,以促进行为监控和个性化健康干预。
Result: 在基于真实世界NHANES记录构建的新数据集上,MotionTeller在语义保真度(BERTScore-F1 = 0.924)和词汇准确性(ROUGE-1 = 0.722)上表现出色,ROUGE-1比基于提示的基线方法高出7%,训练损失在第15个epoch收敛至0.38。
Insight: 创新点在于将预训练的活动记录编码器与轻量级投影模块结合,将行为嵌入映射到冻结的仅解码器LLM的标记空间,实现了对可穿戴传感器数据的流畅、以人为中心的描述生成,为行为理解提供了可扩展且可解释的系统。
Abstract: As wearable sensing becomes increasingly pervasive, a key challenge remains: how can we generate natural language summaries from raw physiological signals such as actigraphy - minute-level movement data collected via accelerometers? In this work, we introduce MotionTeller, a generative framework that natively integrates minute-level wearable activity data with large language models (LLMs). MotionTeller combines a pretrained actigraphy encoder with a lightweight projection module that maps behavioral embeddings into the token space of a frozen decoder-only LLM, enabling free-text, autoregressive generation of daily behavioral summaries. We construct a novel dataset of 54383 (actigraphy, text) pairs derived from real-world NHANES recordings, and train the model using cross-entropy loss with supervision only on the language tokens. MotionTeller achieves high semantic fidelity (BERTScore-F1 = 0.924) and lexical accuracy (ROUGE-1 = 0.722), outperforming prompt-based baselines by 7 percent in ROUGE-1. The average training loss converges to 0.38 by epoch 15, indicating stable optimization. Qualitative analysis confirms that MotionTeller captures circadian structure and behavioral transitions, while PCA plots reveal enhanced cluster alignment in embedding space post-training. Together, these results position MotionTeller as a scalable, interpretable system for transforming wearable sensor data into fluent, human-centered descriptions, introducing new pathways for behavioral monitoring, clinical review, and personalized health interventions.
eess.IV [Back]
[67] A Graph-Augmented knowledge Distillation based Dual-Stream Vision Transformer with Region-Aware Attention for Gastrointestinal Disease Classification with Explainable AI eess.IV | cs.CVPDF
Md Assaduzzaman, Nushrat Jahan Oyshi, Eram Mahamud
TL;DR: 本研究提出了一种基于师生知识蒸馏的混合双流深度学习框架,用于胃肠道疾病的准确分类。该框架通过一个高容量的教师模型(结合Swin Transformer的全局上下文推理和Vision Transformer的局部细粒度特征提取)向一个紧凑的Tiny-ViT学生网络蒸馏知识,在保证诊断准确性的同时提升了效率。在两个无线胶囊内窥镜数据集上取得了接近完美的分类性能,并通过可解释性分析验证了其临床可靠性。
Details
Motivation: 解决从内窥镜和组织病理学图像中准确分类胃肠道疾病的挑战,这些挑战主要源于数据量大和类间视觉差异细微。
Result: 在两个精心策划的无线胶囊内窥镜数据集上取得了卓越性能:在数据集1和数据集2上的准确率分别为0.9978和0.9928,平均AUC为1.0000,显示出近乎完美的判别能力。Tiny-ViT学生网络在保持与教师模型相当诊断性能的同时,降低了计算复杂度并实现了更快的推理速度。
Insight: 创新点在于构建了一个结合全局与局部特征提取的教师模型,并通过知识蒸馏将其能力迁移至轻量级学生网络,实现了效率与精度的平衡。同时,利用Grad-CAM、LIME和Score-CAM等多种可解释性方法系统验证了模型决策的临床相关性,增强了框架在医疗应用中的透明度和可信度。
Abstract: The accurate classification of gastrointestinal diseases from endoscopic and histopathological imagery remains a significant challenge in medical diagnostics, mainly due to the vast data volume and subtle variation in inter-class visuals. This study presents a hybrid dual-stream deep learning framework built on teacher-student knowledge distillation, where a high-capacity teacher model integrates the global contextual reasoning of a Swin Transformer with the local fine-grained feature extraction of a Vision Transformer. The student network was implemented as a compact Tiny-ViT structure that inherits the teacher’s semantic and morphological knowledge via soft-label distillation, achieving a balance between efficiency and diagnostic accuracy. Two carefully curated Wireless Capsule Endoscopy datasets, encompassing major GI disease classes, were employed to ensure balanced representation and prevent inter-sample bias. The proposed framework achieved remarkable performance with accuracies of 0.9978 and 0.9928 on Dataset 1 and Dataset 2 respectively, and an average AUC of 1.0000, signifying near-perfect discriminative capability. Interpretability analyses using Grad-CAM, LIME, and Score-CAM confirmed that the model’s predictions were grounded in clinically significant tissue regions and pathologically relevant morphological cues, validating the framework’s transparency and reliability. The Tiny-ViT demonstrated diagnostic performance with reduced computational complexity comparable to its transformer-based teacher while delivering faster inference, making it suitable for resource-constrained clinical environments. Overall, the proposed framework provides a robust, interpretable, and scalable solution for AI-assisted GI disease diagnosis, paving the way toward future intelligent endoscopic screening that is compatible with clinical practicality.