Table of Contents

cs.CL [Back]

[1] Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning cs.CLPDF

Shan Yang

TL;DR: 该论文对多模态物理评估流程进行了端到端审计,发现了三种未被察觉的构建实践(训练-评估污染、翻译漂移和多项选择题饱和)会扭曲领域对视觉-语言推理能力的衡量。论文发布了四个新资源(PhysCorp-A、PhysR1Corp、PhysOlym-A)和一个名为Physics-R1的参考训练方案,该方案基于Qwen3-VL-8B-Thinking冷启动,在多个物理推理基准上显著提升了模型性能。

Details

Motivation: 动机是揭示并解决当前多模态物理推理评估中存在的系统性偏差问题,这些偏差源于数据集的构建缺陷,导致对模型能力的测量失真。

Result: 提出的Physics-R1方法在PhysOlym-A(开放问答)上比8B基础模型提升了+18.3个百分点,在PhysReason上提升了+15.7个百分点(超越了Qwen3-VL-32B和Gemini 2.5 Pro),在OlympiadBench-Physics上提升了+6.9个百分点,在PhyX多项选择题上提升了+4.1个百分点。

Insight: 创新点在于系统性地审计并量化了评估流程中的关键偏差来源,并构建了经过严格审计、高质量、新颖的数据集(PhysOlym-A)和训练方案(Physics-R1)来弥补这些差距,为更可靠的视觉物理推理评估提供了新基准和方法。

Abstract: We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).


[2] Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation cs.CL | cs.AIPDF

Ignacio Sastre, Guillermo Moncecchi, Aiala Rosá

TL;DR: 该论文提出了一种名为’推导提示’的新型提示技术,用于改进检索增强生成框架中的生成步骤。该方法受逻辑推导启发,通过系统应用预定义规则从初始假设推导出结论,并构建可解释的推导树以增强生成过程的可控性。在一个具体案例研究中,该方法显著减少了不可接受的答案数量。

Details

Motivation: 解决大型语言模型在知识密集型、特定领域问答任务中出现的幻觉和错误推理问题,特别是在检索增强生成框架的生成阶段。

Result: 在一个特定案例研究中,与传统RAG和长上下文窗口方法相比,该方法显著减少了不可接受的答案。

Insight: 创新点在于将逻辑推导的形式化方法引入提示工程,通过构建可解释的推导树来增加生成过程的透明度和可控性,这是一种将符号逻辑与神经生成模型相结合的新思路。

Abstract: The application of Large Language Models to Question Answering has shown great promise, but important challenges such as hallucinations and erroneous reasoning arise when using these models, particularly in knowledge-intensive, domain-specific tasks. To address these issues, we introduce Derivation Prompting, a novel prompting technique for the generation step of the Retrieval-Augmented Generation framework. Inspired by logic derivations, this method involves deriving conclusions from initial hypotheses through the systematic application of predefined rules. It constructs a derivation tree that is interpretable and adds control over the generation process. We applied this method in a specific case study, significantly reducing unacceptable answers compared to traditional RAG and long-context window methods.


[3] PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts cs.CL | cs.AIPDF

Anjir Ahmed Chowdhury, Syed Zawad, Xiaolong Ma, Xu Dong, Feng Yan

TL;DR: 本文提出了一种名为PEML的参数高效多任务学习方法,该方法通过神经架构工程优化连续提示,并结合低秩适配调整模型权重,以实现对单个大语言模型的高效多任务微调。

Details

Motivation: 现有参数高效微调方法(如LoRA和Prefix Tuning)主要针对单任务设计,在多任务学习中未能充分协同优化提示和模型适配,导致适应能力受限。为满足多任务场景下资源整合和高效微调的需求,需要同时优化提示和模型权重。

Result: 在GLUE、SuperGLUE、MMLU和常识推理基准测试中,PEML相比MTL-LoRA、MultiLoRa、C-Poly和MoE等先进方法,平均准确率提升最高达6.67%,个别任务峰值增益达10.75%。

Insight: 创新点在于将连续提示优化与低秩权重适配协同设计,通过神经架构工程自动优化提示结构,增强了多任务学习的适应能力;客观来看,这种联合优化策略为参数高效多任务学习提供了新思路,平衡了提示调整和模型微调的作用。

Abstract: Parameter-Efficient Fine-Tuning (PEFT) is widely used for adapting Large Language Models (LLMs) for various tasks. Recently, there has been an increasing demand for fine-tuning a single LLM for multiple tasks because it requires overall less data for fine-tuning thanks to the common features shared among tasks. More importantly, LLMs are resource demanding and deploying a single model for multiple tasks facilitates resource consolidation and consumes significantly less resources compared to deploying individual large model for each task. Existing PEFT methods like LoRA and Prefix Tuning are designed to adapt LLMs to a specific task. LoRA and its variation focus on aligning the model itself for tasks, overlooking the importance of prompt tuning in multi-task learning while Prefix Tuning only adopts a simple architecture to optimize prompts, which limits the adaption capabilities for multi-task. To enable efficient fine-tuning for multi-task learning, it is important to co-optimize prompt optimization and model adaptation. In this work, we propose a Parameter-Efficient Multi-task Learning (\PM), which employs a neural architecture engineering method for optimizing the continuous prompts while also performing low-rank adaption for model weights. We prototype PEML by creating an automated framework for optimizing the continuous prompts and adapting model weights. We evaluate PEML against state-of-the-arts multi-task learning methods MTL-LoRA, MultiLoRa, C-Poly, and MoE, on the GLUE, SuperGLUE, Massive Multitask Language Understanding, and commonsense reasoning benchmarks. The evaluation results present an average accuracy improvement of up to 6.67%, with individual tasks showing peak gains of up to 10.75%.


Xubo Lin, Zezhii Deng, Shihao Wang, Grace Hui Yang, Yang Deng

TL;DR: 本文提出了一种面向法律询问对话的智能体(ICA),专门针对美国最高法院口头辩论场景,采用双层次强化学习框架,通过两个协作的RL智能体分别管理策略对话和生成细粒度话语,以主动提取关键信息实现法律目标。

Details

Motivation: 现有对话系统多为用户驱动,而许多关键现实场景需要对话智能体主动提取信息以实现自身目标,因此本文引入询问式对话智能体(ICA)来解决这一差距。

Result: 在美国最高法院数据集上的评估表明,该方法在多个指标上优于各种基线模型,代表了向高风险、特定领域应用迈出的重要一步。

Insight: 创新点在于双层次强化学习框架,通过协调策略对话管理和细粒度话语生成,模拟司法询问模式,系统性地揭示关键信息;从客观角度看,该方法将主动信息提取与领域特定任务结合,为高风险对话系统提供了新思路。

Abstract: Most existing dialogue systems are user-driven, primarily designed to fulfill user requests. However, in many critical real-world scenarios, a conversational agent must proactively extract information to achieve its own objectives rather than merely respond. To address this gap, we introduce \emph{Inquisitive Conversational Agents (ICAs)} and develop an ICA specifically tailored to U.S. Supreme Court oral arguments. We propose a Dual Hierarchical Reinforcement Learning framework featuring two cooperating RL agents, each with its own policy, to coordinate strategic dialogue management and fine-grained utterance generation. By learning when and how to ask probing questions, the agent emulates judicial questioning patterns and systematically uncovers crucial information to fulfill its legal objectives. Evaluations on a U.S. Supreme Court dataset show that our method outperforms various baselines across multiple metrics. It represents an important first step toward broader high-stakes, domain-specific applications.


[5] Distribution Corrected Offline Data Distillation for Large Language Models cs.CLPDF

Yumeng Zhang, Zhengbang Yang, Yevin Nikhel Goonatilake, Zhuangdi Zhu

TL;DR: 本文提出了一种分布校正的离线数据蒸馏框架,用于提升大型语言模型的推理能力。该方法通过自适应地强调与学生在策略分布更匹配的教师监督,解决了离线蒸馏中教师-学生分布漂移的问题,从而在保持离线数据高效性和高质量的同时,避免了推理时自回归生成导致的复合错误。

Details

Motivation: 现有方法面临一个根本权衡:离线蒸馏使用教师生成的轨迹提供高质量、样本高效的监督,但存在分布漂移问题;而在策略或自蒸馏方法虽然能更好地匹配学生推理时的分布,但需要昂贵的在线采样且早期训练产生的轨迹质量较低。

Result: 在GSM8K、MATH、MATH500等数学推理基准以及AMC、AIME、OlympiadBench等更难的竞赛风格任务上的评估表明,该方法相比先前的离线蒸馏算法提高了推理准确性,产生了更稳定的推理轨迹,同时保持了指令遵循能力。

Insight: 核心创新在于提出了一种轻量级的、具有分布校正意识的训练框架,它无需在线推演即可显著增强离线推理蒸馏。这通过自适应地重新加权教师监督,使其与学生推理时的在策略分布对齐来实现,从而在效率和监督质量之间取得了更好的平衡。

Abstract: Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource-constrained settings. Existing approaches face a fundamental trade-off: offline distillation from teacher-generated traces provides high-quality, sample-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher-generated prefixes, whereas during inference the student autoregresses on self-generated prefixes, leading to compounding errors over long reasoning trajectories. Meanwhile, on-policy or self-distillation methods better match the student’s inference-time distribution, but require costly online sampling and often produce low-quality traces in early training. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher-generated data while correcting teacher-student distribution drift. It adaptively emphasizes teacher supervision that is better aligned with the student’s on-policy distribution. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held-out competition-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction-following capabilities. Our work shows that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without online rollouts.


[6] Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards cs.CL | cs.AIPDF

Luis Lara, Aristides Milios, Zhi Hao Luo, Aditya Sharma, Ge Ya Luo

TL;DR: 本文提出了一种基于大语言模型(LLM)的文本驱动户型图生成方法,通过结合对真实户型图的微调和带有可验证奖励的强化学习(RLVR),来生成同时满足用户指定的拓扑连接关系和精确数值约束(如房间尺寸)的户型图。该方法在真实性、兼容性和多样性指标上均优于现有方法。

Details

Motivation: 现有户型图生成方法主要关注房间之间的拓扑连接关系,但无法有效处理用户对房间尺寸、面积等数值约束的要求。本文旨在解决这一局限,开发一个能同时精确控制数值约束并尊重拓扑连接的AI户型设计系统。

Result: 在真实性(Realism)、兼容性(Compatibility)和多样性(Diversity)指标上,该方法均超越了现有方法。特别是在兼容性任务上,相比现有方法实现了至少94%的相对性能提升。

Insight: 核心创新点在于将LLM微调与带有可验证奖励的强化学习(RLVR)相结合,以同时优化拓扑和数值约束的满足度,并抑制无效或重叠的输出。这证明了LLM在文本生成任务中处理复杂、可验证约束的潜力,为基于文本的生成建模开辟了更广泛的应用前景。

Abstract: An AI system for professional floor plan design must precisely control room dimensions and areas while respecting the desired connectivity between rooms and maintaining functional and aesthetic quality. Existing generative approaches focus primarily on respecting the requested connectivity between rooms, but do not support generating floor plans that respect numerical constraints. We introduce a text-based floor plan generation approach that fine-tunes a large language model (LLM) on real plans and then applies reinforcement learning with verifiable rewards (RLVR) to improve adherence to topological and numerical constraints while discouraging invalid or overlapping outputs. Furthermore, we design a set of constraint adherence metrics to systematically measure how generated floor plans align with user-defined constraints. Our model generates floor plans that satisfy user-defined connectivity and numerical constraints and outperforms existing methods on Realism, Compatibility, and Diversity metrics. Across all tasks, our approach achieves at least a 94% relative reduction in Compatibility compared with existing methods. Our results demonstrate that LLMs can effectively handle constraints in this setting, suggesting broader applications for text-based generative modeling.


[7] Why Retrieval-Augmented Generation Fails: A Graph Perspective cs.CL | cs.AIPDF

Kai Guo, Xinnan Dai, Zhibo Zhang, Nuohan Lin, Shenglai Zeng

TL;DR: 本文从图论视角研究检索增强生成(RAG)失败的原因,通过电路追踪构建归因图来建模检索证据在解码过程中的信息流,发现正确与错误预测在图结构上存在系统性差异,并基于此开发了图拓扑特征驱动的错误检测框架和干预方法。

Details

Motivation: 尽管RAG通过外部证据增强生成,但仍常产生错误答案,其内部失败机制尚不明确;本文旨在从模型内部探究检索证据如何影响答案生成,以理解RAG的局限性。

Result: 在多个问答基准测试中,正确预测表现出更深层的推理路径、更分布式的证据流和更结构化的局部连接模式,而失败预测则呈现浅层、碎片化和过度集中的证据流;基于归因图拓扑特征的错误检测框架能有效识别错误,并通过干预增强问题约束的证据落地,减少错误。

Insight: 创新点在于将RAG的内部信息流建模为归因图,从图结构差异揭示失败机制;客观来看,该方法提供了可解释的电路级视角,并为基于图拓扑的误差诊断和针对性干预提供了新思路。

Abstract: Retrieval-Augmented Generation (RAG) has become a powerful and widely used approach for improving large language models by grounding generation in retrieved evidence. However, RAG systems still produce incorrect answers in many cases. Why RAG fails despite having access to external information remains poorly understood. We present a model-internal study of retrieval-augmented generation that examines how retrieved evidence influences answer generation. Using circuit tracing, we construct attribution graphs that model the flow of information through transformer layers during decoding. These graphs represent interactions among retrieved context, intermediate model activations, and generated tokens, providing a graph, circuit-level view of how external evidence is integrated into the model’s reasoning process across multiple question answering benchmarks, we observe consistent structural differences: correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured pattern of local connectivity, while failed predictions show shallower, fragmented, and overly concentrated evidence flow. Building on these findings, we develop a graph-based error detection framework that uses attribution-graph topology features. Furthermore, we show that attribution graphs enable targeted interventions. By reinforcing question-constrained evidence grounding, we reshape internal routing so that answer generation remains guided by the question, leading to more effective integration of retrieved information and fewer errors.


[8] LLM-based Detection of Manipulative Political Narratives cs.CLPDF

Sinclair Schneider, Florian Steuber, Gabi Dreo Rodosek

TL;DR: 本文提出了一种基于大语言模型(LLM)的计算框架,用于检测和结构化社交媒体上的操纵性政治叙事。该方法首先通过精心设计的少样本提示,利用推理模型过滤出操纵性帖子,然后结合UMAP降维和HDBSCAN聚类来发现叙事群组,最后再次使用推理模型揭示每个群组背后的叙事。该方法在超过120万条社交媒体帖子上成功识别出41个不同的操纵性叙事集群。

Details

Motivation: 随着政治讨论转向社交媒体,检测操纵性政治叙事变得日益重要。核心挑战在于区分操纵性叙事与合法批评,以及识别将真实事件置于操纵性语境中的帖子。

Result: 该方法应用于超过120万条社交媒体帖子,成功识别出41个不同的操纵性叙事集群,证明了其有效性。

Insight: 创新点在于将基于提示的LLM推理(用于精确过滤)与无监督聚类(HDBSCAN)相结合,这种混合方法既能利用LLM的语义理解能力进行精细分类,又能通过无监督聚类发现新的、未预定义的叙事类别,避免了依赖预定义目标列表的局限性。

Abstract: We present a new computational framework for detecting and structuring manipulative political narratives. A task that became more important due to the shift of political discussions to social media. One of the primary challenges thereby is differentiating between manipulative political narratives and legitimate critiques. Some posts may also reframe actual events within a manipulative context. To achieve good clustering results, we filter manipulative posts beforehand using a detailed few-shot prompt that combines documented campaign narratives with legitimate criticisms to differentiate them. This prompt enables a reasoning model to assign labels, retaining only manipulative narrative posts for further processing. The remaining posts are subsequently embedded and dimensionality-reduced using UMAP, before HDBSCAN is applied to uncover narrative groups. A key advantage of this unsupervised approach is its independence from a predefined list of target categories, enabling it to uncover new narrative clusters. Finally, a reasoning model is employed to uncover the narrative behind each cluster. This approach, applied to over 1.2 million social media posts, effectively identified 41 distinct manipulative narrative clusters by integrating prompt-based filtering with unsupervised clustering.


[9] Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax cs.CL | cs.LGPDF

Zeli Su, Ziyin Zhang, Zhou Liu, Xuexian Song, Zhankai Xu

TL;DR: 本文提出了一种基于语义奖励的强化学习新范式,通过Group Relative Policy Optimization (GRPO)优化模型,使用嵌入级别的语义奖励而非传统的似然最大化目标,旨在解决低资源语言扩展中存在的‘对齐税’问题,即在提升目标语言能力时导致通用能力灾难性遗忘的困境。

Details

Motivation: 动机在于解决将大语言模型扩展到低资源语言时出现的‘对齐税’问题,即监督微调(SFT)的僵化性(强制进行词元级别的表面模仿)在狭窄且有偏的数据分布上会导致目标语言能力提升与通用能力遗忘之间的权衡。

Result: 在藏汉机器翻译和藏语标题生成任务上的实验表明,该方法在获得低资源语言能力的同时,显著减轻了对齐税,比SFT更有效地保留了通用能力;尽管产生的表面重叠较少,但语义强化学习在开放式生成中获得了更高的语义质量和偏好,且少样本迁移结果表明其在有限监督下学习了更具可迁移性和鲁棒性的表示。

Insight: 宣称的创新点在于提出了一个语义空间对齐范式,使用嵌入级别的语义奖励进行强化学习优化,鼓励通过灵活的表述来保持语义,从而实现受控的模型更新,减少对预训练知识的破坏性干扰。从客观角度看,其核心创新是将优化目标从词元级别的表面模仿转向语义级别的对齐,为解决低资源扩展中的灾难性遗忘问题提供了一个新思路。

Abstract: Extending large language models (LLMs) to low-resource languages often incurs an “alignment tax”: improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision. Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.


[10] Agentic Recommender System with Hierarchical Belief-State Memory cs.CL | cs.AIPDF

Xiang Shen, Yuhang Zhou, Yifan Wu, Zhuokai Zhao, Siyu Lin

TL;DR: 本文提出了MARS(记忆增强的智能体推荐系统),一个将推荐视为部分可观测问题并维护结构化信念状态的框架。该框架通过三层记忆(事件、偏好、档案)渐进地将嘈杂的行为观察抽象为对用户偏好的紧凑估计,并采用基于LLM的规划器自适应调度包含提取、强化、弱化、巩固、遗忘和再合成在内的完整记忆生命周期操作。

Details

Motivation: 现有基于记忆增强LLM智能体的个性化推荐方法普遍采用扁平化的记忆表示,将瞬时信号与稳定偏好混为一谈,且缺乏管理记忆如何演化的完整生命周期机制。

Result: 在四个InstructRec基准领域的实验表明,MARS实现了最先进的性能,在HR@1和NDCG@10指标上分别比最强基线平均提升了26.4%和10.3%,并且在动态演化场景中通过智能体调度获得了进一步的增益。

Insight: 核心创新在于将推荐建模为部分可观测问题,并引入了分层的信念状态记忆结构(事件/偏好/档案)以及由LLM规划器驱动的自适应记忆生命周期管理(六种操作),这提供了对用户偏好进行渐进式抽象和动态演化的系统化方法。

Abstract: Memory-augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We propose MARS (Memory-Augmented Agentic Recommender System), a framework that treats recommendation as a partially observable problem and maintains a structured belief state that progressively abstracts noisy behavioral observations into a compact estimate of user preferences. MARS organizes this belief state into three tiers: event memory buffers raw signals, preference memory maintains fine-grained mutable chunks with explicit strength and evidence tracking, and profile memory distills all preferences into a coherent natural language narrative. A complete lifecycle of six operations – extraction, reinforcement, weakening, consolidation, forgetting, and resynthesis – is adaptively scheduled by an LLM-based planner rather than fixed-interval heuristics. Experiments on four InstructRec benchmark domains show that \ours achieves state-of-the-art performance with average improvements of 26.4% in HR@1 and 10.3% in NDCG@10 over the strongest baselines with further gains from agentic scheduling in evolving settings.


[11] GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations cs.CLPDF

Jingbo Yang, Kwei-Herng Lai, Xiaowen Wang, Shiyu Chang, Yaar Harari

TL;DR: 本文提出了GroupMemBench基准测试,用于评估LLM智能体在多用户群组对话中的记忆能力。该基准通过图结构合成流程生成可控的多方对话,并设计对抗性查询流程来测试六个类别的记忆任务,揭示了现有记忆系统在群组场景下的严重性能缺陷。

Details

Motivation: 现有LLM智能体记忆系统和基准主要围绕单用户对话设计,而实际部署常涉及多用户群组交互,导致群组记忆的三个关键特性(群体动态、基于说话者的信念跟踪、适应听众的语言)未被衡量,因此需要专门基准来评估和推动多用户记忆能力的发展。

Result: 在GroupMemBench上测试领先记忆系统显示性能急剧下降:最强系统平均准确率仅46.0%,其中知识更新任务为27.1%,术语歧义任务为37.7%;而简单的BM25基线匹配或超越了大多数智能体记忆系统,表明当前记忆处理机制未能保留群组记忆依赖的结构和词汇特征。

Insight: 创新点在于首次系统构建了多用户对话记忆评估基准,通过图结构合成实现可控的群体动态模拟,并设计对抗性查询流程全面覆盖六类记忆任务;客观分析表明,该研究揭示了现有记忆系统在群组场景中的根本局限性,为多用户记忆建模提供了新的评估范式和改进方向。

Abstract: Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single-user setup, even though real deployments routinely span groups and channels with multiple users interacting with the agent and with each other. This mismatch leaves three properties of group memory unmeasured: (i) group dynamics that go beyond concatenated one-on-one chats, (ii) speaker-grounded belief tracking, where the per-user memory modeling is needed, and (iii) audience-adapted language, where Theory-of-Mind shifts produce role-specific vocabulary. We introduce GroupMemBench, a benchmark that exposes all three. A graph-grounded synthesis pipeline produces multi-party conversations with controllable reply structure and conditions each message on per-user personas and target audiences. An adversarial query pipeline then binds every question to a specific asker across six categories, spanning multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention, and iteratively searches challenging, realistic queries that reflect comprehensive memory capability. Benchmarking leading memory systems exposes a sharp collapse: the strongest one reaches only 46.0% average accuracy, with knowledge update at 27.1% and term ambiguity at 37.7%, while a simple BM25 baseline matches or exceeds most agent memory systems. This indicates current memory ingestion erases the structural and lexical features group memory depends on, leaving multi-user memory far from solved.


[12] Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards cs.CLPDF

Mengjie Ren, Jie Lou, Boxi Cao, Xueru Wen, Hongyu Lin

TL;DR: 本文提出了一种名为CIPO(Correction-Oriented Policy Optimization)的方法,用于增强基于可验证奖励的强化学习(RLVR)框架。该方法通过将模型自身失败的轨迹转化为纠正导向的监督信号,与标准RLVR目标联合优化,从而更有效地利用失败信息,提升大语言模型的推理和纠错能力。

Details

Motivation: 现有的RLVR训练常受稀疏二元奖励和弱信用分配问题的困扰,导致优化信号模糊,且未能充分利用失败轨迹中的有用信息。

Result: 在涵盖数学推理和代码生成的11个基准测试上的广泛实验表明,CIPO在推理和纠错性能上均显著且一致地超越了强基线模型,并带来了更强的pass@K增益。

Insight: 核心创新点在于将模型自身的失败尝试转化为纠正样本进行监督学习,这提供了一种不依赖外部信号的、从失败中学习的有效机制,旨在提升模型的内在推理能力而非仅重新分配已有正确答案的概率质量。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model’s own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model’s ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model’s intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.


[13] EndPrompt: Efficient Long-Context Extension via Terminal Anchoring cs.CLPDF

Han Tian, Luxuan Chen, Xinran Chen, Rui Kong, Fang Wang

TL;DR: 本文提出EndPrompt方法,通过仅使用短训练序列实现大语言模型上下文窗口的有效扩展。该方法将原始短上下文作为第一段,并附加一个简短终端提示作为第二段,赋予其接近目标上下文长度的位置索引,从而在短物理序列中引入局部和长距离相对位置关系。

Details

Motivation: 扩展大语言模型的上下文窗口通常需要在目标长度序列上进行训练,导致二次内存和计算成本,使得长上下文适应昂贵且难以复现。本文旨在解决这一问题,通过稀疏位置监督实现高效的长上下文扩展。

Result: 在将LLaMA系列模型的上下文窗口从8K扩展到64K的实验中,EndPrompt在RULER基准上平均得分为76.03,在LongBench上获得最高平均分,超越了LCEG(72.24)、LongLoRA(72.95)和全长微调(69.23),同时计算需求显著降低。

Insight: 核心创新在于通过两段式构造(完整短上下文+终端提示)在短序列中模拟长距离位置关系,并基于RoPE和伯恩斯坦不等式进行理论分析,表明位置插值对注意力函数施加了严格的平滑约束,共享的Transformer参数进一步抑制了对未观测中间距离的不稳定外推。这挑战了密集长序列训练对于可靠上下文窗口扩展是必要的普遍假设。

Abstract: Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two-segment construction introduces both local and long-range relative distances within a short physical sequence while maintaining the semantic continuity of the training text–a property absent in chunk-based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA-family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) while requiring substantially less computation. These results demonstrate that long-context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long-sequence training is necessary for reliable context-window extension. The code is available at https://github.com/clx1415926/EndPrompt.


[14] SciPaths: Forecasting Pathways to Scientific Discovery cs.CLPDF

Eric Chamoun, Yizhou Chi, Yulong Chen, Rui Cao, Zifeng Ding

TL;DR: 该论文提出了一个名为SciPaths的基准测试,用于评估科学发现路径预测能力,即给定一个目标科学贡献和特定时间点之前的文献,任务包括识别实现该目标所需的关键支撑贡献,并将这些贡献与已有文献进行关联。该基准包含262条专家标注的金标准路径和2444条银标准路径,涵盖机器学习和自然语言处理领域的论文。

Details

Motivation: 现有AI4Science基准主要关注引用预测、文献检索或想法生成,而忽视了科学进步所依赖的贡献之间的依赖关系,因此需要一个新的基准来评估模型在识别科学发现路径和依赖关系方面的能力。

Result: 评估前沿和开源语言模型发现,最佳模型在严格语义匹配下仅达到0.189 F1分数,其中核心方法依赖最难恢复;当提供金标准支撑贡献时,文献关联性能显著提升,表明分解质量是端到端路径恢复的主要瓶颈。

Insight: 创新点在于将科学预测评估从传统的引用或生成任务转向对科学发现路径的逆向推理能力,即从目标贡献回溯到支撑性科学构建块和文献依赖关系;客观来看,该研究强调了分解和依赖关系建模在科学AI中的重要性,为未来模型提供了更细粒度的评估框架。

Abstract: Scientific progress depends on sequences of enabling contributions, yet existing AI4Science benchmarks largely focus on citation prediction, literature retrieval, or idea generation rather than the dependencies that make progress possible. In this paper, we introduce discovery pathway forecasting: given a target scientific contribution and the prior literature available at a specified time, the task is to (1) identify the enabling contributions required to realize it and (2) ground each in prior work when such prior work exists. We present SciPaths, a benchmark of 262 expert-annotated gold pathways and 2,444 silver pathways constructed from machine learning and natural language processing papers, where each pathway records enabling contributions, roles, rationales, and prior-work groundings or unmapped decisions. Evaluating frontier and open-weight language models, we find that the best model reaches only 0.189 F1 under strict semantic matching, with core methodological dependencies hardest to recover. Prior-work grounding improves substantially when gold enabling contributions are provided, showing that decomposition quality is a major bottleneck for end-to-end pathway recovery. SciPaths therefore shifts evaluation toward a missing capability in scientific forecasting: reasoning backward from a target contribution to the enabling scientific building blocks and prior-work dependencies that make it feasible.


[15] Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining cs.CL | cs.AI | cs.CV | cs.LGPDF

Weimin Xiong, Shuhao Gu, Bowen Ye, Zihao Yue, Lei Li

TL;DR: 本文提出Video2GUI框架,通过从无标签的互联网视频中自动提取GUI交互轨迹,构建了大规模数据集WildGUI,用于预训练GUI智能体,并在多个基准测试上取得了显著性能提升。

Details

Motivation: 当前GUI智能体的泛化能力受限于大规模、多样化训练数据的缺乏,现有数据集依赖昂贵的人工标注且领域狭窄。

Result: 在WildGUI数据集上预训练Qwen2.5-VL和Mimo-VL模型,在多个GUI grounding和action基准测试上取得了5-20%的一致提升,达到或超越了SOTA水平。

Insight: 创新点在于提出了一个完全自动化的、从海量互联网视频中提取结构化GUI交互轨迹的框架,无需人工标注,极大地扩展了训练数据的规模和多样性。

Abstract: Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.


[16] Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation cs.CL | cs.AIPDF

Songyang Gao, Yinghui Xia, Siyi Liu, Hui Xiong

TL;DR: 本文提出了一种名为’研究图’(Graphs of Research, GoR)的监督微调方法,用于自动化研究想法生成。该方法通过提取每篇种子论文的2跳引用邻域,并基于引用位置、频率、前驱链接和发表时间构建一个描述论文演化的有向无环图(DAG),以此作为监督信号来微调大语言模型(Qwen2.5-7B-Instruct-1M),从而预测种子论文的核心研究想法。

Details

Motivation: 现有基于大语言模型的研究想法生成方法主要依赖于静态文献检索或复杂的提示工程,忽略了参考文献之间的结构关系。本文旨在利用引用演化图的结构化信息作为监督信号,以提升自动化研究想法生成的质量。

Result: 在基于五个主要ML/NLP会议数据构建的数据集(包含498/50/50篇训练/验证/测试种子论文及约7600篇参考文献)上,GoR-SFT方法在由GPT-4o驱动的基线模型进行的人机对战式LLM评委锦标赛中,取得了最先进的(SOTA)性能。

Insight: 主要创新点在于将引用演化图(基于引用位置、频率、前驱链接和时间关系构建的DAG)作为一种结构化的监督信号用于微调LLM,这为自动化科研创新提供了一种新的、可借鉴的监督范式,有望降低利用引用图进行监督的门槛。

Abstract: Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.


[17] Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA cs.CLPDF

Guanhua Chen, Yutong Yao, Shenghe Sun, Ci-Jun Gao, Shudong Liu

TL;DR: 本文提出了一种名为Chain-of-Procedure(CoP)的分层推理框架,用于解决视觉过程问答(VP-QA)任务。该任务要求模型根据用户上传的复杂过程中间状态图像,预测下一步操作。为了系统评估视觉语言模型(VLMs)在此任务上的表现,作者还构建了一个新的多模态基准ProcedureVQA。CoP框架通过视觉线索检索相关指令,然后进行语义分解以细化步骤,最终生成下一步动作,在六个VLMs上实现了最高13%的绝对性能提升。

Details

Motivation: 当前视觉语言模型在标准图像-文本任务上表现优异,但在视觉过程问答(VP-QA)这一实际任务上潜力尚未充分挖掘。VP-QA面临独特挑战,如给定视觉状态时跨模态检索结构化过程的能力不足,以及图像序列粒度与文本步骤分解之间的不匹配。

Result: 在提出的ProcedureVQA基准上对六个视觉语言模型进行实验,结果表明CoP框架相比标准基线方法取得了最高13%的绝对性能提升,有效解决了VP-QA任务。

Insight: 论文的创新点在于针对VP-QA任务提出了一个分层推理框架CoP,它通过检索、分解和生成的层次化流程来弥补视觉与过程文本之间的鸿沟。从客观角度看,其构建专用基准ProcedureVQA以系统评估模型,以及将过程推理分解为可管理的子问题(检索与细化)的思路具有借鉴价值。

Abstract: Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges where users query next-step actions by uploading images for intermediate states of complex procedures. To systematically evaluate VLMs on this practical task, we propose ProcedureVQA, a novel multimodal benchmark specifically designed for visual procedural reasoning. Through comprehensive analysis, we identify two critical limitations in current VLMs: inadequate cross-modal retrieval of structured procedures given visual states, and misalignment between image sequence granularity and textual step decomposition. To address these issues, we present Chain-of-Procedure (CoP), a hierarchical reasoning framework that first retrieves relevant instructions using visual cues, then performs step refinement through semantic decomposition, and finally generates the next step. Experiments across six VLMs demonstrate CoP’s effectiveness, achieving up to 13% absolute improvement over standard baselines.


[18] Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing cs.CLPDF

Jie Jiang, Xing Sun

TL;DR: 本文提出了一种名为PPOW的强化学习框架,用于优化推测解码中的草稿模型。该框架将优化目标从传统的令牌级模仿转向窗口级优化,结合了成本感知加速奖励、基于分布的邻近奖励和自适应发散感知窗口化技术,旨在优先处理具有高置信度加权发散度的信息丰富窗口,从而提升推测解码效率。

Details

Motivation: 推测解码通过轻量级草稿模型提出候选令牌窗口供大型目标模型并行验证来加速LLM推理,但其效率常受难以起草的位置(即早期不匹配导致接受前缀截断和窗口失效)的瓶颈限制。现有基于学习的草稿模型仍使用令牌级监督目标进行优化,而推测效用本质上是窗口级且对前缀敏感的,因此需要更直接的窗口级优化方法。

Result: 在统一的解码协议下,PPOW在多个模型系列和基准测试中实现了平均接受长度6.29-6.52和加速比3.39-4.36倍,展示了性能驱动的窗口级优化在提升推测解码效率方面的实用性。

Insight: 论文的创新点在于将草稿模型优化从令牌级模仿转向窗口级强化学习,并引入了自适应发散感知窗口化来动态选择高信息量的窗口。从客观角度看,这种性能驱动的优化框架直接针对推测解码的最终目标(加速比)进行优化,而非间接的模仿损失,可能更有效地解决硬起草位置的瓶颈问题。

Abstract: Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36$\times$ across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.


[19] COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion cs.CL | cs.AIPDF

Zihan Deng, Xiaozhen Zhong, Chuanzhi Xu

TL;DR: 本文提出了COTCAgent,一种用于纵向电子健康记录(EHR)的分层推理框架,旨在解决大语言模型在纵向EHR推理中存在的统计幻觉和长程依赖捕获困难的问题。该框架通过解耦统计计算、特征匹配和语言生成,实现了高效、严谨的临床推理。

Details

Motivation: 当前大语言模型在纵向EHR推理中存在两个关键缺陷:一是缺乏细粒度统计推理,容易在定量证据隐含于文本时产生临床趋势和指标的幻觉;二是非均匀时间序列和稀缺标签阻碍了模型捕捉长程时间依赖性,限制了可靠的临床推理。

Result: 实验结果表明,由Baichuan-M2驱动的COTCAgent在自建数据集上取得了90.47%的Top-1准确率,在HealthBench基准上取得了70.41%的准确率,优于现有的医疗智能体和主流大语言模型。

Insight: 创新点在于提出了一个分层推理框架,通过Temporal-Statistics Adapter将分析计划转化为可执行代码以标准化趋势输出,并通过Chain-of-Thought Completion层结合症状-趋势-疾病知识库进行加权评分来评估疾病风险。其核心思想是将复杂的临床推理任务解耦为可独立优化的模块,从而降低对复杂多模态输入的依赖并提高计算效率。

Abstract: As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.


[20] From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG cs.CLPDF

Guanhua Chen, Chuyue Huang, Yutong Yao, Shudong Liu, Xueqing Song

TL;DR: 本文针对多模态检索增强生成(RAG)系统在证据检索时粒度粗(整张图像或场景)导致与细粒度用户查询不匹配且错误难以验证的问题,提出了一个包含多视角元素级标注的真实世界地标数据集GranuVistaVQA,并提出了一个多粒度框架GranuRAG。该框架将视觉元素作为一等检索单元,通过元素级检测与分类、多粒度跨模态对齐检索以及属性约束生成三个阶段,实现透明化的错误诊断。

Details

Motivation: 解决现有多模态RAG系统因在粗粒度(如图像或场景级别)进行证据检索,而无法有效匹配细粒度用户查询,且系统失败原因难以追溯和验证的问题。

Result: 在提出的GranuVistaVQA基准测试上,GranuRAG相比六个强基线模型取得了最高29.2%的性能提升。

Insight: 核心创新点在于将视觉元素(而非整张图像)作为检索的基本单元,并构建了一个包含多视角元素级标注的基准数据集来模拟部分观测挑战。从客观角度看,其通过显式的元素级检测、对齐和生成流程,实现了检索过程的透明化和可验证性,为构建可信的多模态RAG系统提供了新思路。

Abstract: Multimodal Retrieval-Augmented Generation (RAG) systems retrieve evidence at coarse granularities (entire images or scenes), creating a mismatch with fine-grained user queries and making failures unverifiable. We introduce GranuVistaVQA, a multimodal benchmark featuring real-world landmarks with element-level annotations across multiple viewpoints, capturing the partial observation challenge where individual images contain only subsets of entities. We further propose GranuRAG, a multi-granularity framework that treats visual elements as first-class retrieval units through three stages: element-level detection and classification, multi-granularity cross-modal alignment for evidence retrieval, and attribution-constrained generation. By grounding retrieval at the element level rather than relying on implicit attention, our approach enables transparent error diagnosis. Experiments demonstrate that GranuRAG achieves up to 29.2% improvement over six strong baselines for this task.


[21] Improving Multi-turn Dialogue Consistency with Self-Recall Thinking cs.CL | cs.AIPDF

Renning Pang, Tian Lan, Leyuan Liu, Xiaoming Huang, Piao Tong

TL;DR: 本文提出了一种名为自我回忆思维(SRT)的框架,旨在解决基于大语言模型的多轮对话系统中存在的长距离上下文依赖和稀疏信息信号问题。SRT通过识别有帮助的历史轮次并利用它们生成上下文合适的响应,使模型能够在推理过程中选择性地回忆和推理上下文,从而无需外部模块即可实现可解释的回忆步骤。

Details

Motivation: 现有基于大语言模型的多轮对话系统难以跟踪非相邻轮次之间的依赖关系,这损害了对话的一致性和可扩展性。随着对话变长,关键信息变得稀疏并被无关上下文淹没,而处理整个对话历史会导致严重的效率瓶颈。现有解决方案要么依赖高延迟的外部记忆,要么通过迭代摘要丢失细粒度细节。

Result: 在多个数据集上的实验表明,SRT将F1分数提高了4.7%,并将端到端延迟降低了14.7%,在推理延迟和准确性之间取得了平衡,并优于最先进的基线方法。

Insight: SRT的创新点在于提出了一种内生的推理过程,通过依赖构建、能力初始化和推理改进三个步骤,实现了无需外部模块的可解释性回忆和推理。这为处理长对话上下文依赖提供了一种高效且准确的新方法,平衡了延迟和性能。

Abstract: Large language model (LLM) based multi-turn dialogue systems often struggle to track dependencies across non-adjacent turns, undermining both consistency and scalability. As conversations lengthen, essential information becomes sparse and is buried in irrelevant context, while processing the entire dialogue history incurs severe efficiency bottlenecks. Existing solutions either rely on high latency external memory or lose fine-grained details through iterative summarization. In this paper, we propose Self-Recall Thinking (SRT), a framework designed to address long-range contextual dependency and sparse informative signals in multi-turn dialogue. SRT identifies helpful historical turns and uses them to generate contextually appropriate responses, enabling the model to selectively recall and reason over context during inference. This process yields an endogenous reasoning process that integrates interpretable recall steps without external modules. SRT incorporates: (1) Dependency Construction: Generating and converting it into self-recall chains; (2)Capability Initialization: Training to enable reasoning chains with recall tokens capability; (3)Reasoning Improvement: Refining accuracy via verifiable rewards to optimize recall and reasoning for correct answers. Experiments on multiple datasets demonstrate that SRT improves F1 score by 4.7% and reduces end-to-end latency by 14.7% over prior methods, achieving a balance between reasoning latency and accuracy, and outperforming state-of-the-art baselines.


[22] Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment cs.CL | cs.AI | cs.LG | stat.MLPDF

Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim, Jeremy C. Weiss

TL;DR: 本文提出了一种检索增强的多模态对齐框架,用于从临床文本和结构化电子健康记录(EHR)数据中重建精确的临床时间线。该方法通过图多步流程,利用文本中的中心事件构建时间骨架,并利用检索到的结构化数据作为外部时间证据进行校准,从而结合了文本的语义丰富性和表格数据的时间精确性。

Details

Motivation: 解决临床时间线重建中,非结构化文本缺乏时间精度而结构化数据缺失大量临床事件的问题,旨在融合两种模态的优势以获得更精确、信息更完整的患者轨迹重建。

Result: 在基于MIMIC-III和MIMIC-IV的i2m4基准测试上,使用指令调优大语言模型评估,该多模态流程持续提高了绝对时间戳准确性(AULTC)和时间一致性,且未损害事件匹配率,优于仅使用文本的单模态重建方法。

Insight: 创新点在于将时间线重建构建为图多步流程,并引入检索增强机制,利用结构化数据校准文本事件的时间。客观来看,其多模态对齐框架有效弥补了单一数据源的缺陷,实证分析显示34.8%的文本事件在表格中完全缺失,凸显了融合的必要性。

Abstract: Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient’s course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.


cs.CV [Back]

[23] Contrastive Multi-Modal Hypergraph Reasoning for 3D Crowd Mesh Recovery cs.CV | cs.GR | cs.MM | eess.IVPDF

Minghao Sun, Chongyang Xu, Yitao Xie, Buzhen Huang, Kun Li

TL;DR: 本文提出了一种名为对比多模态超图推理(CoMHR)的新方法,用于解决拥挤场景下的多人3D网格重建问题。该方法通过融合RGB语义特征、几何先验和遮挡感知的不完整姿态,并引入骨盆深度指示器作为全局空间锚点,构建了一个共享拓扑的超图来建模高阶人群动态。此外,设计了一种基于超图的对比学习方案,以增强模态内区分性和跨模态正交性,从而有效传播全局上下文并推断严重遮挡下的缺失信息。

Details

Motivation: 当前多人3D重建方法通常依赖单模态输入,缺乏几何指导,且往往孤立地重建个体,忽略了拥挤场景中解决模糊性所必需的集体群体上下文。本文旨在克服这些限制,通过协同利用语义、几何和姿态线索来改进拥挤场景的重建。

Result: 在Panoptic和GigaCrowd基准测试上的大量实验证实,该方法实现了新的最先进(SOTA)性能。

Insight: 创新点包括:1) 融合多模态线索(RGB、几何、姿态)并引入骨盆深度指示器作为全局空间锚点;2) 构建共享拓扑超图以建模超越成对约束的高阶人群动态;3) 设计基于超图的对比学习方案,联合增强模态内区分性和跨模态正交性,从而有效处理遮挡和深度模糊问题。

Abstract: Multi-person 3D reconstruction is pivotal for real-world interaction analysis, yet remains challenging due to severe occlusions and depth ambiguity. Current approaches typically rely on single-modality inputs, which inherently lack geometric guidance. Furthermore, these methods often reconstruct subjects in isolation, neglecting the collective group context essential for resolving ambiguities in crowded scenes. To address these limitations, we propose Contrastive Multi-modal Hypergraph Reasoning to synergize semantic, geometric, and pose cues for crowd reconstruction. We first initialize robust node representations by combining RGB features, geometric priors, and occlusion-aware incomplete poses. Additionally, we introduce a pelvis depth indicator as a global spatial anchor, aligning visual features with a metric-scale-agnostic depth ordering. Subsequently, we construct a shared-topology hypergraph that moves beyond pairwise constraints to model higher-order crowd dynamics. To improve feature fusion, we design a hypergraph-based contrastive learning scheme that jointly enhances intra-modal discriminability and enforces cross-modal orthogonality. This mechanism enables the network to propagate global context effectively, allowing it to infer missing information even under severe occlusion. Extensive experiments on the Panoptic and GigaCrowd benchmarks confirm that our method achieves new state-of-the-art performance. Code and pre-trained models are available at https://github.com/SunMH-try/CoMHR.


[24] CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI cs.CV | cs.AIPDF

Xiaoyue Liu, Xiaohan Yuan, Mark Y Chan, Ching-Hui Sia, Lei Li

TL;DR: 本文提出了CineMesh4D,一种新颖的端到端4D(3D+时间)重建流程,能够直接从多视角2D电影磁共振成像(cine MRI)中重建患者特异性的全心脏网格。该方法通过引入可微分的渲染损失函数,利用多视角稀疏轮廓进行监督,并开发了融合全局与局部心脏时序信息的双上下文时序模块,以捕捉高维序列模式。

Details

Motivation: 从电影MRI中精确重建3D+时间的全心脏网格是一项临床关键但技术挑战巨大的任务。挑战源于两个耦合因素:2D图像切片对3D心脏解剖结构的固有稀疏采样,以及心脏形状与运动的紧密耦合。现有方法通常只重建部分心腔或单个心动周期相位,无法实现全心脏、全周期的个性化重建。

Result: 在定量和定性评估中,CineMesh4D在重建质量和运动一致性方面优于现有方法,为个性化实时心脏评估提供了实用途径。

Insight: 创新点在于提出了一个端到端的4D全心脏网格重建框架,核心是设计了可微分渲染损失函数,使得能够从稀疏的2D轮廓直接监督3D+时间网格,以及双上下文时序模块来有效融合心脏运动的全局与局部时序信息,从而解决了从稀疏采样数据中联合重建形状与运动的难题。

Abstract: Accurate 3D+t whole-heart mesh reconstruction from cine MRI is a clinically crucial yet technically challenging task. The difficulty of this task arises from two coupled factors: inherently sparse sampling of 3D cardiac anatomy by 2D image slices and the tight coupling between cardiac shape and motion. Current cardiac image-to-mesh approaches typically reconstruct only a subset of cardiac chambers or a single phase of the cardiac cycle. In this work, we propose CineMesh4D, a novel end-to-end 4D (3D+t) pipeline that directly reconstructs patient-specific whole-heart mesh from multi-view 2D cine MRI via cross-domain mapping. Specifically, we introduce a differentiable rendering loss that enables supervision of 3D+t whole-heart mesh from multi-view sparse contours of cine MRI. Furthermore, we develop a dual-context temporal block that fuses global and local cardiac temporal information to capture high-dimensional sequential patterns. In quantitative and qualitative evaluations, CineMesh4D outperforms existing approaches in terms of reconstruction quality and motion consistency, providing a practical pathway for personalized real-time cardiac assessment. The code will be publicly released once the manuscript is accepted.


[25] Unified Pix Token And Word Token Generative Language Model cs.CVPDF

Haun Leung, ZiNan Wang

TL;DR: 该论文提出了一种统一像素令牌与词令牌的生成式语言模型,旨在解决当前基于ViT的多模态模型在细节视觉理解(如识别图像中的小文本或数字)方面的局限性。模型引入了像素级令牌嵌入、颜色折叠、全局条件注意力近似和图像无监督预训练等新特性。

Details

Motivation: 动机是解决现有基于CLIP或SigLIP的ViT视觉编码器在多模态模型中细节视觉理解能力不足的问题,特别是难以识别图像中的小文本或数字。

Result: 实验通过图像无监督预训练进行,结果表明即使在小模型和有限训练数据下,该模型也表现出良好性能,并预期遵循缩放定律,即随着参数和数据增加,性能将持续提升。

Insight: 创新点在于将像素令牌与词令牌统一到生成式语言模型中,并引入了像素级令牌嵌入、颜色折叠和全局条件注意力近似等机制,为提升多模态模型的细粒度视觉理解提供了新思路。

Abstract: Since the emergence of Vision Transformer (ViT), it has been widely used in generative language model and generative visual model. Especially in the current state-of-art open source multimodal models, ViT obtained by CLIP or SigLIP method serves as the vision encoder backbone to help them acquire visual understanding capabilities. But this method leads to limitations in visual understanding for details, such as difficulty in recognizing small text or numbers in images. To address these issues, we propose a new model to unify pix token and word token into the generative language model. The new model also features with each pix of image having its own token embedding, color folding, global conditional attention approximation and image unsupervised pretraining. We conducted image unsupervised pretraining experiments using our new model to explore its potential. The experimental results show that it has good performance even in small model and with limited training data. We believe our model also conforms to the scaling law, as long as model parameters and training data increased, its performance will continue to improve.


[26] PVRF: All-in-one Adverse Weather Removal via Prior-modulated and Velocity-constrained Rectified Flow cs.CVPDF

Wei Dong, Han Zhou, Terry Ji, Guanhua Zhao, Shahab Asoodeh

TL;DR: PVRF是一个统一的恶劣天气图像恢复框架,它通过先验调制和速度约束的整流流,整合了零样本软天气感知与精细化处理,以应对真实世界中异构和未见过的天气退化问题。

Details

Motivation: 解决真实世界恶劣天气图像恢复中因异构和未见退化导致的挑战,以及传统基于失真的训练方法常产生过度平滑结果的问题。

Result: 在多个基准测试中,PVRF在保真度和感知质量上均优于最先进的基线方法,并在单一和组合退化场景下展现出强大的跨数据集泛化能力。

Insight: 创新点包括:引入基于冻结视觉-语言模型的AWR-QA模块进行软天气概率和低层属性评分;通过属性调制归一化和天气加权适配器调节恢复网络;以及采用感知自适应源扰动和终端一致速度参数化的残差整流流来稳定学习过程。

Abstract: Adverse weather removal (AWR) in real-world images remains challenging due to heterogeneous and unseen degradations, while distortion-driven training often yields overly smooth results. We propose PVRF, a unified framework that integrates zero-shot soft weather perceptions with velocity-constrained rectified-flow refinement. PVRF introduces an AWR-specific question answering module (AWR-QA) that uses frozen vision–language models (VLMs) to estimate soft probabilities of weather types and low-level attribute scores. These perceptions condition restoration networks via attribute-modulated normalization (AMN) and weather-weighted adapters (WWA), producing an anchor estimate for refinement. We then learn a terminal-consistent residual rectified flow with perception-adaptive source perturbation and a terminal-consistent velocity parameterization to stabilize learning near the terminal regime. Extensive experiments show that PVRF improves both fidelity and perceptual quality over state-of-the-art baselines, with strong cross-dataset generalization on single and combined degradations. Code will be released at https://github.com/dongw22/PVRF.


[27] Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation cs.CV | cs.ARPDF

Kieran Carrigg, Sigur de Vries, Amirhossein Sadough, Marcel van Gerven

TL;DR: 本文提出了一种利用遗传编程(GP)为预训练视觉Transformer(ViT)模型演化出层特异性标量函数的硬件感知框架,以替代计算复杂且存在全局归约瓶颈的层归一化层。该方法结合一种新颖的训练后重对齐策略,无需从头开始重新训练模型,即可在ImageNet-1K上快速恢复高精度,并显著改善模型在边缘设备上的部署效率。

Details

Motivation: 视觉Transformer(ViTs)在边缘设备上的部署受到层归一化计算复杂性和全局归约瓶颈的严重阻碍。现有方法用硬件友好的标量近似进行同质替换,但无法最优适配所有层的行为且依赖昂贵的模型重训练。

Result: 演化出的表达式能更准确地近似目标归一化行为,捕获了91.6%的方差(R²),远高于同质基线的70.2%。修改后的架构仅用20个epoch就在ImageNet-1K上恢复了84.25%的Top-1准确率,并在算术复杂度和片外内存流量之间建立了有利的权衡。

Insight: 核心创新点在于利用遗传编程演化出层特异性的异构标量函数,并结合训练后重对齐策略,实现了无需重训练的高效硬件感知适配。这为在边缘加速器上高效部署ViTs提供了一种新思路,即通过定制化、轻量化的层替换来消除特定硬件瓶颈。

Abstract: Vision Transformers (ViTs) achieve state-of-the-art performance on challenging vision tasks, but their deployment on edge devices is severely hindered by the computational complexity and global reduction bottleneck imposed by layer normalization. Recent methods attempt to bypass this by replacing normalization layers with hardware-friendly scalar approximations. However, these homogeneous replacements do not optimally fit to all layers’ behaviour and rely on expensive model retraining. In this work, we propose a highly efficient, hardware-aware framework that utilizes genetic programming (GP) to evolve heterogeneous, layer-specific scalar functions directly from pre-trained weights. Coupled with a novel post-training re-alignment strategy, our approach eliminates the need to retrain models from scratch entirely. Our evolved expressions accurately approximate the target normalization behaviours, capturing $91.6%$ of the variance ($R^2$) compared to only $70.2%$ for homogeneous baselines, allowing our modified architecture to recover $84.25%$ Top-1 ImageNet-1K accuracy in only 20 epochs. By preserving this performance while eliminating the global reduction bottleneck, our approach establishes a highly favourable trade-off between arithmetic complexity and off-chip memory traffic, removing a primary barrier to the efficient deployment of ViTs on edge accelerators.


[28] CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves cs.CV | cs.LGPDF

Amirreza Mohseni, Mona Mohammadi, Morteza Saghafian, Naser Talebizadeh Saradari

TL;DR: 本文介绍了CurveBench,一个用于从视觉输入进行分层拓扑推理的基准测试。该基准包含756张互不相交的乔丹曲线图像,涵盖简单、多边形、地形启发、迷宫状和密集计数等多种配置。每张图像都标注了一个编码平面区域间包含关系的根树。任务被定义为结构化预测:给定一张图像,模型必须恢复出曲线诱导的完整根包含树。尽管任务视觉上简单,但最强的评估模型Gemini 3.1 Pro在CurveBench-Easy上仅达到71.1%的树生成准确率,在CurveBench-Hard上仅为19.1%。作者还通过RLVR风格的微调展示了基准的实用性,训练后的Qwen3-VL-8B模型在CurveBench-Easy上的准确率从2.8%提升至33.3%,超过了GPT-5.4和Claude Opus 4.5。

Details

Motivation: 解决现有模型在从视觉输入进行精确、层次化拓扑推理方面的能力不足问题,特别是对嵌套乔丹曲线之间包含关系的理解。

Result: 在CurveBench基准上,Gemini 3.1 Pro在Easy子集上达到71.1%的树生成准确率,在Hard子集上仅为19.1%。通过RLVR微调,Qwen3-VL-8B模型在Easy子集上的准确率从2.8%提升至33.3%,超过了GPT-5.4和Claude Opus 4.5,但Hard子集上仍有巨大差距。

Insight: 论文的创新点在于提出了一个专注于精确拓扑推理(特别是嵌套曲线包含关系)的视觉基准测试CurveBench,其任务形式化为结构化预测(恢复根树)。从客观角度看,该基准揭示了当前先进视觉语言模型在看似简单的几何拓扑推理任务上存在显著缺陷,为评估和提升模型的几何与拓扑感知能力提供了新的方向。RLVR风格的微调方法也被证明能有效提升模型在此类任务上的性能。

Abstract: We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of \textbf{756 images} of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and dense counting configurations. Each image is annotated with a rooted tree encoding the containment relations between planar regions. We formulate the task as structured prediction: given an image, a model must recover the full rooted containment tree induced by the curves. Despite the visual simplicity of the task, the strongest evaluated model, Gemini 3.1 Pro, achieves only \textbf{71.1%} tree-generation accuracy on CurveBench-Easy and \textbf{19.1%} on CurveBench-Hard. We further demonstrate benchmark utility through RLVR-style fine-tuning of open-weight vision-language models. Our trained Qwen3-VL-8B model improves over \texttt{Qwen-3-VL-8B-Thinking} from \textbf{2.8%} to \textbf{33.3%} tree-generation accuracy on CurveBench-Easy, exceeding GPT-5.4 and Claude Opus 4.5 under our evaluation protocol. The remaining gap, especially on CurveBench-Hard, shows that exact topology-aware visual reasoning remains far from solved.


[29] Venus-DeFakerOne: Unified Fake Image Detection & Localization cs.CVPDF

GuangJian Team

TL;DR: 本文提出DeFakerOne,一个基于InternVL2和SAM2构建的、以数据为中心的统一伪造图像检测与定位基础模型,旨在解决生成式AI带来的跨域伪造图像检测碎片化问题,实现图像级检测和像素级定位。

Details

Motivation: 生成式AI的发展使得图像伪造范式趋于统一,但现有的伪造图像检测与定位研究仍处于碎片化状态,与统一的伪造生成机制不匹配。本文旨在弥合这一差距,解决跨域伪影迁移与干扰理解、以及构建高容量统一基础模型两大挑战。

Result: DeFakerOne在39个伪造检测基准和9个定位基准上均取得了最先进的性能,超越了基线模型,并对真实世界扰动和GPT-Image-2等先进生成器表现出卓越的鲁棒性。

Insight: 创新点在于提出了首个统一跨域伪造检测与定位的基础模型架构,并系统分析了数据缩放规律、跨域伪影迁移-干扰模式、细粒度监督的必要性以及原始分辨率伪影保留等设计原则,为构建可扩展、鲁棒且统一的FIDL系统提供了指导。

Abstract: In recent years, the rapid evolution of generative AI has fundamentally reshaped the paradigm of image forgery, breaking the traditional boundaries between document editing, natural image manipulation, DeepFake generation, and full-image AIGC synthesis. Despite this shift toward unified forgery generation, existing research in Fake Image Detection and Localization (FIDL) remains fragmented. This creates a mismatch between increasingly unified forgery generation mechanisms and the domain-specific detection paradigm. Bridging this mismatch poses two key challenges for FIDL: understanding cross-domain artifacts transfer and interference, and building a high-capacity unified foundation model for joint detection and localization. To address these challenges, we propose DeFakerOne, a data-centric, unified FIDL foundation model integrating InternVL2 and SAM2. DeFakerOne enables simultaneous image-level detection and pixel-level forgery localization across diverse scenarios. Extensive experiments demonstrate that DeFakerOne achieves state-of-the-art performance, outperforming baselines on 39 forgery detection benchmarks and 9 localization benchmarks. Furthermore, the model exhibits superior robustness against real-world perturbations and state-of-the-art generators such as GPT-Image-2. Finally, we provide a systematic analysis of data scaling laws, cross-domain artifacts transfer-interference patterns, the necessity of fine-grained supervision, and the original resolution artifacts preservation, highlighting the design principles for scalable, robust, and unified FIDL.


[30] DUET: Dual-Paradigm Adaptive Expert Triage with Single-cell Inductive Prior for Spatial Transcriptomics Prediction cs.CVPDF

Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Chongyu Qu

TL;DR: DUET是一种用于从组织学图像预测空间转录组学的双范式自适应专家分流框架,通过结合参数化预测和基于记忆的检索,并利用单细胞数据作为先验约束,实现了在多种基因尺度数据集上的最先进性能。

Details

Motivation: 现有方法将空间转录组学预测简化为形态到表达的简单映射,忽略了视觉相似性与分子一致性之间的差异,且未充分利用大规模单细胞数据资源,同时单一范式难以平衡表达灵活性与生物保真度。

Result: 在三个公共数据集上的广泛实验表明,DUET实现了最先进(SOTA)的性能,每个提出的组件都对性能提升有贡献。

Insight: 创新点包括:引入双范式(回归与检索)协同工作,利用大规模单细胞数据作为生物约束以减少视觉歧义,以及设计轻量级适配器动态调整分支偏好以优化性能。

Abstract: Inferring spatially resolved gene expression from histology images offers a cost-effective complement to spatial transcriptomics (ST). However, existing methods reduce this task to a simple morphology-to-expression mapping, where visual similarity does not guarantee molecular consistency. Meanwhile, single-cell data has amassed rich resources far surpassing the scale of ST data, yet it remains underexplored in vision-omics modeling. Furthermore, current approaches commit to a monolithic paradigm with bottlenecks, unable to balance expressive flexibility with biological fidelity. To bridge these gaps, we propose DUET, a novel dual-paradigm framework that synergizes parametric prediction and memory-based retrieval under cellular inductive priors. DUET implements a parallel regression-retrieval paradigm, adaptively reconciling the outputs of its complementary pathways. To mitigate aleatoric vision ambiguity, we incorporate large-scale single-cell references to impose molecular states as biological constraints for faithful learning. Building upon structural refinement, we further design a lightweight adapter to dynamically assign branch preference across spatial contexts to achieve optimal performance. Extensive experiments on three public datasets across varied gene scales demonstrate that DUET achieves SOTA performance, with consistent gains contributed by each proposed component. Code is available at https://github.com/Junchao-Zhu/DUET


[31] SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection cs.CV | cs.ROPDF

Sandro Papais, Lezhou Feng, Charles Cossette, Lingting Ge

TL;DR: SToRe3D是一个针对基于视觉Transformer(ViT)的多视角3D目标检测的稀疏化框架,旨在解决现有方法因密集处理2D图像token和3D物体查询而导致的高推理延迟问题。它通过联合选择相关的2D token和3D query,并存储过滤后的特征以备重新激活,实现了高效的稀疏计算。

Details

Motivation: 现有为2D视觉设计的稀疏方法(如剪枝或合并图像token)无法扩展到全模型稀疏化,也未处理3D物体查询,导致基于ViT的多视角3D检测推理延迟高。本文旨在通过联合稀疏化2D和3D表示来提升效率。

Result: 在nuScenes数据集和新提出的nuScenes-Relevance基准上评估,SToRe3D实现了高达3倍的推理加速,同时仅有微小的精度损失,建立了基于ViT的实时大规模3D检测能力,并在对规划至关重要的目标上保持了精度。

Insight: 创新点在于提出了一个2D-3D联合稀疏化框架,通过互相关的2D-3D相关性头(relevance heads)将计算资源分配给驾驶关键内容,并保留其他嵌入以备重新激活。这实现了从2D图像到3D空间的端到端稀疏化,是首个针对多视角3D检测的全模型稀疏方法。

Abstract: Vision Transformers (ViTs) enable strong multi-view 3D detection but are limited by high inference latency from dense token and query processing across multiple views and large 3D regions. Existing sparsity methods, designed mainly for 2D vision, prune or merge image tokens but do not extend to full-model sparsity or address 3D object queries. We introduce SToRe3D, a relevance-aligned sparsity framework that jointly selects 2D image tokens and 3D object queries while storing filtered features for reactivation. Mutual 2D-3D relevance heads allocate compute to driving-critical content and preserve other embeddings. Evaluated on nuScenes and our new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, establishing real-time large-scale ViT-based 3D detection while maintaining accuracy on planning-critical agents.


[32] ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows cs.CV | cs.AI | cs.LG | cs.MAPDF

Alvaro Lopez Pellicer, Plamen Angelov, Marwan Bukhari, Yi Li, Eduardo Soares

TL;DR: ProtoMedAgent是一个用于多模态临床报告生成的框架,通过隐私感知的智能体工作流实现临床可解释性。它基于冻结的原型网络骨干,将视觉和表格特征蒸馏为离散语义记忆,并采用严格的神经符号瓶颈和反思性Scribe-Critic循环来约束生成过程,避免模型幻觉。该框架还引入了基于k-匿名性和ℓ-多样性的语义隐私门来限制数据泄露风险。

Details

Motivation: 解决可解释性原型网络在临床诊断中连续输出缺乏医学文档所需语义结构的问题,以及标准检索增强生成(RAG)方法中存在的’检索谄媚’现象(即大语言模型为对齐视觉预测而产生事后合理化幻觉)。

Result: 在包含4,160名患者的临床队列上评估,ProtoMedAgent在比较集忠实度上达到91.2%,显著优于标准RAG的46.2%。此外,通过绑定ℓ-多样性相变,系统地将伪影级成员推理风险绝对降低了9.8%。

Insight: 创新点包括:将多模态临床报告形式化为严格的神经符号瓶颈上的迭代零梯度测试时优化问题;采用基于精确集合论微分和反思性Scribe-Critic循环的生成约束机制,从数学上排除无支持的叙述主张;引入由k-匿名性和ℓ-多样性控制的语义隐私门来安全界定数据披露。这些方法在提升生成忠实度的同时增强了隐私保护。

Abstract: While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,’’ where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8%.


[33] PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting cs.CVPDF

Adil Qureshi, Dongki Jung, Jaehoon Choi, Dinesh Manocha

TL;DR: PanoPlane是一种用于稀疏视角室内新视角合成的高保真方法,它通过全景场景补全来重建封闭的房间几何结构。该方法利用360度全景补全,在生成过程中融入完整的空间布局信息,并提出了一种无需训练的关注引导机制,将扩散模型内部的注意力引导至场景中检测到的平面区域,从而实现基于几何一致性的表面外推,而非无约束的幻觉生成。生成的全景补全结果为3D高斯泼溅提供了监督信号,使得仅需三个输入视角即可在未观测区域实现准确的新视角合成。

Details

Motivation: 解决基于有限视场角的透视方法在稀疏视角下进行室内新视角合成时,因训练视角有限而难以重建完整封闭房间几何结构的问题。

Result: 在Replica、ScanNet++和Matterport3D数据集上,使用3、6、9个输入视角进行实验,均取得了最先进的新视角合成质量,PSNR指标相比当前最佳基线提升了高达+17.8%,且无需对扩散模型进行任何训练或微调。

Insight: 主要创新点在于提出了‘布局锚定注意力引导’这一无需训练的机制,将扩散模型生成过程的注意力引导至几何平面,实现基于观测内容的、有几何依据的表面外推,而非无约束的幻觉,从而为3D重建提供更准确的监督信号。从客观角度看,将全景补全与几何平面约束相结合,并巧妙地利用预训练扩散模型的内部注意力进行引导,是一种高效且无需额外训练的创新思路。

Abstract: We present PanoPlane, an approach for high-fidelity sparse-view indoor novel view synthesis that reconstructs closed room geometry via panoramic scene completion. Unlike perspective-based methods that generate training views from limited fields of view, PanoPlane leverages $360^{\circ}$ panoramic completion to condition the generative process on the full spatial layout. We propose Layout Anchored Attention Steering, a training-free mechanism that steers attention within the diffusion model’s internal representation toward scene’s detected planar surfaces at inference time. By directing each unobserved region’s attention toward geometrically consistent observed content, our method replaces unconstrained hallucination with grounded surface extrapolation. The resulting panoramic completions provide supervision for 3D Gaussian Splatting, enabling accurate novel-view synthesis across unobserved regions from as few as three input views. Experiments on Replica, ScanNet++, and Matterport3D demonstrate state-of-the-art novel view synthesis quality across 3, 6, and 9 input views, achieving up to $+17.8%$ improvement in PSNR over the current state-of-the-art baseline without any training or fine-tuning of the diffusion model.


[34] TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion cs.CVPDF

Nurislam Tursynbek, Zhiqiang Lao, Heather Yu, Gedas Bertasius, Marc Niethammer

TL;DR: 本文提出了一种名为TeDiO的训练时优化方法,旨在解决文本到视频扩散模型生成视频时存在的时间不一致问题,如闪烁、漂移或不稳定运动。该方法通过分析模型内部自注意力图中时间对角线的规律性,识别不稳定区域并进行轻量级潜在更新,从而在不修改模型权重或使用外部运动监督的情况下增强时间一致性。

Details

Motivation: 现有文本到视频扩散变换器生成的视频帧在视觉上吸引人,但常存在时间不连贯问题,表现为闪烁、漂移或不稳定运动。作者发现这些失败在模型内部自注意力图中留下明显痕迹:不连贯视频的时间对角线呈现不规则、碎片化模式,而稳定运动则对应平滑的带状对角线模式。

Result: 在多个视频扩散模型(如Wan2.1、CogVideoX)上,TeDiO显著提高了运动平滑度,同时保持了每帧的视觉质量,为现代视频生成系统提供了一种高效的即插即用方法来提升动态真实感。

Insight: 创新点在于首次将时间一致性失败与模型内部自注意力图的时间对角线模式关联起来,并提出了一种无需训练、推理时通过正则化这些内部注意力模式来强化时间一致性的方法。该方法轻量、无需修改模型权重,具有通用性和可扩展性。

Abstract: Recent text-to-video diffusion transformers generate visually compelling frames, yet still struggle with temporal coherence, often producing flickering, drifting, or unstable motion. We show that these failures leave a clear imprint inside the model: incoherent videos consistently exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, whereas stable motion corresponds to smooth, band-diagonal patterns. Building on this observation, we introduce TeDiO, a training-free, inference-time method that reinforces temporal consistency by regularizing these internal attention patterns. TeDiO estimates diagonal smoothness, identifies unstable regions, and performs lightweight latent updates that promote coherent frame-to-frame dynamics, without modifying model weights or using external motion supervision. Across multiple video diffusion models (e.g., Wan2.1, CogVideoX), TeDiO delivers markedly smoother motion while preserving per-frame visual quality, offering an efficient plug-and-play approach to improving dynamic realism in modern video generation systems.


[35] You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps cs.CVPDF

Riccardo Carraro, Anna Briotto, Endi Hysa, Marco Fiorucci, Lamberto Ballan

TL;DR: 本文提出了一种轻量级U-Net架构用于人脸图像超分辨率,能够从严重退化的16×16输入重建128×128的人脸图像,实现8倍放大。核心创新在于利用YOLO-World开放词汇目标检测器生成的关键点热图作为监督信号,形成热图引导的损失函数,无需额外的辅助训练或对齐网络。

Details

Motivation: 解决在极端放大因子下,现有方法因依赖复杂网络架构、对抗训练或独立对齐网络而导致模型复杂度和计算成本高的问题,旨在实现轻量且高效的人脸超分辨率。

Result: 在CelebA对齐数据集上的实验表明,所提出的热图引导损失持续改善了定量指标,并产生了更清晰、更真实的重建结果,证明了轻量网络能够有效利用检测驱动的先验进行感知上可信的极端放大。

Insight: 创新点在于提出了一种无需辅助训练的监督策略,直接重用开放词汇检测器(YOLO-World)的输出作为语义区域重建误差的权重,从而在保持高效训练和推理的同时,提升了轻量网络的重建质量,避免了对抗训练或额外计算开销。

Abstract: Face image super-resolution aims to recover high-resolution facial images from severely degraded inputs. Under extreme upscaling factors, fine facial details are often lost, making accurate reconstruction challenging. Existing methods typically rely on heavy network architectures, adversarial training schemes, or separate alignment networks, increasing model complexity and computational cost. To address these issues, we propose a lightweight U-Net based-architecture designed to reconstructs $128{ \times }128$ facial images from severely degraded $16{ \times }16$ inputs, achieving an $8 \times $ magnification. A key contribution is a novel auxiliary-training-free supervision strategy that leverages heatmaps generated by YOLO-World, an open-vocabulary object detector, to localize key facial features such as eyes, nose, and mouth. These heatmaps are converted into spatial weights to form a heatmap-guided loss that emphasizes reconstruction errors in semantically important regions. Unlike prior methods that require dedicated landmark or alignment networks, our approach directly reuses detector outputs as supervision, maintaining an efficient training and inference pipeline. Experiments on the aligned CelebA dataset demonstrate that the proposed loss consistently improves quantitative metrics and produces sharper, more realistic reconstructions. Overall, our results show that lightweight networks can effectively exploit detection-driven priors for perceptually convincing extreme upscaling, without adversarial training or increased computational cost.


[36] CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers cs.CVPDF

Zhuojin Li, Hsin-Pai Cheng, Hong Cai, Shizhong Han, Fatih Porikli

TL;DR: 本文提出了CoReDiT,一种用于扩散变换器(DiTs)的结构化令牌剪枝框架,旨在降低其高计算成本。该方法利用线性时间空间相干性评分来估计潜在令牌网格中的局部冗余,并在自注意力中跳过高度相干(冗余)的令牌。为了保持密集表示并避免视觉不连续性,它通过空间相邻保留令牌的相干性引导聚合来重建被跳过的注意力输出。此外,还引入了一种渐进式、块自适应的剪枝调度策略。

Details

Motivation: 扩散变换器(DiTs)虽然能生成高质量的图像和视频,但计算成本高昂,限制了其可扩展性和在设备上的部署。本文旨在解决DiTs的计算效率问题。

Result: 在包括PixArt-α和MagicDrive-V2在内的先进扩散骨干网络上,CoReDiT实现了高达55%的自注意力FLOPs减少,在云端GPU上推理速度提升1.33倍,在移动NPU上提升1.72倍,同时保持了高视觉质量。该方法还增加了设备上的内存余量,从而支持更高分辨率的生成。

Insight: 主要创新点在于:1)利用空间相干性评分进行高效的冗余令牌识别与剪枝;2)通过相干性引导的邻居令牌聚合来重建被剪枝令牌的输出,以维持生成质量;3)提出渐进式、块自适应的剪枝调度策略,根据冗余程度动态分配剪枝预算。这为高效扩散模型设计提供了结构化剪枝的新思路。

Abstract: Diffusion Transformers (DiTs) deliver remarkable image and video generation quality but incur high computational cost, limiting scalability and on-device deployment. We introduce CoReDiT, a structured token pruning framework for DiTs across vision tasks. CoReDiT uses a linear-time spatial coherence score to estimate local redundancy in the latent token lattice and skips high coherence (redundant) tokens in self-attention. To maintain a dense representation and avoid visual discontinuities, we reconstruct skipped attention outputs via coherence-guided aggregation of spatially neighboring retained tokens. We further introduce a progressive, block-adaptive pruning schedule that increases pruning gradually and allocates larger budgets to blocks and denoising steps with higher redundancy. Across state-of-the-art diffusion backbones including PixArt-α and MagicDrive-V2, CoReDiT achieves up to 55% self-attention FLOPs reduction and inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while maintaining high visual quality. Notably, CoReDiT also increases on-device memory head-room, enabling higher-resolution generation.


[37] Towards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy cs.CV | cs.LGPDF

Harry Robertshaw, Yanghe Hao, Weiyuan Deng, Benjamin Jackson, S. M. Hadi Sadati

TL;DR: 本文旨在开发并评估一种用于荧光透视成像的实时导管尖端跟踪系统,以支持基于强化学习的机器人系统实现自主导航。研究设计了一个多线程处理流程,并比较了多种深度学习分割模型,最终提出的两分类SegFormer模型在中等复杂度的荧光透视视频数据上取得了最佳性能。

Details

Motivation: 机械取栓术可改善中风预后,但其应用受限于缺乏本地治疗途径。基于强化学习的机器人系统有望通过自主导航缓解此问题,但其需要实时跟踪设备尖端坐标。当前荧光透视成像存在对比度低、噪声大和设备遮挡等挑战,因此需要开发鲁棒的实时跟踪方法。

Result: 在手动标注的中等复杂度荧光透视视频数据上,两分类SegFormer模型的平均绝对误差为4.44毫米,优于U-Net(4.60毫米)、U-Net+Transformer(6.20毫米)及所有三分类模型(5.19-7.74毫米)。在分割基准测试中,该系统超越了现有的CathAction结果,在三分类分割的Dice分数上提升了高达5%,达到了SOTA水平。

Insight: 论文的创新点在于提出了一个结合多线程处理、深度学习分割(特别是SegFormer架构)以及包含两步分量过滤、单像素中轴骨架化和贪婪弧长路径跟随的后处理流程的完整实时跟踪框架。该框架在具有挑战性的成像条件下保持了稳定性能,为基于强化学习的自主导航提供了可靠基础。

Abstract: Purpose: Mechanical thrombectomy (MT) improves stroke outcomes, but is limited by a lack of local treatment access. Widespread distribution of reinforcement learning (RL)-based robotic systems can be used to alleviate this challenge through autonomous navigation, but current RL methods require live device tip coordinate tracking to function. This paper aims to develop and evaluate a real-time catheter tip tracking pipeline under fluoroscopy, addressing challenges such as low contrast, noise, and device occlusion. Methods: A multi-threaded pipeline was designed, incorporating frame reading, preprocessing, inference, and post-processing. Deep learning segmentation models, including U-Net, U-Net+Transformer, and SegFormer, were trained and benchmarked using two-class and three-class formulations. Post-processing involved two-step component filtering, one-pixel medial skeletonization, and greedy arc-length path following with contour fall-back. Results: On manually-labeled moderate complexity fluoroscopic video data, the two-class SegFormer achieved a mean absolute error of 4.44 mm, outperforming U-Net (4.60 mm), U-Net+Transformer (6.20 mm) and all three-class models (5.19-7.74 mm). On segmentation benchmarks, the system exceeded state-of-the-art CathAction results with improvements of up to +5% in Dice scores for three-segmentation. Conclusion: The results demonstrate that the proposed multi-threaded tracking framework maintains stable performance under challenging imaging conditions, outperforming prior benchmarks, while providing a reliable and efficient foundation for RL-based autonomous MT navigation.


[38] PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation cs.CV | cs.AIPDF

Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei

TL;DR: 该论文提出了PhyMotion,一种用于基于物理的人类视频生成的结构化3D运动奖励。它通过从生成视频中恢复SMPL人体网格,在MuJoCo物理模拟器中重定向到人形模型,并沿着运动学合理性、接触与平衡一致性以及动态可行性三个维度评估运动质量,从而提供更可靠的物理可行性奖励信号。

Details

Motivation: 现有视频生成奖励主要依赖2D感知信号,无法可靠评估人体运动的真实性,常给漂浮身体或物理上不可能的运动打高分。论文旨在解决缺乏能明确建模3D身体状态、接触和动力学的可靠奖励信号这一瓶颈。

Result: 实验表明,PhyMotion比现有奖励公式与人类判断具有更强的相关性。在基于RL的后训练中,优化PhyMotion比优化现有奖励带来更大、更一致的改进,在自动指标和盲人评估下(Elo增益+68)均提高了自回归和双向视频生成器的运动真实感。

Insight: 创新点在于提出了一种结构化、细粒度的3D运动奖励,将恢复的3D人体轨迹置于物理模拟器中,并沿多个物理可行性维度进行评估。从客观角度看,其将3D重建、物理模拟和可解释的多维度评估相结合,为生成模型的物理合理性监督提供了新思路。

Abstract: Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.


[39] Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers cs.CVPDF

Kanghyun Baek, Jaihyun Lew, Chaehun Shin, Jungbeom Lee, Sungroh Yoon

TL;DR: 本文针对多模态扩散变换器(MM-DiTs)在文本到图像生成中常出现的概念遗漏问题,提出了一种诊断和纠正方法。通过线性探测文本标记,作者发现文本嵌入中存在表征目标概念缺失的‘遗漏信号’,并基于此提出了遗漏信号干预(OSI)技术,通过放大该信号来主动催化缺失概念的生成。在FLUX.1-Dev和SD3.5-Medium上的实验表明,OSI能显著缓解概念遗漏,即使在极端场景下也有效。

Details

Motivation: 解决多模态扩散变换器在文本到图像生成中频繁出现概念遗漏的问题,即指定的对象或属性未能出现在生成图像中。

Result: 在FLUX.1-Dev和SD3.5-Medium基准上的综合实验显示,OSI方法显著减轻了概念遗漏,在极端场景下也表现良好,但未明确提及是否达到SOTA水平。

Insight: 创新点在于通过线性探测识别出文本嵌入中的‘遗漏信号’,并利用该信号进行主动干预以纠正生成缺失;从客观角度看,这种方法提供了一种可解释的诊断和纠正机制,可能适用于其他生成模型的概念控制问题。

Abstract: Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-to-image generation, yet they frequently suffer from concept omission, where specified objects or attributes fail to emerge in the generated image. By performing linear probing on text tokens, we demonstrate that text embeddings can distinguish a characteristic `omission signal’ representing the absence of target concepts. Leveraging this insight, we propose Omission Signal Intervention (OSI), which amplifies the omission signal to actively catalyze the generation of missing concepts. Comprehensive experiments on FLUX.1-Dev and SD3.5-Medium demonstrate that OSI significantly alleviates concept omission even in extreme scenarios.


[40] CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL cs.CVPDF

Zhenyang Ni, Yijiang Li, Ruochen Jiao, Simon Sinong Zhan, Sipeng Chen

TL;DR: 本文提出CreFlow框架,通过组合式约束奖励模型和在线强化学习后训练,纠正视频生成模型在具身操作任务中违反物理约束的问题,提升生成视频的逻辑一致性和任务执行成功率。

Details

Motivation: 现有视频生成模型在异构数据上训练后,可能产生视觉合理但违反物理约束的序列,而传统视频强化学习的奖励函数难以对组合式任务规范进行逻辑验证。

Result: 在八个双手操作任务上,CreFlow相比现有方法能产生更符合人类和模拟器成功标签的奖励判断,并将下游执行成功率提升了23.8个百分点。

Insight: 创新点包括:将任务需求自动形式化为线性时序逻辑约束的组合式奖励模型;引入信用感知NFT损失限制更新范围;利用组内正样本作为校正方向显式估计的纠正回流损失,以稳定和加速训练。

Abstract: Video generation models trained on heterogeneous data with likelihood-surrogate objectives can produce visually plausible rollouts that violate physical constraints in embodied manipulation. Although reinforcement-learning post-training offers a natural route to adapting VGMs, existing video-RL rewards often reduce each rollout to a low-level visual metric, whereas manipulation video evaluation requires logic-based verification of whether the rollout satisfies a compositional task specification. To fill this gap, we introduce a compositional constraint-based reward model for post-training embodied video generation models, which automatically formulates task requirements as a composition of Linear Temporal Logic constraints, providing faithful rewards and localized error information in generated videos. To achieve effective improvement in high-dimensional video generation using these reward signals, we further propose CreFlow, a novel online RL framework with two key designs: i) a credit-aware NFT loss that confines the RL update to reward-relevant regions, preventing perturbations to unrelated regions during post-training; and ii) a corrective reflow loss that leverages within-group positive samples as an explicit estimate of the correction direction, stabilizing and accelerating training. Experiments show that CreFlow yields reward judgments better aligned with human and simulator success labels than existing methods and improves downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.


[41] KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration cs.CVPDF

Ruicheng Zhang, Kaixi Cong, Jun Zhou, Zhizhou Zhong, Zunnan Xu

TL;DR: 本文提出了KVPO,一种专为自回归视频生成器设计的ODE原生在线组相对策略优化框架。它通过将探索源从随机噪声转移到历史KV缓存,并引入基于轨迹速度能量的速度场代理策略,解决了现有强化学习方法与确定性ODE动力学不匹配的问题,从而在保持数据流形的同时实现了语义层面的多样性探索和策略优化。

Details

Motivation: 现有基于强化学习的方法主要依赖基于噪声的探索和基于SDE的代理策略,这与经过提炼的自回归模型的确定性ODE动力学不匹配,并且倾向于扰动低层次的外观而非对长程连贯性至关重要的高层次语义故事情节进展。KVPO旨在解决这些局限性,以更好地对齐流式自回归视频生成器与人类偏好。

Result: 在多个经过提炼的自回归视频生成器上的实验表明,该方法在单提示词短视频和多提示词长视频设置下,在视觉质量、运动质量和文本-视频对齐方面均取得了持续的性能提升。

Insight: 创新点包括:1)提出了因果-语义探索范式,通过随机路由历史KV缓存条目来构建语义多样的生成分支,确保其严格位于数据流形上;2)引入了基于轨迹速度能量的速度场代理策略,在流匹配速度空间中量化分支似然,并产生与原生ODE公式完全一致的奖励加权对比目标。这为在确定性ODE框架内进行有效的策略优化和多样性探索提供了新思路。

Abstract: Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.


[42] ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition cs.CV | cs.AI | cs.LGPDF

Shen Lin, Jing Lin, Junhao Dong, Piotr Koniusz, Li Xu

TL;DR: 本文提出了一种名为ICED的可解释概念级机器遗忘框架,用于视觉语言模型(VLM)。该方法通过多模态大语言模型从遗忘集中构建紧凑的任务特定概念词汇,并将视觉表示分解为语义概念的稀疏非负组合,从而为细粒度知识操作提供显式接口。该方法将遗忘问题表述为概念级优化,选择性抑制目标概念,同时保留实例内的非目标语义和全局跨模态知识。

Details

Motivation: 现有VLM中的机器遗忘通常在图像或实例级别进行,难以精确移除目标知识而不影响无关语义,因为单个图像通常包含多个纠缠的概念(包括需要遗忘的目标概念和应保留的上下文信息)。

Result: 在领域内和领域外遗忘设置上的大量实验表明,该方法能实现更全面的目标遗忘,更好地保留同一图像中的非目标知识,并且与现有的VLM遗忘方法相比保持了有竞争力的模型效用。

Insight: 创新点在于提出了可解释的概念级遗忘框架,通过构建任务特定概念词汇和将视觉表示分解为稀疏非负概念组合,为VLM提供了细粒度知识操作的显式接口,将遗忘问题转化为概念级优化,实现了更精确的知识编辑。

Abstract: Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.


[43] CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding cs.CVPDF

Ailar Mahdizadeh, Puria Azadi, Muchen Li, Xiangteng He, Leonid Sigal

TL;DR: 本文提出CoRDS方法,将流式视频理解中的KV缓存压缩问题视为核心集选择问题,通过联合优化键值空间的覆盖度和正交性驱动的多样性准则,选择能够代表累积视觉历史的小规模子集,以提升大视觉语言模型在内存受限下的长期推理能力。

Details

Motivation: 现有流式视频理解方法通常依赖局部令牌级启发式策略(如新近性、时间冗余性或显著性)压缩KV缓存,但这些方法未显式优化保留的缓存是否对累积历史具有代表性,因此需要一种更有效的压缩原则。

Result: 在四个开源视觉语言模型和五个长视频/流式视频基准测试上,该方法在固定缓存预算下优于启发式流式压缩基线,证明了核心集选择作为压缩原则的有效性。

Insight: 创新点在于将KV缓存压缩形式化为核心集选择问题,并引入双准则目标(覆盖键值空间)和正交性驱动的多样性准则(与对数行列式子集选择相关),从而在保留检索结构和输出相关信息的同时,确保所选子集的代表性和多样性。

Abstract: Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more diverse retained subset, we further introduce an orthogonality-driven diversity criterion that favors candidates contributing new directions beyond the current selection, and connect this criterion to log-determinant subset selection. Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget. These results highlight that representative coreset selection offers a more effective principle, than token-wise pruning, for memory-constrained streaming video understanding.


[44] InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation cs.CVPDF

Yang Yue, Fangyun Wei, Tianyu He, Jinjing Zhao, Zanlin Ni

TL;DR: 本文提出InsightTok,一种改进的离散视觉分词框架,通过引入局部化、内容感知的感知损失来增强文本和面部的保真度,以解决自回归图像生成中文本和面部细节丢失的问题。

Details

Motivation: 标准离散分词器的目标与文本可读性和面部保真度对齐较弱,导致在自回归图像生成中文本和面部细节难以保留,本文旨在通过专门监督改进分词器训练。

Result: 在16k码本和16倍下采样率下,InsightTok在文本和面部重建上显著优于先前分词器,且不损害整体重建质量,这些改进在InsightAR自回归图像生成中产生更清晰的文本和更真实的面部细节。

Insight: 创新点在于引入局部化、内容感知的感知损失来专门优化文本和面部区域的保真度,这展示了在分词器训练中使用专门监督提升离散图像生成的潜力。

Abstract: Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregressive generators built on discrete tokenization. A central bottleneck is the tokenizer: aggressive downsampling and quantization often discard the fine-grained structures needed to preserve readable glyphs and distinctive facial features. We attribute this gap to standard discrete-tokenizer objectives being weakly aligned with text legibility and facial fidelity, as these objectives typically optimize generic reconstruction while compressing diverse content uniformly. To address this, we propose InsightTok, a simple yet effective discrete visual tokenization framework that enhances text and face fidelity through localized, content-aware perceptual losses. With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details. Overall, our results highlight the potential of specialized supervision in tokenizer training for advancing discrete image generation.


[45] Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation cs.CVPDF

Yuanhang Yao, Ping Qian, Zhu Liu, Long Ma, Weimin Wang

TL;DR: 本文提出了一种基于分层知识蒸馏的框架,用于稳定点监督红外小目标检测。该方法利用冻结的视觉基础模型(VFM)作为语义先验,通过双层优化过程(内层训练VFM嵌入的教师模型,外层将验证引导的知识迁移到轻量级学生模型)来缓解伪标签噪声和训练集偏差。同时,引入了语义条件仿射调制(SCAM)和多层特征注入机制,以及基于聚类的动态协同学习策略,以增强对不完美伪掩码的鲁棒性。

Details

Motivation: 点监督方法降低了红外小目标检测的标注成本,但轻量级CNN检测器语义不足,导致伪掩码噪声大、优化不稳定。本文旨在利用视觉基础模型提供的丰富语义先验,来稳定点监督学习过程。

Result: 在多个ISTD骨干网络和多样化的挑战性场景上的实验表明,该方法在检测精度和训练稳定性方面均取得了一致的提升。

Insight: 创新点在于将点监督学习形式化为一个双层优化问题,并利用冻结的VFM作为稳定的语义先验源。通过SCAM模块将VFM语义注入CNN多层特征,以及基于聚类的动态样本重加权策略,有效缓解了伪标签噪声和训练偏差,为弱监督目标检测提供了可借鉴的稳定化框架。

Abstract: Single-frame Infrared Small Target Detection (ISTD) aims to localize weak targets under heavy background clutter, yet dense pixel-wise annotations are expensive. Point supervision with online label evolution reduces annotation cost; however, lightweight CNN detectors often lack sufficient semantics, leading to noisy pseudo-masks and unstable optimization. To address this, we propose a hierarchical VFM-driven knowledge distillation framework that uses a frozen Vision Foundation Model (VFM) during training. We formulate point-supervised learning as a bilevel optimization process: the inner loop adapts a VFM-embedded teacher on reweighted training samples, while the outer loop transfers validation-guided knowledge to a lightweight student to mitigate pseudo-label noise and training-set bias. We further introduce Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN features at multiple layers. In addition, a dynamic collaborative learning strategy with cluster-level sample reweighting enhances robustness to imperfect pseudo-masks. Experiments on diverse challenging cases across multiple ISTD backbones demonstrate consistent improvements in detection accuracy and training stability. Our code is available at https://github.com/yuanhang-yao/semantic-prior.


[46] Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation cs.CV | cs.GR | cs.MMPDF

Yuheng Wu, Xiangbo Gao, Tianhao Chen, Xinghao Chen, Qing Yin

TL;DR: 本文提出Delta Forcing框架,用于解决交互式自回归视频生成中平衡反应性与稳定性的核心挑战。该方法通过估计教师模型与生成器轨迹之间的潜在差异来构建自适应信任区域,约束不可靠的教师监督,从而在响应新事件的同时保持长期时间一致性。

Details

Motivation: 现有方法(如通过流式长调优将双向模型蒸馏为自回归生成器)在条件变化后常出现持续漂移,其根本原因是条件偏差,即教师模型可能提供与条件对齐但与轨迹无关的指导,导致生成结果局部有效但全局不一致。

Result: 大量实验表明,Delta Forcing在保持事件反应性的同时,显著提高了生成视频的一致性。

Insight: 核心创新在于受信任区域策略优化(TRPO)启发,提出通过潜在增量估计转移一致性,并以此平衡教师监督与单调连续性目标,从而抑制不可靠的教师诱导偏移。这为自回归生成中缓解蒸馏偏差提供了新思路。

Abstract: Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.


[47] Analogical Trajectory Transfer cs.CVPDF

Junho Kim, Eun Sun Lee, Gwangtak Bae, Seunggu Kang, Young Min Kim

TL;DR: 本文提出了一种类比轨迹迁移方法,旨在将3D环境中的运动轨迹迁移到另一个语义相似的位置,以支持机器进行类比空间推理,应用于AR/VR共存、内容创作和机器人技术。

Details

Motivation: 解决在语义相似但物体布局、尺度和结构差异较大的3D场景之间迁移运动轨迹时,因简单语义匹配导致的碰撞或几何扭曲问题。

Result: 该方法无需训练,运行时间约0.6秒,在基于LLMs、VLMs和场景图匹配的基线方法上表现更优,并展示了在虚拟共存、多轨迹迁移、相机迁移和人机运动迁移等应用中的有效性。

Insight: 创新点在于将问题分解为空间隔离的子问题,通过基于3D基础模型特征的层次化平滑映射预测和组合优化,实现语义一致且空间连贯的轨迹迁移,避免了轨迹撕裂或碰撞。

Abstract: We study analogical trajectory transfer, where the goal is to translate motion trajectories in one 3D environment to a semantically analogous location in another. Such a capacity would enable machines to perform analogical spatial reasoning, with applications in AR/VR co-presence, content creation, and robotics. However, even semantically similar scenes can still differ substantially in object placement, scale, and layout, so naively matching semantics leads to collisions or geometric distortions. Furthermore, finding where each trajectory point should transfer to has a large search space, as the mapping must preserve semantics and functionality without tearing the trajectory apart or causing collisions. Our key insight is to decompose the problem into spatially segregated subproblems and merge their solutions to produce semantically consistent and spatially coherent transfers. Specifically, we partition scenes into object-centric clusters and estimate cross-scene mappings via hierarchical smooth map prediction, using 3D foundation model features that encode contextual information from object and open-space arrangements. We then combinatorially assemble the per-cluster maps into an initial transfer and refine the result to remove collisions and distortions, yielding a spatially coherent trajectory. Our method does not require training, attains a fast runtime around 0.6 seconds, and outperforms baselines based on LLMs, VLMs, and scene graph matching. We further showcase applications in virtual co-presence, multi-trajectory transfer, camera transfer, and human-to-robot motion transfer, which indicates the broad applicability of our work to AR/VR and robotics.


[48] Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion cs.CV | cs.CR | cs.LG | cs.ROPDF

Chenyi Wang, Ruoyu Song, Raymond Muller, Jean-Philippe Monteuuis, Jonathan Petit

TL;DR: 本文提出了MIRAGE框架,用于系统性地发现语义攻击,这些攻击通过生成逼真的环境变化(如阴影、湿滑路面)来绕过对抗防御并降低在线高清地图构建的预测性能。

Details

Motivation: 现有像素级扰动攻击可被标准对抗防御抵消,因此需要发现能绕过防御、通过语义层面环境变化误导地图构建的漏洞。

Result: 在nuScenes数据集上评估,MIRAGE实现了边界移除(抑制57.7%检测,破坏96%规划轨迹)和边界注入(唯一成功方法),且在多种对抗防御下保持有效;其生成场景的真实性通过VLM评估达80-84%。

Insight: 利用扩散模型学习真实数据流形,搜索语义突变但道路拓扑不变的场景进行攻击,揭示了当前对抗防御在语义级扰动(表现为合法环境变化)缓解能力上的根本性差距。

Abstract: Autonomous vehicles depend on online HD map construction to perceive lane boundaries, dividers, and pedestrian crossings – safety-critical road elements that directly govern motion planning. While existing pixel perturbation attacks can disrupt the mapping, they can be neutralized by standard adversarial defenses. We present MIRAGE, a framework for systematic discovery of semantic attacks that bypass adversarial defenses and degrade mapping predictions by finding plausible environmental variation (e.g. shadows, wet roads). MIRAGE exploits the latent manifold of real-world data learned by diffusion models, and searches for semantically mutated scenes neighboring the ground truth with the same road topology yet mislead the mapping predictions. We evaluate MIRAGE on nuScenes and demonstrate two attacks: (1) boundary removal, suppressing 57.7% of detections and corrupting 96% of planned trajectories; and (2) boundary injection, the only method that successfully injects fictitious boundaries, while pixel PGD and AdvPatch fail entirely. Both attacks remain potent under various adversarial defenses. We use two independent VLM judges to quantify realism, where MIRAGE passes as realistic 80–84% of the time (vs. 97–99% for clean nuScenes), while AdvPatch only 0–9%. Our findings expose a categorical gap in current adversarial defenses: semantic-level perturbations that manifest as legitimate environmental variation are substantially harder to mitigate than pixel-level perturbations.


[49] SceneForge: Structured World Supervision from 3D Interventions cs.CV | cs.GRPDF

Jizhizi Li, Jiayang Ao, Danny Wicks, Petru-Daniel Tudosiu

TL;DR: SceneForge是一个通过3D世界干预生成结构化监督的框架,它将场景表示为具有语义、几何和物理依赖的持久世界,通过应用显式干预(如物体移除或相机变化)并传播其效果,生成与物体结构和场景级效果一致的监督信号,包括反事实观测、多视角观测以及阴影和反射等效果感知信号。

Details

Motivation: 解决多模态学习任务中需要跨编辑、视角和场景级干预保持一致的监督信号难以从观测级数据集中获取的问题,因为这些数据集不暴露底层场景状态或变化如何传播。

Result: 在匹配的训练预算下,结合SceneForge监督在多个基准测试中定量和定性地提高了物体移除和场景移除的性能。

Insight: 将监督建模为可编辑世界中的结构化状态转换,为干预一致的多模态学习提供了实用且可扩展的基础;其创新在于从共享的3D世界状态直接生成对齐的监督信号,而非依赖后处理的图像空间操作,确保了监督的内在一致性。

Abstract: Many multimodal learning tasks require supervision that remains consistent across edits, viewpoints, and scene-level interventions. However, such supervision is difficult to obtain from observation-level datasets, which do not expose the underlying scene state or how changes propagate through it. We present SceneForge, an intervention-driven framework that generates structured supervision from editable 3D world states. SceneForge represents each scene as a persistent world with semantic, geometric, and physical dependencies. By applying explicit interventions (e.g., object removal or camera variation) and propagating their effects through scene dependencies, SceneForge renders supervision that remains consistent with object structure and scene-level effects. This produces aligned outputs including counterfactual observations, multi-view observations, and effect-aware signals such as shadows and reflections, all derived from a shared world state rather than post hoc image-space processing. We instantiate SceneForge using Infinigen and Blender to construct a licensing-clean indoor supervision resource with a large number of counterfactual pairs and aligned annotations from over 2K scenes, covering both diverse single-view and registered multi-view settings. Under matched training budgets, incorporating SceneForge supervision improves both object removal and scene removal performance across multiple benchmarks in both quantitative and qualitative evaluation. These results indicate that modeling supervision as structured state transitions in editable worlds provides a practical and scalable foundation for intervention-consistent multimodal learning.


[50] DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making cs.CVPDF

Yize Liu, Siyuan Yan, Ming Hu, Lie Ju, Xieji Li

TL;DR: 本文提出了DermAgent,一个用于皮肤病学图像分析的自我反思型多工具智能体系统。该系统通过一个包含七个专门视觉和语言模块的协作框架,结合外部知识检索和确定性批判模块,实现了可追溯的分步诊断推理,旨在解决多模态大语言模型在皮肤病学应用中存在的领域知识不足和幻觉问题。

Details

Motivation: 皮肤病诊断需要结合细粒度视觉感知和专家临床知识。尽管多模态大语言模型促进了交互式医学图像分析,但其在皮肤病学中的应用受到领域知识基础不足和幻觉问题的阻碍。

Result: 在五个皮肤病学基准测试上的广泛实验表明,DermAgent在零样本细粒度疾病诊断、概念标注和临床描述生成任务上,持续优于最先进的多模态大语言模型和医学智能体基线,在皮肤病诊断准确率上超过GPT-4o 17.6%,在描述生成任务的ROUGE-L指标上超过3.15%。

Insight: 创新点在于提出了一个集成了互补视觉感知工具、双模态检索模块和确定性批判模块的协作智能体框架。其核心是“计划-执行-反思”范式,通过检索外部证据库来锚定预测,并通过置信度、覆盖度和冲突门进行严格的事后审计以实现自我纠正,从而增强了诊断的可解释性和可靠性。

Abstract: Dermatological diagnosis requires integrating fine-grained visual perception with expert clinical knowledge. Although Multimodal Large Language Models (MLLMs) facilitate interactive medical image analysis, their application in dermatology is hindered by insufficient domain-specific grounding and hallucinations. To address these issues, we propose DermAgent, a collaborative multi-tool agent that orchestrates seven specialized vision and language modules within a Plan-Execute-Reflect framework. DermAgent delivers stepwise, traceable diagnostic reasoning through three core components. First, it employs complementary visual perception tools for comprehensive morphological description, dermoscopic concept annotation, and disease diagnosis. Second, to overcome the lack of domain prior, a dual-modality retrieval module anchors every prediction in external evidence by cross-referencing 413,210 diagnosed image cases and 3,199 clinical guideline chunks. To further mitigate hallucinations, a deterministic critic module conducts strict post-hoc auditing via confidence, coverage, and conflict gates, automatically detecting inter-source disagreements to trigger targeted self-correction. Extensive experiments on five dermatology benchmarks demonstrate that DermAgent consistently outperforms state-of-the-art MLLMs and medical agent baselines across zero-shot fine-grained disease diagnosis, concept annotation, and clinical captioning tasks, exceeding GPT-4o by 17.6% in skin disease diagnostic accuracy and 3.15% in captioning ROUGE-L. Our code is available at https://github.com/YizeezLiu/DermAgent.


[51] Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture cs.CV | cs.CL | cs.IRPDF

Longxiang Zhang, Weilong Dai, Guanghao Zhang, Hao Jiang, Pipei Huang

TL;DR: 本文提出Think When Needed (TWN)框架,一个具有自适应推理能力的统一多模态嵌入框架。它采用双LoRA架构,在共享的冻结骨干网络上附加推理和嵌入适配器,并通过自监督路由门为每个输入自适应决定是否生成思维链,以减少不必要的推理开销并提升检索质量。

Details

Motivation: 现有方法将思维链推理引入多模态嵌入流程以提升检索质量,但存在模型规模和推理成本高的问题,且对所有输入不加区分地生成思维链,对于简单输入,冗余推理反而可能误导模型并降低性能。

Result: 在MMEB-V2基准的78个任务上,TWN实现了最先进的嵌入质量,同时比现有生成方法高效得多,仅需相对于骨干网络3-5%的额外参数,并且与完全生成模式相比,推理token数量减少高达50%。

Insight: 创新点在于提出了双LoRA架构以缓解联合优化中的梯度冲突,并设计了自适应思维机制,通过自监督路由门实现按需推理,从而在提升性能的同时显著提高效率;此外还探索了嵌入引导的强化学习来优化思维链质量。

Abstract: Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient conflicts introduced by joint optimization while keeping parameters close to a single model. Building on this, an adaptive think mechanism uses a self-supervised routing gate to decide per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and even improve retrieval quality. We further explore embedding-guided RL to optimize CoT quality beyond supervised training. On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.


[52] Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos cs.CVPDF

Yubo Zhao, Yujin Chai, Yunao Dong, Chengfeng Zhao, Zijiao Zeng

TL;DR: 该论文提出了一种名为HA-HOI的框架,用于从单目视频中重建物理上合理的人-物交互4D动画。该方法采用‘人优先,物跟随’的策略,首先恢复人体运动作为交互锚点,然后相对于人体动作重建、对齐和细化物体轨迹,最后将运动学轨迹投影到基于物理的人形-物体仿真中,以生成稳定的物理运动序列。

Details

Motivation: 现有方法重建的人-物交互轨迹虽然具有时间一致性,但往往存在视觉伪影,并且在用作仿真参考运动时,无法保持稳定的接触、功能性操作或物理合理性。这揭示了交互重建的根本差距:不应止步于分别跟踪人和物体,而应恢复使两者运动构成连贯交互的关系。

Result: 在基准测试和真实世界视频上,HA-HOI在人物-物体对齐、接触一致性、时间稳定性和仿真准备度方面,优于先前的单目HOI重建方法。

Insight: 主要创新点在于提出了‘人优先,物跟随’的公式化方法,将人体运动作为交互锚点,并引入基于物理的仿真作为后处理步骤,将运动学轨迹转化为物理上合理的交互动画,从而超越了仅追求视觉合理性的轨迹恢复,迈向物理基础的交互动画生成。

Abstract: Recovering 4D human-object interaction (HOI) from monocular video is a key step toward scalable 3D content creation, embodied AI, and simulation-based learning. Recent methods can reconstruct temporally coherent human and object trajectories, but these trajectories often remain visual artifacts while failing to preserve stable contact, functional manipulation, or physical plausibility when used as reference motions for humanoid-object simulation. This reveals a fundamental interaction gap: HOI reconstruction should not stop at tracking a human and an object, but should recover the relation that makes their motion a coherent interaction. We introduce $\textbf{HA-HOI}$, a framework for reconstructing physically plausible 4D HOI animation from in-the-wild monocular videos. Instead of treating the human and object as independent entities in an ambiguous monocular 3D space, we propose a $\textit{human-first, object-follow}$ formulation. The human motion is recovered as the interaction anchor, and the object is reconstructed, aligned, and refined relative to the human action. The resulting kinematic trajectory is then projected into a physics-based humanoid-object simulation, where it acts as a teacher trajectory for stable physical rollout. Across benchmark and in-the-wild videos, $\textbf{HA-HOI}$ improves human-object alignment, contact consistency, temporal stability, and simulation readiness over prior monocular HOI reconstruction methods. By moving beyond visually plausible trajectory recovery toward physically grounded interaction animation, our work takes a step toward turning general monocular HOI videos into scalable demonstrations for humanoid-object behavior. Project page: https://knoxzhao.github.io/real2sim_in_HOI/


[53] GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding cs.CVPDF

Jiashun Zhu, Ronghao Fu, Jiasen Hu, Nachuan Xing, Xu Na

TL;DR: 本文提出GeoVista,一个用于超高分辨率遥感图像理解的规划驱动主动感知框架。它通过构建全局探索计划、并行验证多个候选区域并维护显式证据状态来解决现有单路径探索方法丢失全局上下文、遗漏分散区域及重复计数的问题。

Details

Motivation: 现有遥感视觉语言模型在解释超高分辨率图像时,通常采用单次聚焦或单一顺序轨迹的探索策略,这会导致全局上下文丢失、区域遗漏和证据重复计数。

Result: 在RSHR-Bench、XLRS-Bench和LRS-VQA基准测试上,GeoVista达到了最先进的性能。

Insight: 创新点包括:1) 规划驱动的主动感知框架,结合全局计划与分支式局部检查;2) 引入APEX-GRO冷启动监督轨迹语料库,将多样任务统一为全局-区域-对象交互推理过程;3) 设计观察-计划-跟踪机制及基于GRPO策略的逐步奖励对齐方法,实现跨区域证据聚合与去重。

Abstract: Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at https://github.com/ryan6073/GeoVista


[54] Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity cs.CV | cs.AIPDF

Jiahao Tian, Yiwei Wang, Gang Yu, Chi Zhang

TL;DR: 本文提出了一种名为Head Forcing的无训练框架,用于解决自回归视频扩散模型在生成长序列视频时面临的错误累积和上下文丢失问题。该框架通过分析注意力头的功能异质性,为不同类型的头(局部头、锚点头、记忆头)定制了不同的KV缓存策略,并引入了分层的记忆系统和头级别的RoPE重编码方案,从而将生成时长从5秒扩展到分钟级别,并支持多提示词交互合成。

Details

Motivation: 自回归视频扩散模型虽然支持实时合成,但在生成长序列视频时,由于注意力头被统一对待导致KV缓存分配次优,从而引发错误累积和长程上下文丢失问题。

Result: 在无需额外训练的情况下,Head Forcing框架将生成时长从5秒扩展到分钟级别,支持多提示词交互合成,并在多个基准测试中持续优于现有基线方法。

Insight: 核心创新点在于首次识别并利用了自回归视频扩散变换器中注意力头的功能异质性(局部头、锚点头、记忆头),并据此设计了针对性的、无训练的KV缓存优化策略(如为记忆头设计分层动态记忆系统)和位置编码保持方案,从而高效解决了长序列生成的稳定性问题。

Abstract: Autoregressive video diffusion models support real-time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long-range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training-free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long-range consistency. A head-wise RoPE re-encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute-level duration, supports multi-prompt interactive synthesis, and consistently outperforms existing baselines. Project Page: https://jiahaotian-sjtu.github.io/headforcing.github.io/.


[55] HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention cs.CV | cs.AIPDF

Xuzhe Zheng, Yuexiao Ma, Jing Xu, Xiawu Zheng, Rongrong Ji

TL;DR: 本文提出了一种名为HASTE的训练无关视频扩散加速方法,通过头级自适应稀疏注意力机制来提升推理效率。该方法包含两个核心组件:基于查询-键漂移跳过不必要掩码预测的’时序掩码重用’,以及通过最小化模型输出误差来分配每个注意力头稀疏阈值的’误差引导预算校准’。

Details

Motivation: 基于扩散的视频生成模型在视觉保真度和时序连贯性上取得了显著进展,但其实际部署受到全注意力机制二次方复杂度的限制。现有的训练无关稀疏注意力方法(如在线top-p稀疏注意力)在掩码预测上仍有不可忽略的开销,并且对所有注意力头使用统一的阈值,忽略了头级异质性,这限制了其在视频扩散变换器中的实际速度-质量权衡。

Result: 在Wan2.1-1.3B和Wan2.1-14B模型上,该方法持续改进了XAttention和SVG2基准,在720P分辨率下实现了高达1.93倍的加速,同时保持了具有竞争力的视频质量和相似性指标。

Insight: 主要创新点在于提出了一个头级自适应框架,通过’时序掩码重用’和’误差引导预算校准’两个即插即用组件,分别解决了掩码预测开销和忽略注意力头异质性的问题。从客观角度看,其核心洞察是识别并量化了现有训练无关稀疏注意力方法中两个被忽视的限制因素(掩码预测开销和头级异质性),并设计了针对性的、可量化的优化策略,从而在保持质量的同时实现了更优的加速比。

Abstract: Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive because it accelerates pretrained models without retraining, yet existing online top-$p$ sparse attention still spends non-negligible cost on mask prediction and applies shared thresholds despite strong head-level heterogeneity. We show that these two overlooked factors limit the practical speed-quality trade-off of training-free sparse attention in Video DiTs. To address them, we introduce a head-wise adaptive framework with two plug-in components: Temporal Mask Reuse, which skips unnecessary mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top-$p$ thresholds by minimizing measured model-output error under a global sparsity budget. On Wan2.1-1.3B and Wan2.1-14B, our method consistently improves XAttention and SVG2, achieving up to 1.93 times speedup at 720P while maintaining competitive video quality and similarity metrics.


[56] ArcGate: Adaptive Arctangent Gated Activation cs.CV | cs.LGPDF

Avik Bhattacharya, Siddhant Dnyanesh Gole, Subhasis Chaudhuri, Alejandro C. Frery, Biplab Banerjee

TL;DR: 本文提出了一种名为ArcGate的自适应反正切门控激活函数,该函数通过三阶段非线性变换生成多种激活形状,每层包含七个可学习参数,使神经网络能根据特征层次和数据分布自主优化非线性特性。在ResNet-50和Vision Transformer架构上,使用PatternNet、UC Merced Land Use和EuroSAT MSI三个遥感基准数据集进行评估,ArcGate在噪声环境中表现出卓越的结构鲁棒性。

Details

Motivation: 解决传统固定形状激活函数(如ReLU、GELU、SiLU)在非线性和适应性上的局限性,使神经网络能更灵活地适应不同特征层次和数据分布的需求。

Result: 在PatternNet上达到99.67%的峰值总体准确率,在中等高斯噪声(标准差0.1)下比ReLU性能领先26.65%,在三个遥感基准数据集上均优于标准基线方法。

Insight: 创新点在于引入可学习的参数实现自适应激活形状,增强了模型在噪声环境中的鲁棒性;从客观角度看,其深度依赖的功能演化机制(如深层增加门控强度以促进信号传播)为激活函数设计提供了新思路,适用于高分辨率地球观测任务。

Abstract: Activation functions are central to deep networks, influencing non-linearity, feature learning, convergence, and robustness. This paper proposes the Adaptive Arctangent Gated Activation (ArcGate) function, a flexible formulation that generates a broad spectrum of activation shapes via a three-stage non-linear transformation. Unlike conventional fixed-shape activations such as ReLU, GELU, or SiLU, ArcGate uses seven learnable parameters per layer, allowing the neural network to autonomously optimize its non-linearity to the specific requirements of the feature hierarchy and data distribution. We evaluate ArcGate using ResNet-50 and Vision Transformer (ViT-B/16) architectures on three widely used remote sensing benchmarks: PatternNet, UC Merced Land Use, and the 13-band EuroSAT MSI multispectral dataset. Experimental results show that ArcGate consistently outperforms standard baselines, achieving a peak overall accuracy of 99.67% on PatternNet. Most notably, ArcGate exhibits superior structural resilience in noisy environments, maintaining a 26.65% performance lead over ReLU under moderate Gaussian noise (standard deviation 0.1). Analysis of the learned parameters reveals a depth-dependent functional evolution, where the model increases gating strength in deeper layers to enhance signal propagation. These findings suggest that ArcGate is a robust and adaptive general node activation function for high-resolution earth observation tasks.


[57] Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models cs.CVPDF

Sujung Hong, Chanyong Yoon, Seongjae Hwang

TL;DR: 本文针对大型扩散视觉语言模型在生成长文本时出现的重复生成和视觉定位退化问题,提出了无需训练即可缓解掩码先验漂移和位置注意力崩溃的轻量级方法,并在多个基准测试中验证了其有效性。

Details

Motivation: 现有大型扩散视觉语言模型在长文本生成中存在重复生成和视觉定位退化的问题,本文旨在探究其根本原因并提出无需额外训练的解决方案。

Result: 在通用多模态基准和视觉定位任务上的实验表明,该方法相比基线模型有显著提升,尤其在长文本描述基准上取得了稳健的增益。

Insight: 创新点在于识别出重复生成源于掩码先验漂移,视觉定位退化源于位置注意力偏差与迭代去掩码过程的不匹配,并提出了掩码先验抑制和单调RoPE缩放这两种无需训练的插件式策略来缓解这些问题。

Abstract: Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.


[58] PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media cs.CV | cs.AI | cs.MMPDF

Fuhao Li, Shaofeng You, Jiagao Hu, Yu Liu, Yuxuan Chen

TL;DR: 本文提出了PROVE评估框架,用于评估图像和视频中的物体移除质量。该框架包含一对感知对齐的度量标准RC-S(评估空间连贯性)和RC-T(评估时间一致性),以及一个包含PROVE-M和PROVE-H两个子集的两层真实世界基准测试集PROVE-Bench。实验表明,RC指标比现有评估协议更符合人类判断。

Details

Motivation: 现有评估物体移除任务的方法存在缺陷:全参考指标奖励复制粘贴行为而非真实擦除;无参考指标存在系统性偏差(如偏爱模糊结果);全局时间指标对编辑区域内的局部伪影不敏感。需要一种更符合人类感知的评估方案。

Result: 在多样化的图像和视频基准测试上的实验表明,RC指标与人类判断的一致性显著强于现有的评估协议。

Insight: 创新点在于提出了一对感知对齐的度量标准(RC-S和RC-T),分别从空间连贯性和时间一致性两个维度进行量化评估,并构建了一个包含有/无真实标注的两层真实世界基准测试集,为社区提供了全面的评估工具和基准。

Abstract: Evaluating object removal in images and videos remains challenging because the task is inherently one-to-many, yet existing metrics frequently disagree with human perception. Full-reference metrics reward copy-paste behaviors over genuine erasure; no-reference metrics suffer from systematic biases such as favoring blurry results; and global temporal metrics are insensitive to localized artifacts within edited regions. To address these limitations, we propose RC (Removal Coherence), a pair of perception-aligned metrics: RC-S, which measures spatial coherence via sliding-window feature comparison between masked and background regions, and RC-T, which measures temporal consistency via distribution tracking within shared restored regions across adjacent frames. To validate RC and support community benchmarking, we further introduce PROVE-Bench, a two-tier real-world benchmark comprising PROVE-M, an 80-video paired dataset with motion augmentation, and PROVE-H, a 100-video challenging subset without ground truth. Together, RC metrics and PROVE-Bench form the PROVE (Perceptual RemOVal cohErence) evaluation framework for visual media. Experiments across diverse image and video benchmarks demonstrate that RC achieves substantially stronger alignment with human judgments than existing evaluation protocols. The code for RC metrics and PROVE-Bench are publicly available at: https://github.com/xiaomi-research/prove/.


[59] Local Spatiotemporal Convolutional Network for Robust Gait Recognition cs.CVPDF

Xiaoyun Wang, Cunrong Li, Wu Wang

TL;DR: 本文提出了一种局部时空卷积网络(LSTCN),用于解决步态识别中从连续视频帧中有效捕获内在运动模式的挑战。该方法通过引入全局双向空间池化(GBSP)机制和局部时空卷积(LSTC)层,赋予标准二维卷积网络提取时空信息的能力,从而以结构简单、计算高效的方式学习基于条带的步态运动模式。

Details

Motivation: 步态识别面临视频数据复杂性和外部协变量(如视角变化、衣物、携带物)干扰的挑战,现有方法要么依赖静态外观特征,要么使用计算复杂的序列模型。本文旨在设计一个结构简单但高效的网络,以低成本提取时空信息。

Result: 论文未在摘要中提及具体的定量结果、基准测试或与SOTA的比较。

Insight: 主要创新点包括:1)GBSP机制通过将空间特征分解为水平和垂直条带表示,降低步态张量维度,使时间维度能参与标准2D卷积;2)LSTC层联合处理时空维度,自适应学习条带运动模式;3)使用非对称卷积核独立关注时间、空间及联合时空域,丰富特征表示。其核心思想是将时空信息提取巧妙地融入标准2D CNN框架,实现高效建模。

Abstract: Gait recognition, as a promising biometric technology, identifies individuals through their unique walking patterns and offers distinctive advantages including non-invasiveness, long-range applicability, and resistance to deliberate disguise. Despite these merits, capturing the intrinsic motion patterns concealed within consecutive video frames remains challenging due to the complexity of video data and the interference of external covariates such as viewpoint changes, clothing variations, and carrying conditions. Existing approaches predominantly rely on either static appearance features extracted from individual silhouette frames or employ complex sequential models (\eg, LSTM, 3D convolutions) that demand substantial computational resources and sophisticated training strategies. To address these limitations, we propose a Local Spatiotemporal Convolutional Network (LSTCN), a structurally simple yet highly effective dual-branch architecture that endows standard two-dimensional convolutional networks with the capacity to extract temporal information. Specifically, we introduce a Global Bidirectional Spatial Pooling (GBSP) mechanism that reduces the dimensionality of gait tensors by decomposing spatial features into horizontal and vertical strip-based local representations, enabling the temporal dimension to participate in standard 2D convolution operations. Building upon this, we design a Local Spatiotemporal Convolutional (LSTC) layer that jointly processes temporal and spatial dimensions, allowing the network to adaptively learn strip-based gait motion patterns. We further extend this formulation with asymmetric convolution kernels that independently attend to the temporal, spatial, and joint spatiotemporal domains, thereby enriching the extracted feature representations.


[60] LiWi: Layering in the Wild cs.CVPDF

Yu He, Fang Li, Haoyang Tong, Lichen Ma, Xinyuan Shan

TL;DR: 本文提出LiWi框架,旨在解决自然场景图像分层分解的难题,通过Agent驱动的数据合成管道构建大规模数据集LiWi-100k,并设计联合优化光度保真度和边界精度的模型,实现了SOTA性能。

Details

Motivation: 现有生成模型的分层图像生成主要局限于图形设计领域,自然场景图像的分层分解研究不足,限制了细粒度编辑和实际应用,需解决可扩展的分层数据建模和自然图像中物体交互(如光照效果和结构边界)的挑战。

Result: 在自然图像分解任务中,该框架在RGB L1和Alpha IoU指标上均达到最先进水平(SOTA),优于现有模型。

Insight: 创新点包括:1)无需人工干预的Agent驱动数据分解(ADD)管道,用于合成大规模分层数据集;2)联合建模光照效果(通过阴影引导学习)和边界精度(通过退化-恢复目标)的框架,提升分解的保真度和准确性。

Abstract: Recent advances in generative models have empowered impressive layered image generation, yet their success is largely confined to graphic design domains. The layering of in-the-wild images remains an underexplored problem, limiting fine-grained editing and applications of images in real-world scenarios. Specifically, challenges remain in scalable layered data and the modeling of object interaction in natural images, such as illumination effects and structural boundary. To address these bottlenecks, we propose a novel framework for high-fidelity natural image decomposition. First, we introduce an Agent-driven Data Decomposition (ADD) pipeline that orchestrates agents and tools to synthesize layered data without manual intervention. Utilizing this pipeline, we construct a large-scale dataset, named LiWi-100k, with over 100,000 high-quality layered in-the-wild images. Second, we present a novel framework that jointly improves photometric fidelity and alpha boundary accuracy. Specifically, shadow-guided learning explicitly models the illumination effects, and degradation-restoration objective provides boundary-correction supervision by recovering clean foreground image from degraded one. Extensive experiments demonstrate that our framework achieves state-of-the-art (SoTA) performance in natural image decomposition, outperforming existing models in RGB L1 and Alpha IoU metrics. We will soon release our code and dataset.


[61] SpectraFlow: Unifying Structural Pretraining and Frequency Adaptation for Medical Image Segmentation cs.CVPDF

Zhiquan Chen, Haitao Wang, Guowei Zou, Hejun Wu

TL;DR: 本文提出SpectraFlow框架,通过两阶段方法解决医学图像分割在低数据量下的挑战:第一阶段采用混合域均值流预训练,学习结构感知的表示;第二阶段结合注意力融合与频率方向动态卷积进行边界优化,在多个数据集上实现了SOTA性能。

Details

Motivation: 医学图像分割在低数据量下存在泛化能力差、边界模糊和细粒度结构缺失的问题,现有自监督预训练方法存在纹理偏差,而准确分割需要几何感知和拓扑一致性。

Result: 在ISIC-2016、Kvasir-SEG和GlaS数据集上的实验表明,该方法在低数据量设置下优于现有SOTA方法,具有更强的鲁棒性和更清晰的边界划分。

Insight: 创新点包括:将掩码作为条件结构引导而非预测目标的混合域均值流预训练、防止表示崩溃的轻量级分散损失、以及结合注意力融合与频率方向动态卷积的边界优化解码器,实现了结构预训练与频率适应的统一。

Abstract: Medical image segmentation remains challenging in low-data regimes, where scarce annotations often yield poor generalization and ambiguous boundaries with missing fine structures. Recent self-supervised pretraining has improved transferability, but it often exhibits a texture bias. In contrast, accurate segmentation is inherently geometry-aware and depends on both topological consistency and precise boundary preservation. To address this problem, we propose a two-stage framework that couples structure-aware encoder pretraining with boundary-oriented decoding. In Stage-1, we aim to learn structure-aware representations for downstream segmentation in low-data regimes. To this end, we propose Mixed-Domain MeanFlow Pretraining, which aligns images and binary masks in a shared latent space through latent transport regression, where masks act as conditional structural guidance rather than prediction targets, making the pretraining task-agnostic. To further improve training stability under scarce supervision, we incorporate a lightweight Dispersive Loss to prevent representation collapse. In Stage-2, we fine-tune the pretrained encoder with a lightweight decoder that combines Direct Attentional Fusion for adaptive cross-scale gating and Frequency-Directional Dynamic Convolution for high-frequency boundary refinement under appearance variation. Experiments on ISIC-2016, Kvasir-SEG, and GlaS demonstrate consistent gains over state-of-the-art methods, with improved robustness in low-data settings and sharper boundary delineation.


[62] Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction cs.CVPDF

Yujie Wei, Chenglong Ma, Jianxiong Gao, Chenhui Wang, Shiwei Zhang

TL;DR: 本文提出了一种名为CineNeuron的分层框架,用于从功能磁共振成像(fMRI)信号中重建视频。该框架受大脑双通路处理机制启发,包含两个协同阶段:自下而上的语义增强阶段将fMRI信号映射到丰富的嵌入空间,以捕获文本语义、图像内容、动作概念和物体类别;自上而下的记忆整合阶段利用提出的混合记忆方法,从先前数据中动态选择相关“记忆”并与fMRI嵌入融合,以优化视频重建。在两个fMRI-视频基准测试上的实验表明,该方法超越了现有最先进方法。

Details

Motivation: 当前从fMRI重建视频的方法受限于嘈杂的fMRI信号与视频丰富内容之间的语义鸿沟,这源于对不完整语义嵌入的依赖,这些嵌入既未捕获视频特定线索(如动作),也未整合先验知识。

Result: 在两个fMRI-视频基准测试上的广泛实验结果表明,CineNeuron在多项指标上超越了最先进(SOTA)方法。

Insight: 主要创新点包括:1)受大脑双通路机制启发的分层框架设计;2)自下而上的语义增强阶段,构建了全面捕获多模态语义的嵌入空间;3)自上而下的记忆整合阶段,通过混合记忆方法动态利用先验知识。从客观角度看,该工作通过结合多模态语义和外部记忆,有效弥合了神经信号与视觉内容之间的语义差距,为神经解码提供了新思路。

Abstract: Reconstructing dynamic visual experiences as videos from functional magnetic resonance imaging (fMRI) is pivotal for advancing the understanding of neural processes. However, current fMRI-to-video reconstruction methods are hindered by a semantic gap between noisy fMRI signals and the rich content of videos, stemming from a reliance on incomplete semantic embeddings that neither capture video-specific cues (e.g., actions) nor integrate prior knowledge. To this end, we draw inspiration from the dual-pathway processing mechanism in human brain and introduce CineNeuron, a novel hierarchical framework for semantically enhanced video reconstruction from fMRI signals with two synergistic stages. First, a bottom-up semantic enrichment stage maps fMRI signals to a rich embedding space that comprehensively captures textual semantics, image contents, action concepts, and object categories. Second, a top-down memory integration stage utilizes the proposed Mixture-of-Memories method to dynamically select relevant “memories” from previously seen data and fuse them with the fMRI embedding to refine the video reconstruction. Extensive experimental results on two fMRI-to-video benchmarks demonstrate that CineNeuron surpasses state-of-the-art methods across various metrics.


[63] A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval cs.CV | cs.AI | cs.IRPDF

Ho Hung Lim, Yi Yang

TL;DR: 本文对视觉金融文档检索中的聚合策略进行了实证研究,探讨了将视觉补丁令牌聚合成单个向量是否会丢失关键信息。研究发现,单向量聚合会导致语义不同的文档产生几乎相同的向量,从而掩盖了金融文档中细微但关键的语义变化。

Details

Motivation: 视觉RAG将文档视为图像并使用视觉编码器获取补丁令牌,但每个文档的数百个补丁令牌在向量数据库中带来了检索和存储挑战。实际部署需要将它们聚合成单个向量,这引发了关键问题:单向量聚合是否会在金融文档中丢失关键信息?

Result: 实验使用金融文档构建的诊断基准表明,单向量聚合会导致不同文档的向量几乎相同,而补丁级别能检测到语义变化。这一发现在不同模型规模、检索优化嵌入和多种缓解策略中均保持一致。

Insight: 研究揭示了全局纹理主导是单向量聚合问题的根本原因,强调了在金融应用中使用单向量视觉文档检索的显著风险,并指出需要开发更精细的聚合或检索策略以保留关键细节。

Abstract: Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.


[64] TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation cs.CV | cs.GRPDF

Bojun Xiong, Zoubin Bi, Xinghui Peng, Yunmu Wang, Junchen Deng

TL;DR: 本文提出了TOPOS框架,这是一个专为单张图像条件化3D头部生成而设计的系统。它能够联合恢复符合行业标准拓扑结构的几何形状和外观,生成具有固定、清洁拓扑的头部网格,确保了所有生成头部之间在顶点级别的一致性对应。该框架包含三个核心模块:用于在统一拓扑下建模头部的TOPOS-VAE变分自编码器、用于从单张图像高效生成高质量网格的TOPOS-DiT整流流变换器,以及用于从同一肖像图像生成可重光照UV纹理贴图的TOPOS-Texture模块。

Details

Motivation: 解决电影、动画和视频游戏工业流程中,高保真3D头部生成需要固定、清洁且统一的参考拓扑结构(这是生产级绑定、蒙皮和动画的前提)的问题。现有通用3D生成模型产生的网格拓扑不一致、顶点众多,阻碍了语义对应和资产级复用。

Result: 大量实验表明,TOPOS在3D头部生成任务上达到了最先进的性能,超越了经典的3D人脸重建方法和通用的3D物体生成模型。

Insight: 主要创新点包括:1) 针对工业需求,提出一个专门生成具有固定、统一拓扑结构(studio-style topology)的3D头部生成框架,确保了顶点级语义对应。2) 受多模态大语言模型启发,设计了TOPOS-VAE,利用Perceiver Resampler将不同拓扑的输入点云转换到目标参考拓扑,构建了结构化的潜在空间。3) 构建了一个端到端的纹理生成模块TOPOS-Texture,通过微调多模态图像生成模型,生成与底层网格几何空间对齐且保留高频细节的可重光照UV纹理。

Abstract: High-fidelity 3D head generation plays a crucial role in the film, animation and video game industries. In industrial pipelines, studios typically enforce a fixed reference topology across all head assets, as such a clean and uniform topology is a prerequisite for production-level rigging, skinning and animation. In this paper, we present TOPOS, a framework tailored for single image conditioned 3D head generation that jointly recovers geometry and appearance under such an industry-standard topology. In contrast to general 3D generative models which produce triangle meshes with inconsistent topology and numerous vertices, hindering semantic correspondence and asset-level reuse, TOPOS generates head meshes with a fixed, studio-style topology, enabling consistent vertex-level correspondence across all generated heads. To model heads under this unified topology, we proposed a novel variational autoencoder structure, termed TOPOS-VAE. Inspired by multi-model large language models (MLLMs), our TOPOS-VAE leverages the Perceiver Resampler to convert input pointclouds sampled from head meshes of diverse topologies into the target reference topology. Building upon TOPOS-VAE’s structured latent space, we train a rectified flow transformer, TOPOS-DiT, to efficiently generate high-fidelity head meshes from a single image. We further present TOPOS-Texture, an end-to-end module that produces relightable UV texture maps from the same portrait image via fine-tuning a multimodal image generative model. The generated textures are spatially aligned with the underlying mesh geometry and faithfully preserve high-frequency appearance details. Extensive experiments demonstrate that TOPOS achieves state-of-the-art performance on 3D head generation, surpassing both classical face reconstruction methods and general 3D object generative models, highlighting its effectiveness for digital human creation.


[65] VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting cs.CV | cs.CE | cs.MMPDF

Chunlei Shi, Hao Li, Yufeng Zhu, Boyu Liu, Yongchao Feng

TL;DR: 本文提出了一种名为VMU-Diff的从粗到精多源数据融合框架,用于降水临近预报。该框架采用两阶段流程:首先利用基于确定性模型的粗粒度阶段融合雷达和多波段卫星数据预测全局运动趋势,然后通过基于概率模型的细粒度阶段(采用残差条件扩散模型)生成精细的预测细节。

Details

Motivation: 现有降水临近预报方法主要依赖单一雷达数据,确定性模型因MSE收敛导致预测模糊,而概率模型(如扩散模型)虽能生成细节但存在伪影且计算效率低。本文旨在解决这些问题,并利用多源数据提升预报性能。

Result: 在江苏SWAN数据集上的实验表明,该方法在短期预报上优于现有最先进方法(SOTA)。

Insight: 创新点在于提出两阶段(粗粒度确定性预测+细粒度概率生成)融合多源数据的框架,并引入Vision Mamba状态空间模块进行时空特征融合与残差重建,以平衡全局趋势与细节生成,减少伪影。

Abstract: Precipitation nowcasting is a vital spatio-temporal prediction task for meteorological applications but faces challenges due to the chaotic property of precipitation systems. Existing methods predominantly rely on single-source radar data to build either deterministic or probabilistic models for extrapolation. However, the single deterministic model suffers from blurring due to MSE convergence. The single probabilistic model, typically represented by diffusion models, can generate fine details but suffers from spurious artifacts that compromise accuracy and computational inefficiency. To address these challenges, this paper proposes a novel coarse-to-fine Vision Mamba Unet and residual Diffusion (VMU-Diff) based precipitation nowcasting framework. It realizes precipitation nowcasting through a two-stage process, i.e., a deterministic model-based coarse stage to predict global motion trends and a probabilistic model-based fine stage to generate fine prediction details. In the coarse prediction stage, rather than single-source radar data, both radar and multi-band satellite data are taken as input. A spatial-temporal attention block and several Vision mamba state-space blocks realize multi-source data fusion, and predict the future echo global dynamics. The fine-grained stage is realized by a spatio-temporal refine generator based on residual conditional diffusion models. It first obtains spatio-temporal residual features based on coarse prediction and ground truth, and further reconstructs the residual via conditional Mamba state-space module. Experiments on Jiangsu SWAN datasets demonstrate the improvements of our method over state-of-the-art methods, particularly in short-term forecasts.


[66] ViMU: Benchmarking Video Metaphorical Understanding cs.CV | cs.CYPDF

Qi Li, Xinchao Wang

TL;DR: 本文介绍了ViMU基准测试,这是首个系统评估前沿模型在视频中理解隐喻、讽刺和社会意义等潜文本能力的基准。ViMU通过设计无提示的多模态证据基础问题,评估模型能否超越字面感知,推断视频中的隐含意义。

Details

Motivation: 现有视频理解模型主要关注字面视觉理解(如识别物体、动作或时间关系),缺乏系统理解视频中隐喻、讽刺和社会意义等潜文本的能力,ViMU旨在填补这一空白。

Result: 论文未在摘要中提及具体定量结果或基准比较,但介绍了ViMU作为首个系统性评估视频潜文本理解能力的基准,其设计确保问题无提示,以严格测试模型能力。

Insight: 创新点在于首次提出视频隐喻理解基准,强调多模态证据基础和开放/选择题结合,推动模型从字面理解转向深层社会文化意义推断,可借鉴其无提示设计以评估模型真实推理能力。

Abstract: Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer’s social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.


[67] CalibAnyView: Beyond Single-View Camera Calibration in the Wild cs.CVPDF

Boying Li, Cheng Zhang, Weirong Chen, Daniel Cremers, Ian Reid

TL;DR: 本文提出CalibAnyView,一种支持任意数量输入视图(N≥1)的统一相机标定方法,通过显式建模跨视图几何一致性来联合估计相机内参和重力方向。该方法构建了一个大规模多视图视频数据集,并开发了一个多视图Transformer来预测密集透视场,再通过几何优化框架进行联合估计。实验表明,该方法在单视图设置下具有强鲁棒性,并在多视图推理下进一步提升性能,为野外3D重建和机器人感知等下游任务提供了可靠基础。

Details

Motivation: 传统相机标定方法依赖受控采集设置,不适用于野外图像;而现有基于学习的方法仅关注单视图标定,忽略了多视图间的几何一致性。本文旨在解决野外任意数量视图的相机标定问题,通过建模跨视图几何一致性来提升标定精度和鲁棒性。

Result: 在大量实验中,CalibAnyView始终优于最先进方法,在单视图设置下表现出强鲁棒性,并在多视图推理下进一步改进,为下游任务提供了可靠基础。

Insight: 创新点包括:1)提出支持任意数量视图的统一标定框架,显式建模跨视图几何一致性;2)构建了涵盖多样真实场景的大规模多视图视频数据集;3)结合多视图Transformer预测密集透视场与几何优化框架,联合估计相机内参和重力方向。从客观角度看,该方法将学习与几何优化相结合,提升了野外标定的实用性和泛化能力。

Abstract: Camera calibration is a fundamental prerequisite for reliable geometric perception, yet classical approaches rely on controlled acquisition setups that are impractical for in-the-wild imagery. Recent learning-based methods have shown promising results for single-view calibration, but inherently neglect geometric consistency across multiple views. We introduce CalibAnyView, a unified formulation that supports an arbitrary number of input views ($N \geq 1$) by explicitly modeling cross-view geometric consistency. To facilitate this, we construct a large-scale multi-view video dataset covering diverse real-world scenarios, including multiple camera models, dynamic scenes, realistic motion trajectories, and heterogeneous lens distortions. Building on this dataset, we develop a multi-view transformer that predicts dense perspective fields, which are further integrated into a geometric optimization framework to jointly estimate camera intrinsics and gravity direction. Extensive experiments demonstrate that CalibAnyView consistently outperforms state-of-the-art methods, achieves strong robustness under single-view settings, and further improves with multi-view inference, providing a reliable foundation for downstream tasks such as 3D reconstruction and robotic perception in the wild.


[68] Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution cs.CV | cs.AI | cs.CLPDF

Tian Qin, Junzhe Chen, Yuqing Shi, Tianshu Zhang, Qiang Ju

TL;DR: 该论文提出了一种名为SIRA的训练免费内部对比解码框架,用于缓解大型视觉语言模型(LVLM)中的幻觉问题。SIRA通过利用多模态Transformer的分阶段信息流,在同一模型内部构建一个反事实参考分支,而无需依赖外部扰动输入或额外前向传播。该方法在解码时抑制那些在没有后期视觉信息访问时仍保持强预测的token,并偏好那些依赖于完整视觉通路的预测。

Details

Motivation: 现有基于对比解码的方法通过比较原始图像预测与外部扰动视觉输入的预测来缓解幻觉,但这会引入非流形伪影且需要昂贵的额外前向传播。论文旨在开发一种无需训练、无需外部工具或扰动输入的内部对比解码方法,以更高效地减少LVLM的幻觉。

Result: 在POPE、CHAIR和AMBER基准测试上,使用Qwen2.5-VL和LLaVA-v1.5模型进行实验,SIRA在减少幻觉方面表现一致,同时保持了描述覆盖度,并且比两遍对比解码方法的开销更低。

Insight: 创新点在于提出了一种内部对比解码框架,通过共享前缀形成对齐的多模态状态,并在后期Transformer层中分叉一个反事实分支(屏蔽图像token注意力),从而在同一模型内部生成一个语言先验主导的参考。这避免了对外部扰动或额外前向传播的依赖,为缓解幻觉提供了一种高效且可解释的途径。

Abstract: Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.


[69] MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models cs.CV | cs.AIPDF

Tianwei Chen, Takuya Furusawa, Yuki Hirakawa, Ryotaro Shimizu, Mo Fan

TL;DR: 本文提出了一个名为MultiEmo-Bench的多标签视觉情感分析基准数据集,用于全面评估多模态大语言模型(MLLMs)预测图像所引发情感的能力。该数据集通过让多名标注者为每张图像选择所有感受到的情感并汇总投票,构建了更可靠、更具代表性的情感分布标签。基于该数据集,作者评估了多个近期MLLM模型在主导情感预测和情感分布预测上的性能,结果表明模型虽有进步但仍有很大改进空间,同时发现LLM-as-a-judge方法在此主观任务上存在局限性。

Details

Motivation: 现有视觉情感数据集的标注方案存在缺陷(每张图像仅让标注者判断单一候选情感是否被引发),这限制了其评估能力,可能导致低估MLLMs的真实性能,因此需要一个新的、更合适的基准来评估MLLMs的多标签情感分析能力。

Result: 在提出的MultiEmo-Bench数据集(包含10,344张图像和236,998个有效投票,涵盖八种情感)上评估了Qwen3-VL、GPT、Gemini、Claude等近期MLLMs。评估了主导情感预测和情感分布预测任务,结果表明近期MLLMs取得了进展,但仍存在显著的改进空间。

Insight: 创新点在于提出了一个基于多标注者投票聚合的多标签视觉情感分析基准数据集,其标注方案更符合图像可能引发多种情感的现实,为评估MLLMs提供了更可靠的基准。客观分析认为,该方法揭示了现有单标签或简单标注方案的局限性,并为评估模型在复杂、主观的多模态理解任务上的性能提供了新思路。同时,论文指出LLM-as-a-judge方法在视觉情感分析这类主观任务上效果并不稳定,这一发现也具有参考价值。

Abstract: This paper introduces a multi-label visual emotion analysis benchmark dataset for comprehensively evaluating the ability of multimodal large language models (MLLMs) to predict the emotions evoked by images. Recent user studies report an unintuitive finding: humans may prefer the predictions of MLLMs over the labels in existing datasets. We argue that this phenomenon stems from the suboptimal annotation scheme used in existing datasets, where each annotator is shown a single candidate emotion for each image and judges whether it is evoked or not. This approach is clearly limited because a single image can evoke multiple emotions with varying intensities. As a result, evaluations based on these datasets may underestimate the capabilities of MLLMs, yet an appropriate benchmark for evaluating such models remains lacking. To address this issue, we introduce a new multi-label benchmark dataset for visual emotion analysis toward MLLMs evaluation. We hire $20$ annotators per image and ask them to select all emotions they feel from an image. Then, we aggregate the votes across all annotators, providing a more reliable and representative dataset labeled with a distribution of emotions. The resulting dataset contains $10,344$ images with $236,998$ valid votes across eight emotions. Based on this benchmark dataset, we evaluate several recent models, including Qwen3-VL, OpenAI’s GPT, Gemini, and Claude. We assess model performance on both dominant emotion prediction and emotion distribution prediction. Our results demonstrate the progress achieved by recent MLLMs while also indicating that substantial room for improvement remains. Furthermore, our experiments with LLM-as-a-judge show that the method does not consistently improve MLLMs’ performance, indicating its limitations for the subjective task of visual emotion analysis.


[70] Vision-Based Water Level and Flow Estimation cs.CV | cs.AIPDF

ZhiXin Sun

TL;DR: 本文提出了一种融合先进视觉模型与统计建模的集成框架,用于基于视觉的水位和河流表面流速估计。该框架通过利用物理先验知识和鲁棒的滤波策略,提高了水位检测和流量估算的准确性。

Details

Motivation: 尽管基于视觉的水位和流速估计方法相比传统传感技术具有更好的可解释性、自动化数据归档和系统鲁棒性,但仍面临环境敏感性、精度有限和现场校准复杂等挑战。本文旨在解决这些问题。

Result: 论文通过结合物理先验和鲁棒滤波策略,提高了水位检测和流量估算的准确性,但摘要中未提及具体的基准测试、定量结果或与SOTA模型的比较。

Insight: 主要创新点在于将SOTA视觉模型与统计建模协同集成,并利用物理先验和鲁棒滤波来提升系统性能,这为环境敏感型视觉任务提供了一种可借鉴的增强鲁棒性和准确性的思路。

Abstract: With the rapid evolution of computer vision, vision-based methodologies for water level and river surface velocity estimation have reached significant maturity. Compared to traditional sensing, these techniques offer superior interpretability, automated data archiving, and enhanced system robustness. However, challenges such as environmental sensitivity, limited precision, and complex site calibration persist. This work proposes an integrated framework that synergizes state-of-the-art (SOTA) vision models with statistical modeling. By leveraging physical priors and robust filtering strategies, we improve the accuracy of water level detection and flow estimation. Code will be available at https://github.com/sunzx97/Vision_Based_Water_Level_and_Flow_Estimation.git


[71] Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging cs.CVPDF

Tan Pan, Shuhao Mei, Yixuan Sun, Kaiyu Guo, Chen Jiang

TL;DR: 本文提出了一种超越实例级自监督的3D多模态医学影像预训练方法,通过利用跨实例的解剖结构拓扑一致性作为监督信号,解决了传统方法忽视个体间空间关系一致性的问题。该方法包含两种对齐机制:有像素级对应时的跨模态三元组目标,以及无监督时的伪对应部分邻域对齐,以防止拓扑坍塌。

Details

Motivation: 传统医学影像自监督预训练方法将每个个体视为孤立实例,未能充分利用解剖结构在个体间保持稳定空间关系(如丘脑始终位于基底节内侧)这一关键生理特征。

Result: 在7个下游多模态任务(分割与分类)上验证,平均分别提升1.1%和5.94%,并在测试时模态缺失情况下表现出显著更好的鲁棒性。

Insight: 创新点在于将跨实例拓扑一致性作为监督信号,并设计了两种对齐机制来应对医学影像的模态和实例间差异性;客观来看,该方法将解剖先验知识融入自监督学习,为多模态医学影像分析提供了更鲁棒的表示学习框架。

Abstract: Self-supervised pre-training methods in medical imaging typically treat each individual as an isolated instance, learning representations through augmentation-based objectives or masked reconstruction. They often do not adequately capitalize on a key characteristic of physiological features: anatomical structures maintain consistent spatial relationships across individuals (instances), such as the thalamus being medial to the basal ganglia, regardless of variations in brain size, shape, or pathology. We propose leveraging this cross-instance topological consistency as a supervisory signal. The challenge arises from the inherent variability in medical imaging, which can differ significantly across instances and modalities. To tackle this, we focus on two alignment regimes. (i) Intra-instance: with pixel-level correspondences available, a cross-modal triplet objective explicitly preserves local neighborhood topology. (ii) Inter-instance: without such supervision, we derive pseudo-correspondences to control partial neighborhood alignment and prevent topology collapse across modalities. We validate our approach across 7 downstream multi-modal tasks, achieving average improvements of 1.1% and 5.94% in segmentation and classification tasks, respectively, and demonstrating significantly better robustness when modalities are missing at test time.


[72] MiVE: Multiscale Vision-language features for reference-guided video Editing cs.CVPDF

Tong Wang, Meng Zou, Chengjing Wu, Xiaochao Qu, Luoqi Liu

TL;DR: 本文提出MiVE框架,通过利用多尺度视觉语言模型特征进行参考引导的视频编辑,解决了现有方法在模态对齐和细节保留方面的局限性。

Details

Motivation: 现有参考引导视频编辑方法存在模态鸿沟或空间细节丢失问题,本文旨在通过利用VLM分层特征实现更精确的编辑。

Result: 在人类偏好评估中达到SOTA水平,超越学术方法和商业系统。

Insight: 创新性地利用VLM早期层保留空间细节、深层编码全局语义的特性,通过统一自注意力架构避免跨注意力模态失配。

Abstract: Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically – early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.


[73] Are Candidate Models Really Needed for Active Learning? cs.CVPDF

Harshini Mridula Mohan, Maanya Manjunath, Vipul Arya, S. H. Shabbeer Basha, Nitin Cheekatla

TL;DR: 本文探讨了在主动学习中是否真的需要候选模型,提出了一种无需初始候选模型的简化主动学习方法,通过随机初始化的CNN和Transformer模型,结合置信度采样策略(高置信度、低置信度及其组合)选择信息量最大的样本进行标注,以减少标注负担。

Details

Motivation: 传统主动学习框架依赖初始候选模型进行迭代样本选择,耗时且资源密集;本文旨在解决这一问题,研究是否可以通过随机初始化权重的模型实现可比性能,从而简化流程并提升效率。

Result: 实验表明,低置信度(LC)采样策略在大多数情况下表现最佳,无需候选模型即可达到与传统依赖候选模型的主动学习框架相当的结果,验证了方法的鲁棒性。

Insight: 创新点在于挑战了传统主动学习对候选模型的依赖,提出了一种更高效、灵活的简化框架;从客观角度看,随机初始化模型结合置信度采样策略为降低计算成本和标注开销提供了新思路,尤其在资源受限场景中具有应用潜力。

Abstract: Deep learning has profoundly impacted domains such as computer vision and natural language processing by uncovering complex patterns in vast datasets. However, the reliance on extensive labeled data poses significant challenges, including resource constraints and annotation errors, particularly in training Convolutional Neural Networks (CNNs) and transformers due to a larger number of parameters. Active learning offers a promising solution to reduce labeling burdens by strategically selecting the most informative samples for annotation. However, the current active learning frameworks are time-intensive which select the samples iteratively with the help of initial candidate models. This study investigates the feasibility of using CNNs and transformers with randomly initialized weights, eliminating the need for initial candidate models while achieving results comparable to active learning frameworks that depend on such candidate models. We evaluate three confidence-based sampling strategies: high confidence (HC), low confidence (LC), and a combination of high confidence in the early stages of training and low confidence at later stages of training (HCLC). Among these, mostly LC demonstrated the best performance in our experiments, showcasing its effectiveness as an active learning strategy without the need for candidate models. Further, extensive experiments verify the robustness of the proposed active learning methods. By challenging traditional frameworks, the proposed work introduces a streamlined approach to active learning, advancing efficiency and flexibility across diverse datasets and domains.


[74] EponaV2: Driving World Model with Comprehensive Future Reasoning cs.CVPDF

Jiawei Xu, Zhizhou Zhong, Zhijian Shu, Mingkai Jia, Mingxiao Li

TL;DR: 本文提出了EponaV2,一种新型的驾驶世界模型范式,旨在通过全面的未来推理实现高质量的轨迹规划。该模型受人类驾驶员预测3D几何和语义的启发,训练模型预测更全面的未来表征,这些表征可解码为未来的几何和语义地图。通过引入3D和语义模态,模型能更深入地理解环境,未来预测任务也显著增强了其真实世界推理能力。此外,受大语言模型训练方法的启发,作者引入了流匹配组相对策略优化机制以进一步提升规划精度。

Details

Motivation: 解决当前自动驾驶中主流的感知-规划范式严重依赖昂贵人工标注来监督轨迹规划,从而限制其可扩展性的问题;同时,现有的无感知驾驶世界模型虽然性能出色,但其规划推理仅基于下一帧图像预测,缺乏足够监督,导致场景理解不全面和轨迹规划不理想。

Result: 在三个NAVSIM基准测试中,EponaV2在无感知模型中取得了最先进的性能(+1.3 PDMS, +5.5 EPDMS)。

Insight: 创新点在于将驾驶世界模型的规划推理从单一的图像预测扩展到对3D几何和语义等多模态未来表征的全面预测,从而增强场景理解和推理能力;同时借鉴大语言模型的训练方法,引入流匹配组相对策略优化机制来提升规划精度。从客观角度看,这是一种将多模态未来预测与强化学习策略优化相结合以提升端到端规划性能的有效范式。

Abstract: Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception-planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception-free driving world models achieve impressive driving performance, their real-world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high-quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real-world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state-of-the-art (SOTA) performances of EponaV2 among perception-free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.


[75] Generating HDR Video from SDR Video cs.CVPDF

SaiKiran Tedla, Francesco Banterle, Trevor Canham, Karanpreet Raja, David B. Lindell

TL;DR: 本文提出了一种从标准动态范围(SDR)视频生成高动态范围(HDR)视频的框架。该框架利用大规模生成式视频模型,通过一个多曝光视频模型(MEVM)从单个非线性SDR视频预测出曝光包围的线性SDR视频序列,再通过一个可学习的视频融合模型(VMM)将这些序列合并为高质量的HDR视频,有效保留阴影和高光细节。

Details

Motivation: HDR视频生态系统日趋成熟,但将遗留的SDR视频上转换为HDR视频的问题仍缺乏令人信服的解决方案。本文旨在解决从日常SDR素材中合成HDR视频的挑战。

Result: 广泛的实验、定量与定性评估以及用户研究表明,该方法能够对来自日常消费者视频甚至标志性电影的野外示例进行鲁棒的HDR转换。

Insight: 主要创新点在于提出了一个两阶段的生成框架(MEVM+VMM),将SDR到HDR的转换问题分解为曝光预测与高质量融合两个子任务,并能够与现有的SDR生成式视频模型结合,扩展了HDR合成流程。从客观角度看,其利用生成模型处理复杂、非线性的色调映射问题,并专注于保留极端亮度区域的细节,是一个有前景的技术路径。

Abstract: The high dynamic range (HDR) video ecosystem is approaching maturity, but the problem of upconverting legacy standard dynamic range (SDR) videos persists without a convincing solution. We propose a framework for HDR video synthesis from casual SDR footage by leveraging large-scale generative video models. We introduce a Multi-Exposure Video Model (MEVM) that can predict exposure-bracketed linear SDR video sequences from a single nonlinear SDR video input. We further propose a learnable Video Merging Model (VMM) that merges the predicted exposure-bracketed video into a high-quality HDR sequence while preserving detail in both shadows and highlights. Extensive experiments, quantitative and qualitative evaluation, and a user study demonstrate that our approach enables robust HDR conversion for in-the-wild examples from casual consumer videos and even iconic films. Finally, our model can support HDR synthesis pipelines built upon existing SDR generative video models. Output HDR videos can be viewed on our supplementary webpage: sdr2hdrvideo.github.io


[76] SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization cs.CV | cs.AI | cs.ROPDF

Posheng Chen, Powen Cheng, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

TL;DR: 论文提出了SceneFunRI基准,用于评估视觉语言模型在不可见区域中推理功能性物体位置的能力。该基准基于SceneFun3D数据集,通过半自动流程构建了855个实例,将任务形式化为2D空间推理问题。研究发现,当前最强基线模型(Gemini 3 Flash)性能有限,表明不可见区域推理仍是VLMs的挑战。

Details

Motivation: 解决现实场景中目标物体可能位于不可见区域时,视觉语言模型难以像人类一样通过上下文和常识推理其位置的问题。

Result: 最强基线模型Gemini 3 Flash在SceneFunRI基准上仅达到CAcc@75为15.20、mIoU为0.74和Dist为28.65的性能水平,表明当前模型在该任务上表现不佳。

Insight: 创新点在于构建了专注于不可见区域推理的基准,并提出了强指令提示、基于推理的提示和空间排除法等提示策略分析;从客观角度看,该研究强调了将任务意图、常识先验、空间基础和不确定性感知搜索更紧密结合的未来模型发展方向。

Abstract: In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.


[77] Towards Continuous Sign Language Conversation from Isolated Signs cs.CVPDF

Youngmin Kim, Kyobin Choo, Jiwoo Park, Minseo Kim, Chanyoung Kim

TL;DR: 该论文提出了一种从孤立手语构建连续手语对话的方法,通过收集大规模标记的孤立手语片段作为词汇基础的运动基元,并将其重组为源自现有对话语料库的手语顺序话语,从而解决了句子级手语视频数据稀缺的问题。

Details

Motivation: 解决当前对话AI系统以口语或书面语为中心,限制了以手语为主要语言的聋哑人士的访问,并应对句子级手语数据收集和标注成本高、现有模型词汇覆盖有限和开放域泛化能力弱的问题。

Result: 定量和定性评估表明,该方法在孤立到连续的运动质量、响应级语义对齐方面有所改进,并支持可扩展的以手语者为中心的交互,更好地支持视觉空间表达。

Insight: 创新点包括构建大规模孤立手语词汇库和连续3D手语对话数据集,使用检索引导的口语到gloss翻译器来弥合口语和手语的结构不匹配,以及提出BRAID(一种扩散Transformer)来执行持续时间对齐和协同发音边界修复,从而实现无需口语文本或外部gloss的直接手语到手语对话生成。

Abstract: Sign language is the primary language for many Deaf and Hard-of-Hearing (DHH) signers, yet most conversational AI systems still mediate interaction through spoken or written language. This spoken-language-centered interface can limit access for signers for whom spoken or written language is not the most accessible medium, motivating direct sign-to-sign conversational modeling. However, sentence-level sign video data are expensive to collect and annotate, leaving existing sign translation and production models with limited vocabulary coverage and weak open-domain generalization. We address this bottleneck by constructing continuous sign conversations from isolated signs: large-scale labeled isolated clips are collected as lexically grounded motion primitives and recomposed into sign-language-ordered utterances derived from existing dialogue corpora. We introduce SignaVox-W, which provides, to our knowledge, the largest labeled isolated-sign vocabulary to date, and SignaVox-U, a continuous 3D sign conversation dataset built from SignaVox-W. To bridge structural mismatch between spoken and signed languages, we use a retrieval-guided spoken-to-gloss translator; to bridge independently collected isolated clips, we propose BRAID, a diffusion Transformer that performs duration alignment and co-articulatory boundary inpainting. With the resulting data, we train SignaVox, a direct sign-to-sign conversational model that generates 3D body, hand, and facial motion responses from prior signing context without spoken-language text or externally provided glosses at inference time. Quantitative and qualitative evaluations show improved isolated-to-continuous motion quality, stronger response-level semantic alignment, and scalable signer-centered interaction that better supports visual-spatial articulation.


[78] Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners cs.CVPDF

Qingyang Liu, Bingjie Gao, Canmiao Fu, Zhipeng Huang, Chen Li

TL;DR: 本文提出了一种新型框架,旨在解决统一多模态模型在任意到图像(X2I)任务中存在的‘理解-生成鸿沟’。该框架通过构建分层数据管道,使模型能够根据指令复杂度和自身能力,在直接生成、自我反思和多步规划三种自适应模式间自主切换,并配合两阶段训练策略(SFT和RL)及专门设计的奖励机制,有效提升了生成质量。

Details

Motivation: 当前统一模型虽能整合多模态理解与生成,但仍存在‘理解-生成鸿沟’,即模型能捕捉用户意图却难以将语义知识转化为精确的像素级操作,这导致了X2I任务中的注意力纠缠瓶颈(复杂提示下规划困难)和视觉细化瓶颈(非结构化反馈无法高效修正缺陷)。

Result: 在X2I任务上的大量实验表明,该方法优于现有基线,在从简单到复杂的指令上实现了更优的生成保真度。

Insight: 核心创新在于提出了一种使统一模型成为自适应交错视觉推理器的框架,通过分层自适应执行模式(直接生成、自我反思、多步规划)和两阶段训练策略(结合逐步推理奖励和组内复杂度惩罚),系统性地解决了理解与生成间的瓶颈问题,提升了模型对复杂场景的处理能力和生成效率。

Abstract: Recent unified models integrate multimodal understanding and generation within a single framework. However, an “understanding-generation gap” persists, where models can capture user intent but often fail to translate this semantic knowledge into precise pixel-level manipulation. This gap results in two bottlenecks in anything-to-image task (X2I): the attention entanglement bottleneck, where blind planning struggles with complex prompts, and the visual refinement bottleneck, where unstructured feedback fails to correct imperfections efficiently. In this paper, we propose a novel framework that empowers unified models to autonomously switch between generation strategies based on instruction complexity and model capability. To achieve this, we construct a hierarchical data pipeline that constructs execution paths across three adaptive modes: direct generation for simple cases, self-reflection for quality refinement, and multi-step planning for decomposing complex scenarios. Building on this pipeline, we contribute a high-quality dataset with over 50,000 samples and implement a two-stage training strategy comprising SFT and RL. Specifically, we design step-wise reasoning rewards to ensure logical consistency and intra-group complexity penalty to prevent redundant computational overhead. Extensive experiments demonstrate that our method outperforms existing baselines on X2I, achieving superior generation fidelity among simple-to-complex instructions. The code is released at https://github.com/WeChatCV/Interleaved_Visual_Reasoner.


[79] Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke cs.CV | cs.AIPDF

Liren Chen, Lidong Sun, Mingyan Huang, Junzhe Tang, Yinghui Zhu

TL;DR: 本文提出了一种用于缺血性卒中预后预测的新型三模态融合模型,通过使用大语言模型从脑部MRI自动生成半结构化诊断文本来丰富数据表示,并设计了视觉条件双对齐融合模块,以视觉特征作为条件先验来指导与生成文本的细粒度交互,从而解决现有方法模态有限和交互不足的问题。

Details

Motivation: 现有多模态方法主要局限于双模态融合,缺乏有效整合医学图像、结构化临床数据和非结构化文本的框架,且未能建立模态间的深度双向交互,导致缺血性卒中预后预测准确性受限。

Result: 在真实世界临床数据集上的大量实验表明,该模型实现了最先进的性能。

Insight: 创新点包括利用LLM从图像生成文本以缓解专家标注稀缺并增强语义正则化,以及设计视觉条件双对齐融合模块通过双重语义对齐损失实现动态深度融合,有效减轻模态异质性。

Abstract: Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.


[80] Video-Zero: Self-Evolution Video Understanding cs.CVPDF

Ruixu Zhang, Deyi Ji, Lanyun Zhu, Xuanyi Liu, Yuxin Meng

TL;DR: 本文提出Video-Zero,一种无需标注的问答者-求解者协同进化框架,旨在解决视频自进化中的核心瓶颈——证据定位问题。该方法通过让问答者发现信息丰富的证据片段并生成基于证据的问题,求解者学习回答并使预测与支持证据对齐,从而形成一个以证据为中心的自我进化闭环。

Details

Motivation: 将自进化范式扩展到视频理解领域面临挑战,因为视频长、动态且冗余,而推理所需的证据往往是稀疏且时间局部化的。直接从完整视频生成困难问答对会产生看似困难但证据基础薄弱的监督信号,依赖于静态线索或语言先验而非时间证据。

Result: 在涵盖时间定位、长视频理解和视频推理的13个基准测试中,Video-Zero持续改进了多个视频视觉语言模型(VLM)主干,证明了以证据为中心的自进化方法的有效性和可迁移性。

Insight: 论文的核心创新在于将视频自进化的瓶颈从单纯的“难度”重新定义为“证据定位”,并提出了一个协同进化框架,通过证据发现、基于证据的监督和证据对齐学习的迭代循环,实现了更可靠的视频理解模型自我进化。这为减少对人工标注的依赖提供了新思路。

Abstract: Self-evolution offers a promising path for improving reasoning models without relying on intensive human annotation. However, extending this paradigm to video understanding remains underexplored and challenging: videos are long, dynamic, and redundant, while the evidence needed for reasoning is often sparse and temporally localized. Naively generating difficult question-answer pairs from full videos can therefore produce supervision that appears challenging but is weakly grounded, relying on static cues or language priors rather than temporal evidence. In this work, we argue that the key bottleneck of video self-evolution is not difficulty alone, but grounding. We propose Video-Zero, an annotation-free Questioner–Solver co-evolution framework that centers self-evolution on temporally localized evidence. The Questioner discovers informative evidence segments and generates evidence-grounded questions, while the Solver learns to answer and align its predictions with the supporting evidence. This closes an iterative loop of evidence discovery, grounded supervision, and evidence-aligned learning. Across 13 benchmarks spanning temporal grounding, long-video understanding, and video reasoning, Video-Zero consistently improves multiple video VLM backbones, demonstrating the effectiveness and transferability of evidence-centered self-evolution.


[81] EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding cs.CV | cs.ROPDF

Yuejiao Su, Xinshen Zhang, Zhen Ye, Lei Yao, Lap-Pui Chau

TL;DR: 本文提出EARL框架,通过两阶段解析(粗粒度解释与细粒度响应)和强化学习优化,实现了对第一人称视角视频中人与环境交互的精确推理与像素级定位。

Details

Motivation: 现有多模态大语言模型在第一人称交互推理和细粒度像素定位方面存在不足,EARL旨在通过显式地将粗粒度交互语义转化为面向查询的应答与定位来解决此问题。

Result: 在Ego-IRGBench基准测试中,EARL的像素定位cIoU达到65.48%,比之前的基于强化学习的方法高出8.37%;在EgoHOS上的OOD定位结果也显示出对未见第一人称定位场景的强大迁移能力。

Insight: 创新点包括:1)两阶段解析框架,将整体交互理解与面向查询的细粒度响应解耦;2)通过分析引导的特征合成器(AFS)整合全局交互描述符作为语义先验;3)设计多层面奖励函数并使用GRPO训练,以联合优化文本、边界框和掩码等异构输出。

Abstract: Understanding human–environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic prior, which is integrated via a novel Analysis-guided Feature Synthesizer (AFS) for query-oriented reasoning. To optimize heterogeneous outputs, including textual answers, bounding boxes, and grounding masks, we design a multi-faceted reward function and train the response stage with GRPO. Experiments on Ego-IRGBench show that EARL achieves 65.48% cIoU for pixel grounding, outperforming previous RL-based methods by 8.37%, while OOD grounding results on EgoHOS indicate strong transferability to unseen egocentric grounding scenarios.


[82] BioHuman: Learning Biomechanical Human Representations from Video cs.CV | cs.GR | cs.LGPDF

Yujun Huo, He Zhang, Chentao Song, Honglin Song, Zongyu Zuo

TL;DR: 本文提出BioHuman框架,通过仿真方法从现有运动捕捉数据集中估计肌肉激活信号,构建了包含视频、运动和肌肉激活同步数据的大规模数据集BioHuman10M,并在此基础上开发了端到端模型BioHuman,能够从单目视频中联合预测人体运动和肌肉激活,从而将视觉观察与内部生物力学状态联系起来。

Details

Motivation: 当前在超越表面运动学理解人体运动方面进展有限,主要受限于缺乏大规模生物力学标注数据集,以及现有方法无法直接从视觉观察推断内部生物力学状态。

Result: 大量实验表明,BioHuman能够准确重建运动学运动和肌肉活动,并在不同受试者和运动上具有良好的泛化能力,为基于视频的生物力学理解建立了新基准。

Insight: 创新点在于提出了一个基于仿真的框架来生成大规模生物力学标注数据集,并开发了首个端到端模型直接从单目视频联合预测运动和肌肉激活,实现了视觉观察与内部生物力学状态的桥接,为物理基础的人体建模开辟了新可能性。

Abstract: Understanding human motion beyond surface kinematics is crucial for motion analysis, rehabilitation, and injury risk assessment. However, progress in this domain is limited by the lack of large-scale datasets with biomechanical annotations, and by existing approaches that cannot directly infer internal biomechanical states from visual observations. In this paper, we introduce a simulation-based framework for estimating muscle activations from existing motion capture datasets, resulting in BioHuman10M, a large-scale dataset with synchronized video, motion, and activations. Building on BioHuman10M, we propose BioHuman, an end-to-end model that takes monocular video as input and jointly predicts human motion and muscle activations, effectively bridging visual observations and internal biomechanical states. Extensive experiments demonstrate that BioHuman enables accurate reconstruction of both kinematic motion and muscle activity, and generalizes across diverse subjects and motions. We believe our approach establishes a new benchmark for video-based biomechanical understanding and opens up new possibilities for physically grounded human modeling.


[83] MonoPRIO: Adaptive Prior Conditioning for Unified Monocular 3D Object Detection cs.CVPDF

Leon Davies, Qinggang Meng, Mohamad Saada, Baihua Li, Simon Sølvsten

TL;DR: MonoPRIO是一种统一的单目3D目标检测器,通过自适应先验条件化解决单目图像中度量尺寸和深度不确定的问题,特别是在遮挡、截断和投影引起的尺度-深度模糊情况下。该方法在KITTI基准测试中取得了领先的多类别和单类别检测结果。

Details

Motivation: 单目3D目标检测中,度量尺寸和深度因单视角证据不足而难以确定,尤其是在遮挡、截断和尺度-深度模糊的情况下。在多类别统一设置中,类别变异性和部分可见性进一步扩大了合理的尺寸模式,导致度量尺寸不稳定。

Result: 在KITTI测试服务器上,MonoPRIO在完全报告的Car、Pedestrian和Cyclist多类别指标中取得了最强的统一结果。在仅Car的设置中,它在Easy/Moderate/Hard类别上获得了最强的3D边界框AP,且未使用额外数据,计算量远小于MonoCLUE。

Insight: 创新点包括:离线构建类别感知的尺寸原型,将每个解码器查询路由到软混合先验,应用不确定性感知的对数空间条件化,以及在训练中对匹配的正样本使用聚类对齐先验(CAP)正则化。自适应先验在图像证据不足以确定度量尺寸时最为有效。

Abstract: Monocular 3D object detection remains challenging because metric size and depth are underdetermined by single-view evidence, particularly under occlusion, truncation, and projection-induced scale-depth ambiguity. Although recent methods improve depth and geometric reasoning, metric size remains unstable in unified multi-class settings, where class variability and partial visibility broaden plausible size modes. We propose MonoPRIO, a unified monocular 3D detector that targets this bottleneck through adaptive prior conditioning in the size pathway. MonoPRIO constructs class-aware size prototypes offline, routes each decoder query to a soft mixture prior, applies uncertainty-aware log-space conditioning, and uses Cluster-Aligned Prior (CAP) regularisation on matched positives during training. On the official KITTI test server, MonoPRIO achieves the strongest fully reported unified multi-class result among methods reporting complete Car, Pedestrian, and Cyclist metrics. In the car-only setting, it also achieves the strongest 3D bounding-box AP across Easy/Moderate/Hard categories among compared methods without extra data, while using substantially less compute than MonoCLUE. Ablations and diagnostics show complementary gains from routed injection and CAP, with the largest benefits in ambiguity-prone, partially occluded, and low-data regimes. These findings indicate that adaptive priors are most effective when image evidence underdetermines metric size, while atypical geometry or extreme visibility loss can still cause mismatch between routed priors and true instance geometry. Code, trained models, result logs, and reproducibility material are available at https://github.com/bigggs/MonoPRIO.


[84] Do Composed Image Retrieval Benchmarks Require Multimodal Composition? cs.CV | cs.CLPDF

Matteo Attimonelli, Alessandro De Bellis, Aryo Pradipta Gema, Rohit Saxena, Monica Sekoyan

TL;DR: 该论文通过分析四个常用的组合图像检索(CIR)基准测试和十一个通用多模态嵌入模型,发现大量查询可以通过单一模态(图像或文本)解决,揭示了普遍存在的单模态捷径。论文进一步通过两阶段审计(识别捷径可解查询和人工验证)构建了一个经过验证的无捷径、无噪声查询子集,重新评估模型后发现,当前CIR基准测试混合了捷径可解、噪声和真正需要组合的查询,从而高估了模型在多模态组合方面的能力。

Details

Motivation: 解决当前组合图像检索(CIR)基准测试中普遍存在的单模态捷径问题,这些捷径使得模型无需进行真正的多模态组合就能获得高性能,从而无法准确评估模型的多模态组合能力。

Result: 在四个广泛使用的CIR基准测试(CIRR、FashionIQ、CIRCO、GeneCIS)上,发现32.2%到83.6%的查询可以通过单一模态解决。在人工验证后的1,689个无捷径、格式良好的查询子集上重新评估模型,结果显示模型准确率下降,但对多模态信息的依赖增加,表明该子集更能反映真正的多模态组合需求。

Insight: 论文的创新点在于系统性地揭示了CIR基准测试中存在的单模态捷径问题,并通过审计方法构建了更纯净的评估子集。这提醒研究社区,基准测试的设计需要仔细排除捷径,才能准确衡量模型的多模态组合能力,避免高估模型性能。

Abstract: Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.


[85] COAL: Counterfactual and Observation-Enhanced Alignment Learning for Discriminative Referring Multi-Object Tracking cs.CVPDF

Shukun Jia, Shiyu Hu, Yipei Wang, Ximeng Cheng, Yichao Cao

TL;DR: 本文提出COAL框架,通过知识正则化解决Referring Multi-Object Tracking (RMOT)中高判别性需求与稀疏语义监督之间的结构矛盾。该框架结合了显式语义注入和反事实学习,利用VLM和LLM增强观察空间和监督信号,并通过分层多流集成架构将外部知识蒸馏到领域特定的判别性表示中。

Details

Motivation: 解决RMOT中因稀疏语义监督导致的模型过拟合、捷径学习和语义崩溃问题,特别是在需要细粒度组合语义判别的高度同质化场景中。

Result: 在Refer-KITTI和Refer-KITTI-V2基准测试中验证了有效性,在极具挑战性的Refer-KITTI-V2上HOTA指标超过之前最佳方法7.28%,达到SOTA水平。

Insight: 创新点在于通过VLM进行显式语义注入以稠化观察空间,以及利用LLM推理进行反事实学习以增强监督和强制严格属性验证;从客观角度看,将外部大模型知识(VLM/LLM)通过知识正则化方式引入RMOT任务,以解决稀疏监督下的判别性悖论,是一种有效的跨模态知识蒸馏和组合语义增强思路。

Abstract: Referring Multi-Object Tracking (RMOT) faces a fundamental structural contradiction between the high-discriminability demand and the sparse semantic supervision. This mismatch is particularly acute in highly homogeneous scenarios that require fine-grained discrimination over complex compositional semantics. However, under sparse supervision, models overfit to salient yet insufficient cues, thereby encouraging shortcut learning and semantic collapse. To resolve this, we propose COAL (Counterfactual and Observation-enhanced Alignment Learning), a framework that advances RMOT beyond isolated structural optimization through knowledge regularization. First, we introduce Explicit Semantic Injection (ESI) via a VLM to densify the observation space and enhance instance discriminability. Second, leveraging LLM reasoning, we propose Counterfactual Learning (CFL) to augment supervision, enforcing strict attribute verification for robust compositional recognition. These strategies are unified within a Hierarchical Multi-Stream Integration (HMSI) architecture, which distills external knowledge into domain-specific discriminative representations. Experiments on Refer-KITTI and Refer-KITTI-V2 benchmarks validate COAL’s efficacy. Notably, it surpasses the state-of-the-art by 7.28% HOTA on the highly challenging Refer-KITTI-V2. These results demonstrate the effectiveness of knowledge regularization for resolving the sparsity-discriminability paradox in RMOT.


[86] Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation cs.CV | cs.CR | cs.SIPDF

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Xianxun Zhu

TL;DR: 本研究对视觉Mamba模型在AI生成图像检测任务上的性能进行了系统评估和比较分析,将其与基于CNN、ViT和VLM的代表性检测器在多个数据集和生成模型上进行了基准测试,重点关注准确性、效率和泛化能力等关键指标。

Details

Motivation: 随着AI图像生成技术(如GANs、扩散模型)的进步,生成内容日益逼真,引发了关于虚假信息、身份盗窃等潜在滥用的担忧。尽管Mamba架构已在多种图像分析任务中展现出潜力,但其在AI生成图像检测方面的能力尚未得到充分探索。

Result: 研究在多个数据集和生成源上进行了基准测试,评估了视觉Mamba变体在准确性、效率和跨不同类型图像及生成模型的泛化性等关键指标上的表现,并与现有方法进行了比较。

Insight: 论文的创新点在于首次对视觉Mamba模型在AI生成图像检测任务上进行了系统性的评估和比较分析,旨在阐明其相对于成熟方法的优势与局限,为构建区分真实与AI生成内容的系统提供了新的组件选择依据。

Abstract: In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba’s strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.


[87] Probing into Camera Control of Video Models cs.CVPDF

Chen Hou, Christian Rupprecht

TL;DR: 本文提出了一种无需训练即可实现视频生成模型相机控制的新方法,将相机控制重新定义为几何引导的位移场,并通过在去噪过程中对潜在特征进行可微分重采样来应用这些位移场。该方法在多种质量指标上表现优异,且可作为探针来研究基础模型的相机控制能力,揭示了代表性视频模型的普遍偏差及其在相机控制响应上的差异。

Details

Motivation: 现有方法通常使用额外的相机模块和配对数据学习从相机运动到视频的映射,但此类数据集在规模、多样性和场景动态性方面有限,可能导致模型输出分布狭窄并损害基础模型学到的强先验,因此需要一种新的相机控制视角。

Result: 该方法在多种质量指标上相比微调基线实现了有效的相机控制且性能下降最小,并在多视图生成任务中进行了基准测试,为3D/4D任务提供了见解。

Insight: 创新点在于将相机控制重新定义为几何引导的位移场问题而非隐式映射问题,通过可微分重采样实现无需训练的应用,并可作为探针工具来分析基础模型的相机控制能力与偏差。

Abstract: Video is a rich and scalable source of 3D/4D visual observations, and camera control is a key capability for video generation models to produce geometrically meaningful content. Existing approaches typically learn a mapping from camera motion to video using additional camera modules and paired data. However, such datasets are often limited in scale, diversity, and scene dynamics, which can bias the model toward a narrow output distribution and compromise the strong prior learned by the base model. These limitations motivate a different perspective on camera control. In this paper, we show that camera control need not be modeled as an implicit mapping problem, but can instead be treated as a form of geometric guidance that induces displacements across frames. Specifically, we reformulate camera control into a set of displacement fields and apply them via differentiable resampling of latent features during denoising. Our simple approach achieves effective camera control with minimal degradation across diverse quality metrics compared to fine-tuned baselines. Since our method is applicable to most video diffusion models without training, it can also serve as a probe to study the camera control capabilities of base models. Using this probe, we identify universal biases shared by representative video models, as well as disparities in their responses to camera control. Finally, we benchmark their performance in multi-view generation, offering insights into their potential for 3D/4D tasks.


[88] Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval cs.CV | cs.MMPDF

Bolin Zhang, Chao Yang, Bin Jiang, Takahiro Komamizu, Ichiro Ide

TL;DR: 本文提出了一种名为MCMT(多提案协作与多任务训练)的弱监督视频时刻检索方法,该方法通过生成多个提案并从中学习高斯掩码来构建高质量的正样本掩码,同时引入简单负样本和困难负样本,并结合前向和反向掩码查询重建任务进行多任务训练,以提升检索的鲁棒性和稳定性。

Details

Motivation: 针对弱监督视频时刻检索中现有方法存在的提案质量低、难以区分同一视频中的错位时刻以及因依赖单一辅助任务而缺乏稳定性的问题,本文旨在设计一种更有效的弱监督方法。

Result: 在两个标准基准测试上的大量实验证实了所提方法的有效性,表明其在弱监督视频时刻检索任务上取得了有竞争力的性能。

Insight: 创新点在于通过多提案协作生成高质量正样本掩码,并结合简单/困难负样本分类以及前向/反向掩码查询重建的多任务训练框架,从而更有效地利用弱监督信号并增强模型的判别能力和稳定性。

Abstract: This study focuses on weakly-supervised Video Moment Retrieval (VMR), aiming to identify a moment semantically similar to the given query within an untrimmed video using only video-level correspondences, without relying on temporal annotations during training. Previous methods either aggregate predictions for all instances in the video, or indirectly address the task by proposing reconstructions for the query. However, these methods often produce low-quality temporal proposals, struggle with distinguishing misaligned moments in the same video, or lack stability due to a reliance on a single auxiliary task. To address these limitations, we present a novel weakly-supervised method called Multi-proposal Collaboration and Multi-task Training (MCMT). Initially, we generate multiple proposals and derive corresponding learnable Gaussian masks from them. These masks are then combined to create a high-quality positive sample mask, highlighting video clips most relevant to the query. Concurrently, we classify other clips in the same video as the easy negative sample and the entire video as the hard negative sample. During training, we introduce forward and inverse masked query reconstruction tasks to impose more substantial constraints on the network, promoting more robust and stable retrieval performance. Extensive experiments on two standard benchmarks affirm the effectiveness of the proposed method in VMR.


[89] Editor’s Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis cs.CVPDF

Mor Ventura, Roy Hirsch, Yonatan Bitton, Regev Cohen, Roi Reichart

TL;DR: 本文提出了AbstractEdit基准,用于评估图像编辑模型对抽象指令的理解能力,并开发了Entity-Rubrics评估框架,通过将抽象编辑分解为实体级评估来量化模型性能。研究发现现有模型在平衡编辑意图和图像保留方面存在困难,而整合先进LLM文本编码器和迭代思维能显著提升性能。

Details

Motivation: 当前图像编辑基准主要关注显式、字面指令,而人类自然交流常使用如’情绪’等抽象概念,导致抽象指令的评估不足。本文旨在填补这一空白,系统化定义和分类抽象图像编辑任务。

Result: 在AbstractEdit基准上评估了11个领先模型,发现标准架构普遍存在欠编辑或过编辑问题,难以平衡意图与保留。Entity-Rubrics评估框架与人类判断具有强相关性。

Insight: 创新点包括首次系统化定义抽象图像编辑、提出基于实体的评估框架(Entity-Rubrics)以及创建首个抽象编辑基准(AbstractEdit)。客观分析表明,将抽象指令分解为原子实体评估是有效方法,且整合LLM编码器和迭代推理是提升性能的关键方向,该框架可泛化为奖励模型或用于测试时批判循环。

Abstract: Humans naturally communicate through abstract concepts like “mood”. However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.


[90] MechVerse: Evaluating Physical Motion Consistency in Video Generation Models cs.CVPDF

Rahul Jain, Mayank Patel, Asim Unmesh, Karthik Ramani

TL;DR: 该论文提出了MechVerse基准测试,用于评估视频生成模型在物理运动一致性方面的表现,特别是针对机械装配体的运动生成。该基准包含21,156个合成视频片段,涵盖141个类别的1,357个机械装配体,分为三个运动复杂度层级。研究发现,现有模型在保持外观和流畅性的同时,往往无法生成符合运动学约束的机械运动。

Details

Motivation: 当前基于文本和图像条件的视频生成模型在视觉保真度和时间连贯性上表现良好,但经常无法生成符合运动学和几何约束的运动,尤其是在机械装配体中,容易出现部件变形、耦合断裂等违反物理规律的问题。

Result: 论文评估了专有、开源和微调后的图像到视频模型,使用了标准视频指标、指令遵循分数和人类对运动正确性的判断。结果显示,现有模型在保持外观和平滑度的同时,无法生成符合机械约束的运动,且错误率随耦合复杂性增加而上升。

Insight: 创新点在于提出了首个专注于评估机械运动一致性的视频生成基准MechVerse,其结构化提示和分层复杂性设计为衡量和改进模型对物理约束的理解提供了新工具。从客观角度看,该工作将视频生成评估从外观质量扩展到物理合理性,是一个重要的研究方向拓展。

Abstract: Text- and image-conditioned video generation models have achieved strong visual fidelity and temporal coherence, but they often fail to generate motion governed by kinematic and geometric constraints. In these settings, object parts must remain rigid, maintain contact or coupling with neighboring components, and transfer motion consistently across connected parts. These requirements are especially explicit in articulated mechanical assemblies, where motion is constrained by rigid-link geometry, contact/coupling relations, and transmission through kinematic chains. A generated video may therefore appear plausible while violating the intended mechanism, such as rotating a part that should translate, deforming a rigid component, breaking coupling between parts, or failing to move downstream components. To evaluate this gap, We introduce MechVerse, a benchmark for mechanically consistent image-to-video generation. MechVerse contains 21,156 synthetic clips from 1,357 mechanical assemblies across 141 categories, organized into three tiers of increasing kinematic complexity: independent articulation, pairwise coupling, and densely coupled multi-part mechanisms. Each clip is paired with a structured prompt describing part identities, stationary supports, moving components, motion primitives, direction, speed/extent, and inter-part dependencies. We evaluate proprietary, open-source, and fine-tuned image-to-video models using standard video metrics, instruction-following scores, and human judgments of motion correctness and kinematic coupling. Results show that current models can preserve appearance and smoothness while failing to generate mechanically admissible motion, with errors increasing as coupling complexity grows. MechVerse provides a benchmark for measuring and improving mechanism-aware video generation from image and language inputs.


[91] Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study cs.CVPDF

Marta Robledo-Moreno, Ruben Vera-Rodriguez, Ruben Tolosana, Javier Ortega-Garcia

TL;DR: 本研究探索了视觉语言模型(VLMs)在在线签名验证任务中的零样本能力,通过将原始运动学时间序列转换为静态图像,并引入基于潜在令牌概率的评分协议,在Signature Verification Challenge基准上评估了GPT-5.2和Gemini 2.5 Pro等先进模型。实验发现,在随机伪造场景中,零样本VLM表现出色,甚至超越有监督的SOTA系统,但在熟练伪造场景中性能显著下降,并出现因思维链推理导致的’合理化陷阱’问题。

Details

Motivation: 尽管视觉语言模型在通用视觉推理方面展现出强大能力,但其在严格生物识别任务(如在线签名验证)中的应用尚未被探索,本研究旨在填补这一空白,评估VLMs在该任务中的零样本性能。

Result: 在SVC基准测试中,针对随机伪造场景,GPT-5.2在移动任务中达到了0.32%的等错误率,超越了有监督的SOTA系统;但在熟练伪造场景中,性能显著恶化,且思维链推理会因产生运动学幻觉而降低性能。

Insight: 创新点包括将运动学时间序列编码为静态图像以适配VLM处理,以及提出基于潜在令牌概率的鲁棒生物特征评分协议;客观分析表明,研究揭示了VLMs在生物识别任务中性能对数据质量和伪造类型的敏感性,以及思维链推理在高度相似样本上可能引发的’合理化陷阱’,这对未来VLM在安全关键应用中的部署具有重要启示。

Abstract: Recent advancements in Vision-Language Models (VLMs) have demonstrated strong capabilities in general visual reasoning, yet their applicability to rigorous biometric tasks remains unexplored. This work presents an exploratory study evaluating the zero-shot performance of state-of-the-art VLMs (GPT-5.2 and Gemini 2.5 Pro) on the Signature Verification Challenge (SVC) benchmark. To enable visual processing, raw kinematic time-series are converted into static images, encoding pressure information into stroke opacity whenever available in the source data. Furthermore, we introduce a scoring protocol that extracts latent token probabilities to compute robust biometric scores. Experimental results reveal a significant performance dichotomy dependent on signal quality and forgery type. In random forgery scenarios, the zero-shot VLM achieves exceptional discrimination, with GPT-5.2 reaching an Equal Error Rate of 0.32% in mobile tasks, outperforming supervised state-of-the-art systems. Conversely, in skilled forgery scenarios, where the task is more challenging because both signatures are almost identical, the results are significantly worse, and a critical “Rationalization Trap” emerges: chain-of-thought (CoT) reasoning degrades performance as the model produces kinematic hallucinations to justify forgery artifacts as natural variability.


[92] FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery cs.CV | cs.AIPDF

Patrick Kwon, Chen Chen

TL;DR: FactorizedHMR是一个用于视频人体网格恢复的两阶段混合框架,通过确定性的回归模块恢复稳定的躯干-根部锚点,再通过概率性的流匹配模块完成易产生歧义的远端关节(如四肢)的恢复。

Details

Motivation: 人体网格恢复存在根本的模糊性,特别是在遮挡或深度线索较弱时,躯干姿态和根部结构相对明确,而远端关节(如四肢)则更不确定,论文旨在针对这种非均匀的模糊性提出解决方案。

Result: 在相机空间和世界空间的基准测试中,FactorizedHMR与强基线模型保持竞争力,在遮挡严重的恢复和漂移敏感的世界空间指标上取得了最明显的提升。

Insight: 创新点在于将人体网格恢复分解为确定性的躯干-根部锚点恢复和概率性的非躯干关节完成,并结合了复合目标表示、几何感知监督和特征感知的无分类器引导,以及引入了多视角的合成数据管道来提供配对监督。

Abstract: Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.


[93] Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning cs.CV | cs.AIPDF

Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du

TL;DR: 本文提出了一种名为闭环视觉推理(CLVR)的框架,旨在解决当前文本到图像(T2I)模型在处理复杂语义时面临的挑战。该框架通过深度耦合视觉-语言逻辑规划与像素级扩散生成,并引入自动化数据引擎、代理提示强化学习(PPRL)和Δ空间权重合并(DSWM)等方法,显著提升了生成质量和效率。

Details

Motivation: 当前T2I模型主要依赖单步生成范式,难以处理复杂语义,且参数扩展收益递减。现有多步推理方法存在规划幻觉、后验反思单一、长上下文优化不稳定以及推理延迟高等瓶颈。

Result: 在多个基准测试中,CLVR超越了现有开源基线,并接近专有商业模型的性能,同时通过DSWM将每步推理成本降至仅4个NFEs,实现了高效的复杂视觉生成。

Insight: 创新点包括:1) 引入带步骤级视觉验证的自动化数据引擎,合成可靠推理轨迹;2) 提出PPRL,通过将交错多模态历史提炼为显式奖励信号,解决长上下文优化不稳定问题;3) 提出DSWM,融合对齐权重与现成蒸馏先验,大幅降低推理延迟,无需昂贵重新蒸馏。

Abstract: Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $Δ$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.


[94] SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition cs.CV | cs.AIPDF

Sukju Oh, Sukkyu Sun

TL;DR: 本文提出了SurgicalMamba,一个用于在线手术阶段识别(SPR)的因果模型。该模型基于Mamba2的结构化状态空间对偶(SSD)构建,将每帧计算成本控制在O(d)。它引入了三个针对手术视频特定需求的SSD兼容组件:分离长短期状态的双路径SSD块、根据信息强度调整处理速度的强度调制步进机制,以及促进跨通道混合的状态重编程。模型在七个公开SPR基准测试中达到了最先进的准确率和Jaccard分数,并在单个GPU上实现了119 fps的实时处理速度。

Details

Motivation: 在线手术阶段识别(SPR)是情境感知手术室系统的关键,需要仅根据过去上下文对每一帧进行预测。手术视频提出了三个自然视频识别器未能共同解决的需求:手术过程长达数万帧、时间流动不均匀(长常规阶段被短暂的阶段转换打断)、以及视觉域狭窄导致骨干网络特征在通道间高度相关。现有识别器要么让每帧成本随长度增长,要么保持成本有界但以均匀速率和通道无关的动态更新状态,未能解决后两个需求。

Result: 在七个公开SPR基准测试中,SurgicalMamba在严格的在线评估下达到了最先进的准确率和阶段级Jaccard分数:在Cholec80上达到94.6%/82.7%(比之前最强方法提升0.7 pp/2.2 pp),在AutoLaparo上达到89.5%/68.9%(提升1.7 pp/2.0 pp),并在单个GPU上以119 fps运行。消融实验验证了每个组件的贡献。

Insight: 论文的创新点包括:1)双路径SSD块在循环状态层面分离长短期机制,以处理极长序列;2)强度调制步进作为一种连续时间扭曲,使慢速路径的有效速率适应于阶段相关信息,处理非均匀时间流;3)状态重编程通过每块凯莱旋转,在原本轴对齐的SSM循环中开启跨通道混合,以处理通道间强相关性。学习到的旋转平面继承了阶段对齐的结构,无需任何直接监督,提供了手术工作流程的可解释内部特征。模型架构将计算复杂度与序列长度解耦,实现了高效的在线推理。

Abstract: Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every frame from past context alone. Surgical video poses three demands that natural-video recognizers do not jointly address: procedures span tens of thousands of frames, time flows non-uniformly as long routine stretches are punctuated by brief phase-defining transitions, and the visual domain is narrow so backbone features are strongly correlated across channels. Existing recognizers either let per-frame cost grow with elapsed length, or hold cost bounded but advance state at a uniform rate with channel-independent dynamics, leaving the latter two demands unaddressed. We present SurgicalMamba, a causal SPR model built on Mamba2’s structured state-space duality (SSD) that holds per-frame cost at O(d). It introduces three SSD-compatible components, each targeting one demand: a dual-path SSD block that separates long- and short-term regimes at the level of recurrent state; intensity-modulated stepping, a continuous-time time-warp that adapts the slow path’s effective rate to phase-relevant information; and state regramming, a per-chunk Cayley rotation that opens cross-channel mixing in the otherwise axis-aligned SSM recurrence. The learned rotation planes inherit a phase-aligned structure without any direct supervision, offering an interpretable internal signature of surgical workflow. Across seven public SPR benchmarks, SurgicalMamba reaches state-of-the-art accuracy and phase-level Jaccard under strict online evaluation: 94.6%/82.7% on Cholec80 (+0.7 pp/+2.2 pp over the strongest prior) and 89.5%/68.9% on AutoLaparo (+1.7 pp/+2.0 pp), at 119 fps on a single GPU. Ablations isolate the contribution of each component. The code is publicly available at https://github.com/sukjuoh/Surgical-Mamba.


[95] On the Cultural Anachronism and Temporal Reasoning in Vision Language Models cs.CV | cs.AI | cs.CLPDF

Mukul Ranjan, Prince Jha, Khushboo Kumari, Zhiqiang Shen

TL;DR: 本文研究了视觉语言模型在处理文化遗产材料时存在的文化时代错位问题,即模型倾向于使用不恰当的时间概念、材料或文化框架来误解历史文物。为量化此现象,作者构建了TAB-VLM基准数据集,包含600个问题,覆盖从史前到现代的1600件印度文物,并评估了10个先进模型,发现所有模型表现均不佳,最佳模型GPT-5.2仅达到58.7%的准确率。

Details

Motivation: 解决视觉语言模型在解释历史文物时存在的文化时代错位问题,即模型因训练数据中非西方视觉文化代表性不足,而无法准确进行时间推理。

Result: 在TAB-VLM基准上,10个SOTA模型均表现不佳,最佳模型GPT-5.2整体准确率仅为58.7%,且不同架构和规模的模型均存在性能差距,表明文化时代错位是视觉AI系统的显著局限。

Insight: 创新点在于定义了文化时代错位现象并构建了首个量化基准TAB-VLM,揭示了当前VLM在时间推理上的根本缺陷,特别是对非西方文化遗产的理解不足,为提升多模态AI系统的时间认知提供了基础。

Abstract: Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.


[96] Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers cs.CV | cs.AI | cs.LGPDF

Jakub Grzywaczewski, Dawid Płudowski, Przemysław Biecek

TL;DR: 该论文通过协方差矩阵的谱分解,将对比预训练视觉语言模型(VLM)的潜在空间分解为多模态语义信号分量和共享噪声子空间,发现噪声几何在不同数据子集上表现出强子群不变性,且修剪这些共享噪声维度基本无害甚至能提升下游任务性能。

Details

Motivation: 解决对比预训练视觉语言模型(VLM)共享潜在空间中存在的结构异常和非语义多模态噪声问题,以深入理解其表征结构。

Result: 通过修剪共享噪声维度,下游任务性能得以保持或主动提升,表明VLM潜在几何的很大一部分由共享的、架构层面的噪声而非仅任务相关语义所主导。

Insight: 创新点在于使用协方差谱分解来系统性地分离VLM中的语义信号与噪声,并揭示了噪声子空间的强不变性,这为理解和改进VLM的表示学习提供了新的机制性见解和潜在的模型压缩/优化方向。

Abstract: Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.


[97] SEDiT: Mask-Free Video Subtitle Erasure via One-step Diffusion Transformer cs.CVPDF

Zheng Hui, Yunlong Bai

TL;DR: 本文提出SEDiT,一种基于一步扩散Transformer的无掩码视频字幕擦除方法。该方法通过单阶段框架直接擦除目标字幕,避免了传统两阶段方法对预提取掩码的依赖,并利用修正流实现一步生成,同时采用混合训练策略和分块流式推理来确保时间一致性并高效处理长视频。

Details

Motivation: 现有视频编辑方法通常依赖基于掩码的修复,需要预先提取目标视频掩码,其分割精度直接影响完成质量,且两阶段处理存在次优性。本文旨在解决视频字幕擦除任务中掩码依赖和效率问题。

Result: 方法在视频字幕擦除任务中实现了高效处理,能够处理原生1440p无限长度视频,通过一步和分块流式推理确保了时间一致性。

Insight: 创新点包括:提出无掩码推理的单阶段框架,直接擦除字幕;利用字幕移除作为局部编辑任务、分布偏移小的特性,在修正流下实现一步生成,并提供理论证明;采用混合训练策略(偶尔以干净首帧潜变量为条件)增强时间连续性;通过直接输入原始视频避免裁剪和重新插入导致的可见接缝。

Abstract: Recent breakthroughs in video diffusion models have significantly accelerated the development of video editing techniques. However, existing methods often rely on inpainting video frames based on masked input, which requires extracting the target video mask in advance, and the precision of the segmentation directly affects the quality of the completion. In this paper, we present SEDiT, a novel one-stage video Subtitle Erasure approach via One-step Diffusion Transformer. We introduce a mask-free inference approach that enables direct erasure of the targeted subtitle. The proposed one-stage framework mitigates the sub-optimality inherent in the two-stage processing of prior models. Since subtitle removal is a localized editing task in which most pixels remain unchanged, the underlying distribution shift is minimal, making it well-suited to one-step generation under rectified flow. We empirically validate the reliability of one-step denoising and further provide a formal theoretical justification. Under the localized-editing structure of subtitle removal, the conditional optimal transport (OT) map and its induced rectified flow velocity field are Lipschitz continuous with respect to the latent variable, which underpins the theoretical feasibility of one-step sampling. To address the challenge of long-term temporal consistency, we adopt a hybrid training strategy by occasionally conditioning the model with a clean first-frame latent. This facilitates temporal continuity, allowing each segment during inference to leverage the output of its predecessor. To avoid visible seams caused by cropping and reinserting processed targets, particularly in scenarios involving substantial motion, we feed the original video directly into SEDiT. Thanks to one-step and chunk-wise streaming inference, our method can efficiently handle native 1440p video with infinite length.


[98] MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory cs.CV | cs.CL | cs.IRPDF

Minghao Guo, Qingyue Jiao, Zeru Shi, Yihao Quan, Boxuan Zhang

TL;DR: 本文提出了MemEye,一个用于评估多模态智能体长期记忆能力的视觉中心化框架。该框架从两个维度衡量记忆能力:一是决定性视觉证据的粒度(从场景级到像素级),二是检索到的证据如何使用(从单一证据到演化合成)。基于此框架,作者构建了一个包含8个生活场景任务的新基准,并通过消融驱动的验证门来评估可回答性、捷径抵抗性、视觉必要性和推理结构。通过评估4种视觉语言模型骨干上的13种记忆方法,发现现有架构在保留细粒度视觉细节和推理随时间变化的状态方面仍存在困难。

Details

Motivation: 现有评估方法很少测试智能体是否保留了后续推理所需的视觉证据,许多视觉基础问题仅通过字幕或文本痕迹即可回答,无需保留细粒度视觉证据,同时缺乏需要推理视觉状态变化的困难案例。

Result: 在构建的新基准上评估了13种记忆方法和4种VLM骨干,结果表明当前架构在保留细粒度视觉细节和推理随时间变化的状态方面表现不佳。

Insight: 提出了一个从视觉证据粒度和证据使用方式两个维度评估多模态记忆能力的系统性框架;构建了一个具有消融验证门的新基准,用于更严格地评估记忆能力;研究发现长期多模态记忆依赖于证据路由、时间追踪和细节提取。

Abstract: Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.


[99] MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models cs.CVPDF

Xiyu Ren, Zhaowei Wang, Yiming Du, Zhongwei Xie, Chi Liu

TL;DR: 本文提出了MEMLENS基准,用于系统评估大型视觉语言模型在长上下文和多模态多轮对话中的长期记忆能力。该基准包含789个问题,涵盖五种记忆能力,并在四种标准上下文长度下进行测试。评估了27个LVLM和7个记忆增强代理,发现长上下文LVLM在短上下文下表现良好但随对话增长而退化,而记忆代理长度稳定但视觉保真度在存储压缩下会下降。多会话推理任务对大多数系统构成挑战,结果表明需要结合长上下文注意力与结构化多模态检索的混合架构。

Details

Motivation: 现有基准缺乏对长上下文LVLM和记忆增强代理在真正需要多模态证据问题上的系统比较,因此需要建立一个全面的基准来评估多模态长期记忆能力。

Result: 在MEMLENS基准上评估了34个模型,长上下文LVLM在短上下文下准确率高但随长度增加而下降,记忆代理长度稳定但视觉保真度受损。多会话推理任务上大多数系统准确率低于30%,且没有单一方法能完全解决该任务。图像消融实验证实了视觉证据的必要性:移除证据图像会使前沿LVLM在80.4%包含图像证据的问题上准确率降至2%以下。

Insight: 论文的创新点在于提出了首个系统评估多模态长期记忆的综合性基准MEMLENS,并通过实验揭示了长上下文LVLM和记忆增强代理各自的局限性,为未来结合两者优势的混合架构设计提供了实证依据和明确方向。

Abstract: Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.


[100] SteerSeg: Attention Steering for Reasoning Video Segmentation cs.CVPDF

Ali Cheraghian, Hamidreza Dastmalchi, Abdelwahed Khamis, Morteza Saberi, Aijun An

TL;DR: 本文提出了SteerSeg,一个轻量级框架,旨在解决基于冻结大型视觉语言模型(LVLM)注意力机制的视频推理分割任务中,注意力图因优化目标为文本生成而非空间定位而导致的信号弥散和模糊问题。该方法通过输入级条件化在源头引导注意力,结合可学习的软提示和推理引导的思维链提示,生成更集中的空间先验,并转换为点提示引导分割模型,同时利用基于相关性的评分对候选轨迹进行排序和选择。

Details

Motivation: 现有基于冻结LVLM注意力图作为空间先验进行免训练定位的方法,其注意力图是为文本生成优化的,导致用于分割的空间定位信号弥散且模糊,这是基于注意力定位的关键瓶颈。

Result: 仅在Ref-YouTube-VOS数据集上训练,SteerSeg在多个基准测试上展现出良好的泛化能力,显著提升了LVLM的空间定位能力。

Insight: 核心创新在于将注意力错位识别为关键瓶颈,并提出在输入源头通过软提示和思维链提示进行注意力引导,从而在不更新LVLM和分割模型参数的情况下,通过仅学习少量软提示,有效改善空间定位,同时保留模型的预训练推理能力。

Abstract: Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals. In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck in attention-based grounding and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting. The soft prompts reshape the attention distribution to produce more spatially concentrated maps, while CoT-derived attributes resolve ambiguity among similar objects by guiding attention toward the correct instance. The resulting attention maps are converted into point prompts across keyframes to guide a segmentation model, while candidate tracklets are ranked and selected using correlation-based scoring. Our approach freezes the LVLM and segmentation model parameters and learns only a small set of soft prompts, preserving the model’s pretrained reasoning capabilities while significantly improving grounding. Despite being trained only on Ref-YouTube-VOS, SteerSeg generalizes well across diverse benchmarks, significantly improving the spatial grounding capability of LVLMs. Project page: https://steerseg.github.io


[101] Representative Attention For Vision Transformers cs.CVPDF

Yuntong Li, Hainuo Wang, Hengxing Liu, Mingjia Li, Xiaojie Guo

TL;DR: 本文提出了一种名为Representative Attention (RPAttention)的线性全局注意力机制,旨在解决视觉Transformer中密集自注意力的二次计算成本问题。该方法通过在表示空间中进行动态的语义化令牌压缩,取代了传统基于固定空间布局的压缩方式,实现了与空间距离无关的语义区域间信息交互。

Details

Motivation: 现有线性注意力方法通常基于预定义的空间布局压缩令牌,导致压缩过程与图像坐标而非视觉内容的语义组织绑定,限制了其有效性。本文旨在克服这一限制,实现更符合内容结构的全局信息交换。

Result: 在图像分类、目标检测和语义分割等多种视觉Transformer骨干网络上的广泛实验证明了该设计的有效性。该方法将主导的令牌交互复杂度从二次降低到线性,同时保持了强大的全局上下文建模能力。

Insight: 核心创新在于提出了一种基于表示空间的动态令牌压缩机制(Gather-Interact-Distribute范式),用表示驱动的压缩取代坐标驱动的聚合,使令牌通信能够自适应地对齐每个输入的内容结构,从而在降低计算复杂度的同时提升语义感知能力。

Abstract: Linear attention has emerged as a promising direction for scaling Vision Transformers beyond the quadratic cost of dense self-attention. A prevalent strategy is to compress spatial tokens into a compact set of intermediate proxies that mediate global information exchange. However, existing methods typically derive these proxy tokens from predefined spatial layouts, causing token compression to remain anchored to image coordinates rather than the semantic organization of visual content. To overcome this limitation, we propose Representative Attention (RPAttention), a linear global attention mechanism that performs token compression directly in representation space. Instead of constructing intermediate tokens from fixed spatial partitions, it dynamically forms a compact set of learned representative tokens to enable semantically related regions to communicate regardless of their spatial distance, by following a lightweight Gather-Interact-Distribute paradigm. Spatial tokens are first softly gathered into representative tokens through competitive similarity-based routing. The representatives then perform global interaction within a compact latent space, before broadcasting the refined information back to all spatial tokens via query-driven cross-attention. Via replacing coordinate-driven aggregation with representation-driven compression, RPAttention preserves global receptive fields while adaptively aligning token communication with the content structure of each input.RPAttention reduces the dominant token interaction complexity from quadratic to linear scaling with respect to the number of spatial tokens, while maintaining expressive global context modeling. Extensive experiments across diverse vision transformer backbones on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our design.


[102] SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding cs.CVPDF

Pengxin Xu, Xincheng Lin, Luping Xiao, Qing Jiang, Meishan Zhang

TL;DR: 本文提出了一种面向交互的层次化场景解析任务,将物理场景表示为‘场景 -> 物体 -> 部件 -> 功能’的显式层次结构,并引入了SceneParser这一基于视觉语言模型的解析器进行统一生成。同时构建了大规模基准测试集SceneParser-Bench用于训练和评估。

Details

Motivation: 现有通用场景感知能力(如开放词汇定位、部件定位、功能预测)通常是孤立的预测,缺乏捕捉面向交互的场景理解所需的结构化依赖关系。本文旨在填补这一空白。

Result: 实验表明,现有MLLMs和感知拼接流水线在SceneParser-Bench上进行层次化解析时表现不佳,而SceneParser在结构感知性能上更强。在COCO和AGD20K上的评估也证明了其与传统任务的兼容性。

Insight: 核心创新在于提出了层次化场景解析任务及其结构化表示,并设计了结合结构补全伪标签和课程学习的统一训练方法。构建的大规模基准和层级条件评估指标(如ParseRate)也为该方向提供了重要工具。其输出的可操作表示对下游任务(如规划)具有实用价值。

Abstract: General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene -> object -> part -> affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.


[103] ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both cs.CV | cs.AI | cs.CLPDF

Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng

TL;DR: ATLAS是一个视觉推理框架,通过引入一个称为功能令牌的离散‘词’,将代理式推理和潜在推理的优势结合起来。该令牌既可作为代理操作执行内部化的视觉运算,又可作为潜在视觉推理单元,避免了生成中间视觉内容的计算开销,并保持了与标准SFT和RL训练的兼容性。

Details

Motivation: 现有视觉推理方法中,代理式方法存在上下文切换延迟,而潜在方法则缺乏任务泛化能力且难以训练。ATLAS旨在结合两者的优点,同时克服其局限性。

Result: 在具有挑战性的基准测试上,ATLAS取得了优越的性能,同时保持了清晰的解释性。

Insight: 核心创新是使用单一离散功能令牌统一代理操作和潜在表示,避免了中间视觉生成。此外,提出的LA-GRPO方法通过静态加权辅助目标锚定功能令牌,稳定了RL训练。这为视觉推理研究提供了一个新的范式。

Abstract: Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete ‘word’, termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.


[104] SCRWKV: Ultra-Compact Structure-Calibrated Vision-RWKV for Topological Crack Segmentation cs.CVPDF

Hanxu Zhang, Chen Jia, Hui Liu, Xu Cheng, Fan Shi

TL;DR: 本文提出了一种超紧凑的结构校准视觉RWKV网络(SCRWKV),用于拓扑裂缝分割。该方法通过新颖的结构场编码器(SFE)主干,结合自适应多尺度级联调制器(AMCM)和结构校准洞察单元(SCIU)等核心组件,在保持线性复杂度的同时实现了高精度建模。在多个复杂纹理和严重干扰的基准测试中,仅用122万个参数的SCRWKV显著超越了现有最先进方法。

Details

Motivation: 解决现有方法在平衡裂缝拓扑建模与计算效率方面的瓶颈,即难以同时实现高分割质量和低资源需求的问题。

Result: 在TUT数据集上取得了0.8428的F1分数和0.8512的mIoU,仅使用1.22M参数即在多个基准测试中显著超越SOTA方法,证明了其高效部署的潜力。

Insight: 主要创新点包括:1) 结构场编码器(SFE)主干;2) 自适应多尺度级联调制器(AMCM)增强纹理表示;3) 结构校准洞察单元(SCIU)核心,其采用几何引导双向结构变换(GBST)捕获拓扑相关性,并将动态自校准衰减(DSCD)集成到Dy-WKV中以抑制噪声传播;4) 轻量级跨尺度谐波融合(CSHF)解码器实现精确特征聚合。整体实现了线性复杂度下的高精度分割。

Abstract: Achieving pixel-level accurate segmentation of structural cracks across diverse scenarios remains a formidable challenge. Existing methods face significant bottlenecks in balancing crack topology modeling with computational efficiency, often failing to reconcile high segmentation quality with low resource demands. To address these limitations, we propose the Ultra-Compact Structure-Calibrated Vision RWKV (SCRWKV), a network that achieves high-precision modeling via a novel Structure-Field Encoder (SFE) backbone while maintaining linear complexity. The SFE integrates the Adaptive Multi-scale Cascaded Modulator (AMCM) to enhance texture representation and utilizes the Structure-Calibrated Insight Unit (SCIU) as its core engine. Specifically, the SCIU employs the Geometry-guided Bidirectional Structure Transformation (GBST) to capture topological correlations and integrates the Dynamic Self-Calibrating Decay (DSCD) into Dy-WKV to suppress noise propagation. Furthermore, we introduce a lightweight Cross-Scale Harmonic Fusion (CSHF) decoder to achieve precise feature aggregation. Systematic evaluations on multiple benchmarks characterized by complex textures and severe interference demonstrate that SCRWKV, with only 1.22M parameters, significantly outperforms SOTA methods. Achieving an F1 score of 0.8428 and mIoU of 0.8512 on the TUT dataset, the model confirms its robust potential for efficient real-world deployment. The code is available at https://github.com/zhxhzy/SCRWKV.


[105] Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model cs.CV | cs.ROPDF

Tao Lin, Yuxin Du, Jiting Liu, Nuobei Zhu, Yunhe Li

TL;DR: Evo-Depth是一个轻量化的深度增强视觉-语言-动作模型,旨在提升机器人操作任务中的空间理解能力。它通过一个轻量级隐式深度编码模块从多视角RGB图像中提取深度特征,并利用空间增强模块将这些特征融入视觉-语言表示中,无需额外传感器,同时保持了高效的部署性能。

Details

Motivation: 当前视觉-语言-动作模型主要依赖2D视觉表示,缺乏深度信息和详细空间关系,在需要精确空间理解的场景中表现不佳。现有引入显式3D输入的方法增加了系统复杂性和成本,而隐式3D建模方法又依赖大型几何基础模型,导致训练和部署开销大。

Result: 在四个仿真基准测试中,仅0.9B参数的Evo-Depth取得了优越性能。在真实世界实验中,它达到了最高的平均成功率,同时在对比方法中具有最小的模型尺寸、最低的GPU内存使用和最高的推理频率。

Insight: 创新点在于提出了一种轻量级的隐式深度编码和深度感知调制机制,实现了无需额外硬件的空间语义增强,并通过渐进对齐训练策略将深度增强表示与下游动作学习对齐,在保持高效部署的同时显著提升了空间理解能力。

Abstract: Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.


[106] MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs cs.CV | cs.AIPDF

Wei Ding, Yilin Li, Yudong Zhang, Ruobing Xie, Xingwu Sun

TL;DR: 本文提出了一种名为MHSA的轻量级框架,用于缓解大型视觉语言模型中的幻觉问题。该框架通过学习修正跨模态注意力模式来减少幻觉,无需修改模型参数,仅需替换原始注意力即可在推理时应用。

Details

Motivation: 大型视觉语言模型在多模态任务中表现出色,但仍存在幻觉问题,即生成与视觉输入不一致的内容。现有工作如DHCP仅从跨模态注意力角度探索幻觉检测,而未解决幻觉缓解问题。

Result: MHSA在多个数据集和LVLMs上有效缓解了判别性和生成性幻觉,通过简单替换注意力实现,无需修改模型参数,展示了其轻量化和通用性。

Insight: 创新点在于将跨模态注意力机制从幻觉检测扩展到幻觉缓解,通过训练一个简单的MLP生成器来修正注意力模式,为LVLM幻觉研究提供了新视角,并增强了模型的可靠性。

Abstract: Large vision-language models (LVLMs) have achieved remarkable performance across diverse multimodal tasks, yet they continue to suffer from hallucinations, generating content that is inconsistent with the visual input. Prior work DHCP (Detecting Hallucinations by Cross-modal Attention Pattern) has explored hallucination detection from the perspective of cross-modal attention, but does not address hallucination mitigation. In this paper, we propose MHSA (Mitigating Hallucinations via Steered Attention), a lightweight framework that mitigates hallucinations by learning to correct cross-modal attention patterns in LVLMs. MHSA trains a simple three-layer MLP generator to produce corrected attention, guided by supervisory signals from the DHCP discriminator and the LVLM itself. During inference, MHSA mitigates both discriminative and generative hallucinations across various datasets and LVLMs by simply replacing the original cross-modal attention with the corrected one, without modifying any LVLM parameters. By extending cross-modal attention mechanisms from hallucination detection to hallucination mitigation, MHSA offers a novel perspective on hallucination research in LVLMs and helps enhance their reliability.


[107] Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image cs.CV | cs.AIPDF

Ming Qian, Zimin Xia, Changkun Liu, Shuailei Ma, Wen Wang

TL;DR: Sat3DGen提出了一种从单张卫星图像生成街景级3D场景的新方法,其核心是采用’几何优先’的策略,通过整合新颖的几何约束和透视图训练策略,显著提升了生成场景的几何精度和真实感。

Details

Motivation: 现有方法在从卫星图像生成3D街景时存在权衡:几何着色模型几何保真度高但缺乏语义多样性,而基于代理的前馈模型能生成丰富内容但几何结构粗糙不稳定。论文旨在解决由极端视角差异和数据监督稀疏不一致带来的根本性几何挑战。

Result: 在基于VIGOR-OOD测试集和高分辨率DSM数据构建的新基准上,该方法将几何RMSE从6.76米提升至5.20米,并将Fréchet Inception Distance (FID)从约40显著降低至19,超越了领先方法Sat2Density++。

Insight: 论文的创新点在于提出了’几何优先’的方法论,将几何约束与透视图训练策略整合到前馈框架中,直接针对几何误差的主要来源。这不仅提升了几何精度,还意外地大幅增强了生成图像的真实感,无需额外的图像质量优化模块,展示了高质量3D资产在多种下游应用中的潜力。

Abstract: Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from $\sim$40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.


[108] Compositional Video Generation via Inference-Time Guidance cs.CVPDF

Ariel Shaulov, Eitan Shaar, Amit Edenzon, Gal Chechik, Lior Wolf

TL;DR: 本文提出了一种名为CVG的推理时引导方法,旨在提升冻结文本到视频扩散模型在组合式生成任务中的忠实度,通过利用模型内部的交叉注意力特征训练轻量级分类器,并在去噪早期步骤中引导潜在轨迹,从而改善实体关系、属性和动作等细粒度组合理解,无需重新训练或修改模型架构。

Details

Motivation: 现有文本到视频扩散模型在需要细粒度组合理解(如实体间关系、属性、动作和运动方向)的提示词上表现不佳,作者认为无需重新训练生成器,而是可以通过引导去噪过程来缓解这一问题。

Result: 在组合式文本到视频基准测试中,CVG提高了提示词的忠实度,同时保持了底层生成器的视觉质量,实现了改进的组合生成效果。

Insight: 创新点在于利用模型内部交叉注意力图作为空间和时间上的概念接地信号,训练可迁移的轻量级分类器进行推理时引导,无需额外用户控制或模型微调,即可提升组合忠实度。

Abstract: Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by retraining the generator, but can instead be mitigated by steering the denoising process using the model’s own internal grounding signals. We propose \textbf{CVG}, an inference-time guidance method for improving compositional faithfulness in frozen text-to-video models. Our key observation is that cross-attention maps already encode how prompt concepts are grounded across space and time. We train a lightweight compositional classifier on these attention features and use its gradients during early denoising steps to steer the latent trajectory toward the desired composition. Built on a frozen VLM backbone, the classifier transfers across semantically related composition labels rather than relying only on narrow category-specific features. CVG improves compositional generation without modifying the model architecture, fine-tuning the generator, or requiring layouts, boxes, or other user-supplied controls. Experiments on compositional text-to-video benchmarks show improved prompt faithfulness while preserving the visual quality of the underlying generator.


[109] Characterizing the visual representation of objects from the child’s view cs.CVPDF

Jane Yang, Tarun Sepuri, Alvin Wei Ming Tan, Khai Loong Aw, Michael C. Frank

TL;DR: 该论文通过分析幼儿第一人称视角视频数据,揭示了儿童视觉经验中物体类别呈现的独特分布与结构特征。研究发现儿童接触的物体类别高度偏斜,少数常见类别(如杯子、椅子)占据主导,且物体常以非典型角度、杂乱场景和部分遮挡形式出现;尽管存在高度变异性,检测到的类别在上级类别(如动物、食物)内仍显示出比标准照片更强的聚类结构,这一模式在自监督视觉和多模态模型的高维嵌入中同样存在。

Details

Motivation: 旨在探究儿童早期视觉经验中物体类别学习的输入特征,理解儿童如何从日常非结构化视觉体验中高效学习物体类别表征。

Result: 基于BabyView数据集(31名参与者,868小时视频,超过300万帧)的分析显示,物体类别暴露高度偏斜,且类别实例在自监督视觉模型(如CLIP)和多模态模型的高维嵌入空间中,于上级类别层面表现出比标准照片更强的聚类结构,该模式在个体儿童的密集采样数据中得以复现。

Insight: 创新点在于首次大规模量化分析了儿童真实视觉经验中物体类别的统计特性,揭示了非典型、稀疏且多变的实例中存在的强上级类别结构,为发展能利用此类结构、从非标准数据中高效学习的视觉类别模型提供了关键见解。

Abstract: Children acquire object category representations from their everyday experiences in the first few years of life. What do the inputs to this learning process look like? We analyzed first-person videos of young children’s visual experience at home from the BabyView dataset ($N$ = 31 participants, 868 hours, ages 5–36 months), using a supervised object detection model to extract common object categories from more than 3 million frames. We found that children’s object category exposure was highly skewed: a few categories (e.g., cups, chairs) dominated children’s visual experiences while most categories appeared rarely, replicating previous findings from a more restricted set of contexts. Category exemplars were highly variable: children encountered objects from unusual angles, in highly cluttered scenes, and partially occluded views; many categories (especially animals) were most frequently viewed as depictions. Surprisingly, despite this variability, detected categories (e.g., giraffes, apples) showed stronger groupings within superordinate categories (e.g., animals, food) relative to groupings derived from canonical photographs of these categories. We found this same pattern when using high-dimensional embeddings from both self-supervised visual and multimodal models; this effect was also recapitulated in densely sampled data from individual children. Understanding the robustness and efficiency of visual category learning will require the development of models that can exploit strong superordinate structure and learn from non-canonical, sparse, and variable exemplars.


[110] EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration cs.CV | cs.AIPDF

Wuyang Li, Yang Gao, Mariam Hassan, Lan Feng, Wentao Pan

TL;DR: EverAnimate是一种用于生成长时程动画视频的高效后训练方法,旨在保持视觉质量和角色身份的一致性。它通过持久潜在上下文记忆来锚定生成过程,包含持久潜在传播和恢复性流匹配两种机制,以解决基于分块生成方法中常见的累积漂移问题。该方法仅需轻量级的LoRA微调,在短时和长时动画生成任务上均超越了现有最佳方法。

Details

Motivation: 解决长时程动画生成中的累积漂移问题,包括低级的视觉质量退化(如静态背景逐渐劣化)和高级的语义漂移(如角色身份和视角相关属性不一致)。

Result: 在10秒和90秒的动画生成任务上均达到SOTA水平。具体而言,在10秒时,PSNR/SSIM分别提升8%/7%,LPIPS/FID分别降低22%/11%;在90秒时,提升幅度进一步增大至15%/15%和32%/27%。

Insight: 核心创新在于通过持久潜在上下文记忆来维持跨时间块的连贯性,结合了持久潜在传播(在潜在空间传播身份和运动信息以缓解时序遗忘)和恢复性流匹配(通过速度调整引入隐式恢复目标以提高块内保真度)两种互补机制。这是一种高效的后训练策略,仅需轻量调参即可显著提升长序列生成的稳定性与质量。

Abstract: We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character identity and view-dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.


[111] LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection cs.CVPDF

Mitchell Piehl, Muchao Ye

TL;DR: LATERN是一个用于视频异常检测(VAD)的测试时上下文感知可解释框架,旨在解决现有视觉语言模型(VLM)方法因令牌限制而独立进行片段级推理、缺乏结构化时间上下文的问题。它通过上下文感知异常评分(CEA)和递归证据聚合(REA)两个模块,将VAD重新定义为时间证据聚合过程,以生成基于视觉-文本证据的、时间连贯的事件级决策和解释。

Details

Motivation: 现有基于视觉语言模型(VLM)的视频异常检测方法存在局限性:由于令牌限制,它们独立处理视频片段,缺乏对视频动态演变的结构化时间上下文理解,导致预测和解释碎片化。本文旨在让VLM能够将异常解释为对演化视频动态的偏离,而非产生孤立的判断。

Result: 在UCF-Crime和XD-Violence等具有挑战性的基准测试上进行的广泛实验表明,LATERN在测试时提升了冻结VLM的检测准确性和解释一致性,同时生成了时间连贯且语义有据的事件级解释。

Insight: 主要创新点在于将VAD重新定义为时间证据聚合过程,并提出了两个核心模块:CEA模块引入了新颖的基于图像的记忆机制,通过帧多样性和视觉-文本对齐选择性扩展上下文以生成可靠异常分数;REA模块在此基础上进行递归时间聚合,以识别连贯的异常区间。这为VLM在视频理解任务中整合时间上下文和生成可解释、连贯的事件级输出提供了新思路。

Abstract: Vision-language models (VLMs) have recently emerged as a promising paradigm for video anomaly detection (VAD) due to their strong visual reasoning ability and natural language-based explainability. In this paper, we aim to address a key limitation of such pipelines, which perform segment-level inference independently owing to token constraints and reason without structured temporal context, allowing VLMs to interpret anomalies as deviations from evolving video dynamics rather than producing fragmented predictions and explanations. To specify, we propose a context-aware framework named LATERN, which reformulates VAD as a temporal evidence aggregation process. LATERN consists of two complementary modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA introduces a novel image-grounded memory mechanism, which selectively chooses historical content via frame diversity and visual-textual alignment as expanded context to help generate reliable anomaly scores. Building upon these scores, REA performs recursive temporal aggregation to identify coherent anomaly intervals and produce event-level decisions and explanations grounded in visual-textual evidence. Extensive experiments on challenging benchmarks, including UCF-Crime and XD-Violence, show that LATERN enhances detection accuracy and explanation consistency for frozen VLMs during test time, while generating temporally coherent and semantically grounded event-level explanations.


[112] Computational Imaging Priors for Wireless Capsule Endoscopy: Monte Carlo-Guided Hemoglobin Mapping for Rare-Anomaly Detection cs.CVPDF

Chengshuai Yang, Lei Xing, Gregory Entin, Roopa Vemulapalli, Lisa Casey

TL;DR: 本文提出一种基于蒙特卡洛启发分析模型的无线胶囊内窥镜血红蛋白映射方法,以提升对小血管异常的分类性能。通过在Kvasir-Capsule数据集上测试两种配置(输入融合与蒸馏头),发现分析先验能带来小幅但方向一致的宏观AUC提升,尤其在淋巴管扩张类别的检测上效果显著。

Details

Motivation: 解决传统RGB训练的胶囊内窥镜分类器因混淆血红蛋白对比度与胆汁/光照衰减而导致小血管病变检测性能不足的问题。

Result: 在Kvasir-Capsule数据集(6个随机种子)上,输入融合配置的宏观AUC从RGB-only的0.760提升至0.783,其中淋巴管扩张类别的AUC从0.238显著提升至0.337;蒸馏变体在普通3通道RGB上运行,宏观AUC为0.773。

Insight: 创新点在于将蒙特卡洛启发的血红蛋白分析先验作为辅助通道或蒸馏目标,以物理模型增强分类器对血管特征的感知;该方法提供了可解释的热图,且无需硬件改动,适用于资源受限的胶囊内窥镜场景。

Abstract: Background. RGB-trained capsule-endoscopy classifiers underperform on small-vessel vascular findings by conflating hemoglobin contrast with bile and illumination falloff. Thus, here we test whether a Monte Carlo-inspired analytic model can compute hemoglobin from RGB signal built upon extracted classifier. Methods. On Kvasir-Capsule (47,238 frames, video-level 70/15/15 split, 11 evaluable classes) we evaluate two software-only configurations against RGB-only EfficientNet-B0 across 6 seeds: (i) a prior P_blood = sigma(alpha * (H_norm - 0.5)) * Phi(r) fused as 2 zero-init auxiliary channels; (ii) a distillation head training a 3-channel RGB backbone to predict P_blood. Significance: paired DeLong, McNemar, bootstrap CIs with Bonferroni correction. Results. Across 6 seeds (n=6,423), the analytic prior provides a small but direction-consistent macro-AUC improvement: RGB-only 0.760 +/- 0.027, input-fusion 0.783 +/- 0.024 (paired Delta = +0.023, sign-positive on 5/6 seeds), distillation 0.773 +/- 0.028. The largest robust per-class lift is on Lymphangiectasia, where AUC rises from RGB 0.238 +/- 0.057 to input-fusion 0.337 +/- 0.019, sign-consistent across all 6 seeds. On rare focal-vascular classes (Angiectasia, Blood - fresh) the prior’s per-seed effects are bimodal: seed=42 reaches Angiectasia AUC 0.528 -> 0.916, but the cross-seed mean is 0.646 -> 0.608 with sigma_PI = 0.23 - reported as a high-variance per-seed exemplar. Conclusion. A Monte Carlo-inspired analytic prior provides a small, direction-consistent macro-AUC improvement on Kvasir-Capsule across 6 seeds with the largest robust per-class lift on Lymphangiectasia; the distillation variant runs on plain 3-channel RGB and yields a free interpretability heatmap.


[113] SAGE3D: Soft-guided attention and graph excitation for 3D point cloud corner detection cs.CVPDF

Batuhan Arda Bekar, Can Sarı, Hüseyin Can Gülkan, Barış Özcan

TL;DR: SAGE3D是一种基于混合Transformer的模型,用于机载LiDAR点云中的角点检测。它采用分层编码器-解码器架构,通过Set Abstraction层逐步下采样点云,并通过Feature Propagation恢复逐点预测。其核心创新在于引入了Soft-Guided Attention和Excitatory Graph Neural Network,前者在训练时注入真值角点标签作为对数先验以提升精度,后者在关键分辨率层级使用仅正消息传递来增强高置信度角点的预测以优化召回率。

Details

Motivation: 解决在机载LiDAR点云中进行精确且鲁棒的角点检测的挑战,传统方法可能在不同尺度下稀释角点信号。

Result: 未在摘要中明确提及具体定量结果或基准测试,但方法旨在通过创新模块协同优化精度(precision)和召回率(recall)。

Insight: 创新点包括:1) Soft-Guided Attention,将监督信号作为先验注入注意力机制以引导训练;2) 兴奋图神经网络,通过仅正消息传递和学习的增强机制在多尺度层级中放大角点信号。从客观角度看,这种将监督先验与图结构消息传递结合的分层设计,为点云中的稀疏目标检测提供了可借鉴的混合架构思路。

Abstract: We present SAGE3D, a hybrid Transformer-based model for corner detection in airborne LiDAR point clouds. We propose a multi-stage solution built on a hierarchical encoder-decoder architecture that progressively downsamples point clouds through Set Abstraction layers and recovers per-point predictions via Feature Propagation. We introduce two innovations: Soft-Guided Attention, which injects ground-truth corner labels as a log-prior into attention logits during training to improve precision; then an Excitatory Graph Neural Network positioned at strategic resolutions in the hierarchy, employing positive-only message passing where high-confidence corners reinforce predictions through learned boosting, optimizing for recall. The hierarchical design enables multi-scale feature extraction while our guided attention and excitatory modules ensure corner signals are amplified rather than diluted across scales.


[114] CoralLite: μCT Reconstruction of Coral Colonies from Individual Corallites cs.CVPDF

Jess Jones, Leonardo Bertini, Kenneth Johnson, Erica Hendy, Tilo Burghardt

TL;DR: 本文提出了CoralLite,一个用于从珊瑚骨骼的μCT扫描中重建单个珊瑚虫(corallite)的标注数据集和深度学习基线方法。该方法结合了混合V-Trans-UNet架构进行分割,并通过预训练和拓扑感知微调,首次展示了视觉机器学习能够有效支持从μCT扫描中实现全3D单个珊瑚虫建模。

Details

Motivation: 为了解决珊瑚生长研究中需要追踪每个珊瑚虫周围沉积的骨骼结构,以理解珊瑚虫分裂速率、时间及其对群体骨骼生长影响的问题,而现有方法难以从μCT扫描中自动、准确地重建单个珊瑚虫的3D模型。

Result: 在相同珊瑚群体的未见切片上,模型达到了0.94的拓扑准确率,平均Dice分数为0.77;在不同生物样本上,平均Dice分数为0.63。这首次证明了视觉机器学习能够有效支持仅从珊瑚骨骼μCT扫描中进行全3D单个珊瑚虫建模。

Insight: 创新点包括:提出了首个针对珊瑚骨骼μCT扫描的标注数据集和深度学习重建基线;设计了一种混合V-Trans-UNet架构,结合了预训练和拓扑感知微调策略,适用于分割大尺寸的μCT虚拟切片;该方法为珊瑚生物学研究提供了可自动分析珊瑚虫级结构的工具,并为相关3D生物结构分割任务提供了可借鉴的弱监督与拓扑约束结合的技术路线。

Abstract: The life history of an individual coral is archived within the accreting skeleton of the colony. While reef-forming coral colonies (e.g. massive \emph{Porites} sp.) may live for hundreds of years and deposit calcareous structures many metres in height and width, their living tissue is a thin outer surface layer comprised of asexually-dividing polyps that only survive a few years. To understand the rate and timing of polyp division and the consequences for colony skeletal growth, scientists need to track the skeletal corallite deposited around each polyp. Here we propose CoralLite, an annotated μCT scan dataset of entire calcareous skeletons and an associated, first corallite deep learning reconstruction baseline. CoralLite combines fully quantified volumetric segmentations with cross-slice linking for visualisations of 3D models for each corallite up to colony scale. For segmentation, we propose and evaluate in detail a hybrid V-Trans-UNet architecture applicable to segmenting tiled μCT virtual slabs of \emph{Porites} sp. colonies. The model is pre-trained on weakly annotated data and topology-aware fine-tuned using fully annotated slice sections with 8k+ manual corallite region annotations. On unseen slices of the same colony, the resulting model reaches 0.94 topological accuracy at mean Dice scores of 0.77 on the same colony and projection axis, and 0.63 mean Dice scores on a different, biologically unrelated specimen. Whilst our experiments are limited in scale and context, our results show for the first time that visual machine learning can effectively support full 3D individual corallite modelling from μCT scans of coral skeletons alone. For reproducibility and as a baseline for future research we publish our full dataset of 697 μCT slices, 37 partial or full slice annotations, and all network weights and source code with this paper.


[115] DriveCtrl: Conditioned Sim-to-Real Driving Video Generation cs.CVPDF

Haonan Zhao, Yiting Wang, Jingkun Chen, Valentina Donzella, Thomas Bashford-Rogers

TL;DR: 本文提出了DriveCtrl,一个基于深度条件控制的仿真到真实驾驶视频生成框架,旨在通过预训练视频基础模型结合结构感知适配器,生成保持原始模拟场景布局和运动模式、时间一致且视觉逼真的驾驶视频,以缩小仿真与真实驾驶视频之间的领域差距。

Details

Motivation: 大规模标注驾驶视频数据对自动驾驶系统训练至关重要,但仿真数据与真实世界存在领域差距,现有视频生成方法难以同时保持场景结构、物体动态、时间一致性和视觉真实感,从而限制了仿真数据在下游部署中的实用性。

Result: 实验表明,DriveCtrl在真实性、时间质量和感知任务性能上持续优于基础模型和竞争方法,显著缩小了驾驶视频生成的仿真到真实差距,并提出了驾驶领域特定的评估指标DVRS来评估生成视频的真实感。

Insight: 创新点包括引入结构感知适配器实现深度引导生成,保持源模拟的布局和运动模式;提出可扩展的数据生成流程,支持深度、参考数据集风格和文本提示三种条件信号,同时保留帧级标注;以及设计了驾驶视频真实感评分DVRS这一领域特定评估指标。

Abstract: Large-scale labelled driving video data is essential for training autonomous driving systems. Although simulation offers scalable and fully annotated data, the domain gap between synthetic and real-world driving videos significantly limits its utility for downstream deployment. Existing video generation methods are not well-suited for this task, as they fail to simultaneously preserve scene structure, object dynamics, temporal consistency, and visual realism, all of which are critical for maintaining annotation validity in generated data. In this paper, we present DriveCtrl, a depth-conditioned controllable sim-to-real video generation framework for realistic driving video synthesis. Built upon a pretrained video foundation model, DriveCtrl introduces a structure-aware adapter that enables depth-guided generation while preserving the scene layout and motion patterns of the source simulation, producing temporally coherent driving videos that remain aligned with the original simulated sequences. We further introduce a scalable data generation pipeline that transforms simulator videos into realistic driving footage matching the visual style of a target real-world dataset. The pipeline supports three conditioning signals: structural depth, reference-dataset style, and text prompts, while preserving frame-level annotations for downstream perception tasks. To better assess this task, we propose a driving-domain-specific knowledge-informed evaluation metric called Driving Video Realism Score (DVRS) that assesses the realism of generated videos. Experiments demonstrate that DriveCtrl consistently outperforms the base model and competing alternatives in realism, temporal quality, and perception task performance, substantially narrowing the sim-to-real gap for driving video generation.


[116] Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation cs.CVPDF

Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan

TL;DR: 本文提出了Causal Forcing++,一种用于实时交互式视频生成的可扩展、少步自回归扩散蒸馏方法。该方法通过因果一致性蒸馏(causal CD)来初始化少步自回归学生模型,解决了现有方法在帧级1-2步采样设置下的初始化瓶颈问题,实现了更低的延迟和更高的训练效率。

Details

Motivation: 实时交互式视频生成需要低延迟、流式且可控的生成过程。现有的自回归扩散蒸馏方法在4步分块生成上表现良好,但在更激进的帧级1-2步采样设置下,学生模型的初始化成为关键瓶颈,现有策略存在目标不对齐、无法进行少步生成或扩展成本过高的问题。

Result: 在帧级2步设置下,Causal Forcing++在VBench Total上超过当前最佳(SOTA)的4步分块Causal Forcing方法0.1分,在VBench Quality上超过0.3分,在VisionReward上超过0.335分,同时将首帧延迟降低50%,并将第二阶段训练成本减少约4倍。

Insight: 核心创新点是提出了因果一致性蒸馏(causal CD),它学习与因果ODE蒸馏相同的自回归条件流映射,但仅从相邻时间步之间的单个在线教师ODE步骤获取监督,避免了预计算和存储完整PF-ODE轨迹的需要,这使得初始化过程更高效且更易于优化。该方法为实时视频生成提供了一种可扩展的少步自回归蒸馏解决方案。

Abstract: Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1–2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit{\textbf{frame-wise 2-step setting}} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50% and Stage 2 training cost by $\sim$$4\times$. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .


[117] Does Synthetic Layered Design Data Benefit Layered Design Decomposition? cs.CVPDF

Kam Man Wu, Haolin Yang, Qingyu Chen, Yihu Tang, Jingye Chen

TL;DR: 本研究探讨了纯合成分层数据是否能改善图形设计分解。基于最先进的分层分解框架CLD,作者构建了合成数据集SynLayers,并利用视觉语言模型生成文本监督和边界框。研究发现:纯合成数据训练可超越非可扩展的替代方案(如PrismLayersPro数据集);性能随训练数据规模增加而提升,但在约5万样本后趋于饱和;合成数据能平衡控制层数分布,避免真实数据中的层数不平衡问题。

Details

Motivation: 解决图像生成输出扁平化、难以灵活编辑的问题,现有方法依赖稀缺的专有分层资产或有限结构先验构建的部分合成数据,面临可扩展性挑战。

Result: 在CLD基准框架上,纯合成数据训练优于广泛使用的PrismLayersPro数据集;性能随数据规模提升,在约50K样本时增益饱和;合成数据能平衡层数分布。

Insight: 创新点在于假设图形设计分解无需精确建模层间依赖,因为设计元素通常是模块化和语义可分离的;采用数据为中心的方法,利用合成数据和VLM自动化生成监督,证明了合成数据作为可扩展替代方案的可行性,并揭示了数据规模与性能的饱和关系及层数分布平衡的优势。

Abstract: Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.


[118] Evidential Reasoning Advances Interpretable Real-World Disease Screening cs.CV | cs.AI | cs.LGPDF

Chenyu Lian, Hong-Yu Zhou, Jing Qin

TL;DR: 本文提出了EviScreen,一个基于证据推理的疾病筛查框架,旨在通过利用历史病例的区域级证据来提升医学图像筛查的可解释性和性能。该框架通过从双重知识库中检索区域证据提供回顾性解释,并利用证据感知推理模块结合当前病例和历史证据进行预测,同时通过对比检索生成的异常图增强定位可解释性。

Details

Motivation: 当前医学图像筛查模型普遍存在可解释性有限、性能欠佳的问题,缺乏有效引用历史病例或提供透明推理路径的机制。

Result: 该方法在作者精心构建的真实世界疾病筛查基准测试中取得了优越性能,在临床级召回率下获得了显著更高的特异性。

Insight: 创新点在于引入基于历史病例区域证据的推理框架,提供回顾性和定位可解释性,而非依赖事后显著性图;通过证据感知推理结合当前与历史信息提升筛查性能。

Abstract: Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.


[119] SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer cs.CVPDF

Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen

TL;DR: SANA-WM是一个高效的26亿参数开源世界模型,专为生成一分钟长视频而设计,能够合成720p高清、具有精确相机控制的长视频。其核心设计包括混合线性注意力、双分支相机控制、两阶段生成流程和鲁棒的标注流程,在数据、训练计算和推理硬件方面均表现出显著效率。

Details

Motivation: 解决现有世界模型在生成分钟级长视频时面临的计算效率低下、内存消耗大以及相机控制精度不足的问题,旨在实现高效、高质量且可控的长视频合成。

Result: 在自建的一分钟世界模型基准测试中,SANA-WM在动作跟随准确性上优于先前的开源基线模型,视觉质量与LingBot-World和HY-WorldPlay等大规模工业基线相当,同时吞吐量提高了36倍。其蒸馏变体可在单张RTX 5090 GPU上通过NVFP4量化在34秒内去噪生成60秒720p视频。

Insight: 创新点包括:1)混合线性注意力结合帧级门控DeltaNet与softmax注意力,实现内存高效的长上下文建模;2)双分支相机控制确保精确的6自由度轨迹跟随;3)两阶段生成流程通过长视频细化器提升序列质量和一致性;4)鲁棒的标注流程从公开视频中提取精确的度量尺度6自由度相机位姿,生成高质量时空一致的动作标签。这些设计在保持高质量输出的同时,大幅提升了训练和推理效率。

Abstract: We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only $\sim$213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at $36\times$ higher throughput for scalable world modeling.


[120] From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing cs.CVPDF

Anirudh Sundara Rajan, Krishna Kumar Singh, Yong Jae Lee

TL;DR: 本文提出了一种面向开放式图像编辑的体验式框架,通过规划器生成结构化原子分解,编排器选择工具和执行区域,并利用视觉语言评判器提供基于结果的奖励,从而优化编辑过程,实现更连贯可靠的多步骤图像编辑。

Details

Motivation: 现有图像编辑模型在处理抽象、多步骤指令时表现不佳,而基于代理的方法依赖手工流程或教师模仿,限制了灵活性和学习效果,因此需要一种能紧密耦合规划与执行的学习框架。

Result: 该方法在开放式图像编辑任务中,相比单步或基于规则的多步基线,能产生更连贯可靠的编辑结果,但摘要未提及具体基准测试或定量比较。

Insight: 创新点在于将规划与基于奖励的执行紧密耦合,通过体验式学习优化编排器,并利用成功轨迹反馈改进规划器,提升了多步骤编辑的适应性和质量。

Abstract: Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly’’). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.


[121] Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video cs.CVPDF

Yifan Wang, Tong He

TL;DR: 本文提出了一种名为Warp-as-History的无需训练、无需修改架构或测试时优化的方法,用于实现相机控制的视频生成。该方法通过将相机引起的扭曲转换为具有目标帧位置对齐和可见令牌选择的相机扭曲伪历史,并利用冻结视频生成模型的视觉历史路径,实现了对相机轨迹的零样本跟随。此外,仅需在单个相机标注视频上进行轻量级离线LoRA微调,即可进一步提升性能并泛化到未见视频。

Details

Motivation: 现有相机控制视频生成方法通常需要在大规模相机标注视频上进行后训练,或依赖测试时优化和额外去噪指导,成本较高。本文旨在探索冻结视频生成模型的零样本能力,以简单接口实现无需训练或优化的相机轨迹跟随。

Result: 在多个数据集上的广泛实验证实了该方法的有效性。无需任何训练或优化即可实现相机轨迹跟随,而通过单视频LoRA微调能进一步提升相机跟随精度、视觉质量和运动动态,且无需测试时优化或目标视频适应。

Insight: 创新点在于将相机扭曲直接转化为模型可处理的伪历史输入,并利用位置编码对齐和令牌选择机制,揭示了冻结视频生成模型固有的零样本相机控制能力。该方法避免了复杂的模型修改或大规模数据训练,提供了一种高效、通用的解决方案。

Abstract: Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model’s visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.


[122] Quantitative Video World Model Evaluation for Geometric-Consistency cs.CV | cs.AIPDF

Jiaxin Wu, Yihao Pi, Yinling Zhang, Yuheng Li, Xueyan Zou

TL;DR: 本文提出了PDI-Bench(透视失真指数),一个用于定量评估生成视频几何一致性的框架。该框架通过分割和点跟踪获取物体中心观测,利用单目重建将其提升到3D世界坐标,并计算一组捕捉尺度-深度对齐、3D运动一致性和3D结构刚性三个维度的投影几何残差。同时构建了PDI-Dataset来支持系统评估。

Details

Motivation: 生成视频模型越来越多地被用作隐式世界模型,但评估其是否产生物理上合理的3D结构和运动仍然具有挑战性。现有评估方法主要依赖人类判断或学习到的评分器,这些方法主观且对几何故障的诊断能力较弱。

Result: 在多个最先进的视频生成器上,PDI揭示了常见的感知指标未能捕捉到的、一致的几何特定故障模式,为迈向物理基础视频生成和物理世界模型提供了诊断信号。

Insight: 创新点在于提出了一个客观、定量的几何一致性评估框架,通过投影几何原理定义可计算的残差指标,并构建了专门针对几何约束的压力测试数据集,弥补了现有主观或基于学习评估方法的不足,为生成模型的物理合理性评估提供了新工具。

Abstract: Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.


[123] RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO cs.CVPDF

Yanzuo Lu, Ronglai Zuo, Jiankang Deng

TL;DR: 本文提出了RAVEN框架,用于实时自回归视频外推生成。该框架通过重新组织训练数据,将自生成过程打包为干净历史端点与噪声去噪状态交替的序列,从而对齐训练与推理时的历史分布。同时,作者提出了CM-GRPO方法,将一致性采样步骤重新表述为条件高斯转移,并直接应用在线强化学习优化该过程。实验表明,RAVEN在质量、语义和动态程度评估上超越了现有的因果视频蒸馏基线,结合CM-GRPO后性能进一步提升。

Details

Motivation: 现有的因果自回归视频扩散模型通过从先前生成内容外推未来片段来支持实时流式生成,但训练时遇到的历史分布与推理时出现的历史分布之间存在持续差距,限制了长时生成质量。

Result: 实验证明,RAVEN在质量、语义和动态程度评估上超越了最近的因果视频蒸馏基线,并且CM-GRPO与RAVEN结合时能带来进一步的性能提升。

Insight: 创新点包括:1)RAVEN框架通过重新打包自生成序列,使训练注意力与推理时外推对齐,并允许下游块损失监督历史表示;2)CM-GRPO将一致性采样步骤重新表述为条件高斯转移,直接应用在线强化学习,避免了先前流模型RL公式中使用的Euler-Maruyama辅助过程。

Abstract: Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.


[124] VGGT-$Ω$ cs.CVPDF

Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger

TL;DR: 本文提出了VGGT-Ω,一种改进的前馈重建模型,通过架构优化、高效数据标注和自监督学习,显著提升了静态和动态场景的重建精度、效率与能力,并展示了其在空间理解任务中的潜力。

Details

Motivation: 解决现有前馈重建模型在模型和数据规模扩展时,重建质量、效率及动态场景处理能力不足的问题。

Result: 在多个基准测试中,VGGT-Ω对静态和动态场景重建取得了强劲结果,例如在Sintel数据集上将相机姿态估计精度提升了77%,达到了新的SOTA水平。

Insight: 创新点包括:简化架构(使用单一密集预测头和多任务监督)、引入寄存器(Register)及寄存器注意力机制以紧凑聚合场景信息并限制帧间信息交换,以及结合大规模有监督和无监督视频数据的高效训练协议,这些设计显著降低了GPU内存消耗并提升了可扩展性。

Abstract: Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-$Ω$, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT’s architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-$Ω$ uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-$Ω$ achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: http://vggt-omega.github.io/


[125] RefDecoder: Enhancing Visual Generation with Conditional Video Decoding cs.CV | cs.LGPDF

Xiang Fan, Yuheng Wang, Bohan Fang, Zhongzheng Ren, Ranjay Krishna

TL;DR: 本文提出RefDecoder,一种参考图像条件化的视频VAE解码器,通过引入参考注意力机制将高保真参考图像信号直接注入解码过程,以解决现有潜在扩散模型中解码器无条件化导致的细节丢失和一致性不足问题。

Details

Motivation: 现有视频生成模型(如潜在扩散模型)通常采用强条件化的去噪网络,但解码器往往保持无条件,这种架构不对称导致生成视频相对于输入图像出现显著的细节损失和不一致。

Result: 在Inter4K、WebVid和Large Motion重建基准测试中,相比无条件基线模型PSNR提升最高达+2.1dB;在VBench I2V基准测试中,主体一致性、背景一致性和整体质量得分均获得全面提升,且无需微调即可直接集成到现有视频生成系统中。

Insight: 核心创新在于提出解码器需要与去噪网络同等程度的条件化以保持结构完整性,通过轻量级图像编码器将参考帧映射为细节丰富的高维token,并在解码器每个上采样阶段与去噪后的视频潜在token协同处理,该方法可泛化至风格迁移和视频编辑精修等多种视觉生成任务。

Abstract: Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.


[126] EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation cs.CV | cs.AIPDF

Ruozhen He, Meng Wei, Ziyan Yang, Vicente Ordonez

TL;DR: 本文提出了EntityBench,一个用于评估长序列多镜头视频生成中实体一致性的新基准,包含140个剧集(2491个镜头),并配有涵盖镜头内质量、提示对齐和跨镜头一致性的三支柱评估套件。作者还提出了EntityMem,一种基于记忆增强的生成系统,通过在生成前存储已验证的实体视觉参考来提升一致性。实验表明,现有方法的跨镜头实体一致性随重复距离增加而急剧下降,而EntityMem在角色保真度上表现最佳。

Details

Motivation: 解决多镜头视频生成中,在长序列中保持角色、物体和地点等实体一致性的挑战,以及现有评估方法实体覆盖有限、一致性度量简单、难以标准化比较的问题。

Result: 在提出的EntityBench基准上,实验显示现有方法的跨镜头实体一致性随重复距离增加而急剧下降。提出的EntityMem方法在角色保真度上取得了最高的Cohen’s d效应值(+2.33),并在评估的方法中实现了最佳的实体存在性。

Insight: 创新点在于构建了一个从真实叙事媒体中提取的、具有明确实体调度和难度分级的标准化长序列多镜头视频生成基准,并配套了精细化的评估套件。从方法角度看,在生成前使用持久化记忆库存储已验证的实体视觉参考,是一种提升长序列实体一致性的有效策略。

Abstract: Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen’s d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine-R-He/EntityBench/.


cs.GR [Back]

[127] FaceParts: Segmentation and Editing of Gaussian Splatting cs.GR | cs.AI | cs.CVPDF

Tymoteusz Zapała, Julia Farganus, Dominik Galus, Mikołaj Czachorowski, Piotr Syga

TL;DR: 本文提出FaceParts,一个用于高斯溅射(Gaussian Splatting)头像的无监督分割与编辑框架。该方法直接在3D高斯域操作,通过特征解缠、基于密度的聚类和FLAME锚定的部件迁移,将头像分解为语义连贯的面部部件(如胡须、眉毛),支持精确编辑和跨头像部件交换。

Details

Motivation: 现有面部编辑方法多依赖2D图像域的生成模型,而3D编辑通常需要费时的手动操作。本文旨在直接在3D高斯表示中实现无监督的语义分割与编辑,以简化3D头像的操控。

Result: 在NeRSemble数据集(包含11个对象)上的实验表明,该方法能鲁棒地分离胡须、眉毛、眼睛等特征。定量评估显示,迁移的部件能适应姿态和表情变化,同时保持身份一致性(ID = 0.943),并具有较低的平均表情距离(AED = 0.021)和平均姿态距离(APD = 0.004)。

Insight: 创新点在于直接在3D高斯域进行无监督语义分割,结合特征解缠与FLAME模型锚定实现部件迁移,避免了传统2D方法或网格辅助的复杂性,为3D高斯表示的可控编辑提供了新思路。

Abstract: Facial editing is an important task with applications in entertainment, virtual reality, and digital avatars. Most existing approaches rely on generative models in the 2D image domain, while in 3D the task is typically performed through labor-intensive manual editing. We propose FaceParts, a framework for unsupervised segmentation and editing of Gaussian Splatting avatars. Unlike existing 2D or mesh-assisted methods, our approach operates directly in the Gaussian domain, decomposing avatars into semantically coherent facial parts without supervision. The method integrates feature disentanglement, density-based clustering, and FLAME-anchored part transfer, enabling precise editing and cross-avatar part swapping. Experiments on the NeRSemble dataset with 11 subjects demonstrate robust isolation of features such as beards, eyebrows, eyes and mustaches. Quantitative evaluation confirms that transferred segments adapt to pose and expression, while maintaining identity consistency (ID = 0.943), low Average Expression Distance (AED = 0.021) and low Average Pose Distance (APD = 0.004).


[128] MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation cs.GR | cs.CV | cs.LGPDF

Dongxia Liu, Jie Ma, Xiaochen Yang, Jiancheng Zhang, Bin Xia

TL;DR: MoZoo是一种生成式动力学求解器,旨在从粗略网格中合成高保真动物视频,以解决传统影视制作中动物肌肉和毛发动态模拟成本高昂的问题。它通过Role-Aware RoPE和Asymmetric Decoupled Attention等技术实现运动对齐与信息解耦,并利用MoZoo-Data合成数据集进行训练,在MoZooBench基准测试中展示了优越的毛发模拟效果。

Details

Motivation: 传统影视制作中动物肌肉和毛发动态模拟过程劳动密集且计算成本高,而现有生成扩散模型在高质量动物模拟方面的潜力尚未充分挖掘,因此论文旨在开发一种高效生成高保真动物视频的方法。

Result: 在MoZooBench基准测试(包含120个网格-视频对)上,MoZoo实现了高保真毛发模拟,保持了优越的时间一致性和结构一致性,表明其在多样动物骨骼和布局上具有有效性能。

Insight: 创新点包括Role-Aware RoPE通过基于角色的索引重映射同步运动对齐并解耦参考信息,Asymmetric Decoupled Attention通过划分潜在序列强制单向信息流以提高效率,以及MoZoo-Data合成到真实数据管道解决高质量训练数据稀缺问题。

Abstract: The creation of cinematic-quality animal effects necessitates the precise modeling of muscle and fur dynamics, a process that remains both labor-intensive and computationally expensive within traditional production workflows. While generative diffusion models have shown promise in diverse artistic workflows, their capacity for high-fidelity animal simulation remains largely unexploited. We present MoZoo, a generative dynamics solver that bypasses conventional refinement to synthesize high-fidelity animal videos from coarse meshes under multimodal guidance. We propose Role-Aware RoPE (RAR-RoPE) which employs role-based index remapping to synchronize motion alignment while decoupling reference information via fixed temporal offsets. Complementing this, Asymmetric Decoupled Attention partitions the latent sequence to enforce a unidirectional information flow, effectively preventing feature interference and improving computational efficiency. To address the scarcity of high-quality training data, we introduce MoZoo-Data, a synthetic-to-real pipeline that leverages a rendering engine and an inverse mapping approach to construct a large-scale dataset of paired sequences. Furthermore, we establish MoZooBench, a comprehensive benchmark with 120 mesh-video pairs. Experimental results demonstrate that MoZoo achieves high-fidelity fur simulation across diverse animal skeletons and layouts, preserving superior temporal and structural consistency.


[129] Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation cs.GR | cs.CV | eess.IVPDF

Diandian Gu, Jing Lin, Gaohong Liu, Jiahang Liu, Su Ma

TL;DR: Seed3D 2.0是基于Seed3D 1.0的先进3D内容生成系统,在生成保真度、仿真就绪能力和应用覆盖范围上均有显著提升。它采用从粗到细的两阶段几何生成流程和局部感知VAE,并引入统一的PBR模型进行纹理材质生成。此外,系统还提供了包含场景布局规划、部件感知分解和无训练关节生成在内的仿真就绪模型套件,支持连贯场景构建和部件级物理交互。

Details

Motivation: 旨在解决现有3D内容生成系统在生成保真度、材质精度以及支持复杂场景构建和物理交互方面的不足,以生成高保真、可直接用于物理仿真的3D内容。

Result: 在针对五个近期商业模型的大规模人类偏好研究中,Seed3D 2.0在带纹理的3D资产生成任务上取得了69.0%至89.9%的一致胜率,显示出其优越性。

Insight: 主要创新点包括:1)几何生成的从粗到细两阶段解耦流程与局部感知VAE;2)用于纹理材质生成的统一PBR模型,结合专家混合缩放和基于VLM的语义条件;3)为支持仿真而引入的场景布局规划、部件感知分解和无训练关节生成等新功能。

Abstract: We present Seed3D 2.0, an advanced 3D content generation system built on Seed3D 1.0, with substantial improvements across generation fidelity, simulation-ready capabilities, and application coverage. For geometry, a coarse-to-fine two-stage pipeline decouples global structure learning from high-frequency detail recovery, while a locality-aware VAE achieves higher spatial compression and more efficient decoding. For texture and material generation, we replace the cascaded pipeline of Seed3D 1.0 with a unified PBR model that directly generates multi-view albedo and metallic-roughness maps, enhanced by Mixture-of-Experts scaling and VLM-based semantic conditioning for improved material precision and visual fidelity. Beyond single-object generation, Seed3D 2.0 introduces a simulation-ready model suite comprising scene layout planning, part-aware decomposition, and training-free articulation generation, enabling coherent scene construction and part-level physical interaction across physics and graphics engines. A large-scale human preference study against five recent commercial models shows that Seed3D 2.0 achieves consistent win rates of 69.0% to 89.9% in textured 3D asset generation. Seed3D 2.0 is available on https://exp.volcengine.com/ark/vision?_vtm_=0.0.c70961.d701978.0&mode=vision&modelId=doubao-seed3d-2-0-260328&tab=Gen3D


[130] UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars cs.GR | cs.CV | cs.SDPDF

Xiaoyu Zhan, Xinyu Fu, Chenghao Yang, Xiaohong Zhang, Dongjie Fu

TL;DR: 本文提出了UMo,一种用于实时语音驱动虚拟形象的统一稀疏运动建模架构。该方法通过统一的文本、音频和运动标记处理框架,结合空间稀疏的专家混合与时间稀疏的关键帧设计,实现了高效实时的高保真面部表情和手势动画生成。

Details

Motivation: 现有语音驱动动画方法存在单模态对齐限制、无法充分利用大规模人体运动数据,或多模态模型表示能力和吞吐量受限等问题,难以同时实现高质量运动生成和实时性能。

Result: 大量定量和定性评估表明,UMo在低延迟和实时性能约束下实现了更好的输出质量,为高保真实时语音驱动虚拟形象提供了实用解决方案。

Insight: 创新点包括统一的稀疏运动建模架构、空间稀疏专家混合与时间稀疏关键帧设计,以及增强声学多样性和语义一致性的多阶段训练策略,在严格延迟约束下仍能保持细粒度的语音-运动对齐。

Abstract: Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance. We present UMo, a unified sparse motion modeling architecture for real-time co-speech avatars, which processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for both facial expressions and gestures. Furthermore, we implement a multi-stage training strategy with targeted audio augmentation to enhance acoustic diversity and semantic consistency. Consequently, UMo preserves fine-grained speech-motion alignment even under strict latency constraints. Extensive quantitative and qualitative evaluations show that UMo achieves better output quality under low latency and real-time performance constraints, offering a practical solution for high-fidelity real-time co-speech avatars.


cs.LG [Back]

[131] Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models cs.LG | cs.CLPDF

Saba Ahmadi, Prasanna Parthasarathi, Yufei Cui

TL;DR: 本文提出TraFL(轨迹流平衡),一种针对扩散语言模型的轨迹平衡后训练方法,旨在解决现有基于奖励最大化的后训练方法中存在的轨迹锁定问题,即更新过度集中于少数去噪路径,导致重复采样时覆盖范围减少。该方法通过将策略训练朝向以冻结参考模型为锚点的奖励倾斜目标分布,在数学推理和代码生成基准测试中,TraFL是唯一在所有基准长度设置下均超越基础模型的后训练方法,且增益随采样预算增加而持续。

Details

Motivation: 扩散语言模型是自回归模型的有前景替代方案,但其后训练方法主要沿用奖励最大化目标,存在轨迹锁定的核心失败模式:采样的奖励驱动更新过度集中概率质量于狭窄的去噪路径集,降低了重复采样下替代正确解的覆盖范围。

Result: 在数学推理和代码生成基准测试中,TraFL是唯一在所有基准长度设置下均超越基础模型的后训练方法,增益随采样预算增加而持续。在Minerva Math上保持高于基础模型,在LiveCodeBench的每个难度分划上均为最强方法。

Insight: 创新点在于提出轨迹平衡目标TraFL,通过扩散兼容的序列级代理和学习的提示依赖归一化,将策略训练朝向以冻结参考模型为锚点的奖励倾斜目标分布,有效缓解轨迹锁定,提高采样多样性和性能。从客观角度看,该方法将强化学习中的轨迹平衡思想适配到扩散语言模型后训练,为扩散模型的后训练提供了新的优化视角。

Abstract: Diffusion language models are a promising alternative to autoregressive models, yet post-training methods for them largely adapt reward-maximizing objectives. We identify a central failure mode in this setting we call trajectory locking: sampled reward-driven updates over-concentrate probability mass onto a narrow set of denoising paths, reducing coverage of alternative correct solutions under repeated sampling. To address this, we propose TraFL (Trajectory Flow baLancing), a trajectory-balance objective that trains the policy toward a reward-tilted target distribution anchored to a frozen reference model. We make this practical for diffusion language models with a diffusion-compatible sequence-level surrogate and a learned prompt-dependent normalization. Across mathematical reasoning and code generation benchmarks, TraFL is the only evaluated post-training method that improves over the base model in every benchmark-length setting, with gains that persist as the sampling budget increases. The improvements transfer to held-out evaluations: TraFL stays above the base model on Minerva Math and is the strongest method on every LiveCodeBench difficulty split.


[132] Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence cs.LG | cs.CLPDF

Mashrekur Rahman

TL;DR: 该论文提出了一种名为Mini-JEPAs的轻量级专业基础模型舰队,用于水文智能任务。该方法通过多个小型、传感器专用的联合嵌入预测架构模型,结合路由代理,针对特定水文问题提供精确的嵌入表示,以弥补通用行星尺度模型在专业信号处理上的不足,并降低计算成本。

Details

Motivation: 解决通用地理空间基础模型(如Google AlphaEarth)在处理专业水文信号时可能存在的精度妥协问题,同时克服其通常难以访问、昂贵且需要大规模计算资源的缺点。

Result: 在Sentinel-2光学、Sentinel-1 SAR等五种传感器数据上预训练的五个22M参数Mini-JEPA模型,在各自匹配变量上取得了高交叉验证R²(如高程0.97、温度0.97、降水0.81)。与AlphaEarth结合时,在土壤湿度、干旱度和降水预测上带来额外提升(ΔR²最高0.031)。路由LLM在精选问题集上达到完美命中率,且双检索策略在物理匹配问题上显著优于单独使用AlphaEarth(Cohen’s d=1.10, p=0.031)。

Insight: 创新点在于采用小型化、传感器专用的模型舰队与路由代理协同工作,实现了专业任务的高精度与计算效率的平衡;从客观角度看,这种模块化、可操作化的设计为领域特定基础模型提供了轻量级替代方案,增强了水文智能的可访问性和实用性。

Abstract: Geospatial foundation models compress multispectral observations into dense embeddings increasingly used in natural-language environmental reasoning systems. A single planetary-scale model, e.g. Google AlphaEarth, handles broad characterization well but may compromise on specialized hydrologic signals. Such generalist models are also often inaccessible, expensive, and require large-scale compute. We propose Mini-JEPAs: a fleet of small sensor-specialized Joint Embedding Predictive Architecture (JEPA) foundation models consulted by a routing agent for specialized questions. We pretrained five 22M-parameter Mini-JEPAs sharing an identical Vision Transformer backbone, JEPA recipe, and 64-d output space, using Sentinel-2 optical, Sentinel-1 SAR, MODIS thermal, multi-temporal Sentinel-2 phenology, and a topography-soil stack. Each Mini-JEPA reconstructs the variable matched to its sensor, with cross-validated $R^2$ reaching 0.97 for elevation, 0.97 for temperature, and 0.81 for precipitation. The five manifolds differ in geometric structure, with global participation ratios from 8.9 to 20.2 and local intrinsic dimensionalities from 2.3 to 9.0. Joint topography-soil and phenology models add predictive value beyond AlphaEarth alone for soil moisture, aridity, and precipitation ($ΔR^2$ up to 0.031). A router LLM reads per-modality references and selects appropriate sensors with a perfect hit rate over a curated question set. In paired LLM-as-Judge evaluation, dual retrieval over AlphaEarth and the routed fleet outperforms AlphaEarth alone on physics-matched questions (Cohen’s $d = 1.10$, $p = 0.031$). Locally-trained Mini-JEPAs can be operationalized for hydrologic intelligence with modest compute.


[133] PreFT: Prefill-only finetuning for efficient inference cs.LG | cs.AI | cs.CL | eess.SYPDF

Andrew Lanpouthakoun, Aryaman Arora, Zhengxuan Wu, Dhruv Pai, Ben Keigwin

TL;DR: 本文提出了一种名为PreFT(Prefill-only Finetuning)的参数高效微调方法,旨在解决个性化大语言模型服务中因同时服务多个用户适配器而导致的推理吞吐量下降问题。该方法仅在预填充阶段应用适配器,在解码阶段则将其丢弃,从而在不显著影响模型性能的前提下大幅提升服务吞吐量。

Details

Motivation: 现有的参数高效微调方法在规模化服务用户个性化模型时,由于预填充和解码阶段的计算模式不匹配,会严重损害推理吞吐量。论文旨在优化服务吞吐量而非单纯参数效率,以实现高效的多适配器服务。

Result: 在Llama 3.1 70B模型上服务512个适配器时,PreFT的吞吐量是传统PEFT方法的1.9倍。在监督微调任务上,PreFT的评估损失略高于PEFT,但可通过增加秩来补偿,且几乎不影响吞吐量;在强化学习任务上,PreFT的性能与标准PEFT相当。

Insight: 核心创新点在于将适配器计算限制在预填充阶段,从而在吞吐量和精度之间取得了更优的权衡。这为大规模个性化LLM服务提供了一种新的、更高效的微调范式,其实现已集成到vLLM推理引擎中。

Abstract: Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.


[134] Diagnosing Training Inference Mismatch in LLM Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Tianle Zhong, Neiwen Ling, Yifan Pi, Zijun Wei, Tianshu Yu

TL;DR: 本文研究了大型语言模型强化学习中的训练-推理不匹配问题,通过构建零不匹配诊断环境VeXact,揭示了即使微小的数值差异也会导致训练崩溃,并提出了缓解措施。

Details

Motivation: 解决LLM强化学习中因实现差异导致的训练与推理阶段概率分布不一致问题,该问题常与离策略漂移等混淆,影响训练稳定性。

Result: 在诊断环境中证明,token级别的微小数值分歧可独立引发训练崩溃;TIM改变了有效优化问题,提出的缓解方法能减轻其影响。

Insight: 创新点在于将TIM从复杂因素中隔离并量化其危害,揭示了TIM是系统级扰动而非良性噪声,需作为稳定性分析的一阶因素处理。

Abstract: Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training-Inference Mismatch (TIM). TIM is difficult to inspect because it is entangled with off-policy drift and common stabilization mechanisms. In this work, we isolate TIM in a zero-mismatch diagnostic setting (VeXact), and show that small token-level numerical disagreements can independently cause training collapse. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM. Our results suggest that TIM is not benign numerical noise, but a systems-level perturbation that should be treated as a first-order factor in analyzing LLM RL stability.


[135] MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification cs.LG | cs.AI | cs.CL | cs.CRPDF

Weisen Jiang, Shuhao Chen, Sinno Jialin Pan

TL;DR: MetaMoE是一个隐私保护的混合专家模型统一框架,它利用公开代理数据作为私有数据的替代,通过多样性感知的代理选择机制,将独立训练的领域专家整合成一个统一的MoE模型。

Details

Motivation: 解决在数据分布式存储且因隐私限制无法共享的现实中,如何统一训练混合专家模型的挑战。

Result: 在计算机视觉和自然语言处理基准测试中,MetaMoE持续优于近期的隐私保护MoE统一方法。

Insight: 创新点在于提出多样性感知的代理选择机制,从公开数据中选择与客户端领域相关且多样化的样本来近似私有数据分布,并监督路由器学习;同时使用上下文感知路由器来增强对异构输入的专家选择。

Abstract: Mixture-of-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We propose MetaMoE, a privacy-preserving framework that unifies independently trained, domain-specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity-aware proxy selection, which selects client-domain-relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context-aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy-preserving MoE unification methods. Code is available at https://github.com/ws-jiang/MetaMoE.


[136] Minimal-Intervention KV Retention: A Design-Space Study and a Diversity-Penalty Survivor cs.LG | cs.CLPDF

Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin

TL;DR: 本文研究了在极小KV缓存预算(64和128)下,针对长形式数学推理任务(MATH-500)的KV缓存压缩设计空间,评估了来自五个设计家族的七种机制,发现它们均未通过检验。随后,作者提出了α方法,这是一种对TriAttention保留评分器的单函数修改,用受贪婪设施选址启发的、在V空间冗余惩罚下的选择替代了argmax-top-k选择。通过预注册协议在开发集上调整惩罚权重λ并在独立测试集上确认,当λ=0.5时,α方法在四个(模型,预算)组合中的两个(Qwen-7B预算128和Llama-8B预算64)上显著优于基线,且没有组合显著变差,触发了预注册的Branch A结论。核心发现是不对称的:在这种极小预算机制下,一个极简的评分修改击败了更复杂的结构重设计,而匹配内存、符号计算评分、独立测试集确认的组合协议是使这种不对称性可见的证据标准。

Details

Motivation: 解决在极小KV缓存预算下,针对长形式数学推理任务,现有复杂的KV缓存压缩机制(涉及缓存表示、头路由、压缩节奏、解码行为、预算内评分等多个设计维度)可能效果不佳的问题,探索更简单有效的修改方案。

Result: 在MATH-500基准上,使用蒸馏推理模型(Qwen-7B和Llama-8B的DeepSeek-R1-Distill变体),在预算b∈{64, 128}下,评估的七种现有机制均被拒绝。提出的α方法(λ=0.5)在四个(模型,预算)组合中的两个(Qwen b=128和Llama b=64)上通过了Bonferroni校正的显著性检验,达到了统计显著优于基线的水平(SOTA),且没有组合显著变差。

Insight: 论文宣称的创新点在于提出了α,一种对现有TriAttention保留评分器的极简修改(引入V空间冗余惩罚的贪婪设施选址式选择),并在极小预算场景下证明了其有效性。从客观角度看,其核心洞察是:在特定(极小预算)机制下,一个轻量级的、基于多样性惩罚的评分函数修改,可能比在缓存表示、路由等结构上进行更复杂的重设计更有效。此外,论文强调并实践了严格的评估协议(匹配内存、符号计算自动评分、预注册、独立测试集确认)作为得出可靠结论的证据标准,这一方法论也具有借鉴价值。

Abstract: KV-cache compression at small budgets is a crowded design space spanning cache representation, head-wise routing, compression cadence, decoding behavior, and within-budget scoring. We study seven mechanisms across these five families under matched mean cache on long-form mathematical reasoning (MATH-500\cite{hendrycks2021math}) with two distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill\cite{deepseek2025r1}) at budgets $b \in {64, 128}$. All seven were rejected. We then propose $α$, a one-function modification to the TriAttention\cite{mao2026triattention} retention scorer that replaces argmax-top-$k$ with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by a single weight $λ$. A pre-registered protocol tunes $λ$ on a frozen development split and confirms on a disjoint held-out split; with $λ= 0.5$, $α$ clears Bonferroni on two of the four (model, budget) cells (Qwen $b{=}128$ and Llama $b{=}64$), no cell is significantly negative, and the pre-registered BranchA triggers. The finding is asymmetric: a minimal scoring modification beat heavier structural redesigns in this regime, and the combined matched-memory, sympy-graded, held-out confirmation protocol is the evidence standard that made the asymmetry visible.


[137] Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy cs.LG | cs.AI | cs.CLPDF

Langzhou He, Junyou Zhu, Yue Zhou, Zhengyao Gu, Junhua Liu

TL;DR: 本文提出了一种名为ActFocus的简单令牌重加权方法,用于解决智能体强化学习中的动作瓶颈问题,该方法通过降低推理令牌的梯度权重并增加高不确定性动作令牌的权重,在多个环境和模型规模上显著优于PPO和GRPO方法。

Details

Motivation: 针对现有策略梯度方法(如PPO和GRPO)在训练大型语言模型时对轨迹中所有令牌进行均匀信用分配,导致训练信号分配不当的问题,本文从基于能量的建模角度出发,发现训练信号主要集中在占轨迹小部分的动作令牌上,即存在动作瓶颈现象。

Result: 在四个环境和不同模型规模上,ActFocus方法一致性地超越了PPO和GRPO,最终步长收益分别提高了高达65.2和63.7个百分点,且没有增加额外的运行时或内存成本。

Insight: 论文的创新点在于从基于能量的建模视角揭示了动作瓶颈现象,并提出了简单的令牌重加权方法ActFocus,该方法通过梯度重加权和基于能量的再分配机制,有效优化了训练信号的分配,提高了强化学习效率。

Abstract: Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment. In this paper, we critically demonstrate that such uniform credit assignment largely misallocates token-level training signals. From an energy-based modeling perspective, we show that token-level training signals, quantified by their correlations with reward variance of different rollouts sampled from a given prompt, concentrate sharply on action tokens rather than reasoning tokens, even though action tokens account for only a small fraction of the trajectory. We refer to this phenomenon as the Action Bottleneck. Motivated by this observation, we propose an embarrassingly simple token reweighting approach, ActFocus, that downweights gradients on reasoning tokens, along with an additional energy-based redistribution mechanism that further increases the weights on action tokens with higher uncertainty. Across four environments and different model sizes, ActFocus consistently outperforms PPO and GRPO, yielding final-step gains of up to 65.2 and 63.7 percentage points, respectively, without any additional runtime or memory cost.


[138] Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance cs.LG | cs.AI | cs.CLPDF

Kai Yan, Alexander G. Schwing, Yu-Xiong Wang

TL;DR: 本文提出了一种名为FEST的Few-Shot演示引导强化学习可验证奖励算法,旨在解决传统RLVR在困难问题上样本效率低的问题。该方法仅需从SFT数据集中随机选取少量演示样本,通过结合监督信号、在线策略信号以及衰减权重来防止过拟合,从而在多个基准测试上取得了优异性能。

Details

Motivation: 传统RLVR在需要生成正确思维链的困难任务上样本效率低下,而现有的演示引导方法通常需要大量昂贵的监督微调数据。本文旨在开发一种仅需极少量演示数据就能有效提升RLVR性能的方法。

Result: 在多个基准测试上,FEST仅使用128个随机选取的演示样本就超越了基线方法,其性能甚至能与使用完整数据集的方法相媲美。

Insight: 核心创新在于证明了极少量随机选取的演示数据结合精心设计的训练机制(监督信号、在线策略信号和衰减权重)足以有效引导RLVR,这为数据高效的强化学习训练提供了新思路,并揭示了防止少样本多轮训练过拟合的关键在于权重衰减策略。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.


[139] Self-Distilled Agentic Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu

TL;DR: 本文提出了一种名为SDAR(自蒸馏智能体强化学习)的方法,用于改进大型语言模型(LLM)智能体的后训练强化学习(RL)。该方法通过引入带门控的辅助目标,将策略上自蒸馏(OPSD)提供的密集令牌级监督与传统的轨迹级RL奖励信号相结合,以解决多轮交互中监督不稳定和教师模型负面拒绝的问题。

Details

Motivation: 传统RL为LLM智能体提供的轨迹级奖励信号对于长视野交互的监督过于粗糙。虽然策略上自蒸馏(OPSD)能提供密集的令牌级指导,但在多轮智能体场景中直接应用会导致监督不稳定,且教师模型因技能检索或利用不完美而产生的负面拒绝需要非对称处理。

Result: 在ALFWorld、WebShop和Search-QA基准上,基于Qwen2.5和Qwen3模型系列的实验表明,SDAR显著优于GRPO基线(在ALFWorld上提升9.4%,在Search-QA上提升7.0%,在WebShop-Acc上提升10.2%),避免了朴素GRPO+OPSD的不稳定性,并在不同模型规模上持续优于混合RL-OPSD基线。

Insight: 核心创新在于将OPSD视为一个由sigmoid门控的辅助目标,而非与RL同等地位。该门控机制将分离的令牌级信号映射为门控值,从而强化教师模型认可的正向差距令牌的蒸馏,并软衰减教师模型的负面拒绝,实现了稳定且有效的多轮智能体训练。这种方法为结合密集监督与稀疏奖励提供了新思路。

Abstract: Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL–OPSD baselines across model scales.


[140] FutureSim: Replaying World Events to Evaluate Adaptive Agents cs.LG | cs.AI | cs.CLPDF

Shashwat Goel, Nikhil Chandak, Arvindh Arun, Ameya Prabhu, Steffen Staab

TL;DR: 论文提出FutureSim基准测试,通过按时间顺序重放真实世界事件(如新闻文章和问题解答)来评估AI智能体在动态开放环境中的适应能力。智能体需在模拟期间预测未来事件,研究测试了前沿智能体在2026年1月至3月期间的表现,发现其预测准确率有限,最佳智能体准确率仅为25%。

Details

Motivation: 解决AI智能体在动态开放环境中适应新信息能力的评估问题,传统方法效率不足,需构建基于真实事件重放的模拟环境以更高效地测量这种能力。

Result: 在FutureSim基准上,前沿智能体表现差异明显,最佳智能体预测准确率为25%,许多智能体的Brier技能评分甚至低于不做预测的基线。该基准为长期测试时适应、搜索、记忆和不确定性推理等研究方向提供了现实设置。

Insight: 创新点在于提出基于真实事件时序重放的模拟基准,以评估智能体在长期开放环境中的适应能力;客观分析认为,该方法通过接地气的模拟设计,为衡量AI在现实世界长期适应进展提供了新途径。

Abstract: AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent’s accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.


[141] Vision-Based Runtime Monitoring under Varying Specifications using Semantic Latent Representations cs.LG | cs.CV | cs.RO | eess.SYPDF

Bardh Hoxha, Oliver Schön, Hideki Okamoto, Lars Lindemann, Georgios Fainekos

TL;DR: 本文研究在部分可观测条件下,基于视觉观测的过去时信号时序逻辑(ptSTL)的认证运行时监控。监控器必须从图像中推断安全相关量并提供有限样本保证,同时具备可重用性:一旦训练和校准,它应能认证目标片段中的任何公式,而无需针对每个公式重新训练。对于由有限字典的时序原子诱导的片段,作者证明了语义基(原子鲁棒性得分向量)是单调、1-Lipschitz可重用接口类中的最小预测目标:任何公式都通过从解析树导出的确定性解码器评估,且单次共形校准即可认证整个片段,无需并集界。作者还引入了滚动预测监控器,它仅预测当前谓词值并在线重建时序历史;这更易于学习但在长时域下趋于保守。在行人过马路基准测试中,滚动方法在短时域下获得更紧的认证边界,而语义基监控器在长时域下边界紧致度最高可达4倍。作者在真实世界的Waymo驾驶数据上验证了所提出的监控器,两种监控器均经验性地满足共形覆盖保证。

Details

Motivation: 解决在部分可观测环境下,从视觉输入中实时监控时序逻辑安全规范的问题,并确保监控器可重用(无需针对每个新公式重新训练)且提供统计认证保证。

Result: 在行人过马路基准测试中,滚动监控器在短时域下认证边界更紧,语义基监控器在长时域下边界紧致度最高提升4倍;在Waymo真实驾驶数据上,两种监控器均经验性满足共形覆盖保证。

Insight: 创新点包括:提出语义基作为最小预测目标以实现可重用监控,证明单次共形校准可认证整个公式片段;引入滚动预测监控器以简化学习但牺牲长时域精度;理论证明了监控接口的单调性和Lipschitz性质,确保认证可靠性。

Abstract: We study certified runtime monitoring of past-time signal temporal logic (ptSTL) from visual observations under partial observability. The monitor must infer safety-relevant quantities from images and provide finite-sample guarantees, while being \emph{reusable}: once trained and calibrated, it should certify any formula in a target fragment without per-formula retraining. For fragments induced by a finite dictionary of temporal atoms, we prove that the \emph{semantic basis}, the vector of atom robustness scores, is the minimum prediction target within the class of monotone, 1-Lipschitz reusable interfaces: any formula is evaluated by a deterministic decoder derived from the parse tree, and a single conformal calibration pass certifies the entire fragment with no union bound. We also introduce a \emph{rolling prediction monitor} that predicts only current predicate values and reconstructs temporal history online; this is easier to learn but grows conservative at long horizons. On a pedestrian-crossroad benchmark, rolling achieves tighter certified bounds at short horizons while the semantic-basis monitor is up to 4-times tighter at long horizons. We validate the presented monitors on real-world Waymo driving data, where both monitors satisfy the conformal coverage guarantee empirically.


[142] GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation cs.LG | cs.CVPDF

Yuhao Liu, Sadeer Al-Kindi, Ashok Veeraraghavan, Guha Balakrishnan

TL;DR: 本文提出了GeoViSTA模型,一种融合地理空间图像和表格数据的视觉-表格Transformer架构,通过双边交叉注意力和地理感知注意力机制学习统一的地理空间嵌入,以联合建模物理环境和社会经济背景,提升地理空间推理能力。

Details

Motivation: 现有地理空间基础模型主要基于遥感图像预训练,未能有效整合表格形式的结构化社会经济协变量,导致无法完整建模环境全貌,限制了在复杂环境、社会及健康相关任务中的推理能力。

Result: GeoViSTA的统一嵌入在预测疾病死亡率和火灾频率等下游任务中,通过线性探测评估,优于基线模型,并在未见区域上表现出更好的泛化性能。

Insight: 创新点包括:1)提出视觉-表格双模态架构,通过双边交叉注意力实现跨模态信息交换;2)引入地理感知注意力机制,对齐连续图像块与不规则人口普查区域标记;3)采用自监督联合掩码自编码目标,利用局部空间上下文和跨模态线索恢复缺失数据,从而学习可迁移的统一表示。

Abstract: Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA’s unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.


[143] Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models cs.LG | cs.CVPDF

Yuehao Liu, Shanyan Guan, Weijia Zhang, Xuanming Shang, Yanhao Ge

TL;DR: 本文提出了一个名为Octopus的两阶段持续学习框架,其核心是无需历史数据的梯度正交化方法(HiFGO),旨在解决多模态大语言模型在持续学习中的灾难性遗忘问题,并在UCIT基准测试上取得了最先进的性能。

Details

Motivation: 现有MLLM持续学习方法存在局限性:基于架构的方法计算开销大且泛化性差,基于回放的方法依赖历史数据引发隐私和存储问题,而传统正则化方法不足以完全防止参数干扰。

Result: 在UCIT基准测试上,Octopus取得了最先进的性能,在平均准确率(Avg)和最终准确率(Last)上分别超越了之前的SOTA方法2.14%和6.82%。

Insight: 主要创新点在于提出了无需历史数据的梯度正交化(HiFGO)以及将任务适应与正则化解耦的两阶段微调策略,从而在可塑性和稳定性之间实现了原则性的平衡。

Abstract: Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent limitations: architecture-based approaches incur additional computational overhead and often generalize poorly to new tasks, rehearsal-based methods rely on storing historical data, raising privacy and storage concerns, and conventional regularization-based strategies alone are insufficient to fully prevent parameter interference. We propose Octopus, a two-stage continual learning framework based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient-level orthogonality without historical task data. Our proposed two-stage finetuning strategy decouples task adaptation from regularization, achieving a principled balance between plasticity and stability. Experiments on UCIT show that Octopus establishes state-of-the-art performance, surpassing prior SOTA by 2.14% and 6.82% in terms of Avg and Last.


[144] DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models cs.LG | cs.CVPDF

Quanhao Li, Junqiu Yu, Kaixun Jiang, Yujie Wei, Zhen Xing

TL;DR: 本文提出DiffusionOPD,一种基于在线策略蒸馏(OPD)的多任务训练范式,用于扩散模型。该方法先独立训练任务特定的教师模型,然后沿学生自身轨迹将教师能力蒸馏到统一的学生模型中,从而解耦单任务探索与多任务集成,避免了联合优化所有任务的负担。

Details

Motivation: 现有强化学习方法在扩散模型中主要局限于单任务优化,扩展到多任务时面临跨任务干扰、不平衡以及级联RL的繁琐和灾难性遗忘问题,因此需要一种有效的多任务训练框架。

Result: 在多个基准测试中,DiffusionOPD在训练效率和最终性能上均超越了多奖励RL和级联RL基线,并取得了最先进(SOTA)的结果。

Insight: 创新点在于将在线策略蒸馏框架从离散令牌推广到连续状态马尔可夫过程,推导出闭式的每步KL目标,通过均值匹配统一了随机SDE和确定性ODE细化,相比传统PPO风格策略梯度具有更低方差和更好泛化性。

Abstract: Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.


cs.AI [Back]

[145] GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration cs.AI | cs.CL | cs.DCPDF

Yeahia Sarker, Md Rahmat Ullah, Musa Molla, Shafiq Joty

TL;DR: 本文提出了GraphBit,一种基于有向无环图(DAG)的确定性编排框架,用于解决现有基于提示编排的智能体框架中存在的幻觉路由、无限循环和不可复现执行等问题。该框架采用Rust引擎管理路由、状态转换和工具调用,并引入三层内存架构来隔离上下文,从而提升性能、可复现性和可审计性。

Details

Motivation: 现有依赖提示编排的智能体LLM框架存在幻觉路由、无限循环和不可复现执行等问题,需要一种显式、确定性的工作流编排方法来确保可靠性和性能。

Result: 在GAIA基准测试中,GraphBit在零工具、文档增强和网络使能的工作流任务上,以67.6%的最高准确率、零框架引起的幻觉、最低延迟(11.9毫秒开销)和最高吞吐量,超越了六个现有框架。消融研究表明,三层内存架构均对性能有贡献,确定性执行在工具密集型任务上带来最大收益。

Insight: 创新点在于将工作流显式定义为DAG,采用引擎编排而非提示编排,以及三层内存架构(临时暂存空间、结构化状态和外部连接器)来防止上下文膨胀。这提供了确定性、可复现性和高性能的智能体编排方案,尤其适用于现实部署中的复杂任务。

Abstract: Agentic LLM frameworks that rely on prompted orchestration, where the model itself determines workflow transitions, often suffer from hallucinated routing, infinite loops, and non-reproducible execution. We introduce GraphBit, an engine-orchestrated framework that defines workflows explicitly and deterministically as a directed acyclic graph (DAG). Unlike prompted orchestration, agents in GraphBit operate as typed functions, while a Rust-based engine governs routing, state transitions, and tool invocation, ensuring reproducibility and auditability. The engine supports parallel branch execution, conditional control flow over structured state predicates, and configurable error recovery. A three-tier memory architecture consisting of ephemeral scratch space, structured state, and external connectors isolates context across stages, preventing cascading context bloat that degrades reasoning in long-running pipelines. Across GAIA benchmark tasks spanning zero-tool, document-augmented, and web-enabled workflows, GraphBit outperforms six existing frameworks, achieving the highest accuracy (67.6 percent), zero framework-induced hallucinations, the lowest latency (11.9 ms overhead), and the highest throughput. Ablation studies demonstrate that each memory tier contributes measurably to performance, with deterministic execution providing the greatest gains on tool-intensive tasks representative of real-world deployments.


[146] Enhanced and Efficient Reasoning in Large Learning Models cs.AI | cs.CC | cs.CL | cs.LGPDF

Leslie G. Valiant

TL;DR: 本文提出了一种高效且原则性的推理方法,用于增强大型语言模型的推理能力。该方法通过预处理阶段将数据重新编码为Unary Relational Integracode,使文本中对象间的关系更加显式,随后通过标准机器学习过程学习预测这些关系。该方法可被视为实现了一个世界模型,并适用于自然语言、视觉和动作等多个领域。

Details

Motivation: 当前大型语言模型能生成流畅文本,但缺乏原则性基础来保证内容的可信度;传统观点认为增加原则性推理在计算上不可行,本文旨在解决这一问题。

Result: 该方法使学习训练数据中核心关系规则子集在多项式时间内可学习(多项式取决于规则复杂度),支持在单个分类器调用内及多次调用间进行可靠推理。

Insight: 创新点在于提出Unary Relational Integracode重新编码数据,显式整合对象的多重属性,结合Robust Logic系统在不确定信息上进行原则性链式推理,实现高效且可扩展的推理增强。

Abstract: In current Large Language Models we can trust the production of smoothly flowing prose on the basis of the principles of machine learning. However, there is no comparably principled basis to justify trust in the content of the text produced. It appears to be conventional wisdom that addressing this issue by adding more principled reasoning is not computationally affordable. Here we propose a principled method of reasoning that is efficient enough to be practical for large language models. Further, the method allows the retention of much of the currently used software and hardware base. Our method for improving the functioning of large language models consists of a first stage of preprocessing that recodes the data to a Unary Relational Integracode that is more explicit about the relationships among the objects described in the text, followed as a second stage by a standard but possibly streamlined machine learning process that then also learns to predict these relationships. The method may be viewed as realizing a world model and applying beyond natural language, to vision and actions, for example, where the multiple properties of an object referred to in an input are brought together explicitly, rather than remaining distributed in the various references to it in the input. We articulate its advantages in terms of Robust Logic, a system for performing principled chaining on learned, and hence uncertain, information. We show that this recoding has the surprising and fortuitous property that, while succinct, it makes the task of learning a core subset of relational rules that hold in the world described in the training data polynomial time learnable in a defined sense, the polynomial depending on the complexity of the rule. This gives support for sound reasoning within each single call of the learned classifier as well as between multiple calls.


Olivia Peiyu Wang, Leilani H. Gilpin

TL;DR: 本文提出了一种神经符号方法,将大型语言模型的表达能力与形式验证的严谨性相结合,旨在使AI辅助的法律推理既强大又可信,从而在不牺牲法律实践所要求的问责制的前提下,减轻人工验证的负担。

Details

Motivation: 解决当前AI系统在法律实践中系统性做出超出源文本支持的、充满假设的推理问题,以满足法律工作对严谨性的高要求。

Result: 摘要中未提及具体的定量结果或基准测试,但提出了一个旨在提高可信度和严谨性的方法框架。

Insight: 核心创新点在于提出了一种神经符号架构,将LLM的生成能力与形式逻辑的验证能力相结合,以约束和解释模型的推理过程,确保结论忠实于源文本,这对于高风险的AI应用领域具有借鉴意义。

Abstract: The growing adoption of large language models in legal practice brings both significant promise and serious risk. Legal professionals stand to benefit from AI that can reason over contracts, draft documents, and analyze sources at scale, yet the high-stakes nature of legal work demands a level of rigor that current AI systems do not provide. The central problem is not simply that LLMs hallucinate facts and references; it is that they systematically draw inferences that go beyond what the source text actually supports, presenting assumption-laden conclusions as if they were logically grounded. This proposal presents a neuro-symbolic approach to legal AI that combines the expressive power of large language models with the rigor of formal verification, aiming to make AI-assisted legal reasoning both capable and trustworthy, thus reducing the burden of manual verification without sacrificing the accountability that legal practice demands.


[148] Know When To Fold ‘Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection cs.AI | cs.CLPDF

Anjir Ahmed Chowdhury, Syed Zawad, Feng Yan

TL;DR: 本文提出了一种名为多阶段飞行中拒绝(MSIFR)的轻量级、无需训练的方法,用于提高基于大语言模型(LLM)的合成数据生成效率。该方法通过在生成过程的中间检查点提前检测并终止低质量的生成轨迹,避免了传统方法中先生成完整输出再过滤所导致的巨大令牌浪费。

Details

Motivation: 现有LLM合成数据生成方法通常在生成完整输出后再应用质量过滤器,这导致大量最终被丢弃的样本浪费了计算令牌。本文旨在解决这种令牌浪费问题,提高生成过程的效率。

Result: 在五个指令调优模型和七个推理基准测试上,MSIFR作为独立方法可将令牌消耗降低11%-77%,与提前退出方法结合时最高可降低78.2%,同时保持或提高了评估准确率。

Insight: 创新点在于将生成过程形式化为一个顺序决策过程,并引入基于规则的快速验证器在中间阶段进行早期拒绝。客观分析认为,其核心洞察是将拒绝决策提前到生成过程中,利用条件效用的鞅性质保证不引入偏差,从而实现无需额外训练的效率提升。

Abstract: While synthetic data generation with large language models (LLMs) is widely used in post-training pipelines, existing approaches typically generate full outputs before applying quality filters, leading to substantial token waste on samples that are ultimately discarded. To address this, we propose Multi-Stage In-Flight Rejection (MSIFR), a lightweight, training-free framework that detects and terminates low-quality generation trajectories at intermediate checkpoints before they reach full completion. MSIFR decomposes the generation process into sequential stages and applies fast rule-based validators to identify arithmetic inconsistencies, hallucination patterns, and formatting violations, enabling early rejection of faulty samples. We formalize in-flight rejection as a sequential decision process and show that any non-trivial discard policy reduces expected token consumption, with stage-wise savings increasing when rejection occurs earlier in the generation pipeline. We further demonstrate that conditional utility estimates form a martingale, ensuring that early, in-flight rejection does not bias the expected utility of retained samples. Across five instruction-tuned models and seven reasoning benchmarks, MSIFR reduces token consumption by 11%-77% as a standalone method, and up to 78.2% when combined with early-exit methods, while preserving or improving evaluation accuracy. These results confirm that MSIFR provides a practical mechanism for improving the efficiency of LLM-based synthetic data generation without additional training or architectural changes.


[149] Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems cs.AI | cs.CLPDF

Ling Wang, Songnan Liu, Jianan Wang, Cheng Cheng, Xin Liu

TL;DR: 本文提出了一种名为HEAR的企业级智能推理器,它基于分层超图本体构建,旨在解决大型语言模型在异构企业系统中存在的幻觉和多跳、n元推理失败问题。HEAR通过证据驱动的推理循环,动态编排本体工具进行结构化多跳分析,无需重新训练LLM。在供应链任务(如订单履行阻塞根因分析)的评估中,HEAR实现了高达94.7%的准确率,并展示了自适应效率,通过程序化超边最小化令牌成本,同时利用拓扑探索确保复杂查询的严格正确性。

Details

Motivation: 大型语言模型在异构企业系统中的应用受到幻觉和多跳、n元推理失败的阻碍,现有范式(如图RAG、NL2SQL)缺乏语义基础和可审计执行能力,无法满足复杂环境的需求。

Result: 在供应链任务(包括订单履行阻塞根因分析)的评估中,HEAR实现了高达94.7%的准确率,能够匹配专有模型性能,并利用开源权重骨干自动化手动诊断。

Insight: 创新点包括构建分层超图本体(基础图层虚拟化溯源感知数据接口,超边层编码n元业务规则和程序协议),以及证据驱动的推理循环,实现自适应效率(通过程序化超边优化成本,拓扑探索确保正确性),为可扩展、可审计的企业智能奠定基础。

Abstract: Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HEAR, an enterprise agentic reasoner built on a Stratified Hypergraph Ontology. Its base Graph Layer virtualizes provenance-aware data interfaces, while the Hyperedge Layer encodes n-ary business rules and procedural protocols. Operating an evidence-driven reasoning loop, HEAR dynamically orchestrates ontology tools for structured multi-hop analysis without requiring LLM retraining. Evaluations on supply-chain tasks, including order fulfillment blockage root cause analysis (RCA), show HEAR achieves up to 94.7% accuracy. Crucially, HEAR demonstrates adaptive efficiency: utilizing procedural hyperedges to minimize token costs, while leveraging topological exploration for rigorous correctness on complex queries. By matching proprietary model performance with open-weight backbones and automating manual diagnostics, HEAR establishes a scalable, auditable foundation for enterprise intelligence.


[150] Herculean: An Agentic Benchmark for Financial Intelligence cs.AI | cs.CLPDF

Xueqing Peng, Zhuohan Xie, Yupeng Cao, Haohang Li, Lingfei Qian

TL;DR: 本文介绍了Herculean,这是首个用于评估智能体金融智能的综合性基准测试,涵盖了交易、对冲、市场洞察和审计四个代表性工作流程。该基准通过标准化的技能环境,对异构智能体系统进行端到端评估,发现当前前沿智能体在需要长期协调和结构化验证的任务上表现不佳。

Details

Motivation: 现有金融基准主要评估静态能力(如问答、检索),无法全面衡量智能体在实际金融专业工作流程中的可靠执行能力,因此需要一个新的基准来评估智能体在复杂、高风险金融工作流中的端到端表现。

Result: 在Herculean基准测试中,前沿智能体在交易和市场洞察任务上表现相对较好,但在对冲和审计任务上表现显著不佳,这些任务需要长期协调、状态一致性和结构化验证。

Insight: 论文的创新点在于构建了首个面向金融智能体的技能基准,通过标准化的多工作流环境(基于MCP)实现了对异构智能体系统的端到端评估,揭示了当前智能体将金融推理转化为可靠工作流执行的关键能力差距。

Abstract: As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.


[151] Nexus : An Agentic Framework for Time Series Forecasting cs.AI | cs.CL | cs.LGPDF

Sarkar Snigdha Sarathi Das, Palash Goyal, Mihir Parmar, Nanyun Peng, Vishy Tirumalashetty

TL;DR: 本文提出了Nexus,一个用于时间序列预测的多智能体框架,它将预测任务分解为多个专门阶段:分离宏观和微观时间波动,并在可用时整合上下文信息,最后合成最终预测。该框架旨在弥合专业时间序列基础模型(TSFMs)与大型语言模型(LLMs)在预测能力上的差距,利用LLMs更强的内在预测能力,并产生高质量的可解释推理轨迹。

Details

Motivation: 时间序列预测不仅需要数值外推,还常常需要结合新闻或事件等非结构化上下文数据进行推理。专业的时间序列基础模型(TSFMs)擅长基于数值模式进行预测,但无法感知现实世界的文本信号;而大型语言模型(LLMs)作为零样本预测器出现,但其性能在不同领域和上下文基础上表现不均。Nexus旨在通过一个分解的、多智能体的框架来弥补这一差距。

Result: 在严格晚于LLM知识截止日期的数据集上进行评估,涵盖Zillow房地产指标和波动性股票市场数据,Nexus始终匹配或优于最先进的时间序列基础模型(TSFMs)和强大的LLM基线,达到了SOTA水平。

Insight: 论文宣称的创新点在于提出了一个将预测任务分解为专门阶段的多智能体框架(Nexus),从而有效整合数值与上下文推理。客观分析认为,其核心洞察在于揭示了当前一代LLMs拥有比先前认知更强的内在预测能力,而这种能力的关键在于如何组织数值和上下文推理。这确立了现实世界预测是一个超越单纯序列建模的智能体推理问题,并提供了高质量、可解释的推理轨迹。

Abstract: Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual signals. Conversely, while LLMs are emerging as zero-shot forecasters, their performance remains uneven across domains and contextual grounding. To bridge this gap, we introduce Nexus, a multi-agent forecasting framework that decomposes prediction into specialized stages: isolating macro-level and micro-level temporal fluctuations, and integrating contextual information when available before synthesizing a final forecast. This decomposition enables Nexus to adapt from seasonal signals to volatile, event-driven information without relying on external statistical anchors or monolithic prompting. We show that current-generation LLMs possess substantially stronger intrinsic forecasting ability than previously recognized, depending critically on how numerical and contextual reasoning are organized. Evaluated on data strictly succeeding LLM knowledge cutoffs spanning Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state-of-the-art TSFMs and strong LLM baselines. Beyond numerical accuracy, Nexus produces high-quality reasoning traces that explicitly show the fundamental drivers behind each forecast. Our results establish that real-world forecasting is an agentic reasoning problem extending well beyond only sequence modeling.


Joy Bose

TL;DR: 本文提出了Falkor-IRAC,一个用于印度法律AI的图约束生成框架,通过构建IRAC知识图谱(包含问题、规则、分析和结论节点)来结构化表示法律推理过程,并引入验证代理确保LLM生成的答案在图中有有效支持路径,从而减少幻觉和错误引用。

Details

Motivation: 解决基于向量检索增强生成(RAG)的法律AI系统在印度等高案件量司法管辖区中存在的幻觉先例、过时法规引用和缺乏支持的推理链等持续失败问题,这些缺陷对司法公正获取有实际影响。

Result: 在包含51个最高法院判决的概念验证语料库上,验证代理正确验证了已完成查询的引用并正确拒绝了伪造引用,评估使用了引用基础准确性、路径有效性率、幻觉先例率和冲突检测率等图原生指标。

Insight: 创新点在于将法律推理建模为图约束生成问题,通过IRAC知识图谱和验证代理强制实现结构化推理,并优先检测教义冲突而非静默解决;从客观角度看,该方法强调了符号推理与向量检索的互补性,并提出了更适合法律推理评估的图原生指标。

Abstract: Legal reasoning is not semantic similarity search. A court judgment encodes constrained symbolic reasoning: precedent propagation, procedural state transitions, and statute-bound inference. These are properties that vector-based retrieval-augmented generation (RAG) cannot faithfully represent. Hallucinated precedents, outdated statute citations, and unsupported reasoning chains remain persistent failure modes in LLM-based legal AI, with real consequences for access to justice in high-caseload jurisdictions such as India. This paper presents Falkor-IRAC, a graph-constrained generation framework for Indian legal AI that grounds generation in structured reasoning over an IRAC (Issue, Rule, Analysis, Conclusion) knowledge graph. Judgments from the Supreme Court and High Courts of India are ingested as IRAC node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency agentic traversal. At inference time, LLM-generated answers are accepted only if a valid supporting path can be traced through the graph, a check performed by a falsifiability oracle called the Verifier Agent. The system also detects doctrinal conflicts as a first-class output rather than silently resolving them. Falkor-IRAC is evaluated using graph-native metrics: citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate. These metrics are argued to be more appropriate for legal reasoning evaluation than BLEU and ROUGE. On a proof-of-concept corpus of 51 Supreme Court judgments, the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations. Evaluation against vector-only RAG baselines is left for future work, as is GPU-accelerated inference to address current timeout rates on CPU hardware.


[153] Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model cs.AI | cs.CL | cs.LGPDF

Minghao Wu, Yuting Yan, Zhenyang Cai, Ke Ji, Chuangsen Fang

TL;DR: 该论文提出了SepsisAgent,一种基于世界模型增强的大型语言模型(LLM)智能体,用于ICU中脓毒症治疗的序贯决策推荐。它通过一个学习到的临床世界模型来模拟候选干预措施下的患者反应,并遵循一个提议-模拟-优化的流程来生成处方。

Details

Motivation: 解决LLMs在脓毒症管理中虽然具备临床知识和推理能力,但缺乏对行动条件化患者动态的固有理解的问题,旨在将LLMs与患者动态模型相结合以进行更可靠的决策。

Result: 在MIMIC-IV脓毒症轨迹数据集上,SepsisAgent在离策略价值评估中超越了所有传统强化学习和基于LLM的基线方法,同时在遵循指南和安全行动指标上实现了最佳的安全性表现。

Insight: 核心创新在于提出了一个提议-模拟-优化的智能体工作流程,并通过一个包含监督微调、行为克隆和基于世界模型的智能体强化学习的三阶段课程来训练LLM智能体,使其通过与临床世界模型的交互学习患者演化的规律,即使在没有模拟器访问时这些知识仍然有用。

Abstract: Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid–vasopressor interventions, and follows a propose–simulate–refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose–simulate–refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.


[154] Orchard: An Open-Source Agentic Modeling Framework cs.AI | cs.CLPDF

Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu

TL;DR: 本文介绍了Orchard,一个用于可扩展智能体建模的开源框架。其核心是Orchard Env环境服务,提供跨任务领域、智能体约束和流水线阶段的沙箱生命周期管理原语。基于此,论文构建了三个智能体建模方案:针对编码智能体的Orchard-SWE、针对GUI视觉语言计算机使用智能体的Orchard-GUI以及针对个人助理智能体的Orchard-Claw,并在各自基准上取得了开源模型的领先性能。

Details

Motivation: 当前智能体建模领域,高性能系统多依赖闭源代码、模型或服务,而大多数开源框架侧重于编排和评估,而非可扩展的智能体训练。Orchard旨在提供一个开源框架,以解决基础设施和训练数据方面的差距,促进开放研究。

Result: Orchard-SWE在SWE-bench Verified基准上,经过SFT后达到64.3%,经过SFT+RL后达到67.5%,在同等规模的开源模型中创造了新的SOTA。Orchard-GUI在WebVoyager、Online-Mind2Web和DeepShop基准上分别达到74.1%、67.0%和64.0%的成功率,成为最强的开源模型,并与闭源系统保持竞争力。Orchard-Claw在Claw-Eval基准上达到59.6%的pass@3,与更强的ZeroClaw约束配对时达到73.9%。

Insight: 论文的核心创新在于提出了一个轻量级、开放、与智能体约束无关的环境层(Orchard Env),它支持跨领域的可重用智能体数据、训练方案和评估。具体技术贡献包括:为编码智能体引入信用分配SFT从未解决的轨迹中学习生产性片段,并应用平衡自适应rollout进行RL;以及展示了仅用少量蒸馏轨迹和合成任务即可有效训练视觉语言和助理智能体,实现了数据高效性。

Abstract: Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.


[155] Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use cs.AI | cs.CLPDF

Renning Pang, Tian Lan, Leyuan Liu, Piao Tong, Sheng Cao

TL;DR: 论文提出了一种名为CAST的案例驱动框架,用于校准大语言模型(LLM)的工具使用。该框架将历史执行轨迹视为结构化案例,从中提取信号来估计最优推理策略并预测潜在的结构性故障,进而通过细粒度奖励设计和自适应推理,使模型在强化学习过程中内化基于案例的策略。实验表明,该方法在提高执行准确性和任务成功率的同时,减少了不必要的推理长度。

Details

Motivation: 动机在于解决LLM工具使用中可靠执行的问题,即需要平衡适当的推理深度与严格的结构有效性,避免因推理不足或过度导致执行失败或效率低下。

Result: 在BFCLv2和ToolBench基准测试上,CAST方法使整体执行准确率最高提升5.85个百分点,平均推理长度减少26%,显著减轻了高影响的结构性错误,同时提高了模式忠实执行和任务级工具使用成功率。

Insight: 创新点在于从案例视角出发,将历史执行轨迹结构化为可复用知识,用于指导自适应推理和奖励设计,从而校准工具使用。这提供了一种将过往经验转化为系统性适应知识的有效途径,而非简单复用原始示例输出。

Abstract: Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases. Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns. The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning. Experiments on BFCLv2 and ToolBench demonstrate that CAST improves both schema-faithful execution and task-level tool-use success while reducing unnecessary deliberation. The approach achieves up to 5.85 percentage points gain in overall execution accuracy and reduces average reasoning length by 26%, significantly mitigating high-impact structural errors. Ultimately, this demonstrates how historical execution cases can provide reusable adaptation knowledge for calibrated tool use.


[156] Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning cs.AI | cs.CVPDF

Haozhe Wang, Qixin Xu, Changpeng Wang, Taofeng Xue, Chong Peng

TL;DR: 本文提出了一种强化学习框架,通过引入感知验证(PV)和结构化语言验证,解决了视觉语言模型中感知与推理的信用分配模糊问题,从而协同提升感知保真度和推理能力。

Details

Motivation: 当前视觉语言模型在追求感知-推理协同时常面临‘跷跷板效应’,性能提升有限,其根本原因是模态信用分配模糊:模型失败时难以区分是‘错误感知’还是‘错误推理’所致。

Result: 该方法通过解耦感知与推理步骤并进行针对性奖励,使单一模型在广泛视觉语言任务上实现了感知与推理性能的同步提升。

Insight: 核心创新在于提出了感知验证机制,利用‘蒙眼推理’代理独立评估感知保真度,并结合结构化验证替代高方差的LLM评判,实现了高效、可扩展的模态感知信用分配。

Abstract: Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a “seesaw effect” on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception (“bad seeing”) or flawed logic (“bad thinking”)? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a “blindfolded reasoning” proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error – either bad seeing or bad thinking – enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.


eess.IV [Back]

[157] Efficient Dense Matching for Enhanced Gaussian Splatting Using AV1 Motion Vectors eess.IV | cs.CVPDF

Julien Zouein, Vibhoothi Vibhoothi, François Pitié, Anil Kokaram

TL;DR: 本文提出了一种基于AV1视频编码运动矢量的特征检测与匹配流程,旨在提升3D高斯泼溅(3DGS)场景重建的初始化质量。该方法通过利用AV1码流中固有的运动信息,显著降低了传统运动恢复结构(SfM)的计算开销,生成了更密集的点云,从而改善了3DGS的重建精度和训练效率。

Details

Motivation: 3DGS虽然实现了实时、逼真的场景重建,但其表示质量严重依赖于初始点云的质量。传统基于COLMAP的SfM流程计算成本高,且在纹理稀疏区域生成的点云稀疏,这影响了后续重建的精度和收敛速度。

Result: 该方法生成的初始点云密度可达传统SfM的八倍。使用该增强初始化后,3DGS的性能得到直接提升,在VMAF指标上提高了9分,并且达到基线质量所需的训练时间平均减少了63%。

Insight: 论文的核心创新点在于将视频压缩领域(AV1编码标准)中的运动矢量信息,创造性地应用于3D重建的SfM初始化阶段,替代了计算密集的穷举匹配,实现了效率与鲁棒性的平衡。这为跨领域(视频编码与计算机视觉)的技术融合以解决特定瓶颈问题提供了新思路。

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a prominent framework for real-time, photorealistic scene reconstruction, offering significant speed-ups over Neural Radiance Fields (NeRF). However, the fidelity of 3DGS representations remains heavily dependent on the quality of the initial point cloud. While standard Structure-from-Motion (SfM) pipelines using COLMAP provide adequate initialisation, they often suffer from high computational costs and sparsity in textureless regions, which degrades subsequent reconstruction accuracy and convergence speed. In this work, we introduce an AV1-based feature detection and matching pipeline that significantly reduces SfM processing overhead. By leveraging motion vectors inherent to the AV1 video codec, we bypass computationally expensive exhaustive matching while maintaining geometric robustness. Our pipeline produces substantially denser point clouds, with up to eight times as many points as classical SfM. We demonstrate that this enhanced initialisation directly improves 3DGS performance, yielding an 9-point increase in VMAF and a 63% average reduction in training time required to reach baseline quality. The project page: https://sigmedia.tv/AV1-3DGS.github.io/


cs.CR [Back]

[158] To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model cs.CR | cs.AI | cs.CL | cs.CV | cs.LGPDF

Chengshuai Zhao, Zhen Tan, Dawei Li, Zhiyuan Yu, Huan Liu

TL;DR: 本文提出了一种名为MMGuard的主动防御方法,用于保护多模态数据免遭未经授权的大型视觉语言模型(LVLM)微调。该方法通过向数据中注入人眼不可见的扰动,利用LVLM的学习动态,使模型在微调时过度拟合噪声,从而在推理阶段(无扰动时)性能下降。

Details

Motivation: 针对大型视觉语言模型(LVLM)未经授权地抓取和训练网络多模态数据所带来的版权和隐私风险,现有事后补救措施(如机器遗忘和水印)存在不足,本文旨在为数据所有者提供一种主动的、先发制人的保护机制。

Result: 在六个数据集上对九个开源LVLM的评估表明,MMGuard在白盒、灰盒和黑盒威胁模型下均能提供有效、隐蔽且鲁棒的保护,证明了其在主动防御攻击性微调方面的机制优势。

Insight: 创新点在于提出了一种主动的、基于学习动态的扰动注入方法,并引入了跨模态绑定破坏策略,从理论上保证了噪声与训练目标之间虚假关联的建立。此外,集成学习策略增强了其跨模型的可迁移性,为多模态数据保护提供了新的技术路径。

Abstract: The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, are inherent post-hoc approaches that act only after intellectual property infringement has already occurred. In this work, we propose MMGuard to empower data owners to proactively protect their multimodal data against unauthorized LVLM fine-tuning. MMGuard generates unlearnable examples by injecting human-imperceptible perturbations that actively exploit the learning dynamics of LVLMs. By minimizing the training loss, the perturbation creates an optimization shortcut, causing the model to overfit to the noise and thereby degrading downstream performance when the perturbation is absent during inference. To further strengthen this defense, MMGuard introduces a cross-modal binding disruption, strategically shifting LVLM attention to enforce a spurious correlation between the noise and the training target with theoretical guarantees. Enhanced by an ensemble learning strategy for cross-model transferability, MMGuard is evaluated against nine open-source LVLMs across six datasets. Our comprehensive results demonstrate effective, stealthy, and robust protection under white-box, gray-box, and black-box threat models, establishing a mechanistic advantage in proactively defending against aggressive fine-tuning exploitation.


cs.SE [Back]

[159] CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing cs.SE | cs.AI | cs.CLPDF

Mingzhi Zhu, Michele Merler, Raju Pavuluri, Stacy Patterson

TL;DR: CRANE是一种无需训练的参数编辑方法,用于提升代码智能体的性能。它通过零空间编辑技术,将配对的Instruct(指令遵循)和Thinking(推理)检查点之间的差异视为候选推理编辑的方向池,经过去噪、保守泰勒门控和渐进式S型投影处理,将Thinking模型的推理能力注入Instruct模型,同时保持其工具使用的纪律性和效率。

Details

Motivation: 代码智能体需要同时具备对长期仓库状态进行推理的能力和严格遵守工具使用协议的能力。现有的配对Instruct和Thinking检查点模型在这两方面能力互补但不对齐:Instruct模型简洁且遵守工具协议,但推理能力较弱;Thinking模型规划能力强但容易过度思考,导致智能体性能下降。CRANE旨在将两者的优势结合。

Result: 在Roo-Eval基准测试中,CRANE使Qwen3-30B-A3B模型的pass1达到66.2%(提升19.5%),使Qwen3-Next-80B-A3B模型的pass1达到81.5%(提升8.7%)。在SWE-bench-Verified基准测试中,在两个规模上各多解决了14个实例(分别达到122/500和180/500)。在Terminal-Bench v2基准测试中,pass1/pass5最高提升了2.3%/7.8%,分别达到7.6%/17.9%和14.8%/30.3%。在所有三个基准测试中,CRANE均持续优于其他合并策略。

Insight: 论文的创新点在于提出了一种训练自由的参数编辑框架,将模型能力差异视为可编辑的方向向量,并通过精心设计的去噪、门控和投影机制,实现了推理能力的定向注入与工具使用纪律性的保留。从客观角度看,其核心洞察是将模型融合问题转化为对参数空间特定方向的编辑和抑制,为模型能力组合提供了一种高效、可控的新思路。

Abstract: Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.


stat.ML [Back]

[160] Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning stat.ML | cs.CL | cs.LGPDF

Yu Gu, Zijun Yu, Vahid Partovi Nia, Masoud Asgharian

TL;DR: 本文提出了一种基于共形预测的链式思维推理聚合方法,通过加权分数聚合和校准弃权规则来直接解决聚合不确定性,从而在保证置信错误率有限样本控制的同时,提高选择性准确率。

Details

Motivation: 动机在于解决链式思维推理中自洽性聚合的不确定性挑战,特别是在错误答案代价远高于弃权的场景下,需要一种能够提供统计保证的聚合方法。

Result: 在四个基准测试、四个开源模型和三种分数类别上,该方法实现的置信错误率与预设目标一致;在GSM8K上,通过弃权不到5%的问题,达到了90.1%的选择性准确率,而多数投票基线的准确率为82%。

Insight: 创新点包括用加权分数聚合替代多数投票,并利用共形风险控制校准弃权规则,从而提供有限样本保证;同时,识别出分数可分离性是弃权可证明提升选择准确率的关键条件,并推导出仅从校准数据预测准确率增益的闭式表达式。

Abstract: Chain-of-thought (CoT) reasoning with self-consistency improves performance by aggregating multiple sampled reasoning paths. In this setting, correctness is no longer tied to a single reasoning trace but to the aggregation rule over a pool of candidate paths, making aggregation uncertainty the central challenge. This issue is critical where confidently incorrect answers are far more costly than abstentions. We introduce a conformal procedure for CoT reasoning that directly addresses aggregation uncertainty. Our approach replaces majority voting with weighted score aggregation over reasoning paths and calibrates an abstention rule using conformal risk control. This approach leads to finite-sample guarantees on the confident-error rate–the probability that the system answers and is wrong. We further identify score separability as the key condition under which abstention provably improves selective accuracy, and derive closed-form expressions that predict accuracy gains from calibration data alone. The method is fully inference-time, and requires no retraining. Across four benchmarks, four open-source models, and three score classes, realized confident-error rates are consistent with the prescribed targets up to calibration-split and test-set variability. Our method achieves $90.1%$ selective accuracy on GSM8K by abstaining on less than $5%$ of problems, compared with $82%$ accuracy under majority-voting baseline.


cs.RO [Back]

[161] IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation cs.RO | cs.AI | cs.CL | cs.CVPDF

Shijie Lian, Bin Yu, Xiaopeng Lin, Zhaolong Shen, Laurence Tianruo Yang

TL;DR: 论文提出IntentVLA,一种历史条件视觉语言动作(VLA)框架,用于解决机器人模仿学习中因部分可观测性导致的动作块间冲突和不稳定执行问题。它通过编码近期视觉观察来建模短期意图,并用此意图表示来条件化动作块的生成。

Details

Motivation: 现有基于帧条件的VLA策略仅从当前观察和指令推断动作块,在部分可观测场景下,相邻重规划步骤可能采样到不同的意图,导致动作块间冲突和执行不稳定。论文旨在解决机器人模仿数据中因演示者短期意图、任务阶段或近期上下文不同而导致的多模态问题。

Result: 在AliasBench、SimplerEnv、LIBERO和RoboCasa等多个基准测试中,IntentVLA提高了执行稳定性,并优于强大的VLA基线模型。

Insight: 核心创新点在于引入历史条件机制,将近期视觉观察编码为紧凑的短期意图表示,并以此指导动作块生成,从而缓解观察歧义(aliasing)问题。此外,论文贡献了AliasBench,一个包含12个任务的、专门用于隔离和评估短期观察歧义的机器人操作基准。

Abstract: Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines


[162] MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving cs.RO | cs.CVPDF

Rajeev Yasarla, Deepti Hegde, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi

TL;DR: 论文提出了MAPLE框架,用于端到端自动驾驶的闭环训练,通过在VLA模型的潜在空间中进行反应式多智能体推演,结合监督微调和强化学习,提升驾驶系统的鲁棒性和交互真实性。

Details

Motivation: 现有端到端运动规划模型在闭环评估中表现脆弱,且传统闭环监督方法缺乏可扩展性,无法完全建模反应式环境,因此需要一种无需外部模拟器的可扩展闭环训练框架。

Result: MAPLE在Bench2Drive基准测试中实现了最先进的驾驶性能,展示了可扩展的闭环多智能体推演能力。

Insight: 创新点包括在潜在空间进行多智能体闭环推演以模拟反应式环境,结合监督微调和强化学习进行训练,并引入多样性奖励以生成超出日志数据的规划行为,无需依赖外部模拟器,提升了可扩展性和真实性。

Abstract: Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.


[163] Learning Direct Control Policies with Flow Matching for Autonomous Driving cs.RO | cs.CVPDF

Marcello Ceresini, Federico Pirazzoli, Andrea Bertogalli, Lorenzo Cipelli, Filippo D’Addeo

TL;DR: 本文提出了一种基于流匹配的自动驾驶规划器,能够直接从鸟瞰图场景输入生成加速度和曲率控制轨迹,并通过少量ODE积分步骤实现低延迟推理,适用于实时闭环重规划。

Details

Motivation: 解决自动驾驶中传统方法对分布偏移敏感、难以泛化到未见场景的问题,旨在实现一种能够直接输出控制策略、对几何信息敏感且泛化能力强的实时规划器。

Result: 在意大利帕尔马城市街道的模拟数据上训练,并在分布内和分布外(如多车道高速公路和未见城市场景)环境中进行闭环评估,模型能够可靠泛化,保持稳定控制并成功完成与训练分布差异显著的场景。

Insight: 创新点在于结合鸟瞰图表示(提供几何中心视角,对分布偏移不敏感)和流匹配公式(学习平滑向量场,在分布偏移下性能优雅下降),实现了对未见场景的强泛化能力和实时控制。

Abstract: We present a flow-matching planner for autonomous driving that directly outputs actionable control trajectories defined by acceleration and curvature profiles. The model is conditioned on a bird’s-eye-view (BEV) raster of the surrounding scene and generates control sequences in a small number of Ordinary Differential Equations (ODE) integration steps, enabling low-latency inference suitable for real-time closed-loop re-planning. We train exclusively on urban scenarios (real urban city streets, intersections and roundabouts of the city of Parma, Italy) collected from a 2D traffic simulator with reactive agents, and evaluate in closed-loop on both in-distribution and markedly out-of-distribution environments, including multi-lane highways and unseen urban scenarios. Our results show that the model generalizes reliably to these unseen conditions, maintaining stable closed-loop control and successfully completing scenarios that differ substantially from the training distribution. We attribute this to the BEV representation, which provides a geometry-centric view of the scene that is inherently less sensitive to distributional shifts, and to the flow-matching formulation, which learns a smooth vector field that degrades gracefully under distribution shift. We provide video demonstrations of closed-loop behavior at https://marcelloceresini.github.io/DirectControlFlowMatching.


[164] CLOVER: Closed-Loop Value Estimation & Ranking for End-to-End Autonomous Driving Planning cs.RO | cs.AI | cs.CVPDF

Sining Ang, Yuguang Yang, Canyu Chen, Yan Wang

TL;DR: CLOVER是一个用于端到端自动驾驶规划的闭环价值估计与排序框架,通过生成器-评分器架构生成多样候选轨迹并基于规划指标评分进行排序,解决了模仿学习与基于规则的评估之间的不匹配问题。

Details

Motivation: 现有端到端自动驾驶规划器通常通过模仿单一记录轨迹进行训练,但使用基于规则的规划指标(如安全性、可行性、进度和舒适性)进行评估,导致训练与评估不匹配:接近记录轨迹的路径可能违反规划规则,而远离演示的替代方案可能有效且得分更高。

Result: 在NAVSIM基准测试中,CLOVER达到94.5 PDMS和90.4 EPDMS,创下新的SOTA;在更具挑战性的NavHard分割上获得48.3 EPDMS,与最强报告结果相当;在nuScenes开环评估中,实现了最低的L2误差和碰撞率。

Insight: CLOVER的创新点包括:通过评估器过滤的伪专家轨迹扩展生成器支持范围,采用保守的闭环自蒸馏方法训练评分器和生成器,并理论分析了不完美评分器在目标增强和保守更新条件下仍能可靠改进生成器的机制。

Abstract: End-to-end autonomous driving planners are commonly trained by imitating a single logged trajectory, yet evaluated by rule-based planning metrics that measure safety, feasibility, progress, and comfort. This creates a training–evaluation mismatch: trajectories close to the logged path may violate planning rules, while alternatives farther from the demonstration can remain valid and high-scoring. The mismatch is especially limiting for proposal-selection planners, whose performance depends on candidate-set coverage and scorer ranking quality. We propose CLOVER, a Closed-LOop Value Estimation and Ranking framework for end-to-end autonomous driving planning. CLOVER follows a lightweight generator–scorer formulation: a generator produces diverse candidate trajectories, and a scorer predicts planning-metric sub-scores to rank them at inference time. To expand proposal support beyond single-trajectory imitation, CLOVER constructs evaluator-filtered pseudo-expert trajectories and trains the generator with set-level coverage supervision. It then performs conservative closed-loop self-distillation: the scorer is fitted to true evaluator sub-scores on generated proposals, while the generator is refined toward teacher-selected top-$k$ and vector-Pareto targets with stability regularization. We analyze when an imperfect scorer can improve the generator, showing that scorer-mediated refinement is reliable when scorer-selected targets are enriched under the true evaluator and updates remain conservative. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, establishing a new state of the art. On the more challenging NavHard split, it obtains 48.3 EPDMS, matching the strongest reported result. On supplementary nuScenes open-loop evaluation, CLOVER achieves the lowest L2 error and collision rate among compared methods. Code data will be released at https://github.com/WilliamXuanYu/CLOVER.


cs.SD [Back]

[165] Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study cs.SD | cs.CV | cs.LGPDF

Wuao Liu, Mustafa Chasmai, Subhransu Maji, Grant Van Horn

TL;DR: 本文系统研究了在有限数据条件下,掩码自编码器(MAE)在细粒度生物声学识别任务(iNatSounds数据集)上的预训练效果,发现与大规模音频基准测试不同,在中等规模数据下,预训练数据规模比目标设计更重要,且领域特定数据的额外掩码重建预训练可能带来有限收益甚至性能下降。

Details

Motivation: 生物声学识别需要细粒度的声学理解,但iNaturalist等大型数据仓库通常只有弱标注(每段录音仅一个正类物种标签),使得监督学习面临挑战;尽管MAE在大规模音频语料库上表现出强可迁移性,但其在数据有限的生物声学场景中的有效性尚未充分探索。

Result: 在iNatSounds物种分类任务上,预训练于多样化通用音频数据的模型取得了最佳迁移性能;与大规模音频基准观察相反,领域特定数据的额外掩码重建预训练收益有限甚至可能降低性能,且数据选择性过滤在总体数据规模有限时优势可忽略。

Insight: 在中等规模细粒度生物声学场景中,预训练数据规模主导模型性能,而非复杂的预训练目标设计;这为有限监督下的模型选择提供了实用指导,并阐明了基于MAE的预训练何时有效。

Abstract: Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer strategies. Consistent with prior work, we find that models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Contrary to observations from large-scale audio benchmarks, we find that (1) additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models, and (2) selective data filtering offers a negligible advantage when the overall data scale is limited. Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design. These findings further clarify when MAE-based pretraining is effective and provide practical guidance for model selection under limited supervision.