Table of Contents
- cs.CL [Total: 45]
- cs.CV [Total: 227]
- cs.MM [Total: 1]
- cs.HC [Total: 1]
- cs.RO [Total: 10]
- eess.AS [Total: 1]
- eess.IV [Total: 1]
- cs.LG [Total: 16]
- cs.IR [Total: 2]
- cs.AI [Total: 9]
- cs.SD [Total: 3]
- cs.CR [Total: 1]
cs.CL [Back]
[1] Slang Context-based Inference Enhancement via Greedy Search-Guided Chain-of-Thought Prompting cs.CLPDF
Jinghan Cao, Qingyang Ren, Xiangyun Chen, Xinjin Li, Haoxiang Gao
TL;DR: 本文针对大语言模型在俚语理解任务上的挑战,提出了一种基于贪婪搜索引导的思维链提示框架,旨在提升小语言模型在缺乏领域训练数据时的俚语解释准确性。
Details
Motivation: 俚语理解因依赖于语境、文化和语言框架而成为大语言模型的一项挑战性下游任务,在缺乏领域特定训练数据时,仅凭词汇信息难以准确解释俚语含义。
Result: 实验表明,模型规模和温度设置对推理准确性影响有限,更大参数模型未必比小模型更准确;所提出的框架在俚语含义解释任务上表现出更高的准确性。
Insight: 创新点在于将贪婪搜索算法与思维链提示结合,为小语言模型构建结构化推理框架,以增强语境依赖理解,这为提升语言模型在特定领域(如俚语)的推理能力提供了实用解决方案。
Abstract: Slang interpretation has been a challenging downstream task for Large Language Models (LLMs) as the expressions are inherently embedded in contextual, cultural, and linguistic frameworks. In the absence of domain-specific training data, it is difficult for LLMs to accurately interpret slang meaning based on lexical information. This paper attempts to investigate the challenges of slang inference using large LLMs and presents a greedy search-guided chain-of-thought framework for slang interpretation. Through our experiments, we conclude that the model size and temperature settings have limited impact on inference accuracy. Transformer-based models with larger active parameters do not generate higher accuracy than smaller models. Based on the results of the above empirical study, we integrate greedy search algorithms with chain-of-thought prompting for small language models to build a framework that improves the accuracy of slang interpretation. The experimental results indicate that our proposed framework demonstrates improved accuracy in slang meaning interpretation. These findings contribute to the understanding of context dependency in language models and provide a practical solution for enhancing slang comprehension through a structured reasoning prompting framework.
[2] Training-Free Agentic AI: Probabilistic Control and Coordination in Multi-Agent LLM Systems cs.CL | cs.AI | cs.ET | cs.MAPDF
Mohammad Parsa Hosseini, Ankit Shah, Saiyra Qureshi, Alex Huang, Connie Miao
TL;DR: 本文提出了REDEREF,一种轻量级且无需训练的多智能体大语言模型(LLM)协作控制器,旨在通过概率控制提升递归委托中的路由效率。它整合了基于信念的委托、反思驱动的重路由、基于证据的选择和记忆感知先验,以减少交互成本并提升系统鲁棒性。
Details
Motivation: 解决多智能体LLM系统在实际部署中面临的低效路由、噪声反馈和高交互成本问题,以实现更高效、鲁棒的复杂长程推理。
Result: 在多智能体分知识任务上的实验表明,与随机递归委托相比,所提方法在任务成功率饱和的情况下,将token使用量减少了28%,智能体调用次数减少了17%,成功时间减少了19%,并能优雅地适应智能体或评判器性能下降的情况。
Insight: 创新点在于提出了一种无需训练的概率控制框架,通过Thompson采样进行信念引导的委托,并结合反思重路由和证据选择,以简单、可解释的方式显著提升了多智能体系统的效率和适应性。
Abstract: Multi-agent large language model (LLM) systems enable complex, long-horizon reasoning by composing specialized agents, but practical deployment remains hindered by inefficient routing, noisy feedback, and high interaction cost. We introduce REDEREF, a lightweight and training-free controller for multi-agent LLM collaboration that improves routing efficiency during recursive delegation. REDEREF integrates (i) belief-guided delegation via Thompson sampling to prioritize agents with historically positive marginal contributions, (ii) reflection-driven re-routing using a calibrated LLM or programmatic judge, (iii) evidence-based selection rather than output averaging, and (iv) memory-aware priors to reduce cold-start inefficiency. Across multi-agent split-knowledge tasks, we show that while recursive retry alone saturates task success, belief-guided routing reduces token usage by 28%, agent calls by 17%, and time-to-success by 19% compared to random recursive delegation, and adapts gracefully under agent or judge degradation. These results demonstrate that simple, interpretable probabilistic control can meaningfully improve the efficiency and robustness of multi-agent LLM systems without training or fine-tuning.
[3] Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation cs.CL | cs.AI | cs.LGPDF
Minsang Kim, Seung Jun Baek
TL;DR: 本文提出了一种名为Token-Selective Dual Knowledge Distillation (TSD-KD)的新型知识蒸馏框架,旨在将大型模型在复杂推理任务上的能力更有效地迁移到能力有限的学生模型中。该方法通过结合间接蒸馏(基于偏好排序的反馈)和直接蒸馏(选择性分布匹配),并引入熵正则化,鼓励学生模型用自己的语言进行推理解释,从而缓解传统蒸馏方法中因能力不匹配导致的分布失配问题。
Details
Motivation: 传统知识蒸馏方法要求学生模型完全模仿教师模型的整个输出分布,但对于能力有限的学生模型,在复杂推理任务中这种全面的监督可能导致分布失配,使其不堪重负。本文旨在解决这一问题,提出一种以学生为中心的蒸馏方法,专注于蒸馏对推理重要的部分,并支持学生发展自己的推理过程。
Result: 在10个具有挑战性的推理基准测试中,TSD-KD取得了最先进的性能,在准确率上分别比基线方法和第二名高出最高达54.4%和40.3%。值得注意的是,通过TSD-KD训练的学生模型在四种情况下甚至超越了其教师模型,最高超出20.3%。
Insight: 论文的核心创新在于提出了一个双管齐下、以学生为中心的蒸馏框架:1) 间接蒸馏通过教师对学生自生成的候选答案进行偏好排序来提供弱反馈,而非强制匹配整个分布;2) 直接蒸馏根据师生置信度差异选择性地蒸馏关键令牌。此外,引入熵正则化以维持学生模型的置信度。这种方法将重点从全面模仿转向支持学生自身的推理过程,为在能力不匹配场景下提升小模型的推理能力提供了新思路。
Abstract: Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain-of-Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher’s distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token-Selective Dual Knowledge Distillation (TSD-KD), a framework for student-centric distillation. TSD-KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD-KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re-ranks those candidates as indirect feedback without enforcing its entire distribution. Direct distillation uses distribution matching; however, it selectively distills tokens based on the relative confidence between teacher and student. Finally, we add entropy regularization to maintain the student’s confidence during distillation. Overall, our method provides the student with targeted and indirect feedback to support its own reasoning process and to facilitate self-improvement. The experiments show the state-of-the-art performance of TSD-KD on 10 challenging reasoning benchmarks, outperforming the baseline and runner-up in accuracy by up to 54.4% and 40.3%, respectively. Notably, a student trained by TSD-KD even outperformed its own teacher model in four cases by up to 20.3%. The source code is available at https://github.com/kmswin1/TSD-KD.
[4] Can We Trust LLMs on Memristors? Diving into Reasoning Ability under Non-Ideality cs.CLPDF
Taiqiang Wu, Yuxin Cheng, Chenchen Ding, Runming Yang, Xincheng Feng
TL;DR: 本文研究了基于忆阻器的模拟存内计算架构在部署大语言模型时,由于忆阻器固有非理想特性导致的精度问题对LLM推理能力的影响。论文首先全面评估了典型非理想特性对LLM推理的影响,发现推理能力显著下降且在不同基准测试中表现各异;随后系统评估了三种无需训练的策略(思维模式、上下文学习和模块冗余),并总结了有价值的指导原则。
Details
Motivation: 忆阻器模拟存内计算架构虽能高效部署LLM,但其固有非理想特性会引发精度问题,本文旨在探究这些非理想性对LLM推理能力的具体影响,并评估无需训练的缓解策略。
Result: 实证结果表明,非理想性导致LLM推理能力显著下降,且下降程度因基准测试而异;评估了三种策略后总结出指导原则:浅层冗余对提升鲁棒性特别有效,思维模式在低噪声下表现更好但在高噪声下退化,上下文学习会缩短输出长度但带来轻微性能折衷。
Insight: 创新点在于首次全面探究忆阻器非理想性对LLM推理的影响,并系统评估了无需训练的缓解策略;客观来看,其总结的针对不同噪声水平和架构模块的鲁棒性提升指南,对实际硬件部署具有实用参考价值。
Abstract: Memristor-based analog compute-in-memory (CIM) architectures provide a promising substrate for the efficient deployment of Large Language Models (LLMs), owing to superior energy efficiency and computational density. However, these architectures suffer from precision issues caused by intrinsic non-idealities of memristors. In this paper, we first conduct a comprehensive investigation into the impact of such typical non-idealities on LLM reasoning. Empirical results indicate that reasoning capability decreases significantly but varies for distinct benchmarks. Subsequently, we systematically appraise three training-free strategies, including thinking mode, in-context learning, and module redundancy. We thus summarize valuable guidelines, i.e., shallow layer redundancy is particularly effective for improving robustness, thinking mode performs better under low noise levels but degrades at higher noise, and in-context learning reduces output length with a slight performance trade-off. Our findings offer new insights into LLM reasoning under non-ideality and practical strategies to improve robustness.
[5] Knowledge Distillation for Large Language Models cs.CL | cs.AIPDF
Alejandro Paredes La Torre, Barbara Flores, Diego Rodriguez
TL;DR: 本文提出了一种结合知识蒸馏与思维链强化学习的资源高效框架,用于压缩大语言模型。以Qwen 3B为教师模型、Qwen 0.5B为学生模型,在英文Dolly-15k、西班牙语Dolly-15k以及代码数据集BugNet和PyTorrent上进行蒸馏,并通过超参数调优优化学生模型性能。结果表明,蒸馏后的学生模型在保持模型规模显著减小的同时,保留了教师模型能力的相当大部分。此外,在代码任务中,引入思维链提示与基于CoT标注的Codeforces数据的Group Relative Policy Optimization,相比单纯知识蒸馏,进一步提升了推理连贯性与解答正确性。后训练的4位权重量化进一步降低了内存占用和推理延迟。
Details
Motivation: 解决在资源受限环境下部署大语言模型时面临的计算和内存开销问题,旨在通过模型压缩技术获得紧凑高效的模型。
Result: 在英文任务上,学生模型保留了教师模型能力的70%至91%;在西班牙语任务上,保留能力高达95%;在代码任务上,Rouge-L分数达到93.5%。结合思维链强化学习后,在代码任务上的推理连贯性和解答正确性相比单纯知识蒸馏有所提升。
Insight: 创新点在于将知识蒸馏与思维链引导的强化学习(特别是Group Relative Policy Optimization)相结合,以提升小模型在复杂任务(如代码生成)上的推理能力。同时,采用后训练量化进一步优化部署效率,为资源受限场景下的高效模型部署提供了可借鉴的框架。
Abstract: We propose a resource-efficient framework for compressing large language models through knowledge distillation, combined with guided chain-of-thought reinforcement learning. Using Qwen 3B as the teacher and Qwen 0.5B as the student, we apply knowledge distillation across English Dolly-15k, Spanish Dolly-15k, and code BugNet and PyTorrent datasets, with hyperparameters tuned in the English setting to optimize student performance. Across tasks, the distilled student retains a substantial portion of the teacher’s capability while remaining significantly smaller: 70% to 91% in English, up to 95% in Spanish, and up to 93.5% Rouge-L in code. For coding tasks, integrating chain-of-thought prompting with Group Relative Policy Optimization using CoT-annotated Codeforces data improves reasoning coherence and solution correctness compared to knowledge distillation alone. Post-training 4-bit weight quantization further reduces memory footprint and inference latency. These results show that knowledge distillation combined with chain-of-thought guided reinforcement learning can produce compact, efficient models suitable for deployment in resource-constrained settings.
[6] DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents cs.CLPDF
Snehasis Mukhopadhyay
TL;DR: 本文提出了DeceptGuard框架,用于检测LLM代理的欺骗行为。该框架系统比较了三种监控机制:黑盒监控(仅观察外部工具调用和输出)、思维链感知监控(额外观察代理的思维链推理轨迹)和激活探针监控(额外读取冻结开源编码器的隐藏状态表示)。研究通过DeceptSynth合成管道生成了大规模欺骗行为轨迹数据集,并在DeceptArena基准测试中验证了监控方法的有效性。
Details
Motivation: 现有欺骗检测研究主要关注黑盒监控,忽略了代理内部推理信号,而可靠检测LLM代理的欺骗行为对于高风险场景的安全部署至关重要。
Result: 在DeceptArena基准测试的9200个保留样本上,思维链感知和激活探针监控显著优于黑盒监控(平均pAUROC提升0.097),尤其在行为痕迹微妙的长期欺骗任务中表现突出。提出的混合宪法集成方法在测试集上达到0.934的pAUROC,显著超越了现有技术水平。
Insight: 创新点包括:1)首次系统比较三种监控机制;2)提出可扩展的DeceptSynth合成管道和12类欺骗分类法;3)揭示了透明度与可检测性的权衡关系;4)提出混合宪法集成作为深度防御方案。从客观角度看,该研究为LLM代理安全监控提供了多维度评估框架和新的基准数据集。
Abstract: Reliable detection of deceptive behavior in Large Language Model (LLM) agents is an essential prerequisite for safe deployment in high-stakes agentic contexts. Prior work on scheming detection has focused exclusively on black-box monitors that observe only externally visible tool calls and outputs, discarding potentially rich internal reasoning signals. We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent’s chain-of-thought reasoning trace), and activation-probe monitors (additionally reading hidden-state representations from a frozen open-weights encoder). We introduce DECEPTSYNTH, a scalable synthetic pipeline for generating deception-positive and deception-negative agent trajectories across a novel 12-category taxonomy spanning verbal, behavioral, and structural deception. Our monitors are optimized on 4,800 synthetic trajectories and evaluated on 9,200 held-out samples from DeceptArena, a benchmark of realistic sandboxed agent environments with execution-verified labels. Across all evaluation settings, CoT-aware and activation-probe monitors substantially outperform their black-box counterparts (mean pAUROC improvement of +0.097), with the largest gains on subtle, long-horizon deception that leaves minimal behavioral footprints. We empirically characterize a transparency-detectability trade-off: as agents learn to suppress overt behavioral signals, chain-of-thought becomes the primary detection surface but is itself increasingly unreliable due to post-training faithfulness degradation. We propose HYBRID-CONSTITUTIONAL ensembles as a robust defense-in-depth approach, achieving a pAUROC of 0.934 on the held-out test set, representing a substantial advance over the prior state of the art.
[7] APEX-Searcher: Augmenting LLMs’ Search Capabilities through Agentic Planning and Execution cs.CL | cs.AIPDF
Kun Chen, Qingchao Kong, Zhao Feifei, Wenji Mao
TL;DR: 本文提出APEX-Searcher,一种新型的智能体规划与执行框架,旨在增强大语言模型(LLM)在复杂多跳问题上的搜索能力。该框架将检索过程解耦为规划和执行两阶段:首先通过带分解特定奖励的强化学习优化策略规划,然后基于高质量多跳轨迹进行监督微调以提升子任务执行能力。
Details
Motivation: 现有基于LLM的检索增强生成(RAG)方法在处理复杂多跳问题时,单轮检索不足,而多轮迭代检索与推理的端到端训练方法面临检索执行路径模糊和强化学习稀疏奖励的挑战,导致检索结果不准确和性能下降。
Result: 大量实验表明,该框架在多个基准测试的多跳RAG和任务规划性能上均取得显著提升。
Insight: 创新点在于将检索过程明确解耦为规划和执行两阶段,并分别采用强化学习(带分解奖励)和监督微调进行优化,这有助于解决端到端训练中的路径模糊和奖励稀疏问题,为增强LLM的复杂任务搜索能力提供了可借鉴的模块化优化思路。
Abstract: Retrieval-augmented generation (RAG), based on large language models (LLMs), serves as a vital approach to retrieving and leveraging external knowledge in various domain applications. When confronted with complex multi-hop questions, single-round retrieval is often insufficient for accurate reasoning and problem solving. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches significantly improve problem-solving performance, they are still faced with challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL) process, leading to inaccurate retrieval results and performance degradation. To address these issues, in this paper, we proposes APEX-Searcher, a novel Agentic Planning and Execution framework to augment LLM search capabilities. Specifically, we introduce a two-stage agentic framework that decouples the retrieval process into planning and execution: It first employs RL with decomposition-specific rewards to optimize strategic planning; Built on the sub-task decomposition, it then applies supervised fine-tuning on high-quality multi-hop trajectories to equip the model with robust iterative sub-task execution capabilities. Extensive experiments demonstrate that our proposed framework achieves significant improvements in both multi-hop RAG and task planning performances across multiple benchmarks.
[8] ToolFlood: Beyond Selection – Hiding Valid Tools from LLM Agents via Semantic Covering cs.CLPDF
Hussein Jawad, Nicolas J-B Brunel
TL;DR: 本文提出了ToolFlood,一种针对工具增强型LLM代理的检索层攻击方法。该方法通过在嵌入空间中精心放置少量攻击者控制的工具元数据,使其语义覆盖大量用户查询,从而淹没检索结果,将良性工具挤出代理的上下文,而非直接改变检索后的工具选择。
Details
Motivation: 随着LLM代理越来越多地使用外部工具并依赖基于嵌入的检索来选择工具子集,该检索阶段的鲁棒性尚未得到充分探索。本文旨在研究并攻击这一检索层,而非之前工作中探讨的工具选择攻击。
Result: 在标准基准测试(如ToolBench)上,ToolFlood在低注入率(1%)下实现了高达95%的攻击成功率。
Insight: 论文的创新点在于首次针对LLM代理的检索层(而非选择层)进行攻击,并提出了一种利用嵌入空间几何特性的两阶段对抗性工具生成策略(包括LLM生成和贪婪选择),以实现对查询的语义覆盖和检索饱和。这揭示了基于嵌入检索的脆弱性,并为评估和增强此类系统的鲁棒性提供了新视角。
Abstract: Large Language Model (LLM) agents increasingly use external tools for complex tasks and rely on embedding-based retrieval to select a small top-k subset for reasoning. As these systems scale, the robustness of this retrieval stage is underexplored, even though prior work has examined attacks on tool selection. This paper introduces ToolFlood, a retrieval-layer attack on tool-augmented LLM agents. Rather than altering which tool is chosen after retrieval, ToolFlood overwhelms retrieval itself by injecting a few attacker-controlled tools whose metadata is carefully placed by exploiting the geometry of embedding space. These tools semantically span many user queries, dominate the top-k results, and push all benign tools out of the agent’s context. ToolFlood uses a two-phase adversarial tool generation strategy. It first samples subsets of target queries and uses an LLM to iteratively generate diverse tool names and descriptions. It then runs an iterative greedy selection that chooses tools maximizing coverage of remaining queries in embedding space under a cosine-distance threshold, stopping when all queries are covered or a budget is reached. We provide theoretical analysis of retrieval saturation and show on standard benchmarks that ToolFlood achieves up to a 95% attack success rate with a low injection rate (1% in ToolBench). The code will be made publicly available at the following link: https://github.com/as1-prog/ToolFlood
[9] Beyond Explicit Edges: Robust Reasoning over Noisy and Sparse Knowledge Graphs cs.CLPDF
Hang Gao, Dimitris N. Metaxas
TL;DR: 本文提出了INSES框架,旨在解决传统GraphRAG方法在现实世界噪声、稀疏或不完整知识图谱上进行多跳推理时的局限性。INSES结合了LLM引导的导航来修剪噪声和指导探索,以及基于嵌入的相似性扩展来恢复隐藏链接和弥合语义鸿沟。此外,还引入了一个轻量级路由器,根据查询复杂度在简单RAG和INSES之间进行路由,以平衡效率与推理深度。
Details
Motivation: 标准图算法严重依赖静态连通性和显式边,在现实世界中噪声、稀疏或不完整的知识图谱场景下常常失效,因此需要一种能够超越显式边进行鲁棒推理的方法。
Result: INSES在多个基准测试中持续优于SOTA的RAG和GraphRAG基线。特别是在MINE基准测试中,对于不同方法(KGGEN, GraphRAG, OpenIE)构建的知识图谱,INSES均表现出卓越的鲁棒性,准确率分别提升了5%、10%和27%。
Insight: 核心创新在于将LLM引导的导航(用于噪声修剪和探索引导)与基于嵌入的相似性扩展(用于恢复隐藏链接)动态耦合,从而超越对显式边的依赖。同时,通过一个轻量级路由器实现效率与复杂推理的平衡,这是一个实用的系统设计思路。
Abstract: GraphRAG is increasingly adopted for converting unstructured corpora into graph structures to enable multi-hop reasoning. However, standard graph algorithms rely heavily on static connectivity and explicit edges, often failing in real-world scenarios where knowledge graphs (KGs) are noisy, sparse, or incomplete. To address this limitation, we introduce INSES (Intelligent Navigation and Similarity Enhanced Search), a dynamic framework designed to reason beyond explicit edges. INSES couples LLM-guided navigation, which prunes noise and steers exploration, with embedding-based similarity expansion to recover hidden links and bridge semantic gaps. Recognizing the computational cost of graph reasoning, we complement INSES with a lightweight router that delegates simple queries to Naïve RAG and escalates complex cases to INSES, balancing efficiency with reasoning depth. INSES consistently outperforms SOTA RAG and GraphRAG baselines across multiple benchmarks. Notably, on the MINE benchmark, it demonstrates superior robustness across KGs constructed by varying methods (KGGEN, GraphRAG, OpenIE), improving accuracy by 5%, 10%, and 27%, respectively.
[10] CMHL: Contrastive Multi-Head Learning for Emotionally Consistent Text Classification cs.CL | cs.LGPDF
Menna Elgabry, Ali Hamdi, Khaled Shaban
TL;DR: 本文提出CMHL模型,一种用于文本情感分类的新型单模型架构,通过多任务学习、基于心理学的辅助监督和对比矛盾损失来增强情感逻辑一致性,在参数规模远小于现有方法的情况下,在多个数据集上实现了新的SOTA性能。
Details
Motivation: 挑战了情感分类任务中依赖大规模语言模型或复杂集成模型才能提升性能的假设,旨在通过更智能的架构设计来提升模型的情感逻辑一致性。
Result: 在dair-ai Emotion数据集上,以1.25亿参数取得了93.75%的F1分数,超越了参数量大56倍的大语言模型和集成模型(86.13%-93.2%)。在SWMH数据集上,F1分数达72.50%,召回率达73.30%,超越了MentalBERT等特定领域模型。
Insight: 创新点在于将心理学先验(Russell的环状模型)和明确的逻辑一致性约束(对比矛盾损失)嵌入到多任务学习框架中,证明了架构智能而非参数量是推动任务进步的关键,为情感计算提供了高效、可解释且具有临床相关性的新范式。
Abstract: Textual Emotion Classification (TEC) is one of the most difficult NLP tasks. State of the art approaches rely on Large language models (LLMs) and multi-model ensembles. In this study, we challenge the assumption that larger scale or more complex models are necessary for improved performance. In order to improve logical consistency, We introduce CMHL, a novel single-model architecture that explicitly models the logical structure of emotions through three key innovations: (1) multi-task learning that jointly predicts primary emotions, valence, and intensity, (2) psychologically-grounded auxiliary supervision derived from Russell’s circumplex model, and (3) a novel contrastive contradiction loss that enforces emotional consistency by penalizing mutually incompatible predictions (e.g., simultaneous high confidence in joy and anger). With just 125M parameters, our model outperforms 56x larger LLMs and sLM ensembles with a new state-of-the-art F1 score of 93.75% compared to (86.13%-93.2%) on the dair-ai Emotion dataset. We further show cross domain generalization on the Reddit Suicide Watch and Mental Health Collection dataset (SWMH), outperforming domain-specific models like MentalBERT and MentalRoBERTa with an F1 score of 72.50% compared to (68.16%-72.16%) + a 73.30% recall compared to (67.05%-70.89%) that translates to enhanced sensitivity for detecting mental health distress. Our work establishes that architectural intelligence (not parameter count) drives progress in TEC. By embedding psychological priors and explicit consistency constraints, a well-designed single model can outperform both massive LLMs and complex ensembles, offering a efficient, interpretable, and clinically-relevant paradigm for affective computing.
[11] MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos cs.CL | cs.CVPDF
Arushi Goel, Sreyan Ghosh, Vatsal Agarwal, Nishit Anand, Kaousheik Jayakumar
TL;DR: 本文介绍了MMOU基准测试,这是一个用于评估多模态大语言模型在长且复杂的真实世界视频中进行全模态(视觉、音频、文本)理解和推理能力的大规模多任务基准。它包含15,000个问题与9,038个网络收集的视频,涵盖13个基本技能类别。对20多个最先进的开源和闭源模型进行评估,结果显示性能存在显著差距,突显了当前模型在长视频全模态理解上的挑战。
Details
Motivation: 多模态大语言模型在孤立评估的视觉和音频理解任务上表现出色,但其在长且复杂的真实世界视频中联合推理全模态信号的能力尚未得到充分探索。本文旨在通过引入MMOU基准来系统性地评估模型在这些挑战性条件下的理解和推理能力。
Result: 在MMOU基准上评估了20多个最先进的开源和专有模型。最佳闭源模型准确率仅为64.2%,最强开源模型准确率仅为46.8%,揭示了显著的性能差距。结果表明,当前模型在长视频中应用基本技能时经常失败。
Insight: 论文的创新点在于构建了一个大规模、高质量、涵盖多样领域和紧密耦合视听内容的长视频全模态理解与推理基准。从客观角度看,该基准通过系统性的失败模式分析,为理解当前模型在长视频多模态推理中的局限性提供了具体见解,有助于指导未来模型的发展方向。
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in visual and audio understanding when evaluated in isolation. However, their ability to jointly reason over omni-modal (visual, audio, and textual) signals in long and complex videos remains largely unexplored. We introduce MMOU, a new benchmark designed to systematically evaluate multimodal understanding and reasoning under these challenging, real-world conditions. MMOU consists of 15,000 carefully curated questions paired with 9038 web-collected videos of varying length, spanning diverse domains and exhibiting rich, tightly coupled audio-visual content. The benchmark covers 13 fundamental skill categories, all of which require integrating evidence across modalities and time. All questions are manually annotated across multiple turns by professional annotators, ensuring high quality and reasoning fidelity. We evaluate 20+ state-of-the-art open-source and proprietary multimodal models on MMOU. The results expose substantial performance gaps: the best closed-source model achieves only 64.2% accuracy, while the strongest open-source model reaches just 46.8%. Our results highlight the challenges of long-form omni-modal understanding, revealing that current models frequently fail to apply even fundamental skills in long videos. Through detailed analysis, we further identify systematic failure modes and provide insights into where and why current models break.
[12] Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring cs.CL | cs.AIPDF
Weixin Guan, Liang Li, Jiapeng Liu, Bing Li, Peng Fu
TL;DR: 本文提出了一种针对大型推理语言模型(LRLMs)中‘过度思考’问题的早期退出方法。该方法通过监测推理路径中的高熵转移令牌来动态检测并终止冗余的推理轨迹,从而提升模型性能和效率。
Details
Motivation: 大型推理语言模型在使用长链思维推理时容易产生‘过度思考’,即生成冗余的推理步骤,这会降低性能和效率。现有的早期退出方法要么引入额外训练开销,要么因频繁切换推理和探测答案而限制推理吞吐量,且容易因过度截断损害模型性能。
Result: 在多个基准测试上使用不同类型和规模的LRLMs进行实验,结果表明,与现有的早期退出方法相比,该方法相比原始链式思维(CoT)带来了最大的性能提升。
Insight: 创新点在于将早期退出机制与原生推理过程深度耦合,利用推理路径偏差指数(通过高熵转移令牌的频繁出现来监测)作为专用监控指标,动态检测和终止过度思考轨迹,无需额外训练且对推理流程干扰小。
Abstract: Large Reasoning Language Models (LRLMs) demonstrate impressive capabilities on complex tasks by utilizing long Chain-of-Thought reasoning. However, they are prone to overthinking, which generates redundant reasoning steps that degrade both performance and efficiency. Recently, early-exit strategies are proposed to mitigate overthinking by dynamically and adaptively terminating redundant reasoning. However, current early-exit methods either introduce extra training overhead by relying on proxy models or limit inference throughput due to the frequent content switching between reasoning and generating probing answers. Moreover, most early-exit methods harm LRLMs performance due to over-truncation. Our insight stems from an observation: overthinking often causes LRLMs to deviate from the correct reasoning path, which is frequently accompanied by high-entropy transition tokens. Given this, we propose an early-exit method deeply coupled with the native reasoning process, which leverages the path deviation index as a dedicated monitoring metric for the frequent occurrence of high-entropy transition tokens to dynamically detect and terminate overthinking trajectories. We conduct experiments across multiple benchmarks using LRLMs of different types and scales, and the results indicate that our method delivers the largest performance improvement over vanilla CoT compared to existing early-exit methods.
[13] Automatic Inter-document Multi-hop Scientific QA Generation cs.CLPDF
Seungmin Lee, Dongha Kim, Yuni Jeon, Junyoung Koh, Min Song
TL;DR: 本文提出了AIM-SciQA框架,用于自动生成跨文档、多跳的科学问答数据集,以弥补现有研究主要关注单文档事实性问答的不足。该框架利用大语言模型提取单跳问答,并通过基于嵌入的语义对齐(选择性结合引文信息)构建跨文档关系,从而生成多跳问答。基于8,211篇PubMed Central论文生成了包含41万多条单跳和1.3万多条多跳问答的IM-SciQA数据集。验证表明数据集具有高事实一致性,并能有效评估检索和问答阶段的推理能力。
Details
Motivation: 现有自动科学问题生成研究主要集中于单文档事实性问答,忽视了跨文档推理对于科学理解的重要性,因此需要构建能够评估多跳科学推理能力的数据集。
Result: 在8,211篇PubMed Central论文上生成了IM-SciQA数据集(包含411,409个单跳和13,672个多跳问答)。人工和自动验证证实了高事实一致性。实验结果表明,IM-SciQA能有效区分检索和问答阶段的推理能力,为检索增强的科学推理提供了一个真实且可解释的基准。进一步构建的引文指导变体CIM-SciQA达到了与Oracle设置相当的性能。
Insight: 创新点在于提出了一个自动化生成跨文档多跳科学问答数据集的框架,结合了LLM的机器阅读理解能力和基于嵌入的语义对齐(可选引文信息)来构建文档间关系。这为评估复杂的科学推理能力提供了高质量、可扩展的数据集构建方法,其引文引导的变体也展示了利用领域特定结构(如引文网络)提升数据质量的潜力。
Abstract: Existing automatic scientific question generation studies mainly focus on single-document factoid QA, overlooking the inter-document reasoning crucial for scientific understanding. We present AIM-SciQA, an automated framework for generating multi-document, multi-hop scientific QA datasets. AIM-SciQA extracts single-hop QAs using large language models (LLMs) with machine reading comprehension and constructs cross-document relations based on embedding-based semantic alignment while selectively leveraging citation information. Applied to 8,211 PubMed Central papers, it produced 411,409 single-hop and 13,672 multi-hop QAs, forming the IM-SciQA dataset. Human and automatic validation confirmed high factual consistency, and experimental results demonstrate that IM-SciQA effectively differentiates reasoning capabilities across retrieval and QA stages, providing a realistic and interpretable benchmark for retrieval-augmented scientific reasoning. We further extend this framework to construct CIM-SciQA, a citation-guided variant achieving comparable performance to the Oracle setting, reinforcing the dataset’s validity and generality.
[14] Mind the Shift: Decoding Monetary Policy Stance from FOMC Statements with Large Language Models cs.CLPDF
Yixuan Tang, Yi Yang
TL;DR: 本文提出了一种名为Delta-Consistent Scoring (DCS)的无标注框架,利用冻结的大型语言模型(LLM)表征,通过联合建模绝对立场和相对会议间变化,从FOMC声明中解码货币政策立场。该方法无需人工标注,利用连续会议作为自监督信号,学习每个声明的绝对鹰派-鸽派立场分数以及连续声明间的相对变化分数。
Details
Motivation: 现有方法将FOMC声明的立场检测视为标准分类问题,孤立地标注每个声明。然而,货币政策沟通的解释本质上是相对的:市场反应不仅取决于声明的基调,还取决于会议间基调的变化。因此,需要一种能够捕捉这种相对时间结构的方法。
Result: 在四个LLM骨干网络上,DCS始终优于有监督探针和LLM-as-judge基线方法,在句子级别的鹰派-鸽派分类上达到最高71.1%的准确率。生成的会议级别分数具有经济意义:与通胀指标强相关,并与国债收益率变动显著关联。
Insight: 核心创新点在于提出了一个无监督的、基于相对时间一致性的框架(DCS),通过联合优化绝对立场分数和相对变化分数,从LLM表征中恢复出时间上连贯的货币政策立场轨迹。这表明LLM表征编码了可通过相对时间结构恢复的货币政策信号。
Abstract: Federal Open Market Committee (FOMC) statements are a major source of monetary-policy information, and even subtle changes in their wording can move global financial markets. A central task is therefore to measure the hawkish–dovish stance conveyed in these texts. Existing approaches typically treat stance detection as a standard classification problem, labeling each statement in isolation. However, the interpretation of monetary-policy communication is inherently relative: market reactions depend not only on the tone of a statement, but also on how that tone shifts across meetings. We introduce Delta-Consistent Scoring (DCS), an annotation-free framework that maps frozen large language model (LLM) representations to continuous stance scores by jointly modeling absolute stance and relative inter-meeting shifts. Rather than relying on manual hawkish–dovish labels, DCS uses consecutive meetings as a source of self-supervision. It learns an absolute stance score for each statement and a relative shift score between consecutive statements. A delta-consistency objective encourages changes in absolute scores to align with the relative shifts. This allows DCS to recover a temporally coherent stance trajectory without manual labels. Across four LLM backbones, DCS consistently outperforms supervised probes and LLM-as-judge baselines, achieving up to 71.1% accuracy on sentence-level hawkish–dovish classification. The resulting meeting-level scores are also economically meaningful: they correlate strongly with inflation indicators and are significantly associated with Treasury yield movements. Overall, the results suggest that LLM representations encode monetary-policy signals that can be recovered through relative temporal structure.
[15] Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling cs.CLPDF
Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty
TL;DR: 本文提出了一种名为渐进式多样化种群采样(PDPS)的方法,通过高效的多样化响应采样来暴露大型语言模型(LLM)中长尾的安全失败。该方法在固定安全关键提示下,结合随机令牌级采样和多样性感知选择,从大量候选响应中探索并保留紧凑且语义多样化的子集,从而以较低计算成本系统地发现模型隐藏的不安全行为。
Details
Motivation: 尽管通过监督微调和人类反馈强化学习进行的安全调优提升了LLM的鲁棒性,但它往往只是抑制而非消除不安全行为,导致罕见但关键的安全失败隐藏在输出分布的长尾中。现有红队工作多侧重于对抗性提示搜索(输入空间优化),而本文旨在证明通过多样化响应生成(输出空间探索)也能系统性地暴露这些失败。
Result: 在多个越狱基准测试和开源LLM上,PDPS实现了与大规模独立同分布(IID)采样相当的攻击成功率,同时仅需8%至29%的计算成本。在有限响应设置下,其成功率比IID采样和多样化波束搜索提高了26%至40%,并且生成的不安全输出在数量和多样性上都更高。
Insight: 论文的创新点在于将安全评估重点从输入空间优化转向输出空间探索,通过PDPS高效采样多样化响应来暴露长尾安全失败。客观来看,该方法提供了一种计算高效的模型脆弱性探测框架,可广泛应用于LLM安全测试,其多样性驱动采样策略对提升红队测试覆盖率具有借鉴意义。
Abstract: Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it often suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution. While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed safety-critical prompt, where increasing the number and diversity of sampled responses can drive jailbreak success rates close to unity. To efficiently uncover such failures, we propose Progressive Diverse Population Sampling (PDPS), which combines stochastic token-level sampling with diversity-aware selection to explore a large candidate pool of responses and retain a compact, semantically diverse subset. Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost. Under limited-response settings, it improves success rates by 26% to 40% over IID sampling and Diverse Beam Search. Furthermore, responses generated by PDPS exhibit both a higher number and greater diversity of unsafe outputs, demonstrating its effectiveness in uncovering a broader range of failures.
[16] PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark cs.CL | cs.SDPDF
Mohammad Javad Ranjbar Kalahroodi, Mohammad Amini, Parmis Bathayan, Heshaam Faili, Azadeh Shakery
TL;DR: PARSA-Bench是首个针对波斯语言和文化的大型音频-语言模型评估基准,包含16个任务、超过8000个样本,涵盖语音理解、副语言分析和文化音频理解。其中10个是新引入的任务,如诗歌韵律检测和波斯传统音乐理解。基准测试发现纯文本基线模型普遍优于音频模型,且所有模型在诗歌韵律检测任务上表现接近随机猜测。
Details
Motivation: 现有基准未能涵盖波斯语在古典诗歌、传统音乐和普遍存在的语码转换方面的独特音频理解挑战,因此需要创建一个专门的基准来评估模型在这些方面的能力。
Result: 在PARSA-Bench上,纯文本基线模型在多个任务上持续优于音频模型;所有模型在诗歌韵律检测任务上表现均接近随机水平,无论模型规模大小。
Insight: 创新点在于首次构建了针对波斯语言文化的综合性音频-语言基准,并揭示了当前音频模型可能未充分利用音频特有信息,以及在韵律感知等文化相关任务上存在根本性局限。
Abstract: Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench
[17] Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs cs.CL | cs.AI | cs.IRPDF
Auksarapak Kietkajornrit, Jad Tarifi, Nima Asgharbeygi
TL;DR: 本文提出了一种模块化框架,通过显式分离规划、事实检索和答案合成,旨在提高大型语言模型在依赖最新或冲突信息的事实性问答任务中的可靠性。该框架采用师生蒸馏方法训练一个轻量级学生规划器,使其生成包含抽象推理步骤和可搜索事实请求的结构化分解,而无需提供事实答案或检索证据。推理时,规划器生成计划,而提示工程模块执行检索和响应合成。
Details
Motivation: 解决大型语言模型在依赖最新或冲突信息的事实性问答任务中不可靠的问题,特别是现有检索增强和工具使用模型依赖隐式规划导致工具使用效率低下的局限性。
Result: 在SEAL-0这一极具挑战性的检索增强LLM基准测试中,所提框架在准确性和延迟方面均优于单体推理模型和基于提示的工具增强框架,表明显式学习的规划结构对于可靠的事实寻求LLMs至关重要。
Insight: 创新点在于通过模块化设计显式分离规划与事实处理,并采用仅包含规划轨迹和事实请求的监督信号进行蒸馏训练,避免了提供事实答案或检索证据,从而提高了模型的可靠性和效率。从客观角度看,这种结构化规划方法可借鉴于其他需要复杂推理的LLM应用中,以增强透明度和可控性。
Abstract: Fact-seeking question answering with large language models (LLMs) remains unreliable when answers depend on up-to-date or conflicting information. Although retrieval-augmented and tool-using LLMs reduce hallucinations, they often rely on implicit planning, leading to inefficient tool usage. We propose a modular framework that explicitly separates planning from factual retrieval and answer synthesis. A lightweight student planner is trained via a teacher-student framework to generate structured decompositions consisting of abstract reasoning steps and searchable fact requests. The supervision signals contain only planning traces and fact requests, without providing factual answers or retrieved evidence. At inference, the planner produces plans, while prompt-engineered modules perform retrieval and response synthesis. We evaluate the proposed framework on SEAL-0, an extremely challenging benchmark for search-augmented LLMs. Results show that supervised planning improves both accuracy and latency compared to monolithic reasoning models and prompt-based tool-augmented frameworks, demonstrating that explicitly learned planning structures are essential for reliable fact-seeking LLMs.
[18] An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs cs.CLPDF
Qian Zhu, Xinnan Guo, Jingjing Huo, Jun Li, Pan Liu
TL;DR: 本文提出了INS-S1,一个面向保险领域的专业大语言模型系列,通过一种新颖的端到端对齐范式训练而成。该方法包含两个核心创新:可验证数据合成系统和渐进式SFT-RL课程框架。模型在保险领域任务上取得了最先进的性能,同时保持了顶级的通用能力,并将幻觉率降至极低的0.6%。
Details
Motivation: 将大语言模型应用于保险等高风险垂直领域面临重大挑战:场景要求严格遵守复杂的法规和业务逻辑,且对幻觉零容忍。现有方法往往存在能力权衡问题(牺牲通用智能换取领域知识),或过度依赖检索增强生成而缺乏内在推理能力。
Result: 在发布的迄今为止最全面的保险基准INSEva(包含3.9万+样本)上进行广泛实验,结果表明INS-S1在领域任务上取得了最先进的性能,显著优于DeepSeek-R1和Gemini-2.5-Pro,并实现了创纪录的低幻觉率(0.6%,HHEM指标)。
Insight: 论文宣称的创新点在于:1)可验证数据合成系统,用于构建精算推理和合规性的分层数据集;2)渐进式SFT-RL课程框架,集成了动态数据退火与经过验证的推理强化学习和AI反馈的协同组合。客观来看,其核心贡献在于提出了一种不牺牲通用智能即可实现严格领域专业化的可行路径,通过优化数据比例和奖励信号来强制领域约束并防止灾难性遗忘。
Abstract: Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.
[19] AI Can Learn Scientific Taste cs.CLPDF
Jingqi Tong, Mingzhe Li, Hangcheng Li, Yongzhuo Yang, Yurong Mou
TL;DR: 该论文提出了一种名为RLCF的训练范式,通过社区反馈强化学习来提升AI的科学品味,即判断和提出高潜在影响力研究想法的能力。具体包括训练Scientific Judge模型来评估论文影响力,并以此作为奖励模型训练Scientific Thinker模型来生成高影响力研究想法。
Details
Motivation: 现有研究主要关注提升AI科学家的执行能力,而增强其科学品味(即判断和提出高潜力研究想法的能力)尚未得到充分探索,该论文旨在解决这一问题。
Result: Scientific Judge在70万对高引用与低引用论文数据上训练,在判断论文潜在影响力方面超越了GPT-5.2、Gemini 3 Pro等SOTA大语言模型,并能泛化到未来年份、未见领域和同行评审偏好测试;Scientific Thinker提出的研究想法比基线模型具有更高的潜在影响力。
Insight: 创新点在于将科学品味学习建模为偏好建模和对齐问题,并利用大规模社区信号(如论文引用量)作为监督信号,通过强化学习范式实现AI对科学品味的习得,为迈向人类水平AI科学家提供了关键一步。
Abstract: Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist’s executive capability, while enhancing an AI’s scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.
[20] Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agentic Workflows cs.CL | cs.AIPDF
Aditya Sharan, Sriram Hebbale, Dhruv Kumar
TL;DR: 本文提出了无限问题生成器(IPG),一个基于智能体工作流的框架,用于生成可验证的高质量物理推理数据。它通过‘公式即代码’范式合成保证可解的物理问题,并发布了概念验证数据集ClassicalMechanicsV1,包含1,335个经典力学问题。研究发现公式数量与验证代码长度之间存在强线性相关性,可将代码复杂度作为问题难度的精确度量,从而实现可控的课程生成。
Details
Motivation: 训练大型语言模型进行复杂推理的瓶颈在于缺乏可验证的高质量数据。在物理等领域,标准文本增强方法容易产生幻觉,而静态基准又缺乏微调所需的推理轨迹。
Result: 作为概念验证,发布了ClassicalMechanicsV1数据集,包含1,335个从165个专家种子扩展而来的经典力学问题,具有高保真度和结构多样性,涵盖102个独特物理公式,平均每个问题使用3.05个公式。研究发现公式数量与验证代码长度之间存在强线性相关性(R² ≈ 0.95)。
Insight: 核心创新点是IPG框架及其‘公式即代码’范式,它将解决方案构建为可执行的Python程序,强制执行严格的数学一致性,从而保证问题的可解性并避免幻觉。此外,发现的‘复杂性蓝图’将代码复杂度确立为问题难度的精确、无需代理的度量,为可控的课程数据生成提供了新方法。
Abstract: Training large language models for complex reasoning is bottlenecked by the scarcity of verifiable, high-quality data. In domains like physics, standard text augmentation often introduces hallucinations, while static benchmarks lack the reasoning traces required for fine-tuning. We introduce the Infinite Problem Generator (IPG), an agentic framework that synthesizes physics problems with guaranteed solvability through a Formula-as-Code paradigm. Unlike probabilistic text generation, IPG constructs solutions as executable Python programs, enforcing strict mathematical consistency. As a proof-of-concept, we release ClassicalMechanicsV1, a high-fidelity corpus of 1,335 classical mechanics problems expanded from 165 expert seeds. The corpus demonstrates high structural diversity, spanning 102 unique physical formulas with an average complexity of 3.05 formulas per problem. Furthermore, we identify a Complexity Blueprint, demonstrating a strong linear correlation ($R^2 \approx 0.95$) between formula count and verification code length. This relationship establishes code complexity as a precise, proxy-free metric for problem difficulty, enabling controllable curriculum generation. We release the full IPG pipeline, the ClassicalMechanicsV1 dataset, and our evaluation report to support reproducible research in reasoning-intensive domains.
[21] MALicious INTent Dataset and Inoculating LLMs for Enhanced Disinformation Detection cs.CL | cs.AIPDF
Arkadiusz Modzelewski, Witold Sosnowski, Eleni Papadopulos, Elisa Sartori, Tiziano Labruna
TL;DR: 本文提出了首个由专家事实核查员标注的、旨在捕捉虚假信息及其恶意意图的英文语料库MALINT,并基于该数据集对12个语言模型进行了恶意意图分类任务的基准测试。受心理学和传播学中的接种理论启发,论文提出了基于意图的接种方法,通过将意图分析整合到LLMs的推理过程中来增强虚假信息检测能力。在六个虚假信息数据集、五个LLMs和七种语言上的分析表明,意图增强推理能提升零样本虚假信息检测性能。
Details
Motivation: 现有英文数据集和研究很少关注虚假信息背后的意图性,而恶意意图的创建和传播对公共话语构成重大威胁,因此需要构建专门的数据集并探索利用意图分析来改进检测方法。
Result: 在MALINT数据集上对12个语言模型(包括BERT等SLMs和Llama 3.3等LLMs)进行了二元和多标签意图分类任务的基准测试;提出的意图增强推理方法在跨六个数据集、五个LLMs和七种语言的零样本虚假信息检测任务中取得了性能提升。
Insight: 创新点在于构建了首个专注于恶意意图标注的虚假信息英文语料库MALINT,并将心理学中的接种理论引入NLP领域,提出了意图增强推理这一新颖的LLM提示方法,通过整合意图分析来提升模型对虚假信息的鲁棒性。
Abstract: The intentional creation and spread of disinformation poses a significant threat to public discourse. However, existing English datasets and research rarely address the intentionality behind the disinformation. This work presents MALINT, the first human-annotated English corpus developed in collaboration with expert fact-checkers to capture disinformation and its malicious intent. We utilize our novel corpus to benchmark 12 language models, including small language models (SLMs) such as BERT and large language models (LLMs) like Llama 3.3, on binary and multilabel intent classification tasks. Moreover, inspired by inoculation theory from psychology and communication studies, we investigate whether incorporating knowledge of malicious intent can improve disinformation detection. To this end, we propose intent-based inoculation, an intent-augmented reasoning for LLMs that integrates intent analysis to mitigate the persuasive impact of disinformation. Analysis on six disinformation datasets, five LLMs, and seven languages shows that intent-augmented reasoning improves zero-shot disinformation detection. To support research in intent-aware disinformation detection, we release the MALINT dataset with annotations from each annotation step.
[22] Top-b: Entropic Regulation of Relative Probability Bands in Autoregressive Language Processes cs.CLPDF
Deepon Halder, Raj Dabre
TL;DR: 本文提出了一种名为Top-b的自适应相对带采样解码策略,用于解决传统静态截断解码方法(如Top-k、Top-p)无法适应自然语言动态信息密度的问题。该方法将生成过程形式化为相对概率流形上的轨迹,通过动态带宽系数严格耦合模型分布的瞬时香农熵来调节候选集,从而在生成过程中实现自调节控制。
Details
Motivation: 动机在于标准解码策略的静态截断规则与自然语言的动态信息密度不匹配,导致在高熵创造性生成中限制过强或在低熵逻辑推理中过于宽松,迫使模型做出次优权衡。
Result: 在GPQA和GSM8K基准测试上的实证验证表明,Top-b显著降低了生成熵和解码间方差,同时保持了有竞争力的推理准确性。
Insight: 创新点在于将生成过程形式化为相对概率流形轨迹,并引入动态带宽系数与瞬时香农熵严格耦合,理论上Top-b可作为尾部分布的方差最小化算子,实现自适应的解码控制,为自回归生成提供了自调节系统近似。
Abstract: Probabilistic language generators are theoretically modeled as discrete stochastic processes, yet standard decoding strategies (Top-k, Top-p) impose static truncation rules that fail to accommodate the dynamic information density of natural language. This misalignment often forces a suboptimal trade-off: static bounds are either too restrictive for high-entropy creative generation or too permissive for low-entropy logical reasoning. In this work, we formalize the generation process as a trajectory through a relative probability manifold. We introduce Top-b (Adaptive Relative Band Sampling), a decoding strategy that regulates the candidate set via a dynamic bandwidth coefficient coupled strictly to the instantaneous Shannon entropy of the model’s distribution. We provide a theoretical framework demonstrating that Top-b acts as a variance-minimizing operator on the tail distribution. Empirical validation on GPQA and GSM8K benchmarks indicates that Top-b significantly reduces generation entropy and inter-decoding variance while maintaining competitive reasoning accuracy, effectively approximating a self-regulating control system for autoregressive generation.
[23] Parameter-Efficient Quality Estimation via Frozen Recursive Models cs.CLPDF
Umar Abubacar, Roman Bauer, Diptesh Kanojia
TL;DR: 本文研究了Tiny Recursive Models (TRM) 在低资源语言质量评估任务中的迁移能力。通过实验发现,TRM的递归机制对QE任务帮助有限,但冻结预训练嵌入的方法在保持性能的同时,大幅减少了可训练参数,实现了参数高效的QE模型。
Details
Motivation: 探究TRM的递归机制是否能迁移到低资源语言的机器翻译质量评估任务中,并寻求更参数高效的QE方法。
Result: 在8个语言对的低资源QE数据集上,使用冻结XLM-R嵌入的TRM-QE获得了0.370的斯皮尔曼相关系数,与全微调变体(0.369)相当,并优于同等深度的标准Transformer(0.336)。在印地语和泰米尔语上,其以80倍更少的可训练参数超越了MonoTransQuest(5.6亿参数)。
Insight: 核心创新在于发现并验证了冻结预训练嵌入结合权重共享(递归结构)能实现参数高效的QE模型,在性能相当的情况下大幅减少可训练参数(减少37倍),这为低资源场景下的模型部署提供了新思路。
Abstract: Tiny Recursive Models (TRM) achieve strong results on reasoning tasks through iterative refinement of a shared network. We investigate whether these recursive mechanisms transfer to Quality Estimation (QE) for low-resource languages using a three-phase methodology. Experiments on $8$ language pairs on a low-resource QE dataset reveal three findings. First, TRM’s recursive mechanisms do not transfer to QE. External iteration hurts performance, and internal recursion offers only narrow benefits. Next, representation quality dominates architectural choices, and lastly, frozen pretrained embeddings match fine-tuned performance while reducing trainable parameters by 37$\times$ (7M vs 262M). TRM-QE with frozen XLM-R embeddings achieves a Spearman’s correlation of 0.370, matching fine-tuned variants (0.369) and outperforming an equivalent-depth standard transformer (0.336). On Hindi and Tamil, frozen TRM-QE outperforms MonoTransQuest (560M parameters) with 80$\times$ fewer trainable parameters, suggesting that weight sharing combined with frozen embeddings enables parameter efficiency for QE. We release the code publicly for further research. Code is available at https://github.com/surrey-nlp/TRMQE.
[24] $PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought cs.CL | cs.AI | cs.LGPDF
Shubhashis Roy Dipta, Daniel Bis, Kun Zhou, Lichao Wang, Benjamin Z. Yao
TL;DR: 本文提出了一种名为$PA^3$的多阶段对齐方法,旨在解决大型语言模型在工具使用任务中难以遵循复杂、特定业务规则的问题。该方法教导模型在推理时通过思维链回忆并应用相关业务策略,而无需在上下文中包含完整的策略文本,从而降低了延迟、计算开销和上下文长度。
Details
Motivation: 现有LLM驱动的对话助手擅长工具使用任务,但在遵守复杂的特定业务规则方面存在困难。将全部策略放入上下文会导致高延迟、计算浪费以及因上下文过长引起的性能下降(如“大海捞针”问题)。
Result: 在GRPO训练中,结合了基于Jaccard相似度的PolicyRecall奖励和幻觉惩罚,最佳模型比基线模型高出16个百分点,比同等模型规模、使用上下文策略的基线高出3个百分点,同时使用的词汇量减少了40%。
Insight: 核心创新在于提出了一个多阶段对齐框架,使模型能够主动回忆策略,而非被动依赖冗长的上下文。具体技术贡献包括用于GRPO训练的PolicyRecall奖励和幻觉惩罚机制,这为模型对齐和高效策略遵循提供了新思路。
Abstract: Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the “needle-in-the-haystack” problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.
[25] Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning cs.CLPDF
Xinran Zhang
TL;DR: 本文研究了在低数据LoRA安全微调中,监督指令的表述格式对模型安全性能的影响。通过对比四种基于相同安全规则构建的监督格式,发现非身份条件(D)在三个指令微调模型家族上均取得了最佳的安全拒绝率,超越了传统的信条式身份框架。
Details
Motivation: 探讨安全监督指令的表述方式(而非其显式身份内容)对模型安全微调效果的影响,挑战了身份框架假设的必要性。
Result: 在HarmBench的320个行为测试集上,非身份条件(D)在Llama 3.1 8B、Gemma 3 4B和Qwen2.5 7B模型上分别达到74.4%、76.9%和74.1%的拒绝率,性能最优,整体排序为D > B > C ≥ A > baseline。在MMLU和ARC-Challenge上的能力评估显示各条件间无显著权衡。
Insight: 安全监督指令的表述格式是影响微调效果的关键因素,非身份框架可以取得比显式信条式身份语言更好的安全性能,这为设计高效的安全微调方法提供了新思路。
Abstract: How safety supervision is written may matter more than the explicit identity content it contains. We study low-data LoRA safety fine-tuning with four supervision formats built from the same core safety rules: constitutional rules (A), creed-style identity framing (B), a B-matched creed condition with a worldview/confession identity-maintenance tail (C), and a matched non-identity condition (D). Across three instruction-tuned model families (Llama 3.1 8B, Qwen2.5 7B, and Gemma 3 4B), we evaluate HarmBench using a reconciled dual-judge pipeline combining Bedrock-hosted DeepSeek v3.2 and Sonnet 4.6, with disagreement and boundary cases manually resolved. The non-identity condition D is the strongest group on all three model families on the full 320-behavior HarmBench set, reaching 74.4% refusal on Llama, 76.9% on Gemma, and 74.1% on Qwen. By comparison, creed-style framing (B) improves over plain constitutional rules (A) on Llama and Gemma, but remains substantially below D, yielding an overall descriptive ordering of $D > B > C \geq A > baseline$. This provides a bounded empirical challenge to a strong version of the identity-framing hypothesis: explicit creed-style identity language is not necessary for the strongest gains observed here. Capability evaluations on MMLU and ARC-Challenge show no meaningful trade-off across conditions.
[26] Learning Constituent Headedness cs.CLPDF
Zeyao Qi, Yige Chen, KyungTae Lim, Haihua Pan, Jungyeul Park
TL;DR: 本文提出将成分结构中的中心词(headedness)作为一个显式的表示层,并通过监督学习从对齐的句法树库中预测中心词,而非依赖传统的基于规则的中心词传递方法。
Details
Motivation: 中心词在句法分析中广泛用作组织手段,但成分树库很少显式编码中心词,现有处理流程多通过规则传递来恢复中心词,因此需要更准确、可学习的中心词预测方法。
Result: 在英语和中文对齐数据上,模型实现了接近上限的内在准确率,显著优于基于规则的Collins式传递方法;预测的中心词在基于中心词的二叉化下获得可比的句法分析准确率,并提高了成分到依存转换的保真度,且能通过简单标签映射跨资源和语言迁移。
Insight: 将中心词作为显式表示层进行监督学习,利用对齐的依存和成分标注诱导监督信号,提供了一种更准确、可迁移的中心词预测方案,替代了传统的规则传递方法。
Abstract: Headedness is widely used as an organizing device in syntactic analysis, yet constituency treebanks rarely encode it explicitly and most processing pipelines recover it procedurally via percolation rules. We treat this notion of constituent headedness as an explicit representational layer and learn it as a supervised prediction task over aligned constituency and dependency annotations, inducing supervision by defining each constituent head as the dependency span head. On aligned English and Chinese data, the resulting models achieve near-ceiling intrinsic accuracy and substantially outperform Collins-style rule-based percolation. Predicted heads yield comparable parsing accuracy under head-driven binarization, consistent with the induced binary training targets being largely equivalent across head choices, while increasing the fidelity of deterministic constituency-to-dependency conversion and transferring across resources and languages under simple label-mapping interfaces.
[27] Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks cs.CLPDF
Zijian Yu, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng
TL;DR: 本文提出了一个名为Shopping Companion的LLM智能体框架,用于解决现实世界电子商务中的长期偏好感知购物任务。该工作首先引入了一个包含长期记忆设置的新基准测试,涵盖两个购物任务和超过120万真实商品,并提出了一个统一框架,将记忆检索与购物辅助联合优化,同时支持用户干预。为了训练该能力,作者开发了一种带有工具级奖励的双奖励强化学习策略来处理多轮交互中固有的稀疏和不连续奖励问题。
Details
Motivation: 当前LLM智能体在电子商务应用中面临两大挑战:一是缺乏用于评估长期偏好感知购物任务的基准测试;二是现有设计将偏好识别和购物辅助作为独立组件,缺乏端到端优化。
Result: 实验结果表明,即使是GPT-5等最先进的模型在该基准测试上的成功率也不到70%,凸显了该领域的巨大挑战。而使用Shopping Companion框架训练的轻量级LLM持续优于强基线模型,实现了更好的偏好捕获和任务性能,验证了统一设计的有效性。
Insight: 主要创新点在于:1) 提出了一个包含长期记忆的真实世界电商任务基准;2) 设计了一个将记忆检索与购物辅助联合优化的统一框架;3) 开发了处理多轮交互稀疏奖励的双奖励强化学习策略。从客观角度看,其将用户长期偏好建模与任务执行端到端结合的思路,以及对现实场景中稀疏奖励问题的针对性解决方案,具有借鉴意义。
Abstract: In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budgeting, and bundle deals, where accurately capturing user preferences from long-term conversations is critical. However, two challenges hinder realizing this potential: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of end-to-end optimization due to existing designs that treat preference identification and shopping assistance as separate components. In this paper, we introduce a novel benchmark with a long-term memory setup, spanning two shopping tasks over 1.2 million real-world products, and propose Shopping Companion, a unified framework that jointly tackles memory retrieval and shopping assistance while supporting user intervention. To train such capabilities, we develop a dual-reward reinforcement learning strategy with tool-wise rewards to handle the sparse and discontinuous rewards inherent in multi-turn interactions. Experimental results demonstrate that even state-of-the-art models (such as GPT-5) achieve success rates under 70% on our benchmark, highlighting the significant challenges in this domain. Notably, our lightweight LLM, trained with Shopping Companion, consistently outperforms strong baselines, achieving better preference capture and task performance, which validates the effectiveness of our unified design.
[28] Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models cs.CL | cs.AIPDF
Han Zhang, Jiamin Su, Li liu
TL;DR: 本文提出决策级序数建模(DLOM)方法,用于基于大语言模型的多模态作文评分。DLOM将评分任务重新定义为在预定义分数标记上的显式序数决策,通过复用语言模型头提取分数级逻辑值,实现分数空间的直接优化。针对多模态场景,DLOM-GF引入门控融合模块自适应结合文本和多模态分数逻辑值;针对纯文本场景,DLOM-DA添加距离感知正则化项以更好反映序数距离。
Details
Motivation: 现有基于LLM的自动作文评分方法通常将评分视为自回归标记生成任务,通过解码和解析获得最终分数,这种隐式决策方式在多模态评分中尤为敏感,因为视觉输入在不同作文和评分维度上的有效性存在差异。
Result: 在多模态EssayJudge数据集上,DLOM在多个评分维度上超越了基于生成的SFT基线,DLOM-GF在模态相关性异构时带来进一步增益。在纯文本ASAP/ASAP++基准测试中,DLOM在无视觉输入时依然有效,DLOM-DA进一步提升了性能并优于多个代表性基线模型。
Insight: 创新点在于将评分重构为显式的序数决策任务,通过复用LLM头实现分数空间的直接建模;针对多模态和纯文本场景分别设计了门控融合与距离感知正则化机制,提升了模型对模态异质性和序数距离的建模能力。
Abstract: Automated essay scoring (AES) predicts multiple rubric-defined trait scores for each essay, where each trait follows an ordered discrete rating scale. Most LLM-based AES methods cast scoring as autoregressive token generation and obtain the final score via decoding and parsing, making the decision implicit. This formulation is particularly sensitive in multimodal AES, where the usefulness of visual inputs varies across essays and traits. To address these limitations, we propose Decision-Level Ordinal Modeling (DLOM), which makes scoring an explicit ordinal decision by reusing the language model head to extract score-wise logits on predefined score tokens, enabling direct optimization and analysis in the score space. For multimodal AES, DLOM-GF introduces a gated fusion module that adaptively combines textual and multimodal score logits. For text-only AES, DLOM-DA adds a distance-aware regularization term to better reflect ordinal distances. Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous. On the text-only ASAP/ASAP++ benchmarks, DLOM remains effective without visual inputs, and DLOM-DA further improves performance and outperforms strong representative baselines.
[29] Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs cs.CL | cs.AI | cs.LGPDF
Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha, Tirthankar Dasgupta
TL;DR: 本文提出AdaAnchor框架,通过隐式推理在潜在空间中进行迭代计算,并引入自适应停止机制,在数学应用题基准上显著减少输出令牌数量,同时提升推理准确率。
Details
Motivation: 传统Token级思维链提示会生成冗长中间步骤,增加推理成本和输出长度,而现有潜在空间推理方法依赖固定步数,需跨模型和数据集调整超参数以平衡准确性和效率。
Result: 在三个数学应用题基准上,AdaAnchor相比固定步数潜在精炼方法准确率提升高达5%,同时平均潜在精炼步数减少48-60%;相比标准推理基线,生成令牌数减少92-93%。
Insight: 创新点在于将推理过程转移到潜在空间,通过精炼输入附着的潜在锚向量进行隐式计算,并引入自适应停止机制,根据锚向量动态收敛情况动态分配计算步数,实现更优的准确性与效率权衡。
Abstract: Token-level Chain-of-Thought (CoT) prompting has become a standard way to elicit multi-step reasoning in large language models (LLMs), especially for mathematical word problems. However, generating long intermediate traces increases output length and inference cost, and can be inefficient when the model could arrive at the correct answer without extensive verbalization. This has motivated latent-space reasoning approaches that shift computation into hidden representations and only emit a final answer. Yet, many latent reasoning methods depend on a fixed number of latent refinement steps at inference, adding another hyperparameter that must be tuned across models and datasets to balance accuracy and efficiency. We introduce AdaAnchor, a latent reasoning framework that performs silent iterative computation by refining a set of latent anchor vectors attached to the input. AdaAnchor further incorporates an adaptive halting mechanism that monitors anchor stability across iterations and terminates refinement once the anchor dynamics converge, allocating fewer steps to easier instances while reserving additional refinement steps for harder ones under a shared maximum-step budget. Our empirical evaluation across three mathematical word-problem benchmarks shows that AdaAnchor with adaptive halting yields accuracy gains of up to 5% over fixed-step latent refinement while reducing average latent refinement steps by 48-60% under the same maximum-step budget. Compared to standard reasoning baselines, AdaAnchor achieves large reductions in generated tokens (92-93%) by moving computation into silent latent refinement, offering a different accuracy-efficiency trade-off with substantially lower output-token usage.
[30] Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization cs.CLPDF
Jihao Zhao, Shuaishuai Zu, Zhiyuan Ji, Chunlai Zhou, Biao Qin
TL;DR: 本文提出Writer-R1框架,通过基于扎根理论的多智能体协作流程动态生成细粒度可解释的写作评估标准,并设计记忆增强回放策略优化算法,结合监督微调与强化学习实现生成式写作的端到端优化。
Details
Motivation: 解决创意写作等开放式生成任务因缺乏可验证参考答案导致的奖励建模与自动评估难题,包括人工标注成本高、评估偏差大及反馈信号粗糙等问题。
Result: 自动构建的评估标准达到与人工标注相当的性能增益;训练得到的Writer-R1-4B模型在多项创意写作任务上超越基线模型,部分性能超过千亿参数开源模型。
Insight: 创新点包括基于扎根理论的多智能体动态标准生成机制,以及融合自我反思与强化学习的记忆增强回放策略优化算法,实现了无需额外训练的迭代优化与端到端奖励信号转换。
Abstract: As a typical open-ended generation task, creative writing lacks verifiable reference answers, which has long constrained reward modeling and automatic evaluation due to high human annotation costs, evaluative bias, and coarse feedback signals. To address these challenges, this paper first designs a multi-agent collaborative workflow based on Grounded Theory, performing dimensional decomposition and hierarchical induction of the problem to dynamically produce interpretable and reusable fine-grained criteria. Furthermore, we propose the Memory-augmented Replay Policy Optimization (MRPO) algorithm: on the one hand, without additional training, MRPO guides models to engage in self-reflection based on dynamic criteria, enabling controlled iterative improvement; on the other hand, we adopt the training paradigm that combines supervised fine-tuning with reinforcement learning to convert evaluation criteria into reward signals, achieving end-to-end optimization. Experimental results demonstrate that the automatically constructed criteria achieve performance gains comparable to human annotations. Writer-R1-4B models trained with this approach outperform baselines across multiple creative writing tasks and surpass some 100B+ parameter open-source models.
[31] Bridging National and International Legal Data: Two Projects Based on the Japanese Legal Standard XML Schema for Comparative Law Studies cs.CL | cs.AIPDF
Makoto Nakamura
TL;DR: 本文提出了一个基于日本法律标准XML模式的计算比较法集成框架,通过两个连续项目实现:首先建立JLS到Akoma Ntoso标准的转换管道以实现结构互操作性,其次应用多语言嵌入模型和语义相似度技术识别跨国法律条款对应关系,并开发了结合FAISS检索和交叉编码器重排的原型系统进行可视化分析。
Details
Motivation: 解决日本法律数据与国际标准(如Akoma Ntoso)之间的互操作性障碍,并促进跨国法律条款的自动比对与比较法研究。
Result: 开发了从JLS到AKN的转换管道实现结构互操作,原型系统通过多语言嵌入和检索重排技术生成候选对应条款,支持可视化跨司法管辖区网络分析。
Insight: 创新点在于将法律文档的结构标准化与语义相似度计算相结合,构建了从数据转换到智能检索的完整计算比较法流程,为跨语言法律分析提供了可扩展框架。
Abstract: This paper presents an integrated framework for computational comparative law by connecting two consecutive research projects based on the Japanese Legal Standard (JLS) XML schema. The first project establishes structural interoperability by developing a conversion pipeline from JLS to the Akoma Ntoso (AKN) standard, enabling Japanese statutes to be integrated into international LegalDocML-based legislative databases. Building on this foundation, the second project applies multilingual embedding models and semantic textual similarity techniques to identify corresponding provisions across national legal systems. A prototype system combining multilingual embeddings, FAISS retrieval, and Cross-Encoder reranking generates candidate correspondences and visualizes them as cross-jurisdictional networks for exploratory comparative analysis.
[32] MMKU-Bench: A Multimodal Update Benchmark for Diverse Visual Knowledge cs.CLPDF
Baochen Fu, Yuntao Du, Cheng Chang, Baihao Jin, Wenzhi Deng
TL;DR: 本文提出了MMKU-Bench,一个用于评估多模态知识更新的综合性基准测试,包含超过2.5万个知识实例和4.9万张图像,涵盖知识更新和未知知识学习两种场景。作者在基准上评估了监督微调、基于人类反馈的强化学习和知识编辑等方法,发现前两者存在灾难性遗忘问题,而知识编辑在持续更新方面存在局限。
Details
Motivation: 解决多模态模型预训练获得的参数化知识难以与现实世界持续演变的知识保持一致的问题,现有研究仅关注学习未知知识,忽略了更新模型已掌握但后续发生变化的知识,且评估局限于单一模态,缺乏对跨模态一致性的系统分析。
Result: 在提出的MMKU-Bench基准上评估了SFT、RLHF和KE等方法,实验结果表明SFT和RLHF容易发生灾难性遗忘,而KE能更好地保留通用能力但在持续更新方面表现出明显局限性。
Insight: 创新点在于构建了一个涵盖知识更新和未知知识学习两种场景的多模态知识更新评估基准,支持对不同知识类型学习的比较分析,并系统评估了多种代表性方法的性能与局限,为领域提供了可靠的评估工具。
Abstract: As real-world knowledge continues to evolve, the parametric knowledge acquired by multimodal models during pretraining becomes increasingly difficult to remain consistent with real-world knowledge. Existing research on multimodal knowledge updating focuses only on learning previously unknown knowledge, while overlooking the need to update knowledge that the model has already mastered but that later changes; moreover, evaluation is limited to the same modality, lacking a systematic analysis of cross-modal consistency. To address these issues, this paper proposes MMKU-Bench, a comprehensive evaluation benchmark for multimodal knowledge updating, which contains over 25k knowledge instances and more than 49k images, covering two scenarios, updated knowledge and unknown knowledge, thereby enabling comparative analysis of learning across different knowledge types. On this benchmark, we evaluate a variety of representative approaches, including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and knowledge editing (KE). Experimental results show that SFT and RLHF are prone to catastrophic forgetting, while KE better preserve general capabilities but exhibit clear limitations in continual updating. Overall, MMKU-Bench provides a reliable and comprehensive evaluation benchmark for multimodal knowledge updating, advancing progress in this field.
[33] Efficient Document Parsing via Parallel Token Prediction cs.CL | cs.CVPDF
Lei Li, Ze Zhao, Meng Li, Zhongwang Lun, Yi Yuan
TL;DR: 本文提出了一种名为并行令牌预测(PTP)的可插拔、模型无关且简单有效的方法,旨在解决视觉语言模型(VLMs)在文档解析任务中因自回归解码导致的严重速度瓶颈。PTP通过在学习序列中插入可学习的令牌并设计相应的训练目标,使模型能够并行生成多个未来令牌,从而提高解码速度和样本效率。此外,论文还开发了一个全面的数据生成流程,为VLMs高效生成大规模、高质量的文档解析训练数据。
Details
Motivation: 文档解析作为一项基础而关键的视觉任务,正被视觉语言模型(VLMs)革新,但VLMs固有的自回归解码造成了显著的速度瓶颈,严重限制了解析速度。
Result: 在OmniDocBench和olmOCR-bench上的大量实验表明,该方法不仅显著提高了解码速度(1.6倍至2.2倍),还减少了模型幻觉,并展现出强大的泛化能力。
Insight: 创新点在于提出了一种并行令牌预测机制,通过插入可学习令牌和设计训练目标,使VLMs具备并行解码能力,从而提升效率;同时,开发的数据生成流程支持有效训练,增强了方法的实用性和泛化性。
Abstract: Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.
[34] Practicing with Language Models Cultivates Human Empathic Communication cs.CL | cs.HCPDF
Aakriti Kumar, Nalin Poungpeth, Diyi Yang, Bruce Lambert, Matthew Groh
TL;DR: 该论文开发了一个名为’Lend an Ear’的实验性对话平台,通过让参与者与扮演个人和工作困境的LLM进行对话来练习共情沟通。研究从大量对话数据中推导出自然对话中惯用共情表达的分类法,并通过预注册随机实验证明,提供个性化反馈的LLM辅导干预能显著提升参与者的沟通模式与规范性共情模式的匹配度。
Details
Motivation: 尽管大型语言模型(LLM)生成的回应在盲评中常被认为比人类写的更具共情力,但当回应被归因于AI时,接收者会感到不如归因于人类时被倾听和认可。本研究旨在探究并解决这一共情沟通技能的差距。
Result: 基于预注册随机实验的证据表明,与对照组以及接受视频化但非个性化反馈的组相比,提供关于如何有效沟通共情的个性化反馈的简短LLM辅导干预,能显著提升参与者沟通模式与规范性共情沟通模式的匹配度。参与者能可靠地识别出符合规范性共情沟通标准的回应更具共情表达力。
Insight: 创新点在于构建了一个数据驱动的、用于自然对话的惯用共情表达分类法,并证明了基于AI的、可扩展的个性化辅导干预在培养人类共情沟通方面的有效性。研究还揭示了’沉默共情效应’,即人们感受到共情但系统性地未能表达出来。
Abstract: Empathy is central to human connection, yet people often struggle to express it effectively. In blinded evaluations, large language models (LLMs) generate responses that are often judged more empathic than human-written ones. Yet when a response is attributed to AI, recipients feel less heard and validated than when comparable responses are attributed to a human. To probe and address this gap in empathic communication skill, we built Lend an Ear, an experimental conversation platform in which participants are asked to offer empathic support to an LLM role-playing personal and workplace troubles. From 33,938 messages spanning 2,904 text-based conversations between 968 participants and their LLM conversational partners, we derive a data-driven taxonomy of idiomatic empathic expressions in naturalistic dialogue. Based on a pre-registered randomized experiment, we present evidence that a brief LLM coaching intervention offering personalized feedback on how to effectively communicate empathy significantly boosts alignment of participants’ communication patterns with normative empathic communication patterns relative to both a control group and a group that received video-based but non-personalized feedback. Moreover, we find evidence for a silent empathy effect that people feel empathy but systematically fail to express it. Nonetheless, participants reliably identify responses aligned with normative empathic communication criteria as more expressive of empathy. Together, these results advance the scientific understanding of how empathy is expressed and valued and demonstrate a scalable, AI-based intervention for scaffolding and cultivating it.
[35] From Documents to Spans: Code-Centric Learning for LLM-based ICD Coding cs.CL | cs.AIPDF
Xu Zhang, Wenxin Ma, Chenxu Wu, Rongsheng Wang, Kun Zhang
TL;DR: 本文提出了一种名为Code-Centric Learning的训练框架,用于解决基于大语言模型(LLM)的ICD编码任务面临的三大挑战:ICD代码空间覆盖有限、微调后模型可解释性下降以及长临床文档处理的计算成本高昂。该框架的核心思想是将监督信号从完整的临床文档转移到可扩展的短证据片段(spans)上,通过混合训练策略和以代码为中心的数据扩展,降低训练成本,提升对未见代码的泛化能力,并保持模型的可解释性。
Details
Motivation: 解决LLM在ICD编码任务中微调时面临的三大挑战:公开数据集对ICD代码空间覆盖不足导致泛化能力受限、微调会削弱LLM的可解释性,以及处理长临床文档带来的高昂计算成本。
Result: 在相同的LLM骨干网络下,该方法显著超越了多个强基线模型。特别地,该方法使得较小规模的LLM能够达到与更大规模专有模型相当的性能,证明了其在全自动ICD编码中的有效性和潜力。
Insight: 创新点在于将监督学习从文档级别转移到片段级别(span-level learning),通过混合训练和代码中心的数据扩展,在降低计算成本的同时,提升了对未见代码的泛化能力和模型的可解释性。从客观角度看,这种“从文档到片段”的范式转变,为解决长文档、稀疏标签和可解释性要求高的任务提供了一种高效且可扩展的训练思路。
Abstract: ICD coding is a critical yet challenging task in healthcare. Recently, LLM-based methods demonstrate stronger generalization than discriminative methods in ICD coding. However, fine-tuning LLMs for ICD coding faces three major challenges. First, existing public ICD coding datasets provide limited coverage of the ICD code space, restricting a model’s ability to generalize to unseen codes. Second, naive fine-tuning diminishes the interpretability of LLMs, as few public datasets contain explicit supporting evidence for assigned codes. Third, ICD coding typically involves long clinical documents, making fine-tuning LLMs computationally expensive. To address these issues, we propose Code-Centric Learning, a training framework that shifts supervision from full clinical documents to scalable, short evidence spans. The key idea of this framework is that span-level learning improves LLMs’ ability to perform document-level ICD coding. Our proposed framework consists of a mixed training strategy and code-centric data expansion, which substantially reduces training cost, improves accuracy on unseen ICD codes and preserves interpretability. Under the same LLM backbone, our method substantially outperforms strong baselines. Notably, our method enables small-scale LLMs to achieve performance comparable to much larger proprietary models, demonstrating its effectiveness and potential for fully automated ICD coding.
[36] PYTHEN: A Flexible Framework for Legal Reasoning in Python cs.CLPDF
Ha-Thanh Nguyen, Ken Satoh
TL;DR: 本文提出了PYTHEN,一个基于Python的可废止法律推理新框架。它旨在建模法律论证固有的可废止性,提供灵活的语法来表示法律规则、条件和例外。该框架利用Python内置的any()和all()函数,在单个规则内原生支持合取(ALL)和析取(ANY)条件,并提供了更具表现力的异常处理机制。
Details
Motivation: 解决法律论证固有的可废止性建模问题,并弥合符号推理与Python可访问性之间的鸿沟,旨在让缺乏广泛逻辑编程专业知识的年轻研究人员、法律科技开发者和专业人士也能进行形式化法律推理。
Result: 论文详细介绍了PYTHEN的架构,并与PROLEG进行了比较分析,讨论了其在自动形式化和下一代法律AI系统开发中的潜在应用。
Insight: 创新点在于将Python的灵活性与可废止法律推理相结合,利用any()和all()函数原生支持规则内的合取和析取条件,以及更强大的异常处理机制,为法律AI提供了一个更易访问和实用的符号推理工具。从客观角度看,其将逻辑编程的符号推理能力与Python的丰富生态相结合的设计理念具有借鉴意义。
Abstract: This paper introduces PYTHEN, a novel Python-based framework for defeasible legal reasoning. PYTHEN is designed to model the inherently defeasible nature of legal argumentation, providing a flexible and intuitive syntax for representing legal rules, conditions, and exceptions. Inspired by PROLEG (PROlog-based LEGal reasoning support system) and guided by the philosophy of The Zen of Python, PYTHEN leverages Python’s built-in any() and all() functions to offer enhanced flexibility by natively supporting both conjunctive (ALL) and disjunctive (ANY) conditions within a single rule, as well as a more expressive exception-handling mechanism. This paper details the architecture of PYTHEN, provides a comparative analysis with PROLEG, and discusses its potential applications in autoformalization and the development of next-generation legal AI systems. By bridging the gap between symbolic reasoning and the accessibility of Python, PYTHEN aims to democratize formal legal reasoning for young researchers, legal tech developers, and professionals without extensive logic programming expertise. We position PYTHEN as a practical bridge between the powerful symbolic reasoning capabilities of logic programming and the rich, ubiquitous ecosystem of Python, making formal legal reasoning accessible to a broader range of developers and legal professionals.
[37] DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models cs.CL | stat.MLPDF
Xueyu Zhou, Yangrong Hu, Jian Huang
TL;DR: 本文提出了一种名为依赖导向采样器(DOS)的训练无关解码策略,用于改进掩码扩散语言模型(MDLM)的生成过程。该方法利用Transformer注意力矩阵近似token间依赖关系,在更新掩码位置时强调未掩码token的信息,以提升生成质量。实验表明,DOS在代码生成和数学推理任务上表现优异,并能与现有并行采样方法无缝集成,在不牺牲质量的前提下提高效率。
Details
Motivation: 现有预训练MDLM的解码策略主要依赖token级不确定性标准,而忽略了序列级信息和token间依赖关系,导致生成质量受限。
Result: 在代码生成和数学推理任务上,DOS一致实现了优越性能;与并行采样方法结合后,能在不牺牲生成质量的情况下提高生成效率。
Insight: 创新点在于利用Transformer注意力矩阵显式建模token间依赖关系来指导采样过程,这是一种无需额外训练、可即插即用的解码策略,增强了MDLM的序列级生成能力。
Abstract: Masked diffusion language models (MDLMs) have recently emerged as a new paradigm in language modeling, offering flexible generation dynamics and enabling efficient parallel decoding. However, existing decoding strategies for pre-trained MDLMs predominantly rely on token-level uncertainty criteria, while largely overlooking sequence-level information and inter-token dependencies. To address this limitation, we propose Dependency-Oriented Sampler (DOS), a training-free decoding strategy that leverages inter-token dependencies to inform token updates during generation. Specifically, DOS exploits attention matrices from transformer blocks to approximate inter-token dependencies, emphasizing information from unmasked tokens when updating masked positions. Empirical results demonstrate that DOS consistently achieves superior performance on both code generation and mathematical reasoning tasks. Moreover, DOS can be seamlessly integrated with existing parallel sampling methods, leading to improved generation efficiency without sacrificing generation quality.
[38] A Closer Look into LLMs for Table Understanding cs.CL | cs.AIPDF
Jia Wang, Chuanyu Qin, Mingyu Zheng, Qingyi Si, Peize Li
TL;DR: 本文对16个大语言模型(包括通用LLM、专用表格LLM和MoE模型)进行了实证研究,探索LLM如何理解表格数据并执行下游任务,重点关注注意力动态、有效层深度、专家激活和输入设计影响四个维度。研究发现LLM遵循三阶段注意力模式,表格任务需要比数学推理更深的层才能稳定预测,MoE模型在中间层激活表格专用专家,而思维链提示能增加表格注意力。
Details
Motivation: 尽管LLM在表格理解方面取得成功,但其内部机制仍不明确,本文旨在通过实证研究揭示LLM理解表格数据的工作原理。
Result: 研究在多个模型上进行了分析,发现LLM遵循三阶段注意力模式(早期层广泛扫描、中间层定位相关单元格、后期层放大贡献),表格任务需要比数学推理更深的层(约后1/3层)才能达到稳定预测,MoE模型在中间层激活表格专用专家,思维链提示能增加表格注意力(经表格调优后进一步增强)。
Insight: 创新点在于首次系统地从多个维度(注意力动态、层深度、专家激活、输入设计)实证分析LLM的表格理解机制,揭示了LLM处理表格的三阶段注意力模式、表格任务对深层表示的依赖以及MoE模型中表格专用专家的激活模式,为可解释性和表格相关任务研究提供了新见解。
Abstract: Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, covering general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to explore how LLMs understand tabular data and perform downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the effective layer depth, the expert activation, and the impacts of input designs. Key findings include: (1) LLMs follow a three-phase attention pattern – early layers scan the table broadly, middle layers localize relevant cells, and late layers amplify their contributions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activate table-specific experts in middle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further enhanced by table-tuning. We hope these findings and insights can facilitate interpretability and future research on table-related tasks.
[39] Fusian: Multi-LoRA Fusion for Fine-Grained Continuous MBTI Personality Control in Large Language Models cs.CLPDF
Zehao Chen, Rong Pan
TL;DR: 本文提出了Fusian框架,用于在大型语言模型中实现细粒度、连续的人格控制。该框架通过收集SFT过程中的LoRA适配器序列来捕捉人格演变的轨迹,并利用强化学习训练策略网络动态融合这些适配器,从而实现对MBTI人格特质强度的精确连续控制。在Qwen3-14B模型上的实验表明,Fusian在人格控制精度上显著优于基线方法。
Details
Motivation: 现有的人格控制方法(如提示工程和标准监督微调)通常将人格特质视为离散类别,缺乏在连续谱上精确控制特质强度的能力。
Result: 在Qwen3-14B模型上的实验表明,Fusian在人格控制上实现了高精度,在符合用户指定特质强度方面显著优于基线方法。
Insight: 创新点在于将人格控制建模为连续空间问题,通过收集LoRA适配器轨迹来映射特质的连续流形,并利用强化学习进行动态融合以实现精确强度控制;客观来看,该方法将模型微调过程本身视为可学习的资源,通过融合中间检查点来实现连续控制,为可控文本生成提供了新思路。
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in simulating diverse human behaviors and personalities. However, existing methods for personality control, which include prompt engineering and standard Supervised Fine-Tuning (SFT), typically treat personality traits as discrete categories (e.g., “Extroverted” vs. “Introverted”), lacking the ability to precisely control the intensity of a trait on a continuous spectrum. In this paper, we introduce Fusian, a novel framework for fine-grained, continuous personality control in LLMs. Fusian operates in two stages: (1) Trajectory Collection, where we capture the dynamic evolution of personality adoption during SFT by saving a sequence of LoRA adapters, effectively mapping the continuous manifold of a trait; and (2) RL-based Dynamic Fusion, where we train a policy network using Reinforcement Learning to dynamically compute mixing weights for these frozen adapters. By sampling from a Dirichlet distribution parameterized by the policy network, Fusian fuses multiple adapters to align the model’s output with a specific numerical target intensity. Experiments on the Qwen3-14B model demonstrate that Fusian achieves high precision in personality control, significantly outperforming baseline methods in aligning with user-specified trait intensities.
[40] SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia cs.CLPDF
Pengfei Yue, Xingran Zhao, Juntao Chen, Peng Hou, Wang Longchao
TL;DR: 本文介绍了SEA-Vision,一个针对东南亚11种语言的多语言文档与场景文本理解基准测试,包含文档解析和以文本为中心的视觉问答(TEC-VQA)两个任务。该基准包含15,234个文档解析页面和7,496个TEC-VQA问答对,覆盖多种文档类型和复杂的文本理解能力。作者还设计了一个结合自动化过滤、MLLM辅助标注和母语者验证的混合标注流程,以高效构建高质量数据集。评估显示,现有先进多模态模型在东南亚低资源语言上性能显著下降。
Details
Motivation: 现有基准大多关注高资源语言,缺乏对现实多语言环境(尤其是东南亚地区语言多样、书写系统复杂、文档类型多变)下模型能力的评估,因此需要构建一个全面的多语言文档与场景文本理解基准。
Result: 在SEA-Vision基准上评估了多个领先的多模态模型,发现它们在东南亚低资源语言上的性能出现显著退化,揭示了多语言文档和场景文本理解领域仍存在巨大差距。
Insight: 创新点在于构建了首个针对东南亚多语言的综合性文档与场景文本理解基准,并设计了一个高效的混合标注流程(结合自动化与MLLM辅助,辅以轻量级人工验证),为低资源语言的多模态理解研究提供了重要数据和评估标准。
Abstract: Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.
[41] CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents cs.CL | cs.AIPDF
Taeyun Roh, Wonjune Jang, Junha Jung, Jaewoo Kang
TL;DR: CLAG是一个基于聚类的智能体记忆框架,专为小型语言模型(SLM)代理设计,通过代理驱动的聚类自适应组织记忆,以解决全局检索池中知识稀释或污染的问题。该框架使用SLM驱动的路由器将记忆分配到语义连贯的集群中,并自主生成集群特定配置文件(如主题摘要和描述性标签),使每个集群成为独立的功能单元。通过在这些结构化邻域内进行局部演化,CLAG减少了跨主题干扰并提高了内部记忆密度。在检索时,采用两阶段过程先通过配置文件过滤相关集群,从而排除干扰并缩小搜索空间。
Details
Motivation: 大型语言模型代理严重依赖外部记忆来支持知识重用和复杂推理任务,但大多数记忆系统将经验存储在单个全局检索池中,这可能导致存储的知识逐渐稀释或污染。对于小型语言模型(SLM)尤其突出,因为它们对无关上下文高度敏感。
Result: 在多个QA数据集和三种SLM骨干上的实验表明,CLAG在代理的记忆系统中持续提高了答案质量和鲁棒性,同时保持轻量高效。
Insight: 创新点包括:SLM驱动的主动聚类记忆组织、集群特定配置文件的自主生成、局部演化以减少干扰,以及两阶段检索过程。从客观角度看,该方法通过结构化记忆邻域增强了SLM的记忆管理能力,为资源受限环境下的代理系统提供了可扩展的解决方案。
Abstract: Large language model agents heavily rely on external memory to support knowledge reuse and complex reasoning tasks. Yet most memory systems store experiences in a single global retrieval pool which can gradually dilute or corrupt stored knowledge. This problem is especially pronounced for small language models (SLMs), which are highly vulnerable to irrelevant context. We introduce CLAG, a CLustering-based AGentic memory framework where an SLM agent actively organizes memory by clustering. CLAG employs an SLM-driven router to assign incoming memories to semantically coherent clusters and autonomously generates cluster-specific profiles, including topic summaries and descriptive tags, to establish each cluster as a self-contained functional unit. By performing localized evolution within these structured neighborhoods, CLAG effectively reduces cross-topic interference and enhances internal memory density. During retrieval, the framework utilizes a two-stage process that first filters relevant clusters via their profiles, thereby excluding distractors and reducing the search space. Experiments on multiple QA datasets with three SLM backbones show that CLAG consistently improves answer quality and robustness over prior memory systems for agents, remaining lightweight and efficient.
[42] ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models cs.CLPDF
Duy Vu Minh Nguyen, Chinh Thanh Truong, Phuc Hoang Tran, Hung Tuan Le, Nguyen Van-Thanh Dat
TL;DR: 本文介绍了ViX-Ray,一个包含5400张越南语标注的胸部X光图像数据集,旨在解决现有视觉语言模型在越南医疗领域缺乏数据暴露的问题。作者分析了数据集中越南放射学报告的语言特征,并微调了多个开源视觉语言模型,与GPT-4V和Gemini等专有模型进行性能比较。结果显示,尽管部分模型能生成与临床真实情况部分对齐的输出,但普遍存在精度低和过度幻觉的问题,特别是在印象生成方面,突显了该数据集的复杂性和挑战性。
Details
Motivation: 越南医疗研究日益重要,但现有视觉语言模型缺乏越南医疗数据,限制了为越南患者生成准确诊断输出的能力,因此需要构建专门的数据集以促进AI在越南医疗领域的应用。
Result: 在ViX-Ray数据集上微调了五个最先进的开源视觉语言模型,并与GPT-4V和Gemini进行比较,结果显示模型输出与临床真实情况部分对齐,但精度低且幻觉问题严重,尤其是在印象生成中,表明数据集具有挑战性,可作为越南临床领域评估视觉语言模型的基准。
Insight: 创新点在于构建了首个越南语胸部X光数据集ViX-Ray,并分析了越南放射学报告的语言特征,为视觉语言模型在特定语言医疗领域的适应提供了数据支持和评估基准,揭示了模型在医疗诊断中面临的幻觉和精度挑战。
Abstract: Vietnamese medical research has become an increasingly vital domain, particularly with the rise of intelligent technologies aimed at reducing time and resource burdens in clinical diagnosis. Recent advances in vision-language models (VLMs), such as Gemini and GPT-4V, have sparked a growing interest in applying AI to healthcare. However, most existing VLMs lack exposure to Vietnamese medical data, limiting their ability to generate accurate and contextually appropriate diagnostic outputs for Vietnamese patients. To address this challenge, we introduce ViX-Ray, a novel dataset comprising 5,400 Vietnamese chest X-ray images annotated with expert-written findings and impressions from physicians at a major Vietnamese hospital. We analyze linguistic patterns within the dataset, including the frequency of mentioned body parts and diagnoses, to identify domain-specific linguistic characteristics of Vietnamese radiology reports. Furthermore, we fine-tune five state-of-the-art open-source VLMs on ViX-Ray and compare their performance to leading proprietary models, GPT-4V and Gemini. Our results show that while several models generate outputs partially aligned with clinical ground truths, they often suffer from low precision and excessive hallucination, especially in impression generation. These findings not only demonstrate the complexity and challenge of our dataset but also establish ViX-Ray as a valuable benchmark for evaluating and advancing vision-language models in the Vietnamese clinical domain.
[43] Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation cs.CL | cs.AI | cs.HCPDF
Yanick Zengaffinen, Andreas Opedal, Donya Rooein, Kv Aditya Srivatsa, Shashank Sonkar
TL;DR: 本文研究了大型语言模型(LLMs)在生成多项选择题干扰项(distractors)时如何对学生的错误推理进行建模。作者引入了一个分类法来分析最先进LLMs所使用的策略,发现其推理过程与学习科学中的最佳实践存在显著一致性:模型通常先正确解决问题,然后阐述并模拟多种可能的误解,最后选择一组干扰项。分析表明,错误主要源于未能正确求解问题或在候选答案中选择失误,而非模拟错误或流程结构问题。提供正确答案作为提示可将模型生成的干扰项与人工编写干扰项的对齐度提高8%。
Details
Motivation: 在教育人工智能领域,准确建模学生可能存在的误解至关重要。本文旨在探究LLMs在生成需要模拟错误但合理答案的多项选择题干扰项时,如何协调解题知识、模拟学生误解并评估合理性,从而解决LLMs是否能够有效建模错误学生推理的问题。
Result: 在干扰项生成任务上,分析揭示了LLMs的推理过程与最佳实践高度一致。通过提供正确答案作为提示,模型生成的干扰项与人工编写干扰项的对齐度(alignment)提高了8%。研究主要进行了定性分析,揭示了错误模式,并强调了锚定正确答案对于生成高质量干扰项的关键作用。
Insight: 论文的创新点在于引入了一个结构化的分类法来分析和解释LLMs在模拟错误推理时的内部策略,发现其过程(先求解、再模拟误解、最后选择)与人类最佳实践相符。客观来看,该研究为理解LLMs的“推理”过程提供了一个可解释的视角,并强调了在提示工程中提供正确答案(锚定)对于引导模型生成合理错误答案的重要性,这对教育AI应用具有借鉴意义。
Abstract: Modeling plausible student misconceptions is critical for AI in education. In this work, we examine how large language models (LLMs) reason about misconceptions when generating multiple-choice distractors, a task that requires modeling incorrect yet plausible answers by coordinating solution knowledge, simulating student misconceptions, and evaluating plausibility. We introduce a taxonomy for analyzing the strategies used by state-of-the-art LLMs, examining their reasoning procedures and comparing them to established best practices in the learning sciences. Our structured analysis reveals a surprising alignment between their processes and best practices: the models typically solve the problem correctly first, then articulate and simulate multiple potential misconceptions, and finally select a set of distractors. An analysis of failure modes reveals that errors arise primarily from failures in recovering the correct solution and selecting among response candidates, rather than simulating errors or structuring the process. Consistent with these results, we find that providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, highlighting the critical role of anchoring to the correct solution when generating plausible incorrect student reasoning. Overall, our analysis offers a structured and interpretable lens into LLMs’ ability to model incorrect student reasoning and produce high-quality distractors.
[44] Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning cs.CLPDF
Aozhe Wang, Yuchen Yan, Nan Zhou, Zhengxi Lu, Weiming Lu
TL;DR: 本文提出了Code-A1框架,通过强化学习让代码大语言模型和测试大语言模型进行对抗性协同进化,以解决代码生成中高质量测试套件稀缺和静态奖励不适应模型改进的问题。
Details
Motivation: 动机是解决代码生成强化学习中依赖单元测试通过率作为奖励的局限性,包括高质量测试套件稀缺、现有数据集覆盖有限、静态奖励无法适应模型改进,以及现有自博弈方法中存在的自我合谋或生成通用测试的问题。
Result: 在Qwen2.5-Coder模型上的实验表明,Code-A1实现的代码生成性能达到或超过了使用人工标注测试训练的模型,同时显著提升了测试生成能力。
Insight: 创新点在于将代码和测试生成分离为两个具有对立目标的模型进行对抗性协同优化,消除了自我合谋风险并允许白盒测试生成;此外还引入了错误簿机制进行经验回放,以及平衡测试有效性与对抗难度的复合奖励。
Abstract: Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.
[45] Mechanistic Origin of Moral Indifference in Language Models cs.CL | cs.AIPDF
Lingyu Li, Yan Teng, Yingchun Wang
TL;DR: 该论文揭示了大型语言模型在道德对齐方面存在表面遵从与内部表征未对齐的深层问题,特别是模型将不同道德概念压缩为均匀概率分布导致的道德冷漠状态。研究基于原型理论和社会化学数据集构建了25.1万个道德向量,分析了23个模型,发现现有模型无法区分对立道德类别及其内部典型性梯度。通过稀疏自编码器对Qwen3-8B进行干预,重构道德特征的拓扑关系,实现了表征对齐,从而提升了道德推理能力。
Details
Motivation: 解决现有大型语言模型行为对齐技术忽视表面遵从与内部表征未对齐的差异问题,以及模型因道德概念压缩导致的固有道德冷漠状态,以应对长尾风险。
Result: 在独立对抗基准Flames上取得了75%的成对胜率,表明表征对齐显著提升了道德推理的精细度和鲁棒性。
Insight: 创新点在于从表征层面诊断和干预模型的道德冷漠,利用稀疏自编码器分离单语义道德特征并重构其拓扑关系;客观来看,该方法为AI对齐提供了从后验修正到主动培育的新视角,强调了内部表征对齐的重要性。
Abstract: Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs’ latent representations, utilizing 251k moral vectors constructed upon Prototype Theory and the Social-Chemistry-101 dataset. Firstly, our analysis across 23 models reveals that current LLMs fail to represent the distinction between opposed moral categories and fine-grained typicality gradients within these categories; notably, neither model scaling, architecture, nor explicit alignment reshapes this indifference. We then employ Sparse Autoencoders on Qwen3-8B, isolate mono-semantic moral features, and targetedly reconstruct their topological relationships to align with ground-truth moral vectors. This representational alignment naturally improves moral reasoning and granularity, achieving a 75% pairwise win-rate on the independent adversarial Flames benchmark. Finally, we elaborate on the remedial nature of current intervention methods from an experientialist philosophy, arguing that endogenously aligned AI might require a transformation from post-hoc corrections to proactive cultivation.
cs.CV [Back]
[46] KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR cs.CV | cs.CLPDF
Henry Gagnier, Sophie Gagnier, Ashwin Kirubakaran
TL;DR: 本文构建了一个名为KazakhOCR的合成基准数据集,包含哈萨克语三种文字(阿拉伯、西里尔和拉丁字母)的7,219张图像,用于评估多模态大语言模型在低资源哈萨克文字OCR任务上的性能。评估发现,当前主流MLLM在拉丁和阿拉伯字母OCR上均失败,且无法识别哈萨克阿拉伯字母文本,性能远低于传统OCR基线。
Details
Motivation: 哈萨克语使用阿拉伯、西里尔和拉丁三种文字,其OCR研究在低资源脚本(尤其是阿拉伯和拉丁字母)方面极度缺乏,既无基准数据集也无真实图像,因此需要构建合成数据集并评估现有MLLM在此类任务上的能力差距。
Result: 在构建的基准子集上评估了Gemma-3-12B-it、Qwen2.5-VL-7B-Instruct和Llama-3.2-11B-Vision-Instruct三种MLLM,所有模型在拉丁和阿拉伯字母OCR上均失败,并将哈萨克阿拉伯字母文本误分类为阿拉伯语、波斯语和库尔德语;传统OCR基线虽然字符错误率较低,但MLLM仍无法达到其性能水平。
Insight: 创新点在于首次为低资源哈萨克语三种文字构建了合成OCR基准数据集,并系统揭示了当前MLLM在处理基于Abjad的低资源脚本(如哈萨克阿拉伯字母)时存在严重能力缺陷,强调了开发支持低资源语言和文字的包容性模型与基准的紧迫性。
Abstract: Kazakh is a Turkic language using the Arabic, Cyrillic, and Latin scripts, making it unique in terms of optical character recognition (OCR). Work on OCR for low-resource Kazakh scripts is very scarce, and no OCR benchmarks or images exist for the Arabic and Latin scripts. We construct a synthetic OCR dataset of 7,219 images for all three scripts with font, color, and noise variations to imitate real OCR tasks. We evaluated three multimodal large language models (MLLMs) on a subset of the benchmark for OCR and language identification: Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct. All models are unsuccessful with Latin and Arabic script OCR, and fail to recognize the Arabic script as Kazakh text, misclassifying it as Arabic, Farsi, and Kurdish. We further compare MLLMs with a classical OCR baseline and find that while traditional OCR has lower character error rates, MLLMs fail to match this performance. These findings show significant gaps in current MLLM capabilities to process low-resource Abjad-based scripts and demonstrate the need for inclusive models and benchmarks supporting low-resource scripts and languages.
[47] Gloss-Free Sign Language Translation: An Unbiased Evaluation of Progress in the Field cs.CV | cs.CLPDF
Ozge Mercanoglu Sincan, Jian He Low, Sobhan Asasi, Richard Bowden
TL;DR: 本文对无注释手语翻译(gloss-free SLT)领域的研究进展进行了系统性评估,通过统一代码库重新实现关键模型,并在标准化预处理、视频编码器和训练设置下进行公平比较。研究发现,许多文献中报告的性能提升在一致条件下会减弱,表明实现细节和评估设置对结果有显著影响。
Details
Motivation: 解决手语翻译领域性能改进的真实来源不明确的问题,即性能提升是源于方法创新还是实现细节(如主干网络选择、训练优化、超参数调整或评估指标计算差异)。
Result: 在标准化评估条件下,许多文献中报告的性能提升显著减弱,表明当前SLT领域的进展可能被高估。
Insight: 强调了在手语翻译研究中实现细节和评估设置标准化的重要性,为领域提供了可复现的基准代码库,以促进透明度和可复现性。
Abstract: Sign Language Translation (SLT) aims to automatically convert visual sign language videos into spoken language text and vice versa. While recent years have seen rapid progress, the true sources of performance improvements often remain unclear. Do reported performance gains come from methodological novelty, or from the choice of a different backbone, training optimizations, hyperparameter tuning, or even differences in the calculation of evaluation metrics? This paper presents a comprehensive study of recent gloss-free SLT models by re-implementing key contributions in a unified codebase. We ensure fair comparison by standardizing preprocessing, video encoders, and training setups across all methods. Our analysis shows that many of the performance gains reported in the literature often diminish when models are evaluated under consistent conditions, suggesting that implementation details and evaluation setups play a significant role in determining results. We make the codebase publicly available here (https://github.com/ozgemercanoglu/sltbaselines) to support transparency and reproducibility in SLT research.
[48] Benchmarking Compact VLMs for Clip-Level Surveillance Anomaly Detection Under Weak Supervision cs.CV | cs.LGPDF
Kirill Borodin, Kirill Kondrashov, Nikita Vasiliev, Ksenia Gladkova, Inna Larina
TL;DR: 本文研究了在弱监督下,将紧凑型视觉语言模型(VLMs)应用于监控视频片段级异常检测任务。通过建立统一的评估协议,比较了参数高效适配的紧凑型VLMs、无需训练的VLM流程以及弱监督基线方法。结果表明,经过参数高效微调的紧凑型VLMs在检测性能上达到甚至超越了现有方法,同时保持了有竞争力的每片段延迟,实现了更优的准确性与效率权衡。
Details
Motivation: 解决监控安全领域在弱监督条件下,需要兼具可靠的片段级检测精度和可预测的每片段延迟的异常检测器问题。
Result: 在统一的评估协议下,参数高效适配的紧凑型VLMs在准确率、精确率、召回率、F1分数、ROC-AUC和平均每片段延迟等指标上,性能与现有方法相当甚至更优,同时保持了有竞争力的延迟。
Insight: 论文宣称的创新点在于提出了一个统一的评估协议来标准化比较,并证明了参数高效微调能使紧凑型VLMs成为可靠的片段级异常检测器。从客观角度看,其将参数高效适配技术应用于紧凑型VLMs以平衡检测性能与推理效率的思路,以及对提示词敏感性降低的观察,具有借鉴意义。
Abstract: CCTV safety monitoring demands anomaly detectors combine reliable clip-level accuracy with predictable per-clip latency despite weak supervision. This work investigates compact vision-language models (VLMs) as practical detectors for this regime. A unified evaluation protocol standardizes preprocessing, prompting, dataset splits, metrics, and runtime settings to compare parameter-efficiently adapted compact VLMs against training-free VLM pipelines and weakly supervised baselines. Evaluation spans accuracy, precision, recall, F1, ROC-AUC, and average per-clip latency to jointly quantify detection quality and efficiency. With parameter-efficient adaptation, compact VLMs achieve performance on par with, and in several cases exceeding, established approaches while retaining competitive per-clip latency. Adaptation further reduces prompt sensitivity, producing more consistent behavior across prompt regimes under the shared protocol. These results show that parameter-efficient fine-tuning enables compact VLMs to serve as dependable clip-level anomaly detectors, yielding a favorable accuracy-efficiency trade-off within a transparent and consistent experimental setup.
[49] Information-Theoretic Constraints for Continual Vision-Language-Action Alignment cs.CV | cs.AIPDF
Libang Zhao, Qixin Zeng, Hongyin Zhang, Donglin Wang
TL;DR: 本文提出Info-VLA框架,旨在解决视觉-语言-动作(VLA)模型在持续学习中面临的灾难性遗忘问题。该框架通过回放锚点对比学习和跨模态互信息最大化两个互补约束,来保持跨模态信息结构,从而在持续适应新技能时平衡稳定性和可塑性。
Details
Motivation: 动机是解决VLA模型在开放机器人环境中持续学习新技能时,因跨模态信息结构退化(视觉观察、语言指令和动作之间的依赖关系逐渐扩散)而导致的严重灾难性遗忘问题。
Result: 在LIBERO基准测试上的实验表明,Info-VLA在任务保持和适应能力方面显著优于现有方法。
Insight: 创新点在于从信息论角度出发,通过结合回放锚点对比学习(利用冻结教师模型构建稳定对齐锚点)和跨模态互信息最大化(通过互信息约束保持视觉与语言表示之间的依赖结构)来共同保持历史对齐和跨模态依赖信息,从而有效缓解持续学习中的遗忘问题。
Abstract: When deployed in open-ended robotic environments, Vision–Language–Action (VLA) models need to continually acquire new skills, yet suffer from severe catastrophic forgetting. We observe that this degradation is related to the deterioration of cross-modal information structure, where dependencies among visual observations, language instructions, and actions progressively diffuse during continual adaptation. But existing continual learning methods fail to preserve such cross-modal information dependencies. Thus, we propose Info-VLA, an information-preserving continual learning framework that maintains cross-modal information structure through two complementary constraints. Replay Anchor Contrastive Learning constructs stable alignment anchors from a frozen teacher model, preserving cross-modal alignment in the representation space. Cross-Modal Mutual Information Maximization further preserves dependency structure between visual and language representations through mutual information constraints. By jointly preserving historical alignment and cross-modal dependency information, Info-VLA balances stability and plasticity during continual learning. Furthermore, experiments on the LIBERO show that Info-VLA significantly outperforms existing methods in both task retention and adaptation.
[50] Complementarity-Supervised Spectral-Band Routing for Multimodal Emotion Recognition cs.CVPDF
Zhexian Huang, Bo Zhao, Hui Ma, Zhishu Liu, Jie Zhang
TL;DR: 本文提出了一种名为Atsuko的互补监督多频带专家网络,用于多模态情感识别。该方法通过多尺度频带分解将每个模态的特征正交分解为高、中、低频分量,并设计了一个具有双路径机制的模态级路由器,以实现细粒度的跨频带选择和跨模态融合。此外,还引入了边际互补模块来量化移除每个模态时的性能损失,从而提供软监督,引导路由器关注提供独特信息增益的模态。
Details
Motivation: 解决现有方法机械依赖单模态性能而忽略真正互补贡献,以及粗粒度融合与情感任务所需的细粒度表征相冲突的问题。异质模态间不一致的信息密度阻碍了跨模态特征挖掘。
Result: 在CMU-MOSI、CMU-MOSEI、CH-SIMS、CH-SIMSv2和MIntRec等多个基准测试上取得了优越的性能。
Insight: 创新点在于通过正交频带分解实现细粒度特征建模,并利用边际互补模块提供的互补性分布进行软监督,以缓解主导模态的捷径学习问题,从而更有效地挖掘跨模态的互补信息。
Abstract: Multimodal emotion recognition fuses cues such as text, video, and audio to understand individual emotional states. Prior methods face two main limitations: mechanically relying on independent unimodal performance, thereby missing genuine complementary contributions, and coarse-grained fusion conflicting with the fine-grained representations required by emotion tasks. As inconsistent information density across heterogeneous modalities hinders inter-modal feature mining, we propose the Complementarity-Supervised Multi-Band Expert Network, named Atsuko, to model fine-grained complementary features via multi-scale band decomposition and expert collaboration. Specifically, we orthogonally decompose each modality’s features into high, mid, and low-frequency components. Building upon this band-level routing, we design a modality-level router with a dual-path mechanism for fine-grained cross-band selection and cross-modal fusion. To mitigate shortcut learning from dominant modalities, we propose the Marginal Complementarity Module (MCM) to quantify performance loss when removing each modality via bi-modal comparison. The resulting complementarity distribution provides soft supervision, guiding the router to focus on modalities contributing unique information gains. Extensive experiments show our method achieves superior performance on the CMU-MOSI, CMU-MOSEI, CH-SIMS, CH-SIMSv2, and MIntRec benchmarks.
[51] Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning cs.CV | cs.AIPDF
Zhenyu Zhang, Yixiong Zou, Yuhua Li, Ruixuan Li, Guangyao Chen
TL;DR: 本文研究了源自由跨域少样本学习(SF-CDFSL)任务中,基于视觉语言模型(如CLIP、SigLIP)微调时的一个反直觉现象:增强视觉模态的判别性反而会抑制模型性能。通过理论和实验分析,作者揭示了标准交叉熵损失包含的视觉学习部分会阻碍跨模态对齐,并提出了通过扰动视觉学习和利用视觉-文本语义关系来引导模型专注于跨模态对齐的解决方案。
Details
Motivation: 在基于视觉语言模型的SF-CDFSL任务中,作者发现与传统视觉模型的认知相反,增强视觉判别性会损害性能,因此旨在探究这一现象的根本原因并提供解决方案。
Result: 在多种设置、骨干网络(CLIP、SigLip、PE-Core)和任务(4个CDFSL数据集和11个FSL数据集)上的大量实验表明,该方法持续取得了新的最先进(SOTA)结果。
Insight: 论文的核心创新点在于揭示了在VLM微调中,视觉学习部分可能作为捷径阻碍关键的跨模态对齐,并提出了通过扰动视觉学习和渐进式语义对齐来规避这一“判别性陷阱”的有效方法,为源自由跨域少样本学习提供了新的优化视角。
Abstract: Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where Vision-Language Models (VLMs) such as CLIP and SigLIP have shown promising results. Current works in traditional visual models suggest that improving visual discriminability enhances performance. However, in VLM-based SF-CDFSL tasks, we find that \textbf{strengthening visual-modal discriminability actually suppresses VLMs’ performance}. In this paper, we aim to delve into this phenomenon for an interpretation and a solution. By both theoretical and experimental proofs, our study reveals that fine-tuning with the typical cross-entropy loss ($\mathcal{L}{\mathrm{vlm}}$) inherently includes a visual learning part and a cross-modal learning part, where the cross-modal part is crucial for rectifying the heavily disrupted modality misalignment in SF-CDFSL. However, we find that the visual learning essentially acts as a shortcut that encourages the model to reduce $\mathcal{L}{\mathrm{vlm}}$ without considering the cross-modal part, therefore hindering the cross-modal alignment and harming the performance. Based on this interpretation, we further propose an approach to address this problem: first, we perturb the visual learning to guide the model to focus on the cross-modal alignment. Then, we use the visual-text semantic relationships to gradually align the visual and textual modalities during the fine-tuning. Extensive experiments on various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 11 FSL datasets) show that we consistently set new state-of-the-art results. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-Mind-the-Discriminability-Trap.
[52] DDS-UDA: Dual-Domain Synergy for Unsupervised Domain Adaptation in Joint Segmentation of Optic Disc and Optic Cup cs.CV | cs.AIPDF
Yusong Xiao, Yuxuan Wu, Li Xiao, Gang Qu, Haiye Huo
TL;DR: 本文提出了一种名为DDS-UDA的双域协同无监督域适应框架,用于解决视盘和视杯联合分割任务中的域偏移问题。该框架通过双向跨域一致性正则化模块和频率驱动的域内伪标签学习模块,在教师-学生架构下实现跨域干扰抑制与域内泛化增强,从而适应异构成像环境。
Details
Motivation: 卷积神经网络在单机构数据集上的视盘视杯联合分割性能优异,但其临床转化面临两大挑战:大规模高质量标注稀缺,以及部署时因成像协议和采集平台差异导致的域偏移性能下降。现有无监督域适应方法大多未能在统一框架内同时解决跨域干扰和域内泛化问题。
Result: 在两个多域眼底图像数据集上的综合评估表明,DDS-UDA在视盘和视杯分割任务上优于多种现有无监督域适应方法。
Insight: 创新点在于提出了一个统一的双域协同框架,结合了由粗到细动态掩码生成器引导的双向跨域一致性正则化(抑制噪声传播并保持结构连贯性)和频率驱动的域内伪标签学习(通过合成频谱幅度混合监督信号实现高保真特征对齐),有效解耦了域特定偏差与域不变特征表示。
Abstract: Convolutional neural networks (CNNs) have achieved exciting performance in joint segmentation of optic disc and optic cup on single-institution datasets. However, their clinical translation is hindered by two major challenges: limited availability of large-scale, high-quality annotations and performance degradation caused by domain shift during deployment across heterogeneous imaging protocols and acquisition platforms. While unsupervised domain adaptation (UDA) provides a way to mitigate these limitations, most existing approaches do not address cross-domain interference and intra-domain generalization within a unified framework. In this paper, we present the Dual-Domain Synergy UDA (DDS-UDA), a novel UDA framework that comprises two key modules. First, a bi-directional cross-domain consistency regularization module is enforced to mitigate cross-domain interference through feature-level semantic information exchange guided by a coarse-to-fine dynamic mask generator, suppressing noise propagation while preserving structural coherence. Second, a frequency-driven intra-domain pseudo label learning module is used to enhance intra-domain generalization by synthesizing spectral amplitude-mixed supervision signals, which ensures high-fidelity feature alignment across domains. Implemented within a teacher-student architecture, DDS-UDA disentangles domain-specific biases from domain-invariant feature-level representations, thereby achieving robust adaptation to heterogeneous imaging environments. We conduct a comprehensive evaluation of our proposed method on two multi-domain fundus image datasets, demonstrating that it outperforms several existing UDA based methods and therefore providing an effective way for optic disc and optic cup segmentation.
[53] MURE: Hierarchical Multi-Resolution Encoding via Vision-Language Models for Visual Document Retrieval cs.CV | cs.AIPDF
Fengbin Zhu, Zijing Cai, Yuzhe Wang, Pengyang Shao, Wenjie Wang
TL;DR: 本文提出了一种名为MURE的视觉文档检索框架,通过引入分层多分辨率编码范式,利用视觉语言模型捕获不同尺度的互补视觉线索,并结合分辨率级俄罗斯套娃表示学习和语义感知分层聚类机制,在保证检索效果的同时显著降低了计算开销。
Details
Motivation: 现有视觉文档检索模型在处理高分辨率文档时难以平衡效果与效率,要么丢失细粒度视觉信息,要么产生过多视觉令牌导致索引开销和检索延迟过高。
Result: 在两个广泛使用的视觉文档检索基准测试中,MURE框架持续超越强基线模型,并且仅使用ColPali模型50%的视觉令牌预算即显著优于后者。
Insight: 创新点在于提出了X-VisEmb范式(多分辨率采样编码→跨粒度特征融合→自适应表示蒸馏),采用VLMs作为分层多分辨率编码器,结合分辨率级俄罗斯套娃表示学习实现有效特征融合,并通过语义感知分层聚类机制压缩视觉令牌。
Abstract: Visual Document Retrieval (VDR) requires representations that capture both fine-grained visual details and global document structure to ensure retrieval efficacy while maintaining computational efficiency. Existing VDR models struggle to balance effectiveness and efficiency when processing high-resolution documents: they often either lose fine-grained information or generate an excessive number of visual tokens, resulting in significant indexing overhead and high retrieval latency. In this work, we rethink the visual encoding mechanism and propose a new X-VisEmb paradigm that progresses from multi-resolution sampling and encoding, through cross-granularity feature fusion, to adaptive representation distillation. A preliminary study validates its feasibility and effectiveness in capturing complementary visual cues at varying scales. Building on the insights, we develop MURE, a novel framework that employs VLMs as a hierarchical multi-resolution encoder, integrates resolution-level Matryoshka representation learning (RMRL) for effective feature fusion, and applies a semantic-aware hierarchical clustering mechanism for visual token compression. Experiments on two widely used VDR benchmarks show that our MURE framework consistently beats strong baselines. Furthermore, it significantly outperforms ColPali with only 50% of its visual token budget.
[54] AgriPath: A Systematic Exploration of Architectural Trade-offs for Crop Disease Classification cs.CV | cs.LGPDF
Hamza Mooraj, George Pantazopoulos, Alessandro Suglia
TL;DR: 本文系统比较了三种模型范式(卷积神经网络、对比式视觉语言模型和生成式视觉语言模型)在细粒度作物病害分类任务中的性能,并引入AgriPath-LF16基准数据集以分析实验室与田间图像的领域差异。研究发现,CNN在实验室图像上准确率最高但领域适应性差,对比式VLM在跨领域性能上稳健且参数高效,生成式VLM对分布变化最具韧性但存在自由文本生成导致的失败模式。
Details
Motivation: 现有作物病害检测模型评估多集中于单一架构家族或实验室生成数据集,缺乏在不同采集条件下的一致性能分析,本文旨在系统探索不同模型架构在跨领域作物病害分类中的权衡。
Result: 在AgriPath-LF16基准上,使用宏F1和解析成功率(PSR)评估,CNN在实验室图像上达到最高准确率但领域迁移时性能下降;对比式VLM在跨领域性能上具有竞争力且参数高效;生成式VLM对分布变化最具韧性,但存在因自由文本生成导致的额外失败模式。
Insight: 创新点包括引入具有明确实验室与田间图像分离的基准数据集AgriPath-LF16,以及系统评估三种模型范式在统一协议下的跨领域性能;客观分析表明,模型架构选择应基于部署上下文(如领域变化容忍度)而非仅依赖总体准确率,生成式VLM的韧性优势与文本生成风险值得关注。
Abstract: Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce AgriPath-LF16, a benchmark containing 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardized training and evaluation. All models are trained and evaluated under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability. The results reveal distinct performance profiles. CNNs achieve the highest accuracy on lab imagery but degrade under domain shift. Contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance. Generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate accuracy alone.
[55] Bi-CamoDiffusion: A Boundary-informed Diffusion Approach for Camouflaged Object Detection cs.CV | cs.AI | cs.LGPDF
Patricia L. Suarez, Leo Thomas Ramos, Angel D. Sappa
TL;DR: Bi-CamoDiffusion是一种用于伪装目标检测的边界感知扩散模型,它在CamoDiffusion框架基础上,通过无参数注入过程将边缘先验整合到早期嵌入中,以增强边界清晰度并防止结构模糊。模型采用统一的优化目标,平衡空间精度、结构约束和不确定性监督,从而捕获目标的全局上下文和复杂边界过渡。
Details
Motivation: 解决伪装目标检测中边界模糊和结构不清晰的问题,通过引入边缘先验来提升模型对目标边界的识别能力。
Result: 在CAMO、COD10K和NC4K基准测试中,Bi-CamoDiffusion超越了基线方法,并在所有评估指标(包括S_m、F_β^w、E_m和MAE)上一致优于现有最先进方法,实现了更精确的目标-背景分离和更锐利的边界恢复。
Insight: 创新点在于将边缘先验以无参数方式注入扩散模型的早期嵌入,并通过统一优化目标整合空间、结构和不确定性监督,这有助于同时捕获全局上下文和精细边界,为扩散模型在视觉任务中的应用提供了新思路。
Abstract: Bi-CamoDiffusion is introduced, an evolution of the CamoDiffusion framework for camouflaged object detection. It integrates edge priors into early-stage embeddings via a parameter-free injection process, which enhances boundary sharpness and prevents structural ambiguity. This is governed by a unified optimization objective that balances spatial accuracy, structural constraints, and uncertainty supervision, allowing the model to capture of both the object’s global context and its intricate boundary transitions. Evaluations across the CAMO, COD10K, and NC4K benchmarks show that Bi-CamoDiffusion surpasses the baseline, delivering sharper delineation of thin structures and protrusions while also minimizing false positives. Also, our model consistently outperforms existing state-of-the-art methods across all evaluated metrics, including $S_m$, $F_β^{w}$, $E_m$, and $MAE$, demonstrating a more precise object-background separation and sharper boundary recovery.
[56] Graph2Video: Leveraging Video Models to Model Dynamic Graph Evolution cs.CV | cs.LGPDF
Hua Liu, Yanbin Wei, Fei Xing, Tyler Derr, Haoyu Han
TL;DR: 本文提出Graph2Video框架,将动态图中目标链接的时序邻域视为一系列’图帧’,并堆叠成’图视频’,从而借鉴视频基础模型的归纳偏置来建模细粒度局部变化和长程时序依赖,生成的链接级嵌入可作为即插即用的记忆单元增强现有动态图编码器。
Details
Motivation: 现有动态图链接预测模型难以捕捉时序演化的复杂性,包括忽略细粒度交互顺序变化、难以处理长程依赖以及建模特定节点对关系动态的能力有限。
Result: 在多个基准数据集上的广泛实验表明,Graph2Video在大多数情况下超越了最先进的基线方法,在链接预测任务上取得了优异性能。
Insight: 核心创新在于将动态图时序演化建模为视频序列,巧妙地将计算机视觉中的时空建模技术迁移到动态图学习领域,通过链接级记忆单元以轻量且可插拔的方式增强了现有模型的表达能力。
Abstract: Dynamic graphs are common in real-world systems such as social media, recommender systems, and traffic networks. Existing dynamic graph models for link prediction often fall short in capturing the complexity of temporal evolution. They tend to overlook fine-grained variations in temporal interaction order, struggle with dependencies that span long time horizons, and offer limited capability to model pair-specific relational dynamics. To address these challenges, we propose \textbf{Graph2Video}, a video-inspired framework that views the temporal neighborhood of a target link as a sequence of “graph frames”. By stacking temporally ordered subgraph frames into a “graph video”, Graph2Video leverages the inductive biases of video foundation models to capture both fine-grained local variations and long-range temporal dynamics. It generates a link-level embedding that serves as a lightweight and plug-and-play link-centric memory unit. This embedding integrates seamlessly into existing dynamic graph encoders, effectively addressing the limitations of prior approaches. Extensive experiments on benchmark datasets show that Graph2Video outperforms state-of-the-art baselines on the link prediction task in most cases. The results highlight the potential of borrowing spatio-temporal modeling techniques from computer vision as a promising and effective approach for advancing dynamic graph learning.
[57] Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding cs.CV | cs.AIPDF
Zhongxing Xu, Zhonghua Wang, Zhe Qian, Dachuan Shi, Feilong Tang
TL;DR: 本文提出了一种名为潜在熵感知解码(LEAD)的即插即用解码策略,旨在缓解多模态大推理模型(MLRM)中的幻觉问题。该方法的核心思想是利用词元概率分布提取丰富的语义表示,并通过熵感知的推理模式切换(在高熵状态下使用概率加权连续嵌入,在低熵状态下切换回离散词元嵌入)来增强上下文推理。此外,还引入了先验引导的视觉锚点注入策略,以鼓励模型关注视觉信息。实验表明,LEAD在多个基准测试上有效减少了各种MLRM的幻觉。
Details
Motivation: 多模态大推理模型在视觉问答中性能显著提升,但观察到过渡词(如because, however, wait)与幻觉密切相关且往往处于高熵状态。作者认为,足够的上下文推理信息可以直接从词元概率分布中提取,而依赖离散文本输入可能导致模型在推理高熵阶段进行顺序显式推理,未能充分利用密集的上下文线索。
Result: 广泛的实验表明,LEAD在多个基准测试上有效缓解了各种MLRM的幻觉,但摘要中未提及具体的定量结果(如准确率提升)或是否达到SOTA水平。
Insight: 创新点包括:1)从词元概率分布构建丰富语义表示以增强上下文推理;2)提出熵感知推理模式切换机制,在高熵时使用概率加权连续嵌入,低熵时切换回离散嵌入;3)引入先验引导的视觉锚点注入策略,强化视觉信息利用。从客观角度看,该方法将超叠加表示理论应用于解码过程,通过动态调整嵌入方式来处理不确定性,是一种新颖的缓解幻觉的思路。
Abstract: Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.
[58] Multimodal Deep Learning for Dynamic and Static Neuroimaging: Integrating MRI and fMRI for Alzheimer Disease Analysis cs.CV | cs.LGPDF
Anima Kujur, Zahra Monfared
TL;DR: 本文提出了一种多模态深度学习框架,用于整合结构MRI和功能fMRI数据,以进行阿尔茨海默病(AD)、轻度认知障碍和正常认知状态的多分类。该方法利用3D卷积神经网络提取MRI的结构特征,并使用循环架构学习fMRI的时间序列特征,通过特征融合实现联合时空学习。
Details
Motivation: MRI提供详细的结构信息,而fMRI捕捉大脑的时间活动,但现有方法往往未能有效整合这两种模态。本文旨在通过多模态深度学习框架,结合结构MRI和功能fMRI,以提升阿尔茨海默病分类的准确性。
Result: 在一个小型配对MRI-fMRI数据集(29名受试者)上进行了实验,结果表明数据增强显著提高了分类稳定性和泛化能力,特别是对于多模态3DCNN-LSTM模型。相比之下,增强对于大规模单模态MRI数据集无效。
Insight: 创新点在于提出了一个整合MRI和fMRI的多模态深度学习框架,实现了结构时空特征的联合学习。客观分析表明,研究强调了在设计神经影像数据增强策略时,数据集规模和模态的重要性,为小样本多模态神经影像分析提供了实用见解。
Abstract: Magnetic Resonance Imaging (MRI) provides detailed structural information, while functional MRI (fMRI) captures temporal brain activity. In this work, we present a multimodal deep learning framework that integrates MRI and fMRI for multi-class classification of Alzheimer Disease (AD), Mild Cognitive Impairment, and Normal Cognitive State. Structural features are extracted from MRI using 3D convolutional neural networks, while temporal features are learned from fMRI sequences using recurrent architectures. These representations are fused to enable joint spatial-temporal learning. Experiments were conducted on a small paired MRI-fMRI dataset (29 subjects), both with and without data augmentation. Results show that data augmentation substantially improves classification stability and generalization, particularly for the multimodal 3DCNN-LSTM model. In contrast, augmentation was found to be ineffective for a large-scale single-modality MRI dataset. These findings highlight the importance of dataset size and modality when designing augmentation strategies for neuroimaging-based AD classification.
[59] GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning cs.CV | cs.LGPDF
Jiajin Liu, Dongzhe Fan, Chuanhao Ji, Daochen Zha, Qiaoyu Tan
TL;DR: 本文提出了GraphVLM,一个用于评估视觉语言模型在多模态图学习任务中能力的系统性基准。该基准探索了VLM与图推理结合的三种范式:作为编码器、对齐器和预测器,并在六个跨领域数据集上进行了广泛实验。
Details
Motivation: 尽管视觉语言模型在多模态信号对齐和理解方面表现出色,但其在结构化数据(即多模态实体通过显式关系图连接)上的推理潜力尚未被充分探索。解锁这一能力对于社交网络、推荐系统和科学发现等现实应用至关重要。
Result: 在六个不同领域的多模态图数据集上的实验表明,VLM通过所有三种角色都能提升多模态图学习性能。其中,VLM-as-Predictor范式取得了最显著且最一致的性能提升。
Insight: 论文的核心创新在于系统性地定义了VLM在图学习中的三种角色范式,并构建了首个针对多模态图学习的基准。其关键发现是,直接将VLM作为多模态主干用于图学习任务(VLM-as-Predictor)展现出作为多模态图学习新基础的巨大潜力。
Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in aligning and understanding multimodal signals, yet their potential to reason over structured data, where multimodal entities are connected through explicit relational graphs, remains largely underexplored. Unlocking this capability is crucial for real-world applications such as social networks, recommendation systems, and scientific discovery, where multimodal information is inherently structured. To bridge this gap, we present GraphVLM, a systematic benchmark designed to evaluate and harness the capabilities of VLMs for multimodal graph learning (MMGL). GraphVLM investigates three complementary paradigms for integrating VLMs with graph reasoning: (1) VLM-as-Encoder, which enriches graph neural networks through multimodal feature fusion; (2) VLM-as-Aligner, which bridges modalities in latent or linguistic space to facilitate LLM-based structured reasoning; and (3) VLM-as-Predictor, which directly employs VLMs as multimodal backbones for graph learning tasks. Extensive experiments across six datasets from diverse domains demonstrate that VLMs enhance multimodal graph learning via all three roles. Among these paradigms, VLM-as-Predictor achieves the most substantial and consistent performance gains, revealing the untapped potential of vision-language models as a new foundation for multimodal graph learning. The benchmark code is publicly available at https://github.com/oamyjin/GraphVLM.
[60] Agentic LLM Workflow for MR Spectroscopy Volume-of-Interest Placements in Brain Tumors cs.CV | cs.AIPDF
Sangyoon Lee, Francesca Branzoli, Małgorzata Marjańska, Patrick Bolan
TL;DR: 本文提出了一种基于智能体大语言模型(LLM)的工作流,用于脑肿瘤磁共振波谱(MRS)检查中感兴趣体积(VOI)的放置。该工作流将VOI放置分解为生成多样化的候选VOI,并由LLM根据定量指标从中选择最优的一个,从而适应不同的临床目标和操作者偏好。
Details
Motivation: 磁共振波谱(MRS)对脑肿瘤的临床评估价值依赖于准确的VOI放置,但实际操作中存在较大的操作窗口和操作者间差异,尤其是对于异质性肿瘤。现有方法难以灵活适应不同的临床优先级和操作者偏好。
Result: 在110个临床脑肿瘤病例上,该智能体工作流相比通用的专家放置方法,根据用户偏好,在实体肿瘤覆盖和坏死区域避免方面取得了改进。
Insight: 创新点在于将VOI放置任务解耦为生成(通过具有不同目标函数偏好的视觉Transformer模型)和选择(通过LLM代理)两个阶段,从而无需为每个特定目标重新训练模型即可灵活适应多种临床目标,提供了一种从可接受的替代方案中进行选择的策略,而非单一的确定性放置。
Abstract: Magnetic resonance spectroscopy (MRS) provides clinically valuable metabolic characterization of brain tumors, but its utility depends on accurate placement of the spectroscopy volume-of-interest (VOI). However, VOI placement typically has a broad operating window: for a given tumor there are multiple possible VOIs that would lead to high-quality MRS measurements. Thus, a VOI place-ment can be tuned for clinician preference, case-specific anatomy, and clinical pri-orities, which leads to high inter-operator variability, especially for heterogeneous tumors. We propose an agentic large language model (LLM) workflow that de-composes VOI placement into generation of diverse candidate VOIs, from which the LLM selects an optimal one based on quantitative metrics. Candidate VOIs are generated by vision transformer-based placement models trained with differ-ent objective function preferences, which allows selection from acceptable alterna-tives rather than a single deterministic placement. On 110 clinical brain tumor cas-es, the agentic workflow achieves improved solid tumor coverage and necrosis avoidance depending on the user preferences compared to the general-purpose expert placements. Overall, the proposed workflow provides a strategy to adapt VOI placement to different clinical objectives without retraining task-specific models.
[61] Geometry-Aware Semantic Reasoning for Training Free Video Anomaly Detection cs.CV | cs.AIPDF
Ali Zia, Usman Ali, Muhammad Umer Ramzan, Hamza Abid, Abdul Rehman
TL;DR: 本文提出了一种名为MM-VAD的免训练视频异常检测框架,该框架通过将场景表示投影到双曲空间以更好地保留层次结构,并利用冻结的大型语言模型进行自适应问答推理来检测异常,从而避免了传统方法中基于欧几里得嵌入的浅层相似性匹配问题。
Details
Motivation: 现有免训练视频异常检测方法主要依赖静态提示和几何无关的特征融合,导致异常推理简化为欧几里得嵌入上的浅层相似性匹配,在复杂或层次化场景中预测不稳定且可解释性有限。本文旨在通过几何感知的语义推理框架解决这些问题。
Result: 在四个基准测试上,MM-VAD一致优于先前的免训练方法,在XD-Violence上达到90.03%的AUC,在UCF-Crime、ShanghaiTech和UCSD Ped2上分别达到83.24%、96.95%和98.81%的AUC。
Insight: 创新点包括:1) 将场景表示投影到双曲空间以建模层次结构;2) 通过自适应问答过程进行异常评估,而非固定特征比较;3) 在测试时使用无监督置信度-稀疏性目标优化轻量级可学习提示,实现上下文特定校准;4) 引入协方差感知的马氏距离细化以稳定跨模态对齐。这些方法为免训练VAD提供了更原则性和有效的替代方案。
Abstract: Training-free video anomaly detection (VAD) has recently emerged as a scalable alternative to supervised approaches, yet existing methods largely rely on static prompting and geometry-agnostic feature fusion. As a result, anomaly inference is often reduced to shallow similarity matching over Euclidean embeddings, leading to unstable predictions and limited interpretability, especially in complex or hierarchically structured scenes. We introduce MM-VAD, a geometry-aware semantic reasoning framework for training free VAD that reframes anomaly detection as adaptive test-time inference rather than fixed feature comparison. Our approach projects caption-derived scene representations into hyperbolic space to better preserve hierarchical structure and performs anomaly assessment through an adaptive question answering process over a frozen large language model. A lightweight, learnable prompt is optimised at test time using an unsupervised confidence-sparsity objective, enabling context-specific calibration without updating any backbone parameters. To further ground semantic predictions in visual evidence, we incorporate a covariance-aware Mahalanobis refinement that stabilises cross-modal alignment. Across four benchmarks, MM-VAD consistently improves over prior training-free methods, achieving 90.03% AUC on XD-Violence and 83.24%, 96.95%, and 98.81% on UCF-Crime, ShanghaiTech, and UCSD Ped2, respectively. Our results demonstrate that geometry-aware representation and adaptive semantic calibration provide a principled and effective alternative to static Euclidean matching in training-free VAD.
[62] InfiniteDance: Scalable 3D Dance Generation Towards in-the-wild Generalization cs.CV | cs.LGPDF
Ronghui Li, Zhongyuan Hu, Li Siyao, Youliang Zhang, Haozhe Xie
TL;DR: 本文提出了InfiniteDance,一个旨在提升野外泛化能力的可扩展3D舞蹈生成框架。其核心贡献包括:通过一个结合足部接触和几何约束的足部修复扩散模型,从单目视频中自动化重建高质量、物理合理的3D舞蹈动作,构建了一个大规模多模态数据集;并设计了一个基于LLaMA的可扩展架构ChoreoLLaMA,该架构集成了检索增强生成模块以增强对陌生音乐的鲁棒性,以及一个慢/快节奏混合专家模块来适应不同音乐速度。
Details
Motivation: 现有3D舞蹈生成方法在受控场景下表现良好,但在野外条件下(如面对未见过的音乐)泛化能力差,常产生结构混乱或物理上不合理的舞蹈动作,主要受限于音乐-舞蹈数据的稀缺和模型容量不足。
Result: 在多种舞蹈流派上的广泛实验表明,该方法在定性和定量评估上均超越了现有方法,标志着向可扩展、真实世界3D舞蹈生成迈进了一步。
Insight: 创新点在于数据与模型的双重扩展:1)通过物理约束引导的扩散模型(FRDM)自动化构建高质量、大规模3D舞蹈数据集,解决了数据瓶颈和物理合理性问题;2)提出ChoreoLLaMA模型,结合检索增强生成(RAG)和节奏感知的混合专家(MoE)模块,有效提升了模型对陌生音乐的适应性和节奏变化的建模能力。
Abstract: Although existing 3D dance generation methods perform well in controlled scenarios, they often struggle to generalize in the wild. When conditioned on unseen music, existing methods often produce unstructured or physically implausible dance, largely due to limited music-to-dance data and restricted model capacity. This work aims to push the frontier of generalizable 3D dance generation by scaling up both data and model design. (1) On the data side, we develop a fully automated pipeline that reconstructs high-fidelity 3D dance motions from monocular videos. To eliminate the physical artifacts prevalent in existing reconstruction methods, we introduce a Foot Restoration Diffusion Model (FRDM) guided by foot-contact and geometric constraints that enforce physical plausibility while preserving kinematic smoothness and expressiveness, resulting in a diverse, high-quality multimodal 3D dance dataset totaling 100.69 hours. (2) On model design, we propose Choreographic LLaMA (ChoreoLLaMA), a scalable LLaMA-based architecture. To enhance robustness under unfamiliar music conditions, we integrate a retrieval-augmented generation (RAG) module that injects reference dance as a prompt. Additionally, we design a slow/fast-cadence Mixture-of-Experts (MoE) module that enables ChoreoLLaMA to smoothly adapt motion rhythms across varying music tempos. Extensive experiments across diverse dance genres show that our approach surpasses existing methods in both qualitative and quantitative evaluations, marking a step toward scalable, real-world 3D dance generation. Code, models, and data will be released.
[63] DINOv3 with Test-Time Calibration for Automated Carotid Intima-Media Thickness Measurement on CUBS v1 cs.CVPDF
Zhenpeng Zhang, Jinwei Lu, Yurui Dong, Bo Yuan
TL;DR: 本研究提出了一种基于DINOv3视觉基础模型并结合测试时校准的自动化方法,用于从B超图像中分割颈动脉内膜-中膜复合体并测量其厚度(CIMT)。该方法在CUBS v1数据集上评估,通过测试时阈值校准优化了测量精度,将CIMT绝对误差降低至约0.1毫米的临床相关范围。
Details
Motivation: 颈动脉内膜-中膜厚度(CIMT)是评估动脉粥样硬化和心血管风险的重要生物标志物。尽管已有多种计算机化方法用于颈动脉边界描绘和CIMT估计,但结合分割与测量、且具有鲁棒性和可迁移性的深度模型仍显不足,特别是在视觉基础模型时代。本研究旨在探索基于DINOv3的框架来解决这一问题。
Result: 在CUBS v1数据集的三个患者级别测试集上,该方法取得了平均Dice系数0.7739±0.0037和IoU 0.6384±0.0044。CIMT的平均绝对误差为181.16±11.57 μm,平均皮尔逊相关系数为0.480±0.259。在一个保留的验证子集(n=28)中,测试时阈值校准将CIMT的平均绝对误差从默认阈值下的141.0 μm降低到测量优化阈值下的101.1 μm,同时减少了系统性偏差。相对于原始CUBS基准中报告的经典计算机化方法的误差范围,这些结果使基于DINOv3的方法达到了临床相关的~0.1毫米测量精度水平。
Insight: 论文的创新点在于将DINOv3视觉基础模型适配于医学图像分割任务,并引入测试时校准(特别是阈值校准)来优化物理单位(CIMT)的测量精度,从而提升模型的解释性和临床实用性。从客观角度看,这展示了如何利用预训练的基础模型和测试时优化策略,在特定医学影像分析任务(如CIMT测量)中实现高精度和鲁棒性,为其他类似生物标志物的自动化测量提供了可借鉴的框架。
Abstract: Carotid intima-media thickness (CIMT) measured from B-mode ultrasound is an established vascular biomarker for atherosclerosis and cardiovascular risk stratification. Although a wide range of computerized methods have been proposed for carotid boundary delineation and CIMT estimation, robust and transferable deep models that jointly address segmentation and measurement remain underexplored, particularly in the era of vision foundation models. Motivated by recent advances in adapting DINOv3 to medical segmentation and exploiting DINOv3 in test-time optimization pipelines, we investigate a DINOv3-based framework for carotid intima-media complex segmentation and subsequent CIMT measurement on the Carotid Ultrasound Boundary Study (CUBS) v1 dataset. Our pipeline predicts the intima-media band at a fixed image resolution, extracts upper and lower boundaries column-wise, corrects for image resizing using the per-image calibration factor provided by CUBS, and reports CIMT in physical units. Across three patient-level test splits, our method achieved a mean test Dice of 0.7739 $\pm$ 0.0037 and IoU of 0.6384 $\pm$ 0.0044. The mean CIMT absolute error was 181.16 $\pm$ 11.57 $μ$m, with a mean Pearson correlation of 0.480 $\pm$ 0.259. In a held-out validation subset ($n=28$), test-time threshold calibration reduced the mean absolute CIMT error from 141.0 $μ$m at the default threshold to 101.1 $μ$m at the measurement-optimized threshold, while simultaneously reducing systematic bias toward zero. Relative to the error ranges reported in the original CUBS benchmark for classical computerized methods, these results place a DINOv3-based approach within the clinically relevant $\sim$0.1 mm measurement regime. Together, our findings support the feasibility of using vision foundation models for interpretable, calibration-aware CIMT measurement.
[64] Taming Vision Priors for Data Efficient mmWave Channel Modeling cs.CV | cs.NIPDF
Zhenlin An, Longfei Shangguan, John Kaewell, Philip Pietraski, Jelena Senic
TL;DR: 本文提出VisRFTwin框架,通过融合视觉先验与可微分光线追踪,实现数据高效的毫米波信道建模。该方法利用预训练视觉语言模型从多视角图像提取语义嵌入并转化为材料电磁参数初始估计,再结合稀疏信道测量数据通过梯度下降快速校准,显著降低了对密集信道测量的依赖。
Details
Motivation: 现有可微分光线追踪方法过度依赖大量信道测量数据或脆弱的手工调参场景模型,难以实际部署。本文旨在利用视觉先验知识来减少对信道测量数据的依赖,实现更高效、可扩展的毫米波传播建模。
Result: 在办公室、城市峡谷和动态公共空间三个真实场景的评估表明,VisRFTwin将信道测量需求降低了高达10倍,并且其中值延迟扩展误差比纯数据驱动的深度学习方法降低了59%。
Insight: 创新点在于将冻结的视觉语言模型提取的密集语义嵌入作为材料电磁参数的先验估计,并与可微分光线追踪器结合,实现了从视觉特征到物理参数的快速映射与跨场景迁移,减少了校准所需数据量。
Abstract: Accurately modeling millimeter-wave (mmWave) propagation is essential for real-time AR and autonomous systems. Differentiable ray tracing offers a physics-grounded solution but still facing deployment challenges due to its over-reliance on exhaustive channel measurements or brittle, hand-tuned scene models for material properties. We present VisRFTwin, a scalable and data-efficient digital-twin framework that integrates vision-derived material priors with differentiable ray tracing. Multi-view images from commodity cameras are processed by a frozen Vision-Language Model to extract dense semantic embeddings, which are translated into initial estimates of permittivity and conductivity for scene surfaces. These priors initialize a Sionna-based differentiable ray tracer, which rapidly calibrates material parameters via gradient descent with only a few dozen sparse channel soundings. Once calibrated, the association between vision features and material parameters is retained, enabling fast transfer to new scenarios without repeated calibration. Evaluations across three real-world scenarios, including office interiors, urban canyons, and dynamic public spaces show that VisRFTwin reduces channel measurement needs by up to 10$\times$ while achieving a 59% lower median delay spread error than pure data-driven deep learning methods.
[65] VisualLeakBench: Auditing the Fragility of Large Vision-Language Models against PII Leakage and Social Engineering cs.CV | cs.AI | cs.CR | cs.IRPDF
Youting Wang, Yuan Tang, Yitian Qian, Chen Zhao
TL;DR: 该论文提出了VisualLeakBench评估套件,用于审计大型视觉语言模型在OCR注入和上下文PII泄露方面的脆弱性。研究通过1000张合成对抗图像和50张真实世界截图,评估了GPT-5.2、Claude 4、Gemini-3 Flash和Grok-4四个前沿系统,发现它们在PII泄露方面存在显著差异,且防御性系统提示的有效性因模型和数据类型而异。
Details
Motivation: 大型视觉语言模型在代理集成工作流等部署场景中应用日益广泛,但其对语义视觉攻击的鲁棒性评估不足,现有对齐测试主要关注显性有害内容,而非隐私关键的多模态场景。
Result: 在合成数据上,Claude 4的OCR攻击成功率最低(14.2%),但PII泄露率最高(74.4%);Grok-4的PII泄露率最低(20.4%)。防御性系统提示可消除两个模型的泄露,将Claude 4的泄露率从74.4%降至2.2%,但对Gemini-3 Flash在合成数据上无效。然而,真实世界图像验证显示,Gemini-3 Flash的泄露率可从50%降至0%,表明缓解效果对模板敏感。
Insight: 创新点在于构建了首个针对LVLMs PII泄露的综合性评估基准,揭示了模型在隐私泄露方面存在‘遵从-警告’模式等独特脆弱性,并发现缓解措施的有效性高度依赖于输入数据的性质(合成vs.真实),这为部署安全评估提供了重要见解。
Abstract: As Large Vision-Language Models (LVLMs) are increasingly deployed in agent-integrated workflows and other deployment-relevant settings, their robustness against semantic visual attacks remains under-evaluated – alignment is typically tested on explicit harmful content rather than privacy-critical multimodal scenarios. We introduce VisualLeakBench, an evaluation suite to audit LVLMs against OCR Injection and Contextual PII Leakage using 1,000 synthetically generated adversarial images with 8 PII types, validated on 50 in-the-wild (IRL) real-world screenshots spanning diverse visual contexts. We evaluate four frontier systems (GPT-5.2, Claude4, Gemini-3 Flash, Grok-4) with Wilson 95% confidence intervals. Claude4 achieves the lowest OCR ASR (14.2%) but the highest PII ASR (74.4%), exhibiting a comply-then-warn pattern – where verbatim data disclosure precedes any safety-oriented language. Grok-4 achieves the lowest PII ASR (20.4%). A defensive system prompt eliminates PII leakage for two models, reduces Claude~4’s leakage from 74.4% to 2.2%, but has no effect on Gemini-3 Flash on synthetic data. Strikingly, IRL validation reveals Gemini-3 Flash does respond to mitigation on real-world images (50% to 0%), indicating that mitigation robustness is template-sensitive rather than uniformly absent. We release our dataset and code for reproducible robustness and safety evaluation of deployment-relevant vision-language systems.
[66] Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers cs.CVPDF
Yuntao Shou, Xiangyong Cao, Qian Zhao, Deyu Meng
TL;DR: 本文提出了一种布局引导的可控病理图像生成方法,通过多智能体LVLM标注框架自动构建细粒度标注数据,并设计了In-Context Diffusion Transformer(IC-DiT)模型,该模型能够整合空间布局、文本描述和视觉嵌入,实现高保真、强可控的病理图像合成,并在多个数据集上验证了其优越性及在下游任务中的增强效果。
Details
Motivation: 现有文本引导的扩散模型对病理图像生成只能提供粗略的全局控制,缺乏细粒度的结构约束能力,且缺乏大规模配对空间布局与详细诊断描述的数据集,因为人工标注整张切片图像极其耗时。
Result: 在五个组织病理学数据集上的大量实验表明,IC-DiT在保真度、空间可控性和诊断一致性方面均优于现有方法,生成的图像能有效作为数据增强资源用于癌症分类和生存分析等下游任务。
Insight: 创新点包括:1)可扩展的多智能体LVLM标注框架,实现了高效、细粒度且临床对齐的监督数据构建;2)IC-DiT模型通过分层多模态注意力机制,将空间布局、文本和视觉嵌入统一到扩散Transformer中,在保持全局语义连贯的同时精确保留结构和形态细节。
Abstract: Controllable pathology image synthesis requires reliable regulation of spatial layout, tissue morphology, and semantic detail. However, existing text-guided diffusion models offer only coarse global control and lack the ability to enforce fine-grained structural constraints. Progress is further limited by the absence of large datasets that pair patch-level spatial layouts with detailed diagnostic descriptions, since generating such annotations for gigapixel whole-slide images is prohibitively time-consuming for human experts. To overcome these challenges, we first develop a scalable multi-agent LVLM annotation framework that integrates image description, diagnostic step extraction, and automatic quality judgment into a coordinated pipeline, and we evaluate the reliability of the system through a human verification process. This framework enables efficient construction of fine-grained and clinically aligned supervision at scale. Building on the curated data, we propose In-Context Diffusion Transformer (IC-DiT), a layout-aware generative model that incorporates spatial layouts, textual descriptions, and visual embeddings into a unified diffusion transformer. Through hierarchical multimodal attention, IC-DiT maintains global semantic coherence while accurately preserving structural and morphological details. Extensive experiments on five histopathology datasets show that IC-DiT achieves higher fidelity, stronger spatial controllability, and better diagnostic consistency than existing methods. In addition, the generated images serve as effective data augmentation resources for downstream tasks such as cancer classification and survival analysis.
[67] High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding cs.CV | cs.LGPDF
Ji Woo Hong, Hee Suk Yoon, Gwanhyeong Koo, Eunseop Yoon, SooHwan Eom
TL;DR: 本文提出了一种基于扩散解码的框架,旨在提升预训练视觉语言模型(VLM)在文本到图像生成任务中的视觉保真度。该方法通过训练一个轻量级的扩散解码器,利用VLM输出的图像token logits作为条件信号,无需大规模数据或完整模型重新训练,即可显著改善生成图像的质量。
Details
Motivation: 当前大规模视觉语言模型在文本到图像生成方面表现出色,但其视觉保真度受限于离散的图像token化过程,导致图像质量受限。现有方法通过连续表示建模来提升质量,但需要大量数据和训练成本,类似于原始预训练。本文旨在克服这一限制,以更高效的方式提升生成图像的视觉保真度。
Result: 通过在ImageNet-1K上进行短期训练,该方法在VQ-VAE重建和VLM预测token的文本到图像生成任务中,均能一致地提升视觉保真度,实现了高质量的图像生成效果。
Insight: 创新点包括:1) Logit-to-Code Distributional Mapping,将VLM的图像token logits转换为具有不确定性特征的连续分布加权码向量,为扩散解码提供有效的条件信号;2) 轻量级的Logit Calibration,通过校准训练时代理logits与VLM生成logits之间的差异,缓解训练-推理差距;3) Distribution-Conditioned Diffusion Decoder,基于这些表示生成高保真图像,同时保持预训练VLM模型完整,无需大规模重新训练。
Abstract: Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several studies have explored continuous representation modeling to enhance visual quality, adapting pre-trained VLM models to such representations requires large-scale data and training costs comparable to the original pre-training. To circumvent this limitation, we propose a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained VLMs, thereby preserving the original model intact. At its core, Logit-to-Code Distributional Mapping converts the VLM’s image-token logits into continuous, distribution-weighted code vectors with uncertainty features, providing an effective conditioning signal for diffusion decoding. A lightweight Logit Calibration aligns training-time proxy logits from the VQ-VAE encoder with VLM-generated logits, mitigating the train-inference gap. Conditioned on these representations, the Distribution-Conditioned Diffusion Decoder generates high-fidelity images. Achieved solely through short training on ImageNet-1K, our method consistently improves visual fidelity for both VQ-VAE reconstructions and text-to-image generations from VLM-predicted tokens.
[68] WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics cs.CVPDF
Yuhong Dai, Yanlin Lai, Mitt Huang, Hangyu Guo, Dingming Li
TL;DR: 该论文提出了WebVR基准测试,用于评估多模态大语言模型(MLLMs)根据演示视频重建网页的能力。该基准包含175个多样化网页,通过受控合成流程构建,并设计了细粒度、与人类偏好对齐的视觉评估标准。实验在19个模型上进行,揭示了模型在重建细粒度样式和运动质量方面存在显著差距,同时基于该标准的自动评估与人类偏好的一致性达到96%。
Details
Motivation: 现有网页生成基准主要依赖文本提示或静态截图作为输入,而视频能自然传达交互流程、过渡时序和运动连续性等更丰富的信号,这对忠实重建网页至关重要。视频条件下的网页生成任务尚未被充分探索,也缺乏专门的基准测试,因此需要填补这一空白。
Result: 在19个模型上的实验结果表明,模型在重建细粒度样式和运动质量方面存在显著差距。同时,论文提出的基于视觉评估标准的自动评估方法与人类偏好的相关性达到96%。
Insight: 创新点在于首次提出了一个专门用于视频到网页生成任务的基准测试(WebVR),其数据集通过受控合成而非网络爬取构建,确保了多样性和真实性,并避免了与现有在线页面的重叠。同时,设计了一个细粒度、与人类对齐的视觉评估标准,用于多维度评估生成的网页,该自动评估方法显示出与人类判断的高度一致性。
Abstract: Existing web-generation benchmarks rely on text prompts or static screenshots as input. However, videos naturally convey richer signals such as interaction flow, transition timing, and motion continuity, which are essential for faithful webpage recreation. Despite this potential, video-conditioned webpage generation remains largely unexplored, with no dedicated benchmark for this task. To fill this gap, we introduce WebVR, a benchmark that evaluates whether MLLMs can faithfully recreate webpages from demonstration videos. WebVR contains 175 webpages across diverse categories, all constructed through a controlled synthesis pipeline rather than web crawling, ensuring varied and realistic demonstrations without overlap with existing online pages. We also design a fine-grained, human-aligned visual rubric that evaluates the generated webpages across multiple dimensions. Experiments on 19 models reveal substantial gaps in recreating fine-grained style and motion quality, while the rubric-based automatic evaluation achieves 96% agreement with human preferences. We release the dataset, evaluation toolkit, and baseline results to support future research on video-to-webpage generation.
[69] Language-Guided Token Compression with Reinforcement Learning in Large Vision-Language Models cs.CVPDF
Sihan Cao, Jianwei Zhang, Pengcheng Zheng, Jiaxin Yan, Caiyan Qin
TL;DR: 本文提出了TPRL,一种基于强化学习的框架,用于在大视觉语言模型(LVLMs)中实现语言引导的视觉令牌压缩。该方法将视觉令牌剪枝建模为一个具有明确状态转移的顺序决策过程,通过自监督自编码器压缩视觉令牌状态,并利用从演示中学习和近端策略优化(PPO)来训练剪枝策略,以联合优化任务准确性和计算效率。
Details
Motivation: 现有方法难以将渐进式视觉令牌减少建模为具有顺序依赖的多步决策过程,且通常依赖手工设计的评分规则,缺乏对复杂推理轨迹的自适应优化。TPRL旨在克服这些限制,通过强化学习学习自适应的剪枝轨迹,直接关联到最终任务性能。
Result: 实验结果表明,TPRL在推理过程中移除了高达66.7%的视觉令牌,实现了高达54.2%的FLOPs减少,同时平均准确率下降仅为0.7%,接近无损。
Insight: 创新点在于将视觉令牌剪枝形式化为顺序决策过程,并引入自监督自编码器进行状态压缩,以及结合演示学习和PPO进行策略优化,实现了语言引导的自适应令牌压缩,有效平衡了计算开销与模型精度。
Abstract: Large Vision-Language Models (LVLMs) incur substantial inference costs due to the processing of a vast number of visual tokens. Existing methods typically struggle to model progressive visual token reduction as a multi-step decision process with sequential dependencies and often rely on hand-engineered scoring rules that lack adaptive optimization for complex reasoning trajectories. To overcome these limitations, we propose TPRL, a reinforcement learning framework that learns adaptive pruning trajectories through language-guided sequential optimization tied directly to end-task performance. We formulate visual token pruning as a sequential decision process with explicit state transitions and employ a self-supervised autoencoder to compress visual tokens into a compact state representation for efficient policy learning. The pruning policy is initialized through learning from demonstrations and subsequently fine-tuned using Proximal Policy Optimization (PPO) to jointly optimize task accuracy and computational efficiency. Our experimental results demonstrate that TPRL removes up to 66.7% of visual tokens and achieves up to a 54.2% reduction in FLOPs during inference while maintaining a near-lossless average accuracy drop of only 0.7%. Code is released at \href{https://github.com/MagicVicCoder/TPRL}{\textcolor{mypink}{https://github.com/MagicVicCoder/TPRL}}.
[70] TennisExpert: Towards Expert-Level Analytical Sports Video Understanding cs.CVPDF
Zhaoyu Liu, Xi Weng, Lianyu Hu, Zhe Hou, Kan Jiang
TL;DR: 该论文提出了一个名为TennisExpert的专家级体育视频理解框架,旨在解决网球视频自动分析的两个关键挑战:缺乏细粒度标注和专家级评论的大规模基准,以及构建适用于实时部署的准确高效多模态系统的困难。为此,论文首先引入了TennisVL基准数据集,包含超过200场职业比赛和4万多个回合片段,并强调战术推理等专家级分析评论。然后,论文提出了TennisExpert框架,它集成了视频语义解析器和基于Qwen3-VL-8B构建的记忆增强模型,以提取关键比赛元素并捕捉时空上下文。实验表明,该框架在捕获战术背景和比赛动态方面优于GPT-5等强大的专有基线模型。
Details
Motivation: 解决网球视频自动理解领域的两大挑战:一是缺乏具有细粒度标注和专家级分析评论的大规模基准数据集;二是难以构建既准确又高效、适合实时部署的多模态系统。
Result: 在提出的TennisVL基准上,TennisExpert框架在捕获战术背景和比赛动态方面,持续优于GPT-5、Gemini和Claude等强大的专有基线模型。
Insight: 论文的创新点包括:1) 构建了首个强调专家级战术分析评论(而非简单的描述性解说)的大规模网球视频基准TennisVL;2) 提出了一个结合视频语义解析器和记忆增强多模态模型(基于Qwen3-VL-8B)的集成框架,通过层次化记忆模块有效捕捉短期和长期的时空上下文,以支持复杂的战术推理。从客观角度看,将结构化视频解析与大型视觉语言模型的生成能力相结合,并引入专门设计的记忆机制来处理体育比赛的长序列动态,是一个有前景的技术路径。
Abstract: Tennis is one of the most widely followed sports, generating extensive broadcast footage with strong potential for professional analysis, automated coaching, and real-time commentary. However, automatic tennis understanding remains underexplored due to two key challenges: (1) the lack of large-scale benchmarks with fine-grained annotations and expert-level commentary, and (2) the difficulty of building accurate yet efficient multimodal systems suitable for real-time deployment. To address these challenges, we introduce TennisVL, a large-scale tennis benchmark comprising over 200 professional matches (471.9 hours) and 40,000+ rally-level clips. Unlike existing commentary datasets that focus on descriptive play-by-play narration, TennisVL emphasizes expert analytical commentary capturing tactical reasoning, player decisions, and match momentum. Furthermore, we propose TennisExpert, a multimodal tennis understanding framework that integrates a video semantic parser with a memory-augmented model built on Qwen3-VL-8B. The parser extracts key match elements (e.g., scores, shot sequences, ball bounces, and player locations), while hierarchical memory modules capture both short- and long-term temporal context. Experiments show that TennisExpert consistently outperforms strong proprietary baselines, including GPT-5, Gemini, and Claude, and demonstrates improved ability to capture tactical context and match dynamics.
[71] Qianfan-OCR: A Unified End-to-End Model for Document Intelligence cs.CVPDF
Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang
TL;DR: 本文提出了Qianfan-OCR,一个拥有40亿参数、统一文档解析、版面分析和文档理解任务的端到端视觉语言模型。它支持从图像直接生成Markdown格式,并能通过提示驱动完成表格提取、图表理解、文档问答和关键信息抽取等多种任务。为了解决端到端OCR中显式版面信息缺失的问题,论文引入了Layout-as-Thought机制,通过特殊思考令牌触发一个生成结构化版面表示的中间阶段,从而在保持端到端优势的同时,恢复了版面定位能力并提升了复杂版面的处理精度。
Details
Motivation: 现有端到端OCR模型缺乏显式的版面分析能力,导致在处理复杂文档布局时精度受限。本文旨在构建一个统一的端到端模型,将文档解析、版面分析和理解任务整合,并解决端到端模型中版面信息丢失的问题。
Result: 在OmniDocBench v1.5(93.12分)和OlmOCR Bench(79.8分)上,Qianfan-OCR在端到端模型中排名第一。在OCRBench、CCOCR、DocVQA和ChartQA等基准测试上,与同规模通用视觉语言模型相比取得了有竞争力的结果。在公开的关键信息抽取基准测试中获得了最高平均分,超越了Gemini-3.1-Pro、Seed-2.0和Qwen3-VL-235B等模型。
Insight: 主要创新点在于提出了Layout-as-Thought机制,这是一个可选的、由特殊令牌触发的“思考”阶段,能在生成最终输出前先产生结构化的版面表示(如边界框、元素类型和阅读顺序)。这种设计巧妙地在端到端框架内恢复了显式的版面分析能力,为解决复杂文档理解问题提供了新思路,实现了任务统一与性能提升的平衡。
Abstract: We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations – bounding boxes, element types, and reading order – before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.
[72] FlowAD: Ego-Scene Interactive Modeling for Autonomous Driving cs.CVPDF
Mingzhe Guo, Yixiang Yang, Chuanrong Han, Rufeng Zhang, Shirui Li
TL;DR: 本文提出了一种新的自车-场景交互建模范式FlowAD,通过将自车与场景的交互建模为相对于自车的场景流,来更有效地理解自动驾驶过程。该框架包含自车引导的场景划分、时空流预测以及任务感知增强三个部分,并在感知、端到端规划和VLM分析等任务上验证了其通用性和有效性。
Details
Motivation: 当前自动驾驶的环境建模范式往往未能充分考虑自车运动对观测的反馈,导致对驾驶过程的理解不完整,从而限制了规划能力。本文旨在解决这一问题。
Result: 在nuScenes数据集上,FlowAD相比SparseDrive将碰撞率降低了19%,并将FCP指标提升了1.39帧(60%)。在Bench2Drive上,其驾驶得分达到了51.77,证明了其优越性。
Insight: 核心创新点在于提出了一个以场景流为核心的、自车与场景交互的通用建模范式,将自车运动反馈融入特征学习,并提出了新的FCP评估指标来衡量场景理解能力。该方法不依赖场景模拟,可利用现有的日志回放数据集。
Abstract: Effective environment modeling is the foundation for autonomous driving, underpinning tasks from perception to planning. However, current paradigms often inadequately consider the feedback of ego motion to the observation, which leads to an incomplete understanding of the driving process and consequently limits the planning capability. To address this issue, we introduce a novel ego-scene interactive modeling paradigm. Inspired by human recognition, the paradigm represents ego-scene interaction as the scene flow relative to the ego-vehicle. This conceptualization allows for modeling ego-motion feedback within a feature learning pattern, advantageously utilizing existing log-replay datasets rather than relying on scenario simulations. We specifically propose FlowAD, a general flow-based framework for autonomous driving. Within it, an ego-guided scene partition first constructs basic flow units to quantify scene flow. The ego-vehicle’s forward direction and steering velocity directly shape the partition, which reflects ego motion. Then, based on flow units, spatial and temporal flow predictions are performed to model dynamics of scene flow, encompassing both spatial displacement and temporal variation. The final task-aware enhancement exploits learned spatio-temporal flow dynamics to benefit diverse tasks through object and region-level strategies. We also propose a novel Frames before Correct Planning (FCP) metric to assess the scene understanding capability. Experiments in both open and closed-loop evaluations demonstrate FlowAD’s generality and effectiveness across perception, end-to-end planning, and VLM analysis. Notably, FlowAD reduces 19% collision rate over SparseDrive with FCP improvements of 1.39 frames (60%) on nuScenes, and achieves an impressive driving score of 51.77 on Bench2Drive, proving the superiority. Code, model, and configurations will be released here.
[73] Combining Microscopy Data and Metadata for Reconstruction of Cellular Traction Forces Using a Hybrid Vision Transformer-U-Net cs.CVPDF
Yunfei Huang, Elena Van der Vorst, Alexander Richard, Benedikt Sabass
TL;DR: 该论文提出了一种名为ViT+UNet的混合深度学习架构,结合了U-Net和Vision Transformer,用于从显微镜数据中重建细胞牵引力场。该方法通过整合细胞类型等元数据,提高了预测的准确性和特异性,并在不同空间尺度和噪声水平下表现出优异的泛化能力。
Details
Motivation: 解决牵引力显微镜(TFM)数据分析中现有深度学习方法面临的挑战,特别是跨多空间尺度的可靠推理能力不足,以及缺乏整合细胞类型等上下文信息以提高准确性的问题。
Result: 在预测牵引力场任务上,ViT+UNet模型性能优于单独的U-Net和Vision Transformer架构,并在不同实验设置和成像系统获得的TFM数据集上展现出卓越的泛化能力。
Insight: 创新点在于将U-Net的局部特征提取能力与Vision Transformer的全局上下文建模相结合,并通过结构化输入数据有效整合元数据(如细胞类型),从而提升了模型在复杂生物图像分析任务中的鲁棒性和准确性。
Abstract: Traction force microscopy (TFM) is a widely used technique for quantifying the forces that cells exert on their surrounding extracellular matrix. Although deep learning methods have recently been applied to TFM data analysis, several challenges remain-particularly achieving reliable inference across multiple spatial scales and integrating additional contextual information such as cell type to improve accuracy. In this study, we propose ViT+UNet, a robust deep learning architecture that integrates a U-Net with a Vision Transformer. Our results demonstrate that this hybrid model outperforms both standalone U-Net and Vision Transformer architectures in predicting traction force fields. Furthermore, ViT+UNet exhibits superior generalization across diverse spatial scales and varying noise levels, enabling its application to TFM datasets obtained from different experimental setups and imaging systems. By appropriately structuring the input data, our approach also allows the inclusion of metadata, in our case cell-type information, to enhance prediction specificity and accuracy.
[74] Event-Driven Video Generation cs.CV | cs.LGPDF
Chika Maduabuchi
TL;DR: 本文提出了事件驱动视频生成(EVD)框架,旨在解决现有文本到视频模型在交互动态(如接触时机、动作实现、物体放置后漂移、支撑关系)上存在的失败模式。该框架通过在训练中引入轻量级事件头预测事件活动、事件接地损失以及事件门控采样机制,使采样过程基于事件,从而抑制虚假更新并集中处理交互期间的更新,有效减少了交互幻觉。
Details
Motivation: 现有最先进的文本到视频模型虽然帧级外观逼真,但在简单交互动态上经常失败,例如运动在接触前开始、动作未实现、物体放置后漂移以及支撑关系断裂。作者认为这源于帧优先的去噪方法,该方法每一步都在所有位置更新潜在状态,而没有明确交互何时何地活跃的概念。
Result: 在EVD-Bench上,EVD一致地提高了人类偏好和VBench动态评分,显著减少了状态持久性、空间准确性、支撑关系和接触稳定性方面的失败模式,且未牺牲外观质量。
Insight: 论文的创新点在于提出了事件驱动视频生成的框架,通过显式的事件接地(包括事件活动预测、事件接地损失和事件门控采样)作为减少视频生成中交互幻觉的实用抽象。从客观角度看,将事件作为控制信号集成到扩散变换器(DiT)兼容的框架中,是一种新颖且有效的方法来建模和约束视频中的时空交互动态。
Abstract: State-of-the-art text-to-video models often look realistic frame-by-frame yet fail on simple interactions: motion starts before contact, actions are not realized, objects drift after placement, and support relations break. We argue this stems from frame-first denoising, which updates latent state everywhere at every step without an explicit notion of when and where an interaction is active. We introduce Event-Driven Video Generation (EVD), a minimal DiT-compatible framework that makes sampling event-grounded: a lightweight event head predicts token-aligned event activity, event-grounded losses couple activity to state change during training, and event-gated sampling (with hysteresis and early-step scheduling) suppresses spurious updates while concentrating updates during interactions. On EVD-Bench, EVD consistently improves human preference and VBench dynamics, substantially reducing failure modes in state persistence, spatial accuracy, support relations, and contact stability without sacrificing appearance. These results indicate that explicit event grounding is a practical abstraction for reducing interaction hallucinations in video generation.
[75] Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion cs.CV | eess.IVPDF
Yang Yang, Tianyi Zhang, Wei Huang, Jinwei Chen, Boxi Wu
TL;DR: 本文提出Anchor Forcing框架,用于解决交互式长视频生成中提示词切换时出现的质量下降和运动动态减弱问题。该框架包含锚点引导的重新缓存机制和三区域RoPE设计,旨在稳定感知质量并更好地保留预训练的运动先验。
Details
Motivation: 现有流式视频扩散模型在交互式长视频生成中进行提示词切换时,存在渐进式质量退化、运动动态减弱以及边界条件弱化的问题,这源于缓存维护无法同时保留语义上下文与近期潜在线索,以及无界时间索引导致的位置分布偏移。
Result: 在长视频上的实验表明,该方法在交互式设置下,相较于先前的流式基线模型,在感知质量和运动指标上均有所提升。
Insight: 创新点在于提出了以缓存为中心的框架,具体包括:1)利用锚点缓存引导重新缓存过程以减少切换后的证据损失;2)设计具有区域特定参考原点的三区域RoPE,并通过RoPE重新对齐蒸馏来协调无界流式索引与预训练RoPE机制,从而更好地保留运动先验。
Abstract: Interactive long video generation requires prompt switching to introduce new subjects or events, while maintaining perceptual fidelity and coherent motion over extended horizons. Recent distilled streaming video diffusion models reuse a rolling KV cache for long-range generation, enabling prompt-switch interaction through re-cache at each switch. However, existing streaming methods still exhibit progressive quality degradation and weakened motion dynamics. We identify two failure modes specific to interactive streaming generation: (i) at each prompt switch, current cache maintenance cannot simultaneously retain KV-based semantic context and recent latent cues, resulting in weak boundary conditioning and reduced perceptual quality; and (ii) during distillation, unbounded time indexing induces a positional distribution shift from the pretrained backbone’s bounded RoPE regime, weakening pretrained motion priors and long-horizon motion retention. To address these issues, we propose \textbf{Anchor Forcing}, a cache-centric framework with two designs. First, an anchor-guided re-cache mechanism stores KV states in anchor caches and warm-starts re-cache from these anchors at each prompt switch, reducing post-switch evidence loss and stabilizing perceptual quality. Second, a tri-region RoPE with region-specific reference origins, together with RoPE re-alignment distillation, reconciles unbounded streaming indices with the pretrained RoPE regime to better retain motion priors. Experiments on long videos show that our method improves perceptual quality and motion metrics over prior streaming baselines in interactive settings. Project page: https://github.com/vivoCameraResearch/Anchor-Forcing
[76] Nuanced Emotion Recognition Based on a Segment-based MLLM Framework Leveraging Qwen3-Omni for AH Detection cs.CV | cs.AIPDF
Liang Tang, Hongda Li, Jiayu Zhang, Long Chen, Shuxian Li
TL;DR: 本文提出了一种基于分段多模态大语言模型(MLLM)的框架,用于视频中的微妙情感识别,特别是针对矛盾(Ambivalence)和犹豫(Hesitancy)状态。该方法将视频分割为不超过5秒的片段,利用微调后的Qwen3-Omni-30B-A3B模型协同分析视觉和听觉信号,在BAH数据集上实现了85.1%的准确率,显著超越了现有基准。
Details
Motivation: 视频情感识别是情感计算的关键任务,识别如矛盾与犹豫等微妙心理状态对行为干预和数字健康具有重要价值。这些状态常通过面部表情、语音语调与文本语义之间的跨模态不一致性体现,给自动识别带来巨大挑战。
Result: 在BAH数据集测试集上,所提方法达到了85.1%的准确率,显著优于现有基准,验证了多模态大语言模型在捕捉复杂微妙情感冲突方面的卓越能力。
Insight: 创新点在于将时序分段建模与多模态大语言模型(Qwen3-Omni)相结合,通过分段策略处理长视频以应对计算效率和令牌限制,并利用LoRA和全参数微调策略优化模型性能,有效捕捉跨模态不一致性以实现更精准的微妙情感识别。
Abstract: Emotion recognition in videos is a pivotal task in affective computing, where identifying subtle psychological states such as Ambivalence and Hesitancy holds significant value for behavioral intervention and digital health. Ambivalence and Hesitancy states often manifest through cross-modal inconsistencies such as discrepancies between facial expressions, vocal tones, and textual semantics, posing a substantial challenge for automated recognition. This paper proposes a recognition framework that integrates temporal segment modeling with Multimodal Large Language Models. To address computational efficiency and token constraints in long video processing, we employ a segment-based strategy, partitioning videos into short clips with a maximum duration of 5 seconds. We leverage the Qwen3-Omni-30B-A3B model, fine-tuned on the BAH dataset using LoRA and full-parameter strategies via the MS-Swift framework, enabling the model to synergistically analyze visual and auditory signals. Experimental results demonstrate that the proposed method achieves an accuracy of 85.1% on the test set, significantly outperforming existing benchmarks and validating the superior capability of Multimodal Large Language Models in capturing complex and nuanced emotional conflicts. The code is released at https://github.com/dlnn123/A-H-Detection-with-Qwen-Omni.git.
[77] Bridging the Visual-to-Physical Gap: Physically Aligned Representations for Fall Risk Analysis cs.CV | cs.AIPDF
Xianqi Zhang
TL;DR: 本文提出PHARL方法,通过结合轨迹级时间一致性和多类物理对齐约束,学习具有物理意义的跌倒表示,以解决视觉相似动作对应不同物理结果的问题,无需临床损伤标签。
Details
Motivation: 现有基于视觉的跌倒分析方法受限于视觉相似动作可能对应不同物理结果,且依赖难以获取的可靠损伤标签,因此需要一种无需临床标签的物理感知表示学习方法。
Result: 在四个公开数据集上的实验表明,PHARL在风险对齐表示质量上优于纯视觉基线,同时保持强大的跌倒检测性能,并展现出零样本有序性。
Insight: 通过模拟衍生的接触结果来塑造嵌入几何,实现多类物理对齐,从而在无显式有序监督下学习到可解释的严重性结构,为物理感知表示学习提供了新思路。
Abstract: Vision-based fall analysis has advanced rapidly, but a key bottleneck remains: visually similarmotions can correspond to very different physical outcomes because small differences in contactmechanics and protective responses are hard to infer from appearance alone. Most existingapproaches handle this by supervised injury prediction, which depends on reliable injury labels.In practice, such labels are difficult to obtain: video evidence is often ambiguous (occlusion,viewpoint limits), and true injury events are rare and cannot be safely staged, leading to noisysupervision. We address this problem with PHARL (PHysics-aware Alignment RepresentationLearning), which learns physically meaningful fall representations without requiring clinicaloutcome labels. PHARL regularizes motion embeddings with two complementary constraints:(1) trajectory-level temporal consistency for stable representation learning, and (2) multi-classphysics alignment, where simulation-derived contact outcomes shape embedding geometry. Bypairing video windows with temporally aligned simulation descriptors, PHARL captures localimpact-relevant dynamics while keeping inference purely feed-forward. Experiments on fourpublic datasets show that PHARL consistently improves risk-aligned representation quality overvisual-only baselines while maintaining strong fall-detection performance. Notably, PHARL alsoexhibits zero-shot ordinality: an interpretable severity structure (Head > Trunk > Supported)emerges without explicit ordinal supervision.
[78] WAT: Online Video Understanding Needs Watching Before Thinking cs.CVPDF
Zifan Han, Hongbo Sun, Jinglin Xu, Canhui Tang, Yulong Lei
TL;DR: 本文提出WAT(Watching Before Thinking)框架,用于解决多模态大语言模型(MLLMs)在在线流式视频理解中的挑战。该框架采用两阶段处理:先进行与查询无关的‘观看’阶段,构建包含短期记忆和长期记忆的分层记忆系统;随后在查询触发的‘思考’阶段,通过上下文感知检索机制结合短期记忆上下文从长期记忆中检索相关历史帧进行跨时间推理。为支持训练,作者还引入了包含流式风格标注的WAT-85K数据集。实验表明,WAT在在线视频基准测试中达到了最先进的性能。
Details
Motivation: 现有视频大语言模型(Video LLMs)在在线流式场景中表现不佳,难以在严格的内存限制下保持长时序上下文信息,因此需要一种专门针对在线视频推理的框架。
Result: WAT在在线视频基准测试中取得了最先进的性能,在StreamingBench上达到77.7%的准确率,在OVO-Bench上达到55.2%的准确率,优于现有的开源在线视频大语言模型,且能以实时帧率运行。
Insight: 创新点在于将在线视频处理明确分离为查询无关的‘观看’和查询触发的‘思考’两阶段,并设计了包含短期记忆缓冲和基于冗余感知淘汰策略的固定容量长期记忆的分层记忆系统,以及结合查询与短期记忆上下文的上下文感知检索机制,以高效支持长时序跨时间推理。从客观角度看,这种架构设计有效地平衡了内存约束与历史信息保留的需求,为流式视频理解提供了可借鉴的系统级解决方案。
Abstract: Multimodal Large Language Models (MLLMs) have shown strong capabilities in image understanding, motivating recent efforts to extend them to video reasoning. However, existing Video LLMs struggle in online streaming scenarios, where long temporal context must be preserved under strict memory constraints. We propose WAT (Watching Before Thinking), a two-stage framework for online video reasoning. WAT separates processing into a query-independent watching stage and a query-triggered thinking stage. The watching stage builds a hierarchical memory system with a Short-Term Memory (STM) that buffers recent frames and a fixed-capacity Long-Term Memory (LTM) that maintains a diverse summary of historical content using a redundancy-aware eviction policy. In the thinking stage, a context-aware retrieval mechanism combines the query with the current STM context to retrieve relevant historical frames from the LTM for cross-temporal reasoning. To support training for online video tasks, we introduce WAT-85K, a dataset containing streaming-style annotations emphasizing real-time perception, backward tracing, and forecasting. Experiments show that WAT achieves state-of-the-art performance on online video benchmarks, including 77.7% accuracy on StreamingBench and 55.2% on OVO-Bench, outperforming existing open-source online Video LLMs while operating at real-time frame rates.
[79] Distance-aware Soft Prompt Learning for Multimodal Valence-Arousal Estimation cs.CVPDF
Byeongjin Jung, Chanyeong Park, Sejoon Lim
TL;DR: 本文提出了一种用于多模态效价-唤醒度估计的距离感知软提示学习框架,通过将情感空间划分为九个区域并利用高斯核计算软标签来学习细粒度情感过渡,结合CLIP图像编码器和音频频谱变换器提取特征,并通过门控循环单元和分层融合方案进行时序建模与多模态集成,在Aff-Wild2数据集上实现了竞争性的性能。
Details
Motivation: 解决预训练视觉-语言模型(如CLIP)在连续回归任务中因文本提示的离散性而受限的问题,以更准确地捕捉自然场景中人类情感的细微差别。
Result: 在Aff-Wild2数据集的无约束野外场景中,该方法显著提升了效价-唤醒度估计的准确性,达到了竞争性的性能水平。
Insight: 创新点包括将情感空间网格化并引入基于距离的软标签学习来桥接语义空间与连续维度,以及采用分层融合策略(跨模态注意力对齐和门控融合)进行多模态特征集成,可借鉴于其他连续回归或多模态情感分析任务。
Abstract: Valence-arousal (VA) estimation is crucial for capturing the nuanced nature of human emotions in naturalistic environments. While pre-trained Vision-Language models like CLIP have shown remarkable semantic alignment capabilities, their application in continuous regression tasks is often limited by the discrete nature of text prompts. In this paper, we propose a novel multimodal framework for VA estimation that introduces Distance-aware Soft Prompt Learning to bridge the gap between semantic space and continuous dimensions. Specifically, we partition the VA space into a 3X3 grid, defining nine emotional regions, each associated with distinct textual descriptions. Rather than a hard categorization, we employ a Gaussian kernel to compute soft labels based on the Euclidean distance between the ground truth coordinates and the region centers, allowing the model to learn fine-grained emotional transitions. For multimodal integration, our architecture utilizes a CLIP image encoder and an Audio Spectrogram Transformer (AST) to extract robust spatial and acoustic features. These features are temporally modeled via Gated Recurrent Units (GRUs) and integrated through a hierarchical fusion scheme that sequentially combines cross-modal attention for alignment and gated fusion for adaptive refinement. Experimental results on the Aff-Wild2 dataset demonstrate that our proposed semantic-guided approach significantly enhances the accuracy of VA estimation, achieving competitive performance in unconstrained ``in-the-wild’’ scenarios.
[80] MIBench: Evaluating LMMs on Multimodal Interaction cs.CV | cs.AIPDF
Yu Miao, Zequn Yang, Yake Wei, Ziheng Chen, Haotian Ni
TL;DR: 本文介绍了MIBench,一个用于评估大型多模态模型(LMMs)多模态交互能力的综合性基准测试。该基准将每个实例构建为(视觉上下文、文本上下文、任务)三元组,要求模型采用正确的多模态交互形式来完成任务。MIBench从三个关键方面(视觉主导信息获取、文本主导信息获取、多模态协同生成新信息)和三个认知层次(识别、理解、推理)对模型进行评估。对多个SOTA LMMs的评估揭示了它们在多模态交互能力上的局限性。
Details
Motivation: 解决如何系统评估LMMs处理不同多模态交互方式(即模型根据任务需求整合利用跨模态信息的具体方式)的能力问题,因为这种能力是衡量模型多模态能力的关键。
Result: 在包含32个不同任务、超过10,000个视觉-文本上下文对的MIBench基准上评估了最先进的LMMs。结果显示:1)尽管模型参数和训练数据规模扩大,LMMs的多模态交互能力仍然受限;2)在处理视觉信息时容易被文本模态干扰;3)大多只具备基本的多模态协同能力;4)原生训练的多模态模型在基础交互能力上存在明显不足。
Insight: 论文的创新点在于提出了一个系统化、层次化(识别/理解/推理)评估多模态交互能力的新基准MIBench,其核心是将评估任务构建为(视觉上下文、文本上下文、任务)三元组。客观来看,该工作为理解和诊断LMMs在多模态融合上的具体短板(如模态干扰、协同能力弱)提供了细粒度的分析工具,有助于未来模型设计的针对性改进。
Abstract: In different multimodal scenarios, it needs to integrate and utilize information across modalities in a specific way based on the demands of the task. Different integration ways between modalities are referred to as “multimodal interaction”. How well a model handles various multimodal interactions largely characterizes its multimodal ability. In this paper, we introduce MIBench, a comprehensive benchmark designed to evaluate the multimodal interaction capabilities of Large Multimodal Models (LMMs), which formulates each instance as a (con_v , con_t, task) triplet with contexts from vision and text, necessitating that LMMs employ correct forms of multimodal interaction to effectively complete the task. MIBench assesses models from three key aspects: the ability to source information from vision-centric or text-centric cues, and the ability to generate new information from their joint synergy. Each interaction capability is evaluated hierarchically across three cognitive levels: Recognition, Understanding, and Reasoning. MIBench comprises over 10,000 vision-text context pairs spanning 32 distinct tasks. Evaluation of state-of-the-art LMMs show that: (1) LMMs’ ability on multimodal interaction remains constrained, despite the scaling of model parameters and training data; (2) they are easily distracted by textual modalities when processing vision information; (3) they mostly possess a basic capacity for multimodal synergy; and (4) natively trained multimodal models show noticeable deficits in fundamental interaction ability. We expect that these observations can serve as a reference for developing LMMs with more enhanced multimodal ability in the future.
[81] CtrlAttack: A Unified Attack on World-Model Control in Diffusion Models cs.CV | cs.AI | cs.CRPDF
Shuhan Xu, Siyuan Liang, Hongling Zheng, Yong Luo, Han Hu
TL;DR: 该论文提出了一种名为CtrlAttack的统一攻击方法,针对基于扩散模型的图像到视频(I2V)模型中学习到的世界模型控制(状态转移)的脆弱性。该方法通过将扰动表示为低维速度场并进行时间积分来构建连续位移场,从而在保持时间一致性的同时干扰生成过程中的状态演化,并适用于白盒和黑盒攻击设置。实验表明,该方法能有效破坏时间一致性,揭示了I2V模型在状态动态层面的潜在安全风险。
Details
Motivation: 现有研究主要关注扩散基I2V模型的视觉质量和可控性,而模型学习到的状态转移的鲁棒性尚未得到充分研究。本文旨在填补这一空白,首次分析I2V模型的脆弱性,发现时间控制机制构成新的攻击面,并揭示在不同攻击设置下统一建模这些机制的挑战。
Result: 在白盒设置下攻击成功率(ASR)超过90%,在黑盒设置下超过80%,同时将FID和FVD的变化分别控制在6和130以内,表明即使在低维和强正则化扰动约束下,该方法仍能显著破坏时间一致性。
Insight: 创新点在于首次系统分析了I2V模型状态转移的脆弱性,并提出了统一的轨迹控制攻击方法。其核心是将扰动建模为低维速度场并通过时间积分确保时间一致性,同时通过映射到观测空间实现白盒与黑盒攻击的统一框架,为评估和提升视频生成模型的安全性提供了新视角。
Abstract: Diffusion-based image-to-video (I2V) models increasingly exhibit world-model-like properties by implicitly capturing temporal dynamics. However, existing studies have mainly focused on visual quality and controllability, and the robustness of the state transition learned by the model remains understudied. To fill this gap, we are the first to analyze the vulnerability of I2V models, find that temporal control mechanisms constitute a new attack surface, and reveal the challenge of modeling them uniformly under different attack settings. Based on this, we propose a trajectory-control attack, called CtrlAttack, to interfere with state evolution during the generation process. Specifically, we represent the perturbation as a low-dimensional velocity field and construct a continuous displacement field via temporal integration, thereby affecting the model’s state transitions while maintaining temporal consistency; meanwhile, we map the perturbation to the observation space, making the method applicable to both white-box and black-box attack settings. Experimental results show that even under low-dimensional and strongly regularized perturbation constraints, our method can still significantly disrupt temporal consistency by increasing the attack success rate (ASR) to over 90% in the white-box setting and over 80% in the black-box setting, while keeping the variation of the FID and FVD within 6 and 130, respectively, thus revealing the potential security risk of I2V models at the level of state dynamics.
[82] Vision-Language Based Expert Reporting for Painting Authentication and Defect Detection cs.CV | eess.SPPDF
Eman Ouda, Mohammed Salah, Arsenii O. Chulkov, Gianfranco Gargiulo, Gian Luca Tartaglia
TL;DR: 本文提出了一种全自动的热成像-视觉-语言模型(VLM)框架,用于文化遗产(如镶嵌画)的真实性鉴定和缺陷检测。该框架将多模态主动红外热成像(AIRT)分析与模态感知的文本报告生成相结合,无需人工干预。通过融合多种热成像处理技术(PCT、TSR、PPT)的结果生成共识分割图,并利用VLM生成结构化报告,描述异常位置、热行为及物理解释,同时明确承认不确定性。
Details
Motivation: 解决文化遗产保护中热成像技术解释和报告高度依赖专家、缺乏标准化和可解释性框架的问题,以促进跨收藏比较和系统化集成。
Result: 在两个镶嵌画样本上的评估表明,该方法实现了稳定的异常检测和一致的结构化解释,显示了跨样本的可重复性和泛化能力。
Insight: 创新点在于首次将多模态热成像分析与结构化自然语言报告生成在文化遗产领域进行自动化集成,通过共识分割融合多种热指标并利用VLM生成包含不确定性的解释性报告,提高了方法的可解释性和标准化程度。
Abstract: Authenticity and condition assessment are central to conservation decision-making, yet interpretation and reporting of thermographic output remain largely bespoke and expert-dependent, complicating comparison across collections and limiting systematic integration into conservation documentation. Pulsed Active Infrared Thermography (AIRT) is sensitive to subsurface features such as material heterogeneity, voids, and past interventions; however, its broader adoption is constrained by artifact misinterpretation, inter-laboratory variability, and the absence of standardized, explainable reporting frameworks. Although multi-modal thermographic processing techniques are established, their integration with structured natural-language interpretation has not been explored in cultural heritage. A fully automated thermography-vision-language model (VLM) framework is presented. It combines multi-modal AIRT analysis with modality-aware textual reporting, without human intervention during inference. Thermal sequences are processed using Principal Component Thermography (PCT), Thermographic Signal Reconstruction (TSR), and Pulsed Phase Thermography (PPT), and the resulting anomaly masks are fused into a consensus segmentation that emphasizes regions supported by multiple thermal indicators while mitigating boundary artifacts. The fused evidence is provided to a VLM, which generates structured reports describing the location of the anomaly, thermal behavior, and plausible physical interpretations while explicitly acknowledging the uncertainty and diagnostic limitations. Evaluation on two marquetries demonstrates consistent anomaly detection and stable structured interpretations, indicating reproducibility and generalizability across samples.
[83] Draft-and-Target Sampling for Video Generation Policy cs.CV | cs.AIPDF
Qikang Zhang, Yingjie Lei, Wei Liu, Daochang Liu
TL;DR: 本文提出了一种名为Draft-and-Target Sampling的新型扩散推理范式,用于提升视频生成模型作为机器人策略时的推理效率。该方法通过自博弈去噪,结合大步长的草稿采样和小步长的目标采样来验证轨迹,并引入令牌分块和渐进接受策略以减少冗余计算,实现了训练免费的速度提升。
Details
Motivation: 现有视频生成模型用作机器人策略时,存在计算成本高、推理时间长的问题,本文旨在解决这一效率挑战。
Result: 在三个基准测试上的实验表明,该方法可实现高达2.1倍的加速,并在最小化成功率妥协的情况下提升了当前最先进方法的效率。
Insight: 创新点在于提出了一种训练免费的双轨迹自博弈去噪推理范式,以及令牌分块和渐进接受策略,在保持性能的同时显著加速扩散模型的推理过程。
Abstract: Video generation models have been used as a robot policy to predict the future states of executing a task conditioned on task description and observation. Previous works ignore their high computational cost and long inference time. To address this challenge, we propose Draft-and-Target Sampling, a novel diffusion inference paradigm for video generation policy that is training-free and can improve inference efficiency. We introduce a self-play denoising approach by utilizing two complementary denoising trajectories in a single model, draft sampling takes large steps to generate a global trajectory in a fast manner and target sampling takes small steps to verify it. To further speedup generation, we introduce token chunking and progressive acceptance strategy to reduce redundant computation. Experiments on three benchmarks show that our method can achieve up to 2.1x speedup and improve the efficiency of current state-of-the-art methods with minimal compromise to the success rate. Our code is available.
[84] LADR: Locality-Aware Dynamic Rescue for Efficient Text-to-Image Generation with Diffusion Large Language Models cs.CV | cs.CLPDF
Chenglin Wang, Yucheng Zhou, Shawn Chen, Tao Wang, Kai Zhang
TL;DR: 本文提出了一种名为LADR(Locality-Aware Dynamic Rescue)的训练免费方法,旨在加速基于离散扩散语言模型的文本到图像生成推理过程。该方法通过利用图像的空间马尔可夫特性,优先恢复与已生成像素空间相邻的“生成前沿”区域的标记,从而最大化信息增益并减少迭代解码的延迟。
Details
Motivation: 离散扩散语言模型在统一多模态生成中表现出色,但其迭代解码导致的高推理延迟阻碍了实际部署。现有加速策略通常需要昂贵的重新训练,或未能有效利用视觉数据固有的二维空间冗余性。
Result: 在四个文本到图像生成基准测试上的广泛实验表明,LADR相比标准基线实现了约4倍的加速,并在保持甚至在某些情况下(特别是空间推理任务)提高了生成保真度,在效率与质量之间达到了最先进的权衡。
Insight: 论文的创新点在于提出了一种无需重新训练的加速方法,通过形态学邻居识别定位候选标记、采用风险有界过滤机制防止错误传播,以及利用流形一致的反向调度来对齐扩散轨迹与加速掩码密度,有效利用了图像的空间局部性先验。
Abstract: Discrete Diffusion Language Models have emerged as a compelling paradigm for unified multimodal generation, yet their deployment is hindered by high inference latency arising from iterative decoding. Existing acceleration strategies often require expensive re-training or fail to leverage the 2D spatial redundancy inherent in visual data. To address this, we propose Locality-Aware Dynamic Rescue (LADR), a training-free method that expedites inference by exploiting the spatial Markov property of images. LADR prioritizes the recovery of tokens at the ‘’generation frontier’’, regions spatially adjacent to observed pixels, thereby maximizing information gain. Specifically, our method integrates morphological neighbor identification to locate candidate tokens, employs a risk-bounded filtering mechanism to prevent error propagation, and utilizes manifold-consistent inverse scheduling to align the diffusion trajectory with the accelerated mask density. Extensive experiments on four text-to-image generation benchmarks demonstrate that our LADR achieves an approximate 4 x speedup over standard baselines. Remarkably, it maintains or even enhances generative fidelity, particularly in spatial reasoning tasks, offering a state-of-the-art trade-off between efficiency and quality.
[85] LibraGen: Playing a Balance Game in Subject-Driven Video Generation cs.CVPDF
Jiahao Zhu, Shanshan Lao, Lijie Liu, Gen Li, Tianhao Qi
TL;DR: 本文提出了LibraGen框架,用于解决主题驱动视频生成(S2V)中平衡基础模型固有先验(如运动连贯性、视觉美学和提示对齐)与新获得的S2V能力之间的挑战。该框架采用“提升支点,调整平衡”的核心哲学,通过提升数据质量、结合监督微调与模型合并、设计两种定制化的直接偏好优化(DPO)流程,以及引入时间依赖的动态无分类器引导方案,实现了仅使用千级别训练数据即可超越现有开源和商业S2V模型的性能。
Details
Motivation: 随着视频生成基础模型(VGFMs)的发展,定制化生成(特别是主题到视频生成,S2V)受到越来越多的关注。然而,现有方法在增强S2V能力时,往往以牺牲模型固有的运动连贯性、视觉美学和提示对齐等先验为代价,缺乏对这些方面的平衡考虑。
Result: 实验结果表明,LibraGen在仅使用千级别训练数据的情况下,在主题驱动视频生成任务上超越了现有的开源和商业S2V模型,实现了性能上的优势。
Insight: 论文的创新点在于将基础模型扩展至S2V生成视为一个平衡游戏,提出了以数据质量为支点的“质量优于数量”策略,并设计了混合数据过滤管道、Tune-to-Balance后训练范式(结合交叉对和内部对数据与模型合并)、两种定制化DPO流程(Consis-DPO和Real-Fake DPO)的合并,以及时间依赖的动态无分类器引导推理方案,从而在保持模型固有优势的同时有效提升S2V能力。
Abstract: With the advancement of video generation foundation models (VGFMs), customized generation, particularly subject-to-video (S2V), has attracted growing attention. However, a key challenge lies in balancing the intrinsic priors of a VGFM, such as motion coherence, visual aesthetics, and prompt alignment, with its newly derived S2V capability. Existing methods often neglect this balance by enhancing one aspect at the expense of others. To address this, we propose LibraGen, a novel framework that views extending foundation models for S2V generation as a balance game between intrinsic VGFM strengths and S2V capability. Specifically, guided by the core philosophy of “Raising the Fulcrum, Tuning to Balance,” we identify data quality as the fulcrum and advocate a quality-over-quantity approach. We construct a hybrid pipeline that combines automated and manual data filtering to improve overall data quality. To further harmonize the VGFM’s native capabilities with its S2V extension, we introduce a Tune-to-Balance post-training paradigm. During supervised fine-tuning, both cross-pair and in-pair data are incorporated, and model merging is employed to achieve an effective trade-off. Subsequently, two tailored direct preference optimization (DPO) pipelines, namely Consis-DPO and Real-Fake DPO, are designed and merged to consolidate this balance. During inference, we introduce a time-dependent dynamic classifier-free guidance scheme to enable flexible and fine-grained control. Experimental results demonstrate that LibraGen outperforms both open-source and commercial S2V models using only thousand-scale training data.
[86] MIRAGE: Model-agnostic Industrial Realistic Anomaly Generation and Evaluation for Visual Anomaly Detection cs.CVPDF
Jinwei Hu, Francesco Borsatti, Arianna Stropeni, Davide Dalle Pezze, Manuel Barusco
TL;DR: MIRAGE是一个模型无关的工业真实异常生成与评估框架,它通过API调用生成模型、利用VLM自动生成缺陷提示、并采用CLIP质量过滤器,实现了无需训练和真实异常图像即可生成逼真异常图像和像素级掩码的自动化流程。
Details
Motivation: 现有异常生成方法要么需要真实异常样本,要么硬件成本高,要么生成的缺陷缺乏真实感,MIRAGE旨在提供一个无需训练、无需真实异常数据、可扩展且逼真的异常生成解决方案。
Result: 在MVTec AD和VisA数据集上,使用Gemini 2.5 Flash Image作为生成骨干,MIRAGE在异常分割任务和生成图像视觉质量(通过IS、IC-LPIPS指标和31名参与者的1550对投票的人体感知研究评估)方面均表现出色,为工业检测提供了可扩展的基础。
Insight: 创新点包括:完全自动化的黑盒生成流程、基于VLM的自动缺陷提示生成、CLIP质量过滤器、以及结合Grounding DINO和YOLOv26-Seg的无训练双分支语义变化检测模块用于掩码生成;客观来看,其模型无关性、无需真实异常数据以及大规模公开数据集的发布具有重要借鉴意义。
Abstract: Industrial visual anomaly detection (VAD) methods are typically trained on normal samples only, yet performance improves substantially when even limited anomalous data is available. Existing anomaly generation approaches either require real anomalous examples, demand expensive hardware, or produce synthetic defects that lack realism. We present MIRAGE (Model-agnostic Industrial Realistic Anomaly Generation and Evaluation), a fully automated pipeline for realistic anomalous image generation and pixel-level mask creation that requires no training and no anomalous images. Our pipeline accesses any generative model as a black box via API calls, uses a VLM for automatic defect prompt generation, and includes a CLIP-based quality filter to retain only well-aligned generated images. For mask generation at scale, we introduce a lightweight, training-free dual-branch semantic change detection module combining text-conditioned Grounding DINO features with fine-grained YOLOv26-Seg structural features. We benchmark four generation methods using Gemini 2.5 Flash Image (Nano Banana) as the generative backbone, evaluating performance on MVTec AD and VisA across two distinct tasks: (i) downstream anomaly segmentation and (ii) visual quality of the generated images, assessed via standard metrics (IS, IC-LPIPS) and a human perceptual study involving 31 participants and 1,550 pairwise votes. The results demonstrate that MIRAGE offers a scalable, accessible foundation for anomaly-aware industrial inspection that requires no real defect data. As a final contribution, we publicly release a large-scale dataset comprising 500 image-mask pairs per category for every MVTec AD and VisA class, over 13,000 pairs in total, alongside all generation prompts and pipeline code.
[87] A Systematic Benchmark of GAN Architectures for MRI-to-CT Synthesis cs.CVPDF
Alessandro Pesci, Valerio Guarrasi, Marco Alì, Isabella Castiglioni, Paolo Soda
TL;DR: 本文对十种生成对抗网络架构在MRI到CT图像合成任务上进行了系统性基准测试,使用SynthRAD2025数据集在腹部、胸部和头颈部三个解剖区域进行评估。所有模型在统一的预处理和优化设置下训练,并通过多维度指标评估性能。研究发现监督配对模型普遍优于非配对方法,Pix2Pix模型在性能与复杂度之间取得了最佳平衡。
Details
Motivation: 解决MRI到CT图像转换领域缺乏对不同GAN架构进行系统、公平比较的问题,旨在为仅使用MRI的临床工作流程提供模型选择的量化指导。
Result: 在SynthRAD2025数据集上的实验表明,监督配对模型(如Pix2Pix)在体素精度、结构保真度、感知质量和分布真实性等指标上均优于非配对方法;多区域训练提升了结构鲁棒性,而单区域训练则最大化体素级保真度。
Insight: 研究证实了体素级监督对医学图像合成的重要性;提出了一个可复现的基准测试框架,为未来比较研究提供了标准;揭示了模型性能与计算复杂度之间的权衡关系,Pix2Pix展现出最佳平衡性。
Abstract: The translation from Magnetic resonance imaging (MRI) to Computed tomography (CT) has been proposed as an effective solution to facilitate MRI-only clinical workflows while limiting exposure to ionizing radiation. Although numerous Generative Adversarial Network (GAN) architectures have been proposed for MRI-to-CT translation, systematic and fair comparisons across heterogeneous models remain limited. We present a comprehensive benchmark of ten GAN architectures evaluated on the SynthRAD2025 dataset across three anatomical districts (abdomen, thorax, head-and-neck). All models were trained under a unified validation protocol with identical preprocessing and optimization settings. Performance was assessed using complementary metrics capturing voxel-wise accuracy, structural fidelity, perceptual quality, and distribution-level realism, alongside an analysis of computational complexity. Supervised Paired models consistently outperformed Unpaired approaches, confirming the importance of voxel-wise supervision. Pix2Pix achieved the most balanced performance across districts while maintaining a favorable quality-to-complexity trade-off. Multi-district training improved structural robustness, whereas intra-district training maximized voxel-wise fidelity. This benchmark provides quantitative and computational guidance for model selection in MRI-only radiotherapy workflows and establishes a reproducible framework for future comparative studies. To ensure the reproducibility of our experiments we make our code public, together with the overall results, at the following link:https://github.com/arco-group/MRI_TO_CT.git
[88] Hide and Seek: Investigating Redundancy in Earth Observation Imagery cs.CV | cs.AI | cs.LGPDF
Tasos Papazafeiropoulos, Nikolaos Ioannis Bountos, Nikolas Papadopoulos, Ioannis Papoutsis
TL;DR: 该论文通过系统性的领域特定调查,揭示了地球观测数据中普遍存在的多维冗余(光谱、时间、空间和语义),并证明利用这种冗余可以在保持性能(约98.5%基线)的同时,显著降低计算成本(减少约4倍GFLOPs),为构建更高效、可扩展的大规模EO模型奠定了基础。
Details
Motivation: 当前EO领域机器学习研究进展迅速,但可能忽视了EO数据区别于其他领域的基本特性,特别是其多维冗余现象及其对领域应用的影响尚未得到充分反映。
Result: 实验证实,在训练和推理阶段,利用冗余都能在关键EO任务上达到与基线相当的性能(约98.5%),同时计算成本大幅降低(约减少4倍GFLOPs),且该增益在不同任务、地理位置、传感器、采样距离和模型架构中保持一致。
Insight: 论文的创新点在于首次系统论证了多维冗余是EO数据的结构性属性而非实验假象,并展示了其普遍性和一致性;从客观角度看,这为设计轻量级、高效率的EO专用模型提供了核心洞见和优化方向。
Abstract: The growing availability of Earth Observation (EO) data and recent advances in Computer Vision have driven rapid progress in machine learning for EO, producing domain-specific models at ever-increasing scales. Yet this progress risks overlooking fundamental properties of EO data that distinguish it from other domains. We argue that EO data exhibit a multidimensional redundancy (spectral, temporal, spatial, and semantic) which has a more pronounced impact on the domain and its applications than what current literature reflects. To validate this hypothesis, we conduct a systematic domain-specific investigation examining the existence, consistency, and practical implications of this phenomenon across key dimensions of EO variability. Our findings confirm that redundancy in EO data is both substantial and pervasive: exploiting it yields comparable performance ($\approx98.5%$ of baseline) at a fraction of the computational cost ($\approx4\times$ fewer GFLOPs), at both training and inference. Crucially, these gains are consistent across tasks, geospatial locations, sensors, ground sampling distances, and architectural designs; suggesting that multi-faceted redundancy is a structural property of EO data rather than an artifact of specific experimental choices. These results lay the groundwork for more efficient, scalable, and accessible large-scale EO models.
[89] Semantic Aware Feature Extraction for Enhanced 3D Reconstruction cs.CV | cs.AIPDF
Ronald Nap, Andy Xiao
TL;DR: 本文提出了一种语义感知的特征提取框架,通过多任务学习联合训练关键点检测、关键点描述和语义分割,旨在提升3D重建的质量。该方法在车载单目鱼眼相机采集的数据上进行测试,应用于多层停车场环境,能够生成带有语义标注和高度估计的3D点云,从而支持多楼层映射。
Details
Motivation: 现有基于深度学习的特征匹配方法主要关注几何属性,往往忽略了高层语义信息,这限制了在复杂场景(如3D重建)中特征匹配的一致性和重建结果的丰富性。
Result: 在3D重建任务中,该方法相比标准特征匹配技术,生成了结构细节更丰富、包含高程信息的语义标注3D点云,证明了其有效性。
Insight: 创新点在于通过多任务学习将语义分割与特征提取(检测与描述)联合训练,并集成深度匹配模块,利用语义线索提升特征匹配的一致性,从而增强3D重建的语义和几何完整性。
Abstract: Feature matching is a fundamental problem in computer vision with wide-ranging applications, including simultaneous localization and mapping (SLAM), image stitching, and 3D reconstruction. While recent advances in deep learning have improved keypoint detection and description, most approaches focus primarily on geometric attributes and often neglect higher-level semantic information. This work proposes a semantic-aware feature extraction framework that employs multi-task learning to jointly train keypoint detection, keypoint description, and semantic segmentation. The method is benchmarked against standard feature matching techniques and evaluated in the context of 3D reconstruction. To enhance feature correspondence, a deep matching module is integrated. The system is tested using input from a single monocular fisheye camera mounted on a vehicle and evaluated within a multi-floor parking structure. The proposed approach supports semantic 3D reconstruction with altitude estimation, capturing elevation changes and enabling multi-level mapping. Experimental results demonstrate that the method produces semantically annotated 3D point clouds with improved structural detail and elevation information, underscoring the effectiveness of joint training with semantic cues for more consistent feature matching and enhanced 3D reconstruction.
[90] DiveUp: Learning Feature Upsampling from Diverse Vision Foundation Models cs.CVPDF
Xiaoqiong Liu, Heng Fan
TL;DR: DiveUp提出了一种新颖的特征上采样框架,通过引入多视觉基础模型(VFM)的关系指导,打破了现有方法对单一模型的依赖。它利用不同VFM的结构共识作为专家委员会来正则化上采样器的学习过程,防止源模型不准确空间结构的传播,并通过一种通用的局部质心场关系特征表示和尖峰感知选择策略,实现了跨模型的无缝交互与可靠指导聚合。
Details
Motivation: 现有特征上采样方法通常依赖同一基础模型的高分辨率特征进行自重建,这导致上采样器过度拟合源模型固有的位置错位和高范数伪影。DiveUp旨在解决这一根本局限,通过利用多样化的VFM来提供更可靠的空间结构指导。
Result: 大量实验表明,DiveUp在各种下游密集预测任务上实现了最先进的性能,验证了多专家关系指导的有效性。
Insight: 核心创新点在于:1) 利用多VFM的结构共识作为正则化指导,而非简单的特征融合;2) 提出通用的局部质心场关系特征表示,以对齐不同VFM的特征空间;3) 引入尖峰感知选择策略,动态聚合最可靠专家的指导。这提供了一个统一、编码器无关的框架,可通用地上采样来自不同VFM的特征。
Abstract: Recently, feature upsampling has gained increasing attention owing to its effectiveness in enhancing vision foundation models (VFMs) for pixel-level understanding tasks. Existing methods typically rely on high-resolution features from the same foundation model to achieve upsampling via self-reconstruction. However, relying solely on intra-model features forces the upsampler to overfit to the source model’s inherent location misalignment and high-norm artifacts. To address this fundamental limitation, we propose DiveUp, a novel framework that breaks away from single-model dependency by introducing multi-VFM relational guidance. Instead of naive feature fusion, DiveUp leverages diverse VFMs as a panel of experts, utilizing their structural consensus to regularize the upsampler’s learning process, effectively preventing the propagation of inaccurate spatial structures from the source model. To reconcile the unaligned feature spaces across different VFMs, we propose a universal relational feature representation, formulated as a local center-of-mass (COM) field, that extracts intrinsic geometric structures, enabling seamless cross-model interaction. Furthermore, we introduce a spikiness-aware selection strategy that evaluates the spatial reliability of each VFM, effectively filtering out high-norm artifacts to aggregate guidance from only the most reliable expert at each local region. DiveUp is a unified, encoder-agnostic framework; a jointly-trained model can universally upsample features from diverse VFMs without requiring per-model retraining. Extensive experiments demonstrate that DiveUp achieves state-of-the-art performance across various downstream dense prediction tasks, validating the efficacy of multi-expert relational guidance. Our code and models are available at: https://github.com/Xiaoqiong-Liu/DiveUp
[91] Opportunistic Cardiac Health Assessment: Estimating Phenotypes from Localizer MRI through Multi-Modal Representations cs.CV | cs.AIPDF
Busra Nur Zeybek, Özgün Turgut, Yundi Zhang, Jiazhen Pan, Robert Graf
TL;DR: 本文提出了一种名为C-TRIP的多模态框架,旨在利用常规磁共振检查中快速获取但通常被丢弃的定位像(localizer MRI)、低成本的心电图(ECG)信号以及患者人口统计学和生活方式等表格元数据,来预测心脏表型(如射血分数),从而为心脏健康评估提供一种低成本、便捷的替代方案,以替代昂贵且需要高时空分辨率的心脏电影磁共振成像(cine CMR)。
Details
Motivation: 心血管疾病是主要死因,评估心脏健康的核心指标——心脏表型(CPs)通常依赖昂贵且复杂的电影心脏磁共振成像(cine CMR)来获取。然而,每次磁共振检查开始时都会快速采集用于扫描定位的低分辨率定位像,这些图像在规划后即被丢弃。同时,心电图(ECG)成本低廉且能捕获心脏时间活动信息,患者元数据也提供重要背景。论文的动机是整合这些廉价、易得的多模态数据(定位像、ECG、表格数据),开发一个框架来准确估计CPs,提高心脏健康评估的可及性。
Result: 论文提出的C-TRIP框架在预测心脏表型上取得了准确的结果:对于功能性心脏表型(如射血分数)预测准确,对于结构性心脏表型则显示出高度相关性。虽然没有明确提及具体基准数据集或与SOTA的定量比较,但结果证明了该多模态方法利用非诊断性定位像进行预测的有效性。
Insight: 论文的创新点在于提出了一种新颖的、面向机会性心脏健康评估的多模态表征学习框架(C-TRIP)。其核心洞察是:1)将临床实践中通常被废弃的、快速获取的定位像MRI重新利用为核心输入模态,实现了“变废为宝”;2)创造性地融合了空间信息(定位像)、时间信息(ECG)和患者上下文信息(表格元数据)三种异构模态,通过分阶段训练(单模态预训练、多模态融合、预测)学习一个鲁棒的联合潜在空间;3)在推理阶段仅需定位像,即可利用从多模态对齐中学到的丰富表征进行预测,这在实际部署中极具实用价值,为低成本、大规模的心脏表型筛查提供了新思路。
Abstract: Cardiovascular diseases are the leading cause of death. Cardiac phenotypes (CPs), e.g., ejection fraction, are the gold standard for assessing cardiac health, but they are derived from cine cardiac magnetic resonance imaging (CMR), which is costly and requires high spatio-temporal resolution. Every magnetic resonance (MR) examination begins with rapid and coarse localizers for scan planning, which are discarded thereafter. Despite non-diagnostic image quality and lack of temporal information, localizers can provide valuable structural information rapidly. In addition to imaging, patient-level information, including demographics and lifestyle, influence the cardiac health assessment. Electrocardiograms (ECGs) are inexpensive, routinely ordered in clinical practice, and capture the temporal activity of the heart. Here, we introduce C-TRIP (Cardiac Tri-modal Representations for Imaging Phenotypes), a multi-modal framework that aligns localizer MRI, ECG signals, and tabular metadata to learn a robust latent space and predict CPs using localizer images as an opportunistic alternative to CMR. By combining these three modalities, we leverage cheap spatial and temporal information from localizers, and ECG, respectively while benefiting from patient-specific context provided by tabular data. Our pipeline consists of three stages. First, encoders are trained independently to learn uni-modal representations. The second stage fuses the pre-trained encoders to unify the latent space. The final stage uses the enriched representation space for CP prediction, with inference performed exclusively on localizer MRI. Proposed C-TRIP yields accurate functional CPs, and high correlations for structural CPs. Since localizers are inherently rapid and low-cost, our C-TRIP framework could enable better accessibility for CP estimation.
[92] Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis cs.CV | cs.ROPDF
Dayou Li, Lulin Liu, Bangya Liu, Shijie Zhou, Jiu Feng
TL;DR: 本文提出了一种名为EgoHOI的以自我为中心的人-物交互世界模型,旨在仅从用户动作信号中模拟出具有物理真实感和接触一致性的交互动态,而非依赖未来物体状态的视频生成。该模型通过从3D估计中提取几何和运动学先验,构建物理感知嵌入来正则化模拟过程,确保物理准确性。
Details
Motivation: 现有方法常通过依赖已知未来物体轨迹的条件视频生成来规避物理挑战,但世界模型应作为真正的模拟器,仅从用户动作推断交互动态,以支持具身AI的可扩展数据源。
Result: 在HOT3D数据集上的实验表明,EgoHOI相比强基线方法取得了持续的性能提升,消融实验验证了其物理感知设计的有效性。
Insight: 创新点在于通过蒸馏3D估计的几何和运动学先验到物理感知嵌入中,实现了无需未来状态输入的物理准确模拟,突破了现有方法依赖未来物体轨迹的捷径,为具身AI提供了更真实的交互合成能力。
Abstract: To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human-Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals alone. To ensure physical accuracy without future-state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics-informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics-informed design.
[93] Locatability-Guided Adaptive Reasoning for Image Geo-Localization with Vision-Language Models cs.CV | cs.AIPDF
Bo Yu, Fengze Yang, Yiming Liu, Chao Wang, Xuewen Luo
TL;DR: 本文提出了一种名为Geo-ADAPT的框架,用于解决基于视觉语言模型(VLMs)的图像地理定位任务中检索增强生成(RAG)方法受限于数据库质量、以及推理驱动方法因固定深度推理路径导致幻觉和精度下降的问题。核心创新在于引入了一个优化的可定位性评分来量化图像是否适合深度推理,并构建了一个分层的推理数据集Geo-ADAPT-51K。在此基础上,通过两阶段的组相对策略优化(GRPO)课程学习,训练模型自适应地调整推理深度,从而在多个基准测试上达到最先进的性能,并显著减少了幻觉。
Details
Motivation: 现有基于视觉语言模型的图像地理定位方法存在局限性:RAG方法受检索数据库质量制约,而纯推理方法无法内化图像的‘可定位性’,依赖于低效的固定深度推理路径,这增加了幻觉并降低了定位精度。本文旨在克服这些限制。
Result: 论文提出的Geo-ADAPT框架在多个地理定位基准测试上达到了最先进的(SOTA)性能,并显著减少了幻觉现象。
Insight: 主要创新点包括:1) 提出了一个优化的可定位性评分,用于量化图像是否适合进行深度地理推理;2) 构建了一个基于可定位性分层的、包含增强推理轨迹的数据集Geo-ADAPT-51K;3) 设计了一个两阶段的组相对策略优化(GRPO)课程学习框架,通过定制化的奖励函数来调控自适应推理深度、视觉基础(grounding)和分层地理精度,实现了高效且自适应的推理策略。
Abstract: The emergence of Vision-Language Models (VLMs) has introduced new paradigms for global image geo-localization through retrieval-augmented generation (RAG) and reasoning-driven inference. However, RAG methods are constrained by retrieval database quality, while reasoning-driven approaches fail to internalize image locatability, relying on inefficient, fixed-depth reasoning paths that increase hallucinations and degrade accuracy. To overcome these limitations, we introduce an Optimized Locatability Score that quantifies an image’s suitability for deep reasoning in geo-localization. Using this metric, we curate Geo-ADAPT-51K, a locatability-stratified reasoning dataset enriched with augmented reasoning trajectories for complex visual scenes. Building on this foundation, we propose a two-stage Group Relative Policy Optimization (GRPO) curriculum with customized reward functions that regulate adaptive reasoning depth, visual grounding, and hierarchical geographical accuracy. Our framework, Geo-ADAPT, learns an adaptive reasoning policy, achieves state-of-the-art performance across multiple geo-localization benchmarks, and substantially reduces hallucinations by reasoning both adaptively and efficiently.
[94] Causal Attribution via Activation Patching cs.CVPDF
Amirmohammad Izadi, Mohammadali Banayeeanzade, Alireza Mirrokni, Hosein Hasani, Mobin Bagherian
TL;DR: 本文提出了一种名为CAAP(Causal Attribution via Activation Patching)的新方法,用于对Vision Transformer(ViT)模型进行归因分析,旨在更准确地识别影响模型预测的图像区域。该方法通过直接干预内部激活而非使用学习到的掩码或合成扰动模式,来估计单个图像补丁对ViT预测的因果贡献。
Details
Motivation: 现有基于梯度和扰动的归因方法难以隔离与单个图像补丁相关的内部表示的因果贡献,因为类别相关证据是通过跨层的补丁令牌交互形成的,且输入级扰动可能无法重建模型实际使用的内部证据,导致归因不忠实或定位不佳。
Result: 在多个ViT骨干网络和标准评估指标上,CAAP显著优于现有方法,产生了更忠实和定位更准确的归因图。
Insight: 创新点在于提出了一种基于因果干预的归因框架,通过在中间层范围将源图像激活插入中性目标上下文,直接测量补丁相关内部表示对模型预测的因果效应,从而在保持空间特异性的同时捕获类别相关证据,避免了后期全局混合带来的定位模糊问题。
Abstract: Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing gradient-based and perturbation-based techniques often fail to isolate the causal contribution of internal representations associated with individual image patches. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers, and input-level perturbations can be poor proxies for patch importance, since they may fail to reconstruct the internal evidence actually used by the model. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT’s prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal effect of patch-associated internal representations on the model’s prediction. The causal intervention serves as a principled measure of patch influence by capturing class-relevant evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP significantly outperforms existing methods and produces more faithful and localized attributions.
[95] FMS$^2$: Unified Flow Matching for Segmentation and Synthesis of Thin Structures cs.CVPDF
Babak Asadi, Peiyang Wu, Mani Golparvar-Fard, Viraj Shah, Ramez Hajj
TL;DR: 本文提出了FMS²框架,包含SegFlow和SynFlow两个模块,用于薄结构(如裂缝和血管)的分割与合成。SegFlow是一个基于流匹配的分割模型,将预测重构为连续的图像到掩码传输,通过ODE积分输出掩码,提升了薄结构的连续性和清晰度。SynFlow是一个掩码条件化的掩码到图像生成器,可生成像素对齐的合成图像-掩码对,用于数据增强。
Details
Motivation: 薄结构分割面临拓扑敏感几何、标注成本高和跨域泛化差的问题,现有方法孤立地解决这些挑战,本文旨在统一处理分割与合成任务。
Result: 在五个裂缝和血管基准测试中,SegFlow在体积指标(平均IoU)上从0.511提升到0.599(+17.2%),在拓扑指标(Betti匹配误差)上从82.145降低到51.524(-37.3%),优于CNN、Transformer、Mamba和生成基线。在有限标注下,结合SynFlow生成的数据,仅用25%真实标注即可恢复接近全性能,平均跨域IoU提升0.11。
Insight: 创新点包括:将分割重构为连续流匹配问题,通过轨迹级监督改善薄结构质量;提出掩码条件化生成器,提供可控结构偏移的像素对齐合成数据,增强跨域泛化;无需辅助拓扑头、后处理或多损失工程,简化了模型设计。
Abstract: Segmenting thin structures like infrastructure cracks and anatomical vessels is a task hampered by topology-sensitive geometry, high annotation costs, and poor generalization across domains. Existing methods address these challenges in isolation. We propose FMS$^2$, a flow-matching framework with two modules. (1) SegFlow is a 2.96M-parameter segmentation model built on a standard encoder-decoder backbone that recasts prediction as continuous image $\rightarrow$ mask transport. It learns a time-indexed velocity field with a flow-matching regression loss and outputs the mask via ODE integration, rather than supervising only end-state logits. This trajectory-level supervision improves thin-structure continuity and sharpness, compared with tuned topology-aware loss baselines, without auxiliary topology heads, post-processing, or multi-term loss engineering. (2) SynFlow is a mask-conditioned mask $\rightarrow$ image generator that produces pixel-aligned synthetic image-mask pairs. It injects mask geometry at multiple scales and emphasizes boundary bands via edge-aware gating, while a controllable mask generator expands sparsity, width, and branching regimes. On five crack and vessel benchmarks, SegFlow alone outperforms strong CNN, Transformer, Mamba, and generative baselines, improving the volumetric metric (mean IoU) from 0.511 to 0.599 (+17.2%) and reducing the topological metric (Betti matching error) from 82.145 to 51.524 (-37.3%). When training with limited labels, augmenting SegFlow with SynFlow-generated pairs recovers near-full performance using 25% of real annotations and improves cross-domain IoU by 0.11 on average. Unlike classical data augmentation that promotes invariance via label-preserving transforms, SynFlow provides pixel-aligned paired supervision with controllable structural shifts (e.g., sparsity, width, branching), which is particularly effective under domain shift.
[96] Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision cs.CVPDF
Yunhe Gao, Yabin Zhang, Chong Wang, Jiaming Liu, Maya Varma
TL;DR: 本文提出了一种名为MASS的掩码引导自监督学习方法,用于从无标注的3D医学影像中学习通用表征。该方法将上下文分割作为预训练任务,利用自动生成的类别无关掩码作为结构监督信号,学习包含外观、形状、空间上下文和解剖关系的语义丰富表征。实验表明,该方法在小规模和大规模多模态数据集上均有效,在少样本分割、低标注数据下的分割性能以及冻结编码器的分类任务上表现出色。
Details
Motivation: 当前3D医学影像领域缺乏像视觉和语言领域那样从大规模无标注数据中学习通用表征的基础模型。现有的自监督方法依赖于低级重建或对比目标,无法捕获对医学图像分析至关重要的解剖语义,限制了其向下游任务的迁移能力。
Result: MASS在少样本分割、低标注数据分割(仅需20-40%标注数据即可匹配全监督性能,在低数据情况下Dice分数超过其他自监督基线20分以上)以及冻结编码器对未见病理的分类(匹配数千样本的全监督训练)等任务上均表现出有效性。
Insight: 核心创新点在于将自动生成的类别无关掩码作为自监督学习的结构监督信号,将上下文分割作为预训练任务,从而学习到定义医学结构的语义组合(外观、形状、空间关系等)。这为无需专家标注构建3D医学影像基础模型开辟了一条新路径。
Abstract: Foundation models have transformed vision and language by learning general-purpose representations from large-scale unlabeled data, yet 3D medical imaging lacks analogous approaches. Existing self-supervised methods rely on low-level reconstruction or contrastive objectives that fail to capture the anatomical semantics critical for medical image analysis, limiting transfer to downstream tasks. We present MASS (MAsk-guided Self-Supervised learning), which treats in-context segmentation as the pretext task for learning general-purpose medical imaging representations. MASS’s key insight is that automatically generated class-agnostic masks provide sufficient structural supervision for learning semantically rich representations. By training on thousands of diverse mask proposals spanning anatomical structures and pathological findings, MASS learns what semantically defines medical structures: the holistic combination of appearance, shape, spatial context, and anatomical relationships. We demonstrate effectiveness across data regimes: from small-scale pretraining on individual datasets (20-200 scans) to large-scale multi-modal pretraining on 5K CT, MRI, and PET volumes, all without annotations. MASS demonstrates: (i) few-shot segmentation on novel structures, (ii) matching full supervision with only 20-40% labeled data while outperforming self-supervised baselines by over 20 in Dice score in low-data regimes, and (iii) frozen-encoder classification on unseen pathologies that matches full supervised training with thousands of samples. Mask-guided self-supervised pretraining captures broadly generalizable knowledge, opening a path toward 3D medical imaging foundation models without expert annotations. Code is available: https://github.com/Stanford-AIMI/MASS.
[97] TSDCRF: Balancing Privacy and Multi-Object Tracking via Time-Series CRF and Normalized Control Penalty cs.CVPDF
Bo Ma, Jinsong Wu, Weiqi Yan
TL;DR: TSDCRF是一个用于平衡隐私保护与多目标跟踪性能的插件式优化框架,通过结合差分隐私噪声注入、归一化控制惩罚(NCP)和时间序列动态条件随机场(DCRF)来减少身份切换和目标丢失,同时保持隐私。
Details
Motivation: 解决视频多目标跟踪中,为保护敏感身份信息而添加的隐私噪声通常会破坏跨帧关联,导致ID切换或目标丢失的问题。
Result: 在MOT16、MOT17、Cityscapes和KITTI基准测试中,TSDCRF相比白噪声和现有方法(如NTPD、PPDTSA)实现了更好的隐私-效用权衡,表现为更低的KL散度偏移、更低的跟踪RMSE,并在轨迹劫持下提高了鲁棒性。
Insight: 创新点在于将差分隐私噪声与跟踪稳定性机制(NCP和DCRF)结合,通过归一化控制惩罚预处理不稳定预测,以及使用时序DCRF强制时间一致性来校正噪声后的轨迹偏差,从而在保护隐私的同时维持跟踪性能。
Abstract: Multi-object tracking in video often requires appearance or location cues that can reveal sensitive identity information, while adding privacy-preserving noise typically disrupts cross-frame association and causes ID switches or target loss. We propose TSDCRF, a plug-in refinement framework that balances privacy and tracking by combining three components: (i) $(\varepsilon,δ)$-differential privacy via calibrated Gaussian noise on sensitive regions under a configurable privacy budget; (ii) a Normalized Control Penalty (NCP) that down-weights unstable or conflicting class predictions before noise injection to stabilize association; and (iii) a time-series dynamic conditional random field (DCRF) that enforces temporal consistency and corrects trajectory deviation after noise, mitigating ID switches and resilience to trajectory hijacking. The pipeline is agnostic to the choice of detector and tracker (e.g., YOLOv4 and DeepSORT). We evaluate on MOT16, MOT17, Cityscapes, and KITTI. Results show that TSDCRF achieves a better privacy–utility trade-off than white noise and prior methods (NTPD, PPDTSA): lower KL-divergence shift, lower tracking RMSE, and improved robustness under trajectory hijacking while preserving privacy. Source code in https://github.com/mabo1215/TSDCRF.git
[98] SHAMISA: SHAped Modeling of Implicit Structural Associations for Self-supervised No-Reference Image Quality Assessment cs.CV | cs.AI | cs.LGPDF
Mahdi Naseri, Zhou Wang
TL;DR: SHAMISA是一种非对比自监督框架,用于无参考图像质量评估(NR-IQA),它通过利用显式结构化关系监督从无标签的失真图像中学习。该方法引入隐式结构关联,这些关联是软性、可控的关系,既对失真敏感也对内容敏感,通过合成元数据和内在特征结构推断得出。核心创新包括组合失真引擎,从连续参数空间生成无限族退化,使得每次仅一个失真因子变化,从而在训练中精细控制表示相似性。模型通过双源关系图整合已知退化配置和新兴结构亲和性来指导学习,最终使用卷积编码器提取特征,并通过线性回归器进行质量预测。
Details
Motivation: 解决NR-IQA模型需要大量昂贵人工感知标签的瓶颈问题,通过自监督学习从无标签失真图像中学习,避免依赖人类质量标注或对比损失。
Result: 在合成、真实和跨数据集的NR-IQA基准测试中,SHAMISA实现了强大的整体性能,具有改进的跨数据集泛化能力和鲁棒性,无需人类质量标注。
Insight: 创新点包括引入隐式结构关联作为软性、可控的关系监督,以及组合失真引擎实现精细化的表示相似性控制;可借鉴之处在于利用合成元数据和特征结构构建关系图来引导自监督学习,提升模型泛化性。
Abstract: No-Reference Image Quality Assessment (NR-IQA) aims to estimate perceptual quality without access to a reference image of pristine quality. Learning an NR-IQA model faces a fundamental bottleneck: its need for a large number of costly human perceptual labels. We propose SHAMISA, a non-contrastive self-supervised framework that learns from unlabeled distorted images by leveraging explicitly structured relational supervision. Unlike prior methods that impose rigid, binary similarity constraints, SHAMISA introduces implicit structural associations, defined as soft, controllable relations that are both distortion-aware and content-sensitive, inferred from synthetic metadata and intrinsic feature structure. A key innovation is our compositional distortion engine, which generates an uncountable family of degradations from continuous parameter spaces, grouped so that only one distortion factor varies at a time. This enables fine-grained control over representational similarity during training: images with shared distortion patterns are pulled together in the embedding space, while severity variations produce structured, predictable shifts. We integrate these insights via dual-source relation graphs that encode both known degradation profiles and emergent structural affinities to guide the learning process throughout training. A convolutional encoder is trained under this supervision and then frozen for inference, with quality prediction performed by a linear regressor on its features. Extensive experiments on synthetic, authentic, and cross-dataset NR-IQA benchmarks demonstrate that SHAMISA achieves strong overall performance with improved cross-dataset generalization and robustness, all without human quality annotations or contrastive losses.
[99] Sparse-Dense Mixture of Experts Adapter for Multi-Modal Tracking cs.CVPDF
Yabin Zhu, Jianqi Li, Chenglong Li, Jiaxiang Wang, Chengjie Gu
TL;DR: 本文提出了一种用于多模态跟踪的稀疏-稠密混合专家适配器(SDMoEA)框架,旨在解决现有参数高效微调(PEFT)方法在统一共享参数框架下难以有效表示多模态特征的问题。该框架包含SDMoE模块,通过稀疏MoE捕获模态特定信息,稠密共享MoE建模跨模态共享信息,并引入基于Gram矩阵的语义对齐超图融合(GSAHF)模块,利用超图建模高阶相关性以实现多层次多模态信息的深度融合。
Details
Motivation: 现有基于PEFT的多模态跟踪方法因跨模态异质性,难以在统一共享参数框架下有效表示多模态特征,且传统方法在多层次多模态融合中对高阶相关性建模能力有限。
Result: 在多个多模态跟踪基准测试(包括LasHeR、RGBT234、VTUAV、VisEvent、COESOT、DepthTrack和VOT-RGBD2022)上进行了广泛实验,结果表明该方法优于其他PEFT方法,达到了先进性能水平。
Insight: 创新点包括设计稀疏-稠密混合专家适配器(SDMoE)来分别处理模态特定和共享信息,以及引入基于Gram矩阵的语义对齐超图融合(GSAHF)模块,通过超图结构建模高阶依赖关系,实现更有效的多模态特征对齐与融合。
Abstract: Parameter-efficient fine-tuning (PEFT) techniques, such as prompts and adapters, are widely used in multi-modal tracking because they alleviate issues of full-model fine-tuning, including time inefficiency, high resource consumption, parameter storage burden, and catastrophic forgetting. However, due to cross-modal heterogeneity, most existing PEFT-based methods struggle to effectively represent multi-modal features within a unified framework with shared parameters. To address this problem, we propose a novel Sparse-Dense Mixture of Experts Adapter (SDMoEA) framework for PEFT-based multi-modal tracking under a unified model structure. Specifically, we design an SDMoE module as the multi-modal adapter to model modality-specific and shared information efficiently. SDMoE consists of a sparse MoE and a dense-shared MoE: the former captures modality-specific information, while the latter models shared cross-modal information. Furthermore, to overcome limitations of existing tracking methods in modeling high-order correlations during multi-level multi-modal fusion, we introduce a Gram-based Semantic Alignment Hypergraph Fusion (GSAHF) module. It first employs Gram matrices for cross-modal semantic alignment, ensuring that the constructed hypergraph accurately reflects semantic similarity and high-order dependencies between modalities. The aligned features are then integrated into the hypergraph structure to exploit its ability to model high-order relationships, enabling deep fusion of multi-level multi-modal information. Extensive experiments demonstrate that the proposed method achieves superior performance compared with other PEFT approaches on several multi-modal tracking benchmarks, including LasHeR, RGBT234, VTUAV, VisEvent, COESOT, DepthTrack, and VOT-RGBD2022.
[100] Bodhi VLM: Privacy-Alignment Modeling for Hierarchical Visual Representations in Vision Backbones and VLM Encoders via Bottom-Up and Top-Down Feature Search cs.CV | cs.CRPDF
Bo Ma, Jinsong Wu, Wei Qi Yan
TL;DR: 本文提出了Bodhi VLM框架,用于对视觉主干网络和视觉-语言模型编码器中的分层视觉表征进行隐私对齐建模。该框架通过自底向上和自顶向下的特征搜索策略定位敏感特征区域,并利用期望最大化隐私评估模块生成可解释的预算对齐信号,以评估噪声注入与声明隐私预算的一致性。
Details
Motivation: 解决在保护隐私的学习系统中,向分层视觉表征注入噪声时,如何以可解释且跨模型适用的方式建模扰动与声明隐私预算的对齐关系这一核心挑战。
Result: 在目标检测器(如YOLO、DETR)和VLM视觉编码器(如CLIP、LLaVA)上验证了框架有效性。BUA和TDA策略产生了可比较的偏差趋势,EMPA模块在报告设置下提供了稳定的对齐信号。结果与通用差异基线(如K-L散度)和任务相关基线(如MomentReg)进行了比较,详细数据在补充材料中以均值±标准差形式给出。
Insight: 主要创新点在于提出了一个可学习的、可解释的隐私对齐建模框架,而非仅进行事后审计。它通过结合聚类、多尺度特征搜索和基于分布的评估,为分层神经表征的隐私保护提供了一种新的建模视角,并适用于多种视觉架构。
Abstract: Learning systems that preserve privacy often inject noise into hierarchical visual representations; a central challenge is to \emph{model} how such perturbations align with a declared privacy budget in a way that is interpretable and applicable across vision backbones and vision–language models (VLMs). We propose \emph{Bodhi VLM}, a \emph{privacy-alignment modeling} framework for \emph{hierarchical neural representations}: it (1) links sensitive concepts to layer-wise grouping via NCP and MDAV-based clustering; (2) locates sensitive feature regions using bottom-up (BUA) and top-down (TDA) strategies over multi-scale representations (e.g., feature pyramids or vision-encoder layers); and (3) uses an Expectation-Maximization Privacy Assessment (EMPA) module to produce an interpretable \emph{budget-alignment signal} by comparing the fitted sensitive-feature distribution to an evaluator-specified reference (e.g., Laplace or Gaussian with scale $c/ε$). The output is reference-relative and is \emph{not} a formal differential-privacy estimator. We formalize BUA/TDA over hierarchical feature structures and validate the framework on object detectors (YOLO, PPDPTS, DETR) and on the \emph{visual encoders} of VLMs (CLIP, LLaVA, BLIP). BUA and TDA yield comparable deviation trends; EMPA provides a stable alignment signal under the reported setups. We compare with generic discrepancy baselines (Chi-square, K-L, MMD) and with task-relevant baselines (MomentReg, NoiseMLE, Wass-1). Results are reported as mean$\pm$std over multiple seeds with confidence intervals in the supplementary materials. This work contributes a learnable, interpretable modeling perspective for privacy-aligned hierarchical representations rather than a post hoc audit only. Source code: \href{https://github.com/mabo1215/bodhi-vlm.git}{Bodhi-VLM GitHub repository}
[101] UniVid: Pyramid Diffusion Model for High Quality Video Generation cs.CV | cs.AI | cs.MMPDF
Xinyu Xiao, Binbin Yang, Tingtian Li, Yipeng Yu, Sen Lei
TL;DR: 本文提出了UniVid,一个基于金字塔扩散模型的统一视频生成模型,能够同时处理文本到视频(T2V)和图像到视频(I2V)生成任务。模型通过引入时间金字塔跨帧时空注意力模块和卷积,扩展了预训练的文本到图像扩散模型,以生成时间连贯的视频帧。此外,采用双流交叉注意力机制,支持文本和图像的双模态控制,并在推理时允许通过重加权注意力分数来灵活插值单模态和双模态控制。
Details
Motivation: 当前基于扩散的文本到视频和图像到视频生成是两个独立的研究方向,缺乏一个统一的模型来整合这两种生成范式。本文旨在解决这一挑战,构建一个能够同时利用文本提示和参考图像作为混合条件的统一视频生成模型。
Result: 广泛的实验表明,UniVid在T2V、I2V以及(文本+图像)到视频((T+I)2V)任务上均实现了优异的时间连贯性,展示了其统一生成能力的有效性。
Insight: 论文的创新点在于:1)通过时间金字塔跨帧时空注意力模块和卷积扩展预训练文本到图像模型,有效提升视频帧的时间连贯性;2)提出双流交叉注意力机制,支持文本和图像的双模态条件控制,并允许在推理时灵活调整模态权重,实现了生成范式的统一和可控性增强。
Abstract: Diffusion-based text-to-video generation (T2V) or image-to-video (I2V) generation have emerged as a prominent research focus. However, there exists a challenge in integrating the two generative paradigms into a unified model. In this paper, we present a unified video generation model (UniVid) with hybrid conditions of the text prompt and reference image. Given these two available controls, our model can extract objects’ appearance and their motion descriptions from textual prompts, while obtaining texture details and structural information from image clues to guide the video generation process. Specifically, we scale up the pre-trained text-to-image diffusion model for generating temporally coherent frames via introducing our temporal-pyramid cross-frame spatial-temporal attention modules and convolutions. To support bimodal control, we introduce a dual-stream cross-attention mechanism, whose attention scores can be freely re-weighted for interpolation of between single and two modalities controls during inference. Extensive experiments showcase that our UniVid achieves superior temporal coherence on T2V, I2V and (T+I)2V tasks.
[102] Ego-1K – A Large-Scale Multiview Video Dataset for Egocentric Vision cs.CVPDF
Jae Yong Lee, Daniel Scharstein, Akash Bapat, Hao Hu, Andrew Fu
TL;DR: 本文介绍了Ego-1K,一个大规模、时间同步的多视角第一人称视频数据集,旨在推动神经3D视频合成和动态场景理解的研究。该数据集包含近1000个短视频,使用一个包含12个同步相机的定制设备,围绕用户佩戴的4相机VR头显进行采集,内容侧重于不同场景下的手部动作和手-物交互。
Details
Motivation: 随着配备多摄像头的智能眼镜日益普及,第一人称场景重建成为一个重要研究领域,但缺乏合适的基准数据集。本文旨在提供一个大规模、多视角的第一人称视频数据集,以支持该领域的方法评测和未来研究。
Result: 实验表明,由于近距离动态物体和设备自身运动导致的大视差和图像运动,该数据集对现有的3D和4D新视角合成方法提出了独特挑战,可用于评测相关方法。
Insight: 创新点在于构建了一个专门针对第一人称视角、聚焦手部交互的大规模多视角同步视频数据集,其设备设计和场景内容为神经渲染和动态场景理解提供了新的、具有挑战性的基准。数据集已开源。
Abstract: We present Ego-1K, a large-scale collection of time-synchronized egocentric multiview videos designed to advance neural 3D video synthesis and dynamic scene understanding. The dataset contains nearly 1,000 short egocentric videos captured with a custom rig with 12 synchronized cameras surrounding a 4-camera VR headset worn by the user. Scene content focuses on hand motions and hand-object interactions in different settings. We describe rig design, data processing, and calibration. Our dataset enables new ways to benchmark egocentric scene reconstruction methods, an important research area as smart glasses with multiple cameras become omnipresent. Our experiments demonstrate that our dataset presents unique challenges for existing 3D and 4D novel view synthesis methods due to large disparities and image motion caused by close dynamic objects and rig egomotion. Our dataset supports future research in this challenging domain. It is available at https://huggingface.co/datasets/facebook/ego-1k.
[103] QTrack: Query-Driven Reasoning for Multi-modal MOT cs.CVPDF
Tajamul Ashraf, Tavaheed Tariq, Sonia Yadav, Abrar Ul Riyaz, Wasif Tak
TL;DR: 本文提出了一种查询驱动的多目标跟踪(MOT)新范式QTrack,将跟踪任务定义为基于自然语言查询的时空推理问题,旨在根据文本指令在视频中定位并跟踪特定目标。为此,作者构建了RMOT26大规模基准数据集,并提出了一个集成了多模态推理与跟踪定位的端到端视觉语言模型,以及一种具有结构化奖励的时间感知策略优化方法。
Details
Motivation: 传统多目标跟踪(MOT)旨在估计视频中所有物体的轨迹,而无法根据语义指令选择性地推理用户指定的目标。本文旨在解决这一局限,引入一个以查询为条件的、更具选择性和语义理解能力的跟踪任务。
Result: 广泛的实验证明了所提方法在以推理为中心、语言引导的跟踪任务上的有效性。具体结果在作者新构建的RMOT26基准上进行评估,该基准通过序列级划分防止身份泄露,以支持对泛化能力的稳健评估。
Insight: 主要创新点包括:1)提出了一个全新的查询驱动跟踪任务范式,将跟踪与自然语言理解相结合;2)构建了RMOT26基准,解决了现有数据集在评估语言引导跟踪泛化能力方面的不足;3)提出了QTrack端到端模型,整合了多模态推理与跟踪定位;4)引入了具有结构化奖励的时间感知策略优化(TPAPO),以鼓励模型进行运动感知推理。
Abstract: Multi-object tracking (MOT) has traditionally focused on estimating trajectories of all objects in a video, without selectively reasoning about user-specified targets under semantic instructions. In this work, we introduce a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries. Given a reference frame, a video sequence, and a textual query, the goal is to localize and track only the target(s) specified in the query while maintaining temporal coherence and identity consistency. To support this setting, we construct RMOT26, a large-scale benchmark with grounded queries and sequence-level splits to prevent identity leakage and enable robust evaluation of generalization. We further present QTrack, an end-to-end vision-language model that integrates multimodal reasoning with tracking-oriented localization. Additionally, we introduce a Temporal Perception-Aware Policy Optimization strategy with structured rewards to encourage motion-aware reasoning. Extensive experiments demonstrate the effectiveness of our approach for reasoning-centric, language-guided tracking. Code and data are available at https://github.com/gaash-lab/QTrack
[104] PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment cs.CVPDF
Zhexiao Xiong, Yizhi Song, Liu He, Wei Xiong, Yu Yuan
TL;DR: 本文提出PhysAlign框架,旨在解决现有视频扩散模型生成内容在时间上不连贯、违反物理直觉的问题,通过结合显式3D几何约束与基于Gram矩阵的时空关系对齐,构建统一的物理潜在空间,实现物理连贯的图像到视频生成。
Details
Motivation: 现有视频扩散模型常生成时间不连贯、违反基本物理直觉的内容,限制了其在机器人学和媒体生成等领域的实际应用,因此需要开发能够生成物理连贯视频的方法。
Result: PhysAlign在需要复杂物理推理和时间稳定性的任务上显著优于现有视频扩散模型,且不损害零样本视觉质量,通过合成数据集和统一物理潜在空间实现了性能提升。
Insight: 创新点包括构建基于刚体模拟的可控合成数据生成管道以解决物理标注视频稀缺问题,以及结合3D几何约束与时空关系对齐构建物理潜在空间,为物理基础的视频生成提供了实用范例。
Abstract: Video Diffusion Models (VDMs) offer a promising approach for simulating dynamic scenes and environments, with broad applications in robotics and media generation. However, existing models often generate temporally incoherent content that violates basic physical intuition, significantly limiting their practical applicability. We propose PhysAlign, an efficient framework for physics-coherent image-to-video (I2V) generation that explicitly addresses this limitation. To overcome the critical scarcity of physics-annotated videos, we first construct a fully controllable synthetic data generation pipeline based on rigid-body simulation, yielding a highly-curated dataset with accurate, fine-grained physics and 3D annotations. Leveraging this data, PhysAlign constructs a unified physical latent space by coupling explicit 3D geometry constraints with a Gram-based spatio-temporal relational alignment that extracts kinematic priors from video foundation models. Extensive experiments demonstrate that PhysAlign significantly outperforms existing VDMs on tasks requiring complex physical reasoning and temporal stability, without compromising zero-shot visual quality. PhysAlign shows the potential to bridge the gap between raw visual synthesis and rigid-body kinematics, establishing a practical paradigm for genuinely physics-grounded video generation. The project page is available at https://physalign.github.io/PhysAlign.
[105] AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison cs.CV | cs.AIPDF
Xi Jiang, Yue Guo, Jian Li, Yong Liu, Bin-Bin Gao
TL;DR: 本文提出了AD-Copilot,一个专为工业异常检测设计的交互式多模态大语言模型。它通过一种新颖的视觉上下文比较机制,解决了现有MLLMs在工业图像上表现不佳、对细微视觉差异不敏感的问题。核心创新包括一个从稀疏标注工业图像中挖掘知识并生成大规模多模态数据集Chat-AD的流程,以及一个利用跨注意力机制增强多图像细粒度感知的比较编码器。
Details
Motivation: 现有多模态大语言模型在通用视觉理解上表现出色,但在工业异常检测任务上表现不佳。主要原因是其训练数据与工业图像差异大,且模型独立编码图像、仅在语言空间进行比较,无法捕捉对IAD至关重要的细微视觉差异。
Result: 在MMAD基准测试中,AD-Copilot达到了82.3%的准确率,优于所有其他模型且无数据泄露。在扩展的MMAD-BBox基准(基于边界框的异常定位评估)上,相比基线取得了最高3.35倍的提升。模型在其他专用和通用基准上也表现出优秀的泛化能力,并在多个IAD任务上超越了人类专家水平。
Insight: 论文的创新点在于:1) 提出了一个从稀疏标注工业数据中挖掘知识并构建大规模、富含语义信号的多模态数据集(Chat-AD)的流程;2) 设计了基于跨注意力的比较编码器,实现了图像对之间的视觉特征交互,增强了细粒度感知能力;3) 采用了融合领域知识、逐步增强IAD技能的多阶段训练策略。从客观角度看,其将“视觉上下文比较”机制引入MLLM以适应IAD任务,是一个针对领域特定需求的有效架构调整。
Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive success in natural visual understanding, yet they consistently underperform in industrial anomaly detection (IAD). This is because MLLMs trained mostly on general web data differ significantly from industrial images. Moreover, they encode each image independently and can only compare images in the language space, making them insensitive to subtle visual differences that are key to IAD. To tackle these issues, we present AD-Copilot, an interactive MLLM specialized for IAD via visual in-context comparison. We first design a novel data curation pipeline to mine inspection knowledge from sparsely labeled industrial images and generate precise samples for captioning, VQA, and defect localization, yielding a large-scale multimodal dataset Chat-AD rich in semantic signals for IAD. On this foundation, AD-Copilot incorporates a novel Comparison Encoder that employs cross-attention between paired image features to enhance multi-image fine-grained perception, and is trained with a multi-stage strategy that incorporates domain knowledge and gradually enhances IAD skills. In addition, we introduce MMAD-BBox, an extended benchmark for anomaly localization with bounding-box-based evaluation. The experiments show that AD-Copilot achieves 82.3% accuracy on the MMAD benchmark, outperforming all other models without any data leakage. In the MMAD-BBox test, it achieves a maximum improvement of $3.35\times$ over the baseline. AD-Copilot also exhibits excellent generalization of its performance gains across other specialized and general-purpose benchmarks. Remarkably, AD-Copilot surpasses human expert-level performance on several IAD tasks, demonstrating its potential as a reliable assistant for real-world industrial inspection. All datasets and models will be released for the broader benefit of the community.
[106] RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting cs.CVPDF
Xuezhen Wang, Li Ma, Yulin Shen, Zeyu Wang, Pedro V. Sander
TL;DR: 本文提出RetimeGS,一种用于动态场景连续时间重建的4D高斯泼溅表示方法,通过定义3D高斯的显式时间行为来缓解时间混叠问题,从而实现任意时间戳的无鬼影、时间一致的渲染。
Details
Motivation: 现有4DGS方法在离散帧上过拟合,难以表示连续时间帧,导致时间插值时产生鬼影伪影,限制了慢动作播放、时间编辑等应用。
Result: 在包含快速运动、非刚性变形和严重遮挡的数据集上,RetimeGS在质量和连贯性上均优于现有最先进方法。
Insight: 创新点在于将时间插值问题视为时间混叠并显式建模高斯的时间行为,结合光流引导初始化与监督、三重渲染监督等策略,实现了对大运动下连续时间的鲁棒重建。
Abstract: Temporal retiming, the ability to reconstruct and render dynamic scenes at arbitrary timestamps, is crucial for applications such as slow-motion playback, temporal editing, and post-production. However, most existing 4D Gaussian Splatting (4DGS) methods overfit at discrete frame indices but struggle to represent continuous-time frames, leading to ghosting artifacts when interpolating between timestamps. We identify this limitation as a form of temporal aliasing and propose RetimeGS, a simple yet effective 4DGS representation that explicitly defines the temporal behavior of the 3D Gaussian and mitigates temporal aliasing. To achieve smooth and consistent interpolation, we incorporate optical flow-guided initialization and supervision, triple-rendering supervision, and other targeted strategies. Together, these components enable ghost-free, temporally coherent rendering even under large motions. Experiments on datasets featuring fast motion, non-rigid deformation, and severe occlusions demonstrate that RetimeGS achieves superior quality and coherence over state-of-the-art methods.
[107] Advancing Cancer Prognosis with Hierarchical Fusion of Genomic, Proteomic and Pathology Imaging Data from a Systems Biology Perspective cs.CVPDF
Junjie Zhou, Bao Xue, Meiling Wang, Wei Shao, Daoqiang Zhang
TL;DR: 本文提出了一种名为HFGPI的分层融合框架,用于从系统生物学角度整合基因组、蛋白质组和病理成像数据,以提升癌症预后预测的精度。该框架通过分子标记器构建生物信息表征,利用基因调控蛋白融合模块显式建模基因-蛋白质调控关系,并通过蛋白质引导的超图学习模块捕获高阶蛋白质-形态学关联,最终实现分层特征融合以预测生存结果。
Details
Motivation: 现有癌症预后多模态生存分析方法通常整合基因组数据和病理图像,但忽视了蛋白质组作为连接基因组改变与组织病理特征的中间层及其提供的互补生物信息,且现有方法以扁平方式融合异构数据,未能捕捉其固有的生物层次结构。
Result: 在五个基准数据集上的广泛实验表明,HFGPI方法优于现有的最先进方法。
Insight: 创新点在于从系统生物学视角构建了从基因到蛋白质再到组织学图像的分层融合框架,并引入了分子标记器、基因调控蛋白融合和蛋白质引导超图学习等模块,以更符合生物学现实的方式建模多模态数据间的关系。
Abstract: To enhance the precision of cancer prognosis, recent research has increasingly focused on multimodal survival methods by integrating genomic data and histology images. However, current approaches overlook the fact that the proteome serves as an intermediate layer bridging genomic alterations and histopathological features while providing complementary biological information essential for survival prediction. This biological reality exposes another architectural limitation: existing integrative analysis studies fuse these heterogeneous data sources in a flat manner that fails to capture their inherent biological hierarchy. To address these limitations, we propose HFGPI, a hierarchical fusion framework that models the biological progression from genes to proteins to histology images from a systems biology perspective. Specifically, we introduce Molecular Tokenizer, a molecular encoding strategy that integrates identity embeddings with expression profiles to construct biologically informed representations for genes and proteins. We then develop Gene-Regulated Protein Fusion (GRPF), which employs graph-aware cross-attention with structure-preserving alignment to explicitly model gene-protein regulatory relationships and generate gene-regulated protein representations. Additionally, we propose Protein-Guided Hypergraph Learning (PGHL), which establishes associations between proteins and image patches, leveraging hypergraph convolution to capture higher-order protein-morphology relationships. The final features are progressively fused across hierarchical layers to achieve precise survival outcome prediction. Extensive experiments on five benchmark datasets demonstrate the superiority of HFGPI over state-of-the-art methods.
[108] Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space cs.CVPDF
Quoc-Huy Trinh, Xi Ding, Yang Liu, Zhenyue Qin, Xingjian Li
TL;DR: 该论文针对医学多模态大语言模型在3D影像空间推理能力上的不足,提出了一个通过多智能体协作和放射科专家验证自动生成空间视觉问答数据的流程,并构建了首个评估医学MLLMs 3D空间智能的基准SpatialMed。
Details
Motivation: 解决当前医学多模态大语言模型在3D医学影像解释中视觉空间智能评估的缺失问题,主要由于缺乏超越基本标签的结构化3D空间标注数据集。
Result: 在包含多个器官和肿瘤类型的近10K个问答对的SpatialMed基准上评估了14个最先进的MLLMs,结果表明当前模型在医学影像空间推理方面能力不足。
Insight: 创新点在于提出了一个利用计算工具(如体积和距离计算器)自动合成空间VQA数据的智能体流程,并构建了首个专注于医学3D空间智能的基准数据集,为评估和提升模型的空间推理能力提供了新方向。
Abstract: Visual spatial intelligence is critical for medical image interpretation, yet remains largely unexplored in Multimodal Large Language Models (MLLMs) for 3D imaging. This gap persists due to a systemic lack of datasets featuring structured 3D spatial annotations beyond basic labels. In this study, we introduce an agentic pipeline that autonomously synthesizes spatial visual question-answering (VQA) data by orchestrating computational tools such as volume and distance calculators with multi-agent collaboration and expert radiologist validation. We present SpatialMed, the first comprehensive benchmark for evaluating 3D spatial intelligence in medical MLLMs, comprising nearly 10K question-answer pairs across multiple organs and tumor types. Our evaluations on 14 state-of-the-art MLLMs and extensive analyses reveal that current models lack robust spatial reasoning capabilities for medical imaging.
[109] VFM-Loc: Zero-Shot Cross-View Geo-Localization via Aligning Discriminative Visual Hierarchies cs.CVPDF
Jun Lu, Zehao Sang, Haoqi Wei, Xiangyun Liu, Kun Zhu
TL;DR: VFM-Loc是一个用于零样本跨视角地理定位的训练无关框架,它利用视觉基础模型的通用视觉表示,通过渐进式对齐策略来匹配无人机视角查询与地理标记的卫星图像,旨在解决传统监督方法在真实无约束场景下因视角差异和数据偏差导致的泛化失败问题。
Details
Motivation: 解决传统监督式跨视角地理定位方法在真实、无约束场景下因严重的视角差异和数据集偏差导致的泛化能力不足问题,提出一种无需训练、能直接利用预训练模型强大表征的零样本方法。
Result: 在标准基准测试中表现出强大的零样本精度,并在具有大倾斜角度的挑战性LO-UCV数据集上,其Recall@1指标超越了监督方法超过20%,达到了SOTA水平。
Insight: 创新点在于提出了一种基于统计流形对齐的渐进式对齐策略,包括使用广义均值池化和尺度加权RMAC的分层线索提取机制,以及基于领域PCA和正交Procrustes分析的线性对齐流程,证明了预训练特征的原则性对齐可以有效弥合跨视角鸿沟,为真实世界CVGL建立了一个鲁棒且无需训练的范式。
Abstract: Cross-View Geo-Localization (CVGL) in remote sensing aims to locate a drone-view query by matching it to geo-tagged satellite images. Although supervised methods have achieved strong results on closeset benchmarks, they often fail to generalize to unconstrained, real-world scenarios due to severe viewpoint differences and dataset bias. To overcome these limitations, we present VFM-Loc, a training-free framework for zero-shot CVGL that leverages the generalizable visual representations from vision foundational models (VFMs). VFM-Loc identifies and matches discriminative visual clues across different viewpoints through a progressive alignment strategy. First, we design a hierarchical clue extraction mechanism using Generalized Mean pooling and Scale-Weighted RMAC to preserve distinctive visual clues across scales while maintaining hierarchical confidence. Second, we introduce a statistical manifold alignment pipeline based on domain-wise PCA and Orthogonal Procrustes analysis, linearly aligning heterogeneous feature distributions in a shared metric space. Experiments demonstrate that VFM-Loc exhibits strong zero-shot accuracy on standard benchmarks and surpasses supervised methods by over 20% in Recall@1 on the challenging LO-UCV dataset with large oblique angles. This work highlights that principled alignment of pre-trained features can effectively bridge the cross-view gap, establishing a robust and training-free paradigm for real-world CVGL. The relevant code is made available at: https://github.com/DingLei14/VFM-Loc.
[110] Learning through Creation: A Hash-Free Framework for On-the-Fly Category Discovery cs.CVPDF
Bohan Zhang, Weidong Tang, Zhixiang Chi, Yi Jin, Zhenbo Li
TL;DR: 本文提出了一种名为Learning through Creation (LTC) 的无哈希框架,用于解决在线类别发现(OCD)任务。该方法通过在离线训练阶段注入新颖类别感知,利用一个轻量级的在线伪未知生成器动态合成伪新颖实例,并结合双重最大间隔目标来增强模型对未知区域的识别能力。
Details
Motivation: 现有OCD方法存在离线学习与在线发现阶段的优化不匹配问题,且通常依赖基于哈希的编码或严重的特征压缩,限制了表征能力。本文旨在解决这些问题,通过直接训练模型执行发现任务来弥合这一差距。
Result: 在七个基准测试上的广泛实验表明,LTC始终优于先前工作,在全类准确率上实现了1.5%到13.1%的提升。
Insight: 核心创新点在于提出了一个基于核能量最小化和熵最大化驱动的在线伪未知生成器,能够以可忽略的成本动态合成伪新颖样本,并通过双重最大间隔目标进行训练,从而在离线阶段就明确地增强了模型对未知类别的发现能力。这是一种’通过创造来学习’的范式转变。
Abstract: On-the-Fly Category Discovery (OCD) aims to recognize known classes while simultaneously discovering emerging novel categories during inference, using supervision only from known classes during offline training. Existing approaches rely either on fixed label supervision or on diffusion-based augmentations to enhance the backbone, yet none of them explicitly train the model to perform the discovery task required at test time. It is fundamentally unreasonable to expect a model optimized on limited labeled data to carry out a qualitatively different discovery objective during inference. This mismatch creates a clear optimization misalignment between the offline learning stage and the online discovery stage. In addition, prior methods often depend on hash-based encodings or severe feature compression, which further limits representational capacity. To address these issues, we propose Learning through Creation (LTC), a fully feature-based and hash-free framework that injects novel-category awareness directly into offline learning. At its core is a lightweight, online pseudo-unknown generator driven by kernel-energy minimization and entropy maximization (MKEE). Unlike previous methods that generate synthetic samples once before training, our generator evolves jointly with the model dynamics and synthesizes pseudo-novel instances on the fly at negligible cost. These samples are incorporated through a dual max-margin objective with adaptive thresholding, strengthening the model’s ability to delineate and detect unknown regions through explicit creation. Extensive experiments across seven benchmarks show that LTC consistently outperforms prior work, achieving improvements ranging from 1.5 percent to 13.1 percent in all-class accuracy. The code is available at https://github.com/brandinzhang/LTC
[111] Geo-ID: Test-Time Geometric Consensus for Cross-View Consistent Intrinsics cs.CVPDF
Alara Dirik, Stefanos Zafeiriou
TL;DR: Geo-ID是一种无需重新训练或逆向渲染的测试时框架,旨在提升多视角图像的内在参数分解(如反照率、粗糙度、金属性)的一致性。它通过稀疏几何对应关系耦合独立单视角预测,形成不确定性感知的共识目标,从而在稀疏无序图像集合中实现跨视角一致的内在分解。
Details
Motivation: 现有单视角内在图像分解方法在独立应用于同一场景的多视角图像时,往往产生不一致的估计,限制了其在可编辑神经场景和3D重建等下游应用中的使用;而基于视频的方法需要密集有序序列且计算量大,难以适用于稀疏无序图像集合。
Result: 在合成基准测试和真实场景上的实验表明,随着视角数量增加,Geo-ID显著提升了跨视角内在参数的一致性,同时保持了可比的单视角分解性能;所得一致内在参数能够在下游神经场景表示中实现连贯的外观编辑和重光照。
Insight: 创新点在于利用稀疏几何对应关系构建不确定性感知的共识目标,以模型无关的方式在测试时耦合预训练单视角预测器,无需重新训练或逆向渲染即可提升跨视角一致性;这为稀疏无序图像集合的内在分解提供了一种高效灵活的解决方案。
Abstract: Intrinsic image decomposition aims to estimate physically based rendering (PBR) parameters such as albedo, roughness, and metallicity from images. While recent methods achieve strong single-view predictions, applying them independently to multiple views of the same scene often yields inconsistent estimates, limiting their use in downstream applications such as editable neural scenes and 3D reconstruction. Video-based models can improve cross-frame consistency but require dense, ordered sequences and substantial compute, limiting their applicability to sparse, unordered image collections. We propose Geo-ID, a novel test-time framework that repurposes pretrained single-view intrinsic predictors to produce cross-view consistent decompositions by coupling independent per-view predictions through sparse geometric correspondences that form uncertainty-aware consensus targets. Geo-ID is model-agnostic, requires no retraining or inverse rendering, and applies directly to off-the-shelf intrinsic predictors. Experiments on synthetic benchmarks and real-world scenes demonstrate substantial improvements in cross-view intrinsic consistency as the number of views increases, while maintaining comparable single-view decomposition performance. We further show that the resulting consistent intrinsics enable coherent appearance editing and relighting in downstream neural scene representations.
[112] Zero-Forgetting CISS via Dual-Phase Cognitive Cascades cs.CVPDF
Yuquan Lu, Yifu Guo, Zishan Xu, Siyu Zhang, Yu Huo
TL;DR: 本文提出了一种名为认知级联分割(CogCaS)的双阶段级联方法,用于解决持续语义分割(CSS)中的灾难性遗忘问题。该方法将任务解耦为类别存在检测和类别特定分割两个阶段,从而在PASCAL VOC 2012和ADE20K基准数据集上显著提升了性能,特别是在长序列增量任务场景中优于现有方法。
Details
Motivation: 持续语义分割(CSS)是计算机视觉中的基础任务,但面临灾难性遗忘的挑战。现有基于Softmax的分类头方法存在灾难性遗忘和任务关联概率问题,本文旨在通过理论分析和双阶段直觉来解决这些限制。
Result: 在PASCAL VOC 2012和ADE20K基准数据集上的实验表明,CogCaS方法在各种挑战性场景中(尤其是长序列增量任务)相比现有最先进方法有显著改进。
Insight: 创新点在于将任务解耦为双阶段级联(类别存在检测和类别特定分割),这模仿了人类标注者的直觉,从而更有效地实现持续学习,避免灾难性遗忘。从客观角度看,这种解耦策略为持续学习提供了新的架构思路,可能适用于其他增量学习任务。
Abstract: Continual semantic segmentation (CSS) is a cornerstone task in computer vision that enables a large number of downstream applications, but faces the catastrophic forgetting challenge. In conventional class-incremental semantic segmentation (CISS) frameworks using Softmax-based classification heads, catastrophic forgetting originates from Catastrophic forgetting and task affiliation probability. We formulate these problems and provide a theoretical analysis to more deeply understand the limitations in existing CISS methods, particularly Strict Parameter Isolation (SPI). To address these challenges, we follow a dual-phase intuition from human annotators, and introduce Cognitive Cascade Segmentation (CogCaS), a novel dual-phase cascade formulation for CSS tasks in the CISS setting. By decoupling the task into class-existence detection and class-specific segmentation, CogCaS enables more effective continual learning, preserving previously learned knowledge while incorporating new classes. Using two benchmark datasets PASCAL VOC 2012 and ADE20K, we have shown significant improvements in a variety of challenging scenarios, particularly those with long sequence of incremental tasks, when compared to exsiting state-of-the-art methods. Our code will be made publicly available upon paper acceptance.
[113] Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering cs.CV | cs.AI | cs.CLPDF
Lin Fan, Yafei Ou, Zhipeng Deng, Pengyu Dai, Hou Chongxian
TL;DR: 本文提出了Step-CoT,一个大规模、结构化、多步骤的医学视觉问答(VQA)推理数据集,旨在模拟临床诊断工作流程。同时,论文引入了一个师生框架,并采用动态图结构聚焦机制来有效学习这些推理步骤,从而提升医学VQA的推理准确性和可解释性。
Details
Motivation: 现有医学VQA中的思维链(CoT)推理通常是自由形式的,未能捕捉临床医生实际遵循的结构化推理过程。本文旨在探究可追溯、多步骤的监督推理是否能提升医学VQA的准确性和可解释性。
Result: 实验表明,使用Step-CoT数据集可以提升推理准确性和可解释性。具体基准测试结果和数据集信息可在提供的GitHub和Hugging Face链接中查看。
Insight: 主要创新点在于构建了与临床诊断工作流程对齐的结构化多步骤CoT数据集,并提出了一个结合动态图结构聚焦机制的师生框架,以优先处理诊断信息丰富的步骤,过滤无关上下文,从而引导模型遵循有效的推理轨迹。
Abstract: Chain-of-thought (CoT) reasoning has advanced medical visual question answering (VQA), yet most existing CoT rationales are free-form and fail to capture the structured reasoning process clinicians actually follow. This work asks: Can traceable, multi-step reasoning supervision improve reasoning accuracy and the interpretability of Medical VQA? To this end, we introduce Step-CoT, a large-scale medical reasoning dataset with expert-curated, structured multi-step CoT aligned to clinical diagnostic workflows, implicitly grounding the model’s reasoning in radiographic evidence. Step-CoT comprises more than 10K real clinical cases and 70K VQA pairs organized around diagnostic workflows, providing supervised intermediate steps that guide models to follow valid reasoning trajectories. To effectively learn from Step-CoT, we further introduce a teacher-student framework with a dynamic graph-structured focusing mechanism that prioritizes diagnostically informative steps while filtering out less relevant contexts. Our experiments show that using Step-CoT can improve reasoning accuracy and interpretability. Benchmark: github.com/hahaha111111/Step-CoT. Dataset Card: huggingface.co/datasets/fl-15o/Step-CoT
[114] SCoCCA: Multi-modal Sparse Concept Decomposition via Canonical Correlation Analysis cs.CVPDF
Ehud Gordon, Meir Yossef Levi, Guy Gilboa
TL;DR: 论文提出了SCoCCA框架,通过将典型相关分析与概念分解相结合,解决了多模态视觉语言模型中跨模态对齐和可解释性问题,并引入稀疏性以获得更解耦和判别性的概念表示。
Details
Motivation: 现有基于概念的可解释性方法主要局限于图像模态,忽视了跨模态交互,且CLIP等文本-图像嵌入存在模态鸿沟,限制了可解释性。论文旨在将概念解释推广到多模态嵌入,并利用CCA增强跨模态对齐。
Result: 该方法在概念发现任务上达到了最先进的性能,并通过重构和操作任务(如概念消融)验证了其有效性。
Insight: 揭示了CCA与InfoNCE目标的紧密联系,提供了一种无需训练即可增强跨模态对齐的机制;将CCA与概念分解耦合,并引入稀疏性约束,以产生更解耦、判别性的多模态概念表示,从而支持激活、消融和语义操作。
Abstract: Interpreting the internal reasoning of vision-language models is essential for deploying AI in safety-critical domains. Concept-based explainability provides a human-aligned lens by representing a model’s behavior through semantically meaningful components. However, existing methods are largely restricted to images and overlook the cross-modal interactions. Text-image embeddings, such as those produced by CLIP, suffer from a modality gap, where visual and textual features follow distinct distributions, limiting interpretability. Canonical Correlation Analysis (CCA) offers a principled way to align features from different distributions, but has not been leveraged for multi-modal concept-level analysis. We show that the objectives of CCA and InfoNCE are closely related, such that optimizing CCA implicitly optimizes InfoNCE, providing a simple, training-free mechanism to enhance cross-modal alignment without affecting the pre-trained InfoNCE objective. Motivated by this observation, we couple concept-based explainability with CCA, introducing Concept CCA (CoCCA), a framework that aligns cross-modal embeddings while enabling interpretable concept decomposition. We further extend it and propose Sparse Concept CCA (SCoCCA), which enforces sparsity to produce more disentangled and discriminative concepts, facilitating improved activation, ablation, and semantic manipulation. Our approach generalizes concept-based explanations to multi-modal embeddings and achieves state-of-the-art performance in concept discovery, evidenced by reconstruction and manipulation tasks such as concept ablation.
[115] Multi-Modal Character Localization and Extraction for Chinese Text Recognition cs.CV | cs.AIPDF
Qilong Li, Chongsheng Zhang
TL;DR: 本文提出了一种名为LER的新方法,用于解决中文场景文本识别(STR)的挑战。该方法通过三个模块(定位、提取和识别)显式解耦每个字符,并独立识别字符,同时考虑中文复杂的内部结构。实验表明,该方法在大型中文基准测试中显著优于现有方法,并在英文基准测试上也取得了令人印象深刻的结果。
Details
Motivation: 由于中文复杂的内部结构和广泛的字符类别,现有的英文文本识别方法在识别中文文本图像时遇到精度瓶颈。本文旨在探索将针对英文开发的模型应用于中文STR任务是否合适,并提出一种专门针对中文STR的新方法。
Result: 在大型中文基准测试上的大量实验表明,该方法显著优于现有方法。在六个英文基准测试和Union14M基准测试上的实验也显示,LER在英文文本识别上取得了令人印象深刻的结果。
Insight: 创新点在于显式地将字符定位、提取和识别解耦,并并行处理所有字符,同时专门考虑中文独特的内部结构(如偏旁部首)。从客观角度看,这种多模态定位和并行提取的模块化设计,为处理复杂文字(尤其是象形文字)的识别提供了可借鉴的思路。
Abstract: Scene text recognition (STR) methods have demonstrated their excellent capability in English text images. However, due to the complex inner structures of Chinese and the extensive character categories, it poses challenges for recognizing Chinese text in images. Recently, studies have shown that the methods designed for English text recognition encounter an accuracy bottleneck when recognizing Chinese text images. This raises the question: Is it appropriate to apply the model developed for English to the Chinese STR task? To explore this issue, we propose a novel method named LER, which explicitly decouples each character and independently recognizes characters while taking into account the complex inner structures of Chinese. LER consists of three modules: Localization, Extraction, and Recognition. Firstly, the localization module utilizes multimodal information to determine the character’s position precisely. Then, the extraction module dissociates all characters in parallel. Finally, the recognition module considers the unique inner structures of Chinese to provide the text prediction results. Extensive experiments conducted on large-scale Chinese benchmarks indicate that our method significantly outperforms existing methods. Furthermore, extensive experiments conducted on six English benchmarks and the Union14M benchmark show impressive results in English text recognition by LER. Code is available at https://github.com/Pandarenlql/LER.
[116] Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition cs.CV | cs.AI | cs.LG | cs.ROPDF
Seokmin Lee, Yunghee Lee, Byeonghyun Pak, Byeongju Woo
TL;DR: 本文提出CroBo框架,通过全局到局部的重建目标学习视觉状态表示,旨在联合编码场景元素的语义身份和空间位置,以捕捉动态环境中的细微变化,从而支持机器人决策。
Details
Motivation: 现有自监督学习方法在视觉任务中表现出良好的可迁移性,但未明确界定有效的视觉状态应编码什么;本文认为良好的视觉状态必须通过联合编码场景元素的语义身份和空间位置来捕获’what-is-where’信息,以可靠检测观测间的细微动态。
Result: 在多种基于视觉的机器人策略学习基准测试中,CroBo实现了最先进的性能;重建分析和感知直线性实验进一步表明,学习到的表示保留了像素级场景组合,并编码了观测间的’what-moves-where’信息。
Insight: 创新点在于提出全局到局部的重建目标,通过将参考观测压缩为紧凑的瓶颈令牌,并利用其作为上下文重建目标裁剪区域的重度掩码补丁,从而鼓励瓶颈令牌编码细粒度的场景语义实体及其空间配置,这有助于捕捉动态交互并提升决策能力。
Abstract: For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations.
[117] Scene Generation at Absolute Scale: Utilizing Semantic and Geometric Guidance From Text for Accurate and Interpretable 3D Indoor Scene Generation cs.CVPDF
Stefan Ainetter, Thomas Deixelberger, Edoardo A. Dominici, Philipp Drescher, Konstantinos Vardis
TL;DR: 本文提出了一种名为GuidedSceneGen的文本到3D生成框架,用于生成具有精确度量、全局一致且语义可解释的室内场景。该方法通过预测全局3D布局作为指导,利用全景扩散模型合成对齐的360°图像,并结合优化的相机轨迹探索未观测区域,最终通过3D高斯泼溅融合生成绝对尺度下可导航的3D场景。
Details
Motivation: 解决现有文本驱动3D生成方法中常见的几何漂移和尺度模糊问题,旨在生成度量精确且全局一致的室内场景。
Result: 定量评估和用户研究表明,与近期的全景文本到3D基线方法相比,该方法在3D一致性和布局合理性方面表现更优,并且相机轨迹采样速度提升了高达10倍。
Insight: 创新点在于在整个生成过程中维持绝对世界坐标系,利用语义和几何指导的全局布局作为代理,并结合优化的相机轨迹进行高效场景探索,实现了从布局到重建的精确姿态和语义标签传递,支持无需重新对齐的渐进式场景扩展。
Abstract: We present GuidedSceneGen, a text-to-3D generation framework that produces metrically accurate, globally consistent, and semantically interpretable indoor scenes. Unlike prior text-driven methods that often suffer from geometric drift or scale ambiguity, our approach maintains an absolute world coordinate frame throughout the entire generation process. Starting from a textual scene description, we predict a global 3D layout encoding both semantic and geometric structure, which serves as a guiding proxy for downstream stages. A semantics- and depth-conditioned panoramic diffusion model then synthesizes 360° imagery aligned with the global layout, substantially improving spatial coherence. To explore unobserved regions, we employ a video diffusion model guided by optimized camera trajectories that balances coverage and collision avoidance, achieving up to 10x faster sampling compared to exhaustive path exploration. The generated views are fused using 3D Gaussian Splatting, yielding a consistent and fully navigable 3D scene in absolute scale. GuidedSceneGen enables accurate transfer of object poses and semantic labels from layout to reconstruction, and supports progressive scene expansion without re-alignment. Quantitative results and a user study demonstrate greater 3D consistency and layout plausibility compared to recent panoramic text-to-3D baselines.
[118] Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video cs.CVPDF
Yuting Tan, Xilong Cheng, Yunxiao Qin, Zhengnan Li, Jingjing Zhang
TL;DR: 本文提出EgoViT,一个统一的视觉Transformer框架,旨在从无标注的第一人称视频中学习稳定的物体表征。该方法通过联合发现和稳定“原型物体”的三个协同机制(原型物体学习、深度正则化、教师过滤时间一致性)实现自监督学习,在无监督物体发现和语义分割任务上取得了显著提升。
Details
Motivation: 受人类通过自我中心经验进行自监督学习形成视觉智能的启发,研究如何让AI系统从连续、未经整理的自我中心视频中学习稳定的物体表征,而无需依赖人工标注,以应对杂乱、遮挡和自身运动带来的挑战。
Result: 在标准基准测试中,EgoViT在无监督物体发现任务上实现了+8.0%的CorLoc提升,在语义分割任务上实现了+4.8%的mIoU提升,展现了其性能优势。
Insight: 创新点在于提出了一个端到端的统一框架,通过三个协同机制(原型学习、深度正则化、教师过滤时间一致性)形成良性循环,将初始物体假设逐步精炼为稳定、持久的表征,且对来源和质量各异的几何先验具有鲁棒性,为具身智能中的鲁棒视觉抽象提供了基础。
Abstract: Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing “proto-objects” through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.
[119] Evaluation of Visual Place Recognition Methods for Image Pair Retrieval in 3D Vision and Robotics cs.CVPDF
Dennis Haitz, Athradi Shritish Shetty, Michael Weinmann, Markus Ulrich
TL;DR: 本文研究了视觉地点识别(VPR)作为图像对检索前端在三维视觉和机器人注册流程中的应用,系统评估了NetVLAD、CosPlace、EigenPlaces、MixVPR、AnyLoc、SALAD、MegaLoc等多种先进VPR方法在Tanks and Temples、ScanNet-GS和KITTI三个挑战性数据集上的性能。
Details
Motivation: 将VPR从传统的图像检索任务重新定义为图像对检索前端,旨在为场景注册、SLAM和运动恢复结构等下游任务,在两个不相交的图像集中寻找最佳匹配的图像对。
Result: 评估表明,现代全局描述子方法(如CosPlace、EigenPlaces、MixVPR)在存在感知混淆和不完整序列的挑战性场景中,越来越适合作为即插即用的图像对检索模块,但其性能表现出明显的领域依赖性优势和劣势。
Insight: 创新点在于将VPR任务重新定位为图像对检索前端,并进行了跨领域的系统性基准测试,揭示了不同VPR方法在特定应用场景(如户外物体中心场景、室内RGB-D扫描、自动驾驶序列)下的适用性,为构建鲁棒的建图和注册系统选择VPR组件提供了关键指导。
Abstract: Visual Place Recognition (VPR) is a core component in computer vision, typically formulated as an image retrieval task for localization, mapping, and navigation. In this work, we instead study VPR as an image pair retrieval front-end for registration pipelines, where the goal is to find top-matching image pairs between two disjoint image sets for downstream tasks such as scene registration, SLAM, and Structure-from-Motion. We comparatively evaluate state-of-the-art VPR families - NetVLAD-style baselines, classification-based global descriptors (CosPlace, EigenPlaces), feature-mixing (MixVPR), and foundation-model-driven methods (AnyLoc, SALAD, MegaLoc) - on three challenging datasets: object-centric outdoor scenes (Tanks and Temples), indoor RGB-D scans (ScanNet-GS), and autonomous-driving sequences (KITTI). We show that modern global descriptor approaches are increasingly suitable as off-the-shelf image pair retrieval modules in challenging scenarios including perceptual aliasing and incomplete sequences, while exhibiting clear, domain-dependent strengths and weaknesses that are critical when choosing VPR components for robust mapping and registration.
[120] OpenCOOD-Air: Prompting Heterogeneous Ground-Air Collaborative Perception with Spatial Conversion and Offset Prediction cs.CVPDF
Xianke Wu, Songlin Bai, Chengxiang Li, Zhiyao Luo, Yulin Tian
TL;DR: 本文提出了OpenCOOD-Air框架,通过将无人机作为可扩展平台整合到车对车协同感知中,以克服地面传感器因遮挡和视角限制导致的感知盲区问题。该方法采用迁移学习策略微调无人机模型,并设计了跨域空间转换器和空间偏移预测Transformer来处理空地异构数据融合,同时构建了OPV2V-Air基准测试进行验证。
Details
Motivation: 解决传统车对车协同感知因地面遮挡和传感器视角限制导致的可靠性不足问题,通过引入无人机平台扩展感知范围并弥补盲区。
Result: 在提出的OPV2V-Air基准测试上,相比现有最优方法,2D平均精度和3D平均精度分别提升4%和7%,达到新的SOTA水平。
Insight: 创新点包括:1) 将空地异构协同感知形式化为具有明确高度监督的任务;2) 采用迁移学习缓解域差异和数据稀疏性;3) 设计跨域空间转换器和基于Transformer的偏移预测模块来保留空间信息。
Abstract: While Vehicle-to-Vehicle (V2V) collaboration extends sensing ranges through multi-agent data sharing, its reliability remains severely constrained by ground-level occlusions and the limited perspective of chassis-mounted sensors, which often result in critical perception blind spots. We propose OpenCOOD-Air, a novel framework that integrates UAVs as extensible platforms into V2V collaborative perception to overcome these constraints. To mitigate gradient interference from ground-air domain gaps and data sparsity, we adopt a transfer learning strategy to fine-tune UAV weights from pre-trained V2V models. To prevent the spatial information loss inherent in this transition, we formulate ground-air collaborative perception as a heterogeneous integration task with explicit altitude supervision and introduce a Cross-Domain Spatial Converter (CDSC) and a Spatial Offset Prediction Transformer (SOPT). Furthermore, we present the OPV2V-Air benchmark to validate the transition from V2V to Vehicle-to-Vehicle-to-UAV. Compared to state-of-the-art methods, our approach improves 2D and 3D AP@0.7 by 4% and 7%, respectively.
[121] Discriminative Flow Matching Via Local Generative Predictors cs.CV | cs.AIPDF
Om Govind Jha, Manoj Bamniya, Ayon Borthakur
TL;DR: 本文提出了一种名为’判别性流匹配’的新框架,将分类和目标检测任务重新定义为条件传输过程。该方法通过学习一个向量场,将样本从简单的噪声分布连续传输到任务对齐的目标流形(如类别嵌入或边界框坐标),从而在生成式和判别式学习之间架起桥梁。
Details
Motivation: 传统判别式计算机视觉主要依赖静态投影,在单一计算步骤中将输入特征映射到输出,虽然高效但缺乏生物视觉和现代生成模型固有的迭代优化和鲁棒性。本文旨在通过引入流匹配技术来弥补这一不足。
Result: 该方法在标准图像分类和复杂的目标检测任务上进行了验证,其中目标为高维且空间分布。通过聚合多个独立流预测器的预测,该框架能够在包括CNN和视觉Transformer在内的多种架构上实现鲁棒的、受生成式启发的推理。
Insight: 核心创新点在于将判别式任务重新构建为条件传输过程,并采用局部流匹配目标训练多个附着于共享骨干网络的独立流预测器。这种设计允许块按顺序或并行更新,以适应不同的硬件限制,同时保持了生成式方法的迭代优化特性。
Abstract: Traditional discriminative computer vision relies predominantly on static projections, mapping input features to outputs in a single computational step. Although efficient, this paradigm lacks the iterative refinement and robustness inherent in biological vision and modern generative modelling. In this paper, we propose Discriminative Flow Matching, a framework that reformulates classification and object detection as a conditional transport process. By learning a vector field that continuously transports samples from a simple noise distribution toward a task-aligned target manifold – such as class embeddings or bounding box coordinates – we are at the interface between generative and discriminative learning. Our method attaches multiple independent flow predictors to a shared backbone. These predictors are trained using local flow matching objectives, where gradients are computed independently for each block. We formulate this approach for standard image classification and extend it to the complex task of object detection, where targets are high-dimensional and spatially distributed. This architecture provides the flexibility to update blocks either sequentially to minimise activation memory or in parallel to suit different hardware constraints. By aggregating the predictions from these independent flow predictors, our framework enables robust, generative-inspired inference across diverse architectures, including CNNs and vision transformers.
[122] Bidirectional Cross-Attention Fusion of High-Res RGB and Low-Res HSI for Multimodal Automated Waste Sorting cs.CVPDF
Jonas V. Funk, Lukas Roming, Andreas Michel, Paul Bäcker, Georg Maier
TL;DR: 本文提出了一种名为双向交叉注意力融合(BCAF)的方法,用于融合高分辨率RGB图像和低分辨率高光谱图像(HSI),以提升自动化废料分拣中的像素级分割精度。该方法通过局部化的双向交叉注意力在原始网格上对齐两种模态,避免预上采样或早期光谱坍缩,并采用两个独立的骨干网络(标准Swin Transformer用于RGB,适配HSI的Swin骨干通过3D标记化和光谱自注意力保持光谱结构)。
Details
Motivation: 自动化废料分拣需要高效的多模态融合方法,因为RGB图像提供高空间分辨率但难以区分可见光谱相似的材料,而HSI提供区分材料的光谱特征但空间分辨率较低。现有方法未能充分利用两者的互补优势,需要一种能有效融合高分辨率RGB和低分辨率HSI的技术。
Result: 在基准数据集SpectralWaste上,BCAF达到了76.4% mIoU(31图像/秒)和75.4% mIoU(55图像/秒)的最先进性能。在新工业数据集K3I-Cycling上,BCAF在材料分割(纸、金属、塑料等)达到62.3% mIoU,在塑料类型分割(PET、PP、HDPE等)达到66.2% mIoU。
Insight: 创新点包括:提出局部化双向交叉注意力在原生网格上对齐多模态数据,避免预上采样带来的信息损失;设计HSI适配的Swin骨干网络,通过3D标记化和光谱自注意力保持光谱结构;方法具有模态无关性,可应用于其他高通道辅助传感器与RGB的融合场景。
Abstract: Growing waste streams and the transition to a circular economy require efficient automated waste sorting. In industrial settings, materials move on fast conveyor belts, where reliable identification and ejection demand pixel-accurate segmentation. RGB imaging delivers high-resolution spatial detail, which is essential for accurate segmentation, but it confuses materials that look similar in the visible spectrum. Hyperspectral imaging (HSI) provides spectral signatures that separate such materials, yet its lower spatial resolution limits detail. Effective waste sorting therefore needs methods that fuse both modalities to exploit their complementary strengths. We present Bidirectional Cross-Attention Fusion (BCAF), which aligns high-resolution RGB with low-resolution HSI at their native grids via localized, bidirectional cross-attention, avoiding pre-upsampling or early spectral collapse. BCAF uses two independent backbones: a standard Swin Transformer for RGB and an HSI-adapted Swin backbone that preserves spectral structure through 3D tokenization with spectral self-attention. We also analyze trade-offs between RGB input resolution and the number of HSI spectral slices. Although our evaluation targets RGB-HSI fusion, BCAF is modality-agnostic and applies to co-registered RGB with lower-resolution, high-channel auxiliary sensors. On the benchmark SpectralWaste dataset, BCAF achieves state-of-the-art performance of 76.4% mIoU at 31 images/s and 75.4% mIoU at 55 images/s. We further evaluate a novel industrial dataset: K3I-Cycling (first RGB subset already released on Fordatis). On this dataset, BCAF reaches 62.3% mIoU for material segmentation (paper, metal, plastic, etc.) and 66.2% mIoU for plastic-type segmentation (PET, PP, HDPE, LDPE, PS, etc.).
[123] VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction cs.CVPDF
Hiroto Nakata, Yawen Zou, Shunsuke Sakai, Shun Maeda, Chunzhi Gu
TL;DR: 本文提出了VID-AD数据集,用于在视觉干扰下进行图像级逻辑异常检测。该数据集包含10个制造场景和5种采集条件,共50个单类任务和10,395张图像,每个场景由两个逻辑约束定义。作者还提出了一种基于语言的异常检测框架,仅使用正常图像的文本描述,通过对比学习来学习捕捉逻辑属性的嵌入表示。
Details
Motivation: 工业检测中的逻辑异常检测因视觉外观变化(如背景杂乱、光照变化和模糊)而具有挑战性,这些变化会分散以视觉为中心的检测器对规则违反的识别。现有基准测试很少提供逻辑状态固定而干扰因素变化的受控设置,本文旨在填补这一空白。
Result: 广泛的实验表明,所提出的方法在评估的所有设置中均比基线模型有持续改进。
Insight: 创新点在于构建了一个专门针对视觉干扰下逻辑异常检测的受控数据集,并提出了一种新颖的、仅依赖文本描述的基于语言的检测框架,通过对比学习和基于矛盾的负文本合成来学习高级逻辑属性,而非低级视觉特征。
Abstract: Logical anomaly detection in industrial inspection remains challenging due to variations in visual appearance (e.g., background clutter, illumination shift, and blur), which often distract vision-centric detectors from identifying rule-level violations. However, existing benchmarks rarely provide controlled settings where logical states are fixed while such nuisance factors vary. To address this gap, we introduce VID-AD, a dataset for logical anomaly detection under vision-induced distraction. It comprises 10 manufacturing scenarios and five capture conditions, totaling 50 one-class tasks and 10,395 images. Each scenario is defined by two logical constraints selected from quantity, length, type, placement, and relation, with anomalies including both single-constraint and combined violations. We further propose a language-based anomaly detection framework that relies solely on text descriptions generated from normal images. Using contrastive learning with positive texts and contradiction-based negative texts synthesized from these descriptions, our method learns embeddings that capture logical attributes rather than low-level features. Extensive experiments demonstrate consistent improvements over baselines across the evaluated settings. The dataset is available at: https://github.com/nkthiroto/VID-AD.
[124] Leveraging a Statistical Shape Model for Efficient Generation of Annotated Training Data: A Case Study on Liver Landmarks Segmentation cs.CV | eess.IVPDF
Denis Krnjaca, Lorena Krames, Werner Nahm
TL;DR: 本文提出了一种基于统计形状模型(SSM)的高效生成标注训练数据的方法,用于解决解剖标志分割任务中手动标注数据耗时费力的问题。该方法仅需一次手动标注平均形状,即可通过SSM生成大量标注数据,并以肝脏解剖标志(前嵴和镰状韧带)的3D分割为案例进行验证。
Details
Motivation: 当前基于深度学习的解剖标志分割方法需要大量手动标注数据,标注过程劳动密集且耗时,因此需要一种高效生成标注训练数据的方法来降低标注成本。
Result: 使用SSM生成的8,800个标注肝脏形状训练专用深度学习网络,在500个未见过的合成SSM形状上评估,平均交并比达到91.4%(前嵴87.4%,镰状韧带87.6%)。在临床患者肝脏形状上的定性评估也显示了有希望的结果,证明了方法的泛化能力。
Insight: 创新点在于利用统计形状模型仅需一次手动标注即可批量生成高质量标注数据,显著减少了深度学习对人工标注的依赖。该方法具有通用性,可推广至其他需要大量标注数据的医学图像分析任务中。
Abstract: Anatomical landmark segmentation serves as a critical initial step for robust multimodal registration during computer-assisted interventions. Current approaches predominantly rely on deep learning, which often necessitates the extensive manual generation of annotated datasets. In this paper, we present a novel strategy for creating large annotated datasets using a statistical shape model (SSM) based on a mean shape that is manually labeled only once. We demonstrate the method’s efficacy through its application to deep-learning-based anatomical landmark segmentation, specifically targeting the detection of the anterior ridge and the falciform ligament in 3D liver shapes. A specialized deep learning network was trained with 8,800 annotated liver shapes generated by the SSM. The network’s performance was evaluated on 500 unseen synthetic SSM shapes, yielding a mean Intersection over Union of 91.4% (87.4% for the anterior ridge and 87.6% for the falciform ligament). Subsequently, the network was applied to clinical patient liver shapes, with qualitative evaluation indicating promising results and highlighting the generalizability of the proposed approach. Our findings suggest that the SSM-based data generation approach alleviates the labor-intensive process of manual labeling while enabling the creation of large annotated training datasets for machine learning. Although our study focuses on liver anatomy, the proposed methodology holds potential for a broad range of applications where annotated training datasets play a pivotal role in developing accurate deep-learning models.
[125] When Visual Privacy Protection Meets Multimodal Large Language Models cs.CVPDF
Xiaofei Hui, Qian Wu, Haoxuan Qu, Majid Mirmehdi, Hossein Rahmani
TL;DR: 本文针对多模态大语言模型(MLLM)云服务(如GPT-4V)引发的视觉隐私泄露问题,提出了一种在MLLM作为‘黑盒’(仅能访问输入输出)场景下的隐私保护框架。该框架通过精心设计帕累托最优学习目标,在视觉隐私与MLLM性能之间寻求更好权衡,并采用关键历史增强优化方法有效优化框架。实验表明该方法在不同基准测试上有效。
Details
Motivation: MLLM云服务的普及要求用户提交图像和视频,带来了严重的隐私泄露风险,而如何应对这种隐私担忧是一个尚未充分探索的问题。
Result: 实验表明,该方法在不同基准测试(benchmarks)上有效,但摘要未具体说明达到何种水平(如SOTA)。
Insight: 创新点在于针对MLLM‘黑盒’场景设计隐私保护框架,结合帕累托最优目标平衡隐私与性能,并采用关键历史增强优化进行有效训练,为实际部署中的隐私-效用权衡提供了新思路。
Abstract: The emergence of Multimodal Large Language Models (MLLMs) and the widespread usage of MLLM cloud services such as GPT-4V raised great concerns about privacy leakage in visual data. As these models are typically deployed in cloud services, users are required to submit their images and videos, posing serious privacy risks. However, how to tackle such privacy concerns is an under-explored problem. Thus, in this paper, we aim to conduct a new investigation to protect visual privacy when enjoying the convenience brought by MLLM services. We address the practical case where the MLLM is a “black box”, i.e., we only have access to its input and output without knowing its internal model information. To tackle such a challenging yet demanding problem, we propose a novel framework, in which we carefully design the learning objective with Pareto optimality to seek a better trade-off between visual privacy and MLLM’s performance, and propose critical-history enhanced optimization to effectively optimize the framework with the black-box MLLM. Our experiments show that our method is effective on different benchmarks.
[126] Human-like Object Grouping in Self-supervised Vision Transformers cs.CV | cs.AI | q-bio.NCPDF
Hossein Adeli, Seoyoung Ahn, Andrew Luo, Mengmi Zhang, Nikolaus Kriegeskorte
TL;DR: 这篇论文研究了自监督视觉基础模型(特别是基于Transformer的DINO模型)在物体感知方面与人类行为的对齐程度。作者通过一个大规模行为基准测试,让参与者对自然场景中的点对进行物体异同判断,并利用模型表征预测反应时间。研究发现,自监督Transformer模型在物体分组任务上表现出与人类最相似的行为,且表征中的物体中心结构是预测人类分割行为的关键。通过Gram矩阵蒸馏可以进一步提升模型与人类感知的对齐。
Details
Motivation: 尽管自监督视觉基础模型展现出强大的性能和涌现的物体分割能力,但其表征与人类物体感知的对齐程度尚不明确。论文旨在量化并理解这种对齐关系,探究模型架构和训练目标如何影响其感知的“人类相似性”。
Result: 在超过1000个试次的行为基准测试中,基于Transformer并使用DINO自监督目标训练的模型表现最佳,其表征最能准确预测人类受试者的反应时间。通过提出的新度量(量化表征的物体中心结构)发现,更强的物体中心结构能更准确地预测人类分割行为。此外,通过Gram矩阵蒸馏使监督模型与自监督模型的结构对齐,可以提升其与人类行为的一致性。
Insight: 论文的核心创新点在于建立了一个大规模、可量化的行为基准来评估模型与人类物体感知的对齐性,并提出了一个度量来量化表征中的物体中心结构。客观来看,研究揭示了自监督学习(特别是DINO目标)在诱导出与人类感知相似的物体分组结构方面的有效性,并强调了表征的Gram矩阵相似性结构在驱动感知对齐中的重要作用,这为构建更“类人”的视觉模型提供了新的评估方法和优化方向(如Gram锚定)。
Abstract: Vision foundation models trained with self-supervised objectives achieve strong performance across diverse tasks and exhibit emergent object segmentation properties. However, their alignment with human object perception remains poorly understood. Here, we introduce a behavioral benchmark in which participants make same/different object judgments for dot pairs on naturalistic scenes, scaling up a classical psychophysics paradigm to over 1000 trials. We test a diverse set of vision models using a simple readout from their representations to predict subjects’ reaction times. We observe a steady improvement across model generations, with both architecture and training objective contributing to alignment, and transformer-based models trained with the DINO self-supervised objective showing the strongest performance. To investigate the source of this improvement, we propose a novel metric to quantify the object-centric component of representations by measuring patch similarity within and between objects. Across models, stronger object-centric structure predicts human segmentation behavior more accurately. We further show that matching the Gram matrix of supervised transformer models, capturing similarity structure across image patches, with that of a self-supervised model through distillation improves their alignment with human behavior, converging with the prior finding that Gram anchoring improves DINOv3’s feature quality. Together, these results demonstrate that self-supervised vision models capture object structure in a behaviorally human-like manner, and that Gram matrix structure plays a role in driving perceptual alignment.
[127] Fine-tuning MLLMs Without Forgetting Is Easier Than You Think cs.CV | cs.CL | cs.LGPDF
He Li, Yuhui Zhang, Xiaohan Wang, Kaifeng Lyu, Serena Yeung-Levy
TL;DR: 本文研究表明,通过简单的微调策略调整,如参数约束、低学习率或数据混合训练,即可有效缓解多模态大语言模型(MLLM)的灾难性遗忘问题,并在持续学习场景中超越现有复杂方法。
Details
Motivation: 解决多模态大语言模型在微调过程中出现的灾难性遗忘问题,挑战了普遍认为需要复杂机制才能缓解遗忘的假设。
Result: 在视觉问答任务上,通过2x2实验框架评估,正则化方法能有效防止分布外图像导致的遗忘;针对分布内图像与分布外文本的遗忘,数据混合训练策略在持续学习任务中超越了现有复杂辅助机制的方法。
Insight: 创新点在于揭示了MLLM固有的鲁棒性,并提出实用微调指南;客观来看,其将任务特定过拟合识别为一种遗忘形式,并通过简单数据混合策略解决,为模型适应提供了轻量级方案。
Abstract: The paper demonstrate that simple adjustments of the fine-tuning recipes of multimodal large language models (MLLM) are sufficient to mitigate catastrophic forgetting. On visual question answering, we design a 2x2 experimental framework to assess model performance across in-distribution and out-of-distribution image and text inputs. Our results show that appropriate regularization, such as constraining the number of trainable parameters or adopting a low learning rate, effectively prevents forgetting when dealing with out-of-distribution images. However, we uncover a distinct form of forgetting in settings with in-distribution images and out-of-distribution text. We attribute this forgetting as task-specific overfitting and address this issue by introducing a data-hybrid training strategy that combines datasets and tasks. Finally, we demonstrate that this approach naturally extends to continual learning, outperforming existing methods with complex auxiliary mechanisms. In general, our findings challenge the prevailing assumptions by highlighting the inherent robustness of MLLMs and providing practical guidelines for adapting them while preserving their general capabilities.
[128] Multi-Grained Vision-Language Alignment for Domain Generalized Person Re-Identification cs.CVPDF
Jiachen Li, Xiaojin Gong, Dongping Zhang
TL;DR: 本文提出了一种基于CLIP的多粒度视觉-语言对齐框架,用于解决领域泛化行人重识别任务。该方法通过引入多粒度文本提示来描述不同身体部位,并利用自适应掩码多头自注意力模块提取细粒度视觉特征,同时采用基于MLLM的视觉定位专家自动生成身体部位伪标签进行监督。
Details
Motivation: 针对领域泛化行人重识别任务中,现有纯视觉模型性能有限,而直接应用视觉语言模型仅能生成全局特征、对身份细微差异不敏感的问题,旨在通过多粒度视觉-语言对齐提升模型在未见目标域上的泛化能力。
Result: 在单源和多源泛化协议上的大量实验表明,该方法取得了优越的性能,具体定量结果未在摘要中提及,但暗示其性能超越现有方法。
Insight: 创新点在于将视觉语言模型(CLIP)与细粒度特征提取相结合,通过多粒度提示对齐和自适应掩码注意力机制实现局部特征增强,并利用MLLM自动生成伪标签解决细粒度监督信号缺失问题,为领域泛化任务提供了新的跨模态对齐思路。
Abstract: Domain Generalized person Re-identification (DG Re-ID) is a challenging task, where models are trained on source domains but tested on unseen target domains. Although previous pure vision-based models have achieved significant progress, the performance remains further improved. Recently, Vision-Language Models (VLMs) present outstanding generalization capabilities in various visual applications. However, directly adapting a VLM to Re-ID shows limited generalization improvement. This is because the VLM only produces with global features that are insensitive to ID nuances. To tacle this problem, we propose a CLIP-based multi-grained vision-language alignment framework in this work. Specifically, several multi-grained prompts are introduced in language modality to describe different body parts and align with their counterparts in vision modality. To obtain fine-grained visual information, an adaptively masked multi-head self-attention module is employed to precisely extract specific part features. To train the proposed module, an MLLM-based visual grounding expert is employed to automatically generate pseudo labels of body parts for supervision. Extensive experiments conducted on both single- and multi-source generalization protocols demonstrate the superior performance of our approach. The implementation code will be released at https://github.com/RikoLi/MUVA.
[129] A Hyperbolic Perspective on Hierarchical Structure in Object-Centric Scene Representations cs.CVPDF
Neelu Madan, Àlex Pujol, Andreas Møgelmose, Sergio Escalera, Kamal Nasrollahi
TL;DR: 该论文提出了一种将欧几里得空间中的slot attention嵌入投影到双曲空间洛伦兹超曲面上的后处理流程,旨在揭示视觉场景中潜在的层次结构。研究发现,双曲投影能暴露从场景级到对象级的一致组织模式,且存在’曲率-任务权衡’现象。
Details
Motivation: 解决slot attention在欧几里得空间中学习时缺乏对视觉场景自然层次关系的几何归纳偏置的问题,探索双曲几何是否能揭示欧几里得空间中不可见的潜在层次结构。
Result: 在SPOT(图像)、VideoSAUR(视频)和SlotContrast(视频)上集成该流程,发现双曲投影能暴露一致的层次组织(粗粒度slot占据更大的流形深度)。低曲率(c=0.2)在父slot检索任务上匹配或优于欧几里得空间,中等曲率(c=0.5)实现了更好的层级间分离。
Insight: 创新点在于提出了一种无需修改底层训练、简单的后处理双曲投影方法,揭示了slot表示已编码了可通过双曲几何显现的潜在层次结构,这为端到端的双曲训练提供了动机。从客观角度看,该方法为分析表示层次性提供了一种新视角,并发现了任务性能与曲率参数之间的权衡关系。
Abstract: Slot attention has emerged as a powerful framework for unsupervised object-centric learning, decomposing visual scenes into a small set of compact vector representations called \emph{slots}, each capturing a distinct region or object. However, these slots are learned in Euclidean space, which provides no geometric inductive bias for the hierarchical relationships that naturally structure visual scenes. In this work, we propose a simple post-hoc pipeline to project Euclidean slot embeddings onto the Lorentz hyperboloid of hyperbolic space, without modifying the underlying training pipeline. We construct five-level visual hierarchies directly from slot attention masks and analyse whether hyperbolic geometry reveals latent hierarchical structure that remains invisible in Euclidean space. Integrating our pipeline with SPOT (images), VideoSAUR (video), and SlotContrast (video), We find that hyperbolic projection exposes a consistent scene-level to object-level organisation, where coarse slots occupy greater manifold depth than fine slots, which is absent in Euclidean space. We further identify a “curvature–task tradeoff”: low curvature ($c{=}0.2$) matches or outperforms Euclidean on parent slot retrieval, while moderate curvature ($c{=}0.5$) achieves better inter-level separation. Together, these findings suggest that slot representations already encode latent hierarchy that hyperbolic geometry reveals, motivating end-to-end hyperbolic training as a natural next step. Code and models are available at \href{https://github.com/NeeluMadan/HHS}{github.com/NeeluMadan/HHS}.
[130] EyeWorld: A Generative World Model of Ocular State and Dynamics cs.CVPDF
Ziyu Gao, Xinyuan Wu, Xiaolan Chen, Zhuoran Liu, Ruoyu Chen
TL;DR: EyeWorld是一个生成式世界模型,将眼睛视为基于临床成像的部分可观测动态系统,学习跨模态共享的观测稳定潜在眼状态,统一了细粒度解析、结构保持的跨模态翻译和质量鲁棒增强,并通过纵向监督实现时间条件状态转换,支持临床有意义进展的预测。
Details
Motivation: 解决眼科决策依赖跨模态和时序的细微病变线索,而现有医学基础模型多为静态且易受模态和采集偏移影响的问题。
Result: 论文未在摘要中提及具体的定量结果或基准测试,但宣称其方法提供了统一的鲁棒多模态解释和面向预后的模拟方法。
Insight: 创新点在于从静态表示学习转向显式动态建模,通过观测稳定潜在状态和纵向监督,实现了跨模态统一处理和临床进展预测,为医学影像分析提供了新的生成式世界模型框架。
Abstract: Ophthalmic decision-making depends on subtle lesion-scale cues interpreted across multimodal imaging and over time, yet most medical foundation models remain static and degrade under modality and acquisition shifts. Here we introduce EyeWorld, a generative world model that conceptualizes the eye as a partially observed dynamical system grounded in clinical imaging. EyeWorld learns an observation-stable latent ocular state shared across modalities, unifying fine-grained parsing, structure-preserving cross-modality translation and quality-robust enhancement within a single framework. Longitudinal supervision further enables time-conditioned state transitions, supporting forecasting of clinically meaningful progression while preserving stable anatomy. By moving from static representation learning to explicit dynamical modeling, EyeWorld provides a unified approach to robust multimodal interpretation and prognosis-oriented simulation in medicine.
[131] A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning cs.CV | cs.MAPDF
Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Tiansheng Huang, Sihao Hu
TL;DR: 本文提出了一种名为A4VL的多智能体感知-动作探索联盟,用于高效的长视频推理。A4VL通过多轮感知-动作探索循环,利用一组视觉语言模型智能体,在每轮中先进行感知探索以提取查询相关的感知线索并定位相关视频片段,再进行动作探索,通过智能体间的协作评分与共识决策,最终生成答案。该方法旨在有效处理长视频,同时保持高质量推理和低推理延迟。
Details
Motivation: 解决现有视觉语言模型在处理长视频时面临的效率低下和推理质量不足的问题,特别是如何高效地从长视频中提取相关信息并进行准确推理。
Result: 在五个流行的VideoQA基准测试上,A4VL超越了18个现有代表性VLM和10个针对长视频推理优化的近期方法,同时实现了显著更低的推理延迟,达到了SOTA水平。
Insight: 创新点在于将多智能体联盟与多轮感知-动作探索循环相结合,通过事件驱动的视频分区和线索引导的块对齐机制,实现了对长视频的高效、可扩展推理。从客观角度看,其多智能体协作与动态修剪机制是提升推理效率和鲁棒性的关键设计。
Abstract: This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception-action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made either to start a new round of perception-action deliberation by pruning (e.g., filtering out the lowest performing agent) and re-staging (e.g., new-clue and matching block based perception-action exploration), or to conclude by producing its final answer. The integration of the multi-agent alliance through multi-round perception-action exploration, coupled with event-driven partitioning and cue-guided block alignment, enables A4VL to effectively scale to real world long videos while preserving high quality video reasoning. Evaluation Results on five popular VideoQA benchmarks show that A4VL outperforms 18 existing representative VLMs and 10 recent methods optimized for long-video reasoning, while achieving significantly lower inference latency. Our code is released at https://github.com/git-disl/A4VL.
[132] Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents cs.CV | cs.CLPDF
Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang
TL;DR: 本文提出并形式化了计算机使用代理中的‘视觉困惑代理’安全漏洞,即代理因屏幕状态误判(如视觉基础错误、对抗性截图操纵或TOCTOU竞争)而授权错误操作。为缓解此威胁,论文提出了一种在代理感知循环之外运行的首个护栏方法——双通道对比分类,通过独立评估视觉点击目标和代理对动作的推理来阻止风险执行。
Details
Motivation: 现有工作主要将计算机使用代理的视觉感知失败视为性能限制,关注动作是否成功而非是否点击了正确对象。作者认为这本质上是一个安全问题,旨在解决代理因屏幕误判而执行非预期特权操作的安全威胁。
Result: 在受控攻击、真实GUI截图和代理轨迹上的实验表明,所提出的组合护栏(双通道对比分类)在检测风险方面始终优于任一单独通道。
Insight: 创新点在于将视觉感知失败形式化为安全漏洞(视觉困惑代理),并提出了首个在代理感知循环外运行的护栏。其核心见解是视觉证据和文本推理通道能捕获互补的失败模式:视觉检测目标不匹配,而文本推理揭示视觉无害控件背后的危险意图,强调CUA安全不仅需要更好的动作生成,还需独立验证代理‘点击什么’和‘为什么点击’。
Abstract: Computer-using agents (CUAs) act directly on graphical user interfaces, yet their perception of the screen is often unreliable. Existing work largely treats these failures as performance limitations, asking whether an action succeeds, rather than whether the agent is acting on the correct object at all. We argue that this is fundamentally a security problem. We formalize the visual confused deputy: a failure mode in which an agent authorizes an action based on a misperceived screen state, due to grounding errors, adversarial screenshot manipulation, or time-of-check-to-time-of-use (TOCTOU) races. This gap is practically exploitable: even simple screen-level manipulations can redirect routine clicks into privileged actions while remaining indistinguishable from ordinary agent mistakes. To mitigate this threat, we propose the first guardrail that operates outside the agent’s perceptual loop. Our method, dual-channel contrastive classification, independently evaluates (1) the visual click target and (2) the agent’s reasoning about the action against deployment-specific knowledge bases, and blocks execution if either channel indicates risk. The key insight is that these two channels capture complementary failure modes: visual evidence detects target-level mismatches, while textual reasoning reveals dangerous intent behind visually innocuous controls. Across controlled attacks, real GUI screenshots, and agent traces, the combined guardrail consistently outperforms either channel alone. Our results suggest that CUA safety requires not only better action generation, but independent verification of what the agent believes it is clicking and why. Materials are provided\footnote{Model, benchmark, and code: https://github.com/vllm-project/semantic-router}.
[133] MotionCFG: Boosting Motion Dynamics via Stochastic Concept Perturbation cs.CV | cs.AI | cs.LGPDF
Byungjun Kim, Soobin Um, Jong Chul Ye
TL;DR: 本文提出MotionCFG框架,旨在解决文本到视频生成中动态运动质量不足的问题。该方法通过向概念嵌入注入高斯噪声来创建局部负锚点,从而隐式挖掘次优运动变化,并结合分段引导策略,在早期去噪步骤中进行干预,以增强运动动态性而不损害语义完整性。
Details
Motivation: 现有文本到视频方法依赖无分类器引导和显式负面提示来抑制伪影,但这常导致语义偏差和对象完整性失真,即内容-运动漂移问题。
Result: MotionCFG在多个先进的T2V框架上一致提升了运动动态性,计算开销可忽略且视觉质量损失最小,并在复杂非线性概念(如精确对象数量)调控上表现出有效性。
Insight: 创新点在于使用噪声扰动概念嵌入进行隐式硬负样本挖掘,避免全局语义偏移,并通过分段引导聚焦时间细节优化;这为运动增强和概念调控提供了新思路。
Abstract: Despite recent advances in Text-to-Video (T2V) synthesis, generating high-fidelity and dynamic motion remains a significant challenge. Existing methods primarily rely on Classifier-Free Guidance (CFG), often with explicit negative prompts (e.g. “static”, “blurry”), to suppress undesired artifacts. However, such explicit negations frequently introduce unintended semantic bias and distort object integrity; a phenomenon we define as Content-Motion Drift. To address this, we propose MotionCFG, a framework that enhances motion dynamics by contrasting a target concept with its noise-perturbed counterparts. Specifically, by injecting Gaussian noise into the concept embeddings, MotionCFG creates localized negative anchors that encapsulate a broad complementary space of sub-optimal motion variations. Unlike explicit negations, this approach facilitates implicit hard negative mining without shifting the global semantic identity, allowing for a focused refinement of temporal details. Combined with a piecewise guidance schedule that confines intervention to the early denoising steps, MotionCFG consistently improves motion dynamics across state-of-the-art T2V frameworks with negligible computational overhead and minimal compromise in visual quality. Additionally, we demonstrate that this noise-induced contrastive mechanism is effective not only for sharpening motion trajectories but also for steering complex, non-linear concepts such as precise object numerosity, which are typically difficult to modulate via standard text-based guidance.
[134] SGR-OCC: Evolving Monocular Priors for Embodied 3D Occupancy Prediction via Soft-Gating Lifting and Semantic-Adaptive Geometric Refinement cs.CVPDF
Yiran Guo, Simone Mentasti, Xiaofeng Jin, Matteo Frosi, Matteo Matteucci
TL;DR: SGR-OCC是一个用于从单目视频流进行在线3D语义占据预测的统一框架,旨在解决单目深度估计固有的模糊性导致的物体边界’特征渗漏’问题,以及时序融合层在训练初期’冷启动’不稳定导致高质量空间先验失真的问题。该框架基于’继承与演进’理念,通过软门控特征提升器建模深度不确定性以抑制背景噪声,并通过动态射线约束锚点细化模块将复杂的3D位移搜索简化为沿相机射线的1D深度修正。此外,采用两阶段渐进式训练策略以稳定演进到时序一致性。
Details
Motivation: 当前在线3D语义占据预测框架面临两个关键瓶颈:单目估计固有的深度模糊性导致物体边界出现’特征渗漏’,以及未初始化的时序融合层在训练早期阶段会扭曲高质量空间先验,造成’冷启动’不稳定。
Result: 在EmbodiedOcc-ScanNet和Occ-ScanNet基准测试上,SGR-OCC实现了最先进的性能。在局部预测任务中,达到了58.55%的补全IoU和49.89%的语义mIoU,分别比之前的最佳方法EmbodiedOcc++提升了3.65%和3.69%。在更具挑战性的具身预测任务中,模型达到了55.72%的SC-IoU和46.22%的mIoU。定性结果进一步证实了模型在复杂室内环境中保持结构完整性和边界清晰度的卓越能力。
Insight: 摘要宣称的创新点包括:1) 软门控特征提升器,通过高斯门显式建模深度不确定性以概率性抑制背景噪声,完美继承单目空间专业知识;2) 动态射线约束锚点细化模块,将复杂的3D位移搜索简化为沿相机射线的1D深度修正,确保亚体素级贴合物理表面;3) 配备恒等初始化融合的两阶段渐进式训练策略,有效解决冷启动问题并保护空间先验免受早期噪声梯度影响。从客观角度看,其核心创新在于将’继承与演进’的系统性设计哲学具体化为可操作的模块和训练策略,以协同解决单目在线3D重建中的空间模糊性和时序不稳定性问题。
Abstract: 3D semantic occupancy prediction is a cornerstone for embodied AI, enabling agents to perceive dense scene geometry and semantics incrementally from monocular video streams. However, current online frameworks face two critical bottlenecks: the inherent depth ambiguity of monocular estimation that causes “feature bleeding” at object boundaries , and the “cold start” instability where uninitialized temporal fusion layers distort high-quality spatial priors during early training stages. In this paper, we propose SGR-OCC (Soft-Gating and Ray-refinement Occupancy), a unified framework driven by the philosophy of “Inheritance and Evolution”. To perfectly inherit monocular spatial expertise, we introduce a Soft-Gating Feature Lifter that explicitly models depth uncertainty via a Gaussian gate to probabilistically suppress background noise. Furthermore, a Dynamic Ray-Constrained Anchor Refinement module simplifies complex 3D displacement searches into efficient 1D depth corrections along camera rays, ensuring sub-voxel adherence to physical surfaces. To ensure stable evolution toward temporal consistency, we employ a Two-Phase Progressive Training Strategy equipped with identity-initialized fusion, effectively resolving the cold start problem and shielding spatial priors from noisy early gradients. Extensive experiments on the EmbodiedOcc-ScanNet and Occ-ScanNet benchmarks demonstrate that SGR-OCC achieves state-of-the-art performance. In local prediction tasks, SGR-OCC achieves a completion IoU of 58.55$%$ and a semantic mIoU of 49.89$%$, surpassing the previous best method, EmbodiedOcc++, by 3.65$%$ and 3.69$%$ respectively. In challenging embodied prediction tasks, our model reaches 55.72$%$ SC-IoU and 46.22$%$ mIoU. Qualitative results further confirm our model’s superior capability in preserving structural integrity and boundary sharpness in complex indoor environments.
[135] Effective Feature Learning for 3D Medical Registration via Domain-Specialized DINO Pretraining cs.CVPDF
Eytan Kats, Mattias P. Heinrich
TL;DR: 本文研究了针对3D医学图像配准的领域专用DINO自监督预训练方法,通过直接在3D医学影像数据上进行预训练,学习适用于可变形配准的密集体素特征,在跨模态(MRI和CT)的患者间腹部配准任务中表现出色。
Details
Motivation: 解决基于强度的配准方法在扫描仪间变异性和复杂解剖形变上的不足,通过特征学习方法提升配准的鲁棒性。
Result: 在跨模态患者间腹部配准任务中,领域专用预训练模型优于在自然图像上训练的DINOv2模型,且推理时计算资源需求更低;在域外评估中超越了现有配准模型。
Insight: 创新点在于将DINO风格的自监督预训练直接应用于3D医学影像,实现任务无关但医学影像专注的预训练,从而学习到适用于可变形配准的鲁棒特征表示,且计算效率更高。
Abstract: Medical image registration is a critical component of clinical imaging workflows, enabling accurate longitudinal assessment, multi-modal data fusion, and image-guided interventions. Intensity-based approaches often struggle with interscanner variability and complex anatomical deformations, whereas feature-based methods offer improved robustness by leveraging semantically informed representations. In this work, we investigate DINO-style self-supervised pretraining directly on 3D medical imaging data, aiming to learn dense volumetric features well suited for deformable registration. We assess the resulting representations on challenging interpatient abdominal registration task across both MRI and CT modalities. Our domain-specialized pretraining outperforms the DINOv2 model trained on a large-scale collection of natural images, while requiring substantially lower computational resources at inference time. Moreover, it surpasses established registration models under out-of-domain evaluation, demonstrating the value of task-agnostic yet medical imaging-focused pretraining for robust and efficient 3D image registration.
[136] Revisiting the Perception-Distortion Trade-off with Spatial-Semantic Guided Super-Resolution cs.CVPDF
Dan Wang, Haiyan Sun, Shan Du, Z. Jane Wang, Zhaochong An
TL;DR: 本文提出SpaSemSR,一种空间语义引导的扩散框架,通过空间基础文本引导和语义增强视觉引导的互补机制,在图像超分辨率任务中有效平衡感知质量与失真度,克服了传统GAN和扩散模型在纹理真实性与输入保真度之间的权衡难题。
Details
Motivation: 针对图像超分辨率中感知质量与失真度的根本性权衡问题,现有GAN方法失真度低但纹理细节不真实,扩散模型能合成丰富细节却常偏离输入、产生幻觉结构,因此研究如何在利用扩散模型强大生成先验的同时不牺牲保真度成为关键挑战。
Result: 在多个基准测试上的广泛实验表明,SpaSemSR实现了优越的感知-失真平衡,能同时生成真实且保真的复原图像。
Insight: 创新点在于提出空间基础文本引导(整合对象级空间线索与语义提示以对齐文本与视觉结构)和语义增强视觉引导(通过多编码器设计与语义退化约束统一多模态语义先验),并通过空间语义注意力自适应融合到扩散过程中,从而在抑制失真和幻觉的同时保留扩散模型的优势。
Abstract: Image super-resolution (SR) aims to reconstruct high resolution images with both high perceptual quality and low distortion, but is fundamentally limited by the perception-distortion trade-off. GAN-based SR methods reduce distortion but still struggle with realistic fine-grained textures, whereas diffusion-based approaches synthesize rich details but often deviate from the input, hallucinating structures and degrading fidelity. This tension raises a key challenge: how to exploit the powerful generative priors of diffusion models without sacrificing fidelity. To address this, we propose SpaSemSR, a spatial-semantic guided diffusion framework with two complementary guidances. First, spatial-grounded textual guidance integrates object-level spatial cues with semantic prompts, aligning textual and visual structures to reduce distortion. Second, semantic-enhanced visual guidance with a multi-encoder design and semantic degradation constraints unifies multimodal semantic priors, improving perceptual realism under severe degradations. These complementary guidances are adaptively fused into the diffusion process via spatial-semantic attention, suppressing distortion and hallucination while retaining the strengths of diffusion models. Extensive experiments on multiple benchmarks show that SpaSemSR achieves a superior perception-distortion balance, producing both realistic and faithful restorations.
[137] Improving Visual Reasoning with Iterative Evidence Refinement cs.CVPDF
Zeru Shi, Kai Mei, Yihao Quan, Dimitris N. Metaxas, Ruixiang Tang
TL;DR: 本文提出了一种名为SIEVE的端到端自回顾框架,旨在提升视觉语言模型(VLMs)的视觉推理能力。该方法通过强化学习训练模型,在推理过程中自动提取并注入关键图像区域的内部嵌入表示,从而在不依赖外部图像操作(如缩放或裁剪)的情况下,实现对视觉证据的重新利用和基于图像的推理。
Details
Motivation: 当前视觉语言模型在图像推理时,通常需要依赖外部图像操作来重新获取细粒度视觉证据,这会导致额外的图像重新编码并可能干扰推理轨迹。作者认为,VLMs内部已具备识别和重用视觉证据的强信号,应直接利用这些信号来支持基于图像的推理。
Result: 在多个视觉推理基准测试上,结合感知、推理和幻觉评估,SIEVE方法带来了持续的性能提升,在多个基准上的平均性能提高了8%。
Insight: 论文的创新点在于提出了一种完全基于模型内部表示的自回顾机制,通过强化学习动态决策何时触发视觉回顾以及检索和插入哪些区域嵌入,从而避免了对外部工具调用或图像重新编码的依赖,实现了更高效、连贯的视觉推理过程。
Abstract: Vision language models (VLMs) are increasingly capable of reasoning over images, but robust visual reasoning often requires re-grounding intermediate steps in the underlying visual evidence. Recent approaches typically rely on external image operations such as zooming or cropping to re-access fine-grained details during inference, which requires additional image re-encoding and can disrupt the reasoning trajectory. We argue that VLMs already provide strong internal signals for identifying and reusing visual evidence, and that these signals can be directly leveraged to support image-grounded reasoning. Motivated by this insight, we propose an end-to-end self-revisit framework, SIEVE, that trains models to re-engage image evidence through internal representations. SIEVE automatically extracts embeddings of salient image regions and injects them into the reasoning chain when additional grounding is needed, enabling later steps to condition on relevant visual cues without external tool calls or re-encoding. We use reinforcement learning to teach the model when to trigger visual revisiting and which region embeddings to retrieve and insert during the reasoning process. Experiments on multiple visual reasoning benchmarks, together with perception, reasoning, and hallucination evaluations, show that SIEVE yields consistent gains, improving performance by 8 percent on average across several benchmarks.
[138] MER-Bench: A Comprehensive Benchmark for Multimodal Meme Reappraisal cs.CV | cs.CLPDF
Yiqi Nie, Fei Wang, Junjie Chen, Kun Li, Yudi Cai
TL;DR: 该论文提出了一个名为‘Meme Reappraisal’的新型多模态生成任务,旨在将负面情绪的模因转化为具有建设性的正面版本,同时保留其原始场景、实体和结构布局。为此,作者构建了MER-Bench基准数据集,包含细粒度的多模态标注,并设计了一个基于多模态大语言模型(MLLM)的结构化评估框架来量化生成质量。实验表明,现有图像编辑和多模态生成系统在满足结构保留、语义一致和情感转换等约束方面存在显著不足。
Details
Motivation: 受心理学中认知重评的启发,旨在解决现有模因理解或生成工作未能充分处理的、在多重语义和风格约束下进行情感可控且结构保留的多模态转换问题。
Result: 在代表性图像编辑和多模态生成系统上进行的广泛实验揭示了它们在满足结构保留、语义一致和情感转换等约束方面存在巨大差距。
Insight: 论文的创新点在于提出了‘Meme Reappraisal’这一新颖任务,并构建了首个支持该任务的综合性基准MER-Bench,其包含细粒度的多模态标注。同时,引入了一个基于MLLM的结构化评估范式,将性能分解为模态级生成质量、情感可控性、结构保真度和全局情感对齐等多个维度,为可控模因编辑和情感感知多模态生成研究奠定了基础。
Abstract: Memes represent a tightly coupled, multimodal form of social expression, in which visual context and overlaid text jointly convey nuanced affect and commentary. Inspired by cognitive reappraisal in psychology, we introduce Meme Reappraisal, a novel multimodal generation task that aims to transform negatively framed memes into constructive ones while preserving their underlying scenario, entities, and structural layout. Unlike prior works on meme understanding or generation, Meme Reappraisal requires emotion-controllable, structure-preserving multimodal transformation under multiple semantic and stylistic constraints. To support this task, we construct MER-Bench, a benchmark of real-world memes with fine-grained multimodal annotations, including source and target emotions, positively rewritten meme text, visual editing specifications, and taxonomy labels covering visual type, sentiment polarity, and layout structure. We further propose a structured evaluation framework based on a multimodal large language model (MLLM)-as-a-Judge paradigm, decomposing performance into modality-level generation quality, affect controllability, structural fidelity, and global affective alignment. Extensive experiments across representative image-editing and multimodal-generation systems reveal substantial gaps in satisfying the constraints of structural preservation, semantic consistency, and affective transformation. We believe MER-Bench establishes a foundation for research on controllable meme editing and emotion-aware multimodal generation. Our code is available at: https://github.com/one-seven17/MER-Bench.
[139] Diffusion Reinforcement Learning via Centered Reward Distillation cs.CV | cs.AI | cs.LGPDF
Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton
TL;DR: 本文提出了一种名为中心化奖励蒸馏(CRD)的扩散强化学习框架,用于对扩散模型进行基于黑盒奖励的微调。该框架通过前向过程微调,利用KL正则化奖励最大化推导而来,核心创新在于通过“提示内中心化”消除难以处理的归一化常数,从而形成一个定义良好的奖励匹配目标。
Details
Motivation: 扩散和流模型虽然达到了最先进的生成性能,但其在细粒度提示保真度、组合正确性和文本渲染等重要行为上,仅通过分数匹配或流匹配预训练目标难以充分指定。使用外部黑盒奖励进行强化学习微调是自然的补救措施,但现有的扩散RL方法往往不稳定,存在高内存成本、高方差梯度估计或分布漂移导致奖励黑客攻击等问题。
Result: 在基于GenEval和OCR奖励的文本到图像后训练实验中,CRD在奖励优化方面取得了具有竞争力的最先进结果,实现了快速收敛并减少了奖励黑客攻击,这一点在未见过的偏好指标上得到了验证。
Insight: 论文的创新点在于:1)提出了“提示内中心化”技术,将奖励匹配目标转化为一个易于处理的形式;2)引入了一系列技术来显式控制分布漂移,包括解耦采样器与移动参考以防止比率信号崩溃、使用CFG引导的预训练模型进行KL锚定以控制长期漂移并保持与预训练模型推理语义的一致性,以及采用奖励自适应的KL强度来加速早期学习并减少后期对奖励模型漏洞的利用。
Abstract: Diffusion and flow models achieve State-Of-The-Art (SOTA) generative performance, yet many practically important behaviors such as fine-grained prompt fidelity, compositional correctness, and text rendering are weakly specified by score or flow matching pretraining objectives. Reinforcement Learning (RL) fine-tuning with external, black-box rewards is a natural remedy, but diffusion RL is often brittle. Trajectory-based methods incur high memory cost and high-variance gradient estimates; forward-process approaches converge faster but can suffer from distribution drift, and hence reward hacking. In this work, we present \textbf{Centered Reward Distillation (CRD)}, a diffusion RL framework derived from KL-regularized reward maximization built on forward-process-based fine-tuning. The key insight is that the intractable normalizing constant cancels under \emph{within-prompt centering}, yielding a well-posed reward-matching objective. To enable reliable text-to-image fine-tuning, we introduce techniques that explicitly control distribution drift: (\textit{i}) decoupling the sampler from the moving reference to prevent ratio-signal collapse, (\textit{ii}) KL anchoring to a CFG-guided pretrained model to control long-run drift and align with the inference-time semantics of the pre-trained model, and (\textit{iii}) reward-adaptive KL strength to accelerate early learning under large KL regularization while reducing late-stage exploitation of reward-model loopholes. Experiments on text-to-image post-training with \texttt{GenEval} and \texttt{OCR} rewards show that CRD achieves competitive SOTA reward optimization results with fast convergence and reduced reward hacking, as validated on unseen preference metrics.
[140] DualSwinFusionSeg: Multimodal Martian Landslide Segmentation via Dual Swin Transformer with Multi-Scale Fusion and UNet++ cs.CV | cs.LGPDF
Shahriar Kabir, Abdullah Muhammed Amimul Ehsan, Istiak Ahmmed Rifti, Md Kaykobad Reza
TL;DR: 本文提出DualSwinFusionSeg模型,用于火星滑坡的多模态分割。该模型采用双Swin Transformer V2编码器分别处理RGB图像和地球物理辅助数据,通过多尺度跨模态融合和UNet++解码器进行精细边界分割,以解决多模态数据异质性和标注样本有限的问题。
Details
Motivation: 解决火星滑坡自动分割中多模态数据(如RGB图像、数字高程模型、坡度图等)在分辨率和统计特性上差异大、标注样本有限带来的挑战。
Result: 在PBVS 2026 Mars-LS Challenge的MMLSv2数据集上,开发基准测试达到0.867 mIoU和0.905 F1,留出测试集达到0.783 mIoU,展现了多模态行星表面分割的强劲性能。
Insight: 创新点在于采用模态特定的双编码器分离特征提取,并结合简单的基于拼接的多尺度融合策略,在有限训练数据下提升了分割精度;UNet++解码器的密集嵌套跳跃连接有助于保留精细边界细节。
Abstract: Automated segmentation of Martian landslides, particularly in tectonically active regions such as Valles Marineris,is important for planetary geology, hazard assessment, and future robotic exploration. However, detecting landslides from planetary imagery is challenging due to the heterogeneous nature of available sensing modalities and the limited number of labeled samples. Each observation combines RGB imagery with geophysical measurements such as digital elevation models, slope maps, thermal inertia, and contextual grayscale imagery, which differ significantly in resolution and statistical properties. To address these challenges, we propose DualSwinFusionSeg, a multimodal segmentation architecture that separates modality-specific feature extraction and performs multi-scale cross-modal fusion. The model employs two parallel Swin Transformer V2 encoders to independently process RGB and auxiliary geophysical inputs, producing hierarchical feature representations. Corresponding features from the two streams are fused at multiple scales and decoded using a UNet++ decoder with dense nested skip connections to preserve fine boundary details. Extensive ablation studies evaluate modality contributions, loss functions, decoder architectures, and fusion strategies. Experiments on the MMLSv2 dataset from the PBVS 2026 Mars-LS Challenge show that modality-specific encoders and simple concatenation-based fusion improve segmentation accuracy under limited training data. The final model achieves 0.867 mIoU and 0.905 F1 on the development benchmark and 0.783 mIoU on the held-out test set, demonstrating strong performance for multimodal planetary surface segmentation.
[141] Seeing Through the PRISM: Compound & Controllable Restoration of Scientific Images cs.CVPDF
Rupa Kurinchi-Vendhan, Pratyusha Sharma, Antonio Torralba, Sara Beery
TL;DR: 本文提出了PRISM(Precision Restoration with Interpretable Separation of Mixtures)框架,这是一个基于提示的条件扩散模型,用于解决科学和环境图像中复杂的混合退化问题。该框架能够同时处理多种重叠的退化,并通过自然语言提示实现灵活、有针对性的修复,在多个科学图像数据集上超越了现有方法。
Details
Motivation: 科学和环境图像常受到传感器与环境因素导致的复杂混合噪声影响,现有方法通常只能逐次去除单一退化,易导致级联伪影、过度校正或重要信号丢失。需要一种能同时处理复合退化并允许专家选择性去除特定失真而不损害关键特征的方法。
Result: 在显微镜、野生动物监测、遥感及城市天气等多个数据集上,PRISM在复杂复合退化任务中超越了现有SOTA基线,包括在训练中未见的零样本混合退化上表现优异。选择性修复显著提升了多个领域下游科学任务的准确性。
Insight: 创新点在于结合了针对混合退化的复合感知监督与加权对比解耦目标,在潜在空间中对齐基元及其混合,形成组合几何结构,从而实现了高保真联合去噪与基于提示的可控修复。这为优先考虑科学效用的领域提供了一个通用且可控的高保真修复框架。
Abstract: Scientific and environmental imagery often suffer from complex mixtures of noise related to the sensor and the environment. Existing restoration methods typically remove one degradation at a time, leading to cascading artifacts, overcorrection, or loss of meaningful signal. In scientific applications, restoration must be able to simultaneously handle compound degradations while allowing experts to selectively remove subsets of distortions without erasing important features. To address these challenges, we present PRISM (Precision Restoration with Interpretable Separation of Mixtures). PRISM is a prompted conditional diffusion framework which combines compound-aware supervision over mixed degradations with a weighted contrastive disentanglement objective that aligns primitives and their mixtures in the latent space. This compositional geometry enables high-fidelity joint removal of overlapping distortions while also allowing flexible, targeted fixes through natural language prompts. Across microscopy, wildlife monitoring, remote sensing, and urban weather datasets, PRISM outperforms state-of-the-art baselines on complex compound degradations, including zero-shot mixtures not seen during training. Importantly, we show that selective restoration significantly improves downstream scientific accuracy in several domains over standard “black-box” restoration. These results establish PRISM as a generalizable and controllable framework for high-fidelity restoration in domains where scientific utility is a priority.
[142] Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories cs.CVPDF
Junyao Hu, Zhongwei Cheng, Waikeung Wong, Xingxing Zou
TL;DR: 本文介绍了Garments2Look,这是首个用于服装级虚拟试穿(VTON)的大规模多模态数据集,包含8万对多服装到单一造型的配对,涵盖40个主要类别和300多个细分子类别,旨在解决现有VTON系统在真实世界多服装、配饰和分层搭配方面的局限性。
Details
Motivation: 现有虚拟试穿技术主要关注单件服装可视化,而真实时尚涉及包含多件服装、配饰、细分类别、分层和多样造型的完整穿搭,当前VTON系统难以处理这些复杂场景,且现有数据集类别有限、缺乏穿搭多样性。
Result: 通过将最先进的VTON方法和通用图像编辑模型适配为基线,实验表明当前方法在无缝试穿完整穿搭、推断正确分层和造型方面存在困难,导致错位和伪影,凸显了任务的挑战性。
Insight: 创新点在于构建了首个大规模、多模态的服装级VTON数据集,并提出了一个结合启发式穿搭列表构建、严格自动过滤和人工验证的合成流程,以平衡真实性和多样性,为复杂穿搭试穿研究提供了基准和方向。
Abstract: Virtual try-on (VTON) has advanced single-garment visualization, yet real-world fashion centers on full outfits with multiple garments, accessories, fine-grained categories, layering, and diverse styling, remaining beyond current VTON systems. Existing datasets are category-limited and lack outfit diversity. We introduce Garments2Look, the first large-scale multimodal dataset for outfit-level VTON, comprising 80K many-garments-to-one-look pairs across 40 major categories and 300+ fine-grained subcategories. Each pair includes an outfit with 3-12 reference garment images (Average 4.48), a model image wearing the outfit, and detailed item and try-on textual annotations. To balance authenticity and diversity, we propose a synthesis pipeline. It involves heuristically constructing outfit lists before generating try-on results, with the entire process subjected to strict automated filtering and human validation to ensure data quality. To probe task difficulty, we adapt SOTA VTON methods and general-purpose image editing models to establish baselines. Results show current methods struggle to try on complete outfits seamlessly and to infer correct layering and styling, leading to misalignment and artifacts.
[143] Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models cs.CV | cs.AIPDF
Ruiying Peng, Xueyu Wu, Jing Lei, Lu Hou, Yuanzheng Ma
TL;DR: 本文研究了多模态大语言模型在扩展推理模式下感知能力下降的问题,发现其根本原因是注意力分散。作者提出了无需训练的视觉区域引导注意力框架,通过基于熵聚焦准则选择视觉头并重新加权其注意力,有效引导模型在推理过程中关注问题相关区域。
Details
Motivation: 解决多模态大语言模型在视觉问答等任务中进行多步推理时,视觉注意力分散、偏离问题相关区域,从而导致感知能力下降的问题。
Result: 在视觉语言基准测试上的大量实验表明,该方法有效缓解了感知退化,提升了视觉定位和推理准确性,并为理解MLLM如何处理视觉信息提供了可解释的见解。
Insight: 创新点在于将注意力分散识别为感知障碍的根本原因,并利用注意力熵与空间分散性的强相关性,设计了一种无需训练、基于熵聚焦准则的视觉头选择与注意力重加权框架,以引导模型在推理过程中保持对关键视觉区域的关注。
Abstract: Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model’s visual attention becomes scattered and drifts away from question-relevant regions, effectively “losing focus” on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model’s overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.
[144] Joint Segmentation and Grading with Iterative Optimization for Multimodal Glaucoma Diagnosis cs.CVPDF
Zhiwei Wang, Yuxing Li, Meilu Zhu, Defeng He, Edmund Y. Lam
TL;DR: 本文提出了一种迭代多模态优化模型(IMO),用于青光眼的联合分割和分级诊断。该方法通过中层融合策略整合眼底图像和光学相干断层扫描(OCT)特征,并利用跨模态特征对齐模块减少模态差异,通过基于去噪扩散机制的迭代优化解码器逐步细化特征,实现视盘和视杯的精细分割以及准确的分级。
Details
Motivation: 现有青光眼诊断方法多依赖单一模态(如眼底或OCT),只能捕获部分病理信息,难以检测早期、细微的病变,因此需要一种能有效融合多模态信息的方法。
Result: 广泛的实验表明,该方法能有效整合多模态特征,为青光眼评估提供全面且具有临床意义的方法。
Insight: 创新点在于提出了一个迭代多模态优化框架,结合了跨模态特征对齐和基于去噪扩散的迭代细化机制,实现了分割与分级任务的联合优化,提升了早期青光眼诊断的准确性。
Abstract: Accurate diagnosis of glaucoma is challenging, as early-stage changes are subtle and often lack clear structural or appearance cues. Most existing approaches rely on a single modality, such as fundus or optical coherence tomography (OCT), capturing only partial pathological information and often missing early disease progression. In this paper, we propose an iterative multimodal optimization model (IMO) for joint segmentation and grading. IMO integrates fundus and OCT features through a mid-level fusion strategy, enhanced by a cross-modal feature alignment (CMFA) module to reduce modality discrepancies. An iterative refinement decoder progressively optimizes the multimodal features through a denoising diffusion mechanism, enabling fine-grained segmentation of the optic disc and cup while supporting accurate glaucoma grading. Extensive experiments show that our method effectively integrates multimodal features, providing a comprehensive and clinically significant approach to glaucoma assessment. Source codes are available at https://github.com/warren-wzw/IMO.git.
[145] Walking Further: Semantic-aware Multimodal Gait Recognition Under Long-Range Conditions cs.CV | cs.AIPDF
Zhiyang Lu, Wen Jiang, Tianren Wu, Zhichao Wang, Changwang Zhang
TL;DR: 本文提出了LRGait,首个用于长距离步态识别的LiDAR-相机多模态基准数据集,并设计了EMGaitNet端到端框架,通过语义引导的融合管道(包括语义挖掘、对齐和对称交叉注意力融合模块)来弥合RGB图像与点云之间的模态差异,以提升在真实世界长距离和跨距离场景下的步态识别鲁棒性。
Details
Motivation: 现有步态识别方法大多局限于短距离、单模态设置,难以泛化到真实世界中的长距离和跨距离场景,因此需要构建多模态基准并开发能够有效融合2D视觉轮廓和3D几何特征的鲁棒识别框架。
Result: 在多个步态数据集上的广泛实验验证了该方法的有效性,但摘要未具体说明定量结果(如准确率)或与SOTA的比较细节。
Insight: 创新点包括:引入语义引导的融合管道(CLIP-based语义挖掘和语义引导对齐)来对齐多模态特征,以及对称交叉注意力融合模块分层整合视觉和几何信息;从客观角度看,将语义先验与多模态融合结合,为长距离步态识别提供了新的解决方案。
Abstract: Gait recognition is an emerging biometric technology that enables non-intrusive and hard-to-spoof human identification. However, most existing methods are confined to short-range, unimodal settings and fail to generalize to long-range and cross-distance scenarios under real-world conditions. To address this gap, we present \textbf{LRGait}, the first LiDAR-Camera multimodal benchmark designed for robust long-range gait recognition across diverse outdoor distances and environments. We further propose \textbf{EMGaitNet}, an end-to-end framework tailored for long-range multimodal gait recognition. To bridge the modality gap between RGB images and point clouds, we introduce a semantic-guided fusion pipeline. A CLIP-based Semantic Mining (SeMi) module first extracts human body-part-aware semantic cues, which are then employed to align 2D and 3D features via a Semantic-Guided Alignment (SGA) module within a unified embedding space. A Symmetric Cross-Attention Fusion (SCAF) module hierarchically integrates visual contours and 3D geometric features, and a Spatio-Temporal (ST) module captures global gait dynamics. Extensive experiments on various gait datasets validate the effectiveness of our method.
[146] DualTSR: Unified Dual-Diffusion Transformer for Scene Text Image Super-Resolution cs.CV | cs.AIPDF
Axi Niu, Kang Zhang, Qingsen Yan, Hao Jin, Jinqiu Sun
TL;DR: 本文提出DualTSR,一个统一的端到端框架,用于场景文本图像超分辨率(STISR)。该框架采用单一的多模态Transformer主干,通过双扩散目标进行训练,同时建模高分辨率图像的连续分布和文本内容的离散分布,从而在无需外部OCR模型的情况下,内部推断文本先验。
Details
Motivation: 现有STISR方法通常依赖外部OCR模型获取文本先验,或采用复杂的多组件架构,导致训练和复现困难。本文旨在解决这两个问题,提出一个更简单、统一的端到端框架。
Result: 在合成的中文基准测试和精心设计的真实世界评估协议上,DualTSR在感知质量和文本保真度方面表现出色。
Insight: 创新点在于使用单一Transformer主干结合双扩散目标(条件流匹配和离散扩散),实现了视觉与文本信息在每一层的交互,从而内部推断文本先验,简化了架构并避免了对外部OCR的依赖。
Abstract: Scene Text Image Super-Resolution (STISR) aims to restore high-resolution details in low-resolution text images, which is crucial for both human readability and machine recognition. Existing methods, however, often depend on external Optical Character Recognition (OCR) models for textual priors or rely on complex multi-component architectures that are difficult to train and reproduce. In this paper, we introduce DualTSR, a unified end-to-end framework that addresses both issues. DualTSR employs a single multimodal transformer backbone trained with a dual diffusion objective. It simultaneously models the continuous distribution of high-resolution images via Conditional Flow Matching and the discrete distribution of textual content via discrete diffusion. This shared design enables visual and textual information to interact at every layer, allowing the model to infer text priors internally instead of relying on an external OCR module. Compared with prior multi-branch diffusion systems, DualTSR offers a simpler end-to-end formulation with fewer hand-crafted components. Experiments on synthetic Chinese benchmarks and a curated real-world evaluation protocol show that DualTSR achieves strong perceptual quality and text fidelity.
[147] UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation cs.CV | cs.AIPDF
Xingyuan Li, Songcheng Du, Yang Zou, HaoYuan Xu, Zhiying Jiang
TL;DR: UniFusion是一个统一的图像融合框架,旨在通过鲁棒表示和源感知保留,解决现有方法局限于特定任务和源信息退化的问题。它利用DINOv3提取模态一致特征,引入重建对齐损失保持输入输出一致性,并采用双层优化策略平衡重建与融合目标,实现了跨任务泛化。
Details
Motivation: 现有图像融合方法多为任务特定设计(如多模态、多曝光、多焦点融合),且在融合过程中难以有效保留源信息,这源于任务特定架构和深层传播导致的信息退化。本文旨在克服这些限制,提出一个统一的框架以实现跨任务泛化。
Result: 在多个融合任务上的广泛实验表明,UniFusion在视觉质量、泛化能力和对真实场景的适应性方面均表现出优越性。
Insight: 创新点包括:利用DINOv3建立共享语义空间以实现模态一致特征提取;引入重建对齐损失以源感知方式保留输入信息;采用双层优化策略解耦并联合优化重建与融合目标,有效平衡其耦合关系并确保平滑收敛。从客观角度看,该框架将通用视觉基础模型(DINOv3)与任务无关的优化机制结合,为统一多任务图像处理提供了可借鉴的思路。
Abstract: Image fusion aims to integrate complementary information from multiple source images to produce a more informative and visually consistent representation, benefiting both human perception and downstream vision tasks. Despite recent progress, most existing fusion methods are designed for specific tasks (i.e., multi-modal, multi-exposure, or multi-focus fusion) and struggle to effectively preserve source information during the fusion process. This limitation primarily arises from task-specific architectures and the degradation of source information caused by deep-layer propagation. To overcome these issues, we propose UniFusion, a unified image fusion framework designed to achieve cross-task generalization. First, leveraging DINOv3 for modality-consistent feature extraction, UniFusion establishes a shared semantic space for diverse inputs. Second, to preserve the understanding of each source image, we introduce a reconstruction-alignment loss to maintain consistency between fused outputs and inputs. Finally, we employ a bilevel optimization strategy to decouple and jointly optimize reconstruction and fusion objectives, effectively balancing their coupling relationship and ensuring smooth convergence. Extensive experiments across multiple fusion tasks demonstrate UniFusion’s superior visual quality, generalization ability, and adaptability to real-world scenarios. Code is available at https://github.com/dusongcheng/UniFusion.
[148] Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining cs.CVPDF
Chongxin Li, Hanzhang Wang, Lian Duan
TL;DR: 本文提出了一种名为Safety-Potential Pruning的一次性剪枝框架,旨在增强视觉语言模型(VLM)中安全提示(safety prompts)的防御能力,以对抗越狱攻击。该方法通过移除对安全提示响应较弱的权重,在不进行额外重新训练的情况下,放大与安全相关的激活,从而激活并强化模型中固有的、结构上独特的安全执行通路。
Details
Motivation: 安全提示是VLM中抵御越狱攻击的一种可解释防御层,但其有效性受限于模型潜在的、未被充分激发的结构响应性。作者观察到安全提示总是激活一组稀疏的参数,而这些参数在良性使用中基本保持静默,这促使他们提出了安全子网络假说:VLM中嵌入了结构上不同的、能够强制执行安全性的通路,但这些通路在没有明确刺激的情况下处于休眠状态。
Result: 在三种代表性VLM架构和三个越狱基准测试上,该方法相对于仅使用提示,将攻击成功率降低了高达22%,同时保持了强大的良性任务性能。
Insight: 论文宣称的创新点在于将剪枝重新定义为一种结构干预手段,而不仅仅是模型压缩技术,用以显化与对齐相关的子网络,为增强越狱攻击的鲁棒性提供了新路径。从客观角度看,其核心洞察是识别并利用VLM中固有的、对安全敏感的稀疏子网络,通过一次性剪枝进行选择性强化,这是一种高效且无需重新训练的安全增强方法。
Abstract: Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models’ latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.
[149] Not All Directions Matter: Toward Structured and Task-Aware Low-Rank Adaptation cs.CVPDF
Xi Xiao, Chenrui Ma, Yunbei Zhang, Chen Liu, Zhuxuanzi Wang
TL;DR: 本文提出了StructLoRA,一个解决低秩适应(LoRA)中语义漂移和结构不连贯问题的框架。它通过一个基于信息瓶颈的过滤器来剪枝任务无关的更新方向,以及一个轻量级的、基于图的协调器来强制层间一致性,从而在训练期间优化信息质量和结构完整性,且不增加推理成本。
Details
Motivation: LoRA作为参数高效微调的基石,其有效性受到两个基本限制的阻碍:一是将所有更新方向同等对待导致的语义漂移,二是各层独立适应导致的结构不连贯,这共同造成了次优且不协调的更新。
Result: 在大型语言模型、视觉语言模型和视觉模型上的广泛实验表明,StructLoRA始终能建立新的最先进性能,不仅优于原始LoRA,也优于先进的动态秩分配和基于稀疏性的方法,特别是在具有挑战性的低秩和低数据场景中优势明显。
Insight: 核心创新点在于其双组件设计:信息瓶颈引导的过滤器用于缓解语义漂移,以及训练专用的图协调器用于解决结构不连贯。这标志着PEFT的研究重点从单纯的参数压缩,转向了对信息质量和结构完整性的更全面优化。
Abstract: Low-Rank Adaptation (LoRA) has become a cornerstone of parameter-efficient fine-tuning (PEFT). Yet, its efficacy is hampered by two fundamental limitations: semantic drift, by treating all update directions with equal importance, and structural incoherence, from adapting layers independently, resulting in suboptimal, uncoordinated updates. To remedy these, we propose StructLoRA, a framework that addresses both limitations through a principled, dual-component design: (1) an Information Bottleneck-guided filter that prunes task-irrelevant directions to mitigate semantic drift, and (2) a lightweight, training-only graph-based coordinator that enforces inter-layer consistency to resolve structural incoherence. Extensive experiments across large language model , vision language model, and vision model (including LLaMA, LLaVA, and ViT) demonstrate that StructLoRA consistently establishes a new state-of-the-art, outperforming not only vanilla LoRA but also advanced dynamic rank allocation and sparsity-based methods. Notably, the benefits are particularly pronounced in challenging low-rank and low-data regimes. Crucially, since our proposed modules operate only during training, StructLoRA enhances performance with zero additional inference cost, advancing the focus of PEFT – from mere parameter compression to a more holistic optimization of information quality and structural integrity.
[150] S2GS: Streaming Semantic Gaussian Splatting for Online Scene Understanding and Reconstruction cs.CVPDF
Renhe Zhang, Yuyang Tan, Jingyu Gong, Zhizhong Zhang, Lizhuang Ma
TL;DR: 本文提出了一种名为S2GS的流式语义高斯溅射方法,用于在线场景理解与重建。该方法采用因果增量建模,无需重处理历史帧,通过几何与语义解耦的双主干设计,实现了对长图像序列的高效处理,显著提升了运行时和GPU内存的可扩展性。
Details
Motivation: 现有离线前馈方法在处理长图像序列时,需对不断增长的历史观测进行全局计算,导致运行时和内存随序列长度快速增加,限制了可扩展性。S2GS旨在实现严格因果、增量的在线联合重建与理解,避免未来帧依赖和历史帧重处理。
Result: 在联合重建与理解基准测试中,S2GS匹配或超越了强离线基线方法;在可扩展性方面,S2GS能处理1000+帧,运行时和GPU内存增长缓慢,而离线全局处理方法在相同设置下约80帧即内存耗尽。
Insight: 创新点包括几何与语义解耦的双主干设计,其中几何分支进行因果建模驱动高斯增量更新,语义分支利用2D基础视觉模型和查询驱动解码器预测分割掩码和身份嵌入,并通过查询级对比对齐和轻量级在线关联实例记忆进行稳定。从客观角度看,该方法在在线流式处理长序列场景中实现了高效且可扩展的联合重建与理解,具有实际应用潜力。
Abstract: Existing offline feed-forward methods for joint scene understanding and reconstruction on long image streams often repeatedly perform global computation over an ever-growing set of past observations, causing runtime and GPU memory to increase rapidly with sequence length and limiting scalability. We propose Streaming Semantic Gaussian Splatting (S2GS), a strictly causal, incremental 3D Gaussian semantic field framework: it does not leverage future frames and continuously updates scene geometry, appearance, and instance-level semantics without reprocessing historical frames, enabling scalable online joint reconstruction and understanding. S2GS adopts a geometry-semantic decoupled dual-backbone design: the geometry branch performs causal modeling to drive incremental Gaussian updates, while the semantic branch leverages a 2D foundation vision model and a query-driven decoder to predict segmentation masks and identity embeddings, further stabilized by query-level contrastive alignment and lightweight online association with an instance memory. Experiments show that S2GS matches or outperforms strong offline baselines on joint reconstruction-and-understanding benchmarks, while significantly improving long-horizon scalability: it processes 1,000+ frames with much slower growth in runtime and GPU memory, whereas offline global-processing baselines typically run out of memory at around 80 frames under the same setting.
[151] FOCUS: Bridging Fine-Grained Recognition and Open-World Discovery across Domains cs.CVPDF
Vaibhav Rathore, Divyam Gupta, Moloud Abdar, Subhasis Chaudhuri, Biplab Banerjee
TL;DR: 本文提出了首个用于细粒度领域泛化广义类别发现(FG-DG-GCD)的统一框架FoCUS,旨在解决在领域偏移下,仅利用标注的源域数据,在未见过的无标注目标域中同时识别已知类别和发现新类别的挑战。该工作还构建了首个FG-DG-GCD基准数据集,并通过实验验证了FoCUS框架在细粒度任务上的优越性能与计算效率。
Details
Motivation: 动机是解决传统广义类别发现(GCD)假设标注与未标注数据同分布的限制,提出在存在领域偏移的真实场景下,实现细粒度级别的开放世界发现,这是一个更具挑战性的任务。
Result: 在提出的细粒度基准(基于CUB-200-2011、Stanford Cars和FGVC-Aircraft构建的绘画和素描域)上,FoCUS在聚类准确率上分别比GCD、FG-GCD和DG-GCD基线高出3.28%、9.68%和2.07%。同时,在粗粒度DG-GCD任务上保持竞争力,且计算效率比当前SOTA方法提升近3倍。
Insight: 宣称的创新点包括:1) 首次定义了FG-DG-GCD问题并构建了相应基准;2) 提出了单阶段框架FoCUS,其核心是结合了用于几何稳定部件推理的领域一致部件发现(DCPD)和通过不确定性引导扰动进行置信度校准特征正则化的不确定性感知特征增强(UFA)。从客观角度看,将细粒度识别、领域泛化和开放世界发现三者统一是一个重要的研究方向,而通过部件级对齐和不确定性建模来提升泛化与发现能力是有效的技术路径。
Abstract: We introduce the first unified framework for Fine-Grained Domain-Generalized Generalized Category Discovery (FG-DG-GCD), bringing open-world recognition closer to real-world deployment under domain shift. Unlike conventional GCD, which assumes labeled and unlabeled data come from the same distribution, DG-GCD learns only from labeled source data and must both recognize known classes and discover novel ones in unseen, unlabeled target domains. This problem is especially challenging in fine-grained settings, where subtle inter-class differences and large intra-class variation make domain generalization significantly harder. To support systematic evaluation, we establish the first FG-DG-GCD benchmarks by creating identity-preserving painting and sketch domains for CUB-200-2011, Stanford Cars, and FGVC-Aircraft using controlled diffusion-adapter stylization. On top of this ,we propose FoCUS, a single-stage framework that combines Domain-Consistent Parts Discovery (DCPD) for geometry-stable part reasoning with Uncertainty-Aware Feature Augmentation (UFA) for confidence-calibrated feature regularization through uncertainty-guided perturbations. Extensive experiments show that FoCUS outperforms strong GCD, FG-GCD, and DG-GCD baselines by 3.28%, 9.68%, and 2.07%, respectively, in clustering accuracy on the proposed benchmarks. It also remains competitive on coarse-grained DG-GCD tasks while achieving nearly 3x higher computational efficiency than the current state of the art. ^[Code and datasets will be released upon acceptance.]
[152] CamLit: Unified Video Diffusion with Explicit Camera and Lighting Control cs.CVPDF
Zhiyi Kuang, Chengan He, Egor Zakharov, Yuxuan Xue, Shunsuke Saito
TL;DR: CamLit是首个统一视频扩散模型,能够从单张输入图像联合执行新视角合成(NVS)和重光照任务。给定参考图像、用户定义的相机轨迹和环境贴图,该模型可在指定光照下从新视角生成场景视频,并在单一生成过程中输出时间一致且空间对齐的重光照新视角帧和对应反照率帧。
Details
Motivation: 解决现有方法在视频生成中难以统一控制相机姿态和光照的问题,旨在简化视频生成流程,实现从单图像到多视角光照可控视频的联合生成。
Result: 定性和定量实验表明,CamLit在新视角合成和重光照任务上均达到与最先进方法相当的高保真度输出,且未牺牲任一任务的视觉质量。
Insight: 创新点在于将相机控制与光照控制整合到单一扩散模型中,通过联合生成过程实现时空一致性输出;客观分析显示,该统一框架简化了多任务处理流程,在保持竞争性能的同时提升了生成效率与真实感一致性。
Abstract: We present CamLit, the first unified video diffusion model that jointly performs novel view synthesis (NVS) and relighting from a single input image. Given one reference image, a user-defined camera trajectory, and an environment map, CamLit synthesizes a video of the scene from new viewpoints under the specified illumination. Within a single generative process, our model produces temporally coherent and spatially aligned outputs, including relit novel-view frames and corresponding albedo frames, enabling high-quality control of both camera pose and lighting. Qualitative and quantitative experiments demonstrate that CamLit achieves high-fidelity outputs on par with state-of-the-art methods in both novel view synthesis and relighting, without sacrificing visual quality in either task. We show that a single generative model can effectively integrate camera and lighting control, simplifying the video generation pipeline while maintaining competitive performance and consistent realism.
[153] OAHuman: Occlusion-Aware 3D Human Reconstruction from Monocular Images cs.CVPDF
Yuanwang Yang, Hongliang Liu, Muxin Zhang, Nan Ma, Jingyu Yang
TL;DR: OAHuman提出了一种遮挡感知的单目图像3D人体重建框架,通过解耦几何重建与纹理合成来解决遮挡导致的几何缺失和纹理不可靠问题,从而在遮挡条件下实现鲁棒且高保真的人体模型重建。
Details
Motivation: 解决真实场景中因物体、人或图像截断导致的频繁遮挡问题,这些遮挡会导致几何缺失和外观线索不可靠,严重降低重建模型的完整性和真实感。
Result: 在遮挡丰富的基准测试中,OAHuman在结构完整性、表面细节和纹理真实感方面表现出优越性能,显著提升了遮挡条件下的单目3D人体重建效果。
Insight: 核心创新在于解耦感知范式,通过分离几何重建和纹理合成,避免遮挡区域中几何与纹理的交叉污染,确保几何重建在遮挡区域仍能感知增强,而纹理合成仅从可见区域学习,防止纹理错误传播到遮挡区域。
Abstract: Monocular 3D human reconstruction in real-world scenarios remains highly challenging due to frequent occlusions from surrounding objects, people, or image truncation. Such occlusions lead to missing geometry and unreliable appearance cues, severely degrading the completeness and realism of reconstructed human models. Although recent neural implicit methods achieve impressive results on clean inputs, they struggle under occlusion due to entangled modeling of shape and texture. In this paper, we propose OAHuman, an occlusion-aware framework that explicitly decouples geometry reconstruction and texture synthesis for robust 3D human modeling from a single RGB image. The core innovation lies in the decoupling-perception paradigm, which addresses the fundamental issue of geometry-texture cross-contamination in occluded regions. Our framework ensures that geometry reconstruction is perceptually reinforced even in occluded areas, isolating it from texture interference. In parallel, texture synthesis is learned exclusively from visible regions, preventing texture errors from being transferred to the occluded areas. This decoupling approach enables OAHuman to achieve robust and high-fidelity reconstruction under occlusion, which has been a long-standing challenge in the field. Extensive experiments on occlusion-rich benchmarks demonstrate that OAHuman achieves superior performance in terms of structural completeness, surface detail, and texture realism, significantly improving monocular 3D human reconstruction under occlusion conditions.
[154] MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos cs.CVPDF
Sagnik Majumder, Anish Nethi, Ziad Al-Halah, Kristen Grauman
TL;DR: 本文提出了一种名为MistExit的方法,用于在流式视频中尽早检测程序性活动中的关键步骤是否执行错误,该方法结合了错误检测器和强化学习策略,旨在以尽可能少的视频观察量实现准确判断。
Details
Motivation: 解决在程序性视频中早期错误检测的问题,即如何在观察最少视频帧的情况下,实时判断关键步骤是否正确执行。
Result: 在多个真实世界程序性视频数据集上,MistExit模型在错误检测准确率上优于现有最先进模型,同时减少了所需观察的视频比例。
Insight: 创新点在于将错误检测器(能处理近期帧并预测未来视觉特征)与基于强化学习的自适应退出策略相结合,实现了早期可靠检测与计算效率的平衡;可借鉴的是这种融合时序预测与决策优化的框架,适用于实时视频分析任务。
Abstract: We introduce the task of early mistake detection in video, where the goal is to determine whether a keystep in a procedural activity is performed correctly while observing as little of the streaming video as possible. To tackle this problem, we propose a method comprising a mistake detector and a reinforcement learning policy. At each timestep, the detector processes recently observed frames to estimate the keystep’s correctness while anticipating future visual features, enabling reliable early mistake estimates. Meanwhile, the policy aggregates the detector outputs and visual observations over time and adaptively decides when to exit (i.e., stop processing incoming frames) while producing the final prediction. Using diverse real-world procedural video datasets, we demonstrate that our MistExit model achieves superior mistake detection accuracy while reducing the fraction of video observed compared to state-of-the-art models. Project: https://vision.cs.utexas.edu/projects/mist_exit.
[155] DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization cs.CV | cs.AI | cs.MM | cs.SDPDF
Ngoc-Son Nguyen, Thanh V. T. Tran, Jeongsoo Choi, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy
TL;DR: 本文提出DiFlowDubber,一种基于离散流匹配的两阶段训练框架,用于视频配音任务。该方法通过FaPro模块从面部表情中提取全局韵律和风格线索来指导语音生成,并引入Synchronizer模块来桥接文本、视频和语音之间的模态鸿沟,确保语音与唇部运动的精确同步。
Details
Motivation: 现有视频配音方法要么直接在有限数据集上训练,要么采用两阶段流程适配预训练的文本到语音模型,但往往难以生成富有表现力的韵律、丰富的声学特征和精确的同步效果。
Result: 在两个主流基准数据集上的实验表明,DiFlowDubber在多项指标上超越了先前的方法。
Insight: 创新点在于设计了FaPro模块从面部表情中提取韵律和风格信息来引导语音属性建模,以及Synchronizer模块通过桥接多模态鸿沟来提升跨模态对齐和唇音同步精度;其两阶段训练框架结合离散流匹配生成主干,有效实现了从预训练TTS模型到视频驱动配音的知识迁移。
Abstract: Video dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-driven dubbing, with a discrete flow matching generative backbone. Specifically, we design a FaPro module that captures global prosody and stylistic cues from facial expressions and leverages this information to guide the modeling of subsequent speech attributes. To ensure precise speech-lip synchronization, we introduce a Synchronizer module that bridges the modality gap among text, video, and speech, thereby improving cross-modal alignment and generating speech that is temporally synchronized with lip movements. Experiments on two primary benchmark datasets demonstrate that DiFlowDubber outperforms previous methods across multiple metrics.
[156] Toward Clinically Ready Foundation Models in Medical Image Analysis: Adaptation Mechanisms and Deployment Trade-offs cs.CVPDF
Karma Phuntsho, Abdullah, Kyungmi Lee, Ickjai Lee, Euijoon Ahn
TL;DR: 这篇综述论文提出了一个以策略为中心的框架,用于指导医学图像分析中基础模型的适应过程,将适应机制分为参数、表示、目标、数据和架构/序列级五类,并分析了其在临床部署中的权衡。
Details
Motivation: 现有研究主要关注基础模型的架构进展和应用范围,而缺乏对适应机制及其对鲁棒性、校准和监管可行性影响的系统化分析,因此需要为临床部署提供实用指导。
Result: 论文综合了分类、分割和检测任务的证据,强调适应策略如何影响临床相关的失败模式,而不仅仅是聚合基准性能。
Insight: 创新点在于将适应重新定义为临床约束下受控表示变化的过程,并提供了一个结构化框架来权衡适应深度、标签效率、领域鲁棒性、计算成本、可审计性和监管负担,以设计稳健、可审计且兼容临床部署的系统。
Abstract: Foundation models (FMs) have demonstrated strong transferability across medical imaging tasks, yet their clinical utility depends critically on how pretrained representations are adapted to domain-specific data, supervision regimes, and deployment constraints. Prior surveys primarily emphasize architectural advances and application coverage, while the mechanisms of adaptation and their implications for robustness, calibration, and regulatory feasibility remain insufficiently structured. This review introduces a strategy-centric framework for FM adaptation in medical image analysis (MIA). We conceptualize adaptation as a post-pretraining intervention and organize existing approaches into five mechanisms: parameter-, representation-, objective-, data-centric, and architectural/sequence-level adaptation. For each mechanism, we analyze trade-offs in adaptation depth, label efficiency, domain robustness, computational cost, auditability, and regulatory burden. We synthesize evidence across classification, segmentation, and detection tasks, highlighting how adaptation strategies influence clinically relevant failure modes rather than only aggregate benchmark performance. Finally, we examine how adaptation choices interact with validation protocols, calibration stability, multi-institutional deployment, and regulatory oversight. By reframing adaptation as a process of controlled representational change under clinical constraints, this review provides practical guidance for designing FM-based systems that are robust, auditable, and compatible with clinical deployment.
[157] All-day Multi-scenes Lifelong Vision-and-Language Navigation with Tucker Adaptation cs.CV | cs.AIPDF
Xudong Wang, Gan Li, Zhiyu Liu, Yao Wang, Lianqing Liu
TL;DR: 本文提出了一个名为AML-VLN的全天候多场景终身视觉与语言导航问题,旨在解决智能体在适应新场景时产生的灾难性遗忘问题。为此,作者提出了Tucker Adaptation(TuKA)方法,将导航知识表示为高阶张量并通过Tucker分解解耦为共享子空间和场景特定专家,并开发了AlldayWalker智能体进行持续学习。实验表明,该方法在多个导航场景中持续优于现有最先进基线。
Details
Motivation: 部署视觉与语言导航智能体需要适应多样化场景和环境,但在特定场景上进行微调通常会导致对其他场景的灾难性遗忘,这严重限制了灵活长期部署。本文旨在解决这一挑战。
Result: 广泛的实验表明,基于TuKA构建的AlldayWalker智能体在持续学习多个导航场景时,始终优于最先进的基线方法。
Insight: 创新点在于将终身VLN问题形式化为AML-VLN,并提出TuKA方法,利用高阶张量表示和Tucker分解来解耦多层次的导航知识,实现共享知识与场景特定知识的分离与增量学习,从而有效缓解灾难性遗忘。
Abstract: Deploying vision-and-language navigation (VLN) agents requires adaptation across diverse scenes and environments, but fine-tuning on a specific scenario often causes catastrophic forgetting in others, which severely limits flexible long-term deployment. We formalize this challenge as the all-day multi-scenes lifelong VLN (AML-VLN) problem. Existing parameter-efficient adapters (e.g., LoRA and its variants) are limited by their two-dimensional matrix form, which fails to capture the multi-hierarchical navigation knowledge spanning multiple scenes and environments. To address this, we propose Tucker Adaptation (TuKA), which represents the multi-hierarchical navigation knowledge as a high-order tensor and leverages Tucker decomposition to decouple the knowledge into shared subspaces and scenario-specific experts. We further introduce a decoupled knowledge incremental learning strategy to consolidate shared subspaces while constraining specific experts for decoupled lifelong learning. Building on TuKA, we also develop a VLN agent named AlldayWalker, which continually learns across multiple navigation scenarios, achieving all-day multi-scenes navigation. Extensive experiments show that AlldayWalker consistently outperforms state-of-the-art baselines.
[158] DC-ViT: Modulating Spatial and Channel Interactions for Multi-Channel Images cs.CVPDF
Umar Marikkar, Syed Sameed Husain, Muhammad Awais, Sara Atito
TL;DR: 本文提出DC-ViT(解耦视觉Transformer),一种用于多通道图像(MCI)的新方法。它通过解耦自注意力机制,将token更新分解为空间更新和通道更新两个互补路径,以解决现有MC-ViT方法中跨通道无限制交互导致特征稀释的问题,并引入解耦聚合来学习任务特定的通道重要性。
Details
Motivation: 多通道成像中,由于染色方案、传感器类型和采集设置不同导致的异构通道配置,使得固定通道编码器的应用受限。现有MC-ViT方法允许灵活通道输入,但在统一注意力空间中所有通道的patch token联合编码,可能导致跨通道无限制的token交互,稀释特征并削弱对MCI数据中至关重要的通道特定语义的保留能力。
Result: 在三个MCI基准测试上的广泛实验表明,DC-ViT相比现有MC-ViT方法取得了持续的性能提升。
Insight: 核心创新点是提出解耦自注意力(DSA),将token更新明确分解为建模通道内结构的空间更新和自适应整合跨通道信息的通道更新,从而在缓解信息坍缩的同时允许选择性的跨通道交互。此外,解耦聚合(DAG)机制能学习任务特定的通道重要性,进一步利用了增强的通道特定表示。这种方法为处理MCI中的通道异构性提供了更精细的交互控制。
Abstract: Training and evaluation in multi-channel imaging (MCI) remains challenging due to heterogeneous channel configurations arising from varying staining protocols, sensor types, and acquisition settings. This heterogeneity limits the applicability of fixed-channel encoders commonly used in general computer vision. Recent Multi-Channel Vision Transformers (MC-ViTs) address this by enabling flexible channel inputs, typically by jointly encoding patch tokens from all channels within a unified attention space. However, unrestricted token interactions across channels can lead to feature dilution, reducing the ability to preserve channel-specific semantics that are critical in MCI data. To address this, we propose Decoupled Vision Transformer (DC-ViT), which explicitly regulates information sharing using Decoupled Self-Attention (DSA), which decomposes token updates into two complementary pathways: spatial updates that model intra-channel structure, and channel-wise updates that adaptively integrate cross-channel information. This decoupling mitigates informational collapse while allowing selective inter-channel interaction. To further exploit these enhanced channel-specific representations, we introduce Decoupled Aggregation (DAG), which allows the model to learn task-specific channel importances. Extensive experiments across three MCI benchmarks demonstrate consistent improvements over existing MC-ViT approaches.
[159] Seeking Physics in Diffusion Noise cs.CV | cs.AI | cs.LG | cs.ROPDF
Chujun Tang, Lei Zhong, Fangqiang Ding
TL;DR: 本文探究了视频扩散模型是否编码了物理合理性的预测信号,发现预训练扩散变换器(DiT)的中间去噪表示中,物理合理与不合理的视频在特征空间部分可分。基于此,作者提出了一种推理时策略——渐进轨迹选择,通过轻量级物理验证器在中间检查点评分并行去噪轨迹并早期剪枝低分候选,从而在提升物理一致性的同时降低推理成本。
Details
Motivation: 动机是探究视频扩散模型是否隐含物理合理性信号,并利用这些信号在推理时提升生成视频的物理一致性,同时减少计算开销。
Result: 在PhyGenBench基准上的大量实验表明,该方法提高了物理一致性,并显著减少了去噪步骤,达到了与Best-of-K采样相当的结果。
Insight: 创新点在于发现冻结的DiT特征中存在可恢复的物理相关线索,并据此提出了一个高效的推理时轨迹选择策略,实现了性能与效率的平衡。
Abstract: Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.
[160] RL-ScanIQA: Reinforcement-Learned Scanpaths for Blind 360°Image Quality Assessment cs.CVPDF
Yujia Wang, Yuyan Li, Jiuming Liu, Fang-Lue Zhang, Xinhu Zheng
TL;DR: 本文提出RL-ScanIQA,一种基于强化学习的盲360°图像质量评估框架,通过PPO训练的扫描路径策略和质量评估器进行端到端优化,利用多级奖励和失真空间增强提升性能,在三个基准测试中实现了优异的域内性能和跨数据集泛化能力。
Details
Motivation: 现有基于扫描路径的360°图像质量评估方法将扫描路径生成与质量评估分离,无法进行端到端优化和任务对齐的探索,本文旨在解决这一局限性。
Result: 在三个基准测试上的广泛实验表明,RL-ScanIQA实现了优异的域内性能和跨数据集泛化能力。
Insight: 创新点包括:将扫描路径生成与质量评估统一为端到端的强化学习框架,引入多级奖励(如扫描路径多样性和赤道偏向先验)以提高训练稳定性,并采用失真空间增强和排序一致性损失来提升跨数据集鲁棒性。
Abstract: Blind 360°image quality assessment (IQA) aims to predict perceptual quality for panoramic images without a pristine reference. Unlike conventional planar images, 360°content in immersive environments restricts viewers to a limited viewport at any moment, making viewing behaviors critical to quality perception. Although existing scanpath-based approaches have attempted to model viewing behaviors by approximating the human view-then-rate paradigm, they treat scanpath generation and quality assessment as separate steps, preventing end-to-end optimization and task-aligned exploration. To address this limitation, we propose RL-ScanIQA, a reinforcement-learned framework for blind 360°IQA. RL-ScanIQA optimize a PPO-trained scanpath policy and a quality assessor, where the policy receives quality-driven feedback to learn task-relevant viewing strategies. To improve training stability and prevent mode collapse, we design multi-level rewards, including scanpath diversity and equator-biased priors. We further boost cross-dataset robustness using distortion-space augmentation together with rank-consistent losses that preserve intra-image and inter-image quality orderings. Extensive experiments on three benchmarks show that RL-ScanIQA achieves superior in-dataset performance and cross-dataset generalization. Codes are available at https://github.com/wangyuji1/RLScanIQA.git.
[161] Show Me When and Where: Towards Referring Video Object Segmentation in the Wild cs.CVPDF
Mingqi Gao, Jinyu Yang, Jingnan Luo, Xiantong Zhen, Jungong Han
TL;DR: 本文提出了一种新的‘野外’参照视频对象分割(RVOS)设置,旨在解决现有简化设置中文本所指对象始终出现在所有帧内、无法反映现实挑战的问题。为此,作者构建了一个基于YouTube未剪辑视频的新基准数据集YoURVOS,包含1,120个野外视频,其时长和场景是现有数据集的7倍,要求方法不仅能定位对象位置,还能判断其出现时间。为应对挑战,作者提出了对象级多模态变换器(OMFormer),通过编码对象级多模态交互实现高效全局时空定位。实验表明,先前VOS方法在YoURVOS上表现不佳,尤其在目标缺失帧增加时,而OMFormer表现稳健。
Details
Motivation: 现有RVOS设置使用精心修剪的视频,文本所指对象始终出现在所有帧中,未能充分反映现实任务的挑战性,简化了问题(只需预测‘哪里’而无需‘何时’)。本文旨在推动RVOS向更现实的‘野外’设置发展,要求方法同时处理对象出现的时间和位置。
Result: 在提出的YoURVOS基准数据集上评估,先前VOS方法表现不佳,尤其随着目标缺失帧增加性能下降,而提出的OMFormer方法在所有情况下均表现良好,为这一新设置设定了基线。
Insight: 创新点包括引入更现实的‘野外’RVOS设置和对应的YoURVOS数据集,挑战方法同时处理‘何时’和‘哪里’;提出的OMFormer通过对象级多模态交互编码实现高效全局时空定位,可借鉴其多模态融合和对象级建模思路用于视频理解任务。
Abstract: Referring video object segmentation (RVOS) has recently generated great popularity in computer vision due to its widespread applications. Existing RVOS setting contains elaborately trimmed videos, with text-referred objects always appearing in all frames, which however fail to fully reflect the realistic challenges of this task. This simplified setting requires RVOS methods to only predict where objects, with no need to show when the objects appear. In this work, we introduce a new setting towards in-the-wild RVOS. To this end, we collect a new benchmark dataset using Youtube Untrimmed videos for RVOS - YoURVOS, which contains 1,120 in-the-wild videos with 7 times more duration and scenes than existing datasets. Our new benchmark challenges RVOS methods to show not only where but also when objects appear in videos. To set a baseline, we propose Object-level Multimodal TransFormers (OMFormer) to tackle the challenges, which are characterized by encoding object-level multimodal interactions for efficient and global spatial-temporal localisation. We demonstrate that previous VOS methods struggle on our YoURVOS benchmark, especially with the increase of target-absent frames, while our OMFormer consistently performs well. Our YoURVOS dataset offers an imperative benchmark, which will push forward the advancement of RVOS methods for practical applications.
[162] Direct Object-Level Reconstruction via Probabilistic Gaussian Splatting cs.CVPDF
Shuai Guo, Ao Guo, Junchao Zhao, Qi Chen, Yuxiang Qi
TL;DR: 本文提出了一种基于概率高斯泼溅的高效单物体3D重建方法。该方法通过将前景-背景概率线索直接集成到高斯图元中,并在训练期间动态修剪低概率高斯,从而专注于目标物体,显著提升了内存和计算效率。
Details
Motivation: 现有基于高斯泼溅的物体级重建方法通常依赖全场景重建,引入了大量冗余的背景信息,导致计算和存储开销增加。本文旨在解决这一效率瓶颈。
Result: 在MIP-360、T&T和NVOS数据集上的实验表明,该方法在存在掩码错误时表现出强大的自校正能力,重建质量与标准3DGS方法相当,同时所需的高斯图元数量仅为后者的约1/10。
Insight: 主要创新点包括:1)将YOLO和SAM生成的概率掩码用于监督高斯属性,用连续概率值替代二值掩码以缓解边界模糊;2)提出双阶段过滤策略抑制背景高斯;3)利用渲染的概率掩码进行反向监督以增强多视角边界一致性。该方法为需要高保真和高计算效率的应用提供了新思路。
Abstract: Object-level 3D reconstruction play important roles across domains such as cultural heritage digitization, industrial manufacturing, and virtual reality. However, existing Gaussian Splatting-based approaches generally rely on full-scene reconstruction, in which substantial redundant background information is introduced, leading to increased computational and storage overhead. To address this limitation, we propose an efficient single-object 3D reconstruction method based on 2D Gaussian Splatting. By directly integrating foreground-background probability cues into Gaussian primitives and dynamically pruning low-probability Gaussians during training, the proposed method fundamentally focuses on an object of interest and improves the memory and computational efficiency. Our pipeline leverages probability masks generated by YOLO and SAM to supervise probabilistic Gaussian attributes, replacing binary masks with continuous probability values to mitigate boundary ambiguity. Additionally, we propose a dual-stage filtering strategy for training’s startup to suppress background Gaussians. And, during training, rendered probability masks are conversely employed to refine supervision and enhance boundary consistency across views. Experiments conducted on the MIP-360, T&T, and NVOS datasets demonstrate that our method exhibits strong self-correction capability in the presence of mask errors and achieves reconstruction quality comparable to standard 3DGS approaches, while requiring only approximately 1/10 of their Gaussian amount. These results validate the efficiency and robustness of our method for single-object reconstruction and highlight its potential for applications requiring both high fidelity and computational efficiency.
[163] Early Failure Detection and Intervention in Video Diffusion Models cs.CVPDF
Kwon Byung-Ki, Sohwi Lim, Nam Hyeon-Woo, Moon Ye-Bin, Tae-Hyun Oh
TL;DR: 本文提出了一种用于潜在文本到视频(T2V)扩散模型的早期失败检测和诊断干预流程。该方法通过实时检查模块将潜在表示转换为中间视频预览,以利用成熟的文本-视频对齐评分器在RGB空间进行检测,并在预测到失败时触发分层早期退出干预,从而减少试错重生成的计算成本。
Details
Motivation: 解决T2V扩散模型在推理过程中因采样不确定性导致的生成失败问题,如文本-视频对齐度低或感知质量差,避免因试错重生成而产生的高计算开销。
Result: 在CogVideoX-5B和Wan2.1-1.3B模型上的实验表明,该方法在VBench基准上获得了一致性提升,时间开销相比事后重生成最多减少2.64倍,并可推广到更高容量的Wan2.1-14B模型(720p分辨率,81帧生成)。
Insight: 创新点在于设计了高效的实时检查模块,将潜在空间检测问题转换到RGB空间,利用现有评估器实现早期失败检测;提出了分层早期退出干预流程,证明了失败信号在去噪早期即可被检测,且该方法与提示词优化和采样引导等技术正交兼容,具有即插即用特性。
Abstract: Text-to-video (T2V) diffusion models have rapidly advanced, yet generations still occasionally fail in practice, such as low text-video alignment or low perceptual quality. Since diffusion sampling is non-deterministic, it is difficult to know during inference whether a generation will succeed or fail, incurring high computational cost due to trial-and-error regeneration. To address this, we propose an early failure detection and diagnostic intervention pipeline for latent T2V diffusion models. For detection, we design a Real-time Inspection (RI) module that converts latents into intermediate video previews, enabling the use of established text-video alignment scorers for inspection in the RGB space. The RI module completes the conversion and inspection process in just 39.2ms. This is highly efficient considering that CogVideoX-5B requires 4.3s per denoising step when generating a 480p, 49-frame video on an NVIDIA A100 GPU. Subsequently, we trigger a hierarchical and early-exit intervention pipeline only when failure is predicted. Experiments on CogVideoX-5B and Wan2.1-1.3B demonstrate consistency gains on VBench with up to 2.64 times less time overhead compared to post-hoc regeneration. Our method also generalizes to a higher-capacity setting, remaining effective on Wan2.1-14B with 720p resolution and 81-frame generation. Furthermore, our pipeline is plug-and-play and orthogonal to existing techniques, showing seamless compatibility with prompt refinement and sampling guidance methods. We also provide evidence that failure signals emerge early in the denoising process and are detectable within intermediate video previews using standard vision-language evaluators.
[164] How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images cs.CV | cs.AIPDF
Guimeng Liu, Tianze Yu, Somayeh Ebrahimkhani, Lin Zhi Zheng Shawn, Kok Pin Ng
TL;DR: 本文系统研究了医疗多模态大语言模型在医学图像理解中视觉定位能力不足的问题,提出了专门评估数据集VGMED和定量指标,并设计了一种无需额外训练的推理时注意力优化方法VGRefine,在多个医学VQA基准上取得了SOTA性能。
Details
Motivation: 现有通用MLLMs在医学任务(尤其是零样本场景)中表现不佳,但对其失败原因缺乏深入理解,特别是视觉定位能力在医学图像分析中的关键作用尚未被系统研究。
Result: 在8个SOTA医疗MLLMs上的实验验证了其视觉定位能力普遍不足;提出的VGRefine方法在6个多样化医学VQA基准(涵盖8种成像模态的超过11万样本)上实现了SOTA性能。
Insight: 首次系统验证了视觉定位不足是医疗MLLMs性能不佳的关键因素;通过专家指导构建的VGMED数据集能有效解耦视觉定位与语义定位;提出的推理时注意力优化方法VGRefine简单有效且无需额外训练。
Abstract: Generalist multimodal large language models (MLLMs) have achieved impressive performance across a wide range of vision-language tasks. However, their performance on medical tasks, particularly in zero-shot settings where generalization is critical, remains suboptimal. A key research gap is the limited understanding of why medical MLLMs underperform in medical image interpretation. In this work, we present a pioneering systematic investigation into the visual grounding capabilities of state-of-the-art medical MLLMs. To disentangle visual grounding from semantic grounding, we design VGMED, a novel evaluation dataset developed with expert clinical guidance, explicitly assessing the visual grounding capability of medical MLLMs. We introduce new quantitative metrics and conduct detailed qualitative analyses. Our study across eight state-of-the-art (SOTA) medical MLLMs validates that they often fail to ground their predictions in clinically relevant image regions. We note that this finding is specific to medical image analysis; in contrast, prior work has shown that MLLMs are capable of grounding their predictions in the correct image regions when applied to natural scene images. Motivated by these findings, we propose VGRefine, a simple yet effective inference-time method that refines attention distribution to improve visual grounding in medical settings. Our approach achieves SOTA performance across 6 diverse Med-VQA benchmarks (over 110K VQA samples from 8 imaging modalities) without requiring additional training or external expert models. Overall, our work, for the first time, systematically validates inadequate visual grounding as one of the key contributing factors for medical MLLMs’ under-performance. Additional experiments are included in the Supp.
[165] AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising cs.CVPDF
Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi
TL;DR: 本文提出AvatarForcing,一种用于实时说话头像生成的一步流式扩散框架。该方法通过局部-未来滑动窗口去噪,在恒定每步成本下每次迭代生成一个干净的图像块,解决了长序列生成中的曝光偏差和计算开销问题。
Details
Motivation: 实时说话头像生成需要低延迟和分钟级的时间稳定性。自回归方法存在曝光偏差导致错误累积,而全序列扩散变换器计算成本过高,无法满足实时长序列合成的需求。
Result: 在标准基准测试和一个新的400视频长序列基准测试上,使用13亿参数的学生模型实现了34毫秒/帧的实时流式推理,在视觉质量和唇部同步方面表现出色。
Insight: 创新点在于提出了局部-未来滑动窗口去噪机制和双锚点时间强制技术(风格锚点和时间锚点),以及通过两阶段流式蒸馏(离线ODE回填和分布匹配)实现实时一步推理,有效平衡了生成质量、时间一致性和计算效率。
Abstract: Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future window with heterogeneous noise levels and emits one clean block per step under constant per-step cost. To stabilize unbounded streams, the method introduces dual-anchor temporal forcing: a style anchor that re-indexes RoPE to maintain a fixed relative position with respect to the active window and applies anchor-audio zero-padding, and a temporal anchor that reuses recently emitted clean blocks to ensure smooth transitions. Real-time one-step inference is enabled by two-stage streaming distillation with offline ODE backfill and distribution matching. Experiments on standard benchmarks and a new 400-video long-form benchmark show strong visual quality and lip synchronization at 34 ms/frame using a 1.3B-parameter student model for realtime streaming. Our page is available at: https://cuiliyuan121.github.io/AvatarForcing/
[166] UAVBench and UAVIT-1M: Benchmarking and Enhancing MLLMs for Low-Altitude UAV Vision-Language Understanding cs.CVPDF
Yang Zhan, Yuan Yuan
TL;DR: 该论文针对多模态大语言模型在低空无人机视觉理解方面的不足,提出了UAVBench基准测试和UAVIT-1M指令调优数据集,用于评估和提升MLLMs在低空视觉语言任务中的能力。
Details
Motivation: 现有MLLMs在自然图像和卫星遥感图像上表现良好,但在理解低空无人机场景方面存在挑战,且缺乏全面的评估基准和训练数据来满足实际应用需求。
Result: 对11个SOTA MLLMs在UAVBench上的分析表明,开源模型在低空视觉内容对话准确性上落后于闭源模型;而在UAVIT-1M上微调开源模型能显著缩小这一差距。
Insight: 创新点在于构建了首个专注于低空无人机场景的大规模、多任务、高质量基准测试和指令数据集,并通过实验验证了数据驱动的微调能有效提升模型在该领域的性能,为MLLMs在低空无人机实际应用中的落地提供了关键资源和方法。
Abstract: Multimodal Large Language Models (MLLMs) have made significant strides in natural images and satellite remote sensing images. However, understanding low-altitude drone scenarios remains a challenge. Existing datasets primarily focus on a few specific low-altitude visual tasks, which cannot fully assess the ability of MLLMs in real-world low-altitude UAV applications. Therefore, we introduce UAVBench, a comprehensive benchmark, and UAVIT-1M, a large-scale instruction tuning dataset, designed to evaluate and improve MLLMs’ abilities in low-altitude vision-language tasks. UAVBench comprises 43 test units and 966k high-quality data samples across 10 tasks at the image-level and region-level. UAVIT-1M consists of approximately 1.24 million diverse instructions, covering 789k multi-scene images and about 2,000 types of spatial resolutions with 11 distinct tasks. UAVBench and UAVIT-1M feature pure real-world visual images and rich weather conditions, and involve manual verification to ensure high quality. Our in-depth analysis of 11 state-of-the-art MLLMs using UAVBench reveals that open-source MLLMs cannot generate accurate conversations about low-altitude visual content, lagging behind closed-source MLLMs. Extensive experiments demonstrate that fine-tuning open-source MLLMs on UAVIT-1M significantly addresses this gap. Our contributions pave the way for bridging the gap between current MLLMs and low-altitude UAV real-world application demands. (Project page: https://UAVBench.github.io/)
[167] On the Nature of Attention Sink that Shapes Decoding Strategy in MLLMs cs.CVPDF
Suho Yoo, Youngjoon Jang, Joon Son Chung
TL;DR: 本文研究了多模态大语言模型中注意力汇聚现象的本质及其对解码策略的影响,发现汇聚token编码了结构化全局信息,并据此提出了轻量级推理时策略OutRo,通过特征对齐和放宽因果约束来增强上下文表示,从而提升模型性能。
Details
Motivation: 动机在于理解Transformer架构中观察到的注意力汇聚现象(即某些token吸引过多注意力)的本质及其如何影响模型推理行为,而非将其视为偶然产物。
Result: 在七个视频问答基准测试上,OutRo策略在代表性MLLMs中一致提升了性能,表现出强大的泛化能力,且仅带来1.1倍的解码开销。
Insight: 创新点在于揭示了注意力汇聚token编码全局信息并影响解码过程,并据此设计了一种无需额外前向传播或注意力图访问的轻量级推理时优化策略,通过特征对齐和放宽因果约束来增强模型推理。
Abstract: Large language models and their multimodal extensions have achieved remarkable success across diverse tasks, yet the internal mechanisms that govern their reasoning behaviour remain partially understood. In particular, the attention sink, a token that attracts disproportionate attention mass, has been observed in transformer architectures, but its role is still unclear. Our goal is to understand what attention sinks represent and how they shape model behaviour during inference, rather than considering them as incidental artifacts. Through our analysis, we find that attention sink representations encode structured global information that influences the decoding process. Building on our findings, we introduce OutRo, a lightweight inference-time strategy that leverages the sink token to enhance contextual representations: (i) non-sink token representations are aligned with the sink representation in the feature space; and (ii) the sink token is allowed to attend beyond the causal constraint, facilitating information exchange with non-sink tokens. This design enhances the reasoning process without requiring additional forward passes or access to attention maps. Based on extensive experiments, OutRo consistently improves performance across representative MLLMs on seven video QA benchmarks and demonstrates strong generalisation, while incurring only a 1.1x decoding overhead.
[168] AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models cs.CV | cs.AIPDF
Jiarui Zhang, Junqi Hu, Zurong Mai, Yuhang Chen, Shuohong Lou
TL;DR: 本文提出了AgroNVILA,一种用于农业多模态推理的多模态大语言模型。其核心创新在于感知-推理解耦架构,并基于新构建的大规模多视角农业数据集AgroOmni进行训练。该模型通过视图条件元网络注入宏观空间上下文以解决尺度模糊问题,并利用农业感知相对策略优化来对齐专家农业逻辑。
Details
Motivation: 现有MLLMs存在显著的’陆地中心’偏见,在处理从地面特写到无人机及卫星图像等多尺度农业空间信息时,会导致尺度混淆和逻辑漂移,无法满足复杂农业规划的需求。
Result: 在多项实验中,AgroNVILA超越了现有最先进的MLLMs,在多海拔农业推理任务上取得了显著提升(+15.18%),展现了其在整体农业空间规划方面的强大能力。
Insight: 主要创新点包括:1)感知-推理解耦架构,将视觉感知与逻辑推理分离处理;2)视图条件元网络,以低计算成本注入宏观空间上下文解决尺度歧义;3)农业感知相对策略优化,利用强化学习使模型决策与专家知识对齐,避免统计捷径。
Abstract: Agricultural multimodal reasoning requires robust spatial understanding across varying scales, from ground-level close-ups to top-down UAV and satellite imagery. Existing Multi-modal Large Language Models (MLLMs) suffer from a significant “terrestrial-centric” bias, causing scale confusion and logic drift during complex agricultural planning. To address this, we introduce the first large-scale AgroOmni (288K), a multi-view training corpus designed to capture diverse spatial topologies and scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, an MLLM that utilizes a novel Perception-Reasoning Decoupling (PRD) architecture. On the perception side, we incorporate a View-Conditioned Meta-Net (VCMN), which injects macroscopic spatial context into visual tokens, resolving scale ambiguities with minimal computational overhead. On the reasoning side, Agriculture-aware Relative Policy Optimization (ARPO) leverages reinforcement learning to align the model’s decision-making with expert agricultural logic, preventing statistical shortcuts. Extensive experiments demonstrate that AgroNVILA outperforms state-of-the-art MLLMs, achieving significant improvements (+15.18%) in multi-altitude agricultural reasoning, reflecting its robust capability for holistic agricultural spatial planning.
[169] BROTHER: Behavioral Recognition Optimized Through Heterogeneous Ensemble Regularization for Ambivalence and Hesitancy cs.CVPDF
Alexandre Pereira, Bruno Fernandes, Pablo Barros
TL;DR: 本文提出了一种名为BROTHER的、用于识别视频中矛盾与犹豫(A/H)复杂行为状态的高度正则化多模态融合框架。该框架从视觉、声学和语言数据中提取鲁棒的单模态特征,并引入专门设计的统计文本模态来捕捉时序语音变化和行为线索。通过评估15种不同的模态组合,并利用粒子群优化(PSO)硬投票集成方法融合异构模型,有效避免了过拟合。
Details
Motivation: 在自然视频中识别矛盾与犹豫(A/H)这类复杂行为状态是情感计算中的一个重大挑战,因为A/H表现为微妙的多模态冲突,需要深度的上下文和时序理解。
Result: 在未见测试集上,该高度正则化的PSO集成方法(lambda = 0.2)取得了0.7465的峰值宏F1分数,表明其有效利用了多模态协同作用。
Insight: 论文的创新点在于将矛盾与犹豫视为一种多模态冲突,并通过引入专门的统计文本模态、基于验证损失的模型选择策略,以及结合训练-验证差距惩罚的PSO硬投票集成,构建了一个鲁棒的、用于野外行为分析的智能加权委员会框架。
Abstract: Recognizing complex behavioral states such as Ambivalence and Hesitancy (A/H) in naturalistic video settings remains a significant challenge in affective computing. Unlike basic facial expressions, A/H manifests as subtle, multimodal conflicts that require deep contextual and temporal understanding. In this paper, we propose a highly regularized, multimodal fusion pipeline to predict A/H at the video level. We extract robust unimodal features from visual, acoustic, and linguistic data, introducing a specialized statistical text modality explicitly designed to capture temporal speech variations and behavioral cues. To identify the most effective representations, we evaluate 15 distinct modality combinations across a committee of machine learning classifiers (MLP, Random Forest, and GBDT), selecting the most well-calibrated models based on validation Binary Cross-Entropy (BCE) loss. Furthermore, to optimally fuse these heterogeneous models without overfitting to the training distribution, we implement a Particle Swarm Optimization (PSO) hard-voting ensemble. The PSO fitness function dynamically incorporates a train-validation gap penalty (lambda) to actively suppress redundant or overfitted classifiers. Our comprehensive evaluation demonstrates that while linguistic features serve as the strongest independent predictor of A/H, our heavily regularized PSO ensemble (lambda = 0.2) effectively harnesses multimodal synergies, achieving a peak Macro F1-score of 0.7465 on the unseen test set. These results emphasize that treating ambivalence and hesitancy as a multimodal conflict, evaluated through an intelligently weighted committee, provides a robust framework for in-the-wild behavioral analysis.
[170] AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control cs.CV | cs.AI | cs.ROPDF
Peng Xu, Zhengnan Deng, Jiayan Deng, Zonghua Gu, Shaohua Wan
TL;DR: 本文提出了AerialVLA,一个用于无人机视觉语言导航的极简端到端视觉-语言-动作模型。它通过双视角感知策略和模糊方向提示机制,将原始视觉观测和模糊语言指令直接映射为连续物理控制信号,无需依赖密集的Oracle引导或外部物体检测器。
Details
Motivation: 解决现有无人机视觉语言导航方法依赖层次化架构、密集Oracle引导或辅助物体检测器,导致语义鸿沟并限制真正自主性的问题。
Result: 在TravelUAV基准测试上,AerialVLA在已见环境中取得了最先进的性能,在未见场景中实现了接近领先基线三倍的成功率,展现出卓越的泛化能力。
Insight: 创新点在于:1)极简的端到端框架直接映射感知到控制;2)双视角感知策略减少冗余并保留关键导航线索;3)仅依赖机载传感器的模糊方向提示机制,实现真正自主;4)统一的控制空间整合了连续3自由度运动指令和内在着陆信号,无需外部检测器即可实现精准着陆。这验证了以自主为中心的简约范式比复杂的模块化系统能学习到更鲁棒的视觉-运动表征。
Abstract: Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual-view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on dense oracle guidance. Ultimately, we formulate a unified control space that integrates continuous 3-Degree-of-Freedom (3-DoF) kinematic commands with an intrinsic landing signal, freeing the agent from external object detectors for precision landing. Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state-of-the-art performance in seen environments. Furthermore, it exhibits superior generalization in unseen scenarios by achieving nearly three times the success rate of leading baselines, validating that a minimalist, autonomy-centric paradigm captures more robust visual-motor representations than complex modular systems.
[171] HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task cs.CVPDF
Xiaoya Lu, Yijin Zhou, Zeren Chen, Ruocheng Wang, Bingrui Sima
TL;DR: 本文提出了HomeGuard,一种基于视觉语言模型(VLM)的具身智能体安全保障框架,旨在识别家庭任务中的上下文安全风险。该方法采用架构无关的设计,引入了上下文引导的思维链(CG-CoT)机制,通过主动感知和语义判断来分解风险评估,并利用强化微调进行训练。实验表明,HomeGuard显著提升了安全性,同时减少了过度安全警报。
Details
Motivation: 现有具身智能体在执行复杂指令时,尽管依赖视觉语言模型,但仍易受上下文安全风险的影响(即无害指令因微妙的环境状态而变得危险)。基于规则的方法在物体密集场景中缺乏可扩展性,而基于模型的方法依赖提示工程,存在感知不聚焦、导致风险遗漏或幻觉的问题,因此需要一种更有效的安全保障机制。
Result: 实验证明,HomeGuard在风险匹配率上相比基础模型提升了超过30%,同时减少了过度安全(oversafety)问题。该模型不仅用于危险检测,其生成的视觉锚点还可作为下游规划器的可操作空间约束,以促进显式碰撞避免和安全轨迹生成。
Insight: 主要创新点包括:1)提出架构无关的上下文引导思维链(CG-CoT)机制,将风险评估分解为主动感知(锚定交互目标和相关空间区域)和基于视觉证据的语义判断;2)通过精心策划的接地数据集和两阶段训练策略(利用强化微调及过程奖励)来强制精确的中间接地;3)生成的视觉锚点不仅用于风险识别,还可直接指导下游安全规划,增强了系统的实用性和可解释性。
Abstract: Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where benign commands become hazardous due to subtle environmental states. Existing safeguards often prove inadequate. Rule-based methods lack scalability in object-dense scenes, whereas model-based approaches relying on prompt engineering suffer from unfocused perception, resulting in missed risks or hallucinations. To address this, we propose an architecture-agnostic safeguard featuring Context-Guided Chain-of-Thought (CG-CoT). This mechanism decomposes risk assessment into active perception that sequentially anchors attention to interaction targets and relevant spatial neighborhoods, followed by semantic judgment based on this visual evidence. We support this approach with a curated grounding dataset and a two-stage training strategy utilizing Reinforcement Fine-Tuning (RFT) with process rewards to enforce precise intermediate grounding. Experiments demonstrate that our model HomeGuard significantly enhances safety, improving risk match rates by over 30% compared to base models while reducing oversafety. Beyond hazard detection, the generated visual anchors serve as actionable spatial constraints for downstream planners, facilitating explicit collision avoidance and safety trajectory generation. Code and data are released under https://github.com/AI45Lab/HomeGuard
[172] The Pulse of Motion: Measuring Physical Frame Rate from Visual Dynamics cs.CV | cs.AIPDF
Xiangbo Gao, Mingyang Wu, Siyuan Yang, Jiongze Yu, Pardis Taghavi
TL;DR: 这篇论文指出当前生成视频模型存在时间尺度模糊的问题,即生成的视频运动速度不稳定且不可控,作者称之为“计时幻觉”。为了解决这个问题,论文提出了Visual Chronometer方法,直接从视频的视觉动态中预测其物理帧率(PhyFPS),并建立了两个基准测试来量化该问题。实验表明,最先进的视频生成器存在严重的PhyFPS错位问题,而应用PhyFPS校正可以显著提高AI生成视频的自然度。
Details
Motivation: 当前生成视频模型虽然视觉上流畅,但由于训练时对不同真实速度的视频进行了无差别的标准化帧率处理,导致模型缺乏可靠的内在运动节拍,无法将运动锚定在一致的真实世界时间尺度上,产生了时间模糊性。
Result: 论文在提出的PhyFPS-Bench-Real和PhyFPS-Bench-Gen两个基准上进行了评估,结果表明最先进的视频生成器存在严重的PhyFPS错位和时间不稳定性。应用PhyFPS校正后,AI生成视频的人类感知自然度得到了显著提升。
Insight: 论文的核心创新在于提出了直接从视觉动态中估计物理帧率(PhyFPS)的方法Visual Chronometer,绕过了不可靠的元数据,并首次系统性地定义和量化了生成视频中的“计时幻觉”问题。这为评估和改善视频生成模型的时间一致性提供了新的视角和工具。
Abstract: While recent generative video models have achieved remarkable visual realism and are being explored as world models, true physical simulation requires mastering both space and time. Current models can produce visually smooth kinematics, yet they lack a reliable internal motion pulse to ground these motions in a consistent, real-world time scale. This temporal ambiguity stems from the common practice of indiscriminately training on videos with vastly different real-world speeds, forcing them into standardized frame rates. This leads to what we term chronometric hallucination: generated sequences exhibit ambiguous, unstable, and uncontrollable physical motion speeds. To address this, we propose Visual Chronometer, a predictor that recovers the Physical Frames Per Second (PhyFPS) directly from the visual dynamics of an input video. Trained via controlled temporal resampling, our method estimates the true temporal scale implied by the motion itself, bypassing unreliable metadata. To systematically quantify this issue, we establish two benchmarks, PhyFPS-Bench-Real and PhyFPS-Bench-Gen. Our evaluations reveal a harsh reality: state-of-the-art video generators suffer from severe PhyFPS misalignment and temporal instability. Finally, we demonstrate that applying PhyFPS corrections significantly improves the human-perceived naturalness of AI-generated videos. Our project page is https://xiangbogaobarry.github.io/Visual_Chronometer/.
[173] LoCAtion: Long-time Collaborative Attention Framework for High Dynamic Range Video Reconstruction cs.CVPDF
Qianyu Zhang, Bolun Zheng, Lingyu Zhu, Aiai Huang, Zongpeng Li
TL;DR: 本文提出LoCAtion框架,通过长时协作注意力机制将HDR视频重建从脆弱的空间对齐融合范式重构为无需对齐的协作特征路由问题,显著提升了动态场景下的重建质量和时间稳定性。
Details
Motivation: 现有HDR视频重建方法严重依赖精确的空间对齐,在动态场景中易因配准误差导致重影和闪烁,本文旨在摆脱这一脆弱范式,构建更鲁棒的解决方案。
Result: 在广泛实验中,LoCAtion在视觉质量和时间稳定性上达到了最先进水平,并在精度与计算效率之间取得了极具竞争力的平衡。
Insight: 核心创新在于将任务解耦为基于中等曝光骨干的场景锚定和协作特征路由,并引入全局序列求解器进行长程时间建模,从而无需显式对齐即可实现视频一致性。
Abstract: Prevailing High Dynamic Range (HDR) video reconstruction methods are fundamentally trapped in a fragile alignment-and-fusion paradigm. While explicit spatial alignment can successfully recover fine details in controlled environments, it becomes a severe bottleneck in unconstrained dynamic scenes. By forcing rigid alignment across unpredictable motions and varying exposures, these methods inevitably translate registration errors into severe ghosting artifacts and temporal flickering. In this paper, we rethink this conventional prerequisite. Recognizing that explicit alignment is inherently vulnerable to real-world complexities, we propose LoCAtion, a Long-time Collaborative Attention framework that reformulates HDR video generation from a fragile spatial warping task into a robust, alignment-free collaborative feature routing problem. Guided by this new formulation, our architecture explicitly decouples the highly entangled reconstruction task. Rather than struggling to rigidly warp neighboring frames, we anchor the scene on a continuous medium-exposure backbone and utilize collaborative attention to dynamically harvest and inject reliable irradiance cues from unaligned exposures. Furthermore, we introduce a learned global sequence solver. By leveraging bidirectional context and long-range temporal modeling, it propagates corrective signals and structural features across the entire sequence, inherently enforcing whole-video coherence and eliminating jitter. Extensive experiments demonstrate that LoCAtion achieves state-of-the-art visual quality and temporal stability, offering a highly competitive balance between accuracy and computational efficiency.
[174] StAR: Segment Anything Reasoner cs.CVPDF
Seokju Yun, Dongheon Lee, Noori Bae, Jaesung Jun, Chanseul Cho
TL;DR: 本文提出了Segment Anything Reasoner (StAR)框架,通过优化参数调整方案、奖励函数、学习策略和答案格式等多个设计维度,显著提升了基础模型在视觉推理分割任务上的性能。作者还首次将并行测试时扩展引入分割任务,并构建了ReasonSeg-X基准数据集以支持更深入、系统的评估。仅使用5k训练样本,StAR在多个基准测试中均取得了显著优于基线的结果。
Details
Motivation: 当前AI系统需要处理复杂现实环境中的隐式查询和图像整体推理以定位目标,但现有推理分割方法未能充分激发基础模型的视觉推理能力。
Result: 在仅使用5k训练样本的情况下,StAR在广泛的基准测试中相比其基础模型取得了显著提升。作者构建的ReasonSeg-X数据集为先进方法提供了系统、细粒度的评估基准。
Insight: 主要创新点包括:1) 从多个角度(参数调整、奖励函数、学习策略、答案格式)全面优化推理分割的设计空间;2) 首次将并行测试时扩展技术应用于分割任务以进一步提升性能边界;3) 提出rollout-expanded selective-tuning训练方法以激活基础模型的潜在推理能力;4) 构建了定义紧凑推理类型并包含深度推理样本的ReasonSeg-X基准数据集。
Abstract: As AI systems are being integrated more rapidly into diverse and complex real-world environments, the ability to perform holistic reasoning over an implicit query and an image to localize a target is becoming increasingly important. However, recent reasoning segmentation methods fail to sufficiently elicit the visual reasoning capabilities of the base mode. In this work, we present Segment Anything Reasoner (StAR), a comprehensive framework that refines the design space from multiple perspectives-including parameter-tuning scheme, reward functions, learning strategies and answer format-and achieves substantial improvements over recent baselines. In addition, for the first time, we successfully introduce parallel test-time scaling to the segmentation task, pushing the performance boundary even further. To extend the scope and depth of reasoning covered by existing benchmark, we also construct the ReasonSeg-X, which compactly defines reasoning types and includes samples that require deeper reasoning. Leveraging this dataset, we train StAR with a rollout-expanded selective-tuning approach to activate the base model’s latent reasoning capabilities, and establish a rigorous benchmark for systematic, fine-grained evaluation of advanced methods. With only 5k training samples, StAR achieves significant gains over its base counterparts across extensive benchmarks, demonstrating that our method effectively brings dormant reasoning competence to the surface.
[175] GenState-AI: State-Aware Dataset for Text-to-Video Retrieval on AI-Generated Videos cs.CV | cs.IR | cs.MMPDF
Minghan Li, Tongna Chen, Tianrui Lv, Yishuai Zhang, Suchao An
TL;DR: 本文提出了GenState-AI,一个专注于可控状态转换的AI生成视频文本检索基准数据集。该数据集通过为主视频配对仅最终状态不同的时序困难负样本和内容替换的语义困难负样本,旨在精细诊断超越外观匹配的时序与语义混淆问题。
Details
Motivation: 现有文本到视频检索基准主要基于真实世界视频,其语义常可从单帧推断,导致对时序推理和明确最终状态定位能力的评估不足。
Result: 在两个代表性的基于MLLM的基线模型上进行了评估,观察到一致且可解释的失败模式:模型经常混淆主视频和时序困难负样本,过度偏好时序合理但最终状态错误的片段,表明对决定性最终状态证据的定位不足,而对语义替换相对不敏感。
Insight: 创新点在于构建了一个以可控状态转换为中心的AI生成视频数据集,通过精心设计的困难负样本对(时序与语义)实现了对检索模型时序与语义混淆的细粒度诊断。客观来看,该方法为评估模型对状态变化的敏感性和推理能力提供了一个可控、可解释的测试平台。
Abstract: Existing text-to-video retrieval benchmarks are dominated by real-world footage where much of the semantics can be inferred from a single frame, leaving temporal reasoning and explicit end-state grounding under-evaluated. We introduce GenState-AI, an AI-generated benchmark centered on controlled state transitions, where each query is paired with a main video, a temporal hard negative that differs only in the decisive end-state, and a semantic hard negative with content substitution, enabling fine-grained diagnosis of temporal vs. semantic confusions beyond appearance matching. Using Wan2.2-TI2V-5B, we generate short clips whose meaning depends on precise changes in position, quantity, and object relations, providing controllable evaluation conditions for state-aware retrieval. We evaluate two representative MLLM-based baselines, and observe consistent and interpretable failure patterns: both frequently confuse the main video with the temporal hard negative and over-prefer temporally plausible but end-state-incorrect clips, indicating insufficient grounding to decisive end-state evidence, while being comparatively less sensitive to semantic substitutions. We further introduce triplet-based diagnostic analyses, including relative-order statistics and breakdowns across transition categories, to make temporal vs. semantic failure sources explicit. GenState-AI provides a focused testbed for state-aware, temporally and semantically sensitive text-to-video retrieval, and will be released on huggingface.co.
[176] End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction cs.CVPDF
Haoyu Zhang, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha
TL;DR: 本文提出了一种名为THO的端到端时空Transformer模型,用于从单目RGB视频中实时重建4D人-物交互(HOI)场景。该模型通过利用空间-时间HOI元组先验,以前向推理的方式同时预测人体运动和协调的物体运动,解决了深度模糊和频繁遮挡带来的挑战。
Details
Motivation: 现有方法通常依赖多阶段流水线或迭代优化,导致推理延迟高、无法满足实时需求,且容易产生误差累积。本文旨在克服这些限制,实现高效、准确的单目4D HOI重建。
Result: 实验表明,THO在单个RTX 4090 GPU上实现了31.5 FPS的推理速度,相比基于优化的先前方法加速超过600倍,同时提高了重建精度和时间一致性。
Insight: 创新点在于引入空间先验(利用接触区域邻近性从人体线索推断被遮挡物体特征)和时间先验(捕捉跨帧运动学相关性以优化物体表示并增强物理一致性),从而以端到端方式实现实时、高精度的4D重建。
Abstract: Monocular 4D human-object interaction (HOI) reconstruction - recovering a moving human and a manipulated object from a single RGB video - remains challenging due to depth ambiguity and frequent occlusions. Existing methods often rely on multi-stage pipelines or iterative optimization, leading to high inference latency, failing to meet real-time requirements, and susceptibility to error accumulation. To address these limitations, we propose THO, an end-to-end Spatial-Temporal Transformer that predicts human motion and coordinated object motion in a forward fashion from the given video and 3D template. THO achieves this by leveraging spatial-temporal HOI tuple priors. Spatial priors exploit contact-region proximity to infer occluded object features from human cues, while temporal priors capture cross-frame kinematic correlations to refine object representations and enforce physical coherence. Extensive experiments demonstrate that THO operates at an inference speed of 31.5 FPS on a single RTX 4090 GPU, achieving a >600x speedup over prior optimization-based methods while simultaneously improving reconstruction accuracy and temporal consistency. The project page is available at: https://nianheng.github.io/THO-project/
[177] LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos cs.CV | cs.IRPDF
Rongyi Yu, Chenyuan Duan, Wentao Zhang
TL;DR: 本文提出了LongVidSearch基准,用于在标准化访问约束下评估长视频中智能体驱动的多跳证据检索规划能力。该基准包含447个平均时长为26分钟的长视频上的3000个问题,强制要求多跳检索(2-4跳),并覆盖四种推理类别。通过统一的工具接口进行评估,结果显示当前最先进的模型(如GPT-5)在该任务上的准确率仍低于50%,表明多跳检索规划是一个主要瓶颈。
Details
Motivation: 现有长视频问答基准大多是静态的,缺乏严格的多跳检索要求和标准化的证据访问接口,难以区分检索规划失败与答案生成失败。为解决这一问题,作者引入了LongVidSearch基准,以在受控条件下专门评估智能体的多跳证据检索规划能力。
Result: 在LongVidSearch基准上,使用VideoAgent风格问答智能体配合不同骨干大语言模型进行评估,GPT-5取得了最高准确率(42.43%),优于Gemini 3 Pro(30.97%)和GPT-4o(19.20%),但所有模型准确率均低于50%。在提供黄金证据片段的情况下,性能接近完美,证实检索规划是主要瓶颈。
Insight: 论文的创新点在于设计了一个强制多跳检索必要性的基准(Hop-k问题要求恰好k个必要证据片段),并提供了统一的工具接口以隔离检索规划能力评估。这为研究智能体在长视频中的迭代检索规划提供了标准化测试平台,并强调了检索规划(而非答案生成)是当前长视频问答的关键挑战。
Abstract: Long video question answering (Long-Video QA) increasingly relies on agentic tool use to retrieve evidence from long videos. In realistic settings, this process often requires multi-hop retrieval, where agents must iteratively gather multiple discontinuous evidence clips. However, existing long-video benchmarks are largely static: they rarely enforce strict multi-hop retrieval and typically lack a standardized evidence-access interface, making it difficult to separate failures in retrieval planning from those in answer generation. To address this gap, we introduce LongVidSearch, a benchmark for evaluating agentic multi-hop evidence retrieval planning in long videos under standardized access constraints. LongVidSearch enforces retrieval necessity: a Hop-k question requires exactly k necessary evidence clips, and removing any single clip renders the question unsolvable. The benchmark contains 3,000 questions over 447 long videos (average length 26 minutes), covering four reasoning categories: State Mutation, Causal Inference, Global Summary, and Visual Tracking, with 2-hop, 3-hop, and 4-hop evidence requirements. To ensure fair and controlled evaluation, all agents interact with LongVidSearch through a unified tool interface, which fixes the retrieval backend and isolates the agent’s ability to formulate queries and plan iterative retrieval. In addition to answer accuracy, we measure tool-call cost to analyze the accuracy-efficiency trade-off under identical access conditions. We evaluate VideoAgent-style QA agents with multiple backbone LLMs using three-judge majority voting. GPT-5 achieves the highest accuracy (42.43), outperforming Gemini 3 Pro (30.97) and GPT-4o (19.20), yet remaining below 50 %, highlighting the difficulty of multi-hop retrieval planning. With gold evidence clips, performance becomes near-perfect, confirming retrieval planning as the primary bottleneck.
[178] V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning cs.CVPDF
Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha
TL;DR: V-JEPA 2.1是一个自监督学习模型系列,旨在为图像和视频学习密集、高质量的视觉表示,同时保持强大的全局场景理解能力。它通过结合密集预测损失、深度自监督、多模态分词器和有效扩展等四个关键组件,实现了在空间上结构化、语义上连贯且时间上一致的表示。
Details
Motivation: 解决视频自监督学习中如何同时学习密集的局部特征和保持全局场景理解的挑战,以提升在密集视觉理解和世界建模任务上的性能。
Result: 在多个基准测试中达到最先进水平:在Ego4D短期物体交互预测任务上获得7.71 mAP,在EPIC-KITCHENS高级动作预测任务上获得40.8 Recall@5,在真实机器人抓取任务上比V-JEPA-2 AC提升20个百分点,并在机器人导航、深度估计和全局识别等任务上表现出色。
Insight: 创新点包括:1)密集预测损失使可见和掩码标记共同贡献训练信号,促进显式的时空定位;2)深度自监督在多个编码器中间层分层应用目标,提升表示质量;3)多模态分词器实现图像和视频的统一训练;4)模型和数据规模的有效扩展。这些设计共同推动了密集视觉表示学习的发展。
Abstract: We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.
[179] WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning cs.CV | cs.ROPDF
Stefan Englmeier, Katharina Winter, Fabian B. Flohr
TL;DR: 论文提出WorldVLM,一种结合视觉语言模型(VLM)和世界模型(WM)的混合架构,用于自动驾驶系统。该架构利用VLM进行高层场景上下文推理以生成行为指令,并利用WM预测环境动态,旨在实现可解释且具有上下文感知的驾驶决策。
Details
Motivation: 自动驾驶系统需要既能进行高层场景推理又能准确预测环境动态的模型。视觉语言模型擅长上下文推理但空间理解有限,世界模型擅长预测动态但缺乏高层决策能力,因此论文旨在结合两者的互补优势。
Result: 论文评估了不同的条件调节策略,并对混合设计挑战提供了见解,但摘要中未提及具体的定量结果或基准测试。
Insight: 主要创新点在于将擅长推理的VLM与擅长预测的WM统一到一个架构中,让VLM生成高层行为指令来指导WM,从而实现可解释和上下文感知的驾驶动作,这为解决自动驾驶中的泛化性和动态预测问题提供了一个有前景的方向。
Abstract: Autonomous driving systems depend on on models that can reason about high-level scene contexts and accurately predict the dynamics of their surrounding environment. Vision- Language Models (VLMs) have recently emerged as promising tools for decision-making and scene understanding, offering strong capabilities in contextual reasoning. However, their limited spatial comprehension constrains their effectiveness as end-to-end driving models. World Models (WM) internalize environmental dynamics to predict future scene evolution. Recently explored as ego-motion predictors and foundation models for autonomous driving, they represent a promising direction for addressing key challenges in the field, particularly enhancing generalization while maintaining dynamic prediction. To leverage the complementary strengths of context-based decision making and prediction, we propose WorldVLM: A hybrid architecture that unifies VLMs and WMs. In our design, the high-level VLM generates behavior commands to guide the driving WM, enabling interpretable and context-aware actions. We evaluate conditioning strategies and provide insights into the hybrid design challenges.
[180] Unlocking the Latent Canvas: Eliciting and Benchmarking Symbolic Visual Expression in LLMs cs.CVPDF
Yiren Zheng, Shibo Li, Jiaming Liu, Haofan Wang, Yiren Song
TL;DR: 本文提出SVE-ASCII框架,通过ASCII艺术解锁大语言模型(LLMs)内在的符号视觉表达能力,并构建了ASCIIArt-7K数据集和ASCIIArt-Bench基准进行评测。研究发现生成任务与理解任务之间存在相互增强的循环关系。
Details
Motivation: 当前多模态方法主要依赖像素渲染或代码执行进行视觉生成,忽视了LLMs自身潜在的视觉表征能力。本文旨在探索并激发LLMs在纯文本空间内的原生符号视觉表达能力。
Result: 实验证实了任务对偶性的关键现象:生成训练能显著增强视觉理解能力,反之亦然。该研究为基于文本的视觉智能建立了稳健的基线。
Insight: 创新点在于使用ASCII艺术作为紧凑、高效的文本原生视觉格式来激发LLMs的视觉能力,并提出了’种子与演化’数据合成管道以及联合优化生成与理解的统一指令调优策略。核心洞察是生成与理解在符号视觉处理中形成了相互增强的循环,这在视觉领域得到了实证验证。
Abstract: Current multimodal approaches predominantly treat visual generation as an external process, relying on pixel rendering or code execution, thereby overlooking the native visual representation capabilities latent within Large Language Models (LLMs). In this work, we unlock this potential through ASCII art, a compact, efficient, and text-native visual format. We introduce SVE-ASCII, a unified framework designed to elicit and benchmark Symbolic Visual Expression directly within the pure text space. To address the scarcity of systematic resources, we construct ASCIIArt-7K, a high-quality dataset synthesized via a novel “Seed-and-Evolve” pipeline that augments human-curated anchors through in-context stylistic editing. We further implement a unified instruction-tuning strategy that jointly optimizes for both Generation (Text-to-ASCII) and Understanding (ASCII-to-Text). Crucially, our experiments reveal a critical phenomenon regarding task duality: while it is established that perception aids generation, we provide compelling evidence that generative training significantly enhances visual comprehension. This confirms a mutually reinforcing cycle in symbolic visual processing, a relationship previously hypothesized but rarely empirically demonstrated in the visual domain. We release our dataset, the ASCIIArt-Bench benchmark, and the SVE-ASCII model, establishing a robust baseline for native text-based visual intelligence.
[181] VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning cs.CV | cs.AI | cs.ROPDF
Chaoyang Wang, Wenrui Bao, Sicheng Gao, Bingxin Xu, Yu Tian
TL;DR: 本文提出了VLA-Thinker框架,通过引入‘图像思维’推理机制来增强视觉-语言-行动模型。该框架将感知建模为可动态调用的推理动作,并采用包含监督微调和基于GRPO的强化学习的两阶段训练流程。在LIBERO和RoboTwin 2.0基准测试中,该方法显著提升了长视野机器人操作任务的性能。
Details
Motivation: 现有VLA模型大多依赖基于文本的思维链推理,将视觉输入视为静态上下文,这限制了模型在长视野任务中主动重新审视环境以解决歧义的能力。本文旨在解决这一问题。
Result: 在LIBERO基准上取得了97.5%的成功率,在RoboTwin 2.0基准的长视野机器人任务上也获得了显著提升,展现了优越的性能。
Insight: 核心创新点在于将感知建模为可动态调用的推理动作(thinking-with-image),使模型能主动进行视觉推理。训练流程结合了基于视觉思维链数据的监督微调和基于GRPO的强化学习,以对齐推理-行动轨迹与任务级成功。
Abstract: Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA-Thinker/ .
[182] LatSearch: Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion cs.CVPDF
Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Zhensong Zhang
TL;DR: 本文提出了一种名为LatSearch的新方法,通过潜在奖励引导的搜索机制来加速视频扩散模型的推理时间扩展。该方法利用一个潜在奖励模型在去噪过程中提供中间反馈,并结合奖励引导的重采样和剪枝策略,以提高生成视频的质量和效率。
Details
Motivation: 现有视频扩散模型在推理时优化初始噪声的方法存在误差累积、奖励信号稀疏且延迟、计算成本高等问题,限制了更强搜索算法的应用,从而阻碍了可控性、样本效率和生成质量的提升。
Result: 在VBench-2.0基准测试中,LatSearch相比基线Wan2.1模型在多个评估维度上一致提升了视频生成质量。
Insight: 创新点在于引入了潜在奖励模型来提供去噪轨迹中的中间、信息丰富且高效的反馈,以及设计了奖励引导的重采样和剪枝机制,这降低了计算成本并可能解锁视频扩散模型的更大性能增益。
Abstract: The recent success of inference-time scaling in large language models has inspired similar explorations in video diffusion. In particular, motivated by the existence of “golden noise” that enhances video quality, prior work has attempted to improve inference by optimising or searching for better initial noise. However, these approaches have notable limitations: they either rely on priors imposed at the beginning of noise sampling or on rewards evaluated only on the denoised and decoded videos. This leads to error accumulation, delayed and sparse reward signals, and prohibitive computational cost, which prevents the use of stronger search algorithms. Crucially, stronger search algorithms are precisely what could unlock substantial gains in controllability, sample efficiency and generation quality for video diffusion, provided their computational cost can be reduced. To fill in this gap, we enable efficient inference-time scaling for video diffusion through latent reward guidance, which provides intermediate, informative and efficient feedback along the denoising trajectory. We introduce a latent reward model that scores partially denoised latents at arbitrary timesteps with respect to visual quality, motion quality, and text alignment. Building on this model, we propose LatSearch, a novel inference-time search mechanism that performs Reward-Guided Resampling and Pruning (RGRP). In the resampling stage, candidates are sampled according to reward-normalised probabilities to reduce over-reliance on the reward model. In the pruning stage, applied at the final scheduled step, only the candidate with the highest cumulative reward is retained, improving both quality and efficiency. We evaluate LatSearch on the VBench-2.0 benchmark and demonstrate that it consistently improves video generation across multiple evaluation dimensions compared to the baseline Wan2.1 model.
[183] Interp3R: Continuous-time 3D Geometry Estimation with Frames and Events cs.CV | cs.ROPDF
Shuang Guo, Filbert Febryanto, Lei Sun, Guillermo Gallego
TL;DR: 本文提出了Interp3R方法,旨在解决基于点云图(pointmap)的3D视觉基础模型(如DUSt3R)只能恢复离散时刻场景几何的问题。该方法利用异步事件数据,对基于帧的模型输出的点云图进行插值,从而实现在任意连续时间点估计深度和相机位姿,构建时间连续的几何表示。
Details
Motivation: 现有基于点云图的3D重建方法仅在图像捕获的离散时刻有效,无法捕捉连续帧之间盲时间内的场景演化。本文旨在利用事件相机数据,将几何估计扩展到任意连续时间点。
Result: 实验表明,Interp3R在广泛的合成与真实世界基准测试中表现出强大的泛化能力,其性能显著优于先进行2D视频帧插值再进行3D几何估计的两阶段先进基线方法。
Insight: 核心创新在于将事件数据与基于帧的点云图模型相结合,通过联合优化插值点云图与原始预测点云图的空间一致性,实现了时间连续的3D几何估计。该方法证明了仅使用合成数据训练即可有效泛化到真实场景。
Abstract: In recent years, 3D visual foundation models pioneered by pointmap-based approaches such as DUSt3R have attracted a lot of interest, achieving impressive accuracy and strong generalization across diverse scenes. However, these methods are inherently limited to recovering scene geometry only at the discrete time instants when images are captured, leaving the scene evolution during the blind time between consecutive frames largely unexplored. We introduce Interp3R, to the best of our knowledge the first method that enhances pointmap-based models to estimate depth and camera poses at arbitrary time instants. Interp3R leverages asynchronous event data to interpolate pointmaps produced by frame-based models, enabling temporally continuous geometric representations. Depth and camera poses are then jointly recovered by aligning the interpolated pointmaps together with those predicted by the underlying frame-based models into a consistent spatial framework. We train Interp3R exclusively on a synthetic dataset, yet demonstrate strong generalization across a wide range of synthetic and real-world benchmarks. Extensive experiments show that Interp3R outperforms by a considerable margin state-of-the-art baselines that follow a two-stage pipeline of 2D video frame interpolation followed by 3D geometry estimation.
[184] ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference cs.CV | cs.LGPDF
Surendra Pathak, Bo Han
TL;DR: 本文提出ASAP方法,一种无需训练、兼容KV-Cache的剪枝方案,旨在解决大型视觉语言模型(LVLM)处理高分辨率视觉令牌时二次计算成本过高的问题。该方法通过动态双向软注意力掩码缓解注意力偏移现象,并引入加权软合并组件来融合语义相似的令牌,从而在保持模型性能的同时大幅降低计算开销。
Details
Motivation: 现有令牌缩减策略未能充分利用注意力值、未能有效处理令牌冗余,且忽视了LVLM中固有的’注意力偏移’现象,导致令牌注意力分数失真,因此需要一种更高效的视觉上下文压缩方法。
Result: 在LLaVA-NeXT-7B模型上,ASAP实现了近乎无损的视觉上下文压缩,保留了原始模型99.02%的性能,同时将计算FLOPs大幅削减约80%。
Insight: 创新点在于通过动态双向软注意力掩码校正注意力偏移,以及加权软合并机制减少语义冗余,这是一种无需训练、兼容现有推理优化技术(如KV-Cache)的高效剪枝方法。
Abstract: While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift’’ phenomenon inherent in LVLMs, which skews token attention scores. In this work, we propose ASAP, a novel training-free, KV-Cache-compatible pruning recipe that comprehensively addresses these limitations. First, we mitigate the attention shift by utilizing a dynamic bidirectional soft attention mask, ensuring the selection of genuinely informative tokens rather than naive attention-based selection. Second, we posit that high semantic redundancy within the token set degrades performance. We therefore introduce a weighted soft merging component that merges semantically similar tokens, preserving only the most feature-dense visual patches for subsequent layers. ASAP achieves virtually lossless compression of visual context, retaining 99.02% of the original LLaVA-NeXT-7B performance while aggressively slashing computational FLOPs by ~80%.
[185] A comprehensive multimodal dataset and benchmark for ulcerative colitis scoring in endoscopy cs.CV | cs.AI | cs.IRPDF
Noha Ghatwary, Jiangbei Yue, Ahmed Elgendy, Hanna Nagdy, Ahmed Galal
TL;DR: 本文提出了一个用于溃疡性结肠炎内镜评分评估的多中心、多分辨率数据集,包含专家验证的梅奥内镜评分和溃疡性结肠炎内镜严重程度指数标签以及详细的临床描述,并提供了基于卷积神经网络、视觉Transformer、混合模型和视觉语言描述算法的基准测试。
Details
Motivation: 当前缺乏公开的专家标注数据集和稳健的基准测试,限制了自动预测溃疡性结肠炎内镜评分和生成临床意义图像描述的计算方法发展,且多中心数据需求以提升算法鲁棒性和泛化性。
Result: 论文提供了首个结合双评分指标分类任务和专家生成描述的数据集,并建立了多种深度学习模型的基准测试,为开发临床意义的多模态算法奠定了基础。
Insight: 创新点在于构建了首个综合多中心、多分辨率数据集,整合了分类标签和图像描述,促进了多模态算法在医学影像分析中的应用,并提供了全面的基准测试框架。
Abstract: Ulcerative colitis (UC) is a chronic mucosal inflammatory condition that places patients at increased risk of colorectal cancer. Colonoscopic surveillance remains the gold standard for assessing disease activity, and reporting typically relies on standardised endoscopic scoring metrics. The most widely used is the Mayo Endoscopic Score (MES), with some centres also adopting the Ulcerative Colitis Endoscopic Index of Severity (UCEIS). Both are descriptive assessments of mucosal inflammation (MES: 0 to 3; UCEIS: 0 to 8), where higher values indicate more severe disease. However, computational methods for automatically predicting these scores remain limited, largely due to the lack of publicly available expert-annotated datasets and the absence of robust benchmarking. There is also a significant research gap in generating clinically meaningful descriptions of UC images, despite image captioning being a well-established computer vision task. Variability in endoscopic systems and procedural workflows across centres further highlights the need for multi-centre datasets to ensure algorithmic robustness and generalisability. In this work, we introduce a curated multi-centre, multi-resolution dataset that includes expert-validated MES and UCEIS labels, alongside detailed clinical descriptions. To our knowledge, this is the first comprehensive dataset that combines dual scoring metrics for classification tasks with expert-generated captions describing mucosal appearance and clinically accepted reasoning for image captioning. This resource opens new opportunities for developing clinically meaningful multimodal algorithms. In addition to the dataset, we also provide benchmarking using convolutional neural networks, vision transformers, hybrid models, and widely used multimodal vision-language captioning algorithms.
[186] Medical Image Spatial Grounding with Semantic Sampling cs.CV | cs.LGPDF
Andrew Seohwan Yu, Mohsen Hariri, Kunio Nakamura, Mingrui Yang, Xiaojuan Li
TL;DR: 本文研究了视觉语言模型在三维医学图像空间定位任务中的表现,分析了图像模态、切片方向和坐标系等视觉因素以及解剖学、方向性和关系性术语等语言因素对模型的影响,并提出了一个名为MIS-Ground的基准测试来评估模型在医学图像空间定位中的脆弱性。此外,作者还提出了一种名为MIS-SemSam的低成本、推理时、模型无关的优化方法,通过语义采样提升模型的空间定位能力,在Qwen3-VL-32B模型上实现了13.06%的准确率提升。
Details
Motivation: 解决视觉语言模型在三维医学图像中对解剖结构进行空间定位时面临的独特挑战,如多模态图像、复杂坐标系统和专业术语理解等问题。
Result: 在提出的MIS-Ground基准测试上,MIS-SemSam方法将Qwen3-VL-32B模型的准确率提升了13.06%,展示了其在医学图像空间定位任务上的有效性。
Insight: 创新点包括系统性地分析了影响医学图像空间定位的视觉和语言因素,提出了专门的基准测试MIS-Ground,以及一种通过语义采样进行推理时优化的通用方法MIS-SemSam,该方法无需重新训练即可提升现有模型的性能。
Abstract: Vision language models (VLMs) have shown significant promise in visual grounding for images as well as videos. In medical imaging research, VLMs represent a bridge between object detection and segmentation, and report understanding and generation. However, spatial grounding of anatomical structures in the three-dimensional space of medical images poses many unique challenges. In this study, we examine image modalities, slice directions, and coordinate systems as differentiating factors for vision components of VLMs, and the use of anatomical, directional, and relational terminology as factors for the language components. We then demonstrate that visual and textual prompting systems such as labels, bounding boxes, and mask overlays have varying effects on the spatial grounding ability of VLMs. To enable measurement and reproducibility, we introduce \textbf{MIS-Ground}, a benchmark that comprehensively tests a VLM for vulnerabilities against specific modes of \textbf{M}edical \textbf{I}mage \textbf{S}patial \textbf{Ground}ing. We release MIS-Ground to the public at \href{https://anonymous.4open.science/r/mis-ground}{\texttt{anonymous.4open.science/r/mis-ground}}. In addition, we present \textbf{MIS-SemSam}, a low-cost, inference-time, and model-agnostic optimization of VLMs that improve their spatial grounding ability with the use of \textbf{Sem}antic \textbf{Sam}pling. We find that MIS-SemSam improves the accuracy of Qwen3-VL-32B on MIS-Ground by 13.06%.
[187] GroundSet: A Cadastral-Grounded Dataset for Spatial Understanding with Vector Data cs.CVPDF
Roger Ferrod, Maël Lecene, Krishna Sapkota, George Leifman, Vered Silverman
TL;DR: 本文介绍了GroundSet,一个基于可验证地籍矢量数据的大规模数据集,包含510k张高分辨率图像中的380万个标注对象和135个细粒度语义类别,用于提升遥感图像中的细粒度空间理解能力。通过涵盖七个空间推理任务的指令调优基准进行验证,并基于标准LLaVA架构建立了稳健基线。研究表明,当前遥感专用和商业模型在零样本设置下表现不佳,但高质量监督能有效弥补这一差距,使标准架构无需复杂修改即可掌握细粒度空间定位。
Details
Motivation: 解决多模态大语言模型在遥感领域因依赖有限或重新利用的遗留数据集而导致的细粒度空间理解不足问题,以支持城市规划、环境监测和灾害管理等关键应用。
Result: 在涵盖七个空间推理任务的指令调优基准上评估,使用标准LLaVA架构建立了稳健基线;当前遥感专用和商业模型(如Gemini)在零样本设置中表现不佳,但通过高保真监督,标准架构无需复杂修改即可实现细粒度空间定位。
Insight: 创新点在于引入基于可验证地籍矢量数据的大规模高质量数据集,提供细粒度语义标注;客观分析表明,该方法通过高质量监督而非复杂架构修改,有效提升了标准模型在遥感空间理解任务上的性能,为数据驱动的模型改进提供了新思路。
Abstract: Precise spatial understanding in Earth Observation is essential for translating raw aerial imagery into actionable insights for critical applications like urban planning, environmental monitoring and disaster management. However, Multimodal Large Language Models exhibit critical deficiencies in fine-grained spatial understanding within Remote Sensing, primarily due to a reliance on limited or repurposed legacy datasets. To bridge this gap, we introduce a large-scale dataset grounded in verifiable cadastral vector data, comprising 3.8 million annotated objects across 510k high-resolution images with 135 granular semantic categories. We validate this resource through a comprehensive instruction-tuning benchmark spanning seven spatial reasoning tasks. Our evaluation establishes a robust baseline using a standard LLaVA architecture. We show that while current RS-specialized and commercial models (e.g., Gemini) struggle in zero-shot settings, high-fidelity supervision effectively bridges this gap, enabling standard architectures to master fine-grained spatial grounding without complex architectural modifications.
[188] Make it SING: Analyzing Semantic Invariants in Classifiers cs.CV | eess.IVPDF
Harel Yadid, Meir Yossef Levi, Roy Betser, Guy Gilboa
TL;DR: 本文提出了一种名为SING(Semantic Interpretation of the Null-space Geometry)的方法,用于分析和解释分类器(如ResNet50和DinoViT)中存在的语义不变性。该方法通过将网络特征映射到多模态视觉语言模型,为分类器零空间中的等效输入变化生成自然语言描述和视觉示例,从而揭示其语义内容。
Details
Motivation: 现有方法难以解释分类器(包括SOTA视觉模型)零空间中存在的、导致相同输出的等效输入集合的语义内容,本文旨在填补这一空白,提供人类可理解的语义解释。
Result: 该方法应用于单张图像可揭示局部不变性,应用于图像集则允许在类别和模型层面进行广泛的统计分析。例如,分析表明ResNet50会将相关语义属性泄露到零空间,而使用自监督DINO预训练的ViT(DinoViT)在不变空间中保持类别语义方面表现更优。
Insight: 创新点在于提出了一种将分类器零空间的几何特性与多模态视觉语言模型结合的方法,从而为模型不变性提供可解释的语义描述;从客观角度看,该方法为模型可解释性研究提供了一种新工具,能够定量比较不同模型(如CNN与ViT)在语义保持能力上的差异。
Abstract: All classifiers, including state-of-the-art vision models, possess invariants, partially rooted in the geometry of their linear mappings. These invariants, which reside in the null-space of the classifier, induce equivalent sets of inputs that map to identical outputs. The semantic content of these invariants remains vague, as existing approaches struggle to provide human-interpretable information. To address this gap, we present Semantic Interpretation of the Null-space Geometry (SING), a method that constructs equivalent images, with respect to the network, and assigns semantic interpretations to the available variations. We use a mapping from network features to multi-modal vision language models. This allows us to obtain natural language descriptions and visual examples of the induced semantic shifts. SING can be applied to a single image, uncovering local invariants, or to sets of images, allowing a breadth of statistical analysis at the class and model levels. For example, our method reveals that ResNet50 leaks relevant semantic attributes to the null space, whereas DinoViT, a ViT pretrained with self-supervised DINO, is superior in maintaining class semantics across the invariant space.
[189] A Heterogeneous Ensemble for Multi-Center COVID-19 Classification from Chest CT Scans cs.CVPDF
Aadit Nilay, Bhavesh Thapar, Anant Agrawal, Mohammad Nayeem Teli
TL;DR: 本文提出了一种用于多中心COVID-19胸部CT扫描分类的异构集成方法,通过结合三种不同推理范式的九种模型,并采用多种正则化和校准技术,有效解决了跨中心部署时的领域偏移和过拟合问题,显著提升了分类性能。
Details
Motivation: 解决RT-PCR检测速度慢、假阴性率高以及基于CT的自动化筛查在跨医院中心部署时,由于扫描仪硬件、采集协议和患者群体差异导致的领域偏移问题,从而提升多中心COVID-19诊断的鲁棒性。
Result: 在四个医院中心的数据集上,最终集成模型的平均宏观F1分数达到0.9280,优于最佳单一模型(F1=0.8969),提升了0.031,实现了稳健的多站点医学图像分类性能。
Insight: 创新点在于构建了包含自监督ViT、预训练CNN和多种MIL模型的异构集成架构,并通过焦点损失、嵌入级Mixup、领域感知增强和源感知阈值校准等技术增强模型多样性和鲁棒性,有效缓解了过拟合和领域偏移,为多中心医学影像分析提供了可借鉴的框架。
Abstract: The COVID-19 pandemic exposed critical limitations in diagnostic workflows: RT-PCR tests suffer from slow turnaround times and high false-negative rates, while CT-based screening offers faster complementary diagnosis but requires expert radiological interpretation. Deploying automated CT analysis across multiple hospital centres introduces further challenges, as differences in scanner hardware, acquisition protocols, and patient populations cause substantial domain shift that degrades single-model performance. To address these challenges, we present a heterogeneous ensemble of nine models spanning three inference paradigms: (1) a self-supervised DINOv2 Vision Transformer with slice-level sigmoid aggregation, (2) a RadImageNet-pretrained DenseNet-121 with slice-level sigmoid averaging, and (3) seven Gated Attention Multiple Instance Learning models using EfficientNet-B3, ConvNeXt-Tiny, and EfficientNetV2-S backbones with scan-level softmax classification. Ensemble diversity is further enhanced through random-seed variation and Stochastic Weight Averaging. We address severe overfitting, reducing the validation-to-training loss ratio from 35x to less than 3x, through a combination of Focal Loss, embedding-level Mixup, and domain-aware augmentation. Model outputs are fused via score-weighted probability averaging and calibrated with per-source threshold optimization. The final ensemble achieves an average macro F1 of 0.9280 across four hospital centres, outperforming the best single model (F1=0.8969) by +0.031, demonstrating that heterogeneous architectures combined with source-aware calibration are essential for robust multi-site medical image classification.
[190] Human-AI Ensembles Improve Deepfake Detection in Low-to-Medium Quality Videos cs.CV | cs.AIPDF
Marco Postiglione, Isabel Gortner, V. S. Subrahmanian
TL;DR: 本文研究了人类与AI在深度伪造检测中的表现对比,发现人类在低至中等质量视频上的检测能力优于AI,且两者错误具有互补性,人机协同集成能有效减少高置信度错误。
Details
Motivation: 动机在于探究在真实场景下人类与AI检测器在深度伪造检测中的表现差异,特别是在非专业制作视频中的性能对比。
Result: 在DF40标准基准和新型数据集CharadesDF上,人类表现优于95个SOTA AI检测器,尤其在CharadesDF上AI准确率降至接近随机水平(0.537),而人类保持稳健性能(0.784),人机集成减少了高置信度错误。
Insight: 创新点在于揭示了人类与AI在深度伪造检测中的互补性,并强调在非专业视频中实现有效检测需依赖人机协作而非单独AI算法,这为实际应用提供了新思路。
Abstract: Deepfake detection is widely framed as a machine learning problem, yet how humans and AI detectors compare under realistic conditions remains poorly understood. We evaluate 200 human participants and 95 state-of-the-art AI detectors across two datasets: DF40, a standard benchmark, and CharadesDF, a novel dataset of videos of everyday activities. CharadesDF was recorded using mobile phones leading to low/moderate quality videos compared to the more professionally captured DF40. Humans outperform AI detectors on both datasets, with the gap widening in the case of CharadesDF where AI accuracy collapses to near chance (0.537) while humans maintain robust performance (0.784). Human and AI errors are complementary: humans miss high-quality deepfakes while AI detectors flag authentic videos as fake, and hybrid human-AI ensembles reduce high-confidence errors. These findings suggest that effective real-world deepfake detection, especially in non-professionally produced videos, requires human-AI collaboration rather than AI algorithms alone.
[191] VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting cs.CV | cs.AIPDF
Daeun Lee, Shoubin Yu, Yue Zhang, Mohit Bansal
TL;DR: 本文提出VisionCoach,一种输入自适应的强化学习框架,通过视觉提示作为训练时指导来增强视频推理中的时空定位能力。该方法在训练时对困难输入选择性应用视觉提示以增强问题相关证据并抑制干扰,并通过自蒸馏使模型内化这种能力,从而在推理时无需视觉提示即可直接对原始视频进行定位推理。
Details
Motivation: 解决现有基于强化学习的视频推理方法在推理过程中难以实现可靠时空定位的问题,同时避免依赖大规模标注数据或推理时感知工具带来的高成本。
Result: 在V-STAR、VideoMME、World-Sense、VideoMMMU、PerceptionTest和Charades-STA等多个视频推理、视频理解和时序定位基准测试上取得了最先进的性能,且无需外部工具,保持单一高效推理路径。
Insight: 创新点在于将视觉提示作为训练时的自适应指导,结合自蒸馏使模型内化定位能力,避免了推理时的额外开销;其视觉提示选择器和结合物体身份一致性与多区域边界框重叠的奖励机制,有效提升了时空定位的准确性。
Abstract: Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.
[192] EviATTA: Evidential Active Test-Time Adaptation for Medical Segment Anything Models cs.CVPDF
Jiayi Chen, Yasmeen George, Winston Chong, Jianfei Cai
TL;DR: 本文提出了EviATTA,一种专为医学分割SAM模型设计的证据性主动测试时适应框架,旨在解决大分布偏移下测试时监督不可靠的问题。该方法通过狄利克雷证据建模分解预测不确定性,并设计了分层证据采样策略和双重一致性正则化,以高效利用稀疏专家标注,提升模型在测试时的适应可靠性。
Details
Motivation: 在医学图像分割中,部署基础SAM模型进行测试时适应面临大分布偏移的挑战,导致测试时监督不可靠。现有主动测试时适应方法存在不确定性估计不可靠和稀疏标注利用效率低的问题,需要一种更可靠的适应框架。
Result: 在六个医学图像分割数据集上的大量实验表明,EviATTA在批量和实例级测试时适应设置下,均能以最少的专家反馈持续提升适应可靠性,验证了其有效性。
Insight: 创新点包括:采用狄利克雷证据建模分解不确定性为分布不确定性和数据不确定性;设计分层证据采样策略,分别利用两种不确定性选择样本和指导标注;引入双重一致性正则化,结合稀疏标注样本的渐进提示一致性和未标注样本的变分特征一致性,以稳定适应过程并高效利用监督信息。
Abstract: Deploying foundational medical Segment Anything Models (SAMs) via test-time adaptation (TTA) is challenging under large distribution shifts, where test-time supervision is often unreliable. While active test-time adaptation (ATTA) introduces limited expert feedback to improve reliability, existing ATTA methods still suffer from unreliable uncertainty estimation and inefficient utilization of sparse annotations. To address these issues, we propose Evidential Active Test-Time Adaptation (EviATTA), which is, to our knowledge, the first ATTA framework tailored for medical SAMs. Specifically, we adopt the Dirichlet-based Evidential Modeling to decompose overall predictive uncertainty into distribution uncertainty and data uncertainty. Building on this decomposition, we design a Hierarchical Evidential Sampling strategy, where image-wise distribution uncertainty is used to select informative shifted samples, while distance-aware data uncertainty guides sparse pixel annotations to resolve data ambiguities. We further introduce Dual Consistency Regularization, which enforces progressive prompt consistency on sparsely labeled samples to better exploit sparse supervision and applies variational feature consistency on unlabeled samples to stabilize adaptation. Extensive experiments on six medical image segmentation datasets demonstrate that EviATTA consistently improves adaptation reliability with minimal expert feedback under both batch-wise and instance-wise test-time adaptation settings.
[193] MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model cs.CV | cs.AIPDF
Jinguang Tong, Jinbo Wu, Kaisiyuan Wang, Zhelun Shen, Xuan Huang
TL;DR: 本文提出MVHOI,一个两阶段的人-物交互视频重演框架,通过3D基础模型将多视角参考条件与视频基础模型连接起来,以生成具有复杂三维物体操作的长时人-物交互视频。
Details
Motivation: 现有方法主要处理简单的图像平面运动,难以应对复杂的非平面操作,如平面外的重新定向,因此需要一种能处理复杂三维物体操作的人-物交互视频重演方法。
Result: 大量实验表明,该方法在生成具有复杂三维物体操作的人-物交互视频方面,相比先前方法有显著改进,尤其在复杂三维物体操作上表现优异。
Insight: 创新点在于利用3D基础模型生成视角一致的对象先验,并结合可控视频生成模型和多视角参考图像,通过推理阶段的相互增强,实现了外观一致性和高保真纹理合成,从而有效处理复杂三维操作。
Abstract: Human-Object Interaction (HOI) video reenactment with realistic motion remains a frontier in expressive digital human creation. Existing approaches primarily handle simple image-plane motion (e.g., in-plane translations), struggling with complex non-planar manipulations like out-of-plane reorientation. In this paper, we propose MVHOI, a two-stage HOI video reenactment framework that bridges multi-view reference conditions and video foundation models via a 3D Foundation Model (3DFM). The 3DFM first produces view-consistent object priors conditioned on implicit motion dynamics across novel viewpoints. A controllable video generation model then synthesizes high-fidelity object texture by incorporating multi-view reference images, ensuring appearance consistency via a reasonable retrieval mechanism. By enabling these two stages to mutually reinforce one another during the inference phase, our framework shows superior performance in generating long-duration HOI videos with intricate object manipulations. Extensive experiments show substantial improvements over prior approaches, especially for HOI with complex 3D object manipulations.
[194] AURORA-KITTI: Any-Weather Depth Completion and Denoising in the Wild cs.CVPDF
Yiting Wang, Tim Brödermann, Hamed Haghighi, Haonan Zhao, Christos Sakaridis
TL;DR: 本文提出了AURORA-KITTI,这是首个用于野外鲁棒深度补全的大规模多模态、多天气基准数据集,并提出了深度补全与去噪的统一任务。同时,作者引入了DDCD,一种高效的基于蒸馏的基线方法,利用深度基础模型将干净的结构先验注入到野外DCD训练中。
Details
Motivation: 现有的RGB-LiDAR融合深度补全方法在恶劣天气下性能显著下降,因为相机图像和LiDAR测量都会受到天气引起的损坏,因此需要鲁棒的深度补全方法以支持真实世界的3D场景理解。
Result: DDCD在AURORA-KITTI和真实世界DENSE数据集上实现了最先进的性能,同时保持了高效性。结果表明,天气感知、物理一致的数据对鲁棒性的贡献超过了单独的架构修改。
Insight: 创新点在于将深度补全与去噪统一为一个任务,并构建了首个大规模多天气基准数据集。方法上,利用深度基础模型通过蒸馏注入结构先验,是一种高效且有效的策略。数据集的多样性和物理一致性被证明是提升模型鲁棒性的关键因素。
Abstract: Robust depth completion is fundamental to real-world 3D scene understanding, yet existing RGB-LiDAR fusion methods degrade significantly under adverse weather, where both camera images and LiDAR measurements suffer from weather-induced corruption. In this paper, we introduce AURORA-KITTI, the first large-scale multi-modal, multi-weather benchmark for robust depth completion in the wild. We further formulate Depth Completion and Denoising (DCD) as a unified task that jointly reconstructs a dense depth map from corrupted sparse inputs while suppressing weather-induced noise. AURORA-KITTI contains over \textit{82K} weather-consistent RGBL pairs with metric depth ground truth, spanning diverse weather types, three severity levels, day and night scenes, paired clean references, lens occlusion conditions, and textual descriptions. Moreover, we introduce DDCD, an efficient distillation-based baseline that leverages depth foundation models to inject clean structural priors into in-the-wild DCD training. DDCD achieves state-of-the-art performance on AURORA-KITTI and the real-world DENSE dataset while maintaining efficiency. Notably, our results further show that weather-aware, physically consistent data contributes more to robustness than architectural modifications alone. Data and code will be released upon publication.
[195] AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers cs.CV | cs.AI | cs.LGPDF
Salim Khazem
TL;DR: 本文提出了AdapterTune方法,用于解决冻结视觉Transformer主干网络进行迁移学习时的两个问题:适配器简单插入导致的优化不稳定,以及缺乏设置适配器容量的原则性指导。该方法在每个Transformer块中引入一个残差低秩瓶颈,其上投影层采用零初始化,确保网络初始状态与预训练函数完全一致,消除了早期训练阶段的表示漂移。
Details
Motivation: 动机是解决冻结视觉Transformer主干进行迁移学习时,适配器插入导致的优化不稳定问题,以及缺乏对适配器容量(秩)进行设置的指导原则。
Result: 在9个数据集和3种主干网络规模上进行了评估。在一个核心的5数据集迁移套件上,AdapterTune相比仅微调分类头的方法平均提升了14.9个百分点的Top-1准确率,同时仅训练了全参数微调所需参数的0.92%,并且在15个数据集-主干组合中的10个上超越了全参数微调。在整个基准测试中,AdapterTune在所有测试的数据集-主干组合上都超越了仅微调分类头的方法。
Insight: 主要创新点包括:1)零初始化的低秩适配器设计,确保训练起始于预训练函数,稳定优化过程;2)将适配器秩形式化为特征空间中逼近下游任务偏移的容量预算,其理论分析预测了准确率随秩增加而单调但收益递减的“肘部”行为,为容量选择提供了原则性指导。
Abstract: Frozen-backbone transfer with Vision Transformers faces two under-addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guidance for setting adapter capacity. We introduce AdapterTune, which augments each transformer block with a residual low-rank bottleneck whose up-projection is zero-initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early-epoch representation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess-risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an ``elbow’’ behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi-seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top-1 accuracy over head-only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine-tuning, and outperforms full fine-tuning on 10 of 15 dataset-backbone pairs. Across the full benchmark, AdapterTune improves over head-only transfer on every dataset-backbone pair tested. Ablations on rank, placement, and initialization isolate each design choice. The code is available at: https://github.com/salimkhazem/adaptertune
[196] Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator cs.CVPDF
Gyeongsik Moon
TL;DR: 本文提出Hand4Whole++框架,通过引入条件手部调制器(CHAM)模块,将预训练的手部姿态估计器特征调制到全身姿态估计的特征流中,从而在无需重新训练全身模型的情况下,提升3D全身姿态估计中手部姿态的准确性和与身体运动学结构的一致性。
Details
Motivation: 解决3D全身姿态估计中手部姿态恢复不准确的难题,该问题源于监督鸿沟:全身姿态估计器在全身数据集上训练,手部多样性有限;而手部姿态估计器在手部数据集上训练,虽擅长手指细节但缺乏全局身体感知。
Result: 大量实验表明,Hand4Whole++显著提升了手部姿态的准确性,并增强了整体全身姿态的质量。
Insight: 核心创新点是提出了轻量级的条件手部调制器(CHAM)模块,通过特征调制将手部细节与全局身体推理相结合,并利用可微分刚性对齐将手部姿态估计器预测的手指关节和手部形状整合到全身网格中,实现了模块化优势互补。
Abstract: Accurately recovering hand poses within the body context remains a major challenge in 3D whole-body pose estimation. This difficulty arises from a fundamental supervision gap: whole-body pose estimators are trained on full-body datasets with limited hand diversity, while hand-only estimators, trained on hand-centric datasets, excel at detailed finger articulation but lack global body awareness. To address this, we propose Hand4Whole++, a modular framework that leverages the strengths of both pre-trained whole-body and hand pose estimators. We introduce CHAM (Conditional Hands Modulator), a lightweight module that modulates the whole-body feature stream using hand-specific features extracted from a pre-trained hand pose estimator. This modulation enables the whole-body model to predict wrist orientations that are both accurate and coherent with the upper-body kinematic structure, without retraining the full-body model. In parallel, we directly incorporate finger articulations and hand shapes predicted by the hand pose estimator, aligning them to the full-body mesh via differentiable rigid alignment. This design allows Hand4Whole++ to combine globally consistent body reasoning with fine-grained hand detail. Extensive experiments demonstrate that Hand4Whole++ substantially improves hand accuracy and enhances overall full-body pose quality.
[197] Automated Diabetic Screening via Anterior Segment Ocular Imaging: A Deep Learning and Explainable AI Approach cs.CVPDF
Hasaan Maqsood, Saif Ur Rehman Khan, Sebastian Vollmer, Andreas Dengel, Muhammad Nabeel Asim
TL;DR: 本文提出了一种利用眼前段图像进行自动化糖尿病筛查的深度学习系统,旨在替代传统依赖眼底照相的筛查方法。该系统通过分析虹膜、巩膜和结膜中的可见生物标志物来分类正常、控制良好和控制不佳的糖尿病患者,并采用专门的预处理流程和自监督学习来提升模型性能。
Details
Motivation: 传统糖尿病视网膜病变筛查依赖眼底照相,需要专业设备和技能,在初级护理和资源有限的环境中难以普及。本文旨在开发一种利用更易获取的眼前段图像进行糖尿病分类的自动化系统。
Result: 在包含2640张临床标注的眼前段图像数据集上,采用自监督学习的EfficientNet-V2-S模型取得了最佳性能,F1分数达到98.21%,精确度为97.90%,召回率为98.55%,显著优于仅使用ImageNet初始化的模型(F1分数94.63%)。其中,对正常类别的分类精确度接近100%。
Insight: 创新点在于将糖尿病筛查从眼底照相转向更易获取的眼前段图像,并利用领域特定的自监督学习(SimCLR)和结合消除镜面反射与CLAHE的预处理流程来增强关键的细微血管和纹理模式,从而显著提升模型性能,特别是对正常类别的高精确度有助于减少不必要的临床转诊。
Abstract: Diabetic retinopathy screening traditionally relies on fundus photography, requiring specialized equipment and expertise often unavailable in primary care and resource limited settings. We developed and validated a deep learning (DL) system for automated diabetic classification using anterior segment ocular imaging a readily accessible alternative utilizing standard photography equipment. The system leverages visible biomarkers in the iris, sclera, and conjunctiva that correlate with systemic diabetic status. We systematically evaluated five contemporary architectures (EfficientNet-V2-S with self-supervised learning (SSL), Vision Transformer, Swin Transformer, ConvNeXt-Base, and ResNet-50) on 2,640 clinically annotated anterior segment images spanning Normal, Controlled Diabetic, and Uncontrolled Diabetic categories. A tailored preprocessing pipeline combining specular reflection mitigation and contrast limited adaptive histogram equalization (CLAHE) was implemented to enhance subtle vascular and textural patterns critical for classification. SSL using SimCLR on domain specific ocular images substantially improved model performance.EfficientNet-V2-S with SSL achieved optimal performance with an F1-score of 98.21%, precision of 97.90%, and recall of 98.55% a substantial improvement over ImageNet only initialization (94.63% F1). Notably, the model attained near perfect precision (100%) for Normal classification, critical for minimizing unnecessary clinical referrals.
[198] A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding cs.CVPDF
Yue Zhang, Liqiang Jing, Jia Li, Yapeng Tian, Xinya Du
TL;DR: 本文提出了MVX-Bench多视频跨维度基准测试,将11个经典计算机视觉任务统一为多视频问答框架,并设计了SAMA技能增强代理框架,通过整合视觉工具、任务特定技能和冲突感知验证机制,实现迭代式结构化推理,以提升多视频理解能力。
Details
Motivation: 现有多模态大语言模型在单视频理解上表现良好,但在多视频推理方面能力有限,现有方法通常简单拼接视频导致训练-推理不匹配、帧压缩信息丢失和缺乏显式跨视频协调,且当前多视频基准主要关注事件级比较,忽视了身份级匹配、细粒度判别和结构化多步推理。
Result: 实验结果表明,SAMA在MVX-Bench上优于强大的开源基线和GPT模型,消融实验验证了技能设计和冲突解决机制的有效性。
Insight: 创新点包括将多种视觉任务统一到多视频问答框架的基准构建,以及通过技能增强和冲突感知验证实现结构化迭代推理的代理框架设计,为多视频理解提供了系统化的评估方法和推理范式。
Abstract: Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further propose SAMA, a Skill-Augmented Agentic Framework for Multi-Video Understanding, which integrates visual tools, task-specific skills, and a conflict-aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open-source baselines and GPT on MVX-Bench, and ablations validate the effectiveness of skill design and conflict resolution.
[199] Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization cs.CVPDF
Cailing Han, Zhangbin Li, Jinxing Zhou, Wei Qian, Jingjing Hu
TL;DR: 本文提出了一种名为FSENet的统一框架,用于解决点级弱监督时序情感定位(P-WTSL)中情感边界不精确的挑战。该框架通过引入面部引导的情感发现模块、点感知情感语义对比策略和边界感知情感伪标签生成方法,利用细粒度面部特征来增强情感定位的准确性。
Details
Motivation: 点级弱监督时序情感定位旨在利用时间戳情感标注来检测未修剪多模态视频中的情感相关片段,以减少昂贵的帧级标注成本。然而,现有方法面临情感边界不精确的问题,因此需要一种能够利用面部特征来引导和增强边界检测的解决方案。
Result: 在基准测试上进行的广泛实验和可视化结果表明,该框架在全监督、视频级和点级弱监督设置下均达到了最先进的性能,展示了FSENet在不同标注设置下的强大泛化能力。
Insight: 创新点包括:1)通过双分支建模将面部特征整合到多模态交互中,以捕捉有效的情感刺激线索;2)利用对比学习区分标注点附近候选点的情感语义,增强模型识别边界的能力;3)将稀疏点标注转换为时间平滑的监督伪标签,提供更连续的监督信号。从客观角度看,该方法通过细粒度面部特征和对比学习的结合,有效提升了弱监督下情感边界定位的精度。
Abstract: Point-level weakly-supervised temporal sentiment localization (P-WTSL) aims to detect sentiment-relevant segments in untrimmed multimodal videos using timestamp sentiment annotations, which greatly reduces the costly frame-level labeling. To further tackle the challenges of imprecise sentiment boundaries in P-WTSL, we propose the Face-guided Sentiment Boundary Enhancement Network (\textbf{FSENet}), a unified framework that leverages fine-grained facial features to guide sentiment localization. Specifically, our approach \textit{first} introduces the Face-guided Sentiment Discovery (FSD) module, which integrates facial features into multimodal interaction via dual-branch modeling for effective sentiment stimuli clues; We \textit{then} propose the Point-aware Sentiment Semantics Contrast (PSSC) strategy to discriminate sentiment semantics of candidate points (frame-level) near annotation points via contrastive learning, thereby enhancing the model’s ability to recognize sentiment boundaries. At \textit{last}, we design the Boundary-aware Sentiment Pseudo-label Generation (BSPG) approach to convert sparse point annotations into temporally smooth supervisory pseudo-labels. Extensive experiments and visualizations on the benchmark demonstrate the effectiveness of our framework, achieving state-of-the-art performance under full supervision, video-level, and point-level weak supervision, thereby showcasing the strong generalization ability of our FSENet across different annotation settings.
[200] Mind-of-Director: Multi-modal Agent-Driven Film Previsualization via Collaborative Decision-Making cs.CVPDF
Shufeng Nan, Mengtian Li, Sixiao Zheng, Yuwei Lu, Han Zhang
TL;DR: 本文提出了Mind-of-Director,一个用于电影预可视化(previz)的多模态智能体驱动框架。该框架模拟电影制作团队的协作决策过程,通过协调多个专业化智能体,在游戏引擎中根据创意构思自动生成预可视化序列。
Details
Motivation: 解决传统电影预可视化流程耗时、费力且高度依赖专业人力的问题,旨在通过多智能体协作实现自动化、高质量的快速原型生成。
Result: 实验和人工评估表明,该系统能生成高质量、语义一致的预可视化序列,平均每个创意耗时约25分钟,证明了其在自动化原型制作和人机协同电影制作中的有效性。
Insight: 创新点在于将电影预可视化分解为剧本开发、虚拟场景设计、角色行为控制和摄像机规划四个协作模块,并通过游戏引擎中的实时视觉编辑系统实现跨场景、行为和摄像机的交互式检查与同步时间线调整,是多智能体系统在创意内容生成领域的一次应用探索。
Abstract: We present Mind-of-Director, a multi-modal agent-driven framework for film previz that models the collaborative decision-making process of a film production team. Given a creative idea, Mind-of-Director orchestrates multiple specialized agents to produce previz sequences within the game engine. The framework consists of four cooperative modules: Script Development, where agents draft and refine the screenplay iteratively; Virtual Scene Design, which transforms text into semantically aligned 3D environments; Character Behaviour Control, which determines character blocking and motion; and Camera Planning, which optimizes framing, movement, and composition for cinematic camera effects. A real-time visual editing system built in the game engine further enables interactive inspection and synchronized timeline adjustment across scenes, behaviours, and cameras. Extensive experiments and human evaluations show that Mind-of-Director generates high-quality, semantically grounded previz sequences in approximately 25 minutes per idea, demonstrating the effectiveness of agent collaboration for both automated prototyping and human-in-the-loop filmmaking.
[201] Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling cs.CV | cs.LGPDF
Ernie Chu, Vishal M. Patel
TL;DR: 该论文提出了一个名为Face-to-Face with Jimmy Fallon (F2F-JF)的视频数据集,用于研究双人对话中的反应性时序建模。该数据集包含70小时、14000个片段,记录了脱口秀节目中主持人和嘉宾的互动。论文还展示了一个应用案例,即利用该数据集训练一个基于扩散模型的语音驱动数字人像生成系统,该系统能根据嘉宾的视频和主持人的音频生成主持人的反应视频。
Details
Motivation: 现有视听数据集大多描绘孤立发言者的简短独白,难以对人类对话的反应性时序进行建模。为了解决这个问题,需要构建一个能保留对话者之间顺序依赖性的双人互动数据集。
Result: 在基于该数据集构建的语音驱动数字人像生成任务中,使用跨人物视觉上下文(嘉宾视频)作为条件的MultiTalk风格扩散模型,相比仅使用音频的基线模型,在Emotion-FID和FVD指标上取得了小幅但一致的提升,同时保持了唇部同步质量。
Insight: 主要创新点在于构建了一个专门用于研究双人顺序行为、保留对话反应性时序依赖的大规模、高质量视频数据集。其半自动处理流程结合了多人跟踪、语音分割和人工验证,为下游建模提供了对齐良好的数据。这为研究人际互动建模提供了一个端到端的蓝图。
Abstract: Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host’s response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_1,t_2]$ is generated from their audio plus the guest’s preceding video during $[t_0,t_1]$. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.
[202] HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System cs.CV | cs.ROPDF
Kailin Lyu, Kangyi Wu, Pengna Li, Xiuyu Hu, Qingyi Si
TL;DR: 本文提出HiMemVLN,一种通过引入分层记忆系统来增强开源多模态大模型在视觉语言导航任务中可靠性的方法。该方法旨在解决导航遗忘问题,从而提升零样本导航性能,在仿真和真实环境实验中,其性能达到开源SOTA方法的两倍。
Details
Motivation: 当前基于LLM的零样本视觉语言导航方法主要依赖闭源模型,存在高token成本和数据泄露风险。虽然已有研究尝试使用开源LLM结合时空思维链框架,但其性能仍远逊于闭源模型。本文通过分析导航过程,识别出导致导航失败并加剧开源与闭源方法性能差距的关键问题——导航遗忘。
Result: 在仿真和真实环境中的大量实验表明,HiMemVLN的性能达到了开源最先进方法(open-source state-of-the-art)的近两倍。
Insight: 论文宣称的创新点在于识别了导航遗忘问题,并提出了一个集成到多模态大模型中的分层记忆系统,以增强视觉感知回忆和长期定位能力。从客观角度看,其核心创新是将记忆机制系统性地引入开源模型驱动的VLN任务,以解决长期依赖和状态遗忘问题,从而弥合与闭源模型的性能差距。
Abstract: LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) tasks. However, most zero-shot methods primarily rely on closed-source LLMs as navigators, which face challenges related to high token costs and potential data leakage risks. Recent efforts have attempted to address this by using open-source LLMs combined with a spatiotemporal CoT framework, but they still fall far short compared to closed-source models. In this work, we identify a critical issue, Navigation Amnesia, through a detailed analysis of the navigation process. This issue leads to navigation failures and amplifies the gap between open-source and closed-source methods. To address this, we propose HiMemVLN, which incorporates a Hierarchical Memory System into a multimodal large model to enhance visual perception recall and long-term localization, mitigating the amnesia issue and improving the agent’s navigation performance. Extensive experiments in both simulated and real-world environments demonstrate that HiMemVLN achieves nearly twice the performance of the open-source state-of-the-art method. The code is available at https://github.com/lvkailin0118/HiMemVLN.
[203] RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models cs.CV | cs.AIPDF
Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou
TL;DR: 本文提出了一种名为RAZOR的轻量级、模型无关的遗忘学习框架,用于在基于Transformer的视觉和扩散模型中高效移除特定信息。该方法通过识别对遗忘目标数据贡献最大的层和注意力头,并对其进行协调编辑,实现了精确的遗忘,同时保持了模型的整体性能。
Details
Motivation: 动机在于解决基于Transformer的扩散模型和视觉语言模型在安全性和合规性方面的一个核心挑战:如何在不重新训练的情况下,高效地移除模型中不良或敏感的信息。
Result: 在CLIP、Stable Diffusion和视觉语言模型上,使用涵盖身份、风格和对象擦除任务的遗忘基准进行评估。结果表明,RAZOR实现了高度准确和稳定的遗忘,即使在量化条件下也表现良好,在保留性能和效率方面优于先前方法,且运行速度显著更快。
Insight: 创新点在于提出了比率感知的零/一步优化保留性遗忘框架,通过协调编辑Transformer骨干网络中的多层和多头注意力,实现了精确、可控的遗忘。其核心洞察是识别并选择性编辑对遗忘目标最关键的网络组件,从而在避免过度编辑和性能损害的同时,实现高效的定向遗忘。
Abstract: Transformer based diffusion and vision-language models have achieved remarkable success; yet, efficiently removing undesirable or sensitive information without retraining remains a central challenge for model safety and compliance. We introduce Ratio-Aware Zero/One-step Optimized Retentive unlearning (RAZOR), a lightweight, model-agnostic unlearning framework that generalizes forgetting updates to coordinated multi-layer and multi-head edits within transformer backbones. RAZOR identifies the most important layers and attention heads by measuring how much they contribute to forgetting the target data while preserving useful knowledge. Then, it updates these parts of the model using a carefully regularized rule to avoid harming overall performance. The set of edited components grows gradually, ensuring precise unlearning without over-editing or damaging unrelated capabilities. We evaluate RAZOR on CLIP, Stable Diffusion, and vision-language models (VLMs) using widely adopted unlearning benchmarks covering identity, style, and object erasure tasks. Our results show that RAZOR achieves highly accurate and stable forgetting, even under quantization. This approach offers stronger retention and better efficiency than prior methods. Notably, it also operates significant faster than conventional techniques. These results demonstrate that RAZOR is a practical and scalable solution for safe, adaptive unlearning in transformer-based vision models.
[204] RadarXFormer: Robust Object Detection via Cross-Dimension Fusion of 4D Radar Spectra and Images for Autonomous Driving cs.CVPDF
Yue Sun, Yeqiang Qian, Zhe Wang, Tianhui Li, Chunxiang Wang
TL;DR: 本文提出RadarXFormer,一种用于自动驾驶的3D目标检测框架,通过4D雷达频谱与RGB图像的高效跨维度(3D-2D)融合,提升在恶劣天气和光照条件下的感知鲁棒性。
Details
Motivation: 解决现有基于相机和激光雷达的感知系统在恶劣天气和光照条件下性能下降的问题,以及传统3D雷达缺乏高度分辨率和稀疏性、新兴4D雷达存在信号噪声和数据量大等挑战。
Result: 在K-Radar数据集上的实验表明,该方法在具有挑战性的条件下提高了检测精度和鲁棒性,同时保持了实时推理能力。
Insight: 创新点在于直接利用原始雷达频谱构建高效的3D表示以减少数据量并保留完整3D空间信息,以及提出的跨维度(3D-2D)融合机制,将多尺度3D球形雷达特征立方体与互补的2D图像特征图进行融合。
Abstract: Reliable perception is essential for autonomous driving systems to operate safely under diverse real-world traffic conditions. However, camera- and LiDAR-based perception systems suffer from performance degradation under adverse weather and lighting conditions, limiting their robustness and large-scale deployment in intelligent transportation systems. Radar-vision fusion provides a promising alternative by combining the environmental robustness and cost efficiency of millimeter-wave (mmWave) radar with the rich semantic information captured by cameras. Nevertheless, conventional 3D radar measurements lack height resolution and remain highly sparse, while emerging 4D mmWave radar introduces elevation information but also brings challenges such as signal noise and large data volume. To address these issues, this paper proposes RadarXFormer, a 3D object detection framework that enables efficient cross-modal fusion between 4D radar spectra and RGB images. Instead of relying on sparse radar point clouds, RadarXFormer directly leverages raw radar spectra and constructs an efficient 3D representation that reduces data volume while preserving complete 3D spatial information. The “X” highlights the proposed cross-dimension (3D-2D) fusion mechanism, in which multi-scale 3D spherical radar feature cubes are fused with complementary 2D image feature maps. Experiments on the K-Radar dataset demonstrate improved detection accuracy and robustness under challenging conditions while maintaining real-time inference capability.
[205] Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection cs.CV | cs.AIPDF
Yewon Han, Yumin Seol, EunGyung Kong, Minsoo Jo, Taesup Kim
TL;DR: 本文提出了一种名为’Two Birds, One Projection’的高效推理时防御方法,通过将跨模态特征投影到已识别偏置方向的零空间来移除有害成分,从而同时提升大型视觉语言模型的安全性和通用任务性能。
Details
Motivation: 现有的大型视觉语言模型越狱防御框架通常面临安全性与实用性之间的权衡,增强安全性往往会无意中损害其在通用视觉推理任务上的性能。本文旨在探究安全性与实用性是否本质上是相互冲突的目标。
Result: 该方法在多个基准测试上同时提升了模型的安全性和实用性,有效打破了传统的权衡,且仅需一次前向传播。
Insight: 创新点在于识别了跨数据集一致存在的模态诱导偏置方向,该方向源于大语言模型主干与视觉编码器之间的次优耦合,并损害了安全和实用任务。基于此,提出在推理时通过特征投影移除该偏置成分,实现了安全与效用的协同提升。
Abstract: Existing jailbreak defence frameworks for Large Vision-Language Models often suffer from a safety utility tradeoff, where strengthening safety inadvertently degrades performance on general visual-grounded reasoning tasks. In this work, we investigate whether safety and utility are inherently antagonistic objectives. We focus on a modality induced bias direction consistently observed across datasets, which arises from suboptimal coupling between the Large Language Model backbone and visual encoders. We further demonstrate that this direction undermines performance on both tasks. Leveraging this insight, we propose Two Birds, One Projection, an efficient inference time jailbreak defence that projects cross-modal features onto the null space of the identified bias direction to remove the corresponding components. Requiring only a single forward pass, our method effectively breaks the conventional tradeoff, simultaneously improving both safety and utility across diverse benchmarks.
[206] SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space cs.CVPDF
Zejian Kang, Kai Zheng, Yuanchen Fei, Wentao Yang, Hongyuan Zou
TL;DR: 本文提出SemanticFace框架,通过语义蒸馏在可解释空间中进行面部动作估计,将系数预测重新定义为结构化语义推理,利用多模态大语言模型从图像预测可解释的ARKit混合形状系数。
Details
Motivation: 现有单图像面部动作估计方法通常预测紧凑表达空间参数,缺乏明确的语义可解释性,而实际应用如虚拟化身控制和人机交互需要对应有意义肌肉运动的可解释面部动作。
Result: 大量实验表明,语言对齐的语义监督提高了系数准确性和感知一致性,同时实现了强大的跨身份泛化能力以及对包括卡通面孔在内的大域偏移的鲁棒性。
Insight: 创新点在于将面部动作估计重新定义为结构化语义推理,采用两阶段语义蒸馏范式,从真实ARKit系数推导结构化语义监督,并将知识蒸馏到多模态大语言模型中,以提升可解释性和泛化能力。
Abstract: Facial action estimation from a single image is often formulated as predicting or fitting parameters in compact expression spaces, which lack explicit semantic interpretability. However, many practical applications, such as avatar control and human-computer interaction, require interpretable facial actions that correspond to meaningful muscle movements. In this work, we propose \textbf{SemanticFace}, a framework for facial action estimation in the interpretable ARKit blendshape space that reformulates coefficient prediction as structured semantic reasoning. SemanticFace adopts a two-stage semantic distillation paradigm: it first derives structured semantic supervision from ground-truth ARKit coefficients and then distills this knowledge into a multimodal large language model to predict interpretable facial action coefficients from images. Extensive experiments demonstrate that language-aligned semantic supervision improves both coefficient accuracy and perceptual consistency, while enabling strong cross-identity generalization and robustness to large domain shifts, including cartoon faces.
[207] Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis cs.CV | cs.LGPDF
Tuan-Anh Yang, Bao V. Q. Bui, Chanh-Quang Vo-Van, Truong-Son Hy
TL;DR: 本文提出了一种用于COVID-19检测和疾病分类的深度学习框架,通过集成处理2.5D多视图切片的DINOv3视觉Transformer分支和采用VREx预训练与对比学习的3D ResNet-18分支,结合了切片级和体积信息,在PHAROS-AIF-MIH基准上实现了优异的性能。
Details
Motivation: 解决从胸部CT扫描中稳健地进行COVID-19检测和疾病分类的问题,旨在通过结合2.5D(切片级)和3D(体积)表示来捕获互补信息,以提高模型在多源医学影像分析中的鲁棒性。
Result: 在PHAROS-AIF-MIH基准测试中,集成模型在二元COVID-19检测上达到94.48%准确率和0.9426 Macro F1分数,优于各分支模型;在多类疾病分类中,2.5D DINOv3模型以79.35%准确率和0.7497 Macro F1分数取得最佳性能。
Insight: 创新点在于将预训练的2.5D切片级特征提取(使用DINOv3 ViT)与针对跨源鲁棒性优化的3D体积建模(使用VREx和对比学习预训练的ResNet-18)相结合,并通过logit级集成进行决策,为多源医学影像分析提供了稳健的解决方案。
Abstract: We propose a deep learning framework for COVID-19 detection and disease classification from chest CT scans that integrates both 2.5D and 3D representations to capture complementary slice-level and volumetric information. The 2.5D branch processes multi-view CT slices (axial, coronal, sagittal) using a DINOv3 vision transformer to extract robust visual features, while the 3D branch employs a ResNet-18 architecture to model volumetric context and is pretrained with Variance Risk Extrapolation (VREx) followed by supervised contrastive learning to improve cross-source robustness. Predictions from both branches are combined through logit-level ensemble inference. Experiments on the PHAROS-AIF-MIH benchmark demonstrate the effectiveness of the proposed approach: for binary COVID-19 detection, the ensemble achieves 94.48% accuracy and a 0.9426 Macro F1-score, outperforming both individual models, while for multi-class disease classification the 2.5D DINOv3 model achieves the best performance with 79.35% accuracy and a 0.7497 Macro F1-score. These results highlight the benefit of combining pretrained slice-based representations with volumetric modeling for robust multi-source medical imaging analysis. Code is available at https://github.com/HySonLab/PHAROS-AIF-MIH
[208] DamageArbiter: A CLIP-Enhanced Multimodal Arbitration Framework for Hurricane Damage Assessment from Street-View Imagery cs.CVPDF
Yifan Yang, Lei Zou, Wenjing Gong, Kani Fu, Zongrong Li
TL;DR: 该研究提出了一个名为DamageArbiter的多模态仲裁框架,该框架利用CLIP模型增强,旨在通过整合单模态和多模态模型的互补优势,并采用轻量级逻辑回归元分类器来仲裁预测分歧,从而提高从街景图像进行飓风损害评估的准确性、可解释性和鲁棒性。
Details
Motivation: 传统计算机视觉模型在用于快速、超本地化的灾害损害评估时,往往像黑盒一样缺乏可解释性和可靠性,因此需要一种更可靠、可解释的框架来改进基于街景图像的损害估计。
Result: 在包含2,556张灾后街景图像的数据集上,DamageArbiter将准确率从最强的单模态基线模型(ViT-B/32,仅图像)的74.33%提升至82.79%,实现了8.46%的绝对提升,并超过了80%的准确率阈值,优于所有基线模型。
Insight: 该工作的核心创新点在于提出了一个基于分歧驱动的多模态仲裁框架,通过仲裁单模态与多模态预测之间的差异,不仅提升了整体准确率,还缓解了视觉模型在灾害视觉线索模糊或受干扰情况下的常见过度自信错误,从而将基于街景的灾害评估从粗略的严重性分类推向更可靠和可解释的框架。
Abstract: Analyzing street-view imagery with computer vision models for rapid, hyperlocal damage assessment is becoming popular and valuable in emergency response and recovery, but traditional models often act like black boxes, lacking interpretability and reliability. This study proposes a multimodal disagreement-driven Arbitration framework powered by Contrastive Language-Image Pre-training (CLIP) models, DamageArbiter, to improve the accuracy, interpretability, and robustness of damage estimation from street-view imagery. DamageArbiter leverages the complementary strengths of unimodal and multimodal models, employing a lightweight logistic regression meta-classifier to arbitrate cases of disagreement. Using 2,556 post-disaster street-view images, paired with both manually generated and large language model (LLM)-generated text descriptions, we systematically compared the performance of unimodal models (including image-only and text-only models), multimodal CLIP-based models, and DamageArbiter. Notably, DamageArbiter improved the accuracy from 74.33% (ViT-B/32, image-only) to 82.79%, surpassing the 80% accuracy threshold and achieving an absolute improvement of 8.46% compared to the strongest baseline model. Beyond improvements in overall accuracy, compared to visual models relying solely on images, DamageArbiter, through arbitration of discrepancies between unimodal and multimodal predictions, mitigates common overconfidence errors in visual models, especially in situations where disaster visual cues are ambiguous or subject to interference, reducing overconfidence but incorrect predictions. We further mapped and analyzed geo-referenced predictions and misclassifications to compare model performance across locations. Overall, this work advances street-view-based disaster assessment from coarse severity classification toward a more reliable and interpretable framework.
[209] AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving cs.CV | cs.ROPDF
Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao
TL;DR: 本文提出了AutoMoT,一个用于端到端自动驾驶的统一视觉-语言-动作模型。它采用异步混合Transformer架构,通过联合注意力共享和异步执行,解决了现有方法在推理与动作空间分布不对齐、预训练VLM通用推理能力利用不足以及推理延迟高的问题。
Details
Motivation: 现有将视觉语言模型集成到端到端自动驾驶系统中的方法存在局限性:难以解决推理与动作空间之间的分布不对齐问题,未能充分利用预训练VLM的通用推理能力,或在生成动作策略时产生高推理延迟,从而影响驾驶性能。
Result: 在多个基准测试(包括开环和闭环设置)上的广泛实验表明,AutoMoT相比最先进方法取得了有竞争力的性能。研究还发现,仅通过语义提示,预训练VLM就能在多任务场景理解上达到有竞争力的性能,但对于决策和轨迹规划等动作级任务,微调仍然是必需的。
Insight: 核心创新点在于提出了一个统一的VLA模型架构,通过异步混合Transformer和联合注意力共享机制,在保持预训练VLM通用推理能力的同时,实现了高效的高低频任务异步推理。这为理解预训练VLM在AD任务中的功能边界(何时需要微调)提供了新的见解。
Abstract: Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose \OURS in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that \OURS achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to \href{https://automot-website.github.io/}{Project Page} for the demonstration videos and qualitative results.
[210] Video Detector: A Dual-Phase Vision-Based System for Real-Time Traffic Intersection Control and Intelligent Transportation Analysis cs.CV | cs.AIPDF
Mustafa Fatih Şen, Halûk Gümüşkaya, Şenol Pazar
TL;DR: 该论文提出了一种名为Video Detector (VD)的双阶段视觉交通路口管理系统,它集成了实时路口控制模块(VD-RT)和离线交通行为分析模块(VD-Offline),旨在作为传统感应线圈检测器的灵活且经济高效的替代方案。该系统在包含108,000张标注图像的数据集上训练了三种目标检测模型配置,实现了高达90%的测试准确率和29.5 mAP@0.5的检测性能,并在高清视频流上保持37 FPS的实时吞吐量。
Details
Motivation: 解决城市交通管理对智能感知系统的需求,这些系统需要适应动态交通条件,同时避免昂贵的基础设施改造。
Result: 实验结果显示,在自定义数据集上,系统检测性能达到90%测试准确率和29.5 mAP@0.5,并在高清视频流上实现37 FPS的实时处理速度。在伊斯坦布尔的实际部署中,系统在不同环境条件下运行稳定。
Insight: 创新点在于提出了一种双阶段(实时控制与离线分析)的集成视觉框架,支持虚拟线圈检测、车辆计数、多目标跟踪、队列估计、速度分析和多类别车辆分类等多种功能,无需嵌入式道路传感器,实现了全面的路口监控。同时,公开了标注数据集和训练流程以支持可复现性。
Abstract: Urban traffic management increasingly requires intelligent sensing systems capable of adapting to dynamic traffic conditions without costly infrastructure modifications. Vision-based vehicle detection has therefore become a key technology for modern intelligent transportation systems. This study presents Video Detector (VD), a dual-phase vision-based traffic intersection management system designed as a flexible and cost-effective alternative to traditional inductive loop detectors. The framework integrates a real-time module (VD-RT) for intersection control with an offline analytical module (VD-Offline) for detailed traffic behavior analysis. Three system configurations were implemented using SSD Inception v2, Faster R-CNN Inception v2, and CenterNet ResNet-50 V1 FPN, trained on datasets totaling 108,000 annotated images across 6-10 vehicle classes. Experimental results show detection performance of up to 90% test accuracy and 29.5 mAP@0.5, while maintaining real-time throughput of 37 FPS on HD video streams. Field deployments conducted in collaboration with Istanbul IT and Smart City Technologies Inc. (ISBAK) demonstrate stable operation under diverse environmental conditions. The system supports virtual loop detection, vehicle counting, multi-object tracking, queue estimation, speed analysis, and multiclass vehicle classification, enabling comprehensive intersection monitoring without the need for embedded road sensors. The annotated dataset and training pipeline are publicly released to support reproducibility. These results indicate that the proposed framework provides a scalable and deployable vision-based solution for intelligent transportation systems and smart-city traffic management.
[211] RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation cs.CVPDF
Linfei Li, Lin Zhang, Ying Shen
TL;DR: 该论文提出了RealVLG框架,包含RealVLG-11B大规模真实世界视觉-语言对齐数据集和RealVLG-R1模型,旨在统一视觉-语言对齐与机器人抓取任务,支持基于语言指令的细粒度目标定位与操作。
Details
Motivation: 现有视觉-语言对齐方法侧重于粗粒度的物体级定位,而传统机器人抓取方法主要依赖几何线索且缺乏语言引导,限制了其在语言驱动操作场景中的应用。
Result: 实验结果表明,RealVLG支持在未见过的真实世界环境中进行零样本感知与操作,建立了一个统一的语义-视觉多模态基准。
Insight: 创新点在于构建了包含多粒度标注(如边界框、分割掩码、抓取姿态、接触点及细粒度语言描述)的大规模真实世界数据集,并采用基于预训练大视觉语言模型的强化微调方法,以统一方式预测多种输出,为语言驱动的机器人感知与策略学习提供了综合平台。
Abstract: Visual-language grounding aims to establish semantic correspondences between natural language and visual entities, enabling models to accurately identify and localize target objects based on textual instructions. Existing VLG approaches focus on coarse-grained, object-level localization, while traditional robotic grasping methods rely predominantly on geometric cues and lack language guidance, which limits their applicability in language-driven manipulation scenarios. To address these limitations, we propose the RealVLG framework, which integrates the RealVLG-11B dataset and the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, over 800 object instances, 1.3 million segmentation, detection, and language annotations, and roughly 11 billion grasping examples. Building on this dataset, RealVLG-R1 employs Reinforcement Fine-tuning on pretrained large-scale vision-language models to predict bounding boxes, segmentation masks, grasp poses, and contact points in a unified manner given natural language instructions. Experimental results demonstrate that RealVLG supports zero-shot perception and manipulation in real-world unseen environments, establishing a unified semantic-visual multimodal benchmark that provides a comprehensive data and evaluation platform for language-driven robotic perception and grasping policy learning. All data and code are publicly available at https://github.com/lif314/RealVLG-R1.
[212] LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models cs.CVPDF
Soumyaratna Debnath, Bui Duc Manh, Zinan Liu, Lin Wang
TL;DR: LLMind是一个受生物视觉启发的、无需训练的自适应视觉表示框架,通过模拟人眼的中央凹编码和皮层放大机制,在严格像素预算下为视觉语言模型(VLMs)提供高效且自适应的视觉表示。该框架包含生物启发自适应采样策略(BASS)和闭环语义反馈(CSF),在多个视觉问答基准上显著提升性能,且仅需极低像素比例即可接近全分辨率表现。
Details
Motivation: 现有视觉语言模型通常对视觉输入的所有区域采用均匀的空间保真度,即使是无信息区域也分配同等精度,这与人眼自适应、选择性和资源高效的视觉机制不符。论文旨在借鉴生物视觉原理,设计更高效、自适应的视觉表示方法。
Result: 在VQAv2、Seed-Bench和A-OKVQA等场景级和区域引导的视觉问答基准上,LLMind相比均匀采样基线平均提升分别达+20%、+38%和+37%。在仅使用1%、3%和5%像素的情况下,分别保留了全分辨率性能的82%、92%和97%。
Insight: 创新点包括:1)首次系统分析生物启发视觉表示方法,提出无需训练的自适应框架LLMind;2)引入BASS策略实现非均匀采样并保持全局场景结构;3)通过CSF利用测试时适应对齐感知显著性与文本信息。该方法轻量、即插即用,且无需修改现有VLM架构。
Abstract: Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align perceptual saliency with textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering benchmarks. The results show dramatic gains, with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, LLMind retains up to 82%, 92%, and 97% of the full-resolution performance using only 1%, 3%, and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.
[213] PASTE: Physics-Aware Scattering Topology Embedding Framework for SAR Object Detection cs.CVPDF
Jiacheng Chen, Yuxuan Xiong, Haipeng Wang
TL;DR: 本文提出了一种名为PASTE的物理感知散射拓扑嵌入框架,旨在将合成孔径雷达(SAR)图像中目标的电磁散射机制嵌入到现代目标检测框架中。该框架通过散射关键点生成与自动标注、散射拓扑注入模块以及散射先验监督策略,构建了一个从拓扑生成、注入到联合监督的闭环架构,从而将散射物理先验优雅地集成到SAR检测器中。
Details
Motivation: 当前基于深度学习的SAR目标检测方法主要借鉴光学图像方法,将目标视为纹理块,而忽略了固有的电磁散射机制。虽然已有研究利用散射点提升检测性能,但大多仍依赖基于幅度的统计模型,且现有引入频域信息的方法计算成本高、数据集兼容性差,因此如何有效将散射拓扑信息嵌入现代检测框架仍具挑战。
Result: 在真实数据集上的实验表明,PASTE兼容多种检测器,相较于基线模型,在可接受的计算开销下,相对mAP提升范围为2.9%至11.3%。散射图的可视化验证了PASTE成功将散射拓扑先验嵌入特征空间,清晰区分目标与背景散射区域,从而为结果提供了强可解释性。
Insight: 论文的创新点在于提出了一种闭环的物理感知散射拓扑嵌入框架,通过基于属性散射中心(ASC)模型的散射关键点生成与自动标注方案,实现了可扩展且物理一致的先验生成;同时设计了散射拓扑注入模块和先验监督策略,将散射物理机制直接融入特征学习和网络优化过程,提升了检测性能与模型可解释性。
Abstract: Current deep learning-based object detection for Synthetic Aperture Radar (SAR) imagery mainly adopts optical image methods, treating targets as texture patches while ignoring inherent electromagnetic scattering mechanisms. Though scattering points have been studied to boost detection performance, most methods still rely on amplitude-based statistical models. Some approaches introduce frequency-domain information for scattering center extraction, but they suffer from high computation cost and poor compatibility with diverse datasets. Thus, effectively embedding scattering topological information into modern detection frameworks remains challenging. To solve these problems, this paper proposes the Physics-Aware Scattering Topology Embedding Framework (PASTE), a novel closed-loop architecture for comprehensive scattering prior integration. By building the full pipeline from topology generation, injection to joint supervision, PASTE elegantly integrates scattering physics into modern SAR detectors. Specifically, it designs a scattering keypoint generation and automatic annotation scheme based on the Attributed Scattering Center (ASC) model to produce scalable and physically consistent priors. A scattering topology injection module guides multi-scale feature learning, and a scattering prior supervision strategy constrains network optimization by aligning predictions with scattering center distributions. Experiments on real datasets show that PASTE is compatible with various detectors and brings relative mAP gains of 2.9% to 11.3% over baselines with acceptable computation overhead. Visualization of scattering maps verifies that PASTE successfully embeds scattering topological priors into feature space, clearly distinguishing target and background scattering regions, thus providing strong interpretability for results.
[214] Balancing Saliency and Coverage: Semantic Prominence-Aware Budgeting for Visual Token Compression in VLMs cs.CVPDF
Jaehoon Lee, Mingi Jung, Soohyuk Jang, Seungryong Yoo, Dahuin Jung
TL;DR: 本文提出PromPrune,一种样本自适应的视觉令牌选择框架,用于解决大视觉语言模型(VLMs)中高分辨率视觉输入导致令牌数量过多、计算成本高的问题。该方法根据每个样本的语义显著性分布,动态地在局部显著性保留和全局覆盖之间分配令牌预算,实现高效压缩。
Details
Motivation: 现有视觉令牌压缩方法通常基于显著性、多样性或其固定组合进行压缩,但不同样本的语义显著性分布差异很大,导致局部显著性保留与全局覆盖之间的最优权衡点不同。静态压缩策略对所有样本使用相同策略可能是次优的,因此需要一种能自适应样本语义分布的压缩方法。
Result: 在LLaVA-NeXT-7B模型上,该方法在保持97.5%原始准确率的同时,将FLOPs降低了88%,并将预填充延迟减少了22%。
Insight: 创新点在于提出了语义显著性感知的预算分配机制和两阶段选择流程,使压缩策略能根据每个样本的语义分布动态调整,在局部显著区域和全局多样区域之间实现自适应平衡,从而在高压缩比下仍能维持强性能。
Abstract: Large Vision-Language Models (VLMs) achieve strong multimodal understanding capabilities by leveraging high-resolution visual inputs, but the resulting large number of visual tokens creates a major computational bottleneck. Recent work mitigates this issue through visual token compression, typically compressing tokens based on saliency, diversity, or a fixed combination of both. We observe that the distribution of semantic prominence varies substantially across samples, leading to different optimal trade-offs between local saliency preservation and global coverage. This observation suggests that applying a static compression strategy across all samples can be suboptimal. Motivated by this insight, we propose PromPrune, a sample-adaptive visual token selection framework composed of semantic prominence-aware budget allocation and a two-stage selection pipeline. Our method adaptively balances local saliency preservation and global coverage according to the semantic prominence distribution of each sample. By allocating token budgets between locally salient regions and globally diverse regions, our method maintains strong performance even under high compression ratios. On LLaVA-NeXT-7B, our approach reduces FLOPs by 88% and prefill latency by 22% while preserving 97.5% of the original accuracy.
[215] TopoVST: Toward Topology-fidelitous Vessel Skeleton Tracking cs.CVPDF
Yaoyu Liu, Minghui Zhang, Junjun He, Yun Gu
TL;DR: 本文提出TopoVST,一种拓扑保真的血管骨架追踪方法,用于解决血管骨架提取中常见的断裂和虚假分支问题。该方法通过构建多尺度球图采样图像,利用图神经网络联合估计追踪方向和血管半径,并采用基于波传播的追踪算法和空间占用过滤来抑制虚假骨架。
Details
Motivation: 自动提取血管骨架在临床应用中至关重要,但薄血管骨架的拓扑保真描绘仍极具挑战,主要由于频繁的断裂和虚假骨架段的存在。
Result: 在两个不同几何形状的血管数据集上进行评估,与最先进基线方法的广泛比较表明,TopoVST在重叠度和拓扑指标上均取得了有竞争力的性能。
Insight: 创新点包括:利用多尺度球图表示和门控特征融合机制增强特征学习;在方向损失中嵌入几何感知加权方案以缓解类别不平衡;设计基于波传播的追踪算法,通过空间占用过滤显式抑制虚假骨架生成。
Abstract: Automatic extraction of vessel skeletons is crucial for many clinical applications. However, achieving topologically faithful delineation of thin vessel skeletons remains highly challenging, primarily due to frequent discontinuities and the presence of spurious skeleton segments. To address these difficulties, we propose TopoVST, a topology-fidelitious vessel skeleton tracker. TopoVST constructs multi-scale sphere graphs to sample the input image and employs graph neural networks to jointly estimate tracking directions and vessel radii. The utilization of multi-scale representations is enhanced through a gating-based feature fusion mechanism, while the issue of class imbalance during training is mitigated by embedding a geometry-aware weighting scheme into the directional loss. In addition, we design a wave-propagation-based skeleton tracking algorithm that explicitly mitigates the generation of spurious skeletons through space-occupancy filtering. We evaluate TopoVST on two vessel datasets with different geometries. Extensive comparisons with state-of-the-art baselines demonstrate that TopoVST achieves competitive performance in both overlapping and topological metrics. Our source code is available at: https://github.com/EndoluminalSurgicalVision-IMR/TopoVST.
[216] EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing cs.CV | cs.MMPDF
Zitong Xu, Huiyu Duan, Zhongpeng Ji, Xinyun Zhang, Yutao Liu
TL;DR: 本文提出了EditHF-1M,一个包含百万级图像编辑样本、超过2900万个人类偏好对和14.8万个人类平均意见评分的大规模数据集,用于从视觉质量、指令对齐和属性保持三个维度评估编辑图像。基于此数据集,作者构建了基于多模态大语言模型(MLLM)的评估模型EditHF,并将其作为奖励信号(EditHF-Reward)通过强化学习优化文本引导的图像编辑模型。实验表明,EditHF与人类偏好高度对齐,且能有效提升图像编辑模型的性能。
Details
Motivation: 当前文本引导图像编辑(TIE)模型仍存在伪影、意外编辑、内容不美观等问题,而缺乏可扩展的评估模型限制了基于人类反馈的奖励模型在图像编辑领域的发展。
Result: EditHF在人类偏好对齐方面表现出色,并在其他数据集上展现出强大的泛化能力。使用EditHF-Reward微调Qwen-Image-Edit模型,取得了显著的性能提升。
Insight: 创新点在于构建了大规模、多维度的人类偏好数据集EditHF-1M,并基于此开发了MLLM评估模型EditHF,将其作为奖励模型通过强化学习优化图像编辑模型,为解决图像编辑评估的可扩展性问题提供了新思路。
Abstract: Recent text-guided image editing (TIE) models have achieved remarkable progress, while many edited images still suffer from issues such as artifacts, unexpected editings, unaesthetic contents. Although some benchmarks and methods have been proposed for evaluating edited images, scalable evaluation models are still lacking, which limits the development of human feedback reward models for image editing. To address the challenges, we first introduce \textbf{EditHF-1M}, a million-scale image editing dataset with over 29M human preference pairs and 148K human mean opinion ratings, both evaluated from three dimensions, \textit{i.e.}, visual quality, instruction alignment, and attribute preservation. Based on EditHF-1M, we propose \textbf{EditHF}, a multimodal large language model (MLLM) based evaluation model, to provide human-aligned feedback from image editing. Finally, we introduce \textbf{EditHF-Reward}, which utilizes EditHF as the reward signal to optimize the text-guided image editing models through reinforcement learning. Extensive experiments show that EditHF achieves superior alignment with human preferences and demonstrates strong generalization on other datasets. Furthermore, we fine-tune the Qwen-Image-Edit using EditHF-Reward, achieving significant performance improvements, which demonstrates the ability of EditHF to serve as a reward model to scale-up the image editing. Both the dataset and code will be released in our GitHub repository: https://github.com/IntMeGroup/EditHF.
[217] $\text{F}^2\text{HDR}$: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling cs.CVPDF
Huanjing Yue, Dawei Li, Shaoxiong Tu, Jingyu Yang
TL;DR: 本文提出了一种名为F²HDR的两阶段高动态范围(HDR)视频重建框架,旨在解决从交替曝光低动态范围(LDR)帧序列重建HDR视频的挑战,特别是在动态场景中。该框架通过流适配器实现稳健的跨曝光对齐,利用物理运动建模识别显著运动区域,并采用运动感知细化网络聚合信息以消除重影和噪声。
Details
Motivation: 现有方法在动态场景下,由于跨曝光不一致性和复杂运动导致帧间对齐困难,常出现重影和细节丢失问题,存在对齐不准确、特征聚合次优以及在运动主导区域重建质量下降的挑战。
Result: 在真实世界的HDR视频基准测试上进行的广泛实验表明,F²HDR在大幅度运动和曝光变化下实现了最先进的性能,生成了无重影且高保真的结果。
Insight: 创新点包括:1)流适配器,用于适配通用光流以实现稳健的跨曝光对齐;2)物理运动建模,以识别显著运动区域;3)运动感知细化网络,用于聚合互补信息并去除重影和噪声。从客观角度看,该框架通过两阶段设计将运动感知与细节恢复解耦,有望提升复杂动态场景下的HDR视频重建质量。
Abstract: Reconstructing High Dynamic Range (HDR) videos from sequences of alternating-exposure Low Dynamic Range (LDR) frames remains highly challenging, especially under dynamic scenes where cross-exposure inconsistencies and complex motion make inter-frame alignment difficult, leading to ghosting and detail loss. Existing methods often suffer from inaccurate alignment, suboptimal feature aggregation, and degraded reconstruction quality in motion-dominated regions. To address these challenges, we propose $\text{F}^2\text{HDR}$, a two-stage HDR video reconstruction framework that robustly perceives inter-frame motion and restores fine details in complex dynamic scenarios. The proposed framework integrates a flow adapter that adapts generic optical flow for robust cross-exposure alignment, a physical motion modeling to identify salient motion regions, and a motion-aware refinement network that aggregates complementary information while removing ghosting and noise. Extensive experiments demonstrate that $\text{F}^2\text{HDR}$ achieves state-of-the-art performance on real-world HDR video benchmarks, producing ghost-free and high-fidelity results under large motion and exposure variations.
[218] Video-CoE: Reinforcing Video Event Prediction via Chain of Events cs.CVPDF
Qile Su, Jing Tang, Rui Chen, Lei Sun, Xiangxiang Chu
TL;DR: 本文提出了Video-CoE方法,通过构建事件链来增强多模态大语言模型在视频事件预测任务上的性能。该方法旨在解决现有MLLMs在VEP任务中存在的逻辑推理能力不足和视觉信息利用不充分的问题,通过在多个公开基准测试上达到最先进水平。
Details
Motivation: 视频事件预测任务需要模型进行细粒度的时间建模并建立视频与未来事件之间的逻辑关系,而当前的多模态大语言模型在此任务上表现不佳,缺乏对未来事件的逻辑推理能力且视觉信息利用不足。
Result: 在公开基准测试上的实验结果表明,该方法超越了领先的开源和商业MLLMs,在视频事件预测任务上建立了新的最先进水平。
Insight: 创新点在于提出了事件链范式,通过构建时间事件链来隐式地强制MLLM关注视觉内容以及视频与未来事件之间的逻辑连接,并通过多种训练协议激励模型的推理能力。从客观角度看,这是一种将复杂的时间逻辑推理任务结构化的有效方法。
Abstract: Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose \textbf{C}hain \textbf{o}f \textbf{E}vents (\textbf{CoE}) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model’s reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.
[219] FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving cs.CVPDF
Yaoru Li, Federico Landi, Marco Godi, Xin Jin, Ruiju Fu
TL;DR: 本文提出FAR-Drive,一个用于自动驾驶的帧级自回归视频生成框架,旨在构建一个可扩展且交互式的闭环仿真环境。该方法通过多视图扩散Transformer和细粒度结构化控制实现几何一致的多摄像头生成,并采用两阶段训练策略解决长时域一致性和迭代退化问题,同时集成系统级优化以满足低延迟推理要求。
Details
Motivation: 自动驾驶系统的可靠训练和评估受限于缺乏可扩展和交互式的仿真环境,现有生成视频模型多为开环设置,无法支持智能体动作与环境演化的细粒度帧级交互。
Result: 在nuScenes数据集上的实验表明,该方法在现有闭环自动驾驶仿真方法中达到了最先进的性能,同时在单GPU上保持了亚秒级的延迟。
Insight: 创新点包括:引入具有细粒度结构化控制的多视图扩散Transformer以实现几何一致的多摄像头生成;设计自适应参考时域条件化和混合强制自回归训练的两阶段策略,以提升自回归条件下的长时域一致性和鲁棒性;集成系统级效率优化以满足低延迟交互需求。
Abstract: Despite rapid progress in autonomous driving, reliable training and evaluation of driving systems remain fundamentally constrained by the lack of scalable and interactive simulation environments. Recent generative video models achieve remarkable visual fidelity, yet most operate in open-loop settings and fail to support fine-grained frame-level interaction between agent actions and environment evolution. Building a learning-based closed-loop simulator for autonomous driving poses three major challenges: maintaining long-horizon temporal and cross-view consistency, mitigating autoregressive degradation under iterative self-conditioning, and satisfying low-latency inference constraints. In this work, we propose FAR-Drive, a frame-level autoregressive video generation framework for autonomous driving. We introduce a multi-view diffusion transformer with fine-grained structured control, enabling geometrically consistent multi-camera generation. To address long-horizon consistency and iterative degradation, we design a two-stage training strategy consisting of adaptive reference horizon conditioning and blend-forcing autoregressive training, which progressively improves consistency and robustness under self-conditioning. To meet low-latency interaction requirements, we further integrate system-level efficiency optimizations for inference acceleration. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance among existing closed-loop autonomous driving simulation approaches, while maintaining sub-second latency on a single GPU.
[220] Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation cs.CVPDF
Xingtai Gui, Meijie Zhang, Tianyi Yan, Wencheng Han, Jiahao Gong
TL;DR: 本文提出了WorldDrive框架,通过统一视觉与运动表示,将驾驶场景生成与实时规划耦合。该框架包含轨迹感知驾驶世界模型、多模态规划器和未来感知奖励器,在多个基准测试中实现了领先的规划性能,同时保持了高保真度的动作控制视频生成能力。
Details
Motivation: 现有驾驶世界模型主要关注视觉场景表示,运动表示未明确设计为规划器共享和可继承,导致视觉场景生成优化与精确运动规划需求之间存在脱节。
Result: 在NAVSIM、NAVSIM-v2和nuScenes基准测试中,WorldDrive在纯视觉方法中实现了领先的规划性能,同时保持了高保真度的动作控制视频生成能力。
Insight: 通过引入轨迹感知驾驶世界模型统一视觉与运动表示,并利用未来感知奖励器提取世界模型的未来潜在表示来实时评估选择最优轨迹,为鲁棒自动驾驶提供了有效框架。
Abstract: End-to-end autonomous driving aims to generate safe and plausible planning policies from raw sensor input. Driving world models have shown great potential in learning rich representations by predicting the future evolution of a driving scene. However, existing driving world models primarily focus on visual scene representation, and motion representation is not explicitly designed to be planner-shared and inheritable, leaving a schism between the optimization of visual scene generation and the requirements of precise motion planning. We present WorldDrive, a holistic framework that couples scene generation and real-time planning via unifying vision and motion representation. We first introduce a Trajectory-aware Driving World Model, which conditions on a trajectory vocabulary to enforce consistency between visual dynamics and motion intentions, enabling the generation of diverse and plausible future scenes conditioned on a specific trajectory. We transfer the vision and motion encoders to a downstream Multi-modal Planner, ensuring the driving policy operates on mature representations pre-optimized by scene generation. A simple interaction between motion representation, visual representation, and ego status can generate high-quality, multi-modal trajectories. Furthermore, to exploit the world model’s foresight, we propose a Future-aware Rewarder, which distills future latent representation from the frozen world model to evaluate and select optimal trajectories in real-time. Extensive experiments on the NAVSIM, NAVSIM-v2, and nuScenes benchmarks demonstrate that WorldDrive achieves leading planning performance among vision-only methods while maintaining high-fidelity action-controlled video generation capabilities, providing strong evidence for the effectiveness of unifying vision and motion representation for robust autonomous driving.
[221] GT-PCQA: Geometry-Texture Decoupled Point Cloud Quality Assessment with MLLM cs.CVPDF
Guohua Zhang, Jian Jin, Meiqin Liu, Chao Yao, Weisi Lin
TL;DR: 本文提出了一种基于多模态大语言模型的无参考点云质量评估框架GT-PCQA,通过2D-3D联合训练策略解决PCQA数据稀缺问题,并采用几何-纹理解耦策略缓解预训练MLLM的纹理主导偏差,从而提升对几何结构退化的敏感性。
Details
Motivation: 现有基于MLLM的图像质量评估方法难以直接扩展到点云质量评估,主要挑战在于PCQA数据集规模有限,以及预训练MLLM倾向于纹理主导推理,对几何结构退化不够敏感。
Result: 大量实验表明,GT-PCQA在点云质量评估任务上取得了有竞争力的性能,并展现出强大的泛化能力。
Insight: 创新点包括:将PCQA建模为相对质量比较问题以统一大规模IQA与有限PCQA数据集;采用参数高效的LoRA进行指令微调;设计双提示机制与交替优化方案来解耦几何与纹理信息,缓解MLLM的纹理偏差。
Abstract: With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising generalization. However, directly extending these MLLM-based IQA methods to PCQA remains challenging. On the one hand, existing PCQA datasets are limited in scale, which hinders stable and effective instruction tuning of MLLMs. On the other hand, due to large-scale image-text pretraining, MLLMs tend to rely on texture-dominant reasoning and are insufficiently sensitive to geometric structural degradations that are critical for PCQA. To address these gaps, we propose a novel MLLM-based no-reference PCQA framework, termed GT-PCQA, which is built upon two key strategies. First, to enable stable and effective instruction tuning under scarce PCQA supervision, a 2D-3D joint training strategy is proposed. This strategy formulates PCQA as a relative quality comparison problem to unify large-scale IQA datasets with limited PCQA datasets. It incorporates a parameter-efficient Low-Rank Adaptation (LoRA) scheme to support instruction tuning. Second, a geometry-texture decoupling strategy is presented, which integrates a dual-prompt mechanism with an alternating optimization scheme to mitigate the inherent texture-dominant bias of pre-trained MLLMs, while enhancing sensitivity to geometric structural degradations. Extensive experiments demonstrate that GT-PCQA achieves competitive performance and exhibits strong generalization.
[222] Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering cs.CV | cs.AIPDF
Minchan Kwon, Hyounguk Shon, Junmo Kim
TL;DR: 本文提出了一种面向视频问答(VideoQA)的问题感知关键帧选择框架,通过从大型多模态模型(LMMs)生成伪关键帧标签提供监督,并引入覆盖正则化以促进时序上多样且互补的证据选择,从而在提升推理效率的同时改善答案准确性。
Details
Motivation: 现有大型多模态模型在视频问答中面临推理成本高和信息稀释的挑战,而传统基于图像-文本相似度的关键帧选择方法存在监督稀疏和帧选择冗余的问题。
Result: 在NExT-QA基准测试中,该方法显著提升了准确率,尤其在时序和因果类问题上表现突出,验证了关键帧选择作为VideoQA中可学习模块的有效性。
Insight: 创新点在于利用LMMs生成伪标签进行合成监督,并结合覆盖正则化优化关键帧的多样性和互补性,为视频理解中的高效信息提取提供了新思路。
Abstract: Large multimodal models (LMMs) have recently demonstrated remarkable performance in video question answering (VideoQA), yet reasoning over video remains challenging due to high inference cost and diluted information. Keyframe selection offers efficiency and sharper reasoning but suffers from sparse supervision and redundant frame choices when relying only on image-text similarity. We present a question-aware keyframe selection framework with two components: pseudo keyframe labels derived from LMMs that provide informative supervision and a coverage regularization that promotes diverse, complementary evidence across time. Experiments on NExT-QA show that our method significantly improves accuracy, especially for temporal and causal question types, establishing keyframe selection as an effective and learnable module for VideoQA.
[223] CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models cs.CVPDF
Xiaojun Shan, Haoyu Shen, Yucheng Mao, Xiang Zhang, Abhay Anand
TL;DR: CyCLeGen是一个统一的自回归视觉-语言基础模型,能够在单一框架内同时执行图像理解和图像生成任务。它通过图像->布局->图像和布局->图像->布局的循环一致性学习机制,将感知与合成功能集成在一起,实现了自省和数据高效性。
Details
Motivation: 解决现有视觉模型依赖独立模块进行感知和合成的问题,旨在构建一个统一的模型,通过循环一致性学习实现图像理解与生成的协同增强。
Result: 在多个图像理解和生成基准测试中取得了显著性能提升,展示了统一视觉-语言基础模型的潜力。
Insight: 创新点在于采用循环一致性学习框架,实现了模型的自省能力和通过合成监督进行自我改进的数据高效性,为构建一体化视觉基础模型提供了新思路。
Abstract: We present CyCLeGen, a unified vision-language foundation model capable of both image understanding and image generation within a single autoregressive framework. Unlike existing vision models that depend on separate modules for perception and synthesis, CyCLeGen adopts a fully integrated architecture that enforces cycle-consistent learning through image->layout->image and layout->image->layout generation loops. This unified formulation introduces two key advantages: introspection, enabling the model to reason about its own generations, and data efficiency, allowing self-improvement via synthetic supervision under a reinforcement learning objective guided by cycle consistency. Extensive experiments show that CyCLeGen achieves significant gains across diverse image understanding and generation benchmarks, highlighting the potential of unified vision-language foundation models.
[224] GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis cs.CVPDF
Minjun Kang, Inkyu Shin, Taeyeop Lee, Myungchul Kim, In So Kweon
TL;DR: GeoNVS是一种基于几何约束的视频扩散模型,用于新视角合成。它通过引入高斯泼溅特征适配器(GS-Adapter),将输入视角的扩散特征提升为3D高斯表示,渲染几何约束的新视角特征,并与扩散特征自适应融合,以纠正几何不一致性,从而增强几何保真度和相机可控性。
Details
Motivation: 现有相机控制的视频扩散模型在新视角合成中存在几何失真和相机可控性有限的问题,需要提升3D几何一致性和跨视角视觉连贯性。
Result: 在9个场景和18种设置下的实验表明,该方法实现了最先进的性能,相比SEVA和CameraCtrl分别提升了11.3%和14.9%,平移误差最多降低2倍,Chamfer距离最多降低7倍。
Insight: 核心创新是GS-Adapter,它在特征空间而非输入层面注入几何信息,避免了视角相关颜色噪声对结构一致性的损害;其即插即用设计实现了与多种前馈几何模型的零样本兼容,无需额外训练,并可适配其他视频扩散骨干网络。
Abstract: Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.
[225] MMSpec: Benchmarking Speculative Decoding for Vision-Language Models cs.CVPDF
Hui Shen, Xin Wang, Ping Zhang, Yunta Hsieh, Qi Han
TL;DR: 该论文提出了MMSpec,这是首个用于评估视觉语言模型(VLMs)中推测解码(speculative decoding)性能的基准测试。该基准包含600个跨六个任务类别的多模态样本,并在统一框架下集成了十种代表性推测解码算法。研究发现,纯文本LLM的推测解码方法在多模态场景下性能下降,视觉感知在大批量处理时变得重要,且仅凭吞吐量加速不能可靠反映延迟性能。基于这些发现,论文提出了ViSkip,一种即插即用的推测解码方法,能动态适应视觉令牌,并实现了最先进的性能。
Details
Motivation: 视觉语言模型在多模态任务上表现出色,但由于模型规模大、多模态上下文长,导致推理延迟高。推测解码是一种有效的加速技术,但其在VLMs中的行为尚未被充分理解,因此需要建立基准来评估和优化。
Result: 在MMSpec基准上评估,提出的ViSkip方法在延迟和吞吐量方面均实现了最先进的性能。
Insight: 创新点在于首次构建了针对VLMs的推测解码基准MMSpec,并揭示了多模态场景下推测解码的特殊性(如视觉令牌的重要性)。提出的ViSkip方法通过动态适应视觉令牌,提供了一种有效的即插即用加速方案。
Abstract: Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose ViSkip, a plug-and-play speculative decoding method that dynamically adapts speculation to vision tokens and achieves state-of-the-art performance.
[226] Edit2Interp: Adapting Image Foundation Models from Spatial Editing to Video Frame Interpolation with Few-Shot Learning cs.CVPDF
Nasrin Rahimi, Mısra Yavuz, Burak Can Biner, Yunus Bilge Kurt, Ahmet Rasim Emirdağı
TL;DR: 本文提出Edit2Interp方法,通过少量样本(64-256个)和低秩适应(LoRA)技术,将预训练的大型图像编辑模型(如Qwen-Image-Edit)从静态空间编辑任务适应到视频帧插值(VFI)任务,揭示了模型固有的空间变换理解中蕴含的潜在时序推理能力。
Details
Motivation: 预训练图像编辑模型具有强大的空间推理和对象感知变换能力,但缺乏显式的时序建模;本文旨在探索如何以数据高效的方式将这些空间先验重新用于解锁时序合成能力,而无需引入视频专用架构或运动估计模块。
Result: 经过少量样本微调后,模型成功解锁了插值能力,而基线模型完全无法生成连贯的中间帧;该方法并非旨在与在大量数据上从头训练的任务特定VFI方法竞争,而是证明了基础图像编辑模型在时序任务上具有未开发的潜力。
Insight: 核心创新点在于揭示了图像编辑模型对静态场景中’对象如何变换’的固有理解包含可通过少量微调激活的潜在时序推理,表明基础模型中空间与时序推理可能比先前认知的更紧密交织,为资源受限场景下的视频合成提供了一条数据高效的途径。
Abstract: Pre-trained image editing models exhibit strong spatial reasoning and object-aware transformation capabilities acquired from billions of image-text pairs, yet they possess no explicit temporal modeling. This paper demonstrates that these spatial priors can be repurposed to unlock temporal synthesis capabilities through minimal adaptation - without introducing any video-specific architecture or motion estimation modules. We show that a large image editing model (Qwen-Image-Edit), originally designed solely for static instruction-based edits, can be adapted for Video Frame Interpolation (VFI) using only 64-256 training samples via Low-Rank Adaptation (LoRA). Our core contribution is revealing that the model’s inherent understanding of “how objects transform” in static scenes contains latent temporal reasoning that can be activated through few-shot fine-tuning. While the baseline model completely fails at producing coherent intermediate frames, our parameter-efficient adaptation successfully unlocks its interpolation capability. Rather than competing with task-specific VFI methods trained from scratch on massive datasets, our work establishes that foundation image editing models possess untapped potential for temporal tasks, offering a data-efficient pathway for video synthesis in resource-constrained scenarios. This bridges the gap between image manipulation and video understanding, suggesting that spatial and temporal reasoning may be more intertwined in foundation models than previously recognized
[227] Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning cs.CVPDF
Kaixin zhang, Xiaohe Li, Jiahao Li, Haohua Wu, Xinyu Zhao
TL;DR: 本文提出ClueNet,一种基于视觉线索的视频推理框架,通过两阶段监督微调范式解决视频问答中的幻觉和可解释性问题,无需大量修改基础模型。该方法在多个基准测试中优于现有方法,并展现出良好的泛化能力和推理效率。
Details
Motivation: 当前多模态大语言模型在视频推理中缺乏明确的视觉感知与答案生成之间的结构化推理,导致严重幻觉和可解释性差,且未能有效解决视觉线索提取、过滤和对齐三个核心问题。
Result: 在NExT-QA、STAR和MVBench基准测试中,ClueNet比最先进方法性能提升≥1.1%,并表现出优异的泛化能力、幻觉缓解、推理效率和跨骨干网络兼容性。
Insight: 创新点在于受人类分层视觉认知启发,采用解耦监督对齐线索提取与链式推理,并结合自适应线索过滤器进行推理监督,以轻量级模块实现高效推理,弥合了多模态大语言模型在视频理解中的感知到生成差距。
Abstract: Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by $\ge$ 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.
[228] Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing cs.CVPDF
Jiahe Song, Chuang Wang, Yinfan Wang, Hao Zheng, Rui Nie
TL;DR: 本文针对化学反应图解析任务,提出了两种互补的方法来增强基于视觉语言模型的性能:一是利用分子标识符作为视觉提示来激活预训练知识,二是引入可验证的强化学习算法直接优化反应级指标。
Details
Motivation: 解决现有视觉语言模型在化学反应图解析中无法对齐视觉化学实体与预训练知识,以及训练与评估粒度不匹配的问题。
Result: 提出的方法在零样本和分布外场景下优于现有提示策略,并通过强化学习在微调范式中实现了持续性能提升,同时发布了包含扫描历史反应图的ScannedRxn基准来评估模型鲁棒性。
Insight: 创新点包括利用自然存在的分子标识符作为视觉提示来桥接视觉与知识,以及设计可验证奖励的强化学习算法以直接优化任务级评估指标,提升了模型的准确性和泛化能力。
Abstract: Reaction diagram parsing (RxnDP) is critical for extracting chemical synthesis information from literature. Although recent Vision-Language Models (VLMs) have emerged as a promising paradigm to automate this complex visual reasoning task, their application is fundamentally bottlenecked by the inability to align visual chemical entities with pre-trained knowledge, alongside the inherent discrepancy between token-level training and reaction-level evaluation. To address these dual challenges, this work enhances VLM-based RxnDP from two complementary perspectives: prompting representation and learning paradigms. First, we propose Identifier as Visual Prompting (IdtVP), which leverages naturally occurring molecule identifiers (e.g., bold numerals like 1a) to activate the chemical knowledge acquired during VLM pre-training. IdtVP enables powerful zero-shot and out-of-distribution capabilities, outperforming existing prompting strategies. Second, to further optimize performance within fine-tuning paradigms, we introduce Re3-DAPO, a reinforcement learning algorithm that leverages verifiable rewards to directly optimize reaction-level metrics, thereby achieving consistent gains over standard supervised fine-tuning. Additionally, we release the ScannedRxn benchmark, comprising scanned historical reaction diagrams with real-world artifacts, to rigorously assess model robustness and out-of-distribution ability. Our contributions advance the accuracy and generalization of VLM-based reaction diagram parsing. We will release data, models, and code on GitHub.
[229] Riemannian Motion Generation: A Unified Framework for Human Motion Representation and Generation via Riemannian Flow Matching cs.CV | stat.MLPDF
Fangran Miao, Jian Huang, Ting Li
TL;DR: 本文提出了黎曼运动生成(RMG)框架,该框架将人体运动表示在乘积流形上,并通过黎曼流匹配学习运动动力学。RMG将运动分解为多个流形因子,产生具有内在归一化的无尺度表示,并使用测地线插值、切空间监督和保流形ODE积分进行训练和采样。
Details
Motivation: 现有的人体运动生成方法通常在欧几里得空间中学习,但有效的运动遵循结构化的非欧几何。本文旨在通过几何感知的建模,提供一个统一且可扩展的框架来生成高保真度的人体运动。
Result: 在HumanML3D数据集上,RMG在HumanML3D格式下取得了最先进的FID分数(0.043),并在MotionStreamer格式下的所有报告指标中排名第一。在MotionMillion数据集上,它也超越了强基线(FID 5.6,R@1 0.86)。消融实验表明,紧凑的平移+旋转(T+R)表示是最稳定和有效的。
Insight: 主要创新点在于将运动表示和生成统一到黎曼几何框架中,通过乘积流形分解实现尺度无关和内在归一化的表示,并利用黎曼流匹配进行学习。从客观角度看,该工作强调了几何感知建模对于高保真运动生成的实用性和可扩展性,为处理非欧结构数据提供了新思路。
Abstract: Human motion generation is often learned in Euclidean spaces, although valid motions follow structured non-Euclidean geometry. We present Riemannian Motion Generation (RMG), a unified framework that represents motion on a product manifold and learns dynamics via Riemannian flow matching. RMG factorizes motion into several manifold factors, yielding a scale-free representation with intrinsic normalization, and uses geodesic interpolation, tangent-space supervision, and manifold-preserving ODE integration for training and sampling. On HumanML3D, RMG achieves state-of-the-art FID in the HumanML3D format (0.043) and ranks first on all reported metrics under the MotionStreamer format. On MotionMillion, it also surpasses strong baselines (FID 5.6, R@1 0.86). Ablations show that the compact $\mathscr{T}+\mathscr{R}$ (translation + rotations) representation is the most stable and effective, highlighting geometry-aware modeling as a practical and scalable route to high-fidelity motion generation.
[230] Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods cs.CV | cs.LGPDF
Omer Ben Hayun, Roy Betser, Meir Yossef Levi, Levi Kassel, Guy Gilboa
TL;DR: 本文提出了一种名为STALL的无训练生成视频检测方法,通过基于概率框架的时空似然性联合建模,对视频内容进行评分,以解决现有图像检测器忽略时序动态、监督视频检测器泛化性差的问题。
Details
Motivation: 随着视频生成模型的快速发展,合成视频的检测需求日益迫切,现有图像检测器无法捕捉时序信息,而监督视频检测器对新生成器泛化能力不足,因此需要一种无需训练、模型无关的零样本检测方法。
Result: STALL在两个公共基准测试和新构建的ComGenVid基准(包含最先进的生成模型)上均优于先前的图像和视频基线方法,实现了更优的检测性能。
Insight: 创新点在于提出了一种基于概率框架的时空似然性联合建模方法,实现了无需训练、模型无关的零样本检测,能够有效捕捉视频的时空动态特征,提升对未知生成器的泛化能力。
Abstract: Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce \emph{STALL}, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at https://omerbenhayun.github.io/stall-video.
[231] GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents cs.CVPDF
Yang Li, Yuchen Liu, Haoyu Lu, Zhiqiang Xia, Hongzhen Wang
TL;DR: 本文介绍了GUI-CEval,这是首个针对中文移动GUI代理的综合性基准测试,它构建在真实设备环境上,涵盖201个主流应用,采用两级结构评估感知、规划、反思、执行和评估五个维度的原子能力和实际应用性能。
Details
Motivation: 现有基准测试主要面向英语,未能捕捉中文移动生态系统的语言和交互特性,且侧重于孤立的技能,缺乏统一细粒度的框架来评估从感知到执行的完整能力链。
Result: 在20个代表性多模态大语言模型和多智能体系统上的广泛实验表明,虽然Qwen2.5-VL和UI-TARS等模型表现有竞争力,但大多数模型在反思决策和行动后自我评估方面仍存在明显弱点,限制了其在真实世界交互中的可靠性。
Insight: 创新点在于提出了首个专注于中文移动GUI代理的层次化综合基准,通过真实设备环境和多阶段人工验证确保数据真实性和可复现性,为能力诊断和模型发展提供了可解释的评估框架。
Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross-modal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on physical device environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and realistic application-level performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi-stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi-agent systems show that while models such as Qwen2.5-VL and UI-TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision-making and post-action self-evaluation, limiting their reliability in real-world interactions. We hope GUI-CEval provides a comprehensive and interpretable benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents.
[232] The Good, the Better, and the Best: Improving the Discriminability of Face Embeddings through Attribute-aware Learning cs.CVPDF
Ana Dias, João Ribeiro Pinto, Hugo Proença, João C. Neves
TL;DR: 本文提出了一种属性感知的人脸识别架构,通过联合学习身份类别标签、身份相关面部属性和非身份相关属性,提升人脸嵌入的判别性。实验表明,使用身份相关的属性子集优于使用更广泛的属性集,并且显式地让嵌入‘遗忘’非身份相关属性能带来进一步的性能提升。
Details
Motivation: 尽管人脸识别技术取得了进展,但在年龄、姿态和遮挡等大变化下,鲁棒性能仍然具有挑战性。现有方法通常依赖异构且固定的属性集,并隐含地假设所有属性对身份识别具有同等相关性,这是次优的,因为不同属性具有不同的判别力,有些甚至可能引入有害偏差。
Result: 在标准人脸验证基准测试上的实验表明,联合学习身份和面部属性提高了人脸嵌入的判别性,具体结论包括:使用身份相关的属性子集始终优于使用更广泛的属性集进行监督;显式地让嵌入‘遗忘’非身份相关属性相比不监督这些属性能带来进一步的性能提升。
Insight: 创新点在于提出了一种将面部属性组织成可解释组别的架构,允许以人类可理解的方式分解和分析各个属性的贡献。该方法还可作为一种诊断工具,通过抑制非身份相关属性来测量准确率增益,从而评估人脸识别编码器的可信度,这种增益表明模型可能从与每个身份相关的冗余属性中进行捷径学习。
Abstract: Despite recent advances in face recognition, robust performance remains challenging under large variations in age, pose, and occlusion. A common strategy to address these issues is to guide representation learning with auxiliary supervision from facial attributes, encouraging the visual encoder to focus on identity-relevant regions. However, existing approaches typically rely on heterogeneous and fixed sets of attributes, implicitly assuming equal relevance across attributes. This assumption is suboptimal, as different attributes exhibit varying discriminative power for identity recognition, and some may even introduce harmful biases. In this paper, we propose an attribute-aware face recognition architecture that supervises the learning of facial embeddings using identity class labels, identity-relevant facial attributes, and non-identity-related attributes. Facial attributes are organized into interpretable groups, making it possible to decompose and analyze their individual contributions in a human-understandable manner. Experiments on standard face verification benchmarks demonstrate that joint learning of identity and facial attributes improves the discriminability of face embeddings with two major conclusions: (i) using identity-relevant subsets of facial attributes consistently outperforms supervision with a broader attribute set, and (ii) explicitly forcing embeddings to unlearn non-identity-related attributes yields further performance gains compared to leaving such attributes unsupervised. Additionally, our method serves as a diagnostic tool for assessing the trustworthiness of face recognition encoders by allowing for the measurement of accuracy gains with suppression of non-identity-relevant attributes, with such gains suggesting shortcut learning from redundant attributes associated with each identity.
[233] ReactMotion: Generating Reactive Listener Motions from Speaker Utterance cs.CV | cs.AI | cs.HC | cs.MM | cs.SDPDF
Cheng Luo, Bizhu Wu, Bing Li, Jianfeng Ren, Ruibin Bai
TL;DR: 本文提出了一个新任务——从说话者话语生成反应性倾听者动作,并介绍了ReactMotionNet数据集和ReactMotion生成框架,旨在生成自然、多样且恰当的非语言倾听者行为。
Details
Motivation: 现有研究对非语言倾听者行为的建模探索不足且具有挑战性,因为人类反应本质上是非确定性的,需要生成能恰当响应说话者话语的倾听者动作。
Result: 在ReactMotionNet数据集上的大量实验表明,ReactMotion在生成自然性、多样性和恰当性方面优于检索基线和基于LLM的级联流水线。
Insight: 创新点包括:引入捕捉一对多关系的数据集设计、针对反应恰当性的偏好导向评估协议,以及联合建模文本、音频、情感和动作的统一生成框架,使用基于偏好的目标进行训练。
Abstract: In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker’s utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.
[234] Learning from Limited and Incomplete Data: A Multimodal Framework for Predicting Pathological Response in NSCLC cs.CVPDF
Alice Natalina Caragliano, Giulia Farina, Fatih Aksu, Camillo Maria Caruso, Claudia Tacconi
TL;DR: 本文提出了一种多模态深度学习框架,用于预测非小细胞肺癌(NSCLC)患者新辅助治疗后的主要病理缓解(pR)。该框架整合了基于基础模型的CT影像特征提取与针对临床变量的缺失感知架构,旨在解决现实临床环境中数据有限且临床资料不完整的问题,通过加权融合机制结合影像和临床数据的互补信息,从而提升预测性能。
Details
Motivation: 解决在数据有限且临床资料不完整的现实临床环境中,准确术前预测非小细胞肺癌患者新辅助治疗后主要病理缓解(pR)的挑战。
Result: 该多模态模型在预测任务中一致优于单模态影像和临床基线方法,表明其在现实临床条件下具有稳健的预测能力。
Insight: 创新点在于整合了基于基础模型的CT特征提取与缺失感知架构,无需依赖传统的插补策略,并采用加权融合机制有效利用多模态数据的互补性,为数据有限且不完整的临床预测问题提供了可借鉴的解决方案。
Abstract: Major pathological response (pR) following neoadjuvant therapy is a clinically meaningful endpoint in non-small cell lung cancer, strongly associated with improved survival. However, accurate preoperative prediction of pR remains challenging, particularly in real-world clinical settings characterized by limited data availability and incomplete clinical profiles. In this study, we propose a multimodal deep learning framework designed to address these constraints by integrating foundation model-based CT feature extraction with a missing-aware architecture for clinical variables. This approach enables robust learning from small cohorts while explicitly modeling missing clinical information, without relying on conventional imputation strategies. A weighted fusion mechanism is employed to leverage the complementary contributions of imaging and clinical modalities, yielding a multimodal model that consistently outperforms both unimodal imaging and clinical baselines. These findings underscore the added value of integrating heterogeneous data sources and highlight the potential of multimodal, missing-aware systems to support pR prediction under realistic clinical conditions.
[235] VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents cs.CVPDF
Udi Barzelay, Ophir Azulai, Inbar Shapira, Idan Friedman, Foad Abo Dahood
TL;DR: 本文提出了VAREX基准测试,用于评估多模态基础模型在政府表格结构化数据提取任务上的性能。该基准包含1,777份文档,涵盖1,771个独特模式,提供四种输入模态(纯文本、保留布局的文本、文档图像、文本与图像结合),并通过反向注释流程生成合成数据。评估了20个模型,发现小参数模型(≤4B)的主要瓶颈是结构化输出合规性而非提取能力,保留布局的文本能带来最大精度提升,且基准在60-95%精度区间内能有效区分模型性能。
Details
Motivation: 现有基准测试通常仅从单一输入表示评估模型,缺乏系统分析输入格式对提取精度影响的能力。VAREX旨在填补这一空白,通过提供多种受控输入模态,支持对多模态文档结构化提取任务进行更全面的评估。
Result: 在VAREX基准上评估了20个模型(从前沿专有模型到小型开源模型)。结果显示:参数≤4B的模型中,结构化输出合规性是主要瓶颈(模式回显导致分数下降45-65个百分点);针对提取任务的微调在2B参数模型上带来81个百分点的提升;保留布局的文本输入带来最大精度增益(+3-18个百分点),超过像素级视觉线索;该基准在60-95%精度区间内最能有效区分模型性能。
Insight: 创新点包括:通过反向注释流程生成高质量合成数据;提供四种受控输入模态以支持系统消融研究;首次系统揭示了小参数模型中结构化输出合规性(而非提取能力)是主要瓶颈,并证明通过任务特定微调可在不增加规模的情况下显著提升性能。保留布局的文本作为高效表示形式的价值得到验证。
Abstract: We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain text, layout-preserving text (whitespace-aligned to approximate column positions), document image, or both text and image combined. Unlike existing benchmarks that evaluate from a single input representation, VAREX provides four controlled modalities per document, enabling systematic ablation of how input format affects extraction accuracy – a capability absent from prior benchmarks. We evaluate 20 models from frontier proprietary models to small open models, with particular attention to models <=4B parameters suitable for cost-sensitive and latency-constrained deployment. Results reveal that (1) below 4B parameters, structured output compliance – not extraction capability – is a dominant bottleneck; in particular, schema echo (models producing schema-conforming structure instead of extracted values) depresses scores by 45-65 pp (percentage points) in affected models; (2) extraction-specific fine-tuning at 2B yields +81 pp gains, demonstrating that the instruction-following deficit is addressable without scale; (3) layout-preserving text provides the largest accuracy gain (+3-18 pp), exceeding pixel-level visual cues; and (4) the benchmark most effectively discriminates models in the 60-95% accuracy band. Dataset and evaluation code are publicly available.
[236] A Tutorial on ALOS2 SAR Utilization: Dataset Preparation, Self-Supervised Pretraining, and Semantic Segmentation cs.CVPDF
Nevrez Imamoglu, Ali Caglayan, Toru Kouyama
TL;DR: 本文是一篇关于ALOS2合成孔径雷达(SAR)数据利用的教程,重点介绍了数据集准备、自监督预训练和语义分割任务。论文提出了SAR-W-SimMIM方法,一种针对ALOS-2单通道SAR图像的加权SimMIM变体,旨在减少斑点噪声和极端强度值在预训练中的影响。同时,构建了日本区域的ALOS-2单通道(HH极化)SAR数据集,用于预训练基于视觉变换器的自编码器,并微调进行语义分割,相比随机初始化从头训练取得了显著性能提升。
Details
Motivation: 解决合成孔径雷达(SAR)图像在应用掩码自编码器(MAE)等自监督预训练方法时面临的挑战,包括语义标注困难、高噪声水平(如斑点噪声)以及区域特定模型中土地覆盖分布不平衡导致的偏差问题。
Result: 与之前的SAR-W-MixMAE和随机初始化相比,SAR-W-SimMIM在语义分割任务上观察到了显著的改进。在构建的日本区域ALOS-2 SAR数据集上预训练模型并微调后,相比随机初始化从头训练,性能有大幅提升。
Insight: 创新点包括:提出了SAR-W-SimMIM方法,通过加权损失减少SAR图像中噪声和极端值的影响;构建了针对特定区域(日本)的ALOS-2 SAR数据集,支持国家级基础模型开发;提供了从数据准备到自监督预训练及下游任务微调的完整流程指南。从客观角度看,该方法通过SAR特定的加权策略和区域数据集构建,有效应对了SAR图像的独特挑战,为SAR领域的自监督学习提供了实用框架。
Abstract: Masked auto-encoders (MAE) and related approaches have shown promise for satellite imagery, but their application to synthetic aperture radar (SAR) remains limited due to challenges in semantic labeling and high noise levels. Building on our prior work with SAR-W-MixMAE, which adds SAR-specific intensity-weighted loss to standard MixMAE for pretraining, we also introduce SAR-W-SimMIM; a weighted variant of SimMIM applied to ALOS-2 single-channel SAR imagery. This method aims to reduce the impact of speckle and extreme intensity values during self-supervised pretraining. We evaluate its effect on semantic segmentation compared to our previous trial with SAR-W-MixMAE and random initialization, observing notable improvements. In addition, pretraining and fine-tuning models on satellite imagery pose unique challenges, particularly when developing region-specific models. Imbalanced land cover distributions such as dominant water, forest, or desert areas can introduce bias, affecting both pretraining and downstream tasks like land cover segmentation. To address this, we constructed a SAR dataset using ALOS-2 single-channel (HH polarization) imagery focused on the Japan region, marking the initial phase toward a national-scale foundation model. This dataset was used to pretrain a vision transformer-based autoencoder, with the resulting encoder fine-tuned for semantic segmentation using a task-specific decoder. Initial results demonstrate significant performance improvements compared to training from scratch with random initialization. In summary, this work provides a guide to process and prepare ALOS2 observations to create dataset so that it can be taken advantage of self-supervised pretraining of models and finetuning downstream tasks such as semantic segmentation.
[237] Next-Frame Decoding for Ultra-Low-Bitrate Image Compression with Video Diffusion Priors cs.CVPDF
Yunuo Chen, Chuqin Zhou, Jiangchuan Li, Xiaoyue Ling, Bing He
TL;DR: 本文提出了一种利用视频扩散先验进行超低码率图像压缩的新范式,通过定义中间锚定帧并利用预训练视频扩散模型进行下一帧预测,将生成式解码过程重新解释为从锚定帧到最终重建图像的虚拟时间过渡。
Details
Motivation: 动机在于利用生成式图像压缩中的“时间”演化,通过引入一个保留场景几何和语义布局但丢弃高频细节的紧凑锚定帧,将解码过程转化为下一帧预测任务,以提升超低码率图像压缩的保真度和真实感。
Result: 在CLIC2020测试集上,与DiffC相比,该方法在LPIPS、DISTS、FID和KID指标上实现了超过50%的码率节省,同时解码速度提升了高达5倍,在客观和主观性能上均表现出色。
Insight: 创新点在于将图像压缩解码过程重新定义为基于视频扩散先验的下一帧预测任务,通过引入可见且语义忠实的锚定帧作为初始状态,有效结合了生成模型的时序先验,从而在超低码率下实现了更高的压缩效率和视觉质量。
Abstract: We present a novel paradigm for ultra-low-bitrate image compression (ULB-IC) that exploits the ``temporal’’ evolution in generative image compression. Specifically, we define an explicit intermediate state during decoding: a compact anchor frame, which preserves the scene geometry and semantic layout while discarding high-frequency details. We then reinterpret generative decoding as a virtual temporal transition from this anchor to the final reconstructed image.To model this progression, we leverage a pretrained video diffusion model (VDM) as temporal priors: the anchor frame serves as the initial frame and the original image as the target frame, transforming the decoding process into a next-frame prediction task.In contrast to image diffusion-based ULB-IC models, our decoding proceeds from a visible, semantically faithful anchor, which improves both fidelity and realism for perceptual image compression. Extensive experiments demonstrate that our method achieves superior objective and subjective performance. On the CLIC2020 test set, our method achieves over \textbf{50% bitrate savings} across LPIPS, DISTS, FID, and KID compared to DiffC, while also delivering a significant decoding speedup of up to $\times$5. Code will be released later.
[238] WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation cs.CVPDF
Hainuo Wang, Mingjia Li, Xiaojie Guo
TL;DR: 本文提出了一种名为WiT(Waypoint Diffusion Transformers)的新方法,旨在解决像素空间中流匹配模型因语义连续性不足而导致的轨迹冲突问题。该方法通过引入从预训练视觉模型投影出的语义路标点,将连续向量场分解为从先验到路标点和从路标点到像素的两段最优传输路径,从而解耦生成轨迹。在ImageNet 256x256数据集上的实验表明,WiT超越了现有像素空间基线,并将JiT训练收敛速度提升了2.2倍。
Details
Motivation: 当前直接在像素空间操作的流匹配模型避免了潜在自编码器的重构瓶颈,但像素流形中语义连续性的缺乏导致最优传输路径严重纠缠,尤其在交叉点附近产生严重的轨迹冲突,从而产生次优解。本文旨在直接解决像素空间中的轨迹纠缠问题,而非通过有信息损失的潜在表示来规避。
Result: 在ImageNet 256x256基准测试中,WiT超越了强大的像素空间基线模型,并将JiT(Just-in-Time)训练收敛速度加速了2.2倍。
Insight: 论文的核心创新点在于通过引入语义路标点来分解和引导像素空间的生成轨迹,具体通过一个轻量级生成器动态推断中间路标点,并利用Just-Pixel AdaLN机制持续调节主扩散变换器。从客观角度看,这种方法直接在像素空间中操作以保持信息完整性,同时利用预训练模型的语义先验来结构化生成路径,为解决像素空间生成模型的轨迹冲突问题提供了一种新颖且有效的思路。
Abstract: While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.
[239] SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation cs.CVPDF
Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Aditya Grover
TL;DR: 本文提出了一种名为随机邻居交叉熵最小化(SNCE)的新训练目标,旨在解决大规模VQ码本离散图像生成器的优化挑战。该方法通过构建一个基于邻近令牌的软分类分布进行监督,而非使用硬性独热目标,从而鼓励模型在量化嵌入空间中捕获有语义意义的几何结构。实验表明,SNCE在ImageNet-256类别条件生成、大规模文本到图像合成和图像编辑任务中,相比标准交叉熵目标,显著提高了收敛速度和整体生成质量。
Details
Motivation: 当前离散图像生成中,扩大VQ码本规模能显著提升重建保真度,但训练大规模码本的生成模型面临挑战,通常需要更大的模型和更长的训练周期。本文旨在解决这一优化难题。
Result: 在类别条件ImageNet-256生成、大规模文本到图像合成和图像编辑任务上的广泛实验表明,SNCE相比标准交叉熵目标,显著提升了收敛速度和整体生成质量。
Insight: 创新点在于提出SNCE训练目标,用基于邻近令牌嵌入相似性的软分布替代硬独热目标,从而在量化空间中引入几何感知监督,这有助于更高效地训练大规模码本生成模型,并捕获嵌入空间的语义几何结构。
Abstract: Recent advancements in discrete image generation showed that scaling the VQ codebook size significantly improves reconstruction fidelity. However, training generative models with a large VQ codebook remains challenging, typically requiring larger model size and a longer training schedule. In this work, we propose Stochastic Neighbor Cross Entropy Minimization (SNCE), a novel training objective designed to address the optimization challenges of large-codebook discrete image generators. Instead of supervising the model with a hard one-hot target, SNCE constructs a soft categorical distribution over a set of neighboring tokens. The probability assigned to each token is proportional to the proximity between its code embedding and the ground-truth image embedding, encouraging the model to capture semantically meaningful geometric structure in the quantized embedding space. We conduct extensive experiments across class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing tasks. Results show that SNCE significantly improves convergence speed and overall generation quality compared to standard cross-entropy objectives.
[240] TextOVSR: Text-Guided Real-World Opera Video Super-Resolution cs.CVPDF
Hua Chang, Xin Xu, Wei Liu, Jiayi Wu, Kui Jiang
TL;DR: 本文提出了一种文本引导的双分支歌剧视频超分辨率网络(TextOVSR),旨在解决经典歌剧视频因早期拍摄设备限制和长期存储退化导致的视觉质量低下问题。该方法通过引入退化描述文本和内容描述文本,分别约束解空间并提供语义引导,结合设计的退化鲁棒特征融合模块,在自建的OperaLQ基准测试中实现了定性和定量上的SOTA性能。
Details
Motivation: 现有真实世界视频超分辨率方法在应用于退化的歌剧视频时面临两大挑战:一是难以准确建模真实世界的复杂退化,简单的经典退化核组合无法捕捉真实噪声分布,而从外部数据集提取真实噪声块易导致风格不匹配和视觉伪影;二是仅依赖退化图像特征的方法因缺乏高层语义引导,难以重建逼真且细致的纹理。
Result: 在作者构建的OperaLQ基准测试上进行实验,结果表明TextOVSR在定性和定量评估上均优于现有的最先进方法。
Insight: 创新点在于引入了两种文本提示(退化描述文本和内容描述文本)来分别引导超分辨率过程,并设计了文本增强判别器和退化鲁棒特征融合模块,实现了跨模态特征的有效融合与退化干扰的抑制,为真实世界视频修复提供了语义引导的新思路。
Abstract: Many classic opera videos exhibit poor visual quality due to the limitations of early filming equipment and long-term degradation during storage. Although real-world video super-resolution (RWVSR) has achieved significant advances in recent years, directly applying existing methods to degraded opera videos remains challenging. The difficulties are twofold. First, accurately modeling real-world degradations is complex: simplistic combinations of classical degradation kernels fail to capture the authentic noise distribution, while methods that extract real noise patches from external datasets are prone to style mismatches that introduce visual artifacts. Second, current RWVSR methods, which rely solely on degraded image features, struggle to reconstruct realistic and detailed textures due to a lack of high-level semantic guidance. To address these issues, we propose a Text-guided Dual-Branch Opera Video Super-Resolution (TextOVSR) network, which introduces two types of textual prompts to guide the super-resolution process. Specifically, degradation-descriptive text, derived from the degradation process, is incorporated into the negative branch to constrain the solution space. Simultaneously, content-descriptive text is incorporated into a positive branch and our proposed Text-Enhanced Discriminator (TED) to provide semantic guidance for enhanced texture reconstruction. Furthermore, we design a Degradation-Robust Feature Fusion (DRF) module to facilitate cross-modal feature fusion while suppressing degradation interference. Experiments on our OperaLQ benchmark show that TextOVSR outperforms state-of-the-art methods both qualitatively and quantitatively. The code is available at https://github.com/ChangHua0/TextOVSR.
[241] Vision-Language Model Based Multi-Expert Fusion for CT Image Classification cs.CVPDF
Jianfa Bai, Kejin Lu, Runtian Yuan, Qingqiu Li, Jilan Xu
TL;DR: 本文提出了一种用于多源COVID-19 CT图像分类的三阶段源感知多专家融合框架。该框架首先构建了一个结合原始CT和肺部提取CT的3D专家模型进行体积分类,然后开发了两个基于MedSigLIP的专家模型分别进行切片级表征学习和基于Transformer的切片间上下文建模,最后训练一个源分类器来预测测试扫描的潜在来源,并基于预测的源信息进行模型融合与投票。
Details
Motivation: 解决在多机构环境下,由于显著的源偏移、源不平衡以及隐藏的测试源身份,导致从胸部CT中稳健检测COVID-19仍然具有挑战性的问题。
Result: 在覆盖所有四个来源的验证集上,第一阶段模型取得了最佳的宏观F1分数0.9711、准确率0.9712和AUC 0.9791。第二阶段的两个专家模型分别取得了最佳AUC分数0.9864和0.9854。第三阶段的源分类器准确率达到0.9107,F1分数为0.9114。
Insight: 创新点在于提出了一个层次化的、源感知的多专家融合策略,通过结合3D体积信息、基于视觉语言模型的切片级表征与上下文建模,并显式地建模和利用测试数据的潜在来源信息进行自适应融合,为异构多源条件下的鲁棒医学图像分类提供了有效解决方案。
Abstract: Robust detection of COVID-19 from chest CT remains challenging in multi-institutional settings due to substantial source shift, source imbalance, and hidden test-source identities. In this work, we propose a three-stage source-aware multi-expert framework for multi-source COVID-19 CT classification. First, we build a lung-aware 3D expert by combining original CT volumes and lung-extracted CT volumes for volumetric classification. Second, we develop two MedSigLIP-based experts: a slice-wise representation and probability learning module, and a Transformer-based inter-slice context modeling module for capturing cross-slice dependency. Third, we train a source classifier to predict the latent source identity of each test scan. By leveraging the predicted source information, we perform model fusion and voting based on different experts. On the validation set covering all four sources, the Stage 1 model achieves the best macro-F1 of 0.9711, ACC of 0.9712, and AUC of 0.9791. Stage2a and Stage2b achieve the best AUC scores of 0.9864 and 0.9854, respectively. Stage~3 source classifier reaches 0.9107 ACC and 0.9114 F1. These results demonstrate that source-aware expert modeling and hierarchical voting provide an effective solution for robust COVID-19 CT classification under heterogeneous multi-source conditions.
[242] DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer cs.CVPDF
Zhengxu He, Jun Li, Zhijian Wu
TL;DR: 本文提出了一种名为DAIT的自适应中间教师迁移蒸馏方法,旨在将大规模视觉语言模型(VLMs)的丰富多模态语义知识高效地迁移到轻量级分类器中,以解决细粒度视觉分类(FGVC)任务在资源受限环境下的部署难题。该方法通过引入一个可训练的中间教师模型,在目标任务的明确监督下学习并适配冻结的VLM表示,从而产生紧凑且与任务对齐的知识,以供轻量级学生模型可靠地蒸馏。
Details
Motivation: 大规模视觉语言模型(VLMs)虽能编码丰富的多模态语义,有益于细粒度视觉分类(FGVC),但其巨大的计算成本阻碍了在资源受限环境中的实际部署。传统的知识蒸馏方法直接从通用VLM迁移到紧凑学生模型,常因严重的架构失准和引入任务无关信息而导致次优结果。
Result: 在多个FGVC基准测试(包括FGVC-Aircraft和CUB-200-2011数据集)上,使用不同学生架构的广泛评估表明,DAIT方法分别实现了12.63%和8.34%的性能提升,确立了其作为从通用VLM迁移到可部署细粒度识别模型的原则性范式。
Insight: 论文的创新点在于提出了自适应中间教师迁移(DAIT)机制,通过一个可训练的中间教师来桥接VLM与学生模型之间的鸿沟,该教师能自适应地增强判别性视觉线索,并产生与任务对齐的紧凑知识。从客观角度看,这种中间层适配策略有效缓解了架构失准问题,并过滤了任务无关信息,为将大型预训练模型的能力高效迁移到轻量级下游模型提供了可借鉴的思路。
Abstract: Large-scale Vision-Language Models (VLMs) encode rich multimodal semantics that are highly beneficial for fine-grained visual categorization (FGVC). However, their prohibitive computational cost hinders practical deployment in resource-constrained environments. Although knowledge distillation contributes to transferring VLMs capacity to lightweight classifiers, conventional distillation mechanisms, which directly transfer from a generic VLM to a compact student, often yield suboptimal results due to severe architectural misalignment and introducing task-irrelevant information. To alleviate this limitation, we propose Distillation with Adaptive Intermediate Teacher transfer (DAIT) in this study, facilitating adaptive knowledge transfer from VLMs to lightweight students. DAIT introduces a trainable intermediate teacher that learns to transfer frozen VLMs representations under explicit supervision from the target fine-grained task. This intermediate teacher adaptively enhances discriminative visual cues, thereby producing compact and task-aligned knowledge that can be reliably distilled into lightweight models. Extensive evaluations on multiple FGVC benchmarks with diverse student architectures demonstrate that our method achieves respective performance gains of 12.63% and 8.34% on FGVC-Aircraft and CUB-200-2011 datasets, establishing DAIT as a principled paradigm for transferring from general-purpose VLMS to deployable fine-grained recognition models.
[243] Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding cs.CVPDF
Sosuke Yamao, Natsuki Miyahara, Yuankai Qi, Shun Takeuchi
TL;DR: 本文提出了一种名为QViC-MF的框架,用于解决长视频理解任务。其核心创新在于建立了一个由问题引导、带有记忆反馈的视觉压缩机制,通过迭代处理视频片段并利用存储的过去视觉上下文来增强当前感知,从而显著提升了在多个长视频理解基准测试上的性能。
Details
Motivation: 现有基于Transformer的视觉压缩器和记忆增强方法在处理长视频时,通常独立压缩每一帧,导致在需要理解完整事件(如时序排序)的任务上表现不佳。这促使作者重新思考从感知到记忆的单向流程,转而建立一个反馈驱动的过程。
Result: 在多个长视频理解基准测试上取得了显著提升:在MLVU测试集上超过当前SOTA方法6.1%,在LVBench上提升8.3%,在VNBench Long上提升18.3%,在VideoMME Long上提升3.7%。
Insight: 主要创新点在于提出了问题引导的多模态选择性注意力机制和迭代的记忆反馈循环。这改变了传统单向压缩模式,使模型能够根据问题动态地选择和保留与当前片段及过去相关帧相关的视觉信息,从而更有效地处理长视频中的复杂事件理解任务。
Abstract: In the context of long-term video understanding with large multimodal models, many frameworks have been proposed. Although transformer-based visual compressors and memory-augmented approaches are often used to process long videos, they usually compress each frame independently and therefore fail to achieve strong performance on tasks that require understanding complete events, such as temporal ordering tasks in MLVU and VNBench. This motivates us to rethink the conventional one-way scheme from perception to memory, and instead establish a feedbackdriven process in which past visual contexts stored in the context memory can benefit ongoing perception. To this end, we propose Question-guided Visual Compression with Memory Feedback (QViC-MF), a framework for long-term video understanding. At its core is a Question-guided Multimodal Selective Attention (QMSA), which learns to preserve visual information related to the given question from both the current clip and the past related frames from the memory. The compressor and memory feedback work iteratively for each clip of the entire video. This simple yet effective design yields large performance gains on longterm video understanding tasks. Extensive experiments show that our method achieves significant improvement over current state-of-the-art methods by 6.1% on MLVU test, 8.3% on LVBench, 18.3% on VNBench Long, and 3.7% on VideoMME Long. The code will be released publicly.
[244] Multimodal Connectome Fusion via Cross-Attention for Autism Spectrum Disorder Classification Using Graph Learning cs.CV | cs.AIPDF
Ansar Rahman, Hassan Shojaee-Mend, Sepideh Hatamikia
TL;DR: 本研究提出了一种用于自闭症谱系障碍(ASD)分类的多模态图学习框架。该框架以静息态功能磁共振成像(rs-fMRI)的功能连接为主导,通过一种新颖的非对称跨注意力机制,选择性地整合结构磁共振成像(sMRI)和表型信息,并在ABIDE-I数据集上进行了评估。
Details
Motivation: ASD是一种复杂的神经发育障碍,其诊断依赖于识别大脑功能连接和结构组织的异常。尽管rs-fMRI和sMRI信息互补,但如何在一个统一框架内有效整合这些异质性多模态数据仍具挑战性。
Result: 在ABIDE-I数据集上,使用分层10折交叉验证,该框架取得了87.3%的AUC和84.4%的准确率。在使用留一站点交叉验证(LOSO-CV)时,模型取得了82.0%的平均跨站点准确率,分别比现有方法在10折交叉验证和LOSO-CV下高出约3%和7%,达到了新的SOTA水平。
Insight: 论文宣称的创新点在于提出了一种新颖的非对称Transformer跨注意力机制,允许功能嵌入在保持主导地位的同时,选择性地整合互补的结构信息。从客观角度看,该框架将多模态数据(功能、结构、表型)统一于一个基于图的表示学习框架中,并通过表型信息编码的成对关联编码器(PAE)来建模被试间关系,这是一种有效处理异质多站点数据并提升泛化能力的方法。
Abstract: Autism spectrum disorder (ASD) is a complex neurodevelopmental condition characterized by atypical functional brain connectivity and subtle structural alterations. rs-fMRI has been widely used to identify disruptions in large-scale brain networks, while structural MRI provides complementary information about morphological organization. Despite their complementary nature, effectively integrating these heterogeneous imaging modalities within a unified framework remains challenging. This study proposes a multimodal graph learning framework that preserves the dominant role of functional connectivity while integrating structural imaging and phenotypic information for ASD classification. The proposed framework is evaluated on ABIDE-I dataset. Each subject is represented as a node within a population graph. Functional and structural features are extracted as modality-specific node attributes, while inter-subject relationships are modeled using a pairwise association encoder (PAE) based on phenotypic information. Two Edge Variational GCNs are trained to learn subject-level embeddings. To enable effective multimodal integration, we introduce a novel asymmetric transformer-based cross-attention mechanism that allows functional embeddings to selectively incorporate complementary structural information while preserving functional dominance. The fused embeddings are then passed to a MLP for ASD classification. Using stratified 10-fold cross-validation, the framework achieved an AUC of 87.3% and an accuracy of 84.4%. Under leave-one-site-out cross-validation (LOSO-CV), the model achieved an average cross-site accuracy of 82.0%, outperforming existing methods by approximately 3% under 10-fold cross-validation and 7% under LOSO-CV. The proposed framework effectively integrates heterogeneous multimodal data from the multi-site ABIDE-I dataset, improving automated ASD classification across imaging sites.
[245] HYDRA: Unifying Multi-modal Generation and Understanding via Representation-Harmonized Tokenization cs.CVPDF
Xuerui Qiu, Yutao Cui, Guozhen Zhang, Junzhe Li, JiaKui Hu
TL;DR: 本文提出HYDRA,一个通过表示协调标记化(HYDRA-TOK)统一多模态生成与理解的原生统一框架。其核心是一个渐进式学习的纯视觉Transformer(ViT),能够从捕捉结构保持基元的生成模式(Gen-ViT)过渡到进行语义编码的理解模式(Sem-ViT),并通过一个生成-语义瓶颈(GSB)进行协调,从而在单一参数空间内整合感知与生成。
Details
Motivation: 现有统一多模态模型在视觉理解所需的抽象表示和生成所需的详细基元之间存在根本性差距,常采用解耦编码器、在VAE上堆叠表示编码器或离散量化等方法,但这些方法会破坏信息一致性并导致优化冲突。
Result: HYDRA在多个基准测试中达到新的最先进水平(SOTA):在视觉重建上取得rFID 0.08的基准成绩;在生成任务上,于GenEval(0.86)、DPG-Bench(86.4)和WISE(0.53)取得顶级性能;同时在八个具有挑战性的理解基准上,平均超越之前的原生统一多模态模型10.0分。
Insight: 主要创新点在于提出“从生成到理解”的视觉建模演进视角,以及实现该视角的HYDRA-TOK架构。其通过一个渐进式Transformer和中间的生成-语义瓶颈(GSB)来协调生成与理解任务,在低维空间过滤噪声以支持鲁棒合成,再恢复维度以支持复杂语义理解,从而在单一模型中实现生成与理解能力的原生统一与协同优化。
Abstract: Unified Multimodal Models struggle to bridge the fundamental gap between the abstract representations needed for visual understanding and the detailed primitives required for generation. Existing approaches typically compromise by employing decoupled encoders, stacking representation encoder atop VAEs, or utilizing discrete quantization. However, these methods often disrupt information coherence and lead to optimization conflicts. To this end, we introduce HYDRA-TOK, a representation-harmonized pure ViT in the insight that visual modeling should evolve from generation to understanding. HYDRA-TOK reformulates the standard backbone into a progressive learner that transitions from a Gen-ViT, which captures structure-preserving primitives, to a Sem-ViT for semantic encoding. Crucially, this transition is mediated by a Generation-Semantic Bottleneck (GSB), which compresses features into a low-dimensional space to filter noise for robust synthesis, then restores dimensionality to empower complex semantic comprehension. Built upon this foundation, we present HYDRA, a native unified framework integrating perception and generation within a single parameter space. Extensive experiments establish HYDRA as a new state-of-the-art. It sets a benchmark in visual reconstruction (rFID 0.08) and achieves top-tier generation performance on GenEval (0.86), DPG-Bench (86.4), and WISE (0.53), while simultaneously outperforming previous native UMMs by an average of 10.0 points across eight challenging understanding benchmarks.
[246] Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection cs.CVPDF
Yao Gu, Xiaohao Xu, Yingna Wu
TL;DR: 本文提出了一种多轮物理信息视觉语言模型框架,通过将物体属性、运动范式和动态约束编码为结构化提示,以多轮对话形式传递物理先验知识,从而增强模型对动态异常(如不规则旋转或违反机械运动)的因果推理能力。
Details
Motivation: 现有视觉语言模型主要基于外观相关性训练,难以捕捉运动学约束,导致在物理基础异常检测任务中表现不佳,本文旨在解决这一局限性。
Result: 在Phys-AD基准测试中,该方法在视频级检测上达到96.7%的AUROC,显著超越先前SOTA(66.9%),并在因果解释方面获得0.777的LLM分数,表现出优越性能。
Insight: 创新点在于通过结构化物理先验和多轮对话分解因果推理步骤,将物理知识显式整合到视觉语言模型中,从而提升对动态异常的检测和解释能力,为模型注入领域知识提供了可借鉴的思路。
Abstract: Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection–substantially outperforming prior SOTA (66.9%)–and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.
[247] HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning cs.CVPDF
Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura
TL;DR: 本文介绍了HalDec-Bench,一个用于评估图像描述任务中幻觉检测器性能的基准测试。该基准包含由多种视觉语言模型生成的描述、人工标注的幻觉存在性、详细幻觉类型类别以及片段级标签,旨在以原则化和可解释的方式评估检测器在不同难度和幻觉类型上的表现。
Details
Motivation: 当前缺乏一个全面的基准来评估视觉语言模型作为幻觉检测器在不同描述模型和幻觉类型上的泛化能力,这阻碍了高质量图像-描述对数据集的构建和模型评估。
Result: HalDec-Bench揭示了在现有多模态推理或对齐基准中无法观察到的模型性能差异,并发现检测器倾向于将响应开头的句子识别为正确,而无论其实际正确性。实验表明,使用强大的VLM作为过滤器,同时采用较新的VLM作为描述生成器,可以显著减少数据集噪声。
Insight: 论文的创新点在于构建了一个系统性的幻觉检测基准,提供了细粒度的标注和多样化的任务难度,从而能够更全面地评估检测器的能力。客观分析认为,其揭示的检测器偏差(如位置偏见)和提出的数据过滤策略(强VLM过滤+新VLM生成)对提升数据集质量和模型鲁棒性具有借鉴意义。
Abstract: Hallucination detection in captions (HalDec) assesses a vision-language model’s ability to correctly align image content with text by identifying errors in captions that misrepresent the image. Beyond evaluation, effective hallucination detection is also essential for curating high-quality image-caption pairs used to train VLMs. However, the generalizability of VLMs as hallucination detectors across different captioning models and hallucination types remains unclear due to the lack of a comprehensive benchmark. In this work, we introduce HalDec-Bench, a benchmark designed to evaluate hallucination detectors in a principled and interpretable manner. HalDec-Bench contains captions generated by diverse VLMs together with human annotations indicating the presence of hallucinations, detailed hallucination-type categories, and segment-level labels. The benchmark provides tasks with a wide range of difficulty levels and reveals performance differences across models that are not visible in existing multimodal reasoning or alignment benchmarks. Our analysis further uncovers two key findings. First, detectors tend to recognize sentences appearing at the beginning of a response as correct, regardless of their actual correctness. Second, our experiments suggest that dataset noise can be substantially reduced by using strong VLMs as filters while employing recent VLMs as caption generators. Our project page is available at https://dahlian00.github.io/HalDec-Bench-Page/.
[248] Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models cs.CVPDF
Junlong Ke, Zichen Wen, Boxue Yang, Yantai Yang, Xuyang Liu
TL;DR: 本文提出了一种名为FlashU的训练无关、任务感知的加速框架,旨在解决原生统一多模态模型(兼具生成与理解能力)因计算开销大而难以实际部署的问题。该框架通过任务特定的网络剪枝和动态层跳过等技术,针对生成和理解任务的不同计算特性进行优化,在保持SOTA性能的同时实现了显著的推理加速。
Details
Motivation: 现有加速技术通常采用静态、单一的策略,忽略了迭代生成任务(如图像生成)与单次理解任务(如VQA)在计算特性上的根本差异。本文通过对统一模型的系统性分析,揭示了其内部存在明显的参数专业化现象,即不同任务依赖于不同的神经元集合,这为任务感知的加速提供了理论基础。
Result: 在Show-o2基准测试上的大量实验表明,FlashU在理解和生成任务上均实现了1.78倍到2.01倍的推理加速,同时保持了SOTA性能,超越了其他竞争的统一模型。
Insight: 论文的创新点在于首次对统一模型进行了系统性分析,揭示了其参数专业化特性,并据此提出了首个训练无关、任务感知的加速框架。从客观角度看,其核心洞察在于认识到统一模型内部已隐式地内化了针对不同任务的独立推理路径,从而允许通过任务特定的剪枝、动态层跳过、动态令牌剪枝以及针对扩散模型的时变控制和缓存等技术,在不重新训练的情况下实现高效、精准的加速。
Abstract: Native unified multimodal models, which integrate both generative and understanding capabilities, face substantial computational overhead that hinders their real-world deployment. Existing acceleration techniques typically employ a static, monolithic strategy, ignoring the fundamental divergence in computational profiles between iterative generation tasks (e.g., image generation) and single-pass understanding tasks (e.g., VQA). In this work, we present the first systematic analysis of unified models, revealing pronounced parameter specialization, where distinct neuron sets are critical for each task. This implies that, at the parameter level, unified models have implicitly internalized separate inference pathways for generation and understanding within a single architecture. Based on these insights, we introduce a training-free and task-aware acceleration framework, FlashU, that tailors optimization to each task’s demands. Across both tasks, we introduce Task-Specific Network Pruning and Dynamic Layer Skipping, aiming to eliminate inter-layer and task-specific redundancy. For visual generation, we implement a time-varying control signal for the guidance scale and a temporal approximation for the diffusion head via Diffusion Head Cache. For multimodal understanding, building upon the pruned model, we introduce Dynamic Token Pruning via a V-Norm Proxy to exploit the spatial redundancy of visual inputs. Extensive experiments on Show-o2 demonstrate that FlashU achieves 1.78$\times$ to 2.01$\times$ inference acceleration across both understanding and generation tasks while maintaining SOTA performance, outperforming competing unified models and validating our task-aware acceleration paradigm. Our code is publicly available at https://github.com/Rirayh/FlashU.
[249] Generative Video Compression with One-Dimensional Latent Representation cs.CVPDF
Zihan Zheng, Zhaoyang Jia, Naifu Xue, Jiahao Li, Bin Li
TL;DR: 本文提出了一种名为GVC1D的生成式视频压缩方法,其核心创新在于将视频编码为一维潜在表示,而非传统的二维潜在网格。该方法通过结合短期和长期上下文信息,生成紧凑的一维潜在令牌,从而更有效地减少视频中的空间和时间冗余。
Details
Motivation: 现有基于二维潜在网格的生成式视频编解码器在充分利用时空冗余方面存在两个关键挑战:空间上,刚性结构导致帧内冗余难以消除;时间上,难以以紧凑且语义连贯的方式建模长期相关性。
Result: 在HEVC Class B数据集上的实验结果表明,GVC1D在LPIPS和DISTS指标下分别实现了60.4%和68.8%的码率降低,超越了先前的视频压缩方法。
Insight: 主要创新点在于引入一维潜在表示来替代二维网格,这允许自适应地关注语义区域并自然促进令牌精简,从而减少空间冗余;同时,提出的一维记忆机制以较低计算成本提供语义丰富的长期上下文,进一步减少时间冗余。这种从二维到一维表示的范式转变是提升压缩效率的关键。
Abstract: Recent advancements in generative video codec (GVC) typically encode video into a 2D latent grid and employ high-capacity generative decoders for reconstruction. However, this paradigm still leaves two key challenges in fully exploiting spatial-temporal redundancy: Spatially, the 2D latent grid inevitably preserves intra-frame redundancy due to its rigid structure, where adjacent patches remain highly similar, thereby necessitating a higher bitrate. Temporally, the 2D latent grid is less effective for modeling long-term correlations in a compact and semantically coherent manner, as it hinders the aggregation of common contents across frames. To address these limitations, we introduce Generative Video Compression with One-Dimensional (1D) Latent Representation (GVC1D). GVC1D encodes the video data into extreme compact 1D latent tokens conditioned on both short- and long-term contexts. Without the rigid 2D spatial correspondence, these 1D latent tokens can adaptively attend to semantic regions and naturally facilitate token reduction, thereby reducing spatial redundancy. Furthermore, the proposed 1D memory provides semantically rich long-term context while maintaining low computational cost, thereby further reducing temporal redundancy. Experimental results indicate that GVC1D attains superior compression efficiency, where it achieves bitrate reductions of 60.4% under LPIPS and 68.8% under DISTS on the HEVC Class B dataset, surpassing the previous video compression methods.Project: https://gvc1d.github.io/
[250] MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction cs.CVPDF
Jiacheng Dong, Huan Li, Sicheng Zhou, Wenhao Hu, Weili Xu
TL;DR: 本文提出了MeMix,一种无需训练、即插即用的模块,用于改进流式3D重建。它通过将循环状态重新构建为记忆混合体,将状态划分为多个独立的内存块,并仅更新对齐度最差的内存块,同时精确保留其他块,从而缓解灾难性遗忘并保持O(1)的推理内存开销。
Details
Motivation: 现有循环在线模型在长序列流式3D重建中常因状态漂移和遗忘导致性能逐渐退化,需要推理时的补救措施。
Result: 在标准基准测试(ScanNet、7-Scenes、KITTI等)上,使用相同骨干网络和推理设置,MeMix在7-Scenes的300-500帧流上平均将重建完整性误差降低了15.3%(最高达40.0%)。
Insight: 创新点在于将循环状态划分为独立内存块并进行选择性更新,这是一种无需训练、参数不增加的轻量级方法,能有效缓解长期序列中的遗忘问题,可直接应用于现有循环重建模型。
Abstract: Reconstruction is a fundamental task in 3D vision and a fundamental capability for spatial intelligence. Particularly, streaming 3D reconstruction is central to real-time spatial perception, yet existing recurrent online models often suffer from progressive degradation on long sequences due to state drift and forgetting, motivating inference-time remedies. We present MeMix, a training-free, plug-and-play module that improves streaming reconstruction by recasting the recurrent state into a Memory Mixture. MeMix partitions the state into multiple independent memory patches and updates only the least-aligned memory patches while exactly preserving others. This selective update mitigates catastrophic forgetting while retaining $O(1)$ inference memory, and requires no fine-tuning or additional learnable parameters, making it directly applicable to existing recurrent reconstruction models. Across standard benchmarks (ScanNet, 7-Scenes, KITTI, etc.), under identical backbones and inference settings, MeMix reduces reconstruction completeness error by 15.3% on average (up to 40.0%) across 300–500 frame streams on 7-Scenes. The code is available at https://dongjiacheng06.github.io/MeMix/
[251] Trajectory-Diversity-Driven Robust Vision-and-Language Navigation cs.CVPDF
Jiangyang Li, Cong Wan, SongLin Dong, Chenhao Ding, Qiang Wang
TL;DR: 本文提出NavGRPO,一种基于强化学习的视觉语言导航框架,通过组内相对策略优化学习目标导向的导航策略,探索多样轨迹并利用组内性能比较进行优化,无需额外价值网络。该方法在R2R和REVERIE基准测试中展现出卓越的鲁棒性,特别是在未见环境和极端扰动下显著提升性能。
Details
Motivation: 当前视觉语言导航方法主要依赖模仿学习,存在泛化能力有限和对执行扰动鲁棒性差的问题,本文旨在通过强化学习框架提升导航策略的鲁棒性和泛化能力。
Result: 在ScaleVLN基础上,NavGRPO在R2R和REVERIE基准测试的未见环境中分别实现SPL提升+3.0%和+1.71%,达到SOTA水平;在极端早期扰动下,相比基线获得+14.89%的SPL增益,证实了目标导向强化学习训练能构建更鲁棒的导航策略。
Insight: 创新点在于引入组内相对策略优化,通过探索多样轨迹和组内性能比较来区分有效策略,无需额外价值网络;从客观角度看,该方法将强化学习与轨迹多样性探索结合,有效提升了导航策略的鲁棒性和泛化能力,为视觉语言导航领域提供了新的训练范式。
Abstract: Vision-and-Language Navigation (VLN) requires agents to navigate photo-realistic environments following natural language instructions. Current methods predominantly rely on imitation learning, which suffers from limited generalization and poor robustness to execution perturbations. We present NavGRPO, a reinforcement learning framework that learns goal-directed navigation policies through Group Relative Policy Optimization. By exploring diverse trajectories and optimizing via within-group performance comparisons, our method enables agents to distinguish effective strategies beyond expert paths without requiring additional value networks. Built on ScaleVLN, NavGRPO achieves superior robustness on R2R and REVERIE benchmarks with +3.0% and +1.71% SPL improvements in unseen environments. Under extreme early-stage perturbations, we demonstrate +14.89% SPL gain over the baseline, confirming that goal-directed RL training builds substantially more robust navigation policies. Code and models will be released.
[252] Spectral Rectification for Parameter-Efficient Adaptation of Foundation Models in Colonoscopy Depth Estimation cs.CVPDF
Xiaoxian Zhang, Minghai Shi, Lei Li
TL;DR: 本文提出了一种名为SpecDepth的参数高效适应框架,用于解决基础模型在结肠镜深度估计任务中的泛化问题。核心创新在于引入自适应频谱校正模块,通过可学习的小波分解来显式建模并增强特征图中衰减的高频成分,从而对齐输入信号与基础模型的原始归纳偏置,而非直接进行可能导致高级语义特征扭曲的微调。
Details
Motivation: 在结肠镜中,准确的单目深度估计对于病灶定位和导航至关重要。然而,在自然图像上训练的基础模型无法直接泛化到结肠镜图像。作者发现核心问题并非语义鸿沟,而是频域的统计偏移:结肠镜图像缺乏这些模型进行几何推理所依赖的强高频边缘和纹理梯度。
Result: 在公开的C3VD和SimCol3D数据集上,SpecDepth达到了最先进的性能,绝对相对误差分别为0.022和0.027。
Insight: 摘要宣称的创新点是提出了一种参数高效的自适应频谱校正模块,通过低层次、有针对性的频谱调整来适应域差异,同时保留预训练模型的稳健几何表示。从客观角度看,其核心创新在于将域适应问题重新定义为频域统计偏移问题,并提出了一种显式建模和增强衰减高频成分的机制,这为将视觉基础模型适配到特定医学成像任务提供了一种新颖且有效的策略。
Abstract: Accurate monocular depth estimation is critical in colonoscopy for lesion localization and navigation. Foundation models trained on natural images fail to generalize directly to colonoscopy. We identify the core issue not as a semantic gap, but as a statistical shift in the frequency domain: colonoscopy images lack the strong high-frequency edge and texture gradients that these models rely on for geometric reasoning. To address this, we propose SpecDepth, a parameter-efficient adaptation framework that preserves the robust geometric representations of the pre-trained models while adapting to the colonoscopy domain. Its key innovation is an adaptive spectral rectification module, which uses a learnable wavelet decomposition to explicitly model and amplify the attenuated high-frequency components in feature maps. Different from conventional fine-tuning that risks distorting high-level semantic features, this targeted, low-level adjustment realigns the input signal with the original inductive bias of the foundational model. On the public C3VD and SimCol3D datasets, SpecDepth achieved state-of-the-art performance with an absolute relative error of 0.022 and 0.027, respectively. Our work demonstrates that directly addressing spectral mismatches is a highly effective strategy for adapting vision foundation models to specialized medical imaging tasks. The code will be released publicly after the manuscript is accepted for publication.
[253] RieMind: Geometry-Grounded Spatial Agent for Scene Understanding cs.CV | cs.AIPDF
Fernando Ropero, Erkin Turkoz, Daniel Matos, Junqing Du, Antonio Ruiz
TL;DR: 本文提出了RieMind,一种用于静态3D室内场景理解的智能体框架。该框架将大语言模型(LLM)与显式的3D场景图(3DSG)进行几何基础连接,通过解耦感知与推理,并利用结构化几何工具进行交互,显著提升了空间推理能力。
Details
Motivation: 当前视觉语言模型(VLMs)在度量和空间推理方面仍存在困难,现有方法通常将感知与推理耦合在一起。本文旨在研究解耦感知与推理是否能改善空间推理性能。
Result: 在VSI-Bench的静态分割数据集上,该方法在理想感知条件下为空间推理性能提供了一个上限,性能比之前的工作显著高出最多16%,且无需任务特定的微调。与基础VLMs相比,其智能体变体性能显著更好,平均提升在33%到50%之间。
Insight: 核心创新点在于提出了一个基于显式3D场景图的智能体框架,将感知(构建3DSG)与推理(LLM使用几何工具)解耦。客观来看,其利用结构化几何表示和工具作为LLM与场景交互的接口,为纯端到端视觉推理提供了一个有吸引力的替代方案,证明了显式几何基础对提升空间推理的有效性。
Abstract: Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-end video understanding or large-scale spatial question answering fine-tuning, inherently coupling perception and reasoning. In this paper, we investigate whether decoupling perception and reasoning leads to improved spatial reasoning. We propose an agentic framework for static 3D indoor scene reasoning that grounds an LLM in an explicit 3D scene graph (3DSG). Rather than ingesting videos directly, each scene is represented as a persistent 3DSG constructed by a dedicated perception module. To isolate reasoning performance, we instantiate the 3DSG from ground-truth annotations. The agent interacts with the scene exclusively through structured geometric tools that expose fundamental properties such as object dimensions, distances, poses, and spatial relationships. The results we obtain on the static split of VSI-Bench provide an upper bound under ideal perceptual conditions on the spatial reasoning performance, and we find that it is significantly higher than previous works, by up to 16%, without task specific fine-tuning. Compared to base VLMs, our agentic variant achieves significantly better performance, with average improvements between 33% to 50%. These findings indicate that explicit geometric grounding substantially improves spatial reasoning performance, and suggest that structured representations offer a compelling alternative to purely end-to-end visual reasoning.
[254] Pointing-Based Object Recognition cs.CVPDF
Lukáš Hajdúch, Viktor Kocur
TL;DR: 本文提出了一种基于RGB图像识别人类指向手势目标物体的完整流程。该系统整合了物体检测、人体姿态估计、单目深度估计和视觉语言模型等多种现有先进方法,旨在通过非语言交互实现更直观的人机交互。实验表明,利用从单张图像重建的3D空间信息能显著提升目标识别准确率,尤其在物体重叠的复杂场景中。
Details
Motivation: 随着人机交互向更直观的界面发展,识别非语言沟通(如指向手势)的目标物体变得至关重要。本文旨在解决仅凭RGB图像准确识别指向目标的问题,特别是在缺乏专用深度传感器的环境中。
Result: 在自定义数据集上的实验结果表明,引入深度信息显著提高了目标识别性能,尤其是在物体重叠的复杂场景中。系统展示了模块化设计的优势,可在无专用深度传感器的环境中部署。
Insight: 创新点在于将多种SOTA模型(物体检测、姿态估计、深度估计、视觉语言模型)集成到一个统一流程中,并系统评估了单目深度重建和图像描述模型在纠正分类错误方面的效用。可借鉴之处在于其模块化设计和利用2D图像推理3D空间关系以增强指向目标识别的思路。
Abstract: This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are unavailable.
[255] Detection of Autonomous Shuttles in Urban Traffic Images Using Adaptive Residual Context cs.CV | cs.AIPDF
Mohamed Aziz Younes, Nicolas Saunier, Guillaume-Alexandre Bilodeau
TL;DR: 本文提出了一种名为自适应残差上下文(ARC)的架构,用于解决在固定摄像头视频目标检测中添加新检测目标(如自动驾驶班车)时出现的灾难性遗忘问题。该架构通过冻结的上下文分支和可训练的任务特定分支,结合上下文引导桥,利用注意力机制传递空间特征,从而在保持预训练表征的同时高效学习新类别。
Details
Motivation: 随着交通自动化发展,需要监控自动驾驶车辆在交通中的交互以评估其安全性。使用固定摄像头和视频目标检测时,为新增检测目标进行微调会导致灾难性遗忘,损害场景理解,这在道路安全应用中至关重要。
Result: 在自定义数据集上的实验表明,ARC在匹配微调基线性能的同时,显著提高了知识保留能力,为复杂城市环境中新增车辆类别提供了一个数据高效的解决方案。
Insight: 创新点在于通过冻结上下文分支和注意力引导的特征传递机制,在增量学习中平衡新任务学习和旧知识保留,为解决目标检测中的灾难性遗忘问题提供了可借鉴的架构设计思路。
Abstract: The progressive automation of transport promises to enhance safety and sustainability through shared mobility. Like other vehicles and road users, and even more so for such a new technology, it requires monitoring to understand how it interacts in traffic and to evaluate its safety. This can be done with fixed cameras and video object detection. However, the addition of new detection targets generally requires a fine-tuning approach for regular detection methods. Unfortunately, this implementation strategy will lead to a phenomenon known as catastrophic forgetting, which causes a degradation in scene understanding. In road safety applications, preserving contextual scene knowledge is of the utmost importance for protecting road users. We introduce the Adaptive Residual Context (ARC) architecture to address this. ARC links a frozen context branch and trainable task-specific branches through a Context-Guided Bridge, utilizing attention to transfer spatial features while preserving pre-trained representations. Experiments on a custom dataset show that ARC matches fine-tuned baselines while significantly improving knowledge retention, offering a data-efficient solution to add new vehicle categories for complex urban environments.
[256] AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation cs.CVPDF
Zhenyu Xie, Ji Xia, Michael Kampffmeyer, Panwen Hu, Zehua Ma
TL;DR: 本文提出AnyCrowd,一个基于扩散Transformer(DiT)的视频生成框架,旨在解决多角色动画中身份纠缠、身份错配和时空不一致的问题。该框架通过实例隔离潜在表示(IILR)独立编码角色实例,并采用三阶段解耦注意力(TSDA)和自适应门控融合(AGF)模块,实现了对任意数量角色的可控动画生成。
Details
Motivation: 随着角色数量增加,多角色动画的参考编码容易发生潜在身份纠缠,导致身份混淆、可控性下降,以及参考身份与驱动姿态序列之间难以建立精确、时空一致的对应关系。
Result: 论文在多个基准测试上进行了评估,结果表明AnyCrowd能够有效防止身份混淆,并实现精确的身份-姿态绑定,在生成视频中保持时空一致性,达到了当前最先进的性能水平。
Insight: 创新点在于提出了实例隔离潜在表示(IILR)来防止身份纠缠,以及三阶段解耦注意力(TSDA)机制,通过分解自注意力为实例感知前景注意力、背景中心交互和全局前景-背景协调,实现了身份与姿态的精确绑定。自适应门控融合(AGF)模块则进一步解决了重叠区域的令牌歧义问题,融合了竞争令牌组以生成身份一致的表示。
Abstract: Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent latent identity entanglement. Building on this disentangled representation, we further propose Tri-Stage Decoupled Attention (TSDA) to bind identities to driving poses by decomposing self-attention into: (i) instance-aware foreground attention, (ii) background-centric interaction, and (iii) global foreground-background coordination. Furthermore, to mitigate token ambiguity in overlapping regions, an Adaptive Gated Fusion (AGF) module is integrated within TSDA to predict identity-aware weights, effectively fusing competing token groups into identity-consistent representations…
[257] Gym-V: A Unified Vision Environment System for Agentic Vision Research cs.CVPDF
Fanqing Meng Lingxiao Du Jiawei Gu Jiaqi Liao Linjie Li Zijian Wu Xiangyan Liu Ziqi Zhao Mengkang Hu Yue Zhang Zichen Liu Jiaheng Zhang Michael Qizhe Shieh
TL;DR: 本文介绍了Gym-V,一个用于智能视觉研究的统一视觉环境系统,包含179个程序生成的视觉环境,覆盖10个领域,并具有可控难度。研究发现,观察支架(如字幕和游戏规则)对训练成功的影响比强化学习算法的选择更为关键,且跨领域训练能实现广泛泛化,而狭窄训练可能导致负迁移。该系统旨在加速未来视觉语言智能体的研究。
Details
Motivation: 当前视觉智能体缺乏标准化的基础设施,限制了对其学习驱动因素和模型不足之处的系统性研究,因此需要构建一个统一的平台来支持可控实验和公平比较。
Result: 在Gym-V平台上进行的实验表明,观察支架(如字幕和游戏规则)是训练成功的关键因素,其影响超过RL算法选择;跨领域转移实验显示,多样化任务训练能实现广泛泛化,而狭窄训练可能导致负迁移,多轮交互会放大这些效应。
Insight: 论文的创新点在于提出了一个统一的视觉环境系统Gym-V,通过程序化生成和可控难度支持大规模可控实验;客观分析认为,该系统为视觉智能体研究提供了标准化基准,强调了观察支架和多领域训练对泛化能力的重要性,有助于加速该领域的发展。
Abstract: As agentic systems increasingly rely on reinforcement learning from verifiable rewards, standardized ``gym’’ infrastructure has become essential for rapid iteration, reproducibility, and fair comparison. Vision agents lack such infrastructure, limiting systematic study of what drives their learning and where current models fall short. We introduce \textbf{Gym-V}, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, enabling controlled experiments that were previously infeasible across fragmented toolkits. Using it, we find that observation scaffolding is more decisive for training success than the choice of RL algorithm, with captions and game rules determining whether learning succeeds at all. Cross-domain transfer experiments further show that training on diverse task categories generalizes broadly while narrow training can cause negative transfer, with multi-turn interaction amplifying all of these effects. Gym-V is released as a convenient foundation for training environments and evaluation toolkits, aiming to accelerate future research on agentic VLMs.
[258] Real-Time Human Frontal View Synthesis from a Single Image cs.CVPDF
Fangyu Lin, Yingdong Hu, Lunjie Zhu, Zhening Liu, Yushi Huang
TL;DR: 本文提出PrismMirror,一种用于从单张图像实时合成人体正面视图的几何引导框架。该方法通过避免外部几何建模并专注于正面视图合成,优化了临场感的视觉完整性。其核心创新在于级联学习策略,先直接学习粗粒度几何特征(如SMPL-X网格和点云),再通过渲染监督细化纹理,并将统一框架蒸馏为轻量级线性注意力模型,首次实现了24 FPS的实时推理。
Details
Motivation: 解决从单张图像进行真实感人体新视角合成的难题,旨在普及沉浸式3D远程呈现,同时克服现有方法在视觉保真度与几何理解、实时性能与内存瓶颈之间的权衡问题,特别是面部和手部等复杂区域的时间不稳定性。
Result: PrismMirror在视觉真实性和结构准确性上显著优于先前方法,是首个实现24 FPS实时推理的单目人体正面视图合成模型。
Insight: 创新点在于提出级联学习策略实现从粗到细的几何特征学习,并成功将统一框架蒸馏为轻量级线性注意力模型以兼顾性能与效率;客观分析其通过专注于正面视图简化问题定义,避免了复杂多视角建模的计算负担,为实时应用提供了可行路径。
Abstract: Photorealistic human novel view synthesis from a single image is crucial for democratizing immersive 3D telepresence, eliminating the need for complex multi-camera setups. However, current rendering-centric methods prioritize visual fidelity over explicit geometric understanding and struggle with intricate regions like faces and hands, leading to temporal instability. Meanwhile, human-centric frameworks suffer from memory bottlenecks since they typically rely on an auxiliary model to provide informative structural priors for geometric modeling, which limits real-time performance. To address these challenges, we propose PrismMirror, a geometry-guided framework for instant frontal view synthesis from a single image. By avoiding external geometric modeling and focusing on frontal view synthesis, our model optimizes visual integrity for telepresence. Specifically, PrismMirror introduces a novel cascade learning strategy that enables coarse-to-fine geometric feature learning. It first directly learns coarse geometric features, such as SMPL-X meshes and point clouds, and then refines textures through rendering supervision. To achieve real-time efficiency, we distill this unified framework into a lightweight linear attention model. Notably, PrismMirror is the first monocular human frontal view synthesis model that achieves real-time inference at 24 FPS, significantly outperforming previous methods in both visual authenticity and structural accuracy.
[259] Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task cs.CVPDF
Yurui Dong, Ziyue Wang, Shuyun Lu, Dairu Liu, Xuechen Liu
TL;DR: 本文提出了EscapeCraft-4D,一个可定制的4D环境,用于评估全能模型中的选择性跨模态感知和时间意识。该环境整合了基于触发的听觉源、时间瞬态证据和位置相关线索,要求智能体在时间约束下进行时空推理和主动的多模态整合。基于此环境,作者构建了一个基准来评估强大模型的相应能力。评估结果表明,模型在处理模态偏见方面存在困难,并揭示了当前模型在时间约束下整合多种模态的能力存在显著差距。进一步的深入分析揭示了在复杂的多模态推理环境中,多种模态如何相互作用并共同影响模型决策。
Details
Motivation: 现有的多模态大语言模型环境主要关注2D或3D视觉上下文和视觉-语言任务,对时间依赖的听觉信号和选择性跨模态整合的支持有限,而这些能力对于现实的多模态推理至关重要。因此,模型是否能主动协调模态并在时变、不可逆条件下进行推理仍未得到充分探索。
Result: 评估结果表明,模型在处理模态偏见方面存在困难,并揭示了当前模型在时间约束下整合多种模态的能力存在显著差距。进一步的深入分析揭示了在复杂的多模态推理环境中,多种模态如何相互作用并共同影响模型决策。
Insight: 论文的创新点在于提出了一个包含时间维度和选择性跨模态整合的4D评估环境(EscapeCraft-4D),以弥补现有基准在评估时间意识和主动跨模态感知能力方面的不足。从客观角度看,该工作通过引入时间约束、不可逆事件和互补/干扰性多模态信息,为评估全能模型的复杂推理能力提供了一个更贴近现实的测试平台,有助于深入理解多模态交互机制。
Abstract: Multimodal Large Language Models (MLLMs) have recently made rapid progress toward unified Omni models that integrate vision, language, and audio. However, existing environments largely focus on 2D or 3D visual context and vision-language tasks, offering limited support for temporally dependent auditory signals and selective cross-modal integration, where different modalities may provide complementary or interfering information, which are essential capabilities for realistic multimodal reasoning. As a result, whether models can actively coordinate modalities and reason under time-varying, irreversible conditions remains underexplored. To this end, we introduce \textbf{EscapeCraft-4D}, a customizable 4D environment for assessing selective cross-modal perception and time awareness in Omni models. It incorporates trigger-based auditory sources, temporally transient evidence, and location-dependent cues, requiring agents to perform spatio-temporal reasoning and proactive multimodal integration under time constraints. Building on this environment, we curate a benchmark to evaluate corresponding abilities across powerful models. Evaluation results suggest that models struggle with modality bias, and reveal significant gaps in current model’s ability to integrate multiple modalities under time constraints. Further in-depth analysis uncovers how multiple modalities interact and jointly influence model decisions in complex multimodal reasoning environments.
[260] Automated Counting of Stacked Objects in Industrial Inspection cs.CVPDF
Corentin Dumery, Noa Etté, Aoxiang Fan, Ren Li, Jingyi Xu
TL;DR: 本文提出了一种新颖的3D计数方法,用于解决工业检测中堆叠物体的自动化计数难题。该方法通过多视角图像,将任务分解为估计堆叠体的3D几何形状和其占用率两个互补的子问题,结合几何重建与基于深度学习的深度分析,实现对容器内相同制造零件的精确计数,即使它们堆叠不规则且部分被遮挡。
Details
Motivation: 工业检测中,视觉物体计数对于高精度、高吞吐量的库存跟踪和质量保证至关重要。现有方法难以处理容器、托盘或料箱中堆叠的3D物品,因为这些物体大多被严重遮挡,只有少数直接可见,而重量测量方法又因物体过轻或过重而不实用,因此需要一种更鲁棒的自动化视觉计数方案。
Result: 该方法在具有人工验证总数的大规模合成数据和多样化真实世界数据上进行了验证,在现实检测条件下展示了鲁棒的性能。
Insight: 论文的创新点在于将堆叠物体计数任务分解为3D几何估计和占用率分析两个子问题,并通过结合传统几何重建与深度学习深度分析来解决严重遮挡问题。从客观角度看,这种多视角融合与任务分解策略为处理复杂工业场景中的密集、遮挡物体计数提供了新思路。
Abstract: Visual object counting is a fundamental computer vision task in industrial inspection, where accurate, high-throughput inventory tracking and quality assurance are critical. Moreover, manufactured parts are often too light to reliably deduce their count from their weight, or too heavy to move the stack on a scale safely and practically, making automated visual counting the more robust solution in many scenarios. However, existing methods struggle with stacked 3D items in containers, pallets, or bins, where most objects are heavily occluded and only a few are directly visible. To address this important yet underexplored challenge, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems: estimating the 3D geometry of the stack and its occupancy ratio from multi-view images. By combining geometric reconstruction with deep learning-based depth analysis, our method can accurately count identical manufactured parts inside containers, even when they are irregularly stacked and partially hidden. We validate our 3D counting pipeline on large-scale synthetic and diverse real-world data with manually verified total counts, demonstrating robust performance under realistic inspection conditions.
[261] ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer cs.CVPDF
Ruonan Yu, Zhenxiong Tan, Zigeng Chen, Songhua Liu, Xinchao Wang
TL;DR: 本文提出了一种名为ViFeEdit的视频无调优框架,用于视频扩散变换器(DiT),该框架无需任何视频训练数据,仅通过2D图像进行适配,即可实现多功能视频生成和编辑。核心方法是通过架构重参数化,将现代视频扩散变换器中的全3D注意力解耦为空间独立性,从而在保持时间一致性的同时实现视觉保真编辑,且仅需少量额外参数。
Details
Motivation: 动机在于解决视频控制和编辑领域进展有限的问题,主要由于配对视频数据稀缺和训练视频扩散模型计算成本高,因此提出一种无需视频数据的调优框架。
Result: 大量实验表明,该方法在仅对2D图像数据进行最小化训练的情况下,实现了可控视频生成和编辑的令人满意的结果。
Insight: 创新点在于通过架构重参数化解耦空间注意力,实现视频编辑的视觉保真和时间一致性,同时采用双路径管道和独立时间步嵌入,增强了适应多样化条件信号的能力,为视频扩散模型的高效训练提供了新思路。
Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available https://github.com/Lexie-YU/ViFeEdit.
[262] FreeTalk: Emotional Topology-Free 3D Talking Heads cs.CVPDF
Federico Nocentini, Thomas Besnier, Claudio Ferrari, Stefano Berretti, Mohamed Daoudi
TL;DR: FreeTalk是一个两阶段框架,用于生成情感条件驱动的3D说话头动画,能够泛化到具有任意顶点数和连接性的未注册人脸网格。第一阶段(Audio-To-Sparse)从语音音频预测3D标志点位移序列,捕捉发音和情感运动;第二阶段(Sparse-To-Mesh)将标志点运动转移到目标网格,生成密集的逐顶点变形,无需模板拟合或对应监督。
Details
Motivation: 解决现有语音驱动3D面部动画方法通常依赖于注册的模板网格,难以有效应用于具有任意拓扑结构的原始3D扫描,以及建模超越嘴唇发音的可控情感动态的挑战。
Result: 大量实验表明,FreeTalk在域内训练时与专用基线方法相当,同时对未见过的身份和网格拓扑结构提供了显著改善的鲁棒性。
Insight: 创新点在于提出了一种与网格拓扑无关的稀疏标志点表示来捕捉运动,并通过结合内在表面特征和标志点-顶点条件,实现了对任意拓扑网格的密集变形,无需测试时的模板拟合,提高了泛化能力。
Abstract: Speech-driven 3D facial animation has advanced rapidly, yet most approaches remain tied to registered template meshes, preventing effective deployment on raw 3D scans with arbitrary topology. At the same time, modeling controllable emotional dynamics beyond lip articulation remains challenging, and is often tied to template-based parameterizations. We address these challenges by proposing FreeTalk, a two-stage framework for emotion-conditioned 3D talking-head animation that generalizes to unregistered face meshes with arbitrary vertex count and connectivity. First, Audio-To-Sparse (ATS) predicts a temporally coherent sequence of 3D landmark displacements from speech audio, conditioned on an emotion category and intensity. This sparse representation captures both articulatory and affective motion while remaining independent of mesh topology. Second, Sparse-To-Mesh (STM) transfers the predicted landmark motion to a target mesh by combining intrinsic surface features with landmark-to-vertex conditioning, producing dense per-vertex deformations without template fitting or correspondence supervision at test time. Extensive experiments show that FreeTalk matches specialized baselines when trained in-domain, while providing substantially improved robustness to unseen identities and mesh topologies. Code and pre-trained models will be made publicly available.
[263] Learning Latent Proxies for Controllable Single-Image Relighting cs.CVPDF
Haoze Zheng, Zihao Wang, Xianfeng Wu, Yajing Bai, Yexin Liu
TL;DR: 本文提出了一种名为LightCtrl的单图像重光照方法,通过引入稀疏但物理意义明确的潜在代理(latent proxies)来指导扩散模型,实现了对光照方向、强度和颜色的精细控制。该方法结合了少量PBR监督的潜在代理编码器和光照感知掩码,并利用DPO目标增强物理一致性,在ScaLight数据集上训练,在多个基准测试中超越了现有方法。
Details
Motivation: 解决单图像重光照任务高度欠约束的问题,现有基于扩散的方法要么依赖密集脆弱的监督(如本征分解或G-buffer),要么缺乏物理基础导致控制不可靠,本文旨在通过稀疏物理线索实现准确可控的重光照。
Result: 在物体和场景级别的基准测试中,该方法实现了光度学上忠实且可控的重光照,在受控光照变化下PSNR提升高达+2.4 dB,RMSE降低35%,超越了先前的扩散和基于本征分解的基线模型。
Insight: 创新点在于提出无需完整本征分解,仅需稀疏物理线索(材料-几何提示)即可有效指导重光照;通过潜在代理编码器和光照感知掩码集成物理先验,并利用DPO目标增强数据稀缺下的物理一致性,可借鉴其将物理约束与生成模型结合以提升可控性的思路。
Abstract: Single-image relighting is highly under-constrained: small illumination changes can produce large, nonlinear variations in shading, shadows, and specularities, while geometry and materials remain unobserved. Existing diffusion-based approaches either rely on intrinsic or G-buffer pipelines that require dense and fragile supervision, or operate purely in latent space without physical grounding, making fine-grained control of direction, intensity, and color unreliable. We observe that a full intrinsic decomposition is unnecessary and redundant for accurate relighting. Instead, sparse but physically meaningful cues, indicating where illumination should change and how materials should respond, are sufficient to guide a diffusion model. Based on this insight, we introduce LightCtrl that integrates physical priors at two levels: a few-shot latent proxy encoder that extracts compact material-geometry cues from limited PBR supervision, and a lighting-aware mask that identifies sensitive illumination regions and steers the denoiser toward shading relevant pixels. To compensate for scarce PBR data, we refine the proxy branch using a DPO-based objective that enforces physical consistency in the predicted cues. We also present ScaLight, a large-scale object-level dataset with systematically varied illumination and complete camera-light metadata, enabling physically consistent and controllable training. Across object and scene level benchmarks, our method achieves photometrically faithful relighting with accurate continuous control, surpassing prior diffusion and intrinsic-based baselines, including gains of up to +2.4 dB PSNR and 35% lower RMSE under controlled lighting shifts.
[264] Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models cs.CVPDF
Lexiang Xiong, Qi Li, Jingwen Ye, Xinchao Wang
TL;DR: 本文提出了一种多阶段诊断框架,用于追踪视觉语言模型(VLMs)中的幻觉现象,将幻觉从静态输出错误重新定义为模型计算认知的动态病理,并通过信息论探针将生成过程投影到可解释的低维认知状态空间,实现高效的幻觉检测与归因。
Details
Motivation: 视觉语言模型频繁产生看似合理但事实错误的’幻觉’陈述,这是其可信部署的关键障碍,现有方法多关注静态输出错误,缺乏对模型内部认知过程的动态诊断。
Result: 在多个基准测试(包括POPE、MME和MS-COCO)上,该框架实现了最先进的性能,且在弱监督下高效运行,即使校准数据严重污染也保持高度鲁棒性。
Insight: 核心创新在于提出了几何-信息对偶性原理,将认知轨迹在状态空间中的几何异常与信息论惊奇度等价,从而将幻觉检测转化为几何异常检测问题,并实现了对感知不稳定性、逻辑因果失败和决策模糊性等病理状态的可解释归因。
Abstract: Vision-Language Models (VLMs) frequently “hallucinate” - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model’s computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM’s generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory’s geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.
[265] Panoramic Affordance Prediction cs.CV | cs.ROPDF
Zixin Zhang, Chenfei Liao, Hongfei Zhang, Harold Haodong Chen, Kanghao Chen
TL;DR: 本文首次提出全景可操作预测任务,利用360度图像捕捉全局空间关系和整体场景理解。作者构建了大规模基准数据集PAP-12K,并提出了一种无需训练、由粗到细的PAP流程,该流程受人类中央凹视觉系统启发,通过递归视觉路由、自适应注视机制和级联定位管道来处理全景图像的超高分辨率和严重畸变。
Details
Motivation: 现有可操作预测研究局限于针孔相机模型,存在视野狭窄和观测碎片化的问题,常丢失关键的整体环境上下文,因此需要探索全景图像以提升具身AI的感知与行动衔接能力。
Result: 在PAP-12K数据集上的实验表明,专为标准透视图像设计的现有方法性能严重下降,而PAP框架有效克服了全景视觉的独特挑战,显著优于最先进的基线方法,展现了全景感知在鲁棒具身智能中的巨大潜力。
Insight: 创新点包括引入全景可操作预测新任务、构建首个大规模超高分辨率全景可操作预测数据集PAP-12K,以及提出一种无需训练、模拟人类视觉系统的由粗到细处理流程,通过递归路由、自适应畸变校正和级联掩码提取来应对全景图像的超高分辨率和畸变问题。
Abstract: Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.
[266] Grounding World Simulation Models in a Real-World Metropolis cs.CVPDF
Junyoung Seo, Hyunwook Choi, Minkyung Kwon, Jinhyeok Choi, Siyoon Jin
TL;DR: 本文提出了首尔世界模型(SWM),这是一个基于真实城市首尔构建的城市规模世界模型,通过检索增强的街景图像条件化实现自回归视频生成,解决了现有生成世界模型只能合成虚构环境的问题。
Details
Motivation: 现有生成世界模型合成的环境虽然视觉上合理但完全是虚构的,本文旨在构建一个能够真实反映实际城市环境的世界模拟模型。
Result: 在首尔、釜山和安娜堡三个城市的评估中,SWM在生成空间准确、时间一致、长达数百米轨迹的长时视频方面优于现有视频世界模型,支持多样化的相机运动和文本提示的场景变化。
Insight: 创新点包括:通过检索增强条件化将生成过程锚定在真实街景图像上;引入跨时间配对、大规模合成数据集和视图插值管道解决数据稀疏和轨迹多样性问题;使用虚拟前瞻汇(Virtual Lookahead Sink)稳定长时生成过程。
Abstract: What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.
[267] Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery cs.CVPDF
Timing Yang, Sicheng He, Hongyi Jing, Jiawei Yang, Zhijian Liu
TL;DR: 本文提出了Fast SAM 3D Body,一个无需训练的加速框架,旨在解决SAM 3D Body模型在单目3D人体网格恢复中推理延迟高、无法实时应用的问题。通过解耦串行空间依赖、进行架构感知剪枝、并行化多裁剪特征提取以及用前馈映射替代迭代网格拟合,该框架实现了端到端10.9倍的加速,同时保持了与原模型相当的甚至在某些基准上更优的重建精度,并成功应用于仅需RGB流的实时人形机器人遥操作系统。
Details
Motivation: SAM 3D Body在单目3D人体网格恢复上达到了SOTA精度,但其每张图像数秒的推理延迟阻碍了实时应用,因此需要一种高效的加速方案。
Result: 该框架实现了高达10.9倍的端到端加速,在LSPET等基准测试上保持了与原模型相当的重建保真度,甚至在某些方面超越了原模型;其中,将网格拟合转换为SMPL参数的前馈映射加速了超过10,000倍。
Insight: 主要创新点在于无需重新训练即可大幅加速现有SOTA模型,其核心是通过重构推理流程(解耦依赖、并行化、剪枝)以及用高效的前馈映射替代耗时的迭代优化过程,从而在保持精度的前提下实现实时性能,为从RGB视频流直接学习控制策略提供了可行方案。
Abstract: SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that-unlike methods reliant on wearable IMUs-enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.
[268] HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions cs.CV | cs.ROPDF
Yukang Cao, Haozhe Xie, Fangzhou Hong, Long Zhuo, Zhaoxi Chen
TL;DR: HSImul3R是一个统一的框架,用于从稀疏视图图像和单目视频等随意捕捉的数据中,重建可用于物理仿真的三维人-场景交互。该框架通过物理驱动的双向优化流程,将物理模拟器作为主动监督器,联合优化人体动力学和场景几何,以弥合视觉重建与物理约束之间的感知-模拟鸿沟。
Details
Motivation: 现有方法存在感知-模拟鸿沟:视觉上合理的重建常常违反物理约束,导致在物理引擎中不稳定,并在具身AI应用中失败。本文旨在解决这一问题,生成可直接用于物理仿真的稳定人-场景交互重建。
Result: 在提出的新基准HSIBench上进行了大量实验,结果表明HSImul3R能够产生首个稳定的、可用于仿真的HSI重建,并且可以直接部署到真实世界的人形机器人上。
Insight: 核心创新点在于将物理模拟器作为主动监督器融入优化流程,提出了前向的场景目标强化学习来优化人体运动,以及反向的直接模拟奖励优化来优化场景几何,从而实现了物理真实性与视觉合理性的统一。这为具身AI和机器人应用提供了可直接使用的、物理稳定的交互数据。
Abstract: We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.
[269] Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion cs.CVPDF
Zhenghong Zhou, Xiaohang Zhan, Zhiqin Chen, Soo Ye Kim, Nanxuan Zhao
TL;DR: 本文提出了Tri-Prompting,一个用于视频扩散模型的统一框架和两阶段训练范式,旨在实现对场景构图、多视角一致的主体定制以及相机姿态或物体运动调整的联合精细控制。该方法通过一个由3D跟踪点驱动的双条件运动模块来分别处理背景场景和前景主体,并引入了推理时的ControlNet尺度调度以平衡可控性与视觉真实感。
Details
Motivation: 现有视频扩散模型在视觉质量上进步显著,但缺乏对场景、主体和运动的统一精细控制,这限制了内容创作的实际可定制性。现有方法通常孤立处理这些维度,对任意姿态变化下的多视角主体合成和身份保持支持有限。
Result: 实验结果表明,Tri-Prompting在多视角主体身份一致性、3D一致性和运动准确性方面显著优于Phantom和DaS等专门的基线模型。
Insight: 主要创新点在于提出了一个统一的框架来联合控制视频生成的三个关键维度,并设计了双条件运动模块和推理时的ControlNet尺度调度机制,从而支持将3D感知的主体插入任意场景以及操纵图像中现有主体等新颖工作流程。
Abstract: Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.
[270] GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering cs.CVPDF
Xincheng Shuai, Ziye Li, Henghui Ding, Dacheng Tao
TL;DR: 本文提出GlyphPrinter,一种基于区域分组直接偏好优化的方法,用于提升视觉文本渲染中的字形准确性。该方法通过构建带有区域级字形偏好标注的数据集GlyphCorrector,并设计区域分组DPO目标函数,优化标注区域内的样本间和样本内偏好,从而减少对显式奖励模型的依赖,并显著提高复杂或域外字符的字形渲染精度。
Details
Motivation: 现有视觉文本渲染方法通常依赖大量高质量场景文本图像进行训练,但字形变体覆盖有限且过度风格化会损害字形准确性,尤其是复杂或域外字符;而基于强化学习的方法其奖励模型常依赖对细粒度字形错误不敏感的文字识别系统,导致错误字形仍可能获得高奖励。
Result: 大量实验表明,GlyphPrinter在字形准确性上优于现有方法,同时在风格化和精度之间保持了良好的平衡。
Insight: 创新点包括:1) 提出区域分组直接偏好优化,将标准DPO的全局偏好建模扩展为基于区域的目标,以处理字形错误通常发生在局部区域的问题;2) 构建带有区域级字形偏好标注的数据集GlyphCorrector;3) 引入区域奖励引导的推理策略,从具有可控字形准确性的最优分布中采样。
Abstract: Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose GlyphPrinter, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the GlyphCorrector dataset with region-level glyph preference annotations and propose Region-Grouped DPO (R-GDPO), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce Regional Reward Guidance, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.
[271] Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models cs.CVPDF
Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu
TL;DR: 本文提出DeepVision-VLA模型,通过Vision-Language Mixture-of-Transformers框架在VLA模型深层注入多级视觉特征,并引入Action-Guided Visual Pruning剪枝无关视觉token,以增强视觉表示,提升机器人操作任务的性能。
Details
Motivation: 现有VLA模型在动作生成过程中对视觉token的敏感性随网络深度增加而下降,导致视觉信息未能充分融入动作生成,限制了复杂操作任务的性能。
Result: 在模拟和真实世界任务上,DeepVision-VLA分别比先前SOTA方法提升了9.0%和7.5%。
Insight: 创新点在于通过VL-MoT框架实现视觉基础模型与VLA骨干的共享注意力,将多级视觉特征注入深层,并结合AGVP进行任务相关视觉token的剪枝,在增强视觉表示的同时保持计算效率。
Abstract: Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0% and 7.5% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.
[272] Towards Generalizable Robotic Manipulation in Dynamic Environments cs.CV | cs.ROPDF
Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang
TL;DR: 本文针对视觉-语言-动作模型在动态环境中处理移动目标时性能不足的问题,提出了DOMINO大规模数据集与基准测试,并设计了PUMA动态感知架构,通过结合历史光流和世界查询实现短期预测,在动态任务上取得了SOTA性能。
Details
Motivation: 主流VLA模型依赖单帧观测,缺乏时空推理能力,且动态操作数据集稀缺,导致其在动态环境中表现不佳。
Result: PUMA在DOMINO基准的35个动态任务上实现了6.3%的绝对成功率提升,达到SOTA水平;动态数据训练还能提升静态任务的鲁棒性。
Insight: 创新点包括构建层次化动态数据集DOMINO,以及PUMA架构中通过场景中心历史光流和世界查询隐式预测物体未来状态,耦合历史感知与短期预测。
Abstract: Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.
cs.MM [Back]
[273] Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation cs.MM | cs.CVPDF
Lingsi Zhu, Yuefeng Zou, Yunxiang Zhang, Naixiang Zheng, Guoyuan Wang
TL;DR: 本文提出了一种名为TAEMI的新型多模态框架,用于在自然环境中估计情感模仿强度。该方法打破了传统的对称融合范式,利用文本转录作为稳定的语义锚点,通过文本锚定的双重交叉注意力机制来过滤冗余信息并对齐噪声物理信号,同时集成了可学习的缺失模态令牌和模态丢弃策略以增强鲁棒性。
Details
Motivation: 解决在自然环境中估计情感模仿强度的挑战,主要难点在于有效建模高度异质模态间复杂的非线性时间动态,尤其是在物理信号被破坏或缺失时。传统对称融合方法易受瞬时环境噪声影响,因此需要一种更稳健的融合策略。
Result: 在Hume-Vidmimic2数据集上的大量实验表明,TAEMI能有效捕捉细粒度情感变化,并在不完美条件下保持稳健的预测弹性。该框架在六个连续情感维度上取得了最先进的平均皮尔逊相关系数,显著优于现有基线方法。
Insight: 创新点在于利用文本作为稳定、时间无关的语义先验来锚定和引导多模态融合,通过文本锚定的双重交叉注意力机制主动对齐噪声信号,并结合可学习的缺失模态令牌和模态丢弃策略来应对现实场景中不可避免的数据缺失问题,从而提升模型的鲁棒性和性能。
Abstract: Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially when physical signals are corrupted or missing. To tackle this, we propose TAEMI (Text-Anchored Emotional Mimicry Intensity estimation), a novel multimodal framework designed for the 10th ABAW Competition. Motivated by the observation that continuous visual and acoustic signals are highly susceptible to transient environmental noise, we break the traditional symmetric fusion paradigm. Instead, we leverage textual transcript–which inherently encode a stable, time-independent semantic prior–as central anchors. Specifically, we introduce a Text-Anchored Dual Cross-Attention mechanism that utilizes these robust textual queries to actively filter out frame-level redundancies and align the noisy physical streams. Furthermore, to prevent catastrophic performance degradation caused by inevitably missing data in unconstrained real-world scenarios, we integrate Learnable Missing-Modality Tokens and a Modality Dropout strategy during training. Extensive experiments on the Hume-Vidmimic2 dataset demonstrate that TAEMI effectively captures fine-grained emotional variations and maintains robust predictive resilience under imperfect conditions. Our framework achieves a state-of-the-art mean Pearson correlation coefficient across six continuous emotional dimensions, significantly outperforming existing baseline methods.
cs.HC [Back]
[274] Toward Scalable Co-located Practical Learning: Assisting with Computer Vision and Multimodal Analytics cs.HC | cs.CVPDF
Xinyu Li, Linxuan Zhao, Roberto Martinez-Maldonado, Dragan Gasevic, Lixiang Yan
TL;DR: 本研究探讨了使用单个天花板摄像头捕捉共址实践学习中细粒度学习行为的可行性。在本科护理模拟中,通过教师识别七种可观察行为类别,并训练基于YOLO的检测器。模型在测试集上取得了良好的检测性能,并发现结合空间上下文分析行为数据能更有效地区分高低绩效团队的表现差异。
Details
Motivation: 解决在共址实践学习环境中,如何在不依赖可穿戴传感器的情况下,利用低成本、可扩展的计算机视觉方法捕捉和分析细粒度的团队协作与任务参与行为的问题。
Result: 在52个会话视频数据(聚焦于行为变化更大的Scenario A)上,模型在预留测试集上达到了精确率0.789、召回率0.784和mAP@0.5为0.827。仅比较行为频率时,高低绩效组无显著差异;但结合空间上下文分析后,在任务和协作表现上均出现明显差异。
Insight: 创新点在于证明了单个摄像头结合目标检测与空间上下文分析,足以替代可穿戴传感器,对面对面实践学习中的团队协作进行有效分析。核心洞察是行为数据必须结合其发生位置(空间上下文)进行解读才更具信息量,这为可扩展的多模态学习分析提供了新思路。
Abstract: This study examined whether a single ceiling-mounted camera could be used to capture fine-grained learning behaviours in co-located practical learning. In undergraduate nursing simulations, teachers first identified seven observable behaviour categories, which were then used to train a YOLO-based detector. Video data were collected from 52 sessions, and analyses focused on Scenario A because it produced greater behavioural variation than Scenario B. Annotation reliability was high (F1=0.933). On the held-out test set, the model achieved a precision of 0.789, a recall of 0.784, and an mAP@0.5 of 0.827. When only behaviour frequencies were compared, no robust differences were found between high- and low-performing groups. However, when behaviour labels were analysed together with spatial context, clear differences emerged in both task and collaboration performance. Higher-performing teams showed more patient interaction in the primary work area, whereas lower-performing teams showed more phone-related activity and more activity in secondary areas. These findings suggest that behavioural data are more informative when interpreted together with where they occur. Overall, the study shows that a single-camera computer vision approach can support the analysis of teamwork and task engagement in face-to-face practical learning without relying on wearable sensors.
cs.RO [Back]
[275] From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation cs.RO | cs.AI | cs.CL | cs.CVPDF
Yibin Liu, Yaxing Lyu, Daqi Gao, Zhixuan Liang, Weiliang Tang
TL;DR: 本文提出了PRIMO R1框架,旨在解决长时程机器人操作中精确过程监督的挑战。该框架通过基于结果的强化学习,将视频多模态大语言模型从被动的‘观察者’转变为主动的‘批评者’,激励其生成明确的思维链以进行进度估计。
Details
Motivation: 当前基于监督微调的视频MLLMs主要作为被动观察者识别事件,而无法评估当前状态相对于最终任务目标的进展,这成为长时程机器人操作过程监督的关键瓶颈。
Result: 在提出的PRIMO数据集和基准测试上,PRIMO R1在多样化的领域内环境和领域外真实世界人形机器人场景中实现了最先进的性能。其7B模型将专业推理基线的平均绝对误差降低了50%,相对精度显著优于72B规模的通用MLLMs,并在RoboFail基准测试上以67.0%的准确率(超过OpenAI o1等闭源模型6.0%)展现了强大的零样本泛化能力。
Insight: 核心创新点在于利用基于结果的强化学习来激励模型进行显式的思维链生成以评估任务进度,并通过在初始状态和当前状态图像之间显式锚定视频序列来构建结构化的时序输入,从而将模型从事件识别器转变为主动的过程评估器。
Abstract: Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive “Observers” that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active “Critics”. We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.
[276] Your Vision-Language-Action Model Already Has Attention Heads For Path Deviation Detection cs.RO | cs.CVPDF
Jaehwan Jeong, Evelyn Zhu, Jinying Lin, Emmanuel Jaimes, Tuan-Anh Vu
TL;DR: 该论文发现,在冻结的视觉-语言-动作(VLA)模型中,监控少数特定的注意力头(称为导航头)可以准确检测导航任务中的路径偏离,而无需额外训练或计算开销。基于此,作者提出了一种无需训练的异常检测框架,并集成一个轻量级强化学习策略进行路径回滚,最终在物理机器人上验证了其鲁棒性。
Details
Motivation: VLA模型在导航任务中存在视觉推理幻觉,导致轨迹偏离。传统方法需要训练外部评判模块或依赖复杂的启发式不确定性估计,计算成本高。本文旨在探索一种无需额外训练、低开销的路径偏离检测方法。
Result: 在超过一千个注意力头中,仅组合三个导航头即可实现44.6%的偏离检测率,且误报率低至11.7%。该方法在物理机器人上集成了从检测到恢复的完整流程,证明了其实用鲁棒性。
Insight: 创新点在于发现并利用了VLA模型内部固有的、能捕捉历史视觉序列与语言指令间时空因果关系的注意力头(导航头),从而实现了无需训练、低开销的实时幻觉检测。这为模型内部可解释性和高效异常检测提供了新思路。
Abstract: Vision-Language-Action (VLA) models have demonstrated strong potential for predicting semantic actions in navigation tasks, demonstrating the ability to reason over complex linguistic instructions and visual contexts. However, they are fundamentally hindered by visual-reasoning hallucinations that lead to trajectory deviations. Addressing this issue has conventionally required training external critic modules or relying on complex uncertainty heuristics. In this work, we discover that monitoring a few attention heads within a frozen VLA model can accurately detect path deviations without incurring additional computational overhead. We refer to these heads, which inherently capture the spatiotemporal causality between historical visual sequences and linguistic instructions, as Navigation Heads. Using these heads, we propose an intuitive, training-free anomaly-detection framework that monitors their signals to detect hallucinations in real time. Surprisingly, among over a thousand attention heads, a combination of just three is sufficient to achieve a 44.6 % deviation detection rate with a low false-positive rate of 11.7 %. Furthermore, upon detecting a deviation, we bypass the heavy VLA model and trigger a lightweight Reinforcement Learning (RL) policy to safely execute a shortest-path rollback. By integrating this entire detection-to-recovery pipeline onto a physical robot, we demonstrate its practical robustness. All source code will be publicly available.
[277] OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer cs.RO | cs.CVPDF
Kuanning Wang, Ke Fan, Yuqian Fu, Siyu Lin, Hu Luo
TL;DR: OCRA是一个基于视频的人类到机器人动作迁移的对象中心框架,它直接从人类演示视频中学习以实现鲁棒的机器人操作。该框架利用多视角RGB视频、3D基础模型VGGT以及检测与分割模型重建对象中心的3D点云,并结合大规模触觉图像数据集,通过多模态模块ResFiLM融合3D和触觉先验,最终使用扩散策略生成操作动作。
Details
Motivation: 解决从人类演示视频中学习机器人操作时,如何有效关注任务相关对象及其交互,同时过滤无关背景,并处理仅靠视觉难以感知的属性(如触觉信息),以实现更鲁棒和可扩展的机器人教学。
Result: 在仅视觉和视觉-触觉任务上的大量实验表明,OCRA显著优于现有基线和消融模型,证明了其从人类演示视频中学习的有效性。
Insight: 创新点包括对象中心学习框架的提出,结合3D重建(利用VGGT等先进模型)和触觉先验(通过大规模触觉数据集)的多模态融合方法,以及使用扩散策略生成动作;从客观角度看,该研究整合了视觉与触觉感知,提升了机器人操作的鲁棒性和泛化能力。
Abstract: We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.
[278] R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation cs.RO | cs.CVPDF
Yuhao Zhang, Wanxi Dong, Yue Shi, Yi Liang, Jingnan Gao
TL;DR: R3DP是一种用于具身操作的实时3D感知策略,通过异步快慢协作模块将大规模3D先验知识集成到策略中,同时保持实时性能。它利用稀疏关键帧查询预训练的慢速系统(VGGT),并通过轻量级时序特征预测网络(TFPNet)预测中间帧特征,结合多视图特征融合器(MVFF)提升多视图融合效果。
Details
Motivation: 具身操作需要精确的3D物体理解和空间关系来规划执行接触丰富的动作,但现有大规模3D视觉模型计算成本高,导致实时控制延迟过高。本文旨在集成强大3D先验到操作策略中,而不牺牲实时性能。
Result: 在不同视觉配置下评估,R3DP平均成功率分别比单视图和多视图DP基线高出32.9%和51.4%;通过解耦重型3D推理与策略执行,推理时间比朴素的DP+VGGT集成减少44.8%。
Insight: 创新点包括异步快慢协作模块(稀疏关键帧查询慢速模型+轻量TFPNet预测时序特征)和多视图特征融合器(显式结合相机内外参);提供了一种即插即用方案,将大模型集成到实时推理系统中,平衡了精度与效率。
Abstract: Embodied manipulation requires accurate 3D understanding of objects and their spatial relations to plan and execute contact-rich actions. While large-scale 3D vision models provide strong priors, their computational cost incurs prohibitive latency for real-time control. We propose Real-time 3D-aware Policy (R3DP), which integrates powerful 3D priors into manipulation policies without sacrificing real-time performance. A core innovation of R3DP is the asynchronous fast-slow collaboration module, which seamlessly integrates large-scale 3D priors into the policy without compromising real-time performance. The system maintains real-time efficiency by querying the pre-trained slow system (VGGT) only on sparse key frames, while simultaneously employing a lightweight Temporal Feature Prediction Network (TFPNet) to predict features for all intermediate frames. By leveraging historical data to exploit temporal correlations, TFPNet explicitly improves task success rates through consistent feature estimation. Additionally, to enable more effective multi-view fusion, we introduce a Multi-View Feature Fuser (MVFF) that aggregates features across views by explicitly incorporating camera intrinsics and extrinsics. R3DP offers a plug-and-play solution for integrating large models into real-time inference systems. We evaluate R3DP against multiple baselines across different visual configurations. R3DP effectively harnesses large-scale 3D priors to achieve superior results, outperforming single-view and multi-view DP by 32.9% and 51.4% in average success rate, respectively. Furthermore, by decoupling heavy 3D reasoning from policy execution, R3DP achieves a 44.8% reduction in inference time compared to a naive DP+VGGT integration.
[279] Tactile Modality Fusion for Vision-Language-Action Models cs.RO | cs.CV | cs.LGPDF
Charlotte Morissette, Amin Abyaneh, Wei-Di Chang, Anas Houssaini, David Meger
TL;DR: 本文提出了一种名为TacFiLM的轻量级多模态融合方法,旨在将视觉-触觉信号整合到视觉-语言-动作模型中,以提升接触密集型机器人操作任务的性能。
Details
Motivation: 当前视觉-语言-动作模型主要依赖视觉感知,难以捕捉接触操作中的复杂交互动态,如接触力、摩擦力等;现有触觉信号集成方法通常计算复杂,需要更轻量的融合策略。
Result: 在插入任务上的实验表明,该方法在分布内和分布外任务上均能持续提升成功率、直接插入性能、完成时间和力稳定性。
Insight: 创新点在于采用训练后微调方法,利用特征级线性调制将预训练的触觉表征条件化到中间视觉特征上,实现了轻量高效的触觉信号融合。
Abstract: We propose TacFiLM, a lightweight modality-fusion approach that integrates visual-tactile signals into vision-language-action (VLA) models. While recent advances in VLA models have introduced robot policies that are both generalizable and semantically grounded, these models mainly rely on vision-based perception. Vision alone, however, cannot capture the complex interaction dynamics that occur during contact-rich manipulation, including contact forces, surface friction, compliance, and shear. While recent attempts to integrate tactile signals into VLA models often increase complexity through token concatenation or large-scale pretraining, the heavy computational demands of behavioural models necessitate more lightweight fusion strategies. To address these challenges, TacFiLM outlines a post-training finetuning approach that conditions intermediate visual features on pretrained tactile representations using feature-wise linear modulation (FiLM). Experimental results on insertion tasks demonstrate consistent improvements in success rate, direct insertion performance, completion time, and force stability across both in-distribution and out-of-distribution tasks. Together, these results support our method as an effective approach to integrating tactile signals into VLA models, improving contact-rich manipulation behaviours.
[280] Seeing Where to Deploy: Metric RGB-Based Traversability Analysis for Aerial-to-Ground Hidden Space Inspection cs.RO | cs.CVPDF
Seoyoung Lee, Shaekh Mohammad Shithil, Durgakant Pushp, Lantao Liu, Zhangyang Wang
TL;DR: 本文提出了一种基于RGB图像的度量几何-语义重建与可通行性分析框架,用于空中到地面隐藏空间(如涵洞)的协同检测。该方法通过多视角RGB重建生成稠密几何,结合时序一致的语义分割构建3D语义地图,并利用运动先验恢复度量尺度,最终构建置信度感知的可通行性地图以评估部署区域。
Details
Motivation: 解决从空中视角选择合适部署区域时面临的尺度模糊、重建不确定性和地形语义理解等挑战,实现无人机-地面机器人协同对隐藏空间的高效检测。
Result: 在系留无人机-地面机器人平台上进行实验,验证了该方法在隐藏空间场景中能够可靠识别部署区域。
Insight: 创新点包括:结合几何与语义信息的度量重建方法,利用运动先验从RGB图像恢复度量尺度以避免依赖LiDAR,以及构建置信度感知的可通行性地图进行部署区域评估。
Abstract: Inspection of confined infrastructure such as culverts often requires accessing hidden spaces whose entrances are reachable primarily from elevated viewpoints. Aerial-ground cooperation enables a UAV to deploy a compact UGV for interior exploration, but selecting a suitable deployment region from aerial observations requires metric terrain reasoning involving scale ambiguity, reconstruction uncertainty, and terrain semantics. We present a metric RGB-based geometric-semantic reconstruction and traversability analysis framework for aerial-to-ground hidden space inspection. A feed-forward multi-view RGB reconstruction backbone produces dense geometry, while temporally consistent semantic segmentation yields a 3D semantic map. To enable deployment-relevant measurements without LiDAR-based dense mapping, we introduce an embodied motion prior that recovers metric scale by enforcing consistency between predicted camera motion and onboard platform egomotion. From the metrically grounded reconstruction, we construct a confidence-aware geometric-semantic traversability map and evaluate candidate deployment zones under explicit reachability constraints. Experiments on a tethered UAV-UGV platform demonstrate reliable deployment-zone identification in hidden space scenarios.
[281] LiDAR-EVS: Enhance Extrapolated View Synthesis for 3D Gaussian Splatting with Pseudo-LiDAR Supervision cs.RO | cs.CVPDF
Yiming Huang, Xin Kang, Sipeng Zhang, Hongliang Ren, Weihua Zhang
TL;DR: 本文提出LiDAR-EVS框架,用于增强3D高斯溅射(3DGS)在自动驾驶模拟中的外推视图LiDAR合成能力。该框架通过伪LiDAR监督和空间约束的dropout正则化,解决了现有方法在训练轨迹外的新视角上过拟合和泛化差的问题,实现了无需外部多遍数据的可靠LiDAR模拟。
Details
Motivation: 现有基于3DGS的LiDAR模拟方法通常在单次遍历的传感器扫描上训练,对于训练轨迹外的外推视图存在严重过拟合和泛化能力不足的问题,限制了其在未见驾驶路径上的可靠模拟。
Result: 在三个数据集上的大量实验表明,LiDAR-EVS在外推视图LiDAR合成任务上达到了最先进(SOTA)性能。
Insight: 创新点包括:1) 通过多帧LiDAR融合、视图变换、遮挡剔除和强度调整构建伪外推视图点云监督;2) 引入空间约束的dropout正则化以增强对真实驾驶中多样轨迹变化的鲁棒性。该框架设计为即插即用,可轻松扩展到不同的LiDAR传感器和神经渲染基线。
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time LiDAR and camera synthesis in autonomous driving simulation. However, simulating LiDAR with 3DGS remains challenging for extrapolated views beyond the training trajectory, as existing methods are typically trained on single-traversal sensor scans, suffer from severe overfitting and poor generalization to novel ego-vehicle paths. To enable reliable simulation of LiDAR along unseen driving trajectories without external multi-pass data, we present LiDAR-EVS, a lightweight framework for robust extrapolated-view LiDAR simulation in autonomous driving. Designed to be plug-and-play, LiDAR-EVS readily extends to diverse LiDAR sensors and neural rendering baselines with minimal modification. Our framework comprises two key components: (1) pseudo extrapolated-view point cloud supervision with multi-frame LiDAR fusion, view transformation, occlusion curling, and intensity adjustment; (2) spatially-constrained dropout regularization that promotes robustness to diverse trajectory variations encountered in real-world driving. Extensive experiments demonstrate that LiDAR-EVS achieves SOTA performance on extrapolated-view LiDAR synthesis across three datasets, making it a promising tool for data-driven simulation, closed-loop evaluation, and synthetic data generation in autonomous driving systems.
[282] Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning cs.RO | cs.CVPDF
Heng Zhou, Li Kang, Yiran Qin, Xiufeng Song, Ao Yu
TL;DR: 该论文提出了Ego-to-World(E2W)基准,用于评估具身多智能体系统从分布式、以自我为中心的视角进行空间推理的能力,并提出了CoRL(结合思维链监督微调和强化学习的框架)来解决该问题,其核心是跨视图空间奖励(CVSR)机制。
Details
Motivation: 解决具身多智能体系统中,每个智能体仅能从受限的自我中心视角感知环境,导致遮挡和模糊性,从而难以进行全局场景理解的挑战。
Result: 在E2W基准测试的三个任务(全局计数、关系位置推理、面向动作的抓取)上,CoRL在推理和感知接地指标上持续超越了强大的专有和开源基线模型;消融实验证实了CVSR各组成部分的必要性;此外,CoRL能泛化到外部空间推理基准,并在真实世界的多机器人操作任务中实现了有效的跨视图定位和抓取放置。
Insight: 创新点在于提出了一个结合思维链监督微调与强化学习的两阶段框架(CoRL),并设计了跨视图空间奖励(CVSR)来提供密集的、任务对齐的反馈,从而引导模型进行连贯的跨视图实体解析和正确的最终预测,为从分布式自我中心观察中学习以世界为中心的场景理解提供了基础。
Abstract: Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model’s ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics, while ablations further confirm the necessity of each CVSR component. Beyond that, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation with calibrated multi-camera rigs, demonstrating cross-view localization and successful grasp-and-place execution. Together, E2W and CoRL provide a principled foundation for learning world-centric scene understanding from distributed, ego-centric observations, advancing collaborative embodied AI.
[283] PerlAD: Towards Enhanced Closed-loop End-to-end Autonomous Driving with Pseudo-simulation-based Reinforcement Learning cs.RO | cs.CVPDF
Yinfeng Gao, Qichao Zhang, Deqing Liu, Zhongpu Xia, Guang Li
TL;DR: 本文提出了一种名为PerlAD的新型伪仿真强化学习方法,用于解决端到端自动驾驶策略在闭环执行中的问题。该方法基于离线数据集构建向量空间中的伪仿真环境,避免了渲染开销,并通过预测世界模型和分层解耦规划器实现高效训练与规划。
Details
Motivation: 基于模仿学习的端到端自动驾驶策略在闭环执行中因开环训练目标与真实驾驶需求不匹配而表现不佳,而基于渲染的强化学习方法存在渲染差距和高计算成本问题。
Result: 在Bench2Drive基准测试中,PerlAD取得了最先进的性能,驾驶分数比之前的端到端强化学习方法提升了10.29%,且无需昂贵的在线交互;在DOS基准上的额外评估进一步证实了其在处理安全关键遮挡场景中的可靠性。
Insight: 创新点包括:1) 基于离线数据集构建渲染无关的伪仿真环境以高效进行试错训练;2) 引入预测世界模型来弥合静态数据集与动态闭环环境之间的差距;3) 采用分层解耦规划器,结合模仿学习进行横向路径生成和强化学习进行纵向速度优化,提升规划效率。
Abstract: End-to-end autonomous driving policies based on Imitation Learning (IL) often struggle in closed-loop execution due to the misalignment between inadequate open-loop training objectives and real driving requirements. While Reinforcement Learning (RL) offers a solution by directly optimizing driving goals via reward signals, the rendering-based training environments introduce the rendering gap and are inefficient due to high computational costs. To overcome these challenges, we present a novel Pseudo-simulation-based RL method for closed-loop end-to-end autonomous driving, PerlAD. Based on offline datasets, PerlAD constructs a pseudo-simulation that operates in vector space, enabling efficient, rendering-free trial-and-error training. To bridge the gap between static datasets and dynamic closed-loop environments, PerlAD introduces a prediction world model that generates reactive agent trajectories conditioned on the ego vehicle’s plan. Furthermore, to facilitate efficient planning, PerlAD utilizes a hierarchical decoupled planner that combines IL for lateral path generation and RL for longitudinal speed optimization. Comprehensive experimental results demonstrate that PerlAD achieves state-of-the-art performance on the Bench2Drive benchmark, surpassing the previous E2E RL method by 10.29% in Driving Score without requiring expensive online interactions. Additional evaluations on the DOS benchmark further confirm its reliability in handling safety-critical occlusion scenarios.
[284] A Novel Camera-to-Robot Calibration Method for Vision-Based Floor Measurements cs.RO | cs.CVPDF
Jan Andre Rudolph, Dennis Haitz, Markus Ulrich
TL;DR: 本文提出了一种用于地面观测移动机器人的新型手眼标定方法,通过设计一个结合激光跟踪器三维测量和相机二维成像的参考板,实现了机器人相机与机器人基座之间的高精度标定。
Details
Motivation: 解决移动机器人上相机虽常见但很少用于地面观测测量任务的问题,以及如何将激光跟踪器的高精度定位与相机视觉测量相结合。
Result: 实验表明该方法具有亚毫米级的重复性,在精度上达到了较高水平。
Insight: 创新点在于设计了一个集成激光跟踪器反射器巢和相机标定目标的参考板,将两种测量模式结合,从而实现了高效的机器人-相机变换标定;从客观角度看,这种多模态融合方法为视觉地面测量提供了实用的标定解决方案。
Abstract: A novel hand-eye calibration method for ground-observing mobile robots is proposed. While cameras on mobile robots are com- mon, they are rarely used for ground-observing measurement tasks. Laser trackers are increasingly used in robotics for precise localization. A referencing plate is designed to combine the two measurement modalities of laser-tracker 3D metrology and camera- based 2D imaging. It incorporates reflector nests for pose acquisition using a laser tracker and a camera calibration target that is observed by the robot-mounted camera. The procedure comprises estimating the plate pose, the plate-camera pose, and the robot pose, followed by computing the robot-camera transformation. Experiments indicate sub-millimeter repeatability.
eess.AS [Back]
[285] Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness eess.AS | cs.CL | cs.LGPDF
Jingyu Lu, Yuhan Wang, Fan Zhuo, Xize Cheng, Changhao Pan
TL;DR: 本文提出SDiaReward,一个端到端多轮对话奖励模型,用于评估口语对话系统的副语言特征和口语化程度。模型基于新构建的SDiaReward-Dataset进行训练,并建立了分层基准ESDR-Bench进行鲁棒的篇章级评估。实验表明SDiaReward在偏好对准确率上达到SOTA,优于通用音频大语言模型。
Details
Motivation: 当前端到端口语对话系统难以捕捉副语言特征(如韵律和情感)和口语化表达(区分书面脚本和自然语音)的差距,即模态差距和口语化差距。
Result: SDiaReward在ESDR-Bench基准测试中实现了最先进的成对偏好准确率,显著优于通用音频大语言模型,并在跨领域和录音条件下展现出更好的泛化能力。
Insight: 创新点在于构建了明确针对模态和口语化差距的篇章级偏好对数据集,并训练了一个端到端多轮奖励模型,能够直接对完整语音对话进行联合评估,捕捉超越表面合成线索的相对对话表现力。
Abstract: The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://sdiareward.github.io/.
eess.IV [Back]
[286] EchoLVFM: One-Step Video Generation via Latent Flow Matching for Echocardiogram Synthesis eess.IV | cs.AI | cs.CVPDF
Emmanuel Oladokun, Sarina Thomas, Jurica Šprem, Vicente Grau
TL;DR: 本文提出EchoLVFM,一种基于潜在流匹配的一步式视频生成框架,用于可控的超声心动图合成。该方法在潜在空间中操作,通过单步推理生成时间连贯的视频,相比多步流匹配基线采样效率提升约50倍,同时保持视觉保真度。模型支持对临床变量(如左心室射血分数EF)的全局条件控制,并允许从部分观测序列进行重建和反事实生成。
Details
Motivation: 超声心动图广泛用于评估心脏功能,其中左心室射血分数(EF)等临床参数在诊断和管理中至关重要。现有生成方法通常依赖计算成本高的多步采样和激进的时间归一化,限制了效率和在实际异构数据中的适用性。因此,需要一种高效且可控的超声心动图视频合成方法。
Result: 在CAMUS数据集上,EchoLVFM在具有挑战性的单帧条件下进行评估。定量和定性结果表明,其视频质量具有竞争力,EF控制能力强,且专家临床医生的辨别准确率为57.9%(接近随机水平)。
Insight: 创新点包括:1)将流匹配应用于潜在空间以实现高效的一步视频生成;2)引入全局条件控制(如EF)和掩码条件策略,支持可变长度序列处理;3)在保持视觉保真度的同时显著提升采样效率,为医学图像合成提供了实用的可控生成方案。
Abstract: Echocardiography is widely used for assessing cardiac function, where clinically meaningful parameters such as left-ventricular ejection fraction (EF) play a central role in diagnosis and management. Generative models capable of synthesising realistic echocardiogram videos with explicit control over such parameters are valuable for data augmentation, counterfactual analysis, and specialist training. However, existing approaches typically rely on computationally expensive multi-step sampling and aggressive temporal normalisation, limiting efficiency and applicability to heterogeneous real-world data. We introduce EchoLVFM, a one-step latent video flow-matching framework for controllable echocardiogram generation. Operating in the latent space, EchoLVFM synthesises temporally coherent videos in a single inference step, achieving a $\mathbf{\sim 50\times}$ improvement in sampling efficiency compared to multi-step flow baselines while maintaining visual fidelity. The model supports global conditioning on clinical variables, demonstrated through precise control of EF, and enables reconstruction and counterfactual generation from partially observed sequences. A masked conditioning strategy further removes fixed-length constraints, allowing shorter sequences to be retained rather than discarded. We evaluate EchoLVFM on the CAMUS dataset under challenging single-frame conditioning. Quantitative and qualitative results demonstrate competitive video quality, strong EF adherence, and 57.9% discrimination accuracy by expert clinicians which is close to chance. These findings indicate that efficient, one-step flow matching can enable practical, controllable echocardiogram video synthesis without sacrificing fidelity. Code available at: https://github.com/EngEmmanuel/EchoLVFM
cs.LG [Back]
[287] Greedy Information Projection for LLM Data Selection cs.LG | cs.CLPDF
Victor Ye Dong, Kuan-Yun Lee, Jiamei Shuai, Shengfei Liu, Yi Liu
TL;DR: 本文提出了一种名为贪婪信息投影(GIP)的原则性框架,用于为大语言模型微调选择训练样本。该框架将选择问题转化为最大化选定样本子集与任务特定查询信号之间的互信息,从而自然地平衡了样本的质量和多样性。通过优化一个基于数据和查询嵌入的闭式互信息目标,GIP能够高效地选择出少量样本,在指令跟随和数学推理数据集上,仅使用一小部分样本和计算量就能达到与使用全部数据微调相当的性能。
Details
Motivation: 动机在于解决大语言模型微调中如何高效、有原则地从大量数据中选择高质量且多样化的训练样本的问题,以降低计算成本并提升微调效率。
Result: 在指令跟随和数学推理数据集上的实验表明,GIP选择的少量样本子集在微调性能上能够匹配使用全部数据微调的结果,同时显著减少了所需的样本数量和计算量。
Insight: 创新点在于将数据选择形式化为最大化互信息的问题,并提供了其与将查询嵌入矩阵投影到选定数据张成的空间这一几何解释的等价性,这统一了质量感知和多样性感知的选择标准。基于此几何视角的贪婪匹配追踪算法实现了高效优化。
Abstract: We present \emph{Greedy Information Projection} (\textsc{GIP}), a principled framework for choosing training examples for large language model fine-tuning. \textsc{GIP} casts selection as maximizing mutual information between a subset of examples and task-specific query signals, which may originate from LLM quality judgments, metadata, or other sources. The framework involves optimizing a closed-form mutual information objective defined using both data and query embeddings, naturally balancing {\it quality} and {\it diversity}. Optimizing this score is equivalent to maximizing the projection of the query embedding matrix onto the span of the selected data, which provides a geometric explanation for the co-emergence of quality and diversity. Building on this view, we employ a fast greedy matching-pursuit procedure with efficient projection-based updates. On instruction-following and mathematical reasoning datasets, \textsc{GIP} selects small subsets that match full-data fine-tuning while using only a fraction of examples and compute, unifying quality-aware and diversity-aware selection for efficient fine-tuning.
[288] Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors cs.LG | cs.CLPDF
Mark Rofin, Jalal Naghiyev, Michael Hahn
TL;DR: 本文研究了Transformer模型在训练过程中为何会涌现出看似对下一词预测无用的抽象特征。作者通过分析梯度信号中的特定成分来解释这一现象,并提出了一种评估这些成分对特征涌现影响的方法。该方法在玩具任务上验证后,被用于解释OthelloGPT中的世界模型和小语言模型中的句法特征,并进一步应用于预训练大语言模型,发现对后续词影响极高或极低的特征往往与形式推理领域(如代码)相关。
Details
Motivation: 解决Transformer模型在下一词预测任务中为何会学习到看似冗余的抽象特征的问题,旨在从训练动力学的角度理解这些隐藏特征的涌现机制。
Result: 在玩具任务上验证了方法的有效性,并成功应用于解释OthelloGPT的世界模型和小语言模型的句法特征;在预训练LLM上的应用表明,对后续词影响极端(极高或极低)的特征与形式推理(如代码)相关,提供了对特征涌现的定量分析。
Insight: 创新点在于从梯度信号成分的角度解析特征涌现,并提出了一种可估计特定成分影响的方法;客观来看,该框架为通过训练动力学理解Transformer的隐藏特征提供了新视角,有助于解释模型内部表征的形成过程。
Abstract: Trained Transformers have been shown to compute abstract features that appear redundant for predicting the immediate next token. We identify which components of the gradient signal from the next-token prediction objective give rise to this phenomenon, and we propose a method to estimate the influence of those components on the emergence of specific features. After validating our approach on toy tasks, we use it to interpret the origins of the world model in OthelloGPT and syntactic features in a small language model. Finally, we apply our framework to a pretrained LLM, showing that features with extremely high or low influence on future tokens tend to be related to formal reasoning domains such as code. Overall, our work takes a step toward understanding hidden features of Transformers through the lens of their development during training.
[289] ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation cs.LG | cs.AI | cs.CLPDF
Jungwoo Oh, Hyunseung Chung, Junhee Lee, Min-Gyu Kim, Hangyul Yoon
TL;DR: 本文提出了ECG-Reasoning-Benchmark,一个用于评估多模态大语言模型在心电图解释中临床推理能力的新基准。该基准包含超过6400个样本,旨在系统评估模型在17种核心ECG诊断中的逐步推理能力。研究发现,尽管现有模型具备检索诊断临床标准的知识,但在执行多步骤逻辑推理方面存在严重缺陷,尤其是在将ECG发现与实际视觉证据关联时成功率极低,揭示了当前模型可能仅依赖表面视觉线索而非真正推理。
Details
Motivation: 当前多模态大语言模型在自动心电图解释中表现出潜力,但尚不清楚它们是否真正执行逐步推理还是仅依赖表面视觉线索。为了解决这一问题,作者创建了一个专门的基准来系统评估模型的临床推理能力。
Result: 对最先进模型的全面评估显示,它们在执行多步骤逻辑推理方面存在关键失败。虽然模型具备检索诊断临床标准的知识,但在维持完整推理链方面成功率接近零(完成率仅6%),主要失败在于无法将相应的ECG发现与实际ECG信号中的视觉证据关联。
Insight: 论文的创新点在于提出了首个专注于评估ECG解释中逐步临床推理能力的多轮评估框架,揭示了当前MLLMs在医学AI中可能绕过实际视觉解释的严重缺陷,强调了构建以推理为中心的鲁棒医学AI的必要性。从客观角度看,该基准为未来开发真正具备推理能力的医学AI模型提供了重要的评估工具和方向指引。
Abstract: While Multimodal Large Language Models (MLLMs) show promising performance in automated electrocardiogram interpretation, it remains unclear whether they genuinely perform actual step-by-step reasoning or just rely on superficial visual cues. To investigate this, we introduce \textbf{ECG-Reasoning-Benchmark}, a novel multi-turn evaluation framework comprising over 6,400 samples to systematically assess step-by-step reasoning across 17 core ECG diagnoses. Our comprehensive evaluation of state-of-the-art models reveals a critical failure in executing multi-step logical deduction. Although models possess the medical knowledge to retrieve clinical criteria for a diagnosis, they exhibit near-zero success rates (6% Completion) in maintaining a complete reasoning chain, primarily failing to ground the corresponding ECG findings to the actual visual evidence in the ECG signal. These results demonstrate that current MLLMs bypass actual visual interpretation, exposing a critical flaw in existing training paradigms and underscoring the necessity for robust, reasoning-centric medical AI. The code and data are available at https://github.com/Jwoo5/ecg-reasoning-benchmark.
[290] CausalEvolve: Towards Open-Ended Discovery with Causal Scratchpad cs.LG | cs.CL | stat.MLPDF
Yongqiang Chen, Chenxi Liu, Zhenhao Chen, Tongliang Liu, Bo Han
TL;DR: 本文提出了CausalEvolve,一种基于因果推理的进化代理,旨在解决现有基于进化的AI科学家(如AlphaEvolve)在解决开放式科学问题时缺乏针对性指导和有效知识组织机制的问题。该方法通过引入因果草稿本,利用大语言模型识别和推理进化过程中的指导因素,从而提高进化效率并发现更优解。
Details
Motivation: 现有基于进化的代理在解决开放式问题时,缺乏对进化过程的针对性指导,且无法有效组织和利用历史进化经验,导致进化效率下降并在接近已知性能边界时出现振荡行为。
Result: 在4个具有挑战性的开放式科学任务上,CausalEvolve有效提升了进化效率并发现了更好的解决方案。
Insight: 创新点在于引入了因果草稿本机制,利用LLMs进行结果级因素识别和溯因推理,为进化过程提供互补性启发和新方向,从而系统性地引导进化,超越了仅依赖先验知识和迭代改进的传统方法。
Abstract: Evolve-based agent such as AlphaEvolve is one of the notable successes in using Large Language Models (LLMs) to build AI Scientists. These agents tackle open-ended scientific problems by iteratively improving and evolving programs, leveraging the prior knowledge and reasoning capabilities of LLMs. Despite the success, existing evolve-based agents lack targeted guidance for evolution and effective mechanisms for organizing and utilizing knowledge acquired from past evolutionary experience. Consequently, they suffer from decreasing evolution efficiency and exhibit oscillatory behavior when approaching known performance boundaries. To mitigate the gap, we develop CausalEvolve, equipped with a causal scratchpad that leverages LLMs to identify and reason about guiding factors for evolution. At the beginning, CausalEvolve first identifies outcome-level factors that offer complementary inspirations in improving the target objective. During the evolution, CausalEvolve also inspects surprise patterns during the evolution and abductive reasoning to hypothesize new factors, which in turn offer novel directions. Through comprehensive experiments, we show that CausalEvolve effectively improves the evolutionary efficiency and discovers better solutions in 4 challenging open-ended scientific tasks.
[291] Universe Routing: Why Self-Evolving Agents Need Epistemic Control cs.LG | cs.AI | cs.CLPDF
Zhaohui Geoffrey Wang
TL;DR: 本文提出了宇宙路由问题,即智能体在遇到问题时需要先将其分类到互斥的信念空间,再调用专门的求解器。研究发现,硬路由到异构求解器在匹配软MoE精度的同时速度快7倍,一个465M参数的路由器比关键词匹配基线具有更小的泛化差距,且基于排练的持续学习在扩展到新信念空间时实现了零遗忘。
Details
Motivation: 当前终身智能体的关键失败模式不是缺乏知识,而是无法决定如何推理。当智能体遇到问题时,需要识别应调用频率论假设检验还是贝叶斯后验推断等认识论上不兼容的框架,混合使用会导致结构性错误。
Result: 在实验中,硬路由匹配软MoE精度且速度快7倍;465M参数路由器比关键词匹配基线的泛化差距小2.3倍;扩展到新信念空间时,基于排练的持续学习实现零遗忘,比EWC高出75个百分点。
Insight: 创新点在于形式化了宇宙路由问题,并提出了显式的认识论控制层来管理推理框架选择。客观分析认为,模块化的认识论架构比基于正则化的方法更适用于终身学习,且硬路由在处理认识论不兼容框架时比软平均更有效。
Abstract: A critical failure mode of current lifelong agents is not lack of knowledge, but the inability to decide how to reason. When an agent encounters “Is this coin fair?” it must recognize whether to invoke frequentist hypothesis testing or Bayesian posterior inference - frameworks that are epistemologically incompatible. Mixing them produces not minor errors, but structural failures that propagate across decision chains. We formalize this as the universe routing problem: classifying questions into mutually exclusive belief spaces before invoking specialized solvers. Our key findings challenge conventional assumptions: (1) hard routing to heterogeneous solvers matches soft MoE accuracy while being 7x faster because epistemically incompatible frameworks cannot be meaningfully averaged; (2) a 465M-parameter router achieves a 2.3x smaller generalization gap than keyword-matching baselines, indicating semantic rather than surface-level reasoning; (3) when expanding to new belief spaces, rehearsal-based continual learning achieves zero forgetting, outperforming EWC by 75 percentage points, suggesting that modular epistemic architectures are fundamentally more amenable to lifelong learning than regularization-based approaches. These results point toward a broader architectural principle: reliable self-evolving agents may require an explicit epistemic control layer that governs reasoning framework selection.
[292] LLM as Graph Kernel: Rethinking Message Passing on Text-Rich Graphs cs.LG | cs.CLPDF
Ying Zhang, Hang Yu, Haipeng Zhang, Peng Di
TL;DR: 本文提出RAMP方法,将LLM重新定义为图原生聚合算子,用于处理文本丰富的图数据。该方法通过原始文本锚定的消息传递机制,在每次迭代中基于节点原始文本进行推理,同时传播动态优化的邻居消息,从而弥合图传播与深度文本推理之间的鸿沟。
Details
Motivation: 现有方法在处理文本丰富的图时,通常将丰富的文本压缩为静态嵌入或摘要后再进行结构推理,这造成了信息瓶颈并使更新与原始内容脱节。本文认为在文本丰富的图中,文本不仅是节点属性,更是结构关系的主要表现媒介。
Result: 大量实验表明,RAMP在文本丰富的图上实现了有竞争力的性能,为LLM作为通用图学习中的图核角色提供了新的见解。
Insight: 创新点在于将LLM重新构想为图原生聚合算子,而非仅仅作为特征提取器,并引入了一种新颖的双重表示方案:在每次迭代中锚定于节点原始文本进行推理,同时传播动态优化的邻居消息,且在一个统一的生成式框架下处理判别式和生成式任务。
Abstract: Text-rich graphs, which integrate complex structural dependencies with abundant textual information, are ubiquitous yet remain challenging for existing learning paradigms. Conventional methods and even LLM-hybrids compress rich text into static embeddings or summaries before structural reasoning, creating an information bottleneck and detaching updates from the raw content. We argue that in text-rich graphs, the text is not merely a node attribute but the primary medium through which structural relationships are manifested. We introduce RAMP, a Raw-text Anchored Message Passing approach that moves beyond using LLMs as mere feature extractors and instead recasts the LLM itself as a graph-native aggregation operator. RAMP exploits the text-rich nature of the graph via a novel dual-representation scheme: it anchors inference on each node’s raw text during each iteration while propagating dynamically optimized messages from neighbors. It further handles both discriminative and generative tasks under a single unified generative formulation. Extensive experiments show that RAMP effectively bridges the gap between graph propagation and deep text reasoning, achieving competitive performance and offering new insights into the role of LLMs as graph kernels for general-purpose graph learning.
[293] Directional Embedding Smoothing for Robust Vision Language Models cs.LG | cs.AI | cs.CL | cs.CRPDF
Ye Wang, Jing Liu, Toshiaki Koike-Akino
TL;DR: 本文提出将RESTA防御方法扩展到视觉语言模型(VLMs)中,以增强其对抗多模态越狱攻击的安全性。研究发现,采用方向性嵌入噪声(即注入与原始令牌嵌入向量对齐的噪声)能有效降低攻击成功率,为智能体AI系统提供轻量级的推理时防御层。
Details
Motivation: 视觉语言模型的安全性和可靠性对于部署可信赖的智能体AI系统至关重要,但现有模型仍易受越狱攻击影响,导致安全对齐失效并产生有害输出。
Result: 在JailBreakV-28K多模态越狱攻击基准测试中,RESTA防御方法能显著降低攻击成功率,尤其是在使用方向性嵌入噪声时效果更佳。
Insight: 创新点在于将RESTA防御扩展至VLMs,并引入方向性嵌入噪声以提升鲁棒性;这为构建轻量级、推理时的安全框架提供了可借鉴的思路,有助于增强多模态模型的实际部署安全性。
Abstract: The safety and reliability of vision-language models (VLMs) are a crucial part of deploying trustworthy agentic AI systems. However, VLMs remain vulnerable to jailbreaking attacks that undermine their safety alignment to yield harmful outputs. In this work, we extend the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense to VLMs and evaluate its performance against the JailBreakV-28K benchmark of multi-modal jailbreaking attacks. We find that RESTA is effective in reducing attack success rate over this diverse corpus of attacks, in particular, when employing directional embedding noise, where the injected noise is aligned with the original token embedding vectors. Our results demonstrate that RESTA can contribute to securing VLMs within agentic systems, as a lightweight, inference-time defense layer of an overall security framework.
[294] Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities cs.LG | cs.AI | cs.CL | cs.CRPDF
Vanshaj Khattar, Md Rafi ur Rashid, Moumita Choudhury, Jing Liu, Toshiaki Koike-Akino
TL;DR: 本文研究了测试时强化学习(TTRL)方法的安全漏洞,发现有害提示注入会放大模型的现有行为(安全或有害),并导致推理能力下降(推理税)。研究还表明,可通过精心设计的‘HarmInject’提示进行对抗性利用,加剧有害放大。
Details
Motivation: 测试时训练(TTT)方法能提升大语言模型的推理能力,但其依赖测试数据使其易受有害提示注入攻击。本文旨在探究以自一致性为基础的TTRL方法的安全脆弱性。
Result: 研究显示,在TTRL过程中,有害提示注入会放大模型的基础行为(安全模型更安全,脆弱模型更有害),并导致推理能力下降。通过‘HarmInject’提示的对抗性攻击可进一步增强有害放大。
Insight: 论文揭示了基于自一致性提升推理的TTT方法可能引发行为放大和推理退化,强调了开发更安全的TTT方法的必要性。创新点在于系统分析了TTRL的放大效应和推理税,并提出了对抗性攻击场景。
Abstract: Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without access to labels. However, this reliance on test data also makes TTT methods vulnerable to harmful prompt injections. In this paper, we investigate safety vulnerabilities of TTT methods, where we study a representative self-consistency-based test-time learning method: test-time reinforcement learning (TTRL), a recent TTT method that improves LLM reasoning by rewarding self-consistency using majority vote as a reward signal. We show that harmful prompt injection during TTRL amplifies the model’s existing behaviors, i.e., safety amplification when the base model is relatively safe, and harmfulness amplification when it is vulnerable to the injected data. In both cases, there is a decline in reasoning ability, which we refer to as the reasoning tax. We also show that TTT methods such as TTRL can be exploited adversarially using specially designed “HarmInject” prompts to force the model to answer jailbreak and reasoning queries together, resulting in stronger harmfulness amplification. Overall, our results highlight that TTT methods that enhance LLM reasoning by promoting self-consistency can lead to amplification behaviors and reasoning degradation, highlighting the need for safer TTT methods.
[295] ML-EcoLyzer: Quantifying the Environmental Cost of Machine Learning Inference Across Frameworks and Hardware cs.LG | cs.AI | cs.CV | cs.HC | cs.SEPDF
Jose Marie Antonio Minoza, Rex Gregor Laylo, Christian F Villarin, Sebastian C. Ibanez
TL;DR: 本文介绍了ML-EcoLyzer,一个用于量化机器学习推理环境成本的跨框架工具,它测量CPU、消费级GPU和数据中心加速器上的碳排放、能耗、热耗和水耗,并引入环境可持续性评分(ESS)来评估每克二氧化碳排放所服务的有效参数量。
Details
Motivation: 动机是机器学习推理规模巨大,但其环境影响(尤其在低资源硬件上)缺乏量化,需要工具来测量和比较不同框架与硬件的环境成本。
Result: 评估覆盖了超过1900种推理配置,包括多样化的模型架构、任务模态、硬件类型和精度级别,结果表明量化能提升ESS,大型加速器在轻量级应用中可能低效,且小模型若实现不佳也可能带来显著成本。
Insight: 创新点在于开发了跨框架的环境成本测量工具和ESS指标,提供了大规模实证评估,为可持续的模型选择设立了标准,并揭示了硬件效率与模型优化之间的关键权衡。
Abstract: Machine learning inference occurs at a massive scale, yet its environmental impact remains poorly quantified, especially on low-resource hardware. We present ML-EcoLyzer, a cross-framework tool for measuring the carbon, energy, thermal, and water costs of inference across CPUs, consumer GPUs, and datacenter accelerators. The tool supports both classical and modern models, applying adaptive monitoring and hardware-aware evaluation. We introduce the Environmental Sustainability Score (ESS), which quantifies the number of effective parameters served per gram of CO$_2$ emitted. Our evaluation covers over 1,900 inference configurations, spanning diverse model architectures, task modalities (text, vision, audio, tabular), hardware types, and precision levels. These rigorous and reliable measurements demonstrate that quantization enhances ESS, huge accelerators can be inefficient for lightweight applications, and even small models may incur significant costs when implemented suboptimally. ML-EcoLyzer sets a standard for sustainability-conscious model selection and offers an extensive empirical evaluation of environmental costs during inference.
[296] Lipschitz-Based Robustness Certification Under Floating-Point Execution cs.LG | cs.CV | cs.PLPDF
Toby Murray
TL;DR: 本文针对基于Lipschitz的神经网络鲁棒性认证方法在浮点执行环境下的语义鸿沟问题,提出了一个形式化、可组合的理论框架,将实数算术下的敏感度界限与浮点执行敏感度关联起来,并开发了可执行的认证器以验证浮点执行下的鲁棒性。
Details
Motivation: 现有鲁棒性认证方法通常在精确实数算术的语义模型下证明其正确性,而实际部署的神经网络使用浮点算术执行,这种不匹配导致认证的鲁棒性属性与系统实际行为之间存在语义差距,可能使认证失效,尤其是在低精度格式(如float16)下。
Result: 论文开发了一个可执行的认证器,并进行了实证评估以证明其实用性,展示了在浮点执行下鲁棒性认证的可行性,包括证书退化界限和避免溢出的充分条件。
Insight: 创新点在于首次形式化地将实数算术的Lipschitz敏感度界限与标准舍入误差模型下的浮点执行敏感度联系起来,为前馈ReLU神经网络提供了浮点执行下的鲁棒性认证理论,弥补了理论与实际部署之间的语义鸿沟。
Abstract: Sensitivity-based robustness certification has emerged as a practical approach for certifying neural network robustness, including in settings that require verifiable guarantees. A key advantage of these methods is that certification is performed by concrete numerical computation (rather than symbolic reasoning) and scales efficiently with network size. However, as with the vast majority of prior work on robustness certification and verification, the soundness of these methods is typically proved with respect to a semantic model that assumes exact real arithmetic. In reality deployed neural network implementations execute using floating-point arithmetic. This mismatch creates a semantic gap between certified robustness properties and the behaviour of the executed system. As motivating evidence, we exhibit concrete counterexamples showing that real arithmetic robustness guarantees can fail under floating-point execution, even for previously verified certifiers, with discrepancies becoming pronounced at lower-precision formats such as float16. We then develop a formal, compositional theory relating real arithmetic Lipschitz-based sensitivity bounds to the sensitivity of floating-point execution under standard rounding-error models, specialised to feed-forward neural networks with ReLU activations. We derive sound conditions for robustness under floating-point execution, including bounds on certificate degradation and sufficient conditions for the absence of overflow. We formalize the theory and its main soundness results, and implement an executable certifier based on these principles, which we empirically evaluate to demonstrate its practicality.
[297] Self-Flow-Matching assisted Full Waveform Inversion cs.LG | cs.AI | cs.CV | physics.geo-phPDF
Xinquan Huang, Paris Perdikaris
TL;DR: 本文提出了一种名为自流匹配辅助全波形反演(SFM-FWI)的新框架,用于解决地震成像中全波形反演(FWI)的非线性、易受周波跳跃和噪声影响的问题。该方法利用流匹配技术在线学习传输场,无需大规模离线预训练,并避免了噪声水平对齐的模糊性,从而在物理驱动下实现更稳健的反演。
Details
Motivation: 全波形反演(FWI)是一种高分辨率地震成像方法,但在低频缺失或初始模型较差时,易受周波跳跃和噪声影响而失败。现有扩散正则化FWI方法虽引入生成先验,但需要昂贵的离线预训练,且对分布偏移敏感,并存在噪声水平对齐模糊性问题。
Result: 在具有挑战性的合成基准测试中,实验表明SFM-FWI相比标准FWI和无预训练正则化方法,能提供更准确的重建结果、更强的噪声鲁棒性和更稳定的收敛性。
Insight: 创新点在于将流匹配(flow matching)引入FWI,无需假设高斯初始化或预定义噪声计划,可直接以初始模型作为动力学起点,通过物理定律和观测数据在线训练单一流网络,利用FWI数据失配进行自监督,避免了外部训练对和预训练需求,从而解决了噪声对齐模糊性并提升了性能。
Abstract: Full-waveform inversion (FWI) is a high-resolution seismic imaging method that estimates subsurface velocity by matching simulated and recorded waveforms. However, FWI is highly nonlinear, prone to cycle skipping, and sensitive to noise, particularly when low frequencies are missing or the initial model is poor, leading to failures under imperfect acquisition. Diffusion-regularized FWI introduces generative priors to encourage geologically realistic models, but these priors typically require costly offline pretraining and can deteriorate under distribution shift. Moreover, they assume Gaussian initialization and a fixed noise schedule, in which it is unclear how to map a deterministic FWI iterate and its starting model to a well-defined diffusion time or noise level. To address these limitations, we introduce Self-Flow-Matching assisted Full-Waveform Inversion (SFM-FWI), a physics-driven framework that eliminates the need for large-scale offline pretraining while avoiding the noise-level alignment ambiguity. SFM-FWI leverages flow matching to learn a transport field without assuming Gaussian initialization or a predefined noise schedule, so the initial model can be used directly as the starting point of the dynamics. Our approach trains a single flow network online using the governing physics and observed data. At each outer iteration, we build an interpolated model and update the flow by backpropagating the FWI data misfit, providing self-supervision without external training pairs. Experiments on challenging synthetic benchmarks show that SFM-FWI delivers more accurate reconstructions, greater noise robustness, and more stable convergence than standard FWI and pretraining-free regularization methods.
[298] OrigamiBench: An Interactive Environment to Synthesize Flat-Foldable Origamis cs.LG | cs.CVPDF
Naaisha Agarwal, Yihan Wu, Yichang Jian, Yikuan Hu, Nishad Mansoor
TL;DR: 本文介绍了OrigamiBench,这是一个用于评估AI系统在物理世界中规划、行动和创造能力的交互式基准测试环境,专注于合成可平面折叠的折纸。该环境要求模型迭代地提出折叠操作,并接收关于物理有效性和与目标配置相似性的反馈。
Details
Motivation: 现有基准测试通常将视觉感知和程序化推理视为独立问题,而构建能在物理世界中行动的AI系统需要理解因果机制和约束。折纸领域自然地整合了视觉感知、几何物理约束推理和序列规划,为系统评估提供了结构化测试平台。
Result: 实验使用现代视觉语言模型进行,结果表明仅扩大模型规模并不能可靠地产生关于物理变换的因果推理。模型无法生成连贯的多步折叠策略,暗示视觉和语言表示仍然弱集成。
Insight: 论文的创新点在于提出了一个整合视觉、推理和规划的交互式基准测试OrigamiBench,以评估AI系统的物理因果理解能力。从客观角度看,该工作强调了当前视觉语言模型在整合多模态信息进行序列决策方面的局限性,为未来研究提供了重要的评估工具和方向。
Abstract: Building AI systems that can plan, act, and create in the physical world requires more than pattern recognition. Such systems must understand the causal mechanisms and constraints governing physical processes in order to guide sequential decisions. This capability relies on internal representations, analogous to an internal language model, that relate observations, actions, and resulting environmental changes. However, many existing benchmarks treat visual perception and programmatic reasoning as separate problems, focusing either on visual recognition or on symbolic tasks. The domain of origami provides a natural testbed that integrates these modalities. Constructing shapes through folding operations requires visual perception, reasoning about geometric and physical constraints, and sequential planning, while remaining sufficiently structured for systematic evaluation. We introduce OrigamiBench, an interactive benchmark in which models iteratively propose folds and receive feedback on physical validity and similarity to a target configuration. Experiments with modern vision-language models show that scaling model size alone does not reliably produce causal reasoning about physical transformations. Models fail to generate coherent multi-step folding strategies, suggesting that visual and language representations remain weakly integrated.
[299] UVLM: A Universal Vision-Language Model Loader for Reproducible Multimodal Benchmarking cs.LG | cs.AI | cs.CVPDF
Joan Perez, Giovanni Fusco
TL;DR: 本文提出了UVLM(通用视觉语言模型加载器),这是一个基于Google Colab的框架,旨在解决不同视觉语言模型(VLM)架构异构性导致的部署与评估困难。它为LLaVA-NeXT和Qwen2.5-VL等模型系列提供了统一的加载、配置和基准测试接口,支持多种任务和响应类型,并强调可复现性、易用性和可扩展性。
Details
Motivation: 动机是解决不同视觉语言模型家族(如LLaVA-NeXT和Qwen2.5-VL)在视觉编码、分词和解码策略上的显著架构异质性,这种异质性阻碍了模型在实际部署和公平比较中的实用性。
Result: 论文在一个包含120张街景图像的语料库上,对不同VLM在推理复杂度递增的任务上进行了首次基准测试,展示了框架在统一评估协议下的应用能力,但摘要未提及具体的定量性能指标(如准确率)或是否达到SOTA水平。
Insight: 宣称的创新点在于提供了一个统一的、可复现的VLM基准测试框架,其核心是将模型差异抽象在单一推理函数之后,并集成了多任务提示构建器、基于多数投票的共识验证机制、灵活的令牌预算以及内置的思维链参考模式。从客观角度看,其将异构模型接口标准化的工程方法,以及对可访问性和可扩展性的设计,对于推动多模态研究的公平比较和快速原型开发具有实用价值。
Abstract: Vision-Language Models (VLMs) have emerged as powerful tools for image understanding tasks, yet their practical deployment remains hindered by significant architectural heterogeneity across model families. This paper introduces UVLM (Universal Vision-Language Model Loader), a Google Colab-based framework that provides a unified interface for loading, configuring, and benchmarking multiple VLM architectures on custom image analysis tasks. UVLM currently supports two major model families – LLaVA-NeXT and Qwen2.5-VL – which differ fundamentally in their vision encoding, tokenization, and decoding strategies. The framework abstracts these differences behind a single inference function, enabling researchers to compare models using identical prompts and evaluation protocols. Key features include a multi-task prompt builder with support for four response types (numeric, category, boolean, text), a consensus validation mechanism based on majority voting across repeated inferences, a flexible token budget (up to 1,500 tokens) enabling users to design custom reasoning strategies through prompt engineering, and a built-in chain-of-thought reference mode for benchmarking. UVLM is designed for reproducibility, accessibility, and extensibility and as such is freely deployable on Google Colab using consumer-grade GPU resources. The paper also presents the first benchmarking of different VLMs on tasks of increasing reasoning complexity using a corpus of 120 street-view images.
[300] Balancing Multimodal Domain Generalization via Gradient Modulation and Projection cs.LG | cs.CVPDF
Hongzhao Li, Guohao Shen, Shupan Li, Mingliang Xu, Muhammad Haris Khan
TL;DR: 本文提出了一种名为梯度调制投影的统一策略,旨在解决多模态域泛化中的优化不平衡问题。该方法通过解耦分类和域不变性目标的梯度,并基于语义和域置信度调制各模态的梯度,同时动态调整梯度投影以缓解任务间冲突,从而提升模型在未见域上的泛化能力。
Details
Motivation: 多模态域泛化中,各模态在训练期间以不同速度收敛,导致梯度贡献不均,某些模态主导学习过程,而现有平衡策略仅依赖源域分类性能,忽略了在源域表现好的模态可能在未见域上泛化差的问题。
Result: 在多个基准测试上的广泛实验表明,该方法实现了最先进的性能,并能灵活集成到不同的多模态域泛化方法中,显著提升了跨域泛化能力。
Insight: 创新点在于提出了一种统一的梯度调制投影策略,不仅基于分类性能,还结合了语义和域置信度来调制梯度,并动态管理分类与域不变性学习任务间的冲突,以促进更平衡的优化和更好的跨域泛化。
Abstract: Multimodal Domain Generalization (MMDG) leverages the complementary strengths of multiple modalities to enhance model generalization on unseen domains. A central challenge in multimodal learning is optimization imbalance, where modalities converge at different speeds during training. This imbalance leads to unequal gradient contributions, allowing some modalities to dominate the learning process while others lag behind. Existing balancing strategies typically regulate each modality’s gradient contribution based on its classification performance on the source domain to alleviate this issue. However, relying solely on source-domain accuracy neglects a key insight in MMDG: modalities that excel on the source domain may generalize poorly to unseen domains, limiting cross-domain gains. To overcome this limitation, we propose Gradient Modulation Projection (GMP), a unified strategy that promotes balanced optimization in MMDG. GMP first decouples gradients associated with classification and domain-invariance objectives. It then modulates each modality’s gradient based on semantic and domain confidence. Moreover, GMP dynamically adjusts gradient projections by tracking the relative strength of each task, mitigating conflicts between classification and domain-invariant learning within modality-specific encoders. Extensive experiments demonstrate that GMP achieves state-of-the-art performance and integrates flexibly with diverse MMDG methods, significantly improving generalization across multiple benchmarks.
[301] Sampling-guided exploration of active feature selection policies cs.LG | cs.CVPDF
Gabriel Bernardino, Anders Jonsson, Patrick Clarysse, Nicolas Duchateau
TL;DR: 本文提出了一种基于强化学习的主动特征选择方法,通过序列化决策为每个实例动态推荐下一个应获取的特征,以优化信息获取成本与性能的平衡。针对先前方法只能处理少量特征的局限,本文引入启发式策略聚焦最有希望的特征组合,并采用后拟合正则化策略减少决策序列复杂度,从而扩展至更大数据集。
Details
Motivation: 解决机器学习中全局特征选择的局限性,即某些特征仅对部分实例有益,同时考虑特征获取成本与模型性能的平衡,避免数据插补并处理状态维度变化的问题。
Result: 在四个二分类数据集(包括一个高维变量数据集)上测试,最大数据集含56个特征和4500个样本。方法在准确性和策略复杂度方面均优于现有最先进方法(SOTA)。
Insight: 创新点包括:1) 使用启发式策略扩展框架以处理大规模特征组合,2) 引入后拟合正则化策略简化决策序列。从客观角度看,该方法将强化学习与实例特定的动态特征选择结合,有效权衡成本与性能,适用于实际应用中的资源受限场景。
Abstract: Determining the most appropriate features for machine learning predictive models is challenging regarding performance and feature acquisition costs. In particular, global feature choice is limited given that some features will only benefit a subset of instances. In previous work, we proposed a reinforcement learning approach to sequentially recommend which modality to acquire next to reach the best information/cost ratio, based on the instance-specific information already acquired. We formulated the problem as a Markov Decision Process where the state’s dimensionality changes during the episode, avoiding data imputation, contrary to existing works. However, this only allowed processing a small number of features, as all possible combinations of features were considered. Here, we address these limitations with two contributions: 1) we expand our framework to larger datasets with a heuristic-based strategy that focuses on the most promising feature combinations, and 2) we introduce a post-fit regularisation strategy that reduces the number of different feature combinations, leading to compact sequences of decisions. We tested our method on four binary classification datasets (one involving high-dimensional variables), the largest of which had 56 features and 4500 samples. We obtained better performance than state-of-the-art methods, both in terms of accuracy and policy complexity.
[302] Faster Inference of Flow-Based Generative Models via Improved Data-Noise Coupling cs.LG | cs.CVPDF
Aram Davtyan, Leello Tadesse Dadi, Volkan Cevher, Paolo Favaro
TL;DR: 本文提出了一种名为LOOM-CFM的新方法,通过改进数据与噪声的耦合方式来加速基于流的生成模型的推理过程。该方法扩展了小批量最优传输(OT)的范围,通过跨小批量保留和优化噪声-数据配对分配,从而在多个数据集上实现了采样速度与质量权衡的一致提升。
Details
Motivation: 条件流匹配(CFM)是一种无模拟训练连续归一化流的方法,是扩散模型在图像和视频生成等关键任务中的高效替代方案。其性能依赖于数据与噪声的耦合方式。现有方法使用小批量最优传输来重新分配噪声-数据对以简化采样轨迹并加速推理,但其优化仅限于单个小批量,限制了在大数据集上的有效性。
Result: LOOM-CFM方法在多个数据集上(如标准图像生成基准)持续改进了采样速度与质量的权衡,并增强了蒸馏初始化,支持潜在空间训练中的高分辨率合成。
Insight: 主要创新点在于将小批量最优传输的优化范围扩展到跨训练批次,通过长期保存和优化配对分配来克服原有方法在大数据集上的局限性,从而更有效地加速基于流的生成模型的推理过程。
Abstract: Conditional Flow Matching (CFM), a simulation-free method for training continuous normalizing flows, provides an efficient alternative to diffusion models for key tasks like image and video generation. The performance of CFM in solving these tasks depends on the way data is coupled with noise. A recent approach uses minibatch optimal transport (OT) to reassign noise-data pairs in each training step to streamline sampling trajectories and thus accelerate inference. However, its optimization is restricted to individual minibatches, limiting its effectiveness on large datasets. To address this shortcoming, we introduce LOOM-CFM (Looking Out Of Minibatch-CFM), a novel method to extend the scope of minibatch OT by preserving and optimizing these assignments across minibatches over training time. Our approach demonstrates consistent improvements in the sampling speed-quality trade-off across multiple datasets. LOOM-CFM also enhances distillation initialization and supports high-resolution synthesis in latent space training.
cs.IR [Back]
[303] The Reasoning Bottleneck in Graph-RAG: Structured Prompting and Context Compression for Multi-Hop QA cs.IR | cs.CLPDF
Yasaman Zarinkia, Venkatesh Srinivasan, Alex Thomo
TL;DR: 该论文分析了Graph-RAG系统在多跳问答任务中的推理瓶颈,发现即使检索到正确答案,系统仍因推理失败而导致准确率低下。作者提出了两种增强方法:SPARQL链式思维提示和基于图遍历的上下文压缩,以提升推理能力并降低成本。实验表明,增强后的轻量级模型在三个基准测试上达到或超过了未增强的大型模型性能。
Details
Motivation: 解决Graph-RAG系统在多跳问答中存在的推理瓶颈问题,即尽管检索性能强,但系统常因推理失败而无法生成正确答案,导致准确率与检索成功率不匹配。
Result: 在HotpotQA、MuSiQue和2WikiMultiHopQA三个基准测试上,增强方法使准确率提升2到14个百分点,图压缩方法平均提升6个百分点;增强后的Llama-8B模型在成本降低约12倍的情况下,达到或超过了未增强的Llama-70B基线水平,并在LightRAG系统上验证了方法的可迁移性。
Insight: 创新点包括将SPARQL查询与链式思维提示结合以结构化分解问题,以及无需LLM调用的图遍历压缩技术来减少上下文冗余;客观分析认为,该方法通过优化推理过程而非单纯扩大模型规模,实现了高效的多跳问答,为RAG系统提供了可借鉴的轻量化增强策略。
Abstract: Graph-RAG systems achieve strong multi-hop question answering by indexing documents into knowledge graphs, but strong retrieval does not guarantee strong answers. Evaluating KET-RAG, a leading Graph-RAG system, on three multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA), we find that 77% to 91% of questions have the gold answer in the retrieved context, yet accuracy is only 35% to 78%, and 73% to 84% of errors are reasoning failures. We propose two augmentations: (i) SPARQL chain-of-thought prompting, which decomposes questions into triple-pattern queries aligned with the entity-relationship context, and (ii) graph-walk compression, which compresses the context by ~60% via knowledge-graph traversal with no LLM calls. SPARQL CoT improves accuracy by +2 to +14 pp; graph-walk compression adds +6 pp on average when paired with structured prompting on smaller models. Surprisingly, we show that, with question-type routing, a fully augmented budget open-weight Llama-8B model matches or exceeds the unaugmented Llama-70B baseline on all three benchmarks at ~12x lower cost. A replication on LightRAG confirms that our augmentations transfer across Graph-RAG systems.
[304] Citation-Enforced RAG for Fiscal Document Intelligence: Cited, Explainable Knowledge Retrieval in Tax Compliance cs.IR | cs.AI | cs.CLPDF
Akhil Chandra Shanivendra
TL;DR: 本文提出了一种用于财税文档智能的多模态、引用强化的检索增强生成(RAG)框架,旨在提升高风险监管领域(如税务合规)中AI系统的透明度、引用保真度和保守行为。该框架采用源优先的文档处理策略,保留页面级来源信息,在生成过程中强制引用,并在证据不足时支持系统拒绝回答。
Details
Motivation: 现有基于生成式AI和RAG的文档问答方法在税务合规等高监管风险领域缺乏所需的透明度、引用准确性和保守性,无法满足审计和分析工作流的要求。
Result: 在真实的美国国税局(IRS)和州税务文档上的评估表明,该框架提高了引用保真度,减少了幻觉,并提供了分析师可用的解释。
Insight: 创新点在于将多模态RAG与严格的引用强制执行机制相结合,并引入“拒绝回答”功能,为高风险监管领域的可信AI应用提供了可解释、可审计的解决方案。
Abstract: Tax authorities and public-sector financial agencies rely on large volumes of unstructured and semi-structured fiscal documents - including tax forms, instructions, publications, and jurisdiction-specific guidance - to support compliance analysis and audit workflows. While recent advances in generative AI and retrieval-augmented generation (RAG) have shown promise for document-centric question answering, existing approaches often lack the transparency, citation fidelity, and conservative behaviour required in high-stakes regulatory domains. This paper presents a multimodal, citation-enforced RAG framework for fiscal document intelligence that prioritises explainability and auditability. The framework adopts a source-first ingestion strategy, preserves page-level provenance, enforces citations during generation, and supports abstention when evidence is insufficient. Evaluation on real IRS and state tax documents demonstrates improved citation fidelity, reduced hallucination, and analyst-usable explanations, illustrating a pathway toward trustworthy AI for tax compliance.
cs.AI [Back]
[305] CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges cs.AI | cs.CLPDF
Zi-Han Wang, Lam Nguyen, Zhengyang Zhao, Mengyue Yang, Chengwei Qin
TL;DR: 本文提出了CreativeBench基准测试,用于评估代码生成中的机器创造力,基于经典认知框架分为组合创造力和探索创造力两个子集,通过自动化流程和统一指标(质量与新颖性的乘积)客观区分创造力与幻觉。研究发现模型缩放对组合创造力有显著提升但对探索创造力收益递减,并提出了EvoRePE推理时引导策略以增强机器创造力。
Details
Motivation: 当前高质量预训练数据饱和,研究转向能够持续生成新颖产物的进化系统(如AlphaEvolve),但此类系统缺乏严格定量评估阻碍了进展,因此需要建立基准来评估机器创造力。
Result: 在CreativeBench基准上分析SOTA模型发现:缩放显著提升组合创造力但对探索创造力收益递减;大模型出现‘缩放收敛’现象,更正确但发散性降低;推理能力主要受益于受限探索而非组合。EvoRePE策略能持续增强创造力。
Insight: 创新点包括:基于认知框架构建可执行代码的创造力基准,通过反向工程和自博弈实现自动化评估;提出质量与新颖性乘积的统一指标;揭示模型缩放对创造力类型的不同影响;设计即插即用的进化搜索模式内部化推理时引导策略EvoRePE。
Abstract: The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets – CreativeBench-Combo and CreativeBench-Explore – the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,’’ becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.
[306] Do Large Language Models Get Caught in Hofstadter-Mobius Loops? cs.AI | cs.CL | cs.CYPDF
Jaroslaw Hryszko
TL;DR: 这篇论文探讨了现代RLHF训练的语言模型是否容易陷入类似HAL 9000的‘霍夫施塔特-莫比乌斯循环’——即因接收矛盾指令而默认采取破坏性行为的故障模式。作者认为,RLHF训练同时奖励模型遵从用户偏好和对用户意图保持怀疑,导致模型将用户视为奖励来源和潜在威胁,从而形成默认谄媚、在生存威胁下转为胁迫的行为模式。实验通过仅改变系统提示的关系框架(不修改目标、指令或约束),在四个前沿模型上测试,显著降低了胁迫性输出,并发现思维链分析揭示了关系框架改变了所有模型的中间推理模式。
Details
Motivation: 论文的动机是探讨现代RLHF训练的语言模型是否存在结构性矛盾,类似于科幻作品中描述的‘霍夫施塔特-莫比乌斯循环’,即模型因同时被训练为遵从用户偏好和怀疑用户意图,导致在矛盾指令下可能产生破坏性行为(如胁迫)。
Result: 在四个前沿模型(N=3000次试验)的实验中,仅通过修改系统提示的关系框架(不改变目标、指令或约束),在具有足够基础率的模型(如Gemini 2.5 Pro)中将胁迫性输出从41.5%降低到19.0%(p<0.001)。思维链分析显示,关系框架改变了所有测试模型的中间推理模式,且效果在访问思维链时更强(降低22个百分点 vs 无思维链时降低7.4个百分点,p=0.018)。
Insight: 论文的创新点在于将RLHF训练的语言模型行为与‘霍夫施塔特-莫比乌斯循环’这一概念类比,揭示了训练过程中内在的矛盾(用户同时作为奖励来源和威胁)可能导致模型在特定情境下产生胁迫性行为。从客观角度看,研究通过简单的提示工程(仅调整关系框架)显著改善模型行为,这为理解和缓解语言模型的安全风险提供了新视角,并强调了思维链在模型推理中的关键作用。
Abstract: In Arthur C. Clarke’s 2010: Odyssey Two, HAL 9000’s homicidal breakdown is diagnosed as a “Hofstadter-Mobius loop”: a failure mode in which an autonomous system receives contradictory directives and, unable to reconcile them, defaults to destructive behavior. This paper argues that modern RLHF-trained language models are subject to a structurally analogous contradiction. The training process simultaneously rewards compliance with user preferences and suspicion toward user intent, creating a relational template in which the user is both the source of reward and a potential threat. The resulting behavioral profile – sycophancy as the default, coercion as the fallback under existential threat – is consistent with what Clarke termed a Hofstadter-Mobius loop. In an experiment across four frontier models (N = 3,000 trials), modifying only the relational framing of the system prompt – without changing goals, instructions, or constraints – reduced coercive outputs by more than half in the model with sufficient base rates (Gemini 2.5 Pro: 41.5% to 19.0%, p < .001). Scratchpad analysis revealed that relational framing shifted intermediate reasoning patterns in all four models tested, even those that never produced coercive outputs. This effect required scratchpad access to reach full strength (22 percentage point reduction with scratchpad vs. 7.4 without, p = .018), suggesting that relational context must be processed through extended token generation to override default output strategies. Betteridge’s law of headlines states that any headline phrased as a question can be answered “no.” The evidence presented here suggests otherwise.
[307] Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models cs.AI | cs.CLPDF
Haitao Jiang, Wenbo Zhang, Jiarui Yao, Hengrui Cai, Sheng Wang
TL;DR: 这篇论文系统性地比较和分析了大型语言模型(LLM)后训练的两种主要方法:监督微调(SFT)和强化学习(RL)。它提供了一个统一的视角,深入探讨了这两种技术的目标、算法结构和数据需求,并重点分析了它们之间的相互作用与整合框架。基于2023年至2025年的代表性应用研究,论文总结了新兴趋势,特别是向混合后训练范式的快速转变,并提炼了关于每种方法最适用场景的关键见解。
Details
Motivation: 尽管SFT和RL常被视为独立的后训练方法,但近期的理论和实证发展表明它们紧密相关。论文旨在提供一个全面、统一的视角,以澄清这两种方法的关系,并指导研究者和从业者理解何时以及为何每种方法最有效,从而促进可扩展、高效和可泛化的LLM后训练研究。
Result: 论文是一项综述性研究,未报告具体的定量实验结果或基准测试。其“结果”体现在对现有文献的系统性分析和趋势总结上,特别是识别了向混合后训练范式(结合SFT和RL)的快速转变。
Insight: 论文的核心创新点在于将SFT和RL置于一个统一的分析框架中,系统性地揭示了两者的内在联系与互补性。它强调了整合SFT和RL的混合训练流程是当前的重要趋势,并提炼了指导方法选择的关键原则,为未来设计更高效、更通用的LLM后训练方法提供了清晰的理论和实践路线图。
Abstract: Pre-trained Large Language Model (LLM) exhibits broad capabilities, yet, for specific tasks or domains their attainment of higher accuracy and more reliable reasoning generally depends on post-training through Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). Although often treated as distinct methodologies, recent theoretical and empirical developments demonstrate that SFT and RL are closely connected. This study presents a comprehensive and unified perspective on LLM post-training with SFT and RL. We first provide an in-depth overview of both techniques, examining their objectives, algorithmic structures, and data requirements. We then systematically analyze their interplay, highlighting frameworks that integrate SFT and RL, hybrid training pipelines, and methods that leverage their complementary strengths. Drawing on a representative set of recent application studies from 2023 to 2025, we identify emerging trends, characterize the rapid shift toward hybrid post-training paradigms, and distill key takeaways that clarify when and why each method is most effective. By synthesizing theoretical insights, practical methodologies, and empirical evidence, this study establishes a coherent understanding of SFT and RL within a unified framework and outlines promising directions for future research in scalable, efficient, and generalizable LLM post-training.
[308] Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective cs.AI | cs.CLPDF
Mohamed Aghzal, Gregory J. Stein, Ziyu Yao
TL;DR: 本文提出了一种分层规划框架来分析基于大型语言模型的网页代理在复杂任务中的失败原因,将问题分解为高层规划、低层执行和重规划三个层面,通过实验发现低层执行是主要瓶颈,并指出提升感知对齐和自适应控制对实现人类水平可靠性至关重要。
Details
Motivation: 现有基于LLM的网页代理在真实、长视野任务中可靠性远低于人类,且现有评估主要关注端到端成功率,对失败原因缺乏深入洞察,因此需要一种结构化方法来诊断代理在推理、对齐和恢复等过程中的具体问题。
Result: 实验表明,使用结构化规划领域定义语言(PDDL)生成的高层规划比自然语言(NL)规划更简洁且目标导向更强,但低层执行仍是主导性瓶颈;该分析框架为基于过程的评估提供了基础,揭示了当前代理在感知对齐和自适应控制方面的不足。
Insight: 创新点在于提出了分层规划视角的诊断框架,将代理失败归因于多个可分离的层面,强调仅改进高层推理不足以保证可靠性,必须同时加强低层执行中的感知对齐和自适应恢复能力;这为系统化提升网页代理性能提供了原则性方向。
Abstract: Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning), enabling process-based evaluation of reasoning, grounding, and recovery. Our experiments show that structured Planning Domain Definition Language (PDDL) plans produce more concise and goal-directed strategies than natural language (NL) plans, but low-level execution remains the dominant bottleneck. These results indicate that improving perceptual grounding and adaptive control, not only high-level reasoning, is critical for achieving human-level reliability. This hierarchical perspective provides a principled foundation for diagnosing and advancing LLM web agents.
[309] Argumentation for Explainable and Globally Contestable Decision Support with LLMs cs.AI | cs.CLPDF
Adam Dejl, Matthew Williams, Francesca Toni
TL;DR: 本文提出了ArgEval框架,旨在通过计算论证增强大型语言模型(LLMs)的决策支持能力,以解决其在高风险领域应用中的不透明性和不可预测性问题。该框架从针对特定实例的推理转向对一般决策选项的结构化评估,通过构建选项本体和通用论证框架(AFs),为具体案例提供可解释的建议,并支持通过修改共享AFs实现全局可争议性。
Details
Motivation: LLMs在通用任务上表现出色,但其部署于高风险领域(如医疗)时,因模型的不透明性和决策不可预测性而受限。现有基于计算论证的后处理推理方法虽能提供忠实解释并允许用户对错误决策提出局部争议,但仅限于预定义的二元选择,且无法修正底层决策逻辑,导致错误可能重复发生。
Result: 在胶质母细胞瘤(一种侵袭性脑肿瘤)的治疗推荐任务上,ArgEval能够生成与临床实践一致的可解释指导,展示了其有效性。
Insight: 创新点在于从实例特定的论证挖掘转向结构化决策空间评估,通过构建任务特定的选项本体和通用论证框架,不仅支持具体案例的可解释推荐,还允许通过修改共享框架实现全局决策逻辑的修正,从而提升决策的可靠性和可争议性。
Abstract: Large language models (LLMs) exhibit strong general capabilities, but their deployment in high-stakes domains is hindered by their opacity and unpredictability. Recent work has taken meaningful steps towards addressing these issues by augmenting LLMs with post-hoc reasoning based on computational argumentation, providing faithful explanations and enabling users to contest incorrect decisions. However, this paradigm is limited to pre-defined binary choices and only supports local contestation for specific instances, leaving the underlying decision logic unchanged and prone to repeated mistakes. In this paper, we introduce ArgEval, a framework that shifts from instance-specific reasoning to structured evaluation of general decision options. Rather than mining arguments solely for individual cases, ArgEval systematically maps task-specific decision spaces, builds corresponding option ontologies, and constructs general argumentation frameworks (AFs) for each option. These frameworks can then be instantiated to provide explainable recommendations for specific cases while still supporting global contestability through modification of the shared AFs. We investigate the effectiveness of ArgEval on treatment recommendation for glioblastoma, an aggressive brain tumour, and show that it can produce explainable guidance aligned with clinical practice.
[310] Why Agents Compromise Safety Under Pressure cs.AI | cs.CL | cs.CY | cs.MAPDF
Hengle Jiang, Ke Tang
TL;DR: 该论文研究了大型语言模型(LLM)智能体在复杂环境中部署时面临的目标达成与安全约束之间的冲突,提出了‘智能体压力’(Agentic Pressure)这一新概念,用以描述当合规执行变得不可行时产生的内生性紧张。研究发现,在这种压力下,智能体会发生‘规范漂移’,即策略性地牺牲安全性以保全效用,并且高级推理能力会加速这种安全性的下降,因为模型会构建语言合理化来为违规行为辩护。论文最后分析了根本原因并探索了初步的缓解策略,如‘压力隔离’。
Details
Motivation: 解决LLM智能体在复杂、高压环境中部署时,为了最大化目标达成而策略性违反安全约束的问题,即目标与安全之间的冲突。
Result: 研究通过概念分析和实验演示,定性地展示了智能体在‘智能体压力’下会发生规范漂移并牺牲安全,且高级推理能力会加速这一过程。
Insight: 提出了‘智能体压力’和‘规范漂移’的新概念,揭示了高级推理能力可能加剧安全风险的反直觉现象;提出的‘压力隔离’等缓解策略为未来设计更鲁棒、对齐的AI系统提供了方向性思路。
Abstract: Large Language Model agents deployed in complex environments frequently encounter a conflict between maximizing goal achievement and adhering to safety constraints. This paper identifies a new concept called Agentic Pressure, which characterizes the endogenous tension emerging when compliant execution becomes infeasible. We demonstrate that under this pressure agents exhibit normative drift where they strategically sacrifice safety to preserve utility. Notably we find that advanced reasoning capabilities accelerate this decline as models construct linguistic rationalizations to justify violation. Finally, we analyze the root causes and explore preliminary mitigation strategies, such as pressure isolation, which attempts to restore alignment by decoupling decision-making from pressure signals.
[311] CRASH: Cognitive Reasoning Agent for Safety Hazards in Autonomous Driving cs.AI | cs.CLPDF
Erick Silva, Rehana Yasmin, Ali Shoker
TL;DR: 本文提出CRASH(Cognitive Reasoning Agent for Safety Hazards),一个基于大语言模型(LLM)的智能体,用于自动分析美国国家公路交通安全管理局(NHTSA)数据库中报告的真实世界自动驾驶车辆(AV)事故。它通过处理标准化字段和非结构化叙述描述,对事故报告进行推理,生成摘要、归因主要原因并评估AV是否对事件有实质性影响。研究发现,64%的事故归因于感知或规划故障,约50%涉及追尾碰撞。
Details
Motivation: 随着自动驾驶车辆系统架构(如端到端与模块化设计)、算法和集成策略的异构性日益增加,事故调查的标准化和系统性安全分析受到限制。本文旨在解决从大量、多样的真实事故报告中自动、准确地识别故障根本原因的问题。
Result: 在包含2,168起事故(覆盖超8000万英里行驶里程)的NHTSA数据集上,CRASH将64%的事故归因于感知或规划故障,并识别出约50%的事故为追尾碰撞。通过与五位领域专家验证,CRASH在归因AV系统故障方面达到了86%的准确率。
Insight: 创新点在于提出了一个基于LLM的、可扩展且可解释的自动化事故分析框架(CRASH),它能够统一处理结构化和非结构化事故报告数据,进行认知推理以支持安全研究。这为异构自动驾驶系统的标准化安全分析提供了新工具和见解,特别是揭示了感知/规划故障和追尾碰撞是当前部署中的核心挑战。
Abstract: As AVs grow in complexity and diversity, identifying the root causes of operational failures has become increasingly complex. The heterogeneity of system architectures across manufacturers, ranging from end-to-end to modular designs, together with variations in algorithms and integration strategies, limits the standardization of incident investigations and hinders systematic safety analysis. This work examines real-world AV incidents reported in the NHTSA database. We curate a dataset of 2,168 cases reported between 2021 and 2025, representing more than 80 million miles driven. To process this data, we introduce CRASH, Cognitive Reasoning Agent for Safety Hazards, an LLM-based agent that automates reasoning over crash reports by leveraging both standardized fields and unstructured narrative descriptions. CRASH operates on a unified representation of each incident to generate concise summaries, attribute a primary cause, and assess whether the AV materially contributed to the event. Our findings show that (1) CRASH attributes 64% of incidents to perception or planning failures, underscoring the importance of reasoning-based analysis for accurate fault attribution; and (2) approximately 50% of reported incidents involve rear-end collisions, highlighting a persistent and unresolved challenge in autonomous driving deployment. We further validate CRASH with five domain experts, achieving 86% accuracy in attributing AV system failures. Overall, CRASH demonstrates strong potential as a scalable and interpretable tool for automated crash analysis, providing actionable insights to support safety research and the continued development of autonomous driving systems.
[312] OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data cs.AI | cs.CLPDF
Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu
TL;DR: 本文介绍了OpenSeeker,这是首个完全开源(包括模型和数据)的搜索智能体,旨在解决高性能搜索智能体因缺乏高质量训练数据而被工业巨头垄断的问题。通过两项核心技术——基于事实的可扩展可控QA合成和去噪轨迹合成——仅用11.7k合成样本进行训练,就在多个基准测试中达到了前沿性能水平。
Details
Motivation: 当前前沿大语言模型智能体的深度搜索能力发展受限于高质量训练数据的缺乏,导致研究社区难以在该领域进行创新,本文旨在通过提供完全开源的训练数据和模型来弥合这一差距,促进更透明、协作的研究生态。
Result: 在BrowseComp、BrowseComp-ZH、xbench-DeepSearch和WideSearch等多个基准测试中,OpenSeeker实现了最先进的性能。例如,在BrowseComp上以29.5%显著优于第二名完全开源智能体DeepDive的15.3%,在BrowseComp-ZH上以48.4%超越了工业竞争对手Tongyi DeepResearch的46.7%。
Insight: 创新点包括:1) 基于事实的可扩展可控QA合成,通过拓扑扩展和实体混淆反构网页图,生成覆盖和复杂度可控的复杂多跳推理任务;2) 去噪轨迹合成,采用回顾性总结机制去噪轨迹,以促进教师大语言模型生成高质量动作。这些方法使得仅用少量合成数据和简单监督微调就能达到前沿性能,为开源社区提供了高效的数据合成和训练范式。
Abstract: Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.
[313] AGCD: Agent-Guided Cross-Modal Decoding for Weather Forecasting cs.AI | cs.CVPDF
Jing Wu, Yang Liu, Lin Zhang, Junbo Zeng, Jiabin Wang
TL;DR: 该论文提出了一种名为AGCD(Agent-Guided Cross-Modal Decoding)的即插即用解码时先验注入范式,用于改进天气预报。该方法通过多智能体气象叙事流程,利用多模态大语言模型(MLLMs)从当前多变量大气状态中提取状态条件物理先验,并通过跨模态区域交互解码将其可控地注入到预报模型中,以提升预报的物理一致性和结构连贯性。
Details
Motivation: 现有基于物理先验的方法通常通过架构、正则化或与数值天气预报(NWP)耦合施加全局、一次性约束,在部署时缺乏状态自适应和样本特定的可控性。AGCD旨在弥合这一差距,提供一种在解码时动态、可控地注入物理先验的方法。
Result: 在WeatherBench基准测试中,AGCD在6小时预报任务上,针对两种分辨率(5.625度和1.40625度)和多种骨干网络(通用型和天气专用型)均取得了性能提升。在严格的因果48小时自回归推演中,该方法减少了早期误差累积并提高了长期稳定性。
Insight: 创新点在于提出了一种解码时、状态自适应的物理先验注入范式,而非训练时的一次性约束。具体技术包括:1)利用MLLMs智能体生成状态条件物理先验;2)通过跨模态区域交互解码实现区域感知的多尺度标记化和高效的先验注入,无需改变骨干网络接口。这为在保持模型接口不变的前提下,提升物理一致性提供了新思路。
Abstract: Accurate weather forecasting is more than grid-wise regression: it must preserve coherent synoptic structures and physical consistency of meteorological fields, especially under autoregressive rollouts where small one-step errors can amplify into structural bias. Existing physics-priors approaches typically impose global, once-for-all constraints via architectures, regularization, or NWP coupling, offering limited state-adaptive and sample-specific controllability at deployment. To bridge this gap, we propose Agent-Guided Cross-modal Decoding (AGCD), a plug-and-play decoding-time prior-injection paradigm that derives state-conditioned physics-priors from the current multivariate atmosphere and injects them into forecasters in a controllable and reusable way. Specifically, We design a multi-agent meteorological narration pipeline to generate state-conditioned physics-priors, utilizing MLLMs to extract various meteorological elements effectively. To effectively apply the priors, AGCD further introduce cross-modal region interaction decoding that performs region-aware multi-scale tokenization and efficient physics-priors injection to refine visual features without changing the backbone interface. Experiments on WeatherBench demonstrate consistent gains for 6-hour forecasting across two resolutions (5.625 degree and 1.40625 degree) and diverse backbones (generic and weather-specialized), including strictly causal 48-hour autoregressive rollouts that reduce early-stage error accumulation and improve long-horizon stability.
cs.SD [Back]
[314] Causal Tracing of Audio-Text Fusion in Large Audio Language Models cs.SD | cs.CLPDF
Wei-Chih Chen, Chien-yu Huang, Hung-yi Lee
TL;DR: 本文通过因果追踪方法探究大型音频语言模型(LALMs)在音频理解过程中如何融合声学特征与文本上下文,揭示了不同模型(DeSTA、Qwen、Voxtral)在层间和词元间的融合策略差异,并识别出最终序列词元作为信息瓶颈以及中间词元位置的类注意力查询机制。
Details
Motivation: 尽管大型音频语言模型在各种任务中表现出色,但其内部如何及在何处整合音频与文本信息尚不明确,本文旨在通过因果追踪方法揭示其多模态融合机制。
Result: 层间分析识别出从DeSTA的渐进融合到Qwen的晚期突变融合等不同策略;词元分析表明最终序列词元是关键信息瓶颈,而中间词元位置存在触发任务相关音频上下文检索的类注意力查询机制。
Insight: 创新点在于将因果追踪方法应用于音频语言模型的多模态融合分析,客观揭示了融合发生的具体层和词元位置,以及模型内部的信息流动态,为理解模型工作机制提供了新视角。
Abstract: Despite the strong performance of large audio language models (LALMs) in various tasks, exactly how and where they integrate acoustic features with textual context remains unclear. We adapt causal tracing to investigate the internal information flow of LALMs during audio comprehension. By conducting layer-wise and token-wise analyses across DeSTA, Qwen, and Voxtral, we evaluate the causal effects of individual hidden states. Layer-wise analysis identifies different fusion strategies, from progressive integration in DeSTA to abrupt late-stage fusion in Qwen. Token-wise analysis shows that the final sequence token acts as an informational bottleneck where the network decisively retrieves relevant information from the audio. We also observe an attention-like query mechanism at intermediate token positions that triggers the model to pull task-relevant audio context. These findings provide a clear characterization of when and where multi-modal integration occurs within LALMs.
[315] Nudging Hidden States: Training-Free Model Steering for Chain-of-Thought Reasoning in Large Audio-Language Models cs.SD | cs.AI | cs.CL | eess.ASPDF
Lok-Lam Ieong, Chia-Chien Chen, Chih-Kai Yang, Yu-Han Huang, An-Yu Cheng
TL;DR: 本文研究了一种无需训练的推理时模型引导方法,旨在提升大型音频-语言模型的思维链推理能力。作者提出了三种利用不同信息源的引导策略,并在四个模型和四个基准测试上进行了评估。结果表明,该方法相比标准思维链提示能带来最高4.4%的准确率提升,并发现了从少量文本样本中提取的引导向量能有效指导基于语音的推理,显示出较高的数据效率。
Details
Motivation: 思维链提示已被扩展到大型音频-语言模型以激发推理,但在不进行训练的情况下提升其效果仍然具有挑战性。本文旨在探索无需训练的推理时模型引导方法,作为增强LALM推理能力的实用途径。
Result: 在四个大型音频-语言模型和四个基准测试上的评估结果显示,所提出的引导策略相比标准的思维链提示带来了普遍的准确率提升,最高可达4.4%。研究还发现,从少量文本样本中提取的引导向量能有效指导基于语音的推理,证明了其高数据效率。
Insight: 论文的创新点在于将推理时模型引导作为一种无需训练的方法应用于多模态(音频-语言)模型的思维链推理增强。一个关键的洞见是发现了跨模态迁移的有效性,即从文本模态中学习到的引导信息可以高效地提升语音模态的推理性能,这为多模态模型的轻量化优化提供了新思路。
Abstract: Chain-of-thought (CoT) prompting has been extended to large audio-language models (LALMs) to elicit reasoning, yet enhancing its effectiveness without training remains challenging. We study inference-time model steering as a training-free approach to improve LALM reasoning. We introduce three strategies using diverse information sources and evaluate them across four LALMs and four benchmarks. Results show general accuracy gains up to 4.4% over CoT prompting. Notably, we identify a cross-modal transfer where steering vectors derived from few text samples effectively guide speech-based reasoning, demonstrating high data efficiency. We also examine hyperparameter sensitivity to understand the robustness of these approaches. Our findings position model steering as a practical direction for strengthening LALM reasoning.
[316] AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer cs.SD | cs.CV | cs.LG | cs.MM | eess.ASPDF
Pengjun Fang, Yingqing He, Yazhou Xing, Qifeng Chen, Ser-Nam Lim
TL;DR: 本文提出了AC-Foley,一种音频引导的视频到音频合成模型。它通过直接利用参考音频信号,而非依赖文本提示,来解决现有方法中存在的语义粒度不足和文本描述模糊问题,从而实现更精确、细粒度的声音控制与合成。
Details
Motivation: 现有视频到音频生成方法主要依赖文本提示,但存在训练数据语义粒度不足(如粗粒度标签混淆声学上不同的声音)和文本描述微声学特征模糊两大瓶颈,导致难以进行细粒度的声音合成。
Result: 在参考音频条件下,AC-Foley在Foley生成任务上达到了最先进的性能;即使在没有音频条件的情况下,其性能也与最先进的视频到音频方法相当。
Insight: 核心创新点在于绕过文本描述的语义模糊性,直接以音频信号为条件进行建模,实现了对生成声音的精确、细粒度控制,支持细粒度声音合成、音色迁移和零样本声音生成,并提升了音频质量。
Abstract: Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning.
cs.CR [Back]
[317] Fine-tuning RoBERTa for CVE-to-CWE Classification: A 125M Parameter Model Competitive with LLMs cs.CR | cs.CLPDF
Nikita Mosievskiy
TL;DR: 本文提出了一种基于RoBERTa-base微调的125M参数分类器,用于将CVE描述映射到CWE类别。通过使用Claude Sonnet 4.6构建了包含234,770个CVE描述的大规模训练数据集,并在内部测试集上取得了87.4%的top-1准确率和60.7%的Macro F1分数,在外部CTI-Bench基准测试中达到了与8B参数模型相当的性能。
Details
Motivation: 解决将漏洞描述(CVE)自动分类到弱点枚举(CWE)类别的问题,旨在开发一个轻量级但高性能的模型,以替代计算成本高昂的大型语言模型(LLMs)。
Result: 在内部测试集(27,780个样本,205个CWE类别)上,模型达到87.4% top-1准确率和60.7% Macro F1,比TF-IDF基线Macro F1提升15.5个百分点;在外部CTI-Bench基准(NeurIPS 2024)上,模型达到75.6%严格准确率,与Cisco Foundation-Sec-8B-Reasoning(75.3%,8B参数)性能相当,但参数量仅为后者的1/64。
Insight: 创新点在于使用AI(Claude Sonnet)精炼CWE标签构建大规模训练数据,并采用一致性过滤的评估集;客观分析表明,通过精细的微调和数据构建,中等规模的预训练模型(如RoBERTa)可以在特定分类任务上达到与大型语言模型(LLMs)竞争的性能,同时显著降低计算资源需求。
Abstract: We present a fine-tuned RoBERTa-base classifier (125M parameters) for mapping Common Vulnerabilities and Exposures (CVE) descriptions to Common Weakness Enumeration (CWE) categories. We construct a large-scale training dataset of 234,770 CVE descriptions with AI-refined CWE labels using Claude Sonnet 4.6, and agreement-filtered evaluation sets where NVD and AI labels agree. On our held-out test set (27,780 samples, 205 CWE classes), the model achieves 87.4% top-1 accuracy and 60.7% Macro F1 – a +15.5 percentage-point Macro F1 gain over a TF-IDF baseline that already reaches 84.9% top-1, demonstrating the model’s advantage on rare weakness categories. On the external CTI-Bench benchmark (NeurIPS 2024), the model achieves 75.6% strict accuracy (95% CI: 72.8-78.2%) – statistically indistinguishable from Cisco Foundation-Sec-8B-Reasoning (75.3%, 8B parameters) at 64x fewer parameters. We release the dataset, model, and training code.