Table of Contents

cs.CL [Back]

[1] Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs cs.CL | cs.ET | cs.LGPDF

Yiming Huang, Zhenbo Shi, Xin-Cheng Wen, Jichuan Zeng, Cuiyun Gao

TL;DR: 本文提出了一种名为FREIA的无监督强化学习算法,旨在解决大型语言模型在无监督推理任务中因缺乏真实监督信号而导致的策略优化偏差问题。该算法基于自由能原理设计了自适应奖励机制,并结合了自适应优势塑形技术,以更好地适应模型在训练过程中不断演进的推理能力。

Details

Motivation: 现有基于无监督强化学习的方法在训练过程中难以适应模型推理能力的动态变化,导致在缺乏真实监督的情况下策略优化可能偏离正确方向。

Result: 在三个推理任务的九个数据集上的实验表明,FREIA超越了其他无监督强化学习基线方法。特别是在数学推理任务中,使用DeepSeek-R1-Distill-Qwen-1.5B模型时,FREIA在Pass@1指标上平均比其他方法高出0.5到3.5分。

Insight: 论文的创新点在于引入了基于自由能原理的自适应奖励机制和自适应优势塑形技术,这为无监督强化学习在大型语言模型推理任务中的应用提供了新的自适应优化思路,有助于模型在训练过程中更稳定地自我提升。

Abstract: Unsupervised reinforcement learning (RL) has emerged as a promising paradigm for enabling self-improvement in large language models (LLMs). However, existing unsupervised RL-based methods often lack the capacity to adapt to the model’s evolving reasoning capabilities during training. Therefore, these methods can misdirect policy optimization in the absence of ground-truth supervision. To address this issue, we introduce FREIA, a novel RL-based algorithm built on two key innovations: (1) Free Energy-Driven Reward (FER) adapts rewards to balance consensus and exploration based on the Free Energy Principle. (2) Adaptive Advantage Shaping (AAS) adaptively adjusts learning signals based on the statistical characteristics of sampled rewards. Empirical evaluations on nine datasets across three reasoning tasks showcase that FREIA outperforms other unsupervised RL-based baselines. Notably, in mathematical reasoning tasks, FREIA surpasses other methods by an average of 0.5 to 3.5 points in Pass@1 using the DeepSeek-R1-Distill-Qwen-1.5B model.


[2] Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning cs.CL | cs.ET | cs.LGPDF

Yiming Huang, Zhenbo Shi, Shuzheng Gao, Cuiyun Gao, Peiyi Han

TL;DR: 本文提出了一种自适应幂均值策略优化(APMPO)方法,用于改进基于可验证奖励的强化学习(RLVR)范式,以增强大语言模型(LLM)的推理能力。该方法包含两个核心创新:幂均值策略优化(PMPO)和反馈自适应裁剪(FAC)。PMPO通过广义幂均值目标,使模型能够自适应地从算术均值的信号放大行为过渡到几何均值的一致性增强行为。FAC则根据实时奖励统计自适应调整裁剪边界,克服静态机制的局限。

Details

Motivation: 现有基于可验证奖励的强化学习方法通常依赖静态的策略优化方案,这与模型不断演进的推理能力不匹配。为了解决这种错配问题,本文旨在开发一种能够自适应模型能力变化的策略优化方法。

Result: 在三个推理任务的九个数据集上进行的大量实验表明,APMPO优于最先进的基于RLVR的基线方法。例如,在使用Qwen2.5-3B-Instruct模型时,APMPO在数学推理基准上的平均Pass@1分数比GRPO提高了3.0分,达到了新的SOTA水平。

Insight: 论文宣称的创新点在于提出了PMPO和FAC两个自适应组件。从客观角度看,其核心创新是将策略优化目标从固定的均值形式(如算术或几何均值)推广为可自适应调整的幂均值,并结合实时反馈动态调整训练稳定性机制(裁剪边界),这为适应LLM推理能力的动态演变提供了一种新的优化框架。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an essential paradigm that enhances the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically rely on static policy optimization schemes that misalign with the model’s evolving reasoning capabilities. To address this issue, we propose Adaptive Power-Mean Policy Optimization (APMPO), which comprises two main innovations: Power-Mean Policy Optimization (PMPO) and Feedback-Adaptive Clipping (FAC). Specifically, PMPO introduces a generalized power-mean objective. This enables the model to adaptively transition from the signal-amplifying behavior of the arithmetic mean to the consistency-enforcing behavior of the geometric mean. FAC adaptively adjusts clipping bounds based on real-time reward statistics to overcome the limitations of static mechanisms. Capitalizing on these innovations, APMPO improves learning dynamics and reasoning performance. Extensive experiments on nine datasets across three reasoning tasks showcase the superiority of APMPO over state-of-the-art RLVR-based baselines. For instance, APMPO boosts the average Pass@1 score on mathematical reasoning benchmarks by 3.0 points compared to GRPO when using Qwen2.5-3B-Instruct.


[3] MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs cs.CL | cs.AIPDF

Tung Sum Thomas Kwok, Qian Qian, Xiaofeng Lin, Dongxu Zhang, Jun Han

TL;DR: 本文提出了一个以数据为中心的框架,用于在医疗大语言模型中生成和检测词级虚构内容。该框架包含MedFabric数据集生成管道和ETHER检测器,旨在解决现有医疗幻觉数据集在覆盖范围、风格差异和分布偏移方面的不足,从而提升检测性能。

Details

Motivation: 解决大语言模型在需要专业知识的领域(如医疗)中产生虚构内容(即事实错误但表述流畅的陈述)的问题,现有医疗幻觉数据集在捕捉此类现象上存在不足。

Result: 在词级虚构基准测试中,MedFabric框架使检测器性能超过现有最先进方法15%以上,并在结构相似性上保持了一致的性能。

Insight: 创新点在于提出了一个数据中心的管道来生成具有句法和风格保真度的现实词级虚构样本,以及一个模块化的词级虚构检测器,整合了文本到表格分解、词掩码填充和混合句子对评估等技术,以增强事实对齐。从客观角度看,该方法强调了高质量、领域特定合成数据生成对于提升检测模型性能的重要性。

Abstract: Large Language Models exhibit strong reasoning and semantic understanding capabilities but often hallucinate in domains that require expert knowledge, among which fabrications, the generation of factually incorrect yet fluent statements, pose the greatest risk in medical contexts. Existing medical hallucination datasets inadequately capture fabrication phenomena due to limited fabrication coverage, stylistic disparities between human and LLM-authored texts, and distributional drift during hallucinated sample synthesis. To address this, we propose a data-centric pipeline to generate realistic and word-level fabrications that preserve syntactic and stylistic fidelity while introducing subtle factual deviations, resulting in MedFabric. Building upon this dataset, we introduce ETHER, a modular word-level fabrication detector integrating Text2Table Decomposition, Word Masking and Filling and Hybrid Sentence Pair Evaluation to enhance factual alignment. Empirical results demonstrate that MedFabric outperforms state-of-the-art detectors by over 15% on word-level fabrication benchmarks while maintaining consistent performance across structural similarities, offering a comprehensive framework for reliable and domain-specific factuality detection.


[4] Material Database Agent: A Multimodal Agentic Framework for Scientific Literature Mining cs.CLPDF

Achuth Chandrasekhar, Omid Barati Farimani, Radheesh Sharma Meda, Amir Barati Farimani

TL;DR: 本文提出了一个名为Material Database Agent (MDA)的多模态智能体框架,用于从科学文献中自动挖掘和构建材料科学数据库。该系统以PDF文献为输入,通过并行处理文本和图表,利用多个子智能体协同工作,最终将非结构化信息整合成结构化的表格数据库。

Details

Motivation: 材料科学工作流严重依赖海量科学文献中的结构化和非结构化数据,但实验细节通常埋藏在文本、表格、图表和图像中,导致数据库构建过程依赖人工、耗时且难以规模化。多模态大语言模型的发展为高效准确提取信息提供了可能。

Result: 摘要中未提及具体的定量实验结果或基准测试,但宣称MDA能够以高速度和准确性从文本和科学图表中提取信息,为构建生产规模的数据库提供了可行性。

Insight: 论文的核心创新点是提出了一种模块化、多智能体的系统架构,专门用于将材料科学文献转化为数据库。与基于规则或单次处理流程的方法不同,它通过并行处理和智能体协作来提升效率和可扩展性,为利用多模态智能体信息提取技术构建下一代科学数据库奠定了基础。

Abstract: Materials science workflows rely on structured and unstructured data from the vast body of available scientific literature. However, most of the experimental details remain buried in text, tables, graphs and figures. Thus, constructing databases that incorporate this data is a manual, time-consuming, and hard-to-scale process. Multimodal large language models have made it feasible to extract information from text and scientific figures with high speed and accuracy. This opens the possibility of an AI system that can create production-scale material databases. Material Database Agent (MDA) is a modular, multi-agent system architecture for converting research literature into structured databases. MDA accepts article PDFs as input, which are subsequently processed in parallel into markdown files and figures. Multiple sub-agents read these markdown files and figures in parallel to assemble sub-databases for each paper. These sub-databases are then compiled into a single tabular database by an agent. As opposed to using either a rule-based approach or a single-pass pipeline for extracting information, MDA is a specialized architecture for transforming the literature into a database in the field of materials science. More generally, this study provides a basis for positioning multimodal agentic information extraction as a viable means for constructing next-generation scientific databases from the primary literature.


[5] NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise cs.CL | cs.AIPDF

Zhi Xu, Yun Fu

TL;DR: 本文提出了NoisyCausal基准,用于评估在结构化噪声下的因果推理能力,并设计了一个结合大语言模型与显式因果结构的模块化推理框架,以提升模型在噪声干扰下的因果推理鲁棒性和可解释性。

Details

Motivation: 大语言模型在自然语言因果推理中难以区分相关性与因果关系,尤其在观测数据存在噪声、不完整或包含无关信息时表现不佳,因此需要专门的基准和方法来评估和提升其因果推理能力。

Result: 实验表明,所提方法在NoisyCausal基准上显著优于标准提示和基线推理方法,并且在未进行任务特定调优的情况下,在外部基准Cladder上也能良好泛化。

Insight: 创新点在于通过注入可控噪声(如无关干扰、值扰动、混杂和部分可观测性)构建结构化噪声基准,并提出了一个引导LLM基于符号化因果图进行结构化推理的框架,将因果抽象与语言驱动推理相结合,以实现更忠实和鲁棒的因果理解。

Abstract: Causal reasoning in natural language requires identifying relevant variables, understanding their interactions, and reasoning about effects and interventions, often under noisy or ambiguous conditions. While large language models (LLMs) exhibit strong general reasoning abilities, they struggle to disentangle correlation from causation, particularly when observations are partially incorrect or irrelevant information is present. In this work, we introduce NoisyCausal, a new benchmark designed to evaluate causal reasoning under structured noise. Each instance is generated from a ground-truth causal graph and contextualized with a natural language scenario by injecting controllable forms of noise, such as irrelevant distractors, value perturbations, confounding, and partial observability. Moreover, we propose a modular reasoning framework that combines LLMs with explicit causal structure to address these challenges. Our method prompts the LLM to extract variables, construct a causal graph from context, and then reformulates the reasoning task as a structured prompt grounded in this graph. Rather than relying on statistical patterns alone, the LLM is guided by symbolic structure, enabling more interpretable and robust inference. Experimental results show that our method significantly outperforms standard prompting and reasoning baselines on NoisyCausal. Furthermore, it generalizes well to external benchmarks such as Cladder without task-specific tuning. Our findings highlight the importance of combining causal abstractions with language-driven reasoning to achieve faithful and robust causal understanding in LLMs.


[6] GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking cs.CL | cs.AIPDF

Ziqi Zhu, Adithya Suresh, Tomal Deb, Iman Abbasnejad

TL;DR: 本文提出了GEM框架,一种结合图增强专家混合模型与ReAct智能体的对话状态追踪方法,旨在解决大语言模型在结构化信息提取任务上的不足。该方法通过动态路由机制,协调图神经网络和微调T5-Small模型,并集成ReAct智能体进行复杂推理,在MultiWOZ 2.2数据集上实现了最先进的性能。

Details

Motivation: 解决大语言模型在对话状态追踪任务中难以精确提取多领域对话结构化信息的问题,尽管它们具备强大的通用能力。

Result: 在MultiWOZ 2.2数据集上,GEM达到了65.19%的联合目标准确率,显著优于端到端大语言模型方法(最佳为38.43%),并超越了包括TOATOD(63.79%)、D3ST(58.70%)和Diable(56.48%)在内的现有最先进方法。

Insight: 创新点在于将结构化对话表示(图神经网络)、动态专家路由(混合专家模型)和基于智能体的推理(ReAct)相结合,为对话状态追踪提供了一个高效且准确的范式,通过选择性专家激活保持了计算效率。

Abstract: Dialogue State Tracking (DST) requires precise extraction of structured information from multi-domain conversations, a task where Large Language Models (LLMs) struggle despite their impressive general capabilities. We present GEM (Graph-Enhanced Mixture-of-Experts), a novel framework that combines language models and graph-structured dialogue understanding with ReAct agent-based reasoning for superior DST performance. Our approach dynamically routes between specialized experts: a Graph Neural Network that captures dialogue structure and turn-level dependencies, and a finetuned T5-Small encoder-decoder for sequence modeling, coordinated by an intelligent router. For complex value generation tasks, we integrate ReAct agents that perform structured reasoning over dialogue context. On MultiWOZ 2.2, GEM achieves 65.19% Joint Goal Accuracy, substantially outperforming end-to-end LLM approaches (best: 38.43%) and surpassing state-of-the-art (SOTA) methods including TOATOD (63.79%), D3ST (58.70%), and Diable (56.48%). Our graph-enhanced mixture-of-experts architecture with ReAct integration demonstrates that combining structured dialogue representation with dynamic expert routing and agent-based reasoning provides a powerful paradigm for dialogue state tracking, achieving superior accuracy while maintaining computational efficiency through selective expert activation.


[7] SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States cs.CLPDF

Zhenliang Zhang, Wenqing Wang, Yong Hu, Yaming Yang, Jiaheng Gao

TL;DR: 本文提出SCOUT,一种用于长文本理解(LTU)的新范式,通过主动信息觅食机制,将文档视为可探索环境,基于紧凑的、可溯源的认知状态进行推理。该方法通过状态级差距诊断,自适应地在粗到细的探索与锚定状态更新之间切换,逐步将认知状态收缩至满足查询需求,从而在保持高推理保真度的同时显著降低计算开销。

Details

Motivation: 现有百万token级长文本理解方法面临两难:端到端处理的长上下文LLM计算成本高且注意力稀释;而专用LTU代理常通过图构建或索引等与任务无关的抽象牺牲推理保真度。核心洞察是查询相关信息相对于全文通常是稀疏的,因此有效推理应依赖查询充分的子集而非整个上下文。

Result: 实验表明,SCOUT在性能上匹配最先进的专有模型,同时将token消耗降低高达8倍。此外,随着上下文长度增加,SCOUT保持稳定,显著缓解了实际应用中的成本-性能权衡问题。

Insight: 创新点在于将LTU从被动处理转变为主动信息觅食,引入可探索环境视角和基于可溯源认知状态的推理。通过状态级差距诊断驱动的自适应探索与状态更新机制,实现了在稀疏相关信息中高效、精准地收缩推理范围,为长文本理解提供了兼顾效率与保真度的新思路。

Abstract: Long-Text Understanding (LTU) at million-token scale requires balancing reasoning fidelity with computational efficiency. Frontier long-context LLMs can process millions of token contexts end-to-end, but they suffer from high token consumption and attention dilution. In parallel, specialized LTU agents often sacrifice fidelity through task-agnostic abstractions like graph construction or indexing. We identify a key insight for LTU: query-relevant information is typically sparse relative to the full document, so effective reasoning should rely on a query-sufficient subset rather than the entire context. To address this, we propose SCOUT, a new paradigm for LTU that shifts from passive processing to active information foraging. It treats the document as an explorable environment and answers from a compact, provenance-grounded epistemic state. Guided by state-level gap diagnosis, SCOUT adaptively alternates between coarse-to-fine exploration and anchored state updates that progressively contract its epistemic state toward query sufficiency. Experiments show that SCOUT matches state-of-the-art proprietary models while reducing token consumption by up to 8x. Moreover, SCOUT remains stable as context length scales, substantially alleviating the practical cost-performance trade-off.


[8] CHE-TKG: Collaborative Historical Evidence and Evolutionary Dynamics Learning for Temporal Knowledge Graph Reasoning cs.CLPDF

Shuai-long Lei, Xiaobin Zhu, Jiarui Liang, Guoxi Sun, Zhiyu Fang

TL;DR: 该论文提出了一种名为CHE-TKG的新型协作双视图学习框架,用于时序知识图谱推理。该框架通过显式分离并联合建模历史证据和演化动态,旨在学习和利用它们互补的预测信号,以预测未来事件。

Details

Motivation: 现有时序知识图谱推理方法通常只关注历史证据或演化动态中的单一信息源,未能充分利用两者互补的预测信号,限制了预测能力。

Result: 在多个基准测试上的广泛实验表明,CHE-TKG实现了最先进的性能。

Insight: 创新点在于显式构建历史证据图和演化动态图两个视图,并采用关系分解和对比对齐目标来更好地捕获跨视图的预测信号,从而协同利用长期结构规律与短期时序变化信息。

Abstract: Temporal knowledge graph (TKG) reasoning aims to predict future events from historical facts. A key challenge lies in jointly capturing two sources of predictive information in TKGs: historical evidence and evolutionary dynamics. However, existing methods typically focus on only one of these sources, which limits the ability to fully exploit the complementary predictive signals in TKGs. To address this, we propose CHE-TKG, a novel collaborative dual-view learning framework for TKG reasoning. CHE-TKG explicitly separates and jointly models historical evidence and evolutionary dynamics, aiming to learn and exploit their complementary predictive signals. Specifically, CHE-TKG constructs a historical evidence graph to capture long-term structural regularities and stable relational constraints, alongside an evolutionary dynamics graph to model temporal transitions and recent changes, with dedicated encoders for each view. We further employ relation decomposition and a contrastive alignment objective to better capture the predictive signals across the two views. Extensive experiments demonstrate that CHE-TKG achieves state-of-the-art performance on multiple benchmarks.


[9] Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL cs.CLPDF

Yaxun Dai, Baolin Sun, Junying Wang, Pengfei Wang, Yingqi Gao

TL;DR: 本文提出FineStep框架,用于解决工具集成Text-to-SQL解析中的信用分配问题。该框架通过设计独立的过程奖励、引入步骤级信用分配机制以及基于步骤级优势的策略优化方法,旨在精确量化每个推理步骤的价值,从而减少冗余工具交互并提升模型性能。在BIRD基准测试中,FineStep实现了最先进的性能,并在4B规模上相比GRPO平均提升了3.25%的EX得分。

Details

Motivation: 现有工具集成Text-to-SQL的强化学习方法主要依赖粗粒度的结果监督,导致信用分配问题:即使中间步骤冗余、低效或错误,只要最终答案正确,模型就会获得相同奖励。这鼓励模型探索次优推理空间,限制了效率和泛化能力。

Result: 在BIRD基准测试中,FineStep实现了最先进的性能,在4B规模上相比GRPO平均提升了3.25%的EX得分,并减少了冗余工具交互。

Insight: 创新点包括:引入独立过程奖励以缓解结果监督的信号稀疏性,提出步骤级信用分配机制来精确量化每个推理步骤的价值,以及开发基于步骤级优势的策略优化方法。从客观角度看,该研究将强化学习中的信用分配问题细化到工具集成Text-to-SQL的步骤级别,为序列决策任务提供了更精细的监督信号,可借鉴于其他需要多步推理的AI任务中。

Abstract: Tool-integrated Text-to-SQL parsing has emerged as a promising paradigm, framing SQL generation as a sequential decision-making process interleaved with tool execution. However, existing reinforcement learning approaches mainly rely on coarse-grained outcome supervision, resulting in a fundamental credit assignment problem: models receive the same reward for any trajectory that yields the correct answer, even when intermediate steps are redundant, inefficient, or erroneous. Consequently, models are encouraged to explore suboptimal reasoning spaces, limiting both efficiency and generalization. To address this problem, we propose FineStep, a novel framework for step-level credit assignment in tool-augmented Text-to-SQL. First, we introduce a reward design with independent process rewards to alleviate the signal sparsity of outcome supervision. Next, we present a step-level credit assignment mechanism to precisely quantify the value of each reasoning step. Finally, we develop a policy optimization method based on step-level advantages for efficient updates. Extensive experiments on BIRD benchmarks show that FineStep achieves state-of-the-art performance and reduces redundant tool interactions, with a 3.25% average EX gain over GRPO at the 4B scale.


[10] Sentiment Analysis and Customer Satisfaction Prediction on E-Commerce Platforms Based on YouTube Comments Using the XGBoost Algorithm cs.CLPDF

Ridho Benedictus Togi Manik, Muhammad Aqil Ramadhan, Ihsan Maulana Yusuf, Luluk Muthoharoh, Ardika Satria

TL;DR: 本研究针对印尼电商平台,利用XGBoost算法和TF-IDF向量化,基于YouTube评论构建了一个客户满意度预测模型。实验表明,通过PyCaret优化的机器学习框架具有优越的分类鲁棒性,并发现电商讨论中混杂的社会政治术语显著影响了情感极性。

Details

Motivation: 印尼数字商务的快速增长使消费者互动转向以YouTube为代表的视频社交网络,海量、非结构化、多语境的评论给人工情感追踪带来巨大挑战,因此需要自动化的客户满意度预测模型。

Result: 在基于YouTube电商评论视频的二手数据集上,经过PyCaret优化的模型展现了优越的分类性能。除了标准性能指标,词汇评估和特征重要性映射揭示了社会政治术语对情感极性的显著影响。

Insight: 创新点在于将XGBoost与TF-IDF结合应用于YouTube电商评论的情感分析,并利用PyCaret进行自动化机器学习优化。一个重要的客观发现是,电商语境下的用户讨论并非纯粹商业导向,而是与社会政治议题深度交织,这为理解在线消费者行为提供了新的视角。

Abstract: The exponential expansion of digital commerce in Indonesia has significantly shifted consumer interactions toward video-centric social networks, particularly YouTube. Consequently, the sheer volume of unstructured, multi-contextual comments poses a tremendous challenge for manual sentiment tracking. This study investigates and constructs a predictive model for customer satisfaction leveraging the Extreme Gradient Boosting (XGBoost) architecture coupled with Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. By utilizing a secondary dataset of YouTube comments retrieved from e-commerce review videos, the raw text underwent rigorous preprocessing to generate normalized numerical features. The experimental results demonstrate that the PyCaret-optimized machine learning framework delivers superior classification resilience. Beyond standard performance metrics, lexical evaluations and feature-importance mapping uncover a notable phenomenon: e-commerce discourse is heavily infiltrated by socio-political terminologies, which ultimately influence the polarity of audience satisfaction.


[11] Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training cs.CL | cs.LGPDF

Hengyu Shi, Tianyang Han, Peizhe Wang, Zhiling Wang, Xu Yang

TL;DR: 本文提出了一种名为LoPT(Local-Learning Post-Training)的新型大语言模型后训练方法,旨在通过引入梯度边界来降低计算成本和内存开销。该方法将Transformer模型在中间点分割,后半部分通过任务目标进行学习,前半部分则通过轻量级的特征重建目标进行更新,从而缩短反向传播路径并减少任务梯度对预训练表示的干扰。

Details

Motivation: 传统的LLM后训练采用端到端的梯度传播,这导致需要存储整个模型的激活、产生长距离反向依赖以及任务梯度直接访问预训练表示,计算成本高且可能对预训练表示造成不必要的干扰。本文的动机是设计一种更廉价、更快速的后训练方法,特别是在监督信号比预训练窄得多的情况下。

Result: 大量实验表明,LoPT在保持竞争力的性能的同时,实现了更低的内存成本、更高的训练效率以及更好的预训练能力保留。

Insight: 主要创新点在于将梯度可达性作为一个显式的设计选择,通过在Transformer中点放置一个梯度边界,将后训练解耦为任务学习和特征重建两部分。这提供了一种在保持性能的同时,显著降低后训练计算和内存开销的新思路,对于资源受限的模型适配场景具有借鉴意义。

Abstract: LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward dependencies and direct task-gradient access to pretrained representations. We argue that this full-depth backward coupling can be unnecessarily expensive and intrusive, particularly when post-training supervision is much narrower than pre-training. To this end, we propose \textbf{LoPT}: Local-Learning Post-Training, a simple post-training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second-half block learns from the task objective, while the first-half block is updated by a lightweight feature-reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task-induced backward path while limiting direct interference from narrow task gradients on early-layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: https://github.com/HumyuShi/LoPT


[12] UFAL-CUNI at SemEval-2026 Task 11: An Efficient Modular Neuro-symbolic Method for Syllogistic Reasoning cs.CLPDF

Ivan Kartáč, Kristýna Onderková, Jan Bronec, Zdeněk Kasner, Mateusz Lango

TL;DR: 本文介绍了参加SemEval-2026 Task 11任务(解耦大语言模型中的内容与形式推理)的系统。该系统采用一种高效的模块化神经符号方法,将符号证明器与小型推理LLM(40亿参数)相结合,通过LLM解析器将自然语言三段论转换为FOL表示,再使用自动定理证明器进行推理,并可选配机器翻译和多语言输入符号检索模块。

Details

Motivation: 动机是解决大语言模型在推理任务中内容偏见与形式逻辑能力混杂的问题,旨在构建一个能有效分离内容与形式推理的系统。

Result: 该系统在大多数子任务上取得了有竞争力的准确率和相对较低的内容效应,消融实验表明该方法在同等参数规模下优于基于LLM的零样本基线,但也揭示了小型LLM的多语言能力有限。

Insight: 创新点在于提出了一种结合符号推理(定理证明器)与神经组件(小型LLM解析器)的高效模块化架构,有效解耦了内容与形式推理;客观分析其价值在于为资源受限场景(使用小型LLM)下的可解释、结构化推理提供了可行方案,并系统评估了任务度量标准的局限性。

Abstract: This paper describes our system submitted to SemEval-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models. We present an efficient modular neuro-symbolic approach, combining a symbolic prover with small reasoning LLMs (4B parameters). The system consists of an LLM-based parser that translates natural language syllogisms to a first-order logic (FOL) representation, an automated theorem prover, and two optional modules: machine translation for multilingual inputs and a symbolic retrieval component for the identification of relevant premises. The system achieves competitive accuracy and relatively low content effect on most subtasks. Our ablations show that this approach outperforms LLM-based zero-shot baselines in this parameter size range, but also reveal limited multilingual capabilities of small LLMs. Finally, we include a discussion of the task’s main ranking metric and analyze its limitations.


[13] Why Expert Alignment Is Hard: Evidence from Subjective Evaluation cs.CLPDF

Tzu-Mi Lin, Wataru Hirota, Tatsuya Ishigaki, Lung-Hao Lee, Chung-Chi Chen

TL;DR: 本文研究了在主观评估任务中实现大型语言模型与专家判断对齐的困难性,发现对齐难度源于专家评估的异质性、部分隐性、维度依赖性和时间不稳定性,而不仅仅是模型限制。

Details

Motivation: 解决在主观评估任务中,由于专家意见分歧、依赖隐性标准且判断随时间变化,导致大模型与专家对齐困难的问题。

Result: 通过专家评估和后续问卷调查,揭示了四个一致模式:专家间对齐难度差异大;显式标准和推理不总能改善对齐;编辑对示例数量和身份敏感;不同评估维度的对齐难度不同,其中基于提案内容的维度更容易对齐。

Insight: 创新点在于将专家对齐作为理解主观评估困难性的途径,并系统分析了影响对齐的因素;客观来看,研究强调了主观评估的固有复杂性,为设计更鲁棒的对齐方法提供了实证依据。

Abstract: Aligning large language models with expert judgment is especially difficult in subjective evaluation tasks, where experts may disagree, rely on tacit criteria, and change their judgments over time. In this paper, we study expert alignment as a way to understand this difficulty. Using expert evaluations and follow-up questionnaires, we examine how different forms of expert information affect alignment and what this reveals about subjective judgment. Our findings show four consistent patterns. First, alignment difficulty varies substantially across experts, suggesting that expert evaluation styles differ widely in their distance from a model’s prior behavior. Second, explicit criteria and reasoning do not always improve alignment, indicating that expert judgment is not fully captured by verbalized rules. Third, editing is sensitive to both the number and the identity of examples, with small numbers of edits providing useful but unstable gains. Fourth, alignment difficulty differs across evaluation dimensions: dimensions grounded more directly in proposal content are easier to align, while dimensions requiring external knowledge or value-based judgment remain harder. Taken together, these results suggest that expert alignment is difficult not only because of model limitations, but also because subjective evaluation is inherently heterogeneous, partly tacit, dimension-dependent, and temporally unstable.


[14] Misaligned by Reward: Socially Undesirable Preferences in LLMs cs.CL | cs.AI | cs.CYPDF

Gayane Ghazaryan, Esra Dönmez

TL;DR: 该论文通过将社会评估数据集转化为成对偏好数据,系统评估了现有奖励模型在偏见、安全性、道德和伦理推理四个社会关键领域的表现,发现这些模型普遍偏好社会不可取的选项,且其偏好会导致系统性偏差分布,揭示了标准奖励基准在社会对齐评估上的不足。

Details

Motivation: 现有奖励模型评估主要关注广泛的指令遵循基准,无法有效衡量模型是否捕捉了社会期望的偏好,导致社会对齐的重要缺陷可能被隐藏。

Result: 在五个公开奖励模型和两个作为奖励代理的指令调优模型上,模型在不同领域表现差异显著,无一模型整体最优;模型远未达到强社会智能水平,常偏好社会不可取选项,且偏好产生系统性偏差分布;更强的偏见避免能力会降低对上下文的敏感性,揭示了避免偏见结果与保持上下文忠实性之间的关键对齐权衡。

Insight: 创新点在于提出了一个将社会评估数据集转化为偏好数据以直接测量奖励模型编码的社会偏好的框架;客观分析认为,该研究揭示了奖励模型在社会对齐中的系统性缺陷及对齐目标间的内在权衡,强调了开发专门社会对齐评估基准的必要性。

Abstract: Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture socially desirable preferences. As a result, important failures in social alignment can remain hidden. We extend reward-model benchmarking to four socially consequential domains: bias, safety, morality, and ethical reasoning. We introduce a framework that converts social evaluation datasets into pairwise preference data, leveraging gold labels where available and directional bias indicators otherwise. This enables us to test whether reward models prefer socially undesirable responses, and whether their preferences produce systematically biased distributions over selected outputs. Across five publicly available reward models and two instruction-tuned models used as reward proxies, we find substantial variation across domains, with no single model performing best overall. The models fall well short of strong social intelligence: they often prefer socially undesirable options, and their preferences produce systematically biased distributions. Moreover, stronger bias avoidance can reduce sensitivity to context, revealing a key alignment trade-off between avoiding biased outcomes and preserving contextual faithfulness. These findings show that standard reward benchmarks are insufficient for assessing social alignment and highlight the need for evaluations that directly measure the social preferences encoded in reward models.


[15] Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models cs.CL | cs.AIPDF

Quintin Pope, Ajay Hayagreeve Balaji, Jacques Thibodeau, Xiaoli Fern

TL;DR: 本文提出了一种自动化的对比评估流程,用于审计大型语言模型干预措施的行为影响。该方法通过比较基础模型与干预模型在相同提示上下文下的自由形式多标记生成,生成人类可读、统计验证的自然语言假设来描述模型差异,并总结已验证假设中的模式。

Details

Motivation: 动机在于自动化地发现和验证语言模型干预措施(如推理蒸馏、知识编辑和遗忘)可能引发的预期及意外行为变化,为模型行为的事后审计提供统计基础和可解释的工具。

Result: 在合成设置中,通过注入已知行为变化,验证了流程能可靠地恢复这些变化;在三个真实干预案例(推理蒸馏、知识编辑和遗忘)中,方法能识别预期和意外行为偏移,区分显著与细微干预,并在无效应或提示库不匹配时不产生幻觉差异。

Insight: 创新点在于开发了一种自动化、对比性的评估管道,结合统计验证和自然语言假设生成,为模型干预行为审计提供了可扩展且可解释的解决方案,有助于揭示干预的潜在副作用。

Abstract: We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models. Given a base model $M_1$ and an intervention model $M_2$, our method compares their free-form, multi-token generations across aligned prompt contexts and produces human-readable, statistically validated natural-language hypotheses describing how the models differ, along with recurring themes that summarize patterns across validated hypotheses. We evaluate the approach in synthetic setting by injecting known behavioral changes and showing that the pipeline reliably recovers them. We then apply it to three real-world interventions, reasoning distillation, knowledge editing and unlearning, demonstrating that the method surfaces both intended and unexpected behavioral shifts, distinguishes large from subtle interventions, and does not hallucinate differences when effects are absent or misaligned with the prompt bank. Overall, the pipeline provides a statistically grounded and interpretable tool for post-hoc auditing of intervention-induced changes in model behavior.


[16] Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction cs.CLPDF

Yucheng Ruan, Ling Huang, Qika Lin, Kai He, Mengling Feng

TL;DR: 本文提出了一种基于证据推理的多视图学习框架,用于可信赖的心理健康预测。该框架整合了仅编码器模型的语义信息和仅解码器模型的高层推理信息,通过主观逻辑进行不确定性建模,并采用证据融合策略平衡互补视图。在三个真实数据集上的实验表明,该方法不仅提升了预测性能,还提供了可靠的不确定性估计和可解释性。

Details

Motivation: 现有基于文本的自动化心理健康预测方法主要依赖语义表示,在模糊、噪声或数据分布偏移的情况下容易产生过度自信的预测,且缺乏可靠的不确定性估计,难以应用于高风险的实际场景。

Result: 在Dreaddit、SDCNL和DepSeverity三个真实数据集上分别取得了0.835、0.731和0.751的准确率,证明了其可靠的预测潜力。额外的噪声鲁棒性实验和可解释性案例研究进一步验证了框架的有效性。

Insight: 创新点在于将心理健康预测构建为多视图学习问题,融合语义与推理信息,并引入基于主观逻辑的证据学习框架进行显式不确定性建模和证据融合,从而提升预测的可靠性和可解释性,适用于风险敏感的应用场景。

Abstract: Automated mental health prediction using textual data has shown promising results with deep learning and large language models. However, deploying these models in high-stakes real-world settings remains challenging, as existing approaches largely rely on semantic representations and often produce overconfident predictions under ambiguous, noisy, or shifted data. Moreover, most methods lack reliable uncertainty estimation, undermining trust in risk-sensitive mental health applications. To address these limitations, we formulate the task as a multi-view learning problem that integrates semantic information from encoder-only models with higher-level reasoning information from decoder-only models, where reasoning-aware representations and uncertainty modeling are obtained in a trustworthy manner. To ensure reliable fusion, we adopt an evidential learning framework based on Subjective Logic to explicitly model uncertainty and introduce an evidential fusion strategy that balances complementary views while discounting unreliable evidence. Benchmarking on three real-world datasets, Dreaddit, SDCNL, and DepSeverity, reports accuracies of 0.835, 0.731, and 0.751, respectively, demonstrating its potential for reliable mental health prediction. Additional experiments on robustness to noise and case studies for interpretability confirm that our proposed framework not only improves predictive performance but also provides trustworthy uncertainty estimates and human-understandable reasoning signals, making it suitable for risk-sensitive applications in mental health assessment.


cs.CV [Back]

[17] Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology cs.CV | cs.AI | cs.CYPDF

Roy Jiang, Hyunjae Kim, Zhenyue Qin, Morten Lee, Margaret MacGibeny

TL;DR: 本研究评估了多模态大语言模型(MLLMs)在临床皮肤病学中的实际应用能力。通过对比四个开源模型和一个商业模型在公开数据集与真实世界医院会诊病例上的表现,发现模型在公开基准上的诊断性能尚可,但在真实场景中显著下降,揭示了基准测试与临床实际应用之间存在巨大差距。

Details

Motivation: 尽管MLLMs在公开皮肤病学基准上表现出潜力,但其性能是否能推广到真实世界的皮肤病临床决策中尚不明确。本研究旨在量化这一“从基准到临床”的差距。

Result: 在公开基准上,最佳开源模型的Top-3诊断准确率为26.55%,GPT-4.1为42.25%。在仅使用图像的真实世界会诊病例中,开源模型的准确率降至1.50%-13.35%,GPT-4.1降至24.65%。结合临床上下文后性能有所提升,但模型输出对不完整或错误的上下文高度敏感。在基于严重程度的分诊任务中,模型达到了中等敏感度(>60%)。

Insight: 研究揭示了当前皮肤病学MLLMs存在显著的“基准-临床”性能鸿沟,强调了在真实、复杂临床环境中评估AI模型的重要性。同时,临床上下文的整合能提升模型表现,但其可靠性受限于上下文质量,表明模型目前更适合作为筛查辅助工具而非直接临床部署。

Abstract: Multimodal large language models (MLLMs) have demonstrated promise on publicly available dermatology benchmarks. However, benchmark performance may not generalize to real-world dermatologic decision-making. To quantify this benchmark-to-bedside gap, we evaluated four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4 and MedGemma-4B-Instruct) and one commercial MLLM (GPT-4.1) across three publicly available dermatology datasets and a retrospective multi-site hospital-based dermatology consultation cohort comprising 5,811 cases and 46,405 clinical images. Models were evaluated on two clinically relevant tasks: differential diagnosis generation and severity-based triage. Diagnostic performance was modest on public datasets and declined substantially in the real-world cohort. On public benchmarks, top-3 diagnostic accuracy reached 26.55% for the best open-weight model and 42.25% for GPT-4.1. On real-world consultation cases using images alone, top-3 diagnostic accuracy fell to 1.50%-13.35% among open-weight models and 24.65% for GPT-4.1. Incorporating clinical context improved performance across all models, increasing top-3 diagnostic accuracy up to 28.75% among open-weight models and 38.93% for GPT-4.1. However, model outputs were highly sensitive to incomplete or erroneous consultation context. For severity-based triage, models achieved moderate sensitivity (above 60%), suggesting potential utility for screening but insufficient reliability for clinical deployment. These findings demonstrate that benchmark performance substantially overestimates the real-world clinical capability of current dermatology MLLMs.


[18] Anatomy of a failure: When, how, and why deep vision fails in scientific domains cs.CVPDF

Ji-Hun Oh, Dou Hoon Kwark, Kianoush Falahkheirkhah, Kevin Yeh, John Cheville

TL;DR: 本文探讨了深度学习在科学成像领域中的失败案例,通过比较病理学中RGB染色组织图像与红外成像的生化特征数据,发现深度学习模型在信息更丰富的红外数据上表现反而更差,揭示了深度学习框架在科学图像处理中的局限性。

Details

Motivation: 研究动机在于评估深度学习在科学成像中的有效性,特别是针对非RGB的科学图像(如红外成像),这些图像包含更丰富的物理化学信息,但深度学习模型的应用却可能导致灾难性失败。

Result: 实验结果表明,在红外成像数据上训练的深度学习模型表现不佳,模型倾向于崩溃为一维预测,即使使用最先进的鲁棒性策略也无法解决这一问题,这突显了深度学习在科学领域中的通用性局限。

Insight: 论文的创新点在于揭示了深度学习在科学成像中的先验偏差与数据特性不匹配问题,并呼吁开发针对特定模态的专用AI算法,以避免类似失败,这对AI安全性和科学应用具有重要启示。

Abstract: Mirroring its ubiquity in popular media and all human activities, the use of deep learning (DL) is rapidly growing in scientific imaging modalities. However, unlike everyday RGB pictures, pixels encode precise physicochemical properties in scientific imaging across potentially thousands of channels. While DL is well validated on human-centric RGB perceptual tasks, its effectiveness for scientific imaging remains uncertain. Here, we show that the naive application of DL frameworks to scientific images can lead to critical failures. We evaluate the use of DL for pathology, comparing RGB images of stained tissue with the quantitative and information-rich biochemical signatures of infrared (IR) imaging. Despite this informational advantage, DL models trained on IR data paradoxically underperform. We investigate this discrepancy to find that IR data priors interact poorly with the simplicity bias of DL, causing models to collapse to one-dimensional predictions. This constitutes a catastrophic DL failure because the model’s representational capacity remains largely unused, while furthermore raising AI safety concerns and undermining the advantages of such scientific modalities. Notably, this problem persists even with state-of-the-art DL robustification strategies, which are primarily designed and validated for RGB imagery and thus inherit the same prior-bias mismatch. This work establishes a framework for understanding the limitations of generic DL in science and advocates for the study of modality-specific failure modes to guide the development of specialized, safe AI algorithms.


[19] Densification and forecasting of Sentinel-2 time series from multimodal SAR and Optical satellite data using deep generative models cs.CVPDF

Véronique Defonte, Dawa Derksen, Alexandre Constantin, Bastien Nespoulous

TL;DR: 本文提出了一种基于概率深度学习框架的方法,用于Sentinel-2光学卫星时间序列的密集化和预测,通过联合利用Sentinel-2光学和Sentinel-1 SAR多模态卫星数据,在任意过去或未来日期生成光学图像,并重点关注生成图像的不确定性。

Details

Motivation: 光学卫星时间序列在农业、气候监测等领域应用广泛,但云层和扫描边缘导致时间维度上的不规则采样,限制了连续监测;现有方法主要关注在观测时间范围内填补缺失数据,无法预测未来观测,因此需要一种能同时进行时间序列密集化和预测的解决方案。

Result: 实验结果表明,该方法在稀疏和时间错位的时间序列上实现了有效的密集化和预测,但摘要未提及具体基准测试或定量结果(如与SOTA的比较)。

Insight: 创新点在于结合多模态SAR和光学数据,采用概率深度学习框架生成任意日期图像,并强调不确定性建模,这为处理卫星数据中的缺失值和未来预测提供了可借鉴的生成式方法。

Abstract: Optical satellite image time series are extensively used in many Earth observation applications, including agriculture, climate monitoring, and land surface analysis. However, clouds and swath edges result in irregular sampling along the temporal dimension, limiting continuous monitoring. To address this issue, a growing body of work has focused on temporal densification and reconstruction of satellite image time series, with the objective of filling missing or cloud-contaminated observations within the temporal extent of the available data. While these approaches improve temporal continuity, they are inherently restricted to the reconstruction of the gaps within the observed time periods, and do not address the prediction of future observations. This work proposes a probabilistic deep learning framework for the densification and forecasting of Sentinel-2 time series by generating optical images at arbitrary past or future dates. The approach leverages multimodal satellite data by jointly exploiting Sentinel-2 optical and Sentinel-1 SAR observations. Unlike most existing works, we propose to focus on the uncertainty of the generated images. Experimental results demonstrate effective densification and forecasting, on sparse and temporally misaligned time series.


[20] Imagery Dataset for Remaining Useful Life Estimation of Synthetic Fibre Ropes cs.CV | cs.LGPDF

Anju Rani, Daniel Ortiz-Arroyo, Petar Durdevic

TL;DR: 本文提出了一个用于合成纤维绳剩余使用寿命估计的新型图像数据集,包含约34,700张高分辨率图像,记录了11根Dyneema SK75/78高模量聚乙烯绳在60 kN至280 kN七个不同轴向载荷水平下进行循环疲劳测试直至机械失效的完整退化生命周期。

Details

Motivation: 合成纤维绳的剩余使用寿命估计对于海上起重机、风力涡轮机安装和重载处理等应用的安全运行至关重要,但目前缺乏公开的、在受控循环疲劳载荷下捕获其完整退化生命周期的图像数据集。

Result: 该数据集记录了绳缆从695到8,340次循环的疲劳寿命,每经过固定次数的滑轮循环(一次检查爆发)后,沿绳缆不同横截面位置捕获十张图像,为基于视觉的状态监测和预测算法提供了基准资源。

Insight: 创新点在于创建了首个公开的、覆盖合成纤维绳完整疲劳退化过程的时空图像数据集,并标注了对应的已循环次数,可直接用于RUL回归、损伤进展建模、异常检测和载荷条件预测等多种机器学习任务,填补了该领域的数据空白。

Abstract: Remaining useful life (RUL) estimation of synthetic fibre ropes (SFRs) is critical for safe operation in offshore-crane, wind turbine installation, and heavy-load handling applications, where rope failure can result in catastrophic safety incidents and costly downtime. Despite growing research interest in data-driven condition monitoring, there is no publicly available image dataset that captures the complete degradation lifecycle of SFRs under controlled cyclic fatigue loading. To address this gap, we present a novel image dataset comprising approximately 34,700 high-resolution images of eleven Dyneema SK75/78 high-modulus polyethylene (HMPE) rope samples subjected to cyclic fatigue on a sheave-bend test stand at seven distinct axial load levels ranging from 60 kN to 280 kN. Ropes were loaded until mechanical failure, with fatigue lifetimes ranging from 695 cycles to 8,340 cycles. After every fixed number of sheave cycles (an inspection burst), ten images were captured at different cross-sectional positions along the rope, providing spatially representative sampling of surface degradation throughout the rope’s entire service life. The images obtained from each load are annotated with the corresponding elapsed cycle count, enabling a direct computation of RUL for any rope in the sequence. This dataset aims to support a broad range of machine learning (ML) tasks including RUL regression, damage progression modelling, anomaly detection, and load-conditioned prognostics. The dataset is intended to serve as a benchmark resource for the development and comparison of vision-based condition monitoring (CM) and prognostics algorithms for SFRs.


[21] Beyond Fixed Thresholds and Domain-Specific Benchmarks for Explainable Multi-Task Classification in Autonomous Vehicles cs.CV | cs.ROPDF

Maryam Sadat Hosseini Azad, Shahriar Baradaran Shokouhi

TL;DR: 本文针对自动驾驶场景理解中深度学习模型缺乏透明度的问题,提出了一种自适应阈值选择方法以优化多任务分类的决策边界,并引入了一个新的跨文化可解释自动驾驶数据集IUST-XAI-AD。

Details

Motivation: 解决自动驾驶系统中深度学习模型的黑盒性问题,通过多任务视觉理解提升系统的可解释性和安全性,并克服传统固定阈值方法在多任务场景中的不足。

Result: 自适应阈值选择方法在不同任务上提高了F1分数;IUST-XAI-AD数据集为特定驾驶场景提供了更具挑战性的评估基准,并揭示了跨文化驾驶行为模式。

Insight: 创新点包括通过置信度阈值敏感性分析实现自适应决策边界优化,以及构建一个包含人类标注决策与推理的跨文化数据集,为可解释且文化自适应的自动驾驶系统开发提供了方法论和评估工具。

Abstract: Scene understanding is a vital part of autonomous driving systems, which requires the use of deep learning models. Deep learning methods are intrinsically black box models, which lack transparency and safety in autonomous driving. To make these systems transparent, multi-task visual understanding has become crucial for explainable autonomous driving perception systems, where simultaneous prediction of multiple driving behaviors and their underlying explanations is essential for safe navigation and human trust in autonomous vehicles. In order to design an accurate and cross-cultural explainable autonomous driving system, we introduce a comprehensive confidence threshold sensitivity analysis that evaluates various threshold values to identify optimal decision boundaries for different tasks. Our analysis demonstrates that traditional fixed threshold approaches are suboptimal for multi-task scenarios. Through extensive evaluation, we demonstrate that our adaptive threshold selection methodology improves F1-scores across different tasks. In addition, we introduce IUST-XAI-AD, a novel dataset consisting of 958 images with human annotations for driving decisions and corresponding reasoning. This dataset addresses the critical gap in domain-specific evaluation benchmarks for distinct driving contexts and provides a more challenging test environment compared to existing datasets. Experimental results demonstrate that confidence threshold sensitivity analysis can significantly improve model performance, while the introduction of the IUST-XAI-AD dataset reveals important insights about cross-cultural driving behavior patterns. The combined contributions of this work provide both methodological advances and practical evaluation tools that can accelerate the development of more reliable, explainable, and culturally-adaptive autonomous driving systems for global deployment.


[22] Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning cs.CV | cs.CLPDF

Qihua Dong, Ruozhen He, Junwen Chen, Yizhou Wang, Xu Ma

TL;DR: 本文提出了HierVA,一种用于图表推理的分层视觉智能体框架,通过在图像-文本联合空间中迭代构建和更新工作上下文来解决高级图表问答任务。该框架包含一个高层管理器来生成计划并维护仅包含关键信息的紧凑上下文,以及专门的工作器来执行推理、收集证据并返回结果。

Details

Motivation: 现有MLLMs擅长理解单一图表,但在跨多个子图的多步推理方面存在困难,而高级图表问答需要精确感知小视觉元素并进行跨子图的多步推理。

Result: 在CharXiv推理子集上的实验表明,该方法相对于强大的多模态基线模型取得了持续改进,消融研究验证了分层架构、限定视觉上下文和精炼上下文能带来互补的增益。

Insight: 创新点在于提出了一个在联合图像-文本空间中管理上下文的分层智能体框架,通过高层管理器和工作器的分工协作,以及使用缩放工具限定视觉上下文,有效提升了跨子图多步推理的能力。

Abstract: Advanced chart question answering requires both precise perception of small visual elements and multi-step reasoning across several subplots. While existing MLLMs are strong at understanding single plots, they often struggle with multi-step reasoning across multiple subplots. We propose HierVA, a hierarchical visual agent framework for chart reasoning that iteratively constructs and updates a working context in a joint image–text space. A high-level manager generates plans and maintains a compact context containing only key information, while specialized workers perform reasoning, gather evidence, and return results. In particular, the agent maintains separate visual and textual contexts, using a zoom-in tool to restrict the visual context. Experiments on the CharXiv reasoning subset demonstrate consistent improvements over strong multimodal baselines, and ablation studies verify that hierarchical architecture, scoped visual context, and distilled context contribute complementary gains.


[23] InterFuserDVS: Event-Enhanced Sensor Fusion for Safe RL-Based Decision Making cs.CVPDF

Mustafa Sakhaia, Kaung Sithua, Min Khant Soe Okea, Maciej Wielgosza

TL;DR: 本文提出InterFuserDVS,一种扩展自SOTA模型InterFuser的架构,通过集成动态视觉传感器(DVS/事件相机)作为额外模态来增强自动驾驶系统的感知可靠性。该方法采用一种新颖的基于令牌的融合策略,将累积的事件帧整合到基于Transformer的骨干网络中,利用RGB、LiDAR和DVS数据的互补性。在CARLA基准测试中,该方法展现了改进的鲁棒性。

Details

Motivation: 解决传统自动驾驶传感器(如RGB相机和LiDAR)在高动态范围场景或高速场景中,因运动模糊和延迟而感知受限的问题。事件相机提供了微秒级时间分辨率和高动态范围,能捕捉异步亮度变化,为感知提供了新范式。

Result: 在CARLA排行榜基准上评估,该方法实现了77.2的竞争性驾驶分数和100%的优异路线完成率,表明集成DVS提高了驾驶智能体的鲁棒性。

Insight: 宣称的创新点包括将DVS作为额外模态集成到SOTA的InterFuser架构中,并提出了新颖的基于令牌的融合策略。从客观角度看,其创新之处在于有效融合了异步事件数据与同步的RGB/LiDAR数据,利用了它们在时间分辨率和动态范围上的互补性,以提升在恶劣光照和动态条件下的安全性与性能。

Abstract: Autonomous driving systems rely heavily on robust sensor fusion to perceive complex envi- ronments. Traditional setups using RGB cameras and LiDAR often struggle in high-dynamic- range scenes or high-speed scenarios due to motion blur and latency. Dynamic Vision Sensors (DVS), or event cameras, offer a paradigm shift by capturing asynchronous brightness changes with microsecond temporal resolution and high dynamic range. In this paper, we propose an extended architecture of the state-of-the-art InterFuser model, integrating DVS as an additional modality to enhance perception reliability. We introduce a novel token-based fusion strategy that incorporates accumulated event frames into the transformer-based backbone of InterFuser. Our method leverages the complementary nature of RGB, LiDAR, and DVS data. We evaluate our approach on the Car Learning to Act (CARLA) Leaderboard benchmarks, demonstrating that the inclusion of DVS improves the robustness of the driving agent, achieving a competitive Driving Score of 77.2 and a superior Route Completion of 100%. The results indicate that event-based vision is a promising direction for improving safety and performance in adverse lighting and dynamic conditions.


[24] Optimize-at-Capture: Highly-adaptive Exposure Controlling for In-Vehicle Non-contact Heart-rate Monitoring cs.CV | eess.SYPDF

Jieying Wang, Xinqi Cai, Caifeng Shan, Wenjin Wang

TL;DR: 本文提出了一种用于车内非接触式心率监测的高度自适应曝光控制框架,旨在解决远程光电容积描记术(rPPG)在动态光照变化下性能下降的问题。该方法通过基于历史皮肤反射的预测模型主动调整曝光参数,确保皮肤感兴趣区域保持在rPPG信号提取的最佳动态范围内。研究还贡献了ExpDrive数据集,包含真实驾驶条件下48名受试者的同步面部视频和参考心电图。实验表明,该方法在挑战性驾驶场景中显著优于固定曝光和标准自动曝光策略。

Details

Motivation: 解决rPPG在车内动态光照变化下性能严重下降的问题,关键因素是现有系统在视频采集时缺乏针对rPPG优化的曝光控制,导致面部亮度不稳定。

Result: 在真实驾驶场景的实验中,该方法将平均绝对误差降低了6.31 bpm(从14.1降至7.79 bpm),并将成功率显著提高了32.3个百分点(从24.9%提升至57.2%),在低光照(雨天)和高眩光(晴天)条件下均有效提升了非接触心率监测性能。

Insight: 创新点在于提出了专门为rPPG测量优化的主动曝光控制框架,通过预测性建模调整曝光,确保皮肤区域处于最佳动态范围;同时贡献了首个公开的真实驾驶生理监测数据集ExpDrive,推动了该领域的研究。从客观角度看,将曝光控制与rPPG任务深度结合,从采集源头优化信号质量,是一种有效的系统级设计思路。

Abstract: Remote photoplethysmography (rPPG) holds great promise for continuous heart-rate monitoring of drivers in intelligent vehicles. However, its performance is severely degraded by the highly dynamic illumination changes. A critical yet overlooked factor is the lack of exposure controlling during video acquisition – most existing systems rely on either fixed exposure settings or camera build-in auto-exposure, both of which fail to maintain stable facial brightness under rapidly changing lighting conditions during driving. To address this gap, we propose a highly-adaptive exposure controlling framework that proactively adjusts exposure parameters based on predictive modeling of historical skin reflections. Unlike standard auto-exposure, our method is specifically optimized for rPPG measurement, ensuring the skin region of interest (ROI) remains within the optimal dynamic range for rPPG signal extraction. As an important contribution of this study, we introduce ExpDrive, a public in-vehicle physiological monitoring dataset comprising synchronized facial video and reference ECG from 48 subjects captured under real driving conditions. Extensive experiments demonstrate that our method consistently outperforms fixed exposure and standard auto-exposure strategies. Specifically, it reduces the Mean Absolute Error (MAE) by 6.31 bpm (from 14.1 to 7.79 bpm) and significantly increases the success rate by 32.3 percentage points (p < 0.001) (from 24.9% to 57.2%) across challenging driving scenarios. Notably, it clearly improved the performance of non-contact heart-rate monitoring in both low-light (rainy) and high-glare (sunny) conditions, validating the efficacy of exposure-aware acquisition design.


[25] Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning cs.CVPDF

Yating Wang, Yaqi Zhao, Yongshun Gong, Yilong Yin, Haoliang Sun

TL;DR: 本文提出可解释提示学习(IPL)框架,通过交替进行离散语义标记选择和连续提示优化,以提升视觉语言模型在提示学习中的可解释性和性能。该方法将语义标记选择建模为近似子模优化问题,确保标记既易于理解又语义多样,并采用交替优化策略整合离散与连续优化过程。

Details

Motivation: 解决CLIP等视觉语言模型在连续提示学习中存在的过拟合和可解释性差的问题,同时避免离散提示优化依赖大型外部模型导致的高计算成本和有限扩展性。

Result: 在多个基准测试中,IPL与五种代表性提示学习方法结合,均能一致提升可解释性和准确性,证明了其有效性和可扩展性。

Insight: 创新点在于将离散语义标记选择与连续提示优化交替进行,通过子模优化确保标记的语义多样性和可理解性,实现了可解释性与任务适应性的平衡;客观分析认为这种混合优化策略为提示学习提供了可插拔的通用扩展方案。

Abstract: Vision-language models such as CLIP achieve strong visual-textual alignment, but often suffer from overfitting and limited interpretability when adapted through continuous prompt learning. While discrete prompt optimization improves interpretability, it usually depends on large external models, leading to high computational costs and limited scalability. In this paper, we propose Interpretable Prompt Learning (IPL), a hybrid framework that alternates between discrete semantic token selection and continuous prompt optimization. Specifically, IPL formulates semantic token selection as an approximate submodular optimization problem, encouraging tokens that are both human-understandable and semantically diverse. It further adopts an alternating optimization strategy to integrate discrete token selection with continuous prompt tuning, improving interpretability while preserving adaptability to downstream tasks. Our framework is plug-and-play, allowing seamless integration with existing prompt learning methods. Extensive experiments on multiple benchmarks show that IPL consistently improves both interpretability and accuracy across five representative prompt learning methods, providing an effective and scalable extension to existing frameworks.


[26] RemoteZero: Geospatial Reasoning with Zero Human Annotations cs.CVPDF

Liang Yao, Fan Liu, Shengxiang Xu, Chuanyi Zhang, Rui Min

TL;DR: RemoteZero是一个无需边界框标注的地理空间推理框架,通过利用多模态大语言模型(MLLM)更强的判别能力进行语义验证,替代几何监督,实现了在无标注遥感数据上的自进化训练。

Details

Motivation: 现有地理空间推理模型虽能自主生成推理链,但仍依赖人工标注的真实坐标作为监督,阻碍了模型在丰富无标签遥感数据上的真正自进化。

Result: 实验表明,RemoteZero在无需边界框标注的情况下,取得了与强监督方法相媲美的性能,验证了自验证训练在地理空间推理定位中的潜力。

Insight: 核心创新在于利用MLLM在验证查询满足性方面优于直接生成坐标的“不对称性”,以内在语义验证替代几何监督,从而实现了无框监督的GRPO训练和迭代自进化能力。

Abstract: Geospatial reasoning requires models to resolve complex spatial semantics and user intent into precise target locations for Earth observation. Recent progress has liberated the reasoning path from manual curation, allowing models to generate their own inference chains. Yet a final dependency remains: they are still supervised by human-annotated ground-truth coordinates. This leaves the reasoning process autonomous, but not its spatial endpoint, and prevents true self-evolution on abundant unlabeled remote sensing data. To break this bottleneck, we introduce RemoteZero, a box-supervision-free framework for geospatial reasoning. RemoteZero is motivated by a simple asymmetry: an MLLM is typically better at verifying whether a region satisfies a query than at directly generating precise coordinates. Leveraging this stronger discriminative ability, RemoteZero replaces geometric supervision with intrinsic semantic verification and enables GRPO training without box annotations. The resulting framework further supports iterative self-evolution, allowing the model to improve from unlabeled remote sensing imagery through its own verification signal. Experiments show that RemoteZero achieves competitive performance against strong supervised methods, demonstrating the potential of self-verifying training for geospatial reasoning localization.


[27] StableI2I: Spotting Unintended Changes in Image-to-Image Transition cs.CV | cs.AIPDF

Jiayang Li, Shuo Cao, Xiaohui Li, Zhizhen Zhang, Kaiwen Zhu

TL;DR: 本文提出了StableI2I,一个用于评估图像到图像(I2I)转换任务中内容保真度和前后一致性的统一动态框架,并构建了相应的基准测试StableI2I-Bench。该框架无需参考图像,即可在图像编辑和修复等任务中提供准确、细粒度且可解释的评估,并与人类主观判断高度相关。

Details

Motivation: 现有I2I评估主要关注指令遵循和生成图像的感知质量或美学,但未能充分评估输出图像是否保留了输入图像的语义对应和空间结构,本文旨在解决这一局限性。

Result: 大量实验结果表明,StableI2I在内容保真度和一致性评估上提供了准确、细粒度且可解释的结果,与人类主观判断有很强的相关性。

Insight: 创新点在于提出了一个无需参考图像即可评估I2I任务内容一致性的统一框架,并构建了系统性评估MLLMs在此类任务上准确性的基准测试,为诊断真实世界I2I系统的内容一致性和模型性能提供了实用工具。

Abstract: In most real-world image-to-image (I2I) scenarios, existing evaluations primarily focus on instruction following and the perceptual quality or aesthetics of the generated images. However, they largely fail to assess whether the output image preserves the semantic correspondence and spatial structure of the input image. To address this limitation, we propose StableI2I, a unified and dynamic evaluation framework that explicitly measures content fidelity and pre–post consistency across a wide range of I2I tasks without requiring reference images, including image editing and image restoration. In addition, we construct StableI2I-Bench, a benchmark designed to systematically evaluate the accuracy of MLLMs on such fidelity and consistency assessment tasks. Extensive experimental results demonstrate that StableI2I provides accurate, fine-grained, and interpretable evaluations of content fidelity and consistency, with strong correlations to human subjective judgments. Our framework serves as a practical and reliable evaluation tool for diagnosing content consistency and benchmarking model performance in real-world I2I systems.


[28] Stream-T1: Test-Time Scaling for Streaming Video Generation cs.CVPDF

Yijing Tu, Shaojin Wu, Mengqi Huang, Wenchuan Wang, Yuxin Wang

TL;DR: 本文提出了Stream-T1,一个专为流式视频生成设计的测试时缩放(TTS)框架,通过流式缩放噪声传播、奖励剪枝和记忆下沉三个单元,在降低计算成本的同时显著提升了生成视频的时间一致性和视觉质量。

Details

Motivation: 现有基于扩散模型的测试时视频生成方法存在候选探索成本过高和缺乏时间引导的问题,本文旨在通过转向流式视频生成来解决这些结构性瓶颈。

Result: 在5秒和30秒的综合视频基准测试中,Stream-T1表现出显著优势,大幅提升了时间一致性、运动平滑度和帧级视觉质量。

Insight: 创新点在于将TTS与流式视频生成的块级合成、较少去噪步骤的特性相结合,并设计了利用历史噪声先验、结合短期与长期评估的奖励机制,以及基于奖励反馈的动态KV-cache管理机制,实现了高效且高质量的长视频生成。

Abstract: While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structural bottlenecks, we propose shifting the focus to streaming video generation. We identify that its chunk-level synthesis and few denoising steps are intrinsically suited for TTS, significantly lowering computational overhead while enabling fine-grained temporal control. Driven by this insight, we introduced Stream-T1, a pioneering comprehensive TTS framework exclusively tailored for streaming video generation. Specifically, Stream-T1 is composed of three units: (1) Stream -Scaled Noise Propagation, which actively refines the initial latent noise of the generating chunk using historically proven, high-quality previous chunk noise, effectively establishes temporal dependency and utilizing the historical Gaussian prior to guide the current generation; (2) Stream -Scaled Reward Pruning, which comprehensively evaluates generated candidates to strike an optimal balance between local spatial aesthetics and global temporal coherence by integrating immediate short-term assessments with sliding-window-based long-term evaluations; (3) Stream-Scaled Memory Sinking, which dynamically routes the context evicted from KV-cache into distinct updating pathways guided by the reward feedback, ensuring that previously generated visual information effectively anchors and guides the subsequent video stream. Evaluated on both 5s and 30s comprehensive video benchmarks, Stream-T1 demonstrates profound superiority, significantly improving temporal consistency, motion smoothness, and frame-level visual quality.


[29] Information Coordination as a Bridge: A Neuro-Symbolic Architecture for Reliable Autonomous Driving Scene Understanding cs.CVPDF

Shuo Liu, Lei Shi, Haowen Liu, Jing Xu, Yufei Gao

TL;DR: 本文提出InfoCoordiBridge,一种以BEV为中心的神经符号架构,通过在感知和语言推理之间插入一个显式的协调桥,来解决自动驾驶场景理解中多传感器输出冗余、冲突导致幻觉和不安全结论的问题。该架构包含统一的多智能体感知层、信息协调与融合模块以及基于场景摘要的推理与验证模块。

Details

Motivation: 解决现有LLM驱动的自动驾驶系统将语言模型作为后处理器,直接对冗余或冲突的感知输出进行推理,从而放大幻觉实体和不安全结论的问题,以实现跨异构传感器语义一致且可验证的场景理解。

Result: 在nuScenes和Waymo数据集上的实验表明,ICA模块在保持有竞争力的3D检测精度的同时,显著提高了融合一致性,将冗余降低到1%以下,属性一致性达到约98%。在NuScenes-QA和Waymo-QA基准测试上,SSRE模块相比代表性的VLM和智能体基线,提高了事实依据性并减少了幻觉实体的提及。

Insight: 核心创新在于在感知和高级推理之间引入了一个显式的信息协调桥(ICA模块),将多源感知输出对齐融合为单一、无冲突的场景摘要(SceneSummary),从而在提示语言模型之前就过滤掉冗余和不一致的证据,这是一种有效的神经符号结合架构设计思路。

Abstract: Reliable autonomous driving requires scene understanding that is semantically consistent across heterogeneous sensors and verifiable at the reasoning stage. However, many recent LLM-driven driving systems attach the language model as a post-processor and force it to reason over redundant or conflicting perception outputs, which can amplify hallucinated entities and unsafe conclusions. This paper proposes InfoCoordiBridge, a BEV-centric neuro-symbolic architecture that inserts an explicit coordination bridge between perception and language reasoning. InfoCoordiBridge comprises (i) a unified multi-agent perception layer that outputs typed structured facts together with modality-focused synopses, (ii) an ICA module that aligns and fuses multi-source outputs into a single SceneSummary, and (iii) an SSRE module that performs SceneSummary-grounded reasoning with verification. Experiments on nuScenes and Waymo show that ICA preserves competitive 3D detection accuracy while substantially improving fusion consistency, reducing redundancy to below 1% and achieving about 98% attribute agreement. On NuScenes-QA and a template-aligned Waymo-QA benchmark, SSRE improves factual grounding and reduces hallucinated entity mentions compared with representative VLM and agentic baselines. Overall, by coordinating multi-sensor outputs into a single conflict-aware SceneSummary before prompting, InfoCoordiBridge prevents redundant and cross-modally inconsistent perception evidence from propagating into high-level reasoning.


[30] DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning cs.CV | cs.AIPDF

Yuancheng Wei, Haojie Zhang, Linli Yao, Lei Li, Jiali Chen

TL;DR: 本文提出了DiffCap-Bench,一个用于图像差异描述(IDC)的综合性、挑战性且鲁棒的基准测试。该基准涵盖了十个不同的差异类别,并引入了基于LLM作为评判者的评估协议,以克服现有基准多样性不足和传统评估指标(如BLEU)无法捕捉语义一致性的问题。通过对先进多模态大语言模型(MLLMs)的广泛评估,该研究揭示了专有模型与开源模型之间的显著性能差距,强调了推理能力的重要性,并指出了模型扩展的局限性。

Details

Motivation: 现有图像差异描述(IDC)基准缺乏多样性和组合复杂性,且标准词汇重叠评估指标(如BLEU、METEOR)无法有效评估语义一致性或惩罚幻觉,这阻碍了对多模态大语言模型(MLLMs)在IDC任务上进行全面、鲁棒的评估。

Result: 通过对最先进MLLMs的广泛评估,结果表明专有模型与开源模型之间存在显著性能差距,并揭示了模型推理能力的关键重要性以及模型扩展的明显局限性。该评估框架与人类专家判断高度一致,并与下游图像编辑数据构建质量强相关。

Insight: 论文的创新点在于构建了一个覆盖十个差异类别的综合性IDC基准(DiffCap-Bench),并提出了一个基于人类验证的差异列表、以LLM作为评判者的鲁棒评估协议。这为IDC任务提供了一个更全面、更语义化的评估标准,并能有效预测模型在下游任务(如图像编辑)中的实用性。

Abstract: Image Difference Captioning (IDC) generates natural language descriptions that precisely identify differences between two images, serving as a key benchmark for fine-grained change perception, cross-modal reasoning, and image editing data construction. However, existing benchmarks lack diversity and compositional complexity, and standard lexical-overlap metrics (e.g., BLEU, METEOR) fail to capture semantic consistency or penalize hallucinations, which together prevent a comprehensive and robust evaluation of multimodal large language models (MLLMs) on IDC. To address these gaps, we introduce DiffCap-Bench, a comprehensive IDC benchmark covering ten distinct difference categories to ensure diversity and compositional complexity. Furthermore, we propose an LLM-as-a-Judge evaluation protocol grounded in human-validated Difference Lists, enabling a robust assessment of models’ ability to both capture and describe visual changes. Through extensive evaluation of state-of-the-art MLLMs, we reveal significant performance gaps between proprietary and open-source models, highlight the critical importance of reasoning capability, and identify clear limitations in model scaling. Our framework also demonstrates strong alignment with human expert judgments and strong correlation with downstream image editing data construction quality. These findings establish DiffCap-Bench as both a reliable IDC evaluation framework and a practical predictor of downstream utility. The benchmark and code will be made publicly available to support further research.


[31] SpecPL: Disentangling Spectral Granularity for Prompt Learning cs.CV | cs.AI | cs.CL | cs.LGPDF

Jingtao Zhou, Xirui Kang, Feiyang Huang, Lai-Man Po

TL;DR: SpecPL是一种新颖的提示学习方法,它从光谱角度出发,通过解耦视觉信号中的语义低频成分和细粒度高频细节来优化视觉语言模型。该方法利用冻结的VAE和视觉语义库,并通过反事实粒度监督来增强模型的细粒度判别能力。作为一种即插即用的增强模块,它能有效提升现有文本导向提示学习方法(如CoOp和MaPLe)的性能。

Details

Motivation: 现有视觉语言模型的提示学习方法存在模态不对称性,主要优化文本标记,而视觉编码器保持冻结,作为整体特征提取器,忽略了对于细粒度判别至关重要的光谱粒度。这导致了稳定性与泛化性之间的权衡困境。

Result: 在11个基准测试上的实验表明,该方法取得了具有竞争力的最先进性能,达到了81.51%的调和平均准确率的新性能上限,验证了其有效性。

Insight: 核心创新点在于从光谱视角进行提示学习,通过反事实粒度监督显式地区分视觉粒度与语义不变性。其利用冻结组件(VAE、视觉语义库)来解耦和锚定不同频率信息,并作为通用增强模块提升现有基线的方法设计具有借鉴意义。

Abstract: Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for fine-grained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low-frequency bands and granular high-frequency details. A frozen Visual Semantic Bank anchors text representations to universal low-frequency invariants, mitigating overfitting. Crucially, fine-grained discrimination is driven by counterfactual granule training: by permuting high-frequency signals, we compel the model to explicitly distinguish visual granularity from semantic invariance. Uniquely, SpecPL serves as a universal plug-and-play booster, revitalizing text-oriented baselines like CoOp and MaPLe via visual-side guidance. Experiments on 11 benchmarks demonstrate competitive state-of-the-art performance, achieving a new performance ceiling of 81.51% harmonic-mean accuracy. These results validate that spectral disentanglement with counterfactual supervision effectively bridges the gap in the stability-generalization trade-off. Code is released at https://github.com/Mlrac1e/SpecPL-Prompt-Learning.


[32] Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting cs.CV | cs.AIPDF

Binh Long Nguyen, Kien Nguyen, Sridha Sridharan, Clinton Fookes, Peyman Moghadam

TL;DR: Ilov3Splat是一个基于3D高斯溅射(3D-GS)构建的、用于实例级开放词汇3D场景理解的新框架。它通过联合优化场景几何和语义表示,将高斯溅射点与视图一致的特征场增强,利用多分辨率哈希嵌入编码语言对齐的CLIP特征,并训练基于SAM掩码的对比实例特征场,实现了无需类别监督或手动标注、仅凭自然语言描述即可在3D场景中识别任意对象的能力。

Details

Motivation: 解决现有方法依赖2D渲染匹配或点级语义关联导致的跨视图不一致、缺乏连贯实例级推理以及下游3D任务精度受限的问题。

Result: 在标准基准测试中,Ilov3Splat在对象选择和实例分割任务上均优于先前的开放词汇3D-GS方法,达到了最先进的性能水平。

Insight: 创新点在于将视图一致的特征场与3D高斯溅射联合优化,利用多分辨率哈希嵌入高效编码CLIP特征以实现密集连贯的3D空间语言对齐,并通过基于SAM掩码的对比学习训练实例特征场以支持细粒度跨视图对象区分;客观来看,其将2D视觉语言模型特征有效融入3D表示并实现实例级开放词汇查询的方法具有借鉴意义。

Abstract: We introduce Ilov3Splat, a novel framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D-GS). Most prior work depends on 2D rendering-based matching or point-level semantic association, which undermines cross-view consistency, lacks coherent instance-level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view-consistent feature fields. Specifically, we leverage multi-resolution hash embedding to efficiently encode language-aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine-grained object distinction across views. At inference time, CLIP-encoded queries are matched against the learned features, followed by two-stage 3D clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects in 3D scenes based on natural language descriptions, without requiring category supervision or manual annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat outperforms prior open-vocabulary 3D-GS methods in both object selection and instance segmentation, offering a flexible and accurate solution for language-driven 3D scene understanding. Project page: https://csiro-robotics.github.io/Ilov3Splat.


[33] From Priors to Perception: Grounding Video-LLMs in Physical Reality cs.CVPDF

Zicheng Zhao, Chaofan Gan, Shijie Li, Weiyao Lin

TL;DR: 该论文针对视频大语言模型在细粒度物理推理中的系统性缺陷,提出统一归因理论,认为问题源于语义先验主导而非感知缺陷。通过构建基于物理定律的程序化对抗课程数据集PACC,并设计视觉锚定推理链VARC,迫使模型在逻辑判断前基于低级视觉事实进行推理。实验表明,仅通过LoRA微调即可显著提升SOTA模型的物理推理能力。

Details

Motivation: 解决视频大语言模型在反物理异常和反直觉场景中系统性推理失败的问题,现有方法泛化有限且混淆了生成伪影与真实物理谬误。

Result: 在SOTA模型上,通过PACC课程和VARC方法进行标准LoRA微调,无需侵入式架构修改,即可有效中和先验干扰,在物理推理能力上实现显著提升。

Insight: 创新点在于提出语义先验主导理论解释模型失败根源,并构建首个基于物理定律的高保真对抗视频数据集PACC以解耦视觉伪影与逻辑错误,同时设计VARC强制模型进行视觉锚定推理,为提升模型物理推理提供了数据与推理框架的双重解决方案。

Abstract: While Video Large Language Models (Video-LLMs) excel in general understanding, they exhibit systematic deficits in fine-grained physical reasoning. Existing interventions not only suffer from limited generalization but fundamentally conflate generative artifacts with genuine physical fallacies. Furthermore, we find that models fail systematically not only in anti-physics anomalies but also in counter-intuitive scenarios where visual facts contradict statistical expectations. Accordingly, we propose the Unified Attribution Theory: this dual failure stems not from perception deficiency, but from Semantic Prior Dominance – the reasoning mechanism is deeply hijacked by internal narrative scripts. To address this, we construct the Programmatic Adversarial Curriculum (PACC), the first high-fidelity adversarial video dataset synthesized based on physical laws, thoroughly decoupling visual artifacts from logical errors. Concurrently, we design the Visual-Anchored Reasoning Chain (VARC) to force models to explicitly ground their judgments in low-level visual facts prior to logical adjudication. Experiments demonstrate that without invasive architectural modifications, standard LoRA fine-tuning with the PACC curriculum effectively neutralizes prior interference in state-of-the-art (SOTA) models, yielding a substantial leap in physical reasoning capabilities.


[34] DALight-3D: A Lightweight 3D U-Net for Brain Tumor Segmentation from Multi-Modal MRI cs.CV | cs.LG | cs.NEPDF

Nand Kumar Mishra, Dhruv Mishra, Dr Manu Pratap Singh

TL;DR: 本文提出了一种轻量化的3D U-Net变体DALight-3D,用于多模态MRI脑肿瘤分割,通过结合深度可分离3D卷积、标识符条件归一化、跨切片注意力和自适应跳跃融合等技术,在保持分割精度的同时显著减少了模型参数和计算成本。

Details

Motivation: 解决多模态MRI脑肿瘤自动分割中,现有体积模型计算成本高的问题,旨在设计一个在精度和效率之间取得更好平衡的轻量化模型。

Result: 在Medical Segmentation Decathlon Task01 BrainTumour基准测试中,经过50轮训练,DALight-3D以2.22M参数达到0.727的平均Dice分数,优于Residual 3D U-Net的0.710 Dice和3.20M参数,并在与标准3D U-Net、Attention U-Net和V-Net的对比中表现出更优的精度-效率权衡。

Insight: 创新点包括将深度可分离卷积扩展到3D以降低参数量,引入标识符条件归一化增强模态特定特征,以及跨切片注意力和自适应跳跃融合模块来提升特征整合能力;从客观角度看,这些组件的组合有效实现了轻量化设计,且消融实验验证了各模块的必要性。

Abstract: Automatic brain tumor segmentation from multi-modal MRI remains challenging because volumetric models often incur substantial computational cost. This paper presents DALight-3D, a compact 3D U-Net variant that combines depthwise separable 3D convolutions, identifier-conditioned normalization, cross-slice attention, and adaptive skip fusion. The method is evaluated on the Medical Segmentation Decathlon Task01 BrainTumour benchmark under matched optimization settings against standard 3D U-Net, Attention U-Net, Residual 3D U-Net, and V-Net baselines. In the reported 50-epoch comparison, DALight-3D achieves a mean Dice of 0.727 with 2.22M parameters, compared with 0.710 Dice and 3.20M parameters for Residual 3D U-Net. Component-wise ablations show consistent performance degradation when SepConv, identifier-conditioned normalization, CSA, or SSFB is removed. These results indicate that DALight-3D offers a favorable accuracy-efficiency trade-off within the present benchmark setting.


[35] Velox: Learning Representations of 4D Geometry and Appearance cs.CVPDF

Anagh Malik, Dorian Chan, Xiaoming Zhao, David B. Lindell, Oncel Tuzel

TL;DR: Velox提出了一种学习4D物体潜在表示的框架,该框架能够从非结构化的动态点云输入中,通过编码器压缩生成动态形状token,并利用4D表面解码器和高斯解码器分别建模几何和外观,从而获得描述性强、压缩性好且易于构建的4D表示。

Details

Motivation: 解决从非结构化动态点云中学习既能准确捕捉物体几何与外观,又具有压缩性以提升下游任务效率的4D表示问题。

Result: 在视频到4D生成、3D跟踪和通过图像到4D生成的布料模拟三个下游任务上均表现出色,验证了其有效性。

Insight: 创新点在于使用双解码器(4D表面解码器和高斯解码器)分别监督几何和外观学习,从而从动态点云中学习到压缩且全面的4D表示;其方法对输入要求低,仅需动态点云,增强了实用性。

Abstract: We introduce a framework for learning latent representations of 4D objects which are descriptive, faithfully capturing object geometry and appearance; compressive, aiding in downstream efficiency; and accessible, requiring minimal input, i.e., an unstructured dynamic point cloud, to construct. Specifically, Velox trains an encoder to compress spatiotemporal color point clouds into a set of dynamic shape tokens. These tokens are supervised using two complementary decoders: a 4D surface decoder, which models the time-varying surface distribution capturing the geometry; and a Gaussian decoder, which maps the tokens to 3D Gaussians, helping learn appearance. To demonstrate the utility of our representation, we evaluate it across three downstream tasks – video-to-4D generation, 3D tracking, and cloth simulation via image-to-4D generation – and observe strong performances in all settings.


[36] Reward-Guided Semantic Evolution for Test-time Adaptive Object Detection cs.CVPDF

Lihua Zhou, Mao Ye, Xiatian Zhu, Nianxin Li, Changyi Ma

TL;DR: 本文提出了一种无需训练、基于奖励引导语义进化的测试时自适应目标检测框架RGSE,用于解决开放词汇目标检测中视觉语言模型在测试时分布偏移下的性能下降问题。该方法通过进化搜索策略直接优化文本嵌入,避免了昂贵的反向传播或外部记忆机制。

Details

Motivation: 开放词汇目标检测中,视觉语言模型(如Grounding DINO)在测试时分布偏移下性能下降,主要源于区域提议的文本嵌入与偏移视觉嵌入之间的语义错位。现有测试时自适应方法要么依赖昂贵的反向传播,要么通过外部记忆绕过语义错位,缺乏直接、高效且无需训练的对齐方式。

Result: RGSE在多个检测基准测试中实现了最先进的性能,且仅增加极小的计算开销。

Insight: 创新点在于将文本嵌入适应视为语义搜索过程,通过进化搜索(扰动候选、基于余弦相似度的奖励评估、奖励加权平均融合)直接优化文本嵌入,无需反向传播或外部记忆,实现了高效、训练免费的测试时语义对齐。

Abstract: Open-vocabulary object detection with vision-language models (VLMs) such as Grounding DINO suffers from performance degradation under test-time distribution shifts, primarily due to semantic misalignment between text embeddings and shifted visual embeddings of region proposals. While recent test-time adaptive object detection methods for VLM-based either rely on costly backpropagation or bypass semantic misalignment via external memory, none directly and efficiently align text and vision in a training-free manner. To address this, we propose Reward-Guided Semantic Evolution (RGSE), a training-free framework that directly refines the text embeddings at test time. Inspired by evolutionary search, RGSE treats text embedding adaptation as a semantic search process: it perturbs text embeddings as candidate variants, evaluates them via cosine similarity with current and historical high-confidence visual proposals as a reward signal, and fuses them into a refined embedding through reward-weighted averaging. Without any backpropagation, RGSE achieves state-of-the-art performance across multiple detection benchmarks while adding minimal computational overhead. Our code will be open source upon publication.


[37] InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery cs.CVPDF

Kaili Zheng, Kaiwen Wang, Xun Zhu, Chenyi Guo, Ji Wu

TL;DR: 本文提出InterMesh框架,通过显式整合人-物交互信息来改进多人人体网格恢复任务。该方法利用人-物交互检测器增强查询表示,并设计了轻量级模块集成到现有架构中,在多个数据集上显著提升了姿态和形状估计的准确性。

Details

Motivation: 现有基于DETR的端到端多人人体网格恢复方法仅通过自注意力隐式建模交互关系,缺乏对人类与环境、人与人之间交互的显式推理,导致在复杂交互场景中精度受限。

Result: 在3DPW、MuPoTS、CMU Panoptic、Hi4D和CHI3D数据集上的实验表明,该方法达到SOTA水平,其中在CMU Panoptic上MPJPE降低9.9%,在Hi4D上降低8.2%。

Insight: 创新点在于引入显式交互感知机制,通过人-物交互检测器提供结构化语义信息,并设计轻量级编码器和优化器模块,以较低计算开销显著提升复杂交互场景下的恢复精度。

Abstract: Humans constantly interact with their surroundings. Existing end-to-end multi-person human mesh recovery methods, typically based on the DETR framework, capture inter-human relationships through self-attention across all human queries. However, these approaches model interactions only implicitly and lack explicit reasoning about how humans interact with objects and with each other. In this paper, we propose InterMesh, a simple yet effective framework that explicitly incorporates human-environment interaction information into human mesh recovery pipeline. By leveraging a human-object interaction detector, InterMesh enriches query representations with structured interaction semantics, enabling more accurate pose and shape estimation. We design lightweight modules, Contextual Interaction Encoder and Interaction-Guided Refiner, to integrate these features into existing HMR architectures with minimal overhead. We validate our approach through extensive experiments on 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D datasets, demonstrating remarkable improvements over state-of-the-art methods. Notably, InterMesh reduces MPJPE by 9.9% on CMU Panoptic and 8.2% on Hi4D, highlighting its effectiveness in scenarios with complex human-object and inter-human interactions.


[38] Open-Source Image Editing Models Are Zero-Shot Vision Learners cs.CV | cs.CLPDF

Wei Liu, Jiaxin Lin, Rui Chen

TL;DR: 本文系统评估了三种开源图像编辑模型(Qwen-Image-Edit、FireRed-Image-Edit、LongCat-Image-Edit)在未经任何微调的情况下,在密集视觉预测任务(单目深度估计、表面法线估计、语义分割)上的零样本能力。结果表明,这些模型展现出显著的零样本视觉理解能力,部分任务性能甚至超越了经过微调或指令调优的模型。

Details

Motivation: 现有研究主要依赖闭源模型或需要任务特定的指令调优来证明生成模型具有零样本视觉能力,本文旨在探究公开可用的开源图像编辑模型是否也具备这种开箱即用的零样本视觉能力。

Result: 在NYUv2表面法线估计任务上,FireRed-Image-Edit的平均角度误差为17.69°,超越了微调的Marigold模型(20.86°),并与指令调优的Vision Banana模型(17.78°)相当。在NYUv2深度估计任务上,LongCat-Image-Edit在仿射对齐后获得δ1=0.822;在DIODE室内数据集上,Qwen-Image-Edit的δ1达到0.868。在Cityscapes语义分割任务上,Qwen-Image-Edit在19类上达到25.7 mIoU,在更粗粒度的7类上达到49.5 mIoU。

Insight: 论文的创新点在于首次系统验证了开源图像编辑模型具备零样本视觉理解能力,且这种能力可能是图像编辑预训练任务涌现出的通用属性,而非特定模型的偶然现象。这为利用开源生成模型作为零样本视觉基础模型提供了实证依据。

Abstract: Recent studies have shown that large generative models can solve vision tasks they were not explicitly trained for. However, existing evidence relies on closed-source models(Veo3, Nano Banana Pro) or requires task-specific instruction tuning, leaving open whether publicly available image-editing models possess zero-shot vision abilities out of the box. We conduct a systematic evaluation of three open-source image-editing models – Qwen-Image-Edit, FireRed-Image-Edit, and LongCat-Image-Edit – on dense visual prediction tasks \emph{without any fine-tuning}. We benchmark monocular depth estimation on NYUv2 and DIODE, surface normal estimation on NYUv2, and semantic segmentation on Cityscapes, covering both geometric and semantic scene understanding. Results show that open-source image-editing models exhibit non-trivial zero-shot visual understanding. On NYUv2 surface normals, FireRed-Image-Edit achieves a mean angular error of $17.69^\circ$, surpassing the fine-tuned Marigold ($20.86^\circ$) and matching the instruction-tuned Vision Banana ($17.78^\circ$) without any task-specific training. On NYUv2 depth estimation, LongCat-Image-Edit obtains $δ_1{=}0.822$ with affine alignment, and Qwen-Image-Edit leads on DIODE Indoor ($δ_1{=}0.868$). On Cityscapes semantic segmentation, Qwen-Image-Edit reaches 25.7 mIoU at the 19-class level and 49.5 mIoU at a coarser 7-category level. By comparing three independently trained editors, we test whether zero-shot vision ability is an emergent property of image-editing pretraining rather than a model-specific artifact. Code, evaluation scripts, and all results are publicly released to serve as a reproducible baseline for future work.


[39] Lightning Unified Video Editing via In-Context Sparse Attention cs.CVPDF

Shitong Shao, Zikai Zhou, Haopeng Li, Yingwei Song, Wenliang Zhong

TL;DR: 本文提出了In-context Sparse Attention (ISA),一种专为视频编辑中的上下文学习范式设计的、近乎无损的经验性稀疏注意力框架。它通过基于查询锐度的动态路由机制,显著降低了计算成本。基于ISA,作者构建了LIVEditor模型,并在大规模高质量数据集上训练,实现了高效的视频编辑。

Details

Motivation: 当前基于上下文学习的视频编辑方法存在二次方注意力计算成本高、形成关键计算瓶颈的问题,需要一种高效的解决方案来加速推理过程。

Result: 实验表明,LIVEditor在注意力模块的延迟上降低了约60%,同时在EditVerseBench、IVE-Bench和VIE-Bench等多个基准测试上超越了现有最先进方法,实现了近乎无损的加速且未损害视觉保真度。

Insight: 核心创新点在于ISA框架,其基于两个关键发现:上下文令牌的显著性远低于源令牌;查询锐度与近似误差相关。据此设计了高效的上下文预剪枝和基于查询锐度的动态分组路由机制(高误差查询使用全注意力,低误差查询使用高效零阶泰勒稀疏注意力)。此外,构建大规模高质量视频编辑数据集的数据流水线也是一个重要贡献。

Abstract: Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing. Our design is grounded in two key insights: first, context tokens exhibit significantly lower saliency than source tokens; second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we build \textbf{\texttt{LIVEditor}} , a novel lightning video editing model via ISA and a proposed video-editing data pipeline that curated a 1.7M high-quality dataset. Extensive experiments demonstrate that LIVEditor achieves a $\sim$60% reduction in attention-module latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering near-lossless acceleration without compromising visual fidelity.


[40] From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation cs.CV | cs.AIPDF

Zishen Qu, Xuesong Li, Haijian Gu, Hongwei Kang, Quan Meng

TL;DR: 本文提出RLFSeg框架,利用Rectified Flow在潜在空间中学习从图像到分割掩码的直接映射,避免了扩散模型的噪声去噪过程和时间步优化,显著提升了基于文本的图像分割性能,尤其在零样本场景下表现优异。

Details

Motivation: 解决基于扩散模型的特征提取方法在判别性分割任务中因生成特性带来的性能限制问题。

Result: 在零样本场景下性能大幅超越先前基于扩散模型的方法,通过标签细化和自适应一步采样策略,在单步推理中实现更高准确率。

Insight: 创新性地将Rectified Flow引入分割任务,实现生成模型向判别任务的无结构修改重定向,展示了应用潜力和研究价值。

Abstract: Text-based image segmentation aims to delineate object boundaries within an image from text prompts, offering higher flexibility and broader application scope compared to traditional fixed-category segmentation tasks. Recent studies have shown that diffusion models (e.g., Stable Diffusion) can provide rich multimodal semantic features, leading to studies of using diffusion models as feature extractors for segmentation tasks. Such methods, however, inherit the generative natures of diffusion models that are harmful to discriminative segmentation tasks. In response, we propose RLFSeg, a novel framework that leverages Rectified Flow to learn direct mapping from the image to the segmentation mask within the latent space. The model is thus freed from the noise-denoise process and the need to optimize the time step of diffusion models, resulting in substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios. By introducing label refinement and an Adaptive One-Step Sampling strategy, the model achieves higher accuracy even on a single inference step. The framework redirects a pretrained generative model to the discriminative segmentation task with zero modification to model structure, thus reveals promising application potential and significant research value.


[41] DiCLIP: Diffusion Model Enhances CLIP’s Dense Knowledge for Weakly Supervised Semantic Segmentation cs.CVPDF

Zhiwei Yang, Pengfei Song, Yucong Meng, Kexue Fu, Shuo Wang

TL;DR: DiCLIP是一个新颖的弱监督语义分割框架,它通过扩散模型来增强CLIP在视觉和文本模态上的密集知识,以改进基于图像级标签的类激活图生成。该框架包含视觉相关性增强和文本语义增强模块,在PASCAL VOC和MS COCO数据集上实现了最先进的性能,并显著降低了训练成本。

Details

Motivation: 现有基于CLIP的弱监督语义分割方法仅利用其视觉-语言配对特性进行密集定位,忽略了CLIP在视觉和文本模态上固有的密集知识有限性,导致生成的类激活图次优。

Result: DiCLIP在PASCAL VOC和MS COCO基准测试中超越了最先进的方法,并显著减少了训练成本。

Insight: 创新点在于利用扩散模型的空间一致性和生成能力来增强CLIP的密集知识:通过注意力聚类精炼模块提取多样化的相关性图作为偏差来优化视觉特征分布,并通过动态键值缓存模型将CAM生成从补丁-文本匹配转变为视觉知识检索范式。

Abstract: Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically leverages Class Activation Maps (CAMs) to achieve pixel-level predictions. Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced to generate CAMs in WSSS. However, previous WSSS methods solely adopt CLIP’s vision-language paired property for dense localization, neglecting its inherently limited dense knowledge across both visual and text modalities, which renders CAM generation suboptimal. In this work, we propose DiCLIP, a novel WSSS framework that leverages the generative diffusion model to enhance CLIP’s dense knowledge across two modalities. Specifically, Visual Correlation Enhancement (VCE) and Text Semantic Augmentation (TSA) modules are proposed for dense prediction enhancement. To improve the spatial awareness of visual features, our VCE module utilizes diffusion’s reliable spatial consistency to mitigate the over-smoothing issue in CLIP’s attention. It designs the Attention Clustering Refinement (ACR) module to reliably extract diverse correlation maps from the diffusion model. The correlation maps act as a diversity bias for CLIP’s self-attention, recursively pushing its visual features towards a more discriminative dense distribution. To augment the semantics of text embeddings, our TSA module argues that a single text modality is insufficient to encompass the variability of visual categories. Thus, we leverage diffusion’s generative power to maintain a dynamic key-value cache model, shifting CAM generation from a patch-text matching mechanism to a novel visual knowledge retrieval paradigm. With these enhancements, DiCLIP not only outperforms state-of-the-art methods on PASCAL VOC and MS COCO but also significantly reduces training costs. Code is publicly available at https://github.com/zwyang6/DiCLIP.


[42] Advancing Aesthetic Image Generation via Composition Transfer cs.CVPDF

Kai Zou, Zhiwei Zhao, Bin Liu, Nenghai Yu

TL;DR: 本文提出了Composer框架,旨在通过语义无关的方式建模图像构图,以提升生成图像的美学质量。该框架支持从参考图像中提取构图感知表示进行构图迁移,并利用大型视觉语言模型实现主题驱动的构图检索。此外,通过文本到构图的微调,实现了无参考模式下的构图规划。

Details

Motivation: 现有方法通常通过隐式学习或基于语义的布局控制来增强构图,而非显式建模构图本身,这限制了美学图像生成的灵活性和精确性。

Result: 实验结果表明,Composer在文本到图像任务中显著提升了美学质量,并支持个性化的构图控制和迁移,在创意过程中为用户提供了精确性和灵活性。

Insight: 创新点包括语义无关的构图建模、基于预训练扩散模型的定制条件引导模块,以及利用大型视觉语言模型进行上下文学习实现显式构图规划。此外,通过生成模型构建高质量数据集支持训练也是一大贡献。

Abstract: Composition is a cornerstone of visual aesthetics, influencing the appeal of an image. While its principles operate independently of specific content, in practice, composition is often coupled with semantics. As a result, existing methods often enhance composition either through implicit learning or by semantics-based layout control, rather than explicitly modeling composition itself. To address this gap, we introduce Composer, a framework rooted in aesthetic theory, designed to model composition in a semantic-agnostic manner. First, it supports composition transfer by extracting key composition-aware representations from a reference image and leveraging a tailored conditional guidance module to control composition based on pre-trained diffusion models. Second, when users specify only text themes without a composition reference, Composer supports theme-driven composition retrieval by leveraging the in-context learning capabilities of Large Vision-Language Models (LVLMs), achieving explicit composition planning. To enhance composition in a reference-free mode, we conduct text-to-composition fine-tuning on the trained control module to enable implicit composition planning. Furthermore, we curated a high-quality dataset comprising 2 million image-text pairs using state-of-the-art generative models to support model training. Experimental results demonstrate that Composer significantly enhances aesthetic quality in text-to-image tasks and facilitates personalized composition control and transfer, offering users precision and flexibility in the creative process.


[43] Temporal Structure Matters for Efficient Test-Time Adaptation in Wearable Human Activity Recognition cs.CV | cs.HC | cs.LGPDF

Zishu Zhou, Zaipeng Xie, Xuanyao Jie

TL;DR: 本文提出了一种名为SIGHT的轻量级、无需反向传播的测试时自适应框架,用于解决可穿戴人类活动识别模型在跨用户分布偏移下的性能下降问题。该框架通过利用WHAR数据流中固有的窗口间时间结构,将时间连续性作为特征条件推理信号,而非仅作为输出空间平滑先验,从而在边缘设备上实现实时部署。

Details

Motivation: 现有测试时自适应方法主要继承视觉任务的假设,未能充分利用可穿戴人类活动识别数据流中固有的时间结构,导致在真实世界跨用户分布偏移下模型性能下降。

Result: 在真实世界数据集上的评估表明,SIGHT在计算和内存成本降低的同时,性能优于现有的测试时自适应基线方法。

Insight: 创新点在于将时间结构重新审视为特征条件推理信号,利用时间连续性和观测诱导的特征偏差来指导预测细化路由,并提出了一种基于原型对齐和流级边际习惯跟踪的几何感知过渡路由机制,实现了轻量且无需反向传播的自适应。

Abstract: Wearable human activity recognition (WHAR) models often suffer from performance degradation under real-world cross-user distribution shifts. Test-time adaptation (TTA) mitigates this degradation by adapting models online using unlabeled test streams, yet existing methods largely inherit assumptions from vision tasks and underexploit the inherent inter-window temporal structure in WHAR streams. In this paper, we revisit such temporal structure as a feature-conditioned inference signal rather than merely an output-space smoothing prior. We derive the insight that temporal continuity and observation-induced feature deviations provide complementary cues for determining when to preserve or release temporal inertia and where to route prediction refinement during likely transitions. Building upon this insight, we propose SIGHT, a lightweight and backpropagation-free TTA framework for WHAR, enabling real-time edge deployment. SIGHT estimates predictive surprise by comparing the current feature with a prototype-based expected state, and then uses the resulting feature deviation to guide geometry-aware transition routing based on prototype alignment and stream-level marginal habit tracking. Evaluations on real-world datasets confirm that SIGHT outperforms existing TTA baselines while reducing computational and memory costs.


[44] UniPCB: A Generation-Assisted Detection Framework for PCB Defect Inspection cs.CVPDF

Huan Zhang, Lianghong Tan, Yichu Xu, Jiangzhong Cao, Huanqi Wu

TL;DR: 本文提出了一种名为UniPCB的生成辅助检测框架,用于解决PCB缺陷检测中数据稀缺、不平衡以及复杂背景下特征表示不足的挑战。该框架集成了可控缺陷生成和任务特定缺陷检测,通过多模态条件生成器合成缺陷样本以扩充训练集,并采用改进的检测架构提升性能。

Details

Motivation: 解决PCB缺陷检测中因缺陷样本稀缺、不平衡以及复杂电路背景导致的特征表示不足问题,现有方法未能同时解决数据瓶颈和模型架构的局限性。

Result: 在DsPCBSD+数据集上的实验表明,UniPCB在缺陷检测上达到mAP@0.5为98.0%和mAP@0.5:0.95为61.8%,超越所有对比方法;生成分支的FID为129.61,SSIM为0.619,优于现有条件生成方法。

Insight: 创新点包括:1)多模态条件生成器并行提取边缘、深度和文本条件,通过ScaleEncoder和Condition Modulation实现结构对齐的缺陷感知合成;2)检测端引入Inverted Residual Shift Attention结合自注意力和移位卷积以捕获全局上下文和局部纹理,以及Cross-level Complementary Fusion Block进行选择性跨层特征融合;3)生成与检测的协同优化,合成样本直接丰富训练集,形成性能提升的复合效应。

Abstract: Printed Circuit Board (PCB) defect inspection faces two compounding challenges: scarce and imbalanced defect samples that limit model training, and insufficient feature representation under complex circuit backgrounds. Existing generation methods rely on single-modality conditions with coarse structural control, while detection methods improve architectures without addressing the data bottleneck. To resolve both challenges jointly, we propose a generation-assisted PCB defect inspection framework that integrates controlled defect synthesis with task-specific defect detection. On the generation side, a Multi-modal Condition Generator extracts complementary edge, depth, and text conditions in parallel. A ScaleEncoder then embeds these conditions into the diffusion U-Net at four resolutions, and a Condition Modulation applies FiLM-style spatially-adaptive modulation at each scale, enabling structurally aligned and defect-aware sample synthesis. On the detection side, an Inverted Residual Shift Attention couples self-attention with shift-wise convolution to jointly capture global context and local texture, and a Cross-level Complementary Fusion Block generates pixel-level gates for selective cross-level feature fusion. The synthesized samples directly enrich the detection training set, so that improvements in generation compound with improvements in detection. Extensive experiments on DsPCBSD+ demonstrate that UniPCB achieves mAP@0.5 of 98.0% and mAP@0.5:0.95 of 61.8% on defect detection, surpassing all compared methods, while the generation branch attains an FID of 129.61 and SSIM of 0.619, outperforming existing conditional generation approaches.


[45] CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering cs.CVPDF

Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin

TL;DR: 本文提出了一种名为CAST的训练免费、即插即用方法,通过利用大视觉语言模型在回答图像描述查询时表现出的增强视觉注意力模式,来引导模型在通用问答中的视觉注意力,从而有效缓解模型的对象幻觉问题。该方法在多个基准测试和模型上平均减少了6.03%的对象幻觉,达到了最先进的性能,且推理成本极低。

Details

Motivation: 大视觉语言模型在生成内容时经常产生与视觉信息不符的对象幻觉。现有方法依赖昂贵的人工标注与训练,或增加推理时间的解码策略,本文旨在提出一种无需训练、低成本的缓解方案。

Result: 在五个广泛使用的大视觉语言模型和五个包含判别性与生成性任务的基准测试上,CAST方法平均减少了6.03%的对象幻觉,达到了最先进的性能,同时仅增加了极少的推理成本并保留了模型的其他基础能力。

Insight: 核心创新点在于发现并利用了模型在回答描述性查询时视觉注意力增强的现象,通过探测技术识别对描述查询敏感的关注头并估计优化的引导方向,从而以无训练的方式增强模型的细粒度视觉感知能力。这为缓解幻觉问题提供了一种高效、通用的注意力引导范式。

Abstract: Although Large Vision-Language Models (LVLMs) have demonstrated remarkable performance on downstream tasks, they frequently produce contents that deviate from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or decoding strategies which significantly increase inference time. In this work, we observe that LVLMs’ attention to visual information is significantly enhanced when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-guided Visual Attention Steering (CAST), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern corresponding to caption queries to enhance LVLMs’ visual perception capability. Specifically, we use probing techniques to identify attention heads that are highly sensitive to caption queries and estimate optimized steering directions for their outputs. This steering strengthens LVLM’s fine-grained visual perception capabilities, thereby effectively mitigating object hallucination. CAST reduced object hallucination by an average of 6.03% across five widely used LVLMs and five benchmarks including both discriminative and generative tasks, demonstrating state-of-the-art performance while adding little inference cost and preserving other foundational capabilities.


[46] Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern cs.CVPDF

Xiaopei Zhu, Guanning Zeng, Zhanhao Hu, Jun Zhu, Xiaolin Hu

TL;DR: 本文提出了一种针对可见光-热成像(RGB-T)物体检测器的物理对抗攻击方法,通过设计一种非重叠的RGB-T图案(NORP)的对抗服装来实现。该方法构建了3D RGB-T人体和服装模型以模拟全视角攻击,并提出了空间离散-连续优化(SDCO)算法来优化图案。实验表明,该方法在数字和物理世界中均能对多种融合架构的RGB-T检测器实现高攻击成功率,并提出了融合阶段集成方法以增强攻击在不同架构检测器间的可迁移性。

Details

Motivation: 可见光-热成像(RGB-T)检测器在自动驾驶等应用中至关重要,但其在物理世界中的安全性被严重忽视。本文旨在探索RGB-T检测器的物理对抗攻击漏洞。

Result: 在多种融合架构的RGB-T检测器上进行了系统评估,在数字和物理世界中均实现了高攻击成功率。

Insight: 核心创新点在于提出了非重叠RGB-T图案(NORP)的设计,避免了重叠图案(ORP)导致的光线衰减问题,并提出了空间离散-连续优化(SDCO)方法和融合阶段集成策略来有效生成和迁移对抗样本。

Abstract: Visible-thermal (RGB-T) object detection is a crucial technology for applications such as autonomous driving, where multimodal fusion enhances performance in challenging conditions like low light. However, the security of RGB-T detectors, particularly in the physical world, has been largely overlooked. This paper proposes a novel approach to RGB-T physical attacks using adversarial clothing with a non-overlapping RGB-T pattern (NORP). To simulate full-view (0$^{\circ}$–360$^{\circ}$) RGB-T attacks, we construct 3D RGB-T models for human and adversarial clothing. NORP is a new adversarial pattern design using distinct visible and thermal materials without overlap, avoiding the light reduction in overlapping RGB-T patterns (ORP). To optimize the NORP on adversarial clothing, we propose a spatial discrete-continuous optimization (SDCO) method. We systematically evaluated our method on RGB-T detectors with different fusion architectures, demonstrating high attack success rates both in the digital and physical worlds. Additionally, we introduce a fusion-stage ensemble method that enhances the transferability of adversarial attacks across unseen RGB-T detectors with different fusion architectures.


[47] FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation cs.CV | cs.AIPDF

Yuanzhi Wang, Xuhua Ren, Jiaxiang Cheng, Bing Ma, Kai Yu

TL;DR: FaithfulFaces是一个用于文本到视频生成(T2V)的姿势忠实面部身份保持学习框架,旨在解决在复杂动态场景中,特别是面部姿势变化大或存在遮挡时,现有方法身份失真的问题。其核心是通过一个姿势共享的身份对齐器,利用姿势共享字典和姿势变化-身份不变性约束,将单视图输入映射到具有显式欧拉角嵌入的全局面部姿势表示,从而提供姿势忠实的面部先验,引导生成模型实现鲁棒的身份保持生成。

Details

Motivation: 现有身份保持的文本到视频生成(IPT2V)方法在面部姿势变化大或存在遮挡时,经常遭受显著的身份失真,无法在复杂动态场景中保持身份一致性。

Result: 大量实验表明,FaithfulFaces在保持身份一致性和结构清晰度方面达到了最先进的性能,即使发生姿势变化和遮挡。

Insight: 创新点在于提出了一个姿势共享的身份对齐器,通过姿势共享字典和姿势变化-身份不变性约束来对齐不同视角的面部姿势,并利用显式欧拉角嵌入构建全局面部姿势表示作为生成先验。此外,论文还专门构建了一个包含丰富面部姿势多样性的高质量视频数据集以支持训练。

Abstract: Identity-preserving text-to-video generation (IPT2V) empowers users to produce diverse and imaginative videos with consistent human facial identity. Despite recent progress, existing methods often suffer from significant identity distortion under large facial pose variations or facial occlusions. In this paper, we propose \textit{FaithfulFaces}, a pose-faithful facial identity preservation learning framework to improve IPT2V in complex dynamic scenes. The key of FaithfulFaces is a pose-shared identity aligner that refines and aligns facial poses across distinct views via a pose-shared dictionary and a pose variation-identity invariance constraint. By mapping single-view inputs into a global facial pose representation with explicit Euler angle embeddings, FaithfulFaces provides a pose-faithful facial prior that guides generative foundations toward robust identity-preserving generation. In particular, we develop a specialized pipeline to curate a high-quality video dataset featuring substantial facial pose diversity. Extensive experiments demonstrate that FaithfulFaces achieves state-of-the-art performance, maintaining superior identity consistency and structural clarity even as pose changes and occlusions occur.


[48] Not Every Subject Should Stay: Machine Unlearning for Noisy Engagement Recognition cs.CVPDF

Alexander Vedernikov

TL;DR: 本文研究在带有噪声标注的参与度识别数据集中,如何通过主体级别的机器遗忘技术,在无需完全重新训练的情况下,移除整个问题主体的影响。作者在DAiSEE和EngageNet数据集上,以TCCT-Net为固定平台,通过模型依赖的代理对有害主体进行排序,应用轻量级近似遗忘更新,并与仅在保留主体上从头训练的Oracle模型进行比较。实验表明,在典型的遗忘集设置下,遗忘模型能以约四分之一的重训练成本,分别恢复Oracle模型在EngageNet和DAiSEE上92.5%和89.3%的性能增益。

Details

Motivation: 参与度识别数据集通常按主体索引,且常包含噪声、主观的监督标注,使得事后数据集修订成为一个实际问题。现有噪声标签和数据清洗方法主要在训练前或训练时在样本级别操作,但未直接解决一个不同的问题:一旦模型已经训练完成,能否在不完全重新训练的情况下,移除整个问题主体的影响?

Result: 在DAiSEE和EngageNet数据集上,使用TCCT-Net作为固定平台进行评估。在代表性的K=3遗忘集设置下,遗忘模型在EngageNet上恢复了Oracle增益的92.5%,在DAiSEE上恢复了89.3%,而成本仅为重新训练的四分之一左右。在测试的小规模审计机制中,有效性在中等遗忘集大小时最强。

Insight: 论文的创新点在于将主体级别的机器遗忘作为一种事后净化机制,用于处理参与度识别中的噪声标注问题。它提出了一种模型依赖的代理来排序有害主体,并应用轻量级近似遗忘更新,从而以低成本实现对问题主体影响的移除。从客观角度看,该方法为已训练模型的后期修正提供了一种高效的替代方案,其有效性取决于主体选择质量和移除机制。

Abstract: Engagement recognition datasets are typically subject-indexed and often contain noisy, subjective supervision, making post-hoc dataset revision a practical problem. Existing noisy-label and data-cleaning methods largely operate at the sample level before or during training, but do not directly address a different question: once a model has already been trained, can the influence of an entire problematic subject be removed without full retraining? We study this setting through subject-level machine unlearning as a post-hoc sanitization mechanism for engagement recognition. Starting from a baseline trained on all subjects, we rank candidate harmful subjects using a model-dependent proxy, apply a lightweight approximate unlearning update, and compare the result against an oracle model retrained from scratch on the retained subjects only. We instantiate this protocol on DAiSEE and EngageNet using Tensor-Convolution and Convolution-Transformer Network (TCCT-Net) as a fixed platform and evaluate three matched model states under the same removal scenario: baseline, unlearned, and oracle. In representative K=3 forget-set settings, the unlearned model recovers 89.3% and 92.5% of the oracle gain on EngageNet and DAiSEE, respectively, at roughly one quarter of retraining cost. Across the tested small-audit regimes, effectiveness is strongest at an intermediate forget-set size, indicating that approximate subject-level unlearning is a useful low-cost correction mechanism, but one whose benefit depends on subject selection quality and removal regime.


[49] Anny-Fit: All-Age Human Mesh Recovery cs.CVPDF

Laura Bravo-Sánchez, Matthieu Armando, Romain Brégier, Grégory Rogez, Serena Yeung-Levy

TL;DR: Anny-Fit是一个用于全年龄段多人3D人体网格恢复的相机空间优化框架。它通过联合优化相机坐标系中的所有个体,并利用来自现成网络的多种专家知识(如度量深度图、实例分割、2D关键点以及VLM衍生的年龄/性别语义属性)来共同引导优化,解决了现有方法在真实世界全年龄段场景中因独立优化和成人假设而导致的深度尺度模糊和身体比例问题。该方法不仅提升了2D重投影、相对深度排序、3D估计和形状估计的精度,还能通过生成的伪真值标注将语义知识蒸馏到HMR模型中,实现从仅成人模型到全年龄段模型的零样本适应。

Details

Motivation: 现有单图像3D人体姿态与形状恢复方法通常假设主体为成人并独立优化每个人,这无法处理真实世界全年龄段场景中身体比例差异和深度模糊的联合解析问题。

Result: 在多个数据集上,Anny-Fit显著提升了2D重投影精度(+13到16)、相对深度排序(+6到7),降低了3D估计误差(-9到-29),并大幅改善了形状估计(+25到+82),生成了更连贯的场景。

Insight: 核心创新在于提出一个在相机坐标系中联合优化多人的框架,并整合多种互补的现成专家知识信号(特别是VLM语义属性)来共同约束优化,从而解决全年龄段场景的深度尺度模糊问题。此外,该方法展示了如何利用框架生成的伪真值将语义知识蒸馏到HMR模型中,实现了从成人模型到全年龄段模型的零样本适应,弥合了仅成人建模与全年龄段建模之间的鸿沟。

Abstract: Recovering 3D human pose and shape from a single image remains a cornerstone of human-centric vision, yet most methods assume adult subjects and optimize each person independently. These assumptions fail in real-world, all-age scenes, where body proportions and depth must be resolved jointly. We introduce Anny-Fit, a multi-person, camera-space optimization framework for all-age 3D human mesh recovery (HMR). Unlike existing per-person fitting methods, Anny-Fit jointly optimizes all individuals directly in the camera coordinate system, enforcing global spatial consistency. At the core of our approach is the use of multiple forms of expert knowledge – including metric depth maps, instance segmentation, 2D keypoints, and, VLM-derived semantic attributes such as age and gender – each obtained from dedicated off-the-shelf networks. These complementary signals jointly guide the optimization, constraining the depth-scale ambiguity characteristic of all-age scenes. Across diverse datasets, Anny-Fit consistently improves 2D reprojection accuracy (+13 to 16), relative depth ordering (+6 to 7), 3D estimation error (-9 to -29) and shape estimation (+25 to +82), producing more coherent scenes. Finally, we show that VLM-based semantic knowledge can be distilled into an HMR model via the pseudo-ground-truth annotations produced by Anny-Fit on training data, enabling it to learn semantically meaningful shape parameters while improving HMR performance. Our approach bridges adult-only and all-age modeling by enabling zero-shot adaptation of adult-trained HMR pipelines to the full age spectrum without retraining. Code is publicly available at https://github.com/naver/anny-fit.


[50] VC-FeS: Viewpoint-Conditioned Feature Selection for Vehicle Re-identification in Thermal Vision cs.CV | eess.SYPDF

Yasod Ginige, Ransika Gunasekara, Darsha Hewavitharana, Manjula Ariyarathne, Peshala Jayasekara

TL;DR: 本文提出了一种名为VC-FeS的新方法,用于解决热成像中车辆再识别(ReID)的挑战。该方法通过构建视角条件化的特征向量,并在不同特征空间中进行特定区域的特征比较,有效利用了预训练的ViT特征提取器,并针对热成像领域缺乏颜色和纹理信息、视角变化大等问题进行了专门优化。

Details

Motivation: 解决在单通道热成像中,由于缺乏颜色信息、纹理特征弱化以及视角变化导致同类物体高度相似,使得现有方法性能不佳的车辆再识别问题。

Result: 在RGBNT100(红外)车辆数据集和自建的热成像海事数据集上进行了测试,其mAP分数分别比现有最佳方法(SOTA)高出19.7%和12.8%。

Insight: 核心创新点在于提出了视角条件化的特征选择和特定区域的特征比较机制,这允许有效迁移和适配RGB预训练模型(如ViT)到热成像领域,以处理其特有的挑战。同时,论文计划开源首个用于海事船只识别的热成像数据集,具有领域贡献价值。

Abstract: Identification of less-articulated objects using single-channel images, such as thermal images, is important in many applications, such as surveillance. However, in this domain, existing methods show poor performance due to high similarity among objects of the same category in the absence of color information (overlooking shape information) and de-emphasized texture information. Furthermore, variability in viewpoint adds more complexity as the features vary from side to side. We address these issues by constructing viewpoint-conditioned feature vectors and area-specific feature comparisons in separate feature spaces. These interventions enable leveraging the advancements of existing RGB-pre-trained ViT feature extractors while effectively adapting them to address the challenges specific to the thermal domain. We test our system with RGBNT100 (IR) vehicle dataset and a thermal maritime dataset acquired by us. Our results surpass the state-of-the-art methods by 19.7% and 12.8% for the above datasets in mAP scores, respectively. We also plan to make our thermal dataset available, the first of its kind for maritime vessel identification.


[51] Hybrid Congestion Classification Framework Using Flow-Guided Attention and Empirical Mode Decomposition cs.CV | cs.AIPDF

Eugene Kofi Okrah Denteh, Blessing Agyei Kyem, Joshua Kofi Asamoah, Armstrong Aboah

TL;DR: 本文提出了一种名为FLO-EMD的混合交通拥堵分类框架,该框架结合了光流引导的注意力机制和经验模态分解(EMD),以同时捕捉道路场景的空间上下文和非平稳的交通运动动态。该方法通过光流引导的通道和空间注意力机制来精炼RGB特征,使其聚焦于运动相关区域,同时利用EMD对聚合的运动统计量进行分解以提取内在时间成分,最终融合学习到的时空表征对拥堵程度进行分类。

Details

Motivation: 现有方法在交通拥堵分类中存在局限性:基于视觉的方法通常依赖外观线索和标准时间池化,容易偏向静态基础设施;而基于信号的方法能表征时间动态但缺乏场景级定位所需的空间上下文。这些互补的局限性促使研究者提出一个统一框架,将运动证据与空间特征选择联系起来,同时保持数据自适应的时序表征。

Result: 在来自四个监控网络的1050个五秒视频片段上的实验表明,FLO-EMD实现了97.5%的整体测试准确率(加权F1分数为0.9742),优于现有基线方法,并在不同环境条件下保持鲁棒性。消融实验和敏感性分析进一步量化了EMD、本征模态函数数量以及所选运动描述符的贡献。

Insight: 论文的创新点在于将光流引导的注意力机制与经验模态分解(EMD)相结合,以数据驱动的方式联合建模空间场景上下文和非平稳时间动态。从客观角度看,这种混合方法有效整合了视觉外观和运动信号的互补信息,通过注意力机制实现运动感知的空间特征选择,并通过EMD自适应地分解复杂时序模式,为时空表征学习提供了新思路。

Abstract: Accurate traffic congestion classification requires models that jointly capture roadway scene context and non-stationary traffic motion, yet most prior work treats these requirements in isolation. Vision-based methods often depend on appearance cues with standard temporal pooling, which can bias predictions toward static infrastructure, whereas signal-based approaches characterize temporal dynamics but lack the spatial context needed for scene-level localization. These complementary limitations motivate a unified framework that links motion evidence to spatial feature selection while preserving data-adaptive temporal characterization. This study therefore proposes FLO-EMD, a hybrid approach that couples motion-guided attention with empirical, data-driven temporal decomposition. Dense optical flow guides channel and spatial attention so that RGB features are refined toward motion-relevant regions. In parallel, aggregated flow statistics form compact motion traces that are decomposed using Empirical Mode Decomposition (EMD) to extract intrinsic temporal components. The resulting EMD embedding is fused with learned spatiotemporal representations to classify light, medium, and heavy congestion. Experiments on 1,050 five-second clips from four surveillance networks show that FLO-EMD achieves 97.5% overall test accuracy (weighted F1 = 0.9742), outperforming established baselines and remaining robust across diverse environmental conditions; ablation and sensitivity analyses further quantify the contributions of EMD, the number of intrinsic mode functions, and the selected motion descriptors.


[52] Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction cs.CV | cs.HC | cs.LG | cs.ROPDF

Berk Sezer, Ali Görkem Küçük, Erol Şahin, Sinan Kalkan

TL;DR: 该论文提出了Gaze4HRI,一个专门用于评估零样本3D视线估计方法在真实人机交互场景下鲁棒性的大规模视频数据集和基准测试。研究发现,现有方法在至少一种关键HRI条件下都会失效,而基于ETH-X-Gaze数据集训练的PureGaze方法表现出最强的鲁棒性。研究结论挑战了当前依赖复杂时空建模的趋势,强调了数据多样性和特征净化框架的重要性。

Details

Motivation: 解决现有零样本视线估计方法在真实人机交互场景下的可靠性不确定性问题,因为现有基准测试常常忽略了动态摄像机视角、移动目标等关键HRI条件,且跨数据集评估存在复杂性差距,无法评估真实鲁棒性。

Result: 在Gaze4HRI基准测试中,所有被评估的方法(包括基于Transformer和时空建模的先进方法)在至少一种关键HRI条件(如光照、头-视线冲突、摄像机/目标运动)下都会失败,其中向下凝视是普遍失效点。唯一例外是使用ETH-X-Gaze数据集训练的PureGaze方法,它在所有其他条件下都保持了鲁棒性。

Insight: 论文的创新点在于创建了一个针对HRI场景的、包含关键变量的严格基准测试。其核心洞察是:对于零样本视线估计在无约束环境中的鲁棒性,广泛的数据多样性(如ETH-X-Gaze数据集)是主要驱动力,而像PureGaze中用于视线特征净化的自对抗损失这样的增强鲁棒性框架能提供进一步的显著改进。这挑战了当前研究过于关注复杂模型架构的倾向。

Abstract: While zero-shot appearance-based 3D gaze estimation offers significant cost-efficiency by directly mapping RGB images to gaze vectors, its reliability in Human-Robot Interaction (HRI) settings remains uncertain. Existing benchmarks frequently overlook fundamental HRI conditions, such as dynamic camera viewpoints and moving targets in video. Furthermore, current cross-dataset evaluations often suffer from a complexity gap, where methods trained on diverse datasets are tested on significantly smaller and less varied sets, failing to assess true robustness. To bridge these gaps, we introduce Gaze4HRI, a large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) designed to evaluate state-of-the-art performance against critical HRI variables: illumination, head-gaze conflict, as well as the motion of camera and gaze target in video. Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze’s self-adversarial loss for gaze feature purification, provide a substantial further improvement. Ultimately, this study establishes a rigorous benchmark that provides practical guidelines for practitioners as well as reshaping future research. The dataset and codes are available at https://gazeforhri.github.io.


[53] MIRAGE: Retrieval and Generation of Multimodal Images and Texts for Medical Education cs.CVPDF

Miguel Diaz Benito, Cecilia Diana Albelda, Alvaro Garcia Martin, Jesus Bescos Cano, Marcos Escudero-Vinolo

TL;DR: MIRAGE是一个多模态医学文本和图像检索与生成系统,旨在为医学教育提供交互式学习工具。该系统通过将文本和图像映射到共享潜在空间,支持语义查询,允许用户从可信来源检索临床相关图像,并通过医学扩散模型生成合成图像,同时利用大语言模型提供丰富描述。

Details

Motivation: 解决医学教育中缺乏多样化、标注良好且具有交互性的医学图像资源的问题,传统医学图册不实用,而在线搜索可能提供错误或不全的材料。

Result: 系统基于在ROCO数据集上微调的医学版CLIP(MedICaT-ROCO),结合医学扩散模型(Prompt2MedImage)和大语言模型(Dolly-v2-3b),实现了图像检索、生成和描述功能,并支持视觉比较不同医疗状况。

Insight: 创新点在于构建了一个完全基于公开预训练模型的多模态系统,确保可复现性和可访问性,为无编程技能的医学生提供免费、透明、易用的交互式学习工具,通过共享潜在空间实现语义查询和个性化视觉学习。

Abstract: Access to diverse, well-annotated medical images with interactive learning tools is fundamental for training practitioners in medicine and related fields to improve their diagnostic skills and understanding of anatomical structures. While medical atlases are valuable, they are often impractical due to their size and lack of interactivity, whereas online image search may provide mislabeled or incomplete material. To address this, we propose MIRAGE, a multimodal medical text and image retrieval and generation system that allows users to find and generate clinically relevant images from trustworthy sources by mapping both text and images to a shared latent space, enabling semantically meaningful queries. The system is based on a fine-tuned medical version of CLIP (MedICaT-ROCO), trained with the ROCO dataset, obtained from PubMed Central. MIRAGE allows users to give prompts to retrieve images, generate synthetic ones through a medical diffusion model (Prompt2MedImage) and receive enriched descriptions from a large language model (Dolly-v2-3b). It also supports a dual search option, enabling the visual comparison of different medical conditions. A key advantage of the system is that it relies entirely on publicly available pretrained models, ensuring reproducibility and accessibility. Our goal is to provide a free, transparent and easy-to-use didactic tool for medical students, especially those without programming skills. The system features an interface that enables interactive and personalized visual learning through medical image retrieval and generation. The system is accessible to medical students worldwide without requiring local computational resources or technical expertise, and is currently deployed on Kaggle: http://www-vpu.eps.uam.es/mirage


[54] 3D Ultrasound-Derived Pseudo-CT Synthesis Using a Transformer-Augmented Residual Network for Real-Time Operator Guidance cs.CVPDF

Sapna Sachan, Amulya Kumar Mahto

TL;DR: 本文提出了一种基于3D超声生成伪CT(UD-pCT)的框架,通过结合Transformer增强的残差网络(BT-ResUNet3D)和3D条件PatchGAN判别器,从超声图像合成类似CT的解剖参考体积,旨在为实时操作引导提供解剖参考,减少对CT的依赖。

Details

Motivation: CT成像虽在临床诊断和图像引导介入中不可或缺,但存在电离辐射风险;超声虽无辐射且普及,但高度依赖操作者且缺乏定量组织表征,常导致诊断不确定性和不必要的CT检查。因此,研究旨在开发一种从超声生成伪CT的方法,以提供安全的解剖参考。

Result: 在TRUSTED数据集上使用PSNR和SSIM进行定量评估,表明所提方法在结构保真度和感知图像质量上优于现有基线方法。

Insight: 创新点包括:1)提出Bottleneck Transformer Residual U-Net3D(BT-ResUNet3D)模型,结合3D残差编码器-解码器和Transformer瓶颈,有效建模细粒度局部解剖结构和长距离体积依赖;2)使用3D条件PatchGAN判别器增强合成伪CT的局部结构真实性;3)框架专注于生成解剖参考而非物理精确的亨氏单位,适用于实时操作引导。局限性在于配对数据集较小,可能影响模型泛化能力。

Abstract: Computed tomography (CT) is indispensable for clinical diagnosis and image-guided interventions but exposes patients to ionizing radiation, motivating the development of safer imaging alternatives. Ultrasound (US) is non-ionizing and widely accessible; however, it is highly operator dependent and lacks quantitative tissue characterization, often leading to diagnostic uncertainty and unnecessary CT examinations. This work presents a 3D ultrasound-derived pseudo-CT (UD-pCT) framework that generates CT-like anatomical reference volumes inferred from US, without aiming to reproduce physically accurate Hounsfield Units. Paired 3D kidney US and CT volumes from the TRUSTED dataset are first spatially aligned using a landmark-based multimodal registration pipeline, creating high-quality paired inputs for supervised training of an adversarial framework. The proposed Bottleneck Transformer Residual U-Net3D (BT-ResUNet3D) model employs a 3D residual encoder-decoder generator augmented with a transformer bottleneck, enabling effective modeling of fine-grained local anatomical structures as well as long-range volumetric dependencies, while a 3D Conditional PatchGAN discriminator enforces local structural realism in the synthesized pseudo-CT volumes. Quantitative evaluation using PSNR and SSIM demonstrates that the proposed method outperforms established baselines in structural fidelity and perceptual image quality. The UD-pCT volumes provide real-time anatomical reference for operator guidance, potentially reducing acquisition variability and unnecessary CT use. A limitation of this study is the relatively small paired dataset, which may limit the generalizability of the proposed model.


[55] VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA cs.CVPDF

Haibin He, Maoyuan Ye, Jing Zhang, Juhua Liu, Bo Du

TL;DR: 本文提出VTAgent,一种基于问题引导的智能体框架,用于视频文本视觉问答(Video TextVQA)。该方法通过显式锚定与问题相关的关键帧来定位证据,从而提升问答性能。在无需训练的设置下已优于直接视频推理,结合监督微调和强化学习后,在多个基准上实现了显著的精度提升,达到了新的最先进水平。

Details

Motivation: 现有视频大语言模型在Video TextVQA基准上的性能有限,分析表明主要瓶颈在于定位与问题相关的关键证据帧,而非推理能力本身。

Result: 在无需训练的设置下已超越直接视频推理;结合监督微调(SFT)和强化学习(RL)后,在多个基准上平均准确率提升+12.12,ANLS提升+11.15,达到了新的最先进(SOTA)结果。

Insight: 创新点在于提出了问题引导的智能体框架,显式锚定关键帧以解决证据定位瓶颈;客观分析认为,将关键帧定位与问答推理解耦,并通过训练优化,是提升Video TextVQA性能的有效途径。

Abstract: Video text-based visual question answering (Video TextVQA) aims to answer questions by reasoning over visual textual content appearing in videos. Despite the strong multimodal video understanding capabilities of recent Video-LLMs, their performance on existing Video TextVQA benchmarks remains limited. To better understand this gap, we conduct an upper-bound analysis through frame-wise question answering, counting a sample as correct if any frame yields the right answer, which significantly outperforms direct video-based inference and reveals a substantial performance gap. The results suggest that the primary bottleneck lies in the localization of key question-relevant evidence, rather than in reasoning capacity itself. Building on this insight, we propose a question-guided agent framework that explicitly anchors the relevant keyframes before answering. The approach operates effectively in a training-free setting and consistently surpasses direct video inference. With additional supervised fine-tuning (SFT) and reinforcement learning (RL), it achieves an average improvement of +12.12 in accuracy and +11.15 in ANLS across benchmarks, establishing new state-of-the-art results. Our study underscores the critical role of explicit keyframe anchoring for advancing Video TextVQA. The code will be publicly released.


[56] FairEnc: A Fair Vision-Language Model with Fair Vision and Text Encoders for Glaucoma Detection cs.CV | cs.AI | cs.LG | eess.IV | q-bio.QMPDF

Mohamed Elhabebe, Ayman El-Baz, Qing Liu

TL;DR: 本文提出了一种名为FairEnc的公平预训练方法,用于视觉语言模型(VLM),旨在同时减轻多个敏感属性(如种族、性别、民族和语言)在文本和视觉模态中的偏见,以促进青光眼检测的公平性。该方法通过生成合成临床描述和采用对比对齐目标来优化文本编码器,并通过互信息正则化和多判别器对抗去偏策略来优化视觉编码器,在公开数据集上验证了其有效性和泛化能力。

Details

Motivation: 自动化青光眼检测对于预防不可逆视力损失和减轻医疗系统负担至关重要,但确保在不同患者群体中的公平性仍是一个重大挑战。本文旨在解决视觉语言模型在青光眼检测中存在的多敏感属性偏见问题,以促进更公平的临床部署。

Result: 在公开的Harvard-FairVLMed数据集上,FairEnc有效降低了由DPD和DEOdds衡量的群体差异,同时在零样本和线性探测评估中实现了强大的诊断性能。在私有FairFundus数据集上的额外实验表明,FairEnc在跨域和跨模态设置下一致保持公平性优势,并将诊断性能维持在竞争范围内。

Insight: 创新点包括:1)联合去偏文本和视觉编码器以处理多敏感属性;2)利用大语言模型生成保留疾病语义的合成临床描述进行文本去偏;3)提出结合互信息正则化和多判别器对抗去偏的双层公平策略用于视觉去偏。从客观角度看,该方法强调了跨模态公平性的重要性,并展示了在分布偏移下泛化公平性的潜力,对实际临床应用具有借鉴意义。

Abstract: Automated glaucoma detection is critical for preventing irreversible vision loss and reducing the burden on healthcare systems. However, ensuring fairness across diverse patient populations remains a significant challenge. In this paper, we propose FairEnc, a fair pretraining method for vision-language models (VLMs) that enables simultaneous debiasing across multiple sensitive attributes. FairEnc jointly mitigates biases in both textual and visual modalities with respect to multiple sensitive attributes, including race, gender, ethnicity, and language. Specifically, for the textual encoder, we leverage a large language model to generate synthetic clinical descriptions with varied sensitive attributes while preserving disease semantics, and employ a contrastive alignment objective to encourage demographic-invariant representations. For the visual encoder, we propose a dual-level fairness strategy that combines mutual information regularization to reduce statistical dependence between learned features and demographic groups, with multi-discriminator adversarial debiasing. Comprehensive experiments on the publicly available Harvard-FairVLMed dataset demonstrate that FairEnc effectively reduces demographic disparity as measured by DPD and DEOdds while achieving strong diagnostic performance under both zero-shot and linear probing evaluations. Additional experiments on the private FairFundus dataset show that FairEnc consistently preserves fairness advantages under cross-domain and cross-modality settings and maintains diagnostic performance within a competitive range. These results highlight FairEnc’s ability to generalize fairness under distribution shifts, supporting its potential for more equitable deployment in real-world clinical settings. Our codebase and synthetic clinical notes are available at https://github.com/Mohamed-Elhabebe/FairEnc


[57] DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring cs.CV | cs.AIPDF

Anju Rani, Daniel Ortiz-Arroyo, Petar Durdevic

TL;DR: 本文提出了DART,一个用于合成纤维绳索全面状态监测的视觉-语言基础模型。该模型通过统一的架构处理从损伤分类、严重程度回归到维护建议等完整检查工作流,无需针对下游任务进行微调。

Details

Motivation: 解决海上、海事和工业环境中合成纤维绳索状态监测的复杂需求,该需求不仅限于分类,还需要从单张检查图像中提供连续的严重程度估计、维护建议、异常标记、劣化时间线和自动报告。

Result: 在包含14个细粒度损伤类别的4270张图像上训练后,冻结的DART主干在下游任务上表现优异:损伤分类准确率93.22%,宏观F1分数91.04%(比纯视觉基线提升38.5个百分点);连续严重程度回归斯皮尔曼相关系数rho=0.94;20样本的少样本识别宏观F1分数为89.2%。

Insight: 创新点包括:1)将联合嵌入预测架构扩展到跨模态领域;2)引入显著性引导的HD-MASK策略、按类别可学习的严重程度门控以及对比损伤解纠缠损失,以统一表征同时编码损伤类型、严重程度排序和跨模态语义;3)展示了单一模型作为通用状态监测主干的能力,超越了传统分类任务。

Abstract: The condition monitoring (CM) of synthetic fibre ropes (SFRs) used in offshore, maritime, and industrial settings demands more than a classifier: inspectors need continuous severity estimates, maintenance recommendations, anomaly flags, deterioration timelines, and automated reports, all from a single inspection image. We present DART (Damage Assessment via Rope Transformer), a vision-language foundation model that addresses the full rope inspection workflow through a unified multi-task architecture. DART extends the Joint-Embedding Predictive Architecture (JEPA) to the cross-modal domain by coupling a Vision Transformer (ViT-H/14) with Llama-3.2-3B-Instruct via a Severity-Conditioned Cross-Modal Fusion (SC-CMF) module. Three architectural innovations drive the model’s versatility: (1) HD-MASK, a saliency-guided masking strategy that focuses self-supervised reconstruction on damage-dense patches; (2) per-class learnable severity gates that adaptively weight language grounding by damage category; and (3) a Contrastive Damage Disentanglement (CDD) loss that shapes the embedding space to simultaneously encode damage type, severity ordering, and cross-modal semantics. Trained once on 4,270 images spanning 14 fine-grained rope damage classes, the frozen DART backbone supports downstream tasks without any task-specific fine-tuning: damage classification (93.22 % accuracy, 91.04 % macro-F1, +38.5 pp over a vision-only baseline), continuous severity regression (Spearman rho = 0.94, within-1-ordinal accuracy 99.6 %), few-shot recognition (89.2 % macro-F1 at 20 shots). These results demonstrate that DART functions as a general-purpose CM backbone that goes well beyond classification, providing actionable inspection intelligence from a single shared representation.


[58] Attention-Based Chaotic Self-Supervision for Medical Image Classification cs.CVPDF

Joao Batista Florindo, Amanda Pontes de Oliveira Ornelas

TL;DR: 本文提出了一种用于医学图像分类的新型自监督学习预训练策略——混沌去噪自编码器(CDAE)。该方法通过混沌变换处理输入图像,并训练自编码器进行重建,以学习鲁棒的领域特定特征。此外,还引入了一种注意力融合机制,将CDAE编码器的特征与标准编码器特征结合,以利用通用和领域特定表示的优势。在两个公共医学数据集(ISIC 2018和APTOS 2019)上的实验表明,该方法取得了高性能。

Details

Motivation: 医学图像分类的深度学习模型通常依赖大规模标注数据集或从ImageNet的标准迁移学习。自监督学习(SSL)是一种有前景的替代方案,但常见方法(如掩码自编码器MAE)的随机掩码可能会破坏细粒度的诊断特征。本文旨在解决这一问题,提出一种能更好保留医学图像关键特征的自监督预训练方法。

Result: 在ISIC 2018(皮肤病变)数据集上,准确率达到0.9221,F1-macro达到0.8530;在APTOS 2019(糖尿病视网膜病变)数据集上,准确率达到0.8644,F1-macro达到0.7433。这些结果证明了所提方法的有效性。

Insight: 主要创新点在于:1)用可控的混沌变换替代随机掩码,迫使编码器学习“逆转混沌”,从而更专注于学习鲁棒的、领域特定的特征,避免破坏诊断细节;2)设计了注意力融合机制,智能地结合来自通用预训练编码器和领域特定CDAE编码器的特征,实现优势互补。从客观角度看,这是一种针对医学图像特点(细粒度特征关键)量身定制的自监督预训练策略,其“混沌变换”的思想和双编码器融合架构具有借鉴意义。

Abstract: Deep learning models for medical image classification usually achieve promising results but typically rely on large, annotated datasets or standard transfer learning from ImageNet. Self-Supervised Learning (SSL) has emerged as a powerful alternative, yet common methods like masked autoencoders (MAEs) may inadvertently destroy fine-grained diagnostic features by using random masking. In this paper, we propose a novel SSL pre-training strategy, the Chaotic Denoising Autoencoder (CDAE). Instead of masking, we apply a chaotic transformation to the input image, tasking an autoencoder to reconstruct the original. We hypothesize this forces the encoder to learn robust, domain-specific features by “inverting the chaos”. Furthermore, we propose an attentive fusion mechanism that combines features from our CDAE-trained encoder with a standard encoder, leveraging the strengths of both general and domain-specific representations. Our method is evaluated on two public medical datasets: ISIC 2018 (skin lesions) and APTOS 2019 (diabetic retinopathy). The proposed model achieves high performance, with an accuracy of 0.9221 and an F1-macro of 0.8530 on ISIC 2018, and an accuracy of 0.8644 and F1-macro of 0.7433 on APTOS 2019, demonstrating the efficacy of our approach.


[59] Chaotic Contrastive Learning for Robust Texture Classification cs.CVPDF

Joao B Florindo

TL;DR: 本文提出了一种结合自监督学习与确定性混沌动力学的新框架,用于鲁棒的纹理分类。该方法通过混沌对比预训练策略,利用Logistic、Tent和Sine等像素级混沌映射作为非线性数据增强技术,迫使网络学习拓扑鲁棒的特征。此外,引入基于注意力的特征集成,融合来自监督大型骨干网络的高层语义表示与来自混沌预训练小型编码器的低频结构特征。

Details

Motivation: 纹理分类在计算机视觉中至关重要,但由于类间相似度高且结构模式对尺度和光照变化敏感,面临独特挑战。现有CNN和Vision Transformer通常需要大量标注数据,或过度依赖颜色和形状特征导致跨域泛化能力不足。

Result: 在六个纹理基准数据集(FMD、UMD、KTH-TIPS2-b、DTD、GTOS和1200Tex)上的实验结果表明,该方法优于现有最先进方法,在所有分析数据集上均取得了有前景的准确率。

Insight: 创新点在于将混沌动力学理论融入自监督对比学习,通过混沌扰动模拟复杂环境噪声和反射变化,增强特征鲁棒性;同时采用注意力机制集成不同层次的特征,结合监督与自监督学习的优势,提升纹理分类性能。

Abstract: Texture classification is a pivotal task in computer vision, presenting unique challenges due to high inter-class similarity and the sensitivity of structural patterns to scale and illumination changes. While Convolutional Neural Networks (CNNs) and recent Vision Transformers have set performance benchmarks, they often require extensive labeled datasets or struggle to generalize across domains due to an over-reliance on color and shape features. This paper introduces a novel framework that synergizes Self-Supervised Learning (SSL) with deterministic chaotic dynamics. We propose a chaotic contrastive pre-training strategy, where pixel-wise chaotic maps, specifically Logistic, Tent, and Sine maps, act as non-linear data augmentation techniques. These chaotic perturbations, grounded in ergodic theory, force the network to learn topologically robust features by mimicking complex environmental noise and reflectance variations. Furthermore, we introduce an attention-based feature ensemble that fuses high-level semantic representations from a supervised large backbone with low-frequency structural features from a chaos-pretrained tiny encoder. Experimental results on six texture benchmarks (FMD, UMD, KTH-TIPS2-b, DTD, GTOS, and 1200Tex) demonstrate the superiority of the proposed method, outperforming state-of-the-art approaches and achieving promising accuracies on all the analyzed datasets.


[60] CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography cs.CVPDF

Gasser Elazab, Frank Neuhaus, Tilman Koß, Malte Splietker, Aditya Date

TL;DR: 本文介绍了CARD,一个用于自动驾驶的多模态数据集,专注于提供密集的3D地面真值,以应对具有挑战性的道路地形(如减速带、坑洼和不规则路面)。该数据集包含同步的全局快门立体相机、前后激光雷达、6自由度位姿、车轮运动轨迹和完整校准,覆盖约110公里和4.7小时的驾驶数据,在德国和意大利采集。

Details

Motivation: 现有自动驾驶数据集大多在平坦良好路面上采集,且提供的激光雷达地面真值稀疏,不足以评估深度估计和补全中的细粒度几何结构,因此需要一个新的数据集来填补这一空白。

Result: CARD通过多激光雷达融合,每帧提供约50万有效深度像素,比KITTI深度补全数据集多约6.5倍,平均比其他公共驾驶数据集多10倍;同时建立了针对道路不规则表面的标准化评估协议,并基准测试了最先进的深度估计模型以提供强基线。

Insight: 创新点在于提供密集的3D地面真值以支持精细几何评估,并引入针对道路地形不规则性的2D边界框标注,使数据集能同时用于几何和感知任务的准确基准测试;从客观角度看,其多传感器融合和数据规模在挑战性道路场景中具有显著优势。

Abstract: Autonomous driving must operate across diverse surfaces to enable safe mobility. However, most driving datasets are captured on well-paved flat roads. Moreover, recent driving datasets primarily provide sparse LiDAR ground truth for images, which is insufficient for assessing fine-grained geometry in depth estimation and completion. To address these gaps, we introduce CARD, a multi-modal driving dataset that delivers quasi-dense 3D ground truth across continuous sequences rich in speed bumps, potholes, irregular surfaces and off-road segments. Our sensor suite includes synchronized global-shutter stereo cameras, front and rear LiDARs, 6-DoF poses from LiDAR-inertial odometry, per-wheel motion traces, and full calibration. Notably, our multi-LiDAR fusion yields ~500K valid depth pixels per frame, about 6.5x more than KITTI Depth Completion and 10x more on average than other public driving datasets. The dataset spans ~110 km and 4.7 hours across Germany and Italy. In addition, CARD provides 2D bounding boxes targeting road-topography irregularities, enabling accurate benchmarking for both geometry and perception tasks. Furthermore, we establish a standardized evaluation protocol for road surface irregularities on CARD and benchmark state-of-the-art depth estimation models to provide strong baselines. The CARD dataset is hosted on https://huggingface.co/CARD-Data.


[61] Prompt-Anchored Vision-Text Distillation for Lifelong Person Re-identification cs.CVPDF

Wen Wen, Hao Chen, Shiliang Zhang

TL;DR: 本文提出了一种名为Prompt-Anchored vision-text Distillation (PAD)的非对称视觉-文本框架,用于解决终身行人重识别任务中的语义漂移、适应能力有限和灾难性遗忘问题。该方法利用预训练视觉-语言模型中冻结的文本编码器作为跨域稳定语义锚点,通过文本侧蒸馏提示词以保持视觉-文本对齐,视觉侧则使用基于指数移动平均的教师模型和自适应提示池进行域适应。

Details

Motivation: 终身行人重识别模型在面临新领域数据时,常出现语义漂移、适应能力不足和灾难性遗忘。现有无示例方法主要依赖纯视觉蒸馏或参数正则化,忽略了文本等辅助模态在保持语义稳定性和实现增量可塑性方面的潜力。

Result: 大量实验表明,PAD在已见和未见领域均显著优于最先进方法,在稳定性和可塑性之间实现了良好平衡。

Insight: 创新点在于利用冻结的文本编码器作为跨域语义锚点,并提出了非对称的视觉-文本蒸馏框架,其中文本侧通过蒸馏提示词提供全局语义参考,视觉侧通过自适应提示池实现域适应,从而有效解耦了视觉和文本的角色,提升了模型的泛化能力。

Abstract: Lifelong person re-identification (LReID) aims to train a generalizable model with sequentially collected data. However, such models often suffer from semantic drift, limited adaptability, and catastrophic forgetting as new domains emerge. Existing exemplar-free approaches largely rely on visual-only distillation or parameter regularization, while overlooking the potential of auxiliary modalities, such as text, to preserve semantic stability and enable incremental plasticity. We observe that the frozen text encoder in pretrained vision-language models can serve as a stable semantic anchor across domains. To decouple the roles of vision and text, we propose Prompt-Anchored vision-text Distillation (PAD), an asymmetric vision-text framework for semantic alignment and cross-domain generalization. On the textual side, we distill prompts to preserve vision-text alignment under a fixed semantic space, acting as a global semantic reference rather than a dominant learning signal. On the visual side, an EMA-based teacher with an adaptive prompt pool enables domain-wise adaptation by allocating new slots while freezing past ones. Extensive experiments show that PAD substantially outperforms state-of-the-art methods across seen and unseen domains, achieving a strong balance between stability and plasticity. Project page is available at https://github.com/zu-zi/PAD.


[62] When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise cs.CV | cs.CLPDF

Philip Wootaek Shin, Ajay Narayanan Sridhar, Sivani Devarapalli, Rui Zhang, Jack Sampson

TL;DR: 该论文研究了视觉语言模型(VLMs)在面临图像旋转和噪声等视觉扰动时,其关系推理能力会显著下降,导致关系幻觉问题。作者评估了基于提示的增强和预处理策略(如方向校正和去噪)的效果,发现这些方法只能部分缓解问题,无法完全解决幻觉。研究揭示了模型感知鲁棒性与关系理解之间的差距。

Details

Motivation: 尽管视觉语言模型在多模态任务上表现出色,但它们容易产生关系幻觉,即错误地推断物体间的交互关系。论文旨在探究视觉扰动(特别是旋转和噪声)如何影响模型的关系推理能力,并评估现有缓解策略的有效性。

Result: 实验表明,即使是轻微的图像旋转或噪声,也会导致多种模型和数据集上的关系推理性能显著下降。所评估的提示增强和预处理策略仅能带来部分改进,但无法完全消除幻觉。

Insight: 论文的创新点在于系统性地分析了视觉扰动对VLM关系幻觉的影响,并实证了现有缓解方法的局限性。其核心洞察是,当前VLMs的感知鲁棒性与深层关系理解能力之间存在脱节,这凸显了开发更具几何感知和鲁棒性的VLM的必要性。

Abstract: Vision-language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of visual perturbations, specifically rotation and noise, and show that even mild distortions significantly degrade relational reasoning across models and datasets. We further evaluate prompt-based augmentation and preprocessing strategies (orientation correction and denoising), finding that while they offer partial improvements, they do not fully resolve hallucinations. Our results reveal a gap between perceptual robustness and relational understanding, highlighting the need for more robust, geometry-aware VLMs.


[63] Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation cs.CV | cs.AI | cs.LGPDF

Hongxu Chen, Yanghao Wang, Bowei Zhu, Hongxiang Li, Zhen Wang

TL;DR: 本文提出了一种名为直接乘积流匹配(DP-FM)的新方法,用于改进视觉语言模型(VLM)的少样本适应。该方法通过将跨模态对齐建模在解耦的圆柱形流形(即直接乘积流形)上,独立处理特征的径向和角度动态,并结合无分类器引导来注入数据集特定信息,从而在多个基准测试上实现了最先进的少样本适应性能。

Details

Motivation: 现有基于流匹配(FM)的少样本适应方法受限于预训练跨模态特征中不兼容的几何先验,导致次优的适应性能。具体问题包括角度动态扭曲、径向动态被忽视以及上下文无关的无条件流导致数据集特定信息丢失。

Result: 在11个基准测试上的广泛实验结果表明,DP-FM在少样本适应任务上达到了新的最先进水平(SOTA)。

Insight: 主要创新点在于从极坐标分解(径向和角度子流形)的新几何视角分析问题,并提出一个统一的黎曼框架(扭曲乘积流形),最终推导出在解耦流形上实现独立径向演化和恒速角度测地线传输的DP-FM方法,有效解决了现有方法的局限性。同时,通过基于预训练VLM隐藏状态的条件化流(无分类器引导)来恢复缺失的数据集上下文信息。

Abstract: Recent flow matching (FM) methods improve the few-shot adaptation of vision-language models, by modeling cross-modal alignment as a continuous multi-step flow. In this paper, we argue that existing FM methods are inherently constrained by incompatible geometric priors on pre-trained cross-modal features, resulting in suboptimal adaptation performance. We first analyze these methods from a polar decomposition perspective (i.e., radial and angular sub-manifolds). Under this new geometric view, we identify three overlooked limitations in them: 1) Angular dynamics distortion: The radial-angular coupling induces non-uniform speed on the angular sub-manifold, leading to regression training difficulty and extra truncation errors. 2) Radial dynamics neglect: Feature normalization discards modality confidence, failing to distinguish out-of-distribution and in-distribution data, and abandoning crucial radial dynamics. 3) Context-agnostic unconditional flow: Dataset-specific information loss during pre-trained cross-modal feature extraction remains unrecovered. To resolve these issues, we propose warped product flow matching (WP-FM), a unified Riemannian framework that reformulates alignment on a warped product manifold. Within this framework, we derive direct product flow matching (DP-FM) by introducing a constant-warping metric, which yields a decoupled cylindrical manifold (i.e., direct product manifold). DP-FM enables independent radial evolution and constant-speed angular geodesic transport, effectively eliminating angular dynamics distortion while preserving radial consistency. Meanwhile, we incorporate classifier-free guidance by conditioning the flow on the pre-trained VLMs’ hidden states to inject missing dataset-specific information. Extensive results across 11 benchmarks have demonstrated that DP-FM achieves a new state-of-the-art for multi-step few-shot adaptation.


[64] ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection cs.CVPDF

Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le, SuiYang Guang, Tuan Kiet Pham

TL;DR: 论文提出ScriptHOI框架,通过将交互短语建模为软脚本状态转移,分解为身体角色、接触、几何、功能、运动和物体状态等槽位,以视觉状态标记器解析人-物对并计算脚本覆盖与冲突,从而校准HOI预测、暴露缺失证据并提供训练约束,结合区间部分标签学习和反事实脚本对比损失,提升开放词汇人-物交互检测的泛化能力。

Details

Motivation: 解决开放词汇人-物交互检测中现有模型预测过度依赖物体功能和短语级共现,而忽略手、工具、目标、接触模式和物体状态等联合视觉证据验证的问题。

Result: 在HICO-DET、V-COCO和开放词汇HOI划分上的实验表明,ScriptHOI显著提升了罕见和未见交互的识别能力,并大幅减少了功能冲突的误报。

Insight: 创新点在于将交互短语结构化分解为多模态状态槽位,通过脚本覆盖与冲突校准预测,并引入区间部分标签学习和反事实对比损失来缓解标注不完整和对象捷径问题,增强了模型对细粒度视觉证据的推理能力。

Abstract: Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict. These two quantities calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations. To avoid suppressing valid but unannotated interactions, we further introduce interval partial-label learning, which constrains unannotated candidates with script-derived lower and upper probability bounds instead of assigning closed-world negatives. A counterfactual script contrast loss swaps individual script slots to discourage object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.


[65] FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching cs.CVPDF

Andranik Sargsyan, Shant Navasardyan

TL;DR: FlowDIS是一种基于流匹配框架的语言引导二分图像分割方法,通过学习时间依赖的向量场将图像分布转换为掩码分布,并利用位置感知实例配对训练策略实现文本提示的强可控性,在DIS-TE测试集上显著超越现有SOTA方法。

Details

Motivation: 现有二分图像分割方法在保留细粒度细节和捕捉前景语义结构方面存在不足,FlowDIS旨在通过流匹配框架和语言引导解决这些问题,提升分割精度和可控性。

Result: 在DIS-TE测试集上,FlowDIS相比最佳先前方法实现了5.5%更高的Fβω指标和43%更低的MAE(M),无论是否使用语言引导均显著优于SOTA。

Insight: 创新点包括将流匹配框架引入二分图像分割以建模分布转换,以及位置感知实例配对策略实现文本提示的像素级精确控制;客观分析表明该方法通过概率流学习增强了细节保留和语义对齐能力。

Abstract: Accurate image segmentation is essential for modern computer vision applications such as image editing, autonomous driving, and medical image analysis. In recent years, Dichotomous Image Segmentation (DIS) has become a standard task for training and evaluating highly accurate segmentation models. Existing DIS approaches often fail to preserve fine-grained details or fully capture the semantic structure of the foreground. To address these challenges, we present FlowDIS, a novel dichotomous image segmentation method built on the flow matching framework, which learns a time-dependent vector field to transport the image distribution to the corresponding mask distribution, optionally conditioned on a text prompt. Moreover, with our Position-Aware Instance Pairing (PAIP) training strategy, FlowDIS offers strong controllability through text prompts, enabling precise, pixel-level object segmentation. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches both with and without language guidance. Compared with the best prior DIS method, FlowDIS achieves a 5.5% higher $F_β^ω$ measure and 43% lower MAE ($\mathcal{M}$) on the DIS-TE test set. The code is available at: https://github.com/Picsart-AI-Research/FlowDIS


[66] A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping cs.CVPDF

Maxim V. Shugaev, Md Reshad Ul Hoque, Bridget Kennedy, Joseph T. Riley, Fiona Hwang

TL;DR: 本文提出了一个用于评估视频序列在严重折射扭曲条件下几何失真去除方法的统一基准,涵盖从类似湍流的轻度扭曲到强非连续折射变形的多种失真级别,并包含实验室采集的真实数据和基于物理光折射建模生成的合成序列。

Details

Motivation: 现有基准主要针对轻度大气湍流,缺乏对强且高度不均匀折射条件下(如湍流空气或水面)视频序列严重几何失真与时间不稳定性问题的系统性评估,因此需要建立一个全面的基准来填补这一空白。

Result: 该基准评估了从简单基线、经典配准算法到先进学习方法(如DATUM及作者提出的基于扩散的V-cache)的一系列方法,使用像素级(PSNR、SSIM)和感知(LPIPS、DINO、CLIP)指标进行了首次大规模几何失真去除分析,为高度失真光学环境下的视频重建算法开发与评估奠定了基础。

Insight: 创新点在于构建了首个系统覆盖从轻度到极端折射扭曲的统一多帧图像恢复基准,并引入了基于扩散的V-cache方法处理高失真区域;客观来看,其通过物理建模生成多样化合成数据与真实数据结合的方式,为复杂失真条件下的方法评估提供了更全面的框架。

Abstract: Video sequence capturing through refractive dynamic media, such as a turbulent air or water surface, often suffer from severe geometric distortions and temporal instability. While recent advances address mild atmospheric turbulence, no existing benchmarks systematically evaluate restoration methods under strong and highly nonuniform refractive conditions. We present a comprehensive benchmark for geometric distortion removal in video, covering a range from turbulence-like mild warping to strong discontinuous refractive deformations. The benchmark includes both laboratory-captured real data and synthetic sequences generated for static scenes via physics-based light refraction modeling across four distortion levels and multiple surface wave types. We evaluate a spectrum of methods from simple baselines and classical registration algorithms to advanced learning-based approaches including DATUM and our proposed diffusion based V-cache for high and extreme distortions regimes. Evaluation uses both pixel-level (PSNR, SSIM), and perceptual (LPIPS, DINO, CLIP) metrics providing the first large scale analysis of geometric distortion removal. Our benchmark establishes a new foundation for developing and evaluating algorithms capable of reconstructing video from highly distorted optical environments. Our code and datasets are available at https://github.com/iafoss/refractive-mfir-benchmark.


[67] Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging cs.CVPDF

Bernhard Kainz, Johanna P Mueller, Matthew Baugh, Cosmin Bercea

TL;DR: 本文提出了一种名为WALDO的训练免费框架,用于医学影像中的零样本异常定位。该框架基于最优传输理论,通过熵加权切片Wasserstein距离进行解剖感知参考选择、利用Goldilocks区域采样优化参考相似度,并通过加权非极大值抑制进行自一致性聚合,从而在NOVA脑MRI基准测试中显著提升了零样本异常定位的性能。

Details

Motivation: 解决基于视觉语言模型(VLM)的零样本异常定位在医学影像中因缺乏健康解剖上下文而性能受限的问题,将其重新定义为通过结构化比较正常解剖参考分布来识别异常的对比推理问题。

Result: 在NOVA脑MRI基准测试上,使用Qwen2.5-VL-72B的WALDO框架达到43.5±1.6% mAP@30(95%置信区间:[40.4, 46.7]),相比零样本基线相对提升19%;跨模型评估显示GPT-4o和Qwen3-VL-32B分别达到32.0±6.5%和32.0±6.6% mAP@30,McNemar配对检验证实统计显著性(p<0.01)。

Insight: 创新点包括:将零样本定位重构为对比推理问题;引入基于最优传输的熵加权Wasserstein距离进行解剖感知参考选择;提出Goldilocks区域采样以平衡偏差-方差权衡;通过加权非极大值抑制实现自一致性聚合。这些方法提升了医学影像中罕见病理检测的准确性和鲁棒性。

Abstract: Zero-shot anomaly localisation via vision-language models (VLMs) offers a compelling approach for rare pathology detection, yet its performance is fundamentally limited by the absence of healthy anatomical context. We reformulate zero-shot localisation as a comparative inference problem in which anomalies are identified through structured comparison against reference distributions of normal anatomy. We introduce WALDO, a training-free framework grounded in optimal transport theory that enables comparative reasoning through: (i) entropy-weighted Sliced Wasserstein distances for anatomically-aware reference selection from DINOv2 patch distributions, (ii) Goldilocks zone sampling exploiting the non-monotonic relationship between reference similarity and localisation accuracy, and (iii) self-consistency aggregation via weighted non-maximum suppression. We theoretically analyse the Goldilocks effect through distributional divergence, and show that references with moderate similarity minimize a bias-variance trade-off in comparative visual reasoning. On the NOVA brain MRI benchmark, WALDO with Qwen2.5-VL-72B achieves $43.5_{\pm1.6}%$ mAP@30 (95% CI: [40.4, 46.7]), representing a 19% relative improvement over zero-shot baselines. Cross-model evaluation shows consistent gains: GPT-4o achieves $32.0_{\pm6.5}%$ and Qwen3-VL-32B achieves $32.0_{\pm6.6}%$ mAP@30. Paired McNemar tests confirm statistical significance ($p<0.01$). Source code is available at https://github.com/bkainz/WALDO_MICCAI26_demo .


[68] PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World cs.CVPDF

Yunhan Yang, Chunshi Wang, Junliang Ye, Yang Li, Zanxin Chen

TL;DR: PhysForge是一个两阶段框架,用于生成具有物理基础的3D交互资产。它首先通过视觉语言模型规划一个定义材料、功能和运动学约束的’分层物理蓝图’,然后通过一个结合了新颖KineVoxel注入机制的物理基础扩散模型来合成高保真几何和精确运动参数。该框架由包含15万个带四层物理标注资产的大规模数据集PhysDB支持。

Details

Motivation: 现有方法主要关注静态几何,忽略了交互所需的功能属性,这是交互式虚拟世界和具身AI合成物理基础3D资产的关键瓶颈。本文旨在通过将资产生成根植于功能逻辑和分层物理来弥补这一差距。

Result: 实验表明,PhysForge能够生成功能合理、可用于仿真的资产,为交互式3D内容和具身智能体提供了强大的数据引擎。

Insight: 主要创新点包括:1)提出了一个解耦的两阶段生成框架,将物理逻辑规划与几何生成分离;2)引入了’分层物理蓝图’的概念来定义多级物理约束;3)设计了新颖的KineVoxel注入机制,使扩散模型能够同时合成几何和运动学参数;4)构建了大规模带精细物理标注的数据集PhysDB作为支撑。

Abstract: Synthesizing physics-grounded 3D assets is a critical bottleneck for interactive virtual worlds and embodied AI. Existing methods predominantly focus on static geometry, overlooking the functional properties essential for interaction. We propose that interactive asset generation must be rooted in functional logic and hierarchical physics. To bridge this gap, we introduce PhysForge, a decoupled two-stage framework supported by PhysDB, a large-scale dataset of 150,000 assets with four-tier physical annotations. First, a VLM acts as a “physical architect” to plan a “Hierarchical Physical Blueprint” defining material, functional, and kinematic constraints. Second, a physics-grounded diffusion model realizes this blueprint by synthesizing high-fidelity geometry alongside precise kinematic parameters via a novel KineVoxel Injection (KVI) mechanism. Experiments demonstrate that PhysForge produces functionally plausible, simulation-ready assets, providing a robust data engine for interactive 3D content and embodied agents.


[69] OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents cs.CVPDF

Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai

TL;DR: OpenSearch-VL是一个完全开源的配方,用于训练前沿的多模态深度搜索智能体。它通过构建高质量的训练数据(包括SearchVL-SFT-36k和SearchVL-RL-8k数据集)、设计多样化的工具环境以及提出一种多轮致命错误感知的GRPO训练算法,实现了显著的性能提升。

Details

Motivation: 当前顶级的多模态搜索智能体难以复现,主要原因是缺乏开源的高质量训练数据、透明的轨迹合成流程或详细的训练配方。本文旨在解决这些问题,提供一个可复现的开源方案。

Result: 在七个基准测试上平均提升了超过10个百分点,并在多项任务上取得了与专有商业模型相当的结果。

Insight: 创新点包括:通过维基百科路径采样、模糊实体重写和源锚点视觉定位构建高质量训练数据的专用流程;统一文本搜索、图像搜索、OCR、裁剪、锐化、超分辨率和透视校正的多样化工具环境;以及处理级联工具失败的多轮致命错误感知GRPO训练算法。

Abstract: Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. Despite rapid progress, top-tier multimodal search agents remain difficult to reproduce, largely due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes. To this end, we introduce OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents with agentic reinforcement learning. First, we curated a dedicated pipeline to construct high-quality training data through Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding, which jointly reduce shortcuts and one-step retrieval collapse. Based on this pipeline, we curate two training datasets, SearchVL-SFT-36k for SFT and SearchVL-RL-8k for RL. Besides, we design a diverse tool environment that unifies text search, image search, OCR, cropping, sharpening, super-resolution, and perspective correction, enabling agents to combine active perception with external knowledge acquisition. Finally, we propose a multi-turn fatal-aware GRPO training algorithm that handles cascading tool failures by masking post-failure tokens while preserving useful pre-failure reasoning through one-sided advantage clamping. Built on this recipe, OpenSearch-VL delivers substantial performance gains, with over 10-point average improvements across seven benchmarks, and achieves results comparable to proprietary commercial models on several tasks. We will release all data, code, and models to support open research on multimodal deep search agents.


[70] LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore) cs.CVPDF

Wei Luo, Yiting Lu, Xin Li, Haoran Li, Fengbin Guan

TL;DR: LoViF 2026 PhyScore挑战赛旨在推动对世界模型生成的视频进行全面的质量评估,不仅关注感知质量,还强调物理合理性、时间一致性和输入条件对齐。挑战赛要求参与者开发一个能联合预测视频质量、物理真实性、条件-视频对齐和时间一致性四个维度的指标,并定位物理异常时间戳。基准数据集包含1,554个由七种代表性世界生成模型生成的视频,分为三个赛道(文本到2D、图像到4D和视频到4D),涵盖26个物理相关场景类别。评估基于分数预测和异常定位,结合时间戳IOU和SRCC/PLCC指标。

Details

Motivation: 当前评估实践存在核心缺陷:仅依赖感知质量不足以判断生成动态是否物理合理、时间一致且与输入条件一致,因此需要开发一个全面的质量评估框架来弥补这一差距。

Result: 挑战赛建立了包含1,554个视频的基准数据集,由七种世界生成模型生成,分为三个赛道和26个物理相关类别。评估采用复合协议,结合时间戳IOU和SRCC/PLCC指标,以同时评估分数预测和异常定位性能。

Insight: 创新点在于提出了一个多维度(视频质量、物理真实性、条件对齐、时间一致性)的联合评估指标,并引入物理异常时间戳定位进行细粒度诊断,这扩展了传统视频质量评估的范围,强调了物理合理性和条件一致性在生成内容评估中的重要性。

Abstract: This paper reports on the LoViF 2026 PhyScore challenge, a competition on holistic quality assessment of world-model-generated videos across both 2D and 4D generation settings. The challenge is motivated by a central gap in current evaluation practice: perceptual quality alone is insufficient to judge whether generated dynamics are physically plausible, temporally coherent, and consistent with input conditions. Participants are required to build a metric that jointly predicts four dimensions, i.e., Video Quality, Physical Realism, Condition-Video Alignment, and Temporal Consistency. Depart from that, participants also need to localize physical anomaly timestamps for fine-grained diagnosis. The benchmark dataset contains 1,554 videos generated by seven representative world generative models, organized into three tracks (text-2D, image-to-4D, and video-to-4D) and spanning 26 categories. These categories explicitly cover physics-relevant scenarios, including dynamics, optics, and thermodynamics, together with diverse real-world and creative content. To ensure label reliability, scores and anomaly timestamps are produced through trained human annotation with an additional automated quality-control pass. Evaluation is based on both score prediction and anomaly localization, with a composite protocol that combines TimeStamp_IOU and SRCC/PLCC. This report summarizes the challenge design and provides method-level insights from submitted solutions.


[71] D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models cs.CVPDF

Dengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang, Mingzhe Zheng

TL;DR: 本文提出了D-OPSD,一种用于步数蒸馏扩散模型的新型训练范式,旨在解决此类模型在持续监督微调时难以保持其原有少步推理能力的问题。该方法通过利用模型编码器的上下文能力,将训练过程构建为一种策略内自蒸馏过程,使模型能够在不牺牲原始性能的情况下学习新概念和风格。

Details

Motivation: 当前高性能图像生成模型正从低效的多步模型转向高效的少步模型(如Z-Image-Turbo和FLUX.2-klein),但这些模型难以直接进行持续的监督微调,因为传统微调技术会损害其固有的少步推理能力。

Result: 论文未在摘要中提供具体的定量实验结果或基准测试对比。

Insight: 核心创新在于提出了一种策略内自蒸馏训练范式,利用现代扩散模型(其编码器为LLM/VLM)继承的上下文能力,使模型在训练中同时扮演教师(基于文本提示和目标图像的多模态特征)和学生(仅基于文本特征)的角色,并在学生自身的生成轨迹上进行优化,从而实现了不损害原始少步能力的持续调优。

Abstract: The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for directly continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromises their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder’s in-context capabilities. This enables us to make the training as an on-policy self-distillation process. Specifically, during training, we make the model acts as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student’s own roll-outs. By optimized on the model’s own trajectory and under it’s own supervision, D-OPSD enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity.


[72] Taming Outlier Tokens in Diffusion Transformers cs.CV | cs.AI | cs.LGPDF

Xiaoyu Wu, Yifei Wang, Tsu-Jui Fu, Liang-Chieh Chen, Zhe Gan

TL;DR: 本文研究了扩散变换器(DiTs)中的异常令牌问题,发现预训练的ViT编码器和DiT内部均会产生携带有限局部信息的高范数异常令牌,导致生成图像出现伪影。为解决此问题,作者提出了双阶段寄存器(DSR)干预方法,在ImageNet和文本到图像生成任务中有效减少了异常伪影并提升了生成质量。

Details

Motivation: 动机在于探索扩散变换器中异常令牌的角色及其对生成质量的影响,现有工作对生成模型中异常令牌的作用研究不足,且简单屏蔽高范数令牌无法改善性能,表明问题与局部块语义损坏更相关。

Result: 在ImageNet和大规模文本到图像生成基准上,DSR干预方法一致减少了异常伪影,提高了生成质量,表明该方法在构建更强DiTs中的有效性。

Insight: 创新点在于识别了DiT中异常令牌的普遍性及其与局部语义损坏的关联,并提出了双阶段寄存器(包括训练时寄存器、递归测试时寄存器和扩散寄存器)作为干预手段,为改善扩散变换器的鲁棒性提供了新思路。

Abstract: We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based intervention for both components: trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser. Across ImageNet and large-scale text-to-image generation, these interventions consistently reduce outlier artifacts and improve generation quality. Our results highlight outlier-token control as an important ingredient in building stronger DiTs.


[73] Syn4D: A Multiview Synthetic 4D Dataset cs.CVPDF

Zeren Jiang, Yushi Lan, Yihang Luo, Yufan Deng, Zihang Lai

TL;DR: 本文介绍了Syn4D,一个用于动态场景理解的多视角合成4D数据集,提供相机运动、深度图、密集跟踪和人体姿态标注,支持像素跨时间和相机的3D反投影,并在多个下游任务中验证了其有效性。

Details

Motivation: 解决单目视频动态场景密集3D重建与跟踪领域高质量、完整几何标注数据集稀缺的问题。

Result: 在4D场景重建、3D点跟踪、几何感知相机重定向和人体姿态估计等多个任务上进行了广泛评估,证明了数据集的有效性,有助于推动动态场景理解和时空建模研究。

Insight: 创新点在于构建了一个支持像素跨时空和相机视图反投影的合成4D数据集,提供了密集、完整的几何与运动标注,为动态场景分析提供了新的基准资源。

Abstract: Dense 3D reconstruction and tracking of dynamic scenes from monocular video remains an important open challenge in computer vision. Progress in this area has been constrained by the scarcity of high-quality datasets with dense, complete, and accurate geometric annotations. To address this limitation, we introduce Syn4D, a multiview synthetic dataset of dynamic scenes that includes ground-truth camera motion, depth maps, dense tracking, and parametric human pose annotations. A key feature of Syn4D is the ability to unproject any pixel into 3D to any time and to any camera. We conduct extensive evaluations across multiple downstream tasks to demonstrate the utility and effectiveness of the proposed dataset, including 4D scene reconstruction, 3D point tracking, geometry-aware camera retargeting, and human pose estimation. The experimental results highlight Syn4D’s potential to facilitate research in dynamic scene understanding and spatiotemporal modeling.


cs.RO [Back]

[74] From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models cs.RO | cs.CVPDF

Yihan Lin, Haoyang Li, Yang Li, Haitao Shen, Yihan Zhao

TL;DR: 本文系统研究了视觉-语言-动作模型中潜在动作监督的方法,通过对比基于图像和基于动作的潜在动作表示,揭示了不同监督策略在长时推理、场景泛化和复杂运动协调任务中的适用性,并发现直接使用离散潜在动作标记监督VLM效果最佳。

Details

Motivation: 现有VLA模型在异构数据集上使用潜在动作监督的方法分散且缺乏系统比较,本文旨在统一研究框架并分析不同监督策略的优劣。

Result: 在统一VLA基线模型上,实验表明基于图像的潜在动作在长时推理和场景泛化任务中表现更好,而基于动作的潜在动作在复杂运动协调任务中更优;直接使用离散潜在动作标记监督VLM取得了最有效的性能。

Insight: 创新点在于系统化对比了潜在动作监督的两类方法(图像基与动作基),并揭示了’监督形式-任务类型’的对应关系;客观分析认为离散潜在动作标记的直接监督策略为混合数据训练提供了高效解决方案。

Abstract: Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work structures the study of latent action supervision from two perspectives: (i) regularizing the trajectory via image-based latent actions, and (ii) unifying the target space with action-based latent actions. Under a unified VLA baseline, we instantiate and compare four representative integration strategies. Our results reveal a formulation-task correspondence: image-based latent actions benefit long-horizon reasoning and scene-level generalization, whereas action-based latent actions excel at complex motor coordination. Furthermore, we find that directly supervising the VLM with discrete latent action tokens yields the most effective performance. Finally, our experiments offer initial insights into the benefits of latent action supervision in mixed-data, suggesting a promising direction for VLA training. Code is available at https://github.com/RUCKBReasoning/From_Pixels_to_Tokens.


[75] Driver-WM: A Driver-Centric Traffic-Conditioned Latent World Model for In-Cabin Dynamics Rollout cs.RO | cs.AI | cs.CVPDF

Haozhuang Chi, Daosheng Qiu, Hao Su, Haochen Liu, Zirui Li

TL;DR: 本文提出Driver-WM,一种以驾驶员为中心的潜在世界模型,用于在共享控制过渡中预测车内动态。该模型通过门控因果注入机制,将外部交通条件与内部驾驶员状态(包括物理运动、行为和情绪语义)在紧凑的潜在空间中统一建模,实现了长时程的几何预测和语义对齐。

Details

Motivation: 现有驾驶世界模型主要预测外部环境,而车内智能系统仅限于识别任务,缺乏对驾驶员动态的多步推演能力。本文旨在解决L2/L3级自动驾驶中,在共享控制过渡时预测人机交互反应的问题。

Result: 在多任务辅助驾驶基准测试中,Driver-WM在反应性高机动操作上实现了鲁棒的长时程几何预测,并提升了驾驶员与交通状态的语义对齐性能。

Insight: 创新点在于提出了一个方向性耦合的双流架构,通过学习的向量门进行门控因果注入,在严格保持时间因果性的同时,利用外部交通上下文条件化地推演车内动态。这种显式的外部到内部条件化机制允许进行受控的测试时干预,以系统分析模型响应。

Abstract: Safe L2/L3 driving automation requires anticipating human-in-the-loop reactions during shared-control transitions. While most driving world models forecast the external environment, in-cabin intelligence remains strictly recognition-oriented and lacks multi-step rollout capabilities for driver dynamics. We introduce Driver-WM, a driver-centric latent world model that rolls out in-cabin dynamics causally conditioned on out-cabin traffic context. This formulation unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition. Operating in a compact latent space constructed from frozen vision-language features, Driver-WM adopts a dual-stream architecture to separately encode external traffic and internal driver states. These streams are directionally coupled via a gated causal injection mechanism, which uses a learned vector gate to modulate external contextual perturbations while strictly enforcing temporal causality. Evaluations on a multi-task assistive driving benchmark demonstrate that Driver-WM yields robust long-horizon geometric forecasting for reactive high-motion maneuvers and improves semantic alignment for both driver and traffic states. Finally, the explicit external-to-internal conditioning allows for controlled test-time interventions to systematically analyze mechanism responses.


cs.NI [Back]

[76] Look Once, Beam Twice: Camera-Primed Real-Time Double-Directional mmWave Beam Management for Vehicular Connectivity cs.NI | cs.AI | cs.CE | cs.CV | eess.SYPDF

Avhishek Biswas, Apala Pramanik, Eylem Ekici, Mehmet C. Vuran

TL;DR: 本文提出了VIBE(基于视觉的波束成形)系统,这是一个用于车辆通信的实时双向毫米波波束管理框架。它通过融合摄像头感知、机器学习、基于模型的推理和闭环射频反馈,来加速波束对准并平衡链路建立延迟与质量。该系统在室内外测试平台、公共数据集和实时车辆实验中进行了评估,展示了优异的泛化能力和低中断率。

Details

Motivation: 毫米波频段可为车联网提供高速连接,但面临严重的路径损耗和移动性导致的波束失准问题。现有方法存在训练开销大、对未见场景泛化能力有限等不足,因此需要一种快速、可靠的双向波束对准方案。

Result: 在公共数据集上评估时,VIBE超越了最先进的端到端机器学习波束选择模型。与5G NR分层波束成形相比,VIBE始终保持更低的中断率,在实验中可实现低至1.1-1.4%的中断率。

Insight: 核心创新在于提出了一种混合的、基于模型的闭环学习架构,它利用摄像头观测来缩小波束搜索空间,从而绕过繁重的训练开销并加速链路建立。同时,轻量级的波束细化和偏移跟踪机制能根据动态应用需求自适应调整波束。这种将感知与通信闭环反馈相结合的方法,比纯端到端ML模型更适合现实世界的毫米波车联网场景。

Abstract: Millimeter-wave (mmWave) frequencies promise multi-gigabit connectivity for vehicle-to-everything (V2X) networks, but face challenges in terms of severe path loss and mobility-related beam misalignment. Reliable V2X connectivity requires fast, double-directional beam alignment. However, existing methods suffer from high training overhead and limited generalization to unseen scenarios. This paper presents VIsion-based BEamforming(VIBE), a hybrid model-based, closed-loop, learning architecture for real-time double-directional mmWave beam management primed by camera sensing. VIBE fuses machine learning, model-based reasoning, and closed-loop RF feedback to balance beam-pair establishment latency with link quality. VIBE bypasses exhaustive training overhead and accelerates link establishment by leveraging camera observations to reduce the beam-search space. Lightweight beam refinement and offset tracking mechanisms adaptively refine beams in response to dynamic application requirements. VIBE is implemented and evaluated across online indoor/outdoor testbeds, public datasets, and real-time vehicular experiments, demonstrating strong generalization capabilities, making it suitable for real-time V2X communication. Comparisons with 5G NR hierarchical beamforming show that VIBE consistently maintains lower outage rates. Furthermore, VIBE outperforms state-of-the-art end-to-end ML models for beam selection when evaluated on public datasets and achieves outage rates as low as 1.1-1.4 %. The results show that a hybrid model-based, closed-loop learning architecture is better suited for real-world mmWave vehicular connectivity than end-to-end trained ML models. For reproducibility, we publish our code to https://github.com/UNL-CPN-Lab/Look-Once-Beam-Twice.


cs.CY [Back]

[77] Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation cs.CY | cs.AI | cs.CLPDF

David Gringras, Misha Salahshoor

TL;DR: 这篇论文通过文献计量学审计,量化了AI能力评估文献中存在的‘前沿滞后’现象,即论文评估的模型往往落后于评估时的实际技术前沿,且这种差距正在扩大。研究还揭示了论文在报告推理模式状态和结论泛化方面存在不足,并提出了包括VERSIO-AI检查清单在内的补救措施。

Details

Motivation: 论文旨在解决应用领域LLM能力评估文献中存在的‘能力误报’问题,即现有文献往往评估的是过时、更便宜或未充分激发的模型,而非评估时的技术前沿模型,导致读者无法准确了解AI系统的当前能力。

Result: 研究发现,中位数论文评估的模型在评估时落后于同期技术前沿约10.85个ECI(相当于Claude Sonnet 3.7与Claude Opus 4.5之间的距离),且该差距正以每年+5.53个ECI的速度扩大。同时,仅3.2%的摘要和21.2%的全文披露了推理模式状态,且超过一半的论文将结论泛化到‘AI’层面而非具体评估模型。

Insight: 论文的创新点在于首次大规模量化了AI评估文献中的‘发表激发差距’(前沿滞后),并揭示了其扩大趋势及报告不透明问题。提出的VERSIO-AI检查清单(包含13项,核心3项可导致桌面拒稿)为强制披露配置细节(如模型快照、推理模式、工具访问等)提供了具体框架,有助于提高评估的透明度和时效性。

Abstract: Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-4o-mini zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about “AI” that propagate through citations, media, and policy. We measure the ‘publication elicitation gap’ (the gap between these answers) in a pre-registered audit of 112,303 LLM-keyword-matched candidate records (2022-01 to 2026-04; 18,574 admissible, 4,766 full-paper texts retrievable), comparing tested models to the contemporaneous frontier on the Epoch AI Capabilities Index (ECI), reproduced under Arena Elo and Artificial Analysis. The median paper evaluates a model +10.85 ECI (~1.4x the distance between Claude Sonnet 3.7 and Claude Opus 4.5) behind the contemporaneous frontier at evaluation time (H1); an exploratory rational-lag baseline (H8) decomposes this into ~25% peer-review latency, ~75% excess lag. The gap is widening at +5.53 ECI/year (H2; 95% CI [+5.03, +5.83]). Meanwhile, only 3.2% of abstracts (21.2% of full-texts) disclose reasoning-mode status on reasoning-capable models (H4) and 52.5% (95% CI [48.2, 56.9]) state conclusions at the level of “AI” rather than the evaluated model(s), rising at OR = 1.23/year. Proposed remedies include API-access subsidies and editorial enforcement of reporting frameworks mandating configuration-surface disclosure (model snapshot, reasoning mode/effort, tool access, scaffolding, prompting, etc.); VERSIO-AI is a 13-item checklist (Core 3 desk-reject) extending existing frameworks at the elicitation surface, with per-DOI analysis at frontierlag.org.


astro-ph.CO [Back]

[78] Segmenting proto-halos with vision transformers astro-ph.CO | astro-ph.IM | cs.CVPDF

Toka Alokda, Cristiano Porciani

TL;DR: 本文探索使用深度学习在宇宙学初始密度场中分割和分类原晕区域,根据其在红移z=0时的最终晕质量。比较了基于V-Net的全卷积神经网络和U-Net Transformer两种架构,发现Transformer网络在所有指标上显著优于CNN,在每类晕的总分割质量上实现低于1%的误差。两种网络均比基于扰动理论的PINOCCHO模型精度更高,特别是在低晕质量和原晕边界详细重建方面。

Details

Motivation: 解决从早期宇宙产生的小宇宙学扰动形成暗物质晕这一高度非线性过程的建模问题,传统上依赖N体模拟,本文旨在利用深度学习直接从初始密度场分割和分类原晕区域。

Result: 在分割原晕区域的基准测试中,Transformer网络在所有指标上显著优于CNN,达到每类晕总分割质量低于1%的误差,且两种深度学习模型均大幅超越基于扰动理论的PINOCCHO模型,尤其在低晕质量和边界重建方面。

Insight: 创新点在于将视觉Transformer架构应用于宇宙学原晕分割任务,并证明其优于传统CNN;通过比较密度场、潮汐剪切及其组合作为输入特征的影响,以及使用Grad-CAM生成热力图初步分析网络如何利用输入场,为模型可解释性提供了见解。

Abstract: The formation of dark-matter halos from small cosmological perturbations generated in the early universe is a highly non-linear process typically modeled through N-body simulations. In this work, we explore the use of deep learning to segment and classify proto-halo regions in the initial density field according to their final halo mass at redshift z=0. We compare two architectures: a fully convolutional neural network (CNN) based on the V-Net design and a U-Net transformer. We find that the transformer-based network significantly outperforms the CNN across all metrics, achieving sub-percent error in the total segmented mass per halo class. Both networks deliver much higher accuracy than the perturbation-theory-based model \textsc{pinocchio}, especially at low halo masses and in the detailed reconstruction of proto-halo boundaries. We also investigate the impact of different input features by training models on the density field, the tidal shear, and their combination. Finally, we use Grad-CAM to generate class-activation heatmaps for the CNN, providing preliminary yet suggestive insights into how the network exploits the input fields.


cs.GR [Back]

[79] Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation cs.GR | cs.AI | cs.CL | cs.CV | cs.LGPDF

Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang

TL;DR: JoyAI-Image是一个统一的多模态基础模型,集成了视觉理解、文本到图像生成和指令引导的图像编辑功能。它通过一个空间增强的多模态大语言模型(MLLM)与多模态扩散Transformer(MMDiT)耦合,使感知和生成通过共享的多模态接口交互。该模型采用可扩展的训练方法,结合了统一指令调优、长文本渲染监督、空间基础数据以及通用和空间编辑信号,从而在广泛的基准测试中实现了最先进或极具竞争力的性能。

Details

Motivation: 旨在开发一个统一的多模态模型,以同时解决视觉理解、生成和编辑任务,并通过增强空间感知和可控合成能力,推动模型从一般视觉能力向更强的空间智能发展。

Result: 在理解、生成、长文本渲染和编辑等多个基准测试中,JoyAI-Image实现了最先进(SOTA)或极具竞争力的性能。

Insight: 创新点包括将空间增强的MLLM与MMDiT耦合以实现感知与生成的交互,以及采用统一指令调优和空间基础数据等训练方法。从客观角度看,该模型通过双向循环(增强理解、可控空间编辑和新视角辅助推理)促进了空间智能的提升,为下游应用如视觉-语言-动作系统和世界模型提供了有前景的路径。

Abstract: We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.


cs.LG [Back]

[80] RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction cs.LG | cs.AI | cs.CLPDF

Sihao Liu, YuFan Xiong, Zhonghua Jiang, Zhaode Wang, chengfei lv Shengyu Zhang

TL;DR: 本文提出RetentiveKV,一种基于状态空间模型的熵驱动KV缓存优化方法,旨在解决多模态大语言模型中因长视觉上下文导致的KV缓存膨胀问题。该方法将KV驱逐从离散的上下文截断重新定义为连续的内存演化,利用信息熵量化低注意力令牌的信息潜力,并通过熵引导的状态转换将其整合到连续状态空间中,以便在后续解码中动态重新激活。

Details

Motivation: 多模态大语言模型在处理长视觉上下文时,视觉KV缓存大幅膨胀,导致计算效率和内存消耗面临严峻挑战。现有KV缓存压缩方法通常依赖’重要性持久性’假设来修剪令牌,但在多模态场景下,该方法因视觉令牌存在’延迟重要性’(初始显著性低但后期解码关键)和离散修剪破坏视觉线索空间连续性这两个关键问题而显得脆弱。

Result: 在多模态基准测试上的大量实验表明,RetentiveKV实现了5.0倍的KV缓存压缩和1.5倍的解码加速。

Insight: 论文宣称的创新点在于将KV缓存驱逐重新定义为基于状态空间模型的连续内存演化过程,并利用信息熵来量化和管理低注意力令牌的信息潜力,实现动态重新激活。从客观角度看,其创新之处在于将状态空间模型和熵的概念引入KV缓存管理,以解决多模态场景下视觉令牌的延迟重要性和空间连续性问题,这是一种新颖的、针对多模态特性的缓存优化思路。

Abstract: Multimodal Large Language Models face severe challenges in computational efficiency and memory consumption due to the substantial expansion of the visual KV cache when processing long visual contexts. Existing KV cache compression methods typically rely on the “persistence of importance” hypothesis to prune tokens. However, this approach proves fragile in multimodal settings due to two key issues: 1) Visual tokens display “deferred importance,” initially exhibiting low salience but becoming pivotal during later decoding, which can lead to premature eviction. 2) Discrete pruning disrupts the inherent spatial continuity of visual cues. To address these challenges, we propose RetentiveKV, an entropy-driven KV cache optimization method that reformulates KV eviction from “discrete context truncation” to “continuous memory evolution” based on State Space Models. Our method leverages information entropy to quantify the information potential of low-attention tokens and integrates tokens scheduled for eviction into a continuous state space through entropy-guided state transitions, enabling their dynamic reactivation when semantic relevance arises during subsequent decoding. Extensive experiments on multimodal benchmarks demonstrate that RetentiveKV achieves 5.0 times KV cache compression and 1.5 times decoding acceleration.


[81] Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO cs.LG | cs.AI | cs.CLPDF

Zhiyuan Zeng, Jiameng Huang, Zhangyue Yin, Jiashuo Liu, Ziniu Li

TL;DR: 本文针对GRPO风格强化学习中聚合策略梯度项的设计选择问题,分析了标准序列聚合与令牌聚合存在的优化偏差,并提出了一种平衡聚合方法以解决该问题。

Details

Motivation: GRPO训练中如何聚合每个采样组内的令牌级策略梯度项这一关键设计选择尚未充分探索,标准序列聚合与令牌聚合各自引入不同的优化偏差,需要一种更平衡的解决方案。

Result: 在Qwen2.5-Math-7B和Qwen3-1.7B模型上,使用DAPO-17k和Polaris数据集,在六个推理和编码基准测试中,平衡聚合方法相比标准聚合策略能持续提升训练稳定性和最终性能。

Insight: 创新点在于揭示了聚合规则对优化偏差的影响机制,并提出了一种通过分别计算正负子集内令牌均值、再结合序列计数权重进行平衡的聚合方法;客观来看,该研究将聚合策略确立为GRPO式RLVR的关键设计维度,其有效性受响应长度变异和正负长度差距的调控。

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a central paradigm for improving reasoning and code generation in large language models, and GRPO-style training is widely adopted for its simplicity and effectiveness. However, an important design choice remains underexplored: how token-level policy gradient terms are aggregated within each sampled group. Standard GRPO uses sequence aggregation, while recent work has advocated token aggregation as a better alternative. We show that these two rules induce different optimization biases: token aggregation introduces sign-length coupling, while sequence aggregation implicitly downweights longer responses through sequence-level equal weighting. To address this tension, we propose \textbf{Balanced Aggregation (BA)}, a simple drop-in replacement that computes token-level means separately within the positive and negative subsets and then combines them with sequence-count-based weights. Experiments with Qwen2.5-Math-7B and Qwen3-1.7B on DAPO-17k and Polaris, evaluated on six reasoning and coding benchmarks, show that BA consistently improves training stability and final performance over standard token and sequence aggregation. Our analysis further shows that the relative effectiveness of token and sequence aggregation is largely governed by response-length variation and the positive-negative length gap, highlighting aggregation as a critical design dimension in GRPO-style RLVR.


[82] Position: the Stochastic Parrot in the Coal Mine. Model Collapse is a Threat to Low-Resource Communities cs.LG | cs.CL | cs.CYPDF

Devon Jarvis, Richard Klein, Benjamin Rosman, Steven James, Stefano Sarao Mannelli

TL;DR: 这篇立场论文探讨了模型崩溃现象对低资源社区的威胁,指出随着生成模型越来越多地使用先前模型的输出进行训练,性能会下降,这加剧了数据退化、文化偏见和资源浪费问题,并可能阻碍AI民主化进程。

Details

Motivation: 论文的动机是结合模型崩溃、大语言模型对训练数据高频模式的依赖、巨大数据集需求及高环境成本等批评,论证模型崩溃对AI民主化努力构成威胁,尤其对低资源和边缘化社区产生不成比例的影响。

Result: 论文未提及具体实验或基准测试结果,而是基于理论分析和立场讨论,强调模型崩溃会降低训练效率并使数据分布偏离尾部,从而影响社区公平性。

Insight: 创新点在于将模型崩溃与环境、文化影响联系起来,并呼吁采取行动缓解其对低资源社区的负面影响,为AI民主化提供了新的批判视角和初步缓解方向。

Abstract: Model collapse, the degradation in performance that arises when generative models are trained on the outputs of prior models, is an increasing concern as artificially generated content proliferates. Related critiques of large language models have highlighted their tendency to reproduce frequent patterns in training data, their reliance on vast datasets, and their substantial environmental cost. Together, these factors contribute to data degradation, the reinforcement of cultural biases, and inefficient resource use. In this position paper we aim to combine these views and argue that model collapse threatens current efforts to democratize AI. By reducing training efficiency and skewing data distributions away from the tails of their support, model collapse disproportionately impacts low-resource and marginalized communities. We examine both the environmental and cultural implications of this phenomenon, situate our position within recent position papers on model collapse, and conclude with a call to action. Finally, we outline initial directions for mitigating these effects.


[83] Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models cs.LG | cs.CL | cs.CVPDF

Huatian Zhang, Zhendong Mao, Lei Zhang, Yongdong Zhang

TL;DR: 本文提出了一种不确定性感知的探索性直接偏好优化方法,用于解决多模态大语言模型中的幻觉问题。该方法通过量化模型在视觉基础任务中的认知不确定性,引导模型主动探索并纠正自身在视觉细节理解上的缺陷,从而提升视觉保真度。

Details

Motivation: 现有基于直接偏好优化的方法依赖模型自评估的视觉敏感度信号来分配训练权重,但训练中的模型存在自指偏差,容易强化已学到的视觉线索而忽略难以感知的关键细节,限制了更深层次的对齐。

Result: 大量实验证明了该方法的有效性和鲁棒性,但摘要中未具体提及在哪些基准测试上达到何种水平(如SOTA)。

Insight: 创新点在于引入基于令牌级认知不确定性的探索机制,使模型能够识别自身认知缺陷并主动进行自我纠正,从而更精细地调整对视觉缺陷令牌的学习压力,并减轻对非偏好样本中有益知识的过度惩罚。该方法还提供了理论证明。

Abstract: Direct Preference Optimization (DPO) has proven to be an effective solution for mitigating hallucination in Multimodal Large Language Models (MLLMs) by learning from preference pairs. One of its key challenges lies in how to transfer the sequence-level preference into fine-grained supervision on visual fidelity. To safeguard vision-related tokens that are prone to hallucination, existing methods typically allocate training emphasis according to the model’s self-assessed visual sensitivity signals. However, such sensitivity, estimated by a model still under training, introduces self-referential bias: reinforcing already well-learned visual cues while neglecting hard-to-perceive but critical details, thereby limiting deeper alignment. In this work, we propose an Uncertainty-aware Exploratory Direct Preference Optimization (UE-DPO) method for MLLMs, which enables the model to uncover its cognitive deficiencies and actively explore for self-correction, guided by token-level epistemic uncertainty. Specifically, we first quantify the uncertainty from the model’s failure to ground token predictions in the given image. Then, based on an uncertainty-aware exploration intensity, we encourage more learning pressure on visually deficient tokens in preferred samples, and alleviate the over-penalization of beneficial knowledge in dispreferred samples. Further, we provide a theoretical justification for our method, and extensive experiments demonstrate its effectiveness and robustness.


[84] Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization cs.LG | cs.CLPDF

Xiyan Fu, Wei Liu

TL;DR: 本文研究了利用结果级强化学习(RL)提升组合泛化能力的方法,提出采用Group Relative Policy Optimization(GRPO)框架,基于模型最终输出的反馈进行优化,并探索了二元结果奖励和包含组合反馈的复合奖励。实验表明,相比监督微调(SFT),强化学习能有效改善组合泛化,通过重塑输出分布,减少对常见训练组合的过拟合,尤其在复杂组合类型上表现更优。

Details

Motivation: 组合泛化指正确解释已知原语的新组合,现有方法多依赖监督微调,鼓励模型模仿目标输出,但这种词级训练范式难以捕捉全局组合结构,无法泛化到未见组合。本文旨在探索是否可通过结果级强化学习来改进组合泛化。

Result: 在多个组合基准测试上的实验显示,强化学习相比监督微调能提升组合泛化性能。进一步分析表明,监督模型倾向于过拟合频繁的训练组合,而强化学习通过重塑输出分布改善了泛化,特别是在更复杂的组合类型上。

Insight: 创新点在于将组合泛化问题从传统的监督微调转向结果级强化学习优化,利用GRPO框架和复合奖励机制,强调基于最终输出的全局反馈,而非局部词级模仿,这有助于模型更好地捕捉组合结构并减少过拟合,为组合泛化任务提供了新思路。

Abstract: Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fails to capture the global compositional structure required for generalizing to unseen combinations. In this work, we investigate whether compositional generalization can instead be improved through outcome-level reinforcement learning. We adopt Group Relative Policy Optimization to optimize models based on feedback on their final outputs. Within this framework, we explore both a simple binary outcome reward and a composite reward that provides additional composition feedback. Experiments on multiple compositional benchmarks show that reinforcement learning improves compositional generalization compared to supervised fine-tuning. Further analysis reveals that supervised models tend to overfit frequent training compositions, whereas reinforcement learning improves compositional generalization by reshaping the output distribution, particularly for more complex composition types.


[85] Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers cs.LG | cs.CLPDF

Senkang Hu, Yong Dai, Xudong Han, Zhengru Fang, Yuzhi Zhao

TL;DR: 本文提出了一种名为自我诱导结果潜力(SIOP)的方法,用于解决长视野LLM智能体在中间信息收集轮次中的信用分配问题。该方法通过将最终答案的语义聚类视为潜在未来结果状态,构建可靠性感知的目标分布,并奖励那些增加可靠未来状态后验支持的轮次,从而在无需任务特定验证器或答案监督的情况下实现轮级信用分配。

Details

Motivation: 长视野LLM智能体依赖于中间信息收集轮次,但训练反馈通常仅在最终答案处观察到,因为过程级奖励需要高质量的人工标注。现有轮级塑造方法需要答案监督或稳定的任务特定验证器,而无标签强化学习方法主要在答案或轨迹级别提取自信号,无法为中间轮次分配信用。

Result: 在七个搜索增强的智能体推理基准测试中,SIOP在无验证器的结果级基线方法上提高了平均性能,并接近了有黄金答案监督的结果基线水平。

Insight: 创新点在于将最终答案的语义聚类作为潜在未来结果状态,用于基于潜力的轮级信用分配,从而将信息潜力塑造从黄金答案监督推广到无需任务特定黄金验证器的设置,同时避免了标准GRPO中使用的广播式轨迹级优势。

Abstract: Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process-level rewards require high-quality human annotation. Existing turn-level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task-specific verifiers. Conversely, label-free RL methods extract self-signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self-Induced Outcome Potential (SIOP), which treats semantic clusters of final answers as latent future outcome states for potential-based turn-level credit assignment. For each query, SIOP samples multiple rollouts, clusters final answers into semantic outcome modes, and builds a reliability-aware target distribution over these states. It then rewards turns for increasing posterior support for reliable future states using a tractable cluster-level approximation. The objective generalizes information-potential shaping from gold-answer supervision to settings without task-specific gold verifiers while avoiding the broadcasted rollout-level advantages used by standard GRPO. We formalize the framework, characterize its supervised gold-answer limit, and show that SIOP improves average performance over verifier-free outcome-level baselines on seven search-augmented agentic reasoning benchmarks while approaching a gold-supervised outcome baseline. Code is available at https://github.com/dl-m9/SIOP.git.


[86] Improving Medical VQA through Trajectory-Aware Process Supervision cs.LG | cs.CVPDF

Halil Ibrahim Gulluk, Olivier Gevaert

TL;DR: 本文提出了一种通过轨迹感知过程监督来改进医学视觉问答(VQA)的方法。首先,利用COMCTS算法和开源视觉语言模型为六个医学VQA基准生成了包含推理轨迹的数据集。基于此,论文设计了一个两阶段训练框架:先进行监督微调,再结合基于过程的奖励进行组相对策略优化(GRPO)。实验表明,该方法在多个指标上显著超越了仅使用监督微调的基线。

Details

Motivation: 现有医学VQA数据集普遍缺乏推理过程的解释,这限制了模型可靠推理能力的发展。本文旨在通过生成并利用推理轨迹数据,为医学VQA模型提供过程层面的监督,从而提升其推理能力。

Result: 在六个医学VQA基准上的实验结果表明,结合基于动态时间规整(DTW)的过程奖励与答案精确匹配奖励的方法,相比仅使用监督微调,将平均准确率从0.598提升至0.689,平均BERTScore从0.845提升至0.881,平均ROUGE-L从0.665提升至0.748,实现了性能的全面提升。

Insight: 论文的核心创新点在于引入了轨迹感知的过程监督机制。具体而言,提出了一种新颖的基于过程的奖励函数,它通过句子嵌入和DTW距离来度量生成推理过程与真实推理过程的相似性,从而在强化学习阶段对模型的推理路径进行优化。这为训练具备更强推理能力的医学视觉语言模型提供了一种有效范式。

Abstract: Reasoning capabilities are crucial for reliable medical visual question answering (VQA); however, existing datasets rarely include reasoning explanations. We address this by generating reasoning trajectories for six medical VQA benchmarks using the COMCTS algorithm with open-source vision-language models, with an LLM serving as the verification judge. Building on these generated datasets, we propose a two-stage training framework: supervised fine-tuning followed by Group Relative Policy Optimization (GRPO) with a novel process-based reward. While standard approaches rely solely on exact-match rewards for final answers, we introduce a trajectory-aware reward that measures the similarity between generated and ground-truth reasoning processes. Specifically, we embed reasoning steps using sentence transformers and compute the Dynamic Time Warping (DTW) distance between the resulting vector sequences. Experiments across six benchmarks demonstrate that combining the DTW-based process reward with exact-match reward consistently outperforms SFT-only training, raising mean accuracy from 0.598 to 0.689, mean BERTScore from 0.845 to 0.881, and mean ROUGE-L from 0.665 to 0.748. Our results highlight the importance of process supervision in training reasoning-capable medical VLMs. We make our code and generated reasoning datasets publicly available at https://anonymous.4open.science/r/MICCAI-R1-MED-VQA-code-B14B/


[87] Towards General Preference Alignment: Diffusion Models at Nash Equilibrium cs.LG | cs.CVPDF

Jiaming Hu, Jiamu Bai, Haoyu Wang, Debarghya Mukherjee, Ioannis Ch. Paschalidis

TL;DR: 本文提出了一种基于博弈论视角的扩散模型偏好对齐方法Diffusion Nash Preference Optimization (Diff.-NPO),旨在克服现有基于人类反馈的强化学习(RLHF)方法(如DPO)依赖奖励诱导偏好信号和Bradley-Terry模型的局限性,通过让当前策略与自身博弈实现自我改进,从而更好地对齐人类偏好。

Details

Motivation: 现有基于偏好的扩散模型对齐方法(如DPO)依赖于奖励诱导的偏好信号,并通常假设人类偏好可由Bradley-Terry模型充分建模,这可能无法捕捉人类偏好的全部复杂性。

Result: 在文本到图像生成任务上,Diff.-NPO通过多种指标验证了其有效性,并一致优于现有的基于偏好的扩散对齐方法。

Insight: 从博弈论视角重新形式化扩散对齐问题,提出让策略自我博弈以实现对齐的通用偏好框架,避免了显式奖励建模和特定偏好模型的假设,提升了方法的通用性和对齐效果。

Abstract: Reinforcement learning from human feedback (RLHF) has been popular for aligning text-to-image (T2I) diffusion models with human preferences. As a mainstream branch of RLHF, Direct Preference Optimization (DPO) offers a computationally efficient alternative that avoids explicit reward modeling and has been widely adopted in diffusion alignment. However, existing preference-based methods for diffusion alignment still rely on reward-induced preference signals and typically assume that human preferences can be adequately modeled by the Bradley–Terry (BT) model, which may fail to capture the full complexity of human preferences. In this paper, we formulate diffusion alignment from a game-theoretic perspective. We propose Diffusion Nash Preference Optimization (Diff.-NPO), an intuitive general preference framework for diffusion alignment. Diff.-NPO encourages the current policy to play against itself to achieve self improvement and lead to a better alignment. Empirically, we demonstrate the effectiveness of Diff.-NPO on the text-to-image generation task via various metrics. Diff.-NPO consistently outperforms existing preference-based diffusion alignment methods.