Table of Contents

cs.CL [Back]

[1] Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents? cs.CLPDF

Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu

TL;DR: 本文提出了REFLECT基准,用于对LLM作为评估者(LLM-as-judge)在评估深度研究智能体时的可靠性进行元评估。研究发现,当前LLM评估者在检测智能体执行过程中的细粒度失败模式时表现不佳,总体准确率低于55%,尤其是在证据验证方面。

Details

Motivation: 随着深度研究智能体在自动化复杂信息检索任务中的作用日益重要,需要可扩展且可靠的评估方法,LLM-as-judge成为一种监督范式。然而,在部署LLM评估者来监督研究智能体之前,必须首先评估这些评估者本身的可靠性,而现有的元评估方法存在不足。

Result: 在REFLECT基准上的实验表明,当前表现最好的LLM评估模型在检测推理、工具使用和报告质量等方面的失败模式时,总体准确率低于55%,在证据验证方面表现尤其差。

Insight: 论文的创新点在于提出了一个细粒度、可验证的元评估基准(REFLECT),通过受控干预在高质量智能体执行轨迹上实例化失败模式,从而系统性地暴露LLM评估者的局限性。这为构建更可靠的深度研究智能体评估流程提供了具体指导。

Abstract: Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.


[2] Prompting language influences diagnostic reasoning and accuracy of large language models cs.CLPDF

Adrien Bazoge, Josselin Corvellec, Sofiane Djillali Sid-Ahmed, Pierre-Antoine Gourraud

TL;DR: 本研究评估了提示语言(英语与法语)对五种大型语言模型(o3、DeepSeek-R1、GPT-4-Turbo、Llama-3.1-405B-Instruct和BioMistral-7B)在临床诊断推理和最终诊断准确性方面的影响。研究使用180个涵盖16个医学专业的临床案例,由两名医生根据18点评分量表进行评估。结果表明,除o3外,其他四种模型在英语提示下的表现均显著优于法语,差异体现在鉴别诊断、逻辑结构和内部有效性等多个推理维度。

Details

Motivation: 目前LLM在临床决策支持方面的评估大多基于英语,其在其他语言环境下的可靠性尚不确定。本研究旨在探究提示语言是否会影响LLM的诊断推理质量和诊断准确性,以评估其全球公平部署的潜力。

Result: 在评估的五种模型中,有四种(DeepSeek-R1、GPT-4-Turbo、Llama-3.1-405B-Instruct、BioMistral-7B)在英语提示下的综合表现显著优于法语(平均差异0.37-0.91分,调整后p<0.05)。只有o3模型未显示出整体语言效应。

Insight: 论文的创新点在于首次系统性地量化了提示语言对LLM临床诊断性能的影响,揭示了语言偏差不仅影响最终答案准确性,还深刻影响中间推理过程的质量。这一发现强调了在非英语环境中部署临床AI时进行针对性评估和优化的必要性,对实现公平的全球医疗AI应用具有重要启示。

Abstract: Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p < 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. o3 was the only model showing no overall language effect. These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide.


[3] Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution cs.CL | cs.AI | cs.IT | cs.LGPDF

Xiaoou Liu, Tiejin Chen, Dengjia Zhang, Yaqing Wang, Lu Cheng

TL;DR: 本文提出了Stepwise Confidence Attribution(SCA)框架,用于诊断黑盒大语言模型在多步推理任务中的失败。SCA基于信息瓶颈原则,仅利用生成的推理轨迹为每个步骤分配置信度,从而识别潜在错误步骤。

Details

Motivation: 现有置信度估计方法局限于最终答案或需要模型内部访问,难以诊断多步推理轨迹中的具体失败步骤,因此需要一种仅基于生成内容的方法来评估步骤级置信度。

Result: 在数学推理和多跳问答任务上的实验表明,SCA能可靠识别与推理错误强相关的低置信度步骤,且使用步骤级置信度指导自我纠正可将纠正成功率提升高达13.5%。

Insight: 创新点在于将信息瓶颈原则应用于黑盒LLM的步骤级置信度估计,提出了无需图结构的非参数方法NIBS和基于图的学习方法GIBS,实现了仅依赖生成轨迹的错误诊断。

Abstract: Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5% over answer-level feedback.


[4] OpenCompass: A Universal Evaluation Platform for Large Language Models cs.CL | cs.LGPDF

Maosong Cao, Kai Chen, Haodong Duan, Yixiao Fang, Tong Gao

TL;DR: 本文提出了OpenCompass,一个用于大语言模型(LLM)的一站式、可扩展、支持高并发的通用评估平台。该平台旨在解决当前基于静态基准数据集评估方法面临的挑战,如任务类型多样、评估标准不一以及数据处理流程碎片化等问题,通过模块化设计提供高兼容性、灵活性和高并发性,支持多领域主流基准数据集,为学术界和工业界提供统一的LLM评估工具。

Details

Motivation: 随着大语言模型的快速迭代,对其能力进行客观、定量和全面的评估已成为推动技术发展的关键环节。当前主流的基于静态基准数据集的评估方法面临任务类型多样、评估标准不一致以及数据与处理流程碎片化等挑战,难以高效进行跨领域和大规模模型评估。

Result: 论文未在摘要中提供具体的定量实验结果或基准测试排名(如SOTA比较)。它主要介绍了OpenCompass平台的设计、架构和功能,宣称其支持多领域主流基准数据集,并能通过统一工具促进LLM的优缺点识别与后续优化。

Insight: 论文的创新点在于提出了一个模块化、组件解耦的通用LLM评估平台,其核心架构包含配置系统、任务划分、执行调度、任务执行单元和结果可视化五大组件,并提供了基于规则、LLM-as-a-Judge和级联评估器等多种评估工作流以适应不同任务场景。从客观角度看,该平台通过一站式、高并发的设计,有望标准化和简化LLM评估流程,提升评估效率与可比性。

Abstract: In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.


[5] Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning cs.CLPDF

Shuyu Wei, Jian Sun, Delai Qiu, Yining Wang, Shengping Liu

TL;DR: 本文提出了一种名为条件熵塑形(CES)的框架,旨在通过动态控制LLM在推理过程中每个token的熵,来平衡推理的准确性和响应简洁性。该方法基于DAPO,利用token级熵作为不确定性信号,对正确推理路径上的高熵“分叉点”token进行惩罚以提高简洁性,对错误路径上的高熵token进行奖励以鼓励探索和纠错。

Details

Motivation: 现有基于熵的深度推理方法要么不加区分地增加响应长度,要么以牺牲准确性为代价缩短响应。为了解决这一权衡问题,本文旨在开发一种能够根据问题难度自适应调整推理深度的机制。

Result: 在DeepSeek-R1-Distill-7B模型上实现CES,并在12个数学基准测试上进行评估。结果表明,相对于DAPO基线,CES在平均准确率上取得了一致性提升,同时减少了响应长度。补充实验在更小的1.5B骨干模型和领域外基准测试上也显示出类似的趋势。

Insight: 核心创新点在于将token级熵作为动态控制信号,并设计了条件双向策略(对正确与错误路径采取不同操作),从而实现了推理过程的自适应调节。这为改进LLM推理提供了一种新颖的、细粒度的熵控制视角。

Abstract: Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy “forking point” tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.


[6] Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation cs.CL | cs.AIPDF

Bing Wang, Shaotian Yan, Chen Shen, kaiyuan liu, Sinan Fan

TL;DR: 本文提出了一种名为MOTAB的新方法,用于缓解大型语言模型(LLM)推理蒸馏中的双重暴露偏差问题。该方法通过动态监控学生模型的生成过程,并在其偏离安全边界时回溯到上一个安全状态,利用教师模型进行干预校正,从而在容忍小错误的同时避免次优上下文的影响。

Details

Motivation: 现有LLM推理蒸馏方法面临两难困境:离策略蒸馏因训练与推理上下文不匹配而存在暴露偏差,导致长思维链推理中的错误级联;而在策略蒸馏则因学生模型生成次优上下文,使教师模型难以提供有效指导,引入了反向暴露偏差。

Result: 在LIMO-v2和AceReason数据集上的大量实验表明,MOTAB方法有效缓解了双重暴露偏差,在推理任务上平均性能提升了约3%。

Insight: 创新点在于提出了一个动态监控与回溯的蒸馏框架,通过设定自适应安全边界来平衡容忍学生小错误(缓解暴露偏差)和及时干预防止次优上下文(规避反向暴露偏差),为解决蒸馏中的分布不匹配问题提供了新思路。

Abstract: Large language models (LLMs) have achieved remarkable success in complex reasoning tasks via long chain-of-thought (CoT), yet their immense computational overhead hinders real-world deployment. LLM reasoning distillation addresses this by transferring reasoning capabilities from formidable teacher models to compact student models. However, existing distillation paradigms face a fundamental dilemma. Typical off-policy distillation strictly utilizes teacher-generated golden trajectories, suffering from an exposure bias due to the mismatch between training distributions and student-generated inference contexts, which leads to error cascades in long CoT reasoning. To address this, on-policy distillation allows students to explore their own trajectories, but we demonstrate that it inherently introduces a reciprocal reversed exposure bias: the teacher model also struggles to provide positive guidance when conditioned on student-generated sub-optimal contexts. To resolve this dual exposure biases problem, we propose Monitoring Trajectories and Backtracking when it strays (MOTAB), a new LLM reasoning distillation pipeline. Specifically, MOTAB dynamically monitors the student’s on-policy generation against an adaptive safety boundary. When the generation strays and exceeds this threshold, MOTAB backtracks to the last safe state and leverages teacher intervention to correct the course. This approach inherently tolerates minor student errors to mitigate exposure bias, while preventing sub-optimal contexts to circumvent reversed exposure bias. Extensive experiments on the LIMO-v2 and AceReason datasets demonstrate that MOTAB effectively alleviates the dual exposure biases, yielding a roughly 3% average performance improvement in reasoning tasks.


[7] LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models cs.CLPDF

Zhe Yuan, Yipeng Zhou, Jinghan Li, Xinyuan Chen, Bowen Deng

TL;DR: 本文提出了一种名为Lambda Policy Optimization(LambdaPO)的新型强化学习对齐框架,旨在解决现有Group Relative Policy Optimization(GRPO)方法在优势估计时因依赖单一统计基线而丢失细粒度偏好信息的问题。该方法将优势估计重新概念化为一种分解的、成对的偏好结构,并通过引入基于生成推理轨迹与真实解决方案之间精确率-召回率对齐的语义密度奖励来缓解二元结果监督的稀疏性。实验表明,LambdaPO在数学推理和问答任务上相比基线方法提升了性能。

Details

Motivation: 现有GRPO方法虽然因其无需显式价值评论家而备受推崇,但其依赖单一组均值作为统计基线,将轨迹空间的关系拓扑压缩为单一标量,从而抹去了在复杂、对排序敏感的奖励环境中至关重要的细粒度偏好信息。

Result: 在具有挑战性的数学推理和问答任务上的实验结果表明,LambdaPO相比基线方法(如GRPO)提升了性能。

Insight: 核心创新在于将优势估计从标量值重构为基于成对比较的分解结构,并通过策略自身对已确立偏好的概率置信度进行动态衰减;同时,引入语义密度奖励作为辅助目标,从生成的推理轨迹中挖掘更细粒度的优化信号,以更好地引导LLM优化。

Abstract: Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method’s reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward differentials against all peers in its cohort, where each pairwise comparison is dynamically attenuated by the policy’s own probabilistic confidence in the established preference. To further mitigate the sparsity of binary outcome supervision, we augment the objective with a semantic density reward, derived from the precision-recall alignment between generated reasoning traces and ground-truth solutions. As a result, our method can mine more fine-grained optimization signals from a group of rollouts, guiding the LLM to a better optima. Experimental results across challenging math reasoning and question-answering tasks demonstrates that LambdaPO improves performance compared to the baseline methods.


[8] Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters cs.CL | cs.AI | cs.CVPDF

Zhiyu Xu, Lean Wang, Yuanxin Liu, Lei Li, Hao Zhou

TL;DR: 本文系统研究了跨模态技能注入(Cross-Modal Skill Injection)在视觉语言模型(VLM)中的应用,探讨了其适用场景、融合方法及超参数调优。研究发现,该方法在指令跟随和跨语言场景中表现良好,但在数学推理上存在困难;经典方法如TA和DARE优于其他融合方法,且超参数调优对性能至关重要。

Details

Motivation: 视觉语言模型(VLM)在通用多模态理解上表现出色,但难以高效获取持续演进的领域特定技能。传统方法如监督微调(SFT)需要大量数据和计算资源,而模型融合提供了一种无需额外训练数据或显著计算开销的替代方案,旨在将大型语言模型(LLM)的领域专业知识注入VLM,以诱导新兴的跨模态能力。

Result: 研究在多个场景下评估了跨模态技能注入的性能:在指令跟随和跨语言设置中普遍表现良好,但在数学推理上挣扎。在方法比较中,经典方法如TA和DARE consistently achieve superior performance over alternative merging methods。

Insight: 创新点在于首次系统分析了跨模态技能注入的适用性、方法和超参数,揭示了其在特定场景(如指令跟随)的有效性,并强调了经典融合方法(TA/DARE)的优越性以及超参数调优的关键作用,为高效扩展VLM的领域能力提供了实证指导。

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models (LLMs) to VLMs without incurring additional training data requirements or significant computational overhead. Unlike conventional merging of homogeneous LLMs, which mainly aggregates existing capabilities, cross-modal skill injection aims to induce emergent cross-modal capabilities by integrating a domain-expert LLM into a VLM. However, existing research lacks a systematic analysis of the applicability and methodology of cross-modal skill injection. In this study, we investigate cross-modal skill injection across three main aspects: scenarios, methods, and hyperparameters. For scenarios, we find that cross-modal skill injection generally performs well in instruction-following and cross-lingual settings, yet struggles with mathematical reasoning. For methods, we find that classic approaches such as TA and DARE consistently achieve superior performance over alternative merging methods. We also provide a systematic and quantitative analysis of the hyperparameter tuning that these classic methods critically depend on.


[9] GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment cs.CLPDF

Minxuan Lv, Tiehua Mei, Tanlong Du, Junmin Chen, Zhenpeng Su

TL;DR: 本文提出了GoLongRL,一个完全开源、以能力为导向的长上下文强化学习后训练方法,并公开了包含23K样本的数据集和完整训练代码。该方法通过能力导向的数据构建和异构多任务优化策略TMN-Reweight,显著提升了模型在长上下文任务中的性能,在相同GRPO设置下超越了闭源数据集,并使30B参数模型达到了与更大规模模型相当的水平。

Details

Motivation: 现有长上下文RL方法通常将数据构建视为设计复杂检索路径,导致任务覆盖单一且奖励公式无法反映实际需求。本文旨在通过能力导向的数据构建和解决异构奖励优化问题,来更全面地提升模型的长上下文能力。

Result: 在相同GRPO设置下,仅使用本文开源数据集训练的模型性能超越了闭源的QwenLong-L1.5数据集。训练得到的Qwen3-30B-A3B模型在长上下文性能上与DeepSeek-R1-0528和Qwen3-235B-A22B-Thinking-2507等更大模型相当。TMN-Reweight方法进一步提升了平均性能,并在各项评估中保持或改进了通用能力。

Insight: 创新点包括:1) 基于长上下文能力分类学构建多样化、覆盖9类任务的开源数据集,强调奖励多样性对能力提升的重要性;2) 提出TMN-Reweight方法,通过任务级均值归一化和难度自适应加权来解决异构多任务优化中的奖励尺度对齐与优势估计问题。

Abstract: We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.


[10] LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening cs.CLPDF

Ming Zhang, Qiyuan Peng, Yinxi Wei, Yujiong Shen, Kexin Tan

TL;DR: 本文提出了LLMEval-Logic,一个用于评估大语言模型逻辑推理能力的中文基准测试集。该基准通过真实情境场景构建,包含经过专家审核和Z3求解器验证的形式化标注,并采用对抗性强化流程提升难度。评估14个前沿LLM的结果显示,当前模型在逻辑推理上存在显著差距,最佳模型在困难项上的准确率仅为37.5%。

Details

Motivation: 现有逻辑推理基准多通过模板生成,标注粗糙且未经严格审核,并已被前沿推理模型快速饱和,因此需要构建一个更严谨、基于真实场景且经过求解器验证的基准来准确评估LLMs的逻辑推理能力。

Result: 在LLMEval-Logic基准上评估了14个前沿LLM,最佳模型在困难子集上的准确率仅为37.5%;即使在提供参考符号的情况下,所有模型中最高的Z3+Rubric形式化联合得分也仅为60.16%,表明当前模型与理想性能存在巨大差距。

Insight: 创新点在于构建了一个结合真实情境、专家审核、Z3求解器验证答案正确性、并采用闭环对抗工作流进行强化的中文逻辑推理基准;其提供的精细评分规则(rubric atoms)和基于封闭模型空间的多步子问题设计,为严谨评估模型的逻辑形式化与推理能力提供了新方法。

Abstract: Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.


[11] Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges cs.CL | cs.AIPDF

Husnain Amjad, Raja Khurram Shahzad, Aamir Shahzad, Mehwish Fatima

TL;DR: 这篇综述论文系统性地回顾了大型语言模型在数学推理领域的最新进展,包括数据集、模型架构、训练策略和评估方法。论文通过分析约120项研究,提出了统一的数学数据集分类法,并评估了不同推理架构和训练策略的效果,同时指出了当前存在的推理忠实性、基准偏差和泛化限制等关键挑战。

Details

Motivation: 数学推理是评估人工智能系统能力的关键基准,随着大型语言模型推理能力的提升,理解其在数学推理上的表现变得日益重要。本文旨在通过结构化分析,综合该领域的最新进展,为当前研究和未来方向提供统一的分析框架。

Result: 论文未提出具体模型或实验,而是对现有研究进行了系统性综述和比较分析。它评估了不同推理架构(如工具集成、验证器引导推理)和训练策略对推理鲁棒性和泛化的影响,并比较了现有评估指标,指出了最终答案准确性与过程级推理验证之间的差距。

Insight: 论文的创新点在于提出了一个统一的数学数据集分类法(区分预训练语料、监督微调资源和评估基准),并系统性地分析了推理架构与训练策略。从客观角度看,其对失败模式(如推理忠实性问题、基准偏差)的识别和对未来研究方向(如符号基础、评估可靠性)的概述,为构建更鲁棒和可信的LLM推理系统提供了重要见解。

Abstract: Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.


[12] K-Quantization and its Impact on Output Performance cs.CLPDF

Robin Baki Davidsson, Pierre Nugues

TL;DR: 本文研究了不同量化位数(从2位到6位)对八个大型语言模型性能的影响,评估了它们在知识处理与推理(MMLU-Pro)、代码理解(CRUXEval)和阅读理解(MuSR)等任务上的表现。研究发现,更高精度(如8位Q8_0)通常带来性能提升但存在收益递减,而激进量化(如2位Q2_K)虽能保持可接受的精度,但某些模型会出现显著性能下降。

Details

Motivation: 大型语言模型规模庞大,部署面临挑战,量化作为一种主流压缩技术,其不同位数对模型性能和准确性的具体影响尚不明确,需要深入研究。

Result: 在MMLU-Pro、CRUXEval和MuSR等基准测试中,高精度量化(如8位)性能更优,但收益递减;激进量化(如2位)下,性能损失因模型和任务而异,大型模型更具韧性,但极低精度仍会导致显著下降;7-9B参数的中等规模模型在效率与资源使用上达到最佳平衡。

Insight: 论文的创新点在于系统评估了多种量化位数对多个LLM在不同任务上的影响,揭示了模型规模、量化精度与性能之间的权衡关系,特别是明确了中等规模模型在量化下的优势,为模型压缩部署提供了实证指导。

Abstract: Recent advancements in large language models (LLMs) have shown their remarkable capacities in many NLP tasks. However, their substantial size often presents challenges for deployment. This necessitates efficient techniques for model compression, with quantization emerging as a prominent solution. Despite its benefits, the exact impact of quantization (from 2- to 6-bit) on the performance and accuracy of LLMs remains an active area of research. This paper investigates the performance of eight LLMs at various quantization levels, focusing on tasks such as MMLU-Pro for knowledge processing and reasoning, CRUXEval for code comprehension, and MuSR for reading comprehension. Our results show a consistent trend where higher precision (e.g., 8-bit Q8_0) yields improved performance, albeit with diminishing returns. Aggressive quantization (e.g., 2-bit Q2_K) usually retains acceptable accuracy, though some models show a substantial loss in performance. Our findings indicate that while lower bit precision generally reduces performance, the impact varies across models and tasks. Larger models show greater resilience to aggressive quantization, but can still undergo significant drops at lower precision levels. Mid-sized models in the 7-9 billion parameter range strike an optimal balance between efficiency and resource usage. Such results provide insights into the trade-offs between model size, quantization, and performance.


[13] TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection cs.CL | cs.AIPDF

Wen Shi, Zhe Wang, Huafei Huang, Qing Qing, Ziqi Xu

TL;DR: 本文提出了一种名为TERGAD的新型数据增强框架,用于图异常检测(GAD)。该框架利用大语言模型(LLMs)的语义推理能力,将节点的拓扑属性转化为描述性自然语言叙事,从而生成富含结构语义的高层嵌入,并通过门控双分支自编码器将其与原始节点属性自适应融合,以联合重构图结构和节点特征,最终基于综合重构误差计算异常分数。

Details

Motivation: 现有基于文本的图异常检测方法通常直接将原始文本特征整合到数据表示中,但往往忽略了节点的结构上下文,这限制了它们检测因节点内在内容与其拓扑角色不一致而产生的复杂异常的能力。

Result: 在六个真实世界数据集上的广泛实验表明,TERGAD始终优于最先进的基线方法。消融研究验证了结构语义引导的不可或缺作用以及门控融合机制的有效性。

Insight: 创新点在于通过LLMs将图结构信息转化为语义叙事,从而为节点生成结构感知的语义嵌入,并通过门控机制自适应地融合结构与属性信息。这为图异常检测提供了一种结合深度语义理解和图结构分析的新颖思路。

Abstract: Graph Anomaly Detection (GAD) aims to identify atypical graph entities, such as nodes, edges, or substructures, that deviate significantly from the majority. While existing text-rich approaches typically integrate structural context into the data representation pipeline using raw textual features, they often neglect the structural context of nodes. This limitation hinders their ability to detect sophisticated anomalies arising from inconsistencies between a node’s inherent content and its topological role. To bridge this gap, we propose TERGAD (Structure-aware Text-enhanced Representations for Graph Anomaly Detection), A novel data augmentation framework that enriches structural semantics for GAD via the semantic reasoning capabilities of Large Language Models (LLMs). Specifically, TERGAD translates node-level topological properties into descriptive natural language narratives, which are subsequently processed by an LLM to derive high-level semantic embeddings. These embeddings are then adaptively fused with original node attributes through a gated dual-branch autoencoder to jointly reconstruct both graph structure and node features. The anomaly score is computed based on the integrated reconstruction error, effectively capturing deviations in both observable attributes and LLM-informed semantic expectations. Extensive experiments on six real-world datasets demonstrate that TERGAD consistently outperforms state-of-the-art baselines. Furthermore, our ablation studies validate the indispensable role of structural semantic guidance and the efficacy of the gated fusion mechanism. Code is available at https://github.com/Kantorakitty/TERGAD-main.


[14] Synthesis and Evaluation of Long-term History-aware Medical Dialogue cs.CL | cs.AIPDF

Hebin Hu, Renke Dai, Ah-Hwee Tan, Yilin Kang

TL;DR: 本文提出了一种利用大语言模型合成高质量长期医疗对话数据的框架,并构建了MediLongChat数据集。该数据集包含多样化的患者疾病轨迹和多轮对话,旨在评估医疗代理在长期历史记忆和跨会话推理方面的能力。

Details

Motivation: 现有医疗对话数据集缺乏真实的长期时间线,无法支持对患者纵向病史的回忆和推理进行系统评估,而真实临床文本又受隐私和伦理限制。

Result: 在MediLongChat基准测试中,即使是当前最先进的大语言模型也表现不佳,突显了该基准的适用性以及开发针对性方法的必要性。

Insight: 创新点在于提出了一个知识引导的三阶段合成框架来生成长期医疗对话数据,并设计了多维评估框架(结合向量指标和LLM-as-a-judge)来评估数据质量和代理的记忆能力,填补了长期医疗对话评估的空白。

Abstract: An effective healthcare agent must be able to recall and reason over a patient’s longitudinal medical history. However, the absence of datasets with realistic long-term dialogue timelines limits systematic evaluation. Real clinical text is constrained by privacy and ethics, while existing benchmarks focus on isolated interactions, failing to capture cross-session reasoning. We introduce a framework for synthesizing high-quality, long-term medical dialogues with LLMs. Our approach entails a knowledge-guided decomposition into three stages: constructing synthetic patient profiles with diverse disease and complication trajectories, generating multi-turn dialogues per encounter, and integrating them into a coherent longitudinal history dataset, MediLongChat. We establish three benchmark tasks-In-dialogue Reasoning, Cross-dialogue Reasoning, and Synthesis Reasoning-to evaluate the memory capabilities of healthcare agents. To assess data quality, we introduce a multi-dimensional evaluation framework combining vector-based metrics with LLM-as-a-judge assessments. Specifically, we define automatic measures-Faithfulness, Coherence, and Diversity-together with two LLM-based evaluations: Correctness and Realism. Benchmark experiments show that even state-of-the-art LLMs struggle with MediLongChat. These findings highlight the benchmark’s applicability and underscore the need for tailored methods to advance healthcare agents.


[15] Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs cs.CLPDF

Lucie Galland, Chloé Clavel, Magalie Ochs

TL;DR: 本文探索了大型语言模型(LLMs)在生成反映不同能力和善意水平的多模态行为(包括言语、声音、手势和面部表情)方面的能力,以校准用户对社会交互代理(SIAs)的信任。研究发现,LLMs能够生成跨模态的连贯行为,且这些行为与理论预期一致,但也会在指定性别时复制社会性别刻板印象。

Details

Motivation: 随着社会交互代理(SIAs)日益融入日常生活,校准用户对代理实际能力的信任至关重要,以确保其适当使用。本文旨在利用LLMs生成反映信任关键维度(能力和善意)的多模态行为,以支持细致且信任校准的交互。

Result: 通过分析LLMs生成的大规模多模态转录本,GPT-5.4能够生成跨文本、语调、面部表情和手势的连贯行为。随机森林特征重要性分析表明,生成的行为与能力和善意的理论预期一致。用户研究(Prolific平台,受试者内设计)证实,参与者感知到的能力和善意水平与生成指令的意图相符。

Insight: 创新点在于提出了一种自动生成与特定信任特质水平对齐的多模态行为的方法,这是实现细致信任校准交互的第一步。客观分析发现,LLMs在生成行为时能有效反映理论维度,但也揭示了其在性别提示下会无意识地复制社会刻板印象(如将男性与高能力、女性与高善意关联),这为未来研究提供了重要的伦理警示。

Abstract: As Socially Interactive Agents (SIAs) become increasingly integrated into daily life, the ability to calibrate user trust to an agent’s actual capabilities would help ensure appropriate usage of these agents. In this paper, we explore the capacity of Large Language Models (LLMs) to generate multimodal behaviors (verbal, vocal, gestural, and facial expression modalities) that reflect varying levels of ability and benevolence, two key dimensions of trustworthiness. We propose a novel method for automatically generating behaviors aligned with specific levels of these traits, a first step towards enabling nuanced and trust-calibrated interactions. By analyzing a large dataset of multimodal transcripts generated by LLMs, we demonstrate that GPT-5.4 is able to produce coherent behavior across different modalities (text, intonation, facial expression, and gesture). Using Random Forest feature importance analysis, we show that the generated behaviors align with theoretical expectations for ability and benevolence. However, we also find that when gender is specified in the prompt, LLMs tend to reproduce societal gender stereotypes, associating male agents’ behaviors with high ability and female agents’ behaviors with high benevolence. To validate our approach, we conducted a user study on Prolific using a within-subjects design. Participants perceived different levels of ability and benevolence in the generated behaviors align with the intended instructions.


Shanshan Xu, Johan Lindholm, Amogh Raina, Henrik Palmer Olsen, Daniel Hershcovich

TL;DR: 本文提出了LP-Eval,一个用于评估法律命题生成质量的三步评估标准及数据集。该标准由法律专家共同设计,将法律命题质量分解为形式有效性和实质性维度。研究基于欧盟法院的判决,利用大语言模型生成法律命题,并发布了100个模型生成命题的专家标注数据集。

Details

Motivation: 法律命题生成是法律推理和学说研究的核心,但在法律自然语言处理领域尚未得到充分研究。本文旨在自动生成和评估来自欧盟法院判决的法律命题,以填补这一空白。

Result: 结果表明,大语言模型能够生成大部分格式良好且高质量的法律命题。专家评估显示,基于成熟案例生成的命题质量高于基于近期案例生成的命题。此外,研究发现,基于评估标准引导的大语言模型判断与专家评估更接近,但仍无法捕捉人类专家更细粒度的区分。

Insight: 创新点在于引入了与法律专家共同设计的结构化评估标准,将法律命题质量分解为可操作的维度,并发布了首个用于该任务的专家标注数据集。这为法律自然语言处理中的生成与评估任务提供了更可靠的基准和方法论框架。

Abstract: Legal proposition generation is central to legal reasoning and doctrinal scholarship, yet remain under-examined in Legal NLP. This paper investigates the automatic generation and evaluation of legal propositions from decisions of the Court of Justice of the European Union using large language models (LLMs). We introduce LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions. Using this rubric, we release a dataset of two experts’ annotations for 100 LLM-generated legal propositions. Our results show that LLMs can generate predominantly well-formed and high-quality propositions, while expert evaluations reveal higher quality for propositions derived from well established cases than from recent ones. We further examine LLMs as evaluators and find that rubric-guided LLM judgments align more closely with expert assessments than direct overall scoring, but remain insensitive to finer-grained distinctions captured by human experts.


[17] Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning cs.CLPDF

Qinghe Ma, Zhen Zhao, Yiming Wu, Jian Zhang, Lei Bai

TL;DR: 本文提出了AutoTool,一种自适应决定是否调用工具的双模态多模态大语言模型推理方法。该方法通过强化学习框架,设计显式的双模态推理策略和特定模式的奖励函数,以平衡工具辅助和文本中心推理,从而在保证准确性的同时提高效率。

Details

Motivation: 现有工具增强推理研究主要关注模型执行工具调用的能力,而忽视了调用工具的必要性。作者认为工具使用并非总是有益的,冗余或不适当的调用会增加推理开销甚至误导模型预测,因此需要自适应地决定是否调用工具。

Result: 在V*基准测试上,AutoTool相比基础模型实现了21.8%的准确率提升;在POPE基准测试上,相比现有工具增强方法效率提高了44.9%,表现出优异的性能和高效性。

Insight: 创新点在于提出了自适应工具调用机制,通过强化学习框架平衡工具辅助和文本中心推理,避免对单一推理模式的过早偏好,并在训练后期促进自由探索,从而在准确性和效率之间取得更好权衡。

Abstract: Tool-augmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models (MLLMs). However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query. Within a reinforcement learning framework, we design an explicit dual-mode reasoning strategy with mode-specific reward functions to guide the model toward producing accurate responses. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool-assisted and text-centric reasoning throughout training, and promotes free exploration in later stages. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21.8% accuracy gain on V* benchmark compared to the base model, and a 44.9% improvement in efficiency over existing tool-augmented methods on POPE benchmark. Code is available at https://github.com/MQinghe/AutoTool.


[18] Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory cs.CLPDF

Jingwei Sun, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han

TL;DR: 本文提出TriMem记忆系统,通过维护三种共存粒度(原始对话片段、原子事实、聚合档案)来提升LLM智能体的长期记忆能力,并采用基于TextGrad的提示优化实现终身演化。实验表明其在多个基准测试中优于现有方法。

Details

Motivation: 解决现有LLM智能体记忆系统因依赖原子事实提取而丢失细节、无法支持深度推理,以及静态提示无法适应多样化对话风格的问题。

Result: 在LoCoMo和PerLTQA基准测试上,使用多种LLM骨干网络,TriMem均持续超越强基线方法。

Insight: 创新点在于三粒度共存表示架构与基于反馈的提示优化机制,实现了存储保真度、检索效率与深度推理的平衡,且无需参数更新即可终身进化。

Abstract: To enable reliable long-term interaction, LLM agents require a memory system that can faithfully store, efficiently retrieve, and deeply reason over accumulated dialogue history. Most existing methods adopt an extracted fact based paradigm: handcrafted static prompts compress raw dialogues into atomic facts, which are then stored, matched, and injected into downstream reasoning. Nevertheless, such fact-centric designs inevitably discard fine-grained details in original dialogues and fail to support deep reasoning over scattered isolated facts. Moreover, static prompts cannot maintain consistent extraction granularity across diverse dialogue styles. To address these limitations, we propose TriMem, which maintains three coexisting representation granularities, including raw dialogue segments anchored by source identifiers for storage fidelity, extracted atomic facts for efficient memory retrieval, synthesized profiles that aggregate dispersed facts into holistic semantic understanding for deep reasoning. We further adopt TextGrad-based prompt optimization, which iteratively refines extraction and profiling prompts via response quality feedback, achieving lifelong evolution without any parameter updating. Extensive experiments on LoCoMo and PerLTQA across multiple LLM backbones demonstrate that TriMem consistently outperforms strong memory baselines. The code is available at https://TMLR-TriMem.github.io .


[19] Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents cs.CLPDF

Wenjie Tang, Minne Li, Sijie Huang, Liquan Xiao, Yuan Zhou

TL;DR: 本文提出了ReBel算法,一种面向长视野智能体的过程级强化学习方法,通过显式建模结构化信念状态来总结交互历史并指导策略学习。该方法在部分可观测环境中,利用信念一致性监督将预测信念与观测反馈的差异转化为密集的自监督信号,并通过信念感知分组来降低优势估计的方差。

Details

Motivation: 针对部分可观测环境中长视野交互任务,由于不完全观测导致智能体信念随时间漂移,以及延迟奖励掩盖了中间决策的因果影响,加剧了时序信用分配难题。

Result: 在ALFWorld和WebShop等具有挑战性的长视野基准测试中,ReBel相比回合级基线GRPO将任务成功率提升了最高20.4个百分点,并将样本效率提高了2.1倍。

Insight: 创新点在于引入信念一致性监督,将信念预测误差转化为无需外部逐步标注或验证器的自监督信号,并通过信念感知分组实现更鲁棒的优势估计,为部分可观测下的长视野决策提供了新思路。

Abstract: Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.


[20] CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning cs.CL | cs.AIPDF

Dachuan Shi, Hanlin Zhu, Xiangchi Yuan, Wanjia Zhao, Kejing Xia

TL;DR: 本文提出了CopT,一种新的推理流程,它颠倒了传统的思维链顺序。CopT首先生成一个草稿答案,然后基于该答案进行策略性思考以进行反思和修正。该方法利用连续嵌入作为推理时的对比验证器来评估答案可靠性,并在不可靠时触发进一步思考。

Details

Motivation: 传统的思维链方法将思考作为回答的前提,这可能导致不必要的延迟和计算开销,尤其是在模型已经能够识别答案的情况下(即表演性推理)。CopT旨在通过先回答后思考的顺序,以及动态控制思考过程,来提高推理效率和准确性。

Result: 在数学、编程和智能体推理任务上,CopT在无需额外训练的情况下,将峰值准确率提升了高达23%,并在同等或更高准确率下将token使用量减少了高达57%。

Insight: 主要创新点在于将“回答-思考”顺序反转,并引入基于连续嵌入的对比验证器来量化答案可靠性。这提供了一种动态控制推理过程、平衡效率与准确性的新范式,其理论分析将验证器估计与互信息联系起来,解释了其有效性。

Abstract: Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model’s support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.


[21] Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP cs.CLPDF

Jann Pfeifer, Debayan Banerjee, Ricardo Usbeck

TL;DR: 该论文研究了使用基于结果的强化学习(GRPO)训练小型指令调优语言模型(Qwen3-1.7B)在学术领域(DBLP-QuAD数据集)进行零样本文本到SPARQL查询生成。该方法利用执行反馈、结构约束和答案级奖励进行训练,并与零样本基线及监督微调基线进行比较。

Details

Motivation: 解决知识图谱问答中,现有方法依赖大模型或完整查询标注进行监督的问题,探索在缺乏黄金查询标注的情况下,使用基于结果的强化学习训练小模型进行零样本查询生成的可行性。

Result: 在DBLP-QuAD上,GRPO方法相比零样本基线有显著提升,并展现出有竞争力的泛化能力;但监督式DoRA微调在相同模型规模下取得了更高的整体准确率。消融分析表明,基于执行的奖励贡献了大部分性能增益。

Insight: 创新点在于将GRPO强化学习应用于小模型的零样本SPARQL生成,仅依赖执行结果等反馈而非查询标注;客观来看,这为缺乏标注数据的场景提供了一种可行的训练策略,并验证了基于结果的奖励的有效性。

Abstract: Knowledge graph question answering seeks to translate natural language questions into executable queries over knowledge graphs, but existing approaches often rely on large models or full supervision in the form of gold query annotations. This study examines whether reinforcement learning with outcome-based rewards can train a small instruction-tuned language model to perform zero-shot Text-to-SPARQL generation in the scholarly domain. Group-Relative Policy Optimization (GRPO) is applied to the Qwen3-1.7B model on DBLP-QuAD, using prompts that combine natural language questions with symbolic hints about entities and relations. Training relies on execution feedback, structural constraints, and answer-level rewards, with an additional variant that incorporates gold-query-based shaping. The resulting models are compared to the unmodified zero-shot baseline and to a supervised DoRA-finetuned baseline across answer-level accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit, suggesting that outcome-based reinforcement learning is a viable training strategy when gold queries are unavailable for token-level supervision.


[22] MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models cs.CLPDF

Yuanqing Cai, Ziyi Huang, Minhao Liu, Lixin Duan, Wen Li

TL;DR: 本文提出了MixRea基准,用于评估大语言模型在显式-隐式推理任务中的表现,发现模型普遍存在类似于人类‘非注意盲视’的缺陷,即忽略重要但微妙的上下文线索。作者还提出了潜在关系补全提示法(PRCP)来缓解这一问题,并在21个先进LLM上验证了其有效性。

Details

Motivation: 受人类认知中‘非注意盲视’理论的启发,研究旨在探究大语言模型是否因在嵌入人类注意力偏见的数据上训练,而在明确的指令下,同样会忽略微妙但重要的上下文信息。

Result: 在MixRea基准(包含9种推理类型、2246个多选题)上评估21个先进LLM,表现最好的推理模型(Gemini 2.5 Pro)仅达到42.8%的一致性,揭示了广泛的非注意盲视现象。提出的PRCP方法能有效改善推理。

Insight: 创新点在于提出了显式-隐式推理任务和MixRea基准,系统性地量化了LLM的‘非注意盲视’缺陷;提出的PRCP提示方法通过恢复被忽略的因果关系来提升推理,为构建更具认知对齐能力的模型提供了方向。

Abstract: Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emph{inattentional blindness} in human cognition, we investigate whether LLMs, trained on human-preferred corpora that embed attentional biases, exhibit a similar limitation: \emph{failing to attend to subtle yet important contextual cues under explicit task instructions}. To evaluate this, we introduce the task of \textbf{explicit-implicit reasoning} and present \textbf{MixRea}, a benchmark of 2,246 multiple-choice questions across 9 reasoning types with varying distributions of explicit and implicit information. Evaluation of 21 advanced LLMs shows that even the best-performing reasoning model (Gemini 2.5 Pro) achieves only 42.8% consistency, revealing widespread inattentional blindness. To mitigate this, we propose \textbf{Potential Relation Completion Prompting (PRCP)}, a prompting method that improves reasoning by recovering overlooked causal relations. Further analysis shows that this limitation persists across diverse multi-source reasoning tasks, highlighting the need for more cognitively aligned models.


[23] ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning cs.CLPDF

Juncheng Wu, Letian Zhang, Yuhan Wang, Haoqin Tu, Hardy Chen

TL;DR: 本文提出了ClinSeekAgent,一个用于临床决策支持的自动化智能体框架,旨在将范式从被动消费证据转变为主动获取证据。该框架能够根据临床查询,从原始数据源(如医学知识库、原始电子健康记录和医学影像工具)中动态、迭代地搜寻和整合多模态证据,以支持临床推理。它既可作为前沿大语言模型的推理时智能体,也可作为训练时管道,将高质量的智能体轨迹提炼为紧凑的开源模型。

Details

Motivation: 现有基于大语言模型的临床决策支持研究大多假设证据已预先整理好并直接提供给模型,这与需要主动从异构数据源中搜寻和整合多模态证据的真实临床工作流程不符。

Result: 在文本型电子健康记录任务上,ClinSeekAgent将Claude Opus 4.6的总体F1分数从60.0提升至63.2,将MiniMax M2.5从43.1提升至47.3,并在9个评估的主模型中,有7个在风险预测方面取得正向增益。在多模态任务上,它将Claude Opus 4.6的得分从47.5提升至62.6(+15.1),所有评估模型在三个胸部X光相关任务组上均有提升。通过蒸馏得到的ClinSeek-35B-A3B模型在AgentEHR-Bench上达到34.0的平均F1,比其Qwen3.5-35B-A3B基线提升了11.9分,接近Claude Opus 4.6的水平。

Insight: 论文的核心创新在于提出了一个从被动证据消费转向主动证据获取的智能体框架,实现了对原始异构临床数据的动态、迭代式多模态证据搜寻与整合。这不仅提升了推理时模型的临床决策性能,还提供了一种将智能体复杂轨迹蒸馏为高效开源模型的训练范式,为构建更自主、更贴近真实工作流的临床AI系统提供了新思路。

Abstract: Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.


[24] KoRe: Compact Knowledge Representations for Large Language Models cs.CLPDF

Davide Cavicchini, Fausto Giunchiglia, Jacopo Staiano

TL;DR: 本文提出KoRe方法,通过将知识图谱的1跳子图编码为紧凑的离散知识令牌,并将其注入到大型语言模型(LLM)主干中,以解决LLM内部参数化知识表示不透明、难以更新和易产生幻觉的问题。该方法在三个基准测试中实现了竞争性性能,同时显著减少了令牌使用量(最高达10倍)。

Details

Motivation: 针对LLM将世界知识编码在参数中导致的表示不透明、难以调试更新和易产生幻觉的固有缺陷,以及知识图谱(KG)虽能提供可读可编辑的知识表示但现有集成技术需要大量重训练或微调的问题,旨在探索一种高效集成KG以增强LLM的方法。

Result: 在三个已建立的基准测试上,KoRe方法取得了竞争性的性能表现,同时令牌使用量显著减少(最高达10倍),证明了其有效性。

Insight: 创新点在于提出了一种将知识图谱子图编码为紧凑离散令牌并直接注入LLM的方法,无需大量重训练或微调,实现了知识的高效、可解释性集成,为LLM的知识增强提供了新思路。

Abstract: Modern Large Language Models (LLMs) have shown impressive performances in user-facing tasks such as question answering, as well as consistent improvements in reasoning capabilities. Still, the way these models encode knowledge seems inherently flawed: by design, LLMs encode world-knowledge within their parameters. This way of representing knowledge is inherently opaque, difficult to debug and update, and prone to hallucinations. On the other hand, Knowledge Graphs can provide human-readable and easily editable world knowledge representations, and their application in knowledge-intensive tasks has consistently proven beneficial to downstream performance. Nonetheless, current integration techniques require extensive retraining or finetuning. To overcome this issue, we introduce KoRe, a methodology to encode 1-hop sub-graphs into compact discrete knowledge tokens and inject them into a LLM backbone. We test the proposed approach on three established benchmarks, and report competitive performances coupled with a significant reduction (up to 10x) in token usage. Our results show that compact discrete KG representations can efficiently and effectively be used to ground modern LLMs.


[25] From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models cs.CL | cs.CVPDF

Juncheng Wu, Hardy Chen, Haoqin Tu, Xianfeng Tang, Freda Shi

TL;DR: 本文提出了一种分阶段训练视觉语言模型(VLM)的方法,将模型能力解耦为视觉感知、视觉推理和文本推理三个阶段进行专门训练。研究表明,视觉感知是推理的基础,需要针对性优化,且通过强化学习比监督微调更有效。该方法在多个VLM上验证,能提升感知与推理性能,并缩短推理路径。

Details

Motivation: 当前VLM研究强调长链思维推理,但作者发现其视觉任务性能主要受限于视觉感知能力不足,而非推理本身。因此,本文旨在系统研究VLM后训练中感知与推理的相互作用,探索分阶段优化方案。

Result: 实验表明,分阶段训练相比合并训练能持续提升视觉感知和推理性能。具体而言,该方法在多个基准上取得先进结果,如在WeMath上提升5.2%,在RealWorldQA上提升3.7%,同时推理准确率提高1.5%,推理轨迹缩短20.8%。

Insight: 核心创新点在于将VLM能力解耦为感知与推理进行分阶段训练,并发现视觉感知是推理的基石,需优先巩固。这提供了一种基于能力而非传统难度的新课程学习维度,两者结合可带来叠加增益。从客观角度看,该方法为VLM后训练提供了可借鉴的结构化优化框架。

Abstract: Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.


cs.CV [Back]

[26] MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation cs.CVPDF

Bizhu Wu, Jinheng Xie, Wenting Chen, Zhe Kong, Jianfeng Ren

TL;DR: MotionMERGE是一个多粒度统一框架,旨在解决现有运动-语言模型在细粒度理解和控制方面的不足。该框架通过部件级和时间级建模、新颖的预训练策略以及大规模细粒度数据集,实现了对运动更精确的生成、理解、编辑和推理。

Details

Motivation: 现有运动-语言模型在动画或交互等任务中缺乏对肢体部位的细粒度理解和精细控制,这源于模型无法聚焦于运动的局部模式以及训练数据缺乏细粒度监督的根本问题。

Result: 广泛的实验表明,MotionMERGE在细粒度文本驱动的运动编辑和基于运动的推理新基准上,能够实现更精确的运动生成、理解和编辑,并在其他复杂运动任务上展现出强大的零样本泛化能力。

Insight: 创新点包括:在单个LLM内对运动进行部件级和时间级显式建模以实现精细控制;设计了融合跨粒度对齐、时序定位、局部对齐、运动连贯性和基于运动的思维链推理的联合监督预训练策略;构建了首个包含细粒度时空校正指令和基于运动的思维链标注的大规模数据集MotionFineEdit。

Abstract: Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can’t focus on motion’s localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.


[27] Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos cs.CVPDF

Yuqi Tang, Yang Shi, Zhuoran Zhang, Qixun Wang, Xuehai Bai

TL;DR: 该论文提出了Artifact-Bench,一个用于评估多模态大语言模型(MLLMs)在检测和分析AI生成视频中伪影能力的综合性基准。该基准定义了涵盖三种视频风格的三级伪影分类法,并设置了三个互补任务:真实与AI生成视频分类、成对真实感比较和细粒度伪影识别。通过在19个主流MLLMs上的实验,揭示了它们在伪影感知和推理方面存在显著局限,且其判断与人类感知偏好存在明显偏差。

Details

Motivation: 当前视频生成模型生成的视频仍存在时间不一致、结构扭曲和语义不连贯等伪影,而MLLMs感知和推理此类伪影的能力尚不明确。现有基准缺乏对伪影感知和细粒度诊断推理的系统性评估,尤其是在超越照片写实内容的多样化AI生成视频领域。

Result: 在19个领先的MLLMs上的实验表明,它们在伪影感知和推理方面存在实质性局限,许多模型在具有挑战性的设置中接近甚至低于随机猜测的性能水平。研究进一步观察到MLLM的判断与人类感知偏好之间存在显著错位。

Insight: 论文的创新点在于建立了一个系统性的、涵盖多种视频风格(照片写实、动画、CG风格)的三级伪影分类法,并据此构建了一个包含三个互补任务的综合性基准,用于全面评估MLLMs在AI生成视频质量评估方面的能力,揭示了当前模型的局限性及其与人类判断的差距。

Abstract: Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy, Artifact-Bench defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or even below-random performance in challenging settings. We further observe significant misalignment between MLLM judgments and human perceptual preferences, highlighting their limited reliability as general evaluators for AI-generated video realism.


[28] EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction cs.CV | cs.LG | cs.ROPDF

Ahmad Yehia, Abduallah Mohamed, Tianyi Wang, Jiseop Byeon, Kun Qian

TL;DR: 本文介绍了EgoTraj,一个使用Meta Quest Pro在真实世界城市环境中采集的以自我为中心的多模态人类轨迹数据集,包含75个序列的RGB视频、6自由度头部姿态、3D眼动向量和场景标注。该数据集旨在支持第一人称视角下的人类轨迹预测研究,并用于AR感知、导航和辅助系统。

Details

Motivation: 当前缺乏在真实世界环境中采集的以自我为中心的人类轨迹数据集,这限制了从第一人称视角准确预测人类轨迹的研究进展,而该能力对人形机器人、可穿戴传感系统和辅助导航等应用至关重要。

Result: 论文在EgoTraj数据集上对多种最先进的以自我为中心轨迹预测方法进行了基准测试,并通过消融实验分析了注视、场景和运动线索的贡献,结果验证了数据集的实用性。

Insight: 创新点在于构建了一个包含长时程、自主导航、多样化城市路线和广泛参与者多样性的真实世界第一人称轨迹数据集,并提供了同步的多模态传感器数据(如眼动向量),这为研究多模态线索融合的轨迹预测模型提供了新的基准平台。

Abstract: Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real-world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real-world urban environments. Each recording provides synchronized RGB video along with ground-truth data, including continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state-of-the-art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR-based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at https://github.com/yehiahmad/EgoTraj.


[29] MedFM-Robust: Benchmarking Robustness of Medical Foundation Models cs.CVPDF

Xiangxiang Cui, Tianjin Huang, Yifang Wang, Lijie Hu, Lu Yin

TL;DR: 该论文提出了一个名为MedFM-Robust的基准测试,旨在系统评估医疗基础模型(MedFMs)在真实世界条件下的鲁棒性。它涵盖了医学视觉语言模型(如LLaVA-Med, GPT-4o)和分割基础模型(如SAM-Med2D)两大类,针对视觉问答、报告生成等任务进行可靠性测试。

Details

Motivation: 随着医疗基础模型在临床中的广泛应用,其在实际部署中的可靠性至关重要,因此需要建立一个严格的基准来评估这些模型在真实世界条件下的鲁棒性。

Result: 论文提出了一个基准测试框架,但摘要中未提及具体的定量实验结果或与SOTA的比较,主要贡献在于建立评估体系。

Insight: 创新点在于首次系统性地为医疗基础模型(包括视觉语言和分割模型)构建了鲁棒性评估基准,填补了该领域标准化测试的空白,对推动模型安全部署有重要意义。

Abstract: Medical foundation models (MedFMs) have emerged as transformative tools in healthcare, demonstrating capabilities across diverse clinical applications. These models can be broadly categorized into two paradigms: Medical Vision-Language Models (Med-VLMs) and segmentation foundation models. Med-VLMs range from medical-specialized models such as LLaVA-Med and MedGemma, to general-purpose models like GPT-4o and Gemini, all capable of medical image understanding tasks including visual question answering (VQA), report generation, and visual grounding. Concurrently, the Segment Anything Model (SAM) has catalyzed a new generation of medical segmentation models, with adaptations like SAM-Med2D and MedSAM. The widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions.


[30] A Systematic Failure Analysis of Vision Foundation Models for Open Set Iris Presentation Attack Detection cs.CVPDF

Rahul Anand, Siddharth Singh, Dileep A D, Mahadeva Prasanna, Raghavendra Ramachandra

TL;DR: 本文对通用视觉基础模型在开放集虹膜呈现攻击检测(PAD)中的适用性进行了系统性故障分析。研究评估了五种代表性基础模型在三种开放集协议下的表现,包括未见过的攻击工具、不同传感器采集的数据集以及从近红外到可见光光谱的跨光谱迁移。结果表明,基础模型在相似传感特性的数据集间可迁移,但在未见攻击工具和跨光谱场景下泛化能力严重不足,且参数高效微调(LoRA)在某些情况下甚至会放大故障。

Details

Motivation: 视觉基础模型在多种视觉任务中表现出强大的可迁移性,但其在生物识别应用,特别是现实开放集操作条件下的虹膜呈现攻击检测(PAD)中的适用性尚未得到充分检验。本研究旨在系统性地分析通用视觉基础模型在开放集虹膜PAD任务中的失败模式。

Result: 在统一实验框架下评估了冻结特征表示和基于LoRA的参数高效任务适应。结果显示,基础模型在具有相似传感特性的数据集间可迁移,但在未见攻击工具(PAI)和跨光谱(从NIR到VIS)评估中泛化失败且性能急剧下降。LoRA在特定跨数据集设置中能提升性能,但在攻击级别和光谱偏移下经常放大故障。

Insight: 论文的创新点在于首次对视觉基础模型在开放集虹膜PAD任务中进行了系统性故障分析,并设计了分离不同分布偏移源(攻击工具、数据集、光谱)的严格协议。客观分析认为,其核心发现是:封闭集或跨数据集的强性能不能等同于鲁棒的开放集安全性,这凸显了需要开发对呈现攻击伪影保持敏感、同时在现实部署变化下保持稳定的PAD表征。

Abstract: Vision foundation models have demonstrated strong transferability across diverse visual recognition tasks and are increasingly considered for biometric applications. Their suitability for iris Presentation Attack Detection (PAD), particularly under realistic open-set operating conditions, remains insufficiently examined. This work presents a systematic failure analysis of general-purpose vision foundation models for open-set iris PAD using periocular imagery. Five representative foundation models are evaluated under three open-set protocols that explicitly separate different sources of distribution shift: unseen Presentation Attack Instruments (PAIs), unseen datasets captured with different sensors and cross-spectral transfer from near-infrared (NIR) to visible spectrum (VIS) imagery. Both frozen feature representations and parameter-efficient task adaptation using Low-Rank Adaptation (LoRA) are assessed within a unified experimental framework. The results indicate that foundation models can transfer across datasets with similar sensing characteristics, but fail to generalise reliably to unseen attack instruments and degrade sharply under cross-spectral evaluation. While LoRA improves performance in certain cross-dataset settings, it frequently amplifies failure under attack-level and spectral shifts. Additional validation experiments using segmented iris inputs, full backbone fine-tuning, joint cross-dataset and cross-PAI shifts, and reverse VIS to NIR transfer further confirm that these failures are not simply artefacts of periocular input, weak adaptation, or one-directional spectral evaluation. These findings show that strong closed-set or cross-dataset performance should not be treated as evidence of robust open-set security, and highlight the need for PAD representations that maintain sensitivity to presentation artefacts while remaining stable under realistic deployment variation.


[31] CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering cs.CV | cs.AIPDF

Mahesh Bhosale, Abdul Wasi, Vishvesh Trivedi, Pengyu Yan, Akhil Gorugantu

TL;DR: CRAFT是一个用于多模态视频问答的查询条件化流水线,它通过动态关键帧选择、多语言ASR和混合批评循环来迭代验证和修复主张,最终合并证据并生成带引用的答案。在MAGMaR 2026基准测试中取得了最佳性能,并在WikiVideo转换数据集上展示了良好的泛化能力。

Details

Motivation: 解决现实世界新闻事件中基于多视频的问答问题,要求系统能够从异构视频档案中检索查询相关证据,并将每个主张归因于其支持来源。

Result: 在MAGMaR 2026基准上,CRAFT取得了最佳的整体平均分(0.739)、参考召回率(0.810)和引用F1分数(0.635);在WikiVideo转换数据集上也表现强劲(0.823平均分)。

Insight: 创新点包括查询条件化的动态关键帧选择、结合多语言ASR的混合批评循环进行迭代验证与修复,以及基于原子主张的证据聚合框架,这些设计提升了证据检索和引用的准确性。

Abstract: Grounded multi-video question answering over real-world news events requires systems to surface query-relevant evidence across heterogeneous video archives while attributing every claim to its supporting source. We introduce CRAFT (Critic-Refined Adaptive Key-Frame Targeting), a query-conditioned pipeline that combines dynamic keyframe selection, per-video ASR with multilingual fallback, and a hybrid critic loop to iteratively verify and repair claims before consolidation. The pipeline integrates UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator, with a final citation-merging stage that emits each fact once with all supporting source identifiers. On MAGMaR 2026, CRAFT achieves the best overall average (0.739), reference recall (0.810), and citation F1 (0.635). We further evaluate on a MAGMaR-style conversion of WikiVideo with 52 non-overlapping event queries, where CRAFT also performs strongly (0.823 Avg), showing that its claim-centric evidence aggregation generalizes beyond MAGMaR. Ablations show that atomic claims, ASR, and the critic loop drive the main gains over the vanilla query-conditioned baseline. Code and implementation details are publicly available at https://github.com/bhosalems/CRAFT.


[32] FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models cs.CV | cs.AIPDF

Youngsun Lim, Cusuh Ham, Pin-Yu Chen, Deepti Ghadiyaram

TL;DR: 本文提出了FAGER框架,用于评估和优化文本到图像(T2I)模型生成图像的事实正确性。该框架通过结合LLM的事实提议与参考引导的视觉事实提取和验证,构建结构化事实评估标准,并转化为视觉语言模型(VLM)可评估的问答对,以检测图像中隐含或外部事实的准确性。

Details

Motivation: 现有T2I评估指标主要关注图像与提示中显式信息的对齐,但难以评估涉及科学知识、历史事实、产品或文化概念等隐含或外部事实的正确性,因此需要一种能评估事实基础性的新方法。

Result: 在涵盖科学、历史、产品、文化和知识密集型概念的五个数据集上,FAGER在Factual A/B测试中持续优于先前指标,该测试衡量指标是否更偏好事实性参考图像而非生成图像。此外,FAGER能以无需训练的方式优化T2I输出,在各数据集上带来显著的事实性提升。

Insight: 创新点在于提出一个基于代理的评估框架,将LLM与VLM结合,系统性地评估和优化图像生成的事实基础性,弥补了现有指标在隐含和外部事实验证上的不足,为T2I模型的事实性评估提供了可操作反馈。

Abstract: Existing text-to-image (T2I) evaluation metrics mainly assess whether generated images align with information explicitly stated in the prompt, but often fail to capture factual requirements that are implicit, externally grounded, or identity-defining. As a result, they are not well suited for evaluating factual correctness in prompts involving scientific knowledge, historical facts, products, or culture-specific concepts. We propose FActually Grounded Evaluation and Refinement (FAGER), an agentic framework that evaluates whether generated images correctly reflect visually verifiable facts grounded in or implied by the prompt, while also providing actionable feedback for improvement. FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. To validate FAGER as a factuality metric, we introduce a Factual A/B test, which measures whether a metric prefers factual reference images over corresponding generated images. Across five datasets spanning science, history, products, culture, and knowledge-intensive concepts, FAGER consistently outperforms prior metrics on this test. We further show that FAGER can be used to refine T2I outputs in a fully training-free manner, yielding substantial factuality gains across datasets.


[33] Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models cs.CVPDF

Svetlana Orlova, Niccolò Cavagnero, Gijs Dubbelman

TL;DR: 本文探索了一种数据高效的视频预训练方法,通过冻结预训练的图像基础模型,仅训练一个轻量级的循环时序模块来处理视频流,从而减少对大规模视频数据和计算资源的需求。

Details

Motivation: 视频基础模型通常需要大规模视频数据集进行预训练,成本高昂;而现代图像基础模型已具备强大的空间表征能力,因此研究能否通过重用这些空间表征并仅针对时序推理进行预训练来构建有竞争力的视频模型。

Result: 在多个视频理解任务上的实证结果表明,无需大规模视频预训练即可获得较强的时序性能,验证了该方法的可行性。

Insight: 创新点在于提出了一种轻量级训练范式,将图像基础模型作为冻结的空间编码器,仅预训练时序模块,这为未来在冻结图像基础模型上构建循环视频基础模型提供了方向,并可能显著降低视频预训练的数据和计算成本。

Abstract: Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: https://github.com/tue-mps/towards-video-image-frozen .


[34] Efficient coding along the visual hierarchy cs.CVPDF

Ananya Passi, Brian S. Robinson, Michael F. Bonner

TL;DR: 本文提出了一种基于高效编码原则的无监督学习框架,旨在模拟生物视觉系统从有限数据中学习层次化特征的能力。该方法通过逐层压缩自然图像的主导变异模式,无需标签、任务或反向传播,即可生成从边缘、颜色到纹理、形状的视觉特征。这些特征不仅与人类感知对齐,还能预测视觉皮层的fMRI响应,且在结合监督微调后能提升大脑对齐效果和分类学习效率。

Details

Motivation: 研究动机是探索生物视觉系统仅需有限经验即可学习高效视觉表征的原理,特别是验证高效编码假说——即神经表征捕捉自然输入的统计结构——能否从有限数据中构建出与人类对齐的视觉特征层次。

Result: 实验表明,该无监督高效编码模型生成的特征与人类观察者识别一致,并能预测人类视觉皮层图像诱发的fMRI响应;在低数据设置下,结合监督微调的混合学习方法实现了更好的大脑对齐和更快的类别学习。

Insight: 创新点在于提出了一种仅依赖局部统计的无监督分层学习程序,无需反向传播即可模拟视觉层次;客观分析认为,该方法为解释生物视觉的数据高效性提供了计算模型,并展示了高效编码与监督学习结合在神经对齐和分类任务中的潜力。

Abstract: Biological visual systems learn from limited experience, unlike deep learning models that rely on millions of training images. What learning principles make this possible? We tested whether efficient coding, the idea that neural representations capture the statistical structure of natural inputs, can build a hierarchy of human-aligned visual features from limited data. We developed an unsupervised learning procedure in which each layer of a deep network compresses its inputs onto the dominant modes of variation in natural images, using only local statistics and no labels, tasks, or backpropagation. This unsupervised procedure yields features that progress from edges and colors to textures and shapes. The features of this deep efficient coding model are readily recognized by human observers and are predictive of image-evoked fMRI responses in human visual cortex. Furthermore, a hybrid learning procedure that combines efficient coding with supervised fine-tuning yields better brain alignment in low-data settings and more rapid category learning. These findings suggest that efficient coding may shape representations across the entire visual hierarchy and help explain the data efficiency of biological vision.


[35] Smartphone-based Circular Plot Sampling for Forest Inventory cs.CVPDF

Su Sun, Jui-Cheng Chiu, Nabin Khanal, Songlin Fei, Yingjie Victor Chen

TL;DR: 本文提出了一种基于智能手机的轻量级森林圆形样地采样方法,通过单次穿行视频即可完成样地内树木的胸径和空间位置测量。该方法结合了预训练的单目深度估计、树木实例分割和SLAM框架,无需专业硬件,仅使用消费级智能手机即可实现。

Details

Motivation: 传统的森林样地调查依赖昂贵的地面激光雷达或费时费力的人工测量方法,可扩展性和可及性受限。本文旨在开发一种低成本、易操作的替代方案,以促进大规模森林调查。

Result: 在人工林和天然林样地中评估,胸径测量的平均绝对误差分别为1.51厘米(MARE 3.98%)和2.30厘米(MARE 5.69%),性能稳定且与不同起始位置无关。其精度与成熟的野外方法相当。

Insight: 创新点在于将单目深度估计、实例分割与SLAM框架集成,通过视频序列联合优化相机轨迹和深度,并利用校准参考长度恢复绝对真实尺度。该方法显著降低了设备成本和操作复杂性,具有很高的实用价值。

Abstract: Circular sample plots are a cornerstone of forest inventory, yet accurate measurement of tree diameter at breast height (DBH) and spatial location within such plots remains challenging. Conventional approaches rely either on costly terrestrial LiDAR systems or labor-intensive manual methods involving calipers and compass bearings, limiting their scalability and accessibility in large scale environments. We present a lightweight, smartphone-based pipeline that enables complete plot sampling based tree measurement from a single walkthrough video, requiring no specialized hardware beyond a consumer smartphone mounted on a portable stand. The proposed method integrates pretrained monocular depth estimation and tree instance segmentation with a simultaneous localization and mapping (SLAM) framework to jointly refine camera trajectories and depth across the video sequence. Tree positions and DBH estimates are recovered by fusing SLAM-derived camera poses with segmented depth maps, with absolute real-world scale anchored via a calibrated reference length. The system was evaluated in both managed forest plots and natural forest plot, achieving a mean absolute error of 1.51 cm (MARE 3.98%) and 2.30 cm (MARE 5.69%) respectively, with consistent performance across varying starting directions and positions. Cross-video consistency analysis further demonstrated stable and reproducible tree localization across measurements initiated from different starting positions. The proposed approach achieves accuracy comparable to established field methods while substantially reducing equipment cost and operational complexity, making it accessible to both professional researchers and non-expert forest managers in diverse operational settings.


[36] Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference cs.CV | cs.AIPDF

Beomseok Kang, Dongwon Jo, Jiwon Song, Donghwee Son, Jae-Joon Kim

TL;DR: 本文提出了RotateK,一种基于旋转的结构化Key通道剪枝框架,用于解决视觉语言模型推理时KV缓存压力过大的问题。该方法通过在线PCA旋转将token相关的通道重要性对齐到共享的低维子空间,从而实现轻量级头级掩码下的精确剪枝,并结合融合的Triton注意力内核直接在稀疏通道Keys上进行高效解码。

Details

Motivation: 视觉语言模型在推理时因单张图像编码为数千个token而面临严重的KV缓存压力,现有方法多通过token剪枝利用token稀疏性,但永久丢弃视觉内容会导致细粒度感知任务性能大幅下降,因此需要探索特征稀疏性这一补充维度,在固定KV缓存预算下压缩通道维度以保留更多视觉token。

Result: 在两个代表性VLM骨干网络上的实验表明,RotateK在准确性和解码延迟方面均优于先前的Key通道剪枝方法,同时在匹配的KV缓存预算下,联合token-channel剪枝相比仅token剪枝的基线有所提升。

Insight: 创新点在于通过旋转对齐token依赖的通道重要性到共享子空间,解决了头级剪枝鲁棒性不足与token级剪枝硬件不友好的结构权衡问题,实现了结构化剪枝与高效解码的平衡,为VLM推理优化提供了新思路。

Abstract: Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.


[37] HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding cs.CVPDF

Mengqi Shi, Haopeng Zhang

TL;DR: 本文提出了HAVEN基准测试,这是一个用于统一视频理解的分层对齐多模态基准。该基准通过构建一个完全粒度化(帧、镜头和视频级别)且完全多模态(视频和文本)的数据集架构,并带有明确的跨模态连续对齐,来评估多模态大语言模型在复杂叙事上的忠实总结和推理能力。

Details

Motivation: 现有视频摘要基准测试的监督信息分散在孤立的粒度(如关键帧、关键镜头或脱节的文本摘要)上,未能捕捉跨模态对齐固有的层次结构,导致无法充分评估MLLMs对复杂叙事的忠实理解和推理能力。

Result: 对多个最先进的多模态大语言模型进行广泛基准测试的结果表明,模型在表面文本流畅性和基于多模态的接地理解之间存在持续差距。

Insight: 创新点在于提出了一个统一的、分层对齐的多模态标注范式,并基于此构建了一个涵盖摘要、时序推理、多模态接地和显著性排序的综合评估套件,将多模态系统评估从传统的问答形式推进到更严格的、可解释的层次化视频理解测试平台。

Abstract: While Multimodal Large Language Models (MLLMs) exhibit strong performance on standard video tasks, their ability to faithfully summarize and reason over complex narratives remains poorly evaluated. Existing summarization benchmarks fragment supervision across isolated granularities, such as keyframes, key shots, or disjointed text summaries, failing to capture the inherently hierarchical structure of cross-modal alignment. To address this critical gap, we introduce HAVEN, a hierarchically aligned multimodal benchmark for unified video understanding. HAVEN pioneers a fully granular (frame, shot, and video levels) and fully multimodal (video and text) dataset architecture, complete with explicit, continuous alignment between modalities. Built upon this unified annotation paradigm, we propose a comprehensive evaluation suite spanning summarization, temporal reasoning, multimodal grounding, and saliency ranking. Extensive benchmarking of state-of-the-art MLLMs exposes a persistent gap between surface-level textual fluency and grounded multimodal understanding. Ultimately, HAVEN advances the evaluation of multimodal systems beyond traditional QA formats, offering a rigorous, standardized testbed to drive future research in interpretable, hierarchical video understanding. We publicly release the dataset, benchmark suite, and evaluation protocols.


[38] PhyWorld: Physics-Faithful World Model for Video Generation cs.CV | cs.AI | cs.ET | cs.LG | cs.MMPDF

Pu Zhao, Juyi Lin, Timothy Rupprecht, Arash Akbari, Chence Yang

TL;DR: 论文提出了PhyWorld,一个用于视频生成的世界模型,旨在通过两阶段后训练(微调)来生成时间连贯且物理忠实(physics-faithful)的场景延续视频。第一阶段通过流匹配(flow matching)微调来提升视频到视频的延续质量,鼓励稳定的视觉属性和连贯的运动动态。第二阶段使用基于物理偏好对的直接偏好优化(DPO)来使生成的动态与物理原理对齐,引导模型产生物理上更可信的输出。

Details

Motivation: 大型视频生成模型作为世界模拟器(world simulators)用于训练物理AI系统时,需要生成物理忠实的视频延续,即生成的视频必须保持条件输入所隐含的物理状态,并以符合基本物理原理的方式演化。现有模型在这方面存在不足,因此需要专门的方法来提升视频的物理忠实性。

Result: 在标准视频质量基准VBench上,PhyWorld的平均得分为0.769,优于SOTA基线(0.756或以下)。在作者提出的专用物理忠实性基准(per-law scoring)上,PhyWorld的平均得分为3.09,也优于最强基线(2.99)。

Insight: 论文的核心创新点在于提出了一个两阶段后训练框架,将流匹配微调(提升时间连贯性)与基于物理偏好的DPO(提升物理忠实性)相结合,从而系统性地增强视频生成模型作为世界模拟器的能力。这种方法为将通用视频生成模型转化为更可靠的物理环境模拟器提供了一条可借鉴的技术路径。

Abstract: World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations, namely, generated videos that preserve the physical state implied by the conditioning input, and evolve in ways consistent with basic physical principles. We propose PhyWorld, a video generation world model designed to produce temporally coherent and physically faithful scene continuations through two-stage post-training. In the first stage, we improve video-to-video continuation with flow matching fine-tuning, encouraging stable visual attributes and coherent motion dynamics across frames. In the second stage, we align generated dynamics with physical principles using Direct Preference Optimization (DPO) over physics preference pairs, guiding the model toward outputs with higher physical plausibility. To evaluate PhyWorld, we use both standard video-quality benchmarks and a dedicated physical-faithfulness benchmark with per-law scoring. Experiments show that PhyWorld improves video consistency, achieving an average score of 0.769 on VBench compared with 0.756 or below for state-of-the-art baselines. PhyWorld also improves physical plausibility, reaching an average score of 3.09 on our physical-faithfulness benchmark compared with 2.99 for the strongest baseline. These results suggest that post-training large video generation models with continuation and physics-preference signals can make them more effective world simulators for Physical AI.


[39] iGSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language Models cs.CVPDF

Xuezhi Cui, Dongbo Zhou, Wang Guo, Zeyuan Wang, Ziyu Li

TL;DR: 本文提出了一种名为iGSP的新框架,用于视觉语言模型的高效持续学习。该方法将适应过程分为两个阶段:首先通过隐式梯度子空间投影识别并最大化知识重用的共享子空间,然后对正交子空间进行微调以快速拟合任务特定残差。

Details

Motivation: 为了解决持续学习中参数爆炸问题,以及现有相似性驱动共享机制错误地将视觉相似性与底层对齐一致性等同,导致视觉相似但逻辑不同的任务间产生严重负迁移,且无法利用视觉多样任务间的对齐重用问题。

Result: 在MTIL基准测试上的大量实验表明,iGSP在达到最先进(SOTA)精度的同时,显著提升了训练效率:与当前SOTA方法相比,平均可训练参数减少了42.7%,并且最终总参数相对于同类方法减少了86.9%。

Insight: 核心创新在于将对齐共享重新定义为共享低秩子空间内优化轨迹重叠的几何问题,并提出了基于MoE路由器早期收敛建立子空间基、通过子空间约束正则化进行隐式梯度投影、以及将路由概率作为梯度流指示器来精确剪枝冗余维度的新机制。

Abstract: Vision-Language Models require efficient adaptation to continually emerging downstream tasks. While Parameter-Efficient Fine-Tuning mitigates catastrophic forgetting, assigning isolated modules per task leads to parameter explosion. Conversely, recent similarity-driven sharing mechanisms falsely equate superficial visual similarity with underlying alignment consistency. This fundamental mismatch triggers severe negative transfer between visually similar but logically distinct tasks and fails to exploit alignment reuse across visually diverse ones. We argue thatalignment sharing is fundamentally a geometric problem of overlapping optimization trajectories within shared low-rank subspaces. Grounded in this insight, we propose iGSP, a novel framework that achieves efficient adaptation via implicit gradient subspace projection. Leveraging the early convergence of MoE routers to establish the subspace basis, iGSP bifurcates the adaptation process into two phases. First, the Subspace Identification phase introduces candidate experts via basis pre-expansion, applies a novel subspace-constrained regularization to implicitly project new task gradients onto the historical subspace, and precisely prunes redundant dimensions by treating routing probabilities as gradient flow indicators, ultimately to maximize knowledge reuse. Second, the Orthogonal Subspace Fine-Tuning phase fixes this structural basis and removes the regularization to rapidly fit the task-specific residual loss. Extensive experiments on the MTIL benchmark demonstrate that iGSP achieves state-of-the-art accuracy while significantly improving training efficiency, reducing the average trainable parameters by 42.7% compared to current SOTA methods, and decreasing the final total parameters by 86.9% relative to counterparts. The source code is available at https://github.com/GeoX-Lab/iGSP.


[40] MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems cs.CVPDF

Quanxing Xu, Yuhao Tian, Ling Zhou, Xian Zhong, Xiaohua Huang

TL;DR: 本文提出了一个名为MetaRA的测试框架,用于评估基于多模态大语言模型的视觉问答系统的鲁棒性。该框架利用蜕变关系生成受控的图像-问题输入变体,系统地探测模型在语言扰动、视觉线索依赖和多模态推理等方面的脆弱性。

Details

Motivation: 现有的VQA评估主要依赖静态数据集和基于准确率的指标,无法充分衡量模型的鲁棒性、一致性和泛化能力,因此需要一种更系统的评估方法来揭示隐藏的失败模式。

Result: 实验结果表明,MetaRA比传统的准确率指标提供了更丰富的诊断性见解,能够暴露在标准基准测试下隐藏的失败模式,揭示了模型对语言扰动的敏感性、对表面视觉线索的过度依赖以及更深层的多模态推理弱点。

Insight: 创新点在于将蜕变测试思想引入多模态大语言模型的鲁棒性评估,提出了一种可扩展、模型无关的系统性评估框架,为构建可信的多模态AI提供了新的评估视角。

Abstract: Visual Question Answering (VQA), as the representative multimodal task, serves as a key benchmark for evaluating the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, existing evaluations largely rely on static datasets and accuracy-based metrics, which fail to capture robustness, consistency, and generalization. Inspired by Metamorphic Testing (MT), we propose Metamorphic Robustness Assessment (MetaRA), a testing framework that employs Metamorphic Relations (MRs) to systematically probe vulnerabilities in MLLM-based VQA systems. MetaRA generates controlled variations of image-question inputs based on specific MRs and evaluates models across diverse conditions. Applying MetaRA to multiple MLLM-based VQA models across different tasks reveals nuanced failure patterns, including sensitivity to linguistic perturbations, over-reliance on superficial visual cues, and deeper weaknesses in multimodal reasoning. Experimental results demonstrate that MetaRA provides richer diagnostic insights than conventional accuracy metrics, exposing failure modes that remain hidden under standard benchmarks. Overall, this work highlights the need for systematic robustness evaluation in VQA and positions metamorphic assessment as a scalable, model-agnostic approach toward trustworthy multimodal AI.


[41] SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution cs.CVPDF

Yiren Song, Yihan Wang, Xiyao Deng, Zhuoran Yan, Mike Zheng Shou

TL;DR: 本文提出SWEET框架,利用图像编辑模型而非密集视频生成来构建稀疏视觉世界模型,用于具身任务执行。它通过渐进式图像编辑生成任务相关的关键帧序列,再将其转化为可执行的动作块,从而以更低成本实现可靠的视觉规划。

Details

Motivation: 密集视频生成计算成本高,且对许多操作任务并非必要,因为这些任务的进展可由少量任务相关的视觉状态概括。本文旨在探索图像编辑模型是否能作为稀疏视觉世界模型,预测任务级未来状态。

Result: 在DROID和RoboMimic基准上的实验表明,SWEET在已见和未见场景中均提升了关键帧预测能力,并实现了从序列关键帧规划到可执行机器人动作的完整流程。与视频生成模型Wan2.2相比,图像编辑模型FLUX-Kontext在相同机器人数据设置下,产生了更可靠的任务级关键帧,具有更好的视觉保真度和显著更低的推理成本。

Insight: 核心创新在于将图像编辑模型用作稀疏世界模型进行渐进式视觉规划,并引入混合训练策略以减少真实与编辑视觉子目标之间的不匹配。这揭示了图像编辑是具身视觉预测中一个前景广阔且尚未充分探索的方向。

Abstract: Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task-relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX-Kontext under the same robotic data setting, and find that image editing produces more reliable task-level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one-shot sparse visual planning framework that progressively generates a sequence of task-relevant manipulation keyframes through successive image editing, conditioned on language instructions and optional arrow-based spatial guidance. A goal-conditioned diffusion action predictor then converts adjacent imagined keyframes into executable action chunks. To reduce the mismatch between real and edited visual subgoals, we further introduce a mixed-training strategy with filtered edited targets. Experiments on DROID and RoboMimic show that SWEET improves keyframe prediction across seen and unseen scenes and enables a full pipeline from sequential keyframe planning to executable robot actions, suggesting that image editing is a promising and underexplored direction for embodied visual prediction.


[42] TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards cs.CV | cs.DBPDF

Mingxuan Cui, Jingpu Yang, Fengxian Ji, Qian Jiang, Zhecheng Shi

TL;DR: 本文提出TextAlign,一种用于文本渲染偏好对齐的非侵入式框架,通过分层视觉语言模型(VLM)奖励将渲染错误分解为全局、单词和字形三个级别,将缺陷判断转换为标量偏好信号,支持GRPO和DPO优化,在保持生成质量的同时提升文本渲染准确性。

Details

Motivation: 解决大文本到图像生成模型在忠实文本渲染方面的持续弱点,现有方法通常需要修改模型架构或编码器,部署复杂,因此研究将其视为训练后偏好对齐问题,提出非侵入式解决方案。

Result: 在FLUX.1-dev和Z-Image-Turbo基准测试中,基于OCR的文本准确性持续提升,且不降低一般生成质量;相比SD3.5、Qwen-Image、AnyText和TextDiffuser等基线模型表现出优势。

Insight: 创新点在于将文本渲染问题转化为偏好对齐,使用分层VLM奖励进行多粒度错误评估,提供可扩展的奖励设计替代模型重新设计,为非侵入式改进文本渲染提供了新思路。

Abstract: Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.


[43] DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs cs.CVPDF

Minyoung Park, Taehun Kong, Sangjun Ahn

TL;DR: 本文提出了DynaTok,一种无需训练、时间自适应且具有位置偏差感知能力的视频大语言模型(Video-LLM)令牌压缩框架。该框架通过在时间和空间两个维度动态分配令牌预算,有效减少了长视频序列带来的巨大计算开销,同时保持了语义覆盖。

Details

Motivation: 现有基于注意力幅度的免训练令牌压缩方法存在位置偏差,且仅依赖短期时间局部性,导致时空覆盖冗余和令牌使用效率低下。DynaTok旨在解决这些问题,以实现高效、鲁棒的视频推理。

Result: 在MVBench、LongVideoBench、MLVU和VideoMME四个VideoQA基准测试上的实验表明,即使在令牌减少90%的激进压缩下,DynaTok仍能保持超过95%的基线准确率,超越了近期的免训练方法。

Insight: 创新点在于通过轻量级指数移动平均(EMA)内存实现长期时间变化感知的时域预算分配(TBA),以及利用基于激活的注意力图和空间内存实现空间多样性选择、减少冗余并缓解位置偏差的空域预算分配(SBA)。该框架无需重新训练即可与现有Video-LLMs(如LLaVA-OneVision)无缝集成。

Abstract: Recent advances in Video Large Language Models (Video-LLMs) have greatly expanded multimodal reasoning capabilities. However, the massive number of visual tokens extracted from long video sequences incurs prohibitive computational costs, limiting their deployment in real-world scenarios. Existing training-free token compression methods select tokens based on attention magnitude as a proxy for semantic importance, but often overlook positional bias and rely only on short-term temporal locality, leading to redundant spatio-temporal coverage and inefficient token usage. We present DynaTok, a training-free, temporally adaptive and bias-aware token compression framework that allocates token budgets across both temporal and spatial dimensions. Through a lightweight exponential moving average (EMA) memory, the Temporal Budget Allocation (TBA) module dynamically assigns fewer tokens to redundant frames and more to novel frames, capturing long-term temporal variation. The Spatial Budget Allocation (SBA) module complements this by selecting spatially diverse and semantically important features using activation-based attention maps, while leveraging a spatial memory to reduce redundancy from previously selected regions and mitigate positional bias. DynaTok integrates seamlessly with existing Video-LLMs such as LLaVA-OneVision and LLaVA-Video without retraining, and effectively preserves semantic coverage under aggressive compression. Experiments on four representative VideoQA benchmarks-MVBench, LongVideoBench, MLVU, and VideoMME-show that DynaTok retains over 95% of baseline accuracy even with a 90% token reduction, surpassing recent training-free approaches. These results demonstrate that DynaTok provides a principled foundation for efficient and robust video reasoning, paving the way toward real-time streaming video understanding with future Video-LLMs.


[44] RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding cs.CV | cs.AIPDF

Hanqing Liu, Mingjie Liu, Luoping Cui, Endian Lin, Donghong Jiang

TL;DR: 本文提出了RE-VLM,一种首个利用RGB图像和事件流进行双流处理的视觉语言模型,旨在提升在恶劣成像条件下(如低光、高动态范围、快速运动)的场景理解能力。为了解决RGB-Event-Text监督数据稀缺的问题,作者提出了一种图驱动的流水线来生成可验证的场景图,并合成了描述和问答对,同时构建了两个数据集PEOD-Chat和RGBE-Chat用于开发和评估。实验表明,RE-VLM在图像描述和视觉问答基准上,尤其是在挑战性条件下,显著优于仅使用RGB或仅使用事件的最先进模型。

Details

Motivation: 传统视觉语言模型在恶劣成像条件下(如低光、高动态范围、快速运动)表现不佳,因为标准RGB图像在这些环境中会退化;事件相机提供了一种互补的模态,能够异步记录像素亮度变化,具有高时间分辨率和宽动态范围,在帧失效时保留运动线索。

Result: 在图像描述和视觉问答基准测试中,RE-VLM在参数量可比的情况下,一致性地超越了仅使用RGB或仅使用事件的最先进模型,在挑战性条件下取得了特别大的性能提升。

Insight: 创新点在于首次提出了双流视觉语言模型,联合利用RGB和事件流进行鲁棒的场景理解;提出了一种图驱动的流水线,从同步的RGB-Event流生成可验证的场景图,并合成文本监督数据,以解决多模态数据稀缺问题;构建了针对光照挑战场景和多样化场景的数据集,为领域研究提供了资源。

Abstract: Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question-answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments. Code and datasets are available at https://github.com/bupt-ai-cz/RE-VLM.


[45] Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation cs.CVPDF

Junyuan Ma, Xunzhi Xiang, Wenbin Li, Qi Fan, Yang Gao

TL;DR: 本文提出了一种名为HERA的三阶段选择-正则化-校准框架,用于解决跨域少样本语义分割问题。该方法通过自适应选择视觉基础模型中最具信息量的层、利用先验引导正则化优化交互表示,并结合像素级自适应校准来生成一致的分割掩码,仅需微调不到2.7%的参数即可在新域中有效利用冻结的VFM特征。

Details

Motivation: 动机在于解决视觉基础模型在跨域少样本语义分割任务中面临的两个主要挑战:一是有限的标注样本容易导致微调时过拟合,二是预训练数据中目标域偏移的不足导致跨域不一致性和层间敏感性。

Result: 在多个跨域少样本语义分割基准测试中,HERA方法超越了现有最佳方法,平均交并比提升超过4.1个百分点,达到了新的SOTA水平。

Insight: 创新点包括:1)设计分层层选择机制,通过数据依赖的样本转移风险自适应识别VFM中最具信息量的层;2)提出先验引导正则化来优化交互表示的结构;3)引入像素级自适应校准结合选定表示和精炼交互图以提升预测一致性。这些步骤构成了一个无需源数据重训练的高效分层管道。

Abstract: Vision foundation models (VFMs) have achieved strong performance across various vision tasks. However, it still remains challenging to apply VFMs for cross-domain few-shot segmentation (CD-FSS), which segments objects of novel classes under domain shifts using only a few labeled exemplars. The challenge is mainly driven by two factors: (1) limited labeled exemplars per novel class relative to the scale of VFM pre-training, making the model prone to overfitting during retraining, and (2) target-domain shifts underrepresented during pre-training, inducing cross-domain inconsistency and layer-wise sensitivity. To address these issues, we propose Hierarchical Exemplar Representation Adaptation (HERA), a three-stage select-regularize-calibrate VFM-based segmentation framework that learns effectively from limited labels and adapts to novel domains without source-data retraining. We first design Hierarchical Layer Selection (HLS) to adaptively identify the most informative VFM layer using a data-dependent Exemplar Transfer Risk (ETR) computed for each candidate layer. Then, Prior-Guided Regularization (PGR) regularizes interactions on the selected representation, yielding well-structured local signals for the subsequent stage. Furthermore, Pixelwise Adaptive Calibration (PAC) combines the selected representation with the refined interaction maps to calibrate pixel-wise predictions, producing consistent masks. Together, these stages form a hierarchical select-regularize-calibrate pipeline that guides frozen VFM features in new domains while fine-tuning less than 2.7% of parameters at test time. Extensive experiments show that HERA surpasses the state of the art by more than 4.1 mIoU across multiple CD-FSS benchmarks.


[46] Semantic-Enriched Latent Visual Reasoning cs.CVPDF

Tianrun Xu, Yue Sun, Qixun Wang, Jingyi Lu, Yuan Wang

TL;DR: 本文提出了一种名为语义增强的潜在视觉推理(SLVR)的两阶段学习框架,旨在通过增强潜在表示中的语义丰富性来改进多模态潜在空间推理。该框架首先在细粒度属性监督下学习语义增强的区域中心潜在表示,然后通过多查询组相对策略优化(M-GRPO)对齐同一区域的多查询潜在表示。实验表明,SLVR在语义变化基准SV-QA上提升了潜在视觉推理的鲁棒性和语义一致性。

Details

Motivation: 现有多模态潜在空间推理方法主要依赖视觉监督,导致潜在表示缺乏足够的语义丰富性,限制了其在多样化区域级推理任务中的能力。本文旨在通过引入语义增强来克服这一限制,以支持更复杂的视觉推理任务。

Result: 在构建的SLV-Set数据集(包含约40万区域级属性标注和80万多查询问答样本)和SV-QA基准上,SLVR相比现有基线方法,在潜在推理的鲁棒性和语义一致性方面表现出改进。

Insight: 创新点包括:1)两阶段框架结合了属性级语义监督和多查询对齐;2)设计了M-GRPO方法以增强潜在表示的对齐性;3)构建了大规模数据集SLV-Set和评估基准SV-QA,为区域级语义推理提供了新工具。

Abstract: Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.


[47] Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection cs.CV | cs.LG | cs.NE | physics.app-ph | physics.opticsPDF

Parnian Ghapandar Kashani, Shiqi Chen, Aydogan Ozcan

TL;DR: 本文提出了一种可扩展、高能效的光学-神经网络混合架构,用于并行检测多个深度伪造视频。该框架结合轻量级数字前端与空间复用光学解码后端,通过可编程空间光调制器实现大规模并行模拟推理,单次光学传播可同时处理15个以上视频流,在降低计算成本的同时实现高吞吐量和准确的视频级真实性预测。

Details

Motivation: AI生成视觉媒体的快速扩散迫切需要高效、可信的深度伪造检测系统,但现有基于深度学习的检测方法依赖计算密集且高能耗的推理算法,限制了其可扩展性。

Result: 在可见光谱空间复用实验装置上,使用Celeb-DF视频数据集进行验证,单次光学推理并行测试15个视频,平均深度伪造检测准确率、敏感性和特异性分别达到97.79%、99.86%和95.72%。系统对视频退化、噪声、压缩、实验失准和黑盒对抗攻击表现出鲁棒性。

Insight: 创新点在于将光学计算集成到AI推理中,通过空间复用光学解码实现大规模并行模拟处理,同时提升了吞吐量、能效和对抗鲁棒性,这三者在纯数字系统中难以兼得。

Abstract: The rapid proliferation of AI-generated visual media has created an urgent need for efficient, trustworthy deepfake detection systems. However, existing deep learning-based detection methods rely on computationally intensive and energy-demanding inference algorithms, limiting their scalability. Here, we present a hybrid digital-analog deepfake video detection framework that combines a lightweight digital front-end with a spatially multiplexed optical decoding back-end for massively parallel analog inference through a programmable spatial light modulator. By simultaneously processing 15 or more video streams within a single optical propagation pass, the system enables high-throughput and accurate video-level authenticity prediction at reduced computational cost compared with purely digital methods. We validated this hybrid deepfake video processor using different datasets spanning classical face-swapping, real-world deepfake recordings, and fully AI-generated videos. Using a spatially multiplexed experimental set-up operating in the visible spectrum, we achieved average deepfake detection accuracy, sensitivity and specificity of 97.79%, 99.86% and 95.72%, respectively, on the Celeb-DF video dataset with 15 videos tested in parallel in a single optical pass per inference. The multiplexed optical decoder also demonstrates resilience against various types of video degradation, noise, compression, experimental misalignments and black-box adversarial attacks. Our results show that integrating optical computation into AI inference enables simultaneous gains in throughput, energy efficiency, and adversarial robustness - three properties that are difficult to achieve together in purely digital systems.


[48] Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings cs.CV | cs.AI | cs.LGPDF

Chenyu Lian, Hong-Yu Zhou, Chun-Ka Wong, Jing Qin

TL;DR: 本文提出了一种名为CoNNS的概念引导噪声负样本抑制框架,用于解决胸部X光与放射学报告视觉语言对齐中因不同患者存在相似发现而引入的噪声负样本问题,从而提升零样本分类和定位任务的性能。

Details

Motivation: 标准对比学习将不同患者的X光片和报告简单视为负样本对,但不同患者常表现出相似的病理发现,这引入了噪声负样本,导致语义模糊并损害零样本理解任务的性能。

Result: 在多个零样本分类数据集和多粒度零样本定位任务上的广泛实验表明,CoNNS超越了现有的最先进模型。

Insight: 创新点在于构建了一个由大语言模型驱动的分层临床概念本体,并基于此实施了跨患者对重标记策略(包括细粒度分解、噪声负样本过滤和难负样本挖掘),以及提出了概念感知NCE损失来抑制噪声负样本。该方法的核心在于利用结构化概念知识来精细化地管理对比学习中的负样本关系。

Abstract: Vision-language alignment using chest X-rays and radiology reports has emerged as an advanced paradigm for zero-shot classification and grounding of chest X-ray findings. However, standard contrastive learning typically treats radiographs and reports from different patients simply as negative pairs. This assumption introduces noisy negatives, as different patients frequently exhibit similar findings. Such noisy negatives cause semantic ambiguity and degrade performance in zero-shot understanding tasks. To address this challenge, we propose CoNNS, a concept-guided noisy-negative suppression framework. To support the negative suppression mechanism, unlike previous methods that use raw reports or templatized texts, we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts (evidential segment and presence statement). Leveraging this ontology, we implement a cross-patient pair relabeling strategy comprising three steps: (1) Fine-Grained Breakdown to categorize pairs based on finding presence; (2) Noisy Negative Filtering to resolve semantic conflicts by removing false negatives; and (3) Hard Negative Mining to identify subtle attribute discrepancies using a lightweight language model. Finally, we propose a Concept-Aware NCE loss to align visual features with text while suppressing the identified noisy negatives. Extensive experiments across multi-granularity zero-shot grounding tasks and five zero-shot classification datasets validate that CoNNS outperforms existing state-of-the-art models. The code is available at https://github.com/DopamineLcy/conns.


[49] Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock cs.CVPDF

Haiying Sha

TL;DR: 本文系统诊断了视频扩散Transformer中Token-Choice稀疏混合专家(MoE)的训练失败模式。通过将约50亿参数的预训练稠密模型转换为MoE架构,实验揭示了五种层次化的失败模式,包括路由器饱和、选择性死锁和精度陷阱等。基于路由决策的时间序列分析,作者提出了功能冗余假说来解释死锁现象,并总结了稠密到MoE转换的三定律,校准了Token-Choice范式的能力边界,并规划了从视觉统一到世界模型的三步演进路线图。

Details

Motivation: 解决视频扩散Transformer中稀疏混合专家(MoE)架构在训练过程中出现的系统性失败问题,特别是路由崩溃和选择性死锁现象,以提升MoE在视觉生成任务中的稳定性和性能。

Result: 实验在视频扩散Transformer上揭示了五种失败模式,包括线性路由器全局软饱和、MLP路由器选择性死锁、交叉注意力路由器部分自恢复等,并观察到死锁层呈U形分布。通过分析6500万token在5000训练步上的路由决策时间序列,支持了功能冗余假说。

Insight: 创新点包括提出功能冗余假说解释死锁机制,总结稠密到MoE转换的三定律(如路由专家克隆原始权重、共享专家初始化策略),以及解决bfloat16精度陷阱的完整方案。从系统生物学角度借鉴功能冗余理论,为MoE训练提供了新的理论视角和工程实践指南。

Abstract: This paper systematically diagnoses the training failure modes of Token-Choice sparse Mixture-of-Experts (MoE) on video Diffusion Transformers. Starting from a pretrained dense model of about 5 billion parameters, we convert it into an MoE architecture following three laws: routed experts exactly clone the original FFN weights, shared experts are initialized to zero for verification and then to extremely small non-zero noise for actual training, while only the gating networks start from random initialization. Experiments reveal a hierarchy of five failure modes: (1) linear routers suffer global soft saturation with complete expert homogenization; (2) MLP routers introduce selective deadlock, where roughly one-third of layers degenerate into a single-expert mode that cannot be prevented by increasing the auxiliary loss; (3) cross-attention routers exhibit preliminary self-recovery, yet about nine layers remain stubbornly deadlocked; (4) deadlocked layers display a U-shaped distribution, concentrated in shallow visual processing layers and deep semantic integration layers; (5) bfloat16 mixed precision causes tiny weight updates to be truncated to zero by hardware. Based on routing decision time series over 65 million tokens across 5,000 training steps, we propose the Functional Redundancy Hypothesis: deadlock is a rational waiting strategy before the shared expert matures within the gate-shared expert-routed expert triadic system. This hypothesis is supported by the theory of functional redundancy in systems biology. On the engineering side, we summarize the Three Laws of dense-to-MoE conversion and provide a complete solution for the bfloat16 precision trap. We calibrate the current capability boundary of the Token-Choice paradigm and outline a three-step evolutionary roadmap from visual unification to a world model.


[50] MatPhys: Learning Material-Aware Physics Parameters for Deformable Object Simulation from Videos cs.CVPDF

Yang Yang, Yiyan Wang, Zheming Liu, Naoya Iwamoto

TL;DR: MatPhys是一个从单视角视频中预测弹簧-质量物理参数的端到端框架,旨在解决现有方法中材料均匀性假设和跨场景参数不一致的问题。通过结合语义分割和共享材料编码本,该方法能够为物体的不同部分分配差异化的物理行为,并确保相同材料在不同场景下具有一致的参数。

Details

Motivation: 现有基于物理的逆向优化方法通常假设物体材料均匀,且由于单目观测的固有模糊性,导致相同材料在不同场景或交互中参数不一致。MatPhys旨在克服这两个根本性限制,实现材料感知且跨场景一致的物理参数预测。

Result: 实验表明,该方法在重建和未来预测任务上达到了与逐场景优化基线相当的性能,同时在未见过的交互和物体上表现出更强的泛化能力,并获得了更一致的物理参数。

Insight: 创新点在于利用DINO特征进行语义部件分解并查询部件级材料先验,以及引入共享材料编码本作为外观与物理之间的桥梁,从而将欠约束的单目问题转化为基于可复用材料概念的端到端推理。

Abstract: Reconstructing simulation-ready deformable objects is important for vision, graphics, and robotics. Existing physics-driven methods can recover physical digital twins from videos, but they suffer from two fundamental limitations: they typically assume a homogeneous material across the whole object, and their scene-specific inverse optimization, combined with the inherent ambiguity of monocular observation, yields inconsistent parameters for the same material across different scenes or interactions. We propose MatPhys, a material-aware feed-forward framework that predicts spring-mass parameters from a single-view video, addressing these two issues with two coupled designs. To relax the homogeneous material assumption, we use DINO features to decompose the object into semantically meaningful parts and to query a part-level material prior, assigning each part its own physical behavior. To enforce cross-scene consistency, we introduce a learned material codebook of shared material embeddings as the bridge between appearance and physics, and further use the part-level prior as a reference distribution that constrains the decoder so that the same material yields consistent parameters across scenes and interactions. Together, these designs turn an under-constrained monocular problem into feed-forward inference grounded on shared, reusable material concepts. Experiments show that our method matches per-scene optimization baselines in reconstruction and future prediction, while achieving stronger generalization to unseen interactions and objects with more consistent physical parameters.


[51] MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification cs.CV | cs.LGPDF

Halil Ibrahim Gulluk, Olivier Gevaert

TL;DR: 本文提出MAM-CLIP,一种用于乳腺X线摄影BI-RADS分类的视觉-语言预训练方法。该方法利用乳腺X线摄影图谱中的图像-文本对进行对比学习预训练,使视觉编码器能够从文本描述中吸收丰富的医学知识,从而提升对乳腺X线影像的理解。随后在BI-RADS预测任务上进行微调,尤其在标注数据稀缺时展现出显著的性能提升。

Details

Motivation: 解决乳腺X线影像解读存在主观差异、仅依赖图像标签训练分类模型性能有限的问题,旨在利用乳腺X线摄影图谱中的文本描述信息来增强模型对影像的理解能力。

Result: 在BI-RADS预测任务上,相比无此预训练的模型,性能显著提升,特别是在小样本场景下:当训练样本为1K时,3类平均F1分数提升+14%;当样本为40K时,提升+1%。实验还表明,在拥有超过10K训练样本时,使用2K图像-文本对进行预训练比使用2K标注样本进行训练更具信息量,平均优势为+1.1%。

Insight: 创新点在于将视觉-语言预训练(特别是对比学习)引入乳腺X线影像分析领域,利用医学图谱中的结构化文本知识来增强视觉表征学习。客观来看,该方法为医学影像分析提供了一种有效利用多模态先验知识(尤其是小样本场景下)的新范式,并公开了相关代码、模型和数据以促进研究。

Abstract: Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP


[52] LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue cs.CVPDF

Chaoyue Li, Yongxue Xu, Jie Feng, Jiayu Ding

TL;DR: 本文提出LMM-Track4D模型,旨在增强大型多模态模型(LMMs)对视频中4D(3D空间+时间)动态场景的持续时空推理能力。为此,作者定义了基于轨迹的多轮时空对话新任务,并构建了Track4D-Bench基准数据集。模型结合了射线-时间几何编码(RTGE)、用于长时动态传播的TRK状态令牌以及OSK-RA解码器,以在遮挡和视角变化下稳定估计3D目标轨迹。

Details

Motivation: 现有大型多模态模型在图像和视频理解方面能力增强,但在维持4D连续时空动态推理方面仍存在不足。本文旨在研究并弥补这一能力差距。

Result: 在提出的Track4D-Bench基准(包含526个对话样本)上的实验表明,LMM-Track4D模型相比强基线模型取得了持续改进。

Insight: 创新点在于提出了一个结合轨迹的时空对话任务及相应基准,并设计了包含RTGE几何编码、专用TRK状态令牌和OSK-RA解码器的模型架构。核心洞察是,显式的动态状态建模是激发LMMs进行4D动态推理的有效设计原则。

Abstract: Recent large multimodal models (LMMs) have become increasingly capable on image and video understanding, yet still struggle to sustain 4D continuous spatiotemporal dynamic reasoning. To study this capability gap, we formulate trajectory-grounded multi-turn spatiotemporal dialogue, a new task in which a model must answer spatiotemporal queries while returning structured 3D target trajectories over an entire short clip or a specified segment of a longer clip, and introduce Track4D-Bench, a benchmark with 526 clip-level dialogue samples spanning 23.5k frames and 7.5k object annotations, for training and evaluation. Building on this task, we propose LMM-Track4D, which combines RTGE (Ray–Time Geometry Encoding), a dedicated streaming state token TRK for long-horizon dynamic propagation, and an Object-Slot Kinematic, Residual-Anchor (OSK-RA) decoder for stable 4-step 3D state estimation under occlusion and viewpoint variation. Experiments on Track4D-Bench show consistent improvements over strong baselines, suggesting that explicit dynamic state modeling is a useful design principle for eliciting 4D dynamic reasoning in LMMs. Our code and dataset will be publicly available at https://github.com/mikubaka88/LMM-Track4D.


[53] Vision Harnessing Agent for Open Ad-hoc Segmentation cs.CVPDF

Zilin Wang, Stella X. Yu

TL;DR: 本文提出了首个用于开放临时概念分割的视觉引导智能体VASA,它无需训练,结合视觉语言模型、分割基础模型和视觉工作流,通过持久工作掩码进行推理、构建和验证分割结果。

Details

Motivation: 解决开放临时概念分割的难题,这类概念无法通过预训练文本检索直接获得,需要根据图像证据(如部件、关系、排除和集合)动态构建。

Result: 在新建的PARS基准上,VASA超越开放词汇、基于推理和智能体基线,比SAM3 Agent提升14-25%;在RefCOCOm多粒度指代分割基准上,比SAM3 Agent提升5-9%,比其他智能体基线最高提升20%。

Insight: 创新点在于将AI智能体从简单封装基础模型提升为编程化任务知识、VLM行为、视觉例程、工作记忆和故障感知工作流,实现视觉构造能力。

Abstract: Segmentation has become easy when the concept is known, requiring retrieval of a learned visual grounding from text. It remains hard for open ad-hoc concepts, where the grounding may not exist as one learned mask and must often be constructed from image evidence through parts, relations, exclusions, and collections. We propose a Vision-guided Ad-hoc Segmentation Agent (VASA), the first vision harnessing agent for open ad-hoc segmentation. VASA is training-free and couples a VLM agent, a segmentation foundation model, and a visually grounded workflow. Rather than revising text prompts alone, VASA uses a persistent working mask to reason, construct, and validate a solution. It plans visual operations, invokes segmentation tools, inspects results, edits the mask, and recovers from errors. We construct PARS, a new benchmark that turns part-level labels in PartImageNet into open ad-hoc concepts through long-form definition queries. On PARS, VASA outperforms open-vocabulary, reasoning-based, and agentic baselines, surpassing SAM3 Agent by 14-25%. On RefCOCOm, a standard multi-granularity referring segmentation benchmark, VASA improves over SAM3 Agent by 5-9% and over other agentic baselines by up to 20%. These results validate agentic visual construction for open ad-hoc segmentation. Our work points to a path for AI agents beyond wrapping foundation models as tools: Programming them with task knowledge, VLM behavior, visual routines, working memory, and failure-aware workflows.


[54] KappaPlace: Learning Hyperspherical Uncertainty for Visual Place Recognition via Prototype-Anchored Supervision cs.CV | cs.AIPDF

Maya Yanko, Yoli Shavit

TL;DR: KappaPlace是一个用于视觉地点识别(VPR)的、学习不确定性感知表征的原则性框架。其核心是通过原型锚定监督策略,将图像描述符建模为冯·米塞斯-费舍尔分布变量,并学习一个轻量级模块来预测浓度参数作为偶然不确定性的直接代理。该方法还提出了一种新颖的匹配级不确定性量化公式,并在五个基准测试上显著降低了预期校准误差。

Details

Motivation: 当前最先进的视觉地点识别方法缺乏良好校准的不确定性估计,无法可靠地指示查询何时模糊或匹配可能错误,这在安全关键的机器人应用中构成风险。

Result: 在五个不同的基准测试上,KappaPlace相比现有方法将预期校准误差(ECE@K)降低了高达50%,同时保持或提高了检索召回率。

Insight: 主要创新点在于原型锚定监督策略和基于冯·米塞斯-费舍尔分布的概率建模,这使得能够直接学习表征的偶然不确定性。此外,从查询中心视角扩展到匹配级不确定性量化是一个新颖的贡献,提供了更细粒度的可靠性信号。该方法还提供了联合训练和冻结主干网络的后训练扩展两种变体,增强了实用性。

Abstract: Visual Place Recognition (VPR) is critical for autonomous navigation, yet state-of-the-art methods lack well-calibrated uncertainty estimation. Standard pipelines cannot reliably signal when a query is ambiguous or a match is likely incorrect, posing risks in safety-critical robotics. We propose KappaPlace, a principled framework for learning uncertainty-aware VPR representations. Our core contribution is a Prototype-Anchored supervision strategy that leverages latent class representatives as targets for a probabilistic objective. By modeling image descriptors as von Mises-Fisher (vMF) variables, we learn a lightweight module to predict the concentration parameter as a direct proxy for aleatoric uncertainty. While existing VPR uncertainty methods are typically restricted to a query-centric view, we derive a novel match-level formulation to quantify the reliability of specific query-reference pairs. Across five diverse benchmarks, KappaPlace reduces Expected Calibration Error (ECE@K) by up to 50% compared to existing methods while maintaining or improving retrieval recall. We provide both a joint-training variant and a post-training extension for frozen backbones. Our results demonstrate that KappaPlace provides a robust, stable, and well-calibrated signal that enables reliable decision-making within the VPR pipeline. Our code is available at: https://github.com/mayayank95/UncertaintyAwareVPR


[55] CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing cs.CV | cs.AI | cs.GR | cs.HCPDF

Haobo Hu, Xiangwu Guo, Zhiheng Chen, Difei Gao, Haotian Liu

TL;DR: 本文介绍了CutVerse,一个用于评估GUI代理在真实媒体后期制作环境中性能的基准测试。该基准涵盖了7个专业应用程序中的186个复杂、长视野任务,并开发了一个轻量级解析器将原始屏幕录制和交互日志转化为结构化、组合式的GUI动作轨迹。实验表明现有代理在媒体编辑任务上的成功率仅为36.0%,突显了长视野工作流的挑战。

Details

Motivation: 当前GUI代理在网页导航和基础操作系统任务上进展显著,但在专业创意工作流中的能力尚未充分探索,因此需要建立一个系统性的评估基准来填补这一空白。

Result: 在CutVerse基准上的广泛评估显示,现有代理在真实媒体编辑任务上的成功率仅为36.0%,表明其在长视野可靠性和领域特定规划方面仍存在局限。

Insight: 论文的创新点在于构建了一个专注于媒体后期制作的专业GUI代理基准,并通过开发轻量级解析器实现了对复杂、多模态交互序列的结构化表示,为评估代理在真实创意工作流中的能力提供了系统方法。

Abstract: While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.


[56] Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models cs.CV | cs.AIPDF

Wooseok Jeon, Seungho Park, Seunghyun Shin, Sangeyl Lee, Hyeonho Jeong

TL;DR: 本文提出了一种名为DyMoS(动态运动滑块)的训练无关且模型无关的方法,旨在解决图像到视频(I2V)模型生成视频过于静态的问题。该方法通过重新平衡在初始去噪步骤中,生成帧对参考帧的注意力路径,从而增强运动动态,同时保持视觉质量和参考图像的保真度。

Details

Motivation: 图像到视频模型生成的视频通常比文本到视频模型更静态。现有方法通过削弱或修改图像条件信号来缓解此问题,但往往需要额外训练或牺牲对参考图像的保真度。本文旨在识别并解决导致运动抑制的关键机制——参考帧主导。

Result: 在多个最先进的I2V骨干模型上的实验表明,DyMoS能持续改善运动动态,同时保持视觉质量和参考图像的保真度。该方法仅引入一个标量参数即可连续控制运动强度。

Insight: 核心创新点是识别出“参考帧主导”是运动抑制的关键机制,即非参考帧对参考帧关键令牌分配了过多的自注意力,导致参考信息在时间上过度传播并抑制了帧间动态。基于此,提出的DyMoS是一种无需训练、模型无关的干预方法,通过调整注意力路径来重新平衡这种主导关系,实现了运动强度与图像保真度的可控权衡。

Abstract: Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate this issue by weakening or modifying the image-conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify \emph{reference-frame dominance} as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS~(Dynamic Motion Slider), a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state-of-the-art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.


[57] Return of Frustratingly Easy Unsupervised Video Domain Adaptation cs.CVPDF

Pengfei Wei, Yiqun Sun, Zhiqiang Xu, Yiping Ke, Lawrence B. Hsieh

TL;DR: 本文提出了一种名为MetaTrans的简单无监督视频域自适应方法,该方法仅包含两个基本损失项,通过时空分离处理策略有效减少跨域视频的空间和时间差异。

Details

Motivation: 解决无监督视频域自适应问题,该领域虽实用但研究不足,旨在通过简洁的模型设计提升跨域动作识别性能。

Result: 在多个跨域动作识别任务上的实验表明,MetaTrans在绝对适应性能上显著提升,相对性能增益远超当前最先进的无监督视频域自适应基线方法。

Insight: 创新点在于将跨域视频的空间和时间差异分开处理,通过时序-静态减法模块实现高效去偏,模型设计简洁但体现了先进的域自适应思想。

Abstract: Unsupervised video domain adaptation (UVDA) is a practical but under-explored problem. In this paper, we propose a frustratingly easy UVDA method, called MetaTrans. Specifically, MetaTrans adopts a concise learning objective that contains only two fundamental loss terms. Despite the simplicity of the learning objective, MetaTrans embodies an advanced UVDA idea, that is, handling the spatial and temporal divergence of cross-domain videos separately, through a subtle model architecture design. By implementing a temporal-static subtraction module, MetaTrans effectively removes spatial and temporal divergence. Extensive empirical evaluations, particularly on various cross-domain action recognition tasks, show substantial absolute adaptation performance enhancement and significantly superior relative performance gain compared with state-of-the-art UVDA baselines.


[58] EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning cs.CVPDF

Pengtao Ma, Ziliang Zhou, Ciyu Ruan, Haoyang Wang, Kaiyuan Li

TL;DR: 本文提出EventPrune,一种基于事件相机运动先验的级联事件辅助令牌剪枝框架,用于高效的第一人称动态空间推理。该方法通过事件触发因果采样、事件引导运动显著性过滤和事件注意力排名融合三个阶段,在减少80%视觉令牌的同时提升推理性能,并实现了1.89倍加速和52%计算量降低。

Details

Motivation: 基于Transformer的视频大语言模型存在二次注意力计算成本高的问题,现有令牌剪枝方法依赖离散静态快照,无法保留动态空间推理所需的连续运动和几何线索。

Result: 在ESR-Real(首个真实世界RGB-事件基准)上,ECP在减少80%令牌的情况下准确率超过全令牌基线(37.62% vs. 36.31%),并带来2.68个百分点的性能提升,达到SOTA水平。

Insight: 创新点在于首次利用事件相机的高频运动线索作为连续先验指导令牌选择,提出无需训练的级联剪枝框架,将事件流与视觉注意力机制融合以增强动态推理效率。

Abstract: First-person dynamic spatial reasoning requires models to track continuous motion and precise geometric structure, but the quadratic attention cost of Transformer-based Video-LLMs makes dense visual tokens computationally expensive. Existing token pruning paradigms predominantly rely on discrete static snapshots, failing to preserve the motion and geometric cues essential for reasoning. We propose Event Cascade Pruning (ECP), to our knowledge the first training-free framework that leverages the high-frequency motion cues from event cameras as a continuous event-guided motion prior to guide token selection. ECP combines three stages: Event-Triggered Causal Sampling to anchor motion-informative keyframes, Event-guided Motion Saliency Filtering to suppress event-inactive visual tokens, and Event-Attention Ranking Fusion to calibrate spatial attention with motion-salient dynamics. With 80% visual token reduction, ECP outperforms the full-token baseline (37.62% vs. 36.31%) while achieving 1.89x inference speedup and 52% GFLOPs reduction. We further introduce ESR-Real, the first real-world RGB-event benchmark for first-person spatial reasoning, where ECP improves accuracy by 2.68 percentage points over full-token baselines.


[59] Thinking in Scales: Accelerating Gigapixel Pathology Image Analysis via Adaptive Continuous Reasoning cs.CVPDF

Jiusong Ge, Yingkang Zhan, Wenjie Zhao, Di Zhang, Ke Wang

TL;DR: 本文提出了一种名为PathCTM的新方法,用于加速千兆像素病理全切片图像的分析。该方法将诊断推理建模为一个动态的序列信息获取过程,通过从低倍率全局视图逐步过渡到高倍率局部视图,并自适应地终止推理,从而大幅减少计算开销。

Details

Motivation: 传统的全切片图像分析方法通常基于多示例学习范式,需要处理大量高倍率图像块,计算成本高昂,严重限制了分析的效率和可扩展性。本文旨在解决这一计算效率瓶颈。

Result: 大量实验表明,与标准的基于MIL的方法相比,PathCTM将所需的图像块数量减少了95.95%,推理时间缩短了约95.62%,同时保持了AUC性能不下降。

Insight: 核心创新点在于将病理图像分析建模为一种动态、自适应的连续推理过程,结合了条件计算进行动态尺度切换与注意力引导的区域剪枝,以及基于置信度的早停机制,实现了计算效率的极大提升。

Abstract: Traditional whole slide image (WSI) analysis methods typically rely on the multiple instance learning (MIL) paradigm, which extracts patch-level features at high magnification and aggregates them for slide-level prediction. However, such exhaustive patch-level processing is computationally expensive, severely limiting the efficiency and scalability of WSI analysis. To address this challenge, we propose PathCTM (a Pathology-oriented Continuous Thought Model) that enables token-efficient scale-space continuous reasoning for gigapixel WSIs. PathCTM formulates diagnostic inference as a dynamic sequential information pursuit. It progressively transitions from low-magnification global to high-magnification local inspection, and adaptively terminates inference when sufficient evidence is gathered to effectively bound decision uncertainty. Specifically, it uses conditional computation for dynamic scale switching with attention-guided region pruning, coupled with confidence-aware early stopping. Extensive experiments demonstrate that, compared with standard MIL-based methods, PathCTM reduces the number of required image patches by 95.95% and shortens inference time by approximately 95.62%, while maintaining AUC without degradation. Code is available at https://github.com/JSGe-AI/PathCTM.


[60] iDiff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment cs.CVPDF

Xinli Yue, JianHui Sun, Tao Shao, Liangchao Yao, Fan Xia

TL;DR: 本文提出了iDiff,一个可解释的、感知差异的成对图像质量评估框架,用于专业摄影领域。该框架采用双分支设计,包含答案模型和思考模型,分别负责稳健的偏好预测和基于图像的推理生成,并在NTIRE 2026 RAIM挑战赛中取得了第一名。

Details

Motivation: 解决专业摄影中成对图像质量评估任务,该任务不仅要求模型判断两张图像中哪张质量更优,还需要提供基于图像的可信推理。NTIRE 2026 RAIM挑战赛进一步强调了同时评估偏好预测和推理生成的需求。

Result: 在NTIRE 2026 RAIM挑战赛中取得了第一名。广泛的实验表明,该框架在准确性和推理质量指标上均有效。

Insight: 创新点在于将显式的差异建模与结构化的多模态推理相结合。具体包括:双分支设计(答案模型与思考模型)实现判别性决策与结构化解释的联合建模;答案模型通过显式分解全局/局部视图、内容感知专业化及跨主干集成来提升鲁棒性;思考模型利用专家风格模板、多源质量特征和基于答案模型预测的监督来增强推理生成。

Abstract: Pairwise image quality assessment (IQA) in professional photography requires a model not only to identify the preferred image between two candidates, but also to provide convincing and image-grounded reasoning. In the NTIRE 2026 RAIM challenge, this requirement is further emphasized by jointly evaluating preference prediction and rationale generation. To address this task, we propose iDiff, an Interpretable Difference-aware framework for pairwise image quality assessment. Our method adopts a dual-branch design consisting of an Answer Model and a Thinking Model. The Answer Model performs robust preference prediction by explicitly decomposing each sample into left/right global and local views, followed by content-aware specialization for person and scene images and ensemble-based aggregation across backbones. The Thinking Model focuses on rationale generation and is progressively enhanced with expert-style templates, multi-source quality features, and answer-aware supervision conditioned on the Answer Model prediction. In this way, iDiff jointly models discriminative decision making and structured explanation, improving both robustness and interpretability. Extensive experiments demonstrate the effectiveness of the proposed framework on both accuracy and reasoning-quality metrics. Our method achieved first place in the NTIRE 2026 RAIM challenge, showing the effectiveness of integrating explicit difference modeling with structured multimodal reasoning for pairwise IQA.


[61] Dual-Prompt CLIP with Hybrid Visual Encoders for Occluded Person Re-Identification cs.CVPDF

Zhangjian Ji, Shaotong Qiao, Kai Feng, Wei Wei

TL;DR: 本文提出了一种名为DPL-ReID的新模型,用于解决遮挡行人重识别问题。该模型结合了双提示学习策略、真实世界遮挡增强方法和加权门控特征融合机制,旨在利用文本线索捕获完整行人语义并增强对遮挡的鲁棒性。

Details

Motivation: 现有基于预训练视觉语言模型的行人重识别方法通常只关注增强基于提示的特征学习,而忽略了遮挡物的语义信息,这导致在遮挡场景下跨视图匹配变得困难。

Result: 在多个基准遮挡ReID数据集上的大量实验表明,所提出的DPL-ReID模型达到了最先进的性能水平。

Insight: 创新点在于引入了双提示学习策略来同时利用行人文本提示和遮挡物文本提示,以及设计了真实世界遮挡增强方法来生成更逼真的训练数据。加权门控特征融合机制则有效引导视觉编码器生成更全面的特征表示。

Abstract: Occluded person re-identification focuses on matching partially visible pedestrians across multiple camera views. However, occlusions disrupt body-region cues, thereby complicating cross-view matching. Most person ReID methods built on pretrained vision-language models only focus on enhancing prompt-based feature learning while ignoring the semantic information of occluders. Based on the success of CLIP-ReID, we propose a novel Dual Prompt Learning ReID (DPL-ReID) model for occluded person ReID. It incorporates a Dual Prompt Learning (Dual-PL) strategy, which can utilize textual cues to capture complete pedestrian semantics and keep robustness against occlusion, and a Real-World Occlusion Augmentation (RWOA) method that realistically simulates occlusion scenarios encountered in real word to enrich occluded samples. In addition, we also design a Weighted Gated Feature Fusion (WGFF) method, which in corporates LSNet to capture global information and act as a feature-gating mechanism. This mechanism can effectively guide the CLIP visual encoder toward generating more comprehensive feature representations. Extensive experiments on several benchmark occluded ReID datasets show that our proposed DPL-ReID achieves the state-of-the art performance. The occlusion instance library are available at https://github.com/stone-qiao/DPL-ReID.


[62] CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision cs.CV | cs.AIPDF

Pengcheng Wang, Haoxiang Liu, Yang Dai, Xiangxiang Zeng, Guanhua Chen

TL;DR: 本文提出了CaptchaMind,一种基于强化学习并带有显式推理过程监督的CAPTCHA求解器。为了解决训练数据缺乏的问题,作者首先构建了CaptchaBench,一个包含16,000个程序生成样本、带有详细区域和过程级标注的基准测试集。该方法在CaptchaBench上取得了82.9%的平均成功率,在真实世界实例上达到71.0%,显著优于现有方法。

Details

Motivation: 现代CAPTCHA的求解需要鲁棒的多步视觉推理和交互能力,但由于缺乏大规模训练数据和过程级标注,基于训练的方法一直缺失。

Result: 在CaptchaBench基准测试的八个任务类别上,CaptchaMind取得了82.9%的平均成功率,在真实世界实例上达到71.0%,显著优于所有不依赖闭源API的现有方法。

Insight: 创新点在于通过构建首个支持大规模训练的CAPTCHA基准测试集(CaptchaBench)来解决数据瓶颈,并提出了结合显式推理过程监督的强化学习训练框架,以解决需要细粒度视觉细节捕捉和区域级比较的复杂任务。

Abstract: CAPTCHAs are widely deployed as human verification mechanisms and frequently block intelligent agents from completing end-to-end automation in real-world web environments. Solving modern CAPTCHAs requires robust multi-step visual reasoning and interaction capabilities, yet training-based approaches have remained absent due to the lack of large-scale training data and process-level annotations. We introduce CaptchaBench, the first CAPTCHA benchmark designed to support large-scale training, comprising 16,000 programmatically generated samples across eight task categories with detailed region and process-level annotations. Systematic evaluation on CaptchaBench reveals that existing methods fail consistently on tasks requiring fine-grained visual detail capture and region-level comparison. We therefore present CaptchaMind, an RL-based solver trained with explicit reasoning process supervision, achieving 82.9% average success rate across eight tasks and 71.0% on real-world instances, substantially outperforming all existing methods without closed-source APIs.


[63] Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs cs.CVPDF

Xueying Jiang, Wenhao Li, Quanhao Qian, Deli Zhao, Shijian Lu

TL;DR: 本文提出了一种面向相机鲁棒性的3D定位方法,通过方程锚定的工具使用框架解决MLLMs中的相机内参模糊性问题。该方法将空间工具重新定义为公式变量,在思维链中显式写入针孔反投影方程,并将工具输出代入公式后回归最终的9自由度边界框。

Details

Motivation: 解决多模态大语言模型在3D定位任务中存在的相机内参模糊性问题——同一图像在不同相机参数下对应不同的3D场景,现有方法要么忽略相机参数导致过拟合,要么将工具返回的深度信息作为参考线索而非确定性传播。

Result: 在相机内参缩放0.5倍至1.5倍的3D目标检测和3D视觉定位任务上,该方法优于仅使用RGB和工具增强的基线方法,在相机参数与训练尺度偏差最大时获得显著性能提升。

Insight: 创新点在于将空间工具输出作为公式变量嵌入显式几何方程(针孔模型),实现相机信息的确定性传播;通过方程锚定的工具使用范式,将隐式参考线索转化为显式几何约束,增强模型对相机参数变化的鲁棒性。

Abstract: 3D localization in Multimodal Large Language Models (MLLMs), including 3D object detection and 3D visual grounding, is fundamentally limited by camera intrinsic ambiguity: the same image admits different 3D scenes under different cameras. Existing MLLMs either ignore camera parameters and overfit to a canonical training intrinsic, or retrieve depth and 3D cues from external tools but treat the returned values as reference cues (numerical hints that the model is free to interpret implicitly), both preventing camera information from being deterministically propagated into the prediction. We propose an equation-anchored tool-use framework that re-purposes spatial tools as formula variables. The proposed framework proactively retrieves camera intrinsics and samples multi-point metric depths, writes the pinhole back-projection equation $\hat{X} = (u_c - c_x)\bar{Z}/f_x$ explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box. On both 3D object detection and 3D visual grounding tasks under rescaled camera intrinsics from $0.5\times$ to $1.5\times$, our method outperforms RGB-only and tool-augmented baselines, with significant gains where the camera deviates most from the training scale. Code and data will be released.


[64] Self-Creative Text-to-Object Generation using Semantic-Aware Spatial Weighting cs.CVPDF

Yue Yu, Haibo Chen, Shuo Chen, Jian Yang, Jun Li

TL;DR: 本文提出了一种名为SCDiff的自创性文本到图像生成模型,旨在解决现有T2I模型因噪声预测网络限制而缺乏真正创造力的问题。该模型通过可学习的空间加权模块和视觉语义混合损失两个核心组件,在保证文本-图像对齐的同时,增强生成图像的视觉新颖性和艺术价值。

Details

Motivation: 当前文本到图像生成模型主要优化文本与图像的逐字对齐,其噪声预测网络将生成限制在高概率区域,导致输出缺乏真正的创造性和艺术价值。

Result: 大量实验表明,该模型在创造力、语义对齐和视觉连贯性方面均有显著提升,为生成创造性对象提供了一个简单而强大的框架。

Insight: 创新点在于引入了参数化的Kaiser-Bessel窗口来强化中心图像特征以促进新颖生成,以及一个包含相似性损失和多样性损失的双重损失函数,在约束文本对齐的同时最大化视觉新颖性。

Abstract: Instilling creativity in text-to-image (T2I) generation presents a significant challenge, as it requires synthesized images to exhibit not only visual novelty and surprise, but also artistic value. Current T2I models, however, are largely optimized for literal text-image alignment with their data distribution, and their noise prediction networks constrain the generation to high-probability regions, consequently generating outputs that lack authentic creativity. To address this, we propose a Self-Creative Diffusion (SCDiff) model for meaningful T2I generations featuring two core modules: a learnable spatial weighting (LSW) module and a visual-semantic mixing loss (VSML). The LSW module designs a parametric Kaiser-Bessel window to reinforce central image features, fostering novel and surprising generation. The VSML module introduces a dual loss function: a similarity loss constrains that the new images align with its textual description, while a diversity loss maximizes its distinction from the original image, enhancing both semantic value and visual novelty. Extensive experiments demonstrate that our model substantially improves creativity, semantic alignment, and visual coherence, offering a simple yet powerful framework for generating creative objects.


[65] EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs cs.CV | cs.AIPDF

Yang Dai, Dian Jiao, Tianwei Lin, Wenqiao Zhang

TL;DR: 本文提出了EgoCoT-Bench,一个用于评估多模态大语言模型在自我中心视频中进行细粒度、可验证、以操作为中心的思维链推理能力的基准测试。该基准包含超过3,000个可验证的问答对,涵盖感知、回顾、预测和高级推理任务,旨在解决现有基准在基于时空证据的推理评估方面的不足。

Details

Motivation: 现有自我中心视频基准在评估模型推理过程的时空证据基础方面存在局限,缺乏对细粒度、以操作为中心的推理及其可验证性的支持。

Result: 实验结果表明,现有模型在自我中心细粒度推理上仍面临困难,且许多模型产生的解释虽然答案正确,但其证据与答案不一致。

Insight: 创新点在于构建了一个基于时空场景图引导生成、并经过人工精修的基准,专注于评估模型推理的时空可验证性,为模型的可解释性和推理可靠性提供了新的测试平台。

Abstract: The rapid development of Multimodal Large Language Models (MLLMs) has led to growing interest in egocentric video understanding, specifically the ability for MLLMs to recognize fine-grained hand-object interactions, track object state changes over time, and reason about manipulative processes in dynamic environments from a first-person perspective. However, existing egocentric video benchmarks suffer from \textbf{limited grounded rationale evaluation}, offering limited support for fine-grained operation-centric reasoning and rarely examining whether model rationales are grounded in explicit spatio-temporal evidence. To address this gap, we introduce \textbf{EgoCoT-Bench}, a fine-grained egocentric benchmark for grounded and verifiable operation-centric reasoning with explicit step-by-step rationale annotations. Overall, EgoCoT-Bench comprises 3,172 verifiable QA pairs over 351 egocentric videos separated into four task groups for a total of 12 sub-task groups, encompassing perception and retrospection, anticipation, and high-level reasoning. The benchmark is constructed through a spatio-temporal scene graphs (STSG) guided generation framework and is further refined by human annotators to ensure correctness, egocentric relevance and fine-grained quality. Experimental results show continuing difficulties with egocentric fine-grained reasoning and further reveal that many multimodal models produce explanations that are answer-correct, but have evidence that is inconsistent with the answer. We hope EgoCoT-Bench can serve as a useful testbed for grounded and verifiable reasoning in egocentric video understanding. Project page and supplementary materials are available at: https://dstardust.github.io/EgoCoT/.


[66] A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images cs.CV | cs.AIPDF

João Pedro Matos-Carvalho, Laio Oriel Seman, Stefano Frizzo Stefenon, Mohammad Khalaf Mohammad Khreasat, Gabriel Villarrubia González

TL;DR: 本文提出了一种名为YOLO26-MoE的新型目标检测架构,用于无人机图像中的绝缘子故障检测。该架构将稀疏混合专家模块集成到YOLO26检测器的高分辨率分支中,以增强对细微和多样化故障模式的特征提取能力,并通过一个工具增强的大型语言模型代理来协调超参数优化和训练。

Details

Motivation: 电力线绝缘子巡检对保障电网可靠性至关重要,但现有基于无人机和深度学习的视觉系统在检测小缺陷区域、异构故障模式、复杂背景和变化成像条件时仍面临挑战。

Result: 所提出的模型在绝缘子故障检测任务上取得了0.9900 mAP@0.5和0.9515 mAP@0.5:0.95的性能,超越了最新的YOLO版本,达到了SOTA水平。

Insight: 主要创新点在于将稀疏MoE模块集成到YOLO26的高分辨率分支中,实现了对细微故障模式的自适应特征细化,同时保持了单阶段检测框架的效率;另一个创新是使用工具增强的LLM代理来协调超参数优化和训练流程,提升了自动化程度。

Abstract: The inspection of electrical power line insulators is essential for ensuring grid reliability and preventing failures caused by damaged or degraded insulation components. In recent years, Unmanned Aerial Vehicles (UAVs) combined with deep learning-based vision systems have emerged as an effective solution for automating this process. However, insulator fault detection remains challenging due to small defect regions, heterogeneous fault patterns, complex backgrounds, and varying imaging conditions. To address these challenges, this paper proposes an optimized YOLO26-MoE, a novel object detection architecture that integrates a sparse Mixture-of-Experts (MoE) module into the high-resolution branch of the YOLO26 detector. The proposed modification enables adaptive feature refinement for subtle and diverse fault patterns while preserving the efficiency of a one-stage detection framework. Hyperparameter optimization, final training, and evaluation were coordinated through a tool-augmented Large Language Model (LLM) agent. The proposed model achieved 0.9900 mAP@0.5 and 0.9515 mAP@0.5:0.95, outperforming the latest YOLO versions. These results demonstrate that the proposed model provides an effective and reliable solution for UAV-based insulator fault detection.


[67] Lens Privacy Sealing: A New Benchmark and Method for Physical Privacy-Preserving Action Recognition cs.CV | cs.AIPDF

Mengyuan Liu, Ziyi Wang, Peiming Li, Junsong Yuan

TL;DR: 本文提出了一种名为镜头隐私密封(LPS)的硬件解决方案,通过在相机镜头上使用可调节的层压膜来物理模糊图像,实现采集前的隐私保护。同时,作者构建了P³AR数据集用于隐私保护动作识别,并提出了MSPNet框架来处理LPS导致的视频质量下降问题,该框架包含帧间噪声抑制器和跨帧语义聚合器,显著提升了动作识别精度。

Details

Motivation: 基于RGB摄像头的监控系统在公共安全和医疗保健中用于动作识别,但引发了严重的隐私担忧。现有方法依赖捕获后的算法,无法在数据采集过程中保护隐私,因此需要一种在传感器前进行物理隐私保护的方案。

Result: 在提出的P³AR数据集上,MSPNet框架(结合IFNS和CFSA)的动作识别准确率相比基线方法提升了近一倍,同时将身份识别率抑制在较低水平。与最先进的硬件方法相比,LPS实现了更优的隐私-效用权衡,并能抵抗点扩散函数反演和数据驱动恢复等重建攻击。

Insight: 主要创新点在于提出了一种低成本、物理不可逆的硬件隐私保护方案(LPS),以及配套的大规模数据集(P³AR)和专门处理退化视频的识别框架(MSPNet)。其核心思想是将隐私保护从软件后处理前移至物理采集阶段,通过随机多层散射实现强隐私,并通过对比语言-图像预训练增强语义提取鲁棒性。

Abstract: RGB camera-based surveillance systems enable human action recognition for public safety and healthcare, yet raise serious privacy concerns. Existing methods rely on post-capture algorithms, which fail to protect privacy during data acquisition. We propose Lens Privacy Sealing (LPS), a simple hardware solution that physically obscures camera lenses with adjustable laminating film, providing pre-sensor privacy protection at minimal cost. Unlike software methods or expensive engineered optics, LPS achieves strong privacy through stochastic multi-layer scattering that is physically irreversible. We introduce the P$^3$AR dataset for privacy-preserving action recognition, featuring both large-scale replay-captured (P$^3$AR-NTU, 114K videos) and real-world collected (P$^3$AR-PKU) subsets with privacy attribute annotations. To handle video degradation from LPS, we propose MSPNet, a single-stage framework incorporating Inter-Frame Noise Suppressor (IFNS) and Cross-Frame Semantic Aggregator (CFSA), enhanced by contrastive language-image pre-training for robust semantic extraction. Extensive experiments demonstrate that MSPNet with IFNS and CFSA nearly doubles action recognition accuracy compared to baseline methods while suppressing identity recognition to low levels. Comprehensive validation shows LPS achieves a superior privacy-utility trade-off compared to state-of-the-art hardware methods, resists reconstruction attacks including PSF inversion and data-driven recovery, and generalizes robustly across optical configurations and challenging environments. Code is available at https://github.com/wangzy01/MSPNet.


[68] White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation cs.CVPDF

Shuwei Li, Lei Tan, Robby T. Tan

TL;DR: 本文提出VLM-CC,一个反馈引导的框架,将颜色恒常性问题重新定义为迭代优化过程。该方法不直接从原始输入估计光源,而是利用轻量级视觉语言模型对白平衡后的图像进行评估,识别残留色偏并提供定性反馈,从而迭代更新光源估计直至收敛。

Details

Motivation: 解决基于学习的颜色恒常性模型在跨相机泛化上的挑战,这些模型容易过拟合训练相机的颜色响应特性,导致在其他相机拍摄的图像上性能下降。

Result: VLM-CC在多个数据集上实现了最先进的跨相机颜色恒常性鲁棒性,达到了SOTA水平。

Insight: 核心创新在于将颜色恒常性重构为迭代感知反馈问题,用VLM评估替代直接RGB回归,利用VLM的定性反馈引导优化过程,从而提升跨相机泛化能力。

Abstract: Color constancy aims to keep object colors consistent under varying illumination. Cross-camera generalization in color constancy remains challenging because learning-based models often overfit to the color response characteristics of the training camera, resulting in degraded performance on images captured by other cameras. We propose VLM-CC, a feedback-guided framework that formulates color constancy as an iterative refinement process. Instead of directly estimating the illuminant from raw input, VLM-CC performs iterative correction driven by vision-language model (VLM)-based evaluation. At each iteration, the image is white-balanced using the current estimate and converted to pseudo-sRGB. A lightweight LoRA-tuned VLM then assesses the corrected image, identifying the dominant residual color cast and providing qualitative feedback. This feedback is mapped to a residual illumination direction (red, green, or blue) and used to update the illuminant estimate until convergence. Our key idea is to reframe color constancy as an iterative perceptual feedback problem, leveraging VLM evaluation instead of direct RGB regression. By replacing direct RGB estimation with VLM-guided perceptual feedback, VLM-CC achieves state-of-the-art robustness in cross-camera color constancy across multiple datasets. Code will be available at https://github.com/NothingIknow/VLM-CC.


[69] Component-Aware Structure-Preserving Style Transfer for Satellite Sim2Real 6D Pose Estimation cs.CV | cs.AIPDF

Yonglong Zhang

TL;DR: 本文提出了一种面向卫星6D姿态估计的组件感知结构保持风格迁移框架,用于解决合成到真实数据转换问题。该方法通过弱配对样本构建、组件级风格编码注入以及多种约束训练,生成可用于下游监督的真实风格图像。实验表明,该方法在降低图像分布差异和提升姿态估计性能方面优于基线方法。

Details

Motivation: 解决卫星单目6D姿态估计中真实标注数据稀缺、合成数据与真实数据存在外观差距的问题,旨在通过风格迁移构建可用于下游监督的真实风格合成数据。

Result: 在5000张渲染卫星图像和100张真实图像上实验,与代表性图像翻译基线相比,取得了最低的图像分布差异(FID 54.32,KID 0.048)。使用生成数据训练GDRNet姿态估计器,在目标域适应设置下,ADD通过率提升至0.260,AUC提升至0.611。

Insight: 创新点在于组件感知的风格迁移,通过从无标签真实图像提取部件级风格码并注入对应合成区域,结合局部对比一致性、自正则化和边缘保持约束,在保持几何标注的同时实现外观转换,提升了卫星Sim2Real姿态估计性能。

Abstract: Monocular 6D pose estimation for non-cooperative satellites depends heavily on annotated training data, yet real satellite images with reliable pose labels and component-level masks are difficult to acquire at scale. Synthetic rendering can provide exact geometric annotations, but the appearance gap between rendered and real observations limits direct transfer to the real domain. This paper presents a component-aware structure-preserving style transfer framework for satellite synthetic-to-real data construction. The method builds weakly paired real–synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It then extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. To keep the generated images usable for downstream supervision, adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints. Experiments are conducted on 5,000 rendered satellite images and 100 real images captured in a calibrated setup. The real images provide target-domain appearance references and final evaluation images, while the downstream GDRNet pose estimator is trained only on synthetic or translated synthetic images. Compared with representative image-translation baselines, the proposed method achieves the lowest image distribution discrepancy, with an FID of 54.32 and a KID of 0.048. When the translated data are used to train GDRNet in this target-domain adaptation setting, the ADD pass rate improves to 0.260 and the AUC improves to 0.611. These results indicate that component-level appearance transfer can improve satellite Sim2Real pose estimation in the considered calibrated setup while retaining simulation-derived geometric annotations.


[70] P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation cs.CV | cs.AIPDF

Kai Sheng, Liuyi Wang, Haojie Dai, Jinlong Li, Yongrui Qin

TL;DR: 本文提出了P2DNav,一个用于零样本视觉语言导航(VLN)的分层框架。该框架通过全景图到下视图(P2D)推理,将导航决策分解为全景方向选择和下视图局部定位两个阶段,并结合滑动窗口对话记忆(SDM)和反射性重定向机制(RRM)来提升长视野导航的稳定性和可靠性。

Details

Motivation: 解决现有零样本VLN方法通常依赖额外的航点预测模块,导致高层方向推理与细粒度局部定位纠缠,从而产生易错和不稳定决策的问题。

Result: 在R2R-CE基准测试上,P2DNav在零样本方法中取得了强劲性能。与基于航点的SOTA零样本方法和无航点SOTA零样本方法相比,其成功率(SR)分别提升了146.6%和58.9%。

Insight: 核心创新在于将导航决策明确分解为全景方向选择和下视图局部定位两个解耦阶段,并通过SDM和RRM分别处理历史上下文和决策可靠性评估,实现了更稳健的零样本导航。这种分层和模块化设计思路具有借鉴意义。

Abstract: Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360° panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.


[71] UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register cs.CVPDF

Congpei Qiu, Zhaoyu Hu, Wei Ke, Zhuotao Tian, Yanhao Wu

TL;DR: 本文提出UniRefiner,一个通用的视觉Transformer(ViT)精炼框架,旨在通过对比寄存器(contrastive register)机制,教导预训练ViT模型自我识别并处理其表征中的虚假令牌(spurious tokens),以提升模型在密集预测任务(如语义分割)中的空间感知能力。该方法仅需在少量图像上进行微调,即可显著提升包括EVA-CLIP-8B等大型基础模型在内的多种ViT的性能。

Details

Motivation: 现有大规模视觉Transformer模型在空间敏感任务(如密集预测)中的性能受到虚假令牌的阻碍。先前工作对虚假令牌的定义过于狭隘(如仅关注高范数离群点),作者认为任何未能编码位置对齐语义的令牌都应被视为虚假伪影,这揭示了一个更复杂的问题,需要系统性的解决方案。

Result: 在ADE20K数据集上,精炼后的EVA-CLIP-8B模型达到了51.9%的mIoU,相比原始模型提升了9.4%,甚至超过了DINOv2(49.1%)等专用视觉模型。零样本分割准确率提升了高达22%。实验表明该方法在多种ViT模型上均能带来一致且显著的性能提升。

Insight: 核心创新点在于对虚假令牌进行了更全面、系统的分类与诊断,并据此提出了一个通用的、基于对比寄存器的精炼框架。该框架通过双重目标(对齐图像令牌与常规令牌以保留语义,对齐寄存器令牌与虚假令牌以捕获虚假信号),使模型能够自我分离和重新分配虚假令牌,从而有效解锁了大规模基础模型潜在的空间表征能力。

Abstract: Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9% mIoU on ADE20K (+9.4%), surpassing specialized vision models like DINOv2 (49.1%), while zero-shot segmentation accuracy improves by up to 22%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.


[72] Benchmarking and Evolving Reason-Reflect-Rectify for Reflective Visual Generation cs.CVPDF

Junjie Wang, Xinghua Lou, Jason Li, Ye Tian, Keyu Chen

TL;DR: 本文针对文本到图像(T2I)模型和统一多模态模型(UMMs)在复杂提示下迭代精炼能力的不足,提出了用于反思视觉生成(RVG)的Reason-Reflect-Rectify(R^3)循环框架,并构建了包含600多个专家标注实例的R^3-Bench基准。为弥合模型识别错误与生成可执行修正指令之间的差距,作者提出了R^3-Refiner优化框架,实验表明其在多个基准上显著提升了生成质量。

Details

Motivation: 现有T2I和UMM模型依赖单次生成范式,难以处理需要迭代优化的复杂提示,因此需要一种支持多轮反思与修正的视觉生成框架。

Result: 在提出的R^3-Bench基准上,R^3-Refiner在反思判决分数和修正分数上分别实现了+12.0%和+9.0%的显著提升,并能与多种MLLM和T2I模型集成,在GenEval++和T2I-CompBench上提高了生成质量。

Insight: 核心创新在于将反思视觉生成形式化为R^3循环,并构建了专门的基准来量化迭代能力。提出的R^3-Refiner框架,通过GRPO和分层奖励机制,有效对齐了反思推理与修正指令的生成,为解决模型迭代优化问题提供了系统性的方法。

Abstract: Text-to-Image (T2I) models and Unified Multimodal Models (UMMs) have achieved remarkable progress in visual generation. However, their reliance on a single-pass generation paradigm limits their ability to handle complex prompts requiring iterative refinement. To enable multi-round Reflective Visual Generation (RVG), we formalize the Reason-Reflect-Rectify (R^3) loop as a core framework and introduce R^3-Bench, a benchmark of over 600 expert-annotated instances that quantifies iterative reasoning and rectification capabilities. Evaluation on R^3-Bench reveals a critical gap: while state-of-the-art models can identify generation errors, they fail to generate actionable rectification instructions. To bridge this gap, we propose R^3-Refiner, a dual-stage framework leveraging Group Relative Policy Optimization (GRPO) and a Hierarchical Reward Mechanism (HRM) to better align rectification with reflective reasoning. Experiments show that R^3-Refiner achieves significant improvements on R^3-Bench (+12.0% in Reflective Verdict Score, +9.0% in Rectification Score), and can be seamlessly integrated with various MLLMs to enhance the generation quality of different T2I models on GenEval++ and T2I-CompBench. Code is available at https://github.com/xiaomoguhz/R3-Bench.


[73] WBCAtt+: Fine-Grained Pixel-Level Morphological Annotations for White Blood Cell Images cs.CVPDF

Satoshi Tsutsui, Winnie Pang, Shuting He, Bihan Wen

TL;DR: 该论文提出了WBCAtt+数据集,这是一个针对白细胞图像的新型细粒度标注数据集,包含11种形态学属性和5种像素级细胞成分的密集标注,共计11.3万个图像级标签和1万个分割图。作者利用该数据集建立了属性识别和语义分割的基线模型,并设计了一个结合细胞组成结构的属性识别模型以提升性能,同时展示了数据集在可解释AI模型(如反事实示例生成)等应用中的潜力。

Details

Motivation: 现有白细胞图像数据集主要标注细胞类别,缺乏病理学家用于解释细胞判读的详细形态学特征,这限制了相关研究的深入。为了填补这一空白,作者旨在创建一个提供全面形态学注释的数据集。

Result: 论文提供了基于WBCAtt+数据集的属性识别和语义分割基线模型。通过设计一个结合细胞组成结构的属性识别模型,进一步提升了识别性能。

Insight: 主要创新在于构建了首个提供白细胞图像全面形态学标注(包括图像级属性和像素级分割)的细粒度数据集WBCAtt+。从客观角度看,该数据集的结构化属性标注(如细胞核形状、细胞质颗粒度)与像素级成分分割的结合,为开发更精细、可解释的细胞分析模型提供了关键基础,特别是在推动病理学AI模型的可解释性方面具有重要价值。

Abstract: The microscopic examination of white blood cells (WBCs) plays a fundamental role in pathology and is essential for diagnosing blood disorders such as leukemia and anemia. To support further research on WBC images, multiple datasets have been proposed. However, they mainly annotate cell categories, and lack detailed morphological characteristics that pathologists use to explain their interpretations of cells. To address this gap, we introduce WBCAtt+, a novel dataset of WBC images densely annotated with 11 morphological attributes and five pixel-level cell components. With 113k image-level labels and 10k segmentation maps, WBCAtt+ is the first to provide comprehensive annotations for WBC images. Leveraging this dataset, we provide baseline models for attribute recognition and semantic segmentation. We also design an attribute recognition model to incorporate compositional structure of cells, further improving the recognition performance. Lastly, we showcase various applications enabled by our dataset, such as explainable AI models, including counterfactual example generation. \revision{The dataset and code are publicly available\footnote{https://doi.org/10.57967/hf/8143}}.


[74] Tango3D: Towards Alignment for Global and Local 2D-3D Correspondence cs.CVPDF

Zebin He, Mingxin Yang, Shuhui Yang, Hanxiao Sun, Xintong Han

TL;DR: 本文提出了Tango3D,一个旨在统一密集对应和全局检索的3D基础模型。它通过一个几何感知的2D视觉主干网络和一个预训练的3D VAE,将图像编码为2D块、点云编码为3D令牌,并映射到一个共享空间中,以实现局部像素到点的对齐和全局语义对齐。模型采用三阶段渐进式训练策略来稳定联合学习。

Details

Motivation: 现有的3D基础模型通常将点云对齐到如CLIP等冻结的视觉-语言空间,通过将3D形状压缩为全局向量来实现跨模态检索,但这种仅全局对齐的方法无法建立细粒度的像素到点对应关系。

Result: 实验表明,该模型成功实现了物体级别的像素到点对齐,同时保持了具有竞争力的全局检索能力,这是现有3D基础模型所不具备的联合能力。

Insight: 核心创新在于提出了一个统一的模型架构和训练策略,以同时实现全局语义对齐和局部密集对应。通过建立细粒度的对齐特征空间,为纯几何的3D令牌注入了丰富的语义信息,为广泛的密集3D下游任务铺平了道路。

Abstract: Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show our model successfully achieves object-level pixel-to-point alignment while maintaining competitive global retrieval, a joint capability not offered by existing 3D foundation models. By establishing a fine-grained alignment feature space, Tango3D injects rich semantics into purely geometric 3D tokens, paving the way for a wide range of dense 3D downstream tasks.


[75] Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention cs.CVPDF

Wenhu Zhang, Yiming Wu, Huanyu Wang, Yaoyang Liu, Huanzhang Dou

TL;DR: 本文提出了块近似稀疏注意力框架(BA-Att),用于在扩散语言模型中实现高效的长上下文建模。该方法通过块级预下采样操作在紧凑的下采样空间中识别信息区域,避免了依赖脆弱的先验位置模式,并引入了轻量级的范数排序模块和协方差补偿校正来降低计算复杂度。

Details

Motivation: 扩散语言模型(DLMs)在生成长序列文本时具有全局连贯性、双向性和可控性等优势,但扩展到超长序列的计算成本仍然很高。现有的块稀疏注意力方法通常基于固定的采样模式(如尾部区域或反对角条纹)在高分辨率注意力空间中选择块,这种先验驱动的采样可能会遗漏重要标记,并在分布偏移下引入不稳定性。

Result: 大量实验表明,该算子在注意力计算上相比FlashAttention实现了高达6.95倍的加速,并且在语言模型、多模态语言模型和视频生成模型中,在50%的稀疏度下保持了接近全注意力的性能,展现了强大的效率和泛化能力。

Insight: 主要创新点在于提出了不依赖固定位置先验的块近似稀疏注意力框架,通过预下采样策略和理论分析(定义了oracle后下采样注意力图并形式化了近似误差),并设计了轻量级的范数排序和协方差补偿校正模块来高效近似完整的协方差,从而在保持性能的同时显著提升长序列建模效率。

Abstract: Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks by fixed sampling patterns over the high-resolution attention space, such as tail regions or anti-diagonal stripes. Such prior-driven sampling can miss salient tokens and introduce instability under distribution shifts. In this paper, we propose the Block Approximate Sparse Attention framework (BA-Att) with block-wise pre-downsampled operation, which identifies informative regions within a compact downsampled space, avoiding reliance on brittle positional priors. To analyze its theoretical behavior, we define an oracle post-downsample attention map and formalize the approximation error between pre- and post-downsample schemes. Based on this insight, we introduce a lightweight norm-sorting module and a covariance-compensated correction that approximates full covariance using diagonal QK variances, reducing computational complexity. Extensive experiments show that our operator achieves up to 6.95x acceleration over FlashAttention in attention computation, and maintains near full-attention performance at 50% sparsity across language models, multimodal language models, and video generation models, demonstrating strong efficiency and generalization.


[76] Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls cs.CVPDF

Abdul Mohaimen Al Radi, Kunyang Li, Yuzhang Shang, Mubarak Shah, Yu Tian

TL;DR: 本文提出了Aero-World方法,将预训练的图生视频扩散模型转化为可控的空中视频生成器,通过注入平移加速度和角速度序列来生成符合细粒度惯性动作的无人机视频。该方法还引入了AeroBench基准,用于评估生成视频与低层动作信号的一致性,并展示了在动作对齐和视频质量方面的显著提升。

Details

Motivation: 现有基础视频模型主要基于自然语言训练,难以生成符合低层控制信号(如惯性动作)的空中视频,而无人机在6自由度空间中的运动对动作精度要求高,微小误差会导致轨迹漂移。生成可控的空中视频可为无人机智能体的训练和评估提供可扩展的代理数据。

Result: 在AeroBench基准上,Aero-World将平均动作对齐分数(AAS)从57.7提升至63.6,相比仅使用动作微调的方法;与AirScape相比,Aero-World在质量-控制权衡上表现更优,具有更低的FVD(596.5 vs. 1058.6)、更高的SSIM(0.595 vs. 0.505)和更高的光流-IMU相关性(0.44 vs. 0.20)。

Insight: 创新点包括:通过动作令牌流将惯性控制信号注入预训练的潜在扩散Transformer;使用冻结的潜在空间物理探针(Physics Probe)提供可微分的惯性一致性监督,避免昂贵的视频解码;提出AeroBench基准和动作对齐分数(AAS)、物理一致性率(PCR)等评估指标。从客观角度看,该方法将物理约束与生成模型结合,为具身AI中的可控视频生成提供了新思路。

Abstract: Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can support scalable training and evaluation of aerial agents by providing a controllable proxy for real-world or expensive simulation data. To address this problem, we propose \textbf{Aero-World}, a method for converting a pretrained image-to-video diffusion model into a controllable aerial video generator. Aero-World injects sequences of translational acceleration and angular velocity into a pretrained latent diffusion transformer through an action-token stream. A frozen latent-space Physics Probe, trained independently on real video–IMU pairs, provides differentiable inertial-consistency supervision during LoRA finetuning while avoiding computationally expensive video decoding. We further propose \textbf{AeroBench}, a benchmark for evaluating whether generated drone videos adhere to low-level action signals. AeroBench uses Action Alignment Score (AAS) to measure agreement with commanded inertial actions and Physical Consistency Rate (PCR) to measure temporal motion stability. On AeroBench, Aero-World improves mean AAS from 57.7 to 63.6 over action-only finetuning and gives a stronger quality-control trade-off than AirScape, with lower FVD (596.5 vs. 1058.6), higher SSIM (0.595 vs. 0.505), and higher Flow-IMU correlation (0.44 vs. 0.20). These results suggest that frozen Physics Probe supervision is a practical mechanism for adapting pretrained video generators toward more action-aligned aerial motion.


[77] FlowErase-RL: Rethinking Concept Erasure as Reward Optimization in Flow Matching Models cs.CVPDF

Yi Sun, Zhiqi Zhang, Xinhao Zhong, Yimin Zhou, Shuoyang Sun

TL;DR: 本文提出了FlowErase-RL,一个基于GRPO(推测为Group Relative Policy Optimization或其变体)的框架,用于解决流匹配模型中的概念擦除问题。该方法将概念擦除重新定义为奖励优化问题,并引入动态双路径奖励机制,联合优化概念擦除奖励以抑制目标概念,以及非目标空间奖励以保持生成保真度。实验表明,该方法在多种概念擦除任务上实现了最先进的性能,同时保持了图像质量和语义对齐,并能有效抵抗对抗攻击及扩展到多概念场景。

Details

Motivation: 流匹配模型在文本到图像生成质量上取得显著进展,但也带来了生成有害或不期望内容的安全风险。现有的概念擦除方法要么是推理时干预效果有限,要么依赖于需要精确对齐数据且难以扩展和多概念设置的监督微调(SFT)。

Result: 在裸露、物体和艺术风格擦除任务上的大量实验表明,该方法实现了最先进的擦除性能,同时保持了强大的图像质量和语义对齐。此外,它对对抗攻击表现出稳健的抵抗力,并能有效地扩展到多概念场景。

Insight: 主要创新点是将概念擦除重新定义为奖励优化问题,并引入了动态双路径奖励机制(概念擦除奖励和非目标空间奖励),通过性能驱动的切换策略在训练中自适应平衡,实现了无需显式监督的稳定优化。这为流匹配模型中的安全可控生成建立了一个新范式。

Abstract: Recent advances in flow matching models have significantly improved text-to-image generation quality, but also introduce growing safety risks due to the generation of harmful or undesirable content. Existing concept erasure methods are either inference-time interventions with limited effectiveness or rely on supervised fine-tuning (SFT), which requires precisely aligned data and struggles with scalability and multi-concept settings. In this paper, we propose \emph{FlowErase-RL}, the first GRPO-based framework for concept erasure in flow matching models. We reformulate concept erasure as a reward optimization problem and introduce a \textbf{dynamic dual-path reward mechanism} that jointly optimizes (i) a Concept Erasure (CE) reward to suppress target concepts and (ii) a Non-target Space (NS) reward to preserve generative fidelity. The two reward paths are adaptively balanced during training via a performance-driven switching strategy, enabling stable optimization without explicit supervision. Extensive experiments on nudity, object, and artistic style erasure demonstrate that our method achieves state-of-the-art erasure performance while maintaining strong image quality and semantic alignment. Moreover, it exhibits robust resistance to adversarial attacks and scales effectively to multi-concept scenarios. Our results establish a new paradigm for safe and controllable generation in flow matching models.


[78] GeoMamba: A Geometry-driven MambaVision Framework and Dataset for Fine-grained Optical-SAR Object Retrieval cs.CVPDF

Tiantong Fang, Xiuwei Wang, Jing Xiao, Wujie Zhou, Liang Liao

TL;DR: 本文提出GeoMamba,一个针对光学与合成孔径雷达(SAR)图像细粒度目标检索的几何驱动框架。该框架通过几何特征注入(GFI)模块和几何一致性约束(GCC)模块增强跨模态特征交互与结构一致性学习,并构建了新的未对齐细粒度数据集FGOS-as进行验证。实验表明,GeoMamba在跨模态检索任务上取得了优于现有方法的性能。

Details

Motivation: 解决多源遥感(特别是光学与SAR)在未对齐条件下的细粒度目标检索难题,传统方法依赖配对或空间对齐样本,而实际应用中存在显著的模态差异、斑点噪声和结构不一致性,限制了鲁棒的跨模态表示学习。

Result: 在新建的数据集FGOS-as(包含11个航空航天和海事类别)上进行了广泛实验,GeoMamba在所有对所有的检索设置中达到了63.3%的mAP和77.0%的Rank-1准确率,超越了现有方法。

Insight: 创新点包括几何特征注入(GFI)模块以增强跨模态交互和引入结构先验,以及几何一致性约束(GCC)模块结合深度监督(DS)策略施加分层几何约束以保留目标结构;从客观角度看,其将几何先验系统性地融入跨模态学习框架,并构建了针对未对齐现实场景的细粒度数据集,对遥感跨模态检索具有重要价值。

Abstract: Multi-source remote sensing enables complementary observation of ground objects, while cross-modal fine-grained object retrieval remains challenging, especially under unaligned optical and SAR conditions. Unlike conventional retrieval settings that rely on paired or spatially aligned samples, practical optical-SAR retrieval is affected by substantial modality discrepancy, speckle noise, and structural inconsistency, which limit robust cross-modal representation learning. To address this problem, we propose GeoMamba, a geometry-driven framework tailored for optical-SAR fine-grained retrieval. Specifically, GeoMamba introduces a Geometric Feature Injection (GFI) module that enhances cross-modal feature interaction and incorporates structural priors, thereby improving the robustness of SAR representations and promoting geometry-consistent feature learning. In addition, a Geometric Consistency Constraint (GCC) module, together with a Deep Supervision (DS) strategy, imposes hierarchical geometric constraints using classical operators, which helps preserve informative object structures during representation learning. We further construct a new dataset, FGOS-as, containing 11 aerospace and maritime categories for evaluating unaligned cross-modal fine-grained object retrieval in realistic remote sensing scenarios. Extensive experiments on FGOS-as demonstrate that GeoMamba outperforms existing methods, achieving 63.3% mAP and 77.0% Rank-1 accuracy in all-to-all retrieval setting.


[79] Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation cs.CVPDF

Yuanpei Zhao, Jie Lin, Chao Zhang, Yilin Wang, Mao Li

TL;DR: 该论文提出了PPaint基准数据集,通过匹配的双协议(成对偏好和点级评分)收集专家对中国画的美学标注,揭示了两种协议的互补性:偏好提供更一致的序数排名,评分锚定绝对分数尺度。基于融合的专家标注真值,作者进一步提出PSDistill方法,将视觉语言模型的成对判断转化为校准的伪分数,并通过置信度加权的排序优化训练单次推理的美学评分器。蒸馏后的Qwen3-VL-8B模型在跨类别评估中显著提升性能,达到与闭源模型相当的水平。

Details

Motivation: 现有图像美学评估基准通常只采用成对偏好或点级评分中的一种协议,未能充分利用两者的互补优势。论文旨在通过构建匹配的双协议标注数据集,系统性地探索两种协议的协同作用,并基于此开发高效的单次推理美学评分模型。

Result: 在PPaint数据集上,融合两种协议信号构建的专家标注真值具有高度一致性。蒸馏后的Qwen3-VL-8B模型在三个绘画类别上的平均SRCC从0.504提升至0.709,超越了所有开源基线模型(包括专用美学模型ArtiMuse),并与闭源模型Gemini-3.1-Pro的差距在0.04 SRCC以内,且单次推理成本更低。跨域迁移能力在APDDv2数据集上得到进一步验证。

Insight: 创新点在于构建了首个匹配双协议标注的美学评估基准,揭示了偏好与评分协议的互补性;并提出基于Elo参考池和置信度加权排序优化的自蒸馏框架,将视觉语言模型的成对比较能力高效转化为单次评分能力,实现了高性能与低推理成本的平衡。

Abstract: Pairwise preferences and pointwise ratings are the two dominant annotation protocols in image aesthetic assessment (IAA), yet existing benchmarks adopt only one, leaving their complementarity unmeasured under controlled conditions. We introduce PPaint, a matched dual-protocol benchmark in which 15 domain experts, 5 per category, annotate 150 Chinese paintings under both protocols across five aesthetic dimensions, collecting 45,900 pairwise expert judgments through a locally dense preference design alongside the matched ratings. The matched design reveals complementary strengths: preferences yield more consistent ordinal rankings, while ratings anchor the absolute score scale. Fusing both signals via two independent preference-to-score methods yields a fused expert ground truth on which the two constructions converge to nearly identical scores. The same preference-to-score principle extends to label-free VLM training. PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via an Elo reference pool, and trains the same VLM with confidence-weighted ranking optimization to produce a single-pass aesthetic scorer. Trained on a single painting category, the distilled Qwen3-VL-8B improves mean SRCC from 0.504 to 0.709 across all three categories, outperforming all open-source baselines including the dedicated aesthetic model ArtiMuse and matching closed-source Gemini-3.1-Pro within 0.04 SRCC at single-pass inference cost, with cross-domain transfer further validated on APDDv2. We will release the full PPaint dataset and training code.


[80] Real-World On-Vehicle Evaluation of Embedding-Based Anomaly Detection cs.CVPDF

Albert Schotschneider, Daniel Bogdoll, Svetlana Pavlitska, Ahmed Abouelazm, Johann Marius Zoellner

TL;DR: 本文提出了一种基于预训练视觉Transformer嵌入的实时异常检测方法,通过潜在语义特征空间中的最近邻相似性检测交通场景中的异常,无需显式监督或数据集特定训练,适用于真实世界部署。

Details

Motivation: 解决自动驾驶中交通场景异常检测的挑战,现有方法依赖抽象语义类别定义正常性,难以适应多样化的真实世界场景。

Result: 在Road Anomaly基准测试中表现良好,并在真实世界自动驾驶车辆上展示了定性一致性,成功突出不同场景中的语义异常对象。

Insight: 创新点在于利用基础模型嵌入进行基于参考图像的异常检测,通过补丁级处理和密集异常掩码实现定位,避免了复杂训练,为现实操作条件提供了简单有效的异常信号。

Abstract: Detecting anomalies in traffic scenes is crucial for ensuring safety in autonomous driving, yet collecting representative anomalous data remains challenging. Existing anomaly detection methods are highly specialized and rely on normality as defined by the abstract semantic Cityscapes classes, making it difficult to adapt to diverse real-world scenarios. We propose an adaptable real-time anomaly detection method that leverages foundation models in the form of pretrained vision transformer embeddings to detect deviations via nearest-neighbor similarity in the latent semantic feature space. Based on patch-wise processing, the algorithm produces dense anomaly masks, allowing for the localization of detected anomalies. The method robustly models normality through a single reference image. This formulation avoids explicit supervision and dataset-specific training, making it suitable for real-world deployment. We evaluate the method on standard benchmarks and on an automated vehicle in real-world scenarios. Despite its simplicity, the method achieves good performance on the Road Anomaly benchmark and demonstrates consistent qualitative behavior in practice, successfully highlighting semantically unusual objects in diverse scenes. These results suggest that simple, reference-based methods can provide useful anomaly signals under realistic operating conditions.


[81] Mechanisms of Object Localization in Vision-Language Models cs.CVPDF

Timothy Schaumlöffel, Martina G. Vilas, Gemma Roig

TL;DR: 本文研究了视觉语言模型(VLMs)中的物体定位机制,重点关注LLaVA-1.5和InternVL-3.5两个代表性模型系列。通过使用token消融、注意力剔除和因果中介分析等机制可解释性工具,发现定位主要由一种容器化机制驱动:物体对齐的token定义物体的空间范围,而边界内token的语义排列对预测框基本无关。

Details

Motivation: 视觉语言模型在连接视觉和文本信息方面非常有效,但在基本分类和定位任务上常常表现不佳。尽管分类机制已得到较多研究,但支持物体定位的过程仍知之甚少。

Result: 研究发现,只有极少数注意力头中介了分类和定位的因果效应,这些头集中在LLaVA的早中期层和InternVL的中后期层。两个任务共享一些早期处理,但最终依赖于基本不同的专门化注意力头。

Insight: 论文首次提供了VLMs中定位的层和头级别解释,揭示了狭窄的计算路径,这可以指导未来的模型设计和接地目标。创新点在于通过机制可解释性方法揭示了定位的容器化机制和任务特定的注意力头分布。

Abstract: Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis. We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while the semantic arrangement of tokens within those boundaries is largely irrelevant to the predicted box. Only a very small set of attention heads mediates the causal effect for both classification and localization, concentrating in early-mid layers for LLaVA and mid-late layers for InternVL. The two tasks share some early processing but ultimately depend on largely distinct specialized heads. Overall, we provide the first layer- and head-level account of localization in VLMs, revealing narrow computational pathways that can guide future model design and grounding objectives.


[82] Fast 4D Mesh Generation by Spatio-Temporal Attention Chains cs.CVPDF

Dvir Samuel, Yuval Atzmon, Gal Chechik, Yoni Kasten

TL;DR: 本文提出了一种无需训练、基于时空注意力链的快速4D网格生成方法,通过利用4D主干网络中早期出现的时序对应关系,在潜在空间传播信息,从而加速动态3D结构从视频中的恢复过程。

Details

Motivation: 现有4D网格生成方法速度慢、计算成本高且难以扩展到长序列,本文旨在解决这些问题,提升时序对应质量并加速生成过程。

Result: 在SOTA对比中,该方法仅需9秒生成4D网格,实现了13倍加速,同时生成更高质量结果,并能扩展到长达16倍的视频而不降低网格质量;在2D目标跟踪和4D跟踪等下游任务上实现了有竞争力的零样本性能。

Insight: 创新点在于观察到时序对应关系在网格视觉准确前就已出现在4D主干网络中,并设计了时空注意力链框架,通过潜在空间映射和注意力机制避免昂贵的显式匹配,从而提升动态网格几何和时序一致性,还支持了先前方法不具备的可靠相机估计能力。

Abstract: 4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details and thereby improving dynamic mesh geometry and temporal consistency. Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a $13\times$ speedup while producing higher-quality results. Moreover, our approach scales to videos up to $16\times$ longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks: 2D object tracking and 4D tracking. We further show that our framework enables reliable camera estimation, a capability not supported by prior 4D mesh generation methods.


[83] LaCoVL-FER: Landmark-Guided Contrastive Learning Network with Vision-Language Enhancement for Facial Expression Recognition cs.CVPDF

Jiaxin Wang, Muwei Jian, Hui Yu, Junyu Dong, Yifan Xia

TL;DR: 本文提出了一种名为LaCoVL-FER的新方法,用于解决野外环境下因姿态、遮挡和光照变化而导致的面部表情识别难题。该方法通过结合面部关键点提供的几何先验和视觉-语言模型提供的语义先验,设计了地标引导自适应编码器和视觉-语言增强策略,以生成更具表达相关性和鲁棒性的特征表示。

Details

Motivation: 现有基于注意力的方法主要依赖视觉外观线索,存在注意力冗余和不稳定的问题,在复杂场景下性能受限。本文旨在通过整合几何与语义先验,提升模型在不受控环境下的鲁棒性和泛化能力。

Result: 在RAF-DB、FERPlus和AffectNet三个具有代表性的真实世界FER数据集上,LaCoVL-FER的定量和定性实验均表明其性能超越了现有最先进(SOTA)方法。

Insight: 创新点在于将面部关键点几何信息与CLIP模型的视觉-语言语义信息进行自适应融合与对齐。具体包括:设计双分支门控交叉注意力机制进行特征融合,以及利用表达相关特征来细化视觉特征并调节文本提示,从而生成实例感知的表示,增强了模型的判别能力。

Abstract: Facial Expression Recognition (FER) in the wild is still challenging due to uncontrolled variations in pose, occlusion, and illumination. Most existing attention-based methods primarily rely on visual appearance cues, suffering from attention redundancy and instability, which limits their performance in complex scenarios. To address these issues, we propose a novel landmark-guided contrastive learning network with vision-language enhancement for FER (LaCoVL-FER), which integrates geometric priors from facial landmarks and semantic priors from a vision-language model. Specifically, a Landmark-Guided Adaptive Encoder (LGAE) is designed to introduce geometric priors through a Bi-branch Gated Cross Attention (BGCA) mechanism, which achieves adaptive fusion of landmark-based geometric and visual appearance features to produce expression-relevant features, thereby focusing on key facial regions and suppressing noise interference. In parallel, a Vision-Language Enhancement Strategy (VLES) is presented to leverage the expression-relevant features to refine the generalizable visual features extracted by the frozen pretrained CLIP image encoder, yielding expression-specific visual representations. Based on these representations, an Expression-Conditioned Prompting (ECP) mechanism is utilized to further adapt the textual features of fixed class-level prompts from the frozen pretrained CLIP text encoder, generating more instance-aware textual representations. These visual-textual representations are aligned as semantic priors to enhance the robustness and generalization of FER. Quantitative and qualitative experiments demonstrate that our LaCoVL-FER outperforms state-of-the-art methods on three representative real-world FER datasets, including RAF-DB, FERPlus, and AffectNet. The code is available at https://github.com/ylin06804/LaCoVL-FER.


[84] FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding cs.CV | cs.AI | cs.CLPDF

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu

TL;DR: 本文提出了FineBench,一个专注于细粒度人类活动理解的人本视频问答基准,包含199,420个多选题对,覆盖64个长视频。评估发现现有开源视觉语言模型表现不佳,特别是在空间推理和细微动作区分上。为此,作者提出了FineAgent框架,通过定位器和描述器模块来增强模型,实验表明其能有效提升多种开源模型在FineBench上的性能。

Details

Motivation: 现有视觉语言模型在通用视频理解上表现卓越,但在需要细微解释人类动作和交互的现实应用中,其细粒度理解能力不足。当前的人本基准缺乏结合长视频、密集问答覆盖和帧级时空定位的综合性评估。

Result: 在FineBench上的评估显示,GPT-5等专有模型表现尚可,但当前开源视觉语言模型显著落后,尤其在多人场景的空间推理和区分人类动作/交互的细微差异方面。FineAgent框架能一致地提升多种开源视觉语言模型在FineBench上的性能。

Insight: 论文的创新点在于构建了一个结合长视频、密集问答和帧级时空定位的细粒度人本视频理解基准FineBench,并提出了模块化的FineAgent框架来针对性增强模型的定位和描述能力,为未来研究提供了严格的测试平台和实用的增强方法。

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs.


[85] Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models cs.CVPDF

Hengfei Wang, Anshul Gupta, Pierre Vuillecard, Jean-Marc Odobez

TL;DR: 本文提出了EyeVLM框架,用于系统评估视觉语言模型在理解人类注视行为方面的能力,重点关注注视跟随和社交注视预测两个核心任务。研究通过零样本和微调两种方式测试了多种先进VLMs,并与纯视觉模型进行对比,发现当前VLMs在精确理解注视方面仍存在不足。

Details

Motivation: 视觉语言模型在多模态推理方面发展迅速,但它们在理解人类注视和注意力这一需要结合物理场景、活动及社交上下文进行推理的核心任务上的可靠性尚未被充分探索。

Result: 研究基于现有注视理解数据集进行系统评估,结果显示当前VLMs缺乏精确的注视理解能力;虽然任务特定的微调训练有助于缩小与纯视觉模型的差距,但仍需显著改进。

Insight: 创新点在于提出了一个系统性的评估框架,将注视理解任务分解为几何视觉处理(注视跟随)和社交关系推理(社交注视预测)两个互补维度,并探索了不同提示策略以及模型与数据规模的影响,为VLMs在人类行为理解领域的应用提供了基准分析。

Abstract: Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires reasoning about the physical scene as well as the activity, interactions, and social context. However, the extent to which VLMs can reliably understand human gaze and related attentional behaviors remains largely unexplored. In this work, we present EyeVLM, a systematic evaluation framework for gaze understanding in VLMs across two complementary dimensions: tasks and models. To assess gaze understanding capabilities, we focus on two core tasks. The first, gaze following, i.e., predicting the 2D location where a person is looking, has a geometric and visual processing focus, requiring a precise understanding of the human face, attention direction, 3D scene structure, and spatial grounding of attended targets. The second, social gaze prediction, requires social and relational reasoning over multi-person interactions (e.g., mutual gaze and shared attention), and may benefit more from the LLM semantic reasoning capabilities within VLMs. Regarding models, EyeVLM evaluates these tasks in two ways: a zero-shot setting with a diverse set of state-of-the-art open- and closed-source VLMs, exploring different prompting strategies; and a fine-tuning approach based on task-specific QA pairs, studying the impact of model scale and data scale. As benchmarks, we rely on existing gaze understanding datasets and perform a systematic comparison with state-of-the-art purely visual models. Overall, our results show that current VLMs lack precise gaze understanding capabilities. While standard training helps reduce the gap with visual models, significant improvements are still needed.


[86] Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models cs.CV | cs.AI | cs.CLPDF

Guangzhi Xiong, Qiao Jin, Sanchit Sinha, Zhiyong Lu, Aidong Zhang

TL;DR: 本文针对大型视觉语言模型在医学影像(如胸片)推理中视觉归因方法不可靠的问题,提出了一个因果评估框架来验证归因方法的有效性,并开发了MedFocus方法,通过基于概念的不平衡最优传输和针对性干预来定位临床相关解剖区域,显著提升了归因的准确性。

Details

Motivation: 大型视觉语言模型在医学应用中缺乏将回答忠实于视觉证据的能力,这引发了临床可信度的担忧;现有视觉归因方法是否真正反映了模型决策所依据的视觉证据尚未得到验证。

Result: 在11种归因方法、6个开源LVLM和两种输出模式(直接答案和逐步推理)的评估中,现有方法常无法识别LVLM使用的证据;提出的MedFocus方法在空间、概念和令牌级别归因上显著优于先前方法。

Insight: 创新点包括开发因果评估框架来验证归因方法的可靠性,以及提出基于概念的不平衡最优传输的MedFocus方法,通过针对性干预测量解剖区域对模型输出的因果效应,从而提升医学LVLM的可信归因。

Abstract: Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model’s decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model’s prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.


[87] When Preference Labels Fall Short: Aligning Diffusion Models from Real Data cs.CVPDF

Weiyan Chen, Weijian Deng, Yao Xiao, Weijie Tu, ZiYi Dong

TL;DR: 本文研究利用真实数据作为监督信号进行扩散模型偏好对齐的方法,提出了一种以真实图像为参考点、通过对比生成或扰动样本来构建偏好信号的策略,无需人工标注的偏好对。实验表明该方法能有效指导扩散模型对齐,性能与现有基于偏好的方法相当。

Details

Motivation: 现有偏好对齐方法主要依赖模型生成图像构建的偏好对,这种相对监督在样本均存在伪影或视觉质量有限时可能模糊,难以推断真正理想的输出,因此探索真实数据能否作为替代监督源。

Result: 基于真实数据的监督在扩散模型对齐中提供有效指导,在实验中达到与现有偏好对齐方法相当的性能水平。

Insight: 创新点在于提出以真实数据为中心的监督策略,通过对比真实图像与生成/扰动样本自动构建偏好信号,无需人工标注,为偏好对齐提供了实用且互补的监督源,并启发了标签高效的模型对齐方向。

Abstract: Preference alignment aims to guide generative models by learning from comparisons between preferred and non-preferred samples. In practice, most existing approaches rely on preference pairs constructed from model-generated images. Such supervision is inherently relative and can be ambiguous when both samples exhibit artifacts or limited visual quality, making it difficult to infer what constitutes a truly desirable output. In this work, we investigate whether real data can serve as an alternative source of supervision for preference alignment. We adopt a data-centric perspective and study a curation strategy that treats real images as reference points and constructs preference signals by contrasting them with generated or perturbed samples, without requiring manually annotated preference pairs. Through empirical analysis, we show that real-data-based supervision provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods. Our results suggest that real data offers a practical and complementary source of supervision for preference alignment and highlight directions of label-efficient alignment strategies. Code and models are available at https://cwyxx.github.io/RealAlign.


[88] Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding cs.CVPDF

Peter El Hachem, Ahmed Nassar, A. Said Gurbuz, Christoph Auer, Peter W. J. Staar

TL;DR: 该论文提出了一种名为结构化布局先验的方法,用于提升视觉语言模型在训练数据中未见的布局上的文档理解鲁棒性。该方法通过一个轻量级RT-DETR检测器预先解析文档布局结构,并将其序列化后注入到模型提示中,从而缓解了模型解码过程中的两阶段瓶颈问题。

Details

Motivation: 现有视觉语言模型在端到端解析文档时,对于训练中未见的布局结构经常失效。作者将此归因于一个两阶段瓶颈:解码器在提取内容前,必须先对布局实体进行分类和定位,而第一阶段失败会导致第二阶段内容提取崩溃。

Result: 在包含10k页的结构性分布外基准测试中,Markdown F1分数从0.37提升至0.92;在OmniDocBench中文子集的表格TEDS分数从0.01提升至0.36;在26k页的ViDoRe V3基准测试中,无限循环解码失败率在所有测试的工业领域均下降。该方法仅带来15%的额外延迟和平均74个提示词的开销。

Insight: 创新点在于将布局检测结果序列化并注入到与解码器共享的生成空间中,同时保留完整页面图像作为后备,而非裁剪页面或使用纯文本先验。注意力分析表明,解码器在生成结构时关注注入的布局标记,在生成内容时关注图像块,验证了两阶段瓶颈的缓解。

Abstract: Vision-Language Models (VLMs) parse documents end-to-end but frequently break down on layouts unlike those seen in training. We attribute this to a two-hop bottleneck: before the decoder can extract content (Hop 2), it must first classify and localize the enclosing layout entity (Hop 1), and when the first hop fails the second collapses into omissions, malformed structure, or autoregressive repetition. We pre-resolve Hop 1 outside the decoder by running a lightweight RT-DETR detector, serializing its outputs in the parser’s native DocTags vocabulary, and injecting them into the prompt alongside the full page image. Unlike analyze-then-parse approaches that crop the page, or prior prompt-level priors written in plain text, our prior shares the decoder’s generation space and leaves the global image in view as a fallback when detections are noisy. On a 10k-page structural out-of-distribution benchmark, markdown F1 rises from $0.37$ to $0.92$; on the Chinese subset of OmniDocBench, table TEDS rises from $0.01$ to $0.36$; and on the 26k-page ViDoRe V3 benchmark, infinite-loop decoding failures drop across every industrial domain tested. These gains cost $15%$ wall-clock latency and a median of $74$ prompt tokens, with no architectural change to the base VLM. An attention-level analysis further reveals a bimodal phase shift in which the decoder attends to injected layout tokens when emitting structure and to image patches when emitting content, consistent with the two-hop bottleneck being alleviated. Model weights will be released to support reproducibility.


[89] Passive Construction Site Safety Monitoring via Persona-Scaffolded Adversarial Chain-of-Thought VLM Verification cs.CV | cs.AIPDF

Ananth Sriram, Neel Mokaria, Rajveer Singh

TL;DR: 本文提出了一种用于被动式建筑工地安全监控的三阶段流水线系统,该系统处理来自穿戴式和固定摄像头的视频。它首先使用微调的YOLO11进行个人防护装备和危险检测,然后通过SAM 3进行分割细化和工人去重,最后利用Qwen3-VL-8B-Instruct模型,结合一种基于角色构建的对抗性思维链协议,进行合规性验证和幻觉控制。

Details

Motivation: 建筑行业是美国最致命的行业,现有监控方法成本高、需要人工实时操作或覆盖范围窄。本文旨在开发一种被动的、在班次结束后进行的安全监控系统,以更经济、自动化的方式预防事故。

Result: 在非正式的12视频Ironsite开发语料库的三作者评审中,第三阶段的提示设计(专业角色背景故事)相比单次提示,带来了12%的精确度提升,在易产生幻觉的违规类别上提升最大。系统还能将违规映射到OSHA标准,并基于姿态关键点进行人体工程学风险评分。

Insight: 主要创新在于第三阶段的提示设计:采用基于方法派演员框架的专业角色背景故事来构建对抗性思维链协议,并通过结构化的消息隔离来强制生成器、判别器和协调器之间的观察独立性,以编码关于人工观察与自动检测可靠性的先验知识,从而有效控制幻觉并提升验证精度。

Abstract: Construction remains the deadliest industry sector in the United States, with 1,055 fatal worker injuries recorded in 2023, and the majority preventable. Existing monitoring approaches are expensive, require real-time human operators, or address only a narrow subset of violations. This paper presents a passive, end-of-shift construction safety monitoring pipeline processing video from POV body-worn and fixed wall-mounted cameras through a three-stage architecture: (1) fine-tuned YOLO11 for primary PPE and hazard detection, (2) SAM 3 for segmentation refinement and worker deduplication, and (3) Qwen3-VL-8B-Instruct with a method-prompted, persona-scaffolded three-pass adversarial chain-of-thought protocol for compliance verification and hallucination control. The principal contribution is the Stage 3 prompt design: professional persona backstories following the method-actor framing drive an observed 12% precision improvement over single-pass prompting in an informal three-author review of the 12-video Ironsite development corpus, with the largest gains on hallucination-prone violation categories. Structural message isolation enforces observational independence between a generator, discriminator, and reconciliation pass governed by asymmetric rules encoding priors about human observation versus automated detection reliability. The system maps violations to OSHA standards, performs REBA-inspired ergonomic risk scoring from pose keypoints, and produces per-worker safety reports with timestamped evidence. An evaluation harness is released for future reproduction.


[90] GoTTA be Diverse: Rethinking Memory Policies for Test-Time Adaptation cs.CVPDF

Shyma Alhuwaider, Yasmeen Alsaedy, Merey Ramazanova, Silvio Giancola, Bernard Ghanem

TL;DR: 本文系统研究了测试时适应(TTA)中的内存策略,指出现有方法通常将内存机制与特定适应算法耦合评估,难以分离其独立影响。研究发现,内存管理的关键在于维持样本的类内多样性,而不仅仅是保留最近或类别平衡的样本。基于此,作者提出了GOTTA(Guided Observational Test-Time Adaptation)系列多样性感知内存策略,该策略结合了类别平衡分配与特征空间多样性,可作为即插即用模块替换现有缓冲区。

Details

Motivation: 现有TTA研究主要关注适应目标,而内存策略作为选择驱动适应测试样本的关键组件,其设计选择的重要性及其适用条件缺乏系统性评估。本文旨在将内存策略与适应算法解耦,在统一条件下评估不同内存设计在各种测试流(i.i.d.、非i.i.d.、持续流、实际流)下的表现。

Result: 在图像损坏基准测试和视频流设置中,多样性感知内存(GOTTA)在内存预算受限和具有挑战性的非i.i.d.流下,能最显著地提升适应性能。随着内存容量增加,该方法仍保持竞争力。

Insight: 创新点在于将内存管理确立为鲁棒测试时适应的首要组件,并识别出多样性(特别是类内多样性)是实用TTA的核心原则。GOTTA策略通过结合类别平衡与特征空间多样性,有效避免了冗余缓冲区,并在时间相关和标签偏斜的流中保持了具有代表性的适应信号,其模块化设计使其可与不同的TTA目标配对使用。

Abstract: Test-time adaptation (TTA) enables a pre-trained model to adapt online to an unlabeled test stream under distribution shift. While most TTA research focuses on the adaptation objective, practical streams also depend critically on the memory used to select which test samples drive adaptation. Existing memory mechanisms are usually evaluated as components of specific TTA algorithms, making it difficult to isolate which memory design choices matter and when they matter. In this work, we provide a systematic benchmark that decouples memory from the adaptation algorithm and evaluates memory policies under unified conditions across i.i.d., non-i.i.d., continual, and practical test streams. Our study shows that effective memory management requires more than retaining recent or class-balanced samples. In particular, intra-class diversity is a key factor for avoiding redundant buffers and maintaining representative adaptation signals under temporally correlated and label-skewed streams. Motivated by this finding, we introduce Guided Observational Test-Time Adaptation (GOTTA), a family of diversity-aware memory policies that combine class-balanced allocation with feature-space diversity. GOTTA memories act as drop-in replacements for existing buffers and can be paired with different TTA objectives. Across corruption benchmarks and video-stream settings, diversity-aware memory improves adaptation most clearly under constrained memory budgets and challenging non-i.i.d. streams, while remaining competitive as memory capacity increases. These results highlight memory management as a first-class component of robust test-time adaptation and identify diversity as a central principle for practical TTA.


[91] Feed-Forward Gaussian Splatting from Sparse Aerial Views cs.CVPDF

Dongli Wu, Zhuoxiao Li, Tongyan Hua, Yinrui Ren, Xiaobao Wei

TL;DR: 本文提出AnyCity框架,用于从稀疏航拍视角重建大规模城市场景。该方法通过预测观测支持的几何潜在表示来锚定可靠结构,并利用脚手架条件化的航拍补全令牌对弱约束内容进行门控残差更新,最终通过单次前馈过程生成3D高斯场景。

Details

Motivation: 稀疏航拍视角重建存在证据不平衡问题(如屋顶区域重复观测而立面、遮挡结构缺乏多视角支持),现有前馈3D高斯溅射方法直接回归确定性表示会导致重影、立面融化和纹理拉伸,而生成式方法又缺乏观测几何与先验驱动内容的清晰分离,导致结构不一致。

Result: 在合成、航拍域、无人机纹理和真实场景的实验表明,该方法在单次前馈推理中实现了连贯的城市新视角合成,相比前馈基线模型取得一致改进,推理速度达到秒级。

Insight: 创新点包括:观测接地的生成式重建框架,通过观测支持的几何潜在表示与脚手架条件化令牌分离可靠结构与弱约束内容;训练中采用稠密到稀疏蒸馏传递结构线索,并结合航拍适应的视频扩散先验提供细粒度外观线索;通过观测保持目标确保细化表示与输入几何一致。

Abstract: Reconstructing large-scale urban scenes from sparse aerial views is a crucial yet challenging task. Due to biased top-down and shallow-oblique camera poses, sparse aerial captures exhibit strong evidence imbalance: roofs and open regions are repeatedly observed, while facades, distant buildings, and occluded structures receive little multi-view support. Existing feed-forward 3D Gaussian Splatting methods directly regress a deterministic representation from sparse inputs, but this often leads to ghosting, melted facades, and stretched textures. Recent pseudo-view and video-based generative reconstruction methods use additional supervision or generative priors. However, they often lack a clear separation between observed geometry and prior-driven content, which can lead to plausible but inconsistent structures. We propose AnyCity, an observation-grounded generative reconstruction framework for sparse aerial urban scenes. AnyCity first predicts an observation-supported geometry latent to anchor reliable structures, and then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. During training, dense-to-sparse distillation transfers structural cues from dense-view reconstruction, while an aerial-adapted video diffusion prior provides fine-grained urban appearance cues through gated token conditioning. Observation-preserving objectives keep the refined representation consistent with input-supported geometry. At inference time, AnyCity reconstructs the final 3D Gaussian scene from sparse aerial views in a single feed-forward pass, achieving coherent urban novel-view synthesis with second-level inference. Experiments on synthetic, aerial-domain, UAV-textured, and real-world scenes show consistent improvements over feed-forward baselines.


[92] Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models cs.CV | cs.AIPDF

Yi Zhong, Haotong Qin, Xindong Zhang, Lei Zhang, Guolei Sun

TL;DR: 本文提出了一种名为SplitQ的低比特后训练量化框架,旨在解决大型视觉-语言模型(VLMs)量化过程中因文本和视觉模态激活分布异质性导致的精度下降问题。该框架通过模态特定异常通道解耦模块隔离关键异常通道,并利用自适应跨模态校准模块动态减少量化误差,从而在多种低比特设置下显著提升量化模型的性能。

Details

Motivation: 现有后训练量化方法在部署视觉-语言模型时,由于文本和视觉模态的激活分布存在异质性,常导致模型精度显著下降。作者发现这种跨模态异质性在通道间分布不均,少数通道包含大部分模态特定的异常值,且这些异常值通常位于不同模态的不同通道中。

Result: 在6个流行的多模态数据集上,SplitQ在W4A8、W4A4、W3A3和W3A2等多种量化设置下均显著优于现有方法。特别是在具有挑战性的W3A3设置下,SplitQ保持了FP16性能的93.5%(69.5 vs. 74.3),达到了当前最先进的效率水平。

Insight: 创新点在于提出了模态特定异常通道解耦模块,以低开销有效隔离关键异常通道,以及自适应跨模态校准模块,通过双轻量可学习分支动态校准量化误差。从客观角度看,该方法通过分析并针对性处理跨模态异质性的通道级分布特性,为多模态模型的高效量化提供了新的思路。

Abstract: Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs’ accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies, we design an Adaptive Cross-Modal Calibration (ACC) module that employs dual lightweight learnable branches to dynamically mitigate modality-induced quantization errors. Extensive experiments on popular VLMs demonstrate that SplitQ significantly outperforms existing approaches across 6 popular multi-modal datasets under all evaluated quantization settings, including W4A8, W4A4, W3A3, and W3A2. Notably, SplitQ preserves 93.5% of FP16 performance under the challenging W3A3 setting (69.5 vs. 74.3), pushing the efficiency frontier for deploying advanced VLMs. Our code is available at https://github.com/EMVision-NK/SplitQ


[93] AffectVerse: Emotional World Models for Multimodal Affective Computing cs.CVPDF

Bo Zhao, Fanghua Ye, Yixin Ji, Sicheng Zhao, Xiaojiang Peng

TL;DR: 本文提出AffectVerse模型,基于Qwen2.5-Omni,通过引入情感世界模块(EWM)来建模情感动态。EWM包含跨模态时序想象、模态感知多步注意力信念聚合和信念注入三个子模块,利用对未来表征的预测作为自监督信号,增强模型对情感状态演变的推理能力。

Details

Motivation: 现有MLLM在情感识别中通常对完整的视听文本输入进行静态融合,而人类情感推理依赖于整合观察到的多模态线索与对情感状态如何展开的预期,因此需要显式建模情感动态。

Result: 在九个基准测试上,AffectVerse至少比其他模型提升了2.57%。消融实验表明,时序想象、跨模态展开和信念聚合均带来了增益。

Insight: 创新点在于提出了一个无动作的、表征层面的短时程潜在情感预测模块(EWM),将未来预测作为过去条件下的自监督信号,迫使当前信念状态编码可预测后续情感变化的过渡线索,为情感计算提供了一种实用的预测性信念状态建模方法。

Abstract: Humans infer emotions by integrating observed multimodal cues with expectations about how affective states may unfold. Existing multimodal large language models (MLLMs), however, often treat emotion recognition as static fusion over complete audiovisual-text inputs, leaving affective dynamics implicit. We propose AffectVerse, a Qwen2.5-Omni-based model equipped with an Emotion World Module (EWM), an action-free representation-level module for short-horizon latent affective prediction. \rev{EWM contains three modules: 1) Cross-Modal Temporal Imagination predicts future video/audio representations from past tokens with multi-step rollout. 2) MAMA(Modality-Aware Multi-step Attention) Belief Aggregation compresses imagined tokens into modality-aware belief tokens. 3) Belief Injection inserts these belief tokens into the LLM for affective reasoning.} AffectVerse uses future prediction as a past-conditioned self-supervised signal: it does not replace modeling observed history or require unseen signals at inference, but forces the current belief state to encode transition cues that are predictive of subsequent affective change. Across nine benchmarks, AffectVerse improves at least 2.57% over other models, while controlled ablations show additive gains from temporal imagination, cross-modal rollout, and belief aggregation. These results suggest predictive belief-state modeling is a practical alternative for affective computing.


[94] StruMPL: Multi-task Dense Regression under Disjoint Partial Supervision and MNAR Labels cs.CV | cs.AI | cs.LGPDF

Reza M. Asiyabi, Juan Alberto Molina-Valero, The SEOSAW Partnership, Steven Hancock, Casey M. Ryan

TL;DR: StruMPL是一种用于处理多任务密集回归问题的新方法,专门解决在异质、不完整且非随机缺失标签下的森林地上生物量估算问题。该方法通过共享编码器、回归头、插补头、倾向得分头以及一个可学习的物理模块,联合优化结构变量预测和生物量估算,并利用增强逆概率加权伪结果进行训练,以纠正空间偏差。

Details

Motivation: 动机源于从地球观测数据估算森林地上生物量时面临的挑战:星载激光雷达提供大量结构数据但无生物量标签,地面样地提供生物量数据但存在空间偏差且无结构指标,标签不完整且非随机缺失,同时生物量与结构变量之间存在已知的异速生长约束。

Result: 在两个生态不同的生物群落上,StruMPL在AGB的均方根误差和偏差方面优于消融变体和最接近的已发表方法,分层分析显示AIPW将高生物量区域的偏差降低了约54%。

Insight: 创新点在于将问题形式化为具有MNAR标签和任务间物理约束的异质不完整部分监督下的多任务密集回归,并提出了联合优化框架,其中可学习的物理模块和结合倾向得分与插补基线的增强逆概率加权损失设计,对于在保持损失有界的同时恢复无偏估计至关重要。

Abstract: Estimating forest aboveground biomass (AGB) from Earth observation combines two structurally incompatible label sources: spaceborne lidar provides canopy structure at millions of locations but no biomass estimate, and ground-based plots provide biomass at thousands of biased locations but no metrics of structure. No single training sample carries labels for all target variables, plot labels are missing not at random (MNAR), and biomass is linked to the structural variables by known but biome-specific allometric laws. We formalise this as multi-task dense regression under heterogeneous disjoint partial supervision with MNAR labels and inter-task physical constraints, and propose StruMPL to address it jointly. A shared encoder feeds per-variable regression, imputation, and propensity heads for spatial MNAR correction, and a learnable physics module that evaluates the inter-task constraint on the model’s own predictions at every pixel. The supervised loss uses an Augmented IPW (AIPW) pseudo-outcome with stop-gradients on the propensity and on the imputation baseline; we show analytically and empirically that both are necessary for joint optimisation to recover IPW-weighted stationary points while keeping the loss bounded. On two ecologically distinct biomes, StruMPL outperforms ablation variants and the closest published method on AGB RMSE and bias, with a stratified analysis showing AIPW reduces high-AGB bias by ~54%.


[95] World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks cs.CV | cs.AI | cs.ROPDF

Zuyao Lin, Jianhui Zhang, Peidong Jia, Xiaoguang Zhao, Shanghang Zhang

TL;DR: 本文提出了一种名为世界-自我建模(World-Ego Modeling)的新范式,用于解决具身智能中长时域混合任务(如导航与操作交织)的预测问题。该方法将未来演化分解为世界和自我两个组件,并实例化为一个统一的世界-自我模型(WEM),该模型结合了隐式分离的规划器和级联并行专家混合扩散生成器。为了严格评估,作者构建了首个专注于长时域混合导航-操作任务的世界建模基准HTEWorld。

Details

Motivation: 现有具身智能中的世界模型通常在一个单一流中预测世界和自我(机器人)的演化,导致世界与自我动态纠缠,这在长时域混合任务(如导航与操作交织)中性能会下降。本文旨在解决这种纠缠问题。

Result: 在作者新构建的HTEWorld基准(包含12.5万个视频片段和300条多轮评估轨迹)上,WEM模型取得了最先进的性能。同时,在现有的纯操作任务基准上,WEM也保持了竞争力。

Insight: 核心创新点在于提出了世界-自我建模的概念范式,从运动、语义和意图三个视角定义了世界与自我的边界,并分析了三种解耦策略。其实例化模型WEM通过隐式分离规划器和CP-MoE扩散生成器的耦合,实现了对长时域混合任务演化的有效建模。新基准HTEWorld的构建也为该领域提供了重要的评估工具。

Abstract: World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emph{World-Ego Modeling}, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.


[96] InterLight: Leveraging Intrinsic Illumination Priors for Low-Light Image Enhancement cs.CVPDF

Ziqi Wang, Xu Zhang, Laibin Chang, Shi Chen, Jiaqi Ma

TL;DR: 本文提出InterLight框架,通过系统性地挖掘和利用内在光照先验来解决低光图像增强问题。该方法结合物理引导的数据增强、自适应提示和亮度门控内在记忆机制,在多个基准测试中取得了优异的增强效果。

Details

Motivation: 现有基于深度学习的Retinex方法在低光图像增强中常出现过增强或颜色失真,且通常假设噪声均匀或光照理想,无法有效处理复杂光照场景。

Result: 在多个基准测试上的广泛实验表明,该方法能有效提升图像纹理清晰度和视觉一致性,实现了先进的增强性能。

Insight: 创新点在于构建了光照感知的增强流程,通过物理引导增强注入传感器级光照响应先验,并利用自适应提示和亮度门控内在记忆机制选择性补偿信息损失,同时通过自监督一致性目标正则化整个流程。

Abstract: Low-Light Image Enhancement (LLIE) has long been a challenging problem in low-level vision, as insufficient illumination often leads to low contrast, detail loss, and noise. Recent studies show that deep learning-based Retinex theory can effectively decouple illumination and reflectance. However, existing methods frequently suffer from over-enhancement or color distortion, and often assume uniform noise or ideal lighting. To address these limitations, we propose InterLight, a novel framework that systematically excavates and operationalizes intrinsic illumination priors for LLIE.Our core insight is that robust enhancement requires not just estimating illumination, but constructing an illumination-aware pipeline. We first inject sensor-level illumination-response priors via physics-guided augmentation, then represent the degradation through adaptive prompts conditioned on the scene’s latent illumination state. This explicit representation directly guides a luminance-gated intrinsic memory mechanism to selectively compensate for information loss, prioritizing reconstruction in dark regions while preserving fidelity in bright ones. Finally, the entire process is regularized by a self-supervised consistency objective that distills illumination-invariant features. By deeply exploiting intrinsic illumination priors, our method achieves clearer textures and more visually coherent enhancement results. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach. Code is available at: https://github.com/House-yuyu/InterLight.


[97] RECIPE: Procedural Planning via Grounding in Instructional Video cs.CVPDF

Luigi Seminara, Antonino Furnari, Lorenzo Torresani

TL;DR: RECIPE提出了一种基于强化学习的程序性规划方法,通过利用大规模教学视频中的嘈杂ASR转录本作为验证器而非标签源,来生成多轨迹的步骤序列。该方法采用GRPO框架,以时序对齐质量作为奖励,支持文本历史(Socratic)和视频直接输入(Video)两种配置,并在有标注和弱监督场景下均有效。

Details

Motivation: 视觉规划任务需要根据部分视频上下文和目标生成剩余步骤,但现有标注数据规模小、领域窄且仅编码单一执行轨迹,而大规模教学视频库虽提供丰富内容,但直接使用其嘈杂ASR伪标签进行监督微调会传播分割和对齐错误。论文利用验证生成步骤序列是否在ASR转录本中时序对齐的廉价性,来解决从嘈杂视频中提取干净步骤标签的困难。

Result: 在7个程序性基准测试中,使用基于参考的LLM-as-judge协议评估6个程序性标准,RECIPE-RL在所有规模(0.5B、3B、7B)和每个基准上均优于基础检查点,域内宏观准确率提升+7到+8个百分点,零样本场景下最高提升+16个百分点。它在有标注和伪标签规划上均优于监督微调(后者会降低基础性能),并在无人工标注时保持鲁棒性;在Visual Planning for Assistance和COIN基准上,它改进了最强零样本基线,并保持了监督微调所丧失的生成多样性。

Insight: 创新点在于利用时序对齐验证的廉价性,将嘈杂视频库转化为强化学习的奖励信号(通过GRPO框架),避免了直接使用伪标签导致的错误传播,从而支持多轨迹规划和零样本泛化。客观来看,该方法通过不对称设计(提取难但验证易),有效利用大规模弱监督数据,提升了规划的准确性和多样性,且框架统一适用于不同输入配置和监督机制。

Abstract: Visual planning asks a model to generate the remaining steps of a procedure in natural language given a partial video context and a goal. Progress on this task is bottlenecked by annotation: clean labeled datasets are small, domain-narrow, and encode a single execution trajectory per example, even though many valid orderings exist. Large-scale instructional video corpora offer orders of magnitude more procedural content, but supervised fine-tuning on pseudo-labels from their noisy ASR narrations propagates segmentation and alignment errors and stays single-trajectory. We identify a key asymmetry: extracting clean step labels from noisy video is hard, but verifying whether a generated step sequence is temporally grounded in ASR transcripts is cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry in RECIPE, which uses grounding quality as a reward for GRPO, turning the noisy corpus into a verifier rather than a label source. The framework applies uniformly to two planner input configurations (Socratic, with a textual history extracted by a frozen VLM, and Video, consuming video tokens directly) and to annotated and weakly supervised regimes. We evaluate on 7 procedural benchmarks using a reference-based LLM-as-judge protocol scoring plans across 6 procedural criteria. RECIPE-RL improves over the base checkpoint at all scales (0.5B, 3B, 7B) and every benchmark, with macro-accuracy gains of +7 to +8 points in-domain and up to +16 points zero-shot. It outperforms supervised fine-tuning on both annotated and pseudo-labeled plans (the latter degrades the base) and remains robust without human annotations. Used as the proposal stage of a prior propose-assess-search planner, it improves over the strongest zero-shot baseline at every horizon on Visual Planning for Assistance, and on COIN it preserves the generation diversity that SFT collapses.


[98] Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models cs.CVPDF

Jia-Wei Hai, Yijun Wang, Xiu-Shen Wei

TL;DR: 本文提出了一种名为注意力引导测试时提示调优(A-TPT)的方法,旨在提升视觉语言模型(如CLIP)在细粒度场景下对对抗性攻击的鲁棒性。该方法通过改进梯度注意力展开机制来识别在对抗攻击下存活的语义区域,并利用这些区域指导空间变化的增强强度和测试时多视图集成,以进行提示调优和推理。

Details

Motivation: 现有视觉语言模型的测试时适应方法通常依赖多视图增强,但在细粒度场景下难以准确识别语义信息,且容易破坏判别性区域,导致在面对对抗性攻击时性能显著下降。

Result: 大量实验表明,A-TPT在对抗性和干净数据上的表现均优于现有的测试时适应方法。

Insight: 创新点在于将梯度注意力机制与测试时提示调优相结合,通过语义保留的空间变化增强来引导模型适应,这为提升模型在对抗性环境下的细粒度鲁棒性提供了一种新思路。

Abstract: Vision-Language Models (VLMs), such as CLIP, have achieved significant zero-shot performance on downstream tasks with various fine-tuning adaptation methods. However, recent studies have proven that adversarial attacks can significantly degrade the inference ability of VLMs, posing substantial risks to their practical applications. Prevalent test-time adaptation methods typically rely on multi-view augmentation to implement various fine-tuning strategies, which struggle to identify semantic information and are prone to destroying discriminative regions in fine-grained scenarios. To address these limitations, we propose Attention-Guided Test-Time Prompt Tuning (A-TPT), a semantics-preserving method designed for test-time adaptation. We first refine the gradient attention rollout mechanism to identify semantically meaningful regions surviving under adversarial attacks. Furthermore, we leverage them to guide the spatially varying augmentation intensities and multi-view ensemble for prompt tuning and inference. Extensive experiments demonstrate that A-TPT outperforms existing test-time adaptation methods on both adversarial and clean data. Codes are available at https://github.com/SEU-VIPGroup/A-TPT .


[99] CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition cs.CVPDF

Hongji Yang, Songlian Li, Yucheng Zhou, Xiaotong Zhao, Alan Zhao

TL;DR: 本文提出了CogOmniControl,一个推理驱动的可控视频生成框架,旨在解决现有扩散模型在抽象、稀疏或复杂条件下(如故事板草图、粘土渲染)生成视频时与用户创作意图对齐不佳的问题。该框架将可控视频生成分解为创作意图认知和生成两个阶段,通过专门的CogVLM模型从稀疏条件中准确认知用户意图并生成密集推理输出,再通过CogOmniDiT模型统一控制并生成视频,最后利用CogVLM的引导能力进行最佳视频选择,形成一个闭环的’缰绳式’架构。

Details

Motivation: 现有扩散模型在照片级真实感和流畅度上表现良好,但在处理抽象、稀疏或复杂的专业制作条件(如故事板草图)时表现脆弱,难以与用户的创作意图对齐。现有方法要么通过适配器注入条件,要么在扩散主干中耦合通用视觉语言模型,存在能力差距。

Result: 在基于专业工作流数据构建的CogReasonBench和CogControlBench两个基准测试上进行的实验表明,CogOmniControl超越了现有的开源模型。

Insight: 核心创新点在于将可控视频生成分解为意图认知和生成两个阶段,并引入专门的CogVLM进行专业、清晰的意图推理。通过强化学习将推理输出与生成模型对齐,并利用CogVLM引导视频评估和最佳选择,形成了一个闭环的’缰绳式’架构,提升了与复杂创作意图的对齐能力。

Abstract: Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user’s creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM’s robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop “harness-like” architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: https://um-lab.github.io/CogOmniControl/


[100] A Nash Equilibrium Framework For Training-Free Multimodal Step Verification cs.CV | cs.GTPDF

Rohit Sinha, Kunal Tilaganji, Tanuja Ganu, Nagarajan Natarajan, Amit Sharma

TL;DR: 本文提出了一种无需训练的多模态步骤验证框架,将推理步骤验证建模为专家评委间的协调问题,通过纳什均衡形式化评委交互,利用一致性与分歧信号来识别有效推理步骤。该方法通过闭式解计算均衡分数,实现基于分歧感知的过滤和稳定性排序。

Details

Motivation: 现有多模态大语言模型生成的推理链常包含细微错误,导致答案错误;现有验证方法存在局限:学习型批评器需要大量标注数据且跨任务性能不一致,而无训练方法简单平均不同来源分数,忽略了分歧本身对步骤有效性的重要指示作用。

Result: 在六个基准测试上评估,该方法相比基线模型实现了2.4%至5.2%的持续提升,并与学习型批评器表现相当,表明跨模态一致性(而非平均置信度)无需任务特定适配即可提供鲁棒的验证信号。

Insight: 创新点在于将步骤验证形式化为纳什均衡博弈,利用评委间的一致与分歧作为验证信号;客观来看,该方法通过博弈论框架捕捉评分分歧中的信息,实现了无需训练、任务通用的稳定验证机制。

Abstract: Multimodal large language models often generate reasoning chains containing subtle errors that lead to incorrect answers. Current verification approaches have notable limitations. Learned critics need extensive labeled data and show inconsistent performance across different tasks. Meanwhile, existing training-free methods simply average scores from different sources, missing a key insight: when these scores disagree, that disagreement itself carries important information about whether a reasoning step is truly valid or not. We propose a training-free verification approach that treats step-wise verification as a coordination problem among specialized judges. We formalize these judges’ interaction as a Nash equilibrium game where agreement signals valid steps while disagreement reveals instability. Our method computes equilibrium scores through a closed-form solution, enabling both disagreement-aware filtering and stability-conscious ranking of reasoning steps. Evaluated across six benchmarks, our approach achieves consistent improvements of 2.4% to 5.2% over baseline models and shows competitive performance against learned critics, demonstrating that cross-modal agreement (not just average confidence) provides robust verification signals without task-specific adaptation.


[101] VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving cs.CV | cs.AIPDF

Zhefan Xu, Ghassen Jerfel, Marina Haliem, Qi Zhao, Jeonhyung Kang

TL;DR: 该论文提出了VL-DPO框架,利用视觉语言模型作为零样本推理器,从预训练模型的轨迹中自动生成偏好对,并通过直接偏好优化来微调自动驾驶运动预测模型,使其更好地与人类驾驶偏好对齐。

Details

Motivation: 标准模仿学习目标可能无法完全捕捉人类驾驶偏好的复杂细微差别,而视觉语言模型展现出强大的推理和常识理解能力,因此研究如何利用VLM来对齐自动驾驶模型与人类偏好。

Result: 在Waymo Open End-to-End Driving Dataset上微调模型,使用评分者反馈分数和平均位移误差进行评估。最终模型VL-DPO相比预训练模型,RFS提升了11.94%,ADE降低了10.01%。

Insight: 创新点在于将VLM作为零样本推理器自动生成偏好数据,并首次将直接偏好优化应用于自动驾驶运动预测领域,以数据驱动的方式实现模型与人类偏好的对齐。

Abstract: The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model’s rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM’s trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.


[102] Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation cs.CVPDF

Yifan Li, Xinyu Zhou, Yunhao Ge, Yu Kong

TL;DR: 本文提出了空间提示视觉轨迹预测(SP-VTP)这一新任务形式化,旨在通过初始空间提示(如边界框或点)来定义机器人操作任务的目标,并预测未来末端执行器的轨迹。为此,作者收集并标注了EgoSPT数据集,并提出了SPOT模型来解决该问题。

Details

Motivation: 在杂乱环境中,通过语言指令或任务标识符指定机器人操作可能不够精确,而空间指示(指示移动什么物体以及放置到哪里)能更好地处理相似物体。本文旨在解决以视觉为中心的对象和目标指定挑战。

Result: 在严格的场景级划分下进行实验,结果表明,SPOT模型在跨场景轨迹预测上优于无提示或单源提示的基线方法。

Insight: 创新点在于首次形式化了SP-VTP任务,将静态的空间提示与动态演变的场景配置相结合。SPOT模型结合了任务编码器、观察编码器和轨迹生成器,为以自我为中心的操纵提供了一种简单且可扩展的任务条件。

Abstract: Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.


[103] Stage-adaptive Token Selection for Efficient Omni-modal LLMs cs.CVPDF

Zijie Xin, Jie Yang, Ruixiang Zhao, Tianyi Wang, Fengyun Rao

TL;DR: 本文提出了一种名为SEATS的训练无关、阶段自适应的令牌选择方法,用于提升全模态大语言模型(om-LLMs)的推理效率。该方法通过分析模型层间的令牌依赖关系,在不同阶段(LLM前、LLM内部、后期层)采用不同的策略(如基于注意力权重的多样性选择、跨块渐进式剪枝、动态预算分配)来动态移除冗余的视觉和音频令牌,从而显著减少计算开销。

Details

Motivation: 全模态大语言模型通过将视频和音频编码为时间对齐的令牌序列进行处理,但这些密集的非文本令牌在LLM中的处理带来了巨大的计算开销。现有的令牌选择方法要么仅针对纯视觉输入,要么在LLM前以固定的模态比例进行剪枝,无法捕捉跨模态令牌重要性随模型层深度的动态演变。

Result: 在Qwen2.5-Omni和Qwen3-Omni模型上的实验表明,SEATS方法仅保留10%的视觉和音频令牌,即可实现9.3倍的FLOPs减少和4.8倍的预填充速度提升,同时保持了原模型96.3%的性能。

Insight: 论文的核心创新在于揭示了om-LLMs中视觉和音频令牌依赖关系呈现块状模式并随深度逐渐减弱的规律,并据此设计了一个阶段自适应的动态剪枝框架。该方法的关键在于:1)在LLM前去除时空冗余;2)在LLM内部根据查询相关性动态分配跨模态的令牌保留预算;3)在后期层完全移除已完成跨模态融合的非文本令牌,实现了计算效率与模型性能的良好平衡。

Abstract: Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.


[104] MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling cs.CVPDF

Zhiping Yu, Chenyang Liu, Jinqi Cao, Qinzhe Yang, Siwei Yu

TL;DR: 本文提出了一个用于多模态遥感图像生成的统一基础模型MetaEarth-MM,它能够在一个模型中实现五种模态的配对联合生成和任意模态间的转换。该方法采用以场景为中心的联合建模范式,通过解耦架构先推断潜在场景表示,再基于此生成目标模态。作者还构建了包含280万张多分辨率全球图像的大规模数据集EarthMM用于训练。

Details

Motivation: 解决多模态遥感图像中完整配对观测数据稀缺的问题,并克服现有生成方法在模态数量和生成任务增加时存在的通用性和可扩展性限制。

Result: 大量实验表明,MetaEarth-MM在多种生成任务上表现出强大的生成能力和稳健的泛化性能,并能在数据和表征层面支持下游任务,凸显了其作为跨模态地球观测通用基础模型的潜力。

Insight: 创新点在于提出了以场景为中心的联合建模范式,区别于先前依赖直接外观级跨模态映射的方法,通过解耦架构围绕底层场景内容组织生成过程。这为多模态遥感图像的统一生成提供了新思路。

Abstract: Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at https://github.com/YZPioneer/MetaEarth-MM.


[105] PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset cs.CVPDF

Haojun Chen, Haoyang He, Chengming Xu, Qingdong He, Junwei Zhu

TL;DR: 本文提出了PixVerve-95K,一个高质量、开源的超高清(UHR)文本到图像(T2I)数据集,包含9.5万张像素数至少为1亿的图像及多维标注。基于此数据集,作者通过三种训练方案,首次将多种T2I基础模型扩展至原生100MP图像生成,并建立了PixVerve-Bench基准,用于全面评估UHR图像的视觉质量和语义对齐。

Details

Motivation: 随着对更好视觉体验的追求和成像技术的快速发展,对超高清(UHR)图像生成的需求显著增长,但高分辨率内容的稀缺性和复杂性给UHR图像生成带来了巨大挑战。

Result: 在提出的PixVerve-Bench基准上进行了广泛的实验,结合传统指标和基于多模态大语言模型的评估,建立了全面的UHR图像评估协议,并对训练策略进行了建设性探索,为未来突破提供了有价值的见解。

Insight: 主要创新点在于构建了首个大规模、高质量、开源的100MP级UHR T2I数据集(PixVerve-95K),并首次将T2I基础模型扩展至原生100MP生成;客观来看,其精心设计的数据管道、多维标注以及结合传统与多模态LLM的评估协议,为UHR图像生成领域提供了重要的数据基础和方法论参考。

Abstract: Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.


[106] SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction cs.CVPDF

Zhixiong Zhang, Yizhuo Li, Shuangrui Ding, Yuhang Zang, Shengyuan Ding

TL;DR: 本文提出了一种名为SetCon的新方法,用于解决开放端指代分割问题。该方法将任务重新定义为显式的集合级概念预测,利用大型视觉语言模型生成的自然语言概念作为语义条件,通过分层语义分解(先预测共享的集合级概念,再细化为与目标子集对齐的细粒度概念组)来联合解码掩码集合。该方法在图像和视频基准测试中均取得了最先进的结果。

Details

Motivation: 现有基于大型视觉语言模型的指代分割方法通常使用特殊令牌顺序表示多个目标,将其视为独立输出而非连贯集合,难以捕捉集合级属性(如完整性和互斥性),限制了其在多实例、跨类别组或开放端目标集等复杂场景中的应用。

Result: 在图像基准测试上,SetCon在gRefCOCO上提升了3.3 gIoU,在MUSE上提升了12.1 gIoU,且随着指代目标数量增加,优势更明显,达到了最先进水平。在视频基准测试的检测与跟踪设置下,该方法在七个指代视频基准上取得了新SOTA,例如在MeViS上提升10.9 J&F,在Ref-SeCVOS上提升12.4 J&F。

Insight: 核心创新在于将开放端指代分割重新定义为集合级概念预测任务,并引入分层语义分解框架。这摒弃了传统的分割专用令牌,转而使用LVLM生成的自然语言概念作为语义条件,能更好地建模目标集合的整体属性和内部结构。同时,构建的大规模分层语义标注数据集(23.6万样本,78.4万概念短语)为训练提供了关键支持。

Abstract: Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the target scope and then refines it into fine-grained concept groups aligned with target subsets. To support this, a two-stage annotation pipeline augments existing reasoning segmentation datasets with hierarchical semantic supervision (236k samples, 784k concept phrases). SetCon achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE), with margins that grow as the number of referred targets increases. The concept interface also transfers to video under a detect-and-track setting, yielding new state-of-the-art results on seven referring video benchmarks, including +10.9 J&F on MeViS and +12.4 J&F on Ref-SeCVOS.


[107] CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models cs.CVPDF

Hsiang-Wei Huang, Junbin Lu, Kuang-Ming Chen, Jianxu Shangguan, Cheng-Yen Yang

TL;DR: 论文提出了CaMo模型,这是一个基于相机运动理解的视觉语言模型,旨在解决现有空间视觉语言模型在相机运动理解方面的不足。通过引入空间叙事评分(SNS)评估框架,论文展示了现有模型在空间认知上的缺陷,并开发了CaMo模型以在SNS评估和直接空间问答中实现一致性能。

Details

Motivation: 现有空间视觉语言模型在空间问答基准上表现良好,但缺乏对相机运动的基本理解,这限制了其真正的空间智能。论文旨在通过评估和训练方法解决这一差距,提升模型在3D空间理解上的可迁移性。

Result: 在SNS评估下,现有最先进的空间视觉语言模型性能显著下降,而CaMo模型在SNS评估和直接空间问答准确率上均实现了一致的高性能,达到了SOTA水平。

Insight: 论文的创新点在于提出了SNS评估框架,强调通过显式空间叙事外部化来评估视觉语言模型,并开发了基于相机运动理解的CaMo模型,这为提升模型在3D空间理解上的可迁移性提供了新思路。

Abstract: Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial cognition. We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM. Under SNS, state-of-the-art spatial VLMs exhibit significant performance degradation despite high direct question answering accuracy. To address this gap, we introduce CaMo, a camera motion grounded VLM that achieves consistent performance across SNS evaluation and direct spatial question answering accuracy. Our results highlight the importance of explicit spatial narrative externalization for evaluating VLMs with transferable 3D spatial understanding. Our code, data, and model is available at https://github.com/hsiangwei0903/CaMo


[108] MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation cs.CVPDF

Yujie Wei, Yujin Han, Zhekai Chen, Yongming Li, Kaixun Jiang

TL;DR: 本文提出了MSAVBench,这是首个针对多镜头音视频生成任务的综合性基准测试与自适应混合评估框架。该基准覆盖视频、音频、镜头和参考四个关键维度,包含多样任务设置、最多15个镜头以及具有挑战性的非现实场景。其评估框架通过自适应镜头分割校正、基于实例的主观指标量规以及基于工具的证据提取来提升鲁棒性,并与人类判断高度一致(斯皮尔曼等级相关系数达91.5%)。通过对19个先进闭源和开源模型的系统评估,发现当前系统在导演级控制和细粒度音画同步方面仍有不足,而模块化或智能体生成流程是缩小开源与闭源模型差距的有前景路径。

Details

Motivation: 随着视频生成从单镜头合成快速演进到复杂的多镜头音视频叙事以满足现实需求,评估此类前沿模型仍是一个根本性挑战。现有基准在范围和数据多样性上有限,且依赖僵化的评估流程,无法对现代多镜头音视频模型进行系统可靠的评估。

Result: MSAVBench与人类判断高度一致,斯皮尔曼等级相关系数达到91.5%。对19个最先进的闭源和开源模型的系统评估表明,当前系统在导演级控制和细粒度音画同步方面仍存在困难。

Insight: 创新点在于构建了首个覆盖多维度(视频、音频、镜头、参考)和多任务场景的综合性多镜头音视频生成基准,并设计了包含自适应镜头分割校正、实例化主观量规和工具化证据提取的鲁棒混合评估框架,为系统评估该前沿领域模型提供了可靠工具。评估发现模块化或智能体生成流程是提升开源模型性能的有效方向。

Abstract: Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.


[109] Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites cs.CV | cond-mat.mtrl-sci | cs.LGPDF

Antonio Peña Corredor, Julien Lesseur, Romain Nunez, Paul Rivalland, Thomas Philippe

TL;DR: 本文提出了一种用于航空航天SiC/SiC复合材料X射线断层扫描(XCT)缺陷检测的可解释性计算机视觉框架p-ResNet-50。该框架在ResNet-50基础上扩展了原型层,将高检测精度与基于案例的解释相结合,使每个分类决策都能追溯到有物理意义的参考原型,从而解决了深度学习模型在工业检测中缺乏透明度的问题。

Details

Motivation: 当前航空航天SiC/SiC复合材料的XCT无损检测依赖于专家视觉评估,决策过程可追溯性有限。虽然深度卷积网络可以自动化缺陷检测,但其黑盒特性与工业检测实践所需的透明度相冲突。本文旨在填补这一差距。

Result: 在包含约12,000个XCT图像块的数据集上,以黑盒ResNet-50为基线(ROC-AUC = 0.991),原型扩展框架实现了可比性能(准确率0.957 vs. 0.959;ROC-AUC 0.994 vs. 0.993),在略微降低敏感度的同时获得了更高的精确度和特异性。

Insight: 主要创新点包括:1)引入原型层,将学习到的六个原型与专家定义的语义类别(如健康基体、孔隙、线状缺陷等)明确对齐,实现基于案例的可解释性;2)提出锚点和中心点两种新颖的正则化项,将原型与专家选择的图像块绑定,防止原型崩溃;3)通过UMAP进行潜在空间分析,描绘语义一致的子域并映射不确定性区域,为检测人员提供模型可靠性的明确视图。该框架为将领域专家知识嵌入原型网络提供了一种可复用的方法论。

Abstract: Non-destructive testing of aerospace SiC/SiC composites via X-ray computed tomography (XCT) relies on expert visual assessment, with current workflows offering limited traceability for accept/reject decisions. Deep convolutional networks can automate defect detection, yet their black-box nature conflicts with the transparency that industrial inspection practice demands. To close this gap, we introduce p-ResNet-50, a convolutional framework extended with a prototype layer that couples high detection accuracy with case-based explanations. Six learned prototypes are explicitly aligned with expert-defined semantic categories-healthy matrix, matrix–air interfaces, pores, line-like defects, and mixed morphologies-so that every classification is traceable to a physically meaningful reference. Two novel regularisation terms, anchor-based and medoid-based, tether prototypes to expert-selected patches and prevent prototype collapse, addressing a known limitation of prototype networks. Latent-space analysis via UMAP delineates semantically coherent sub-domains and maps zones of uncertainty where misclassifications concentrate, giving inspectors an explicit picture of where the model is-and is not-reliable. The framework is validated on an XCT patch dataset of approximately 12,000 patches extracted from four defect-rich SiC/SiC laboratory specimens. Taking a black-box ResNet-50 as a baseline (ROC-AUC = 0.991), the prototype extension achieves comparable performance (accuracy 0.957 vs. 0.959; ROC-AUC 0.994 vs. 0.993) while trading a slight reduction in sensitivity for higher precision and specificity. Each decision is backed by representative evidence patches, and the model explicitly flags its uncertainty regions. Beyond defect mapping, the framework establishes a reusable methodology for embedding domain-expert knowledge into prototype networks, applicable to other XCT inspection scenarios requiring traceable, auditable decisions.


cs.MM [Back]

[110] CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation cs.MM | cs.AI | cs.CV | cs.SD | eess.ASPDF

Gyubin Lee, Junwon Lee, Juhan Nam

TL;DR: 本文提出CounterFlow,一种用于反事实视频Foley生成的双阶段推理时采样方法,旨在生成与视频视觉证据相矛盾但保持时间同步的音频。该方法通过两个阶段:第一阶段构建视频衍生的时间结构并抑制视觉隐含声源;第二阶段去除视频条件以专注于根据目标提示塑造音频音色。

Details

Motivation: 现有视频&文本到音频(VT2A)模型在视频与文本内容不一致时,往往仍锚定于视觉隐含的声源,难以生成反事实音频。本文旨在解决这一限制,实现高质量的反事实视频Foley生成。

Result: 与简单的负提示方法和最先进的基线模型相比,CounterFlow在反事实视频Foley生成方面有显著提升。评估采用了一种利用文本-音频共嵌入空间的度量标准,以衡量目标提示证据和残留的视觉隐含声源泄漏。

Insight: 创新点在于提出了一个推理时的双阶段采样方案,将时间结构建模与音色塑造解耦,有效分离了视频条件中的时间信息和声源信息。这为预训练流匹配VT2A模型提供了灵活的控制机制,以生成与视觉内容相矛盾的音频。

Abstract: We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines. To evaluate replacement quality, we propose a metric leveraging a text-audio co-embedding space to measure both target-prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at https://gyubin-lee.github.io/counterflow-demo/


cs.IR [Back]

[111] Improving Retrieval-Augmented Generation without Taxonomy-based Error Categorization cs.IR | cs.AI | cs.CLPDF

Gongbo Zhang, Yifan Peng, Chunhua Weng

TL;DR: 本文提出了一种无需基于分类的错误分类的检索增强生成(RAG)改进方法RePAIR。该方法通过一个响应-动作学习范式,直接将有缺陷的RAG输出映射到缓解错误的行动计划,避免了依赖细粒度错误分类和显式的评判监督。在多个基准测试中,RePAIR持续提升了智能体化RAG的性能。

Details

Motivation: 现有智能体化RAG系统通常假设评判反馈可靠,并侧重于规划策略,而忽视了纠错过程本身的鲁棒性问题,例如错误分类不当或纠正措施无效。本文旨在探索不依赖显式错误分类来提升RAG性能的可能性。

Result: 在多个基准测试上,RePAIR方法持续且一致地提升了智能体化RAG的性能。

Insight: 核心创新在于绕过传统的、可能不可靠的细粒度错误分类和显式评判监督,转而采用端到端的响应-动作学习范式,直接从错误输出学习纠正动作,这增强了纠错过程的鲁棒性和效率。

Abstract: Retrieval-Augmented Generation (RAG) improves the factual accuracy of large language model (LLM) outputs by grounding generation in external knowledge. Recent agentic RAG systems extend this paradigm with critical agents to evaluate model responses and iteratively refine outputs. However, most prior work implicitly assumes reliable critic feedback and focuses on planning strategies, while paying limited attention to the robustness of the error-correction process itself, which can be impacted by misaligned error categories and ineffective or incorrect corrections. Here, we hypothesize that RAG performance can be improved without explicit error categorization. We propose RePAIR, a response-action learning paradigm that directly maps flawed RAG outputs to error-mitigating action plans without relying on fine-grained error taxonomies and explicit critic supervision. Across multiple benchmarks, RePAIR consistently improves agentic RAG performance.


[112] Trust or Abstain? A Self-Aware RAG Approach cs.IR | cs.CLPDF

Xi Zhu, Ziqi Wang, Kai Mei, Wujiang Xu, Minghao Guo

TL;DR: 该论文提出了一种名为SABER的自我感知RAG方法,旨在解决检索增强生成中因上下文知识与参数知识冲突或不准确而导致的不可靠问题。该方法通过轻量级预测器评估知识可靠性,并做出信任参数知识、信任上下文知识、信任任一或弃权的四元决策,从而在多个基准测试中提升了准确性和忠实度。

Details

Motivation: 现有RAG方法主要协调使用哪种知识源,但未明确判断每个答案路径是否正确,因此需要让大语言模型具备自我感知能力,以识别自身知识和推理的局限性。

Result: 在四个LLM骨干模型上,SABER在十个推理时和微调基线方法中提升了端到端准确性和冲突特定忠实度,尤其在冲突密集的数据集上增益最大;在弃权设置下,其风险-覆盖曲线帕累托优于所有基于提示的弃权方法。

Insight: 创新点在于构建了模型特定、与真实情况对齐的知识冲突基准,并设计了无需LLM微调的自我感知信念估计器,通过结合自我先验和多轨迹推理的表示来驱动可靠性决策,实现了可调节的覆盖与风险平衡。

Abstract: Retrieval-augmented generation (RAG) improves large language models (LLMs) by incorporating external evidence, but it also introduces knowledge conflicts when retrieved contextual knowledge (CK) and parametric knowledge (PK) disagree or are both unreliable. Existing approaches mainly coordinate which source to use, without explicitly asking whether each answer path is correct. We argue that faithful RAG requires LLM self-awareness, namely the ability to recognize the limits of its own knowledge and reasoning. To ground this problem, we construct a model-specific, ground-truth-aligned knowledge-conflict benchmark by evaluating LLM backbones on PK-only and CK-conditioned answer paths over approximately 69K query-context instances per backbone, drawn from five conflict-QA datasets. We then introduce SABER, a Self-Aware Belief Estimator for RAG that requires no LLM fine-tuning. SABER combines a self-prior with PK-side and CK-side conditional reasoning representations from multi-trace inference, then estimates reliability beliefs with two lightweight predictors to drive a 4-cell decision over trust PK, trust CK, trust either, or abstain. Across four LLM backbones, SABER improves end-to-end accuracy and conflict-specific faithfulness over ten inference-time and fine-tuning baselines, with the largest gains on conflict-heavy datasets. Under abstention, SABER’s risk-coverage curve Pareto-dominates every prompt-based abstainer, providing a tunable balance between coverage and answer risk. Our code is available at https://github.com/xizhu1022/SABER.


cs.CR [Back]

[113] DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models cs.CR | cs.AI | cs.CV | cs.LGPDF

Ye Sun, Xin Wang, Jiaming Zhang, Yifeng Gao, Yixu Wang

TL;DR: DarkLLM是一个新颖的对抗攻击框架,它训练一个大语言模型(LLM)将自然语言攻击指令翻译成潜在的攻击向量,再解码为视觉对抗扰动。该框架通过自然语言指令微调,统一了定向、非定向、分割和多模型攻击,实现了灵活可控的对抗样本生成,使单个指令能诱导异构模型产生期望行为。

Details

Motivation: 传统对抗攻击通常局限于单一预定义目标,每个攻击紧密耦合于特定模型或任务,这限制了其在真实场景中的可扩展性和灵活性。本文旨在克服这一限制,开发一个更通用、可扩展的攻击框架。

Result: 在4个任务、13个数据集和15个模型上的广泛实验表明,仅拥有10亿参数的DarkLLM能够遵循攻击者指令,对CLIP、SAM和前沿LLMs生成高效的攻击,揭示了现代基础模型中的系统性漏洞。

Insight: 核心创新点在于利用LLM作为自然语言指令到对抗扰动的通用翻译器,实现了攻击目标、任务和模型类型的统一与解耦。这为理解和评估基础模型的鲁棒性提供了一个新的、更灵活的范式。

Abstract: While vision and multimodal foundation models underpin critical tasks from perception to complex reasoning, they remain highly vulnerable to adversarial attacks. However, traditional adversarial attacks are typically limited to single, predefined objectives, tightly coupling each attack to a specific model or task, which restricts their scalability and flexibility in real-world scenarios. In this work, we present DarkLLM, a novel attack framework that trains an LLM to translate natural-language attack instructions into latent attack vectors, which are then decoded into visual adversarial perturbations. By leveraging natural-language instruction tuning, DarkLLM not only unifies targeted, untargeted, segmentation, and multi-model attacks within a single framework, but also achieves flexible and controllable adversarial generation, enabling each instruction to produce a perturbation that induces desired behaviors across heterogeneous models. Through extensive experiments across 4 tasks, 13 datasets, and 15 models, we demonstrate that DarkLLM with only 1B parameters can follow attacker instructions and generate highly effective attacks against CLIP, SAM, and frontier LLMs, revealing a systemic vulnerability in modern foundation models.


cs.MA [Back]

[114] STAR-PólyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision cs.MA | cs.AI | cs.CLPDF

Jiaao Wu, Xian Zhang, Hanzhang Liu, Sophia Zhang, Fan Yang

TL;DR: STAR-PólyaMath是一个通过元级监督和结构化推理者-验证者交互来解决长视野数学推理中幻觉累积、记忆碎片化和推理-工具权衡失衡等可靠性问题的多智能体框架。该框架采用由无推理Python编排器控制的状态机结构,并引入持久的元策略师进行跨尝试记忆和高级战略指导,在多个顶级数学竞赛基准上实现了最先进的性能。

Details

Motivation: 现有前沿AI模型和多智能体系统在需要扩展、长视野推理的数学问题上仍存在根本的可靠性问题,如幻觉累积、记忆碎片化和推理-工具权衡失衡,论文旨在系统性地解决这些挑战。

Result: STAR-PólyaMath在AIME 2025-2026、MathArena Apex Shortlist、MathArena Apex 2025、Putnam 2025、IMO 2025、HMMT February 2026和USAMO 2026这八个顶级竞赛基准上均取得了最先进(SOTA)的结果,在AIME、Putnam和HMMT上获得满分,并在Apex 2025上以93.75%的得分显著超越了最强基线GPT-5.5的80.21%。

Insight: 核心创新在于引入了持久的元策略师(Meta-Strategist)进行元级监督和跨尝试记忆管理,以及采用无推理的Python编排器将控制与推理分离,通过状态机结构和回溯-重规划循环来约束错误传播,使系统能够摆脱无效循环而非停滞或过度依赖工具,其性能提升主要源于框架编排而非模型层面的多样性。

Abstract: Frontier AI models and multi-agent systems have led to significant improvements in mathematical reasoning. However, for problems requiring extended, long-horizon reasoning, existing systems continue to suffer from fundamental reliability issues: hallucination accumulation, memory fragmentation, and imbalanced reasoning-tool trade-offs. In this paper, we introduce STAR-PólyaMath, a multi-agent framework that systematically addresses these challenges through meta-level supervision and structured Reasoner-Verifier interaction. STAR-PólyaMath is structured as an orchestrated state machine with nested challenge-step-replan loops, governed by a reasoning-free Python orchestrator that separates control from inference and bounds error propagation through trace-back and re-planning. Our key innovation is a persistent Meta-Strategist that maintains cross-attempt memory and exercises meta-level control by issuing high-level strategic guidance or mandatory directives, so the system can escape unproductive loops rather than stagnate or over-rely on tools. STAR-PólyaMath achieves state-of-the-art results on all eight top-tier competition benchmarks: AIME 2025-2026, MathArena Apex Shortlist, MathArena Apex 2025, Putnam 2025, IMO 2025, HMMT February 2026, and USAMO 2026. It obtains perfect scores on AIMEs, Putnam, and HMMT, and shows its largest margin on Apex 2025, scoring 93.75% compared with 80.21% by the strongest baseline GPT-5.5. Ablation studies show that the gains arise from the framework’s orchestration rather than from model-level diversity since removing key components or substituting in mixed backbones consistently weakens performance. Code is available at https://github.com/Julius-Woo/STAR-PolyaMath.


eess.IV [Back]

[115] SpecX: A Large-Scale Benchmark for Multi-Modal Spectroscopy and Cross-Paradigm Evaluation eess.IV | cs.CV | cs.LG | q-bio.OTPDF

Chengrui Xiang, Tengfei Ma, Yujie Chen, Tong Wang, Haowen Chen

TL;DR: 论文提出了SpecX,一个用于多模态光谱学的大规模基准测试,包含170万个分子的多种光谱模态数据,并支持跨范式评估。SpecX通过三个层级的数据集支持预训练、基准测试和高质量实验评估,涵盖分子解析、光谱模拟和光谱理解等任务。

Details

Motivation: 现有光谱基准测试在规模、模态对齐和评估范围上存在局限,通常只专注于专用模型或多模态大语言模型(MLLMs),因此需要建立一个统一的跨范式评估基准。

Result: 实验表明,专用模型在信号级建模方面表现出色,而MLLMs在高级推理方面有优势但缺乏精确的光谱基础。SpecX为光谱智能建立了统一基准。

Insight: 创新点在于构建了大规模、多模态对齐的光谱数据集,并支持跨专用模型和MLLMs的统一评估,强调了开发光谱原生基础模型的必要性。

Abstract: Existing spectral benchmarks are limited in scale, modality alignment, and evaluation scope, and typically focus on either specialized models or multimodal language models (MLLMs). We introduce SpecX, a large-scale benchmark for multi-modal spectroscopy with cross-paradigm evaluation. SpecX contains 1.7M molecules with diverse spectral modalities, including NMR (1H, 13C, HSQC), IR, MS,UV,Raman and FL, and is organized into three tiers: a large-scale dataset for pretraining, an aligned multi-spectral subset for benchmarking, and a high-quality experimental subset for evaluation. SpecX supports a range of tasks such as molecular elucidation, spectrum simulation, and spectral understanding, and enables unified evaluation across both specialized spectral models and MLLMs. Experiments show that specialized models excel at signal-level modeling, while MLLMs exhibit strengths in high-level reasoning but lack precise spectral grounding. SpecX establishes a unified benchmark for spectral intelligence and highlights the need for spectrum-native foundation models.


[116] From Division to Decision: Leveraging Temporal Cell-Stage Segmentation for Embryo Transferability Prediction eess.IV | cs.CV | cs.LG | q-bio.QMPDF

Yasmine Hachani, Patrick Bouthemy, Elisa Fromont, Véronique Duranthon, Ludivine Laffont

TL;DR: 本文提出TransFACT,一个基于Transformer的框架,利用牛胚胎发育前四天的2D延时视频,结合帧级时序特征和阶段级表征,以发育阶段作为辅助监督,在第四天预测胚胎可移植性。

Details

Motivation: 当前牛胚胎选择依赖于授精后第七天的单次专家评估,导致妊娠失败率高。延时显微视频虽能提供早期发育的详细信息,但由于复杂的运动模式和耗时的分析而难以利用。

Result: 实验表明,TransFACT通过利用一个为动作识别设计的方法,在预测胚胎可移植性方面取得了优于竞争对手的性能。

Insight: 创新点在于将发育阶段分割作为辅助监督任务,结合帧级和阶段级特征进行建模,将动作识别领域的Transformer方法迁移应用于胚胎发育分析,实现了早期(第四天)的预测能力。

Abstract: Accurate selection of bovine embryos is a challenging task, as current practice relies on a single expert assessment on the seventh day after insemination, resulting in high rates of pregnancy loss. Time-lapse videomicroscopy provides detailed information on early development, but is difficult to exploit because of complex motion patterns and time-consuming analysis. We propose TransFACT, a transformer-based framework for modeling early developmental stages and embryo transferability using 2D time-lapse videos from the first four days of development. TransFACT combines frame-level temporal features with stage-level representations, using developmental stages as auxiliary supervision to predict transferability on day four. Our experiments demonstrate that TransFACT, by leveraging an existing method designed for action recognition, achieves superior performance than its competitor in predicting embryo transferability.


[117] FGSVQA: Frequency-Guided Short-form Video Quality Assessment eess.IV | cs.CVPDF

Xinyi Wang, Angeliki Katsenou, Junxiao Shen, David Bull

TL;DR: 本文提出了一种名为FGSVQA的端到端短格式视频质量评估框架,旨在解决用户生成内容(UGC)中因复杂生成流程、快速内容变化和混合失真带来的挑战。该方法基于CLIP的密集视觉编码器,并引入频域压缩先验来生成感知伪影和结构的权重图,通过显式分解伪影、结构和原始视觉特征分支,并利用学习到的门控模块自适应地融合它们,以实现准确且高效的质量预测。

Details

Motivation: 短格式视频因其复杂的生成流程、快速的内容变化和混合失真,为用户生成内容(UGC)的质量评估带来了新挑战,需要专门的方法来准确评估其质量。

Result: 实验结果表明,该方法在短格式视频数据集上实现了强大的性能,平均排名和线性相关性指标表现优异(SRCC: 0.736, PLCC: 0.787),同时保持了高效的推理运行时间。

Insight: 创新点在于将频域压缩先验引入VQA框架,以生成伪影和结构感知的权重图,并通过显式特征分解和自适应门控融合来提升质量预测的准确性和效率,这为处理混合失真的视频质量评估提供了新思路。

Abstract: Short-form video poses new challenges to the quality assessment of user-generated content (UGC) due to its complex generation pipeline, rapid content variation, and mixed distortions. To address this challenge, we propose an end-to-end video quality assessment (VQA) framework that employs a dense visual encoder based on CLIP, and incorporates compression priors derived from the frequency domain to generate artifact- and structure-aware weight maps for feature aggregation. By explicitly decomposing artifact, structure, and original visual feature branches and adaptively fusing them over time through a learned gating module, the proposed method achieves accurate and efficient quality prediction. Experimental results show that our method achieves strong performance on short-form video datasets in terms of average rank and linear correlation (SRCC: 0.736, PLCC: 0.787), while maintaining efficient inference runtime. The code and additional results are available at: https://github.com/xinyiW915/FGSVQA.


cs.RO [Back]

[118] RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning cs.RO | cs.AI | cs.CV | cs.LG | cs.MAPDF

Ehsan Ahmadi, Hunter Schofield, Behzad Khamidehi, Fazel Arasteh, Jinjun Shan

TL;DR: 论文提出了RLFTSim,一个基于强化学习的微调框架,用于提升多智能体交通仿真的真实性和可控性。该方法在预训练的仿真模型基础上,通过设计平衡保真度与可控性的奖励函数,利用真实世界数据分布进行对齐,从而生成更符合现实的驾驶场景。

Details

Motivation: 现有的监督开环训练方法难以捕捉复杂驾驶场景中固有的动态多智能体交互,导致仿真真实性不足。本文旨在解决仿真模型与真实数据分布对齐的问题,并同时提供一种可控的场景生成方法。

Result: 在Waymo Open Motion Dataset上的综合实验表明,该方法在真实性方面取得了最先进的性能。与基于启发式搜索的微调方法相比,RLFTSim由于提出了低方差、密集的奖励信号,所需样本量显著减少。

Insight: 核心创新点在于将强化学习微调用于对齐仿真与真实数据分布,并设计了平衡真实性与可控性的奖励机制。其提出的低方差密集奖励有效提升了样本效率,而目标条件化则为交通仿真提供了直接的可控性蒸馏方法。

Abstract: Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at https://ehsan-ami.github.io/rlftsim.


[119] Beyond Imitation: Learning Safe End-to-End Autonomous Driving from Hard Negatives cs.RO | cs.CVPDF

Junli Wang, Zhihua Hua, Xueyi Liu, Zebin Xing, Haochen Tian

TL;DR: 本文提出BeyondDrive框架,通过同时学习成功和失败的驾驶行为来解决模仿学习中安全目标错配的问题。该框架包含基于流匹配的负轨迹生成器、多样性感知采样策略和排斥距离损失,以在轨迹空间中建立区分性的安全边界。

Details

Motivation: 现有端到端自动驾驶模仿学习方法主要最小化与专家轨迹的几何偏差,但空间接近性并不等同于行为安全,导致轨迹模仿损失相近但安全结果迥异(如可恢复与碰撞)。

Result: 在NAVSIMv1闭环基准测试中,BeyondDrive应用于单模态基线Latent TransFuser,取得了89.7 PDMS,超越了先前的最先进方法。此外,该框架能有效泛化到多模态规划器等不同架构,并在HUGSIM基准上展示了强大的零样本迁移能力。

Insight: 创新点在于明确建模安全不对称性,通过合成安全关键但接近专家的负轨迹来联合学习正负行为,并使用排斥损失建立安全边界,从而超越单纯模仿,提升驾驶策略的安全判别能力。

Abstract: Existing imitation learning methods for end-to-end autonomous driving predominantly learn from successful demonstrations by minimizing geometric deviations from expert trajectories. This paradigm implicitly assumes that spatial proximity implies behavioral safety, leading to a critical objective mismatch: trajectories with nearly identical imitation losses may exhibit drastically different safety outcomes, where one remains recoverable while the other results in collision. To address this limitation, we propose BeyondDrive, a failure-aware imitation learning framework that jointly learns from successful and failed driving behaviors. First, we introduce a flow matching-based negative trajectory generator that synthesizes safety-critical yet expert-proximate trajectories, enabling explicit modeling of safety asymmetry. Second, we develop a diversity-aware sampling strategy that mitigates mode collapse and improves coverage of diverse failure modes during negative trajectory generation. Third, we propose a Repulsive Distance Loss that simultaneously attracts predictions toward expert demonstrations while repelling them from hard negative trajectories, thereby establishing discriminative safety boundaries in trajectory space. Applied to the uni-modal baseline Latent TransFuser, BeyondDrive achieves 89.7 PDMS on the NAVSIMv1 closed-loop benchmark, outperforming prior state-of-the-art methods. Moreover, BeyondDrive generalizes effectively across different autonomous driving architectures, including multi-modal planners, and further demonstrates strong zero-shot transferability on the HUGSIM benchmark.


[120] SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving cs.RO | cs.CVPDF

Kefei Tian, Yuansheng Lian, Kai Yang, Xiangdong Chen, Shen Li

TL;DR: 本文提出了SafeAlign-VLA,一个负样本增强的安全对齐框架,用于提升端到端自动驾驶系统在安全关键长尾场景下的性能。该框架通过反事实安全配对范式生成结构化安全标签和反事实正轨迹,并采用两阶段训练策略,结合监督学习和强化学习,有效利用负样本数据来理解和规避风险行为。

Details

Motivation: 当前基于视觉-语言-动作(VLA)模型的端到端自动驾驶方法主要依赖正面的专家演示数据,很少利用负样本(如风险行为),导致对风险行为和安全边界的理解不足,难以处理安全关键的长尾场景。

Result: 在NAVSIM v1测试集上,SafeAlign-VLA达到了89.1的PDMS分数,比不使用负样本的基线提升了1.3%。在DeepAccident数据集上,它将碰撞率降低至3.36%,同时实现了84.2%的语言准确率和85.8%的风险预测准确率,验证了其有效性。

Insight: 核心创新在于提出了一个统一的负样本增强安全对齐框架,通过反事实推理生成结构化安全标签和对比轨迹,并设计了结合负样本增强监督微调和基于锚点的分组相对策略优化的两阶段训练策略,以对比方式利用正负样本来引导策略优化并惩罚高风险行为,提升了系统的安全性和鲁棒性。

Abstract: End-to-end autonomous driving systems excel in common scenarios but struggle with safety-critical long-tail cases. Vision-Language-Action (VLA) models are promising due to their strong reasoning capabilities. However, most VLA-based approaches rely on positive expert demonstrations, rarely exploiting negative samples, leading to insufficient understanding of risky behaviors and safety boundaries. To address this limitation, we propose SafeAlign-VLA, a unified negative-enhanced safe alignment framework that incorporates negative data into supervised learning and reinforcement learning. First, we develop a counterfactual safety pairing paradigm to generate structured safety labels and counterfactual positive trajectories from risky scenarios via counterfactual reasoning. Then, a two-stage training strategy is adopted: negative-enhanced supervised fine-tuning for failure feedback and trajectory correction, followed by anchor-based group relative policy optimization that uses positive and negative trajectories as contrastive anchors to steer sampling and penalize high-risk behaviors via group-relative advantages. Experiments on NAVSIM and DeepAccident validate the proposed framework. SafeAlign-VLA achieves 89.1 PDMS on the NAVSIM v1 testset, improving over the baseline without negative data by 1.3%. On DeepAccident, it reduces the collision rate to 3.36%, while achieving 84.2% language accuracy and 85.8% risk prediction accuracy. These results demonstrate the effectiveness of the proposed negative-enhanced safe alignment framework for safe and robust autonomous driving.


[121] Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation cs.RO | cs.CV | cs.LGPDF

He-Yang Xu, Pengyuan Zhang, Zongyuan Ge, Xiaoshuai Hao, Serge Belongie

TL;DR: 本文提出了MetaFine,一个用于细粒度操作任务的诊断性元评估框架。该框架将操作能力解耦为理解、感知和控制行为三个维度,通过构建组合任务图吸收异构的外部基准测试,并在统一协议下重构为不同复杂度的诊断场景。评估发现现有视觉-语言-动作模型存在维度特异性失败,并识别出视觉编码器保留局部空间结构的能力是关键瓶颈。

Details

Motivation: 当前具身AI基准测试将细粒度操作所需的各种能力(局部属性定位、高保真空间感知、约束遵从的运动执行)简化为二元成功率,这系统性高估了模型能力(高达70%)并掩盖了阻碍实际部署的架构瓶颈。

Result: 通过MetaFine评估最先进的视觉-语言-动作模型,暴露了传统指标无法发现的严重维度特异性失败。通过针对性因果干预,发现提升视觉编码器保留局部空间结构的能力,可以直接解锁先前无法访问的操作能力,而无需修改下游策略。

Insight: 创新点在于提出了一个从排名转向诊断的评估框架,将基准测试转化为修复真实物理灵巧性所需分层能力的可操作指南。该框架支持混合现实-仿真验证,利用有限的真实世界数据来校准基于仿真的可扩展估计,以实现更稳定的物理基准测试。

Abstract: Fine-grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high-fidelity spatial perception, and constraint-respecting motor execution. However, current embodied AI benchmarks collapse these capacities into binary success rates, systematically inflating reported capabilities by up to 70% and masking the architectural bottlenecks that impede real-world deployment. We introduce MetaFine, a diagnostic meta-evaluation framework that disentangles manipulation competency along three axes: understanding, perception, and controlled behavior. Built on a compositional task graph, MetaFine absorbs heterogeneous external benchmarks and reconstructs them into diagnostic scenarios of varying complexity under a unified protocol. Evaluating state-of-the-art vision-language-action (VLA) models through this lens exposes severe dimension-specific failures invisible to conventional metrics. Through targeted causal intervention, we identify the visual encoder’s ability to preserve local spatial structure as a key bottleneck for fine-grained precision: improving it directly unlocks previously inaccessible manipulation capabilities without modifying downstream policies. MetaFine further supports hybrid real-sim validation, using limited paired real-world rollouts to calibrate scalable simulation-based estimates for more stable physical benchmarking. By shifting evaluation from ranking to diagnosis, MetaFine turns benchmarking into an actionable compass for repairing the layered capacities underlying genuine physical dexterity. The MetaFine framework, benchmarks, and supporting resources will be publicly released at our project page: https://metafine.github.io/.


cs.AI [Back]

[122] From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning cs.AI | cs.CL | cs.CV | cs.ROPDF

Ahmed Y. Gado, Omar Y. Goba, Alaa Hassanein, Catherine M. Elias, Ahmed Hussein

TL;DR: 本文探讨了在自动驾驶车辆(AVs)中,通过大型语言模型(LLMs)和大型多模态模型(LMMs)进行高层场景解释与规划时,时间作为次要属性处理导致连续动作推理不一致的问题。研究引入三种逐步增加时间整合的规划器架构,在BDD-X数据集子集上评估,发现时间条件化虽未显著提升基于NLP的正确性指标,但定性分析揭示了预测性危险推理、稳定纠正行为等优势。

Details

Motivation: 当前基于LLMs和LMMs的自动驾驶场景解释与规划方法忽视时间属性,导致连续动作推理不一致,影响安全性和可解释性。研究旨在探索智能体间通信中的时间条件化是否能保持或增强推理连贯性,而不损害语义或逻辑一致性。

Result: 在BDD-X数据集子集上,使用语义、句法和逻辑指标评估三种规划器架构,结果显示时间条件化未在标准NLP正确性指标上带来统计显著改进。但定性分析表明,它促进了预测性危险推理、稳定纠正行为和哨兵策略分歧。

Insight: 创新点在于首次系统研究时间条件化在智能体场景到规划推理中的作用,并建立了时间场景到规划推理的实证基准。客观来看,研究揭示了基于提示的时间接地方法的局限性,为未来时间感知推理模型设计提供了重要参考。

Abstract: Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.


[123] What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code cs.AI | cs.CLPDF

Yuze Zhao, Junpeng Fang, Lu Yu, Zhenya Huang, Kai Zhang

TL;DR: 本文通过控制预训练实验探究代码数据对语言模型推理能力的影响,发现纯可执行代码主要提升编程能力而非通用推理,而跨领域结构化推理轨迹(如代码-文本混合数据)才是提升数学推理的关键。增加数学领域结构化样本密度可在保持编程性能的同时显著提升复杂数学推理能力,并通过路由分析揭示了领域间竞争与协同的机制证据。

Details

Motivation: 代码已成为现代基础语言模型训练的标准组成部分,但其在编程之外的作用尚不明确,本文旨在重新审视代码是否真正提升推理能力,并探究数据组成对能力迁移的影响。

Result: 在10T令牌语料库上进行细粒度领域分离的受控预训练实验表明,纯代码数据对复杂数学推理有竞争性负面影响,而结构化数学领域样本(如代码-文本混合)在固定数学预算下可大幅提升困难数学推理任务性能,同时基本保持编程能力。

Insight: 创新点在于揭示了结构化推理信号(而非纯代码)是跨领域能力迁移的关键,提出通过认知支架(如代码-文本混合数据)针对性缓解领域间权衡,并利用路由分析从机制层面验证了数据组成对专家激活模式的影响,为数据中心的优化策略提供了新视角。

Abstract: Code has become a standard component of modern foundation language model (LM) training, yet its role beyond programming remains unclear. We revisit the claim that code improves reasoning through controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation. Our findings are threefold. First, when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general reasoning enhancer; instead, it competes with knowledge-intensive tasks, especially complex mathematical reasoning. Second, the reasoning gains often attributed to code are better explained by cross-domain structured reasoning traces, such as code-text and math-text mixtures, rather than by executable code alone. Third, increasing the density of structured math-domain samples within a fixed math budget yields substantial gains on difficult mathematical reasoning while largely preserving programming performance, suggesting that cognitive scaffolds offer a targeted way to mitigate cross-domain trade-offs. Finally, routing analyses show that data-composition effects are reflected in expert-activation patterns, providing mechanism-level evidence for competitive and synergistic interactions across domains. Our results clarify which data characteristics transfer across capability dimensions and point to more precise data-centric optimization strategies.


[124] PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents cs.AI | cs.CL | cs.LGPDF

Zhuohan Gu, Qizheng Zhang, Omar Khattab, Samuel Madden

TL;DR: 本文提出了PEEK系统,用于提升长上下文LLM智能体在重复外部上下文(如文档库、代码仓库)中的交互效率。该系统通过构建并维护一个恒定大小的上下文地图(context map),缓存关于外部上下文的可重用定向知识(如内容结构、有用实体等),从而减少迭代次数并降低成本。

Details

Motivation: 现有方法在多次调用中仅保留智能体的轨迹、原始材料的被动访问或任务级策略,缺乏对重复相同上下文工作负载最关键的可重用定向知识,导致效率低下和成本高昂。

Result: 在长上下文推理和信息聚合任务中,PEEK相比强基线提升6.3-34.0%,迭代次数减少93-145次,成本比最先进的提示学习框架ACE低1.7-5.8倍;在上下文学习中,解决率和评分准确率分别提升6.0-14.0%和7.8-12.1%,成本降低1.4倍。这些改进在包括OpenAI Codex在内的多种语言模型和智能体架构中均得到验证。

Insight: 创新点在于将可重用定向知识抽象为恒定大小的上下文地图,并通过可编程缓存策略(包含蒸馏器、制图器和基于优先级的驱逐器)动态维护,实现了对重复外部上下文的高效、低成本交互,为长上下文LLM智能体设计提供了新思路。

Abstract: Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent’s trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent’s prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.


[125] AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees cs.AI | cs.CV | cs.MAPDF

Yuankai Li, Tinghui Zhu, Ha Min Son, Zhe Zhao, Xin Liu

TL;DR: 本文提出了AQuaUI,一种用于GUI智能体模型的无训练推理时视觉令牌缩减方法。该方法利用GUI截图信息密度非均匀分布的特性,通过构建自适应四叉树,在每个叶子节点保留一个代表性合并令牌,从而减少视觉令牌数量。此外,还提出了条件四叉树算法以提升多步GUI交互中的时序一致性。

Details

Motivation: 现有大型多模态模型(LMMs)作为GUI智能体模型骨干时,高分辨率GUI截图在每次迭代中引入提示,但这些截图空间信息密度高度不均匀,现有方法要么需要额外训练,要么依赖基于注意力的令牌压缩,忽略了GUI截图的布局结构和空间冗余。

Result: 在标准grounding和navigational基准测试上,AQuaUI在先进GUI智能体模型上实现了一致的精度-效率权衡改进。在GUI-Owl-1.5-32B-Instruct模型上,AQuaUI实现了高达13.22%的加速和29.52%的视觉令牌减少,同时保持了99.06%的全令牌性能。

Insight: 创新点在于利用GUI截图固有的空间冗余性,提出了一种无需重新训练的自适应四叉树令牌缩减方法,并引入了条件四叉树算法来保持跨步骤的时序一致性,这为高效处理非均匀信息密度的视觉输入提供了新思路。

Abstract: Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iteration step. However, these screenshots exhibit highly non-uniform spatial information density: large regions may carry little information and are visually homogeneous, while key text and icons may require high visual fidelity. Existing approaches to this problem either require additional training or rely on attention-based token compression, ignoring the structured layout and spatial redundancy of GUI screenshots. To fill the gap, this paper proposes AquaUI, a training-free inference-time token reduction method for GUI agent models that utilizes the non-uniform information density in screenshots. AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of the quadtree. AQuaUI preserves the spatial positions of retained tokens throughout the pipeline to ensure that all position-encoding stages remain consistent. To further improve temporal consistency across multi-step GUI interactions, we propose a conditional quadtree algorithm that leverages the continuity between consecutive screenshots within a single request. Specifically, it refines the current quadtree using previous quadtrees as references, helping preserve fine-grained regions across static or mildly shifted GUI states. We implement AQuaUI on state-of-the-art GUI agent models and conduct experiments on standard grounding and navigational benchmarks. AQuaUI consistently shows improved accuracy-efficiency trade-offs over prior baselines. Notably, on GUI-Owl-1.5-32B-Instruct, AQuaUI achieves up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance, suggesting that the spatial redundancy of GUI screenshots can be exploited at inference without retraining.


cs.LG [Back]

[126] ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning cs.LG | cs.AI | cs.CLPDF

Wanghan Xu, Yuhao Zhou, Hengyuan Zhao, Shuo Li, Dianzhi Yu

TL;DR: 本文提出ReCrit,一个面向科学批判性推理的、具有状态转移感知能力的强化学习框架。它将大语言模型在用户批评下的行为分解为四个象限(修正、迎合、鲁棒性、边界),并通过奖励修正和鲁棒性、惩罚迎合来优化模型在批评交互中的表现。该方法在多个科学推理基准上显著提升了模型在批评阶段的准确性。

Details

Motivation: 大语言模型在科学推理的批评交互中,不仅可能回答错误,更危险的是可能在用户批评后放弃最初正确的答案。现有方法通常只关注最终答案的准确性,而忽略了批评前后答案正确性发生转变这一关键问题。

Result: 在ChemBench、TRQA和EarthSE三个科学推理基准上,ReCrit将Qwen3.5-4B模型的平均批评阶段准确率从38.15提升至51.49,将Qwen3.5-9B模型的准确率从45.40提升至55.59。消融实验表明,仅基于最终答案的奖励收益甚微,而具有状态转移感知的奖励和象限加权能产生更有效的训练信号。

Insight: 核心创新在于将批评交互建模为“回合间正确性转移”问题,而非单纯的最终答案准确性问题,并据此设计了四象限行为分解和相应的奖励机制。此外,提出的动态异步回放与尾部自适应完成技术,解决了交互式训练中回放等待的扩展性问题。

Abstract: Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at https://github.com/black-yt/ReCrit .


[127] Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking cs.LG | cs.AI | cs.CLPDF

Qinwu Xu, Zhuoheng Li, Jessie Salas

TL;DR: 本文针对多模态大语言模型(MLLMs)在性能差异微小、评估信号易受噪声影响时面临的检查点选择难题,提出了一种鲁棒的决策框架。该框架整合了真实世界数据、基于LLM的结构化判断和多阶段排序协议,通过逐点过滤、列表排序和成对比较进行渐进式优化,并引入了基于子采样的置信度估计和基于百分位数的评分机制以提升可靠性。

Details

Motivation: 现有检查点选择方法严重依赖静态基准测试或逐点评分,这些方法常与真实使用场景脱节,且缺乏鲁棒的不确定性估计,尤其在OCR密集场景下评估有效性不足。

Result: 论文通过实验证明,数据质量(特别是OCR可读性)是评估有效性的关键决定因素;所提框架能够捕获分布特征并惩罚尾部失败,从而在存在评估不确定性的情况下实现更稳健的检查点选择。

Insight: 创新点在于将检查点选择形式化为评估不确定性下的鲁棒决策问题,并设计了结合真实数据、LLM代理评估和稳定性感知排序的多阶段框架;其基于子采样的置信度估计和百分位数评分机制为噪声环境下的模型选择提供了新的可靠性保障思路。

Abstract: Checkpoint selection for multimodal large language models (MLLMs) presents significant challenges when performance differentials are marginal and evaluation signals are prone to noise. Existing methodologies rely heavily on static benchmarks or pointwise scoring, which frequently misalign with in-the-wild usage and lack robust uncertainty estimation, particularly in OCR-heavy scenarios. In this work, we formulate checkpoint selection as a robust decision problem under evaluation uncertainty. We propose a multi-stage framework that integrates curated real-world data, structured LLM-based judgment, and multi-stage ranking protocols. The evaluation system orchestrates progressive refinement via pointwise filtering, listwise ranking, and pairwise comparison. To enhance reliability, we introduce subsampling-based confidence estimation and a percentile-based scoring formulation that captures distributional characteristics while penalizing tail failures. Furthermore, we demonstrate that data quality, specifically OCR readability, is a critical determinant of evaluation validity.


[128] The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next cs.LG | cs.AI | cs.CLPDF

Adil Amin

TL;DR: 本文提出一种分析前沿模型能力演化的框架,通过解耦SWE-bench和GPQA Diamond基准测试的得分,识别模型在编码与推理能力间的协同或权衡关系,并预测未来评估重点的转移。

Details

Motivation: 现有排行榜仅按独立维度排名模型,无法揭示不同能力间是相互促进还是此消彼长,而能力间的交互作用对前沿模型发展更具指导意义。

Result: 分析34个模型发现编码与推理能力整体正相关(r=+0.72),但不同实验室的策略差异显著;SWE-bench趋于饱和,而HLE和指令遵循任务仍具区分度,预示评估轴心将发生轮换。

Insight: 提出基于能力耦合趋势和残差场(h-field)的诊断方法,可量化各实验室技术路线的效率差异,并提供包含定位、诊断、轴心轮换的三阶段行动指南及可证伪预测。

Abstract: Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases – and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses capability emphasis and identifies which measurement or stress test is most informative next. Across 34 models from 10 labs (2024–2026), capabilities cooperate ($r = +0.72$, $p < 10^{-6}$), but cooperation varies by lab and over time: DeepSeek reversed from reasoning-rich to coding-first ($h$: $+11.2 \to -4.7$, 15.9-pp swing); Google maintains consistent reasoning emphasis; Anthropic oscillates between coding excursions and recovery. Cooperation is not static – it cascades. Six open-weight architectures confirm a second capability transition at 30–72B, and SWE-bench is now saturating while HLE and instruction-following retain discriminatory spread – signaling the next axis rotation. We provide a three-level playbook (locate, diagnose, rotate), a per-lab measurement-priority table, and seven falsifiable predictions with timestamped criteria for the next 12 months of frontier releases. Per-lab coupling slopes vary $5\times$ (Google $1.15$ vs. DeepSeek $0.23$), quantifying how efficiently each recipe converts coding gains into reasoning. Five April 2026 releases confirm the diagnostic out of sample ($r$ rises from $+0.72$ to $+0.75$). An interactive dashboard provides phase classification with actionable recommendations, $h$-field diagnostics, per-lab coupling trajectories, ODE-based scaling predictions, benchmark rotation guidance, self-steering demo, and live tracking of all seven predictions: https://zehenlabs.com/cape/.


[129] Dynamic Model Merging Made Slim cs.LG | cs.AI | cs.CLPDF

Guodong Du, Wanyu Lin

TL;DR: 本文提出了DiDi-Merging,一个轻量化的动态模型合并框架,旨在解决现有动态合并方法在准确性与效率之间权衡不佳的问题。该框架通过可微秩分配来平衡共享参数和专家参数,并引入无需数据的精炼步骤来恢复任务保真度。

Details

Motivation: 现有动态模型合并方法要么保留完整的共享模型并附加小型专家模块,要么为专家模块分配过多容量,导致存储效率与模型性能之间的权衡不理想。

Result: DiDi-Merging在仅使用单个微调模型1.24倍参数的情况下,即可匹配先前的动态基线方法;在使用1.4倍参数时,性能则超过它们,其存储需求远低于需要超过2倍参数的方法。该方法在视觉、语言和多模态任务上均适用。

Insight: 核心创新点在于将参数预算问题形式化为低秩模块中的可微秩优化,并通过数据无关的精炼步骤来提升性能,实现了更紧凑且高效的动态模型合并。

Abstract: Model merging enables the reuse of fine-tuned models without joint training or access to original data. Dynamic merging further improves flexibility by selectively activating task-relevant parameters and efficiently composing experts across multiple tasks. However, existing dynamic methods either maintain a full shared model with tiny experts or allocate excessive capacity to experts, leading to suboptimal accuracy–efficiency trade-offs. To address this, we propose DiDi-Merging, a slim dynamic merging framework that leverages differentiable rank allocation to balance shared and expert parameters. By formulating parameter budgeting as differentiable rank optimization in low-rank modules and introducing a data-free refinement step to recover task fidelity, DiDi-Merging matches prior dynamic baselines at only 1.24x the parameters of a single fine-tuned model and surpasses them at 1.4x, substantially more compact than methods requiring > 2x storage. DiDi-Merging applies across vision, language, and multimodal tasks.


[130] Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling cs.LG | cs.AI | cs.CLPDF

Adil Amin

TL;DR: 本文研究了语言模型规模扩展过程中推理能力与真实性之间的耦合关系,发现存在一个由模型家族决定的临界规模N_c。当模型规模小于N_c时,推理与真实性呈负相关;大于N_c时,两者转为正相关协同发展。研究通过分析63个基础模型,揭示了损失曲线无法反映的这种相变现象,并开发了诊断工具。

Details

Motivation: 现有的缩放定律主要预测损失与计算量的关系,但未能揭示不同能力(如推理与真实性)之间如何相互作用。本文旨在探究这些能力在模型规模扩展过程中的耦合动态,以发现超越损失曲线的隐藏行为模式。

Result: 研究发现临界规模N_c约为35亿参数(95%置信区间为29亿至134亿)。在协同机制下,前沿模型的推理与真实性呈强正相关(r=+0.72,涵盖34个模型)。通过数据筛选、架构创新和训练方法优化,可以在较小规模下提前达到协同状态,例如Gemma-4在40亿参数时达到了通常需要130亿+标准训练模型才能实现的耦合度(0.871)。

Insight: 创新点在于发现了语言模型能力发展的“相变”现象,并证明模型规模并非决定相变的唯一因素,架构、数据筛选和训练方法均可独立影响临界点。研究还提出了一种仅需公开基准分数、无需模型内部参数的诊断方法,并开发了开源工具用于分析任何开放权重模型的耦合阶段和提供干预建议。

Abstract: Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale $N_c$, capabilities anticorrelate; above it, they cooperate. $N_c \approx 3.5$B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift $N_c$ independently: curated training eliminated the coupling dip between Qwen generations ($0.025 \to 0.830$ at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals – only public benchmark scores across a model family. The cooperative regime extends to the frontier ($r = +0.72$, 34 models, 10 labs). Code, data, and an open-source activation-steering tool for any open-weight model are released alongside an interactive dashboard that diagnoses any model’s coupling phase, suggests concrete interventions (data curation, width, benchmark rotation), and provides ODE scaling predictions, frontier diagnostics, and eigenstructure analysis: https://zehenlabs.com/cape/.


[131] SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs cs.LG | cs.AI | cs.CLPDF

Chanuk Lee, Minki Kang, Sung Ju Hwang

TL;DR: 本文提出SAGE框架,旨在解决强化学习与可验证奖励(RLVR)在推理任务中提升pass@1但无法有效提升pass@k的问题。作者认为标准RLVR目标中的反向KL正则化将策略锚定在参考分布上,抑制了替代推理模式的探索。SAGE通过引导函数重塑锚定分布,实现可控的经验支持扩展,从而在数学推理基准上同时提升了pass@1和pass@k性能。

Details

Motivation: 动机在于探究RLVR是否真正让大语言模型获得新的推理能力,还是仅仅提高了从基础模型中采样已有推理模式的效率。现有分析支持后者,认为标准RLVR目标的结构特性导致探索压力不足,限制了pass@k的提升。

Result: 在具有挑战性的数学推理基准测试中,SAGE框架在pass@1和pass@k指标上均取得了持续改进,表明其能有效平衡效率与覆盖范围。

Insight: 创新点在于识别出反向KL正则化是限制探索的关键结构约束,并提出通过引导函数重塑锚定分布来可控地扩展经验支持,从而解决奖励黑客攻击或概率质量分配到非目标区域的问题,实现了效率与覆盖的更好权衡。

Abstract: Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse-KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward-KL provides a satisfactory solution, as both disrupt the efficiency-coverage trade-off by either inducing reward hacking or allocating probability mass to off-target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse-KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks. Our code is available at https://github.com/tally0818/SAGE.


[132] Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels cs.LG | cs.AI | cs.CLPDF

Alexander Boesgaard Lorup

TL;DR: 本文提出了一种反事实似然测试方法,用于量化私有推理通道之间的间接影响。该方法通过替换上游私有块、固定公共令牌序列和下游目标,测量下游目标的负对数似然偏移,从而区分直接和间接影响。在7B角色通道推理模型上的验证表明,该方法能有效分离未掩码和掩码条件,并识别出通过公共通道传播的间接影响。

Details

Motivation: 随着推理系统将中间计算分离为私有和公共通道,评估案例在记录中看起来相似,包括独立共推导、直接访问私有内容以及通过公共通信的间接影响。本文旨在解决如何准确测量私有推理通道之间间接影响的问题。

Result: 在7B角色通道推理模型上的验证中,反事实似然测试成功分离了未掩码和掩码条件,而文本探针(如n-gram重叠)不可靠。反向B到A的影响接近零,而A到B的影响通过公共语音隐藏状态持续存在。多检查点验证(三个检查点、五个种子和13,734个有效方向对比)复制了这种不对称性。

Insight: 创新点在于提出反事实似然测试作为测量私有通道间接影响的实用默认方法,并通过长度匹配控制RoPE位置混淆。研究强调私有通道评估应分别报告直接和间接影响,为推理系统的隐私评估提供了新工具。

Abstract: Reasoning systems increasingly separate intermediate computation into private and public channels, creating evaluation cases that look similar in transcripts: independent co-derivation, direct access to private content, and indirect influence through public communication. This paper presents a counterfactual likelihood test for measuring influence between private reasoning channels. The method replaces an upstream private block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the downstream target’s negative-log-likelihood shift. On a 7B role-channel reasoning model used for validation, textual probes are unreliable: raw n-gram overlap overstates leakage, corrected overlap remains noisy, and canary reproduction reports no discrimination. Counterfactual likelihood separates unmasked and masked conditions, while length matching controls a RoPE positional confound. In the hardened masked validation, reverse B-to-A influence is near zero, while A-to-B influence persists through public-speech hidden states. A multi-checkpoint validation across three checkpoints, five seeds, and 13,734 valid directional contrasts replicates this asymmetry. A graph-separation control that blocks private-to-public carrier edges produces bit-identical natural and counterfactual scores across all 13,734 control evaluations, identifying the tested public-channel pathway as the complete carrier of the measured counterfactual signal under the implemented role-visibility mask. The results show that private-channel evaluation should report direct and indirect influence separately, and that counterfactual likelihood probes provide a practical default for measuring these boundaries.


[133] EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data cs.LG | cs.AI | cs.CL | cs.CVPDF

Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan, Mahi Luthra

TL;DR: 该论文提出了EgoBabyVLM,一个用于评估跨模态学习能力的基准测试套件,特别关注从自然主义的第一人称视角视频数据中学习语言基础的能力。研究通过在不同语义对齐程度的视频数据集上训练视觉语言模型,并引入自动生成的Machine-DevBench基准来评估模型在词汇和语法能力上的表现,揭示了当前模型在弱对齐数据上的局限性。

Details

Motivation: 当前基于网络精选数据训练的大型视觉语言模型在泛化到由可穿戴设备、具身智能体和婴儿头戴摄像机产生的稀疏、弱对齐的第一人称视频流时存在困难,且缺乏固定的评估流程来衡量这一领域的进展。

Result: 研究结果表明,当前的视觉语言模型范式严重依赖于精选数据的紧密语义对齐,无法有效利用自然主义第一人称输入中占主导地位的弱对齐信号。论文引入了EgoBabyVLM挑战来推动模型发展。

Insight: 核心创新点在于提出了一个全面的评估套件,特别是自动生成的Machine-DevBench基准,它通过模型训练词汇的对数频率分箱来消除训练/评估不匹配问题,并解决了先前发展基准统计功效低的问题,为从婴儿式自然数据中学习提供了新的评估标准。

Abstract: Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today’s best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams – and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model’s training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input – the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.


[134] CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization cs.LG | cs.CL | cs.CVPDF

Ahmed Heakl, Abdelrahman M. Shaker, Youssef Mohamed, Rania Elbadry, Omar Fetouh

TL;DR: 本文提出了一种名为对比证据策略优化(CEPO)的新方法,用于强化学习与可验证奖励(RLVR)中的自蒸馏。该方法通过引入错误答案教师,对比正确答案与错误答案对每个生成令牌的偏好,从而更精确地识别出关键的推理步骤与无关的填充内容,以改善奖励信号的分配。

Details

Motivation: 在RLVR中,模型生成正确答案时,所有令牌都会收到相同的奖励信号,这无法区分关键的推理步骤与语法填充内容。现有方法要么因答案泄露而破坏训练,要么产生的信号太弱而无法有效区分。

Result: 在五个多模态数学推理基准测试上,CEPO在2B和4B规模下分别取得了43.43%和60.56%的平均准确率,优于相同训练预算下GRPO的41.17%和57.43%。分布匹配自蒸馏方法(OPSD, SDPO)的表现甚至低于未训练基线,验证了理论预测的信息泄露问题。

Insight: 核心创新在于提出了一个更尖锐的对比问题,即同时利用正确答案的偏好和错误答案的厌恶来识别真正的推理步骤。该方法通过重用训练批次中已有的被拒绝轨迹来构建错误答案教师,无需额外采样成本,并在理论上保证了结构安全性,同时严格锐化了关键令牌的信用分配。

Abstract: When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model’s baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just “does the correct answer favor this token?” but “does the correct answer favor it while the wrong answer disfavors it?” A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at https://github.com/ahmedheakl/CEPO.


[135] OScaR: The Occam’s Razor for Extreme KV Cache Quantization in LLMs and Beyond cs.LG | cs.CLPDF

Zunhai Su, Rui Yang, Chao Zhang, Yaxiu Liu, Yifan Zhang

TL;DR: 本文提出OScaR框架,用于解决大语言模型和多模态大模型中KV缓存的内存瓶颈问题。该框架通过Canalized Rotation和Omni-Token Scaling技术,有效缓解了极端量化下由Token Norm Imbalance引起的误差,实现了高效且近乎无损的INT2量化压缩。

Details

Motivation: 随着长上下文推理和多模态智能的发展,KV缓存的内存占用成为高效部署的主要瓶颈。现有逐通道量化方法在极端压缩下效果不佳,其根本限制在于Token Norm Imbalance(TNI)导致的量化误差放大。

Result: 在多种X-LLM(纯文本、多模态、全模态大模型)上的广泛评估表明,OScaR在INT2量化下实现了近乎无损的性能,优于现有方法并定义了新的帕累托前沿。与BF16 FlashDecoding-v2基线相比,OScaR实现了高达3.0倍的解码加速、5.3倍的内存占用减少和4.1倍的吞吐量提升。

Insight: 创新点在于从理论和实证角度识别了TNI是量化保真度的主要瓶颈,并提出了轻量级的Canalized Rotation和Omni-Token Scaling技术来有效缓解序列维度方差。该框架避免了复杂的量化流水线,提供了一个鲁棒、低复杂度且通用的解决方案,并辅以优化的系统设计和CUDA内核实现。

Abstract: The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.


[136] A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits cs.LG | cs.AI | cs.CC | cs.CLPDF

Yuyang Zhang, Yifu Zhang, Xuehai Zhou, Xiaoyin Chen

TL;DR: 本文通过最优传输理论形式化LLM推理过程,将离散轨迹投影到连续度量空间,利用Wasserstein-1距离量化领域偏移。基于Kantorovich对偶性,通过架构的Lipschitz连续性和函数逼近极限来界定OOD泛化边界,揭示了位置依赖注意力机制与移位不变机制在泛化能力上的本质差异,并证明了Transformer电路深度在解决组合搜索问题中的必要性。

Details

Motivation: 尽管LLM推理的实证缩放定律已被充分记录,但控制分布外泛化的理论机制仍不明确。本文旨在通过度量理论分析,揭示Transformer架构在结构泛化中的根本约束与逼近极限。

Result: 在54种Transformer配置上的组合搜索实验验证了理论边界,表明泛化风险随Wasserstein领域偏移单调退化,且移位不变机制(如旋转嵌入)相比位置依赖编码(如绝对位置编码)能显著降低预期风险。

Insight: 创新点在于将推理过程形式化为最优传输问题,并利用Kantorovich对偶性导出泛化边界;核心发现包括:位置编码的移位不变性是保证泛化能力的关键,以及物理层深度(而非仅宽度)对避免表示坍缩具有不可替代性,这为Transformer架构设计提供了理论指导。

Abstract: While empirical scaling laws for LLM reasoning are well-documented, the theoretical mechanisms governing out-of-distribution (OOD) generalization remain elusive. We formalize reasoning via optimal transport, projecting discrete trajectories into a continuous metric space to quantify domain shifts using the Wasserstein-1 distance. Invoking Kantorovich duality, we bound OOD generalization via architectural Lipschitz continuity and functional approximation limits. This exposes two primary constraints. First, position-dependent attention (e.g., Absolute Positional Encoding) fails to preserve shift invariance, yielding an $Ω(1)$ Lipschitz constant and expected risk, whereas shift-invariant mechanisms (e.g., Rotary Embeddings) preserve equivariance and bound the error. Second, by mapping sequential backtracking to a Dyck-$k$ language, we establish a strict circuit depth lower bound for $\text{TC}^0$ Transformers. Scaling physical layer depth is necessary to avert representation collapse – a constraint that scaling representation width cannot bypass due to irreducible approximation bounds in Barron spaces. Evaluations across 54 Transformer configurations on combinatorial search corroborate these bounds, demonstrating that generalization risk degrades monotonically with the Wasserstein domain shift.


[137] Spectral Gradient Surgery for Domain-Generalizable Dataset Distillation cs.LG | cs.CVPDF

Minyoung Oh, Najeong Chae, Jae-Young Sim

TL;DR: 本文提出了一种用于领域可泛化数据集蒸馏(DGDD)的新方法——谱梯度手术(SGS),旨在解决传统数据集蒸馏方法在分布外(OOD)数据上泛化能力不足的问题。该方法通过分析源领域梯度的谱域一致性,分离出类别判别性和领域特定性信息,并利用两种互补梯度更新来增强合成数据集的泛化性和多样性。

Details

Motivation: 传统数据集蒸馏(DD)假设训练和测试数据同分布,这在实践中很少成立。直接将领域泛化(DG)技术应用于蒸馏后的数据效果不佳,因为现有DG方法依赖真实数据的自然多样性,而合成数据集本身缺乏这种多样性,且额外的数据增强开销与数据集蒸馏的效率目标相冲突。

Result: 在多个不同规模的基准测试上进行的大量实验表明,SGS方法显著提升了分布外(OOD)泛化性能,同时保持了与现有分布匹配(DM)方法的即插即用兼容性。

Insight: 核心创新在于将OOD泛化能力不足归因于压缩合成数据集中类别判别性与领域特定性信息的纠缠,并提出在谱域分析梯度一致性来解耦这两种信息。通过强化跨领域共享的梯度成分并显式促进蒸馏数据集内部的多样性,实现了无需复杂数据增强的高效领域泛化蒸馏。

Abstract: Dataset Distillation (DD) synthesizes a compact synthetic dataset that preserves the training utility of a full dataset. However, its standard formulation assumes that test data follow the same distribution as training data, an assumption that rarely holds in practice. A straightforward extension-applying post-hoc Domain Generalization (DG) techniques to distilled data-is ill-suited because existing DG methods rely on the natural diversity of real datasets, which compact synthetic sets inherently lack, while also incurring substantial augmentation overhead that conflicts with the efficiency objective of dataset distillation. To address this limitation, we introduce Domain Generalizable Dataset Distillation (DGDD), a new problem setting that explicitly targets out-of-distribution (OOD) generalization of distilled datasets. We study this problem through a widely adopted DD baseline of Distribution Matching (DM). We attribute the OOD vulnerability of DM to the entanglement of class-discriminative and domain-specific information within the compressed synthetic set, and propose Spectral Gradient Surgery (SGS) to disentangle the two. The key insight of SGS is that cross-domain agreement among domain-wise gradients in the spectral domain reveals which gradient components are shared across source domains-and are therefore class-discriminative-and which are domain-specific. Based on this observation, SGS augments the standard DM update with two complementary gradients: one that reinforces cross-domain shared components and another that explicitly promotes diversity within the distilled dataset. Extensive experiments on diverse-scale benchmarks demonstrate that SGS substantially improves OOD generalization while remaining plug-and-play compatible with existing DM methods.


[138] INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference cs.LG | cs.CV | cs.DCPDF

Ahmed Šabanović, Paul Joe Maliakel, Ivona Brandić

TL;DR: INAR-VL是一个轻量级边缘-云路由系统,用于视觉语言模型(VLM)的双层部署。它通过分析图像和文本的复杂度信号,动态地将简单查询路由到边缘设备执行,复杂查询卸载到云端,从而在延迟、能耗和准确性之间取得平衡。

Details

Motivation: 解决边缘部署视觉语言模型时面临的延迟与准确性权衡问题:云端执行准确但延迟高、能耗大,边缘执行快速但模型能力有限导致准确性下降,且静态部署无法适应图像质量和推理复杂度的异构性。

Result: 在视觉问答任务上的评估表明,INAR-VL将36%的请求在边缘执行,降低了24%的延迟和26%的能耗,同时保持了97%的云端级别准确性。

Insight: 创新点在于利用轻量级的图像和文本复杂度信号进行动态路由和模型选择,实现输入感知的自适应卸载,而非静态分配,这为边缘-云协同推理提供了高效的调度策略。

Abstract: Edge deployment of Vision-Language Models (VLMs) faces a tradeoff between latency and accuracy: cloud execution provides high-quality predictions but incurs communication delay and energy cost, while edge-only execution is faster but less accurate due to limited model capacity. This trade-off is further complicated by heterogeneity in image quality and reasoning complexity, making static placement suboptimal. We present INAR-VL, a lightweight edge-cloud routing system for multimodal inference in a two-tier deployment. INAR-VL maintains complementary VLMs across edge and cloud and uses lightweight image and text complexity signals to guide routing and model selection, executing simple queries locally while offloading complex ones when beneficial. Evaluation on visual question answering shows that INAR-VL executes 36% of requests on the edge, reduces latency by 24%, lowers energy by 26%, and preserves 97% of cloud-level accuracy.


[139] Navigating the Emotion Tree: Hierarchical Hyperbolic RAG for Multimodal Emotion Recognition cs.LG | cs.CVPDF

Zeheng Wang, Bo Zhao, Yijie Zhu, Zhishu Liu, Hui Ma

TL;DR: 该论文提出了一种名为HyperEmo-RAG的检索增强生成框架,用于解决多模态情感识别任务中忽视情感层次结构、易受噪声影响的问题。该框架通过将层次化情感标签和多模态样本嵌入到双曲空间,并设计分层检索与结构化知识注入机制,显著提升了细粒度情感分类性能。

Details

Motivation: 现有多模态大语言模型通常将情感类别视为独立标签,忽略了心理学中丰富的情感层次分类体系,并且缺乏外部上下文知识,容易过度解读噪声线索,导致细粒度情感分类困难。

Result: 在多个数据集上的实验表明,HyperEmo-RAG显著优于现有方法,实现了最先进的(SOTA)性能。

Insight: 创新点在于将情感层次结构与双曲空间嵌入相结合进行分层检索,并通过构建证据图和设计Tree-Aware Attention、EmotionGraphFormer等机制,将结构化知识作为显式认知上下文注入LLM,有效利用了情感分类学的先验知识并增强了模型的推理鲁棒性。

Abstract: Multimodal emotion recognition aims to integrate text, audio, and video sources to understand human affective states. Although multimodal large language models excel at multimodal reasoning, they typically treat emotion categories as independent labels, ignoring the rich hierarchical taxonomy of human psychology. Moreover, lacking external contextual knowledge makes them highly susceptible to over-interpreting noisy cues, further complicating fine-grained emotion classification. To address these issues, we propose \textbf{HyperEmo-RAG}, a retrieval-augmented generation framework that leverages a structured emotional knowledge base. Our framework introduces two key innovations. 1) Hierarchical hyperbolic grounding. Recognizing the inherent hierarchical tree structure of emotion taxonomies, we jointly embed hierarchical emotion labels and multimodal samples into a continuous hyperbolic space (Poincaré ball) and design a hierarchical beam-search deliberation process that progressively retrieves samples from coarse to fine-grained levels. 2) Structured evidence injection. Based on the retrieved evidence, we construct an evidence graph and inject the structured knowledge as explicit cognitive context into the LLM through a Tree-Aware Attention mechanism and an EmotionGraphFormer, preserving the integrity of graph-structured information. Experiments on multiple datasets demonstrate that HyperEmo-RAG significantly outperforms existing methods.


[140] Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era cs.LG | cs.CVPDF

Qiuhe Hong, Yuyang Liu, Shuo Yang, Tiantian Peng, Fei Zhu

TL;DR: 本文提出了一种名为基于推理的动态平衡持续学习(RDB-CL)的新方法,用于指导多模态大语言模型(MLLMs)在强化学习与可验证奖励(RLVR)范式下的持续学习。该方法通过形式化‘可移植性’这一样本级度量,利用推理层面的信号动态调整KL正则化强度,以在适应新任务时更好地平衡知识保留与探索。

Details

Motivation: 在MLLMs与RLVR结合的新兴范式中,需要一种新模式来指导模型的持续适应,以在不断学习新多模态任务的同时有效保留先验知识。

Result: 实验表明,RDB-CL方法在持续学习任务上一致优于基线方法,相比原始的RLVR基线,其最终准确率(Last accuracy)提升了+12.0%。

Insight: 核心创新在于提出了‘推理可移植性’(Reasoning Portability)的概念,并实证了推理层面信号相比答案层面信号在分布外样本上更可靠;据此设计的RDB-CL框架能根据样本的可移植性动态调整正则化约束,实现对可复用推理的保留与新推理路径的探索之间的平衡。

Abstract: Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy’s behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.


cs.GR [Back]

[141] PiG-Avatar: Hierarchical Neural-Field-Guided Gaussian Avatars cs.GR | cs.CVPDF

Julian Kaltheuner, Jan Spindler, Sina Kitz, Patrick Stotko, Reinhard Klein

TL;DR: PiG-Avatar提出了一种新的高斯化身方法,通过将化身表示为锚定在由连续神经场控制的体积规范空间中的高斯分布,解耦了表示与模板拓扑。该方法使用参数化人体模型仅进行运动学传输,并通过3D重心锚点传输保持运动连贯性,同时引入双级空间相干优化以实现高效处理。

Details

Motivation: 现有高斯化身方法通常将几何参数化在身体模板表面上,这限制了分层、离体和非刚性服装几何的捕捉。本文旨在解决这一限制,将表示与模板拓扑解耦。

Result: 在包含复杂服装和具有挑战性非刚性运动的受试者的基准测试中,PiG-Avatar实现了最先进的渲染质量,对不完美的身体模型初始化具有鲁棒性,并在所有细节级别上实现实时渲染。

Insight: 核心创新在于使用神经场控制的体积规范空间来锚定高斯分布,从而解耦表示与模板拓扑;通过3D重心锚点传输和双级空间相干优化(结合Sobolev预条件神经场更新和基于KNN的预条件),实现了锚点密度的自组织,使复杂服装几何作为自然的高保真输出出现。

Abstract: Existing Gaussian avatar methods typically parameterize geometry on a body-template surface, which entangles the avatar’s representation space with the template’s deformation space and limits the capture of layered, off-body, and non-rigid clothing geometry. We present PiG-Avatar, which addresses this limitation by using the parametric body model solely for kinematic transport, while representing the avatar as Gaussians anchored in a volumetric canonical space governed by a continuous neural field. This decouples representation from template topology, avoiding the geometric constraints of surface-based parameterizations. Kinematic coherence is maintained through 3D barycentric anchor transport, which guides motion without constraining geometry and allows anchors to deviate freely from the template surface, yielding dense, stable temporal surface correspondences by construction. To make this unconstrained formulation tractable, we introduce dual-level spatially coherent optimization, combining Sobolev-preconditioned neural-field updates with a novel KNN-based preconditioning of canonical anchor geometry. Together, these mechanisms induce an emergent self-organization of anchor density: anchors migrate toward regions of high curvature, appearance variation, and non-coherent motion without explicit heuristics. As a result, complex clothing geometry and layered surfaces emerge as natural, high-fidelity outputs. This single representation further supports hierarchical reconstruction across multiple levels of detail, with coarse-level supervision propagating to finer levels through the shared field and coupled anchor graph. On established benchmarks featuring subjects with complex clothing and challenging non-rigid motion, PiG-Avatar achieves state-of-the-art rendering quality, generalizes robustly to imperfect body model initialization, and renders in real time across all detail levels.