Table of Contents
- cs.CL [Total: 21]
- cs.CV [Total: 73]
- eess.IV [Total: 2]
- cs.AI [Total: 5]
- cs.IR [Total: 1]
- cs.RO [Total: 6]
- cs.LG [Total: 9]
cs.CL [Back]
[1] DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence cs.CL | cs.AIPDF
DeepSeek-AI, Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang
TL;DR: 本文介绍了DeepSeek-V4系列模型的两个预览版本:拥有1.6万亿参数(激活490亿)的DeepSeek-V4-Pro和拥有2840亿参数(激活130亿)的DeepSeek-V4-Flash。这两个混合专家模型均支持百万令牌的上下文长度,通过引入混合注意力架构、流形约束超连接和Muon优化器等关键技术升级,在长上下文场景中实现了高效率。模型在超过32万亿高质量令牌上进行了预训练,并通过后训练流程进一步解锁能力,其中DeepSeek-V4-Pro-Max模式在核心任务上重新定义了开源模型的最先进水平。
Details
Motivation: 旨在开发能够高效处理百万令牌长上下文的智能模型,解决传统模型在长序列推理中计算开销和内存占用过大的问题,使长视野任务和测试时扩展更加可行。
Result: DeepSeek-V4-Pro-Max在核心任务上超越了其前代模型,重新定义了开源模型的最先进水平。在百万令牌上下文设置下,DeepSeek-V4-Pro仅需DeepSeek-V3.2单令牌推理FLOPs的27%和KV缓存的10%,表现出极高的长上下文效率。
Insight: 创新点包括:结合压缩稀疏注意力和重度压缩注意力的混合注意力架构以提升长上下文效率;增强传统残差连接的流形约束超连接;以及用于更快收敛和更高训练稳定性的Muon优化器。这些设计共同实现了在保持强大性能的同时,显著降低长上下文推理的计算和内存成本。
Abstract: We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models – DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) – both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.
[2] How LLMs Fail and Generalize in RTL Coding for Hardware Design? cs.CL | cs.AI | cs.PLPDF
Guan-Ting Liu, Chao-Han Huck Yang, Chenhui Deng, Zhongzhi Yu, Brucek Khailany
TL;DR: 本文研究了大型语言模型在硬件设计RTL编码中的失败与泛化问题,重点分析了模型将顺序编程先验知识转换为并行时序逻辑时遇到的瓶颈。通过引入基于问题可解性的错误分类法,研究发现前沿模型在VerilogEval基准测试中存在90.8%的初始通过率上限,且无法通过测试时计算扩展解决的功能性错误是主要限制。
Details
Motivation: 解决LLMs在硬件设计中将顺序编程逻辑转换为并行时序逻辑时存在的根本性瓶颈,探究模型在RTL编码任务中的失败模式与泛化能力限制。
Result: 在VerilogEval基准测试中,前沿模型的性能在90.8%的初始通过率处达到平台期;优化方法可消除语法错误,但会加剧深层功能错误;重复采样策略可修复可解错误,但RTL编码能力仍受预训练知识严格限制。
Insight: 创新点在于建立了基于认知理论的问题可解性错误分类法(语法、语义、可解功能、不可解功能错误),并揭示了模型对齐技术仅教会模型编译,而当前基于LLM的硬件生成流程需要更多模型推理研究而非对齐干预。
Abstract: Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(LLM). To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory. Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Evaluations reveal a strict empirical ceiling on the VerilogEval benchmark, as frontier models plateau at a 90.8% initial pass rate. These plateaus are defined by unsolvable functional errors, exposing persistent knowledge gaps immune to test time compute scaling. Furthermore, we expose a striking surface convergence gap: optimization readily eliminates syntax errors but concurrently exacerbates deeper functional failures. Our findings demonstrate that alignment techniques merely teach models to compile. While repeated sampling strategies can patch solvable errors, register-transfer level(RTL) coding capacity remains strictly bounded by pretraining knowledge. Addressing challenges in the current LLM based hardware generation pipeline requires more studies in model reasoning rather than alignment interventions.
[3] Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer cs.CL | cs.AIPDF
Ahmed Haj Ahmed, Ruochen Zhang, Alvin Grissom
TL;DR: 本文通过微调七种大语言模型(4B-671B参数)在阿拉伯语上,并评估其在闪族语言和非闪族对照语言上的零样本阅读理解性能,研究了跨语言迁移机制。研究发现,无论模型架构(密集或专家混合)如何,均未观察到闪族语言特有的迁移证据:基线较弱的模型在所有语言上均有显著提升,而基线较强的模型仅显示边际增益,与语系无关。思维链消融实验进一步表明,从微调中获益最多的模型同样受益于推理时思维链,暗示这两种机制均主要解决任务格式对齐问题,而非跨语言知识迁移。
Details
Motivation: 研究动机是探究跨语言迁移中,语言亲缘关系(如闪族语系内部)与任务对齐(如阅读理解格式)各自的作用,以厘清大语言模型跨语言能力提升的本质。
Result: 在阿拉伯语上微调后,模型在闪族语(如希伯来语)和非闪族语(如斯瓦希里语)的零样本阅读理解任务上表现相似:弱基线模型(如BLOOM)在所有语言上性能大幅提升(例如,绝对准确率提升高达40%),而强基线模型(如GPT-4)提升有限(约1-2%)。思维链提示也带来类似幅度的提升,且提升模式与微调一致。
Insight: 论文的核心创新点在于通过系统实验和消融(思维链)揭示了跨语言迁移性能提升可能主要源于模型学习任务格式/指令对齐,而非依赖于语言间的语言学亲缘关系进行知识迁移,这对理解大模型跨语言机制和设计高效迁移方法具有启发意义。
Abstract: We study cross-lingual transfer by fine-tuning seven large language models (4B–671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and Mixture-of-Experts architectures, we find no evidence of Semitic-specific transfer: models with weak baselines improve dramatically across all languages, while strong-baseline models show only marginal gains regardless of language family. A chain-of-thought ablation reinforces this finding – the same models that benefit most from fine-tuning benefit equally from inference-time reasoning, suggesting both mechanisms address task-format alignment rather than cross-lingual knowledge transfer.
[4] Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics cs.CL | cs.AIPDF
Zhengheng Li, Panrui Li, Xuyang Liu, Puzhi Xia
TL;DR: 本文系统研究了扩散大语言模型(dLLMs)中上下文学习(ICL)的查询位置偏差问题,揭示了查询位置是影响生成质量的一阶变量,其影响程度与示例语义质量相当。论文通过解码动力学分析发现,位置敏感性源于注意力流的空间‘近因效应’和任务相关的解码轨迹偏移,并提出了无需真实标签的Auto-ICL自适应路由策略来动态优化查询放置。
Details
Motivation: 当前dLLMs的上下文学习实践通常沿袭自回归LLMs的尾部查询模板,忽视了dLLMs双向注意力的结构范式转变,导致查询位置偏差未被充分探索和解决。
Result: 在异构推理和感知任务上,提出的Auto-ICL策略能够鲁棒地逼近oracle性能,有效缓解了位置不稳定性。
Insight: 创新点在于揭示了dLLMs中查询位置作为一阶变量的重要性,提出了基于迭代解码过程跟踪的新指标平均置信度(Average Confidence),并设计了无需训练的自适应路由策略来动态优化ICL模板,为建立空间ICL基线提供了基础。
Abstract: While In-Context Learning (ICL) is extensively studied in Autoregressive (AR) LLMs, its mechanism within Diffusion Large Language Models (dLLMs) remains largely unexplored. Unlike AR models restricted by unidirectional causal masking, dLLMs intrinsically utilize bidirectional attention, offering extensive spatial flexibility for query placement. Unfortunately, current practices conventionally inherit AR-style trailing-query templates, often overlooking the structural paradigm shift. This paper presents a comprehensive analysis unveiling that query position is actually a first-order variable in dLLMs. Through empirical decoupling, we demonstrate that positional variance impacts generation quality on par with example semantic quality. Internally, this positional sensitivity stems from a spatial ``Recency Effect’’ in attention flow and task-dependent shifts in decoding trajectories. To mitigate this instability without ground-truth labels, we reveal that traditional single-step confidence ($C_{decoded}$) fails in dLLMs. Instead, we propose Average Confidence ($\overline{C}$), a novel metric tracking the iterative decoding process. By establishing the foundational spatial ICL baselines, we introduce Auto-ICL, a training-free adaptive routing strategy that dynamically optimizes query placement, robustly approaching oracle performance across heterogeneous reasoning and perception tasks.
[5] Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models cs.CLPDF
Amogh Sheth, Biruk Assefa, Yi Wen Huang, Andrew Lin, Yuhao Ge
TL;DR: 本文提出了一种名为因果归因剪枝(CAP)的训练无关方法,用于在大型语言模型(LLM)中实现细粒度权重剪枝。该方法通过在小规模校准集上测量注意力头对推理任务的因果影响来识别关键头,并将头级分数转化为权重级重要性值,以在保持模型推理能力的同时降低推理成本。
Details
Motivation: 大型语言模型在多步推理任务上表现出色,但推理成本高昂。现有基于幅度或激活的剪枝标准无法直接捕捉注意力头的功能贡献,导致在剪枝后推理性能显著下降。
Result: 在GSM8K、StrategyQA和ARC-Challenge基准上,使用Llama-3-8B-Instruct和Mistral-7B-Instruct模型,在10%、20%和50%稀疏度下评估CAP。在中等稀疏度(10-20%)下,CAP在大多数模型-基准配置中优于Wanda剪枝方法,特别是在Llama-3模型上的ARC-Challenge任务中,在20%稀疏度下相对准确率提升高达61%。
Insight: 创新点在于使用干预性测量(因果归因)来直接评估注意力头对推理任务的功能贡献,从而指导剪枝,这比基于相关性的剪枝标准更能保持下游推理性能。该方法为训练无关的模型压缩提供了新的视角,即关注组件的因果效应而非统计相关性。
Abstract: Large language models (LLMs) excel at multi-step reasoning but incur substantial inference cost. We introduce Causal Attribution Pruning (CAP), a training-free method that identifies critical attention heads by measuring their causal impact on reasoning tasks and uses these head-level scores to guide fine-grained weight pruning. For each attention head, CAP estimates the expected performance degradation when the head is masked during forward passes on a small calibration set of reasoning problems. These causal scores are then converted into weight-level importance values for the corresponding projection matrices. Unlike magnitude-only or activation-based criteria, CAP’s interventional measurement directly captures each head’s functional contribution, yielding relative accuracy gains of up to 61% over Wanda on ARC-Challenge at 20% sparsity. We evaluate CAP on GSM8K, StrategyQA, and ARC-Challenge using Llama-3-8B-Instruct and Mistral-7B-Instruct at 10%, 20%, and 50% sparsity. At moderate sparsity (10-20%), CAP improves over Wanda in most model-benchmark configurations. with especially large gains on ARC-Challenge for Llama-3. Our results suggest that attention-head-level causal attribution can better preserve reasoning performance on downstream benchmarks than correlational pruning criteria at equivalent sparsity, while remaining limited by coarse MLP attribution at 50% sparsity.
[6] Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning cs.CL | cs.AIPDF
Xinyan Zhu, Yaoqi Liu, Yue Gao, Huadong Ma, Cheng Yang
TL;DR: 本文提出了一种名为LUCID的幻觉检测方法,专门用于基于大语言模型的知识图谱推理框架。该方法联合利用LLM的注意力分数、知识图谱的语义和结构信息,通过图神经网络进行整合,以检测模型生成中的错误。在九个数据集上的实验表明,LUCID相比15个基线方法达到了最先进的性能。
Details
Motivation: 基于大语言模型的知识图谱推理框架虽然流行,但LLM的幻觉问题依然严重,即使结合了相关知识图谱信息,模型仍可能生成错误输出,导致不可靠的决策。现有的幻觉检测方法要么关注LLM内部状态,要么验证与检索上下文的一致性,但都忽略了知识图谱的结构信息,导致性能不佳。
Result: 在九个手动标注的基准数据集上进行评估,LUCID相比15个基线方法达到了最先进的性能。
Insight: 创新点在于首次针对LLM-based知识图谱推理框架提出幻觉检测方法,并联合利用LLM注意力分数、KG语义和结构信息,通过图神经网络整合这些特征,弥补了现有方法忽略KG结构信息的不足。
Abstract: Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support. With the rapid development of large language models (LLMs), LLM-based KG reasoning frameworks have become increasingly popular by leveraging retrieved KG information. However, hallucinations in LLMs remain a critical issue. Even when relevant KG knowledge is incorporated, models may still generate incorrect outputs, leading to misinformation and unreliable decisions. Existing hallucination detection methods either focus on LLM internal states or verify consistency with retrieved contexts, but both overlook the structural information in KGs, resulting in suboptimal performance. To address this gap, we propose LUCID, the first halLUcination deteCtIon method for LLM-based knowleDge graph reasoning frameworks. LUCID jointly leverages LLM attention scores, KG semantics, and structural information. Specifically, it extracts node and edge features from attention scores and semantic similarities, and integrates them with KG structure using a graph neural network. We also construct manually annotated benchmark datasets for evaluation. Experiments on nine datasets show that LUCID achieves state of the art performance compared to 15 baselines.
[7] Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling cs.CL | cs.LGPDF
Ardit Krasniqi, Luan Vejsiu, Elira Dervishi
TL;DR: 本文提出了一个名为GRACE的统一理论框架,用于研究测试时缩放(TTS)中验证器的最佳粒度问题。该框架将验证粒度表征为问题难度、验证器准确性和计算预算的显式函数,证明了存在一个相变:当计算预算大或问题困难时,细粒度验证占优;而在低预算、简单问题情况下,粗粒度验证更优。基于此,论文提出了一种自适应粒度策略,并在多个数学推理基准上验证了其优于固定粒度基线。
Details
Motivation: 测试时缩放(TTS)通过推理时投入额外计算来提升大语言模型(LLMs)的推理性能,其中验证器是关键组件。然而,在给定计算预算下,验证的最佳粒度是什么这一根本问题尚未得到充分探索。粗粒度的结果奖励模型(ORMs)和细粒度的过程奖励模型(PRMs)代表两个极端,但单独使用都无法在所有情况下实现计算最优。
Result: 在MATH-500、GSM8K和AIME基准上的实证结果证实了所有四个理论主张。提出的自适应策略在匹配计算量的情况下,准确率比固定粒度基线高出最多3.1%。
Insight: 论文的核心创新在于建立了一个统一的理论框架(GRACE),首次将验证粒度选择问题形式化,并揭示了其与问题难度和计算预算的相变关系。这为理解Best-of-N、波束搜索和步骤级蒙特卡洛树搜索(MCTS)等不同TTS方法提供了统一的帕累托最优性视角,并启发了可证明达到计算-性能帕累托前沿的自适应策略。
Abstract: Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the \emph{verifier}, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: \emph{what is the optimal granularity of verification under a given compute budget?} Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called \textbf{GRACE} (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1% accuracy at matched compute.
[8] Characterizing Narrative Content in Web-scale LLM Pretraining Data cs.CLPDF
Teagan Johnson, Elliott Ash, Andrew Piper, Maria Antoniak
TL;DR: 本文首次对大规模语言模型预训练语料库Dolma中的叙事特征进行了细粒度研究,提出一个基于叙事理论、涵盖能动性、场景和事件三个核心叙事元素的框架,并操作化为11个可解释维度。通过标注400个段落样本,微调并验证了基于RoBERTa的细粒度叙事预测模型NarraBERT,进而将其应用于300万段落,创建了NarraDolma数据集。研究发现叙事结构在异构数据中可大规模测量,网络文本存在连续、多维的叙事结构,且叙事质量在不同预训练来源和主题中分布不均,当前数据筛选实践未能衡量或解释此现象。
Details
Motivation: 尽管叙事是人类交流的基本模式,但网络规模LLM预训练语料库的叙事构成仍未被充分探索。本文旨在填补这一空白,研究预训练数据中的叙事特征分布。
Result: 研究构建了NarraDolma数据集和NarraBERT模型,并在3万亿token的Dolma语料库上进行了大规模分析,证实了叙事结构在异构数据中的可测量性,并揭示了叙事质量在不同数据源和主题中的不均衡分布。
Insight: 创新点在于首次将系统的叙事理论框架应用于大规模预训练语料分析,并开发了可扩展的细粒度叙事预测模型。这为理解数据组成如何影响模型的叙事推理任务提供了基础,并揭示了当前数据筛选方法在衡量叙事质量方面的不足。
Abstract: The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.
[9] LaViSA: A Language and Vision Structural Ambiguity Benchmark cs.CLPDF
Lee Sangmyeong, Shun Inadumi, Koichiro Yoshino
TL;DR: 本文提出了LaViSA基准测试,用于评估视觉语言模型(VLMs)利用视觉场景解决结构歧义的能力。该基准包含七个歧义类别的歧义句、消歧句及对应图像。通过对多种VLMs的评估,发现尽管近期模型在一定程度上能利用视觉线索,但在某些歧义类型和视觉细微语义区分上仍存在困难。
Details
Motivation: 结构歧义因句法结构导致单一句子存在多种有效解释,是语言理解的根本挑战。视觉场景可作为解决此类歧义的有用线索,而视觉语言模型需要具备从视觉场景推导可能语义解释的能力。
Result: 实验结果表明,尽管近期VLMs能在一定程度上利用视觉场景解决结构歧义,但在某些歧义类型和视觉细微语义区分上仍存在困难,揭示了利用视觉场景解决结构歧义的局限性。
Insight: 创新点在于构建了首个专注于评估VLMs解决结构歧义能力的多模态基准LaViSA,涵盖七类歧义并提供消歧图像对。客观分析认为,该基准系统性地揭示了当前VLMs在结合视觉与语言进行细粒度语义推理方面的不足,为模型改进提供了明确方向。
Abstract: Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving such ambiguity, and Vision and Language Models (VLMs) need to be capable of deriving possible semantic interpretations from visual scenes. We introduce Language and Vision Structural Ambiguity (LaViSA), a benchmark designed to evaluate the ability of VLMs to resolve structural ambiguity leveraging visual scenes. LaViSA consists of ambiguous sentences, their disambiguated sentences, and corresponding images of these disambiguated sentences across seven ambiguity categories. Using LaViSA, we conduct a comprehensive evaluation of diverse VLMs, including both proprietary and open-source models with varying parameter scales and reasoning capabilities. Experimental results show that although recent VLMs can leverage visual scenes to resolve structural ambiguity to a some extent, they still struggle with certain ambiguity types and visually subtle semantic distinctions, indicating remaining limitations in resolving structural ambiguity using visual scenes.
[10] SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation cs.CLPDF
Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Bo Peng
TL;DR: 本文提出了SAGE-OPD,一种用于多轮策略蒸馏的选择性干预框架。该方法通过观察环境反馈和教师判断,选择性地跳过或干预学生的响应,并利用教师置信度加权token级蒸馏,以解决多轮交互中早期错误累积和标准密集监督的脆弱性问题。
Details
Motivation: 标准策略蒸馏在单轮设置中有效,但在LLM智能体多轮交互的现实场景中,早期错误会改变未来观察并沿轨迹累积,导致密集的token级监督变得脆弱,可能过度惩罚语义有效的替代方案、强化局部退化行为并传播不可靠的教师监督。
Result: 在智能体任务上的实验表明,SAGE-OPD持续优于基线方法,在ALFWorld未见任务的成功率上相比标准OPD实现了高达13.3%的相对提升。消融研究证实了轮级干预、教师置信度加权和损失归一化具有互补效益。
Insight: 核心创新在于提出了一个无需验证器的选择性干预框架,将教师监督有选择地分配到必要且可靠的轮次,并通过置信度加权和损失归一化来缓解错误传播和不确定监督的影响。其核心观点是,有效的多轮OPD应保持策略性,但教师监督应被选择性分配。
Abstract: On-policy distillation (OPD) improves student models by training them on trajectories induced by their own policy, making it a promising approach for mitigating exposure bias in agent training. However, most OPD studies focus on single-turn settings, while realistic LLM agents interact with environments over multiple turns. In this regime, early errors can alter future observations and compound across the trajectory, and standard dense token-level OPD becomes brittle, as it may over-penalize semantically valid alternatives, reinforce local degeneracies such as repeated actions, and propagate unreliable teacher supervision on off-distribution histories. We propose SAGE-OPD, a verifier-free selective intervention framework specifically designed for multi-turn OPD. Instead of applying teacher supervision uniformly across all turns, SAGE-OPD first observes environment feedback and uses teacher judgment to decide whether each student response should be skipped or intervened on. To further address compounding errors, SAGE-OPD weights token-level distillation by teacher confidence, reducing the influence of uncertain teacher distributions on corrupted or ambiguous histories. Finally, SAGE-OPD applies loss normalization to preserve the overall loss scale of standard OPD while retaining selective turn-level weighting. Experiments on agent tasks show that SAGE-OPD consistently improves over baselines, achieving up to a 13.3% relative improvement in ALFWorld unseen success rate over standard OPD. Ablation studies further demonstrate that turn-level intervention, teacher confidence weighting, and loss normalization provide complementary benefits. Our results suggest that effective multi-turn OPD should remain on-policy, but teacher supervision should be selectively allocated to turns where intervention is necessary and reliable.
[11] Where Does Social Reasoning Come From? Capability Provenance in Language Models cs.CL | cs.LGPDF
Glenn Matlin, Chandreyi Chakraborty, Saehee Eom, Mika Okamoto, Rayan Castilla
TL;DR: 本文提出了一种基于训练数据归因的可解释性能力溯源方法,用于探究OLMo3-7B模型的社会推理与STEM推理能力分别源自预训练语料库的哪些区域。该方法通过梯度归因计算文档影响力,并聚合到576个主题-格式分类箱中,通过对比社会推理与STEM推理基准测试(如SocialIQA与ARC-Challenge)的归因差异,发现两种推理能力依赖截然不同的语料区域。研究进一步通过定向机器遗忘进行因果验证,并开源了相关代码与数据。
Details
Motivation: 现有训练数据归因方法在文档级别噪声较大,且多关注事实性知识而非推理能力,难以清晰揭示模型特定能力(如社会推理)的语料来源。
Result: 在OLMo3-7B模型上,社会推理与STEM推理能力依赖的语料区域存在显著差异,且推理层面的对比比知识层面更明显;定向遗忘高归因主题箱(如社会推理对应的文学类)会导致对应基准测试性能下降,验证了归因结果的因果性。
Insight: 创新点在于将训练数据归因从文档级噪声问题提升至主题分类箱级的聚合分析,并设计了社会推理与STEM推理的对比实验框架,为模型能力溯源提供了可解释且可因果验证的新工具。
Abstract: We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model’s predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer’s 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.
[12] Code-Switching Reveals Language Anchoring in Multilingual LLMs cs.CLPDF
Jeonghyun Park, Seunghyun Yoon, Yonghyun Jun, Hwanhee Lee
TL;DR: 本文研究了多语言大语言模型在处理代码切换输入时的性能下降问题,通过引入锚定偏置这一几何度量来量化语言锚定现象,并提出了CANVAS方法在推理时进行干预以缓解性能下降。
Details
Motivation: 多语言大语言模型处理代码切换输入时性能经常下降,作者旨在理解这种下降的原因,并探索缓解方法。
Result: 实验发现源框架CS保持源语言锚定,而目标框架CS向目标语言偏移且QA性能下降更大;提出的CANVAS方法在多种MLLM和CS条件下一致恢复了QA F1分数。
Insight: 创新点在于使用语法强制的CS作为诊断工具来定位表示,并提出了可操作的推理时干预方法CANVAS,利用内部锚定信号来指导隐藏状态的对齐。
Abstract: Multilingual Large Language Models (MLLMs) are increasingly expected to handle Code-Switched (CS) inputs, yet mixing languages frequently degrades performance relative to source- or target-language monolingual counterparts. To understand this degradation, we use grammar-forced CS as a controlled diagnostic setting for locating CS representations relative to their source and target counterparts. We introduce Anchor Bias, a geometric measure that quantifies language anchoring, whether a CS hidden state aligns closer to its source or target language counterpart. Across diverse MLLMs, Anchor Bias reveals a consistent grammar-frame effect: source-framed CS stays source-anchored, whereas target-framed CS shifts target-ward and shows larger Question Answering (QA) degradation. Motivated by this representational pattern, we propose CANVAS (Contextual Anchor-based Neural Vector Alignment Steering), an inference-time intervention that extracts a source-side canvas from the input and softly steers target-language hidden states toward the source anchor during prefill. CANVAS consistently recovers QA F1 across MLLMs and CS conditions, showing that internal anchoring signals provide an actionable target for mitigating CS inference failures.
[13] NRITYAM: Language Models Meet Art and Heritage of Dance cs.CL | cs.AIPDF
Punit Kumar Singh, Niladri Ghosh, Advait Joshiınst, Shailee Choudhary, Michael Färber
TL;DR: 该论文提出了NRITYAM,一个用于评估语言模型在全球舞蹈传统中文化理解能力的综合性基准。该基准包含9,260个精心策划的问答对,涵盖12种语言,是与本土舞蹈艺术家和母语者合作开发的。研究评估了包括大语言模型、小语言模型、多模态大语言模型和小多模态语言模型在内的广泛模型集。
Details
Motivation: 当前语言模型的全球有效性依赖于对本地社会文化背景的细致理解,而现有基准在评估文化理解能力方面存在不足,特别是在传统表演艺术领域。
Result: NRITYAM作为多语言和多文化基准,为评估AI系统理解和推理传统表演艺术的能力设定了新标准,是评估舞蹈文化知识的最大数据集。
Insight: 创新点在于通过与本土专家深度合作,从零开始构建了一个专注于特定文化领域(舞蹈)的大规模、多语言评估基准,强调了AI模型文化理解能力评估的重要性。
Abstract: Language models have become essential tools in shaping modern workflows. However, their global effectiveness hinges on a nuanced understanding of local socio-cultural contexts. To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions. NRITYAM comprises 9,260 carefully curated question-answer pairs spanning 12 languages, making it the largest dataset dedicated to evaluating cultural knowledge in dance. The dataset has been developed from the ground up through close collaboration with native dance artists and native speakers of the languages, who authored and validated culturally relevant questions specific to their regions. We evaluate a broad set of models, including large language models, small language models, multimodal large language models, and small multimodal language models. As a multilingual and multicultural benchmark, NRITYAM sets a new standard for evaluating the ability of AI systems to understand and reason about traditional performing arts. Detailed dataset samples are available at~\url{https://github.com/niladrighosh03/NRITYAM}.
[14] Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability cs.CLPDF
Jiechao Gao, Rohan Kumar Yadav, Yuangang Li, Yuandong Pan, Jie Wang
TL;DR: 本文提出了一种语义预训练框架,将预训练语言模型(如BERT)的知识迁移到Tsetlin Machine(TM)中,以提升其语义理解能力,同时保持TM的完全可解释性。该方法通过K-means或Top2Vec将文本样本分组为语义连贯的簇,利用簇-样本对通过增强的Type I反馈预训练TM,使其学习可解释的语义关键词,并在下游任务上进行微调。
Details
Motivation: 预训练语言模型(如BERT)在文本分类上性能强大但缺乏透明度,限制了其在高风险场景的应用;而Tsetlin Machine(TM)提供完全可解释的基于子句的推理,但语义信息捕获能力弱。先前尝试结合两者依赖于静态词嵌入,忽略了上下文含义,因此需要一种无需嵌入的方法来桥接两者。
Result: 在五个数据集上的实验表明,该方法显著优于原始TM和基于嵌入的TM,性能达到与BERT竞争的水平,同时保持了可解释性。
Insight: 创新点在于提出了一种不使用嵌入的语义预训练框架,通过语义簇-样本对和增强的Type I反馈,将语言模型的上下文语义知识迁移到可解释的TM中,实现了性能与可解释性的平衡;从客观角度看,该方法巧妙地利用聚类提取语义信息,避免了复杂的嵌入表示,为可解释AI提供了新思路。
Abstract: Pre-trained language models such as BERT achieve strong text classification performance but lack transparency, limiting their use in high-stakes settings. The Tsetlin Machine (TM) offers fully interpretable, clause-based reasoning but captures little semantic information, and prior attempts to bridge the two rely on static word embeddings that miss contextual meaning. We propose a semantic pre-training framework that transfers knowledge from a pre-trained language model into a TM without using embeddings. Text samples are grouped into semantically coherent clusters with K-means or Top2Vec, and the resulting cluster-sample pairs pre-train a non-negated TM with enhanced Type I feedback. The TM thereby learns interpretable semantic keywords that are fine-tuned on downstream tasks. Across five datasets, our method substantially outperforms vanilla and embedding-based TMs and reaches performance competitive with BERT while remaining interpretable.
[15] AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts cs.CLPDF
Yanyu Yao, Shangze Li, Zhi Zheng, Hui Zheng, Qi Liu
TL;DR: AtomMem是一个为LLM智能体设计的长期记忆系统,通过提取高价值的原子事实作为高效记忆表示,并组织成层次化事件结构和时间剖面,以支持跨多会话的信息积累与重用。该系统在检索时激活关联记忆图连接碎片化记忆,在LoCoMo基准测试中实现了最先进的性能。
Details
Motivation: 解决大型语言模型固定上下文窗口限制长期信息积累和跨会话复用的问题,以及现有记忆增强系统构建记忆方式粗糙、不稳定、效率低下的挑战。
Result: 在LoCoMo基准测试中,AtomMem在各种推理任务上达到了最先进的性能(SOTA)。
Insight: 创新点在于引入事实执行器选择性提取高价值原子事实作为高效记忆表示,并组织成层次化事件结构和时间剖面以捕捉连贯情景和动态演化用户属性;客观分析其通过关联记忆图连接碎片记忆,提供了可扩展且经济可行的个性化智能体部署方案。
Abstract: Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented systems often construct memory in a coarse and unstable manner, relying on inefficient memory representations or unstable unconstrained updates. To address these challenges, we propose AtomMem, a long-term memory system designed for value-dense storage and stable memory evolution. AtomMem introduces a Fact Executor, which selectively extracts high value atomic facts from long form interactions to serve as highly efficient memory representations. Subsequently, AtomMem organizes these facts into hierarchical event structures and temporal profiles, capturing coherent episodic contexts and tracking dynamically evolving user attributes over time. During retrieval, the system activates an associative memory graph to connect fragmented memories. Experiments on the LoCoMo benchmark confirm that AtomMem achieves state-of-the-art performance across various reasoning tasks, offering a scalable and economically viable solution for deploying intelligent personalized agents.
[16] Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship cs.CLPDF
William Guey, Pierrick Bougault
TL;DR: 本文研究了大型语言模型在可验证指令遵循修订任务中是否存在自我偏好偏差。通过让模型作为真实作者或中立评估者,在IFEval基准上测试其对已验证修正的接受率,发现四个中等规模模型家族在85次比较中未检测到显著的自我偏好,作者拒绝有效修正的比例与中立模型基本一致。
Details
Motivation: 针对LLMs在作为评判者时表现出的自我偏好偏差(即倾向于偏好自身生成内容),本文旨在探究模型在修订自身文本时是否会抵制已验证的有效修正,从而验证自我偏好在可验证修订任务中的存在性。
Result: 在IFEval基准上,作者模型与中立模型对已验证修正的拒绝率差距为-5.1个百分点(95%置信区间[-12.9, +2.7]),未达到统计显著性,表明未检测到自我偏好;定性分析显示,当作者拒绝修正时,97%的理由是基于缺陷捕捉而非偏好。
Insight: 创新点在于通过确定性验证器(IFEval检查器)而非其他模型来定义修正的有效性,从而在可验证修订场景中客观测试自我偏好;研究发现中等规模LLMs在指令遵循修订中可能不存在显著自我偏好,这为模型在自我修订任务中的可靠性提供了实证依据。
Abstract: Large language models (LLMs) increasingly review and revise text, including their own. A documented self-preference bias (models favoring their own generations when acting as judges) raises the question of whether models also resist valid corrections to their own writing. We test this in a setting where “valid” is decided not by another model but by a deterministic verifier: instruction-following revision on IFEval. A model writes a draft; the official IFEval checker confirms the draft violates a constraint and that a candidate edit fixes it; the model then accepts or rejects that edit either as the genuine in-context author or as a fresh model that sees the draft neutrally. Across four mid-tier model families and 85 author-versus-fresh comparisons, we find no detectable self-preference: authors reject verified-good fixes to their own drafts at essentially the same rate as fresh models judging the same drafts (gap -5.1 pp, 95% CI [-12.9, +2.7]). A self-skepticism hint from a smaller pilot did not replicate at scale. The one robust observation is qualitative: when authors do reject a verified-good fix, 97% of their stated reasons are flaw-catching rather than preference, that is, about the character of rejections, not an elevated rate. Effects smaller than ~13 pp cannot be excluded at this sample size.
[17] MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization cs.CL | cs.AI | cs.LG | q-bio.QMPDF
Aueaphum Aueawatthanaphisut
TL;DR: 本文提出MedRLM,一种递归多模态健康智能框架,用于长上下文临床推理、传感器引导筛查和社区至三级转诊优化。该框架将患者病例视为外部临床环境,通过协调文本、EHR、影像、传感器信号等专业代理进行递归检查、分解、检索、验证与合成,并引入临床证据图记忆连接患者观察与检索证据。
Details
Motivation: 解决现有医疗大语言模型和检索增强生成系统在临床证据分散于长电子健康记录、医学影像、传感器流、指南和转诊约束时,因依赖单步提示或检索而表现脆弱的问题,推动医疗AI从静态问答转向可审计、多模态且感知工作流的临床决策支持。
Result: 论文概述了使用涵盖EHR、放射学、心电图、ICU时间序列和转诊代理结果的公共及认证临床数据集进行真实数据评估的设计,但未在摘要中报告具体定量结果或基准比较。
Insight: 创新点包括将患者病例建模为可递归探索的外部环境、协调多模态专业代理的框架设计、临床证据图记忆的引入,以及传感器引导递归触发和不确定性门控细化机制,以实现更稳健、可解释的临床决策支持。
Abstract: Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and retrieval-augmented generation systems often rely on single-step prompting or retrieval, which can be fragile when clinical evidence is distributed across long electronic health records, medical images, sensor streams, guidelines, and referral constraints. This paper proposes MedRLM, a Recursive Multimodal Health Intelligence framework for long-context clinical reasoning, sensor-guided screening, and community-to-tertiary referral support. Instead of compressing all patient information into one prompt, MedRLM treats the patient case as an external clinical environment that can be recursively inspected, decomposed, retrieved, verified, and synthesized. The framework coordinates specialized agents for clinical text, longitudinal EHR, medical imaging, physiological sensor signals, guideline retrieval, uncertainty auditing, and referral planning. It further introduces a Clinical Evidence Graph Memory to connect patient-specific observations with retrieved evidence, standardized definitions, sensor-derived biomarkers, and referral criteria. A sensor-guided recursive triggering mechanism activates deeper reasoning when abnormal physiological or behavioral patterns are detected, while uncertainty-gated refinement supports clinician review for high-risk or low-confidence cases. We also outline a real-data evaluation design using public and credentialed clinical datasets spanning EHR, radiology, ECG, ICU time series, and referral-proxy outcomes. MedRLM aims to move medical AI from static question answering toward auditable, multimodal, and workflow-aware clinical decision support.
[18] HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization cs.CLPDF
Zhentao Tan, Wei Chen, Jingyi Shen, Yao Liu, Xu Shen
TL;DR: 本文提出HydraHead,一种新颖的注意力混合架构,旨在解决长上下文处理中注意力二次复杂度的瓶颈。其核心创新在于在注意力头维度上混合全注意力和线性注意力,并引入基于可解释性分析的关键头选择策略和尺度归一化融合模块,以高效整合异构注意力信号。通过三阶段迁移流程,该方法以极低的训练开销实现了高性能的混合模型,在长上下文任务中表现出色。
Details
Motivation: 动机在于解决注意力机制二次复杂度对长上下文处理的瓶颈,并探索当前层间混合策略下未充分探索的注意力混合设计空间。研究观察到同一层内不同注意力头存在功能异质性,这为在更细粒度的头级别进行注意力混合提供了自然且原则性的切入点。
Result: 在统一的训练设置下,HydraHead在长上下文任务中优于其他混合设计,同时保持了强大的通用推理能力。通过可解释性驱动的头选择,它在7:1的线性注意力与全注意力比例下,匹配了3:1层间混合模型的长上下文性能。关键的是,仅用150亿token训练,HydraHead在512K上下文长度上相比基线有超过69%的提升,性能接近原生支持256K上下文的领先可比模型Qwen3.5。
Insight: 摘要宣称的创新点在于:1)在注意力头维度而非层维度进行异构注意力(全注意力和线性注意力)混合;2)基于可解释性分析识别检索关键头并仅对其保留全注意力的选择策略;3)用于协调两种注意力头输出分布差异的尺度归一化融合模块。从客观角度看,该研究将混合粒度从层细化到头,并利用可解释性分析指导混合设计,为高效长上下文模型架构探索提供了新的、有潜力的方向。
Abstract: The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid’s long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.
[19] ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion cs.CLPDF
Maxim Melichov, Yakov Kolani, Morris Alper
TL;DR: ReNikud是一种用于现代希伯来语的音素转换方法,它通过音频监督和伪标注架构解决了希伯来语因元音缺失导致的发音歧义问题,并在多个基准测试中超越了现有最佳方法。
Details
Motivation: 解决现代希伯来语因辅音音素文字系统导致的元音缺失问题,传统方法依赖稀缺且不反映口语发音的标注数据,而直接序列预测方法在数据有限时效果不佳。
Result: 在现有希伯来语G2P基准和新提出的MILIM口语希伯来语基准上,ReNikud超越了之前的最先进方法。
Insight: 创新点包括利用数千小时未标注希伯来语音频进行基于音素的自动语音识别伪标注,以及设计伪元音化架构以字符级对齐作为归纳偏置来预测音素,从而无需人工标注即可反映自然口语发音规范。
Abstract: Grapheme-to-phoneme (G2P) conversion for Modern Hebrew is needed for applications like text-to-speech (TTS), but is challenging due to the language’s abjad writing system, which leaves vowels largely unwritten, creating substantial ambiguity. Standard approaches first predict vowel diacritics (nikud) to produce International Phonetic Alphabet (IPA) transcriptions, but this is limited: vocalization data is scarce and laborious to produce, it does not specify features such as lexical stress, and it reflects formal grammatical rules rather than everyday spoken pronunciation. Direct sequence-to-sequence IPA prediction, meanwhile, struggles on limited data and fails to exploit the character-level alignment characteristic of abjads. Our method, ReNikud, overcomes these limitations with two key insights: (1) Weak audio supervision via a phoneme-based automatic speech recognition (ASR) pseudo-labeling pipeline on thousands of hours of unlabeled Hebrew audio, yielding phonemic transcriptions that reflect natural spoken norms without manual annotation. (2) A pseudo-vocalization architecture that predicts IPA phonemes at each character position, enforcing character-level alignment as an inductive bias. Results on existing Hebrew G2P benchmarks and the new targeted MILIM benchmark for spoken Hebrew show that ReNikud surpasses previous state-of-the-art methods. We will release our code and trained models to support further work on Hebrew TTS and speech technologies.
[20] PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback cs.CLPDF
Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng
TL;DR: 本文提出PsyScore框架,将心理测量学与教学支架相结合,通过共享潜在能力表征统一作文自动评分与反馈生成。该框架包含三个模块:基于分级部分计分模型的神经IRT评分器、基于最近发展区的自适应反馈生成器,以及多视角反馈评估策略。
Details
Motivation: 现有自动作文评分系统存在评分模型可解释性差与LLM反馈忽视学习者能力水平的问题,需要整合可靠评估与个性化教学反馈。
Result: 在ASAP++数据集上的实验表明,PsyScore在保持竞争力的评分性能的同时,能生成更符合教学原理的反馈。
Insight: 创新点在于将心理测量模型(GPCM)嵌入神经网络实现可解释能力估计,并基于诊断能力参数动态调整多智能体反馈策略,实现评估与教学的统一表征。
Abstract: Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.
[21] StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs cs.CL | cs.CVPDF
Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner
TL;DR: 本文提出了StylisticBias基准,用于评估多模态大语言模型(MLLMs)在属性层面的社会偏见。该基准通过生成约25K张保持身份不变、仅改变单一视觉属性(如年龄、体型、时尚风格)的逼真人脸图像,系统性地测量了特定视觉线索如何影响模型的社会判断。研究发现,年龄和体型主导身份层面的偏见,而时尚风格等视觉属性则驱动最大的属性层面判断偏移,且约15个属性解释了近80%的偏见变异,表明偏见集中在少数视觉线索上。
Details
Motivation: MLLMs越来越多地应用于个人和社会关键场景,但影响这些模型对人进行判断的视觉线索尚不明确。先前研究通常比较不同个体或群体,难以区分外观效应与身份差异,因此需要一种受控方法来评估属性层面的社会偏见。
Result: 在25个二元社会判断场景中评估了六个MLLMs,发现年龄和体型是身份层面偏见的主要因素,而时尚风格等视觉属性导致最大的属性层面判断偏移。约15个属性解释了近80%的总变异,表明偏见高度集中于少数视觉线索。敏感性在语义上与外观相关的判断(尤其是社会经济和风格相关判断)中最强。
Insight: 创新点在于构建了一个受控基准(StylisticBias),通过固定身份、仅改变单一视觉属性的方法,实现了对MLLMs属性层面社会偏见的细粒度评估。客观分析表明,该方法揭示了偏见并非均匀分布,而是由少数关键视觉线索(如时尚风格)主导,这为理解模型偏见机制和开发去偏技术提供了新视角。
Abstract: Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: https://github.com/timo-cavelius/StylisticBias and https://hf.co/datasets/shaghayegh/stylistic-bias-dataset.
cs.CV [Back]
[22] LEAP: Layer-skipping Efficiency via Adaptive Progression for Vision Transformer Distillation cs.CVPDF
Jiaqi Zhang, Ashton Lee, Anthony Wong, John Zou, Sami BuGhanem
TL;DR: 本文提出了LEAP,一种用于视觉Transformer特征知识蒸馏的自适应渐进式训练课程。该方法通过利用教师模型的中间特征图作为一系列难度递增的目标,让学生模型在掌握基础表示后再学习更高层次的抽象,从而缓解师生能力差距带来的蒸馏瓶颈。实验表明,LEAP能显著加速收敛、节省训练开销,并在多个数据集上提升学生模型的性能。
Details
Motivation: 基于视觉Transformer的大规模视觉基础模型计算需求高,需通过知识蒸馏压缩以部署到边缘设备。然而,传统的基于特征的知识蒸馏因学生模型容量有限,难以模仿教师复杂的特征图,存在显著的师生差距瓶颈。
Result: 在ImageNet-100上,LEAP蒸馏的ViT-S模型达到90.1%的准确率,比基线提升+12.24%。在ImageNet-1K的实例检索任务中,于Oxford和Paris数据集上分别获得+3.84%和+7.75%的改进。此外,该方法在ImageNet-100上节省了25.1%的训练FLOPs和21%的训练时间。
Insight: 核心创新在于设计了一种自适应渐进式课程,将教师中间层特征作为难度递进的学习目标,让学生分阶段构建表征,这有效降低了蒸馏难度并提升了训练效率。其采用的训练早期对教师推理进行早停的策略,也显著减少了计算开销。
Abstract: Vision Foundation Models (VFMs) with Vision Transformer (ViT) backbones, such as DINOv2, have become essential for downstream tasks like object recognition and semantic segmentation. The immense computational requirements of backbones often necessitate distillation into smaller architectures for edge deployment. Feature-based knowledge distillation (KD) often suffers from the teacher-student gap; the student struggles to imitate teacher’s complex feature map due to its limited capacity. To mitigate this bottleneck, we propose LEAP: Layer-skipping Efficiency via Adaptive Progression, a training curriculum for ViT feature-based knowledge distillation. By utilizing the teacher’s intermediate feature maps as a sequence of progressively more difficult targets, our curriculum allows the student to build a foundational representation before tackling higher-level abstractions. Our results demonstrate that this paradigm significantly accelerates convergence through adaptive difficulty selection across various student model sizes and dataset scales. With our curriculum, the LEAP-distilled ViT-S achieves 90.1% accuracy on ImageNet-100, a +12.24% improvement compared with baseline. On ImageNet-1K, LEAP achieves +3.84% and +7.75% improvement for the instance retrieval task on the Oxford and Paris datasets, respectively. Furthermore, the curriculum enables 25.1% savings in training FLOPs and 21% savings in training time on ImageNet-100 by implementing early-stopping for teacher inference during the initial stages of training. Code is available at https://github.com/KevinZ0217/LEAP
[23] LooseControlVideo: Directorial Video Control using Spatial Blocking cs.CVPDF
Shariq Farooq Bhat, Niloy J. Mitra, Kalyan Sunkavalli
TL;DR: 本文提出了LooseControlVideo框架,旨在解决文本到视频生成中多物体场景的精确三维空间编排难题。该方法通过使用稀疏、有朝向的三维边界框作为’布景’代理,允许用户以直观的方式指定高级布局和轨迹,同时利用视频生成模型生成逼真的遮挡、动态和交互。
Details
Motivation: 现有基于深度条件的模型虽然能实现良好的结构保真度,但需要密集、逐帧精确的指导,这对于涉及可变形物体的动态事件制作来说非常耗时费力。因此,需要一种更直观、表达力强的控制方法。
Result: 在nuScenes、HO-3D和BEHAVE基准测试上的广泛评估表明,LooseControlVideo显著优于现有的基于2D边界框和光流的基线方法。具体而言,在轨迹误差上实现了1.2倍到3倍的提升,刚性运动一致性提升了2倍,遮挡准确率比当前最先进的布局条件模型提高了1.5倍到2倍。
Insight: 核心创新在于使用稀疏、有朝向的三维边界框作为高级控制代理,并结合一种新颖的DNOCS编码(用于表示三维尺寸、方向和深度顺序遮挡)对视频生成骨干网络进行微调。这种方法为复杂、多主体的视频创作提供了良好的几何先验,并支持局部细化而不会过度干扰全局场景上下文。
Abstract: Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a “blocking” proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.
[24] ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing? cs.CV | cs.ROPDF
Yuyang Zhang, Wenyao Zhang, Zekun Qi, He Zhang, Haitao Lin
TL;DR: 本文提出ImageWAM,一种新的世界动作模型框架,它利用预训练的图像编辑模型而非视频生成来进行机器人动作预测。该方法通过图像编辑建模当前帧到目标帧的变换,专注于动作相关的视觉差异,从而避免了视频生成带来的计算成本高、关注无关细节以及长时程想象误差等问题。实验表明,ImageWAM在多种仿真和真实世界任务中超越了标准视觉语言动作模型基线,并显著降低了计算开销和延迟。
Details
Motivation: 当前基于视频生成的世界动作模型存在三个相互关联的局限性:推理成本高(由于密集的多帧未来token)、预测能力浪费在与动作无关的时空和外观细节上,以及长时程未来想象可能引入误导动作预测的错误。这促使研究者质疑世界动作模型是否真的需要视频生成,并探索更高效的替代方案。
Result: ImageWAM在不同仿真和真实世界实验中,无需额外的策略预训练,就超越了标准的视觉语言动作模型基线,并与竞争性世界动作模型性能相当。同时,它将FLOPs降低到基于视频的WAMs的1/6,延迟降低到1/4。注意力分析进一步表明,编辑缓存专注于任务相关的变化区域。
Insight: 论文的核心创新在于重新利用预训练的图像编辑模型作为世界动作建模的先验,这提供了更匹配的任务表示:仅需建模单帧目标变换,聚焦于动作相关的视觉差异,并通过编辑预训练将任务指令接地到局部视觉变化。从客观角度看,将图像编辑的KV缓存作为紧凑的世界-动作上下文来条件化动作专家,是一种高效且可解释的替代视频生成的方法。
Abstract: World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.
[25] PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models cs.CV | cs.AI | cs.CLPDF
Yueyi Sun, Yuhao Wang, Jason Li, Ye Tian, Tao Zhang
TL;DR: 本文提出了PerceptionDLM,一种基于多模态扩散语言模型(DLM)的并行区域感知方法。该方法通过高效提示和结构化注意力掩码,实现了对图像中多个掩码区域的并行描述生成,显著提升了推理效率。作者还构建了ParaDLC-Bench基准来系统评估DLM的并行视觉感知能力。
Details
Motivation: 现有MLLM多依赖自回归生成,在处理需要为多个区域生成描述的任务时效率低下。本文旨在利用扩散语言模型的并行解码特性,解决多区域感知任务的效率瓶颈。
Result: 在PerceptionDLM-Base(开源扩散MLLM中达到SOTA性能)基础上构建的PerceptionDLM,在区域描述任务上保持了有竞争力的性能,同时在多区域感知任务上实现了显著的推理速度提升。
Insight: 创新点在于首次利用扩散语言模型的优势实现并行区域描述与感知,通过高效提示和结构化注意力掩码设计,在序列和词元级别实现并行生成。这为高效多模态视觉理解提供了新思路。
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.
[26] Language-Instructed Vision Embeddings for Controllable and Generalizable Perception cs.CVPDF
Chengzhi Mao, Xudong Lin, Wen-Sheng Chu
TL;DR: 本文提出了一种名为语言指导视觉嵌入(LIVE)的新范式,它使用语言指令在推理时动态引导视觉编码器,生成以任务为中心的嵌入,从而避免了针对特定任务进行重新训练的需要。该方法旨在提高视觉表示的上下文相关性和可控性,实现更通用和自适应的视觉感知。
Details
Motivation: 当前视觉基础模型通常作为静态特征提取器进行训练,将任务适应的负担转移给下游大型模型。本文旨在探索一种替代方案,利用语言本身作为高级指导来动态调整视觉编码器,以解决静态特征提取导致的上下文不相关和可控性差的问题。
Result: 实验表明,LIVE在减少视觉幻觉方面表现显著(在MMVP基准上提升34分),在视觉问答任务上超越了参数量大几个数量级的视觉语言模型,并且能够泛化到未见过的指令和任务。
Insight: 核心创新在于将语言指令作为推理时的动态控制信号来引导视觉编码,而非仅将视觉特征输入语言模型。这提供了一种实现自适应、指令驱动视觉智能的直接路径,其范式转变(从静态特征提取到动态语言引导)具有借鉴意义。
Abstract: Vision foundation models are typically trained as static feature extractors, placing the burden of task adaptation onto large downstream models. We propose an alternative paradigm: instead of solely feeding visual features into language models, we use language itself to dynamically guide the vision encoder. Our method, Language-Instructed Vision Embeddings (LIVE), leverages language as high-level guidance to produce task-centric embeddings at inference time, removing the need for task-specific retraining. This enables the encoder to focus on contextually relevant aspects of the input, yielding more controllable and generalizable representations. Empirically, LIVE reduces visual hallucinations (+34 points on MMVP), surpasses vision-language models with orders of magnitude more parameters on visual question answering, and generalizes to unseen instructions and tasks – offering a direct path toward adaptive, instruction-driven visual intelligence.
[27] Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models cs.CVPDF
Navin Ranjan, Andreas Savakis
TL;DR: 本文提出了Mix-QVLA,一个面向视觉-语言-动作模型的、任务证据感知的混合精度后训练量化框架。该框架通过锚定全精度动作令牌的参考决策,评估量化是否在关键功能边界上保留了任务相关证据,并计算归一化的梯度加权任务证据图来量化证据强度和分布的失真。它进一步通过一个软瓶颈目标将边界级退化聚合成层级的敏感度分数,并建模任务执行过程中的敏感度变化,从而在模型大小和比特操作预算下指导混合精度比特分配。
Details
Motivation: 动机是解决视觉-语言-动作模型在低比特部署时,如何在保持任务性能的同时,实现更高的存储和计算效率。现有方法可能忽略了量化对任务决策关键证据的影响以及任务执行过程中层重要性的动态变化。
Result: 在OpenVLA风格策略上的广泛评估表明,Mix-QVLA改善了低比特VLA部署的精度-效率权衡。具体在LIBERO基准上,它将OpenVLA-OFT的内存占用从15.4 GB减少到4.1 GB,保持了96.3%的平均成功率(BF16全精度模型为97.1%),并实现了1.52倍的推理加速。
Insight: 创新点在于提出了任务证据感知的量化评估方法,通过分析量化对决策支持证据的强度和分布的影响来指导比特分配,并建模了层敏感度在任务执行过程中的动态变化,而非假设固定不变的敏感度分布。这为VLA模型的混合精度量化提供了更精细、更贴合任务特性的指导原则。
Abstract: We propose Mix-QVLA, a task-evidence-aware mixed-precision PTQ framework for VLA models. Mix-QVLA anchors each quantized variant to the full-precision action-token reference decision and evaluates whether quantization preserves task-relevant evidence across key VLA functional boundaries. It computes normalized gradient-weighted task-evidence maps from boundary activations and compares full-precision and quantized maps using evidence-mass and attribution-distribution distortion, capturing changes in both the strength and allocation of decision-supporting evidence. A soft-bottleneck objective aggregates boundary-level degradation into layer-wise sensitivity scores. Mix-QVLA further models sensitivity throughout task execution, capturing phase-dependent shifts in layer importance rather than assuming a fixed sensitivity profile. The resulting evidence- and time-aware scores guide mixed-precision bit allocation under model-size and BitOps budgets. Extensive evaluations on OpenVLA-style policies show that Mix-QVLA improves the accuracy-efficiency trade-off of low-bit VLA deployment. On LIBERO, Mix-QVLA reduces OpenVLA-OFT memory from 15.4 GB to 4.1 GB, retains 96.3 average success compared with 97.1 for the BF16 model, and achieves a 1.52x inference speedup.
[28] TeleMorpher: Toward Robust Simultaneous Motion-Location Editing cs.CV | cs.AIPDF
Haengbok Chung
TL;DR: 本文提出TeleMorpher,一个用于视频中同时进行运动和位置编辑的单次框架。该框架通过解耦前景与背景、利用运动先验进行姿态变形,并注入基线运动编辑器,实现了对视频主角运动和位置更可控、更精确的编辑。
Details
Motivation: 扩散模型在图像和视频生成与编辑方面取得了显著成功,但同时对视频中主角的运动和位置进行鲁棒的编辑这一具有实际重要性的任务尚未得到充分探索。本文旨在分析影响编辑质量的根本因素,并解决这一挑战。
Result: 在野外视频和TaiChi数据集上的实验表明,TeleMorpher在定量和定性评估(包括真人评估)中均取得了优越的性能。
Insight: 创新点包括:提出了首个单次同时运动和位置编辑框架;引入了无需训练的姿态变形方法,利用运动先验作为指导;提出了两个基于LPIPS的新评估指标,分别衡量运动编辑前后的背景一致性和运动编辑的保真度。
Abstract: Diffusion models have achieved remarkable success in image and video generation and editing. While recent studies have extended these efforts toward motion editing, simultaneously transforming both motion and location-despite its practical importance-remains largely unexplored. To better understand robust motion-location editing, we first analyze the fundamental factors that degrade its quality. Based on this analysis, we propose TeleMorpher, one of the first one-shot frameworks to the best of our knowledge, for simultaneous motion-location editing. Our approach leverages motion priors, a target motion-centric video generated from an off-the-shelf model as motion-editing guidance, and the ground truth motion to enable more controllable and precise motion-location editing. Via this, our framework works as follows: (1) we first disentangle the protagonist and the background via pre-trained segmentation and inpainting models. (2) Then, we introduce a training-free pose warping that edits the protagonist’s motion with the motion prior as the guidance. (3) The result of warped motion video is directly injected into a baseline motion editor during inference, mitigating the difference between source and target motions while preserving the appearance of the source video. (4) To enhance the reliability of quantitative evaluations, we propose two new LPIPS-based metrics that measure the background consistency before and after the motion editing and the fidelity of motion editing performance via measuring the difference between the extracted protagonist’s skeletons from source and target videos. Experiments with in-the-wild videos and the TaiChi dataset demonstrate that TeleMorpher achieves superior performance across both quantitative and qualitative measurements (real-human evaluation), underscoring its effectiveness.
[29] Vortex: Multi-Modal Fusion System for Intelligent Video Retrieval cs.CVPDF
Duc-Tho Nguyen, Hieu-Hoc Tran-Minh, Khanh-Hoa Lam, Hoang-Nhut Ly, Huu-Phuc Huynh
TL;DR: 本文介绍了为2025年胡志明市AI挑战赛开发的Vortex多模态视频检索系统。该系统集成了自适应关键帧提取、基于视觉-语言和语音模型的多模态元数据生成,以及融合CLIP和SigLIP2嵌入的混合检索策略,并引入了基于Rocchio的相关性反馈和多阶段时序搜索机制。在官方评估中,系统在初赛和决赛均取得了优异成绩。
Details
Motivation: 旨在推进智能多媒体搜索和时序推理,解决视频检索中全局与细粒度语义平衡、交互性以及时序事件对齐等问题。
Result: 在官方竞赛的初赛中获得79.6/88分(90.5%),在决赛中获得整体‘优秀’评级,并在问答任务中获得‘杰出’结果,验证了混合检索方法的有效性。
Insight: 创新点在于通过互逆排序融合(RRF)结合CLIP和SigLIP2嵌入以互补优势,并整合了Rocchio反馈和多阶段时序搜索来增强交互与对齐能力,为智能、上下文感知的视频检索提供了可扩展的架构基础。
Abstract: This paper presents Vortex, the multimodal video retrieval system developed by our team, FocusOnFun, for the Ho Chi Minh City AI Challenge 2025, designed to advance intelligent multimedia search and temporal reasoning. The system integrates adaptive keyframe extraction, multimodal metadata generation from vision-language and speech models, and a hybrid retrieval strategy that fuses CLIP and SigLIP2 embeddings through Reciprocal Rank Fusion to balance global and fine-grained semantics. To enhance interactivity, Vortex incorporates Rocchio-based relevance feedback and a multi-stage temporal search mechanism for sequential event alignment. Built on Milvus and Elasticsearch, the architecture enables scalable indexing and efficient retrieval. Evaluated in the official competition, our FocusOnFun team’s system achieved a score of 79.6/88 (90.5%) in the Preliminary Round and was further evaluated in the Final Round, achieving an Excellent' overall performance with Outstanding’ results in the question-answering (QA) task. This demonstrating the complementary strengths of CLIP and SigLIP2 and confirming the effectiveness of the hybrid retrieval approach. The system establishes a robust foundation for future research in intelligent, context-aware, and interactive video retrieval.
[30] Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval cs.CVPDF
Nguyen Cao Hoang, Hoang Bui Le, Nam Vo Hoang, Trung-Nghia Le
TL;DR: 本文提出了一种新颖的框架,用于时尚领域的组合图像检索任务。该框架整合了多模态大语言模型LLaVA来生成属性感知的三元组数据,并引入两阶段微调策略以增强对比学习。实验表明,该方法提升了组合推理能力和细粒度检索性能。
Details
Motivation: 解决时尚领域组合图像检索任务中,因标注数据稀缺和负样本采样过于简单而导致的模型性能受限问题。
Result: 实验结果表明,该方法增强了组合推理能力,并改善了细粒度检索行为,验证了所提框架在时尚检索任务中的可行性和潜力。
Insight: 主要创新点在于利用多模态大语言模型自动生成高质量的训练数据(属性感知三元组),以及设计了两阶段微调策略来优化对比学习过程,从而缓解数据稀缺问题并提升模型对细微属性变化的感知能力。
Abstract: Composed image retrieval retrieves a target image using a composed query of a reference image and a modified text description. In the fashion domain, this task requires understanding subtle attribute variations such as color, pattern, and texture. However, existing approaches face limitations due to scarce annotated data and simplistic negative sampling. We propose a novel framework that integrates a multi-modal large language model (LLaVA) to generate attribute-aware triplets and introduces a two-stage fine-tuning strategy to enhance contrastive learning. We leverage pretrained vision-language models, such as CLIP-ViT/B32, to generate and concatenate sentence-level prompts with the relative caption and to scale the number of negatives using static representations. Experimental results demonstrate enhanced compositional reasoning and improved fine-grained retrieval behavior, underscoring the feasibility and potential of the proposed framework for fashion retrieval.
[31] NEST: Narrative Event Structures in Time for Long Video Understanding cs.CV | cs.CLPDF
Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang
TL;DR: 该论文提出了NEST数据集,一个用于长视频叙事理解的新基准,包含1005部全长电影,并标注了102个多模态叙事事件及其关系。论文还引入了事件触发检测、事件定位、事件论元提取和事件关系提取等基线任务,以评估模型对长视频中叙事结构的理解能力。
Details
Motivation: 现有长视频基准主要关注海量信息中的检索任务,而忽略了评估模型对叙事结构的理解,例如低级动作如何形成事件、事件如何随时间交互以及叙事如何进展。
Result: 在NEST基准上,事件触发检测(ETD)低于8%,事件定位(EL)低于6%,事件论元提取(EAE)低于11%,表明基础事件发现任务极具挑战性;而事件关系提取(ERE)在给定事件后更为可行,零-shot F1达到35.45%,微调后达到44.42%。
Insight: 论文的创新点在于构建了一个专注于长视频叙事结构理解的多模态数据集,并定义了结构化的事件关系(如时序排序、层次组合和长程依赖),为评估模型对复杂叙事逻辑的理解提供了新基准。
Abstract: Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.
[32] Occ-VLM: Occupancy Grounded Vision Language Model for Indoor Scene Understanding cs.CVPDF
Jianing Li, Zhou Fang, Yijiang Liu, Li Du
TL;DR: 本文提出了Occ-VLM,一种仅使用姿态RGB图像和单一2D视觉编码器进行3D场景理解的新框架。该方法通过重建3D场景占据作为几何先验,将前景2D token与3D空间关联,并由大语言模型解码以实现统一理解。
Details
Motivation: 现有方法通常依赖显式3D输入或额外3D几何编码器,导致3D几何感知与通过视觉-语言预训练学到的丰富2D语义在结构上解耦,阻碍了统一3D视觉-语言表示的发展。
Result: 在多项实验中,Occ-VLM在多视角占据预测任务上达到了最先进的性能,同时在3D视觉问答和3D密集描述基准测试中,其表现与依赖3D输入的视觉-语言模型相当。
Insight: 核心创新在于利用重建的3D占据作为几何先验来桥接2D语义与3D空间,实现了无需显式3D输入或额外3D编码器的统一表示学习,为纯2D图像输入下的3D场景理解提供了新思路。
Abstract: Recently, vision-language models (VLMs) have made significant progress in 3D scene understanding, driving advances in applications such as embodied intelligence and robotic vision. However, existing approaches typically either rely directly on explicit 3D inputs (e.g., point clouds or RGB-D sequences), or introduce an additional 3D geometry encoder to derive 3D-aware visual tokens from 2D images. Such designs structurally decouple 3D geometric perception from the rich 2D semantics learned via vision-language pre-training, hindering the development of a unified 3D vision-language representation. In this work, we propose Occ-VLM, a novel framework for 3D scene understanding that operates purely on posed RGB images and employs a single 2D vision encoder. Specifically, Occ-VLM reconstructs 3D scene occupancy as an auxiliary geometric prior, which is utilized to spatially associate foreground 2D tokens with 3D space. These tokens are then decoded by a Large Language Model (LLM) for unified scene understanding. Extensive experiments demonstrate that Occ-VLM achieves both accurate geometric perception and robust vision-language reasoning: it attains state-of-the-art performance on multi-view occupancy prediction, while performing on par with 3D-input VLMs on 3D Visual Question Answering (VQA) and 3D dense captioning benchmarks.
[33] QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval cs.CV | cs.AIPDF
Xiuyuan Zhu, Ke Lu, Zijie Yang, Chao Yue, Jian Xue
TL;DR: 本文提出了QueryGaussian,一个无需训练、可扩展的开放词汇3D实例检索框架。它通过实例级查询机制,将语义理解与几何表示解耦,利用预训练的2D视觉模型解释用户提示,并通过最大权重关联策略和时间融合模块将分割掩码提升到3D空间。该方法在保持与最先进方法相当精度的同时,大幅提升了效率和可扩展性。
Details
Motivation: 解决现有基于‘场景级嵌入’范式的3D实例检索方法在城市级大规模场景中面临的内存和计算成本线性增长、易触发内存溢出(OOM)的根本性架构瓶颈问题。
Result: 实验结果表明,QueryGaussian在精度上与最先进(SOTA)方法相当,同时在效率上实现了决定性飞跃:GPU内存使用减少了70%以上,推理速度加快了180倍,并能在消费级硬件上对包含数千万个高斯点的城市级场景进行快速实例检索。
Insight: 核心创新在于从‘场景级语义蒸馏’范式转向‘实例级查询’范式,实现了语义与几何的解耦。具体技术亮点包括:利用预训练2D模型进行开放词汇理解、通过并发最大权重关联策略确保语义-视觉一致性、以及引入带有多阶段自适应密度聚类的时间融合模块来缓解投影模糊性。这种架构从根本上解决了可扩展性问题。
Abstract: Efficiently retrieving specific 3D instances from large-scale scenes via natural language prompts remains a formidable challenge in multimedia analysis. Existing approaches predominantly follow a “scene-level embedding” paradigm, which requires distilling high-dimensional semantic features into every 3D primitive. This strategy suffers from a fundamental architectural bottleneck: memory and computational costs scale linearly with scene complexity, inevitably triggering out-of-memory (OOM) failures in city-scale environments. To address this barrier, we propose QueryGaussian, a training-free framework for expeditious and scalable open-vocabulary 3D instance retrieval. Unlike holistic semantic distillation, QueryGaussian employs an instance-level query mechanism that decouples semantic understanding from geometric representation. Specifically, we leverage pre-trained 2D vision models to interpret user prompts and lift segmentation masks into 3D via a concurrent maximum-weight association strategy, ensuring semantic-visual consistency. To mitigate projection ambiguity, we introduce a temporal fusion module with multi-stage adaptive density clustering. Experimental results demonstrate that QueryGaussian not only matches the accuracy of state-of-the-art methods but also delivers a decisive efficiency leap, reducing GPU memory usage by over 70% and accelerating inference by 180x. Crucially, QueryGaussian enables expeditious instance retrieval on city-scale scenes containing tens of millions of Gaussians using consumer-grade hardware.
[34] Training-Free Metrics for Synthetic Object Detection Data: A Proxy for Detector Performance cs.CVPDF
Myeongseok Nam, Donghoon Yeo, Seungwook Kim
TL;DR: 本文提出了一种名为条件-组合域匹配(CCDM)的无训练度量方法,用于评估合成目标检测数据对下游检测器性能的提升效果,避免昂贵的模型训练过程。实验表明,CCDM在VisDrone-DET数据集上与YOLOv8下游性能的斯皮尔曼相关性达到1.0,显著优于现有合成图像评估指标。
Details
Motivation: 解决合成数据在目标检测任务中有效性评估依赖耗时耗力的下游模型训练问题,特别是由于边界框标注密集导致评估成本高昂。
Result: 在VisDrone-DET数据集上,CCDM度量家族与YOLOv8下游性能的斯皮尔曼相关性达到1.0,明显超越现有合成图像评估指标。
Insight: 创新点在于提出无需训练即可预计算的CCDM度量作为合成训练集相对效用的代理指标,客观分析显示其通过条件组合匹配有效预测数据效用,降低了合成数据评估的计算开销。
Abstract: With the recent advent of image generative models, synthetic data are increasingly being used to supplement limited real datasets for training computer vision models. However, not all synthetic datasets improve performance equally, and their effectiveness can only be assessed by training a downstream model, which is computationally expensive and time-consuming. This problem is pronounced in the task of object detection, where the required annotations are much more dense due to bounding boxes. In this paper, we propose a pre-computable metric family, dubbed Conditional-Composition Domain Match (CCDM), which serves as a proxy for the relative utility of candidate synthetic training sets for downstream detection. Experiments on the VisDrone-DET dataset show that the CCDM metric families achieve a Spearman correlation of 1.0 with the downstream performance of YOLOv8, clearly outperforming existing metrics for synthetic image evaluation.
[35] ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number cs.CV | cs.AIPDF
Zijie Meng
TL;DR: ParaScale提出了一种尺度校准的相机运动迁移方法,通过引入一个规范不变的无量纲描述符——视差数Pi,来解决参考视频与目标视频之间尺度不匹配的问题。该方法作为一个即插即用模块,无需重新训练,可集成到任何姿态条件生成器中,在保持视觉保真度的同时,显著降低了尺度不匹配误差。
Details
Motivation: 解决在视频生成中,直接将参考视频的相机运动轨迹迁移到不同尺度的目标场景时,由于深度尺度规范不一致导致的运动感知失真(要么难以察觉,要么过度夸张)的问题。
Result: 在跨越四个数量级的尺度范围及多种骨干网络上,ParaScale方法将实现的视差保持在恒等线上,并将视差一致性误差(PCE)比未校准的迁移降低了3倍以上,且不损失视觉保真度。
Insight: 核心创新在于从几何原理出发,定义了规范不变的无量纲描述符——视差数Pi(Pi = ||ΔT|| / Z̄),作为相机运动感知强度的量化指标,并证明这才是尺度忠实迁移中必须保持的量,而非原始轨迹本身。同时,提出了新的评估指标PCE来暴露场景尺度不匹配问题。
Abstract: Transferring the camera motion of a reference video to a freshly generated one lets creators reuse cinematic moves. Yet reference and target often live at incompatible scales – a sweep across a galaxy versus a nudge across a desk – and naively reusing the recovered trajectory yields either imperceptible or violently exaggerated motion. We trace this to a geometric fact: translation-induced image motion scales as ||T||/Z, so a monocular trajectory is meaningful only up to a depth-scale gauge. We distill this into the Parallax Number Pi = ||Delta T|| / Zbar, a dimensionless, gauge-invariant descriptor of how strongly a camera move is felt, and prove that it – not the raw trajectory – is the quantity that scale-faithful transfer must preserve. ParaScale is a plug-and-play module that reads Pi off any reference video and re-realizes it against the target scene’s own depth, per frame, leaving rotation untouched. Sitting between pose extraction and pose injection, it requires no retraining and drops into any pose-conditioned generator. We further introduce the Parallax Consistency Error (PCE), a scale-symmetric metric that – unlike the similarity-aligned TransErr – exposes scene-scale mismatch. Across scale regimes spanning four orders of magnitude and multiple backbones, ParaScale keeps the realized parallax on the identity line and cuts PCE by more than 3x over uncalibrated transfer with no loss of visual fidelity.
[36] Neural Events: Discrete Asynchronous Autoencoders for Event-Based Vision cs.CVPDF
Roberto Pellerito, Daniel Gehrig, Shintaro Shiba, Davide Scaramuzza
TL;DR: 本文提出了一种名为’神经事件’的框架,用于对事件相机产生的原始事件流进行重新编码。该方法将连续的、信息稀疏的事件流压缩成一组数量少但信息量高的离散、可学习的代码,从而在保持高性能的同时大幅降低数据吞吐量。
Details
Motivation: 事件相机以微秒级分辨率输出事件流,但单个事件语义价值低,下游算法需要从海量低信息事件中快速整合线索。现有架构难以平衡捕获精细时间动态与维持可控数据吞吐量之间的矛盾。
Result: 在物体检测和分类任务上,基于神经事件训练的网络性能达到或超越了最先进方法,同时将事件率降低了2.0倍。
Insight: 核心创新在于将事件流重新标记化为离散、异步、可学习的’神经事件’,每个事件代表一个局部时空上下文窗口。这种压缩表示在保持时间动态的同时,极大地提高了数据处理的效率。
Abstract: Event cameras capture dynamic scenes with exceptional temporal fidelity by representing them as a continuous stream of microsecond resolution \textit{events}. Each individual event, however, only carries minimal semantic value, merely signaling a localized brightness change. To derive meaningful signals, downstream algorithms need to quickly integrate cues from a potentially massive torrent of low-information events. Current architectures, however, are easily overwhelmed, struggling to balance capturing fine-grained temporal dynamics and maintaining a manageable data throughput. This paper proposes a framework to re-tokenize event streams into a small set of highly informative \textit{neural events}, each representing a local spatio-temporal context window with a discrete learnable code. Every time this code flips, a neural event is triggered, yielding a highly compressed data stream. We demonstrate that, across object detection and classification, networks trained on neural events are on par or surpass the performance of state-of-the-art approaches while reducing the event rate by a factor of 2.0.
[37] ViCoStream: Streaming VideoLLMs Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference cs.CVPDF
Yang Tan, Junlong Tong, Linan Yue, Hao Wu, Pengfei Fang
TL;DR: 本文提出ViCoStream,一种阶段协调的流式视频大语言模型推理框架,通过分块执行、CUDA流重叠、视觉令牌控制、有界视觉注意力和查询端检索等技术,在单个A100 GPU上实现了134 FPS的视频吞吐量和低于50毫秒的首令牌延迟,同时保持接近全历史基线的准确性。
Details
Motivation: 现有流式VideoLLM方法主要关注加速单个模块(如视觉编码、令牌剪枝),但缺乏对系统能否持续维持实时流式性能的深入洞察,本文旨在通过协调的流水线推理来解决视频摄取吞吐量和查询响应延迟的关键平衡问题。
Result: 在多个流式基准测试中,使用Qwen2.5-VL-3B/7B-Instruct模型,ViCoStream在单个A100 GPU上实现了134 FPS的视频吞吐量和低于50毫秒的TTFT,同时准确率接近全历史基线。
Insight: 创新点在于将流式VideoLLM推理形式化为跨视觉预处理、编码、令牌丢弃和LLM预填充/解码的协调流水线,并系统研究了分块大小、令牌保留、注意力局部性和检索范围等参数对吞吐量-准确性权衡的影响,为实时部署提供了可借鉴的优化框架。
Abstract: Streaming VideoLLMs must continuously process incoming video while maintaining low query latency, making both video-ingestion throughput and query-time responsiveness critical for real-time deployment. Existing methods largely focus on accelerating individual modules, such as visual encoding, token pruning, or KV-cache compression, but provide limited insight into whether the resulting system can sustain real-time streaming performance. We formulate streaming VideoLLM inference as a coordinated pipeline spanning visual preprocessing, visual encoding, token dropping, and LLM prefilling/decoding. Building on this formulation, we propose ViCoStream (Video Coordinated Streaming), a stage-wise coordinated streaming framework that combines chunk-wise execution, CUDA-stream overlap, visual token control, bounded visual attention, and query-side retrieval to bound per-chunk computation and memory costs. We further provide a systematic study of bottleneck migration, revealing how chunk size, token retention, attention locality, and retrieval scope shape the throughput-accuracy trade-off. Experiments with Qwen2.5-VL-3B/7B-Instruct across multiple streaming benchmarks show that ViCoStream achieves 134 FPS video throughput and less than 50 ms TTFT on a single A100 GPU while maintaining accuracy close to full-history baselines.
[38] 3D-PLOT-LLM: Part-Level Object Tokens for 3D Large Language Models cs.CVPDF
Jintang Xue, Xinyu Wang, Yixing Wu, Jingwen Chen, C. -C. Jay Kuo
TL;DR: 本文提出了3D-PLOT-LLM,一种创新的3D多模态大语言模型,旨在解决现有模型无法处理3D物体部件级理解和推理的问题。该方法通过重组输入token流,引入可学习的区域标记和保留词汇token,使部件能够直接通过LLM自身的词汇进行寻址和引用,而无需添加繁重的分割解码器或边界框头。
Details
Motivation: 现有3D多模态大语言模型只能将3D物体作为一个整体进行描述,无法处理、命名或推理其组成部分。先前实现部件感知的方法通常需要添加分割解码器、更重的3D编码器或边界框语法,导致参数量大幅增加。本文旨在探索一种更高效、参数更少的替代方案。
Result: 在PartVerse-QA基准测试中,模型在caption-to-slots任务上达到Jaccard 0.459和Exact-match 13.78%,slot-to-caption GPT-4o评分为44.68。在3DCoMPaT-GrIn部件感知描述基准上,模型在所有文本输出指标上均优于PointLLM、Kestrel、PARIS3D和SegPoint,并在4项指标中的3项上优于ShapeLLM,GPT-4o评分比PointLLM高出最多+3.03。在Objaverse整体物体描述任务中,加入PartVerse-QA训练后,SBERT和GPT-4o评分分别比PointLLM高出+0.65和+1.85。
Insight: 核心创新在于通过重组输入token流(引入区域标记和保留词汇token)而非修改模型架构,实现了对3D物体部件的直接寻址和引用,这是一种轻量化的部件感知方法。Marker-Space Refinement模块通过利用区域的空间统计和邻接关系来细化标记,增强了局部表征。该方法仅引入不到100万可训练参数,远少于先前方法,且无需分割解码器或边界框头,在效率和性能上取得了平衡。
Abstract: 3D multimodal large language models (3D MLLMs) describe a 3D object as a whole but cannot address, name, or reason about its parts. Prior part-aware attempts add segmentation decoders, heavier 3D encoders, or bounding-box grammars at substantial parameter cost. We take a fundamentally different path: we reorganize the input token stream so that parts become directly addressable through the LLM’s own vocabulary. Our model, 3D-PLOT-LLM, partitions the frozen point encoder’s patches into K locally coherent regions and inserts, before each region’s patch tokens, a learnable per-region marker and a reserved vocabulary token
[39] Multimodal Concept Bottleneck Models cs.CV | cs.LGPDF
Tongqing Shi, Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng
TL;DR: 本文提出了一种多模态概念瓶颈模型(MM-CBM),通过将图像和文本嵌入对齐到可解释特征,解决了传统概念瓶颈模型泛化能力有限和非概念信息泄露的问题,并扩展了其在CLIP中的应用,实现了可解释的零样本分类和图像检索等新视觉任务。
Details
Motivation: 现有概念瓶颈模型(CBMs)受限于预定义类别集的泛化能力,且存在非概念信息泄露的风险,即模型可能无意中利用概念之外的预测信号。
Result: 在四个标准基准测试中,MM-CBM平均准确率提升最高达51.26%,同时保持高精度,与黑盒模型性能差距在约5%以内,并提供了更强的可解释性。
Insight: 创新点在于引入双概念瓶颈层(CBLs)对齐多模态嵌入,将CBMs扩展到CLIP框架,实现了可解释的零样本任务;客观分析认为,该方法通过多模态对齐有效缓解了信息泄露,提升了模型的泛化能力和透明度。
Abstract: Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.
[40] Linear Recurrent Unit with Semantic Modulation for Image Super-Resolution cs.CVPDF
Mingyu Choi, Woo Kyoung Han, Sunghoon Im, Kyong Hwan Jin
TL;DR: 本文提出了一种基于线性循环单元(LRU)的图像超分辨率网络,通过引入语义调制单元(SMU)来平衡性能与效率。SMU通过调制LRU、空间分类和基于学习原型的特征增强,提升了模型在单图像超分辨率任务上的表现。实验表明,该方法在定量和定性上均优于当前最先进方法,且计算复杂度相当。
Details
Motivation: LRU在长程依赖任务中表现出色,但其静态参数化和单扫描方法限制了在2D视觉任务(如图像超分辨率)中的应用。本文旨在通过引入语义调制单元来克服这些限制,实现性能与效率的平衡。
Result: 在单图像超分辨率任务上,该方法在定量和定性评估中均超越了近期最先进方法,同时计算复杂度与现有方法相当,达到了SOTA水平。
Insight: 创新点在于将LRU与语义调制单元(SMU)结合,SMU通过调制LRU、空间分类和特征增强(基于学习原型)动态适应图像内容,从而提升了2D视觉任务的性能,同时保持了高效性。
Abstract: Linear recurrent unit (LRU), designed with a principled formulation for stable linear recurrence, has demonstrated promising accuracy and robustness on long-range dependency tasks. However, its static parameterization and single-scan method limits its applicability to 2D vision tasks. In this study, we propose a LRU-based restoration network with a semantic modulating unit (SMU) to achieve a harmonious balance between performance and efficiency in single-image super-resolution. The SMU plays three key roles: LRU modulation, spatial categorization, and feature enhancement through learned prototype. Extensive experiments demonstrate that our method quantitatively and qualitatively surpasses recent state-of-the-art methods. Notably, our approach achieves superior performance with computational complexity on par with existing methods. The source code and models are available at https://github.com/MingyuChoi-run/LSM
[41] SurgVista: Long-Horizon Surgical World Modeling with Plausible Instrument-Tissue Dynamics cs.CVPDF
Wentao Pan, Wuyang Li, Shengyuan Liu, Xinyu Liu, Hengyu Liu
TL;DR: 本文提出了SurgVista,一个用于长时程手术世界建模的模型,旨在生成具有合理器械-组织动态的未来手术场景帧。它通过两种训练方法(变形一致性正则化和漂移适应训练)解决了现有方法中空间交互不一致和时域保真度崩溃的问题。同时,论文还引入了SurgWorld-Bench基准测试集,用于严格评估模型在器械运动准确性和组织响应保真度方面的性能。
Details
Motivation: 动机在于解决机器人手术策略学习的规模化挑战,因为专家演示成本高昂且体内探索存在安全风险。现有手术世界模型存在空间交互不一致(器械接触无法导致空间一致的组织形变)和时域保真度崩溃(自回归展开中预测误差累积导致视觉质量下降)两大失败模式。
Result: 在提出的SurgWorld-Bench基准上进行的大量实验表明,SurgVista在视觉质量、时间一致性和交互保真度方面始终优于最先进的方法,并且随着预测时长的增加,其优势更加明显。
Insight: 创新点在于提出了两种训练方法:1)变形一致性正则化,通过从训练视频中提取场景点轨迹并利用潜在对比学习强制跨帧一致性,以增强物理一致的器械-组织动态;2)漂移适应训练,通过用在线预测残差和根据长时程漂移统计校准的光度增强来扰动条件帧,以在长时程展开中维持视觉保真度。此外,构建了一个包含多种手术类型、长程展开和解耦评估指标的基准测试集SurgWorld-Bench,为领域提供了严谨的评估框架。
Abstract: Scaling robot policy learning for autonomous surgery is challenging, as expert demonstrations are expensive and in vivo exploration poses substantial safety risks. Surgical world models address this by generating realistic, action-conditioned future frames from an initial observation, but existing methods exhibit two persistent failure modes: spatial interaction incoherence, where visible instrument contact fails to induce spatially consistent tissue deformation, and temporal fidelity collapse, where prediction errors compound across autoregressive rollouts and progressively corrupt visual quality. We present SurgVista, a surgical world model that mitigates both failures through two training recipes. Deformation Consistency Regularization extracts scene-point trajectories from training videos and enforces cross-frame coherence through latent contrastive learning, strengthening physically consistent instrument-tissue dynamics. Drift Adaptation Training mitigates long-horizon drift by perturbing conditioning frames with online prediction residuals and photometric augmentations calibrated to long-horizon drift statistics, sustaining visual fidelity over extended rollouts. To enable rigorous evaluation, we further introduce SurgWorld-Bench, featuring diverse procedure types, long-range rollouts, and decoupled metrics for instrument-motion accuracy and tissue-response fidelity. Extensive experiments show that SurgVista consistently outperforms state-of-the-art methods across visual quality, temporal consistency, and interaction fidelity, with gains widening as the prediction horizon grows.
[42] SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision cs.CVPDF
Jiayu Tang, Yuchen Zhou, Chao Gou
TL;DR: 本文提出SpatialSV框架,旨在将可解释的3D空间感知能力内化到多模态大语言模型中。该框架通过任务导向的视觉监督,主动将2D视觉特征提升为显式的3D表示,从而避免依赖外部工具或不可解释的特征蒸馏。
Details
Motivation: 现有方法通过外部工具引入空间先验会带来显著推理开销,或依赖不可解释的隐特征蒸馏且缺乏细粒度几何约束。SpatialSV旨在解决这些问题,在MLLMs中内化鲁棒且可解释的3D空间感知能力。
Result: 在多个模型和基准测试上的广泛实验证明了SpatialSV在增强和解释MLLMs空间智能方面的有效性。该框架在半监督设置下表现出强大的泛化能力,验证了其利用未标记视觉数据进行可扩展、可解释空间表示学习的潜力。
Insight: 创新点在于采用任务导向的视觉监督,迫使模型主动将2D特征提升为显式3D表示,这为模型的内在空间知识提供了透明的可视化诊断窗口。从客观角度看,将3D重建作为可解释的代理,为模型内部表示提供了直观的评估机制,是一种新颖的监督范式。
Abstract: Unlocking the spatial intelligence of multimodal large language model (MLLMs) is crucial for understanding and interacting with the 3D world. Prevailing approaches typically inject spatial priors via external tools, which impose significant inference overhead, or rely on latent feature distillation, which remains uninterpretable and lacks fine-grained geometric constraints. To address these issues, we propose SpatialSV, a framework designed to internalize robust 3D spatial awareness within MLLMs while simultaneously offering inherent interpretability. Deviating from passive feature imitation, SpatialSV employs task-oriented visual supervision, compelling the model to actively lift its 2D visual features into explicit 3D representations, including depth maps, camera poses, and point clouds. Crucially, this 2D-to-3D lifting process provides a transparent window into the model’s representations: the resulting 3D reconstructions serve as an intuitive proxy for visualizing and diagnosing the quality of the model’s intrinsic spatial knowledge. Extensive experiments across multiple models and benchmarks demonstrate the effectiveness of SpatialSV in enhancing and interpreting MLLMs’ spatial intelligence. Furthermore, the framework exhibits strong generalization in semi-supervised settings, validating its potential to leverage unlabeled visual data for scalable, interpretable spatial representation learning.
[43] Gaussian Process Prior Variational Autoencoder for Endoscopic Videos cs.CVPDF
Ivan De Boi, Xinxing Shi, Xiaoyu Jiang, Tim J. M. Jaspers, Francisco Caetano
TL;DR: 本文提出了一种基于高斯过程先验的变分自编码器(GPVAE)框架,用于内窥镜视频的修复任务。该方法通过引入时序高斯过程先验来替代标准因子化潜在先验,能够利用视频的时间连续性进行缺失帧插值和不确定性感知重建。
Details
Motivation: 内窥镜视频分析对于胃肠道诊断和计算机辅助干预至关重要,但视频序列常受到镜面反射、运动伪影和缺失帧的干扰,这些瞬时损坏会分散临床医生的注意力、降低图像可解释性并破坏下游任务(如3D重建和导航)。因此,需要利用时间连续性的方法进行有效修复,而不是孤立地处理帧。
Result: 在C3VDv2结肠镜数据集上,最佳GPVAE变体相对于匹配的VAE基线,平均将图像重建RMSE降低了21.9%,最高降低26.1%。在下游轨迹估计中,经典视觉里程计和预训练PoseNet的平均轨迹RMSE降低了12.7%,每轮训练时间平均增加27.3%。
Insight: 创新点在于将时序高斯过程先验引入VAE框架,以建模视频帧间的时间依赖性,实现不确定性感知的缺失帧插值。此外,结合了内窥镜特定的编码器(如EndoVAE和GastroNet-5M的ViT)以及可扩展的GP近似方法(HPA和SPA),并采用基于DUCKNet的掩码管道处理镜面反射。
Abstract: Endoscopic video analysis is essential for gastrointestinal diagnosis and computer-assisted interventions, but video sequences are routinely degraded by specular reflections, motion artifacts, and missing frames. These transient corruptions can distract clinicians, reduce image interpretability, and disrupt downstream tasks such as 3D reconstruction and navigation. Effective restoration therefore requires methods that exploit temporal continuity rather than treating frames in isolation. We introduce a Gaussian Process Prior Variational Autoencoder (GPVAE) framework for endoscopic video restoration that replaces the standard factorized latent prior with a temporal Gaussian process prior, enabling interpolation of missing frames with uncertainty-aware reconstruction. The framework combines endoscopy-specific encoders, including a convolutional EndoVAE backbone and pretrained Vision Transformer encoders from GastroNet-5M, with two scalable GP approximations: Hierarchical Prior Approximation (HPA) and Sparse Precision Approximation (SPA). Specular reflections are handled using a DUCKNet-based masking pipeline that excludes corrupted pixels from the reconstruction objective. On the C3VDv2 colonoscopy dataset, the best GPVAE variants reduced image reconstruction RMSE by 21.9% on average, and by up to 26.1%, relative to matched VAE baselines. Downstream trajectory RMSE was reduced by 12.7% on average across classical visual odometry and a pretrained PoseNet, at an average increase of 27.3% in training time per epoch. Finally, the GP posterior provides per-frame uncertainty estimates that reflect temporal support and offer a confidence signal for restored frames.
[44] CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs cs.CVPDF
Chengwen Liu, Hao Peng, Jisheng Dang, Hong Peng, Bin Hu
TL;DR: 本文提出CARE框架,一种基于能力感知的奖励塑造方法,用于优化视频多模态大语言模型中的自适应推理长度。该框架通过指数移动平均估计模型能力,将训练分为渐进阶段,引导模型从探索性长推理转向高效简洁推理,并引入后验放大器强化困难样本的奖励信号。
Details
Motivation: 现有基于强化学习的视频多模态推理方法通常采用简单僵化的推理长度控制策略,无法适应模型能力的动态演变,导致早期探索不足或后期推理冗余与解码低效。
Result: 在多个视频推理和通用视频理解基准测试上的广泛实验表明,CARE一致提升了推理准确性,稳定了强化学习训练,并显著提高了token效率。训练过程中推理长度呈现倒U型轨迹,收敛时产生更短但信息更丰富的推理轨迹。
Insight: 核心创新在于将模型能力估计动态融入奖励塑造过程,实现推理长度的自适应优化;通过批统计归一化区分冗余与必要复杂性,以及后验放大器机制强化对历史困难样本优异表现的奖励,这些设计可借鉴于其他需要平衡探索与效率的序列决策任务。
Abstract: In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model’s evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at https://github.com/1Pansy/Video-CARE.
[45] Spatial-Aware Reduction Framework: Towards Efficient and Faithful Visual State Space Models cs.CV | cs.AIPDF
Jindi Lv, Aoyu Li, Yuhao Zhou, Zheng Zhu, Xiaofeng Wang
TL;DR: 本文针对视觉状态空间模型(如Mamba)在应用现有token缩减方法时出现的性能崩溃问题,提出了一种名为STORM的空间感知token缩减框架。该框架将缩减操作重新定义为对空间单元的结构化处理,以保持压缩过程中的结构完整性,从而在无需训练的情况下,显著提升了多种视觉Mamba骨干网络的剪枝精度。
Details
Motivation: 现有token缩减方法在应用于结构增强的Mamba变体时会导致严重的性能下降,作者认为这是由于这些方法在空间上是不可知的,破坏了选择性扫描机制所需的二维结构前提。
Result: 在无需训练的设置下,STORM在多种视觉Mamba骨干网络上实现了最先进的剪枝精度。特别是在VMamba上,STORM带来了显著的精度恢复,其top-1准确率比先前方法高出多达63.3%;在PlainMamba上,STORM仅带来1.0%的准确率下降,性能与ViT相当。
Insight: 创新点在于将token缩减重新定义为一种在空间单元上强制执行局部约束的结构化操作,以保持网格拓扑和邻域一致性。作为一个即插即用模块,STORM为现有缩减流程赋予了显式的空间感知能力,且无需任何训练。
Abstract: Mamba demonstrates strong efficiency in modeling long visual sequences. However, when token reduction is applied to structurally enhanced Mamba variants, these models exhibit a severe performance collapse. We attribute this degradation to the spatially agnostic nature of existing reduction methods, which violate the two-dimensional structural premise required by the selective scanning mechanism. In this work, we propose STORM, a spatial-aware token reduction framework designed to maintain structural integrity throughout the compression process. STORM reformulates reduction into a structured operation on spatial units, enforcing localized constraints to maintain both grid topology and neighborhood coherence. As a plug-and-play module, STORM equips existing reduction pipelines with explicit spatial awareness without any training. Empirical results demonstrate that STORM achieves state-of-the-art pruning accuracy across diverse vision Mamba backbones under training-free settings. Notably, STORM delivers a substantial accuracy recovery on VMamba, outperforming prior methods by up to 63.3% in top-1 accuracy. Meanwhile, STORM incurs only a 1.0% accuracy drop on PlainMamba, achieving performance comparable to ViT.
[46] Triangular Consistency as a Universal Constraint for Learning Optical Flow cs.CV | cs.AIPDF
Yi Xiao, Carlos Rodriguez Coronel, Jing Zhan, Haniyeh Ehsani Oskouie, Alex Wong
TL;DR: 本文提出了一种基于三角一致性的光流学习通用约束方法,该方法独立于网络架构、监督类型和数据集,适用于图像对和多帧场景。该约束通过组合两个光流生成第三个光流,并强制三者之间的一致性,可用于循环一致性、多帧时间链和合成数据增强。
Details
Motivation: 解决光流学习中缺乏通用、轻量且无需额外标注的几何约束问题,旨在提升不同训练设置下的模型性能。
Result: 实验表明,该方法在监督、无监督和迁移学习设置下均能带来一致的性能提升,且计算开销可忽略。
Insight: 创新点在于从光流几何中推导出普适的三角一致性约束,可作为即插即用组件增强模型训练,不依赖特定模型假设。
Abstract: We propose triangular consistency as a first-principled constraint for optical flow, which is agnostic to network architecture, supervision type, and dataset, and applies to both image-pair and multi-frame settings. This simple but powerful constraint is to compose two flows to induce a third flow and enforce consistency among the three. The composed flows may arise from (i) image pairs, yielding cycle consistency; (ii) multiple video frames, producing longer-range motion through temporal chaining; or (iii) image pairs combined with controlled synthetic transformations, which becomes data augmentation. This triangular consistency introduces negligible computational overhead and requires no additional annotations. Since it is derived directly from the geometry of optical flow, it does not rely on model-specific assumptions and serves as a ``universal’’ plug-and-play component for optical flow training. Experiments show consistent improvement across supervised, unsupervised, and transfer learning settings.
[47] Speeding up the annotation process in semantic segmentation industrial applications cs.CV | cs.AIPDF
Marta Fernandez-Moreno, Margarita Guerrero, Rosalia Rementeria, Pablo Mesejo, Raul Moreno
TL;DR: 本文提出利用无监督计算机视觉算法加速工业材料科学中复杂语义分割任务的数据标注过程,通过预标注步骤将标注时间从170小时减少至37小时,实现了约78%的效率提升。
Details
Motivation: 针对语义分割任务中高分辨率图像像素级标注耗时且易出错的问题,旨在量化无监督算法在加速标注流程方面的实际效果,以解决工业应用中数据标注的瓶颈。
Result: 在自建的最大公开钢铁显微组织分割数据集上,无监督预标注方法将标注时间从170小时大幅缩短至37小时,并提供了经领域专家验证的深度学习模型作为该数据集的基准。
Insight: 创新点在于首次量化了无监督算法对标注过程的加速效果,并公开了高分辨率全标注数据集;其方法可推广至其他需要精细标注的工业视觉任务,以降低人工成本。
Abstract: Current machine learning models commonly require large and well-annotated datasets. However, the annotation process often becomes a bottleneck, with increased complexity leading to higher chances of human errors. Within this context, our goal in this paper is to leverage unsupervised algorithms to improve data annotation efficiency for complex semantic segmentation problems in industrial materials science. Previous research has quantified labeling time and others explored unsupervised methods. However, to the best of our knowledge, this is the first study to quantify how much unsupervised algorithms accelerate the labeling process. We aim to validate the extent to which this laborious process can be accelerated, focusing on semantic segmentation tasks that involve annotating each pixel of high-resolution images, such as the microstructure characterization challenge in materials science. Specifically, we demonstrate that by using unsupervised computer vision algorithms, the time required for the labeling process can be reduced from 170 hours to 37 hours, achieving an approximate reduction of 78%. The dataset we work with includes large images of dimensions 1280x959 and 960x703, which further increases the complexity of the annotation task. Despite these challenges, we create and share the largest public steel microstructure segmentation dataset to date, available under MIT License with permanent DOI, contributing a fully annotated, high-resolution dataset to the field. Additionally, this is the first work to compare the labeling time from scratch (a common approach in previous studies) to the labeling time when using these unsupervised algorithms as a pre-annotation step. Furthermore, we provide a Deep Learning model trained on this dataset, validated by field experts, and deployed in an industrial setting, serving as an initial benchmark for this public dataset.
[48] Timage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models cs.CVPDF
Yifeng Wu, Huimin Huang, Ruiluo Wu, Chunyi Lin, Guanhua Chen
TL;DR: 本文提出了Timage,一种用于微调视觉语言模型(VLM)的生成式文本-图像范式。该方法通过一个受约束的薛定谔桥(cSB)采样器,将文本查询作为排版覆盖层绘制到图像本身,从而在输入层面解决多模态理解中的空间对齐问题。
Details
Motivation: 现有的多模态大语言模型(MLLM)在进行细粒度空间推理时,由于文本查询缺乏显式的几何锚点,常常无法准确定位到正确的图像区域。现有方法要么调整模型权重,要么增加冗长的指令提示,但都无法在不损害模型通用能力的前提下可靠地将语言与视觉坐标对齐。
Result: 在VMCBench基准测试中,Timage与一个较小的7B骨干模型结合,明显超越了更大的专有系统以及参数调优的基线模型。
Insight: 核心创新在于将多模态理解重新定义为输入层面的对齐问题,通过生成视觉覆盖层作为显式的注意力信标来引导模型关注空间语义。其方法(cSB)将布局合成分解为区域搜索和外观塑造两个耦合的随机阶段,在保护前景内容的同时确保文本可读性和视觉平衡,这是一种与架构无关的、通过精心重构输入来增强多模态推理的有效杠杆。
Abstract: Multimodal Large Language Models (MLLMs) often lose track of the right image regions during fine-grained spatial reasoning, because a textual query rarely carries any explicit geometric anchor into the pixel domain. Prevailing remedies either rewire the model’s weights or pad the prompt with verbose instructions, yet neither reliably pins the language to the correct visual coordinates without eroding the backbone’s general competence. We introduce Timage, a paradigm that recasts multimodal understanding as an alignment problem solved at the input: the query is drawn, as a typeset overlay, onto the image itself. The placement and appearance of this overlay are produced by a Constrained Schrödinger Bridge (cSB), an entropic optimal-transport sampler that factorizes layout synthesis into two coupled stochastic stages. The first stage, Region Search, transports noise toward query-aligned image zones while obeying a hard occlusion barrier that protects salient foreground content; the second stage, Appearance Shaping, sizes the glyphs through an ``ink-budget’’ regularizer so that the rendered text stays legible and visually balanced. The resulting overlay behaves as an explicit attention beacon that channels the model’s focus along spatial semantics. On the VMCBench suite, Timage paired with a modest 7B backbone clearly overtakes far larger proprietary systems as well as parameter-tuned baselines. The study positions deliberate input reconstruction as a powerful, architecture-neutral lever for strengthening multimodal reasoning.
[49] Confidence Calibration for Multimodal LLMs: An Empirical Study through Medical VQA cs.CV | cs.AIPDF
Yuetian Du, Yucheng Wang, Ming Kong, Tian Liang, Qiang Long
TL;DR: 本文首次对医学多模态大语言模型(MLLMs)的置信度校准问题进行了全面分析,提出了一种结合多策略融合询问(MS-FBI)与辅助专家LLM评估的新方法,旨在提升医学视觉问答(VQA)任务中模型置信度的可靠性。实验表明,该方法在三个医学VQA数据集上平均将预期校准误差(ECE)降低了40%,显著增强了MLLMs在医疗应用中的可信度。
Details
Motivation: 医学MLLMs在医疗任务中展现出巨大潜力,但其输出的置信度常与实际准确性不匹配,可能导致误诊或忽视正确建议,因此需要研究如何校准其置信度以提高可靠性。
Result: 在三个医学VQA数据集上的实验结果显示,所提方法将预期校准误差(ECE)平均降低了40%,显著提升了置信度校准水平,增强了模型在医疗诊断中的可信度。
Insight: 创新点在于首次系统分析了医学MLLMs的置信度校准问题,并提出了结合多策略融合询问与专家LLM评估的校准方法;从客观角度看,该方法强调了领域特定校准的重要性,为AI辅助诊断提供了更可信的解决方案。
Abstract: Multimodal Large Language Models (MLLMs) show great potential in medical tasks, but their elicited confidence often misaligns with actual accuracy, potentially leading to misdiagnosis or overlooking correct advice. This study presents the first comprehensive analysis of the relationship between accuracy and confidence in medical MLLMs. It proposes a novel method that combines Multi-Strategy Fusion-Based Interrogation (MS-FBI) with auxiliary expert LLM assessment, aiming to improve confidence calibration in Medical Visual Question Answering (VQA). Experiments demonstrate that our method reduces the Expected Calibration Error (ECE) by an average of 40% across three Medical VQA datasets, significantly enhancing MLLMs’ reliability. The findings highlight the importance of domain-specific calibration for MLLMs in healthcare, offering a more trustworthy solution for AI-assisted diagnosis.
[50] DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Mathematical Expression Generation cs.CVPDF
Wei Pan, Xuhan Zheng, Yilin Shi, Huiguo He, Hiuyi Cheng
TL;DR: 本文提出DiffMath,一种符号与图感知的隐式扩散框架,用于手写数学表达式生成。该方法利用LaTeX的层次结构作为结构先验,无需位置监督,通过设计关系抽象语法树(RelAST)、学习结构保持的隐式表示(MathVAE)以及在结构化隐式空间中进行条件去噪(MathDiT)来生成结构一致的手写表达式。
Details
Motivation: 手写数学表达式生成因复杂的二维布局和长程结构依赖而具有挑战性,现有方法依赖显式空间监督(如符号级边界框),标注成本高且可扩展性有限。
Result: 实验表明,DiffMath能生成结构一致的手写表达式,性能优于现有方法,并通过合成数据增强提升了下游OCR模型的准确率。
Insight: 创新点包括:提出关系抽象语法树(RelAST)作为生成导向的紧凑表示;设计MathVAE通过符号感知和关系感知的感知正则化学习结构保持的隐式表示;在MathDiT中利用自适应层归一化(AdaLN)引入全局符号计数先验以提升结构连贯性。从客观角度看,该方法将LaTeX结构作为隐式先验,避免了昂贵的位置标注,是数据高效生成的有效探索。
Abstract: Handwritten Mathematical Expression Generation (HMEG) is challenging due to the complex two-dimensional layouts and long-range structural dependencies of mathematical expressions. Existing methods typically rely on explicit spatial supervision, such as symbol-level bounding boxes, which incurs high annotation costs and limits scalability. In this work, we propose DiffMath, a symbol- and graph-aware latent diffusion framework that leverages the hierarchical structure inherent in LaTeX as a structural prior, eliminating the need for positional supervision. First, we design a Relational Abstract Syntax Tree (RelAST), a generation-oriented representation that distills MathML trees into compact triplet sequences [S, R, D], where each token directly encodes a symbol identity, spatial relation, or nesting depth. Second, we introduce MathVAE, which learns structure-preserving latent representations through symbol-aware and relation-aware perceptual regularization, ensuring that the latent space captures both character semantics and spatial topology. Third, MathDiT performs conditional denoising in this structured latent space, further guided by a global symbol-count prior via Adaptive Layer Normalization (AdaLN) to improve structural coherence. Experiments show that DiffMath produces structurally consistent handwritten expressions, achieves superior performance over existing methods, and improves the accuracy of downstream OCR models through synthetic data augmentation.
[51] SketchKeyAnime: Reference-anchored Sparse Key-Sketch Animation Synthesis cs.CVPDF
Meixi Li, Xianlin Zhang, Yue Zhang, Xueming Li
TL;DR: 本文提出了SketchKeyAnime,一个基于视频扩散模型的动画生成框架。该框架能够从稀疏的关键草图输入(如单张参考RGB图像和少量时序索引的关键草图)中,生成结构可控、外观一致且时序连贯的动画。
Details
Motivation: 传统动画制作高度依赖手绘和迭代精修,现有动画生成方法通常需要RGB边界帧、密集的逐帧条件或完整的草图序列,限制了其在低成本输入条件下的应用。本文旨在解决从稀疏草图输入生成高质量动画的问题。
Result: 在Sakuga-42M的Aesthetic子集上进行实验,结果表明SketchKeyAnime在代表性动画插值和草图引导生成基线方法中表现最佳。相比最优基线,EDMD降低了31.9%,FVD降低了9.5%,在大多数定量指标上达到了最佳整体性能。
Insight: 创新点包括:1)提出了一种双分支条件机制,分别编码局部几何约束和语义-时序上下文;2)引入了Sketch Cross Attention,通过可学习的门控融合参考图像和草图条件;3)采用了自适应加权损失,以加强对关键草图帧和线稿区域的监督。这些设计实现了对动画结构和外观的精细控制。
Abstract: Traditional animation production relies heavily on manual drawing and iterative refinement, particularly for key-pose design, in-betweening, and character coloring. While existing animation and video generation methods have made notable progress, they typically depend on RGB boundary frames, dense frame-wise conditions, or complete sketch sequences, limiting their applicability under low-cost input conditions. We present SketchKeyAnime, a video diffusion framework for generating structurally controllable, appearance-consistent, and temporally coherent animations from sparse key-sketch inputs. Given a single reference RGB image and a few temporally indexed key sketches, SketchKeyAnime introduces a dual-branch conditioning mechanism to encode local geometric constraints alongside semantic-temporal context. It leverages Sketch Cross Attention to fuse reference image and sketch conditions with learnable gating, and incorporates an Adaptive Weighted Loss to strengthen supervision on key-sketch frames and line-art regions. Experimental results on the Aesthetic subset of Sakuga-42M show that our approach consistently outperforms representative animation interpolation and sketch-guided generation baselines. Compared to the best-performing baseline, SketchKeyAnime reduces EDMD by 31.9% and FVD by 9.5%, demonstrating superior sketch fidelity and temporal coherence, while achieving the best overall performance across most quantitative metrics. These results validate the proposed framework and highlight its potential for low-cost, highly controllable animation creation.
[52] ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models cs.CV | cs.AIPDF
Yihao Wang, Zijian He, Jie Ren, Keze Wang
TL;DR: 本文提出了ROSE基准测试,用于评估多模态大语言模型在视觉感知到动作转换中的能力。该基准通过固定视觉场景、变化区域约束和符号输出要求,测试模型在不同任务上下文中将相同视觉证据转化为相应动作的可靠性。研究发现,模型在计数导向任务和区域条件动作任务之间存在显著性能差距,揭示了模型在将共享视觉证据转化为特定上下文动作时的瓶颈。
Details
Motivation: 多模态大语言模型日益需要基于视觉信息执行动作,但同一场景在不同任务上下文中可能需要不同的动作。本文旨在探究模型如何可靠地将相同视觉证据转化为当前上下文所需的动作,从而揭示模型在感知到动作转换中的能力缺陷。
Result: 在九个最新的多模态大语言模型上,从计数导向任务到区域条件动作任务的性能下降高达44.5个百分点,而人类性能达到98.8%。性能差距在配对场景和区域中持续存在,即使同一模型能返回正确计数;坐标定位仅解释了部分损失,表明存在模型依赖的瓶颈。
Insight: 创新点在于设计了ROSE基准,通过耦合计数和坐标动作任务,系统评估模型在推断隐式多数参考并基于细粒度视觉证据执行动作的能力。客观分析认为,该研究揭示了多模态模型在上下文特定动作生成中的内在瓶颈,为模型改进提供了针对性方向。
Abstract: Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action required by the current context? To answer this question, we introduce \textsc{ROSE} (\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution), a controlled benchmark that holds the visual scene fixed while varying region constraints and required symbolic outputs. Through coupled counting and coordinate-action tasks, \textsc{ROSE} tests whether models can infer an implicit majority reference and act on the resulting fine-grained visual evidence under changing contexts. Across nine recent MLLMs, performance drops by as much as 44.5 percentage points from counting-oriented tasks to region-conditioned action, despite 98.8% human performance. The gap persists on paired scenes and regions for which the same model returns the correct count, while global-click and matched local controls show that coordinate grounding explains only part of the loss, revealing a distinct, model-dependent bottleneck in turning shared visual evidence into context-specific actions.
[53] CrossFlow: One-Step Generation Across Latent and Pixel Spaces cs.CVPDF
Xiyuan Wang, Xiao Zhang, Yang Li, Ruoxi Jiang, Zhao Zhong
TL;DR: CrossFlow提出了一种跨空间流匹配生成方法,将噪声隐变量直接映射到像素空间图像,实现了单步生成。该方法通过无速度的单步目标函数,在隐空间轨迹上训练但监督预测为图像而非隐变量位移,从而将生成器与解码器功能统一。在ImageNet-1k 256×256类别条件生成任务中,CrossFlow-XL仅需一次函数评估即可达到1.62 FID。
Details
Motivation: 传统隐扩散模型存在优化目标不匹配问题:生成器针对隐空间预测进行优化,但最终图像质量取决于解码器如何处理与干净编码器输出可能不同的生成隐变量。
Result: 在ImageNet-1k 256×256类别条件生成基准上,CrossFlow-XL仅用一次函数评估达到1.62 FID,消融实验表明隐编码器及像素空间感知损失与对抗损失对保真度至关重要。
Insight: 创新点在于提出跨空间流目标函数,将隐空间轨迹训练与像素空间监督相结合,无需独立解码器即可实现高效单步生成,统一了生成与解码过程。
Abstract: Most diffusion and flow-matching generators define the prior, probability path, and prediction target in the same representation space. Latent diffusion improves efficiency by moving this path into an autoencoder latent space, but the final sample is still produced by a separately trained decoder. This separation creates a mismatch: the generator is optimized for latent-space prediction, while final quality depends on how the decoder handles generated latents that may differ from clean encoder outputs. We introduce CrossFlow, a cross-space flow formulation that maps noisy latent inputs directly to pixel-space images. The key technical step is a velocity-free one-step objective: the latent trajectory defines the training path, but the supervised prediction is an image rather than a latent displacement. This lets one model act both as a one-step latent-to-pixel generator and as a decoder replacement for latent diffusion pipelines. On class-conditional ImageNet-1k at $256\times256$, CrossFlow-XL achieves 1.62 FID with one function evaluation. Ablations show that the latent encoder and pixel-space perceptual and adversarial losses are important for fidelity. These results indicate that cross-space flow objectives can combine the efficiency of latent representations with direct pixel-space supervision, without requiring a separate decoder at inference.
[54] Vision-Reasoning-Guided Occlusion Removal from Light Fields cs.CVPDF
Mohamed Youssef, Oliver Bimber
TL;DR: 本文提出了一种结合光场积分(LFI)与视觉语言模型(VLM)的视觉推理引导光场遮挡去除框架。该方法首先通过LFI集成多视角观测以抑制前景遮挡,生成初步增强的可视化表示;然后引入VLM作为条件语义先验,在观测数据引导下恢复退化的结构和精细细节;并通过多样本融合策略聚合多个生成假设以提高一致性并减少幻觉伪影。
Details
Motivation: 解决在自然环境中,密集前景植被等严重遮挡对计算成像场景恢复带来的挑战,实现遮挡鲁棒的场景恢复。
Result: 在合成和真实世界数据集上实现了SOTA性能,在四个合成光场基准场景(4-Syn)上获得了最高的平均SSIM,并在结构化和非结构化采集设置中展现出强大的泛化能力。
Insight: 创新点在于将物理成像约束(LFI)与视觉语言推理(VLM)相结合,利用VLM的语义先验指导细节恢复,并通过多假设融合提升一致性,为严重遮挡下的鲁棒感知(如搜救、机器人导航)提供了新思路。
Abstract: Occlusion-robust scene recovery remains a major challenge in computational imaging, particularly in natural environments where dense foreground vegetation severely limits visibility. We propose a vision-reasoning-guided light field occlusion removal framework that combines the visibility recovery capability of light field integration (LFI) with the semantic reasoning capacity of vision-language models (VLMs). Multi-view observations are first integrated via LFI to suppress foreground occlusions and produce an initial visibility-enhanced representation. A VLM is then incorporated as a conditional semantic prior to restore degraded structures and recover fine details, guided by the observed measurements. To improve recovery consistency and reduce hallucination artifacts, we introduce a multi-sample fusion strategy that aggregates multiple generated hypotheses into a unified estimate. Experimental results on synthetic and real-world datasets demonstrate state-of-the-art performance, achieving the highest average SSIM across four synthetic light field benchmark scenes (4-Syn) and strong generalization across structured and unstructured acquisition settings. These results highlight the effectiveness of combining physical imaging constraints with vision-language reasoning for robust perception under severe occlusion, with applicability to search-and-rescue and exploratory robotic navigation.
[55] ReA-OVCD: Reliability-Aware Open-Vocabulary Change Detection via Semantic and Spatial Refinement cs.CVPDF
Hongming Zhu, Huaji Chen, Bowen Du, Sicong Liu, Qin Liu
TL;DR: 本文提出了一种无需训练的可靠性感知开放词汇变化检测框架ReA-OVCD,用于遥感影像中基于任意文本提示的灵活土地覆盖变化识别。该方法通过像素级语义差异生成候选变化区域,并引入语义和空间协同细化策略来提升可靠性,有效解决了现有方法在实例级比较忽略细粒度语义变化与像素级比较不可靠之间的权衡问题。
Details
Motivation: 现有开放词汇变化检测方法在建模变化时存在固有权衡:实例级比较会忽略细粒度的语义变化(如部分建筑扩建),而直接的像素级比较则因语义模糊和空间不一致性导致响应不稳定和边界伪影,结果不可靠。
Result: 在多个数据集(LEVIR-CD、WHU-CD、DSIFN和SECOND)上的广泛实验表明,该方法在计算效率更高的前提下,持续优于现有最先进方法,F1^C指标提升了2.13%至9.75%。
Insight: 核心创新在于提出了一个无需训练的双重可靠性感知框架,通过语义变化推理模块联合分析分布差异和响应变化来抑制偶然不一致,并通过边界感知变化细化模块验证候选区域是否由可靠的内部像素支持来减轻边界伪影,实现了语义和空间层面的协同细化。
Abstract: Unlike traditional remote sensing change detection that relies on predefined categories, Open-Vocabulary Change Detection (OVCD) identifies land cover changes flexibly using arbitrary text prompts. However, existing methods suffer from an inherent trade-off when modeling changes: instance-level comparison overlooks fine-grained semantic variations (e.g., partial building extensions), while direct pixel comparison proves unreliable, yielding unstable responses and boundary artifacts due to semantic ambiguity and spatial inconsistency. To this end, we propose an efficient training-free Reliability-Aware Open-Vocabulary Change Detection (ReA-OVCD) framework. It first derives candidate change regions from pixel-wise semantic discrepancies to ensure flexible and detailed localization. To ensure reliability, it subsequently introduces a collaborative refinement strategy to explicitly model change validity from both semantic and spatial perspectives. Specifically, we develop a Semantic Change Reasoning (SCR) module that reassesses changes by jointly analyzing distributional divergence and response variation, enabling the suppression of incidental inconsistencies while preserving reliable semantic shifts. In addition, a Boundary-aware Change Refinement (BCR) module is designed to mitigate artifacts stemming from boundary misalignment and uncertainty through validating whether candidate regions are supported by reliable interior pixels. Extensive experiments across multiple datasets (LEVIR-CD, WHU-CD, DSIFN, and SECOND) demonstrate that our method consistently outperforms state-of-the-art approaches, achieving $\mathrm{F}_{1}^{C}$ improvements of 2.13% to 9.75% with higher computational efficiency. The code is publicly available at \https://github.com/Funny0101/ReA-OVCD
[56] FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification cs.CVPDF
Xuanhao Qi, Tom H. Luan, Yukang Zhang, Jinkai Zheng, Zhou Su
TL;DR: 本文提出了FUSE框架,用于解决多模态目标重识别中现有方法过度依赖低频线索、忽略中高频结构的问题。该框架将多模态ReID重新定义为谱解耦和能量对齐的两阶段过程,通过谱分解模块和跨模态对齐模块实现层次化频谱建模与跨模态能量对齐。
Details
Motivation: 现有多模态ReID方法倾向于强调低频线索(如颜色、光照和粗略外观),而忽略了编码几何、纹理和身份判别细节的中高频结构,导致频谱表示不完整和跨模态对齐不稳定。
Result: 在RGBNT201、RGBNT100和MSVR310数据集上的大量实验表明,FUSE实现了9.1% mAP和9.5% Rank-1的性能提升,为多模态表示学习建立了一个可解释的频域范式。
Insight: 创新点在于将多模态ReID问题重新构建为频域中的谱解耦和能量对齐过程,提出了自适应谱分解模块和基于频率一致性的跨模态对齐模块,并引入了可学习的频率调制以增强在不同光照和异质传感器条件下的鲁棒性。
Abstract: Despite significant progress in multi-modal Re-Identification (ReID), existing methods tend to emphasize low-frequency cues. Consequently, they focus on attributes such as color, illumination, and coarse appearance, while overlooking mid and high-frequency structures that encode geometric, textural, and identity-discriminative details. This imbalance leads to incomplete spectral representations and unstable cross-modal alignment. To overcome these limitations, we introduce FUSE, a frequency-domain framework that reformulates multi-modal ReID as a two-stage process of spectral disentanglement and energy alignment. The proposed Spectral Decomposition Module (SDM) adaptively partitions features into low, mid, and high-frequency subspaces, enabling hierarchical spectral modeling. The Cross-Modal Alignment Module (CAM) further enforces energy alignment and subspace complementarity across modalities via frequency-consistency regularization. In addition, FUSE incorporates learnable frequency modulation to enhance robustness under varying illumination and heterogeneous sensor conditions. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 show that FUSE achieves 9.1% mAP and 9.5% Rank-1 improvements, establishing an interpretable frequency-domain paradigm for multi-modal representation learning.
[57] Holo-World: Unified Camera, Object and Weather Control for Video World Model cs.CVPDF
Xiangchen Yin, Wenzhang Sun, Jiahui Yuan, Zijie Liu, Yinda Chen
TL;DR: 该论文提出了Holo-World,一个统一可控的视频世界模型,能够从单张图像出发,通过联合控制相机、物体和天气指令来生成视频,实现世界保持或天气状态转换。
Details
Motivation: 现有视频世界模型在相机和物体控制方面仍相互孤立,且天气生成通常依赖源视频或重建场景,缺乏从单张图像出发的统一控制能力。
Result: 在天气状态生成任务上,Holo-World在定量和定性实验中均优于视频到视频的天气编辑基线,能够保持精确的相机和物体控制及一致的场景结构。
Insight: 创新点包括构建HoloStateData数据集提供统一控制样本,以及设计Unified Scene Adapter和Scene-Weather Decomposed CFG,分别通过参数子空间分解和条件引导分离来有效建模天气依赖的外观和粒子效果。
Abstract: Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at \url{https://xiangchenyin.github.io/Holo-World/}.
[58] See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View cs.CV | cs.AIPDF
Fanfu Xue, En Yu, Yantian Shen, Zhikun Hu, Hongjun Wang
TL;DR: 本文针对无人机视觉语言导航(UAV-VLN)任务,提出了一个专注于‘看见并抵达’阶段的子任务UAV-VLN-FOV,旨在更精确地评估无人机在目标进入视野后的精准抵达能力。同时,作者提出了一个名为3DG-VLN的视觉语言航点预测框架,该框架利用动态3D方向线索来增强细粒度视觉接地和空间方向对齐,以实现精确的目标抵达。
Details
Motivation: 现有的UAV-VLN任务通常将长距离目标发现和最终目标接近作为一个整体进行优化和评估,这使得难以独立评估无人机在目标可见后,将视觉语言证据转化为精确3D运动的关键能力。本文旨在解决这一局限性。
Result: 在作者构建的高分辨率基准测试上,3DG-VLN方法优于其他竞争性UAV-VLN基线模型,成功率提升了13.82%。真实世界试验也验证了其在实际‘看见并抵达’导航中的潜力。
Insight: 创新点在于将‘看见并抵达’阶段独立出来进行诊断性评估,并提出了一个利用动态3D方向线索进行在线更新的航点预测框架,以保持与目标的空间对齐并减少方向漂移。从客观角度看,其构建的高分辨率、多视角(前视和下视)观测数据集也为该领域提供了有价值的基准。
Abstract: UAV Vision-Language Navigation (UAV-VLN) is typically formulated as a holistic search-and-reach problem, where long-range target discovery and final target approach are optimized and evaluated jointly. This formulation makes it difficult to assess a critical capability of aerial embodied agents, namely whether a UAV can accurately ground a visible target and translate vision-language evidence into precise 3D motion once the target enters its field of view. To address this limitation, we introduce UAV-VLN-FOV, a target-visible navigation task that isolates the see-and-reach stage and enables a more diagnostic evaluation of terminal reaching ability. We further propose 3DG-VLN, a vision-language waypoint prediction framework guided by dynamic 3D direction cues to enhance fine-grained visual grounding and spatial direction alignment for precise target reaching. Specifically, 3DG-VLN adaptively processes high-resolution front-view and downward-view observations to preserve fine-grained visual and geometric details for target grounding. It also updates the target-relative direction online during closed-loop navigation, allowing the agent to maintain spatial alignment with the target and reduce accumulated direction drift. To support this task, we construct a dedicated high-resolution benchmark which contains 2,717 trajectories with target-oriented high-level instructions, high-resolution front-view and downward-view egocentric observations, and continuous 3D waypoint annotations. Experiments show that 3DG-VLN outperforms competitive UAV-VLN baselines, achieving a 13.82% improvement in success rate. Real-world trials further demonstrate the potential of 3DG-VLN for practical see-and-reach navigation. The source code and benchmark are available at https://github.com/xuefanfu/3DG-VLN.
[59] The Hidden Evolution of Disguised Visual Context inside the VLM cs.CV | cs.AIPDF
Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Atito
TL;DR: 本文对视觉语言模型(VLM)中两种主流的视觉信息集成范式——上下文提示(in-context)和逐层注入(layer-wise injection)——进行了公平比较,揭示了视觉标记在大型语言模型(LLM)内部的隐藏演化过程。研究发现,视觉标记以缺乏语言结构的原始表示形式进入LLM,其后续演变取决于集成范式,并捕获了视觉信号中根本不同的频率特征,这最终决定了VLM在不同任务上的性能。
Details
Motivation: 视觉标记作为原始信号进入LLM,其如何被转化为有意义的表示并与语言空间交互,完全取决于集成架构。然而,对这些架构选择如何影响视觉信息及其内部转换以与LLM集成的控制性比较和理解仍然不足。
Result: 研究在单图像、多图像和视频基准测试上,在相同的训练条件下评估了两种集成范式。结果表明,性能差异并非仅由注意力分配驱动,而是由各层视觉表示的质量决定。
Insight: 创新点在于揭示了视觉标记在LLM内部作为“伪装的视觉上下文”的演化过程,并系统比较了两种主流集成范式如何塑造这种演化,从而影响特征利用、表示对齐和最终任务性能。客观来看,该研究为VLM架构设计提供了重要的实证分析和理论见解。
Abstract: Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM’s intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.
[60] WeGenBench: A Multidimensional Diagnostic Benchmark towards Text-to-Image Model Optimization cs.CVPDF
Qian Liang, Xiaomin Li, Ying Zhang, Jia Xu, Lihao Ni
TL;DR: 本文提出了WeGenBench,一个用于全面、多维度评估文本到图像生成模型性能的新型基准测试。该基准包含4000个中英文平衡的测试提示,通过宏观场景分类和细粒度多维标签进行交叉维度评估,并设计了结合视觉语言模型的新型评估指标,以精确诊断模型在特定生成类别中的缺陷。
Details
Motivation: 现有基准测试难以全面、准确地衡量文本到图像生成模型在多维度的性能,且无法有效揭示模型在特定类别中的固有缺陷。
Result: 在当前的SOTA方法上进行了系统性基准测试,并深入分析了现有模型的局限性。
Insight: 创新点在于构建了一个双语平衡、具有多维标注和交叉维度评估机制的综合性基准,并设计了可提供详细推理轨迹的新型VLM集成评估指标,实现了对模型缺陷的精确诊断和评估结果的可验证性。
Abstract: Recent text-to-image generation models have demonstrated remarkable capabilities in synthesizing highly realistic images from text inputs alone. Although existing benchmarks can evaluate the generation capabilities of various models to some extent, they struggle to comprehensively and accurately measure performance across multiple dimensions, often failing to reveal the inherent deficiencies of models in specific categories. To address these limitations, we propose WeGenBench, a novel benchmark designed for the comprehensive, multi-perspective evaluation of text-to-image generation capabilities. Our benchmark comprises a total of 4,000 test prompts across two primary categories, meticulously balanced between Chinese and English to evaluate bilingual and cross-cultural generation capabilities. Beyond macroscopic scene classification, we annotate each prompt with multi-dimensional tags tailored to the distinct content and challenges of each language, thereby refining the generation tasks into more specific sub-categories. Through a cross-dimensional evaluation mechanism leveraging both scene classifications and multi-dimensional tags, WeGenBench can precisely pinpoint model shortcomings in specific generation categories. Furthermore, to measure generation quality more accurately, we design and validate several novel evaluation metrics by integrating Vision-Language Models (VLMs), which assess model performance on domain-specific tasks from three core aspects. Crucially, our approach yields both the assessment outcomes and the detailed reasoning trajectories, facilitating a rigorous verification of the accuracy and soundness of the evaluation results. Finally, we conduct systematic benchmarking on current state-of-the-art methods and provide an in-depth analysis of the limitations present in existing models.
[61] EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies cs.CVPDF
Ganlin Yang, Zhangzheng Tu, Yuqiang Yang, Sitong Mao, Junyi Dong
TL;DR: 本文提出EventVLA,一个用于长视野机器人操作的端到端视觉-语言-动作(VLA)策略框架。其核心创新是稀疏视觉证据记忆,通过基础视觉锚点和动态关键帧证据记忆(KEM)模块,前瞻性地捕捉和存储任务关键视觉事件,以解决长期任务中相关视觉线索被遮挡或消失的问题。
Details
Motivation: 现有记忆增强的VLA策略在处理长视野操作任务时存在信息瓶颈、高延迟或视觉冗余积累等问题,当任务相关线索随时间变得不可观测时,标准VLA策略往往会失败。
Result: 在专门设计的诊断基准RoboTwin-MeM上,对17个需要记忆的仿真任务和4个真实世界双手任务进行评估,EventVLA相比最先进的记忆增强VLA方法,平均成功率提升了40%。
Insight: 创新点在于提出了稀疏视觉证据记忆的概念,特别是KEM模块能直接从VLA的潜在嵌入中预测未来关键帧概率,实现前瞻性、自主的视觉事件捕获,动态评估当前观测的未来因果效用,从而在证据消失前保存瞬态视觉信息。
Abstract: Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA’s latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.
[62] Geometry-Preserving in 3D Gaussian Splatting for LiDAR-Camera Extrinsic Calibration cs.CVPDF
Kyoleen Kwak, Daeho Kim, Jeong Woon Lee, Hyoseok Hwang
TL;DR: 本文提出了一种用于LiDAR-相机外参标定的几何保持型3D高斯泼溅(3DGS)框架。该方法通过聚合多视角LiDAR观测提供密集深度监督,并阻止光度梯度更新高斯空间参数,从而保持高斯代理的度量几何结构。在公共驾驶数据集上的实验表明,该方法在校准精度上持续优于现有的无目标标定方法。
Details
Motivation: 现有基于3DGS的无目标LiDAR-相机标定方法因优先考虑渲染质量,导致代理几何结构与真实LiDAR点云结构发生漂移,无法保持精确的度量几何。
Result: 在公共驾驶数据集上的验证表明,该方法在校准精度上持续优于现有无目标标定方法,实现了更高的外参标定准确度。
Insight: 创新点在于通过多视角LiDAR观测的密集深度监督与阻断光度梯度对空间参数的更新,在保持3DGS可微框架优势的同时,强制其几何表示与LiDAR真实结构对齐,解决了代理几何漂移的核心问题。
Abstract: Accurate LiDAR-camera calibration is essential for robust multi-modal perception. Targetless approaches avoid manual setup but remain limited by the scarcity of discriminative cross-modal features. Recent methods address this by reconstructing the scene within a differentiable model, enabling extrinsic optimization through dense photometric supervision. Among these, 3D Gaussian Splatting (3DGS) has been widely adopted as a geometric proxy that bridges LiDAR and camera within a single differentiable framework. However, since 3DGS was originally designed for novel view synthesis, existing methods tend to prioritize rendering quality, causing the proxy geometry to drift from the true LiDAR structure. We propose a framework that preserves the metric geometry of the Gaussian proxy by aggregating multi-view LiDAR observations for dense depth supervision and blocking photometric gradients from updating the Gaussian spatial parameters. We validate our method on public driving datasets, where it consistently outperforms existing targetless methods in calibration accuracy.
[63] EFIQA: Explainable Fundus Image Quality Assessment via Anatomical Priors cs.CV | cs.LGPDF
Pengwei Wang, José Morano, Qian Wan, Hrvoje Bogunović
TL;DR: 本文提出EFIQA框架,用于眼底图像质量评估。该框架无需质量相关监督,通过利用解剖先验知识学习‘应该存在什么’,而非从人工标注中学习‘什么是退化’,从而生成空间质量图。具体实现为两阶段方法:首先通过掩码解剖修复训练无监督异常检测器识别血管缺失区域,然后将此先验知识蒸馏到浅层适配器中,将冻结基础模型的特征映射为精确质量图。
Details
Motivation: 解决现有基于深度学习的图像质量评估方法的两大局限:其泛化性受限于训练集的标注标准,且无法提供质量退化的空间反馈,缺乏可解释性。
Result: 在外部数据集评估中,这种无标签、最小化适配的方法,在不同质量标准的基准测试上,相比监督方法实现了更好的性能和可解释性。
Insight: 核心创新在于利用解剖先验(‘应该存在什么’)而非退化标签进行无监督学习,并通过两阶段设计(无监督异常检测与知识蒸馏到冻结基础模型)实现可解释的空间质量图生成,提升了方法的泛化性和实用性。
Abstract: Image quality control is vital for a wide range of downstream applications. Deep learning-based image quality assessment methods typically train classifiers on dataset-specific quality labels, inheriting two limitations: (1) generalization is tied to the labeling criteria of the training set and (2) these methods cannot provide spatial feedback on where the quality is degraded, lacking explainability. In this work, we propose EFIQA, a framework that requires no quality-related supervision and produces spatial quality maps by design. Rather than learning what is degradation" from human-annotated labels, EFIQA learns what should be there” by leveraging anatomical priors. For fundus photography, we instantiate this as a two-stage approach, by first training an unsupervised anomaly detector via masked anatomical inpainting to identify regions of missing vasculature, and then distilling this prior knowledge into a shallow adapter mapping features of a frozen foundation model to precise quality maps. External-dataset evaluation demonstrates that this label-free approach with minimal adaptation achieves better performance and explainability compared with supervised methods across benchmarks with different quality criteria, highlighting its potential for real-world applications.
[64] SA-VIS: Sparse frame Annotations for training Video Instance Segmentation cs.CVPDF
Edoardo Mello Rella, Ajad Chhatkuli, Shipra Jain, Ender Konukoglu, Luc Van Gool
TL;DR: 本文提出了一种名为SA-VIS的视频实例分割方法,旨在通过稀疏标注的视频帧进行训练。该方法引入了过去帧特征传播模块和轻量级帧特定实例查询,显著减少了训练所需的密集标注和计算开销,同时性能下降很小。
Details
Motivation: 现有在线视频实例分割方法依赖密集标注的视频序列进行训练,这带来了高昂的标注和计算成本。本文旨在解决这一问题,论证实例建模和演化无需密集标注,从而提出一种利用稀疏标注进行高效训练的方法。
Result: 在YouTube-VIS 2019/2021/2022和Occluded VIS数据集上,SA-VIS相比基线模型有显著提升。在仅使用1/5图像标注的稀疏标注场景下,性能仅下降0.4%,并在有限标注场景下实现了超过1%的平均精度提升,达到了当前最先进水平。
Insight: 主要创新点在于提出的过去帧特征传播模块,它能聚合多帧图像编码器的低维特征,利用稀疏标注实现强大的端到端学习能力。这种简单设计有效弥合了稀疏与密集标注训练之间的精度差距,为降低标注成本提供了新思路。
Abstract: Recent online video instance segmentation (VIS) methods have achieved impressive results, thus becoming the preferred approach to segment instances in videos. Despite the resurgence of impressive single image models, the online (or semi-online) VIS approaches outperform single-image models (e.g., based on SAM) by using long sequences of densely annotated frames during training. However,such a training setup of VIS is expensive in the sense of compute as well as dense annotations required. In order to solve these major flaws, we argue that the effective modeling of the instances and their evolution in videos do not require densely annotated frames. To that end, we propose a simple and effective module, called Past-frames Feature Propagation (PFP) which aggregates low-dimensional features from the image encoder of multiple frames. This simple low-compute module provides tremendous learning capability in using sparse video frame labels for end-to-end training. Combined with a light-weight frame-specific Instance Queries, our Sparse frame Annotation VIS (SA-VIS) significantly improves performance over its baseline. Most interestingly, our simple design that avoids complexities effectively bridges the gap in accuracy between training on sparsely and densely annotated video sequences. This translates to a mere 0.4% drop in performance of SA-VIS when using annotations for only 1/5 of the images in the dataset. Empirically, SA-VIS shows strong improvements over the baseline on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS) and an over 1% improvement in AP on the state-of-the-art in a limited annotations scenario.
[65] HEad and neCK TumOR (HECKTOR) 2025: Benchmark of Segmentation, Diagnosis, and Prognosis in Multimodal PET/CT cs.CVPDF
Numan Saeed, Salma Hassan, Shahad Hardan, Lishan Cai, Xinglong Liang
TL;DR: HECKTOR 2025挑战赛建立了一个全面的头颈癌自动分析基准,使用多模态PET/CT成像和电子健康记录。该挑战赛包含三个互补任务:分割原发肿瘤和转移淋巴结、预测无复发生存期以及分类HPV状态。基于来自全球10个中心超过1100名患者的多机构数据集,对提交的算法进行了评估。
Details
Motivation: 头颈癌的准确肿瘤勾画对放疗计划至关重要,但手动分割耗时且存在观察者间差异。此外,从非侵入性成像中预测长期临床结果(如无复发生存期)和确定HPV状态具有挑战性但临床价值高。
Result: 在分割任务上,最佳算法的平均Dice相似系数为0.75;在生存预测任务上,一致性指数为0.66;在HPV分类任务上,平衡准确率为0.56。这些结果是在一个保留的测试集上评估的。
Insight: 该挑战赛通过整合多模态成像(PET/CT)和临床数据,为头颈癌的自动化分析(分割、诊断和预后)提供了一个标准化、大规模的基准。其多任务设置(分割、生存预测、HPV分类)反映了临床工作流的实际需求,有助于推动算法向临床决策支持系统的转化。
Abstract: Head and neck cancers (HNC) represent a significant global health burden, with accurate tumor delineation being essential for effective radiotherapy planning. The complexity of the oropharyngeal anatomy, combined with the heterogeneous appearance of tumors on imaging, makes manual segmentation time-intensive and subject to inter-observer variability. Beyond segmentation, predicting long-term clinical outcomes, such as recurrence-free survival (RFS), and determining human papillomavirus (HPV) status from noninvasive imaging, remain challenging yet clinically valuable goals. The HECKTOR 2025 challenge addresses these needs by establishing a comprehensive benchmark for automated HNC analysis using multimodal PET/CT imaging and electronic health records. Building on previous editions (2020-2022), this challenge features an expanded multi-institutional dataset comprising over 1,100 patients from 10 centers worldwide. Participants were tasked with three complementary objectives: (1) segmenting primary gross tumor volumes (GTVp) and metastatic lymph nodes (GTVn), (2) predicting recurrence-free survival, and (3) classifying HPV status. The challenge attracted 35 registered teams, with 15 final submissions evaluated on a held-out test set. Top-performing algorithms achieved a mean Dice similarity coefficient of 0.75 for segmentation, a concordance index of 0.66 for survival prediction, and a balanced accuracy of 0.56 for HPV classification. This paper presents a comprehensive analysis of the submitted methodologies, evaluates their performance across different lesion characteristics, and discusses their implications for clinical translation in automated oncology workflows and decision support systems.
[66] ARTEMIS: Agent-guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation cs.CVPDF
Tong Wang, Siwen Wang, Yaolei Qi, Jinxing Zhou, Yuting He
TL;DR: ARTEMIS是一个用于不完美监督视频息肉分割的统一框架,通过代理引导的可靠性感知时序掩码演化,从稀疏标注(如点、涂鸦)或少量密集标注帧中学习密集且时序一致的掩码。它利用SAM2初始化掩码,通过视觉-语言代理选择可靠时序锚点并进行双向传播,最后结合可靠性感知的鲁棒学习训练分割器。
Details
Motivation: 解决不完美监督视频息肉分割中,由于弱对比度、模糊边界、运动模糊和镜面高光等挑战,以及稀疏像素级监督导致的直接伪标签方法产生几何退化掩码、边界泄露、未充分利用时序一致性且忽略可靠性的问题。
Result: 在SUN-SEG和CVC-ClinicDB-612数据集上的涂鸦、点和有限标签设置实验中,ARTEMIS达到了最先进的性能。
Insight: 创新点包括:1) 引入视觉-语言代理进行可靠时序锚点选择;2) 双向时序传播与SAM2结合细化掩码;3) 可靠性感知的鲁棒学习框架,包含可靠性引导的参考选择、参考原型传输模块和可靠性感知鲁棒损失,能评估掩码可靠性、演化锚点、跨帧传输目标身份并对噪声监督降权而非丢弃困难样本。
Abstract: Imperfectly supervised video polyp segmentation (VPS) aims to learn dense, temporally consistent masks from inexpensive supervision, including weak annotations (points, scribbles) and semi-supervision with few densely labeled frames. This setting is clinically valuable but challenging due to weak contrast, ambiguous boundaries, motion blur, and specular highlights, compounded by sparse pixel-level guidance. While SAM2 can generate dense masks from sparse inputs, direct pseudo-labeling often yields geometry-degraded masks with boundary leakage, underutilizes temporal consistency, and ignores reliability. To address these issues, we propose ARTEMIS, a unified framework for imperfectly supervised VPS driven by agent-guided reliability-aware temporal mask evolution. ARTEMIS initializes coarse masks from available supervision: SAM2 converts points/scribbles, while dense labels serve as reliable anchors. A debate-and-judge vision-language agent selects reliable temporal anchors under weak supervision, which are propagated bidirectionally with SAM2 to refine unreliable or unlabeled frames. Finally, ARTEMIS trains the segmenter using temporal reliability-aware robust learning, incorporating reliability-guided reference selection, a Reference Prototype Transport Module, and reliability-aware robust loss. These components assess mask reliability, evolve anchors over time, transport target identity across frames, and down-weight noisy supervision instead of discarding difficult samples. Experiments on SUN-SEG and CVC-ClinicDB-612 under scribble, point, and limited-label settings demonstrate that ARTEMIS achieves state-of-the-art performance. Code will be released at https://github.com/wangtong627/ARTEMIS.
[67] Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs cs.CV | cs.AIPDF
Haochen Han, Jue Wang, Alex Jinpeng Wang, Fangming Liu
TL;DR: 本文针对遥感多模态大语言模型在否定理解方面的不足,提出了首个评估基准RS-Neg,并设计了一种测试时学习方法NeFo来提升模型性能。研究发现现有模型在否定任务上存在幻觉和性能下降问题,而NeFo仅需少量未标注测试样本即可显著改善模型理解能力,并展现出良好的泛化性。
Details
Motivation: 遥感多模态大语言模型在真实场景中需明确识别错误或缺失内容(如应急响应中定位未淹没路线),但其否定理解能力尚未充分探索,限制了实际部署。
Result: 在RS-Neg基准上评估发现,先进的遥感MLLMs在否定任务上表现不佳,存在幻觉和性能显著下降。提出的NeFo方法仅使用约5%的未标注测试样本,即可显著提升模型对否定的理解,并在未见任务上展现出强泛化能力。
Insight: 创新点包括构建首个遥感否定理解基准RS-Neg,以及提出一种显式融入否定逻辑角色的测试时学习方法NeFo。客观来看,通过自动化数据生成流程和动态视觉聚焦模块进行验证,为模型在需要精确否定推理的遥感应用中提供了有效的评估与增强框架。
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in various Remote Sensing (RS) tasks. However, their ability to comprehend negation remains underexplored, limiting deployment in real-world applications where models must explicitly identify what is false or absent, e.g., emergency responders need to locate non-flooded routes for evacuation. To comprehensively study this limitation, we introduce RS-Neg, the first benchmark to evaluate negation understanding across region-level to scene-level tasks. Specifically, we design an automated data generation pipeline for RS imagery, using LLMs to synthesize diverse negation queries, and introduce a dynamic visual focus module for verification. Our evaluation reveals that advanced RS MLLMs struggle with negation, exhibiting hallucinations and substantial performance degradation. To close this gap, we propose NeFo, a novel test-time learning method that explicitly incorporates the logical role of negation into the model optimization. Remarkably, using about 5% unlabeled test samples, NeFo significantly improves the negation understanding of models and shows strong generalization to unseen tasks. Code and data will be released upon acceptance.
[68] HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin cs.CV | cs.AI | cs.ROPDF
Maciej Wozniak, Jesper Ericsson, Hariprasath Govindarajan, Truls Nyberg, Thomas Gustafsson
TL;DR: 本文提出HilDA,一种用于LiDAR骨干网络的自监督预训练框架,通过结合分层蒸馏(包括多层蒸馏和全局上下文蒸馏)与时间占用扩散目标,更好地捕捉驾驶任务所需的语义和几何信息。
Details
Motivation: 现有方法将视觉基础模型(VFMs)视为黑盒教师,仅依赖逐帧特征相似性,未能充分利用教师模型的层级语义结构、全局上下文以及LiDAR序列固有的丰富时空信息。
Result: 在跨模态蒸馏基准测试中达到最先进(SOTA)水平,并在3D物体检测、场景流和语义占用预测任务上优于先前的蒸馏方法训练的模型。
Insight: 创新点在于分层蒸馏(结合多层蒸馏和全局上下文蒸馏)与时间占用扩散目标的整合,以促进时空一致性,更全面地利用教师模型的语义结构和LiDAR的时空特性。
Abstract: Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher’s layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.
[69] Evaluation of Image Matching for Art Skills Assessment cs.CVPDF
Asaad Alghamdi, Michael Poor, Trung-Nghia Le, Tam V. Nguyen
TL;DR: 本文提出了一种通过匹配手绘图像与原始模板来评估绘画技能的方法,利用计算机视觉技术(SIFT特征和Siamese网络)进行图像相似度比较,以替代传统繁琐的评估过程。
Details
Motivation: 解决绘画技能评估中传统方法复杂且耗时的问题,利用计算机视觉实现自动化、类人水平的图像匹配,以提供全面且高效的技能评估。
Result: 实验表明,该方法在评估艺术技能水平上是可行的,且通过特征分析发现,基于SIFT的关键点匹配在检测绘画技能方面更为有效。
Insight: 创新点在于将SIFT特征和Siamese网络结合用于艺术技能评估,客观分析显示,SIFT在捕捉手绘图像与模板的局部特征差异上具有优势,为自动化艺术教育评估提供了新思路。
Abstract: While some individuals possess a natural talent for drawing, mastering this skill requires dedicated training and practice. Determining one’s skill in the art of drawing requires proper comprehensive assessment. In this paper, we propose a method to measure drawing skill by by matching the hand-drawn image with the original template. Existing techniques often involve complex processes. However, advancements in computer vision allow us to train computers to perform these comparisons at a human-like level, thereby resolving the tedious and overwhelming traditional process. Using computer vision applications, determining image similarity involves identifying the level of similarities in an image with a reference image. We have implemented and analyzed the SIFT feature and Siamese network to measure image similarity. Our results indicate that it is feasible to assess art skill levels. Through feature analysis, we found that SIFT-based key point matching provides a more effective means of detecting drawing skills.
[70] Cinematic Compositing Using Character-Environment-Harmonized Video Generation Models cs.CVPDF
Tianyi Xiang, Mingming He, Li Ma, Jing Liao
TL;DR: 该论文提出了一种用于电影级合成的端到端视频扩散框架,旨在将绿幕角色无缝集成到新环境中,同时保持物理和光度真实感。该方法通过联合建模角色到环境(C2E)的物理交互和环境到角色(E2C)的光照协调,特别是处理交互道具的挑战,实现了高质量的动态视频合成。
Details
Motivation: 现有方法难以捕捉角色与环境之间复杂的双向交互,包括C2E物理交互和E2C光照协调,导致合成效果缺乏真实感。
Result: 大量实验表明,该框架在电影级动态视频合成任务上显著优于现有方法,实现了高质量的合成效果。
Insight: 创新点包括:1)提出三掩码引导架构与RGB-D联合去噪,确保角色、道具和环境间的物理一致性;2)开发高效先验驱动的数据整理流程,无需昂贵渲染即可构建高质量重光照对;3)引入参考条件机制,实现可控环境合成和精确道具替换。
Abstract: Cinematic compositing aims to integrate green-screen characters into novel environments while maintaining physical and photometric realism. Previous methods often fail to capture the complex bidirectional interactions between characters and their surroundings, which we characterize as Character-to-Environment (C2E) physical interaction and Environment-to-Character (E2C) lighting harmonization. To address this, we propose an end-to-end video diffusion framework that jointly models C2E and E2C interactions, specifically handling the challenges of interactive props. Our approach introduces a tri-mask-guided architecture with RGB-D joint denoising to ensure physically consistent interactions among the character, props, and environment. We further develop an efficient prior-driven data curation pipeline to construct high-quality relighting pairs without expensive rendering. Finally, a reference-conditioned mechanism enables controllable environment synthesis and precise prop replacement. Extensive experiments demonstrate that our framework significantly outperforms existing methods in cinematic-quality dynamic video compositing.
[71] DeepForestVisionV2: Ecology-Driven Taxonomy Expansion for Camera-Trap Monitoring in African Tropical Forests cs.CV | q-bio.QMPDF
Hugo Magaldi, Theau d’Audiffret, Etienne Francois Akomo-Okoue, Bala Amarasekaran, Naomi Anderson
TL;DR: DeepForestVisionV2 是一个用于非洲热带森林相机陷阱监测的分类模型升级版。它将预测类别从35个扩展到64个,以更好地处理垂直分层、开阔场景和人类活动界面等部署梯度。模型在包含大量照片和视频的数据集上训练,并在多个验证集上评估,结果表明其在保持跨站点鲁棒性的同时,显著提升了野外监测的实用性。
Details
Motivation: 现有工具DeepForestVision专为封闭林冠、地面森林内部设计,其35个类别的预测空间在遇到树栖灵长类、鸟类、半水生物种或牲畜等人类相关干扰物时过于粗糙,无法满足相机陷阱在河岸、林间空地和公园边缘等多样化环境中的监测需求。
Result: 在跨国家的裁剪照片验证集上,模型达到0.86准确率、0.82宏F1和0.81平衡准确率。在针对特定梯度的乌干达视频基准测试中,模型在更难的分类任务下保持或提升了基线准确率,例如在森林内部视频中识别的分类单元从22个增加到29个,在河岸场景中从4个增加到9个,在公园边缘用例中准确率从0.62提升到0.86且误报从11次降为0次。
Insight: 论文的核心创新点在于以生态学驱动的方式扩展分类体系,针对实际部署中遇到的垂直分层、场景开放度和人为界面三个梯度来设计新的类别,从而将模型从封闭森林内部推广到更广泛的栖息地。这种基于部署需求(而非单纯增加数据量)的、有针对性的分类空间扩展策略,对于提升计算机视觉模型在生态监测等领域的实际效用具有借鉴意义。
Abstract: Camera-trap monitoring in African tropical forests increasingly extends beyond closed-canopy interiors to riverbanks, clearings, and park edges. Among available open tools for African forest camera-trap classification, DeepForestVision is the only one providing a matched offline workflow for both photographs and videos, and previous work showed that it outperformed other available baselines on a comparable benchmark. However, it was designed for closed-canopy, ground-level forest interiors and uses a 35-class prediction space that becomes too coarse when deployments encounter arboreal primates, birds, semi-aquatic taxa, or human-associated confounders such as livestock. We present DeepForestVisionV2, an ecology-driven expansion from 35 to 64 prediction classes (61 animal classes plus human, vehicle, and blank) designed to address three recurrent deployment gradients: vertical stratification, scene openness, and anthropogenic interfaces. DeepForestVisionV2 retains the same offline workflow and is trained on 1,535,010 photographs and 243,354 videos from multi-country African tropical-forest projects. Evaluation combines a cross-country cropped-photo validation set, used to assess robustness across sites and camera-trap settings, with three held-out Uganda video benchmarks spanning the targeted gradients. On the validation set, DeepForestVisionV2 reaches 0.86 accuracy, 0.82 macro-F1, and 0.81 balanced accuracy. On the deployment benchmarks, it preserves or improves baseline accuracy despite its harder classification task, while increasing the number of identified taxa from 22 to 29 in forest-interior videos and from 4 to 9 at riverbanks. In the park-edge use case, it raises accuracy from 0.62 to 0.86 and reduces false alarms from 11 to 0. These results show that DeepForestVisionV2 materially improves field utility while preserving robustness across sites, habitats, and camera-trap settings.
[72] SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs cs.CV | cs.AIPDF
Bo Yin, Xiaobin Hu, Chengming Xu, Ruolin Shen, Mo Yang
TL;DR: SPOT-E是一种无需重新训练的即插即用测试时方法,旨在解决视觉语言模型在证据密集型任务中因关键视觉证据小而容易被忽略导致的性能下降问题。该方法通过引入低熵锚点和熵塑形目标,在减少答案不确定性的同时保留基线高置信度标记,并利用基于组相对策略优化的轻量级调优生成问题条件化的视觉聚光灯。
Details
Motivation: 视觉语言模型在证据密集型任务中表现不佳,因为关键的视觉证据通常很小、局部化且容易被忽略,导致即使高层推理能力完好,证据读取也会失败。现有的推理时视觉干预方法缺乏验证高亮证据是否被实际使用的反馈机制。
Result: 在所有基准测试和不同的VLM家族中,SPOT-E都带来了性能的持续提升,并在视觉损坏条件下提高了模型的鲁棒性。
Insight: 创新点在于将答案跨度预测熵作为模型内部反馈信号,并通过低熵锚点和熵塑形目标解决了朴素熵最小化的模糊性问题(低熵可能源于证据接地的置信度或捷径崩溃)。该方法通过轻量级的GRPO优化实现实例级的视觉聚光灯生成,是一种高效的测试时干预策略。
Abstract: Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \url{https://github.com/YinBo0927/SPOT-E}
[73] CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection cs.CVPDF
Junhao Cai, Deyu Zeng, Junhao Pang, Junyu Chen, Qiwei Liang
TL;DR: 本文提出了CMDS-AD,一个用于少样本异常检测的跨模态双流解耦框架。它通过LoRA引导的扩散模型生成多样化的RGB样本来缓解数据稀缺,并利用一个预训练的扩散模型作为正常估计器,从RGB输入中直接提取低频正常表示,形成一个辅助的纯低频信息流。该框架通过坐标感知分层特征映射器自适应对齐跨模态语义,并使用乘法评分机制过滤模态特定噪声,从而在极少的训练样本下精确分离微观缺陷。
Details
Motivation: 解决少样本异常检测中训练数据有限的问题。现有多模态异常检测方法采用空间均匀的特征处理,混淆了稳定的宏观结构与高频局部缺陷信号,加剧了跨模态错位并导致高误报率。
Result: 在极端的1-shot设置下,CMDS-AD在MVTec 3D-AD数据集上实现了I-AUROC绝对提升5.7%和AUPRO提升2.0%,在EyeCandies数据集上分别提升了7.7%和5.6%,达到了新的最先进水平。
Insight: 核心创新在于构建了一个双流架构:一个纯低频的辅助估计流用于锚定鲁棒的结构模板,一个未压缩的真实流用于精确分离微观缺陷,实现了对宏观结构与微观缺陷信号的解耦。此外,利用预训练扩散模型作为非线性低通滤波器来提取低频表示,以及坐标感知分层特征映射和乘法评分机制,都是有效的设计。
Abstract: Few-shot anomaly detection remains challenging due to limited training data. Multi-modal anomaly detection (MAD) offers a viable solution, leveraging 3D geometric cues to enrich 2D RGB representations and compensate for this scarcity. However, existing MAD methods apply spatially uniform feature processing, conflating stable macroscopic structures with high-frequency localized defect signals, exacerbating cross-modal misalignment and inflating false-positive rates. To overcome this, we present CMDS-AD, a Cross-Modal Dual-Stream Anomaly Detection framework. A LoRA-guided diffusion model generates diverse RGB samples to mitigate extreme data scarcity. For 3D normal augmentation, we employ a pre-trained diffusion model as a normal estimator. Crucially, this estimator inherently acts as a non-linear low-pass filter, directly extracting low-frequency normal representations from RGB inputs. This establishes an auxiliary estimated stream of purely low-frequency information, anchoring robust structural templates and assisting the uncompressed real stream, containing coupled high- and low-frequency components, to precisely isolate micro-defects. A Coordinate-Aware Hierarchical Feature Mapper adaptively aligns cross-modal semantics, while a multiplicative scoring mechanism filters modality-specific noise. Under the extreme 1-shot setting, CMDS-AD achieves absolute performance gains of 5.7% (I-AUROC) and 2.0% (AUPRO) on MVTec 3D-AD, alongside 7.7% and 5.6% improvements on EyeCandies, establishing a new state-of-the-art.
[74] U$^2$Mamba: A Two-level Nested U-structure Mamba for Salient Object Detection cs.CVPDF
Junhui Li, Jialu Li, Youshan Zhang
TL;DR: 本文提出了一种名为U$^2$Mamba的新型网络结构,用于显著目标检测。该模型基于Mamba架构,通过引入多尺度Mamba U模块来增强网络深度以提升局部特征提取能力,并采用一种新颖的双层嵌套U型结构来整合浅层和深层特征,从而捕获更丰富的上下文信息和更长范围的数据。此外,论文还提出了一种分层训练监督方法,取代了传统的深度监督方案。
Details
Motivation: 现有的基于Mamba的显著目标检测模型在探索上下文信息和整个架构深度方面存在不足。本文旨在解决这一问题,通过设计更深的网络结构和更有效的特征融合机制来提升模型性能。
Result: 大量实验表明,U$^2$Mamba在显著目标检测任务上取得了极具竞争力的性能,可与当前最先进的方法相媲美。
Insight: 主要创新点在于:1) 设计了多尺度Mamba U模块以增强模型深度和局部特征提取;2) 提出了一个双层嵌套的U型结构,能够有效整合不同感受野的浅层和深层特征;3) 采用了一种新颖的分层训练监督方法,在训练过程中计算每一层的损失,替代了传统的深度监督方案。
Abstract: Mamba-based models have emerged as a promising alternative for salient object detection (SOD), offering significant advantages in modeling long sequences. However, existing models often fail to explore contextual information and the depth of the entire architecture. This paper introduces U$^2$Mamba, a powerful and innovative U-structured network for salient object detection. We propose multiscale Mamba U-blocks (MMUBs) that enhance the model depth to improve local feature extraction capabilities. Our newly developed nested U-structure, incorporating MMUBs, enables the network to integrate various receptive fields from shallow and deep layers, thereby collecting richer contextual information and longer-range data without being constrained by resolution. Instead of using the traditional deep supervision scheme and top-level supervised training, we propose a hierarchical training supervision method where the loss is computed at each level during the training process. Extensive experiments demonstrate that U$^2$Mamba achieves highly competitive performance against state-of-the-art methods. The source code is available at \url{https://github.com/JL021/U2Mamba}.
[75] CUPID: Reconstructing UV Texture Maps for Interpretable Person-of-Interest Deepfake Detection cs.CVPDF
Giovanni Affatato, Sara Mandelli, Edoardo Daniele Cannas, Paolo Bestagini, Stefano Tubaro
TL;DR: 本文提出CUPID方法,一种针对特定人物(POI)的深度伪造检测器。它利用从3D人脸重建中提取的UV纹理图,结合掩码自编码器(MAE)的表示学习能力,无需在训练阶段使用任何深度伪造视频,甚至无需包含特定POI。该方法通过匹配查询视频与原始参考视频的嵌入来评估真实性,并在UV空间提供可解释的残差图以突出面部异常区域。
Details
Motivation: 针对高知名度人物的深度伪造(POI deepfake)对现代社会构成威胁,现有POI检测方法在鲁棒性、效率和可解释性方面仍存在不足,需要一种能综合这些关键方面的解决方案。
Result: 在四个深度伪造数据集上的实验表明,CUPID在大多数数据集上优于当前最先进方法(SOTA),并在强下采样和压缩下具有最佳的整体鲁棒性,同时推理速度显著更快。
Insight: 创新点在于将UV纹理图与MAE结合,实现无需伪造视频或特定POI训练的零样本检测,并在UV空间提供可解释的残差可视化;客观来看,该方法通过3D面部表示与自监督学习的融合,提升了跨身份泛化能力和检测效率。
Abstract: Deepfakes targeting a high-profile individual, known as Person-of-Interest (POI), are a threat to modern democracies and societies. Current POI deepfake detection methods still struggle to combine robustness to post-processing, efficiency and interpretability, focal aspects of modern deepfake detectors. In this paper we propose CUPID, a POI video deepfake detector that combines UV texture maps, a facial appearance representation derived from 3D face reconstructions, with the representation learning capabilities of the Masked Autoencoder (MAE). Our method does not require any deepfake videos in its training phase. Moreover, it does not even require to include a specific POI in the training set: the combination of UV texture maps extracted from real video frames and the MAE context-guided reconstruction yields a latent space that captures rich and discriminative facial features also for identities unseen during training. In the testing phase, the embeddings extracted from a query video depicting the POI can be matched against pristine reference videos to assess the video authenticity. Furthermore, operating in the UV space naturally provides an additional layer of interpretability. Specifically, we can extract decoded residual maps that highlight which facial regions of a test video deviate most from the identity representation of the corresponding POI. Experiments on four deepfake datasets show that CUPID outperforms current state of the art on most datasets and achieves the best overall robustness against strong downscaling and compression, providing also substantially faster inference. Our experimental code will be released at https://github.com/polimi-ispl/CUPID.
[76] GEN-Guard: Correcting Generalization Failures for Deployable Federated Surgical AI cs.CVPDF
Julia Alekseenko, Pietro Mascagni, AI4SafeChole Consortium, Nicolas Padoy
TL;DR: 本文提出GEN-Guard框架,用于检测和纠正联邦学习在手术视频AI中的泛化失败问题。该框架通过客户端阻断评估(CBE)检测泛化失败,并通过分歧感知蒸馏(DAD)进行特征级校正,以提升模型在未见机构上的零样本泛化能力。
Details
Motivation: 标准联邦学习评估仅基于参与医院的验证数据选择“最佳”全局模型,可能导致模型过拟合内部联邦数据而无法泛化到未见机构,即性能泄漏问题。
Result: 在腹腔镜胆囊切除术的手术阶段识别和结肠镜息肉分割两个多中心临床任务上,GEN-Guard将联邦内F1分数提升高达2个百分点,未见机构性能提升高达3个百分点,最差机构性能提升3-9个百分点。
Insight: 创新点在于提出了性能泄漏这一系统性风险,并设计了后处理框架GEN-Guard,其CBE和DAD组件可在标准联邦学习收敛后运行,实现零样本自适应,增强跨机构鲁棒性。
Abstract: Federated Learning (FL) in surgical video AI enables collaborative model training without sharing sensitive data. However, standard evaluation practices - selecting the “best” global model based only on validation data from participating hospitals - can lead to suboptimal deployment choices. We identify this critical failure mode as performance leakage, where the selected model overfits internal federation data and fails to generalize to unseen institutions. We propose GEN-Guard, a practical post-hoc framework to detect and correct generalization failures in federated surgical AI. It integrates Generalization Detection via Client-Blocked Evaluation (CBE), which validates performance on isolated client distributions to prevent performance leakage, and Generalization Correction through Disagreement-Aware Distillation (DAD), which learns adaptive feature-level corrections for cross-institutional robustness. Both components operate after standard FL convergence while providing robust support for zero-shot adaptation to unseen environments. We first quantify the severity of performance leakage, observing Model Selection Failures (MSFs) exceeding 80% under standard evaluation. GEN-Guard is evaluated on two multi-center clinical challenges: surgical phase recognition in laparoscopic cholecystectomy and polyp segmentation in colonoscopy. Across both datasets, GEN-Guard consistently corrects these failures, improving in-federation F1 scores by up to 2 points, unseen-institution performance by up to 3 points, and worst-case institutional performance by 3-9 points. Performance leakage represents a systematic and previously under-recognized risk in federated surgical AI. GEN-Guard provides a practical solution for detecting and correcting such failures. By improving cross-institutional robustness and zero-shot generalization, it strengthens the reliability of FL for real-world surgical deployment.
[77] Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models cs.CVPDF
Haoxuan Wu, Lai Man Po, Mengyang Liu, Kun Li, Hongzheng Yang
TL;DR: 本文提出PRISM方法,通过轻量级查询聚合头在视频扩散模型的噪声潜在空间中直接解码偏好信号,实现高效视频生成评估与早期采样优化。该方法不仅达到SOTA偏好预测精度,还具备强噪声鲁棒性,显著降低计算成本并提升视频质量。
Details
Motivation: 传统基于像素的奖励模型评估视频生成时,与噪声扩散过程脱节且VAE解码成本高昂,因此探索能否直接从噪声潜在空间进行偏好判别。
Result: 在视频生成基准测试中,PRISM达到SOTA偏好预测精度,并实现早期Best-of-N采样,大幅减少去噪计算量同时提升视频质量。
Insight: 创新点在于揭示骨干网络生成性能与其内在评估能力正相关,支持自改进视频骨干;轻量级查询聚合头实现噪声潜在空间的直接偏好解码,突破传统评估范式。
Abstract: Evaluating video generation with clean, pixel-based reward models disconnects evaluation from the noisy diffusion process and incurs massive VAE decoding costs. In this paper, we challenge this paradigm by asking a fundamental question: Can a powerful video generator inherently discriminate preferences directly from noisy latents? To answer this, we introduce \textbf{PRISM} (\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels). PRISM employs a lightweight Query-based Aggregation head with a frozen video diffusion backbone to decode preference signals from noisy latents. Surprisingly, PRISM not only achieves SOTA preference accuracy but also unlocks strong noise-robustness, which enables early-stage Best-of-$N$ sampling. This allows for filtering suboptimal candidates at the very beginning of denoising, drastically reducing computation while boosting video quality. We also reveal a strong positive correlation between a backbone’s generative performance and its inherent evaluative power, enabling self-improving video backbones.
[78] Geometry-Aware Superpixel Graph Transformer with Metadata for Skin Lesion Classification cs.CVPDF
Muhammad Azeem, Tanveer Hussain, Amr Ahmed, Ardhendu Behera
TL;DR: 本文提出了一种新颖的基于区域的图学习框架,用于皮肤病变分类。该方法将病变图像建模为由超像素区域节点构成的图,其中节点使用冻结的CNN特征表示,并通过编码区域间几何关系的边属性和一个连接所有区域的元数据上下文节点来增强。采用边缘感知图变换器和注意力驱动传播更新节点表示,最终通过图级嵌入进行分类。
Details
Motivation: 解决皮肤癌自动分类中因病变结构异质性、类内差异大以及良恶性视觉差异细微带来的挑战。现有CNN/ViT方法通常依赖全局或块级特征,并通过后期融合结合患者元数据,这限制了基于空间的多模态推理能力。
Result: 在四个公共基准测试上的实验表明,显式的区域级关系建模和图原生的多模态融合相比现有最先进方法(SOTA)取得了持续的性能提升。
Insight: 创新点在于提出了一个以图为中心的新视角:将CNN特征建模为关系节点,并通过上下文(几何关系和元数据)集成进行增强,从而实现更具表达力和鲁棒性的分类。具体包括:显式建模超像素区域间几何关系作为边属性,以及引入专用的元数据上下文节点在同一关系空间中进行结构化融合。
Abstract: Automated skin cancer classification from dermoscopic images remains challenging due to heterogeneous lesion structure, strong intra-class variability, and subtle visual differences between benign and malignant cases. Existing CNN/ViT pipelines typically rely on global or patch-level features and often combine patient metadata via late fusion, which limits spatially grounded multimodal reasoning. We present a novel region-based graph learning framework that explicitly models lesions as graphs of spatially coherent superpixel regions represented as frozen CNN features. To capture fine-grained lesion arrangements, we encode inter-regional geometry as edge attributes and introduce a dedicated metadata context node connected to all regions, providing structured integration of demographic/clinical variables within the same relational space. Node representations are updated using our edge-aware graph transformer followed by attention-driven propagation, and a final graph-level embedding for benign-malignant classification. Experiments on four public benchmarks demonstrate that explicit region-level relational modeling and graph-native multimodal fusion yield consistent gains over the state-of-the-art. Consequently, we establish a new graph-centric perspective in which CNN features are modeled as relational nodes and improved through contextual integration, yielding more expressive and robust classifications.
[79] Reliability-Aware Prototype Calibration for Frozen Pose-Flow Video Anomaly Detection cs.CVPDF
Ning Dong, Yingna Su, Xin Dong, Ziyun Jiao, Xinnian Guo
TL;DR: 本文提出了一种名为可靠性感知原型校准(RPC)的后处理分数校准方法,用于解决冻结姿态流视频异常检测中单一似然分数可能掩盖多模态正常行为且对姿态观测噪声敏感的问题。RPC在冻结的潜在空间中,将标准化的最近原型偏差与标准化的流分数相加,并仅使用关键点置信度来门控这种几何证据,从而在保持原始密度信号的同时,利用经验正常模式结构在姿态可靠性下校正排名。
Details
Motivation: 姿态流视频异常检测器在一类监控中很有吸引力,但单一的似然分数可能隐藏多模态的正常行为,并且对姿态观测噪声敏感。本文研究一个冻结检测器设置,其中姿态流骨干网络、缓存的骨架轨迹和评估流程都是固定的,旨在通过轻量级后处理校准来增强现有系统,而无需重新训练或复现整个姿态流程。
Result: 在两个冻结的姿态流骨干网络和四个数据集上,RPC在所有八个骨干-数据集组合中都提高了帧级AUROC,提升范围从0.34到4.49个百分点,平均提升2.03个百分点。消融和可靠性分析表明,原型偏差是主要的校正信号,而可靠性门控在姿态观测可信度较低时最有用。
Insight: 创新点在于提出了一种轻量级的后处理校准方法RPC,它结合了密度估计和几何原型偏差,并利用姿态可靠性进行门控,从而有效校正异常排名。从客观角度看,该方法在冻结系统设置下,通过利用缓存数据和潜在空间结构进行校准,为实际部署中难以重新训练的系统提供了实用的性能提升方案。
Abstract: Pose-flow video anomaly detectors are attractive for one-class surveillance because they provide likelihood-based rankings for tracked skeleton windows. However, a single likelihood score may hide multimodal normal behavior and be sensitive to pose-observation noise. We study a frozen-detector setting in which the pose-flow backbone, cached skeleton tracks, and evaluation pipeline are fixed. Reliability-Aware Prototype Calibration (RPC) is a post-hoc score calibration method for this setting. It adds a standardized nearest-prototype deviation in the frozen latent space to the standardized flow score, and uses keypoint confidence only to gate this added geometric evidence. Thus, RPC preserves the original density signal while correcting the ranking with empirical normal-mode structure under pose reliability. Across two frozen pose-flow backbones and four datasets, RPC improves frame-level AUROC in all eight backbone-dataset pairs, with gains ranging from 0.34 to 4.49 percentage points and averaging 2.03 points. Ablation and reliability analyses show that prototype deviation is the main corrective signal, while reliability gating is most useful when pose observations are less trustworthy. These results suggest that lightweight post-hoc calibration can strengthen cached pose-flow systems when retraining or reproducing the full pose pipeline is impractical.
[80] Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation cs.CVPDF
Karn Tiwari, Varnith Chordia, Prathosh A P
TL;DR: 本文提出了一种名为QK Product Steering的训练免费、数据免费且零推理成本的权重编辑方法,用于缓解视觉语言模型(VLMs)中的物体幻觉问题。该方法通过抑制选定中间层中少数主导奇异模态来直接编辑每头查询-键乘积(即产生预softmax注意力分数的算子),并通过闭式查询权重更新将编辑后的乘积映射回查询权重,同时保持共享键权重固定,从而与分组查询注意力兼容。
Details
Motivation: 视觉语言模型(VLMs)经常生成流畅但视觉上无支持的描述,尤其是提及图像中不存在的物体,即物体幻觉问题。本文旨在无需额外数据、微调或推理开销的情况下缓解这一问题。
Result: 在三个基于GQA的VLM上,QK Product Steering实现了平均相对CHAIR_s减少4.0%,而匹配的随机模态控制显示变化可忽略。该方法在减少幻觉的同时基本保留了通用的多模态能力。
Insight: 创新点在于将QK乘积分解为对称和反对称分量以区分相互内容相似性模式和定向注意力模式,并通过编辑主导奇异模态来针对性地抑制幻觉信号,这为解码时缓解提供了一种简单替代方案。
Abstract: Vision-language models (VLMs) often generate fluent but visually unsupported descriptions, especially by mentioning objects absent from the image. We propose QK Product Steering, a data-free, training-free, and zero-inference-cost weight edit for reducing object hallucination. The method directly edits the per-head query-key product, the operator that produces pre-softmax attention logits, by suppressing a small number of dominant singular modes in selected middle layers. The edited product is then mapped back to the query weights through a closed-form query-only update while keeping shared key weights fixed, making the edit compatible with grouped-query attention. We further decompose the QK product into symmetric and antisymmetric components to distinguish mutual content-similarity patterns from directional attention patterns. Across three GQA-based VLMs, QK Product Steering achieves an average relative CHAIR$_s$ reduction of $4.0%$, while matched random-mode controls show negligible change. Interpretability ablations show that the hallucination signal is specific to dominant QK modes and is primarily localized to the symmetric mutual-attention channel. Overall, QK Product Steering offers a simple alternative to decoding-time mitigation, requiring no additional data, fine-tuning, or inference-time overhead while largely preserving general multimodal capability.
[81] InfantFace: Detecting infant faces in neonatal clinical environments cs.CVPDF
Abdullah Bin-Obaid, Maria M. Cobo, Rebeccah Slater, Lionel Tarassenko, Mauricio Villarroel
TL;DR: 本文提出了一种基于YOLOv11m的单阶段模型InfantFace,专门用于新生儿临床环境中的婴儿面部检测。该模型在多个公开数据集上训练和评估,并在包含113名独立婴儿的228个视频的新生儿研究数据集上进行微调,以应对临床环境中的背景杂乱、光照变化和面部遮挡等挑战。
Details
Motivation: 新生儿面部可靠定位是基于视频的非接触式评估(如疼痛表情分析、疼痛评分、心肺信号提取和呼吸暂停警报)的关键第一步,但临床环境中存在背景杂乱、光照变化、设备遮挡等挑战,降低了现有面部检测模型的准确性。
Result: 在微调前,模型在评估中达到AP50为0.87,超过了三种最先进的通用面部检测器;经过临床领域适应后,性能进一步提升至AP50为0.96。
Insight: 创新点在于针对新生儿临床环境定制了YOLOv11m模型,并通过领域适应显著提升了性能;客观分析认为,该研究强调了创建公开新生儿数据集(同时确保隐私和伦理标准)对于推动该领域进展的重要性。
Abstract: Reliable localisation of the neonatal face is the first step for several video-camera based non-contact assessments such as pain and distress related facial expression analysis, pain scoring, cardiorespiratory signal extraction and cessation of breathing alerts. However, major challenges persist in neonatal clinical environments. Cluttered backgrounds, illumination changes and poor lighting conditions can reduce the accuracy of face detection models. Clinical interventions, monitoring equipment and, in some cases, medical devices can obstruct the face, making visual assessment difficult. We propose a one-stage YOLOv11m-based model tailored for face detection of infants in neonatal clinical environments. We combined multiple publicly available datasets (VGGFace2, CelebA, FDDB, WIDER FACE) to train and evaluate our proposed model. We then fine-tuned our model on a neonatal research dataset involving 228 videos from 114 recording sessions of 113 independent infants. Before fine-tuning, our model achieved an AP50 of 0.87, surpassing the performance of three state-of-the-art general face detectors. Performance improved further to an AP50 of 0.96 after clinical-domain adaptation. Evaluating face detection performance across different datasets remains a challenge due to the lack of publicly available neonatal datasets. Prioritising the creation of such datasets, while upholding appropriate privacy safeguards and ethical standards in their creation and use, would greatly support further progress in this field.
[82] Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology cs.CV | cs.CL | cs.LGPDF
Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter
TL;DR: 本文提出了一种无需人工空间标注的放射学视觉-语言模型训练方法,并构建了RefRad2D大规模双语数据集。基于此训练的RadGrounder模型能联合执行报告生成、视觉问答和空间定位任务,在外部VQA基准上达到与专业医学VLM相当的性能,且空间监督不会损害语言质量。
Details
Motivation: 解决放射学领域视觉-语言模型训练依赖人工空间标注的问题,旨在实现无需手动标注即可训练具有空间定位能力的医学VLM。
Result: 在Slake和VQA-RAD等外部VQA基准测试中,RadGrounder取得了与专业医学VLM竞争的结果;添加临床数据到训练中提升了开放域VQA性能,证明了数据集的迁移性。
Insight: 创新点包括利用LLM自动生成任务特定子集和自动分割构建大规模数据集,以及联合训练框架实现空间可验证输出而不牺牲VQA性能,为医学VLM提供高效数据解决方案。
Abstract: We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.
[83] PCFootprint: A Large-Scale Dataset and Benchmark for Vectorized Building Footprint Extraction from Aerial LiDAR Point Clouds cs.CVPDF
Haoyuan Shen, Kuihao Wang, Ruisheng Wang, Yujun Liu
TL;DR: 本文提出了PCFootprint,这是首个用于从机载激光扫描点云中提取矢量化建筑足迹的大规模公开数据集。该数据集包含33,000个瓦片,覆盖了多样化的城乡景观,并包含一个用于评估跨地域泛化能力的3,000瓦片测试集。作者还建立了全面的基准测试,评估了主流方法,并揭示了该任务面临的挑战。
Details
Motivation: 现有的基于光学影像的建筑足迹提取方法易受遮挡、透视畸变和残余投影位移的影响,导致提取结果不完整或错位,且缺乏明确的高程信息,限制了其在详细级别建筑建模中的直接应用。
Result: 实验结果表明,在PCFootprint数据集上,现有主流方法面临高类内方差、数据不平衡和复杂地理空间环境噪声等重大挑战。
Insight: 主要创新点在于创建了首个大规模、公开的、点云与矢量化足迹精确对齐的数据集,并设立了跨地域泛化测试集,为基于点云的建筑建模研究提供了新的基准和方向。
Abstract: Building footprint extraction is a fundamental task in photogrammetry, remote sensing, and computer vision. Recent image-based methods have achieved remarkable progress in extracting vectorized footprints from high-resolution optical imagery. However, optical imagery inherently susceptible to occlusions, perspective distortions, and residual relief displacement, yielding incomplete or misaligned footprint extraction. Furthermore, the lack of explicit elevation information limits its direct applicability to Level of Detail building modeling. In this paper, we present PCFootprint, the first large-scale public dataset for footprint extraction from airborne laser scanning point clouds. PCFootprint comprises \num{33000} tiles derived from the Estonian Land and Spatial Development Board, covering diverse urban and rural landscapes. Each tile spans \qtyproduct{128 x 128}{\m} with systematically aligned vectorized footprints aligned to point clouds. The dataset includes a \num{3000} tiles cross-domain test set for evaluating generalization across geographic regions. We establish comprehensive benchmarks by evaluating mainstream methods. Experimental results reveal significant challenges including high intra-class variance, data imbalance, and noise across complex geospatial environments. We believe PCFootprint will advance future research in building modeling, urban scene understanding, and geospatial analysis. The PCFootprint dataset is publicly available at \url{https://huggingface.co/datasets/Haoyuan-Shen/PCFootprint}.
[84] FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining cs.CV | cs.AIPDF
Jinghong Lan, Wei Cheng, Yunuo Chen, Ziqi Ye, Peng Xing
TL;DR: 本文提出FreeStyle框架,通过挖掘社区LoRA资源构建大规模风格-内容双参考数据集,并设计两阶段课程学习策略来抑制风格参考中的语义泄漏,实现了在保持内容结构的同时迁移风格的双参考图像生成。
Details
Motivation: 解决风格-内容双参考生成中内容保真度、风格对齐与指令跟随的平衡难题,以及缺乏大规模、干净分离且覆盖长尾风格的三元组数据的问题。
Result: 在涵盖风格参考和双参考生成的基准测试中,模型在风格相似性、内容保持、美学质量、指令跟随和泄漏抑制方面均表现出色,实现了风格对齐、内容保持与泄漏抑制的强平衡。
Insight: 创新点包括将社区LoRA作为风格与内容的组合锚点构建数据集,以及采用两阶段课程学习(注意力级富集约束和频率感知RoPE调制)来针对性抑制不同阶段的语义泄漏;同时引入了风格不变的内容对齐分数和基于VLM的校准拒绝分数用于可靠评估。
Abstract: Style-content dual-reference generation aims to synthesize an image that preserves the structure and semantics of a content reference while adopting the style of a separate style reference.Despite recent progress, this setting remains challenging because models must balance content fidelity, style alignment, and instruction following avoiding semantic leakage from the style reference.A key bottleneck is the lack of large-scale triplet data with clean content-style separation and broad long-tail style coverage.In this work, we propose FreeStyle, a scalable dual-reference generation framework based on community LoRA mining.We treat community LoRAs as compositional anchors for style and content, and design a rigorous generation and filtering pipeline to construct large-scale Style-Reference and Content-Reference triplets across multiple base models.To address content leakage, we adopt a two-stage curriculum with stage-specific disentanglement mechanisms: an attention-level enrichment constraint that suppresses style-reference leakage in the style-transfer stage, and a frequency-aware RoPE modulation strategy that targets positional-correspondence-based leakage in the harder dual-reference stage.We also introduce a benchmark covering both style-reference and dual-reference generation, with evaluations on style similarity, content preservation, aesthetics, instruction following, and leakage rejection. The benchmark incorporates a style-invariant Content Alignment Score (CAS) and introduces a calibrated VLM-based Rejection Score for evaluating generation reliability and leakage suppression.Extensive experiments show that our model achieves a strong balance among style alignment, content preservation, and leakage suppression.
[85] S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence cs.CVPDF
Yalun Dai, Hao Li, Shulin Tian, Runmao Yao, Yuhao Dong
TL;DR: 本文提出了S-Agent,一种用于空间推理的智能体范式,旨在通过工具使用和证据积累来理解和推理连续的多视角图像与视频。它将视觉语言模型(VLM)作为语义规划器,结合空间工具层次结构(如2D物体定位、3D几何证据提升)和时空记忆机制,将空间感知重塑为以场景为中心的理解。实验表明,该方法能显著提升开源和闭源VLM在空间推理任务上的性能,并通过监督微调训练出紧凑的S-Agent-8B模型,性能超越同规模基线并与先进闭源模型相当。
Details
Motivation: 解决现有视觉语言模型和工具增强智能体在空间推理上的局限性,即它们通常基于静态、孤立的视觉观察进行推理,而真实世界的空间智能需要对连续演化的3D环境进行推理。
Result: 在多视角和视频空间推理基准测试中,S-Agent以无需训练的方式持续提升了开源和闭源VLM的性能。通过监督微调生成的S-Agent-8B模型显著超越了类似规模的基线模型(如Qwen3-VL-8B),并与先进的闭源模型(如GPT-5.4和Gemini 3)表现相当。
Insight: 创新点在于将空间推理重新定义为时空证据积累过程,而非孤立的帧级预测,并引入了场景中心的理解范式、层次化空间工具链以及包含场景记忆和智能体记忆的时序记忆机制,实现了跨帧和推理步骤的证据整合。
Abstract: Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).
[86] HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining cs.CVPDF
Juncheng Ma, Jianxin Bi, Yufan Deng, Xuanran Zhai, Kewei Zhang
TL;DR: 本文系统研究了以自我中心的人类视频(egocentric human video)与遥操作真实机器人轨迹(teleoperated real-robot trajectories)作为具身基础模型(embodied foundation models)预训练数据源的对比。研究发现,经过精心设计的过滤和标注流程处理后,自我中心的人类视频不仅能作为机器人数据的可行替代,甚至能带来更优的性能表现。
Details
Motivation: 具身基础模型面临比大语言模型更严峻的数据瓶颈,当前主流的遥操作真实机器人数据存在收集成本高、获取困难、行为和环境多样性低等可扩展性问题。本文旨在探索成本更低、多样性更高的自我中心人类视频作为替代预训练数据源的有效性。
Result: 在相同的预训练数据量下,使用自我中心人类视频预训练的模型,在真实机器人动作预测上的验证损失降低了24%,在分布内和分布外真实机器人任务执行上的成功率分别提高了52.5%和90%。
Insight: 论文的核心创新点在于验证了一个可扩展的具身基础模型预训练范式:首先利用多样化的自我中心人类视频学习世界表示,然后仅需少量标注的真实机器人数据进行动作空间对齐。这为在昂贵机器人数据收集之前评估数据质量提供了指导,并鼓励了对自我中心数据的更广泛探索。
Abstract: Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.
[87] SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm cs.CV | cs.AI | cs.DBPDF
Solène Debuysère, Nicolas Trouvé, Nathan Letheule, Elise Colin, Georgia Channing
TL;DR: 本文提出了SARLO-80数据集,这是一个全球范围、80厘米分辨率的斜距SAR-光学-文本多模态数据集。该数据集基于公开的Umbra聚束模式SAR数据(SICD格式),通过标准化处理生成复数SAR图像块、像素级对齐的光学图像块以及三种长度的自然语言描述,旨在支持在原生SAR几何下的多模态对齐研究。
Details
Motivation: 当前多模态基础模型的发展主要依赖于大型光学基准数据集,而合成孔径雷达(SAR)领域缺乏可比的大规模、高分辨率、保留复数信息和原生几何结构的多模态数据集,这限制了基于物理基础的多模态学习。
Result: 该数据集包含来自全球72个国家257个地点的119,566个三元组(复数/幅度斜距SAR图像块、对齐的光学图像块、自然语言描述),并提供了固定的训练/验证/测试划分以及完整的预处理和基准代码,以支持跨模态检索和条件生成等任务的复现性基准测试。
Insight: 主要创新点在于构建了一个结合了甚高分辨率(VHR)SAR SLC数据、像素级对齐光学图像和自然语言描述的大规模公开数据集,并保留了复数测量值和原生采集几何,为SAR领域的多模态学习提供了更物理基础的数据资源。
Abstract: Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR–optical datasets largely rely on low-resolution, intensity-only Ground Range Detected~(GRD) products and do not preserve complex-valued SAR measurements or native acquisition geometry, which restricts physically grounded multimodal learning. In particular, large-scale public datasets combining very-high-resolution (VHR) SAR SLC, aligned optical imagery, and natural-language descriptions are still lacking. We present a VHR SAR–optical–text dataset built from open-access Umbra spotlight acquisitions distributed as Sensor Independent Complex Data (SICD). From around 2,500 worldwide scenes (VV/HH, 20cm–2m native resolution), we standardize all SAR data to an 80cm slant-range grid via band-limited FFT resampling and tile the imagery into 1024 by 1024 patches. For each SAR patch, we retrieve a high-resolution optical tile and warp it into the SAR grid using local coordinate correspondences for local pixel-level alignment. We further generate three caption variants (SHORT/MID/LONG) per sample to support vision–language training and evaluation. Our dataset contains 119,566 triplets (complex and amplitude slant-range SAR patch, aligned optical patch, natural-language description) covering 257 locations across 72 countries and a broad range of land types and infrastructures. We release fixed train/validation/test splits and the full preprocessing and baseline code to enable reproducible benchmarks for multimodal alignment on cross-modal retrieval and conditional generation in native SAR geometry. The dataset is publicly available on the Hugging Face Hub at https://huggingface.co/datasets/ONERA/SARLO-80.
[88] VisDom: Sparse Novel View Synthesis with Visible Domain Constraint cs.CVPDF
Mariia Gladkova*, Tarun Yenamandra*, Edmond Boyer, Robert Maier, Tony Tung
TL;DR: 本文提出VisDom,一种无需学习的几何约束方法,用于增强稀疏视角下的新视图合成。该方法通过强制多视角可见性要求来改进基于视觉外壳的重建,并集成到NeRF和Gaussian Splatting框架中,以从少至四张输入图像实现高质量重建。
Details
Motivation: 稀疏视角新视图合成因从少量视图恢复3D几何存在歧义而具挑战性,现有方法如NeRF和Gaussian Splatting在稀疏设置下容易过拟合,产生漂浮伪影和不一致几何,而仅依赖轮廓一致性作为正则化不足。
Result: 在三个挑战性数据集上的实验表明,VisDom在稀疏视角新视图合成中带来一致改进;在Omni3D和MipNeRF360上,结合GaussianObject进一步提升了性能,同时以22倍更低的训练成本达到或超越其效果。
Insight: 创新点在于引入可见域约束作为空间先验,通过要求至少K个视图可见来增强基于轮廓的重建,这是一种无需学习参数、仅需轮廓的领域无关方法,可简单补充现有隐式和显式重建流程。
Abstract: Sparse novel view synthesis (NVS) remains challenging due to the ambiguity of recovering 3D geometry from few input views. While NeRF- and Gaussian Splatting (GS)-based methods perform well with dense supervision, they often overfit in sparse settings, producing floating artifacts and inconsistent geometry. Silhouette consistency is commonly used as a regularizer, but it remains insufficient, as silhouette-consistent regions can extend beyond the true object geometry. We introduce VisDom, a learning-free geometric constraint that augments classical carving-based visual hull reconstruction by enforcing a minimum multi-view visibility requirement. Specifically, we define a visible domain as the subset of 3D space observed by at least $K$ views and use it as an additional filtering criterion on top of standard silhouette-based reconstruction. This provides a stronger spatial prior in sparse-view settings. We integrate VisDom into both implicit (NeRF) and explicit (GS) pipelines by restricting volumetric sampling and guiding Gaussian placement during optimization. Experiments on three challenging datasets show consistent improvements in sparse-view NVS, enabling high-quality object-centric reconstruction from as few as four input images. Our method is domain-agnostic, requires only silhouettes, and introduces no learned parameters, making it a simple complement to existing approaches. Applying VisDom on top of GaussianObject further improves performance on Omni3D and MipNeRF360, while matching or surpassing it at 22 $\times$ lower training cost.
[89] SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation cs.CVPDF
Shilong Xiang, Zirui Zhang, Lijun Yu, Chengzhi Mao
TL;DR: 论文提出了一种名为空间推测解码(SSD)的框架,用于加速自回归图像生成。该方法通过利用图像的2D空间局部性,同时预测序列中相邻的水平与垂直token,从而克服视觉推理中的内存墙瓶颈。实验表明,SSD在DPG-Bench和GenEval基准上实现了高达13.3倍的加速,同时保持了高保真度。
Details
Motivation: 自回归模型将图像视为一维离散token序列进行视觉生成,但这种方法丢弃了视觉信号固有的2D空间局部性,导致推理时存在严重的计算瓶颈。
Result: 在DPG-Bench和GenEval基准测试中,SSD将自回归图像生成加速了高达13.3倍,同时保持了高保真度,达到了高效生成的目标。
Insight: 创新点在于将预测目标与图像的自然几何结构对齐,通过同时预测水平与垂直相邻token来利用2D空间相关性,这为实时高分辨率自回归生成模型提供了新的效率提升途径。
Abstract: Autoregressive models excel in visual generation by treating images as 1D sequences of discrete tokens, mirroring language modeling. However, this flattening discards the intrinsic 2D spatial locality of visual signals, creating severe computational bottlenecks during inference. We introduce Spatially Speculative Decoding (SSD), a framework that aligns the predictive objective with the natural geometry of images. Rather than predicting only the immediate next token in a 1D sequence, our model simultaneously predicts the adjacent horizontal token and the token directly below it. By capitalizing on this 2D spatial correlation, spatially speculative decoding overcomes the memory wall in visual inference. Our approach accelerates autoregressive image generation by up to 13.3x while maintaining high fidelity on DPG-Bench and GenEval. Our results suggest that respecting the underlying geometry of vision unlocks massive computational efficiencies, paving the way for real-time, high-resolution autoregressive generative models.
[90] CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation cs.CVPDF
Ilona Demler, Xinran Xie, Blake Werner, Anna Szczuka, Pietro Perona
TL;DR: 本文介绍了CalTennis数据集,这是一个用于评估单目到3D姿态估计的大规模网球视频基准。该数据集包含超过1100万帧(51小时)的网球训练和比赛视频,由2-6个同步摄像机以60Hz频率拍摄,覆盖40名球员。它是现有野外人体运动视频数据集的10倍大,也是首个提供专家级运动员运动同步多视角记录的大规模基准。
Details
Motivation: 解决现有数据集在规模、多视角同步记录以及专家级运动员运动捕捉方面的不足,为单目到3D姿态估计提供一个更真实、更具挑战性的评估基准。
Result: 在CalTennis上对最先进的单目到3D姿态估计方法进行基准测试,发现虽然3D关节角度恢复已相当准确,但所有模型在深度和脚部接触的估计上均存在困难。
Insight: 提出了两个新颖的性能指标(步法和稳定性)以及对身体形状不一致性的定性研究,这些指标揭示了先前未被充分探索的失败模式,为姿态估计和动作分析的改进指明了具体方向。同时,描述了一种无需专业设备或专业知识、可实现全自动视频校准和同步的简单标准化数据收集协议。
Abstract: The Caltech Tennis Dataset (CalTennis) is a large-scale video benchmark for evaluating monocular-to-3D pose estimation in the wild. CalTennis comprises over 11 million frames (51 hours) of tennis practice and match play from 40 players, captured with 2-6 synchronized cameras at 60 Hz. It is 10 times larger than existing in-the-wild human motion video datasets and 3 times larger than existing MOCAP-ground-truthed datasets, and it is the first large-scale benchmark to provide synchronized multi-view recordings of expert athletic motion. The multi-view setup enables inexpensive, label-free evaluation of monocular-to-3D pose estimation algorithms. We describe a simple, standardized protocol that enables data collection without specialized equipment or expertise, along with fully automated video calibration and synchronization. Benchmarking state-of-the-art monocular-to-3D pose methods on CalTennis, we find that while 3D joint angle recovery is now quite accurate, all models struggle to estimate depth and foot contact consistently. We further propose two novel performance metrics, footwork and stability, as well as qualitatively study body shape inconsistency. These metrics expose previously underexplored failure modes and point to concrete opportunities for improvement in pose estimation and action analysis.
[91] Thinking in Boxes: 3D Editing in Real Images Made Easy cs.CVPDF
Pradhaan S Bhat, Naveen Chandra R, Rishubh Parihar, Vaibhav Vavilala, R. Venkatesh Babu
TL;DR: 这篇论文提出了一种名为’Thinking in Boxes’的3D图像编辑方法,通过使用3D边界框作为结构化规范,用户只需指定输入和输出框,即可实现对真实图像中物体进行精确的平移、旋转、缩放和视角变换,同时保持场景和物体身份。
Details
Motivation: 现有基于文本和2D条件的图像编辑方法在控制空间变换(特别是大范围物体运动和相机视角变化)时存在模糊性和弱控制力,无法精确指定几何变换。
Result: 该方法在真实图像上进行大规模3D编辑时,性能显著优于当前最先进(SOTA)的方法。
Insight: 核心创新在于将3D边界框作为明确的几何问题规范,并引入深度对齐的平面地板作为全局参考系,结合两阶段训练策略(合成场景和真实视频),实现了对复杂真实图像的泛化编辑能力。
Abstract: Text and 2D-conditioning interfaces provide weak, ambiguous control over spatial transformations in image editing – particularly under large object motions and camera changes. Prior work has used 3D primitives such as boxes, but only as loose conditioning signals indicating approximate object location rather than specifying the transformation. We instead use 3D boxes as structured specifications: the user provides the input and output boxes of the edit, casting editing as a well-posed geometry problem. This ``thinking in boxes’’ interface, where each box face is color-coded to convey 3D orientation, gives precise control over translation, rotation, scaling, and viewpoint changes in real images while preserving scene and object identity, and recovering previously unseen object regions. To ground transformations in scene appearance, we introduce a depth-aligned planar floor as a global reference frame, shaded with depth-aware cues. Conditioned on this structure, an image generator produces consistent results under large transformations. Trained in two stages – on synthetic multi-object scenes and a small set of real-world videos from Objectron – the system generalizes to complex, in-the-wild real images. Our method operates directly on real photographs and substantially outperforms recent state-of-the-art methods on large 3D edits.
[92] TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living cs.CVPDF
Arkaprava Sinha, Dominick Reilly, Siddharth Krishnan, Hieu Le, Srijan Das
TL;DR: 本文提出了TimeProVe框架,用于高效处理长视频问答任务。该框架采用’先假设后验证’的混合策略,首先使用轻量级模块生成基于动作的答案-证据假设,然后仅对目标片段调用昂贵的视觉语言模型进行验证。
Details
Motivation: 解决长视频问答中现有方法的两大问题:密集处理整个视频计算成本过高,或依赖稀疏字幕推理容易遗漏局部时序和运动中心证据。
Result: 在提出的OpenTSUBench基准上,TimeProVe比最强基线性能提升7.3%,同时减少75%的VLM调用和93%的推理成本。在Charades-STA上达到有竞争力水平,结合定位VLM后实现SOTA。
Insight: 核心创新是Action-based Candidate Evidence模块,通过轻量级LLM推理将时序定位动作转换为查询相关的候选答案和证据窗口。框架设计实现了计算效率与推理精度的平衡。
Abstract: Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer–evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.
[93] Current World Models Lack a Persistent State Core cs.CVPDF
Jinpeng Lu, Dexu Zhu, Haoyuan Shi, Linghan Cai, Guo Tang
TL;DR: 本文指出当前世界模型缺乏持久状态核心,即无法在无观测时维持世界状态的持续演化。作者提出了首个系统性诊断基准WRBench,通过将相机运动视为对可观测性的干预,评估模型是否能在目标离开视野后继续推进事件。实验覆盖23个模型的9600个视频,发现现有系统普遍将世界维持为跟踪镜头,无法在目标重新出现时保持事件一致性。
Details
Motivation: 现有世界模型基准仅关注表面属性(如保真度、运动和控制性),而忽略了世界在无观测时是否持续演化这一核心要求。物理世界建模需要内部世界状态随时间独立于观察而演化,但当前基准存在盲点。
Result: 在WRBench基准上测试了涵盖四种控制范式的23个模型,生成9600个视频进行分析。关键发现是:所有系统都将观察到的世界维持为跟踪镜头,当目标重新进入视野时,会恢复到离开时的状态,而非在不可见期间推进事件。这一失败在所有控制范式、模型家族和规模增量中均顽固存在。
Insight: 论文的创新点在于提出了首个关注世界状态持久性的诊断基准WRBench,将相机运动视为干预来评估世界线一致性。核心洞察是:更清晰的图像、更精确的控制、更丰富的几何先验或单纯增加参数数量都无法实现稳健的世界状态演化,因此物理状态内核的稳定性和视点干预下的世界线一致性应成为世界模型设计的首要目标。
Abstract: World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce \textbf{WRBench}, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.
[94] UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning cs.CV | cs.LGPDF
Wenhao Chi, Arkaprava Sinha, Dominick Reilly, Hieu Le, Srijan Das
TL;DR: 本文提出UNIEGO,一个用于统一第一人称(egocentric)视频表征学习的框架。其核心是一个分层多教师蒸馏框架,通过引入一层特定表征的代理模型,将来自不同视角、模态和基础模型的异构教师知识转化为同质的第一人称空间,再通过选择性代理蒸馏自适应地整合可靠监督信号。
Details
Motivation: 解决第一人称视频理解因视角单一、模态有限而难以捕捉人类动作丰富性的问题,旨在构建一个能整合多视角、多模态、多基础模型互补知识,但仅需第一人称视频即可部署的统一表征模型。
Result: 在三个具有挑战性的第一人称-第三人称基准测试(ego-exo benchmarks)上,UNIEGO在动作识别、视频检索和动作分割三个任务上均达到了最先进的性能,超越了朴素的多教师蒸馏基线方法。
Insight: 创新点在于提出了一个由代理模型作为中介的分层蒸馏框架,以及选择性代理蒸馏机制,这能有效处理异构教师知识带来的梯度冲突,并通过代理参数初始化策略稳定训练,实现了更丰富、更具判别力的统一表征学习。
Abstract: Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.
eess.IV [Back]
[95] Full-Self Diagnostics (FSD): Physics-Grounded Visual Biomarker Inference from Smartphone Video via Inverse Problems and Operator Learning eess.IV | cs.CV | cs.LGPDF
Jonathan Thomas, Harsh Thaker
TL;DR: 本文提出了全自诊断(FSD)框架,这是一个统一的数学框架,用于从消费者智能手机拍摄的9秒无约束面部视频中恢复潜在的生理状态。该方法整合了物理前向模型、信息论可观测性理论、正则化逆问题、算子学习和监督学习五个相互强化的组件。实证验证表明,该框架能够从普通视频中推断出具有临床相关性的生物标志物,如血糖水平,且性能随着配对数据量的增加可预测地提升。
Details
Motivation: 解决从消费者智能手机拍摄的无约束面部视频中,非侵入性地、可靠地推断生理生物标志物(如血糖)的挑战,旨在实现便捷、低成本的健康监测。
Result: 在59名受试者的38812个真实世界配对扫描数据上进行了验证。在主要作者(血糖范围35-550 mg/dL)的自收集数据上,平均绝对相对差(MARD)为29.86%,97.57%的预测落在Clarke误差网格的A+B区,仅有0.27%落在危险的E区。一名管理良好的糖尿病参与者在70-180 mg/dL较窄范围内的MARD达到17%。
Insight: 创新点在于将物理模型(辐射传输方程和发色团吸收)与信息论、逆问题理论、算子学习及监督学习(可解释为随机变分推断)相结合,构建了一个具有理论保证(如可识别性)的统一框架,实现了跨设备、分辨率和人群的泛化,并证明了性能随配对数据量平方根倒数比例提升的可预测缩放规律。
Abstract: We present Full-Self Diagnostics (FSD), a unified mathematical framework for recovering latent physiological states from unconstrained 9-second facial videos captured by consumer smartphones. The approach integrates five mutually reinforcing components: (1) a physics-based forward model derived from the radiative transfer equation and chromophore absorption that maps camera observables to biomarker concentrations; (2) an information-theoretic observability theory proving that multi-channel visual signals (spectral, pulse, respiratory, micro-expression, and oculomotor) contain strictly increasing mutual information with physiological state; (3) a stable, Tikhonov-regularized inverse problem with domain-uniform identifiability guarantees; (4) an operator-learning formulation that enables generalization across devices, resolutions, and populations; and (5) a supervised learning procedure, interpretable as stochastic variational inference, that continuously refines the model from paired biosensor ground truth with performance improving proportionally to one over the square root of the number of paired observations. Empirical validation on 38812 real-world paired scans across 59 subjects demonstrates practical performance. Self-collected data from the lead author (glucose range 35-550 mg/dL) yields MARD of 29.86 percent with 97.57 percent of predictions in Clarke Error Grid Zones A+B and only 0.27 percent in the dangerous Zone E. A well-managed diabetic participant achieves MARD of 17 percent in the narrower 70-180 mg/dL band. These results confirm that consumer-grade facial video encodes sufficient structured information for clinically relevant, non-invasive biomarker inference under fully unconstrained conditions, with performance scaling predictably as more paired data becomes available.
[96] FrequencyFormer: A Co-Designed Sensor-to-Processor Pipeline for Frequency-Domain Vision Transformer Inference eess.IV | cs.CVPDF
Chengwei Zhou, Ovishake Sen, Xuming Chen, Rishith Paramasivam, Shaahin Angizi
TL;DR: 本文提出了FrequencyFormer,一种传感器到处理器协同设计的流水线,用于在传感器边缘系统上高效部署视觉Transformer(ViT)。该方法通过多尺度DCT分词器将图像压缩为紧凑的频域令牌,结合基于查找表的近传感器硬件实现和修改的MIPI通信架构,大幅减少了数据传输和能耗。
Details
Motivation: 在传感器边缘系统部署ViT受限于设备计算能力以及从传感器向处理器传输高维图像数据所需的能量和带宽。现有近传感器计算方法压缩效果有限,而频域为视觉信息提供了天然的紧凑表示,可在传感器层面利用以减少数据传输。
Result: 在分类、检测和分割任务中,该方法作为标准ViT补丁嵌入的直接替代,与预训练主干网络保持兼容。流水线实现了28.8 TOPS/W的能量效率,通信能量降低了230倍,传感器端总能量降低了2.22倍。
Insight: 创新点包括:利用频域紧凑表示进行传感器级压缩的多尺度DCT分词器;基于固定DCT系数的无乘法器、高能效的查找表硬件实现;以及修改的低功耗MIPI通信架构。这为传感器内ViT部署提供了一个可扩展的频域令牌化基础。
Abstract: Deploying vision transformers (ViTs) on sensor-edge systems is limited not only by on-device compute, but also by the energy and bandwidth required to transmit high-dimensional image data from the sensor to the processor. While in-sensor and near-sensor computing reduce this cost through early feature extraction, existing methods often provide only modest compression. We observe that the frequency domain provides a naturally compact representation of visual information and can be exploited at the sensor level to reduce sensor-to-processor data movement. Building on this insight, we present FrequencyFormer, a co-designed sensor-to-processor pipeline for efficient ViT inference. FrequencyFormer includes: (1) a multi-scale DCT tokenizer that compresses a 224x224 image into compact frequency-domain tokens, achieving up to 128x reduction in off-chip data volume with modest accuracy loss; (2) a LUT-based near-sensor hardware implementation that leverages fixed DCT coefficients for multiplier-free, energy- and area-efficient tokenization; and (3) a modified MIPI-based low-power communication architecture that further reduces transfer energy. FrequencyFormer serves as a drop-in replacement for standard ViT patch embedding and remains compatible with pretrained backbones across classification, detection, and segmentation tasks. The pipeline achieves 28.8 TOPS/W, reduces communication energy by 230x, and lowers total sensor-side energy by 2.22x, demonstrating frequency-domain tokenization as a scalable foundation for in-sensor ViT deployment.
cs.AI [Back]
[97] DeXposure-Claw: An Agentic System for DeFi Risk Supervision cs.AI | cs.CL | cs.LG | q-fin.RMPDF
Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He
TL;DR: DeXposure-Claw是一个用于去中心化金融风险监管的智能体系统,它通过结构化证据来引导LLM决策。系统包含一个图时间序列基础模型用于预测未来风险暴露网络,结合确定性监控器和压力场景生成警报与证据,并通过数据健康和置信度门控来约束风险升级,最终输出可审计的监管工单。
Details
Motivation: 去中心化金融环境中的快速、网络化信用风险给监管者带来挑战,通用LLM智能体在此场景下表现不佳,容易基于弱证据过度反应并推荐高风险干预措施,且现有评估方法缺乏与监管目标对齐的假警报度量方式。
Result: 在五年的真实周度数据上进行实验,结果完全支持该系统。论文还提出了一个六维评估框架DeXposure-Bench,其决策轴根据监管对齐的绝对损失真值和明确的错误干预率对工单进行评分。
Insight: 核心创新在于提出了一个基于预测的、证据结构化的智能体监管架构,将LLM决策与图时间序列模型的预测输出、确定性监控规则以及明确的置信度门控相结合,从而提高了决策的可靠性和可审计性,并设计了专门针对监管场景的评估指标来量化假警报率。
Abstract: Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at https://github.com/EVIEHub/DeXposure-Claw.
[98] Diffusion Language Models: An Experimental Analysis cs.AI | cs.CLPDF
Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi
TL;DR: 本文对扩散语言模型(DLMs)进行了系统的实验分析,评估了八种最先进的DLMs在推理、编码、翻译、知识和结构化问题解决等八个基准测试上的表现,并考察了生成质量和计算效率。研究还分析了去噪步数、上下文长度、块大小和并行解掩码策略等关键推理时因素的影响,揭示了DLMs在不同任务、架构和推理预算下的优势与局限。
Details
Motivation: 扩散语言模型作为自回归生成的一种替代范式,通过迭代去噪生成文本,但现有研究在评估协议、数据集、推理预算和生成超参数上的差异使得难以比较其能力并理解其权衡。本文旨在通过系统实验分析现代DLMs,以提供对其能力和部署特性的实用见解。
Result: 在八个基准测试(涵盖推理、编码、翻译、知识和结构化问题解决)上评估了八种最先进的DLMs,结果显示DLMs的行为受生成时设计选择强烈影响,导致性能与计算效率之间存在不同的权衡。
Insight: 创新点在于首次对现代DLMs进行了全面系统的实验分析,强调了生成时设计(如去噪步数、上下文长度等)对模型性能的关键影响,并通过在相同条件下训练的小模型对比补充了大规模实验,为DLMs的实际部署提供了重要指导。
Abstract: Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.
[99] CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models cs.AI | cs.CLPDF
Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang, Yi Chang
TL;DR: 本文提出了CombEval,一个用于评估大语言模型组合计数能力的动态基准测试框架。该框架通过类型化的Cofola规范来定义问题,支持可控地生成具有精确求解器验证答案的自然语言计数问题,并能够系统性地变化对象类型、实体规模、约束数量和推理深度。作者评估了11个LLM,发现模型在处理有序对象、不可区分元素、相对位置约束和嵌套对象依赖时仍然脆弱,错误分析进一步揭示了约束解释和计数原理方面的失败。
Details
Motivation: 现有的静态基准测试集无法系统性地评估大语言模型在组合计数这一核心推理任务上的能力,特别是模型在面对不同问题复杂性(如对象类型、约束、规模)时的表现和失败模式。因此,需要一个新的、可控的、动态的基准来诊断LLM在组合推理中何时及为何失败。
Result: 在直接和代码增强两种设置下评估了11个LLM。结果表明,模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖等问题上表现脆弱,未能达到可靠水平。CombEval本身作为一个诊断工具,其生成的基准测试套件已公开。
Insight: 论文的主要创新点在于提出了一个基于形式化规范(Cofola)的动态基准生成框架,能够系统性地控制问题的组合结构复杂性,从而实现对LLM组合推理能力的细粒度、可解释的评估和诊断。这为理解LLM在组合推理这一关键能力上的局限性提供了新的方法论和工具。
Abstract: We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints, enabling controlled generation of natural-language counting problems with exact solver-verified answers. Unlike static collections, CombEval supports systematic variation of object type, entity scale, constraint count, and reasoning depth. We evaluate 11 LLMs under direct and code-augmented settings and find that models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies. Error analysis further identifies failures in constraint interpretation and counting principles. CombEval provides a diagnostic testbed for studying when and why LLMs fail at combinatorial reasoning. The code and generated benchmark suites are publicly available at \url{https://github.com/YuxuZhou-CN/combination-problem-generation}.
[100] Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning cs.AI | cs.CLPDF
Sajib Acharjee Dip, Dawei Zhou, Liqing Zhang
TL;DR: 本文提出了一种名为SEVRA(选择性验证推理分配)的服务层控制器,用于在预算受限的推理任务中智能决定是否对冻结求解器的初始答案进行验证。该方法通过训练基于可恢复性感知的门控机制,在数学推理(MATH-5)和常识问答(CommonsenseQA)等基准上,实现了在保持或提升准确率的同时显著减少计算开销。
Details
Motivation: 解决测试时推理中额外计算资源分配不均的问题:验证可能修复错误答案、浪费已正确问题的算力,或引入有害的答案变更,从而优化部署时的资源效率。
Result: 在MATH-5基准上,选择性验证达到76.3%的准确率(略高于始终验证的75.5%),同时减少26.8%的后生成token,并将有害翻转从2.2%降至1.0%;在GSM8K上,仅验证3.0%的样本,准确率从93.4%提升至94.5%,验证token减少91.2%。但更长的初始求解能以更少总token达到相近准确率。
Insight: 创新点在于将验证问题重构为部署分配问题而非新验证器设计,通过服务层控制器动态选择验证策略;客观来看,该方法强调了在预算感知推理中优先调整初始求解预算,再结合选择性恢复以平衡效率、可审计性和回归风险控制。
Abstract: Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes. We study this as a deployment allocation problem rather than a new-verifier problem. We introduce \sevra, Selective Verification for Reasoning Allocation, a serving-layer controller that decides whether to preserve a frozen solver’s initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log intervention outcomes and train recoverability-aware gates from serving-visible attempt state. On \mathfive, selective verification reaches 76.3% accuracy, compared with 75.5% for always verifying, while reducing post-generation tokens by 26.8% and harmful flips from 2.2% to 1.0%. However, an 8,192-token initial solve reaches 76.0% accuracy with 28% fewer total model tokens, showing that selective recovery is useful but not the best tested cost frontier. In frozen transfer to \gsm, the selective policy verifies only 3.0% of examples, improves accuracy from 93.4% to 94.5%, and reduces verification tokens by 91.2% relative to always verifying; again, a longer initial solve matches its accuracy with fewer realized tokens. On CommonsenseQA, always-on verification hurts, while Self-Consistency@5 improves accuracy at about five times the realized token cost. The resulting deployment rule is: tune the initial budget first, then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.
[101] GLARE: A Natural Language Interface for Querying Global Explanations cs.AI | cs.CVPDF
Bhavan Vasu, Rajesh Mangannavar
TL;DR: 本文提出了GLARE,一个基于大语言模型(LLM)的自然语言交互界面,用于查询图像分类器的全局解释。该系统利用LLM作为中介,将用户的自然语言问题转换为对局部解释数据的结构化SQL查询,从而生成统计增强的自然语言回答和可视化结果,旨在提升全局解释的可访问性和实用性。
Details
Motivation: 全局解释对于理解视觉模型至关重要,但其复杂且单一的特性阻碍了实际探索。用户通常需要针对特定问题的答案,而非静态的解释结果,因此需要一个交互式接口来提供灵活的自然语言访问。
Result: 论文在意图解释、查询映射准确性、对新查询和数据集的泛化能力以及语言错误鲁棒性方面进行了评估。结果表明,LLM介导的查询方式显著提升了以人为中心的可解释人工智能(XAI)中全局解释的可访问性和可用性。
Insight: 创新点在于利用LLM作为中介,将自然语言问题动态转换为对底层局部解释数据的SQL查询,实现了无需用户接触底层表示的灵活聚合,并提供了意图对齐的可视化输出,为交互式XAI系统设计提供了新思路。
Abstract: While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system’s core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.
cs.IR [Back]
[102] SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering cs.IR | cs.CVPDF
Ayush Dwivedi, Qixin Wang, Ashvi Soni, Ruoteng Wang, Han Li
TL;DR: SAFE-Cascade是一个用于图表问答的成本自适应视觉-语言路由交互系统。它首先通过OCR提取图表文本,使用纯文本语言模型给出初步答案,然后通过一个学习型路由器决定是接受该文本答案还是升级调用更昂贵的视觉语言模型(VLM),从而在保持性能的同时降低计算成本。
Details
Motivation: 解决在图表问答任务中,为每个查询都调用昂贵的视觉语言模型(VLM)可能造成不必要开销的问题,因为许多问题仅通过OCR文本和轻量级语言推理即可回答。
Result: 在ChartQA测试集(375个样本)上,SAFE-Cascade实现了69.1%的统一准确率,同时仅调用了73.1%的VLM。相比全VLM基线(67.7%准确率,100% VLM调用),它在匹配性能的同时减少了26.9%的VLM调用和9.3%的估计成本。
Insight: 创新点在于提出了一个级联路由框架,通过一个学习型路由器在纯文本路径和VLM路径之间进行动态、成本自适应的决策,并构建了一个透明的交互界面,让用户可以理解和调整精度与成本的权衡边界。
Abstract: Vision-language models (VLMs) are powerful for chart question answering, but invoking a VLM for every query can be unnecessarily expensive when many questions are answerable from OCR text and lightweight language reasoning. We demonstrate SAFE-Cascade, an interactive system for cost-adaptive chart question answering. Given a chart image and a natural-language question, SAFE-Cascade first extracts chart text with OCR, obtains a provisional answer from a text-only language model, and then uses a learned router to decide whether to accept the text answer or escalate to a VLM. The demo exposes this decision process to users: OCR evidence, text-only answer, routing probability, escalation decision, final answer, estimated cost, and estimated latency are shown side by side. SAFE-Cascade is designed as a transparent interface for understanding when visual grounding is actually needed. Users can upload or select charts, ask questions, inspect the evidence used by each pathway, compare text-only and VLM answers, and adjust the escalation threshold to explore the accuracy-cost frontier. The system is implemented with Azure Document Intelligence for OCR, gpt-5-mini as the text-only model, gemini-2.5-flash-image as the VLM, and a Random Forest router trained on inference-time features. On a held-out ChartQA test split of 375 examples from a 2,500-example experiment, SAFE-Cascade achieves 69.1% unified accuracy with 73.1% VLM invocation, compared with 67.7% accuracy and 100% VLM invocation for the full-VLM baseline. The observed +1.4 percentage-point difference is statistically uncertain, so we interpret SAFE-Cascade as matching full-VLM performance while reducing VLM calls by 26.9% and estimated cost by 9.3%. The demonstration shows how selective modality routing can make multimodal knowledge systems more transparent, tunable, and cost-aware.
cs.RO [Back]
[103] 3D Scene Graphs: Open Challenges and Future Directions cs.RO | cs.CVPDF
Dennis Rotondi, Francesco Argenziano, Sebastian Koch, Nathan Hughes, Martin Buechner
TL;DR: 这篇论文是一篇关于3D场景图(3DSGs)的综述,旨在统一该领域并批判性地回顾其现状,特别强调开放挑战和未来方向。它首先对3DSGs进行了形式化定义,分析了现有模型的主要建模选择,然后回顾了从原始感官数据构建3DSGs的方法,最后探讨了下游应用和评估策略。
Details
Motivation: 3D场景图作为一种结合了几何基础与语义关系抽象的强大空间AI表示,在机器人和计算机视觉领域应用广泛,但目前该领域存在碎片化问题,不同社区采用不同的表述、构建流程和评估协议,难以进行方法比较和挑战评估。
Result: 作为一篇综述,本文未提出新方法,因此没有具体的定量实验结果。它系统地回顾了现有工作,并建立了一个统一的框架来分析和比较不同方法。
Insight: 论文的主要创新点在于提供了一个统一的形式化定义和分类框架,以整合碎片化的3DSG研究领域,并明确指出了开放挑战和未来方向,为社区提供了清晰的路线图。此外,它创建了一个专用网站来组织和扩展综述内容,促进了知识共享。
Abstract: 3D Scene Graphs (3DSGs) have emerged as a powerful representation for spatial AI by combining geometric grounding with semantic and relational abstractions of the environment. Their expressiveness has made them relevant to a broad range of problems in robotics and computer vision, including manipulation, navigation, task planning, scene understanding, and many others. However, the field remains fragmented: different communities adopt distinct formulations, construction pipelines, and evaluation protocols, making it difficult to compare methods, identify common assumptions, and assess remaining challenges for robust real-world deployment. This survey provides a unified and critical review of 3DSGs, with particular emphasis on open challenges and future directions. We first formalize 3DSGs under a common definition and analyze the principal modeling choices that characterize existing formulations, including node and edge attributes, hierarchical structure, dynamic scene representations, and affordance-aware extensions. We then review how 3DSGs are built from raw sensory observations, discussing the most common terminologies, conventions, and techniques. Finally, we examine downstream applications and evaluation strategies, from intrinsic graph quality to task-level performance. To support the community, we also provide a dedicated website that organizes and extends the surveyed content, accessible at https://3dscenegraphs.com/.
[104] Scaling Self-Play for End-to-End Driving cs.RO | cs.CVPDF
Luke Rowe, Roger Girgis, Rodrigue de Schaetzen, Daphne Cornelisse, Alaap Grandhi
TL;DR: 本文提出了一种通过大规模自博弈训练端到端自动驾驶模型的新策略,以克服传统基于人类演示数据训练方法的局限性。该方法利用名为Gigapixel的高吞吐量批量驾驶模拟器,从像素观测进行自博弈训练,并结合特权强化学习教师进行策略蒸馏,最后通过轻量级感知适应将策略迁移到真实世界传感器数据。
Details
Motivation: 传统端到端自动驾驶模型依赖离线人类演示数据集,这些数据集状态覆盖有限且缺乏闭环反馈,导致部署时容易产生累积误差,对长尾交互场景表现脆弱。
Result: 在Gigapixel中训练并适应真实传感器数据的策略,在HUGSIM和NAVSIM-v2基准测试中取得了有竞争力的性能,且无需人类轨迹监督;自博弈训练的扩展性带来了策略性能的成比例提升。
Insight: 创新点包括:提出大规模像素级自博弈训练框架,引入高吞吐量简化渲染模拟器Gigapixel以平衡效率与场景结构,以及结合特权RL教师的自博弈DAgger训练方法来解决样本效率问题;其核心在于通过模拟自博弈实现可扩展的端到端模型训练,并验证了性能随训练规模扩展而提升的规律。
Abstract: End-to-end autonomous driving models are typically trained on offline human-demonstration datasets that provide limited state coverage and often no closed-loop feedback, making them prone to compounding errors when deployed in closed-loop and brittle to long-tail agent interactions. To overcome these limitations, we propose an alternative strategy for training end-to-end driving models: large-scale self-play directly from pixels in simulation. While prior self-play approaches have shown promising transfer to real-world driving, they typically assume vectorized Bird’s-Eye-View (BEV) observations that are incompatible with end-to-end policies operating directly on sensor observations. To this end, we introduce Gigapixel, a high-throughput batched driving simulator with perspective rendering, enabling scalable self-play directly from pixel observations. Rather than targeting compute-costly photorealistic sensor simulation, Gigapixel renders a simplified bounding-box world that preserves essential scene structure while achieving throughput at 50k agent steps per second. Since direct pixel-space self-play RL is prohibitively sample-inefficient at end-to-end model scale, we propose self-play DAgger training: we train pixel-based policies in self-play via on-policy distillation from a privileged RL teacher. To bridge the sim-to-real gap, we subsequently transfer the self-play trained policies to real-world sensor data through lightweight perception adaptation. Policies trained in Gigapixel and adapted to real-world sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision. Moreover, scaling self-play training yields proportional gains in policy performance, establishing self-play as a practical and scalable strategy for training end-to-end models.
[105] World Engine: Towards the Era of Post-Training for Autonomous Driving cs.RO | cs.CVPDF
Tianyu Li, Li Chen, Caojun Wang, Haochen Liu, Kashyap Chitta
TL;DR: 本文提出World Engine,一种生成式框架,用于解决自动驾驶策略在安全关键的长尾事件中可靠性不足的问题。该框架通过从真实驾驶日志重建高保真交互环境,并系统性地外推生成安全关键的变体场景,从而支持基于强化的后训练,以提升策略的安全性。
Details
Motivation: 自动驾驶车辆必须在现实世界中安全运行,但现有端到端驾驶策略在常规场景表现优异,却受限于真实数据集中安全关键的长尾事件稀缺,这些罕见事件定义了学习策略的实际安全边界,而现实中难以大规模收集。
Result: 在基于nuPlan的公共基准测试中,World Engine显著减少了罕见安全关键场景中的失败,且比单纯扩大预训练数据带来更大增益;在量产级自动驾驶系统上部署后,该策略减少了模拟碰撞,并在道路测试中显示出可测量的改进。
Insight: 创新点在于提出通过合成安全关键交互进行后训练的可扩展范式,利用生成式框架系统性地创建高保真、高风险变体场景,从而规避真实世界探索的物理风险,有效提升自动驾驶策略的安全边界。
Abstract: Autonomous vehicles must operate safely in the real world, where errors can have severe consequences. Although modern end-to-end driving policies excel in routine scenarios, their reliability is limited by the scarcity of safety-critical ``long-tail’’ events in real driving datasets. These rare interactions define the practical safety boundary of the learned policy, yet they are difficult to collect at scale in the real world. Here we show that this fundamental limitation can be addressed by post-training pre-trained driving models on synthesized high-stakes interactions. We introduce World Engine, a generative framework that reconstructs high-fidelity interactive environments from real-world logs and systematically extrapolates them into realistic safety-critical variations. This paradigm enables reinforcement-based post-training to align policies with safety constraints, circumventing the physical risks inherent in real-world exploration. On a public benchmark built on nuPlan, World Engine substantially reduces failures in rare safety-critical scenarios and yields significantly larger gains than scaling pre-training data alone. Furthermore, when deployed on a production-scale autonomous driving system, the resulting policy reduces simulated collisions and demonstrates measurable improvements in on-road testing, showing that post-training on synthesized, safety-critical interactions offers a scalable and effective pathway to safer autonomous driving. The full codebase suite, including training, is released to the public.
[106] Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications cs.RO | cs.CVPDF
Paul Koch, Vivek Chavan, André Sers, Adem Karakurt, Paul Hofmann
TL;DR: 本文探讨了认知机器人和计算机视觉应用中AI视觉模型的当前局限与发展趋势,重点讨论了如何通过训练数据生成来弥合仿真与真实世界应用之间的领域差距。
Details
Motivation: 解决AI视觉模型在认知机器人应用中面临的训练数据不足、精度限制和领域差距扩展性等挑战,特别是仿真与真实世界之间的数据鸿沟问题。
Result: 论文未提供具体实验结果,但讨论了当前最先进方法的局限性和作者正在进行的工作进展。
Insight: 通过将仿真与真实世界数据在训练生成阶段进行链接,以更高效地弥合领域差距,这为AI视觉模型的训练数据生成提供了新的思路。
Abstract: AI vision models are a driving factor for the potential use case scenarios of cognitive robotics within in the industry and household applications. A large array of methods from semantic environment analysis towards 6D and grasping pose estimation have been proposed based on the latest AI achievements. However, such advancements require further strong and efficient methods w.r.t. training data and AI-architectures, which are capable in synergy to tackle current challenges, precision limits, and scalability beyond domain gaps. In this paper, we discuss these current limits and trends in the related state-of-the-art which are challenging those. Further we discuss our current work in progress on bridging the domain gap between simulations and real world applications by linking those in the training data generation.
[107] Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory cs.RO | cs.AI | cs.CV | cs.LGPDF
Jinghan Yang, Yunchao Zhang, Wang Yuan, Haolun Wan, Jiaming Zhang
TL;DR: 本文提出了一种基于信息论的通用且可解释的视觉-语言-动作(VLA)模型失败预测方法,称为Tri-Info。该方法将VLA控制形式化为一个闭环信息管道,并推导出三个信息论信号来捕捉动作的多样性、时间一致性与状态转换的耦合程度。实验表明,Tri-Info在多个VLA模型和基准环境中具有强大的跨领域泛化能力,无需重新训练即可在真实世界任务上达到83%的准确率。
Details
Motivation: VLA模型在多样化任务中部署日益广泛,但其物理交互可能导致不可逆的损害,且模型本身是黑盒,因此需要一种通用且可解释的失败检测方法。
Result: 在六个VLA模型和三个基准环境上,Tri-Info在域内性能与最强基线相当。更重要的是,它无需重新训练即可跨架构、跨环境、跨仿真到现实的差距进行泛化,在先前检测器失效的真实世界任务上达到83%的准确率。
Insight: 论文的核心创新在于从信息论角度形式化VLA控制,并提取出三个可解释的信息论信号作为失败预测指标。这不仅实现了强大的跨领域泛化,还提供了对底层失败模式的可解释诊断,为黑盒模型的可靠性评估提供了新视角。
Abstract: Vision-Language-Action (VLA) models are increasingly deployed across diverse tasks, yet they remain black boxes whose physical interactions can cause irreversible harm, making generalizable and interpretable failure detection essential. We observe that successful and failed rollouts carry systematically different information-theoretic signatures. Building on this, we formalize VLA control as a closed-loop information pipeline and derive the Triple Information-theoretic (Tri-Info) signals that capture whether actions remain diverse, temporally consistent, and coupled to state transitions. Across six VLA models and three benchmark environments, Tri-Info matches the strongest baselines in-domain. Moreover, Tri-Info transfers across architectures, environments, and the sim-to-real gap without retraining, reaching 83% accuracy on real-world tasks where prior detectors collapse to chance. This establishes Tri-Info as a simple yet powerful method that not only detects failures with strong cross-domain generalization, but also delivers interpretable diagnostics of the underlying failure modes.
[108] Fast Human Attention Prediction for Fixation-guided Active Perception in Autonomous Navigation cs.RO | cs.CVPDF
Fatma Youssef Mohammed, Grzegorz Malczyk, Kostas Alexis
TL;DR: 本文提出了GazeLNN,一种计算轻量级的人眼注视路径预测模型,用于机器人自主导航中的主动感知。该模型结合液态神经网络和MobileNetV3,以自回归方式预测序列化的注视热图,在保持高性能的同时大幅降低了计算成本。
Details
Motivation: 现有的人眼视觉注意力预测模型计算成本高昂,阻碍了其在机器人自主性中的应用,本文旨在解决这一效率瓶颈。
Result: 在MIT Low Resolution数据集上,GazeLNN以0.47的ScanMatch分数达到了最先进性能,计算成本降低了99.40%,推理速度提升了六倍。
Insight: 创新点在于将液态神经网络作为循环引擎用于注视路径预测,实现了高性能与极低计算开销的结合,并通过强化学习策略将模型集成到机器人主动相机控制中,验证了其实用性。
Abstract: Human visual attention relies on structured scanpaths to efficiently process scenes, yet instilling this behavior into robot autonomy is in its infancy and hindered by the high,computational costs of existing predictive models. To address this, we introduce GazeLNN, a computationally lightweight,scanpath prediction model that leverages Liquid Neural Networks as its recurrent engine and employs MobileNetV3 for feature extraction. Operating auto-regressively, the architecture predicts sequential fixation heatmaps conditioned on the current visual stimulus and fixation history. Despite requiring only 0.61 GFLOPs, GazeLNN achieves state-of-the-art performance on the MIT Low Resolution dataset achieving 0.47 ScanMatch score. It outperforms existing recurrent baselines across diverse evaluation metrics, while reducing computational costs by 99.40% and accelerating inference by up to six times. To investigate the role of human attention modeling in robot autonomy and demonstrate the practical utility of this highly efficient architecture, we integrate GazeLNN into an active camera-robot control policy trained via Reinforcement Learning. This integration enables human-fixation-guided perception during autonomous navigation, validated through successful real-world deployments on an aerial robot.
cs.LG [Back]
[109] Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models cs.LG | cs.CLPDF
Salim Khazem
TL;DR: 本文提出了一种名为自由能签名(Fes)的光谱描述符,用于检测大型语言模型(LLM)中的幻觉。该方法将注意力拉普拉斯矩阵视为哈密顿量,提取其热力学势(如配分函数、自由能)和随机矩阵理论(RMT)光谱形状因子,并证明了其稳定性、表达能力和检测性能的理论保证。在六个开源LLM和六个基准测试上的实验表明,基于Fes的轻量级探测器在注意力光谱基线中取得了最佳的AUROC性能,且在无监督设置下也能提供有效的检测。
Details
Motivation: 现有的光谱诊断方法(如LapEig)仅使用少数特征值或手动选取的标量来总结拉普拉斯谱,未能充分利用其结构信息,因此需要一种更全面、理论支撑更强的光谱描述符来提升幻觉检测的性能。
Result: 在六个开源LLM和六个基准测试上,基于Fes的轻量级探测器在注意力光谱基线中取得了最强的综合AUROC,平均比LapEig高出6.5个AUROC点,比GoR-4高出2.4个点。在完全无监督设置下,RMT偏差分数实现了平均0.71的AUROC。
Insight: 创新点在于将注意力拉普拉斯矩阵解释为哈密顿量,并系统性地提取其热力学势和RMT光谱形状因子作为描述符,这提供了更丰富的光谱结构信息。从客观角度看,该方法将物理学的热力学概念和随机矩阵理论引入LLM幻觉检测,为理解推理过程的光谱特征提供了新的理论框架和诊断工具,且无需更新底层LLM,具有实用价值。
Abstract: Hallucination detection in large language models (LLMs) is deployment-critical, and recent work shows that the spectrum of attention-derived graph Laplacians carries strong signal about reasoning quality. Prior spectral diagnostics, however, summarize the Laplacian spectrum by a handful of eigenvalues or hand-picked scalars, leaving most of its structure unused. We propose Free-Energy Signatures (Fes), a spectral descriptor that treats each layer’s attention Laplacian as a Hamiltonian and extracts its thermodynamic potentials partition function, free energy, spectral entropy, heat capacity together with the random-matrix-theory (RMT) spectral form factor. We prove three results: (i)Lipschitz stability of Fes under attention perturbation; (ii)an expressiveness result showing that Fes enriches finite spectral summaries and approximates moment-derived spectral functionals under explicit regularity and grid-resolution assumptions; and (iii)~a finite-sample PAC bound on the AUROC of a training-free detector built from Fes. Empirically, across six open-weight LLMs and six benchmarks, a lightweight probe on Fes descriptors achieves the strongest aggregate AUROC among attention-spectral baselines, improving over LapEig by $+6.5$ AUROC points and over GoR-4 by $+2.4$ points on average, while requiring no update to the underlying LLM. In the fully unsupervised setting, an RMT-deviation score achieves mean AUROC $0.71$, providing a label-free but weaker detector. A complementary RMT analysis shows that correct generations exhibit more Wigner-Dyson like spectral statistics, whereas hallucinations exhibit more Poisson-like statistics. The anonymized code and config are provided in the supplementary material.
[110] Efficiently Representing Algorithms With Chain-of-Thought Transformers cs.LG | cs.AI | cs.CLPDF
Yanhong Li, Anej Svete, Ashish Sabharwal, William Merrill
TL;DR: 本文证明了链式思维(CoT)变换器能够高效模拟Word RAM算法,仅需多项式对数级开销。研究首先在有限精度、多项式对数宽度和右端唯一硬注意力的变换器上建立该结果,随后扩展到有限宽度和对数精度的连续CoT(推理以向量形式进行)以及混合架构(变换器层位于循环层之上)。
Details
Motivation: 现有理论表明CoT变换器可以模拟图灵机,但图灵机不适合高效讨论算法,而Word RAM模型更贴近实际算法设计与分析。因此,研究动机是探究CoT变换器能否高效模拟Word RAM算法,例如以O(n log n)步骤排序或运行Dijkstra算法。
Result: 研究肯定地回答了上述问题,证明CoT变换器能以多项式对数级开销模拟任何Word RAM算法。在Word RAM具有“扁平”指令集时,开销降至对数平方级;对于无乘法的扁平指令,仅需对数级开销,这与模拟图灵机所需的二次开销形成鲜明对比。
Insight: 创新点在于将CoT变换器的理论能力从图灵机扩展到更高效的Word RAM模型,揭示了其在模拟实际算法时的多项式对数级效率优势。这为理解推理模型的计算表达能力提供了新视角,并可能推动更高效的算法模拟架构设计。
Abstract: The increasing popularity of \emph{reasoning} models – language models that output a series of reasoning or thought tokens before producing an answer – is justified, in part, by theoretical results showing that chain-of-thought (CoT) transformers can simulate Turing machines, and thus perform arbitrary computation. However, the Turing machine, while suitable for complexity-theoretic analysis, is not convenient, intuitive, or efficient for discussing algorithms. Algorithms are typically designed and analyzed at a higher level of abstraction, captured by the \emph{Word RAM} model with random-access memory and unit-cost operations on $\bigO(\log n)$-bit words. As a result, Word RAM algorithms can be substantially more efficient than their Turing machine counterparts, raising the question: \emph{Can CoT transformers efficiently simulate Word RAM algorithms?} For instance, can they sort $n$ items in $\bigO(n \log n)$ steps or run Dijkstra’s algorithm in $\bigO(E + V \log V)$ steps? We answer affirmatively, up to poly-logarithmic overhead. We first establish this for finite-precision transformers with poly-logarithmic width and rightmost unique hard attention, then strengthen the result to two more practical settings with finite width and log-precision: \emph{continuous} CoT, where reasoning takes the form of vectors rather than tokens, and a \emph{hybrid} architecture in which transformer layers sit atop a recurrent (linear RNN) layer. In all three cases, we find that CoT \emph{can} efficiently simulate any Word RAM algorithm with only a poly-logarithmic overhead in $n$. This overhead reduces to log-square when the Word RAM has a ``flat’’ instruction set, and only logarithmic for multiplication-free flat instructions – in stark contrast to known CoT simulations of Turing machines, which require quadratic overhead over Word RAM.
[111] Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models cs.LG | cs.AI | cs.CLPDF
Darrien McKenzie, Nicklas Hansen, Xiaolong Wang
TL;DR: 本文提出了一种基于流形结构的课程学习方法BMC,将大语言模型强化学习中的问题采样建模为具有内在非平稳性的流形赌博机问题,通过贝叶斯学习在潜在表示空间中引导采样,以平衡学习信号、任务多样性和评估相关性。
Details
Motivation: 现有自适应课程学习方法通常仅关注问题难度,将问题选择视为独立臂的经典赌博机问题,忽视了任务空间的结构化和异质性,导致训练效率受限。
Result: 实验表明,不同采样策略在学习信号、任务流形覆盖和评估相关性之间存在非平凡权衡,仅优先考虑难度不足以获得强大的下游性能。
Insight: 创新点在于将问题采样框架化为流形结构赌博机问题,并引入层次化任务树和贝叶斯学习来显式建模任务间的潜在几何关系,强调了在课程学习中纳入结构意识和类型感知的重要性。
Abstract: Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model’s latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.
[112] Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning cs.LG | cs.AI | cs.CLPDF
Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li
TL;DR: 本文提出了一个名为“Connect the Dots”(CoD)的通用框架,用于训练大型语言模型(LLMs)具备一种元能力,即让基于LLM的智能体在长期部署中能够连接经验点:通过持续探索环境、从自身经验中学习并迭代更新其对环境的上下文理解,从而在解决一系列任务时逐步提升未来任务的表现。该框架包括端到端强化学习算法设计、支持长序列任务与环境交互的基础设施,以及专门用于激励和评估这种元能力的任务与环境。
Details
Motivation: 为了解决长期生命周期智能体(long-lifecycle agents)的核心挑战——即智能体需要在持续部署中通过积累经验、自我更新上下文知识来逐步提升性能,本文旨在训练LLMs获得这种“连接经验点”的元能力,而不仅仅是针对特定领域的任务能力。
Result: 实证结果验证了在CoD设置下进行端到端强化学习训练的有效性,并展示了所激发的元能力在训练域内、跨不同领域以及从CoD到Ralph-loop设置中的分布外泛化潜力。
Insight: 创新点在于提出了一个专注于训练LLMs获得长期自我改进元能力的通用框架,而非优化特定任务性能;它结合了细粒度信用分配的GRPO风格强化学习算法,并设计了专门针对该元能力的任务与环境,促进了跨领域泛化,连接了先前多条研究路线。
Abstract: This work presents a general framework for training large language models (LLMs) to “Connect the Dots” (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization – within the training domains, across different domains, and from CoD to Ralph-loop settings – of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod}.
[113] What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis cs.LG | cs.CLPDF
Xinghao Chen, Chak Tou Leong, Wenjin Guo, Jian Wang, Wenjie Li
TL;DR: 本文从信息论视角分析了潜在思维链(Latent CoT)的有效监督问题,指出传统结果监督会导致梯度衰减和表示漂移的双重崩溃。作者将过程监督分解为轨迹监督和空间监督两个互补维度,并提出统一潜在探针(ULP)来量化潜在轨迹与显式推理步骤之间的互信息。实验表明推理准确性与潜在链中保留的信息保真度紧密相关,为潜在推理监督提供了原则性框架。
Details
Motivation: 潜在思维链将推理过程内化于连续隐状态,避免了冗长的离散推理轨迹,但仅靠结果监督提供的学习信号较弱,容易导致语义漂移,因此需要研究如何实现稳健的潜在推理。
Result: 实验揭示了信息-性能绑定现象:推理准确性取决于潜在链中保留的信息保真度。通过ULP量化互信息,验证了生成式重建比刚性几何压缩更能保持信息容量,从而提升性能。
Insight: 创新点在于从信息论角度将过程监督分解为轨迹监督和空间监督,并提出ULP作为评估工具;核心见解是从几何模仿转向互信息最大化,为潜在推理监督提供了新的理论指导。
Abstract: Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \href{https://github.com/EIT-NLP/Supervision-in-Latent-CoT}{this repository}.
[114] ProMUSE: Progressive Multi-modal Uncertainty-guided Staged Evidential Alzheimer Disease Classification cs.LG | cs.AI | cs.CVPDF
Long Doan, Branden Chen, Ethan Litton, Huan Huang, Jiajing Huang
TL;DR: 论文提出了一种名为ProMUSE的渐进式多模态不确定性引导分期证据网络,用于阿尔茨海默病(AD)的早期诊断。该方法首先使用低成本临床数据进行证据分类并量化不确定性,当不确定性超过学习阈值时,逐步引入MRI或PET等昂贵成像模态的特征,通过Dempster-Shafer理论融合模态信念和不确定性,从而在保持诊断准确性的同时,显著减少对昂贵成像数据的依赖。
Details
Motivation: 阿尔茨海默病的早期诊断至关重要,但依赖MRI和PET等多模态数据成本高昂且不易获取,导致全模态推断在现实临床工作流中不切实际。因此,需要一种能自适应决定何时需要额外模态、以降低数据采集总成本并保持准确性的方法。
Result: 在ADNI、AIBL和OASIS数据集上,针对CN-AD、CN-MCI和MCI-AD分类任务的实验表明,ProMUSE在达到与全模态基线相当或更优的准确性的同时,将MRI/PET的使用减少了50-90%,实现了显著的成本节约。
Insight: 创新点在于提出了一种基于不确定性引导的渐进式多模态融合框架,通过Dirichlet主观逻辑模型量化不确定性并设定自适应阈值,结合Dempster-Shafer理论进行信念融合,实现了资源高效且实用的AD筛查方案。从客观角度看,其分阶段数据采集策略和不确定性驱动的模态选择机制,为多模态医疗诊断中的成本-精度权衡问题提供了可借鉴的解决方案。
Abstract: Alzheimer’s disease (AD) is a fatal disorder that destroys memory and cognitive skills in the elderly population. Most treatments for AD are effective in the early stage, leading to an increasing demand for early AD diagnosis. AD diagnosis increasingly relies on multimodal data such as clinical assessments, structural Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) imaging. However, MRI and PET acquisition remain costly and not universally accessible, making full-modality inference impractical in real-world clinical workflows. We propose ProMUSE, a Progressive Multi-modal Uncertainty Guided Staged Evidential Network that adaptively determines when additional modalities are necessary, helping reduce the overall cost of data acquisition while maintaining accuracy. ProMUSE first performs evidential classification using low-cost clinical data and quantifies uncertainty via a Dirichlet-based subjective logic model. When uncertainty exceeds a learned threshold, ProMUSE progressively incorporates MRI or PET features, fusing modality-wise belief and uncertainty through Dempster-Shafer theory to obtain a calibrated multimodal prediction. This staged acquisition strategy enables accurate diagnosis while minimizing reliance on expensive imaging. Experiments on ADNI, AIBL, and OASIS across CN-AD, CN-MCI, and MCI-AD tasks demonstrate that ProMUSE achieves competitive or superior accuracy compared to full-modality baselines while reducing MRI/PET usage by 50-90%, yielding substantial cost savings. These results highlight ProMUSE as a practical, uncertainty-aware, and resource-efficient solution for real-world AD screening.
[115] 3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning cs.LG | cs.CV | cs.ROPDF
Ellina Zhang, Madhaven Iyengar, Amir Zadeh, Chuan Li, Deepak Pathak
TL;DR: 本文提出了3D-DLP,一种自监督的以物体为中心的场景表示学习模型。它将场景级的RGB-D或体素观测分解为一组3D潜在粒子,每个粒子编码解耦的属性(如3D关键点位置、边界框尺寸和外观特征),并通过端到端的自监督重建目标学习可解释的逐粒子分割图。该模型在模拟和真实数据集上验证了其潜在空间的可解释性和可控性,并能提升下游机器人操作任务的性能。
Details
Motivation: 旨在解决从场景观测中学习可解释、可控且以物体为中心的3D表示的问题,为下游任务(如机器人操作)提供更优的表示基础。
Result: 在模拟和真实世界数据集上验证了模型的有效性,其学习到的潜在空间具有可解释性和可控性,可用于生成新场景。在下游机器人操作任务中,其性能优于缺乏显式3D信息或依赖无物体中心结构的密集3D输入的基线方法。
Insight: 核心创新在于将DLP框架扩展到3D,通过自监督学习将场景分解为属性解耦的3D潜在粒子,实现了可解释、可控的物体级场景表示,并证明了这种紧凑表示对机器人操作等下游任务的有效性。
Abstract: We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at https://eubooks3003.github.io/3d-dlp.
[116] Flow Map Denoisers: Traversing the Distortion-Perception Plane for Inverse Problems cs.LG | cs.CVPDF
Nicolas Zilberstein, Morteza Mardani, Santiago Segarra
TL;DR: 本文提出了一种基于流映射模型的图像去噪方法,称为Flow Map Denoisers,用于解决图像恢复中失真与感知质量之间的权衡问题。该方法通过一个前瞻参数t,在最小均方误差(MMSE)和感知优化之间连续调节,从而在失真-感知平面上遍历多个操作点。该方法无需配对数据监督、辅助模型或采样器超参数调优,适用于一般逆问题,并在CelebA和AFHQ数据集上通过实验验证了其有效性。
Details
Motivation: 图像恢复面临一个基本权衡:最小化误差的方法会产生模糊重建,而最大化感知质量的方法则产生清晰但保真度较低的图像。现有方法要么固定在这个失真-感知前沿上的单一操作点,要么需要配对数据监督、辅助模型或采样器超参数调优来访问不同点。本文旨在通过流映射模型提供一个连续调节的解决方案。
Result: 在CelebA(128×128)和AFHQ(256×256)数据集上,针对多个线性和非线性逆任务进行了广泛实验。结果表明,单个训练好的流映射模型能够跨越失真-感知权衡,在两端匹配或超越专门的基线方法。对于高斯目标,理论证明变化参数t能精确恢复最优失真-感知前沿;对于自然图像,经验观察显示类似行为。
Insight: 创新点在于流映射模型隐式定义了一个单参数去噪器家族,通过前瞻参数t作为控制旋钮,在失真和感知质量之间连续调节。这避免了现有方法对配对数据或额外调优的依赖,并扩展了Plug-and-Play求解器在一般逆问题中的应用,实现了感知对齐与数据一致性之间的权衡控制。从客观角度看,该方法提供了一种灵活且高效的框架,以单一模型覆盖整个失真-感知前沿,具有实际部署潜力。
Abstract: Image restoration faces a fundamental tradeoff: methods that minimize error produce blurry reconstructions, while those that maximize perceptual quality yield sharp but less faithful images. Existing approaches either commit to a single operating point on this distortion perception (DP) frontier or require paired-data supervision, auxiliary models, or hyperparameter tuning of the sampler to access different points. We show that flow map models, a recent extension of flow matching for few-step sampling that learns an average field, implicitly define a one-parameter family of denoisers that continuously spans the DP frontier. The lookahead parameter t acts as a control knob between the MMSE and perceptual regimes. For Gaussian targets, we prove that varying t exactly recovers the optimal DP frontier; for natural images, we observe similar behavior empirically. Within a Plug-and-Play solver, the same mechanism extends to general inverse problems, where it controls a tradeoff between perceptual alignment and data consistency. Despite the lack of exact optimality guarantees in this setting, a single trained flow map spans the DP tradeoff, matching or exceeding specialized baselines at both extremes. Extensive experiments on CelebA ($128\times 128$) and AFHQ ($256\times 256$) across several linear and nonlinear inverse tasks validate our findings.
[117] Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer vision cs.LG | cs.CVPDF
Luke J. Zachmann, David D. Diaz, Vincent A. Landau, Chelsey Walden-Schreiner, Tony Chang
TL;DR: 本文提出了VibrantForests框架,通过整合国家森林资源清查、机载激光雷达和卫星影像数据,利用计算机视觉技术生成了覆盖全美、每年更新、10米分辨率的森林结构连续地图,包括冠层覆盖度、高度、地上活体生物量等关键属性。
Details
Motivation: 解决现有森林和野火风险管理中,因数据来源、年代和质量不一导致的操作规划系统混乱问题,需要提供连贯、全覆盖且年度更新的森林结构地图。
Result: 模型在从稀疏冠层/低生物量到密集冠层/高生物量的全谱森林条件下均表现出预测能力,相比同类被动传感器模型,延迟了饱和现象的出现,并减少了回归到均值的行为,从而在小/稀疏条件下避免高估,在大/密集条件下避免低估。
Insight: 创新点在于整合多源遥感数据(国家清查、激光雷达、卫星影像)并训练卫星模型,实现了高分辨率、年度更新的全覆盖森林属性估算,为森林和野火规划提供了连贯的数据基础。
Abstract: Remote sensing is increasingly relied upon to deliver actionable science for forest and wildfire risk management across large landscapes. Wall-to-wall, annually updated maps are a persistent need for effective forest management. Many planning systems and data collections combine disparate data sources with different purposes, vintages, and prediction quality, which leads to confounding behavior in operational planning systems. We introduce the VibrantForests framework, developed and applied to map forest attributes and provide a coherent foundation for effective forest and wildfire planning. VibrantForests includes a satellite-based forest structure model trained on lidar-derived samples and applied across the contiguous United States to concurrently generate estimates of canopy cover, canopy height, aboveground live tree biomass, basal area, and quadratic mean diameter at 10-meter resolution. We demonstrate predictive capability spanning the full spectrum of forest conditions ranging from sparse-canopy/low-biomass to dense-canopy/high-biomass. Results show that our model extends the range at which saturation is commonly encountered in comparable passive-sensor models, and reduces regression-to-mean behavior that commonly produces overestimation of forest attributes in small/sparse conditions and underestimation in large/dense conditions. The VibrantForests framework addresses a key limitation in large-area forest and wildfire planning by delivering coherent wall-to-wall estimates of management-relevant attributes at annual cadence and 10m resolution.